# towards_enabling_metalearning_from_target_models__93b944f5.pdf Towards Enabling Meta-Learning from Target Models Su Lu Han-Jia Ye Le Gan De-Chuan Zhan State Key Laboratory for Novel Software Technology Nanjing University, Nanjing, 210023, China {lus,yehj}@lamda.nju.edu.cn, {ganl,zhandc}@nju.edu.cn Meta-learning can extract an inductive bias from previous learning experience and assist the training of new tasks. It is often realized through optimizing a metamodel with the evaluation loss of task-specific solvers. Most existing algorithms sample non-overlapping support sets and query sets to train and evaluate the solvers respectively due to simplicity (S/Q protocol). Different from S/Q protocol, we can also evaluate a task-specific solver by comparing it to a target model T , which is the optimal model for this task or a model that behaves well enough on this task (S/T protocol). Although being short of research, S/T protocol has unique advantages such as offering more informative supervision, but it is computationally expensive. This paper looks into this special evaluation method and takes a step towards putting it into practice. We find that with a small ratio of tasks armed with target models, classic meta-learning algorithms can be improved a lot without consuming many resources. We empirically verify the effectiveness of S/T protocol in a typical application of meta-learning, i.e., few-shot learning. In detail, after constructing target models by fine-tuning the pre-trained network on those hard tasks, we match the task-specific solvers and target models via knowledge distillation. 1 Introduction Meta-learning means improving performance measures over a family of tasks by their training experience [22]. It has been researched in various fields such as image classification [11, 16] and reinforcement learning [6, 14]. By reusing transferable meta-knowledge extracted from previous tasks, we can learn new tasks with a higher efficiency or a shortage of data. A typical meta-learning algorithm can be decomposed into two iterative phases. In the first phase, we train a solver of a task on its training set with assistance of meta-model. In the second phase, we optimize the solver s performance to update meta-model. One key factor in this procedure is the way to evaluate the solver because the evaluation result acts as the supervision signal for meta-model. Early meta-learning algorithms [19, 23] directly use the solver s training loss as its performance metric, and optimize this metric over a distribution of tasks. Obviously, inner-task over-fitting may happen during the training of task-specific solvers, resulting in an inaccurate supervision signal for the meta-model. This drawback is even more amplified in applications where the training set of each task is limited such as few-shot learning and noisy learning. Intuitively, assessment of solvers should be independent of their training sets. This principle draws forth two important meta-learning algorithms in 2016 [24, 28], which respectively export solver evaluation from the perspective of data and model . In this paper, we call these two methodologies S/Q protocol and S/T protocol. In S/Q protocol, S means support set and Q means query set. They De-Chuan Zhan is the corresponding author. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). contain non-overlapping instances sampled from a same distribution. By training the solver on S and evaluating it on Q, we are able to obtain an approximate generalization error of the solver and eventually provide the meta-model with a reliable supervision signal. Another choice is to compare the trained solver with an ideal target model T . Assuming that T works well on a task, we can minimize the discrepancy between the trained solver and T to pull the solver closer to T . Here T can be Bayesian optimal solution to a task or a model trained on a sufficiently informative dataset. Figure 1 gives an illustration of both S/Q protocol and S/T protocol. Support Set 𝒮 Query Set 𝒬 Non-Overlapping Update Meta-Model (a) S/Q protocol for meta-learning. Support Set 𝒮 Target Model 𝒯 Decision Boundary Model Discrepancy Update Meta-Model (b) S/T protocol for meta-learning. Figure 1: Comparison between S/Q protocol and S/T protocol. (a) In S/Q protocol, each task contains a support set S and a query set Q. We train a solver on S and evaluate it on Q, and query loss is used to optimize metamodel. (b) In S/T protocol, each task contains a support set S and a target model T . After training a solver on S, we directly minimize the discrepancy between it and T . Although appeared in the same year, S/Q protocol is more widely accepted by meta-learning society [4, 8, 13, 10] while the research about how to leverage target models remains immature. The main reason is the simplicity of S/Q and the computational hardness of S/T . However, S/T protocol has some unique advantages. Firstly, it does not depend on possibly biased and noisy query sets. Secondly, by viewing support sets and their corresponding target models as (feature, label) samples, meta-learning is reduced to supervised learning and we can transfer insights from supervised learning to improve meta-learning [2]. Thirdly, we can treat the target model as a teacher and incorporate a teacher-student framework like knowledge distillation [7] and curriculum learning [1] in metalearning. Thus, it is necessary and meaningful to study S/T protocol in meta-learning. This paper looks into S/T protocol and takes a step towards enabling meta-learning from target models. We mainly answer two questions: (1) If we already have access to target models, how to learn from them? What are the benefits of learning from them? (2) In a real-world application, how to obtain target models efficiently and make S/T protocol computationally tractable? For the first question, we propose to match the task-specific solver to the target model in output space. Learning from target models brings us more robust solvers. For the second question, we focus on a typical application scenario of meta-learning, i.e., few-shot learning. We construct target models by fine-tuning the globally pre-trained network on those hard tasks to maintain efficiency. 2 Related Work Meta-Learning. Meta-learning aims at extracting task-level experience (so-called meta-knowledge) from seen tasks, while generalizing the learned meta-knowledge to unseen tasks efficiently. Researchers have studied several kinds of meta-knowledge like model initialization [3, 25], embedding network [21, 12, 9, 20, 4], external memory [19, 5], optimization strategy [15, 18], and data augmentation strategy [13]. Despite their diversity in meta-knowledge, most existing models are trained under S/Q protocol, and rely on a randomly sampled and possibly biased query set. Actually, most algorithms are protocol-agnostic, and both S/Q protocol and S/T protocol can be applied to them. Thus, our work on S/T is general, and it has a wide application field. Learning from Target Models. The idea of learning from target models in meta-learning is first proposed by [28]. In [28], the authors constructed a model regression network that explicitly regresses between small-sample classifiers and target models in parameter space. Here both solvers and target models are limited to low-dimensional linear classifier, making it feasible to regress between them. From our perspective, matching two models parameters is not practical when the dimension of parameters is too high. Thus, we match two models in output space in this paper. Similarly, there are other papers focusing on meta-learning from target models [27, 31]. The most similar work to us is [31], which constructs target models with abundant instances and matches task-specific solvers and target models. However, they all assume that every single task has a target model, increasing both space and time complexity of S/T protocol. To summarize, we claim that one key point in putting S/T protocol into practice is reducing the requirement for target models. In this paper, we focus on those hard tasks, and find that by learning from a small ratio of informative target models, classic meta-learning algorithms can be improved. 3 Preliminary Meta-learning extracts high-level knowledge by a meta-model from meta-training tasks sampled from a task distribution p(τ) and reuses the learned meta-model on new tasks belonging to the same distribution. Each task τ has a task-specific support set S = {(xi, yi)}|S| i=1, and we can train on S a solver g : X Y parameterized by γg. Without loss of generality, a meta-model can be defined as f : S G parameterized by θf that receives a support set as input and outputs a solver. Here S is the space of support sets and G is the space of solvers. In other words, f encodes the training process of g on S under the supervision of meta-knowledge θf. Taking two well-known meta-learning algorithms, MAML [3] and Proto Net [21], as examples, we have the following concrete forms of f: MAML meta-learns a model initialization θf and fine-tunes it on each S with one gradient descent step to obtain a task-specific solver g. It can be written as Equ (1). η is step size and ℓ: Y Y R+ is some loss function. f(S; θf) = g (xi,yi) S ℓ(g(xi; γ), yi) Proto Net meta-learns an embedding function φθf parameterized by θf and generates a lazy solver which classifies an instance to the category of its nearest class center. Here g is implicitly parameterized by both θf and embedded support instances. f(S; θf) = g ; θf, {φθf (xi)|(xi, yi) S} (2) S/Q Protocol. How to evaluate the solver g trained on S? The answer to this question differs conventional S/Q protocol [24] from S/T protocol. In S/Q protocol, we sample another query set Q = {xj, yj}|Q| j=1 apart from S for each task. Instances in S and Q are i.i.d. distributed and have a same label set, and we evaluate g by its loss on Q. Since S and Q contain non-overlapping instances, loss on Q is a more reliable supervision signal. S/Q protocol can be formulated as Equ (3). Here Dtr is the meta-training set and we can sample meta-training tasks τ tr from it. τ tr=(Str,Qtr) Dtr (xj,yj) Qtr ℓ(f(Str)(xj), yj) (3) S/T Protocol. Any sampled query set Q can be biased and noisy, which may cause an inaccurate evaluation of the solver. An alternative is directly matching the task-specific solver g = f(S) and a target model T that works well on the corresponding task. By computing the distance from the solver to target model, we obtain a more robust training signal to update meta-model. By replacing the solver evaluation part in Equ (3), we have the following S/T protocol Equ (4). Here L : G G R+ is some loss function to measure the discrepancy between g and target model T . τ tr=(Str,T tr) Dtr L(f(Str), T tr) (4) 4 Effect of Target Model We have introduced some basic concepts in meta-learning, and formulate S/Q protocol and S/T protocol in Section 3. In this section, we assume that target models are available, and study how to utilize them to assist meta-learning. Firstly, we propose a model matching framework based on output comparison. Secondly, we verify the effectiveness of our proposal in a synthetic experiment. Moreover, we try to decrease the ratio of tasks that have target models, and show that it is possible to reduce the resource consumption of S/T protocol. 4.1 Model Matching Target Model 𝒯 Support Set Knowledge Distillation Parameter Regression Figure 2: Two approaches to matching g and target model T . Left: matching them in parameter space. Right: matching them in output space. In S/T protocol, one key point is how to match the solver g and its target model T . In other words, we need to specify the concrete formulation of L(g, T ). Generally, methods to match g to T can be classified into two categories. Firstly, we can directly match two models parameters or use another model to regress between two models parameters [28]. For example, let γg and γT be the parameters of g and T , we can set L(g, T ) = P (xi,yi) Str ℓ(g(xi), yi) + λ γg γT 2 2. Here λ is a balancing hyperparameter. This method may work well for low-dimensional parameters, but is not suitable for complex models like deep neural networks. A better alternative is to match two models in their output space, i.e., L(g, T ) = P (xi,yi) Str [(1 λ)ℓ(g(xi), yi) + λD(T (xi), g(xi))]. Here D( , ) is a function that measures the discrepancy between T (xi) and g(xi). If we instantiate D( , ) as KL divergence KL( || ) for classification problem, the aforementioned loss function is equivalent to that of knowledge distillation. Figure 2 is an illustration of approaches to matching a solver to a target model. 4.2 Empirical Study: Sinusoid Regression In this part, we assume that target models are available, and evaluate the effectiveness of our proposed matching approach. We construct a synthetic regression problem, and try to answer the following questions: (1) Can S/T protocol outperform S/Q protocol when target models are available? (2) Is it possible to improve meta-learning with only a few target models? Setting. Consider regression tasks T (x) = a sin(bx c) where a, b, and c are uniformly sampled from [0.1, 5], [0.5, 2], and [0.5, 2π] respectively. For each task, we generate 10 support instances by uniformly sampling x in range [ 5, 5]. For S/Q protocol, we additionally sample 30 query instances for each task. We then set y = T (x) + ϵ where ϵ N(0, 0.5) is a Gaussian noise. 10000 tasks are used for both meta-training and meta-testing. 500 tasks are used for meta-validation. Algorithms. We consider two classic meta-learning algorithms, MAML [3] and Proto Net [21]. MAML can be directly applied to a regression task, but Proto Net is originally designed for classification. In this part, we modify Proto Net slightly to fit regression problem. In detail, we try to meta-learn an embedding function φ : R R100, with assistance of which the similaritybased regression model g( ; {φ(xi)|(xi, yi) S}) works well across all tasks. Here for any instance (x, y), ˆy = g(x) = P (xi,yi) S wiyi and wi = exp{ φ(xi),φ(x) } P exp{ φ(xi),φ(x) }. A same embedding network is used in two algorithms. We train MAML and Proto Net under S/Q protocol and S/T protocol. When using S/Q protocol, we minimize MSE loss on 30 query instances to optimize φ. For S/T protocol, we match the solver and the target model in output space, and set D(T (xi), g(xi)) = T (xi) g(xi) 2 2. Thus, the loss function under S/T protocol is L(g, T ) = P (xi,yi) Str (1 λ) g(xi) yi 2 2 + λ g(xi) T (xi) 2 2 . λ is a hyper-parameter. More implementation details can be found in the supplementary material. Superiority of S/T Protocol. Table 1 shows the MSE of four models on meta-testing tasks. We can see that models trained under S/T protocol consistently outperform models trained under S/Q protocol. In Figure 3, we visualize a randomly chosen meta-testing task. Different colors are used for different meta-learning algorithms, and dotted lines and dashed lines are used for S/Q protocol and S/T protocol respectively. We can see that models trained under S/T protocol fit the target sinusoid curve better. It is meaningful to discuss why target models improve meta-learning algorithms. In this empirical study, distillation from target models can be interpreted as label denoising. In detail, we can prove2 that meta-learning loss under S/T protocol (1 λ) g(x) y 2 2 + λ g(x) T (x) 2 2 is an upper bound of g(x) (y λϵ) 2 2, which is the standard MSE loss between the output of solver g and cleaner label y λϵ (raw label y equals to T (x) + ϵ). Therefore, the larger λ is, the cleaner training labels are. Table 2 is an ablation study on hyper-parameter λ. As expected, both algorithms trained under S/T achieve better performance with larger λ. These results demonstrate the superiority of S/T protocol when target models are available. Table 1: Average test MSE of two meta-learning algorithms. Models trained under S/T protocol outperform those trained under S/Q protocol. Method MAML Proto Net S/Q S/T S/Q S/T MSE on Dts 4.933 3.621 4.706 3.332 Table 2: Average test MSE of two meta-learning algorithms with different λ values. Larger λ offers cleaner labels, resulting in better models. λ 1 0.8 0.5 0.2 MSE: MAML(S/T ) 3.220 3.419 3.621 3.833 MSE: Proto Net(S/T ) 3.137 3.304 3.332 3.550 -5 -3 -1 1 3 5 -6 target MAML(S/Q) MAML(S/T) Proto Net(S/Q) Proto Net(S/T) instance in S Figure 3: Visualization of a randomly sampled metatesting task. Dotted lines are used for S/Q protocol while dashed lines are used for S/T protocol. We can see that models trained under S/T protocol can fit the target sinusoid curve better. 0 500 1000 2000 5000 10000 # of tasks with target models average MSE loss MAML(random) MAML(heuristic) Proto Net(random) Proto Net(heuristic) Figure 4: Change of MSE loss over number of metatraining tasks that have target models. By selecting hard tasks heuristically, we are able to obtain an evident performance gain with a small number of target models. Reducing the Requirement for Target Models. Despite the satisfying results in the empirical study, it does not mean that we can apply S/T protocol in real-world applications and necessarily obtain higher performance. Up till now, we have assumed that every single meta-training task has a target model. This assumption is too strong from two aspects. Firstly, we usually don t have ready-made target models, and constructing target models is not trivial. Secondly, even though we have designed a method to construct target models, it will cost too much time to construct a target model for every single metatraining task. Existing researches that focus on meta-learning from target models often bypass this dilemma by restricting the complexity of solvers and target models [28] or building one global target model. In this paper, we study a more general methodology - reducing the number of required target models. If we randomly choose a small subset of meta-training tasks, and only provide these tasks target models, how will the model performance change? To answer the question, we first randomly sample subsets of tasks that have target models, and abandon target models for other tasks. In this case, the meta-learning loss of tasks without target models degenerates to S/Q loss. By ranging the size of this subset, we can plot the performance curve of MAML and Proto Net in Figure 4. Then, we heuristically select the hardest tasks from all meta-training tasks and only deploy target models for these tasks. In this regression problem, a sinusoid curve is defined as a sin(bx c), and larger a or smaller b induce steeper curves. We simply consider these steep curves as hard tasks, and sort the hardness of all meta-training tasks according to a b. Another two performance curves using this heuristic are also plotted in Figure 4. We can see that when using this naive heuristic, we can obtain an evident performance gain with only 500(5%) target models. This finding inspires us to analyse the hardness of tasks in meta-learning, and confirms the possibility of learning from a few target models. 2We leave the proof to the supplementary material. 5 Application Case: Few-Shot Learning Few-shot learning is a typical application of meta-learning. It aims at recognizing new categories with only a few labelled instances. In few-shot learning, we have two datasets that contain non-overlapping classes, i.e., Dtr and Dts. Dtr is composed of seen classes while Dts contains unseen classes. We can sample N-way K-shot3 meta-training tasks from Dtr to train the meta-model, and expect that the trained meta-model will also work well on Dts. 5.1 Task Hardness Table 3: Average accuracy on auxiliary dataset Dau. Fine-tuned target models outperform a single pre-trained target model on randomly sampled tasks. Target Model pre-train fine-tune Accuracy on Dau 98.24 99.37 13 14 15 16 17 18 hardness ratio of task count accuracy (%) pre-trained φ pt fine-tuned φ ft Figure 5: Grouping of 1000 tasks according to their hardness. Both φpt and φft achieve lower accuracy on harder tasks, verifying the reasonability of our proposed hardness metric. Fine-tuned target models obtain a remarkable performance gain on hard tasks. Following the idea of constructing target models for hard tasks, we firstly investigate which tasks are hard in few-shot learning. We consider the relationship between classes as a key factor that determines the hardness of a classification task. Assuming that there are Ctr classes in Dtr, we first compute a similarity matrix F RCtr Ctr whose element Fuv equals to the similarity between the u-th class centre and the v-th class center. In few-shot learning, pre-training the backbone network on Dtr has become a common practice [26, 30], and we can compute these class centres based on the pre-trained model φpt as Equ (5) and Equ (6). In Equ (5), Ku is the number of instances of class u in Dtr, and with a bit abuse of notation, we use yi = u to select instances belonging to the u-th class. (xi,yi) Dtr yi=u φpt(xi), u [Ctr] Fuv = cu cv cu cv , u, v [Ctr] (6) With similarity matrix F, we can take out the sub-similarity matrix of task τ by slicing the rows and columns corresponding to classes contained in τ. The hardness of task τ is defined as the sum of its sub-similarity matrix. The more similar classes in τ are, the more difficult to differ them from each other. The hardness of every meta-training task can be evaluated with similarity matrix F, and we compute F only once. 5.2 Target Model Construction As mentioned in last part, pre-training the backbone network on seen classes is a widely used technology in few-shot learning. The pre-trained network φpt is optimized using cross-entropy loss on the whole meta-training set, and can classify all classes in Dtr. Since there are Ctr classes in Dtr, the output of φtr is a Ctr-dimensional vector. Given a specific N-way task τ, a naive approach to obtain a target model is taking out N corresponding dimensions of the pre-trained model s output. However, using a single pre-trained model to assist the meta-learning of all tasks is sub-optimal. We claim that fine-tuning the pre-trained model on the subset of Dtr that contains classes in τ can give us a better target model for τ. Evaluation on Auxiliary Dataset. To verify the reasonability of the heuristic task hardness metric and the effectiveness of the fine-tuning approach, we need another auxiliary dataset Dau. Dau contains same classes as Dtr, and we can evaluate the accuracy of constructed target models (trained on Dtr) on Dau. We conduct an experiment on mini Image Net [24] to check whether fine-tuned target models 3An N-way K-shot task is a classification task with N classes and K instances in each class. (a) Bayesian optimal classifier. (b) Proto Net trained under S/Q. (c) Proto Net trained under S/T . Figure 6: Decision boundaries of three different models in raw 2-d space. Different point colors represent different classes, and different background colors represent different classification regions. 5% tasks have target models. A 5-way 10-shot task is visualized. (a) Bayesian optimal model constructed with parameters {µn}N n=1 and {Σn}N n=1. Although having the lowest misclassification error in expectation, it is not robust to noises since the decision boundary is very steep. (b) Proto Net trained under S/Q protocol. Decision boundary is still not regular. (c) Proto Net trained under S/T protocol. Decision boundary is very smooth due to the regularization effect of knowledge distillation. Models trained under S/T protocol are more robust to noisy of biased instances. are better than pre-trained target models. Firstly, we pre-train a Res Net-12 with a linear layer on the meta-training split of mini Image Net. After that, we randomly sample 1000 5-way tasks from Dtr, and fine-tune the pre-trained backbone to obtain 1000 target models. For each task τ, we take out all instances in Dau that belong to classes in τ to evaluate φpt and φft τ. Table 3 shows the average accuracy on auxiliary dataset Dau. We can see that fine-tuned target models achieve higher accuracy because they are task-specific, but the performance gain is marginal. The pre-trained model already works well enough on these seen classes. This means it is not cost-effective to fine-tune a target model for every single meta-training task. In Figure 5, we divide these 1000 tasks into 10 bins according to their hardness. In each bin, we compute the average accuracy of φpt and φft. Now we can draw two conclusions. Firstly, both φpt and φft achieve lower accuracy on harder tasks, and this verifies the reasonability of our proposed hardness metric. Secondly, the performance gain of fine-tuned target models are most remarkable on hard tasks, and this means fine-tuning target models for hard tasks can simultaneously save computing resources and improve the pre-trained target model. Now we can summarize our S/T protocol for few-shot learning. Firstly, we pre-train φpt on Dtr, and then sample meta-training tasks from seen classes. Secondly, we sort the meta-training tasks according to their hardness, and fine-tune the pre-trained network to obtain local target models for a small ratio of hardest tasks. Denote by Dtr 1 the set of tasks that have target models and Dtr 2 the set of tasks that do not have target models. For tasks in Dtr 1, we train task-specific solvers on their support sets, and then evaluate these solvers under S/T protocol. For tasks in Dtr 2, we simply use S/Q to compute query loss, as shown in Equ (7). (Str 1,T tr 1 ) Dtr 1 (xi,yi) Str 1 (1 λ)ℓ(f(Str 1 )(xi), yi) + λKL(T tr 1 (xi)||f(Str 1 )(xi)) (Str 2,Qtr 2) Dtr 2 (xj,yj) Qtr 2 ℓ(f(Str 2 )(xj), yj) (7) Different from S/Q protocol, S/T protocol does not rely on randomly sampled query sets, and target models usually offer more information than instances. Distillation term plays the role of regularization, enforcing the solvers for hard tasks to be smooth (see next subsection). Although the idea of S/T protocol is proposed in 2016, it is not widely used due to its computational intractability. However, in this paper we propose an efficient method to construct target models, and only deploy target models for a small ratio of hard tasks. This opens the door for future research of S/T protocol, and unearth the potential of existing meta-learning algorithms. 5.3 Empirical Study: Gaussian Classification In this part, we test our proposed method on a synthetic classification dataset. The purposes of this empirical study are two-fold: (1) check whether S/T protocol with only a few target models can improve classic meta-learning algorithm; (2) study why distillation from target models can help. Setting. In this experiment, we randomly generate 100 2-d Gaussian distributions. There are 64 classes for meta-training, 16 classes for meta-validation, and 20 classes for meta-testing. We sample 100 instances for each class to form the whole dataset. For each class, we sample its mean vector µ U2[ 10, 10] R2 and covariance matrix Σ = Σ Σ where Σ U2 2[ 2, 2] R2 2. Here U means uniform distribution. We then sample 10000 5-way 10-shot tasks for both meta-training and meta-testing. After every 500 episodes, we sample 500 tasks for meta-validation. Algorithms. In this part, we use a Proto Net [21] trained under S/Q protocol as our baseline. It meta-learns a shared embedding function φ : R2 R100 across tasks, and classifies an instance into the category of its nearest support class center. To be specific, let cn = 1 K P (xi,yi) S yi=n φ(xi) be the support class center of the n-th class4, then for instance x, the model will predict its Ndimensional label ˆy as ˆyn = exp{ cn,φ(x) } P exp{ cn,φ(x) }, n [N]. As a comparison, we also train a Proto Net under S/T protocol. Here the target model is constructed by fine-tuning the pre-trained global embedding network on specific tasks. To check whether S/T protocol can work with only a few target models, we set the ratio the tasks that have target models to 5% and 10%. As presented in last part, we sort all meta-training tasks according to their hardness and fine-tune the pre-trained backbone on those hardest tasks. Refer to supplementary material for more details. Table 4: Average accuracy on meta-testing set. Models trained under S/T protocol outperform those trained under S/Q protocol even though there are only a few target models. φpt means directly using the pre-trained network to solve meta-testing tasks without meta-training phase. The second row and the third row represent biased sampling, where we only sample instances that have low likelihoods. When instances are biased, the superiority of S/T protocol is more evident because target models make task-specific solvers more robust. Protocol φpt S/Q S/T -5% S/T -10% ACC 82.33 87.90 90.32 92.87 ACC(< 0.3) 77.41 81.25 87.66 90.14 ACC(< 0.1) 65.57 70.10 79.22 84.02 Results and Discussions. Firstly, we report the meta-testing accuracy of different models in Table 4. Methods under S/T protocol outperform vanilla Proto Net by a large margin. Even with only 5% target models, we can obtain a remarkable accuracy improvement. Then, we study why S/T protocol can help Proto Net learn better. In Figure 6, we visualize a 5-way 10-shot task and the decision regions of 3 models in raw 2-d space. Figure 6a is the Bayesian optimal classifier T , i.e., for an instance x, p(ˆy = n|x) 1 2π 1 |Σn|1/2 exp 1 2(x µn) Σ 1 n (x µn) where µn and Σn are the mean vector and covariance matrix of class n. Because different classes have different covariance matrices, the decision boundary of Bayesian classifier is very steep. Figure 6b and Figure 6c are results of Proto Net trained under S/Q protocol and S/T protocol respectively. In Figure 6c, the decision boundary is smooth and regular, which is different from the previous two models. This result offers a natural interpretation of S/T protocol s benefit: target models impose a regularization on task-specific solvers, making them more robust to noisy and biased instances. In fact, [32] also gives a similar conclusion: knowledge distillation can be seen as a special label smoothing and it can regularize model training. In order to more clearly verify this property, we sample biased tasks only containing instances that have low likelihoods (< 0.3 or < 0.1), and test different models on them. In the second row and third row of Table 4, we can see that S/T protocol can defend biased sampling to the maximum extent because of the strong supervision offered by target models. 4With a bit abuse of notation, we use yi = n to select instances belonging to the n-th class. Table 5: Average test accuracy with 95% confidence intervals on meta-testing tasks of mini Image Net and tiered Image Net. All the methods use Res Net-12 as backbone network except MAML with mark. The row with mark uses a 4-layer Conv Net as backbone network, which is shallower than Res Net-12. Blue values are cited from existing papers while red values are reproduced by us. Best results are in bold. We can see that MAML and Proto Net trained under S/T protocol outperform models trained under S/Q protocol even with a few target models. Specifically, Proto Net trained under S/T protocol achieves state-of-the-art performance in most cases. Method mini Image Net tiered Image Net 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot Deep EMD [33] 65.91 0.82 82.41 0.56 71.16 0.87 86.03 0.58 FEAT [30] 66.78 0.20 82.05 0.14 70.80 0.23 84.79 0.16 FRN [29] 66.45 0.19 82.83 0.13 72.06 0.22 86.89 0.14 MAML (S/Q) [3] 48.70 1.84 63.11 0.92 - - MAML (S/Q, re-implement) 58.84 0.25 74.62 0.38 63.02 0.30 67.26 0.32 MAML (S/T -5%) 59.14 0.33 75.77 0.29 64.52 0.30 68.39 0.34 MAML (S/T -10%) 60.06 0.35 76.34 0.42 65.23 0.45 70.02 0.33 Proto Net (S/Q) [21] 60.37 0.83 78.02 0.57 65.65 0.92 83.40 0.65 Proto Net (S/Q, re-implement) 65.30 0.30 79.93 0.39 70.34 0.45 84.68 0.55 Proto Net (S/T -5%) 67.35 0.49 81.67 0.62 71.25 0.37 85.80 0.31 Proto Net (S/T -10%) 68.03 0.52 82.53 0.47 72.41 0.39 86.91 0.47 Table 6: Ablation study. Random means selecting tasks randomly rather than according to their hardness. φpt means using the pre-trained network as target model for all tasks. Best results are in bold. We can see that our proposed heuristic hardness metric and the fine-tuning strategy improve model performance. Model mini Image Net tiered Image Net 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot MAML (S/Q) 58.84 74.62 63.02 67.26 MAML (S/T -10%-random) 59.66 74.90 65.11 68.63 MAML (S/T -10%-φpt) 59.35 75.88 64.78 69.26 MAML (S/T -10%-hardness-φft) 60.06 76.34 65.23 70.02 Proto Net (S/Q) 65.30 79.93 70.34 84.68 Proto Net (S/T -10%-random) 66.72 81.05 71.22 85.37 Proto Net (S/T -10%-φpt) 67.47 81.70 71.55 86.04 Proto Net (S/T -10%-hardness-φft) 68.03 82.53 72.41 86.91 5.4 Empirical Study: Benchmark Evaluation In this part, we evaluate our S/T protocol on two benchmark datasets, i.e., mini Image Net [24] and tiered Image Net [17]. Refer to supplementary material for dataset details.5 We try to answer four questions: (1) Can we achieve SOTA performance with a classic meta-learning model trained under S/T protocol? (2) How does each component influence model s performance? (3) How does the hyper-parameter λ influence model s performance? (4) How much time does S/T protocol cost? Algorithms. We implement two classic meta-learning algorithms, MAML and Proto Net, under S/T protocol. We use Res Net-12 as the backbone network, which is pre-trained on the meta-training set. For a fair comparison, we only include other algorithms that also use Res Net-12 as backbone network in Table 5. More implementation details can be found in the supplementary material. Competitive Results against SOTA. We show in Table 5 that MAML or Proto Net can be improved a lot when trained under S/T protocol with only 5% or 10% target models. Note that vanilla Proto Net does not use pre-training trick, and we re-implement it with pre-training. Proto Net is proposed in 2017, but we can obtain SOTA performance by retraining it under S/T protocol with only a few target models. This verifies the superiority of S/T protocol. In fact, S/T protocol is a generic training protocol that can be applied to any meta-learning algorithm, and we apply S/T protocol to more meta-learning algorithms in the supplementary material to show the effectiveness of our method. 5Our code is available at https://github.com/njulus/ST. Table 7: Time consumption for fine-tuning target models on mini Image Net. Number of Target Models Time Consumption (min) 500 (5%) 224.3 1000 (10%) 420.6 2000 (20%) 851.7 0.2 0.5 0.8 1.0 λ 1-shot accuracy (%) 5-way 1-shot 5-shot accuracy (%) 5-way 5-shot Figure 7: Performance change over λ. We can see that larger λ tends to benefit model accuracy. We set λ to 0.8, a relatively large value. Ablation Study. In this part, we check the effectiveness of each component. Table 6 shows that our proposed hardness metric and finetuning strategy help to improve performance. Randomly sampling 10% tasks and constructing target models for these tasks improves model performance. The third row and the seventh row in Table 6 verify that learning from target models is beneficial even though the target models are not optimal. With only 10% locally finetuned target models and our heuristic hardness metric, we can achieve nearly state-of-the-art performance by Proto Net. Hyper-Parameter. We check the influence of hyper-parameter λ in Equ (7). We sample 5-way tasks from mini Image Net, and try different λ values. Figure 7 shows that larger λ tends to benefit model performance. We set λ to 0.8, a relatively large value, in most of experiments. Time Consumption. In S/T protocol for few-shot learning, we need to construct target models through fine-tuning the globally pre-trained network. This will cost extra time to train a model. In this part, we try to answer the following question: how much time does S/T protocol cost in few-shot learning? We range the ratio of tasks that have target models in {5%, 10%, 20%}, and report the time consumption of fine-tuning target models on mini Image Net. Results are shown in Table 7. We run the experiment on an Nvidia Ge Force RTX 2080ti GPU and Intel(R) Xeon(R) Silver 4110 CPU. We can see that about 4 hours are needed to fine-tune target models for 5% meta-training tasks, and time consumption for fine-tuning 2000 target models is still acceptable. 6 Conclusion In this paper, we study S/T meta-learning protocol that evaluates a task-specific solver by comparing it to a target model. S/T protocol offers a more informative supervision signal for meta-learning, but is difficult to use in practice owing to its high computational cost. We find that by only deploying target models for those hardest tasks, we can improve existing meta-learning algorithms while maintaining efficiency. We propose a heuristic task hardness metric and a convenient target model construction method for few-shot learning. Experiments on synthetic datasets and benchmark datasets demonstrate the superiority of S/T protocol and effectiveness of our proposed method. Acknowledgements This work is supported by National Key R&D Program of China (2020AAA0109401), NSFC (41901270, 61773198, 62006112), NSF of Jiangsu Province (BK20190296, BK20200313), and CCF-Baidu Open Fund (NO.2021PP15002000). [1] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning, pages 41 48, 2009. [2] Wei-Lun Chao, Han-Jia Ye, De-Chuan Zhan, Mark Campbell, and Kilian Q Weinberger. Revisiting meta-learning as supervised learning. Co RR, abs/2002.00573, 2020. [3] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1126 1135, 2017. [4] Micah Goldblum, Steven Reich, Liam Fowl, Renkun Ni, Valeriia Cherepanova, and Tom Goldstein. Unraveling meta-learning: Understanding feature representations for few-shot tasks. In Proceedings of the 37th International Conference on Machine Learning, pages 3607 3616, 2020. [5] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. Co RR, abs/1410.5401, 2014. [6] Abhishek Gupta, Russell Mendonca, Yu Xuan Liu, Pieter Abbeel, and Sergey Levine. Metareinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems 31, pages 5302 5311, 2018. [7] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. Co RR, abs/1503.02531, 2015. [8] Ekaterina Iakovleva, Jakob Verbeek, and Karteek Alahari. Meta-learning with shared amortized variational inference. In Proceedings of the 37th International Conference on Machine Learning, pages 4572 4582, 2020. [9] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for oneshot image recognition. In 32nd International Conference on Machine Learning Workshop, volume 2, 2015. [10] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the 32nd IEEE Conference on Computer Vision and Pattern Recognition, pages 10657 10665, 2019. [11] Su Lu, Han-Jia Ye, and De-Chuan Zhan. Tailoring embedding function to heterogeneous few-shot tasks by global and local feature adaptors. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pages 8776 8783, 2021. [12] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems 31, pages 721 731. 2018. [13] Seong-Jin Park, Seungju Han, Ji-Won Baek, Insoo Kim, Juhwan Song, Hae Beom Lee, Jae-Joon Han, and Sung Ju Hwang. Meta variance transfer: Learning to augment from the others. In Proceedings of the 37th International Conference on Machine Learning, pages 7510 7520, 2020. [14] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient offpolicy meta-reinforcement learning via probabilistic context variables. In Proceedings of the 36th International Conference on Machine Learning, pages 5331 5340, 2019. [15] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017. [16] Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. Few-shot learning with embedded class models and shot-free meta training. In Proceedings of the 17th International Conference on Computer Vision, pages 331 339, 2019. [17] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B. Tenenbaum, Hugo Larochelle, and WRichard S. Zemel. Meta-learning for semi-supervised few-shot classification. In Proceedings of the 6th International Conference on Learning Representations, 2018. [18] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In Proceedings of the 7th International Conference on Learning Representations, 2019. [19] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In Proceedings of the 33rd International Conference on Machine Learning, pages 1842 1850, 2016. [20] Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. Adaptive subspaces for few-shot learning. In Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition, pages 4136 4145, 2020. [21] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30, pages 4077 4087. 2017. [22] Sebastian Thrun and Lorien Pratt. Learning to Learn. Springer Science & Business Media, 2012. [23] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2):77 95, 2002. [24] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29, pages 3630 3638. 2016. [25] Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J Lim. Multimodal model-agnostic metalearning via task-aware modulation. In Advances in Neural Information Processing Systems 32, pages 1 12. 2019. [26] Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. Co RR, abs/1911.04623, 2019. [27] Yu-Xiong Wang, Adrien Bardes, Ruslan Salakhutdinov, and Martial Hebert. Progressive knowledge distillation for generative modeling. 2019. [28] Yu-Xiong Wang and Martial Hebert. Learning to learn: Model regression networks for easy small sample learning. In Proceedings of the 14th European Conference on Computer Vision, pages 616 634, 2016. [29] Davis Wertheimer, Luming Tang, and Bharath Hariharan. Fine-grained few-shot classification with feature map reconstruction networks. Co RR, abs/2012.01506, 2020. [30] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition, pages 8808 8817, 2020. [31] Han-Jia Ye, Lu Ming, De-Chuan Zhan, and Wei-Lun Chao. Few-shot learning with a strong teacher. Co RR, abs/2107.00197, 2021. [32] Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition, pages 3903 3911, 2020. [33] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image classification with differentiable earth mover s distance and structured classifiers. In Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition, pages 12203 12213, 2020.