# metalearning_with_selfimproving_momentum_target__9714705c.pdf Meta-Learning with Self-Improving Momentum Target Jihoon Tack1, Jongjin Park1, Hankook Lee1, Jaeho Lee2, Jinwoo Shin1 1Korea Advanced Institute of Science and Technology (KAIST) 2Pohang University of Science and Technology (POSTECH) {jihoontack,jongjin.park,hankook.lee,jinwoos}@kaist.ac.kr jaeho.lee@postech.ac.kr The idea of using a separately trained target model (or teacher) to improve the performance of the student model has been increasingly popular in various machine learning domains, and meta-learning is no exception; a recent discovery shows that utilizing task-wise target models can significantly boost the generalization performance. However, obtaining a target model for each task can be highly expensive, especially when the number of tasks for meta-learning is large. To tackle this issue, we propose a simple yet effective method, coined Self-improving Momentum Target (Si MT). Si MT generates the target model by adapting from the tempo- ral ensemble of the meta-learner, i.e., the momentum network. This momentum network and its task-specific adaptations enjoy a favorable generalization performance, enabling self-improving of the meta-learner through knowledge distillation. Moreover, we found that perturbing parameters of the meta-learner, e.g., dropout, further stabilize this self-improving process by preventing fast convergence of the distillation loss during meta-training. Our experimental results demonstrate that Si MT brings a significant performance gain when combined with a wide range of meta-learning methods under various applications, including few-shot regression, few-shot classification, and meta-reinforcement learning. Code is available at https://github.com/jihoontack/Si MT. 1 Introduction Meta-learning [51] is the art of extracting and utilizing the knowledge from the distribution of tasks to better solve a relevant task. This problem is typically approached by training a meta-model that can transfer its knowledge to a task-specific solver, where the performance of the meta-model is evaluated on the basis of how well each solver performs on the corresponding task. To learn such meta-model, one should be able to (a) train an appropriate solver for each task utilizing the knowledge transferred from the meta-model, and (b) accurately evaluate the performance of the solver. A standard way to do this is the so-called S/Q (support/query) protocol [55, 34]: for (a), use a set of support set samples to train the solver; for (b), use another set of samples, called query set samples to evaluate the solver1. Recently, however, an alternative paradigm called S/T (support/target) protocol has received much attention [58, 62, 32]. The approach assumes that the meta-learner has an access to task-specific target models, i.e., an expert model for each given task, and uses these models to evaluate task-specific solvers by measuring the discrepancy of the solvers from the target models. Intriguingly, it has been observed that such knowledge distillation procedure [43, 21] helps to improve the meta-generalization performance [62], in a similar way that such teacher-student framework helps to avoid overfitting under non-meta-learning contexts [30, 24]. 1We give an overview of terminologies used in the paper to guide readers new to this field (see Appendix A). 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Distillation Temporal ensemble Momentum target Momentum network Task-specific solver Perturbed solver Self-improving process Figure 1: An overview of the proposed Self-improving Momentum Target (Si MT): the momentum network efficiently generates the target model, and by distilling knowledge to the task-specific solver, it forms a self-improving process. S and Q denote the support and query datasets, respectively. Despite such advantage, the S/T protocol is difficult to be used in practice, as training target models for each task usually requires an excessive computation, especially when the number of tasks is large. Prior works aim to alleviate this issue by generating target models in a compute-efficient manner. For instance, Lu et al. [32] consider the case where the learner has an access to a model pre-trained on a global data domain that covers most tasks (to be meta-trained upon), and propose to generate task-wise target models by simply fine-tuning the model for each task. However, the method still requires to compute for fine-tuning on a large number of tasks, and more importantly, is hard to be used when there is no effective pre-trained model available, e.g., a globally pre-trained model is usually not available in reinforcement learning, as collecting global data is a nontrivial task [9]. In this paper, we ask whether we can generate the task-specific target models by (somewhat ironically) using meta-learning. We draw inspiration from recent observations in semi/self-supervised learning literature [50, 16, 5] that the temporal ensemble of a model, i.e., the momentum network [27], can be an effective teacher of the original model. It turns out that a similar phenomenon happens in the meta-learning scenario: one can construct a momentum network of the meta-model, whose task-specific adaptation is an effective target model from which the task-specific knowledge can be distilled to train the original meta-model. Contribution. We establish a novel framework, coined Meta-Learning with Self-improving Momentum Target (Si MT), which brings the benefit of the S/T protocol to the S/Q-like scenario where task-specific target models are not available (but have access to query data). The overview of Si MT is illustrated in Figure 1. In a nutshell, Si MT is comprised of two (iterative) steps: Momentum target: We generate the target model by adapting from the momentum network, which shows better adaptation performance than the meta-model itself. In this regard, generating the target model becomes highly efficient, e.g., one single forward is required when obtaining the momentum target for Proto Net [45]. Self-improving process: The meta-model enables to improve through the knowledge distillation from the momentum target, and this recursively improves the momentum network by the temporal ensemble. Furthermore, we find that perturbing parameters of the task-specific solver of the meta-model, e.g., dropout [47], further stabilizes this self-improving process by preventing fast convergence of the distillation loss during meta-training. We verify the effectiveness of Si MT under various applications of meta-learning, including fewshot regression, few-shot classification, and meta-reinforcement learning (meta-RL). Overall, our experimental results show that incorporating the proposed method can consistently and significantly improve the baseline meta-learning methods [10, 31, 36, 45]. In particular, our method improves the few-shot classification accuracy of Conv4 [55] trained with MAML [10] on mini-Image Net [55] from 47.33% ! 51.49% for 1-shot, and from 63.27% ! 68.74% for 5-shot, respectively. Moreover, we show that our framework could even notably improve on the few-shot regression and meta-RL tasks, which supports that our proposed method is indeed domain-agnostic. 2 Related work Learning from target models. Learning from an expert model, i.e., the target model, has shown its effectiveness across various domains [30, 35, 65, 52]. As a follow-up, recent papers demonstrate that meta-learning can also be the case [58, 62]. However, training independent task-specific target models is highly expensive due to the large space of task distribution in meta-learning. To this end, recent work suggests pre-training a global encoder on the whole meta-training set and finetune target models on each task [32]; however, they are limited to specific domains and still require some computations, e.g., they take more than 6.5 GPU hours to pre-train only 10% of target models while ours require 2 GPU hours for the entire meta-learning process (Proto Net [45] of Res Net-12 [34]) on the same GPU. Another recent relevant work is bootstrapped meta-learning [11], which generates the target model from the meta-model by further updating the parameters of the task-specific solver for some number of steps with the query dataset. While the bootstrapped target models can be obtained efficiently, their approach is specialized in gradient-based meta-learning schemes, e.g., MAML [10]. In this paper, we suggest an efficient and more generic way to generate the target model during the meta-training. Learning with momentum networks. The idea of temporal ensembling, i.e., the momentum network, has become an essential component of the recent semi/self-supervised learning algorithms [3, 5]. For example, Mean Teacher [50] first showed that the momentum network improves the performance of semi-supervised image classification, and recent advanced approaches [2, 46] adopted this idea for achieving state-of-the-art performances. Also, in self-supervised learning methods which enforce invariance to data augmentation, such momentum networks are widely utilized as a target network [19, 16] to prevent collapse by providing smoother changes in the representations. In meta-learning, a concurrent work [6] used stochastic weight averaging [23] (a similar approach to the momentum network) to learn a low-rank representation. In this paper, we empirically demonstrate that the momentum network shows better adaptation performance compare to the original meta-model, which motivates us to utilize it for generating the target model in a compute-efficient manner. 3 Problem setup and evaluation protocols In this section, we formally describe the meta-learning setup under consideration, and S/Q and S/T protocols studied in prior works. Problem setup: Meta-learning. Let p( ) be a distribution of tasks. The goal of meta-learning is to train a meta-model f , parameterized by the meta-model parameter , which can transfer its knowledge to help to train a solver for a new task. More formally, we consider some adaptation subroutine Adapt( , ) which uses both information transferred from and the task-specific dataset (which we call support set) S to output a task-specific solver as φ = Adapt( , S ). For example, the model-agnostic meta-learning algorithm (MAML; [10]) uses the adaptation subroutine of taking a fixed number of SGD on S , starting from the initial parameter . In this paper, we aim to give a general meta-learning framework that can be used in conjunction with any adaptation subroutine, instead of designing a method specialized for a specific one. The objective is to learn a nice meta-model parameter from a set of tasks sampled from p( ) (or sometimes the task distribution itself), such that the expected loss of the task-specific adaptations is small, i.e., min E p( )[ (Adapt( , S ))], where ( ) denotes the test loss on task . To train such meta-model, we need a mechanism to evaluate and optimize (e.g., via gradient descent). For this purpose, existing approaches take one of two approaches: the S/Q protocol or the S/T protocol. S/Q protocol. The majority of the existing meta-learning frameworks (e.g., [55, 34]) splits the task-specific training data into two, and use them for different purposes. One is the support set S which is used to perform the adaptation subroutine. Another is the query set Q which is used for evaluating the performance of the adapted parameter and compute the gradient with respect to . In other words, given the task datasets (S1, Q1), (S2, Q2), . . . , (SN, QN),2 the S/Q protocol solves Adapt( , S i), Q i# where L(φ, Q) denotes the empirical loss of a solver φ on the dataset Q. 2Here, while we assumed a static batch of tasks for notational simplicity, the expression is readily extendible to the case of a stream of tasks drawn from p( ). S/T protocol. Another line of work considers the scenario where the meta-learner additionally has an access to a set of target models φtarget for each training task [58, 32]. In such case, one can use a teacher-student framework to regularize the adapted solver to behave similarly (or have low prediction discrepancy, equivalently) to the target model. Here, a typical practice is to not split each task dataset and measure the discrepancy using the support dataset that is used for the adaptation [32]. In other words, given the task datasets S1, S2, . . . , SN and the corresponding target models φ 1 target, φ 2 target, . . . , φ N target, the S/T protocol updates the meta-model by solving Adapt( , S i), φ i target, S i# where Lteach(φ, φtarget, S) denotes a discrepancy measure between the adapted model φ and the target model φtarget, measured using the dataset S. 4 Meta-learning with self-improving momentum target In this section, we develop a compute-efficient framework which bring the benefits of S/T protocol to the settings where we do not have access to target-specific tasks or a general pretrained model, as in general S/Q-like setups. In a nutshell, our framework iteratively generates a meta-target model which generalizes well when adapted to the target tasks, by constructing a momentum network [50] of the meta-model itself. The meta-model is then trained, using both the knowledge transferred from the momentum target and the knowledge freshly learned from the query sets. We first briefly describe our meta-model update protocol (Section 4.1), and then the core component, coined Self-Improving Momentum Target (Si MT), which efficiently generates the target model for each task (Section 4.2). 4.1 Meta-model update with a S/Q-S/T hybrid loss To update the meta-model, we use a hybrid loss function of the S/Q protocol (1) and the S/T protocol (2). Formally, let (S1, Q1), (S2, Q2), . . . , (SN, QN) be given task datasets with support-query split, and let φ 1 target, φ 2 target, . . . , φ N target be task-specific target models generated by our target generation procedure (which will be explained with more detail in Section 4.2). We train the meta-model as (1 λ) L(Adapt( , S i), Q i) + λ Lteach(Adapt( , S i), φ i target, Q i) where λ 2 [0, 1) is the weight hyperparameter. We note two things about Eq. 3. First, while we are training using the target model, we also use a S/Q loss term. This is because our method trains the meta-target model and the meta-model simultaneously from scratch, instead of requiring fully-trained target models. Second, unlike in the S/T protocol, we evaluate the discrepancy Lteach using the query set Q i instead of the support set, to improve the generalization performance of the student model. In particular, the predictions of adapted models on query set samples are softer (i.e., having less confidence) than on support set samples, and such soft predictions are known to be beneficial on the generalization performance of the student model in the knowledge distillation literature [64, 49]. 4.2 Si MT: Self-improving momentum target We now describe the algorithm we propose, Si MT (Algorithm 1), to generate the target model in a compute-efficient manner. In a nutshell, Si MT is comprised of two iterative steps: momentum target and self-improving process. To efficiently generate a target model, Si MT utilizes the temporal ensemble of the network, i.e., the momentum network, then distills the knowledge of the generated target model into the task-specific solver of the meta-model to form a self-improving process. Momentum target. For the compute-efficient generation of target models, we utilize the momentum network moment of the meta-model. Specifically, after every meta-model training iteration, we compute the exponential moving average of the meta-model parameter as moment moment + (1 ) , (4) where 2 [0, 1) is the momentum coefficient. We find that moment can adapt better than the metamodel itself and observe that the loss landscape has flatter minima (see Section 5.5), which can Algorithm 1 Si MT: Self-Improving Momentum Target Require: Distribution over tasks p( ), adaptation subroutine Adapt( ), momentum coefficient , weight hyperparameter λ, dropout probability p, task batch size N, learning rate β. 1: Initialize using the standard initialization scheme. 2: Initialize the momentum network with the meta-model parameter, moment . 3: while not done do 4: Sample N tasks { i}N i=1 from p( ) 5: for i = 1 to N do 6: Sample support set S i and query set Q i from i 7: φ i moment = Adapt( moment, S i). . Generate a momentum target. 8: φ i = Adapt( , S i). . Adapt a task-specific solver. 9: φ i drop = Dropout(φ i, p). . Perturb the solver. 10: L i total( ) = (1 λ) L(φ i drop, Q i) + λ Lteach(φ i drop, φ i moment, Q i) . Compute loss. 11: end for 12: β total( ). . Train the meta-model. 13: moment moment + (1 ) . . Update the momentum network. 14: end while be a hint for understanding the generalization improvement [29, 12]. Based on this, we propose to generate the task-specific target model, i.e., the momentum target φmoment, by adapting from the momentum network moment. For a given support set S, we generate the target model for each task as moment = Adapt( moment, S i), 8i 2 {1, 2, . . . , N}. (5) We remark that generating momentum targets does not require an excessive amount of compute (see Section 5.5), e.g., Proto Net [45] requires a single forward of a support set, and MAML [10] requires few-gradient steps without second-order gradient computation for the adaptation. Self-improving process via knowledge distillation. After generating the momentum target, we utilize its knowledge to improve the generalization performance of the meta-model. To this end, we choose the knowledge distillation scheme [21], which is simple yet effective across various domains, including meta-learning [32]. Here, our key concept is that the momentum target self-improves during the training due to the knowledge transfer. To be specific, the knowledge distillation from the momentum target improves the meta-model itself, which recursively improves the momentum through the temporal ensemble. Formally, for a given query set Q, we distill the knowledge of the momentum target φmoment to the task-specific solver of the meta-model φ as Lteach(φ, φmoment, Q) := 1 |Q| fφmoment(x), fφ(x) where l KD is the distillation loss and | | is the cardinality of the set. For regression tasks, we use the MSE loss, i.e., l KD(z1, z2) := kz1 z2k2 2, and for classification tasks, we use the KL divergence with temperature scaling [17], i.e., l KD(z1, z2) := T 2 KL(σ(z1/T) k σ(z2/T)), where T is the temperature hyperparameter, σ is the softmax function and z1, z2 are logits of the classifier, respectively. We present the detailed distillation objective of reinforcement learning tasks in Appendix C. Also, note that optimizing the distillation loss only propagates gradients to the meta-model , not to the momentum network moment, i.e., known as the stop-gradient operator [5, 7]. Furthermore, we find that the distillation loss (6) sometimes converges too fast during the metatraining, which can stop the self-improving process. To prevent this, we suggest perturbing the parameter space of φ. Intuitively, injecting noise to the parameter space of φ forces an asymmetricity between the momentum target s prediction, hence, preventing fφ and fφmoment from reaching a similar prediction. To this end, we choose the standard dropout regularization [47] due to its simplicity and generality across architectures and also have shown its effectiveness under distillation research [60]: φdrop := Dropout(φ, p) where p is the probability of dropping activations. In the end, we use the perturbed task-specific solver φdrop and the momentum target φmoment for our evaluation protocol (3). Table 1: Few-shot regression results on Shape Net and Pascal datasets. We report the angular error for Shape Net, and MSE for Pascal. Si MT use the momentum network at meta-test time. Reported results are averaged over three trials, subscripts denote the standard deviation, and bold denotes the best result of each group. Shape Net Pascal Method 10-shot 15-shot 10-shot 15-shot MAML [10] 29.555 0.600 22.286 3.369 2.612 0.280 2.513 0.250 MAML [10] + Si MT 18.913 2.655 16.100 1.318 1.462 0.230 1.229 0.074 ANIL [36] 39.915 0.665 38.202 1.388 6.600 0.360 6.517 0.420 ANIL [36] + Si MT 37.424 0.951 29.478 0.212 5.339 0.321 5.007 0.145 Meta SGD [31] 17.353 1.110 15.768 1.266 3.532 0.381 2.833 0.216 Meta SGD [31] + Si MT 16.121 1.322 14.377 0.358 2.300 0.871 1.879 0.134 Table 2: Few-shot in-domain adaptation accuracy (%) on 5-way miniand tiered-Image Net. Si MT use the momentum network at meta-test time. Reported results are averaged over three trials, subscripts denote the standard deviation, and bold denotes the best result of each group. mini-Image Net tiered-Image Net Model Method 1-shot 5-shot 1-shot 5-shot MAML [10] 47.33 0.45 63.27 0.14 50.19 0.21 66.05 0.19 MAML [10] + Si MT 51.49 0.18 68.74 0.12 52.51 0.21 69.58 0.11 ANIL [36] 47.71 0.47 63.13 0.43 49.57 0.04 66.34 0.28 ANIL [36] + Si MT 50.81 0.56 67.99 0.19 51.66 0.26 68.88 0.08 Meta SGD [31] 50.66 0.18 65.55 0.54 52.48 1.22 71.06 0.20 Meta SGD [31] + Si MT 51.70 0.80 69.13 1.40 52.98 0.07 71.46 0.12 Proto Net [45] 47.97 0.29 65.16 0.67 51.90 0.55 71.51 0.25 Proto Net [45] + Si MT 51.25 0.55 68.71 0.35 53.25 0.27 72.69 0.27 Res Net-12 [34] MAML [10] 52.66 0.60 68.69 0.33 57.32 0.59 73.78 0.27 MAML [10] + Si MT 56.28 0.63 72.01 0.26 59.72 0.22 74.40 0.90 ANIL [36] 51.80 0.59 68.38 0.20 57.52 0.68 73.50 0.35 ANIL [36] + Si MT 54.44 0.27 69.98 0.66 58.18 0.31 75.59 0.50 Meta SGD [31] 54.95 0.11 70.65 0.43 58.97 0.89 76.37 0.11 Meta SGD [31] + Si MT 55.72 0.96 74.01 0.79 61.03 0.05 78.04 0.48 Proto Net [45] 52.84 0.21 68.35 0.29 61.16 0.17 79.94 0.20 Proto Net [45] + Si MT 55.84 0.57 72.45 0.32 62.01 0.42 81.82 0.12 5 Experiments In this section, we experimentally validate the effectiveness of the proposed Si MT by measuring its performance on various meta-learning applications, including few-shot regression (Section 5.1), few-shot classification (Section 5.2), and meta-reinforcement learning (meta-RL; Section 5.3). Common setup. By following the prior works, we chose the checkpoints and the hyperparameters on the meta-validation set for the few-shot learning tasks [33, 56]. For RL, we chose it based on the best average return during the training [10]. We find that the hyperparameters, e.g., momentum coefficient or the weight hyperparameter λ, are not sensitive across datasets and architectures but can vary on the type of the meta-learning scheme or tasks. We provide further details in Appendix D. Moreover, we report the adaptation performance of the momentum network for Si MT. 5.1 Few-shot regression For regression tasks, we demonstrate our experiments on Shape Net [13] and Pascal [63] datasets, where they aim to predict the object pose of a gray-scale image relative to the canonical orientation. To this end, we use the following empirical loss L to train the meta-model: the angular loss for the Shape Net (P (x,y)2Qkcos(fφ(x)) cos(y)k2 + ksin(fφ(x)) sin(y)k2 ) and the MSE loss for the Table 3: Few-shot cross-domain adaptation accuracy (%) on Res Net-12 trained under 5-way miniand tiered-Image Net. We consider CUB and Cars as cross-domain datasets. Si MT use the momentum network at meta-test time. Reported results are averaged over three trials, subscripts denote the standard deviation, and bold denotes the best result of each group. mini-Image Net ! tiered-Image Net ! Problem Method CUB Cars CUB Cars MAML [10] 39.50 0.91 32.87 0.20 42.32 0.69 36.62 0.12 MAML [10] + Si MT 42.32 0.62 33.73 0.63 44.33 0.43 37.21 0.35 ANIL [36] 37.30 0.89 31.28 1.03 42.29 0.33 36.27 0.58 ANIL [36] + Si MT 38.86 0.98 32.34 0.95 44.53 1.21 36.92 0.56 Meta SGD [31] 41.98 0.18 34.52 0.56 46.48 2.10 38.09 1.21 Meta SGD [31] + Si MT 43.50 0.89 33.92 0.30 46.62 0.41 38.69 0.26 Proto Net [45] 41.22 0.81 32.79 0.61 47.75 0.56 37.59 0.80 Proto Net [45] + Si MT 44.13 0.30 34.53 0.40 48.89 0.65 38.07 0.42 MAML [10] 56.17 0.92 44.56 0.79 65.00 0.89 51.08 0.28 MAML [10] + Si MT 59.22 0.39 46.59 0.21 67.58 0.61 51.88 0.52 ANIL [36] 53.42 0.97 41.65 0.67 62.48 0.85 50.50 1.18 ANIL [36] + Si MT 56.03 1.40 45.88 0.82 66.30 0.99 54.60 0.91 Meta SGD [31] 58.90 1.30 47.44 1.55 70.38 0.27 56.28 0.07 Meta SGD [31] + Si MT 65.07 1.89 49.86 0.84 73.93 0.42 57.97 1.34 Proto Net [45] 57.87 0.77 48.06 1.10 74.35 0.93 57.23 0.25 Proto Net [45] + Si MT 63.85 0.76 51.67 0.29 75.97 0.09 59.01 0.50 (x,y)2Qkfφ(x) yk2), following the prior works [63, 13]. For the backbone meta-learning schemes we use gradient-based approaches, including MAML [10], ANIL [36], and Meta SGD [31]. For all methods, we train the convolutional neural network with 7 layers [63] and apply dropout regularization [47] before the max-pooling layer for Si MT. Table 1 summarizes the results, showing that Si MT significantly improves the overall meta-learning schemes in all tested cases. 5.2 Few-shot classification For few-shot classification tasks, we use the cross-entropy loss for the empirical loss term L to train the meta-model ., i.e., P (x,y)2Q lce(fφ(x), y) where lce is the cross-entropy loss. We train the meta-model on mini-Image Net [55] and tiered-Image Net [38] datasets, following the prior works [32, 56]. Here, we consider the following gradient-based and metric-based meta-learning approaches as our backbone algorithm to show the wide usability of our method: MAML, ANIL, Meta SGD, and Proto Net [45]. We train each method on Conv4 [55] and Res Net-12 [34], and apply dropout before the max-pooling layer for Si MT. For the training details, we mainly follow the setups from each backbone algorithm paper. See Appendix D.1 for more details. In-domain adaptation. In this setup, we evaluate the adaptation performance on different classes of the same dataset used in meta-training. As shown in Table 2, incorporating Si MT into existing meta-learning methods consistently and significantly improves the in-domain adaptation performance. In particular, Si MT achieves higher accuracy gains on the mini-Image Net dataset, e.g., 5-shot performance improves from 63.27% ! 68.74% on Conv4. We find that this is due to the overfitting issue of backbone algorithms on the mini-Image Net dataset, where Si MT is more robust to such issues. For instance, when training mini-Image Net 5-shot classification on Conv4, MAML starts to overfit after the first 40% of the training process, while Si MT does not overfit during the training. Cross-domain adaptation. We also consider the cross-domain adaptation scenarios. Here, we adapt the meta-model on different datasets from the meta-training: we use CUB [57] and Cars [26] datasets. Such tasks are known to be challenging, as there exists a large distribution shift between training and testing domains [18]. Table 3 shows the results. Somewhat interestingly, Si MT also improves the cross-domain adaptation performance of the base meta-learning methods across the considered datasets. These results indicate that Si MT successfully learns the ability to generalize to unseen tasks even for the distributions that highly differ from the training. Table 4: Comparison with bootstrapped targets (Bootstrap) [16] on few-shot in-domain adaptation tasks. We report the adaptation accuracy (%) of Conv4 trained under 5-way miniand tiered-Image Net. Si MT use the momentum network at meta-test time. Reported results are averaged over three trials, subscripts denote the standard deviation, and bold indicates the best result of each group. mini-Image Net tiered-Image Net Method 1-shot 5-shot 1-shot 5-shot MAML [10] 47.33 0.45 63.27 0.14 50.19 0.21 66.05 0.19 MAML [10] + Bootstrap [16] 48.68 0.33 68.45 0.40 49.34 0.26 68.84 0.37 MAML [10] + Si MT 51.49 0.18 68.74 0.12 52.51 0.21 69.58 0.11 ANIL [36] 47.71 0.47 63.13 0.43 49.57 0.04 66.34 0.28 ANIL [36] + Bootstrap [16] 47.74 0.44 65.16 0.04 48.85 0.34 66.09 0.07 ANIL [36] + Si MT 50.81 0.56 67.99 0.19 51.66 0.26 68.88 0.08 5.3 Reinforcement learning The goal of meta-RL is training an agent to quickly adapt a policy to maximize the expected return for unseen tasks using only a limited number of sample trajectories. Since the expected return is usually not differentiable, we use policy gradient methods to update the policy. Specifically, we use vanilla policy gradient [59], and trust-region policy optimization (TRPO; [39]) for the task-specific solver and meta-model, respectively, following MAML [10]. The overall training objective of meta-RL is in Appendix C, including the empirical loss L, and the knowledge distillation loss Lteach. We evaluate Si MT on continuous control tasks based on Open AI Gym [4] environments. In these experiments, we choose MAML as our backbone algorithm, and train a multi-layer perceptron policy network with two hidden layers of size 100 by following the prior setup [10]. We find that the distillation loss is already quite effective even without the dropout regularization, and applying it does not improve more. We conjecture that dropout on such a small network may not be effective as it is designed to reduce the overfitting of large networks [47]. We provide more experimental details in Appendix D.1. MAML MAML + Si MT Average returns Gradient steps 0 1 2 3 (a) 2D Navigation MAML MAML + Si MT Average returns Gradient steps 0 1 2 3 (b) Half-cheetah Direction Figure 2: Meta-RL results for (a) 2D navigation and (b) Half-cheetah locomotion tasks. The solid line and shaded regions represent the truncated mean and standard deviation, respectively, across five runs. 2D Navigation. We first evaluate Si MT on a 2D Navigation task, where a point agent moves to different goal positions which are randomly chosen within a 2D unit square. Figure 2 shows the adaptation performance of learned models with up to three gradient steps. These results demonstrate that Si MT could consistently improve the adaptation performance of MAML. Also, Si MT makes faster performance improvements than vanilla MAML with additional gradient steps. Locomotion. To further demonstrate the effectiveness of our method, we also study highdimensional, complex locomotion tasks based on the Mu Jo Co [53] simulator. We choose a set of goal direction tasks with a planar cheetah ( Half-cheetah ), following previous works [10, 36]. In the goal direction tasks, the reward is the magnitude of the velocity in either the forward or backward direction, randomly chosen for each task. Figure 2b shows that Si MT significantly improves the adaptation performance of MAML even with a single gradient step. 5.4 Comparison with other target models In this section, we compare Si MT with other meta-learning schemes that utilize the target model, including bootstrapped [16] and task-wise pre-trained target models [32]. Bootstrapped target model. We compare Si MT with a recent relevant work, bootstrapped metalearning (Bootstrap) [16]. The key difference is on how to construct the target model φtarget: Si MT utilizes the momentum network while Bootstrap generates the target by further updating the parameters of the task-specific solver. Namely, Bootstrap relies on gradient-based adaptation while Si MT does not. This advantage allows Si MT to incorporate various non-gradient-based meta-learning approaches, e.g., Proto Net [45], as shown in Table 2. Furthermore, Si MT shows not only wider Table 5: Comparison with pre-trained target models on few-shot in-domain adaptation tasks. We report the adaptation accuracy (%) of Res Net-12 trained on 5-way miniand tiered-Image Net. Si MT use the learned momentum network at meta-test time. Reported results are averaged over three trials, subscripts denote the standard deviation, and bold indicates the best result of each group. indicates the values from the reference and percentage after the bar denotes the proportion of tasks with pre-trained target models for meta-training [32]. 1-shot train cost (GPU hours) mini-Image Net tiered-Image Net Method 1-shot 5-shot 1-shot 5-shot MAML [10] 1.31 58.84 0.25 74.62 0.38 63.02 0.30 67.26 0.32 MAML [10] + Lu et al. [32] - 5% 5.04 59.14 0.33 75.77 0.29 64.52 0.30 68.39 0.34 MAML [10] + Lu et al. [32] - 10% 8.32 60.06 0.35 76.34 0.42 65.23 0.45 70.02 0.33 MAML [10] + Si MT 1.64 62.05 0.39 78.77 0.45 63.91 0.32 77.43 0.47 Table 6: Ablation study on each component of Si MT. We report the few-shot in-domain adaptation accuracy (%) on Conv4 trained with mini-Image Net. Here, we use the learned momentum network at meta-test time, except for the first experiment of the table. The reported results are averaged over three trials, subscripts denote the standard deviation, and bold denotes the best result. Momentum Distillation Dropout 1-shot 5-shot - - - 47.33 0.45 63.27 0.14 X - - 48.98 0.32 66.12 0.21 X X - 49.23 0.24 66.52 0.15 X - X 49.25 0.41 65.25 0.15 X X X 51.49 0.18 68.74 0.12 applicability but also better performance than Bootstrap. As shown in Table 4, Si MT consistently outperforms Bootstrap in few-show learning experiments, which implies that the momentum target is more effective than the bootstrapped target. Pre-trained target model. We compare Si MT with Lu et al. [32] which utilizes the task-wise pretrained target model for meta-learning. To this end, we train Si MT upon a Res Net-12 backbone pre-trained on the meta-training set, by following [32]. As shown in Table 5, Si MT consistently improves over MAML, and more intriguingly, Si MT performs even better than Lu et al. [32]. We conjecture this is because Si MT can fully utilize the target model for all tasks due to its efficiency, while Lu et al. [32] should partially sample the tasks with the target model due to the computational burden: note that when generating target models, Si MT only requires additional 0.3 GPU hours for all tasks while Lu et al. [32] requires more than 3.7 GPU hours for 5% of tasks. 5.5 Ablation study Throughout this section, unless otherwise specified, we perform the experiments in 5-shot in-domain adaptation on mini-Image Net with Conv4, where MAML is the backbone meta-learning scheme. Component analysis. We perform an analysis on each component of our method in both 1-shot and 5-shot classification on mini-Image Net: namely, the use of (a) the momentum network moment, (b) the distillation loss Lteach (6), and (c) the dropout regularization Dropout( ), by comparing the accuracies. The results in Table 6 show each component is indeed important for the improvement. We find that a naïve combination of the distillation loss and the momentum network does not show significant improvements. But, by additionally applying the dropout, the distillation loss becomes more effective and further improves the performance. Note that this improvement does not fully come from the dropout itself, as only using dropout slightly degrades the performance in some cases. Computational efficiency. Our method may be seemingly compute-inefficient when incorporating meta-learning methods (due to the momentum target generation); however, we show that it is not. Although Si MT increases the total training time of MAML by roughly 1.2 times, we have observed that it is 3 times faster to achieve the best performance of MAML: in Figure 3a, we compare the accuracy under the same training wall-clock time with MAML. Comparison of the momentum network and meta-model. To understand how the momentum network improves the performance of the meta-model, we compare the adaptation performance of the momentum network and the meta-model during training Si MT. As shown in Figure 3b, we observe MAML MAML + Si MT Validation accuracy (%) Training time (min) 0 50 100 150 200 250 300 350 (a) Computation efficiency Meta-model, θ Momentum network, θmoment Validation accuracy (%) Training iterations ( 103) 10 20 30 40 50 60 (b) Network choice for the adaptation Figure 3: Validation accuracy curves of 5-shot mini-Image Net on Conv4: we compare the adaptation performance of (a) MAML and Si MT, under the same training wall-clock time, and (b) the metamodel and the momentum network of Si MT, under the same number of training steps. The solid line and shaded regions represent the mean and standard deviation, respectively, across three runs. that the performance of the momentum network is consistently better than the meta-model, which implies that the proposed momentum target is a nice target model in our self-improving mechanism. (a) Meta-model (b) Momentum network Figure 4: Loss landscape visualization of Conv4 trained on 5-shot mini-Image Net with MAML. Loss landscape analysis. We visualize the loss landscape of the momentum network moment and the meta-model , to give insights into the generalization improvement. To do this, we train MAML with a momentum network (without distillation and dropout) and visualize the loss by perturbing each parameter space [29] (See Appendix D.2 for the detail of the visualization method). As shown in Figure 4, the momentum network forms a flatter loss landscape than the meta-model, where recent studies demonstrate that such a flat landscape is effective under various domains [12]. 6 Discussion and conclusion In this paper, we propose a simple yet effective method, Si MT, for improving meta-learning. Our key idea is to efficiently generate target models using a momentum network and utilize its knowledge to self-improve the meta-learner. Our experiments demonstrate that Si MT significantly improves the performance of meta-learning methods on various applications. Limitations and future work. While Si MT is a compute-efficient way to use target models in meta-learning, it is still built on top of existing meta-model update techniques. Since existing metalearning methods have limited scalability (to large-scale scenarios) [44], Si MT is no exception. Hence, improving the scalability of meta-learning schemes is an intriguing future research direction, where we believe incorporating Si MT into such scenarios is worthwhile. Potential negative impacts. Meta-learning often requires a large computation due to the numerous task adaptation during meta-training, therefore raising environmental concerns, e.g., carbon generation [41]. As Si MT is built upon the meta-learning methods, practitioners may need to consider some computation for successful training. To address this issue, sparse adaptation scheme [42] or lightweight methods for meta-learning [28] would be required for the applications. Acknowledgements We thank Younggyo Seo, Jaehyun Nam, and Minseon Kim for providing helpful feedbacks and suggestions in preparing an earlier version of the manuscript. This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST), No.2019-0-01906, Artificial Intelligence Graduate School Program (POSTECH), and No.2022-0-00713, Meta-learning applicable to real-world problems). [1] S. M. R. Arnold, P. Mahajan, D. Datta, I. Bunner, and K. S. Zarkias. learn2learn: A library for Meta-Learning research. ar Xiv preprint ar Xiv:2008.12284, 2020. [2] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel. Mix Match: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 2019. [3] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel. Re Mix- Match: Semi-supervised learning with distribution alignment and augmentation anchoring. In International Conference on Learning Representations, 2020. [4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Open AI gym. ar Xiv preprint ar Xiv:1606.01540, 2016. [5] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerg- ing properties in self-supervised vision transformers. In IEEE International Conference on Computer Vision, 2021. [6] K. Chen and C.-G. Lee. Meta-free few-shot learning via representation learning with weight averaging. In International Joint Conference on Neural Networks, 2022. [7] X. Chen and K. He. Exploring simple Siamese representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. [8] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, 2016. [9] G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learning. In International Conference on Machine Learning, 2019. [10] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017. [11] S. Flennerhag, Y. Schroecker, T. Zahavy, H. van Hasselt, D. Silver, and S. Singh. Bootstrapped meta-learning. In International Conference on Learning Representations, 2022. [12] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. [13] N. Gao, H. Ziesche, N. A. Vien, M. Volpp, and G. Neumann. What matters for meta-learning vision regression tasks? In IEEE Conference on Computer Vision and Pattern Recognition, 2022. [14] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018. [15] G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, 2018. [16] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, 2020. [17] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, 2017. [18] Y. Guo, N. C. Codella, L. Karlinsky, J. V. Codella, J. R. Smith, K. Saenko, T. Rosing, and R. Feris. A broader study of cross-domain few-shot learning. In European Conference on Computer Vision, 2020. [19] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. [20] N. Hilliard, L. Phillips, S. Howland, A. Yankov, C. D. Corley, and N. O. Hodas. Few-shot learning with metric-agnostic conditional embeddings. ar Xiv preprint ar Xiv:1802.04376, 2018. [21] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. [22] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015. [23] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson. Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence, 2018. [24] Y. Jang, H. Lee, S. J. Hwang, and J. Shin. Learning what and where to transfer. In International Conference on Machine Learning, 2019. [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [26] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object representations for fine-grained categorization. In International IEEE Workshop on 3D Representation and Recognition (3d RR13), 2013. [27] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017. [28] J. Lee, J. Tack, N. Lee, and J. Shin. Meta-learning sparse implicit neural representations. In Advances in Neural Information Processing Systems, 2021. [29] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, 2018. [30] Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [31] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. [32] S. Lu, H.-J. Ye, L. Gan, and D.-C. Zhan. Towards enabling meta-learning from target models. In Advances in Neural Information Processing Systems, 2021. [33] J. Oh, H. Yoo, C. Kim, and S.-Y. Yun. Boil: Towards representation change for few-shot learning. In International Conference on Learning Representations, 2021. [34] B. Oreshkin, P. Rodríguez López, and A. Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, 2018. [35] W. Park, D. Kim, Y. Lu, and M. Cho. Relational knowledge distillation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. [36] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals. Rapid learning or feature reuse? towards under- standing the effectiveness of maml. In International Conference on Learning Representations, 2020. [37] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017. [38] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018. [39] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015. [40] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016. [41] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green ai. ar Xiv preprint ar Xiv:1907.10597, [42] J. R. Schwarz and Y. W. Teh. Meta-learning sparse compression networks. Transactions on Machine Learning Research, 2022. [43] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014. [44] J. Shin, H. B. Lee, B. Gong, and S. J. Hwang. Large-scale meta-learning with continual trajectory shifting. In International Conference on Machine Learning, 2021. [45] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017. [46] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems, 2020. [47] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014. [48] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018. [49] J. Tang, R. Shivanna, Z. Zhao, D. Lin, A. Singh, E. H. Chi, and S. Jain. Understanding and improving knowledge distillation. ar Xiv preprint ar Xiv:2002.03532, 2020. [50] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, 2017. [51] S. Thrun and L. Pratt. Learning to Learn. Springer, 1998. [52] Y. Tian, D. Krishnan, and P. Isola. Contrastive representation distillation. In International Conference on Learning Representations, 2020. [53] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. [54] H.-Y. Tseng, H.-Y. Lee, J.-B. Huang, and M.-H. Yang. Cross-domain few-shot classification via learned feature-wise transformation. In International Conference on Learning Representations, 2020. [55] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, 2016. [56] J. Von Oswald, D. Zhao, S. Kobayashi, S. Schug, M. Caccia, N. Zucchet, and J. Sacramento. Learning where to learn: Gradient sparsity in meta and continual learning. In Advances in Neural Information Processing Systems, 2021. [57] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. [58] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In European Conference on Computer Vision, 2016. [59] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992. [60] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. [61] H. Yao, L.-K. Huang, L. Zhang, Y. Wei, L. Tian, J. Zou, J. Huang, et al. Improving generalization in meta-learning via task augmentation. In International Conference on Machine Learning, 2021. [62] H. Ye, L. Ming, D. Zhan, and W. Chao. Few-shot learning with a strong teacher. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [63] M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn. Meta-learning without memorization. In International Conference on Learning Representations, 2020. [64] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng. Revisiting knowledge distillation via label smoothing regularization. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. [65] S. Yun, J. Park, K. Lee, and J. Shin. Regularizing class-wise predictions via self-knowledge distillation. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main exper- imental results (either in the supplemental material or as a URL)? [Yes] The code is available at https://github.com/jihoontack/Si MT. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5 and Appendix D.1 for the training details. (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] See Section 5. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix D.1 for details. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]