# setbased_metainterpolation_for_fewtask_metalearning__1dbfdcb3.pdf

Set-based Meta-Interpolation for Few-Task Meta-Learning

Seanie Lee1 , Bruno Andreis1 , Kenji Kawaguchi 2, Juho Lee1,3, Sung Ju Hwang1

KAIST1, National University of Singapore2, AITRICS3 {lsnfamily02, andries}@kaist.ac.kr, kenji@comp.nus.edu.sg, {juholee, sjhwang82}@kaist.ac.kr

Meta-learning approaches enable machine learning systems to adapt to new tasks given few examples by leveraging knowledge from related tasks. However, a large number of meta-training tasks are still required for generalization to unseen tasks during meta-testing, which introduces a critical bottleneck for real-world problems that come with only few tasks, due to various reasons including the difficulty and cost of constructing tasks. Recently, several task augmentation methods have been proposed to tackle this issue using domain-specific knowledge to design augmentation techniques to densify the meta-training task distribution. However, such reliance on domain-specific knowledge renders these methods inapplicable to other domains. While Manifold Mixup based task augmentation methods are domain-agnostic, we empirically find them ineffective on non-image domains. To tackle these limitations, we propose a novel domain-agnostic task augmentation method, Meta-Interpolation, which utilizes expressive neural set functions to densify the meta-training task distribution using bilevel optimization. We empirically validate the efficacy of Meta-Interpolation on eight datasets spanning across various domains such as image classification, molecule property prediction, text classification and sound classificattion. Experimentally, we show that Meta-Interpolation consistently outperforms all the relevant baselines. Theoretically, we prove that task interpolation with the set function regularizes the meta-learner to improve generalization.

1 Introduction

The ability to learn a new task given only a few examples is crucial for artificial intelligence. Recently, meta-learning [39, 3] has emerged as a viable method to achieve this objective and enables machine learning systems to quickly adapt to a new task by leveraging knowledge from other related tasks seen during meta-training. Although existing meta-learning methods can efficiently adapt to new tasks with few data samples, a large dataset of meta-training tasks is still required to learn meta-knowledge that can be transferred to unseen tasks. For many real-world applications, such extensive collections of meta-training tasks may be unavailable. Such scenarios give rise to the few-task meta-learning problem where a meta-learner can easily memorize the meta-training tasks but fail to generalize well to unseen tasks. The few-task meta-learning problem usually results from the difficulty in task generation and data collection. For instance, in the medical domain, it is infeasible to collect a large amount of data to construct extensive meta-training tasks due to privacy concerns. Moreover, for natural language processing, it is not straightforward to split a dataset into tasks, and hence entire datasets are treated as tasks [30].

Equal Contribution. Order of the authors was determined by a coin toss.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

(a) Class Assignment (b) Task Augmentation

Interpolation of and

with set function

(c) Bilevel Optimization

Figure 1: Concept. Three-way one-shot classification problem. (a) A new class is assigned to a pair of classes sampled without replacement from the pool of meta-training tasks. (b) The support sets are interpolated with a set function and paired with a query set. (c) Bilevel optimization of the set function and meta-learner. Several works have been proposed to tackle the few-task meta-learning problem using task augmentation techniques such as clustering a dataset into multiple tasks [30], leveraging strong image augmentation methods such as vertical flipping to construct new classes [32], and the employment of Manifold Mixup [44] for densifying the meta-training task distribution [49, 50]. However, majority of these techniques require domain-specific knowledge to design such task augmentations and hence cannot be applied to other domains. While Manifold Mixup based methods [49, 50] are domain-agnostic, we empirically find them ineffective for mitigating meta-overfitting in few-task meta-learning especially in non-image domains such as chemical and text, and that they sometimes degrade generalization performance.

In this work, we focus solely on domain-agnostic task augmentation methods that can densify the meta-training task distribution to prevent meta-overfitting and improve generalization at metatesting for few-task meta-learning. To tackle the limitations already discussed, we propose a novel domain-agnostic task augmentation method for metric based meta-learning models. Our method, Meta-Interpolation, utilizes expressive neural set functions to interpolate two tasks and the set functions are trained with bilevel optimization so that a meta-learner trained on the interpolated tasks generalizes to tasks in the meta-validation set. As a consequence of end-to-end training, the learned augmentation strategy is tailored to each specific domain without the need for specialized domain knowledge.

For example, for K-way classification, we sample two tasks consisting of support and query sets and assign a new class k to each pair of classes {σ(k), σ (k)} for k = 1, . . . , K, where σ, σ are permutations on {1, . . . , K} as depicted in Figure 1a. Hidden representations of the support set with classes σ(k) and σ (k) are then transformed into a single support set using a set function that maps a set of two vectors to a single vector. We refer to the output of the set function as the interpolated support set and these are used to compute class prototypes. As shown in Figure 1b, the interpolated support set is paired with a query set (query set 1 in Figure 1a)), randomly selected from the two tasks to obtain a new task. Lastly, we optimize the set function so that a meta-learner trained on the augmented task can minimize the loss on the meta-validation tasks as illustrated in Figure 1c.

To verify the efficacy of our method, we empirically show that it significantly improves the performance of Prototypical Networks [40] on the few-task meta-learning problem across multiple domains. Our method outperforms the relevant baselines on eight few-task meta-learning benchmark datasets spanning image classification, chemical property prediction, text classification, and sound classification. Furthermore, our theoretical analysis shows that our task interpolation method with the set function regularizes the meta-learner and improves generalization performance.

Our contribution is threefold:

We propose a novel domain-agnostic task augmentation method, Meta-Interpolation, which leverages expressive set functions to densify the meta-training task distribution for the few-task meta-learning problem. We theoretically analyze our model and show that it regularizes the meta-learner for better generalization. Through extensive experiments, we show that Meta-Interpolation significantly improves the performance of Prototypical Network on various domains such as image, text, and chemical molecule, and sound classification on the few-task meta-learning problem.

2 Related Work

Meta-Learning The two mainstream approaches to meta-learning are gradient based [10, 33, 14, 24, 12, 36, 37] and metric based meta-learning [45, 40, 42, 29, 26, 6, 38]. The former formulates meta-knowledge as meta-parameters such as the initial model parameters and performs bilevel optimization to estimate the meta-parameters so that a meta-learner can generalize to unseen tasks with few gradient steps. The latter learns an embedding space where classification is performed by measuring the distance between a query and a set of class prototypes. In this work, we focus on metric based meta-learning with fewer number of meta-training tasks, i.e., few-task meta-learning. We propose a novel task augmentation method that densifies the meta-training task distribution and mitigates overfitting due to the fewer number of meta-training tasks for better generalization to unseen tasks.

Task Augmentation for Few-Task Meta-learning Several methods have been proposed to augment the number of meta-training tasks to mitigate overfitting in the context of few-task meta-learning. Ni et al. [32] apply strong data augmentations such as vertical flip to images to create a new class. For text classification, Murty et al. [30] split meta-training tasks into latent reasoning categories by clustering data with a pretrained language model. However, they require domain-specific knowledge to design such augmentations, and hence the resulting augmentation techniques are inapplicable to other domains where there is no well-defined data augmentation or pretrained model. In order to tackle this limitation, Manifold Mixup-based task augmentations have also been proposed. Meta Mix [49] interpolates support and query sets with Manifold Mixup [44] to construct a new query set. MLTI [50] performs Manifold Mixup [44] on support and query sets from two tasks for task augmentation. Although these methods are domain-agnostic, we empirically find that they are not effective in some domains and can degrade generalization performance. In contrast, we propose to train an expressive neural set function to interpolate two tasks with bilevel optimization to find optimal augmentation strategies tailored specifically to each domain.

Preliminaries In meta-learning, we are given a finite set of tasks {Tt}T t=1, which are i.i.d samples from an unknown task distribution p(T ). Each task Tt consists of a support set Ds t = {(xs t,i, ys t,i)}Ns i=1 and a query set Dq t = {(xq t,i, yq t,i)}Nq i=1, where xt,i and yt,i denote a data point and its corresponding label respectively. Given a predictive model ˆfθ,λ := f L θL f l+1 θl+1 φλ f l θl f 1 θ1 with L layers, we want to estimate the parameter θ that minimizes the meta-training loss and generalizes to query sets Dq sampled from an unseen task T using the support set Ds , where λ is a hyperparameter for the function φλ. In this work, we primarily focus on metric based meta-learning methods rather than gradient based meta-learning methods due to efficiency and empirically higher performance over the gradient based methods on the tasks we consider.

Problem Statement In this work, we focus solely on few-task meta-learning. Here, the number of meta-training tasks drawn from the meta-training distribution is extremely small and the goal of a meta-learner is to learn meta-knowledge from such limited tasks that can be transferred to unseen tasks during meta-testing. The key challenges here are preventing the meta-learner from overfitting on the meta-training tasks and generalizing to unseen tasks drawn from a meta-test set.

Metric Based Meta-Learning The goal of metric based meta-learning is to learn an embedding space induced by ˆfθ,λ, where we perform classification by computing distances between data points and class prototypes. We adopt Prototypical Network (Proto Net) [40] for ˆfθ,λ, where φλ is the identity function. Specifically, for each task Tt with its corresponding support Ds t and query Dq t sets, we compute class prototypes {ck}K k=1 as the average of the hidden representation of the support samples belonging to the class k as follows:

(xs t,i,ys t,i) Ds t yt,i=k

ˆfθ,λ(xs t,i) RD (1)

where Nk denotes the number of instances belonging to the class k. Given a metric d( , ) : RD RD 7 R, we compute the probability of a query point xq t,i being assigned to the class k by measuring the distance between the hidden representation ˆfθ,λ(xq t,i) and the class prototype ck followed by

Algorithm 1 Meta-training

Require: Tasks {T train t }T t=1 {T val t}T t =1, learning rate α, η R+, update period S, and batch size B. 1: Initialize parameters θ, λ 2: for all i 1, . . . , M do 3: Ltr 0 4: for all j 1, . . . , B do 5: Sample two tasks Tt1 = {Ds t1, Dq t1} and Tt2 = {Ds t2, Dq t2} from {T train t }T t=1. 6: ˆDs Interpolate(Ds t1, Ds t2, φλ) with Eq.3. 7: ˆT { ˆDs, Dq t1} 8: Ltr += 1 2B Lsingleton(λ, θ, Tt1) 9: Ltr += 1 2B Lmix(λ, θ, ˆT ) 10: end for 11: θ θ α Ltr

θ 12: if mod(i, S) = 0 then 13: g Hyper Grad(θ, λ, {T val t }T t =1, α, Ltr

θ ) 14: λ λ η g 15: end if 16: end for 17: return θ, λ

Algorithm 2 Hyper Grad [27]

Require: model parameter θ, hyperparamter λ, validation tasks {T val t }T t =1, learning rate α, gradient of training loss w.r.t θ Ltr

θ , batch size B , and the number of iterations for Neumann series q N. 1: LV 0 2: for all i 1, . . . , B do 3: Sample a task Tt from {T val t }T t =1. 4: LV += 1 B Lsingleton(λ, θ; T ) 5: end for 6: v1 LV

θ 7: Initialize p deepcopy(v1) 8: for all j 1, . . . , q do 9: v1 = α grad( Ltr

θ , θ, grad_outputs = v1)

10: p += v1 11: end for 12: v2 grad( Ltr

θ , λ, grad_outputs = αp) 13: return LV

λ v2 14: 15:

softmax. With the class probability, we compute the cross-entropy loss for Proto Net as follows:

Lsingleton (λ, θ; Tt) := X

i,k 1{yt,i=k} log exp( d( ˆfθ,λ(xq t,i), ck)) P k exp( d( ˆfθ,λ(xq t,i), ck )) (2)

where 1 is an indicator function. At meta-test time, a test query is assigned a label based on the minimal distance to a class prototype, i.e., yq = arg mink d( ˆfθ,λ(xq ), ck). However, optimizing 1 T PT t=1 Lsingleton(λ, θ; Tt) w.r.t θ is prone to overfitting since we are given only a small number of meta-training tasks. The meta-learner tends to memorize the meta-training tasks, which limits its generalization to new tasks at meta-test time [51, 35].

Meta-Interpolation for Task Augmentation In order to tackle the meta-overfitting problem with a small number of tasks, we propose a novel data-driven domain-agnostic task augmentation framework which enables the meta-learner trained on few tasks to generalize to unseen few-shot classification tasks. Several methods have been proposed to densify the meta-training tasks. However, they heavily depend on the augmentation of images [32] or need a pretrained language model for task augmentation [30]. Although Manifold Mixup based methods [49, 50] are domain-agnostic, we empirically find them ineffective in certain domains. Instead, we optimize expressive neural set functions to augment tasks to enhance the generalization of a meta-learner to unseen tasks. As a consequence of end-to-end training, the learned augmentation strategy is tailored to each domain.

Specifically, let φλ : Rn d Rd be a set function which maps a set of d dimensional vectors with cardinality n to a d dimensional vector. In all our experiments, we use Set Transformer [23] for φλ. Given a pair of tasks Tt1 = {Ds t1, Dq t1} and Tt2 = {Ds t2, Dq t2} with corresponding support and query sets for K way classification, we construct new classes by choosing K pairs of classes from the two tasks. We sample permutations σt1 and σt2 on {1, . . . , K} for each task Tt1 and Tt2 respectively and assign class k to the pair {σt1(k), σt2(k)} for k = 1, . . . , K. For the newly assigned class k, we pair two instances from classes σt1(k) and σt2(k) and interpolate their hidden representations with the set function φλ. The class prototypes for class k are computed using the output of φλ as follows:

Sk := {({xs t1,i, xs t2,j}, k) | (xs t1,i, ys t1,i) Ds t1, ys t1,i = σt1(k), (xs t2,j, ys t2,j) Ds t2, ys t2,j = σt2(k)}

hs,l t1,i := (f l θl f 1 θ1)(xs t1,i), hs,l t2,j := (f l θl f 1 θ1)(xs t2,j) Rd

ˆck := 1 |Sk|

({xs t1,i,xs t2,j},k) Sk

f L θL f l+1 θl+1

φλ({hs,l t1,i, hs,l t2,j}) RD

ˆDs := {ˆc1, . . . , ˆc K} (3) where we define ˆDs to be the set of all the interpolated prototypes ˆck for k = 1, . . . , K. For queries, we do not perform any interpolation. Instead, we use Dq t1 as the query set and compute its hidden

representation ˆfθ,λ(xq t1,i) RD. We then measure the distance between the query with yq t1,i = σt1(k) and the interpolated prototype of class k to compute the loss as follows:

Lmix(λ, θ, ˆT ) := X

i,k 1{yq t1,i=σt1(k)} log exp( d( ˆfθ,λ(xq t1,i), ˆck)) P

k exp( d( ˆfθ,λ(xq t1,i), ˆck )) (4)

where ˆT = { ˆDs, Dq t1}. The intuition behind interpolating only support sets is to construct harder tasks that a meta-learner cannot memorize. Alternatively, we can interpolate only query sets. However, this is computationally more expensive since the size of query sets is usually larger than that of support sets. In Section 5, we empirically show that interpolating either support or query sets achieves higher training loss than interpolating both, which empirically supports the intuition. Lastly, we also use the original task Tt1 to evaluate the loss Lsingleton(λ, θ, Tt1) in Eq. 2 by passing the corresponding support and query set to ˆfθ,λ. The additional forward pass enriches the diversity of the augmented tasks and makes meta-training consistent with meta-testing since we do not perform any task augmentation in the meta-testing stage.

Optimization Since jointly optimizing θ, the parameters of Proto Net, and λ, the parameters of the set function φλ, with few tasks is prone to overfitting, we consider λ as hyperparameters and perform bilevel optimization with meta-training and meta-validation tasks as follows:

λ := arg min λ

t=1 Lsingleton(λ, θ (λ); T val t ) (5)

θ (λ) := arg min θ

t=1 Lsingleton(λ, θ; T train t ) + Lmix(λ, θ; ˆTt) (6)

where T train t , T val t , ˆTt denote the meta-training, meta-validation, and interpolated task, respectively. Since computing the exact gradient w.r.t λ is intractable due to the long inner optimization steps in Eq. 6, we leverage the implicit function theorem to approximate the gradient as Lorraine et al. [27]. Moreover, we alternately update θ and λ for computational efficiency as described in Algo. 1 and 2.

4 Theoretical Analysis

In this section, we theoretically investigate the behavior of the Set Transformer and how it induces a distribution dependent regularization, which is then shown to have the ability to control the Rademacher complexity for better generalization. To analyze the behavior of the Set Transformer, we first define it concretely with the attention mechanism A(Q, K, V ) = softmax(

d 1QK )V . Given h, h Rd, define H{h,h } 1 = [h, h ] R2 d and H{h} 1 = h R1 d. Then, for any r {{h, h }, {h}}, the output of the Set Transformer φλ(r) is defined as follows:

φλ(r) = A(Q2, Kr 2, V r 2 ) Rd, (7)

where Q2 = SW Q 2 + b Q 2 , Qr 1 = Hr 1W Q 1 + 12b Q 1 , Kr j = Hr j W K j + 12b K j , V r j = Hr j W V j + 12b V j (for j {1, 2}), and Hr 2 = A(Qr 1, Kr 1, V r 1 ) Rn d. Qj, Kj, Vj denote query, key, and value for the attention mechanism for j = 1, 2, respectively. Here, 12 = [1, . . . , 1] Rn, W Q j , W K j , W V j Rd d, b Q j , b K j , b V j R1 d, Qr 1, Kr j , V r j Rn d, and Q2 R1 d. Let l {1, . . . , L}.

Our analysis will show the importance of the following quantity of the Set Transformer in our method:

α(t,t ) ij = p(t,t ,i,j) 2 (1 p(t,t ,i,j) 1 ) + (1 p(t,t ,i,j) 2 )(1 p(t,t ,i,j) 1 ), (8)

where p(t,t ,i,j) 1 = softmax(

d 1Q {ht,i,ht ,j} 1 (K {ht,i,ht ,j} 1 ) )1,1, p(t,t ,i,j) 1 =

d 1Q {ht,i,ht ,j} 1 (K {ht,i,ht ,j} 1 ) )2,1, p(t,t ,i,j) 2 = softmax(

d 1Q2(K {ht,i,ht ,j} 2 ) )1,1 with ht,i = ϕl θ(xs t,i) and ϕl θ = f l θl f 1 θ1. For a matrix A Rm n, Ai,j denotes the entry for i-th row and j-th column of the matrix A.

We now introduce the additional notation and problem setting to present our results. Define W =

(W V 1 W V 2 ) Rd d, b = (b V 1 W V 2 + b V 2 ) Rd, Lt(c) = 1

n Pn i=1 log exp( d( ˆ fθ,λ(xq t,i),cyq t,i)) P

k exp( d( ˆ fθ,λ(xq t,i),ck )),

and It,k = {i [N (t) s ] : ys t,i = k}, where N (t) s = |Ds t |. We also define the empirical measure

µt,k = 1 |It,k| P

i It,k δi over the index i [N (t) s ] with the Dirac measures δi. Let U[K] be the uniform distribution over {1, . . . , K}. For any function φ and point u in its domain, we define the j-th order tensor jφ(u) Rd d d by jφ(u)i1i2 ij = j ui1ui2 uij φ(u). For example,

1φ(u) and 2φ(u) are the gradient and the Hessian of φ evaluated at u. For any j-th order tensor jφ(u), we define the vectorization of the tensor by vec[ jφ(u)] Rdj. For an vector a Rd, we define a j = a a a Rdj where represents the Kronecker product. We assume that rg Wϕl θl(xs t,i) + b = 0 for all r 2, where g := f L θL f l+1 θl+1. This assumption is satisfied, for example, if g represents a deep neural network with Re LU activations. This assumption is also satisfied in the simpler special case considered in the proposition below.

The following theorem shows that Lmix(λ, θ, ˆTt,t ) is approximately Lsingleton (λ, θ; Tt) plus regularization terms on the directional derivatives of ϕl θl on the direction of W(ϕl θl(xs t ,j) ϕl θl(xs t,i)):

Theorem 1. For any J N+, if c 7 d(y, c) is J-times differentiable for all y, then the J-th order approximation of Lmix(λ, θ, ˆTt,t ) is given by Lsingleton (λ, θ; Tt)+PJ j=1 1 j! vec[ j Lt(c)] j, where = [ 1 , . . . , K] and

k = E i µt,k, j µt ,σ(k)

h α(t,t ) ij g Wϕl θl(xs t,i) + b W ϕl θl(xs t ,j) ϕl θl(xs t,i) i .

To illustrate the effect of this data-dependent regularization, we now consider the following special case that is used by Yao et al. [50] for Proto Net: L (λ, θ; Tt) = 1 n Pn i=1 Li (λ, θ; Tt) where Li (λ, θ; Tt) = 1 1+exp( (xq t,i (c 1+c 2)/2,θ ), c k := 1 Nt,k P (xs t,i,ys t,i) Ds t 1{yt,i=k}xs t,i, and

, denotes dot product. Define c = 1 n Pn i=1 1 4 ψ(zt,i)(ψ(zt,i) 0.5)

1+exp(zt,i) , where ψ(zt,i) = exp(zt,i) 1+exp(zt,i) and zt,i = xq t,i (c 1 + c 2)/2, θ . Note that c > 0 if θ is no worse than the random guess; e.g., Li (λ, 0; Tt) > Li (λ, θ; Tt) for all i [n]. We write v 2 M = v Mv for any positive semi-definite matrix M. In this special case, we consider that α is balanced: i.e., Et ,σ[P2 k=1 1 |It ,σ(k)| P

j It ,σ(k) 1 |It,k| P

i It,k α(t,t ) ij (ϕl θl(xs t ,j) ϕl θl(xs t,i))] = 0 for all t. This is used to prevent the Set Transformer from over-fitting to the training sets; i.e., in such simple special cases, the Set Transformer without any restriction is too expressive relative to the rest of the model (and may memorize the training sets without using the rest of the model). The following proposition shows that the additional regularization term is simplified to the form of c θ 2 M in this special case: Proposition 1. In the special case explained above, the second approximation of Et ,σ[Lmix(λ, θ, ˆTt,t )] is given by Lsingleton (λ, θ; Tt) + c θ 2 Et ,σ[δt,t ,σδ t,t ,σ], where δt,t ,σ =

Ek U[2]Ei µt,k,j µt ,σ(k)[α(t,t ) ij (xs t ,j xs t,i)].

In the above regularization form, we have an implicit regularization effect on θ 2 Σ where Σ = Ex,x [(x x )(x x ) ]. The following theorem shows that this implicit regularization can reduce the Rademacher complexity for better generalization:

Proposition 2. Let FR = {x 7 θ x : θ 2 Σ R} with Ex[x] = 0. Then, Rn(FR)

rank(Σ) n .

All the proofs are presented in Appendix A.

5 Experiments

We now demonstrate the efficacy of our set-based task augmentation method on multiple few-task benchmark datasets and compare against the relevant baselines.

Datasets We perform classification on eight datasets to validate our method. (1), (2), & (3) Metabolism [17], NCI [31] and Tox21 [18]: these are binary classification datasets for predicting the properties of chemical molecules. For Metabolism, we use three subdatasets for meta-training, meta-validation, and meta-testing, respectively. For NCI, we use four subdatasets for meta-training, two for meta-validation and the remaining three for meta-testing. For Tox21, we use six subdatasets

Table 1: Average accuracy of 5 runs and 95% confidence interval for few shot classification on non-image domains Tox21, NCI, GLUE-Sci Tail dataset, and ESC-50 datasets. ST stands for Set Transformer.

Chemical Text Speech

Metabolism Tox21 NCI GLUE-Sci Tail ESC-50

Method 5-shot 5-shot 5-shot 4-shot 5-shot

Proto Net 63.62 0.56% 64.07 0.80% 80.45 0.48% 72.59 0.45% 69.05 1.48% Meta Reg 66.22 0.99% 64.40 0.65% 80.94 0.34% 72.08 1.33% 74.95 1.78% Meta Mix 68.02 1.57% 65.23 0.56% 79.46 0.38% 72.12 1.04% 71.99 1.41% MLTI 65.44 1.14% 64.16 0.23% 81.12 0.70% 71.65 0.70% 70.62 1.96% Proto Net+ST 66.26 0.65% 64.98 1.25% 81.20 0.30% 72.37 0.56% 71.54 1.56%

Meta-Interpolation 72.92 1.89% 67.54 0.40% 82.86 0.26% 73.64 0.59% 79.22 0.84%

for meta-training, two for meta-validation, and four for meta-testing. (4) GLUE-Sci Tail [30]: it consists of four natural language inference datasets where we predict whether a hypothesis sentence contradicts a premise sentence. We use MNLI [47] and QNLI [46] for meta-training, SNLI [5] and RTE [46] for meta-validation, and Sci Tail [20] for meta-testing. (5) ESC-50 [34]: this is an environmental sound recognition dataset. We make a 20/15/15 split out of 50 base classes for metatraining/validation/testing and sample 5 classes from each spilt to construct a 5-way classification task. (6) Rainbow MNIST (RMNIST) [11]: this is a 10-way classification dataset. Following Yao et al. [50], we construct each task by applying compositions of image transformations to the images of the MNIST [9] dataset. (7) & (8) Mini-Image Net-S [45] and CIFAR100-FS [22]: these are 5-way classification datasets where we choose 12/16/20 classes out of 100 base classes for metatraining/validation/testing, respectively and sample 5 classes from each split to construct a task.

Note that Metabolism, Tox21, NCI, GLUE-Sci Tail, and ESC-50 are real-world few-task metalearning datasets with a very small number of tasks. For Mini-Image Net-S and CIFAR100-FS, following Yao et al. [50], we artificially reduce the number of tasks from the original datasets for few-task meta-learning. RMNIST is synthetically generated by applying augmentations to MNIST.

Implementation Details For RMNIST, Mini-Image Net-S, and CIFAR100-FS, we use four convolutional blocks with each block consisting of a convolution, Re LU, batch normalization [19], and max pooling. For Metabolism, Tox21, and NCI, we convert the chemical molecules into SMILES format and extract a 1024 bit fingerprint feature using RDKit [15] where each bit captures a fragment of the molecule. We use two blocks of affine transformation, batch normalization, and Leaky Re LU, and affine transformation for the last layer. For GLUE-Sci Tail dataset, we stack 3 fully connected layers with Re LU on the pretrained language model ELECTRA [8]. For ESC-50 dataset, we pass raw audio signal to the pretrained VGGish [16] feature extractor to obtain an embedding vector. We use the feature vector as input to the classifier which is exactly the same as the one used for Metabolism, Tox21, and NCI. For our Meta-Interpolation, we use Set Transformer [23] for the set function φλ.

Baselines We compare our method against following domain-agnostic baselines. 1. Proto Net [40]: Vanilla Proto Net trained on Eq. 2 by fixing φλ to be the identity function. 2. Meta Reg [2]: Proto Net with ℓ2 regularization where element-wise coefficients are learned with bilevel optimization. 3. Meta Mix [49]: Proto Net trained with support sets and mixed query sets where we interpolate one instance from the support sets and the other from the original query sets with Manifold Mixup. 4. MLTI [50]: Proto Net trained with Manifold Mixup based task augmentation. We sample two tasks and interpolate two query sets and support sets with Manifold Mixup, respectively. 5. Proto Net+ST Proto Net and Set Transformer (φλ) trained with bilevel optimization but without task augmentation for Lmix(λ, θ, ˆTt) in Eq. 6. 6. Meta-Interpolation Our full model learning to interpolate support sets from two tasks using bilevel optimization and training the Proto Net with both the original and interpolated tasks.

Results As shown in Table 1, Meta-Interpolation consistently outperforms all the domain-agnostic task augmentation and regularization baselines on non-image domains. Notably, it significantly improves the performance on ESC-50, which is a challenging datatset that only contains 40 examples per class. In addition, Meta-Interpolation effectively tackles the Metabolism and GLUE-Sci Tail datasets which have an extremely small number of meta-training tasks: three and two meta-training tasks, respectively. Contrarily, the baselines do not achieve consistent improvements across all the

Table 2: Average accuracy of 5 runs and 95% confidence interval for few shot classification on image domains Rainbow MNIST, Mini-Image Net, and CIFAR100. ST stands for Set Transformer.

RMNIST Mini-Image Net-S CIFAR-100-FS Method 1-shot 1-shot 5-shot 1-shot 5-shot

Proto Net 75.35 1.43% 39.14 0.78% 51.17 0.57% 38.05 1.56% 52.63 0.74% Meta Reg 76.40 0.56% 39.36 0.45% 50.94 0.67% 37.74 0.70% 52.73 1.26% Meta Mix 76.54 0.72% 38.25 0.09% 52.38 0.52% 36.13 0.63% 52.52 0.89% MLTI 79.40 0.75% 39.69 0.47% 52.73 0.51% 38.81 0.55% 53.41 0.83% Proto Net+ST 77.38 2.05% 38.93 1.03% 48.92 0.67% 38.03 0.85% 50.72 0.92%

Meta Interpolation 83.24 1.39% 40.28 0.48% 53.06 0.33% 41.48 0.45% 54.94 0.80%

Proto Net MLTI Meta Mix Proto Net+ST Meta Reg Meta Interpolation

0 5000 10000 15000 20000 25000 30000

(a) Train RMNIST

0 5000 1000015000200002500030000

Validation Loss

(b) Val. RMNIST

2000 4000 6000 8000 10000 Steps

(c) Train NCI

2000 4000 6000 8000 10000 Steps

Validation Loss

(d) Val. NCI

Figure 2: (a) (d) Meta-train and meta-validation loss on RMNIST and NCI for Proto Net, MLTI, Meta Mix, Proto Net+ST, and Meta Interpolation.

domains and tasks considered. For example, Meta Reg is effective on the sound domain (ESC-50) and Metabolism, but does not work on the chemical (Tox21 and NCI) and text (GLUE-Sci Tail) domains. Similarly, Meta Mix and MLTI achieve performance improvements on some datasets but degrade the test accuracy on others. This empirical evidence supports the hypothesis that the optimal task augmentation strategy varies across domains and justifies the motivation for Meta-Interpolation which learns augmentation strategies tailored to each domain.

We provide additional experimental results on the image domain in Table 2. Again, Meta-Interpolation outperforms all the baselines. In contrast to the previous experiments, Meta Reg hurts the generalization performance on all the image datasets except on RMNIST. Note that Manifold Mixup-based augmentation methods, Meta Mix and MLTI, marginally improve the generalization performance for 1-shot classification on Mini-Image Net-S and CIFAR-100-FS, although they boost the accuracy on 5-shot experiments. This suggests that different task augmentation strategies are required for 1-shot and 5-shot for the same dataset. Meta-Interpolation on the other hand learns task augmentation strategies tailored for each task and dataset and consistently improves the performance of the vanilla Proto Net for all the experiments on the image datasets.

Moreover, we plot the meta-training and meta-validation loss on RMNIST and NCI dataset in Figure 2. Meta-Interpolation obtains higher training loss but much lower validation loss than the others on both datasets. This implies that interpolating only support sets constructs harder tasks that a meta-learner cannot memorize and regularizes the meta-learner for better generalization. Proto Net overfits to the meta-training tasks on both datasets. MLTI mitigates the overfitting issue on RMNIST but is not effective on the NCI dataset where it shows high validation loss in Figure 2d. On the other hand, Meta Mix, which constructs a new query set by interpolating a support and query set with Manifold Mixup, results in generating overly difficult tasks which causes underfitting on RMNIST where the training loss is not properly minimized in Figure 2a. However, this augmentation strategy is effective for tackling meta-overfitting on NCI where the validation loss is lower than Proto Net. The loss curve of Proto Net+ST supports the claim that increasing the model size and using bilevel optimization cannot handle the few-task meta-learning problem. It shows higher validation loss on both RMNIST and NCI as presented in Figure 2b and 2d. Similarly, Meta Reg which learns coefficients for ℓ2 regularization fails to prevent meta-overfitting on both datasts.

Lastly, we empirically show that the performance gains mostly come from the task augmentation with Meta-Interpolation, rather than from bilevel optimization or the introduction of extra parameters with the set function. As shown in Table 1 and 2, Proto Net+ST, which is Meta Interpolation but trained without any task augmentation, significantly degrades the performance of Proto Net on Mini-Image Net and CIFAR-100-FS. On the other datasets, the Proto Net+ST obtains marginal improvement or largely

Original Prototype Interpolated Prototype

(b) Meta-Interp.

(d) Meta-Interp.

Figure 3: Visualization of original and interpolated tasks from NCI ((a) and (b)) and ESC-50 ((c) and (d)).

Table 3: Ablation study on ESC-50 dataset.

Model Accuracy

Meta-Interpolation 79.22 0.96

w/o Interpolation 71.54 1.56 w/o Bilevel 63.01 2.06 w/o Lsingleton(λ, θ, T train t ) 78.01 1.56

Table 4: Performance of different set functions on ESC-50 dataset.

Set Function Accuracy

Proto Net 69.05 1.69 Deep Sets 74.26 1.77 Set Transformer 79.22 0.96

Table 5: Performance of different interpolation on ESC-50 dataset.

Interpolation Strategy Accuracy

Query+Support 76.87 0.94 Query 78.19 0.84 Support+ Noise 78.27 1.24 Support 79.22 0.96

underperforms the other baselines. Thus, the task augmentation strategy of interpolating two support sets with the set function φλ is indeed crucial for tackling the few-task meta-learning problem.

Ablation Study We further perform ablation studies to verify the effectiveness of each component of Meta-Interpolation. In Table 3, we show experimental results on the ESC-50 dataset by removing various components of our model. Firstly, we train our model without any task interpolation but keep the set function φλ, denoted as w/o Interpolation. The model without task interpolation significantly underperforms the full task-augmentation model, Meta-Interpolation, which shows that the improvements come from task interpolation rather than the extra parameters introduced by the set encoding layer. Moreover, bilevel optimization is shown to be effective for estimating λ, which are the parameters of the set function. Jointly training the Proto Net and the set function without bilevel optimization, denoted as w/o Bilevel, largely degrades the test accuracy by 15%. Lastly, we remove the loss Lsingleton(λ, θ, T train t ) for inner optimization in Eq. 6, denoted as w/o Lsingleton(λ, θ, T train t ). This hurts the generalization performance since it decreases the diversity of tasks and causes inconsistency between meta-training and meta-testing, since we do not perform any interpolation for support sets at meta-test time.

We also explore an alternative set function such as Deep Sets [52] using the ESC50 dataset to show the general effectiveness of our method regardless of the set encoding scheme. In Table 4, Meta Interpolation with Deep Sets improves the generalization performance of Proto Typical Network and the model with Set Transformer further boost the performance as a consequence of higher-order and pairwise interactions among the set elements via the attention mechanism.

Table 6: Comparison to interpolation with noise on ESC50.

Interpolation Strategy Accuracy

Support+ Noise 69.60 1.60 Support 75.35 1.63

Lastly, we empirically validate our interpolation strategy that mixes only support sets. We compare our method to various interpolation strategies including one that mixes a support set with a zero mean and unit variance Gaussian noise. In Table 5, we empirically show that the interpolation strategy which mixes only support sets outperforms the other mixing strategies. Note that interpolating a support set with gaussian noise works well on ESC50 dataset though we find that it significantly degrades the performance of Proto Net on RMNIST, from 75.35 1.63 to 69.60 1.60 as shown in Table 6, which justifies our approach of mixing two support sets.

Visualization In Figure 3, we present the t-SNE [43] visualizations of the original and interpolated tasks with MLTI and Meta-Interpolation, respectively. Following Yao et al. [50], we sample three original tasks from NCI and ESC-50 dataset, where each task is a two-way five-shot and five-way fiveshot classification problem, respectively. The tasks are interpolated with MLTI or Meta-Interpolation to construct 300 additional tasks and represented as a set of all class prototypes. To visualize the

prototypes, we first perform Principal Component Analysis [13] (PCA) to reduce the dimension of each prototype. The first 50 principal components are then used to compute the t-SNE visualizations. As shown in Figure 3b and 3d, Meta-Interpolation successfully learns an expressive neural set function that densifies the task distribution. The task augmentations with MLTI, however, do not cover a wide embedding space as shown in Figure 3a and 3c as the mixup strategy allows to generate tasks only on the simplex defined by the given set of tasks.

Although we have shown promising results in various domains, our method requires extra computation for bilevel optimization to estimate λ, the parameters of the set function φλ, which makes it challenging to apply our method to gradient based meta-learning methods such as MAML. Moreover, our interpolation is limited to classification problem and it is not straightforward to apply it to regression tasks. Reducing the computational cost for bilevel optimization and extending our framework to regression will be important for future work.

6 Conclusion

We proposed a novel domain-agnostic task augmentation method, Meta Interpolation, to tackle the meta-overfitting problem in few-task meta-learning. Specifically, we leveraged expressive neural set functions to interpolate a given set of tasks and trained the interpolating function using bilevel optimization, so that the meta-learner trained with the augmented tasks generalizes to meta-validation tasks. Since the set function is optimized to minimize the loss on the validation tasks, it allows us to tailor the task augmentation strategy to each specific domain. We empirically validated the efficacy of our proposed method on various domains, including image classification, chemical property prediction, text and sound classification, showing that Meta-Interpolation achieves consistent improvements across all domains. This is in stark contrast to the baselines which improve generalization in certain domains but degenerate performance in others. Furthermore, our theoretical analysis shed light on how Meta-Interpolation regularizes the meta-learner and improves its generalization performance. Lastly, we discussed the limitation of our method.

Acknowledgments

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)), the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF2018R1A5A1059921), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub), the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2021R1F1A1061655), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00713), Samsung Electronics (IO201214-08145-01), and Google Research Grant. It was also results of a study on the HPC Support Project, supported by the Ministry of Science and ICT and NIPA.

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

[2] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Meta Reg: Towards domain generalization using meta-regularization. Advances in neural information processing systems, 31, 2018.

[3] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume 2, pages 969 vol. IEEE, 1991.

[4] Charles Blair. The computational complexity of multi-level linear programs. Annals of Operations Research, 34, 1992.

[5] Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632 642, 2015.

[6] Kaidi Cao, Maria Brbic, and Jure Leskovec. Concept learners for few-shot learning. In International Conference on Learning Representations, 2021.

[7] Richard Li-Yang Chen, Amy Cohn, and Ali Pinar. An implicit optimization approach for survivable network design. In 2011 IEEE network science workshop, pages 180 187. IEEE, 2011.

[8] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.

[9] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012.

[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126 1135. PMLR, 2017.

[11] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International Conference on Machine Learning, pages 1920 1930. PMLR, 2019.

[12] Sebastian Flennerhag, Andrei A. Rusu, Razvan Pascanu, Francesco Visin, Hujun Yin, and Raia Hadsell. Meta-learning with warped gradient descent. In International Conference on Learning Representations, 2020.

[13] Karl Pearson F.R.S. Liii. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11): 559 572, 1901. doi: 10.1080/14786440109462720.

[14] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations, 2019.

[15] Greg Landrum. RDKit: Open-source cheminformatics software., 2018. https://github. com/rdkit/rdkit.

[16] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pages 131 135. IEEE, 2017.

[17] Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: machine learning datasets and tasks for therapeutics. ar Xiv e-prints, pages ar Xiv 2102, 2021.

[18] Ruili Huang, Menghang Xia, Dac-Trung Nguyen, Tongan Zhao, Srilatha Sakamuru, Jinghua Zhao, Sampada A Shahane, Anna Rossoshek, and Anton Simeonov. Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85, 2016.

[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448 456. PMLR, 2015.

[20] Tushar Khot, Ashish Sabharwal, and Peter Clark. Sci Tail: A textual entailment dataset from science question answering. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

[22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. In https://www.cs.toronto.edu/ kriz/cifar.html. Citeseer, 2009.

[23] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744 3753. PMLR, 2019.

[24] Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning, pages 2927 2936. PMLR, 2018.

[25] Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina Mc Millan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175 184, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.

[26] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sungju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019.

[27] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pages 1540 1552. PMLR, 2020.

[28] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.

[29] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.

[30] Shikhar Murty, Tatsunori B Hashimoto, and Christopher D Manning. DRe Ca: A general task augmentation strategy for few-shot natural language inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1113 1125, 2021.

[31] NCI. NCI dataset, 2018. https://github.com/GRAND-Lab/graph_datasets.

[32] Renkun Ni, Micah Goldblum, Amr Sharaf, Kezhi Kong, and Tom Goldstein. Data augmentation for meta-learning. In International Conference on Machine Learning, pages 8152 8161. PMLR, 2021.

[33] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

[34] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pages 1015 1018. ACM Press, 2015. ISBN 978-1-4503-3459-4. doi: 10.1145/2733373.2806390. URL http://dl.acm.org/citation. cfm?doid=2733373.2806390.

[35] Janarthanan Rajendran, Alexander Irpan, and Eric Jang. Meta-learning requires metaaugmentation. Advances in Neural Information Processing Systems, 33:5705 5715, 2020.

[36] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.

[37] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.

[38] Victor Garcia Satorras and Joan Bruna Estrach. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.

[39] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. On learning how to learn: The meta-meta-... hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1(2), 1987.

[40] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.

[41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929 1958, 2014.

[42] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199 1208, 2018.

[43] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.

[44] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438 6447. PMLR, 2019.

[45] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.

[46] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019.

[47] Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112 1122, 2018.

[48] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-theart natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38 45, Online, October 2020. Association for Computational Linguistics.

[49] Huaxiu Yao, Long-Kai Huang, Linjun Zhang, Ying Wei, Li Tian, James Zou, Junzhou Huang, et al. Improving generalization in meta-learning via task augmentation. In International Conference on Machine Learning, pages 11887 11897. PMLR, 2021.

[50] Huaxiu Yao, Linjun Zhang, and Chelsea Finn. Meta-learning with fewer tasks through task interpolation. In International Conference on Learning Representations, 2022.

[51] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Metalearning without memorization. In International Conference on Learning Representations, 2020.

[52] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Please see limitation from Section 6 (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We run experiments 5 times with different random seeds and show the 95% confidence interval on Table 1 and 2. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We describe the datasets and cite the creators in Section 5. (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]