# adversarial_task_upsampling_for_metalearning__db70bc1b.pdf

Adversarial Task Up-sampling for Meta-learning

Yichen Wu1,2 , Long-Kai Huang2 , Ying Wei1

1City University of Hong Kong, 2Tencent AI Lab {wuyichen.am97, hlongkai}@gmail.com, yingwei@cityu.edu.hk

The success of meta-learning on existing benchmarks is predicated on the assumption that the distribution of meta-training tasks covers meta-testing tasks. Frequent violation of the assumption in applications with either insufficient tasks or a very narrow meta-training task distribution leads to memorization or learner overfitting. Recent solutions have pursued augmentation of meta-training tasks, while it is still an open question to generate both correct and sufficiently imaginary tasks. In this paper, we seek an approach that up-samples meta-training tasks from the task representation via a task up-sampling network. Besides, the resulting approach named Adversarial Task Up-sampling (ATU) suffices to generate tasks that can maximally contribute to the latest meta-learner by maximizing an adversarial loss. On fewshot sine regression and image classification datasets, we empirically validate the marked improvement of ATU over state-of-the-art task augmentation strategies in the meta-testing performance and also the quality of up-sampled tasks.

1 Introduction

The past few years have seen the burgeoning development of meta-learning, a.k.a. learning to learn, which draws upon the meta-knowledge learned from previous tasks (i.e., meta-training tasks) to expedite the learning of novel tasks (i.e., meta-testing tasks) with a few examples. A sufficient number and diversity of meta-training tasks are pivotal for the generalization capability of the meta-knowledge, so that (1) they cover the true task distribution (i.e., environment [4]) from which meta-testing tasks are sampled, discouraging learner overfitting [23] and (2) the meta-knowledge empowers fast adaptation via the support set for each task, avoiding memorization overfitting [44]. Notwithstanding up to millions of meta-training tasks in benchmark datasets [24, 31], real-world applications such as drug discovery [40] and medical image diagnosis [14] usually have access to only thousands or hundreds of tasks, which puts the meta-knowledge at high risk of learner and memorization overfitting.

While early attempts towards improving the generalization capability of the meta-knowledge revolve around regularization methods that limit the capacity of the meta-knowledge [11, 44], recent works on augmentation of meta-training tasks have shown a marked improvement [20, 40, 43]. The objective of task augmentation is to draw the empirical task distribution which is formed by assembling Dirac delta functions located in each meta-training task closer to the true task distribution. Consequently, achieving this objective requires a qualified task augmentation approach to simultaneously possess the following three properties: (1) task-aware: the augmented tasks comply with the true task distribution, being not erroneous to lead the meta-knowledge astray (tasks A and B in Figure 1a); (2) task-imaginary: the augmented tasks cover a substantial portion of the true distribution, embracing task diversity which task-awareness is nonetheless inadequate to guarantee (tasks C, D, and E in Figure 1b); (3) model-adaptive: the augmented tasks are timely in improving the current metaknowledge, to which the meta-knowledge before augmentation struggles to generalize (task F in Figure 1c).

Part of the work was done when the author interned in Tencent AI Lab. Corresponding author: Long-Kai Huang and Ying Wei

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

True task distribution Empirical task distribution Sampled task Augmented task Erroneous task

support set query set

task 1 task 2

erroneous task via mix-up

(a) task-aware (b) task-imaginary (c) model-adaptive (e) an example erroneous task (d) true task distribution

Figure 1: Pictorial illustration of the three characteristics possessed by a qualified task augmentation approach, i.e., being (a) task-aware, (b) task-imaginary, and (c) model-adaptive. (d) shows the true task distribution that task augmentation aims to approximate, and (e) presents an example erroneous task violation of the true task distribution of sinusoidal functions y = w sin(x) (w [0, 2]).

Unfortunately, developing such a qualified task augmentation approach remains challenging. First, the task-aware methods sacrifice task diversity for task-awareness they establish task-awareness by injection of the same random noise to labels of the support and query set [23], rotation of both support and query images [20], or mix-up of support and query examples within each task [40], all of which result in augmented tasks that are within the immediate vicinity of sampled meta-training tasks as shown in Figure 1a. Second, the task-imaginary method [43] that mixes up both support examples of two distinct tasks and their query examples in the feature space compromises on task-awareness the resulting examples are even multi-modal in Figure 1e and constitute an erroneous task that fails to comply with the task distribution. Third, how to adaptively augment tasks that maximally improve the meta-knowledge and thereby the performance on meta-testing tasks remains unexplored.

To this end, we propose the Adversarial Task Up-sampling (ATU) framework to augment tasks that are aware of the task distribution, imaginary, and adaptive to the current meta-knowledge. Grounded on gradient-based meta-learning algorithms that are generally applicable to either regression or classification problems, ATU takes the initialization of the base learner as the meta-knowledge. Concretely, ATU consists of a task up-sampling network whose input is a task itself and outputs are augmented tasks. To ensure that the augmented tasks are imaginary and meanwhile faithful to the underlying task distribution, we train the up-sampling network to minimize the Earth Mover Distance between augmented tasks and the local task distribution characterized by a set of sampled tasks. Besides, we enforce the up-sampling network to produce challenging tasks that complement the current initialization, by maximizing the loss of the model adapted from the initialization on their query examples and minimizing the similarity between the gradient of the initialization with respect to their support examples and that of their query examples.

In summary, our main contributions are three-fold: (1) we present the first task-level augmentation network that learns to generate tasks that simultaneously meet the qualifications of being taskaware, task-imaginary, and model-adaptive; (2) we provide a theoretical analysis to justify that the proposed ATU framework indeed promotes task-awareness; (3) we conduct comprehensive experiments covering both regression and classification problems and a total of five datasets, where the proposed ATU improves the generalization ability of gradient-based meta-learning algorithms by up to 3.81%.

2 Related Work

Table 1: Summary of existing task augmentation strategies.

Method Task-aware Task-imaginary Model-adaptive

Meta Aug [23] Meta Mix [40] Meta-Maxup [20] MLTI [43] ATU

As a paradigm that effectively adapts the meta-knowledge learned from past tasks to accelerate the learning of new ones, meta-learning has sparked considerable interest in many scenarios [38, 26, 36, 35], especially for few-shot learning. It falls into four major strands based on what the meta-knowledge is, i.e., optimizer-based methods [3, 37], feed-forward methods [24, 10, 39], metric-based methods [27, 29, 33] and gradient-based meth-

ods [9, 15, 42, 12, 6], where the inner optimizer, the mapping function from the support set to the task-specific model, the distance metric measuring the similarity between samples, and the parameter initialization are formulated as the meta-knowledge that enables quick adaptation to a task within a small number of steps. Our method is primarily evaluated on gradient-based methods which enjoy wide adoption and applicability in either classification or regression problems.

Within-task Overfitting. Few-shot learning puts meta-learning, especially gradient-based methods which require optimization of high-dimensional parameters within each task, at risk of within-task overfitting. Some works tackle the problem with various ways of reducing the number of parameters to adapt in the inner loop: only updating the head [22] or the feature extractor [21], learning datadependent latent generative embeddings of parameters [25] or context parameters [48] to adapt, imposing gradient dropout [32], and generating stochastic input-dependent perturbations [13]. The other bunch of works alleviates the problem through data augmentation within each task. Ni et al. [20] applies standard augmentation like Random Crop and Cut Mix onto support samples. Sun et al. [28] and Zhang et al. [47] proposed to generate more data within a class via a ball generator and generative adversarial networks, respectively. These techniques designed for within-task overfitting, however, have been proved to lend little support to meta-overfitting which we focus on.

Meta-overfitting. Distinguished from traditional overfitting within a task, two types of metaoverfitting including memorization and learner overfitting have been pinpointed in [44, 23]. Despite meta-regularization techniques [11, 44] that limit the capacity of the meta-learner, task augmentation strategies [23, 19, 20, 16, 43, 40] have emerged as more effective solutions to meta-overfitting. Table 1 presents a summary of these strategies, except the strategies of large rotation [16] being part of Meta Maxup [20] and DRe Ca [19] applicable to natural language inference tasks only. Meta Aug [23] augments a task by adding a random noise on labels of both support and query sets, and Meta Mix [40] mixes support and query examples within a task. Such within-task augmentation guarantees the validity of augmented tasks, i.e., being task-aware, though it almost does not alter the mapping from the support to the query set, i.e., generating limited imaginary tasks beyond meta-training tasks. Meta-Maxup [20] and MLTI [43] approach this problem via the cross-task mixup method, unfavorably at the expense of erroneous tasks. Our work seeks a novel task augmentation framework capable of generating tasks that not only meet the task-awareness and task-imagination needs but also adapt to maximally benefit the up-to-the-minute meta-learner.

3 Preliminaries

3.1 Meta-Learning Problem and Gradient-Based Meta-Learning

Meta-Learning model f are trained and evaluated on episodes of few-shot learning tasks. Assume the task distribution is p(T ). A few-shot learning task Ti i.i.d. sampled from p(T ) consists of a support set Ds i = (Xs i , Y s i ) = {(xs i,j, ys i,j)}Ks j=1 and a query set Dq i = (Xq i , Y q i ) = {(xq i,j, yq i,j)}Kq j=1, where Xs i and Y s i , (Xq i and Y q i ) are the collection of inputs and labels in support (query) set, and Ks (Kq) is the size of support (query) set.

The most representative gradient-based meta-learning algorithm is MAML [9]. MAML aims to learn an initialization parameter θ0 of the model f that can be adapted to any new task after a few steps of gradient update. Concretely, given a specific task Ds i , Dq i and a parametric model fθ, MAML initializes the model parameter θ by θ0 and updates θ by performing gradient descent on the support set Ds i . It then optimizes the initialization parameter θ0 by minimizing the loss L estimated on the query set Dq i . The objective of MAML can be formulated as min θ0 ETi p(T ) L(ϕi, Dq i ), s.t. ϕi = θ0 α θ0L(θ0, Ds i ). (1)

3.2 Earth Mover s Distance

To estimate the distance between two tasks, we use Earth-Mover Distance (EMD). Earth-Mover Distance, a.k.a. Wasserstein metric, is a distance measure of two probability distributions or two sets of points, and is widely used in image retrieval and point cloud up-sampling works [46, 45]. Given two sets S1 and S2 with the same size, EMD calculates their distances as:

d EMD(S1, S2) = min ϕ:S1 S2 1 S1

s S1 s ϕ(s) 2 (2)

Task Patch 𝓣𝓣𝒑𝒑 Latent task manifold

Set Encoder Meta Learner

Coarse Tasks 𝓣𝓣𝒄𝒄

Ground-Truth Task 𝓣𝓣𝒈𝒈

𝒉𝒉𝒔𝒔= 𝒈𝒈𝒔𝒔𝓣𝓣𝒑𝒑 Patch Feature

Decoder 𝒈𝒈𝒅𝒅( )

Perturbations 𝓩𝓩

Coarse Generator

𝒅𝒅𝑬𝑬𝑬𝑬𝑬𝑬(𝓣𝓣𝒖𝒖𝒖𝒖, 𝓣𝓣𝒈𝒈) (EMD Loss)

𝟏𝟏 𝒯𝒯𝒖𝒖𝒖𝒖 𝑻𝑻𝒊𝒊 𝒯𝒯𝒖𝒖𝒖𝒖ℒ𝒂𝒂𝒂𝒂𝒂𝒂(𝑻𝑻𝒊𝒊)

(Adversarial Loss)

Up-sampled Task 𝓣𝓣𝒖𝒖𝒖𝒖

Figure 2: Illustration of the ATU algorithm with a task up-sampling network. The task up-sampling network consists of a set encoder gs( ) which extracts a set feature of the input task patch, a coarse task generator gc( ) which generates coarse tasks given the task patch, and a decoder gd( ) which generates fine tasks from coarse tasks based on random perturbance and the set feature.

where ϕ is a bijective projection mapping S1 to S2. The value of EMD in (2) can be obtained by solving the linear programming problem w.r.t. ϕ.

4 Adversarial Task Up-sampling

In practice, the task distribution p(T ) is unknown and we optimize the meta parameter θ0 with an empirical estimation of Eq. (1) over of meta-training tasks {Ti}NT i=1 as

min θ0 1 NT

i=1 L(ϕi, Dq i ), s.t. ϕi = θ0 α θ0L(θ0, Ds i ) (3)

Given a finite set of meta-training tasks, the empirical task distribution may deviate from the true task distribution. The meta-model trained on such a finite set of tasks will cause memorization or learner overfitting [44, 23], which hurts the generalization to new tasks. To alleviate this problem, we propose a new task up-sampling network to generate a sufficient number of diverse tasks such that the empirical task distribution formed by the original meta-training tasks and the augmented tasks together is closer to the true task distribution. To achieve this, the tasks generated by the task up-sampling network should match the true task distribution and cover a large fraction of it. However, since the true task distribution and its underlying manifold are unknown, we cannot provide the task up-sampling network with explicit information about it. Instead, we generate new tasks by performing Task Up-sampling (TU) from a set of training tasks that implicitly comprise the latent task manifold information. The idea of Task Up-sampling is inspired by point cloud up-sampling methods [46, 45], which generate up-sampled points lying on the latent distribution (i.e., the shape) of the given local point patch. Similar to the point cloud up-sampling algorithm, our augmentation network receives a task patch consisting of a set of tasks Tp = {Ti}Np i=1 where Np is the set size, and generates up-sampled tasks Tup= { ˆDs, ˆDq} that are uniformly distributed over the same underlying task distribution as the task patch.

Due to the high complexity of task generation, it is infeasible to directly generate up-sampling tasks without sacrificing the quality of the tasks. Inspired by [46], we propose a two-stage generation strategy to generate the up-sampled tasks. In the first stage, we produce a sparse set of tasks, aiming at recovering the global task distribution of the task patch. The tasks obtained in the first stage are called coarse tasks. In the second stage, we generate multiple tasks for each coarse task, aiming to characterize the local task distribution around each coarse task. To guide the second generation stage, we use the patch features of the input task set as input to provide global task information and also multiple random noise vectors as input to provide directional perturbations to generate diverse tasks around the coarse task. The generation process is summarized in Fig. 2.

Our proposed task up-sampling network consists of 3 components, namely, a coarse task generator gc( ), a set encoder gs( ), and a decoder gd( ). The coarse task generator is similar to a set autoencoder. It first encodes the information of the whole input task set and then decodes it to generate rc Np coarse tasks Tc = {T c i }. The set encoder, denoted by gs( ), extracts the set information of the input task patch as a patch feature hs to provide global information in the second-stage generation. For each task T c i in Tc, the decoder gd( ) generates rd tasks located around T c i in the task manifold

by taking as input rd random perturbations {zi}r i=1 that are i.i.d. noise sampled from a uniform distribution and the set feature hs as input. In general, we use the same perturbations for each coarse task T c i in Tc. Finally, we obtain the up-sampled task set Tup consisting of r Np tasks as Tup = gd(Tc, Z, hs), where r = rc rd is the up-sampling ratio. We denote the task augmentation network by Gθg(Tp, Z) where θg is the trainable parameters of the task augmentation network.

In each iteration of the training phase, we construct r Np tasks to form the ground truth tasks set Tg (e.g., randomly select from the meta training task set). Then we sample Np tasks from Tg to form the task patch Tp and randomly sample r perturbation noise vectors to form the perturbation set Z = {zi}r i=1. By feeding Tp and Z to the task augmentation network, we obtain the up-sampled task set Tup = Gθg(Tp, Z). To train the task augmentation network, we apply EMD loss between the up-sampled task set Tup and the ground-truth task set Tg to encourage the generated task set to have the same distribution as the true task distribution. However, it may still be insufficient to make the up-sampled tasks cover a significant fraction of the true task distribution. In this case, the up-sampling tasks provide limited additional information compared with the original meta-training tasks, and thus the meta-learner has restricted benefit from the generated tasks. To generate more informative tasks for the meta-learner, we want the generated tasks to be difficult for the current meta model θ0. Following [41], we measure the difficulty of the a task for θ0 by the loss estimated on query set w.r.t. to ϕi, i.e. L(ϕi, ˆDq i ), and the gradient similarity between the support and query sets w.r.t. θ0, i.e. θ0L(θ0, ˆDs i ), θ0L(θ0, ˆDq i ) . The large loss and small gradient similarity indicate a difficult task. Therefore, we want to maximize the following objective function to generate informative tasks:

Ladv(θ0, ( ˆDs i , ˆDq i )) = η1L(ϕi, ˆDq i ) η2 θ0L(θ0, ˆDs i ), θ0L(θ0, ˆDq i ) , (4)

where η1 and η2 are two hyperparameters that control the strength of the two terms in Ladv. We call this loss adversarial loss because it aims to increase the difficulty of the up-sampling tasks for the meta-learner while the meta-learner is trained to minimize the loss on the generated difficult tasks. And we named the proposed algorithm as Adversarial Task Up-sampling (ATU).

Together with the EMD loss, we obtain the objective to train the task up-sampling network:

LAT U(θg, Tp) = d EMD(Tup, Tg) 1 r Np

ˆTi Tup Ladv(θ0, ( ˆDs i, ˆDq i )). (5)

Note that the gradient of Eq. (5) will not be backpropagated to the meta model θ0 and the meta model will be updated by minimizing the meta loss in Eq. (3) on the up-sampled tasks Tup. We summarize the proposed ATU in Algorithm 1 in Appendix A.

4.1 ATU on Regression and Classification Problem

Before introducing the details of regression and classification tasks, let us first review the Eq. (2), where S1 and S2 can be understood as the set of up-sampled tasks Tup and the set of ground-truth tasks Tg, respectively. Each point s in either set represents the embedding of a task.

Regression Tasks. We consider a simple regression problem: sinusoidal regression, which is widely used to evaluate the effectiveness of the meta-learning methods. In sinusoidal regression problem, before feeding a task Ti = (Ds i , Dq i ) to the task up-sampling network, we need to present the embedding of a sine regression task at first. In this paper, we combine all samples of the support set and query set as the embedding of a sine regression task, i.e., s = [xs 1, ys 1, xs 2, ys 2, ..., xs Ks, ys Ks, xq 1, yq 1, xq 2, yq 2, ..., xq Kq, yq Kq] R2(Ks+Kq), where we sort the support set and query set such that xs 1 xs 2 ... xs Ks and xq 1 xq 2 ... xq Kq. This sorting could make the task input is invariant to the permutation of data in support and query sets and thus the extracted feature of each task is permutation-invariant, which simplifies the design of the task up-sampling network. To generate the coarse tasks, we first use the set encoder gs( ) to extract the set feature hs and directly generate the coarse tasks from the set feature hs. Then we generate the up-sampled tasks Tup utilizing the decoder gd( ) with perturbations z. Consequently, the dimension of generated tasks Tup is (r Np, 2(Ks, Kq)), and that of ground truth tasks Tg is also (r Np, 2(Ks, Kq)), where r is the up-sampling ratio and Np is the set size of a task patch. For the regression problem, we add an extra EMD loss on the support and query set for each generated task ˆT i Tup to encourage the points in the generated support set and query set to follow the same sinusoidal distribution, and the objective is shown in Eq. (6).

LAT U(θg, Tp) = d EMD(Tup, Tg)+η3 1 r Np

ˆTi Tup d EMD( ˆDs i, ˆDq i ) 1

ˆTi Tup Ladv(θ0, ( ˆDs i, ˆDq i ),

(6) where ˆTi = ( ˆDs i , ˆDq i ) is obtained by transforming each of Tup back to a support set and a query set.

Classification Tasks. Dissimilar to regression tasks, classification tasks labels of each class are randomly given under the mutually-exclusive setting [44]. For example, for an N-way Ks-shot classification problem with Kq query samples for each class, the label y in episodic-based metatraining is a randomly chosen value from {0, 1, ..., N 1}. In light of the fact that the label y is not semantically meaningful, we only use the images x to represent the embedding of an N-way classification task. Concretely, we reshape each task into a task pool of (Ks+Kq) tasks, each of which is N-way 1-shot and represented as s = [x1, x2, ..., x N] RNd where d is the dimension of each image. Then, we can split the task into (Ks+Kq) N-way 1-shot classification tasks without query examples. We represent the task by concatenating the input from N classes in a fixed order (based on classes). We treat the (Ks+Kq) tasks as a task patch and feed them to the task up-sampling network.

Since the task distribution of the image classification problem is extremely complex, it is impractical to generate the coarse tasks from a set feature. Instead, we use the original tasks as the coarse tasks and generate the up-sampled tasks by a more informative perturbance around the original tasks. To achieve this, we generate the perturbation by randomly sampling extra KM images from N different classes in the base set. Then for the image xi in a class of a task in the coarse tasks Tc, we subtract it from the KM images to obtain KM residual images and their corresponding set features (i.e., obtained from gs( )), concatenate the set features with a noise vector, and use an attention network to obtain a residual images feature xres i for the image xi given KM residual images. Finally, we generate the image as xu i =xi+xres i for the augmented task. We repeat this process for all images in a coarse task to get an augmented task and apply the r noise vector to get r up-sampled tasks. The dimension of generated tasks Tup and ground truth tasks Tg are both (r(Ks + Kq), Nd), where Tg is obtained by mixing up with images in the memory bank. The whole training objective function is Eq. (5). More details of network structures and training details are shown in Appendix B.

5 Theoretical Analysis

We will introduce the formal definition of an up-sampled task that conforms to task-awareness, based on which we present the essential property of our proposed ATU framework in maximizing the task-awareness, compared to previous task augmentation approaches.

Definition 1 (Task-aware Up-sampling). Suppose that we are given a set of Np tasks {Xi, Yi}Np i=1 from which we up-sample a new task Tup. For each i-th task, its ground-truth parameter that map the input Xi to the output Yi is θi, i.e., Yi = fθi(Xi). The up-sampled task Tup = {Xu, Yu} is defined to be task-aware, if and only if θu = g(θ1, , θNu) and Yu = fθu(Xu) where g is the up-sampling function and θu is the up-sampled parameter.

This definition states two prerequisites a task-aware up-sampling has to meet: (1) the up-sampling is performed in the functional space, which is to relate Nu parameters via g; (2) the mapping between the input and the output of an up-sampled task satisfies fθu. Property 1 (Task-awareness Maximization). Consider Nu = 2, g(θ1, θ2) = (1 λ)θ1 + λθ2, fθ1( ) = W1, and fθ2( ) = W2. The proposed ATU algorithm that pursues an up-sampled task Tup = {Xu, Yu} via minimizing the EMD loss between T1 and T2 maximizes the task-awareness, i.e., minimizing the distance between Yu and fθu(Xu).

Proof. According to the definition of EMD (Eq. (2)), it solves: ϕ =arg minϕ Φ P j x1,j x2,ϕ(j) 2, where Φ = {{1, , n1}7 {1, ,n2}} denotes the set containing all possible bijective assignments, each of which gives one-to-one correspondence between T1 and T2. Based on the optimal assignments ϕ , the EMD is known to be defined as d EMD = 1 min{n1,n2} P

j x1,j x2,ϕ (j) 2. In light of

the difficulty in mathematically formulating a possible up-sampled task Tu that lies in the local manifold of {T1, T2}, we reasonably assume a simplified way of characterizing an up-sampled task Tu to be yu,j = αT 1,j Y1+αT 2,j Y2, xu,j = αT 1,j X1+αT 2,j X2, j, where each sample is a convex

combination of samples from both T1 and T2. The combination coefficients α1,j, α2,j R(Ks+Kq) 1, PKs+Kq

k α1,jk =1, P

k α2,jk =1, α1,jk, α2,jk 0, k. Different combination coefficients lead to a set of up-sampled task candidates { Tu}. We evaluate the task-awareness property of each candidate Tu, i.e., the distance between Yu and fθu( Xu), to be Yu fθu( Xu) 2 = P

j yu,j fθu( xu,j) 2 = P

j (W1 W2)[λαT 1,j X1 (1 λ)αT 2,j X2] 2 = LHS. (See Appendix E.)

Note that LHS P

j W1 W2 2(λ xu,j x2,ϕ2(j) 2 + X2 2) and LHS P

j W1 W2 2((1 λ) x1,ϕ1(j) xu,j 2 + X1 2). (See Appendix E.) By combining the two inequalities above, we have LHS P

j W1 W2 2 min{λ xu,j x2,ϕ2(j) 2 + X2 2, (1 λ) λx1,ϕ1(j) xu,j 2 + X1 2}. In practice, it is easy to normalize all the tasks in the feature space, which leads to X1 2 = X2 2. Therefore, by minimizing the EMD loss d EMD= min{minϕ2 P

j xu,j x2,ϕ2(j) 2, minϕ1 P

j xu,j x1,ϕ1(j) 2}, the proposed task upsampling network identifies from the candidate set { Tu} the task Tu that has the minimal distance between Yu and fθu(Xu); in other words, the task-awareness is maximized.

Previous task augmentation approaches directly mix up two tasks without minimizing the EMD loss, i.e., yu,j = (1 λ)y1,j + λy2,j, xu,j = (1 λ)x1,j + λx2,j. In this case, the task-awareness is unwarranted as we have illustrated in Section 1, provided that Yu fθu(Xu) 2 = P

j (1 λ)y1,j+ λy2,j [(1 λ)W1 +λW2][(1 λ)x1,j +λx2,j] 2 = P

j λ2(1 λ)2 (W1 W2)(x1,j x2,j) 2.

6 Experiments

To evaluate the effectiveness of ATU, we conduct extensive experiments to answer the following questions: Q1: How does ATU perform compared to state-of-the-art task-augmentation-based and regularization meta-learning methods? Q2: Whether can the proposed ATU consistently improve performance for different meta-learning methods? Q3: What does up-sampled task by ATU looks like? Q4: What is the influence of increasing the task number within meta-training data on the performance improvement of ATU? Benchmarks. We compared ATU with state-of-the-art task augmentation strategies for meta-learning, including Meta Aug [23], Meta Mix [40], Meta Maxup [20], MLTI [43], and regularization methods, including Meta Dropout [13], TAML [11], and Meta-Reg [44]for both regression and classification problems. We also consider a variant of ATU which removes the adversarial loss Ladv and trains the task augmentation network only through the EMD loss. We denote this variant by TU. To validate the consistent effect of ATU in improving different meta-learners, we apply ATU and AU on MAML [15], Meta SGD [15] and ANIL [22]. We also consider cross-domain settings where the meta-testing tasks are from different domains.

Table 2: MSE with 95% confidence intervals on sinusoidal regression.

Model 10-shot 20-shot 30-shot Drop Grad [32] 0.91 0.17 0.62 0.12 0.55 0.13 Meta Aug [23] 0.93 0.18 0.65 0.14 0.58 0.12 Meta-Learner: MAML MAML [9] 0.93 0.18 0.65 0.13 0.58 0.12 Meta Mix [40] 0.81 0.17 0.58 0.12 0.56 0.11 MLTI [43] 0.92 0.17 0.65 0.13 0.62 0.12 TU 0.84 0.16 0.55 0.12 0.47 0.10 ATU 0.70 0.14 0.47 0.13 0.42 0.11 Meta-Learner: Meta SGD Meta SGD [15] 0.70 0.17 0.49 0.11 0.42 0.09 Meta Mix [40] 0.60 0.15 0.37 0.09 0.37 0.08 MLTI [43] 0.66 0.16 0.51 0.11 0.44 0.10 TU 0.54 0.11 0.36 0.08 0.31 0.08 ATU 0.49 0.10 0.34 0.08 0.29 0.08

Table 3: MSE with 95% confidence intervals on cross-domain sinusoidal regression.

Cross-domain Frequency Aplitude Phase [0.4,0.8] [5.0,6.0] [ π,0] Meta-Learner: MAML MAML [9] 1.78 0.35 3.52 0.35 3.12 0.52 Meta Mix [40] 1.67 0.30 3.60 0.28 3.14 0.54 MLTI [43] 1.92 0.42 3.56 0.37 3.66 0.63 TU 1.70 0.34 3.22 0.31 2.88 0.48 ATU 1.58 0.35 2.92 0.29 2.58 0.48 Meta-Learner: Meta SGD Meta SGD [15] 2.24 0.46 2.42 0.32 2.73 0.56 Meta Mix [40] 1.77 0.35 2.50 0.27 2.46 0.48 MLTI [43] 1.80 0.42 2.56 0.28 2.54 0.54 TU 1.64 0.38 2.37 0.25 2.04 0.46 ATU 1.71 0.40 2.19 0.23 2.53 0.62

6.1 Regression

Experimental Setup. Following [15], we construct the K-shot regression task by sampling from the target sine curve y(x) = Asin(ωx + b), where the amplitude A [0.1, 5.0], the frequency ω [0.8, 1.2], the phase b [0, π] and x is sampled from [ 5.0, 5.0]. In the meta-training phase, each task contains K support and K target (K=10) examples. We adopt mean squared error (MSE) as the loss function. For the base model fθ, we adopt a small neural network, which consists of an input layer of size 1, 2 hidden layers of size 40 with Re LU and an output layer of size 1. We use one

4 Support Query

4 Mixed Query Support Query

(b) Meta Mix

2 Support Query

2 Support Query

(d) ATU Figure 3: The augmented regression tasks generated by different augmentation-based methods.

gradient update with a fixed step size α=0.01 in inner loop, and use Adam as the outer-loop optimizer following [9, 15]. Moreover, the meta-learner is trained on 240,000 tasks with meta batch-size being 4. In meta-testing stage , we randomly sample 100 sine curves as meta-test tasks, each task containing K support samples and 100 query examples. The data points x in query set are evenly distributed on [ 5.0, 5.0]. The averaged MSE with 95% confidence intervals upon these 100 sine curves with K=10, 20, 30 are reported in Table 2. We also perform cross-domain experiments by sampling 100 sine curves which have different frequencies, amplitudes or phases from the tasks in meta-training set and report the results in Table 3. More settings about the up-sampling networks are listed in Appendix C.

Performance. The results in Table 2 and Table 3 show that ATU consistently outperforms the baseline methods MAML, Meta Mix, and MLTI in different K-shot (K {10, 20, 30}) settings and various domain settings. These results validate that the tasks generated by ATU can better approximate the true task distribution and provide more information to the meta-learner than Meta Mix and MLTI, thus enabling better generalization of the model. We further verify the superiority of the proposed methods by visualizing the augmented tasks generated by the proposed methods and the baseline methods. The visualization results in Fig. 3 show that the points in the tasks generated by TU and ATU fit the sine curve well, while the points in the tasks generated by MLTI and Meta Mix deviate from the sine curve. This indicates that augmented tasks generated by TU and ATU match the true task distribution.

It is noteworthy that the support set and query set generated by ATU differ significantly from those generated by TU, which indicates that the task generated by ATU is more difficult . This, together with the results that ATU outperforms TU in most experiments, demonstrates the effectiveness of the adversarial losses in generating informative tasks to improve generalization of the meta-learner.

In Fig. 4, we also visualize the adaptation of meta-learner trained by different task augmentation methods for a 10-shot meta-test regression task. Compared to the MAML trained on original meta-training tasks, the MAML trained on tasks generated by TU fits the ground-truth sinusoid after only one update. And ATU performs even better than TU. This again validates that the augmented tasks generated by TU and ATU are more informative for the meta-learner to learn the meta knowledge from the true task distribution.

4 Ground Truth

Figure 4: Initialization (dotted) and onestep adaptation (solid) regression curves of MAML,TU and ATU when K=10.

6.2 Classification

Experimental setup. We follow MLTI to evaluate the performance of task augmentation algorithms for few-shot classification problem with limited number of base classes in meta-training set under non-label-sharing settings. We consider four datasets (base classes number): mini Imagenet-S (12), ISIC [18] (4), Dermnet-S (30), and Tabular Murris [5] (57) covering classification tasks on general natural images, medical images, and gene data. Note that the mini Imagenet-S and Dermnet-S are constructed by limiting the base classes of mini Imangenet [33] and Dermnet [1], respectively. We construct N-way K-shot tasks, setting N = 5 for mini Imagenet-S, Derment-S, Tabular Murris and N = 2 for ISIC dataset due to its limited number of base classes and setting K = 1 or K = 5. Recall that TAU relies on extra KM images to generate augmented tasks in image classification problem. To make the training process more efficient, we set KM to 3 and the upsampling rate r be 2. More details of these datasets and the settings of the task augmentation networks are listed in Appendix D.

Table 4: Average accuracy under different settings of few-shot classification and various datasets.

Model mini Imagenet-S ISIC Derm Net-S Tabular Murris 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

MAML [9] 38.27% 52.14% 57.59% 65.24% 43.47% 60.56% 79.08% 88.55% Meta-Reg [44] 38.35% 51.74% 58.57% 68.45% 45.01% 60.92% 79.18% 89.08% TAML [11] 38.70% 52.75% 58.39% 66.09% 45.73% 61.14% 79.82% 89.11% Meta-Dropout [13] 38.32% 52.53% 58.40% 67.32% 44.30% 60.86% 78.18% 89.25% Meta Mix [40] 39.43% 54.14% 60.34% 69.47% 46.81% 63.52% 81.06% 89.75% Meta-Maxup [20] 39.28% 53.02% 58.68% 69.16% 46.10% 62.64% 79.56% 88.88% MLTI [43] 41.58% 55.22% 61.79% 70.69% 48.03% 64.55% 81.73% 91.08%

TU 42.16% 56.33% 62.03% 73.97% 48.07% 64.81% 81.88% 91.15% ATU 42.60% 56.78% 62.84% 74.50% 48.33% 65.16% 82.04% 91.42%

Performance. We show the performance on the four datasets in Table 4. On all four datasets, the proposed ATU consistently outperforms the baseline methods, including the augmentation-based methods (i.e. Meta Mix, Meta-Maxup and MLTI) and regularization-based methods (Meta-Reg, TAML and Meta-Dropout). And TU achieves the second best performance on all experiments. We also observe that our method achieves a large improvement on the ISIC dataset which consists of only 4 base classes, indicating the effectiveness of our method in limited tasks scenarios. We further evaluate the effectiveness of the proposed ATU on improving the generalization for different backbone meta-learner by conducting experiments under 1-shot setting to compare the performance of MLTI and ATU in improving the performance of the backbone meta-learner Meta SGD and ANIL. The results are presented in Table 5. ATU again consistently outperforms MLTI. All these results validate the superiority of the proposed ATU and TU over the existing baselines in generating informative tasks to improve the performance of different backbone metalearners. We also evaluate the performance of ATU in cross-domain adaptaion settings. In Table 6, we present the results of the experiment that apply the meta-model trained on mini Image Net-S to Dermnet-S, and vice versa. ATU improves the generalization performance of MAML (the backbone meta-learner in this experiment) by a large margin. This indicates ATU can consistently improve the backbone meta-model s generalization ability under the challenging cross-domain settings.

Table 5: Comparison of compatibility with different backbone meta-learning algorithms on 1-shot classification.

Method mini-S ISIC Derm-S TM

Meta SGD [15] 37.88% 58.79% 42.07% 81.55% Meta SGD+MLTI 39.58% 61.57% 45.49% 83.31% Meta SGD+ATU 40.52% 62.84% 46.78% 83.84%

ANIL [22] 38.02% 59.48% 44.58% 75.67% ANIL+MLTI 39.15% 61.78% 46.79% 77.11% ANIL+ATU 39.27% 62.12% 47.03% 77.23%

Table 6: Cross-domain adaptation experiments between mini-S and Dermnet-S. A B denotes that the backbone meta-model is meta-trained on A and meta-tested on B.

Model mini-S Derm-S Derm-S mini-S 1-shot 5-shot 1-shot 5-shot

MAML [9] 34.46% 50.36% 28.78% 41.29% MAML+ATU 36.86% 51.98% 30.68% 46.72%

Meta SGD [15] 31.07% 49.07% 28.17% 41.83% Meta SGD+ATU 37.75% 54.60% 30.78% 44.01%

12 25 38 51 64 Num. of Meta-training Classes

Accuracy(1-shot)

Accuracy(5-shot)

MAML(1-shot) ATU(1-shot) MAML(5-shot) ATU(5-shot)

Figure 5: The averaged accuracy on the mini Imagenet dataset with different number of tasks.

Original Task

Up-sampled Task

Figure 6: T-SNE visualization of original and upsampled tasks on 1-shot mini Imagenet-S setting.

Effect of the number of meta-training tasks. We conduct experiments to analyze the change in the performance improvement of ATU over MAML with the number of meta-training tasks in 1-shot and 5-shot settings. The results presented in Fig. 5 show that ATU significantly improves the performance of MAML by about 4.5%-5% when the number of base classes is 12, while the improvement decreases with the number of base classes increasing on both 1-shot and 5-shot settings. When the number of base classes increases, the number of training tasks increases rapidly and the empirical task distribution constructed from the meta-training tasks becomes closer and closer to the true latent task distribution. Therefore, the extra information provided by tasks generated by ATU becomes less. However, even if all available base classes are used in the meta-training (i.e., 64 meta-training classes), our proposed ATU could still help to improve the performance of MAML.

Visualization of the generated tasks. We visualize the up-sampled tasks by t-SNE to evaluate their generation quality for MAML under the 1-shot mini Imagenet-S setting. Concretely, we up-sample 100 tasks for 5 original tasks via ATU by using different perturbations for each task. In order to visualize the relationship between generated tasks and original tasks using t-SNE, we represent each task by concatenating the vector of the support and query sets. The results presented in Fig. 6 show that the up-sampled tasks stay near the original tasks, which means they are matching the true task distribution. This indicates the generated tasks are task-aware. Besides, we can observe that the augmented tasks are diverse and cover a substantial portion of the original tasks. This demonstrates the task imaginary property of the augmented tasks. These two observations suggest that the proposed ATU is a qualified task augmentation algorithm.

2 Support Query

Figure 7: Visualization of the upsampled task generated by ATU when η3 = 0 in Eq. (6).

Effect of the extra d EMD( ˆDs i , ˆDq i ) in regression tasks. As presented in Section 4.1, we propose to apply an extra EMD loss d EMD( ˆDs i , ˆDq i ) on the support and query set for each generated task to encourage the points in the generated support set and the query set to follow the same sine curve. In Fig. 3, we have visualized the tasks generated by the Task Up-sampling Network trained with the extra EMD loss. Here we provide additional visualization result for tasks generated by the Task Up-sampling Network trained without the extra EMD loss in Fig. 7. It can be seen that the support set and the query set are not on the same sinusoid, indicating that the generated tasks need additional supervision to avoid them being too difficult or not even valid tasks.

7 Conclusion and Limitation In this paper, we propose the first task-level up-sampling network that learns to generate tasks that simultaneously meet the qualifications of being task-aware, task-imaginary, and model-adaptive. The proposed Adversarial Task Up-sampling (ATU) takes a set of tasks as input and learn to up-sample tasks complying with the true task distribution while being informative to improve the generalization of the meta-learner. We theoretically justify that ATU promotes task-awareness and empirically verify that ATU improves the generalization of various backbone meta-learner for both regression and classification tasks on five datasets. Limitations. Our theoretical results are obtained under some strong assumptions, but all the experiments and visualization outcomes validate our method s effectiveness in real settings.

8 Acknowledgement

This work is sponsored by the Tencent AI Lab Gift Fund (Project 9229073) and City U Strategic Interdisciplinary Research Grant (Project 7020064).

[1] Dermnet dataset, 2016. http://www.dermnet.com/.

[2] Fungi dataset, 2018. https://www.kaggle.com/c/fungi-challenge-fgvc-2018.

[3] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Neur IPS, pages 3981 3989, 2016.

[4] Jonathan Baxter. A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning, 28(1):7 39, 1997.

[5] Kaidi Cao, Maria Brbic, and Jure Leskovec. Concept learners for few-shot learning. In ICLR, 2020.

[6] Can Chen, Xi Chen, Chen Ma, Zixuan Liu, and Xue Liu. Gradient-based bi-level optimization for deep learning: A survey. ar Xiv preprint ar Xiv:2207.11719, 2022.

[7] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In ICLR, 2018.

[8] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, pages 3606 3613, 2014.

[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126 1135, 2017.

[10] David Ha, Andrew M Dai, and Quoc V Le. Hypernetworks. In ICLR, 2017.

[11] Muhammad Abdullah Jamal and Guo-Jun Qi. Task agnostic meta-learning for few-shot learning. In CVPR, pages 11719 11727, 2019.

[12] Hae Beom Lee, Hayeon Lee, Donghyun Na, Saehoon Kim, Minseop Park, Eunho Yang, and Sung Ju Hwang. Learning to balance: Bayesian meta-learning for imbalanced and out-ofdistribution tasks. In ICLR, 2020.

[13] Hae Beom Lee, Taewook Nam, Eunho Yang, and Sung Ju Hwang. Meta dropout: Learning to perturb latent features for generalization. In ICLR, 2019.

[14] Xiaomeng Li, Lequan Yu, Yueming Jin, Chi-Wing Fu, Lei Xing, and Pheng-Ann Heng. Difficulty-aware meta-learning for rare disease diagnosis. In MICCAI, pages 357 366, 2020.

[15] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

[16] Jialin Liu, Fei Chao, and Chih-Min Lin. Task augmentation by rotating for meta-learning. ar Xiv preprint ar Xiv:2003.00804, 2020.

[17] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Finegrained visual classification of aircraft. ar Xiv preprint ar Xiv:1306.5151, 2013.

[18] Md Ashraful Alam Milton. Automated skin lesion classification using ensemble of deep neural networks in isic 2018: Skin lesion analysis towards melanoma detection challenge. ar Xiv preprint ar Xiv:1901.10802, 2019.

[19] Shikhar Murty, Tatsunori Hashimoto, and Christopher D Manning. Dreca: A general task augmentation strategy for few-shot natural language inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1113 1125, 2021.

[20] Renkun Ni, Micah Goldblum, Amr Sharaf, Kezhi Kong, and Tom Goldstein. Data augmentation for meta-learning. In ICML, pages 8152 8161, 2021.

[21] Jaehoon Oh, Hyungjun Yoo, Chang Hwan Kim, and Se-Young Yun. Boil: Towards representation change for few-shot learning. In ICLR, 2020.

[22] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In ICLR, 2019.

[23] Janarthanan Rajendran, Alex Irpan, and Eric Jang. Meta-learning requires meta-augmentation. ar Xiv preprint ar Xiv:2007.05549, 2020.

[24] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.

[25] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2018.

[26] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Metaweight-net: Learning an explicit mapping for sample weighting. Neur IPS, 32, 2019.

[27] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Neur IPS, pages 4080 4090, 2017.

[28] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In CVPR, pages 403 412, 2019.

[29] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, pages 1199 1208, 2018.

[30] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In ECCV, pages 266 282. Springer, 2020.

[31] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. In ICLR, 2019.

[32] Hung-Yu Tseng, Yi-Wen Chen, Yi-Hsuan Tsai, Sifei Liu, Yen-Yu Lin, and Ming-Hsuan Yang. Regularizing meta-learning via gradient dropout. In ACCV, 2020.

[33] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Neur IPS, 29:3630 3638, 2016.

[34] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.

[35] Renzhen Wang, Kaiqin Hu, Yanwen Zhu, Jun Shu, Qian Zhao, and Deyu Meng. Meta feature modulator for long-tailed recognition. ar Xiv preprint ar Xiv:2008.03428, 2020.

[36] Renzhen Wang, Xixi Jia, Quanziang Wang, and Deyu Meng. Learning to adapt classifier for imbalanced semi-supervised learning. ar Xiv preprint ar Xiv:2207.13856, 2022.

[37] Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. In ICML, pages 3751 3760, 2017.

[38] Yichen Wu, Jun Shu, Qi Xie, Qian Zhao, and Deyu Meng. Learning to purify noisy labels via meta soft label corrector. In AAAI, pages 10388 10396, 2021.

[39] Jin Xu, Jean-Francois Ton, Hyunjik Kim, Adam Kosiorek, and Yee Whye Teh. Metafun: Meta-learning with iterative functional updates. In ICML, pages 10617 10627, 2020.

[40] Huaxiu Yao, Long-Kai Huang, Linjun Zhang, Ying Wei, Li Tian, James Zou, Junzhou Huang, et al. Improving generalization in meta-learning via task augmentation. In ICML, pages 11887 11897, 2021.

[41] Huaxiu Yao, Yu Wang, Ying Wei, Peilin Zhao, Mehrdad Mahdavi, Defu Lian, and Chelsea Finn. Meta-learning with an adaptive task scheduler. Neur IPS, 34, 2021.

[42] Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. Hierarchically structured meta-learning. In ICML, pages 7045 7054, 2019.

[43] Huaxiu Yao, Linjun Zhang, and Chelsea Finn. Meta-learning with fewer tasks through task interpolation. 2022.

[44] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Metalearning without memorization. In ICLR, 2020.

[45] Lequan Yu, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. Pu-net: Point cloud upsampling network. In CVPR, pages 2790 2799, 2018.

[46] Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and Martial Hebert. Pcn: Point completion network. In 2018 International Conference on 3D Vision (3DV), pages 728 737, 2018.

[47] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to few-shot learning. Neur IPS, 2:8, 2018.

[48] Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In ICML, pages 7693 7702, 2019.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] See Section ??. Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] . (b) Did you describe the limitations of your work? [Yes] . See Section 8

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] . 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] . See Section 5 (b) Did you include complete proofs of all theoretical results? [Yes] . See Section 5 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] . We exploit the open-source datasets and will release the code. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?[Yes] . See Section 6.2 and Appendix C/D. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?[Yes] . See Section 6.2. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] . See Appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators?[Yes] . See Section 6.2. (b) Did you mention the license of the assets?[N/A] .

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] . (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] . 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] . (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] . (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] .