# task_cooperation_for_semisupervised_fewshot_learning__17ac56fd.pdf

Task Cooperation for Semi-Supervised Few-Shot Learning

Han-Jia Ye, Xin-Chun Li, De-Chuan Zhan State Key Laboratory for Novel Software Technology, Nanjing University {yehj, lixc, zhandc}@lamda.nju.edu.cn

Training a model with limited data is an essential task for machine learning and visual recognition. Few-shot learning approaches meta-learn a task-level inductive bias from SEEN class few-shot tasks, and the meta-model is expected to facilitate the few-shot learning with UNSEEN classes. Inspired by the idea that unlabeled data can be utilized to smooth the model space in traditional semi-supervised learning, we propose TAsk COoperation (TACO) which takes advantage of unsupervised tasks to smooth the meta-model space. Specifically, we couple the labeled support set in a few-shot task with easily-collected unlabeled instances, prediction agreement on which encodes the relationship between tasks. The learned smooth meta-model promotes the generalization ability on supervised UNSEEN few-shot tasks. The state-ofthe-art few-shot classiﬁcation results on Mini Image Net and Tiered Image Net verify the superiority of TACO to leverage unlabeled data and task relationship in meta-learning.

Introduction Both instance collection and labeling costs inﬂuence the practical utility of a model in real-world applications, which requires a classiﬁer to be trained with limited examples. For example, a robotic agent should be able to imitate behaviors from one single demonstration (Yu et al. 2018). One solution to the Few-Shot Learning (FSL) problem takes advantage of data from related classes. Towards training effective classiﬁers for few-shot tasks with UNSEEN classes (a.k.a. the meta-test phase), meta-learning mimics the few-shot task evaluations on the SEEN class set (a.k.a. the meta-train set) and extracts task-level inductive bias in the meta-training phase (Baxter 2000; Vilalta and Drissi 2002; Maurer, Pontil, and Romera-Paredes 2016). For example, the instance embedding function (Vinyals et al. 2016; Snell, Swersky, and Zemel 2017), model initialization (Finn, Abbeel, and Levine 2017; Nichol, Achiam, and Schulman 2018), functional mapping (Qiao et al. 2018), and optimization strategies (Ravi and Larochelle 2017) facilitate FSL. During meta-training, episodes of few-shot tasks, couples of few-shot support set and the same-distribution query set,

De-Chuan Zhan is the corresponding author. This work is supported by NSFC (61773198, 6163000043, 61921006, 62006112), NSF of Jiangsu Province (BK20200313). Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Analogy between semi-supervised learning (top) and semi-supervised few-shot learning (bottom), where unlabeled instances (resp. unsupervised tasks) assist shaping a smooth model (resp. meta-model) space. Limited labeled instances (resp. tasks) make revealing the characteristic of the data difﬁcult (left). Unsupervised tasks facilitate a metamodel to generalize better and to construct close classiﬁers for similar tasks (right).

are sampled from the SEEN class set to update the metamodel (as in Fig. 2 (a)). Speciﬁcally, a task-speciﬁc classiﬁer is derived from the meta-model based on the few-shot support set, and the classiﬁer s performance is measured on the corresponding query set. The supervision of meta-learning comes from the labels in the query set, so we deﬁne such tasks as the supervised ones. Similarly, we introduce unsupervised task as a task with a few-shot labeled support set and an unlabeled pool set. The pool set contains easily collected instances from any (even distractor) class, but it is difﬁcult to provide supervision in meta-training directly. In this paper, we propose the TAsk COoperation (TACO) approach for few-shot classiﬁcation, which takes advantage of the task relationship by incorporating both the supervised and unsupervised tasks during meta-training (we denote it as Semi-Supervised Few-Shot Learning (SS-FSL) in Fig. 2 (c)). As shown in Fig. 1 (bottom), directly learning the metamodel over supervised tasks could lead to a biased metamodel space, which constructs diverse classiﬁers for similar tasks. TACO makes the meta-model space smooth, so that similar support sets are mapped to close classiﬁers and the meta-model generalizes better. In detail, the similarity among few-shot tasks is measured by their prediction agree-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Figure 2: The difference between supervised few-shot learning (FSL), Few-Shot Semi-Supervised Learning (FS-SSL) and Semi-Supervised Few-Shot Learning (SS-FSL). In FSL (a), episodes of supervised tasks are sampled to train the meta-model (top), and it is the same scenario in the metatest phase (bottom); In FS-SSL, each SEEN few-shot task is paired with an additional same-distributed unlabeled set (U) to enhance its ability individually, and correspondingly, the meta-model only learns how to utilize unlabeled data in a speciﬁc semi-supervised task; SS-FSL emphasizes taking advantage of the unlabeled data to construct unsupervised tasks (a joint set with S and U) from a macro-perspective, obtaining a smooth meta-model.

ment over the unlabeled pool set, which corresponds to the notion that similar samples (resp. tasks) have similar labels (resp. classiﬁcation behavior) in semi-supervised learning paradigm. It is notable that unlabeled data are only used during meta-training to measure the smoothness, and the metamodel acts in a fully supervised manner in meta-test. Several relatedness measures between tasks and few-shot classiﬁers are proposed and investigated for TACO. The same meta-learning mechanism extends various supervised few-shot approaches like Proto Net (Snell, Swersky, and Zemel 2017) and Proto MAML (Triantaﬁllou et al. 2020). TACO variants achieve not only superior performance in different semi-supervised conﬁgurations but also get higher accuracy on fully supervised benchmarks like Mini Image Net. In summary, from the standpoint of traditional semisupervised learning, we utilize unlabeled data from a macroperspective to form unsupervised few-shot tasks and encourage close tasks to behave similarly. We propose TACO to incorporate the relationship between tasks to obtain a smooth meta-model space, and it can still generalize well even in fully supervised meta-test phase (without unlabeled data). The experimental results on both SS-FSL and standard FSL verify the effectiveness of the TACO approach.

Related Work

Training a model with limited examples is essential due to the instance collection and labeling costs (Li, Fergus, and Perona 2006; Lake et al. 2011; Lake, Salakhutdinov, and Tenenbaum 2015). Towards ﬁguring out the data scarcity problem, two important paradigms, semi-supervised learn-

ing and meta-learning, are usually considered. Semi-Supervised Learning (SSL) discovers the latent structure of data via unlabeled instances (Bennett and Demiriz 1998; Chapelle, Schlkopf, and Zien 2010; Oliver et al. 2018). To ensure smoothness of predictions, prediction consistency (Sajjadi, Javanmardi, and Tasdizen 2016), low entropy region (Grandvalet and Bengio 2004), and data generation (Kingma et al. 2014) act as key principles. Meta-learning deals with the FSL problem by extracting task-level inductive bias from SEEN classes, and then generalizes to UNSEEN class few-shot tasks (Maurer, Pontil, and Romera-Paredes 2016; Chao et al. 2020). For example, the embedding-based (Vinyals et al. 2016; Snell, Swersky, and Zemel 2017; Lee et al. 2019), gradient-based (Finn, Abbeel, and Levine 2017; Nagabandi et al. 2019), and generative (Zhang et al. 2019) meta-learning methods. Recent literature explores the usage of the unlabeled data in FSL. Transductive few-shot learning assumes all test instances come simultaneously, which are used as the unlabeled pool, so as to leverage the latent structure between training and test instances (Liu et al. 2019; Qiao et al. 2019). In Few-Shot Semi-Supervised Learning (FS-SSL), each task is equipped with an auxiliary set of unlabeled instances (even from distractor classes) in both meta-training and meta-test stages (Boney and Ilin 2017; Ayyad et al. 2019), and the meta-model learns to provide better classiﬁer estimation based on the unlabeled data (Ren et al. 2018; Khodadadeh, B ol oni, and Shah 2019) (as in Fig. 2 (b)). Instead of formulating an SSL problem in each few-shot task, we focus on the Semi-Supervised Few-Shot Learning (SS-FSL) mechanism from a macro-perspective, where not only supervised tasks but also unsupervised tasks are incorporated during the meta-training (as in Fig. 2 (c)). Compared with FSSSL, there are two main differences in our SS-FSL. First, the usage of unlabeled data is different. Different from forming episodes of semi-supervised tasks in FS-SSL, SS-FSL constructs unsupervised tasks to smooth the meta-model space from a macro-perspective FS-SSL utilizes the unlabeled data to improve the ability of a speciﬁc few-shot task, while FS-SSL emphasizes improving the discriminative ability of the meta-model (e.g., embeddings) with the help of unsupervised tasks. Second, meta-test strategies are different. Usually, FS-SSL needs the assistance of unlabeled data during meta-test, while SS-FSL can still generalize well even without unlabeled data owing to the smooth meta-model space.

Meta-Learning for Few-Shot Learning

In this section, we introduce the few-shot classiﬁcation problem and describe how to solve it with meta-learning.

The Few-Shot Learning Problem

Few-Shot Learning (FSL) formalizes a classiﬁcation task in the N-way K-shot form. The support set of a task DS = {(xi, yi)}NK i=1 contains N classes and K labeled examples in each class, where the instance xi RD and the one-hot coding label yi {0, 1}N. The goal of FSL is to train an N-way classiﬁer h HN : RD {0, 1}N based on the NK examples, where HN is the N-way classiﬁer space. h is

prone to over-ﬁt when K is small (e.g., K = 1). sfx(p) normalizes a vector p RN into a probability distribution with softmax, i.e., PN n=1 sfx(p)n = 1 and {sfx(p)n 0}N n=1. Denote KLU(p q) = PN n=1 sfx(p)n log sfx(p)n

sfx(q)n as an operator which normalizes two N-dimensional vectors with softmax and then outputs their KL-divergence.

Meta-Learning for Few-Shot Learning Meta-learning learns a task-level mapping f from the Nway K-shot support set DS to its target classiﬁer h HN in a supervised way (Chao et al. 2020). To learn the metamodel f, episodes of tasks are sampled from a meta-train set with SEEN classes. In detail, each task contains a Nway K-shot support set DS and a query set DQ with samedistribution examples from the N classes. The quality of a meta-generated classiﬁer f(DS) is measured by its classiﬁcation ability on DQ. In summary, f can be learned by:

(x Q j ,y Q j ) DQ ℓ f(DS)(x Q j ), y Q j . (1)

The summation of (DS, DQ) in Eq. 1 denotes the enumeration of all sampled tasks from the SEEN class set. The loss ℓ( , ) measures the quality of a meta-generated classiﬁer f(DS) via the discrepancy between the predicted label and the ground-truth of the query set, e.g., the cross-entropy. The lower the average loss when predicting instances in DQ, the closer the meta-generated classiﬁer to the target one. After optimizing Eq. 1, f maps a training set to its target classiﬁer even with a few labeled examples. Since the meta-training mimics the few-shot evaluation, it is supposed to generalize to N-way K-shot tasks composed by UNSEEN classes (a.k.a. meta-test phase). The meta-model f could be implemented in a non-parametric style. In other words, a query instance x Q j is classiﬁed based on a soft nearest neighbor rule:

ˆyj = f(DS)(x Q j ) = X

(x S i ,y S i ) DS sim φ(x Q j ), φ(x S i ) y S i .

(2) φ : RD Rd extracts features of the input examples and transforms them into a latent space with d dimensions. sim(φ(x Q j ), φ(x S i )) measures the similarity between the query instance φ(x Q j ) and a support instance φ(x S i ). Matching Network (Vinyals et al. 2016) uses the ℓ2normalized cosine similarity in Eq. 2. After learning φ with Eq. 1, the embedding facilitates the construction of nearest neighbor classiﬁer. The Prototypical Network (Snell, Swersky, and Zemel 2017) implements Eq. 2 with the negative euclidean distance. When K > 1, it averages the same-class instances together and uses class centers (prototypes) for prediction. The embedding center of class n can be deﬁned as cn = 1 K P yi,n=1 φ(x S i ), then we have

ˆyj = PN n=1 sim(φ(x Q j ), cn)yn.

Semi-Supervised Few-Shot Learning (SS-FSL). Considering the practical utility of unlabeled data, SS-FSL handles the case that most of the SEEN class data are unlabeled.

The meta-model is required to utilize both labeled and unlabeled data during meta-training, while only labeled few-shot support set from UNSEEN classes are provided in meta-test.

Task Cooperation for Few-Shot Learning We focus on the Semi-Supervised Few-Shot Learning (SSFSL), using the unlabeled meta-train data to improve the generalization ability of the meta-model f. We ﬁrst outline the main idea of TAsk COoperation (TACO) and then describe the concrete conﬁgurations. Last are discussions.

TACO for Semi-Supervised Few-Shot Learning Towards incorporating the easily collected and informative unlabeled data during the meta-training, we propose the TAsk COoperation (TACO) framework where the related tasks cooperate with each other for a smooth meta-model f. Traditional semi-supervised learning assumes that a smooth function maps near inputs to similar outputs, which is an essential property to achieve discriminative and generalizable models (Friedman, Hastie, and Tibshirani 2001; Chapelle, Schlkopf, and Zien 2010; Berthelot et al. 2019). For an N-way K-shot classiﬁcation task, f maps its support set DS to its corresponding classiﬁer h N = f(DS). TACO generalizes the smooth notion in the traditional supervised learning to the meta-model space. From a macroperspective of meta-learning, we ﬁrst make an analogy between the training instance in the traditional supervised learning and the few-shot support set in the meta-learning. We propose TACO to better capture the task relationship, which adds a smoothness constraint over the meta-learning objective in Eq. 1, so that two close tasks behave similarly:

(DS, ˆ DS) DIS f(DS), f( ˆDS) . (3)

We assume DS and ˆDS are two visually/semantically similar few-shot support sets sampled from the meta-train set, and DIS( , ) measures the discrepancy between two mapped models in HN. λ > 0 is a balance parameter. By minimizing Eq. 3 together with Eq. 1, the meta-model f not only maps a task to its target classiﬁer but also generates similar classiﬁers for close few-shot support set (revealed by the small classiﬁer-space distance between f(DS) and f( ˆDS)), which corresponds well to the smooth notion in traditional supervised/semi-supervised learning. Beneﬁted from TACO, the meta-model f becomes smooth and more discriminative, generalizing better in the meta-test stage.

Similarity Measures for TACO Eq. 3 leaves the question of how to deﬁne the similarity between few-shot support sets and the discrepancy between classiﬁers. Here we provide detailed deﬁnitions.

Similarity between tasks. Given an N-way K-shot support set DS, we generate another similar few-shot support set ˆDS based on two strategies. First, we consider two tasks are similar if they have the same set of classes. Given the N classes in DS, we sample another non-overlapping

Figure 3: Empirical observations of measuring similarity between classiﬁers based on their predictions over the instance space. Each row corresponds to a 5-way task trained by 3000 examples in Mini Image Net, where the task-speciﬁc classiﬁer is based on Nearest Center Mean (NCM) over embeddings. The ﬁrst two tasks have exactly the same categories while their classiﬁers are learned with different initializations; the third task has different classes but all classes belong to the same super-categories with the ﬁrst two; the last task targets classes from different sets of super-classes. The construction of tasks encodes the task-level visual and semantic similarity. The similarity between the output of meta-model, i.e., the classiﬁers, could be revealed by their prediction results over the same (even distractor class) instance, as shown in the r.h.s. The histograms demonstrate the mean prediction results of 600 examples from the Jellyﬁsh class. The visually/semantically similar tasks will have closer NCM decisions compared with the dissimilar ones.

K instances from the labeled meta-train set for each of the N classes to construct ˆDS. Besides, we keep the order of classes in the two few-shot classiﬁcation tasks the same, which maintains a correspondence between their classiﬁers. Second, we borrow the idea from standard semi-supervised learning (Sajjadi, Javanmardi, and Tasdizen 2016; Berthelot et al. 2019) to construct similar tasks based on perturbations. In this case, instances in ˆDS are the same as those instances in DS except additional (advanced) data augmentation operations (e.g., random crop). Data augmentation changes the raw image input of an instance to some extent while keeping its semantic meaning, so the transformed task is close to the original one. Two labeled tasks with similar support sets usually target similar classiﬁers, so it is meaningful to apply the task similarity objective to them. Remark 1 In addition to generating visually similar tasks, we can also obtain semantically similar tasks based on class relationships, e.g., the binary classiﬁcation task for tiger vs. dog and cat vs. dog are similar. Since such a class-wise similarity measure requires semantic attributes for classes, we leave it for future study.

Similarity between Classiﬁers. The output of the metamodel f w.r.t. a support set, an N-way classiﬁer h N, is a function from instances to labels, and directly measuring the discrepancy between two classiﬁers in Eq. 3 requires function space metrics. Since the characteristic of classiﬁers could be revealed by its predictions {h N(x)} over all

Figure 4: Illustration on the usage of unsupervised tasks for SS-FSL in TACO. Similarities for task and classiﬁer are proposed to ensure the smoothness of meta-model f.

possible instances {x} sampled from the task distribution, we transform the similarity between two classiﬁers from the function space to the instance space two similar classiﬁers have similar predictions over all instances. We ﬁrst empirically demonstrate that the prediction for an object from other classes except the ones in few-shot support set (i.e., distractor classes) still reveals the properties of the embedding based few-shot classiﬁer. Consider a task discerning N cat classes, the predictions of its classiﬁer on a dog image decomposes the cat-level characteristic into these N cat classes. With classiﬁers based on embeddings, the conﬁdence of a new instance implies its similarity to those N class centers in a joint embedding space. We verify this point with Nearest Center Mean (NCM) Classiﬁer (Mensink et al. 2013) based on the tasks from Mini Image Net (in Fig. 3). The agreements between normalized conﬁdences demonstrate the similarity among tasks (and their target classiﬁers) visually/semantically similar few-shot support sets make similar predictions on instances. Based on the previous observation, we take a further step to make use of the unlabeled data during meta-training to measure the similarity between classiﬁers. Since the query set error P (x Q j ,y Q j ) DQ ℓ h N(x Q j ), y Q j is used to update the meta-model f, it is notable that if we can not get access to y Q j , i.e., the query set labels, the meta-model f can not be updated. Thus, we make another analogy between supervised examples (i.e., the instance and label pair) in the traditional supervised learning and the tasks (i.e., the support and query sets pair) in the meta-learning. The instance and label in the supervised learning correspond to the few-shot support set and query set in the metalearning, respectively. A few-shot support set with an unlabeled query set is similar to the unlabeled instances in traditional supervised learning. We denote a supervised task as a couple of labeled support and query sets (DS, DQ), and an unsupervised task as a combination of a few-shot labeled support set DS and an unlabeled pool set Dpool = {x P k } sampled from the unlabeled part of the meta-train set. In-

stances in Dpool may come from distractor classes w.r.t. those classes in DS. During SS-FSL, for two similar unsupervised tasks (DS, Dpool) and ( ˆDS, Dpool) sharing the same unlabeled pool set Dpool, we measure the similarity between the output of the meta-model the similarity between two few-shot classiﬁers based on the JS divergence of their predicted distributions on Dpool:

DIS(f(DS), f( ˆDS)) (4)

x P k Dpool JSDT f(DS)(x P k ) f( ˆDS)(x P k ) .

f(DS)(x P k ) provides the afﬁliation conﬁdence of an instance x P k towards the N classes in DS. JSDT (p q) = 1 2KLU(p q

T ) is the JS divergence over the unnormalized predictions, the smaller the value, the closer these two distributions. T is a positive temperature to soften the predicted distribution (Hinton, Vinyals, and Dean 2015; Ye, Lu, and Zhan 2020). In our experiments, we stop the gradient of the second vector when computing the JS divergence. By matching the predictions of two few-shot classiﬁers on the same set of instances without label, the metamodel is required to map similar few-shot tasks to similarperformed models. Therefore, the outputs of similar metamodel inputs are pulled together, which forces the smoothness of the meta-model f. Eq. 4 could be applied to two labeled tasks if we replace Dpool as the union of query sets.

Remark 2 Minimizing the discrepancy between similar tasks predictions potentially produces a smooth instance embedding encoder φ. However, matching the embeddings directly from the input perspective of the meta-model is not as ﬂexible as Eq. 4, which is too strong and does not work well in our experiments.

Objective. TACO uses Eq. 4 as an auxiliary objective, i.e.,

(DS,DQ,Dpool)

(x Q j ,y Q j ) DQ ℓ f(DS)(x Q j ), y Q j (5)

x P k Dpool S DQ JSDT f(DS)(x P k ) f( ˆDS)(x P k ) .

TACO can be instantiated with the embedding-based methods like Proto Net. Eq. 5 acts as an efﬁcient way to maximize the similarity between both labeled and unsupervised tasks. During meta-training, we sample DS, DQ and Dpool from labeled and unlabeled meta-train set respectively (as in Fig. 4 and Alg.1). To take full advantage of examples, we combine the query set DQ with Dpool. The prediction matching not only utilizes the unlabeled meta-train data in a semi-supervised manner, but also promotes the cosupervision between similar tasks. During meta-test with DS only, a discriminative classiﬁer is generated based on f in a supervised manner without additional unlabeled instances.

Remark 3 TACO is general based on the deﬁnition of the similarity measurements of inputs (few-shot support sets) and outputs (target classiﬁers) of the meta-model. Since

Algorithm 1 The meta-training ﬂow of the TACO.

Require: SEEN class set S 1: for all iteration = 1,... do 2: Sample N-way K-shot (DS, DQ) from S 3: Generate similar tasks ˆDS based on DS 4: Sample Dpool from the unlabeled part of S 5: for all (x Q j , y Q j ) DQ do

6: Get f(DS)(x Q j ) 7: end for 8: for all x P k DQ S Dpool do 9: Get JSDT (f(DS)(x P k ) f( ˆDS)(x P k )) 10: end for 11: Compute objective as in Eq. 5 and update f 12: end for 13: return Few-shot classiﬁer mapping f

the embedding based classiﬁers project all the instances into a common subspace, it is able to measure the crossclass similarity in an unsupervised way between in-task class instances and distractor class instances. In our experiments, we implement f with Proto Net (Snell, Swersky, and Zemel 2017) and Proto MAML (Triantaﬁllou et al. 2020) (in the supplementary). By minimizing the discrepancy between similar classiﬁers, TACO updates the meta-model (i.e., the embedding) to pull similar few-shot tasks together, which gives rise to a smooth task-classiﬁer meta-model (Saito et al. 2018). Thus, given the neighborhood SEEN class few-shot task w.r.t. an UNSEEN class few-shot task, the discerning ability of a well-performed meta-model on those similar SEEN class tasks generalize to the UNSEEN tasks as well.

Experiments We investigate TACO on Mini Image Net as well as Tiered Image Net. We describe experimental setups, and then provide the few-shot classiﬁcation performance together with visualization results. Implementation details, qualitative and quantitative evaluations are in the supplementary.

Experimental Setups Datasets. Mini Image Net (Vinyals et al. 2016) and Tiered Image Net (Ren et al. 2018) contain 100 classes and 608 classes respectively. All images are resized to 3 84 84 following (Vinyals et al. 2016; Finn, Abbeel, and Levine 2017; Snell, Swersky, and Zemel 2017). We use the standard split of two datasets following (Ravi and Larochelle 2017; Ren et al. 2018), where meta-train, meta-val, and meta-test have non-overlapping classes.

Supervised Evaluation Protocols. We evaluate mean accuracy over 10,000 5-way 1-Shot and 5-Way 5-shot tasks (Vinyals et al. 2016; Ye et al. 2020), where the test set in a task has 15 examples from each of the 5 classes. In this supervised evaluation, all labeled examples in the meta-train set are utilized. We omit the 95% conﬁdence interval in the experiments, and detailed values are in the supplementary.

Semi-Supervised Data Generation and Evaluation Protocols. We construct the semi-supervised meta-train set by removing part of its labels. Two different partitions are investigated. The ﬁrst strategy splits all examples in the metatrain set across classes (SAC). In this case, we randomly select 30% classes in the meta-train set as the labeled part and uses the instances in the remaining classes without their labels as the unlabeled set. Similarly, we randomly select 30% instances across instances (SAI). In the SAI case, it is possible to sample non-distractor classes from the unlabeled pool, which reduces the classiﬁcation difﬁculty w.r.t. SAC to some extent. Based on the observation in (Oliver et al. 2018), it is more realistic to set the number of images in the meta-val set smaller than the number of labeled instances in the meta-train set. Thus instead of preserving the whole meta-val set, we adopt the same SAC or SAI split methods to reduce the size of the meta-val set. Only selected labeled meta-val images are utilized to select the best model. The average performance of 3 random partitions is reported.

Comparison Methods. We mainly compare TACO with four embedding based approaches, namely Match Net (Vinyals et al. 2016), Proto Net (Snell, Swersky, and Zemel 2017), Semi-Proto Net (Ren et al. 2018) and PRWN (Ayyad et al. 2019). Semi-Proto Net is designed for few-shot semi-supervised learning, and we adapt it in our setting. Its improved version, with an MLP-based selector to detect helpful unlabeled instances, is denoted as Semi-Proto Net . PRWN gets compact and well-separated class representations via prototypical random walk.

Implementation Details. We use a 4-layer Conv Net (Vinyals et al. 2016; Finn, Abbeel, and Levine 2017; Snell, Swersky, and Zemel 2017) as the backbone, which is initialized following (Rusu et al. 2018; Ye et al. 2020). TACO and all comparison methods are ﬁne-tuned on the pre-trained embedding in meta-training, and For semisupervised FSL, we sample 75 unlabeled instances in each mini-batch. We also investigate the Res Net-12 (Lee et al. 2019), which is complicated but with high discriminative ability. We ﬁnd Res Net over-ﬁts when applied to the SS-FSL setting with a limited number of meta-train data, so we only test Res Net in the standard supervised few-shot classiﬁcation tasks. Meta-model in TACO is implemented with Proto Net. For constructing similar unsupervised tasks, we sample two support sets with the same classes and then apply random perturbations to them via the advanced augmentation reported in (Xie et al. 2020).

Results of Semi-Supervised Few-Shot Learning We ﬁrst investigate TACO for Semi-Supervised Few Shot Learning (SS-FSL) case on two benchmark datasets Mini Image Net and Tiered Image Net. Results are recorded in Table 1 and Table 2. All compared methods are required to meta-learn few-shot facilitated embeddings whose qualities are revealed by the average accuracy on the meta-test set. Two split strategies, i.e., split across classes and instances, are considered to verify the importance of the unlabeled

Mini Image Net 1-shot 5-shot

Conﬁguration SAC SAI SAC SAI

Match Net 42.33 44.73 55.93 59.03 Proto Net 43.18 45.41 54.80 58.41 Semi Proto 42.41 44.22 58.70 61.54 Semi Proto 42.84 45.59 59.21 62.31 PRWN 42.72 44.65 58.90 61.34

TACO 43.97 46.56 61.13 62.85

Table 1: Mean SS-FSL accuracy over 10,000 tasks on Mini Image Net, with only 30% labeled meta-train set. SAC/SAI denote Split (meta-train set) Across Classes/Instances.

Tiered Image Net 1-shot 5-shot

Conﬁguration SAC SAI SAC SAI

Match Net 49.72 52.12 64.62 67.17 Proto Net 50.24 52.42 67.74 70.56 Semi Proto 49.29 51.31 68.52 70.88 Semi Proto 50.41 51.78 68.92 71.17 PRWN 50.28 51.35 67.96 70.23

TACO 51.82 54.66 68.77 71.83

Table 2: SS-FSL accuracy on the Tiered Image Net, with 30% labeled meta-train set.

meta-train set. Due to the fact there are more diverse examples (more classes) in the SAI case, all few-shot methods achieve better classiﬁcation accuracy in this scenario compared with the SAC case.

Match Net and Proto Net are meta-trained in a fully supervised manner, where only the labeled meta-train set is used. Semi-Proto takes advantage of the unlabeled instances to help estimate the center of each class during meta-training. Since there are no unlabeled data in the meta-test phase, Semi-Proto does not improve a lot w.r.t. the vanilla Proto Net in the 1-Shot scenario. From the results, the unlabeled instances help a lot when there is more than one instance in each class, and the distractor detector in Semi-Proto works well especially in this case. Table 1 and Table 2 provide consistent results, where the methods taking advantage of the unlabeled data in meta-training achieve (at least slightly) better results than the supervised counterparts. The supervised baselines are really strong. The same phenomenon is also discovered in the classical deep semi-supervised learning (Oliver et al. 2018). TACO uses the unlabeled data by matching model predictions. With high-quality embeddings, TACO gets the best performance in both cases over the two benchmarks even there are no unlabeled instances during meta-test. It veriﬁes TACO is able to meta-learn more discriminative meta-knowledge (instance embeddings) with the unlabeled data in meta-training.

T=2 SAC SAI λ = 0.1 SAC SAI

λ = 0 42.33 44.73 T = 1 43.58 46.50 λ = 0.1 43.97 46.56 T = 2 43.97 46.56 λ = 1 42.31 44.65 T = 4 43.23 46.47

Table 3: Semi-supervised 1-shot classiﬁcation accuracy on Mini Image Net with Conv Net backbone, where only 30% of meta-train set are labeled. Performance of TACO using different parameters are compared.

(a) Split Across Classes

(b) Split Across Instances

Figure 5: The change of 5-Way 1-Shot classiﬁcation accuracy of Proto Net, Semi-Proto Net , and TACO when the ratio of labeled examples in the meta-train set changes. Both label splits across classes and instances are investigated.

Ablation Studies Inﬂuence of parameters. There are two main parameters in TACO, the balance weight λ weighting the distribution matching term, and the temperature T to soften the predicted conﬁdence. We show the inﬂuences of these two parameters in the SS-FSL scenario. The same conﬁgurations are used as the previous experiments. From the results in Table 3, we ﬁnd the distribution matching term indeed improves the learned embedding upon the supervised baseline (λ = 0). For the reason that we try to match the prediction distributions of two tasks mutually, the temperature used to scale the prediction outputs does not inﬂuence the performance a lot.

Inﬂuence of the Label Ratio Change. We also test the TACO approach when the ratio of labeled meta-train set varies, ranging from 10%, 30%, 50% to 80%. The remaining part of the labeled set in the meta-train set is used as the unlabeled set for SS-FSL. Results of 5-way 1-shot classiﬁcation accuracy with both SAC and SAI partitions are shown in Fig. 5. Two plots reveal the same trends that with more labeled instances in the meta-training, all few-shot approaches achieve better performance. Among all results, TACO gets the best 5-way 1-shot classiﬁcation results in all cases, especially meta-learned in SAI partition. Semi-Proto cannot improve the quality of the embedding especially when the size of the labeled data is very small (e.g., 10%). The results verify the robustness of TACO.

Supervised Few-Shot Classiﬁcation As mentioned before and for fair comparisons, we investigate the supervised few-shot classiﬁcation via replacing Dpool by the union of query sets of two supervised few-

Mini Image Net Tiered Image Net

5-Way 1-Shot 5-Shot 1-Shot 5-Shot

Tap Net 61.65 76.36 63.08 80.26 MTL 61.20 75.50 65.60 78.60 Meta Opt 62.64 78.63 65.99 81.56 CAN 63.85 79.44 69.89 84.23 TACO (Ours) 66.57 82.10 71.12 85.42

TEAM 60.07 75.9 - - TPN 59.46 75.65 - - CAN 67.19 80.64 73.21 84.93 TACO (Ours) 68.23 83.42 75.53 85.72

Table 4: Supervised few-shot classiﬁcation accuracy on the Mini Image Net and Tiered Image Net using the Res Net-12 Backbone. denotes the transductive FSL method which utilizes the unlabeled data from the query set.

shot tasks in Eq. 4. We ﬁnd that TACO still works due to its explicit consideration of task relationship. The smoothness of a meta-learned model improves the generalization ability of the learned embedding when classifying UNSEEN few-shot tasks. We compare TACO with TEAM (Qiao et al. 2019), TPN (Liu et al. 2019), Tap Net (Yoon, Seo, and Moon 2019), MTL (Sun et al. 2019), Meta Opt (Lee et al. 2019), CAN (Hou et al. 2019) on Mini Image Net and Tiered Image Net datasets with the Res Net backbone, the results are shown in Table 4. For Mini Image Net, we cite the published results of compared methods, and we can ﬁnd that TACO can get better performances, which can also be veriﬁed from Tiered Image Net. In addition to the fully supervised comparison, we also apply TACO in a transductive manner (super-scripted by in Table 4), where the query set acts as the unlabeled pool. By taking advantage of unlabeled data in each few-shot task as the Semi-Proto Net (Ren et al. 2018) manner, TACO promote the FSL performance especially in the 1-shot scenario. More details could be found in the supplementary.

Conclusion Instead of utilizing unlabeled data to help classiﬁcation in each few-shot task, we focus on the Semi-Supervised Few Shot Learning (SS-FSL) problem from a macro-perspective. For a pair of meta-training tasks, the proposed TAsk COoperation (TACO) approach leverages unsupervised tasks couples of a labeled few-shot support set and an unlabeled query set with distractor classes to minimize the disagreement of predictions between their few-shot classiﬁers. Thus, TACO obtains a smooth meta-model space where similar few-shot tasks have close classiﬁers, which leads to a more discriminative and generalizable meta-model. Finally, a supervised classiﬁer could be effectively constructed when targeting UNSEEN class few-shot tasks. TACO improves FSL performance on two benchmarks in both semi-supervised and supervised scenarios. Future work includes extending the TACO paradigm to a fully unsupervised scenario.

References Ayyad, A.; Navab, N.; Elhoseiny, M.; and Albarqouni, S. 2019. Semi-Supervised Few-Shot Learning with Prototypical Random Walks. Co RR abs/1903.02164v3. Baxter, J. 2000. A Model of Inductive Bias Learning. JAIR 12: 149 198. Bennett, K. P.; and Demiriz, A. 1998. Semi-Supervised Support Vector Machines. In Neur IPS, 368 374. Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; and Raffel, C. A. 2019. Mixmatch: A holistic approach to semi-supervised learning. In Neur IPS, 5049 5059. Boney, R.; and Ilin, A. 2017. Semi-Supervised Few Shot Learning with Prototypical Networks. Co RR abs/1711.10856. Chao, W.-L.; Ye, H.-J.; Zhan, D.-C.; Campbell, M.; and Weinberger, K. Q. 2020. Revisiting Meta-Learning as Supervised Learning. Co RR abs/2002.00573. Chapelle, O.; Schlkopf, B.; and Zien, A. 2010. Semi Supervised Learning. The MIT Press. Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, 1126 1135. Friedman, J.; Hastie, T.; and Tibshirani, R. 2001. The elements of statistical learning, volume 1. Springer series in statistics New York. Grandvalet, Y.; and Bengio, Y. 2004. Semi-supervised Learning by Entropy Minimization. In Neur IPS, 529 536. Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. Co RR abs/1503.02531. Hou, R.; Chang, H.; Ma, B.; Shan, S.; and Chen, X. 2019. Cross Attention Network for Few-shot Classiﬁcation. In Neur IPS, 4005 4016. Khodadadeh, S.; B ol oni, L.; and Shah, M. 2019. Unsupervised Meta-Learning for Few-Shot Image Classiﬁcation. In Neur IPS, 10132 10142. Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semi-supervised Learning with Deep Generative Models. In Neur IPS, 3581 3589. Lake, B. M.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. B. 2011. One shot learning of simple visual concepts. In Cog Sci. Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science 350(6266): 1332 1338. Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019. Meta-Learning with Differentiable Convex Optimization. In CVPR, 10657 10665. Li, F.-F.; Fergus, R.; and Perona, P. 2006. One-Shot Learning of Object Categories. TPAMI 28(4): 594 611. Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S. J.; and Yang, Y. 2019. Learning to Propagate Labels: Transductive Propagation Network for Few-Shot Learning. In ICLR.

Maurer, A.; Pontil, M.; and Romera-Paredes, B. 2016. The Beneﬁt of Multitask Representation Learning. JMLR 17: 81:1 81:32. Mensink, T.; Verbeek, J. J.; Perronnin, F.; and Csurka, G. 2013. Distance-Based Image Classiﬁcation: Generalizing to New Classes at Near-Zero Cost. TPAMI 35(11): 2624 2637. Nagabandi, A.; Clavera, I.; Liu, S.; Fearing, R. S.; Abbeel, P.; Levine, S.; and Finn, C. 2019. Learning to adapt: Metalearning for model-based control. ICLR . Nichol, A.; Achiam, J.; and Schulman, J. 2018. On First Order Meta-Learning Algorithms. Co RR abs/1803.02999. Oliver, A.; Odena, A.; Raffel, C.; Cubuk, E. D.; and Goodfellow, I. J. 2018. Realistic Evaluation of Deep Semi Supervised Learning Algorithms. In Neur IPS, 3239 3250. Qiao, L.; Shi, Y.; Li, J.; Wang, Y.; Huang, T.; and Tian, Y. 2019. Transductive Episodic-Wise Adaptive Metric for Few Shot Learning. In ICCV, 3603 3612. Qiao, S.; Liu, C.; Shen, W.; and Yuille, A. L. 2018. Few-Shot Image Recognition by Predicting Parameters From Activations. In CVPR, 7229 7238. Ravi, S.; and Larochelle, H. 2017. Optimization as a model for few-shot learning. In ICLR. Ren, M.; Triantaﬁllou, E.; Ravi, S.; Snell, J.; Swersky, K.; Tenenbaum, J. B.; Larochelle, H.; and Zemel, R. S. 2018. Meta-Learning for Semi-Supervised Few-Shot Classiﬁcation. In ICLR. Rusu, A. A.; Rao, D.; Sygnowski, J.; Vinyals, O.; Pascanu, R.; Osindero, S.; and Hadsell, R. 2018. Meta-Learning with Latent Embedding Optimization. In ICLR. Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum Classiﬁer Discrepancy for Unsupervised Domain Adaptation. In CVPR, 3723 3732. Sajjadi, M.; Javanmardi, M.; and Tasdizen, T. 2016. Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning. In Neur IPS, 1163 1171. Snell, J.; Swersky, K.; and Zemel, R. S. 2017. Prototypical Networks for Few-shot Learning. In Neur IPS, 4080 4090. Sun, Q.; Liu, Y.; Chua, T.-S.; and Schiele, B. 2019. Meta Transfer Learning for Few-Shot Learning. In CVPR, 403 412. Triantaﬁllou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.; Manzagol, P.-A.; and Larochelle, H. 2020. Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples. In ICLR. Vilalta, R.; and Drissi, Y. 2002. A Perspective View and Survey of Meta-Learning. Artiﬁcial Intelligence Review 18(2): 77 95. Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching Networks for One Shot Learning. In Neur IPS, 3630 3638. Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.; and Le, Q. V. 2020. Unsupervised Data Augmentation for Consistency Training. In Neur IPS.

Ye, H.-J.; Hu, H.; Zhan, D.-C.; and Sha, F. 2020. Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions. In CVPR, 8805 8814. Ye, H.-J.; Lu, S.; and Zhan, D.-C. 2020. Distilling Cross Task Knowledge via Relationship Matching. In CVPR, 12393 12402. Yoon, S. W.; Seo, J.; and Moon, J. 2019. Tap Net: Neural Network Augmented with Task-Adaptive Projection for Few-Shot Learning. In ICML, 7115 7123. Yu, T.; Finn, C.; Dasari, S.; Xie, A.; Zhang, T.; Abbeel, P.; and Levine, S. 2018. One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning. In Robotics: Science and Systems. Zhang, J.; Zhao, C.; Ni, B.; Xu, M.; and Yang, X. 2019. Variational Few-Shot Learning. In ICCV, 1685 1694.