# adaptive_crossmodal_fewshot_learning__43b04f0e.pdf Adaptive Cross-Modal Few-shot Learning Chen Xing College of Computer Science, Nankai University, Tianjin, China Element AI, Montreal, Canada Negar Rostamzadeh Element AI, Montreal, Canada Boris N. Oreshkin Element AI, Montreal, Canada Pedro O. Pinheiro Element AI, Montreal, Canada Metric-based meta-learning techniques have successfully been applied to fewshot classification problems. In this paper, we propose to leverage cross-modal information to enhance metric-based few-shot learning methods. Visual and semantic feature spaces have different structures by definition. For certain concepts, visual features might be richer and more discriminative than text ones. While for others, the inverse might be true. Moreover, when the support from visual information is limited in image classification, semantic representations (learned from unsupervised text corpora) can provide strong prior knowledge and context to help learning. Based on these two intuitions, we propose a mechanism that can adaptively combine information from both modalities according to new image categories to be learned. Through a series of experiments, we show that by this adaptive combination of the two modalities, our model outperforms current uni-modality few-shot learning methods and modality-alignment methods by a large margin on all benchmarks and few-shot scenarios tested. Experiments also show that our model can effectively adjust its focus on the two modalities. The improvement in performance is particularly large when the number of shots is very small. 1 Introduction Deep learning methods have achieved major advances in areas such as speech, language and vision [25]. These systems, however, usually require a large amount of labeled data, which can be impractical or expensive to acquire. Limited labeled data lead to overfitting and generalization issues in classical deep learning approaches. On the other hand, existing evidence suggests that human visual system is capable of effectively operating in small data regime: humans can learn new concepts from a very few samples, by leveraging prior knowledge and context [23, 30, 46]. The problem of learning new concepts with small number of labeled data points is usually referred to as few-shot learning [1, 6, 27, 22] (FSL). Most approaches addressing few-shot learning are based on meta-learning paradigm [43, 3, 52, 13], a class of algorithms and models focusing on learning how to (quickly) learn new concepts. Metalearning approaches work by learning a parameterized function that embeds a variety of learning tasks and can generalize to new ones. Recent progress in few-shot image classification has primarily been made in the context of unimodal learning. In contrast to this, employing data from another modality can help when the data in the original modality is limited. For example, strong evidence supports the hypothesis that language helps recognizing new visual objects in toddlers [15, 45]. This Work done when interning at Element AI. Contact through: xingchen1113@gmail.com 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. ping-pong ball egg Komondor mop Figure 1: Concepts have different visual and semantic feature space. (Left) Some categories may have similar visual features and dissimilar semantic features. (Right) Other can possess same semantic label but very distinct visual features. Our method adaptively exploits both modalities to improve classification performance in low-shot regime. suggests that semantic features from text can be a powerful source of information in the context of few-shot image classification. Exploiting auxiliary modality (e.g., attributes, unlabeled text corpora) to help image classification when data from visual modality is limited, have been mostly driven by zero-shot learning [24, 36] (ZSL). ZSL aims at recognizing categories whose instances have not been seen during training. In contrast to few-shot learning, there is no small number of labeled samples from the original modality to help recognize new categories. Therefore, most approaches consist of aligning the two modalities during training. Through this modality-alignment, the modalities are mapped together and forced to have the same semantic structure. This way, knowledge from auxiliary modality is transferred to the visual side for new categories at test time [9]. However, visual and semantic feature spaces have heterogeneous structures by definition. For certain concepts, visual features might be richer and more discriminative than text ones. While for others, the inverse might be true. Figure 1 illustrates this remark. Moreover, when the number of support images from visual side is very small, information provided from this modality tend to be noisy and local. On the contrary, semantic representations (learned from large unsupervised text corpora) can act as more general prior knowledge and context to help learning. Therefore, instead of aligning the two modalities (to transfer knowledge to the visual modality), for few-shot learning in which information are provided from both modalities during test, it is better to treat them as two independent knowledge sources and adaptively exploit both modalities according to different scenarios. Towards this end, we propose Adaptive Modality Mixture Mechanism (AM3), an approach that adaptively and selectively combines information from two modalities, visual and semantic, for few-shot learning. AM3 is built on top of metric-based meta-learning approaches. These approaches perform classification by comparing distances in a learned metric space (from visual data). On the top of that, our method also leverages text information to improve classification accuracy. AM3 performs classification in an adaptive convex combination of the two distinctive representation spaces with respect to image categories. With this mechanism, AM3 can leverage the benefits from both spaces and adjust its focus accordingly. For cases like Figure 1(Left), AM3 focuses more on the semantic modality to obtain general context information. While for cases like Figure 1(Right), AM3 focuses more on the visual modality to capture rich local visual details to learn new concepts. Our main contributions can be summarized as follows: (i) we propose adaptive modality mixture mechanism (AM3) for cross-modal few-shot classification. AM3 adapts to few-shot learning better than modality-alignment methods by adaptively mixing the semantic structures of the two modalities. (ii) We show that our method achieves considerable boost in performance over different metric-based meta-learning approaches. (iii) AM3 outperforms by a considerable margin current (single-modality and cross-modality) state of the art in few-shot classification on different datasets and different number of shots. (iv) We perform quantitative investigations to verify that our model can effectively adjust its focus on the two modalities according to different scenarios. 2 Related Work Few-shot learning. Meta-learning has a prominent history in machine learning [43, 3, 52]. Due to advances in representation learning methods [11] and the creation of new few-shot learning datasets [22, 53], many deep meta-learning approaches have been applied to address the few-shot learning problem . These methods can be roughly divided into two main types: metric-based and gradient-based approaches. Metric-based approaches aim at learning representations that minimize intra-class distances while maximizing the distance between different classes. These approaches rely on an episodic training framework: the model is trained with sub-tasks (episodes) in which there are only a few training samples for each category. For example, matching networks [53] follows a simple nearest neighbour framework. In each episode, it uses an attention mechanism (over the encoded support) as a similarity measure for one-shot classification. In prototypical networks [47], a metric space is learned where embeddings of queries of one category are close to the centroid (or prototype) of supports of the same category, and far away from centroids of other classes in the episode. Due to the simplicity and good performance of this approach, many methods extended this work. For instance, Ren et al. [39] propose a semi-supervised few-shot learning approach and show that leveraging unlabeled samples outperform purely supervised prototypical networks. Wang et al. [54] propose to augment the support set by generating hallucinated examples. Task-dependent adaptive metric (TADAM) [35] relies on conditional batch normalization [5] to provide task adaptation (based on task representations encoded by visual features) to learn a taskdependent metric space. Gradient-based meta-learning methods aim at training models that can generalize well to new tasks with only a few fine-tuning updates. Most these methods are built on top of model-agnostic metalearning (MAML) framework [7]. Given the universality of MAML, many follow-up works were recently proposed to improve its performance on few-shot learning [33, 21]. Kim et al. [18] and Finn et al. [8] propose a probabilistic extension to MAML trained with variational approximation. Conditional class-aware meta-learning (CAML) [16] conditionally transforms embeddings based on a metric space that is trained with prototypical networks to capture inter-class dependencies. Latent embedding optimization (LEO) [41] aims to tackle MAML s problem of only using a few updates on a low data regime to train models in a high dimensional parameter space. The model employs a low-dimensional latent model embedding space for update and then decodes the actual model parameters from the low-dimensional latent representations. This simple yet powerful approach achieves current state of the art result in different few-shot classification benchmarks. Other metalearning approaches for few-shot learning include using memory architecture to either store exemplar training samples [42] or to directly encode fast adaptation algorithm [38]. Mishra et al. [32] use temporal convolution to achieve the same goal. Current approaches mentioned above rely solely on visual features for few-shot classification. Our contribution is orthogonal to current metric-based approaches and can be integrated into them to boost performance in few-shot image classification. Zero-shot learning. Current ZSL methods rely mostly on visual-auxiliary modality alignment [9, 58]. In these methods, samples for the same class from the two modalities are mapped together so that the two modalities obtain the same semantic structure. There are three main families of modality alignment methods: representation space alignment, representation distribution alignment and data synthetic alignment. Representation space alignment methods either map the visual representation space to the semantic representation space [34, 48, 9], or map the semantic space to the visual space [59]. Distribution alignment methods focus on making the alignment of the two modalities more robust and balanced to unseen data [44]. Re Vi SE [14] minimizes maximum mean discrepancy (MMD) of the distributions of the two representation spaces to align them. CADA-VAE [44] uses two VAEs [19] to embed information for both modalities and align the distribution of the two latent spaces. Data synthetic methods rely on generative models to generate image or image feature as data augmentation [60, 57, 31, 54] for unseen data to train the mapping function for more robust alignment. ZSL does not have access to any visual information when learning new concepts. Therefore, ZSL models have no choice but to align the two modalities. This way, during test the image query can be directly compared to auxiliary information for classification [59]. Few-shot learning, on the other hand, has access to a small amount of support images in the original modality during test. This makes alignment methods from ZSL seem unnecessary and too rigid for FSL. For few-shot learning, it would be better if we could preserve the distinct structures of both modalities and adaptively combine them for classification according to different scenarios. In Section 4 we show that by doing so, AM3 outperforms directly applying modality alignment methods for few-shot learning by a large margin. 3 Method In this section, we explain how AM3 adaptively leverages text data to improve few-shot image classification. We start with a brief explanation of episodic training for few-shot learning and a summary of prototypical networks followed by the description of the proposed adaptive modality mixture mechanism. 3.1 Preliminaries 3.1.1 Episodic Training Few-shot learning models are trained on a labeled dataset Dtrain and tested on Dtest. The class sets are disjoint between Dtrain and Dtest. The test set has only a few labeled samples per category. Most successful approaches rely on an episodic training paradigm: the few shot regime faced at test time is simulated by sampling small samples from the large labeled set Dtrain during training. In general, models are trained on K-shot, N-way episodes. Each episode e is created by first sampling N categories from the training set and then sampling two sets of images from these categories: (i) the support set Se = {(si, yi)}N K i=1 containing K examples for each of the N categories and (ii) the query set Qe = {(qj, yj)}Q j=1 containing different examples from the same N categories. The episodic training for few-shot classification is achieved by minimizing, for each episode, the loss of the prediction on samples in query set, given the support set. The model is a parameterized function and the loss is the negative loglikelihood of the true class of each query sample: L(θ) = E (Se,Qe) t=1 log pθ(yt|qt, Se) , (1) where (qt, yt) Qe and Se are, respectively, the sampled query and support set at episode e and θ are the parameters of the model. 3.1.2 Prototypical Networks We build our model on top of metric-based meta-learning methods. We choose prototypical network [47] for explaining our model due to its simplicity. We note, however, that the proposed method can potentially be applied to any metric-based approach. Prototypical networks use the support set to compute a centroid (prototype) for each category (in the sampled episode) and query samples are classified based on the distance to each prototype. The model is a convolutional neural network [26] f : Rnv Rnp, parameterized by θf, that learns a np-dimensional space where samples of the same category are close and those of different categories are far apart. For every episode e, each embedding prototype pc (of category c) is computed by averaging the embeddings of all support samples of class c: pc = 1 |Sce| (si,yi) Sc e f(si) , (2) where Sc e Se is the subset of support belonging to class c. The model produces a distribution over the N categories of the episode based on a softmax [4] over (negative) distances d of the embedding of the query qt (from category c) to the embedded prototypes: p(y = c|qt, Se, θ) = exp( d(f(qt), pc)) P k exp( d(f(qt), pk)) . (3) We consider d to be the Euclidean distance. The model is trained by minimizing Equation 1 and the parameters are updated with stochastic gradient descent. 3.2 Adaptive Modality Mixture Mechanism The information contained in semantic concepts can significantly differ from visual contents. For instance, Siberian husky and wolf , or komondor and mop , might be difficult to discriminate with visual features, but might be easier to discriminate with language semantic features. convex combination Figure 2: (Left) Adaptive modality mixture model. The final category prototype is a convex combination of the visual and the semantic feature representations. The mixing coefficient is conditioned on the semantic label embedding. (Right) Qualitative example of how AM3 works. Assume query sample q has category i. (a) The closest visual prototype to the query sample q is pj. (b) The semantic prototypes. (c) The mixture mechanism modify the positions of the prototypes, given the semantic embeddings. (d) After the update, the closest prototype to the query is now the one of the category i, correcting the classification. In zero-shot learning, where no visual information is given at test time (that is, the support set is void), algorithms need to solely rely on an auxiliary (e.g., text) modality. On the other extreme, when the number of labeled image samples is large, neural network models tend to ignore the auxiliary modality as it is able to generalize well with large number of samples [20]. Few-shot learning scenario fits in between these two extremes. Thus, we hypothesize that both visual and semantic information can be useful for few-shot learning. Moreover, given that visual and semantic spaces have different structures, it is desirable that the proposed model exploits both modalities adaptively, given different scenarios. For example, when it meets objects like ping-pong balls which has many visually similar counterparts, or when the number of shots is very small from the visual side, it relies more on text modality to distinguish them. In AM3, we augment metric-based FSL methods to incorporate language structure learned by a wordembedding model W (pre-trained on unsupervised large text corpora), containing label embeddings of all categories in Dtrain Dtest. In our model, we modify the prototype representation of each category by taking into account their label embeddings. More specifically, we model the new prototype representation as a convex combination of the two modalities. That is, for each category c, the new prototype is computed as: p c = λc pc + (1 λc) wc , (4) where λc is the adaptive mixture coefficient (conditioned on the category) and wc = g(ec) is a transformed version of the label embedding for class c. The representation ec is the pre-trained word embedding of label c from W. This transformation g : Rnw Rnp, parameterized by θg, is important to guarantee that both modalities lie on the space Rnp of the same dimension and can be combined. The coefficient λc is conditioned on category and calculated as follows: λc = 1 1 + exp( h(wc)) , (5) where h is the adaptive mixing network, with parameters θh. Figure 2(left) illustrates the proposed model. The mixing coefficient λc can be conditioned on different variables. In Appendix F we show how performance changes when the mixing coefficient is conditioned on different variables. The training procedure is similar to that of the original prototypical networks. However, the distances d (used to calculate the distribution over classes for every image query) are between the query and the cross-modal prototype p c: pθ(y = c|qt, Se, W) = exp( d(f(qt), p c)) P k exp( d(f(qt), p k)) , (6) where θ = {θf, θg, θh} is the set of parameters. Once again, the model is trained by minimizing Equation 1. Note that in this case the probability is also conditioned on the word embeddings W. Figure 2(right) illustrates an example on how the proposed method works. Algorithm 1, on supplementary material, shows the pseudocode for calculating the episode loss. We chose prototypical network [47] for explaining our model due to its simplicity. We note, however, that AM3 can potentially be applied to any metric-based approach that calculates prototypical embeddings pc for categories. As shown in next section, we apply AM3 on both Proto Nets and TADAM [35]. TADAM is a task-dependent metric-based few-shot learning method, which currently performs the best among all metric-based FSL methods. 4 Experiments In this section we compare our model, AM3, with three different types of baselines: uni-modality few-shot learning methods, modality-alignment methods and metric-based extensions of modalityalignment methods. We show that AM3 outperforms the state of the art of each family of baselines. We also verify the adaptiveness of AM3 through quantitative analysis. 4.1 Experimental Setup We conduct main experiments with two widely used few-shot learning datasets: mini Image Net [53] and tiered Image Net [39]. We also experiment on CUB-200 [55], a widely used zero-shot learning dataset. We evaluate on this dataset to provide a more direct comparison with modality-alignment methods. This is because most modality-alignment methods have no published results on few-shot datasets. We use Glo Ve [37] to extract the word embeddings for the category labels of the two image few-shot learning data sets. The embeddings are trained with large unsupervised text corpora. More details about the three datasets can be found in Appendix B. Baselines. We compare AM3 with three family of methods. The first is uni-modality few-shot learning methods such as MAML [7], LEO [41], Prototypical Nets [47] and TADAM [35]. LEO achieves current state of the art among uni-modality methods. The second fold is modality alignment methods. CADA-VAE [44], among them, has the best published results on both zero and few-shot learning. To better extend modality alignment methods to few-shot setting, we also apply the metricbased loss and the episode training of Proto Nets on their visual side to build a visual representation space that better fits few-shot scenario. This leads to the third fold baseline, modality alignment methods extended to metric-based FSL. Details of baseline implementations can be found in Appendix C. AM3 Implementation. We test AM3 with two backbone metric-based few-shot learning methods: Proto Nets and TADAM. In our experiments, we use the stronger Proto Nets implementation of [35], which we call Proto Nets++. Prior to AM3, TADAM achieves the current state of the art among all metric-based few-shot learning methods. For details on network architectures, training and evaluation procedures, see Apprendix D. Source code is released at https://github.com/Element AI/am3. 4.2 Results Table 1 and Table 2 show classification accuracy on mini Image Net and on tiered Image Net, respectively. We conclude multiple results from these experiments. First, AM3 outperforms its backbone methods by a large margin in all cases tested. This indicates that when properly employed, text modality can be used to boost performance in metric-based few-shot learning framework very effectively. Second, AM3 (with TADAM backbone) achieves results superior to current state of the art (in both single modality FSL and modality alignment methods). The margin in performance is particularly remarkable in the 1-shot scenario. The margin of AM3 w.r.t. uni-modality methods is larger with smaller number of shots. This indicates that the lower the visual content is, the more important semantic information is for classification. Moreover, the margin of AM3 w.r.t. modality alignment methods is larger with smaller number of shots. This indicates that the adaptiveness of AM3 would be more effective when the visual modality provides less information. A more detailed analysis about the adaptiveness of AM3 is provided in Section 4.3. Model Test Accuracy 5-way 1-shot 5-way 5-shot 5-way 10-shot Uni-modality few-shot learning baselines Matching Network [53] 43.56 0.84% 55.31 0.73% - Prototypical Network [47] 49.42 0.78% 68.20 0.66% 74.30 0.52% Discriminative k-shot [2] 56.30 0.40% 73.90 0.30% 78.50 0.00% Meta-Learner LSTM [38] 43.44 0.77% 60.60 0.71% - MAML [7] 48.70 1.84% 63.11 0.92% - Proto Nets w Soft k-Means [39] 50.41 0.31% 69.88 0.20% - SNAIL [32] 55.71 0.99% 68.80 0.92% - CAML [16] 59.23 0.99% 72.35 0.71% - LEO [41] 61.76 0.08% 77.59 0.12% - Modality alignment baselines De Vi SE [9] 37.43 0.42% 59.82 0.39% 66.50 0.28% Re Vi SE [14] 43.20 0.87% 66.53 0.68% 72.60 0.66% CBPL [29] 58.50 0.82% 75.62 0.61% - f-CLSWGAN [57] 53.29 0.82% 72.58 0.27% 73.49 0.29% CADA-VAE [44] 58.92 1.36% 73.46 1.08% 76.83 0.98% Modality alignment baselines extended to metric-based FSL framework De Vi SE-FSL 56.99 1.33% 72.63 0.72% 76.70 0.53% Re Vi SE-FSL 57.23 0.76% 73.85 0.63% 77.21 0.31% f-CLSWGAN-FSL 58.47 0.71% 72.23 0.45% 76.90 0.38% CADA-VAE-FSL 61.59 0.84% 75.63 0.52% 79.57 0.28% AM3 and its backbones Proto Nets++ 56.52 0.45% 74.28 0.20% 78.31 0.44% AM3-Proto Nets++ 65.21 0.30% 75.20 0.27% 78.52 0.28% TADAM [35] 58.56 0.39% 76.65 0.38% 80.83 0.37% AM3-TADAM 65.30 0.49% 78.10 0.36% 81.57 0.47 % Table 1: Few-shot classification accuracy on test split of mini Image Net. Results in the top use only visual features. Modality alignment baselines are shown on the middle and our results (and their backbones) on the bottom part. Model Test Accuracy 5-way 1-shot 5-way 5-shot Uni-modality few-shot learning baselines MAML [7] 51.67 1.81% 70.30 0.08% Proto. Nets with Soft k-Means [39] 53.31 0.89% 72.69 0.74% Relation Net [50] 54.48 0.93% 71.32 0.78% Transductive Prop. Nets [28] 54.48 0.93% 71.32 0.78% LEO [41] 66.33 0.05% 81.44 0.09% Modality alignment baselines De Vi SE [9] 49.05 0.92% 68.27 0.73% Re Vi SE [14] 52.40 0.46% 69.92 0.59% CADA-VAE [44] 58.92 1.36% 73.46 1.08% Modality alignment baselines extended to metric-based FSL framework De Vi SE-FSL 61.78 0.43% 77.17 0.81% Re Vi SE-FSL 62.77 0.31% 77.27 0.42% CADA-VAE-FSL 63.16 0.93% 78.86 0.31% AM3 and its backbones Proto Nets++ 58.47 0.64% 78.41 0.41% AM3-Proto Nets++ 67.23 0.34% 78.95 0.22% TADAM [35] 62.13 0.31% 81.92 0.30% AM3-TADAM 69.08 0.47% 82.58 0.31% Table 2: Few-shot classification accuracy on test split of tiered Image Net. Results in the top use only visual features. Modality alignment baselines are shown in the middle and our results (and their backbones) in the bottom part. deeper net, evaluated in [28]. Finally, it is also worth noting that all modality alignment baselines get a significant performance improvement when extended to metric-based, episodic, few-shot learning framework. However, most Proto Nets++ TADAM (a) Accuracy vs. # shots AM3-Proto Nets++ AM3-TADAM (b) λ vs. # shots Figure 3: (a) Comparison of AM3 and its corresponding backbone for different number of shots (b) Average value of λ (over whole validation set) for different number of shot, considering both backbones. of modality alignment methods (original and extended), perform worse than current state-of-the-art uni-modality few-shot learning method. This indicates that although modality alignment methods are effective for cross-modality in ZSL, it does not fit few-shot scenario very much. One possible reason is that when aligning the two modalities, some information from both sides could be lost because two distinct structures are forced to align. We also conducted few-shot learning experiments on CUB-200, a popular dataest for ZSL dataset, to better compare with published results of modality alignment methods. All the conclusion discussed above hold true on CUB-200. Moreover, we also conduct ZSL and generalized FSL experiments to verify the importance of the proposed adaptive mechanism. Results on on this dataset are shown in Appendix E. 4.3 Adaptiveness Analysis We argue that the adaptive mechanism is the main reason for the performance boosts observed in the previous section. We design an experiment to quantitatively verify that the adaptive mechanism of AM3 can adjust its focus on the two modalities reasonably and effectively. Figure 3(a) shows the accuracy of our model compared to the two backbones tested (Proto Nets++ and TADAM) on mini Image Net for 1-10 shot scenarios. It is clear from the plots that the gap between AM3 and the corresponding backbone gets reduced as the number of shots increases. Figure 3(b) shows the mean and std (over whole validation set) of the mixing coefficient λ for different shots and backbones. First, we observe that the mean of λ correlates with number of shots. This means that AM3 weighs more on text modality (and less on visual one) as the number of shots (hence, the number of visual data points) decreases. This trend suggests that AM3 can automatically adjust its focus more to text modality to help classification when information from the visual side is very low. Second, we can also observe that the variance of λ (shown in Figure 3(b)) correlates with the performance gap of AM3 and its backbone methods (shown in Figure 3(a)). When the variance of λ decreases with the increase of number of shots, the performance gap also shrinks. This indicates that the adaptiveness of AM3 on category level plays a very important role for the performance boost. 5 Conclusion In this paper, we propose a method that can adaptively and effectively leverage cross-modal information for few-shot classification. The proposed method, AM3, boosts the performance of metric-based approaches by a large margin on different datasets and settings. Moreover, by leveraging unsupervised textual data, AM3 outperforms state of the art on few-shot classification by a large margin. The textual semantic features are particularly helpful on the very low (visual) data regime (e.g. one-shot). We also conduct quantitative experiments to show that AM3 can reasonably and effectively adjust its focus on the two modalities. [1] E. Bart and S. Ullman. Cross-generalization: learning novel classes from a single example by feature replacement. In CVPR, 2005. [2] Matthias Bauer, Mateo Rojas-Carulla, Jakub Bartlomiej Swikatkowski, Bernhard Scholkopf, and Richard E Turner. Discriminative k-shot learning using probabilistic models. In NIPS Bayesian Deep Learning, 2017. [3] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Conference on Optimality in Biological and Artificial Networks, 1992. [4] John Bridle. Probabilistic interpretation of feedforward classification network outputs with relationships to statistical pattern recognition. Neurocomputing: Algorithms, Architectures and Applications, 1990. [5] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations. Distill, 2018. [6] Michael Fink. Object classification from a single example utilizing class relevance metrics. In NIPS, 2005. [7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017. [8] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Neur IPS, 2018. [9] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013. [10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011. [11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [13] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In ICANN, 2001. [14] Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. Learning robust visualsemantic embeddings. In CVPR, 2017. [15] R Jackendoff. On beyond zebra: the relation of linguistic and visual information. Cognition, 1987. [16] Xiang Jiang, Mohammad Havaei, Farshid Varno, Gabriel Chartrand, Nicolas Chapados, and Stan Matwin. Learning to learn with conditional class dependencies. In ICLR, 2019. [17] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Fasttext.zip: Compressing text classification models. ar Xiv preprint ar Xiv:1612.03651, 2016. [18] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. In Neur IPS, 2018. [19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [21] Alexandre Lacoste, Thomas Boquet, Negar Rostamzadeh, Boris Oreshki, Wonchang Chung, and David Krueger. Deep prior. NIPS workshop, 2017. [22] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In Annual Meeting of the Cognitive Science Society, 2011. [23] Barbara Landau, Linda B Smith, and Susan S Jones. The importance of shape in early lexical learning. Cognitive development, 1988. [24] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zero-data learning of new tasks. In AAAI, 2008. [25] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 2015. [26] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998. [27] Fei-Fei Li, Rob Fergus, and Pietro Perona. One-shot learning of object categories. PAMI, 2006. [28] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, and Yi Yang. Transductive propagation network for few-shot learning. ICLR, 2019. [29] Zhiwu Lu, Jiechao Guan, Aoxue Li, Tao Xiang, An Zhao, and Ji-Rong Wen. Zero and few shot learning with semantic feature synthesis and competitive learning. ar Xiv preprint ar Xiv:1810.08332, 2018. [30] Ellen M Markman. Categorization and naming in children: Problems of induction. MIT Press, 1991. [31] Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy. A generative model for zero shot learning using conditional variational autoencoders. In CVPR Workshops, 2018. [32] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In ICLR, 2018. [33] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. ar Xiv, 2018. [34] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. ICLR, 2014. [35] Boris N Oreshkin, Alexandre Lacoste, and Pau Rodriguez. Tadam: Task dependent adaptive metric for improved few-shot learning. In Neur IPS, 2018. [36] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning with semantic output codes. In NIPS, 2009. [37] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014. [38] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017. [39] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In ICLR, 2018. [40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. [41] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2019. [42] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, 2016. [43] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. on learning now to learn: The meta-meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany, 1987. [44] Edgar Schönfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. CVPR, 2019. [45] Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 2005. [46] Linda B Smith and Lauren K Slone. A developmental approach to machine learning? Frontiers in psychology, 2017. [47] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, 2017. [48] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In NIPS, 2013. [49] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014. [50] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018. [51] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013. [52] Sebastian Thrun. Lifelong learning algorithms. Kluwer Academic Publishers, 1998. [53] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, 2016. [54] Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In CVPR, 2018. [55] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. [56] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. PAMI, 2018. [57] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In CVPR, 2018. [58] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In CVPR, 2017. [59] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning. In CVPR, pages 2021 2030, 2017. [60] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR, 2018.