# learning_to_learn_variational_semantic_memory__1e3febb5.pdf Learning to Learn Variational Semantic Memory Xiantong Zhen1,2 , Yingjun Du1 , Huan Xiong3,4, Qiang Qiu5, Cees G. M. Snoek1 , Ling Shao2,4 1AIM Lab, University of Amsterdam, Netherlands 2Inception Institute of Artificial Intelligence, Abu Dhabi, UAE 3Harbin Institute of Technology, Harbin, China 4Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE 5Electrical and Computer Engineering, Purdue University, USA {x.zhen,y.du,cgmsnoek}@uva.nl In this paper, we introduce variational semantic memory into meta-learning to acquire long-term knowledge for few-shot learning. The variational semantic memory accrues and stores semantic information for the probabilistic inference of class prototypes in a hierarchical Bayesian framework. The semantic memory is grown from scratch and gradually consolidated by absorbing information from tasks it experiences. By doing so, it is able to accumulate long-term, general knowledge that enables it to learn new concepts of objects. We formulate memory recall as the variational inference of a latent memory variable from addressed contents, which offers a principled way to adapt the knowledge to individual tasks. Our variational semantic memory, as a new long-term memory module, confers principled recall and update mechanisms that enable semantic information to be efficiently accrued and adapted for few-shot learning. Experiments demonstrate that the probabilistic modelling of prototypes achieves a more informative representation of object classes compared to deterministic vectors. The consistent new state-of-the-art performance on four benchmarks shows the benefit of variational semantic memory in boosting few-shot recognition. 1 Introduction Memory plays an essential role in human intelligence, especially for aiding learning and reasoning in the present. In machine intelligence, neural memory [22, 73, 23] has been shown to enhance neural networks by augmentation with an external memory module. For instance, episodic memory storing past experiences helps reinforcement learning agents adapt more quickly and improve sample efficiency [7, 55, 24]. Memory is well-suited for few-shot learning by meta-learning in that it offers an effective mechanism to extract inductive bias [21] by accumulating prior knowledge from a set of previously observed tasks. One of the primary issues when designing a memory module is deciding what information should be memorized, which usually depends on the problems to solve. Though being highly promising, it is non-trivial to learn to store useful information in previous experience, which should be as non-redundant as possible. Existing few-shot learning works with external memory typically store the information from the support set of the current task [41, 70, 57, 40, 30], focusing on learning the access mechanism, which is assumed to be shared across tasks. The memory used in these works is short-term with limited capacity [18, 39] in that long-term information is not well retained, despite the importance for efficiently learning new tasks. Semantic memory, also known as conceptual knowledge [43, 67, 59], refers to general facts and common world knowledge gathered throughout our lives [52]. It enables humans to quickly learn Equal contribution. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. new concepts by recalling the knowledge acquired in the past [59]. Compared to episodic memory, semantic memory has been less studied [67, 59], despite its pivotal role in remembering the past and imagining the future [29]. By its very nature, semantic memory can provide conceptual context to facilitate novel event construction [28] and support a variety of cognitive activities, e.g., object recognition [5]. We draw inspiration from the cognitive function of semantic memory and introduce it into meta-learning to learn to collect long-term semantic knowledge for few-shot learning. In this paper, we propose an external memory module to accrue and store long-term semantic information gained from past experiences, which we call variational semantic memory. The function of semantic memory closely matches that of prototypes [61, 1], which identify the semantics of objects in few-shot classification. The semantic knowledge accumulated in the memory helps build the new object concepts represented by prototypes typically obtained from only one or few samples [61]. We apply our variational semantic memory module to the probabilistic inference of class prototypes modelled as distributions. The probabilistic prototypes obtained are more informative and therefore better represent categories of objects compared to deterministic vectors [61, 1]. We formulate the memory recall as a variational inference of the latent memory, which is an intermediate stochastic variable. This offers a principled way to retrieve information from the external memory and incorporate it into the inference of class prototypes for each individual task. We cast the optimization as a hierarchical variational inference problem in the Bayesian framework; the parameters of the inference of prototypes are jointly optimized in conjunction with the memory recall and update. The semantic memory is gradually consolidated throughout the course of learning by updating the knowledge from new observations in each experienced task via an attention mechanism. The long-term semantic knowledge on seen object categories is acquired, maintained and enhanced during the learning process. This contrasts with existing works [57, 70] in which the memory stores data from the support set and therefore only considers the short term. In our memory, each entry stores semantics representing a distinct object category by summarizing feature representations of class samples. This reduces redundant information and saves storage overhead. More importantly it avoids collapsing memory reading and writing into single memory slots [22, 74], which ensures that rich context information is provided for better construction of new concepts. To summarize our three contributions: i) We propose variational semantic memory, a long-term memory module, which learns to acquire semantic information and enables new concepts of object categories to be quickly learned for few-shot learning. ii) We formulate the memory recall as a variational inference problem by introducing the latent memory variable, which offers a principled way to retrieve relevant information that fits with specific tasks. iii) We introduce variational semantic memory into the probabilistic inference of prototypes modelled as distributions rather than deterministic vectors, which provides more informative representations of class prototypes. Few-shot classification is commonly learned by constructing T few-shot tasks from a large dataset and optimizing the model parameters on these tasks. A task, also called an episode, is defined as an N-way K-shot classification problem [70, 50]. An episode is drawn from a dataset by randomly sampling a subset of classes. Data points in an episode are partitioned into a support S and query Q set. We adopt the episodic optimization [70], which trains the model in an iterative way by taking one episode-update at a time. The update of the model parameters is defined by a variational learning objective, which is based on an evidence lower bound (ELBO) [6]. Different from traditional machine learning tasks, meta-learning for few-shot classification trains the model on the meta-training set, and evaluates on the meta-test set, whose classes are not seen during meta-training. In this work, we develop our method based on the prototypical network (Proto Net) [61]. Specifically, the prototype zn of an object class n is obtained by: zn = 1 K P k Φ(xn,k) where Φ(xn,k) is the feature embedding of the sample xn,k, which is usually obtained by a convolutional neural network. For each query sample x, the distribution over classes is calculated based on the softmax over distances to the prototypes of all classes in the embedding space: p(yn = 1|x) = exp( d(Φ(x), zn)) P n exp( d(Φ(x), zn )), (1) where y denotes a random one-hot vector, with yn indicating its n-th element, and d( , ) is some distance function. Due to its non-parametric nature, the Proto Net enjoys high flexibility and efficiency, achieving great success in few-shot learning. The ideal prototypical representation should be expressive and encompass enough intra-class variance, while being distinguishable between different classes. In the literature [61, 1], however, the prototypes are commonly modeled by a single or multiple deterministic vectors obtained by average pooling of only a few samples or clustering. Hence, they are not sufficiently representative of object categories. Moreover, uncertainty is inevitable due to the scarcity of data, which should also be encoded into the prototypical representations. In this paper, we derive a probabilistic latent variable model by modeling prototypes as distributions, which are learned by variational inference. 2.1 Variational Prototype Inference We introduce the probabilistic modeling of class prototypes, in which we treat the prototype z of each class as a distribution. In the few-shot learning scenario, to find z is to infer the posterior p(z|x, y), where (x, y) denotes the sample from the query set Q. We derive a variational inference framework to solve z by leveraging the support set S. Consider the conditional log-likelihood in a probabilistic latent variable model, where we incorporate the prototype z as the latent variable i=1 p(yi|xi) = log |Q| Y Z p(yi|xi, z)p(z|xi)dz , (2) where p(z|xi) is the conditional prior in which we make the prototype dependent on xi. In general, it is intractable to directly solve the posterior, and usually we resort to a variational distribution to approximate the true posterior by minimizing the KL divergence: DKL[q(z|S)||p(z|x, y)], (3) where q(z|S) is the variational posterior that makes the prototype z dependent on the support set S to leverage the meta-learning setting for few-shot classification. By applying the Baye s rule, we obtain i=1 p(yi|xi) h Eq(z|S) log p(yi|xi, z) DKL(q(z|S)||p(z|xi)) i , (4) which is the ELBO of the conditional log-likelihood in (2). In practice, the variational posterior q(z|S) is implemented by a neural network that takes the average feature representations of samples in the support set S and returns the mean and variance of the prototype z. This can be directly adopted as the optimization objective for the variational inference of the prototype. While inheriting the flexibility of the prototype based few-shot learning [61, 1], our probabilistic inference enhances its class expressiveness by exploring higher-order information, i.e., variance, beyond a single or multiple deterministic mean vectors of samples in each class. More importantly, the probabilistic modeling provides a principled way of incorporating prior knowledge acquired from experienced tasks. In what follows, we introduce the external memory to augment the probabilistic latent model for enhanced variational inference of prototypes. 2.2 Variational Semantic Memory We introduce the variational semantic memory to accumulate and store the semantic information from previous tasks for the inference of prototypes of new tasks. The knowledge on objects in the memory is consolidated episodically by seeing more object instances, which enables conceptual representations of new objects to be quickly built up for novel categories in tasks to come. To be more specific, we deploy an external memory unit M which stores a key-value pair in each row of the memory array as [22]. The keys are the average feature representations of images from the same classes and the values are their corresponding class labels. The semantics of object categories in the memory provide context for quickly learning concepts of new object categories by seeing only a few examples in the current tasks. In contrast to most existing external memory modules [57, 47, 8], our variational semantic memory module stores semantic information by summarizing samples from individual categories, and therefore our memory module requires relatively light storage overhead, enabling more efficient retrieval of content from the memory. Memory recall and inference It is pivotal to recall relevant information from the external memory and adapt it to learning new tasks when working with neural memory modules. When recalling a memory, it is not simply a read out; the content from the memory must be processed in order to fit the data in a specific task [47, 22, 73]. We regard memory recall as a decoding process of chosen content in the memory, which we accomplish via variational inference, instead of simply reading out the raw content from the external memory and directly incorporating it into specific tasks. To this end, we introduce an intermediate stochastic variable, referred to as the latent memory m. We cast the retrieval of memory into the inference of m from the addressed memory M; the memory addressing is based on the similarity between the content in the memory and the support set from the current task. The latent memory m is inferred to connect the accrued semantic knowledge stored in the long-term memory to the current task, which is seamlessly coupled with the prototype inference under a hierarchical Bayesian framework. From a Bayesian perspective, the prototype posterior can be inferred by marginalizing over the latent memory variable m: q(z|S) = Z q(z|m, S)p(m|S)dm, (5) where q(z|m, S) indicates that the prototype z is now dependent on the support set S and the latent memory m. To leverage the external memory M, we design a variational approximation q(m|M, S) to the posterior over the latent memory m by inferring from M conditioned on S: q(m|M, S) = a=1 p(m|Ma, S)p(a|M, S). (6) Here, a is the addressed categorical variable, Ma denotes the corresponding memory content at address a, and |M| represents the memory size, i.e., the number of memory entries. We establish a hierarchical Bayesian framework for the variational inference of prototypes: q(z|M, S) = a=1 p(a|M, S) Z q(z|S, m)p(m|Ma, S)dm, (7) which is shown as a graphical model in Figure 1. We use the support set S and memory M to generate the categorical variable a to address the external memory, and then fetch the content Ma to infer the latent memory m, which is incorporated as a conditional variable to assist S in the inference of the prototype z. This offers a principled way to incorporate semantic knowledge and build up the prototypes of novel object categories. It mimics the cognitive mechanism of the human brain in learning new concepts by associating them with related concepts learned in the past [29]. Moreover, it naturally handles ambiguity and uncertainty when recalling memory better than the common strategy of using a deterministic transformation [22, 73]. When a is given, m only depends on Ma and no longer relies on S. Therefore, we can attain p(m|Ma, S) = p(m|Ma) by safely dropping S, which gives rise to: q(m|M, S) = a=1 p(m|Ma)p(a|M, S). (8) Since the memory size is finite, bounded by the number of seen classes, we further approximate q(m|M, S) empirically by q(m|M, S) = a=1 λap(m|Ma), λa = exp g(Ma, S) P i exp g(Mi, S) , (9) where Ma is the memory slot and stores the average feature representation of samples in each category that are seen in the learning stage, and g( , ) is a learnable similarity function, which we implement as a dot product for efficiency by taking the averages of samples in Mi and S, respectively. Thus, the prototype inference can now be approximated by Monte Carlo sampling: q(z|M, S) 1 j=1 q(z|m(j), S), m(j) a=1 λap(m|Ma), (10) where J is the number of Monte Carlo samples. Figure 1: Graphical illustration of the proposed probabilistic prototype inference with variational semantic memory. M is the semantic memory module. St n denotes the samples from the n-th class in the support set in each t task. Qt is the query set. T is the number of tasks, and N is the number of classes in each task. Memory update and consolidation The memory update is an important operation in the maintenance of memory, which should be able to effectively absorb new useful information to enrich memory content. We draw inspiration from the concept formation process in the human cognitive function [29]: the concept of an object category is formed and grown by seeing a set of similar objects of the same category. The memory is built from scratch and gradually consolidated by being episodically updated with knowledge observed from a series of related tasks. We adopt an attention mechanism to refresh content in the memory by taking into account the structural information of data. To be more specific, the memory is empty at the beginning of the learning. When a new task arrives, we directly append the mean feature representation of data from a given category to the memory entries if this category is not seen. Otherwise, for seen categories, we update the memory content with new observed data from the current task using self-attention [68] similar to the graph attention mechanism [69]. This enables the structural information of data to be better explored for memory update. We first construct the graph with respect to the memory Mc to be updated. The nodes of the graph are a set of feature representations: Hc = {h0 c, h1 c, h2 c, . . . , h Nc c }, where h Nc c Rd, Nc = |Sc Qc|, h0 c = Mc, hi>0 c = hφ(xi c), hφ( ) is the convolutional neural network for feature extraction, and xi c {Sc Qc} contains all samples including both the support and query set from the c-th category in the current task. We use the nodes Hc on the graph to generate a new representation of memory Mc, which better explores structural information of data. To do so, we need to compute a similarity coefficient between Mc and the nodes hi c on the graph. We implement this by a single-layer feed-forward neural network parameterized by a weight vector h R2d, that is, ei Mc = w [Mc, hi c] with [ , ] being a concatenation operation. Here, ei Mc indicates the importance of node i s features to node Mc. In practice, we use the following normalized similarity coefficients [69]: βi Mc = softmaxi(ei Mc) = exp(Leaky Re LU w [Mc, hi c] ) PNc j=0 exp(Leaky Re LU w [Mc, hj c] ) . (11) We can now compute a linear combination of the feature representations of the nodes on the graph as the final output representation of Mc: i=0 βi Mchi c where σ( ) is a nonlinear activation function, e.g., softmax. The graph attention operation can effectively find and assimilate the most useful information from the samples in the new task. We update the memory content with an attenuated weighted average, Mc αMc + (1 α) Mc, (13) where α (0, 1) is a hyperparameter. This operation allows useful information to be retained in the memory, while erasing less relevant or trivial information. 2.3 Objective To train the model, we adopt stochastic gradient variational Bayes [31] and implement it using deep neural networks for end-to-end learning. By combining (4), (9) and (7), we obtain the following objective for the hierarchical variational inference: arg min {φ,θ,ψ,ϕ} (xt i,yt i) Qtn log p(yt i|hφ(xt i), z(ℓ) n ) + j=1 qϕ(zn|m(j), h Stn)||pθ(zn|hφ(xt i)) i + DKL( a λapψ(m|Ma)||pψ(m| h Stn)) # (14) where z(ℓ) 1 J PJ j=1 qϕ(z|m(j), St n), m(j) P|M| a=1 λapψ(m|Ma), L and J are numbers of Monte Carlo samples, h Stn = 1 |Stn| P x Stn hφ(x), and n denotes the n-th class. To enable back propagation, we adopt the reparameterization trick [31] for sampling z and m. The third term in (14) essentially serves to constrain the inferred latent memory to ensure that it is relevant to the current task. Here, we make the parameters shared by the prior and the posterior for m, and we also amortize the inference of prototypes across classes [20], which involves using the samples St n from each class to infer their prototypes individually. In practice, the log-likelihood term is implemented as a cross entropy loss between predictions and ground-truth labels. The conditional probabilistic distributions are set to be diagonal Gaussian. We implement them using multi-layer perceptrons with the amortization technique and the reparameterization trick [31, 54], which take the conditionals as input and output the parameters of the Gaussian. In addition, we implement the model with the objective in (4), which we refer to as the variational prototypical network. 3 Related Work Meta-Learning Meta-learning, or learning to learn [60, 64], for few-shot learning [32, 50, 57, 15, 70, 61] addresses the fundamental challenge of generalizing across tasks with limited labelled data. Meta-learning approaches for few-shot learning differ in the way they acquire inductive biases and adopt them for individual tasks [21]. They can be roughly categorized into four groups. Those in the first group are based on distance metrics and generally learn a shared/adaptive embedding space in which query images can be accurately matched to support images for classification [70, 61, 58, 42, 78, 1, 9]. Those based on optimization try to learn an optimization algorithm that is shared across tasks, and can be adapted to new tasks, enabling learning to be conducted efficiently and effectively [50, 2, 15, 16, 21, 66, 56, 77, 49]. The third group explicitly learns a base-learner that incorporates knowledge acquired by the meta-learner and effectively solves individual tasks [20, 4, 81]. In the fourth group, a memory mechanism has been incorporated. Usually, an external memory module is deployed to rapidly assimilate new data of unseen tasks, which is used for quick adaptation or to make decisions [57, 40, 41, 38]. The methods from different groups are not necessarily exclusive, and they can be combined to improve performance [66]. In addition, meta-learning has also been explored for reinforcement learning [37, 7, 72, 12, 14, 55] and other tasks [3, 76]. Prototypes The prototypical network is one of most successful meta-learning models for few-shot learning [70, 48, 25, 17]. It learns to project samples into a metric space in which classification is conducted by computing the distance from query samples to class prototypes. Allen et al. [1] introduced an infinite mixture of prototypes that represents each category of objects by multiple clusters. The number of clusters is inferred from data by non-parametric Bayesian methods [48, 25]. Recently, Triantafillou et al. [66] combined the complementary strengths of prototypical networks and MAML [15] by leveraging their respective effective inductive bias and flexible adaptation mechanism for few-shot learning. Our work improves the prototypical network by probabilistic modeling of prototypes, inheriting the effectiveness and flexibility of the Proto Net and further enriching the expressiveness of prototypes by the external memory mechanism. Memory It has been shown that neural networks with memory, such as the long-short term memory [26] model, are capable of meta-learning [27]. Recent works augment neural networks with an external memory module to improve their learning capability [57, 44, 73, 23, 30, 40, 41, 47, 74, 75]. In few-shot learning, existing work with external memory mainly store the information contained in Figure 2: Prototype distributions of our variational prototype network without (left) and with variational semantic memory (right), where different colors indicate different categories. With the memory the prototypes become more distinctive and distant from each other, with less overlap. Table 1: Benefit of variational prototype network over Proto Net [61] in (%) on mini Image Net, tiered Image Net and CIFAR-FS. mini Image Net, 5-way tiered Image Net, 5-way CIFAR-FS, 5-way 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot Proto Net 47.40 0.60 65.41 0.52 53.31 0.89 72.69 0.74 55.50 0.70 72.01 0.60 Variational prototype network 52.11 1.70 66.13 0.83 55.13 1.88 73.71 0.84 61.35 1.60 75.72 0.90 Table 2: Benefit of variational semantic memory in our variational prototype network over alternative memory modules [22, 73] in (%) on mini Image Net, tiered Image Net and CIFAR-FS. mini Image Net, 5-way tiered Image Net, 5-way CIFAR-FS, 5-way 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot Variational prototype network 52.11 1.70 66.13 0.83 55.13 1.88 73.71 0.84 61.35 1.60 75.72 0.90 w/ Rote Memory 53.15 1.81 66.92 0.78 55.98 1.73 74.12 0.88 62.71 1.71 76.17 0.81 w/ Transformed Memory 53.85 1.71 67.23 0.89 56.15 1.70 74.33 0.80 62.97 1.88 76.97 0.77 w/ Variational semantic memory 54.73 1.60 68.01 0.90 56.88 1.71 74.65 0.81 63.42 1.90 77.93 0.80 the support set of the current task [41, 70, 40], focusing on learning the access mechanism shared across tasks. In these works, the external memory is wiped from episode to episode [18, 39]. Hence, it fails to maintain long-term information that has been shown to be crucial for efficiently learning new tasks [47, 18]. Memory has also been incorporated into generative models [8, 36, 74] and sequence modeling [34] by conditioning on the context information provided in the external memory. To store minimal amounts of data, Ramalho and Garnelo proposed a surprise-based memory module, which deploys a memory controller to select minimal samples to write into the memory [47]. In contrast to [8], our variational semantic memory adopts deterministic soft addressing, which enables us to leverage the full context of memory content by picking up multiple entries instead of a single one [8]. Our variational semantic memory is able to accrue long-term knowledge that provides rich context information for quickly learning novel tasks. Rather than directly using specific raw content or deploying a deterministic transformation [22, 73], we introduce the latent memory as an intermediate stochastic variable to be inferred from the addressed content in the memory. This enables the most relevant information to be retrieved from the memory and adapted to the data in specific tasks. 4 Experiments Datasets and settings We evaluate our model on four standard few-shot classification tasks: mini Image Net [71], tiered Image Net [53], CIFAR-FS [4] and Omniglot [33]. For fair comparison with previous works, we experiment with both shallow convolutional neural networks with the same architecture as in [20] and a deep Res Net-12 [42, 35, 38, 51] architecture for feature extraction. More implementation details, including optimization settings and network architectures, are given in the supplementary material. We also provide a state-of-the-art comparison on the Omniglot dataset and more results on the performance with other deep architectures, e.g., WRN-28-10 [79]. Benefit of variational prototype network We compare against the Proto Net [61] as our baseline model in which the prototypes are obtained by averaging the feature representations of each class. These results are obtained with shallow networks. As shown in Table 1, the proposed variational prototype network consistently outperforms the Proto Net demonstrating the benefit brought by Table 3: Advantage of memory update with attention mechanism. mini Image Net, 5-way tiered Image Net, 5-way CIFAR-FS, 5-way 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot w/o Attention 53.97 1.80 67.13 0.76 56.05 1.73 74.27 0.85 62.93 1.76 76.79 0.80 w/ Attention 54.73 1.60 68.01 0.90 56.88 1.71 74.65 0.81 63.42 1.90 77.93 0.80 Table 4: Comparison with other memory models. mini Image Net, 5-way tiered Image Net, 5-way CIFAR-FS, 5-way 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot MANN [57] 41.38 1.70 61.73 0.80 44.27 1.69 67.15 0.70 54.31 1.91 67.98 0.80 KM [74] 53.84 1.70 67.35 0.80 55.73 1.65 73.36 0.70 62.58 1.80 77.11 0.80 Variational semantic memory 54.73 1.60 68.01 0.90 56.88 1.71 74.65 0.81 63.42 1.90 77.93 0.80 probabilistic modeling. The probabilistic prototypes provide more informative representations of classes, which are able to encompass large intra-class variations and therefore improve performance. Benefit of variational semantic memory We compare with two alternative methods of memory recall: rote memory and transformed memory [22, 73] (The implementation details of are provided in the supplementary material). As shown in Table 2, our variational semantic memory surpasses alternatives on all three benchmarks. The advantage over rote memory indicates the benefit of introducing the intermediate latent memory variable; the advantage over transformed memory demonstrates the benefit of formulating the memory recall as the variational inference of the latent memory, which is treated as a stochastic variable. To understand the empirical benefit, we visualize the distributions of prototypes obtained with/without variational semantic memory in Figure 2 on mini Image Net. The variational semantic memory enables the prototypes of different classes to be more distinctive and distant from each other, with less overlap, which enables larger intra-class variations to be encompassed, resulting in improved performance. Benefit of attentional memory update We investigate the benefit of the attention mechanism for memory update. Specifically, we replace the attention-based update with a mean-based one; that is, we use M = 1 Nc P i h(xi c). The experimental results are reported in Table 3. We can see that the memory update with attention mechanism performs consistently better than that using the mean-based update. This is because that the with the attention mechanism, we are able to better absorb more informative knowledge from the data of new tasks by exploring the structural information. Figure 3: Effect of memory size. Effect of memory size We conduct this experiment on mini Image Net. From Figure 3, we can see that the performance increases along with the increase in memory size. This is reasonable since larger memory provides more context information for building better prototypes. Moreover, we observe that the memory module plays a more significant role in the 1-shot setting. In this case, the prototype inferred from only one example might be insufficiently representative of the object class. Leveraging context information provided by the memory, however, compensates for the limited number of examples. Comparison with other memory models To demonstrate the effectiveness of our variational memory mechanism, we compare with two other representative memory models, i.e., the memory augmented neural network (MANN) [57] and the Kanerva machine (KM) [74]. MANN adopts an architecture with augmented memory capacities similar to neural Turing machines [22] while the KM deploys Kanerva s sparse distributed memory mechanism and introduces learnable addresses and reparameterized latent variables. The KM was originally proposed for generative models, but we adopt its reading and writing mechanism to our semantic memory in the meta-learning setting for few-shot classification. The results are shown in Table 4. Our variational semantic memory consistently outperforms MANN and the KM on all three datasets. Table 5: Comparison (%) on mini Image Net, tiered Image Net and CIFAR-FS using a shallow feature extractor. mini Image Net, 5-way tiered Image Net, 5-way CIFAR-FS, 5-way 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot Matching Net [70] 43.56 0.84 55.31 0.73 - - - - MAML [15] 48.70 1.84 63.11 0.92 51.67 1.81 70.30 1.75 58.90 1.91 71.52 1.10 Relation Net [63] 50.44 0.82 65.32 0.70 54.48 0.93 65.32 0.70 55.00 1.01 69.30 0.80 SNAIL (32C) by [4] 45.10 0.85 55.20 0.80 - - - - GNN [19] 50.31 0.83 66.42 0.90 - - 61.90 1.03 75.30 0.91 PLATIPUS [16] 50.10 1.90 - - - - - VERSA [20] 53.31 1.80 67.30 0.91 - - 62.51 1.70 75.11 0.91 R2-D2 (64C) [4] 49.50 0.20 65.40 0.20 - - 62.30 0.20 77.40 0.20 R2-D2 [11] 51.70 1.80 63.31 0.91 - - 60.20 1.80 70.91 0.91 CAVIA [82] 51.80 0.70 65.61 0.60 - - - - i MAML [46] 49.30 1.90 - - - - - VSM (This paper) 54.73 1.60 68.01 0.90 56.88 1.71 74.65 0.81 63.42 1.90 77.93 0.80 Table 6: Comparison (%) on mini Image Net and tiered Image Net using a deep feature extractor. mini Image Net, 5-way tiered Image Net, 5-way 1-shot 5-shot 1-shot 5-shot SNAIL [38] 55.71 0.99 68.88 0.92 - - Ada Res Net [41] 56.88 0.62 71.94 0.57 - - TADAM [42] 58.50 0.30 76.70 0.30 - - Shot-Free [51] 59.04 n/a 77.64 n/a 63.52 n/a 82.59 n/a TEWAM [45] 60.07 n/a 75.90 n/a - - MTL [62] 61.20 1.80 75.50 0.80 - - Variational FSL [80] 61.23 0.26 77.69 0.17 - - Meta Opt Net [35] 62.64 0.61 78.63 0.46 65.99 0.72 81.56 0.53 Diversity w/ Cooperation [13] 59.48 0.65 75.62 0.48 - - Meta-Baseline [10] 63.17 0.23 79.26 0.17 - - Tian et al. [65] 64.82 0.60 82.14 0.43 71.52 0.69 86.03 0.49 VSM (This paper) 65.72 0.57 82.73 0.51 72.01 0.71 86.77 0.44 State-of-the-art comparison As shown in Tables 5 and 6, our variational semantic memory (VSM) sets a new state-of-the-art on all few-shot learning benchmarks. On mini Image Net, our model using either a shallow or deep network achieves high recognition accuracy, surpassing the second best method, i.e., VERSA [20], by a margin of 1.43% on the 5-way 1-shot using a shallow network. On tiered Image Net, our model again outperforms previous methods using shallow networks, e.g., MAML [15] and Relation Net [63], and deep networks, e.g., [65]. On CIFAR-FS, our model delivers 63.42% on the 5-way 1-shot setting, surpassing the second best R2D2 [4] by 1.12%. The consistent state-of-the-art results on all benchmarks using either shallow or deep feature extraction networks validate the effectiveness of our model for few-shot learning. 5 Conclusion In this paper, we introduce a new long-term memory module, named variational semantic memory, into meta-learning for few-shot learning. We apply it as an external memory for the probabilistic modelling of prototypes in a hierarchical Bayesian framework. The memory episodically learns to accrue and store semantic information by experiencing a set of related tasks, which provides semantic context that enables new object concepts to be quickly learned in individual tasks. The memory recall is formulated as the variational inference of a latent memory variable from the addressed content in the external memory. The memory is established from scratch and gradually consolidated by updating with knowledge absorbed from data in each task using an attention mechanism. Extensive experiments on four benchmarks demonstrate the effectiveness of variational semantic memory in learning to accumulate long-term knowledge. Our model achieves new state-of-the-art performance on four benchmark datasets, consistently surpassing previous methods. More importantly, the findings in this work demonstrate the benefit of semantic knowledge accrued through long-term memory in effectively learning novel concepts of object categories, and therefore highlight the pivotal role of semantic memory in few-shot recognition. Broader Impact This work introduces the concept of semantic memory from cognitive science into the machine learning field. We use it to augment a probabilistic model for few-shot learning. The developed variational framework offers a principled way to achieve memory recall, which could also be applied to other learning scenarios, e.g., continual learning. The emprical findings indicate the potential role of neural semantic memory as a long-term memory module in enhancing machine learning models. Finally, this work will not cause any foreseeable ethical issue or societal consequence. Acknowledgements The authors would like to thank all anonymous reviewers for their constructive feedback and valuable suggestions. Thanks to the partial support from national natural science foundation of China (Grants: 61976060, 61871016). [1] K. R. Allen, E. Shelhamer, H. Shin, and J. B. Tenenbaum. Infinite mixture prototypes for few-shot learning. In ICML, 2019. [2] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Neur IPS, 2016. [3] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using meta-regularization. In Neur IPS, 2018. [4] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi. Meta-learning with differentiable closed-form solvers. In ICLR, 2019. [5] J. R. Binder and R. H. Desai. The neurobiology of semantic memory. Trends in cognitive sciences, 15(11):527 536, 2011. [6] D. M. Blei, A. Kucukelbir, and J. D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. [7] C. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z. Leibo, J. Rae, D. Wierstra, and D. Hassabis. Model-free episodic control. ar Xiv preprint ar Xiv:1606.04460, 2016. [8] J. Bornschein, A. Mnih, D. Zoran, and D. J. Rezende. Variational memory addressing in generative models. In Neur IPS, 2017. [9] T. Cao, M. Law, and S. Fidler. A theoretical analysis of the number of shots in few-shot learning. In ICLR, 2020. [10] Y. Chen, X. Wang, Z. Liu, H. Xu, and T. Darrell. A new meta-baseline for few-shot learning. ar Xiv preprint ar Xiv:2003.04390, 2020. [11] A. Devos, S. Chatel, and M. Grossglauser. Reproducing meta-learning with differentiable closed-form solvers. In ICLR Workshop, 2019. [12] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016. [13] N. Dvornik, C. Schmid, and J. Mairal. Diversity with cooperation: Ensemble methods for few-shot classification. In ICCV, 2019. [14] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola. Meta-q-learning. ar Xiv preprint ar Xiv:1910.00125, 2020. [15] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017. [16] C. Finn, K. Xu, and S. Levine. Probabilistic model-agnostic meta-learning. In Neur IPS, 2018. [17] S. Fort. Gaussian prototypical networks for few-shot learning on omniglot. ar Xiv preprint ar Xiv:1708.02735, 2017. [18] M. Fortunato, M. Tan, R. Faulkner, S. Hansen, A. P. Badia, G. Buttimore, C. Deck, J. Z. Leibo, and C. Blundell. Generalization of reinforcement learners with working and episodic memory. In Neur IPS, 2019. [19] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2018. [20] J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. E. Turner. Meta-learning probabilistic inference for prediction. In ICLR, 2019. [21] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. In ICLR, 2018. [22] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014. [23] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi nska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471 476, 2016. [24] S. Hansen, A. Pritzel, P. Sprechmann, A. Barreto, and C. Blundell. Fast deep reinforcement learning using online adjustments from the past. In Neur IPS, 2018. [25] N. L. Hjort, C. Holmes, P. Müller, and S. G. Walker. Bayesian nonparametrics, volume 28. Cambridge University Press, 2010. [26] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [27] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In ICANN. Springer, 2001. [28] M. Irish, D. R. Addis, J. R. Hodges, and O. Piguet. Considering the role of semantic memory in episodic future thinking: evidence from semantic dementia. Brain, 135(7):2178 2191, 2012. [29] M. Irish and O. Piguet. The pivotal role of semantic memory in remembering the past and imagining the future. Frontiers in behavioral neuroscience, 7:27, 2013. [30] L. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. In ICLR, 2016. [31] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [32] G. Koch. Siamese neural networks for one-shot image recognition. In ICML Workshop, 2015. [33] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. [34] H. Le, T. Tran, T. Nguyen, and S. Venkatesh. Variational memory encoder-decoder. In Neur IPS, 2018. [35] K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex optimization. In CVPR, 2019. [36] C. Li, J. Zhu, and B. Zhang. Learning to generate with memory. In ICML, 2016. [37] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. [38] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In ICLR, 2018. [39] A. Miyake and P. Shah. Models of working memory: Mechanisms of active maintenance and executive control. Cambridge University Press, 1999. [40] T. Munkhdalai and H. Yu. Meta networks. In ICML, 2017. [41] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted neurons. In ICML, 2018. [42] B. Oreshkin, P. R. López, and A. Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Neur IPS, 2018. [43] K. Patterson, P. J. Nestor, and T. T. Rogers. Where do you know what you know? the representation of semantic knowledge in the human brain. Nature Reviews Neuroscience, 8(12):976 987, 2007. [44] A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural episodic control. In ICML, 2017. [45] S. Qiao, C. Liu, W. Shen, and A. L. Yuille. Few-shot image recognition by predicting parameters from activations. In CVPR, 2018. [46] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In Neur IPS, 2019. [47] T. Ramalho and M. Garnelo. Adaptive posterior learning: few-shot learning with a surprise-based memory module. In ICLR, 2019. [48] C. E. Rasmussen. The infinite gaussian mixture model. In Neur IPS, 2000. [49] S. Ravi and A. Beatson. Amortized bayesian meta-learning. In ICLR, 209. [50] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017. [51] A. Ravichandran, R. Bhotika, and S. Soatto. Few-shot learning with embedded class models and shot-free meta training. In ICCV, 2019. [52] D. Reisberg. "Semantic Memory" in The Oxford handbook of cognitive psychology. Oxford University Press, 2013. [53] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-learning for semi-supervised few-shot classification. In ICLR, 2018. [54] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. [55] S. Ritter, J. Wang, Z. Kurth-Nelson, S. Jayakumar, C. Blundell, R. Pascanu, and M. Botvinick. Been there, done that: Meta-learning with episodic recall. In ICML, 2018. [56] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2019. [57] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memoryaugmented neural networks. In ICML, 2016. [58] V. G. Satorras and J. B. Estrach. Few-shot learning with graph neural networks. In ICLR, 2018. [59] D. Saumier and H. Chertkow. Semantic memory. Current neurology and neuroscience reports, 2(6):516 522, 2002. [60] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph D thesis, Technische Universität München, 1987. [61] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Neur IPS, 2017. [62] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele. Meta-transfer learning for few-shot learning. In CVPR, 2019. [63] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018. [64] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012. [65] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola. Rethinking few-shot image classification: a good embedding is all you need? ar Xiv preprint ar Xiv:2003.11539, 2020. [66] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. In ICLR, 2020. [67] E. Tulving et al. Episodic and semantic memory. Organization of memory, 1:381 403, 1972. [68] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Neur IPS, 2017. [69] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In ICLR, 2018. [70] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Neur IPS, 2016. [71] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Neur IPS. 2015. [72] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. ar Xiv preprint ar Xiv:1611.05763, 2016. [73] J. Weston, S. Chopra, and A. Bordes. Memory networks. In ICLR, 2014. [74] Y. Wu, G. Wayne, A. Graves, and T. Lillicrap. The kanerva machine: A generative distributed memory. In ICLR, 2018. [75] Y. Wu, G. Wayne, K. Gregor, and T. Lillicrap. Learning attractor dynamics for generative memory. In Neur IPS, 2018. [76] T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun. Metaanchor: Learning to detect objects with customized anchors. In Neur IPS, 2018. [77] J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian model-agnostic meta-learning. In NIPS, 2018. [78] S. W. Yoon, J. Seo, and J. Moon. Tapnet: Neural network augmented with task-adaptive projection for few-shot learning. In ICML, 2019. [79] S. Zagoruyko and N. Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. [80] J. Zhang, C. Zhao, B. Ni, M. Xu, and X. Yang. Variational few-shot learning. In ICCV, 2019. [81] X. Zhen, H. Sun, Y. Du, J. Xu, Y. Yin, L. Shao, and C. Snoek. Learning to learn kernels with variational random features. ICML, 2020. [82] L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson. Fast context adaptation via meta-learning. In ICML, 2019.