# metaneighborhoods__37ff046f.pdf

Meta-Neighborhoods

Siyuan Shan Department of Computer Science University of North Carolina at Chapel Hill siyuanshan@cs.unc.edu

Yang Li Department of Computer Science University of North Carolina at Chapel Hill yangli95@cs.unc.edu

Junier B. Oliva Department of Computer Science University of North Carolina at Chapel Hill joliva@cs.unc.edu

Making an adaptive prediction based on one s input is an important ability for general artiﬁcial intelligence. In this work, we step forward in this direction and propose a semi-parametric method, Meta-Neighborhoods, where predictions are made adaptively to the neighborhood of the input. We show that Meta-Neighborhoods is a generalization of k-nearest-neighbors. Due to the simpler manifold structure around a local neighborhood, Meta-Neighborhoods represent the predictive distribution p(y | x) more accurately. To reduce memory and computation overhead, we propose induced neighborhoods that summarize the training data into a much smaller dictionary. A meta-learning based training mechanism is then exploited to jointly learn the induced neighborhoods and the model. Extensive studies demonstrate the superiority of our method.1

1 Introduction

𝑓𝜙( ) 𝑥1 𝑥2 𝑥3

neighbor-based adjustment

predict with instance-specific model

predict with fixed model

Figure 1: Top: traditional parametric models. Bottom: our per-instance adapted model.

Discriminative machine learning models typically learn the predictive distribution p(y | x). There are two paradigms to build a model, parametric methods and non-parametric methods [12]. Parametric methods assume that a set of ﬁxed parameters θ dominates the predictive distribution, i.e., p(y | x; θ). The training process estimates θ and then discard the training data completely, as the learned parameters θ are responsible for the following prediction. This paradigm has proven effective, however, it leaves the entire burden on learning a complex predictive distribution over potentially large support. Nonparametric models differ in that the number of parameters scales with data. They typically reuse the training data during the testing phase to make predictions. For instance, the well-known k-nearest neighbor (KNN) estimator often achieves surprisingly good results by leveraging neighbors from the training data, which reduces the problem to a much simpler local-manifold. Despite its ﬂexibility, non-parametric methods are required to store the training data and traverse them during testing, which may impose signiﬁcant memory and computation overhead for large training sets.

1The code is available at https://github.com/lupalab/Meta-Neighborhoods

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

In this work, we combine the merits of both paradigms and propose a semi-parametric method called Meta-Neighborhoods. The main body of Meta-Neighborhoods is a parametric neural network, but we adapt its parameters to a local neighborhood in a non-parametric scheme. The prediction is made on the local manifold by the adapted model. Fig. 1 illustrates the difference between traditional parametric models and the proposed model. Inspired by the success of inducing point methods from sparse Gaussian process literature [34, 38] to alleviate the storage burden and reduce time complexity, we learn induced neighborhoods, which summarize the training data into a much smaller dictionary. The induced neighborhoods and the neural network parameters are learned jointly.

Our model is also closely related to locally linear embeddings [30], which reconstructs the non-linear manifold with locally linear approximation around each neighborhood. In our method, we adapt an initial model (not necessarily linear) to local neighborhoods. Since the local manifold is much simpler, we expect the adapted model can better capture the predictive distribution. Overall, it learns a better discriminative model on the entire support.

Our method imposes challenges of adapting the initial model since the local neighborhoods usually do not contain enough training instances to independently adapt the model, and the induced neighborhoods contain even fewer instances. Inspired by the few-shot and meta-learning literature [6], we propose a meta-learning based training mechanism, where we learn an initial model so that it adapts to the local neighborhood after only several ﬁnetuning steps over a few instances inside the neighborhood.

The prediction process of our model remains ﬂexible by following a non-parametric scheme. An input x is ﬁrst paired with its neighbors by querying the induced dictionary. The initial model is adapted to its neighborhood by ﬁnetuning several steps on the neighbors. We then predict the target y using the adapted model.

Our contributions are as follows:

We combine parametric and non-parametric methods in a meta-learning framework. We propose Meta-Neighborhoods to jointly learn the induced neighborhoods and an adaptive initial model, which can adapt its parameters to a local neighborhood according to the input through both ﬁnetuning and a proposed instance-wise modulation scheme, i Fi LM. Extensive empirical studies demonstrate the superiority of Meta-Neighborhoods for both regression and classiﬁcation tasks. We empirically ﬁnd the induced neighbors are semantically meaningful: they represent informative boundary cases on both realistic and toy datasets; they also capture sub-category concepts even though such information is not given during training.

Problem Formulation Given a training set D = {(xi, yi)}N i=1 with N input-target pairs, we learn a discriminative model fφ(x) and a dictionary M = {(kj, vj)}S j=1 jointly from D. The learned dictionary stores the neighbors induced from the training set, where S is the number of induced neighbors. Just like the real training set D, the dictionary stores input-target pairs as key-value pairs (kj,vj), where both the keys and the values are learned end-to-end with the model parameters φ in a meta-learning framework. For classiﬁcation tasks, vj is a vector representing the class probabilities while for regression tasks vj is the regression target. In the following text, we will use the terms induced neighbors" and learnable neighbors" interchangeably. We defer the exact training mechanism to Section 2.2.

2.1 Predict with Induced Neighborhoods

In this section, we assume access to the learned neighborhoods in M and the learned model fφ. Different from the conventional parametric setting, where the learned model is employed directly to predict the target, we adapt the model according to the neighborhoods retrieved from M and the adapted model is the one responsible for making predictions. Speciﬁcally, for a test data xi, relevant entries in M are retrieved in a soft-attention manner by comparing xi to kj via a attention function ω.

Dictionary M

attention function 𝜔

𝜔1 𝜔2 𝜔3 𝜔𝑆

output network

𝜔(𝑥𝑖, 𝑘𝑗) 𝐿(𝑓𝜙𝑘𝑗, 𝑣𝑗)

𝜙𝑖 fine-tuning

Figure 2: Model overview.

Algorithm 1 META-NEIGHBORHOODS: TRAINING PHASE

Require: ω: similarity metric, η: outer loop learning rate

1: Initialize θ, φ, α, M = {(kj, vj)}S j=1 2: while not done do 3: Sample a batch of training data {(xi, yi)}B i=1 4: for all (xi, yi) in current batch do 5: Compute the feature vector zi = µθ(xi) 6: Compute Linner i (φ) = PS j=1 ω(zi, kj)L(fφ(kj), vj)

7: Finetune φ: φi = φ α φLinner i (φ) 8: end for 9: Compute Lmeta(φ, θ, M, α) = 1

B PB i=1 L(fφi(µθ(xi)), yi) 10: Update model parameters Θ = {θ, φ, M, α} using gradient descent as Θ Θ η ΘLmeta

11: end while

The retrieved entries are then utilized to ﬁnetune fφ following

j=1 ω(xi, kj)L(fφ(kj), vj), (1)

where α is the ﬁnetuning step size. Note we weight the loss terms of all dictionary entries by their similarities to the input test data xi. The intuition is that nearby neighbors play a more important role in placing xi into the correct local manifold. Since the model fφ is specially trained in a meta-learning scheme, it adapts to the local neighborhoods after only a few ﬁnetuning steps.

To better understand our method, we draw connections to other well-known techniques. The above prediction process is similar to a one-step EM algorithm. Speciﬁcally, the dictionary querying step is an analogy to the Expectation step, which determines the latent variables (in our case, the neighborhood assignment). And the ﬁnetuning step is similar to the Maximization step, which maximizes the expected log-likelihood. We can also view this process from a Bayesian perspective, where the initial parameter φ is an empirical prior estimated from data; posteriors are derived from neighbors following the ﬁnetuning steps. The predictive distribution with posterior is used for the ﬁnal predictions.

2.2 Joint Meta Learning

Above, we assume access to given M and fφ, in this section, we describe our meta-learning mechanism to train them jointly. The training strategy of Meta-Neighborhoods resembles MAML [6] in that both adopt a nested optimization procedure, which involves an inner loop update and an outer loop update in each iteration. Note that in contrast to MAML, we are solving a general discriminative task rather than a few-shot task. Given a batch of training data {xi, yi}B i=1 with a batch size B, in the inner loop we ﬁnetune the initial parameter φ to φi in a similar fashion to (1). With φ individually ﬁnetuned for each training data xi using its corresponding neighborhoods, we then jointly train the model parameter φ, the dictionary M as well as the inner loop learning rate α in the outer loop using the following meta-objective function

Lmeta(φ, M, α) = 1

i=1 L(fφi(xi), yi) = 1

i=1 L(fφ α φLinner i (xi), yi), (2)

where Linner i (φ) = PS j=1 ω(xi, kj)L(fφ(kj), vj) according to (1). We set α to be a learnable scalar or diagonal matrix. Lmeta encourages learning shared φ, M, and α that are widely applicable for data with the same distribution as the training data. An overview of our model is shown in Fig. 2.

Parameter φ serves as initial weights that can be quickly adapted to a speciﬁed neighborhood. This meta training scheme effectively tackles the overﬁtting problem caused by the limited number of ﬁnetuning instances, as it explicitly optimizes the generalization performance after ﬁnetuning.

For high-dimensional inputs such as images, learning kj in the input space could be prohibitive. Therefore, we employ a feature extractor µθ to extract the feature embedding zi = µθ(xi) for each

xi and learn kj in the embedding space. We accordingly modify (1) to

j=1 ω(µθ(xi), kj)L(fφ(kj), vj), (3)

where the attention function ω is employed in embedding space. The meta-objective is accordingly modiﬁed as Lmeta(φ, θ, M, α) = 1

B PB i=1 L(fφi(µθ(xi)), yi). We train θ and other learnable parameters jointly. Note that the model without a feature extractor can be viewed as a special case where µθ is an identity mapping. The pseudocode of our training algorithm is given in Algorithm 1.

It is also desirable to adjust µθ per-instance. However, when µθ is a deep convolution neural network, tuning the entire feature extractor µθ is computationally expensive. Inspired by Fi LM [28], we propose instance-wise Fi LM (i Fi LM) that adjusts the batch normalization layers individually for each instance. Suppose al RB Cl W l Hl is the output of the lth Batch Normalization layer BNl, where B is batch size, Cl is the number of channels, W l and Hl are the feature map width and height. Along with each BNl, we deﬁne a learnable dictionary M l = {kl j, γl j, βl j}Sl j=1 of size Sl. kl j are the

keys used for querying. γl j, βl j RCl represent the scale and shift parameters used for adaptation respectively. When querying M l, the outputs al are ﬁrst aggregated across their spatial dimensions using global pooling, i.e. gl = Global Avg Pool(al) RB Cl. Then, the instance-wise adaptation parameters bγl i and bβl i are computed as

j=1 ω(gl i, kl j)γl j RCl bβl i =

j=1 ω(gl i, kl j)βl j RCl, (4)

where ω is deﬁned as in (1) and i {1, 2, . . . , B}. Following Fi LM [28], each individual activation al i is then adapted with an afﬁne transformation bγl i al i + bβl i. Note the transformation is agnostic to spatial position, which helps to reduce the number of learnable parameters in the dictionary M.

2.3 Other Details and Considerations

In this section, we discuss further implementation details. We also motivate our method from the perspective of k-nearest neighbor (KNN).

Similarity Metrics To implement the attention function ω in (1)(3)(4), we need a similarity metric to compare a input xi with each key kj. We try two types of metrics, cosine similarity and negative Euclidean distance. The similarities of xi to all keys are normalized using a softmax function with a temperature parameter T [14], i.e.,

ω(xi, kj) = exp(sim(xi, kj)/T) PS s=1 exp(sim(xi, ks)/T) , (5)

where sim( ) represents the similarity metric.

Initialization of the Dictionary Since we use similarity-based attention function ω in (3), we would like to initialize the key kj to have a similar distribution to zi = µθ(xi), otherwise, kj cannot receive useful training signal at early training steps. To simplify the initialization, we follow [8] to remove the non-linear function (e.g. Re LU) at the end of µθ so that features extracted by µθ are approximately Gaussian distributed. With this modiﬁcation, we can simply initialize kj with Gaussian distribution.

Cosine-similarity Based Classiﬁcation Since the model fφ is ﬁnetuned using the learned dictionary in the inner loop, the quality of the dictionary has a signiﬁcant impact on ﬁnal performance. Typical neural network classiﬁers employ dot product to compute the logits, where the magnitude and direction could both affect the results. Therefore, the model needs to learn both the magnitude and the direction of kj. To alleviate the training difﬁculty, we eliminate the inﬂuence of magnitude by using a cosine similarity classiﬁer, where only the direction of kj can affect the computation of logits. Cosine similarity classiﬁers have been adopted to replace dot product classiﬁers for few-shot learning [8, 2] and robust classiﬁcation [40].

Relationship to KNN Below, we show that Meta-Neighborhoods can be derived as a direct generalization of KNN under a multi-task learning framework. Considering a regression task where the regression target is a scalar, the standard view of KNN is as follows. First, aggregate the k-nearest neighbors of a query xi from the training set D as N( xi) = {(xj, yj)}k j=1 D. Then, predict an average of the responses in the neighborhood: ˆy = 1

k Pk j=1 yj.

Instead of simply performing an average of responses in a neighborhood, we frame KNN as a solution to a multi-task learning problem with tasks corresponding to individual neighborhoods as follows. Here, we take each query (test) data xi as a single task, Ti. To ﬁnd the optimal estimator on the neighborhood N( xi) = {(xj, yj)}k j=1, we optimize the following loss LTi(fi) =

1 k Pk j=1 L(fi(xj), yj) where L is a supervised loss, and fi is the estimator to be optimized. For

example, for MSE-based regression the loss for each task is LTi(fi) = 1

k Pk j=1(fi(xj) yj)2. If one

takes fi to be a constant function fi(ηj) = Ci, then the loss is simply LTi(fi) = 1

k Pk j=1(Ci ζj)2,

which leads to an optimal fi( xi) = Ci = 1

k Pk j=1 ζj, the same solution as traditional KNN. Similar observations hold for classiﬁcation. Thus, given neighborhood assignments, one can view KNN as solving for individual tasks in the special case of a constant estimator fi(xj) = Ci.

With the multi-task formulation of KNN, we can generalize KNN to derive our Meta-Neighborhoods method by considering a non-constant estimator as fi. For instance, one may take fi as a parametric output function fφi (e.g. a linear model or neural networks), and ﬁnetune the parameter φ to φi for a data xi according to the loss on neighborhood N(xi). Instead of ﬁtting a single label on the neighborhood, a parametric approach attempts to ﬁt a richer (e.g. linear) dependency between input features and labels for the neighborhood. In addition, the multi-task formulation gives rise to a way of constructing meta-learning episodes. Also, we learn both neighborhoods and the function fφ jointly in our Meta-Neighborhoods framework.

3 Related Work

Memory-augmented Neural Networks Augmenting neural network with memory has been studied in the sentinel work Neural Turing Machine [10], where a neural network can read and write an external memory to record and change its state. Recent works that utilize the memory modules generally fall into two categories. One category modiﬁes the memory modules according to hand-crafted rules. For instance, previous works tackling few-shot classiﬁcations add a new slot to the memory when the label of a given data does not match the labels of its k-nearest neighbors from the memory [1] or the given data is misclassiﬁed [29]. [35] adopts a ﬁxed-size memory that acts as a circular buffer for life-long learning. Another category uses a fully-differentiable memory module and trains it together with neural networks by gradient descent. This type of memory has been explored for knowledge-based reasoning [11], sequential prediction [36] and few-shot learning [17, 32]. Our work also utilizes a differentiable memory but is used to capture local manifold and improve the general discriminative learning performance.

Meta-Learning Representative meta-learning algorithms can be roughly categorized into two classes: initialization based and metric-learning based. Initialization based methods, such as MAML [6], learn a good initialization for model parameters so that several gradient steps using a limited number of labeled examples can adapt the model to make predictions for new tasks. To further improve ﬂexibility, Meta-SGD [23] learns coordinate-wise inner learning rates, and curvature information is considered in [27] to transform the gradients in the inner optimization. Metric-learning based methods focus on using a distance metric on the feature space to compare query set samples with labeled support set samples. Examples include cosine similarity [39] or Euclidean distance [33] to support examples. A learned relation module is employed in [37].

Our model is in a similar vein to the initialization based method: each test sample can be regarded as a new task, and we meta-learn a dictionary that adapts the initial model to a local neighborhood by ﬁnetuning over queried neighbors. A recent work Meta Au Xiliary Learning (MAXL) also explores meta-learning techniques to improve classiﬁcation performance, where a label generator is meta learned to generate auxiliary labels so that the auxiliary task trained together with the primary classiﬁcation task can improve the primary performance.

(a) Iteration 0

(b) Iteration 800

(c) Iteration 6400

Figure 3: Evolution of learnable neighbors and classiﬁcation results on the test data during training. Two classes are two spirals. Binary predictions for the test set are shown as blue and yellow points. Learnable neighbors are ﬁrst randomly initialized in (a), then optimized in (b) (c). The dotted black lines are the trajectories of learnable neighbors through training iterations. A video of the optimization process is given in the supplemental materials.

4 Experiments

In this section, we conduct experiments for both classiﬁcation and regression tasks. To demonstrate the beneﬁts of making predictions in local neighborhoods, we compare to the vanilla model where the same network architecture is used but without the learnable neighborhoods. We also compare to MAXL [24] for classiﬁcation task.

4.1 Toy Example: Binary Classiﬁcation of the Concentric Spiral Dataset

To investigate the behavior of the induced neighbors and how they assist the parametric model to make predictions, we ﬁrst classify on a 2D toy dataset. In this binary classiﬁcation task, points from two classes are placed in concentric spirals with a non-linear decision boundary. Although a linear classiﬁer is incapable of capturing this decision boundary, we show that tuning a linear classiﬁer with induced neighbors gives an overall non-linear classiﬁer. In addition, our learned neighborhoods capture the critical manifold structure and concentrate at boundary cases; we also observe semantically relevant learned neighbors in higher dimensions (see Appendix A.8 and A.9)

In Fig. 3, we visualize the evolution of the learned neighbors and the decision boundary. The learnable neighbors (shown in green and red markers for two classes respectively) are ﬁrst initialized with random keys and values. As we train the model and the neighbors, the learned neighbors are gradually driven to important manifold locations as in Fig. 3 (b) and (c). After training, the linear classiﬁer adapted to local neighborhoods can accurately classify test examples. We use 100 induced neighbors. The labels of neighbors (values in the dictionary) are ﬁxed after random initialization for illustration purposes. The 2D locations of neighbors (keys in the dictionary) are updated with the model. We use negative Euclidean distance as the similarity metric in (5) and set T to 0.1.

4.2 Image Classiﬁcation

In this section, we evaluate 9 datasets with different complexities and sizes: MNIST [21], MNISTM[7], PACS[22], SVHN [9], CIFAR-10 [19], CIFAR-100 [19], CINIC-10 [3], Tiny-Image Net [31] and Image Net [4]. Dataset details and preprocessing methods are given in Appendix A.1.

Our models are compared to two baselines: vanilla, a traditional parametric Conv Net with the same architecture as ours but without the learnable dictionary, and MAXL [24], where an auxiliary label generator is meta-learned to enhance the primary classiﬁcation tasks. For MNIST, MNIST-M, SVHN and PACS, a 4-layer Conv Net is selected as the feature extractor µθ. For the other four datasets, three deep convolutional architectures, Dense Net40-BC [15], Res Net29, and Res Net56 [13], are used as

Table 1: The classiﬁcation accuracies of our model and the baselines. MN" denotes Meta Neighborhoods. Results from three individual runs are reported and the best performance is marked as bold. Given that backbones like Res Net-56 are strong, our consistent improvement is notable.

Datasets vanilla MAXL ours vanilla+i Fi LM MN MN+i Fi LM Backbone: 4-layer Conv Net MNIST 99.44 0.03% 99.60 0.02% 99.40 0.02% 99.62 0.03% 99.58 0.03% SVHN 93.02 0.12% 94.06 0.10% 93.95 0.12% 94.46 0.09% 94.92 0.09% MNIST-M 96.18 0.05% 96.85 0.06% 96.99 0.07% 96.55 0.04% 97.40 0.05% PACS 92.55 0.08% 94.85 0.12% 94.45 0.09% 95.19 0.10% 95.22 0.09% Backbone: Dense Net40-BC CIFAR-10 94.53 0.10% 94.83 0.09% 94.87 0.08% 95.04 0.11% 95.22 0.09% CIFAR-100 73.92 0.12% 75.64 0.14% 74.66 0.13% 76.32 0.16% 76.96 0.14% CINIC-10 84.92 0.07% 85.42 0.07% 85.11 0.08% 85.73 0.10% 86.02 0.07% Tiny-Image Net 49.28 0.18% 50.94 0.16% 50.86 0.14% 53.27 0.18% 54.36 0.15% Backbone: Res Net-29 CIFAR-10 95.06 0.10% 95.31 0.09% 95.17 0.10% 95.56 0.09% 95.58 0.10% CIFAR-100 76.51 0.15% 77.94 0.12% 77.16 0.14% 78.84 0.14% 79.84 0.11% CINIC-10 86.03 0.08% 86.34 0.06% 86.64 0.06% 86.86 0.08% 87.35 0.09% Tiny-Imag Nnet 54.82 0.17% 56.29 0.14% 55.59 0.17% 57.36 0.15% 57.94 0.14% Backbone: Res Net-56 CIFAR-10 95.73 0.08% 96.06 0.07% 96.08 0.08% 96.36 0.07% 96.40 0.06% CIFAR-100 79.64 0.13% 80.36 0.13% 80.04 0.12% 80.58 0.10% 80.90 0.12% CINIC-10 88.21 0.07% 88.30 0.05% 88.57 0.07% 88.61 0.06% 88.99 0.07% Tiny-Image Net 57.92 0.12% 58.94 0.16% 58.31 0.15% 60.05 0.12% 60.78 0.13% Image Net 48.41 0.14% 48.83 0.16% 52.03 0.12% 51.85 0.12% 54.23 0.13%

µθ. fφ is implemented as a cos-similarity based classiﬁer with one linear layer for both vanilla and our models. Experiment details for our models and baselines are provided in Appendix A.2.

Results Table 1 compares the test accuracy to baselines. We show that both Meta-Neighborhoods (MN) and i Fi LM (vanilla+i Fi LM) can improve over vanilla. The best performance is achieved when combining MN and i Fi LM (MN+i Fi LM), which outperforms the vanilla and MAXL baselines across several network architectures and different datasets. This indicates Meta-Neighborhoods and i Fi LM are complementary and it is beneﬁcial to adjust both fφ and µθ per-instance. Our method is also effective at PACS and MNIST-M datasets that contain signiﬁcant domain shifts.

Note that backbones like Res Net-56 are already powerful for these datasets and there is limited room for improvement over the vanilla model. For instance, employing Res Net-110 instead of Res Net-56 only gives 0.14% and 0.40% further improvements on CIFAR-10 and CIFAR-100, but at the expense of doubling the number of parameters. Yet Meta-Neighborhoods still consistently achieve greater improvements over vanilla than previous SOTA meta-learning method MAXL [24]. Compared to vanilla models, Meta-Neighborhoods with the same backbone architecture contains extra trainable parameters stored in the dictionary M. However, as discussed in Appendix A.3, the performance boost in our paper originates from adjusting models using neighbors, rather than a naive increase in the number of parameters.

Figure 4: Cosine similarities between features and their corresponding ground-truth class prototypes. Each blue point denotes a testing sample. We expect most samples locate above the red lines, meaning larger similarities after ﬁnetuning.

Since we implement fφ as a cosine similarity classiﬁcation layer, φ can be regarded as the prototypes for each class. To verify that ﬁnetuning over neighborhoods helps with the classiﬁcation, we compare the cosine similarity between the extracted feature zi and its corresponding ground-truth prototypes φ[yi] before and after ﬁnetuning, where yi is the class label for zi. From Fig. 4, we can see that the cosine similarities increase after ﬁnetuning for most test examples, which indicates better predictions after ﬁnetuning.

We conduct ablation studies for S and T in Appendix A.4. Ablations for the number of inner loop ﬁnetuning steps and different forms of α (scalar or diagonal) are provided in Appendix A.5.

(b) CIFAR-10

Figure 5: t-SNE visualization of the learned neighbors and training data on MNIST and CIFAR-10. Learned neighbors are marked as +" and real training data are marked as o". The class information is represented by colors. Please zoom in to see the differences between +" and o".

Figure 6: Sub-category image retrieval quality of our model and the vanilla model. Correct retrievals have green outlines and wrong retrievals have red outlines.

Additional experiments for vanilla models trained with dot-product output layer and SGD are provided in Appendix A.6. We also discuss the inference speed of our model in Appendix A.7.

Analysis of Learned Neighbors In Fig. 5, we use t-SNE [26] to visualize the 2D embeddings of the learned neighbors (marked as +") along with the training data (marked as o") on CIFAR-10 and MNIST. The 2D embeddings of training data are computed on the features zi = µθ(xi), and embeddings of learned neighbors are computed on keys kj. Classes of the learned neighbors are inferred from values vj. It is shown that our model learns neighbors beyond the training set as the learned neighbors do not completely overlap with the training data, and the learned neighbors represent hard cases" around class boundaries to assist our model making better predictions. It is also interesting to note that this follows the same trend as our toy example in Fig. 3.

We further investigate whether the learned neighbors are semantically meaningful by retrieving their 5-nearest neighbors from the test set. As shown in Appendix A.8, the retrieved 5-nearest neighbors for each learned neighbor not only come from the same class, but also represent a speciﬁc sub-category concept. In Appendix A.9, we quantitatively show that our method has superior subcategory discovery performance than vanilla on CIFAR-100: our method achieves 63.3% accuracy on the 100 ﬁne-grained classes while vanilla only achieves 59.28%. This indicates our learned neighbors can preserve ﬁne-grained details that are not explicitly given in the supervision signal. Qualitative results are shown in Fig. 6.

4.3 Regression

Table 2: Test MSE of our model, k NN and the vanilla baseline on ﬁve datasets. n and d respectively denote the dataset size and the data dimension.

Datasets n d k NN vanilla Meta-Neighborhoods music 515345 90 0.6812 0.0062 0.6236 0.0056 0.6088 0.0050 toms 28179 96 0.0602 0.0083 0.0594 0.0080 0.0531 0.0073 cte 53500 384 0.00134 0.00023 0.00121 0.00022 0.00109 0.00015 super 21263 80 0.1126 0.0061 0.1132 0.0060 0.1077 0.0068 gom 1059 116 0.5982 0.0521 0.5949 0.0515 0.5681 0.0563

We use ﬁve publicly available datasets with various sizes from UCI Machine Learning Repository [5]. For regression tasks, we found learning neighbors in the input space yields better performance compared to learning neighbors in the feature space. As a result, the feature extractor µθ is implemented as an identity mapping function. We compare our model to k NN and vanilla baseline using the mean square error (MSE). The vanilla baseline is a multilayer perceptron for regression. We searched for the best network conﬁguration for the vanilla model on every dataset by varying the number of layers in {2, 3, 4, 5} and the number of neurons at each layer in {32, 64, 128, 256}. For each dataset, our model uses the same network architecture to vanilla. Model details, training details, and hyperparameter settings are given in Appendix B. 5-fold cross-validation is used to report the results in Table 2. Our model has lower MSE scores compared to the vanilla model across the ﬁve datasets. The results of Meta-Neighborhoods and vanilla are statistically different based on paired Student s t-test with a signiﬁcance of 0.05. We found that naively increasing the model complexity for vanilla baseline can not further improve its performance due to over-ﬁtting, but our method can as it takes advantage of non-parametric neighbor information.

5 Conclusion

In this work, we introduced Meta-Neighborhoods, a novel meta-learning framework that adjusts predictions based on learnable neighbors. It is interesting to note that in addition to directly generalizing KNN, Meta-Neighborhoods provides a learning paradigm that aligns more closely with human learning. Human learning jointly leverages previous examples both to shape the perceptual features we focus on and to pull relevant memories when faced with novel scenarios [20]. In much the same way, Meta-Neighborhoods use feature-based models that are then adjusted by pulling memories from previous data. We show through extensive empirical studies that Meta-Neighborhoods improve the performance of already strong backbone networks like Dense Net and Res Net on several benchmark datasets. In addition to providing a greater gain in performance than previous state-of-the-art metalearning methods like MAXL, Meta-Neighborhoods also works both for regression and classiﬁcation, and provides further interpretability.

Broader Impact

Any general discriminative machine learning model runs the risk of making biased and offensive predictions reﬂective of training data. Our work is no exception as it aims at improving discriminative learning performance. To reduce these negative inﬂuences to the minimum possible extent, we only use standard benchmarks in this work, such as CIFAR-10, Tiny-Image Net, MNIST, and datasets from the UCI machine learning repository.

Our work does impose some privacy concerns as we are learning a per-instance adjusted model in this work. Potential applications of the proposed model include precision medicine, personalized recommendation systems, and personalized driver assistance systems. To keep user data safe, it is desirable to only deploy our model locally.

The induced neighbors in our work, which are semantically meaningful, can also be regarded as fake synthetic data. Like Deep Fakes, they may also raise a set of challenging policy, technology, and legal issues. Legislation regarding synthetic data should take effect and the research community needs to develop effective methods to detect these synthetic data.

Acknowledgments and Disclosure of Funding

This work was supported in part by NIH 1R01AA02687901A1.

[1] Qi Cai, Yingwei Pan, Ting Yao, Chenggang Yan, and Tao Mei. Memory matching networks for one-shot image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4080 4088, 2018.

[2] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classiﬁcation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hkx LXn Ac FQ.

[3] Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. Cinic-10 is not imagenet or cifar-10. ar Xiv preprint ar Xiv:1810.03505, 2018.

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[5] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml.

[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126 1135. JMLR. org, 2017.

[7] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016.

[8] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367 4375, 2018.

[9] Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. ar Xiv preprint ar Xiv:1312.6082, 2013.

[10] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014.

[11] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska Barwi nska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538 (7626):471, 2016.

[12] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630 645. Springer, 2016.

[14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017.

[16] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? ar Xiv preprint ar Xiv:1608.08614, 2016.

[17] Łukasz Kaiser, Oﬁr Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. ar Xiv preprint ar Xiv:1703.03129, 2017.

[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[19] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[20] Patricia K Kuhl, Feng-Ming Tsao, and Huei-Mei Liu. Foreign-language experience in infancy: Effects of short-term exposure and social interaction on phonetic learning. Proceedings of the National Academy of Sciences, 100(15):9096 9101, 2003.

[21] Yann Le Cun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[22] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542 5550, 2017.

[23] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

[24] Shikun Liu, Andrew Davison, and Edward Johns. Self-supervised generalisation with meta auxiliary learning. In Advances in Neural Information Processing Systems, pages 1677 1687, 2019.

[25] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6Ri Cq Y7.

[26] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579 2605, 2008.

[27] Eunbyung Park and Junier B Oliva. Meta-curvature. In Advances in Neural Information Processing Systems 32, pages 3309 3319. 2019.

[28] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

[29] Tiago Ramalho and Marta Garnelo. Adaptive posterior learning: few-shot learning with a surprise-based memory module. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bye Sds C9Km.

[30] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323 2326, 2000.

[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015.

[32] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842 1850, 2016.

[33] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077 4087, 2017.

[34] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257 1264, 2006.

[35] Pablo Sprechmann, Siddhant Jayakumar, Jack Rae, Alexander Pritzel, Adria Puigdomenech Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, and Charles Blundell. Memory-based parameter adaptation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkf Ov Gb CW.

[36] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440 2448, 2015.

[37] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199 1208, 2018.

[38] Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artiﬁcial Intelligence and Statistics, pages 567 574, 2009.

[39] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630 3638, 2016.

[40] Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Robust classiﬁcation with convolutional prototype learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3474 3482, 2018.