# fewshot_learning_via_dirichlet_tessellation_ensemble__1f3b8590.pdf Published as a conference paper at ICLR 2022 FEW-SHOT LEARNING AS CLUSTER-INDUCED VORONOI DIAGRAMS: A GEOMETRIC APPROACH Chunwei Ma1, Ziyun Huang2, Mingchen Gao1, Jinhui Xu1 1Department of Computer Science and Engineering, University at Buffalo 2Computer Science and Software Engineering, Penn State Erie 1{chunweim,mgao8,jinhui}@buffalo.edu 2{zxh201}@psu.edu Few-shot learning (FSL) is the process of rapid generalization from abundant base samples to inadequate novel samples. Despite extensive research in recent years, FSL is still not yet able to generate satisfactory solutions for a wide range of real-world applications. To confront this challenge, we study the FSL problem from a geometric point of view in this paper. One observation is that the widely embraced Proto Net model is essentially a Voronoi Diagram (VD) in the feature space. We retrofit it by making use of a recent advance in computational geometry called Cluster-induced Voronoi Diagram (CIVD). Starting from the simplest nearest neighbor model, CIVD gradually incorporates cluster-to-point and then cluster-to-cluster relationships for space subdivision, which is used to improve the accuracy and robustness at multiple stages of FSL. Specifically, we use CIVD (1) to integrate parametric and nonparametric few-shot classifiers; (2) to combine feature representation and surrogate representation; (3) and to leverage feature-level, transformation-level, and geometry-level heterogeneities for a better ensemble. Our CIVD-based workflow enables us to achieve new state-of-the-art results on mini-Image Net, CUB, and tiered-Imagen Net datasets, with 2% 5% improvements upon the next best. To summarize, CIVD provides a mathematically elegant and geometrically interpretable framework that compensates for extreme data insufficiency, prevents overfitting, and allows for fast geometric ensemble for thousands of individual VD. These together make FSL stronger. 1 INTRODUCTION Recent years have witnessed a tremendous success of deep learning in a number of data-intensive applications; one critical reason for which is the vast collection of hand-annotated high-quality data, such as the millions of natural images for visual object recognition (Deng et al., 2009). However, in many real-world applications, such large-scale data acquisition might be difficult and comes at a premium, such as in rare disease diagnosis (Yoo et al., 2021) and drug discovery (Ma et al., 2021b; 2018). As a consequence, Few-shot Learning (FSL) has recently drawn growing interests (Wang et al., 2020). Generally, few-shot learning algorithms can be categorized into two types, namely inductive and transductive, depending on whether estimating the distribution of query samples is allowed. A typical transductive FSL algorithm learns to propagate labels among a larger pool of query samples in a semi-supervised manner (Liu et al., 2019); notwithstanding its normally higher performance, in many real world scenarios a query sample (e.g. patient) also comes individually and is unique, for instance, in personalized pharmacogenomics (Sharifi-Noghabi et al., 2020). Thus, we in this paper adhere to the inductive setting and make on-the-fly prediction for each newly seen sample. Few-shot learning is challenging and substantially different from conventional deep learning, and has been tackled by many researchers from a wide variety of angles. Despite the extensive research All four authors are corresponding authors. Published as a conference paper at ICLR 2022 on the algorithmic aspects of FSL (see Sec. 2), two challenges still pose an obstacle to successful FSL: (1) how to sufficiently compensate for the data deficiency in FSL? and (2) how to make the most use of the base samples and the pre-trained model? For the first question, data augmentation has been a successful approach to expand the size of data, either by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) (Li et al., 2020b; Zhang et al., 2018) or by variational autoencoders (VAEs) (Kingma & Welling, 2014) (Zhang et al., 2019; Chen et al., 2019b). However, in each way, the authenticity of either the augmented data or the feature is not guaranteed, and the out-of-distribution hallucinated samples (Ma et al., 2019) may hinder the subsequent FSL. Recently, Liu et al. (2020b) and Ni et al. (2021) investigate supportlevel, query-level, task-level, and shot-level augmentation for meta-learning, but the diversity of FSL models has not been taken into consideration. For the second question, Yang et al. (2021) borrows the top-2 nearest base classes for each novel sample to calibrate its distribution and to generate more novel samples. However, when there is no proximal base class, this calibration may utterly alter the distribution. Another line of work (Sbai et al., 2020; Zhou et al., 2020) learns to select and design base classes for a better discrimination on novel classes, which all introduce extra training burden. As a matter of fact, we still lack a method that makes full use of the base classes and the pretrained model effectively. In this paper, we study the FSL problem from a geometric point of view. In metric-based FSL, despite being surprisingly simple, the nearest neighbor-like approaches, e.g. Proto Net (Snell et al., 2017) and Simple Shot (Wang et al., 2019), have achieved remarkable performance that is even better than many sophisticatedly designed methods. Geometrically, what a nearest neighbor-based method does, under the hood, is to partition the feature space into a Voronoi Diagram (VD) that is induced by the feature centroids of the novel classes. Although it is highly efficient and simple, Voronoi Diagrams coarsely draw the decision boundary by linear bisectors separating two centers, and may lack the ability to subtly delineate the geometric structure arises in FSL. Table 1: The underlying geometric structures for various FSL methods. Method Geometric Structure Proto Net (Snell et al., 2017) Voronoi Diagram S2M2 R (Mangla et al., 2020) spherical VD DC (Yang et al., 2021) Power Diagram Deep Voro-- (ours) CIVD Deep Voro/Deep Voro++ (ours) CCVD To resolve this issue, we adopt a novel technique called Cluster-induced Voronoi Diagram (CIVD) (Chen et al., 2013; 2017; Huang & Xu, 2020; Huang et al., 2021), which is a recent breakthrough in computation geometry. CIVD generalizes VD from a point-to-point distance-based diagram to a cluster-to-point influence-based structure. It enables us to determine the dominating region (or Voronoi cell) not only for a point (e.g. a class prototype) but also for a cluster of points, guaranteed to have a (1 + ϵ)-approximation with a nearly linear size of diagram for a wide range of locally dominating influence functions. CIVD provides us a mathematically elegant framework to depict the feature space and draw the decision boundary more precisely than VD without losing the resistance to overfitting. Accordingly, in this paper, we show how CIVD is used to improve multiple stages of FSL and make several contributions as follows. 1. We first categorize different types of few-shot classifiers as different variants of Voronoi Diagram: nearest neighbor model as Voronoi Diagram, linear classifier as Power Diagram, and cosine classifier as spherical Voronoi Diagram (Table 1). We then unify them via CIVD that enjoys the advantages of multiple models, either parametric or nonparametric (denoted as Deep Voro--). 2. Going from cluster-to-point to cluster-to-cluster influence, we further propose Cluster-to-cluster Voronoi Diagram (CCVD), as a natural extension of CIVD. Based on CCVD, we present Deep Voro which enables fast geometric ensemble of a large pool of thousands of configurations for FSL. 3. Instead of using base classes for distribution calibration and data augmentation (Yang et al., 2021), we propose a novel surrogate representation, the collection of similarities to base classes, and thus promote Deep Voro to Deep Voro++ that integrates feature-level, transformation-level, and geometry-level heterogeneities in FSL. Extensive experiments have shown that, although a fixed feature extractor is used without independently pretrained or epoch-wise models, our method achieves new state-of-the-art results on all Published as a conference paper at ICLR 2022 Base Class Voronoi Diagram 20 0 20 40 60 Novel Class Voronoi Diagram 01 13 05 77 09 Surrogate Representation Figure 1: Schematic illustrations of Voronoi Diagram (VD) and surrogate representation on Multi Digit MNIST dataset (Sun, 2019). Left and central panels demonstrate the VD of base classes and novel classes (5-way 1-shot) in R2, respectively. The colored squares stand for the 1-shot support samples. In the right panel, for each support sample, the surrogate representation (dotted line) exhibits a unique pattern which those of the query samples (colored lines) also follow. (See Appendix C for details.) three benchmark datasets including mini-Image Net, CUB, and tiered-Image Net, and improves by up to 2.18% on 5-shot classification, 2.53% on 1-shot classification, and up to 5.55% with different network architectures. 2 RELATED WORK Few-Shot Learning. There are a number of different lines of research dedicated to FSL. (1) Metricbased methods employ a certain distance function (cosine distance (Mangla et al., 2020; Xu et al., 2021), Euclidean distance (Wang et al., 2019; Snell et al., 2017), or Earth Mover s Distance (Zhang et al., 2020a;b)) to bypass the optimization and avoid possible overfitting. (2) Optimization-based approaches (Finn et al., 2017) manages to learn a good model initialization that accelerates the optimization in the meta-testing stage. (3) Self-supervised-based (Zhang et al., 2021b; Mangla et al., 2020) methods incorporate supervision from data itself to learn a robuster feature extractor. (4) Ensemble method is another powerful technique that boosting the performance by integrating multiple models (Ma et al., 2021a). For example, Dvornik et al. (2019) trains several networks simultaneously and encourages robustness and cooperation among them. However, due to the high computation load of training deep models, this ensemble is restricted by the number of networks which is typically <20. In Liu et al. (2020c), instead, the ensemble consists of models learned at each epoch, which, may potentially limit the diversity of ensemble members. Geometric Understanding of Deep Learning. The geometric structure of deep neural networks is first hinted at by Raghu et al. (2017) who reveals that piecewise linear activations subdivide input space into convex polytopes. Then, Balestriero et al. (2019) points out that the exact structure is a Power Diagram (Aurenhammer, 1987) which is subsequently applied upon recurrent neural network (Wang et al., 2018) and generative model (Balestriero et al., 2020). The Power/Voronoi Diagram subdivision, however, is not necessarily the optimal model for describing feature space. Recently, Chen et al. (2013; 2017); Huang et al. (2021) uses an influence function F(C, z) to measure the joint influence of all objects in C on a query z to build a Cluster-induced Voronoi Diagram (CIVD). In this paper, we utilize CIVD to magnify the expressivity of geometric modeling for FSL. 3 METHODOLOGY 3.1 PRELIMINARIES Few-shot learning aims at discriminating between novel classes Cnovel with the aid of a larger amount of samples from base classes Cbase, Cnovel Cbase = . The whole learning process usually follows the Published as a conference paper at ICLR 2022 meta-learning scheme. Formally, given a dataset of base classes D = {(xi, yi)}, xi D, yi Cbase with D being an arbitrary domain e.g. natural image, a deep neural network z = φ(x), z Rn, which maps from image domain D to feature domain Rn, is trained using standard gradient descent algorithm, and after which φ is fixed as a feature extractor. This process is referred to as metatraining stage that squeezes out the commonsense knowledge from D. For a fair evaluation of the learning performance on a few samples, the meta-testing stage is typically formulated as a series of K-way N-shot tasks (episodes) {T }. Each such episode is further decomposed into a support set S = {(xi, yi)}K N i=1 , yi CT and a query set Q = {(xi, yi)}K Q i=1 , yi CT , in which the episode classes CT is a randomly sampled subset of Cnovel with cardinality K, and each class contains only N and Q random samples in the support set and query set, respectively. For few-shot classification, we introduce here two widely used schemes as follows. For simplicity, all samples here are from S and Q, without data augmentation applied. Nearest Neighbor Classifier (Nonparametric). In Snell et al. (2017); Wang et al. (2019) etc., a prototype ck is acquired by averaging over all supporting features for a class k CT : x S,y=k φ(x) (1) Then each query sample x Q is classified by finding the nearest prototype: ˆy = arg minkd(z, ck) = ||z ck||2 2, in which we use Euclidean distance for distance metric d. Linear Classifier (Parametric). Another scheme uses a linear classifier with cross-entropy loss optimized on the supporting samples: L(W , b) = P (x,y) S log p(y|φ(x); W , b) = P (x,y) S log exp(W T y φ(x) + by) P k exp(W T k φ(x) + bk) (2) in which Wk, bk are the linear weight and bias for class k, and the predicted class for query x Q is ˆy = arg maxk p(y|z; Wk, bk). 3.2 FEW-SHOT LEARNING AS CLUSTER-INDUCED VORONOI DIAGRAMS In this section, we first introduce the basic concepts of Voronoi Tessellations, and then show how parametric/nonparametric classifier heads can be unified by VD. Definition 3.1 (Power Diagram and Voronoi Diagram). Let Ω= {ω1, ..., ωK} be a partition of the space Rn, and C = {c1, ..., c K} be a set of centers such that K r=1ωr = Rn, K r=1ωr = . Additionally, each center is associated with a weight νr {ν1, ..., νK} R+. Then the set of pairs {(ω1, c1, ν1), ..., (ωK, c L, νK)} is a Power Diagram (PD), where each cell is obtained via ωr = {z Rn : r(z) = r}, r {1, .., K}, with r(z) = arg min k {1,...,K} d(z, ck)2 νk. (3) If the weights are equal for all k, i.e. νk = νk , k, k {1, ..., K}, then a PD collapses to a Voronoi Diagram (VD). By definition, it is easy to see that the nearest neighbor classifier naturally partitions the space into K cells with centers {c1, ..., c K}. Here we show that the linear classifier is also a VD under a mild condition. Theorem 3.1 (Voronoi Diagram Reduction). The linear classifier parameterized by W , b partitions the input space Rn to a Voronoi Diagram with centers { c1, ..., c K} given by ck = 1 2Wk if bk = 1 4||Wk||2 2, k = 1, ..., K. Proof. See Appendix B for details. 3.2.1 FROM VORONOI DIAGRAM TO CLUSTER-INDUCED VORONOI DIAGRAM Now that both nearest neighbor and linear classifier have been unified by VD, a natural idea is to integrate them together. Cluster-induced Voronoi Diagram (CIVD) (Chen et al., 2017; Huang et al., Published as a conference paper at ICLR 2022 2021) is a generalization of VD which allows multiple centers in a cell, and is successfully used for clinical diagnosis from biomedical images (Wang et al., 2015), providing an ideal tool for the integration of parametric/nonparametric classifier for FSL. Formally: Definition 3.2 (Cluster-induced Voronoi Diagram (CIVD) (Chen et al., 2017; Huang et al., 2021)). Let Ω= {ω1, ..., ωK} be a partition of the space Rn, and C = {C1, ..., CK} be a set (possibly a multiset) of clusters. The set of pairs {(ω1, C1), ..., (ωK, CK)} is a Cluster-induced Voronoi Diagram (CIVD) with respect to the influence function F(Ck, z), where each cell is obtained via ωr = {z Rn : r(z) = r}, r {1, .., K}, with r(z) = arg max k {1,...,K} F(Ck, z). (4) Here C can be either a given set of clusters or even the whole power set of a given point set, and the influence function is defined as a function over the collection of distances from each member in a cluster Ck to a query point z: Definition 3.3 (Influence Function). The influence from Ck, k {1, ..., K} to z / Ck is F(Ck, z) = F({d(c(i) k , z)|c(i) k Ck}|Ck| i=1). In this paper F is assumed to have the following form F(Ck, z) = sign(α) P|Ck| i=0 d(c(i) k , z)α. (5) The sign function here makes sure that F is a monotonically decreasing function with respect to distance d. The hyperparameter α controls the magnitude of the influence, for example, in gravity force α = (n 1) in n-dimensional space and in electric force α = 2. Since the nearest neighbor centers {ck}K k=1 and the centers introduced by linear classifier { ck}K k=1 are obtained from different schemes and could both be informative, we merge the corresponding centers for a novel class k to be a new cluster Ck = {ck, ck}, and use the resulting C = {C1, ..., CK} to establish a CIVD. In such a way, the final partition may enjoy the advantages of both parametric and nonparametric classifier heads. We name this approach as Deep Voro--. 3.3 FEW-SHOT CLASSIFICATION VIA SURROGATE REPRESENTATION In nearest neighbor classifier head, the distance from a query feature z to each of the prototypes {ck}K k=1 is the key discrimination criterion for classification. We rewrite {d(z, ck)}K k=1 to be a vector d RK such that dk = d(z, ck). These distances are acquired by measure the distance between two points in high dimension: z, ck Rn. However, the notorious behavior of high dimension is that the ratio between the nearest and farthest points in a point set P approaches 1 (Aggarwal et al., 2001), making {d(z, ck)}K k=1 less discriminative for classification, especially for FSL problem with sample size N K n. Hence, in this paper, we seek for a surrogate representation. In human perception and learning system, similarity among familiar and unfamiliar objects play a key role for object categorization and classification (Murray et al., 2002), and it has been experimentally verified by functional magnetic resonance imaging (f MRI) that a large region in occipitotemporal cortex processes the shape of both meaningful and unfamiliar objects (Op de Beeck et al., 2008). In our method, a connection will be built between each unfamiliar novel class in Cnovel and each related well-perceived familiar class in Cbase. So the first step is to identify the most relevant base classes for a specific task T . Concretely: Definition 3.4 (Surrogate Classes). In episode T , given the set of prototypes {ck}K k=1 for the support set S and the set of prototypes {c t}|Cbase| t=1 for the base set D, the surrogate classes for episode classes CT is given as: Csurrogate(T ) = k=1 Top-R t {1,...,|Cbase|} d(ck, c t) (6) in which the top-R function returns R base class indices with smallest distances to ck, and the center for a base class t is given as c t = 1 |{(x,y)|x D,y=t}| P x D,y=tφ(x). Here R is a hyperparameter. The rationale behind this selection instead of simply using the whole base classes Cbase is that, the episode classes CT are only overlapped with a portion of base classes (Zhang et al., 2021a), and Published as a conference paper at ICLR 2022 discriminative similarities are likely to be overwhelmed by the background signal especially when the number of base classes is large. After the surrogate classes are found, we re-index their feature centers to be {c j} R j=1, R R K. Then, both support centers {ck}K k=1 and query feature z are represented by the collection of similarities to these surrogate centers: d k = (d(ck, c 1), ..., d(ck, c R)), k = 1, ..., K d = (d(z, c 1), ..., d(z, c R)) (7) where d k, d R R are the surrogate representation for novel class k and query feature z, respectively. By surrogate representation, the prediction is found through ˆy = arg minkd(d , d k) = arg mink||d d k||2 2. This set of discriminative distances is rewritten as d RK such that d k = d(d , d k). An illustration of the surrogate representation is shown in Figure 1 on Multi Digit MNIST, a demonstrative dataset. Integrating Feature Representation and Surrogate Representation. Until now, we have two discriminative systems, i.e., feature-based d RK and surrogate-based d RK. A natural idea is to combine them to form the following final criterion: d = β d ||d||1 + γ d ||d ||1 , (8) where d and d are normalized by their Manhattan norm, ||d||1 = PK k=1dk and ||d ||1 = PK k=1d k, respectively, and β and γ are two hyperparameters adjusting the weights for feature representation and surrogate representation. 3.4 DEEPVORO: INTEGRATING MULTI-LEVEL HETEROGENEITY OF FSL In this section we present Deep Voro, a fast geometric ensemble framework that unites our contributions to multiple stages of FSL, and show how it can be promoted to Deep Voro++ by incorporating surrogate representation. Compositional Feature Transformation. It is believed that FSL algorithms favor features with more Gaussian-like distributions, and thus various kinds of transformations are used to improve the normality of feature distribution, including power transformation (Hu et al., 2021), Tukey s Ladder of Powers Transformation (Yang et al., 2021), and L2 normalization (Wang et al., 2019). While these transformations are normally used independently, here we propose to combine several transformations sequentially in order to enlarge the expressivity of transformation function and to increase the polymorphism of the FSL process. Specifically, for a feature vector z, three kinds of transformations are considered: (I) L2 Normalization. By projection onto the unit sphere in Rn, the feature is normalized as: f(z) = z ||z||2 . (II) Linear Transformation. Now since all the features are located on the unit sphere, we then can do scaling and shifting via a linear transformation: gw,b(z) = wz + b. (III) Tukey s Ladder of Powers Transformation. Finally, Tukey s Ladder of Powers Transformation is applied on the feature: hλ(z) = zλ if λ = 0 log(z) if λ = 0. By the composition of L2 normaliza- tion, linear transformation, and Tukey s Ladder of Powers Transformation, now the transformation function becomes (hλ gw,b f)(z) parameterized by w, b, λ. Multi-level Heterogeneities in FSL. Now we are ready to articulate the hierarchical heterogeneity existing in different stages of FSL. (I) Feature-level Heterogeneity: Data augmentation has been exhaustively explored for expanding the data size of FSL (Ni et al., 2021), including but not limited to rotation, flipping, cropping, erasing, solarization, color jitter, Mix Up (Zhang et al., 2017), etc. The modification of image x will change the position of feature z in the feature space. We denote all possible translations of image as a set of functions {T}. (II) Transformation-level Heterogeneity: After obtaining the feature z, a parameterized transformation is applied to it, and the resulting features can be quite different for these parameters (see Figure F.1). We denote the set of all possible transformations to be {Pw,b,λ}. (III) Geometry-level Heterogeneity: Even with the provided feature, the few-shot classification model can still be diverse: whether a VD or PD-based model is used, whether the feature or the surrogate representation is adopted, and the setting of R will also change the degree of locality. We denote all possible models as {M}. Published as a conference paper at ICLR 2022 Deep Voro for Fast Geometric Ensemble of VDs. With the above three-layer heterogeneity, the FSL process can be encapsulated as (M Pw,b,λ φ T)(x), and all possible configurations of a given episode T with a fixed φ is the Cartesian product of these three sets: {T} {Pw,b,λ} {M}. Indeed, when a hold-out validation dataset is available, it can be used to find the optimal combination, but by virtue of ensemble learning, multiple models can still contribute positively to FSL (Dvornik et al., 2019). Since the cardinality of the resulting configuration set could be very large, the FSL model M as well as the ensemble algorithm is required to be highly efficient. The VD is a nonparametric model and no training is needed during the meta-testing stage, making it suitable for fast geometric ensemble. While CIVD models the cluster-to-point relationship via an influence function, here we further extend it so that cluster-to-cluster relationship can be considered. This motivates us to define Cluster-to-cluster Voronoi Diagram (CCVD): Definition 3.5 (Cluster-to-cluster Voronoi Diagram). Let Ω= {ω1, ..., ωK} be a partition of the space Rn, and C = {C1, ..., CK} be a set of totally ordered sets with the same cardinality L (i.e. |C1| = |C2| = ... = |CK| = L). The set of pairs {(ω1, C1), ..., (ωK, CK)} is a Cluster-to-cluster Voronoi Diagram (CCVD) with respect to the influence function F(Ck, C(z)), and each cell is obtained via ωr = {z Rn : r(z) = r}, r {1, .., K}, with r(z) = arg max k {1,...,K} F(Ck, C(z)) (9) where C(z) is the cluster (also a totally ordered set with cardinality L) that query point z belongs, which is to say, all points in this cluster (query cluster) will be assigned to the same cell. Similarly, the Influence Function is defined upon two totally ordered sets Ck = {c(i) k }L i=1 and C(z) = {z(i)}L i=1: F(Ck, C(z)) = sign(α) PL i=0 d(c(i) k , z(i))α. (10) With this definition, now we are able to streamline our aforementioned novel approaches into a single ensemble model. Suppose there are totally L possible settings in our configuration pool {T} {Pw,b,λ} {M}, for all configurations {ρi}L i=1, we apply them onto the support set S to generate the K totally ordered clusters {{c(ρi) k }L i=1}K k=1 including each center c(ρi) k derived through configuration ρi, and onto a query sample x to generate the query cluster C(z) = {z(ρ1), ..., z(ρL)}, and then plug these two into Definition 3.5 to construct the final Voronoi Diagram. When only the feature representation is considered in the configuration pool, i.e. ρi {T} {Pw,b,λ}, our FSL process is named as Deep Voro; if surrogate representation is also incorporated, i.e. ρi {T} {Pw,b,λ} {M}, Deep Voro is promoted to Deep Voro++ that allows for higher geometric diversity. See Appendix A for a summary of the notations and acronyms 4 EXPERIMENTS Table 2: Summarization of the datasets used in the paper. Datasets Base classes Novel classes Image size Images per class Multi Digit MNIST 64 20 64 64 1 1000 mini-Image Net 64 20 84 84 3 600 CUB 100 50 84 84 3 44 60 tiered-Image Net 351 160 84 84 3 732 1300 The main goals of our experiments are to: (1) validate the strength of CIVD to integrate parametric and nonparametric classifiers and confirm the necessity of Voronoi reduction; (2) investigate how different levels of heterogeneity individually or collaboratively contribute to the overall result, and compare them with the state-of-art method; (3) reanalyze this ensemble when the surrogate representation comes into play, and see how it could ameliorate the extreme shortage of support samples. See Table 2 for a summary and Appendix D for the detailed descriptions of mini-Image Net (Vinyals et al., 2016), CUB (Welinder et al., 2010), and tiered-Image Net (Ren et al., 2018), that are used in this paper. Deep Voro--: Integrating Parametric and Nonparametric Methods via CIVD. To verify our proposed CIVD model for the integration of parameter/nonparametric FSL classifiers, we first run three standalone models: Logistic Regressions with Power/Voronoi Diagrams as the underlining geometric structure (Power-LR/Voronoi-LR), and vanilla Voronoi Diagram (VD, i.e. nearest neighbor model), and then integrate VD with either Power/Voronoi-LR (see Appendix E for details). Interestingly, VD with the Power-LR has never reached the best result, suggesting that ordinary LR cannot Published as a conference paper at ICLR 2022 be integrated with VD due to their intrinsic distinct geometric structures. After the proposed Voronoi reduction (Theorem 3.1), however, VD+Voronoi-LR is able to improve upon both models in most cases, suggesting that CIVD can ideally integrate parameter and nonparametric models for better FSL. Table 3: The 5-way few-shot accuracy (in %) with 95% confidence intervals of Deep Voro and Deep Voro++ compared against the state-of-the-art results on three benchmark datasets. The results of DC and S2M2 R are reproduced based on open-sourced implementations using the same random seed with Deep Voro. Methods mini-Image Net CUB tiered-Image Net 5way 1shot 5way 5shot 5way 1shot 5way 5shot 5way 1shot 5way 5shot MAML (Finn et al., 2017) 54.69 0.89 66.62 0.83 71.29 0.95 80.33 0.70 51.67 1.81 70.30 0.08 Meta-SGD (Li et al., 2017) 50.47 1.87 64.03 0.94 53.34 0.97 67.59 0.82 Meta Variance Transfer (Park et al., 2020) 67.67 0.70 80.33 0.61 Meta GAN (Zhang et al., 2018) 52.71 0.64 68.63 0.67 Delta-Encoder (Schwartz et al., 2018) 59.9 69.7 69.8 82.6 Matching Net (Vinyals et al., 2016) 64.03 0.20 76.32 0.16 73.49 0.89 84.45 0.58 68.50 0.92 80.60 0.71 Prototypical Net (Snell et al., 2017) 54.16 0.82 73.68 0.65 72.99 0.88 86.64 0.51 65.65 0.92 83.40 0.65 Baseline++ (Chen et al., 2019a) 57.53 0.10 72.99 0.43 70.40 0.81 82.92 0.78 60.98 0.21 75.93 0.17 Variational Few-shot (Zhang et al., 2019) 61.23 0.26 77.69 0.17 Tri Net (Chen et al., 2019b) 58.12 1.37 76.92 0.69 69.61 0.46 84.10 0.35 LEO (Rusu et al., 2018) 61.76 0.08 77.59 0.12 68.22 0.22 78.27 0.16 66.33 0.05 81.44 0.09 DCO (Lee et al., 2019) 62.64 0.61 78.63 0.46 65.99 0.72 81.56 0.53 Negative-Cosine (Liu et al., 2020a) 63.85 0.81 81.57 0.56 72.66 0.85 89.40 0.43 MTL (Wang et al., 2021) 59.84 0.22 77.72 0.09 67.11 0.12 83.69 0.02 Constellation Net (Xu et al., 2021) 64.89 0.23 79.95 0.17 AFHN (Li et al., 2020b) 62.38 0.72 78.16 0.56 70.53 1.01 83.95 0.63 AM3+TRAML (Li et al., 2020a) 67.10 0.52 79.54 0.60 E3BM (Liu et al., 2020c) 63.80 0.40 80.29 0.25 71.20 0.40 85.30 0.30 Simple Shot (Wang et al., 2019) 64.29 0.20 81.50 0.14 71.32 0.22 86.66 0.15 R2-D2 (Liu et al., 2020b) 65.95 0.45 81.96 0.32 Robust-dist++ (Dvornik et al., 2019) 63.73 0.62 81.19 0.43 70.44 0.32 85.43 0.21 IEPT (Zhang et al., 2021b) 67.05 0.44 82.90 0.30 72.24 0.50 86.73 0.34 MELR (Fei et al., 2021) 67.40 0.43 83.40 0.28 70.26 0.50 85.01 0.32 72.14 0.51 87.01 0.35 S2M2 R (Mangla et al., 2020) 64.65 0.45 83.20 0.30 80.14 0.45 90.99 0.23 68.12 0.52 86.71 0.34 M-SVM+MM+ens+val (Ni et al., 2021) 67.37 0.32 84.57 0.21 Deep EMD (Zhang et al., 2020a) 65.91 0.82 82.41 0.56 75.65 0.83 88.69 0.50 71.16 0.87 86.03 0.58 Deep EMD-V2 (Zhang et al., 2020b) 68.77 0.29 84.13 0.53 79.27 0.29 89.80 0.51 74.29 0.32 86.98 0.60 DC (Yang et al., 2021) 67.79 0.45 83.69 0.31 79.93 0.46 90.77 0.24 74.24 0.50 88.38 0.31 PT+NCM (Hu et al., 2021) 65.35 0.20 83.87 0.13 80.57 0.20 91.15 0.10 69.96 0.22 86.45 0.15 Deep Voro 69.48 0.45 86.75 0.28 82.99 0.43 92.62 0.22 74.98 0.48 89.40 0.29 Deep Voro++ 71.30 0.46 85.40 0.30 82.95 0.43 91.21 0.23 75.38 0.48 87.25 0.33 Deep Voro: Improving FSL by Hierarchical Heterogeneities. In this section, we only consider two levels of heterogeneities for ensemble: feature-level and transformation-level. For feature-level ensemble, we utilize three kinds of image augmentations: rotation, flipping, and central cropping summing up to 64 distinct ways of data augmentation (Appendix F). For transformation-level ensemble, we use the proposed compositional transformations with 8 different combinations of λ and b that encourage a diverse feature transformations (Appendix F.1) without loss of accuracy (Figure 2). The size of the resulting configuration pool becomes 512 and Deep Voro s performance is shown in Table 3. Clearly, Deep Voro outperforms all previous methods especially on 5-way 5-shot FSL. Specifically, Deep Voro is better than the next best by 2.18% (than Ni et al. (2021)) on mini Image Net, by 1.47% (than Hu et al. (2021)) on CUB, and by 1.02% (than Yang et al. (2021)) on tiered-Image Net. Note that this is an estimated improvement because not all competitive methods here are tested with the same random seed and the number of episodes. More detailed results can be found in Appendix F. By virtue of CCVD and using the simplest VD as the building block, Deep Voro is arguably able to yield a consistently better result by the ensemble of a massive pool of independent VD. Deep Voro also exhibits high resistance to outliers, as shown in Figure K.16. Deep Voro++: Further Improvement of FSL via Surrogate Representation. In surrogate representation, the number of neighbors R for each novel class and the weight β balancing surrogate/feature representations are two hyperparameters. With the help of an available validation set, a natural question is that whether the hyperparameter can be found through the optimization on the validation set, which requires a good generalization of the hyperparameters across different novel classes. From Figure K.13, the accuracy of VD with varying hyperparameter shows a good agreement between testing and validation sets. With this in mind, we select 10 combinations of β and R, guided by the validation set, in conjunction with 2 different feature transformations and 64 different image augmentations, adding up to a large pool of 1280 configurations for ensemble (denoted by Deep Voro++). As shown in Table 3, Deep Voro++ achieves best results for 1-shot FSL 2.53% Published as a conference paper at ICLR 2022 Table 4: Deep Voro ablation experiments with feature(Feat.)/transformation(Trans.)/geometry(Geo.)- level heterogeneities on mini-Image Net 5-way few-shot dataset. L denotes the size of configuration pool, i.e. the number of ensemble members. These lines show the average VD accuracy without CCVD integration. Methods Geometric Structures Feat. Trans. Geo. L 5-way 1-shot 5-way 5-shot tunable parameters: rotation etc. w, b, λ β, γ, R Deep Voro-- CIVD 65.85 0.43 84.66 0.29 Deep Voro CCVD 1 66.92 0.45 84.64 0.30 8 8 66.45 0.44 84.55 0.29 64 64 67.88 0.45 86.39 0.29 64 8 512 69.48 0.45 86.75 0.28 Deep Voro++ CCVD w/ surrogate representation 1 68.68 0.46 84.28 0.31 2 10 20 68.38 0.46 83.27 0.31 64 64 70.95 0.46 84.77 0.30 64 2 10 1280 71.30 0.46 85.40 0.30 0.05 0.10 b ( = 0) Validation Testing A. mini-Image Net 5-way 5-shot 0.05 0.10 b ( = 0) Validation Testing B. mini-Image Net 5-way 1-shot 0.05 0.10 b ( = 0) Validation Testing C. CUB 5-way 5-shot 0.05 0.10 b ( = 0) Validation Testing D. CUB 5-way 1-shot Figure 2: The 5-way fewshot accuracy of VD with different λ and b on mini Image Net and CUB Datasets. higher than Zhang et al. (2020b), 2.38% higher than Hu et al. (2021), and 1.09% higher than Zhang et al. (2020b), on three datasets, respectively, justifying the efficacy of our surrogate representation. See Appendix G for more detailed analysis. Ablation Experiments and Running Time. Table 4 varies the level of heterogeneity (see Table F.4 and G.5 for all datasets). The average accuracy of VDs without CCVD integration is marked by , and is significantly lower than the fully-fledged ensemble. Table 5 presents the running time of Deep Voro(++) benchmarked in a 20-core Intel Core TM i7 CPU with Num Py (v1.20.3), whose efficiency is comparable to DC/S2M2 2, even with >1000 diversity. Table 5: Running time comparison. Methods Time (min) DC 88.29 S2M2 R 33.89 #ensemble members: Deep Voro 1 0.05 512 25.67 Deep Voro++ 1 0.14 1280 179.05 Experiments with Different Backbones, Meta-training Protocols, and Domains. Because different feature extraction backbones, meta-training losses, and degree of discrepancy between the source/target domains will all affect the downstream FSL, we here examine the robustness of Deep Voro/Deep Voro++ under a number of different circumstances, and details are shown in Appendices H, I, J. Notably, Deep Voro/Deep Voro++ attains the best performance by up to 5.55%, and is therefore corroborated as a superior method for FSL, regardless of the backbone, training loss, or domain. 5 CONCLUSION In this paper, our contribution is threefold. We first theoretically unify parametric and nonparametric few-shot classifiers into a general geometric framework (VD) and show an improved result by virtue of this integration (CIVD). By extending CIVD to CCVD, we present a fast geometric ensemble method (Deep Voro) that takes into consideration thousands of FSL configurations with high efficiency. To deal with the extreme data insufficiency in one-shot learning, we further propose a novel surrogate representation which, when incorporated into Deep Voro, promotes the performance of one-shot learning to a higher level (Deep Voro++). In future studies, we plan to extend our geometric approach to meta-learning-based FSL and lifelong FSL. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS This research was supported in part by NSF through grant IIS-1910492. REPRODUCIBILITY STATEMENT Our code as well as data split, random seeds, hyperparameters, scripts for reproducing the results in the paper are available at https://github.com/horsepurve/Deep Voro. Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory, pp. 420 434. Springer, 2001. Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1):78 96, 1987. Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk. The geometry of deep networks: Power diagram subdivision. Advances in Neural Information Processing Systems, 32:15832 15841, 2019. Randall Balestriero, Sebastien Paris, and Richard G Baraniuk. Max-affine spline insights into deep generative networks. In International Conference on Learning Representations 2020, 2020. Danny Z. Chen, Ziyun Huang, Yangwei Liu, and Jinhui Xu. On clustering induced voronoi diagrams. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pp. 390 399. IEEE Computer Society, 2013. doi: 10.1109/FOCS.2013.49. Danny Z. Chen, Ziyun Huang, Yangwei Liu, and Jinhui Xu. On clustering induced voronoi diagrams. SIAM J. Comput., 46(6):1679 1711, 2017. doi: 10.1137/15M1044874. Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=Hkx LXn Ac FQ. Zitian Chen, Yanwei Fu, Yinda Zhang, Yu-Gang Jiang, Xiangyang Xue, and Leonid Sigal. Multilevel semantic feature augmentation for one-shot learning. IEEE Transactions on Image Processing, 28(9):4594 4605, 2019b. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Nikita Dvornik, Cordelia Schmid, and Julien Mairal. Diversity with cooperation: Ensemble methods for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3723 3731, 2019. Nanyi Fei, Zhiwu Lu, Tao Xiang, and Songfang Huang. {MELR}: Meta-learning via modeling episode-level relationships for few-shot learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=D3Pc GLd Mx0. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135. PMLR, 2017. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Published as a conference paper at ICLR 2022 Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017. Yuqing Hu, Vincent Gripon, and St ephane Pateux. Leveraging the feature distribution in transferbased few-shot learning. Artificial Neural Networks and Machine Learning ICANN 2021, pp. 487 499, 2021. ISSN 1611-3349. doi: 10.1007/978-3-030-86340-1 39. URL http://dx. doi.org/10.1007/978-3-030-86340-1_39. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017. Ziyun Huang and Jinhui Xu. An efficient sum query algorithm for distance-based locally dominating functions. Algorithmica, 82(9):2415 2431, 2020. doi: 10.1007/s00453-020-00691-w. Ziyun Huang, Danny Z. Chen, and Jinhui Xu. Influence-based voronoi diagrams of clusters. Comput. Geom., 96:101746, 2021. doi: 10.1016/j.comgeo.2021.101746. Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, 2014. Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10657 10665, 2019. Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, and Liwei Wang. Boosting few-shot learning with adaptive margin loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12576 12584, 2020a. Kai Li, Yulun Zhang, Kunpeng Li, and Yun Fu. Adversarial feature hallucination networks for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13470 13479, 2020b. Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for fewshot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Mingsheng Long, and Han Hu. Negative margin matters: Understanding margin in few-shot classification. In European Conference on Computer Vision, pp. 438 455. Springer, 2020a. Jialin Liu, Fei Chao, and Chih-Min Lin. Task augmentation by rotating for meta-learning. ar Xiv preprint ar Xiv:2003.00804, 2020b. Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sungju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019. Yaoyao Liu, Bernt Schiele, and Qianru Sun. An ensemble of epoch-wise empirical bayes for fewshot learning. In European Conference on Computer Vision, pp. 404 421. Springer, 2020c. Chunwei Ma, Yan Ren, Jiarui Yang, Zhe Ren, Huanming Yang, and Siqi Liu. Improved peptide retention time prediction in liquid chromatography through deep learning. Analytical Chemistry, 90(18):10881 10888, 2018. Chunwei Ma, Zhanghexuan Ji, and Mingchen Gao. Neural style transfer improves 3D cardiovascular MR image segmentation on inconsistent data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 128 136. Springer, 2019. Chunwei Ma, Ziyun Huang, Jiayi Xian, Mingchen Gao, and Jinhui Xu. Improving uncertainty calibration of deep neural networks via truth discovery and geometric optimization. In Uncertainty in Artificial Intelligence, pp. 75 85. PMLR, 2021a. Published as a conference paper at ICLR 2022 Jianzhu Ma, Samson H Fong, Yunan Luo, Christopher J Bakkenist, John Paul Shen, Soufiane Mourragui, Lodewyk FA Wessels, Marc Hafner, Roded Sharan, Jian Peng, et al. Few-shot learning creates predictive models of drug response that translate from high-throughput screens to individual patients. Nature Cancer, 2(2):233 244, 2021b. Puneet Mangla, Nupur Kumari, Abhishek Sinha, Mayank Singh, Balaji Krishnamurthy, and Vineeth N Balasubramanian. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2218 2227, 2020. Scott O Murray, Daniel Kersten, Bruno A Olshausen, Paul Schrater, and David L Woods. Shape perception reduces activity in human primary visual cortex. Proceedings of the National Academy of Sciences, 99(23):15164 15169, 2002. Renkun Ni, Micah Goldblum, Amr Sharaf, Kezhi Kong, and Tom Goldstein. Data augmentation for meta-learning. In International Conference on Machine Learning (ICML), pp. 8152 8161. PMLR, 2021. Hans P. Op de Beeck, Katrien Torfs, and Johan Wagemans. Perceived shape similarity among unfamiliar objects and the organization of the human object vision pathway. Journal of Neuroscience, 28(40):10111 10123, 2008. doi: 10.1523/JNEUROSCI.2511-08.2008. Seong-Jin Park, Seungju Han, Ji-won Baek, Insoo Kim, Juhwan Song, Hae Beom Lee, Jae-Joon Han, and Sung Ju Hwang. Meta variance transfer: Learning to augment from the others. In International Conference on Machine Learning, pp. 7510 7520. PMLR, 2020. Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In international conference on machine learning (ICML), pp. 2847 2854. PMLR, 2017. Mengye Ren, Sachin Ravi, Eleni Triantafillou, Jake Snell, Kevin Swersky, Josh B. Tenenbaum, Hugo Larochelle, and Richard S. Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=HJc Szz-CZ. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015. Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2018. Othman Sbai, Camille Couprie, and Mathieu Aubry. Impact of base dataset design on few-shot image classification. In European Conference on Computer Vision, pp. 597 613. Springer, 2020. Eli Schwartz, Leonid Karlinsky, Joseph Shtok, Sivan Harary, Mattias Marder, Abhishek Kumar, Rog erio Schmidt Feris, Raja Giryes, and Alexander M Bronstein. Delta-encoder: an effective sample synthesis method for few-shot object recognition. In Neur IPS, 2018. Hossein Sharifi-Noghabi, Shuman Peng, Olga Zolotareva, Colin C Collins, and Martin Ester. AITL: Adversarial Inductive Transfer Learning with input and output space adaptation for pharmacogenomics. Bioinformatics, 36:i380 i388, 07 2020. Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems (NIPS), 30:4077 4087, 2017. Shao-Hua Sun. Multi-digit mnist for few-shot learning, 2019. URL https://github.com/ shaohua0116/Multi Digit MNIST. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29:3630 3638, 2016. Published as a conference paper at ICLR 2022 Haoxiang Wang, Han Zhao, and Bo Li. Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation. In International Conference on Machine Learning. PMLR, 2021. Jiazhuo Wang, John D. Mac Kenzie, Rageshree Ramachandran, and Danny Z. Chen. Neutrophils identification by deep learning and voronoi diagram of clusters. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2015, pp. 226 233, Cham, 2015. Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. ar Xiv preprint ar Xiv:1911.04623, 2019. Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv., 53(3), June 2020. Zichao Wang, Randall Balestriero, and Richard Baraniuk. A max-affine spline perspective of recurrent neural networks. In International Conference on Learning Representations, 2018. Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. California Institute of Technology, 2010. Weijian Xu, yifan xu, Huaijin Wang, and Zhuowen Tu. Attentional constellation nets for few-shot learning. In International Conference on Learning Representations, 2021. Shuo Yang, Lu Liu, and Min Xu. Free lunch for few-shot learning: Distribution calibration. In International Conference on Learning Representations, 2021. Tae Keun Yoo, Joon Yul Choi, and Hong Kyu Kim. Feasibility study to improve deep learning in oct diagnosis of rare retinal diseases with few-shot classification. Medical & Biological Engineering & Computing, 59(2):401 415, 2021. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016. Baoquan Zhang, Xutao Li, Yunming Ye, Zhichao Huang, and Lisai Zhang. Prototype completion with primitive knowledge for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3754 3762, 2021a. Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image classification with differentiable earth mover s distance and structured classifiers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020a. Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Differentiable earth mover s distance for few-shot learning, 2020b. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. Jian Zhang, Chenglong Zhao, Bingbing Ni, Minghao Xu, and Xiaokang Yang. Variational few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1685 1694, 2019. Manli Zhang, Jianhong Zhang, Zhiwu Lu, Tao Xiang, Mingyu Ding, and Songfang Huang. IEPT: Instance-level and episode-level pretext tasks for few-shot learning. In International Conference on Learning Representations, 2021b. Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to few-shot learning. Neur IPS, 2:8, 2018. Linjun Zhou, Peng Cui, Xu Jia, Shiqiang Yang, and Qi Tian. Learning to select base classes for fewshot classification. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020. Published as a conference paper at ICLR 2022 A NOTATIONS AND ACRONYMS Table A.1: Notations and acronyms for VD, PD, CIVD, and CCVD, four geometric structures discussed in the paper. Geometric Structures Acronyms Notations Description Voronoi Diagram VD ck center for a Voronoi cell ωk, k {1, .., K} ωk dominating region for center ck, k {1, .., K} Power Diagram PD c center for a Power cell ωk, k {1, .., K} νk weight for center ck, k {1, .., K} ωk dominating region for center ck, k {1, .., K} Cluster-induced Voronoi Diagram CIVD Ck cluster as the center for a CIVD cell ωk, k {1, .., K} ωk dominating region for cluster Ck F influence function F(Ck, z) from cluster Ck to query point z α magnitude of the influence Cluster-to-cluster Voronoi Diagram CCVD Ck cluster as the center for a CCVD cell ωk, k {1, .., K} ωk dominating region for cluster Ck C(z) the cluster that query point z belongs F influence function F(Ck, C(z)) from Ck to query cluster C(z) α magnitude of the influence Table A.2: Summary and comparison of geometric structures, centers, tunable parameters, and the numbers of tunable parameters (denoted by #) for Deep Voro--, Deep Voro, and Deep Voro++. Parameters for feature-level , transformation-level , and geometry-level heterogeneity are shown in yellow , blue , and red , respectively. See Sec. F for implementation details. Here PD is reduced to VD by Theorem 3.1. For every λ (or R), the b (or β) value with the highest validation accuracy is introduced into the configuration pool. Methods Geometric Structures Centers Tunable Param. # Description Deep Voro-- CIVD Ck = {ck, ck} ck from VD ck from PD Deep Voro CCVD Ck = {c(ρi) k }L i=1 ρi {T} {Pw,b,λ} angle of rotation 4 flipping or not 2 scaling & cropping 8 w = 1 scale factor in linear transformation b 4 shift factor in linear transformation λ 2 exponent in powers transformation #configurations L = 512 Deep Voro++ CCVD Ck = {c(ρi) k }L i=1 ρi {T} {Pw,b,λ} {M} angle of rotation 4 flipping or not 2 scaling & cropping 8 w = 1 scale factor in linear transformation b 1 shift factor in linear transformation λ 2 exponent in powers transformation R 10 the number of top-R nearest base prototypes for a novel prototype γ = 1 weight for surrogate representation β 1 weight for feature representation #configurations L = 1280 B POWER DIAGRAM SUBDIVISION AND VORONOI REDUCTION B.1 PROOF OF THEOREM 3.1 Lemma B.1. The vertical projection from the lower envelope of the hyperplanes {Πk(z) : W T k z + bk}K k=1 onto the input space Rn defines the cells of a PD. Theorem 3.1 (Voronoi Diagram Reduction). The linear classifier parameterized by W , b partitions the input space Rn to a Voronoi Diagram with centers { c1, ..., c K} given by ck = 1 2Wk if bk = 1 4||Wk||2 2, k = 1, ..., K. Published as a conference paper at ICLR 2022 Proof. We first articulate Lemma B.1 and find the exact relationship between the hyperplane Πk(z) and the center of its associated cell in Rn. By Definition 3.1, the cell for a point z Rn is found by comparing d(z, ck)2 νk for different k, so we define the power function p(z, S) expressing this value p(z, S) = (z u)2 r2 (11) in which S Rn is a sphere with center u and radius r. In fact, the weight ν associated with a center in Definition 3.1 can be intepreted as the square of the radius r2. Next, let U denote a paraboloid y = z2, let Π(S) be the transform that maps sphere S with center u and radius r into hyperplane Π(S) : y = 2z u u u + r2. (12) It can be proved that Π is a bijective mapping between arbitrary spheres in Rn and nonvertical hyperplanes in Rn+1 that intersect U (Aurenhammer, 1987). Further, let z denote the vertical projection of z onto U and z denote its vertical projection onto Π(S), then the power function can be written as p(z, S) = d(z, z ) d(z, z ), (13) which implies the following relationships between a sphere in Rn and an associated hyperplane in Rn+1 (Lemma 4 in Aurenhammer (1987)): let S1 and S2 be nonco-centeric spheres in Rn, then the bisector of their Power cells is the vertical projection of Π(S1) Π(S2) onto Rn. Now, we have a direct relationship between sphere S, and hyperplane Π(S), and comparing equation (12) with the hyperplanes used in logistic regression {Πk(z) : W T k z + bk}K k=1 gives us r2 = bk + 1 4||Wk||2 2. (14) Although there is no guarantee that bk + 1 4||Wk||2 2 is always positive for an arbitrary logistic regression model, we can impose a constraint on r2 to keep it be zero during the optimization, which implies 4||Wk||2 2. (15) By this way, the radii for all K spheres become identical (all zero). After the optimization of logistic regression model, the centers { 1 2Wk}K k=1 will be used for CIVD integration. C DETAILS ABOUT THE DEMONSTRATIVE EXAMPLE ON MULTIDIGITMNIST DATASET Multi Digit MNIST (Sun, 2019) dataset is created by concatenating two (or three) digits of different classes from MNIST for few-shot image classification. Here we use Double MNIST Datasets (i.e. two digits in an image) consisting of 100 classes (00 to 09), 1000 images of size 64 64 1 per class, and the classes are further split into 64, 20, and 16 classes for training, testing, and validation, respectively. To better embed into the R2 space, we pick a ten-classes subset (00, 01, 12, 13, 04, 05, 06, 77, 08, and 09) as the base classes for meta-training, and another five-class subset (02, 49, 83, 17, and 36) for one episode. The feature extractor is a 4-layer convolutional network with an additional fully-connected layer for 2D embedding. In Figure 1 left panel, the VD is obtained by setting the centroid of each base class as the Voronoi center. For each novel class, the Voronoi center is simply the 1-shot support sample (Figure 1 central panel). The surrogate representation is computed as the collection of distances from a support/query sample to each of the base classes, as shown in Figure 1 right panel. Interestingly, the surrogate representations for a novel class, no matter if it is a support sample (dotted line) or a query sample (colored line) generally follow a certain pattern akin within a class, distinct cross class make them ideal surrogates for distinguishing between different novel classes. In our paper, we design a series of algorithms answering multiple questions regarding this surrogate representation: how to select base classes for the calculation of surrogate representation, how to combine it with feature representation, and how to integrate it into the overall ensemble workflow. Published as a conference paper at ICLR 2022 D MAIN DATASETS For a fair and thorough comparison with previous works, three widely-adopted benchmark datasets are used throughout this paper. (1) mini-Image Net (Vinyals et al., 2016) is a shrunk subset of ILSVRC-12 (Russakovsky et al., 2015), consists of 100 classes in which 64 classes for training, 20 classes for testing and 16 classes for validation. Each class has 600 images of size 84 84 3. (2) CUB (Welinder et al., 2010) is another benchmark dataset for FSL, especially fine-grained FSL, including 200 species (classes) of birds. CUB is an unbalanced dataset with 58 images in average per class, also of size 84 84 3. We split all classes into 100 base classes, 50 novel classes, and 50 validation classes, following previous works (Chen et al., 2019a). (3) tiered-Image Net (Ren et al., 2018) is another subset of ILSVRC-12 (Russakovsky et al., 2015) but has more images, 779,165 images in total. All images are categorized into 351 base classes, 97 validation classes, and 160 novel classes. The number of images in each class is not always the same, 1281 in average. The image size is also 84 84 3. E DEEPVORO--: INTEGRATING PARAMETRIC AND NONPARAMETRIC METHODS VIA CIVD Table E.3: Cluster-induced Voronoi Diagram (CIVD) for the integration of parametric Logistic Regression (LR) and nonparametric nearest neighbor (i.e. Voronoi Diagram, VD) methods. The results from S2M2 R and DC are also included in this table but excluded for comparison. Best result is marked in bold. Methods mini-Imagenet CUB tiered-Image Net 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot S2M2 R 64.65 0.45 83.20 0.30 80.14 0.45 90.99 0.23 68.12 0.52 86.71 0.34 DC 67.79 0.45 83.69 0.31 79.93 0.46 90.77 0.24 74.24 0.50 88.38 0.31 Power-LR 65.45 0.44 84.47 0.29 79.66 0.44 91.62 0.22 73.57 0.48 89.07 0.29 Voronoi-LR 65.58 0.44 84.51 0.29 79.63 0.44 91.61 0.22 73.65 0.48 89.15 0.29 VD 65.37 0.44 84.37 0.29 78.57 0.44 91.31 0.23 72.83 0.49 88.58 0.29 CIVD-based Deep Voro-- VD + Power-LR 65.63 0.44 84.25 0.30 79.52 0.43 91.52 0.22 73.68 0.48 88.71 0.29 VD + Voronoi-LR 65.85 0.43 84.66 0.29 79.40 0.44 91.57 0.22 73.78 0.48 89.02 0.29 E.1 EXPERIMENTAL SETUP AND IMPLEMENTATION DETAILS In this section, we first establish three few-shot classification models with different underlying geometric structures, two logistic regression (LR) models and one nearest neighbor model: (1) Power Diagram-based LR (Power-LR), (2) Voronoi Diagram-based LR (Voronoi-LR), and (3) Voronoi Diagram (VD). Then, the main purposes of our analysis are (1) to examine how the performance is affected by the proposed Voronoi Reduction method in Sec. 3.2, and (2) to inspect whether VD can be integrated with Power/Voronoi Diagram-based LRs. The feature transformation used throughout this section is Pw,b,λ with w = 1.0, b = 0.0, λ = 0.5. For Power-LR, we train it directly on the transformed K-way N-shot support samples using Py Torch library with an Adam optimizer with batch size at 64 and learning rate at 0.01. For Voronoi-LR, the vanilla LR is retrofitted as shown in Algorithm 1, in which the bias is given by Theorem 3.1 to make sure that the parameters induce a VD in each iteration. In our CIVD model in Definition 3.2, we use a cluster instead of a single prototype to stand for a novel class. Here this cluster contains two points, i.e. Ck = {ck, ck}, in which ck is obtained from VD, and ck is acquired from Power-LR or Voronoi-LR. The question we intend to answer here is that whether Power-LR or Voronoi-LR is the suitable model for the integration. Published as a conference paper at ICLR 2022 Algorithm 1: Voronoi Diagram-based Logistic Regression. Data: Support Set S Result: W 1 Initialize W W (0); 2 for epoch 1, ..., #epoch do 4||Wk||2 2, k = 1, ..., K ; Apply Theorem 3.1 4 Compute loss L(W , b) ; forward propagation 5 Update W ; backward propagation A. No Transformation B. L2 Normalization C. Power Transformation D. Log Transformation Figure F.1: The t-SNE visualizations of (A) original features, (B) L2 normalization, (C) Tukey s Ladder of Powers Transformation with λ = 0.5, and (D) compositional transformation with λ = 0, w = 1, b = 0.04 of 5 novel classes from mini-Image Net dataset. E.2 RESULTS The results are shown in Table E.3. Interestingly, when integrated with VD, Power-LR never reaches the best result, suggesting that VD and LR are intrinsic different geometric models, and cannot be simply integrated together without additional effort. On mini-Image Net and tiered-Image Net datasets, the best results are achieved by either Voronoi-LR or VD+Voronoi-LR, showing that CIVD coupled with the proposed Voronoi reduction can ideally integrate parametric and nonparametric few-shot models. Notably, on these two datasets, when Power-LR is reduced to Voronoi-LR, although the number of parameters is decreased (b is directly given by Theorem 3.1, not involved in the optimization), the performance is always better, for example, increases from 65.45% to 65.58% on 5-way 1-shot mini-Image Net data. On CUB dataset, results of different models are similar, probably because CUB is a fine-grained dataset and all classes are similar to each other (all birds). F DEEPVORO: IMPROVING FSL VIA HIERARCHICAL HETEROGENEITIES F.1 EXPERIMENTAL SETUP AND IMPLEMENTATION DETAILS In this section we describe feature-level and transformation-level heterogeneities that are used for ensemble in order to improve FSL. See the next section for geometry-level heterogeneity. Feature-level heterogeneity. Considering the reproducibility of the methodology, we only employ deterministic data augmentation upon the images without randomness involved. Specifically, three kinds of data augmentation techniques are used. (1) Rotation is an important augmentation method widely used in self-supervised learning (Mangla et al., 2020). Rotating the original images by 0 , 90 , 180 , and 270 gives us four ways of augmentation. (2) After rotation, we can flip the images horizontally, giving rise to additional two choices after each rotation degree. (3) Central cropping after scaling can alter the resolution and focus area of the image. Scaling the original images to (84+B) (84+B), B increasing from 0 to 70 with step 10, bringing us eight ways of augmentation. Published as a conference paper at ICLR 2022 Finally, different combinations of the three types result in 64 kinds of augmentation methods (i.e. |{T}| = 64). Transformation-level heterogeneity. In our compositional transformation, the function (hλ gw,b f)(z) is parameterized by w, b, λ. Since g is appended after the L2 normalization f, the vector comes into g is always a unit vector, so we simply set w = 1. For the different combinations of λ and b, we test different values with either λ = 0 or λ = 0 on the hold-out validation set (as shown in Figure 2 and K.12), and pick top-8 combinations with the best performance on the validation set. Ensemble Schemes. Now, in our configuration pool {T} {Pw,b,λ}, there are 512 possible configurations {ρ(i)}512 i=1. For each ρ, we apply it on both the testing and the validation sets. With this large pool of ensemble candidates, how and whether to select a subset {ρ(i)}L i=1 {ρ(i)}512 i=1 is still a nontrivial problem. Here we explore three different schemes. (1) Full (vanilla) ensemble. All candidates in {ρ(i)}512 i=1 are taken into consideration and then plugged into Definition 3.5 to build the CIVD for space partition. (2) Random ensemble. A randomly selected subset with size L < L is used for ensemble. (3) Guided ensemble. We expect the performance for {ρ(i)}512 i=1 on the validation set can be used to guide the selection of {ρ(i)}L i=1 from the testing set, provided that there is good correlation between the testing set and the validation set. Specifically, we rank the configurations in the validation set with regard to their performance, and add them sequentially into {ρ(i)}L i=1 until a maximum ensemble performance is reached on the validation set, then we use this configuration set for the final ensemble. Since VD is nonparametric and fast, we adopt VD as the building block and only use VD for each ρ for the remaining part of the paper. The α value in the influence function (Definition 3.3) is set at 1 throughout the paper, for the simplicity of computation. For a fair comparison, we downloaded the trained models1 used by Mangla et al. (2020) and Yang et al. (2021). The performance of FSL algorithms is typically evaluated by a sequence of independent episodes, so the data split and random seed for the selection of novel classes as well as support/query set in each episode will all lead to different result. To ensure the fairness of our evaluation, DC (Yang et al., 2021), and S2M2 R (Mangla et al., 2020) are reevaluated with the same data split and random seed as Deep Voro. The results are obtained by running 2000 episodes and the average accuracy as well as 95% confidence intervals are reported. F.2 RESULTS Table F.4: Ablation study of Deep Voro s performance with different levels of ensemble. The number of ensemble members are given in parentheses. Methods Feature-level Transformation-level mini-Image Net CUB tiered-Image Net 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot No Ensemble 65.37 0.44 84.37 0.29 78.57 0.44 91.31 0.23 72.83 0.49 88.58 0.29 Vanilla Ensemble (8) 66.45 0.44 84.55 0.29 80.98 0.44 91.47 0.22 74.02 0.49 88.90 0.29 Vanilla Ensemble (64) 67.88 0.45 86.39 0.29 77.30 0.43 91.26 0.23 73.74 0.49 88.67 0.29 Vanilla Ensemble (512) 69.23 0.45 86.70 0.28 79.90 0.43 91.70 0.22 74.51 0.48 89.11 0.29 Random Ensemble (512) 69.30 0.45 86.74 0.28 80.40 0.43 91.94 0.22 74.64 0.48 89.15 0.29 Guided Ensemble (512) 69.48 0.45 86.75 0.28 82.99 0.43 92.62 0.22 74.98 0.48 89.40 0.29 Our proposed compositional transformation enlarges the expressivity of the transformation function. When the Tukey s ladder of powers transformation is used individually, as reported in Yang et al. (2021), the optimal λ is not 0, but if an additional linear transformation g is inserted between f and h, λ = 0 coupled with a proper b can give even better result, as shown in Figure 2 and K.12. Importantly, from Figure 2, a combination of λ and b with good performance on the validation set can also produce satisfactory result on the testing set, suggesting that it is possible to optimize the hyperparameters on the validation set and generalize well on the testing set. In terms of the polymorphism induced by various transformations in the feature space, Figure F.1 exhibits the t-SNE visualizations of the original features and the features after three different kinds of transformations, showing that the relative positions of different novel classes is largely changes especially after compositional transformation (as shown in D). Besides commonly used data augmentation, this transformation provides another level of diversity that may be beneficial to the subsequent ensemble. The results for different levels of ensemble are shown in Table F.4, in which the number of ensemble member are also indicated. Although transformation ensemble does not involve any change to the feature, it can largely improve the results for 1-shot FSL, from 65.37% to 66.45% on mini-Image Net, 1downloaded from https://github.com/nupurkmr9/S2M2_fewshot Published as a conference paper at ICLR 2022 from 78.57% to 80.98% on CUB, and from 72.83% to 74.02% on tiered-Image Net, respectively, probably because 1-shot FSL is more prone to overfitting due to its severe data deficiency. Featurelevel ensemble, on the other hand, is more important for 5-shot FSL, especially for mini-Image Net. When combining the two levels together, the number of ensemble members increases to 512 and the performance significantly surpasses each individual level. On all three datasets, the guided ensemble scheme always achieves the best result for both single-shot and multi-shot cases, showing that the validation set can indeed be used for the guidance of subset selection and our method is robust cross classes in the same domain. When there is no such validation set available, the full ensemble and random ensemble schemes can also give comparable result. To inspect how performance changes with different number of ensemble members, we exhibit the distribution of accuracy at three ensemble levels for mini-Image Net in Figure F.2 and F.3 , for CUB in Figure F.4 and F.5, and for tiered-Image Net in Figure F.6 and F.7. Figure (b) in each of them also exhibits the correlation between the testing and validation sets for all 512 configurations. Clearly, better result is often reached when there are more configurations for the ensemble, validating the efficacy of our method for improving the performance and robustness for better FSL. Algorithm 2: VD with Surrogate Representation for Episode T . Data: Base classes D, Support Set S = {(xi, yi)}K N i=1 , yi CT , query sample x Result: d 1 D (Pw,b,λ φ T)(D) ; Extract and transform feature 2 S (Pw,b,λ φ T)(S); 3 z (Pw,b,λ φ T)(x); 4 for t 1, ..., |Cbase|; Compute prototypes of base classes 6 c t 1 |{(z ,y)|z D ,y=t}| P 8 for k 1, ..., K; Compute prototypes from support samples z S ,y=k z ; 11 dk d(z, ck) 13 Csurrogate ; 14 for k 1, ..., K; Find surrogate classes 16 Csurrogate Csurrogate S Top-R t {1,...,|Cbase|} d(ck, c t) 18 R |Csurrogate|; 19 d (d(z, c 1), ..., d(z, c R)) ; Compute surrogate representation for query sample 20 for k 1, ..., K; Compute surrogate representations for support samples 22 d k (d(ck, c 1), ..., d(ck, c R)); 23 d k d(d , d k) 25 d β d ||d||1 + γ d ||d ||1 ; Compute final criterion 26 return d Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 Number of Ensemble Members mini-Image Net transformation-level Ensemble Full Ensemble (a) Transformation-level Ensemble 82 83 84 85 86 87 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble DC S2M2-R mini-Image Net 5-way 5-shot Ensemble Ensemble Deep Voro (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members mini-Image Net feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 5-shot mini-Image Net Dataset 0 100 200 300 400 500 Number of Ensemble Members mini-Image Net Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro on 5-way 5-shot mini-Image Net Dataset Figure F.2: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 Number of Ensemble Members mini-Image Net transformation-level Ensemble Full Ensemble (a) Transformation-level Ensemble 63 64 65 66 67 68 69 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble DC S2M2-R mini-Image Net 5-way 1-shot Ensemble Ensemble Deep Voro (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members mini-Image Net feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 1-shot mini-Image Net Dataset 0 100 200 300 400 500 Number of Ensemble Members mini-Image Net Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro on 5-way 1-shot mini-Image Net Dataset Figure F.3: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 Number of Ensemble Members CUB transformation-level Ensemble Full Ensemble (a) Transformation-level Ensemble 82 84 86 88 90 92 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble DC S2M2-R CUB 5-way 5-shot Ensemble Ensemble Deep Voro (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members CUB feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 5-shot CUB Dataset 0 100 200 300 400 500 Number of Ensemble Members CUB Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro on 5-way 5-shot CUB Dataset Figure F.4: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 Number of Ensemble Members CUB transformation-level Ensemble Full Ensemble (a) Transformation-level Ensemble 62.5 65.0 67.5 70.0 72.5 75.0 77.5 80.0 82.5 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble DC S2M2-R CUB 5-way 1-shot Ensemble Ensemble Deep Voro (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members CUB feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 1-shot CUB Dataset 0 100 200 300 400 500 Number of Ensemble Members CUB Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro on 5-way 1-shot CUB Dataset Figure F.5: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 Number of Ensemble Members tiered-Image Net transformation-level Ensemble Full Ensemble (a) Transformation-level Ensemble 85 86 87 88 89 90 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble DC S2M2-R tiered-Image Net 5-way 5-shot Ensemble Ensemble Deep Voro (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members tiered-Image Net feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 5-shot tiered-Image Net Dataset 0 100 200 300 400 500 Number of Ensemble Members tiered-Image Net Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro on 5-way 5-shot tiered-Image Net Dataset Figure F.6: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 Number of Ensemble Members tiered-Image Net transformation-level Ensemble Full Ensemble (a) Transformation-level Ensemble 68 69 70 71 72 73 74 75 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble DC S2M2-R tiered-Image Net 5-way 1-shot Ensemble Ensemble Deep Voro (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members tiered-Image Net feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 1-shot tiered-Image Net Dataset 0 100 200 300 400 500 Number of Ensemble Members tiered-Image Net Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro on 5-way 1-shot tiered-Image Net Dataset Figure F.7: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 0 100 200 300 400 Number of Shots 15: 88.91% 20: 89.75% 40: 90.63% 100: 91.18% 200: 91.22% 400: 91.55% Effect of the number of shots on mini-Image Net dataset Figure G.8: The accuracy of VD with increasing number of shots on mini-Image Net dataset. G DEEPVORO++: FURTHER IMPROVEMENT OF FSL VIA SURROGATE REPRESENTATION G.1 EXPERIMENTAL SETUP AND IMPLEMENTATION DETAILS In this section, we introduce another layer of heterogeneity, that is, geometry-level, that exists in our surrogate representation. In Definition 3.4, increasing R will enlarge the degree of locality when searching for the top-R surrogate classes. In equation (8), if we set γ = 1 then increasing β will make the model rely more on the feature representation and less on the surrogate representation. In order to weigh up R and β, we perform a grid search for different combinations of R and β on the validation set, as shown in Figure K.13, K.14, and K.15. For each R, we select the β that gives rise to the best result on the validation set, and use this (R, β) on the testing set, resulting in 10 such pairs in total. So there are 10 models in the geometry-level heterogeneity, standing for different degrees of locality. In conjunction with feature-level (64 kinds of augmentations) and transformation-level (here only the top-2 best transformations are used) heterogeneities, now there are 1280 different kinds of configurations in our configuration pool that will be used by the CCVD model. In conclusion, there are overall 512 + 1280 = 1792 configurations for a few-shot episode. Generating 1800 ensemble candidates is nearly intractable for parametric methods like logistic regression or cosine classifier, which may consume e.g. months for thousands of episodes. However, the VD model is nonparametric and highly efficient, making it empirically possible to collect all the combinations and integrate them all together via CCVD. The complete algorithm for the computation of surrogate representation is shown in Algorithm 2. G.2 RESULTS The heatmaps for different (R, β) pairs on testing/validation sets are shown in Figure K.13 for mini Image Net, in Figure K.14 for CUB, and in Figure K.15 for tiered-Image Net, respectively. Basically, the testing and validation set follow the same pattern. When R is small, i.e. only a small number of base classes are used for surrogate, then a higher weight should be placed on feature representation. With a fixed β, increasing R beyond a certain threshold will potentially cause a drop in accuracy, probably because the meaningful similarities is likely to be overwhelmed by the signals from the large volume of irrelevant base classes. Table G.5: Ablation study of Deep Voro++ s performance with different levels of ensemble. The number of ensemble members are given in parentheses. Methods Feature-level Transformation-level Geometry-level mini-Image Net CUB tiered-Image Net No Ensemble 65.37 0.44 78.57 0.44 72.83 0.49 Vanilla Ensemble (20) 68.38 0.46 80.70 0.45 74.48 0.50 Vanilla Ensemble (64) 70.95 0.46 81.04 0.44 74.87 0.49 Vanilla Ensemble (1280) 71.24 0.46 81.18 0.44 74.75 0.49 Random Ensemble (1280) 71.34 0.46 81.98 0.43 75.07 0.48 Guided Ensemble (1280) 71.30 0.46 82.95 0.43 75.38 0.48 Published as a conference paper at ICLR 2022 As shown in Table 3 and G.5, Deep Voro++ further improves upon Deep Voro for 5-way 1-shot FSL by 1.82% and 0.4% on mini-Image Net and tiered-Image Net, respectively, and is comparable with Deep Voro on CUB dataset (82.95% vs. 82.99%). Notably, on 5-shot FSL, Deep Voro++ usually causes a drop of accuracy from Deep Voro. To inspect the underlying reason for this behavior, we apply VD on 5-way K-shot FSL with K increasing from 1 to 400 and report the average accuracy in Figure G.8. It can be observed that, in extreme low-shot learning, i.e. K [1, 5], simply adding one shot makes more prominent contribution to the accuracy, suggesting that the centers obtained from 5-shot samples are much better that those from only 1 sample, so there is no necessity to resort to surrogate representation for multi-shot FSL and we only adopt Deep Voro for 5-shot episodes in the remaining part of this paper. Ablation study of Deep Voro++ with different levels of ensemble is shown in Table G.5, Figure G.9, G.10, and G.11. All three layers of heterogeneities collaboratively contribute towards the final result. The fully-fledged Deep Voro++ establishes new state-of-the-art performance on all three datasets for 1-shot FSL. 1 2 3 4 5 6 7 8 9 1011121314151617181920 Number of Ensemble Members Surrogate Representation mini-Image Net transformation/geometry-level Ensemble Full Ensemble (a) Transformation-level Ensemble 65 66 67 68 69 70 71 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble Surrogate Representation mini-Image Net 5-way 1-shot Ensemble & Geometry-level Ensemble Deep Voro++ (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members Surrogate Representation mini-Image Net feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 1-shot mini-Image Net Dataset 0 200 400 600 800 1000 1200 Number of Ensemble Members Surrogate Representation mini-Image Net Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro++ on 5-way 1-shot mini-Image Net Dataset Figure G.9: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 9 1011121314151617181920 Number of Ensemble Members Surrogate Representation CUB transformation/geometry-level Ensemble Full Ensemble (a) Transformation-level Ensemble 65.0 67.5 70.0 72.5 75.0 77.5 80.0 82.5 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble Surrogate Representation CUB 5-way 1-shot Deep Voro++ Ensemble & Geometry-level Ensemble (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members Surrogate Representation CUB feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 1-shot CUB Dataset 0 200 400 600 800 1000 1200 Number of Ensemble Members Surrogate Representation CUB Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro++ on 5-way 1-shot CUB Dataset Figure G.10: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 9 1011121314151617181920 Number of Ensemble Members Surrogate Representation tiered-Image Net transformation/geometry-level Ensemble Full Ensemble (a) Transformation-level Ensemble 71 72 73 74 75 Testing Set Accuracy Validation Set Accuracy Transformation-level Feature-level Dirichlet Tessellation Ensemble Surrogate Representation tiered-Image Net 5-way 1-shot Ensemble Deep Voro++ & Geometry-level Ensemble (b) Testing/Validation Sets Correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Number of Ensemble Members Surrogate Representation tiered-Image Net feature-level Ensemble Full Ensemble (c) Feature-level Ensemble on 5-way 1-shot tiered-Image Net Dataset 0 200 400 600 800 1000 1200 Number of Ensemble Members Surrogate Representation tiered-Image Net Dirichlet Tessellation Ensemble Random Ensemble Guided Ensemble Full Ensemble (d) Deep Voro++ on 5-way 1-shot tiered-Image Net Dataset Figure G.11: Three levels of ensemble and the correlation between testing and validation sets with different configurations in the configuration pool. Published as a conference paper at ICLR 2022 H EXPERIMENTS WITH DIFFERENT BACKBONES H.1 IMPLEMENTATION DETAILS In order to test the robustness of Deep Voro/Deeo Voro++ with various deep learning architectures, we downloaded the trained models2 used by Wang et al. (2019). We evaluated DC, S2M2 R, Deep Voro, and Deep Voro++ using the same random seed. The results are obtained by running 500 episodes and the average accuracy as well as 95% confidence intervals are reported. H.2 EXPERIMENTAL RESULTS Table H.6: Comparison of FSL algorithms with different network architectures. WRN-28-10 was trained with rotation loss and Mix Up loss (Mangla et al., 2020) instead of using ordinary softmax loss (WRN-28-10 ). Methods WRN-28-10 WRN-28-10 1-shot 5-shot 1-shot 5-shot DC 67.79 0.45 83.69 0.31 62.09 0.95 78.47 0.67 S2M2 R 64.65 0.45 83.20 0.30 61.11 0.92 79.83 0.64 Deep Voro 69.48 0.45 86.75 0.28 62.26 0.94 82.02 0.63 Deep Voro++ 71.30 0.46 65.01 0.98 Dense Net-121 Res Net-34 1-shot 5-shot 1-shot 5-shot DC 62.68 0.96 79.96 0.60 59.10 0.90 74.95 0.67 S2M2 R 60.33 0.92 80.33 0.62 58.92 0.92 77.99 0.64 Deep Voro 60.66 0.91 82.25 0.59 61.61 0.92 81.81 0.60 Deep Voro++ 65.18 0.95 64.65 0.96 Res Net-18 Res Net-10 1-shot 5-shot 1-shot 5-shot DC 60.20 0.96 75.59 0.69 59.01 0.92 74.27 0.69 S2M2 R 59.57 0.93 78.69 0.69 57.59 0.92 77.10 0.67 Deep Voro 61.50 0.93 81.58 0.64 58.34 0.93 79.05 0.63 Deep Voro++ 64.79 0.97 61.75 0.95 Mobile Net Conv-4 1-shot 5-shot 1-shot 5-shot DC 59.41 0.91 76.07 0.66 49.32 0.87 62.89 0.71 S2M2 R 58.36 0.93 76.75 0.68 45.19 0.87 64.56 0.74 Deep Voro 60.91 0.93 80.14 0.65 48.47 0.86 65.86 0.73 Deep Voro++ 63.37 0.95 52.15 0.98 On Wide residual networks (WRN-28-10) (Zagoruyko & Komodakis, 2016), Residual networks (Res Net-10/18/34) (He et al., 2016), Dense convolutional networks (Dense Net-121) (Huang et al., 2017), and Mobile Net (Howard et al., 2017), Deep Voro/Deep Voro++ shows a consistent improvement upon DC and S2M2 R. Excluding Deep Voro/Deep Voro++, there is no such a method that is always better for both 5-shot and 1-shot FSL. Generally, DC is expert in 1-shot while S2M2 2 favors 5-shot. According to Table 3, we do not apply Deep Voro++ on 5-shot FSL since Deep Voro usually outperforms Deep Voro++ with more shots available. I EXPERIMENTS WITH DIFFERENT TRAINING PROCEDURES I.1 EXPERIMENTAL SETUP AND IMPLEMENTATION DETAILS Our geometric space partition model is built on top of a pretrained feature extractor, and the quality of the feature extractor will significantly affect the downstream FSL (Mangla et al., 2020). Here we used another two feature extractors trained with different schemes. (1) Manifold Mixup training employs an additional Mixup loss that interpolates the data and the label simultaneously and can 2downloaded from https://github.com/mileyan/simple_shot Published as a conference paper at ICLR 2022 help deep neural network generalize better. (2) Rotation loss is widely used especially in selfsupervised learning in which the network learns to predict the degree by which an image is rotated. We downloaded the corresponding pretrained models used by Mangla et al. (2020) and Yang et al. (2021) and evaluate the four methods by 500 episodes. I.2 RESULTS Table I.7: Comparison of performance with different meta-training procedures. Methods Self-supervision w/ Rotation Loss Manifold Mixup 1-shot 5-shot 1-shot 5-shot DC 66.43 0.86 82.61 0.62 62.61 0.90 78.62 0.68 S2M2 R 58.33 0.96 79.26 0.66 48.11 0.96 72.74 0.74 Deep Voro 68.80 0.86 85.70 0.58 65.00 0.93 83.19 0.65 Deep Voro++ 69.23 0.89 65.25 0.93 As shown in Table I.7, Deep Voro/Deep Voro++ achieves best results for both rotation loss and Mixup loss. Interestingly, there is a substantial gap between the two training schemes when they are used out-of-the-box for downstream FSL ( accuracy = 10.22% for 1-shot and accuracy = 6.52% for 5-shot), but after Deep Voro/Deep Voro++, this gap becomes narrowed ( accuracy = 3.98% for 1-shot and accuracy = 2.51% for 5-shot), suggesting the strength of Deep Voro/Deep Voro++ to make the most of the pretrained models J CROSS DOMAIN FEW-SHOT LEARNING J.1 EXPERIMENTAL SETUP AND IMPLEMENTATION DETAILS Cross-domain FSL is more challenging than FSL in which base classes and novel classes come from essentially distinct domains. To examine the ability of our method for cross-domain FSL, we apply the feature extractor trained on mini-Image Net (CUB) on the few-shot data in CUB (mini-Image Net) for coarse-to-fine (fine-to-coarse) domain shifting. J.2 RESULTS Table J.8: Comparison of performance on cross-domain FSL. Methods CUB mini-Image Net mini-Image Net CUB 1-shot 5-shot 1-shot 5-shot DC 46.25 0.93 62.99 0.81 54.64 0.87 72.83 0.71 S2M2 R 41.15 0.84 58.09 0.79 49.01 0.88 69.99 0.71 Deep Voro 46.15 0.90 64.60 0.80 49.03 0.87 72.30 0.74 Deep Voro++ 47.83 0.97 54.88 0.92 Basically, Deep Voro/Deep Voro++ is more stable than the other two method with a shifting domain, especially on fine-to-coarse FSL (CUB to mini-Image Net), with an improvement of 6.68% for 1shot and 6.51% for 5-shot than S2M2 R, and is comparable with DC on coarse-to-fine FSL (mini Image Net to CUB). K ADDITIONAL FIGURES Published as a conference paper at ICLR 2022 0.05 0.10 b ( = 0) Validation Testing tiered-Image Net 5-way 5-shot 0.05 0.10 b ( = 0) Validation Testing tiered-Image Net 5-way 1-shot Figure K.12: The 5-way few-shot accuracy of VD with different λ and b values on tiered-Image Net datasets. 1 2 3 4 5 6 7 8 9 10 R 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 66.43 66.69 66.91 67.01 67.15 67.25 66.28 66.78 67.17 67.39 67.56 67.62 67.72 67.75 66.93 67.27 67.58 67.71 67.88 67.93 68.01 68.00 66.59 67.28 67.54 67.76 67.92 68.05 68.10 68.10 68.11 66.95 67.49 67.73 67.87 68.02 68.07 68.13 68.16 68.14 67.21 67.68 67.80 67.93 68.06 68.08 68.08 68.09 68.09 67.39 67.78 67.83 67.94 68.02 68.05 68.05 68.04 68.04 66.40 67.47 67.82 67.83 67.93 67.98 68.01 67.99 67.98 67.97 66.61 67.54 67.82 67.82 67.90 67.94 67.99 67.92 67.90 67.89 66.76 67.57 67.76 67.83 67.87 67.90 67.92 67.87 67.85 67.84 66.92 67.54 67.78 67.82 67.82 67.84 67.86 67.80 67.79 67.75 67.01 67.57 67.73 67.80 67.78 67.78 67.79 67.74 67.73 67.71 67.07 67.56 67.72 67.76 67.71 67.74 67.70 67.66 67.66 67.62 mini-Image Net 5-way 1-shot Test Set 1 2 3 4 5 6 7 8 9 10 R 69.75 69.77 69.74 70.21 70.41 70.53 70.58 70.63 70.69 70.68 70.53 70.93 70.99 71.08 71.11 71.14 71.15 71.12 70.37 70.98 71.29 71.34 71.36 71.37 71.35 71.37 71.32 70.83 71.28 71.45 71.46 71.50 71.49 71.52 71.47 71.41 71.05 71.40 71.52 71.55 71.55 71.55 71.52 71.46 71.41 69.83 71.19 71.46 71.59 71.60 71.57 71.53 71.47 71.41 71.36 70.12 71.25 71.49 71.56 71.56 71.52 71.45 71.40 71.34 71.28 70.39 71.31 71.48 71.50 71.49 71.43 71.38 71.30 71.26 71.21 70.51 71.31 71.44 71.46 71.45 71.38 71.29 71.22 71.21 71.14 70.62 71.33 71.39 71.39 71.38 71.28 71.20 71.12 71.11 71.05 70.68 71.29 71.37 71.29 71.30 71.19 71.10 71.02 71.01 70.95 70.73 71.27 71.29 71.22 71.20 71.12 71.01 70.96 70.91 70.86 70.77 71.25 71.22 71.19 71.11 71.02 70.94 70.87 70.83 70.78 mini-Image Net 5-way 1-shot Validation Set Figure K.13: The 5-way 1-shot accuracy with different β and R values on mini-Image Net testing (left) and validation (right) datasets. 1 2 3 4 5 6 7 8 9 10 R 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 79.87 80.03 80.11 80.18 80.18 80.20 79.72 80.04 80.28 80.32 80.39 80.36 80.37 80.36 80.08 80.32 80.45 80.46 80.52 80.48 80.46 80.44 79.84 80.29 80.45 80.52 80.52 80.51 80.47 80.49 80.43 80.11 80.46 80.53 80.56 80.52 80.52 80.46 80.45 80.40 80.23 80.50 80.51 80.54 80.50 80.45 80.45 80.42 80.36 80.32 80.48 80.50 80.50 80.43 80.42 80.42 80.38 80.32 80.32 80.47 80.48 80.46 80.40 80.39 80.35 80.30 80.25 79.79 80.34 80.46 80.44 80.40 80.37 80.32 80.30 80.24 80.19 79.90 80.35 80.40 80.38 80.34 80.32 80.26 80.23 80.19 80.13 79.96 80.36 80.36 80.34 80.28 80.23 80.20 80.18 80.14 80.09 80.02 80.34 80.31 80.30 80.22 80.16 80.13 80.10 80.07 80.02 80.06 80.31 80.28 80.24 80.16 80.13 80.08 80.06 80.01 79.95 CUB 5-way 1-shot Test Set 1 2 3 4 5 6 7 8 9 10 R 77.11 77.45 77.54 77.70 77.74 77.79 77.31 77.63 77.88 77.96 78.07 78.13 78.13 77.32 77.69 77.96 78.15 78.20 78.26 78.25 78.30 77.10 77.67 77.93 78.13 78.28 78.28 78.35 78.32 78.36 77.48 77.91 78.10 78.22 78.31 78.30 78.36 78.35 78.36 77.74 78.03 78.20 78.29 78.34 78.33 78.36 78.34 78.31 77.87 78.10 78.23 78.29 78.32 78.31 78.30 78.27 78.25 77.07 78.00 78.14 78.21 78.26 78.29 78.24 78.25 78.23 78.17 77.23 78.06 78.14 78.21 78.22 78.24 78.20 78.19 78.16 78.12 77.40 78.12 78.15 78.18 78.19 78.20 78.14 78.12 78.09 78.07 77.51 78.12 78.17 78.14 78.14 78.14 78.11 78.08 78.02 78.01 77.59 78.12 78.15 78.12 78.10 78.11 78.07 78.04 77.97 77.95 77.67 78.14 78.10 78.09 78.07 78.07 78.04 77.99 77.93 77.93 CUB 5-way 1-shot Validation Set Figure K.14: The 5-way 1-shot accuracy with different β and R values on CUB testing (left) and validation (right) datasets. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 9 10 R 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 73.41 73.30 73.04 72.81 72.64 72.42 72.28 72.12 72.04 73.62 73.62 73.51 73.39 73.32 73.21 73.13 72.99 72.91 72.83 73.64 73.72 73.68 73.60 73.53 73.49 73.43 73.35 73.29 73.24 73.62 73.72 73.72 73.66 73.62 73.59 73.57 73.50 73.47 73.43 73.55 73.66 73.67 73.67 73.66 73.67 73.64 73.62 73.59 73.54 73.52 73.59 73.63 73.66 73.64 73.65 73.66 73.65 73.63 73.60 73.46 73.53 73.57 73.60 73.62 73.62 73.63 73.63 73.60 73.58 73.43 73.49 73.53 73.54 73.57 73.61 73.59 73.59 73.60 73.58 73.39 73.46 73.49 73.52 73.55 73.56 73.58 73.57 73.58 73.56 73.34 73.45 73.46 73.49 73.50 73.53 73.53 73.54 73.56 73.55 73.33 73.41 73.42 73.47 73.49 73.49 73.51 73.52 73.52 73.52 73.30 73.39 73.40 73.43 73.47 73.47 73.48 73.49 73.49 73.51 73.26 73.36 73.38 73.41 73.44 73.43 73.44 73.46 73.45 73.46 tiered-Image Net 5-way 1-shot Test Set 1 2 3 4 5 6 7 8 9 10 R 74.17 73.34 72.59 71.93 71.42 70.95 70.57 70.24 73.43 73.29 73.01 72.68 72.40 72.11 71.87 71.65 71.43 71.25 72.60 72.82 72.72 72.55 72.42 72.21 72.11 71.96 71.82 71.70 71.97 72.26 72.28 72.24 72.14 72.06 72.00 71.91 71.83 71.72 71.49 71.82 71.90 71.92 71.89 71.83 71.78 71.74 71.69 71.62 71.09 71.45 71.58 71.62 71.64 71.64 71.60 71.57 71.50 71.47 70.81 71.17 71.31 71.38 71.40 71.42 71.41 71.40 71.36 71.35 70.56 70.90 71.07 71.15 71.19 71.20 71.22 71.20 71.18 71.18 70.36 70.67 70.86 70.95 71.00 70.98 71.01 71.02 71.01 71.00 70.20 70.50 70.68 70.75 70.80 70.82 70.83 70.84 70.86 70.84 70.07 70.35 70.52 70.60 70.66 70.67 70.71 70.71 70.73 70.70 70.24 70.38 70.45 70.52 70.55 70.57 70.60 70.61 70.60 70.12 70.24 70.36 70.41 70.44 70.44 70.48 70.49 70.49 70.02 70.14 70.24 70.30 70.31 70.34 70.37 70.40 70.41 tiered-Image Net 5-way 1-shot Validation Set Figure K.15: The 5-way 1-shot accuracy with different β and R values on tiered-Image Net testing (left) and validation (right) datasets. 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Geometric Variance of Support Set Average Accuracy S2M2_R DC Deep Voro Episode Accuracy as a Function of Geometric Variance Figure K.16: Outlier Analysis. In order to investigate the resistance to outlier for various methods, we here define Geometric Variance (GV ) as a reflection of the possibility that a support set contains an outlier, due to the difficulty of inferring out-of-distribution sample from merely 1 or 5 samples. Formally, for a support set S = {(zi, yi)}K N i=1 , its Geometric Variance is defined as GV (S) = 1 K PK k=1 1 ( N 2) P i {1,...,N},j {1,...,N} ||zi zj||2, measuring the average point-to-point distance in this support set. The larger GV is, with higher probability S contains an outlier. For every episode in 2000 episodes from 5-way 5-shot mini-Image Net data, GV is computed as well as the episode accuracy. As shown in Figure K.16, very high GV causes a significant decrease of episode accuracy, but our method Deep Voro is more resistant to the presence of outliers.