# zero_shot_learning_with_the_isoperimetric_loss__750ce1bb.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Zero Shot Learning with the Isoperimetric Loss Shay Deutsch,1 Andrea Bertozzi,1 Stefano Soatto2 1Department of Mathematics, 2Department of Computer Science University of California, Los Angeles {shaydeu, bertozzi}@math.ucla.edu,1 soatto@cs.ucla.edu2 We introduce the isoperimetric loss as a regularization criterion for learning the map from a visual representation to a semantic embedding, to be used to transfer knowledge to unknown classes in a zero-shot learning setting. We use a pretrained deep neural network model as a visual representation of image data, a Word2Vec embedding of class labels, and linear maps between the visual and semantic embedding spaces. However, the spaces themselves are not linear, and we postulate the sample embedding to be populated by noisy samples near otherwise smooth manifolds. We exploit the graph structure defined by the sample points to regularize the estimates of the manifolds by inferring the graph connectivity using a generalization of the isoperimetric inequalities from Riemannian geometry to graphs. Surprisingly, this regularization alone, paired with the simplest baseline model, outperforms the state-of-the-art among fully automated methods in zeroshot learning benchmarks such as Aw A and CUB. This improvement is achieved solely by learning the structure of the underlying spaces by imposing regularity. Introduction Motivating example. A pottopod is a pot with limbs. Not even a single example image of a pottopod is needed to find one in Fig. 1. However, one has surely seen plenty of examples of animals with limbs, as well as pots. In zero-shot learning (ZSL) one aims to exploit models trained with supervision, together with maps to some kind of attribute or semantic space, to then recognize objects as belonging to classes for which no previous examples have ever been seen. The ingredients of a ZSL method are illustrated in Fig. 2. Seen samples Xs and their corresponding labels Ys are used to train a model φ, typically a deep neural network (DNN), in a supervised fashion, to yield a vector in a highdimensional (visual embedding) space Z. At the same time, a function s maps semantic attributes such as has legs, is short, or simply a word embedding of the ground truth (seen and unseen) labels of interest Y = {Ys, Yu}, to some metric space. The name of the game in ZSL is to learn a map ψ, possibly along with other components of the diagram, from the Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. P. Buegel, 1562 Figure 1: Find the Pottopod (Courtesy of P. Perona). visual embedding space Z to the semantic space S, that can serve to transfer knowledge from the seen labels (reflected in Z) to the unseen ones. Alternatively, one can try to cluster samples with unseen labels in the visual embedding space Z, and then associate clusters to unseen labels. Focus on regularization. Transfer of knowledge hinges on some kind of regularity of the maps involved in ZSL. In practice, the visual embedding space Z and the semantic embedding S are only known through discrete samples, and the maps are learned restricted to these finite samples. One crucial theme in ZSL is, interpreting sample embeddings as noisy points on the otherwise differentiable manifolds Z and S, to attempt to regularize the spaces Z, S, and/or the map between them. Key contribution. Of all the various components of a ZSL method, we choose the simplest possible (Romera-Paredes and Torr 2015), except for the regularization of the semantic map. There, we introduce a sophisticated model, based on an extension of the isoperimetric inequalities from Riemannian geometry to discrete structures (graphs). We treat sample visual embeddings as vertices in a graph, with affinities as edges, and the visual-to-embedding map ψ interpreted as a linear function on the graph. We then introduce the isoperimetric loss (IPL) to enforce regularity on the domain Z based on the flow of the function defined on it through the boundary of sets of a given volume. The resulting reg- arg min y Yu d (s(y), ψ φ(x)) arg max y Ys φ(x) Figure 2: Illustration of the components of a ZSL algorithm: For images in the set with seen labels Xs, the labels can be estimated by maximum a-posterior over labels in the seen set Ys on the visual representation φ(Xs). For the unseen labels, there is no direct connection to the data because they are not seen during training. Inference is then indirect: A visual representation is inferred, and from there a semantic representation, which is compared to the semantic representation of unseen labels, minimizing some distance in the semantic space, over all possible unseen labels Yu. ularized graph is informed by both the visual and semantic maps. We use it to perform clustering and map clusters to labels. Therefore, we take a very simple visual-to-semantic embedding function, namely a linear transformation, and indirectly regularize it by regularizing its domain and range spaces. We expected our regularization to improve the baseline (Romera-Paredes and Torr 2015) on which our ZSL model is based. We did not expect it to surpass the (far more sophisticated) state-of-the-art in the two most common benchmark datasets used in ZSL, namely Aw A (Lampert, Nickisch, and Harmeling 2009) and CUB (Welinder et al. 2010). Yet, the experiments indicate so. In some cases, it even outperformed methods that used human annotation for the unseen labels. At heart, we solve a topology estimation problem. We determine the connectivity between nodes of the visual embedding graph, which defines a topology in that space informed by the semantic representation of seen attributes. Much of the literature in this area focuses on what kind of graph signal (embedding, or descriptor) to attribute to the nodes, whereas the connectivity of the graph is decided a-priori. We focus on the complementary problem, which is to determine the graph connectivity and learn the graph weights. Unlike other approaches, the connectivity in our method is informed both by the value of the visual descriptors at the vertices, and the values of the semantic descriptors in the range space. Our framework allows us to use automated semantic representation to perform ZSL, resulting in a framework which is entirely free of human annotation. Preliminaries We first describe a general formalization of ZSL that helps place our contribution in context. Every ZSL includes a supervised component, which results in a visual embedding, a collection of unseen labels or attributes, a map from these attributes to a vector (semantic) space, and a map from visual to semantic spaces. It is important to understand the assumptions underlying the transfer of information from seen to unseen attributes, which translates in regularity assumptions on the visual-to-semantic map. Supervised component. In standard supervised classification, a dataset Ds is given where both the input data x and the output labels ys are seen: Ds = {xi, yi s}N i=1 (1) where the set of seen labels, for instance 1 = cat and 2 = dog, is indicated by Ys, with cardinality |Ys| = ns. The data belong to X, for instance the set of natural images. The goal of supervised learning is to infer the parameters w of a function φw : X RK + that approximates the (log)- posterior probability over Ys, φw(x)j log P(ys = j|x). (2) where the subscript j denotes the j-th component of the vector φw(x). At test time, given an unseen image x, one can infer the unknown label ˆy associated with it as the maximum a-posteriori estimate ˆy(x) = arg max y Ys φw(x)y. (3) We indicate with Z the (latent, or representation) space where the data X are mapped, zi = φw(xi) Z. (4) Visual embedding. Although zi can be interpreted as logprobabilities, one can simply consider them as an element of a vector space of dimension at least K, Z RK, called visual embedding. It is also customary to use intermediate layers of a deep network, rather than the last one that is used for classification, as a visual embedding, so in general K = ns. A general classifier, rather than a linear one, can then be used to determine the class, based on the latent representation z. We want the formalism to be flexible, so we do not constrain the dimension of the embedding to be the same of the dimension of the seen classes, and we consider z = φw(x) to be any (pre-trained) visual feature. Unseen labels. In ZSL there is a second set of unseen labels1 Yu, disjoint from the first Yu Ys = . We call Y = Yu Ys. At training time we do not have any sample images with labels in Yu. However, we do at test time. Zero-shot learning. The goal of ZSL is to classify test images as belonging to the unseen classes. That is, to learn a map from X to Yu. Absent any assumption on how the unseen labels are related to the seen labels, ZSL makes little sense. Assumptions. In ZSL one assumes there is a semantic metric vector space S RM, to which all labels seen and unseen can be mapped via a function s : Y S. If the 1 A misnomer, since one knows ahead of time what these labels are, for instance 3 = sailboat , 4 = car. However, one is not given images with those labels during training. So, while the labels are seen, image samples with those labels are not seen during training. metric is Euclidean, a distance between two labels, yi, yj, can be induced via ds(yi, yj) = s(yi) s(yj) . Otherwise, any other distance d(s(yi), s(yj)) on S can be used to find the label associated to an element σ S of the semantic space (embedding), for instance using a nearest-neighbor rule ˆy(σ) = arg min y Y d(s(y) σ). (5) Note that the minimum could be any label, seen or unseen. This is just a metric representation of the set of labels, independent of the ZSL problem. The second assumption is that there exists a map ψ : Z S from the visual to the semantic embedding, which can be learned to map embeddings of seen classes z to semantic vectors σ = ψ(z) in such a way that they land close to the semantic embedding s(ys) of the seen labels: i=1 d(s(yi s) ψ φw(xi s)). (6) One could learn both the visual embedding φw and the visual-to-semantic map ψ simultaneously, or fix the former and just learn the latter. In some cases, even the latter is fixed. Validation. The merit of any ZSL approach is usually evaluated empirically, since the assumptions cannot be validated absent samples in the unseen label class or knowledge of the transfer between the seen and unseen tasks. Once training has terminated and we have embeddings φ ˆ w and ψ, given test data xi, we can compare the imputed labels obtained via ˆy(xu) = arg min y Yu d(ψ φ ˆ w(xu), y) (7) with labels yu in the validation set. The construction of this loss function (6) is illustrated in Fig. 2. Baseline. ZSL methods differ by whether they learn both the visual and semantic embedding, only one, or none; by the method and the criterion used for learning. Since the unseen labels are never seen during training, the transfer of information from seen to unseen labels hinges on the regularity of the learned maps. For this reason, much of the recent work in ZSL aims to explore different regularization methods for the learned maps. The simplest case, which requires no regularization, is to assume that all the maps are linear: φ(x) = Fx and s(z) = V z for suitable matrices F, V (Romera-Paredes and Torr 2015). The results are not state-of-the-art (see Sect. ), but we nevertheless adopt this baseline and focus on regularizing, rather than the map ψ directly, the spaces Z and S, which is our key contribution. Related Work There are many variants for zero-shot learning (ZSL). The general formalism developed, along with the diagram in Fig. 2, helps understanding the various approaches in relation to one another. The problem of zero-shot learning dates back to the early days of visual recognition when the desire to transfer knowledge from painfully learned model to more general classes emerged (Li, Fergus, and Perona 2006; Bart and Ullman 2005). Early modern methods consistent with our approach include (Lampert, Nickisch, and Harmeling 2009; Norouzi et al. 2013) which can be described by a choice of fixed visual embedding φ, semantic embedding s, and an imposed structure of the visual-semantic map (eq. (2) of (Norouzi et al. 2013) in our notation) i=1 zis(ˆy(zi)) where the sum is truncated at the T largest elements of z, so nothing is learned. A particularly simple approach, which we adopt as baseline, is (Romera-Paredes and Torr 2015), who assume that all the maps of interest are linear. In particular, they postulate ψ(z) = V z. Although the map is linear, the domain and range where it is defined are not linear spaces (although their embedding space is). We adopt this choice and focus on smoothing the domain and range of the map. Roughly speaking, zero shot learning methods can be classified into two main categories: inductive and transductive. In the inductive settings (Zhang and Saligrama 2016; Akata et al. 2013; Lampert and Hannes Nickisch 2014; Akata et al. 2015) which has dominated zero shot learning, the unseen classes are introduced one by one and decision about each unseen instance is made instantly once it is introduced. In the transductive setting (Li et al. 2017; Kodirov et al. 2015; Gopalan, Li, and Chellappa 2014; Changpinyo et al. 2016; Deutsch et al. 2017; Song et al. 2018) typically all unseen instances are processed simultaneously by constructing a graph where one exploits the underlying manifold structure, for example using the graph-based label propagation approach (Gopalan, Li, and Chellappa 2014). The problem of learning the graph Laplacian or the graph weights directly from given input data has been recently addressed in a number of works (Kalofolias 2016; Dong et al. 2016; Egilmez, Pavez, and Ortega 2017a; Pavez and Ortega 2016; Lake and Tenenbaum 2010). In the most general case, both the graph connectivity and the graph weights are unknown, in which case a common way to enforce smoothness is to use a regularizer which controls some level of sparsity. Perhaps the most widely used criterion is that the energy of the graph Laplacian computed from the graph signal at the vertices be small. Our approach is inspired from the isoperimetric problem, which is a classic problem in geometry. In Euclidean spaces the study of isoperimetric inequalities provides exact measures for general domains, while in Riemannian manifolds they provide some qualitative understanding of the geometry. Isoperimetric inequalities on manifolds were extended to graphs (Chung 1997; Chung, Grigor yan, and Yau 1999) where the analysis shares some similarities and intuition from the continuous settings. Description of our approach We select a fixed visual embedding φ consisting of a Res Net101 architecture trained on Image Net using all classes, to map images x onto a 2048-dimensional embedding z = φ(x). We assume yi s Ys is a subset of ns = 40 to 150 classes depending on the dataset: In Aw A (Lampert, Nickisch, and Harmeling 2009) there are 50 classes, of which we consider 40 as seen and sequester nu = 10 as unseen. In CUB (Welinder et al. 2010) there are 200 classes, of which we consider 150 as seen and the rest unseen. We exploit a fixed semantic map s from text attributes, namely labels, onto a vector space S = RM with dimension M = 100 (Aw A) to 300 (CUB), using Word2Vec (Mikolov et al. 2013): The map ψ : Z S is assumed linear, ψ(z) = V z, where V is an M 2048 matrix, learned as in (Romera-Paredes and Torr 2015) using the seen dataset Ds. To facilitate comparison to some algorithms we also use a VGGverydeep-19 on CUB rather than Res Net101. At test time, given data xi u with unseen labels, we compute the visual representation zi u = φ(xi u) RK and then semantic embeddings si u = V zi u for all i = 1, . . . , Nu. We construct a graph G = (Z, W) with vertices2 zi u Z and edges {wij} = W that measure the affinities between embeddings wij = zi u, zj u . G is a discrete representation of the smooth manifold φ(X) Z. The function ψ, restricted to G, yields si u, with range ψ(Z) S, which we also assume to be a smooth manifold. In practice, because of the finite sampling and the nuisance variability in the descriptors, both the domain and range of ψ are far from smooth. Key idea. Rather than smoothing the map ψ : G S, we assume it is linear in the embedding space, and instead smooth both its domain and range. We seek a non-parametric deformation represented by changes in the connectivity matrix W of the underlying graph, that minimizes the isoperimetric loss (IPL). This is a form of regularization which we introduce in the field of ZSL. The IPL measures the flow through a closed neighborhood relative to the area of its boundary. For two-dimensional (2-D) surfaces in 3-D, it is minimized when the neighborhood is a sphere. The IPL extends this intuition to higher dimensions. Application to ZSL. The result of our regularization is a new graph G , informed by the domain, range and map of the function ψ. We perform spectral clustering on G to obtain a set of nu = |Yu| clusters {c1, . . . , cnu}. Each of these clusters can then be associated with a label in the unseen set Yu. We do not need to know the association explicitly to evaluate our method. However, one could align the clusters to the semantic representation of the unknown labels if so desired. 3 Results. In general, there is no right regularizer, so we 2We abuse the notation to indicate with Z the visual embedding space, the range of the function φ(X), which we assume to be a differentiable manifold, and the vertices of a discrete graph sampled from Z. 3For instance, by finding a transformation U that solves ziu cj si u If so desired, one could also add to the regularization procedure a term to align the clusters to the semantic representations of the validate our approach empirically on the two most common datasets for ZSL, namely Aw A and CUB. Compared to the current best methods that do not use any manual annotation, Zero-IPL reduces errors by 3.06% on Aw A1 (increased precision from 73.7% to 76.03%), and by 6.91% (increased precision from 36.9% to 39.45%) on CUB. Next, we describe the specific contribution, which is the smoothing of the graph-representation of ψ, in detail. Regularization In this section we describe in more detail our graph smoothing based on the isoperimetric loss (IPL). Our baseline gives us a graph G with weights wij that we want to modify. We can think of these weights as noisy, and seek a way to regularize them, by exploiting also the function ψ defined on G that yields semantic embeddings. Our regularization criterion is to achieve some level of compactness of bounded subsets: For a collection of subsets of the vertices with fixed size (corresponding to the volume of a subset) we want to find the subsets with the smallest size boundary. Why this might be a good criterion rests on classical differential geometry of Riemannian manifolds, where in the most basic case, the most compact manifold that encloses a fixed area with minimum size boundary is a circle. However, tools and concepts from classical differential geometry do not translate easily to graphs. Thus, we seek a technique that uses a key invariant, the isoperimetric dimension. It is transferred to the discrete setting, and we introduce the IPL as a way to control smoothness in the graph. Our criterion, quantified by the isoperimetric gap, generalizes this bias towards compactness to more general sets. Isoperimetric loss Let Br(ξ) be the ball around ξ Z of radius r, that is the set of nodes within a distance d G less than r. Let i j, d G(j,ξ)