# zeroshot_metric_learning__cb5d2b34.pdf Zero-shot Metric Learning Xinyi Xu , Huanhuan Cao , Yanhua Yang , Erkun Yang and Cheng Deng School of Electronic Engineering, Xidian University, Xian 710071, China xyxu.xd@gmail.com, hhcao@stu.xidian.edu.cn, yanhyang@xidian.edu.cn, {erkunyang, chdeng.xd}@gmail.com In this work, we tackle the zero-shot metric learning problem and propose a novel method abbreviated as ZSML, with the purpose to learn a distance metric that measures the similarity of unseen categories (even unseen datasets). ZSML achieves strong transferability by capturing multi-nonlinear yet continuous relation among data. It is motivated by two facts: 1) relations can be essentially described from various perspectives; and 2) traditional binary supervision is insufficient to represent continuous visual similarity. Specifically, we first reformulate a collection of specific-shaped convolutional kernels to combine data pairs and generate multiple relation vectors. Furthermore, we design a new cross-update regression loss to discover continuous similarity. Extensive experiments including intra-dataset transfer and inter-dataset transfer on four benchmark datasets demonstrate that ZSML can achieve state-of-the-art performance. 1 Introduction Metric learning aims to find appropriate similarity measurements of data points, whose core intuition is to preserve the distance between data points in embedding space. This topic is of important practice due to its wide applications in many related areas, such as face recognition [Guillaumin et al., 2009], clustering [Davis et al., 2007; Xing et al., 2003], and retrieval [Zhou et al., 2004]. Euclidean distance is one of the most common similarity metrics since it does not require priori information and training process. However, unsatisfactory results may be yielded as it treats all feature dimensions equally and independently, thus fails to capture the idiosyncrasies of data. In contrast, parametric Mahalanobis distance that can model the different dimension importance, has been adopted in many works. Some representative Mahalanobis approaches [Hoi et al., 2006; Xing et al., 2003] project data linearly and minimize Euclidean distance between positive pairs, while maximize it between negative pairs. Alternatively, one may also Corresponding author. directly optimize the Mahalanobis metric for nearest neighbor classification, among which representative works include, but are not limited to, Neighborhood Component Analysis (NCA) [Roweis et al., 2004], Large Margin Nearest Neighbor (LMNN) [Weinberger and Saul, 2009], and Nearest Class Mean (NCM) [Mensink et al., 2013]. Priori information plays a pivotal role in the success of these metric learning schemes. Therefore, unsatisfactory results can be produced when the priori is not available. In this paper, we are committed to a more challenging task: zero-shot metric learning, whose ambition is to learn an effective metric for unseen categories and datasets. It claims that the learned metric must measure the similarity without access to the target data. Powerful transferability can be obtained by capturing the multi-nonlinear and continuous relations, which is consistent with the innate character of data. Particularly, we first reformulates a set of specific-shaped convolutional kernels to discover various kinds of relations. It is well known that convolutional neural network (CNN) has great power in feature embedding[Lecun et al., 1998; Donahue et al., 2013; Toshev and Szegedy, 2014], while in this paper it is employed to reveal the correlation among data. Then, we design a crossupdate regression loss, which relax the binary supervision employed on the positive pairs (PPs) and negative pairs (NPs) to extend generalization capability. Specifically, we initialize a coarse continuous label as a weak supervision of the predicted similarity, and update the coarse label and the predicted similarity alternately till convergence. By doing so, we can learn the similarity order and improve transferability. To better demonstrate the superiority of ZSML, we present multi-level transfer tasks, which covers transferring to unseen category within one dataset (intra-dataset ZSML) and unseen datasets (inter-dataset ZSML). In a nutshell, the main contributions of our work can be summarized as follows: Departing from the traditional single and linear relation representation, we reformulate a family of specificshaped convolutional kernels which can capture the multi-nonlinear relations among data points. We devise a cross-update regression loss for learning continuous similarity to improve generalization capability, which is verified in our empirical study. Extensive transfer experiments demonstrate that our model can better measure the similarity of unseen cate- Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Feature Space Relation Space Similarity Space Supervision Continues Label Back Propagation Regression Loss Similarity 0 Figure 1: The multi-nonlinear regression metric learning framework of our proposed method. ZSML employs a relation function R and a similarity function S to project data from the feature space into a scalar similarity space, where the similarity extent of two examples are measured. Finally, a regression from scalars in the similarity space to the continuous labels is adopted to guide the training process. The continuous labels are ranged in (0, α) for NPs and (β, 1) for PPs. gories and unseen dataset compared with the peer methods. 2 Related Work The Mahalanobis metric is among the most commonly used linear metric learning methods, and a majority of methods have been developed based on the Mahalanobis metric. For instance, Davis proposed an information theoretical metric learning (ITML) method [Davis et al., 2007], which essentially minimized the differential relative entropy between two multi-variate Gaussians. LMNN [Weinberger and Saul, 2009] enforced a large margin of a Support Vector Machine (SVM) within the triplet data and depicted the relative relations among three individual examples. Recently, GMML [Zadeh et al., 2016] revisited the task of learning an Euclidean metric from weakly supervised data, where pairs of similar and dissimilar points building on geometric intuition are given. Furthermore, Ye [Ye et al., 2016] proposed a unified multi-metric learning approach (UM2L) to combine both spatial connections and rich semantic factors. Xiong [Xiong et al., 2012] proposed a single adaptive metric termed position dependence structure, which additionally incorporated the feature mean vector to encode the distance besides the feature difference vector. Thereon, Huang [Huang et al., 2016] proposed to encode the two linear structures of data pairs and then map the feature pair to a similarity space. Zero-shot learning aims for the learning of a task without training samples [Huang et al., 2015; Lampert et al., 2014]. Usually, this involves transferring the knowledge either by the model parameters or by shared features. Numerous models have been proposed to focus on descriptive attributes to represent object classes [Lampert et al., 2014; Farhadi et al., 2010]. Some other models exploit the hierarchical semantics of data [Griffin and Perona, 2008; Marszalek and Schmid, 2007]. The sample space is imposed by a general-to-specific order either based on an existing hierarchy [Marszalek and Schmid, 2007] or learned from visual features [Griffin and Perona, 2008]. Scalability is achieved by associ- ating classifiers with each hierarchy node. In this paper, we focus on an analogous yet different issue: a zero-shot metric. The main purpose of our zero-shot metric learning is to measure the similarity between instances which are never seen before. 3 Proposed Approach Figure 1 shows the framework of our proposed ZSML. The relation mining function R is first employed to project data pairs from feature space to relation space, in which each kind of relation is encoded as an vector. We then employ a similarity function S to map the relation vector into a scalar similarity space, where each scalar implies how similar two data points is. Finally a regression loss guides the whole optimization procedure under the supervision of a continuous label. 3.1 Preliminaries Let χ = {xi|i = 1, 2, , n} be the training set, where xi is an m-dimensional vector. P is a data pair set, which contains N pairs randomly built up within χ. Given a random data pair (xi, xj), (rij 1 , rij 2 , , rij k ) indicate the corresponding k relation vectors produced by a relation function R, and sij s is the predicted similarity generated by a similarity function S. To achieve continuous supervision, we encode available binary label information yb RN into continuous form yc RN. For PPs (NPs), the binary label yb = 1 (yb = 0) while the continuous label yc (β, 1) (yc (0, α)), where α and β are two boundaries for NPs and PPs respectively. 3.2 Multi-Nonlinear Relations Mining Traditional Mahalanobis metric learning algorithms usually employ a linear projection A to map the original data points as Ax, and compute a simple Euclidean distance Axi Axj 2 to imply the similarity extent of data pair. However, it fails to describe inherently complex relations among data, and we tackle this problem by adopting a family of specificshaped convolutional kernels to project data pair from feature Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Multi-nonlinear Combination Feature Space Relation Space Figure 2: The multi-nonlinear relations mined by specific-shaped convolutional kernels, whose width are all 2. The purpose is to combine data pair and generate multiple relation vectors. This figure shows an example when k = 4. space into relation space. As illustrated in Figure 2, four different convolutional kernels, whose width are all 2, slide on the m 2 feature matrix vertically, and thus produce four various relation vectors (distinguished by color). By doing this, we can unearth multiple relations among data. Taking a data pair (xj, xj) as input, k relation vectors are computed by (rij 1 , rij 2 , , rij k ) = R(xi, xj), R {WC 1 , b C 1 }, (1) where (rij 1 , rij 2 , , rij k ) are k relation vectors, R is a relation function implemented by a convolutional layer (Conv) and a rectified linear unit layer (Re LU), and WC 1 , b C 1 are the parameters of R. We employ a similarity function S to project the above relation vectors into a one-dimensional similarity space, in which the larger value indicates the more similar is. S is implemented by three fully connected (FC) layers, which the last layer contains one neuron. Therein, the predicted similarity sij of data pair (xi, xj) is computed by sij = S(rij 1 , rij 2 , , rij k ), S {WI 1, b I 1; WI 2, b I 2; WI 3, b I 3}, (2) where {WI 1, b I 1; WI 2, b I 2; WI 3, b I 3} are the parameters of the three fully connected layers. All projection parameters are V = {WC, b C, WI, b I}. Notably, the neural network serves as a tool for correlation uncovering, which is quite different from the traditional feature extracting. 3.3 Regression Loss Conventional metric learning algorithms usually adopt a binary label as supervision, which is prone to be over-fitting for trying to individually push the similarity in terms of two single points [Huang et al., 2016]. As depicted in Figure 3, binary labels consider data points in the same class with the same similarity extent, and neglect the intra-class data manifold. To better preserve the original data similarities, we propose a regression loss to learn continuous similarity, which enables data points from the same category to reside on a manifold, and also maintain a distance between data points from different categories. The regression loss is implemented in two steps: 1) generate an N-dimensional vector according Binary Supervision Continues Supervision Negative Pairs Caltech-101 Predicted Similarity Continues Label Positive Pairs Figure 3: Comparison between binary supervision and continuous supervision. The blue edged pairs are negative pairs (NPs) while the red-edged ones are positive pairs (PPs). Binary supervision only separates NPs and PPs which ignores the order of similarity, while continuous supervision shows great priority as it can reveal continuous visual similarity. to binary label yb, which serves as an initialization of continuous label yc; 2) a cross-update strategy is employed to alternately optimize predicted similarity s and yc by forcing the consistency of them. The binary essence of yb is a obstacle towards guiding continuous similarity learning. Therein, we encode yb into a continuous form yc, which contributes in a certain range. As shown in Figure 4, the Euclidean distances are mapped into (β, 1) for PPs and (0, α) for NPs. Concretely, we adopt the following mapping functions: yc ij = αd2 ij + α, if yb ij = 0, yc ij = (β 1)d2 ij + 1, if yb ij = 1, (3) where dij is the normalized Euclidean distance, α and β are the boundaries for the continuous labels. Another problem is that the initialized continuous label is coarse and needs to be finely tuned. ZSML achieves this goal by adopting a cross-update strategy to optimize the predicted similarity s and continuous labels yc alternatively. We design our objective function to consist of two parts, and the intuition is to make s and yc close: The loss of the similarity Ls is Ls(s, V) = 1 sij yc ij 2 , s.t. β