# deep_metric_learning_with_graph_consistency__857557c4.pdf Deep Metric Learning with Graph Consistency Binghui Chen,1 Pengyu Li, 1 Zhaoyi Yan 1,2 Biao Wang 1 Lei Zhang 1,3 1 Artificial Intelligence Center, DAMO Academy, Alibaba Group 2 Harbin Institute of Technology 3 The Hong Kong Polytechnic University chenbinghui@bupt.edu.cn, lipengyu007@gmail.com, yanzhaoyi@outlook.com, wangbiao225@foxmail.com, cslzhang@comp.polyu.edu.hk Deep Metric Learning (DML) has been more attractive and widely applied in many computer vision tasks, in which a discriminative embedding is requested such that the image features belonging to the same class are gathered together and the ones belonging to different classes are pushed apart. Most existing works insist to learn this discriminative embedding by either devising powerful pair-based loss functions or hardsample mining strategies. However, in this paper, we start from another perspective and propose Deep Consistent Graph Metric Learning (CGML) framework to enhance the discrimination of the learned embedding. It is mainly achieved by rethinking the conventional distance constraints as a graph regularization and then introducing a Graph Consistency regularization term, which intends to optimize the feature distribution from a global graph perspective. Inspired by the characteristic of our defined Discriminative Graph , which regards DML from another novel perspective, the Graph Consistency regularization term encourages the sub-graphs randomly sampled from the training set to be consistent. We show that our CGML indeed serves as an efficient technique for learning towards discriminative embedding and is applicable to various popular metric objectives, e.g. Triplet, N-Pair and Binomial losses. This paper empirically and experimentally demonstrates the effectiveness of our graph regularization idea, achieving competitive results on the popular CUB, CARS, Stanford Online Products and In-Shop datasets. Introduction In the context of end-to-end feature learning framework of deep convolutional neural network where the convolutional neural network actually is a powerful non-linear mapping function and can be arbitrarily modeled by loss functions to some extend, Deep Metric Learning (DML) focuses on the design of discriminative objective loss function, so as to constrain the learned embedding to be more discriminative. By reason of the powerful representation ability of the learned embedding, Deep Metric Learning (DML) has been widely explored and applied in many computer vision tasks, such as image retrieval (Gordo et al. 2017; Noh et al. 2017), face recognition (Schroff, Kalenichenko, and Philbin 2015; Wen et al. 2016), person re-identification (Hermans, Beyer, and Leibe 2017; Chen et al. 2017), zero-shot learning Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. regularize to be consistent decrease the intraclass variations increase the interclass variations Randomly sampled G1 Randomly sampled G2 Figure 1: Graph Consistency. The circles indicate the data points, green/red/blue color represent three classes, resp. For the left two sub-figures, the highlighted colors means the current sampled data. From the left two sub-figures, one can observe that since the overall feature representations are not discriminative enough, i.e. with large intra-class variations and relatively small inter-class margins, the randomly sampled graphs are different with each other. Then by regularizing them to be consistent, in other words, aligning them to be the same, the intra-class distance can be decreased and the inter-class margin can be enlarged to some extend, resulting in the more discriminative representation space than before. (Oh Song et al. 2016), visual tracking (Leal-Taix e, Canton Ferrer, and Schindler 2016; Tao, Gavves, and Smeulders 2016) and cross-modal retrieval (Deng et al. 2018). Deep Metric Learning (DML) is generally achieved by learning feature representations for the input images such that the instances from the same class are mapped to the small vicinity in the low-dimensional representation space while the samples from different classes are placed relatively apart. The representations are learned under an endto-end optimization framework where the objective function utilizes the loss terms to impose the desired intra-class and inter-class distance constraints in the feature space. Thus, in order to obtain the discriminative feature representations, most of the DML works dedicate to mining the expressive instance pairs as many as possible. For example, many research works focus on exploring the tuple-based loss functions. such as contrastive loss (Sun et al. 2014), binomial deviance loss (Yi et al. 2014), triplet loss (Schroff, Kalenichenko, and Philbin 2015) and quadruplet loss (Chen The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) et al. 2017). However, in these tuple-based methods, training instances are grouped into pairs, triplets or quadruplets, resulting in a quadric or cubical growth of training pairs which are of high probability to be highly redundant and less informative. It gives rise to some key problems for tuple-based approaches, in which (1) the actually constructed pairs are finite and local such that they cannot utilize the global and informative data structure, thus the optimized image representations will not be discriminative enough, and (2) the optimization of feature representation is dominated by the margin constraints, in which case if the sampled pairs satisfy the margin constraints, the losses will become zero and the parameter update will be stopped, thus the actual global feature distributions might be still not discriminative, leading to inferior performances. Then, to learn compact and separable features, some researchers try to seek help from the technique of hard samples mining, such as (Wu et al. 2017; Harwood et al. 2017; Schroff, Kalenichenko, and Philbin 2015), however, in practice, the model training is usually very sensitive to the sampling strategy and sampled pairs, resulting in bad local minimum and large variations in performances. Moreover, some researchers propose to use global instance-relations for discriminative embedding learning, such as Lifted(Oh Song et al. 2016), NPair(Sohn 2016) and MS(Wang et al. 2019a), while due to the Maximum-Domination problem1 behind Soft Max formulation, the global constraints from these methods are not enough. To this end, proposing more discriminative and efficient deep metric objective function remains important. Considering the aforementioned problems, in this paper, we propose the deep Consistent Graph Metric Learning (CGML) framework, a novel loss constraint, to further enhance the discriminative leanring by regularizing the randomly sampled graphs to be consistent during each training iteration. It is mainly achieved by introducing a Graph Consistency (GC) regularization term that is plug and play and can be generally applied to many existing deep metric learning methods. Specifically, at each iteration, we first randomly select m classes with n m instances each class for two times, then regard the instances as nodes and construct two graphs according to instance-to-instance distances respectively. Restraining these two randomly-sampled graphs to be consistent is to satisfy the property of discriminative representation distribution where compact intra-class distributions and separable inter-class distances exist. As illustrated in Fig.1, at the beginning, the data representations are not discriminative, and the sampled graphs have large diversities, after performing the consistency regularization on these graphs, large inter-class distances and small intra-class variations can be achieved, obtaining discriminative feature space. To demonstrate our method, we provide mathematical proofs. Moreover, considering the numerical problem in actual training, we further introduce an upper-bounded GC term for ensuring the learning of discriminative embedding. 1Pay more attention to only the maximum similarity input, the rest inputs might be ignored. Therefore, the strength of the global constraint is weakened. The main contributions of this work can be summarized as follows: We propose the deep Consistent Graph Metric Learning (CGML) framework, a novel graph-based view for learning discriminative feature representations, which is plug and play and can be applied to many existing deep metric methods. CGML is achieved by introducing the Graph Consistency (GC) term, which is to match the property of our defined Discriminative Graph and has rigourous mathematical proofs. Then, to ensure the optimization of GC term, an upper-bound of GC is considered. Extensive experiments have been performed on several popular datasets for DML, including CARS (Krause et al. 2013), CUB, Stanford Online Products (Oh Song et al. 2016) and In-Shopes (Liu et al. 2016), achieving competitive results. Related Work Graph Learning: Graph-based approaches have become attentive in recent computer vision community and are shown to be an efficient way of relation modeling. Constructing graph over the image spatial positions and then propagating mass via random walk has been widely used for object saliency detection (Harel, Koch, and Perona 2007). Graph Convolution Network (GCN) (Kipf and Welling 2016) is proposed on semi-supervised classification. It has been adopted for capturing relations between objects in video recognition tasks (Wang and Gupta 2018). IRG (Liu et al. 2019) employs the graph relation for knowledge distillation. The graph knowledge is also used for visual query answering (Xiong et al. 2019). However, different from these works, we aim at encouraging the discrimination of the learned deep embedding by regularizing the randomly constructed sub-graphs over data points to be consistent with each other, which is the obvious property of our defined Discriminative Graph and discriminative feature distribution. Deep Metric Learning: DML intends to pull the instances from the same class closer while push the ones from different classes farther apart. The commonly used Contrastive loss (Sun et al. 2014) and Triplet loss (Schroff, Kalenichenko, and Philbin 2015) have been widely explored and applied. Additionally, there are some other deep metric learning methods: Smart-mining (Harwood et al. 2017) combines the local triplet loss and the global loss to supervise the learning of deep metric by hard-example mining. Sampling Matters (Wu et al. 2017) proposes distance weighted sampling strategy. Angular loss (Wang et al. 2017) optimizes a triangle based angular function. Proxy-NCA (Movshovitz-Attias et al. 2017) explains why popular classification loss works from a proxy-agent view, and its implementation is very similar to Softmax. N-Pair loss (Sohn 2016) proposes to use N-Pair tuples for training discriminative embedding, and ALMN (Chen and Deng 2019a) proposes the adaptive large margin N-pair loss by generating geometrical virtual negative point instead of employing hard-sample mining for learning more discriminative embedding. SNR (Yuan et al. 2019) employs the idea of Signal-to-Noise Ratio on the deep metric objective and obtains the robust feature embedding, HDC (Yuan, Yang, and Zhang 2017) employs the cascaded models and selects hardsamples from different levels and models. BIER loss (Opitz et al. 2017, 2018) adopts the online gradients boosting methods. De ML (Chen and Deng 2019b) employs the ensemble metrics learned from the hybrid attention proposals. These methods try to improve the performances by resorting to the ensemble idea. However, different from the above methods that are based on instance-pairs construction, samples mining or metric ensemble, we target the informative graph structure behind data points distribution for learning discriminative embedding. It is a novel view for introducing global constraints. Proposed Approach In this section, we will first give the problem background of less-discriminative embedding learning in Section 3.1, and then introduce our defined Discriminative Graph and its corresponding property, inspired by this property we have our Graph Consistency (GC) regularization as in Section 3.2, to further ensure the optimization of graph consistency we consider an upper-bounded GC term in Section 3.3, finally we propose the deep Consistent Graph Metric Learning (CGML) framework in Section 3.4. Problem Background Most of the DML works are designed to optimize the relative distances between positive pairs and negative pairs such that the margin constraints can be achieved, such as Contrastive loss (Sun et al. 2014), Triplet loss (Schroff, Kalenichenko, and Philbin 2015) and Quadruplet loss (Chen et al. 2017). However, satisfying these distance margin constraints between instance pairs actually is not equivalent to the learning of discriminative feature representations. Specifically, we take Triplet loss as a toy example as illustrated in Fig. 2. One can observe that after satisfying the margin constraint, the constructed pairs will propose zero losses and contribute little to the update of feature embedding and model parameters. As a result, the data points in feature space will stop moving towards the more discriminative places. Therefore, in order to obtain the compact intra-class distributions and separable inter-class distances, we recast the problem of distance constraint as the graph regularization, which takes the informative graph structure behind data points in feature space into consideration. Graph Consistency Regularization In this paper, we rethink the discriminative distance optimization from another novel perspective, i.e. graph optimization. The ultimate target thus turns to learning towards Discriminative Graph. Now, we first give the definition of Discriminative Graph as below: Definition 1. Given large scale data representations X = [x1, , x N], where xi Rd, N is the data number and they are uniformly coming from C classes. For the c-th class, its biggest intra-class Euclidean distance is αc and distance margin constraint: ap+m