# dataadaptive_metric_learning_with_scale_alignment__758dd7ba.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Data-Adaptive Metric Learning with Scale Alignment

Shuo Chen, Chen Gong, Jian Yang, Ying Tai, Le Hui, Jun Li

PCA Lab, Key Lab of Intelligent Perception and System for High-Dimensional Information of Ministry of Education Jiangsu Key Lab of Image and Video Undersatanding for Social Security School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China Youtu Lab, Tencent {shuochen, chen.gong, csjyang, le.hui}@njust.edu.cn, yingtai@tencent.com, junl.mldl@gmail.com

The central problem for most existing metric learning methods is to ﬁnd a suitable projection matrix on the differences of all pairs of data points. However, a single uniﬁed projection matrix can hardly characterize all data similarities accurately as the practical data are usually very complicated, and simply adopting one global projection matrix might ignore important local patterns hidden in the dataset. To address this issue, this paper proposes a novel method dubbed Data-Adaptive Metric Learning (DAML), which constructs a data-adaptive projection matrix for each data pair by selectively combining a set of learned candidate matrices. As a result, every data pair can obtain a speciﬁc projection matrix, enabling the proposed DAML to ﬂexibly ﬁt the training data and produce discriminative projection results. The model of DAML is formulated as an optimization problem which jointly learns candidate projection matrices and their sparse combination for every data pair. Nevertheless, the over-ﬁtting problem may occur due to the large amount of parameters to be learned. To tackle this issue, we adopt the Total Variation (TV) regularizer to align the scales of data embedding produced by all candidate projection matrices, and thus the generated metrics of these learned candidates are generally comparable. Furthermore, we extend the basic linear DAML model to the kernerlized version (denoted KDAML ) to handle the non-linear cases, and the Iterative Shrinkage-Thresholding Algorithm (ISTA) is employed to solve the optimization model. Intensive experimental results on various applications including retrieval, classiﬁcation, and veriﬁcation clearly demonstrate the superiority of our algorithm to other state-of-the-art metric learning methodologies.

Introduction Metric learning aims to learn a distance function for data pairs to faithfully measure their similarities. It has played an important role in many pattern recognition applications, such as face veriﬁcation (Liu et al. 2018), person reidentiﬁcation (Si et al. 2018), and image retrieval (Zhan et al. 2009; Liu, Tsang, and M uller 2017). The well-studied metric learning models are usually global, which means that they directly learn a single semi-

positive deﬁnite (SPD) matrix c M = b P b P to decide a Ma-

Corresponding author. Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

P::0,P::1, ,P::c w

Projection Matrix:

(a). Framework of Traditional Metric Learning.

(b). Framework of Our Proposed DAML.

P(w)=P::0 + λ k=1wk P::k

Selecting Matrices by using w

Learning w and P::0, P::1, , P::c

Learning Projection Matrix

Combining The Selected Matrices

Projection Distance Result

Distance Result Projection

P א P P Ԣ ଶ

P(w) P(w) Ԣ ଶ ଶ

Figure 1: The comparison of traditional metric learning and our proposed model. (a) The traditional method learns a single global projection matrix b P to distinguish the similarity of (x, x ). (b) Our proposed DAML jointly learns multiple projection matrices P ::0, P ::1, , P ::c and their weight vector w Rc for each data pair (x, x ).

halanobis distance function D b P (x, x ) = (x x ) c M(x x )1 for all data pairs (x, x ) (see Fig. 1(a)), where b P can be understood as a projection matrix. The detailed implementations can be linear projection (Harandi, Salzmann, and Hartley 2017) or non-linear deep neural networks (DNN) (Oh Song et al. 2016). The primitive linear works utilized the supervised information (e.g. must-link and cannot-link) to control the learned distance during their training phases, such as Distance Metric Learning for Clustering (Xing et al. 2003), Large Margin Nearest Neighbor (LMNN) (Weinberger, Blitzer, and Saul 2006), and Information-Theoretic Metric Learning (ITML) (Davis et al. 2007). To enhance the ﬁtting performance and effectively discover the structure for more complicated data, some recent non-linear methods including Projection Metric Learning on Grassmann Manifold (Huang et al. 2015) and Geometric Mean Metric Learning (GMML) (Zadeh, Hosseini, and Sra 2016) are proposed to learn the matrices c M and b P on a manifold instead of the original linear space. Moreover, by adaptively enriching the

1For simplicity, the notation of square on D b P (x, x ) has been omitted and it will not inﬂuence the ﬁnal output.

training data, Adversarial Metric Learning (Chen et al. 2018; Duan et al. 2018) showed further improvement on the linear metric by discriminating the confusing yet critical data pairs produced by the generator. There are some other methods that replace the linear projection b P x by the nonlinear form b P W(x) so that the model representation ability can be boosted, in which the mapping W( ) usually indicates a deep neural network. For example, the Convolutional Neural Network (CNN) is adopted by Siamese-Net (Zagoruyko and Komodakis 2015) while Multi-Layer Perceptron (MLP) is employed by Discriminative Deep Metric Learning (DDML) (Hu, Lu, and Tan 2014). However, the above global metric learning methods are not ﬂexible for handling complex or heterogeneous data, because they all use the same projection operator for all data pairs, which might be inappropriate to characterize the local data properties. As a result, some important patterns carried by the training data are ignored and the learned metric can be inaccurate. To improve the ﬂexibility of metric learning for ﬁtting complex data pairs, various local models were proposed from different viewpoints. For example, LMNN was extended to a local version by learning a speciﬁc metric for each class based on certain classiﬁcation criterion (Weinberger and Saul 2008). Afterwards, the Instance Speciﬁc Distance was proposed to further enhance the metric ﬂexibility on each of the training examples (Zhan et al. 2009). Recently, in Parametric Local Metric Learning (Wang, Kalousis, and Woznica 2012), the authors proposed the weighted Mahalanobis distance in which multiple metric values are linearly combined. Based on the similar way, the traditional methods LMNN and GMML have also been extended to the local forms by introducing the weighted distances (Bohn e et al. 2014; Su, King, and Lyu 2017). In contrast to the combination of multiple metrics of PLML, a Gaussian mixture based model (Luo and Huang 2018) partitioned the metric c M into multiple blocks and proposed a localized norm to improve the model ﬂexibility. However, these improvements on metric c M are not guaranteed to consistently render reasonable projections for discriminating the similar pairs from dissimilar ones. Therefore, there are also some works aiming at directly reﬁning the projection operator b P . For instance, Gated Siamese-Net (Varior, Haloi, and Wang 2016) employed a gating function to selectively emphasize the local pattern for the projected data. Similarly, an attention mechanism was utilized in the image matching model (Si et al. 2018), by which the feature-pair alignment can be performed on the projection results. Although the existing metric learning models have achieved promising results to some extent, most of them cannot adaptively ﬁnd the suitable projection strategy for different data pairs, as they are not data-adaptive and thus the learning ﬂexibility is rather limited. To this end, we propose a novel metric learning model that generalizes the single projection matrix to multiple ones, and establish a selective mechanism to adaptively utilize them according to the local property of data points (see Fig. 1(b)). In other words, our method jointly learns multiple candidate matrices and

their sparse combinations for different training pairs. On one hand, every pair of examples is associated with a deﬁnite projection matrix which is constructed by wisely selecting and combining a small fraction of candidate matrices. On the other hand, the candidate matrices are also automatically learned to minimize the training loss on each training pair. Considering that such a data-adaptive projection may bring about over-ﬁtting, we further introduce the concept of metric scale and employ the Total Variation (TV) regularizer to enforce the embedded data produced by different candidate projection matrices are generally aligned in the same scale. Consequently, the solution space for learning the candidate projection matrices shrinks and the over-ﬁtting problem caused by scale variations can be effectively alleviated. Thanks to the sparse selections of projection matrices and the operation of scale alignment, every data pair is able to acquire suitable projection to reach discriminative representation for further distance calculations. Therefore, our proposed method is termed as Data-Adaptive Metric Learning (DAML). The main contributions of this paper are summarized below:

We propose a novel metric learning framework dubbed DAML, which is able to learn adaptive projections for different data pairs to enhance the discriminability and ﬂexibility of the learned metric.

A kenerlized version is devised to enable the DAML model to successfully handle the non-linear cases, and an efﬁcient optimization algorithm is designed to solve the proposed model which is guaranteed to converge.

DAML is empirically validated on various typical datasets and the results suggest that DAML outperforms other state-of-the-art metric learning methodologies.

The Proposed DAML Model

In this section, we ﬁrst introduce some necessary notations. After that, we establish the basic DAML model in linear space, and then extend it to a kernerlized form to handle the non-linear cases. Finally, we derive the Iterative Shrinkage Thresholding Algorithm (ISTA) to solve the proposed optimization problem.

Throughout this paper, we write matrices as bold uppercase characters and vectors as bold lowercase characters. Tensors are written as Euclid uppercase characters. Let y = (y1, y2, , yn) be the label vector of training data pairs X = {(x1, x 1), (x2, x 2), , (xn, x n)} with xi, x i Rd, where yi = 1 if xi and x i are similar, and yi = 0 otherwise. Here d is the data dimensionality and n is the total number of data pairs. Given P as a three-order tensor, and then the k-th slice of tensor P is written as P ::k. The notations || ||F , || ||1, and || ||2 denote the Frobeniusnorm, l1-norm, and l2-norm, respectively. The Total Variation (TV) norm ||a||tv for a vector a Rd is deﬁned as Pd 1 i=1 |ai ai+1|. The signum function sign(z) = 1 if z 0, and sign(z) = 1 otherwise.

P::1e2 P::1e3

(a). ei s projection on P::1. (b). ei s projection on P::2. (c). Scale Alignment.

P::2e1 P::2e2 P::2e3

P::1e1 P::1e2 P::1e3

Figure 2: Illustration of scale alignment. The lengths of projected ei (i = 1, 2, , d) by different projection matrices are summed up to the same value. Here d is simply set to 3 for visualization.

Linear DAML Model

Given the matrix c M Rd d in the traditional Mahalanobis distance, we know that c M can be decomposed as c M = b P b P , and thus the squared Mahalanobis distance (xi x i) c M(xi x i) between xi and x i is equivalent to the Euclidean distance after their projections by b P , i.e., D b P (xi, x i) = || b P xi b P x i||2 2, (1)

where b P is the projection matrix of the size d r, and r is the dimensionality of projection results. Since only one global projection matrix b P is adopted, the traditional methods are not sufﬁciently ﬂexible for learning the similarities of all data pairs. To address this limitation, we build a dataadaptive metric learning scheme which automatically generates the suitable projection matrix for each of the n data pairs (xi, x i) (i = 1, 2, , n). As such, the local property of dataset can be exploited and an improved metric can be ﬁnally learned. To be speciﬁc, we propose the following distance regarding tensor P, namely DP(xi, x i) = ||P (wi)xi P (wi)x i||2 2, (2) where P is a three-order tensor stacked by a primitive projection matrix P ::0 and c candidate matrices P ::1, P ::2 , P ::c in depth. For the i-th data pair, its speciﬁc projection matrix has the form P (wi) = P ::0 + λ Xc

k=1 wik P ::k, (3)

where wi = (wi1, wi2, , wic) and wik is the weight of candidate projection matrix P ::k for constructing P (wi). The parameter λ is tuned by the user, and λ = 0 degenerates our model to the traditional metric learning. From Eq. (3), we see that the data-adaptive projection matrix P (wi) is decided by the sum of a primitive projection matrix P ::0 and the linear combination of the candidate projection matrices. Therefore, we have to jointly learn the tensor P Rr d (c+1) for all data pairs as well as the weight vector wi for the i-th training data pair (xi, x i). By taking all n training pairs into consideration and putting all wi into a matrix W = (w1, w2, , wn), the basic empirical loss for our DAML model is formed as

L(P, W ) = 1

i=1 l(DP(xi, x i), yi), (4)

in which the function l(DP(xi, x i), yi) evaluates the inconsistency between the label yi and the model prediction DP(xi, x i) for the data pair (xi, x i). Note that practically

not all candidate projection matrices are needed to generate the data-adaptive projection matrix P (wi) for the i-th data pair, so we use the regularizer ||W ||1 to encourage the algorithm to sparsely select a small subset of candidate matrices P ::0, P ::1, , P ::c to suitably reconstruct P (wi). The good news by introducing the combination of candidate matrices is that the ﬁtting ability of our model can be enhanced. Nevertheless, the bad news is that if we merely minimize the loss function Eq. (4) equipped with the sparse regularizer ||W ||1 to learn our metric, the overﬁtting problem may occur due to the large amount of entries in {P ::0, P ::1, , P ::c} to be learned. Therefore, we should ﬁnd a way to constrain the ﬁnal solution to a suitable hypothesis space. Ideally, the linear combination of c + 1 candidate matrices in {P ::0, P ::1, , P ::c} can produce exact label for every speciﬁc data pair when we minimize l(DP(xi, x i), yi), which is undesirable and cannot acquire the reasonable general metric. This is because the candidate matrices in {P ::0, P ::1, , P ::c} can produce the results with arbitrary scales. To address the over-ﬁtting problem caused by scale variations, we introduce the notation of metric scale s(P ::k) for P ::k (k = 0, 1, , c) and require all scales yielded by {P ::0, P ::1, , P ::c} to be as close as possible. To this end, we devise the TV regularizer on the scale vector s(P) = (s(P ::0), s(P ::1), , s(P ::c)) Rc+1, such that the difference between any two of {s(P ::0), s(P ::1), , s(P ::c)} can be minimized. Since all projection matrices adopt the comparable scales to measure the distance between pairs of data points, the overﬁtting caused by scale variations of projection matrices can be effectively alleviated. Speciﬁcally, s(P ::k) is deﬁned as the sum of squared Mahalanobis distances between the projection on orthonormal bases (i.e., P ::kei) and the origin, i.e.,

s(P ::k)= Xd

i=1DP ::k(ei, 0)= Xd

i=1||P ::kei 0||2 2, (5)

where ei is the i-th orthonormal base in Rd. As shown in Fig. 2, the lengths of projected ei (i = 1, 2, , d) by different projection matrices are summed up to the same value so that the generated scales are perfectly aligned. Since Eq. (5) can be simpliﬁed to s(P ::k) = ||P ::k||2 F , the TV regularizer ||s(P)||tv can be easily tackled in the following optimization. By combining the above empirical loss Eq. (4), sparse regularizer ||W ||1 and TV regularizer s(P ::k), our DAML model is formally formulated as

min P, W L(P, W ) + α||s(P)||tv + β||W ||1, (6)

in which the TV regularizer ||s(P)||tv facilitates the scale alignment and the l1-norm regularizer ||W ||1 performs the sparse selection of candidate projection matrices for dataadaptive projections. In test stage, given a new test data pair (z, z ) from the ddimensional example space, we need to decide the weights for selecting candidate projection matrices, as the optimal weight matrix W (denoted W ) is merely learned for training data. Inspired by the regression mechanism in Low-Rank Representation (LRR) (Liu et al. 2013), here we employ the

linear regression to predict the weight vector wz Rc for a new data pair (z, z ), i.e., wz = Q (z + z ), (7) where Q Rc d is learned from the linear regression by minimizing ||QX W ||2 F , and X = (x1 + x 1, x2 + x 2, , xn + x n). Based on the predicted weight vector wz, we know that the distance between z and z equals to DP (z, z ) = ||P (wz)z P (wz)z ||2 2, (8) in which P is learned by solving Eq. (6).

Kernelized DAML Model In this part, we show that the linear DAML model proposed above can be easily extended to a kernelized form (denoted KDAML ) to handle the non-linear cases. A symmetric similarity function κ is a kernel (Bishop 2006) if there exists a (possibly implicit) mapping function φ( ) : X H from the instance space X to a Hilbert space H such that κ can be written as an inner product in H, i.e., κ(x, x ) = φ(x) φ(x ), (9) where x and x are examples from the instance space X. To perform the kernel extension, we replace the examples xi and x i with their feature mapping results φ(xi) and φ(x i), and thus the kernelized distance Dκ P(xi, x i) which follows Eq. (2) is written as Dκ P(xi, x i) =||P (wi)φ(xi) P (wi)φ(x i)||2 2, (10) in which the mapping results φ(xi), φ(x i) Rh and h is the dimensionality of the Hilbert space H. Notice that in the above kernelized distance, the size of candidate matrices P ::k (k = 0, 1, , c) are increased to r h rather than its original size of r d. Therefore, it is unrealistic to directly learn the parameters in P ::k within Rr h, because the dimensionality of Hilbert space h is usually assumed to be very high or even inﬁnite (Bishop 2006; Liu and Tsang 2017), which means that the large-scale matrix P ::k cannot be computed within the limited time. Therefore, according to (Weinberger and Tesauro 2007), we express the candidate matrix P ::k as the following form regarding the mapping results φ (i = 1, 2, , n), and obtain P ::k = R::kϕ , (11) where R::k Rr n and ϕ = (φ(x1), φ(x2), , φ(xn)) Rh n. After that, the problem has been transformed to learn R::0, R::1, , R::c, of which the sizes are independent with h, and thus the mathematical operations in the original highdimensional Hilbert space is avoided. By further denoting the n-dimensional vectors as ki =(κ(xi, x1), κ(xi, x2), , κ(xi, xn)) , k i =(κ(x i, x1), κ(x i, x2), , κ(x i, xn)) , (12)

we have Dκ R(xi, x i)

= (φ(xi) φ(x i)) (R::0ϕ + Xc

k=1 R::kϕ )

(R::0ϕ + Xc

k=1 R::kϕ )(φ(xi) φ(x i))

= (φ(xi) φ(x i)) ϕR(wi) R(wi)ϕ (φ(xi) φ(x i))

= (ki k i) R(wi) R(wi)(ki k i), (13)

in which the tensor R Rr n (c+1) is stacked by the candidate matrices R::0, R::1, , R::c, and the matrix R(wi) Rr n corresponds to the data-adaptive projection matrix P (wi) in Eq. (2). Compared with the linear DAML model, and additional step required by the kernelized DAML is that the vectors ki and k i in Eqs. (12) and (13) should be pre-computed for the i-th pair. As long as the kernel κ( ) is speciﬁed, we may ﬁnally obtain the kernelized distance by Eq. (13). In this paper, we adopt the Gaussian kernel function as κ( ) due to its popularity and computational easiness. Since the distance formulation of KDAML (i.e. Eq. (13)) shares the equivalent mathematical expression with that of linear DAML (i.e. Eq. (2)), it can be directly solved via the same optimization algorithm as the linear DAML. The optimization process is detailed in the next section.

Optimization

For algorithm implementation, the loss function l(Di, yi) in Eq. (4) can be squared loss, squared hinge loss or other popular formulations, which are continuous and have the derivative l Di(Di, yi) regarding Di2. Then we employ the Iterative Shrinkage-Thresholding Algorithm (ISTA) (Rolfs et al. 2012) to solve our problem in Eq. (6). The general ISTA solves a continuous optimization problem with the form min θ f(θ) + g(θ), (14)

where θ is the optimization variable, f(θ) is derivable, and g(θ) is usually non-smooth. The solution to Eq. (14) can be found by iteratively optimizing the following function, namely ΓL(θ, θ(t))

=f(θ(t))+(θ θ(t)) f(θ(t))+ L

2 ||θ θ(t)||2 2+g(θ), (15)

where L is Lipschitz constant, and f(θ(t)) computes the gradient of f on θ(t). Eq. (15) admits a unique minimizer, which is

ΠL(θ(t)) =arg min θ g(θ)+ L

2 ||θ (θ(t) 1

L f(θ(t)))||2 2,

(16) where L is usually manually tuned to satisfy f(ΠL(θ(t)))+g(ΠL(θ(t))) ΓL(ΠL(θ(t)), θ(t)). (17)

To solve our model, we let θ = (P, W ), so we have f(θ) and g(θ) in our model as f(P, W ) = L(P, W ) + α||s(P)||tv, (18) and

g(W ) = β||W ||1. (19)

Then in each iteration, we have to minimize the following

2Here the notation DP(xi, x i) is simpliﬁed as Di for convenience.

function J(P, W )

= β||W ||1 + L

2 ||W (W (t) 1

L W L(P(t), W (t)))||2 F

L( PL(P(t),W (t))+α P||s(P)||tv)

(20) in which3 ( WL(P(t), W (t))= 1

n Pn i=1l Di(Di, yi) WDi, PL(P(t), W (t))= 1

n Pn i=1l Di(Di, yi) PDi. (21)

To minimize above Eq. (20), here we provide the computation results of W Di and PDi respectively. By using the chain rule of derivate, we can easily obtain that4 ( wi Di = 2(A 1:i, A 2:i, , A c:i) wi, P ::k Di = 2 Pc j=0 wikwijxix i P ::j. (22)

Based on above results, the minimizer of Eq. (20) (i.e. the iteration rule) can be summarized as ( W (t+1) = T β

L W L(P(t), W (t))),

P(t+1) = P(t) 1

L ( PL(P(t), W (t))+α P||s(P)||tv, (23) in which the Soft Threshold Operator Tµ(v) (Cai, Cand es, and Shen 2010) is deﬁned as

v µ, if v > µ, v + µ, if v < µ, 0, otherwise. (24)

We summarize the training phase of DAML in Algorithm 1, where Eq. (23) is denoted as (P(t+1), W (t+1)) = ΠL(P(t), W (t)). (25) Based on the output of Algorithm 1, the testing steps for a new data pair are described in Algorithm 2. Since KDAML has the equivalent mathematical expression with linear DAML, the training procedure (Algorithm 1) and testing steps (Algorithm 2) are directly applicable to KDAML. Finally, we want to explain the convergence of Algorithm 1. Although the traditional ISTA is designed for convex optimization, some extended convergence proofs for nonconvex problems have been provided in the prior works such as (Cui 2018). Therefore, the adopted optimization process is theoretically guaranteed to converge to a stationary point.

Experiments

In this section, intensive empirical investigations are conducted to validate the effectiveness of our proposed method. In detail, we compare the performance of the DAML and KDAML models with: 1) the classical linear metric learning methods ITML (Davis et al. 2007) and LMNN (Wein-

3 P ::k||s(P)||tv = 2sign(||P ::k||2 F ||P ::k+1||2 F )P ::k 2sign(||P ::k||2 F ||P ::k 1||2 F )P ::k. 4Here wi = (1, λw1, λw2, , λwc) and xi = xi x i, respectively. The element Akji in the tensor A R(c+1) (c+1) n

equals to x i P ::k P ::jxi+x i P ::k P ::jx i 2x i P ::k P ::jx i.

Algorithm 1 Solving Eq. (6) via ISTA. Input: Training data pairs X = {(xi, x i)|1 i n}; labels y {0, 1}n; parameters α, β, λ. Initialize: t = 1; L0 = 1; η > 1; W (0) = 0; P(0) = 0. Repeat:

1). Find the smallest nonnegative integers it such that with b L = ηit L(t 1)

f(Π b L(P(t), W (t))) + g(Π b L(P(t), W (t)))

Γb L(Π b L(P(t), W (t)), (P(t), W (t))).

2). Set L(t) = b L and use Eq. (23) to update (P(t+1), W (t+1)) = ΠL(t)(P(t), W (t)).

3). Update t = t + 1.

Until Convergence. Output: The converged P and W .

Algorithm 2 Distance Computation for New Test Data Pair. Input: Test data pair (z, z ); the learnd projection tensor P and selection weights W . Initialize: Regression matrix Q = (XX ) 1W (X) . Procedure:

1). Predict the weights wz = Q (z + z ).

2). Compute the distance DP (z, z ) = ||P (wz)z P (wz)z ||2 2.

End. Output: The predicted distance DP (z, z ).

berger, Blitzer, and Saul 2006); 2) the DNN based metric learning method DDML (Hu, Lu, and Tan 2014); and 3) state-of-the-art metric learning methods GMDRML (Luo and Huang 2018), LGMML (Su, King, and Lyu 2017), and AML (Chen et al. 2018). All methods are evaluated on retrieval, classiﬁcation and veriﬁcation tasks. For the compared methods, we follow the authors suggestions to choose the optimal parameters. For the tuning parameters, c is ﬁxed to 10 while α, β and λ are all tuned by searching the grid {0, 0.2, 0.4, , 2} to get the best performances. We follow ITML and use the squared hinge loss as l(DP(xi, x i), yi) in Eq. (6) for our objective function. The Gaussian kernel function (Bishop 2006) is employed for implementing KDAML.

Experiments on Retrieval Retrieval is one of the most typical applications of metric learning, which aims to search the most similar instances for a query instance (Zhan et al. 2016). In our experiments, we use the Pub Fig face image dataset (Nair and Hinton 2010) and the Outdoor Scene Recognition (OSR) dataset (Parikh and Grauman 2011) to evaluate the capabilities of all compared methods. In the ﬁrst experiment, we use the cropped Pub Fig face image dataset, which includes 771 images from 8 individu-

Query Top 5 result

(a) Results of 5 nearest neighbors when we query an image on Pubfig dataset. For each queried image, the first row shows the results of LGMML, and the second is the results of our method.

Query Top 5 result

(b) Results of 5 nearest neighbors when we query an image on OSR dataset. For each queried image, the first row shows the results of LGMML, and the second is the results of our method.

Figure 3: Results of 5 nearest neighbors on Pub Fig and OSR datasets for image retrieval. The green box means that the retrieval result is correct, and the red box denotes that the retrieval result is incorrect.

Table 1: The classiﬁcation error rates (%) of the top 5 retrieval results predicted by all methods on the Pub Fig and OSR datasets. The best result in each dataset is in bold. Notation indicates that DAML and KDAML are signiﬁcantly better than the best baseline method.

Datasets ITML LMNN GMDRML DDML LGMML AML DAML KDAML t-test Pub Fig 13.89 0.12 13.65 0.05 13.22 0.34 13.23 0.07 13.15 0.14 13.89 0.11 12.95 0.09 12.89 0.11 OSR 24.59 0.11 24.68 0.13 23.51 0.24 23.61 0.17 24.39 0.14 24.32 0.21 23.41 0.12 23.18 0.22

als. We follow the experimental setting in (Luo and Huang 2018) and use a 512-dimensional Dense Scale Invariant Feature Transform (DSIFT) features (Cheung and Hamarneh 2009) to represent each image. This experiment is run 5 times, where 30 images per person are randomly selected each time as the training data. Fig. 3(a) shows the retrieval results of two queries, and the average classiﬁcation error rates of top 5 retrieval results are presented in Tab. 1. It can be seen that our proposed DAML and KDAML achieve the lowest error rates, i.e., 12.95% and 12.89%, respectively. Moreover, the DNN based method DDML and local method LGMML also yield good performances, but they are still slightly worse than our DAML and KDAML models are revealed in Tab. 1.

The second experiment is performed on the OSR dataset. It includes 2688 images from 8 scene categories, which are described by the high-level attribute features (Huo, Nie, and Huang 2016). We also use 30 images for each category as training data, and the other images are used as test data. We repeat this procedure 5 times and use the average error rate to evaluate all methods in Tab. 1. It can be found that our methods DAML and KDAML still obtain the best results 23.41% and 23.18% among all comparators. Fig. 3(b) shows the retrieval results of two queries, and it is clear that our method learns a better distance metric than the traditional Mahalanobis distance. Notably, we ﬁnd that it is quite difﬁcult to distinguish mountain and tall-building when their images contain blue sky background. Therefore, LGMML fails to consistently return the correct results (see the ﬁrst row), while our method can still render the satisfactory results. We also perform the t-test (signiﬁcance level 0.05) to validates the superiority of our method to the best baseline.

Experiments on Classiﬁcation

To evaluate the performances of various compared methods on classiﬁcation task, we follow the existing work (Ye et al. 2017) and adopt the k-NN classiﬁer (k = 5) based on the learned metrics to investigate the classiﬁcation error rates of various methods. The datasets are from the well-known UCI repository (Asuncion and Newman 2007), which include MNIST, Autompg, Sonar, Australia, Balance, Isolet, and Letters. We compare all methods over 20 random trials. In each trial, 80% of examples are randomly selected as the training examples, and the rest are used for testing. By following the recommendation in (Zadeh, Hosseini, and Sra 2016), the training pairs are generated by randomly picking up 1000c(c 1) pairs among the training examples. The average classiﬁcation error rates of compared methods are showed in Tab. 2, and we ﬁnd that DAML and KDAML obtain the best results. It can be noted that KDAML is generally better than DAML, as the kernel mapping improves the non-linear ability of our model.

Among the compared methods, it can be noticed that the main difference between ITML and our DAML is the form of their projection matrices, in which ITML employs single ﬁxed projection matrix while our DAML utilizes multiple candidate matrices to generate the data-adaptive projection matrices. From the experimental results in Tab. 2), we can clearly observe that DAML is able to obtain signiﬁcantly better performances than ITML, and therefore the proposed data-adaptive projection matrix is indeed useful to enhance the accuracy of the learned metric.

Table 2: The classiﬁcation error rates (%) of all methods on the MNIST, Autompg, Sonar, German-Credit, Balance, Isolet and Letters datasets. The best result in each dataset is in bold. Notation indicates that DAML and KDAML are signiﬁcantly better than the best baseline method.

Datasets ITML LMNN GMDRML DDML LGMML AML DAML KDAML t-test MNIST 14.31 2.32 17.46 5.32 12.21 0.82 11.56 0.07 12.19 0.14 11.52 0.27 11.34 0.12 11.12 0.21 Autompg 26.62 3.21 25.92 3.32 24.32 2.71 23.95 1.52 23.56 1.41 25.31 3.22 23.95 5.21 21.45 1.21 Sonar 17.02 3.52 16.04 5.31 17.12 5.64 15.31 2.56 21.35 4.01 16.53 3.51 15.64 1.64 15.55 3.65 Australia 17.52 2.13 15.51 2.53 14.12 2.04 15.12 5.23 18.35 1.84 12.95 2.27 13.17 1.97 13.11 2.01 Balance 9.31 2.21 9.93 1.62 6.56 2.14 8.12 1.97 7.36 2.14 7.98 2.11 6.21 0.12 6.15 0.17 Iso Let 9.23 1.12 3.23 1.23 3.12 0.34 2.68 0.71 3.91 1.14 2.91 2.17 2.98 1.15 2.79 1.32 Letters 6.24 0.23 4.21 2.05 3.54 1.22 5.14 1.04 4.56 2.45 3.22 1.23 3.11 1.05 3.01 1.23

0 0.2 0.4 0.6 0.8 1 False Positive Rate (FPR)

True Positive Rate (TPR)

(1). Pub Fig

ITML 0.67897 LMNN 0.72725 GMDRML 0.7532 DDML 0.86222 LGMML 0.87537 AML 0.90116 DAML 0.90746 KDAML 0.9242

0 0.2 0.4 0.6 0.8 1 False Positive Rate (FPR)

True Positive Rate (TPR)

ITML 0.61852 LMNN 0.63748 GMDRML 0.65762 DDML 0.86406 LGMML 0.84448 AML 0.80242 DAML 0.87824 KDAML 0.8911

0 0.2 0.4 0.6 0.8 1 False Positive Rate (FPR)

True Positive Rate (TPR)

ITML 0.56865 LMNN 0.60049 GMDRML 0.6191 DDML 0.65785 LGMML 0.7022 AML 0.71717 DAML 0.75373 KDAML 0.76998

Figure 4: ROC curves of different methods on (a) Pub Fig, (b) LFW and (c) MVS datasets. AUC values are presented in the Legends.

Experiments on Veriﬁcation We use two face datasets and one image matching dataset to evaluate the capabilities of all compared methods on image veriﬁcation. For the Pub Fig face dataset (as described before), the ﬁrst 80% data pairs are selected for training and the rest are used for testing. Similar experiments are performed on the LFW face dataset (Huo, Nie, and Huang 2016) which includes 13233 unconstrained face images of 5749 individuals. The image matching dataset MVS (Brown, Hua, and Winder 2011) consists of over 3 104 gray-scale images sampled from 3D reconstructions of the Statue of Liberty (LY), Notre Dame (ND) and Half Dome in Yosemite (YO). By following the settings in (Zagoruyko and Komodakis 2015), LY and ND are put together to form a training set with over 105 image patch pairs, and 104 patch pairs in YO are used for testing. The adopted features are extracted by DSIFT (Cheung and Hamarneh 2009) and Siamese-CNN (Zagoruyko and Komodakis 2015) for face datasets (i.e. Pub Fig and LFW) and image patch dataset (i.e. MVC), respectively. We plot the Receiver Operator Characteristic (ROC) curve by changing the thresholds of different distance metrics. Then the values of Area Under Curve (AUC) are calculated to quantitatively evaluate the performances of all comparators. From the ROC curves and AUC values in Fig. 4, it is clear to see that DAML and KDAML consistently outperform other methods.

Conclusion In this paper, we propose a metric learning framework named Data-Adaptive Metric Learning (DAML), which generalizes the single global projection matrix of traditional Mahalanobis distance to a local data-adaptive form. The proposed projection matrix is combined by a series of weighted candidate matrices for a speciﬁc data pair, in which the l1norm is employed to sparsely select a small subset of can-

didates to form the suitable combination. Meanwhile, a TV regularizer is utilized to align the produced scales of candidates so that the over-ﬁtting caused by the arbitrary scale variations can be avoided. Furthermore, we show that such a linear data-adaptive metric can be easily kernelized to handle the non-linear cases. The experimental results on various tasks show that the proposed DAML is able to ﬂexibly discover the local data property and acquire more reliable and precise metric than the state-of-the-art metric learning methods. Since the proposed DAML framework is general in nature, it is promising to apply DAML to more manifold and DNN based metric learning models for the future works.

Acknowledgments

The authors would like to thank the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by the National Science Fund (NSF) of China under Grant Nos. U1713208, 61472187 and 61602246, Program for Changjiang Scholars, NSF of Jiangsu Province (No: BK20171430), the Fundamental Research Funds for the Central Universities (No: 30918011319), the Summit of the Six Top Talents Program (No: DZXX-027), the CAST Lift Program for Young Talents , and the Outstanding Ph D of NJUST Program (No: AE88902).

Asuncion, A., and Newman, D. 2007. Uci machine learning repository. Bishop, C. M. 2006. Pattern Recognition and Machine Learning. springer. Bohn e, J.; Ying, Y.; Gentric, S.; and Pontil, M. 2014. Large margin local metric learning. In ECCV, 679 694. Springer.

Brown, M.; Hua, G.; and Winder, S. 2011. Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1):43 57. Cai, J.-F.; Cand es, E. J.; and Shen, Z. 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4):1956 1982. Chen, S.; Gong, C.; Yang, J.; Li, X.; Wei, Y.; and Li, J. 2018. Adversarial metric learning. In IJCAI, 123 131. Cheung, W., and Hamarneh, G. 2009. n-sift: n-dimensional scale invariant feature transform. IEEE Transactions on Image Processing 18(9):2012 2021. Cui, A. 2018. Iterative thresholding algorithm based on nonconvex method for modiﬁed lp-norm regularization minimization. ar Xiv preprint ar Xiv:1804.09385. Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S. 2007. Information-theoretic metric learning. In ICML. Duan, Y.; Zheng, W.; Lin, X.; Lu, J.; and Zhou, J. 2018. Deep adversarial metric learning. In CVPR, 2780 2789. Harandi, M.; Salzmann, M.; and Hartley, R. 2017. Joint dimensionality reduction and metric learning: A geometric take. In ICML, 1943 1950. Hu, J.; Lu, J.; and Tan, Y.-P. 2014. Discriminative deep metric learning for face veriﬁcation in the wild. In CVPR. Huang, Z.; Wang, R.; Shan, S.; and Chen, X. 2015. Projection metric learning on grassmann manifold with application to video based face recognition. In CVPR, 140 149. Huo, Z.; Nie, F.; and Huang, H. 2016. Robust and effective metric learning using capped trace norm. In KDD. Liu, W., and Tsang, I. W. 2017. Making decision trees feasible in ultrahigh feature and label dimensions. The Journal of Machine Learning Research. Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; and Ma, Y. 2013. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence. Liu, W.; Xu, D.; Tsang, I.; and Zhang, W. 2018. Metric learning for multi-output tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Liu, W.; Tsang, I. W.; and M uller, K.-R. 2017. An easyto-hard learning paradigm for multiple classes and multiple labels. The Journal of Machine Learning Research. Luo, L., and Huang, H. 2018. Matrix variate gaussian mixture distribution steered robust metric learning. In AAAI. Nair, V., and Hinton, G. E. 2010. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, 807 814. Oh Song, H.; Xiang, Y.; Jegelka, S.; and Savarese, S. 2016. Deep metric learning via lifted structured feature embedding. In CVPR, 4004 4012. Parikh, D., and Grauman, K. 2011. Relative attributes. In ICCV. Rolfs, B.; Rajaratnam, B.; Guillot, D.; Wong, I.; and Maleki, A. 2012. Iterative thresholding algorithm for sparse inverse covariance estimation. In NIPS, 1574 1582.

Si, J.; Zhang, H.; Li, C.-G.; Kuen, J.; Kong, X.; Kot, A. C.; and Wang, G. 2018. Dual attention matching network for context-aware feature sequence based person reidentiﬁcation. In CVPR, 132 141. Su, Y.; King, I.; and Lyu, M. 2017. Learning to rank using localized geometric mean metrics. In SIGIR, 233 241. Varior, R. R.; Haloi, M.; and Wang, G. 2016. Gated siamese convolutional neural network architecture for human re-identiﬁcation. In ECCV, 791 808. Springer. Wang, J.; Kalousis, A.; and Woznica, A. 2012. Parametric local metric learning for nearest neighbor classiﬁcation. In NIPS, 1601 1609. Weinberger, K. Q., and Saul, L. K. 2008. Fast solvers and efﬁcient implementations for distance metric learning. In ICML, 1160 1167. ACM. Weinberger, K. Q., and Tesauro, G. 2007. Metric learning for kernel regression. In Artiﬁcial Intelligence and Statistics. Weinberger, K. Q.; Blitzer, J.; and Saul, L. K. 2006. Distance metric learning for large margin nearest neighbor classiﬁcation. In NIPS, 1473 1480. Xing, E. P.; Jordan, M. I.; Russell, S. J.; and Ng, A. Y. 2003. Distance metric learning with application to clustering with side-information. In NIPS, 521 528. Ye, H.-J.; Zhan, D.-C.; Si, X.-M.; and Jiang, Y. 2017. Learning mahalanobis distance metric: Considering instance disturbance helps. In IJCAI, 866 872. Zadeh, P.; Hosseini, R.; and Sra, S. 2016. Geometric mean metric learning. In ICML, 2464 2471. Zagoruyko, S., and Komodakis, N. 2015. Learning to compare image patches via convolutional neural networks. In ICCV, 4353 4361. Zhan, D.-C.; Li, M.; Li, Y.; and Zhou, Z. 2009. Learning instance speciﬁc distances using metric propagation. In ICML. Zhan, D.-C.; Hu, P.; Chu, Z.; and Zhou, Z.-H. 2016. Learning expected hitting time distance. In AAAI, 2309 2314.