# curvilinear_distance_metric_learning__b52d21e1.pdf

Curvilinear Distance Metric Learning

Shuo Chen , Lei Luo , Jian Yang , Chen Gong , Jun Li , Heng Huang

Distance Metric Learning aims to learn an appropriate metric that faithfully measures the distance between two data points. Traditional metric learning methods usually calculate the pairwise distance with ﬁxed distance functions (e.g., Euclidean distance) in the projected feature spaces. However, they fail to learn the underlying geometries of the sample space, and thus cannot exactly predict the intrinsic distances between data points. To address this issue, we ﬁrst reveal that the traditional linear distance metric is equivalent to the cumulative arc length between the data pair s nearest points on the learned straight measurer lines. After that, by extending such straight lines to general curved forms, we propose a Curvilinear Distance Metric Learning (CDML) method, which adaptively learns the nonlinear geometries of the training data. By virtue of Weierstrass theorem, the proposed CDML is equivalently parameterized with a 3-order tensor, and the optimization algorithm is designed to learn the tensor parameter. Theoretical analysis is derived to guarantee the effectiveness and soundness of CDML. Extensive experiments on the synthetic and real-world datasets validate the superiority of our method over the state-of-the-art metric learning models.

1 Introduction

The goal of a Distance Metric Learning (DML) algorithm is to learn the distance function for data pairs to measure their similarities. The learned distance metric successfully reﬂects the relationships within data points and signiﬁcantly improves the performance of many subsequent learning tasks, such as classiﬁcation [22], clustering [23], retrieval [24], and veriﬁcation [14]. It has recently become an active research topic in machine learning community [30, 29].

The well-studied DML methods are usually linear, namely Mahalanobis distance metric based models [23]. Under the supervisions of pairwise similarities, they intend to learn a Semi-Positive Deﬁnite (SPD) matrix M =P P Rd d to decide the squared parametric distance Dist2 P (x, bx)= (x bx) M(x bx) between data points x and bx in the d-dimensional space. It is notable that such a linear Mahalanobis distance is equivalent to the Euclidean distance in the m-dimensional feature space projected by P Rd m. To perform the learning of the parameter M, intensive efforts have been put to design various loss functions and constraints in optimization models. The early works Large Margin Nearest Neighbor (LMNN) [27] and Information-Theoretic Metric Learning (ITML) [7] utilized the must-link and cannot-link information to constrain the ranges of the learned distances. Instead of ﬁxing the distance ranges in the objective, Geometric Mean Metric Learning (GMML) [31]

S. Chen, J. Yang, and C. Gong are with the PCA Lab, Key Lab of Intelligent Perception and Systems for High Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (E-mail: {shuochen, csjyang, chen.gong}@njust.edu.cn). L. Luo and H. Huang are with the Electrical and Computer Engineering, University of Pittsburgh, and also with JD Finance America Corporation, USA (E-mail: lel94@pitt.edu, henghuanghh@gmail.com). J. Li is with the Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA (E-mail: junli@mit.edu). L. Luo and J. Yang are corresponding authors.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

e2t (a). Space of Euclidean Metric (b). Space of Linear Metric (c). Space of Curvilinear Metric (Ours)

Real Distances: Measured Distances: Measurer Lines:

θ3(t) p3t e3t

Normal Lines (Isolines):

Te1(x) Tp1(x)

Figure 1: Conceptual illustrations of Euclidean metric, linear (Mahalanobis) metric, and our proposed curvilinear metric in three-dimensional space. For a pair of data points (i.e., x and bx) from the spatial surface, metrics ﬁnd out the nearest (calibration) points (i.e., T (x) and T (bx)) on each learned measurer line, and then use the arc length between nearest points as measured distance results. By the curved measurer lines, our method can measure the intrinsic curvilinear distance more exactly.

proposed a geometric loss to jointly minimize the intra-class distances and maximize the inter-class distances as much as possible. Considering that the above methods utilizing one single matrix M are not ﬂexible for complex data, the traditional Mahalanobis distance is extended to a combined form of multiple linear metrics [30]. Recently, the strategies of adversarial training [8] and collaborative training [20] were introduced in Adversarial Metric Learning (AML) [4] and Bilevel Distance Metric Learning (BDML) [29], respectively, which showed further improvements on the linear metric.

To enhance the ﬂexibility of DML for ﬁtting data pairs from nonlinear sample spaces, the early works transferred the original data points to the high-dimensional kernel space by using traditional kernel methods [26]. Recently, the projection matrix P of the linear DML was extended to a nonlinear feature mapping form P W( ), in which the mapping W( ) is implemented by typical Deep Neural Networks (DNN), such as Convolutional Neural Network (CNN) [32] and Multiple Layer Perceptron (MLP) [10]. To further utilize the ﬁtting capability of DNN and characterize the relative distances, the traditional pairwise loss was extended to multi-example forms, such as Npair loss [22] and Angular loss [25]. It is worth pointing out that the above kernelized metrics and DNN based metrics are still calculated with ﬁxed Euclidean distance in the extracted feature space, which ignores the geometric structures of the sample space. To this end, some recent works proposed to learn the projection matrix P on differential manifolds (e.g., SPD manifold [33] and Grassmann manifold [13]) to improve the representation capability on some speciﬁc nonlinear data structures. However, the geometries of their used manifolds are usually speciﬁed and cannot be learned to adapt to various nonlinear data, and hence remarkably hindering the generality of the manifold based DML approaches.

Although above existing DML models have achieved promising results to some extent, most of them fail to learn the spatial geometric structures of the sample space, and thus their obtained metrics cannot reﬂect the intrinsic curvilinear distances between data points. To address this challenging problem, in this paper, we ﬁrstly present a new interpretation to reveal that the traditional linear distance metric is equivalent to the cumulative arc length between data pair s nearest points on the straight measurer lines (see Fig. 1(a) and (b)). Such straight measurer lines can successfully learn the directions of real distances, but they are not capable of adapting to the curvilinear distance geometries on many nonlinear datasets. Therefore, we propose the Curvilinear Distance Metric Learning (CDML) model, which extends the straight measurer lines to the general smooth curved lines (see Fig. 1(c)). Thanks to the generalized forms of such curvilinear measurers, the geometries of training data can be adaptively learned, so that the nonlinear pairwise distances can be reasonably measured. We theoretically analyze the effectiveness of CDML by showing its ﬁtting capability and generalization bound. Furthermore, we prove that our proposed curvilinear distance satisﬁes the topological deﬁnitions of the (pseudo-)metric, which demonstrates the geometric soundness of such a distance metric. The main contributions of this paper are summarized as: (I). We provide a new intuitive interpretation for traditional linear metric learning by explicitly formulating the measurer lines and measurement process; (II). We propose a generalized metric learning model dubbed CDML with discovering the curvilinear distance hidden in the nonlinear data, and the corresponding optimization algorithm is designed to solve the proposed model which is guaranteed to converge; (III). The complete theoretical guarantee is established, which analyzes the ﬁtting capability, generalization bound, and topological property of CDML, and therefore ensuring the model effectiveness and soundness; (IV). CDML is experimentally validated to outperform state-of-the-art metric learning models on both synthetic datasets and real-world datasets.

Notations. Throughout this paper, we write matrices, vectors, and 3-order tensors as bold uppercase characters, bold lowercase characters, and bold calligraphic uppercase characters, respectively. For a 3order tensor M, the notations Mi::, M:i:, and M::i denote the horizontal, lateral, and frontal slices. The tube ﬁber, row ﬁber, and column ﬁber as Mij:, Mi:j, and M:ij. Let y=(y1, y2, , y N) be the label vector of training data pairs X ={(xj, bxj)|j =1, 2, , N} with xj, bxj Rd, where yj =1 if xj and bxj are similar, and yj =0 otherwise. Here d is the data dimensionality, and N is the total number of data pairs. The operators 2 and F denote the vector ℓ2-norm and matrix (tensor) Frobenius-norm, respectively. The notation Nn ={1, 2, , n} for any n N.

2 Curvilinear Distance Metric Learning

In this section, we ﬁrst present a new geometric interpretation for traditional linear metric learning models. Then the Curvilinear Distance Metric Learning (CDML) is formulated based on such an interpretation. The optimization algorithm is designed to solve CDML with convergence guarantee.

2.1 A Geometric Interpretation for Linear Metric

It is well known that the linear distance metric (i.e., squared Mahalanobis distance) [30, 29] between two given data points x, bx Rd is deﬁned as

Dist2 P (x, bx) = (x bx) M(x bx) = (x bx) P P (x bx), (1)

where the matrix M = P P is assumed to be SPD in Rd d. In previous works, the above linear distance metric is usually interpreted as the Euclidean distance in the projection space, where the projection matrix P Rd m plays the role of feature extractions [28]. Here d and m are the dimensionalities of the original sample space and the extracted feature space, respectively. Although such an interpretation offers a friendly way for model extensions, it is not clear enough that why the linear distance metric fails to characterize the curvilinear distances on nonlinear data.

Now we present a new understanding for the linear distance metric from its measurement process, which offers a clear way to hand the nonlinear geometric data. We denote pi Rd as the i-th column of P . By using the rule of inner products, Eq. (1) equals to the following cumulative form1 Xm

i=1 pi 2 2 x bx 2 2cos2 pi, x bx = Xm

i=1 pi 2 2 pi Ti(x) pi Ti(bx) 2 2, (2)

where Ti(x) and Ti(bx) are the projection points of x and bx, which satisfy (pi Ti(x) x) pi = 0 and (pi Ti(bx) bx) pi = 0, respectively. After that, the ℓ2-norm distance pi Ti(x) pi Ti(bx) 2 is equivalently converted to the arc length from Ti(x) to Ti(bx) on the straight line z = pit (t R), and thus the squared Mahalanobis distance is rewritten as

Dist2 P (x, bx) = Xm

Ti(x) pi 2 dt 2 , (3)

where the integral value is the arc length of the straight measurer line z = pit from Ti(x) to Ti(bx). Here the weight pi 2 2 is regarded as the scale of the measurer line which equals to the squared unit arc length from 0 to 1. By using the convexity of g(t) = pit x 2 2, the orthogonal condition (pi Ti(x) x) pi = 0 is equivalent to ﬁnding the nearest point Ti(x) on the measurer line, namely

Ti(x) = arg min t R pit x 2 2. (4)

Based on the results of Eq. (3) and Eq. (4), we can clearly observe that the Mahalanobis distance of the data pair {x, bx} is intrinsically computed as the cumulative arc length between {x, bx} s nearest points on the learned measurer line z = pit, which is shown in Fig. 1. It reveals that linear metrics merely learn the rough directions of real distances, yet cannot capture the complex data geometry.

2.2 Model Establishment

As we mentioned before, traditional metrics learn the direction pi of the measurer line z =pit in the d-dimensional sample space. However, such a straight line is far from adapting to complex nonlinear

1For calculation details, Dist2 P (x, bx) = (x bx) P P (x bx) = P (x bx) 2 2 = (p 1 (x bx), p 2 (x bx), , p m(x bx)) 2 2 = Pm i=1(p i (x bx))2 = Pm i=1 pi 2 2 x bx 2 2cos2 pi, x bx .

data in the real world. We thus use a general vector-valued function θi : R Rd to extend the straight line z =pit (t R) to the smooth curved line ez =θi(t) (t R). Speciﬁcally, it can be written as

ez = (ez1, ez2, , ezd) = (θi1(t), θi2(t), , θid(t)) = θi(t), (5)

where the smooth function θik(t) is the k-th element of the vector-valued function θi(t). It should be noticed that such a curved line is also assumed to be zero-crossing, i.e., θi(0) = 0 which is consistent with the linear distance metric. Then the nearest point Ti(x) deﬁned in Eq. (4) can be easily extended to the nearest point set Nθi(x), and we naturally have that

Nθi(x) = arg min t R θi(t) x 2 2. (6)

Nevertheless, the point set Nθi(x) might contain more than one element, so we simply use the smallest element of Nθi(x), as described in Deﬁnition 1. Deﬁnition 1. For a data point x Rd and a curved line θi, we deﬁne the calibration point Tθi(x) as Tθi(x)=arg mint Nθi(x) t, where Nθi(x) includes all nearest points of x on the curved line θi(t).

According to our offered interpretation in Section 2.1, the curvilinear distance is consistently regarded as the cumulative arc length values (see Fig. 1(c)). Here we follow the common formula of arc length in calculus [21], which is given in Deﬁnition 2. Deﬁnition 2. The arc length from T1 to T2 on the curved line θi is deﬁned as

Lengthθi(T1, T2) = w max(T1,T2)

min(T1,T2) θ i(t) 2 dt, (7)

where the derivative vector θ i(t) = (θ i1(t), θ i2(t), , θ id(t)) .

Based on the above deﬁnitions of the arc length and the calibration point, the traditional linear distance metric (i.e., Eq. (3)) is easily extended to the general curvilinear form. Speciﬁcally, the squared curvilinear distance between data points x and bx is calculated by

Dist2 Θ(x, bx) = Xm

i=1 sθi Length2 θi(Tθi(x), Tθi(bx)), (8)

where Θ=(θ1, θ2, , θm) is the learning parameter of the curvilinear distance metric, and m is the number of measurer lines. Here the scale value sθi =Length2 θi(0, 1). When we use the empirical risk loss to learn Θ, the objective of Curvilinear Distance Metric Learning (CDML) is formulated as

min Θ Fm 1 N

j=1 L(Dist2 Θ(xj, bxj); yj) + λR(Θ), (9)

where Fm = {(θ1, θ2, , θm)| θik(t) = 0 and θik(t) is smooth for i Nm and k Nd}, the regularization parameter λ > 0 is tuned by users. In the above objective, the loss function L evaluates the inconsistency between the predicted distances and their similarity labels, and the regularizer R is used to reduce the over-ﬁtting.

Implementation of Θ. As the learning parameter Θ of CDML (i.e., Eq. (9)) is in an abstract form and cannot be directly solved, we have to give a concrete form for each curved line θi(t) for learning tasks. Here we employ the polynomial function to approximate θi(t), due to the guarantee of inﬁnite approximation which is described in Theorem 1. It is notable that θi(t) can also be inﬁnitely approximated by other ways, such as Fourier series, deep neural networks, and piecewise linear functions [16]. Without loss of generality, we employ the polynomial function in this paper. Theorem 1 (Weierstrass Approximation [21]). Assume that the vector-valued function θi(t) Rd (i=1, 2, , m) is continuous and zero-crossing deﬁned on [a, b]. Then for any ϵ > 0 and t [a, b], there exists the c-order polynomial vector-valued function

Mi::(t) = Xc

k=1 Mi1ktk, Xc

k=1 Mi2ktk, , Xc

k=1 Midktk , (10)

such that Pm i=1 θi(t) Mi::(t) 2 < ϵ, where the 3-order tensor M = [Mijk] Rm d c.

We let θi(t) := Mi::(t) for i = 1, 2, , m, and then the abstract parameter Θ in Eq. (9) can be materialized by the tensor M with inﬁnite approximation in the following optimization objective

min M Rm d c 1 N

j=1 L(Dist2 M(xj, bxj); yj) + λR(M). (11)

We thus successfully convert the abstract optimization problem Eq. (9) to the above concrete form Eq. (11) w.r.t. the tensor M Rm d c, which can be easily solved by existing algorithms [11].

Algorithm 1 Solving Eq. (11) via Stochastic Gradient Descent. Input: Training data pairs X = {(xj, bxj)|j NN}; labels y {0, 1}N; batch size h; learning rate ρ; regularization parameter λ; tensor size parameters c and m. Initialize: k = 1; M(1) = 0. Repeat:

1). Uniformly randomly pick h data pairs {(xbj, bxbj)|j Nh} from X.

2). Compute calibration points TMi::(xbj) and TMi::(bxbj) for i = 1, 2, , m by solving the real roots of f i(t) = 0 in Eq. (13). 3). Update the learning parameter M by

M(k+1) :=M(k) ρ 1

j=1L j M(k)Dist2 M(k)(xbj, bxbj)+λ M(k)R(M(k)) . (12)

4). Update k := k + 1.

Until Converge. Output: The converged M .

2.3 Optimization Algorithm

Since the pair number N is usually large in Eq. (11), we use the Stochastic Gradient Descent (SGD) method to solve it. As we know that the central operations in SGD are the gradient computation of the objective function. Here we only need to offer the gradient of Dist2 M(x, bx) which mainly depends on the calibration points TMi::(x), TMi::(bx), and the arc length Length Mi::(TMi::(x), TMi::(bx)).

According to Deﬁnition 1, the calibration point TMi::(x) can be directly obtained from the nearest point set NMi::(x) = {t |fi(t ) fi(ˆt), and t , ˆt Γ i}, where the polynomial function is

fi(t) = Mi::(t) x 2 2 = Xc

j, k=1(M i:j Mi:k)tj+k 2 Xc

k=1 M i:kxtk + x x, (13)

and Γ i is the real root set for polynomial equation f i(t) = 0. Here the real roots of f i(t) = 0 can be efﬁciently solved by simple numerical algorithms [17], of which the computation complexity does not depend on the number of training data pair N and feature dimensionality d.

By using the deﬁnition of integral, the arc length is equivalently converted to

Length Mi::(T1, T2)= w max(T1,T2)

min(T1,T2) M i::(t) 2dt= lim L

l=0 M i::(min(T1, T2)+l t) 2 t, (14)

where t=|T1 T2|/L. In practical uses, we ﬁx L to a large number (e.g., L=103) and thus obtain a well approximation GMi::(T1, T2):=(PL l=0 M i::(min(T1, T2)+l t) 2 t)2 for the squared arc length value Length2 Mi::(T1, T2).

According to the chain rule of derivative, the gradient of Dist2 M w.r.t. M is obtained as2

Mi::Dist2 M(x, bx) = Mi::GMi::(0, 1) GMi::(TMi::(x), TMi::(bx))

+ GMi::(0, 1) e Mi::GMi::(TMi::(x), TMi::(bx)). (15)

It is noticed that the gradient of GMi::(TMi::(x), TMi::(bx)) does not exist necessarily, because the intermediate variable TMi::(x) is not differentiable w.r.t. Mi::. Therefore, here we use the smoothed gradient3 instead of the original gradient of GMi::(TMi::(x), TMi::(bx)). The optimization algorithm for solving Eq. (11) is summarized in Algorithm 1.

Convergence. Notice that the main difference between Algorithm 1 and traditional SGD is that we utilize a smoothed gradient instead of the original gradient. Previous works have proved that the smoothed gradient still ensures that the SGD algorithm converges to a stationary point [12, 2].

2 Mi::GMi::(0, 1) = 2

GMi::(0, 1) PL l=0 M i::(l/L) ((l/L)1,(l/L)2, ,(l/L)c)

M i::(l/L) 2 .

3 e Mi::GMi::(TMi::(x),TMi::(bx)) = Mi:: µ Mi:: 2 2 (GMi::+µ Mi::(TMi::+µ Mi::(x),TMi::+µ Mi::(bx))

G(TMi::(x),TMi::(bx))), where Mi:: Rd c is generated from the standard normal distribution, and µ > 0 is a given small number used to smooth the gradient.

3 Theoretical Analysis

In this section, we provide theoretical results for the ﬁtting capability, generalization bound, and topological property of CDML. All proofs of theorems are given in the supplementary materials.

3.1 Fitting Capability

We ﬁrst reveal that the curvilinear distance learned by Eq. (11) is capable of distinguishing the similarities of all training data pairs. We assume that the (dis)similar pair sets XSimilar and XDissimilar are the partitions of the whole training data pairs set X, where the similarity label y(x,bx) = 1 if (x, bx) XSimilar, and y(x,bx) =0 if (x, bx) XDissimilar. Then the conclusion is described as follows.

Theorem 2. For given margin > 0, there exist m, c N and f M Rm d c such that

Dist f M(β, bβ) Dist f M(α, bα) > margin, (16)

where (α, bα) XSimilar and (β, bβ) XDissimilar.

From the above result, we know that the well learned curvilinear distance correctly predicts the similarities of data pairs in the training set X, which ensures that the inter-class distances are always greater than the intra-class distances. In most metric learning models, the loss functions are designed to reward the larger inter-class distances and smaller intra-class distances. It means that the distance Dist f M( , ) in Eq. (16) can be successfully achieved by minimizing the loss functions. Therefore, the ﬁtting capability of the curvilinear distance can be reliably guaranteed by the parameter tensor M.

3.2 Generalization Bound

Now we further analyze the effectiveness of CDML by offering the generalization bound of the solution to Eq. (11). Such a bound evaluates the bias between the generalization error ε(M) := E(x,bx) D(L(Dist2 M(x,bx); y(x,bx))) and empirical error εX (M) := 1

N PN j=1L(Dist2 M(xj,bxj); yj), where D is the real data distribution and E( ) denotes the expectation function. We simply use the squared tensor Frobenius-norm [18] for regularization and have the following conclusion. Theorem 3. Assume that R(M) = M 2 F = P

i,j,k(Mijk)2 and M Rm d c is the solution to Eq. (11). Then, we have that for any 0 < δ < 1 with probability 1 δ

ε(M ) εX (M ) X p

2ln(1/δ)/N + BλRN(L), (17)

where Bλ 0 as λ + 4. Here RN(L) is the Rademacher complexity of the loss function L related to the space Rm d c for N training pairs, and X = maxk NN |L(Dist2 M (xk, bxk); yk)|.

In Eq. (17), the ﬁrst term of the upper bound converges with the increasing of the number of training data pairs N. We can also ﬁnd that the second term converges to 0 with the increasing of λ, which means the regularizer R(M) effectively improves the generalization ability of CDML.

3.3 Topological Property

In general topology, the metric5 is deﬁned as the distance function satisfying the non-negativity, symmetry, triangle, and coincidence properties [23, 9]. As an extended metric, the pseudo-metric merely has the ﬁrst three properties as revealed in [19]. Here we prove that the curvilinear distance deﬁned in Eq. (8) satisﬁes the topological deﬁnitions, and thus its geometric soundness is guaranteed. Theorem 4. For the curvilinear distance DistΘ(x, bx) and its corresponding parameter Θ, we denote Θ (τ) = (θ 1(τ1), θ 2(τ2), , θ m(τm)) Rd m and have that

1). DistΘ(x, bx) is a pseudo-metric for any Θ Fm;

2). DistΘ(x, bx) is a metric, if Θ (τ) is full row rank for any τ = (τ1, τ2, , τm) Rm.

4Here Bλ = 2EX,Z[sup M F(λ) εZ(M) εX(M)]/EX,Z[sup M Rm d c εZ(M) εX(M)] and F(λ) is a shrinking hypothesis space induced by the regularizer R(M). 5The distance function Dist( , ) is a metric if and only if it satisﬁes the four conditions for any α1, α2, α3 Rd: (I). Non-negativity: Dist(α1, α2) 0; (II). Symmetry: Dist(α1, α2) = Dist(α2, α1); (III). Triangle: Dist(α1, α2) + Dist(α2, α3) Dist(α1, α3); (IV). Coincidence: Dist(α1, α2) = 0 α1 = α2.

Class-1 Margin Class-2 Class-1 Margin Class-2 Class-1 Margin Class-2

Figure 2: Visualizations of the measurer lines learned by CDML in two-dimensional space. The grayscale denotes the distance from the origin point to the current point of the learned measurer line.

Table 1: Classiﬁcation error rates (%, mean std) of all methods on synthetic datasets including Intersected Lines, Concentric Circles, and Nested Moons.

Datasets LMNN [27] ITML [7] Npair [22] Angular [25] ODML [28] CERML [14] CDML

Instersected Lines 14.33 1.21 17.46 2.11 8.52 0.99 8.10 3.24 10.52 2.17 6.21 1.92 5.12 1.13 Concentric Circles 16.62 2.14 15.92 3.12 9.13 1.51 8.98 1.89 11.31 2.23 10.32 2.16 6.95 1.41 Nested Moons 17.02 2.23 12.04 2.14 9.22 1.89 10.12 2.09 15.12 1.98 11.12 2.41 8.64 2.45

Notice that the above Theorem 4 has the same conclusion with the traditional linear distance metric when θi(t) is specialized by pit, and thus such a result is a generalization of the property in the linear model [23]. In fact, most of the real-world data indeed have non-negativity, symmetry, triangle, and coincidence properties. Hence this theorem clearly tells us that the basic geometric characteristics of real-world data can be well persevered in the curvilinear distance metric.

4 Experimental Results

In this section, we show our experimental results on both synthetic and real-world benchmark datasets to validate the effectiveness of CDML. We ﬁrst visualize the learning results of CDML on synthetic datasets. After that, we compare classiﬁcation and veriﬁcation accuracy of CDML with two classical metric learning methods (LMNN [27] and ITML [7]) and four state-of-the-art methods (Npair loss [22], Angular Loss [25], ODML [28], and CERML [14]). Here LMNN, ITML, and ODML are linear and the others are nonlinear. In our experiments, the parameters λ and c are ﬁxed to 1.2 and 10, respectively. The SGD parameters h and ρ are ﬁxed to 103 and 10 3, respectively. We follow ITML and use the squared hinge loss and squared Frobenius-norm as L and R in Eq. (11), respectively.

4.1 Visualizations on Synthetic Datasets

To demonstrate the model effectiveness on nonlinear data, we ﬁrst visualize the learning results of CDML on nonlinear synthetic datasets including Intersected Lines, Concentric Circles, and Nested Moons [5]. Each dataset contains more than 300 data points across 2 classes in the two-dimensional space. On each dataset, 60% of all data is randomly selected for training, and the rest is used for test. The measurer line count m is ﬁxed to 1 to clearly visualize the learned results.

As illustrated in Fig. 2, the learned measurer lines are plotted with gray lines, of which the gray-level denotes the arc length distance from the origin point to the current point. According to the deﬁnition in Eq. (6), we can clearly observe that the nearest points of data points from two classes are distributed apart on the two sides of the measurer lines (i.e., the low gray-level and high gray-level). Therefore, such measure results correctly predict large values for inter-class distances and small values for intra-class distances. Furthermore, the test error rates of all compared methods are shown in Table 1, and we ﬁnd that DDML, PML, and CDML obtained superior results over other methods due to their non-linearity. Meanwhile, CDML achieves relatively lower error rates than the deep neural network based model (DDML) and manifold based model (PML) on the three datasets, which validates the superiority of our method.

4.2 Comparisons on Classiﬁcation Datasets

To evaluate the performances of all compared methods on the classiﬁcation task, we adopt the k-NN classiﬁer (k=5) based on the learned metrics to investigate the classiﬁcation error rates of various methods. The datasets are from the well-known UCI machine learning repository [1] including MNIST, Autompg, Sonar, Australia, Hayes-r, Glass, Segment, Balance, Isolet, and Letters.

Table 2: Classiﬁcation error rates (%, mean std) of all methods on real-world datasets. The last row lists the Win/Tie/Loss counts of CDML against other methods with t-test at signiﬁcance level 0.05.

Datasets LMNN [27] ITML [7] Npair [22] Angular [25] ODML [28] CERML [14] CDML

MNIST 17.46 5.32 14.32 7.32 11.56 1.07 12.26 5.12 12.12 5.22 13.36 2.32 8.12 3.64 Autompg 25.92 3.32 26.62 3.21 21.95 1.52 19.02 3.01 25.32 5.32 26.36 3.02 15.32 6.11 Sonar 16.04 5.31 18.02 3.52 15.31 2.56 16.86 1.21 17.95 6.78 19.21 6.33 15.40 3.64 Australia 15.51 2.53 17.52 2.13 15.12 5.11 15.54 1.23 16.23 4.12 18.26 6.22 12.22 2.54 Hayes-r 30.46 7.32 34.24 6.32 24.36 2.17 23.12 1.37 29.76 1.07 30.12 5.32 25.15 5.23 Glass 30.12 2.32 29.11 3.28 22.32 4.72 23.02 1.22 28.26 1.22 29.11 0.12 22.12 4.64 Segment 2.73 0.82 5.16 2.22 8.77 0.32 4.11 1.22 3.76 1.34 5.36 3.12 1.23 0.32 Balance 9.93 1.62 9.31 2.21 8.12 1.97 7.12 2.22 8.63 2.22 9.45 5.45 5.01 2.64 Iso Let 3.23 1.32 9.23 2.32 5.43 2.12 5.49 1.12 2.68 0.72 7.26 2.32 3.12 1.64 Letters 4.21 2.05 6.24 0.32 5.14 1.04 4.67 1.82 4.88 0.82 5.32 2.22 2.09 0.64

W/T/L 8/2/0 9/1/0 5/5/0 6/4/0 8/2/0 8/2/0

0 0.2 0.4 0.6 0.8 1 False Positive Rate (FPR)

True Positive Rate (TPR)

(a). Pub Fig

LMNN 0.68429 ITML 0.73196 Npair 0.88111 Angular 0.90928 ODML 0.86952 CERML 0.75923 CDML 0.92908

0 0.2 0.4 0.6 0.8 1 False Positive Rate (FPR)

True Positive Rate (TPR)

LMNN 0.6564 ITML 0.63667 Npair 0.87167 Angular 0.82805 ODML 0.889 CERML 0.67924 CDML 0.90741

0 0.2 0.4 0.6 0.8 1 False Positive Rate (FPR)

True Positive Rate (TPR)

LMNN 0.57987 ITML 0.61335 Npair 0.73184 Angular 0.75125 ODML 0.71639 CERML 0.63211 CDML 0.78559

Figure 3: ROC curves of all methods on the 3 datasets. AUC values are presented in the legends.

We compare all methods over 20 random trials. In each trial, 80% of examples are randomly selected as the training examples, and the rest are used for testing. The training pairs are generated by randomly picking up 1000K(K 1) pairs among the training examples [31], where K is the number of classes. Here the measurer line count m is ﬁxed to the feature dimensionality (i.e., d). The average classiﬁcation error rates of all compared methods are shown in Table 2. We also perform the t-test (signiﬁcance level 0.05) to validate the superiority of our method over all baseline methods on each dataset. From the experimental results, we can observe that CDML obtains signiﬁcant improvements on the linear metric learning models, which demonstrates the usefulness of our proposed curvilinear generalization. Furthermore, the statistical records of average error rates and t-test results reliably validate the superiority of our method over other baseline methods.

4.3 Comparisons on Veriﬁcation Datasets

We use two face datasets and one image matching dataset to evaluate the capabilities of all compared methods on image veriﬁcation. The Pub Fig face dataset includes 2 104 image pairs belonging to 140 people [15], in which the ﬁrst 80% data pairs are selected for training and the rest are used for test. Similar experiments are performed on the LFW face dataset which includes 13233 unconstrained face images [15]. The MVS dataset [3] consists of over 3 104 image patches sampled from 3D objects, in which 105 pairs are selected to form the training set, and 104 pairs are used for test.

The adopted features are extracted by DSIFT [6] and Siamese-CNN [32] for face datasets and image patch dataset, respectively. We plot the Receiver Operator Characteristic (ROC) curve by changing the thresholds of different distance metrics. Then the values of Area Under Curve (AUC) are calculated to quantitatively evaluate the performances of all comparators. From the ROC curves and AUC values in Fig. 3, it is clear to see that DDML and CDML consistently outperform other methods. In comparison, CDML obtains better results than the best baseline method DDML on three datasets.

5 Conclusion

In this paper, we introduced the new insight of the mechanism of metric learning models, where the measured distance is naturally interpreted as the arc length between nearest points on the learned straight measurer lines. We extended such straight measurer lines to general curved lines for further learning the intrinsic geometries of the training data. To the best of our knowledge, this is the ﬁrst work of metric learning with adaptively constructing the geometric relations between data points. We provided theoretical analysis to show that the proposed framework can be well applied to the general nonlinear data. Visualizations on toy data indicate that the learned measurer lines critically capture the underlying rules, and thus making the learning algorithm acquire more reliable and precise metric than the state-of-the-art methods.

Acknowledgment

S.C., J.Y., and C.G. were supported by the National Science Fund (NSF) of China under Grant (Nos: U1713208, 61602246, and 61973162), Program for Changjiang Scholars, 111 Program AH92005, the Fundamental Research Funds for the Central Universities (No: 30918011319), NSF of Jiangsu Province (No: BK20171430), the Young Elite Scientists Sponsorship Program by Jiangsu Province, and the Young Elite Scientists Sponsorship Program by CAST (No: 2018QNRC001).

L.L. and H.H. were supported by U.S. NSF-IIS 1836945, NSF-IIS 1836938, NSF-DBI 1836866, NSFIIS 1845666, NSF-IIS 1852606, NSF-IIS 1838627, and NSF-IIS 1837956.

[1] Arthur Asuncion and David Newman. Uci machine learning repository, 2007. 4.2

[2] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Neur IPS, 2018. 2.3

[3] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 2011. 4.3

[4] Shuo Chen, Chen Gong, Jian Yang, Xiang Li, Yang Wei, and Jun Li. Adversarial metric learning. In IJCAI, 2018. 1

[5] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Neur IPS, 2018. 4.1

[6] W Cheung and G Hamarneh. n-sift: n-dimensional scale invariant feature transform. IEEE Transactions on Image Processing, 18(9), 2009. 4.3

[7] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Informationtheoretic metric learning. In ICML, 2007. 1, 4.1, 4, 4.2

[8] Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. Deep adversarial metric learning. In CVPR, 2018. 1

[9] Søren Hauberg, Oren Freifeld, and Michael J Black. A geometric take on metric learning. In Neur IPS, 2012. 3.3

[10] Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Discriminative deep metric learning for face veriﬁcation in the wild. In CVPR, 2014. 1

[11] Feihu Huang, Songcan Chen, and Heng Huang. Faster stochastic alternating direction method of multipliers for nonconvex optimization. In ICML, 2019. 2.2

[12] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradient-free proximal stochastic methods for nonconvex nonsmooth optimization. In AAAI, 2019. 2.3

[13] Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Projection metric learning on grassmann manifold with application to video based face recognition. In CVPR, 2015. 1

[14] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Luc Van Gool, and Xilin Chen. Cross euclideanto-riemannian metric learning with application to face recognition from video. IEEE transactions on pattern analysis and machine intelligence, 2018. 1, 4.1, 4, 4.2

[15] Zhouyuan Huo, Feiping Nie, and Heng Huang. Robust and effective metric learning using capped trace norm: Metric learning via capped trace norm. In KDD. ACM, 2016. 4.3

[16] Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine learning, 2005. 2.2

[17] John M Mc Namee and Victor Pan. Numerical methods for roots of polynomials. Newnes, 2013. 2.3

[18] Carl D Meyer. Matrix analysis and applied linear algebra, volume 71. Siam, 2000. 3.2

[19] Benjamin Paaßen, Claudio Gallicchio, Alessio Micheli, and Barbara Hammer. Tree edit distance learning via adaptive symbol embeddings. In ICML, 2018. 3.3

[20] Joel B Predd, Sanjeev R Kulkarni, and H Vincent Poor. A collaborative training algorithm for distributed learning. IEEE Transactions on Information Theory, 2009. 1

[21] Walter Rudin et al. Principles of mathematical analysis. Mc Graw-hill New York, 1964. 2.2, 1

[22] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Neur IPS, 2016. 1, 4.1, 4, 4.2

[23] Juan Luis Suárez, Salvador García, and Francisco Herrera. A tutorial on distance metric learning: Mathematical foundations, algorithms and software. ar Xiv:1812.05944, 2018. 1, 3.3, 3.3

[24] Fang Wang, Le Kang, and Yi Li. Sketch-based 3d shape retrieval using convolutional neural networks. In CVPR, 2015. 1

[25] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In ICCV, 2017. 1, 4.1, 4, 4.2

[26] Jun Wang, Huyen T Do, Adam Woznica, and Alexandros Kalousis. Metric learning with multiple kernels. In Neur IPS, 2011. 1

[27] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. In Neur IPS, 2006. 1, 4.1, 4, 4.2

[28] Pengtao Xie, Wei Wu, Yichen Zhu, and Eric P Xing. Orthogonality-promoting distance metric learning: convex relaxation and theoretical analysise. In ICML, 2018. 2.1, 4.1, 4, 4.2

[29] Jie Xu, Lei Luo, Cheng Deng, and Heng Huang. Bilevel distance metric learning for robust image recognition. In Neur IPS, 2018. 1, 2.1

[30] Han-Jia Ye, De-Chuan Zhan, Xue-Min Si, Yuan Jiang, and Zhi-Hua Zhou. What makes objects similar: a uniﬁed multi-metric learning approach. In Neur IPS, 2016. 1, 2.1

[31] Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In ICML, 2016. 1, 4.2

[32] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In ICCV, 2015. 1, 4.3

[33] Pengfei Zhu, Hao Cheng, Qinghua Hu, Qilong Wang, and Changqing Zhang. Towards generalized and efﬁcient metric learning on riemannian manifold. In IJCAI, 2018. 1