# emd_metric_learning__e8ae9234.pdf

EMD Metric Learning

Zizhao Zhang, Yubo Zhang, Xibin Zhao , Yue Gao

Key Laboratory for Information System Security, Ministry of Education Tsinghua National Laboratory for Information Science and Technology School of Software, Tsinghua University, China. {zz-zh14,zhangyb17}@mails.tsinghua.edu.cn {zxb,gaoyue}@tsinghua.edu.cn * indicates corresponding authors

Earth Mover s Distance (EMD), targeting at measuring the many-to-many distances, has shown its superiority and been widely applied in computer vision tasks, such as object recognition, hyperspectral image classiﬁcation and gesture recognition. However, there is still little effort concentrated on optimizing the EMD metric towards better matching performance. To tackle this issue, we propose an EMD metric learning algorithm in this paper. In our method, the objective is to learn a discriminative distance metric for EMD ground distance matrix generation which can better measure the similarity between compared subjects. More speciﬁcally, given a group of labeled data from different categories, we ﬁrst select a subset of training data and then optimize the metric for ground distance matrix generation. Here, both the EMD metric and the EMD ﬂow-network are alternatively optimized until a steady EMD value can be achieved. This method is able to generate a discriminative ground distance matrix which can further improve the EMD distance measurement. We then apply our EMD metric learning method on two tasks, i.e., multiview object classiﬁcation and document classiﬁcation. The experimental results have shown better performance of our proposed EMD metric learning method compared with the traditional EMD method and the state-of-the-art methods. It is noted that the proposed EMD metric learning method can be also used in other applications.

Introduction Variable-size descriptions of distributions, such as Gaussian Mixture Models (GMM) (Li, Wang, and Zhang 2013) have been widely used to represent multidimensional distributions in a compact way. These descriptors can be regarded as signatures, which are a set of the main clusters of a distribution. Here, each cluster is represented by a single point (i.e., the cluster center) in the underlying space, together with a weight that denotes the size or the importance of that cluster. This is a common-used data description method and accordingly there are many methods to measure the distance between two signatures, such as KL divergence (Goldberger, Gordon, and Greenspan 2003), Jensen-Shannon divergence (Endres and Schindelin 2003), and maximum mean discrepancy (Borgwardt et al. 2006). Among these methods, the Earth Mover s Distance (EMD)

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: The illustration of Earth Mover s Distance, where feai is the feature for the center of Pi, and wi is the corresponding weight of Pi. dij is the cost of shipping a unit of supply from Pi to Qj, and fij is the corresponding ﬂow.

(Rubner, Tomasi, and Guibas 2000) is a general and ﬂexible way to measure the dissimilarity between two signatures, which is a special case of the Wasserstein distances (Villani 2009) in its continuous form. EMD has shown its superior performance on the many-to-many matching problem and been used in many computer vision tasks, such as object recognition (Zhang et al. 2007), hyperspectral image classﬁcation (Sun et al. 2015) and gesture recognition (Wang and Chan 2014). The original EMD is based on a solution to the old transportation problem, as shown in Figure 1, which can be formalized as a linear programming problem: given a set of suppliers P = {P1, P2, . . . , Pn}, a set of consumers Q = {Q1, Q2, . . . , Qm}, and a ground distance matrix D = {dij} whose elements dij deﬁnes the cost of shipping a unit of supply from Pi to Qj, the aim is to ﬁnd an optimal set F of the ﬂows fij, i.e. the amount of supply shipped from the i-th supplier Pi to the j-th consumer Qj, to minimize the overall cost n i=1 m j=1 dijfij, subjecting to the following constraints fij 0 (1 i n, 1 j m), n i=1 fij w Qj (1 j m), m j=1 fij w Pi (1 i n), (1)

where w Pi is the supply of the i-th supplier and w Qj is the capacity of the j-th consumer.

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Figure 2: The illustration of ground distance metric optimization for EMD distance measurement in (Wang and Guibas 2012).

Once the transportation problem is solved, the EMD distance is deﬁned as the cost normalized by the total ﬂow:

EMD(P, Q) =

n i=1 m j=1 dijfij n i=1 m j=1 fij . (2)

Compared with other distance measurements, EMD is able to account for cross-bin information and can be applied to the signatures with different sizes. Moreover, EMD can generate a ﬂow-network representing how the mass in the clusters is transported between two signatures and thus calculate an optimal distance between two signatures. As shown in the EMD deﬁnition, the ground distance matrix between two signatures plays an important role in the EMD ﬂow optimization procedure. A well generated ground distance matrix can lead to a better ﬂow, and yet achieve even better EMD distance measurement. In a simple case, if the cluster centers of all the signatures are the same (i.e. position-ﬁxed), a signature can be degenerated into a histogram descriptor. In this scenario, the optimization task for a better EMD distance measurement is to directly learn an optimal ground distance matrix, which deﬁnes the distance between each pair of the cluster centers, as shown in Figure 2. To this end, a supervised EMD distance learning method (Wang and Guibas 2012) was introduced to learn an optimal ground distance matrix among the representing histograms. In particular, the triplet constraint was used in this method and the trained ground distance matrix can better reﬂect the cross-bin relationships, hence producing more accurate EMD distances and better performance in the application of face veriﬁcation. However, this method can only work on the situation with ﬁxed-size histograms representing the subjects. In practice, it cannot guarantee that all the subjects could be described by ﬁxed-size histogram, which limits the application of the ground distance matrix learning method. Moreover, the learned ground distance matrix in (Wang and Guibas 2012) may contain negative elements, which could lead to negative EMD distances. We note that it is more general in practice that each signature can be described by different types of data and a common-used ground distance matrix does not exist in these scenarios. Under such circumstances, it is not feasible to optimize the ground distance matrix for EMD in many applications. Different from optimizing the

ground distance matrix directly, it is possible to learn an optimal EMD distance metric, i.e., the metric to generate the ground distance matrix in EMD, which has not been investigated yet, to the best of our knowledge.

To tackle this issue, we propose an EMD metric learning algorithm that can work in a more general setting. In this method, the objective is to optimize the distance metric for ground distance matrix generation, which can be more discriminative on data categories. We ﬁrst select a group of data pairs of homogeneous neighbors and inhomogeneous neighbors from the labeled training dataset, and then learn the metric to minimize the EMD between the signatures from the same category and maximize the EMD between the signatures from different categories by encoding the triplet constraints via a hinge-loss functions. Instead of the gradient descent method, we use the intrinsic algorithm for optimizing the ground distance metric to retain the positive deﬁniteness of the metric. We note that with the updated EMD metric, the ground distance matrix and the corresponding ﬂow for EMD will change accordingly. Under such circumstance, we propose to optimize the EMD metric and the ﬂow-networks of EMD alternatively until convergence. The learned EMD metric can be used to calculate new ground distance matrix and update the ﬂow-networks for EMD between each two compared signatures. Then, with the new ﬂow-networks, the EMD metric can be optimized again. This process repeats until convergence. The merit of our proposed method lies in twofold. First, the EMD metric can update the ground distance matrix for each two signatures separately, which does not require all signatures should have the same histogram description. Second, the ground distance matrix is guaranteed to be non-negative. We have applied our EMD metric learning method on two tasks, i.e., object classiﬁcation and document classiﬁcation, and experiments are conducted on two public benchmarks, including the National Taiwan University (NTU) 3D model dataset (Chen et al. 2003) and the Twitter Sentiment Corpus dataset (Sanders 2011). We have also compared the proposed method with state-of-theart methods and the traditional EMD method, and the experimental results show better performance of the proposed method.

The main contributions of this work can be summarized as follows:

1. We propose an EMD metric learning algorithm targeting on a more general setting, which can dynamically optimize the ground distance matrix and yet lead to better EMD distance measurement.

2. We have applied the EMD metric learning method on two tasks, i.e., multi-view object classiﬁcation and text classiﬁcation, and evaluated the performance.

The rest of this paper is organized as follows. We ﬁrst introduce the related work on EMD applications. We then provide our proposed EMD metric learning method and its applications. Experiments and discussions are further provided and we ﬁnally conclude this paper.

Figure 3: The framework of our proposed EMD metric learning algorithm.

Related work

In this section, we brieﬂy review existing works on EMD and its applications, such as gesture recognition, hyperspectral image classiﬁcation, document classiﬁcation and face recognition. EMD was ﬁrst introduced in (Rubner, Tomasi, and Guibas 2000), where it was used as a metric for image retrieval. More speciﬁcally, each image implied a distribution of points in the three-dimensional color space. Based on the clustering results, the data distribution was transformed into the signature description. In (Carley and Tomasi 2015), EMD was employed for gesture recognition. For each hand gesture, a silhouette signature containing segment information was extracted. The optional and additional ground distances were deﬁned between segments as a convex function of the ﬂows. Then, the signature dissimilarity was calculated using EMD. In the task of high-resolution remote sensing image classiﬁcation (Zhang et al. 2013), each image was translated to a set of visual codes. By sampling and clustering the visual codes, a codebook was then generated and arranged via a strategy named local nearest neighbor codebook arrangement. For each image, the occurrence number of each code was counted to form a histogram representation. Then, the comparison between each image pair can be transformed to the EMD distance measurement, which was used for hyperspectral image classiﬁcation. In (Li, Wang, and Zhang 2013), Li et al. proposed an EMD methodology using Gaussian Mixture Models for image matching. In this method, each image was modeled by a GMM, and the image matching task was conducted by comparing the corresponding GMMs via sparse representation-based EMD. Moreover, two kinds of ground distances between GMMs were deﬁned and learned based on theory of information geometry. In the task of document classiﬁcation (Kusner et al. 2015), each nonstop word in a document was ﬁrst embedded into a feature vector using the word2vec network. Then, EMD was employed to measure the dissimilarity between two document by calculating the minimal cost that the embedded words of one document need to travel to the embedded words of another documents. In (Wang and Guibas 2012), EMD was used to measure the distance between two face representations. In this work, the ground distance matrix was optimized using a supervised EMD learning method with triple-constraints. In the experiments of face veriﬁcation, face descriptors were extracted based on reference identities which were pre-selected from

the test faces. Although this work has shown better performance with the updated ground distance matrix, it requires that all the data are represented by ﬁxed-size histograms, which is not always feasible in practice.

EMD Metric Learning In this section, we detailed introduce the proposed EMD metric learning algorithm. Figure 3 illustrates the general framework of our proposed method. Given a set of signatures with labels, our objective is to learn an optimal metric for ground distance matrix generation, which could be more discriminative for classiﬁcation. First, we select a group of data pairs from homogeneous neighbors and inhomogeneous neighbors, which are used as the training data for metric learning. Then, we conduct metric learning to minimize the EMD between the signatures from the same category and maximize the EMD between the signatures from different categories. As the learned EMD metric will lead to new ﬂow-networks and yet new EMD distances, we propose an alternating optimization algorithm here to jointly update the EMD metric and the ﬂow-networks until convergence. The ﬁnal distance metric is used for EMD distance measurement. In this section, we ﬁrst introduce the formulation of the EMD metric learning task, and then provide the training data selection method and the solution to the optimization task.

The Formulation of EMD Metric Learning Given N signatures O = {O1, O2, . . . , ON}, with Oi = {Xi, wi}, i = 1, . . . , N, let Xi = (x1 i , x2 i , . . . , xni i ) denote the ni feature vectors of the i-th signature and wi = (w1 i , w2 i , . . . , wni i ) denote the corresponding weight vector. y = {y1, y2, . . . , y N} is the set of labels with respect to O. To measure the distance between two signatures Oi and Oj by EMD, the ground distance matrix is required to be generated ﬁrst. Here we use the Mahalanobis distance as the ground distance measurement, and the squared distance between the s-th feature vector of Oi and the t-th feature vector of Oj can be written as

d(xs i, xt j) = (xs i xt j)T A(xs i xt j), (3) where A is a global linear transformation of the underlying space, i.e., the ground distance metric, which can be initialed as I. Given a set of training data with labels, the task here is to learn an optimal distance metric A for ground distance matrix generation, which should be more discriminative for the

data categories. More speciﬁcally, the objective is to minimize the EMD distances between the signatures from the same category and maximize the EMD distances between the signatures from different categories. Given a set of signature triplets T = {(a, b, c) | ya = yb, ya = yc}, usually we need to guarantee that the distance between the signatures from the same category should be smaller than that from different categories. Here ya = yb indicates that a and b are with the same labels and ya = yc means that a and c are with different labels. Here, the triplet constraint for EMD metric can be written as

EMDA(Oa, Ob) < EMDA(Oa, Oc), (a, b, c) T . (4)

By encoding the triplet constraints via a hinge-loss function, the supervised EMD metric learning can be formulated as

min A f(A)=

(a,b,c) T [μ+EMDA(Oa, Ob) EMDA(Oa, Oc)]+,

(5) where μ is the non-negative real number which speciﬁes the lower bound for the margin to ﬁx the scale of the matrix A. Besides the triplet constraint, in order to preserve the topological structure of the data, we also need to guarantee that the homogeneous neighbours from the same category should be close. By jointly considering the topological regularizer and the triplet constraint, the formulation for EMD metric learning can be rewritten as

min A f(A) =(1 λ)

(a,b) S EMDA(Oa, Ob)

(a,b,c) T [μ + EMDA(Oa, Ob)

EMDA(Oa, Oc)]+,

where the set S is deﬁned as all pairs of homogeneous neighbors from the same category and 0 λ 1 is a tradeoff parameter to balance the triplet loss term and the topological regularizer.

Training Data Selection We note that the number of all possible triplets in the training dataset could be very large. In such case, the number of relative distance constraints will grow cubically with respect to the size of the training set and thus leads to very high computational cost. Therefore, it is important to select a small set of training data for metric learning to reduce the computational cost. We just select kg nearest neighbors with the same label (namely target neighbors) and ki nearest neighbors with different labels (namely imposters) for each signature. Then, the overall amount of triplet constraints could be reduced to kikg N. However, we need to calculate the EMD distances between all pairs of signatures to ﬁnd these target neighbors and imposters. Consider that the time complexity of solving the standard EMD problem is O(n3 log n), it is very expensive for training data selection. To circumvent this issue, we use a cheap lower bound of the standard EMD to approximately select the nearest neighbors without computing the exact EMD distance (Kusner et al. 2015), which is much more efﬁcient than the traditional EMD. Precisely,

given two signatures P = {P1, P2, . . . , Pn} and Q = {Q1, Q2, . . . , Qm}, let D = {dij} and F = {fij} denote the ground distance matrix and the EMD ﬂow-networks respectively. The relaxed EMD (REMD) is deﬁned as

REMD(P, Q)=

n i=1 dij w Pi n i=1 w Pi

j=1 w Qj m j=1 di jw Qj m j=1 w Qj

i=1 w Pi > m

j=1 w Qj, (7)

where j = arg minj dij, i = arg mini dij. The intuition behind Eq. (7) is that if the total weight of P is less than that of Q, then for each cluster in P, we move all its mass to the closest cluster in Q, and vice versa. It is straight-forward to show that the REMD(P, Q) must lower bound EMD(P, Q). Now we assume that n i=1 w Pi m j=1 w Qj, according the deﬁnition of EMD, it follows m j=1 fij = w Pi. Hence

n i=1 m j=1 dijfij n i=1 m j=1 fij

n i=1 m j=1 dij fij n i=1 m j=1 fij

n i=1 dij m j=1 fij n i=1 m j=1 fij = n i=1 dij w Pi n i=1 w Pi =REMD(P, Q).

Evidently, the conclusion remains true in the other case. To calculate the bound, we just need to calculate the pairwise distance between clusters and conduct a nearest neighbor search, which has a time complexity of O(n2) and allows us to speed up the data selection process considerably. With the pairwise REMD distances among all training data, we can select a subset of triplets from the original training dataset for metric learning.

Optimization With the selected training data, here we optimize the objective function in Eq. (6) to learn the EMD metric A. We note that once the EMD metric A is updated, the ground distance matrix and the corresponding ﬂow-networks F for EMD will change accordingly. Then, we need to re-calculate the EMD distances with respect to the new ﬂow-networks. Under such circumstance, the optimization should target on both the EMD metric A and the associated ﬂow-networks F. By substituting Eq. (2) into Eq. (6), the objective function is further written as

min A,Ff(A, F) = (1 λ)

(a,b) S EMDA(Oa, Ob) +λ

(a,b,c) T [μ + EMDA(Oa, Ob) EMDA(Oa, Oc)]+,

(a,b) S na i=1 nb j=1 Fa,b(i, j)d(xi a, xj b)

μ+ na i=1 nb j=1 Fa,b(i, j)d(xi a, xj b)

na i=1 nc j=1 Fa,c(i, j)d(xi a, xj c)

When F is ﬁxed, the optimization task can be transformed into the problem of Mahalanobis distance metric learning. On the other hand, If A is ﬁxed, the problem can be splited into |S| + 2|T | independent traditional EMD sub-problems,

which can be solved by the Hungarian algorithm. Therefore, the formulation is bi-convex with respect to A and F jointly. To solve the above optimization task, we propose to optimize the EMD metric A and the ﬂow-networks F of EMD alternatively until convergence. Given the ﬂow-network F for the selected training data, the learning task can be further rewritten in the matrix form as

min A f(A)=(1 λ)

(a,b) S tr(GT a,b A)

(a,b,c) T [μ+tr(GT a,b A) tr(GT a,c A)]+, (10)

Ga,b = G1 a,b G0 a,b,

G1 a,b = Xadiag(Fa,be)XT a + Xbdiag(e T Fa,b)XT b ,

G0 a,b = Xa Fa,b XT b + Xb F T a,b XT a .

The gradient of Eq. (10) with respect to A is calculated by

(a,b) S Ga,b+λ

(a,b,c) T (Ga,b Ga,c),

T = T {(a, b, c) | tr(GT a,c A) tr(GT a,b A) > μ}.

To retain the positive deﬁniteness of A, a common used optimization techniques is the projected gradient method, which applies gradient descent followed by a projection onto the positive-deﬁnite cone (Han et al. 2017). However, in recent years, several intrinsical interactive methods for optimization on matrix manifolds have achieved better performance and convergence (Arsigny et al. 2007). This kind of methods preserves the manifold structure of positive-deﬁnite matrix group, i.e., let the variable A still belong to the corresponding manifolds in each iteration. Therefore, we use the intrinsic algorithm (Ying et al. 2017) to optimize A. The set of symmetric positive-deﬁnite matrices is a smooth Riemannian manifold, denoted by M. The tangent space of M at the point P is the vector space of derivations at point P, denoted by TP M. Given two points on the manifold, the locally distance-minimizing curve connecting them is termed the geodesic. The geodesic on the manifold going through P with tangent vector Z is given by

Γ(P,Z)(α) = P 1 2 exp(αP 1

2 )P 1 2 . (13)

Therefore, we update At to At+1 on the positive-deﬁnite matrix group by computing:

1 2 t exp(αA 1

2 t f(At)A 1

1 2 t , (14)

where α is the optimal step size at each iteration. Note that the tangent vector should be symmetric in Eq.(13), but actually the gradient of the objective function in Eq.(12) is not symmetric, which further leads At+1 to be nonsymmetric. To circumvent this issue, we add a symmetrization operator on the gradient. That is, we have

1 2 t exp(αA 1

2 t Sym[ f(At)]A 1

1 2 t , (15)

Algorithm 1 EMD Metric Learning

Input: Training data set O, maximal iteration of EMD metric learning kout, and parameters kg, ki, λ and μ Output: EMD metric A 1: Initialize A = I. 2: Calculate the REMD between each pair of signatures. 3: Construct the set S and T by selecting ki target neighbors and kg imposters for each training instance. 4: Initialize the learning rate α and maximal iteration kin of EMD metric learning. 5: for each s [1, kout] do 6: Fix A, solve for the EMD ﬂow-networks F. 7: Fix F. 8: for each t [1, kin] do 9: Compute the gradient f(At) by Eq.(12). 10: Update A by Eq.(15). 11: return A

where Sym[ f(At)] = f(At)+ f(At)T

2 . Then, we can update the EMD metric A using the gradient descent method. With the new A, we can further calculate the ﬂow-networks F for all the selected training data. In this way, we can alternatively update A and the ﬂow F until convergence. The newly generated EMD metric A can be used for further EMD distance measurement. The overall workﬂow of the proposed EMD metric learning method is shown in Algorithm 1.

Applications of EMD Metric Learning In this section, we apply our proposed EMD metric learning method to two tasks, i.e., multi-view object classiﬁcation and document classiﬁcation. Multi-View Object Classiﬁcation. In this application, each object is represented by a group of views taken from the cameras with different angles. For each view of the object, a Convolutional Neural Network (CNN) feature (Nie et al. 2017) is extracted for description due to its superior performance on the task of object retrieval and recognition. It is noteworthy that our framework has no speciﬁc requirements on the feature extraction, which indicates that other features can be also used here and the CNN feature is just an example. After feature extraction, considering that multiple views may contain redundancy, we conduct view clustering following the settings in (Gao et al. 2012) to generate view clusters. For each object, a group of representative views are selected from these clusters, associated with corresponding weights based on the cluster size. In this way, each object can be represented by a signature of representative views and EMD can be used to measure the distance between objects for classiﬁcation. Our EMD metric learning method can be employed to optimize the EMD distance between objects. Document Classiﬁcation. In this application, each document is represented by a bag-of-words (BOW) feature and can be modeled as a signature (Kusner et al. 2015). To describe the document, the word2vec (Mikolov et al. 2013) model, a three-layered neural network, is used to generate a feature representation for each nonstop word. Then, the nor-

malized bag-of-words (n BOW) vector is extracted from each document, reﬂecting the frequency (weight) of each nonstop word in this document. In this way, we construct the signature description for each document and the EMD metric learning method can be used to learn an optimal EMD distance for document classiﬁcation.

Experiments Testing Datasets and Experimental Settings To validate the proposed EMD metric learning method, we have conducted experiments on two datasets, i.e.., the National Taiwan University 3D model dataset (NTU) (Chen et al. 2003) and the Twitter Sentiment Corpus dataset (TWITTER) (Sanders 2011). The NTU dataset contains 401 objects from 16 categories, including bomb, bottle, car, chair, cup, door, guitar, gun, map, plane, starship, stick, sword, table, tank, truck. In this dataset, each object contains 60 views, and the 4096-d CNN feature is extracted for each view. To reduce the computational cost during the metric learning process, we further apply PCA to reduce the feature dimension to 300. The TWITTER dataset contains 2176 tweets labeled with three types of sentiments, including positive , negative , and neutral . In this dataset, each document is represented by a 300-d feature, and the average number of unique words per document is 9.9. We randomly select 20%, 30%, 40% and 50% of all data per each category as labeled training data and all the rest are used for testing. The dataset splitting process repeats 10 times and the average accuracy of classiﬁcation results is used for evaluation. We empirically set the parameter μ in Eq. (6) to 0.1 on both datasets. The parameter λ is set to 0.5 on the NTU dataset, and 0.2 on the TWITTER dataset. In experiments, the following methods are used for comparison:

1. Manifold Discriminant Analysis (MDA) (Wang and Chen 2009). In MDA, an embedding space is learned to maximize manifold margin between two compared manifolds by modeling each image set as a manifold.

2. Covariance Discriminative Learning (CDL) (Wang et al. 2012). In CDL, the natural second-order statistic is employed to model the image set and the distance is measured using the linear discriminant analysis.

3. Covariance Discriminative Learning with Partial Least Squares (CDL PLS) (Wang et al. 2012). CDL PLS is improved from the CDL to conduct the partial least squares through means of latent variables.

4. Log-Euclidean Metric Learning (LEML) (Huang et al. 2015b). In LEML, a Mahalanobis metric is learned to make the matrix logarithms changed to a discriminative tangent space from original space.

5. Projection Metric Learning (PML) (Huang et al. 2015a). In PML, a Fisher LDA-like framework is introduced to learn a Mahalanobis-like matrix directly on Grassmann manifold, which maps the original Grassmann manifold to a lower-dimensional, more discriminative one.

6. EMD, i.e., the traditional EMD method.

Figure 4: Experimental comparison on the NTU dataset.

Figure 5: Experimental comparison on the TWITTER dataset.

7. EMD metric learning (EMD-M), i.e., the proposed method. We note that in the TWITTER dataset, MDA and LEML cannot be performed due to the number of words per document is too small and there may exists duplicate words in one document, which results in some mistakes in the training process. Therefore, we compare all 7 methods in the NTU datasets while only compare 5 methods in the TWIITER dataset.

Experimental results Experimental results on two datasets are demonstrated in Figure 4 and Figure 5. As shown in these results, we can have the following observations: 1. Compared with the state-of-the-art methods, the proposed EMD-M method achieves higher classiﬁcation performance on both datasets. For example, EMD-M achieves the classiﬁcation accuracy of 78.54%, 82.26%, 84.42% and 85.25% when 20%, 30%, 40% and 50% data are used for training on the NTU dataset. On the TWITTER dataset, EMD-M achieves gains of 5.29%, 3.98% and 2.74% compared with PML, CDL, and CDL LDA when 50% data are used for training. Similar results can be observed in other experiments. 2. Compared with traditional EMD method, the proposed EMD-M achieves better performance. For example, EMD-M obtains gains of 5.48%, 6.83%, 8.32% and 8.5% on the TWITTER dataset when 20%, 30%, 40%, and 50% data are used for training, compared with EMD. The better performance of our proposed method can be dedicated to the following reasons. We note that although

EMD is effective on many-to-many matching, it is limited on the robustness of data comparison. We can observe that EMD works better on the NTU dataset while performs worse on the TWITTER dataset. To overcome this limitation, our proposed method is able to learn an optimal EMD metric and yet lead to better ground distance matrix for EMD distance calculation. In this way, the EMD-M method can be more discriminative using the training data and thus achieve much better performance compared with EMD.

On convergence

(a) the NTU Dataset

(b) the TWITTER Dataset

Figure 6: The variation of objective function with respect to the iteration of our learning method.

In our proposed EMD-M method, the EMD metric A and the ﬂow-network F are alternatively updated. Therefore, the convergence speed is important for the learning process. Here we further investigate the change of the cost values of the objective function with respect to the iterations. The data is shown in Figure 6. As shown in these results, the objective function can reduce very fast and it will converge after about 20 iterations. These results can demonstrate that our proposed method can achieve the optimal performance efﬁciently.

On parameters In our framework, there are two types of key parameters, including the selected number of training data and λ in Eq. (6). Here we investigate the inﬂuence of these parameters on the performance. For the training data selection, there are parameters for the number of selected target neighbors and imposters for each signature, i.e., kg and ki. These parameters control the size of training data. Usually, the more training data and the corresponding constraints we use, the better the performance of resulting metric is. Here we ﬁx kg as 3 and vary ki in the range of [2, 16] on the NTU dataset, and [1, 8] on the TWITTER dataset, with 30% training data. Experimental results are shown in Figure 7. As shown in these results, we can notice that the performance is steady when ki is large enough, such as above 10 on the NTU dataset. An interesting observation is that on the TWITTER dataset, we need fewer imposters to achieve the stable performance than on the NTU dataset. It illustrates that the more categories there are, the more imposters we need to select for each training instance. Another important parameter is λ, which controls the weights of the loss term and the regularizer. We vary λ from 0.3 to 0.7 on the NTU dataset, while from 0 to 0.4 on the TWITTER dataset. The experimental results are shown in

(a) the NTU Dataset

(b) the TWITTER Dataset

Figure 7: The performance evaluation by varying the sampling rates with 30% data used for training.

(a) the NTU Dataset

(b) the TWITTER Dataset

Figure 8: The performance evaluation by varying the parameter λ with 30% data used for training.

Figure 8. We can notice that the proposed method can also achieve steady performance when λ is round 0.5 and 0.2 on the two datasets respectively, and show relatively poor performance when λ is overly small or large, which is consistent with the impact of λ in the presented model. Too small values of λ eliminate the inﬂuence of local correlations, while overly large values overemphasize the topological structure of the data.

In this paper, we propose an EMD metric learning algorithm. The proposed method is able to learn an optimal distance metric and yet lead a better ground distance matrix for EMD distance calculation between two signatures. In the learning stage, we alternatively update the EMD metric and ﬂownetwork for EMD distance. We have applied it on the tasks of multi-view object classiﬁcation and document classiﬁcation. We have conducted experiments on two datasets, and the comparison with the state-of-the-art methods demonstrates the effectiveness of the proposed method. We also compare our proposed EMD-M method with the traditional EMD method. Experimental results show about 5% gains for our EMD-M method. The framework can also be further applied to many other tasks, such as face veriﬁcation and object recognition.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant 2017YFC011300, in part by the National Natural Science Funds of China under Grant 61671267 and Grant 61527812, in part by National Science and Technology Major Project (No. 2016ZX01038101), MIIT IT funds (Research and application of TCN key tech-

nologies ) of China, and The National Key Technology R&D Program (No. 2015BAG14B01-02).

References Arsigny, V.; Fillard, P.; Pennec, X.; and Ayache, N. 2007. Geometric means in a novel vector space structure on symmetric positive-deﬁnite matrices. SIAM Journal on Matrix Analysis and Applications 29(1):328 347. Borgwardt, K. M.; Gretton, A.; Rasch, M. J.; Kriegel, H.-P.; Sch olkopf, B.; and Smola, A. J. 2006. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14):e49 e57. Carley, C., and Tomasi, C. 2015. Single-frame indexing for 3d hand pose estimation. In Proceedings of IEEE International Conference on Computer Vision Workshops, 101 109. Chen, D.-Y.; Tian, X.-P.; Shen, Y.-T.; and Ouhyoung, M. 2003. On visual similarity based 3d model retrieval. In Computer Graphics Forum, volume 22, 223 232. Wiley Online Library. Endres, D. M., and Schindelin, J. E. 2003. A new metric for probability distributions. IEEE Transactions on Information Theory 49(7):1858 1860. Gao, Y.; Tang, J.; Hong, R.; Yan, S.; Dai, Q.; Zhang, N.; and Chua, T.-S. 2012. Camera constraint-free view-based 3d object retrieval. IEEE Transactions on Image Processing 21(4):2269 2281. Goldberger, J.; Gordon, S.; and Greenspan, H. 2003. An efﬁcient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In Proceedings of IEEE International Conference on Computer Vision, 487. IEEE. Han, J.; Cheng, G.; Li, Z.; and Zhang, D. 2017. A uniﬁed metric learning-based framework for co-saliency detection. IEEE Transactions on Circuits and Systems for Video Technology. Huang, Z.; Wang, R.; Shan, S.; and Chen, X. 2015a. Projection metric learning on grassmann manifold with application to video based face recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 140 149. Huang, Z.; Wang, R.; Shan, S.; Li, X.; and Chen, X. 2015b. Log-euclidean metric learning on symmetric positive deﬁnite manifold with application to image set classiﬁcation. In Proceedings of International Conference on Machine Learning, 720 729. Kusner, M.; Sun, Y.; Kolkin, N.; and Weinberger, K. 2015. From word embeddings to document distances. In Proceedings of International Conference on Machine Learning, 957 966. Li, P.; Wang, Q.; and Zhang, L. 2013. A novel earth mover s distance methodology for image matching with gaussian mixture models. In Proceedings of IEEE International Conference on Computer Vision, 1689 1696. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efﬁcient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781.

Nie, W.; Cao, Q.; Liu, A.; and Su, Y. 2017. Convolutional deep learning for 3d object retrieval. Multimedia Systems 23(3):325 332. Rubner, Y.; Tomasi, C.; and Guibas, L. J. 2000. The earth mover s distance as a metric for image retrieval. International Journal of Computer Vision 40(2):99 121. Sanders, N. J. 2011. Sanders-twitter sentiment corpus. Sun, W.; Li, W.; Li, J.; and Lai, Y. M. 2015. Band selection using sparse nonnegative matrix factorization with the thresholded earths mover distance for hyperspectral imagery classiﬁcation. Earth Science Informatics 8(4):907 918. Villani, C. 2009. The Wasserstein distances. Berlin, Heidelberg: Springer Berlin Heidelberg. 93 111. Wang, C., and Chan, S. 2014. A new hand gesture recognition algorithm based on joint color-depth superpixel earth mover s distance. In Proceedings of the 4th International Workshop on Cognitive Information Processing, 1 6. IEEE. Wang, R., and Chen, X. 2009. Manifold discriminant analysis. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 429 436. IEEE. Wang, F., and Guibas, L. J. 2012. Supervised earth movers distance learning and its computer vision applications. In Proceedings of European Conference on Computer Vision, 442 455. Springer. Wang, R.; Guo, H.; Davis, L. S.; and Dai, Q. 2012. Covariance discriminative learning: A natural and efﬁcient approach to image set classiﬁcation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2496 2503. IEEE. Ying, S.; Wen, Z.; Shi, J.; Peng, Y.; Peng, J.; and Qiao, H. 2017. Manifold preserving: An intrinsic approach for semisupervised distance metric learning. IEEE Transactions on Neural Networks and Learning Systems. Zhang, J.; Marszalek, M.; Lazebnik, S.; and Schmid, C. 2007. Local features and kernels for classiﬁcation of texture and object categories: A comprehensive study. International Journal of Computer Vision 73(2):213 238. Zhang, Y.; Sun, X.; Wang, H.; and Fu, K. 2013. Highresolution remote-sensing image classiﬁcation via an approximate earth mover s distance-based bag-of-features model. IEEE Geoscience and Remote Sensing Letters 10(5):1055 1059.