# fishmml_fisherhsic_multiview_metric_learning__9fc361ba.pdf FISH-MML: Fisher-HSIC Multi-View Metric Learning Changqing Zhang1, Yeqing Liu1, Yue Liu1, Qinghua Hu1 , Xinwang Liu2 and Pengfei Zhu1 1School of Computer Science and Technology, Tianjin University, Tianjin, China 2School of Computer, National University of Defense Technology, Changsha, China {zhangchangqing, yeqing, liuyue76, huqinghua, zhupengfei}@tju.edu.cn, 1022xinwang.liu@gmail.com This work presents a simple yet effective model for multi-view metric learning, which aims to improve the classification of data with multiple views, e.g., multiple modalities or multiple types of features. The intrinsic correlation, different views describing same set of instances, makes it possible and necessary to jointly learn multiple metrics of different views, accordingly, we propose a multi-view metric learning method based on Fisher discriminant analysis (FDA) and Hilbert-Schmidt Independence Criteria (HSIC), termed as Fisher-HSIC Multi-View Metric Learning (FISH-MML). In our approach, the class separability is enforced in the spirit of FDA within each single view, while the consistence among different views is enhanced based on HSIC. Accordingly, both intra-view class separability and inter-view correlation are well addressed in a unified framework. The learned metrics can improve multi-view classification, and experimental results on real-world datasets demonstrate the effectiveness of the proposed method. 1 Introduction With the rapid development of information acquirement technique, data are usually represented with different modalities or different types of features. In computer vision, images are often depicted with different types of descriptors based on color or texture cues; RGB-D images are considered as multimodal data consisting of depth and color information. For social network analysis (SNA), different relationships usually characterize the same set of users, and different types of attributes or textual information are often associated with those users, e.g., user-generated content or demographic details. Recently, there are intensive attentions on developing classification or clustering models for the data with multiple views, and the effectiveness has been empirically proven on diverse applications [Kumar et al., 2011; Wang et al., 2016; Gong, 2017; Zhao et al., 2017; Cao et al., 2015; Zhang et al., 2015; 2017]. Corresponding author: Qinghua Hu. Fisher Discriminant Analysis Hilbert-Schmidt Independence Criteria Inter-view consistence Intra-view separability M (1) M (2) Multi-view input Multiple metrics Figure 1: Illustration of Fisher-HSIC Multi-View Metric Learning. The proposed model enforces separability within each view using label information, and simultaneously respects consistence across different views. As recognized by recent researches [Sindhwani and Rosenberg, 2008; Dhillon et al., 2011; Liu et al., 2016], the main challenge of exploiting multi-view data lies in how to effectively explore the underlying correlation among different views. It is nontrivial since although different views indeed depict same set of instances, the feature space of one view can be completely different from that in another view, which is known as heterogeneous features. For example, Euclidean distances are typically employed as the distance measure for HOG [Dalal and Triggs, 2005], while spatial matching kernels are widely used for the local descriptors SIFT [Lowe, 1999]. Moreover, features of one view can be high-dimensional while another one may be not and fea- Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) tures of one view may be much more noisy than another view. These challenges increase the difficulty of integrating different views. The representative and straightforward strategy is feature integration, i.e., directly concatenating all views into high-dimensional vectors or performing dimensionality reduction jointly (such as Canonical Correlation Analysis (CCA)). However, direct combination of feature vectors or simple linear combination of the outputs of different views can not guarantee promising performance, and CCA-based methods [Blaschko and Lampert, 2008; Chaudhuri et al., 2009] only explore linear correlations thus neglect complex correlations in real applications. Metric learning can learn a distance function to well reflect the relationships between data points consistent with their semantic labels, which could benefit to subsequent tasks, e.g., classification or clustering. Generally, metric learning approaches seek a Mahalanobis distance with the paired samples or other side information encoding relationships of data. Compared with the Euclidean distance in original feature space, the learned distance metric could better reveal the relationships between data points. Due to its effectiveness, extensive metric learning methods [Weinberger and Saul, 2009; Davis et al., 2007; Guillaumin et al., 2009] have been proposed and widely applied in real-world applications. In this paper, we propose to learn distance metric for data with multiple views, i.e., jointly learning multiple metrics to explore complementary information across multiple views. Towards this goal, we propose Fisher-HSIC Multi View Metric Learning (FISH-MML) algorithm. On the one hand, we introduce Fisher discriminant analysis (FDA) to search for the optimal projections (corresponding to Mahalanobis distance metrics) to maximize class separability and preserve expressiveness within each view, which alleviates the difficulty of classification compared with original features (corresponding to Euclidean distance metrics). On the other hand, to explore the complementarity from different views, our model maximizes the dependence among different views with Hilbert-Schmidt Independence Criterion (HSIC), which ensures between-data relationships (under the learned metrics) of different views to be consistent in kernel space. The proposed approach is effectively optimized by using the Alternating Direction Minimization (ADM) strategy, and extensive experiments validate the effectiveness of our approach. The highlights of this paper are summarized as follows: (1) We propose a novel multi-view metric learning method, which is simple yet rather effective. (2) Our model simultaneously enhances class separability within each view and explores complex correlation across multiple views in a unified framework. (3) With FDA, class separability within each view is enhanced, while by using HSIC, our method can effectively explore correlations among different views. (4) Based on Alternating Direction Method (ADM), our objective is efficiently optimized with guaranteed convergence. (5) Experiments on real-world multi-view datasets validate the effectiveness of our method for classification. 2 Related Work There have been quite a few distance metric learning approaches. The early work in [Xing et al., 2003] learns distance metric with side information that indicates two data samples being similar or dissimilar. This method is formulated as a convex optimization problem. The work in [Davis et al., 2007] introduces information theory and formulates the task as minimizing the differential relative entropy between two multivariate Gaussians under constraints on distance function. LMNN (Large Margin Nearest Neighbors) [Weinberger and Saul, 2009] learns a metric for k-Nearest Neighbor (k NN) classifier to enforce that k-nearest neighbors belong to the same class while examples from different classes are separated by a large margin. There are also some methods that learn distance metrics under sparsity [Ying et al., 2009] or low-rank [Ding et al., 2015] assumptions. Due to the ubiquitousness of data with multiple modalities or descriptors, the literature of multi-view learning has spanned a very broad range. In metric learning domain, the method HMML (Heterogeneous Multi-Metric Learning) [Zhang et al., 2011] proposes to jointly learn a set of heterogeneous metrics for multi-sensor data fusion, which generalizes LMNN [Weinberger and Saul, 2009] from singleview to multi-view learning. The work in [Xie and Xing, 2013] proposes a general framework of multi-modal distance metric learning based on multi-wing harmonium model, in which different modalities are embedded into a shared latent space. The researchers also propose a large margin multimetric learning (LM3L) [Hu et al., 2014] method for face and kinship verification. Recently, deep model-based metric learning methods [Hu et al., 2017; Lu et al., 2015] have been proposed. The sharable and individual multi-view deep metric learning (Mv DML) approach [Hu et al., 2017] jointly learns multiple distance metrics for multi-view data by seeking an individual distance metric for each view and a common representation for different views in a unified latent subspace. 3 Background For notations used throughout this paper, boldface uppercase, boldface lowercase, and normal italic letters are utilized to denote matrix, vector, and scalar respectively. We denote feature matrix as X Rd n, where d and n are dimensionality of feature space and number of samples, respectively. xi is the feature vector of the ith samples. For data represented by V different views, we use X = {X(v) Rdv n}V v=1 to denote the set of feature matrices of multiple views with dv being the dimensionality of the feature space for the vth view. Similar to traditional metric learning algorithms, in our model, we also focus on learning the Mahalanobis distance. In contrast, our model jointly learns these multiple Mahalanobis distances of multiple views. Generally, the Mahalanobis distance has the following definition: Definition 3.1. (Mahalanobis distance). The Mahalanobis distance between two samples xi and xj is defined as d2 M(xi, xj) = ||xi xj||2 M = (xi xj)T M(xi xj), where the Mahalanobis matrix M is constrained to be symmetric positive-definite to assure the validity. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 4 Fisher-HSIC Multi-View Metric Learning Firstly, we try to propose a general framework for multi-view metric learning (MML) that can jointly learn multiple metrics for multiple views. The general form of multi-view metric learning is given as follows: max{M(v)}V v=1 S({M(v)}V v=1) | {z } class separability +λ C({M(v)}V v=1) | {z } view consistence where M(v) is the distance metric of the vth view, and λ > 0 is a tradeoff hyperparameter. The above objective function searches the optimal metrics that can simultaneously maximize the class separability (with S( )) and penalize the disagreement between different views (with C( )). Separability and Expressiveness To ensure the class separability within each view, FDA is introduced into our model, which is based on the definitions of between-class and total scatter matrices: j=1 nj(µj µ)(µj µ)T , i=1 (xi µ)(xi µ)T , i=1 xj i, µ = 1 where xj i denotes the feature vector of the ith sample in class Cj, µj and µ are sample means for class Cj and the whole data set, respectively. g and nj are the number of classes and the number of samples belonging to class Cj, respectively. On the one hand, we note that Euclidean distance is involved in Eq.(2) which is used in FDA. This could be improved with the Mahalanobis distance, i.e., (xi µ)T (xi µ) (xi µ)T M(xi µ). On the other hand, when M is a symmetric positive-definite matrix, d M is a metric. Specifically, as any symmetric positive semi-definite matrix M Sd + can be decomposed as M = PT P, where P Rk d and k rank(M). Accordingly, the distance metric function can be rewritten as d2 M(xi, xj) = (xi xj)T PT P(xi xj) = ||P(xi xj)||2 2, which corresponds to the Euclidean distance between projected feature vectors. Therefore, with respect to the new space (induced by P(v)), the between-class and total scatter matrices in the vth view, i.e., S(v) b and S(v) t , are induced as follows: j=1 nj(µ(v) j µ(v))(µ(v) j µ(v))T , i=1 (z(v) i µ(v))(z(v) i µ(v))T , i=1 zj(v) i , µ(v) = 1 i=1 z(v) i , where z(v) = P(v)x(v) with z(v) being the projected feature vector corresponding to x(v). µ(v) j and µ(v) are sample means of the vth view for class Cj and the whole data set, respectively. Then we aim to maximize the class separability within each view for the most discriminative capability parameterized by the projections {P(v)}V v=1. In the spirit of FDA, we need to optimize the following unconstrained optimization objective function: max {P(v)}V v=1 v=1 Tr(S(v) b ; P(v)) γTr(S(v) t ; P(v)), (4) where Tr( ) is the matrix trace operator and γ is a tunable parameter to balance the two terms involved. Tr(S(v) b ; P(v)) denotes the trace operator for S(v) b conditioned on P(v). Note that, we do not constrain γ in Eq.(4) to be positive and there are different meanings for positive and negative cases. With γ > 0, the objective function maximizes the interclass scattering while simultaneously minimizes the total scattering and thus, the intra-class scattering is automatically minimized. However, as recognized by [Cheng et al., 2011], it may be sensitive to spurious features of the data in highdimensional case. This inspired us to set γ < 0 to pursuit the expressiveness. Specifically, in this manner, it actually maximizes not only discriminativeness but also expressiveness jointly. Recall the objective function of PCA (Principal Component Analysis), i.e., i=1 {v T xi v T µ}2 = v T Sv i=1 (xi µ)(xi µ)T , where we take the first component as example for simplicity, and the constraint v T v = 1 indicates that we are only interested in the direction instead of its magnitude. Similar to PCA which maximizes the above objective function, we maximize Tr(S(v) t ; P(v)) with respect to P(v) to account for expressiveness. Consistence across Multiple Views The above objective function focuses on seeking metrics that jointly maximize discriminativeness and expressiveness. Our model is devoted to handle data with multiple views where complementarity is critical, hence now we try to explore complementary information from multiple views by using Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 2005]. To determine complex associations of two signals, HSIC has been theoretically [Gretton et al., 2005] and empirically [Xiao and Guo, 2015; Song et al., 2007] justified to be a proper measure of (in)dependence when associated with a universal kernel. Letting the observations Z(v) and Z(w) (corresponding to two different views) contain n data points {(z(v) i , z(w) i ) X Y}n i=1 that are jointly drawn from a probability distribution Pz(v)z(w), the consistence between two views is measured by the dependence between z(v) and z(w). The dependence measured by HSIC is computed according to the norm of the cross-covariance operator over the domain X Y Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) in Hilbert space. A large HSIC value indicates strong dependence with respect to the choice of kernels. The HSIC is defined as HSIC(Pz(v)z(w), F, G) := ||Cz(v)z(w)||2 HS, where ||A||HS = q P i,j a2 ij. F and G are reproducing kernel Hilbert Space (RKHS) on X and Y, respectively. The cross-covariance is a function that gives the covariance of two random variables and defined as Cz(v)z(w) = Ez(v)z(w)[(φ(z(v)) µz(v)) (ϕ(z(w)) µz(w))], where µz(v) = E(φ(z(v))), µz(w) = E(ϕ(z(w))), and is the tensor product. φ(z(v)) and ϕ(z(w)) are functions that map z(v) X and z(w) Y to kernel space F and G with respect to the kernel functions kv(z(v) i , z(v) j ) =< φ(z(v) i ), φ(z(v) j ) > and kw(z(w) i , z(w) j ) =< ϕ(z(w) i ), ϕ(z(w) j ) >. Accordingly, we have the empirical HSIC defined as: HSIC(Z(v), Z(w)) = (n 1) 2tr(Kv HKw H), (6) where Kv and Kw are the Gram matrices with kv,ij = kv(z(v) i , z(v) j ), kw,ij = kw(z(w) i , z(w) j ). hij = δij 1/n centers the Gram matrix to have zero mean in the feature space. In our implementation, we use the inner product kernel function, i.e., K(v) = Z(v)T Z(v) = X(v)T P(v)T P(v)X(v), and promising performance are achieved. Note that maximizing HSIC(Z(v), Z(w)) enhances the dependency between K(v) and K(w), which penalties the disagreement between kernel matrices from different views parameterized by the projections P(v) and P(w). Objective Function For multi-view metric learning, we jointly enhance intra-view separability, expressiveness, and inter-view correlations with respect to the learned metrics in a unified objective function: max {P(v)}V v=1 v=1 tr(Sb (v); P(v)) + λ1 v=1 tr(St (v); P(v)) v =w HSIC(P(v)X(v), P(w)X(w)) s.t. P(v)P(v)T = I, v = 1, ..., V, where λ1 > 0 and λ2 > 0 are hyperparameters encoding the belief degrees for expressiveness and inter-view consistence, respectively. Note that we impose orthogonal constraints on P(v) (i.e., P(v)P(v)T = I) for the following reasons: first, it can address the scale issue, since without this constraint, the values of P(v) will be arbitrarily large to maximize the objective; second, it is consistent with the requirement of expressiveness in PCA (see Eq.(5)); last but not the least, it also provides convenience for optimization which will be discussed later. Our objective function can be rewritten as follows: max {P(v)}V v=1 v=1 tr P(v)(A + λ1B + λ2C)P(v)T = max {P(v)}V v=1 v=1 tr P(v)DP(v)T (8) i=1 xj(v) i 1 i=1 x(v) i ) i=1 xj(v) i 1 i=1 x(v) i )T , i=1 (x(v) i 1 i=1 x(v) i )(x(v) i 1 i=1 x(v) i )T , w=1;w =v X(v)HK(w)HX(v)T , D = A + λ1B + λ2C. The discriminativeness, expressiveness, and consistence are accounted by A, B and C, respectively. Given the condition P(v)P(v)T = I and with variables of the other views fixed, updating P(v) is an eigenvalue decomposition task which could be efficiently solved. Once P(v)s are learned, we can get M(v) by M(v) = P(v)T P(v). Then the multi-view data Xi = {xi(1), .. . , xi(V )} can be projected by P(v)s and transformed into ˆXi = {P(1)xi(1), .. . , P(V )xi(V )}. With concatenation of all the projected feature vectors, existing classification methods (e.g., k NN) could be employed. To summarize, our approach has the following merits: (1) our model is simple yet effective for multi-view metric learning; (2) our model can jointly learn multiple metrics by simultaneously enforcing separability, expressiveness, and exploring complex correlations among different views; (3) both intra-view relationships of data points and inter-view correlations of different views are addressed seamlessly in a unified framework; (4) our approach is solved efficiently with the alternating direction method (ADM), and since the value of our objective function is non-decreasing with iterations, the algorithm is guaranteed to converge. 5 Experiments We conduct experiments on four real-world datasets and compare our FISH-MML with existing state-of-the-art methods in terms of diverse evaluation measures. 5.1 Setting The datasets employed are as follows: handwritten1 contains 2000 images of 10 classes from number 0 to 9. There are 6 types of descriptors exacted: Pix (view1), Fou (view2), Fac (view3), ZER (view4), KAR (view5) and MOR (view6). Caltech101-72 contains a subset of images from Caltech101. There are 7 categories selected with 1474 images: faces, motorbikes, dollar-bill, garfield, snoopy, stopsign, and windsor-chair. 6 types of features are used: Gabor (view1), WM (view2), CENTRIST (view3), HOG (view4), GIST (view5) and LBP (view6). 1https://archive.ics.uci.edu/ml/datasets/Multiple+Features 2http://www.vision.caltech.edu/Image Datasets/Caltech101/ Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Method Metrics k NN ITML LMNN LDML GMML Ours handwritten Accuracy .972 .008 .969 .013 .928 .012 .976 .005 .977 .006 .979 .006 F1-score .945 .016 .939 .026 .864 .022 .953 .011 .955 .011 .959 .011 Precision .944 .017 .939 .026 .863 .023 .953 .011 .955 .011 .958 .013 Recall .946 .015 .938 .026 .865 .020 .953 .010 .955 .011 .959 .010 Accuracy .879 .014 .901 .014 .838 .024 .879 .014 .879 .014 .977 .007 F1-score .863 .024 .900 .023 .881 .019 .863 .024 .867 .024 .982 .008 Precision .815 .029 .841 .030 .862 .033 .815 .029 .821 .027 .967 .015 Recall .961 .027 .969 .020 .903 .034 .955 .026 .964 .025 .997 .001 Accuracy .719 .069 .795 .068 .752 .097 .793 .079 .721 .068 .874 .042 F1-score .583 .093 .660 .103 .605 .143 .656 .124 .583 .093 .766 .078 Precision .572 .099 .657 .099 .604 .138 .660 .126 .574 .099 .779 .081 Recall .596 .096 .671 .090 .608 .152 .654 .128 .596 .096 .755 .085 Accuracy .727 .069 .847 .047 .608 .076 .812 .051 .749 .055 .824 .038 F1-score .498 .124 .743 .091 .474 .115 .698 .085 .556 .107 .680 .116 Precision .428 .151 .718 .109 .455 .123 .677 .105 .497 .132 .645 .138 Recall .635 .057 .776 .090 .502 .117 .725 .081 .651 .057 .729 .108 Table 1: Comparison to metric learning methods with best single view. Method Metrics k NN ITML LMNN LDML GMML HMML EMGMML Ours handwritten Accuracy .941 .015 .948 .013 .922 .020 .944 .012 .939 .011 .927 .013 .839 .012 .979 .006 F1-score .886 .027 .901 .022 .854 .037 .892 .020 .884 .018 .861 .023 .770 .017 .959 .011 Precision .884 .028 .901 .022 .853 .038 .892 .019 .884 .018 .860 .023 .760 .019 .958 .013 Recall .888 .026 .901 .023 .854 .036 .892 .021 .885 .019 .861 .023 .780 .016 .959 .010 Accuracy .882 .012 .915 .016 .830 .061 .882 .012 .881 .013 .921 .013 .919 .007 .977 .007 F1-score .862 .021 .921 .018 .810 .099 .862 .062 .861 .023 .913 .016 .903 .010 .982 .008 Precision .787 .034 .873 .028 .778 .113 .787 .034 .786 .033 .861 .025 .830 .015 .967 .015 Recall .954 .024 .974 .016 .847 .084 .954 .024 .952 .030 .970 .016 .991 .004 .997 .001 Accuracy .700 .092 .769 .057 .767 .098 .700 .092 .702 .095 .798 .044 .802 .030 .874 .042 F1-score .555 .125 .616 .089 .627 .140 .555 .125 .556 .140 .641 .071 .651 .032 .766 .078 Precision .540 .125 .595 .086 .622 .135 .540 .125 .543 .148 .620 .062 .628 .036 .779 .081 Recall .575 .137 .641 .103 .632 .148 .575 .137 .574 .142 .667 .096 .675 .033 .755 .085 Accuracy .631 .077 .580 .078 .416 .053 .627 .078 .651 .070 .522 .093 .702 .088 .824 .038 F1-score .432 .106 .413 .071 .230 .075 .426 .105 .469 .095 .305 .078 .349 .174 .680 .116 Precision .351 .103 .377 .071 .209 .070 .345 .101 .402 .087 .261 .077 .287 .179 .645 .138 Recall .576 .119 .460 .080 .264 .096 .571 .125 .570 .121 .383 .107 .483 .126 .729 .108 Table 2: Comparison to metric learning methods with multiple views. (a) handwritten (b) Caltech101 (c) MSRA (d) football Figure 2: Visualization of features with t-SNE. The top row corresponds to direct concatenation of the original feature vectors of multiple views (i.e., [x(1); ...; x(V )]), while the bottom row is the visualization result of our approach (i.e., [P(1)x(1); ...; P(V )x(V )]). Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) MSRA [Liu et al., 2010] contains 210 images labeled with 7 classes: tree, building, airplane, cow, face, car, and bicycle. 6 types of features are extracted: CENT (view1), CMT (view2), GIST (view3), HOG (view4), LBP (view5), and SIFT (view6). football3 consists of 248 English Premier League football players on Twitter labeled with 20 communities. There are 6 views describing relationships between two users: follows (view1), followed by (view2), mentions (view3), mentioned by (view4), retweets (view5) and retweed by (view6). We compared our method with the following baselines: k NN. We conduct k NN based on Euclidean distance for each single view of features and feature concatenation. ITML (Information-Theoretic Metric Learning) [Davis et al., 2007]. The method characterizes the metric using a Mahalanobis distance by formulating the problem as minimizing the differential relative entropy between two multivariate Gaussians under constraints on the distance function. LMNN ((Large Margin Nearest Neighbors) [Weinberger and Saul, 2009]. The method learns a Mahanalobis distance metric to improve k NN classification. LDML (Logistic Discriminant Metric Learning) [Guillaumin et al., 2009]. This method employs logistic discriminant to learn a metric such that positive pairs have smaller distances than negative pairs. HMML (Heterogeneous Multi-Metric Learning) [Zhang et al., 2011]. The method proposes to jointly learn multiple optimal homogenous/heterogeneous metrics in order to fuse the data collected from multiple sensors for classification by generalizing the LMNN framework. GMML (Geometric Mean Metric Learning) [Zadeh et al., 2016]. The method is built on geometric intuition, and learns a symmetric positive definite matrix by formulating it as a smooth, strictly convex optimization problem. EMGMML (Efficient Multi-modal Geometric Mean Metric Learning) [Liang et al., 2017]. The method proposes to learn a set of optimal homogenous/heterogeneous metrics by generalizing the GMML framework. Each dataset is randomly partitioned into 80% for training and 20% for testing. Then 20% samples are randomly selected from the training set as validation set for parameter tuning. We select the value from {0.001, 0.01, 0.1, 1, 10, 100, 1000} for λ1 and λ2. Uniformly, we set the number of nearest neighborhoods to 5 for all methods on each dataset. For the randomness involved in data partition, we run 10 times and report the averaged performance with deviation. 5.2 Results Since the objective is non-decreasing with the iterations, the algorithm is guaranteed to converge. We conduct convergence experiments on four datasets and show as in Fig.3. As shown in Table 1, we first compared ours with existing metric learning methods with the best single view. It is observed that our FISH-MML achieves the best performance on 3 out of 4 datasets in terms of all evaluation metrics. As a strong competitor, ITML performs as the best on football. However, on handwritten, Caltech101-7 and MSRA, ITML does not per- 3http://mlg.ucd.ie/aggregation/index.html 1 2 3 4 5 6 7 8 9 10 Iteration Value of objective (a) handwritten 1 2 3 4 5 6 7 8 9 10 Iteration Value of objective (b) Caltech101-7 1 2 3 4 5 6 7 8 9 10 Iteration Value of objective 1 2 3 4 5 6 7 8 9 10 Iteration Value of objective (d) football Figure 3: Convergence experiment. form very well. As shown in Table 2, we also compared our method with multi-view metric learning approaches. For traditional single-view metric learning methods, we concatenate feature vectors from multiple views as input. There are also two comparisons which are specially designed for multi-view data, i.e., HMML and EGMML. Our method outperforms all the comparisons on these four datasets, which further demonstrates the advantage of FISH-MML in exploring complementarity from multiple views. Fig.2 intuitively demonstrates the advantage of our approach by using t-distributed stochastic neighbor embedding (t-SNE) [Maaten and Hinton, 2008], since clusters in terms of ground-truth labels with our model are more compact and separable than those of directly combining different views. 6 Conclusions This paper has proposed a metric learning model for multiview data which aims at jointly learning multiple metrics for multiple views. Our proposal has the advantage of simultaneously exploring the intra-view relationships and interview correlations in a unified framework. Specifically, we introduce Fisher discriminant analysis to enhance separability and expressiveness, and utilize Hilbert-Schmidt Independence Criteria to ensure consistence across different views. Our method is relatively simple to implement and easy to optimize with guaranteed convergence to local minimal. Experiments on benchmark datasets have verified the advantages of our approach over state-of-the-art metric learning methods. Acknowledgments This work was supported in part by National Natural Science Foundation of China (Grand No:61602337, 61732011, 61702358). References [Blaschko and Lampert, 2008] Matthew B Blaschko and Christoph H Lampert. Correlational spectral clustering. In CVPR, pages 1 8, 2008. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) [Cao et al., 2015] X. Cao, C. Zhang, H. Fu, Si Liu, and Hua Zhang. Diversity-induced multi-view subspace clustering. In CVPR, pages 586 594, 2015. [Chaudhuri et al., 2009] Kamalika Chaudhuri, Sham M Kakade, Karen Livescu, and Karthik Sridharan. Multi-view clustering via canonical correlation analysis. In ICML, pages 129 136, 2009. [Cheng et al., 2011] Qiang Cheng, Hongbo Zhou, and Jie Cheng. The fisher-markov selector: fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data. IEEE T-PAMI, 33(6):1217 1233, 2011. [Dalal and Triggs, 2005] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886 893, 2005. [Davis et al., 2007] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In ICML, pages 209 216, 2007. [Dhillon et al., 2011] Paramveer Dhillon, Dean P Foster, and Lyle H Ungar. Multi-view learning of word embeddings via cca. In NIPS, pages 199 207, 2011. [Ding et al., 2015] Zhengming Ding, Sungjoo Suh, Jae-Joon Han, Changkyu Choi, and Yun Fu. Discriminative low-rank metric learning for face recognition. In FG, volume 1, pages 1 6, 2015. [Gong, 2017] Chen Gong. Exploring commonality and individuality for multi-modal curriculum learning. In AAAI, pages 1926 1933, 2017. [Gretton et al., 2005] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Scholkopf. Measuring statistical dependence with hilbert-schmidt norms. In ALT, volume 16, pages 63 78. Springer, 2005. [Guillaumin et al., 2009] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Is that you? metric learning approaches for face identification. In ICCV, pages 498 505, 2009. [Hu et al., 2014] Junlin Hu, Jiwen Lu, Junsong Yuan, and Yap-Peng Tan. Large margin multi-metric learning for face and kinship verification in the wild. In ACCV, pages 252 267. Springer, 2014. [Hu et al., 2017] Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Sharable and individual multi-view metric learning. IEEE T-PAMI, 2017. [Kumar et al., 2011] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. In NIPS, pages 1413 1421, 2011. [Liang et al., 2017] Jianqing Liang, Qinghua Hu, Pengfei Zhu, and Wenwu Wang. Efficient multi-modal geometric mean metric learning. Pattern Recognition, 2017. [Liu et al., 2010] Tie Liu, Zejian Yuan, Jian Sun, Jingdong Wang, Nanning Zheng, Xiaoou Tang, and Heung Yeung Shum. Learning to detect a salient object. IEEE T-PAMI, 33(2):353 367, 2010. [Liu et al., 2016] Xinwang Liu, Yong Dou, Jianping Yin, Lei Wang, and En Zhu. Multiple kernel k-means clustering with matrixinduced regularization. In AAAI, pages 1888 1894, 2016. [Lowe, 1999] David G Lowe. Object recognition from local scaleinvariant features. In ICCV, pages 1150 1157, 1999. [Lu et al., 2015] Jiwen Lu, Gang Wang, Weihong Deng, Pierre Moulin, and Jie Zhou. Multi-manifold deep metric learning for image set classification. In CVPR, pages 1137 1145, 2015. [Maaten and Hinton, 2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579 2605, 2008. [Sindhwani and Rosenberg, 2008] Vikas Sindhwani and David S Rosenberg. An rkhs for multi-view learning and manifold coregularization. In ICML, pages 976 983, 2008. [Song et al., 2007] Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo. Supervised feature selection via dependence estimation. In ICML, pages 823 830, 2007. [Wang et al., 2016] Shuyang Wang, Zhengming Ding, and Yun Fu. Coupled marginalized auto-encoders for cross-domain multiview learning. In IJCAI, pages 2125 2131, 2016. [Weinberger and Saul, 2009] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207 244, 2009. [Xiao and Guo, 2015] Min Xiao and Yuhong Guo. Feature space independent semi-supervised domain adaptation via kernel matching. IEEE T-PAMI, 37(1):54 66, 2015. [Xie and Xing, 2013] Pengtao Xie and Eric P Xing. Multi-modal distance metric learning. 2013. [Xing et al., 2003] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with application to clustering with side-information. In NIPS, pages 521 528, 2003. [Ying et al., 2009] Yiming Ying, Kaizhu Huang, and Colin Campbell. Sparse metric learning via smooth optimization. In NIPS, pages 2214 2222, 2009. [Zadeh et al., 2016] Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In ICML, pages 2464 2471, 2016. [Zhang et al., 2011] Haichao Zhang, Thomas S Huang, Nasser M Nasrabadi, and Yanning Zhang. Heterogeneous multi-metric learning for multi-sensor fusion. In Information Fusion (FUSION), pages 1 8, 2011. [Zhang et al., 2015] C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao. Low-rank tensor constrained multiview subspace clustering. In ICCV, pages 1582 1590, 2015. [Zhang et al., 2017] C. Zhang, Q. Hu, H. Fu, P. Zhu, and X. Cao. Latent multi-view subspace clustering. In CVPR, pages 4333 4341, 2017. [Zhao et al., 2017] Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix factorization. In AAAI, pages 2921 2927, 2017. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)