# online_heterogeneous_transfer_metric_learning__17ae9f13.pdf Online Heterogeneous Transfer Metric Learning Yong Luo1, Tongliang Liu2, Yonggang Wen1, Dacheng Tao2 1 School of Computer Science and Engineering, Nanyang Technological University, Singapore 2 UBTECH Sydney AI Centre, SIT, FEIT, University of Sydney, Australia yluo180@gmail.com, tongliang.liu@sydney.edu.au, ygwen@ntu.edu.sg, dacheng.tao@sydney.edu.au Distance metric learning (DML) has been demonstrated to be successful and essential in diverse applications. Transfer metric learning (TML) can help DML in the target domain with limited label information by utilizing information from some related source domains. The heterogeneous TML (HTML), where the feature representations vary from the source to the target domain, is general and challenging. However, current HTML approaches are usually conducted in a batch manner and cannot handle sequential data. This motivates the proposed online HTML (OHTML) method. In particular, the distance metric in the source domain is pre-trained using some existing DML algorithms. To enable knowledge transfer, we assume there are large amounts of unlabeled corresponding data that have representations in both the source and target domains. By enforcing the distances (between these unlabeled samples) in the target domain to agree with those in the source domain under the manifold regularization theme, we learn an improved target metric. We formulate the problem in the online setting so that the optimization is efficient and the model can be adapted to new coming data. Experiments in diverse applications demonstrate both effectiveness and efficiency of the proposed method. 1 Introduction Distance metric learning (DML) aims to learn a proper distance function to reveal the underlying data relationship [Xing et al., 2002]. It has been demonstrated to be successful in diverse applications, such as nearest-neighbor classification [Weinberger et al., 2005], clustering [Xing et al., 2002], content based image retrieval [Jain et al., 2008] and face verification [Chopra et al., 2005]. To learn an appropriate distance metric, we often need large amount of label information, such as class labels or pair (similar/dissimilar) or triplet (relative comparison) constraints. However, in real-world applications, the provided information is usually scarce due to the high labeling cost and DML is likely to fail in this scenario. Transfer metric learning (TML) [Zha et al., 2009] is able to alleviate this issue by transferring information or knowledge from other related source domains [Shao et al., 2016; Liu et al., 2017; Wang et al., 2017], where the distance metric is better. Directly applying the source metric to the target domain is infeasible when samples in the source and target domain lie in different feature spaces or have semantic gap. Such challenging heterogeneous TML (HTML) setting is popular in realworld applications. For example, we can use a large corpus of labeled English documents to help classify Spanish documents. The dimensions of the English and Spanish document representations are different due to the utilized different vocabularies. It is advantageous to use some existing expensive (high-performing but computational intensive) features (such as deep CNN [Chatfield et al., 2014]) to help learn a better metric for cheap features (such as LBP [Ojala et al., 2002]). We may also use the semantic tags to guide the metric learning of visual features. Heterogeneous transfer learning (HTL) [Luo et al., 2017a; 2017b; Li et al., 2017] is able to handle the heterogeneous features [Xie et al., 2016; 2017], and some HTL approaches learn feature transformations to map the source and target samples into a common subspace [Wang and Mahadevan, 2011]. The learned transformation in the target domain can be used to derive an improved target distance metric. However, most of these approaches conduct the learning process on the entire training set. Hence, they are not applicable in the online setting, where training samples are provided sequentially (one by one). In addition, it is exhausted to retrain the model when new (labeled) training samples are available. To tackle this issue, we develop a novel online HTML (OHTML) method, which updates the target metric using the source knowledge and only one labeled target sample at each step. In particular, we first learn the distance metric in the source domain using some existing DML algorithms. The source metric learning can be performed offline, and only the obtained metric is needed in the target metric learning. Alternatively, if the source feature (such as deep CNN) is much more expressive than target feature, we can directly employ the simple Euclidean metric (or some other pre-defined metrics) in the source domain. To build a connection between the source and target domains, we assume there are abundant unlabeled training samples that have representations in both the source and target domains. For each pair of such Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) unlabeled samples, if they are close to each other in the source domain, their distances in the target domain should also be small. By formulating this as a manifold regularization term [Belkin et al., 2006], and simultaneously minimizing the empirical loss w.r.t. the current labeled training sample, we learn an improved distance metric for the target domain. Besides, Log Det divergence [Davis et al., 2007; Jain et al., 2008] is enforced to minimize the differences between the new obtained target metric and the previous one. This ensures that the updated metric parameter automatically satisfy the positive semi-definite (PSD) constraint. We do not require the costly PSD projection, and thus the updating algorithm is quite efficient. There is a recent work of metric imitation (MI) [Dai et al., 2015] that aims to utilize the expensive source feature to learn an improved metric for the cheap target feature. Their formulation is also based on manifold learning, but it discards the valuable label information, and the target metric (parameterized by a linear transformation) is learned in a batch manner. Eigenvalue decomposition is involved in the optimization and thus their training complexity is high. The proposed OHTML is more advantageous than MI in that: 1) the target metric can be updated dynamically and adapt to patterns in the new coming data; 2) the optimization is much more efficient. We compare the proposed method with representative online DML algorithms [Jain et al., 2008; Jin et al., 2009] and competitive HTML approaches [Wang and Mahadevan, 2011; Dai et al., 2015] in various applications including object categorization, scene clustering, face verification, as well as image retrieval. The results demonstrate both effectiveness and efficiency of our method. 2 Online Heterogeneous Transfer Metric Learning Problem setting: given a source and target domains with heterogeneous feature representations. The training set with side information for the target domain is given by DL M = {x1 Mk, x2 Mk, y Mk}NM k=1, where x1 Mk, x2 Mk Rd M , and y Mk = 1 indicates x1 Mk and x2 Mk are similar/dissimilar to each other. In the target domain, the number of sample pairs NM is small and the utilized feature is cheap. Hence the resulting distance metric obtained by applying existing DML algorithms may perform poorly. To improve the target metric, we assume there is a relevant source domain with training set DL S = {x1 Sk, x2 Sk, y Mk}NS k=1. In the source domain, either the samples with side information are abundant (i.e., NS is large), or the feature is more expressive or interpretable than that in the target domain. Therefore, a better distance metric can be obtained. To help the target metric learning use the source domain knowledge, we assume there are large amounts of unlabeled data that have representations in both the source and target domains, i.e., DU = {(x U Sn, x U Mn)}N U n=1. Such data are usually easy to collect in practice. 2.1 Problem Formulation Our ultimate goal is to learn an appropriate Mahalanobis distance metric for the target domain by transferring knowledge from the source domain. The Mahalanobis distance is often defined as d A(x1 k, x2 k) = (x1 k x2 k)T A(x1 k x2 k), (1) where A is the metric (parameter of the distance function), which is an positive semi-definite (PSD) matrix. To facilitate knowledge transfer [Du et al., 2013; Shao et al., 2014], we learn distance metric in the source domain beforehand using some existing DML algorithms, such as LMNN [Weinberger et al., 2005] and ITML [Davis et al., 2007]. This can be conducted offline and does not have impact on the computational complexity of target metric learning. In the target domain, the labeled training pairs are provided sequentially, i.e., only one labeled pair is available at each step. Suppose the pre-trained source metric is AS, then we have the following general formulation for updating the target metric AM: Ak+1 M = arg min AM 0 F(AM) =Ψ(AM) + γADiv(AM, Ak M) + γIRI(d AM , d AS; DU), (2) where Ψ(AM) = V (AM; x1 Mk, x2 Mk, y Mk) is the empirical loss w.r.t. AM on the current training pair. We choose V (AM; x1 Mk, x2 Mk, y Mk) = g(y Mk[1 d AM (x1 Mk, x2 Mk)]), where g(z) = max{0, b z} is the hinge loss, and we set b = 0 in this paper. For notational simplicity, we set δk = x1 k x2 k so that d A(x1 k, x2 k) = δT k Aδk. The regularization term Div(AM, Ak M) measures the difference between the new and previously obtained metric parameters. In this paper, we choose Div( , ) to be the Log Det divergence [Davis et al., 2007], which is desirable in DML since it is scale-invariant and can make AM automatically satisfy the PSD constraint AM 0. For any two unlabeled samples (x U i , x U j ) in DU, we calculate their distances d AM (x U Mi, x U Mj) and d AS(x U Si, x U Sj) in the target and source domain respectively. Since the two distances correspond to the same unlabeled sample pair, they should agree with each other (d AM should be small if d AS is small). This intuition is formulated in the regularization term RI, which enables the source metric to guide the metric learning in the target domain. Different types of regularization can be employed for RI, such as the absolute difference between the distances if the source and target features have been properly normalized. In this paper, we design a manifold based regularizer since it can take the geometry of the data distribution into consideration. Because the matrix AM is positive semi-definite, we decompose it as AM = UMU T M. Hence the distance d AM (x Mi, x Mj) = (x Mi x Mj)T UMU T M(x Mi x Mj) = U T Mx Mi U T Mx Mj 2 2. Then we define a regularizer so that the transformation UM is smooth over the source manifold, i.e., RI( ) = 1 (N U)2 XN U i,j=1 WSij U T Mx U Mi U T Mx U Mj 2 2 = 1 (N U)2 tr XU MLS(XU M)T AM , (3) where the source manifold is approximated by the data adjacency graph WS calculated based on the distances d AS Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) in the source domain, LS is the graph Laplacian given by LS = DS WS, WS is constructed using k nearest-neighbor (k NN) method, and DS is a diagonal matrix with the element DSii = PN U j=1 WSij. In this paper, we choose WSij to be a binary weight, i.e., WSij = 1 if the j-th unlabeled sample is the neighbor of the i-th sample, and 0 otherwise. By substituting (3) into (2), we obtain the following specific optimization problem: arg min AM 0F(AM) = ξMk + γAtr((Ak M) 1AM) γAlogdet(AM) + γI (N U)2 tr(XU MLS(XU M)T AM), s.t. y Mt[δT Mk AMδMk 1] ξMk, (4) where we initialize A0 M as an identity matrix. 2.2 Solution The solution of the optimization problem (4) is given as in the following theorem. Theorem 1. The optimal solution of problem (4) is given by: BMk, τM 0; BMk (s Mk 1)BMkδMkδT Mk BMk s2 Mk , 0 < τM < 1; BMk y Mk BMkδMkδT Mk BMk γA+y Mks Mk , τM 1. where τM = γA y Mk (1 1 s Mk ), BMk = ((Ak M) 1 + γI with HS = 1 (N U)2 XU MLS(XU M)T and s Mk = δT Mk BMkδMk. Proof. For notational simplicity, we omit the subscript M in the following derivation. By introducing the Lagrangian multipliers τ 0, and λ 0, we obtain the following Lagrangian of (4): L(A, ξk, λ, τ) =ξk + γAtr((Ak) 1A) γAlogdet(A) +γItr(HSA) λξk + τ(yk[δT k Aδk 1] ξk). where HS = XULS(XU)T . By taking the derivative of L with respect to A, and setting it to be zero, we have A 1 = B 1 k + τ 1 Here Bk = ((Ak) 1 + γI γA HS) 1 and Zk = ykδkδT k . By using the Sherman-Morrison inverse formula, i.e., (B + uv T ) 1 = B 1 (B 1uv T B 1) 1+v T B 1u , we obtain: A = Bk τyk BkδkδT k Bk γA + τyksk , (8) where sk = δT k Bkδk, and γA + τyksk = 0. By setting L(A,ξk,λ,τ) ξk = 0, we have 1 λ τ = 0. (9) Algorithm 1 The proposed online heterogeneous transfer metric learning (OHTML) algorithm. Input: Labeled training pairs in the source and target domains, i.e., {x1 Sk, x2 Sk, y Sk} and {x1 Mk, x2 Mk, y Mk}; unlabeled corresponding data in both domains, i.e., {(x U Sn, x U Mn)}. Pre-calculation: Learn AS in using the labeled data in the source domain, and calculate HS using the unlabeled corresponding data based on the learned AS. Hyper-parameters: γA and γI. Output: Ak+1 M . 1: Initialize A0 M = I. 2: for k = 0, 1, 2, do 3: Receive a labeled training pair: (x1 Mk, x2 Mk, y Mk). 4: Calculate the empirical loss Ψ(AM) based on Ak M. 5: If the loss is greater than zero: 6: Pre-compute BMk and s Mk; 7: Update Ak+1 M using (5). 8: Else Ak+1 M Ak M. 9: end for Since λ 0, we have τ 1. By substituting (8) and (9) into (6), we obtain a sub-problem L(τ) w.r.t. τ. By setting L(τ) τ = 0 and considering Bk = ((Ak) 1 + γI γA HS) 1, we have Considering that 0 τ 1, we obtain τ = median{γA sk ), 0, 1}. (11) By substituting (11) into (8), we have Bk (sk 1)BkδkδT k Bk s2 k , 0 < γA Bk yk BkδkδT k Bk γA+yksk , γA This completes the proof. In practice, we can calculate BMk using the Sherman Morrison inverse formula, i.e., (A + UV ) 1 = A 1 A 1U(I + V A 1U) 1V A 1, and obtain BMk = Ak M γI γA Ak MHS(I + γI γA Ak MHS) 1Ak M. (13) We summarize the main procedure of the proposed OHTML algorithm in Algorithm 1. The following theorem guarantees that the obtained resulting matrix Ak+1 M calculated using Algorithm 1 is positive definite. Theorem 2. The resulting matrix Ak+1 M in (5) is positive definite. Proof. For notational simplicity, we omit the subscript M in the following derivation. We use Sd + and Sd ++ to denote the Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) sets of positive semi-definite and positive definite matrices respectively. We have that the matrix HS Sd +, since the integrated Laplacian matrix LS Sd +. Because the inverse of a positive definite matrix is also positive definite, we conclude that Bk Sd ++ and sk = δT k Bkδk > 0 for any non-zero vector δk. If γA(1 1 sk ) 0, it is obvious that x T Ak+1x = x T Bkx > 0. This indicates that Ak+1 Sd ++. If 0 < γA(1 1 x T Ak+1x = x T Bkx (sk 1)x T BkδkδT k Bkx s2 k = (x T Bkx)(δT k Bkδk) (x T Bkδk)2 sk + x T BkδkδT k Bkx s2 k . It is easy to verify that (x T Bkx)(δT k Bkδk) (x T Bkδk)2 0 according to the Cauchy-Schwarz inequalities. Therefore, x T Ak+1x > 0 for any non-zero vector x, and Ak+1 Sd ++. If γA(1 1 sk ) 1, when yk = 1, for any vector x Rd, we have x T Ak+1x = x T Bkx x T BkδkδT k Bkx γA + sk = γA(x T Bkx) + (x T Bkx)(δT k Bkδk) (x T Bkδk)2 γA + Sk > 0. Hence Ak+1 Sd ++. When yk = 1, we have γA > sk since γA( 1 sk 1) 1. Therefore, for any vector x Rd, we also have x T Ak+1x = x T Bkx + x T BkδkδT k Bkx γA sk = x T Bkx + (x T Bkδk)2 γA sk > 0. (16) This completes the proof. The complexity of the proposed algorithm mainly depends on calculation of the matrix BMk, which involves multiplication and inversion of d M d M matrices. Therefore, the complexity of the proposed algorithm is O(NMdc M), where NM and d M are the number of labeled training pairs and feature dimension in the target domain respectively. The constant c 3 is determined by the utilized multiplication and inversion algorithms. It should be noted that the matrix HS can be pre-computed and hence the complexity is independent on the source domain, as well as the number of unlabeled correspondences. Therefore, the proposed method is quite efficient as long as d M is not very large. 3 Experiments In this section, we evaluate the proposed OHTML algorithm in four different applications: object categorization, scene clustering, face verification and image retrieval. In the first three applications, we investigate how much the source domain with powerful but computationally expensive feature could help DML in the target domain with cheap feature. In the last application, we utilize the interpretable text feature to help DML with visual feature, which is often hard to be interpreted. 3.1 Experimental Setup Specifically, we compare with the following methods: EU: calculating the distance between samples in the target domain by applying the simple Euclidean metric, which is served as the baseline. LEGO [Jain et al., 2008]: a competitive online DML algorithm based on Log Det regularization. RDML [Jin et al., 2009]: an efficient online DML algorithm that is robust for high dimensional data. DAMA [Wang and Mahadevan, 2011]: a competitive heterogeneous transfer learning (HTL) approach based on manifold alignment. It utilizes class labels to align the heterogeneous domains. MI [Dai et al., 2015]: a recently proposed metric imitation approach that utilizes the expensive feature to help learn a good metric for cheap feature. Large amounts of unlabeled correspondences are used for knowledge transfer via manifold approximation. OHTML: the proposed online HTML method. OHTML(EU) means that we set AS as an identity matrix in term (3), i.e., do not learn the source metric and directly employ Euclidean metric in the source domain. For the single DML algorithms (LEGO and RDML), only the limited labeled training pairs in the target domain are utilized, and no additional information from the source domain is leveraged. For DAMA and MI, a linear transformation UM is learned for the target domain and we derive the metric parameter as AM = UMU T M. The candidate set for choosing the trade-off hyper-parameters is {10i|i = 5, , 4} if unspecified in the original papers. The hyper-parameters γA and γI are tuned on the set {10i|i = 5, , 4} and {10i|i = 2, , 7} respectively. Hyper-parameter determination is still an open issue in HTL due to the limited labeled data. To this end, best performance over the candidate sets are reported for all compared methods. For all the different methods, kernel principal component analysis (KPCA) is adopted to explore some nonlinearity in the data, and also reduce the dimensionality. We run the experiments ten times by randomly choosing the labeled set. The algorithms are implemented using Matlab and the experiments are conducted on a 3.4 GHz Intel Xeon E5-2687W (8 cores) computer. 3.2 Object Categorization This set of experiments is conducted on the popular Caltech101 [Fei-Fei et al., 2004] dataset, which contains 8, 677 images that belong to 101 object categories. The expensive deep CNN [Chatfield et al., 2014] and cheap PHOG [Bosch et al., 2007] are adopted as the feature representations in the source and target domains respectively. The features are provided by [Dai et al., 2015], where the original feature dimensions Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Number of training samples per category 10 15 20 25 30 Average accuracy Average accuracy on the Caltech dataset EU LEGO RDML DAMA MI OHTML(EU) OHTML Number of training samples per category 10 15 20 25 30 CPU time (s) Training time cost on the Caltech dataset EU LEGO RDML DAMA MI OHTML Figure 1: Classification accuracy and training cost w.r.t. the number of labeled training samples for each category on the Caltech dataset. are 4096 and 40 respectively. The resulting dimensions after KPCA are 512 and 40. Half of the dataset is used for training and the remaining is for test. In both the source and target domains, we randomly select {10, 15, 20, 25, 30} labeled training samples for each category. The labeled training pairs are constructed according to the strategy in [Weinberger et al., 2005], and k NN is adopted as the classifier. Performance w.r.t. Different Number of Labeled Samples The classification accuracies and training costs w.r.t. a varied number of labeled training samples are shown in Fig. 1. We do not show the curve of OHTML(EU) in the time cost figure since the source metric AS is pre-calculated and the training costs of OHTML and OHTML(EU) are the same. From the results, we observe that: 1) the accuracies of all different methods improve with an increasing number of labeled samples. The single domain DML algorithms (LEGO and RDML) are only slightly better than the EU baseline, while the HTL approaches (DAMA, MI and OHTML) outperform them significantly. This demonstrates that knowledge of the source domain can help the target metric learning and be successfully transferred to the target domain by the different HTL approaches; 2) MI is better than DAMA when the number of labeled data is small (such as 10), but DAMA is better when more labeled data are provided. This is because MI only utilizes the unlabeled corresponding data to facilitate knowledge transfer, while DAMA relies on labeled data; 3) the proposed OHTML outperforms MI consistently since we can leverage the label information in the target domain. OHTML(EU) is a bit worse than OHTML since we do not learn the source metric in OHTML(EU). The accuracy decrease is not significant since the utilized source feature is much more expressive than the target feature and the estimated data ad- 0.2 0.4 1.0 Unlabeled correspondences (%{training data}) 0.2 0.4 1.0 CPU time (s) Unlabeled correspondences (%{training data}) Figure 2: Classification accuracy and training cost w.r.t. the number of unlabeled corresponding samples (percentage of training data) in both domains on the Caltech dataset. Methods Purity CPU time (s) EU 0.368 0.000 NA LEGO 0.373 0.014 0.184 0.036 RDML 0.376 0.010 0.068 0.022 DAMA 0.486 0.042 0.709 0.013 MI 0.560 0.000 6.198 0.043 OHTML(EU) 0.563 0.011 0.433 0.086 OHTML 0.576 0.022 Table 1: Clustering purity and training cost on the Scene-15 dataset. The number of labeled samples for each category is 10. jacency graph WS in the source domain can be used to guide the target metric learning even without learning the source metric; 4) OHTML is also superior to DAMA given limited labeled data since we use the unlabeled correspondences to build domain connection; 5) the training time of DAMA increases sharply when more labeled instances are given, while the proposed OHTML is much steady and the costs are only slightly higher than the single domain DML algorithms, and much lower than MI and DAMA. This demonstrate efficiency of our method. Performance w.r.t. Different Number of Unlabeled Correspondences A certain percentage of training data is used as the unlabeled corresponding data. The classification accuracies and training costs w.r.t. a varied percentage are shown in Fig. 2. We only report the performance of MI and OHTML since other approaches do not use such data. It can be seen from the results that: 1) accuracies of both MI and OHTML increase when more unlabeled correspondences are provided, and the proposed method outperforms MI consistently; 2) the training cost of MI increases dramatically while OHTML is much steady. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Methods Accuracy CPU time (s) EU 0.831 0.009 NA LEGO 0.843 0.012 3.283 0.224 RDML 0.853 0.008 1.533 0.216 DAMA 0.906 0.023 409.702 17.555 MI 0.885 0.014 945.203 1.780 OHTML(EU) 0.918 0.011 106.375 0.874 OHTML 0.924 0.010 Table 2: Verification accuracy and training cost on the LFW dataset. 3.3 Scene Clustering We conduct scene clustering on the Scene-15 [Lazebnik et al., 2006] dataset. It consists of 4, 585 images from 15 natural scene categories. CNN [Chatfield et al., 2014] and LBP [Ojala et al., 2002] are used as the feature representations of the source and target domains respectively. The resulting dimensions after KPCA are 512 and 50 respectively. We randomly split the dataset into equal size for training and test. The number of labeled samples is 10 for each category. Following [Dai et al., 2015], spectral clustering is applied to group the data and the evaluation criterion is the purity of clustering. we report the performance in Table 1. We can see from the results that the HTL approaches are much better than the single domain DML algorithms, which are only comparable to the EU baseline. MI and OHTML outperforms DAMA significantly and the computational cost of DAMA is low. This is because the total number of labeled samples (15 10) is quite small in this set of experiments. The standard deviation of MI is zero since: 1) it is an unlabeled approach and the learned target metric does not dependent on the varied labeled sets; 2) the labeled set is not utilized for test in clustering. 3.4 Face Verification In this subsection, we employ the well-known labeled face in the wild (LFW) [Huang et al., 2007] dataset, where there are 13, 233 face images of 5, 749 individuals. The source and target features are CNN and LBP [Chen et al., 2013] respectively. The dimension of LBP is reduced to 400 suggested by [Chen et al., 2013]. We conduct experiments under the unrestricted protocol since DAMA needs the class label information, and no outside data are utilized. We adopt the standard 10-folds split of the dataset [Huang et al., 2007], and each fold is used for test in turn. The performance is reported in Table 2. From the results we can see that: 1) although the single domain DML algorithms are very efficient, the improvements on accuracy are quite limited compared with the EU baseline; 2) DAMA is superior to MI since the provided label information in both domains is enough for it to achieve satisfactory performance. By making use of both the unlabeled corresponding data and label information in the target domain, we obtain the best accuracy, and is more efficient than MI and DAMA. Methods MAP CPU time (s) EU 0.265 0.000 NA LEGO 0.274 0.007 0.673 0.030 RDML 0.266 0.001 0.375 0.029 DAMA 0.268 0.002 0.627 0.020 MI 0.291 0.000 4.176 0.067 OHTML(EU) 0.292 0.002 3.314 0.402 OHTML 0.296 0.003 Table 3: Retrieval MAP and training cost on the Corel5K dataset. The number of labeled instances for each concept is 10. 3.5 Image Retrieval We further apply the different methods to image retrieval and the Corel5K [Duygulu et al., 2002] dataset is used for evaluation. The dataset contains 5, 000 images belonging to 50 concepts (100 images for each concept). The semantic tag is used as the source feature and the bag-of-visual word (Bo VW) based on the local SIFT [Lowe, 2004] is adopted as the target feature. Dimensions of both the tag and Bo VW features are reduced to 100, and half of the data are used for training. Following [Dai et al., 2015], mean average precision (MAP) is adopted as the evaluation criterion, and the results are shown in Table 3. In this application, DAMA is only comparable to the EU baseline and single domain DML algorithms. This may be because there is semantic gap between the textual and visual features. Thus it is harder to build connections between the source and target domains using the limited label information. MI and the proposed OHTML are much better and our method outperforms MI in terms of both MAP score and training cost. 4 Conclusion This paper presents a general online model for heterogeneous transfer metric learning (HTML). The model is based on the direct pairwise distance minimization between the source and target domain. By formulating it under the manifold regularization theme, we obtain an efficient online HTML (OHTML) algorithm. Both effectiveness and efficiency of the proposed method are verified in diverse applications. We mainly conclude from the results that: 1) it is advantageous to utilize the expensive or interpretable feature to help learning a relatively good metric for the cheap feature or the feature that is hard to interpret; 2) the developed online model can significantly accelerate HTML with little accuracy sacrifice in most cases, especially when the labeled data are limited in the target domain. Acknowledgments This work is supported by Singapore NRF2015ENCGDCR01001-003, administrated via IMDA, NRF2015ENCGBICRD001-012, administrated via BCA, and Australian Research Council Projects FL-170100117, DP-140102164, and LP-150100671. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) References [Belkin et al., 2006] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7(11):2399 2434, 2006. [Bosch et al., 2007] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Image classification using random forests and ferns. In ICCV, pages 1 8, 2007. [Chatfield et al., 2014] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. ar Xiv preprint ar Xiv:1405.3531, 2014. [Chen et al., 2013] Dong Chen, Xudong Cao, Fang Wen, and Jian Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In CVPR, pages 3025 3032, 2013. [Chopra et al., 2005] Sumit Chopra, Raia Hadsell, and Yann Le Cun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, pages 539 546, 2005. [Dai et al., 2015] Dengxin Dai, Till Kroeger, Radu Timofte, and Luc Van Gool. Metric imitation by manifold transfer for efficient vision applications. In CVPR, pages 3527 3536, 2015. [Davis et al., 2007] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In ICML, pages 209 216, 2007. [Du et al., 2013] Bo Du, Liangpei Zhang, Dacheng Tao, and Dengyi Zhang. Unsupervised transfer learning for target detection from hyperspectral images. Neurocomputing, 120:72 82, 2013. [Duygulu et al., 2002] Pinar Duygulu, Kobus Barnard, Joao FG de Freitas, and David A Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV, pages 97 112, 2002. [Fei-Fei et al., 2004] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshop on Generative Model Based Vision, 2004. [Huang et al., 2007] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007. [Jain et al., 2008] Prateek Jain, Brian Kulis, Inderjit S Dhillon, and Kristen Grauman. Online metric learning and fast similarity search. In NIPS, pages 761 768, 2008. [Jin et al., 2009] Rong Jin, Shijun Wang, and Yang Zhou. Regularized distance metric learning: Theory and algorithm. In NIPS, pages 862 870, 2009. [Lazebnik et al., 2006] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, pages 2169 2178, 2006. [Li et al., 2017] Xue Li, Liangpei Zhang, Bo Du, Lefei Zhang, and Qian Shi. Iterative reweighting heterogeneous transfer learning framework for supervised remote sensing image classification. JSAEORS, 10(5):2022 2035, 2017. [Liu et al., 2017] Tongliang Liu, Qiang Yang, and Dacheng Tao. Understanding how feature structure transfers in transfer learning. In IJCAI, pages 2365 2371, 2017. [Lowe, 2004] D.G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91 110, 2004. [Luo et al., 2017a] Yong Luo, Dacheng Tao, and Yonggang Wen. Exploiting high-order information in heterogeneous multi-task feature learning. In IJCAI, pages 2443 2449, 2017. [Luo et al., 2017b] Yong Luo, Yonggang Wen, Tongliang Liu, and Dacheng Tao. General heterogeneous transfer distance metric learning via knowledge fragments transfer. In IJCAI, pages 2450 2456, 2017. [Ojala et al., 2002] Timo Ojala, Matti Pietikainen, and Topi Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE TPAMI, 24(7):971 987, 2002. [Shao et al., 2014] Ming Shao, Dmitry Kit, and Yun Fu. Generalized transfer subspace learning through low-rank constraint. IJCV, 109(1-2):74 93, 2014. [Shao et al., 2016] Ming Shao, Zhengming Ding, Handong Zhao, and Yun Fu. Spectral bisection tree guided deep adaptive exemplar autoencoder for unsupervised domain adaptation. In AAAI, pages 2023 2029, 2016. [Wang and Mahadevan, 2011] Chang Wang and Sridhar Mahadevan. Heterogeneous domain adaptation using manifold alignment. In IJCAI, pages 1541 1546, 2011. [Wang et al., 2017] Zengmao Wang, Bo Du, Lefei Zhang, Liangpei Zhang, Ruimin Hu, and Dacheng Tao. On gleaning knowledge from multiple domains for active learning. In IJCAI, pages 3013 3019, 2017. [Weinberger et al., 2005] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, pages 1473 1480, 2005. [Xie et al., 2016] Liping Xie, Dacheng Tao, and Haikun Wei. Multi-view exclusive unsupervised dimension reduction for video-based facial expression recognition. In IJCAI, pages 2217 2223, 2016. [Xie et al., 2017] Liping Xie, Dacheng Tao, and Haikun Wei. Joint structured sparsity regularized multiview dimension reduction for video-based facial expression recognition. ACM TIST, 8(2):28:1 21, 2017. [Xing et al., 2002] Eric P Xing, Michael I Jordan, Stuart Russell, and Andrew Ng. Distance metric learning with application to clustering with side-information. In NIPS, pages 505 512, 2002. [Zha et al., 2009] Zheng-Jun Zha, Tao Mei, Meng Wang, Zengfu Wang, and Xian-Sheng Hua. Robust distance metric learning with auxiliary knowledge. In IJCAI, pages 1327 1332, 2009. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)