# latent_semantic_aware_multiview_multilabel_classification__a27b1212.pdf

Latent Semantic Aware Multi-View Multi-Label Classiﬁcation

Changqing Zhang,1 Ziwei Yu,1 Qinghua Hu,1 Pengfei Zhu,1 Xinwang Liu,2 Xiaobo Wang3

1School of Computer Science and Technology, Tianjin University, Tianjin, China, 300350 2School of Computer National University of Defense Technology Changsha, China, 410073 3Center for Biometrics and Security Research & National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences, 100190 {zhangchangqing, yuziwei, huqinghua, zhupengfei}@tju.edu.cn

For real-world applications, data are often associated with multiple labels and represented with multiple views. Most existing multi-label learning methods do not sufﬁciently consider the complementary information among multiple views, leading to unsatisfying performance. To address this issue, we propose a novel approach for multi-view multi-label learning based on matrix factorization to exploit complementarity among different views. Speciﬁcally, under the assumption that there exists a common representation across different views, the uncovered latent patterns are enforced to be aligned across different views in kernel spaces. In this way, the latent semantic patterns underlying in data could be well uncovered and this enhances the reasonability of the common representation of multiple views. As a result, the consensus multi-view representation is obtained which encodes the complementarity and consistence of different views in latent semantic space. We provide theoretical guarantee for the strict convexity for our method by properly setting parameters. Empirical evidence shows the clear advantages of our method over the state-of-the-art ones.

Introduction Multi-label classiﬁcation, which assigns one example with multiple classes, is of signiﬁcant interest due to its ubiquity in real-world applications. For example, in computer vision, an image may simultaneously contain more than one type of objects; in web page categorization, a news web page may cover different topics, such as sports, business and entertainment. For this problem, multi-label learning approaches (Boutell et al. 2004; Tsoumakas and Katakis 2006; Zhang and Zhou 2007; Gong et al. 2016) have been proposed over the past decade, such as the early representative methods: binary relevance (BR) (Tsoumakas and Katakis 2006) and label powerset (LP) (Boutell et al. 2004). By directly transforming the multi-label learning task into multiple binary classiﬁcation tasks, BR neglects the correlation among labels. LP regards each subset of multiple labels as a different class of single-label classiﬁcation. Although taking the label correlation into consideration, this model lacks of mining the complex label correlation and can not be applied for the task with large label set. Multilabel k-nearest neighbour (MLk NN) (Zhang and Zhou 2007)

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

is one of classic and effective multi-label methods, which builds a Bayesian model by using the k-nearest neighbour method to obtain the prior and likelihood, and then utilizes the max posterior to assign labels for testing example. Some more recent methods focus on other issues in multi-label classiﬁcation, e.g., label noise (Yu et al. 2014; Yang, Jiang, and Zhou 2013). Although diverse methods have been proposed in the literature, there still exist the following limitations. On one hand, most existing multi-label learning methods only consider single view data, however, each individual view cannot characterize different labels comprehensively since different views encode different properties of data, which implies the practical necessity of multi-view learning (Xu, Tao, and Xu 2013; Liu et al. 2017; Cao et al. 2015; Zhang et al. 2015). On the other hand, learning with plenty of unlabeled data has shown its power in many real applications. However, most existing multi-label classiﬁcation models are fully supervised thus they are unable to explore the unlabeled samples. Although a few semi-supervised multi-label learning methods (Liu, Jin, and Yang 2006; Wang, Tu, and Tsotsos 2013) have been developed, these models are not speciﬁcally targeted on the multi-view semi-supervised multi-label learning. The most recent and related method in (Liu et al. 2015) also utilizes matrix factorization and common representation. However, it has the following two limitations: ﬁrstly, the common representation among multiple views is learned without constraining the bases of different views, which weakens the reasonability of the common representation; secondly, the common representation learning and multi-label learning (label completion) are performed in two separated steps, thus the prediction performance could not be well guaranteed. In this paper, we propose a new multi-view multi-label learning approach termed as Latent Semantic Aware Multiview Multi-label Learning (LSA-MML). As shown in Figure 1, given the input data with multiple views, our method simultaneously seeks a predictive common representation of multiple views and the corresponding projection model between the common representation and labels. The bases of V different views, {B(v)}V v=1, can be considered as latent semantic components. With the common representation P, the jth bases of different views, i.e, {bv j}V v=1, encode the same latent semantic, therefore, these bases across different

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

ODWHQW VHPDQWLF EDVLV DOLJQPHQW

VFDUFH ODEHO LQIRUPDWLRQ

ODWHQW VHPDQWLF EDVHV FR DZDUH

Figure 1: Method overview. The common representation P is learned by exploring the complementarity of multiple views and scarce labeled samples (solid circles and squares) jointly. The latent semantic basis matrices (i.e., B(v)s) of different views are aligned in kernel spaces, which guarantees the reasonability of the consensus representation P.

views should be consistent with each other. We align these bases of different views with Hilbert-Schmidt Independence Criterion (HSIC) in kernel space, which well addresses the comparability of different views, thus a consensus coefﬁcient matrix (common representation) P for different views is induced. To solve our problem, we provide the theoretical analysis for the convexity and the instruction for parameter setting to guarantee the strict convexity. Extensive empirical results on benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.

Related Work From the last decade, multi-label classiﬁcation has received intensive attention (Boutell et al. 2004; Zhang and Zhou 2007; Yang, Jiang, and Zhou 2013). Generally, existing multi-label methods can be roughly categorized into three lines. The ﬁrst-order strategy deals with multi-label learning in label-by-label manner, i.e., dividing the multi-label problem into multiple binary classiﬁcation tasks or its variants (Zhang and Zhou 2007; Clare and King 2001). The secondorder methods introduce the pairwise relations between the labels for the multi-label classiﬁcation, such as the ranking between the relevant label and irrelevant label (Elisseeff and Weston 2002; F urnkranz et al. 2008; Ghamrawi and Mc Callum 2005). CLR (F urnkranz et al. 2008) ﬁrstly transforms the multi-label learning problem into label ranking problem by introducing the pairwise comparison, and then constructs binary classiﬁers to solve the multi-label ranking problem. Rank-SVM (Elisseeff and Weston 2002) conducts multilabel classiﬁcation by adopting the ranking loss as cost function in SVM. The high-order strategy builds more complex relations among labels for multi-label learning (Read et al. 2011; Tsoumakas and Vlahavas 2007). The representatives include the chain-based method (Read et al. 2011) which

transforms the multi-label data to a chain of binary classiﬁers, and the label-set-based methods (Boutell et al. 2004; Tsoumakas and Vlahavas 2007) that divide the entire label set into multiple overlapping subsets and train one classiﬁer for each subset. Due to the ubiquity of data with multiple views, multi-view learning has been an active research ﬁeld and shown its effectiveness in a wide range of applications (Xu, Tao, and Xu 2013). Recently, a few multiview multi-label classiﬁcation methods (Luo et al. 2013; Liu et al. 2015) were proposed to exploit the complementarity of different types of features for the improved classiﬁcation performance. The method in (Luo et al. 2013) introduces multi-view vector-valued manifold regularization to integrate multi-view features. The method in (Liu et al. 2015) seeks a common low-dimensional representation under the matrix factorization framework and then conducts classiﬁcation based on matrix completion. Both the two recent methods (Luo et al. 2013; Liu et al. 2015) perform classiﬁcation in the transductive semisupervised manner.

LSA-MML: Our Classiﬁcation Model Suppose there are L labeled data points {xl, yl}L l=1 and U unlabeled data points {xu}N u=L+1, where N = L+U. These instance-label pairs are stacked in two matrices, i.e., X = (x1, ..., x N) and Y = (Yl, Yu) = (y1, ..., y N). Since we employ the transductive learning manner to simultaneously exploit unsupervised samples, our objective function turns out to be the following general form

min M(X; ˆY) + λS(Yl, ˆY), (1)

where ˆY is the completed label matrix to be predicted, which is learned with data X and a few known labels in Yl. Speciﬁcally, to obtain the completed label matrix ˆY, we aim

to uncover the underlying structure from data themselves X by the ﬁrst term M(X; ˆY), which is guided by the labeled data Yl in the second term S(Yl, ˆY). Considering the data with multiple views, we generalize the above formulation as

min M(X(1), , X(v); ˆY) + λS(Yl, ˆY). (2)

Under the assumption that different views share the latent common representation, i.e., X(v) = B(v)P, we have

min M(X(1), , X(v); P) + λ1Ω(B(1), , B(V ))

+λ2P(P, ˆY) + λ3S(Yl, ˆY), (3)

where P RK N is the consensus multi-view representation which encodes the complementary information from different views. B(v) RDv K is the basis matrix corresponding to the vth view. Accordingly, the ﬁrst term searches a comprehensive multi-view representation, and the second term guarantees the reasonability of using a common representation for different views since it aligns the bases of different views in the latent semantic space. The last term delivers the label information on the estimated label matrix and based on which the third term ensures the predictive property of the common multi-view representation. Therefore, the complemented label matrix ˆY beneﬁts from both multi-view data (P) and supervised label information (Yl). Speciﬁcally, by using common multi-view representation under matrix factorization framework, we have

M(X(1), , X(v); P) =

v=1 ||X(v) B(v)P||2 F , (4)

where || ||F is the well-known Frobenius norm of matrix. For different views, the latent bases should be consistent across different views. To this end, we penalize the independence of bases between different views with

Ω(B(1), , B(V )) =

v =w IND(B(v), B(w)), (5)

where the aim of the regularization IND( , ) is to enhance the dependence of these bases between different views. Since these bases are in different feature spaces, hence, we introduce HSIC to constrain the consistence across different views in kernel spaces. Speciﬁcally, we deﬁne IND(B(v), B(w)) = HSIC(B(v), B(w)) in our method. Hilbert-Schmidt Independence Criterion (Gretton et al. 2005). We give the brief description about HSIC as follows. Let us deﬁne a mapping φ(x) from x X to kernel space F such that the inner product between vectors in that space is given by a kernel function k1(xi, xj) = φ(xi), φ(xj) . Let G be a second kernel space on Y with kernel function k2(yi, yj) = ϕ(yi), ϕ(yj) . The empirical version of HSIC is induced as: Deﬁnition 1. Consider a series of N independent observations drawn from pxy, Z := {(x1, y1), ..., (x N, y N)} X Y, an estimator of HSIC, written as HSIC(Z, F, G), is given by:

HSIC(Z, F, G) = (N 1) 2tr(K1CK2C), (6)

where tr( ) is the trace of a square matrix. K1 and K2 are the Gram matrices with k1,ij = k1(xi, xj), k2,ij = k2(yi, yj). cij = δij 1/N centers the Gram matrix to have zero mean in the feature space.

Since HSIC well measures the independence of two variables (Quadrianto, Song, and Smola 2009), we employ it to maximize the dependency between the bases of two views. Note that, according Eq (6), the HSIC in our method can be considered as the penalization for the disagreement of different views in terms of similarity graphs of bases, as shown in Fig .1. In practice, to ensure the predictive property in terms of labels, the third and fourth terms are integrated as the following formula

Δ = λ2P(P, ˆY) + λ3S(Yl, ˆY) = ||(WP Y)S||2 F , (7)

where ˆY = WP and Y = [Yl, Yu]. W is the prediction model and S is the ﬁltering matrix used to select the labeled samples with Sii = 1 if the ith sample is labeled and 0 otherwise. This ensures that the multi-view consensus representation should be predictive corresponding to the known labels. Accordingly, the ﬁnal form of our objective function turns out to be

min B(v),P,W,αv

v=1 αr v||X(v) B(v)P||2 F + β||(WP Y)S||2 F

v =w IND(B(v), B(w))

v=1 αv = 1; ||b(v) .j ||2 1,

(8) where αv > 0 is used to automatically weight different views and r > 1 is used to avoid a trivial solution that only considers one view and adjusts the complementarity of multiple views (Wang et al. 2007). β > 0 and γ > 0 are tradeoff factors. B(v) is constrained since without constraint P can be pushed arbitrarily close to zero only by re-scaling P/s and B(v)s (s > 0) while preserving the same loss. To summarize, our model has the following merits: 1) our model focuses on seeking the comprehensive common representation of multiple views by enforcing the latent semantic bases of different views to be consistent; 2) our model can be considered as a bi-direction factorization, where the multiview common representation P bridges the factorizations between the multi-view input and the label matrix, where the label matrix can be regarded as the description of data in the view of explicit semantic labels; 3) the label correlations are implicitly encoded by the common representation based on the uncovering latent semantic bases and the relations among them.

Optimization Alternating Optimization Algorithm We adopt the alternating minimization strategy to solve the optimization problem, which is comprised of four subproblems solved as follows:

Update P with ﬁxed {αv}V v=1, W and {B(v)}V v=1. We should minimize the following objective function

v=1 αr v||X(v) B(v)P||2 F + β||(WP Y)S||2 F .

(9) By taking the derivative with respect to P and setting it to zero, then we obtain

v=1 αr v (B(v))T X(v) + (B(v))T B(v)P

+β WT WPSST WT YSST = 0.

We solve the problem by separating the labeled and unlabeled parts thanks to the diagonal property of S. For the labeled part, we have

v=1 αr v (B(v) l )T X(v) l + (B(v) l )T B(v) l Pl

+β WT l Wl Pl WT l Yl = 0.

where the subscript l and u indicate variables corresponding to labeled and unlabeled data, respectively. Accordingly, we can update Pl with the following rule

v=1 αr v(B(v) l )T X(v) l + βWT l Yl

v=1 αrv(B(v) l )T B(v) l + βWT l Wl

For the unsupervised part, we have

v=1 αr v(B(v) u )T B(v) u Pu =

v=1 αr v(B(v) u )T X(v) u . (13)

Accordingly, we can update Pu by the following rule

v=1 αr v(B(v) u )T B(v) u 1(

v=1 αr v(B(v) u )T X(v) u ). (14)

After obtaining Pl and Pu, the common representation corresponding to all data, i.e., P, is obtained as P = [Pl, Pu]. Update B(v) with ﬁxed αv,W and P. We should minimize the following objective function

L(B(v)) = αr v||X(v) B(v)P||2 F γ

HSIC(B(v), B(w))

s.t. ||b(v) .j ||2 1. (15) We optimize B(v)-subproblem by following the work in (Gu et al. 2014), which introduces an auxiliary variable S(v). Then, we have the following objective

L(B(v)) = αr v||X(v) B(v)P||2 F γ

HSIC(B(v), B(w))

s.t. B(v) = S(v), ||s(v) .j ||2 1. (16)

We optimize (16) with alternating direction method of multipliers (ADMM). By removing the equality constraint, it turns out to be

L(B(v), S(v), T(v)) = αr v||X(v) B(v)P||2 F

HSIC(B(v), B(w)) + μ||B(v) S(v) r + T(v) r ||2 F

s.t. ||s(v) .j ||2 1,

where μ > 0 is the penalty hyperparameter. The optimal solution of (17) can be obtained with

B(v) r+1 = arg min B(v) αr v||X(v) B(v)P||2 F

w =v HSIC(B(v), B(w)) + μ||B(v) S(v) r + T(v) r ||2 F

S(v) r+1 = arg min S(v) μ||B(v) r+1 S(v) r + T(v) r ||2 F , s.t.||s(v) .j ||2 1

T(v) r+1 = T(v) r + B(v) r+1 S(v) r+1, update μ if appropriate, (18) where s(v) .j indicates the jth column of S. Note that, Theorem 1 (in subsection 3.2) guarantees the subproblem L(B(v)) to be convex and thus the optimal solution could be obtained. Update W with ﬁxed αv, P and B(v). We minimize the following objective function

L(W) = β||(WP Y)S||2 F . (19)

Accordingly, we obtain the following updating rule

W = YSST PT (PSST PT ) 1. (20)

Update α with ﬁxed B(v) and P. We employ a Lagrange multiplier λ to take the constraint into consideration, obtaining the following Lagrange function

v=1 αr v||X(v) B(v)P||2 F λ(

v=1 αr v 1).

(21) By setting the derivative of Eq. (21) with respect to α and λ to 0, then we have the following updating rule

1 ||X(v) B(v)P||2 F

v=1 ( 1 ||X(v) B(v)P||2 F ) 1/r 1 1 . (22)

According to the above updating rules, we can alternatively update these variables until convergence condition (i.e., the difference of the objective function value between two consecutive iterations is smaller than 10 6) is reached.

Algorithm Analysis Convexity analysis. Note that, due to the HSIC term involved, it is generally not convex due to the negative sign. This leads to a question: is the following function convex?

L(B(v)) = αr v||X(v) B(v)P||2 F

HSIC(B(v), B(w)) + μ||B(v) S(v) r + T(v) r ||2 F . (23)

The optimal solution could be obtained if the function L(B(v)) is strictly convex, which is also a prerequisite for the convergence of the holistic optimization. Therefore, we provide the guarantee for the convexity of L(B(v)) under proper parameter setting as follows:

Theorem 1. The subproblem L(B(v)) is convex given the parameter setting μ 4D(V 1)γ, where V and D are number of views and D = max1 v<V ({D(v)}V v=1).

Proof 1. The convexity of L(B(v)) depends on whether its Hessian matrix 2L(B(v)) is semi-positive deﬁnite or not (Boyd and Vandenberghe 2004). Since the ﬁrst term is convex and we can gives the condition for strict convexity for the last two terms, thus, we only should guarantee the convexity of

Lc(B(v)) = γ

HSIC(B(v), B(w))

+ μ||B(v) S(v) r + T(v) r ||2 F .

Fortunately, the Hessian matrix 2Lc(B(v)) can be easily computed as:

2Lc(B(v)) = μI γ

CB(w)T B(w)C = A. (25)

For convenience, we denote L = V

w=1 w =v CB(w)T B(w)C.

According to the Gerschgorin theorem (Varga 2009), all the

eigenvalues η of A lie in |η μ/γ lii| K

j=1;j =i |lij|,

where K is the number of latent components as mentioned above. After transformation, the value of μ/γ has to satisfy the following constraint:

μ/γ max 1<i<K(

j=1;j =i |lij| lii). (26)

It is easy to show that |lij| 4(V 1), therefore, the lower bound of μ/γ is 4K(V 1). Accordingly, we can set μ = 4K(V 1)γ or even larger to ensure the constraint satisﬁed in practice.

According to Theorem 1, the convexity of L(B(v)) and the subsequent optimal solution is ensured. Complexity and convergence analysis. There are four main sub-problems in our optimization procedure, i.e., P, B(v), W and αv. For simplicity, we suppose the dimensionality of each view is D. The complexity of these subproblems are O(K2D + KDN + K3), O(L(K2D + K3)), O(CN 2 + CNK + K3) and O(DKN) respectively, where L is the iteration number in ADMM algorithm for updating P. Since there are closed-form (optimal) solutions for updating P, B(v), W and weight vector α, the objective is non-decreasing with iterations. Therefore, our algorithm can be guaranteed to converge to a stationary point, which is also empirically validated in experiment as shown in Figure 3.

Experiments Experiment Settings Datasets & features. In this section, we evaluate our LSAMML and compare it with state-of-the-art methods on three benchmark multi-label datasets, i.e., Corel5k (Duygulu et al. 2002), ESP Game (Von Ahn and Dabbish 2005) and PASCAL VOC 07 (Everingham 2006). The detailed statistics information of these datasets is summarized in Table 1. We employ the standard partitions for training and testing sets 1 as described in Table 1.

Table 1: Statistics of datasets.

dataset #instance #training #testing #label

Corel5k 4999 4500 499 260 ESP Game 20770 18689 2081 268 PASCAL VOC 9963 5011 4952 20

There are three types of features, i.e., two types of local features: Dense Sift (Lowe 2004) and Dense Hue (Weijer and Schmid 2006), and one type of global features: Gist (Oliva and Torralba 2001) used in our experiments, where each type of features can be regarded as one view. The dimensionalities of Dense Sift, Dense Hue and Gist are 1000, 100 and 512, respectively. To comprehensively evaluate the effectiveness of label information, we randomly select a subset of labeled samples from the training sets with ratio {0.2, 0.4, 0.6, 0.8, 1.0}. Due to randomness, we conduct each experiment 10 runs and report the average results with standard deviations. Compared methods. We compare our method with several state-of-the-art multi-label classiﬁcation methods. The ﬁrst line is the traditional single view multi-label methods, while the second line takes advantage of all different views. Speciﬁcally, for the traditional single view methods, we report the best performance on the best single view (Best View). Moreover, we also conduct experiments for these methods by concatenating all views (VConcate). The traditional single view multi-label methods include binary relevance (BR) (Tsoumakas and Katakis 2006) and label powerset (LP) (Boutell et al. 2004) that act as the baselines. LP utilizes C4.5 as the base classiﬁer and RAk EL is based on LP. There are some advanced comparisons in our experiments: the lazy multi-label methods based on k-nearest neighbor (ML-k NN) (Zhang and Zhou 2007) which is a simple but rather effective method, the ensemble methods such as random k-labelsets (RAk EL) (Tsoumakas, Katakis, and Vlahavas 2011), ensemble of classiﬁer chains (ECC) (Read et al. 2011) and multi-label classiﬁcation using ensembles of pruned sets (EPS) (Read, Pfahringer, and Holmes 2008). We compare our method with two pieces of work highly related with ours, i.e., multiview vector-valued manifold regularization (MV3MR) (Luo et al. 2013) and low rank multi-view matrix completion (Lr MMC) (Liu et al. 2015). The two methods both belong to the category of multi-view multi-label learning in semi-supervised manner.

1lear.inrialpes.fr/people/guillaumin/data.php

Table 2: Results (mean standard deviation) on Corel5k, ESP Game and PASCAL VOC.

Dataset Corel5k ESP Game PASCAL VOC Method View R-Loss Ave-Pre R-Loss Ave-Pre R-Loss Ave-Pre

BR Best View .266 .002 .268 .004 .264 .000 .229 .003 .407 .007 .372 .009 VConcate .360 .022 .214 .006 .360 .002 .198 .004 .396 .008 .393 .003

LP Best View ..696 .008 .046 .003 .495 .002 .057 .000 .420 .008 .296 .003 VConcate .678 .007 .067 .011 .491 .000 .058 .002 .411 .005 .311 .008

RAk EL Best View .389 .005 .243 .004 .380 .005 .208 .005 .247 .005 .499 .005 VConcate .350 .004 .304 .007 .308 .003 .254 .003 .237 .003 .515 .000

EPS Best View .347 .005 .289 .004 .358 .005 .223 .005 .239 .00 .494 .004 VConcate .344 .004 .378 .007 .352 .003 .223 .003 .228 .000 .507 .002

ECC Best View .150 .006 .372 .005 .182 .001 .294 .001 .203 .001 .253 .001 VConcate .150 .000 .382 .010 .179 .001 .309 .001 .195 .002 .552 .001

MLk NN Best View .126 .002 .406 .003 .182 .002 .294 .001 .180 .002 .564 .001 VConcate .123 .002 .424 .009 .167 .002 .272 .002 .164 .001 .586 .001 MV3MR Multi View .135 .005 .425 000 .183 .001 .267 .002 .181 003 .568 .003 Lr MMC-1 Multi View .112 .000 .382 004 .158 .001 .258 .003 .217 002 .463 .002 Lr MMC-2 Multi View .101 .000 .425 004 .153 .001 .264 .003 .163 002 .532 .002

Ours Best View .123 .010 .419 .014 .183 .001 .267 .002 .206 .005 .528 .006 Multi View .103 .002 .462 .003 .161 .001 .345 .001 .149 .001 .610 .001 Rank 2 1 3 1 1 1

For Lr MMC, we use Lr MMC-2 to indicate the method with data preprocessing for label unbalance issue as the authors did in their work while Lr MMC-1 indicates using the original training data. Parameter setting. We conduct parameter tuning on validation sets by following the same settings in (Luo et al. 2013; Liu et al. 2015). In speciﬁc, each data set is ﬁrst partitioned into training and test set. Following the methods (Luo et al. 2013; Liu et al. 2015), 20% samples are then randomly selected from the test set as validation set for parameter tuning, and the rest is used for evaluating the classiﬁcation performance of each algorithm. We select the value from {2, 3, 4, 5} for r and from {0.01, 0.1, 1, 10, 100} for β and γ. For optimization convenience, the inner product kernel is employed so there is no kernel parameter. To address the randomness in selecting samples, we have repeated the above procedure 10 times and reported the averaged results. Two evaluation metrics that mostly used for multi-label classiﬁcation (Zhang and Zhou 2007) are employed. For the Ranking loss (R-Loss), smaller value indicates better classiﬁcation performance, while larger value of Average precision (Ave-Pre) means better performance. Limited by space, please refer the work in (Schapire and Singer 2000) for the details of these evaluation metrics.

Experiment Results Comparison with state-of-the-arts. Table 2 demonstrates the classiﬁcation comparison of different methods on benchmark datasets with 80% labeled samples used. We report the results under a part of training samples (instead of all training samples), since the performances (with average results and standard deviations) under different random parts of training samples could better characterize the comparison.

Based on the results in these tables, several observations are obtained as follows: 1) In a big picture, our algorithm almost achieves the best performance on all datasets, which clearly demonstrates the advantages of our method in exploring multi-view multi-label data. 2) It is clear that for each traditional single view multi-label method, the view concatenating strategy always obtains better performances than those of the best single view. This validates the effectiveness of multi-view learning over single view learning, since the complementarity among different views is of great importance. 3) We reported both the results of using the best single view and multiple views of our method, and the performance of using multiple views clearly outperforming that of single view validates that the multi-view treatment is essential for the performance. 4) The performance is rather stable with random initiation. Taking the experiment on the dataset Corel5k for example, we run our method 10 times with random initialization, and the standard deviation is 0.02 (same setting as in Table 2). Moreover, we also employed SVD for each single view as initialization (for each B(v)) in our code, and the performance is similar to that of random initialization. Comparison with different ratios of training samples. To further evaluate the effectiveness of our method in utilizing scarce labeled samples, we provide the comparison for the competitive methods in terms of Ranking Loss and Average Precision with different labeled data ratios (from 20% 100%). Based on the results in Figure 2, the following observations are obtained: 1) With the increment of the ratio of labeled samples, the performances for all the algorithms are getting better, which conﬁrms the valuableness of scarce labeled samples. 2) Compared with other algorithms, our model usually achieves the best result for different supervised ratio on all these datasets. This demonstrates that

/DEHOHG 6DPSOHV

5DQNLQJ /RVV

0/.11 6LQJOH 09 05 2XUV

/DEHOHG 6DPSOHV

$YHUDJH 3UHFLVLRQ

0/.11 6LQJOH 09 05 2XUV

(a) Corel5k

/DEHOHG 6DPSOHV

5DQNLQJ /RVV

0/.11 6LQJOH 09 05 2XUV

/DEHOHG 6DPSOHV

$YHUDJH 3UHFLVLRQ

0/.11 6LQJOH 09 05 2XUV

(b) ESP Game

/DEHOHG 6DPSOHV

5DQNLQJ /RVV

0/.11 6LQJOH 09 05 2XUV

/DEHOHG 6DPSOHV

$YHUDJH 3UHFLVLRQ

0/.11 6LQJOH 09 05 2XUV

Figure 2: The 1st to 3rd columns correspond to the performances with increase of the ratio of labeled samples. The method named as Single denotes LSA-MML with the best single view.

1XPEHU RI ,WHUDWLRQ

2EMHFWLYH IXQFWLRQ YDOXH

(a) Corel5k

1XPEHU RI ,WHUDWLRQ

2EMHFWLYH IXQFWLRQ YDOXH

(b) ESP Game

1XPEHU RI ,WHUDWLRQ

2EMHFWLYH IXQFWLRQ YDOXH

Figure 3: Convergence curves on benchmark datasets.

our model can achieve better performance given the same amount of labeled samples, and empirically validates the effectiveness of our multi-view multi-label model. Convergence experiments. As shown in Figure 3, we give the convergence experiments on the three datasets. Clearly, the results empirically prove that our algorithm can converge fast within a small number of iterations for all the datasets, and this is generally consistent with the theoretical analysis.

In this paper, we have proposed a new multi-label learning method, i.e., Latent Semantic Aware Multi-view Multilabel Learning, to fully take advantage of multiple views of data. Supervised by the limited label information, our model could well learn the common representations by simultaneously enforcing the consistence of latent semantic bases among different views in kernel spaces. Furthermore, different from the two-step manner (Liu et al. 2015), in our model

the common representation learning and label prediction are in a uniﬁed framework, where they can improve each other iteratively. We also provided the instruction for parameter setting to guarantee the strict convexity of our algorithm. Experiments on different benchmark datasets clearly validated the superiority of our method over state-of-the-art ones for multi-view multi-label classiﬁcation. There are several directions for the future work. First, exploring more general correlations between common representations and labels is challenging but of great interest. Second, due to the appealing performance of deep learning, extending our model with deep model will be our future work.

Acknowledgments

This work was supported in part by National Natural Science Foundation of China (Grand No: 61602337, 61732011, 61432011, U1435212 and 61502332), and Science Foundation of Tianjin (Grand No: 17JCZDJC30800).

References Boutell, M. R.; Luo, J.; Shen, X.; and Brown, C. M. 2004. Learning multi-label scene classiﬁcation. Pattern Recognition 1757 1771. Boyd, S., and Vandenberghe, L. 2004. Convex optimization. Cambridge university press. Cao, X.; Zhang, C.; Fu, H.; Liu, S.; and Zhang, H. 2015. Diversity-induced multi-view subspace clustering. In CVPR, 586 594. Clare, A., and King, R. 2001. Knowledge discovery in multi-label phenotype data. Principles of data mining and knowledge discovery 42 53. Duygulu, P.; Freitas, N. D.; Barnard, K.; and Forsyth, D. A. 2002. Object recognition as machine translation. ECCV. Elisseeff, A., and Weston, J. 2002. A kernel method for multi-labelled classiﬁcation. In NIPS, 681 687. Everingham, M. 2006. The pascal visual object classes challenge 2007 (voc2007) results. Lecture Notes in Computer Science 117 176. F urnkranz, J.; H ullermeier, E.; Menc ıa, E. L.; and Brinker, K. 2008. Multilabel classiﬁcation via calibrated label ranking. Machine Learning 133 153. Ghamrawi, N., and Mc Callum, A. 2005. Collective multilabel classiﬁcation. In Proceedings of the 14th ACM international conference on Information and knowledge management, 195 200. Gong, C.; Tao, D.; Yang, J.; and Liu, W. 2016. Teaching-tolearn and learning-to-teach for multi-label propagation. In AAAI, 1610 1616. Gretton, A.; Bousquet, O.; Smola, A.; and Sch olkopf, B. 2005. Measuring statistical dependence with hilbert-schmidt norms. In ALT, 63 77. Gu, S.; Zhang, L.; Zuo, W.; and Feng, X. 2014. Projective dictionary pair learning for pattern classiﬁcation. In NIPS, 793 801. Liu, M.; Luo, Y.; Tao, D.; Xu, C.; and Wen, Y. 2015. Lowrank multi-view learning in matrix completion for multilabel image classiﬁcation. In AAAI, 2778 2784. Liu, X.; Li, M.; Wang, L.; Dou, Y.; Yin, J.; and Zhu, E. 2017. Multiple kernel k-means with incomplete kernels. In AAAI, 2259 2265. Liu, Y.; Jin, R.; and Yang, L. 2006. Semi-supervised multilabel learning by constrained non-negative matrix factorization. In AAAI, 421 426. Lowe, D. G. 2004. Distinctive image features from scaleinvariant keypoints. IJCV 91 110. Luo, Y.; Tao, D.; Xu, C.; and Xu, C. 2013. Multiview vectorvalued manifold regularization for multilabel image classiﬁcation. IEEE T-NNLS 709 722. Oliva, A., and Torralba, A. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. ICCV 145 175. Quadrianto, N.; Song, L.; and Smola, A. J. 2009. Kernelized sorting. In NIPS, 1289 1296.

Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2011. Classiﬁer chains for multi-label classiﬁcation. Machine Learning 333 359. Read, J.; Pfahringer, B.; and Holmes, G. 2008. Multi-label classiﬁcation using ensembles of pruned sets. In ICDM, 995 1000. Schapire, R. E., and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Machine Learning 135 168. Tsoumakas, G., and Katakis, I. 2006. Multi-label classiﬁcation: An overview. Dept. of Informatics, Aristotle University of Thessaloniki, Greece. Tsoumakas, G., and Vlahavas, I. 2007. Random k-labelsets: An ensemble method for multilabel classiﬁcation. Machine Learning: ECML 2007 406 417. Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2011. Random k-labelsets for multilabel classiﬁcation. IEEE T-KDE 1079 1089. Varga, R. S. 2009. Matrix iterative analysis. Von Ahn, L., and Dabbish, L. 2005. Labeling images with a computer game. In Sigchi Conference on Human Factors in Computing Systems, 319 326. Wang, M.; Hua, X.-S.; Yuan, X.; Song, Y.; and Dai, L.-R. 2007. Optimizing multi-graph learning: towards a uniﬁed video annotation scheme. In ACM Multimedia, 862 871. Wang, B.; Tu, Z.; and Tsotsos, J. K. 2013. Dynamic label propagation for semi-supervised multi-class multi-label classiﬁcation. In ICCV, 425 432. Weijer, J. V. D., and Schmid, C. 2006. Coloring local feature extraction. In ECCV, 334 348. Xu, C.; Tao, D.; and Xu, C. 2013. A survey on multi-view learning. ar Xiv preprint ar Xiv:1304.5634. Yang, S.-J.; Jiang, Y.; and Zhou, Z.-H. 2013. Multi-instance multi-label learning with weak label. In IJCAI, 1862 1868. Yu, H.-F.; Jain, P.; Kar, P.; and Dhillon, I. S. 2014. Largescale multi-label learning with missing labels. In ICML, 593 601. Zhang, M.-L., and Zhou, Z.-H. 2007. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 2038 2048. Zhang, C.; Fu, H.; Liu, S.; Liu, G.; and Cao, X. 2015. Lowrank tensor constrained multiview subspace clustering. In ICCV, 1582 1590.