# latent_semantic_aware_multiview_multilabel_classification__a27b1212.pdf Latent Semantic Aware Multi-View Multi-Label Classification Changqing Zhang,1 Ziwei Yu,1 Qinghua Hu,1 Pengfei Zhu,1 Xinwang Liu,2 Xiaobo Wang3 1School of Computer Science and Technology, Tianjin University, Tianjin, China, 300350 2School of Computer National University of Defense Technology Changsha, China, 410073 3Center for Biometrics and Security Research & National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences, 100190 {zhangchangqing, yuziwei, huqinghua, zhupengfei}@tju.edu.cn For real-world applications, data are often associated with multiple labels and represented with multiple views. Most existing multi-label learning methods do not sufficiently consider the complementary information among multiple views, leading to unsatisfying performance. To address this issue, we propose a novel approach for multi-view multi-label learning based on matrix factorization to exploit complementarity among different views. Specifically, under the assumption that there exists a common representation across different views, the uncovered latent patterns are enforced to be aligned across different views in kernel spaces. In this way, the latent semantic patterns underlying in data could be well uncovered and this enhances the reasonability of the common representation of multiple views. As a result, the consensus multi-view representation is obtained which encodes the complementarity and consistence of different views in latent semantic space. We provide theoretical guarantee for the strict convexity for our method by properly setting parameters. Empirical evidence shows the clear advantages of our method over the state-of-the-art ones. Introduction Multi-label classification, which assigns one example with multiple classes, is of significant interest due to its ubiquity in real-world applications. For example, in computer vision, an image may simultaneously contain more than one type of objects; in web page categorization, a news web page may cover different topics, such as sports, business and entertainment. For this problem, multi-label learning approaches (Boutell et al. 2004; Tsoumakas and Katakis 2006; Zhang and Zhou 2007; Gong et al. 2016) have been proposed over the past decade, such as the early representative methods: binary relevance (BR) (Tsoumakas and Katakis 2006) and label powerset (LP) (Boutell et al. 2004). By directly transforming the multi-label learning task into multiple binary classification tasks, BR neglects the correlation among labels. LP regards each subset of multiple labels as a different class of single-label classification. Although taking the label correlation into consideration, this model lacks of mining the complex label correlation and can not be applied for the task with large label set. Multilabel k-nearest neighbour (MLk NN) (Zhang and Zhou 2007) Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. is one of classic and effective multi-label methods, which builds a Bayesian model by using the k-nearest neighbour method to obtain the prior and likelihood, and then utilizes the max posterior to assign labels for testing example. Some more recent methods focus on other issues in multi-label classification, e.g., label noise (Yu et al. 2014; Yang, Jiang, and Zhou 2013). Although diverse methods have been proposed in the literature, there still exist the following limitations. On one hand, most existing multi-label learning methods only consider single view data, however, each individual view cannot characterize different labels comprehensively since different views encode different properties of data, which implies the practical necessity of multi-view learning (Xu, Tao, and Xu 2013; Liu et al. 2017; Cao et al. 2015; Zhang et al. 2015). On the other hand, learning with plenty of unlabeled data has shown its power in many real applications. However, most existing multi-label classification models are fully supervised thus they are unable to explore the unlabeled samples. Although a few semi-supervised multi-label learning methods (Liu, Jin, and Yang 2006; Wang, Tu, and Tsotsos 2013) have been developed, these models are not specifically targeted on the multi-view semi-supervised multi-label learning. The most recent and related method in (Liu et al. 2015) also utilizes matrix factorization and common representation. However, it has the following two limitations: firstly, the common representation among multiple views is learned without constraining the bases of different views, which weakens the reasonability of the common representation; secondly, the common representation learning and multi-label learning (label completion) are performed in two separated steps, thus the prediction performance could not be well guaranteed. In this paper, we propose a new multi-view multi-label learning approach termed as Latent Semantic Aware Multiview Multi-label Learning (LSA-MML). As shown in Figure 1, given the input data with multiple views, our method simultaneously seeks a predictive common representation of multiple views and the corresponding projection model between the common representation and labels. The bases of V different views, {B(v)}V v=1, can be considered as latent semantic components. With the common representation P, the jth bases of different views, i.e, {bv j}V v=1, encode the same latent semantic, therefore, these bases across different The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) ODWHQW VHPDQWLF EDVLV DOLJQPHQW VFDUFH ODEHO LQIRUPDWLRQ ODWHQW VHPDQWLF EDVHV FR DZDUH Figure 1: Method overview. The common representation P is learned by exploring the complementarity of multiple views and scarce labeled samples (solid circles and squares) jointly. The latent semantic basis matrices (i.e., B(v)s) of different views are aligned in kernel spaces, which guarantees the reasonability of the consensus representation P. views should be consistent with each other. We align these bases of different views with Hilbert-Schmidt Independence Criterion (HSIC) in kernel space, which well addresses the comparability of different views, thus a consensus coefficient matrix (common representation) P for different views is induced. To solve our problem, we provide the theoretical analysis for the convexity and the instruction for parameter setting to guarantee the strict convexity. Extensive empirical results on benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods. Related Work From the last decade, multi-label classification has received intensive attention (Boutell et al. 2004; Zhang and Zhou 2007; Yang, Jiang, and Zhou 2013). Generally, existing multi-label methods can be roughly categorized into three lines. The first-order strategy deals with multi-label learning in label-by-label manner, i.e., dividing the multi-label problem into multiple binary classification tasks or its variants (Zhang and Zhou 2007; Clare and King 2001). The secondorder methods introduce the pairwise relations between the labels for the multi-label classification, such as the ranking between the relevant label and irrelevant label (Elisseeff and Weston 2002; F urnkranz et al. 2008; Ghamrawi and Mc Callum 2005). CLR (F urnkranz et al. 2008) firstly transforms the multi-label learning problem into label ranking problem by introducing the pairwise comparison, and then constructs binary classifiers to solve the multi-label ranking problem. Rank-SVM (Elisseeff and Weston 2002) conducts multilabel classification by adopting the ranking loss as cost function in SVM. The high-order strategy builds more complex relations among labels for multi-label learning (Read et al. 2011; Tsoumakas and Vlahavas 2007). The representatives include the chain-based method (Read et al. 2011) which transforms the multi-label data to a chain of binary classifiers, and the label-set-based methods (Boutell et al. 2004; Tsoumakas and Vlahavas 2007) that divide the entire label set into multiple overlapping subsets and train one classifier for each subset. Due to the ubiquity of data with multiple views, multi-view learning has been an active research field and shown its effectiveness in a wide range of applications (Xu, Tao, and Xu 2013). Recently, a few multiview multi-label classification methods (Luo et al. 2013; Liu et al. 2015) were proposed to exploit the complementarity of different types of features for the improved classification performance. The method in (Luo et al. 2013) introduces multi-view vector-valued manifold regularization to integrate multi-view features. The method in (Liu et al. 2015) seeks a common low-dimensional representation under the matrix factorization framework and then conducts classification based on matrix completion. Both the two recent methods (Luo et al. 2013; Liu et al. 2015) perform classification in the transductive semisupervised manner. LSA-MML: Our Classification Model Suppose there are L labeled data points {xl, yl}L l=1 and U unlabeled data points {xu}N u=L+1, where N = L+U. These instance-label pairs are stacked in two matrices, i.e., X = (x1, ..., x N) and Y = (Yl, Yu) = (y1, ..., y N). Since we employ the transductive learning manner to simultaneously exploit unsupervised samples, our objective function turns out to be the following general form min M(X; ˆY) + λS(Yl, ˆY), (1) where ˆY is the completed label matrix to be predicted, which is learned with data X and a few known labels in Yl. Specifically, to obtain the completed label matrix ˆY, we aim to uncover the underlying structure from data themselves X by the first term M(X; ˆY), which is guided by the labeled data Yl in the second term S(Yl, ˆY). Considering the data with multiple views, we generalize the above formulation as min M(X(1), , X(v); ˆY) + λS(Yl, ˆY). (2) Under the assumption that different views share the latent common representation, i.e., X(v) = B(v)P, we have min M(X(1), , X(v); P) + λ1Ω(B(1), , B(V )) +λ2P(P, ˆY) + λ3S(Yl, ˆY), (3) where P RK N is the consensus multi-view representation which encodes the complementary information from different views. B(v) RDv K is the basis matrix corresponding to the vth view. Accordingly, the first term searches a comprehensive multi-view representation, and the second term guarantees the reasonability of using a common representation for different views since it aligns the bases of different views in the latent semantic space. The last term delivers the label information on the estimated label matrix and based on which the third term ensures the predictive property of the common multi-view representation. Therefore, the complemented label matrix ˆY benefits from both multi-view data (P) and supervised label information (Yl). Specifically, by using common multi-view representation under matrix factorization framework, we have M(X(1), , X(v); P) = v=1 ||X(v) B(v)P||2 F , (4) where || ||F is the well-known Frobenius norm of matrix. For different views, the latent bases should be consistent across different views. To this end, we penalize the independence of bases between different views with Ω(B(1), , B(V )) = v =w IND(B(v), B(w)), (5) where the aim of the regularization IND( , ) is to enhance the dependence of these bases between different views. Since these bases are in different feature spaces, hence, we introduce HSIC to constrain the consistence across different views in kernel spaces. Specifically, we define IND(B(v), B(w)) = HSIC(B(v), B(w)) in our method. Hilbert-Schmidt Independence Criterion (Gretton et al. 2005). We give the brief description about HSIC as follows. Let us define a mapping φ(x) from x X to kernel space F such that the inner product between vectors in that space is given by a kernel function k1(xi, xj) = φ(xi), φ(xj) . Let G be a second kernel space on Y with kernel function k2(yi, yj) = ϕ(yi), ϕ(yj) . The empirical version of HSIC is induced as: Definition 1. Consider a series of N independent observations drawn from pxy, Z := {(x1, y1), ..., (x N, y N)} X Y, an estimator of HSIC, written as HSIC(Z, F, G), is given by: HSIC(Z, F, G) = (N 1) 2tr(K1CK2C), (6) where tr( ) is the trace of a square matrix. K1 and K2 are the Gram matrices with k1,ij = k1(xi, xj), k2,ij = k2(yi, yj). cij = δij 1/N centers the Gram matrix to have zero mean in the feature space. Since HSIC well measures the independence of two variables (Quadrianto, Song, and Smola 2009), we employ it to maximize the dependency between the bases of two views. Note that, according Eq (6), the HSIC in our method can be considered as the penalization for the disagreement of different views in terms of similarity graphs of bases, as shown in Fig .1. In practice, to ensure the predictive property in terms of labels, the third and fourth terms are integrated as the following formula Δ = λ2P(P, ˆY) + λ3S(Yl, ˆY) = ||(WP Y)S||2 F , (7) where ˆY = WP and Y = [Yl, Yu]. W is the prediction model and S is the filtering matrix used to select the labeled samples with Sii = 1 if the ith sample is labeled and 0 otherwise. This ensures that the multi-view consensus representation should be predictive corresponding to the known labels. Accordingly, the final form of our objective function turns out to be min B(v),P,W,αv v=1 αr v||X(v) B(v)P||2 F + β||(WP Y)S||2 F v =w IND(B(v), B(w)) v=1 αv = 1; ||b(v) .j ||2 1, (8) where αv > 0 is used to automatically weight different views and r > 1 is used to avoid a trivial solution that only considers one view and adjusts the complementarity of multiple views (Wang et al. 2007). β > 0 and γ > 0 are tradeoff factors. B(v) is constrained since without constraint P can be pushed arbitrarily close to zero only by re-scaling P/s and B(v)s (s > 0) while preserving the same loss. To summarize, our model has the following merits: 1) our model focuses on seeking the comprehensive common representation of multiple views by enforcing the latent semantic bases of different views to be consistent; 2) our model can be considered as a bi-direction factorization, where the multiview common representation P bridges the factorizations between the multi-view input and the label matrix, where the label matrix can be regarded as the description of data in the view of explicit semantic labels; 3) the label correlations are implicitly encoded by the common representation based on the uncovering latent semantic bases and the relations among them. Optimization Alternating Optimization Algorithm We adopt the alternating minimization strategy to solve the optimization problem, which is comprised of four subproblems solved as follows: Update P with fixed {αv}V v=1, W and {B(v)}V v=1. We should minimize the following objective function v=1 αr v||X(v) B(v)P||2 F + β||(WP Y)S||2 F . (9) By taking the derivative with respect to P and setting it to zero, then we obtain v=1 αr v (B(v))T X(v) + (B(v))T B(v)P +β WT WPSST WT YSST = 0. We solve the problem by separating the labeled and unlabeled parts thanks to the diagonal property of S. For the labeled part, we have v=1 αr v (B(v) l )T X(v) l + (B(v) l )T B(v) l Pl +β WT l Wl Pl WT l Yl = 0. where the subscript l and u indicate variables corresponding to labeled and unlabeled data, respectively. Accordingly, we can update Pl with the following rule v=1 αr v(B(v) l )T X(v) l + βWT l Yl v=1 αrv(B(v) l )T B(v) l + βWT l Wl For the unsupervised part, we have v=1 αr v(B(v) u )T B(v) u Pu = v=1 αr v(B(v) u )T X(v) u . (13) Accordingly, we can update Pu by the following rule v=1 αr v(B(v) u )T B(v) u 1( v=1 αr v(B(v) u )T X(v) u ). (14) After obtaining Pl and Pu, the common representation corresponding to all data, i.e., P, is obtained as P = [Pl, Pu]. Update B(v) with fixed αv,W and P. We should minimize the following objective function L(B(v)) = αr v||X(v) B(v)P||2 F γ HSIC(B(v), B(w)) s.t. ||b(v) .j ||2 1. (15) We optimize B(v)-subproblem by following the work in (Gu et al. 2014), which introduces an auxiliary variable S(v). Then, we have the following objective L(B(v)) = αr v||X(v) B(v)P||2 F γ HSIC(B(v), B(w)) s.t. B(v) = S(v), ||s(v) .j ||2 1. (16) We optimize (16) with alternating direction method of multipliers (ADMM). By removing the equality constraint, it turns out to be L(B(v), S(v), T(v)) = αr v||X(v) B(v)P||2 F HSIC(B(v), B(w)) + μ||B(v) S(v) r + T(v) r ||2 F s.t. ||s(v) .j ||2 1, where μ > 0 is the penalty hyperparameter. The optimal solution of (17) can be obtained with B(v) r+1 = arg min B(v) αr v||X(v) B(v)P||2 F w =v HSIC(B(v), B(w)) + μ||B(v) S(v) r + T(v) r ||2 F S(v) r+1 = arg min S(v) μ||B(v) r+1 S(v) r + T(v) r ||2 F , s.t.||s(v) .j ||2 1 T(v) r+1 = T(v) r + B(v) r+1 S(v) r+1, update μ if appropriate, (18) where s(v) .j indicates the jth column of S. Note that, Theorem 1 (in subsection 3.2) guarantees the subproblem L(B(v)) to be convex and thus the optimal solution could be obtained. Update W with fixed αv, P and B(v). We minimize the following objective function L(W) = β||(WP Y)S||2 F . (19) Accordingly, we obtain the following updating rule W = YSST PT (PSST PT ) 1. (20) Update α with fixed B(v) and P. We employ a Lagrange multiplier λ to take the constraint into consideration, obtaining the following Lagrange function v=1 αr v||X(v) B(v)P||2 F λ( v=1 αr v 1). (21) By setting the derivative of Eq. (21) with respect to α and λ to 0, then we have the following updating rule 1 ||X(v) B(v)P||2 F v=1 ( 1 ||X(v) B(v)P||2 F ) 1/r 1 1 . (22) According to the above updating rules, we can alternatively update these variables until convergence condition (i.e., the difference of the objective function value between two consecutive iterations is smaller than 10 6) is reached. Algorithm Analysis Convexity analysis. Note that, due to the HSIC term involved, it is generally not convex due to the negative sign. This leads to a question: is the following function convex? L(B(v)) = αr v||X(v) B(v)P||2 F HSIC(B(v), B(w)) + μ||B(v) S(v) r + T(v) r ||2 F . (23) The optimal solution could be obtained if the function L(B(v)) is strictly convex, which is also a prerequisite for the convergence of the holistic optimization. Therefore, we provide the guarantee for the convexity of L(B(v)) under proper parameter setting as follows: Theorem 1. The subproblem L(B(v)) is convex given the parameter setting μ 4D(V 1)γ, where V and D are number of views and D = max1 v