# incomplete_multiview_weaklabel_learning__87c19154.pdf

Incomplete Multi-View Weak-Label Learning

Qiaoyu Tan1, Guoxian Yu1, , Carlotta Domeniconi2, Jun Wang1 and Zili Zhang1,3

1College of Computer and Information Science, Southwest University, Chongqing 400715, China 2Department of Computer Science, George Mason University, Fairfax 22030, USA 3School of Information Technology, Deakin University, Geelong, VIC 3220, Australia {tqy1995119, gxyu, kingjun, zhangzl}@swu.edu.cn, carlotta@cs.gmu.edu

Learning from multi-view multi-label data has wide applications. Two main challenges characterize this learning task: incomplete views and missing (weak) labels. The former assumes that views may not include all data objects. The weak label setting implies that only a subset of relevant labels are provided for training objects while other labels are missing. Both incomplete views and weak labels can lead to signiﬁcant performance degradation. In this paper, we propose a novel model (i MVWL) to jointly address the two challenges. i MVWL learns a shared subspace from incomplete views with weak labels, local label correlations, and a predictor in this subspace, simultaneously. The latter can capture not only cross-view relationships but also weak-label information of training samples. We further develop an alternative solution to optimize our model; this solution can avoid suboptimal results and reinforce their reciprocal effects, and thus further improve the performance. Extensive experimental results on real-world datasets validate the effectiveness of our model against other competitive algorithms.

1 Introduction

In many real-world applications, a sample may have several heterogenous representations, each one giving a different view of the data, and may also have multiple labels. For example, a web image can be tagged with multiple topics given as labels, such as cattle, grass, and tree. At the same time, the image can also be described using heterogenous features, such as texture descriptors, shape descriptors, color descriptors, surrounding texts, and so on. Multi-view multi-label learning, as a natural formulation for this type of data, has attracted a lot of attention in machine learning and in many application domains [Liu et al., 2015; Luo et al., 2015]. Although many multi-view multi-label learning methods have been proposed in recent years, a main challenge remains for this problem: the lack of fully labeled training samples. In practice, it is rather difﬁcult to collect all the relevant labels of a sample, and only a subset may be available. One such example is image annotation. An

Guoxian Yu is the corresponding author.

annotator may only afford to annotate an image with some labels, especially when the number of relevant labels is large. Learning from partially labeled samples is termed as the weaklabel learning problem [Sun et al., 2010; Bucak et al., 2011; Kong et al., 2014]. Several weak-label learning methods have been proposed in single-view [Yu et al., 2014; Cabral et al., 2015] and multi-view scenarios [Zhang et al., 2013]. However, almost all aforementioned methods do not account for another important challenge: incomplete data. Namely, some samples may be missing their representation in one view. This can happen, in practice, for a variety of reasons, e.g., a temporary failure of sensors, or a man-made error. It has been observed that incomplete data are likely to lead to degradation in multi-view learning performance [Xu et al., 2015a]. The more challenging case is when both missing labels and incomplete data co-exist in a multi-view multi-label learning problem. To the best of our knowledge, few studies exist that handle incomplete data [Xu et al., 2015a] or missing labels [Zhang et al., 2013] in multi-view learning, but no previous work simultaneously takes into account both issues. To bridge this gap, we propose a novel uniﬁed model, called incomplete Multi-View Weak-Label Learning (i MVWL), to jointly handle incomplete views and missing labels. The basic strategy of i MVWL is to seek a shared subspace across heterogenous incomplete views, and a robust weak-label classiﬁer in this subspace in a uniﬁed learning framework, where label correlations and discriminative information can be learned. In summary, our main contributions are as follows: The proposed i MVWL can jointly address incomplete views and missing labels. It learns a shared subspace from incomplete views with weak labels, label correlations, and a predictor in this subspace simultaneously. We develop a solution to iteratively optimize our model, avoiding suboptimal problems. Experiments on ﬁve widely used datasets and comparisons with a number of competitive methods [Yuan et al., 2012; Zhang et al., 2013; Xu et al., 2015a; Liu et al., 2015] demonstrate the superiority of the proposed work.

2 Related Work This work is related to two branches of studies, weak-label learning and multi-view learning. Weak-label learning was pioneered by [Sun et al., 2010]. Many weak-label learning algorithms have subsequently been proposed. To name a few,

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

weak-label learning algorithms under a supervised setting [Bucak et al., 2011; Kong et al., 2014], under a semi-supervised setting [Zhao and Guo, 2015; Wu et al., 2015], and under a multi-instance multi-label framework [Yang et al., 2013]. Multi-view learning deals with data represented in different views and has attracted increasing interest in recent years. Previous approaches have considered multi-views in conjunction with semi-supervised learning [Xu et al., 2015a; Nie et al., 2017], with multi-label learning [Zhang et al., 2013; Liu et al., 2015] or with active learning [Wang and Zhou, 2010]. Others tried to estimate a latent subspace by assuming that samples (in different views) corresponding to the same object are close to each other when mapped into the latent subspace [Zhang et al., 2013; Liu et al., 2015; Xu et al., 2017]. Almost all previous weak-label learning studies focus on a single-view setting. Likewise, almost all existing multi-view learning studies typically assume completeness of each view (i.e., each sample appears in all views). The only exceptions are Label Me [Zhang et al., 2013] and MVL-IV [Xu et al., 2015a]. Label Me is a multi-view weak-label learning method, but it assumes complete views of each training sample. As discussed above, this assumption is often violated in practice. MVL-IV is a recently proposed multi-view learning solution that considers incomplete views. It integrates multiple incomplete views by assuming that the views are generated from a common subspace, so that the learned subspace may capture cross-view relationships. Nevertheless, MVL-IV is an unsupervised subspace learning approach, which may be lacking discriminative ability due to missing label information [Xu et al., 2017]. On the other hand, MVL-IV assumes the available labels of training samples are complete, ignoring the widely witnessed weak-label scenarios. Moreover, MVL-IV decouples the subspace learning from the follow-up classiﬁcation learning tasks, which may result in suboptimal models due to the lack of mutual adaptation of the two steps. To address these challenges, this paper proposes a novel uniﬁed framework (i MVWL) to jointly handle incomplete views and weak labels. i MVWL simultaneously learns the shared subspace from incomplete views, a predictor in this subspace, and local label structure. i MVWL not only achieves a discriminative shared subspace from incomplete views, but also a robust weak-label classiﬁer that can dynamically capture local label correlations. To the best of our knowledge, no previous work has been developed to jointly handle challenges from both incomplete views and weak labels.

3 Proposed Approach

Suppose X = {Xv}nv v=1 represents a dataset with n samples and nv views, where Xv = [x1 v, x2 v, ..., xn v] Rn dv indicates the full feature space in view v. Y = [y1, y2, ..., yn]T { 1, 1}n c is the corresponding weak-label matrix, where yi { 1, 1}c is the label vector of xi and c is the number of distinct labels. yic = 1 (c = 1, ..., c) means the c -th label is relevant, while yic = 1 does not provide any information. In the multi-incomplete view setting, a sample may appear in some views, but not all. That is, the data matrix X may have a number of missing rows. An easy ﬁx to this problem is to

remove any sample missing in at least one view. However, this approach will signiﬁcantly reduce the number of samples that can be used for training. Our goal is to predict the labels of unlabeled samples based on multiple incomplete feature spaces X and the weak-label space spanned by Y.

3.1 Problem Formulation With multi-view multi-label data, how to generate a shared discriminative subspace across views and how to train an efﬁcient and robust multi-label classiﬁer in that subspace for label prediction are two challenging problems. Some subspace learning algorithms have been proposed to seek the shared subspace across views [Zhao et al., 2017], such as multi-view subspace learning methods based on a low rank constraint [Liu et al., 2015], matrix factorization [ˇZitnik and Zupan, 2015], and nonnegative matrix factorization (NMF) [Wang and Zhang, 2013]. Among them, NMF has been successfully applied in text mining, image annotations, bioinformatics, recommender systems and other domains [Wang and Zhang, 2013], since most data matrices are naturally nonnegative, or can be easily transformed into nonnegative ones. The major difference between NMF and other matrix factorization methods, such as SVD (singular value decomposition), is the nonnegative constraints, which help to obtain a part-based representation as well as to enhance interpretability of the learned subspace. In this paper, we also focus on nonnegative data matrix mining tasks, and adapt NMF to learn a discriminative low-rank representation from incomplete views by using weak-label information. Given a multi-view datasets X, the standard NMF can be adapted to ﬁnd a shared subspace V as follows:

v=1 ||Xv VUT v ||2 F s.t. U 0, V 0 (1)

where Uv Rdv k, V Rn k, and k is the desired low-rank size, ||.||F represents the Frobenius norm, Uv 0 and V 0 are the nonnegative constraints for the matrices. The learned subspace V in Eq. (1) can capture the cross-view relationships since it enables the integration of complementary information across multiple views [Xu et al., 2015a]. In many applications, however, Eq. (1) may be unreliable due to the presence of incomplete views. A remedy for Eq. (1) to deal with this problem is to ﬁll the missing samples with average feature values; nonetheless, this approach may introduce errors, especially when the number of missing samples is large, hence not suitable for incomplete view setting. Besides, the above unsupervised subspace learning process is lacking discriminative ability because it ignores label information. To address these drawbacks, we formulate subspace learning from incomplete views as a supervised approach, which considers label information and complementary information across incomplete views as follows:

min {Uv,V,W}

v=1 ||Ov (Xv VUT v )||2 F + α||VW Y||2 F (2)

where is the Hadamard product (element-wise product). Ov Rn dv is an indicator matrix that denotes the missing entries, where Ov i,j = 1 if (i, j) is an observed entry in Xv; Ov i,j = 0, otherwise. Y Rn c denotes the available label

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

matrix of n samples. W Rk c is the coefﬁcient matrix, which maps the shared feature subspace into a semantic label space. In Eq. (2), we can achieve two goals. On one hand, the learned subspace can capture the cross-view relationships as discussed before. On the other hand, we can utilize the label information Y to induce the shared subspace towards a semantic label space via the second term, which not only helps to obtain a discriminative subspace but also may alleviate the widely spread semantic gap [Datta et al., 2008] between the input heterogeneous feature spaces and the semantic label space, since V can be viewed as a bridge between them. Eq. (2), however, ignores another important issue in multiview weak-label learning, i.e., the presence of missing labels. Namely, often in practice, Y is incomplete and contains many missing entries. As such, we need to avoid the inﬂuence of missing labels in Y and improve the robustness of Eq. (2). Considering that label correlation is very important in weaklabel setting and usually can further improve the performance [Dong et al., 2018], we leverage label correlation among weak labels to estimate the predicted likelihood scores and extend Eq. (2) as follows:

min {Uv,V,W,S}

v=1 ||Ov (Xv VUT v )||2 F + α||M (VWS Y)||2 F

(3) where M Rn c is an indicator matrix for missing labels: Mi,j = 1 if (i, j) is an observed entry in Y; Mi,j = 0, otherwise. S Rc c denotes the label correlation matrix, α > 0 is the trade-off parameter. By incorporating the label correlation matrix S, Eq. (3) not only can estimate the predicted likelihood scores, but also can enhance the discriminative ability of the learned subspace by using label correlations among weak labels. However, since the observed relevant label sets of samples are incomplete, we cannot directly compute the label correlation matrix S from prior knowledge Y; we need to learn it. In addition, we also need to account for the fact that label correlations are naturally local [Huang and Zhou, 2012], and can manifest as direct or indirect dependencies [Wu et al., 2015]. As such, it is reasonable to assume that label correlations are locally structured, that is, there exists a subset of labels, which are closely related to each other through complex correlations, and are independent from the rest. This local structure typically implies a low-rank structure of S, which is common in real-world applications [Xu et al., 2015b; Xu et al., 2016]. To capture local label correlations, we add a low-rank constraint on S, and make Eq. (3) more suitable for weak-label problems as follows:

min {Uv,V,W,S}

v=1 ||Ov (Xv VUT v )||2 F

+α||M (VWS Y)||2 F + βrank(S)

where β is the trade-off parameter, which balances the relative importance of the low-rank constraint on S. By adding the rank term, our model can capture local label correlations among weak labels which is more suitable for real-world applications. It s worth noticing that the work in [Xu et al., 2015b] also makes a low-rank assumption among labels, but its usage is different. In particular, the authors in [Xu et al., 2015b] multiply the low rank correlation matrix with Y to

replenish missing labels. In contrast, we use the low rank correlation matrix with the predicted likelihood label vectors. As such, since the estimated label correlation values may not be very reliable in practice, our method is less impacted by them. Furthermore, since V is the low-rank feature representation learned across incomplete views, Eq. (4) is robust to outliers and background noise that may affect the feature space. The rank minimization problem is NP-hard. Here we can relax the rank problem using the nuclear norm || || [Cand es and Recht, 2009], and reformulate Eq. (4) as follows:

min {Uv,V,W,S}

v=1 ||Ov (Xv VUT v )||2 F

+α||M (VWS Y)||2 F + β||S||

Eq. (5) considers cross-view relationships and local (low rank) label structure. In addition, it absorbs label information to induce the shared subspace and enhance its discriminative power. Another advantage of our model is that it jointly learns a shared subspace from incomplete views with weak labels, the local label structure, and the predictor in this subspace. This uniﬁed model reinforces their reciprocal effects and thus further improves the performance.

4 Optimization The minimization problem in Eq. (5) is deﬁned with respect to {Uv}v v=1, V, W and S. Since a close-form solution cannot be computed, we develop an alternative optimization method to optimize the objective function.

(I). Keep {Uv}, V and S ﬁxed, update W When {Uv}, V and S are ﬁxed, we have the following equation for W by taking the derivative of Eq. (5) w.r.t. W,

J1(W) = 2VT (M VWS)ST 2VT (M Y)ST (6) We can derive the following ﬁxed-point updating rule for W,

W = W VT (M Y)ST

VT (M VWS)ST (7)

(II). Keep {Uv}, V and W ﬁxed, update S When {Uv}, V and W are ﬁxed, optimizing Eq. (5) with respect to S is equivalent to J2(S) = α||M (XWS Y)||2 F + β||S|| (8)

Eq. (8) can be viewed as a matrix completion problem [Cand es and Recht, 2009], and many algorithms have been proposed to solve this problem in recent decades. Here we adopt an efﬁcient speedup algorithm, Maxide [Xu et al., 2013], to solve it. Maxide only needs to estimate a c c matrix.

(III). Keep {Uv}, W and S ﬁxed, update V When keeping {Uv}, W and S ﬁxed, we obtain the following equation for V by taking the derivative of Eq. (5) w.r.t. V:

v=1 (Ov VUT v )Uv + 2α(M VWS)ST WT

v=1 (Ov Xv)Uv 2α(M Y)ST WT

s.t. Uv 0, Vv 0, v = 1, 2, ..., nv (9)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Using the Karush-Kuhn-Tucker (KKT) condition [Boyd and Vandenberghe, 2004], we can derive the following updating rule,

Vi,j Vi,j ( nv

v=1 (Ov Xv)Uv + α(M Y)ST WT )i,j

v=1 (Ov VUTv )Uv + α(M VWS)ST WT )i,j

(IV). Keep {V}, W and S ﬁxed, update Uv With {V}, W and S ﬁxed, the computation of Uv is independent from Uv , v = v. Thus, for each view v, we obtain the following equation for Uv by taking the derivative of Eq. (5) w.r.t. Uv:

J4(Uv) = 2(Ov Uv VT )V 2(Ov Xv)T V (11)

Using the KKT condition, we can derive the following updating rule:

(Uv)i,j (Uv)i,j ((Ov Xv)T V)i,j

((Ov Uv VT )V)i,j (12)

4.1 Complexity Analysis The time complexity of i MVWL is dominated by matrix multiplication. In each iteration, the time complexities of solving W and S in Eq. (7) and Eq. (8) are O(nck) and O(rc ln c ln n) respectively, where r is the rank of S; the time complexity of updating V in Eq. (10) and Uv in Eq. (12) is less than O(nv(nkdmax + 2nk2 + nck)) and O(nv(nkdmax +dmaxk2)), respectively. dmax represents the largest dimensionality of the views. Since n k and n c, the overall time complexity of i MVWL is O(tnvnkdmax), where t is the number of iterations to reach convergence. In practice, t does not exceed 60. In our study, some of the views have sparse feature matrices; as such, the actual time cost of the above operations can be further reduced.

5 Experiments

5.1 Experimental Setup The ﬁve multi-view datasets used in the experiments (Core15k, Pascal07, ESPGame, IAPRTC-12, and Mirﬂicker) are summarized in Table 1. These datasets1 are obtained from [Guillaumin et al., 2010], and each is represented by six feature views: HUE, SIFT, GIST, HSV, RGB, and LAB. For each dataset, we randomly sample 70% of the data for training, and use the remaining 30% data for testing (unlabeled data). Moreover, to create weak-label scenarios, we follow the protocol given in [Xu et al., 2013]: for each label c we remove the assignment of c for ω% randomly sampled positive and negative training samples (c becomes a missing label); to create incompleteview data scenarios, we randomly remove ε% samples from each view, while ensuring each sample appears in at least one view. For each dataset, dmin represents the minimum dimensionality of the different views. Methods: We compare i MVWL against four state-of-theart methods: Label Me [Zhang et al., 2013], MVL-IV [Xu et

1Available at http://lear.inrialpes.fr/people/guillaumin/data.php

datasets n nv c #avg Core15k 4999 6 260 3.396 Pascal07 9963 6 20 1.465 ESPGame 20770 6 268 4.686 IAPRTC12 19627 6 291 5.719 Mirﬂicker 25000 6 38 4.716

Table 1: Statistics of ﬁve multi-view datasets: n is the number of samples; nv is the number of views; c is the number of distinct labels; and #avg is the average number of labels per sample.

al., 2015a], lr MMC [Liu et al., 2015], and i MSF [Yuan et al., 2012]. The ﬁrst two methods have been introduced in the Related work Section. lr MMC is a matrix completion based multi-view learning method, but it assumes complete views of each training sample and does not explicitly consider missing labels and label correlations. i MSF was initially proposed for single label classiﬁcation with multiple incomplete sources; we extend it for multi-label classiﬁcation by training multiple classiﬁers (one for each label). These comparing methods cannot directly handle incomplete multi-view weak-label settings. For experimental comparisons, we adapt Label Me and lr MMC by ﬁlling missing features with average values, and set the missing labels of MVL-IV and i MSF as negative labels. In addition, we introduce i MVWL-Sp, i MVWL-X, and i MVWLNc to investigate the contribution of learning a discriminative shared subspace, separately handling multiple feature views, and capturing local label correlations, respectively. i MVWLSp excludes label information during the subspace learning process. i MVWL-X concatenates multi-view features into a single vector. i MVWL-Nc excludes label correlations. Fivefold cross validation on the training set is used to select the optimal parameter values for each competitive method. Optimal parameters for the competitive methods are selected as suggested in the corresponding papers. For our method, we selected the parameters α and β from {10i|i = 5, , 0}. Experimental results show that i MVWL yields relatively stable performance with α around 10 2 and β around 10 2, and therefore we use these values. All the experiments are repeated ten times, and both the average and standard deviation are reported. The source code of i MVWL is publicly available at http://mlda.swu.edu.cn/codes.php?name=i MVWL. Evaluation: Four widely used multi-label evaluation metrics are adopted for performance comparisons, i.e., Ranking Loss (RL), Average Precision (AP), Hamming Loss (HL), and adapted AUC. A formal deﬁnition of the ﬁrst three metrics can be found in [Zhang and Zhou, 2014]. The adapted AUC is suggested in [Bucak et al., 2011]. To maintain consistency with other evaluation metrics, in our experiments, we report 1-RL instead of RL. Thus, as for other metrics, the higher the value of 1-RL, the better the performance is.

5.2 Results On All Datasets

Table 2 and 3 give the results of all methods on ﬁve datasets across four evaluation metrics. In the table, / indicates whether i MVWL is statistically (using a pairwise t-test at 95% signiﬁcance level) superior/inferior to the corresponding method. It can be seen that i MVWL outperforms the other methods in most cases. MVL-IV, i MSF, and i MVWL are all designed

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Dataset metric lr MMC MVL-IV Label Me i MSF i MVWL

1-HL 0.954 0.000 0.954 0.000 0.946 0.000 0.943 0.000 0.956 0.000 1-RL 0.762 0.002 0.756 0.001 0.638 0.003 0.709 0.005 0.822 0.001 AP 0.240 0.002 0.240 0.001 0.204 0.002 0.189 0.002 0.313 0.002 AUC 0.763 0.002 0.762 0.001 0.715 0.001 0.663 0.005 0.824 0.001

1-HL 0.882 0.000 0.883 0.000 0.837 0.000 0.836 0.000 0.886 0.000 1-RL 0.698 0.003 0.702 0.001 0.643 0.004 0.568 0.000 0.749 0.002 AP 0.425 0.003 0.433 0.002 0.358 0.003 0.325 0.000 0.455 0.001 AUC 0.728 0.002 0.730 0.001 0.686 0.005 0.620 0.001 0.784 0.001

1-HL 0.970 0.000 0.970 0.000 0.967 0.000 0.964 0.000 0.971 0.000 1-RL 0.777 0.001 0.778 0.000 0.683 0.002 0.722 0.002 0.803 0.001 AP 0.188 0.000 0.189 0.000 0.132 0.000 0.108 0.000 0.236 0.001 AUC 0.783 0.001 0.784 0.000 0.734 0.001 0.674 0.003 0.808 0.001

1-HL 0.967 0.000 0.967 0.000 0.963 0.000 0.960 0.000 0.969 0.000 1-RL 0.801 0.000 0.799 0.001 0.725 0.001 0.631 0.000 0.830 0.001 AP 0.197 0.000 0.198 0.000 0.141 0.000 0.101 0.000 0.234 0.002 AUC 0.805 0.000 0.804 0.001 0.746 0.001 0.665 0.001 0.832 0.001

1-HL 0.839 0.000 0.839 0.000 0.778 0.000 0.775 0.000 0.844 0.001 1-RL 0.802 0.001 0.808 0.001 0.771 0.001 0.641 0.001 0.817 0.001 AP 0.441 0.001 0.449 0.001 0.375 0.000 0.323 0.000 0.497 0.003 AUC 0.806 0.001 0.807 0.000 0.761 0.000 0.715 0.001 0.816 0.001

Table 2: Results on all datasets with ω% = 50%, ε% = 50%, and k = 0.5dmin.

Dataset metric i MVWL-Sp i MVWL-Nc i MVWL-X i MVWL

1-HL 0.955 0.000 0.955 0.000 0.955 0.000 0.956 0.000 1-RL 0.790 0.001 0.798 0.002 0.808 0.000 0.822 0.001 AP 0.285 0.003 0.272 0.003 0.299 0.000 0.313 0.002 AUC 0.791 0.001 0.798 0.002 0.811 0.001 0.824 0.001

1-HL 0.883 0.000 0.884 0.000 0.884 0.000 0.886 0.000 1-RL 0.721 0.001 0.728 0.002 0.726 0.003 0.749 0.002 AP 0.436 0.001 0.440 0.002 0.446 0.001 0.455 0.001 AUC 0.750 0.001 0.745 0.003 0.759 0.001 0.784 0.001

1-HL 0.971 0.000 0.970 0.000 0.970 0.000 0.971 0.000 1-RL 0.790 0.001 0.780 0.001 0.787 0.001 0.803 0.001 AP 0.213 0.002 0.199 0.001 0.198 0.001 0.236 0.001 AUC 0.795 0.001 0.785 0.001 0.791 0.000 0.808 0.001

1-HL 0.968 0.000 0.968 0.000 0.967 0.000 0.969 0.000 1-RL 0.810 0.001 0.804 0.002 0.797 0.003 0.830 0.001 AP 0.213 0.001 0.206 0.001 0.202 0.003 0.234 0.002 AUC 0.813 0.001 0.804 0.002 0.800 0.002 0.832 0.001

1-HL 0.843 0.000 0.839 0.000 0.841 0.001 0.844 0.001 1-RL 0.817 0.001 0.813 0.001 0.806 0.002 0.817 0.001 AP 0.486 0.002 0.486 0.002 0.480 0.004 0.497 0.003 AUC 0.816 0.001 0.807 0.001 0.805 0.003 0.816 0.001

Table 3: Results of variants of i MVWL on all datasets with ω% = 50%, ε% = 50%, and k = 0.5dmin.

for incomplete multi-view data, but i MVWL almost always outperforms the other two across the four evaluation metrics. The main reason is that MVL-IV and i MSF assume that the available labels are complete and ignore the widely witnessed weak-label scenarios. Both lr MMC and Label Me are multiview learning methods based on subspace learning, and they can handle weak-labels. But Label Me is outperformed by lr MVL across ﬁve datasets. A possible reason is that lr MMC considers the multi-view weak-label learning task as a matrix completion problem, which is more robust to missing values. However, lr MMC is outperformed by i MVWL in many cases. This is mainly because lr MMC assumes the completeness of multiple views. As discussed in the Introduction section, this assumption is often violated in practice. In Table 3, i MVWL-Sp is a degenerate case of i MVWL, which is obtained by excluding label information during subspace learning, thereby isolating the subspace learning process from the subsequent classiﬁcation task. i MVWL almost always performs better than i MVWL-Sp on these datasets. This is mainly because in i MVWL-Sp the learned subspace may lack the ability to discriminate between different labels. In addition, when the objectives are treated separately, an optimal subspace can be achieved, but it may not be optimal for the subsequent prediction. These results corroborate our motivation to jointly optimize the two objectives. i MVWLNc is obtained from i MVWL by excluding label correlations, and is almost always outperformed by i MVWL. This fact demonstrates the effectiveness of the proposed method in capturing local label correlations. i MVWL-X is another variant of i MVWL; it concatenates all feature view vectors into a single vector, and follows the same process of i MVWL for

prediction. It s outperformed by i MVWL in almost every case. These results justify the rationale of handling multiple feature views separately. An interesting observation is that i MVWL-Sp performs better than (or comparable to) other methods in most cases. This is mainly because i MVWL-Sp addresses both incomplete multi-view and weak-label problems, while the other methods only address one of the two. The performance margin achieved by i MVWL and i MVWL-Sp further justiﬁes our motivation to jointly handle incomplete multi-view data and weak-labels.

5.3 Handling Weak-Labels We conducted additional experiments on Core15k to investigate the performance of i MVWL and other methods when handling missing labels. We set the dimensionality of the shared subspaces equal to 20%, 50%, and 80% of dmin, with ω% that varies from 0% to 50% with a step-size of 10%. Since i MSF is not a subspace learning method, its performance is the same for all dimensionality. Since the results on all evaluation metrics are similar, for space limitation, we report only the results of AUC in Figure 1. We can see that the performance of all the methods decreases when ω% increases, and i MVWL outperforms the competitive methods in all the settings. Also, regardless of the dimensionality of the learned subspaces, i MVWL performs consistently better than the other methods under different ratios of missing labels.

5.4 Handling Incomplete Multi-View Data We also performed experiments to investigate the impact of different percentages (ε%) of incomplete views on the perfor-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

10 30 50 0.65

Label Me i MSF lr MMC MVL IV i MVWL

10 30 50 0.65

10 30 50 0.5

Figure 1: AUC values of compared methods on the Core15k dataset with different missing label proportions ω%. The dimensionality (k) of the shared subspace is set to 0.2dmin(Left), 0.5dmin(Middle), and 0.8dmin(Right).

lr MMC MVL-IV Label Me i MSF i MVWL Core15k 50.94 156.77 424.6 5531.75 52.47 Pascal07 184.04 299.06 928.69 2019.98 44.60 ESPGame 410.64 4449.91 2314.93 6887.37 1107.85 IAPRTC-12 405.07 6965.33 1900.96 49540.29 1268.97 Mirﬂicker 341.03 38404.08 3098.72 729.84 111.24 Total 1392.89 50281.76 8668.69 64778.68 2585.13

Table 4: Runtime comparison (in seconds).

mance of various methods. Similarly to previous experimental protocols, we set the dimensionality of the learned subspace equal to 20%, 50%, and 80% of dmin, and then increase the percentage of incomplete views ε% from 0% to 50% with a step-size of 10%. Due to space limitation, we report only the results for ε% equal to 0%, 30%, and 50% in Figure 2. The performance trend for ε% equal 10%, 20%, and 40% is similar to those reported in Figure 2. It can be seen that the performance of all the methods decreases with the increasing of ε%, and i MVWL gives the best performance in all the cases. In addition, as the dimensionality (k) of the shared subspace increases, the performance of all the methods shows an increasing trend, but i MVWL still performs consistently better than the competitors, under the different percentages of incomplete views.

5.5 Parameter Analysis In this section, we test the sensitivity of i MVWL w.r.t. α and β. The tested range for α and β is {10i|i = 5, , 0}. For brevity, we only report the 1-RL and AUC results on Core15k in Figure 3; similar results were obtained for the other datasets as well. From the Figure, we can see that i MVWL achieves relatively stable and good performance when α 10 2 and β 10 2. We also observe that when α = 10 5 or β = 10 5, i MVWL has reduced 1-RL or AUC. This result conﬁrms the contribution of weak-label information and local label correlation in improving the performance of i MVWL. When α or β are close to one, the AUC and 1-RL sharply decrease. This is because large values of α (or β) overweight the effect of weak-label information in subspace learning (or local label correlations), while underweighting the shared subspace V, which encodes cross-view relationships.

5.6 Runtime And Convergence Analysis We also study the runtime cost of the competing methods on the ﬁve datasets, and report the costs in Table 4. The experiments are conducted on Cent OS 6.9 with Inter(R) Xeon E5-2678, 64GB RAM and MATLAB 2013a. We can see that i MVWL runs much faster than other comparing methods in

0 30 50 0.6

Label Me i MSF lr MMC MVL IV i MVWL

0 30 50 0.6

0 30 50 0.6

Figure 2: Results on the Core15k dataset with different incomplete view percentages ε%. The dimensionality (k) of the shared subspace is set to 0.2dmin (Left), 0.5dmin(Middle), and 0.8dmin(Right).

4 2 0 4 2 0 0.76

4 2 0 4 2 0 0.76

Figure 3: Parameter analysis w.r.t. α and β on Core15k.

Objective Function Value

Objective Function Value

Figure 4: Convergence trend analysis.

most cases. The only exception is lr MMV on ESPGame and IAPRTC-12 datasets. This is mainly because lr MMC needs to estimate a target matrix of size n k just once, while i MVWL has to estimate the label correlation matrix of size c c in each iteration. As a result, when c is large, i MVWL costs more. These results corroborate the efﬁciency of the proposed method. The convergence trends on the other datasets are similar. Figure 4 shows the convergence curve of i MVWL on Core15k and Pascal07 datasets. As we can see, on both datasets, i MVWL tends to converge after 60 iterations. The convergence trends on the other datasets are similar.

6 Conclusion In this paper, we propose a novel model called i MVWL to learn from data with incomplete views and missing labels. i MVWL learns a discriminative shared subspace from incomplete views with weak labels. At the same time, it learns a robust weak-label classiﬁer in the subspace and the local label structure. An alternative optimization solution is developed to optimize this model, which not only can avoid suboptimal problems, but also reinforces the reciprocal effects of the shared subspace and of the classiﬁer, and further improves the performance. The experimental results show that i MVWL outperforms other competitive methods. How to further improve the efﬁciency of i MVWL is an interesting future pursue. Acknowledgments This work is supported by NSFC (61741217, 61402378 and 61732019), Open Research Project of Hubei Key Laboratory of Intelligent Geo-Information Processing (KLIGIP2017A05).

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

References [Boyd and Vandenberghe, 2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004. [Bucak et al., 2011] Serhat Selcuk Bucak, Rong Jin, and Anil K Jain. Multi-label learning with incomplete class assignments. In CVPR, pages 2801 2808, 2011. [Cabral et al., 2015] Ricardo Cabral, Fernando De la Torre, Joao Paulo Costeira, and Alexandre Bernardino. Matrix completion for weakly-supervised multi-label image classiﬁcation. TPAMI, 37(1):121 135, 2015. [Cand es and Recht, 2009] Emmanuel Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009. [Datta et al., 2008] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Image retrieval: Ideas, inﬂuences, and trends of the new age. ACM Computing Surveys, 40(2):5, 2008. [Dong et al., 2018] Haochen Dong, Yufeng Li, and Zhihua Zhou. Learning from semi-supervised weak-label data. In AAAI, in press, 2018. [Guillaumin et al., 2010] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Multimodal semi-supervised learning for image classiﬁcation. In CVPR, pages 902 909, 2010. [Huang and Zhou, 2012] Sheng Jun Huang and Zhi Hua Zhou. Multi-label learning by exploiting label correlations locally. In AAAI, pages 949 955, 2012. [Kong et al., 2014] Xiangnan Kong, Zhaoming Wu, Lijia Li, Ruofei Zhang, Hang Wu, and Wei Fan. Large-scale multilabel learning with incomplete label assignments. In SDM, pages 920 928, 2014. [Liu et al., 2015] Meng Liu, Yong Luo, Dacheng Tao, Chao Xu, and Yonggang Wen. Low-rank multi-view learning in matrix completion for multi-label image classiﬁcation. In AAAI, pages 2778 2784, 2015. [Luo et al., 2015] Yong Luo, Tongliang Liu, Dacheng Tao, and Chao Xu. Multiview matrix completion for multilabel image classiﬁcation. TIP, 24(8):2355 2368, 2015. [Nie et al., 2017] Feiping Nie, Guohao Cai, and Xuelong Li. Multi-view clustering and semi-supervised classiﬁcation with adaptive neighbours. In AAAI, pages 2408 2414, 2017. [Sun et al., 2010] Yuyin Sun, Yin Zhang, and Zhihua Zhou. Multi-label learning with weak label. In AAAI, pages 1862 1868, 2010. [Wang and Zhang, 2013] Yuxiong Wang and Yujin Zhang. Nonnegative matrix factorization: a comprehensive review. TKDE, 25(6):1336 1353, 2013. [Wang and Zhou, 2010] Wei Wang and Zhihua Zhou. Multiview active learning in the non-realizable case. In NIPS, pages 2388 2396, 2010.

[Wu et al., 2015] Bao-Yuan Wu, Siwei Lyu, and Bernard Ghanem. Ml-mg: Multi-label learning with missing labels using a mixed graph. In ICCV, pages 4157 4165, 2015. [Xu et al., 2013] Miao Xu, Rong Jin, and Zhihua Zhou. Speedup matrix completion with side information: Application to multi-label learning. In NIPS, pages 2301 2309, 2013. [Xu et al., 2015a] Chang Xu, Dacheng Tao, and Chao Xu. Multi-view learning with incomplete views. TIP, 24(12):5812, 2015. [Xu et al., 2015b] Linli Xu, Zhen Wang, Zefan Shen, Yubo Wang, and Enhong Chen. Learning low-rank label correlations for multi-label classiﬁcation with missing labels. In ICDM, pages 1067 1072, 2015. [Xu et al., 2016] Chang Xu, Dacheng Tao, and Chao Xu. Robust extreme multi-label learning. In KDD, pages 1275 1284, 2016. [Xu et al., 2017] Jinglin Xu, Junwei Han, and Feiping Nie. Multi-view feature learning with discriminative regularization. In IJCAI, pages 3161 3167, 2017. [Yang et al., 2013] Shujun Yang, Yuan Jiang, and Zhihua Zhou. Multi-instance multi-label learning with weak label. In IJCAI, pages 1862 1868, 2013. [Yu et al., 2014] Hsiangfu Yu, Prateek Jain, Purushottam Kar, and Inderjit S Dhillon. Large-scale multi-label learning with missing labels. In ICML, pages 593 601, 2014. [Yuan et al., 2012] Lei Yuan, Yalin Wang, Paul M Thompson, Vaibhav A Narayan, and Jieping Ye. Multi-source learning for joint analysis of incomplete multi-modality neuroimaging data. In KDD, pages 1149 1157, 2012. [Zhang and Zhou, 2014] Minling Zhang and Zhihua Zhou. A review on multi-label learning algorithms. TKDE, 26(8):1819 1837, 2014. [Zhang et al., 2013] Wei Zhang, Ke Zhang, Pan Gu, and Xiangyang Xue. Multi-view embedding learning for incompletely labeled data. In IJCAI, pages 1910 1916, 2013. [Zhao and Guo, 2015] Feipeng Zhao and Yuhong Guo. Semisupervised multi-label learning with incomplete labels. In IJCAI, pages 4062 4068, 2015. [Zhao et al., 2017] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: recent progress and new challenges. Information Fusion, 38:43 54, 2017.

[ˇZitnik and Zupan, 2015] Marinka ˇZitnik and Blaˇz Zupan. Data fusion by matrix factorization. TPAMI, 37(1):41 53, 2015.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)