# multilabel_cotraining__182a21f5.pdf Multi-Label Co-Training Yuying Xing1, Guoxian Yu1, , Carlotta Domeniconi2, Jun Wang1 and Zili Zhang1,3 1College of Computer and Information Science, Southwest University, Chongqing 400715, China 2Department of Computer Science, George Mason University, Fairfax 22030, USA 3School of Information Technology, Deakin University, Geelong, VIC 3220, Australia {yyxing4148, gxyu, kingjun, zhangzl}@swu.edu.cn, carlotta@cs.gmu.edu Multi-label learning aims at assigning a set of appropriate labels to multi-label samples. Although it has been successfully applied in various domains in recent years, most multi-label learning methods require sufficient labeled training samples, because of the large number of possible label sets. Co-training, as an important branch of semi-supervised learning, can leverage unlabeled samples, along with scarce labeled ones, and can potentially help with the large labeled data requirement. However, it is a difficult challenge to combine multi-label learning with co-training. Two distinct issues are associated with the challenge: (i) how to solve the widely-witnessed class-imbalance problem in multilabel learning; and (ii) how to select samples with confidence, and communicate their predicted labels among classifiers for model refinement. To address these issues, we introduce an approach called Multi Label Co-Training (MLCT). MLCT leverages information concerning the co-occurrence of pairwise labels to address the class-imbalance challenge; it introduces a predictive reliability measure to select samples, and applies label-wise filtering to confidently communicate labels of selected samples among co-training classifiers. MLCT performs favorably against related competitive multi-label learning methods on benchmark datasets and it is also robust to the input parameters. 1 Introduction In multi-label learning, each sample is associated with several related class labels [Zhang and Zhou, 2014; Gibaja and Ventura, 2015]. Let X Rn d be the data matrix including n d-dimensional samples, and Y Rn q be the qdimensional label space for the samples. Given a training dataset D = {(xi, yi))|1 i n}, the task of multi-label learning is to learn a predictive function f(x) Rq that maps the input feature space of the samples onto the label space. Most multi-label learning methods train the predictor using only labeled samples [Zhang and Zhou, 2007; Bucak et al., Guoxian Yu is the corresponding author. 2011; Huang and Zhou, 2012; Yu et al., 2014]. Given the exponential size of the power set of the labels, a very large number of labeled training samples is generally required. In practice, collecting sufficient labeled samples in this scenario is very expensive, and often impractical. On the other hand, with the rapid advancement of data collection and storage techniques, it has become feasible to collect a large number of unlabeled samples. Inspired by single-label semi-supervised learning [Zhu, 2008], efforts have been made to leverage both labeled and unlabeled samples for multi-label learning, giving promising results [Qian and Davidson, 2010; Yu et al., 2012; Kong et al., 2013; Wu et al., 2015; Zhang and Yeung, 2009; Zhang et al., 2015b]. The nature of this work, though, is typically transductive; that is, the canonical approach is to construct a graph that captures the connections between labeled and unlabeled instances, and then to predict the labels of the unlabeled instances embodied in the graph. As such, this approach cannot generalize to unseen samples. It is obviously desirable to empower the learner with an inductive ability that enables label prediction for new instances unseen during training. Few attempts have been made to achieve inductive semisupervised multi-label learning [Guo and Schuurmans, 2012; Wu and Zhang, 2013; G onen, 2014; Tan et al., 2017; Zhan and Zhang, 2017]. In [Guo and Schuurmans, 2012], the authors utilized unlabeled and labeled instances to learn a subspace representation, while simultaneously training a supervised large-margin multi-label classifier on the labeled instances, which could be directly applied to unseen instances. In [Wu and Zhang, 2013], the authors took advantage of label correlations in labeled instances and of maximum-margin regularization over unlabeled instances to optimize a collection of linear predictors for inductive multi-label classification. In [G onen, 2014], the author proposed a Bayesian semi-supervised multilabel learning (BSSML) approach that combines linear dimensionality reduction with linear binary classification under a low-density assumption. In [Tan et al., 2017], the authors introduced SMILE, which first estimates the missing labels of labeled samples and uses a graph to embody both labeled and unlabeled samples; it then trains a graph-regularized semi-supervised linear classifier, to further recover the missing labels of labeled samples, and to directly predict labels of unseen new samples. In [Zhan and Zhang, 2017], the authors introduced an inductive semi-supervised Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) multi-label learning using a co-training approach called COINs. Specifically, to enable single view co-training, COINs first optimizes two disjoint feature views from the whole feature space by maximizing the diversity between two classifiers independently trained on the two views [Chen et al., 2011], and then iteratively communicates the pairwise ranking predictions of either classifier on unlabeled instances for model refinement. COINs communicates only the single predicted most relevant label and the single predicted most irrelevant label between two classifiers. However, multi-label instances are often simultaneously associated with several relevant and irrelevant labels, and not just a single one. For this reason, only communicating the two most relevant and irrelevant labels may mislead the refinement process, and may not achieve pronounced performance improvement. Furthermore, COINs only focuses on two views. To fully accomplish inductive multi-label classification and to leverage labeled and unlabeled instances of multiple feature views, we advocate the integration of multi-label learning with the well-established co-training paradigm [Blum and Mitchell, 1998; Zhou and Li, 2005]. Co-training has a natural inductive classification ability. It mutually communicates the labels predicted with most confidence among classifiers, which are independently trained on the respective views of data, thus augmenting the labeled training sets. The classifiers are then independently retrained on the respective augmented training sets; the communication and update iterate till convergence. Nevertheless, it is a difficult challenge to integrate multi-label learning with co-training. Two distinct issues should be addressed: (i) How to solve the widely-witnessed class-imbalance problem in multi-label learning. For multi-label datasets, the number of samples relevant to a label is generally much smaller than the number of samples irrelevant to that label. Furthermore, the number of relevant samples for different labels can vary significantly [Zhang et al., 2015a; Sun and Lee, 2017]. The class-imbalance problem can be exaggerated when communicating labels among learners during the iterative process of co-training. (ii) How to select the samples and communicate their predicted labels with confidence among multiple co-training classifiers. Unlike traditional co-training, the to-becommunicated samples can be associated with several labels, and not just one. To address the above two issues in multi-label co-training, we propose a co-training based multi-label classification method called MLCT. MLCT first independently trains predictors on different views and makes prediction on unlabeled samples. Then, it uses the co-occurrence information of labels to adjust the predicted likelihoods and to deal with the class-imbalance problem. Next, it summarizes the adjusted likelihoods across views and measures the predictive confidence of samples based on the summarized likelihoods. After this, it selects samples with the highest confidence, applies label-wise filtering on the summarized likelihoods of the selected samples, and then communicates filtered labels among learners during the iterative co-training process. MLCT repeats the above iterative process till convergence and makes the final prediction on unseen samples by combining the predictions of the classifiers via a majority vote. An extensive comparative study shows that MLCT performs favorably against the recently proposed COINs [Zhan and Zhang, 2017] and other representative multi-label learning methods (including ML-KNN [Zhang and Zhou, 2007], MLRGL[Bucak et al., 2011], MLLOC[Huang and Zhou, 2012], BSSML[G onen, 2014] and SMILE[Tan et al., 2017]). 2 The MLCT Approach The original co-training approach was applied to samples with multiple feature views [Blum and Mitchell, 1998], under the assumption that each feature view would provide sufficient and independent information to produce a classifier with a good generalization capability. In this paper, we mainly focus on mining multi-label samples naturally represented by multiple views. MLCT can also work on feature views generated by a particular view splitting technique [Chen et al., 2011; Du et al., 2011]. Let X = {Xv}m v=1 be m view representations of n samples, where each view Xv Rn dv. xv j R1 dv is the dv-dimensional feature vector for the j-th sample in the v-th view, and yj { 1}q is the q-dimensional label vector for the j-th sample, where yj,c = +1( 1) indicates whether the c-th (1 c q) label is relevant (irrelevant) for the sample. Without loss of generality, we assume that the first l samples are labeled and the remaining u = n l (l u) samples are unlabeled, L = {(xj, yj)}l j=1, U = {xj}n j=l+1. The goal of MLCT is to perform multi-label co-training on X and {yj}n j=1, and to make accurate predictions on unseen new samples. To accomplish this goal, MLCT first uses correlations of labels to address the widely witnessed classimbalance problem in multi-label learning, and to adjust the predicted label confidence values of samples. Next, it introduces a confidence measure to select samples, performs label-wise filtering on the predicted label confidence values of the selected samples, and communicates their labels among classifiers. The following subsections elaborate on the above two steps. 2.1 Addressing Class-imbalance via Label Correlation In multi-label learning, labels have much fewer relevant samples than irrelevant ones, and the number of relevant samples varies significantly across labels [Zhang et al., 2015a]. This class-imbalance issue can become even more pronounced during the iterative process of label communication in co-training. Inspired by the class-imbalance solutions for multi-label learning proposed in [Zhang et al., 2015a; Sun and Lee, 2017], MLCT makes use of label correlation to address class-imbalance in multi-label co-training. Two labels c1 and c2 are considered as positively correlated if they often co-occur as sample labels, and their correlation can be empirically estimated as follows: C(c2, c1) = P|L| j=1[yj,c1 = 1][yj,c2 = 1] P|L| j=1[yj,c1 = 1] (1) where [ ] is true if and only if the condition is met. C(c2, c1) represents the probability that a sample is labeled as c2, given Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) that it s already labeled as c1. Although c1 and c2 might coexist for some samples, the number of relevant samples for c1 might be far less than that of c2, or vice versa. As such, we separately compute C(c2, c1) and C(c1, c2) (C(c2, c1) = C(c1, c2)) to account for the imbalance phenomenon. Suppose f v j = [f v j,1, , f v j,q] Rq is the likelihood of xv j with respect to q labels in the v-th view, initially predicted by a specific multi-label classifier (i.e., ML-KNN [Zhang and Zhou, 2007]) and trained on the l labeled samples. To address the class-imbalance problem, MLCT makes use of label correlation to adjust the predictive confidence values of labels as follows: 1 + e (w+ j,c w j,c) (2) pv j,c is the updated likelihood of the c-th label for xv j. It follows the form of a logistic regression function to enforce that the adjusted value is within the range (0, 1). w j,c and w+ j,c are computed as follows: w j,c = 1 w+ j,c (3) w+ j,c = f v j,c + Pq c2=1,c2 =c f v j,c2C(c, c2) Pq c2=1 (C(c, c2)) 1 (4) where w+ j,c(w j,c) reflects the confidence that c is a relevant (irrelevant) label for xv j. (a) is an indicator function; it s equal to 1 when a > 0, and to 0 otherwise. Pq c2=1 (C(c, c2)) 1 counts the number of labels positively correlated with the c-th label (c excluded). Pq c2=1,c2 =c f v j,c2C(c, c2) indicates how much information the other labels, which are correlated with the c-th label, contributes to the relevance between c and xv j. It is evident that the larger the margin w+ j,c w j,c is, the more confident the prediction is. The margin indirectly reflects the relevance of the c-th label to xv j. Eq. (2) is used to solve the class-imbalance problem via co-occurrence between labels, which not only considers the condition where xv j contains the c-th label, but also takes the reverse case into account (i.e., the c-th label does not belong to the label set of xv j). With the correlations between labels as extra information, the impact of the class-imbalance issue can be effectively reduced. 2.2 Communicating Label Information It s crucial to be able to communicate with confidence samples and labels among classifiers during the iterative process of co-training. A good communication strategy helps obtaining a pronounced and stable performance [Du et al., 2011]. Traditional co-training methods directly select samples with the most confident predictions for communication and model refinement [Blum and Mitchell, 1998; Zhou and Li, 2005; Levati et al., 2017]. However, for co-training with multi-label samples, since a sample may be annotated with several correlated labels, how to communicate labels with confidence is more challenging. As an inductive multi-label co-training algorithm, COINs [Zhan and Zhang, 2017] deals with this challenge by communicating the single positive and single negative labels of an unlabeled example predicted with the largest confidence. As such, the refined model may be misled by the two communicated labels, and may result in performance degeneration. A sample in different views should share the same relevant labels. To select samples and labels to be propagated with confidence, MLCT first summarizes the prediction reliability based on m 1 views (the v-th view is excluded), and measures the overall prediction reliability of the j-th sample with respect to the c-th label as follows: hv j,c = 1 m 1 v =1,v =v |pv j,c (1 pv j,c)| (5) where |pv j,c (1 pv j,c)| is the prediction reliability of the c-th label of xv j . It s straightforward to see that a larger hv j,c value indicates that the m 1 classifiers are more in agreement on whether c should, or should not, be a relevant label for the j-th sample. By extending the estimation in Eq. (5) to q labels, MLCT measures the overall prediction reliability of the j-th sample as follows: c=1 hv(j, c) (6) where larger rv j values imply more consistent predictions across the classifiers, and this makes the j-th sample a good candidate for communication. As such, MLCT selects ub (ub u) samples corresponding to the largest rv j values as the candidate sample set Bv to be communicated for classifier refinement. Next, to identify confident labels of the selected samples during co-training, MLCT defines two threshold values for each label on each view as follows: P xv j Bv f v j,c[f v j,c >= 0.5] P xv j Bv[f v j,c >= 0.5] P xv j Bv f v j,c[f v j,c < 0.5] P xv j Bv[f v j,c < 0.5] (7) θv +(c) is the average predicted likelihood of the c-th label on the v-th view, estimated using plausible relevant samples; similarly, θv (c) is the average predicted likelihood of the c-th label on the v-th view, estimated using plausible irrelevant samples. Since the sample distribution of each label is different, the above two threshold values are computed separately for each label. MLCT then uses the above threshold values to convert f v j,c into a binary label as follows: 1, if f v j,c > θv +(c) 1, if f v j,c < θv (c) 0, otherwise (8) MLCT then uses {bv j,c}m v=1, and the feature information of the respective samples in the v-th view, to augment the labeled training set and to refine the classifier in the corresponding view. We do not apply sample-wise filtering here because different samples have a different number of relevant labels, and an appropriate filter threshold is difficult to pursue [Quevedo et al., 2012]. The pseudo-code of MLCT is summarized in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Algorithm 1. MLCT first computes label correlations based on available labels (step 2). Then, it randomly selects ua unlabeled samples and puts them in the buffer pool B (step 3), and independently predicts label likelihoods of these samples in each view (step 4-7). Next, it summarizes adjusted likelihoods and the overall predictive reliability of these samples in each view, chooses ub samples with the largest reliability, applies label-wise filtering on the summarized predicted likelihoods, and forms the communication label sets (step 8-12). After this, it appends the communication label sets obtained from the respective views to the labeled sets, and removes it from the unlabeled set (step 13). After the maximum number of iterations is reached, MLCT finally returns the iteratively refined classifier of each view, and makes ensemble predictions for new samples via majority vote of the classifiers. Algorithm 1 MLCT pseudo-code Input: L: labeled samples set in m views; U: unlabeled samples set in m views; B: buffered unlabeled samples for co-training; t: maximum number of iterations for co-training; ua, ub: buffer size and number of communication samples. Output: Hv: the prediction model on the v-th (1 v m) view. 1: For iter = 1 : t 2: Estimate label correlation C(c2, c1)(1 c1, c2 q) via Eq. (1); 3: Randomly pick ua samples from U and put them into B; 4: For v = 1 : m 5: Update classifier Hv based on L and make predictions on B; 6: Adjust the initially predicted likelihoods f v j,c (xv j B) via Eq. (2); 7: End For 8: For v = 1 : m 9: Calculate rv j (1 j ua) via Eq. (6), then select ub samples from B with the largest rv j and form the set Bv; 10: Compute θv +(c), θv (c)(1 c q) via Eq. (7); 11: Apply label-wise filtering on f v j,c (xv j Bv) via Eq. (8), and form the communication label set v = {bv j,c}; 12: End For 13: Communicate = { v}m v=1 to {Hv}m v=1, augment the labeled training set L = L S and reduce the unlabeled training set U = U . 14: End For 15: Return {Hv}m v=1. 3 Experiments 3.1 Experimental Setup We assess the effectiveness of MLCT on four publicly accessible multi-label datasets from different domains, with different numbers of views and of samples [Gibaja et al., 2016; Guillaumin et al., 2010]. These datasets are described in Table 1. We also have computed the average imbalanced ratio (Im R) for all labels of each dataset. Im R reflects the degree of class-imbalance, and it is defined as follows [Sun and Lee, 2017]: Pn j=1 max([yj,c = 1], [yj,c = 1]) Pn j=1 min([yj,c = 1], [yj,c = 1]) (9) Data set n q m Avg Min(Im R) Max(Im R) Im R Emotions 593 6 2 1.87 1.24 3.00 2.32 Yeast 2417 14 2 4.23 1.32 70.08 8.95 Core15k 4999 260 6 1.47 3.46 2498.50 327.39 Pascal 9963 20 6 3.40 1.45 50.62 19.67 Table 1: Statistics of the datasets used for the experiments. n, q, and m are the number of examples, labels, and views, respectively. Avg is the average number of labels per sample, and Im R is the average imbalanced ratio for all labels in a dataset. Max(Im R) and Min(Im R) represent the largest and the smallest imbalanced ratios of q labels, respectively. The larger the difference between Max(Im R), Min(Im R), and Im R, the more imbalanced the dataset is. From the statistics in Table 1, we can see that the labels of the last three datasets are quite imbalanced. We compare the performance of MLCT against six representative and related multi-label learning algorithms: MLKNN [Zhang and Zhou, 2007], MLRGL [Bucak et al., 2011], MLLOC [Huang and Zhou, 2012]), BSSML [G onen, 2014], SMILE [Tan et al., 2017] and COINs [Zhan and Zhang, 2017]. To enable experimental comparisons with multi-label learning methods on a single view, we concatenate the feature vectors of different views for each sample into a single vector, and use the latter as the input of the comparing methods. COINs performs the feature view splitting and classifier refinement during the iterative process, and optimizes two views using the concatenated feature vectors. MLCT directly refines m ML-KNN classifiers on the naturally split views during the iterative process. We use five widely-used multi-label evaluation metrics: Hamming Loss (Hamm Loss), Average AUC (Area Under receiver operating Curve) (Avg AUC), Ranking Loss (Rank Loss), One Error (One Error), and Average Precision (Avg Prec). Due to space limitation, we omit the formal definitions of these metrics; they can be found in [Zhang and Zhou, 2014; Gibaja and Ventura, 2015]. Hamm Loss requires transforming the predicted probabilistic label vector of a testing sample into a binary vector. Following the setting of ML-KNN, a label is considered relevant to the sample if its predicted probability is above 0.5, otherwise is considered irrelevant. The smaller the values of Hamm Loss, Rank Loss, and One Error are, the better the performance is. As such, to be consistent with the other evaluation metrics, we report 1-Hamm Loss, 1-Rank Loss, and 1-One Error instead. With the latter measures, larger values indicate a better performance. 3.2 Experimental Results and Analysis To compute the performance of MLCT, we randomly partition samples of each dataset into a training set (70%) and a testing set (30%). For the training set, we again randomly select 10% samples as the initial labeled data (L) and the remaining as unlabeled data (U) for co-training. We independently repeat the above partition 10 times, and report the average results and standard deviations. For co-training based methods, the maximum number of iterations (t) is fixed to 30, the number of samples (ua) in the buffer pool B is fixed to u/t , and the number of samples (ub) to be shared during the co-training process is fixed to 5%ua . The input parameters of the competitive methods are specified (or optimized) as indicated by Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) authors in the codes or papers. Table 2 shows the results. From Table 2, we can see that MLCT generally outperforms comparing methods on different datasets and across the used metrics. We used the signed rank test [Demˇsar, 2006] to check for statistical significance between MLCT and the other methods (except SMILE and MLLOC). All the p-values are smaller than 0.05. Both MLCT and COINs are multi-label co-training methods, and MLCT frequently outperforms the latter. This is because COINs only communicates the single positive and negative labels predicted with highest confidence during cotraining. Given the multi-label characteristics of multi-label samples, the two shared positive and negative labels may mislead the classifier update and thus degenerate the performance. In practice, we also randomly divided the concatenated view of the Pascal dataset into two views, and then applied MLCT on each view. Again, MLCT shows a better performance than COINs. MLCT runs much faster than COINs: their average runtimes on the first two datasets are 155.180 and 708.225 seconds, respectively. MLCT also outperforms two supervised multi-label solutions (ML-KNN and MLRGL), which uses different techniques for multi-label data classification. The performance margin shows the advantage of using unlabeled data for training. MLLOC explores label correlations locally and it holds comparable performance to MLCT, which employs label correlation globally. BSSML, SMILE, and MLCT use unlabeled data for training; BSSML loses to MLCT, and SMILE obtains comparable performance to MLCT. This fact suggests that co-training is an alternative and effective paradigm for using unlabeled data for semi-supervised multi-label learning. SMILE has higher Avg AUC than MLCT because the predicted likelihood vectors of SMILE are less sparse than MLCT. SMILE uses label correlation to replenish missing labels of instances before training the linear classifier, whereas MLCT directly uses the available label information for prediction. Component Analysis To further analyze the effect of the individual components of MLCT, we introduce four variants of MLCT: (i) MLCT(n C): does not adjust the initially predicted likelihoods; in other words, it does not explicitly tackle the classimbalance problem, and directly uses the initially predicted likelihoods during the whole iterative process. (ii) MLCT(n S): first adjusts the initially predicted likelihoods, but randomly selects ub samples; it then follows the same process as MLCT. (iii) MLCT(n F): first adjusts the initially predicted likelihoods and selects samples based on the summarized likelihoods rv j; it then communicates labels whose f v j,c >= 0.5 (as done by ML-KNN), without applying label-wise filtering on the summarized likelihoods. (iv) MLCT(n CSF): does not adjust the initially predicted likelihoods, randomly selects ub samples, and then communicates labels whose f v j,c >= 0.5. Figure 1 gives the results obtained with MLCT and its variants. Overall MLCT outperforms its variants, and MLCT(n CSF) usually has the lowest performance. MLCT(n F), MLCT(n S), and MLCT(n C) outperform MLCT(n CSF). MLCT(n C) is always better than MLCT(n CSF), and is worse than MLCT on various datasets, Emotions Yeast Pascal Core15k 1-Rank Loss MLCT(n CSF) MLCT(n S) MLCT(n F) MLCT(n C) MLCT (a) 1-Rank Loss Emotions Yeast Pascal Core15k MLCT(n CSF) MLCT(n S) MLCT(n F) MLCT(n C) MLCT (b) Avg AUC Figure 1: 1-Rank Loss and Avg AUC of MLCT and its variants on different datasets. especially on Core15k, which is the most imbalanced dataset. This observation indirectly reflects the effectiveness of addressing class-imbalance during the co-training process. Moreover, we can see that MLCT(n S), which randomly selects the same number of samples for communication, is always outperformed by MLCT. This fact shows the effectiveness of MLCT in selecting samples with high confidence. We used a signed rank test to verify the statistical significance of the results of MLCT and its variants, and all p-values are smaller than 0.02. These results support the fact that MLCT is effective in addressing class-imbalance, and is capable of communicating labels with confidence for multi-label co-training. Parameter Sensitivity Analysis As with other co-training methods [Blum and Mitchell, 1998; Zhou and Li, 2005; Zhan and Zhang, 2017], two input parameters (t and ub) may affect the performance of MLCT. We conduct additional experiments to study the sensitivity of MLCT with respect to these parameters. Due to space limitation, we report only the average results with respect to 1-Hamm Loss. In fact, results with respect to other metrics provide similar patterns and conclusions. Figure 2(a) shows the results for MLCT under different values of t; ua and ub were set to u/t and 5%ua , respectively. MLCT has an increasing 1-Hamm Loss as t increases, and it reaches a plateau after 15 iterations. Initially, MLCT has a lower performance than ML-KNN; that is because ML-KNN is trained on the integrated view, whereas MLCT works on separate views. To further study the generalization ability of MLCT, we also investigate the performance of MLCT with MLRGL [Bucak et al., 2011] and BPMLL [Zhang and Zhou, 2006] as base classifiers (instead of ML-KNN), and report these results in Figure 2(a). Again, MLCT has a performance that is superior to that of the adopted base classifiers. Figure 2(b) exhibits the 1-Hamm Loss of MLCT for different number of selected samples (ub) for communication, with ua fixed to 50, 100, 300, and 500, respectively. The results are average on three datasets. Emotions is excluded from this experiment, since its small number of samples prevents the same setting of ua as for the other datasets. MLCT holds a stable performance when the ratio ub/ua increases from 1% to 10%, and then shows a decreasing trend as ub/ua > 10%. This indicates that a reasonable number of samples can be easily selected to achieve an effective co-training for MLCT. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Metric BSSML SMILE MLLOC MLRGL MLKNN COINs MLCT Emotions 1-Hamm Loss 0.699 0.020 0.640 0.004 0.694 0.007 0.594 0.003 0.660 0.023 0.650 0.014 0.691 0.007 1-Rank Loss 0.345 0.036 0.614 0.008 0.712 0.016 0.577 0.015 0.547 0.029 0.588 0.022 0.617 0.029 Avg Prec 0.477 0.017 0.601 0.004 0.548 0.025 0.562 0.015 0.559 0.021 0.582 0.012 0.608 0.027 1-One Error 0.272 0.028 0.466 0.015 0.430 0.018 0.412 0.049 0.412 0.057 0.451 0.023 0.456 0.048 Avg AUC 0.602 0.029 0.647 0.025 0.674 0.022 0.542 0.011 0.525 0.045 0.557 0.043 0.552 0.009 Yeast 1-Hamm Loss 0.724 0.005 0.732 0.008 0.711 0.003 0.722 0.008 0.776 0.007 0.768 0.004 0.772 0.003 1-Rank Loss 0.499 0.024 0.755 0.011 0.700 0.013 0.787 0.004 0.797 0.006 0.789 0.005 0.791 0.005 Avg Prec 0.505 0.017 0.686 0.015 0.504 0.009 0.699 0.007 0.715 0.011 0.715 0.006 0.715 0.007 1-One Error 0.728 0.018 0.696 0.027 0.814 0.084 0.748 0.011 0.731 0.014 0.714 0.010 0.749 0.017 Avg AUC 0.550 0.053 0.594 0.010 0.604 0.009 0.625 0.003 0.581 0.018 0.615 0.009 0.562 0.011 Pascal07 1-Hamm Loss 0.836 0.025 0.888 0.000 0.926 0.000 0.882 0.000 0.927 0.000 0.826 0.013 0.927 0.000 1-Rank Loss 0.661 0.034 0.752 0.007 0.775 0.008 0.695 0.008 0.721 0.006 0.701 0.008 0.725 0.006 Avg Prec 0.205 0.017 0.450 0.006 0.227 0.005 0.424 0.005 0.439 0.005 0.412 0.003 0.453 0.005 1-One Error 0.356 0.037 0.406 0.007 0.250 0.010 0.405 0.008 0.399 0.008 0.362 0.013 0.410 0.007 Avg AUC 0.553 0.056 0.664 0.007 0.807 0.009 0.535 0.005 0.577 0.008 0.587 0.005 0.589 0.009 Core15k 1-Hamm Loss 0.986 0.000 0.956 0.000 0.974 0.000 0.948 0.002 0.987 0.000 0.943 0.000 0.987 0.000 1-Rankloss 0.606 0.017 0.788 0.007 0.827 0.005 0.744 0.005 0.821 0.003 0.794 0.000 0.822 0.005 Avg Prec 0.122 0.011 0.288 0.008 0.248 0.010 0.162 0.021 0.255 0.006 0.203 0.000 0.256 0.006 1-One Error 0.138 0.015 0.284 0.015 0.356 0.038 0.054 0.084 0.271 0.021 0.318 0.000 0.275 0.012 Avg AUC 0.641 0.029 0.658 0.010 0.815 0.006 0.594 0.004 0.567 0.005 0.574 0.000 0.564 0.005 Table 2: Results on different datasets. / indicates whether MLCT is statistically (according to pairwise t-test at 95% significance level) superior/inferior to the other methods. 1 5 10 15 20 25 30 1-Hamm Loss MLCT(MLKNN) MLKNN MLCT(MLRGL) MLRGL MLCT(BPMLL) BPMLL (a) Sensitivity of t (ub / ua) *100% 1 2 3 4 5 6 7 8 9 10 20 30 40 50 1-Hamm Loss (b) Sensitivity of ub/ua offset of θv +(c) 0.1 0 -0.1 -0.2 -0.2 -0.1 offset of θv - (c) 1-Hamm Loss (c) Sensitivity of offsets for θv +(c) and θv (c) Figure 2: 1-Hamm Loss of MLCT under different input values of t (maximum number of iterations), ub (number of selected samples for communication), and different offsets of θv +(c) and θv (c) (label-wise thresholds). MLCT also applies label-wise thresholds (θv +(c) and θv (c)) to filter positive and negative labels. Figure 2(c) shows the 1-Hamm Loss of MLCT under different offsets of these two thresholds. Particularly, +0.2 (-0.2) means increasing (decreasing) the respective threshold by 0.2. From the Figure, we can observe that directly using θv +(c) and θv (c) (both with offset as 0) gives a better performance than other values. In practice, we also tested a threshold fixed to 0.5, and the obtained performance is lower than that of MLCT. These results demonstrate the importance of label-wise filtering. Irrespective of the offset for θv +(c), either decreasing or increasing θv (c) downgrades the performance. This is because decreasing θv (c) results in a more stringent rule for negative samples, whereas increasing θv (c) causes the detection of positive samples as negative ones. On the other hand, when the offset for θv (c) is 0, MLCT shows a stable performance for different offsets of θv +(c). This is because the number of relevant samples for label c is generally smaller than the number of irrelevant samples for this label; as such, MLCT can identify the relevant samples of the c-th label even with a moderately decreased or increased θv +(c). In practice, we investigated the margin (θv +(c) θv (c)) and found it is generally larger than 0.5 across all labels. This investigation shows that label-wise filtering is important for multi-label co-training and the adaptive threshold values are effective. In summary, from these results, we can conclude that MLCT is robust to key input parameters. 4 Conclusions In this paper, we study the multi-label co-training problem, an interesting but seldom studied learning paradigm. We introduce a solution to address the issue of class-imbalance, and to communicate confident labels of multi-label samples during the co-training process. Experimental results show that the proposed solution works better than other related methods. Several avenues remain to be explored, including how to accurately estimate label correlation from limited labeled data, and how to filter relevant labels more reliably during multi-label co-training. The code of MLCT is available at: http://mlda.swu.edu.cn/codes.php?name=MLCT. Acknowledgments This work is supported by NSFC (61741217, 61402378 and 61732019), Open Research Project of Hubei Key Laboratory of Intelligent Geo-Information Processing (KLIGIP2017A05). Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) [Blum and Mitchell, 1998] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In COLT, pages 92 100, 1998. [Bucak et al., 2011] Serhat Selcuk Bucak, Rong Jin, and Anil K Jain. Multi-label learning with incomplete class assignments. In CVPR, pages 2801 2808, 2011. [Chen et al., 2011] Minmin Chen, Yixin Chen, and Kilian Q Weinberger. Automatic feature decomposition for single view co-training. In ICML, pages 953 960, 2011. [Demˇsar, 2006] Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(1):1 30, 2006. [Du et al., 2011] Jun Du, Charles X Ling, and Zhihua Zhou. When does cotraining work in real data? TKDE, 23(5):788 799, 2011. [Gibaja and Ventura, 2015] Eva Gibaja and Sebasti an Ventura. A tutorial on multilabel learning. ACM Computing Surveys, 47(3):52, 2015. [Gibaja et al., 2016] Eva L Gibaja, Jose M Moyano, and Sebasti an Ventura. An ensemble-based approach for multiview multi-label classification. Progress in Artificial Intelligence, 5(4):251 259, 2016. [G onen, 2014] Mehmet G onen. Coupled dimensionality reduction and classification for supervised and semisupervised multilabel learning. Pattern Recognition Letters, 38:132 141, 2014. [Guillaumin et al., 2010] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Multimodal semi-supervised learning for image classification. In CVPR, pages 902 909, 2010. [Guo and Schuurmans, 2012] Yuhong Guo and Dale Schuurmans. Semi-supervised multi-label classification: a simultaneous large-margin, subspace learning approach. In ECML/PKDD, pages 355 370, 2012. [Huang and Zhou, 2012] Shengjun Huang and Zhihua Zhou. Multi-label learning by exploiting label correlations locally. In AAAI, pages 949 955, 2012. [Kong et al., 2013] Xiangnan Kong, Michael K Ng, and Zhihua Zhou. Transductive multilabel learning via label set propagation. TKDE, 25(3):704 719, 2013. [Levati et al., 2017] Jurica Levati, Michelangelo Ceci, Dragi Kocev, and Sao Deroski. Self-training for multi-target regression with tree ensembles. Knowledge-Based Systems, 123(C):41 60, 2017. [Qian and Davidson, 2010] Buyue Qian and Ian Davidson. Semi-supervised dimension reduction for multi-label classification. In AAAI, pages 569 574, 2010. [Quevedo et al., 2012] Jos e Ram on Quevedo, Oscar Luaces, and Antonio Bahamonde. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recognition, 45(2):876 883, 2012. [Sun and Lee, 2017] Kaiwei Sun and Chong Ho Lee. Addressing class-imbalance in multi-label learning via twostage multi-label hypernetwork. Neurocomputing, 266:375 389, 2017. [Tan et al., 2017] Qiaoyu Tan, Yanming Yu, Guoxian Yu, and Jun Wang. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, 260:192 202, 2017. [Wu and Zhang, 2013] Le Wu and Minling Zhang. Multilabel classification with unlabeled data: An inductive approach. In ACML, pages 197 212, 2013. [Wu et al., 2015] Baoyuan Wu, Siwei Lyu, and Bernard Ghanem. Ml-mg: multi-label learning with missing labels using a mixed graph. In ICCV, pages 4157 4165, 2015. [Yu et al., 2012] Guoxian Yu, Carlotta Domeniconi, Huzefa Rangwala, Guoji Zhang, and Zhiwen Yu. Transductive multi-label ensemble classification for protein function prediction. In KDD, pages 1077 1085, 2012. [Yu et al., 2014] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with missing labels. In ICML, pages 593 601, 2014. [Zhan and Zhang, 2017] Wang Zhan and Minling Zhang. Inductive semi-supervised multi-label learning with cotraining. In KDD, pages 1305 1314, 2017. [Zhang and Yeung, 2009] Yu Zhang and Dit Yan Yeung. Semi-supervised multi-task regression. In ECML/PKDD, pages 617 631, 2009. [Zhang and Zhou, 2006] Minling Zhang and Zhihua Zhou. Multilabel neural networks with applications to functional genomics and text categorization. TKDE, 18(10):1338 1351, 2006. [Zhang and Zhou, 2007] Minling Zhang and Zhihua Zhou. Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038 2048, 2007. [Zhang and Zhou, 2014] Minling Zhang and Zhihua Zhou. A review on multi-label learning algorithms. TKDE, 26(8):1819 1837, 2014. [Zhang et al., 2015a] Minling Zhang, Yukun Li, and Xuying Liu. Towards class-imbalance aware multi-label learning. In IJCAI, pages 4041 4047, 2015. [Zhang et al., 2015b] Xiang Zhang, Naiyang Guan, Zhigang Luo, and Xuejun Yang. Constrained projective nonnegative matrix factorization for semi-supervised multilabel learning. In ICMLA, pages 588 593, 2015. [Zhou and Li, 2005] Zhihua Zhou and Ming Li. Tri-training: Exploiting unlabeled data using three classifiers. TKDE, 17(11):1529 1541, 2005. [Zhu, 2008] Xiaojin Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison, Tech. Rep.: 1530, 2008. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)