# towards_classimbalance_aware_multilabel_learning__62e86921.pdf

Towards Class-Imbalance Aware Multi-Label Learning

Min-Ling Zhang Yu-Kun Li Xu-Ying Liu 1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 2Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China {zhangml, liyk, liuxy}@seu.edu.cn

In multi-label learning, each object is represented by a single instance while associated with a set of class labels. Due to the huge (exponential) number of possible label sets for prediction, existing approaches mainly focus on how to exploit label correlations to facilitate the learning process. Nevertheless, an intrinsic characteristic of learning from multi-label data, i.e. the widely-existing classimbalance among labels, has not been well investigated. Generally, the number of positive training instances w.r.t. each class label is far less than its negative counterparts, which may lead to performance degradation for most multi-label learning techniques. In this paper, a new multi-label learning approach named Cross-Coupling Aggregation (COCOA) is proposed, which aims at leveraging the exploitation of label correlations as well as the exploration of class-imbalance. Brieﬂy, to induce the predictive model on each class label, one binary-class imbalance learner corresponding to the current label and several multi-class imbalance learners coupling with other labels are aggregated for prediction. Extensive experiments clearly validate the effectiveness of the proposed approach, especially in terms of imbalance-speciﬁc evaluation metrics such as F-measure and area under the ROC curve.

1 Introduction

Under the multi-label learning setting, each example is represented by a single instance (feature vector) while associated with multiple class labels simultaneously [Tsoumakas et al., 2010; Zhang and Zhou, 2014]. Formally speaking, let X = Rd denote the input space of d-dimensional feature vectors and Y = {y1, y2, . . . , yq} denote the output space of q class labels. Given the multi-label training set D = {(xi, Yi) | 1 i N}, where xi X is a d-dimensional feature vector and Yi Y is the set of labels associated with xi, the task is to learn a multi-label classiﬁer h : X 2Y from D which maps from the space of feature vectors to the space of label sets.

The key challenge to learn from multi-label data lies in the huge number of possible label sets for prediction, which is exponential to the size of label space (i.e. 2q). To take on this challenge, existing approaches mainly focus on exploiting correlations among class labels to facilitate the learning process [Tsoumakas et al., 2010; Zhang and Zhou, 2014]. Based on the order of correlations being considered, existing approaches can be roughly grouped into three categories, i.e. ﬁrst-order approaches which assume independence among class labels, second-order approaches which consider correlations between a pair of class labels, and high-order approaches which consider correlations among all the class labels or subsets of class labels. On the other hand, an inherent property of learning from multi-label data, i.e. the class-imbalance among labels, has not been fully taken into consideration by most existing approaches. For each class label yj Y, let D+ j = {(xi, +1) | yj Yi, 1 i N} and D j = {(xi, 1) | yj / Yi, 1 i N} denote the positive and negative training examples w.r.t. yj. Generally, the corresponding imbalance ratio Im Rj = max(|D+ j |, |D j |)/ min(|D+ j |, |D j |) would be high.1 For instance, among the thirteen benchmark multilabel data sets used in this paper (Table 2), the average imbalance ratio across the label space (i.e. 1 q Pq j=1 Im Rj) ranges from 2.1 to 17.9 (with nine of them greater than 5.0), and the maximum imbalance ratio across the label space (i.e. max1 j q Im Rj) ranges from 3.0 to 50.0 (with eleven of them greater than 10.0). Class-imbalance has long been regarded as one fundamental threat to compromise the performance of standard machine learning algorithms, which would also lead to performance degradation for most multi-label learning approaches [He and Garcia, 2009; Zhang and Zhou, 2014]. Therefore, a favorable practice towards designing multi-label learning algorithm should cautiously leverage the exploitation of label correlations as well as the exploration of class-imbalance. In this paper, a novel class-imbalance aware algorithm named COCOA, i.e. Cr Oss-COupling Aggregation, is proposed to learning from multi-label data. For each class label, COCOA builds one binary-class imbalance learner corresponding to the current label and also several multi-class imbalance

1In most cases, |D+ j | < |D j | holds.

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

learners coupling with other labels. After that, the ﬁnal prediction on each class label is obtained by aggregating the outputs yielded by the binary learner and the multi-class learners. Comparative studies across thirteen publicly-available multi-label data sets show that COCOA achieves highly competitive performance, especially in terms of appropriate evaluation metrics under class-imbalance scenario. The rest of this paper is organized as follows. Section 2 presents technical details of the proposed COCOA approach. Section 3 discusses existing works related to COCOA. Section 4 reports the experimental results of comparative studies. Finally, Section 5 concludes.

2 The COCOA Approach

As shown in Section 1, the task of multi-label learning is to induce a multi-label classiﬁer h : X 2Y from the training set D = {(xi, Yi) | 1 i N}. This is equivalent to learn q real-valued functions fj : X R (1 j q), each accompanied by a thresholding function tj : X R. For any example x X, fj(x) returns the conﬁdence of associating x with class label yj, and the predicted label set is determined according to:

h(x) = {yj | fj(x) > tj(x), 1 j q} (1)

Let Dj denote the binary training set derived from D for the j-th class label yj:

Dj = {(xi, φ(Yi, yj)) | 1 i N} (2)

where φ(Yi, yj) =

( +1, if yj Yi

1, otherwise

Therefore, the derived binary training set consists of a subset of positive training examples (D+ j ) and a subset of negative training examples (D j ), i.e. Dj = D+ j S D j . To deal with the issue of having skewed distribution between D+ j and D j , one straightforward solution is to apply some binaryclass imbalance learner B on Dj to induce a binary classiﬁer gj, i.e. gj B(Dj). Let gj(+1 | x) denote the predictive conﬁdence that x should be regarded as a positive example for yj, the real-valued function fj( ) can then be instantiated as fj(x) = gj(+1 | x). In addition, the thresholding function tj( ) can be simply set to some constant function such as tj(x) = 0. In the above class-imbalance handling strategy, the predictive model fj( ) for each class label yj is built in an independent manner. To incorporate label correlations into the learning process, we choose to randomly couple another class label yk (k = j) with yj as follows. Given the label pair (yj, yk), a multi-class training set Djk can be derived from D:

Djk = {(xi, ψ(Yi, yj, yk)) | 1 i N} (3)

where ψ(Yi, yj, yk) =

0, if yj / Yi and yk / Yi

+1, if yj / Yi and yk Yi

+2, if yj Yi and yk / Yi

+3, if yj Yi and yk Yi

Table 1: The pseudo-code of COCOA.

Inputs: D: the multi-label training set {(xi, Yi) | 1 i N} (xi X, Yi Y, X = Rd, Y = {y1, y2, . . . , yq}) B: the binary-class imbalance learner M: the multi-class imbalance learner K: the number of coupling class labels x: the test example (x X) Outputs: Y : the predicted label set for x Process: 1: for j = 1 to q do 2: Form the binary training set Dj according to Eq.(2); 3: gj B(Dj); 4: Draw a random subset IK Y \ {yj} containing K class labels; 5: for yk IK do 6: Form the tri-class training set Dtri jk according to Eq.(4); 7: gjk M(Dtri jk ); 8: end for 9: Set the real-valued function fj( ) according to Eq.(5); 10: Set the constant thresholding function tj( ), with the constant aj being determined according to Eq.(6); 11: end for 12: Return Y = h(x) according to Eq.(1);

Here, the class label ψ(Yi, yj, yk) for the derived four-class learning problem is determined by the joint assignment of yj and yk w.r.t. Yi. Note that although label correlations can be exploited by making use of Djk in the learning process, the issue of classimbalance may be ampliﬁed by jointly considering yj and yk. Without loss of generality, suppose that positive examples D+ j (D+ k ) correspond to the minority class in the binary training set Dj (Dk). Accordingly, the ﬁrst class (ψ(Yi, yk, yk) = 0) and the fourth class (ψ(Yi, yk, yk) = +3) in Djk would contain the largest and the smallest number of examples. Compared to the original imbalance ratios Im Rj and Im Rk in binary training sets Dj and Dk, the imbalance ratio between the largest class and the smallest class would roughly turn into Im Rj Im Rk in four-class training set Djk. To deal with this potential problem, COCOA transforms the four-class data set Djk into a tri-class data set Dtri jk by merging the third class and the fourth class (both with positive assignment for yj):

Dtri jk = { xi, ψtri(Yi, yj, yk) | 1 i N} (4)

where ψtri(Yi, yj, yk) =

0, if yj / Yi and yk / Yi

+1, if yj / Yi and yk Yi

+2, if yj Yi

Here, for the newly-merged class (ψtri(Yi, yj, yk) = +2), its imbalance ratios w.r.t. the ﬁrst class (ψtri(Yi, yj, yk) = 0) and the second class (ψtri(Yi, yj, yk) = +1) would roughly be Im Rj Im Rk

1+Im Rk and Im Rj 1+Im Rk , which is much smaller than the worst-case imbalance ratio Im Rj Im Rk in the four-class training set.

Table 2: Characteristics of the benchmark multi-label data sets.

Data set |S| dim(S) L(S) F(S) LCard(S) LDen(S) DL(S) PDL(S) Imbalance Ratio min max avg CAL500 502 68 124 numeric 25.058 0.202 502 1.000 1.040 24.390 3.846 Emotions 593 72 6 numeric 1.869 0.311 27 0.046 1.247 3.003 2.146 Medical 978 144 14 numeric 1.075 0.077 42 0.043 2.674 43.478 11.236 Enron 1702 50 24 nominal 3.113 0.130 547 0.321 1.000 43.478 5.348 Scene 2407 294 6 numeric 1.074 0.179 15 0.006 3.521 5.618 4.566 Yeast 2417 103 13 numeric 4.233 0.325 189 0.078 1.328 12.500 2.778 Slashdot 3782 53 14 nominal 1.134 0.081 118 0.031 5.464 35.714 10.989 Corel5k 5000 499 44 nominal 2.214 0.050 1037 0.207 3.460 50.000 17.857 Rcv1 (subset 1) 6000 472 42 numeric 2.458 0.059 574 0.096 3.344 50.000 15.152 Rcv1 (subset 2) 6000 472 39 numeric 2.170 0.056 489 0.082 3.215 47.619 15.873 Eurlex-sm 19348 250 27 numeric 1.492 0.055 497 0.026 3.509 47.619 16.393 Tmc2007 28596 500 15 nominal 2.100 0.140 637 0.022 1.447 34.483 5.848 Mediamill 43907 120 29 numeric 4.010 0.138 3540 0.079 1.748 45.455 7.092

By applying some multi-class imbalance learner M on Dtri jk , one multi-class classiﬁer gjk can be induced, i.e. gjk M(Dtri jk ). Correspondingly, let gjk(+2 | x) denote the predictive conﬁdence that x should have positive assignment w.r.t. yj, regardless of x having positive or negative assignment w.r.t. yk. For each class label yj, COCOA draws a random subset of K class labels IK Y \ {yj} for pairwise coupling. The real-valued function fj( ) is then instantiated by aggregating the predictive conﬁdences of one binary-class imbalance learner and K multi-class imbalance learners:

fj(x) = gj(+1 | x) + X

yk IK gjk(+2 | x) (5)

For the thresholding function tj( ), COCOA chooses to set it as a constant function tj(x) = aj. By accompanying the constant aj with fj, any example x is predicted to be positive for yj if fj(x) > aj, and negative otherwise. Speciﬁcally, the goodness of aj is evaluated based on certain metric which measures how well fj can classify examples in Dj by using aj as the bipartition threshold. In this paper, we employ the F-measure metric (i.e. harmonic mean of precision and recall) which is popular for evaluating the performance of binary classiﬁer, especially for the case of skewed class distribution. Let F(fj, a, Dj) denote the F-measure value achieved by applying {fj, a} over the binary training set Dj, the thresholding constant aj is determined by maximizing the corresponding F-measure:

aj = arg maxa R F(fj, a, Dj) (6)

Table 1 summarizes the complete procedure of the proposed COCOA approach. For each class label yj Y, one binary-class imbalance learner (Steps 2-3) and K coupling multi-class imbalance learners (Steps 4-8) are induced by manipulating the multi-label training set D. After that, the predictive model for yj is produced by aggregating the predictive conﬁdences of the induced binary and multi-class classiﬁers (Steps 9-10). Finally, the predicted label set for the test example is obtained by querying the predictive models of all class labels (Step 12). It is worth noting that although pairwise coupling (Step 6) only considers second-order correlations among labels, the

overall label correlations exploited by COCOA are actually high-order as controlled by the parameter K. Here, COCOA fulﬁlls high-order label correlations by imposing pairwise coupling for K times instead of combining all K coupling labels simultaneously, as the latter strategy may lead to severe class-imbalance due to the combinatorial effects.

3 Related Work In this section, we brieﬂy discuss existing works related to COCOA. More comprehensive reviews on multi-label learning can be found in survey literatures such as [Tsoumakas et al., 2010; Zhang and Zhou, 2014; Gibaja and Ventura, 2015]. As mentioned in Section 2, one intuitive solution towards class-imbalance multi-label learning is to ﬁrstly decompose the multi-label learning problem into q independent binary learning problems, one per class label (a.k.a. binary relevance). And then, for each decomposed binary learning problem, the skewness between the positive and negative training examples can be dealt with via popular binary imbalance learning techniques such as random or synthetic undersampling/oversampling [Spyromitros-Xiouﬁs et al., 2011; Tahir et al., 2012]. However, useful information regarding label correlations will be ignored in this decomposition process. Different from binary decomposition, one can also transform the multi-label learning problem into a multi-class problem by treating any distinct label combinations appearing in the training set as a new class (a.k.a. label powerset). After that, the skewness among the transformed classes can be dealt with via off-the-shelf multi-class imbalance learning techniques [Wang and Yao, 2012; Liu et al., 2013]. Although label correlations can be exploited in this transformation process, the number of transformed classes (upper-bounded by min(N, 2q)) may be too large for any multi-class learner to work well. Besides applying class-imbalance learning techniques to the transformed binary or multi-class problems, one can also make the multi-label learning algorithm be aware of the class-imbalance issue via parameter tuning. For COCOA, the thresholding constant is calibrated by maximizing imbalance-speciﬁc metric such as F-measure based on

Table 3: Performance of each comparing algorithm (mean std. deviation) in terms of macro-averaging F-measure (MACRO-F). In addition, / indicates whether COCOA is statistically superior/inferior to the comparing algorithm on each data set (pairwise t-test at 1% signiﬁcance level).

Algorithm Data Set CAL500 Emotions Medical Enron Scene Yeast Slashdot COCOA 0.207 0.009 0.662 0.013 0.690 0.015 0.324 0.009 0.732 0.013 0.457 0.015 0.327 0.009 USAM 0.217 0.006 0.591 0.016 0.670 0.012 0.266 0.011 0.624 0.008 0.432 0.010 0.259 0.010 USAM-EN 0.246 0.004 0.590 0.018 0.665 0.025 0.274 0.010 0.620 0.011 0.437 0.012 0.296 0.007 SMOTE 0.237 0.006 0.584 0.020 0.672 0.022 0.266 0.006 0.619 0.007 0.430 0.006 0.326 0.005 SMOTE-EN 0.239 0.004 0.582 0.017 0.672 0.022 0.275 0.004 0.624 0.007 0.431 0.005 0.315 0.007 RML 0.209 0.008 0.645 0.016 0.666 0.018 0.309 0.010 0.684 0.013 0.471 0.014 0.311 0.009 ML-KNN 0.074 0.002 0.592 0.026 0.474 0.031 0.174 0.011 0.715 0.011 0.380 0.008 0.198 0.014 CLR 0.081 0.007 0.595 0.017 0.650 0.012 0.229 0.006 0.631 0.013 0.413 0.010 0.233 0.007 ECC 0.102 0.004 0.642 0.014 0.647 0.021 0.241 0.006 0.716 0.009 0.394 0.008 0.250 0.007 RAKEL 0.193 0.003 0.613 0.018 0.576 0.014 0.256 0.006 0.686 0.008 0.420 0.005 0.248 0.006 Data Set win/tie/loss

Algorithm Corel5k Rcv1 Rcv1 Eurlex-sm Tmc2007 Mediamill counts for (subset 1) (subset 2) COCOA COCOA 0.195 0.004 0.363 0.008 0.337 0.009 0.703 0.007 0.669 0.002 0.459 0.004 / USAM 0.141 0.004 0.318 0.005 0.306 0.005 0.562 0.007 0.607 0.002 0.337 0.003 12/0/1 USAM-EN 0.150 0.002 0.317 0.005 0.303 0.005 0.563 0.004 0.608 0.002 0.337 0.003 11/1/1 SMOTE 0.125 0.003 0.314 0.006 0.305 0.004 0.552 0.003 0.566 0.003 0.338 0.001 10/2/1 SMOTE-EN 0.126 0.002 0.313 0.004 0.304 0.004 0.553 0.003 0.567 0.003 0.341 0.001 10/2/1 RML 0.215 0.009 0.387 0.020 0.363 0.029 0.059 0.003 0.568 0.039 0.268 0.019 6/4/3 ML-KNN 0.028 0.004 0.122 0.008 0.103 0.008 0.525 0.012 0.479 0.008 0.245 0.004 12/1/0 CLR 0.049 0.004 0.227 0.007 0.226 0.006 0.599 0.006 0.623 0.003 0.268 0.004 13/0/0 ECC 0.064 0.004 0.216 0.007 0.199 0.004 0.619 0.009 0.642 0.003 0.277 0.002 12/1/0 RAKEL 0.084 0.005 0.272 0.007 0.263 0.005 0.632 0.008 0.643 0.004 0.378 0.002 13/0/0

the training set, which could also be calibrated based on some held-out validation set [Fan and Lin, 2007] or be optimized with an extra learning procedure [Quevedo et al., 2012; Pillai et al., 2013]. Furthermore, instead of only tuning the thresholding parameter, another sophisticated choice is to design multi-label learning algorithms directly optimizing the macro-averaging F-measure [Dembczy nski et al., 2013; Petterson and Caetano, 2010]. In view of the randomness in pairwise coupling, COCOA makes use of ensemble learning to aggregate the predictions of K randomly-generated imbalance learners. There have been multi-label learning methods which also utilize ensemble learning to deal with their inherent random factors, such as ensembling chaining classiﬁer with random order [Read et al., 2011] or ensembling multi-class learner derived from random k-labelsets [Tsoumakas et al., 2011a]. Furthermore, ensemble learning can be employed as a meta-strategy to improve generalization with homogeneous [Shi et al., 2011] or heterogeneous [Tahir et al., 2010] component multi-label learners.

4 Experiments

4.1 Experimental Setup

Data Sets To serve as a solid basis for performance evaluation, a total of thirteen benchmark multi-label data sets have been collected for experimental studies. For each multi-label data set S, we use |S|, L(S), dim(S) and F(S) to represent its number of

examples, number of class labels, number of features and feature type respectively. In addition, several multi-label statistics are further used to characterize properties of S, whose deﬁnitions can be found in [Read et al., 2011] while not detailed here due to page limit. Let Im Rj denote the imbalance ratio on the j-th class label (1 j q), the level of class-imbalance on S can be characterized by the average imbalance ratio ( 1

q Pq j=1 Im Rj), minimum imbalance ratio (min1 j q Im Rj) and maximum imbalance ratio (max1 j q Im Rj) across the label space. As a common practice in class-imbalance studies [He and Garcia, 2009], extreme imbalance is not considered in this paper. Speciﬁcally, any class label with rare appearance (less than 20 positive examples) or with overly-high imbalance ratio (Im Rj 50) is excluded from the label space. Table 2 summarizes characteristics of the experimental data sets, which are roughly ordered according to |S|. As shown in Table 2, the thirteen data sets exhibit diversiﬁed properties from different aspects. These data sets cover a broad range of scenarios, including music (CAL500, Emotions), image (Scene, Corel5k), video (Mediamill), biology (Yeast), and text (the others). Here, dimensionality reduction is performed on text data sets by retaining features with high document frequency.

Comparing Algorithms In this paper, COCOA is compared against two series of algorithms. As discussed in Section 3, the ﬁrst series include several approaches which are capable of dealing with the classimbalance issue in multi-label data:

Table 4: Performance of each comparing algorithm (mean std. deviation) in terms of macro-averaging AUC (MACRO-AUC). In addition, / indicates whether COCOA is statistically superior/inferior to the comparing algorithm on each data set (pairwise t-test at 1% signiﬁcance level).

Algorithm Data Set CAL500 Emotions Medical Enron Scene Yeast Slashdot COCOA 0.557 0.005 0.843 0.010 0.958 0.006 0.731 0.006 0.943 0.003 0.710 0.006 0.736 0.005 USAM 0.514 0.005 0.708 0.019 0.855 0.012 0.606 0.010 0.790 0.009 0.578 0.006 0.617 0.004 USAM-EN 0.513 0.004 0.708 0.015 0.860 0.024 0.600 0.004 0.788 0.009 0.583 0.006 0.618 0.004 SMOTE 0.513 0.005 0.703 0.019 0.874 0.019 0.617 0.007 0.776 0.008 0.579 0.006 0.688 0.008 SMOTE-EN 0.513 0.004 0.704 0.013 0.874 0.019 0.617 0.007 0.777 0.011 0.581 0.007 0.686 0.008 RML ML-KNN 0.516 0.007 0.806 0.015 0.909 0.008 0.663 0.006 0.926 0.005 0.679 0.004 0.676 0.006 CLR 0.561 0.004 0.796 0.010 0.948 0.008 0.709 0.007 0.894 0.005 0.650 0.004 0.698 0.009 ECC 0.549 0.007 0.841 0.009 0.925 0.009 0.723 0.006 0.938 0.003 0.689 0.006 0.706 0.009 RAKEL 0.528 0.005 0.797 0.015 0.828 0.006 0.640 0.003 0.892 0.004 0.640 0.004 0.612 0.002 Data Set win/tie/loss

Algorithm Corel5k Rcv1 Rcv1 Eurlex-sm Tmc2007 Mediamill counts for (subset 1) (subset 2) COCOA COCOA 0.719 0.004 0.889 0.003 0.882 0.002 0.957 0.002 0.930 0.001 0.843 0.001 / USAM 0.572 0.003 0.674 0.010 0.672 0.009 0.788 0.009 0.801 0.003 0.655 0.004 13/0/0/ USAM-EN 0.574 0.002 0.676 0.010 0.671 0.010 0.789 0.006 0.800 0.003 0.654 0.006 13/0/0/ SMOTE 0.597 0.004 0.625 0.009 0.620 0.008 0.795 0.005 0.793 0.003 0.669 0.002 13/0/0/ SMOTE-EN 0.596 0.004 0.626 0.006 0.620 0.009 0.795 0.004 0.793 0.003 0.670 0.002 13/0/0/ RML ML-KNN 0.590 0.005 0.718 0.009 0.710 0.009 0.887 0.004 0.849 0.003 0.767 0.001 13/0/0/ CLR 0.740 0.002 0.891 0.003 0.882 0.002 0.944 0.001 0.906 0.001 0.805 0.001 8/3/2 ECC 0.697 0.006 0.864 0.002 0.855 0.003 0.945 0.002 0.921 0.001 0.826 0.001 12/1/0 RAKEL 0.552 0.002 0.728 0.003 0.721 0.003 0.872 0.005 0.859 0.002 0.737 0.001 13/0/0/

* MACRO-AUC not applicable to RML, which does not yield real-valued outputs on each class label [Petterson and Caetano, 2010].

Undersampling (USAM): The multi-label learning problem is decomposed into q binary learning problems, and the majority class in each binary problem is randomly undersampled to form the new binary training set.

SMOTE: The multi-label learning problem is decomposed into q binary learning problems, and the minority class in each binary problem is oversampled via the SMOTE method [Chawla et al., 2002] to form the new binary training set. Considering that COCOA utilizes ensemble learning in its learning process, an ensemble version of USAM and SMOTE are also employed for comparison (named as USAM-EN and SMOTE-EN).

RML: Besides integrating binary decomposition with undersampling/oversampling, another way to handle class-imbalance is to design learning system which can directly optimize imbalance-speciﬁc metric. Here, the RML approach [Petterson and Caetano, 2010] is employed as another comparing algorithm, which maximizes macro-averaging F-measure on multi-label data via convex relaxation.

In addition to the above algorithms, the second series include several well-established multi-label learning algorithms [Zhang and Zhou, 2014], including ﬁrst-order approach MLKNN [Zhang and Zhou, 2007], second-order approach CLR [F urnkranz et al., 2008], and high-order approaches ECC [Read et al., 2011] and RAKEL [Tsoumakas et al., 2011a].

In this paper, all the comparing algorithms are instantiated as follows: 1) For USAM, SMOTE and their ensemble versions, decision tree is used as the base learner due to its popularity in class-imbalance studies [He and Garcia, 2009]. Speciﬁcally, these algorithms adopt implementations provided by the widely-used Weka platform with J48 decision tree (C4.5 implementation in Weka) as their base learner [Hall et al., 2009]; 2) For RML, the original implementation provided in the literature is used; 3) For the second series of algorithms (ML-KNN, CLR, ECC and RAKEL), we adopt their canonical implementations provided by the MULAN multi-label learning library (upon Weka platform) with suggested parameter conﬁgurations [Tsoumakas et al., 2011b]; 4) For COCOA, both the binary-class and multi-class imbalance learners (B and M) are implemented in Weka using J48 decision tree with undersampling [Hall et al., 2009]. Furthermore, the number of coupling class labels is set as K = min(q 1, 10). For fair comparison, the ensemble size for USAM-EN and SMOTE-EN is also set to be 10.

4.2 Experimental Results

Under class-imbalance scenarios, F-measure and Area Under the ROC Curve (AUC) are the mostly-used evaluation metrics which can provide more insights on the classiﬁcation performance than conventional metrics such as accuracy [He and Garcia, 2009]. In this paper, the multi-label classiﬁcation performance is accordingly evaluated by macroaveraging the metric values across all class labels [Zhang and

(a) Enron (b) Eurlex-sm

Figure 1: Performance gain between COCOA and the comparing algorithm (PGk) changes as the level of imbalance ratio (Ik) increases. On either data set, the performance of each algorithm is evaluated based on F-measure.

Zhou, 2014]. For either macro-averaging metric, the higher the metric value the better the performance. Tables 3 and 4 give the detailed experimental results in terms of each evaluation metric respectively. Each data set is randomly split for training and testing, where 50% examples are chosen to form the training set and the remaining ones form the test set. The random train/test splits are repeated for ten times and the mean metric value as well as the standard deviation are recorded. Furthermore, to show whether COCOA performs signiﬁcantly better/worse than the comparing algorithm, pairwise t-test at 1% signiﬁcance level is conducted. Accordingly, a win/loss is counted and a marker / is shown in the table whenever COCOA achieves signiﬁcantly superior/inferior performance on one data set. Otherwise, a tie is counted and no marker is given. The overall win/tie/loss counts across all data sets are summarized at the last column of each table. In terms of MACRO-F (Table 3), COCOA signiﬁcantly outperforms the comparing algorithms in 46.2% (RML), 76.9% (SMOTE, SMOTE-EN), 84.6% (USAM-EN), 92.3% (USAM, ML-KNN, ECC) and 100% (CLR, RAKEL) cases, and hasn t been outperformed by algorithms in the second series. These results indicate that COCOA is capable of achieving good balance between predictive exactness (precision) and completeness (recall) in handling class-imbalance multi-label learning. In terms of MACRO-AUC (Table 4), COCOA significantly outperforms the comparing algorithms in 61.5% (CLR), 92.3% (ECC) and 100% (USAM, USAM-EN, SMOTE, SMOTE-EN, ML-KNN, RAKEL) cases, while has only been outperformed by CLR twice. These results indicate that the real-valued functions fj( ) (1 j q) learned by COCOA is capable of yielding reasonable predictive conﬁdence, and better classiﬁcation performance can be further expected if it is combined with more sophisticated thresholding strategy other than the constant function (Table 1, Step 10). To further investigate how COCOA works under different levels of imbalance ratios, we roughly group the imbalance

ratio Im Rj into ﬁve intervals Ik (1 k 5) in ascending orders, i.e. I1 = [1, 5], I2 = [5, 10], I3 = [10, 15], I4 = [15, 25] and I5 = [25, 50]. Given one multi-label data set, let Ak denote the average performance of COCOA over class labels whose imbalance ratios fall into Ik, and Bk denote the average performance of another comparing algorithm over class labels in the same interval. Accordingly, the percentage of performance gain, i.e. PGk = [(Ak Bk)/Bk] 100%, is computed to reﬂect the relative performance between COCOA and the comparing algorithm within the given interval. Figure 1 illustrates how PGk changes as the imbalance level Ik moves from I1 to I5. Due to page limit, the performance is evaluated by choosing the F-measure metric 1 |Ik| P Im Rj Ik Fj and two data sets Enron and Eurlex-sm are considered. For brevity, the relative performance against four comparing algorithms (USAM-EN, SMOTE-EN, ECC and RAKEL) has been depicted. Similar trends can be observed for other evaluation metrics and comparing algorithms. As shown in Figure 1, COCOA maintains good relative performance against the comparing algorithms across different imbalance levels, where the curves hardly drop below the baseline (PGk = 0). Furthermore, it is interesting that the performance advantage of COCOA becomes more pronounced when the level of imbalance ratio is high (for I4 = [15, 25] and I5 = [25, 50]). These results indicate that COCOA can provide robust and preferable solutions in diverse class-imbalance scenarios.

5 Conclusion

In this paper, the class-imbalance issue in learning from multi-label data is studied. Accordingly, a novel classimbalance multi-label learning algorithm named COCOA is proposed, which works by leveraging the exploitation of label correlations and the exploration of class-imbalance. Speciﬁcally, one binary-class imbalance learner and several coupling

multi-class imbalance learners are combined to yield the predictive model. Extensive experiments across thirteen benchmark data sets show that COCOA performs favorably against the comparing algorithms, especially in terms of imbalancespeciﬁc metrics such as MACRO-F and MACRO-AUC.

Acknowledgments

The authors wish to thank the anonymous reviewers for their insightful comments and suggestions. This work was supported by the National Science Foundation of China (61175049, 61222309, 61473087), the Natural Science Foundation of Jiangsu Province (BK20141340), and the MOE Program for New Century Excellent Talents in University (NCET-13-0130).

[Chawla et al., 2002] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artiﬁcial Intelligence Research, 16:321 357, 2002.

[Dembczy nski et al., 2013] K. Dembczy nski, A. Jachnik, W. Kotłowski, W. Waegeman, and E. H ullermeier. Optimizing the F-measure in multi-label classiﬁcation: Plug-in rule approach versus structured loss minimization. In Proceedings of the 30th International Conference on Machine Learning, pages 1130 1138, Atlanta, GA, 2013.

[Fan and Lin, 2007] R.-E. Fan and C.-J. Lin. A study on threshold selection for multi-label classiﬁcation. Technical report, Department of Computer Science & Information Engineering, National Taiwan University, 2007.

[F urnkranz et al., 2008] J. F urnkranz, E. H ullermeier, E. Loza Menc ıa, and K. Brinker. Multilabel classiﬁcation via calibrated label ranking. Machine Learning, 73(2):133 153, 2008.

[Gibaja and Ventura, 2015] E. Gibaja and S. Ventura. A tutorial on multilabel learning. ACM Computing Surveys, 47(3):Article 52, 2015.

[Hall et al., 2009] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10 18, 2009.

[He and Garcia, 2009] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263 1284, 2009.

[Liu et al., 2013] X.-Y. Liu, Q.-Q. Li, and Z.-H. Zhou. Learning imbalanced multi-class data with optimal dichotomy weights. In Proceedings of the 13th IEEE International Conference on Data Mining, pages 478 487, Dallas, TX, 2013.

[Petterson and Caetano, 2010] J. Petterson and T. Caetano. Reverse multi-label learning. In Advances in Neural Information Processing Systems 23, pages 1912 1920. MIT Press, Cambridge, MA, 2010.

[Pillai et al., 2013] I. Pillai, G. Fumera, and F. Roli. Threshold optimisation for multi-label classiﬁers. Pattern Recognition, 46(7):2055 2065, 2013. [Quevedo et al., 2012] J. R. Quevedo, O. Luaces, and A. Bahamonde. Multilabel classiﬁers with a probabilistic thresholding strategy. Pattern Recognition, 45(2):876 883, 2012. [Read et al., 2011] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classiﬁer chains for multi-label classiﬁcation. Machine Learning, 85(3):333 359, 2011. [Shi et al., 2011] C. Shi, X. Kong, P. S. Yu, and B. Wang. Multi-label ensemble learning. In Lecture Notes in Artiﬁcial Intelligence 6913, pages 223 239. Springer, Berlin, 2011. [Spyromitros-Xiouﬁs et al., 2011] E. Spyromitros-Xiouﬁs, M. Spiliopoulou, G. Tsoumakas, and I. Vlahavas. Dealing with concept drift and class imbalance in multi-label stream classiﬁcation. In Proceedings of the 22nd International Joint Conference on Artiﬁcial Intelligence, pages 1583 1588, Barcelona, Spain, 2011. [Tahir et al., 2010] M. A. Tahir, J. Kittler, K. Mikolajczyk, and F. Yan. Improving multilabel classiﬁcation performance by using ensemble of multi-label classiﬁers. In Lecture Notes in Computer Science 5997, pages 11 21. Springer, Berlin, 2010. [Tahir et al., 2012] M. A. Tahir, J. Kittler, and F. Yan. Inverse random under sampling for class imbalance problem and its application to multi-label classiﬁcation. Pattern Recognition, 45(10):3738 3750, 2012. [Tsoumakas et al., 2010] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, pages 667 686. Springer, Berlin, 2010. [Tsoumakas et al., 2011a] G. Tsoumakas, I. Katakis, and I. Vlahavas. Random k-labelsets for multi-label classiﬁcation. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079 1089, 2011. [Tsoumakas et al., 2011b] G. Tsoumakas, E. Spyromitros Xiouﬁs, J. Vilcek, and I. Vlahavas. MULAN: A java library for multi-label learning. Journal of Machine Learning Research, 12(Jul):2411 2414, 2011. [Wang and Yao, 2012] S. Wang and X. Yao. Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 42(4):1119 1130, 2012. [Zhang and Zhou, 2007] M.-L. Zhang and Z.-H. Zhou. MLk NN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038 2048, 2007. [Zhang and Zhou, 2014] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819 1837, 2014.