# costsensitive_feature_selection_via_fmeasure_optimization_reduction__da48983f.pdf

Cost-Sensitive Feature Selection via F-Measure Optimization Reduction

Meng Liu, Chang Xu, Yong Luo, Chao Xu, Yonggang Wen, Dacheng Tao

Key Laboratory of Machine Perception (MOE), Cooperative Medianet Innovation Center, School of Electronics Engineering and Computer Science, PKU, Beijing 100871, China Centre for Artiﬁcial Intelligence, UTS, Sydney, NSW 2007, Australia School of Computer Science and Engineering, NTU, 639798, Singapore mengliu@pku.edu.cn, chang.xu@uts.edu.au, yluo180@gmail.com, xuchao@cis.pku.edu.cn, ygwen@ntu.edu.sg, dacheng.tao@uts.edu.au

Feature selection aims to select a small subset from the high-dimensional features which can lead to better learning performance, lower computational complexity, and better model readability. The class imbalance problem has been neglected by traditional feature selection methods, therefore the selected features will be biased towards the majority classes. Because of the superiority of F-measure to accuracy for imbalanced data, we propose to use F-measure as the performance measure for feature selection algorithms. As a pseudo-linear function, the optimization of F-measure can be achieved by minimizing the total costs. In this paper, we present a novel cost-sensitive feature selection (CSFS) method which optimizes F-measure instead of accuracy to take class imbalance issue into account. The features will be selected according to optimal F-measure classiﬁer after solving a series of cost-sensitive feature selection sub-problems. The features selected by our method will fully represent the characteristics of not only majority classes, but also minority classes. Extensive experimental results conducted on synthetic, multi-class and multi-label datasets validate the efﬁciency and signiﬁcance of our feature selection method.

Introduction Feature selection has been one of the most popular dimensionality reduction techniques. It is a process of choosing a subset of relevant features from the high-dimensional data according to certain performance measure (Tang, Alelyani, and Liu 2014). Feature selection can further beneﬁt the machine learning tasks such as classiﬁcation and cluster by speeding up the learning process, improving the model generalization capability, and alleviating the effect of the curse of dimensionality (Nie et al. 2010). A considerable amount of research has been done during the last decade (Villela, de Castro Leite, and Neto 2015; Wang, Tang, and Liu 2015; Luo et al. 2016; Qian and Zhai 2013; Zhao et al. 2010; Xu, Tao, and Xu 2015), which can be divided into three groups: ﬁlter methods, wrapper methods and embedded methods. Filter methods, such as Relief F (Kononenko 1994), m RMR (Peng, Long, and Ding 2005), F-statistic (Liu and Motoda 2012) and Information Gain (Raileanu and Stoffel 2004), choose features only relying the characteristics of data.

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Wrapper methods utilize predeﬁned classiﬁers as a black box to evaluate the selected features. Support vector machine recursive feature elimination (SVM-RFE) (Guyon et al. 2002) and correlation-based feature selection (CFS) (Hall and Smith 1999) are representative wrapper methods. Embedded methods embed the feature selection process into classiﬁer training. Regularized regression-based feature selection methods (Nie et al. 2010; Han and Kim 2015) are typical embedded methods. The methods mentioned above are demonstrated effective in most situations. However, almost all of these methods neglect the inﬂuence of class imbalance issue. They are designed under the implicit assumption that the data distribution or sampling are balanced, i.e., the sample sizes for different classes are about the same. The class imbalance issue is quite common in real-world datasets, which will negatively impact the traditional feature selection methods since they are inclined to choose the features that characterize the majority classes rather than those describe the minority classes. The neglect of class imbalance issue will make it more difﬁcult to obtain better results for the subsequent machine learning tasks since the selected features are already biased towards the majority classes. Feature selection methods which are dependent on classiﬁers also have the class imbalance problem (Nie et al. 2010; Han and Kim 2015). Taking regularized regression-based feature selection for example, these regularization models aim to minimize the ﬁtting errors of the objective functions where the misclassiﬁcation costs for different classes are treated equally (Tang, Alelyani, and Liu 2014). Therefore the feature subset is chosen to achieve the highest classiﬁcation accuracy, which is not an appropriate performance under the imbalanced setting. Consequently, these types of methods can be referred as cost-blind feature selection methods. High and balanced pair values of precision and recall result in high F-measure performance (Parambath, Usunier, and Grandvalet 2014). Therefore F-measure is a more suitable measure compared with accuracy in the imbalanced classes scenario (Pillai, Fumera, and Roli 2012; Dembczynski et al. 2011). Besides F-measure in binary classiﬁcation, its variants in multi-class and multi-label classiﬁcation are receiving much attention recently (Dembczynski et al. 2013; 2011; Ye et al. 2012; Pillai, Fumera, and Roli 2012). There

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

is a great number of studies on optimizing these F-measures, which can be categorized into two paradigms: the decisiontheoretic approaches (DTA) (Lewis 1995) and empirical utility maximization (EUM) approaches. DTA approaches ﬁrst estimate a probability model, which will be utilized to compute the optimal predictions. EUM approaches (Jansche 2005; Tsochantaridis et al. 2005) follow the structured risk minimization principle to minimize the objective function. Directly optimizing F-measure is difﬁcult since it is non-convex, so different approximation methods are used in practice, such as the algorithms for maximizing a convex lower bound of F-measure for support vector machines (Tsochantaridis et al. 2005), and maximizing the expected Fmeasure of a probabilistic classiﬁer using a logistic regression model (Jansche 2005). A simple yet effective method is to threshold the scores obtained by classiﬁers to maximize the F-measure empirically (Parambath, Usunier, and Grandvalet 2014; Yang 2001). Recent developments (Ye et al. 2012; Koyejo et al. 2014; Parambath, Usunier, and Grandvalet 2014; Narasimhan, Vaish, and Agarwal 2014) investigate the pseudo-linear property of F-measures by formulating them as functions of per-class false negative/false positive rate. Through the reduction to cost-sensitive classiﬁcation, the optimization of F-measures can be accomplished by solving a series of cost-sensitive classiﬁcation sub-problems. By employing F-measure as the performance measure of selected features, we present an effective cost-sensitive feature selection (CSFS) method to handle the feature selection problem in the imbalanced data setting. Different from the existing embedded feature selection approaches (Nie et al. 2010; Han and Kim 2015), which focus on optimizing the accuracy, we encourage the feature selection solution to achieve the best F-measure. Motivated by the developments (Parambath, Usunier, and Grandvalet 2014; Ye et al. 2012) that F-measure optimization problem can be decomposed into a series of cost-sensitive classiﬁcation problems, we further modify the classiﬁers of regularized regression-based feature selection methods into costsensitive. After solving a series of cost-sensitive feature selection problems, features will be selected according to the optimal classiﬁer with the largest F-measure. Therefore, the class imbalance is taken into consideration, and selected features will fully represent both majority class and minority class. Experimental results on synthetic, multi-class and multi-label datasets have conﬁrmed the efﬁciency of our method.

F-Measure Optimization Reduction We ﬁrst give a brief introduction of the notations used in this paper. We present matrices as bold uppercase letters and vectors as bold lowercase letters. Given a matrix W = [wij], we denote wi as its i-th row and wj as its j-th column. For p > 0, the ℓp-norm of the vector b Rn is deﬁned as b p = ( n i=1 |bi|p) 1 p . The ℓp,q-norm of the matrix W Rn m is deﬁned as W p,q = ( n i=1 wi q p) 1 q , where p > 0 and q > 0. The symbol denotes the elementwise multiplication.

Actual Actual Positive Negative Predicted tp fp Positive Predicted fn tn Negative

(a) Confusion matrix

Actual Actual Positive Negative Predicted 0 r Positive Predicted 1 + β2 r 0 Negative

(b) Cost matrix

Figure 1: Confusion matrix and associated cost matrix of binary classiﬁcation.

For a given binary classiﬁer, there are four possible outcomes: true positives tp, false positives fp, false negatives fn, and true negatives tn. They are represented as a confusion matrix in Figure 1(a). F-measure can be deﬁned in terms of the marginal probabilities of classes and the perclass false negative/false positive probabilities. The marginal probability of label k is denoted by Pk, and the per-class false negative probability and false positive probability of a classiﬁer h are denoted by FNk(h) and FPk(h), respectively (Parambath, Usunier, and Grandvalet 2014). These probabilities of a classiﬁer h can be summarized by the error proﬁle e(h):

e(h) = (FN1(h), FP1(h), . . . , FNL(h), FPL(h)), (1)

where L is the number of labels, e2k 1 of e(h) R2L is the false negative probability of class k and e2k is the false positive probability. In binary classiﬁcation, we have FN2 = FP1. Thus, for any β > 0, F-measure can be written as a function of error proﬁle e:

Fβ(e) = (1 + β2)(P1 e1) (1 + β2)P1 e1 + e2 . (2)

There are several different deﬁnitions of F-measures in multi-class and multi-label classiﬁcation. Speciﬁcally, we can transform the multi-class or multi-label classiﬁcation into multiple binary classiﬁcation problems, and the average over the Fβ-measures of these binary problems is deﬁned as the macro-F-measure. According to (Parambath, Usunier, and Grandvalet 2014), the micro-F-measure ml Fβ for multilabel classiﬁcation is deﬁned as:

ml Fβ(e) = (1 + β2) L k=1(Pk e2k 1) L k=1((1 + β2)Pk + e2k e2k 1) . (3)

Multi-class classiﬁcation differs from multi-label classiﬁcation in that only a single class can be predicted for each example. According to (Kim, Wang, and Yasunori 2013), one deﬁnition of multi-class micro-F-measure, denoted as mc Fβ can be written as:

mc Fβ(e) = (1 + β2)(1 P1 L k=2 e2k 1)

(1 + β2)(1 P1) L k=2 e2k 1 + e1 . (4)

The fractional-linear F-measures presented in Eqs. (2-4) are pseudo-linear functions with respect to e. The important property of pseudo-linear functions is that their level sets, as function of the false negative rate and the false

F1-measure surface

total cost hyperplane #1

error profile space

total cost hyperplane #2

Figure 2: Illustration of F1-measure surface with level sets and two total cost hyperplanes. Different a(ri) will generate differnt costs for the misclassiﬁcation errors. Error proﬁle space contains all possible values of e. We can notice that higher values of F-measure entail lower values of the total cost.

positive rate, are linear. Based on this observation, a recent work (Parambath, Usunier, and Grandvalet 2014) was proposed for F-measure maximization by reducing it into cost-sensitive classiﬁcation, and proved that the obtained optimal classiﬁer for a cost-sensitive classiﬁcation problem with label dependent costs is also an optimal classiﬁer for F-measure. This method can be separated into three steps. Firstly, the F-measure interval is discretized into a set of evenly spaced values {ri}. F-measure is not invariant under label switching (Nie et al. 2010), i.e., if the positive label is changed to negative, a different F-measure can be obtained. Therefore, the F-measure interval is discretized within the range [0, 1 + β2] rather than [0, 1] in practice. Secondly, for each given F-measure value ri, cost function a : R1 + R2L + generates a cost vector a(ri) and assigns costs to the elements of error proﬁle e, more speciﬁcally, 1 + β2 ri for false negative and ri for false positive in binary classiﬁcation. These costs are shown as a cost matrix in Figure 1(b) (Parambath, Usunier, and Grandvalet 2014). Therefore the goal of optimization is changed to minimize the total cost a(ri), e(h) , which is the inner product of cost vector and error proﬁle (Parambath, Usunier, and Grandvalet 2014). Finally, cost-sensitive classiﬁers for each a(ri) are learned to minimize the total cost a(ri), e(h) , and the one with largest F-measure on the validation set is selected as the optimal classiﬁer. Figure 2 shows that the higher the Fmeasure value, the lower the total cost. This indicates that maximizing F-measure can be achieved by minimizing the corresponding total cost.

Cost-Sensitive Feature Selection

When the data sampling of different classes is imbalanced, it is difﬁcult to discover a satisfactory feature selection solu-

tion to fully represent the properties of different classes. To deal with this problem, we propose to optimize F-measure instead of accuracy in the feature selection task. Motivated by the reduction of F-measure optimization to cost-sensitive classiﬁcation (Parambath, Usunier, and Grandvalet 2014), we modify the classiﬁers used in traditional feature selection methods into cost-sensitive by adding properly generated costs with the in-depth theory guidance. Features are selected according to the classiﬁer with the optimal F-measure performance. This leads to a novel cost-sensitive feature selection (CSFS) method. Figure 3 presents a systematic illustration of our method .

Problem Formulation Given training data, let X = [x1, . . . , xn] Rd n denote feature matrix with n samples and the feature dimension is d. The corresponding label matrix is given by Y = [y1; . . . ; yn] { 1, 1}n m where yi is a row vector of the labels for the i-th example, and m is the number of class labels. The general formulation of regularized regression-based feature selection methods (Nie et al. 2010; Han and Kim 2015), which aim to obtain a projection matrix W Rd m, can be summarized as follows:

min W L(XT W Y) + λR(W), (5)

where L( ) is the norm-based loss function of the prediction residual, R( ) is the regularizer that introduces sparsity to make W applicable for feature selection, and λ is a trade-off parameter. For simplicity, the bias has been absorbed into W by adding a constant value 1 to the feature vector of each example. Such methods have been widely used in both multi-class and multi-label learning tasks (Nie et al. 2010; Kong and Ding 2014; Han and Kim 2015; Xu, Tao, and Xu 2016). However, they are designed to maximize the classiﬁcation accuracy, which is unsuitable for highly imbalanced classes situations (Parambath, Usunier, and Grandvalet 2014), since equal costs are assigned to different classes. To solve the class imbalance problem, we present a new feature selection method, which optimizes F-measure by modifying the classiﬁers of regularized regression-based feature selection into cost-sensitive. Without loss of generality, we start with the illustration on the cost-sensitive feature selection under a binary-class setting, where the label vector is [y1; y2; . . . ; yn] { 1, 1}n 1. As mentioned previously, the cost for positive class is 1 + β2 r and the cost for negative class is r. Thus for each class, we obtain a cost vector c = [c1, . . . , cn]T Rn, where ci = 1 + β2 r if yi = 1, and ci = r if yi = 1. The formulation of total cost for all samples can be given as follows:

i=1 L((x T i w yi) ci) + λR(w), (6)

where w Rd 1 is the projection vector. In multi-class and multi-label scenarios, the cost vector ci Rn for the i-th class can be obtained according to their per-class false negative/false negative cost generated by corresponding cost function a(r). Denoting the cost matrix as C =

(2) Train cost-sensitive classifiers (3) Select the optimal classifier (1) Discretize the F-measure interval (4) Select features

total cost hyperplane

generate costs

F1-measure surface selected features

optimal W optimize total cost

. . . . . .

error profile space

{ }: majority class, cost = ȕ2-r { }: minority class, cost = r

Figure 3: System diagram of the proposed cost-sensitive feature selection (CSFS) model in the case of binary classiﬁcation. This model can be divided into four stages. (1) Discretize the F-measure interval to obtain a set of evenly spaced values {r1, . . . , rn}. (2) For a given ri, cost function a(ri) generates costs 1 + β2 ri for the false negative and ri for the false positive, thus we can get a series of cost-sensitive classiﬁers. (3) Select the optimal classiﬁer with the largest F-measure value on the validation set. (4) Select the top-ranking features according to the projection matrix W of the optimal classiﬁer by sorting wi (1 i d) in descending order.

[c1, c2, . . . , cm] Rn m, we obtain the following formulation:

i=1 L((x T i W yi) ci) + λR(W), (7)

where ci is the i-th row of C corresponding to the i-th example. Due to the rotational invariant property and robustness to outliers (Nie et al. 2010), we adopt ℓ2-norm based loss function as the speciﬁc form of L( ) and the optimization problem becomes:

i=1 (x T i W yi) ci 2 + λR(W). (8)

By further considering that

i=1 (x T i W yi) ci 2 = (XT W Y) C 2,1, (9)

and taking the commonly used ℓ2,1-norm as regularization (Nie et al. 2010), we obtain the following compact form of the cost-sensitive feature selection (CSFS) optimization problem:

min W (XT W Y) C 2,1 + λ W 2,1. (10)

As shown in Figure 3, we can get a series of cost-sensitive feature selection problems with different cost matrix C corresponding to each F-measure value r. After obtaining the optimal W, features can be selected by sorting wi (1 i d) in descending order. If wi shrinks to zero, the i-th feature is less important and will not be selected.

Optimization

For a given F-measure r, the corresponding cost matrix C is ﬁxed and thus W is the only variable in Eq. (10). Taking the derivative of the objective function with respect to wk(1

k m) and setting it to zero, we obtain1:

XUk GUk XT wk XUk GUkyk + λDwk = 0, (11)

where diagonal matrix Uk = diag(ck), D is a diagonal matrix with the i-th diagonal element as dii = 1 2 wi 2 and G is a diagonal matrix with the i-th diagonal element as gii = 1 2 ((XT W Y) C)i 2 . Each wk can thus be solved in

the closed form:

wk = (λD + XUk GUk XT ) 1(XUk GUk)yk. (12)

Since the solution of W is dependent on D and G, we develop an iterative algorithm to obtain the ideal D and G. The whole optimization procedure is described in Algorithm 1. In each iteration, D and G are calculated with current W, and then each column vector wk of W is updated based on the newly solved D and G. The iteration procedure is repeated until the convergence criterion is reached. The convergence of Algorithm 1 is guaranteed by the following theorem:

Theorem 1. Algorithm 1 monotonically decreases the objective value of Eq. (10) in each iteration, that is,

(XT Wt+1 Y) C 2,1 + λ Wt+1 2,1

(XT Wt Y) C 2,1 + λ Wt 2,1. (13)

Due to the limited space, the proof of Theorem 1 is not presented here. In a nutshell, according to (Nie et al. 2010), the objective value of Eq. (10) monotonically decreases in each iteration.

1When wi 2 = 0, Eq. (10) is not differnetiable. This problem can be solved by introducing a small perturbation to regularize dii as 1 2

wi 2 2+ζ . Similarly, the i-th diagonal element gii of

G can be regularized as 1 2

((XT W Y) C)i 2 2+ζ . It can be ver-

iﬁed that the derived algorithm minimizes the following problem: n i=1

((XT W Y) C)i 2 2 + ζ + λ d i=1

wi 2 2 + ζ, which is apparently reduced to Eq. (10) when ζ 0.

Algorithm 1 An iterative algorithm to solve the optimization problem in Eq. (10).

Input: X Rd n, Y Rn m and C Rn m. Output: W Rd m. 1: Initialize W0 as a random matrix, t = 0. 2: while not converging do 3: update diagonal matrix Dt+1 where the i-th diagonal element is 1 2 wi t 2 .

4: update diagonal matrix Gt+1 where the i-th diagonal element is 1 2 ((XT Wt Y) C)i 2 . 5: for k 1 to m do 6: Uk = diag(ck). 7: (wt+1)k = (λDt+1 + XUk Gt+1Uk XT ) 1

8: (XUk Gt+1Uk)yk. 9: end for 10: t = t + 1. 11: end while

Complexity Analysis In Algorithm 1, step 3 and step 4 calculate the diagonal elements which are computationally trivial, so the complexity mainly depends on the matrix multiplication and inversion in step 7. By using sparse matrix multiplication and avoiding dense intermediate matrices, the complexity of updating each (wt+1)k is O(d2(n + d)). Thus the complexity of the proposed algorithm is O(Ttmd2(n + d)), where t is the number of iterations, and T is the number of discretized values of F-measure. Empirical results show that the convergence of Algorithm 1 is rapid and t is usually less than 50. Besides, T is usually less than 20. Therefore, the proposed algorithm is quite efﬁcient.

Experiments Extensive experiments are conducted on synthetic, multiclass and multi-label datasets. For multi-class classiﬁcation, we use two datasets: handwritten digit dataset USPS2 and face image dataset Yale B2. For multi-label classiﬁcation, we use MSVCv23 and TRECVID20054 datasets. Following the previous works (Kong and Ding 2014; Kong et al. 2012), the 384-dimensional color moment features are extracted on MSRC, and the 512-dimensional GIST features on TRECVID. For each dataset, we randomly select 1/3 of the training samples for validation to tune the hyperparameters. For datasets that do not have a separate test set, the data is ﬁrst split to keep 1/4 for testing. A summary of multi-class and multi-label datasets is shown in Table 1. During the training process, the parameter λ in our method is optimized in the range of {10 6, 10 5, . . . , 106}, and the number of selected features is set as {20, 30, . . . , 120}. To fairly compare all different feature selection methods, classiﬁcation experiments are conducted on all datasets using 5-fold cross validation SVM

2http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html 3http://research.microsoft.com/en-us/projects/ objectclassrecognition/ 4http://www-nlpir.nist.gov/projects/tv2005/

Table 1: Information of multi-class and multi-label datasets.

Datasets Classes Samples Features

Multi-class USPS 10 9258 256 Yale B 38 2414 1024

Multi-label MSRC 23 591 384 TRECVID 39 3721 512

with linear kernel and parameter C = 1. We repeat the experiments 10 times with random seeds for generating the validation sets. Both mean and standard deviation of the accuracy and F1-measures are reported.

Synthetic Data To demonstrate the advantage of cost-sensitive feature selection over traditional cost-blind feature selection, a toy experiment is performed to show the inﬂuence of the costs on the selected features. We construct a two-dimensional binaryclass synthetic dataset based on two different uniform distributions, as shown in Figure 4. The ratio of majority class to minority class is 3 : 1. In this experiment, majority class is treated as the positive class, and minority class as the negative class. For a given linear classiﬁer, each coefﬁcient of its projection vector w corresponds to one feature weight such as w1 for x1, then the features with larger coefﬁcients will be selected. The projection vector w varies with the costs assigned to both classes. In Figure 4(a), the cost of majority class is larger than the cost of minority class when r < 1. In Figure 4(b), the costs for both classes are the same when r = 1. In this case, the cost-sensitive feature selection degenerates to the cost-blind feature selection. When r > 1, as shown in Figure 4(c), the cost of majority class is smaller than the cost of minority class. It is worth noting that the weight of feature x1 is larger than the weight of feature x2, which is different from the ﬁrst two examples. Therefore, different features will be selected from different costsensitive feature selection problems.

Multi-Class Datasets On multi-class datasets, CSFS is compared with several popular and representative multi-class feature selection methods, such as Relief F (Kononenko 1994), Information Gain (IG) (Raileanu and Stoffel 2004), m RMR (Peng, Long, and Ding 2005), F-statistic (Liu and Motoda 2012) and RFS (Nie et al. 2010). The multi-class classiﬁcation results in terms of micro F1-measure and accuracy is shown in Figure 5. Table 2 shows the results of different feature selection methods on their best dimensions. We observe that: (1) the proposed CSFS is superior to other multi-class feature selection methods consistently in terms of the micro-F1-measure on both USPS and Yale B datasets; (2) in terms of accuracy, CSFS outperforms other methods on most of the feature subsets.

Multi-Label Datasets On each multi-label dataset, CSFS is compared with ﬁve competitive multi-label feature selection methods: multi-

(b) r = 1 (c) r = 1.2 (a) r = 0.8

ȕ2-r ȕ2-r ȕ2-r

Wr=1 |w1| : |w2| = 1 : 1.12

Wr=1.2 |w1| : |w2| = 3.93 : 1

Wr=0.8 |w1| : |w2| = 1 : 8.75

{ }: majority class, cost=1.2 { }: minority class, cost=0.8

{ }: majority class, cost=1.0 { }: minority class, cost=1.0

{ }: majority class, cost=0.8 { }: minority class, cost=1.2

Figure 4: Illustration of how costs inﬂuence the feature weights on a two-dimensional synthetic dataset.

Relief F IG m RMR F-statistic RFS CSFS

20 40 60 80 100 120 0.90

number of selected features

20 40 60 80 100 120

number of selected features

20 40 60 80 100 120 0.5

number of selected features

micro-F1measure

20 40 60 80 100 120

number of selected features

micro-F1measure

Figure 5: Multi-class classiﬁcation results using SVM in terms of multi-class micro-F1-measure and accuracy.

Table 2: Multi-class micro-F1-measure (% std) and accuracy (% std) of multi-class feature selection methods.

Micro-F1-measure Accuracy USPS Yale B USPS Yale B Relief F 87.25 0.71 39.09 6.55 98.13 1.39 94.22 0.34 IG 83.51 1.06 39.13 1.47 97.35 2.96 93.44 1.22 m RMR 88.30 0.87 49.80 9.44 98.23 0.76 95.00 0.85 F-statistic 88.36 0.84 42.64 1.13 98.20 1.69 92.78 2.70 RFS 89.54 0.62 48.68 8.54 98.50 0.95 95.56 1.51 CSFS 91.56 0.56 53.83 1.63 98.50 0.56 96.72 0.41

label Relief F (MLRelief F) (Kong et al. 2012), multi-label F-statistic (MLF-statistic) (Kong et al. 2012), informationtheoretic feature ranking (ITFR) (Lee and Kim 2015), nonconvex feature selection (NCFS) (Kong and Ding 2014) and RFS (Nie et al. 2010). Particularly, RFS can be extended for multi-label feature selection task (Kong and Ding 2014). Figure 6 shows the classiﬁcation results in terms of multi-label micro-F1-measure and accuracy on MSRC and TRECVID datasets. Table 3 shows the results of each feature selection method on its best performing dimension. From the results, we can observe that: (1) the methods using joint sparse regularization, such as CSFS, NCFS and RFS, show better performances than other feature selection methods

MLRelief F MLF-statistic ITFR NCFS RFS CSFS

20 40 60 80 100 120 0.50

number of selected features

20 40 60 80 100 120

number of selected features

20 40 60 80 100 120 0.56

number of selected features

micro-F1measure

20 40 60 80 100 120

number of selected features

micro-F1measure

(b) TRECVID

(b) TRECVID

Figure 6: Multi-label classiﬁcation results using SVM in terms of multi-label micro-F1-measure and accuracy.

Table 3: Multi-label micro-F1-measure (% std) and accuracy (% std) of multi-label feature selection methods.

Micro-F1-measure Accuracy MSRC TRECVID MSRC TRECVID MLRelief F 63.96 0.49 45.25 0.71 63.57 1.41 62.46 0.88 MLF-statistic 59.99 1.87 43.51 0.36 62.15 1.50 59.55 4.27 ITFR 63.24 1.07 47.23 0.59 65.70 0.32 60.46 3.75 NCFS 67.50 1.96 49.23 0.80 69.32 1.64 64.04 3.57 RFS 68.29 0.93 49.54 0.62 70.75 1.02 63.53 0.77 CSFS 70.88 0.77 51.56 0.56 72.32 0.46 65.42 0.43

that only use the statistical information of the original features. This is because the projection matrices of these methods are determined at the same time during the optimization procedure, corresponding features are selected to prevent high correlation (Han and Kim 2015); (2) Our method outperforms these methods signiﬁcantly under the F-measure criterion, and does not lead to obvious decrement to accuracy. In particular, our method outperforms other methods by a relative improvement between 3%-10% in terms of micro-F1-measure.

In this paper, we proposed a cost-sensitive feature selection method by optimizing F-measure instead of accuracy to tackle the class imbalance problem. Due to the neglect of class imbalance issue, traditional feature selection methods such as regularized regression-based methods usually select the feature subset by maximizing the classiﬁcation accuracy to choose the features. Thus the selected features are biased towards the majority classes. Under the imbalanced classes setting, F-measure is a more suitable performance measure than accuracy. Motivated by the reduction of F-measure optimization to cost-sensitive classiﬁcation, we modify the classiﬁers of regularized regression-based feature selection into cost-sensitive by generating and assigning different costs to each class. Features will be selected according to the classiﬁer with optimal F-measure. Therefore, the selected features will fully represent for all classes. Extensive experiments have been performed on synthetic, multi-class and multi-label datasets. The results demonstrate the effectiveness of our method.

Acknowledgements

This research is partially supported by grants from NSFC 61375026 and 2015BAF15B00, and ARC FT-130101457, DP-140102164 and LE-140100061.

Dembczynski, K. J.; Waegeman, W.; Cheng, W.; and H ullermeier, E. 2011. An exact algorithm for F-measure maximization. In NIPS, 1404 1412. Dembczynski, K. J.; Jachnik, A.; Kotlowski, W.; Waegeman, W.; and H ullermeier, E. 2013. Optimizing the F-measure in multi-label classiﬁcation: Plug-in rule approach versus structured loss minimization. In ICML, 1130 1138. Guyon, I.; Weston, J.; Barnhill, S.; and Vapnik, V. 2002. Gene selection for cancer classiﬁcation using support vector machines. Machine Learning 46(1-3):389 422. Hall, M. A., and Smith, L. A. 1999. Feature selection for machine learning: Comparing a correlation-based ﬁlter approach to the wrapper. In FLAIRS, 235 239. Han, D., and Kim, J. 2015. Unsupervised simultaneous orthogonal basis clustering feature selection. In CVPR, 5016 5023. Jansche, M. 2005. Maximum expected F-measure training of logistic regression models. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 692 699. Kim, J.; Wang, Y.; and Yasunori, Y. 2013. The genia event extraction shared task, 2013 edition-overview. In Proceedings of the Bio NLP Shared Task 2013 Workshop, 8 15. Kong, D., and Ding, C. 2014. Non-convex feature learning via ℓp, operator. In AAAI, 1918 1924. Kong, D.; Ding, C.; Huang, H.; and Zhao, H. 2012. Multi-label Relief F and F-statistic feature selections for image annotation. In CVPR, 2352 2359. Kononenko, I. 1994. Estimating attributes: analysis and extensions of RELIEF. In ECML, 171 182.

Koyejo, O. O.; Natarajan, N.; Ravikumar, P. K.; and Dhillon, I. S. 2014. Consistent binary classiﬁcation with generalized performance metrics. In NIPS, 2744 2752. Lee, J., and Kim, D. 2015. Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognition 48(9):2761 2771. Lewis, D. D. 1995. Evaluating and optimizing autonomous text classiﬁcation systems. In SIGIR, 246 254. Liu, H., and Motoda, H. 2012. Feature selection for knowledge discovery and data mining, volume 454. Springer Science & Business Media. Luo, Y.; Wen, Y.; Tao, D.; Gui, J.; and Xu, C. 2016. Large margin multi-modal multi-task feature extraction for image classiﬁcation. IEEE Transactions on Image Processing 25(1):414 427. Narasimhan, H.; Vaish, R.; and Agarwal, S. 2014. On the statistical consistency of plug-in classiﬁers for non-decomposable performance measures. In NIPS, 1493 1501. Nie, F.; Huang, H.; Cai, X.; and Ding, C. H. 2010. Efﬁcient and robust feature selection via joint ℓ2,1-norms minimization. In NIPS, 1813 1821. Parambath, S. P.; Usunier, N.; and Grandvalet, Y. 2014. Optimizing F-measures by cost-sensitive classiﬁcation. In NIPS, 2123 2131. Peng, H.; Long, F.; and Ding, C. 2005. Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy. TPAMI 27(8):1226 1238. Pillai, I.; Fumera, G.; and Roli, F. 2012. F-measure optimisation in multi-label classiﬁers. In ICPR, 2424 2427. Qian, M., and Zhai, C. 2013. Robust unsupervised feature selection. In IJCAI, 1621 1627. Raileanu, L. E., and Stoffel, K. 2004. Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artiﬁcial Intelligence 41(1):77 93. Tang, J.; Alelyani, S.; and Liu, H. 2014. Feature selection for classiﬁcation: A review. Data Classiﬁcation: Algorithms and Applications 37. Tsochantaridis, I.; Joachims, T.; Hofmann, T.; and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. JMLR 6:1453 1484. Villela, S. M.; de Castro Leite, S.; and Neto, R. F. 2015. Feature selection from microarray data via an ordered search with projected margin. In IJCAI, 3874 3881. Wang, S.; Tang, J.; and Liu, H. 2015. Embedded unsupervised feature selection. In AAAI, 470 476. Xu, C.; Tao, D.; and Xu, C. 2015. Large-margin multi-label causal feature learning. In AAAI, 1924 1930. Xu, C.; Tao, D.; and Xu, C. 2016. Robust extreme multilabel learning. In ACM SIGKDD Conferences on Knowledge Discovery and Data Mining (KDD), 1275 1284. Yang, Y. 2001. A study of thresholding strategies for text categorization. In SIGIR, 137 145. Ye, N.; Chai, K. M. A.; Lee, W. S.; and Chieu, H. L. 2012. Optimizing F-measures: a tale of two approaches. In ICML. Zhao, Z.; Wang, L.; Liu, H.; et al. 2010. Efﬁcient spectral feature selection with minimum redundancy. In AAAI, 673 678.