# worstcase_discriminative_feature_selection__ca304be6.pdf Worst-Case Discriminative Feature Selection Shuangli Liao1 , Quanxue Gao1 , Feiping Nie2 , Yang Liu1 and Xiangdong Zhang1 1State Key Laboratory of Integrated Services Networks, Xidian University, Xi an, 710071, China 2School of Computer Science, OPTIMAL, Northwestern Polytechnical University, Xi an, 710072, China 15829069358@163.com, qxgao@xidian.edu.cn, {feipingnie,liuyangxidian}@gmail.com, xdchen@mail.xidian.edu.cn Feature selection plays a critical role in data mining, driven by increasing feature dimensionality in target problems. In this paper, we propose a new criterion for discriminative feature selection, worst-case discriminative feature selection (WDFS). Unlike Fisher Score and other methods based on the discriminative criteria considering the overall (or average) separation of data, WDFS adopts a new perspective called worst-case view which arguably is more suitable for classification applications. Specifically, WDFS directly maximizes the ratio of the minimum of between-class variance of all class pairs over the maximum of within-class variance, and thus it duly considers the separation of all classes. Otherwise, we take a greedy strategy by finding one feature at a time, but it is very easy to implement and effective. Moreover, we utilize the correlation between features to help reduce the redundancy, and then WDFS is extended to uncorrelated WDFS (UWDFS). To evaluate the effectiveness of the proposed algorithm, we conduct classification experiments on many real data sets. In the experiment, we respectively use the original features and the score vectors of features over all class pairs to calculate the correlation coefficients, and analyze the experimental results in these two ways. Experimental results demonstrate the effectiveness of WDFS and UWDFS. 1 Introduction Nowadays, the dimensionality of the data involved many realworld applications has increased explosively. The presence of many irrelevant and redundant features [Ben-Bassat, 1982] tends to make a learning model overfitting, resulting in its performance degenerates. Dimensionality reduction is one of the most popular techniques to help address this problem. Feature selection is a widely employed technique for reducing dimensionality among practitioners. It aims to select a small number Contact Author: Q. Gao. (qxgao@xidian.edu.cn) Contact Author: F. Nie. (feipingnie@gmail.com) of significant and useful original features without any transformation. Thus, feature selection could maintain the original representation of variables, which leads to better readability and interpretability. Many efforts have been devoted to the research in feature selection during the past few years [Romero and Sopena, 2008] [Pena and Nilsson, 2010]. According to how the learning algorithm is integrated into the process of evaluating and selecting features, feature selection methods can be roughly categorized as wrapper methods [Kohavi and John, 1997] [Wang et al., 2008], embedded methods [Nie et al., 2010] [Cai et al., 2013] and filter methods. The wrapper methods and embedded methods are tightly coupled with the specific classifier, they always have better performance than filter methods on the particular classifier but not on the other classifiers. Different from wrapper and embedded methods, the filter-type methods, which evaluate features based on a certain criterion, are absolutely independent on any classifier. Most existing filter-type feature selection methods conducted the selection process in the manner of feature ranking [Robnik et al., 2003] [Raileanu and Stoffel, 2004] or feature subset evaluation [Nie et al., 2008]. Methods based on the feature ranking compute the relevance of a feature with respect to the class label distribution of data. In this paper, we propose a new filter-type feature selection method in the manner of feature ranking. For the classification problem, feature selection aims to select subset of highly discriminant features. In other words, it selects features that are capable of discriminating samples that belong to different classes. Fisher discriminant criterion [Fisher, 1936] is probably the most widely used one. One of the most representative algorithms is Fisher score [He et al., 2005], which optimizes the so-called Fisher criterion that maximizes the ratio of between-class scatter over within-class scatter in a individual feature level. On this basis, the Laplacian score [He et al., 2005] also consider the feature ability of preserving the local manifold structure. Then Trace Ratio [Nie et al., 2008] was proposed to improve the computational efficiency of Fisher score, which selects a feature subset based on the corresponding subset-level score that is calculated in a Trace Ratio form. Moreover, there are many feature selection method, combining the popular transformationbased dimensionality reduction method linear discriminant analysis (LDA) and sparsity regularization [Masaeli et al., 2010]. For example, Discriminative Feature Selection (DF- Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) S) [Tao et al., 2016] imposes the row sparsity on the transformation matrix of LDA through ℓ2,1-norm regularization to achieve feature selection. Zhang et al. proposed a selfweighted supervised discriminative feature selection (SSDFS) method [Zhang et al., 2018], which constrains the transformation matrix to be orthogonal and introduces the ℓ2,1norm regularization to select features. However, according to the definition of between-class scatter, the aforementioned methods actually maximize the average of all pairwise distances between classes [Bian and Tao, 2011] [Su et al., 2018]. This may cause the so-called class separation problem [Loog et al., 2001]. Specifically, these methods tend to pay close attention to classes with larger distances, but ignore those with smaller distances, resulting in the overlap of neighboring classes in the lower-dimensional space. An example to illustrate the class separation problem is shown in Figure 1, where class 1 and class 2 locate closely to each other while class 3 is far away from them, and all classes have the same covariance. We can directly see that the Feature 2 could easily separates three class, while the Feature 1 would result in a complete confusion of class 1 and class 2. But according to the muti-class Fisher certeria [Rao, 1948], the average of pairwise distances between classes is maximized in the Feature 1. Figure 1: An illustration of the class separation problem. Due to the fact that many feature selection methods based on the discriminative criteria usually consider the overall (or average) separation of data, thus they cannot guarantee the separation (as best as possible) of any class pairs. Thus, we propose a new criterion for discriminative feature selection, namely worst-case discriminative feature selection (WDFS), which adopt a worst-case view [Yu and Yeung, 2010] but the average view. Specifically, WDFS directly maximizes the ratio of the minimum of between-class variance of all class pairs over the maximum of within-class variance, and thus it duly considers the separation of all classes. For the solution, we take a greedy strategy by finding one feature at a time, but that is very easy to implement. Moreover, in order to reduce the redundancy of selected feature subset, we extend WDFS to uncorrelated WDFS (UWDFS) with the help of the correlation between features. Finally, in the experiment, we re- spectively use the original features and the score vectors of features over all class pairs to calculate the correlation coefficients, and analyze the experimental results in these two ways. Our contribution are summarized as follows: 1. Our method WDFS duly considers the separation of all classes, which adopts a new view called worst-case view different from the conventional average view. 2. Considering the feature evaluation mechanism of the WDFS model, we propose a method to help reduce the redundancy, which only needs to calculate the correlation coefficients between features or feature score vectors. 3. Although we take a greedy strategy by finding one feature at a time, that is very easy to implement with low computational complexity. 2 Related Works 2.1 Fisher Score Given a set of N data points xk Rd 1 |k = 1, , N which are sampled from C classes. The within-class matrix Sw and between-class matrix Sb are defined as i=1 ni (µi µ) (µi µ)T (1) xi p µi xi p µi T (2) where ni denotes the number of samples in the i th class, µi denotes the mean of the i th class, and xi p denotes the p th sample in the i th class. To specify the selected features in the procedure of feature selection, we equip the conventional transportation matrix with an explainable structure to specify the selected features, namely selective matrix. Next, we will go into details. First, concatenate the dataset {x1, x2, . . . , x N} to a matrix X= [x1, x2, . . . , x N] Rd N. Then let fr T R1 N denotes the r th row of X, i.e. the r th feature of the dataset, and then let F = XT = [f1, f2 . . . , fd] RN d. Formally, for a feature subset FI F, we could define the corresponding selective matrix as WI = w I(1), w I(2), . . . , w I(m) {0, 1}d m (3) where w I(k) Rd 1 (k = 1, 2, . . . , m d) refers to a column vector whose components are all 0 expect 1 for the I (k) th one. Note that this WI is indeed a column-fullrank transformation matrix. With the selective matrix WI, the process of feature selection could be expressed as FI = FWI. (4) Then the model of Fisher Score could be formulated as max WI Rd m w T I(k)Sbw I(k) w T I(k)Sww I(k) . (5) It should be noted that each feature is evaluated individually. It is easy to see that the model in the Eq. (5) consider the overall (or average) separation of data, thus it cannot guarantee the separation (as best as possible) of any class pairs. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 2.2 Redundancy The Redundancy (feature feature) analysis has been another concerned part of supervised filter models, which has many similarities to the Relevance (feature class) in term of measurement. Yu and Liu [Yu and Liu, 2004] explored further feature redundancy, and more restrictively defined relevant features as any feature that is neither irrelevant nor redundant to the target concept. The independence or complementarity of one feature can be defined based on correlation coefficient, mutual information, or any criterion characterizing feature redundancy. Many models use Euclidean distance, Pearson correlation [Battiti, 1994], and information measures for redundancy analysis. Some feature selection methods seek to remove redundant features while others do not. A classical criterion for feature selection based on relevance and redundancy analysis is Max-Relevance and Min-Redundancy (m RMR) [Bian and Tao, 2011], which maximizes the correlation between features and categorical variables while minimizing the correlation between features by measuring the mutual information. 3 Approach In order to guarantee the separation (as best as possible) of any class pairs, we propose a new criterion for discriminative feature selection, worst-case discriminative feature selection (WDFS) in the subsection 3.1. Unlike Fisher Score and other methods based on the discriminative criteria considering the overall (or average) separation of data, WDFS adopts a worst-case view which duly considers the separation of all classes. Specifically, WDFS directly maximizes the ratio of the minimum of between-class variance of all class pairs over the maximum of within-class variance of all classes. Apart from the analysis of relevance (feature class), the analysis of redundancy (feature feature) is another crucial part of supervised filter models. This work simply uses the correlation coefficient to help reduce redundancy among features. Then WDFS is extended to uncorrelated WDFS (UWDFS) in the subsection 3.2. 3.1 Worst-Case Discriminative Feature Selection In the Eq. (1), due to the fact that µ = 1 N i=1 niµi, thus the equation (1) would be equal to the following equation by simple derivation. j=1 ninj (µi µj) (µi µj)T . (6) Next, we define the between-class matrix Si,j b for the class pair (i, j) and the within-class matrix Si w as Si,j b = (µi µj) (µi µj)T , 1 i < j c, (7) xi p µi xi p µi T , 1 i c. (8) where µi denotes the mean of the i th class, and denotes the sample in the class. Similarly, based on the Fisher discriminative criterion, we formulate the model of worst-case Algorithm 1 WDFS Input: Train dataset {(xi, yi) |i = 1, 2, . . . , N }, wherein xi Rd 1 and yi {1, 2, , C}, the selected number m. Initialize: Concatenate the dataset {x1, x2, . . . , x N} to a matrix X= [x1, x2, . . . , x N] Rd N, and let F = XT = [f1, f2 . . . , fd]. Output: Selective matrix WI Rd m. 1: Calculate the within-class matrix Si w for all class by the Eq. (7), and between-class matrix Si,j b for all class pair (i, j) , 1 i < j C by the Eq. (8). 2: For each feature, calculate the score score(fr),r = 1, 2, . . . , d by the Eq. (10). 3: Sort the scores of all features. 4: Select the m feature indexes corresponding to the first m largest feature scores and construct the corresponding selective matrix WI. discriminative feature selection as max WI Rd m min 1 i