# discriminative_feature_grouping__dce24a5f.pdf Discriminative Feature Grouping Lei Han1 and Yu Zhang1,2 1Department of Computer Science, Hong Kong Baptist University, Hong Kong 2The Institute of Research and Continuing Education, Hong Kong Baptist University (Shenzhen) Feature grouping has been demonstrated to be promising in learning with high-dimensional data. It helps reduce the variances in the estimation and improves the stability of feature selection. One major limitation of existing feature grouping approaches is that some similar but different feature groups are often mis-fused, leading to impaired performance. In this paper, we propose a Discriminative Feature Grouping (DFG) method to discover the feature groups with enhanced discrimination. Different from existing methods, DFG adopts a novel regularizer for the feature coefficients to tradeoff between fusing and discriminating feature groups. The proposed regularizer consists of a ℓ1 norm to enforce feature sparsity and a pairwise ℓ norm to encourage the absolute differences among any three feature coefficients to be similar. To achieve better asymptotic property, we generalize the proposed regularizer to an adaptive one where the feature coefficients are weighted based on the solution of some estimator with root-n consistency. For optimization, we employ the alternating direction method of multipliers to solve the proposed methods efficiently. Experimental results on synthetic and real-world datasets demonstrate that the proposed methods have good performance compared with the state-of-the-art feature grouping methods. Introduction Learning with high-dimensional data is a challenge especially when the size of the data is not very large. Sparse modeling, which selects only a relevant subset of the features, has thus received increasing attention. Lasso (Tibshirani 1996) is one of the most popular sparse modeling methods and has been well studied in the literature. However, in the presence of highly correlated features, Lasso tends to select only one or some of those features, leading to unstable estimations and impaired performance. To address this issue, the group lasso (Yuan and Lin 2006) has been proposed to select groups of features by using the ℓ1/ℓ2 regularizer. As extensions of the group lasso, several methods are proposed to learn from overlapping groups (Zhao, Rocha, and Yu 2009; Jacob, Obozinski, and Vert 2009; Yuan, Liu, and Ye 2011). Both authors contribute equally. Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Other extensions of the group lasso, e.g., (Kim and Xing 2010; Jenatton et al. 2010), aim to learn from the given tree structured information among features. However, those methods require the feature groups to be given as a priori information. That is, they can utilize the given feature groups to obtain solutions with group sparsity, but lack the ability of learning the feature groups. Feature grouping techniques, which find the groups of highly correlated features automatically from data, thus have been proposed to address this issue. These techniques help gain additional insights to understand and interpret data, e.g., finding co-regulated genes in microarray analysis (Dettling and B uhlmann 2004). Feature grouping techniques assume that the features with identical coefficients form a feature group. The elastic net (Zou and Hastie 2005) is a representative feature grouping approach, which combines the ℓ1 and ℓ2 norms to encourage highly correlated features to have identical coefficients. The fused Lasso family, including the fused Lasso (Tibshirani et al. 2005), graph based fused Lasso (Kim and Xing 2009), and generalized fused Lasso (GFLasso) (Friedman et al. 2007), uses some fused regularizers to directly force the feature coefficients of each pair of features to be close based on the ℓ1 norm. Recently, the OSCAR method (Bondell and Reich 2008), which combines a ℓ1 norm and a pairwise ℓ norm on each pair of features, has shown good performance in learning feature groups. Moreover, some extensions of OSCAR have also been proposed (Shen and Huang 2010; Yang et al. 2012; Jang et al. 2013) to further reduce the estimation bias. However, when there exist some similar but still different feature groups, we find that empirically all the existing feature grouping methods tend to fuse those groups together as one group, thus leading to impaired learning performance. Figure 1(a) shows an example, where G1 and G2 are similar but different feature groups, and they are easy to be mis-fused by existing feature grouping methods. In many real-world applications with high-dimensional data, e.g., microarray analysis, the phenomena that feature groups with similar but different feature coefficients appear frequently. For example, by using the method in (Jacob, Obozinski, and Vert 2009), the averaged coefficients of each feature group among the given 637 groups, which correspond to the biological gene pathways, in the breast cancer data is shown in Figure 1(b) and we can observe that there are a lot of feature Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Figure 1: (a) The misfusion problem; (b) A study of the averaged feature coefficients of the groups in microarray data. groups with similar but different (averaged) feature coefficients. This problem is also found in some other real-world applications. In order to solve the aforementioned problem in existing feature grouping methods, we propose a Discriminative Feature Grouping (DFG) method to not only discover feature groups but also discriminate similar feature groups. The DFG method proposes a novel regularizer on the feature coefficients to trade-off between fusing and discriminating feature groups. The proposed regularizer consists of a ℓ1 norm to enforce feature sparsity and a pairwise ℓ norm to encourage |βi βj| and |βi βk| to be identical for any three feature coefficients βi, βj and βk. As analyzed, the pairwise ℓ regularizer is capable of both grouping features and discriminating similar feature groups. Moreover, to achieve better asymptotic property, we propose an adaptive version of DFG, the ADFG method, in which the feature coefficients are weighted based on the solution of some estimator with root-n consistency. For optimization, we employ the alternating direction method of multipliers (ADMM) (Boyd et al. 2011) to solve the proposed methods efficiently. For analysis, we study the asymptotic properties of the DFG and ADFG models, where the feature groups obtained by the ADFG method can recover the ground truth with high probability. Experimental results conducted on synthetic and realworld datasets demonstrate that the proposed methods are competitive compared with existing feature grouping techniques. Notations: Let X Rn d be the predictor matrix or the data matrix and y Rn be the responses or labels, where n is the number of samples and d is the number of features. For any vector x, x p denotes its ℓp-norm. |A| denotes the cardinality of a set A. Background In this section, we briefly overview some existing feature grouping techniques. As a representative of the fused Lasso family, the GFLasso method solves the following optimization problem: min β L(β) + λ1 β 1 + λ2 i