# discriminative_feature_grouping__dce24a5f.pdf

Discriminative Feature Grouping

Lei Han1 and Yu Zhang1,2

1Department of Computer Science, Hong Kong Baptist University, Hong Kong 2The Institute of Research and Continuing Education, Hong Kong Baptist University (Shenzhen)

Feature grouping has been demonstrated to be promising in learning with high-dimensional data. It helps reduce the variances in the estimation and improves the stability of feature selection. One major limitation of existing feature grouping approaches is that some similar but different feature groups are often mis-fused, leading to impaired performance. In this paper, we propose a Discriminative Feature Grouping (DFG) method to discover the feature groups with enhanced discrimination. Different from existing methods, DFG adopts a novel regularizer for the feature coefﬁcients to tradeoff between fusing and discriminating feature groups. The proposed regularizer consists of a ℓ1 norm to enforce feature sparsity and a pairwise ℓ norm to encourage the absolute differences among any three feature coefﬁcients to be similar. To achieve better asymptotic property, we generalize the proposed regularizer to an adaptive one where the feature coefﬁcients are weighted based on the solution of some estimator with root-n consistency. For optimization, we employ the alternating direction method of multipliers to solve the proposed methods efﬁciently. Experimental results on synthetic and real-world datasets demonstrate that the proposed methods have good performance compared with the state-of-the-art feature grouping methods.

Introduction Learning with high-dimensional data is a challenge especially when the size of the data is not very large. Sparse modeling, which selects only a relevant subset of the features, has thus received increasing attention. Lasso (Tibshirani 1996) is one of the most popular sparse modeling methods and has been well studied in the literature. However, in the presence of highly correlated features, Lasso tends to select only one or some of those features, leading to unstable estimations and impaired performance. To address this issue, the group lasso (Yuan and Lin 2006) has been proposed to select groups of features by using the ℓ1/ℓ2 regularizer. As extensions of the group lasso, several methods are proposed to learn from overlapping groups (Zhao, Rocha, and Yu 2009; Jacob, Obozinski, and Vert 2009; Yuan, Liu, and Ye 2011).

Both authors contribute equally. Copyright c 2015, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Other extensions of the group lasso, e.g., (Kim and Xing 2010; Jenatton et al. 2010), aim to learn from the given tree structured information among features. However, those methods require the feature groups to be given as a priori information. That is, they can utilize the given feature groups to obtain solutions with group sparsity, but lack the ability of learning the feature groups. Feature grouping techniques, which ﬁnd the groups of highly correlated features automatically from data, thus have been proposed to address this issue. These techniques help gain additional insights to understand and interpret data, e.g., ﬁnding co-regulated genes in microarray analysis (Dettling and B uhlmann 2004). Feature grouping techniques assume that the features with identical coefﬁcients form a feature group. The elastic net (Zou and Hastie 2005) is a representative feature grouping approach, which combines the ℓ1 and ℓ2 norms to encourage highly correlated features to have identical coefﬁcients. The fused Lasso family, including the fused Lasso (Tibshirani et al. 2005), graph based fused Lasso (Kim and Xing 2009), and generalized fused Lasso (GFLasso) (Friedman et al. 2007), uses some fused regularizers to directly force the feature coefﬁcients of each pair of features to be close based on the ℓ1 norm. Recently, the OSCAR method (Bondell and Reich 2008), which combines a ℓ1 norm and a pairwise ℓ norm on each pair of features, has shown good performance in learning feature groups. Moreover, some extensions of OSCAR have also been proposed (Shen and Huang 2010; Yang et al. 2012; Jang et al. 2013) to further reduce the estimation bias. However, when there exist some similar but still different feature groups, we ﬁnd that empirically all the existing feature grouping methods tend to fuse those groups together as one group, thus leading to impaired learning performance. Figure 1(a) shows an example, where G1 and G2 are similar but different feature groups, and they are easy to be mis-fused by existing feature grouping methods. In many real-world applications with high-dimensional data, e.g., microarray analysis, the phenomena that feature groups with similar but different feature coefﬁcients appear frequently. For example, by using the method in (Jacob, Obozinski, and Vert 2009), the averaged coefﬁcients of each feature group among the given 637 groups, which correspond to the biological gene pathways, in the breast cancer data is shown in Figure 1(b) and we can observe that there are a lot of feature

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Figure 1: (a) The misfusion problem; (b) A study of the averaged feature coefﬁcients of the groups in microarray data.

groups with similar but different (averaged) feature coefﬁcients. This problem is also found in some other real-world applications. In order to solve the aforementioned problem in existing feature grouping methods, we propose a Discriminative Feature Grouping (DFG) method to not only discover feature groups but also discriminate similar feature groups. The DFG method proposes a novel regularizer on the feature coefﬁcients to trade-off between fusing and discriminating feature groups. The proposed regularizer consists of a ℓ1 norm to enforce feature sparsity and a pairwise ℓ norm to encourage |βi βj| and |βi βk| to be identical for any three feature coefﬁcients βi, βj and βk. As analyzed, the pairwise ℓ regularizer is capable of both grouping features and discriminating similar feature groups. Moreover, to achieve better asymptotic property, we propose an adaptive version of DFG, the ADFG method, in which the feature coefﬁcients are weighted based on the solution of some estimator with root-n consistency. For optimization, we employ the alternating direction method of multipliers (ADMM) (Boyd et al. 2011) to solve the proposed methods efﬁciently. For analysis, we study the asymptotic properties of the DFG and ADFG models, where the feature groups obtained by the ADFG method can recover the ground truth with high probability. Experimental results conducted on synthetic and realworld datasets demonstrate that the proposed methods are competitive compared with existing feature grouping techniques. Notations: Let X Rn d be the predictor matrix or the data matrix and y Rn be the responses or labels, where n is the number of samples and d is the number of features. For any vector x, x p denotes its ℓp-norm. |A| denotes the cardinality of a set A.

Background In this section, we brieﬂy overview some existing feature grouping techniques. As a representative of the fused Lasso family, the GFLasso method solves the following optimization problem:

min β L(β) + λ1 β 1 + λ2

i<j |βi βj|, (1)

where L( ) denotes the loss function and λ1 and λ2 are regularization parameters. We use the square loss L(β) = 1 2 y Xβ 2 2 in this paper. In problem (1), the ℓ1 norm encourages the sparsity in β and the fusion term (i.e., the last

term) enforces any two coefﬁcients βi and βj to be identical, which is a way to discover feature groups. The OSCAR method proposes a pairwise ℓ regularizer and solves the following optimization problem:

min β L(β) + λ1 β 1 + λ2

i<j max {|βi|, |βj|}. (2)

The pairwise ℓ regularizer encourages the absolute values of every two coefﬁcients |βi| and |βj| to be identical. Based on OSCAR, some non-convex extensions, e.g., the nc FGS and nc TFGS methods (Yang et al. 2012), are proposed to further reduce the estimation bias. The objective functions of the nc FGS and nc TFGS methods are formulated as

min β L(β) + λ1 β 1 + λ2

|βi| |βj| , (3)

min β L(β) + λ1

i Jτ(|βi|) + λ2

i<j Jτ(||βi| |βj||), (4)

where Jτ(x) = min( x

τ , 1) and τ is a threshold. When τ , problem (4) reduces to problem (3). The third terms in both problems encourage any pair of coefﬁcients to be similar. Note that when λ1 d 1

2 λ2, problem (3) reduces to problem (2) since max(|x|, |y|) = 1

2(|x| + |y| + ||x| |y||).

Discriminative Feature Grouping In this section, we introduce the DFG method and its adaptive extension, the ADFG method. Moreover, we also discuss how to incorporate some additional information into our proposed methods.

DFG Method Problems deﬁned in Eqs. (1-4) impose a fusion-like regularizer for any two feature coefﬁcients βi and βj, where features with identical coefﬁcients are assumed to be from the same feature group. One major limitation of those existing feature grouping methods is that some similar but different feature groups are easy to be mis-fused. To address the issue, the DFG method is proposed with the objective function formulated as

min β L(β) + λ1 β 1 + λ2

max{|βi βj|, |βi βk|},

(5) where λ1 and λ2 are positive regularization parameters. We denote the third term in problem (5) as ΩGD(β). The ℓ1 regularizer (i.e., the second term in problem (5)) encourages feature sparsity and the pairwise ℓ regularizer in ΩGD( ) encourages |βi βj| and |βi βk| to be identical for any triple of feature indices (i, j, k). Note that max{|βi βj|, |βi βk|} can be reformulated as

max{|βi βj|, |βi βk|} = 1

2|βj βk| + |βi βj + βk

Then we can see two effects of ΩGD(β): (1) the ﬁrst term in the right-hand side of Eq. (6) is the fusion regularizer to enforce βj and βk to be grouped similar to the fused Lasso

family (regardless of βi), which reﬂects the grouping property; (2) the second term encourages βi to approach the average of βj and βk, making βi, βj and βk stay discriminative unless all the three coefﬁcients become identical, which is the discriminating effect. Therefore, the regularizer ΩGD(β) not only groups the features in a similar way to the fused Lasso but also discriminates the similar groups.

ADFG Method The ﬁrst and second terms at the right-hand side of Eq. (6) are denoted by ΩG( ) and ΩD( ) respectively. ΩD( ) encourages one feature coefﬁcient to approach the average of another two feature coefﬁcients, which seems too restrictive to model the relations between different feature groups. To capture more ﬂexible relations between groups, we propose an adaptive version of the DFG method, the ADFG method, with a new regularizer corresponding to ΩGD(β) deﬁned as

ΩAd GD(β) = ΩAd G (β) + ΩAd D (β), (7)

where the adaptive grouping regularizer ΩAd G ( ) and the adaptive discriminating one ΩAd D ( ) are deﬁned as

ΩAd G (β) = λ2

j<k wjk|βj βk|,

ΩAd D (β) = λ3

wijk|βi αijkβj (1 αijk)βk|,

where wij is a weight based on an initial estimator β, i.e., wij = | βi βj| γ, with γ as a positive constant, wijk = wij +wik, and αijk = wij wijk . To achieve good theoretic prop-

erty as we will see later, β is supposed to be the solution of some estimator with root-n consistency, e.g., the ordinary least square estimator which is adopted in our implementation. The larger γ, the more trust the initial estimator β gains. The ADFG method can be viewed as a generalization of the DFG method since when γ = 0 and λ2 = (d 2)λ3, the ADFG method reduces to the DFG method. Moreover, to make the ADFG method ﬂexible, we use different regularization parameters λ2 and λ3 to weight the grouping and discriminating parts separately. To keep accordance with ΩAd GD, we also adopt the adaptive version for the ℓ1 norm, which is denoted by ΩAd ℓ1 (β) = λ1 d i=1 wi|βi| with wi = | βi| γ, in Eq. (5) as in (Zou 2006). In order to better understand the regularizer ΩAd GD(β), we see that for the discriminating part ΩAd D ( ), if |βi αijkβj (1 αijk)βk| = 0, and βj and βk are close but still different, βi is different from βj and βk since αijk (0, 1). For the grouping part ΩAd G ( ), it is similar to the adaptive generalized fused lasso regularizer introduced in (Viallon et al. 2013).

Remark 1 ΩAd D ( ) can also be viewed as a regularizer to capture the feature relations by encouraging a linear relationship among any βi, βj, and βk based on the trust of an initial estimator. Recall that ΩAd G ( ) generates feature groups. Therefore, ΩAd GD( ) can leverage between feature grouping and maintaining the feature relations.

Remark 2 Figure 2 provides illustrations for different regularizers in a ball R(β) 10 for different methods, where R( ) is the corresponding regularizer. Since the regularizers in the nc FGS and nc TFGS are similar to that of the OSCAR, they are omitted. In Figures 2(e)-2(h), similar to (Bondell and Reich 2008), the optimal solutions are more likely to hit the sharp points, where sparse solutions appear at the black tiny-dashed circles, feature fusions occur at the blue dashed circles, and features keep discriminative at the red solid circles. We then observe that only the DFG and ADFG methods have both the grouping and discriminating effects.

(a) GFLasso

(e) GFLasso

Figure 2: Pictorial representations of the regularizers in the ball R(β) 10 with β = [β1, β2, β3]T R3: (a) GFLasso (λ1 = 1, λ2 = 0.4); (b) OSCAR (λ1 = 1, λ2 = 0.4); (c) DFG (λ1 = 1, λ2 = 0.4); (d) ADFG (λ1 = 1, λ2 = 0.1, λ3 = 0.2); (e)-(h) the corresponding projections onto the β1-β2 plane.

Remark 3 In addition, the solution paths for different regularizers in the orthogonal case can reveal the properties of the proposed regularizers from another perspective.

Incorporating Graph Information Similar to (Yang et al. 2012), some a priori information can be easily incorporated into the proposed DFG and ADFG methods. For example, when the feature correlations are encoded in a given graph, the ΩGD regularizer in DFG can be adapted to

ΩGD(β) = λ2

(i,j) E (i,k) E

max{|βi βj|, |βi βk|}, (8)

where a graph G = (V, E) encodes the correlations between pairs of features into the set of edges E. Similar formulations can be derived for ΩAd GD and are omitted here.

Optimization Procedure It is easy to show that the objective functions of both the DFG and ADFG methods are convex. We propose to solve the ADFG method using the ADMM, and the same optimization procedure is applicable to the DFG method since the DFG method is a special case of the ADFG method. Note that ΩAd ℓ1 (β), ΩAd G (β) and ΩAd D (β) can be reformulated as ΩAd ℓ1 (β) = T1β 1, ΩAd G (β) = T2β 1 and

ΩAd D (β) = T3β 1, where T1 Rd d, T2 R d(d 1)

2 d and T3 R d(d 1)(d 2)

2 d are sparse matrices. T1 is a diagonal matrix with the weights wi s along the diagonal. In T2, each row is a 1 d vector with only two non-zero entries wjk and wjk at the jth and kth positions respectively. In T3, each row is a 1 d vector with only three non-zero entries wijk, αijkwijk and (αijk 1)wijk at the ith, jth and kth positions respectively. Therefore, the storage and computation w.r.t. T1, T2 and T3 are very efﬁcient since they are sparse matrices. The objective function of the ADFG method can be reformulated as

min β 1 2 y Xβ 2 2 + λ1 T1β 1 + λ2 T2β 1 + λ3 T3β 1. (9)

Since the regularizers in problem (9) are functions of linear transformations of β, we introduce some new variables and reformulate problem (9) as

min β,p,q,r 1 2 y Xβ 2 2 + λ1 p 1 + λ2 q 1 + λ3 r 1

s.t. T1β p = 0, T2β q = 0, T3β r = 0.

The augmented Lagrangian is then deﬁned as

Lρ(β, p, q, r, μ, υ, ν) = 1

2 y Xβ 2 2 + λ1 p 1 + λ2 q 1

+ λ3 r 1 + μT (T1β p) + υT (T2β q) + νT (T3β r)

2 T1β p 2 2 + ρ

2 T2β q 2 2 + ρ

2 T3β r 2 2,

where μ, υ, ν are augmented Lagrangian multipliers. Then we can update all variables, including β, p, q, r, μ, υ, and ν, in one iteration as follows. Update β: In the (k + 1)-th iteration, βk+1 is computed by minimizing Lρ with other variables ﬁxed:

arg min β 1 2 y Xβ 2 2 + (T1μk + T T 2 υk + T T 3 νk)T β

2 T1β pk 2 2 + ρ

2 T2β qk 2 2 + ρ

2 T3β rk 2 2. (10)

Problem (10) is a quadratic problem and has a closed-form solution as βk+1 = F 1bk, where

F = XT X + ρ(I + T T 1 T1 + T T 2 T2 + T T 3 T3),

bk = XT y T1μk T T 2 υ T T 3 ν + ρT1pk + ρT T 2 qk + ρT T 3 rk.

Update p, q and r: pk+1 can be obtained by solving

arg min p ρ 2 p T1βk+1 1

ρμk 2 2 + λ1

which has a closed-form solution as pk+1 = Sλ1/ρ(T1βk+1 + 1

ρμk), where the soft-thresholding operator Sλ( ) is deﬁned as Sλ(x) = sign(x) max{|x| λ, 0}. Similarly, we have qk+1 = Sλ2/ρ(T2βk+1 + 1

ρυk) and rk+1 = Sλ3/ρ(T3βk+1 + 1

ρνk). Update μ, υ and ν: μ, υ and ν can be updated as μk+1 = μk +ρ(T1βk+1 pk+1), υk+1 = υk +ρ(T2βk+1 qk+1), and νk+1 = νk + ρ(T3βk+1 rk+1). By noting that F 1 can be pre-computed, the whole learning procedure can be implemented very efﬁciently.

Theoretical Analysis In this section, we study the asymptotic behavior of the proposed models as the number of samples n . Assume β is the true coefﬁcient vector. Let A = {i : β i = 0} (the true pattern of non-zero coefﬁcients) and d0 = |A|, B = {(i, j) : β i = 0 and β i = β j } (the true pattern of feature groups) and D = {(i, j, k) : β i = 0, β j = 0, β k = 0 and β i = β j , β i = β k, β j = β k} (the true pattern of different features). Let s0 be the number of distinct nonzero coefﬁcients in β . Deﬁne β BD = (β i1, , β is0 )T , which is composed of the s0 distinct non-zero values of β , and let βAd BD = (βAd i1 , , βAd is0 )T be the corresponding estimation. Let A1, , As0 be the sets of indices where in each set the corresponding coefﬁcients are equivalent. The learning model is a linear model, i.e., y = Xβ + ϵ where ϵ = (ϵ1, . . . , ϵn)T is the noise. Moreover, we make two assumptions that are commonly used in the spare learning literature (Zou 2006; Viallon et al. 2013): A.1 The noises ϵ1, . . . , ϵn are i.i.d random variables with mean 0 and variance σ2; A.2 1

n XT X C where C is positive deﬁnite. Let CA be the corresponding d0 d0 principal submatrix of C with the indices of rows and columns deﬁned in A. XBD is a matrix of size n s0 with the ith column deﬁned as x BDi =

j Ai xj. Then CBD is deﬁned as CBD = 1 n XT BDXBD. The regularization parameters are assumed to be functions of the sample size n and so they are denoted by λ(n) m (m = 1, 2, 3). For the asymptotic behavior of the DFG method, we have the following result.

Theorem 1 Let β be the estimator of DFG. If λ(n) m / n λ(0) m 0 (m = 1, 2), where λ(0) m is some non-negative constant, then under assumptions A.1 and A.2 we have n( β β ) d arg min u V(u),

where V(u) is deﬁned as

V(u) = u T Cu 2u T W + λ(0) 1

i=1 f(ui, β i )

+ λ(0) 2 2 (d 2)

j<k f(u jk, β jk) + λ(0) 2 2

f(u ijk, β ijk).

I( ) is the indicator function, f(x, y) = sign(y)x I(y = 0)+ |x|I(y = 0), u jk = uj uk, β jk = β j β k, u ijk = (2ui uj uk)/2, β ijk = (2β i β j β k)/2, and W is assumed to follow a normal distribution N(0, σ2C). Theorem 1 gives the root-n consistency of DFG. However, the following theorem implies that when λ(n) m = O( n) (m = 1, 2), the support of β , i.e. the non-zero elements in β , cannot be recovered by the DFG method with high probability.

Theorem 2 Let β be the estimator of DFG and An = {i : βi = 0}. If λ(n) m / n λ(0) m 0 (m = 1, 2), then under assumptions A.1 and A.2, we have

lim n sup P( An = A) c < 1,

where c is a constant depending on the true model.

For the ADFG method, we can prove that with appropriate choices for λ(n) m (m = 1, 2, 3), the estimation βAd obtained from the ADFG method enjoys nice asymptotic oracle properties, which are depicted in the following theorem, in contrast with the DFG method.

Theorem 3 Let βAd be the estimator of ADFG. Let AAd n , BAd n , and DAd n be the corresponding sets obtained from βAd. If λ(n) m / n 0 and λ(n) m n(γ 1)/2 (m = 1, 2, 3), then under assumptions A.1 and A.2 we have Consistency in feature selection and discriminative feature grouping: P( AAd n = A) 1, P( BAd n = B) 1 and P( DAd n = D) 1 as n .

Asymptotic normality: n( βAd BD β BD) d N(0, σ2C 1 BD).

Theorem 3 shows that the ADFG method has good property as stated in the asymptotic normality part, and the estimated sets AAd n , BAd n and DAd n can recover the corresponding true sets deﬁned in β with high probability approaching 1 when n goes to inﬁnity.

Experiments In this section, we conduct empirical evaluation for the proposed methods by comparing with the Lasso, GFLasso, OSCAR, and the non-convex extensions of OSCAR, i.e. the nc FGS and nc TFGS methods in problems (3) and (4).

Synthetic Data We study two synthetic datasets to compare the performance of different methods. The two datasets are generated according to a linear regression model y = Xβ +ϵ with the noises generated as ϵi N(0, σ2), where X Rn d, β Rd, and y Rn. In the ﬁrst dataset, n, d, and σ are set to be 100, 40 and 2 respectively. The ground truth for the coefﬁcients is β = (3, , 3 10(G1)

, 2.8, , 2.8 10(G2)

, 2, , 2 10(G3)

, 0, , 0 10

)T . In this

case, we can see that two feature groups G1 and G2 are similar, making them easy to be mis-identiﬁed compared with G3. Each data point corresponding to a row in X is generated from a normal distribution N(0, S) where the ith diagonal element sii of S is set to 1 for all i. The (i, j)th element of S, sij, is set to 0.9 if i = j and β i = β j , and otherwise sij = 0.25|β i β j |. The settings of the second dataset are almost identical to those of the ﬁrst dataset except that β = (2.8, .., 2.8 10(G1)

, 2.6, .., 2.6 10(G2)

, 2.4, .., 2.4 10(G3)

, 0, .., 0 10

where groups G1, G2, and G3 are easy to be mis-identiﬁed. We use the mean square error (MSE) to measure the performance of the estimation β with the MSE deﬁned as MSE = 1

n(β β )T XT X(β β ). To measure the accuracy of feature grouping and group discriminating, we introduce a metric S = K i=1 Si/K with K as the number of feature groups and Si deﬁned as

j =k,j,k Ii I(βi = βj) +

j =k,j Ii,k Ii I(βj = βk)

|Ii|(d 1) ,

where Ii (i = 1, . . . , K) denotes the set of indices of the ith feature group with non-zero coefﬁcients in the ground truth. The numerator in Si consists of two parts, where the ﬁrst and second terms represent the recovery of equal and unequal coefﬁcients for Ii separately. The denominator is the total number of possible combinations. Thus, S can measure the performance of feature grouping and discriminating, and a larger value for S indicates better performance. For each dataset, we generate n samples for training, as well as additional n samples for testing. Hyperparameters, including the regularization parameters in all the models, τ in nc TFGS, and γ in ADFG, are tuned using an independent validation set with n samples. We use a grid search method with the resolutions for the λi s (i = 1, 2, 3) in all methods as [10 4, 10 3, , 102] and those for γ as [0, 0.1, , 1]. Moreover, the resolution for τ in the nc TFGS method is [0.05, 0.1, , 5], which is in line with the setting of the original work (Yang et al. 2012).

(c) GFLasso

(e) nc TFGS

Figure 3: Feature coefﬁcients obtained on the ﬁrst synthetic data.

Figure 3 shows the feature coefﬁcients obtained by different methods on the ﬁrst dataset. We see that the nc FGS, nc TFGS, DFG, and ADFG methods achieve better parameter estimation than the Lasso and OSCAR methods. The nc TFGS method shows clear recovery of group G3 but it mis-combines G1 and G2 together as one group. In contrast, although the coefﬁcients in one group obtained from the DFG and ADFG methods are not exactly identical which also occurs in the OSCAR, GFLasso, and nc FGS methods, they are able to distinguish G1 and G2. Table 1 shows the average performance of different methods in terms of MSE and S over 10 simulations. In the ﬁrst dataset, due to the existent of a distant group G3 (from G1 and G2), the nc TFGS achieves the best performance in terms of S, while in the second problem where all the three groups are similar, the DFG and ADFG methods achieve a higher S. In both datasets, the ADFG has the best performance in terms of MSE.

Breast Cancer

We conduct experiments on the previously studied breast cancer data, which contains 8141 genes in 295 tumors (78 metastatic and 217 non-metastatic). The tasks here

Table 1: Average results over 10 repetitions in terms of mean and standard deviation on the synthetic datasets.

Dataset (1) (2) MSE S MSE S Lasso 1.929(0.591) - 1.853(0.831) - OSCAR 1.511(0.545) 0.766(0.015) 1.204(0.534) 0.753(0.021) GFLasso 0.477(0.281) 0.843(0.065) 0.462(0.284) 0.763(0.039) nc FGS 0.476(0.286) 0.842(0.063) 0.462(0.279) 0.768(0.031) nc TFGS 0.323(0.191) 0.857(0.064) 0.574(0.280) 0.759(0.110) DFG 0.399(0.194) 0.781(0.016) 0.267(0.191) 0.770(0.028) ADFG 0.289(0.152) 0.815(0.058) 0.216(0.149) 0.776(0.042)

Table 2: Results averaged over 10 repetitions for different methods on Breast Cancer dataset without a priori information.

Acc. (%) Sen. (%) Pec. (%) Lasso 73.5(5.5) 83.5(5.6) 63.0(8.1) OSCAR 76.2(1.9) 87.3(5.3) 64.4(6.6) GFLasso 76.8(2.0) 87.6(4.9) 65.7(5.1) nc FGS 77.4(1.4) 88.4(5.0) 66.0(5.8) nc TFGS 78.1(1.9) 87.3(5.3) 68.3(6.2) DFG 78.5(3.0) 87.3(5.4) 69.5(8.1) ADFG 81.1(3.8) 89.9(7.1) 71.9(8.5)

are binary classiﬁcation problems to distinguish between metastatic and non-metastatic tumors. We use the square loss for all methods. In this data, the group information are known a priori, and we have observed from Figure 1(b) that a large number of similar but different groups exist. In addition to the feature groups, some a priori information about the feature correlations between some pairs of features in terms of a graph is also known. In the following, we conduct two experiments. In the ﬁrst experiment, we do not utilize any prior information, while the second one compares the variants of OSCAR, nc FGS, nc TFGS, DFG and ADFG by incorporating the graph information. The measurements include accuracy (Acc.), sensitivity (Sen.) and speciﬁcity (Pec.) as used in (Yang et al. 2012).

Learning without A Priori Information Similar to (Jacob, Obozinski, and Vert 2009; Zhong and Kwok 2012), we select the 300 most correlated genes to the outputs as the feature representation, and alleviate the class imbalance problem by duplicating the positive samples twice. 50%, 30%, and 20% of data are randomly chosen for training, validation and testing, respectively. Table 2 shows the average results over 10 repetitions. In Table 2, the DFG and ADFG methods show very competitive performance compared with other methods in all the three metrics, and the ADFG method achieves the best performance.

Incorporating Graph Information We investigate the variants of the DFG and ADFG methods introduced previously by utilizing the available priori information on feature correlations in terms of a graph. The data preparation is identical to that in the previous experiment. The results are shown in Table 3. Similar to the experiment without a priori information, the DFG and ADFG methods perform better than the other methods, and the ADFG method enjoys the best performance. In addition, the performance of all the

Table 3: Results averaged over 10 repetitions for different methods on Breast Cancer dataset with the given graph information.

Acc. (%) Sen. (%) Pec. (%) OSCAR 78.7(4.1) 89.0(3.8) 66.5(6.9) GFLasso 79.1(4.3) 88.1(4.8) 69.4(5.0) nc FGS 80.6(4.2) 91.3(3.6) 68.0(6.0) nc TFGS 81.1(4.0) 90.2(4.1) 70.3(7.6) DFG 82.3(4.5) 91.5(6.8) 71.4(7.1) ADFG 82.4(4.0) 91.3(6.1) 71.8(6.4)

Table 4: Test accuracy (%) averaged over 10 repetitions for different methods on 20-Newsgroups dataset.

Class pairs (1) (2) (3) (4) (5) Lasso 75.1(2.9) 81.9(2.0) 74.8(0.9) 78.5(4.0) 81.1(1.6) OSCAR 75.5(1.8) 82.7(0.9) 73.8(1.7) 78.9(2.1) 83.7(1.9) GFLasso 76.2(1.5) 83.7(1.5) 74.5(1.9) 77.6(1.8) 83.7(2.0) nc FGS 75.3(1.6) 82.8(1.2) 72.7(1.5) 77.4(2.2) 82.9(1.3) nc TFGS 75.3(1.5) 82.8(1.2) 72.8(1.4) 77.7(2.1) 83.6(1.7) DFG 76.3(2.1) 85.0(2.0) 77.0(1.1) 79.2(3.3) 83.9(2.4) ADFG 76.4(2.1) 86.0(1.7) 77.1(2.4) 80.4(3.0) 83.8(2.5)

methods improves compared with that in the previous experiment, which implies that the prior information is helpful.

20-Newsgroups Following (Yang et al. 2012), we use the data from some pairs of classes in the 20-newsgroups dataset to form binary classiﬁcation problems. To make the tasks more challenging, we select 5 pairs of very similar classes: (1) baseball vs. hockey; (2) autos vs. motorcycles; (3) mac vs. ibm.pc; (4) christian vs. religion.misc; (5) guns vs. mideast. Therefore, in all the settings, the feature groups are more likely to be similar, posing challenge to identify them. Similar to (Yang et al. 2012), we ﬁrst use the ridge regression to select the 300 most important features and all the features are centered and scaled to unit variance. Then 20%, 40% and 40% of samples are randomly selected for training, validation, and testing, respectively. Table 4 reports the average classiﬁcation accuracy over 10 repetitions for all the methods. According to the results, the performance of the OSCAR method is better than that of the nc FGS and nc TFGS methods and the DFG and ADFG methods outperform other methods, which again veriﬁes the effectiveness of our methods.

Conclusion and Future Work In this paper, we proposed a novel regularizer together with its adaptive extension to achieve discriminative feature grouping. We developed an efﬁcient algorithm and discussed the asymptotic properties for the proposed models. In feature grouping, the assumption that the values of coefﬁcients in a feature group should be exactly identical seems a bit restricted. In our future work, we will relax this assumption and learn more ﬂexible feature groups.

Acknowledgment This work is supported by NSFC 61305071 and HKBU FRG2/13-14/039.

Bondell, H. D., and Reich, B. J. 2008. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics 64(1):115 123.

Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1):1 122.

Dettling, M., and B uhlmann, P. 2004. Finding predictive gene groups from microarray data. Journal of Multivariate Analysis 90(1):106 131.

Friedman, J.; Hastie, T.; H oﬂing, H.; and Tibshirani, R. 2007. Pathwise coordinate optimization. The Annals of Applied Statistics 1(2):302 332.

Jacob, L.; Obozinski, G.; and Vert, J.-P. 2009. Group lasso with overlap and graph lasso. In International Conference on Machine Learning.

Jang, W.; Lim, J.; Lazar, N. A.; Loh, J. M.; and Yu, D. 2013. Regression shrinkage and grouping of highly correlated predictors with HORSES. ar Xiv preprint ar Xiv:1302.0256.

Jenatton, R.; Mairal, J.; Bach, F. R.; and Obozinski, G. R. 2010. Proximal methods for sparse hierarchical dictionary learning. In International Conference on Machine Learning.

Kim, S., and Xing, E. P. 2009. Statistical estimation of correlated genome associations to a quantitative trait network. PLo S genetics 5(8):e1000587.

Kim, S., and Xing, E. P. 2010. Tree-guided group lasso for multi-task regression with structured sparsity. In International Conference on Machine Learning.

Shen, X., and Huang, H.-C. 2010. Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association 105(490).

Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; and Knight, K. 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(1):91 108.

Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 267 288.

Viallon, V.; Lambert-Lacroix, S.; H oﬂing, H.; and Picard, F. 2013. Adaptive generalized fused-lasso: Asymptotic properties and applications. Technical Report.

Yang, S.; Yuan, L.; Lai, Y.-C.; Shen, X.; Wonka, P.; and Ye, J. 2012. Feature grouping and selection over an undirected graph. In ACM SIGKDD Conference on Kownledge Discovery and Data Mining.

Yuan, M., and Lin, Y. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1):49 67.

Yuan, L.; Liu, J.; and Ye, J. 2011. Efﬁcient methods for overlapping group lasso. In Advances in Neural Information Processing Systems.

Zhao, P.; Rocha, G.; and Yu, B. 2009. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics 37(6A):3468 3497. Zhong, L. W., and Kwok, J. T. 2012. Efﬁcient sparse modeling with automatic feature grouping. In International Conference on Machine Learning. Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2):301 320. Zou, H. 2006. The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476):1418 1429.