# nonconvex_feature_learning_via_lpinf_operator__bbce30b1.pdf Non-Convex Feature Learning via ℓp, Operator Deguang Kong and Chris Ding Department of Computer Science & Engineering, University of Texas, Arlington, 500 UTA Blvd, Arlington, TX 76010 doogkong@gmail.com; chqding@uta.edu We present a feature selection method for solving sparse regularization problem, which has a composite regularization of ℓp norm and ℓ norm. We use proximal gradient method to solve this ℓp, operator problem, where a simple but efficient algorithm is designed to minimize a relatively simple objective function, which contains a vector of ℓ2 norm and ℓ norm. Proposed method brings some insight for solving sparsity-favoring norm, and extensive experiments are conducted to characterize the effect of varying p and to compare with other approaches on real world multi-class and multilabel datasets. Introduction Feature selection is a process of selecting a subset of relevant features from all the original features for robust classification, clustering and other learning tasks. Feature selection plays an important role in machine learning. A large number of feature selection approaches have been developed in literature. In general, they can be divided into two categories. (C1) Filter methods (Langley 1994), (C2) Wrapper methods(Kohavi and John 1997), etc. In recent years, sparsity regularization has been widely investigated and applied into multi-task learning and feature selection studies, where ℓ1, variable selection/projection have been proposed and well investigated in (Masaeli, Fung, and Dy 2010), (Turlach, Venablesy, and Wright 2005),(Tropp et al. 2006), (Quattoni, Collins, and Darrell 2008), (Schmidt et al. 2008) and (Liu, Palatucci, and Zhang 2009). One general conclusion (Masaeli, Fung, and Dy 2010) is that ℓ1, regularization usually performs significant better for classification tasks than both independent ℓ1 and independent ℓ2 regularizations. In other words, ℓ1, related optimization is a well behaved algorithm. We note regularization with non-convex ℓp penalty has gained increasing interest (e.g., smoothly clipped absolute deviation method (Fan and Li 2001), bridge regularization (Hastie, Tibshirani, and Friedman 2001)). As considering ℓp-norm with p smaller than 1, the penalty is not differentiable as soon as its argument vanishes. To deal with this issue, (Huang, Horowitz, and Ma 2008) considered a parameterized differentiable approximation of the ℓp penalty Copyright c 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. and use gradient descent to solve it; (Chartrand and Staneva 2008), (Kong and Ding 2013) use an iteratively re-weighted least square algorithm. Above observations motivate us to combine ℓ with ℓp penalty. In this paper, we consider more general ℓp, type operator where p < 1. This model is a generalization of the ℓ1, group lasso with enforced sparsity of the solution. The attractive property of ℓp, operator is that, at different p (although it is not convex at p < 1), it gives interesting property to approach the real number of non-zero features/variables, which is the desired goal in feature/variable selection tasks. The algorithms designed for ℓp penalty (e.g., (Chartrand and Staneva 2008), (Kong, Zhang, and Ding 2013)) cannot be simply modified to deal with ℓ problem because ℓ is discontinuous, which is harder to deal with than ℓ1norm which is continuous (although its derivative is noncontinuous). The challenge of our problem is to solve the projection for ℓp, operator. Another contribution of this paper is that, a very simple and efficient algorithm is derived to solve ℓp, operator with rigorous analysis. We give the structure of the optimal solution for ℓp, projection, to our knowledge, which has not been emphasized before. Our algorithm also has very clear differences with the other methods used for ℓ1, computation, like blockwise coordinate descent method in (Liu, Palatucci, and Zhang 2009), double coordinate descent method in (Zhang 2011) and the interiorpoint method in (Berwin A. Turlach 2005). We note (Vogt and Roth 2012) studied convex group lasso with p 1 for coupling multiple regression tasks. They studied loss function f(B) w.r.t coefficient B = (β1, , βJ) = PJ j=1 ||βj||p for J tasks. The p in their model is very different from our model. It also seems their paper is not on feature selection: they set the parameter k so that sparsity of B is a fixed value (this is flat sparsity, different from the structured sparsity) and give the prediction error. In our feature selection of Eq.(2), we set λ such that entire rows of W to be zero (structured sparsity). To summarize, the main contributions of our paper are in three-fold. (1) We propose to use ℓp, operator (i.e., a composite regularization of ℓp-norm and ℓ norm) for feature selection. (2) An efficient algorithm is presented to solve ℓp, proximal operator (i.e., a simple function containing a vector of ℓ2 norm and ℓ norm) with rigorous analysis. (3) The experiment results suggest that our algorithm is well- Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence 2 1 0 1 2 0 y=|x| y=|x|0.75 Figure 1: The behaviors of y = xp when p = {1, 0.75, 0.5, 0.2, 0}. behaved for smaller p values on real-world multi-class and multi-label classification tasks. Feature Selection via ℓp, operator Notations Let X ℜd n = (x1, x2, , xn) denote the matrix of input data of d-dimension over n samples, and Y ℜc n = (y1, y2, , yn) denote the matrix of output data for c outputs. In a general multi-class learning task, we use a linear model for the k-th class: yk = WT xk + ϵk, 1 k c, where W ℜd c is the regression coefficient for all the c classes, and ϵk is the Gaussian noise. Proposed feature selection using ℓp, In this paper, we consider the following multi-class sparse regression problem by using ℓp, operator penalization, min W ℜd c J(W) = f(W) + φλ(W), (1) where f(W) is a smooth/non-smooth convex loss function, e.g., least square loss, and φλ(W) = λ(||W||p, )p = λ j=1 ( max 1 k c|Wjk|)p, (2) is ℓp, penalty, and λ 0 is regularization parameters. Note ℓp, -norm of W is defined as, ||W||p, = d X j=1 ( max 1 k c|Wjk|)p 1 In (||W||p, )p of Eq.(2), p-th root is dropped and it is not a norm any more. The penalty of Eq.(2) is a more general case of ℓ1, penalty (Boyd and Vandenberghe 2004). When p = 1, it is ℓ1, -norm penalty. ℓp, operator is a convex relaxation of a pseudo-norm which counts the number of non-zero row in W. When p = 0, it counts the number of non-zero rows if we assume 00 = 0. Usually p is set to 0 p 1. Motivation of our method Let f(Wj) = max 1 k c|Wjk|. Thus the behaviors of P j f(Wj)p is determined by the behaviors of |x|p, where |x| = f(Wj) is a scalar. One can say the different behaviors of function xp when p is set to different values in Fig.(1). p 0, |x|p 1. Thus, when p is small, the behaviors of P j f(Wj)p P j δ(f(Wj)), where δ is a 0-1 function, δ(a) = 1 if a = 0, else δ(a) = 0. It is the desired goal for feature/variable selection because the number of non-zero rows can be more accurately estimated by P j f(Wj)p when p is small. Note W = (w1, w2, , wd)T , where wd ℜc 1 is regression coefficient for feature d across all classes. The mixed ℓp/ℓ penalty in Eq.(2) is used to set all the coefficient in each dimension (i.e., wd) to zeros or non-zero values, for variable selection purpose. Overview of computational algorithm Proximal gradient method (a.k.a FISTA method)(Nesterov 2004; Beck and Teboulle 2009b) is widely used to solve min W f(W)+φλ(W) problem. Here we adopt this method to solve Eq.(1) due to its fast convergence. The key idea of proximal gradient method is to solve the following objective at each iteration t, Wt+1 = arg min U f(Wt) + f(Wt)T (U Wt) 2 U Wt 2 F + φλ(U) (3) = arg min U 1 2 U A 2 F + φρ(U), (4) where A = Wt 1 Lt f(Wt), and ρ = λ Lt , Lt is a parameter chosen at each iteration using some search strategy. Thus the problem is transformed to minimization problem of Eq.(3). Their major computation efforts in each iteration is to compute the gradient of f(Wt), which costs O(dn K) for a generic dense matrix W, where d is feature dimension, n is number of data points, K is number of classes. With appropriate choice of step size Lt, the proximal gradient method (Beck and Teboulle 2009a) will achieve ϵ-optimal solution in O( 1 ϵ) iterations. Then the key step is to solve the associated proximal operator: min U 1 2 U A 2 F + φρ(U). (5) One main contribution of our paper is for the computation of proximal operator of Eq.(5). Main contribution: An effective algorithm for associated proximal operator computation In this section, we present an efficient algorithm to solve the proximal operator in Eq.(5), Lots of previous works (Yuan, Liu, and Ye 2011; Beck and Teboulle 2009b) have shown the efficient computation of proximal operator is the key to many sparse learning problems. First, we note 1 2||W A||2 F + ρ||W||p, 2||wi ai||2 + ρ(||wi|| )p , (6) where wi, ai is the i-th row of matrix W, A; ||wi|| = max j |Wij| is infinity norm for vector wi. The problem of Eq.(5) is therefore decomposed into d independent sub-problems, and each one optimizing wi. Each sub-problem has the following general formulation, u = arg min u J1(u) = arg min u 2 ||u a||2 + ρ(||u|| )p = arg min u 2 ||u a||2 + ρ max 1 i d |ui|p , (7) where u = [u1, u2, ..., ud] is a row vector to be optimized. We call this optimization problem as ℓp, proximal projection problem. Fortunately, we have very efficient algorithm to solve Eq.(7). Simplify the problem First, we present Lemmas 1 and 2 to simplify this optimization problem. Lemma 1. (A) The optimal solution u for Eq.(7) satisfies sign(u i ) = sign(ai). (B) Let v be the optimal solution for objective function J2, J2 = min vi 0 1 2||v b||2 + ρ( max 1 i d vi)p where bi = |ai|. The optimal solution u in Eq.(7) is given by u i = sign(ai)v i . Proof (A) We prove it by contradiction. Suppose in the optimal solution u , there exists i0 where sign(u i0) = sign(ai0), i.e., sign(u i0) = sign(ai0). Then we can construct another solution u , where u i = u i if i = i0 u i if i = i0 . Note that max 1 i d |u i | = max 1 i d |u i |, we have J1(u ) J1(u ) = 1 i (u i ai)2 1 i (u i ai)2 = 2u i0ai0 0. Thus, J1(u ) J1(u ), which contradicts the assumption that u is the optimal solution. (B) For the optimal solution u , we have i (u i ai)2 + ρ( max 1 i d |u i |) p i (sign(u i )|u i | sign(ai)|ai|)2 +ρ( max 1 i d |u i |)p (8) i (|u i | |ai|)2 + ρ( max 1 i d |u i |)p. (9) The deduction from Eq.(8) to Eq.(9) holds because sign(u i ) = sign(ai), which is proved in Lemma 1(A). Let bi = |ai|, vi = |u i |, Eq.(9) recovers the objective function in J2. . Remark From Eq.(7) to objective function of J2, the absolute value operation |.| is removed. Clearly, to optimize J2 is much easier than to optimize Eq.(7). Lemma 2. If elements of b is sorted in descending order, i.e., b1 b2 ... bd, the optimal solution v for J2 must satisfy v 1 v 2 ... v d. It is straightforward to verify Lemma 2. For the convenience of computing max(vi), first we sort b in descending order. Next we show how to solve J2. Optimal solution at p = 0 Note when p = 0, for objective function J2, the minimal function value is given by min (ρ, 1 2||b||2), where optimal v = b or 0. 1 Next we discuss about the optimal solution when 0 < p 1. The structure of optimal solution when 0 < p 1 The key observation is that the optimal solution for objective function of J2, there may be more than one element achieving the maximum value at the same time even if the elements of b are distinct. For example, if we set b = (5, 4, 3, 2, 1), ρ = 1.5, p = 1, the optimal solution for v = (3.75, 3.75, 3, 2, 1) instead of v = (3.5, 4, 3, 2, 1). Thus, we introduce the maximum set S defined as, S = {i|v i = vm}, where vm is the maximum value in the optimal solution v because more than one element may reach to the same maximum value. The next problem is to decide the set S. It is natural to split the elements of v into two parts based on whether they fall into set S or not. To determine the values of optimal solution v after separation, we have Theorem 3 to outline the structure of an optimal solution. To determine the elements falling into set S, we have Lemma 4 to characterize the property of set S and give the cardinality of set S. Theorem 3. Assume maximum set S is known, the optimal solution v i for objective function J2 is v i = bi if i / S vm if i S where vm is a constant and can be obtained through Newton s method. Proof According to the definition of set S, object function J2 can be written as J3, i S (vm bi)2 + 1 i/ S (vi bi)2 + ρ(vm)p (10) Clearly, the optimal solution v i can be split into two parts based on i S or i / S, and these two parts are independent in the optimal function J3. First, consider v i (i / S). The solution for minimization of 1 i/ S (vi bi)2 of Eq.(10) is given by v i = bi. Next, consider v i (i S). Note arg min vm (J3) = arg min vm 1 2 i S (vm bi)2 + ρ(vm)p, 1To our knowledge, there is no well-defined value for 00. Here we assume 00 = 0. thus J 3(vm) = X i S (vm bi) + ρp(vm)p 1, J 3 (vm) = |S| + ρp(p 1)(vm)p 2. Note at p = 1, vm has closed form solution, and is given by When 0 < p < 1, using standard Newton s method, we can iteratively compute vm using vm t+1 = vm t J 3(vm t ) J 3 (vm t ). Starting from an initial guess for vm 0 , e.g., in our experiment, we obtain the optimal vm after 20 iterations to machine precision. Lemma 4. The optimal solution v of Eq.(10) is character- ized by { vm > bi if i / S vm bi if i S . Proof. We prove it by contradiction. (A) We first prove the case i / S. Suppose for those elements of v i , vm bi(i / S). From Theorem 3, we have v i = bi(i / S), and therefore, vm bi = v i (i / S). This contradicts the definition of the set S where vm = max 1 i d v i . Thus the assumption vm bi(i / S) does not hold. (B) Now we prove the case i S. Suppose for those elements of v i , vm > bi (i S). It is easy to see there exists ε = vm max i S {bi} > 0. Thus, for i S, vm bi ε = (vm bi) (vm max i S {bi)) = (max i S {bi) bi) 0. Then (2vm 2bi ε) = (vm bi) + (vm bi ε) > 0. We can construct another solution v , such that { v i = v i if i / S v i = vm ε if i S , and we have J2(v ) J2(v ) = [1 i S (vm ε bi)2 i S (vm bi)2] + ρ[( max 1 i d v i )p ( max 1 i d v i )p] i S ε(2vm 2bi ε) + ρ[(vm ε)p (vm)p)] The second term in above Equation is less than 0 because vm 0, vm ε 0, and function f(x) = xp is monotonically increasing at x > 0 (see Fig.1). Thus J2(v ) J2(v ) < 0. This implies v is the optimal solution, which contradicts that v is the optimal solution. Thus the assumption does not hold, vm bi (i S). Algorithm 1 Proximal operator solution for Eq.(7) when 0 < p 1 Input: v, ρ Output: v 1: b sort(b) to ensure b1 b2 ... bd, record the mapping order before and after sorting 2: k d, let S = {1, 2, k}, solve for vm according to Theorem 1. 3: while vm > bk and k > 1 do 4: k (k 1) 5: let S = {1, 2, , k}, solve for vm according to Theorem 1. 6: end while 7: v b, v i vm(1 i k) 8: map v back into v according to mapping order Complete algorithm Above we have discussed the structure of optimal solution when 0 < p 1. Next we complete the whole algorithm and give the optimal solution v for Eq.(7). Based on Lemma 4, we can use a linear search algorithm to determine the maximum set S by making comparisons in the boundary conditions (e.g., vm > bi). The time cost for linear search algorithm is O(d), which is proportional to the dimension of b. In summary, we present the detailed algorithm in Algorithm 1. Actually, in order to further improve the efficiency, we can use a binary search to determine set S with time cost O(logd). Since we can first arrange b in descending order, i.e, b1 b2 ... bd. Another interesting observation is that vm b1 according to Lemma 4. These properties help to obtain a better understanding of the structure of optimal solution. Experiments We perform extensive experiment on both single-label multi-class and multi-label datasets to validate the effectiveness of proposed ℓp, algorithm. For multi-class datasets, we use four widely used biology datasets: ALLAML1, GLIOMAML2, LUNGML3 and CARML4, which all have high-dimensional features (more than 3000) and very few samples (less than 250). We use three widely used multilabel datasets, Barcelona5, MSRCv26, TRECVID20057. We extract 384-dimensional color moment feature on datasets Barcelona and MSRCv2, and 512-dimensional GIST (Oliva and Torralba 2001) features on TRECVID (e.g., (Chang et al. 2014)). A summary of both multi-class and multi-label datasets is shown in Table.1. 1http://www.sciencemag.org/content/277/5324/393.full 2http://cancerres.aacrjournals.org/content/63/7/1602.long 3http://www.ncbi.nlm.nih.gov/pubmed/11707567 4http://www.ncbi.nlm.nih.gov/pubmed/11606367 5http://mlg.ucd.ie/content/view/61 6http://research.microsoft.com/en-us/projects/ objectclassrecognition/ 7http://www-nlpir.nist.gov/projects/tv2005/ 0 20 40 60 80 100 86 Selected Features p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R (a) ALLAML-k NN 0 20 40 60 80 100 50 Selected Features p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R (b) GLIOMAML-k NN 0 20 40 60 80 100 55 Selected Features p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R (c) LUNGML-k NN 0 20 40 60 80 100 88 Selected Features p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R (d) ALLAML-SVM 0 20 40 60 80 100 45 Selected Features p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R (e) GLIOMAML-SVM 0 20 40 60 80 100 70 Selected Features p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R (f) LUNGML-SVM Figure 2: Classification accuracy using selected features on k NN and SVM classifier on three datasets. p = 0.25, p = 0.5, p = 1 are our methods at different p. Four other methods: ℓ2,1 feature selection of Eq.(12), F-statistic, relief F and m Rm R. 0 100 200 300 400 0.68 Selected Features Macro F1 Score p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R (a) Barcelona 0 100 200 300 400 0.48 Selected Features Macro F1 Score p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R 0 100 200 300 400 500 0.3 Selected Features Macro F1 Score p=0.25 p=0.5 p=1 L1,2 Fstatistic Relief F m Rm R Figure 3: Multi-label feature selection results using SVM classifier on three datasets. p = 0.25, p = 0.5, p = 1 are our methods at different p. Four other methods: ℓ2,1 feature selection of Eq.(12), F-statistic, relief F and m Rm R. Numerical experiments with two initializations It is worth noting that for p < 1, it is not a convex optimization problem to solve Eq.(2). We are interested to see the influences of different initializations. We used two initializations: (a) initialize W using ridge regression, i.e., replace the ℓp, in Eq.(2) with simple Frobenius norm, which gives closed form solution; (b) experiment with another scheme, i.e., start with p = 1 global solution, then use this solution as initialization for p = 0.75, 0.5, and so on. The results shown for p < 1 are the best of these two initializations. The results obtained from ℓp, model can be used for feature selection. We adjust the parameter λ in Eq.(1), and select the non-zeros rows in the results of W as the selected features. After selecting top r < d features, we obtain the new data X ℜr n, which is composed of only top r fea- tures across all data samples. We use standard least square loss for the error function f(W) of Eq.(1). The same least square function is also used for multi-label experiments. More formally, we define the residue R = ||Y BT X||F , where Y ℜk n is class labels, and B is obtained by minimizing: min B ||Y BT X||2 F on the selected features X. The smaller the residue R, the better the selected features obtained from X. we show the residues R, obtained using top 20 features with different initializations, i.e., (a) ridge regression; (b) p = 1 result on 4 datasets. We make several important observations from the results shown in Table 1. (1) Generally, smaller p values give much smaller residues, which indicates better feature selection results; but it is not always consistent. We believe that if we could find the true global solution at p < 1, the residues will Table 1: Multi-class dataset and multi-label dataset descriptions. Single-label multi-class datasets Multi-label multi-class datasets dataset #data #dimension # class dataset #data #dimension # class ALLAML 72 7129 2 Trevid 3721 512 39 GLIOMAML 50 4434 4 MSRC 591 384 23 LUNGML 203 3312 5 Barcelona 139 384 4 CARML 174 9182 11 Table 2: Residue R on selected top 20 features on 4 datasets Dataset initialization p = 0.1 p = 0.25 p = 0.5 p = 0.75 p = 1 ℓ2,1 ALLAML p = 1 6.4217 6.3647 6.4314 6.3892 6.4487 6.4257 ridge regression 6.4141 6.3841 6.4054 6.3672 GLIOMAML p = 1 4.5208 4.7877 4.7935 4.7832 4.9421 4.8324 ridge regression 4.5145 4.7182 4.7877 4.7968 LUNGML p = 1 7.6889 7.5701 7.7916 8.3251 8.3572 8.3612 ridge regression 7.6762 7.5534 7.8216 8.2851 CARML p = 1 8.3947 8.3837 8.3687 8.8135 8.9578 8.9321 ridge regression 8.3747 8.3921 8.3723 8.7335 be more consistent. However, because at p < 1, the problem is non-convex, we cannot guarantee the global minima, and the solution is not unique and depends on initialization. (2) Ridge regression initialization gives slightly better results as compared to p = 1 initialization. (3) At p < 1, the selected features are slightly different (usually several different features depending on how many features to select) for different initialization because we cannot get the global optimization now. (4) We also compare the proposed ℓp, (p = 1) against ℓ2,1 (Liu, Ji, and Ye 2009) feature selection method. To be exact, we follow (Liu, Ji, and Ye 2009), (Kong, Ding, and Huang 2011), use ℓ2,1 regularization term for feature selection, i.e., min W ℜd c J(W) = f(W) + λ||W||2,1, (12) where f(W) is a smooth/non-smooth convex loss function,e.g., least square loss, and ||W||2,1 = d P q Pc k=1 W2 jk, and λ 0 is regularization param- eters. Residues obtained from ℓp, feature selection results are generally better than those from ℓ2,1 results of Eq.(12). Application for multi-class feature selection tasks To further validate our feature selection results, we use another two widely used classifiers: SVM, k NN to compute the classification accuracy after using ℓp, feature selection methods. We did 5-fold cross-validation on both classifiers on 3 datasets: ALLAML, GLIOMAML and LUNGML. We compare the proposed feature selection method ℓp, at differen p against several popularly used feature selection methods, such as F-statistic (Liu and Motoda 1998), relief F (Kononenko 1994), (Kong et al. 2012) and m Rm R (Ding and Peng 2003). Fig. 2 shows the classification accuracy comparisons of different feature selection methods on three data sets. In our methods, p is set to p = 0.25, p = 0.5, p = 1. The shown results are the average accuracy over 5 runs. From the results shown in Fig.2, we observe that, (1) ℓp, feature selection produces better results as compared to F-statistics, Relief F and m Rm R in terms of classification accuracy; (2) the classification accuracy is slightly better when p is small (say p = 0.25, 0.5) on both k NN and SVM classifiers; (3) the classification accuracy obtained from ℓp, feature selection at smaller p values are also generally better than ℓ2,1 method, which again verifies the effectiveness of our method. Application for multi-label feature selection tasks ℓp, feature selection is naturally extended for multilabel feature selection tasks. In multi-label classification problems, a data point can be attributed to multiple classes simultaneously. For the other multi-label feature selection algorithms, we extend the general F-statistic (Liu and Motoda 1998), relief F (Kononenko 1994) and m Rm R (Ding and Peng 2003) for multi-label classification using binary relevance (Tsoumakas, Katakis, and Vlahavas 2010). We adopt the macro-F1 score defined in (Yang 1999) to evaluate the multi-label classification performance. The higher the macro-F1 score, the better classification accuracy. We use standard SVM classifier (linear kernel, C = 1) to validate the feature selection results. We report the macro F1 score by using 5 round 5-fold cross validation in Fig.3. Fig.3 indicate ℓp, method performs much better than the other three feature selection methods (e.g., Relief F, m Rm R, etc). Moreover, when p is small, the feature selection results are generally better than p = 1. In this paper, we propose to use ℓp, operator for feature selection. An efficient algorithm is presented to solve ℓp, regularization problem with rigorous analysis. Extensive experiments on multi-class and multi-label datasets indicate the well behaviors of our algorithm at smaller p values. Acknowledgement. This research is partially supported by NSF-CCF-0917274 and NSF-DMS-0915228 grants. Beck, A., and Teboulle, M. 2009a. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences 2(1):183 202. Beck, A., and Teboulle, M. 2009b. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1):183 202. Berwin A. Turlach, William N. Venables, S. J. W. 2005. Simultaneous variable selection. In Technometrics, 349 363. Boyd, S., and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press. Chang, X.; Shen, H.; Wang, S.; Liu, J.; and Li, X. 2014. Semi-supervised feature analysis for multimedia annotation by mining label correlation. In Advances in Knowledge Discovery and Data Mining, 74 85. Chartrand, R., and Staneva, V. 2008. Restricted isometry properties and nonconvex compressive sensing. Inverse Problems 24(035020):1 14. Ding, C., and Peng, H. 2003. Minimum redundancy feature selection from microarray gene expression data. In J Bioinform Comput Biol, 523 529. Fan, J., and Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. 96(456):1348 1359. Hastie, T.; Tibshirani, R.; and Friedman, J. H. 2001. Elements of Statistical Learning. Springer. Huang, J.; Horowitz, J.; and Ma, S. 2008. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Annals of Statistics 36(2):587 613. Kohavi, R., and John, G. H. 1997. Wrappers for feature subset selection. ARTIFICIAL INTELLIGENCE 97(1):273 324. Kong, D., and Ding, C. H. Q. 2013. Efficient algorithms for selecting features with arbitrary group constraints via group lasso. In ICDM, 379 388. Kong, D.; Ding, C. H. Q.; Huang, H.; and Zhao, H. 2012. Multi-label relieff and f-statistic feature selections for image annotation. In CVPR, 2352 2359. Kong, D.; Ding, C. H. Q.; and Huang, H. 2011. Robust nonnegative matrix factorization using l21-norm. In CIKM, 673 682. Kong, D.; Zhang, M.; and Ding, C. H. Q. 2013. Minimal shrinkage for noisy data recovery using schatten-p norm objective. In ECML/PKDD (2), 177 193. Kononenko, I. 1994. Estimating attributes: Analysis and extensions of relief. 171 182. Springer Verlag. Langley, P. 1994. Selection of relevant features in machine learning. In In Proceedings of the AAAI Fall symposium on relevance, 140 144. AAAI Press. Liu, H., and Motoda, H. 1998. Feature Selection for Knowledge Discovery and Data Mining. Springer. Liu, J.; Ji, S.; and Ye, J. 2009. Multi-task feature learning via efficient l2, 1-norm minimization. In UAI, 339 348. Liu, H.; Palatucci, M.; and Zhang, J. 2009. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In ICML. Masaeli, M.; Fung, G.; and Dy, J. G. 2010. From transformation-based dimensionality reduction to feature selection. In ICML. Nesterov, Y. 2004. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers. Oliva, A., and Torralba, A. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42(3):145 175. Quattoni, A.; Collins, M.; and Darrell, T. 2008. Transfer learning for image classification with sparse prototype representations. In CVPR. Schmidt, M.; Murphy, K.; Fung, G.; and Rosales, R. 2008. Structure learning in random fields for heart motion abnormality detection. In In CVPR. Tropp, J. A.; Gilbert, A. C.; Martin; Strauss, J.; Tropp, J. A.; Gilbert, A. C.; and Strauss, M. J. 2006. Algorithms for simultaneous sparse approximation. part ii: Convex relaxation. Signal Processing 589 602. Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2010. Mining multi-label data. Data Mining and Knowledge Discovery Handbook. Turlach, B. A.; Venablesy, W. N.; and Wright, S. J. 2005. Simultaneous variable selection. In Technometrics, 349 363. Vogt, J. E., and Roth, V. 2012. A complete analysis of the l1,p group-lasso. Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1:67 88. Yuan, L.; Liu, J.; and Ye, J. 2011. Efficient methods for overlapping group lasso. In NIPS, 352 360. Zhang, Y. 2011. A probabilistic framework for learning task relationships in multi-task learning. In Ph.D Thesis, The Hong Kong University of Science and Technology.