# embedded_unsupervised_feature_selection__6ed906ce.pdf

Embedded Unsupervised Feature Selection

Suhang Wang, Jiliang Tang, Huan Liu School of Computing, Informatics, and Decision Systems Engineering Arizona State University, USA {suhang.wang, jiliang.tang, huan.liu}@asu.edu

Sparse learning has been proven to be a powerful technique in supervised feature selection, which allows to embed feature selection into the classiﬁcation (or regression) problem. In recent years, increasing attention has been on applying spare learning in unsupervised feature selection. Due to the lack of label information, the vast majority of these algorithms usually generate cluster labels via clustering algorithms and then formulate unsupervised feature selection as sparse learning based supervised feature selection with these generated cluster labels. In this paper, we propose a novel unsupervised feature selection algorithm EUFS, which directly embeds feature selection into a clustering algorithm via sparse learning without the transformation. The Alternating Direction Method of Multipliers is used to address the optimization problem of EUFS. Experimental results on various benchmark datasets demonstrate the effectiveness of the proposed framework EUFS.

Introduction In many real-world applications such as data mining and machine learning, one is often faced with high-dimensional data (Jain and Zongker 1997; Guyon and Elisseeff 2003). Data with high dimensionality not only signiﬁcantly increases the time and memory requirements of the algorithms, but also degenerates many algorithms performance due to the curse of dimensionality and the existence of irrelevant, redundant and noisy dimensions(Liu and Motoda 2007). Feature selection, which reduces the dimensionality by selecting a subset of most relevant features, has been proven to be an effective and efﬁcient way to handle highdimensional data (John et al. 1994; Liu and Motoda 2007). In terms of the label availability, feature selection methods can be broadly classiﬁed into supervised methods and unsupervised methods. The availability of the class label allows supervised feature selection algorithms (Duda et al. 2001; Nie et al. 2008; Zhao et al. 2010; Tang et al. 2014) to effectively select discriminative features to distinguish samples from different classes. Sparse learning has been proven to be a powerful technique in supervised feature selection (Nie et al. 2010; Gu and Han 2011; Tang and Liu 2012a),

Copyright c 2015, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

which enables feature selection to be embedded in the classiﬁcation (or regression) problem. As most data is unlabeled and it is very expensive to label the data, unsupervised feature selection attracts more and more attentions in recent years (Wolf and Shashua 2005; He et al. 2005; Boutsidis et al. 2009; Yang et al. 2011; Qian and Zhai 2013; Alelyani et al. 2013). Without label information to deﬁne feature relevance, a number of alternative criteria have been proposed for unsupervised feature selection. One commonly used criterion is to select features that can preserve the data similarity or manifold structure constructed from the whole feature space (He et al. 2005; Zhao and Liu 2007). In recent years, applying sparse learning in unsupervised feature selection has attracted increasing attention. These methods usually generate cluster labels via clustering algorithms and then transform unsupervised feature selection into sparse learning based supervised feature selection with these generated cluster labels such as Multi-cluster feature selection (MCFS) (Cai et al. 2010), Nonnegative Discriminative Feature Selection (NDFS) (Li et al. 2012), and Robust Unsupervised Feature Selection (RUFS) (Qian and Zhai 2013). In this paper, we propose a novel unsupervised feature selection algorithm, i.e., Embedded Unsupervised Feature Selection (EUFS). Unlike existing unsupervised feature selection methods such as MCFS, NDFS or RUFS, which transform unsupervised feature selection into sparse learning based supervised feature selection with cluster labels generated by clustering algorithms, we directly embed feature selection into a clustering algorithm via sparse learning without the transformation (see Figure 1). This work theoretically extends the current state-of-the-art unsupervised feature selection, algorithmically expands the capability of unsupervised feature selection, and empirically demonstrates the efﬁcacy of the new algorithm. The major contributions of this paper are summarized next.

Providing a way to directly embed unsupervised feature selection algorithm into a clustering algorithm via sparse learning instead of transforming it into sparse learning based supervised feature selection with cluster labels;

Proposing an embedded feature selection framework EUFS, which selects features in unsupervised scenarios with sparse learning; and

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

(a) Existing sparse learning based unsupervised feature selection method

(b) Embedded unsupervised feature selection

Figure 1: Differences between the existing sparse learning based unsupervesd feature selection methods and the proposed embedded unsupervised feature selection

Conducting experiments on various datasets to demonstrate the effectiveness of the proposed framework EUFS.

The rest of this paper is organized as follows. In Section 2, we give details about the embedded unsupervised feature selection framework EUFS. In Section 3, we introduce a method to solve the optimization problem of the proposed framework. In Section 4, we show empirical evaluation with discussion. In section 5, we present the conclusion with future work.

Embedded Unsupervised Feature Selection Throughout this paper, matrices are written as boldface capital letters and vectors are denoted as boldface lowercase letters. For an arbitrary matrix M Rm n, Mij denotes the (i, j)-th entry of M while mi and mj mean the i-th row and j-th column of M respectively. ||M||F is the Frobenius norm of M and Tr(M) is the trace of M if M is square. A, B equals Tr(AT B), which is the standard inner product between two matrices. I is the identity matrix and 1 is a vector whose elements are all 1. The l2,1-norm is deﬁned as

||M||2,1 = Pm i=1 ||mi|| = Pm i=1 q Pn j=1 M2 ij).

Let X RN d be the data matrix with each row xi R1 d being a data instance. We use F = {f1, . . . , fd} to denote the d features and f1, . . . , fd are the corresponding feature vectors. Assume that each feature has been normalized, i.e., ||fj||2 = 1 for j = 1, . . . , d. Suppose that we want to cluster X into k clusters (C1, C2, . . . , Ck) under the matrix factorization framework as:

min U,V ||X UVT ||2 F

s.t. U {0, 1}N k, UT 1 = 1 (1)

where U RN k is the cluster indicator and V Rd k is the latent feature matrix. The problem in Eq.(1) is difﬁcult to solve due to the constraint on U. Following the common relaxation for label indicator matrix (Von Luxburg 2007;

Tang and Liu 2012b), the constraint on U is relaxed to orthogonality, i.e., UT U = I, U 0. After the relaxation, Eq.(1) can be rewritten as:

min U,V ||X UVT ||2 F

s.t. UT U = I, U 0 (2)

Another signiﬁcance of the orthogonality constraint on U is to allow us to perform feature selection via V, which can be stated by the follow theorem: Theorem 1. Let X = [f1, f2, . . . , fd], and ||fi|| = 1 for i = 1, . . . , d. We use UVT to reconstruct X, i.e., ˆX = UVT . If U is orthogonal, then we can perform feature selection via V. Proof. Since ˆX = UVT , we have ˆfi = Uv T i . Then

||ˆfi||2 = ||Uv T i ||2 = vi UT Uvi 1/2 = ||vi||2 (3)

Consider the case that ||vi||2 is close to 0, which indicates that the reconstructed feature representation ||ˆfi||2 is close to 0. ||fi|| = 1 means fi is not well reconstructed via ˆfi, which suggests that this corresponding feature could be not representative and we should exclude such features to have a better reconstruction. One way to do this is to add a selection matrix diag(p) to X and V as,

||Xdiag(p) U(diag(p)V)T ||2 F (4)

where p = {0, 1}d with pi = 1 if the i-th feature is selected and otherwise pi = 0, which completes the proof.

With Theorem 1, if we want to select m features for the clustering algorithm in Eq.(2), we can rewrite it as:

min U,V ||Xdiag(p) U(diag(p)V)T ||2 F

s.t. UT U = I, U 0

p {0, 1}d, p T 1 = m

The constraint on p makes Eq.(5) mixed integer programming (Boyd and Vandenberghe 2004), which is difﬁcult to solve. We relax the problem in the following way. First, the following theorem suggests that we can ignore the selection matrix on X as min U,V ||X U(diag(p)V)T ||2 F

s.t. UT U = I, U 0

p {0, 1}d, p T 1 = m

Theorem 2. The optimization problems in Eq.(5) and Eq.(6) are equivalent.

Proof. One way to prove Theorem 2 is to show that the objective functions in Eq.(5) and Eq.(6) are equivalent. For Eq.(5), we have

||Xdiag(p) U(diag(p)V)T ||2 F

i=1 ||pifi pi Uv T i ||2 F

i:pi=1 ||fi Uv T i ||2 F

And for Eq.(6), we have

||X U(diag(p)V)T ||2 F

i=1 ||fi pi Uv T i ||2 F

i:pi=1 ||fi Uv T i ||2 F + (N m)

which complete the proof.

We observe that diag(p) and V is as the form of diag(p)V in Eq.(6). Since p is a binary vector and N m rows of the diag(p) are all zeros, diag(p)V is a matrix where elements of many rows are all zeros. This motivates us to absorb the diag(p) into V, i.e., V = diag(p)V, and add l2,1 norm on V to achieve feature selection as

arg min U,V ||X UVT ||2 F + α||V||2,1

s.t. UT U = I, U 0 (9)

Since we forces some rows of V close to 0, U and V may poorly reconstruct some data instances. Reconstructing errors from these instances may easily dominate the objective function because of the squared errors. To make the model robust to these instances, we adopt robust analysis, i.e., replace the loss function by l2,1-norm, as follows

arg min U,V ||X UVT ||2,1 + α||V||2,1

s.t. UT U = I, U 0 (10)

To take advantage of information from attribute-value part, i.e, X, similar data instances should have similar labels, according to the spectral analysis (Von Luxburg 2007), we further add the following term to force similar instances with similar labels as:

min Tr(UT LU) (11)

where L = D S is the Laplacian matrix and D is a diagonal matrix with its elements deﬁned as Dii = Pn j=1 Sij. S RN N denotes the similarity matrix based on X, which is obtained through RBF kernel as

Sij = e ||xi xj ||2

Putting Eq.(10) and Eq.(11) together, the proposed framework EUFS is to solve the following optimization problem:

arg min U,V ||X UVT ||2,1 + α||V||2,1 + βTr(UT LU)

s.t. UT U = I, U 0 (13)

Optimization Algorithm The objective function in Eq.(13) is not convex in both U and V but is convex if we update the two variables alternatively. Following (Huang et al. 2014), we use Alternating Direction Method of Multiplier (ADMM) (Boyd et al. 2011)

to optimize the objective function. By introducing two auxiliary variables E = X UVT and Z = U, we can convert Eq.(13) into the following equivalent problem,

arg min U,V,E,Z ||E||2,1 + α||V||2,1 + βTr(ZT LU)

s.t. E = X UVT , Z = U, UT U = I, Z 0 (14)

which can be solved by the following ADMM problem

min U,V,E,Z,Y1,Y2,µ||E||2,1 + α||V||2,1 + βTr(ZT LU)

+ Y1, Z U + D Y2, X UVT E E

2 (||Z U||2 F + ||X UVT E||2 F )

s.t. UT U = I, Z 0 (15)

where Y1, Y2 are two Lagrangian multipliers and µ is a scalar to control the penalty for the violation of equality constraints E = X UVT and Z = U.

Update E To update E, we ﬁx the other variables except E and remove terms that are irrelevant to E. Then Eq.(15) becomes

min E 1 2||E (X UVT + 1

µY2)||2 F + 1

µ||E||2,1 (16)

The equation has a closed form solution by the following Lemma (Liu et al. 2009) Lemma 3. Let Q = [q1; q2; ...; qm] be a given matrix and λ a positive scalar. If the the optimal solution of

min W 1 2||W Q||2 F + λ||W||2,1 (17)

is W , then the i-th row of W is

w i = (1 λ ||qi||)qi, if ||qi|| > λ 0, otherwise (18)

Apparently, if we let Q = X UVT + 1

µY2, then using Lemma 3, E can be updated as

ei = (1 1 µ||qi||)qi, if ||qi|| > 1

µ 0, otherwise (19)

Update V To update V, we ﬁx the other variables except V and remove terms that are irrelevant to V, then Eq.(15) becomes

min V,UT U=I µ

2 ||X UVT E + 1

µY2||2 F + α||V||2,1 (20)

Using the fact that UT U = I, we can reformulate Eq.(20) as

min V 1 2||V (X E + 1

µY2)T U||2 F + α

µ||V||2,1 (21)

Again, the above equation has a closed form solution according to Lemma 3. Let K = (X E + 1

µY2)T U, then

vi = (1 α µ||ki||)ki, if ||ki|| > α

µ 0, otherwise (22)

Similarly, to update Z, we ﬁx U, V, E, Y1, Y2, µ and remove terms irrelevant to Z, then Eq.(15) becomes

2 ||Z U||2 F + βTr(ZT LU) + Y1, Z U (23)

We can rewrite Eq.(23) by putting the second and third terms to the quadratic term and get a compact form

min Z 0 ||Z T||2 F (24)

where T is deﬁned as

Eq.(24) can be further decomposed to element-wise optimization problems as

min Zij 0(Zij Tij)2 (26)

Clearly, the optimal solution of the above problem is

Zij = max(Tij, 0) (27)

Optimizing Eq.(15) with respect to U yields the equation

min UT U=I Y1, Z U + D Y2, X UVT E E

2 (||Z U||2 F + ||X UVT E||2 F ) + βTr(ZT LU) (28) By expanding Eq.(28) and dropping terms that are independent of U, we arrive at

min UT U=I µ

2 ||U||2 F µ N, U (29)

where N is deﬁned as

µY1 + Z βLZ + (X E + 1

We can further write the above equation into a more compact form as min UT U=I ||U N||2 F (31)

And now we have converted the objective function of updating U to the classical Orthogonal Procrutes problem (Sch onemann 1966) and can be solved using the following lemma (Huang et al. 2014)

Lemma 4. Given the objective in Eq.(31), the optimal U is deﬁned as

U = PQT (32)

where P and Q are the left and right singular vectors of the economic singular value decomposition (SVD) of N.

Update Y1, Y2 and µ After updating the variables, we now need to update the ADMM parameters. According to (Boyd et al. 2011), they are updated as follows

Y1 = Y1 + µ(Z U) (33)

Y2 = Y2 + µ(X UVT E) (34)

µ = max(ρµ, µmax) (35)

Here, ρ > 1 is a parameter to control the convergence speed and µmax is a larger number to prevent µ becomes too large. With these updating rules, EUFS algorithm is summarized in Algorithm 1.

Algorithm 1 Embedded Unsupervised Feature Selection

Input: X RN d, α, β, n, latent dimensional k Output: n features for the dataset 1: Initialize µ = 10 3, ρ = 1.1, µmax = 1010, U = 0, V = 0 (or initialized using K-means) 2: repeat 3: Calculate Q = X UVT + 1

µY2 4: Update E

ei = (1 1 µ||qi||)qi, if ||qi|| > 1

µ 0, otherwise (36)

5: Calculate K = (X E + 1

µY2)T U 6: Update V

vi = (1 α µ||ki||)ki, if ||ki|| > α

µ 0, otherwise (37)

7: Calculate T using Eq.(25) 8: Update Z using Eq.(27) 9: Calculate N according to Eq.(30) 10: Update U by Lemma 4 11: Update Y1, Y2, µ 12: until convergence 13: Sort each feature of X according to ||vi||2 in descending order and select the top-n ranked ones

Parameter Initialization One way to initialize U and V is to simply set them to be 0. As the algorithm runs, the objective function will gradually converge to the optimal value. To accelerate the convergence speed, following the common way of initializing NMF, we can use k-means to initialize U and V. To be speciﬁc, we apply k-means to cluster rows of X and get the soft cluster indicator U. V is simply set as XT U. µ is typically set in the range of 10 6 to 10 3 initially depending on the datasets and is updated in each iteration. µmax is set to be a large value such as 1010 to give µ freedom to increase but prevent it from being too large. ρ is empirically set to 1.1 in our algorithm. The larger ρ is , the faster µ becomes larger and the more we penalize the deviation of the equality constraint, which makes it converges faster. However, we may sacriﬁce some precision of the ﬁnal objective function with large ρ.

Convergence Analysis

The convergence of our algorithm depends on the convergence of the ADMM. The detailed convergence proof of ADMM can be found in (Goldstein et al. 2012; Boyd et al. 2011). The convergence criteria can be set as |Jt+1 Jt|

Jt < ϵ, where Jt is the objective function value in Eq.(14) and ϵ is some tolerance value. In practice, we can control the number of iterations by setting a maximum iteration value. Our experiments ﬁnd that our algorithm converges within 110 iterations for all the datasets we used.

Time Complexity Analysis

The computation cost for E depends on the computation of Q = X UVT + 1

µY2 and update of E. Since U is sparse, i.e., each row of U only has one nonzero element, then the computation cost is O(Nd) and O(Nd), respectively. Similarly, the computation cost for V involves the computation of K = (X E + 1

µY2)T U and update of V, which is O(Nd) again due to the sparsity of U. The main computation cost for Z is the computation of T = (U 1

µLU), which is O(k2) due to the sparsity of both U and L. The main computation cost of U involves the computation of N and its SVD decomposition, which is O(Ndk) and O(Nk2). The computational cost for Y1 and Y2 are both O(Nd). Therefore, the overall time complexity is O(Ndk + Nk2). Since d k, the ﬁnal computation cost if O(Ndk) for each iteration.

Experimental Analysis

In this section, we conduct experiments to evaluate the effectiveness of EUFS 1. After introducing datasets and experimental settings, we compare EUFS with the state-of-theart unsupervised feature selection methods. Further experiments are conducted to investigate the effects of important parameters on EUFS.

The experiments are conducted on 6 publicly available benchmark datasets, including one Mass Spectrometry (MS) dataset ALLAML (Fodor 1997), two microarray datasets, i.e., Prostate Cancer gene expression (Prostate-GE) 2 (Singh et al. 2002) and TOX-171, two face image datasets, i.e., PIX10P and PIE10P3 and one object image dataset COIL204 (Nene et al. 1996). The statistics of the datasets used in the experiments are summarized in Table 1.

1The implementation of EUFS can be found from http://www.public.asu.edu/ swang187/ 2ALLAML and Prostate-GE are publicly available from https://sites.google.com/site/feipingnie/ﬁle 3TOX-171, PIX10P and PIE10P are publicly available from http://featureselection.asu.edu/datasets.php 4COIL20 is publicly available from http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html

Table 1: Statistics of the Dataset Dataset # of Samples # of Features # of Classes ALLAML 72 7192 2 COIL20 1440 1024 20 PIE10P 210 1024 10 TOX-171 171 5748 4 PIX10P 100 10000 10 Prostate-GE 102 5996 2

Experimental Settings Following the common way to evaluate unsupervised feature selection algorithms, we assess EUFS in terms of clustering performance (Zhao and Liu 2007; Li et al. 2012). We compare EUFS with the following representative unsupervised feature selection algorithms:

All Features: All original features are adopted LS: Laplacian Score (He et al. 2005) which evaluates the importance of a feature through its power of locality preservation MCFS: Multi-Cluster Feature Selection (Cai et al. 2010) which selects features using spectral regression with l1norm regularization NDFS: Nonnegative Discriminative Feature Selection (Li et al. 2012) which selects features by a joint feamewrok of nonnegative spectral analysis and l2,1 regularized regression RUFS: Robust Unsupervised Feature Selection (Qian and Zhai 2013) which jointly performs robust label learning via local learning regularized robust orthogonal nonnegative matrix factorization and robust feature learning via joint l2,1-norms minimization.

Two widely used evaluation metrics, accuracy (ACC) and normalized mutual information (NMI), are employed to evaluate the quality of clusters. The larger ACC and NMI are, the better performance is. There are some parameters to be set. Following (Qian and Zhai 2013), for LS, MCFS, NDFS, RUFS and EUFS, we ﬁx the neighborhood size to be 5 for all the datasets. To fairly compare different unsupervised feature selection methods, we tune the parameters for all methods by a grid-search strategy from {10 6, 10 4, . . . , 104, 106}. For EUFS, we set the latent dimension as the number of clusters. How to determine the optimal number of selected features is still an open problem (Tang and Liu 2012a), we set the number of selected features as {50, 100, 150, ..., 300} for all datasets. Best clustering results from the optimal parameters are reported for all the algorithms. In the evaluation , we use K-means to cluster samples based on the selected features. Since K-means depends on initialization, following previous work, we repeat the experiments 20 times and the average results with standard deviation are reported.

Experimental Results The experimental results of different methods on the datasets are summarized in Table 2 and Table 3. We make the follow-

Table 2: Clustering results(ACC% std) of different feature selection algorithms on different datasets. The best results are highlighted in bold. The number in parentheses is the number of features when the performance is achieved

Dataset ALL Features Laplacian Score MCFS NDFS RUFS EUFS ALLAML 67.3 6.72 73.2 5.52(150) 68.4 10.4(100) 69.4 0.00(100) 72.2 0.00(150) 73.6 0.00(100) COIL20 53.6 3.83 55.2 2.84(250) 59.7 4.03(250) 60.1 4.26(300) 62.7 3.51(150) 63.4 5.47(100) PIE10P 30.8 2.29 36.0 2.95(100) 44.3 3.20(50) 40.5 4.51(100) 42.6 4.61(50) 46.4 2.69(50) TOX-171 41.5 3.88 47.5 3.33(200) 42.5 5.15(100) 46.1 2.55(100) 47.8 3.78(300) 49.5 2.57(100) PIX10P 74.3 12.1 76.6 8.10(150) 75.9 8.59(200) 76.7 8.52(200) 73.2 9.40(300) 76.8 5.88(150) Prostate-GE 58.1 0.44 57.5 0.49(300) 57.3 0.50(300) 58.3 0.50(100) 59.8 0.00(50) 60.4 0.80(100)

Table 3: Clustering results(NMI% std) of different feature selection algorithms on different datasets. The best results are highlighted in bold. The number in parentheses is the number of features when the performance is achieved

Dataset ALL Features Laplacian Score MCFS NDFS RUFS EUFS ALLAML 8.55 5.62 15.0 1.34(100) 11.7 12.2(50) 7.20 0.30(300) 12.0 0.00(150) 15.1 0.00(100) COIL20 70.6 1.95 70.3 1.73(300) 72.4 1.90(150) 72.1 1.75(300) 73.1 1.69(150) 77.2 2.75(100) PIE10P 32.2 3.47 38.5 1.44(50) 54.3 3.39(50) 46.0 3.14(100) 49.6 5.15(50) 49.8 3.10(150) TOX-171 17.8 5.20 30.5 2.70(150) 17.7 6.88(100) 22.3 2.41(300) 28.8 2.71(300) 26.0 2.41(100) PIX10P 82.8 6.48 84.3 4.63(150) 85.0 4.95(200) 84.8 4.76(200) 81.1 6.23(300) 85.1 4.30(50) Prostate-GE 1.95 0.27 1.59 0.21(300) 1.53 0.21(300) 2.02 0.25(100) 2.86 0.00(50) 3.36 0.48(100)

ing observations:

Feature selection is necessary and effective. The selected subset of the features can not only reduce the computation cost, but also improve the clustering performance;

Robust analysis is also important for unsupervised feature selection, which helps us select more relevant features and improve the performance;

EUFS tends to achieve better performance with usually fewer selected features such as 50 or 100; and

Most of the time, the proposed framework EUFS outperforms baseline methods, which demonstrates the effectiveness of the proposed algorithm. There are two major reasons. First, we directly embed feature selection in the process of clustering using sparse learning and the norm of the latent feature reﬂects the quality of the reconstruction and thus the importance of the original feature. Second, the graph regularize helps to learn better cluster indicators that ﬁts the existing manifold structure, which leads to a better latent feature matrix. Finally, we introduce robust analysis to ensure that these poorly reconstructed instances have less effect on feature selection.

We also perform parameter analysis for some important parameters of EUFS. Due to space limit, we only report the results on COIL20 in Figure2. The experimental results show that our method is not very sensitive to α and β. However, the performance is relatively sensitive to the number of selected features, which is a common problem for many unsupervised feature selection methods.

Conclusion We propose a new unsupervised feature selection approach, EUFS, which directly embeds feature selection into a clustering algorithm via sparse learning. It eliminates the need for transforming unsupervised feature selection into the sparse learning based supervised feature selection with

50 100 150 200 250 300

(a) ACC for COIL20 (β = 0.1)

50 100 150 200 250 300

(b) NMI for COIL20 (β = 0.1)

50 100 150 200 250 300

(c) ACC for COIL20 (α = 0.1)

50 100 150 200 250 300

(d) NMI for COIL20 (α = 0.1) Figure 2: ACC and NMI of EUFS with different α, β and feature numbers on datasets COIL20

pseudo labels. Nonnegative orthogonality is applied on the cluster indicator to make the problem tractable and ensure that feature selection on latent features has similar effects as on original features. l2,1-norm is applied on the cost function to reduce the effects of the noise introduced by the reconstruction of X and feature selection on V. Experimental results on 6 different real world datasets validate the unique contributions of EUFS. Future work is to investigate if EUFS can be extended to dimensionality reduction algorithms.

Acknowledgments

This material is based upon work supported by, or in part by, the National Science Foundation (NSF) under grant number

IIS-1217466.

Salem Alelyani, Jiliang Tang, and Huan Liu. Feature selection for clustering: A review. In Data Clustering: Algorithms and Applications, pages 29 60. CRC Press, 2013. Christos Boutsidis, Petros Drineas, and Michael W Mahoney. Unsupervised feature selection for the k-means clustering problem. In Advances in Neural Information Processing Systems, pages 153 161, 2009. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011. Deng Cai, Chiyuan Zhang, and Xiaofei He. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 333 342. ACM, 2010. Richard O Duda, Peter E Hart, and David G Stork. Pattern classiﬁcation. 2nd. Edition. New York, 2001. Stephen P. A. Fodor. Dna sequencing: Massively parallel genomics. 277(5324):393 395, 1997. Tom Goldstein, Brendan ODonoghue, and Simon Setzer. Fast alternating direction optimization methods. CAM report, pages 12 35, 2012. Quanquan Gu and Jiawei Han. Towards feature selection in network. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 1175 1184. ACM, 2011. Isabelle Guyon and Andr e Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157 1182, 2003. Xiaofei He, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. In Advances in neural information processing systems, pages 507 514, 2005. Jin Huang, Feiping Nie, Heng Huang, and Chris Ding. Robust manifold nonnegative matrix factorization. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):11, 2014. Anil Jain and Douglas Zongker. Feature selection: Evaluation, application, and small sample performance, 1997. George H John, Ron Kohavi, Karl Pﬂeger, et al. Irrelevant features and the subset selection problem. In ICML, volume 94, pages 121 129, 1994. Zechao Li, Yi Yang, Jing Liu, Xiaofang Zhou, and Hanqing Lu. Unsupervised feature selection using nonnegative spectral analysis. In Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada., 2012. Huan Liu and Hiroshi Motoda. Computational methods of feature selection. CRC Press, 2007.

Jun Liu, Shuiwang Ji, and Jieping Ye. Multi-task feature learning via efﬁcient l 2, 1-norm minimization. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 339 348. AUAI Press, 2009. Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil-20). Technical report, Technical Report CUCS-005-96, 1996. Feiping Nie, Shiming Xiang, Yangqing Jia, Changshui Zhang, and Shuicheng Yan. Trace ratio criterion for feature selection. In AAAI, volume 2, pages 671 676, 2008. Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. Efﬁcient and robust feature selection via joint ?2, 1-norms minimization. In Advances in Neural Information Processing Systems, pages 1813 1821, 2010. Mingjie Qian and Chengxiang Zhai. Robust unsupervised feature selection. In Proceedings of the Twenty-Third international joint conference on Artiﬁcial Intelligence, pages 1621 1627. AAAI Press, 2013. Peter H Sch onemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1 10, 1966. Dinesh Singh, Phillip G Febbo, Kenneth Ross, Donald G Jackson, Judith Manola, Christine Ladd, Pablo Tamayo, Andrew A Renshaw, Anthony V D Amico, Jerome P Richie, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2):203 209, 2002. Jiliang Tang and Huan Liu. Feature selection with linked data in social media. In Proceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim, California, USA, April 26-28, 2012., pages 118 128, 2012. Jiliang Tang and Huan Liu. Unsupervised feature selection for linked social media data. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 904 912. ACM, 2012. Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classiﬁcation: A review. In Data Classiﬁcation: Algorithms and Applications. 2014. Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007. Lior Wolf and Amnon Shashua. Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach. The Journal of Machine Learning Research, 6:1855 1887, 2005. Yi Yang, Heng Tao Shen, Zhigang Ma, Zi Huang, and Xiaofang Zhou. l2, 1-norm regularized discriminative feature selection for unsupervised learning. In IJCAI Proceedings International Joint Conference on Artiﬁcial Intelligence, volume 22, page 1589, 2011. Zheng Zhao and Huan Liu. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th international conference on Machine learning, pages 1151 1157. ACM, 2007. Zheng Zhao, Lei Wang, and Huan Liu. Efﬁcient spectral feature selection with minimum redundancy. In Proceedings of the Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 2010.