# unsupervised_feature_selection_by_pareto_optimization__9e09de01.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Unsupervised Feature Selection by Pareto Optimization Chao Feng,1 Chao Qian,1 Ke Tang2 1Anhui Province Key Lab of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China 2Shenzhen Key Lab of Computational Intelligence, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China chaofeng@mail.ustc.edu.cn, chaoqian@ustc.edu.cn, tangk3@sustc.edu.cn Dimensionality reduction is often employed to deal with the data with a huge number of features, which can be generally divided into two categories: feature transformation and feature selection. Due to the interpretability, the efficiency during inference and the abundance of unlabeled data, unsupervised feature selection has attracted much attention. In this paper, we consider its natural formulation, column subset selection (CSS), which is to minimize the reconstruction error of a data matrix by selecting a subset of features. We propose an anytime randomized iterative approach POCSS, which minimizes the reconstruction error and the number of selected features simultaneously. Its approximation guarantee is well bounded. Empirical results exhibit the superior performance of POCSS over the state-of-the-art algorithms. Introduction In machine learning and data mining applications, we often encounter the input data with very high dimensionality, bringing certain challenges. Many feature transformation techniques have been proposed for dimensionality reduction, e.g., principal component analysis (Jolliffe 2011), singular value decomposition (SVD) (Golub and Reinsch 1971) and autoencoders (Hinton and Salakhutdinov 2006), to name a few. By combining existing features, these methods transform the data into a low dimensional subspace, which can capture the cardinal information of the original data. However, the new feature representation is difficult to interpret, and projecting the input features into the reduced space often requires matrix multiplication, which reduces the inference efficiency. When keeping the semantic meaning of the features is important, feature selection, which selects a subset of features instead of transforming them, is more appealing. Since unlabeled data is often very abundant, much attention has been drawn to the unsupervised case. Unsupervised feature selection can be explored from different perspectives. In this paper, we focus on a natural formulation, column subset selection (CSS), which is to minimize the reconstruction error of a data matrix based on This work was supported by the National Key Research and Development Program of China (2017YFC0804002), the NSFC (61603367, 61672478) and the YESS (2016QNRC001). Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the subset of selected features. Formally, given a matrix A Rm n (where m is the number of instances and n is the number of features) and a positive integer k, the goal is to select at most k columns of A (forming a matrix S) minimizing f(S) = A SS+A 2 F , where S+ is the Moore-Penrose inverse matrix of S. That is, it is to find at most k columns of A that captures as much of A as possible w.r.t. the Frobenius norm. Note that SS+ denotes the projection matrix onto the span of the set S of columns. This problem is known to be UG-Hard (C ivril 2014), and its NP-completeness has been proved only recently (Shitov 2017). Many algorithms with bounded approximation guarantees have been proposed, as briefly reviewed in the following. Related Work The CSS problem was first studied within the numerical linear algebra community. Many algorithms based on rank-revealing QR (RRQR) factorization (Chan and Hansen 1992) were proposed, e.g., (Gu and Eisenstat 1996; Pan and Tang 1999; Hoog and Mattheij 2007). It has been proved that any algorithm for computing RRQR factorization of a matrix A can provide solutions of the CSS problem with approximation guarantees. Let Ak denote the best rank-k approximation to A obtained via SVD, and let Sopt denote an optimal solution of the CSS problem. Note that the approximation guarantee obtained by this kind of algorithms is w.r.t. A Ak F , which is obviously a lower bound on the truly optimal function value, i.e., A Sopt S+ opt A F . Several randomized sampling algorithms were also proposed for solving the CSS problem, including subspace sampling (Drineas, Mahoney, and Muthukrishnan 2008), volume sampling (Guruswami and Sinop 2012) and leverage sampling (Cohen et al. 2015). Their idea is to randomly select columns with probability proportional to some statistic. The theoretical results on these algorithms mainly showed a trade-off between the number of columns chosen, the approximation bound w.r.t. A Ak F and the success probability of the algorithm. In (Boutsidis, Mahoney, and Drineas 2009), the authors proposed a two-stage algorithm by combining the above two techniques. In the randomized stage, the algorithm selects O(k log k) columns according to some probability distribution that depends on the top k right singular vectors of A. In the deterministic stage, it returns exactly k columns from the selected O(k log k) columns by applying the deterministic column selection algorithm based on RRQR factorization. This two-stage algorithm can achieve an approximation ratio of O(k log k) w.r.t. A Ak F with a high probability. In (Civril and Magdon-Ismail 2012), a deterministic algorithm based on the sparse approximation of the SVD of A was proposed. It first computes the top k left singular vectors of A, and then selects columns from A to approximate the space spanned by these vectors, scaled by the singular values. It has been shown that this algorithm can achieve an approximation ratio of (1 + ϵ) w.r.t. A Ak F after selecting O(k log k/ϵ2) columns. The standard greedy algorithm was later used to solve the CSS problem (Farahat, Ghodsi, and Kamel 2011). It iteratively selects one column, whose inclusion can reduce the objective f(S) the most. To investigate its theoretical performance, Bhaskara et al. (2016) studied the equivalent problem that maximizes SS+A 2 F , and proved an approximation ratio of (1 ϵ) w.r.t. the optimal function value Sopt S+ opt A 2 F after selecting O(k/ϵ) columns. Recently, Ordozgoiti et al. (2016) proposed a simple local search algorithm, which starts from k randomly selected columns, and iteratively replaces one selected column with the best among n k unselected columns, until no improvement can be yielded. This algorithm has been shown empirically to outperform other state-of-the-art methods, and an approximation bound w.r.t. Sopt S+ opt A 2 F has also been provided (Ordozgoiti, Canaval, and Mozo 2018). All the above algorithms are approximation algorithms. In (Arai, Maung, and Schweitzer 2015), an approach using the A heuristic search algorithm is guaranteed to find the optimum. However, due to the NP-hardness of the problem, it can effectively select only a small number of columns from small data sets. A similar approach using the weighted A algorithm thus has been proposed (Arai et al. 2016), which can run faster, but without the optimal guarantee. Our Contribution In this paper, we propose a new method based on Pareto optimization (Qian, Yu, and Zhou 2015) for the CSS problem, briefly called POCSS. The idea of POCSS is to first reformulate the original CSS problem as a bi-objective minimization problem that minimizes the given objective f(S) and the number of columns of S simultaneously, then employ a randomized iterative procedure to solve it, and finally select the best solution with at most k columns from the produced set of solutions. In each iteration of POCSS, it needs to evaluate the objective value of a newly generated solution, which is time-consuming. We derive the recursive relation between the objective values by adding one column into a matrix or deleting one column, based on which the algorithm is computationally feasible. Theoretically, we prove that POCSS using polynomial time can find a solution b S with at most k columns such that b Sb S+A 2 F (1 e γ) Sopt S+ opt A 2 F , where γ is the submodularity ratio (Das and Kempe 2011) characterizing how close the function SS+A 2 F is to submodular. Furthermore, we prove that using slightly more than k columns, i.e., 16k/(ϵσ), POCSS can achieve an approximation ratio of (1 ϵ) w.r.t. Sopt S+ opt A 2 F , where ϵ > 0 is a constant and σ denotes the smallest squared singular value of the normalized Sopt, i.e., each column of Sopt is scaled to a unit vector. The experimental results on 10 real-world data sets clearly show the superiority of POCSS over the state-of-the-art algorithms, including Two-stage (Boutsidis, Mahoney, and Drineas 2009), Aprx SVD (Civril and Magdon-Ismail 2012), Greedy (Farahat, Ghodsi, and Kamel 2011), Iter FS (Ordozgoiti, Canaval, and Mozo 2016) and WA (Arai et al. 2016). The rest of the paper first introduces the studied problem, and then presents the proposed method, its theoretical analysis and empirical study. Finally we conclude this paper. Unsupervised Feature Selection We first give some notations that will be used in the paper. [n]: set {1, 2, , n}. 0: all-zeros vector. 2: ℓ2-norm of a vector. 0m,n: zero matrix of size m n. In: identity matrix of size n. Cij: entry of the i-th row and j-th column of matrix C. Ci:: i-th row of matrix C. C:i: i-th column of matrix C. CT : transpose of matrix C. C+: Moore-Penrose inverse of matrix C. tr( ): trace of a square matrix. F : Frobenius norm of a matrix. | |: number of columns of a matrix. Unsupervised feature selection can be naturally characterized by the CSS problem in Definition 1. It is to select at most k columns from all the n columns of a matrix A to best approximate A. The goodness of approximation is measured by the sum of squared errors between the original matrix A and the approximation SS+A based on the selected columns of S. Note that SS+ denotes the projection matrix onto the space spanned by the columns of S. Definition 1 (Column Subset Selection (CSS)). Given a matrix A Rm n and a positive integer k < n, it is to find a submatrix S of A with at most k columns that minimizes A SS+A 2 F , that is, arg min S: a submatrix of A A SS+A 2 F s.t. |S| k. For the ease of theoretical treatment, this minimization problem is often equivalently transformed into a maximization problem: arg max S: a submatrix of A SS+A 2 F s.t. |S| k. The equivalency can be verified by A SS+A 2 F = tr((A SS+A)T (A SS+A)) = tr(AT A AT SS+A) = tr(AT A) tr(AT SS+A) = tr(AT A) tr(AT (SS+)T SS+A) = tr(AT A) SS+A 2 F , where the second and the fourth equalities are derived by the properties of S+, i.e., (SS+)T = SS+ and S+SS+ = S+. Note that tr(AT A) is a constant. In our theoretical analysis, we will consider maximization. The Proposed Approach In this section, we propose a Pareto optimization based approach for the CSS problem, POCSS. Note that Pareto optimization is a general framework that uses bi-objective optimization as an intermediate step to solve single-objective optimization problems. It has been successfully applied to solve the general subset selection problem (Friedrich and Neumann 2015; Qian, Yu, and Zhou 2015; Qian et al. 2017a; 2017b) as well as the problems of multiset selection (Qian et al. 2018b), k-subsets selection (Qian et al. 2018a) and sequence selection (Qian, Feng, and Tang 2018). We use a binary vector s {0, 1}n to denote a submatrix S of A. Each element of s corresponds to one column of A. For 1 i n, the i-th bit si = 1 means that the i-th column of A is included into S; otherwise, the i-th column of A is not selected. Note that the columns of S are in the order of their appearance in A. We will not distinguish s {0, 1}n and its corresponding submatrix S for notational simplicity. POCSS reformulates the original problem, i.e., Definition 1, as a bi-objective minimization problem arg mins {0,1}n f(s), |s| where f(s) = A ss+A 2 F . That is, POCSS minimizes the original objective function and the number of selected columns simultaneously. In the bi-objective setting, both the two objective values have to be considered for comparing two solutions s and s . s weakly dominates s (i.e., s is better than s , denoted as s s ) if f(s) f(s ) |s| |s |; s dominates s (i.e., s is strictly better, denoted as s s ) if s s and either f(s) < f(s ) or |s| < |s |. But if neither s is better than s nor s is better than s, they are incomparable. The procedure of POCSS is described in Algorithm 1. In the optimization process, we use the archive P to maintain the non-dominated solutions produced so-far, and f(P) = {f(s) | s P} is the set of objective values of the solutions in P. The algorithm starts from the all-zeros solution 0 representing the empty matrix (line 1), and then iteratively tries to improve the quality of the solutions in P (lines 4-15). In each iteration, a solution s randomly selected from the current P is used to generate a new solution y by flipping each bit of s with probability 1/n (lines 5-6); y is then evaluated (line 7); if y is not dominated by any previously archived solution (line 8), it will be added into P, and meanwhile those solutions weakly dominated by y will be removed (lines 911). Note that the domination-based comparison makes P always contain incomparable solutions. POCSS repeats for T iterations. The value of T could influence the quality of the produced solution. Their relationship will be made clear in the theoretical analysis, and we will use the theoretically derived T value in the experiments. After running T iterations, the best solution (i.e., having the smallest f value) satisfying the size constraint in P is selected as the final solution (line 16). Algorithm 1 POCSS Algorithm Input: matrix A Rm n, cardinality constraint k [n] Parameter: the number T of iterations Output: a matrix s comprised by at most k columns of A Process: 1: Let s = 0. 2: Let P = {s}, f(P) = {f(s)}, P + = {s+}. 3: Let t = 0. 4: while t < T do 5: Select s from P uniformly at random. 6: y := flip each bit of s with probability 1/n. 7: (f(y), y+) = Evaluate(s, f(s), s+, y). 8: if z P such that z y then 9: Q = {z | z P, y z}. 10: P = (P \ Q) {y}. 11: f(P) = (f(P) \ {f(z) | z Q}) {f(y)}. 12: P + = (P + \ {z+ | z Q}) {y+}. 13: end if 14: t = t + 1. 15: end while 16: return arg mins P,|s| k f(s) Acceleration Note that in each iteration of POCSS, we need to evaluate the objective value of a newly generated solution y, i.e., computing f(y) = A yy+A 2 F , which is very timeconsuming. For a matrix S Rm k, the time of computing S+ is O(k2m); then computing SS+A requires 2kmn time; finally we need to compute 2 F , the time of which is mn. Thus, computing f(S) directly costs time O(kmn), which is expensive. To accelerate this evaluation process, we build the recursive relation between the objective values by deleting one column from a matrix or inserting one column into a matrix, as shown in Lemmas 1-2. Thus, f(y) can be computed from f(s) by recursively deleting columns {A:j | sj = 1 yj = 0} and then inserting columns {A:j | sj = 0 yj = 1}. From Lemmas 1-2, we can see that the recursive relation on f relies on the Moore-Penrose inverse of the current matrix. Thus, we also need to update the Moore-Penrose inverse matrix accordingly, as shown in Lemmas 3-4. During the running process of POCSS, we use a set P + to maintain the Moore-Penrose inverse matrix for each solution in P, i.e., P + = {s+ | s P}. P + will be updated accordingly in line 12 of Algorithm 1. Note that for the empty matrix s = 0, s+ is also an empty matrix and f(s) = A 2 F . In the following lemmas, we assume that S is a full column rank matrix, i.e., rank(S) = |S|. Their proofs are provided in the Appendix. Lemma 1. b S is the matrix generated by deleting the i-th column of S. Let ρ = ((S+)i:)T and β = AT ρ. Then, f(b S) = f(S) + ρ 2 2 βT β. Lemma 2. b S is the matrix generated by inserting the j-th column of A into S as the i-th column. Let E:j = A:j Algorithm 2 Evaluation Subprocedure Input: S, f(S), S+ and Y Output: f(Y) and Y+ Process: 1: fcur = f(S), S+ cur = S+. 2: Xdelete = {A:j | A:j S, A:j / Y}. 3: Xinsert = {A:j | A:j Y, A:j / S}. 4: for each column A:j in Xdelete do 5: ρ = ((S+ cur)i:)T (where A:j is the i-th column of S). 6: β = AT ρ. 7: S delete the i-th column A:j from S. 8: fcur fcur + ρ 2 2 βT β. 9: S+ cur delete the i-th row of S+ cur ρ 2 2 S+ curρρT . 10: end for 11: for each column A:j in Xinsert do 12: E:j = A:j SS+ cur A:j and δ = AT E:j. 13: S insert A:j into S as the i-th column. 14: fcur fcur δT δ/δj. 15: S+ cur insert ((E:j)T E:j) 1(E:j)T into (S+ cur ((E:j)T E:j) 1S+ cur A:j(E:j)T ) as the i-th row. 16: end for 17: f(Y) = fcur, Y+ = S+ cur. 18: return f(Y) and Y+ SS+A:j (i.e., the residual column vector of A:j w.r.t. S), E = A SS+A and δ = AT E:j. Then, f(b S) = f(S) δT δ/δj, where δj denotes the j-th entry of the vector δ. Lemma 3. b S is the matrix generated by deleting the i-th column of S. Let ρ = ((S+)i:)T . Then, b S+ is the matrix by deleting the i-th row of (S+ ρ 2 2 S+ρρT ). Lemma 4. b S is the matrix generated by inserting the j-th column of A into S as the i-th column. Let E:j = A:j SS+A:j. Then, b S+ is the matrix by inserting ((E:j)T E:j) 1(E:j)T into (S+ ((E:j)T E:j) 1S+A:j(E:j)T ) as the i-th row. Algorithm 2 describes the procedure of computing f(Y) and Y+ of a new matrix Y based on the known f(S) and S+ of an old matrix S. To generate Y from S, Xdelete and Xinsert record the columns that need to be deleted and inserted, respectively. According to Lemmas 1 and 3, lines 410 of Algorithm 2 updates the current objective value fcur and the current Moore-Penrose inverse matrix S+ cur by deleting columns in Xdelete. According to Lemmas 2 and 4, lines 11-16 updates fcur and S+ cur by inserting columns in Xinsert. After line 16, the current matrix S is just the matrix Y. The algorithm returns fcur and S+ cur as f(Y) and Y+. We then show how the evaluation is accelerated. For deleting one column (lines 5-9), the time is determined by computing β and S+ curρρT , whose time is mn and 2km, respectively. For inserting one column (lines 12-15), the time is determined by computing SS+ cur A:j, δ and S+ cur A:j(E:j)T , whose time is 2km, mn and 2km, respectively. Thus, both their time is in the order of O(mn). Note that in line 6 of Algorithm 1, it will flip one bit in expectation, that is, it will delete or insert one column in expectation. This implies that the evaluation by Algorithm 2 costs time O(mn) in expectation, which is much smaller than the time O(kmn) of directly computing f(Y). Theoretical Analysis In this section, we investigate the theoretical performance of POCSS. For the CSS problem, we consider the equivalent maximization formulation: arg max S: a submatrix of A g(S) = SS+A 2 F s.t. |S| k. Let Sopt denote an optimal submatrix. The objective function g(S) can actually be seen as a set function by treating S as a set of columns. Then, it can be verified that g(S) is monotone and non-submodular. The submodularity ratio γU,l(g) = min L U,S:|S| l,S L= v S(g(L {v}) g(L)) g(L S) g(L) (Das and Kempe 2011) can be used to measure the closeness of the function g to submodular. Note that U A, l 1 : 0 γU,l(g) 1. In (Qian et al. 2016), it has been proved that the Pareto optimization method using at most 2ek2n expected number of iterations can achieve an approximation ratio of (1 e γ) for maximizing a general monotone set function. By using the similar proof, we can prove the approximation bound of POCSS for maximizing the function g as in Theorem 1, where E[T] denotes the expected number of iterations. The only difference is the size of the archive P during optimization. By the procedure of POCSS, we know that the solutions maintained in P must be incomparable. Thus, each value of one objective can correspond to at most one solution in P. Since |S| {0, 1, . . . , n}, |P| n + 1. Note that for the proof in (Qian et al. 2016), |P| 2k. Thus, the upper bound on E[T] changes from 2ek2n to ekn(n + 1) accordingly. Theorem 1. For the CSS problem, POCSS with E[T] ekn(n + 1) finds a submatrix S of A with |S| k and g(S) (1 e γmin) g(Sopt), where γmin = min U:|U|=k 1 γU,k(g). Then, we further analyze the approximation guarantee of POCSS by allowing more than k columns. Our analysis needs Lemma 5, i.e., Lemma 1 of (Bhaskara et al. 2016), which shows that for any submatrix T, there always exists one column from a better submatrix, whose insertion into T can bring an improvement proportional to their current distance. Let σmin(S) = inf α 2=1 Snormα 2 2 α 2 2 , where Snorm denotes the matrix by normalizing each column of S to a unit vector. That is, σmin(S) is the smallest squared singular value of Snorm. It can be verified that σmin(S) 1, because σmin(S) = inf α 2=1 X|S| i=1 αi(Snorm):i 2 2 i=1 αi(Snorm):i 2 2 = inf α 2=1 i=1 α2 i = 1. Lemma 5. (Bhaskara et al. 2016) Let S, T be two submatrices of A with g(S) g(T). There exists one column of S, inserting which into T can produce a matrix T such that g(T ) g(T) σmin(S)(g(S) g(T))2 By Lemma 5, we prove that taking slightly more than k columns, POCSS can obtain a constant approximation ratio of (1 ϵ). The proof is inspired from the analysis of the greedy algorithm, i.e., Theorem 1 in (Bhaskara et al. 2016). Theorem 2. Let ϵ be any positive constant. For the CSS problem, POCSS with E[T] 16e ϵσmin(Sopt)kn(n + 1) finds a submatrix S of A with |S| 16k ϵσmin(Sopt) and g(S) (1 ϵ) g(Sopt). Proof. We use σ to denote σmin(Sopt) for convenience. Let Jmax denote the maximum integer value of j such that in the archive P, there exists a solution s with |s| 8k σ Pj 1 i=0 2i and g(s) g(sopt) g(sopt)/2j. Note that the vector sopt corresponds to the optimal submatrix Sopt. That is, Jmax = max{j N 0 | s P, i=0 2i g(s) g(sopt) g(sopt)/2j}. Let N be the minimum integer value of i such that 1/2i ϵ, i.e., 1/2N 1 > ϵ and 1/2N ϵ. We then only need to analyze the expected number of iterations until Jmax = N, since Jmax = N implies that there exists one solution s in P satisfying that i=0 2i = 8k σ (2N 1) 16k ϵσ and g(s) g(sopt) g(sopt)/2N (1 ϵ) g(sopt). The initial value of Jmax is 0, since POCSS starts from the empty matrix 0, which has |0| = 0 and g(0) = 0. Assume that currently Jmax = j < N. Let s be a corresponding solution with the value j, i.e., |s| 8k σ Pj 1 i=0 2i and g(s) g(sopt) g(sopt)/2j. It is easy to see that Jmax cannot decrease because deleting s from P (lines 9 and 10 of Algorithm 1) implies that the newly generated solution y s, which must satisfy that |y| |s| and g(y) g(s). Then, we show that Jmax can increase by at least 1 after at most 8ekn(n+1) σ 2j iterations in expectation. By Lemma 5, g(s) < g(sopt) and |sopt| k, we know that flipping one specific 0 bit of s (i.e., inserting one specific column) can generate a new solution y, which satisfies that g(y) g(s) σ (g(sopt) g(s))2 4kg(sopt) >σ (g(sopt)/2j+1)2 4kg(sopt) . The last inequality is derived by g(s) < g(sopt) g(sopt)/2j+1, i.e., g(sopt) g(s) > g(sopt)/2j+1, because otherwise, it implies that Jmax j + 1, contradicting with our assumption Jmax = j. This process happens in one iteration with probability at least 1 |P | 1 n)n 1 1 en(n+1), where 1/|P| is a lower bound on the probability of selecting s in line 5 of Algorithm 1 and 1 n)n 1 is the probability of flipping a specific bit of s while keeping other bits unchanged in line 6. Note that |P| n + 1. Thus, after at most en(n + 1) iterations in expectation, there must exist one solution s1 in P satisfying that |s1| |s| + 1 σ Pj 1 i=0 2i + 1 and g(s1) g(s) + σ (g(sopt)/2j+1)2 g(sopt) g(sopt)/2j + σ (g(sopt)/2j+1)2 4kg(sopt) . We then consider such a process on s1, which can make the archive P contain a solution s2 with |s2| |s1| + 1 8k σ Pj 1 i=0 2i + 2 and g(s2) g(s1)+σ (g(sopt)/2j+1)2 4kg(sopt) g(sopt) g(sopt)/2j + 2σ (g(sopt)/2j+1)2 4kg(sopt) . By repeating this process l = 8k σ 2j times (we pessimistically assume that Jmax does not increase, thus it holds that g(si) < g(sopt) g(sopt)/2j+1 for any i < l), P will contain a solution sl satisfying that i=0 2i + l = 8k g(sl) g(sopt) g(sopt) 2j + lσ (g(sopt)/2j+1)2 = g(sopt) g(sopt)/2j+1. This implies that Jmax j+1. Thus, after at most l en(n+ 1) = 8ekn(n+1) σ 2j expected number of iterations, Jmax can increase by at least 1. Thus, the expected number of iterations for increasing Jmax from 0 to N is at most σ 2j 16ekn(n+1) ϵσ kn(n+1), which is just the expected number of iterations E[T] for achieving an approximation ratio of (1 ϵ). Experiments In this section, we empirically evaluate the effectiveness of POCSS on 10 real-world data sets1. Five stateof-the-art algorithms for the CSS problem are compared, including Two-stage (Boutsidis, Mahoney, and Drineas 2009), Aprx SVD (Civril and Magdon-Ismail 2012), Greedy (Farahat, Ghodsi, and Kamel 2011; Bhaskara et al. 2016), Iter FS (Ordozgoiti, Canaval, and Mozo 2016) and WA (Arai et al. 2016). The detailed introduction of these algorithms can be seen in the related work. For Two-stage, 2k columns are selected in the randomized stage. For WA , we implement its best version WA -b and set the parameter ϵ = 0.5. For POCSS, the number of iterations is ekn(n + 1) as suggested by Theorem 1. To improve its efficiency, submatrices with at least 2k columns are excluded in the running process, thus the number of iterations is set to 2ek2n. To evaluate an output submatrix S, we measure the ratio of its reconstruction error f(S) w.r.t. the smallest rank-k approximation error obtained by SVD, i.e., error ratio = A SS+A 2 F / A Ak 2 F . 1The data sets are downloaded from http://archive.ics.uci.edu/ ml/ and http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. All feature vectors are normalized to unit vectors. Table 1: The error ratio (the smaller the better) of the compared methods on 10 data sets for k = 50. The mean std is reported for randomized algorithms. In each data set, the smallest values are bolded. data set (#inst, #feat) Two-stage Aprx SVD Greedy WA Iter FS POCSS sonar (208, 60) 2.656 2.860 2.852 2.785 2.524 0.000 2.524 0.000 mediamill (30993, 120) 2.179 2.075 2.045 2.105 1.952 0.027 1.945 0.022 Com Cri (2215, 147) 1.265 1.197 1.195 1.203 1.197 0.004 1.194 0.027 musk (7074, 168) 1.868 1.775 1.774 1.740 1.669 0.013 1.656 0.012 dna (2000, 180) 1.334 1.325 1.320 1.323 1.313 0.002 1.311 0.000 Arrhythmia (452, 279) 1.734 1.560 1.551 1.555 1.535 0.006 1.520 0.018 scene (1211, 294) 1.585 1.553 1.548 1.544 1.533 0.002 1.530 0.003 Rl CTs (53500, 386) 1.535 1.432 1.433 1.446 1.452 0.051 1.420 0.003 madelon (2000, 500) 1.123 1.057 1.056 1.056 1.056 0.000 1.056 0.000 s EMG (1800, 2500) 1.259 1.226 1.224 1.224 1.218 0.001 1.216 0.001 average rank 5.7 4.55 3.3 3.8 2.45 1.2 Two-stage Aprx SVD Greedy WA* Iter FS POCSS 50 60 70 80 90 100 k Error ratio 50 60 70 80 90 100 k Error ratio 50 60 70 80 90 100 k Error ratio 50 60 70 80 90 100 k Error ratio (a) mediamill (b) musk (c) dna (d) scene Figure 1: The comparison for cardinality constraints k {50, 60, . . . , 100} (error ratio: the smaller the better). The error ratio is obviously always larger than 1. The smaller it is, the better. For POCSS, Two-stage and Iter FS, which are randomized algorithms, we repeat the run 10 times independently and report the average results. The results for k = 50 are shown in Table 1. POCSS achieves the smallest error ratio on all the data sets. Note that we do not report the standard deviations of Two-stage, because they are always 0, as observed in (Ordozgoiti, Canaval, and Mozo 2016). We compute the rank of each method on each data set as in (Demˇsar 2006), which are averaged in the last row of Table 1. We can observe the order POCSS < Iter FS < Greedy < WA < Aprx SVD < Twostage , which are consistent with Iter FS < Greedy < Twostage in (Ordozgoiti, Canaval, and Mozo 2016) as well as WA < Two-stage in (Arai et al. 2016). Note that although Iter FS is overall the second best, its performance is not very stable. For example, on the Rl CTs data set, Iter FS is worse than Aprx SVD, Greedy and WA . Figure 1 examines the influence of the cardinality constraint k. We can observe that POCSS is consistently better than other methods. Considering the running time, measured by the number of objective function evaluations, POCSS is set to use the theoretical upper bound 2ek2n, i.e., the worst-case time. We also empirically examine how effective POCSS is in practice. For the two data sets Rl CTs and s EMG with k = 50, we plot the curve of the error ratio over the running time for POCSS and select Greedy and Iter FS as the baselines. We 50 100 150 200 250 Running time in kn Error ratio Greedy Iter FS POCSS 50 100 150 200 250 Running time in kn Error ratio Greedy Iter FS POCSS (a) Rl CTs (b) s EMG Figure 2: Performance v.s. running time of POCSS. can observe from Figure 2 that the time of POCSS to obtain a better performance is much less than the worst-case time 2ek2n 271kn, implying that POCSS can be efficient in practice. In this paper, we study the CSS problem for unsupervised feature selection, and propose the new algorithm POCSS by Pareto optimization. POCSS employs a randomized iterative procedure to minimize the reconstruction error and the number of selected features simultaneously. We prove that POCSS can achieve a good approximation guarantee. Empirical results also show its excellent performance. Appendix To prove Lemma 1, which shows the recursive relation on f(S) by deleting one column from S, we first analyze the recursive relation on E = A SS+A. Lemma 6. b S is the matrix generated by deleting the i-th column of S. Let ρ = ((S+)i:)T . Then, b E = A b Sb S+A = E + ρ 2 2 SS+ρρT A. Proof. According to Lemma 3, b S+ is the matrix generated by deleting the i-th row of (S+ ρ 2 2 S+ρρT ). Thus, b Sb S+= S(S+ ρ 2 2 S+ρρT ) S:i(S+ ρ 2 2 S+ρρT )i: = S(S+ ρ 2 2 S+ρρT ), where the last equality is by (S+ ρ 2 2 S+ρρT )i: = ρT ρ 2 2 ρT ρρT = ρT ρT = 0. Then, we have b E = A b Sb S+A = A (SS+ ρ 2 2 SS+ρρT )A = E + ρ 2 2 SS+ρρT A. Based on Lemma 6 and f(S) = E 2 F , we can update the objective f when deleting one column from a matrix. Proof of Lemma 1. f(b S) = b E 2 F = tr(b ET b E) = tr((ET+ ρ 2 2 ATρρT SS+)(E+ ρ 2 2 SS+ρρT A)) = tr(ET E) + 2 ρ 2 2 tr(AT ρρT SS+E) + ρ 4 2 tr(AT ρρT SS+SS+ρρT A), where the third equality is by Lemma 6 and (SS+)T = SS+. Since S is assumed to have full column rank, S+S = I|S|. Then, we have AT ρρT SS+E = AT ρρT SS+(A SS+A) = AT ρρT SS+A AT ρρT SS+A = 0n,n. Since ρT SS+ = (S+)i:SS+ = (S+S)i:S+ = (I|S|)i:S+ = (S+)i: = ρT , we have AT ρρT SS+SS+ρρT A = AT ρρT SS+ρρT A = AT ρρT ρρT A = ρ 2 2AT ρρT A = ρ 2 2ββT . Applying the above two equations to f(b S), we get f(b S) = tr(ET E)+ ρ 2 2 tr(ββT ) = f(S)+ ρ 2 2 βT β. To prove Lemma 2, which shows the recursive relation on f(S) by inserting one column into S, we use the recursive relation on ET E, which has already been derived in Corollary 3 of (Farahat, Ghodsi, and Kamel 2011). Proof of Lemma 2. Let b E = A b Sb S+A. According to Corollary 3 in (Farahat, Ghodsi, and Kamel 2011), we get b ET b E = ET E ET E:j(ET E:j)T / (ET )j:E:j . We then show that ET E:j is just δ. By (SS+)T = SS+ and S+S = I|S|, we get δ = AT E:j = (ET+AT SS+)E:j = ET E:j+AT SS+E:j = ET E:j + AT SS+(A:j SS+A:j) = ET E:j + AT SS+A:j AT SS+SS+A:j = ET E:j. It is also clear that (ET )j:E:j = δj. Thus, we have f(b S) = b E 2 F = tr(b ET b E) = tr(ET E) tr(δδT )/δj = f(S) δT δ/δj. Let e S and b S denote the matrix by zeroing-out and deleting the i-th column of S, respectively. Lemma 7, i.e., Proposition 1 in (Ordozgoiti, Canaval, and Mozo 2016), gives the Moore-Penrose inverse of e S. By further investigating the relation between the Moore-Penrose inverse of e S and b S, we can prove Lemma 3, which shows the recursive relation on S+ by deleting one column from a matrix. Lemma 7. (Ordozgoiti, Canaval, and Mozo 2016) e S is the matrix resulting from zeroing-out the i-th column of S, i.e., the i-th column of e S is comprised by all zeros. Let ρ = ((S+)i:)T . Then, e S+ = S+ ρ 2 2 S+ρρT . Proof of Lemma 3. Since the columns of e S and b S span the same space, their projection matrices are the same. That is, e Se S+ = b Sb S+. Let B denote the matrix by deleting the i-th row of e S+. Then, e Se S+ = b SB, since the i-th column of e S contains all zeros. Thus, b SB = b Sb S+, implying that B = b S+b Sb S+ = b S+. According to Lemma 7, b S+ is the matrix by deleting the i-th row of (S+ ρ 2 2 S+ρρT ). Thus, the lemma holds. Lemma 4 shows the recursive relation on S+ by inserting one column into a matrix. Our proof idea is to first analyze the Moore-Penrose inverse of the resulting matrix when the inserting position is the last column, and then analyze the influence on the Moore-Penrose inverse by switching two columns of a matrix. Proof of Lemma 4. We first insert the j-th column of A into S as its last column, and compute the Moore-Penrose inverse of the resulting matrix [S A:j]. For notational simplicity, let v = A:j and B = ST S. = B ST v v T S v T v = B 1 + τB 1ST vv T SB 1 τB 1ST v τv T SB 1 τ = B 1ST +τB 1ST vv T SB 1ST τB 1ST vv T τv T SB 1ST + τv T = S+ + τS+vv T SS+ τS+vv T τv T SS+ + τv T where the first equality is by the definition of the Moore Penrose inverse of a full column rank matrix, the third equality is by block matrix inversion and τ = (v T v v T SB 1ST v) 1 = ((v SS+v)T (v SS+v)) 1 = ((E:j)T E:j) 1, and the fourth equality is by block matrix multiplication. Since v T SS+ = (SS+v)T = v T (E:j)T , we have [S v]+ = S+ ((E:j)T E:j) 1S+v(E:j)T ((E:j)T E:j) 1(E:j)T Then, we show that if e C is the matrix generated by switching the i-th and j-th columns of a matrix C, e C+ is the matrix that swaps the i-th row and j-th row in C+. Let T denote the elementary matrix obtained by swapping the i-th row and jth row of the identity matrix I|C|. Then, e C = CT. According to Corollary 2.3 in (Cline and Greville 1970), we get e C+ = (CT)+ = (C+CT)+(CTT+)+ = (I|C|T)+(CI|C|)+ = T+C+ = TC+. Since TC+ is just swapping the i-th row and j-th row in C+, our claim holds. Note that the matrix b S we are to analyze is generated by inserting the j-th column of A into S as the i-th column. Thus, by combining Eq. (1) (where v = A:j) with the above claim, the lemma holds. Arai, H.; Maung, C.; Xu, K.; and Schweitzer, H. 2016. Unsupervised feature selection by heuristic search with provable bounds on suboptimality. In AAAI, 666 672. Arai, H.; Maung, C.; and Schweitzer, H. 2015. Optimal column subset selection by A-star search. In AAAI, 1079 1085. Bhaskara, A.; Rostamizadeh, A.; Altschuler, J.; Zadimoghaddam, M.; Fu, T.; and Mirrokni, V. 2016. Greedy column subset selection: New bounds and distributed algorithms. In ICML, 2539 2548. Boutsidis, C.; Mahoney, M. W.; and Drineas, P. 2009. An improved approximation algorithm for the column subset selection problem. In SODA, 968 977. C ivril, A. 2014. Column subset selection problem is UGhard. Journal of Computer and System Sciences 80(4):849 859. Chan, T. F., and Hansen, P. C. 1992. Some applications of the rank revealing QR factorization. SIAM Journal on Scientific and Statistical Computing 13(3):727 741. Civril, A., and Magdon-Ismail, M. 2012. Column subset selection via sparse approximation of SVD. Theoretical Computer Science 421:1 14. Cline, R. E., and Greville, T. N. E. 1970. An extension of the generalized inverse of a matrix. SIAM Journal on Applied Mathematics 19(4):682 688. Cohen, M. B.; Elder, S.; Musco, C.; Musco, C.; and Persu, M. 2015. Dimensionality reduction for k-means clustering and low rank approximation. In STOC, 163 172. Das, A., and Kempe, D. 2011. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In ICML, 1057 1064. Demˇsar, J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1 30. Drineas, P.; Mahoney, M.; and Muthukrishnan, S. 2008. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications 30(2):844 881. Farahat, A. K.; Ghodsi, A.; and Kamel, M. S. 2011. An efficient greedy method for unsupervised feature selection. In ICDM, 161 170. Friedrich, T., and Neumann, F. 2015. Maximizing submodular functions under matroid constraints by evolutionary algorithms. Evolutionary Computation 23(4):543 558. Golub, G. H., and Reinsch, C. 1971. Singular value decomposition and least squares solutions. In Linear Algebra. 134 151. Gu, M., and Eisenstat, S. C. 1996. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing 17(4):848 869. Guruswami, V., and Sinop, A. K. 2012. Optimal columnbased low-rank matrix reconstruction. In SODA, 1207 1214. Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786):504 507. Hoog, F. D., and Mattheij, R. 2007. Subset selection for matrices. Linear Algebra and its Applications 422(2):349 359. Jolliffe, I. 2011. Principal component analysis. In International Encyclopedia of Statistical Science. 1094 1096. Ordozgoiti, B.; Canaval, S. G.; and Mozo, A. 2016. A fast iterative algorithm for improved unsupervised feature selection. In ICDM, 390 399. Ordozgoiti, B.; Canaval, S. G.; and Mozo, A. 2018. Iterative column subset selection. Knowledge and Information Systems 54(1):65 94. Pan, C.-T., and Tang, P. T. P. 1999. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics 39(4):740 756. Qian, C.; Shi, J.-C.; Yu, Y.; Tang, K.; and Zhou, Z.-H. 2016. Parallel Pareto optimization for subset selection. In IJCAI, 1939 1945. Qian, C.; Shi, J.-C.; Yu, Y.; Tang, K.; and Zhou, Z.-H. 2017a. Subset selection under noise. In NIPS, 3562 3572. Qian, C.; Shi, J.-C.; Yu, Y.; and Tang, K. 2017b. On subset selection with general cost constraints. In IJCAI, 2613 2619. Qian, C.; Shi, J.-C.; Tang, K.; and Zhou, Z.-H. 2018a. Constrained monotone k-submodular function maximization using multi-objective evolutionary algorithms with theoretical guarantee. IEEE Transactions on Evolutionary Computation 22(4):595 608. Qian, C.; Zhang, Y.; Tang, K.; and Yao, X. 2018b. On multiset selection with size constraints. In AAAI, 1395 1402. Qian, C.; Feng, C.; and Tang, K. 2018. Sequence selection by Pareto optimization. In IJCAI, 1485 1491. Qian, C.; Yu, Y.; and Zhou, Z.-H. 2015. Subset selection by Pareto optimization. In NIPS, 1774 1782. Shitov, Y. 2017. Column subset selection is NP-complete. ar Xiv:1701.02764.