# sparse_embedded_kmeans_clustering__5f9548ce.pdf Sparse Embedded k-Means Clustering Weiwei Liu , , , Xiaobo Shen , , Ivor W. Tsang School of Computer Science and Engineering, The University of New South Wales School of Computer Science and Engineering, Nanyang Technological University Centre for Artificial Intelligence, University of Technology Sydney {liuweiwei863,njust.shenxiaobo}@gmail.com ivor.tsang@uts.edu.au The k-means clustering algorithm is a ubiquitous tool in data mining and machine learning that shows promising performance. However, its high computational cost has hindered its applications in broad domains. Researchers have successfully addressed these obstacles with dimensionality reduction methods. Recently, [1] develop a state-of-the-art random projection (RP) method for faster k-means clustering. Their method delivers many improvements over other dimensionality reduction methods. For example, compared to the advanced singular value decomposition based feature extraction approach, [1] reduce the running time by a factor of min{n, d}ϵ2log(d)/k for data matrix X Rn d with n data points and d features, while losing only a factor of one in approximation accuracy. Unfortunately, they still require O( ndk ϵ2log(d)) for matrix multiplication and this cost will be prohibitive for large values of n and d. To break this bottleneck, we carefully build a sparse embedded k-means clustering algorithm which requires O(nnz(X)) (nnz(X) denotes the number of non-zeros in X) for fast matrix multiplication. Moreover, our proposed algorithm improves on [1] s results for approximation accuracy by a factor of one. Our empirical studies corroborate our theoretical findings, and demonstrate that our approach is able to significantly accelerate k-means clustering, while achieving satisfactory clustering performance. 1 Introduction Due to its simplicity and flexibility, the k-means clustering algorithm [2, 3, 4] has been identified as one of the top 10 data mining algorithms. It has shown promising results in various real world applications, such as bioinformatics, image processing, text mining and image analysis. Recently, the dimensionality and scale of data continues to grow in many applications, such as biological, finance, computer vision and web [5, 6, 7, 8, 9], which poses a serious challenge in designing efficient and accurate algorithmic solutions for k-means clustering. To address these obstacles, one prevalent technique is dimensionality reduction, which embeds the original features into low dimensional space before performing k-means clustering. Dimensionality reduction encompasses two kinds of approaches: 1) feature selection (FS), which embeds the data into a low dimensional space by selecting the actual dimensions of the data; and 2) feature extraction (FE), which finds an embedding by constructing new artificial features that are, for example, linear combinations of the original features. Laplacian scores [10] and Fisher scores [11] are two basic feature selection methods. However, they lack a provable guarantee. [12] first propose a provable singular value decomposition (SVD) feature selection method. It uses the SVD to find O(klog(k/ϵ)/ϵ2) actual features such that with constant probability the clustering structure The first two authors make equal contributions. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Table 1: Dimension reduction methods for k-means clustering. The third column corresponds to the number of selected or extracted features. The fourth column corresponds to the time complexity of each dimension reduction method. The last column corresponds to the approximation accuracy. N/A denotes not available. nnz(X) denotes the number of non-zeros in X. ϵ and δ represent the gap to optimality and the confidence level, respectively. Sparse embedding is abbreviated to SE. METHOD DESCRIPTION DIMENSIONS TIME ACCURACY [13] SVD-FE k O(nd min{n, d}) 2 FOLKLORE RP-FE O( log(n) ϵ2 ) O( ndlog(n) ϵ2log(d) ) 1 + ϵ [12] SVD-FS O( klog(k/ϵ) ϵ2 ) O(nd min{n, d}) 2 + ϵ [14] SVD-FE O( k ϵ2 ) O(nd min{n, d}) 1 + ϵ [1] RP-FE O( k ϵ2 ) O( ndk ϵ2log(d)) 2 + ϵ [15] RP-FE O( log(n) n ) O(dlog(d)n + dlog(n)) N/A THIS PAPER SE-FE O(max( k+log(1/δ) ϵ2 , 6 ϵ2δ )) O(nnz(X)) 1 + ϵ is preserved within a factor of 2 + ϵ. [13] propose a popular feature extraction approach, where k artificial features are constructed using the SVD such that the clustering structure is preserved within a factor of two. Recently, corollary 4.5 in [14] s study improves [13] s results, by claiming that O( k ϵ2 ) dimensions are sufficient for preserving 1 + ϵ accuracy. Because SVD is computationally expensive, we focus on another important feature extraction method that randomly projects the data into low dimensional space. [1] develop a state-of-the-art random projection (RP) method, which is based on random sign matrices. Compared to SVD-based feature extraction approaches [14], [1] reduce the running time by a factor of min{n, d}ϵ2log(d)/k2, while losing only a factor of one in approximation accuracy. They also improve the results of the folklore RP method by a factor of log(n)/k in terms of the number of embedded dimensions and the running time, while losing a factor of one in approximation accuracy. Compared to SVD-based feature selection methods, [1] reduce the embedded dimension by a factor of log(k/ϵ) and the running time by a factor of min{n, d}ϵ2log(d)/k, respectively. Unfortunately, they still require O( ndk ϵ2log(d)) for matrix multiplication and this cost will be prohibitive for large values of n and d. This paper carefully constructs a sparse matrix for the RP method that only requires O(nnz(X)) for fast matrix multiplication. Our algorithm is significantly faster than other dimensionality reduction methods, especially when nnz(X) << nd. Theoretically, we show a provable guarantee for our algorithm. Given d = O(max( k+log(1/δ) ϵ2 , 6 ϵ2δ)), with probability at least 1 O(δ), our algorithm preserves the clustering structure within a factor of 1 + ϵ, improving on the results of [12] and [1] by a factor of one for approximation accuracy. These results are summarized in Table 1. Experiments on three real-world data sets show that our algorithm outperforms other dimension reduction methods. The results verify our theoretical analysis. We organize this paper as follows. Section 2 introduces the concept of ϵ-approximation k-means clustering and our proposed sparse embedded k-means clustering algorithm. Section 3 analyzes the provable guarantee for our algorithm and experimental results are presented in Section 4. The last section provides our conclusions. 2 Sparse Embedded k-Means Clustering 2.1 ϵ-Approximation k-Means Clustering Given X Rn d with n data points and d features. We denote the transpose of the vector/matrix by superscript and the logarithms to base 2 by log. Let r = rank(X). By using singular value decomposition (SVD), we have X = UΣV , where Σ Rr r is a positive diagonal matrix containing the singular values of X in decreasing order (σ1 σ2 . . . σr), and U Rn r and V Rd r contain orthogonal left and right singular vectors of X. Let Uk and Vk represent U and V with all but their first k columns zeroed out, respectively, and Σk be Σ with all but its largest k singular values zeroed out. Assume k r, [16] have already shown that Xk = UkΣk V k is the optimal rank k 2Refer to Section 2.1 for the notations. approximation to X for any unitarily invariant norm, including the Frobenius and spectral norms. The pseudoinverse of X is given by X+ = V Σ 1U . Assume Xr|k = X Xk. In denotes the n n identity matrix. Let ||X||F be the Frobenius norm of matrix X. For concision, ||A||2 represents the spectral norm of A if A is a matrix and the Euclidean norm of A if A is a vector. Let nnz(X) denote the number of non-zeros in X. The task of k-means clustering is to partition n data points in d dimensions into k clusters. Let µi be the centroid of the vectors in cluster i and c(xi) be the cluster that xi is assigned to. Assume D Rn k is the indicator matrix which has exactly one non-zero element per row, which denotes cluster membership. The i-th data point belongs to the j-th cluster if and only if Dij = 1/ zj, where zj denotes the number of data points in cluster j. Note that D D = Ik and the i-th row of DD X is the centroid of xi s assigned cluster. Thus, we have Pn i=1 ||xi µc(xi)||2 2 = ||X DD X||2 F . We formally define the k-means clustering task as follows, which is also studied in [12] and [1]. Definition 1 (k-Means Clustering). Given X Rn d and a positive integer k denoting the number of clusters. Let D be the set of all n k indicator matrices D. The task of k-means clustering is to solve min D D ||X DD X||2 F (1) To accelerate the optimization of problem 1, we aim to find a ϵ-approximate solution for problem 1 by optimizing D (either exactly or approximately) over an embedded matrix ˆX Rn d with d < d. To measure the quality of approximation, we first define the ϵ-approximation embedded matrix: Definition 2 (ϵ-Approximation Embedded Matrix). Given 0 ϵ < 1 and a non-negative constant τ. ˆX Rn d with d < d is a ϵ-approximation embedded matrix for X, if (1 ϵ)||X DD X||2 F || ˆX DD ˆX||2 F + τ (1 + ϵ)||X DD X||2 F (2) We show that a ϵ-approximation embedded matrix is sufficient for approximately optimizing problem 1: Lemma 1 (ϵ-Approximation k-Means Clustering). Given X Rn d and D be the set of all n k indicator matrices D, let D = arg min D D ||X DD X||2 F . Given ˆX Rn d with d < d, let ˆD = arg min D D || ˆX DD ˆX||2 F . If ˆX is a ϵ -approximation embedded matrix for X, given ϵ = 2ϵ /(1 ϵ ), then for any γ 1, if || ˆX ˆD ˆD ˆX||2 F γ|| ˆX ˆD ˆD ˆX||2 F , we have ||X ˆD ˆD X||2 F (1 + ϵ)γ||X D D X||2 F Proof. By definition, we have || ˆX ˆD ˆD ˆX||2 F || ˆX D D ˆX||2 F and thus || ˆX ˆD ˆD ˆX||2 F γ|| ˆX D D ˆX||2 F (3) Since ˆX is a ϵ-approximation embedded matrix for X, we have || ˆX D D ˆX||2 F (1 + ϵ )||X D D X||2 F τ || ˆX ˆD ˆD ˆX||2 F (1 ϵ )||X ˆD ˆD X||2 F τ (4) Combining Eq.(3) and Eq.(4), we obtain: (1 ϵ )||X ˆD ˆD X||2 F τ || ˆX ˆD ˆD ˆX||2 F γ|| ˆX D D ˆX||2 F (1 + ϵ )γ||X D D X||2 F τγ (5) Eq.(5) implies that ||X ˆD ˆD X||2 F (1 + ϵ )/(1 ϵ )γ||X D D X||2 F (1 + ϵ)γ||X D D X||2 F (6) Remark. Lemma 1 implies that if ˆD is an optimal solution for ˆX, then it also preserves ϵapproximation for X. Parameter γ allows ˆD to be approximately global optimal for ˆX. Algorithm 1 Sparse Embedded k-Means Clustering Input: X Rn d. Number of clusters k. Output: ϵ-approximate solution for problem 1. 1: Set d = O(max( k+log(1/δ) ϵ2 , 6 ϵ2δ)). 2: Build a random map h so that for any i [d], h(i) = j for j [ d] with probability 1/ d. 3: Construct matrix Φ {0, 1}d d with Φi,h(i) = 1, and all remaining entries 0. 4: Construct matrix Q Rd d is a random diagonal matrix whose entries are i.i.d. Rademacher variables. 5: Compute the product ˆX = XQΦ and run exact or approximate k-means algorithms on ˆX. 2.2 Sparse Embedding [1] construct a random embedded matrix for fast k-means clustering by post-multiplying X with a d d random matrix having entries 1 d with equal probability. However, this method requires O( ndk ϵ2log(d)) for matrix multiplication and this cost will be prohibitive for large values of n and d. To break this bottleneck, Algorithm 1 demonstrates our sparse embedded k-means clustering, which requires O(nnz(X)) for fast matrix multiplication. Our algorithm is significantly faster than other dimensionality reduction methods, especially when nnz(X) << nd. For an index i taking values in the set {1, , n}, we write i [n]. Next, we state our main theorem to show that XQΦ is the ϵ-approximation embedded matrix for X: Theorem 1. Let Φ and Q be constructed as in Algorithm 1 and R = (QΦ) R d d. Given d = O(max( k+log(1/δ) ϵ2 , 6 ϵ2δ)). Then for any X Rn d, with a probability of at least 1 O(δ), XR is the ϵ-approximation embedded matrix for X. Let Z = In DD and tr be the trace notation. Eq.(2) can be formulated as: (1 ϵ)tr(ZXX Z) tr(Z ˆX ˆX Z)+τ (1+ϵ)tr(ZXX Z). Then, we try to approximate XX with ˆX ˆX . To prove our main theorem, we write ˆX = XR and our goal is to show that tr(ZXX Z) can be approximated by tr(ZXR RX Z). Lemma 2 provides conditions on the error matrix E = ˆX ˆX XX that are sufficient to guarantee that ˆX is a ϵ-approximation embedded matrix for X. For any two symmetric matrices A, B Rn n, A B indicates that B A is positive semidefinite. Let λi(A) denote the i-th largest eigenvalue of A in absolute value. , represents the inner product, and 0n d denotes an n d zero matrix with all its entries being zero. Lemma 2. Let C = XX and ˆC = ˆX ˆX . If we write ˆC = C + E1 + E2 + E3 + E4, where: (i) E1 is symmetric and ϵ1C E1 ϵ1C. (ii) E2 is symmetric, Pk i=1 |λi(E2)| ϵ2||Xr|k||2 F , and tr(E2) ϵ2||Xr|k||2 F . (iii) The columns of E3 fall in the column span of C and tr(E 3C+E3) ϵ2 3||Xr|k||2 F . (iv) The rows of E4 fall in the row span of C and tr(E4C+E 4) ϵ2 4||Xr|k||2 F . and ϵ1 + ϵ2 + ϵ2 + ϵ3 + ϵ4 = ϵ, then ˆX is a ϵ-approximation embedded matrix for X. Specifically, we have (1 ϵ)tr(ZCZ) tr(Z ˆCZ) min{0, tr(E2)} (1 + ϵ)tr(ZCZ). The proof can be referred to [17]. Next, we show XR is the ϵ-approximation embedded matrix for X. We first present the following theorem: Theorem 2. Assume r > 2k and let V2k Rd r represent V with all but their first 2k columns zeroed out. We define M1 = V 2k, M2 = k/||Xr|k||F (X XV2k V 2k) and M R(n+r) d as containing M1 as its first r rows and M2 as its lower n rows. We construct R = (QΦ) R d d, which is shown in Algorithm 1. Given d = O(max( k+log(1/δ) ϵ2 , 6 ϵ2δ)), then for any X Rn d, with a probability of at least 1 O(δ), we have (i) ||(RM ) (RM ) MM ||2 < ϵ. (ii) |||RM 2||2 F ||M 2||2 F | ϵk. Proof. To prove the first result, one can easily check that M1M 2 = 0r n, thus MM is a block diagonal matrix with an upper left block equal to M1M 1 and lower right block equal to M2M 2. The spectral norm of M1M 1 is 1. ||M2M 2||2 = ||M2||2 2 = k||X XV2k V 2k||2 2 ||Xr|k||2 F = k||Xr|2k||2 2 ||Xr|k||2 F . As ||Xr|k||2 F k||Xr|2k||2 2, we derive ||M2M 2||2 1. Since MM is a block diagonal matrix, we have ||M||2 2 = ||MM ||2 = max{||M1M 1||2, ||M2M 2||2} = 1. tr(M1M 1) = 2k. tr(M2M 2) = k||Xr|2k||2 F ||Xr|k||2 F . As ||Xr|k||2 F ||Xr|2k||2 F , we derive tr(M2M 2) k. Then we have ||M||2 F = tr(MM ) = tr(M1M 1) + tr(M2M 2) 3k. Applying Theorem 6 from [18], we can obtain that given d = O( k+log(1/δ) ϵ2 ), with a probability of at least 1 δ, ||(RM ) (RM ) MM ||2 < ϵ. The proof of the second result can be found in the Supplementary Materials. Based on Theorem 2, we show that ˆX = XR satisfies the conditions of Lemma 2. Lemma 3. Assume r > 2k and we construct M and R as in Theorem 2. Given d = O(max( k+log(1/δ) ϵ2 , 6 ϵ2δ)), then for any X Rn d, with a probability of at least 1 O(δ), ˆX = XR satisfies the conditions of Lemma 2. Proof. We construct H1 Rn (n+r) as H1 = [XV2k, 0n n], thus H1M = XV2k V 2k. And we set H2 Rn (n+r) as H2 = [0n r, ||Xr|k||F k In], so we have H2M = ||Xr|k||F k M2 = X XV2k V 2k = Xr|2k and X = H1M + H2M and we obtain the following: E = ˆX ˆX XX = XR RX XX = 1 + 2 + 3 + 4 (7) Where 1 = H1MR RM H 1 H1MM H 1, 2 = H2MR RM H 2 H2MM H 2, 3 = H1MR RM H 2 H1MM H 2 and 4 = H2MR RM H 1 H2MM H 1. We bound 1 , 2 , 3 and 4 separately, showing that each corresponds to one of the error terms described in Lemma 2. Bounding 1 . E1 = H1MR RM H 1 H1MM H 1 = XV2k V 2k R RV2k V 2k X XV2k V 2k V2k V 2k X (8) E1 is symmetric. By Theorem 2, we know that with a probability of at least 1 δ, ||(RM ) (RM ) MM ||2 < ϵ holds. Then we get ϵIn+r (RM ) (RM ) MM ϵIn+r. And we derive the following: ϵH1H 1 E1 ϵH1H 1 (9) For any vector v, v XV2k V 2k V2k V 2k X v = ||V2k V 2k X v||2 2 ||V2k V 2k||2 2||X v||2 2 = ||X v||2 2 = v XX v, so H1MM H 1 = XV2k V 2k V2k V 2k X XX . Since H1MM H 1 = XV2k V 2k V2k V 2k X = XV2k V 2k X = H1H 1, we have H1H 1 = H1MM H 1 XX = C (10) Combining Eqs.(9) and (10), we arrive at a probability of at least 1 δ, ϵC E1 ϵC (11) satisfying the first condition of Lemma 2. Bounding 2 . E2 =H2MR RM H 2 H2MM H 2 =(X XV2k V 2k)R R(X XV2k V 2k) (X XV2k V 2k)(X XV2k V 2k) (12) E2 is symmetric. Note that H2 just selects M2 from M and scales it by ||Xr|k||F / k. Using Theorem 2, we know that with a probability of at least 1 δ, tr(E2) = ||Xr|k||2 F k tr(M2R RM 2 M2M 2) ϵ||Xr|k||2 F (13) Applying Theorem 6.2 from [19] and rescaling ϵ , we can obtain a probability of at least 1 δ, ||E2||F = ||Xr|2k R RX r|2k Xr|2k X r|2k||F ϵ k ||Xr|2k||2 F (14) Combining Eq.(14), Cauchy-Schwarz inequality and ||Xr|2k||2 F ||Xr|k||2 F , we get that with a probability of at least 1 δ, i=1 |λi(E2)| k||E2||F ϵ||Xr|k||2 F (15) Eqs.(13) and (15) satisfy the second conditions of Lemma 2. Bounding 3 . E3 =H1MR RM H 2 H1MM H 2 =XV2k V 2k R R(X XV2k V 2k) XV2k V 2k(X XV2k V 2k) (16) The columns of E3 are in the column span of H1M = XV2k V 2k, and so in the column span of C. ||V2k||2 F = tr(V 2k V2k) = 2k. As V 2k V = V 2k V2k, V 2k X r|2k = V 2k(V ΣU V2kΣ2k U 2k) = Σ2k U 2k Σ2k U 2k = 0r n. Applying Theorem 6.2 from [19] again and rescaling ϵ, we can obtain that with a probability of at least 1 δ, tr(E 3C+E3) =||Σ 1U (H1MR RM H 2 H1MM H 2)||2 F =||V 2k R RX r|2k 0r n||2 F ϵ2||Xr|k||2 F (17) Thus, Eq.(17) satisfies the third condition of Lemma 2. Bounding 4 . E4 =H2MR RM H 1 H2MM H 1 =(X XV2k V 2k)R RV2k V 2k X (X XV2k V 2k)V2k V 2k X (18) E4 = E 3 and thus we immediately have that with a probability of at least 1 δ, tr(E4C+E 4) ϵ2||Xr|k||2 F (19) Lastly, Eqs.(11), (13), (15), (17) and (19) ensure that, for any X Rn d, ˆX = XR satisfies the conditions of Lemma 2 and is the ϵ-approximation embedded matrix for X with a probability of at least 1 O(δ). 4 Experiment 4.1 Data Sets and Baselines We denote our proposed sparse embedded k-means clustering algorithm as SE for short. This section evaluates the performance of the proposed method on four real-world data sets: COIL20, SECTOR, RCV1 and ILSVRC2012. The COIL20 [20] and ILSVRC2012 [21] data sets are collected from website34, and other data sets are collected from the LIBSVM website5. The statistics of these data sets are presented in the Supplementary Materials. We compare SE with several other dimensionality reduction techniques: 3http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php 4http://www.image-net.org/challenges/LSVRC/2012/ 5https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ # of dimensions 0 200 400 600 800 1000 Clustering accuracy (in %) k-means SVD LLE LS RP PD SE # of dimensions 0 200 400 600 800 1000 Clustering accuracy (in %) k-means SVD LLE LS RP PD SE # of dimensions 0 200 400 600 800 1000 Clustering accuracy (in %) k-means RP PD SE # of dimensions 0 200 400 600 800 1000 Clustering accuracy (in %) k-means RP PD SE (d) ILSVRC2012 Figure 1: Clustering accuracy of various methods on COIL20, SECTOR, RCV1 and ILSVRC2012 data sets. # of dimensions 0 200 400 600 800 1000 Preprocessing time (in second) SVD LLE LS RP PD SE # of dimensions 0 200 400 600 800 1000 Preprocessing time (in second) SVD LLE LS RP PD SE # of dimensions 0 200 400 600 800 1000 Preprocessing time (in second) # of dimensions 0 200 400 600 800 1000 Preprocessing time (in second) (d) ILSVRC2012 Figure 2: Dimension reduction time of various methods on COIL20, SECTOR, RCV1 and ILSVRC2012 data sets. # of dimensions 0 200 400 600 800 1000 Clustering time (in second) k-means SVD LLE LS RP PD SE # of dimensions 0 200 400 600 800 1000 Clustering time (in second) k-means SVD LLE LS RP PD SE # of dimensions 0 200 400 600 800 1000 Clustering time (in second) k-means RP PD SE # of dimensions 0 500 1000 Clustering time (in second) k-means RP PD SE (d) ILSVRC2012 Figure 3: Clustering time of various methods on COIL20, SECTOR, RCV1 and ILSVRC2012 data sets. SVD: The singular value decomposition or principal components analysis dimensionality reduction approach. LLE: The local linear embedding (LLE) algorithm is proposed by [22]. We use the code from website6 with default parameters. LS: [10] develop the laplacian score (LS) feature selection method. We use the code from website7 with default parameters. PD: [15] propose an advanced compression scheme for accelerating k-means clustering. We use the code from website8 with default parameters. RP: The state-of-the-art random projection method is proposed by [1]. After dimensionality reduction, we run all methods on a standard k-means clustering package, which is from website9 with default parameters. We also compare all these methods against the standard k-means algorithm on the full dimensional data sets. To measure the quality of all methods, we report clustering accuracy based on the labelled information of the input data. Finally, we report the running 6http://www.cs.nyu.edu/ roweis/lle/ 7www.cad.zju.edu.cn/home/dengcai/Data/data.html 8https://github.com/stephenbeckr/Sparsified KMeans 9www.cad.zju.edu.cn/home/dengcai/Data/data.html times (in seconds) of both the dimensionality reduction procedure and the k-means clustering for all baselines. 4.2 Results The experimental results of various methods on all data sets are shown in Figures 1, 2 and 3. The Y axes of Figures 2 and 3 represent dimension reduction and clustering time in log scale. We can t get the results of SVD, LLE and LS within three days on RCV1 and ILSVRC2012 data sets. Thus, these results are not reported. From Figures 1, 2 and 3, we can see that: As the number of embedded dimensions increases, the clustering accuracy and running times of all dimensionality reduction methods increases, which is consistent with the empirical results in [1]. Our proposed dimensionality reduction method has superior performance compared to the RP method and other baselines in terms of accuracy, which verifies our theoretical results. LLE and LS generally underperforms on the COIL20 and SECTOR data sets. SVD and LLE are the two slowest methods compared with the other baselines in terms of dimensionality reduction time. The dimension reduction time of the RP method increases significantly with the increasing dimensions, while our method obtains a stable and lowest dimensionality reduction time. We achieve several hundred orders of magnitude faster than the RP method and other baselines. The results also support our complexity analysis. All dimensionality reduction methods are significantly faster than standard k-means algorithm with full dimensions. Finally, we conclude that our proposed method is able to significantly accelerate k-means clustering, while achieving satisfactory clustering performance. 5 Conclusion The k-means clustering algorithm is a ubiquitous tool in data mining and machine learning with numerous applications. The increasing dimensionality and scale of data has provided a considerable challenge in designing efficient and accurate k-means clustering algorithms. Researchers have successfully addressed these obstacles with dimensionality reduction methods. These methods embed the original features into low dimensional space, and then perform k-means clustering on the embedded dimensions. SVD is one of the most popular dimensionality reduction methods. However, it is computationally expensive. Recently, [1] develop a state-of-the-art RP method for faster k-means clustering. Their method delivers many improvements over other dimensionality reduction methods. For example, compared to an advanced SVD-based feature extraction approach [14], [1] reduce the running time by a factor of min{n, d}ϵ2log(d)/k, while only losing a factor of one in approximation accuracy. They also improve the result of the folklore RP method by a factor of log(n)/k in terms of the number of embedded dimensions and the running time, while losing a factor of one in approximation accuracy. Unfortunately, it still requires O( ndk ϵ2log(d)) for matrix multiplication and this cost will be prohibitive for large values of n and d. To break this bottleneck, we carefully construct a sparse matrix for the RP method that only requires O(nnz(X)) for fast matrix multiplication. Our algorithm is significantly faster than other dimensionality reduction methods, especially when nnz(X) << nd. Furthermore, we improve the results of [12] and [1] by a factor of one for approximation accuracy. Our empirical studies demonstrate that our proposed algorithm outperforms other dimension reduction methods, which corroborates our theoretical findings. Acknowledgments We would like to thank the area chairs and reviewers for their valuable comments and constructive suggestions on our paper. This project is supported by the ARC Future Fellowship FT130100746, ARC grant LP150100671, DP170101628, DP150102728, DP150103071, NSFC 61232006 and NSFC 61672235. [1] Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Trans. Information Theory, 61(2):1045 1062, 2015. [2] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Applied Statistics, 28(1):100 108, 1979. [3] Xiao-Bo Shen, Weiwei Liu, Ivor W. Tsang, Fumin Shen, and Quan-Sen Sun. Compressed k-means for large-scale clustering. In AAAI, pages 2527 2533, 2017. [4] Xinwang Liu, Miaomiao Li, Lei Wang, Yong Dou, Jianping Yin, and En Zhu. Multiple kernel k-means with incomplete kernels. In AAAI, pages 2259 2265, 2017. [5] Tom M. Mitchell, Rebecca A. Hutchinson, Radu Stefan Niculescu, Francisco Pereira, Xuerui Wang, Marcel Adam Just, and Sharlene D. Newman. Learning to decode cognitive states from brain images. Machine Learning, 57(1-2):145 175, 2004. [6] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature selection: Beyond the linear model. JMLR, 10:2013 2038, 2009. [7] Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3):222 245, 2013. [8] Yiteng Zhai, Yew-Soon Ong, and Ivor W. Tsang. The emerging big dimensionality . IEEE Computational Intelligence Magazine, 9(3):14 26, 2014. [9] Weiwei Liu and Ivor W. Tsang. Making decision trees feasible in ultrahigh feature and label dimensions. Journal of Machine Learning Research, 18(81):1 36, 2017. [10] Xiaofei He, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. In NIPS, pages 507 514, 2005. [11] Donald H. Foley and John W. Sammon Jr. An optimal set of discriminant vectors. IEEE Trans. Computers, 24(3):281 289, 1975. [12] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. Unsupervised feature selection for the k-means clustering problem. In NIPS, pages 153 161, 2009. [13] Petros Drineas, Alan M. Frieze, Ravi Kannan, Santosh Vempala, and V. Vinay. Clustering in large graphs and matrices. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 291 299, 1999. [14] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACMSIAM Symposium on Discrete Algorithms, pages 1434 1453, 2013. [15] Farhad Pourkamali Anaraki and Stephen Becker. Preconditioned data sparsification for big data with applications to PCA and k-means. IEEE Trans. Information Theory, 63(5):2954 2974, 2017. [16] Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The Quarterly Journal of Mathematics, 11:50 59, 1960. [17] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 163 172, 2015. [18] Michael B. Cohen, Jelani Nelson, and David P. Woodruff. Optimal approximate matrix product in terms of stable rank. In 43rd International Colloquium on Automata, Languages, and Programming, pages 11:1 11:14, 2016. [19] Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM, 61(1):4:1 4:23, 2014. [20] Rong Wang, Feiping Nie, Xiaojun Yang, Feifei Gao, and Minli Yao. Robust 2DPCA with non-greedy l1-norm maximization for image analysis. IEEE Trans. Cybernetics, 45(5):1108 1112, 2015. [21] Weiwei Liu, Ivor W. Tsang, and Klaus-Robert Müller. An easy-to-hard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research, 18(94):1 38, 2017. [22] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. SCIENCE, 290:2323 2326, 2000.