# scalable_sequential_spectral_clustering__be2913f8.pdf

Scalable Sequential Spectral Clustering

Yeqing Li1, Junzhou Huang1 , Wei Liu2

1University of Texas at Arlington, Texas, USA 2Didi Research, Bejing, China {yeqing.li@mavs.uta.edu, jzhuang@uta.edu, weiliu@didichuxing.com}

In the past decades, Spectral Clustering (SC) has become one of the most effective clustering approaches. Although it has been widely used, one signiﬁcant drawback of SC is its expensive computation cost. Many efforts have been devoted to accelerating SC algorithms and promising results have been achieved. However, most of the existing algorithms rely on the assumption that data can be stored in the computer memory. When data cannot ﬁt in the memory, these algorithms will suffer severe performance degradations. In order to overcome this issue, we propose a novel sequential SC algorithm for tackling large-scale clustering with limited computational resources, e.g., memory. We begin with investigating an effective way of approximating the graph afﬁnity matrix via leveraging a bipartite graph. Then we choose a smart graph construction and optimization strategy to avoid random access to data. These efforts lead to an efﬁcient SC algorithm whose memory usage is independent of the number of input data points. Extensive experiments carried out on large datasets demonstrate that the proposed sequential SC algorithm is up to a thousand times faster than the state-of-thearts.

Introduction Clustering is one of the most important problems in computer vision and pattern recognition. In the past decades, spectral clustering methods, e.g., (Shi and Malik 2000; Von Luxburg 2007; Ng et al. 2002; Zelnik-Manor and Perona 2004; Yu and Shi 2003; Chang et al. 2015), which are based on eigendecompositions over graph Laplacians of data, have shown superior performance to traditional methods like K-Means. The effectiveness of spectral clustering methods is mainly due to its capability of capturing nonlinear cluster structures, while many other clustering methods rely on Euclidean geometry and explicit or implicit assumptions of convex shapes of clusters (Yan, Huang, and Jordan 2009). However, conventional spectral clustering algorithms are not applicable to large-scale data because of their quadratic computational complexity. This is mainly due to the expensive cost of the similarity graph construction and

Corresponding author: Junzhou Huang. Email: jzhuang@uta.edu. This work was partially supported by NSF IIS-1423056, CMMI-1434401, CNS-1405985. Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

the eigendecomposition computation. Given n data points and each data point represented by a feature vector with d dimensions, the time complexity of computing a pairwise similarity matrix is O(n2d) and that of the worst-case eigendecomposition of a full graph Laplacian is O(kn2), where k is the number of classes. Given the rapid growth in the amount of user-generated data, such a quadratic complexity makes SC inapplicable in handling big data encountered in real-world problems. Many approaches have been proposed to reduce the computational cost of spectral clustering. They can be roughly divided into two classes. The algorithms of the ﬁrst class focus on reducing the computational cost of eigendecomposition. Given a similarity matrix of n data points, these algorithms accelerate the eigendecomposition by faster approximation approaches (Fowlkes et al. 2004; Chen et al. 2006; Liu et al. 2007; Khoa and Chawla 2012; Liu et al. 2013b; Zhang and Kwok 2010). Although promising results have been achieved, the construction of the similarity matrix still takes quadratic running time, which limits their usage. The algorithms of the second class try to reduce the computational cost of graph construction and eigendecomposition simultaneously by approximating the original graph (Shinnou and Sasaki 2008; Sakai and Imiya 2009; Yan, Huang, and Jordan 2009; Chen and Cai 2011; Li et al. 2015). In general, algorithms in the second class are more practical than those in the ﬁrst class since they reduce the cost of all stages of spectral clustering. However, these algorithms are all in-memory algorithms, which assume that all data and intermediate results can be accessed in main memory. This assumption signiﬁcantly limits the scale of data that they can handle. It is a challenging research problem to efﬁciently handle data beyond the memory capacity. In this context, we could split the algorithm running time into two parts: 1) time for processing data in memory; 2) time for transferring data from/to disk. Most of the previous large-scale spectral clustering algorithms assume that the transfer time is negligible, which makes them impractical for big data. Other researchers try to reduce the second part by developing algorithms in distributed systems (Chen et al. 2011; Miao et al. 2008). However, not only is writing programs on a distributed system a challenging task, but also deploying and tuning a distributed program require specialized skills.

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

This paper aims to develop large-scale spectral clustering algorithms for ordinary users with only limited computing resources. We consider the situation that data cannot be completely stored in memory. Therefore, it is inevitable to swap data back and forth between memory and disk. The key idea of our approach is to divide the entire data set into small blocks and sequentially process each block, which can keep the memory footprint small and the computational cost low. Inspired by the current progress of large-scale graph construction involved in semi-supervised learning (Liu, He, and Chang 2010; Liu, Wang, and Chang 2012), we propose a fast and scalable algorithm for spectral clustering named Sequential Spectral Clustering (Seq SC). Speciﬁcally, we approximate the original similarity graph by a bipartite graph and use two key components to drastically reduce the computational time and data transfer time. The ﬁrst component is a novel sequential K-Means algorithm framework, while the second component is a sequential eigendecomposition approach. The proposed Seq SC is able to run in close-to-linear time with respect to the number of data points without losing the clustering accuracy. This requires scanning the raw data on disk for only a few passes. To our best knowledge, this is the ﬁrst exploration of a large-scale spectral clustering in the case with limited resources. Extensive experiments are conducted in a real environment without enough memory for massive data, where the proposed algorithm is up to several orders of magnitudes faster than the state-of-the-art approaches.

Background and Notations In this section, we will brieﬂy introduce the notations and the spectral clustering algorithm. Suppose X = [x1, . . . , xn]T Rn d denotes the data matrix, where n is the number of data points and d is the dimension of features. Each data point xi Rd belongs to one of K classes C = {c1, . . . , c K}. Given the whole dataset X, each data point is represented as a vertex on the graph and each edge represents the afﬁnity relation of one pair of vertices. In practice, the k-NN graph is usually used. Speciﬁcally, xi and xj are connected if at least one of them is among the k nearest neighbors in the given metric (usually Euclidean distance). The weight of the edge between xi and xj is deﬁned as:

2σ2 , if xi and xj are connected

0, otherwise, where σ is the bandwidth parameter. Note that while we use Gaussian Kernel for our example, this approach is also applicable to other types of kernels. Thus, W = {wij} Rn n is the adjacent matrix of the graph and it is a symmetric undirected graph. Let D Rn n be the degree matrix, the i-th diagonal element of which is dii = n j=1 wij. Let L denotes the normalize graph Laplacian matrix, then it is deﬁned as:

L = I D 1/2WD 1/2 (1) The objective function of normalized spectral clustering (Ng et al. 2002) is deﬁned as: min GT G=I Tr (GT LG), (2)

where G Rn K is the class indicator matrix of all data. One solution of G in Eq. (2) is the K smallest eigen vectors of L. The whole process takes at least quadratic time in terms of n, which is impractical for the large-scale problem.

Methodology

In order to address the scalability problem of SC, we propose Sequential SC (Seq SC) approach by sequentializing the two key steps of SC: graph construction and eigendecomposition.

Graph Construction

As discussed above, the construction of the graph W takes quadratic space and time because of the computation of pairwise distance of data points. This process is easy to sequentialize. Speciﬁcally, we can keep only one sample of data xi in the memory and then load all the other data from the disk sequentially and compute the distances from xi to all the other data points. This algorithm is equivalent to sequentially computing each row of W. However, there are still two problems of this approach. First of all, the time complexity of this approach is still quadratic. Second, it requires scanning the data for multiple times (usually n divided by some constant times), which may take even longer than the computation itself for large data set. To reduce the time complexity, one must eliminate the need of computing pairwise distances. A common way to achieve this goal is to ﬁnd a low-rank approximation to W by using the bipartite graph. The intuition of this approach is instead of computing direct distance between two points xi and xj, we construct some local station uk within their neighborhoods and then approximate the distance between xi and xj by dist(xi, xj) dist(xi, ui)+dist(ui, xj). This approximation is the distance of traveling from one point to another through a nearby transit point uk. The number of stations should be much smaller than the data size so that the number of distances that one need to compute is much less than n2. Therefore, instead of computing W, we employ the approach of anchor graph (Liu, He, and Chang 2010). Here, a small set of anchor points U = [u1, . . . , um]T Rm d is used to capture the manifold structure. These anchor points are usually chosen by running a lightweight clustering algorithms such as K-Means on raw data. With the generated points, the k-NN graph is constructed between the raw data and the anchor points. Connections are only allowed between raw data points and anchor points. This constraint results in a bipartite graph between raw data X and anchor points U. The weight of each edge is deﬁned as

Zij = K(xi, uj)

k Φi K(xi, uk), j Φi, (3)

where K() is a given kernel function and Φi {1 . . . m} denotes the indexes of p nearest neighbours of xi in U.

The afﬁnity matrix becomes W = 0 Z ZT 0

R(n+m) (n+m). The degree matrix becomes D =

R(n+m) (n+m) ,where Dr is a diagonal ma-

trix whose diagonal elements are row sums of Z and Dc is a diagonal matrix whose diagonal elements are the column sums of Z. Since Z is by deﬁnition row normalized, we have Dr = In, where In is the n by n identity matrix. Let ˆZ = ZD 1/2 c . Thus, we have transformed a unstructured ordinary graph to a bipartite graph. Although the time complexity of distance computing is reduced, this approach still need to go through the data set multiple times for using K-Means to generate the anchor points. This is still prohibitive for processing large dataset, which dues to the fact that when the data set is too large to hold it all in memory, the transfer time will soon occupy most portion of the total running time.

The Proposed Sequential K-Means Here, we propose a new approach to construct the graph, which preserves low computational time and has low memory footprint. The core of this approach is the sequential KMeans algorithm. The proposed Sequential K-Means algorithm (Seq KM) is given in Alg. 1. The basic idea of Seq KM is that we ﬁrst randomly select a small subset of data and run K-Means++ algorithm on it and then sequentially process the whole data set using stochastic gradient descent (SGD) (Bottou and Bengio 1995). The K-Means++ (Arthur and Vassilvitskii 2007) guarantees the initial centers have relatively good quality, while the SGD further adjusts the centers by looking at the whole data set. The computation of Seq KM is O(msd + nd), where m is the number of clusters, s is the number of samples. This is much lower than ordinary KMeans. More importantly, it only requires scanning data at most twice (one for random sampling and one for SGD optimization). In fact, with proper data storage, sub-sampling can be implemented without scanning the whole dataset, then only one pass is needed.

Algorithm 1 Sequential K-Means (Seq KM)

1: Input: Number of clusters k, data set X Rn d, sample size s 2: Output: Cluster labels of each data points, centers C and cluster labels of all anchor points. 3: v = 0 // Per-center count 4: M = Randomly select s samples from X; 5: C = Run K-Means++ algorithm on M and get k centers 6: for i = 1 to n do 7: j = arg minj xi Cj 2 j 1, . . . , k 8: v[j] = v[j] + 1 9: η = 1 v[j] 10: Cj = (1 η)Cj + ηxi 11: end for

The Seq KM can ﬁnd a good balance between the quality and speed in the graph construction step, which is of independent interest. With Seq KM, the graph construction can be easily implemented as a sequential on-disk algo-

rithm, which eliminates the potential bottleneck of introducing high transfer time.

The Proposed Sequential Spectral Clustering

After constructing the normalized bipartite graph matrix ˆZ Rn m, we need to compute the left and right singular vectors of ˆZ as shown in Eq. (4) in the clustering step. The solution of SC are the smallest eigenvectors G of L. This is equivalent to computing largest singular vectors of ˆZ, that is:

svd( ˆZ) = AΣBT , (4)

where svd( ) denotes the Singular Value Decomposition (SVD), Σ = diag(σ1, . . . , σm) Rm m is a diagonal matrix with the singular values of ˆZ on its diagonal, σ1 σ2 . . . , σm 0, A Rn m is the left singular vectors, B Rm m is the right singular vectors and G = [AT , BT ]T R(n+m) m. Finally, we can get the cluster labels of each data point by running the K-Means algorithm on G. The above approach is appealing since it reduces the computational cost of spectral clustering on n data point to close to a linear function of n. However, when the amount of data exceeds the capacity of memory, the performance of this approach will suffer from severe degradation. This is due to the need of swapping data back and forth between the memory and the disk. Fortunately, Z is a tall thin matrix and we can implement a sequential SVD algorithm by using this fact. Note that the right singular vectors B is also the eigenvectors of matrix ˆZT ˆZ Rm m, which is a simple inner product of matrix Z and can be computed sequentially as ˆZT ˆZ = n i=1 z T i zi. After obtaining B, A can be computed by A = ˆZBΣ 1. Hence, we can compute A sequentially by the following relation:

ai = ˆzi BΣ 1, (5)

where zi is the i-th row of Z and ai is the i-th row of A.

Algorithm 2 Sequential Singular Value Decomposition (SSVD)

1: Input: Data matrix Z Rn m, Number of singular values/vectors k 2: Output: Singular vectors A Rn k, B Rm k, and Singular Values Σ Rk k

3: S = 0 4: for i = 1 to n do 5: S = S + z T i zi // Sequentially compute ZT Z 6: end for 7: [BΣ] = eig(S, k) // eig() is eigendecomposition 8: Σ = Σ1/2

9: R = Σ 1B 10: for i = 1 to n do 11: Ai = zi R 12: end for

We summarize the Sequential Singular Value Decomposition (SSVD) algorithm in Alg. 2. The basic idea is that we compute the Gram Matrix ZT Z as a form of sum over data (i.e. ZT Z = z T i zi) and obtain B and then sequentially compute each row of A by projection. In our context, we always have m n. Therefore, ZT Z and B are both relatively small dimensional matrix, which is time and space efﬁcient to compute and store. With Seq KM and SSVD, we propose our Sequential Spectral Clustering (Seq SC) algorithm in Alg. 3. In SSC, we also perform the graph construction step and normalization step sequentially. At each time step, we only need one small piece (e.g. one row) of data and small amount of intermediate information, the memory usage can be kept in very low level even the data set is large. In fact, the memory usage is independent to the size of the data set.

Algorithm 3 Sequential Spectral Clustering (Seq SC)

1: Input: Data matrix X Rn d, Cluster number k, Anchor points number m. 2: Output: Cluster labels of each data points, all anchor points U and cluster labels of all anchor points. 3: Generate m anchor points U using Seq KM (Alg. 1); 4: D = 0 5: for i = 1 to n do 6: Compute zi Xi and U using Eq. (3) 7: D = D + zi 8: end for 9: D = diag(D) // Convert D to diagonal matrix 10: for i = 1 to n do 11: ˆzi = zi D 1/2 // Compute ˆZ sequentially 12: end for 13: Compute [A, Σ, B] = SSV D( ˆZ, k) // Alg. 2 14: Apply Seq KM (Alg. 1) on A to get the cluster labels

Computational Complexity and Memory Use. The proposed Seq SC consists three steps: 1) Generating the anchor points U by Seq KM clustering; 2) Constructing bipartite graph Z; 3) Running spectral clustering. Generating the anchor points takes O(ksd + nmd) time. The construction of an bipartite graph takes O(nmd) time. In the third step, the singular decomposition takes O(nm2) time. The total running time will be O(nm2). (In fact, this complexity could be further reduced by sub-selection technique (Li et al. 2014; Li, Chen, and Huang 2014). We will leave this improvement to the future work.) For memory usage, the ﬁrst call of Seq KM in Algorithm 3 takes O((s + m)d) O(sd) space, while the second call takes O((s + k)m) O(sm). The graph construction and graph normalization take O(m2 + md) space, while the SSVD step takes O(mk) space. Therefore, the overall space complexity is O((m + d)m)). It is obvious that the time complexity of the proposed algorithm is much lower than the quadratic time of original spectral clustering and comparable to other SC algorithms based on bipartite graph, e.g. (Chen and Cai 2011), (Liu et al. 2013a). Furthermore, the space complexity of existing SC algorithms is at least O(n max(m, d)). For very large n, this will cause performance problem because of disk swapping.

In contrast, the space complexity of the proposed method is irrelevant to n, which can avoid the swapping problem. These advantages will be validated in the experimental section.

Experiments In this section, we evaluate the proposed Sequential Spectral Clustering (Seq SC) approach on three benchmarks datasets. All our experiments are conducted on a desktop computer with a 3.0GHz Intel Pentium 4 CPU and 1GB RAM, Mat Lab 7.14 (32bit).

Datasets and Experimental Environment We conduct experiments on three data sets MNIST, Cov Type and MNIST8m. They all have ground truth for measuring clustering quality. MNIST1. This dataset consists of 70,000 images of handwritten digits from 0 to 9. We use 784 dimension original pixel values to represent each image. Cov Type. This dataset consists of 581,012 for predicting the forest cover type from cartographic variables. Each sample belongs to one of seven types (classes). MNIST8m2. This data set consists of 8,100,000 images of handwritten digits from 0 to 9 (Loosli, Canu, and Bottou 2007). The feature is the same as MNIST dataset: 784 dimension original pixel values. We randomly select a subset of 500,000 samples.

Experimental Protocols We conduct three parts of experiments to evaluate the effectiveness and efﬁciency of our proposed algorithms. We compare the Seq SC to several state-of-the-art SC algorithms. For all compared algorithms, we download code from the authors website and follow the parameter settings in their papers. All the code are written in Mat Lab. And we use the code from (Chen et al. 2011) to sequentially generate k-NN (k = 50) so that the out-of-memory exception will not be triggered during the graph construction phase. For parameters of our algorithm, we ﬁx m = 200 and construct 5-NN anchor graph. For the ﬁrst part, we compare the following algorithms:

SC is the spectral clustering on normalized Laplacian (Ng et al. 2002). We use the implementation that is available online (Chen et al. 2011)3.

KASP is proposed in (Yan, Huang, and Jordan 2009) and is an approximate spectral clustering algorithm based on K-Means. We implement this algorithm using Mat Lab for fair comparison.

Nystr om (Fowlkes et al. 2004) uses Nystr om transform to approximate the eigenfunction problem. We use the implementation that is available online (Chen et al. 2011)4.

1http://yann.lecun.com/exdb/mnist/ 2http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ multiclass.html 3http://alumni.cs.ucsb.edu/ wychen/sc.html 4http://alumni.cs.ucsb.edu/ wychen/sc.html

LSC (Chen and Cai 2011) is a Landmark-based spectral clustering algorithm. We download the Mat Lab code from the authors website5.

ESCG (Liu et al. 2013a) uses shortest path algorithm to generate bipartite graph. We download the Matlab code from the authors website6.

Seq SC is the proposed algorithm described in Alg. 3.

The second part of experiments aims to evaluate the performance of proposed Seq KM algorithm as a component of Seq SC. That is we replace Seq KM with other alternatives to evaluate the effects on the clustering accuracy and the running time. 1) Seq SC(RS) is the variation of the proposed Seq SC that uses random sampling to generating anchor points instead of Seq KM. 2) Seq SC(KK) is the variation of the proposed Seq SC that uses only the results of K-Means++ on sub-sampling data. That means we only use traditional KMeans on subsampling data to compute the cluster centers and assign each data point to the closest centers (We call it Sub KM below). 3) Seq SC(SK) is the variation of the proposed Seq SC that uses Seq KM to generate anchor points and Sub KM to compute the cluster labels. 4) Seq SC(KS) is the variation of the proposed Seq SC that uses Sub KM to generate anchor points and Seq KM to compute the cluster labels. 5) Seq SC(SS) is the full version of the proposed Seq SC algorithm. For data access, we split data into blocks so that we can check the performance of transferring data. For all experiments, we ﬁx each block to contain 5000 data points. Note that the number 5000 here is not critical as long as it contains data larger than the hard disk transfer block size and less than the memory capacity, which is a wide range. For all batch algorithms, i.e. all of the compared algorithms, we load all of the testing data into memory before the algorithm is run, while for the Seq SC and all its variations, we sequentially load each block when we need to compute it. We measure the clustering quality by the accuracy (Chen and Cai 2011), which compares the clustering labels generated by the algorithms to the labels given by the data set. We also report the running time of each algorithm using wall-clock time. We evaluate the accuracy (Chen and Cai 2011) and running time of all the algorithms.

Clustering Quality The ﬁrst part of our experiments is to compare our algorithm with other state-of-the-art algorithms. We conduct this part of experiments on two datasets: MNIST and Cov Type. The results are shown in Figure 1 to Figure 3b. Since the running time of different algorithms may vary greatly, they are shown in log-scale. Figure 1 and Figure 2 show the compared results of all the algorithms on MNIST dataset in terms of accuracy and running time respectively. In Figure 1, we can observe that LSC and SC are roughly the best among all the baseline algorithms. These results are consistent with previous works

5http://www.cad.zju.edu.cn/home/dengcai/Data/ Reproduce Exp.html 6http://jialu.cs.illinois.edu/

0 2 4 6 8 10 12 14 0.2

SC LSC ESCG KASP Nystrom Seq SC

Figure 1: Clustering accuracy comparison on MNIST dataset. Accuracy vs #Blocks.

0 2 4 6 8 10 12 14 10 1

Running Time

SC LSC ESCG KASP Nystrom Seq SC

Figure 2: Running time comparison on MNIST dataset. Running Time vs. #Blocks.

(Chen and Cai 2011; Liu et al. 2013a). The proposed Seq SC has consistently outperformed all the competitors in various data size from 1 block (n = 5000) to 14 blocks (n = 70000). The comparison results of the running time are shown in Fig. 2. We can observe that the proposed method is the fastest of all. We can observe that SC and ESCG take almost the same amount of time. This is due to the fact that most of the running time is used to construct the k-NN graph. And since the resulted graph is very sparse, the time for the eigendecomposition in SC is relatively short (Liu et al. 2013a). The following ones are KASP and LSC, whose running time grows fast as the data size grows. The running time of the Nystr om algorithm is very close to the proposed Seq SC, which are among the fastest algorithms of all. However, as the size of data grows, the running time Nystr om increase more rapidly. Note that although the running time of the Nystr om algorithm is comparable to the proposed algorithm, their accuracy is not, which has already been shown in Figure 1. This is due to the fact that although running time is reduced when Nystr om uses random subset of data points to approximate the large-scale singular value decomposition, the sampled data points are not effective in preserving the neighboring information of the data set. This information loss results in the performance degradation of the Nystr om algorithm. All these results demonstrate the beneﬁt of the proposed method. Figure 3a and 3b show the results on Cov Type dataset,

0 20 40 60 80 100 120 0

Running Time

Nystrom Seq SC

(a) Time vs. #Blocks @ Cov Type.

0 10 20 30 40 50 60 10 1

Running Time

LSC KASP Seq SC

(b) Time vs. #Blocks @ Cov Type.

0 20 40 60 80 100 0.2

Seq SC(RS) Seq SC(KK) Seq SC(SK) Seq SC(KS) Seq SC(SS)

(c) Acc vs. #Blocks @ MINIST8m.

0 20 40 60 80 100 0

Running Time

Seq SC(RS) Seq SC(KK) Seq SC(SK) Seq SC(KS) Seq SC(SS)

(d) Time vs. #Blocks @ MINIST8m.

0 20 40 60 80 100 120 0.24

Seq SC(200) Seq SC(400) Seq SC(600) Seq SC(800)

(e) Time vs. #Blocks @ Cov Type.

0 20 40 60 80 100 120 0

Running Time

Seq SC(200) Seq SC(400) Seq SC(600) Seq SC(800)

(f) Time vs. #Blocks @ Cov Type.

Figure 3: Clustering Accuracy (Acc) and running time. . (a) Compared with Nystr om (b) Compared with LSC and KASP. (c)-(d) Compare different K-Means approaches; (e)-(f) Compare different number of anchor points.

which is much larger than the MNIST dataset. We omit the results of SC and ESCG because they are too slow to be run on dataset of this scale. In these ﬁgures, we can observe that the proposed Seq SC still outperforms the baseline algorithms. The LSC and the KASP only produce results for part of the data due to the out-of-memory error. Although the Nystr om algorithm can produce results for all the data, its accuracy is lower while its running time grows faster than the proposed algorithm (Figure 3a and 3b). These results again validate the performance and scalability of the proposed algorithm.

Inﬂuence of the Seq KM We further evaluate the performance of different variations of the proposed algorithm on MNIST8m dataset. The results are shown in Figure 3c. We can observe several things from Figure 3c. Firstly, the quality of the generated anchor points is important. The algorithms that are based on K-Means generated anchor points are much better than those based on random generated anchor points. Secondly, Seq KM also outperforms the Sub KM in the stage of generating cluster labels. This is due to the fact that Seq KM ﬁnds a good balance between the clustering quality and performance of the proposed algorithm. The Seq KM algorithm tries to ﬁnd better initialization from local data and use one pass on the whole data set to adjust the ﬁnal results, which avoid being trapped in the local solution. Figure 3d shows the running time comparison of different variations of Seq SC. This result shows that under the frame-

work of our sequential clustering, the running time of every version of Seq SC grows linearly with respect to the amount of data. This result again validates the scalability of the proposed algorithms.

Parameter Selection

The third part of our experiment evaluates the performance of the proposed algorithm under different parameters. The most important parameter is the number of anchor points p, which controls the quality of the bipartite graph. We conduct experiments on Cov Type dataset of Seq SC with p varies from 200 to 800. We can see from Figure 3e that the accuracy increases which p increases. Note that although the running time increases along with p, it is still linear with respect to n. The optimal balance between running time and accuracy depends on computational resources at hand. Still and all, promising results have been achieved by the proposed algorithm even with moderate choice of parameters.

In this paper, we proposed a sequential spectral clustering algorithm, named by Seq SC, for coping with large-scale clustering. The Seq SC algorithm counts on two key components: the proposed Sequential K-Means (Seq KM) and Sequential Singular Value Decomposition (SSVD) algorithms. The beneﬁts of the proposed Seq SC are three-fold: (1) The proposed Seq KM algorithm extends the previous online KMeans algorithm by adopting a better initialization, which

can discover high-quality cluster centers in one pass; (2) the proposed Seq KM algorithm enables the graph construction to be executed in a sequential manner; (3) by employing SSVD, clustering can be performed sequentially and the entire time complexity is nearly linear in the number of input data points. Such beneﬁts make our Seq SC algorithm accurate and scalable enough to tackle massive datasets. The experimental results on three public datasets demonstrated that the proposed Seq SC algorithm outperforms the stateof-the-arts in large-scale clustering with up to 1000 times acceleration.

Arthur, D., and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027 1035. Society for Industrial and Applied Mathematics. Bottou, L., and Bengio, Y. 1995. Convergence properties of the k-means algorithms. In Advances in Neural Information Processing Systems 7. Citeseer. Chang, X.; Nie, F.; Ma, Z.; Yang, Y.; and Zhou, X. 2015. A convex formulation for spectral shrunk clustering. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence. Chen, X., and Cai, D. 2011. Large scale spectral clustering with landmark-based representation. In AAAI. Chen, B.; Gao, B.; Liu, T.-Y.; Chen, Y.-F.; and Ma, W.-Y. 2006. Fast spectral clustering of data using sequential matrix compression. In Machine Learning: ECML 2006. Springer. 590 597. Chen, W.-Y.; Song, Y.; Bai, H.; Lin, C.-J.; and Chang, E. Y. 2011. Parallel spectral clustering in distributed systems. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(3):568 586. Fowlkes, C.; Belongie, S.; Chung, F.; and Malik, J. 2004. Spectral grouping using the nystrom method. Pattern Analysis and Machine Intelligence, IEEE Transactions on 26(2):214 225. Khoa, N. L. D., and Chawla, S. 2012. Large scale spectral clustering using resistance distance and spielman-teng solvers. In Discovery Science, 7 21. Springer. Li, Y.; Chen, C.; Liu, W.; and Huang, J. 2014. Subselective quantization for large-scale image search. Proc. AAAI Conference on Artiﬁcial Intelligence (AAAI). Li, Y.; Nie, F.; Huang, H.; and Huang, J. 2015. Large-scale multi-view spectral clustering via bipartite graph. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence. Li, Y.; Chen, C.; and Huang, J. 2014. Transformationinvariant collaborative sub-representation. In Pattern Recognition (ICPR), 2014 22nd International Conference on, 3738 3743. IEEE. Liu, T.-Y.; Yang, H.-Y.; Zheng, X.; Qin, T.; and Ma, W.- Y. 2007. Fast large-scale spectral clustering by sequential

shrinkage optimization. In Proceedings of the 29th European conference on IR research, 319 330. Springer-Verlag. Liu, J.; Wang, C.; Danilevsky, M.; and Han, J. 2013a. Largescale spectral clustering on graphs. In Proceedings of the Twenty-Third international joint conference on Artiﬁcial Intelligence, 1486 1492. AAAI Press. Liu, J.; Wang, C.; Gao, J.; and Han, J. 2013b. Multiview clustering via joint nonnegative matrix factorization. In Proc. of SDM, volume 13, 252 260. SIAM. Liu, W.; He, J.; and Chang, S.-F. 2010. Large graph construction for scalable semi-supervised learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 679 686. Liu, W.; Wang, J.; and Chang, S.-F. 2012. Robust and scalable graph-based semisupervised learning. Proceedings of the IEEE 100(9):2624 2638. Loosli, G.; Canu, S.; and Bottou, L. 2007. Training invariant support vector machines using selective sampling. In Bottou, L.; Chapelle, O.; De Coste, D.; and Weston, J., eds., Large Scale Kernel Machines. Cambridge, MA.: MIT Press. 301 320. Miao, G.; Song, Y.; Zhang, D.; and Bai, H. 2008. Parallel spectral clustering algorithm for large-scale community data mining. In The 17th WWW workshop on social web search and mining (SWSM). Ng, A. Y.; Jordan, M. I.; Weiss, Y.; et al. 2002. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems 2:849 856. Sakai, T., and Imiya, A. 2009. Fast spectral clustering with random projection and sampling. In Machine Learning and Data Mining in Pattern Recognition. Springer. 372 384. Shi, J., and Malik, J. 2000. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 22(8):888 905. Shinnou, H., and Sasaki, M. 2008. Spectral clustering for a large data set by reducing the similarity matrix size. In LREC. Von Luxburg, U. 2007. A tutorial on spectral clustering. Statistics and computing 17(4):395 416. Yan, D.; Huang, L.; and Jordan, M. I. 2009. Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 907 916. ACM. Yu, S. X., and Shi, J. 2003. Multiclass spectral clustering. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, 313 319. IEEE. Zelnik-Manor, L., and Perona, P. 2004. Self-tuning spectral clustering. In Advances in neural information processing systems, 1601 1608. Zhang, K., and Kwok, J. T. 2010. Clustered nystr om method for large scale manifold learning and dimension reduction. Neural Networks, IEEE Transactions on 21(10):1576 1587.