# spectral_clustering_via_the_power_method__provably__a11c53fe.pdf

Spectral Clustering via the Power Method - Provably

Christos Boutsidis BOUTSIDIS@YAHOO-INC.COM Yahoo, 229 West 43rd Street, New York, NY, USA.

Alex Gittens GITTENS@ICSI.BERKELEY.EDU International Computer Science Institute, Berkeley, CA, USA.

Prabhanjan Kambadur PKAMBADUR@BLOOMBERG.NET Bloomberg L.P., 731 Lexington Avenue, New York, 10022, USA.

Spectral clustering is one of the most important algorithms in data mining and machine intelligence; however, its computational complexity limits its application to truly large scale data analysis. The computational bottleneck in spectral clustering is computing a few of the top eigenvectors of the (normalized) Laplacian matrix corresponding to the graph representing the data to be clustered. One way to speed up the computation of these eigenvectors is to use the power method from the numerical linear algebra literature. Although the power method has been empirically used to speed up spectral clustering, the theory behind this approach, to the best of our knowledge, remains unexplored. This paper provides the ﬁrst such rigorous theoretical justiﬁcation, arguing that a small number of power iterations sufﬁces to obtain near-optimal partitionings using the approximate eigenvectors. Speciﬁcally, we prove that solving the k-means clustering problem on the approximate eigenvectors obtained via the power method gives an additive-error approximation to solving the kmeans problem on the optimal eigenvectors.

1. Introduction Consider clustering the points in Figure 1. The data in this space are non-separable and there is no apparent clustering metric which can be used to recover this clustering structure. In particular, the two clusters have the same centers (centroids); hence, distance-based clustering methods such as k-means (Ostrovsky et al., 2006) will fail. Motivated by

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

Figure 1. 2-D data amenable to spectral clustering.

such shortcomings of traditional clustering approaches, researchers have produced a body of more ﬂexible and dataadaptive clustering approaches, now known under the umbrella of spectral clustering. The crux of these approaches is to model the points to be clustered as vertices of a graph, where weights on edges connecting the vertices are assigned according to some similarity measure between the points. Next, a new, hopefully separable, representation of the points is formed by using the eigenvectors of the (normalized) Laplacian matrix associated with this similarity graph. This new, typically low-dimensional, representation is often called spectral embedding of the points. We refer the reader to (Fiedler, 1973; Von Luxburg, 2007; Shi & Malik, 2000b) for the foundations of spectral clustering and to (Belkin & Niyogi, 2001; Ng et al., 2002; Liu & Zhang, 2004; Zelnik-Manor & Perona, 2004; Smyth & White, 2005) for applications in data mining and machine learning. We explain spectral clustering and the baseline algorithm in detail in Section 2.1.

Submission and Formatting Instructions for ICML 2015

The computational bottleneck in spectral clustering is the computation of the eigenvectors of the Laplacian matrix. Motivated by the need for faster algorithms to compute these eigenvectors, several techniques have been developed in order to speedup this computation (Spielman & Teng, 2009; Yan et al., 2009; Fowlkes et al., 2004; Pavan & Pelillo, 2005; Bezdek et al., 2006; Wang et al., 2009; Nystr om, 1930; Baker, 1977). Perhaps the most popular of the above mentioned techniques is the power method (Lin & Cohen, 2010). The convergence of the power method is theoretically well understood when it comes to measure the principal angle between the space spanned by the true and the approximate eigenvectors (see Theorem 8.2.4 in (Golub & Van Loan, 2012)). We refer readers to (Woodruff, 2014) for a rigorous theoretical analysis of the use of the power method for the low-rank matrix approximation problem. However, these results do not imply that the approximate eigenvectors of the power method are useful for spectral clustering.

Contributions. In this paper, we argue that the eigenvectors computed via the power method are useful for spectral clustering, and that the loss in clustering accuracy is small. We prove that solving the k-means problem on the approximate eigenvectors obtained via the power method gives an additive-error approximation to solving the k-means problem on the optimal eigenvectors (see Lemma 5 and Thm 6).

2. Background

2.1. Spectral Clustering

We ﬁrst review one mathematical formulation of spectral clustering. Let x1, x2, . . . , xn Rd be n points in d dimensions. The goal of clustering is to partition these points into k disjoint sets, for some given k. To this end, deﬁne a weighted undirected graph G(V, E) with |V | = n nodes and |E| edges: each node in G corresponds to an xi; the weight of each edge encodes the similarity between its end points. Let W Rn n be the similarity matrix Wij = e ( xi xj 2)/σ, i = j and Wii = 0 that gives the similarity between xi and xj. Here, σ is a tuning parameter. Given this setup, spectral clustering for k = 2 corresponds to the following graph partitioning problem:

Deﬁnition 1 (The Spectral Clustering Problem for k = 2 (Shi & Malik, 2000b)). Let x1, x2, . . . , xn Rd

and k = 2 be given. Construct graph G(V, E) as described in the text above. Find subgraphs A and B of G that minimize the following:

Ncut(A, B) = cut(A, B) 1 assoc(A, V ) + 1 assoc(B, V )

where, cut(A, B) = P xi A,xj B Wij; assoc(A, V ) = P xi A,xj V Wij; assoc(B, V ) = P xi B,xj V Wij.

This deﬁnition generalizes to any k > 2 in a straightforward manner (we omit the details). Minimizing Ncut(A, B) in a weighted undirected graph is an NPComplete problem (see appendix in (Shi & Malik, 2000b) for proof). Motivated by this hardness result, Shi and Malik (Shi & Malik, 2000b) suggested a relaxation to this problem that is tractable in polynomial time using the Singular Value Decomposition (SVD). First, (Shi & Malik, 2000b) shows that for any G, A, B and partition vector y Rn with +1 to the entries corresponding to A and 1 to the entries corresponding to B the following identity holds: 4 Ncut(A, B) = y T (D W)y/(y T Dy). Here, D Rn n is the diagonal matrix of degree nodes: Dii = P

j Wij. Hence, the spectral clustering problem in Deﬁnition 1 can be restated as ﬁnding such an optimum partition vector y, which, as we mentioned above, is an intractable problem. The real relaxation for spectral clustering asks for a real-valued vector y Rn:

Deﬁnition 2 (The real relaxation for the spectral clustering problem for k = 2 (Shi & Malik, 2000b)). Given graph G with n nodes, adjacency matrix W, and degrees matrix D ﬁnd y Rn such that:

y = argmin y Rn,y T D1n

Once such a y is found, one can partition the graph into two subgraphs by looking at the signs of the elements in y. When k > 2, one can compute k eigenvectors and then apply k-means clustering on the rows of a matrix, denoted as Y, containing those eigenvectors in its columns.

Motivated by these observations, Ng et. al (Ng et al., 2002) (see also (Weiss, 1999)) suggested the following algorithm for spectral clustering 1 (inputs to the algorithm are the points x1, . . . , xn Rd and the number of clusters k).

1. Construct the similarity matrix W Rn n as Wij = e ( xi xj 2)/σ (for i = j); Wii = 0 and σ is given.

2. Construct D Rn n as the diagonal matrix of degrees of the nodes: Dii = P

3. Construct W = D 1

4. Find the largest k eigenvectors of W and assign them as columns to a matrix Y Rn k 3.

5. Apply k-means clustering on the rows of Y, and use this clustering to cluster the original points accordingly.

1Precisely, Ng et. al suggested an additional normalization step on Y before applying k-means, i.e., normalize Y to unit row norms, but we ignore this step for simplicity. 2Here, L = D W is the Laplacian matrix of G and L = In W is the so called normalized Laplacian matrix. 3The top k eigenvectors of D 1

2 correspond to the bottom k eigenvectors of In D 1

Submission and Formatting Instructions for ICML 2015

This algorithm serves as our baseline for an exact spectral clustering algorithm . One way to speedup 4 this baseline algorithm is to use the power method (Lin & Cohen, 2010) in Step 4 to quickly approximate the eigenvectors in Y; that is,

Power method: Initialize S Rn k with i.i.d random Gaussian variables. Let Y Rn k contain the left singular vectors of the matrix

B = ( W W T )p WS = W 2p+1S,

for some integer p 0. Now, use Y instead of Y in step 5 above.

The use of the power method to speedup eigenvector computation is not new. Power method is a classical technique in the numerical linear algebra literature (see Section 8.2.4 in (Golub & Van Loan, 2012)). Existing theoretical analysis provides sharp bounds for the error YYT Y Y T 2 (see Theorem 8.2.4 in (Golub & Van Loan, 2012); this theorem assumes that S has orthonormal columns and it is not perpendicular to Y.) (Halko et al., 2011; Woodruff, 2014) also used the power method with random Gaussian initialization and applied it to the low-rank matrix approximation problem. The approximation bounds proved in those papers are for the term W Y Y T W 2. To the best of our knowledge, none of these results indicates that the approximate eigenvectors are useful for spectral clustering purposes.

2.2. Connection to k-means

The previous algorithm indicates that spectral clustering turns out to be a k-means clustering problem on the rows of Y, the matrix containing the bottom eigenvectors of the normalized Laplacian matrix. The main result of our paper is to prove that solving the k-means problem on Y and using this to cluster the rows in Y gives a clustering which is as good as the clustering by solving the k-means problem on Y. To this end, we need some background on k-means clustering; we present a linear algebraic view below.

For i = 1 : n, let yi Rk be a row of Y as a column vector. Hence,

y T 1 y T 2 ... y T n

4This can be implemented in O(n2kp+k2n) time, as we need O(n2kp) time to implement all the matrix-matrix multiplications (right-to-left) and another O(k2n) time to ﬁnd Y. As we discuss below, selecting p O(ln(kn)) and assuming that the multiplicative spectral gap of between the kth and (k+1)th eigenvalue of W is large sufﬁces to get very accurate clusterings. This leads to an O(kn2 ln(kn)) runtime for this step.

Let k be the number of clusters. One can deﬁne a partition of the rows of Y by a cluster indicator matrix X Rn k. Each column j = 1, . . . , k of X represents a cluster. Each row i = 1, . . . , n indicates the cluster membership of yi. So, Xij = 1/ sj, if and only if the data point yi is in the jth cluster (sj = ||X(j)||0; X(j) is the jth column of X and ||X(j)||0 denotes the number of non-zero elements of X(j)). We formally deﬁne the k-means problem as follows:

Deﬁnition 3. [THE k-MEANS CLUSTERING PROBLEM] Given Y Rn k (representing n data points rows described with respect to k features columns) and a positive integer k denoting the number of clusters, ﬁnd the indicator matrix Xopt Rn k which satisﬁes,

Xopt = argmin X X Y XXT Y 2 F.

Here, X denotes the set of all m k indicator matrices X. Also, we will denote

Y Xopt XT opt Y 2 F := Fopt.

This deﬁnition is equivalent to the more traditional deﬁnition involving sum of squared distances of points from cluster centers (Boutsidis & Magdon-Ismail, 2013; Cohen et al., 2014) (we omit the details). Next, we formalize the notion of a k-means approximation algorithm .

Deﬁnition 4. [k-MEANS APPROXIMATION ALGORITHM] An algorithm is called a γ-approximation for the kmeans clustering problem (γ 1) if it takes inputs the dataset Y Rn k and the number of clusters k, and returns an indicator matrix Xγ Rn k such that w.p. 1 δγ,

Y XγXT γ Y 2 F γ min X X Y XXT Y 2 F = γ Fopt.

An example of such an approximation algorithm is in (Kumar et al., 2004) with γ = 1 + ε (0 < ε < 1) and δγ a constant in (0, 1). The corresponding running time is O(nk 2(k/ε)O(1)). A trivial algorithm with γ = 1 but running time Ω(nk) is to try all possible k-clusterings of the rows of Y and keep the best.

3. Main result

Next, we argue that applying a k-means approximation algorithm on Y and Y gives approximately the same clustering results for a sufﬁciently large number of power iterations p. Hence, the eigenvectors found by the power method do not sacriﬁce the accuracy of the exact spectral clustering algorithm. This is formally shown in Theorem 6. However, the key notion of approximation between the exact and the approximate eigenvectors is captured in Lemma 5:

Submission and Formatting Instructions for ICML 2015

Lemma 5. [See Section 4.1 for proof] For any ε, δ > 0, let

1 2 ln 4 n ε 1 δ 1

is the multiplicative eigen-gap between the k-th and the (k + 1)-th singular value of W. Then, with probability

at least 1 e 2n 2.35δ: YYT Y Y T 2 F ε2.

In words, for a sufﬁciently large value of p, the orthogonal projection operators on span(Y) and span( Y) are bounded, in Frobenius norm, for an arbitrarily small ε. Here and throughout the paper we reserve σ1( W) to denote the largest singular value of W:

σ1( W) σ2( W) σn( W) 0;

and similarly for L:

λ1( L) λ2( L) λn( L).

Next, we present our main theorem (see Section 4.2 for the proof).

Theorem 6. Construct Y via the power method with

1 2 ln 4 n ε 1 δ 1

σk+1 W = 1 σn k+1( L)

1 σn k(L) .

Here, L = In W is the normalized Laplacian matrix. Consider running on the rows of Y a γ-approximation kmeans algorithm with failure probability δγ. Let the outcome be a clustering indicator matrix X γ Rn k. Also, let Xopt be the optimal clustering indicator matrix for Y. Then, with probability at least 1 e 2n 2.35δ δγ,

Y X γXT γ Y 2 F (1+4ε) γ Y Xopt XT opt Y 2 F+4ε2.

3.1. Discussion

Several remarks are necessary regarding our main theorem. First of all, notice that the notion of approximation in clustering quality is with respect to the objective value of kmeans; indeed, this is often the case in approximation algorithms for k-means clustering (Cohen et al., 2014). We

acknowledge that it would have been better to obtain results on how well X γ approximates Xopt directly, for example, via bounding the error X γ Xopt 2 F; however, such results are notoriously difﬁcult to obtain since this is a combinatorial objective.

Next, let us take a more careful look at the effect of the parameter γk. It sufﬁces to discuss this effect for two cases: 1) γk = 1; and 2) γk > 1. To this end, we need to use a relation between the eigenvalues of W and L = D 1

2 (see proof of theorem for explanation): for all i = 1, 2, . . . , n, the relation is:

σi( W) = 1 σn i+1( L).

3.1.1. γk = 1

First, we argue that the case γk = 1 is not interesting from a spectral clustering perspective and hence, we can safely assume that this will not occur in practical scenarios. γk is the multiplicative eigen-gap between the kth and the (k + 1)th eigenvalue of W:

σk+1 W = 1 λn k+1( L)

1 λn k(L) .

The folklore belief (see end of Section 4.3 in (Lee et al., 2012)) in spectral clustering says that a good k to select in order to cluster the data via spectral clustering is when

λn k+1( L) λn k( L).

In this case, the spectral gap is sufﬁciently large and γk will not be close to one; hence, a small number of power iterations p will be enough to obtain accurate results. To summarize, if γk = 1, it does not make sense to perform spectral clustering with k clusters and the user should look for a k > k such that the gap in the spectrum, γk is not approaching and at the very least is strictly larger than 1.

This gap assumption is not surprising from a linear algebraic perspective as well. It is well known that in order the power iteration to succeed to ﬁnd the eigenvectors of any symmetric matrix, the multiplicative eigen-gap in the spectrum should be sufﬁciently large, as otherwise the power method will not be able to distinguish the kth eigenvector from the (k + 1)th eigenvector. For a more detailed discussion of this we refer the reader to Thm 8.2.4 in (Golub & Van Loan, 2012).

3.1.2. γk > 1

We remark that the graph we construct to pursue spectral clustering is a complete graph, hence it is connected; a basic fact in spectral graph theory (see, for example, (Lee

Submission and Formatting Instructions for ICML 2015

et al., 2012)) says that the smallest eigenvalue of the Laplacian matrix L of the graph equals to zero (λn(L) = 0) if and only if the graph is connected. An extension of this fact says that the number of disconnected components in the graph equals the multiplicity of the eigenvalue zero in the Laplacian matrix L. When this happens, we have

σn(L) = σn 1(L) = ... = σn k+1(L) = 0,

and, correspondingly,

σn( L) = σn 1( L) = ... = σn k+1( L) = 0.

What happens, however, when the k smallest eigenvalues of the normalized Laplacian are not zero but close to zero? Cheeger s inequality and extensions in (Lee et al., 2012) indicate that as those eigenvalues approaching zero, the graph is approaching a graph with k disconnected components, hence clustering such graphs should be easy , that is, the number of power iterations p should be small. We formally argue about this statement below. First, we state a version of the Cheeger s inequality that explains the situation for the k = 2 case (we omit the details of the high order extensions in (Lee et al., 2012)). The facts below can be found in Section 1.2 in (Bandeira et al., 2013)). Recall also the graph partitioning problem in Deﬁnition 1. Let the minimum value for Ncut(A, B) be obtained from a partition into two sets Aopt and Bopt, then, implies

1 2λn 1( L) Ncut(Aopt, Bopt) 2 q

Hence, if λn 1( L) 0, then Ncut(Aopt, Bopt) 0, which makes the clustering problem easy , hence amenable to a small number of power iterations.

Now, we formally derive the relation of p as a function of σn k+1( L). Towards this end, ﬁx all values (n, ε, δ, σn k( L)) to constants, e.g., n = 103, ε = 10 3, δ = 10 2, and σn k( L) = 1/2. Also, x = σn k+1( L). Then,

p := f(x) = 1

2 ln(4 109) ln(2 2 x).

Simple calculus arguments show that as 0 x < 1/2 (this range for x is required to make sure that γk > 1) increases, then f(x) also increases, which conﬁrms the expectation that for an eigenvalue x := σn k+1( L) approaching to 0, the number of power iterations is approaching zero as well. We plot f(x) in Figure 2. The number of power iterations is always small since the dependence of p on σn k+1( L) is logarithmic.

3.1.3. DEPENDENCE ON ε

Finally, we remark that the approximation bound in the theorem indicates that the loss in clustering accuracy can be

0.0 0.1 0.2 0.3 0.4 0.5

20 40 60 80 100

Figure 2. Number of power iterations p vs the eigenvalue x := σn k+1( L) of the normalized Laplacian.

made arbitrarily small since the dependence on ε 1 is logarithmic with respect to the number of power iterations. In particular, when ε Y Xopt XT opt Y 2 F, we have a relative error bound: Y X γXT γ Y 2 F

((1 + 4ε) γ + 4ε) Y Xopt XT opt Y 2 F (1 + 8ε) γ Y Xopt XT opt Y 2 F,

where the last relation uses 1 γ. One might wonder that to achieve this relative error performance, ε 0, in which case it appears that p is very large. However, this is not true since the event that ε 0 is necessary occurs only when Y Xopt XT opt Y 2 F 0, which happens when Y Xopt. In this case, from the discussion in the previous section, we have σn k+1( L) 0, because Y is close to an indicator matrix if and only if the graph is close to having k disconnected components, which itself happens if and only if σn k+1( L) 0. It is easy now to calculate that, when this happens, p 0 because, from standard calculus arguments, we can derive:

1 2 ln 4 n 1

ln 1 x 1 σn k( L)

Submission and Formatting Instructions for ICML 2015

4.1. Proof of Lemma 5

4.1.1. PRELIMINARIES

We ﬁrst introduce the notation that we use throughout the paper. A, B, . . . are matrices; a, , . . . are column vectors. In is the n n identity matrix; 0m n is the m n matrix of zeros; 1n is the n 1 vector of ones. The Frobenius and the spectral matrix-norms are A 2 F = P

i,j A2 ij and A 2 = max x 2=1 Ax 2, respectively. The thin (compact) SVD of A Rm n of rank ρ is

A = Uk Uρ k

| {z } UA Rm ρ

Σk 0 0 Σρ k

| {z } ΣA Rρ ρ

VT k VT ρ k

| {z } VT A Rρ n

with singular values σ1 (A) . . . σk (A) σk+1 (A) . . . σρ (A) > 0. The matrices Uk Rm k and Uρ k Rm (ρ k) contain the left singular vectors of A; and, similarly, the matrices Vk Rn k

and Vρ k Rn (ρ k) contain the right singular vectors. Σk Rk k and Σρ k R(ρ k) (ρ k) contain the singular values of A. Also, A = VAΣ 1 A UT A denotes the pseudo-inverse of A. For a symmetric positive deﬁnite matrix (SPSD) A = BBT , λi (A) = σ2 i (B) denotes the i-th eigenvalue of A.

4.1.2. INTERMEDIATE LEMMAS

To prove Lemma 5, we need the following simple fact.

Lemma 7. For any Y Rn k and Y Rn k with YT Y = Y T Y = Ik:

YYT Y Y T 2 F 2k YYT Y Y T 2 2.

Proof. This is because X 2 F rank(X) X 2 2, for any X,

and the fact that rank(YYT Y Y T ) 2k. To justify this,

notice that YYT Y Y T can be written as the product of two matrices each with rank at most 2k:

YYT Y Y T = Y Y YT

We also need the following result, which appeared as Lemma 7 in (Boutsidis & Magdon-Ismail, 2014).

Lemma 8. For any matrix A Rm n with rank at least k, let p 0 be an integer and draw S Rn k, a matrix of i.i.d. standard Gaussian random variables. Fix δ (0, 1) and ε (0, 1). Let Ω1 = (AAT )p AS, and Ω2 = Ak. If

p ln 4nε 1δ 1 ln 1 σk (A)

then w.p. 1 e 2n 2.35δ, Ω1Ω 1 Ω2Ω 2 2 2 ε2.

4.1.3. CONCLUDING THE PROOF OF LEMMA 5

Applying Lemma 8 with A = W and

p ln 4nε 1δ 1 ln 1

gives YYT Y Y T 2 F ε2.

Rescaling ε = ε/

k, i.e., choosing p as

p ln 4nε 1δ 1

gives YYT Y Y T 2 F ε2/k.

Combining this bound with Lemma 7 we obtain:

YYT Y Y T 2 F 2k YYT Y Y T 2 2 2k (ε2/k) = ε2,

as advertised.

4.2. Proof of Theorem 6

Let YYT = Y Y T + E,

where E is an n n matrix with E F ε (this follows after taking square root on both sides in the inequality of Lemma 5). Next, we manipulate the term Y X γXT γ Y F as follows Y X γXT γ Y F =

= YYT X γXT γ YYT F

= Y Y T + E X γXT γ ( Y Y T + E) F

Y Y T X γXT γ Y Y T F + (In X γXT γ )E F

Y Y T X γXT γ Y Y T F + E F

= Y X γXT γ Y F + E F

γ min X X Y XXT Y 2 F + E F

γ Y Xopt XT opt Y F + E F

= γ Y Y T Xopt XT opt Y Y T F + E F

= γ YYT E Xopt XT opt(YYT E) F + E F

γ YYT Xopt XT opt YYT F + 2 γ E F

= γ Y Xopt XT opt Y F + 2 γ E F

Fopt + 2 γ E F γ p

Fopt + 2 γ ε

Submission and Formatting Instructions for ICML 2015

In the above, we used the triangle inequality for the Frobenius norm, the fact that (In X γXT γ ) and (In Xopt XT opt) are projection matrices 5 (combined with the fact that for any projection matrix P and any matrix Z: PZ F Z F), the fact that for any matrix Q with orthonormal columns and any matrix X: XQT F = X F, the fact that 1 γ and the deﬁnition of a γ-approximation kmeans algorithm.

Overall, we have proved:

Y X γXT γ Y F γ p

Fopt + 2 γ ε.

Taking squares on both sides in the previous inequality gives: Y X γXT γ Y 2 F

γ Fopt + 4 ε γ p

Fopt + 4 γ ε2

= γ Fopt + 4 ε p

Fopt + 4 ε2

Finally, using p

Fopt Fopt shows the claim in the theorem. This bound fails with probability at most e 2n + 2.35δ + δγ, which simply follows by taking the union bound on the failure probabilities of Lemma 5 and the γapproximation k-means algorithm.

Connection to the normalized Laplacian. Towards this end, we need to use a relation between the eigenvalues of W and the eigenvalues of L = D 1

2 . Recall that the Laplacian matrix of the graph is L = D W and the normalized Laplacian matrix is L = In W. From the last relation, it is easy to see that an eigenvalue of L equals to 1 minus some eigenvalue of W; the ordering of the eigenvalues, however, is different. Speciﬁcally, for i = 1, 2 . . . , n, the relation is

λi( W) = 1 λn i+1( L),

where the ordering is:

λ1( W) λ2( W) λn( W),

and λ1( L) λ2( L) λn( L).

From this relation:

σk+1 W = 1 λn k+1( L)

1 λn k(L) .

5A matrix P is called a projection matrix if is square and P2 = P.

5. Experiments

To conduct our experiments, we developed high-quality MATLAB versions of the spectral clustering algorithms. In the remainder of this section, we refer to the clustering algorithm in (Shi & Malik, 2000a) as exact algorithm . We refer to the modiﬁed version that uses the power method as approximate algorithm . To measure clustering quality, we used normalized mutual information (Manning et al., 2008):

NMI(Ω; C) = I(Ω; C) 1 2 (H(Ω) + H(C)).

Here, Ω= {ω1, ω2, .., ωk}

is the set of discovered clusters and

C = {c1, c2, .., ck}

is the set of true class labels. Also,

I(Ω; C) = X

j P(ωk cj) log P(ωk cj)

is the mutual information between Ωand C,

k P(ωk) log P(ωk)

and H(C) = X

k P(ck) log P(ck)

are the entropies of Ωand C, where P is the probability that is estimated using maximum-likelihood. NMI is a value in [0, 1], where values closer to 1 represent better clusterings. The exact algorithm uses MATLAB s svds function to compute the top-k singular vectors. The approximate algorithm exploits the tallthin structure of B and computes Y using MATLAB s svd function 6. The approximate algorithm uses MATLAB s normrnd function to generate the random Gaussian matrix S. We used MATLAB s kmeans function with the options Empty Action , singleton , Max Iter , 100, Replicates , 10. All our experiments were run using MATLAB 8.1.0.604 (R2013a) on a 1.4 GHz Intel Core i5 dual-core processor running OS X 10.9.5 with 8GB 1600 MHz DDR3 RAM. Finally, all reported running times are for computing Y for the exact algorithm and Y for the approximate algorithm (given W).

5.1. Spectral Clustering Accuracy

We ran our experiments on four multi-class datasets from the lib SVMTools webpage (Table 1). To compute W,

6[U,S,V] = svd(B *B); tilde Y=B*V*Sˆ(-.5);

Submission and Formatting Instructions for ICML 2015

Exact Approximate Name p=2 Best under exact time NMI time (secs) NMI time (secs) NMI time (secs) p Sat Image 0.5905 1.270 0.5713 0.310 0.6007 0.690 6 Segment 0.7007 1.185 0.2240 0.126 0.5305 0.530 10 Vehicle 0.1655 0.024 0.2191 0.009 0.2449 0.022 6 Vowel 0.4304 0.016 0.3829 0.003 0.4307 0.005 3

Table 2. Spectral Clustering results for exact and approximate algorithms on the datasets from Table 1. For approximate algorithms, we report two sets of numbers: (1) the NMI achieved at p = 2 along with the time and (2) the best NMI achieved while taking less time than the exact algorithm. We see that the approximate algorithm performs at least as good as the exact algorithm for all but the Segment dataset.

Name n d #nnz #classes Sat Image 4435 36 158048 6 Segment 2310 19 41469 7 Vehicle 846 18 14927 4 Vowel 528 10 5280 11

Table 1. The lib SVM multi-class classiﬁcation datasets (Chang & Lin, 2011) used for our spectral clustering experiments.

we use the heat kernel:Wij = e ( xi xj 2)/σij, where xi Rd and xj Rd are the data points and σij is a tuning parameter; σij is determined using the self-tuning method described in (Zelnik-Manor & Perona, 2004). That is, for each data point i, xi is computed to be the Euclidean distance of the ℓth furthest neighbor from i; then σij is set to be xixj for every (i, j); in our experiments, we report the results for ℓ= 7. To determine the quality of the clustering obtained by the approximate algorithm and to determine the effect of p (number of power iterations) on the clustering quality and running time, we varied p from 0 to 10. In columns 2 and 3, we see the results for the exact algorithm, which serve as our baseline for quality and performance of the approximate algorithm. In columns 4 and 5, we see the NMI with 2 power iterations (p = 2) along with the running time. Immediately, we see that even at p = 2, the Vehicle dataset achieves better accuracy than the exact algorithm while resulting in a 2.5x speedup. The only outlier is the Segment dataset, which achieves poor NMI at p = 2. In columns 6 8, we report the best NMI that was achieved by the approximate algorithm while staying under the time taken by the exact algorithm; we also report the value of p and the time taken (in seconds) for this result. We see that even when constrained to run in less time than the exact algorithm, the approximate algorithm bests the NMI achieved by the exact algorithm. For example, at p = 6, the approximate algorithm reports NMI=0.2449 for Vehicle, as opposed to NMI=0.1655 achieved by the exact algorithm. We can see that in many cases, 2 subspace iterations sufﬁce to get good quality clustering (Segment dataset is the exception).

Figure 3. Increase in running time (to compute Y) normalized by running time when p = 0. The baseline times (at p=0) are Sat Image (0.0648 secs), Segment (0.0261 secs), Vehicle (0.0027 secs), and Vowel (0.0012 secs).

Figure 3 depicts the relation between p and the running time (to compute Y) of the approximate algorithm. All the times are normalized by the time taken when p = 0 to enable reporting numbers for all datasets on the same scale. As expected, as p increases, we see a linear increase.

Acknowledgment

We would like to thank Yiannis Koutis and James Lee for useful discussions.

Baker, C.T.H. The numerical treatment of integral equations, volume 13. Clarendon Press, 1977.

Bandeira, Afonso S, Singer, Amit, and Spielman, Daniel A. A cheeger inequality for the graph connection laplacian. SIAM Journal on Matrix Analysis and Applications, 34 (4):1611 1630, 2013.

Submission and Formatting Instructions for ICML 2015

Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, 2001.

Bezdek, J.C., Hathaway, R.J., Huband, J.M., Leckie, C., and Kotagiri, R. Approximate clustering in very large relational data. International journal of intelligent systems, 21(8):817 841, 2006.

Boutsidis, Christos and Magdon-Ismail, Malik. Deterministic feature selection for k-means clustering. Information Theory, IEEE Transactions on, 59(9):6099 6110, 2013.

Boutsidis, Christos and Magdon-Ismail, Malik. Faster svdtruncated regularized least-squares. ISIT, 2014.

Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http://www.csie. ntu.edu.tw/ cjlin/libsvm.

Cohen, Michael, E, Sam, Musco, Cameron, Musco, Christopher, and Persu, Madalina. Dimensionality reduction for k-means clustering and low rank approximation. ar Xiv preprint ar Xiv:1410.6801, 2014.

Fiedler, Miroslav. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(2):298 305, 1973.

Fowlkes, C., Belongie, S., Chung, F., and Malik, J. Spectral grouping using the Nystrom method. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(2): 214 225, 2004.

Golub, Gene H and Van Loan, Charles F. Matrix computations, volume 3. JHU Press, 2012.

Halko, Nathan, Martinsson, Per-Gunnar, and Tropp, Joel A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011.

Kumar, Amit, Sabharwal, Yogish, and Sen, Sandeep. A simple linear time (1+ ε)-approximation algorithm for geometric k-means clustering in any dimensions. In FOCS, 2004.

Lee, James R, Oveis Gharan, Shayan, and Trevisan, Luca. Multi-way spectral partitioning and higher-order cheeger inequalities. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 1117 1130. ACM, 2012.

Lin, Frank and Cohen, William W. Power iteration clustering. In ICML, 2010.

Liu, Rong and Zhang, Hao. Segmentation of 3d meshes through spectral clustering. In Computer Graphics and Applications, 2004. PG 2004. Proceedings. 12th Paciﬁc Conference on, pp. 298 305. IEEE, 2004.

Manning, Christopher D., Raghavan, Prabhakar, and Sch utze, Hinrich. Introduction to Information Retrieval. Cambridge University Press, 2008.

Ng, Andrew Y, Jordan, Michael I, Weiss, Yair, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849 856, 2002.

Nystr om, E. Uber die praktische Auﬂ osung von Integralgleichungen mit Anwendungen auf Randwertaufgaben. Acta Mathematica, 54(1):185 204, 1930.

Ostrovsky, Rafail, Rabani, Yuval, Schulman, Leonard J, and Swamy, Chaitanya. The effectiveness of lloyd-type methods for the k-means problem. In FOCS, 2006.

Pavan, M. and Pelillo, M. Efﬁcient out-of-sample extension of dominant-set clusters. NIPS, 2005.

Shi, J. and Malik, J. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888 905, 2000a.

Shi, Jianbo and Malik, Jitendra. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888 905, 2000b.

Smyth, S and White, S. A spectral clustering approach to ﬁnding communities in graphs. In SDM, 2005.

Spielman, Daniel and Teng, Shang-Hua. Nearly-linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. Arxi V preprint, http://arxiv.org/pdf/cs/0607105v4.pdf, 2009.

Von Luxburg, Ulrike. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

Wang, L., Leckie, C., Ramamohanarao, K., and Bezdek, J. Approximate spectral clustering. Advances in Knowledge Discovery and Data Mining, pp. 134 146, 2009.

Weiss, Yair. Segmentation using eigenvectors: a unifying view. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pp. 975 982. IEEE, 1999.

Woodruff, David P. Sketching as a tool for numerical linear algebra. ar Xiv preprint ar Xiv:1411.4357, 2014.

Yan, D., Huang, L., and Jordan, M.I. Fast approximate spectral clustering. In KDD, 2009.

Zelnik-Manor, Lihi and Perona, Pietro. Self-tuning spectral clustering. In NIPS, 2004.