# coreset_spectral_clustering__bbc0bdd1.pdf

Published as a conference paper at ICLR 2025

CORESET SPECTRAL CLUSTERING

Ben Jourdan1, Gregory Schwartzman2, Peter Macgregor3, and He Sun1

1University of Edinburgh, UK 2Japan Advanced Institute of Science and Technology (JAIST), Japan 3University of St Andrews, UK ben.jourdan@ed.ac.uk, greg@jaist.ac.jp, prm4@st-andrews.ac.uk, h.sun@ed.ac.uk

Coresets have become an invaluable tool for solving k-means and kernel k-means clustering problems on large datasets with small numbers of clusters. On the other hand, spectral clustering works well on sparse graphs and has recently been extended to scale efficiently to large numbers of clusters. We exploit the connection between kernel k-means and the normalised cut problem to combine the benefits of both. Our main result is a coreset spectral clustering algorithm for graphs that clusters a coreset graph to infer a good labelling of the original graph. We prove that an α-approximation for the normalised cut problem on the coreset graph is an O(α)- approximation on the original. We also improve the running time of the state-of-the-art coreset algorithm for kernel k-means on sparse kernels, from O(nk) to O(n min{k, davg}), where davg is the average number of non-zero entries in each row of the n n kernel matrix. Our experiments confirm our coreset algorithm is asymptotically faster on large real-world graphs with many clusters, and show that our clustering algorithm overcomes the main challenge faced by coreset kernel k-means on sparse kernels which is getting stuck in local optima.

1 INTRODUCTION

Kernel k-means and spectral clustering are two popular algorithms which are capable of learning non-linear decision boundaries between clusters. For this reason, they have been applied to many practical problems in machine learning, including in the fields of medical research and network science (Gönen & Margolin, 2014; Kuo et al., 2014; White & Smyth, 2005).

Given a data set X, both kernel k-means and spectral clustering make use of a kernel similarity function K : X X R 0, which is often represented as a matrix of size n n, where n = |X|. The spectral clustering algorithm considers the kernel matrix to be the adjacency matrix of a similarity graph and clusters the nodes of the graph in order to minimise the normalised cut objective function, which we define in Section 3 (Von Luxburg, 2007). Kernel k-means exploits the fact that the kernel function implicitly defines an embedding of the data points into a Hilbert space, represented by ϕ : X H, such that for all x, y X,

ϕ(x), ϕ(y) = K(x, y). (1)

The kernel k-means problem is to minimise the k-means objective in this Hilbert space and is usually solved using a generalisation of Lloyds algorithm (Dhillon et al., 2004), using the kernel K to compute inner products. Remarkably, the normalised cut and kernel k-means objectives are equivalent up to a constant factor (Dhillon et al., 2004).

Despite this equivalence, spectral clustering and kernel k-means have been largely studied separately, and new techniques have been developed with one or the other in mind. For spectral clustering, one of the most promising techniques is to construct a sparse similarity graph in order to achieve a speedup. This can be achieved through constructing an approximate k-nearest neighbour graph using fast approximate nearest neighbour algorithms (Alshammari et al., 2021; Harwood & Drummond, 2016; Malkov & Yashunin, 2018), or by constructing a clustering-preserving sparsifier of the complete kernel similarity graph (Macgregor & Sun, 2023; Sun & Zanetti, 2019). By operating on a sparse similarity graph, the time and memory cost of spectral clustering is reduced from Ω n2 to O(n log(n)).

Published as a conference paper at ICLR 2025

To speed up kernel k-means, Jiang et al. (2024) applied a recent coreset result for Euclidean spaces (Braverman et al., 2021) to kernel spaces. Given a dataset X and kernel function K, they showed that an ε-coreset (Definition 4) of size O(k2ε 4) can be constructed in time O(nk)1. The cost of running an iteration of Lloyd s algorithm on the coreset is then independent of n, for small k.

In this paper, we show that the benefits of the techniques developed for both spectral clustering and kernel k-means can be combined in order to create a faster and more accurate clustering algorithm. In particular, we show that, by exploiting the sparsity of the kernels usually considered for spectral clustering, it is possible to design a faster algorithm for constructing a coreset. Then, we apply spectral clustering directly to the coreset graph with only a small number of nodes and use the result to cluster the original graph. Empirically, we find that on sparse graphs our coreset spectral clustering algorithm has a significantly faster running time than the classical spectral clustering algorithm, and achieves a higher accuracy than coreset kernel k-means.

1.1 OUR RESULTS

Faster k-means++ and coreset construction. In Section 4, we exploit the fact that inputs to kernel k-means are often sparse to devise a faster algorithm for k-means++ initialisation in kernel space. Specifically, when the input kernel matrix is sparse, we improve the running time from O(nk) to O(n min{k, davg}) where davg is the average number of non-zero entries in each row of the kernel matrix. Speeding up k-means++ has received a lot of attention for Euclidean spaces (Bachem et al., 2016; Cohen-Addad et al., 2020). This is the first result to provide speed up in sparse kernel spaces. Combining this with Jiang et al. (2024), we present the first coreset construction for kernel spaces that makes use of kernel sparsity, and reduce the coreset construction time from O(nk) to O(n min{k, davg}) while maintaining the same theoretical guarantees. For large graphs with many clusters, it is critical to break the linear dependence on the number of clusters for practical use.

Coreset spectral clustering. In Section 5, we introduce the coreset spectral clustering algorithm that explicitly combines spectral clustering with coresets by exploiting the equivalence between the normalised cut and weighted kernel k-means problems. Previous work has used this equivalence to convert spectral clustering problems to kernel k-means problems and then solve them using coreset kernel k-means (Jiang et al., 2024). We propose a new method which solves the coreset problem directly with spectral clustering and then transfers the solution back to the original graph. This sidesteps some of the less desirable properties that running kernel k-means would entail, such as its susceptibility to local minima when using indefinite kernels (Dhillon et al., 2007). In Theorem 1, we prove that an α-approximation of the normalised cut problem on the coreset graph gives an O(α)-approximation of the normalised cut problem on the original graph.

Experiments. In Section 6, we preform three experiments to test our coreset construction and coreset spectral clustering algorithms. The first confirms the asymptotic improvement in running time of our coreset construction algorithm for kernel k-means over the previous method of Jiang et al. (2024) on large real-world graphs with up to 65 million nodes and thousands of clusters. The second experiment compares our coreset spectral clustering algorithm against coreset kernel k-means (Jiang et al., 2024) and the sklearn (Pedregosa et al., 2011) implementation of spectral clustering on several real-world graph datasets. This shows that for a small number of clusters, the coreset methods are much faster than spectral clustering and our coreset spectral clustering algorithm finds significantly better solutions than coreset kernel k-means. The third experiment compares our coreset spectral clustering algorithm against coreset kernel k-means on a synthetic graph dataset where we vary the number of clusters to be linear in the number of nodes. Using the spectral clustering method proposed by Macgregor (2023) to cluster the coreset graphs, this experiment shows that our method can scale to hundreds of clusters while coreset kernel k-means is rendered ineffective by local optima after only tens of clusters.

2 RELATED WORK

The techniques for speeding up kernel k-means and spectral clustering can be broadly categorised into the methods that sparsify the relations between input data and the methods based on coresets.

1We use O( ) to suppress polylogarithmic factors.

Published as a conference paper at ICLR 2025

For kernel k-means (Dhillon et al., 2004), low rank approximations of the kernel matrix have been proposed (Musco & Musco, 2017; Wang et al., 2019) as well as low dimensional approximations of ϕ (Chitta et al., 2012). On the other hand, coresets for Euclidean spaces (Braverman et al., 2021; Har-Peled & Mazumdar, 2004; Huang & Vishnoi, 2020) have recently been extended to kernel k-means (Jiang et al., 2024). Coreset methods sample a weighted subset of the input so that the objective of every feasible solution is preserved.

For spectral clustering (Von Luxburg, 2007), spectral sparsifiers (Spielman & Teng, 2011) and more recently cluster preserving sparsifiers (Sun & Zanetti, 2019) can sparsify dense graphs while retaining cluster structure. While effective in theory, spectral sparsifiers require complicated Laplacian solvers to run efficiently, making them difficult to implement in practice. Cluster preserving sparsifiers are practical to implement but their performance is sensitive to the choice of hyperparameters. Peng et al. (2017) proposed a nearly-linear time algorithm for clustering an arbitrary number of clusters which is similar in spirit to a coreset approach. They sample Θ(k log(k)) nodes leveraging the property that in the spectral embedding nodes in the same cluster are nearby and have approximately the same norm. From this, they efficiently extract a set of k points from which the rest of the data can be labelled. However, this approach is impractical as they make use of Laplacian solvers to approximate heat kernel distances.

As well as sparsification and coresets, speedup can also be achieved via improvements to the optimisation algorithms themselves. The triangle inequality can be used to reduce the number of inner products calculated by the kernel k-means algorithm (Dhillon et al., 2004) and recently a mini-batch algorithm has been proposed (Jourdan & Schwartzman, 2024). Macgregor (2023) developed a simple spectral clustering algorithm that forgoes the expensive computation of k eigenvectors in place of O(log(k)) independent calls to the power method. We refer to this method as fast spectral clustering and will use it in our experiments to cluster coreset graphs.

3 PRELIMINARIES

Let X be a set of n objects, and K : X X R 0 be a function measuring the pairwise similarity of data points in X. Let ϕ : X H be the function implicitly defined by K that maps data points in X to the unique Hilbert space such that ϕ(x), ϕ(y) = K(x, y) for all x, y X. The function K is usually represented as a matrix: if Φ = [ϕ(x1), . . . , ϕ(xn)], then K = ΦT Φ Rn n with Kij = ϕ(xi), ϕ(xj) . Let (ϕ(x), ϕ(y)) ϕ(x) ϕ(y) 2 denote the squared distance in feature space for all x, y X, and (ϕ(x), C) = minc C (ϕ(x), ϕ(c)) denote the smallest squared distance from x X to a set C X. We refer to the diagonal elements of K as self similarities and say x is a neighbour of (or incident to) y, written x y, iff ϕ(x), ϕ(y) = 0. If x X and S X, we say x is incident to S iff x is incident to at least one element of S.

Given a graph G = (V, E), the conductance of a set S of vertices is defined as ΦG(S) |E(S, V \ S)| /vol(S) where |E(S, V \ S)| is the number of edges crossing the cut between S and V \ S and vol(S) is the total weight of edges incident to S. We define a k-partition of X to be a collection of sets Π = {πj}k j=1 such that each element of X appears in exactly one member of Π.

We make extensive use of the following concepts in our analysis. Definition 1 (centroids). Given a k-partition Π = {πj}k j=1 of a set X, a map ϕ : X H for some Hilbert space H, and a weight function X R+, define the set of centroids of Π as cϕ w(Π) {cϕ w(πj)}k j=1 where cϕ w(πj) = P

x πj w(x)ϕ(x) / P

x πj w(x) .

Definition 2 (kernel k-means objective). Given a weighted dataset X with weights w : X R+ and feature map ϕ : X H satisfying equation 1 for some kernel function K, the weighted kernel k-means objective with respect to an arbitrary set of points C H is

COSTϕ w(X, C) = X

x X w(x) (ϕ(x), C). (2)

The kernel k-means objective with respect to an arbitrary k-partiton Π = {πj}k j=1 of X is

COSTϕ w(X, cϕ w(Π)) =

x πj w(x) (ϕ(x), cϕ w(πj)),

Published as a conference paper at ICLR 2025

Definition 3 (Normalised cut objective). Given a graph G = (V, E), the normalised cut problem is to minimise the average conductance over all k-partitions of the vertices: min Π={π1,...,πk} NC(G, Π),

where NC(G, Π) = 1

k Pk j=1 ΦG(πj).

Definition 4 (ε-coresets). For 0 < ε < 1, an ε-corset for kernel k-means on a weighted dataset X with weights w : X R+ is a reweighted subset S X such that for the Hilbert space H and map ϕ : X H satisfying equation 1, we have

COSTϕ w (S, C) (1 ε) COSTϕ w(X, C), C H with |C| = k,

where w : S R+ gives the weight for each element in the coreset.

4 FAST CORESET CONSTRUCTION

Given a sparse kernel matrix K with O(m) nonzero entries and weight function w, we give an algorithm to construct an ε-coreset in O(m) time. This improves on the algorithm given by Jiang et al. (2024) which has running time O(nk). The running time of Jiang et al. (2024) is dominated by the running time of D2-sampling, Algorithm 1, which is just the k-means++ initialisation algorithm (Arthur & Vassilvitskii, 2007) in kernel space. For completeness, we provide the full description of the ε-coreset algorithm of Jiang et al. (2024) in Algorithm 4 in Appendix C.1.

In Algorithm 2, we use the fact the kernel matrix is sparse to design an efficient data structure, built around a sampling tree (Wong & Easton, 1980), to perform D2-sampling in nearly-linear time in n and independent of k. To avoid unnecessary updates to the sampling tree, our algorithm has to sample one additional data point prior to performing D2-sampling. Specifically, before uniformly sampling the first point, we will select the point in C with the smallest self affinity. Let x arg minx X ϕ(x), ϕ(x) and c ϕ(x ), ϕ(x ) . We show that adding this extra point does not affect the approximation guarantee, and consequently we get a faster ε-coreset algorithm for sparse kernels using the same analysis as Jiang et al. (2024).

Algorithm 1 D2-Sampling(X) (Jiang et al., 2024)

1: Input: X 2: Draw x X uniformly at random, and initialise C {φ(x)} 3: for i = 1, . . . , k 1 do 4: Draw one sample x X, using probabilities w X(x) (φ(x),C)

COSTϕ w X (X,C) 5: Let C C {φ(x)} 6: end for 7: return C

4.1 OUR DATA STRUCTURE

The key observation is that as points are added to C, the distances from the data points to C can t change for too many points due to the sparsity of the kernel and our choice of x . We formalise this intuition in what follows. Let C X be a set of points such that x is in C, and consider the squared distance from an arbitrary point x to C in feature space. We see that

(x, C) = min c C

ϕ(x), ϕ(x) + ϕ(c), ϕ(c) 2 ϕ(x), ϕ(c)

ϕ(x), ϕ(x) + ϕ(c), ϕ(c) x c ϕ(x), ϕ(x) + ϕ(c), ϕ(c) 2 ϕ(x), ϕ(c) x c

( (x, x ) x c for all c C min c C (x, c) otherwise, (3)

where the last transition makes use of the fact that x C and

(x, x ) = ϕ(x), ϕ(x) + ϕ(x ), ϕ(x ) ϕ(x), ϕ(x) + ϕ(c), ϕ(c) = (x, c)

Published as a conference paper at ICLR 2025

for all x X and c C such that x c. From this we can see how adding a point to C changes the distance between some x X and C:

(x, C {y}) = (x, C) x y min( (x, C), (x, y)) x y.

Assuming we know the values of (x, C) for all x X, each time we add a point y to C, for each neighbour x of y, it suffices to check whether (x, y) is less than (x, C). Let A X be the set returned by Algorithm 2. Then the total number of checks is P

x A deg(x) where deg(x) = |{y : x y}| is the number of nonzero entries in the row of K corresponding to x. In the worst case, we have to check the entries in the rows of the k + 1 data points with the highest degree. Letting davg = 2|E|

n be the average degree, it holds that

davg = 2 |E|

x X deg(x) 1

x A deg(x),

and therefore P

x A deg(x) n degavg and the maximum number of checks we need to perform is O(min(davg, k) n).

Sampling data points according to contribution. We define the contribution of a data point x with respect to C as f(x, C) w X(x) (ϕ(x), C), and the contribution of a set of data points S X with respect to C as f(S, C) P

x S f(x, C). Normalised by COSTϕ w X(X, C), the contribution of each data point follows the D2-sampling distribution and f(S,C)

f(X,C) is the probability of sampling a point from a set S X.

Suppose we have nested sets S1 S2 S3 Sm X. For any x S1, we have that

Pr [x is sampled] = f(x, C)

f(X, C) = f(Sm, C)

f(X, C) f(Sm 1, C)

f(Sm, C) . . . f(x, C)

f(S1, C). (4)

Due to this decomposition, we can use a sampling tree T to efficiently sample proportional to contributions and update contributions as points are added to C.

The root node of T corresponds to the entire set X and stores the total contribution f(X, C). Sibling nodes represent a partition of the set their parent corresponds to and store their respective contributions. Leaf nodes store the data points they represent, their contributions, and weights.

To sample a data point proportional to contribution, we start at the root node of T and recursively sample a child node with probability equal to the child s contribution divided by the parent s contribution until we reach a leaf, corresponding to the sampled data point. From equation 4, this is equivalent to sampling points porportional to contribution, and therefore the distribution required by D2-sampling.

Updating contributions efficiently. We can update the contributions stored in T efficiently each time a point y is added to C. Suppose we find a neighbouring data point x of y such that (x, y) is less than the stored value of (x, C). Then we set the stored value of (x, C) to be (x, y) and subtract the difference in contribution from every internal node along the path from the leaf node to the root. If T is balanced, then the cost to repair T after each check is O(log(n)). The full algorithm is given in Algorithm 2. Since x can be found in time O(n), T is constructed in time O(n), and the cost of sampling k points and repairing T is O(min(davg, k) n log(n)), the overall running time of Algorithm 2 is O(min(davg, k) n).

Correctness. We start with the following useful definition: Definition 5 ((α, β)-approximation for weighted kernel k-means). Let OPTk(X, ϕ) denote the optimal objective of equation 2. We call a subset S H an (α, β)-approximate solution for kernel k-means if |S| = αk and COSTϕ w(X, S) βOPTk(X, ϕ).

D2-sampling returns an (O(1), O(log(k)))-approximate solution for k-means with high probability (Jiang et al., 2024); adding the extra point x has a negligible effect. The proof follows that of Lemma 3.1 in Arthur & Vassilvitskii (2007) with the only difference being the equality of Lemma 3.1 becomes an inequality. We summarise the result in the following lemma:

Published as a conference paper at ICLR 2025

Lemma 4.1. Given a kernel matrix K corresponding to dataset X, Algorithm 2 returns an (O(1), O(log k))-approximation for kernel k-means with high probability and running time O(min(davg, k) n).

By the analysis of Jiang et al. (2024), this immediately implies the following kernel coreset result:

Lemma 4.2. Given a kernel matrix K corresponding to dataset X, Algorithm 4, using Algorithm 2 for D2-sampling, returns an ε-coreset with high probability with size O(k2ε 4) and running time O(min(davg, k) n).

Algorithm 2 FASTD2-SAMPLING

1: Input: Dataset X with |X| = n, positive definite matrix K Rn n 0 , W Rn +, k N 2: T CONSTRUCTT(X, K, W) Algorithm 6 3: x arg minx X ϕ(x), ϕ(x) 4: draw x X uniformly at random 5: C {ϕ(x), ϕ(x )} 6: REPAIR(x ,T) Algorithm 7 7: for i = 1 to k do 8: x SAMPLEPOINT(T) Algorithm 8 9: C C {ϕ(x)} 10: REPAIR(x,T) 11: end for 12: return C

5 CORESET SPECTRAL CLUSTERING

In this section we present our Coreset Spectral Clustering (CSC) algorithm. An intuitive illustration is given in Figure 1. Given an input graph, we first extract the corresponding weighted kernel k-means problem via the equivalence to the normalised cut problem, and then construct an ε-coreset. Following this, again via the equivalence, we solve the corresponding normalised cut problem on the coreset graph to get the coreset partition. Finally, we label the rest of the data by considering kernel distances to the implied centers induced by the coreset graph partition. Our main contribution is summarised in Theorem 1, which states that coreset graphs preserve normalised cuts from the original graph.

Input graph Coreset graph Coreset partition Π

Equivalent kernel k-means problem

Reweighted coreset V Implied centers

Construct coreset

Spectral clustering

Kernel space

Graph space

Figure 1: Sketch of the Coreset Spectral Clustering Algorithm.

Published as a conference paper at ICLR 2025

Equivalence between normalised cut and kernel k-means. First recall that the normalised cut problem and the kernel k-means problem can both be written as the following trace optimisation problems up to a constant (Dhillon et al., 2004):

Normalised Cut

min Tr(D 1A) Tr(ZT D 1

s.t. X {0, 1}n k, X1k = 1n,

Z = D 1 2 X(X T DX) 1

Weighted Kernel k-means

min Tr(WK) Tr(Y T W 1 2 KW 1 2 Y )

s.t. X {0, 1}n k, X1k = 1n,

Y = W 1 2 X(X T WX) 1

In the above optimisation problems (5), A and D are the adjacency and degree matrices corresponding to some graph G, K is a kernel matrix such that Kij = ϕ(xi), ϕ(xj) , and W is a diagonal matrix with Wii = w(xi). Crucially, these problems are equivalent. Indeed substituting K = D 1AD 1 and W = D shows that any normalised cut problem can be written as a kernel k-means problem; substituting D = W and A = WKW shows the reverse. The only wrinkle with translating one to the other is that K must be positive semi-definite. If A is not positive semi-definite, then we can add a multiple of D 1 to D 1AD 1 to make it so while preserving the optimal partitioning (Dhillon et al., 2007).

Coreset graphs. The first and third transitions in Figure 1 are due to equivalence (5). The second transition is simply via constructing a coreset, and the fourth is by executing spectral clustering. The crux of our proof is the fifth transition. Specifically, it is not clear how to translate the partition of the coreset (in graph space) to a solution for the entire input graph. The main difficulty with using coresets for the normalised cut problem is that there is no notion of cluster centers on graphs. To overcome this, we go back to kernel space (the fifth transition) and label the entire input there. For the rest of the section we prove that this approach preserves the approximation of spectral clustering on the coreset for the entire graph.

To help with this, we will use the following lemma, whose proof is deferred to Appendix B.

Lemma 5.1 (Adapted from (Kanungo et al., 2002)). Let S be a finite set of points in a Hilbert space H, and w : H R+ be the weight of each point. Let c(S) = P

x S w(x)x / P

x S w(x) be the centroid of S. Then, it holds for any z H that X

x S w(x) x z 2 = X

h w(x) x c(S) 2i + |S| c(S) z 2 X

Intuitively, the above lemma allows us to draw a connection between the centroids of the coreset partition (in kernel space) and the centroids of the full partition, where every node is assigned to its closest coreset center. We will also make use of the following definitions which allow us to use k-partitions instead of set membership matrices in (5).

1. Let G = (V, E) be a graph on n vertices with adjacency matrix A and degree matrix D, and let Π be a k-partition of V . Then define NCA,D(Π) Tr(D 1A) Tr(ZT D 1/2AD 1/2Z) to be the objective of the trace minimisation version of the normalised cut problem in (5), where Z = D1/2X(X T DX) 1/2 and X {0, 1}n k is the unique set membership matrix on V corresponding to Π.

2. For any k-partition of X, Π, further define COSTK,W (Π) = Tr(WK) Tr(Y T W 1/2KW 1/2Y ) to be the objective of the trace minimisation version of the weighted kernel k-means problem in (5), where Y = W 1/2X(X T WX) 1/2 and X {0, 1}n k is the unique set membership matrix on X corresponding to Π.

Translating the objective of a k-partition with respect to the normalised cut problem to the objective of a set of centres with respect to the kernel k-means objective is straightforward. Dhillon et al. (2004) show that for any k-partition Π = {πj}k j=1, we have COSTKG,WG(Π) = COSTϕ w(V, cϕ w(Π)), where KG = D 1AD 1, WG = DG, w G = diag(WG) and ϕ : V H is a map such that

Published as a conference paper at ICLR 2025

ϕ(x), ϕ(y) = KG(x, y) for all x, y V . For completeness we include a simpler derivation in Lemma B.1. This implies that NCA,D(Π) = COSTKG,WG(Π) = COSTϕ w(V, cϕ w G(Π)).

The other direction is more difficult; given an arbitrary set of k centers S = {sj}k j=1 H, we would like to construct a k-partition Π = {π j}k j=1 of V such that

NCA,D(Π ) = COSTKG,WG(Π ) = COSTϕ w(V, cϕ w G(Π )) = COSTϕ w(V, S).

In general it may not be possible to choose Π such that COSTϕ w(V, cϕ w G(Π )) = COSTϕ w(V, S). However, the inequality COSTϕ w(V, cϕ w G(Π )) COSTϕ w(V, S) does hold by choosing Π such that x π j iff (ϕ(x), S) = (ϕ(x), sj), breaking ties arbitrarily. This is a consequence of Lemma 5.1 which tells us that moving the center of each π to their centroids can only reduce the weighted kernel k-means objective. This will be sufficient to prove our main result.

Given a graph G with adjacency matrix A and degree matrix D, let OPTNCA,D(k) denote the optimal value of the normalised cut problem (5) on G. We can now state our main result: Theorem 1. Given a graph G = (V, E) and an α-approximation algorithm for the normalised cut problem with k clusters (5), SPECTRALCLUSTERING, Algorithm 3 returns a k-partition Π of V such that

NCAG,DG(Π ) α 1 + ε

1 ε OPTNCAG,DG(k).

The running time of Algorithm 3 is the sum of the running time of the ε-coreset algorithm, SPECTRALCLUSTERING, and labelling V 2.

Algorithm 3 CORESET SPECTRAL CLUSTERING

1: Input: Graph G = (V, E), k with adjacency matrix AG and degree matrix DG 2: KG, WG D 1 G AGD 1 G , DG 3: V , WH An ε-coreset for kernel k-means on (V, KG, WG) 4: AH WHK(V )WH K(V ) is the principle submatrix of K with respect to V

5: Π SPECTRALCLUSTERING(AH, k) k-partition {πj}k j=1 6: Π partition assigning each x V to the closest coreset centroid in cϕ w H(Π). 7: return Π

To prove Theorem 1, we show that the objective of any partition of V is preserved on V (Lemma B.2) and the objective of any optimal partition of V is preserved on V (Lemma B.3). We defer their statements and proofs to Appendix B.

Proof of Theorem 1. Let Σ = {σj}k j=1 be an optimal k-partition of V such that NCAG,DG(Σ) = OPTNCAG,DG(k) and let Σ = {σ j}k j=1 be the k-partition of V such that x σ j iff (ϕ(x), cϕ w G(Σ)) = (ϕ(x), cϕ w G(σj)), breaking ties arbitrarily . Then we have

NCAG,DG(Π ) 1 1 εNCAH,DH(Π) α 1 εNCAH,DH(Σ ) α1 + ε

1 εNCAG,DG(Σ),

where the first inequality follows from Lemma B.2, the second holds because NCAH,DH(Π) αOPTNCAH,DH(k) αNCAH,DH(Σ ), and the third follows from Lemma B.3.

6 EXPERIMENTS

We perform three experiments to compare our coreset algorithm against the naive method, and secondly, to compare our Coreset Spectral Clustering algorithm to coreset kernel k-means and the sklearn implementation of spectral clustering. Experiments were performed on a dual Intel Xeon E52690 system with 384GB of RAM. Our implementation was written in Rust using Faer (El Kazdadi)3.

2This can be done efficiently, exploiting sparsity, by first computing the norms of each cϕ w H (πj). 3A user friendly python wrapper is available here: github.com/Ben Jourdan/coreset-sc.

Published as a conference paper at ICLR 2025

6.1 CORESET RUNNING TIME COMPARISON

We compare the running time of our improved coreset construction with that of Jiang et al. (2024). In particular, we compare the running time of Algorithm 4 using Algorithm 2 (our method) for D2-sampling against Algorithm 4 using Algorithm 1 (Jiang et al., 2024) for D2-sampling. We test these algorithms on the following three large graph datasets from the Suite Sparse Matrix Collection (Davis & Hu, 2011).

Friendster: A snapshot of the social media network with 65M nodes, 1.8B edges, and 5000 labeled overlapping communities.

Live Journal: A snapshot of the blogging network with 4M nodes, 34M edges, and 5000 labeled overlapping communities.

wiki-topcats: A snapshot of the Wikipedia hyperlink graph with 2M nodes, 28M edges, and 17K overlapping page categories.

We preprocess each graph to be undirected and construct K = D 1AD 1 and W = D from the corresponding adjacency and degree matrices. We measure the time it takes each method to construct a 100K point coreset for the weighted kernel k-means problem on K and W while varying the number of clusters passed to the D2-sampling routines. Following Jiang et al. (2024), we only do one iteration of importance sampling in Algorithm 4. Figure 2 confirms our method is significantly faster than the previous method. This matches our expectations based on the theoretical results since each graph is very sparse.

Figure 2: Running time comparison of coreset construction using either Algorithm 1 (Jiang et al., 2024) or Algorithm 2 for D2-sampling. Shaded regions denote 1 standard deviation over 10 runs.

6.2 CLUSTERING REAL-WORLD DATASETS

We evaluate the clustering performance of our Coreset Spectral Clustering (CSC) algorithm using our faster coreset construction, as the size of the coreset varies. We test the effect of using both the sklearn spectral clustering algorithm and the faster method of Macgregor (2023) to cluster the coreset graphs. We compare against the sklearn implementation of Spectral Clustering (SC) and coreset kernel k-means using the naive coreset construction (Jiang et al., 2024). We measure each algorithm s running time and Adjusted Rand Index (ARI) (Rand, 1971) on nearest neighbour graphs of the following datasets:

MNIST: A labelled collection of 70,000 28x28 pixel images of handwritten digits (Lecun et al., 1998).

Pen Digits: 10,992 labeled instances of 2D pen movements covering the digits 0-9, each represented with 16 numerical features. (Anguita et al., 2013).

HAR: A collection of 10,299 labeled smartphone sensor readings capturing six human movements, each represented as 561 numerical features (Anguita et al., 2013).

Published as a conference paper at ICLR 2025

Letter: A labelled collection of 20,000 instances of handwritten alphabetic english characters, represented as 16 numerical features (Frey & Slate, 1991).

Figure 3 shows the results for the HAR dataset, and demonstrates that the coreset methods run much faster than SC and both variants of CSC perform as well as SC using a coreset less than 5% the size of the original dataset, in a fraction of a second. While increasing coreset size does help coreset kernel k-means, it struggles to escape local optima, reducing performance. The results for the other datasets are in Appendix A.

Figure 3: Running time, ARI, and Normalised cut of each algorithm on a 200-nearest neighbour graph of the HAR dataset. Shaded regions denote 1 standard deviation over 20 runs.

6.3 CLUSTERING SYNTHETIC GRAPHS WITH MANY CLUSTERS

Finally, we push CSC to the limit by tasking it to cluster the stochastic block model (Abbe, 2018) when the number of clusters is proportional to the number of nodes. We sample graphs where the number of nodes in each cluster is 1000, the probability of an edge between two nodes in the same cluster is 0.5 and the probability of an edge between two nodes from different clusters is 0.001/k. We report the ARI of the coreset graph nodes instead of the full graph because the running time is otherwise dominated by labelling. We only consider the variant of CSC using the method of Macgregor (2023) as computing hundreds of eigenvectors becomes prohibitively expensive. The coreset size is set to be 1% the size of the input graph, implying that each cluster has an average of 10 nodes in the coreset graph. Figure 4 shows a gradual decline in performance of fast CSC as k increases, probably due to fluctuations in the number of nodes the coreset samples per cluster. Nevertheless, it still achieves an ARI of 0.5 in less than 3 seconds on a graph with 250K nodes and 250 clusters. On the other hand, local optima render coreset kernel k-means useless after 50 clusters.

Figure 4: Running time and ARI of each algorithm on the stochastic block model with k clusters of size 1000, p = 1/2, q = 0.001/k with a coreset size of 1%. Shaded regions denote 1 standard deviation over 20 runs.

Published as a conference paper at ICLR 2025

7 ACKNOWLEDGMENTS

The second author was supported by the following research grants: KAKENHI 21K17703, JST ASPIRE JPMJAP2302 and JST CRONOS Japan Grant Number JPMJCS24K2. The last author was supported by an EPSRC Early Career Fellowship (EP/T00729X/1).

Emmanuel Abbe. Community detection and stochastic block models: recent developments. Journal of Machine Learning Research, 18(177):1 86, 2018.

Mashaan Alshammari, John Stavrakakis, and Masahiro Takatsuka. Refining a k-nearest neighbor graph for a computationally efficient spectral clustering. Pattern Recognition, 114, 2021.

D. Anguita, Alessandro Ghio, L. Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. A public domain dataset for human activity recognition using smartphones. In The European Symposium on Artificial Neural Networks, 2013.

David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 07), pp. 1027 1035, 2007.

Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In the AAAI conference on artificial intelligence, volume 30, 2016.

Vladimir Braverman, Shaofeng H-C Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA 21), pp. 2679 2696, 2021.

Radha Chitta, Rong Jin, and Anil K Jain. Efficient kernel clustering using random fourier features. In 2012 IEEE 12th International Conference on Data Mining, pp. 161 170, 2012.

Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Fast and accurate k-means++ via rejection sampling. In 33rd Advances in Neural Information Processing Systems (Neur IPS 20), 2020.

Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1), 2011.

Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 04), pp. 551 556, 2004.

Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1944 1957, 2007.

Sarah El Kazdadi. faer-rs. URL https://github.com/sarah-quinones/faer-rs. MIT License.

Peter W Frey and David J Slate. Letter recognition using holland-style adaptive classifiers. Machine learning, 6: 161 182, 1991.

Mehmet Gönen and Adam A Margolin. Localized data fusion for kernel k-means clustering with application to cancer biology. In 27th Advances in Neural Information Processing Systems (Neur IPS 14), 2014.

Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In 36th Annual ACM Symposium on Theory of Computing (STOC 04), pp. 291 300, 2004.

Ben Harwood and Tom Drummond. Fanng: Fast approximate nearest neighbour graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 16), 2016.

Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC 20), pp. 1416 1429, 2020.

Shaofeng H.-C. Jiang, Robert Krauthgamer, Jianing Lou, and Yubo Zhang. Coresets for kernel clustering. Machine Learning, 113(8):5891 5906, 2024.

Ben Jourdan and Gregory Schwartzman. Mini-batch kernel k-means, 2024. URL https://arxiv.org/ abs/2410.05902.

Published as a conference paper at ICLR 2025

Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. In 18th Annual Symposium on Computational Geometry (So CG 02), pp. 10 18, 2002.

Chia-Tung Kuo, Peter B. Walker, Owen Carmichael, and Ian Davidson. Spectral clustering for medical imaging. In 2014 IEEE International Conference on Data Mining, pp. 887 892, 2014.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Peter Macgregor. Fast and simple spectral clustering in theory and practice. In 36th Advances in Neural Information Processing Systems (Neur IPS 23), 2023.

Peter Macgregor and He Sun. Fast approximation of similarity graphs with kernel density estimation. In 36th Advances in Neural Information Processing Systems (Neur IPS 23), 2023.

Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824 836, 2018.

Cameron Musco and Christopher Musco. Recursive sampling for the nystrom method. 30th Advances in Neural Information Processing Systems (Neur IPS 17), 2017.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Richard Peng, He Sun, and Luca Zanetti. Partitioning well-clustered graphs: Spectral clustering works! SIAM Journal on Computing, 46(2):710 743, 2017.

William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846 850, 1971.

Daniel A Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4): 981 1025, 2011.

He Sun and Luca Zanetti. Distributed graph clustering and sparsification. ACM Transactions on Parallel Computing, 6(3), 2019.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395 416, 2007.

Shusen Wang, Alex Gittens, and Michael W Mahoney. Scalable kernel k-means clustering with nystrom approximation: Relative-error bounds. Journal of Machine Learning Research, 20(12):1 49, 2019.

Scott White and Padhraic Smyth. A spectral clustering approach to finding communities in graphs. In 2005 SIAM International Conference on Data Mining (SDM), pp. 274 285, 2005.

Chak-Kuen Wong and Malcolm C. Easton. An efficient method for weighted sampling without replacement. SIAM Journal on Computing, 9(1):111 113, 1980.

Published as a conference paper at ICLR 2025

A FURTHER EXPERIMENTS ON REAL-WORLD DATASETS

In this section, we report the experimental results omitted from Section 6.2.

Figure 5: Running time, ARI, and Normalised cut of each algorithm on a 250-nearest neighbour graph of the Pen Digits dataset as coreset size varies.

Figure 6: Running time, ARI, and Normalised cut of each algorithm on a 300-nearest neighbour graph of the Letter dataset as coreset size varies.

Figure 7: Running time, ARI, and Normalised cut of each algorithm on a 500-nearest neighbour graph of the MNIST dataset as coreset size varies. Sklearn spectral clustering was not included because it took too long.

Published as a conference paper at ICLR 2025

B SUPPLEMENTARY LEMMAS AND PROOFS

In this section we provide the proofs omitted from Section 5.

Proof of Lemma 5.1.

u S w(u) u z 2

u S w(u) u z, u z

u S w(u) D (u c(S) + c(S) z , (u c(S) + c(S) z E

u S w(u) u c(S) 2 + 2 u c(S), c(S) z + c(S) z 2

u S w(u) u c(S) 2 + 2

u S w(u) u c(S) +

+ |S| c(S) z 2 X

x S w(x) x c(S) 2 + |S| c(S) z 2 X

where last step follows from the fact that c(S) is the weighted centroid of S, so P

u S w(u)(u c(S)) = 0.

Lemma B.1 (Kernel k-means objective equivalence). Given a set of n objects X, w : X R+, K : X X R 0 and ϕ : X H satisfying equation 1, for any k-partition Π = {πj}k j=1 of X, it holds that

COSTK,W (Π) = COSTϕ w(X, {mj}k j=1), (6)

P a πj w(a)ϕ(a)

P b πj w(b) and W = Diag(w).

Proof. Let X {0, 1}n k be the unique set membership matrix corresponding to Π; that is, X1k = 1n and X(i, j) = 1 iff xi πj. Therefore XT WX = Diag(s1, . . . , sk) where si = P

a πi w(v). Expanding the right hand side of equation 6, we have

COSTϕ w(X, {mj}k j=1)

a πj w(a) ϕ(a) mj 2

a πj w(a) ϕ(a), ϕ(a) 2 ϕ(a), mj + mj, mj (7)

Notice that Pk j=1 P

a πj w(a) ϕ(a), ϕ(a) = Pk j=1 P

a πj w(a)Ka,a = Tr(WK) since Π is a partition of X. Expanding the other terms of equation 7 for the jth cluster, we find they are multiples of the same quantity: X

a πj w(a) ϕ(a), mj = X

w(a)w(b) ϕ(a), ϕ(b) P

c πj w(c) ,

a πj w(a) mj, mj = X

c πj w(b)w(c) ϕ(b), ϕ(c) P

d πj w(d) 2 = X

w(a)w(b) ϕ(a), ϕ(b) P

c πj w(c) ,

Finally, we show that these quantities for each cluster are the diagonal entries of the matrix XT WKWX(XT WX) 1:

[XT WKWX(XT WX) 1]jj = [XT WKWX]jj[(XT WX) 1]jj

b Xaj Xaj[WKW]ab 1 P

b πj [WKW]ab 1 P

w(a)w(b) ϕ(a), ϕ(b) P

c πj w(c) .

Published as a conference paper at ICLR 2025

Thus COSTϕ w(X, {mj}) = Tr(WK) Tr(XT WKWX(XT WX) 1) = COSTK,W (Π).

Lemma B.2. For any k-partition Π = {πj}k j=1 of V , we have that

NCAG,DG(Π ) 1 1 ε NCAH,DH (Π),

where Π = {π j}k j=1 is the k-partition of V such that x π j iff (ϕ(x), cϕ w H (Π)) = (ϕ(x), cϕ w H (πj)), breaking ties arbitrarily.

Proof of Lemma B.2. We have NCAH,DH (Π) = COSTKH,WH (Π) = COSTϕ w H (V , cϕ w H (Π)). Since V and w H constitute an ε-coreset, we have that COSTϕ w G(V, cϕ w H (Π)) 1 1 ε COSTϕ w H (V , cϕ w H (Π)). Then by the definition of Π and Lemma 5.1, COSTϕ w G(V, cϕ w G(Π )) COSTϕ w G(V, cϕ w H (Π)). Since NCAG,DG(Π ) =

COSTKG,DG(Π ) = COSTϕ w G(V, cϕ w G(Π )), the claim follows.

Lemma B.3. Let Σ = {σj}k j=1 be an optimal k-partition of V such that NCAG,DG(Σ) = OPTNCAG,DG(k). Then

NCAH,DH (Σ ) (1 + ε)NCAG,DG(Σ)

where Σ = {σ j}k j=1 is the k-partition of V such that x σ j iff (ϕ(x), cϕ w G(Σ)) = (ϕ(x), cϕ w G(σj)), breaking ties arbitrarily.

Proof of Lemma B.3. We have that NCAG,DG(Σ) = COSTϕ w G(V, cϕ w G(Σ)). Since V and w H constitute an ε-coreset, we have that COSTϕ w H (V , cϕ w G(Σ)) (1 + ε)COSTϕ w G(V, cϕ w G(Σ)). From the definition of Σ

and Lemma 5.1, we have NCAH,AG(Σ ) = COSTϕ w H (V , cϕ w H (Σ )) COSTϕ w H (V , cϕ w G(Σ)). The claim follows.

C ALGORITHMS

In this section we provide the coreset kernel k-means algorithm given by Jiang et al. (2024) along with our subroutines for Algorithm 2.

C.1 CORESET KERNEL k-MEANS ALGORITHMS (JIANG ET AL., 2024)

Algorithm 4 Constructing an ε-coreset for kernel k-means on dataset X with kernel K (Jiang et al., 2024)

1: Input: X0 X, i 0 2: repeat 3: i i + 1 and εi ε/(log(i) X0 )1/4 log(i)( ) is the ith iterated logarithm. 4: Xi IMPORTANCE-SAMPLING(Xi 1, εi) Algorithm 5 5: until Xi 0 does not decrease compared to Xi 1 0 6: return Xi

Algorithm 5 Importance-Sampling(X, ε) (Jiang et al., 2024)

1: Input: X, ε 2: Let C D2-Sampling(X) Algorithm 1 or Algorithm 2 (our faster algorithm) 3: for x X do 4: σx w X(x) (x,C )

w X(C) 5: end for 6: for x X do 7: px σx P

y X σy 8: end for 9: Draw N O k2 log2 k X 0

ε2 i.i.d. samples from X, using probabilities (px)x X

10: Let D be the sampled set; for each x D let w D(x) w X(x)

px N 11: return weighted set D

Published as a conference paper at ICLR 2025

C.2 FAST D2-SAMPLING SUBROUTINES

Algorithm 6 CONSTRUCTT

1: Input: X s.t. |X| = n, K Sn ++, W Rn + 2: current_level [ ] 3: for x X do 4: leaf a leaf node corresponding to x with attribute ϕ(x), ϕ(x) + c

5: current_level.push(leaf) 6: end for 7: while current_level.len()> 1 do 8: Next Level [ ]

9: for i = 1 to 2 current_level.len()

2 1 do 10: left_child current_level[i] 11: right_child current_level[i + 1] 12: internal an internal node with left_child and right_child as their respective children and attribute contribution equal to the sum of the contribution of its children. 13: next_level.push(internal) 14: end for 15: if current_level.len() is odd then 16: next_level.push(current_level[current_level.len()]) 17: end if 18: current_level next_level 19: end while 20: return current_level[1]

Algorithm 7 REPAIR(x,T)

1: Let L be the leaf node corresponding to x 2: delta_difference L. 3: L. 0 4: for Internal node I in the path from L to the root of T do 5: I.contribution I.contribution delta_difference w(x) 6: end for 7: for y in N(x) do 8: Let L be the leaf node corresponding to y 9: if (x, y) < L . then 10: delta_difference L . (x, y) 11: L. (x, y) 12: for Internal node I in the path from L to the root of T do 13: I.contribution I.contribution delta_difference w(y) 14: end for 15: end if 16: end for

Algorithm 8 SAMPLEPOINT

1: Input: T 2: Let I be the root of T. 3: while I is an internal node do 4: Let C1 and C2 be the children of I. 5: child Select C1 or C2 with probabilities proportional to C1.contribution and C2.contribution If I only has one child, sample it with probability 1. 6: I child 7: end while 8: return the data point associated with I.