# quantum_inspired_d2sampling_with_applications__8c670480.pdf

Published as a conference paper at ICLR 2025

QUANTUM (INSPIRED) D2-SAMPLING WITH APPLICATIONS

Poojan Chetan Shah and Ragesh Jaiswal Department of Computer Science and Engineering Indian Institute of Technology Delhi {cs1221594, rjaiswal}@cse.iitd.ac.in

D2-sampling is a fundamental component of sampling-based clustering algorithms such as k-means++. Given a dataset V Rd with N points and a center set C Rd, D2-sampling refers to picking a point from V where the sampling probability of a point is proportional to its squared distance from the nearest center in C. The popular k-means++ algorithm is simply a k-round D2-sampling process, which runs in O(Nkd) time and gives O(log k)-approximation in expectation for the k-means problem. In this work, we give a quantum algorithm for (approximate) D2-sampling in the QRAM model that results in a quantum implementation of k-means++ with a running time O(ζ2k2). Here ζ is the aspect ratio (i.e., largest to smallest interpoint distance) and O hides polylogarithmic factors in N, d, k. It can be shown through a robust approximation analysis of k-means++ that the quantum version preserves its O(log k) approximation guarantee. Further, we show that our quantum algorithm for D2-sampling can be dequantized using the sample-query access model of Tang (2023). This results in a fast quantum-inspired classical implementation of k-means++, which we call QI-k-means++, with a running time O(Nd) + O(ζ2k2d), where the O(Nd) term is for setting up the sample-query access data structure. Experimental investigations show promising results for QI-kmeans++ on large datasets with bounded aspect ratio. Finally, we use our quantum D2-sampling with the known D2-sampling-based classical approximation scheme to obtain the first quantum approximation scheme for the k-means problem with polylogarithmic running time dependence on N.

1 INTRODUCTION

Data clustering and the k-means problem, in particular, have many applications in data processing. The k-means problem is defined as: given a set of N points V = {v1, ..., v N} Rd, and a positive integer k, find a set C Rd of k centers such that the cost function,

v V min c C D2(v, c),

is minimised. Here, D(v, c) v c is the Euclidean distance between points v and c. Partitioning the points based on the closest center in the center set C gives a natural clustering of the data points. Due to its applications in data processing, a lot of work has been done in designing algorithms from theoretical and practical standpoints. The k-means problem is known to be NP-hard, so it is unlikely to have a polynomial time algorithm. Much research has been done on designing polynomial time approximation algorithms for the k-means problem. However, the algorithm used in practice to solve k-means instances is a heuristic, popularly known as the k-means algorithm (not to be confused with the k-means problem). This heuristic, also known as Lloyd s iterations Lloyd (1982), iteratively improves the solution in several rounds. The heuristic starts with an arbitrarily chosen set of k centers. In every iteration, it (i) partitions the points based on the nearest center and (ii) updates the center set to the centroids of the k partitions. In the classical computational model, it is easy to see that every Lloyd iteration costs O(Nkd) time. This hill-climbing approach may get stuck at a local minimum or take a huge amount of time to converge, and hence, it does not give provable guarantees

Published as a conference paper at ICLR 2025

on the quality of the final solution or the running time. In practice, Lloyd s iterations are usually preceded by the k-means++ algorithm Arthur & Vassilvitskii (2007), a fast sampling-based approach for picking the initial k centers that also gives an approximation guarantee. So, Lloyd s iterations, preceded by the k-means++ algorithm, give the best of both worlds, theory and practice. Hence, it is unsurprising that a lot of work has been done on these two algorithms. The work ranges from efficiency improvements in specific settings to implementations in distributed and parallel models. With the quantum computing revolution imminent, it is natural to talk about quantum versions of these algorithms and quantum algorithms for the k-means problem in general.

Early work on the k-means problem within the quantum setting involved efficiency gains from quantizing Lloyd s iterations. In particular, Aïmeur et al. (2013) gave an O( N3/2

k ) time algorithm for executing a single Lloyd s iteration for the Metric k-median clustering problem that is similar to the k-means problem. This was using the quantum minimum finding algorithm of Durr & Hoyer (1999). Using quantum distance estimation techniques assuming quantum data access, Lloyd et al. (2013) gave an O(k N log d) time algorithm for the execution of a single Lloyd s iteration for the k-means problem. More recently, Kerenidis et al. (2019) gave an approximate quantization of the k-means++ method and Lloyd s iteration assuming QRAM data structure Kerenidis & Prakash (2017) access to the data. Interestingly, the running time has only a polylogarithmic dependence on the size N of the dataset. The algorithm uses quantum linear algebra procedures, and hence, there is dependence on certain parameters that appear in such procedures, such as the condition number κ(V ). Since Lloyd s iterations do not give an approximation guarantee, its quantum version is also a heuristic without a provable approximation guarantee. A quantum implementation of k-means++ was also given in Kerenidis et al. (2019). Note that k-means++ gives an approximation guarantee of O(log k) in expectation for the k-means problem. However, this approximation guarantee does not immediately extend to the quantum implementation of k-means++ since the quantum procedure introduces D2sampling errors, and one must show that the approximation analysis is robust against those errors. This was indeed missing from Kerenidis et al. (2019). D2-sampling is a fundamental component of k-means++. Given the dataset V and a center set C, D2-sampling refers to picking a point from V such that the sampling probability is proportional to its squared distance from the nearest center in C. Starting with C = {} and D2-sampling and updating C in k rounds is precisely the k-means++ algorithm. In a recent line of work Bhattacharya et al. (2020a); Grunau et al. (2023), a robust version of k-means++ has been analyzed where the D2-sampling probability has a multiplicative (1 ε) error. This is called noisy k-means++, and it is now known Grunau et al. (2023) that this algorithm gives O(log k) approximation. This shows that k-means++ is indeed robust against small relative D2-sampling errors. This further means that the quantum implementation of k-means++ in Kerenidis et al. (2019) gives an approximation guarantee of O(log k).

In this work, we design quantum and quantum-inspired classical algorithms for D2-sampling-based clustering algorithms. We start with k-means++. Quantizing1 k-means++ reduces to quantizing the D2-sampling step. Most of the ideas for quantizing D2-sampling such as coherent amplitude estimation, median estimation, distance estimation, minimum finding, etc., have already been developed in previous works on quantum unsupervised learning (see Wiebe et al. (2015) for examples of several such tools). In particular, Kerenidis et al. (2019) used these tools to give a quantum implementation of D2-sampling. We provide a similar quantum algorithm and then combine this with the known O(log k) approximation guarantee for noisy k-means++ to state the following result.

Theorem 1. There is a quantum implementation of k-means++ that runs in time O(ζ2k2) and gives an O(log k) factor approximate solution for the k-means problem with a probability of at least 0.99. Here, O hides log2 (Nd) and log2 (kd) terms.2

Dependence on aspect ratio ζ: Note that the running time depends on the aspect ratio ζ, which is defined to be the ratio of the maximum to the minimum interpoint distance. For the remaining discussion, it is important to understand that the k-means problem remains NP-hard to approximate

1Quantizing is a term that is used to refer to designing a quantum version of an algorithm with a significant efficiency advantage over the corresponding classical algorithm. 2The output of k-means++ is a subset of data points, which can be stated as a subset of indices. If the description of centers is the expected output, the running time will also include a factor of d.

Published as a conference paper at ICLR 2025

(beyond a fixed constant) even when restricted to instances where ζ upper bounded by a small constant. See Awasthi et al. (2015) for such an hardness reduction.3

The above result is another addition to the growing area of Quantum Machine Learning (QML) where the goal is to design quantum algorithms for machine learning problems with large speed-ups over their classical counterparts. There was a flurry of activity in QML where significant speedups were shown for problems such as recommendation systems Kerenidis & Prakash (2017). However, it was not clear whether such speedups could entirely be attributed to the quantum data access model (QRAM) upon which these works were built. The works of Tang (2019; 2022; 2023) gave much clarity on this aspect. These works showed that similar speedups (up to some polynomial factors) are achievable in the classical model using a classical counterpart of QRAM, now popularly known as the sample-query (SQ) access data structure. In other words, most QML algorithms could be dequantized. Interestingly, by setting up the sample-query access data structure, the dequantized algorithms give rise to quantum-inspired classical algorithms. The interesting property of such a quantum-inspired algorithm is that apart from the linear time spent on setting up the sample-query access data structure, which can be thought of as preprocessing time, the algorithm is extremely fast (otherwise, it would not have given the quantum speedup). Our second contribution is to dequantize D2-sampling using Tang s SQ model. This turns out to be simple since the sample-query access model primarily involves sampling vectors based on squared ℓ2 norms, which is how D2-sampling also works. Based on this, we give a quantum-inspired classical implementation of k-means++. The following theorem formally states our main result on this.

Theorem 2. There is a classical implementation of k-means++ (which we call QI-k-means++) that runs in time O(Nd)+O(ζ2k2d log k log Nd). With probability at least 0.99, QI-k-means++ outputs k centers that give O(log k)-approximation in expectation for the k-means problem.

Note that the O(Nd) term in the running time is for setting up the SQ data structure, which can be thought of as a preprocessing operation. This setting of the data structure can easily be parallelized on a multi-core processor. Once the data structure is available, the running time is sublinear in N. The linear dependence on the dimension d can also be changed to sublinear at the cost of worsening the dependence on ζ. This is through implementing a small relative error D2-sampling procedure, which gives an implementation of the noisy k-means++ , which we know also gives O(log k)-approximation. Since the aspect ratio ζ can be efficiently computed in advance Shamos & Hoey (1975); Dietzfelbinger et al. (1997), we can appropriately choose between the algorithms in Theorem 2 and Theorem 3. We state this result formally below.

Theorem 3. There is a classical implementation of noisy k-means++ (which we call QI-noisyk-means++) that runs in time O(Nd) + O(ζ6k2 log k log Nd).With probability at least 0.99, QInoisy-k-means++ outputs k centers that give O(log k)-approximation in expectation for the k-means problem.

Finally, we design a quantum approximation scheme for the k-means problem with polylogarithmic dependence on the size N of the data set. We do this by quantizing the highly parallel, D2-samplingbased approximation scheme of Bhattacharya et al. (2020b). An approximation scheme is an algorithm that, in addition to the dataset and k, takes an error parameter ε > 0 as input and outputs a solution with a cost within (1 + ε) factor of the optimal. The k-means problem is NP-hard to efficiently approximate beyond an approximation factor of 1.07 Cohen-Addad & C.S. (2019). The tradeoff in obtaining this fine-grained approximation of (1 + ε) is that the running time of our algorithm has an exponential dependence on k and error parameter ε. In the classical setting, such algorithms are categorized as Fixed Parameter Approximation Schemes (fpt-AS). Such (1 + ε)-approximation algorithms can have exponential running time dependence on the parameter (e.g., the number of clusters k in our setting). The practical motivation for studying Fixed-Parameter Tractability (FPT) for computationally hard problems is that when the parameter is small (e.g., number of clusters k 5), the running time is not prohibitively large. We state our main result as the following theorem, which we will prove in the remainder of the paper.

3The hardness of approximation argument in Awasthi et al. (2015) goes through the vertex cover problem. For a given graph G = (V, E), the k-means instance includes |E| points in a |V |-dimensional space. The point corresponding to an edge (i, j) has 1 in coordinates i, j and 0 everywhere else. It is clear that ζ =

2 in this case.

Published as a conference paper at ICLR 2025

Theorem 4. Let 0 < ε < 1/2 be the error parameter. There is a quantum algorithm that, when given QRAM data structure access to a dataset V RN d, runs in time O 2 O( k

ε )dζO(1) and outputs

a k center set C Rk d such that with high probability Φ(V, C) (1 + ε) OPT. Here, ζ is the aspect ratio, i.e., the ratio of the maximum to the minimum distance between two given points in V .4

We end on the note that the above quantum approximation scheme can also be dequantized in the sample-query-access model of Tang (2023).

1.1 COMPARISION WITH PREVIOUS WORK

There have been several fast classical implementations of k-means++. These implementations are fast under certain assumptions on the data set or achieved through trading efficiency with an approximation guarantee or both. Bachem et al. (2016b) gave a Monte Carlo Markov Chain (MCMC) based implementation of k-means++. However, this algorithm, called K-MC2, preserves the O(log k) approximation guarantee of k-means++ only when the dataset satisfies some strong properties that are NP-hard to check. The MCMC-based algorithm was improved by the same authors in Bachem et al. (2016a), but the approximation guarantee of this algorithm had an additive term (in addition to the multiplicative O(log k) factor), which could possibly be large. More recently, a fast implementation based on the multi-tree embedding of the data was given by Cohen-Addad et al. (2020)

Algorithm Time Complexity Approx. Guarantee Key Parameters

QI-k-means++ (This Work) O(Nd) + O(ζ2k2d log k log Nd) O(log k)Φk OP T ζ is the aspect ratio

Bachem et al. (2016b)

O(γk2d log kβ) , O(k3d log2 N log kβ) O(log k)Φk OP T γ = 4 maxx V D2(x,µ)

Φk OP T µ is the dataset mean.

Bachem et al. (2016a)

O(Nd) + O 1

ε , O(Nd) + O k3d log k O(log k)Φk OP T + εΦ1 OP T ε (0, 1) controls runtime - solution quality tradeoff

Rejection Sampling Cohen-Addad et al. (2020)

O(N(d + log N) log ζd) +

O kc2d3 log ζ(N log ζ)O(1/c2) O(c6 log k)Φk OP T ζ is the aspect ratio. c > 1 controls runtime - solution quality tradeoff

Table 1: Comparison of fast implementations of k-means++. Φk OP T represents the optimal k-means cost of the dataset. represents the runtime under the assumptions described in Bachem et al. (2016b).

Let us now compare our algorithm with K-MC2 , AFK-MC2 and Rejection Sampling .

Pre-processing. QI-k-means++ takes O(Nd) to setup the SQ data structures. AFK-MC2 and Rejection Sampling also require similar pre-processing cost for computing the initial distribution of the markov chain and for performing the multi-tree embedding respectively. An advantage of the SQ data structure is that it is very efficient to update, i.e., add or delete a point which costs O(log N) time. This is useful in scenarios where new data points get added periodically, and one needs to recluster the data. Runtime. K-MC2 , AFK-MC2 have runtime which is sublinear in N under certain assumptions. Their runtimes depend on the parameters γ and ε respectively, which can be thought of as being analogous to ζ. It is important to note that computing an estimate of γ involves solving the k-means problem itself, while ζ is efficiently computable. Even under assumptions, QI-k-means++ has a better dependence on k (k2 vs k3). We see that QI-k-means++ and Rejection Sampling have a tradeoff between N (log N vs N O(1/c2)) and ζ (ζ2 vs (log ζ)1+O(1/c2)). Moreover, Rejection Sampling has better dependence on k (k2 vs k) but worse dependence on d (d vs d3).

So, in cases where the aspect ratio is not large (e.g., binarized data such as MNIST Deng (2012)) and k is reasonably small, our implementation should be expected to run fast and give good solutions.

4The O notation hides logarithmic factors in N. In the exponent, it hides logarithmic factors in k and 1/ε.

Published as a conference paper at ICLR 2025

Note that unlike Rejection Sampling and AFK-MC2 , our results have no tradeoff between efficiency and approximation ratio. Even though our work is mainly theoretical in nature, we conduct some experimental investigations to see the running time advantage of QI-k-means++ over k-means++ (the classical implementation).

1.2 RELATED WORK

We have already discussed past research works on quantum versions of the k-means algorithm (i.e., Lloyd s iterations). This includes Aïmeur et al. (2013), Lloyd et al. (2013), and Kerenidis et al. (2019). All these have been built using various quantum tools and techniques developed for various problems in quantum unsupervised learning, such as coherent amplitude and median estimation, distance estimation, minimum finding, etc. See Wiebe et al. (2015) for examples of several such tools. We have also discussed previous works on fast implementations of the k-means++ algorithm. Other directions on quantum k-means includes adiabatic algorithms (e.g., Lloyd et al. (2013)) and algorithms using the QAOA framework (e.g., Otterbach et al. (2017); Farhi et al. (2014)). However, these are without provable guarantees. A recent work of Doriguello et al. Doriguello et al. (2023) improves the results of Kerenidis et al. (2019) on Lloyd s iterations and obtains a quantum algorithm with better running time dependence and removes the dependence on the data matrix dependent parameters such as the condition number. They also dequantize their algorithm in the sample-query access model of Tang Tang (2023). However, since Lloyd s iterations do not give any approximation guarantee, their algorithms also do not have a provable approximation guarantee. In another recent work Kumar et al. (2023), the quantum algorithm of Kerenidis et al. (2019) is used in a supervised setting to construct decision trees.

2 QUANTUM (INSPIRED) D2-SAMPLING

In this section, we discuss the key ideas in the design of our quantum algorithm for D2-sampling and its dequantization in the sample-query (SQ) access model of Tang Tang (2023). Let us first look at the main ideas of our quantum algorithm in the following subsection.

2.1 QUANTUM D2-SAMPLING

We will work under the assumption that the minimum distance between two data points is 1, which can be achieved using scaling. This makes the aspect ratio ζ simply the maximum distance between two data points. We will use i for an index into the rows of the data matrix V RN d, and j for an index into the rows of the center matrix C Rm d. We would ideally like to design a quantum algorithm that performs the transformation:

|i |j |0 |i |j |D(vi, cj) (1)

The known quantum tools such as swap test followed by coherent amplitude estimation, and median estimation allow one to obtain a state that is an approximation of the ideal state that is shown on the right in (1). Moreover, this can be done in time that has only polylogarithmic dependence on the input size. The details are provided in the Appendix. For the current high-level discussion, we will assume that the ideal state can be prepared. We would like to perform sampling from the D2 distribution over V with respect to the center set C. This means that we would like to sample an index i [N] with probability:

Pr[i] = minj [m] D(vi, cj)2 P

i [N] minj [m] D(vi , cj)2 (2)

So, if we can prepare the state:

i [N] |i min j [m] D(vi, cj) ,

then we can obtain a good estimate of the denominator of the right expression in (2), which is basically the k-means cost with respect to the current center set C. This follows from the fact that measuring the second register of the above state gives the distance of a random data point to its closest center in C. Repeatedly preparing the state and measuring helps obtain a close estimate of

Published as a conference paper at ICLR 2025

the k-means cost. Now that the k-means cost (i.e., the denominator of (2)) is known, using a few more standard quantum transformations (controlled rotations), we can pull out the state of the second register as part of the amplitude of an ancilla qubit. In other words, an approximate version of the following quantum state can be prepared: 1

i [N] |i βi |0 + p

1 |βi|2 |1 ,

where βi = 1

Z minj [m] D(vi, cj) and Z is some appropriate normalization term. So, by measuring the above state and doing rejection sampling (ignoring the measurement when the ancilla qubit is 1), we obtain a D2-sample. Some of the quantum states shown above are ideal quantum states that we have used to convey the main ideas of the quantum algorithm. The real states will be close approximations of these states, and we need to carefully account for all the errors introduced. All the details are given in the Appendix. Finally, note that this also gives a quantum implementation of the k-means++ algorithm since the algorithm simply performs D2-sampling in k iterations while updating the center set.

In the following subsection, we discuss how to dequantize this quantum algorithm in the samplequery access model of Tang Tang (2023). This gives a quantum-inspired classical algorithm for D2-sampling.

2.2 QUANTUM INSPIRED D2-SAMPLING

Ewin Tang s thesis Tang (2023) nicely categorizes quantum machine learning (QML) algorithms into ones where the quantum advantage (i.e., polylogarithmic running time) is solely because of the quantum access to the data and the ones where it is not (e.g., the HHL algorithm Harrow et al. (2009)). Quantum access means that a (normalized) data vector v := (v(1), ..., v(d))T can be loaded onto the quantum workspace in O(1) time as |v = v(1) |1 + ... + v(d) |d . One implication of having the quantum state |v is that i can be sampled with probability v(i)2 P

j v(j)2 by simply making a measurement in the standard basis. One of the key insights in Tang s Thesis Tang (2023) is that QML algorithms that benefit solely from the quantum data access can be dequantized if we work with an appropriate classical counterpart of the quantum data access that allows sampling access of the kind described above (in addition to the classical access to the data). This is called sample-query access, or SQ access in short. Dequantization means that a classical algorithm can simulate the steps of the quantum algorithm using SQ access with similar running time dependence. The nice property of the SQ access data structures is that it can be constructed classically in linear time. Note that linear time means that we lose the quantum advantage. However, if we keep this aside and count the SQ access data structure construction as preprocessing, the remaining computation has a similar advantage as that of the quantum algorithm. This is a level playing field with the QML algorithm that allows fair comparison, specifically given that setting up quantum access (called QRAM) also takes linear time. Such classical algorithms obtained by dequantization are called quantum inspired algorithms. Our quantum algorithms based on D2-sampling fall into the category of QML algorithms that can be dequantized in Tang s SQ access model. This section discusses the quantum-inspired classical algorithm we obtain by dequantizing our quantum algorithm using the SQ access data structure. Naturally, we start the discussion by looking at the definition of the SQ access data structure. Definition 1 (Query access, Definition 1.1 in Tang (2023)). For a vector v Cn, we have query access to v, denoted by Q( v), if for all i [n], we can query for v(i). We use q( v) to denote the time (cost) of such a query.

Definition 2 (SQ-access to a vector, Definition 1.2 in Tang (2023)). For a vector v Cn, we have sampling and query access to v, denoted by SQ( v), if we can:

1. query for entries in v as in Q( v). The time cost is denoted by q( v).

2. obtain independent samples i [n] where the probability of sampling i is v(i)2

v 2 . This distribution is denoted by D v. The time cost is denoted by s( v).

3. query for v . The time cost is denoted by n( v).

Let sq( v) max {q( v), s( v), n( v)} denote the time cost of SQ-access.

Published as a conference paper at ICLR 2025

Figure 1: A tree data structure to enable sample-query access to an example vector of dimension n = 4. Index i can be sampled with probability | vi|2 P

j | vj|2 in O(log n) time by traversing down the

A simple tree-based data structure (see Figure 1) supports sample-query access for vectors and matrices. The details can be found in Tang (2023). Here, we give a summary of the running time of various operations. Lemma 1 (Remark 4.12 in Tang (2023)). There is a data structure for storing a vector v Rn supporting the following operations:

1. Reading and updating an entry of v in O(log n) time. So, q( v) = O(log n).

2. Finding v 2 in O(1) time. So, n( v) = O(1).

3. Sampling from D v in O(log n) time. So, s( v) = O(log n)

It follows from the above that sq( v) = O(log n). Moreover, the time required to construct the data structure is O(n).

Let us outline the key ideas in the quantum-inspired classical algorithm in the SQ access model defined above. Let v1, ..., v N denote the data vectors, i.e., vi = V (i, .). Suppose we have set up SQ access for the vectors v1, ..., v N, ( v1 2, ..., v N 2)T (together denoted by SQ(V )), and m centers c1, ..., cm (denoted by SQ(c1), ..., SQ(cm)). Our goal is to set up SQ access for the vector w defined as:

w := min j [m] D(v1, cj), ..., min j [m] D(v N, cj) T (3)

The reason is that once we have SQ access, SQ(w), a sampling query returns an index with distribution Dw, which is precisely the D2-sampling distribution. We enable SQ access for w in a sequence of steps:

(1) SQ(V ), SQ(c1), ..., SQ(cm) (2) SQ(u1), ..., SQ(um) (3)SQ(w).

Here uj is the vector defined as:

i [N], uj(i) = D(vi, cj) = vi cj , (4)

the vector of distances of N data points from the jth center. The above steps are a gross simplification of the real steps to understand the high-level ideas and draw a parallel to the corresponding quantum steps. In actual implementation, we may not be able to enable sample-query access for w but something known as oversampling and query access, which is a generalization SQ access. Much of the technical effort is spent designing these oversampling query accesses. These details are given in the Appendix.

Let us draw a parallel to the quantum algorithm to see that the above SQ-based algorithm is indeed quantum-inspired. The relevant quantum states involved in the D2-sampling algorithm are given below:

(1) i, |vi , j, |cj , 1

i |i |vi (2) 1

i |i |D(vi, c1) ... |D(vi, cm)

i |i min j [m] D(vi, cj) (4) 1

i [N] |i βi |0 + p

Published as a conference paper at ICLR 2025

The correspondence between quantum states and SQ accesses is apparent. So, every step in the quantum algorithm can be simulated using oversampling query access, which leads to a quantuminspired classical algorithm since the SQ access framework is completely classical. Let us now discuss some key elements of the quantum-inspired classical implementation of k-means++ that results from the above quantum-inspired classical D2-sampling.

2.3 QI-k-MEANS++

If we use the quantum-inspired classical D2-sampling iteratively in k rounds to pick k centers, the resulting algorithm is a quantum-inspired classical implementation of k-means++, which we call the QI-k-means++ algorithm. This can be seen as a fast implementation of k-means++. We give a high-level outline of the algorithm and discuss some of its important properties.

Algorithm 1 QI-k-means++(V, k)

1: Setup sample-query access for dataset V 2: C random data point in V 3: for i = 2 to k do 4: Use sample-query access for w (which uses sample-query access for u1, ..., ui 1, which in turn uses sample-query access for V and c1, ..., ci 1) to D2-sample a center c. The vectors w and u1, ..., uk are as defined in (3) and (4). 5: C C {c}; 6: end for 7: return C

Setting up the sample-query access for the dataset V RN d costs O(Nd) time. This can be thought of as the preprocessing time. Setting up the sample-query access is highly parallelizable, as can be seen from the tree-based data structure in Figure 1. On a shared memory, multi-processor system with M processing units, the task can be done in O(Nd/M) time. After the preprocessing in step (1), the cost of the remaining D2-sampling steps is O(ζ2k2d). Here, again, it is possible to parallelize on a multi-processor system since much of the time in D2-sampling is spent on waiting for rejection sampling to succeed. The rejection sampling can be performed independently on multiple processors until one of the processing units succeeds. This cuts down the time to O(ζ2k2d/M). We implement the QI-k-means++ algorithm to compare the k-means cost and time with the classical implementation of k-means++ that has a running time O(Nkd).

3 A QUANTUM APPROXIMATION SCHEME

We convert the D2-sampling-based approximation scheme of Bhattacharya et al. (2020b) to a Quantum version. The approximation scheme is simple and highly parallel, which can be described in a few lines.

Algorithm 2 Approximation Scheme

Input: Dataset V , integer k > 0, and error ε > 0 Output: A center set C with Φ(V, C ) (1 + ε)OPT - (Constant approximation) Find a center set C that is a constant factor approximate solution. An (α, β) pseudo-approximate solution, for constants α, β, also works.

- (D2-sampling) Pick a set T of poly( k

ε ) points independently from the dataset using D2-sampling with respect to the center set C. - (All subsets) Out of all k-tuples (S1, ..., Sk) of (multi)subsets of T {copies of points in C}, each Si of size O( 1

ε), return (µ(S1), ..., µ(Sk)) that gives the least k-means cost. Here, µ(Si) denotes the centroid of points in Si.

We will discuss the quantization of the above three steps of the approximation scheme of Bhattacharya et al. (2020b), thus obtaining a quantum approximation scheme. 5

5Steps (2) and (3) in the algorithm are within a loop for probability amplification. For simplicity, this loop is skipped in this high-level description.

Published as a conference paper at ICLR 2025

(Constant approximation) The first step requires finding a constant factor approximate solution for the k-means problem. Even though several constant factor approximation algorithms are known, we need one with a quantum counterpart that runs in time that is polylogarithmic in the input size N. One such algorithm is the k-means++ seeding algorithm Arthur & Vassilvitskii (2007). Kerenidis et al. (2019) give an approximate quantum version of D2-sampling. The approximation guarantee of the k-means++ algorithm is O(log k) instead of the constant approximation required in the approximation scheme of Bhattacharya et al. (2020b). It is known from the work of Aggarwal et al. (2009) that if the D2-sampling in k-means++ is continued for 2k steps instead of stopping after sampling k centers, then we obtain a center set of size 2k that is a (2, O(1))-pseudo approximate solution. This means that this 2k-size center set has a k-means cost that is some constant times the optimal. Such a pseudo-approximate solution is sufficient for the approximation scheme of Bhattacharya et al. (2020b) to work. We show that the pseudo-approximation guarantee of Aggarwal et al. (2009) also holds when using the approximate quantum version of the D2-sampling procedure.

(D2-sampling) The second step of Bhattacharya et al. (2020b) involves D2-sampling, which we already discussed how to quantize. This is no different than the D2-sampling involved in the kmeans++ algorithm of the previous step. The sampling in this step is simpler since the center set C, with respect to which the D2-sampling is performed, does not change (as is the case with the k-means++ algorithm.)

(All subsets) Since the number of points sampled in the previous step is poly( k

ε ), we need to

consider a list of k

ε ) tuples of subsets, each giving a k-center set (a tuple (S1, ..., Sk) defines (µ(S1), ..., µ(Sk))). We need to compute the k-means cost for every k center set in the list and then pick the one with the least cost. We give quantization of the above steps. 6

Note that the quantization of the classical steps of Bhattacharya et al. (2020b) will incur precision errors. So, we first need to ensure that the approximation guarantee of Bhattacharya et al. (2020b) is robust against small errors in distance estimates, D2-sampling probabilities, and k-means cost estimates. We must carefully account for errors and ensure that the quantum algorithm retains the (1 + ε) approximation guarantee of the robust version of Bhattacharya et al. (2020b). We give the detailed analysis in the Appendix.

4 EXPERIMENTS

We provide an experimental investigation of the QI-k-means++ algorithm. We implemented our code in C++ for both k-means++ and QI-k-means++ and the results are averaged over 5 runs. No preprocessing/dimensionality reductions were done on any of the datasets used. The experiments were performed on an AMD Ryzen 9 5900HX 4.68 GHz 8 Core processor (parallelization on multicore was not used). We must set up the sample and query data structure for the dataset only once to be able to perform seeding for any value of k, with each run taking time only polylogarithmic in N. Specifically, if we wanted to perform seeding for q different values of k, say {k1, ..., kq}, then the total runtime for QI-k-means++ would be O(Nd) + O(ζ2d P

j [q] k2 j) as compared to O(Nd P

j [q] kj) for k-means++. This is one reason to consider the O(Nd) term as pre-processing.

We study the runtime (including setup time for the sample and query data structure) of QI-k-means++ on two extreme types of datasets to demonstrate its possible use cases. The first is binarized MNIST Salakhutdinov & Murray (2008) (70,000 data points, each being a 28 28 pixel image . Each pixel has a value of either 0 or 1, as opposed to the original MNIST, which had values from 0 to 255) and the second is IRIS Fisher (1988) (150 data points with four feature values). The plots in Figures 2 and 3 show the cumulative runtime (i.e., the total time for setting up and calculating cluster centers for all values of k up to a certain value).

Advantageous regime: For a dataset with a large number of points and small aspect ratio (for example, binary MNIST), we find that for one iteration of seeding, our algorithm performs slower but quickly catches up to be significantly faster when we require seeding to be done for multiple

6Note that when picking the center set with the least cost, we can get quadratic improvement in the search

for the best k-center set using quantum search. Given that the search space is of size k

ε ), this results only in a constant factor improvement in the exponent. So, for simplicity, we leave out the quantum search from the discussion.

Published as a conference paper at ICLR 2025

values of k (which is generally the case in practice since in the unsupervised setting, the optimal value of k is not known beforehand). This is because most of the time is spent in setting up the data structure, after which the sampling becomes significantly faster. For example, see the left plot for the MNIST dataset. Notice that the cumulative runtime for QI-k-means++ is almost constant, while for k-means++, it does not scale well due to the multiplicative dependence on N.

Disadvantageous regime: For a dataset with a small number of points and a large aspect ratio, we find that the runtime is dominated by the sampling term and due to the quadratic dependence on k and ζ, QI-k-means++ may not scale well. For example, see the right plot for the IRIS dataset.

We give additional experiments on larger datasets and a comparative analysis with AFKMC2 Bachem et al. (2016a) in Section F of the Appendix.

Figure 2: Cumulative runtime plot for MNIST

Figure 3: Cumulative runtime plot for IRIS

k 2 3 4 5 6 7 8 9 10 QI-k-means++ 8.43 8.13 7.81 7.43 7.29 7.09 6.79 6.78 6.66 k-means++ 8.35 8.10 7.74 7.38 7.44 7.17 7.12 6.91 6.83

Table 2: Clustering cost for binarized MNIST (costs are scaled down by a factor of 106)

k 2 3 4 5 6 7 8 9 10 QI-k-means++ 538.82 253.41 112.53 96.04 83.72 64.92 58.62 56.25 49.92 k-means++ 773.26 227.50 115.33 93.45 83.21 60.14 57.98 53.99 50.35

Table 3: Clustering cost for IRIS (costs are rounded to 2 decimal places)

5 CONCLUSION AND FUTURE WORK

In this work, we used D2-sampling framework to design quantum algorithms for the k-means problem. We also gave their classical counterparts obtained through the dequantization framework of Tang (2023). These algorithms have logarithmic dependence on the number of points in the dataset N, after appropriate linear time pre-processing for the classical ones. Our algorithms depend on the parameter ζ, which is the aspect ratio of the dataset. Our algorithm QI-k-means++ is an addition to the collection of fast implementations of the k-means++ seeding. Interestingly, the performance of all of these algorithms depend on certain aspect ratio like quantities. Even for a simple dataset where the clusters are located very far apart from each other, the parameter ζ (for QI-k-means++ and Rejection Sampling ) and the ratio of optimal 1-means cost to the optimal k-means cost (for K-MC2

and AFK-MC2 ) can become quite large. We think that improving the dependence on these quantities and understanding the tradeoff between them and the data size are interesting open problems.

6 ACKNOWLEDGEMENTS

This work was partially supported by the CSE Research Acceleration Fund of IIT Delhi.

Published as a conference paper at ICLR 2025

Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop and 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX 09 / RANDOM 09, pp. 15 28, Berlin, Heidelberg, 2009. Springer-Verlag. ISBN 9783642036842. doi: 10.1007/978-3-642-03685-9_2. URL https://doi.org/10.1007/ 978-3-642-03685-9_2.

Esma Aïmeur, Gilles Brassard, and Sébastien Gambs. Quantum speed-up for unsupervised learning. Machine Learning, 90:261 287, 2013.

David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 07, pp. 1027 1035, USA, 2007. Society for Industrial and Applied Mathematics. ISBN 9780898716245.

Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The Hardness of Approximation of Euclidean k-Means. In Lars Arge and János Pach (eds.), 31st International Symposium on Computational Geometry (So CG 2015), volume 34 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 754 767, Dagstuhl, Germany, 2015. Schloss Dagstuhl Leibniz Zentrum fuer Informatik. ISBN 978-3-939897-83-5. doi: http://dx.doi.org/10.4230/LIPIcs.SOCG. 2015.754. URL http://drops.dagstuhl.de/opus/volltexte/2015/5117.

Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause. Fast and provably good seedings for k-means. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016a. URL https://proceedings.neurips.cc/paper_files/paper/2016/ file/d67d8ab4f4c10bf22aa353e27879133c-Paper.pdf.

Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), Feb. 2016b. doi: 10.1609/aaai.v30i1.10259. URL https://ojs.aaai.org/index.php/ AAAI/article/view/10259.

Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, Greedy and Not so Greedy k-Means++. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders (eds.), 28th Annual European Symposium on Algorithms (ESA 2020), volume 173 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 18:1 18:21, Dagstuhl, Germany, 2020a. Schloss Dagstuhl Leibniz-Zentrum für Informatik. ISBN 978-3-95977-162-7. doi: 10.4230/LIPIcs. ESA.2020.18. URL https://drops-dev.dagstuhl.de/entities/document/10. 4230/LIPIcs.ESA.2020.18.

Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On Sampling Based Algorithms for k-Means. In Nitin Saxena and Sunil Simon (eds.), 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020), volume 182 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 13:1 13:17, Dagstuhl, Germany, 2020b. Schloss Dagstuhl Leibniz-Zentrum für Informatik. ISBN 978-3-95977-1740. doi: 10.4230/LIPIcs.FSTTCS.2020.13. URL https://drops.dagstuhl.de/opus/ volltexte/2020/13254.

Rich Caruana, Thorsten Joachims, and Lars Backstrom. Kdd-cup 2004: results and analysis. SIGKDD Explor. Newsl., 6(2):95 108, dec 2004. ISSN 1931-0145. doi: 10.1145/1046456.1046470. URL https://doi.org/10.1145/1046456.1046470.

Vincent Cohen-Addad and Karthik C.S. Inapproximability of clustering in lp metrics. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 519 539, 2019. doi: 10.1109/FOCS.2019.00040.

Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Fast and accurate k-means++ via rejection sampling. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 16235 16245. Curran Associates, Inc.,

Published as a conference paper at ICLR 2025

2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/babcff88f8be8c4795bd6f0f8cccca61-Paper.pdf.

Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19 51, October 1997. ISSN 0196-6774. doi: 10.1006/jagm.1997.0873. URL https://doi.org/10.1006/jagm. 1997.0873.

João F. Doriguello, Alessandro Luongo, and Ewin Tang. Do you know what q-means?, 2023.

Christoph Durr and Peter Hoyer. A quantum algorithm for finding the minimum, 1999.

Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization algorithm, 2014.

R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.

Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. FPT Approximation for Constrained Metric k-Median/Means. In Yixin Cao and Marcin Pilipczuk (eds.), 15th International Symposium on Parameterized and Exact Computation (IPEC 2020), volume 180 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 14:1 14:19, Dagstuhl, Germany, 2020. Schloss Dagstuhl Leibniz-Zentrum für Informatik. ISBN 978-3-95977-172-6. doi: 10.4230/LIPIcs.IPEC.2020.14. URL https://drops.dagstuhl.de/opus/volltexte/2020/13317.

Christoph Grunau, Ahmet Alper Özüdo gru, and Václav Rozhoˇn. Noisy k-Means++ Revisited. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman (eds.), 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 55:1 55:7, Dagstuhl, Germany, 2023. Schloss Dagstuhl Leibniz-Zentrum für Informatik. ISBN 978-3-95977-295-2. doi: 10.4230/LIPIcs.ESA.2023.55. URL https://drops-dev.dagstuhl.de/entities/ document/10.4230/LIPIcs.ESA.2023.55.

Aram W. Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Phys. Rev. Lett., 103:150502, Oct 2009. doi: 10.1103/Phys Rev Lett.103.150502. URL https://link.aps.org/doi/10.1103/Phys Rev Lett.103.150502.

Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In Proceedings of the tenth annual symposium on Computational geometry, SCG 94, pp. 332 339, New York, NY, USA, 1994. ACM. ISBN 0-89791-648-4. doi: 10.1145/177424.178042. URL http://doi.acm. org/10.1145/177424.178042.

Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D2-sampling based PTAS for k-means and other clustering problems. Algorithmica, 70(1):22 46, 2014. ISSN 0178-4617. doi: 10.1007/ s00453-013-9833-9. URL http://dx.doi.org/10.1007/s00453-013-9833-9.

Iordanis Kerenidis and Anupam Prakash. Quantum Recommendation Systems. In Christos H. Papadimitriou (ed.), 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 49:1 49:21, Dagstuhl, Germany, 2017. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik. ISBN 978-3-95977029-3. doi: 10.4230/LIPIcs.ITCS.2017.49. URL http://drops.dagstuhl.de/opus/ volltexte/2017/8154.

Iordanis Kerenidis, Jonas Landman, Alessandro Luongo, and Anupam Prakash. qmeans: A quantum algorithm for unsupervised machine learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/16026d60ff9b54410b3435b403afd226-Paper.pdf.

Published as a conference paper at ICLR 2025

Niraj Kumar, Romina Yalovetzky, Changhao Li, Pierre Minssen, and Marco Pistoia. Des-q: a quantum algorithm to construct and efficiently retrain decision trees for regression and binary classification, 2023.

S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2): 129 137, 1982.

Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and unsupervised machine learning, 2013.

Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge University Press, 2010.

J. S. Otterbach, R. Manenti, N. Alidoust, A. Bestwick, M. Block, B. Bloom, S. Caldwell, N. Didier, E. Schuyler Fried, S. Hong, P. Karalekas, C. B. Osborn, A. Papageorge, E. C. Peterson, G. Prawiroatmodjo, N. Rubin, Colm A. Ryan, D. Scarabelli, M. Scheer, E. A. Sete, P. Sivarajah, Robert S. Smith, A. Staley, N. Tezak, W. J. Zeng, A. Hudson, Blake R. Johnson, M. Reagor, M. P. da Silva, and C. Rigetti. Unsupervised machine learning on a hybrid quantum computer, 2017.

Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872 879. ACM, 2008.

Michael Ian Shamos and Dan Hoey. Closest-point problems. In 16th Annual Symposium on Foundations of Computer Science (sfcs 1975), pp. 151 162, 1975. doi: 10.1109/SFCS.1975.8.

E. Tang. Dequantizing algorithms to understand quantum advantage in machine learning. Nat Rev Phys, 4:692 693, 2022. URL https://doi.org/10.1038/s42254-022-00511-w.

Ewin Tang. A quantum-inspired classical algorithm for recommendation systems. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pp. 217?228, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367059. doi: 10.1145/3313276.3316310. URL https://doi.org/10.1145/3313276.3316310.

Ewin Tang. Quantum machine learning without any quantum. Ph D thesis, University of Washington, 2023.

Daniel Whiteson. SUSY. UCI Machine Learning Repository, 2014. DOI: https://doi.org/10.24432/C54606.

Nathan Wiebe, Ashish Kapoor, and Krysta M. Svore. Quantum algorithms for nearest-neighbor methods for supervised and unsupervised learning. Quantum Info. Comput., 15(3 4):316 356, mar 2015. ISSN 1533-7146.

A QUANTUM PRELIMINARIES

In this section , we briefly describe the core notions of quantum computing and refer to Nielsen & Chuang (2010) for a more formal and complete overview of quantum computing.

A qubit (quantum bit) is represented by a unit vector |ψ = α0 |0 + α1 |1 where α0, α1 C are

complex numbers such that |α0|2 + |α1|2 = 1 and |0 = 1 0

and |1 = 0 1

represent the standard

basis of the vector space C2(C) i.e, the vector space whose elements are pairs of complex numbers with the underlying field being C.

Similar to a single qubit, an n-qubit system is represented by a unit vector in (C2(C)) n i.e, as |ψ = P2n 1 j=0 αj |j where the amplitudes are complex numbers satisfying P2n 1 j=0 |αj|2 = 1.

A quantum algorithm performs computation on qubits by applying quantum gates which are repre-

sented as unitary matrices. For example, the hadamard gate H = 1

can perform the

mapping |0 7 1

2(|0 + |1 ) and |1 7 1

Published as a conference paper at ICLR 2025

A quantum state can also be measured in the standard basis. For example, when |ψ = α0 |0 +α1 |1 is measured, we obtain 0 with probability |α0|2 and 1 with probability |α1|2. Similarly, when |ψ = P2n 1 j=0 αj |j is measured , we observe the outcome j = j0j1 . . . jn 1, the binary representation of j with probability |αj|2.

In the context of QML, the input data is assumed to be provided in the QRAM data structure. We refer to Kerenidis & Prakash (2017) for the specific details of QRAM implementation. Any vector v Rd can be encoded as a quantum state using log d qubits as |v = 1 v Pd 1 j=0 vj |j .

B QUANTUM D2-SAMPLING (PROOF OF THEOREM 1)

We will work under the assumption that the minimum distance between two data points is 1, which can be achieved using scaling. This makes the aspect ratio ζ simply the maximum distance between two data points. We will use i for an index into the rows of the data matrix V RN d, and j for an index into the rows of the center matrix C Rk d. We would ideally like to design a quantum algorithm that performs the transformation:

|i |j |0 |i |j |D(vi, cj)

Let us call the state on the right |Ψideal . This is an ideal quantum state for us since |Ψideal helps to perform D2-sampling and to find the k-means cost of clustering, which are the main components of the approximation scheme of Bhattacharya et al. (2020b) that we intend to use. One caveat is that we will only be able to perform the following transformation (instead of the abovementioned transformation) |i |j |0 |i |j |ψi,j , where |ψi,j is an approximation for |D(vi, cj) in a sense that we will make precise below. We will use |Ψreal to denote the state |i |j |ψi,j . This state is prepared using tools such as swap test followed by coherent amplitude estimation, and median estimation. Since these tools and techniques are known from previous works Wiebe et al. (2015); Lloyd et al. (2013); Kerenidis et al. (2019), we summarise the discussion (see Section 4.1 and 4.2 in Kerenidis et al. (2019)) in the following lemma. Lemma 2 (Kerenidis et al. (2019) and Wiebe et al. (2015)). Assume for a data matrix V RN d and a center set matrix C Rt d that the following unitaries: (i) |i |0 |i |vi , (ii) |j |0 |j |cj can be performed in time T and the norms of the vectors are known. For any > 0, there is a quantum algorithm that in time O T log 1

ε computes:

|i |j |0 |i |j |ψi,j ,

where |ψi,j satisfies the following two conditions for every i [N] and j [t]:

|ψi,j 0 ℓ D(vi, cj) E

(ii) For every i, j, D(vi, cj) (1 ε) D(vi, cj).

In the subsequent discussions, we will use T as the time to access the QRAM data structure Kerenidis & Prakash (2017), i.e., for the transitions |i |0 |i |vi and |j |0 |j |cj as given in the above lemma. This is known to be T = O(log2 (Nd)). Moreover, the time to update each entry in this data structure is also T = O(log2 (Nd)). This is the logarithmic factor that is hidden in the O notation. In the following subsections, we discuss the utilities of |Ψreal for the various components of the approximation scheme of Bhattacharya et al. (2020b). During these discussions, it will be easier to see the utility first with the ideal state |Ψideal before the real state |Ψreal that can actually be prepared. We will see how |Ψreal is sufficient within a reasonable error bound.

B.1 FINDING DISTANCE TO CLOSEST CENTER

Let us see how we can estimate the distance of any point to its closest center in a center set C with t k centers. We can use the transformation |i |j |0 |i |j |D(vi, cj) to prepare the following state for any i: |i |D(vi, c1) |D(vi, c2) ... |D(vi, ct)

Published as a conference paper at ICLR 2025

We can then iteratively compare and swap pairs of registers to prepare the state |i minj [t] D(vi, cj) . If we apply the same procedure to |i |ψi,1 ... |ψi,t , then with probability at least (1 2 )t, the

resulting state will be |i minj [t] D(vi, cj) E . So, the contents of the second register will be an

estimate of the distance of the ith point to its closest center in the center set C. This further means that the following state can be prepared with probability at least (1 2 )Nt:7

i=1 |i min j [t] D(vi, cj) .

This quantum state can be used to find the approximate clustering cost of the center set C, which we discuss in the following subsection. However, before we do that, let us summarise the main ideas of this subsection in the following lemma.

Lemma 3. There is a quantum algorithm that, with probability at least (1 2 )Nt, prepares the

quantum state |ΨC in time O T t log 1

B.2 D2-SAMPLING

D2-sampling from the point set V with respect to a center set C Rt d with t centers, samples vi with probability proportional to minj [t] D2(vi, cj). Let us see if we can use our state |ΨC =

N PN i=1 |i minj [t] D(vi, cj) E is useful to perform this sampling. If we can pull out the value of

the second register as the amplitude, then the measurement will give us close to D2-sampling. This is possible since we have an estimate of the clustering cost from the previous subsection. We can use controlled rotations on an ancilla qubit to prepare the state:

i=1 |i βi |0 + p

1 |βi|2 |1 ,

where βi = minj [t] D(vi,cj) 2ζ . Note that βi 1 since ζ is the aspect ratio, and by scaling, we assume that mini,j D(vi, vj) = 1.8 Note that this suggests we must know the aspect ratio (or an upper bound on the ratio), which should be considered an input parameter. This is a standard assumption in parameterised quantum machine learning algorithms such as ours (for example, in the HHL algorithm, a bound on the condition number of the matrix is used as a parameter). The probability of

measurement of (i, 0) is minj [t] D(vi,cj)2

4N ζ2 1 8Nζ2 . Since we do rejection sampling, ignoring (., 1) s

that are sampled with probability 1 1 8ζ2 , we end up sampling with a distribution where the

conditional probability of sampling i is minj [t] D(vi,cj)2

Φ(V,C) (1 ε)4 minj [t] D(vi,cj)2

Φ(V,C) . Moreover,

the rejection sampling gives us a usable sample with high probability in m = O(ζ2 ln 10N) rounds of sampling. This means that points get sampled with a probability close to the actual D2-sampling probability. As mentioned earlier, this is sufficient for the approximation guarantees of Bhattacharya et al. (2020b) to hold. We summarise the observations of this section in the next lemma.

Lemma 4. Let 0 < δ 1 and let m = ζ2 ln 10N. Given a dataset V RN d and a center set C Rt d in the QRAM model, there is a quantum algorithm that runs in time O T tm log 1

with probability at least (1 1/5N) (1 2 )Ntm outputs a sample with D2 distribution such that the distance function D is δ-close to D.

Proof. The proof follows from Lemma 3 and the preceding discussion.

7The state prepared is actually 1

N PN i=1 |i α minj [t] D(vi, cj) E + β |G with |α|2 (1 2 )Nk. However, instead of working with this state, subsequent discussions become much simpler if we assume that |ΨC is prepared with probability |α|2. 8In case scaling only ensures mini,j D(vi, vj) 1 (e.g., vectors with integer coordinates), we can use a bound on maxi,j D(vi, vj) as the parameter instead of the aspect ratio.

Published as a conference paper at ICLR 2025

The above lemma says that for = O( ε2 Ntζ2 ), we obtain the required D2-sample with probability (1 1/N). We can now give proof of Theorem 1, assembling the quantum tools for D2-sampling in this section. Note that we obtain samples from D2 distribution such that D is close to D. Using our quantum sampling algorithm to pick k centers in k rounds is not the classical k-means++ algorithm that requires that the samples be picked with D2 distribution and not something close. The resulting algorithm, however, has been analysed in the literature. If the sampling distribution is within (1 ε) of the D2 distribution, then the algorithm is called the noisy-k-means++ algorithm Bhattacharya et al. (2020a); Grunau et al. (2023). We will use the following result in Grunau et al. (2023). Theorem 5 (Theorem 1.1 in Grunau et al. (2023)). The noisy-k-means++ algorithm is O(log k) approximate, in expectation.

Theorem 1, restated below, follows from the above theorem and Lemma 4. Note that the running time of the quantum implementation is P

t O(ζ2t) = O(ζ2k2). Theorem (Restatement of Theorem 1). There is a quantum implementation of k-means++ that runs in time O(ζ2k2) and gives an O(log k) factor approximate solution for the k-means problem with a probability of at least 0.99. Here, O hides log2 (Nd) and log2 (kd) terms.9

C QUANTUM INSPIRED D2-SAMPLING

Ewin Tang s thesis Tang (2023) nicely categorizes quantum machine learning (QML) algorithms into ones where the quantum advantage (i.e., polylogarithmic running time) is solely because of the quantum access to the data and the ones where it is not (e.g., the HHL algorithm Harrow et al. (2009)). Quantum access means that a (normalized) data vector v := ( v(1), ..., v(d))T can be loaded onto the quantum workspace in O(1) time as |v = v(1) |1 + ... + v(d) |d . In this section, we will use the overhead arrow notation for vectors since we would want to distinguish between vectors and their approximations using different overhead symbols. One implication of having the quantum state |v is that i can be sampled with probability v(i)2 P

j v(j)2 by simply making a measurement in the standard basis. One of the key insights in Tang s Thesis Tang (2023) is that QML algorithms that benefit solely from the quantum data access can be dequantized if we work with an appropriate classical counterpart of the quantum data access that allows sampling access of the kind described above (in addition to the classical access to the data). This is called sample-query access, or SQ access in short. Dequantization means that a classical algorithm can simulate the steps of the quantum algorithm using SQ access with similar running time dependence. The nice property of the SQ access data structures is that it can be constructed classically in linear time. Note that linear time means that we lose the quantum advantage. However, if we keep this aside and count the SQ access data structure construction as preprocessing, the remaining computation has a similar advantage as that of the quantum algorithm. This is a level playing field with the QML algorithm that allows fair comparison, specifically given that setting up quantum access (called QRAM) also takes linear time. Such classical algorithms obtained by dequantization are called quantum inspired algorithms. Our quantum algorithms based on D2-sampling fall into the category of QML algorithms that can be dequantized in Tang s SQ access model. This section discusses the quantum-inspired classical algorithm we obtain by dequantizing our quantum algorithm using the SQ access data structure. Naturally, we start the discussion by looking at the definition of the SQ access data structure. Definition 3 (Query access, Definition 1.1 in Tang (2023)). For a vector v Cn, we have Q( v), query access to v, if for all i [n], we can query for v(i). The time (cost) of such a query is denoted by q( v). Definition 4 (SQ-access to a vector, Definition 1.2 in Tang (2023)). For a vector v Cn, we have SQ( v), sampling and query access to v, if we can:

1. query for entries in v as in Q( v). The time cost is denoted by q( v).

2. obtain independent samples i [n] where the probability of sampling i is v(i)2

v 2 . This distribution is denoted by D v. The time cost is denoted by s( v).

9The output of k-means++ is a subset of data points, which can be stated as a subset of indices. If the description of centers is the expected output, the running time will also include a factor of d.

Published as a conference paper at ICLR 2025

3. query for v . The time cost is denoted by n( v).

Let sq( v) max {q( v), s( v), n( v)} denote the time cost of SQ-access.

In quantum computation an n-dimensional state vector |v0 evolves using unitary transformations |v1 = U |v0 , where U is a 2n 2n unitary matrix. To be able to simulate such operations within the SQ-access model, we must make sure that SQ-access is available for intermediate states of the quantum operation. This means that we must (i) define an appropriate SQ-access notion for matrices and (ii) show closure properties for SQ-access. Fortunately, this has been shown in Tang s Thesis Tang (2023). A more flexible version of SQ-access is needed to show closure properties. This is called oversampling SQ-access. SQ-access for matrices and oversampling SQ-access is defined below. Definition 5 (SQ-access to a matrix). For a matrix A Cm n, we have SQ(A) if we have SQ(A(i, .)) for all i [m] (i.e., SQ-access to the row vectors), and SQ( a), where a is the vector of row norms of A (i.e., a = ( A(1, .) , ..., A(m, .) )T ). The complexity of operations on a matrix is defined as: q(A) := max {q(A(i, .)), q( a)}, n(A) := n( a), s(A) := max {s(A(i, .)), s( a)}, and sq(A) := max {q(A), n(A), s(A)} Definition 6 (Oversampling and query access: Definition 4.4 in Tang (2023)). For v Cn and ϕ 1, we have SQϕ( v), ϕ-oversampling and query access to v (ϕ-OSQ access, in short), if we have Q( v) and SQ( v) for v Cn a vector satisfying (i) v 2 = ϕ v 2 and (ii) | v(i)|2 | v(i)|2 for all i [n]. Denote sϕ( v) := s( v), qϕ( v) := q( v), nϕ( v) := n( v), and sqϕ( v) := max {sϕ( v), qϕ( v), q( v), nϕ( v)}.

Note that D v for v can be seen as an oversampling distribution of D v by a factor of ϕ. This is because the sampling probability of index i is given by

D v(i) = | v(i)|2

v 2 = | v(i)|2

ϕ v 2 | v(i)|2

ϕ v 2 = D v(i)

ϕ-OSQ access also allows one to sample from D v and find v by paying a small time penalty (as a function of ϕ and allowed error). This is captured in the following lemma from Tang (2023). Lemma 5 (Lemma 4.5 in Tang (2023)). Let 0 < δ 1 and 0 < ε 1. Suppose we are given SQϕ( v). Then we can sample from D v with proability at least (1 δ) in time O ϕ sqϕ( v) log 1

also estimate v to within (1 ε) factor with probability at least (1 δ) in O ϕ ε2 sqϕ( v) log 1

We can also define oversampling query access to a matrix. Definition 7 (Oversampling and query access for a matrix). For a matrix A CN d and ϕ 1, we have SQϕ(A), ϕ-oversampling and query access to A (ϕ-OSQ access, in short), if we have

Q(A) and SQ( A) for A CN d satisfying A 2

F = ϕ A 2 F and | A(i, j)|2 |A(i, j)|2 for all

(i, j) [N] [d]. The ϕ-oversampling complexity is given by sqϕ(A) = max {q(A), sq( A)}.

There is a simple tree-based data structure that supports sample-query access for vectors and matrices. The details can be found in Tang (2023). Here, we give a summary of the running time of various operations. Lemma 6 (Remark 4.12 in Tang (2023)). There is a data structure for storing a vector v Rn with supporting the following operations:

1. Reading and updating an entry of v in O(log n) time. So, q( v) = O(log n).

2. Finding v 2 in O(1) time. So, n( v) = O(1).

3. Sampling from D v in O(log n) time. So, s( v) = O(log n)

It follows from the above that sq( v) = O(log n). Moreover, the time to construct the data structure is O(n).

A similar lemma holds for storing matrices.

Published as a conference paper at ICLR 2025

Lemma 7 (Remark 4.12 in Tang (2023)). For any matrix A Rm n, let A denote the vector of row norms of A, i.e., the ith entry of A, denoted by A(i), is A(i, .) . There is a data structure for storing a matrix A Rm n with supporting the following operations:

1. Reading and updating an entry of A in O(log mn) time.

2. Finding A(i) in O(log m) time.

3. Finding A 2 F in O(1) time.

4. Sampling from D A and DA(i,.) in O(log mn) time.

The time to construct the data structure is O(mn).

Let us now see how D2-sampling can be done using the above SQ-access data structures. Let V RN d be the data matrix with the row vectors v1, ..., v N as data points. Let C Rm d be the center set matrix with the centers being the i row vectors c1, ..., cm. Consider D2-sampling a point from V with respect to the center set C. The probability of sampling the ith point is given by:

Pr[i] = mint [m] vi ct 2 P

s [N] mint [m] vs ct 2

Consider vectors w1, ..., wm RN defined as:

wj := ( v1 cj , v2 cj , ..., v N cj )T .

The first step towards D2-sampling using the SQ-access data structure for matrices V and C is to check whether SQ-access for vectors w1, ..., wm can be made available. In the next lemma, we show that ϕ-oversampling SQ-access for an appropriately chosen value of ϕ is possible. Lemma 8. Let α-oversampling SQ access to the data matrix V RN d be available as SQα(V ) and let β oversampling SQ access be available for a vector c Rd as SQβ( c). The we have γ oversampling SQ access, SQγ( w), for vector w := ( V (1, .) c , ..., V (N, .) c )T with γ =

2 w 2 α V 2 + Nβ c 2 . Moreover, the complexity is related as sqγ( w) d (sqα(V )+sqβ( c)).

Proof. SQα(V ) means that Q(V ) and SQ( V ) are available for a matrix V such that (i) V 2 =

α V 2 and (ii) | V (i, j)| |V (i, j)| for all (i, j) [N] [d]. Similarly, SQβ( c) means that Q( c) and SQ( c) are available for a vector c such that (i) c 2 = β c 2 and (ii) | c(t)| | c(t)| for all t [d]. To show SQγ( w), we first need to show Q( w). This is simple since the jth coordinate of w can be found using all the coordinates of V (j, .) and c which are available since we have Q(V ) and Q( c). So, q( w) = d (q(V ) + q( c)). We now need to show SQ( w) for an appropriately chosen w. We consider:

2 V (j, .) 2 + c 2 .

Before we see why SQ access to w is possible, let us first observe that SQ( w) would give a γoversampling SQ access to w. Towards this, we note that

j | w(j)|2 = 2 X

V (j, .) 2 + c 2 = 2

w 2 α V 2 + Nβ c 2 w 2 = γ w 2.

Moreover, we have:

| w(j)| = V (j, .) c V (j, .) + c V (j, .) + c

2 V (j, .) 2 + c 2 = | w(j)|.

Finally, we must show SQ( w). Coordinates of w can be found using V (j, .) s and c which

are available through oversampling access to V and c. w 2 is available through V 2 and c 2.

Published as a conference paper at ICLR 2025

Finally, D w can be sampled by first selecting V with probability V 2

V 2+N c 2 or c with probability

N c 2 V 2+N c 2 and then sampling from D v or U[N] (uniform distribution over [N]), respectively. Given

this, the probability of sampling i works out to be:

V 2 + N c 2

V 2 + N c 2 V 2 + N c 2 1

w 2 = D w(i).

From the above discussion, it follows that sqγ( w) d (sqα(V ) + sqβ( c)).

The previous lemma gives a way to obtain oversampling query access to distances from a single center. In D2-sampling, we have a set C of m centers, and we sample a point based on the squared distance from the nearest center. To enable this operation in the SQ framework, we need sample and query access to a vector where the ith coordinate is the distance of ith data point to the nearest center in C. In the following lemma, we show how to do this. The proof closely follows Lemma 4.6 of Tang (2023). Lemma 9. Let u1, ..., um (R 0)N and ϕ1, ..., ϕm R 1. Let

w := min i [m] ui(1), min i [m] ui(2), ..., min i [m] ui(N) T .

Suppose we have SQϕi( ui) for every i [m]. Then there is a ϕ-oversampling query access,

SQϕ( u) for ϕ =

i [m] ϕi ui 2

m w 2 . The complexity is give by: q( w) = P

i [m] q( ui), qϕ( w) = P

i [m] qϕi( ui), nϕ( w) = P

i [m] nϕi( ui). The sampling complexity is sϕ( w) = maxi [m] sϕi( ui), after a single-time pre-processing cost of P

i [m] nϕi( ui).

Proof. We need to show Q( w) and SQ( w) for an appropriate w. We can respond to each coordinate of w by querying ui s and taking the minimum. This gives q( w) = P

i [m] q( ui). Let u1, ..., um be the vectors corresponding to the oversampling access SQϕ1( u1), ..., SQϕm( um). For oversampling query access, consider:

j [m] | uj(i)|2.

Each coordinate of w can be found by querying ui s and computing the above expression. So, we have Q( w) and qϕ( w) = P

i [m] qϕi( ui). The norm of w is given by:

j [m] | uj(i)|2 = 1

j [m] uj 2 = 1

j [m] ϕj uj 2 =

j [m] ϕj uj 2

w 2 = ϕ w 2.

So, we can query the norm of w by querying the norms of uj s. Hence, nϕ( w) = P

j [m] nϕj( uj). Now, we need to show that w(i) upper bounds w(i). We have:

j [m] | uj(i)|2

j [m] min j [m] | uj (i)|2 = min j [m] uj (i) min j [m] uj(i) = w(i).

Now we describe how to sample from D w. We first query the norms of uj. We pick a j [m]

with probability uj 2 P

j [m] uj 2 , and then sample an index i [N] from the distribution D uj. The

probability of sampling an index i is given by:

uj 2 P j [m] uj 2 | uj(i)|2

j [m] | uj(i)|2

1 m P j [m] uj 2 = | w(i)|2

This is the correct probability for sampling from D w. Note that from the description, the sampling complexity is sϕ( w) = maxi [m] sϕi( ui), after a single-time pre-processing cost of P

i [m] nϕi( ui).

Published as a conference paper at ICLR 2025

We now have all the basic tools to perform D2-sampling in the SQ access model and evaluate its time complexity. In the remaining discussion, we will assume that the origin is centered at one of the data points so that the maximum norm of a data vector is upper bounded by the maximum interpoint distance.

Theorem 6. Let 0 < δ 1. Let V RN d be the dataset and let c1, ..., cm Rd be m arbitrary centers from the dataset and let SQ(V ) and SQ( c1), ..., SQ( cm) be available. Let ζ denote the aspect ratio of the dataset (i.e., the ratio of the maximum to minimum interpoint distance). Then there is a classical algorithm that with probability at least (1 δ) outputs a sample from the D2-distribution (i.e., D2-sample) of V with respect to centers { c1, ..., cm} which runs in time O(ζ2md(log 1

δ ) log Nd).

Proof. Let dnorm be the maximum norm of a data point. Let dmin and dmax be the minimum and maximum distance between two data points, respectively. Then, we have ζ = dmax

dmin . Let us define vectors u1, ..., um where uj(i) := V (i, .) cj . Using Lemma 8, we can enable SQϕj( uj) for every j [m], for ϕj = 2 uj 2 ( V 2 +N cj 2). Lemma 9 gives a way to define SQϕ( w), where

w is defined as w(i) := minj [m] | uj(i)| and ϕ =

j [m] ϕj uj 2

m w 2 . Note that for the w defined, D w is

precisely the D2 distribution over V with respect to centers c1, ..., cm and from Lemma 5, we know how to use SQϕ( w) to sample from D w with probability at least (1 δ) in time O(ϕ sqϕ( w) log 1

δ ). For calculating the complexity, we first get a bound on ϕ:

P j [m] ϕj uj 2

j [m] 2 uj 2 ( V 2 + N cj 2) uj 2

m w 2 = 2 V 2

j [m] cj 2.

The ith coordinate of w is the least distance of V (i, .) from a center. This gives us w 2 (N m) d2 min. Also, V 2 Nd2 norm and cj 2 d2 norm. Putting these bounds in the above

equation (and that m N/2), we get ϕ 8 d2 norm d2 min 8ζ2. From Lemmas 6 and 7, we have sq(V ) = O(log Nd), sq( cj) = O(log d). From Lemma 8, we have sqϕj( uj) = d (sq(V ) + sq( cj)) = O(d log Nd) for every j [m]. From Lemma 9, we have sqϕ( w) = P

j [m] sqϕj( uj) = O(md log Nd). So, from Lemma 5, the complexity of D2-sampling a point is O(ϕ sqϕ( w) log 1

δ ) = O(ζ2md(log 1

δ ) log Nd).

The k-means++ algorithm is simply D2-sampling k centers iteratively while updating the list of centers. We summarise the discussion of this section by giving the complexity of the k-means++ algorithm and the proof of Theorem 2. We restate the theorem for ease of reading.

Theorem (Restatement of Theorem 2). There is a classical implementation of k-means++ (which we call QI-k-means++) that runs in time O(Nd) + O(ζ2k2d). Moreover, with probability at least 0.99, QI-k-means++ outputs k centers that give O(log k)-approximation in expectation for the k-means problem.

Proof. The Nd term in the running time expression is for setting up SQ(V ), the sample-query access to the data matrix V . This may also be regarded as the preprocessing step. This is followed by a D2-sampling k centers iteratively while updating the set of centers by including the new centers sampled. From the previous theorem, we know that when we have m centers, the cost of obtaining a D2-sample is O(ζ2md(log 1

δ ) log Nd) with probability at least (1 δ). To obtain an overall probability of success of 0.99, we need to have a success probability of at least 1 O(1/k) in each iteration. This makes the overall D2-sampling cost O(ζ2k2d log k log Nd). So, the overall time complexity of QI-k-means++ is O (Nd) + O(ζ2k2d). Since with probability at least 0.99, we obtain k perfectly D2-sampled center, the approximation guarantee remains the same as that of k-means++, which is O(log k).10

10What happens with the remaining 0.01 probability? D2-sampling in the SQ model succeeding with probability at least (1 δ) means that with probability at least (1 δ) a center is sampled with D2 distribution, and with the remaining probability, no point is sampled.

Published as a conference paper at ICLR 2025

C.1 QUANTUM INSPIRED APPROXIMATE D2-SAMPLING

Sampling from a distribution that is ε-close to the D2 distribution is sufficient to obtain O(log k) approximation. This follows from the analysis of the noisy-k-means++ algorithm Grunau et al. (2023). This relaxation (without any approximation penalty beyond constant factors) allows us to shave off the factor of d from the time complexity of the quantum-inspired algorithm. However, as we will see in the analysis of this section, it is at the cost of worsening the dependence on the aspect ratio. So, the algorithm in this section should be used for high-dimensional cases where the aspect ratio is small. The factor of d in the sampling component of QI-k-means++ appears because for enabling the oversampling and query access for the vector w := ( V (1, .) c , ..., V (N, .) c )T ( c is a center), we need to access the d coordinates of V (i, .) and c to be able to answer the query access to the ith

index of w, which is V (i, .) c . Note that V (i, .) c = q

V (i, .) 2 + c 2 2 V (i, .), c ,

where V (i, .), c denotes the dot product. The terms V (i, .) 2 and c 2 are available in O(1) time. The dot product term is costly and adds a factor of d. So, the key idea in eliminating the factor of d is to replace the dot product term with an appropriate approximation that requires a much smaller access time. This uses the standard sampling trick - randomly sample a subset of indices, compute the dot product restricted to these indices of the subset, and scale (see Lemma 5.17 in Tang (2023)). The number of indices we need to sample to ensure that the resulting approximation is within (1 ε) of V (i, .) c for every i with probability at least (1 δ) is O log (1/δ)

ε2 c 2 V (i,.) 2

V (i,.) c 2 =

O ζ4 log 1/δ

ε2 . The remaining steps are the same as those in the exact version, except that d is

replaced with O ζ4 log 1/δ

ε2 . This results in the overall running time O(Nd) + O ζ6k2 .

D A QUANTUM APPROXIMATION SCHEME (PROOF OF THEOREM 4)

We start the discussion with the D2-sampling method. In particular, we would like to check the robustness of the approximation guarantee provided by the D2-sampling method against errors in estimating the distances between points. We will show that the D2-sampling method gives a constant pseudo-approximation even under sampling errors.

D.1 PSEUDOAPPROXIMATION USING D2-SAMPLING

Let the matrix V RN d denote the dataset, where row i contains the ith data point vi Rd. Let the matrix C Rt d denote any t-center set, where row i contains the ith center ci Rd out of the t centers. Sampling a data point using the D2 distribution w.r.t. (short for with respect to) a center set C means that the datapoint vi gets sampled with probability proportional to the squared distance to its nearest center in the center set C. This is also known as D2 sampling w.r.t. center set C. More formally, data points are sampled using the distribution D2(v1,C) P

j D2(vj,C), ..., D2(v N,C) P

j D2(vj,C) ,

where D2(vj, C) minc C D2(vj, c). For the special case C = , D2 sampling is the same as uniform sampling. The k-means++ seeding algorithm starts with an empty center set C and, over k iterations, adds a center to C in every iteration by D2 sampling w.r.t. the current center set C. It is known from the result of Arthur & Vassilvitskii (2007) that this k-means++ algorithm above gives an O(log k) approximation in expectation. It is also known from the result of Aggarwal et al. (2009) that if 2k centers are sampled, instead of k (i.e., the for-loop runs from 1 to 2k), the cost with respect to these 2k centers is at most some constant times the optimal k-means cost. Such an algorithm is called a pseudo approximation algorithm. Such a pseudo approximation algorithm is sufficient for the approximation scheme of Bhattacharya et al. (2020b). So, we will quantize the following constant factor pseudo-approximation algorithm.

In the quantum simulation of the above sampling procedure, there will be small errors in the sampling probabilities in each iteration. We need to ensure that the constant approximation guarantee of the above procedure is robust against small errors in the sampling probabilities owing to errors in distance estimation. We will work with a relative error of (1 δ) for small δ. Following is a crucial lemma from Arthur & Vassilvitskii (2007) that we will need to show the pseudo approximation property of Algorithm 1.

Published as a conference paper at ICLR 2025

Algorithm 3 A pseudo-approximation algorithm based on D2-sampling.

Input: (V, k) C {} for i = 1 to 2k do

Pick c using D2-sampling w.r.t. center set C C := C {c} end for return C

Lemma 10 (Lemma 3.2 in Arthur & Vassilvitskii (2007)). Let A be an arbitrary optimal cluster, and let C be an arbitrary set of centers. Let c be a center chosen from A with D2-sampling with respect to C. Then E[cost(A, C {c})] 8 OPT(A).

The above lemma is used as a black box in the analysis of Algorithm 1 in Aggarwal et al. (2009). The following version of the lemma holds for distance estimates with a relative error of (1 δ) and gives a constant factor approximation guarantee. Since Lemma 10 is used as a black box in the analysis of Algorithm 1, replacing this lemma with Lemma 11 also gives a constant factor approximation to the k-means objective. We will use the following notion of the closeness of two distance functions.

Definition 8. A distance function D1 is said to be δ-close to distance function D2, denoted by D1 δ D2, if for every pair of points x, y Rd, D1(x, y) (1 δ) D2(x, y).11

Lemma 11. Let 0 < δ 1/2. Let A be an arbitrary optimal cluster and C be an arbitrary set of centers. Let c be a center chosen from A with D2-sampling with respect to C, where D δ D. Then E[cost(A, C {c})] 72 OPT(A).

Proof. Let D(a) denote the distance of the point a from the nearest center in C and let D(a) denote the estimated distance. We have D(a) D(a) (1 δ). The following expression gives the expectation:

a A D2(a) X

a A min D2(a ), D2(a , a0)

Note that for all a0, a A, D(a0) D(a) + D(a, a0). This gives D(a0) 1+δ 1 δ D(a) +

(1 + δ) D(a0, a), which further gives D2(a0) 2 1+δ 1 δ 2 D2(a) + 2(1 + δ)2 D2(a0, a) and

D2(a0) 2 |A| 1+δ 1 δ 2 P

a A D2(a) + 2 |A|(1 + δ)2 P

a A D2(a0, a). We use this to obtain the

following upper bound on the expectation E[cost(A, C {c})]:

a A D2(a) X

a A min D2(a ), D2(a , a0)

2 |A| 1+δ 1 δ 2 P a A D2(a)

a A D2(a) X

a A min D2(a ), D2(a , a0) +

2 |A|(1 + δ)2 P

a A D2(a0, a)

a A D2(a) X

a A min D2(a ), D2(a , a0)

2 D2(a , a0) + X

2 D(a0, a)2

a A D2(a0, a)

11We use the notation that for positive reals P, Q, P (1 δ) Q if (1 δ) Q P (1 + δ) Q.

Published as a conference paper at ICLR 2025

This completes the proof of the lemma.

We will use this lemma in the approximation scheme of Bhattacharya et al. (2020b). However, this lemma may be of independent interest as this gives a quantum pseudo approximation algorithm with a constant factor approximation that runs in time that is polylogarithmic in the data size and linear in k and d. We will discuss this quantum algorithm in the next section.

D.2 APPROXIMATION SCHEME OF BHATTACHARYA ET AL. (2020B)

A high-level description of the approximation scheme of Bhattacharya et al. (2020b) was given in the introduction. A more detailed pseudocode is given in Algorithm 4. In addition to the input

Algorithm 4 Algorithm of Bhattacharya et al. (2020b). For our Quantum application, we will replaces D with δ-close D.

1: Input: (V, k, ε, Cinit), where V is the dataset, k > 0 is the number of clusters, ε > 0 is the error parameter, and Cinit is a k center set that gives constant (pseudo)approximation. 2: Output: A list L of k center sets such that for at least one C L, Φ(V, C) (1 + ε) OPT. 3: Constants: ρ = O( k

ε4 ); τ = O( 1

ε) 4: L ; count 1 5: repeat 6: Sample a multi-set M of ρk points from V using D2-sampling wrt center set Cinit 7: M M {τk copies of each element in Cinit} 8: for all disjoint subsets S1, ..., Sk of M such that i, |Si| = τ do 9: L L (µ(S1), ..., µ(Sk)) 10: end for 11: count++ 12: until count < 2k

13: return L

instance (V, k) and error parameter ε, the algorithm is also given a constant approximate solution Cinit, which is used for D2-sampling. A pseudoapproximate solution Cinit is sufficient for the analysis in Bhattacharya et al. (2020b). The discussion from the previous subsection gives a robust algorithm that outputs a pseudoapproximate solution even under errors in distance estimates. So, the input requirement of Algorithm 4 can be met. Now, the main ingredient being D2-sampling, we need to ensure that errors in distance estimate do not seriously impact the approximation analysis of Algorithm 4. We state the main theorem of Bhattacharya et al. (2020b) before giving the analogous statement for the modified algorithm where D is replaced with D that is δ-close to D.

Theorem 7 (Theorem 1 in Bhattacharya et al. (2020b)). Let 0 < ε 1/2 be the error parameter, V RN d be the dataset, k be a positive integer, and let Cinit be a constant (pseudo)approximate solution for dataset V . Let L be the list returned by Algorithm 4 on input (V, k, ε, Cinit) using the Euclidean distance function D. Then with probability at least 3/4, L contains a center set C such

that Φ(V, C) (1 + ε) OPT. Moreover, |L| = O 2 O( k

ε ) and the running time of the algorithm

is O(Nd|L|).

We give the analogous theorem with access to the Euclidean distance function D replaced with a function D that is δ-close to D.

Theorem 8. Let 0 < ε 1

2 be the error parameter, 0 < δ < 1/2 be the closeness parameter, V RN d be the dataset, k be a positive integer, and let Cinit be a constant (pseudo)approximate solution for dataset V . Let L be the list returned by Algorithm 4 on input (V, k, ε, Cinit) using the distance function D that is δ-close to the Euclidean distance function D. Then with probability at least 3/4, L contains a center set C such that Φ(V, C) (1 + ε) OPT. Moreover, |L| = 2 O( k

ε ) and the running time of the algorithm is O(Nd|L|).

Published as a conference paper at ICLR 2025

The proof of the above theorem closely follows the proof of Theorem 7 of Bhattacharya et al. (2020b). This is similar to the proof of Theorem 11 that we saw earlier, closely following the proof of Lemma 10. The minor changes are related to approximate distance estimates using D instead of real estimates using D. The statement of Theorem 8 is not surprising in this light. Instead of repeating the entire proof of Bhattacharya et al. (2020b), we point out the one change in their argument caused by using D instead of D as the distance function. The analysis of Bhattacharya et al. (2020b) works by partitioning the points in any optimal cluster Xj into those that are close to Cinit and those that are far. For the far points, it is shown that when doing D2-sampling, a far point will be sampled with probability at least γ times the uniform sampling probability (see Lemma 21 in Goyal et al. (2020), which is a full version of Bhattacharya et al. (2020b)). It then argues that a reasonable size set of D2-sampled points will contain a uniform sub-sample. A combination of the uniform sub-sample along with copies of points in Cinit gives a good center for this optimal cluster Xj. Replacing D with D decrease the value of γ by a multiplicative factor of (1 δ)2

(1+δ)2 (1 δ)4. This means that the number of points sampled should increase by a factor of O( 1 (1 δ)4 ). This means that the list size

increases to O 2 O( k ε(1 δ) ) . Note that when δ 1

2, the list size and running time retain the same

form as that in Bhattacharya et al. (2020b) (i.e., |L| = 2 O( k

ε ) and time O(Nd|L|)). A detailed proof of Theorem 8 is provided in the next section of the Appendix.

D.3 THE QUANTUM ALGORITHM

The main idea of the quantum version of the approximation scheme of Bhattacharya et al. (2020b) is to D2-sample a set S of poly(k/ε) points with respect to a center set Cinit that gives constant approximation to the k-means objective. We then try out all various subsets of possibilities for the k center set out of S and Cinit and then find the one with the least cost. Cinit is itself found quantumly by iteratively D2-sampling 2k points. So, most of the quantum ideas developed for the quantum version of k-means++ can be reused in the quantum approximation scheme. An additional effort is in to calculate the k-means cost of a list of L = 2 O( k

ε ) k-center sets and pick the one with the least cost. Here, instead of adding the relevant steps only, we add all the steps of the quantum algorithm for better readability.

We will work under the assumption that the minimum distance between two data points is 1, which can be achieved using scaling. This makes the aspect ratio ζ simply the maximum distance between two data points. We will use i for an index into the rows of the data matrix V RN d, and j for an index into the rows of the center matrix C R2k d. We would ideally like to design a quantum algorithm that performs the transformation:

|i |j |0 |i |j |D(vi, cj)

Let us call the state on the right |Ψideal . This is an ideal quantum state for us since |Ψideal helps to perform D2-sampling and to find the k-means cost of clustering, which are the main components of the approximation scheme of Bhattacharya et al. (2020b) that we intend to use. One caveat is that we will only be able to perform the following transformation (instead of the abovementioned transformation) |i |j |0 |i |j |ψi,j ,

where |ψi,j is an approximation for |D(vi, cj) in a sense that we will make precise below. We will use |Ψreal to denote the state |i |j |ψi,j . This state is prepared using tools such as swap test followed by coherent amplitude estimation, and median estimation. Since these tools and techniques are known from previous works Wiebe et al. (2015); Lloyd et al. (2013); Kerenidis et al. (2019), we summarise the discussion (see Section 4.1 and 4.2 in Kerenidis et al. (2019)) in the following lemma.

Lemma 12 (Kerenidis et al. (2019) and Wiebe et al. (2015)). Assume for a data matrix V RN d and a center set matrix C Rt d that the following unitaries: (i) |i |0 |i |vi , (ii) |j |0 |j |cj can be performed in time T and the norms of the vectors are known. For any > 0, there is a quantum algorithm that in time O T log 1

ε computes:

|i |j |0 |i |j |ψi,j ,

where |ψi,j satisfies the following two conditions for every i [N] and j [t]:

Published as a conference paper at ICLR 2025

|ψi,j 0 ℓ D(vi, cj) E

(ii) For every i, j, D(vi, cj) (1 ε) D(vi, cj).

In the subsequent discussions, we will use T as the time to access the QRAM data structure Kerenidis & Prakash (2017), i.e., for the transitions |i |0 |i |vi and |j |0 |j |cj as given in the above lemma. This is known to be T = O(log2 (Nd)). Moreover, the time to update each entry in this data structure is also T = O(log2 (Nd)). This is the logarithmic factor that is hidden in the O notation. In the following subsections, we discuss the utilities of |Ψreal for the various components of the approximation scheme of Bhattacharya et al. (2020b). During these discussions, it will be easier to see the utility first with the ideal state |Ψideal before the real state |Ψreal that can actually be prepared. We will see how |Ψreal is sufficient within a reasonable error bound.

D.4 FINDING DISTANCE TO CLOSEST CENTER

Let us see how we can estimate the distance of any point to its closest center in a center set C := Cinit with t := 2k centers. We can use the transformation |i |j |0 |i |j |D(vi, cj) to prepare the following state for any i: |i |D(vi, c1) |D(vi, c2) ... |D(vi, ct) We can then iteratively compare and swap pairs of registers to prepare the state |i minj [t] D(vi, cj) . If we apply the same procedure to |i |ψi,1 ... |ψi,t , then with probability at least (1 2 )t, the

resulting state will be |i minj [t] D(vi, cj) E . So, the contents of the second register will be an

estimate of the distance of the ith point to its closest center in the center set C. This further means that the following state can be prepared with probability at least (1 2 )Nt:12

i=1 |i min j [t] D(vi, cj) .

This quantum state can be used to find the approximate clustering cost of the center set C, which we discuss in the following subsection. However, before we do that, let us summarise the main ideas of this subsection in the following lemma. Lemma 13. There is a quantum algorithm that, with probability at least (1 2 )Nt, prepares the

quantum state |ΨC in time O T t log 1

D.5 COMPUTING COST OF CLUSTERING

Suppose we want to compute the k-means cost, Φ(V, C) PN i=1 minj [t] D2(vi, cj), of the clustering given by a t center set C. We can prepare m copies of the state |ΨC and then estimate the cost of clustering by measuring m copies of this quantum state and summing the squares of the second registers. If m is sufficiently large, we obtain a close estimate of Φ(V, C). To show this formally, we will use the following Hoeffding tail inequality. Theorem 9 (Hoeffding bound). Let X1, ..., Xm be independent, bounded random variables such that Xi [a, b]. Let Sm = X1 + ... + Xm . Then for any θ > 0, we have:

Pr[|Sm E[Sm]| θ] 2 e

Let X1, ..., Xm denote the square of the measured value of the second register in |ΨC . These are random values in the range [1, χ2], where χ = maxi,j D(vi, cj) (1 ε) maxi,j D(vi, vj). Note that by scaling, we assume that mini,j D(vi, vj) = 1. So, χ can be upper bounded in terms of the aspect ratio ζ as χ 2ζ.13 First, we note the expectation of these random variables equals

12The state prepared is actually 1

N PN i=1 |i α minj [t] D(vi, cj) E + β |G with |α|2 (1 2 )Nt. However, instead of working with this state, subsequent discussions become much simpler if we assume that |ΨC is prepared with probability |α|2. 13In cases scaling only ensures mini,j D(vi, vj) 1 (e.g., vectors with integer coordinates), we can use maxi,j D(vi, vj) as the parameter, instead of aspect ratio.

Published as a conference paper at ICLR 2025

N , where Φ(V, C) PN i=1 minj [t] D(vi, cj)2 (1 ε)2 Φ(V, C). We define the variable Sm = X1 + X2 + ... + Xm and apply the Hoeffding bound on these bounded random variables to get a concentration result that can then be used.

Lemma 14. Let αm = Sm N

m and L > 0. If m = O ζ4 ln (10L)

ε2 , then we have:

Pr[αm (1 ε) Φ(V, C)] 1 1

Proof. We know that E[Sm] = m

N Φ(V, C) From the Hoeffding tail inequality, we get the following:

Pr[|Sm E[Sm]| ε E[Sm]] 2 e

2 e ln (10L) 1

This implies that:

Pr[|αm Φ(V, C)| ε Φ(V, C)] 1

This completes the proof of the lemma.

So, conditioned on having m copies of the state |ΨC , we get an estimate of the clustering cost within a relative error of (1 ε)3 with probability at least (1 1 5L). Removing the conditioning, we get the same with probability at least (1 2 )Nkm (1 1 5L). We want to use the above cost estimation technique to calculate the cost for a list of center sets C1, ..., CL, and then pick the center set from the list with the least cost. We must apply the union bound appropriately to do this with high probability. We summarise these results in the following lemma. Let us first set some of the parameters with values that we will use to implement the approximation scheme of Bhattacharya et al. (2020b).

L denotes the size of the list of k-center sets we will iterate over to find the one with the least cost. This quantity is bounded as L = k

m is the number of copies of the state |ΨC made to estimate the cost of the center set C. This, as given is Lemma 14 is m = ζ4 ln (10L)

ε2 , where ζ = (1 + ε) maxi,j D(vi, cj).

Lemma 15. Let L = k

ε ), m = ζ4 ln (10L)

ε2 , and = O 1 Nkm L . Given a point set V and a list of center sets C1, ..., CL in the QRAM model, there is a quantum algorithm that runs in time O 2 O( k

ε )Tζ4 and outputs an index l such that Φ(V, Cl) (1 + ε)3 minj L Φ(V, Cj) with

probbaility at least 3

Proof. The algorithm estimates the cost of C1, ..., CL using m copies each of |ΨC1 , ..., |ΨCL and

picks the index with the minimum value in time O T km L log 1

ε . Plugging the values of L, m, and we get the running time stated in the lemma.

Let us bound the error probability of this procedure. By Lemma 3, the probability that we do not have the correct m copies each of |ΨC1 , ..., |ΨCL is bounded by 1 (1 2 )Nkm L. Conditioned on having |ΨC1 , ..., |ΨCL , the probability that there exists an index j [L], where the estimate is off by more than a (1 ε)3 factor is upper bounded by 1

5 by the union bound. So, the probability that the algorithm will find an index l such that Φ(V, Cl) > (1 + ε)3 minj [L] Φ(V, Cj) is upper bounded by 1 (1 2 )Nkm L + 1

5. This probability is at most 2

5 since = O( 1 Nkm L). This completes the proof of the lemma.

D.6 D2-SAMPLING

D2-sampling from the point set V with respect to a center set C Rt d with t centers, samples vi with probability proportional to minj [t] D2(vi, cj). Let us see if we can use our state |ΨC =

Published as a conference paper at ICLR 2025

N PN i=1 |i minj [t] D(vi, cj) E is useful to perform this sampling. If we can pull out the value of

the second register as the amplitude, then the measurement will give us close to D2-sampling. This is possible since we have an estimate of the clustering cost from the previous subsection. We can use controlled rotations on an ancilla qubit to prepare the state:

i=1 |i βi |0 + p

1 |βi|2 |1 ,

where βi = minj [t] D(vi,cj)

4 Φ(V,C)/N . So, the probability of measurement of (i, 0) is minj [t] D(vi,cj)2

4ζ2 Φ(V,C) . Since

we do rejection sampling, ignoring (., 1) s that are sampled with probability (1 1 4ζ2 ), we

end up sampling with a distribution where the probability of sampling i is minj [t] D(vi,cj)2

(1 ε)4 minj [t] D(vi,cj)2

Φ(V,C) . Moreover, we get a usable sample with high probability in at most O(ζ2 log 10N) rounds of sampling. This means that points get sampled with a probability close to the actual D2-sampling probability. As we have mentioned earlier, this is sufficient for the approximation guarantees of Bhattacharya et al. (2020b) to hold. We summarise the observations of this section in the next lemma.

Lemma 16. Given a dataset V RN d and a center set C Rt d in the QRAM model, there is a quantum algorithm that runs in time O T tm S log 1

δ and with probability at least (1 2 )Ntm S

outputs S independent samples with D2 distribution such that the distance function D is δ-close to D.

Proof. The proof follows from Lemma 13 and the preceding discussion.

The above lemma says that for = O( 1 Nkm S ), we obtain the required samples with high probability. We can now give proof of Theorem 4, assembling the quantum tools of this section. We restate the theorem below for ease of reading.

Theorem (Restatement of Theorem 4). Let 0 < ε < 1/2 be the error parameter. There is a quantum algorithm that, when given QRAM data structure access to a dataset V RN d, runs in time O 2 O( k

ε )dζO(1) and outputs a k center set C Rk d such that with high probability

Φ(V, C) (1 + ε) OPT. Here, ζ is the aspect ratio, i.e., the ratio of the maximum to the minimum distance between two given points in V .14

Proof of Theorem 4. The first requirement for executing the algorithm of Bhattacharya et al. (2020b) is a constant pseudo approximation algorithm, using which we obtain the initial center set Cinit. By Lemma 11, we know that 2k points sampled using D2-sampling give such a center set. From Lemma 16, this can be done quantumly in time O( k2ζO(1)

ε2 ). The algorithm of Bhattacharya et al. (2020b) has an outer repeat loop for probability amplification. Within the outer loop, poly( k

ε ) points are D2-sampled with respect to the center set Cinit (line 6). This can again be done quantumly using Lemma 16 in time O(ζO(1)(k/ε)O(1)). We can then classically process the point set M (see line 7 in Algorithm 4) and create the QRAM data structure for the list C1, ..., CL of k-center sets that correspond to all possible disjoint subsets of M (see line 8 in Algorithm 4). This takes time O(Lkd), where L = k

ε ). Theorem 8 shows that at least one center set in the list gives (1 + ε)- approximation. We use this fact in conjunction with the result of Lemma 15 to get that the underlying quantum algorithm runs in time O(2 O( k

ε )dζO(1)) and with high probability outputs a center set C such that Φ(V, C ) (1 + ε)4 OPT.15

14The O notation hides logarithmic factors in N. The O in the exponent hides logarithmic factors in k and 1/ε. 15We needed (1 + ε), but got (1 + ε)4 instead. However, this can be handled with ε = ε/5.

Published as a conference paper at ICLR 2025

E A ROBUST APPROXIMATION SCHEME FOR k-MEANS (PROOF OF THEOREM 8)

This section gives the proof of Theorem 8. We restate the theorem and the algorithm for readability. We state the algorithm of Bhattacharya et al. Bhattacharya et al. (2020b) replacing D2-sampling with D2-sampling.

Algorithm 5 Algorithm of Bhattacharya et al. (2020b) with D replaced with δ-close D

1: Input: (V, k, ε, Cinit), where V is the dataset, k > 0 is the number of clusters, ε > 0 is the error parameter, and Cinit is a k center set that gives constant (pseudo)approximation. 2: Output: A list L of k center sets such that for at least one C L, Φ(V, C ) (1 + ε) OPT. 3: Constants: ρ = O( k (1 δ)4ε4 ); τ = O( 1

4: L ; count 1 5: repeat 6: Sample a multi-set M of ρk points from V using D2-sampling wrt center set Cinit 7: M M {τk copies of each element in Cinit} 8: for all disjoint subsets S1, ..., Sk of M such that i, |Si| = τ do 9: L L (µ(S1), ..., µ(Sk)) 10: end for 11: count++ 12: until count < 2k

13: return L

Theorem (Restatement of Theorem 8). Let 0 < ε 1

2 be the error parameter, 0 < δ < 1/2 be the closeness parameter, V RN d be the dataset, k be a positive integer, and let Cinit be a constant (pseudo)approximate solution for dataset V . Let L be the list returned by Algorithm 5 on input (V, k, ε, Cinit) using the distance function D that is δ-close to the Euclidean distance function D. Then with probability at least 3/4, L contains a center set C such that Φ(V, C) (1 + ε) OPT. Moreover, |L| = 2 O( k

ε ) and the running time of the algorithm is O(Nd|L|).

Note that what this theorem essentially says is that if the error in the distance estimates is bounded within a multiplicative factor of (1 δ) with δ 1/2, then the guarantees of the algorithm of Bhattacharya et al. Bhattacharya et al. (2020b) do not change except for some constant factors (that get absorbed in the O of the exponent).

We need to define a few quantities that will be used to prove the above theorem. Let {X1, ..., Xk} be the optimal clusters. It is well-known that the center that optimizes the 1-means cost for any pointset is the centroid of the pointset. We denote the centroid of any pointset Y Rd with µ(Y )

y Y y |Y | . This follows from the following well-known fact:

Fact 1. For any Y Rd and any c Rd, we have P

y Y y c 2 = P

y Y y µ(Y ) 2 + |Y | c µ(Y ) 2.

We have used the norm notation instead of D(., .) for the Euclidean distance function. This is to easily distinguish between the usage of D and D. We define the optimal 1-means cost for pointset Xi as (Xi) P

x Xi x µ(Xi) 2. We denote the optimal k-means cost using OPT Pk i=1 (Xi). We will use the following sampling lemma from Inaba et al. Inaba et al. (1994), which says that the centroid of a small set of uniformly sampled points from the dataset Y is a good center with respect to the 1-means cost for dataset Y. Lemma 17 (Inaba et al. (1994)). Let S be a set of points obtained by independently sampling M points with replacement uniformly at random from a point set Y Rd. Then for any δ > 0,

Pr Φ(µ(S), Y ) 1 + 1 δM

(Y ) (1 δ).

We will also use the following approximate triangle inequality that holds for squared distances.

Fact 2 (Approximate triangle inequality). For any x, y, z Rd, we have x z 2 2 ( x y 2 + y z 2).

Published as a conference paper at ICLR 2025

The proof of Theorem 8 follows from the following theorem, which we will prove in the remaining section. Theorem 10. Let 0 < ε, δ 1/2. Let L denote the list returned by Algorithm 5 on input (V, k, ε, Cinit). Then with probability at least 3/4, L contains a center set C such that

Φ (C, {X1, ..., Xk}) 1 + ε

i=1 (Xi) + ε

Moreover, |L| = 2 O( k

ε ) and the running time of the algorithm is O(nd|L|).

Let Cinit be an (α, β)-approximate solution to the k-means problem on the dataset V . That is:

Φ(V, Cinit) α OPT and |Cinit| β k.

Let us now analyze the algorithm. Note that the outer repetition (lines 5-12) operation is to amplify the probability of the list containing a good k center set. We will show that the probability of finding a good k center set in one iteration is at least (3/4)k, and the theorem will follow from standard probability analysis. So, we shall focus on only one iteration of the outer loop. Consider the multi-set M obtained on line (7) of Algorithm 5. We will show that with probability at least (3/4)k, there are disjoint multi-subsets T1, ..., Tk each of size τ such that for every i = 1, ..., k:

Φ(µ(Ti), Xi) 1 + ε

2k OPT. (5)

Since we try all possible subsets of M in line (8), we will get the desired result. More precisely, we will argue in the following manner: consider the multi-set C = {τk copies of each element in Cinit}. We can interpret C as a union of multi-sets C 1, C 2, ..., C k, where C i = {τ copies of each element in Cinit}. Also, since M consists of ρk independently sampled points, we can interpret M as a union of multi-sets M 1, M 2, ..., M k where M 1 is the first ρ points sampled, M 2 is the second ρ points and so on. For all i = 1, ..., k, let Mi = C i (M i Xi).16 We will show that for every i {1, ..., k}, with probability at least (3/4), Mi contains a subset Ti of size τ that satisfies Eqn. (5). Note that Ti s being disjoint follows from the definition of Mi. It will be sufficient to prove the following lemma. Lemma 18. Consider the sets M1, ..., Mk as defined above. For any i {1, ..., k},

Pr h Ti Mi s.t. |Ti| = τ and Φ(µ(Ti), Xi) 1 + ε

We prove the above lemma in the remaining discussion. We do a case analysis that is based on whether Φ(Cinit,Xi)

Φ(Cinit,V ) is large or small for a particular i {1, ..., k}.

Φ(Cinit, Xi) ε 6αk Φ(Cinit, V ) : Here, we will show that there is a subset Ti C i Mi that satisfies Eqn. (5).

Φ(Cinit, Xi) > ε 6αk Φ(Cinit, V ) : Here we will show that Mi contains a subset Ti such that Φ(µ(Ti), Xi) 1 + ε

2 (Xi) and hence Ti also satisfies Eqn. (5).

We will discuss these two cases next.

Case-I: Φ(Cinit, Xi) ε 6αk Φ(Cinit, V ) First, observe that:

Φ(Cinit, Xi) ε

6k OPT (since Cinit is an α-approximate solution) (6)

For any point x V , let c(x) denote the center in the set C that is closest to x. That is, c(x) = arg minc Cinit c x . Given this definition, note that: X

x Xi x c(x) 2 = Φ(Cinit, Xi) (7)

We define the multi-set X i = {c(x) : x Xi}. Let m and m denote the centroids of the point sets Xi and X i, respectively. So, we have (Xi) = Φ(m, Xi) and (X i) = Φ(m , X i). We will show that (Xi) (X i). First, we bound the distance between m and m .

16M i Xi in this case, denotes those points in the multi-set M i that belongs to Xi.

Published as a conference paper at ICLR 2025

Lemma 19. m m 2 Φ(Cinit,Xi)

Proof. We have:

x Xi(x c(x)) 2

x Xj (x c(x)) 2

|Xj| = Φ(Cinit, Xj)

where the second to last inequality follows from Cauchy-Schwartz.

Lemma 20. (X i) 2 Φ(Cinit, Xi) + 2 (Xi).

Proof. We have:

x Xi c(x) m 2 X

x Xi c(x) m 2

(F act 2) 2 X

x Xi ( c(x) x 2 + x m 2) = 2 Φ(Cinit, Xi) + 2 (Xi)

This completes the proof of the lemma.

We now show that a good center for X i will also be a good center for Xi. Lemma 21. Let m be a point such that Φ(m , X i) (1 + ε

8) (X i). Then Φ(m , Xi) (1 + ε

2) (Xi) + ε 2k OPT.

Proof. We have:

Φ(m , Xi) (F act 1) = X

x Xi x m 2 + |Xi| m m 2

(F act 2) (Xi) + 2|Xi| ( m m 2 + m m 2)

(Lemma 19) (Xi) + 2 Φ(Cinit, Xi) + 2|Xi| m m 2

(F act 1) (Xi) + 2 Φ(Cinit, Xi) + 2(Φ(m , X i) (X i))

(Xi) + 2 Φ(C, Xi) + ε

(Lemma 20) (Xi) + 2 Φ(Cinit, Xi) + ε

2 (Φ(Cinit, Xi) + (Xi))

(Eqn. 6) 1 + ε

This completes the proof of the lemma.

We know from Lemma 17 that there exists a (multi) subset of X i of size τ such that the mean of these points satisfies the condition of the lemma above. Since C i contains at least τ copies of every element of Cinit, there is guaranteed to be a subset Ti C i that satisfies Eqn. (5). So, for any index i {1, ..., k} such that Φ(Cinit,Xi)

Φ(Cinit,V ) ε 6αk, Mi has a good subset Ti with probability 1.

Case-II: Φ(Cinit, Xi) > ε 6αk Φ(Cinit, V ) If we can show that a D2-sampled set with respect to center set Cinit has a subset S that may be considered a uniform sample from Xi, then we can use Lemma 17 to argue that Mi has a subset Ti such that µ(Ti) is a good center for Xi. Note that since Φ(Cinit,Xj)

Φ(Cinit,V ) > ε 6αk, and D is δ-close to D, we can argue that if we D2-sample poly( k

ε ) elements, then we will get a good representation from Xi. However, some of the points from Xi may be very close to one of the centers in Cinit and hence will have a very small chance of being D2-sampled. In such a case, no subset S of a D2-sampled set will behave like a uniform sample from Xi. So, we need to argue more carefully taking into consideration the fact that there may be points in Xi for which the chance of being D2-sampled may be very small. Here is the high-level argument that we will build:

Published as a conference paper at ICLR 2025

Consider the set X i which is same as Xi except that points in Xi that are very close to Cinit have been collapsed to their closest center in Cinit. Argue that a good center for the set X i is a good center for Xi. Show that a convex combination of copies of centers in Cinit (i.e., C i) and D2-sampled points from Xi gives a good center for the set X i.

The closeness of point in Xi to points in Cinit is quantified using radius R that is defined by the equation:

R2 defn. = ε2

41 Φ(Cinit, Xi)

Let Xnear i be the points in Xi that are within a distance of R from a point in set Cinit and Xfar i = Xi \ Xnear i . That is, Xnear i = {x Xi : minc Cinit x c R} and Xfar i = Xi \ Xnear i . Using these, we define the multi-set X i as:

X i = Xfar i {c(x) : x Xnear i }

Note that |Xi| = |X i|. Let m = µ(Xi), m = µ(X i). Let n = |Xi| and n = |Xnear i |. We first show a lower bound on (Xi) in terms of R. Lemma 22. (Xi) 16 n

Proof. Let c = arg minc Cinit m c . We do a case analysis:

1. Case 1: m c 5

ε R Consider any point p Xnear i . From triangle inequality, we have:

p m c(p) m c(p) p 5

This gives: (Xi) P

p Xnear i p m 2 16 n

2. Case 2: m c < 5

ε R In this case, we have:

(Xi) = Φ(c, Xi) n m c 2 Φ(Cinit, Xi) n m c 2 41n

This completes the proof of the lemma.

We now bound the distance between m and m in terms of R. Lemma 23. m m 2 n

Proof. Since |Xi| = |X i| and the only difference between Xi and X i are the points corresponding to Xnear i , we have:

m m 2 = 1 (n)2

p Xnear i (p c(p))

p Xnear i p c(p) 2 n2

The second inequality above follows from the Cauchy-Schwarz inequality.

We now show that (Xi) and (X i) are close. Lemma 24. (X i) 4 n R2 + 2 (Xi).

Proof. The lemma follows from the following sequence of inequalities:

p Xnear i c(p) m 2 + X

Published as a conference paper at ICLR 2025

(F act 2) X

p Xnear i 2 c(p) p 2 + p m 2 + X

2 n R2 + 2 Φ(m , Xi)

(F act 1) 2 n R2 + 2 (Φ(m, Xi) + n m m 2)

(Lemma 23) 4 n R2 + 2 (Xi)

This completes the proof of the lemma.

We now argue that any center that is good for X i is also good for Xi.

Lemma 25. Let m be such that Φ(m , X i) 1 + ε

16 (X i). Then Φ(m , Xi) 1 + ε

Proof. The lemma follows from the following inequalities:

Φ(m , Xi) = X

(F act 1) = X

p Xi m p 2 + n m m 2

(F act 2) (Xi) + 2n m m 2 + m m 2

(Lemma 23) (Xi) + 2 n R2 + 2n m m 2

(F act 1) (Xi) + 2 n R2 + 2 (Φ(m , X i) (X i))

(Lemma hypothesis) (Xi) + 2 n R2 + ε

(Lemma 24) (Xi) + 2 n R2 + ε

(Lemma 22) 1 + ε

This completes the proof of the lemma.

Given the above lemma, all we need to argue is that our algorithm indeed considers a center m such that Φ(m , X i) (1 + ε/16) (X i). For this we would need about Ω( 1

ε) uniform samples from X i. However, our algorithm can only sample using D2-sampling w.r.t. Cinit. For ease of notation, let c(Xnear i ) denote the multi-set {c(p) : p Xnear i }. Recall that X i consists of Xfar i and c(Xnear i ). The first observation we make is that the probability of sampling an element from Xfar i is reasonably large (proportional to ε

k). Using this fact, we show how to sample from X i (almost uniformly). Finally, we show how to convert this almost uniform sampling to uniform sampling (at the cost of increasing the size of sample).

Lemma 26. Let x be a sample from D2-sampling w.r.t. Cinit. Then, Pr[x Xfar i ] ε 8αk. Further,

for any point p Xfar i , Pr[x = p] γ |Xi|, where γ denotes (1 δ)4ε3

Proof. Since points are D2-sampled with D being δ-close to D, we note that P

p Xnear i Pr[x =

(1 δ)2 R2 Φ(Cinit,V ) |Xi| (1+δ)2

41 Φ(Cinit,Xi)

Φ(Cinit,V ) . Therefore, the fact that we are in case II implies that:

Pr[x Xfar i ] Pr[x Xi] Pr[x Xnear i ] Φ(Cinit, Xi)

Φ(Cinit, V ) (1 + δ)2

41 Φ(Cinit, Xi)

Φ(Cinit, V ) ε 8αk .

Published as a conference paper at ICLR 2025

Also, if x Xfar i , then Φ(Cinit, {x}) R2 = ε2

41 Φ(Cinit,Xi)

|Xi| . Therefore the probability that a

point p Xfar i gets D2-sampled is,

Pr[x = p] (1 δ)2

(1 + δ)2 Φ(Cinit, {x})

Φ(Cinit, V ) (1 δ)2

(1 + δ)2 ε 6αk R2

Φ(Cinit, Xi)

(1 + δ)2 ε 6αk ε2

(1 + δ)2 ε3

246αk 1 |Xi| (1 δ)4ε3

246αk 1 |Xi|.

This completes the proof of the lemma.

Let O1, . . . Oρ be ρ points sampled independently using D2-sampling w.r.t. Cinit. We construct a new set of random variables Y1, . . . , Yρ. Each variable Yu will depend on Ou only and will take values either in X i or will be . These variables are defined as follows: if Ou / Xfar i , we set Yu to . Otherwise, we assign Yu to one of the following random variables with equal probability: (i) Ou or (ii) a random element of the multi-set c(Xnear i ). The following observation follows from Lemma 26.

Corollary 1. For a fixed index u, and an element x X i, Pr[Yu = x] γ

|X i|, where γ = γ/2.

Proof. If x Xfar i , then we know from Lemma 26 that Ou is x with probability at least γ |X i| (note that X i and Xi have the same cardinality). Conditioned on this event, Yu will be equal to Ou with probability 1/2. Now suppose x c(Xnear i ). Lemma 26 implies that Ou is an element of Xfar i with probability at least ε 8αk. Conditioned on this event, Yu will be equal to x with probability at least 1

2 1 |c(Xnear i )|. Therefore, the probability that Ou is equal to x is at least

ε 8αk 1 2|c(Xnear i )| ε 16αk|X i| γ

Corollary 1 shows that we can obtain samples from X i which are nearly uniform (up to a constant factor). To convert this to a set of uniform samples, we use the idea of Jaiswal et al. (2014). For an element x X i, let γx be such that γx |X i| denotes the probability that the random variable Yu is equal to x (note that this is independent of u). Corollary 1 implies that γx γ . We define a new set of independent random variables Z1, . . . , Zρ. The random variable Zu will depend on Yu only. If Yu is , Zu is also . If Yu is equal to x X i, then Zu takes the value x with probability γ

γx , and with the remaining probability. We can now prove the key lemma.

Lemma 27. Let ρ be 256

γ ε, and m denote the mean of the non-null samples from Z1, . . . , Zρ. Then, with a probability at least (3/4), Φ(m , X i) (1 + ε

Proof. Note that a random variable Zu is equal to a specific element of X i with probability equal to γ

|X i|. Therefore, it takes value with probability 1 γ . Now consider a different set of iid random variables Z u, 1 u ρ as follows: each Zu tosses a coin with a probability of Heads being γ . If we get Tails, it gets value otherwise, it is equal to a random element of X i. It is easy to check that the joint distribution of the random variables Z u is identical to that of the random variables Zu. Thus, it suffices to prove the statement of the lemma for the random variables Z u.

Now we condition on the coin tosses of the random variables Z u. Let n be the number of random variables which are not . (n is a deterministic quantity because we have conditioned on the coin tosses). Let m be the mean of such nonvariables among Z 1, . . . , Z ρ. If n happens to be larger than 128

ε , Lemma 17 implies that with probability at least (7/8), Φ(m , X i) (1 + ε

Finally, observe that the expected number of nonrandom variables is γ ρ 256

ε . Therefore, with probability at least 7

8 (using Chernoff-Hoeffding), the number of nonelements will be at least 128

Published as a conference paper at ICLR 2025

Let C(ρ) denote the multi-set obtained by taking ρ copies of each of the centers in Cinit. Now observe that all the nonelements among Y1, . . . , Yρ are elements of {O1, . . . , Oρ} C(ρ), and so the same must hold for Z1, . . . , Zρ. Moreover, since we only need a uniform subset of size τ = 128

ε , C i suffices instead of C(ρ). This implies that in steps 8-9 of the algorithm, we would have tried adding the point m as described in Lemma 27. This means that Mi contains a subset Ti such that Φ(µ(Ti), Xi) (1 + ε

2) (Xi) with probability at least 3/4. This concludes the proof of Theorem 10.

F ADDITIONAL EXPERIMENTS

In this section, we give additional experiments and some of the details of the experiments in Section 4.

F.1 LARGER DATASETS AND AFKMC2

We have added experiments on datasets significantly larger than the ones experimented on in Section 4 to further elaborate on the dependence of ζ, k, d and N on QI-k-means++. The conditions under which the experiments were performed are the same as before. We also compare with the AFKMC2 algorithm from Bachem et al. (2016a). In our implementation, we used the Markov chain length to be m = 200 as suggested by Bachem et al. (2016a) for all datasets.

Dataset N d QI-k-means++ Setup Time (s)

AFKMC2 Setup Time (s)

KDD Caruana et al. (2004) 145, 751 74 1.1 0.2 SUSY Whiteson (2014) 5, 000, 000 18 11.2 2.5

Table 4: Details of the datasets experimented on and corresponding pre-processing times

We first compare the runtime for sampling (without taking into account the pre-processing step) for a range of values of k. The speedups are given in Table 5. For a better comparison, we also provide plots for the cumulative runtime (done for 9 values of k {2, .., 10} ), which also includes the pre-processing cost.

Dataset Algorithm k = 5 k =10 k =50 k =100 k =200 KDD QI-k-means++ 1.9 10 4 7.6 10 4 1.9 10 2 7.8 10 2 3.2 10 1

KDD k-means++ 4.6 10 1 9.3 10 1 4.7 9.6 19.4 KDD AFKMC2 3.2 10 3 1.2 10 2 2.7 10 1 1.1 4.3 SUSY QI-kmeans++ 1.8 10 4 5.6 10 4 1.6 10 2 8.7 10 2 4.3 10 1

SUSY k-means++ 5.0 9.7 48 95 195 SUSY AFKMC2 2.4 10 3 6.5 10 3 8.8 10 2 3.2 10 1 1.2

Table 5: Runtime dependence (in seconds) for performing a single clustering on k

Figure 4: Cumulative runtime plot for KDD

Figure 5: Cumulative runtime plot for SUSY

Published as a conference paper at ICLR 2025

Large Datasets : For large datasets, the total runtime is dominated by the pre-processing cost. After the pre-processing, sampling the cluster centers becomes extremely fast and is several orders of magnitude faster than k-means++. This can also be seen by the cumulative runtime plots in Figures 4 and 5 - the plot for QI-k-means++ is almost constant. This shows that QI-k-means++ scales well for large datasets since the gain from logarithmic dependence on N outweighs the poor dependence on ζ, k, and d. As for the dependence on k, we can see from Table 5 that the quadratic dependence is immediately visible, but the runtime is still significantly less than that of k-means++ even for intermediate values of k. Hence, such datasets lie in the advantageous regime of QI-k-means++. This contrasts small datasets with larger aspect ratios, as elucidated in Section 4.

Comparison with AFKMC2 : The AFKMC2 algorithm also consists of a pre-processing step which takes O(Nd) time, and a main sampling step, which takes O(mk2d) number of samples from the initial probability distribution, where m is the markov chain length. We note two major tradeoffs between AFKMC2 and QI-k-means++. The first is that the sampling step itself is empirically faster for QI-k-means++, while the pre-processing step is more costly for QI-k-means++ since it requires the setup of sample and query access data structures. But the SQ data structure allows the modification/addition of data points in O(log Nd) time without repeating the pre-processing step. In contrast, AFKMC2 s pre-processing step calculates the initial probability distribution for the Markov chain, which requires calculating the distance of each point from an initial randomly chosen point. Hence, updating or adding a data point will require repeating the pre-processing step.

F.2 ADDITIONAL DETAILS

We report the variance of the clustering costs mentioned in Section 4 (recall that the experiments were conducted over five runs). The seeds were chosen randomly for each run.

k 2 3 4 5 6 7 8 9 10 QI-k-means++ 1.89 3.19 0.97 1.36 1.44 1.24 0.17 0.20 0.47 k-means++ 1.93 3.29 1.90 1.44 1.57 1.09 0.73 0.32 0.52

Table 6: Variance of Clustering cost for binarized MNIST (values are scaled down by a factor of 1011)

k 2 3 4 5 6 7 8 9 10 QI-k-means++ 246523 48805 335 830 236 99 72 55 51 k-means++ 423718 27544 420 561 765 54 93 35 47

Table 7: Variance of Clustering cost for IRIS (values are rounded to 2 decimal places)

k 2 3 4 5 6 7 8 9 10 QI-k-means++ 3.77 4.05 1.44 5.96 1.10 2.37 1.10 0.57 1.07 k-means++ 1.91 2.00 2.24 3.92 2.28 0.38 1.34 1.51 1.06

Table 8: Variance of Clustering cost for DIGITS (values are scaled down by a factor of 1010)