# minibatch_kmeans_terminates_within_odepsilon_iterations__9f1d0eae.pdf Published as a conference paper at ICLR 2023 MINI-BATCH k-MEANS TERMINATES WITHIN O(d/ϵ) ITERATIONS Gregory Schwartzman Japan Advanced Institute of Science and Technology (JAIST) greg@jaist.ac.jp We answer the question: "Does local progress (on batches) imply global progress (on the entire dataset) for mini-batch k-means?". Specifically, we consider minibatch k-means which terminates only when the improvement in the quality of the clustering on the sampled batch is below some threshold. Although at first glance it appears that this algorithm might execute forever, we answer the above question in the affirmative and show that if the batch is of size Ω((d/ϵ)2), it must terminate within O(d/ϵ) iterations with high probability, where d is the dimension of the input, and ϵ is a threshold parameter for termination. This is true regardless of how the centers are initialized. When the algorithm is initialized with the k-means++ initialization scheme, it achieves an approximation ratio of O(log k) (the same as the full-batch version). Finally, we show the applicability of our results to the mini-batch k-means algorithm implemented in the scikit-learn (sklearn) python library. 1 INTRODUCTION The mini-batch k-means algorithm (Sculley, 2010) is one of the most popular clustering algorithms used in practice (Pedregosa et al., 2011). However, due to its stochastic nature, it appears that if we do not explicitly bound the number of iterations of the algorithm, then it might never terminate. We show that, when the batch size is sufficiently large, using only an "early-stopping" condition, which terminates the algorithm when the local progress observed on a batch is below some threshold, we can guarantee a bound on the number of iterations that the algorithm performs which is independent of input size. Problem statement We consider the following optimization problem. We are given an input (dataset), X = {xi}n i=1 [0, 1]d, of size n of d-dimensional real vectors and a parameter k. Note that the assumption that X [0, 1]d is standard in the literature (Arthur et al., 2011), and is meant to simplify notation (otherwise we would have to introduce a new parameter for the diameter of X). Our goal is to find a set C of k centers (vectors in [0, 1]d) such that the following goal function is minimized: 1 n x X min c C c x 2 Usually, the 1/n factor does not appear as it does not affect the optimization goal, however, in our case, it will be useful to define it as such. Lloyd s algorithm The most popular method to solve the above problem is Lloyd s algorithm (often referred to as the k-means algorithm) (Lloyd, 1982). It works by randomly initializing a set of k centers and performing the following two steps: (1) Assign every point in X to the center closest to it. (2) Update every center to be the mean of the points assigned to it. The algorithm terminates when no point is reassigned to a new center. This algorithm is extremely fast in practice but has a worst-case exponential running time (Arthur & Vassilvitskii, 2006; Vattani, 2011). Published as a conference paper at ICLR 2023 Mini-batch k-means To update the centers, Lloyd s algorithm must go over the entire input at every iteration. This can be computationally expensive when the input data is extremely large. To tackle this, the mini-batch k-means method was introduced by Sculley (2010). It is similar to Lloyd s algorithm except that steps (1) and (2) are performed on a batch of b elements sampled uniformly at random with repetitions, and in step (2) the centers are updated slightly differently. Specifically, every center is updated to be the weighted average of its current value and the mean of the points (in the batch) assigned to it. The parameter by which we weigh these values is called the learning rate, and its value differs between centers and iterations. In the original paper by Sculley, there is no stopping condition similar to that of Lloyd s algorithm, instead, the algorithm is simply executed for t iterations, where t is an input parameter. In practice (for example in sklearn (Pedregosa et al., 2011)), together with an upper bound on the number of iterations to perform there are several "early stopping" conditions. We may terminate the algorithm when the change in the locations of the centers is sufficiently small or when the change in the goal function for several consecutive batches does not improve. We note that in both theory (Tang & Monteleoni, 2017; Sculley, 2010) and practice (Pedregosa et al., 2011) the learning rate goes to 0 over time. That is, over time the movement of centers becomes smaller and smaller, which guarantees termination for most reasonable early-stopping conditions at the limit. Our results are the first to show extremely fast termination guarantees for mini-batch k-means with early stopping conditions. Surprisingly, we need not require the learning rate to go to 0. Related work Mini-batch k-means was first introduced by Sculley (2010) as a natural generalization to online k-means (Bottou & Bengio, 1994) (here the batch is of size 1). We are aware only of a single paper that analyzes the convergence rate of mini-batch k-means (Tang & Monteleoni, 2017). It is claimed in (Tang & Monteleoni, 2017) that under mild assumptions the algorithm has O(1/t) convergence rate. That is, after t iterations it holds that the current value of the goal function is within an additive O(1/t) factor from the value of the goal function in some local optimum of Lloyd s algorithm. However, their asymptotic notation subsumes factors that depend on the size of the input. Taking this into account, we get a convergence rate of Ω(n2/t), which implies, at best, a quadratic bound on the execution time of the algorithm. This is due to setting the learning rate at iteration t to O(1/(n2 + t)). Our results do not guarantee convergence to any local-minima, however, they guarantee an exponentially faster runtime bound. Our results We analyze the mini-batch k-means algorithm described above (Sculley, 2010), where the algorithm terminates only when the improvement in the quality of the clustering for the sampled batch is less than some threshold parameter ϵ. That is, we terminate if for some batch the difference in the quality of the clustering before the update and after the update is less than ϵ. Our stopping condition is slightly different than what is used in practice. In sklearn termination is determined based on the changes in cluster centers. In Section 5 we prove that this condition also fits within our framework. Our main goal is to answer the following theoretical question: "Does local progress (on batches) imply global progress (on the entire dataset) for mini-batch k-means, even when the learning rate does not go to 0?". Intuitively, it is clear that the answer depends on the batch size used by the algorithm. If the batch is the entire dataset the claim is trivial and results in a termination guarantee of O(d/ϵ) iterations1. We show that when the batch size exceeds a certain threshold, indeed local progress implies global progress and we achieve the same asymptotic bound on the number of iterations as when the batch is the entire dataset. We present several results: We start with a warm-up in Section 3, showing that when b = Ω(kd3ϵ 2) we can guarantee termination within O(d/ϵ) iterations2 w.h.p (with high probability)3. We require the additional assumption that every real number in the system can be represented using O(1) bits (e.g., 64-bit floats). The above bound holds regardless of how cluster centers are initialized or updated. That is, this bound holds for any center update rule, and not only for the "standard" center update rule described above. Our proof uses elementary tools and is presented to set the stage for our main result. 1This holds because the maximum value of the goal function is d (Lemma 2.1). 2Throughout this paper the tilde notation hides logarithmic factors in n, k, d, ϵ. 3This is usually taken to be 1 1/np for some constant p 1. For our case, it holds that p = 1, however, this can be amplified arbitrarily by increasing the batch size by a multiplicative constant factor. Published as a conference paper at ICLR 2023 In Section 4 we show that using the standard update rule, we can achieve the same termination time with a much smaller batch size. Specifically, a batch size of Ω((d/ϵ)2 log(nkd/ϵ)) = Ω((d/ϵ)2) is sufficient to guarantee termination within O(d/ϵ) iterations. This holds regardless of how centers are initialized and does not require any assumption on the number of bits required to represent real numbers. Our proof makes use of the fact that the standard update rule adds additional stability to the stochastic process when the learning rate is sufficiently small (but need not go to 0). Finally, in Section 5, we show that our main result also holds for the early stopping condition used in sklearn (with our learning rate). However, this results in a larger batch size and slower termination. Specifically if b = Ω((d/ϵ)3k) we terminate within O((d/ϵ)1.5 k) iterations w.h.p. Note that for the batch size to be reasonable, we must require that b n, which implies that (d/ϵ)2 log(nkd/ϵ) = O(n). Thus, our results only hold for a certain range of values for k, d, ϵ. This is reasonable, as in practice it is often the case that ϵ = O(1), d n and the dependence on the rest of the parameters is logarithmic. Solution quality Applying the k-means++ initialization scheme to our results we achieve the same approximation ratio, O(log k) in expectation, as the full-batch algorithm. The approximation guarantee of k-means++ is guaranteed already in the initialization phase (Theorem 3.1 in Arthur & Vassilvitskii (2007)), and the execution of Lloyd s algorithm following initialization can only improve the solution. We show that w.h.p the global goal function is decreasing throughout our execution which implies that the approximation guarantee remains the same. 2 PRELIMINARIES Throughout this paper we work with ordered tuples rather than sets, denoted as Y = (yi)i [ℓ], where [ℓ] = {1, . . . , ℓ}. To reference the i-th element we either write yi or Y [i]. It will be useful to use set notations for tuples such as x Y i [ℓ], x = yi and Y Z i [ℓ], yi Z. When summing we often write P x Y g(x) which is equivalent to Pℓ i=1 g(Y [i]). We borrow the following notation from (Kanungo et al., 2004). For every x, y Rd let (x, y) = x y 2. For every finite tuple S Rd and a vector x Rd let (S, x) = P y S (y, x). k-means We are given an input X = (xi)n i=1 [0, 1]d and a parameter k. Our goal is to find a tuple C Rd of k centers such that the following goal function is minimized: 1 n x X min C C (x, C) Let us define for every x X the function fx : Rk d R where fx(C) = min C C (x, C). We can treat Rk d as the set of k-tuples of d-dimensional vectors. We also define the following function for every tuple A = (ai)ℓ i=1 X: Note that f X is our original goal function. We state the following useful lemma: Lemma 2.1. For any tuple of k centers C [0, 1]d it holds that f X(C) d. Proof. Because X, C [0, 1]d it holds that x X, fx(C) max C C (x, C) d. Therefore f X(C) = 1 n P x X fx(C) 1 We state the following well known theorems: Theorem 2.2 (Hoeffding (1963)). Let Y1, ..., Ym be independent random variables such that 1 i m, E[Yi] = µ and Yi [amin, amax]. Then 2e 2mδ2/(amax amin)2 Published as a conference paper at ICLR 2023 Theorem 2.3 (Jensen (1906)). Let ϕ be a convex function, y1, . . . , yn numbers in its domain and weights a1, . . . , an R+. It holds that: ϕ Pn i=1 aiyi Pn i=1 ai Pn i=1 aiϕ(yi) Pn i=1 ai 3 WARM-UP: A SIMPLE BOUND Let us first show a simple convergence guarantee which makes no assumptions about how the centers are updated. This will set the stage for our main result in Section 4, where we consider the standard update rule used in mini-batch k-means (Sculley, 2010; Pedregosa et al., 2011). Algorithm We analyze a generic variant of the mini-batch k-means algorithm, presented in Algorithm 1. Note that it a very broad class of algorithms (including the widely used algorithm of Sculley (2010)). The only assumptions we make are: 1. The centers remain within [0, 1]d (the convex hull bounding X). 2. Batches are sampled uniformly at random from X with repetitions. 3. The algorithm terminates when updating the centers does not significantly improve the quality of the solution for the sampled batch. Items (1) and (2) are standard both in theory and practice (Sculley, 2010; Pedregosa et al., 2011; Tang & Monteleoni, 2017). Item (3) is usually referred to as an "early-stopping" condition. Early stopping conditions are widely used in practice (for example in sklearn (Pedregosa et al., 2011)), together with a bound on the number of iterations. However, our early-stopping condition is slightly different than the one used in practice. We discuss this difference in Section 5. At first glance, guaranteeing termination for any possible way of updating the centers might seem strange. However, if the update procedure is degenerate, it will make no progress, at which point the algorithm terminates. Algorithm 1: Generic mini-batch k-means 1 C1 [0, 1]d is an initial tuple of centers 2 for i = 1 to do 3 Sample b elements, Bi = (y1, . . . , yb), uniformly at random from X (with repetitions) 4 Update Ci+1 (such that Ci+1 [0, 1]d) 5 if f Bi(Ci) f Bi(Ci+1) < ϵ then Return Ci+1 Termination guarantees for Algorithm 1 To bound the number of iterations of such a generic algorithm we require the following assumption: every real number in our system can be represented using q = O(1) bits. This implies that every set of k centers can be represented using qkd bits. This means that the total number of possible solutions is bounded by 2qkd. This will allow us to show that when the batch is sufficiently large, the sampled batch acts as a sparsifier for the entire dataset. Specifically, it means that for any tuple of k centers, C, it holds that |f Bi(C) f X(C)| < ϵ/4. This implies that, for a sufficiently large batch size, simply sampling a single batch and executing Lloyd s algorithm on the batch will be sufficient, and executing mini-batch k-means is unnecessary. Nevertheless, this serves as a good starting point to showcase our general approach and to highlight the challenges we overcome in Section 4 in order to reduce the required batch size without compromising the running time. Let us assume that the algorithm executes for at least t iterations. That is, the termination condition does not hold for the first t iterations. Our goal is to upper bound t. Parameter range Let us first define the range of parameter values for which the results for this section hold. Recall that n is the size of the input, k is the number of centers, d is the dimension, ϵ is Published as a conference paper at ICLR 2023 the termination threshold. For the rest of this section assume that b = Ω((d/ϵ)2(kd+log(nt))). Later we show that t = O(d/ϵ), which will imply that b = Ω(kd3ϵ 2) is sufficient for our termination guarantees to hold. We state the following useful lemma which guarantees that f B(C) is not too far from f X(C) when the batch size is sufficiently large and C is fixed (i.e., independent of the choice of Bi). Lemma 3.1. Let B be a tuple of b elements chosen uniformly at random from X with repetitions. For any fixed tuple of k centers, C [0, 1]d, it holds that: Pr[|f B(C) f X(C)| δ] 2e 2bδ2/d2. Proof. Let us write B = (y1, . . . , yb), where yi is a random element selected uniformly at random from X with repetitions. For every such yi define the random variable Zi = fyi(C). These new random variables are IID for any fixed C. It also holds that i [b], E[Zi] = 1 x X fx(C) = f X(C) and that f B(C) = 1 x B fx(C) = 1 b Pb i=1 Zi. Applying a Hoeffding bound (Theorem 2.2) with parameters m = b, µ = f X(C), amax amin d we get that: Pr[|f B(C) f X(C)| δ] 2e 2bδ2/d2. Using the above we can show that every Bi is a sparsifier for X. Lemma 3.2. It holds w.h.p that for every i [t] and for every set of k centers, C [0, 1]d, that |f Bi(C) f X(C)| < ϵ/4. Proof. Using Lemma 3.1, setting δ = ϵ/4 and using the fact that b = Ω((d/ϵ)2(kd + log(nt))), we get: Pr[|f B(C) f X(C)| δ] 2e 2bδ2/d2 = 2 Θ(bδ2/d2) = 2 Ω(kd+log(nt)). Taking a union bound over all t iterations and all 2qkd configurations of centers, we get that the probability is bounded by 2 Ω(kd+log(nt)) 2qkd t = O(1/n), for an appropriate constant in the asymptotic notation for b. The lemma below guarantees global progress for the algorithm. Lemma 3.3. It holds w.h.p that i [t], f X(Ci) f X(Ci+1) ϵ/2. Proof. Let us write (the notation x means that we add and subtract x): f X(Ci) f X(Ci+1) = f X(Ci) f Bi(Ci) f Bi(Ci+1) f X(Ci+1) ϵ/2 Due to Lemma 3.2 it holds that w.h.p f X(Ci) f Bi(Ci) > ϵ/4 and f Bi(Ci+1) f X(Ci+1) > ϵ/4. Finally due to the termination condition it holds that f Bi(Ci) f Bi(Ci+1) ϵ. This completes the proof. As f X is upper bounded by d, it holds that we must terminate within O(d/ϵ) iterations w.h.p when b = Ω(kd3ϵ 2 log(nd/ϵ)). We state our main theorem for this Section. Theorem 3.4. For b = Ω(kd3ϵ 2), Algorithm 1 terminates within O(d/ϵ) iterations w.h.p. Towards a smaller batch size Note that the batch size used in this section is about a kd factor larger than what we require in Section 4. This factor is required for the union bound over all possible sets of k centers in Lemma 3.2. However, when actually applying Lemma 3.2, we only apply it for two centers in iteration i, setting B = Bi and C = Ci, Ci+1. A more direct approach would be to apply Lemma 3.1 only for Ci, Ci+1, which would get rid of the extra kd factor. This will work when C = Ci as Bi is sampled after Ci is determined, but will fail for C = Ci+1 because Ci+1 may depend on Bi. In the following section, we show how to use the fact that the learning rate is sufficiently small in order to overcome this challenge. 4 MAIN RESULTS In this section, we show that we can get a much better dependence on the batch size when using the standard center update rule. Specifically, we show that a batch of size Ω((d/ϵ)2) is sufficient to guarantee termination within O(d/ϵ) iterations. We also do not require any assumption about the number of bits required to represent a real number. Published as a conference paper at ICLR 2023 Section preliminaries Let us define for any finite tuple S Rd the center of mass of the tuple as cm(S) = 1 |S| P x S x. For any tuple S Rd and some tuple of cluster centers C = (Cℓ)ℓ [k] it implies a partition (Sℓ)ℓ [k] of the points in S. Specifically, every Sℓcontains the points in S closest to Cℓand every point in S belongs to a single Cℓ(ties are broken arbitrarily). We state the following useful observation: Observation 4.1. Fix some A X. Let C be a tuple of k centers, S = (Sℓ)ℓ [k] be the partition of A induced by C and S = (S ℓ)ℓ [k] be any other partition of A. It holds that Pk j=1 (Sj, Cj) Pk j=1 (S j, Cj). Let Cj i denote the location of the j-th center in the beginning of the i-th iteration. Let (Bℓ i )ℓ [k] be the partition of Bi induced by Ci and let (Xℓ i )ℓ [k] be the partition of X induced by Ci. We analyze Algorithm 1 when clusters are updated as follows: Cj i+1 = (1 αj i)Cj i + αj icm(Bj i ), where αj i is the learning rate. Note that Bj i may be empty in which case cm(Bj i ) is undefined, however, the learning rate is chosen such that αj i = 0 in this case (Cj i+1 = Cj i ). Note that the learning rate may take on different values for different centers, and may change between iterations. In the standard mini-batch k-means algorithm (Sculley, 2010; Pedregosa et al., 2011) the learning rate goes to 0 over time. This guarantees termination for most reasonable stopping conditions. As before, we assume that the algorithm executes for at least t iterations and upper bound t. We show that the learning rate need not go to 0 to guarantee termination when the batch size is sufficiently large. Specifically, we set αj i = q bj i/b, where bj i = Bj i , and we require that b = Ω((d/ϵ)2 log(ndtk)). Proof outline In our proof, we use the fact that a sufficiently small learning rate enhances the stability of the algorithm, which in turn allows us to use a much smaller batch size compared to Section 3. Let us define the auxiliary value C j i+1 = (1 αj i)Cj i + αj icm(Xj i ). This is the j-th center at step i + 1 if we were to use the entire dataset for the update, rather than just a batch. Note that this is only used in the analysis and not in the algorithm. Recall that in the previous section we required a large batch size because we could not apply Lemma 3.1 when B = Bi and C = Ci+1 because Ci+1 may depend on Bi. To overcome this challenge we use Ci+1 instead of Ci+1. Note that Ci+1 only depends on Ci, X and is independent of Bi (i.e., we can fix its value before sampling Bi). We show that for our choice of learning rate it holds that Ci+1, Ci+1 are sufficiently close, which implies that f X(Ci+1), f X(Ci+1) and f Bi(Ci+1), f Bi(Ci+1) are also sufficiently close. This allows us to use a similar proof to that of Lemma 3.3 where Ci+1 acts as a proxy for Ci+1. We formalize this intuition in what follows. First, we state the following useful lemmas: Lemma 4.2 (Kanungo et al. (2004)). For any set S Rd and any C Rd it holds that (S, C) = (S, cm(S)) + |S| (C, cm(S)). Lemma 4.3. For any S X and C, C [0, 1]d, it holds that: | (S, C ) (S, C)| 2 d |S| C C . Proof. Using Lemma 4.2 we get that (S, C) = (S, cm(S)) + |S| (cm(S), C) and that (S, C ) = (S, cm(S)) + |S| (cm(S), C ). Thus, it holds that | (S, C ) (S, C)| = |S| | (cm(S), C ) (cm(S), C)|. Observe that for two vectors x, y Rd it holds that (x, y) = (x y) (x y). Let us switch to vector notation and bound | (cm(S), C ) (cm(S), C)|. | (cm(S), C ) (cm(S), C)| = |(cm(S) C ) (cm(S) C ) (cm(S) C) (cm(S) C)| = | 2cm(S) C + C C + 2cm(S) C C C| = |2cm(S) (C C ) + (C C) (C + C)| = |(C C ) (2cm(S) (C + C))| C C 2cm(S) (C + C) 2 Published as a conference paper at ICLR 2023 Where in the last transition we used the Cauchy-Schwartz inequality. First, we show that due to our choice of learning rate Cj i+1, C j i+1 are sufficiently close. Lemma 4.4. For it holds w.h.p that i [t], j [k], Cj i+1 C j i+1 ϵ 10 Proof. Note that Cj i+1 C j i+1 = αj i(cm(Bj i ) cm(Xj i )). Let us fix some iteration i and center j. To simplify notation, let us denote: X = Xj i , B = Bj i , b = bj i, α = αj i. Although b is a random variable, in what follows we treat it as a fixed value (essentially conditioning on its value). As what follows holds for all values of b it also holds without conditioning due to the law of total probabilities. For the rest of the proof, we assume b > 0 (if b = 0 the claim holds trivially). Let us denote by {Yℓ}b ℓ=1 the sampled points in B . Note that a randomly sampled element from X is in B if and only if it is in X . As batch elements are sampled uniformly at random with repetitions from X, conditioning on the fact that an element is in B means that it is distributed uniformly over X . Thus, it holds that ℓ, E[Yℓ] = 1 |X | P x X x = cm(X ) and E[cm(B )] = 1 b Pb ℓ=1 E[Yℓ] = cm(X ). Our goal is to bound Pr[ cm(B ) cm(X ) ϵ 10α d], we note that it is sufficient to bound the deviation of every coordinate by ϵ/(10α d), as that will guarantee that: cm(B ) cm(X ) = ℓ=1 (cm(B )[ℓ] cm(X )[ℓ])2 ℓ=1 ( ϵ 10α d)2 = ϵ 10α We note that for a single coordinate, ℓ, we can apply a Hoeffding bound with parameters µ = cm(X )[ℓ], amax amin 1 and get that: Pr[|cm(B )[ℓ] cm(X )[ℓ]| ϵ 10α d] 2 e 2b ϵ2 Taking a union bound we get that Pr[ cm(B ) cm(X ) ϵ 10α Pr[ ℓ, |cm(B )[ℓ] cm(X )[ℓ]| ϵ 10α d] 2d e 2b ϵ2 Using the fact that α = p b /b together with the fact that b = Ω((d/ϵ)2 log(ntkd)) (for an appropriate constant) we get that the above is O(1/ntk). Finally, taking a union bound over all t iterations and all k centers per iteration completes the proof. Let us now use the above lemma to bound the goal function when cluster centers are close. Lemma 4.5. Fix some A X. It holds w.h.p that i [t], f A(Ci+1) f A(Ci+1) ϵ/5 Proof. Let S = (Sℓ)ℓ [k], S = (S ℓ)ℓ [k] be the partitions induced by Ci+1, Ci+1 on A. Let us expand the expression: f A(Ci+1) f A(Ci+1) = 1 j=1 (S j, C j i+1) (Sj, Cj i+1) j=1 (Sj, C j i+1) (Sj, Cj i+1) d Sj C j i+1 Cj i+1 1 |A| Sj ϵ/5 = ϵ/5 Published as a conference paper at ICLR 2023 Where the first inequality is due to Observation 4.1, the second is due Lemma 4.3 and finally we use Lemma 4.4 together with the fact that Pk j=1 Sj = |A|. Using the same argument we also get that f A(Ci+1) f A(Ci+1) ϵ/5, which completes the proof. From here our proof is somewhat similar to that of Section 3. Let us state the following useful lemma. Lemma 4.6. It holds w.h.p that for every i [t] : f X(Ci+1) f X(Ci+1) ϵ/5 (1) f Bi(Ci+1) f Bi(Ci+1) ϵ/5 (2) f X(Ci) f Bi(Ci) ϵ/5 (3) f Bi(Ci+1) f X(Ci+1) ϵ/5 (4) Proof. The first two inequalities follow from Lemma 4.5. The last two are due to Lemma 3.1 by setting δ = ϵ/5, B = Bi: Pr[|f Bi(C) f X(C)| δ] 2e 2bδ2/d2 = e Θ(bϵ2/d2) = e Ω(log(nt)) = O(1/nt) Where the last inequality is due to the fact that b = Ω((d/ϵ)2 log(nt)) (for an appropriate constant). The above holds for either C = Ci or C = Ci+1. Taking a union bound over all t iterations we get the desired result. Putting everything together We wish to lower bound f X(Ci) f X(Ci+1). We write the following: f X(Ci) f X(Ci+1) = f X(Ci) f Bi(Ci) f X(Ci+1) f Bi(Ci) f X(Ci+1) ϵ/5 = f Bi(Ci) f Bi(Ci+1) f X(Ci+1) ϵ/5 f Bi(Ci+1) f X(Ci+1) + 4ϵ/5 = f Bi(Ci+1) f Bi(Ci+1) f X(Ci+1) f X(Ci+1) + 4ϵ/5 ϵ/5 Where the first inequality is due to inequality (3) in Lemma 4.6 (f X(Ci) f Bi(Ci) ϵ/5), the second is due to the stopping condition of the algorithm (f Bi(Ci) f Bi(Ci+1) > ϵ), and the last is due to the remaining inequalities in Lemma 4.6. The above holds w.h.p over all of the iterations of the algorithms. As in Section 3, we conclude that w.h.p it holds that t = O(d/ϵ), which implies that b = Ω((d/ϵ)2 log(knd/ϵ)) is sufficient. We state our main theorem. Theorem 4.7. For b = Ω((d/ϵ)2) and αj i = q bj i/b, Algorithm 1 with center update Cj i+1 = (1 αj i)Cj i + αj icm(Bj i ), terminates within O(d/ϵ) iterations w.h.p. 5 APPLICATION TO SKLEARN In this section, we show the relevance of our results to the algorithm implementation of sklearn. The main differences in sklearn are the learning rate and stopping condition. The termination condition4 depends on the movement of the centers in the iteration, rather than the value of f Bi. Specifically, we continue as long as P j [k] (Cj i+1, Cj i ) ϵ for some tolerance parameter ϵ. The learning rate is set as αj i = bj i Pi ℓ=1 bj ℓ. Roughly speaking, this implies that αj i 0 over time, and guarantees termination of the algorithm in the limit. However, for our convergence guarantee, we only require αj i = q bj i/b which need not go to 0 over time. We show that with our learning rate and the termination condition of sklearn, the proof from Section 4 still implies termination, although at a slower rate and requires a larger batch size. 4The exact parameters of this algorithm were extracted directly from the code (the relevant function is _mini_batch_convergence): https://github.com/scikit-learn/scikit-learn/blob/ baf828ca1/sklearn/cluster/_kmeans.py#L1502. Published as a conference paper at ICLR 2023 Specifically, we terminate within O((d/ϵ)1.5 k) iterations w.h.p if the batch size is Ω(k(d/ϵ)3). Note that this result is not subsumed by the result in Section 3 because the stopping condition is different. Below we show that as long as the termination condition in sklearn does not hold (P j [k] (Cj i+1, Cj i ) ϵ), our stopping condition also does not hold for an appropriate param- eter ( f Bi(Ci) f Bi(Ci+1) > ϵ where ϵ = ϵ1.5/ kd). We state the following useful lemma: Lemma 5.1. Let x, y Rd, α [0, 1]. It holds that (x, (1 α)x + αy) = α2 (x, y). Proof. (x, (1 α)x + αy) = x (1 α)x + αy 2 = αx αy 2 = α2 (x, y). Below is our main lemma for this section: Lemma 5.2. If it holds that P j [k] (Cj i+1, Cj i ) > ϵ then f Bi(Ci) f Bi(Ci+1) > ϵ1.5 Proof. Recall that Cj i+1 = (1 αj i)Cj i + αj icm(Bj i ) for αj i = q bj i/b. Thus, we get: j [k] (Cj i , Cj i+1) X j [k] (αj i)2 (Cj i , cm(Bj i )) = X bj i b (Cj i , cm(Bj i )) (5) Where in the transitions we used Lemma 5.1. Let us fix some j [k], we can write the following: (Bj i , Cj i ) (Bj i , Cj i+1) = (Bj i , cm(Bj i )) + bj i (Cj i , cm(Bj i )) (Bj i , cm(Bj i )) bj i (Cj i+1, cm(Bj i )) = bj i( (Cj i , cm(Bj i )) (Cj i+1, cm(Bj i ))) = bj i( (Cj i , cm(Bj i )) ((1 αj i)Cj i + αj icm(Bj i ), cm(Bj i ))) = bj i( (Cj i , cm(Bj i )) (1 αj i)2 (Cj i , cm(Bj i ))) = (2αj i (αj i)2)bj i (Cj i , cm(Bj i )) αj ibj i (Cj i , cm(Bj i )) = Where in the first transition we apply Lemma 4.2, and in the last we use the fact that (Cj i+1, cm(Bj i )) = (αj i)2 (Cj i , cm(Bj i )) and the fact that , αj i [0, 1], 2αj i (αj i)2 αj i. Let us bound f Bi(Ci) f Bi(Ci+1): f Bi(Ci) f Bi(Ci+1) 1 j=1 ( (Bj i , Cj i ) (Bj i , Cj i+1)) αj ibj i b (Cj i , cm(Bj i )) = (Cj i , cm(Bj i )) Where the first inequality is due to Observation 4.1, the second is due to the fact that j [k], (Bj i , Cj i ) (Bj i , Cj i+1) αj ibj i (Cj i , cm(Bj i )), and in the last equality we simply plug in αj i = bj i b combined with. We complete the proof by applying Jensen s inequality, with parameters: ϕ(x) = x1.5, yj = bj i/b and aj = (Cj i , cm(Bj i )), combined with inequality (5). (Cj i , cm(Bj i )) j=1 (Cj i , cm(Bj i )) Pk j=1 bj i b (Cj i , cm(Bj i )) Pk j=1 (Cj i , cm(Bj i )) ϵ1.5 q Pk j=1 (Cj i , cm(Bj i )) ϵ1.5 Finally, plugging ϵ = ϵ1.5 kd into our bounds, we conclude that if b = Ω(ϵ 3d3k) then the number of iterations is bounded by O((d/ϵ)1.5 Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS The author would like to thank Ami Paz, Uri Meir and Giovanni Viglietta for reading preliminary versions of this work. This work was supported by JSPS KAKENHI Grant Numbers JP21H05850, JP21K17703, JP21KK0204. David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In SCG, pp. 144 153. ACM, 2006. David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, pp. 1027 1035. SIAM, 2007. David Arthur, Bodo Manthey, and Heiko Röglin. Smoothed analysis of the k-means method. J. ACM, 58(5):19:1 19:31, 2011. Léon Bottou and Yoshua Bengio. Convergence properties of the k-means algorithms. In NIPS, pp. 585 592. MIT Press, 1994. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. Johan Ludwig William Valdemar Jensen. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175 193, 1906. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89 112, 2004. Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129 136, 1982. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. D. Sculley. Web-scale k-means clustering. In WWW, pp. 1177 1178. ACM, 2010. Cheng Tang and Claire Monteleoni. Convergence rate of stochastic k-means. In AISTATS, volume 54 of Proceedings of Machine Learning Research, pp. 1495 1503. PMLR, 2017. Andrea Vattani. k-means requires exponentially many iterations even in the plane. Discret. Comput. Geom., 45(4):596 616, 2011.