# label_consistency_in_overfitted_generalized_kmeans__926820cf.pdf Label consistency in overfitted generalized k-means Linfan Zhang Department of Statistics University of California, Los Angeles Los Angeles, CA 90095 linfanz@g.ucla.edu Arash A. Amini Department of Statistics University of California, Los Angeles Los Angeles, CA 90095 aaamini@ucla.edu We provide theoretical guarantees for label consistency in generalized k-means problems, with an emphasis on the overfitted case where the number of clusters used by the algorithm is more than the ground truth. We provide conditions under which the estimated labels are close to a refinement of the true cluster labels. We consider both exact and approximate recovery of the labels. Our results hold for any constant-factor approximation to the k-means problem. The results are also model-free and only based on bounds on the maximum or average distance of the data points to the true cluster centers. These centers themselves are loosely defined and can be taken to be any set of points for which the aforementioned distances can be controlled. We show the usefulness of the results with applications to some manifold clustering problems. 1 Introduction Consider the problem of clustering data points sampled according to some probability measure µ from a normed space X with norm X . In the ideal setting, the generalized k-means clustering minimizes the population cost function Q(ξ; µ) := Z min 1 ℓ L x ξℓ p X dµ(x) 1/p (1) where ξ = (ξ1, . . . , ξL) X L is a set of L vectors in X, for some fixed integer L. In practical data analysis, we are given a sample {x1, . . . , xn} drawn from µ and solve an empirical version of (1), namely, b Q(ξ) = Q(ξ; Pn) := 1 i=1 min 1 ℓ L xi ξℓ p X 1/p . (2) Here, Pn := 1 n Pn i=1 δxi is the empirical measure associated with the sample and δx is the point mass measure at x. The minimizer of b Q( ) over X L is denoted as bξ = (bξ1, . . . , bξL) and each point xi is assigned a cluster label bzi := argminℓ xi bξℓ X . Meanwhile, we assume that each data point xi also has a true cluster label zi [K] := {1, . . . , K} which is determined solely by an unknown data-generating process. These true labels are not necessarily related to the optimal solutions of (1) or (2). To distinguish the two, we refer to the clustering induced by (zi) as the true clustering, while a clustering that minimizes the generalized k-means cost function (2), i.e., the clustering induced by (bzi), is referred to as an optimal k-means clustering. In this paper, we consider the label consistency problem, that is, how close the labels (bzi) estimated by k-means clustering are to the true labels (zi). Note that we allow the number of k-means clusters L to be different from the true number of clusters K. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In the above formulation, the case where p = 2, X = Rd and X is the Euclidean norm leads to the classical and widely used k-means problem. Much of the theoretical analysis of k-means has been performed in this case. Early work has focused on how close the optimization problems based on the empirical and ideal cost functions (2) and (1) are to each other, where the closeness is measured in terms of the recovered centers (i.e., bξ and ξ) or the optimal value of the objective function. Such consistency results are proved, for the global minimizers of (2), in the early work of [22, 29] and also in [30, 19] from the vector quantization perspective. These classical results do not directly apply to the performance of the k-means in practice, mainly because solving (2) is NP-hard and approximation methods are usually applied to deal with it. Also, considerations of the label consistency problem are absent from this line of work since no true clustering, external to the k-means problem, is assumed to exist. More recently, there has been more interest in the consistency of practical k-means algorithms [14, 21] as well as the label consistency problem. Lu and Zhou [21] obtain sharp bounds on the label consistency of the Lloyd s algorithm [20] under a sub-Gaussian mixture model. Semidefinite programming (SDP) relaxation is another popular technique for deriving polynomial-time approximations to the k-means problem [28]. Its label consistency has been studied when data is generated from the stochastic ball model [4, 10], sub-Gaussian mixtures [25, 8, 9], the Stochastic Block Model (SBM) [9] and general models [18]. Convex clustering is another relaxation method whose label consistency has been discussed in [34, 27, 11, 31]. The literature on community detection in SBM, a network clustering problem, is also mainly focused on label consistency and inspires our work here; see [1, 33] for a review of those results. For label consistency in kernel spectral clustering, see [2]. In this paper, we study the label consistency of approximate solutions of the generalized k-means problem (2) when L K. Our focus will be on the overfitted case where L > K. This is often relevant in practice since the data-generating process may have a natural number of clusters K that is unknown a priori. An example is the sub-Gaussian mixture with K components. More interesting examples are given in Section 3. All the aforementioned works on label consistency exclusively consider the correctly-fitted case L = K. We show that when the data-generating process admits a set of centers that satisfy certain separation conditions, estimated labels with L K clusters, are close to a refinement of the true labels. These bounds reduce to the label consistency criteria for L = K, but have no counterpart in the literature for L > K. Overfitting in k-means is considered in [32, 23] where it is shown to improve the approximation factor (see Assumption 1(b)) of certain polynomial-time k-means algorithms. Analysis of the approximation factor is concerned with how close one can get to the optimal value of the k-means objective function. In contrast, we are concerned with the label recovering problem and not directly concerned with how well the objective function is approximated. Our work is also aligned with the recent trend of beyond worst case analysis of the NP-hard problems [6], where the performance of the algorithms are considered assuming that there are some meaningful structures in the data (e.g., true clusters). We refer to Remark 1 for a more detailed comparison with this literature. Our results are algorithm-free in the sense that they apply to any algorithm that achieves a constantfactor approximation to the optimal objective. They are also model-free in the sense that we do not make any explicit assumption on the data-generating process. This is important in practice, since many common data models, such as sub-Gaussian mixtures, are often too simplified to capture real clustering problems. We provide examples of more complicated data models in Section 3 and show how our general results can provide new insights for these models. Since k-means clustering often appears as a building block in many sophisticated clustering algorithms, we believe our results will be of broad interest in understanding the performance of clustering algorithms in overfitted settings. Notation. Q(ξ; µ) is only dependent on the set of values among the coordinates of ξ. Although we view ξ as a vector (for which the order of elements matter), with some abuse of notation, we view Q( ; µ) as a set function (mapping 2X to R) that is only sensitive to the set of values represented by ξ. This justifies using the the same symbol for the function irrespective of the number of coordinates of ξ, i.e., the number of clusters. The reason to keep ξ as an (ordered) vector is to make the cluster labels well-defined. For simplicity, let = X . For the case where X Rd, one often takes to be the Euclidean norm, but our results are valid for any norm on Rd, and more broadly any normed space X. 2 Main Results We first state assumptions about the k-means clustering algorithm. Assumption 1. Consider an algorithm for the generalized k-means problem (2), referred to as ALG(p) hereafter, and let bξ(L) X L and bξ(K) X K be its estimated centers when applied with L and K clusters, respectively. Let L K. Assume that ALG(p) has the following properties, for all input sequences (xi): (a) Efficiency: The Voronoi cell of every estimated center bξ(L) ℓ contains at least one of (xi). (b) κ-approximation: b Q(bξ(K)) κ minξ X K b Q(ξ), and similarly with K replaced by L. Efficiency can be achieved by substituting centers whose Voronoi cells have an empty intersection with {xi}, by those having the opposite property. For κ-approximation, the factor κ can depend on the number of clusters K (or L). For example, the k-means++ algorithm has κ = O(log K), with high probability over the initialization [3]. However, there are also constant-factor approximation algorithms for k-means where κ = O(1) independent of K (or L) [24, 13, 15]. For example, with local search, k-means++ can achieve a constant-factor approximation [16]. In addition, κapproximation is not required for all inputs. That is, we are not concerned with the worst-case approximation factor. The κ in Assumption 1(b) is the approximation factor of the algorithm on the specific data under consideration. It is enough for an algorithm to achieve good approximation only on the data of interest. For some of the results, Assumption 1(b) can be replaced with the following modified version: (b ) κ-approximation only for K clusters plus a mononoticity assumption: b Q(bξ(L)) b Q(bξ(K)). Mononoticity is also a reasonable requirement and obviously true for the exact k-means solutions. Next, we extend the definition of the misclassification rate to the overfitted case. Definition 1. The (generalized) misclassification rate between two label vectors z [K]n and bz [L]n, with K L, is Miss(z, bz) = min ω 1 n i=1 1{zi = ω(bzi)}, where the minimization ranges over all surjective maps ω : [L] [K]. When L = K, a surjective map ω is necessarily a bijection and the above becomes the usual definition of misclassification rate when the number of clusters is correctly identified. In this case, Miss(z, bz) = 0 means that there is a one-to-one correspondence between the estimated and true clusters. The generalized definition above allows us to extend this notion of exact recovery to the case L > K. In particular, Miss(z, bz) = 0 when L > K, if and only if bz is a refinement of z. To see this, note that Miss(z, bz) = 0 implies the existence of a map ω : [L] [K] such that ω(bzi) = zi for all i. This in turn is equivalent to: bzi = bzi = zi = zi , which is the refinement relation for the associated clusters. In general, Miss(z, bz) is small if bz is close to a refinement of z. We also use the (optimal) matching distances between elements of two vectors viewed as sets. Definition 2. For ξ X L and ξ X K, define the ℓ and ℓp optimal matching distances as d (ξ, ξ ) = min σ max 1 k K ξσ(k) ξ k , dp(ξ, ξ ) = min σ k=1 ξσ(k) ξ k p 1/p , where σ : [K] [L] ranges over all injective maps. For K = L, d is an upper bound on the Hausdorff distance between the two sets. Obviously, we have d dp for any p 1. 2.1 Distance to true centers Let z = (zi)n i=1 [K]n be a given set of true labels for the data points (xi)n i=1. In addition, our results are stated in terms of a set of vectors ξ = (ξ k)K k=1 which we refer to as the true centers . Throughout, ξ will be only vaguely specified. The only requirement on ξ is that together with the observed data points (xi) and the true labels (zi), they satisfy the deviation bounds in each theorem, e.g., max1 i n xi ξ zi η in Theorem 1, etc. Let πk = Pn i=1 1{zi = k}/n be the proportion of observed data points in true cluster k and let πmin = mink πk. We let bξ be a solution of the k-means algorithm with L K centers and let bzi argminℓ xi bξℓ be the corresponding estimated labels. Our first result provides guarantees for exact label recovery, in the extended sense of recovering a refinement of the true partition when L > K and the exact partition when L = K. Theorem 1 (Exact recovery). Consider a vector of (true) centers ξ X K and labels (zi)n i=1 [K]n. Pick η, δ > 0 such that max1 i n xi ξ zi η, and min (k,k ): k =k ξ k ξ k δ. (3) Consider an algorithm ALG(p) for problem (2), satisfying Assumption 1, and let (bzi)n i=1 [L]n and bξ X L be the estimated labels and centers of ALG(p) applied with the L K. Then, δ η > 2(1 + κ) π1/p min + 4 = Miss(z, bz) = 0, dp(bξ, ξ ) (1 + κ)η π1/p min . (4) When L = K, the assertion Miss = 0 means that there is a permutation σ on [K] such that σ(bzi) = zi for all i, that is, we have the exact recovery of labels (zi) in the classical sense. When L > K, Theorem 1 guarantees the exact recovery of a refinement of the true labels (zi). Example 1 (Stochastic Ball Model). Assume that data are generated from the stochastic ball model considered in [26], where xi = ξ zi + ri with ri sampled independently from a distribution supported on the unit ball in Rd. Here, {ξ k}K k=1 Rd are a fixed set of centers. Clearly, we can take η = 1 in Theorem 1. Then, any κ-approximate k-means algorithm achieves exact recovery when δ > 2 + 2(1 + κ)/ πmin for L = K. In the overfitted case, when δ > 4 + 2(1 + κ)/ πmin, the estimated label vector is an exact refinement of the true labels (zi). In the above example, although it is intuitively clear that for a sufficiently large δ, the solution of the k-means problem should achieve exact label recovery (in the extended sense), Theorem 1 allows us to provide a provable guarantee for any constant-factor approximation, with an explicit bound on the separation parameter δ. We now turn to approximate recovery where the misclassification rate is small. Theorem 2 (Approximate Recovery). Consider a vector of (true) centers ξ X K and labels (zi)n i=1 [K]n. Pick ε, δ > 0 such that ( 1 n Pn i=1 xi ξ zi p)1/p ε, and (3) holds. Consider an algorithm ALG(p) for problem (2), satisfying Assumption 1, and let (bzi)n i=1 [L]n and bξ X L be the estimated labels and centers of ALG applied with the L K. Then, for any c > 2, δ ε > (1 + κ)c π1/p min = Miss(z, bz) < K(1 + κ)pcp ε p , dp(bξ, ξ ) (1 + κ)ε π1/p min . (5) The key difference between Theorems 1 and 2 is the bounds assumed on the deviations Di := xi ξ zi , i [n]. Theorem 1 assumes a bound on the maximum distance to true centers, maxi Di, while Theorem 2 assumes a bound on an average distance, ( 1 i Dp i )1/p, leading to a more relaxed condition. Theorem 2 provides an upper bound on the misclassification rate when a certain separation condition is satisfied. To simplify, consider the case K = κ = p = 2 and take c = 2.1. Then, Theorem 2 implies the following: For every β > 0, there exists a constant c1(β, πmin) > 0 such that if δ/ε c1(β, πmin), (6) then any 2-factor k-means algorithm will have Miss β to the target labels. The next proposition shows that condition (6) is sharp up to constants. Proposition 1. There exists a family of datasets {(xi, zi)}n i=1, with K = 2 balanced true clusters (i.e., πmin = 1/2) and parameterized by true center separation δ and ε = ( 1 n Pn i=1 xi ξ zi 2)1/2 with the following property: Given any constant β (0, 1/2), there exists a constant c2(β) > 0, such that if δ/ε < c2(β), then any 2-factor k-means approximation algorithm with L = 2 clusters has misclassification rate satisfying 1 2. Moreover, any 2-factor k-means approximation algorithm with L = 4 clusters will recover a perfect refinement of the original clusters in the above setting. The proof of Proposition 1 can be found in the Supplementary Material. This proposition shows that if the separation condition (6) is reversed, one can force the performance of any k-means algorithm to be arbitrarily close to that of random guessing. The true centers in Proposition 1 are the natural centers implied by the k-means cost function for the true labels, that is, ξ k = 1 i xi1{zi = k} for k = 1, 2. One can take c1(β, πmin) = 6.3 max(1/πmin, 2/β)1/2 and c2(β) = sin(tan 1( p β/45)) for the constants in (6) and Proposition 1. Remark 1. The separation condition (6) is related to the distribution stability introduced in [5]. Roughly speaking distribution stability plus the following property implies our condition: (D1) For every pair of distinct clusters Ck and Cℓwith centers ξ k and ξ ℓ, there is a point x Cℓ such that x ξ k ξ ℓ ξ k . That is, every cluster Cℓhas points which are closer than ξ ℓto the centers of other clusters. This property is quite mild and one expects it to hold with high probability if the distribution of the points have positive density w.r.t. to the (full-dimensional) Lebesgue measure in a ball around the center. The above seems to suggest that distribution stability is slightly weaker than our condition (6). However, in the presence of (D1), we can also significantly relax distribution stability to arrive at our condition, the details of which are provided in the Supplementary Material. In this sense, these two notions are closely related but not directly comparable, i.e., neither is weaker than the other in general. Example 2 (Sub-Gaussian mixtures). Let us assume that the data is generated from a K-component sub-Gaussian mixture model xi = ξ zi + d 1/2wi where wi = (wi1, . . . , wid) Rd is a zero mean sub-Gaussian noise vector with sub-Gaussian parameter σi, and zi [K] is the latent label of the ith observation. This is an extension of the sub-Gaussian mixture model considered in [7]. Determining whether (ξ k)K k=1 is actually the solution of the population problem (1) is, itself, challenging and the answer may depend on the exact distribution of {wi}. Nevertheless, our results allow us to treat (ξ k) as the true centers. Below we sketch how Theorem 2 applies in this case. The details of the arguments, including the exact definition of a sub-Gaussian vector are provided in the Supplementary Material. Let σmax = maxi σi and set α2 i := E d 1/2wi 2 2 and α2 n := 1 n Pn i=1 α2 i . Assume that there is a numerical constant C > 0 such that α2 n Cσ2 max. Then, one can show that i=1 xi ξ zi 2 > 2 α2 n e c1n α4 n/σ4 max =: pn for some numerical constant c1 > 0. Taking ε2 = 2 α2 n and p = 2 in Theorem 2, we have that with probability at least 1 pn, 2 α2n > (1 + κ)2c2 πmin = Miss(z, bz) 2K(1 + κ)2c2 αn where δ is as in (3) and c > 2. In a general problem, αn, σmax and δ all can vary with n. In order to have label consistency for an ALG(2) algorithm, it is enough to have αn/δ = o(1) and n α4 n/σ4 max . The consistency here is based on the extended Definition 1 and includes the overfitted case in which a refinement of the true labels is consistently recovered. We note that the model in this example includes a very general Gaussian mixture model as a special case, namely the case wi N(0, Σi) where the covariance matrices Σi Rd d are allowed to vary with each data point. In this case, one can take σmax = max1 i n Σi op where op denotes the operator norm, and α2 n := 1 n Pn i=1 tr(Σi)/d. 2.2 Distance to fake centers We now extend Theorem 2, to allow for fake centers {eξℓ}L ℓ=1 and the corresponding labels {ezi}. These can be constructed to reduce the required distance to the data points (xi). Theorem 3 (Approximate Recovery, II). For a fixed L K, consider a vector of constructed centers eξ X L, constructed labels ez = (ezi)n i=1 [L]n and true labels z = (zi)n i=1 [K]n. Assume that ez is a refinement of z, i.e. there is eω : [L] [K] such that eω(ezi) = zi for all i [n]. Pick ε, δ > 0 such that 1 i=1 xi eξezi p 1/p ε, min ℓ1 =ℓ2, eω(ℓ1) = eω(ℓ2) eξℓ1 eξℓ2 δ (7) Consider an algorithm ALG(p) for problem (2), satisfying Assumption 1, and let (bzi)n i=1 [L]n be the estimated label vector of ALG(p) applied with L clusters. Then, for any c > 2, δ ε > (1 + κ)c π1/p min = Miss(z, bz) < K(1 + κ)pcp ε The advantage of Theorem 3 is that when the desired number of clusters L increases, the bound on the misclassification rate can go down: In some applications, by carefully constructing the fake centers eξ, we can make ε smaller as L increases, while roughly maintaining the minimum separation among fake centers associated with the true clusters. If successful, this implies that a refinement of the true clustering can be achieved even when it is hard to recover the true clustering itself. In the following section, we show how this strategy can be applied to some manifold clustering problems. 3 The case for overfitting We now illustrate applications of Theorem 3 in settings where it is hard to recover true clusters, based on the ideal K, but it is possible to obtain accurate refinements by overfitting. The idea is to consider clusters that look like submanifolds of Rd. 3.1 Mixture of curves We say that a random variable x has a (ρ, σ) sub-Gaussian curve distribution if x = γ(t) where t R has a sub-Gaussian distribution with parameter σ and γ : R Rd is a locally ρ-Lipschitz map. i.e., γ(t) γ(s) ρ|t s| for all t, s R such that |t s| 1 Proposition 2. Assume that (xi)n i=1 are independent draws from a K-component mixture of (ρ, σ) sub-Gaussian curve distributions. That is, xi = γzi(ti) where zi [K], ti Qzi independently across i, each Qk is a sub-Gaussian distribution on R with parameter σ, and each γk is locally ρ-Lipschitz. Let Ck be the support of the distribution of γk(t) where t Qk. Assume that min x Ck, y Ck x y δ > 0, for all k = k . Then, there exist a constant C = C(K, δ, ρ, σ, κ) such that any ALG(2) satisfying Assumption 1 applied with Ln C n log n clusters recovers a perfect refinement of z with probability 1 n 1. The significance of this result is that one recovers a perfect refinement with the number of partitions Ln = o(n). It is trivial to obtain a perfect refinement with Ln = n, but not so with Ln/n 0. This is especially the case since one can achieve quite complex cluster configurations with the model in Proposition 2, for some of which applying k-means with K clusters will have a misclassification rate bounded away from zero. Section 3.3 provides some such examples where the true cluster centers coincide, causing any k-means algorithm applied with the true K to incur a substantial error. See also Supplementary Material for a discussion of whether Ln = O( n log n) can be improved. Various extensions of Proposition 2 are possible. We have the following extension to the noisy setting. Corollary 1. Assume that the data is given by yi = xi + 1 dwi for i [n] where (xi) are as given in Proposition 2 and wi are sub-Gaussian noise vectors as in Example 2. Then, under the same assumptions as in Proposition 2, ALG(2) applied with Ln C n log n achieves a misclassification rate K( αn/δ)2 + 1 n with probability 1 pn n 1 where αn and pn are defined in Example 2. Corollary 1 shows that one can achieve consistent clustering (in the generalized sense) with Ln = o(n) clusters assuming that the noise-to-signal ratio αn/δ 0 and n α4 n/σ4 max ; the same conditions needed in the sub-Gaussian mixture example. Again, this result is significant since even in the noiseless case ( αn = 0), consistent recovery is not possible with L = K for some mixtures of curve models. 2.5 5.0 7.5 10.0 delta Average Misclassification Rate L = 2 L = 4 L = 10 2.5 5.0 7.5 10.0 delta Average Misclassification Rate L = 2 L = 4 L = 10 (a) Noisy (L = 4, δ = 3) (b) Noisy (c) Noiseless Figure 1: Line-circle model: (a) Scatter plot for the noisy version. The colors show the L = 4 estimated clusters by k-means. (b) and (c) show the (generalized) misclassification rate versus δ, the radius of the circle, in the noisy and noiseless versions of the model. 3.2 Mixture of higher-order submanifolds A version of Proposition 2 can be stated for a higher-dimensional version of the mixture-of-curves model, if we consider generalized k-means problems with p > 2. We say that a random variable x has a (ρ, σ, r) sub-Gaussian manifold distribution if x = γ(t) where t Rr has a sub-Gaussian distribution with parameter σ and γ : Rr Rd is a locally ρ-Lipschitz map. i.e., γ(t) γ(s) ρ t s for all t, s Rr such that t s 1 Proposition 3. Assume that (xi)n i=1 are independent draws from a K-component mixture of sub Gaussian manifold distributions, with parameters (ρ, σ, rk) for k [K], and let r = maxr [K] rk. Let Ck be the support of the distribution of the kth component. Assume that min x Ck, y Ck x y δ > 0, for all k = k . Then, there exist a constant C = C(K, δ, ρ, σ, r, κ) such that any ALG(p) satisfying Assumption 1, applied with Ln C(n1/p log n)r clusters recovers a perfect refinement of z with probability 1 n 1. In particular, for p > r, we have perfect refinement recovery with Ln = o(n) clusters, with high probability. It is also possible to extend the results to more general distributions on submanifolds via a notion of stochastic covering numbers. For random vector x with distribution µC on a submanifold C Rd, let NµC(ε) be the smallest integer for which, there is a high probability ε-cover of x, that is, a finite subset N C such that P(miny N x y ε) 1 n 2. We state a generalization of Proposition 3 to this setting in the Supplementary Material. 3.3 Numerical experiments We first consider the (noiseless) line-circle model in R3, an example of mixture-of-curves. This model has two clusters: (1) The uniform distribution on the circumference of a circle in the xz-plane, centred at the origin, and (2) the standard Gaussian distribution on the y axis. The minimum separation δ between the two clusters is the radius of the circle. We also consider the noisy version of this model where we add N(0, σ2I3). We sample data points with equal probability from the two clusters. It is nearly impossible for the k-means to correctly label these two clusters when L = 2, since the centers of the two clusters coincide. Figure 1 shows the scatter plot of the data simulated from the noisy line-circle model, with noise level σ = 0.1, n = 3000 and δ = 3. Here, the noise level is set low for better illustration. Different colors are used to label data points based on the output of k-means clustering with L = 4, and this demonstrates that each estimated cluster is a subset of a true cluster. The result aligns with Theorem 3. Although, the true centers coincide (with the origin) when L = 2, by increasing L, we can create fake centers on the line and the circle to have separation close to δ and thus get a small missclassification rate. The other two panels in Figure 1 show the average missclassification rate over 32 repetitions versus δ, for both the noiseless and noisy (σ = 1) line-circle 2 4 6 delta Average Misclassification Rate L = 2 L = 4 L = 10 2 4 6 delta Average Misclassification Rate L = 2 L = 4 L = 10 (a) Noisy (b) Noiseless Figure 2: Line-Gaussian model: The (generalized) misclassification rate versus δ, the distance of the Gaussian center to the line, in the (a) noisy and (b) noiseless versions of the model. 2.5 5.0 7.5 10.0 delta Average Misclassification Rate L = 2 L = 4 L = 10 2.5 5.0 7.5 10.0 delta Average Misclassification Rate L = 2 L = 4 L = 10 (a) True clusters (noiseless) (b) Noisy (c) Noiseless Figure 3: Circle-torus model: (a) Scatter plot for the noiseless version. Colors are used to separate two true clusters. (b) and (c) show the (generalized) misclassification rate versus δ, the radius of the circle, in the noisy and noiseless versions of the model. model. Both show that the misclassification rate is negatively associated with δ and L when L > 2. Similar results are shown for the circle-torus model in Figure 3. Details of this model are discussed in the Supplementary Material. Figure 2 shows the results for a line-Gaussian mixture model: xi = ξ zi + Σ1/2 zi wi R2 where ξ 1 = (0, δ) and ξ 2 = (0, 0), wi N(0, I2), Σ1 = I2 and Σ2 = diag(σ2, 0). Here, we have set σ = 5 and sampled n = 3000 data points with equal probability from the two clusters. We also consider its noisy version by setting all the zero elements in Σ2 to 0.7, which makes the model a general Gaussian mixture. Figure 2 shows the average missclassification rate over 32 repetitions for different L. The results are consistent with Theorem 3 showing that as δ increases, the misclassification rate decreases. Let us first recall a fact from functional analysis. Consider the space of functions f : [n] X and let us equip [n] with the uniform probability measure νn. Then, from the theory of Lebesgue-Bochner spaces, f p := ( R f(ω) p X dνn(ω))1/p defines a proper norm on this function space, turning it into a Banach space Lp(νn; X). In particular, the triangle inequality holds for this norm. Note that f p = ( 1 n Pn i=1 f(i) p X )1/p. We will frequently invoke the triangle inequality in Lp(νn, X). k πkδξ k = 1 n Pn i=1 δξ zi be the empirical measure associated with {ξ zi}. Recalling definition (1) of the population cost function, we have, for any ξ X L, Q(ξ; µ )p = k=1 πk min 1 ℓ L ξ k ξℓ p = 1 i=1 min 1 ℓ L ξ zi ξℓ p. (9) We start with three lemmas that are proved in the Supplementary Material: Lemma 1. Let ALG(p) be a k-means algorithm satisfying Assumption 1(b ) and let bξ be its output for L clusters. Furthermore, assume ( 1 n Pn i=1 xi ξ zi p)1/p ε. Then Q(bξ; µ ) (1 + κ)ε. Lemma 2 (Curvature). For every ξ X L and ξ X K, with L K, Q(ξ; µ ) π1/p min dp(ξ, ξ ) δ Lemma 3. Assume that max1 i n xi ξ zi η and d (bξ, ξ ) γ. When L = K, if δ > 2γ+2η, there exists a bijective function ω : [K] [K] satisfying ω(bzi) = zi, i [n]. When L > K, if δ > 2γ + 4η, there exists a surjective function ω : [L] [K] satisfying ω(bzi) = zi, i [n]. Proof of Theorem 1. As ( 1 n Pn i=1 xi ξ zi p)1/p max1 i n xi ξ zi η, combining Lemma 1 and 2, we have dp(bξ, ξ ) δ π1/p min (1 + κ)η By the condition on δ in (4), we have δ/2 > (1 + κ)η/π1/p min. Then, d (bξ, ξ ) dp(bξ, ξ ) γ := (1 + κ)η/π1/p min, which also makes the assumption in Lemma 3 that δ > 2γ + 4η valid. Finally, the result follows from Lemma 3. Proof of Theorem 2. The argument is similar to one that has appeared in recent literature [17, 12, 33]. From the proof of Lemma 1 (in the Supplementary Material), we have Q(bξ; µ ) 1 i=1 ξ zi bξbzi p 1/p (1 + κ)ε. By Lemma 2 dp(bξ, ξ ) δ π1/p min (1 + κ)ε By the separation assumption in (5), δ/2 > (1 + κ)ε/π1/p min. Hence dp(bξ, ξ ) (1 + κ)ε/π1/p min. Let Ck = {i : zi = k}, |Ck| = nk, and set Tk := {i Ck : ξ zi bξbzi δ/c}. Letting Sk = Ck \ Tk, we obtain |Sk|δp/cp < X i Sk ξ zi bξbzi p i=1 ξ zi bξbzi p n(1 + κ)pεp. Therefore, |Sk| nk < n(1 + κ)pcpεp The last inequality is by assumption δ > (1 + κ)cε/π1/p min. Hence, Tk is not empty. Furthermore, we argue that if i Tk and j Tℓfor k = ℓ, i.e. zi = zj, then bzi = bzj. Assume otherwise, that is, bzi = bzj. Then ξ k ξ ℓ ξ k bξbzi + ξ ℓ bξbzj 2δ/c < δ causing a contradiction. Let Lk := {bzi : i Tk} and L = SK k=1 Lk. Define a function ω : L [K] by setting ω(ℓ) = k for all ℓ Lk and k [K]. By the property of {Tk} shown above, Lk, k [K] are disjoint and nonempty sets. This implies that ω is well-defined and surjective. Extend ω to a surjective ω : [L] [K] by arbitrarily defining it for [L] \ L. Note that bzi Lk implies zi = k. It follows that ω(bzi) = zi for all bzi L, and i=1 1{zi = ω(bzi)} n |L| n < K(1 + κ)pcpεp The result follows. Proof of Theorem 3. By assumption, κ-approximation holds for both K and L clusters. Then, b Q(bξ) κ b Q(L) min, where b Q(L) min := min ξ X L b Q(ξ). Since b Q(L) min ( 1 n Pn i=1 xi eξezi p)1/p ε, by the triangle inequality in Lp(νn, X), i=1 eξezi bξbzi p 1/p 1 i=1 xi eξezi p 1/p + 1 i=1 xi bξbzi p 1/p (1 + κ)ε. Let Tk := {i Ck : eξezi bξbzi δ/c} and Sk = Ck \ Tk. Then, |Sk|δp/cp < X i Sk eξezi bξbzi p i=1 eξezi bξbzi p n(1 + κ)pεp. Therefore, |Sk| nk < n(1 + κ)pcpεp The last inequality is by assumption δ (1 + κ)cε/π1/p min. Hence Tk is not empty. Next we argue that if i Tk, j Tℓfor k = ℓ, i.e. zi = zj, then bzi = bzj. Assume otherwise, that is bzi = bzj. Since ez is a refinement of z, zi = zj implies ezi = ezj and eω(ezi) = eω(ezj). By the triangle inequality, eξezi eξezj eξezi bξbzi + eξezj bξbzj 2δ/c < δ causing a contradiction. The rest of the proof follows that of Theorem 2. Proof of Proposition 2. Let mk be the mean of Qk. Then, P(|ti mzi| > t) 2e t2/2σ2. Let M = p 6σ2 log n. By union bound, with probability 1 2n 2 we have |ti mzi| M for all i [n]. We can cover the set [ M, M] R, with L = M/ε 1-D balls of radius ε. (Without loss of generality, we assume that L is an integer for simplicity.) Let T = {τ1, . . . , τL } one such cover and note that mk + T is an ε-cover of mk + [ M, M]. Let πk : R (mk + T ) be the projection from R onto mk + T . Then, γzi(ti) γzi(πzi(ti)) ρ|ti πzi(ti)| ρε, assuming that ε 1/ρ. Let z i := argminℓ [L ] |ti (mzi + τℓ )| so that πzi(ti) = mzi + τz i. Then let Ln = KL and fix a bijection φ : [Ln] [K] [L ] and define the labels ezi = φ 1(zi, z i). Also consider the map ω0 : [K] [L ] [K] given by ω0(k, ℓ ) = k and set eω := ω0 φ which is a surjective map from [Ln] to [K] satisfying eω(ezi) = zi. For ℓ [Ln] with φ(ℓ) = (k, ℓ ), define eξℓ= γk(mk + τℓ ). Then, we have eξezi = γzi(mzi + τz i) = γzi(πzi(ti)), hence the above argument gives γ(ti) eξezi ρε. It is also clear that the the separation condition (7) is satisfied since by construction if eω(ℓ1) = eω(ℓ2) with φ(ℓ1) = (k1, ℓ 1) and φ(ℓ2) = (k2, ℓ 2), then k1 = k2 hence eξℓ1 and eξℓ2 lie on different manifolds (on Ck1 and Ck2). It follows that conclusion (8) of Theorem 3 holds for p = 2 and, say, c = 3 but with ε replaced with ρε. Take ε = (c1 n) 1 for constant c1 to be determined. Let c2 = 3ρ(1 + κ)/δ. As long as nπmin > (c2/c1)2, the separation condition in (8) is satisfied and we have Miss(z, bz) K(c2/c1)2/n. Hence, as long as c1 > Kc2, we will have Miss(z, bz) < 1/n which implies Miss(z, bz) = 0. We also need to satisfy ε < 1/ρ that is c1 ρ/ n. Taking c1 = Kc2 + ρ satisfies all the required constraints on c1. The required number of clusters is Ln = KL = KM/ε 3Kσc1 p which proves the result with C = 3Kσc1. Note that since c2/c1 < 1 and nπmin 1, the condition nπmin > (c2/c1)2 is automatically satisfied. The proof is complete. Acknowledgments and Disclosure of Funding This work was supported by the NSF CAREER grant DMS-1945667. We thank the referees for their helpful comments that lead to the improvement of the paper. [1] Emmanuel Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017. [2] Arash A Amini and Zahra S Razaee. Concentration of kernel matrices with application to kernel spectral clustering. The Annals of Statistics, 49(1):531 556, 2021. [3] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Technical report, Stanford, 2006. [4] Pranjal Awasthi, Afonso S Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, and Rachel Ward. Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 191 200, 2015. [5] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a ptas for k-median and k-means clustering. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 309 318. IEEE, 2010. [6] Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 49 60. IEEE, 2017. [7] Noureddine El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191 3216, 2010. [8] Yingjie Fei and Yudong Chen. Hidden integrality of sdp relaxations for sub-gaussian mixture models. In Conference On Learning Theory, pages 1931 1965. PMLR, 2018. [9] Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxed k-means. Mathematical Statistics and Learning, 1(3):317 374, 2019. [10] Takayuki Iguchi, Dustin G Mixon, Jesse Peterson, and Soledad Villar. Probably certifiably correct k-means clustering. Mathematical Programming, 165(2):605 642, 2017. [11] Tao Jiang, Stephen Vavasis, and Chen Wen Zhai. Recovery of a mixture of gaussians by sum-of-norms clustering. Journal of Machine Learning Research, 21(225):1 16, 2020. [12] Jiashun Jin. Fast community detection by score. Annals of Statistics, 43(1):57 89, 2015. [13] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89 112, 2004. [14] Mieczysław A Kłopotek. On the consistency of k-means++ algorithm. Fundamenta Informaticae, 172(4):361 377, 2020. [15] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+/spl epsiv/)- approximation algorithm for k-means clustering in any dimensions. In 45th Annual IEEE Symposium on Foundations of Computer Science, pages 454 462. IEEE, 2004. [16] Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662 3671. PMLR, 2019. [17] Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1):215 237, 2015. [18] Xiaodong Li, Yang Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. When do birds of a feather flock together? k-means, proximity, and conic programming. Mathematical Programming, 179(1):295 341, 2020. [19] Tamás Linder. Learning-theoretic methods in vector quantization. In Principles of nonparametric learning, pages 163 210. Springer, 2002. [20] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129 137, 1982. [21] Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd s algorithm and its variants. ar Xiv preprint ar Xiv:1612.02099, 2016. [22] James Mac Queen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281 297. Oakland, CA, USA, 1967. [23] Konstantin Makarychev, Aravind Reddy, and Liren Shan. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33, 2020. [24] Jirı Matoušek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61 84, 2000. [25] Dustin G Mixon, Soledad Villar, and Rachel Ward. Clustering subgaussian mixtures with k-means. In 2016 IEEE Information Theory Workshop (ITW), pages 211 215. IEEE, 2016. [26] Abhinav Nellore and Rachel Ward. Recovery guarantees for exemplar-based clustering. Information and Computation, 245:165 180, 2015. [27] Ashkan Panahi, Devdatt Dubhashi, Fredrik D Johansson, and Chiranjib Bhattacharyya. Clustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. In International conference on machine learning, pages 2769 2777. PMLR, 2017. [28] Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite programming. SIAM journal on optimization, 18(1):186 205, 2007. [29] David Pollard. Strong consistency of k-means clustering. The Annals of Statistics, pages 135 140, 1981. [30] David Pollard. Quantization and the method of k-means. IEEE Transactions on Information theory, 28(2):199 205, 1982. [31] Defeng Sun, Kim-Chuan Toh, and Yancheng Yuan. Convex clustering: model, theoretical guarantee and efficient algorithm. Journal of Machine Learning Research, 22(9):1 32, 2021. [32] Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. Advances in Neural Information Processing Systems, 29:604 612, 2016. [33] Zhixin Zhou and Arash A Amini. Analysis of spectral clustering algorithms for community detection: the general bipartite setting. The Journal of Machine Learning Research, 20(1):1774 1820, 2019. [34] Changbo Zhu, Huan Xu, Chenlei Leng, and Shuicheng Yan. Convex optimization procedure for clustering: Theoretical revisit. Advances in Neural Information Processing Systems, 27:1619 1627, 2014. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 1. (c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work is theoretical and does not have direct societal impacts. Considering the page limit, we focused on presenting theoretical results. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] We did not do data training. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Error bars are included in the plots. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Computing details are in the supplemental material. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] See the supplemental material. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]