# clustering_with_bregman_divergences_an_asymptotic_analysis__c0c22915.pdf Clustering with Bregman Divergences: an Asymptotic Analysis Chaoyue Liu, Mikhail Belkin Department of Computer Science & Engineering The Ohio State University liu.2656@osu.edu, mbelkin@cse.ohio-state.edu Clustering, in particular k-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of k-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters k is large. We establish quantization rates and describe the limiting distribution of the centers as k , extending well-known results for k-means clustering. 1 Introduction Clustering and the closely related problem of vector quantization are fundamental problems in machine learning and data mining. The aim is to partition similar points into "clusters" in order to organize or compress the data. In many clustering methods these clusters are represented by their centers or centroids. The set of these centers is often called the codebook" in the vector quantization literature. In this setting the goal of clustering is to find an optimal codebook, i.e., a set of centers which minimizes a clustering loss function also known as the quantization error. There is vast literature on clustering and vector quantization, see, e.g., [8, 10, 12]. One of the particularly important types of clustering and, arguably, of data analysis methods of any type, is k-means clustering [16] which aims to minimize the loss function based on the squared Euclidean distance. This is typically performed using the Lloyd s algorithm [15], which is an iterative optimization technique. The Lloyd s algorithm is simple, easy to implement and is guaranteed to converge in a finite number of steps. There is an extensive literature on various aspects and properties of k-means clustering, including applications and theoretical analysis [2, 13, 23]. An important type of analysis is the asymptotic analysis, which studies the setting when the number of centers is large. This situation (n k 0) arises in many applications related to data compression as well as algorithms such as soft k-means features used in computer vision and other applications, where the number of centers k is quite large but significantly less than the number of data points n. This situation also arises in k-means feature-based methods which have seen significant success in computer vision, e.g., [6]. The quantization loss for k-means clustering in the setting k is well-known (see [5, 9, 20]). A less well-known fact shown in [9, 18] is that the discrete set of centers also converges to a measure closely related to the underlying probability distribution. This fact can be used to reinterpret k-means feature based methods in terms of a density dependent kernel [21]. More recently, it has been realized that the properties of square Euclidean distance which make the Lloyd s algorithm for k-means clustering so simple and efficient are shared by a class of similarity measures based on Bregman divergence. In an influential paper [3] the authors introduced clustering based on Bregman divergences, which generalized k-means clustering to that setting and produced a corresponding generalized version of the Lloyd s algorithm. That work has lead to a new line of research on clustering including results on multitask Bregman clustering[24], agglomerative 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. Bregman clustering[22] and many others. There has also been some theoretical analysis of Bregman clustering including [7] proving the existence of an optimal quantizer and convergence and bounds for quantization loss in the limit of size of data n for fixed k. In this paper we set out to investigate asymptotic properties of Bregman clustering as the number of centers increases. We provide explicit asymptotic rates for the quantization error of Bregman clustering as well as the continuous measure which is the limit of the center distribution. Our results generalize the well-known results for k-means clustering. We believe that these results will be useful for better understanding in Bregman divergence based clustering algorithms and algorithms design. 2 Preliminaries and Existing Work 2.1 k-means clustering and its asymptotic analysis k-means clustering is one of the most popular and well studied clustering problems in data analysis. Suppose we are given a dataset D = {xi}n i=1 Rd , containing n observations of a Rd-valued random variable X. k-means clustering aims to find a set of points (centroids) α = {aj}k j=1 Rd, with |α| = k initially fixed, that minimizes the squared Euclidean loss function j min a α xj a 2 2. (1) Finding the global minimum of loss function is a NP-hard problem [1, 17]. However, Lloyd s algorithm [15] is a simple and elegant method to obtain a locally optimal clustering of the data, corresponding to a local minimum of the loss function. A key reason for the practical utility of the Lloyd s k-means algorithm is the following property of squared Euclidean distance: the arithmetic mean of a set of points is the unique minimizer of the loss for a single center: i=1 xi = arg min s Rd 1 n i=1 xi s 2 2. (2) It turns out that this property holds in far greater generality as we will discuss below. Asymptotic analysis of Euclidean quantization: In an asymptotic quantization problem, we focus on the limiting case of k , where the size of dataset n k. In this paper we will assume n = , i.e., that the probability distribution with density P is given. This setting is in line with the analysis in [9]. Correspondingly, a density measure P is defined as follows: for a set A Rd, P(A) = R A Pdλd, where λd is the Lebesgue measure on Rd. We also have P = d P The classical asymptotic results for the Euclidean quantization are provided in a more general setting for an arbitrary power of the distance Eq.(1), Euclidean quantization of order r (1 r < ), with loss min a α X a r 2 Note that the Lloyd s algorithm is only applicable to the standard case with r = 2. The output of the k-means algorithm include locations of centroids, which then imply the partition and the corresponding loss. For large k we are interested in: (1) the asymptotic quantization error, and (2) the distribution of centroids. Asymptotic quantization error. The asymptotic quantization error for k-means clustering has been analyzed in detail in [5, 14, 20]. S. Graf and H. Luschgy [9] show that as k , the r-th quantization error decreases at a rate of k r/d. Furthermore, coefficient of the term k r/d is of the form Qr(P) = Qr([0, 1]d) P d/(d+r), (4) where Qr([0, 1]d), a constant independent of P, is geometrically interpreted as asymptotic Euclidean quantization error for uniform distribution on d-dimensional unite cube [0, 1]d. Here d/(d+r) is the Ld/(d+r) norm of function: f d/(d+r) = ( R f d/(d+r)dλd)(d+r)/d. Locational distribution of centroids. A less well-known fact is that the locations of the optimal centroid configuration of k-means converges to a limit distribution closely related to the underlying density [9, 18]. Specifically, given a discrete set of centroids αk, to construct the corresponding discrete measure, j=1 δaj, (5) where δa is Dirac measure centered at a. For a open set A Rd, Pk(A) is the ratio of number of centroids k A located within A to the total number of centroids k, namely Pk(A) = k A/k. We say that a continuous measure P is the limit distribution of centroids, if {Pk} (weakly) converges to P, specifically A Rd, lim k Pk(A) = P(A). (6) S. Graf and H. Luschgy [9] gave an explicit expression for this continuous limit distribution of centroids: Pr = Prλd, Pr = N P d/(d+r), (7) where λd is the Lebesgue measure on Rd, P is the density of the probability distribution and N is the normalization constant to make sure that Pr integrates to 1. 2.2 Bregman divergences and Bregman Clustering In this section we briefly review basics of Bregman divergences and the Bregman clustering algorithm. Bregman divergence, first proposed in 1967 by L.M.Bregman [4], measure dissimilarity between two points in a space. The formal definition is as follows: Definition 1 (Bregman Divergence). Let function φ be strictly convex on a convex set Ω Rd, such that φ is differentiable on relative interior of Ω, we define Bregman divergence Dφ : Ω Ω R with respect to φ as: Dφ(p, q) = φ(p) φ(q) p q, φ(q) , (8) where , is inner product in Rd. Ωis domain of the Bregman divergence. Note that Bregman divergences are not necessarily true metrics. In general, they do satisfy the basic properties of non-negativity and identity of indiscernibles, but may not respect the triangle inequality and symmetry. Examples: Some popular examples of Bregman divergences include: Squared Euclidean distance: DEU(p, q) = p q 2 2, (φEU(z) = z 2) Mahalanobis distance: DMH(p, q) = (p q)T A(p q), A Rd d Kullback-Leibler divergence: KL(p q) = X pi ln pi qi X (pi qi), (φKL(z) = X zi ln zi zi, zi > 0) Itakura-Saito divergence: DIS(p q) = X pi qi 1, (φIS(z) = X ln zi) Norm-like divergence: DNL(p q) = X i pα i + (α 1)qα i αpiqα 1 i . (φNL(z) = X zα i , zi > 0, α 2) (9) Domains of Bregman divergences: ΩEU = ΩMH = Rd, and ΩKL = ΩIS = ΩNL = Rd +. Alternative expression: the quadratic form. Suppose that φ C2(Ω), which holds for most popularly used Bregman divergences. Note that φ(q) + p q, φ(q) is simply the first two terms in Taylor expansion of φ at q. Thus, Bregman divergences are nothing but the difference between a function and its linear approximation. By Lagrange s form of the remainder term, there exists ξ with ξi [min(pi, qi), max(pi, qi)] (i.e. ξ is in the smallest d-dimensional axis-parallel cube that contains p and q) such that Dφ(p, q) = 1 2(p q)T 2φ(ξ)(p q), (10) where 2φ(ξ) denotes the Hessian matrix of φ at ξ. This form is more compact and will be convenient for further analysis, but at the expense of introducing an unknown point ξ. We will use this form in later discussions. The mean as the minimizer. As shown in A. Banerjee et al. [3], the property Eq.(2) still holds if squared Euclidean distance is substituted by a general Bregman divergence: i=1 xi = arg min s Ω i=1 Dφ(xi, s). (11) That allows for the Lloyd s method to be generalized to arbitrary Bregman clustering problems, where the new loss function is defined as i min a α Dφ(xi, a). (12) This modified version of k-means, called Bregman hard clustering algorithm (see Algorithm 1 in [3]), results a locally optimal quantization as well. 3 Asymptotic Analysis of Bregman Quantization We do not distinguish the terminology of Bregman quantization and Bregman clustering. In this section, we analyze the asymptotics of Bregman quantization allowing a power of Bregman divergences in the loss function. We show expressions for the quantization error and limiting distribution of centers. We start with the following: Definition 2 (k-th quantization error for P of order r). Suppose a variable X takes values on Ω Rd following a density P, where Ωis the d-dimensional domain of Bregman divergence Dφ. The k-th quantization error for P of order r (1/2 r < ) associated with Dφ is defined as Vk,r,φ(P) = inf α Rd,|α|=k EP min a α Dr φ(X, a) (13) where α Rd is set of representatives of clusters, corresponding to a certain partition, or quantization of Rd or support of P, and EP [ ] means taking expectation value over P. Remark: (a) The set α that reaches the infimum is called k-optimal set of centers for P of order r with respect to Dr φ(X, a). (b) In this setting, Bregman quantization of order r corresponds to Euclidean quantization of order 2r, because of Eq.(10). 3.1 Asymptotic Bregman quantization error We are interested in the asymptotic case, where k . First note that quantization error asymptotically approaches zero as every point x in the support support of the distribution can always is arbitrarily closed to a centroid with respect to the Bregman divergence when k is large enough. Intuition on Convergence rate. We start by providing an informal intuition for the convergence rate. Assume P has a compact support with a finite volume. Suppose each cluster is a (Bregman) Voronoi cell with typical size ϵ. Since total volume of the support does not change, volume of one cell should be inversely proportional to the number of clusters, ϵd 1 k. On the other hand, because of Eq.(10), Bregman divergence between two points in one cell is the order of square of the cell size, Dφ(X, a) ϵ2, That implies Vk,r,φ(P) k 2r/d asymptotically. (14) We will now focus making this intuition precise and on deriving an expression for the coefficient at the leading term k 2r/d in the quantization error. For now we will keep the assumption that P has compact support, and remove it later on. We only describe the method and display important results in the following. Please see detailed proofs of these results in the Appendix. We first mention a few useful facts: Lemma 1. In the limit of k , each interior point x in the support of P is assigned to an arbitrarily close centroid in the optimal Bregman quantization setting. Lemma 2. If support of P is convex, φ is strictly convex on the support and 2φ is uniformly continuous on the support, then (a): limk k 2r d Vk,r,φ(P) exists in (0, ), denoted as Qr,φ(P), and (b): Qr,φ(P) = lim k k 2r d inf α(|α|=k) EP 2(X a)T 2φ(a)(X a) r . (15) Remark: 1, Since Qr,φ(P) is finite, part (a) of Lemma 2 proves our intuition on convergence rate, Eq.(14). 2, In Eq.(15), it does not matter whether 2φ take values at a, x or even any point between x and a, as long as 2φ has finite values at that point. Coefficient of Bregman quantization error. We evaluate the coefficient of quantization error Qr,φ(P), based on Eq.(15). What makes this analysis challenging is that unlike is that Euclidean quantization, general Bregman error does not satisfy translational invariance and scaling properties. For example, Lemma 3.2 in [9] does not hold for general Bregman divergence. We follow the following approach: First, dice the the support of P into infinitesimal cubes {Al} with edges parallel to axes, where l is the index for cells. In each cell, we approximate the Hessian by a constant matrix 2φ(zl), where zl is a fixed point located in the cell. Therefore, evaluating the Bregman quantization error within each cell reduces to a Euclidean quantization problem, with existing result, Eq.(4). Then summing them up appropriately over the cubes gives total quantization error. We start from Eq.(15), and introduce the following notation: denote sl = P(Al) and conditional density on Al as P( |Al), αl = α Al as set of centroids that located in Al and kl = |αl| as size of αl, and ratio vl = kl/k. Following the above intuition and noting that P = P P(Al)P( |Al), Qr,φ(P) is approximated by Qr,φ(P, {vl}) X l slv 2r/d l Qr,Mh,l (P( |Al)) , (16) Qr,Mh,l (P( |Al)) = lim kl k d l inf αl(|αl|=kl) EP ( |Al) min a αl 1 2(X a)T 2φ(zl)(X a) r (17) where Qr,Mh,l (P( |Al)) is coefficient of asymptotic Mahalanobis quantization error with Mahalanobis matrix 2φ(zl), evaluated on Al with density P( |Al). It can be shown that the approximation error of Qr,φ(P) converges to zero in the limits of k and then size of cell to zero. In each cell Al, P( |Al) is further approximated by uniform density U(Al) = 1/Vl, and Hessian 2φ(zl), as a constant, is absorbed by performing a coordinate transformation. Then Qr,Mh,l (U(Al)) reduces to squared Euclidean quantization error. By applying Eq.(4), we show that Qr,Mh,l (U(Al)) = 1 2r Q2r([0, 1]d)δ2r[det 2φ(zl)]r/d (18) where δ is the size of cube, and Q2r([0, 1]d) is again the constant in Eq.(4). Combining Eq.(17) and Eq.(18), Qr,φ(P) is approximated by Qr,φ(P, {vl}) 1 2r Q2r([0, 1]d)δ2r X l slv 2r/d l [det 2φ(zl)]r/d. (19) Portion of centroids vl within Al is still undecided yet. The following lemma provides an optimal configuration of {vl} that minimizes Qr,φ(P, {vl}): Lemma 3. Let B = {(v1, , v L) (0, )L : PL l=1 vl = 1}, and define v l = sd/(d+2r) l [det 2φ(zl)]r/(d+2r) P l sd/(d+2r) l [det 2φ(zl)]r/(d+2r) , (20) then for the function F(v1, , v L) = l=1 slv 2r/d l [det 2φ(zl)]r/d, (21) F(v 1, , v L) = min (v1, ,v L) B F(v1, , v L) = l sd/(d+2r) l [det 2φ(zl)]r/(d+2r) !(d+2r)/d Lemma 3 finds the optimal configuration of {vl} in Eq.(19). Recall that quantization error is defined to be infimum over all possible configurations, we have our main result: Theorem 1. Suppose E||X||2r+ϵ < for some ϵ > 0 and 2(φ) is uniformly continuous on the support of P, then Qr,φ(P) = 1 2r Q2r([0, 1]d) (det 2φ)r/d P d/(d+2r). (23) Remark: 1, In the Euclidean quantization cases, where φ(z) = z 2, Eq.(23) reduces to Eq.(4), noting that 2φ = 2I. Bregman quantization, which is more general than Euclidean quantization, has result that is quite similar to Eq.(4), differing by a det 2φ-related term. 3.2 The Limit Distribution of Centroids Similar to Euclidean clustering, Bregman clustering also outputs k discrete cluster centroids, which can be interpreted as a discrete measure. Below we show that in the limit this discrete measure coincide with a continuous measure defined in terms of the probability density P. Define Pr,φ to be the integrand function in Eq.(23) (with a normalization factor N), Pr,φ = N (det 2φ)r/(d+2r)P d/(d+2r). (24) The following theorem claim that Pr,φ is exactly the continuous distribution we are looking for: Theorem 2. Suppose P is absolutely continuous with respect to Lebesgue measure λd. Let αk be an asymptotically k-optimal set of centers for P of order r based on Dφ. Define measure Pr,φ := Pr,φλd, then 1 k a αk δa Pr,φ (weakly). (25) Remark: As before Pr,φ is the measure while Pr,φ is the corresponding density function. The proof of the theorem can be found in the appendix. Example 1: Clustering with Squared Euclidean distance (Graf and Luschgy[9]). Squared Euclidean distance is an instance of Bregman divergence, with φ(z) = P z2 i . Graf and Luschgy proved that asymptotic centroid s distribution is like Pr,EU(z) P d/(d+2r)(z). (26) Example 2: Clustering with Mahalanobis distance. Mahalanobis distance is another instance of Bregman divergence, with φ(z) = z T Az, (A) Rd. Hessian matrix 2φ = A. Then the asymptotic centroid s distribution is same as that of Squared Euclidean distance Pr,Mh(z) P d/(d+2r)(z). (27) Example 3: Clustering with Kullback-Leibler divergence. The convex function used to define Kullback-Leibler divergence is negative Shannon entropy defined on domain Ω Rd +, i zi ln zi zi (28) with component index i. Then Hessian matrix 2φKL(z) = diag( 1 According to Eq. (24), centroid s density distribution function Pr,KL(z) P d/(d+2r)(z) Example 4: Clustering with Itakura-Saito divergence. Itakura-Saito divergence uses Burg entropy as φ, φIS(z) = X i ln zi, z Rd, (31) with component index i. Then Hessian matrix 2φIS(z) = diag( 1 z2 d ). (32) According to Eq. (24), centroid s density distribution function Pr,IS(z) P d/(d+2r)(z) Example 5: Clustering with Norm-like divergence. Convex function φ(z) = P i zα i ,z Rd +, with power α 2. Simple calculation shows that the divergence reduces to Euclidean distance when α = 2. However, the divergence is no longer Euclidean-like, as long as α > 2: DNL(p, q) = X i pα i + (α 1)qα i αpiqα 1 i . (34) With some calculation, we have Pr,NL(z) P d/(d+2r)(z) !(α 2)r/(d+2r) Remark: It is easy to see that Kullback-Leibler and Itakura-Saito quantization tend to move centroids closer to axes, and Norm-like quantization, when α > 2, does opposite thing, moving centroids far away from axes. 4 Experiments 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0.6 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 Squared Euclidean 0 0.2 0.4 0.6 0.8 1 0 Kullback-Leibler 0 0.2 0.4 0.6 0.8 1 0 Norm-like (α = 3) Figure 1: First row are predicted distribution functions of centroids by Eq.(36,37,38); second row are experimental histograms of location of centroids, by applying corresponding Bregman hard clustering algorithms. In this section, we verify our results, especially centroid s location distribution Eq.(24), by using the Bregman hard clustering algorithm. Recall that our results are obtained in a limiting case, where we first take size of dataset n and then number of clusters k . However, size of real data is finite and it is also not practical to apply Bregman clustering algorithms on the asymptotic case. In this section, we simply sample data points from given distribution, with dataset size large enough, compared to k, to avoid early stopping of Bregman clustering. In addition, we only verify r = 1 cases here, since the Bregman clustering algorithm, which utilizes Lloyd s method, cannot address Bregman quantization problems with r = 1. Case 1 (1-dimensional): Suppose the density P is uniform over [0, 1]. We set number of clusters k = 81, and apply different versions of Bregman hard clustering algorithm on this sample: standard k-means, Kullback-Leibler clustering and norm-like clustering. According to Eq.(27), Eq.(33) and Eq.(35), theoretical prediction of centroids locational distribution functions in this case should be: P1,EU(z) = 1, z [0, 1]; (36) P1,KL(z) z 1/3, z (0, 1]; (37) P1,NL(z) z1/3, z [0, 1]; (38) and P(z) = 0 elsewhere. Figure 1 shows, in the first row, the theoretical prediction of distribution of centroids, and in the second row, experimental histograms of centroid locations for different Bregman quantization problems. Case 2 (2-dimensional): The density P = U([0, 1]2). Set k = 81 and apply the same three Bregman clustering algorithms as in case 1. Theoretical predictions of distribution of centroids for this case by Eq.(27), Eq.(33) and Eq.(35) are as follow, also shown in Figure 2: P1,EU(z) = 1, z = (z1, z2) [0, 1]2; (39) P1,KL(z) (z1z2) 1/4, z = (z1, z2) (0, 1]2; (40) P1,NL(z) (z1z2)1/4, z = (z1, z2) [0, 1]2; (41) and P(z) = 0 elsewhere. 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 Squared Euclidean Kullback-Leibler Norm-like (α = 3) Figure 2: Experimental results and theoretical predictions of centroids distribution for Case 2. In each of the 3-d plots, function is plotted over the cube [0, 1]2, with left most corner corresponding to point (0, 0), and right most corner corresponding to point (1, 1). Figure 2, in the first row, shows a visualization of centroids locations generated by experiments. For comparison, second row of Figure 2 presents 3-d plots of theoretical predictions of distribution of centroids. In each of the 3-d plots, function is plotted over the cube [0, 1]2, with left most corner corresponding to point (0, 0). It is easy to see that squared Euclidean quantization, in this case, results an uniform distribution of centroids, and that Kullback Leibler quantization tends to attract centroids towards axes, and norm-like quantization tends repel centroids away from axes. 5 Conclusion In this paper, we analyzed the asymptotic Bregman quantization problems for general Bregman divergences. We obtained explicit expressions for both leading order of asymptotic quantization error and locational distribution of centroids, both of which extend the classical results for k-means quantization. We showed how our results apply to commonly used Bregman divergences, and gave some experimental verification. We hope these results will provide guidance and insight for further theoretical analysis of Bregman clustering, such as Bregman soft clustering and other related methods [3, 11], as well as for practical algorithm design and applications. Our results can also lead to better understanding of the existing seeding strategies for Bregman clustering [19] and to new seeding methods. Acknowledgement We thank the National Science Foundation for financial support and to Brian Kulis for discussions. [1] D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine learning, 75(2):245 248, 2009. [2] K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. IPPS/SPDP Workshop on High Performance Data Mining, 1998. [3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. The Journal of Machine Learning Research, 6:1705 1749, 2005. [4] L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200 217, 1967. [5] J. A. Bucklew and G. L. Wise. Multidimensional asymptotic quantization theory with r th power distortion measures. Information Theory, IEEE Transactions on, 28(2):239 247, 1982. [6] A. Coates and A. Y. Ng. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade, pages 561 580. Springer, 2012. [7] A. Fischer. Quantization and clustering with Bregman divergences. Journal of Multivariate Analysis, 101(9):2207 2221, 2010. [8] A. Gersho and R. M. Gray. Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012. [9] S. Graf and H. Luschgy. Foundations of quantization for probability distributions. Springer, 2000. [10] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264 323, 1999. [11] K. Jiang, B. Kulis, and M. I. Jordan. Small-variance asymptotics for exponential family Dirichlet process mixture models. In Advances in Neural Information Processing Systems, pages 3158 3166, 2012. [12] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons, 2009. [13] K. Krishna and M. N. Murty. Genetic k-means algorithm. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 29(3):433 439, 1999. [14] T. Linder. On asymptotically optimal companding quantization. Problems of Control and Information Theory, 20(6):475 484, 1991. [15] S. P. Lloyd. Least squares quantization in PCM. Info. Theory, IEEE Transactions on, 28(2):129 137, 1982. [16] J. Mac Queen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281 297. Oakland, CA, USA., 1967. [17] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. In WALCOM: Algorithms and Computation, pages 274 285. Springer, 2009. [18] D. E. Mc Clure. Nonlinear segmented function approximation and analysis of line patterns. Quarterly of Applied Mathematics, 33(1):1 37, 1975. [19] R. Nock, P. Luosto, and J. Kivinen. Mixed Bregman clustering with approximation guarantees. In Joint ECML and KDD, pages 154 169. Springer, 2008. [20] P. Panter and W. Dite. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44 48, 1951. [21] Q. Que and M. Belkin. Back to the future: Radial basis function networks revisited. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 1375 1383, 2016. [22] M. Telgarsky and S. Dasgupta. Agglomerative Bregman clustering. ar Xiv preprint ar Xiv:1206.6446, 2012. [23] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained k-means clustering with background knowledge. In ICML, volume 1, pages 577 584, 2001. [24] J. Zhang and C. Zhang. Multitask Bregman clustering. Neurocomputing, 74(10):1720 1734, 2011.