# on_convergence_of_epanechnikov_mean_shift__ff6c151b.pdf

On Convergence of Epanechnikov Mean Shift

Kejun Huang University of Minnesota Minneapolis, MN 55414 huang663@umn.edu

Xiao Fu Oregon State University Corvallis, OR 97331 xiao.fu@oregonstate.edu

Nicholas D. Sidiropoulos University of Virginia Charlottesville, VA 22904 nikos@virginia.edu

Epanechnikov Mean Shift is a simple yet empirically very effective algorithm for clustering. It localizes the centroids of data clusters via estimating modes of the probability distribution that generates the data points, using the optimal Epanechnikov kernel density estimator. However, since the procedure involves non-smooth kernel density functions, the convergence behavior of Epanechnikov mean shift lacks theoretical support as of this writing most of the existing analyses are based on smooth functions and thus cannot be applied to Epanechnikov Mean Shift. In this work, we ﬁrst show that the original Epanechnikov Mean Shift may indeed terminate at a non-critical point, due to the non-smoothness nature. Based on our analysis, we propose a simple remedy to ﬁx it. The modiﬁed Epanechnikov Mean Shift is guaranteed to terminate at a local maximum of the estimated density, which corresponds to a cluster centroid, within a ﬁnite number of iterations. We also propose a way to avoid running the Mean Shift iterates from every data point, while maintaining good clustering accuracies under non-overlapping spherical Gaussian mixture models. This further pushes Epanechnikov Mean Shift to handle very large and high-dimensional data sets. Experiments show surprisingly good performance compared to the Lloyd s K-means algorithm and the EM algorithm.

Introduction Clustering is a fundamental problem in artiﬁcial intelligence and statistics (Jain, Murty, and Flynn 1999). The simplest form is arguably the K-means clustering, in which a set of data points {xm}M m=1 Rd is given, and the objective is to separate them into K clusters, such that the sum of the cluster variances is minimized. It has been shown that Kmeans clustering is NP-hard in general (Aloise et al. 2009; Dasgupta and Freund 2009), even though Lloyd s algorithm usually gives reasonably good approximate solutions (Lloyd 1982) when d is small. In fact, it has been used as a standard sub-routine for more complicated clustering tasks like spectral clustering (Ng, Jordan, and Weiss 2001) and subspace clustering (Elhamifar and Vidal 2013), despite the fact that it is not guaranteed to give the global optimal solution. Several attempts have been made to quantify cases under which we can provably cluster the data under a prob-

Copyright 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

abilistic generative model. Based on the Gaussian mixture model (GMM), instead of applying Lloyd s algorithm or Expectation-Maximization (Dempster, Laird, and Rubin 1977), this line of work devises sophisticated and somewhat conceptual methods that guarantee correct estimation of the GMM parameters under additional conditions. The ﬁrst of this genre, to the best of our knowledge, is the work in (Dasgupta 1999), which shows that if the Gaussian components are almost disjoint, then the deﬂation-type method in (Dasgupta 1999) correctly clusters the data with high probability. A more practical approach is later proposed in (Dasgupta and Schulman 2000) with the same performance guarantee, which is a two-round variant of the EM algorithm. A number of follow-up works try to further improve the bound on how close the Gaussian components can be (Arora and Kannan 2001; Vempala and Wang 2004; Brubaker and Vempala 2008; Moitra and Valianty 2010; Belkin and Sinha 2010), but most of them focus on theoretical guarantees but not practical implementations. On the other hand, there is a simple and effective method for clustering called Mean Shift (Fukunaga and Hostetler 1975; Cheng 1995; Comaniciu and Meer 2002) that has been popular in the ﬁeld of computer vision. Two main versions of Mean Shift are the Epanechnikov Mean Shift and Gaussian Mean Shift, and the details will be explained in the next section. Compared to K-means and Gaussian mixture model, fewer theoretical results have been presented regarding Mean Shift. Compared to Gaussian Mean Shift, even less analysis exists for Epanechnikov Mean Shift, partly due to its non-smoothness nature, despite its simplicity and effectiveness. The main contribution of this paper is the establishment of convergence for Epanechnikov Mean Shift. There have been many convergence studies on different variants and approximations to the original Epanechnikov Mean Shift. However, to the best of our knowledge, there is no analysis that directly addresses the original Epanechnikov Mean Shift, partially because non-smoothness of the kernel employed in the method poses a hard analytical problem. Nevertheless, analyzing the convergence behavior of the original Epanechnikov Mean Shift is of great interest since it is based on the optimal kernel in density estimation. In this work, we provide detailed functional analysis and rigorous proof for Epanechnikov Mean Shift. We ﬁrst show that the method in-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

0 5 10 15 20 25 30 MC trial

clustering error

0 5 10 15 20 25 30 MC trial

102 time / sec.

Meau Shift Deflation K-means (MATLAB) EM-GMM (MATLAB) Dasgupta & Schulman 2000 Mean Shift (Original)

Figure 1: 100 Monte-Carlo simulations on synthetic data. Left: clustering error; right: run time.

deed can get swamped at some undesired points, but with very simple modiﬁcation it is guaranteed to reach a local optimum of the data density function that corresponds to a cluster centroids within ﬁnite number of iterations. This is the ﬁrst analytical result that backs convergence behaviors of the Epanechnikov Mean Shift and the proposed modiﬁcation is of practical signiﬁcance. Inspired by the idea of deﬂation by (Dasgupta 1999) and follow-up works, we also propose a deﬂation-variant of the Mean Shift algorithm that is guaranteed to correctly cluster the data one group at a time under some additional conditions. This strategy saves a huge amount of computations compared to the original version and thus suits large-scale clustering problems.

Illustrative Example

Before we delve into convergence analysis of Epanechnikov Mean Shift, we give a simple illustrative example to showcase its effectiveness in clustering which explains the reason why this particular method interests us. Speciﬁcally, we test the performance of the proposed deﬂation-based Epanechnikov Mean Shift and some classic clustering methods, including Lloyd s K-means algorithm, Expectation Maximization (EM) for Gaussian mixture models (GMM), the two-round variant of EM by (Dasgupta and Schulman 2000), and the original Epanechnikov Mean Shift. The experiment is conducted in MATLAB, with the build-in implementation of K-means clustering and EM for GMM. Notice that these are well-implemented K-means / EM algorithms, with smart initialization suggested by K-means++ (Arthur and Vassilvitskii 2007), and/or various parallel implementation / multiple re-start enhancements that have been shown to work well in practice. The two-round variant of EM (Dasgupta and Schulman 2000) is mathematically proven to work well with high probability when the clusters are nonoverlapping. The experiments are conducted on a synthetic dataset {xm}M m=1 Rd generated as follows. For d = 100, we prescribe K = 30 clusters (Gaussian components). For cluster k, we ﬁrst randomly generate its centroid μk N(0, 4I), and then generate Mk = 50k i.i.d. data points from N(μk, I). Then we hide the cluster labels and feed

the data into various algorithms. For the two versions of the Mean Shift algorithm, there is no need to indicate the number of clusters K before hand the only parameter (kernel bandwidth) is tuned by leave-one-out cross-validation, and then the algorithm automatically detects the number of clusters in the data set. For the other methods, the correct number of clusters K is given, which means they are already using more information than the Mean Shift-based methods. The procedure is repeated 30 times. In each Monte-Carlo trial, the obtained cluster labels are aligned with the true labels using the Hungarian algorithm (Kuhn 1955). The clustering error is then calculated as the ratio of wrongly labeled data points over the total number of data points. The clustering error and runtime for each Monte-Carlo trial are shown in Figure 1. The ﬁrst observation is that plain vanilla K-means and EM do not cluster the data very accurately, even though there exists a good clustering structure according to how we generate the data. The clustering error is greatly reduced if we adopt the two-round variant of EM (Dasgupta and Schulman 2000), with the compromise of a signiﬁcantly higher amount of computation time. However, since the different Gaussian components are just marginally separated (and the sizes of each cluster are somewhat unbalanced), the performance is not as good as we might expect according to (Dasgupta and Schulman 2000). Mean Shift-based methods, on the other hand, give the surprising zero clustering error in all cases; considering the fact that the correct number of clusters K is not given to these methods, the results look even more impressive. In terms of computation time, the original Mean Shift takes similar time as that of (Dasgupta and Schulman 2000), whereas the proposed Mean Shift deﬂation takes, remarkably, the least amount of time, even compared to the simple K-means. In this paper, we will study convergence properties of the original Epanechnikov Mean Shift and its deﬂation variant, and explain the reasons behind its effectiveness.

Background: KDE and Mean Shift

The intuition behind Mean Shift for clustering is as follows. Suppose we have the probability density function (PDF) p(z) of the dataset {xm}M m=1 Rd. If the PDF p(z) has

K modes, then we expect K clusters in this dataset. Furthermore, if we run an optimization algorithm, e.g. gradient descent, initialized at a data point xm, and it converges to the k-th mode, then we declare that xm belongs to the k-th cluster.

Kernel Density Estimation In practice, we do not have access to the PDF p(z), but only the set of data points {xm}M m=1. To implement the aforementioned intuition, one needs to ﬁrst estimate the PDF p(z) this is called density estimation (Scott 2015). The most popular approach for density estimation is the so-called kernel density estimator (KDE). For a given kernel function K(z) that satisﬁes

K(z) 0 and K(z)dz = 1,

the corresponding KDE is simply

m=1 K(z xm).

Two popular choices of the kernels are the Gaussian kernel

KG(z; w) = c

and the Epanechnikov kernel

KE(z; w) = c

where c in (1) and (2) are normalizing constants ensuring that the kernel integrates to one. Each of them (and all other kernels) are parameterized by a scalar w, called the bandwidth, which controls the variance of the kernel. It has been shown that the Epanechnikov kernel asymptotically minimizes the mean squared error (MSE) (p(z) p(z))2 dz, (3)

among all possible kernel functions (Epanechnikov 1969). Somewhat surprisingly, this optimal kernel is a highly nonsmooth function. The bandwidth of the kernel w plays an important role on how well the KDE approximates the true density. In practice, one can adopt the leave-one-out cross validation approach to determine this parameter, as we did in this work. The MSE of the estimated density (and the unknown true density) (3) can be separated into three terms: p2(z)dz + p2(z)dz 2E{ p(z)}.

The ﬁrst term is unknown, but a constant; the second term can be directly calculated; and the third term is estimated via leave-one-out cross-validation. This quantity is evaluated at a set of values for w, and the one that gives the minimum value is selected as the bandwidth for the KDE.

Based on the KDE p(z), the Mean Shift algorithm tries to ﬁnd modes of p(z) via the following (weighted average) iterates (Fukunaga and Hostetler 1975; Cheng 1995; Comaniciu and Meer 2002), initialized at each xm:

z 1 M m=1 g( z xm 2)

m=1 g( z xm 2)xm,

where g( ) is called the proﬁle for the kernel function K( ). Putting details aside, the proﬁle for the Gaussian kernel is exp( z xm 2/2w2), and that for the Epanechnikov kernel is the indicator function 1( z xm 2 < w2). Existing analyses for convergence properties of Mean Shift are mostly based on smooth optimization, and argue that the update is always going at the gradient ascent direction. Borrowing the convergence results for gradient-based methods, it is then claimed that the Mean Shift iterates converges to a local maximum of p(z). On hindsight, we make the following comments:

1. The optimal Epanechnikov kernel is non-smooth, so the existing convergence claims cannot establish convergence to a local optimum which asymptotically approaches a mode of p(z). In fact, we will show that the plain vanilla Epanechnikov Mean Shift may indeed get stuck at a noncritical point, and we will provide a simple remedy to ﬁx it.

2. For smooth kernels like the Gaussian kernel, it is indeed easy to show that the algorithm converges to a stationary point. However, not all stationary points are local optima there may exist saddle points, and there is in general no simple way to check whether it is a local optimum or not.

3. An interesting observation is that Mean Shift with smooth kernels usually converges slower than Epanechnikov kernel. The convergence rate for Gaussian Mean Shift can be as slow as sub-linear (Carreira-Perpi n an 2007), whereas Epanechnikov Mean Shift terminates in ﬁnite number of steps (Comaniciu and Meer 2002), although a rigorous proof for this claim is still missing.

The remainder of the paper tries to bridge the gap between the good empirical performance and lack of rigorous theoretical analysis for the Epanechnikov Mean Shift. We show that, with a simple modiﬁcation, Epanechnikov Mean Shift terminates at a local maximum of p(z) within ﬁnite number of iterations. Even though the objective function is nonconvex and non-smooth, the convergence result is surprisingly strong: It is guaranteed to terminate at a local optimum, never at a saddle point, and the number of iterations is ﬁnite, with zero precision accuracies. For completeness, the Epanechnikov Mean Shift is clearly written in Algorithm 1. We shall call the iterative procedure between line 3 6 Epanechnikov Mean Shift iterates , and the entire algorithm as Epanechnikov Mean Shift, which initializes the iterates at every data point xm.

Algorithm 1 Epanechnikov Mean Shift Require: {xm}M m=1, w2

1: for m = 1, ..., M do 2: initialize zm xm 3: repeat Epanechnikov Mean Shift iterates 4: I(zm) {i [M] : xi zm 2 < w2}

5: zm 1 |I(zm)|

6: until convergence (cf. Alg. 2) 7: end for 8: ﬁnd K distinct vectors in {zm}M m=1, denote as {μk}K k=1 9: xm in cluster k if zm = μk.

Function Analysis The Epanechnikov Mean Shift iterates tries to ﬁnd modes (i.e., local maxima) of the KDE p(z) = KE(z xm; w) with the Epanechnikov kernel. As per conventions in the ﬁeld of optimization, we deﬁne functions φ and f by ﬂipping the sign of KE and p, and omitting constants and scalings, which do not affect the task of optimization:

φ(z) = min( z 2, w2), (4)

m=1 φ(xm z). (5)

Obviously, modes of p(z) correspond to local minima of f(z). We start by analyzing the basic properties of the loss function (5).

Lemma 1. The function f(z) is smooth almost everywhere.

Proof. Notice that f(z) is a summation of component functions φ(xm z), so f(z) is smooth at z if and only if m = 1, ..., M, φ(xm z) is smooth at z. According to the deﬁnition of φ in (4), φ(xm z) is non-smooth iff xm z 2 = r, which forms a set that has Lebesgue measure zero in Rd. Because {xm}M m=1 is a ﬁnite set, the union set

{z : x1 z 2 = w2} ... {z : x M z 2 = w2},

which forms the set of non-smooth points for f(z), also has Lebesgue measure zero. In other words, the function f(z) is smooth almost everywhere.

Lemma 2. At every smooth point z of f(z), deﬁne I(z) = {i : xi z 2 < w2}, then we have

i I(z) 2(z xi), (6)

2f(z) = 2|I(z)|I. (7)

Therefore, f(z) is locally convex at every smooth point, and strongly convex if I is not empty.

Proof. If z is a smooth point, there does not exist a xj such that xj z 2 = w2. The expressions for the gradient and Hessian are elementary. For a small hyper-ball containing

only smooth points, the index set I(z) remains the same in this convex region, therefore the Hessian remains the same in this area. Since 2f(z) 0, the function f(z) is locally convex. Furthermore, if I(z) = , then 2f(z) I, in which case f(z) is locally strongly convex.

We now switch our focus to the non-smooth points of f(z). To study their properties, we use the concept of directional derivative, which is deﬁned as (Bertsekas 1999)

f (z; δ) lim α 0 f(z + αδ) f(z)

for a particular direction δ, if the limit exists. The deﬁnition (8) clearly shows that δ is a descending direction if f (z; δ) < 0. The directional derivative obeys the sum rule

m=1 φ (xm z; δ). (9)

Furthermore, if f is smooth at a point z, then the directional derivative is simply f (z; δ) = f(z) δ. For a non-smooth function, we can deﬁne a stationary point as follows (Razaviyayn, Hong, and Luo 2013):

Deﬁnition 1. The point z is a stationary point of f( ) if f (z; δ) 0 for all δ.

Notice that if z is a smooth point for f, Deﬁnition 1 reduces to f(z) δ 0 for all δ, which implies f(z) = 0, the usual deﬁnition of a stationary point for smooth functions.

Lemma 3. The directional derivative of f at z with direction δ is

i I(z) 2(z xi) δ +

j J (z;δ) 2(z xj) δ, (10)

I(z) = {i : xi z 2 < w2},

J (z; δ) = {j : xj z 2 = w2, (z xj) δ < 0}.

Proof. For a smooth point of φ(z), it is easy to see that φ (z; δ) = 2z δ if z 2 < w2, and φ (z; δ) = 0 if z 2 > w2. For a non-smooth point when z 2 = w2, we ﬁnd its directional derivative by resorting to the deﬁnition (8): φ (z; δ) equals to 2z δ or 0 depending on whether z +αδ 2 is less than w2 or not, when α goes to zero. Since

z + αδ 2 = z 2 + 2αz δ + α2 δ 2 = w2 + 2αz δ + o(α2),

we see that z + αδ 2 < w2 iff z δ < 0. Therefore,

φ (z; δ) = 2z δ, z < w2, or z = w2 and z δ < 0, 0, z > w2, or z = w2 and z δ 0. (11) Now using the sum rule for the directional derivative, we conclude that f (z; δ) is as given in (10).

From the expression of f (z; δ), we can show the following interesting claims:

Proposition 1. If z is a non-smooth point for f(z), then z cannot be a stationary point.

Proof. From the expression of φ (z; δ) in (11), we see that the second term in (10) is always 0. To prove that a nonsmooth point cannot be a stationary point, we consider the following two cases:

i I(z) 2(z xi) = 0, then there exists a δ such that

i I(z) 2(z xi) δ < 0, e.g., δ =

i I(z) (z xi), and

thus f (z; δ) < 0 since the second term in (10) cannot be positive;

i I(z) 2(z xi) = 0, since z is a non-smooth point,

there exists a δ such that J = , for example by choosing δ = (z xj) for some j such that z xj 2 = w2, then the second term in (10) is strictly < 0 while the ﬁrst term = 0, therefore f (z; δ) < 0.

To sum up, there always exists a δ such that f (z; δ) < 0 when z is a non-smooth point for f, therefore such a point cannot be a stationary point.

Proposition 2. A point z is a local minimum for f(z) iff:

1. There does not exist a xj such that xj z 2 = w2; 2. the set I(z ) = {i : xi z 2 < w2} is not empty, and

z = 1 |I(z )|

i I(z ) xi.

Proof. A local minimum is ﬁrst of all a stationary point, which, according to Proposition 1, cannot be a non-smooth point for f, therefore there does not exist a xj such that xj z 2 = w2. For a smooth point, its gradient and Hessian is given in (6) and (7). Stationarity implies that f(z ) = 0, therefore z = 1 |I(z )| i I(z ) xi. If I(z ) is not empty, then 2f(z ) 0, and z is a local minimum; otherwise, f(z ) = Mw2 = max f(z), which is a global maximum.

Convergence of Epanechnikov Mean Shift Now we study the convergence of the Epanechnikov Mean Shift iterates.

Lemma 4. Epanechnikov Mean Shift iterates successively minimizes a local upper bound of function f(z), therefore the value of f(z) is monotonically non-increasing.

Proof. At a particular point z, deﬁne f(z| z) as follows:

i I( z) z xi 2 + (M |I( z)|)w2.

It is easy to see that

1. f(z| z) f(z) for all z and z, 2. f( z| z) = f( z), and

Algorithm 2 Epanechnikov Mean Shift iterates Redux Require: {xm}M m=1, n [M] 1: initialize z(0) xn 2: loop 3: I(z(t 1)) {i : xi z(t 1) 2 < w2}

4: z(t) 1 |I(z(t 1))|

i I(z(t 1)) xi

5: if z(t) = z(t 1) then 6: J (z(t)) {j : xi z(t) 2 = w2} 7: if J (z(t)) = then 8: return z(t)

9: else 10: sample j from J (z(t))

11: z(t) 1 |I(z(t 1))|+1

i I(z(t 1)) xi

12: end if 13: end if 14: t t + 1 15: end loop

3. 1 |I( z)|

i I( z) xi = arg min z f(z| z).

This means that, at iteration t, Algorithm 2 updates z as z(t) = arg min z f(z|z(t 1)). Therefore,

f(z(t 1)) = f(z(t 1)|z(t 1)) f(z(t)|z(t 1)) f(z(t)),

which means the values of f(z) obtained by Algorithm 2 form a monotonically non-increasing sequence.

Lemma 4 gives the Epanechnikov Mean Shift iterates in Algorithm 1 a majorization-minimization interpretation, which guarantees convergence to a stationary point in a lot of cases. Unfortunately none of the existing convergence results applies here they either require both f(z) and f(z| z) to be smooth, or more generally f ( z; δ) = f ( z| z; δ) for all z and δ (Razaviyayn, Hong, and Luo 2013). Indeed, it is possible that z(t 1) = z(t) = arg min z f(z|z(t 1)), in which case the algorithm converges, but there exists a xj such that xj z(t) 2 = w2, meaning it is a non-smooth point, thus cannot be a stationary point according to Proposition 1. Fortunately, we ﬁnd that this issue can be ﬁxed with negligible extra computations: If such case happens, we only need to sample one xj such that xj z(t) 2 = w2, and then re-update z(t) as the average of xj together with all the points in I(z(t 1)). This can be instead interpreted as minimizing the following slightly different upper bound:

z(t) = arg min z f (z|z(t 1))

i I(z(t 1)) z xi 2 + (M |I(z(t 1))| 1)w2.

This elaborated procedure is ﬂeshed out in Algorithm 2. We show that Algorithm 2 provably ﬁnds a local optimum in a ﬁnite number of iterations.

Lemma 5. The value of f(z) obtained by Algorithm 2 is strictly decreasing, unless z(t) = z(t 1).

Proof. If z(t 1) = 1 |I(z(t 1))| i I(z(t 1)) xi, then we up-

date z(t) = 1 |I(z(t 1))| i I(z(t 1)) xi, and

f(z(t 1)) f(z(t))

f(z(t 1)|z(t 1)) f(z(t)|z(t 1))

i I(z(t 1)) z(t 1) xi 2

i I(z(t 1)) z(t) xi 2

i I(z(t 1))

z(t 1) 2 2x iz(t 1) z(t) 2 + 2x iz(t)

= |I(z(t 1))| z(t 1) 2 2z(t) z(t 1) z(t) 2 + 2 z(t) 2

= I(z(t 1)) z(t 1) z(t) 2 > 0.

Similarly, if z(t 1) = 1 |I(z(t 1))| i I(z(t 1)) xi, we have

f(z(t 1)) f(z(t))

f (z(t 1)|z(t 1)) f (z(t)|z(t 1))

= |I(z(t 1))| + 1 z(t 1) z(t) 2

= |I(z(t 1))| + 1

1 |I(z(t 1))|

i I(z(t 1)) xi

1 |I(z(t 1))| + 1

i I(z(t 1)) xi

= 1 |I(z(t 1))| + 1

z(t 1) xj 2

|I(z(t 1))| + 1 > 0.

Theorem 1. Algorithm 2 terminates at a local optimum of (5) in a ﬁnite number of iterations.

Proof. We ﬁrst prove that Algorithm 2 terminates at a local optimum of (5). The loss function f(z) is bounded from below, and using Algorithm 2, it is monotonically nonincreasing (cf. Lemma 4), so it converges to a certain value. Lemma 5 further shows that f(z) strictly decreases unless z(t) = z(t 1), in which case z(t) cannot be a non-smooth point as we showed in the proof of Lemma 5. Notice that throughout the iterations I(z(t)) cannot be empty, because otherwise f(z) takes the maximum value, but since we start with z(0) = xm, f(z(0)) < max f(z). Invoking Proposition 2, we conclude that Algorithm 2 terminates at a local optimum.

Now suppose Algorithm 2 terminates in T number of iterations, we show that T can be upperbounded by a ﬁnite number that only depends on the data set {xm} and the bandwidth w that we choose. From the proof of Lemma 5, we get the (very loose) inequality

f(z(t 1)) f(z(t)) z(t 1) z(t) 2. (12)

Summing up both sides for t = 1, ..., T, we have

f(z(0)) f(z(T ))

t=1 z(t 1) z(t) 2. (13)

We have shown that each of the terms on the right-hand-side is positive, unless the algorithm has terminated. If we can further ﬁnd a quantity λ > 0 such that

z(t 1) z(t) 2 λ > 0, t = 1, ..., T, (14)

then we can easily conclude that

T f(z(0)) f(z(T ))

which is a ﬁnite number. To ﬁnd such a λ, we note that each z(t) is the average of a non-empty subset of points from the data set {xm}, which are all the points that can be enclosed in a Euclidean ball with radius w (except for z(0), which is simply one data point). Deﬁne

S = {I(z) J (z; δ) : z, δ Rd},

where I(z) and J (z; δ) are as deﬁned in Lemma 3. We know that |S| < 2M, since it is a union of subsets of {xm}, and there can be at most 2M 1 non-empty subsets of {xm}. Then we deﬁne

λ = min K,L S K =L

which exists since S is a ﬁnite set, and it satisﬁes (14). This λ is also strictly positive: If

ℓ L xℓ, (17)

the distance between this point and every point in either K or L is no greater than w. According to the construction of S, (17) implies K = L, contradicting K = L in (16). To sum up, we have shown that there exists λ > 0 that satisﬁes (14), thus (15) holds, meaning Algorithm 2 terminates in a ﬁnite number of iterations.

We remark that (15) is a gross over-estimate of the number of iterations. We omitted a rather big scaling factor on the right-hand-side of (12), for the sake of simplicity in (13). The point is to show that T can indeed be bounded by a ﬁnite number. In practice, our observation is that Epanechnikov Mean Shift terminates in a very smaller number of iterations, usually less than 10. As we can see, even though we are trying to optimize a non-convex and non-smooth function, we obtain a very

0 20 40 60 80 100 d

Figure 2: Probability that Y > 2d, for Y χ2(d).

strong convergence result that Epanechnikov Mean Shift reaches (not approaches) a local optimum in a ﬁnite number of iterations. From the proof one can see that the nonsmoothness of the Epanechnikov kernel actually helps obtaining such a nice convergence property. It has a very different ﬂavor as the smooth optimization based analyses. For example, the work in (Carreira-Perpi n an 2007) uses a smooth Gaussian kernel, and the analysis ends up with an asymptotic convergence and a sub-linear rate in the worst case.

Mean Shift Deﬂation for Non-overlapping Spherical Gaussian Mixtures The main computation bottleneck for Mean Shift is obviously the fact that the Mean Shift iterates (Algorithm 2) is run at every data point, which is potentially a large set. In this section we provide a heuristic to avoid this computation bottleneck, under the probabilistic generative model that each cluster is generated from a spherical Gaussian with variance σ2I, and they are non-overlapping. Suppose the set of data points {xm}M m=1 Rd comes from a mixture of Gaussian distributions N(μk,σ2I), k = 1, ..., K, with different means {μk}K k=1 Rd but the same covariance matrix σ2I. For a speciﬁc data point xm coming from N(μk,σ2I), the random variable Y = xm μk 2/σ2 follows the χ2-distribution with d degrees of freedom, denoted as χ2(d). An interesting property of the χ2-distribution is that the probability that Y > γd becomes almost negligible for large d and γ > 1. As an example, Figure 2 shows the probability that Y > 2d, as d increases. This can be seen from the Chernoff bound on the tail probability of the χ2distribution (Dasgupta and Gupta 2003): for γ > 1, we have that Pr(Y > γd) (γe1 γ)d/2. It can be easily shown that γe1 γ < 1 when γ > 1, implying that Pr(Y > γd) goes to zero at least exponentially as d increases. This observation inspires us to use

2dσ as the bandwidth, since a Euclidean ball with radius

2dσ encloses almost all points coming from N(μk,σ2I) if it is centered at μk. Furthermore, if min μk μℓ > 2

dσ, the Gaussian components are non-overlapping, thus the ball centered at μk will only contain points coming from N(μk,σ2I).

Algorithm 3 Epanechnikov Mean Shift deﬂation Require: {xm}M m=1, w2 = 2dσ2

1: M {1, 2, ..., M}, k 1 2: while M = do 3: sample m from M 4: run Algorithm 2 initialized at xm, outputs μk 5: declare {xi}i I(μk) as cluster k 6: M M \ I(μ) 7: k k + 1 8: end while

Therefore, once we ﬁnd one local optimum of p(z) that is presumably close to a μk, we can safely group all data points that are within radius

2dσ from μk and declare them as a cluster. This idea leads to the Epanechnikov Mean Shift deﬂation shown in Algorithm 3. As shown in Figure 1, this simple procedure obtains extremely good clustering performance, while reducing the computational complexity down to even smaller than that of Lloyd s K-means algorithm. Notice that the bandwidth w is still estimated via leave-one-out cross-validation. Under strong generative models as in this case, however, we do observe that the estimated bandwidth w is very close to

We study the Epanechnikov Mean Shift algorithm, which is observed to work well but lacked theoretical analysis on its performance as of this writing. After in-depth study on estimated density p(z), with particular focus on its nonsmoothness, we ﬁxed an issue that could potentially affect its convergence, and showed that the Epanechnikov Mean Shift iterate terminates at a local optimum within ﬁnite number of iterations. A deﬂation-based variant of Epanechnikov Mean Shift is also proposed to avoid initializing the iterates at every data point, which reduces the computation considerably, and maintains good clustering performance under nonoverlapping spherical Gaussian mixture assumptions.

Acknowledgments

This work was supported in part by the National Science Foundation (NSF) under grants IIS-1247632, IIS-1447788, and ECCS-1608961.

Aloise, D.; Deshpande, A.; Hansen, P.; and Popat, P. 2009. NP-hardness of Euclidean sum-of-squares clustering. Machine learning 75(2):245 248. Arora, S., and Kannan, R. 2001. Learning mixtures of arbitrary Gaussians. In ACM Symposium on Theory of Computing (STOC), 634 644. Arthur, D., and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027 1035. Society for Industrial and Applied Mathematics.

Belkin, M., and Sinha, K. 2010. Polynomial learning of distribution families. In IEEE Symposium on Foundations of Computer Science (FOCS), 103 112. Bertsekas, D. P. 1999. Nonlinear programming. Athena Scientiﬁc. Brubaker, S. C., and Vempala, S. S. 2008. Isotropic PCA and afﬁne-invariant clustering. In IEEE Symposium on Foundations of Computer Science (FOCS), 551 560. Carreira-Perpi n an, M. A. 2007. Gaussian mean-shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(5):767 776. Cheng, Y. 1995. Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(8):790 799. Comaniciu, D., and Meer, P. 2002. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5):603 619. Dasgupta, S., and Freund, Y. 2009. Random projection trees for vector quantization. IEEE Transactions on Information Theory 55(7):3229 3242. Dasgupta, S., and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22(1):60 65. Dasgupta, S., and Schulman, L. J. 2000. A Two-Round Variant of EM for Gaussian Mixtures. In Uncertainty in Artiﬁcial Intelligence, 152 159. Dasgupta, S. 1999. Learning mixtures of Gaussians. In IEEE Symposium on Foundations of Computer Science (FOCS), 634 644. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1 38. Elhamifar, E., and Vidal, R. 2013. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(11):2765 2781. Epanechnikov, V. A. 1969. Non-parametric Estimation of A Multivariate Probability Density. Theory of Probability and Its Applications 14(1):153 -158. Fukunaga, K., and Hostetler, L. D. 1975. The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory 21(1):32 40. Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data clustering: A review. ACM Computing Surveys (CSUR) 31(3):264 323. Kuhn, H. W. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(12):83 97. Lloyd, S. P. 1982. Least Squares Quantization in PCM. IEEE Transactions on Information Theory 28(2):129 137. Moitra, A., and Valianty, G. 2010. Settling the polynomial learnability of mixtures of Gaussians. In IEEE Symposium on Foundations of Computer Science (FOCS), 93 102.

Ng, A. Y.; Jordan, M. I.; and Weiss, Y. 2001. On Spectral Clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 849 856. Razaviyayn, M.; Hong, M.; and Luo, Z.-Q. 2013. A uniﬁed convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization 23(2):1126 1153. Scott, D. W. 2015. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Vempala, S., and Wang, G. 2004. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences 68(4):841 860.