# on_robustness_of_kernel_clustering__f79f60f8.pdf

On Robustness of Kernel Clustering

Bowei Yan Department of Statistics and Data Sciences University of Texas at Austin

Purnamrita Sarkar Department of Statistics and Data Sciences University of Texas at Austin

Clustering is an important unsupervised learning problem in machine learning and statistics. Among many existing algorithms, kernel k-means has drawn much research attention due to its ability to ﬁnd non-linear cluster boundaries and its inherent simplicity. There are two main approaches for kernel k-means: SVD of the kernel matrix and convex relaxations. Despite the attention kernel clustering has received both from theoretical and applied quarters, not much is known about robustness of the methods. In this paper we ﬁrst introduce a semideﬁnite programming relaxation for the kernel clustering problem, then prove that under a suitable model speciﬁcation, both K-SVD and SDP approaches are consistent in the limit, albeit SDP is strongly consistent, i.e. achieves exact recovery, whereas K-SVD is weakly consistent, i.e. the fraction of misclassiﬁed nodes vanish. Also the error bounds suggest that SDP is more resilient towards outliers, which we also demonstrate with experiments.

1 Introduction

Clustering is an important problem which is prevalent in a variety of real world problems. One of the ﬁrst and widely applied clustering algorithms is k-means, which was named by James Mac Queen [14], but was proposed by Hugo Steinhaus [21] even before. Despite being half a century old, k-means has been widely used and analyzed under various settings.

One major drawback of k-means is its incapability to separate clusters that are non-linearly separated. This can be alleviated by mapping the data to a high dimensional feature space and do clustering on top of the feature space [19, 9, 12], which is generally called kernel-based methods. For instance, the widely-used spectral clustering [20, 16] is an algorithm to calculate top eigenvectors of a kernel matrix of afﬁnities, followed by a k-means on the top r eigenvectors. The consistency of spectral clustering is analyzed by [22]. [9] shows that spectral clustering is essentially equivalent to a weighted version of kernel k-means.

The performance guarantee for clustering is often studied under distributional assumptions; usually a mixture model with well-separated centers sufﬁces to show consistency. [5] uses a Gaussian mixture model, and proposes a variant of EM algorithm that provably recovers the center of each Gaussian when the minimum distance between clusters is greater than some multiple of the square root of dimension. [2] works with a projection based algorithm and shows the separation needs to be greater than the operator norm and the Frobenius norm of difference between data matrix and its corresponding center matrix, up to a constant.

Another popular technique is based on semideﬁnite relaxations. For example [18] proposes a SDP relaxation for k-means typed clustering. In a very recent work, [15] shows the effectiveness of SDP relaxation with k-means clustering for subgaussian mixtures, provided the minimum distance between centers is greater than the variance of the sub-gaussian times the square of the number of clusters r.

On a related note, SDP relaxations have been shown to be consistent for community detection in networks [1, 3]. In particular, [3] consider inlier (these are generated from the underlying clustering

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

model, to be speciﬁc, a blockmodel) and outlier nodes. The authors show that SDP is weakly consistent in terms of clustering the inlier nodes as long as the number of outliers m is a vanishing fraction of the number of nodes.

In contrast, among the numerous work on clustering, not much focus has been on robustness of different kernel k-means algorithms in presence of arbitrary outliers. [24] illustrates the robustness of Gaussian kernel based clustering, where no explicit upper bound is given. [8] detects the inﬂuential points in kernel PCA by looking at an inﬂuence function. In data mining community, many ﬁnd clustering can be used to detect outliers, with often heuristic but effective procedures [17, 10]. On the other hand, kernel based methods have been shown to be robust for many machine learning tasks. For supervised learning, [23] shows the robustness of SVM by introducing an outlier indicator and relaxing the problem to a SDP. [6, 7, 4] develop the robustness for kernel regression. For unsupervised learning, [13] proposes a robust kernel density estimation.

In this paper we ask the question: how robust are SVD type algorithms and SDP relaxations when outliers are present. In the process we also present results which compare these two methods. To be speciﬁc, we show that without outliers, SVD is weakly consistent, i.e. the fraction of misclassiﬁed nodes vanishes with high probability, whereas SDP is strongly consistent, i.e. the number of misclassiﬁed nodes vanishes with high probability. We also prove that both methods are robust to arbitrary outliers as long as the number of outliers is growing at a slower rate than the number of nodes. Surprisingly our results also indicate that SDP relaxations are more resilient to outliers than K-SVD methods. The paper is organized as follows. In Section 2 we set up the problem and the data generating model. We present the main results in Section 3. Proof sketch and more technical details are introduced in Section 4. Numerical experiments in Section 5 illustrate and support our theoretical analysis. More additional analysis are included in the extended version of this paper 1.

2 Problem Setup

We denote by Y = [Y1, , Yn]T the n p data matrix. Among the n observations, m outliers are distributed arbitrarily, and n m inliers form r equal-sized clusters, denoted by C1, , Cr. Let us denote the index set of inliers by I and index set of outliers by O, I O = [n]. Also denote by R = {(i, j) : i O or j O}.

The problem is to recover the true and unknown data partition given by a membership matrix Z = {0, 1}n r, where Zik = 1 if i belongs to the k-th cluster and 0 otherwise. For convenience we assume the outliers are also arbitrarily equally assigned to r clusters, so that each extended cluster, denoted by Ci, i [r] has exactly n/r points. A ground truth clustering matrix X0 Rn n can be achieved

by X0 = ZZT . It can be seen that X0(i, j) = 1 if i, j I, and belong to the same cluster; 0 otherwise.

For the inliers, we assume the following mixture distribution model.

Conditioned on Zia = 1, Yi = µa + Wi p, E[Wi] = 0, Cov[Wi] = σ2 a Ip,

Wi are independent sub-gaussian random vectors.

We treat Y as a low dimensional signal hidden in high dimensional noise. More concretely µa is sparse and µa 0 does not depend on n or p; as n , p . Wi s for i [n] are independent. For simplicity, we assume the noise is isotropic and the covariance only depends on the cluster. The sub-gaussian assumption is non-parametric and includes most of the commonly used distribution such as Gaussian and bounded distributions. We include some background materials on sub-gaussian random variables in Appendix A. This general setting for inliers is common and also motivated by many practical problems where the data lies on a low dimensional manifold, but is obscured by high-dimensional noise [11].

We use the kernel matrix based on Euclidean distances between covariates. Our analysis can be extended to inner product kernels as well. From now onwards, we will assume that the function generating the kernel is bounded and Lipschitz.

1https://arxiv.org/abs/1606.01869

Assumption 1. For n observations Y1, , Yn, the kernel matrix (sometimes also called Gram matrix) K is induced by K(i, j) = f( Yi Yj 2 2), where f satisﬁes |f(x)| 1, x and C0 > 0, s.t. supx,y |f(x) f(y)| C0|x y|.

A widely used example that satisﬁes the above condition is the Gaussian kernel. For simplicity, we will without loss of generality assume K(x, y) = f( x y 2) = exp( η x y 2).

For the asymptotic analysis, we use the following standard notations for approximated rate of convergence. T(n) is O(f(n)) iff for some constant c and n0, T(n) cf(n) for all n n0; T(n) is Ω(f(n)) if for some constant c and n0, T(n) cf(n) for all n n0; T(n) is Θ(f(n)) if T(n) is O(f(n)) and Ω(f(n)); T(n) is o(f(n)) if T(n) is O(f(n)) but not Ω(f(n)). T(n) is o P (f(n)) ( or OP (f(n))) if it is o(f(n)) (or O(f(n))) with high probability.

Several matrix norms are considered in this manuscript. Assume M Rn n, the ℓ1 and ℓ norm are deﬁned the same as the vector ℓ1 and ℓ norm. We deﬁne: M 1 := P

ij |Mij| and M := maxi,j |Mij|. For two matrices M, Q Rm n, their inner product is M, Q = trace(M T Q). The operator norm M is simply the largest singular value of M, which equals the largest eigenvalue for a symmetric matrix. Throughout the manuscript, we use 1n to represent the all one n 1 vector and En, En,k to represent the all one matrix with size n n and n k. The subscript will be dropped when it is clear from context.

2.1 Two kernel clustering algorithms

Kernel clustering algorithms can be broadly divided into two categories; one is based on semideﬁnite relaxation of the k-means objective function and the other is eigen-decomposition based, like kernel PCA, spectral clustering, etc. In this section we describe these two settings.

SDP relaxation for kernel clustering It is well known [9] that kernel k-means could be achieved by maximizing trace(ZT KZ) where Z is the n r matrix of cluster memberships. However due to the non-convexity of the constraints, the problem is NP-hard. Thus lots of convex relaxations are proposed in literature. In this paper, we propose the following semideﬁnite programming relaxation. The same relaxation has been used in stochastic block models [1].

max X trace(KX) (SDP-1)

s.t., X 0, X 0, X1 = n

r 1, diag(X) = 1

The clustering procedure is listed in Algorithm 1.

Algorithm 1 SDP relaxation for kernel clustering

Require: Observations Y1, , Yn, kernel function f.

1: Compute kernel matrix K where K(i, j) = f( Yj Yj 2 2); 2: Solve SDP-1 and let ˆX be the optimal solution; 3: Do k-means on the r leading eigenvectors U of ˆX.

Kernel singular value decomposition Kernel singular value decomposition (K-SVD) is a spectral based clustering approach. One ﬁrst do SVD on the kernel matrix, then do k-means on ﬁrst r eigenvectors. Different variants include K-PCA which uses singular vectors of centered kernel matrix and spectral clustering which uses singular vectors of normalized kernel matrix. The detailed algorithm is shown in Algorithm 2.

3 Main results

In this section we summarize our main results. In this paper we analyze SDP relaxation of kernel k-means and K-SVD type methods. Our main contribution is two-fold. First, we show that SDP relaxation produces strongly consistent results, i.e. the number of misclustered nodes goes to zero with high probability when there are no outliers, which means r without rounding. On the other

Algorithm 2 K-SVD (K-PCA, spectral clustering)

Require: Observations Y1, , Yn, kernel function f.

1: Compute kernel matrix K where K(i, j) = f( Yj Yj 2 2); 2: if K-PCA then 3: K = K K11T /n 11T K/n + 11T K11T /n2; 4: else if spectral clustering then 5: K = D 1/2KD 1/2 where D = diag(K1n); 6: end if 7: Do k-means on the r leading singular vectors V of K.

hand, K-SVD is weakly consistent, i.e. fraction of misclassiﬁed nodes goes to zero when there are no outliers. In presence of outliers, we see an interesting dichotomy in the behaviors of these two methods. Both can be proven to be weakly consistent in terms of misclassiﬁcation error. However, SDP is more resilient to the effect of outliers than K-SVD, if the number of clusters grows or if the separation between the cluster means decays.

Our analysis is organized as follows. First we present a result on the concentration of kernel matrices around their population counterpart. The population kernel matrix for inliers is blockwise constant with r blocks (except the diagonal, which is one). Next we prove that as n increases, the optima ˆX of SDP-1 converges strongly to X0, when there are no outliers and weakly if the number of outliers grows slowly with n. Then we show the mis-clustering error of the clustering returned by Algorithm 1 goes to zero with probability tending to one as n when there are no outliers. Finally, when the number of outliers is growing slowly with n, the fraction of mis-clustered nodes from algorithms 1 and 2 converges to zero.

We will start with the concentration of kernel matrices to their population counterpart. We show that under our data model (1) the empirical kernel matrix with the Gaussian kernel restricted on inliers concentrates around a "population" matrix K, and the ℓ norm of KI I f KI I f goes to zero at

the rate of O( q

Theorem 1. Let dkℓ= µk µℓ . For i Ck, j Cℓ, deﬁne

Kf(i, j) = f(d2 kℓ+ σ2 k + σ2 ℓ) if i = j, f(0) if i = j. . (1)

Then there exists constant ρ > 0, such that P( KI I f KI I f c q

p ) n2p ρc2.

Remark 1. Setting c = q

3 log n p log p , there exists constant ρ > 0, such that

The error probability goes to zero for a suitably chosen constant as long as p is growing faster than log n.

While our analysis is inspired by [11], there are two main differences. First we have a mixture model

where the population kernel is blockwise constant. Second, we obtain q

p rates of convergence by carefully bounding the tail probabilities. In order to attain this we further assume that the noise is sub-gaussian and isotropic. From now on we will drop the subscript f and refer to the kernel matrix as K.

By deﬁnition, K is blockwise constant with r unique rows (except the diagonal elements which are ones). An important property of K is that λr λr+1 (where λi is the ith largest eigenvalue of K) will be Ω(nλmin(B)/r). B is the r r Gaussian kernel matrix generated by the centers. Lemma 1. If the scale parameter in Gaussian kernel is non-zero, and none of the clusters shares a same center, let B be the r r matrix where Bkℓ= f( µk µℓ ), then

λr( K) λr+1( K) n

r λmin(B) min k f(σ2 k) 2 2 max k (1 f(2σ2 k)) = Ω(nλmin(B)/r)

Now we present our result on the consistency of SDP-1. To this end, we will upper bound ˆX X0 1, where ˆX is the optima returned by SDP-1 and X0 is the true clustering matrix. We ﬁrst present a lemma, which is crucial to the proof of the theorem. Before presenting this, we deﬁne γkℓ:= f(2σ2 k) f(d2 kℓ+ σ2 k + σ2 ℓ); γmin := min ℓ =k γkℓ (2)

The ﬁrst quantity γkℓmeasures separation between the two clusters k and ℓ. The second quantity measures the smallest separation possible. We will assume that γmin is positive. This is very similar to the analysis in asymptotic network analysis where strong assortativity is often assumed. Our results show that the consistency of clustering deteriorates as γmin decreases. Lemma 2. Let ˆX be the solution to (SDP-1), then

X0 ˆX 1 2 K K, ˆX X0

Combining the above with the concentration of K from Theorem 1 we have the following result:

Theorem 2. When d2 kℓ> |σ2 k σ2 ℓ|, k = ℓ, and γmin = Ω q

p then for some absolute

constant c > 0, X0 ˆX 1 max n o P (1), o P mn rγmin

Remark 2. When there s no outlier in the data, i.e., m = 0, ˆX = X0 with high probability and SDP-1 is strongly consistent without rounding. When m > 0, the right hand side of the inequality is dominated by mn/r. Note that X0 1 = n2

r , therefore after suitable normalization, the error rate goes to zero with rate O(m/(nγmin)) when n .

Now we will present the mis-clustering error rate of Algorithm 1 and 2. Although ˆX is strongly consistent in the absence of outliers, in practice one often wants to get the labeling in addition to the clustering matrix. Therefore it is usually needed to carry out the last eigen-decomposition step in Algorithm 1. Since X0 is the clustering matrix, its principal eigenvectors are blockwise constant. In order to show small mis-clustering error one needs to show that the eigenvectors of ˆX are converging (modulo a rotation) to those of X0. This is achieved by a careful application of Davis-Kahan theorem, a detailed discussion of which we defer to the analysis in Section 4.

The Davis-Kahan theorem lets one bound the deviation of the r principal eigenvectors ˆU of a Hermitian matrix ˆ M, from the r principal eigenvectors U of M as : ˆU UO F 23/2 M ˆ M F /(λr λr+1) [25], where λr is the rth largest eigenvalue of M and O is the optimal rotation matrix. For a complete statement of the theorem see Appendix F.

Applying the result to X0 and K provides us with two different upper bounds on the distance between leading eigenvectors. We will see in Theorem 3 that the eigengap derived by two algorithms differ, which results in different upper bounds for number of misclustered nodes. Since the Davis-Kahan bounds are tight up-to a constant [25], despite being upper bounds, this indicates that algorithm 1 is less sensitive to the separation between cluster means than Algorithm 2.

Once the eigenvector deviation is established, we present explicit bounds on mis-clustering error for both methods in the following theorem. K-means assigns each row of ˆU (input eigenvectors of K or ˆX) to one of r clusters. Deﬁne c1 , cn Rr such that ci is the centroid corresponding to the ith

row of ˆU. Similarly, for the population eigenvectors U (top r eigenvectors of K or X0), we deﬁne the population centroids as (Zν)i , for some ν Rr r. Recall that we construct Z such that the outliers are equally and arbitrarily divided amongst the r clusters. We show that when the empirical centroids are close to the population centroids with a rotation, then the node will be correctly clustered. We give a general deﬁnition of a superset of the misclustered nodes applicable both to K-SVD and SDP:

M = {i : ci ZiνO 1/ p

2n/r} (4) Theorem 3. Let Msdp and Mksvd be deﬁned as Eq. 4, where ci s are generated from Algorithm 1 and 2 respectively. Let λr be the rth largest eigenvalue value of K. We have:

|Msdp| max o P (1), OP

|Mksvd| OP max mn2

r(λr λr+1)2 , n3 log p rp(λr λr+1)2

Remark 3. Getting a bound for λr in terms of γmin for general blockwise constant matrices is difﬁcult. But as shown in Lemma 1, the eigengap is Ω(n/rλmin(B)). Plugging this back in we have,

|Mksvd| max OP

mr λmin(B)2

In some simple cases one can get explicit bounds for λr, and we have the following. Corollary 1. Consider the special case when all clusters share the same variance σ2 and dkℓare identical for all pairs of clusters. The number of misclustered nodes of K-SVD is upper bounded by:

|Mksvd| max OP

Corollary 1 is proved in Appendix H. Remark 4. The situation may happen if cluster center for a is of the form cea where ea is a binary vector with ea(i) = 1a=i. In this case, the algorithm is weakly consistent (fraction of

misclassiﬁed nodes vanish) when γmin = Ω max{ q

mr/n} . Compared to |Msdp|,

|Mksvd| an additional factor of r γmin . With same m, n, the algorithm has worse upper bound of errors and is more sensitive to γmin, which depends both on the data distribution and the scale parameter of the kernel. The proposed SDP can be seen as a denoising procedure which enlarges the separation. It succeeds as long as the denoising is faithful, which requires much weaker assumptions.

4 Proof of the main results

In this section, we show the proof sketch of the main theorems. The full proofs are deferred to supplementary materials.

4.1 Proof of Theorem 1

In Theorem 1, we show that if the data distribution is sub-gaussian, the ℓ norm of K K restricted

on the inlier nodes concentrates with rate O q

Proof sketch. With the Lipschitz condition, it sufﬁces to show Yi Yj 2 2 concentrates to d2 kℓ+σ2 k+σ2 ℓ.

To do this, we decompose Yi Yj 2 2 = µk µℓ 2 2 + 2 (Wi Wj)T

p (µk µℓ) + Wi Wj 2 2 p . Now it sufﬁces to show the third term concentrates to σ2 k + σ2 ℓand the second term concentrates around 0. Note the fact that Wi Wj is sub-gaussian, its square is sub-exponential. With sub-gaussian tail bound and a Bernstein type inequality for sub-exponential random variables, we prove the result.

With the elementwise bound, the Frobenius norm of the matrix difference is just one more factor of n.

Corollary 2. With probability at least 1 n2p ρc2, KI I KI I F cn p

4.2 Proof of Theorem 2

Lemma 2 is proved in Appendix D, where we make use of the optimality condition and the constraints in SDP-1. Equipped with Lemma 2 we re ready to prove Theorem 2.

Proof sketch. In the outlier-free ideal scenario, Lemma 2 along with the dualtiy of ℓ1 and ℓ norms we get ˆX X0 1 2 K K ˆ X X0 1 γmin . Then by Theorem 1, we get the strong consistency result. When outliers are present, we have to derive a slightly different upper bound. The main idea is to divide the matrices into two parts, one corresponding to the rows and columns of inliers, and the other corresponding to those of the outliers. Now by the concentration result (Theorem 1) on K along with the fact that both the kernel function and X0, ˆX are bounded by 1; and the rows of ˆX sums to n/r because of the constraint in SDP-1, we obtain the proof. The full proof is deferred to Appendix E.

4.3 Proof of Theorem 3

Although Theorem 2 provides insights on how close the recovered matrix ˆX is to the ground truth, it remains unclear how the ﬁnal clustering result behaves. In this section, we bound the number of misclassiﬁed points by bounding the distance in eigenvectors of ˆX and X0. We start by presenting a lemma that provides a bound for k-means step.

K-means is a non-convex procedure and is usually hard to analyze directly. However, when the centroids are well-separated, it is possible to come up with sufﬁcient conditions for a node to be correctly clustered. When the set of misclustered nodes is deﬁned as Eq. 4, the cardinality of M is directly upper bounded by the distance between eigenvectors. To be explicit, we have the following lemma. Here ˆU denotes top r eigenvectors of K for K-SVD and ˆX for SDP. U denotes the top r eigenvectors of K for K-SVD and X0 for SDP. O denotes the corresponding rotation that aligns the empirical eigenvectors to their population counterpart.

Lemma 3. M is deﬁned as Eq. (4), then |M| 8n

r ˆU UO 2 F .

Lemma 3 is proved in Appendix G.

Analysis of |Msdp|: In order to get the deviation in eigenvectors, note the rth eigenvalue of X0 is n/r, and r + 1th is 0, let U Rn r be top r eigenvectors of X and ˆU be eigenvectors of X0. By applying Davis-Kahan Theorem, we have

O, ˆU UO F 23/2 ˆX X0 F

Applying Lemma 3,

23/2 ˆX X0 F

Analysis of |Mksvd|: In the outlier-present kernel scenario, by Corollary 2,

K K F KI I KI I F + KR KR F = OP (n p

log p/p) + OP ( mn)

Again by Davis-Kahan theorem, and the eigengap between λr and λr+1 of K from Lemma 1, let U be the matrix with rows as the top r eigenvectors of K. Let ˆU be its empirical counterpart.

O, ˆU UO F 23/2 K K F

max{ mn, n p

log p/p} λr λr+1

Now we apply Lemma 3 and get the upper bound for number of misclustered nodes for K-SVD.

23/2C max{ mn, n p

λr( K) λr+1( K)

( mn λr λr+1

2 , n2 log p p(λr λr+1)

r(λr λr+1)2 , n3 log p rp(λr λr+1)2

5 Experiments

In this section, we collect some numerical results. For implementation of the proposed SDP, we use Alternating Direction Method of Multipliers that is used in [1]. In each synthetic experiment, we generate n m inliers from r equal-sized clusters. The centers of the clusters are sparse and hidden in a p-dim noise. For each generated data set, we add in m observations of outliers. To capture the

(a) # clusters (b) # outliers (c) Separation

Figure 1: Performance vs parameters: (a) Inlier accuracy vs number of cluster (n = p = 1500, m = 10, d2 = 0.125, σ = 1); (b) Inlier accuracy vs number of outliers (n = 1000, r = 5, d2 = 0.02, σ = 1, p = 500); (c) Inlier accuracy vs separation (n = 1000, r = 5, m = 50, σ = 1, p = 1000).

arbitrary nature of the outliers, we generate half the outliers by a random Gaussian with large variance (100 times of the signal variance), and the other half by a uniform distribution that scatters across all clusters. We compare Algorithm 1 with 1) k-means by Lloyd s algorithms; 2) kernel SVD and 3) kernel PCA by [19].

The evaluating metric is accuracy of inliers, i.e., number of correctly clustered nodes divided by the total number of inliers. To avoid the identiﬁcation problem, we compare all permutations of the predicted labels to ground truth labels and record the best accuracy. Each set of parameter is run 10 replicates and the mean accuracy and standard deviation (shown as error bars) are reported. For all k-means used in the experiments we do 10 restarts and choose the one with smallest k-means loss.

For each experiment, we change only one parameter and ﬁx all the others. Figure 1 shows how the performance of different clustering algorithms change when (a) number of clusters, (b) number of outliers, (c) minimum distance between clusters, increase. The value of all parameters used are speciﬁed in the caption of the ﬁgure. Panel (a) shows the inlier accuracy for various methods as we increase number of clusters. It can be seen that with r growing, the performance of all methods deteriorate except for the SDP. We also examine the ℓ1 norm of X0 ˆX, which remains stable as the number of clusters increases. Panel (b) describes the trend with respect to number of outliers. The accuracy of SDP on inliers is almost unaffected by the number of outliers while other methods suffer with large m. Panel (c) compares the performance as the minimum distance between cluster centers changes. Both SDP and K-SVD are consistent as the distance increases. Compared with K-SVD, SDP achieves consistency faster and variates less across random runs, which matches the analysis given in Section 3.

6 Conclusion

In this paper, we investigate the consistency and robustness of two kernel-based clustering algorithms. We propose a semideﬁnite programming relaxation which is shown to be strongly consistent without outliers and weakly consistent in presence of arbitrary outliers. We also show that K-SVD is also weakly consistent in that the misclustering rate is going to zero as the observation grows and the outliers are of a small fraction of inliers. By comparing two methods, we conclude that although both are robust to outliers, the proposed SDP is less sensitive to the minimum separation between clusters. The experimental result also supports the theoretical analysis.

[1] A. A. Amini and E. Levina. On semideﬁnite relaxations for the block model. ar Xiv preprint ar Xiv:1406.5647, 2014. [2] P. Awasthi and O. Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 37 49. Springer, 2012. [3] T. T. Cai, X. Li, et al. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. The Annals of Statistics, 43(3):1027 1059, 2015. [4] A. Christmann and I. Steinwart. Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli, pages 799 819, 2007. [5] S. Dasgupta and L. Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians. The Journal of Machine Learning Research, 8:203 226, 2007. [6] K. De Brabanter, K. Pelckmans, J. De Brabanter, M. Debruyne, J. A. Suykens, M. Hubert, and B. De Moor. Robustness of kernel based regression: a comparison of iterative weighting schemes. In Artiﬁcial Neural Networks ICANN 2009, pages 100 110. Springer, 2009. [7] M. Debruyne, M. Hubert, and J. A. Suykens. Model selection in kernel based regression using the inﬂuence function. Journal of Machine Learning Research, 9(10), 2008. [8] M. Debruyne, M. Hubert, and J. Van Horebeek. Detecting inﬂuential observations in kernel pca. Computational Statistics & Data Analysis, 54(12):3007 3019, 2010. [9] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on KDD, pages 551 556. ACM, 2004. [10] L. Duan, L. Xu, Y. Liu, and J. Lee. Cluster-based outlier detection. Annals of Operations Research, 168(1):151 168, 2009. [11] N. El Karoui et al. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191 3216, 2010. [12] D.-W. Kim, K. Y. Lee, D. Lee, and K. H. Lee. Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4):607 611, 2005. [13] J. Kim and C. D. Scott. Robust kernel density estimation. The Journal of Machine Learning Research, 13(1):2529 2565, 2012. [14] J. Mac Queen et al. Some methods for classiﬁcation and analysis of multivariate observations. In Proceedings of the ﬁfth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281 297. Oakland, CA, USA., 1967. [15] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semideﬁnite programming. ar Xiv preprint ar Xiv:1602.06612, 2016. [16] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849 856, 2002. [17] R. Pamula, J. K. Deka, and S. Nandi. An outlier detection method based on clustering. In 2011 Second International Conference on EAIT, pages 253 256. IEEE, 2011. [18] J. Peng and Y. Wei. Approximating k-means-type clustering via semideﬁnite programming. SIAM Journal on Optimization, 18(1):186 205, 2007. [19] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299 1319, 1998. [20] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888 905, 2000. [21] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1:801 804, 1956. [22] U. Von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. The Annals of Statistics, pages 555 586, 2008. [23] L. Xu, K. Crammer, and D. Schuurmans. Robust support vector machine training via convex outlier ablation. In AAAI, volume 6, pages 536 542, 2006. [24] M.-S. Yang and K.-L. Wu. A similarity-based robust clustering method. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(4):434 448, 2004. [25] Y. Yu, T. Wang, and R. Samworth. A useful variant of the davis kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015.