# adaptive_clustering_through_semidefinite_programming__0ca74a6a.pdf

Adaptive Clustering through Semideﬁnite Programming

Martin Royer Laboratoire de Mathématiques d Orsay, Univ. Paris-Sud, CNRS, Université Paris-Saclay, 91405 Orsay, France martin.royer@math.u-psud.fr

We analyze the clustering problem through a ﬂexible probabilistic model that aims to identify an optimal partition on the sample X1, ..., Xn. We perform exact clustering with high probability using a convex semideﬁnite estimator that interprets as a corrected, relaxed version of K-means. The estimator is analyzed through a non-asymptotic framework and showed to be optimal or near-optimal in recovering the partition. Furthermore, its performances are shown to be adaptive to the problem s effective dimension, as well as to K the unknown number of groups in this partition. We illustrate the method s performances in comparison to other classical clustering algorithms with numerical experiments on simulated high-dimensional data.

1 Introduction

Clustering, a form of unsupervised learning, is the classical problem of assembling n observations X1, ..., Xn from a p-dimensional space into K groups. Applied ﬁelds are craving for robust clustering techniques, such as computational biology with genome classiﬁcation, data mining or image segmentation from computer vision. But the clustering problem has proven notoriously hard when the embedding dimension is large compared to the number of observations (see for instance the recent discussions from [2, 21]).

A famous early approach to clustering is to solve for the geometrical estimator K-means [19, 13, 14]. The intuition behind its objective is that groups are to be determined in a way to minimize the total intra-group variance. It can be interpreted as an attempt to "best" represent the observations by K points, a form of vector quantization. Although the method shows great performances when observations are homoscedastic, K-means is a NP-hard, ad-hoc method. Clustering with probabilistic frameworks are usually based on maximum likelihood approaches paired with a variant of the EM algorithm for model estimation, see for instance the works of Fraley & Raftery [11] and Dasgupta & Schulman [9]. These methods are widespread and popular, but they tend to be very sensitive to initialization and model misspeciﬁcations.

Several recent developments establish a link between clustering and semideﬁnite programming. Peng & Wei [17] show that the K-means objective can be relaxed into a convex, semideﬁnite program, leading Mixon et al. [16] to use this relaxation under a subgaussian mixture model to estimate the cluster centers. Yan and Sarkar [24] use a similar semideﬁnite program in the context of covariate clustering, when the network has nodes and covariates. Chrétien et al. [8] use a slightly different form of a semideﬁnite program to recover the adjacency matrix of the cluster graph with high probability. Lastly in the different context of variable clustering, Bunea et al. [6] present a semideﬁnite program with a correction step to produce non-asymptotic exact recovery results.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

In this work, we build upon the work and context of [6], and transpose and adapt their ideas for point clustering: we introduce a semideﬁnite estimator for point clustering inspired by the ﬁndings of [17] with a correction component originally presented in [6]. We show that it produces a very strong contender for clustering recovery in terms of speed, adaptivity and robustness to model perturbations. In order to do so we produce a ﬂexible probabilistic model inducing an optimal partition of the data that we aim to recover. Using the same structure of proof in a different context, we establish elements of stochastic control (see for instance Lemma A.1 on the concentration of random subgaussian Gram matrices in the supplementary material) to derive conditions of exact clustering recovery with high probability and show optimal performances including in high dimensions, improving on [16], as well as adaptivity to the effective dimension of the problem. We also show that our results continue to hold without knowledge of the number of structures given one single positive tuning parameter. Lastly we provide evidence of our method s efﬁciency and further insight from simulated data.

Notation. Throughout this work we use the convention 0/0 := 0 and [n] = {1, ..., n}. We take an bn to mean that an is smaller than bn up to an absolute constant factor. Let Sd 1 denote the unit sphere in Rd. For q N {+ }, ν Rd, |ν|q is the lq-norm and for M Rd d , |M|q, |M|F and |M|op are respectively the entry-wise lq-norm, the Frobenius norm associated with scalar product ., . and the operator norm. |D|V is the variation semi-norm for a diagonal matrix D, the difference between its maximum and minimum element. Let A B mean that A B is symmetric, positive semideﬁnite.

2 Probabilistic modeling of point clustering

Consider X1, ..., Xn and let νa = E [Xa]. The variable Xa can be decomposed into

Xa = νa + Ea, a = 1, ..., n, (1)

with Ea stochastic centered variables in Rp. Deﬁnition 1. For K > 1, µ = (µ1, ..., µK) (Rp)K, δ 0 and G = {G1, ..., GK} a partition of [n], we say X1, ..., Xn are (G, µ, δ)-clustered if k [K], a Gk, |νa µk|2 δ. We then call

(µ) := min k<l |µk µl|2 (2)

the separation between the cluster means, and

ρ(G, µ, δ) := (µ)/δ (3)

the discriminating capacity of (G, µ, δ).

In this work we assume that X1, ..., Xn are (G, µ, δ)-clustered. Notice that this deﬁnition does not impose any constraint on the data: for any given G, there exists a choice of µ, means and radius δ important enough so that X1, ..., Xn are (G, µ, δ)-clustered. But we are interested in partitions with greater discriminating capacity, i.e. that make more sense in terms of group separation. Indeed remark that if ρ(G, µ, δ) < 2, the population clusters {νa}a G1, ..., {νa}a GK are not linearly separable, but a high ρ(G, µ, δ) implies that they are well-separated from each other. Furthermore, we have the following result. Proposition 1. Let (G K, µ , δ ) arg max ρ(G, µ, δ) for (G, µ, δ) such that X1, ..., Xn are (G, µ, δ)-clustered, and |G| = K. If ρ(G K, µ , δ ) > 4 then G K is the unique maximizer of ρ(G, µ, δ).

So G K is the partition maximizing the discriminating capacity over partitions of size K. Therefore in this work, we will assume that there is a K > 1 such that X1, ..., Xn is (G, µ, δ)-clustered with |G| = K and ρ(G, µ, δ) > 4. By Proposition 1, G is then identiﬁable. It is the partition we aim to recover.

We also assume that X1, ..., Xn are independent observations with subgaussian behavior. Instead of the classical isotropic deﬁnition of a subgaussian random vector (see for example [20]), we use a more ﬂexible deﬁnition that can account for anisotropy. Deﬁnition 2. Let Y be a random vector in Rd, Y has a subgaussian distribution if there exist Σ Rd d such that x Rd,

E h ex T (Y E Y )i ex T Σx/2. (4)

We then call Σ a variance-bounding matrix of random vector Y , and write shorthand Y subg(Σ). Note that Y subg(Σ) implies Cov(Y ) Σ in the semideﬁnite sense of the inequality. To sum-up our modeling assumptions in this work: Hypothesis 1. Let X1, ..., Xn be independent, subgaussian, (G, µ, δ)-clustered with ρ(G, µ, δ) > 4.

Remark that the modelization of Hypothesis 1 can be connected to another popular probabilistic model: if we further ask that X1, ..., Xn are identically-distributed within a group (and hence δ = 0), the model becomes a realization of a mixture model.

3 Exact partition recovery with high probability

Let G = {G1, ..., GK} and m := mink [K] |Gk| denote the minimum cluster size. G can be represented by its caracteristic matrix B Rn n deﬁned as k, l [K]2, (a, b) Gk Gl,

B ab := 1/|Gk| if k = l 0 otherwise.

In what follows, we will demonstrate the recovery of G through recovering its caracteristic matrix B . We introduce the sets of square matrices

C{0,1} K := {B Rn n + : BT = B, tr(B) = K, B1n = 1n, B2 = B} (5)

CK := {B Rn n + : BT = B, tr(B) = K, B1n = 1n, B 0} (6)

K N CK. (7)

We have: C{0,1} K CK C and CK is convex. Notice that B C{0,1} K . A result by Peng, Wei (2007) [17] shows that the K-means estimator B can be expressed as

B = arg max

B C{0,1} K bΛ, B (8)

for bΛ := ( Xa, Xb )(a,b) [n]2 Rn n, the observed Gram matrix. Therefore a natural relaxation is to consider the following estimator:

b B := arg max B CK bΛ, B . (9)

Notice that E bΛ = Λ + Γ for Λ := ( νa, νb )(a,b) [n]2 Rn n, and Γ := E [ Ea, Eb ](a,b) [n]2 = diag (tr(Var(Ea)))1 a n Rn n. The following two results demonstrate that Λ is the signal structure that lead the optimizations of (8) and (9) to recover B , whereas Γ is a bias term that can hurt the process of recovery. Proposition 2. There exist c0 > 1 absolute constant such that if ρ2(G, µ, δ) > c0(6 + n/m) and m 2(µ) > 8|Γ|V , then we have

B C{0,1} K Λ + Γ, B = B = arg max B CK Λ + Γ, B . (10)

This proposition shows that the b B estimator, as well as the K-means estimator, would recover partition G on the population Gram matrix if the variation semi-norm of Γ were sufﬁciently small compared to the cluster separation. Notice that to recover the partition on the population version, we require the discriminating capacity to grow as fast as 1 + ( n/m)1/2 instead of simply 1 from Hypothesis 1. The following proposition demonstrates that if the condition on the variation semi-norm of Γ is not met, G may not even be recovered on the population version. Proposition 3. There exist G, µ, δ and Γ such that ρ2(G, µ, δ) = + but we have m 2(µ) < 2|Γ|V and

B / arg max

B C{0,1} K Λ + Γ, B and B / arg max B CK Λ + Γ, B . (11)

So Proposition 3 shows that even if the population clusters are perfectly discriminated, there is a conﬁguration for the variances of the noise that makes it impossible to recover the right clustering by K-means. This shows that K-means may fail when the random variable homoscedasticity assumption is violated, and that it is important to correct for Γ = diag(tr(Var(Ea)))1 a n.

Suppose we produce such an estimator bΓcorr. Then substracting bΓcorr from bΛ can be interpreted as a correcting term, i.e. a way to de-bias bΛ as an estimator of Λ. Hence the previous results demonstrate the interest of studying the following semi-deﬁnite estimator of the projection matrix B , let

b Bcorr := arg max B CK bΛ bΓcorr, B . (12)

In order to demonstrate the recovery of B by this estimator, we introduce different quantitative measures of the "spread" of our stochastic variables, that affect the quality of the recovery. By Hypothesis 1 there exist Σ1, ..., Σn such that a [n], Xa subg(Σa). Let

σ2 := max a [n] |Σa|op, V2 := max a [n] |Σa|F , γ2 := max a [n] tr(Σa) (13)

We now produce bΓcorr. Since there is no relation between the variances of the points in our model, there is very little hope of estimating Var(Ea). As for our quantity of interest tr(Var(Ea)), a form of volume, a rough estimation is challenging but possible. The estimator from [6] can be adapted to our context. For (a, b) [n]2 let V (a, b) := max(c,d) ([n]\{a,b})2 Xa Xb, Xc Xd |Xc Xd|2 , bb1 := arg minb [n]\{a} V (a, b) and bb2 := arg minb [n]\{a,bb1} V (a, b). Then for a [n], let

bΓcorr := diag Xa Xbb1, Xa Xbb2 a [n] . (14)

Proposition 4. Assume that m > 2. For c6, c7 > 0 absolute constants, with probability larger than 1 c6/n we have

|bΓcorr Γ| c7 σ2log n + (δ + σ p

log n)γ + δ2 . (15)

So apart from the radius δ terms, that come from generous model assumptions, a proxy for Γ is produced at a σ2 log n rate that we could not expect to improve on. Nevertheless, this control on Γ is key to attain the optimal rates below. It is general and completely independent of the structure of G, as there is no relation between G and Γ.

We are now ready to introduce this paper s main result: a condition on the separation between the cluster means sufﬁcient for ensuring recovery of B with high probability. Theorem 1. Assume that m > 2. For c1, c2 > 0 absolute constants, if

m 2(µ) c2 σ2(n + m log n) + V2( p

n + m log n) + γ(σ p

log n + δ) + δ2( n + m) , (16)

then with probability larger than 1 c1/n we have b Bcorr = B , and therefore b Gcorr = G.

We call the right hand side of (16) the separating rate. Notice that we can read two kinds of requirements coming from the separating rate: requirements on the radius δ, and requirements on σ2, V2, γ dependent on the distributions of observations. It appears as if δ + σ log n can be interpreted as a geometrical width of our problem. If we ask that δ is of the same order as σ log n, a maximum gaussian deviation for n variables, then all conditions on δ from (16) can be removed. Thus for convenience of the following discussion we will now assume δ σ log n.

How optimal is the result from Theorem 1? Notice that our result is adapted to anisotropy in the noise, but to discuss optimality it is easier to look at the isotropic scenario: V2 = pσ2 and γ2 = pσ2. Therefore 2(µ)/σ2 represents a signal-to-noise ratio. For simplicity let us also assume that all groups have equal size, that is |G1| = ... = |GK| = m so that n = m K and the sufﬁcient condition (16) becomes

σ2 K + log n +

(K + log n)p K

Optimality. To discuss optimality, we distinguish between low and high dimensional setups.

In the low-dimensional setup n m log n p, we obtain the following condition:

σ2 K + log n . (18)

Discriminating with high probability between n observations from two gaussians in dimension 1 would require a separating rate of at least σ2 log n. This implies that when K log n, our result is minimax. Otherwise, to our knowledge the best clustering result on approximating mixture center is from [16], and on the condition that 2(µ)/σ2 K2. Furthermore, the K log n regime is known in the stochastic-block-model community as a hard regime where a gap is surmised to exist between the minimal information-theoretic rate and the minimal achievable computational rate (see for example [7]).

In the high-dimensional setup n m log n p, condition (17) becomes:

(K + log n)p K

There are few information-theoretic bounds for high-dimension clustering. Recently, Banks, Moore, Vershynin, Verzelen and Xu (2017) [3] proved a lower bound for Gaussian mixture clustering detection, namely they require a separation of order p

K(log K)p/n. When K log n, our condition is only different in that it replaces log(K) by log(n), a price to pay for going from detecting the clusters to exactly recovering the clusters. Otherwise when K grows faster than log n there might exist a gap between the minimal possible rate and the achievable, as discussed previously.

Adaptation to effective dimension. We can analyse further the condition (16) by introducing an effective dimension r , measuring the largest volume repartition for our variance-bounding matrices Σ1, ..., Σn. We will show that our estimator adapts to this effective dimension. Let

σ2 = maxa [n] tr(Σa)

maxa [n] |Σa|op , (20)

r can also be interpreted as a form of global effective rank of matrices Σa. Indeed, deﬁne Re(Σ) := tr(Σ)/|Σ|op, then we have r maxa [n] Re(Σa) maxa [n] rank(Σa) p.

Now using V2 r σ2 and γ = r σ, condition (16) can be written as

σ2 K + log n +

(K + log n)r K

By comparing this equation to (17), notice that r is in place of p, indeed playing the role of an effective dimension for the problem. This shows that our estimator adapts to this effective dimension, without the use of any dimension reduction step. In consequence, equation (21) distinguishes between an actual high-dimensional setup: n m log n r and a "low" dimensional setup r n m log n under which, regardless of the actual value of p, our estimators recovers under the near-minimax condition of (18).

This informs on the effect of correcting term bΓcorr in the theorem above when n + m log n r . The un-corrected version of the semi-deﬁnite program (9) has a leading separating rate of γ2/m = σ2r /m, but with the bΓcorr correction on the other hand, (21) has leading separating factor smaller than σ2p

(K + log n)r /m = σ2 n + m log n r /m. This proves that in a high-dimensional setup, our correction enhances the separating rate of at least a factor p

(n + m log n)/r .

4 Adaptation to the unknown number of group K

It is rarely the case that K is known, but we can proceed without it. We produce an estimator adaptive to the number of groups K: let bκ R+, we now study the following adaptive estimator:

e Bcorr := arg max B C bΛ bΓcorr, B bκ tr(B). (22)

Theorem 2. Suppose that m > 2 and (16) is satisﬁed. For c3, c4, c5 > 0 absolute constants suppose that the following condition on bκ is satisﬁed

c4 V2 n + σ2n + γ(σ p

log n + δ) + δ2 n < c5bκ < m 2(µ), (23)

then we have e Bcorr = B with probability larger than 1 c3/n

Notice that condition (23) essentially requires bκ to be seated between m 2(µ) and some components of the right-hand side of (16). So under (23), the results from the previous section apply to the adaptive estimator e Bcorr as well and this shows that it is not necessary to know K in order to perform well for recovering G. Finding an optimized, data-driven parameter bκ using some form of cross-validation is outside of the scope of this paper.

5 Numerical experiments

We illustrate our method on simulated Gaussian data in two challenging, high-dimensional setup experiments for comparing clustering estimators. Our sample of n = 100 points are drawn from K = 5 identically-sized, perfectly discriminated non-isovolumic clusters of Gaussians - that is we have k [K], a Gk, Ea N(0, Σk) such that |G1| = ... = |GK| = 20. The distributions are chosen to be isotropic, and the ratio between the lowest and the highest standard deviation is of 1 to 10. We draw points of a Rp space in two different scenarii. In (S1), for a given dimension space p = 500 and a ﬁxed isotropic noise level, we report the algorithm s performances as the signal-to-noise ratio 2(µ)/σ2 is increased from 1 to 15. In (S2) we impose a ﬁxed signal to noise ratio and observe the algorithm s decay in performance as the space dimension p is increased from 102 to 105 (logarithmic scale). All reported points of the simulated space represent a hundred simulations, and indicate a median value with asymmetric standard deviations in the form of errorbars.

Solving for estimator b Bcorr is a hard problem as n grows. For this task we implemented an ADMM solver from the work of Boyd et al. [4] with multiple stopping criterions including a ﬁxed number of iterations of T = 1000. The complexity of the optimization is then roughly O(Tn3). For reference, we compare the recovering capacities of b Gcorr, labeled pecok in Figure 1 with other classical clustering algorithm. We chose three different but standard clustering procedures: Lloyd s K-means algorithm [13] with a thousand K-means++ initialization of [1] (although in scenario (S2), the algorithm is too slow to converge as p grows so we do not report it), Ward s method for Hierarchical Clustering [23] and the low-rank clustering algorithm applied to the Gram matrix, a spectral method appearing in Mc Sherry [15]. Lastly we include the CORD algorithm from Bunea et al. [5].

We measure the performances of estimators by computing the adjusted mutual information (see for instance [22]) between the truth and its estimate. In the two experiments, the results of b Gcorr are markedly better than that of other methods. Scenario (S1) shows it can achieve exact recovery with a lesser signal to noise ratio than its competitors, whereas scenario (S2) shows its performances start to decay much later than the other methods as the space dimension is increased exponentially.

Table 1 summarizes the simulations in a different light: for different parameter value on each line, we count the number of experiments (out of a hundred) that had an adjusted mutual information score equal to 0.9 or higher. This accounts for exact recoveries, or approximate recoveries that reasonably reﬂected the underlying truth. In this table it is also evident that b Gcorr performs uniformly better, be it for exact or approximate recovery: it manages to recover the underlying truth much sooner in terms of signal-to-noise ratio, and for a given signal-to-noise ratio it will represent the truth better as the embedding dimension increases.

Lastly Table 1 provides the median computing time in seconds for each method over the entire experiment. b Gcorr comes with important computation times because bΓcorr is very costly to compute. Our method is computationally intensive but it is of polynomial order. The solving of a semideﬁnite program is a vastly developing ﬁeld of Operational Research and even though we used the classical ADMM method of [4] that proved effective, this instance of the program could certainly have seen a more proﬁtable implementation in the hands of a domain expert. All of the compared methods have a very hard time reaching high sample sizes n in the high dimensional context.

The PYTHON3 implementation of the method used is found in open access here: martinroyer/pecok [18]

0 2 4 6 8 10 12 14 16 SNR 2(µ)/σ2

adj_mi(G, c G)

kmeans++ pecok lowrank-spectral hierarchical cord

Scenario (S1)

102 103 104 105 p

adj_mi(G, c G)

pecok lowrank-spectral hierarchical cord

Scenario (S2)

Figure 1: Performance comparison for clustering estimators and b Gcorr, labeled pecok4 in reference to [6]. The adjusted mutual information equals 1 when the clusterings are identical, 0 when they are independent.

hierarchical kmeans++ lowrank-spectral pecok4 cord 90% SNR=4.75 0 0 0 51 0 90% SNR=6 0 0 0 100 0 (S1) 90% SNR=7.25 18 0 12 100 26 90% SNR=8.5 100 0 100 100 76 med time (s) 0.01 2.76 0.23 1.84 (+18.92)1 0.76

90% dim=102 100 / 100 100 94 90% dim=103 0 / 0 100 31 (S2) 90% dim=5.103 0 / 0 100 0 90% dim=104 0 / 0 49 0 med time (s) 0.14 0.19 1.94 (+68.12)1 0.68 Table 1: Approximate recovery result for experiment (S1) and (S2): number of experiments that had a score superior to 90%, out of a hundred, and computing times over the experiments

1The median time in parenthesis is the time to compute bΓcorr, as opposed to the main time for performing the SDP. Indeed the bΓcorr is very time consuming, its cost is roughly O(n4p). It must be noted that much faster alternatives, such as the one presented in [6], perform equally well (there is no signiﬁcant difference in performance) for the recovery of G, but this is outside the scope of this paper.

6 Conclusion

In this paper we analyzed a new semideﬁnite positive algorithm for point clustering within the context of a ﬂexible probabilistic model and exhibit the key quantities that guarantee non-asymptotic exact recovery. It implies an essential bias-removing correction that signiﬁcanty improves the recovering rate in the high-dimensional setup. Hence we showed the estimator to be near-minimax, adapted to an effective dimension of the problem. We also demonstrated that our estimator can be optimally adapted to a data-driven choice of K, with a single tuning parameter. Lastly we illustrated on high-dimensional experiments that our approach is empirically stronger than other classical clustering methods. The bΓcorr correction step of the algorithm, it can be interpreted as an independent, denoising step for the Gram matrix, and we recommend using such a procedure where the probabilistic framework we developed seems appropriate.

In practice, it is generally more realistic to look at approximate clustering results, but in this work we chose the point of view of exact clustering for investigating theoretical properties of our estimator. Our experimental results provide evidence that this choice is not restrictive, i.e. that our ﬁndings translate very well to approximate recovery. We expect our results to hold with similar speeds for approximate clustering, up to some logarithmic terms. One could think of adapting works on community detection by Guédon and Vershynin [12] based on Grothendieck s inequality, or work by Fei and Chen [10] from the stochastic-block-model community on similar semideﬁnite programs. In fact, referring to a detection bound by Banks, Moore, Vershynin, Verzelen and Xu (2017) [3], our only margin for improvement on the separation speed is to transform the logarithmic factor log n into log K when the number of clusters K is of order O(log n) otherwise the problem is rather open.

As for the robustness of this procedure, a few aspects are to be considered: the algorithm we studied solves for a convexiﬁed objective, therefore its performances are empirically more stable than that of an objective that would prove non-convex, especially in the high-dimensional context. In this work we also beneﬁt from a permissive probabilistic framework that allows for multiple deviations from the classical gaussian cluster model, and come at no price in terms of the performance of our estimator. Points from a same cluster are allowed to have signiﬁcantly different means or ﬂuctuations, and the results for exact recovery with high probability are unchanged, near-minimax and adaptive. Likewise on simulated data the estimator proves the most efﬁcient in exact as well as approximate recovery.

Acknowledgements

This work is supported by a public grant overseen by the French National research Agency (ANR) as part of the Investissement d Avenir" program, through the IDI 2015" project funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02. It is also supported by the CNRS PICS funding High Clust. We thank Christophe Giraud for a shrewd, unwavering thesis direction.

[1] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 07, pages 1027 1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.

[2] M. Azizyan, A. Singh, and L. Wasserman. Minimax theory for high-dimensional gaussian mixtures with sparse mean separation. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 13, pages 2139 2147, USA, 2013. Curran Associates Inc.

[3] J. Banks, C. Moore, N. Verzelen, R. Vershynin, and J. Xu. Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. ar Xiv e-prints ar Xiv:1607.05222, July 2016.

[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1 122, January 2011.

[5] F. Bunea, C. Giraud, and X. Luo. Minimax optimal variable clustering in g-models via cord. ar Xiv preprint ar Xiv:1508.01939, 2015.

[6] F. Bunea, C. Giraud, M. Royer, and N. Verzelen. PECOK: a convex optimization approach to variable clustering. ar Xiv e-prints ar Xiv:1606.05100, June 2016.

[7] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. Journal of Machine Learning Research, 17(27):1 57, 2016.

[8] S. Chrétien, C. Dombry, and A. Faivre. A semi-deﬁnite programming approach to low dimensional embedding for unsupervised clustering. Co RR, abs/1606.09190, 2016.

[9] S. Dasgupta and L. Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians. J. Mach. Learn. Res., 8:203 226, May 2007.

[10] Y. Fei and Y. Chen. Exponential error rates of SDP for block models: Beyond Grothendieck s inequality. ar Xiv e-prints ar Xiv:1705.08391, May 2017.

[11] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458):611 631, 2002.

[12] O. Guédon and R. Vershynin. Community detection in sparse networks via grothendieck s inequality. Co RR, abs/1411.4686, 2014.

[13] S. Lloyd. Least squares quantization in pcm. IEEE Trans. Inf. Theor., 28(2):129 137, September 1982.

[14] J. Mac Queen. Some methods for classiﬁcation and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pages 281 297, Berkeley, Calif., 1967. University of California Press.

[15] F. Mc Sherry. Spectral partitioning of random graphs. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science, FOCS 01, pages 529 , Washington, DC, USA, 2001. IEEE Computer Society.

[16] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures with k-means. In 2016 IEEE Information Theory Workshop (ITW), pages 211 215, Sept 2016.

[17] J. Peng and Y. Wei. Approximating k-means-type clustering via semideﬁnite programming. SIAM J. on Optimization, 18(1):186 205, February 2007.

[18] Martin Royer. ADMM implementation of PECOK. https://github.com/martinroyer/ pecok, October, 2017.

[19] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1:801 804, 1956.

[20] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Chapter 5 of: Compressed Sensing, Theory and Applications. Cambridge University Press, 2012.

[21] N. Verzelen and E. Arias-Castro. Detection and Feature Selection in Sparse Mixture Models. ar Xiv e-prints ar Xiv:1405.1478, May 2014.

[22] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res., 11:2837 2854, December 2010.

[23] J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236 244, 1963.

[24] B. Yan and P. Sarkar. Convex Relaxation for Community Detection with Covariates. ar Xiv e-prints ar Xiv:1607.02675, July 2016.