# datadriven_clustering_via_parameterized_lloyds_families__47549e5d.pdf

Data-Driven Clustering via Parameterized Lloyd s Families

Maria-Florina Balcan Department of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 ninamf@cs.cmu.edu

Travis Dick Department of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 tdick@cs.cmu.edu

Colin White Department of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 crwhite@cs.cmu.edu

Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in ﬁnding the fastest algorithms and seeding procedures for Lloyd s algorithm. The performance of a given clustering algorithm depends on the speciﬁc application at hand, and this may not be known up front. For example, a typical instance may vary depending on the application, and different clustering heuristics perform differently depending on the instance. In this paper, we deﬁne an inﬁnite family of algorithms generalizing Lloyd s algorithm, with one parameter controlling the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-ﬁrst traversal algorithm. We design efﬁcient learning algorithms which receive samples from an application-speciﬁc distribution over clustering instances and learn a nearoptimal clustering algorithm from the class. We show the best parameters vary signiﬁcantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see signiﬁcant improvements.

1 Introduction

Clustering is a fundamental problem in machine learning with applications in many areas including text analysis, transportation networks, social networks, and so on. The high-level goal of clustering is to divide a dataset into natural subgroups. For example, in text analysis we may want to divide documents based on topic, and in social networks we might want to ﬁnd communities. A common approach to clustering is to set up an objective function and then approximately ﬁnd the optimal solution according to the objective. There has been a wealth of both theoretical and empirical research in clustering using this approach Gonzalez [1985], Charikar et al. [1999], Arya et al. [2004], Arthur and Vassilvitskii [2007], Kaufman and Rousseeuw [2009], Ostrovsky et al. [2012], Byrka et al. [2015], Ahmadian et al. [2017].

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

The most popular method in practice for clustering is local search, where we start with k centers and iteratively make incremental improvements until a local optimum is reached. For example, Lloyd s method (sometimes called k-means) Lloyd [1982] and k-medoids Friedman et al. [2001], Cohen et al. [2016] are two popular local search algorithms. There are multiple decisions an algorithm designer must make when using a local search algorithm. First, the algorithm designer must decide how to seed local search, e.g., how the algorithm chooses the k initial centers. There is a large body of work on seeding algorithms, since the initial choice of centers can have a large effect on both the quality of the outputted clustering and the time it takes for the algorithm to converge Higgs et al. [1997], Pena et al. [1999], Arai and Barakbah [2007]. The best seeding method often depends on the speciﬁc application at hand. For example, a typical problem instance in one setting may have signiﬁcantly different properties from that in another, causing some seeding methods to perform better than others. Second, the algorithm designer must decide on an objective function for the local search phase (k-means, k-median, etc.) For some applications, there is an obvious choice. For instance, if the application is Wi-Fi hotspot location, then the explicit goal is to minimize the k-center objective function. For many other applications such as clustering communities in a social network, the goal is to ﬁnd clusters which are close to an unknown target clustering, and we may use an objective function for local search in the hopes that approximately minimizing the chosen objective will produce clusterings which are close to matching the target clustering (in terms of the number of misclassiﬁed points). As before, the best objective function for local search may depend on the speciﬁc application.

In this paper, we show positive theoretical and empirical results for learning the best initialization and local search procedures over a large family of algorithms. We take a transfer learning approach where we assume there is an unknown distribution over problem instances corresponding to our application, and the goal is to use experience from the early instances to perform well on the later instances. For example, if our application is clustering facilities in a city, we would look at a sample of cities with existing optimally-placed facilities, and use this information to ﬁnd the empirically best seeding/local search pair from an inﬁnite family, and we use this pair to cluster facilities in new cities.

(α, β)-Lloyds++ We deﬁne an inﬁnite family of algorithms generalizing Lloyd s method, with two parameters α and β. Our algorithms have two phases, a seeding phase to ﬁnd k initial centers (parameterized by α), and a local search phase which uses Lloyd s method to converge to a local optimum (parameterized by β). In the seeding phase, each point v is sampled with probability proportional to dmin(v, C)α, where C is the set of centers chosen so far and dmin(v, C) = minc C d(v, c). Then Lloyd s method is used to converge to a local minima for the ℓβ objective. By ranging α [0, ) { } and β [1, ) { }, we deﬁne our inﬁnite family of algorithms which we call (α, β)-Lloyds++. Setting α = β = 2 corresponds to the k-means++ algorithm Arthur and Vassilvitskii [2007]. The seeding phase is a spectrum between random seeding (α = 0), and farthest-ﬁrst traversal Gonzalez [1985], Dasgupta and Long [2005] (α = ), and the Lloyd s step is able to optimize over common objectives including k-median (β = 1), k-means (β = 2), and k-center (β = ). We design efﬁcient learning algorithms which receive samples from an application-speciﬁc distribution over clustering instances and learn a near-optimal clustering algorithm from our family.

Theoretical analysis In Section 4, we prove that O 1

ϵ2 min(T, k) log n samples are sufﬁcient to guarantee the empirically optimal parameters (ˆα, ˆβ) have expected cost at most ϵ higher than the optimal parameters (α , β ) over the distribution, with high probability over the random sample, where n is the size of the clustering instances and T is the maximum number of Lloyd s iterations. The key challenge is that for any clustering instance, the cost of the outputted clustering is not even a continuous function in α or β since a slight tweak in the parameters may lead to a completely different run of the algorithm. We overcome this obstacle by showing a strong bound on the expected number of discontinuities of the cost function, which requires a delicate reasoning about the structure of the decision points in the execution of the algorithm; in other words, for a given clustering instance, we must reason about the total number of outcomes the algorithm can produce over the full range of parameters. This allows us to use Rademacher complexity, a distribution-speciﬁc technique for achieving uniform convergence.

Next, we complement our sample complexity result with a computational efﬁciency result. Specifically, we give a novel meta-algorithm which efﬁciently ﬁnds a near-optimal value ˆα with high probability. The high-level idea of our algorithm is to run depth-ﬁrst-search over the execution tree of the algorithm, where a node in the tree represents a state of the algorithm, and edges represent a decision point. A key step in our meta-algorithm is to iteratively solve for the decision points of the algorithm, which itself is nontrivial since the equations governing the decision points do not have

a closed-form solution. We show the equations have a certain structure which allows us to binary search through the range of parameters to ﬁnd the decision points.

Experiments We give a thorough experimental analysis of our family of algorithms by evaluating their performance on a number of different real-world and synthetic datasets including MNIST, Cifar10, CNAE-9, and mixtures of Gaussians. In each case, we create clustering instances by choosing subsets of the labels. For example, we look at an instance of MNIST with digits {0, 1, 2, 3, 4}, and also an instance with digits {5, 6, 7, 8, 9}. We show the the optimal parameters transfer from one instance to the other. Among datasets, there is no single parameter setting that is nearly optimal, and for some datasets, the best algorithm from the (α, β)-Lloyds++ family performs signiﬁcantly better than known algorithms such as k-means++ and farthest-ﬁrst traversal.

2 Related Work

Lloyd s method for clustering The iterative local search method for clustering, known as Lloyd s algorithm or sometimes called k-means, is one of the most popular algorithms for k-means clustering Lloyd [1982], and improvements are still being found Max [1960], Mac Queen et al. [1967], Dempster et al. [1977], Pelleg and Moore [1999], Kanungo et al. [2002], Kaufman and Rousseeuw [2009]. Many different initialization approaches have been proposed Higgs et al. [1997], Pena et al. [1999], Arai and Barakbah [2007]. When using d2-sampling to ﬁnd the initial k centers, the algorithm is known as k-means++, and the approximation guarantee is provably O(log k) Arthur and Vassilvitskii [2007].

Learning to Learn A recent paper shows positive results for learning linkage-based algorithms with pruning over a distribution over clustering instances Balcan et al. [2017], although there is no empirical study done. There are several related models for learning the best representation and transfer learning for clustering. Ashtiani and Ben-David [2015] show how to learn an instance-speciﬁc embedding for a clustering instance, such that k-means does well over the embedding. There has also been work on related questions for transfer learning on unlabeled data and unsupervised tasks Raina et al. [2007], Yang et al. [2009], Jiang and Chung [2012]. To our knowledge, there is no prior work on learning the best clustering objective for a speciﬁc distribution over problem instances, given labeled clustering instances for training.

3 Preliminaries

Clustering A clustering instance V consists of a point set V of size n, a distance metric d (such as Euclidean distance in Rd), and a desired number of clusters 1 k n. A clustering C = {C1, . . . , Ck} is a k-partitioning of V . Often in practice, clustering is carried out by approximately minimizing an objective function (which maps each clustering to a nonzero value). Common objective functions such as k-median and k-means come from the ℓp family, where each cluster Ci is assigned

a center ci and cost(C) = P

v Ci d(v, ci)p 1

p (k-median and k-means correspond to p = 1 and p = 2, respectively). There are two distinct goals for clustering depending on the application. For some applications such as computing facility locations, the algorithm designer s only goal is to ﬁnd the best centers, and the actual partition {C1, . . . , Ck} is not needed. For many other applications such as clustering documents by subject, clustering proteins by function, or discovering underlying communities in a social network, there exists an unknown target clustering C = {C 1, . . . , C k}, and the goal is to output a clustering C which is close to C . Formally, we deﬁne C and C to be ϵ-close if there exists a permutation σ such that Pk i=1 |Ci \ C σ(i)| ϵn. For these applications, the algorithm designer chooses an objective function while hoping that minimizing the objective function will lead to a clustering that is close to the target clustering. In this paper, we will focus on the cost function set to the distance to the target clustering, however, our analysis holds for an abstract cost function cost which can be set to an objective function or any other well-deﬁned measure of cost.

Algorithm Conﬁguration In this work, we assume that there exists an unknown, application-speciﬁc distribution D over a set of clustering instances such that for each instance V, |V | n. We suppose there is a cost function that measures the quality of a clustering of each instance. As discussed in the previous paragraph, we can set the cost function to be the expected Hamming distance of the returned clustering to the target clustering, the cost of an ℓp objective, or any other function. The

learner s goal is to ﬁnd the parameters α and β that approximately minimize the expected cost with respect to the distribution D. Our main technical results bound the intrinsic complexity of the class of (α, β)-Lloyds++ clustering algorithms, which leads to generalization guarantees through standard Rademacher complexity Bartlett and Mendelson [2002], Koltchinskii [2001]. This implies that the empirically optimal parameters are also nearly optimal in expectation.

4 (α, β)-Lloyds++

In this section, we deﬁne an inﬁnite family of algorithms generalizing Lloyd s algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. Our main results bound the intrinsic complexity of this family of algorithms (Theorems 4 and 5) and lead to sample complexity results guaranteeing the empirically optimal parameters over a sample are close to the optimal parameters over the unknown distribution. We measure optimality in terms of agreement with the target clustering. We also show theoretically that no parameters are optimal over all clustering applications (Theorem 2). Finally, we give an efﬁcient algorithm for learning the best initialization parameter (Theorem 7).

Our family of algorithms is parameterized by choices of α [0, ) { } and β [1, ) { }. Each choice of (α, β) corresponds to one local search algorithm. A summary of the algorithm is as follows (see Algorithm 1). The algorithm has two phases. The goal of the ﬁrst phase is to output k initial centers. Each center is iteratively chosen by picking a point with probability proportional to the minimum distance to all centers picked so far, raised to the power of α. The second phase is an iterative two step procedure similar to Lloyd s method, where the ﬁrst step is to create a Voronoi partitioning of the points induced by the initial set of centers, and then a new set of centers is chosen by computing the ℓβ mean of each Voronoi tile.

Algorithm 1 (α, β)-Lloyds++ Clustering

Input: Instance V = (V, d, k), parameter α. Phase 1: Choosing initial centers with dα-sampling 1. Initialize C = and draw a vector Z = {z1, . . . , zk} from [0, 1]k uniformly at random. 2. For each t = 1, . . . , k:

(a) Partition [0, 1] into n intervals, where there is an interval Ivi for each vi with size equal to the probability of choosing vi during dα-sampling in round t (see Figure 1). (b) Denote ct as the point such that zt Ict, and add ct to C. Phase 2: Lloyd s algorithm 5. Set C = . Let {C1, . . . , Ck} denote the Voronoi tiling of V induced by centers C. 6. Compute argminx V P

v Ci d(x, v)β for all 1 i k, and add it to C . 7. If C = C, set C = C and goto 5. Output: Centers C and clustering induced by C.

Our goal is to ﬁnd parameters which return clusterings close to the ground-truth in expectation. Setting α = β = 2 corresponds to the k-means++ algorithm. The seeding phase is a spectrum between random seeding (α = 0), and farthest-ﬁrst traversal (α = ), and the Lloyd s algorithm can optimize for common clustering objectives including k-median (β = 1), k-means (β = 2), and k-center (β = ).

We start with two structural results about the family of (α, β)-Lloyds++ clustering algorithms. The ﬁrst shows that for sufﬁciently large α, phase 1 of Algorithm 1 is equivalent to farthest-ﬁrst traversal. This means that it is sufﬁcient to consider α parameters in a bounded range.

Farthest-ﬁrst traversal Gonzalez [1985] starts by choosing a random center, and then iteratively choosing the point farthest to all centers chosen so far, until there are k centers. We assume that ties are broken uniformly at random.

Lemma 1. Given a clustering instance V and δ > 0, if α > log( nk

δ ) log s , then dα-sampling will give the same output as farthest-ﬁrst traversal with probability > 1 δ. Here, s denotes the minimum ratio d1/d2 between two distances d1 > d2 in the point set.

For some datasets, 1 log s might be very large. In Section 5, we empirically observe that for all datasets we tried, (α, β)-Lloyds++ behaves the same as farthest-ﬁrst traversal for α > 20. Also, in the full version of this paper, we show that if the dataset satisﬁes a stability assumption called separability Kobren et al. [2017], Pruitt et al. [2011], then (α, β)-Lloyds++ outputs the same clustering as farthest-ﬁrst traversal with high probability when α > log n.

Next, to motivate learning the best parameters, we show that for any pair of parameters (α , β ), there exists a clustering instance such that (α , β )-Lloyds++ outperforms all other values of α, β. This implies that dβ-sampling is not always the best choice of seeding for the ℓβ objective. Let clusα,β(V) denote the expected cost of the clustering outputted by (α, β)-Lloyds++, with respect to the target

clustering. Formally, clusα,β(V) = E Z [0,1]k h clusα,β V, Z i , where clusα,β V, Z is the

cost of the clustering outputted by (α, β)-Lloyds++ with randomness Z] [0, 1]k (see line 1 of Algorithm 1).

Theorem 2. For α [0, ) { } and β [1, ) { }, there exists a clustering instance V whose target clustering is the optimal ℓβ clustering, such that clusα ,β (V) < clusα,β(V) for all (α, β) = (α , β ).

Sample efﬁciency Now we give sample complexity bounds for learning the best algorithm from the class of (α, β)-Lloyds++ algorithms. We analyze the phases of Algorithm 1 separately. For the ﬁrst phase, our main structural result is to show that for a given clustering instance and value of β, with high probability over the randomness in Algorithm 1, the number of discontinuities of the cost function clusα,β V, Z as we vary α [0, αh] is O(nk(log n)αh). Our analysis crucially harnesses the randomness in the algorithm to achieve this bound. For instance, if we use a combinatorial approach as in prior algorithm conﬁguration work, we would only achieve a bound of n O(k), which is the total number of sets of k centers. For completeness, we give a combinatorial proof of O(nk+3) discontinuities in the full version of this paper.

To show the O(nk(log n)αh) upper bound, we start by giving a few deﬁnitions of concepts used in the proof. Assume we start to run Algorithm 1 without a speciﬁc setting of α, but rather a range [αℓ, αh], for some instance V and randomness Z. In some round t, if Algorithm 1 would choose a center ct for every setting of α [αℓ, αh], then we continue normally. However, if the algorithm would choose a different center depending on the speciﬁc value of α used from the interval [αℓ, αh], then we fork the algorithm, making one copy for each possible center. In particular, we partition [αℓ, αh] into a ﬁnite number of sub-intervals such that the next center is constant on each interval. The boundaries between these intervals are breakpoints , since as α crosses those values, the next center chosen by the algorithm changes. Our goal is to bound the total number of breakpoints over all k rounds in phase 1 of Algorithm 1, which bounds the number of discontinuities of the cost of the outputted clustering as a function of α over [αℓ, αh].

A crucial step in the above approach is determining when to fork and where the breakpoints are located. Recall that in round t of Algorithm 1, each datapoint vi has an interval in [0, 1] of size dα i Dn(α), where di is the minimum distance from vi to the current set of centers, and Dj(α) = dα 1 + + dα j .

Furthermore, the interval is located between Di 1(α)

Dn(α) and Di(α)

Dn(α) (see Figure 1). WLOG, we assume d1 dn. We prove the following nice structure about these intervals.

Lemma 3. Assume that v1, ..., vn are sorted in decreasing distance from a set C of centers. Then for each i = 1, . . . , n, the function α 7 Di(α)

Dn(α) is monotone increasing and continuous along [0, ).

Furthermore, for all 1 i j n and α [0, ), we have Di(α)

Dn(α) Dj(α)

This lemma guarantees two crucial properties. First, we know that for every (ordered) set C of t k centers chosen by phase 1 of Algorithm 1 up to round t, there is a single interval (as opposed to a more complicated set) of α-parameters that would give rise to C. Second, for an interval [αℓ, αh], the set of possible next centers is exactly viℓ, viℓ+1, . . . , vih, where iℓand ih are the centers sampled when α is αℓand αh, respectively (see Figure 1). Now we are ready to prove our main structural result. Formally, we deﬁne seedα(V, Z) as the outputted centers from phase 1 of Algorithm 1 on instance V with randomness Z.

Figure 1: The algorithm chooses v3 as a center (left). In the interval [αℓ, αℓ+1], the algorithm may choose v4, v3, v2, or v1 as a center, based on the value of α (right).

Theorem 4. Given a clustering instance V, the expected number of discontinuities of seedα(V, Z) as a function of α over [0, αh] is O(nk(log n)αh). Here, the expectation is over the uniformly random draw of Z [0, 1]k.

Proof sketch. Consider round t of a run of Algorithm 1. Suppose at the beginning of round t, there are L possible states of the algorithm, e.g., L sets of α such that within a set, the choice of the ﬁrst t 1 centers is ﬁxed. By Lemma 3, we can write these sets as [α0, α1], . . . , [αL 1, αL], where 0 = α0 < < αL = αh. Given one interval, [αℓ, αℓ+1], we claim the expected number of new breakpoints #It,ℓby choosing a center in round t is bounded by 4n log n(αℓ+1 αℓ). Note that #It,ℓ+ 1 is the number of possible choices for the next center in round t using α in [αℓ, αℓ+1].

The claim gives an upper bound on the expected number of new breakpoints, where the expectation is only over zt (the uniformly random draw from [0, 1] used by Algorithm 1 in round t), and the bound holds for any given conﬁguration of d1 dn. Assuming the claim, we can ﬁnish off the proof by using linearity of expectation as follows. Let #I denote the total number of discontinuities of seedα(V, Z).

EZ [0,1]k[#I] EZ [0,1]k

ℓ=1 (#It,ℓ)

ℓ=1 EZ [0,1]k[#It,ℓ] 4nk log n αh.

Now we will prove the claim. Given zt [0, 1], let x and y denote the minimum indices s.t. Dx(αℓ) Dn(αℓ) > zt and Dy(αℓ+1)

Dn(αℓ+1) > zt, respectively. Then from Lemma 3, the number of breakpoints is exactly x y. Therefore, our goal is to compute Ezt [0,1][x y]. One method is to sum up the expected number of breakpoints for each interval Iv by bounding the maximum possible number of breakpoints given that zt lands in Iv. However, this will sometimes lead to a bound that is too coarse. For example, if αℓ+1 αℓ= ϵ 0, then for each bucket Ivj, the maximum number of breakpoints is 1, but we want to show the expected number of breakpoints is proportional to ϵ. To tighten up this analysis, we will show that for each bucket, the probability (over zt) of achieving the maximum number of breakpoints is low.

Assuming that zt lands in a bucket Ivj, we further break into cases as follows. Let i denote the minimum index such that Di(αℓ+1)

Dn(αℓ+1) > Dj(αℓ)

Dn(αℓ). Note that i is a function of j, αℓ, and αℓ+1, but it is

independent of zt. If zt is less than Di(αℓ+1)

Dn(αℓ+1), then we have the maximum number of breakpoints possible, since the algorithm chooses center vi 1 when α = αℓ+1 and it chooses center vj when α = αℓ. The number of breakpoints is therefore j i+1, by Lemma 3. We denote this event by Et,j, i.e., Et,j is the event that in round t, zt lands in Ivj and is less than Di(αℓ+1)

Dn(αℓ+1). If zt is instead greater

than Di(αℓ+1)

Dn(αℓ+1), then the algorithm chooses center vi when α = αℓ+1, so the number of breakpoints is j i. We denote this event by E t,j. Note that Et,j and E t,j are disjoint and Et,j E t,j is the event that zt Ivj.

Within an interval Ivj, the expected number of breakpoints is

P(Et,j)(j i + 1) + P(E t,j)(j i) = P(Et,j Et,j)(j i) + P(E t,j).

We will show that j i and P(Et,j) are both proportional to (log n)(αℓ+1 αℓ), which ﬁnishes off the claim.

First we upper bound P(Et,j). Recall this is the probability that zt is in between Dj(αℓ)

Di(αℓ+1) Dn(αℓ+1), which we can show is at most (4 log n)(αℓ+1 αℓ) by upper bounding the derivative

α Dj(α) Dn(α) .

Now we upper bound j i. Recall that j i represents the number of intervals between Di(αℓ)

Dn(αℓ), and the smallest interval in this range is d αℓ j Dn(αℓ). Therefore, we can bound j i by

dividing Dj(αℓ)

Dn(αℓ) Di(αℓ)

Dn(αℓ) by d αℓ j Dn(αℓ). We can again use the derivative of Di(α)

Dn(α) to show this value is proportional to 4 log n(αℓ+1 αℓ). This concludes the proof.

Now we analyze phase 2 of Algorithm 1. Since phase 2 does not have randomness, we use combinatorial techniques. We deﬁne lloydsβ(V, C, T) as the cost of the outputted clustering from phase 2 of Algorithm 1 on instance V with initial centers C, and a maximum of T iterations.

Theorem 5. Given T N, a clustering instance V, and a ﬁxed set C of initial centers, the number of discontinuities of lloydsβ(V, C, T) as a function of β on instance V is O(min(n3T , nk+3)).

Proof sketch. Given V and a set of initial centers C, we bound the number of discontinuities introduced in the Lloyd s step of Algorithm 1. First, we give a bound of nk+3 which holds for any value of T. Recall that Lloyd s algorithm is a two-step procedure, and note that the Voronoi partitioning step is independent of β. Let {C1, . . . , Ck} denote the Voronoi partition of V induced by C. Given one of these clusters Ci, the next center is computed by minc Ci P

v Ci d(c, v)β. Given any c1, c2 Ci, the decision for whether c1 is a better center than c2 is governed by P

v Ci d(c1, v)β < P

v Ci d(c2, v)β. By a consequence of Rolle s theorem, this equation has at most 2n + 1 roots. This equation depends on the set C of centers, and the two points c1 and c2, therefore, there are n k n 2

equations each with 2n + 1 roots. We conclude that there are nk+3 total intervals of β such that the outcome of Lloyd s method is ﬁxed.

Next we give a different analysis which bounds the number of discontinuities by n3T , where T is the maximum number of Lloyd s iterations. By the same analysis as the previous paragraph, if we only consider one round, then the total number of equations which govern the output of a Lloyd s iteration is n 2 , since the set of centers C is ﬁxed. These equations have 2n + 1 roots, so the total number of intervals in one round is O(n3). Therefore, over T rounds, the number of intervals is O(n3T ).

By combining Theorem 4 with Theorem 5, and using standard learning theory results, we can bound the sample complexity needed to learn near-optimal parameters α, β for an unknown distribution D over clustering instances. Recall that clusα,β(V) denotes the expected cost of the clustering outputted by (α, β)-Lloyds++, with respect to the target clustering, and let H denote the maximum value of clusα,β(V).

Theorem 6. Given αh and a sample of size m = O H

ϵ 2 min(T, k) log n + log 1

from D [0, 1]k m, with probability at least 1 δ over the choice of the sample, for all α [0, αh]

and β [1, ) { }, 1

m Pm i=1 clusα,β V (i), Z(i) E V D [clusα,β (V )] < ϵ.

Note that a corollary of Theorem 6 and Lemma 1 is a uniform convergence bound for all α [0, ) { }, however, the algorithm designer may decide to set αh < .

Computational efﬁciency In this section, we present an algorithm for tuning α whose running time scales with the true number of discontinuities over the sample. Combined with Theorem 4, this gives a bound on the expected running time of tuning α.

Algorithm 2 Dynamic algorithm conﬁguration

Input: Instance V = (V, d, k), randomness Z, αh, ϵ > 0 1. Initialize Q to be an empty queue, then push the root node ( , [0, αh]) onto Q. 2. While Q is non-empty

(a) Pop node (C, A) from Q with centers C and alpha interval A. (b) For each point ui that can be chosen as the next center, compute Ai = {α A : ui is the sampled center} up to error ϵ and set Ci = C {ui}. (c) For each i, if |Ci| < k, push (Ci, Ai) onto Q. Otherwise, output (Ci, Ai).

The high-level idea of our algorithm is to directly enumerate the set of centers that can possibly be output by dα-sampling for a given clustering instance V and pre-sampled randomness Z. We know from the previous section how to count the number of new breakpoints at any given state in the algorithm, however, efﬁciently solving for the breakpoints poses a new challenge. From the previous section, we know the breakpoints in α occur when Di(α)

Dn(α) = zt. This is an exponential equation with n terms, and there is no closed-form solution for α. Although an arbitrary equation of this form may have up to n solutions, our key observation is that if d1 dn, then Di(α)

Dn(α) must be monotone decreasing (from Lemma 3), therefore, it sufﬁces to binary search over α to ﬁnd the unique solution to this equation. We cannot ﬁnd the exact value of the breakpoint from binary search (and even if there was a closed-form solution for the breakpoint, it might not be rational), however we can ﬁnd the value to within additive error ϵ for all ϵ > 0. We show that the expected cost function is (Hnk log n)-Lipschitz in α, therefore, it sufﬁces to run O log Hnk

ϵ rounds of binary search to ﬁnd a solution whose expected cost is within ϵ of the optimal cost. This motivates Algorithm 2. Theorem 7. Given parameters αh > 0, ϵ > 0, β 1, and a sample S of size m = O H

ϵ 2 min(T, k) log n + log 1

δ + log αh from D [0, 1]k m, run Algorithm 2 on each sample and collect all break-points (i.e., boundaries of the intervals Ai). With probability at least 1 δ, the break-point α with lowest empirical cost satisﬁes |clus α,β(S) min0 α αh clusα,β(S)| < ϵ. The total running time to ﬁnd the best break point is O mn2k2αh log n H

Proof sketch. First we argue that one of the break-points output by Algorithm 2 on the sample is approximately optimal. If we exactly solved for the break-points, then every value of α [0, αh] would produce exactly the same clusterings on the sample one of the break-points, so some breakpoint must be empirically optimal. By Theorem 6 this break-point is also approximately optimal in expectation. Since the algorithm approximately calculates the break-points to within additive error ϵ, we are guaranteed that all true break-points are within distance ϵ of one output by the algorithm. In the full version of the paper, we show that the expected cost function is (Hn2k log n)-Lipschitz. Therefore, the best approximate break-point has approximately optimal expected cost.

Now we analyze the runtime of Algorithm 2. Let (C, A) be any node in the algorithm, with centers C and alpha interval A = [αℓ, αh]. Sorting the points in V according to their distance to C has complexity O(n log n). Finding the points sampled by dα-sampling with α set to αℓand αh costs O(n) time. Finally, computing the alpha interval Ai for each child node of (C, A) costs O(n log n H

ϵ ) time, since we need to perform log nk H log n

ϵ iterations of binary search on α 7 Di(α) Dn(α) and each evaluation of the function costs O(n) time. We charge this O(n log n H

ϵ ) time to the corresponding child node. If we let #I denote the total number of α-intervals for V, then each layer of the execution tree has at most #I nodes, and the depth is k, giving a total running time of O(#I kn log n H

ϵ ). With Theorem 4 this gives us an expected runtime of O mn2k2αh(log n) log n H

Since we showed that dα-sampling is Lipschitz as a function of α, it is also possible to ﬁnd the best α parameter with sub-optimality at most ϵ by ﬁnding the best point from a discretization of [0, αh] with step-size s = ϵ/(Hn2k log n). The running time of this algorithm is O(n3k2H log n/ϵ), which is signiﬁcantly slower than the efﬁcient algorithm presented in this section. Intuitively, Algorithm 2 is able to binary search to ﬁnd each breakpoint in time O(log n H

ϵ ), whereas a discretization-based algorithm must check all values of alpha uniformly, so the runtime of the discretization-based algorithm increases by a multiplicative factor of O n H

5 Experiments

In this section, we empirically evaluate the effect of the α parameter on clustering cost for realworld and synthetic clustering domains. In the full version of this paper we provide full details and additional experiments exploring the effect of β and the number of possible clusterings as we vary α.

Experiment Setup. Our experiments evaluate the (α, β)-Lloyds++ family of algorithms on distributions over clustering instances derived from multi-class classiﬁcation datasets. For each classiﬁcation dataset, we sample a clustering instance by choosing a random subset of k labels, sampling N examples belonging to each of the k chosen labels. The clustering instance then consists of the k N points, and the target clustering is given by the ground-truth labels. This sampling distribution covers many related clustering tasks (i.e., clustering different subsets of the same labels). We always measure distance between points using the ℓ2 distance and set β = 2. We measure clustering cost in terms of the majority cost, which is the fraction of points whose label disagrees with the majority label in their cluster. The majority cost takes values in [0, 1] and is zero iff the output clustering matches the target clustering perfectly. We generate m = 50, 000 samples from each distribution and divide them into equal-sized training and test sets. We then use Algorithm 2 to evaluate the average majority cost for all values of α on the train and test sets. Figure 2 shows the average majority cost for all values of α on both training and testing sets.

We ran experiments on datasets including MNIST, CIFAR10, CNAE9, and a synthetic Gaussian Grid dataset. For MNIST and CIFAR10 we set k = 5, and N = 100, while for CNAE9 and the Gaussian Grid we set k = 4 and N = 120. For MNIST, we used more samples (m = 250, 000).

We ﬁnd that the optimal value of α varies signiﬁcantly between tasks, showing that tuning α on a per-task basis can lead to improved performance. Moreover, we ﬁnd strong agreement in the average cost of each value of α across the independent training and testing samples of clustering instances, as predicted by our sample complexity results.

0 5 10 15 20

Majority Cost

0 5 10 15 20 0.46

Majority Cost

(b) CIFAR-10

0 5 10 15 20 0.4

Majority Cost

0 5 10 15 20 0.000

Majority Cost

(d) Gaussian Grid

Figure 2: Majority cost for (α, β)-Lloyds++ as a function of α for β = 2.

6 Conclusion

We deﬁne an inﬁnite family of algorithms generalizing Lloyd s method, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-ﬁrst traversal algorithm. We provide a sample efﬁcient and computationally efﬁcient algorithm to learn a near-optimal parameter over an unknown distribution of clustering instances, by developing techniques to bound the expected number of discontinuities in the cost as a function of the parameter. We give a thorough empirical analysis, showing that the value of the optimal parameters transfer to related clustering instances. We show the optimal parameters vary among different types of datasets, and the optimal parameters often signiﬁcantly improves the error compared to existing algorithms such as k-means++ and farthest-ﬁrst traversal.

7 Acknowledgments

This work was supported in part by NSF grants CCF-1535967, IIS-1618714, an Amazon Research Award, a Microsoft Research Faculty Fellowship, a National Defense Science & Engineering Graduate (NDSEG) fellowship, and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.

Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), 2017.

Kohei Arai and Ali Ridho Barakbah. Hierarchical k-means: an algorithm for centroids initialization for k-means. Reports of the Faculty of Science and Engineering, 36(1):25 31, 2007.

David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), pages 1027 1035, 2007.

Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3):544 562, 2004.

Hassan Ashtiani and Shai Ben-David. Representation learning for clustering: a statistical framework. In Proceedings of the Annual Conference on Uncertainty in Artiﬁcial Intelligence, pages 82 91, 2015.

Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learning-theoretic foundations of algorithm conﬁguration for combinatorial partitioning problems. In Proceedings of the Annual Conference on Learning Theory (COLT), pages 213 274, 2017.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median, and positive correlation in budgeted optimization. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), pages 737 756, 2015.

Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pages 1 10, 1999.

Michael B Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pages 9 21. ACM, 2016.

Sanjoy Dasgupta and Philip M Long. Performance guarantees for hierarchical clustering. Journal of Computer and System Sciences, 70(4):555 569, 2005.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society, pages 1 38, 1977.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.

Teoﬁlo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293 306, 1985.

Richard E Higgs, Kerry G Bemis, Ian A Watson, and James H Wikel. Experimental designs for selecting molecules from large chemical databases. Journal of chemical information and computer sciences, 37(5):861 870, 1997.

Wenhao Jiang and Fu-lai Chung. Transfer spectral clustering. In Proceedings of the Annual Conference on Knowledge Discovery and Data Mining (KDD), pages 789 803, 2012.

Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. An efﬁcient k-means clustering algorithm: Analysis and implementation. transactions on pattern analysis and machine intelligence, 24(7):881 892, 2002.

Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons, 2009.

Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew Mc Callum. An online hierarchical algorithm for extreme clustering. In Proceedings of the Annual Conference on Knowledge Discovery and Data Mining (KDD), 2017.

Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902 1914, 2001.

Stuart Lloyd. Least squares quantization in pcm. transactions on information theory, 28(2):129 137, 1982.

James Mac Queen et al. Some methods for classiﬁcation and analysis of multivariate observations. In symposium on mathematical statistics and probability, volume 1, pages 281 297. Oakland, CA, USA, 1967.

Joel Max. Quantizing for minimum distortion. IRE Transactions on Information Theory, 6(1):7 12, 1960.

Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):28, 2012.

Dan Pelleg and Andrew Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the Annual Conference on Knowledge Discovery and Data Mining (KDD), pages 277 281, 1999.

José M Pena, Jose Antonio Lozano, and Pedro Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 20(10):1027 1040, 1999.

Kim D Pruitt, Tatiana Tatusova, Garth R Brown, and Donna R Maglott. Ncbi reference sequences (refseq): current status, new features and genome annotation policy. Nucleic acids research, 40 (D1):D130 D135, 2011.

Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the International Conference on Machine Learning (ICML), pages 759 766, 2007.

Qiang Yang, Yuqiang Chen, Gui-Rong Xue, Wenyuan Dai, and Yong Yu. Heterogeneous transfer learning for image clustering via the social web. In Proceedings of the Conference on Natural Language Processing, pages 1 9, 2009.