# consistent_kclustering__6cfbab56.pdf

Consistent k-Clustering

Silvio Lattanzi 1 Sergei Vassilvitskii 2

The study of online algorithms and competitive analysis provides a solid foundation for studying the quality of irrevocable decision making when the data arrives in an online manner. While in some scenarios the decisions are indeed irrevocable, there are many practical situations when changing a previous decision is not impossible, but simply expensive. In this work we formalize this notion and introduce the consistent k-clustering problem. With points arriving online, the goal is to maintain a constant approximate solution, while minimizing the number of reclusterings necessary. We prove a lower bound, showing that Ω(k log n) changes are necessary in the worst case for a wide range of objective functions. On the positive side, we give an algorithm that needs only O(k2 log4 n) changes to maintain a constant competitive solution, an exponential improvement on the naive solution of reclustering at every time step. Finally, we show experimentally that our approach performs much better than the theoretical bound, with the number of changes growing approximately as O(log n).

1. Introduction

Competitive analysis of online algorithms has been an area of spirited research with beautiful results over the past two decades. At its heart, this area is about decision making under uncertainty about the future the input is revealed in an online manner, and at every point in time the algorithm must make an irrevocable choice. A standard example is that of caching algorithms at every time step the algorithm must make a choice about which elements to keep in the cache, and which elements to evict (Fiat et al., 1991). The generalization of caching to metric spaces is encapsu-

*Equal contribution 1Google, Zurich, Switzerland 2Google, New York, New York, USA. Correspondence to: Silvio Lattanzi <silviol@google.com>, Sergei Vassilvitskii <sergeiv@google.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

lated in the k-server problem, which has been the subject of intense study (Bansal et al., 2015; Manasse et al., 1990).

The key metric in online algorithms is the competitive ratio. It measures the quality of the solution obtained by an online algorithm versus an ofﬂine optimum, which has the luxury of seeing the whole input before making any decisions. In situations where the competitive ratio is relatively small, for example, the list update problem (Sleator & Tarjan, 1985), this is a great measure by which we can compare different algorithms. However, in some scenarios strong lower bounds on the competitive ratio imply that any algorithm that makes irrevocable choices will necessarily perform poorly when compared to an ofﬂine optimum.

Online clustering is one such example. In this setting points x1, x2, . . . arrive one at a time, and must be instantly given one of k cluster labels. As is typical, the goal is to have the highest quality clustering (under some pre-speciﬁed objective function, like k-CENTER or k-MEDIAN) at every point in time. As Liberty et al. (2016) showed, not only do online clustering algorithms have an unbounded competitive ratio, but one must use bi-criteria approximations to have any hope of a constant approximate solution.

Another approach to evade strong lower bounds is to make additional assumptions about the input to the problem. For example, one may assume that the input comes in a random (or partially random) order. This assumption has been a fruitful avenue when studying online problems in different contexts, as the classic secretary problem (Ferguson, 1989; Kesselheim et al., 2015; Kleinberg, 2005) or matching (Karp et al., 1990; Mahdian & Yan, 2011). Another alternative is to assume some additional structure on the distribution that points are coming from (Feldman et al., 2009). A big downside of both of these assumptions is that they are hard to test and validate in practice, which is why we take a different approach in this work.

1.1. Consistency

While the irrevocability of past choices makes sense from a theoretical standpoint, for some practical problems this requirement is unrealistically draconian. For example, consider a load balancer, which, when faced with requests arriving online, assigns them to different machines. Better cache performance dictates that similar requests should be

Consistent k clustering

assigned to the same machine, thus the load balancer is essentially performing online clustering. However, fundamentally, nothing is preventing the load balancer from reassigning some of the past jobs to other machines. In this situation, a re-clustering a reassignment of jobs to machines to increase performance is not an impossible operation.

Another common example of a costly, but not prohibitive recomputation comes from standard applications of unsupervised clustering: feature engineering for large scale machine learned systems. In this setting a feature vector x, is augmented with the id of a cluster it falls in, x , and the full vector (x, x ) is given as input to the learner. This is mainly done to introduce expressiveness and non-linearity to simple systems. In this situation, changing the clustering would entail changing the set of features passed to the learner, and retraining the whole system; thus one certainly does not want to do it at every time step, but it can be done if the gains are worthwhile.

From a theoretical perspective, the ability to correct for past mistakes offers the ability for much better solutions. In particular for clustering problems, it avoids the lower bounds introduced by Liberty et al. (2016). As we will show, the option to recluster dramatically improves the quality of the solution, even if it is taken rarely. More formally, we will introduce a parameter β which controls the number of times the solution changes. Setting β = 0 is equivalent to online algorithms, whereas a large value of β is equivalent to recomputing the answer from scratch at every time step.

1.2. Our Contributions

In this paper we focus on exploring the trade-off between the approximation ratio of clustering algorithms, and the number of times we must recompute the results.

We begin by formally deﬁning the notion of (α, β)- consistent clustering in Section 3. Then we prove a lower bound, showing that any constant competitive algorithm must change its cluster centers at least Ω(k log n) times (Section 3.1). Then we show that a known algorithm by Charikar et al. (2004) achieves this bound for the k CENTER problem, and we develop a new algorithm for other clustering objectives, and show that it requires at most O(k2 log4 n) reclusterings, an exponential improvement over the naive solution (Section 5). Finally, we show that the proposed algorithms perform well on real world datasets (Section 7).

1.3. Related Work

There are two avenues for related work that we build on in this paper. The ﬁrst is clustering algorithms, particularly the online clustering variants. In their seminal work Charikar et al. (2004) gave algorithms for the k-CENTER

problem. The case of k-MEDIAN and k-MEANS proved more complex. For the former, Meyerson (2001) gave an O(log n) competitive ration for closely related online facility location problem. This result was further improved by Fotakis (2008) and Anagnostopoulos et al. (2004). The latter was recently studied by Liberty et al. (2016) who gave bicriteria approximations and showed that these are necessary in an online setting. For the soft partition version of the k-clustering problem, an Expectation Maximization algorithm was suggested by Liang & Klein (2009).

The second, closely related area, is that of streaming algorithms. The literature of clustering in the streaming model is very rich, we highlight the most relevant results. The ﬁrst paper to study clustering problem is by Charikar et al. (2004) studying the k-CENTER problem. Guha et al. (2000) give the ﬁrst single pass constant approximation algorithm to the k-MEDIAN variant. Subsequently their result has been is improved by Charikar et al. (2003). Finally, the best algorithm for the closely related variant of facility location is due to Czumaj et al. (2013), who gave a (1 + ϵ)- approximation for the problem.

2. Preliminaries

Let X be a set of n points, and d : X X R a distance function. We assume that d is symmetric and that (X, d) form a metric space, that is d(x, x) = 0 for any x X; d(x, y) = d(y, x) 0 for any x, y X; and, for any x, y, z X, d(x, y) d(x, z) + d(z, x). Finally, by scaling d, let minx,y d(x, y) = 1 and denote by the maximum pairwise distance, maxx,y d(x, y). We will assume that is bounded by a polynomial in n, therefore log = O(log n).

Consider a set of k points c = {c1, c2, . . . , ck} X, which we will refer to as centers. For each ci, let Ci X be the set of points in X closer to ci than to any other center c C. 1 Formally, Ci = {x X | d(x, ci) minc c d(x, c)}.

Given a p > 0, in the rest of the paper we refer to the cost of a point x with to respect to a set of centers as: costp(x, c) = minci d(x, ci)p. And cost of a cluster Ci as: costp(X, Ci) = P x Ci d(x, ci)p.

Now we are ready to deﬁne our problem. For any p > 0 we can deﬁne the cost of clustering of points X with respect to the centers c X as: costp(X, c) = P

x X costp(x, c) = Pk i=1 P x Ci d(x, ci)p.

The k-clustering family of problems asks to ﬁnd the set of centers c that minimize costp for a speciﬁc p. When p = 1, cost1(X, c) is precisely the k-MEDIAN clustering

1 For clarity of the exposition we will assume that all of the pairwise distances are unique. The results still hold when ties are broken lexicographically.

Consistent k clustering

objective. Setting p = 2 is equivalent to the k-MEDOIDS problem2. Finally, with p = , we recover the k-CENTER problem, which asks to minimize the maximum distance of any point to its nearest cluster center.

Observe that although d( , ) satisﬁes the triangle inequality, when raised to p-th power we need to relax the condition. In particular we have that for any x, y, z X: d(x, y)p 2p 1(d(x, z)p + d(z, y)p).

When p is clear from the context, we will refer to costp(X, c) as the cost of the clustering and denote it cost(X, c). We will us optp(X) to denote the optimum cost for the metric space (X, d). We will use c = {c 1, c 2, . . . , c k} to denote the optimal solution.

The k clustering problem is NP-hard to solve exactly, thus we consider approximate solutions. We say that a clustering generated from a set of centers c is α-approximate if costp(X, c) α optp(X). The best known approximation factors are 2 for the k-CENTER problem (Gonzalez, 1985), 1 +

3 + ϵ for the k-MEDIAN problem (Li & Svensson, 2016), and 9 + ϵ for the k-MEDOIDS problem (Kanungo et al., 2004).

3. Consistency

As noted in the introduction, in many online clustering applications the choices made by the online algorithm are not irrevocable, but simply expensive to change. Moreover, by allowing a small number of full recomputations, we can circumvent the stringent lower bounds on competitive ratio for online clustering.

To this end, our goal in this work is to better understand the trade-off between the approximation ratio of online clustering algorithms, and the number of times the representative centers change.

We focus on a dynamic setting where the points arrive sequentially. Let xt denote the point that arrives at time t, and denote by Xt the set of points that has arrived from the beginning. Thus X0 = , and Xi+1 = Xi {xi+1} = {x1, x2, . . . , xi+1}.

For any two sets of centers c, c let |c c | denote the number of elements present in c, but not in c : |c c | = |c \ (c c )|. Observe that when c and c have the same cardinality, |c c | = |c c|. Deﬁnition 3.1. Given a sequence of sets of centers, c0, c1, ... , ct and a positive monotone non-decreasing function β : Z R, we say that the sequence is β-consistent if for all T, PT t=1 |ct ct 1| β(T).

In other words, a sequence is β-consistent, if at time T at

2In the Euclidean space if the centers do not need to be part of the input, setting p = 2 recovers the k-MEANS problem.

most β(T) centers have changed between successive sets.

Deﬁnition 3.2. Given a sequence of points x1, x2, . . . , x T , and a parameter p, a sequence of centers c1, c2, . . . , c T is (α, β)-consistent if: (i) Approximation. At every time t, the centers ct form an α approximate solution to the optimum solution at that time: costp(Xt, ct) α optp(Xt) for all t T. (ii) Consistency. The sets of centers form a β-consistent sequence.

3.1. A lower bound

Before we look for (α, β) consistent algorithms it is useful to understand what values are possible. We show that it is impossible to get a constant approximation and achieve consistency of o(log n) for any of the k clustering problems. Later, in Section 6 we will give a non-constructive result that shows that there is always a sequence of clusterings that is simultaneously constant-approximate and O(k log2 n) consistent.3

Lemma 3.3. There exists a sequence of points such that for any constant α > 0, any algorithm that returns an α-approximate solution while processing n points must be Ω(k log n)-consistent.

Proof. For ease of exposition, assume that p = 1, and consider points lying in (k 1)-dimensional Euclidean space, Rk 1. We begin by adding a point x0 at the origin, and points x1, . . . , xk 1 in positions e1, e2, . . . , ek 1, where ej is the standard basis vector that is 1 in the j-th dimension, and 0 everywhere else.

We then proceed in phases, where in phase 1 i < log n we add points at position (γ)i ej for each j [1, k 1], for some γ > 0 that we will set later. In phase log n we add the remaining n (k 1) log n 1 points at arbitrary positions within the convex hull of already added points.

Let Pi be the set of points at the end of phase i. Consider any algorithm that returns an α-approximate solution on Pi. Let p1, p2, . . . , pk 1 be the points added to the input during phase i, pj = γi ej. Then Pi = Pi 1 {p1, . . . , pk 1}. One feasible solution choses as centers the points added in phase i as well as the origin, C = {p1, p2, . . . , pk 1, 0}.

For every point in Pi 1 the origin is closer than any of the other centers, therefore the total cost is: opt(Pi) cost(Pi, C) = (k 1) Pi 1 z=1 γz (k 1) γi 1

γ 1 . On the other hand, consider a set of centers c that does not include some pj = γiej. The closest point to pj is at γi 1ej, which is at distance γi 1(γ 1) away. There-

3Note that we assume throughout the paper that the maximum distance between any two points, , is polynomial in n. Alternatively we can restate the lower bound in this section as a Ω(k log ) upper bound in section 6 as a O(k log2 ).

Consistent k clustering

fore, cost(Pi, c ) cost({pj}, c ) = γi 1(γ 1). If γ (2 + kα) then we can bound the approximation ratio as: cost(Pi,c )

opt(Pi) γi 1(γ 1)

γ 1 γi 1(γ 1)2

(k 1)γi γ 2

so c cannot be an α-approximate solution. Therefore at the end of phase 1 i < log n, any α-approximate set of centers, must include all points added in phase i. Thus any sequence of sets of centers must be Ω(k log n)-consistent.

Note that considering any p > 1 only makes any omission of point pj even more costly, as compared to the optimum solution.

4. Warm up: k-CENTER Clustering

To gain some intuition about consistent clustering, we begin with the k-CENTER objective. Given a dataset X, the goal is to identify k centers c = {c1, . . . , ck} that minimize: maxx X minc c d(x, c). This problem is known to be NP-hard, but a simple 2-approximation algorithm exists in the batch setting (Gonzalez, 1985). In the streaming setting, when points arrive one at a time, the DOUBLING algorithm by Charikar et al. (2004) was the ﬁrst algorithm discovered for this problem. The algorithm maintains an 8-approximation. Furthermore, it works in O(log ) = O(log n) phases and the total consistency cost of each phase is k; thus we get the following lemma.

Lemma 4.1. The DOUBLING algorithm for the k-CENTER problem is (8, O(k log n))-consistent.

5. Main Algorithm

In this section we present our main result, an algorithm that achieves a polylogarithmic consistency factor. More precisely, we show that for every constant p 1, it is possible to design an algorithm for the Consistent k-clustering problem under costp that is constant approximate, and O(k2 log4 n)-consistent.

In the remainder of the section we ﬁrst present the main ideas behind our algorithm, then prove some useful technical lemmas, and ﬁnally present the full algorithm.

5.1. Main ideas

Before delving into the details, we highlight the three main building blocks of our algorithm.

The ﬁrst is the Meyerson sketch for online facility location (Meyerson, 2001). This sketch has already been used by Charikar et al. (2003) to solve the k-median problem on data streams. We show that the main ingredients of the sketch continue to work under costp objectives, and use it to generally reduce the number of points under considerations from n to k poly log n.

One caveat of this sketch is that to use it we need to have access to a good lower bound on the cost of the optimal solution at any point in time. We obtain it by running the Θ(p) approximation algorithm described by Gupta & Tangwongsan (2008) on all available points. In this way, at any point in time we have a good approximation of the optimum solution. Then we divide the progress of our algorithm into log n phases based on this lower bound and in each phase we use a different sketch.

Finally, while the Meyerson sketch maintains O(k log2 n) possible centers, to computer the k-clustering, we have to reduce these points into exactly k ﬁnal centers. We ﬁrst show that this is possible and then we prove that we do not need to recluster frequently. In fact we will do it only when either a new point is added to the Meyerson sketch O(k log2 n) times or when the number of points assigned to one of these elements of the Meyerson sketch doubles O(k log n) events per sketch.

By putting all of these ingredients together, we show that the number of times we need to fully recluster is at most O(k log3 n) per phase, or that we have O(k2 log4 n) cluster changes in total.

5.2. The Meyerson sketch

We present the Meyerson sketch and prove some useful properties. We assume to have access to a lower bound to the cost of the optimal solution L, such that Lp βoptp, for some constant 0 β 1. (We will remove the assumption later.) Then the algorithm works in phases, such that at any time in phase j, L [2j 1, 2j). So in each phase j we can use the same lower bound Lp j = 2j 1 and

have Lp j βoptp

In each phase j we create 2 log n Meyerson sketches as described in Algorithm 1. Then we combine them in a single sketch as described in Algorithm 2.

Algorithm 1 Single Meyerson sketch

1: Input: A sequence of points x0, x1, x2, . . . , xn. A ﬁnite p. 2: Output: A set S that is a constant bi-criteria approximate solution for the k-clustering problem. 3: S 4: Let X be a set of points and let L be such that L γoptp(X), for some constant γ > 0 5: for x X do 6: if S == then 7: S {x} 8: else 9: Let δ = d(x, S)p

10: With probability min δk(1+log n)

Lp , 1 add x to S

11: Return S

For simplicity we ﬁrst analyze the property of a single Meyerson sketch. In particular we give a bound on both the

Consistent k clustering

number of points selected by a single sketch, as well as the quality of the approximation. The Lemma generalizes the results in (Charikar et al., 2003; Meyerson, 2001) to all ﬁnite p and follows the general structure their proof so it is deferred to the extended version of the paper.

Lemma 5.1. For a constant γ (0, 1), with probability at least 1

2 the set S computed by Algorithm 1 has: (i)

size at most 4k(1 + log n) 22p+1

γp + 1 ; (ii) costp(S)

From Lemma 5.1 we know that with constant probability a single Meyerson sketch is of size O(k log n) and contains a set of points that give a good solution to our problem. Thus, if we construct 2 log n single Meyerson sketches in parallel, at least one of them gives a constant approximation to the optimum at every point in time with probability at least 1 O(n 1). The observation inspired the design of Algorithm 2, whose properties are formalized next.

Lemma 5.2. For a constant γ (0, 1), with probability 1 O(n 1) the set M = 2 log n i=1 Mi computed by Algorithm 2 has: size at most O(k log2 n) and costp(M) 64optp(X).

Proof. As mentioned above, Lemma 5.1 implies that if we construct 2 log n single Meyerson sketches in parallel, with probability 1 O(n 1), at least one of them gives a constant approximation to the optimum at every point in time. Furthermore in total it contains only 4k(1 + log n) 22p+1

γp + 1 points.

Now in Algorithm 2 we are almost building 2 log n Meyerson sketches; the only difference is that we stop adding points to a single sketch when it becomes too large. This modiﬁcation does not change the probability that there exist at least one single sketch that gives a constant approximation to the optimum at every point in time and has at most 4k(1 + log n) 22p+1

γp + 1 points.

Thus with probability 1 O(n 1) at least one of the sketches constructed in Lemma 5.1 gives a constant approximation to the optimum at every point in time. Merging other sketches to this sketch does not affect this property. Furthermore the number of points in each sketch is explicitly bounded by 4k(1 + log n) 22p+1

γp + 1 so the

total number of points in M is bounded by 8k log n(1 +

log n) 22p+1

Note that in some cases we do not need to recompute all the sketches from scratch but we need only to update them, so we can deﬁne a faster update function described in Algorithm 3.

Algorithm 2 Compute Meyerson(Xt, φ)

1: Input: A sequence of points Xt, a lower bound to the optimum φ. 2: Output: 2 log n independent Meyerson sketches M1, . . . , M2 log n 3: Lp = φ

γ , 4: for i [2 log n] do: Initialize all Meyerson sketches 5: Mi x0 6: for x Xt do: 7: for i [2 log n] do: If Mi is not too large, analyze x

8: if |Mi| 4k(1 + log n) 22p+1

γp + 1 then:

9: Let δ = d(x, Mi)p

10: ˆp = min δk(1+log n)

11: Add x to Mi with probability ˆp 12: Return M1, . . . , M2 log n

Algorithm 3 Update Meyerson(M1, . . . , Ms, xt, φ)

1: Input: A point xt, a lower bound to the optimum φ and s independent Meyerson sketches M1, . . . , Ms. 2: Output: s independent Meyerson sketches M1, . . . , Ms 3: Lp = φ

γ , 4: for i [s] do: If Mi is not too large, analyze xt 5: if |Mi| 4k(1 + log n) 22p+1

γp + 1 then:

6: Let δ = d(xt, Mi)p

7: ˆp = min δk(1+log n)

8: Add xt to Mi with probability ˆp 9: Return M1, . . . , Ms

In the rest of the paper we refer to a single Meyerson sketch as Mi and to their union as M.

5.3. From Meyerson to k clusters

Our next step is to show that in the Meyerson sketch there exists a subset of k centers that gives an approximately optimal solution. We follow the approach in Guha et al. (2000) and show that by weighing the points in the Meyerson sketch with the number of original data points assigned to them, and then running a weighted k-clustering algorithm to recluster them into k clusters, we can achieve a constant approximate solution.

Before formalizing this observation we give some additional notation. In the remainder of the section we denote the weight of a point x in the Meyerson sketch with w(x), the cost of the centers used in Meyerson sketch with cost M, and the cost of the aforementioned weighted clustering instance with cost L. Finally we refer to the optimal set of centers for the weighted k-clustering instance as c .

We begin with two technical Lemmas.

Lemma 5.3. For any constant p 1, costp(X, c ) 2p 1 (cost M + cost L)

Lemma 5.4. For any constant p 1, cost L

Consistent k clustering

22p 1 cost M + optp

Note that combining Lemmas 5.3 and 5.4 the following Corollary follows. Corollary 5.5. For any constant p 1, costp(c ) 23p 1 cost M + optp

We defer the proofs of lemma 5.3 and lemma 5.4 to the extended version of the paper. Those proofs are similar in spirit to those in (Bateni et al., 2014; Guha et al., 2000), but are generalized here for all p.

Thanks to Corollary 5.5 we know that by using a Meyerson sketch, M contains a good approximation for our problem. In the next subsection we show how to use this to obtain a solution for the consistency problem.

Before doing this we deﬁne two algorithms that allow us to construct a weighted clustering instance starting from a Meyerson sketch (Algorithm 4) and to update the weights for a weighted instance (Algorithm 5).

Algorithm 4 Create Weighted Instance(M1, . . . , Ms, φ, Xt)

1: Input: A sequence of points Xt, a lower bound to the optimum φ and s independent Meyerson sketches M1, . . . , Ms. 2: Output: A weighted k-clustering instance (M, w). 3: Let M = i Mi 4: Assign points in Xt to the closest point in M 5: Let w(y) to be equal to the number of points assigned to y M 6: Return (M, w)

Algorithm 5 Update Weights(M, w, x)

1: Input: A point x, the current weights w and the Meyerson sketch M. 2: Output: A weighted k-clustering instance (M, w). 3: Assign x to the closest point in M 4: Let mx be the closest point to x in M 5: w(mx) = w(mx) + 1 6: Return (M, w)

5.4. The algorithm

We are now ready to formally state and prove the correctness of our main algorithm, which we present in Algorithm 6. The input of our algorithm is a sequence of points x0, x1, x2, . . . , xn. Recall, that we denote the preﬁx up to t as Xt, and the cost of the solution using centers c as costp(Xt, c). Finally we assume to have access to a γapproximation algorithm A for the weighted k-clustering problem for any constant p-norm (we can use for example the local search algorithm described by Gupta & Tangwongsan (2008)).

We can now state our main theorem. Theorem 5.6. For any constant p 1, with probability 1 O(n 1), Algorithm 6 returns a sequence of

Algorithm 6 Consistent k-clustering algorithm

1: Input: A sequence of points x0, x1, x2, . . . , xn. 2: Output: A sequence of centers c0, c1, c2, . . . , cn 3: Select the ﬁrst k points as centers c0 = {x0, x1, x2, . . . , xk} 4: t 0 5: while costp(c0, Xt) = 0 do: 6: ct c0; Output ct; t t + 1 7: φ 0 Initialize lower bound to the optimum M1, ..., M2 log n Compute Meyerson(X0, φ) 8: c ; s 2 log n 9: while t n do: 10: Run A on Xt to get approximated solution c

11: if costp(Xt, c ) 2φ then: New l.b. for φ 12: φ costp(Xt, c ), 13: M1, ..., Ms Compute Meyerson(Xt, φ) 14: (M, w) Get Weighted Prob(M1, ..., Ms, φ, Xt) 15: Solve (M, w) using algorithm A 16: Let ct be the set of centers computed by A 17: else: Update Meyerson and recluster if needed 18: M1, M2, .. Update Meyerson(M1, .., Ms, xt, φ) 19: Let M = i Mi, 20: if xt M then: xt is in Meyerson sketch 21: (M, w) Get Weighted Prob(M1, ..., Ms, φ, Xt) 22: Solve (M, w) using algorithm A 23: Let ct be the set of centers computed by A 24: else: 25: (M, w) Update Weights(M, w, x) 26: Let mt be the closest point to xt in M 27: if w(mt) is a power of 2 then: 28: Weight of a point doubled 29: Solve (M, w) using algorithm A 30: Let ct be the set of computed centers 31: else: 32: ct = ct 1 33: Output ct; t t + 1

centers c0, c1, c2, . . . , cn such that at any point in time t costp(ct, Xt) αpoptp(Xt) for a constant α and the total inconsistency factor of the solution is O(k2 log4 n)

Proof. We start by bounding the inconsistency factor, Pn 1 i=1 |ci+1 ci|.

During the execution of Algorithm 6 the set of centers changes if and only if one of the three following conditions is met: (i) the cost of the clustering on Xt computed by A increases by a factor of 2, (ii) we add a new point to a Meyerson sketch, (iii) a new point is assigned to a point of the Meyerson sketch, mt, and the weight of mt is a power of 2 after this addition. Note that every time we change the centers, we fully recluster, and so increase the consistency factor by k in the worst case. Therefore to prove the theorem we need to show that one of these conditions is met at most O(k log4 n) times.

From our assumptions we know that the spread of the point set is polynomial in n, which implies the same bound on the cost of the optimum solution. Therefore, the cost of the solution computed by A doubles at most O(log n) times.

Consistent k clustering

For the same reason we update the lower bounds, φ at most O(log n) times during the execution of our algorithm. This in turn implies that we rebuild the Meyerson sketches from scratch at most O(log n) times. Given that we run O(log n) Meyerson sketches in parallel, during the execution of the algorithm we use at most O(log2 n) Meyerson sketches. Furthermore each Meyerson sketch has at most O(k log n) centers, thus in total we can add at most O(k log3 n) points under condition (ii).

Finally note that while a Meyerson sketch is ﬁxed, the weight of every point in the sketch can only grow. In addition, the weight is always is bounded by n, and therefore can double at most log n times per sketch point, resulting in O(k log2 n) changes under a ﬁxed Meyerson sketch. Therefore condition (iii) holds at most O(k log4 n) times. So overall at least one of the conditions is satisﬁed at most O(k log4 n) times, thus the algorithm is O(k2 log4 n)-consistent.

To ﬁnish our proof we need to show that at any point in time our algorithm returns with probability 1 o(n 1) a constant approximation to the optimum. Note that by corollary 5.5 we know that for any constant p 1 the cost of a solution computed on the Meyerson sketch can be bounded by costp(c ) 23p 1 cost M + optp . From Lemma 5.2 we know that the Meyerson sketch guarantees with probability 1 o(n 1) that cost M 16optp(X). So we have the cost of the optimal set of centers in the Meyerson sketch at any point in time is at most O(αpoptp) w.h.p. for a constant α4.

While we cannot compute the optimal set of centers in the Meyerson sketch, we can ﬁnd an O(p) approximation for every constant p by relying on the local search algorithm of Gupta & Tangwongsan (2008). Therefore, every time we recompute the centers using A we are sure that we obtain a constant approximation.

Finally it remains to show that when none of the three conditions are met, and we simply add a point to the solution without recomputing the centers we retain an approximately optimal solution. By Lemma 5.3 we know that for any constant p 1, costp(c ) 2p 1 (cost M + cost L) .

Moreover, we can always bound the cost of Meyerson sketch with 16optp(X).

It remains to get a bound on cost L. Note that the number of points assigned to any point in M did not double since the previous reclustering. Therefore, in the weighted reclustering formulation the weight of all points increased by a factor less than 2. Therefore, cost L at this point is bounded by at most twice cost L computed when we last reclustered. Therefore, costp(c ) 24p 1 cost M + optp and the cur-

4We do not make an attempt to optimize the constant factors. As we show in the experimental section, in practice the algorithm gives a very good approximation.

rent solution remains approximately optimal.

6. Optimizing Consistency

How many times do we need to change the centers to obtain a good k-clustering at any point in time? In Section 5 we presented an algorithm that is O(k2 log4 n)-consistent, while in Subsection 3.1 we showed that at least Ω(k log n) changes are needed (assuming that is polynomial in n). We give an existential result, we show that for any input sequence there exist a solution that is constant approximate and O(k log2 n)-consistent. In interest of space we deferred the proof of the lemma to the extended version of the paper.

Theorem 6.1. For any sequence x0, x1, . . . , xn there exists a sequence of solutions c0, c1, . . . , cn such that i, costp(Xi, ci) αoptp(Xi) for some constant α, and the P

i |ci+1 ci = O(k log2 n).

7. Experiments

We demonstrate the efﬁcacy of our algorithm by tracking both the quality of the solution and the number of reclusterings needed to maintain it on a number of diverse datasets. As we will show, the theoretical guarantees that we prove in the previous section provide a loose bound on the number of reclusterings; in practice the number of times we recompute the solution grows logarithmically with time.

Data We evaluate our algorithm on three datasets from the UCI Repository (Lichman, 2013) that vary in data size and dimensionality. (i) SKINTYPE has 245, 057 points lying in 4-dimensions. (ii) SHUTTLE has 58, 000 points in 9 dimensions. (iii) COVERTYPE has 581, 012 points in 54 dimensions. For each of the datasets we try values of k in {10, 50, 100}, and observe that the qualitative results are consistent across datasets and values of k.

Algorithm Modiﬁcations In the development of the algorithm we made a number of decisions to obtain high probability results. The key among them was to run O(log n) copies of the Meyerson sketch, since each sketch succeeds only with constant probability. We eschew this change in the implementation, and maintain just a single sketch, at the cost of incurring a worse solution quality.

Metrics and results The goal of this work is to give algorithms that maintain a good clustering, but only recluster judiciously, when necessary. To that end, we focus on two main metrics: number of reclusterings and solution quality.

Reclustering We plot the number of reclusterings as a function of time for the three different datasets in Figure 1. Note that the x-axis is on log-scale, and thus a straight line

Consistent k clustering

Figure 1. The number of reclusterings as a function of t for three different datasets, and various values of k, plotted on a log scale. Observe that the (i) x-axis is on Log scale, showing that after an initial warm up phase, the algorithm does a logarithmic number of reclusterings, and (ii) the rate of reclustering is slightly higher for higher values of k.

Figure 2. The approximation ratio as a function of t for three different datasets, and various values of k. Note the stair step pattern when the algorithm choses not to recluster the approximation ratio slowly degrades. A reclustering step brings it down close to 1.

represents number of reclusterings that grows logarithmically with time. Qualitatively we make two observations, across all datasets, and values of k.

First, the rate of reclustering (deﬁned as the fraction of time the algorithm recomputes the solution) is approximately log n/n, which tends to 0 as the dataset size grows. Further, the rate is higher for higher values of k, a fact also suggested by our theoretical analysis.

Unlike the SHUTTLE and COVERTYPE datasets, the SKINTYPE dataset exhibits a change in behavior, where initially the reclustering rate is relatively high, but then it sharplly drops after about O(2k) steps. This is explained by the fact that the order of the points in this data set is not random. Therefore initially the algorithm reclusters at a high rate, once all of the parts of the input space are explored, the rate of reclustering slows. When we run the algorithm on a randomly permuted instance of SKINTYPE, this phase transition in behavior disappears.

Approximation Ratio We plot the approximation ratio of the solution (as compared to the best obtained by ten runs of k-means++ (Arthur & Vassilvitskii, 2007)) in Figure 2.

For the SKIN and COVERTYPE datasets, the approximation ratio stays relatively low, largely bounded by 4, after an initial period. A more careful examination of the plots shows exactly the times when the consistent algorithm allows the solution to degrade, and when it decides to recompute the solution from scratch. The latter are indicated by sharp drops in the approximation ratio, whereas the former are the relatively ﬂat patterns.

It is interesting to note that the additional points sometimes worsen the approximation (as indicated by the lines sloping upwards), but sometimes actually improve the approximation. This is due to the fact that decisions made by the online algorithm balance optimality at that point in time, with the potential location of points arriving in the future. The latter is most apparent in the k = 100 experiment of the SHUTTLE dataset.

All of the datasets sometimes exhibit large ﬂuctuations in the approximation ratio. This is an artifact of using a single Myerson sketch, which does not capture the structure of the points with small, but constant probability.

8. Conclusions and Future Work

We introduced the notion of consistent clustering: a variant of online clustering which balances the need for maintaining an approximately optimal solution with the cost of reclustering. We proved Ω(klog n) lower bounds, and gave algorithms for all k-clustering variants that come close to achieving this bound.

The notion of quantifying the worst case number of changes necessary to maintain a constant approximate solution in an online setting is interesting to study in contexts other than k-clustering. For example, one can consider online graph problems, such as online matching and online densest subgraph, or other types of clustering problems, such as hierarchical or correlation clustering.

Consistent k clustering

Anagnostopoulos, Aris, Bent, Russell, Upfal, Eli, and Hentenryck, Pascal Van. A simple and deterministic competitive algorithm for online facility location. Inf. Comput., 194(2):175 202, 2004.

Arthur, David and Vassilvitskii, Sergei. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 07, pp. 1027 1035, 2007. ISBN 978-0-898716-24-5.

Bansal, Nikhil, Buchbinder, Niv, Madry, Aleksander, and Naor, Joseph (Sefﬁ). A polylogarithmic-competitive algorithm for the k-server problem. J. ACM, 62(5):40:1 40:49, November 2015. ISSN 0004-5411.

Bateni, Mohammad Hossein, Bhaskara, Aditya, Lattanzi, Silvio, and Mirrokni, Vahab S. Distributed balanced clustering via mapping coresets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2591 2599, 2014.

Charikar, Moses, O Callaghan, Liadan, and Panigrahy, Rina. Better streaming algorithms for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, June 9-11, 2003, San Diego, CA, USA, pp. 30 39, 2003.

Charikar, Moses, Chekuri, Chandra, Feder, Tom as, and Motwani, Rajeev. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417 1440, 2004.

Czumaj, Artur, Lammersen, Christiane, Monemizadeh, Morteza, and Sohler, Christian. (1+ ϵ)-approximation for facility location in data streams. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6-8, 2013, pp. 1710 1728, 2013.

Feldman, Jon, Mehta, Aranyak, Mirrokni, Vahab S., and Muthukrishnan, S. Online stochastic matching: Beating 1-1/e. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pp. 117 126, 2009.

Ferguson, Thomas S. Who solved the secretary problem? Statist. Sci., 4(3):282 289, 1989.

Fiat, Amos, Karp, Richard, Luby, Micheal, Mc Geoch, Lyle, Sleator, Daniel, and Young, Neal E. Competitive paging algorithms. Journal of Algorithms, 12(4):685 699, 1991. doi: 10.1016/0196-6774(91)90041-V.

Fotakis, Dimitris. On the competitive ratio for online facility location. Algorithmica, 50(1):1 57, 2008.

Gonzalez, Teoﬁlo F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293 306, 1985.

Guha, Sudipto, Mishra, Nina, Motwani, Rajeev, and O Callaghan, Liadan. Clustering data streams. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pp. 359 366, 2000.

Gupta, Anupam and Tangwongsan, Kanat. Simpler analyses of local search algorithms for facility location. Co RR, abs/0809.2554, 2008.

Kanungo, Tapas, Mount, David M., Netanyahu, Nathan S., Piatko, Christine D., Silverman, Ruth, and Wu, Angela Y. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89 112, 2004.

Karp, Richard M., Vazirani, Umesh V., and Vazirani, Vijay V. An optimal algorithm for on-line bipartite matching. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, May 13-17, 1990, Baltimore, Maryland, USA, pp. 352 358, 1990.

Kesselheim, Thomas, Kleinberg, Robert D., and Niazadeh, Rad. Secretary problems with non-uniform arrival order. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pp. 879 888, 2015.

Kleinberg, Robert D. A multiple-choice secretary algorithm with applications to online auctions. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2005, Vancouver, British Columbia, Canada, January 23-25, 2005, pp. 630 631, 2005.

Li, Shi and Svensson, Ola. Approximating k-median via pseudo-approximation. SIAM J. Comput., 45(2):530 547, 2016.

Liang, Percy and Klein, Dan. Online em for unsupervised models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 09, pp. 611 619, 2009. ISBN 978-1932432-41-1.

Liberty, Edo, Sriharsha, Ram, and Sviridenko, Maxim. An algorithm for online k-means clustering. In Proceedings of the Eighteenth Workshop on Algorithm Engineering and Experiments, ALENEX 2016, Arlington, Virginia, USA, January 10, 2016, pp. 81 89, 2016.

Consistent k clustering

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Mahdian, Mohammad and Yan, Qiqi. Online bipartite matching with random arrivals: an approach based on strongly factor-revealing lps. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011, pp. 597 606, 2011.

Manasse, Mark S., Mc Geoch, Lyle A., and Sleator,

Daniel Dominic. Competitive algorithms for server problems. J. Algorithms, 11(2):208 230, 1990.

Meyerson, Adam. Online facility location. In 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, 14-17 October 2001, Las Vegas, Nevada, USA, pp. 426 431, 2001.

Sleator, Daniel Dominic and Tarjan, Robert Endre. Amortized efﬁciency of list update and paging rules. Commun. ACM, 28(2):202 208, 1985.