# clustering_high_dimensional_dynamic_data_streams__a188da5f.pdf

Clustering High Dimensional Dynamic Data Streams

Vladimir Braverman 1 Gereon Frahling 2 Harry Lang 1 Christian Sohler 3 Lin F. Yang 1

Abstract We present data streaming algorithms for the kmedian problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space {1, 2, . . . }d. Our algorithms use kϵ 2poly(d log ) space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of k centers the cost of the coreset (1 + ϵ)- approximates the cost of the streamed point set. We also provide algorithms that guarantee only positive weights in the coreset with additional logarithmic factors in the space and time complexities. We can use this positively-weighted coreset to compute a (1 + ϵ)-approximation for the k-median problem by any efﬁcient ofﬂine kmedian algorithm. All previous algorithms for computing a (1 + ϵ)-approximation for the kmedian problem over dynamic data streams required space and time exponential in d. Our algorithms can be generalized to metric spaces of bounded doubling dimension.

1 Introduction The analysis of very large data sets is still a big challenge. Particularly, when we would like to obtain information from data sets that occur in the form of a data stream like, for example, streams of updates to a data base system, internet trafﬁc and measurements of scientiﬁc experiments in astroor particle physics (e.g. (Liu et al., 2015)). In such scenarios it is difﬁcult and sometimes even impossible to store the data. Therefore, we need algorithms that process the data sequentially and maintain a summary of the data using space much smaller than the size of the stream. Such algorithms are often called streaming algorithms (for more introduction on streaming algorithms, please refer to (Muthukrishnan, 2005)). One fundamental technique in data analysis is clustering. The idea is to group data into clusters such that data inside

1Johns Hopkins University, USA 2Linguee Gmb H 3TU Dortmund. Correspondence to: Lin F. Yang <lyang@jhu.edu>, Christian Sohler <christian.sohler@tu-dortmund.de>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

the same cluster is similar and data in different clusters is different. Center based clustering algorithms also provide for each cluster a cluster center, which may act as a representative of the cluster. Often data is represented as vectors in Rd and similarity between data points is often measured by the Euclidean distance. Clustering has many applications ranging from data compression to unsupervised learning. In this paper we are interested in clustering problems over dynamic data streams, i.e. data streams that consist of updates, for example, to a database. Our stream consists of insert and delete operations of points from {1, . . . , }d. We assume that the stream is consistent, i.e. there are no deletions of points that are not in the point set and no insertions of points that are already in the point set. We consider the k-median clustering problem, which for a given a set of points P Rd asks to compute a set C of k points that minimizes the sum of distances of the input points to their nearest points in C. 1.1 Our Results We develop the ﬁrst (1 + ϵ)-approximation algorithm for the k-median clustering problem in dynamic data streams that uses space polynomial in the dimension of the data. To our best knowledge, all previous algorithms required space exponentially in the dimension. Formally, our main theorem states,

Theorem 1.1 (Main Theorem). Fix ϵ (0, 1/2), positive integers k and , Algorithm 1 makes a single pass over the dynamic streaming point set P [ ]d, outputs a weighted set S, such that with probability at least 0.99, S is an ϵcoreset for k-median of size O kd4L4/ϵ2 , where L = log . The algorithm uses O kd7L7/ϵ2 bits in the worst case, processes each update in time O(d L2) and outputs the coreset in time poly(d, k, L, 1/ϵ) after one pass of the stream. The theorem is restated in Theorem 3.6 and the proof is presented in Section 3.3. The coreset we constructed may contain negatively weighted points. Thus na ıve ofﬂine algorithms do not apply directly to ﬁnding k-clustering solutions on the coreset. We also provide an alternative approach that output only non-negatively weighted coreset. The new algorithm is slightly more complicated. The space complexity and coreset size is slightly worse than the one with negative weights but still polynomial in d, 1/ϵ and log and optimal in k up to polylogk factor.

Theorem 1.2 (Alternative Results). Fix ϵ (0, 1/2), posi-

Clustering High Dimensional Dynamic Data Streams

tive integers k and , Algorithm 6 makes a single pass over the streaming point set P [ ]d, outputs a weighted set S with non-negative weights for each point, such that with probability at least 0.99, S is an ϵ-coreset for k-median of size O kd4L4/ϵ2 . The algorithm uses O kd8L8/ϵ2 bits in the worst case. For each update of the input, the algorithm needs poly(d, 1/ϵ, L, log k) time to process and outputs the coreset in time poly(d, k, L, 1/ϵ) after one pass of the stream.

The theorem is restated in Theorem 4.3 in Section 4 and the proof is presented therein. Both approaches can be easily extended to maintain a coreset for a general metric space. 1.2 Our Techniques From a high level, both algorithms can be viewed as a combination of the ideas introduced by Frahling and Sohler (Frahling & Sohler, 2005) with the coreset construction by Chen (Chen, 2009). To explain our high-level idea, we ﬁrst summarize the idea of Chen (Chen, 2009). In their construction, they ﬁrst obtain a (α, β)-bi-criterion solution. Namely ﬁnd a set of at most αk centers such that the k-median cost to these αk centers is at most βOPT, where OPT is the optimal cost for a k-median solution. Around each of the αk points, they build logarithmically many concentric ring regions and sample points from these rings. Inside each ring, the distance from a point to its center is upper and lower bounded. Thus the contribution to the optimal cost from the points of this ring is lower bounded by the number of points times the inner diameter of the ring. To be more precise, their construction requires a partition of O(αk) sets of the original data points satisfying the following property: for the partition P1, P2, . . . , Pk , P

i |Pi|diam(Pi) βOPT. They then sample a set of points from each part to estimate the cost of an arbitrary k-set from [ ]d up to O(ϵ|Pi|diam(Pi)/β) additive error. Combining the samples of the k parts, this gives an additive error of at most ϵOPT and therefore an ϵ-coreset. The ﬁrst difﬁculty in generalizing the construction of Chen to dynamic streams is that it depends on ﬁrst computing approximate centers, which seems at ﬁrst glance to require two passes. Surprisingly (since we would like to be polynomial in d), we can resolve this difﬁculty using a gridbased construction. The grid structure can be viewed as a (2d)-ary tree of cells. The root level of the tree is a single cell containing the entire set of points. Going down a level through the tree, each parent cell is split evenly into 2d subcells. Thus in total there are log2 grid levels. Each cell of the ﬁnest level contains at most a single point. Without using any information of a pre-computed (α, β)- bi-criterion solution to the k-median problem, as it does in (Chen, 2009), our ﬁrst idea (similar to the idea used in (Indyk, 2004)) is to use a randomly shifted grid (i.e. shift each coordinate of the cell of the root level by a random value r, where r is uniformly chosen from {1, 2, . . . }, and redeﬁne the tree by splitting cells into 2d subcells recursively). We show that with high probability, in each

level, at most O(k) cells are close to (or containing) a center of an optimal solution to the k-median. For the remaining cells, we show that each of them cannot contain too many points, since otherwise they would contribute too much to the cost of the optimal solution (since each point in these cells is far away from each of the optimal centers). We call the cells containing too many points in a level heavy cells. The immediate non-heavy children of the heavy cells form a partition of the entire point sets (i.e. the cells that are not heavy, but have heavy parents). Let C1, C2, . . . , Ck be these cells, and we can immediately show that P

i |Ci|diam(Ci) βOPT for some β = O(d3/2). If we can identify the heavy cells (e.g. use heavy hitter algorithms), and sample points from their immediate non-heavy children in a dynamic stream, we will obtain a construction similar to Chen (Chen, 2009). Our second idea allows us to signiﬁcantly reduce space requirements and also allows us to do the sampling more easily. For each point p, the cells containing it form a path on the grid tree. We write each point as a telescope sum as the cell centers on the path of the point ( recall that the grids of each level are nested and the c0 is the root of the tree). For example, let c0, c1, . . . , c L be the cell centers of the path, where c L = p, and deﬁne 0 to be the zero vector. Then p = c L c L 1 + c L 1 c L 2 . . . + c0 0 + 0. In this way, we can represent the distance from a point to a set of points as the distance of cell centers to that set of points. For example, let Z [ ]d be a set of points, and d(p, Z) be the distance from p to the closest point in Z. Then d(p, Z) = d(c L, Z) d(c L 1, Z) + d(c L 1, Z) d(c L 2, Z) + . . . + d(0, Z). Thus we can decompose the cost a set Z into L + 2 levels: the cost in level l [0, L] is P

p d(cl p, Z) d(cl 1 p , Z), where cp l is the center of the cell containing p in level l and the cost in the ( 1)-st level is |P|d(0, Z), where P is the entire points set. Since |d(cl p, Z) d(cl 1 p , Z)| is bounded by the cell diameter of the level, we can sample points from the non-heavy cells of the entire level, and guarantee that the cost of that level is well-approximated. Notice that (a) we do not need to sample O(k) points from every part of the partition, thus we save a k factor on the space and (b) we do not need to sample the actual points, but only an estimation of the number of points in each cell, thus the sampling becomes much easier (there is no need to store the sampled points). In the above construction, we are able to obtain a coreset, but the weights can be negative due the the telescope sum. It is not easy ﬁnd an ofﬂine k-median algorithm to output the solutions from a negatively-weighted coreset. To remove the negative weights, we need to adjust the weights of cells. But the cells with a small number of points (compared to the heavy cells) are problematic the samplingbased empirical estimations of the number of points in them has too much error to be adjusted. In our second construction, we are able to remove all the negative weights. The major difference is that we introduce a cut-off on the telescope sum. For example, d(p, Z) =

Clustering High Dimensional Dynamic Data Streams

d(c L p , Z) d(cl(p) p , Z)+d(cl(p) p ) d(cl(p) 1 p )+. . .+d(0, Z) where l(p) is a cutoff level of point p such that the cell containing p in level l(p) is heavy but no longer heavy in level l(p) + 1. We then sample point p with some probability deﬁned according to l(p). In other words, we only sample points from heavy cells and not from non-heavy ones. Since a heavy cell contains enough points, the samplingbased estimation of the number of points is accurate enough and thus allows us to adjust them to be all positive. Finally, to handle the insertion and deletions, we use a F2heavy hitter algorithm to identify the heavy cells. We use pseudo-random hash functions (e.g. Nisan s construction (Nisan, 1992; Indyk, 2000b) or k-wise independent hash functions) to do the sampling and use a K-Set data structure (Ganguly, 2005) to store the sampled points in the dynamic stream. 1.3 Related Work There is a rich history in studies of geometric problems in streaming model. Among these problems some excellent examples are: approximating the diameter of a point set (Feigenbaum et al., 2005; Indyk, 2003), approximately maitain the convex hull (Cormode & Muthukrishnan, 2003; Hershberger & Suri, 2004), the min-volume bounding box (Chan, 2004; Chan & Sadjad, 2006), maintain ϵ-nets and ϵ-approximations of a data stream (Bagchi et al., 2007). Clustering problem is another interesting and popular geometric problem studied in streaming model. There has been a lot of works on clustering data streams for the k-median and k-means problem based on coresets (Har-Peled & Mazumdar, 2004; Har-Peled & Kushal, 2005; Chen, 2009; Feldman et al., 2007; 2013; Feldman & Langberg, 2011). Additionally (Charikar et al., 1997; Guha et al., 2000; Meyerson, 2001) studied the problem in the more general metric space. The currently best known algorithm for k-median problem in this setting is an O(1)- approximation using O(kpolylogn) space (Charikar et al., 2003). However, all of the above methods do not work for dynamic streams. The most relevant works to ours are those by Indyk (Indyk, 2004), Indyk & Price (Indyk & Price, 2011) and Frahling & Sohler (Frahling & Sohler, 2005). Indyk (Indyk, 2004) introduced the model for dynamic geometric data streamings. He studied algorithms for (the weight of) minimum weighted matching, minimum bichromatic matching and minimum spanning tree and k-median clustering. He gave a exhaustive search (1 + ϵ) approximation algorithm for kmedian and a (α, β)-bi-criterion approximation algorithm. Indyk & Price (Indyk & Price, 2011) studied the problem of sparse recovery under Earth Mover Distance. They show a novel connection between EMD/EMD sparse recovery problem to k-median clustering problem on a two dimensional grid. The most related work to current one is Frahling & Sohler (Frahling & Sohler, 2005), who develop a streaming (1 + ϵ)-approximation algorithms for kmedian as well as other problems over dynamic geometric data streams. All previous constructions for higher dimen-

sional grid require space exponential in the dimension d. 2 Preliminaries For integer a b, we denote [a] := {1, 2, . . . , a} and [a, b] := {a, a + 1, . . . , b} for integer intervals. We will consider a point set P from the Euclidean space {1, . . . , }d. Without loss of generality, we always assume is of the form 2L for some integer L, since otherwise we can always pad without loss of a factor more than 2. Our streaming algorithm will process insertions and deletions of points from this space. We study the k-median problem, which is to minimize cost(P, Z) = P

p P d(p, Z) among all sets Z of k centers from Rd and where d(p, q) denotes the Euclidean distance between p and q and d(p, Z) for a set of points Z denotes the distance of p to the closest point in Z. The following deﬁnition is from (Har-Peled & Mazumdar, 2004).

Deﬁnition 2.1. Let P [ ]d be a point set. A small weighted set S is called an ϵ-coreset for the k-median problem, if for every set of k centers Z [ ]d we have 1

(1 ϵ) cost(P, Z) cost(S, Z) (1 + ϵ) cost(P, Z),

where cost(S, Z) := P

s S wt(s)d(s, Z) and wt(s) is the weight of point s S. Through out the paper, we assume parameters ϵ, ρ, δ, (0, 1

2) unless otherwise speciﬁed. For our algorithms and constructions we deﬁne a nested grid with L levels, in the following manner. Deﬁnition of grids Let v = (v1, . . . , vd) be a vector chosen uniformly at random from [0, 1]d. Partition the space {1, . . . , }d into a regular Cartesian grid G0 with side-length and translated so that a vertex of this grid falls on v. Each cell of this grid can be expressed as [v1 + n1 , v1 +(n1 +1) ) . . . [vd +nd , vd +(nd +1) ) for some (n1, . . . , nd) Zd. For i 1, deﬁne the regular grid Gi as the grid with side-length /2i aligned such that each cell of Gi 1 contains 2d cells of Gi. The ﬁnest grid is GL where L = log2 ; the cells of this grid therefore have side-length at most 1 and thus contain at most a single input point. Each grid forms a partition of the point-set S. There is a d-ary tree such that each vertex at depth i corresponds to a cell in Gi, and this vertex has 2d children which are the cells of Gi+1 that it contains. For convenience, we deﬁne G 1 as the entire dataset and it contains a single cell C 1. For each cell C, we also treat it as a subset of the input points (i.e. C P) if there is no confusion. We denote Z [ ]d as the optimal solution for k-median and OPT as the optimal cost for Z . The proof of the following lemma is delayed to Section A.

Lemma 2.2. Fix a set Z [ ]d, then with probability at least 1 ρ, for every level i [0, L], the number of cells that satisfy d(C, Z) /(2i+1d) is at most e|Z|(L+1)/ρ.

1For simplicity of the presentation, we deﬁne the coreset for all sets of k centers Z [ ]d, but it can be generalized to all sets of k centers Z Rd with an additional polylog(1/ϵ) factor in the space. We discuss this point further in Section 6.

Clustering High Dimensional Dynamic Data Streams

2.1 Outline In Section 3, we introduce the coreset with negative weights. In Section 4, we introduce a modiﬁed construction with all positive weights. Section 6 comes with the ﬁnal remarks. 3 Generally Weighted Coreset In this section, we present our generally weighted coreset construction. In Section 3.1, we introduce the telescope sum representation of a point p and the coreset framework. In Section 3.2, we illustrate our coreset framework with an ofﬂine construction. In Section 3.3 we present an one pass streaming algorithm that implements our coreset framework. 3.1 The Telescope Sum and Coreset Framework Our ﬁrst technical idea is to write each point as a telescope sum. We may interpret this sum as replacing a single point by a set of points in the following way. Each term (p q) of the sum can be viewed as a pair of points p and q, where p has weight 1 and q has weight 1. The purpose of this construction is that the contribution of each term (p q) (or the corresponding two points) is bounded. This can be later exploited when we introduce and analyze our sampling procedure. We now start to deﬁne the telescope sum, which will relate to our nested grids. For each C Gi, denote c(C) (or simply c) as its center. For each point p P, deﬁne C(p, i) as the cell that contains p in Gi, and ci p is the center of C(p, i). Then we can write

p = c 1 p +

i=0 ci p ci 1 p .

where we set c 1 p = 0 (we also call this the cell center of the ( 1)-st level for convenience). The purpose of this can be seen when we consider the distance of p to an arbitrary k-centers Z [ ]d, we can write the cost of a single point p, as

d(p, Z) = d(c 1 p , Z) +

i=0 d(ci p, Z) d(ci 1 p , Z).

Note that c L p = p since the cells of GL contain a single point. Thus the cost of the entire set cost(P, Z) can be written as,

p P d(ci p, Z) d(ci 1 p , Z) + X

p P d(c 1 p , Z). (1)

As one can see, we transform the cost deﬁned using the original set of points to the cost deﬁned using cell centers. To estimate the cost, it remains to estimate each of the terms, P

p P d(ci p, Z) d(ci 1 p , Z) for i [0, L] and P

p P d(c 1 p , Z). In other words, assign weights to each of the centers of the grid cells. For i [0, L], and a cell C Gi, denote CP as the parent cell of C in grid Gi 1.

Thus we can rewrite the cost term as follows,

cost(Gi, Z) := X

p P d(ci p, Z) d(ci 1 p , Z)

p C d(c(C), Z) d(c(CP ), Z)

C Gi |C| d(c(C), Z) d(c(CP ), Z)

C Gi |C|d(c(C), Z)

C Gi:C C |C|d(c(C ), Z). (2)

For i = 1, we denote cost(G 1, Z) = |P|d(c 1 p , Z). Then this leads to our following coreset construction framework. Generally Weighted Construction The coreset S in the construction is composed by a weighted subset of centers of grid cells. The procedure of the construction is to assign some (integer) value to each cell center. For instance, maintain a integer valued function c | | on cells (using small amount of space). c |C| is called the value of the cell C. Let c be the center of C, then the weight for c is

wt(c) = c |C| X

C :C Gi+1,C C

c |C |. (3)

And for the L-th grid GL, the weight for each cell C is just c |C|. Note that there might be negative weights for some cells. As a na ıve example, we set c |C| := |C| as the exact number of points of a cell C. Then we would expect the cells in every level except those in GL have weight 0. In other words, we stored the entire point set as the coreset. As we will show, if we allow c |C| as an approximation of |C| up to additive error, we can compress the number of non-zero weighted centers to be a smaller number.

Deﬁnition 3.1. Given a grid structure, and a real valued function c | | on the set of cells. We deﬁne a function d cost : [ ]d G R as follows, for i [0, L] and Z [ ]d,

d cost(Gi, Z) := X

c |C|d(c(C), Z)

C Gi:C C c |C| d(c(C ), Z), (4)

and d cost(G 1, Z) = [ |C 1|d(0, Z), where C 1 is the cell in G 1 containing the entire set of points.

Lemma 3.2. Fix an integer valued function c | | on the set of cells and parameter 0 < ϵ < 1 2. Let S be the set of all cell centers with weights assigned by Equation (3). If [ |C 1| = |P| (recall that C 1 is the ﬁrst cell containing the entire dataset) and for any Z [ ]d with |Z| k and

Clustering High Dimensional Dynamic Data Streams

i [0, L] cost(Gi, Z) d cost(Gi, Z) ϵOPT

then S is an ϵ-coreset for k-median.

Proof. Given an arbitrary set of centers Z [ ]d,

cost(S, Z) = X

s S wt(s)d(s, Z)

C Gi d(c(C), Z) c |C| X

C :C Gi+1 C C

+ |P|d(0, Z)

c |C|d(c(C), Z) X

C Gi 1 C Gi:C C

c |C|d(c(C ), Z)

+ |P|d(0, Z) = X

i [0,L] d cost(Gi, Z).

It follows that |cost(S, Z) cost(P, Z)| ϵOPT.

3.2 An Ofﬂine Construction In this section, we assume we have (10, 10)-bi-criterion approximation to k-median. Let Z = {z 1, z 2, . . . , z 10k} be the centers and o is the cost satisfying OPT o 10OPT. This can be done using (Indyk, 2000a). We will show how we construct the coreset base on the framework described in the last section. An Ofﬂine Construction For each point in level G 1, we sample it with probability π 1 = 1 (i.e. count the number of points exactly) and set [ |C 1| := |P|. For each level i [0, L], we pick the set of all cells C satisfying d(C, Z ) W/(2d), where W is the side length of C. Denote the set of these cells as CZ . We count the number of points in each of these cells exactly, and set c |C| := |C|. For the points in the rest of cells, for each i [0, L], we sample the points with probability

πi = min 200(L + 1)2 d2

2iϵ2o ln 2(L + 1) kd

uniformly and independently. Denote Si as the set of sampled points at level i. For each C CZ , set c |C| := |Si C|/πi.

Then, from the bottom level to the top level, we assign the weight to the cell centers of each of the cells and their parent cells with non-zero c |C| using (3). Denote S as the coreset, which contains the set of cell centers of non-zero weight.

Theorem 3.3. Fix ϵ, ρ (0, 1/2), then with probability at least 1 8ρ, the ofﬂine construction S is an ϵ-coreset for k-median and that

|S| = O d4k L4

Proof of Theorem 3.3. By deﬁnition [ |C 1| = |P|, it is sufﬁce to show that with probability at least 1 4ρ, for every

i [0, L] and every k-set Z [ ]d, d cost(Gi, Z) cost(Gi, Z) ϵOPT/(L + 1).

It follows from Lemma 3.2 that, S is an ϵ-coreset. Let Si be the sampled points of level i. Fix a k-set Z [ ]d, for each i [0, L], by equation (2), we have that, d cost(Gi, Z) = P

C CZ c |C| d(c(C), Z) d(c(CP ), Z) + P

p Si(d(ci p, Z) d(ci 1 p , Z)))/πi. Note that E(d cost(Gi, Z)) = cost(Gi, Z). The ﬁrst term contributes 0 to the difference d cost(Gi, Z) cost(Gi, Z) since each c |C| is exact. It remains to bound the error contribution from the second part. Denote A2 = P

p Si(d(ci p, Z) d(ci 1 p , Z)))/πi. Recall that Z is the centers of the bi-criterion solution and CZ is the set of cells with distance less than W/(2d) to Z , where W is the side-length of a cell. Let A be event that |CZ | e|Z |(L + 1)2/ρ = O(k L2/ρ). By Lemma 2.2, A happens with probability at least 1 ρ. Conditioning on A happening, for each point p C / CZ , we have that d(p, Z ) diam(C)/(2d3/2). Therefore, P

p C / CZ diam(C) (2d3/2) P

p C / CZ d(p, Z ) 20d3/2OPT. By Lemma 3.4, with probability at least 1 ρ (L+1) kd , |A2 E(A2)| ϵOPT

L+1 . Since there are at most kd many different k-sets from [ ]d, thus, for a ﬁxed i [0, L] with probability at least 1 ρ L+1, for all k-sets

Z [ ]d, d cost(Gi, Z) cost(Gi, Z) ϵOPT/(L + 1). By the union bound, with probability at least 1 4ρ, S is the desired coreset. It remains to bound the size of S. Conditioning on A happening, then |CZ | = O(k L2/ρ). For each level i, since each point from cells C CZ contributes at least /(2i+1d) to the bi-criterion solution, there are at most O(2i OPTd/ ) points in cells not in CZ . By a Chernoff bound, with probability at least 1 ρ/(L + 1), the number of points sampled from cells C CZ of level i is upper bounded by O(d4k L3 log 1

ρ/ϵ2). Thus for all levels, with probability at least 1 ρ, the number of points sampled is upper bounded by O(d4k L4 log 1

ρ/ϵ2), which is also an upper bound of the number of cells occupied by sampled points. Now we bound the number of nonzero weighted centers. In the coreset construction, if a cell center has non-zero weight, then either itself or one of its children cells has non-zero assigned value c |C|. Thus the number of non-zero weigted centers is upper bound by 2 times the number of non-zero valued cells. Thus |S| = O(d4k L4 log 1

ρ/ϵ2 + k L2

Lemma 3.4. Fix ϵ, ρ (0, 1/2), if a set of cells C from grid Gi satisﬁes P

C C |C|diam(C) βOPT for some β 2ϵ/(3(L + 1)), let S be a set of independent samples from the point set {C C} with probability

πi min 3a(L+1)2

dβ 2iϵ2o ln 2 kd(L+1)

Clustering High Dimensional Dynamic Data Streams

where 0 < o a OPT for some a > 0, then for a ﬁxed set Z [ ]d, with probability at least 1 ρ/((L + 1) kd), X

p S (d(ci p, Z) d(ci 1 p , Z))/πi

p {C C} (d(ci p, Z) d(ci 1 p , Z)) ϵOPT

The proof is a straightforward application of Bernstein inequality. It is presented in Section A. 3.3 The Streaming Algorithm For the streaming algorithm, the ﬁrst challenge is that we do not know the actual value of OPT, neither do we have an (α, β)-bi-criterion solution. To handle this, we will show that we do not need an actual set of centers of an approximate solution, and that a conceptual optimal solution sufﬁces. We will guess logarithmically many values for OPT to do the sampling. We re-run the algorithm in parallel for each guess of OPT. The second challenge is that we cannot guarantee the sum P

C Gi diam(C) to be upper bounded by βOPT as required in Lemma 3.4. We will show that we can split the set of cells into two parts. The ﬁrst part satisﬁes the property that P C |C|diam(C) βOPT for some parameter β. The second part satisﬁes that |C|diam(C) a OPT/k for some constant a. For the ﬁrst part, we use a similar sampling procedure as we did in the ofﬂine case. The challenge here is that there might be too many points sampled when the algorithm is midway through the stream, and these points may be deleted later in the stream. To handle this case, we use a data structure called K-Set structure with parameter k (Ganguly, 2005). We will insert (with deletions) a multiset of points M [N] into the K-Set. The data structure processes each stream operation in O(log(k/δ)) time. At each point of time, it supports an operation RETRSET, that with probability at least 1 δ either returns the set of items of M or returns Fail. Further, if the number of distinct items |M| is at most k, then RETRSET returns M with probability at least 1 δ. The space used by the K-Set data structure is O(k(log |M| + log N) log(k/δ). The K-Set construction also returns the frequency of each stored points upon the RETRSET operation. For the second part, we call these cells heavy. We ﬁrst upper bound the number of heavy cells by αk for some α > 1. We use a heavy hitter algorithm HEAVY-HITTER to retrieve an approximation to the number of points in these cells. The guarantee is given in the following theorem. In an insertion-deletion stream, it may that although the stream has arbitrary large length, at any moment a much smaller number of elements are active (that is, inserted and not yet deleted). We deﬁne the size of a stream to be the maximum number of active elements at any point of the stream.

Theorem 3.5 ((Larsen et al., 2016) Theorem 2). Fix ϵ, δ (0, 1/2). Given a stream (of insertions and deletions) of

size m consisting of items from universe [n], there exists an algorithm HEAVY-HITTER(n, k, ϵ, δ) that makes a single pass over the stream and outputs a set of pairs H. With probability at least 1 δ, the following holds, (1) for each (i, ˆfi) H, f 2 i Pn j=1 f 2 j /k ϵ2Pn j=k+1 f 2 j ; (2) if for any i [n] and f 2 i Pn j=1 f 2 j /k+ϵ2Pn j=k+1 f 2 j , then (i, ˆfi) H;

(3) for each (i, ˆfi) H, | ˆfi fi| ϵ q Pn j=k+1 f 2 j .

The algorithm uses O (k + 1

δ log m bits of space, O(log n) update time and O(k + 1/ϵ2)polylog(n) query time. Thus, using HEAVY-HITTER, we are guaranteed that the error of the number of points in heavy cells is upper bounded by ϵ times the number of points in the non-heavy cells. The ﬁrst heavy hitter algorithm that achieves an l2 guarantee is by (Charikar et al., 2002), who has the same space and update time as that of the above algorithm. However the update time is slow, i.e. O(n log n) time to output the set of heavy hitters. Lastly, we will use fully independent random hash function to sample the points. We will use Nissan s pseudorandom generator to de-randomize the hash functions by the method of (Indyk, 2000b). Our main theorem for this section is as follows. The formal proof of this theorem is postponed to Section B.

Theorem 3.6 (Main Theorem). Fix ϵ, ρ (0, 1/2), positive integers k and , Algorithm 1 makes a single pass over the streaming point set P [ ]d, outputs a weighted set S, such that with probability at least 1 ρ, S is an ϵ-coreset for k-median of size O d4L4k

ρ , where L = log . The algorithm uses

bits in the worst case, processes each update in time O d L2 log dk L

ρϵ and outputs the coreset in time

poly(d, k, L, 1/ϵ) after one pass of the stream.

4 Positively Weighted Coreset In this section, we will introduce a modiﬁcation to our previous coreset construction, which leads to a coreset with all positively weighted points. The full algorithm and proofs are postponed to Section C. We present the main steps in this section. The high level idea is as follows. When considering the estimate of the number of points in a cell, the estimate is only accurate when it truly contains a large number of points. However, in the construction of the previous section, we sample from each cell of each level, even though some of the cells contain a single point. For those cells, we cannot adjust their weights from negative to positive, since doing so would introduce large error. In this section, we introduce an ending level to each point. In other words, the number of

Clustering High Dimensional Dynamic Data Streams

Algorithm 1 Core Set(S, k, ρ, ϵ): construct a ϵ-coreset for dynamic stream S.

Initization: Initialize a grid structure; O {1, 2, 4, . . . ,

d d+1}; L log ;

πi(o) min 3(L+1)2 d2

2iϵ2o ln 2 kd(L+1)

K (2+e)(L+1)k

ρ + 24d4(L+1)3k

ρ 8(2+e)2kd3(L+1)3 ; m 0;

For each o O and i [0, L], construct fully independent hash function ho,i : [ ]d {0, 1} with Prho,i(ho,i[q] = 1) = πi(o); Initialize K-Set instances KSo,i with error probability ρ/(L + 1), size parameter K; Initialize HEAVY-HITTER( d, (e + 2)(L + 1)k/ρ, ϵ , ρ/(L + 1)) instances, HH0, HH1, . . . , HHL, one for a level; Update (S): for each update (op, q) S:

/*op {Insert, Delete}*/ m m 1; /*Insert: +1, Delete: 1*/ for each i [0, L]:

ci q the center of the cell contains q at level i; HHi.update(op, ci q); for each o [O]:

if ho,i(q) == 1:

KSo,i.update(op, ci q); Query: Let o be the smallest o such that no instance of KSo,0, KSo,1, . . . , KSo,L returns Fail; R {}; for i = 1 to L:

for each cell center c in level i:

Let C be the cell containing c; if i = -1:

f Get Freq(c, HHi, KSo ,i, πi(o )); if i < L:

C C:C Gi+1 Get Freq (c(C ), HHi+1, KSo i+1, πi(o )); Assign weight f g to c; if f g = 0:

R R {c}; else:

Assign weight f to c; if f = 0:

R R {c}; return R.

points of a cell is estimated by sampling only if it contains many points. Thus, the estimates will be accurate enough

Algorithm 2 Get Freq(e, HH, KS, πi): retrieve the correct freuquency of cell center e, given the instance of HEAVY-HITTER and K-set. f S(e) the frequency of e returned by HH; f K(e) the frequency of e returned by KS; k (e + 2)(L + 1)k/ρ; F the set of top-k heavy hitters returned by HEAVY-HITTER; if e F:

return f S(e); else:

return f K(e)/πi.

and allow us to rectify the weights to be all positive. 4.1 Reformulation of the Telescope Sum Deﬁnition 4.1. A heavy cell identiﬁcation scheme H is a map H : G {heavy, non-heavy} such that, h(C 1) =heavy and for cell C Gi for i [0, L] 1. if |C| 2iρd OPT

k(L+1) then H(C) = heavy; 2. If H(C) = non-heavy, then H(C ) = non-heavy for every subsell C of C. 3. For every cell C in level L, H(C) = non-heavy. 4. For each i [0, L], |{C Gi : H(C) = heavy}| λ1k L

ρ , where λ1 10 is a positive universal constant. The output for a cell not speciﬁed by the above conditions can be arbitrary. We call a cell heavy if it is identiﬁed heavy by H. Note that a heavy cell does not necessarily contain a large number of points, but the total number of these cells is always bounded.

In the sequel, heavy cells are deﬁned by an arbitrary ﬁxed identiﬁcation scheme unless otherwise speciﬁed.

Deﬁnition 4.2. Fix a heavy cell identiﬁcation scheme H. For level i [ 1, L], let C(p, i) Gi be the cell in Gi containing p. The ending level l(p) of a point p P is the largest level i such that H(C(p, i)) =heavy, and H(C(p, i+ 1)) =non-heavy. Note that the ending level is uniquely deﬁned if a heavy cell identiﬁcation scheme is ﬁxed. We now rewrite the telescope sum for p as follows,

ci p ci 1 p + c L p cl(p) p ,

where c 1 p = 0 and c L p = p. For arbitrary k-centers Z [ ]d, we write, d(p, Z) = Pl(p) i=0 d(ci p, Z) d(ci 1 p , Z) + d(c L p , Z) d(cl(p) p , Z) + d(0, Z) . 4.2 The New Construction (with arbitrary weights) For these heavy cells, we use HEAVY-HITTER algorithms to obtain accurate estimates of the number of points in these cells, thus providing a heavy cell identiﬁcation scheme. For the non-heavy cells, we only need to sample points from the bottom level, GL, but with a different probability for points with different ending levels. We now describe the new construction. This essentially

Clustering High Dimensional Dynamic Data Streams

has the same gaurantee as the simpler construction from the previous section, however the beneﬁt here is that (as shown in the next subsection) it can be modiﬁed to output only positive weights. In the following paragraph, the estimations c |C| are given as a blackbox. In proposition C.9 we specify the conditions these estimations must satisfy. Non-Negatively Weighted Construction Fix an arbitrary heavy cell identiﬁcation scheme H. Let Pl be all the points with ending level l(p) = l. For each heavy cell C, let c |C| be an estimation of number of points of |C|, we also call c |C| the value of cell C. For each non-heavy cell C , let c |C | = 0. Let S be a set samples of P constructed as follows: S = S 1 S0 S1, . . . SL, where Sl is a set of i.i.d samples from Pl with probability πl. Here πl for l [ 1, L] is redeﬁned as πl =

min λ3d2 L2

2lϵ2o log 2L dk

ρ + λ4d2k L3

2iϵ2ρo log 30k L2

where λ3 > 0 and λ4 > 0 are universal constants. Our coreset S is composed by all the sampled points in S and the cell centers of heavy cells, with each point p assigned a weight 1/πl(p) and for each cell center c of a heavy cell C Gi, the weight is,

wt(c) = c |C| X

C :C Gi+1,C C, C is heavy

c |C | |Si C|

For each non-heavy cell C except for those in the bottom level, wt(c(C)) = 0. The weight of each point from S is the value of the corresponding cell in the bottom level. 4.3 Ensuring Non-Negative Weights We now provide a procedure to rectify all the weights for the coreset constructed in the last sub-section. The idea is similar to the method used in (Indyk & Price, 2011). The procedure is shown in Algorithm 4.3. After this procedure, there will be no negative weights in the coreset outputs.

Algorithm 3 Rectify Weights c |C1|, d |C2| . . . , d |Ck |, S : input the estimates of number of points in each cell and the weighted sampled points, output a weighted coreset with non-negative weights. for i = 1 to L:

for each heavy cell C center in Gi:

if wt(C) < 0:

Decrease the value of the children heavy cells in level Gi+1 and sampled points Si arbitrarily by total |wt(C)| amount, such that for each children cell C Gi+1, c |C | is non-negative, and for each sampled point p Si, the weight is non-negative. return Rectiﬁed Coreset

Theorem 4.3. Fix ϵ, ρ (0, 1/2), positive integers k and , Algorithm 6 makes a single pass over the streaming point set P [ ]d, outputs a weighted set S with nonnegative weights for each point, such that with probability

at least 0.99, S is an ϵ-coreset for k-median of size

where L = log . The algorithm uses

ϵ2 ρd L + L

ρ log2 dk L

ρϵ log2 dk L

bits in the worst case. For each update of the input, the algorithm needs poly (d, 1/ϵ, L, log k) time to process and outputs the coreset in time poly(d, k, L, 1/ϵ, 1/ρ, log k) after one pass of the stream.

5 Experiments We illustrate our construction using an ofﬂine construction on Gaussian mixture data in R2. As shown in Figure 2 in Section D, we randomly generated 65536 points from R2, then rounded the points to a grid of size = 512. Our coreset uses log2 +2 = 11 levels of grids. The storage in each level is very sparse. As shown in Figure 1(a), only 90 points are stored in total. We compared the 1-median costs estimated using the coreset and the dataset, the resulting difference is very small, as illustrated in Figure 1(b).

(b) Figure 1. (a) The layer structure of the coreset. Cells with more weight are shaded darker. (b) The relative error of a 1-median cost function. Using only 90 points, the global maximum error was under 10%. 6 Concluding Remark We develop algorithms that make a single pass over the dynamic stream and output, with high probability, a coreset for the original k-median problem. Both the space complexity and the size of the coreset are polynomially dependent on d, whereas the only previous known bounds are exponential in d. We constructed our coreset for the possible solutions in discrete space [ ]d, but it is easy to modify the coreset to be a coreset in continuous space [0, ]d (note that we still require the input dataset to be from a discrete space). The way to do this is by modifying the sampling probability πi in the algorithm, i.e. replacing the factor of ln(Ω( kd L/ρ)) to ln(Ω(( /ϵ)kd L/ρ)). Then any k-set from [0, ]d can be rounded to the closest k-set in [ /ϵ]d

and the cost only differs by a (1 ϵ) factor while the space bound changes only by a polylog(1/ϵ) factor. Lastly, we remark that the coreset scheme can be easily modiﬁed to other metric spaces, e.g. the lp metric. The space bound depends on the doubling dimension of the metric. As shown in our experiments, a 2D implementation using our framework is very efﬁcient. We believe that a highdimensional implementation will be efﬁcient as well. We leave the full implementation as a future project.

Clustering High Dimensional Dynamic Data Streams

Acknowledgment V. Braverman is supported by the NSF Grants IIS-1447639, EAGER CCF1650041, and CAREER CCF-1652257. H. Lang is supported by the Franco-American Fulbright Commission. H. Lang thanks INRIA (l Institut national de recherche en informatique et en automatique) for hosting him during the writing of this paper. C. Sohler is Supported by DFG within the Collaborative Research Center SFB 876 Providing Information by Resource-Constrained Analysis , project A2. L. Yang is supported by the NSF Grant IIS-1447639. References Bagchi, Amitabha, Chaudhary, Amitabh, Eppstein, David, and Goodrich, Michael T. Deterministic sampling and range counting in geometric data streams. ACM Transactions on Algorithms (TALG), 3(2):16, 2007.

Chan, Timothy M. Faster core-set constructions and data stream algorithms in ﬁxed dimensions. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 152 159. ACM, 2004.

Chan, Timothy M and Sadjad, Bashir S. Geometric optimization problems over sliding windows. International Journal of Computational Geometry & Applications, 16 (02n03):145 157, 2006.

Charikar, Moses, Chekuri, Chandra, Feder, Tom as, and Motwani, Rajeev. Incremental clustering and dynamic information retrieval. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pp. 626 635. ACM, 1997.

Charikar, Moses, Chen, Kevin, and Farach-Colton, Martin. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pp. 693 703. Springer, 2002.

Charikar, Moses, O Callaghan, Liadan, and Panigrahy, Rina. Better streaming algorithms for clustering problems. In Proceedings of the thirty-ﬁfth annual ACM symposium on Theory of computing, pp. 30 39. ACM, 2003.

Chen, Ke. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923 947, 2009. doi: 10.1137/070699007. URL http://dx.doi.org/ 10.1137/070699007.

Cormode, Graham and Muthukrishnan, S. Radial histograms for spatial streams. DIM ACS Technical Report, 11, 2003.

Feigenbaum, Joan, Kannan, Sampath, and Zhang, Jian. Computing diameter in the streaming and slidingwindow models. Algorithmica, 41(1):25 41, 2005.

Feldman, Dan and Langberg, Michael. A uniﬁed framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of

Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011, pp. 569 578, 2011. doi: 10.1145/1993636. 1993712. URL http://doi.acm.org/10.1145/ 1993636.1993712.

Feldman, Dan, Monemizadeh, Morteza, and Sohler, Christian. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, Gyeongju, South Korea, June 6-8, 2007, pp. 11 18, 2007. doi: 10.1145/1247069. 1247072. URL http://doi.acm.org/10.1145/ 1247069.1247072.

Feldman, Dan, Schmidt, Melanie, and Sohler, Christian. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6-8, 2013, pp. 1434 1453, 2013. doi: 10.1137/1.9781611973105. 103. URL http://dx.doi.org/10.1137/1. 9781611973105.103.

Frahling, Gereon and Sohler, Christian. Coresets in dynamic geometric data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pp. 209 217. ACM, 2005.

Ganguly, Sumit. Counting distinct items over update streams. In International Symposium on Algorithms and Computation, pp. 505 514. Springer, 2005.

Guha, Sudipto, Mishra, Nina, Motwani, Rajeev, and O Callaghan, Liadan. Clustering data streams. In Foundations of computer science, 2000. proceedings. 41st annual symposium on, pp. 359 366. IEEE, 2000.

Har-Peled, Sariel and Kushal, Akash. Smaller coresets for k-median and k-means clustering. In Proceedings of the 21st ACM Symposium on Computational Geometry, Pisa, Italy, June 6-8, 2005, pp. 126 134, 2005. doi: 10.1145/1064092.1064114. URL http://doi.acm. org/10.1145/1064092.1064114.

Har-Peled, Sariel and Mazumdar, Soham. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pp. 291 300, 2004. doi: 10.1145/1007352.1007400. URL http://doi. acm.org/10.1145/1007352.1007400.

Hershberger, John and Suri, Subhash. Adaptive sampling for geometric problems over data streams. In Proceedings of the twenty-third ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, pp. 252 262. ACM, 2004.

Indyk, Piotr. High-dimensional computational geometry. Ph D thesis, Citeseer, 2000a.

Clustering High Dimensional Dynamic Data Streams

Indyk, Piotr. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pp. 189 197. IEEE, 2000b.

Indyk, Piotr. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 539 545. Society for Industrial and Applied Mathematics, 2003.

Indyk, Piotr. Algorithms for dynamic geometric problems over data streams. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 373 380. ACM, 2004.

Indyk, Piotr and Price, Eric. K-median clustering, modelbased compressive sensing, and sparse recovery for earth mover distance. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 627 636. ACM, 2011.

Larsen, Kasper Green, Nelson, Jelani, Nguyˆen, Huy L, and Thorup, Mikkel. Heavy hitters via cluster-preserving clustering. ar Xiv preprint ar Xiv:1604.01357, 2016.

Liu, Zaoxing, Ivkin, Nikita, Yang, Lin, Neyrinck, Mark, Lemson, Gerard, Szalay, Alexander, Braverman, Vladimir, Budavari, Tamas, Burns, Randal, and Wang, Xin. Streaming algorithms for halo ﬁnders. In e-Science (e-Science), 2015 IEEE 11th International Conference on, pp. 342 351. IEEE, 2015.

Meyerson, Adam. Online facility location. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pp. 426 431. IEEE, 2001.

Muthukrishnan, Shanmugavelayutham. Data streams: Algorithms and applications. Now Publishers Inc, 2005.

Nisan, Noam. Pseudorandom generators for spacebounded computation. Combinatorica, 12(4):449 461, 1992.