# nearoptimal_kclustering_in_the_sliding_window_model__95b84a94.pdf

Near-Optimal k-Clustering in the Sliding Window Model

David P. Woodruff CMU dwoodruf@cs.cmu.edu

Peilin Zhong Google Research peilinz@google.com

Samson Zhou Texas A&M University samsonzhou@gmail.com

Clustering is an important technique for identifying structural information in largescale data analysis, where the underlying dataset may be too large to store. In many applications, recent data can provide more accurate information and thus older data past a certain time is expired. The sliding window model captures these desired properties and thus there has been substantial interest in clustering in the sliding window model. In this paper, we give the first algorithm that achieves near-optimal (1 + ε)-approximation to (k, z)-clustering in the sliding window model, where z is the exponent of the distance function in the cost. Our algorithm uses k min(ε4,ε2+z) polylog n

ε words of space when the points are from [ ]d, thus significantly improving on works by Braverman et. al. (SODA 2016), Borassi et. al. (Neur IPS 2021), and Epasto et. al. (SODA 2022). Along the way, we develop a data structure for clustering called an online coreset, which outputs a coreset not only for the end of a stream, but also for all prefixes of the stream. Our online coreset samples k min(ε4,ε2+z) polylog n

ε points from the stream. We then show that any online coreset requires Ω k

ε2 log n samples, which shows a separation from the problem of constructing an offline coreset, i.e., constructing online coresets is strictly harder. Our results also extend to general metrics on [ ]d and are near-optimal in light of a Ω k ε2+z lower bound for the size of an offline coreset.

1 Introduction

Clustering is a fundamental procedure frequently used to help extract important structural information from large datasets. Informally, the goal of clustering is to partition the data into k clusters so that the elements within each cluster have similar properties. Classic formulations of clustering include the k-median and k-means problems, which have been studied since the 1950 s [60, 50]. More generally, for a set X of n points in Rd, along with a metric dist, a cluster parameter k > 0, and an exponent z > 0 that is a positive integer, the clustering objective can be defined by

min C Rd,|C|=k

i=1 min c C dist(xi, c)z.

When dist is the Euclidean distance, the problem is known as (k, z)-clustering and more specifically, k-median clustering and k-means clustering, when z is additionally set to 1 and 2, respectively.

As modern datasets have significantly increased in size, attention has shifted to large-scale computational models, such as the streaming model of computation, that do not require multiple passes over the data. In the (insertion-only) streaming model, the points x1, . . . , xn of X arrive sequentially,

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

and the goal is to output an optimal or near-optimal clustering of X while using space sublinear in n, ideally space k polylog(n, d), since outputting the cluster centers uses k words of space, where each word of space is assumed to be able to store an entire input point in Rd. There exist slight variants of the insertion-only streaming model and a long line of active research has been conducted on clustering in these models [42, 21, 44, 43, 23, 17, 37, 39, 1, 11, 59, 46, 14, 27, 10, 25, 61, 29].

The sliding window model. Unfortunately, an important shortcoming of the streaming model is that it ignores the time at which a specific data point arrives and thus it is unable to prioritize recent data over older data. Consequently, the streaming model cannot capture applications in which recent data is more accurate and therefore considered more important than data that arrived prior to a certain time, e.g., Census data or financial markets. Indeed, it has been shown that for a number of applications, the streaming model has inferior performance [4, 52, 57, 62] compared to the sliding window model [33], where only the most recent W updates in the stream comprise the underlying dataset. Here, W > 0 is a parameter that designates the window size of the active data, so that all updates before the W most recent updates are considered expired, and the goal is to aggregate statistics about the active data using space sublinear in W. In the setting of clustering, where the data stream is x1, . . . , xn Rd, the active data set is X = {xn W +1, . . . , xn} for n W and X = {x1, . . . , xn} otherwise. Thus the sliding window model is a generalization of the streaming model, depending on the choice of W, and is especially relevant for time-sensitive settings, such as data summarization [22, 34], event detection in social media [56], and network monitoring [32, 31, 30].

The sliding window model is especially relevant for applications in which computation must be restricted to data that arrived after a certain time. Data privacy laws such as the General Data Protection Regulation (GDPR) mandate that companies cannot retain specific user data beyond a certain duration. For example, the Facebook data policy [36] states that user search histories are retained for 6 months, the Apple differential privacy overview [3] states that collected user information is retained for 3 months, and the Google data retention policy states that browser information may be stored for up to 9 months [41]. These retention polices can be modeled by the sliding window model with the corresponding setting of the window parameter W and thus the sliding window model has been subsequently studied in a wide range of applications [48, 49, 18, 19, 12, 13, 9, 20, 63, 2, 47, 7].

Clustering in the sliding window model. Because the clustering objective is not well-suited to popular frameworks such as the exponential histogram or the smooth histogram, there has been significant interest in clustering in the sliding window model. We now describe the landscape of clustering algorithms in the sliding window model; these results are summarized in Table 1. In 2003, [5] first gave a 2O(1/ε)-approximation algorithm for k-median clustering in the sliding window model using O k

ε4 W 2ε log2 W words of space, where ε 0, 1

2 is an input parameter. Subsequently, [15] gave an O (1)-approximate bicriteria algorithm using 2k centers and k2 polylog(W) space for the k-median problem in the sliding window model. The question of whether there exists a poly(k log W) space algorithm for k-clustering on sliding windows remained open until [16] gave constant-factor approximation sliding window algorithms for k-median and k-means using O k3 log6 W space and [28] gave constant-factor approximation algorithms for k-center clustering using O (k log ) space, where is the aspect ratio, i.e., the ratio of the largest to smallest distances between any pair of points. Afterwards, [8] gave a C-approximation algorithm for some constant C > 214, though it should be noted that their main contribution was the first constant-factor approximation algorithm for k-clustering using space linear in k, i.e., k polylog(W, ) space, and thus they did not attempt to optimize the constant C. Recently, [35] gave the first (1 + ε)-approximation algorithm for (k, z)-clustering using (kd+d C)

ε3 polylog W, , 1

ε words of space, for some constant C 7. Using known dimensionality reduction techniques, i.e., [51], the algorithm s dependence on d C can be removed in exchange for a 1 ε14 polylog W, 1

ε overhead. However, neither the d C dependency nor the 1 ε14 polylog W, 1

ε trade-off is desirable for realistic settings of d and ε for applications of k-clustering on sliding windows. In particular, recent results have achieved efficient summarizations, i.e., coresets, for k-median and k-means clustering in the offline setting using O k

ε4 log n words of space [27, 25] when the input is from [ ]d and it is known that this is near-optimal, i.e., Ω k ε2+z log n

samples are necessary to form coresets for (k, z)-clustering [45] in that setting. Thus a natural question is to ask whether such near-optimal space bounds can be achieved in the sliding window model.

1.1 Our Contributions

In this paper, we answer the question in the affirmative. That is, we give near-optimal space algorithms for k-median and k-means clustering in the sliding window model. In fact, we give more general algorithms for (k, z)-clustering in the sliding window that nearly match the space used by the offline coreset constructions of [27, 25, 26]:

Theorem 1.1. There exists an algorithm that samples k min(ε4,ε2+z) polylog n

ε points and with high probability, outputs a (1 + ε)-approximation to (k, z)-clustering for the Euclidean distance on [ ]d in the sliding window model.

In particular, our bounds in Theorem 1.1 achieve k ε4 polylog n

ε words of space for k-median clustering and k-means clustering, i.e., z = 1 and z = 2, respectively, matching the lower bounds of [25, 45] up to polylogarithmic factors.

Reference Accuracy Space Setting [5] 2O(1/ε) O k

ε4 W 2ε log2 W

k-median, ε 0, 1

[16] C > 2 O k3 log6 W

k-median and k-means [34] C > 214 k polylog(W, ) (k, z)-clustering [35] (1 + ε) (kd+d Cz)

ε3 polylog W, , 1

ε , C 7 (k, z)-clustering Our work (1 + ε) k min(ε4,ε2+z) polylog n

ε (k, z)-clustering Table 1: Summary of (k, z)-clustering results in the sliding window model for input points in [ ]d on a window of size W

Moreover, our algorithm actually produces a coreset, i.e., a data structure that approximately answers the clustering cost of the underlying dataset with respect to any set of k centers, not just the optimal k centers.

Theorem 1.2. There exists an algorithm that samples k min(ε4,ε2+z) polylog n

ε points and with high probability, outputs a (1 + ε)-coreset to (k, z)-clustering in the sliding window model for general metrics on [ ]d.

We emphasize that the guarantees of Theorem 1.2 are for general metrics on [ ]d, such as Lp metrics. Note that in light of the properties of coresets, the guarantee of Theorem 1.1 follows from taking a coreset for (k, z)-clustering on Euclidean distances and then using an offline algorithm for (k, z)-clustering for post-processing after the data stream, i.e., see Theorem 2.4.

Along the way, we provide a construction for a (1 + ε)-online coreset for (k, z)-clustering for general metrics on [ ]d. An online coreset for (k, z)-clustering is a data structure on a data stream that will not only approximately answer the clustering cost of the underlying dataset with respect to any set of k centers, but also approximately answer the clustering cost of any prefix of the data stream with respect to any set of k centers.

Theorem 1.3. There exists an algorithm that samples k min(ε4,ε2+z) polylog n

ε points and with high probability, outputs a (1 + ε)-online coreset for (k, z)-clustering.

We remark that Theorem 1.3 further has the attractive property that once a point is sampled into the online coreset at some point in the stream, then the point irrevocably remains in the online coreset. That is, the online coreset essentially satisfies two different definitions of online: 1) the data structure is a coreset for any prefix of the stream and 2) points sampled into the data structure will never be deleted from the data structure.

We further remark that due to leveraging the coreset construction of [27, 25, 26], we can similarly trade a factor of 1 εz for a poly(k) in the guarantees of Theorem 1.1, Theorem 1.2, and Theorem 1.3.

By contrast, the lower bound by [25] states that any offline coreset construction for k-means clustering only requires Ω k

ε2 points. This lower bound was later strengthened to Ω k ε2+z points by [45], for which matching upper bounds are given by [27, 25]. Thus our online coreset constructions are near-optimal in the k and 1

ε dependencies for z > 1 and nearly match the best known offline constructions for z = 1.

It is thus a natural question to ask whether our polylogarithmic overheads in Theorem 1.3 are necessary for an (1 + ε)-online coreset. We show that in fact, a logarithmic overhead is indeed necessary to maintain a (1 + ε)-online coreset.

Theorem 1.4. Let ε (0, 1). For sufficiently large n, d, and , there exists a set X [ ]d of n points x1, . . . , xn such that any (1 + ε)-online coreset for k-means clustering on X requires Ω k

ε2 log n points.

We emphasize that combined with existing offline coreset constructions [25, 26], Theorem 1.4 shows a separation between the problems of constructing offline coresets and online coresets. That is, the problem of maintaining a data structure that recovers coresets for all prefixes of the stream is provably harder than maintaining a coreset for an offline set of points.

1.2 Technical Overview

In this section, we give a high-level overview of our techniques. We also describe the limitations of many natural approaches.

Shortcomings of histograms and sensitivity sampling. A first attempt at clustering in the sliding window model might be to adapt the popular exponential histogram [33] and smooth histogram techniques [18]. These frameworks convert streaming algorithms to sliding window algorithms in the case that the objective function is smooth, which informally means that once a suffix of a data stream becomes a good approximation of the overall data stream, then it always remains a good approximation, regardless of the values of new elements that arrive in the stream. Unfortunately, [16] showed that the k-clustering objective function is not smooth and thus these histogram-based frameworks cannot work. Nevertheless, they gave the first constant-factor approximation by showing that the k-clustering objective function is almost-smooth using a generalized triangle inequality, which inherently loses constant factors and thus will not suffice for our goal of achieving a (1 + ε)- approximation.

Another approach might be to adapt the popular sensitivity sampling framework of coreset construction [37, 39, 10, 29]. The sensitivity sampling framework assigns a value to each point, called the sensitivity, which intuitively quantifies the importance of that point, and then samples each point with probability proportional to its sensitivity. [9] observed that sliding window algorithms can be achieved from online sensitivity sampling, where the importance of each point is measured against the prefix of the stream, and then running the process in reverse at each time, so that more emphasis is placed on the suffix of the sliding window. At a high level, this is the intuition taken by [34, 35], which leverage data structures that prioritize more recent elements of the data stream. However, it is not known how to achieve optimal bounds simply using sensitivity sampling, and indeed the optimal coreset constructions use slightly more nuanced sampling schemes [27, 25].

Sliding window algorithms from online coresets. Instead, we recall an observation by [9], who noted that deterministic constructions for online coresets for linear algebraic problems can be utilized to obtain sliding window algorithms for the corresponding linear algebraic problems. We first extend this observation to randomized constructions for online coresets for k-clustering problem.

The intuition is quite simple. Given an (1 + ε)-online coreset algorithm for a k-clustering problem on a data stream of length n from Rd that stores S(n, d, k, ε, δ) weights points and succeeds with

probability 1 δ, we store the S(n, d, k, ε , δ ) most recent points in the stream, where ε = O ε log n

and δ = δ poly(n). We then feed the S(n, d, k, ε , δ ) points to the online coreset construction in reverse order of their arrival. Since the online coreset preserves all costs for all prefixes of its input, then the resulting data structure will preserve all costs for all suffixes of the data stream. To extend this guarantee to the entire stream, including the sliding window, we can then use a standard merge-andreduce framework. It thus remains to devise a (1 + ε)-online coreset construction for k-clustering with near-optimal sampling complexity.

Online coreset construction. To that end, our options are quite limited, as to the best of our knowledge, the only offline coreset constructions using O k

ε4 log n words of space when the input is from [ ]d are due to [27, 25]. Fortunately, although the analyses of correctness for these sampling

schemes are quite involved, the constructions themselves are quite accessible. For example, [27] first uses an (α, β)-approximation, i.e., a clustering that achieves α-approximation to the optimal cost but uses βk centers, to partition the underlying dataset X into disjoint concentric rings around each of the βk centers. These rings are then gathered into groups and it is shown that by independently sampling a fixed number of points with replacement from each of the groups suffices to achieve a (1 + ε)-coreset. Their analysis argues that the contribution of each of the groups toward the overall k-clustering cost is preserved through an expectation and variance bounding argument, and then taking a sophisticated union bound over a net over the set of possible centers. Thus their argument still holds when each point of the dataset is independently sampled by the data structure with probability proportional to the probability it would have been sampled by the group. Moreover, independently sampling each point with a higher probability can only decrease the variance, so that correctness is retained, though we must also upper bound the number of sampled points. Crucially, independently sampling each point can be implemented in the online setting and the probability of correctness can be boosted to union bound over all times in the stream, which facilitates the construction of our (1 + ε)-online coreset, given an (α, β)-approximation.

Consistent (α, β)-approximation. It seemingly remains to find (α, β)-approximations for kclustering at all times in the stream. A natural approach would be to use an algorithm that achieves a (α, β)-approximation at a certain time in the stream with constant probability, e.g., [59], boost the probability of success to 1 1 poly(n), and the union bound to argue correctness over all times in the stream. However, a subtle pitfall here is that the rings and groups in the offline coreset construction of [27] are with respect to a specific (α, β)-approximation. Hence their analysis would no longer hold if a point xt was assigned to cluster i1 at time t when the sampling process occurs but then assigned to cluster i2 at the end of the stream. Therefore, we require a consistent (α, β)-approximation, so that once the algorithm assigns point xt to cluster i, then the point xt will always remain in cluster i even if a newer and closer center is subsequently opened later in the stream. To that end, we invoke a result of [34] that analyzes the popular Meyerson online facility location algorithm, along with a standard guess-and-double approach for estimating the input parameter to the Meyerson subroutine.

Lower bound. The intuition for our lower bound that any (1+ε)-online coreset for (k, z)-clustering requires Ω k

ε2 is somewhat straightforward and in a black-box manner. We first observe that [25] showed the existence of a set X of Ω k

ε2 unit vectors in Rd such that any coreset with o k

samples provably cannot accurately estimate the (k, z)-clustering cost for a set C of k unit vectors.

Since an online (1 + ε)-coreset must answer queries on all prefixes of the stream, we embed Ω(log n) instances of X. We first increase the dimension by a log n factor so that each of these instances can have disjoint support. We then give each of the instances increasingly exponential weight to force the data structure to sample Ω k

ε2 points for each instance. Specifically, we insert τ i copies of the i-th instance of X, where τ > 1 is some constant. Because the weight of the i-th instance is substantially greater than the sum of the weights of all previous instances, then any (1 + ε)-online coreset must essentially be a (1 + ε)-offline coreset for the i-th instance, thus requiring Ω k

ε2 points for the i-th instance. This reasoning extends to all Ω(log n) instances, thus showing that any online (1 + ε)-coreset requires Ω k

ε2 log n points.

2 Algorithm

In this section, we describe our sliding window algorithm for k-clustering. We first overview the construction of an online (1 + ε) coreset for (k, z)-clustering under general discrete metrics. We then describe how our online coreset construction for (k, z)-clustering on general discrete metric spaces can be used to achieve near-optimal space algorithms for (k, z)-clustering in the sliding window model.

Online (1 + ε)-coreset. We first recall the following properties from the Meyerson sketch, which we formally introduce in Appendix A.

Theorem 2.1. [8] Given an input stream x1, . . . , xn Rd defining a set X [ ]d, there exists an online algorithm MULTMEYERSON that with probability at least 1 1 poly(n):

(1) on the arrival of each point xi, assigns xi to a center in C through a mapping π : X C, where C contains at most O 22zk log n log centers

x X xi π(xi) z 2 2z+7 Cost|S| k(X, S)

(3) MULTMEYERSON uses O 2zk log3(nd ) words of space

We also use the following notation, adapted from [27] to the online setting.

Let A be an (α, β)-approximation for a k-means clustering on an input set X [ ]d and let C1, . . . , Cβk be the clusters of X induced by A. Suppose the points of X arrive in a data stream S. For a fixed ε > 0, define the following notions of rings and groups:

The average cost of cluster Ci is denoted by κCi := Cost(Ci,A)

For any i, j, the ring Ri,j is the set of points x Ci such that 2jκCi Cost(x, A) < 2j+1κCi. For any j, Rj = Ri,j. The inner ring RI(Ci) = j 2z log ε

2 Ri,j is the set of points of Ci with cost at most ε

z 2z κCi. More generally for a solution S, let RS I denote the union of the inner rings induced by S. The outer ring RO(Ci) = j 2z log z

ε Ri,j is the set of points of Ci with cost at least z

ε 2z κCi. More generally for a solution S, let RS O denote the union of the outer rings induced by S. The main ring RM(Ci) is the set of points of Ci that are not in the inner or outer rings, i.e., RM(Ci) = Ci \ (RI(Ci) RO(Ci).

For any j, the group Gj,b consists of the (2b 1 + 1)-th to (2b)-th points of each ring Ri,j that arrive in S. For any j, we use Gj,min to denote the union of the groups with the smallest costs, i.e.,

Gj,min = x| i, x Ri,j, Cost(Ri,j, A) < 2 ε

z Cost(Rj, A)

The outer groups GO b partition the outer rings RA O so that

GO b = x| i, x Ci, ε

z Cost(RA O, A) βk 2b Cost(RO(Ci), A) < ε

z Cost(RA O, A) βk 2b+1 .

We define GO min = b 0GO b and GO max = b z log 4z

Algorithm 1 RINGSAMPLE

Input: Points x1, . . . , xn [ ]d Output: A set W of weighted points and timestamps

1: Initiate an instance of (α, β)-bicriteria algorithm MULTMEYERSON

2: γ C max(α2,αz)β

min(ε2,εz) log2 1

ε k log |C| + log log 1

ε + log n log2 1

ε 3: W 4: for each point xt, t [n] do 5: Let ci be the center assigned for xt by MULTMEYERSON 6: Let 2j xt ci z 2 < 2j+1 for j Z 7: Let b Z so that the number of points in Ri,j is between 2b 1 + 1 and 2b

8: Let rt be the number of points in Gj,b at time t

9: px min 4 rt γ log n, 1

10: With probability px, add x to W with timestamp t and weight 1 px 11: return W

We then adapt the offline coreset construction of [27] to an online setting at the cost of logarithmic overheads, which suffice for our purpose. The algorithm (Algorithm 1) has the following guarantees:

Lemma 2.2. Let C be an A-approximate centroid set for a fixed group G. There exists an algorithm RINGSAMPLE that samples

O max(α2, αz)β

min(ε2, εz) log2 1

k log |C| + log log 1

ε + log n log2 n log2 log2 1

points and with high probability, outputs a (1 + ε)-online coreset for the k-means clustering problem.

Informally, an approximate centroid set is a set of possible points so that taking the centers from this set generates an approximately accurate solution (see Appendix B for a formal definition). To bound log |C|, we construct and apply a terminal embedding to project each point to a lower dimension and then appeal to known bounds for approximate centroid sets in low-dimensional Euclidean, thereby giving our online coreset algorithm with the guarantees of Theorem 1.3.

Sliding window model. We first recall a standard approach for using offline coreset constructions for insertion-only streaming algorithms. Suppose there exists a randomized algorithm that produces an online coreset algorithm that uses S(n, ε, δ) points for an input stream of length n, accuracy ε, and failure probability δ, where for the ease of discussion, we omit additional dependencies. A standard approach for using coresets on insertion-only streams is the merge-and-reduce approach, which partitions the stream into blocks of size S n, ε 2 log n, δ poly(n) and builds a coreset for each block. Each coreset is then viewed as the leaves of a binary tree with height at most log n, since the binary tree has at most n leaves. Then at each level of the binary tree, for each node in the level, a coreset of size S n, ε 2 log n, δ poly(n) is built from the coresets representing the two children of the node. Due to the mergeability property of coresets, the coreset at the root of the tree will be a coreset

for the entire stream with accuracy 1 + ε 2 log n log n (1 + ε) and failure probability δ.

This approach fails for sliding window algorithms because the elements at the beginning of the data stream can expire, and so coresets corresponding to earlier blocks of the stream may no longer accurate, which would result in the coreset at the root of the tree also no longer being accurate. On the other hand, suppose we partition the stream into blocks consisting of S n, ε 2 log n, δ poly(n) elements as before, but instead of creating an offline coreset for each block, we can create an online coreset for the elements in reverse. That is, since the elements in each block are explicitly stored, we can create offline an artificial stream consisting of the elements in the block in reverse and then give the artificial stream as input to the online coreset construction. Note that if we also first consider the latter coreset when merging two coresets, then this effectively reverses the stream. Moreover, by the correctness of the online coreset, our data structure provides correctness over any prefix of the reversed stream, or equivalently, any suffix of the stream and specifically, correctness over the sliding window. We thus further adapt the merge-and-reduce framework to show that randomized online coresets for problems in clustering can also be used to achieve randomized algorithms for the corresponding problems in the sliding window model. We formalize this approach in Algorithm 2. Theorem 2.3. Let x1, . . . , xn be a stream of points in [ ]d, ε > 0, and let X = {xn W +1, . . . , xn} be the W most recent points. Suppose there exists a randomized algorithm that with probability at least 1 δ, outputs an online coreset algorithm for a k-clustering problem with S(n, d, k, ε, δ) points. Then there exists a randomized algorithm that with probability at least 1 δ, outputs a coreset for the k-clustering problem in the sliding window model with O S n, d, k, ε log n, δ

n2 log n points.

By Theorem 1.3 and Theorem 2.3, we have: Theorem 2.4. There exists an algorithm that samples k min(ε4,ε2+z) polylog n

ε points and with high probability, outputs a (1 + ε)-coreset to (k, z)-clustering in the sliding window model.

Using an offline algorithm for (k, z)-clustering for post-processing after the data stream, we have Theorem 1.1.

3 Experimental Evaluations

In this section, we conduct simple empirical demonstrations as proof-of-concepts to illustrate the benefits of our algorithm. Our empirical evaluations were conducted using Python 3.10 using a 64-bit

Algorithm 2 Merge-and-reduce framework for randomized algorithms in the sliding window model, using randomized constructions of online coresets

Input: A clustering function f, a set of points x1, . . . , xn Rd, accuracy parameter ε > 0, failure probability δ (0, 1), and window size W > 0 Output: An approximation of f on the W most recent points

1: Let CORESET(X, n, d, k, ε, δ) be an online coreset construction with S(n, d, k, ε, δ) points on a set X Rd

2: m O S n, d, k, ε log n, δ

3: Initialize blocks B0, B1, . . . , Blog n 4: for each point xt with t [n] do 5: if B0 does not contain m points then 6: Prepend xt to B0, i.e., B0 {xt} B0 7: else 8: Let i be the smallest index such that Bi =

9: Bi CORESET Y, n, d, k, ε log n, δ

n2 for Y = B0 . . . Bi 1 Y is an ordered set of weighted points 10: for j = 0 to j = i 1 do 11: Bj

12: B0 {xt} 13: return the ordered set Blog n . . . B0

operating system on an AMD Ryzen 7 5700U CPU, with 8GB RAM and 8 cores with base clock 1.80 GHz. The general approach to our experiments is to produce a data stream S that defines dataset X, whose generation we describe below, as well as in Appendix F. We then compare the performance of a simplified version of our algorithm with various state-of-the-art baselines.

Baselines. Our first baseline (denoted off for offline) is the simple Lloyd s algorithm on the entire dataset X, with multiple iterations using the k-means++ initialization. This is a standard approach for finding a good approximation to the optimal clustering cost, because finding the true optimal centers requires exponential time. Because this offline Lloyd s algorithm has access to the entire dataset, the expected behavior is that this algorithm will have the best objective, i.e., smallest clustering cost. However, we emphasize that this algorithm requires storing the entire dataset X in memory and thus its input size is significantly larger than the sublinear space algorithms.

To compare with the offline Lloyd s algorithm, we run a number of sublinear space algorithms. These algorithms generally perform some sort of processing on the datastream X to create a coreset C. We normalize the space requirement of these algorithms by permitting each algorithm to store m points across specific ranges of m. We then run Lloyd s algorithm on the coreset C, with the same number of iterations using the k-means++ initialization.

Our first sublinear space algorithm is uniform sampling on the dataset X. That is, we form C by uniformly sampling m points from X, before running Lloyd s algorithm. We use uni to denote this algorithm whose first step is based on uniformly sampling. Our second sublinear space algorithm is the importance sampling approach used by histogram-based algorithms, e.g., [15, 11, 8]. These algorithms perform importance sampling, i.e., sample points into the coreset C with probability proportional to their distances from existing samples and delete points once the clustering cost of C is much higher than the clustering cost of the dataset X. We use hist(ogram) to denote this algorithm that is based on the histogram frameworks for sliding windows.

Our final algorithm is a simplification of our algorithm. As with the histogram-based algorithm, we perform importance sampling on the stream S to create the coreset C of size m. Thus we do not implement the ring and group sampling subroutines in our full algorithm. However, the crucial difference compared to the histogram-based approach is that we forcefully discard points of C that have expired. We use imp to denote this algorithm whose first step is based on importance sampling.

Dataset. We first describe the methodology and experimental setup of our empirical evaluation on a real-world dataset with an amount of synthetic noise before detailing the experimental results. The

first component of our dataset consists of the points of the SKIN (Skin Segmentation) dataset X from the publicly available UCI repository [6], which was also used in the experiments of [8]. The dataset X consists of 245, 057 points with four features, where each point refers to a separate image, such that the first three features are constructed over BGR space, and the fourth feature is the label for whether or not the image refers to a skin sample. We subsequently pre-process each dataset to have zero mean and unit standard deviation in each dimension.

We then form our dataset X by augmenting X with 201 points in four-dimensional space, where 100 of these points were drawn from a spherical Gaussian with unit standard deviation in each direction and centered at ( 10, 10, 0, 0) and 100 of these points were drawn from a spherical Gaussian with unit standard deviation in each direction and centered at (10, 10, 0, 0). The final point of X was drawn from a spherical Gaussian with unit standard deviation centered at (500, 500, 0, 0). Thus our dataset X has dimensions n = 245, 258 and d = 4. We then create the data stream S by prepending two additional points drawn from spherical Gaussians with standard deviation 2.75 centered at ( 10, 10, 0, 0) and ( 10, 10, 0, 0) respectively, so that the stream has length 245, 260. We set the window length to be 245, 258 in accordance with the true data set, so that the first two points of the stream will be expired by the data stream.

Experimental setup. For each of the instances of Lloyd s algorithm, either on the entire dataset X or the sampled coreset C, we use 10 iterations using the k-means++ initialization. While the offline Lloyd s algorithm stores the entire dataset X of 245, 258 points in memory, we only allow each of the sublinear-space algorithms to store a fixed m points. We compare the algorithms across m {5, 10, 15, 20, 25, 30} and k {2, 3, 4, 5, 6, 7, 8, 9, 10}. Note that in the original dataset, each of the points has a label for either skin or non-skin, which would be reasonable for k = 2. However, due to the artificial structure possibly induced by the synthetic noise, it also makes sense to other values of k. In particular, preliminary experiments from uniform sampling by the elbow method indicated that k = 3 would be a reasonable setting. Thus we fix k = 3 while varying m {5, 10, 15, 20, 25, 30} and we arbitrarily fix m = 25 while varying k {2, 3, 4, 5, 6, 7, 8, 9, 10}.

Experimental results. For each choice of m and k, we ran each algorithm 30 times and tracked the resulting clustering cost. Our algorithm demonstrated superior performance than the other sublinearspace algorithms across all values of m {5, 10, 15, 20, 25, 30} and k {2, 3, 4, 5, 6, 7, 8, 9, 10}, and was even quite competitive with the offline Lloyd s algorithm, even though our algorithm only used memory size m 30, while the offline algorithm used memory 245, 258.

Uniform sampling performed well for k = 2, which in some case captures the structure imposed on the data through the skin vs. non-skin label, but for larger k, the optimal solutions start placing centers to handle the synthetic noise, which may not be sampled by uniform sampling. Thus uniform sampling performed relatively poorly albeit quite stably for larger k. In contrast, the histogram-based algorithm performed poorly for small k across all our ranges of m, due to sampling the extra points in S \ X, so that the resulting Lloyd s algorithm on C moved the centers far away from the optimal centers of X. On the other hand, the histogram-based algorithm performed well for larger k, likely due to additional centers that could be afforded to handle the points in S \ X. We plot our results in Figure 1 and defer additional experiments to Appendix F.

4 Conclusion

In this paper, we give an algorithm outputs a (1 + ε)-approximation to (k, z)-clustering in the sliding window model, while using k min(ε4,ε2+z) polylog n

ε words of space when the points are from [ ]d. Our algorithm not only improves on a line of work [5, 16, 34, 8, 35], but also nearly matches the space used by the offline coreset constructions of [27], which is known to be near-optimal in light of a Ω k ε2+z lower bound for the size of an offline coreset [45].

We also give a lower bound that shows a logarithmic overhead in the number of points is needed to maintain a (1 + ε)-online coreset compared to a (1 + ε)-coreset. That is, we gave in Theorem 1.4 a set X [ ]d of n points x1, . . . , xn such that any (1 + ε)-online coreset for k-means clustering on X requires Ω k

ε2 log n points. However, this does not rule out whether the log n overhead is necessary for (k, z)-clustering in the sliding window model, since a sliding window algorithm does

(a) Comparisons for varying k.

(b) Comparisons for varying m.

Fig. 1: Comparison of average clustering costs made by uniform sampling, histogram-based algorithm, and our coreset-based algorithm across various settings of space allocated to the algorithm, given a synthetic dataset. For comparison, we also include the offline k-means++ algorithm as a baseline, though it is inefficient because it stores the entire dataset.

not necessarily need to maintain an online coreset. We leave this question as a possible direction for future work.

Acknowledgments

David P. Woodruff and Samson Zhou were partially supported by a Simons Investigator Award and by the National Science Foundation under Grant No. CCF-1815840. This work was done in part while Samson Zhou was at Carnegie Mellon University, UC Berkeley, and Rice University.

[1] Marcel R. Ackermann, Marcus Märtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. Streamkm++: A clustering algorithm for data streams. ACM J. Exp. Algorithmics, 17(1), 2012. 2

[2] Miklós Ajtai, Vladimir Braverman, T. S. Jayram, Sandeep Silwal, Alec Sun, David P. Woodruff, and Samson Zhou. The white-box adversarial data stream model. In PODS 22: International Conference on Management of Data, 2022, pages 15 27, 2022. 2

[3] Apple. https://images.apple.com/privacy/docs/Differential_Privacy_Overview.pdf. 2

[4] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems, pages 1 16, 2002. 2

[5] Brian Babcock, Mayur Datar, Rajeev Motwani, and Liadan O Callaghan. Maintaining variance and k-medians over data stream windows. In Proceedings of the Twenty-Second ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, pages 234 243, 2003. 2, 3, 9

[6] Rajen Bhatt and Abhinav Dhall. Skin segmentation dataset. UCI Learning Repository. 9

[7] Jeremiah Blocki, Seunghoon Lee, Tamalika Mukherjee, and Samson Zhou. Differentially private l2-heavy hitters in the sliding window model. In The Eleventh International Conference on Learning Representations, ICLR, 2023. 2

[8] Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Sliding window algorithms for k-clustering problems. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020. 2, 5, 8, 9, 16

[9] Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P. Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 517 528, 2020. 2, 4, 21

[10] Vladimir Braverman, Dan Feldman, Harry Lang, Adiel Statman, and Samson Zhou. Efficient coreset constructions via sensitivity sampling. In Asian Conference on Machine Learning, ACML, pages 948 963, 2021. 2, 4

[11] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In Proceedings of the 34th International Conference on Machine Learning, ICML, pages 576 585, 2017. 2, 8

[12] Vladimir Braverman, Ran Gelles, and Rafail Ostrovsky. How to catch l2-heavy-hitters on sliding windows. Theor. Comput. Sci., 554:82 94, 2014. 2

[13] Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly optimal distinct elements and heavy hitters on sliding windows. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pages 7:1 7:22, 2018. 2

[14] Vladimir Braverman, Avinatan Hassidim, Yossi Matias, Mariano Schain, Sandeep Silwal, and Samson Zhou. Adversarial robustness of streaming algorithms through importance sampling. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, Neur IPS, pages 3544 3557, 2021. 2

[15] Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering on sliding windows in polylogarithmic space. In 35th IARCS Annual Conference on Foundation of Software Technology and Theoretical Computer Science, FSTTCS, pages 350 364, 2015. 2, 8

[16] Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering problems on sliding windows. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1374 1390, 2016. 2, 3, 4, 9

[17] Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. Streaming k-means on well-clusterable data. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 26 40, 2011. 2

[18] Vladimir Braverman and Rafail Ostrovsky. Smooth histograms for sliding windows. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), Proceedings, pages 283 293, 2007. 2, 4

[19] Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling from sliding windows. J. Comput. Syst. Sci., 78(1):260 272, 2012. 2

[20] Vladimir Braverman, Viska Wei, and Samson Zhou. Symmetric norm estimation and regression on sliding windows. In Computing and Combinatorics - 27th International Conference, COCOON, Proceedings, pages 528 539, 2021. 2

[21] Moses Charikar, Liadan O Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pages 30 39, 2003. 2

[22] Jiecao Chen, Huy L. Nguyen, and Qin Zhang. Submodular maximization over sliding windows. Co RR, abs/1611.00129, 2016. 2

[23] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923 947, 2009. 2

[24] Vincent Cohen-Addad, 2022. Private communication. 17

[25] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1038 1051, 2022. 2, 3, 4, 5, 22

[26] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In Neur IPS, 2022. 3, 4

[27] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 169 182, 2021. 2, 3, 4, 5, 6, 9, 16, 17, 19, 24, 25, 26

[28] Vincent Cohen-Addad, Chris Schwiegelshohn, and Christian Sohler. Diameter and k-center in sliding windows. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP, pages 19:1 19:12, 2016. 2

[29] Vincent Cohen-Addad, David P. Woodruff, and Samson Zhou. Streaming euclidean k-median and k-means with o(log n) space. Co RR, abs/2310.02882, 2023. 2, 4

[30] Graham Cormode. The continuous distributed monitoring model. SIGMOD Rec., 42(1):5 14, 2013. 2

[31] Graham Cormode and Minos N. Garofalakis. Streaming in a connected world: querying and tracking distributed data streams. In EDBT 2008, 11th International Conference on Extending Database Technology, Proceedings, page 745, 2008. 2

[32] Graham Cormode and S. Muthukrishnan. What s new: finding significant differences in network data streams. IEEE/ACM Trans. Netw., 13(6):1219 1232, 2005. 2

[33] Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794 1813, 2002. 2, 4

[34] Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Submodular optimization over sliding windows. In Proceedings of the 26th International Conference on World Wide Web, WWW, pages 421 430, 2017. 2, 3, 4, 5, 9

[35] Alessandro Epasto, Mohammad Mahdian, Vahab S. Mirrokni, and Peilin Zhong. Improved sliding window algorithms for clustering and coverage via bucketing-based sketches. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 3005 3042, 2022. 2, 3, 4, 9

[36] Facebook. https://www.facebook.com/policy.php. 2

[37] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC, pages 569 578, 2011. 2, 4

[38] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constantsize coresets for k-means, pca, and projective clustering. SIAM J. Comput., 49(3):601 657, 2020. 19

[39] Dan Feldman and Leonard J. Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1343 1354, 2012. 2, 4

[40] Zhili Feng, Praneeth Kacham, and David P. Woodruff. Strong coresets for subspace approximation and k-median in nearly linear time. Co RR, abs/1912.12003, 2019. 19

[41] Google. https://policies.google.com/technologies/retention. 2

[42] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O Callaghan. Clustering data streams. In 41st Annual Symposium on Foundations of Computer Science, FOCS, pages 359 366, 2000. 2

[43] Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discret. Comput. Geom., 37(1):3 19, 2007. 2

[44] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 291 300, 2004. 2

[45] Lingxiao Huang, Jian Li, and Xuan Wu. Towards optimal coreset construction for (k, z)- clustering: Breaking the quadratic dependency on k. Co RR, abs/2211.11923, 2022. 2, 3, 9

[46] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1416 1429, 2020. 2, 19

[47] Rajesh Jayaram, David P. Woodruff, and Samson Zhou. Truly perfect samplers for data streams and sliding windows. In PODS 22: International Conference on Management of Data, pages 29 40, 2022. 2

[48] Lap-Kei Lee and H. F. Ting. Maintaining significant stream statistics over sliding windows. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 724 732, 2006. 2

[49] Lap-Kei Lee and H. F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the Twenty-Fifth ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, pages 290 297, 2006. 2

[50] J Mac Queen. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281 297, 1967. 1

[51] Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnsonlindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1027 1038, 2019. 2

[52] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. Proc. VLDB Endow., 5(12):1699, 2012. 2

[53] Jirí Matousek. On approximate geometric k-clustering. Discret. Comput. Geom., 24(1):61 84, 2000. 17

[54] Adam Meyerson. Online facility location. In 42nd Annual Symposium on Foundations of Computer Science, FOCS, pages 426 431. IEEE Computer Society, 2001. 15

[55] Shyam Narayanan and Jelani Nelson. Optimal terminal dimensionality reduction in euclidean space. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1064 1069, 2019. 20

[56] Miles Osborne, Sean Moran, Richard Mc Creadie, Alexander von Lünen, Martin D. Sykora, Amparo Elizabeth Cano, Neil Ireson, Craig Macdonald, Iadh Ounis, Yulan He, Tom Jackson, Fabio Ciravegna, and Ann O Brien. Real-time detection, tracking, and monitoring of automatically discovered events in social media. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL, pages 37 42, 2014. 2

[57] Odysseas Papapetrou, Minos N. Garofalakis, and Antonios Deligiannakis. Sketching distributed sliding-window data streams. VLDB J., 24(3):345 368, 2015. 2

[58] Christian Sohler and David P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 802 813, 2018. 19

[59] Zhao Song, Lin F. Yang, and Peilin Zhong. Sensitivity sampling over dynamic geometric data streams with applications to k-clustering. Co RR, abs/1802.00459, 2018. 2, 5

[60] Hugo Steinhaus et al. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci, 1(804):801, 1956. 1

[61] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In International Conference on Artificial Intelligence and Statistics, AISTATS, pages 5391 5415, 2022. 2

[62] Zhewei Wei, Xuancheng Liu, Feifei Li, Shuo Shang, Xiaoyong Du, and Ji-Rong Wen. Matrix sketching over sliding windows. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference, pages 1465 1480. ACM, 2016. 2

[63] David P. Woodruff and Samson Zhou. Tight bounds for adversarially robust streams and sliding windows via difference estimators. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 1183 1196, 2021. 2

A Preliminaries

For a positive integer n, we use the notation [n] to denote the set {1, . . . , n}. Similarly, we use [ ]d

to denote {1, . . . , }d. We use poly(n) to denote a fixed polynomial in n with degree determined as necessary by setting the appropriate constants in corresponding variables. Similarly, we use polylog(n) to denote poly(log n). We suppress polylogarithmic dependencies by writing O (f( )) = O (f( )) polylog f( ).

For (k, z)-clustering on a set X = {x1, . . . , xn} Rd using a set C of k centers and a distance function dist( , ), we define the notation Cost(X, C) = Pn i=1 minc C dist(xi, c)z. We also define the notation Cost|S| k(X, S) := min S:|S| k Cost(X, S), so that Cost|S| k is the cost of an optimal (k, z)-clustering.

Definition A.1 ((α, β)-approximation). We say a set of centers C provides an (α, β)-approximation to the optimal k-means clustering on a set X if |C| βk and

Cost(X, C) αOPT.

Definition A.2 (Coreset). A coreset for (k, z)-clustering on an approximation parameter ε > 0 and a set X of points x1, . . . , xn Rd with distance function dist is a subset S of weighted points of X with weight function w such that for any set C of k points, we have

i=1 dist(xi, C)z X

q S w(q) dist(q, S)z (1 + ε)

i=1 dist(xi, C)z.

Definition A.3 (Online Coreset). An online coreset for (k, z)-clustering on an approximation parameter ε > 0 and a set X of points x1, . . . , xn Rd with distance function dist is a subset S of weighted points of X with weight function w such that for any set C of k points and for any t [n], we have

i=1 dist(xi, C)z X

q St w(q) dist(q, St)z (1 + ε)

i=1 dist(xi, C)z,

where St = S {X1, . . . , Xt}, i.e., the subset of S that has arrived at time t. Theorem A.4 (Bernstein s inequality). Let X1, . . . , Xn be independent random variables such that E[X2 i ] < and Xi 0 for all i [n]. Let X = P

i Xi and γ > 0. Then

Pr [X E[X] γ] exp γ2

If Xi E[Xi] for all i, then for σ2 i = E[X2 i ] E[Xi]2,

Pr [X E[X] + γ] exp γ2

i σ2 i + 2γ /3

Meyerson sketch. We briefly review the Meyerson sketch [54] and the relevant properties that we need from the Meyerson sketch. The Meyerson sketch provides an (α, β)-approximation to (k, z)- clustering on a data stream of points x1, . . . , xn [ ]d with α = 2z+7 and β = O 22z log n log . Moreover, for our purposes, it provides the crucial property that on the arrival of each point xi, the algorithm irrevocably assigns xi to one of the βk centers. Specifically, the clustering cost at the end of the stream is computed with respect to the center that xi is assigned at time i, which may not be the closest center to xi because the closer center can be opened at a later time.

For the ease of discussion, we describe the Meyerson sketch for z = 1; the intuition generalizes naturally to other values of z. The Meyerson sketch performs via a guess-and-double approach, where it first attempts to guess the cost of the optimal clustering cost. Using the guess of the cost, it then turns each point into a center with probability proportional to the distance of that point from the existing centers. This subroutine is illustrated in Algorithm 3. If too many centers have been opened, then the Meyerson sketch determines that the guess for the optimal clustering cost must have been too low and increases the guess. The overall algorithm is given in Algorithm 4.

We require the following properties from the Meyerson sketch.

Algorithm 3 High probability MEYERSON(X, ] OPT, α, δ, , z, k) sketch

Input: Points X := x1, . . . , xn Rd with aspect ratio , estimate ] OPT 0 such that αOPT ] OPT OPT for some α (0, 1), failure probability δ (0, 1) Output: A coreset for k-clustering on X

1: γ 2 log 1

δ 2: for i [γ] do 3: for t [n] do 4: if t = 1 then 5: Mi x1, Cµi 0, wi(x1) = 1 6: else 7: if |Mi| 4k(1 + log ) 2z+3

αz + 1 then

8: With probability min k(1+log ) dist(xt,Mi)z

] OPT , 1 , add xt to Mi with weight 1,

i.e., wi(xt) = 1 9: Otherwise, let z = argminy Mi dist(xt, y), increment the weight of z, i.e., wi(z) wi(z) + 1, and increase Cµi Cµi dist(xt, z)p

10: Let j = argmini:|Mi| 4k(1+log ) 2z+3

αz +1 Cµi be the index of the minimal cost sketch with at

most 4k(1 + log ) 2z+3

αz + 1 samples Return FAIL if such j does not exist

11: return i [γ]Mi, wj, and Cµj

Algorithm 4 High probability MULTMEYERSON sketch

Input: Points X := x1, . . . , xn Rd with aspect ratio , estimate ] OPT 0 such that αOPT ] OPT OPT for some α (0, 1), failure probability δ (0, 1) Output: A coreset for k-means clustering on X if ] OPT upper bounds the cost of the optimal clustering 1: γ log nd( z) 2: for i [γ] do 3: Run MEYERSON X, 2i, α = 1

2, δ, , z, k in parallel

4: Let j be the minimal index in [γ] such that MEYERSON with input 2j has size smaller than 8k log 1

δ (1 + log ) 22z+3 + 1 and cost smaller than 2z+6+j

5: return the output for MEYERSON X, 2j, α = 1

2, δ, , z, k

Theorem 2.1. [8] Given an input stream x1, . . . , xn Rd defining a set X [ ]d, there exists an online algorithm MULTMEYERSON that with probability at least 1 1 poly(n):

(1) on the arrival of each point xi, assigns xi to a center in C through a mapping π : X C, where C contains at most O 22zk log n log centers

x X xi π(xi) z 2 2z+7 Cost|S| k(X, S)

(3) MULTMEYERSON uses O 2zk log3(nd ) words of space

B Online (1 + ε)-Coreset

In this section, we describe how to construct an online (1 + ε) coreset for (k, z)-clustering under general discrete metrics. We first describe the offline coreset construction of [27] and then argue that the construction can be adapted to an online setting at the cost of logarithmic overheads, which suffice for our purpose.

Let A be an (α, β)-approximation for a (k, z)-clustering on an input set X [ ]d and let C1, . . . , Cβk be the clusters of X induced by A. Suppose the points of X arrive in a data stream S. For a fixed ε > 0, [27] define the following notions of rings and groups:

The average cost of cluster Ci is denoted by κCi := Cost(Ci,A)

For any i, j, the ring Ri,j is the set of points x Ci such that

2jκCi Cost(x, A) < 2j+1κCi. For any j, Rj = Ri,j. The inner ring RI(Ci) = j 2z log ε

z Ri,j is the set of points of Ci with cost at most ε

z 2z κCi. More generally for a solution S, let RS I denote the union of the inner rings induced by S. The outer ring RO(Ci) = j 2 log z

ε Ri,j is the set of points of Ci with cost at least z

ε 2z κCi. More generally for a solution S, let RS O denote the union of the outer rings induced by S. The main ring RM(Ci) is the set of points of Ci that are not in the inner or outer rings, i.e., RM(Ci) = Ci \ (RI(Ci) RO(Ci).

For any j, the group Gj,b consists of the (2b 1 + 1)-th to (2b)-th points of each ring Ri,j that arrive in S. For any j, we use Gj,min to denote the union of the groups with the smallest costs, i.e.,

Gj,min = x| i, x Ri,j, Cost(Ri,j, A) < 2 ε

z Cost(Rj, A)

The outer groups GO b partition the outer rings RA O so that

GO b = x| i, x Ci, ε

z Cost(RA O, A) βk 2b Cost(RO(Ci), A) < ε

z Cost(RA O, A) βk 2b+1 .

We define GO min = b 0GO b and GO max = b z log 4z

We remark that unlike the definition of [27], Gj,min is a subset of the groups Gj,b with b 1, but we shall nevertheless show that our sampling procedure preserves the cost contributed by each group. We also require the following slight variation of the definition of A-approximate centroid set from [53] due to [27]. Definition B.1 (A-approximate centroid set). Let X Rd be a set of points, let k, z be two positive integers, and let ε > 0 be an accuracy parameter. Given a set A of centers, we say a set C is an A-approximate centroid set for (k, z)-clustering on X if for every set of k centers S Rd, there exists e S Rd of k points such that for all x X with Cost(x, S) 8z

ε z Cost(x, A) or Cost(x, e S) 8z

ε z Cost(x, A),

|Cost(x, S) Cost(x, e S)| ε z log(z/ε)(Cost(x, S) Cost(x, A).

The following statement is implied by the proof of Theorem 1 in [27]. Theorem B.2. [27, 24] Let z > 0 be a constant. Let x G for a group induced by an (α, β)- bicriteria assignment A. For each cluster Ci with i [βk], let Di = Ci G. Let C be an A-approximate centroid set for G and let

γ = C max(α2, αz)β

min(ε2, εz) log2 1

k log |C| + log log 1

ε + log n log2 1

for some sufficiently large constant C > 0. Let

ζx = Cost(Di, A) |Di| Cost(G, A) γ log n, ηx = Cost(x, A)

Cost(G, A) γ log n.

Suppose each point x X is sampled and reweighted independently into a set Ω0 with probability px, where px min(ζx + ηx, 1). Let Ω1 = Ω0 \ (RI(Ci) (Ci j Gj,min) (RO(Ci) GO min).

Suppose Ω2 is the set of centers in A, where each center ci with i [βk] has weight wi, where wi is a (1 + ε)-approximation to |RI(Ci)| + |Ci j Gj,min| + |RO(Ci) GO min|. Then (Ω1 \ Ω2) Ω2 is (1 + ε)-coreset for the (k, z)-clustering problem with probability 1 1 poly(n).

We first show that the sampling probabilities for each point in the stream by RINGSAMPLE in Algorithm 1 satisfies the conditions of Theorem B.2. Lemma B.3. Let x G for a group induced by an (α, β)-bicriteria assignment A at a time t, with t [n]. For each cluster Ci with i [βk], let Di = Ci G. Let C be an A-approximate centroid set for G and let

γ = C max(α2, αz)β

min(ε2, εz) log2 1

k log |C| + log log 1

ε + log n log2 1

for some sufficiently large constant C > 0 Let

ζx = Cost(Di, A) |Di| Cost(G, A) γ log n, ηx = Cost(x, A)

Cost(G, A) γ log n.

Then the probability px that RINGSAMPLE (Algorithm 1) samples each point x satisfies

px min(ζx + ηx, 1).

Proof. Suppose that x Ri,j and x Gj,b at time t, for some i [βk] in an assignment by A from MULTMEYERSON. Let u be the time that x arrived in the stream. By the properties of the Meyerson sketch, i.e., MULTMEYERSON in Theorem 2.1, x is irrevocably assigned to a cluster Ci with i [βk] at time u. Hence, x must also be assigned to ring Ri,j at time u. Moreover, since the stream is insertion-only, then the number of points in all rings Ri,j for a fixed j across all i [βk] is monotonically non-decreasing. Thus x must also be assigned to group Gj,b at time u.

Let px be the sampling probability of x by RINGSAMPLE in Algorithm 1 at time u. We have that

ru γ log n, 1 ,

where ru is the number of points in Gj,b at time u. Let G(u) j,b be the subset of Gj,b that have arrived

at time u and let G(t) j,b be the subset of Gj,b that have arrived at time t. Let ci be the center assigned

to point x, so that Cost(x, ci) = Cost(x, A) and let C(u) i be the points assigned to ci at time u. Similarly, let D(u) i = C(u) i G(u) j,b . By the definition of Ri,j and Gj,b,

x ci z 2 Cost(G(u) j,b , A) 2j+1

Cost(G(u) j,b , A) 2j+1

Since both the cost of group Gj,b and the number of points in Di is monotonically non-decreasing over time, then at time t, we have

ζx γ log n = Cost(Di, A) |Di| Cost(Gj,b, A) 2|Di| x ci z 2 |Di| Cost(G(t) j,b, A) 2 x ci 2 2 Cost(G(u) j,b , A) 4

Similarly, we have that due to the monotonicity of the cost of group Gj,b over time,

ηx γ log n = x ci z 2 Cost(G(t) j,b, A) x ci z 2 Cost(G(u) j,b , A) 2

Thus for sufficiently large constant C in the definition of γ in RINGSAMPLE, we have that

px min(ζx + ηx, 1),

since px = min 4 ru γ log n, 1 .

We next justify the space complexity of Algorithm 1, i.e., showing that with high probability, an upper bound of the number of samples can be determined. Lemma B.4. RINGSAMPLE (Algorithm 1) samples

O max(α2, αz)β

min(ε2, εz) log2 1

k log |C| + log log 1

ε + log n log2 n log2 log2 1

points with high probability.

Proof. Recall that by definition, the groups Gj,b partition the points X = x1, . . . , xn [ ]d. For a fixed j and b, let Yi be an indicator random variable for whether the i-th point of Gj,b is sampled by RINGSAMPLE. Then we have E [Yi] 4

i γ log n and similarly E Y 2 i 4

i γ log n. By Bernstein s inequality, Theorem A.4, we have that

Pr h X Yi 80γ log2 n i 1

and more generally, we have that P Yi = O γ log2 n with high probability. Thus by a union bound over all j and b, we have that the number of sampled points is at most

O γ log2 n log2 = O 1 min(ε2, εz) log2 1

k log |C| + log log 1

ε + log n log2 n log2 log2 1

for γ = C max(α2,αz)β

min(ε2,εz) log2 1

ε k log |C| + log log 1

ε + log n log2 1

Moreover, note that we can for all t [n], we can explicitly track both |G(t) j,b| and Cost(G(t) j,b, A) as the stream is updated, because once the bicriteria algorithm assigns a point to a center in A, the assignment will remain the same for the rest of the stream. Thus, we have the following:

Lemma B.5. For each j and b, there exists an algorithm that maintains both |G(t) j,b| and Cost(G(t) j,b, A) for all t [n] using O (log(nd )) space.

Putting things together, we give the full guarantees of RINGSAMPLE in Algorithm 1. Lemma 2.2. Let C be an A-approximate centroid set for a fixed group G. There exists an algorithm RINGSAMPLE that samples

O max(α2, αz)β

min(ε2, εz) log2 1

k log |C| + log log 1

ε + log n log2 n log2 log2 1

points and with high probability, outputs a (1 + ε)-online coreset for the k-means clustering problem.

Proof. Consider RINGSAMPLE. Before claiming the algorithm gives an (1 + ε)-online coreset, we first consider a fixed time t [n]. Then correctness at time t follows from applying Theorem B.2, given Lemma B.3 and Lemma B.5. We then observe that once a center is formed by RINGSAMPLE, i.e., once a point is sampled, then it irrevocably remains a center in the data structure. Therefore, conditioned on the correctness at time t, then the data structure will always correctly give an (1 + ε)- coreset to the prefix of t points in the stream at any later point t in the stream, t [n] with t > t. It thus suffices to argue correctness over all t [n], which requires a simple union bound. The space complexity follows from Lemma B.4 and Lemma B.5.

To apply Lemma 2.2, we require upper bounding the term log |C|. To that end, we first require the following definition of doubling dimension. Definition B.6 (Doubling dimension). The doubling dimension of a metric space X with metric d is the smallest integer ℓsuch that for any x X, it is possible to cover the ball of radius 2r around x with 2ℓballs of radius r.

Observe that general discrete metric spaces with n points have doubling dimension O (log n) since all points can be covered by 2log n balls.

We then recall the following result that upper bounds the size log |C| for metric spaces with doubling dimension d. Lemma B.7. [27] Given a subset X from a metric space with doubling dimension d, ε > 0, and an α-approximate solution A with at most k polylog(n) centers, there exists an A-approximate centroid set for X of size |X| α

It is known that the Euclidean space has doubling dimension Θ(d), which would give a d dependency on our coreset size. However, [38] showed that the d dependency can be replaced with k

ε2 , which was subsequently improved by a line of works, e.g., [58, 40, 46], ultimately down to a dependency of 1 ε2 log k

ε using the following notion of terminal embeddings:

Fig. 2: Merge and reduce framework on a stream of length n. The coresets at level 1 are the entire blocks. The coresets at level i for i > 1 are each 1 + O ε 2 log n -coresets of the coresets at their children nodes in level i 1.

Definition B.8 (Terminal embedding). Let ε (0, 1) and X Rd be a set of n points. Then a mapping f : Rd Rm is a terminal embedding if for all x X and all y Rd,

(1 ε) x y 2 f(x) f(y) 2 (1 + ε) x y 2.

[55] gave a construction of a terminal embedding with m = O 1

ε2 log n that can be applied in linear space through exhaustive search when polynomial runtime is not required. Thus Lemma 2.2 nows give the following:

Theorem 1.3. There exists an algorithm that samples k min(ε4,ε2+z) polylog n

ε points and with high probability, outputs a (1 + ε)-online coreset for (k, z)-clustering.

For the purpose of clarity, we emphasize that the algorithm does not use sublinear space, even though the sample complexity is sublinear. Namely, for each stream update, we construct and apply a terminal embedding to project each point to a lower dimension. We then compute the appropriate sampling probability of the projected point, but then sample the original point with the computed sampling probability.

C Sliding Window Model

In this section, we describe how our online coreset construction for (k, z)-clustering on general discrete metric spaces can be used to achieve near-optimal space algorithms for (k, z)-clustering in the sliding window model.

We first recall a standard approach for using offline coreset constructions for insertion-only streaming algorithms. Suppose there exists a randomized algorithm that produces an online coreset algorithm that uses S(n, ε, δ) points for an input stream of length n, accuracy ε, and failure probability δ, where for the ease of discussion, we omit additional dependencies, such as on the dimension d, the clustering constraint k, the parameter z, or additional parameters for whatever problem the coreset construction may approximate. A standard approach for using coresets on insertion-only streams is the merge-and-reduce approach, which partitions the stream into blocks of size S n, ε 2 log n, δ poly(n)

and builds a coreset for each block. Each coreset is then viewed as the leaves of a binary tree with height at most log n, since the binary tree has at most n leaves. Then at each level of the binary tree, for each node in the level, a coreset of size S n, ε 2 log n, δ poly(n) is built from the coresets representing the two children of the node. Due to the mergeability property of coresets, the coreset at

the root of the tree will be a coreset for the entire stream with accuracy 1 + ε 2 log n log n (1 + ε) and failure probability δ. We give an illustration of this approach in Figure 2.

This approach fails for sliding window algorithms because the elements at the beginning of the data stream can expire, and so coresets corresponding to earlier blocks of the stream may no longer accurate, which would result in the coreset at the root of the tree also no longer being accurate. On the other hand, suppose we partition the stream into blocks consisting of S n, ε 2 log n, δ poly(n) elements as before, but instead of creating an offline coreset for each block, we can create an online coreset

for the elements in reverse. That is, since the elements in each block are explicitly stored, we can create offline an artificial stream consisting of the elements in the block in reverse and then give the artificial stream as input to the online coreset construction. Note that if we also first consider the latter coreset when merging two coresets, then this effectively reverses the stream. Moreover, by the correctness of the online coreset, our data structure provides correctness over any prefix of the reversed stream, or equivalently, any suffix of the stream and specifically, correctness over the sliding window.

Indeed, [9] showed that deterministic online coresets for problems in randomized numerical linear algebra can be used to achieve deterministic algorithms for the corresponding problems in the sliding window model. We thus further adapt the merge-and-reduce framework to show that randomized online coresets for problems in clustering can also be used to achieve randomized algorithms for the corresponding problems in the sliding window model. We formalize this approach in Algorithm 2, duplicated below:

Algorithm 5 Merge-and-reduce framework for randomized algorithms in the sliding window model, using randomized constructions of online coresets

Input: A clustering function f, a set of points x1, . . . , xn Rd, accuracy parameter ε > 0, failure probability δ (0, 1), and window size W > 0 Output: An approximation of f on the W most recent points

1: Let CORESET(X, n, d, k, ε, δ) be an online coreset construction with S(n, d, k, ε, δ) points on a set X Rd

2: m O S n, d, k, ε log n, δ

3: Initialize blocks B0, B1, . . . , Blog n 4: for each point xt with t [n] do 5: if B0 does not contain m points then 6: Prepend xt to B0, i.e., B0 {xt} B0 7: else 8: Let i be the smallest index such that Bi =

9: Bi CORESET Y, n, d, k, ε log n, δ

n2 for Y = B0 . . . Bi 1 Y is an ordered set of weighted points 10: for j = 0 to j = i 1 do 11: Bj

12: B0 {xt} 13: return the ordered set Blog n . . . B0

Theorem 2.3. Let x1, . . . , xn be a stream of points in [ ]d, ε > 0, and let X = {xn W +1, . . . , xn} be the W most recent points. Suppose there exists a randomized algorithm that with probability at least 1 δ, outputs an online coreset algorithm for a k-clustering problem with S(n, d, k, ε, δ) points. Then there exists a randomized algorithm that with probability at least 1 δ, outputs a coreset for the k-clustering problem in the sliding window model with O S n, d, k, ε log n, δ

n2 log n points.

Proof. Consider Algorithm 2. Let CORESET(X, n, d, k, ε, δ) be a randomized algorithm that, with probability at least 1 δ, computes an online coreset for a k-clustering problem f with S(n, d, k, ε, δ) points.

We first claim that for each Bi is a 1 + ε log n i online coreset for 2i 1m points. To that end, observe that Bi can only be non-empty if at some time, B0 contains m points and B1, . . . , Bi 1 are all non-empty. By the correctness of the subroutine CORESET, Bi is a 1 + ε log n online coreset for the points in B0 . . . Bi 1 at some point during the stream. Hence by induction, Bi is a 1 + ε log n 1 + ε log n i 1 = 1 + ε log n i coreset for m + Pi 1 j=1 2j 1m = 2i 1m points.

Now, because Algorithm 2 inserts the newest points at the beginning of B0, then the stream is fed in reverse to the merge-and-reduce procedure. Thus, for any W [2i 1, 2i), B0 . . . Bi provides an online coreset k-clustering for the W most recent points in the stream.

To analyze the probability of failure, we remark that there are at most n points in the stream. For each point, there are at most n coresets constructed by the subroutine CORESET (in fact, the number of coreset constructions is upper bounded by O (log n)). Since each subroutine is called with failure probability δ n2 , then by a union bound, the total failure probability is at most δ.

To analyze the space complexity, note that there are at most O (log n) coreset constructions

B0, . . . , Blog n maintained by the algorithm. Each coreset construction samples S n, d, k, ε log n, δ

points. Hence, the total number of sampled points is O S n, d, k, ε log n, δ

By Theorem 1.3 and Theorem 2.3, we have:

Theorem 2.4. There exists an algorithm that samples k min(ε4,ε2+z) polylog n

ε points and with high probability, outputs a (1 + ε)-coreset to (k, z)-clustering in the sliding window model.

Using an offline algorithm for (k, z)-clustering for post-processing after the data stream, we have:

Theorem 1.1. There exists an algorithm that samples k min(ε4,ε2+z) polylog n

ε points and with high probability, outputs a (1 + ε)-approximation to (k, z)-clustering for the Euclidean distance on [ ]d in the sliding window model.

D Lower Bounds

In this section, we show that any (1 + ε)-online coreset for (k, z)-clustering requires Ω k

points. The intuition is somewhat straightforward and in a black-box manner. [25] showed the existence of a set X of Ω k

ε2 unit vectors such that any sublinear space data structure would not be able to accurately determine Cost(C, X) for a set of k unit vectors C. They thus showed that any offline (1 + ε)-coreset construction for (k, z)-clustering required Ω k

Because an online (1 + ε)-coreset must answer queries on all prefixes of the stream, our goal is to essentially embed Ω(log n) instances of the hard instance of [25] into the stream, which would require Ω k

ε2 log n points. To enforce the data structure to sample Ω k

ε2 points for each of the hard instance, we give each of the instances increasingly exponential weight. That is, we give the points in the i-th instance τ i weight for some constant τ > 1, by inserting τ i copies of each of the points. Because the weight of the i-th instance is substantially greater than the sum of the weights of the previous instances, any (1 + ε)-online coreset must essentially be a (1 + ε)-coreset for the i-th instance, thus requiring Ω k

ε2 points for the i-th instance. This reasoning extends to all of the Ω(log n) instances, thereby giving a lower bound of Ω k

ε2 log n points.

We first recall the following offline coreset lower bound by [25].

Theorem D.1. [25] For d = Θ k

ε2 , let X = e1, . . . , ed R2d be the set of elementary vectors. Let z be a constant and let a1, . . . , am R2d with corresponding weights w1, . . . , wm R be a weighted set P of points. Then there exists a set of k unit vectors C = c1, . . . , ck R2d such that for m = o k

(1) Cost(C, X) = Pd i=1 minj [k] ei cj 2 2 2z/2d 2z/2 max(1, z/2)

(2) Cost(C, P) = Pm i=1 wi minj [k] ai cj 2 2 < (1 ε)(2z/2d 2z/2 max(1, z/2)

We remark that the first property is due to Lemma 31 pf [25] and the second property is due to Lemma 33 and Lemma 34 of [25].

Let γ = Θ log n

d . Let d = Θ k

ε2 be the dimension of the hard instance in Theorem D.1 and set d = γd , so that we can partition the space R2d into γ groups of 2d coordinates.

We define a stream by creating γ weighted instances of the hard instance defined in Theorem D.1. Each of the γ hard instances will be embedded into a separate partition of 2d coordinates of R2d. Namely, the first instance consists of the vectors e1, . . . , ed being inserted into the stream. By Theorem D.1, any (1 + ε)-coreset must contain Ω k

ε2 points. The next instance consists of the

vectors e1+2d , . . . , e3d each being inserted τ = 100 times into the stream. That is, after the vector ed arrives in the stream from the first hard instance, then t copies of e1+2d arrive in the stream, followed by then t copies of e2+2d , and so forth. Due to the weights of these vectors, any (1 + ε)- coreset must essentially be a (1 + ε)-coreset for the second hard instance and thus contain Ω k

points with support in the second group of 2d coordinates.

More generally, for each i [γ], the stream inserts ti 1 copies of e1+2(i 1)d , followed by τ i 1

copies of e2+2(i 1)d , and so on. The main intuition is that due to the weights of the i-th group of d

elementary vectors, an (1 + ε)-online coreset must contain a (1 + ε)-coreset for the i-th hard instance. Moreover, since the (1 + ε)-online coreset must be a coreset for any prefix of the stream, then it needs to be a (1 + ε)-coreset for each of the hard instances. Hence, the online coreset must contain

ε2 = Ω k ε2 log n

Lemma D.2. Let τ = 100. For each integer i > 0, let Si be the stream that consists of τ i 1 consecutive copies of e1+2(i 1)d , followed by τ i 1 copies of e2+2(i 1)d , and so on. Let S be the stream that consists of S1 S2 . . .. Then for each i, any (1 + ε)-online coreset after the arrival of Si must consist of i Ω k

Proof. We prove the claim by induction on i. The base case of i = 1 follows from Theorem D.1.

Now suppose the claim holds for a fixed i 1. Let Xi be the set of points that have arrived after Si, i.e., Xi = S1 . . . Si. Let Ci 1 be any (1 + ε)-online coreset for S after the arrival of Si 1. Let Pi be a set of weighted points sampled during stream Si, so that Ci = Ci 1 Pi. Since each point in Si has weight τ i, then by scaling the first property of Theorem D.1, we have that there exists a set of k unit vectors Ui = c1, . . . , ck R2d such that

Cost(U, Xi) =

b=1 τ a min j [k] eb+2(a 1)d cj z 2

b=1 τ i min j [k] eb+2(i 1)d cj z 2

(τ i)(2z/2d 2z/2 max(1, z/2)

In particular, the unit vectors Ui = c1, . . . , ck have support entirely in the i-th group of 2d coordinates in R2d. By the same argument, there exists a set Ui 1 with the same properties in the (i 1)-th group of 2d coordinates in R2d.

By the correctness of the online coreset, we have

Cost(Ui 1, Ci 1) (1 + ε) Cost(Ui 1, Xi 1) = (1 + ε)

a=1 Cost(Ui 1, Sa).

Since Ui 1 consists of unit vectors and each substream Sa consists of unit vectors, then we have

Cost(Ui 1, Sa) 2d τ a.

Thus for ε (0, 1),

Cost(Ui 1, Ci 1) 2

a=1 (2d τ a) 8d τ i 1 < 1

since τ = 100. On the other hand, since Ui 1 has support entirely in the (i 1)-th group of 2d

coordinates and Si has support entirely in the i-th group of 2d coordinates in R2d, then

Cost(Ui 1, Xi) Cost(Ui 1, Si) 2d τ i.

Thus for Ci to be a (1 + ε)-online coreset for ε (0, 1), Ci must sample additional points from Xi on top of Ci 1. Hence, Pi = .

In particular, let Pi consist of vectors y1, . . . , ym with weights w1, . . . , wm. Since Pi = , then

Cost(Ui, Ci) = Cost(U, Ci 1 Pi) Cost(U, Pi).

If |Pi| = o k

ε2 , then by the second property of Theorem D.1, we have

Cost(Ui, Pi) =

b=1 min j [k] wb yb cj 2 2 < τ i(1 ε)(2z/2d 2z/2 max(1, z/2)

which together with Equation 1 contradicts the fact that Ci is an (1 + ε)-online coreset for Xi.

Therefore, we have |Pi| = Ω k

ε2 . Moreover, since Pi has disjoint support from Ci 1, then by induction,

|Ci| = |Ci 1 Pi| = |Ci 1| + |Pi| = i Ω k

Theorem 1.4. Let ε (0, 1). For sufficiently large n, d, and , there exists a set X [ ]d of n points x1, . . . , xn such that any (1 + ε)-online coreset for k-means clustering on X requires Ω k

ε2 log n points.

Proof. Let γ = Θ log n

d . For each i [γ], construct the stream Si as in the statement of Lemma D.2. Observe that |Si| = d ti for t = 100 and so under the settings of the parameter γ with the appropriate constant, the total length of the stream S = S1 . . . Sγ is precisely n. Moreover, by Lemma D.2, any (1 + ε)-online coreset must store γ Ω k

ε2 log n points for n = poly(d).

E On the Proof of Theorem B.2

We remark that Theorem 1 of [27] is stated for sampling a fixed number of points with replacement from each group, rather than sampling each point independently without replacement. By contrast, Theorem B.2 is stated for sampling each point independently without replacement. In this section, we briefly outline the proof of Theorem 1 of [27] and how the analysis translates to the statement of Theorem B.2.

At a high level, the coreset construction of [27] first collects rings of an approximate solution A of k points into groups, using a similar approach to that described in Appendix B with β = 1. The algorithm then computes a coreset for each group first using a procedure GROUPSAMPLE and then using a procedure SENSITIVITYSAMPLE for some of the points not considered by the first procedure. We briefly describe both procedures, as well as how to adapt them to the setting where each point is sampled independently and without replacement.

E.1 Adaptation of Group Sampling

The GROUPSAMPLE procedure of [27] samples a fixed Λ1 number of points from each group G with probability proportional to the contribution of each corresponding cluster of the point to the group. That is, given clusters f C1, . . . , f Ck induced by A on G, GROUPSAMPLE then performs Λ1 rounds of sampling. Each round samples a single point, where a point p f Ci is sampled proportional to Cost(f Ci,A) |f Ci| Cost(G,A) and rescaled appropriately. Then GROUPSAMPLE offers the following guarantees:

Lemma E.1 (Lemma 2 of [27]). Let (X, dist) be a metric space, k, z be positive integers, G be a group of clients and A be an α-approximate solution to (k, z)-clustering on G so that:

For every cluster e C induced by A on G, all points of e C contribute the same cost in A up to a factor of 2.

For all clusters e C induced by A on G, we have that Cost(G,A)

2k Cost( e C, A).

Let C be an A-approximate centroid set for (k, z)-clustering on G.

Then there exists a procedure GROUPSAMPLE that constructs a set Ωof size

max(α2, αz) log2 1

ε 2O(z log z) min(ε2, εz)

k log |C| + log log 1

ε + log n !

such that with high probability, it simultaneously holds for all sets S of k centers that

| Cost(G, S) Cost(Ω, S)| O ε

(Cost(G, S) + Cost(G, A).

We outline the high-level approach of the proof of Lemma E.1 and how can it can adjusted for an (α, β)-approximate solution A, as well as a process that samples each point independently without replacement, rather than using Λ1 rounds as GROUPSAMPLE.

The proof of Lemma E.1 involves further partitioning the points of G into three subsets, based on the cost induced by the point. Namely, given a set S of k centers, a point p in group G is categorized as tiny, interesting, or huge, depending on Cost(p, S) (though the interesting and huge points actually have a small overlap to allow slack in the proof). [27] applies standard Chernoff bounds to show that the number of sampled points is well-concentrated around its expectation and then applies Bernstein s inequality to show that the clustering costs of the tiny points, the interesting points are well-concentrated around their expectations. In particular, they show that the expected number of sampled points from each cluster f Ci is

Λ1 Cost(f Ci, A) Cost(G, A) Λ1

due to the assumption that for all clusters e C induced by A on G, we have that Cost(G,A)

2k Cost( e C, A).

We first remark that if A is an (α, β)-approximate solution rather than an α-approximate solution, i.e., if A has βk centers rather than k centers, then the definition of the rings and groups would instead insist that for all clusters e C induced by A on G, we have that Cost(G,A)

2βk Cost( e C, A). Then by oversampling Λ1 by a factor of β, i.e., sampling βΛ1 points would ensure that the expected number of sampled points from each cluster f Ci would be

βΛ1 Cost(f Ci, A) Cost(G, A) βΛ1

It then remains to argue the correctness of sampling each point independently without replacement rather than a fixed βΛ1 number of points, which simply holds by adjusting the applications of the Chernoff bounds and Bernstein s inequality so that there is a separate random variable for each point in the input rather than for each of the Λ1 rounds.

E.2 Adaptation of Sensitivity Sampling

The SENSITIVITYSAMPLE procedure of [27] samples a fixed Λ2 number of points from each group G with probability proportional to the contribution of the point. Specifically, SENSITIVITYSAMPLE then performs Λ2 rounds of sampling, where each round samples a point p in the group G with probability proportional to Cost(p,A)

Cost(G,A) and rescales the sampled point appropriately. Then SENSITIVITYSAMPLE offers the following guarantee: Lemma E.2 (Lemma 3 of [27]). Let (X, dist) be a metric space, k, z be positive integers, and A be an α-approximate solution to (k, z)-clustering on G. Let C be an A-approximate centroid set for (k, z)-clustering on G. Let G be either a group GO b or GO max. Then there exists a procedure SENSITIVITYSAMPLE that constructs a set Ωof size

2O(z log z)α2 log2 1

k log |C| + log log 1

ε + log n !

such that with high probability, it simultaneously holds for all sets S of k centers that

| Cost(G, S) Cost(Ω, S)| O ε αz log z

(Cost(G, S) + Cost(G, A).

We outline the high-level approach of the proof of Lemma E.2 and how can it can adjusted for an (α, β)-approximate solution A, as well as a process that samples each point independently without replacement, rather than using Λ2 rounds as SENSITIVITYSAMPLE.

The proof of Lemma E.2 partitions the points of G into two categories, based on the cost induced by the point. Given a set S of k centers, the close points are the points p in G that have Cost(p, S) 4z Cost(p, A). The far points are the remaining points in G, i.e., the points p in G with Cost(p, S) > 4z Cost(p, A).

[27] applies Bernstein s inequality to show that the clustering cost of the close points is wellconcentrated around their expectations. We can again adjust the application of Bernstein s inequality so that there is a separate random variable for each point in the input rather than for each of the Λ2 samples.

To handle the far points, [27] again uses Bernstein s inequality to show that with high probability, the clustering points of these points with respect to S can be replaced with the distance to the closest center c A plus the distance from c to the closest center in S. Conditioned on this event, the latter distance can then be charged to the remaining points of the cluster from the original dataset, i.e., the remaining points of the cluster not necessarily restricted to group G, which are significantly more numerous and already paying a similar value in S. In particular, Bernstein s inequality utilizes the fact that the second moment of the estimated cost of a cluster C is at most

Λ2 2 Cost(C G, A) 2k

Λ2 2 (Cost(C G, A))2,

for β = 1. Thus for general β, we recover the same guarantee by oversampling Λ2 by a factor of β, i.e., sampling βΛ2 points would ensure that the second moment would be at most 2k

Λ2 2 Cost2(C G, A). It then remains to argue the correctness of sampling each point independently without replacement rather than a fixed βΛ2 number of points, which again holds by adjusting the application of Bernstein s inequality so that there is a separate random variable for each point in the input rather than for each of the Λ2 rounds.

F Additional Experiments on Synthetic Data

We first describe the methodology and experimental setup of our empirical evaluation on a synthetic dataset before detailing the experimental results. To emphasize the benefits of our algorithm against worst-case input, we generate a synthetic dataset that would fully capture the failure cases of previous baselines.

Dataset. We generated our dataset X consisting of 200, 001 points on two-dimensional space so that 100, 000 points were drawn from a spherical Gaussian with standard deviation 2.75 centered at ( 10, 10) and 100, 000 points were drawn from a spherical Gaussian with standard deviation 2.75 centered at (10, 10). The final point of X was drawn from a spherical Gaussian with standard deviation 2.75 centered at (100000, 100000). Thus by construction of our synthetic dataset for k = 3, the optimal centers should be close to ( 10, 10), (10, 10), and (100000, 100000). We then create the data stream S by prepending two additional points drawn from spherical Gaussians with standard deviation 2.75 centered at ( 100000, 100000) and ( 100000, 100000) respectively. We set the window length to be 200, 001 in accordance with the true data set, so that the first two points of the stream of length 200, 003 will be expired by the data stream.

Experimental setup. For each of the instances of Lloyd s algorithm, either on the entire dataset X or the sampled coreset C, we use 3 iterations using the k-means++ initialization. In this case, the offline Lloyd s algorithm requires storing the entire dataset X in memory and thus its input size is 200, 001 points. By comparison, we normalize the space requirement of the sublinear-space algorithms by permitting each algorithm to store m {3, 4, 5, 6, 7, 8, 9, 10, 11, 12} points. Note that since k = 3, it would not be reasonable for C to have fewer than 3 points. We then run Lloyd s algorithm on the coreset C, with 3 iterations using the k-means++ initialization.

By construction of our dataset, we generally expect the uniform sampling algorithm uni to be stable across the various values of m but perform somewhat poorly, as it will sample points from the large clusters but it will miss the point generated from the Gaussian centered at (100000, 100000). Since in our construction the stream S only contains two more points than the dataset X, the histogrambased algorithm hist will not delete any points. Thus, the resulting coreset C generated by hist is somewhat likely contain the points generated from the Gaussians centered at ( 100000, 100000)

and ( 100000, 100000) and can perform poorly on the synthetic dataset in these cases. Finally, since we allow the last point of the stream to be the single point of X far from the two large clusters, then the importance sampling based algorithm imp will sample the last point with high probability once any points of C have been expired. Hence by the construction of our stream, we expect imp to perform well.

(a) Comparisons for varying k.

(b) Comparisons for varying m.

Fig. 3: Comparison of average clustering costs made by uniform sampling, histogram-based algorithm, and our coreset-based algorithm across various settings of space allocated to the algorithm, given a synthetic dataset. For comparison, we also include the offline k-means++ algorithm as a baseline, though it is inefficient because it stores the entire dataset. Ranges are not plotted because they would not be visible.

Experimental results. For each choice of m and k, we ran each algorithm 50 times and tracked the resulting clustering cost. As expected by our construction, our algorithm performed significantly better than the other sublinear-space algorithms. In fact, even though our algorithm was only permitted memory size m {3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, our algorithm was quite competitive with the offline Lloyd s algorithm, which used memory size 200, 001, i.e., the entire dataset. For k 3, uniform sampling performed relatively poorly but quite stably, because although it never managed to sample the point generated from the Gaussian centered at (100000, 100000), the two other Gaussian distributions were sufficiently close that any sampled point would serve as a relatively good center for points generated from the two distributions. Similarly, for fixed k = 3 in Figure 3b, the importance sampling approach used by histogram-based algorithms performed the worse, by multiple orders of magnitude. We expect this is because we did not delete the points in S \ X from C and thus the resulting Lloyd s algorithm on C moved the centers far away from the centers of the Gaussian distributions that induced X. A more optimized fine-tuned histogram-based algorithm would have searched for parameters that govern when to delete points from S \ X, which have reduced the algorithm down to our main algorithm. We plot our results in Figure 3.