# fair_clustering_in_the_sliding_window_model__8af2a2e3.pdf

Published as a conference paper at ICLR 2025

FAIR CLUSTERING IN THE SLIDING WINDOW MODEL

Vincent Cohen-Addad Google cohenaddad@google.com

Shaofeng H.-C. Jiang School of Computer Science, Peking University shaofeng.jiang@pku.edu.cn

Qiaoyuan Yang School of Computer Science, Peking University qiaoyuanyang@stu.pku.edu.cn

Yubo Zhang School of Computer Science, Peking University zhangyubo18@stu.pku.edu.cn

Samson Zhou Texas A&M University samsonzhou@gmail.com

We study streaming algorithms for proportionally fair clustering, a notion originally suggested by Chierichetti et al. (2017), in the sliding window model. We show that although there exist efficient streaming algorithms in the insertion-only model, surprisingly no algorithm can achieve finite ratio without violating the fairness constraint in sliding window. Hence, the problem of fair clustering is a rare separation between the insertion-only streaming model and the sliding window model. On the other hand, we show that if the fairness constraint is relaxed by a multiplicative (1 + ε) factor, there exists a (1 + ε)-approximate sliding window algorithm that uses poly(kε 1 log n) space. This achieves essentially the best parameters (up to degree in the polynomial) provided the aforementioned lower bound. We also implement a number of empirical evaluations on real datasets to complement our theoretical results.

1 INTRODUCTION

Clustering is a fundamental technique used to identify meaningful patterns and structures within data. Typically, the objective of clustering involves grouping similar data points together into distinct clusters. Due to extensive research conducted over the years, clustering has become a well-established and deeply understood field. In the standard notion of clustering, the input is a set P of n points in a space equipped with metric dist, a cluster parameter k > 0, and an exponent z > 0 that is a positive integer and the objective is to minimize the quantity min C,Γ:|C|=k P

p P minc C dist(p, c)z. The problem is called (k, z)-clustering when the metric space is Rd and dist is the Euclidean distance and in particular, the problem is called k-median for z = 1 and k-means for z = 2. Note that the center set C implicitly partitions P by grouping together the points p P closest to each center c C.

Fair clustering. When applied to user data, traditional clustering methods may produce biased outputs, leading to discriminatory outcomes that can perpetuate social inequalities in the downstream applications (Dwork et al., 2012; Chhabra et al., 2021). For example, clustering algorithms used in hiring or lending tasks may inadvertently discriminate against certain groups based on factors such as race, gender, or socioeconomic status. This is particularly problematic in domains where clustering is used to make decisions with significant consequences, such as healthcare, education, and finance. Hence, there has been a large line of active work focusing on various notions of fairness for clustering; we defer an overview of this work to Section 1.2.

In (α, β)-fair clustering, each item p P is associated with a subpopulation (often also called a group) j [ℓ] across ℓpossible subpopulations and additional constraints 0 αj βj 1 for each i [ℓ]. Then the goal is to minimize the clustering objective over all possible clusters where each cluster has at least αj fraction and no more than βj fraction of items in group L. Thus,

Published as a conference paper at ICLR 2025

these assignment constraints are designed to capture fairness by ensuring that no subpopulation is under-represented or over-represented in each cluster.

The sliding window model. The sliding window model (Datar et al., 2002), which focuses on analyzing recent data points rather than the entire dataset, offers a unique perspective on this challenge. In many real-world applications, recent data is more relevant and informative than historical data (Babcock et al., 2002; Manku & Motwani, 2012; Papapetrou et al., 2015; Wei et al., 2016). For example, in social media analysis, understanding current trends and discussions requires focusing on the most recent posts. In these scenarios, the sliding window model can provide a more dynamic and responsive approach to clustering.

The sliding window model is particularly relevant for applications in which computation must be restricted to data received after a certain time. Laws regarding data privacy, including the sweeping General Data Protection Regulation (GDPR), explicitly require that companies must not retain certain user information beyond a specified time. Consequently, Facebook retains user search data for 6 months (Facebook), the Apple retains user information for 3 months (Apple), and Google stores browser information for up to 9 months (Google). The sliding window model captures these retention policies with the appropriate setting of the window parameter W and thus has been considered across a wide range of applications (Lee & Ting, 2006a;b; Braverman & Ostrovsky, 2007; Chen et al., 2016; Datar & Motwani, 2016; Epasto et al., 2017; Braverman et al., 2018; 2021; Woodruff & Zhou, 2021; Jayaram et al., 2022; Blocki et al., 2023).

However, the sliding window model also introduces new challenges for fairness. As data streams in continuously, the composition of the window can change rapidly, potentially leading to biased or discriminatory outcomes. Therefore, it is essential to develop clustering algorithms that are not only accurate but also fair in the context of the sliding window model. Formally, points arrive sequentially in the sliding window model and the dataset P is implicitly defined as the last n points of the stream. The goal is to perform fair clustering using space sublinear in the size n of the dataset.

1.1 OUR RESULTS

In this work, we study fair clustering in the sliding window model. In particular, our goal is to achieve (1 + ε)-multiplicative approximation to the optimal fair clustering for the dataset implicitly defined by the data stream in the sliding window model, while using the minimal amount of space (and then minimizing the update time as a secondary priority). Surprisingly, there is no existing work on fair clustering in the sliding window model.

On the other hand, recent work has achieved (1 + ε)-approximation for the standard formulation of (k, z)-clustering in the sliding window model using k min(ε4,ε2+z) polylog n

ε words of space (Woodruff et al., 2023), which was shown to be space-optimal up to factors polylogarithmic in the input size n and the aspect ratio (Cohen-Addad et al., 2022; 2023; Huang et al., 2024). In particular, we can assume without loss of generality that each coordinate of each point can be represented in log bits of space, so that after normalization, all points lie within the grid [ ]d = {1, 2, . . . , }d. Similarly, results by Braverman et al. (2022) can be adapted using the well-known merge-and-reduce framework to achieve a (1 + ε)-approximation streaming algorithm for fair clustering using k min(ε4,ε2+z) polylog n

ε words of space, which is also space-optimal up to polylogarithmic factors (Cohen-Addad et al., 2022; Huang et al., 2024). Therefore, we ask:

Can we achieve similar space-optimal results for fair clustering in the sliding window model?

Surprisingly, our first result is a strong impossibility result: Theorem 1.1. Any algorithm for fair (k, z)-clustering in the sliding window model that either achieves any multiplicative approximation or additive

2 1 error, with probability at least 2

3, must use Ω(n) space.

The proof of Theorem 1.1 utilizes tools from communication complexity and is fairly standard. Specifically, we show that fair clustering in the sliding window model can solve a version of Augmented Index, where one player has a binary string x {0, 1}2n of length 2n and weight n, while the second player has an index i and must guess the value of xi, given the values of x1, . . . , xi 1,

Published as a conference paper at ICLR 2025

as well as a message passed from the first player. The main intuition is that the two players can copy x1, . . . , xi 1 to the end of the stream, as well as a guess xi = 0, so that the sliding window contains precisely the elements xi+1, . . . , x2n, x1, x2, . . . , xi 1, 0, as well as n additional points at the origin and n additional points at the point 1 on the real line, which the players subsequently insert. Now if the guess xi = 0 is correct, this window will contain precisely 2n points that zero and 2n points that are one. Otherwise, it will contain 2n 1 points that are one and 2n + 1 points that are zero. Hence by setting the fairness constraint to require half of the points at each center to be from the last 2n points, and half of the points from the first 2n points, the fair clustering objective is zero if and only if the guess xi = 0 is correct. Otherwise, the objective is nonzero, which allows any finite multiplicative approximation to fair clustering to distinguish between these two cases. The lower bound then follows from known communication lower bounds of the Augmented Index problem. We defer the formal proof to Appendix A. We now discuss the main implications of Theorem 1.1.

Whereas our goal was to achieve a (1 + ε)-approximation for fair clustering in poly k, 1

space, Theorem 1.1 states that any multiplicative approximation requires linear space, even an arbitrarily large multiplicative approximation, e.g., polynomial or exponential in n.

Due to the aforementioned results, Theorem 1.1 shows several separations. Firstly, it shows a separation between fair clustering and the standard clustering formulation in the sliding window model, due to the sliding window algorithm for (k, z)-clustering that achieves (1 + ε)-approximation using k min(ε4,ε2+z) polylog n

ε words of space (Woodruff et al., 2023). Secondly, it shows a separation between the insertion-only streaming model and the sliding window model for fair clustering, due to the streaming algorithm for (k, z)-clustering that achieves (1 + ε)-approximation using k min(ε4,ε2+z) polylog n

ε words of space (Braverman et al., 2022). Theorem 1.1 therefore has an important informal message:

Fairness and recency are not compatible in sublinear space!

The natural follow-up question is, what actually can be achieved toward some notion of fair clustering in the sliding window model? To that end, we show that there exist sublinear-space algorithms for fair clustering in the sliding window model if we permit a small slackness on the fairness constraints:

Theorem 1.2. There is a sliding-window streaming algorithm that returns a center set C after every insertion operation, such that with high probability, there exists a ((1 ε)α, (1 + ε)β)-fair k-clustering whose cost with center set C is at most 1 + ε times the optimal (α, β)-fair clustering on the sliding window. This algorithm uses space L poly(ε 1kd log ) where L 2ℓis the number of distinct combinations of groups that any point may belong to.

We remark that the guarantees of our algorithm are slightly stronger than those stated in Theorem 1.2. In particular, not only does our algorithm produce a (1 + ε)-approximation to the optimal fair clustering, but it also produces a (1 + ε)-coreset, i.e., a dataset that can be used to obtain a (1 + ε)- approximation to the cost of any query consisting of k centers under the fairness constraints. To that end, we further remark that the space used by Theorem 1.2 matches the best known bounds for offline coreset constructions up to polylogarithmic factors (Braverman et al., 2022).

Experimental evaluations. Although the (1+ε) ratio which our Theorem 1.2 achieves is powerful, it is mostly theoretical since the running time must be exponential because of APX-hardness. However, our algorithm actually maintains a coreset of the sliding window whose construction is very efficient. Hence, we plug in a practically efficient downstream approximation algorithm by Backurs et al. (2019) to achieve better empirical performance. We validate the performance of our algorithm, compared with three baselines, on five real datasets. Our algorithm offers overall best tradeoff regarding , time/space and accuracy compared with all baselines.

1.2 RELATED WORK

In this section, we describe a number of related works, both in the context of fair clustering and sliding window algorithms for clustering.

Fairness. Fairness in algorithmic design has recently received significant attention (Kleinberg et al., 2017; Huang et al., 2019; Chen et al., 2024; Song et al., 2024). Chierichetti et al. (2017) initiated the study of fairness in clustering, specifically in the context of disparate impact, which demands that

Published as a conference paper at ICLR 2025

any protected subpopulation must receive similar representation in the decision-making process, in particular by the output of an algorithm. Although Chierichetti et al. (2017) initially considered two classes of populations, subsequent works generalized to multiple subpopulations by R osner & Schmidt (2018), as well as different clustering objectives (Ahmadian et al., 2019; Bandyapadhyay et al., 2019; Bercea et al., 2019; Bera et al., 2019b; Kleindessner et al., 2019b; Ahmadian et al., 2020b;a; Esmaeili et al., 2020; 2021; B ohm et al., 2021; Schwartz & Zats, 2022; Ahmadian & Negahbani, 2023). A similar notion studied clustering where each cluster required a certain number of representatives from each subpopulation, rather than a certain fraction (Anegg et al., 2022; Jia et al., 2022).

Other proposed notions of fairness include (1) social fairness, where the clustering cost is considered across subpopulations and then minimized (Jones et al., 2020; Ghadiri et al., 2021; Makarychev & Vakilian, 2021), (2) individual fairness, where each point should have a center within a reasonably close distance that is a function of the overall dataset (Jung et al., 2020; Mahabadi & Vakilian, 2020; Negahbani & Chakrabarty, 2021; Vakilian & Yalc iner, 2022), (3) representative fairness, where the number of centers of each class is constrained (Kleindessner et al., 2019a; Angelidakis et al., 2022), and (4) a number of other notions (Micha & Shah, 2020; Abbasi et al., 2021; Hotegni et al., 2023; Gupta et al., 2023). However, as we focus on disparate impact in this work, these notions of fairness are somewhat orthogonal to the main subject of our study.

Clustering in data streams and the sliding window model. While a general framework for sliding window algorithms are histogram-based approaches (Datar et al., 2002; Braverman & Ostrovsky, 2007), the clustering objective is not smooth and thus not suitable for these frameworks. Hence, there has been an active line of work studying (k, z)-clustering in the sliding window model. Babcock et al. (2003) first introduced an algorithm for k-median clustering in the sliding window model that used O k

ε4 m2ε log2 m words of space and gave a 2O(1/ε)-multiplicative approximation algorithm, where m is the size of the window and ε 0, 1

2 is an input parameter. Braverman et al. (2015) subsequently gave a bicriteria algorithm for the k-median problem in the sliding window model that achieved an O (1)-multiplicative approximation using k2 polylog(m) space, but at the cost of 2k centers. Braverman et al. (2016) achieved the first poly(k log m) space algorithm for k-median and k-means on sliding windows, achieving an O (1)-approximation using O k3 log6 m space. Afterwards, Borassi et al. (2020) achieved a linear dependency in k, giving an O (1)-approximation for k-clustering using k polylog(m, ) space. Epasto et al. (2022) introduced the first (1 + ε)-approximation algorithm for (k, z)-clustering using (kd+d C)

ε3 polylog m, , 1

ε words of space, for some constant C 7, which was subsequently optimized by Woodruff et al. (2023) to k min(ε4,ε2+z) polylog m

ε words of space.

It should noted that all of these results consider the standard formulation of (k, z)-clustering in the sliding window model. That is, no previous works considered fair clustering in the sliding window model. The most relevant works are offline (1 + ε)-coreset constructions by Braverman et al. (2022) for fair clustering that sample k min(ε4,ε2+z) polylog n

ε points, and can be adapted to the insertion-only streaming model at the cost of multiplicative factors polylogarithmic in n, using a standard merge-and-reduce technique. As this coreset size matches the bound for (k, z)-clustering in the sliding window model, one might hope to achieve similar bounds for fair clustering in the sliding window model. Unfortunately, our results show that is not the case.

2 PRELIMINARIES

Notations. We assume the input comes from [ ]d for some integer 1, and let C Rd be the candidate center set. In our analysis, many statements also hold for general Rd, but we may still state them for [ ]d for consistency of presentation. Throughout, when we talk about a point set in Rd, we assume each point is associated with a unique time step. Importantly, we use this time step to identify a point, which means two identical points in Rd with different time steps are considered different points (and set operations, such as intersection, are performed with respect to the time step). For a general function f : U V , define f(Y ) := P

y Y f(y) (for Y U in the domain U).

Definition 2.1 ((α, β)-fair clustering). Given P Rd, ℓgroups P1, . . . , Pℓ P and α, β [0, 1]ℓ, a k-clustering (C1, . . . , Ck), i.e., a k-partition of P, is called (α, β)-fair if |Pj Ci|

|Ci| [αj, βj] for every

Published as a conference paper at ICLR 2025

i [k], j [ℓ]. (α, β)-fair (k, z)-CLUSTERING asks to find an (α, β)-fair clustering (C1, . . . , Ck) and center set C Rd of k points, such that P

x Ci(dist(x, ci))z is minimized.

Notice that there are at most 2ℓpossible combinations of groups that a point p P may belong to. In practice, this 2ℓupper bound is rarely achieved. Hence, it is often useful to consider the actual number of distinct combinations of groups that a point p P may belong to, denoted as L throughout. Namely, L := |{W [ℓ] : p P, such that j W, p Pj}|.

Streaming model. The window size m and the fairness constraints α, β Rℓare given in advance. The input point set P [ ]d and the groups P1, . . . , Pℓ P of (α, β)-fair clustering is presented as a stream of insertion operations, each consists of a point p P and the group IDs that p belongs to. The algorithm must return a center set C Rd after each insertion, which is supposed to be an approximate solution to the dataset defined by the last m operations.

A weighted point set S [ ]d is a point set S equipped with a weight function w S : [ ]d R+ such that supp(w S) = S. Notice that for a weighted point set S and a (weighted or unweighted) point set T, set operation S T is performed only on the point set, and the weight should be defined separately. When the context is clear, we drop the subscript S in the weight function w S.

We use a generic notion of assignment-preserving coresets as introduced in Braverman et al. (2022), which we rephrase as follows. We notice that an assignment-preserving coreset is not immediately a coreset that preserves the fair clustering objective, but it is possible to obtain one by taking a union of several assignment-preserving coresets on a partition of the input dataset (and the detailed argument can be found in Section 4, proof of Theorem 1.2). Definition 2.2 (Assignment constraints). Consider some weighted point set P [ ]d and a center set C C. An assignment constraint with respect to P and C is a function Γ : C R+ such that P

p P w P (p) = P

c C Γ(c). An assignment function with respect to P and C is a function σ : P C R+ such that p P, P

c C σ(p, c) = w P (p). We call σ is consistent with Γ, i.e. σ Γ, if c C, P

p P σ(p, c) = Γ(c).

Definition 2.3 ((k, z)-clustering with assignment constraints). Given a weighted point set P [ ]d, a center set C C and an assignment constraint Γ, for every assignment function σ with respect to P, C and consistent with Γ, the cost of σ is defined as costσ(P, C, Γ) = P

c C σ(p, c)d(p, c)z. The cost function of (k, z)-CLUSTERING with assignment constraint is

cost(P, C, Γ) = min σ:P C R+,σ Γ costσ(P, C, Γ).

As in Braverman et al. (2022); Huang et al. (2023), our coreset is defined with respect to assignment constraints, and we aim for coresets that can preserve all assignment constraints simultaneously. Definition 2.4 (Assignment-preserving coreset). Given a weighted point set P [ ]d and 0 < ε < 1, an assignment-preserving ε-coreset for (k, z)-CLUSTERING of P is a weighted point set S P such that w S(S) = w P (P), and that for every center set C C and assignment constraint Γ, cost(S, C, Γ) (1 ε) cost(P, C, Γ).

3 ONLINE ASSIGNMENT-PRESERVING CORESETS

As we mention, our framework is based on that of Woodruff et al. (2023), which reduces the sliding window algorithm to constructing an online coreset. In their definition, given a point set P = {p1, p2, . . . , pn}, listed in the increasing order of time steps, an online ε-coreset S for (k, z)- CLUSTERING is a weighted subset such that every prefix S {p1, p2, . . . , pi} is an ε-coreset for (k, z)-CLUSTERING of prefix {p1, p2, . . . , pi}.

However, this definition of online coreset does not easily generalize to our setting, and it is even trickier due to our lower bound in Theorem 1.1. Hence, we need to use a slightly relaxed notion, defined in Definition 3.1. Roughly, the prefix property of our online assignment-preserving coresets can tolerate a 1 ε relative error in the weights. This relaxed guarantee leads to an efficient coreset bound, stated in Lemma 3.2, which is the main claim of this section. Definition 3.1 (Online assignment-preserving coreset). Given a weighted point set P = {p1, p2, . . . , pn} listed in the increasing order of time steps and 0 < ε < 1, a weighted point

Published as a conference paper at ICLR 2025

Algorithm 1 Online assignment-preserving coreset construction

procedure ONLINECORESET(P, ε, δ)

Compute an (α, β)-approximate solution Q = {q1, q2, . . . , qℓ} P for (k, z)-clustering of P using e.g., Meyerson sketch, where α = O(2z), β = O(2z poly(log ))

S , ε ε 2zα, δ δ ℓ( lg(

d)+1 ) for i = 1, 2, . . . , ℓdo

Qi {p P : NNQ(p) = qi} NNQ(p) is defined as the nearest neighbor of p in set Q for j = 0, 1, 2 . . . , lg(

d) do Pi,j Pi ring(qi, 2j 1, 2j) Si,j RINGCORESET(Pi,j, ε , δ ) return S

1 i ℓ,0 j lg(

Algorithm 2 Coreset for a single ring

procedure RINGCORESET(P, ε, δ)

List P in the increasing order of time steps as p1, p2, . . . , pn. δ δ (k/ε)O(kd) , T poly(2zdkε 1 log(w(P) ε 1δ 1)), S , sum0 0 for i = 1, 2, . . . , n do

sumi sumi 1 + w P (pi) probi min (T w P (pi)/sumi, 1) add pi to S with probability probi and weight w S(pi) prob 1 i w P (pi) return S

set S is an online assignment-preserving ε-coreset of P for (k, z)-CLUSTERING, if S P and for every t [n], there exists some weighted point set S t such that the following holds:

Let Pt := {p1, . . . , pt} and w Pt : p 7 w P (p). Then S t is an assignment-preserving ε-coreset for the weighted set Pt. Let St := S Pt and w St : p 7 w S(p). Then St = S t and for every p St, w St(p) (1 ϵ)w S t(p).

Lemma 3.2. There exists an algorithm that takes as input weighted set P [ ]d (with unique time steps) of n points, 0 < ε < 1, z 1 and integer k 1, computes weighted set S with |S| = poly(2zε 1kd log(w(P) ε 1), such that S is online assignment-preserving ε-coreset for (k, z)- CLUSTERING with probability 0.9. The algorithm uses poly(2zε 1kd log(w(P) ε 1)) space.

Algorithm for Lemma 3.2. The algorithm for our online assignment-preserving coreset is listed in Algorithm 1. This construction is based on a similar sampling-based framework as in Woodruff et al. (2023), which is also widely used in the coreset literature in general (Chen, 2009; Cohen-Addad et al., 2021; Cohen-Addad & Li, 2019; Braverman et al., 2022). In this framework, one starts with computing a bi-criteria approximation b C for (unconstrained) (k, z)-CLUSTERING, denoted as b C := {bc1, . . . , bct}. We say b C is an (α, β) bi-criteria approximate solution for (unconstrained) (k, z)- CLUSTERING if cost(P, b C) = P

p P dist(x, b C)z α OPT and | b C| βk. For our purpose, we use the Meyerson sketch proposed by Borassi et al. (2020) that outputs an (O (2z) , O 2z log η 1 log ) approximation center set b C P with probability at least 1 η. Next, take the natural clustering defined by b C (according to nearest-neighbor rule), and decompose each cluster into rings centered at bci. The coreset is constructed by simply drawing poly(ϵ 1kd) uniform samples from each ring, re-weight, and take the union.

However, it is nontrivial to draw exactly uniform samples while still maintaining the online property, especially in the streaming setting. Hence, in Woodruff et al. (2023) they turn to sample the i-th element in a ring with probability 1/i (and it gets more complicated when running on a weighted dataset, which we need). Our algorithm also has this part as a key subroutine, which we call RINGCORESET. The RINGCORESET algorithm is presented in Algorithm 2, and its guarantee is summarized in the following Lemma 3.3.

Published as a conference paper at ICLR 2025

Lemma 3.3. Consider q [ ]d, r > 0, ε, δ (0, 1), k, z 1, and weighted set P = (p1, p2, . . . , pn) [ ]d ring(q, r/2, r) listed in the increasing order of time steps. Let N := w(P) and S be the output of RINGCORESET(P, ε, δ) (in Algorithm 2). We have E[|S|] = poly(2zkε 1 log(N ε 1δ 1)), and with probability at least 1 δ, there exists a weighted set S such that

S = S, For every p S, w S (p) (1 ε)w S(p), For every center set C Rd, |C| k and every assignment constraint Γ with respect to P, C, we have |cost(S , C, Γ) cost (P, C, Γ)| ε

4(cost(P, C, Γ) + cost(S , C, Γ) + Nrz).

The proof of Lemma 3.3 is postponed to Appendix C, and we first finish the proof of Lemma 3.2 assuming Lemma 3.3 holds. In the following discussion, for every P [ ]d, C Rd, |C| = k and assignment constraint Γ with respect to P, C, we say σ is the the optimal assignment function of cost(P, C, Γ) if σ = arg minσ:P C R 0,σ Γ P

c C σ(p, c) dist(p, c)z.

Proof of Lemma 3.2. By the definition of Algorithms 1 and 2, we have ONLINECORESET(P) {p1, p2, . . . , pi} = ONLINECORESET({p1, p2, . . . , pi}) when the randomness is fixed. Hence it is sufficient to show that for every P, there exists some assignment-preserving coreset S of P such that S = S and for every p S , w S (p) (1 ε)w S(p).

Let α, ℓ, ε , Si,j, Pi,j follows their definition in Algorithm 1. By applying the union bound on the result of Lemma 3.3, with probability at least 1 δ, for every 1 i ℓ, 0 j log(

d) , there exists a weighted point set S i,j satisfying S i,j = Si,j, p S i,j, w S i,j(p) (1 ε)w Si,j(p), such that for every center set C and assignment constraint Γ with respect to Pi,j, C, we have cost(S i,j, C, Γ) cost(Pi,j, C, Γ)

4 w(Pi,j)rz j + ε

4 (cost(S i,j, C, Γ) + cost(Pi,j, C, Γ))

p Pi,j w P (p) dist(p, qi)z + ε

4(cost(S i,j, C, Γ) + cost(Pi,j, C, Γ))

Let weighted set S be the union of all S i,j. Let σ be the optimal assignment for cost(P, C, Γ). We fix the assignment constraint for every Pi,j with Γi,j(c) := P

p Pi,j σ (p, c). Define NR := lg(

as the number of rings. We have cost(P, C, Γ) = costσ (P, C, Γ) = Pℓ i=1 PNR j=1 cost(Pi,j, C, Γi,j)

and cost(S , C, Γ) Pℓ i=1 PNR j=1 cost(S i,j, C, Γi,j). Summing up together, we have

cost(S , C, Γ) cost(P, C, Γ)

j=1 cost(S i,j, C, Γi,j) cost(Pi,j, C, Γi,j)

p Pi w P (p) dist(p, qi)z + ε

4(cost(S , C, Γ) + cost(P, C, Γ))

ε 4αcost Q + ε

4 cost(P, C, Γ) + ε

4 cost(S , C, Γ)

ε cost(P, C, Γ), where cost Q denotes the cost of solution Q. By symmetric argument we have cost(P, C, Γ) cost(S , C, Γ) ε cost(P, C, Γ). The expectation size of the coreset can be bounded by

j=0 |Si,j| βk lg(

d) poly(2zdkε 1 log(w(P) ε 1δ 1))

= poly(2zdkε 1 log(w(P) ε 1δ 1)).

In summary, when we run ONLINECORESET(P, ε, 0.01), it returns an online assignment-preserving ε-coreset S such that |S| = poly(2zdkε 1 log(w(P) ε 1δ 1)) with probability 0.9.

Published as a conference paper at ICLR 2025

Table 1: Specifications of datasets

dataset size d attribute window size coreset size (benchmark) coreset size (ours)

Adult 50k 6 gender 500 231 150 Bank 45k 10 marital 500 234 150 Diabetes 100k 8 gender 1000 485 300 Athlete 200k 3 gender 2000 1058 750 Census 2500k 13 gender 5000 1988 1500

4 STREAMING ALGORITHMS

Our sliding window algorithm uses a framework introduced by Braverman et al. (2020); Woodruff et al. (2023). This framework reduces clustering on sliding windows to an online coreset problem, via a standard merge-and-reduce method, and the detailed algorithm is listed in Algorithm 3. This algorithm maintains a coreset for the sliding window, and its guarantee is summarized in Theorem 4.1. The key idea is that every window can be viewed as a suffix of the input stream at some time step t, and if we build online coreset for all elements in the entire 1, . . . , t time steps in the reverse order of time steps, then the prefix of the online coreset (which is the key query that the coreset provides) precisely gives a coreset for the sliding window.

Algorithm 3 Sliding window coreset algorithm based on online assignment-preserving coreset

procedure MERGEANDREDUCE(P)

t 0, ℓ lg m initialize ℓ+ 1 point sets B0, B1, . . . Bℓ while input stream is not empty do

t t + 1 p the next element in input stream, time(pi) i online coresets is defined on the reverse stream let j be the smallest index with Bj = , set j m if such index does not exist

Bj ONLINECORESET {p} B0 . . . Bj 1, ε 2 log m, δ 4m2

clear the block Bi, i.e., Bi , for all i {0, 1, . . . , j 1} the coreset of the current window (i.e. the t m + 1, t m + 2, . . . t-th element in input stream) is {p B0 B1 . . . Bℓ: time(p) [ t, t + m)}

Theorem 4.1. There exists an algorithm that takes as input P [ ]d presented as a point stream, 0 < ε < 1, z 1, integer k 1 and window size m 1, computes a weighted subset S of the sliding window after each point insertion, such that there exists an ε-coreset S of the sliding window for (k, z)-CLUSTERING with assignment constraints, that has S = S and p S, w S(p) (1 ε)w S (p). The algorithm uses space poly(ε 1kd log(m )), and succeeds with high probability.

Proof. Consider Algorithm 3. Observe that for each i {0, 1, . . . , ℓ}, Bi is an assignment-preserving online ((1 + ε 2 log m)i 1)-coreset for a sequence of 2i consecutive points and the sequences of all Bi (i {0, 1, . . . , ℓ}) are disjoint. Every time we set Bi ONLINECORESET({p} B0 . . . Bj 1, ε 2 log m, δ 2m), B0, B1, . . . , Bi 1 must be non-empty. By induction, |{p} B0 . . . Bj 1| =

1 + Pi 1 j=0 2j = 2i and the distortion of Bi is at most (1 + ε 2 log m)i.

Suppose the input stream is {pi}i 1. For every time t, after we have inserted the newly arrived point pt, B0 B1 . . . Bℓis an assignment-preserving online ε-coreset of {pt, pt 1, . . . , pt 2ℓ+1} (sorted by time steps). By the definition of online coreset, {p B0 B1 . . . Bℓ: time(p) [ t, t+m)} is an assignment-preserving ε-coreset of {pt, pt 1, . . . , pt m+1}, which is the current window. Since at most 2ℓtimes of merge-and-reduce is involved in B0 . . . Bj 1, the failure probability for each time t is at most δ. This finishes the proof of Theorem 4.1.

Proof of Theorem 1.2. Our algorithm employs the following standard steps to obtain fair clustering from assignment-preserving coresets. Huang et al. (2019); Braverman et al. (2022) shows that, if dataset P is partitioned with respect to the combinations of groups that each point belongs to, and

Published as a conference paper at ICLR 2025

Fig. 1: Fair k-MEDIAN cost curves for all datasets.

suppose one takes the union of the assignment-preserving ε-coresets on each part, denoted as S, then the value of the optimal (α, β)-fair clustering on S is a (1 + ε)-approximation to that of P.

Our algorithm. Our algorithm follows similar steps. We apply Theorem 4.1 on the partition of P, then take the union to obtain some set S . This S has size at most L times that of the coreset

Published as a conference paper at ICLR 2025

Fig. 2: Total running time for our algorithm and Borassi baseline over all sliding windows.

size in Theorem 4.1, and this dominates the space complexity. Finally, in the last step we find a ((1 ε)α, (1 + ε)β)-fair clustering on S (which has a relaxed fairness constraint).

To see why this clustering has value at most (1 + ε) times the optimal (α, β)-fair clustering on P, consider the optimal (α, β)-fair solution on S (the set defined in the first paragraph), then this solution is (1 + ε)-approximate to that on P. Now, by the guarantee of Theorem 4.1, this same solution (with the same objective value) is a feasible ((1 ε)α, (1 ε)β)-fair clustering with respect to S . Notice that what our algorithm finds is no worse than this solution on S , and hence we conclude the clustering has cost at most (1 + ε) times the optimal solution. This finishes the proof.

5 EXPERIMENTS

We implement our coresets and evaluate their performance for solving fair k-MEDIAN in sliding window. The evaluation requires a down-stream approximation algorithm for fair k-MEDIAN which is run on the coreset. We choose to use Fairtree (Backurs et al., 2019) in our experiments, which achieves competitive performance in practice.

Baselines. We employ three baselines. Two of the baselines are based on coresets, called Benchmark and Uniform , which construct the coreset in an alternatively way than ours but still apply Fairtree as the down-stream approximation algorithm. Specifically, in Benchmark , we construct a considerably large coreset from the sliding window (see Table 1 for the coreset sizes), and it serves as a benchmark for the accuracy; and in Uniform , we use uniform sampling to build a coreset of the same size as ours, as a natural heuristic. The last baseline is a previous sliding-window algorithm designed for clustering without fairness constraints (Borassi et al., 2020), and we call it Borassi .

Datasets. We evaluate the algorithms on 5 real datasets: Adult (Becker & Kohavi, 1996), Bank (Moro S & P, 2014), Diabetes (Kahn), Athlete (Barshan & Altun, 2010), and Census (Meek et al., 2001), which have also been used in various previous studies on (fair) clustering (Bera et al., 2019a; Chierichetti et al., 2017; Schmidt et al., 2018; Huang et al., 2019). For each dataset, we extract numerical features to construct a vector in Rd for each record, and we select a binary sensitive attribute. We set the window size n 100. We list the detailed parameters of the datasets in Table 1.

Experiment setup. We choose k = 10 in all experiments. When implementing our coreset, we directly specify a target coreset size instead of using the worst-case bound as we established in previous sections. Due to variations in dataset sizes and corresponding window sizes, we assigned different coreset sizes for each dataset.

We set this target size 150 for both Adult and Bank, 300 for Diabetes, 750 for Athlete, and 1500 for Census. All the experiments are run on a Mac Book Air 15.3 with an Apple M3 chip (8 cores, 2.22 GHz), 16GB RAM, and mac OS 14.4.1 (23E224).

Experiment results. We run all algorithms on the five datasets, and we depict the cost curves in Figure 1 over the time steps. These curves show that our algorithm obtains a comparable cost to Benchmark and Uniform, while using lower space than Benchmark and achieving lower variance than Uniform (which is as expected, since the variance of uniform sampling can be unbounded even without fairness constraints). Finally, compared with Borassi, our algorithm performs better in accuracy, and as can be seen from Figure 2, we also achieve a better running time on larger datasets.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

Research of Shaofeng Jiang was supported in part by a national key R&D program of China No. 2021YFA1000900 and a startup fund from Peking University. Samson Zhou is supported in part by NSF CCF-2335411. The work was conducted in part while Samson Zhou was visiting the Simons Institute for the Theory of Computing as part of the Sublinear Algorithms program. We thank the anonymous reviewers for insightful comments.

Mohsen Abbasi, Aditya Bhaskara, and Suresh Venkatasubramanian. Fair clustering via equitable group representations. In FAcc T 21: 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 504 514, 2021. 4

Sara Ahmadian and Maryam Negahbani. Improved approximation for fair correlation clustering. In International Conference on Artificial Intelligence and Statistics, pp. 9499 9516, 2023. 4

Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Clustering without over-representation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pp. 267 275, 2019. 4

Sara Ahmadian, Alessandro Epasto, Marina Knittel, Ravi Kumar, Mohammad Mahdian, Benjamin Moseley, Philip Pham, Sergei Vassilvitskii, and Yuyan Wang. Fair hierarchical clustering. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS, 2020a. 4

Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Fair correlation clustering. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 4195 4205, 2020b. 4

Georg Anegg, Haris Angelidakis, Adam Kurpisz, and Rico Zenklusen. A technique for obtaining true approximations for k-center with covering constraints. Math. Program., 192(1):3 27, 2022. 4

Haris Angelidakis, Adam Kurpisz, Leon Sering, and Rico Zenklusen. Fair and fast k-center clustering for data summarization. In International Conference on Machine Learning, ICML, pp. 669 702, 2022. 4

Apple. https://images.apple.com/privacy/docs/Differential Privacy Overview.pdf. 2

Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 1 16, 2002. 2

Brian Babcock, Mayur Datar, Rajeev Motwani, and Liadan O Callaghan. Maintaining variance and k-medians over data stream windows. In Proceedings of the Twenty-Second ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, pp. 234 243, 2003. 4

Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In International Conference on Machine Learning, pp. 405 413. PMLR, 2019. 3, 10

Sayan Bandyapadhyay, Tanmay Inamdar, Shreyas Pai, and Kasturi R. Varadarajan. A constant approximation for colorful k-center. In 27th Annual European Symposium on Algorithms, ESA, pp. 12:1 12:14, 2019. 4

Billur Barshan and Kerem Altun. Daily and Sports Activities. UCI Machine Learning Repository, 2010. DOI: https://doi.org/10.24432/C5C59F. 10

Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20. 10

Suman K Bera, Deeparnab Chakrabarty, and Maryam Negahbani. Fair algorithms for clustering. corr, abs/1901.02393. ar Xiv preprint ar Xiv:1901.02393, 2019a. 10

Published as a conference paper at ICLR 2025

Suman Kalyan Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS, pp. 4955 4966, 2019b. 4

Ioana Oriana Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens R osner, Daniel R. Schmidt, and Melanie Schmidt. On the cost of essentially fair clusterings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, volume 145 of LIPIcs, pp. 18:1 18:22. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2019. 4

Jeremiah Blocki, Seunghoon Lee, Tamalika Mukherjee, and Samson Zhou. Differentially private l2heavy hitters in the sliding window model. In The Eleventh International Conference on Learning Representations, ICLR, 2023. 2

Matteo B ohm, Adriano Fazzone, Stefano Leonardi, Cristina Menghini, and Chris Schwiegelshohn. Algorithms for fair k-clustering with multiple protected attributes. Oper. Res. Lett., 49(5):787 789, 2021. 4

Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Sliding window algorithms for k-clustering problems. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020. 4, 6, 10

Vladimir Braverman and Rafail Ostrovsky. Smooth histograms for sliding windows. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), Proceedings, pp. 283 293, 2007. 2, 4

Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering on sliding windows in polylogarithmic space. In 35th IARCS Annual Conference on Foundation of Software Technology and Theoretical Computer Science, FSTTCS, pp. 350 364, 2015. 4

Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering problems on sliding windows. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1374 1390, 2016. 4

Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly optimal distinct elements and heavy hitters on sliding windows. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pp. 7:1 7:22, 2018. 2

Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P. Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS, pp. 517 528, 2020. 8

Vladimir Braverman, Viska Wei, and Samson Zhou. Symmetric norm estimation and regression on sliding windows. In Computing and Combinatorics - 27th International Conference, COCOON, Proceedings, pp. 528 539, 2021. 2

Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pp. 462 473, 2022. 2, 3, 4, 5, 6, 8, 18

Jiecao Chen, Huy L. Nguyen, and Qin Zhang. Submodular maximization over sliding windows. Co RR, abs/1611.00129, 2016. 2

Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923 947, 2009. 6

Wenjing Chen, Shuo Xing, Samson Zhou, and Victoria G. Crawford. Fair submodular cover. Co RR, abs/2407.04804, 2024. 3

Published as a conference paper at ICLR 2025

Anshuman Chhabra, Karina Masalkovaite, and Prasant Mohapatra. An overview of fairness in clustering. IEEE Access, 9:130698 130720, 2021. 1

Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp. 5029 5037, 2017. 1, 3, 4, 10

Fan R. K. Chung and Lincoln Lu. Survey: Concentration inequalities and martingale inequalities: A survey. Internet Math., 3(1):79 127, 2006. doi: 10.1080/15427951.2006.10129115. URL https://doi.org/10.1080/15427951.2006.10129115. 25

Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In ICALP, volume 132 of LIPIcs, pp. 41:1 41:14. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2019. full version: https://arxiv.org/abs/2208.14129. 6, 20

Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC, pp. 169 182. ACM, 2021. 6

Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1038 1051, 2022. 2

Vincent Cohen-Addad, David P. Woodruff, and Samson Zhou. Streaming euclidean k-median and k-means with o(log n) space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pp. 883 908, 2023. 2

Mayur Datar and Rajeev Motwani. The sliding-window computation model and results, 2016. 2

Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794 1813, 2002. 2, 4

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science, pp. 214 226, 2012. 1

Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Submodular optimization over sliding windows. In Proceedings of the 26th International Conference on World Wide Web, WWW, pp. 421 430, 2017. 2

Alessandro Epasto, Mohammad Mahdian, Vahab S. Mirrokni, and Peilin Zhong. Improved sliding window algorithms for clustering and coverage via bucketing-based sketches. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 3005 3042, 2022. 4

Seyed A. Esmaeili, Brian Brubach, Leonidas Tsepenekas, and John Dickerson. Probabilistic fair clustering. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS, 2020. 4

Seyed A. Esmaeili, Brian Brubach, Aravind Srinivasan, and John Dickerson. Fair clustering under a bounded cost. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS, pp. 14345 14357, 2021. 4

Facebook. https://www.facebook.com/policy.php. 2

Mehrdad Ghadiri, Samira Samadi, and Santosh S. Vempala. Socially fair k-means clustering. In FAcc T 21: 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 438 448, 2021. 4

Google. https://policies.google.com/technologies/retention. 2

Shivam Gupta, Ganesh Ghalme, Narayanan C. Krishnan, and Shweta Jain. Efficient algorithms for fair clustering with a new notion of fairness. Data Min. Knowl. Discov., 37(5):1959 1997, 2023. 4

S edjro Salomon Hotegni, Sepideh Mahabadi, and Ali Vakilian. Approximation algorithms for fair range clustering. In International Conference on Machine Learning, ICML, pp. 13270 13284, 2023. 4

Published as a conference paper at ICLR 2025

Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In Neur IPS, pp. 7587 7598, 2019. 3, 8, 10

Lingxiao Huang, Shaofeng H.-C. Jiang, and Jianing Lou. The power of uniform sampling for k-median. In International Conference on Machine Learning, ICML, pp. 13933 13956, 2023. 5

Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean (k, z)- clustering. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC, pp. 1594 1604, 2024. 2

Rajesh Jayaram, David P. Woodruff, and Samson Zhou. Truly perfect samplers for data streams and sliding windows. In PODS 22: International Conference on Management of Data, pp. 29 40, 2022. 2

Xinrui Jia, Kshiteej Sheth, and Ola Svensson. Fair colorful k-center clustering. Math. Program., 192 (1):339 360, 2022. 4

Matthew Jones, Huy L. Nguyen, and Thy Dinh Nguyen. Fair k-centers via maximum matching. In Proceedings of the 37th International Conference on Machine Learning, ICML, pp. 4940 4949, 2020. 4

Christopher Jung, Sampath Kannan, and Neil Lutz. Service in your neighborhood: Fairness in center location. In Aaron Roth (ed.), 1st Symposium on Foundations of Responsible Computing, FORC 2020, pp. 5:1 5:15, 2020. 4

Michael Kahn. Diabetes. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5T59G.

Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference, ITCS, pp. 43:1 43:23, 2017. 3

Matth aus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair k-center clustering for data summarization. In Proceedings of the 36th International Conference on Machine Learning, ICML, pp. 3448 3457, 2019a. 4

Matth aus Kleindessner, Samira Samadi, Pranjal Awasthi, and Jamie Morgenstern. Guarantees for spectral clustering with fairness constraints. In Proceedings of the 36th International Conference on Machine Learning, ICML, pp. 3458 3467, 2019b. 4

Ilan Kremer, Noam Nisan, and Dana Ron. On randomized one-round communication complexity. Comput. Complex., 8(1):21 49, 1999. 15

Lap-Kei Lee and H. F. Ting. Maintaining significant stream statistics over sliding windows. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 724 732, 2006a. 2

Lap-Kei Lee and H. F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 290 297, 2006b. 2

Sepideh Mahabadi and Ali Vakilian. Individual fairness for k-clustering. In Proceedings of the 37th International Conference on Machine Learning, ICML, pp. 6586 6596, 2020. 4

Yury Makarychev and Ali Vakilian. Approximation algorithms for socially fair clustering. In Conference on Learning Theory, COLT, pp. 3246 3264, 2021. 4

Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. Proc. VLDB Endow., 5(12):1699, 2012. 2

Chris Meek, Bo Thiesson, and David Heckerman. US Census Data (1990). UCI Machine Learning Repository, 2001. DOI: https://doi.org/10.24432/C5VP42. 10

Adam Meyerson. Online facility location. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 426 431. IEEE, 2001. 17

Published as a conference paper at ICLR 2025

Evi Micha and Nisarg Shah. Proportionally fair clustering revisited. In 47th International Colloquium on Automata, Languages, and Programming, ICALP, pp. 85:1 85:16, 2020. 4

Rita P Moro S and Cortez P. Bank Marketing. UCI Machine Learning Repository, 2014. DOI: https://doi.org/10.24432/C5K306. 10

Maryam Negahbani and Deeparnab Chakrabarty. Better algorithms for individually fair k-clustering. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS, pp. 13340 13351, 2021. 4

Odysseas Papapetrou, Minos N. Garofalakis, and Antonios Deligiannakis. Sketching distributed sliding-window data streams. VLDB J., 24(3):345 368, 2015. 2

Clemens R osner and Melanie Schmidt. Privacy preserving clustering with constraints. In 45th International Colloquium on Automata, Languages, and Programming, ICALP, pp. 96:1 96:14, 2018. 4

Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means clustering. ar Xiv preprint ar Xiv:1812.10854, 2018. 10

Roy Schwartz and Roded Zats. Fair correlation clustering in general graphs. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pp. 37:1 37:19, 2022. 4

Zhao Song, Ali Vakilian, David P. Woodruff, and Samson Zhou. On socially fair low-rank approximation and column subset selection. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, Neur IPS, 2024. 3

Ali Vakilian and Mustafa Yalc iner. Improved approximation algorithms for individually fair clustering. In International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 8758 8779, 2022. 4

Zhewei Wei, Xuancheng Liu, Feifei Li, Shuo Shang, Xiaoyong Du, and Ji-Rong Wen. Matrix sketching over sliding windows. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference, pp. 1465 1480. ACM, 2016. 2

David P. Woodruff and Samson Zhou. Tight bounds for adversarially robust streams and sliding windows via difference estimators. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pp. 1183 1196, 2021. 2

David P. Woodruff, Peilin Zhong, and Samson Zhou. Near-optimal k-clustering in the sliding window model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, Neur IPS, 2023. 2, 3, 4, 5, 6, 8

A LOWER BOUNDS

In this section, we show that fair clustering cannot be performed in the sliding window model in sublinear space, up to any finite multiplicative error.

We first define the one-way two-party augmented indexing communication problem. Formally, in the Aug Indn problem, Alice receives a vector v {0, 1}n and Bob receives an index i [n], as well as the values of v1, . . . , vi 1. The goal is for Alice to send some message Π to Bob, so that Bob may compute vi with probability at least 2

3. We recall the following communication complexity lower bounds for the augmented indexing problem. Theorem A.1. (Kremer et al., 1999) Any protocol that solves Aug Indn with probability at least 2

3 requires Ω(n) bits of communication.

We now show that the augmented indexing communication problem is hard even if it is promised that half of the entries of Alice s input vector are nonzero. Formally, in the Aug Indn 2n problem, Alice receives a vector v {0, 1}2n, which has exactly n nonzeros, and Bob receives an index i [2n], as well as the values of v1, . . . , vi 1. The goal is again for Alice to send some message Π to Bob, so that Bob may compute vi with probability at least 2

Published as a conference paper at ICLR 2025

Corollary A.2. Any protocol that solves Aug Indn 2n with probability at least 2

3 requires Ω(n) bits of communication.

Proof. Suppose there exists a protocol Π that solves Aug Indn 2n with probability at least 2

3. Given an instance of Aug Indn, let s be the number of nonzero entries in Alice s input vector v {0, 1}n. Then Alice can create a vector u {0, 1}2n with exactly n nonzeros by appending to v a total of n s coordinates that are 1, followed by s coordinates that are 0. Since the number of nonzero entries in v is s and Alice appends n s nonzero entries afterwards, then it follows that u has exactly n nonzero coordinates. Alice and Bob can then run the protocol Π for Aug Indn 2n on u to find ui, where i [n] is the input to Aug Indn. By construction, we have ui = vi, and thus the protocol solves Aug Indn with probability at least 2

3, due to the correctness of Π. Hence, Π requires Ω(n) bits of communication by Theorem A.1.

We now prove our lower bound for fair clustering in the sliding window model.

Theorem 1.1. Any algorithm for fair (k, z)-clustering in the sliding window model that either achieves any multiplicative approximation or additive

2 1 error, with probability at least 2

3, must use Ω(n) space.

Proof. Let v {0, 1}2n with exactly n nonzero entries and i [2n] be an input to Aug Indn 2n. Alice creates an instance of fair (k, z)-clustering on R, with k = 2. There are two groups C1, C2. The fairness constraint is that every cluster should contain at least 0.5-fraction of points from C2. For each i [2n], Alice adds a point xi at vi , all points in sequence x (including those points which are defined later) belong to C1. There are exactly n points at the origin and exactly n points at . Alice then creates a stream S by inserting the points x1, . . . , x2n in order.

Let Π be any algorithm for fair (k, z)-clustering in the sliding window model, with window size 4n. Alice then runs Π on S and passes the state of the algorithm to Bob, who has v1, . . . , vi 1 and thus also x1, . . . , xi 1. Bob sets x2n+j = xj for j [i 1] and inserts the points into the stream. Then Bob sets x2n+i = 0 and inserts the point into the stream. Finally, Bob inserts n points that are located at the origin and belong to C2, another n points that are located at and belong to C2. By construction, the active points in the window are precisely the 2n points from C1, i.e. xi+1, . . . , x2n+i and 2n points from C2.

Since all points are located at {0, }, the optimal center set is {0, }. The number of points from C1 that are located at the origin equals to |{j [i + 1, 2n + i] : xj = 0}| = n + I(xi = 1).

If xi = 0, then both location 0 and contains n points from C1 and n points from C2. The optimal solution is to assign every point p to the center at the same location as p, which requires zero cost.

If xi = 1, then location 0 contains n + 1 points from C1 and n points from C2. In the optimal solution, either one of the points from C1 and located at 0 is assigned to center , or one of the points from C2 and located at is assigned to center 0. For either case, the cost of optimal solution is no less than .

Thus if Π achieves either any multiplicative approximation or additive error

2 1 with probability at least 2

3, then Π can distinguish between these cases, thereby solving Aug Indn 2n. Hence by Corollary A.2, Π uses Ω(n) bits of space.

We can slightly modify the proof of Theorem 1.1 to obtain a lower bound of arbitrary additive violation δ.

Theorem A.3. Any algorithm that, with probability at least 2/3, either achieves any multiplicative approximation or additive /2 1 error, even using δ additive violation in the fairness constraints must use Ω(n/δ) space.

Proof. Using the construction in the proof of Theorem 1.1, we can obtain a point sequence A of length n δ+1 , such that any algorithm for fair (k, z)-clustering in the sliding window that either

Published as a conference paper at ICLR 2025

achieves any multiplicative approximation or additive /2 1 error on instance A with probability at least 2/3, must use Ω(n/δ) space.

Next we construct sequence B by repeating each point in A for δ + 1 times, i.e. B = (A1, A1, . . . , A1, A2, . . . , A2, A3, . . . , A n δ+1 ). In the proof of Theorem 1.1, we reduce Aug Indn 2n to fair clustering in the sliding window by running the fair clustering algorithm on A to distinguish whether window (xi+1, xi+2, . . . , x2n+i 1, 0) has n zeros or n + 1 zeros (x is defined in the proof). Now suppose we have a fair clustering in the sliding window algorithm A which outputs a solution with at most δ additive violation and at most /2 1 additive error (to the optimal solution with no violation). Similar to the proof of Theorem 1.1, we can also distinguish whether window (xi+1, xi+2, . . . , x2n+i 1, 0) has n zeros or n + 1 zeros by running A on B, which completes the

reduction from Aug Ind n δ+1 2 n δ+1 to fair clustering in the sliding window with additive violation. This

implies A must use Ω(n/δ) space.

B MEYERSON SKETCH

We provide a brief overview of the Meyerson sketch (Meyerson, 2001) and its key properties relevant to our work. The Meyerson sketch achieves a bicriteria (C1, C2)-approximation for (k, z)- clustering on a data stream consisting of points x1, . . . , xn [ ]d, where C1 = 2z+7 and C2 = O 22z log n log , i.e., it provides a C1-approximation while using at most C2k centers. An important feature of the Meyerson sketch that we shall utilize is that upon the arrival of each point xi, the algorithm permanently assigns xi to one of the C2k centers, even if there is subsequently a center closer to xi that is opened. Moreover, the clustering cost at the end of the stream is determined based on the center to which xi was assigned at time i.

To simplify the discussion, we describe the Meyerson sketch for the case where z = 1, noting that the intuition extends naturally to other values of z. The Meyerson sketch operates using a guess-and-double strategy, where it begins by estimating the optimal clustering cost. Based on this estimated cost, it randomly converts each point xi into a center, with probability proportional to the distance of the point xi from the existing centers at time i. If the algorithm opens too many centers, it concludes that the estimated optimal clustering cost was too low and doubles the value of the guess.

C PROOF OF LEMMA 3.3

Lemma 3.3. Consider q [ ]d, r > 0, ε, δ (0, 1), k, z 1, and weighted set P = (p1, p2, . . . , pn) [ ]d ring(q, r/2, r) listed in the increasing order of time steps. Let N := w(P) and S be the output of RINGCORESET(P, ε, δ) (in Algorithm 2). We have E[|S|] = poly(2zkε 1 log(N ε 1δ 1)), and with probability at least 1 δ, there exists a weighted set S such that

S = S, For every p S, w S (p) (1 ε)w S(p), For every center set C Rd, |C| k and every assignment constraint Γ with respect to P, C, we have |cost(S , C, Γ) cost (P, C, Γ)| ε

4(cost(P, C, Γ) + cost(S , C, Γ) + Nrz).

To prove Lemma 3.3, we start with showing the following Lemma C.1 which has the online coreset error guarantee only for one center set C and assignment constraint Γ, and only holds when all centers are close to q. The condition of Lemma C.1 is guaranteed by Lemma C.3, which reduces infinitely many pairs of (C, Γ) with unbounded diam(C) to finite pairs with diam(C) = O (z/ε) diam(P). We will conclude Lemma 3.3 by combining Lemma C.1 and Lemma C.3.

Define Rfar := 10zr/ε and Cfar := {c C : d(q, c) > Rfar}. Lemma C.1 assumes the point set P lies in B(q, Rfar).

Lemma C.1. For every r > 0, ε, δ (0, 1), d, k, z 1, q [ ]d, weighted set P [ ]d ring(q, r/2, r), center set C B(q, Rfar), |C| k, and assignment constraint Γ with respect to P, C, let N = w(P) and S be the output of RINGCORESET(P). We have E[|S|] = poly(2zdkε 1 log(N ε 1δ 1)), and with probability at least 1 δ (let δ follows the definition

Published as a conference paper at ICLR 2025

in Algorithm 2), there exists a weighted set S with |cost(S , C, Γ) cost (P, C, Γ)| εNrz, such that S = S and for every p S, w S (p) (1 ε)w S(p).

Pick any ε 12zr-net of ball B(q, Rfar), denoted by Cq. In Lemma C.3, we reduce arbitrary center sets to subsets of Cq. For assignment constraints, we discretize every value Γ(c) into H := {i N

t : i = 0, 1, . . . , t}, where t := k2(10z/ε + 1)z .

Define F = {(C, Γ) : C Cq, |C| k, c C, Γ(c) H, Γ(C) = w(P)} .

We have |F| O (1) k O(kd)(z/ε)O(kzd).

We introduce the following generalized triangle inequality for the proof of Lemma C.3. Lemma C.2. (Braverman et al., 2022) For every d, z 1, δ (0, 1], a, b, c Rd, the following inequality holds, d(a, b)z (1 + t)z 1d(a, c)z + (1 + t 1)z 1d(b, c)z.

Lemma C.3. For every weighted set P ring(q, r/2, r), k-center set C Rd and assignment constraint Γ with respect to P, C, let N := w(P), there exists k-point center set and assignment constraint (C , Γ ) F such that

cost(P, C, Γ) cost(P, C , Γ ) ε

4 cost(P, C, Γ) + X

c Cfar Γ(c)d(c, q)z.

Proof. First, we move every center c outside B(q, Rfar) (i.e. c Cfar) to q, and move every center c inside B(q, Rfar) (i.e. c C\Cfar) to its nearest neighbor in Cq. Suppose c is moving to Cq(c), for every center c. Let C := Cq(C) = {Cq(c) : c C}. Let Γ be the assignment constraint after movements, i.e. Γ(c) = P

c C:Cq(c )=c Γ(c ).

For every assignment function σ w.r.t P, C and assignment function σ w.r.t P, C, σ Γ, σ Γ, we call (σ, σ) is corresponding if σ(x, q) = P

c Cfar σ(x, c ) and σ(x, c) = σ(x, c) for every x P, c C\Cfar. In order to bound the difference between cost(P, C, Γ) and cost(P, C, Γ), we consider the cost difference of every corresponding pair (σ, σ), i.e.

c C σ(x, c)d(x, c)z X

c C σ(x, c)d(x, c)z

x P σ(x, c)(d(x, c)z d(x, Cq(c))z).

Define ε := (1+ε/6)1/(z 1) 1 = Θ(ε/z). For every center c Cfar and point x P, c is moved to q. By Lemma C.2, we have

σ(x, c)(d(x, c)z d(x, q)z) σ(x, c)((1 + ε )z 1d(c, q)z + (1 + ε 1)z 1d(x, q)z)

(1 + ε )z 1σ(x, c) d(c, q)z + ε z+1 ε

10z d(c, q) z

(1 + ε/6)σ(x, c) 1 + ε 10z

(1 + ε/5)σ(x, c)d(c, q)z

From the opposite direction, σ(x, c)(d(x, c)z d(x, q)z) is lower bounded by

σ(x, c)(d(x, c)z d(x, q)z) σ(x, c)((d(c, q) d(x, q))z d(x, q)z)

σ(x, c) ((d(c, q) (ε/10z)d(c, q))z ((ε/10z)d(c, q))z) ((1 (ε/10z))z (ε/10z)z)σ(x, c)d(c, q)z

(1 ε/5)σ(x, c)d(c, q)z

For every center c C\Cfar and point x P, when d(x, c ) > d(x, c), moving c to c = Cq(c) increases the cost by

σ(x, c)(d(x, c )z d(x, c)z) σ(x, c)(((1 + ε )z 1 1)d(x, c)z + (1 + ε 1)z 1d(c, c )z)

Published as a conference paper at ICLR 2025

σ(x, c) εd(x, c)z/6 + (1 + ε/6)ε z+1(εr/12z)z

6d(x, c)z + ε

When d(x, c) > d(x, c ), by the same argument we have,

σ(x, c)(d(x, c)z d(x, c )z) σ(x, c)(((1 + ε )z 1 1)d(x, c )z + (1 + ε 1)z 1d(c, c )z)

6d(x, c)z + ε

In summary, X

x P σ(x, c)(d(x, c)z d(x, Cq(c))z)

x P σ(x, c)d(c, q)z ε

x P σ(x, c)d(x, c)z ε

c Cfar Γ(c)d(c, q)z ε

x P σ(x, c)(d(x, c) d(x, q))z

x P σ(x, c)d(x, c)z ε

c Cfar Γ(c)d(c, q)z ε

x P (1 (ε/10z))σ(x, c)d(x, c)z

x P σ(x, c)d(x, c)z ε

c Cfar Γ(c)d(c, q)z ε

4 cost(P, C, Γ) ε

Since the bound holds for every corresponding σ, σ , let σ be the optimal assignment function of cost(P, C, Γ) and σ be some corresponding assignment function of σ , we have

cost(P, C, Γ) = costσ (P, C, Γ)

cost σ(P, C, Γ) + X

c Cfar Γ(c)d(c, q)z ε

4 cost(P, C, Γ) ε

cost(P, C, Γ) + X

c Cfar Γ(c)d(c, q)z ε

4 cost(P, C, Γ) ε

On the opposite side, let σ be the optimal assignment function of cost(P, C, Γ) and σ be some corresponding assignment function of σ , we have

cost(P, C, Γ) = cost σ (P, C, Γ)

costσ(P, C, Γ) X

c Cfar Γ(c)d(c, q)z ε

4 cost(P, C, Γ) ε

cost(P, C, Γ) X

c Cfar Γ(c)d(c, q)z ε

4 cost(P, C, Γ) ε

Combining (1) and (2), we have cost(P, C, Γ) cost(P, C, Γ) X

c Cfar Γ(c)d(c, q)z

4 cost(P, C, Γ) + ε

Next, we find Γ : C H such that Γ ( C) = N and for every c C, | Γ(c) Γ (c)| < 1/t, which always exists. On the other hand, there also exists assignment function σ Γ such that for every

Published as a conference paper at ICLR 2025

x P |σ(x, c) σ (x, c)| < 1/t. The difference of cost between σ and σ is bounded by

| cost(P, C, Γ) cost(P, C, Γ )|

x P (σ(x, c) σ (x, c))d(x, c)z

x P (σ(x, c) σ (x, c))(diam(Cq) + 2r)z

(10z/ε + 1)z k

Now we are able to give a error bound between cost(P, C, Γ) and cost(P, C, Γ ) by summing up (3) and (4), cost(P, C, Γ) cost(P, C, Γ ) X

c Cfar Γ(c)d(c, q)z

cost(P, C, Γ) cost(P, C, Γ) X

c Cfar Γ(c)d(c, q)z

+ | cost(P, C, Γ) cost(P, C, Γ )|

4 cost(P, C, Γ)

ε 10Nrz + ε

4 cost(P, C, Γ).

We conclude the proof by picking C = C, Γ = Γ .

Proof of Lemma 3.3. By union bound, with probability at least 1 δ, the statement of Lemma C.1 holds for every Γ F.

For every (C, Γ), let (C , Γ ) F be the center set and assignment constraint stated in Lemma C.3 with respect to (C, Γ). We have

| cost(S, C, Γ) cost(P, C, Γ)|

| cost(S, C , Γ ) cost(P, C , Γ )| + |(cost(S, C, Γ) cost(S, C , Γ ))

(cost(P, C, Γ) cost(P, C , Γ ))|

| cost(S, C , Γ ) cost(P, C , Γ )| + ε

4(cost(S, C, Γ) + cost(P, C, Γ))

ε 4(Nrz + cost(S, C, Γ) + cost(P, C, Γ)).

Since the size of S is independent of C and Γ, the upper bound of E[|S|] is the same as the result in Lemma C.1. This finishes the proof of Lemma 3.3.

C.1 PROOF OF LEMMA C.1

Lemma C.1. For every r > 0, ε, δ (0, 1), d, k, z 1, q [ ]d, weighted set P [ ]d ring(q, r/2, r), center set C B(q, Rfar), |C| k, and assignment constraint Γ with respect to P, C, let N = w(P) and S be the output of RINGCORESET(P). We have E[|S|] = poly(2zdkε 1 log(N ε 1δ 1)), and with probability at least 1 δ (let δ follows the definition in Algorithm 2), there exists a weighted set S with |cost(S , C, Γ) cost (P, C, Γ)| εNrz, such that S = S and for every p S, w S (p) (1 ε)w S(p).

Some parts of the proof of this lemma are inspired by Cohen-Addad & Li (2019), but compared to Cohen-Addad & Li (2019), our proof needs to handle additional difficulties, such as dealing with weighted input points and non-uniform sampling probabilities.

Published as a conference paper at ICLR 2025

As w(S) may not equal to w(P), cost(S, C, Γ) is not well-defined since Γ is not an assignment constraint with respect to S. Before discussing the cost of (S, C, Γ), we need to generalize the definition of the cost function to the case that w(S) = Γ(C). Then we prove Lemma C.1 by showing that both | cost(P, C, Γ) cost (S, C, Γ)| and | cost(S , C, Γ) cost (S, C, Γ)| are small.

Definition C.4. For weighted point set P [ ]d and a center set C Rd, a partial assignment function with respect to P and C is a function σ : P C R 0. For a partial assignment function σ with respect to P, C and an assignment constraint Γ with respect to P , C, where P is some arbitrary weighted point set (P is only used in Γ(C) = w(P )), we say σ is partially consistent with Γ if the following property holds, denoted by σ Γ.

For every p P, P

c C σ(p, c) w P (p).

For every c C, P

p P σ(p, c) Γ(c).

If w(P) Γ(C), then σ satisfies that for every c C, P

p P σ(p, c) = Γ(c).

If w(P) Γ(C), then σ satisfies that for every p P, P

c C σ(p, c) = w P (p).

A partial assignment function can be viewed as a maximum flow of the following flow network.

For every p P, there is an edge from source node to p with capacity w P (p).

For every c C, there is an edge from c to sink node with capacity Γ(c).

For every p P, c C, there is an edge from p to c with infinite capacity.

Now we can define the cost of S with respect to the constraint of Γ.

cost (S, C, Γ) = min σ:S C R 0,σ Γ

c C σ(p, c) dist(p, c)z.

Lemma C.5. With probability at least 1 δ/2, |cost (S, C, Γ) cost(P, C, Γ)| εNrz/2.

Proof. The proof is postponed to Appendix C.2.

Lemma C.6. With probability at least 1 δ/2, there exists a weighted point set S with |cost (S, C, Γ) cost(S , C, Γ)| εNrz/2, such that S = S and for every p S, w S (p) (1 ε)w S(p).

Proof. The proof is postponed to Appendix C.3.

Lemma C.7. Let T, prob follow the definition in Algorithm 2, Pn i=1 probi T ln N + 1.

Proof. Notice that

= w P (p1),

i=2 ln 1 w P (pi)

w P (p1) ln N

So we have n X

i=1 probi = 1 +

i=2 min T w P (p)

sumi , 1 T ln N + 1.

Published as a conference paper at ICLR 2025

Proof of Lemma C.1. By applying a union bound to the results of Lemma C.5 and Lemma C.6, we can show that with probability at least 1 δ, there exists a weighted point set S with | cost(P, C, Γ) cost(S , C, Γ)| εNrz, such that S = S and for every p S, w S (p) (1 ε)w S(p).

By Lemma C.7, E[|S|] = Pn i=1 probi T ln N + 1.

C.2 PROOF OF LEMMA C.5

To prove |cost (S, C, Γ) cost(P, C, Γ)| εNrz/2, we show that given an optimal assignment function σ of cost(P, C, Γ), we can always construct a partial assignment function with cost no more than cost(P, C, Γ) + εNrz/2, vice versa.

In the following discussion, let n, T, p1, . . . , pn, sum1, . . . , sumn, prob1, . . . , probn follow the definition in Algorithm 2 and let S be the output of RINGCORESET(P, ε, δ). For every 1 i n, let indicator variable Xi denotes I(pi S).

We first prove a concentration inequality used in the following proofs. Lemma C.8. For every 0 < ε, δ < 1, W > 0, α1, α2, . . . , αn such that 1 i n, 0 αi W w P (pi), as long as T = Ω(ε 2 log(δ 1ε 1)(log log(NW))2), we have

i=1 αiprob 1 i Xi

i=1 αi + εNW

Proof. Here we assume all points pi satisfy probi < 1, otherwise we always have αiprob 1 i Xi = αi and we can eliminate these points from P without changing Pn i=1 αiprob 1 i Xi Pn i=1 αi . Hence we always have probi = T w P (pi)

sumi . As a result, for every 1 i n, we have

αiprob 1 i W w P (pi) sumi T w P (pi) NW

Let γ = 1 + ε 8 ln N , we divide P into L = O ε 1 log N log(ε 1n NW/T) groups Gjmin, Gjmin+1, . . . , Gjmax [n], where jmin = logγ(ε/n) , jmax = logγ(NW/T) . Group Gj contains the indices of all points pi such that αiprob 1 i (γj 1, γj]. In particular, Gjmin contains all points pi such that αi γjmin.

For every jmin < j jmax, we have

i Gj αiprob 1 i Xi X

+ (γ 1)γj 1 X

i Gj (Xi + probi) (5)

For j = jmin we have P

i Gjmin αiprob 1 i ε.

For the second term of (5), it is sufficient to bound P

i Gj probi since by Chernoff bound we have

i=1 Xi > max

i=1 probi, 4 ln(2/δ)

By Lemma C.7, with probability at least 1 δ/2, we have

j=jmin (γ 1)γj 1 X

i Gj (probi + Xi) (γ 1)γjmax 1 n X

i=1 (probi + Xi)

(γ 1)γjmax 1

i=1 probi + 4 ln(1/δ)

T (2T ln N + 4 ln(1/δ) + 2)

Published as a conference paper at ICLR 2025

Next consider the first term of (5). Let µj = P

i Gj probi, Aj = P

i Gj αi, notice that

i Gj probi X

αi γj 1 = Aj γj 1 .

By Chernoff bound, let t = max(εAj, 3ε 1γj+1 ln(2L/δ)), we have

By a union bound, with probability at least 1 δ/2, we have

j=jmin Aj + 3ε 1 jmax X

j=jmin (1 + ε0)j+1 ln(2L/δ)

i=1 αi + 3ε 2γjmax+1 ln(2L/δ)

i=1 αi + 3ε 2γ2 ln(2L/δ) max i [n] αiprob 1 i

i=1 αi + 3ε 2γ2 ln(2L/δ)

Combining (6) and (7) by a union bound, we have

i=1 αiprob 1 i Xi

i=1 αi + εNW

Define R = (20z/ε+2)r diam(C)+diam(P). For every pair c C, p P, dist(p, c) is bounded by R. Let ε0 = (ε/22z) z 1 (20z/ε + 2) zε. In the following proofs, we show these errors are bounded by ε0NRz and that implies err ε0NRz εNrz. Lemma C.9. Pr[cost (S, C, Γ) cost(P, C, Γ) + εNrz/2] 1 δ/4.

Proof. To imply cost (S, C, Γ) cost(P, C, Γ) + εNrz/2, we show that given an optimal assignment function σ : P C R 0 of cost(P, C, Γ) there always exists a partial assignment function σ : S C R 0 for cost (S, C, Γ) such that P p S P c C dist(p, c)zσ (p, c) P

c C dist(p, c)zσ(p, c) + εNrz/2.

A natural way for constructing σ is to assign σ(p,c)

w P (p)w S(p) units of weight from p to c, denoted by assignment τ. Let the cost of this assignment be

w P (p) w S(p) d(p, c)z.

Published as a conference paper at ICLR 2025

However, τ may not be consistent with Γ. Since S is a random point set, some centers may receive more than Γ(c) units of weight, denoted by center set C+, and some may receive less than Γ(c) units denoted by center set C . We transform τ into an assignment consistent with Γ by collecting the surplus weights from C+ and send them to C so that σ is partially consistent with Γ. Re-routing w weight started at point p from c to c costs at most w(dist(p, c )z dist(p, c)z) w Rz. Hence this step costs at most

w P (pi) w S(pi)

i=1 σ(pi, c)prob 1 i Xi

Taking ε = ε0

8k, δ = δ 8k, W = 1, 1 i n, αi = σ(pi, c) in Lemma C.8, we have for every c C,

i=1 σ(pi, c)prob 1 i Xi

8k (Γ(c) + N)

By taking union bound over every c C, with probability at least 1 δ/8, we have

|costτ cost (S, C, Γ)| Rz X

i=1 σ(pi, c)prob 1 i Xi

Next, we show that the cost of τ is close to the real cost of P.

|costτ cost(P, C, Γ)|

i=1 dist(pi, c)z σ(pi, c)

w P (pi) w S(pi) σ(pi, c)

Taking ε = ε0

8k, δ = δ 8k, W = Rz, 1 i n, αi = dist(pi, c)σ(pi, c) in Lemma C.8, we have for every c C,

i=1 dist(pi, c)z σ(pi, c)

w P (pi) w S(pi) σ(pi, c) > ε0

8k (cost(P, C, Γ) + NRz)

By taking union bound over every c C, with probability at least 1 δ/8, we have

i=1 dist(pi, c)z σ(pi, c)

w P (pi) w S(pi) σ(pi, c) ε0

In summary, with probability at least 1 δ/4, we have

cost (S, C, Γ) cost(P, C, Γ) (cost (S, C, Γ) costτ) + (costτ cost(P, C, Γ))

Since S is a random set and contains much fewer points than P, constructing an assignment function with respect to P, C from a partial assignment function of cost (S, C, Γ) is difficult. Hence we first compare cost(P, C, Γ) to the expectation of cost (S, C, Γ) and then show the concentration bound of cost (S, C, Γ).

Lemma C.10. cost(P, C, Γ) E[cost (S, C, Γ)] + εNrz/4.

Proof. Let σ : P C R 0 be an assignment function with respect to P, C. Let σ be a random partial assignment function that denotes the optimal assignment function for cost (S, C, Γ).

Since σ Γ, we have c C, P

p P E[σ (p, c)] Γ(c) and p P, P

c C E[σ (p, c)] w P (p). We can construct σ by initializing σ(p, c) as E[σ (p, c)], then arbitrarily assigning the

Published as a conference paper at ICLR 2025

remaining weights of data points to the remaining capacities of centers. This assigning step costs at most

c C E[σ (p, c)] = Rz

c C σ (p, c)

= Rz E[N min{w(S), N}] (σ Γ) = Rz E[max{N w(S), 0}]

i=1 w P (pi)prob 1 i Xi

i=1 w P (pi)

Taking ε = ε0

8 , δ = ε0 Nw(P ), W = 1, 1 i n, αi = w P (pi) in Lemma C.8, we have

i=1 w P (pi)prob 1 i Xi

i=1 w P (pi)

which implies

i=1 w P (pi)prob 1 i Xi

i=1 w P (pi)

8 + ε0n sumn

We use a concentration inequality based on martingales (Chung & Lu, 2006) to show the concentration bound of cost (S, C, Γ). For convenience, we use xl...r to represent (xl, xl+1, . . . , xr) for every vector x of dimension d, 1 l r d. This notation is also used for function arguments, e.g. f(x1...i) represents f(x1, x2, . . . , xi). Lemma C.11. (Chung & Lu, 2006) Let V = X1 X2 . . . Xn (X1, X2, . . . , Xn R) be a set of n-dimensional real vectors. Suppose there is a function f : V R and n distributions D1, D2, . . . , Dn over X1, X2, . . . , Xn respectively and D is the joint distribution over V , Define f (i)(z1, z2, . . . , zi) = E[f(x) | x1 = z1, . . . , xi = zi].

If σ2 1, σ2 2, . . . , σ2 n satisfies for every 1 i n, z1 X1, . . . , zi 1 Xi 1,

Varx Di(f (i)(z1, z2, . . . zi 1, x)) σ2 i ,

and M satisfies that for every 1 i n, z V, z i Xi such that Prx D[x = z] > 0, Prx D[x1 = z1, . . . , xi = z i, . . . , xn = zn] > 0, we have

|f(z1, z2, . . . , zn) f(z1...i 1, z i, zi+1...n)| M,

then for every t > 0,

Pr x D[|f(x) Ex D[f(x)]| > t] 2 exp t2

2(Pn i=1 σ2 i + Mt/3)

Lemma C.12. Pr[|cost (S, C, Γ) E[cost (S, C, Γ)]| εNrz/4] 1 δ/8.

Proof. Define function f : Rn 0 R 0 as f(x1, x2, . . . , xn) := cost (R, C, Γ) where R := {pi : i [n], xi > 0} and w R(pi) := xi.

Define random vector x = (x1, x2, . . . , xn) be the vector representation of S, i.e.

xi = w S(pi) pi S 0 pi / S .

Let D be the distribution of the weights x, where

xi = w S(pi) pi S 0 pi / S ,

Published as a conference paper at ICLR 2025

and S is the output of RINGCORESET(P).

By the definition of x, x1, x2, . . . , xn are independent random variables. If probi = 1, then xi always equals to w P (pi). Otherwise, xi has a two-point distribution

T w.p. probi 0 w.p. 1 probi .

We assume probi < 1 in the following discussion since the conclusions obviously hold when probi = 1.

Suppose every x1, x2, . . . , xn is fixed except xi. In f(x1, x2, . . . , xn), adding xi by 1 corresponds to adding w S(pi) by 1, which increases cost (S, C, Γ) by at most Rz (recall that every partial assignment function can be viewed as a maximum flow of certain flow network). Hence the difference of f(x1, . . . , xi 1, 0, . . . , xn) and f(x1, . . . , xi 1, sumi

T , . . . , xn) is at most Rz sumi

Define f (i)(z1, z2, . . . , zi) = E[f(x) | x1 = z1, . . . , xi = zi]. To give an upper bound of Var(f (i)(z1...i 1, xi)) (z Rn, Prx D[x = z] > 0), we first consider the difference between F0 := f (i)(z1...i 1, 0) and F1 := f (i)(z1...i 1, sumi/T). By the arguments above, we have

zi+1,...,zn R Pr[xi+1...n = zi+1...n] f(z1...i 1, 0, zi+1...n) f z1...i 1, sumi

T , zi+1...n

Then we have Var(f (i)(z1...i 1, xi)) = probi F 2 1 + (1 probi)F 2 0 (probi F1 + (1 probi)F0)2

= probi(1 probi)(F0 F1)2

probi(Rzsumi/T)2

T w P (pi).

Taking M = Rz N/T and σ2 i = R2z N

T w P (pi) in Lemma C.11, we have

Pr cost (S, C, Γ) E[cost (S, C, Γ)] ε0

2 exp ε2 0N 2R2z

32(Pn i=1(R2z Nw P (pi)/T) + ε0MNRz/3)

2 exp ε2 0T 32

Lemma C.5. With probability at least 1 δ/2, |cost (S, C, Γ) cost(P, C, Γ)| εNrz/2.

Proof. By applying union bound to results of Lemma C.9, Lemma C.10 and Lemma C.12, we get Pr[|cost (S, C, Γ) cost(P, C, Γ)| εNrz/2] 1 δ/2. This concludes the proof.

C.3 PROOF OF LEMMA C.6

Lemma C.6. With probability at least 1 δ/2, there exists a weighted point set S with |cost (S, C, Γ) cost(S , C, Γ)| εNrz/2, such that S = S and for every p S, w S (p) (1 ε)w S(p).

Proof. Taking ε = ε0

4 , δ = δ/2, W = 1, 1 i n, αi = w P (pi) in Lemma C.8, we have

pi S w P (pi)prob 1 i Xi 1 ε0

pi P w P (pi)

Published as a conference paper at ICLR 2025

Let α = w(S)

N . Conditioning on α 1 ε0

4 , we show that |cost (S, C, Γ) cost(S , C, Γ)| ε0NRz/2 always holds, where S contains all elements in S and the weights is defined by w S (x) = α 1w S(x) for every x S.

For every partial assignment function σ with respect to S, C and partially consistent with Γ, and assignment function σ with repsect to S , C and consistent with Γ, we have P

c C σ (x, c) = N and P

c C σ(x, c) = min(w(P), N) = min(α, 1)N. In the following construction, we use min(α, 1) to normalize the weight of σ.

First we prove that cost (S, C, Γ) cost(S , C, Γ) + εNrz/2. Let σ be the optimal assignment function in cost(S , C, Γ). We define the partial assignment function σ as σ(p, c) = min(α, 1)σ (p, c) for every p P, c C. We can verify that σ is partially consistent with Γ.

cost (S, C, Γ) X

c C σ(p, c) dist(p, c)z

min(α, 1) cost(S , C, Γ)

cost(S , C, Γ).

Next we prove that cost(S , C, Γ) cost (S, C, Γ) + εNrz/2. Let σ be the optimal partial assignment function in cost (S, C, Γ). We initialize the assignment function σ as σ (p, c) = max(α 1, 1)σ(p, c). However, σ may not be consistent with Γ. In order to obtain an assignment function consistent with Γ, we apply the following modifications to σ .

Let S + = {x S : P

c C σ (x, c) > w S (x)}, S = {x S : P

c C σ (x, c) < w S (x)}. For every center c C, we re-assign weights from σ (x, c) (x S +) to σ (x , c) (x S ). Repeating this process until P

c C σ (x, c) = w S (x) for every x P. This step costs at most Rz P x S |w S (x) P c C max(α 1, 1)σ(x, c)|.

Let C+ = {c C : P

x P σ (x, c) > Γ(c)}, C = {c C : P

x P σ (x, c) < Γ(c)}. Notice that the previous step does not affect P

x P σ (x, c). For every point x S , we re-assign weights from σ (x, c) (c C+) to σ (x, c ) (c C ). Repeating this process until P

x S σ (x, c) = Γ(c) for every c C. This step costs at most Rz P

c C |Γ(c) max(α 1, 1) P

x S σ(x, c)|.

After these modifications, the cost of σ is bounded by

costσ (P, C, Γ) cost (P, C, Γ)

w S (x) max(α 1, 1) X

c C σ(x, c)

Γ(c) max(α 1, 1) X

x S σ(x, c)

If w(S) N (i.e. α 1 1), then P

x S σ(x, c) = Γ(c) holds for every center c C.

costσ (P, C, Γ) cost (P, C, Γ)

c C σ(x, c)

x S σ(x, c)

c C σ(x, c)

x S (w S(x) w S (x)) +

c C σ(x, c)

2Rz(w(S) N) ε0

Published as a conference paper at ICLR 2025

If w(S) < N (i.e. α 1 > 1), then P

c C σ(x, c) = w S(x) holds for every x P.

costσ (P, C, Γ) cost (P, C, Γ)

w S (x) α 1 X

c C σ(x, c)

x S σ(x, c)

c C σ(x, c)

x S σ(x, c)

x S σ(x, c)

c C (α 1 1)Γ(c) + α 1

x S σ(x, c)

= 2Rz(α 1 1)N ε0

Summing up together we get cost(S , C, Γ) cost (S, C, Γ) + ε0NRz/2. This concludes the proof.