# average_sensitivity_of_euclidean_kclustering__b1dffccb.pdf

Average Sensitivity of Euclidean k-Clustering

Yuichi Yoshida National Institute of Informatics JST, PRESTO yyoshida@nii.ac.jp

Shinji Ito NEC i-shinji@nec.com

Given a set of n points in Rd, the goal of Euclidean (k, ℓ)-clustering is to ﬁnd k centers that minimize the sum of the ℓ-th powers of the Euclidean distance of each point to the closest center. In practical situations, the clustering result must be stable against points missing in the input data so that we can make trustworthy and consistent decisions. To address this issue, we consider the average sensitivity of Euclidean (k, ℓ)-clustering, which measures the stability of the output in total variation distance against deleting a random point from the input data. We ﬁrst show that a popular algorithm k-MEANS++ and its variant called Dℓ-SAMPLING have low average sensitivity. Next, we show that any approximation algorithm for Euclidean (k, ℓ)-clustering can be transformed to an algorithm with low average sensitivity while almost preserving the approximation guarantee. As byproducts of our results, we provide several algorithms for consistent (k, ℓ)-clustering and dynamic (k, ℓ)-clustering in the random-order model, where the input points are randomly permuted and given in an online manner. The goal of the consistent setting is to maintain a good solution while minimizing the number of changes to the solution during the process, and that of the dynamic setting is to maintain a good solution while minimizing the (amortized) update time.

1 Introduction

Euclidean (k, ℓ)-clustering is a popular problem used to compute clustering of points in the Euclidean space, where k is a positive integer and ℓ 1. In this problem, given a set of points X = {x1, . . . , xn} in Rd, we are to compute a set Z = {z1, . . . , zk} of k points in Rd that minimizes

costℓ Z(X) :=

i=1 costℓ Z(xi),

where costℓ Z(xi) := minj=1,...,k xi zj ℓ 2 is the ℓ-th power of the Euclidean distance between x and the closest point in Z. Let optℓ(X) = min Z costℓ Z(X), where the minimum is over sets of k points in Rd (not X). This problem is called k-means clustering when ℓ= 2 [18] and k-medians clustering when ℓ= 1. Although providing an exact solution to Euclidean (k, ℓ)-clustering is NP-hard [3, 7] (even when ℓ= 2), polynomial-time approximation schemes (PTASes) do exist [13]. In practice, an algorithm called Dℓ-SAMPLING is often used, which computes O(2ℓlog k)-approximate solution in polynomial time [4]. When ℓ= 2, this algorithm is often called k-MEANS++.

In practical situations, it is reasonable to assume that the input contains some noise. If the clustering results are altered by the slightest change in the input, the interpretability and reliability of the clustering results are compromised, and hence the clustering result should be stable. Stability is also important when making marketing strategies or product production plans building on the clustering result because changing decisions can be very costly. In this work, we focus on a situation in which some points are missing from the input, and we consider average sensitivity to measure the stability

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

of clustering algorithms, which is deﬁned below. For a set of points X = {x1, . . . , xn} and i [n], let X(i) denote the set {x1, . . . , xi 1, xi+1, . . . , xn}, where [n] := {1, 2, . . . , n}. Then, the average sensitivity of a (randomized) algorithm A on X (with respect to the total variation distance) [23] is deﬁned as

β(A, X) := 1

i=1 d TV(A(X), A(X(i))), (1)

where d TV(A(X), A(X(i))) is the total variation distance between the distributions of A(X) and A(X(i)). Average sensitivity has been studied on various problems such as maximum matching [25], spectral clustering [21], dynamic programming [15, 16], and network centralities [20].

It is crucial to take the average in (1) instead of the maximum, which measures the worst-case sensitivity against deleting points. To see this, let us consider the case k = 2 and a set of points X consisting of two clusters each having (n 1)/2 points and one distant point. Then any good solution must contain the distant point and a middle point between the two clusters. However, for the set of points X obtained from X by deleting the distant point, any good solution must contain a point from each cluster. Hence, the worst-case sensitivity is Ω(1). By contrast, the average sensitivity can be O(1/n) because the contribution of deleting the distant point to the average sensitivity is O(1/n).

1.1 Our contributions

Algorithms with low average sensitivity We ﬁrst show that Dℓ-SAMPLING has average sensitivity O(k/n) (Section 2). This matches the lower bound of Ω(k/n) on the average sensitivity of any algorithm with a ﬁnite approximation ratio (Appendix A).

We then show that any approximation algorithm can be transformed to an algorithm with low average sensitivity while almost preserving the approximation ratio. Speciﬁcally, given an α-approximation algorithm A for (k, ℓ)-clustering and ϵ > 0, we can transform it into a (1 + ϵ)α-approximation algorithm A with average sensitivity poly(d, k, 2ℓ, 1/ϵ)/n (Section 3). For example, if we plug in a PTAS for (k, ℓ)-clustering [13] as A, we obtain a PTAS with average sensitivity poly(d, k, 2ℓ, 1/ϵ)/n as A . In our transformation, we ﬁrst construct a coreset [2, 11] of the input, a subset of the input that preserves the objective function, and then apply the approximation algorithm A.1 We notice that we can construct the coreset with low average sensitivity, and hence any approximation algorithm applied on it has low average sensitivity. This approach is new to the design of algorithms with low average sensitivity.

Consistent clustering In consistent Euclidean (k, ℓ)-clustering, n points in Rd arrive online, and the goal is to maintain a good solution, while minimizing inconsistency, which is deﬁned as the total number of points changed in the solution, i.e., P

t [n 1] |Zt Zt+1|, where Zt is the output of the algorithm after the t-th point arrived.

We consider the random-order model, in which the input points are randomly permuted. We observe that an average sensitivity analysis can be used to obtain consistent algorithms. Speciﬁcally, we show that, given an input X, an index i [n], and the output Z of an algorithm A for Euclidean (k, ℓ)- clustering for X, we can compute an output Z so that the probability Z and Z disagree is at most the total variation distance between A(X) and A(X(i)) in expectation over Z. Using this connection, we show that a variant of Dℓ-SAMPLING is an O(2ℓlog k)-approximation algorithm with inconsistency O(k2 log n). Also, we show that the abovementioned PTAS with low average sensitivity can be used to obtain a (1 + ϵ)-approximation algorithm with inconsistency poly(d, k, 2ℓ, 1/ϵ) log n.

Lattanzi and Vassilvitskii [17] considered consistent algorithms for metric (k, ℓ)-clustering, for which the underlying metric can be arbitrary but the output points should be selected from the input points. They showed an 2O(ℓ)-approximation algorithm with inconsistency O(k2 log4 n), where n is the number of points. Subsequently, Fichtenberger et al. [8] gave an algorithm for metric (k, 1)-clustering with approximation ratio O(1) and inconsistency O(k poly(log(n ))), where is the aspect ratio of the input, that is, the ratio between the largest and smallest distances. We note that these results are

1We note that the term sensitivity is used as the name of a well studied notion in the study of coresets, which measures the importance of a point in the input (See Section 3.1), and the reader should not confuse it with average sensitivity, which measures the solution change caused by deleting points from the input.

incomparable to ours because in Euclidean (k, ℓ)-clustering we can output points that do not appear in the input, which may signiﬁcantly reduces the objective value.

Dynamic clustering In dynamic (or incremental) Euclidean (k, ℓ)-clustering, points arrive online, and the goal is to maintain a good solution, while minimizing the (amortized) time required to update the solution. For this problem, a few heuristics have been proposed [1], but to the best of our knowledge, there is no known dynamic algorithm with a theoretical guarantee on its update time.

As with consistent clustering, we consider the random-order model. We show that the consistent version of Dℓ-SAMPLING can be implemented such that the amortized update time is O(dk + (k + log n)k log n). Also, we show that the abovementioned consistent PTAS can be implemented such that the amortized update time is poly(d, k, ℓ, 1/ϵ) + 2poly(d,k,2ℓ,1/ϵ) log2 n.

We note that several dynamic algorithms that maintain coresets for metric k-clustering have been proposed [10, 12]. However, it is not clear whether these algorithms can be used to solve dynamic Euclidean (k, ℓ)-clustering.

1.2 Related work

Differentially private k-means/medians clustering Differential privacy is a popular notion used to measure privacy risks. Although we do not deﬁne it formally here, we note that, if an algorithm is β-differentially private, then its average sensitivity is at most β (see, e.g., [23]). In [9], it is shown that any polynomial-time α-approximation algorithm for k-means/medians clustering can be converted into an β-differentially private algorithm that, given a set X of points in a d-dimensional unit ball, outputs a solution Z such that

costℓ Z(X) (1 + ϵ)α optℓ(X) + Oα

d + k Oα(1)

where Oα( ) means that α is regarded as a constant and O( ) suppresses polylogarithmic dependence on n, d, k, and β. We note that the additive error is inevitable while obtaining β-differentially private algorithms for k-means/medians clustering [22], whereas we do not need it to bound average sensitivity.

Stability Let D be a distribution over points in Rd, and suppose that we want to ﬁnd a cluster structure in D using an algorithm A. In many cases, we do not know the parameter k that best describes D. A popular method for selecting k is based on the following stability arguments. That is, we select the k that minimizes the instability of A,

Instab(k, n) := E X,X Dn d(A(X, k), A(X , k)),

where A(X, k) is the output of A on a set of points X and the number of clusters k, and d( , ) is the distance between two clusterings, which can be deﬁned in various ways.

The main focus of the literature on instability is characterizing D such that Instab(k, n) 0 (n ) holds when A is the algorithm that outputs an optimal solution, because we can then claim that the k captures the cluster structure in D well. Refer to [24] for a survey on this topic. A notable difference of our work is that we focus on designing an algorithm with low average sensitivity for any input set X, and not characterizing X such that a certain algorithm has low average sensitivity.

2 Average sensitivity of Dℓ-sampling

In this section, we analyze the average sensitivity of a popular algorithm for Euclidean (k, ℓ)- clustering called Dℓ-sampling, which has approximation ratio O(2ℓlog k) [4]. This algorithm starts with a uniformly sampled point in the input X, and then iteratively adds points to the solution Z: at each step, a point x X is sampled with probability proportional to costℓ Z(x) (See Algorithm 1 for details). Throughout, we use the symbol n to denote the number of points in the input set X. Our goal is to show the following: Theorem 2.1. The average sensitivity of Dℓ-SAMPLING(X, k, ℓ) is O(k/n).

Algorithm 1: Dℓ-sampling for Euclidean (k, ℓ)-clustering

1 Procedure Dℓ-SAMPLING(X, k, ℓ)

2 Sample x X uniformly at random and then set Z = {x};

3 for i = 2, . . . , k do

4 Let DX,Z be the probability distribution over X such that x X is sampled with

probability p X,Z(x) := costℓ Z(x) costℓ Z(X);

5 Sample x X from DX,Z and add it to Z.

6 return Z.

We start by analyzing the average sensitivity of the process of sampling from DX,Z: Lemma 2.2. The average sensitivity of the process of sampling from DX,Z is O(1/n).

Proof. For notational simplicity, we use p(x) and p(i)(x) to denote p X,Z(x) and p X(i),Z(x), respectively (Recall that X(i) = {x1, . . . , xi 1, xi+1, . . . , xn}). Note that p(x) = costℓ Z(x)/costℓ Z(X).

We start by analyzing the change in the sampling probability p(x) caused by deleting a point from X. For any i [n] and x X(i), we have

p(i)(x) p(x) = costℓ Z(x) costℓ Z(X(i)) costℓ Z(x) costℓ Z(X) = costℓ Z(xi) costℓ Z(x) costℓ Z(X(i)) costℓ Z(X),

where the last equality follows from costℓ Z(X) costℓ Z(X(i)) = costℓ Z(xi). Then, we have

p(x) p(i)(x) =

costℓ Z(xi) costℓ Z(x) costℓ Z(X(i)) costℓ Z(X) =

costℓ Z(xi) costℓ Z(X) = 1.

Noting that xi does not exist in X(i), the average sensitivity of the process of sampling from DX,Z is

p(x) p(i)(x)

We next analyze the average sensitivity of Dℓ-SAMPLING and prove Theorem 2.1.

Proof of Theorem 2.1. Let z1, . . . , zk X be the sequence of points added to Z. Note that they are random variables. We use z(i) j to denote zj for the input X(i). First, we have

i=1 d TV(z1, z(i) 1 ) = 1

For j [k] and q1, . . . , qk 1 X, we denote by {zj | z1 = q1, . . . , zj 1 = qj 1} the random variable zj conditioned on z1 = q1, . . . , zj 1 = qj 1. We deﬁne z(i) 1 , . . . , z(i) k and {z(i) j | z(i) 1 =

q1, . . . , z(i) j 1 = qj 1} similarly using X(i).

Then by Lemma 2.2, we have

i=1 d TV {zj | z1 = q1, . . . , zj 1 = qj 1}, {z(i) j | z(i) 1 = q1, . . . , z(i) j 1 = qj 1} = O 1

for every j {2, . . . , k} and q1, . . . , qk 1 X.

Given (2) and (3), by induction, we obtain

i=1 d TV {z1, . . . , zk}, {z(i) 1 , . . . , z(i) k } = O k

Algorithm 2: Coreset construction by importance sampling

1 Procedure CORESET(X, k, ℓ, s, m)

2 Let C be an empty set;

3 for m times do

4 Sample x X with probability p(x) := s(x)/ P

x X s(x ) ;

6 Sample p from [p(x), (1 + ϵ/2)p(x)] uniformly at random ;

7 if w(x) is undeﬁned then w(x) 1/ p. ;

8 else w(x) w(x) + 1/ p. ;

9 return (C, w).

3 General transformation

In this section, we show that any approximation algorithm for Euclidean (k, ℓ)-clustering can be converted into an algorithm with low average sensitivity. Speciﬁcally, we show the following: Theorem 3.1. Let A be an α-approximation algorithm for Euclidean (k, ℓ)-clustering with time complexity T(n, d, k, ℓ), where n is the number of points and d is the dimension of the Euclidean space. Then for any ϵ, δ > 0, there exists an algorithm for Euclidean (k, ℓ)-clustering such that

it outputs (1 + ϵ)α-approximation with probability at least 1 δ,

its average sensitivity is

where e O( ) suppresses a polylogarithmic factor in k and 1/ϵ, and

it runs in poly(n, d, k) + T(poly(d, k, 1/ϵ, log(1/δ)), d, k, ℓ).

For example, by plugging in a PTAS to Theorem 3.1 [13], we obtain a PTAS with the same average sensitivity as in the theorem.

Our transformation constructs a coreset for the input, which is a small subset that approximately preserves the objective function, and then apply the given algorithm. We ﬁrst describe known results about coresets in Section 3.1, and then provide an algorithm that constructs coresets with low average sensitivity in Section 3.2. Finally, we prove Theorem 3.1 in Section 3.3.

3.1 Preliminaries

A weighted set is a pair of a set X and a weight function w : X R+. We say that (X, w) is a weighted subset of Y if X Y . For a weighted set (X, w) and a set of points Z, we deﬁne costℓ Z((X, w)) as P

x X w(x)costℓ Z(x). Note that costℓ Z((X, w)) = costℓ Z(X) when w 1. A coreset for Euclidean (k, ℓ)-clustering is deﬁned as follows: Deﬁnition 3.2 (Coreset for Euclidean (k, ℓ)-clustering [2, 11]). Let X be a set of points in Rd. For ϵ > 0, we say that a weighted subset (C, w) of X is an ϵ-coreset of X if for any set Z X of k points, we have (1 ϵ)costℓ Z(X) < costℓ Z((C, w)) < (1 + ϵ)costℓ Z(X).

Importance sampling is a popular approach to obtain a small coreset. Given a set of points X in Rd, an importance function s : X R+, and a positive integer m > 0, we sample a point x X with probability proportional to s(x) and then set its weight to be the reciprocal of the sampling probability, and then we repeat this process m times to construct a coreset. For technical reasons, we slightly perturb the weight of the points. See Algorithm 2 for details.

A popular choice for the importance is the sensitivity with respect to the input. In the context of Euclidean (k, ℓ)-clustering, for a set X of points in Rd, the sensitivity of x X is deﬁned to be

σX(x) := max Z X:|Z|=k costℓ Z(x) costℓ Z(X).

Indeed, any upper bound on sensitivity can be used to construct a coreset:

Theorem 3.3 (Theorem 5.5 in [5], restated. See also [13]). Let X be a set of points in Rd and s : X R+ be an arbitrary upper bound on σX. Let (C, w) = CORESET(X, k, ℓ, s, m) with

dk log S + log 1

where c > 0 is some universal constant and S = P

x X s(x). Then, with probability at least 1 δ, (C, w) is an ϵ-coreset of X.

We note that the slight modiﬁcation to the assigned weights preserves the objective value to within the factor of 1/(1 + ϵ/2), and hence the original analysis of Theorem 3.3 goes through.

The reason that we consider upper bounds on sensitivity in Theorem 3.3 is that no algorithm for exactly computing sensitivity is known. The following lemma shows that any approximate solution can be used to obtain an upper bound.

Lemma 3.4 ([13]). Let X be a set of points in Rd and let Z X be an α-approximate solution to Euclidean (k, ℓ)-clustering on X. For x X, let Zx = {y X : z Z s.t. x z ℓ 2 = costℓ Z(x), y z ℓ 2 = costℓ Z(y)} be the set of all points in the same cluster as x. Then, the sensitivity σX(x) is bounded from above by

s(x) := 22ℓ+2α2 costℓ Z(x) costℓ Z(X) + 1 |Zx|

Furthermore, we have S = P

x X s(x) 22ℓ+3α2k.

3.2 Coreset construction with low average sensitivity

In this section, we consider constructing coresets with low average sensitivity. Speciﬁcally, we show the following:

Lemma 3.5. For any ϵ, δ > 0, there exists a polynomial-time algorithm that, given a set X of points in Rd and a positive integer k and ℓ 1, outputs a weighted subset (C, w) such that (i) (C, w) is an ϵ-coreset of X with probability at least 1 δ, and (ii) we have

|C| m := e O 22ℓk

The average sensitivity of the algorithm is at most O( m

Our algorithm is as follows. We ﬁrst compute a set Z X by calling Dℓ-SAMPLING(X, k, ℓ). Then, we compute the upper bounds on the sensitivities of points in X as in Lemma 3.4, which we denote by s X,Z(x) for x X. Finally, we apply Theorem 3.3 using s X,Z(x) with the smallest m satisfying the condition in the statement, denoted m Z. Note that s X,Z(x) and m Z are random variables because the output Z of Dℓ-SAMPLING is a random variable. It is clear that this algorithm outputs an O(ϵ)-coreset with probability at least 1 δ. The proof of Lemma 3.5 is deferred to Appendix B.

3.3 Proof of Theorem 3.1

Our algorithm is as follows: We ﬁrst compute an ϵ-coreset (C, w) of the input X using Lemma 3.5, and then apply the α-approximation algorithm A given in the statement to the coreset. The analysis of approximation ratio and average sensitivity is trivial.

We discuss the time complexity of the algorithm above. Suppose that the algorithm A runs in T(n, d, k, ℓ) time. Then, we can compute the coreset (C, w) in poly(n, d, k) time. To run the algorithm A, we need T(|C|, d, k, ℓ) = T(poly(d, k, 1/ϵ, log(1/δ)), d, k, ℓ) time. Hence, in total, we need poly(n, d, k) + T(poly(d, k, 1/ϵ, log(1/δ)), d, k, ℓ) time.

4 Consistent clustering in the random-order model

In this section, we show that algorithms for Euclidean (k, ℓ)-clustering with low average sensitivity can be used to design algorithms for consistent Euclidean (k, ℓ)-clustering in the random-order model. In Section 4.1, we show that an algorithm with low average sensitivity satisfying a certain condition can be transformed into a consistent algorithm. Then, we see in Section 4.2 and Appendix C that the algorithms discussed in Sections 2 and 3, respectively, satisfy the condition and can be transformed into consistent algorithms.

4.1 Low average sensitivity to consistency

We start with the following deﬁnition.

Deﬁnition 4.1. Let X = {x1, . . . , xn} be a set of points in Rd. For i [n], let Z and Z(i) be two distributions over subsets of X and X(i), respectively, of size k. Suppose that Pn i=1 d TV(Z(i), Z)/n β holds. Then, we say that a probability transportation for Z of average sensitivity β is computable if the following holds:

1. For each i [n], there is a distribution D(i) over pairs of sets such that its marginal distributions on the ﬁrst and second coordinates are equal to Z(i) and Z, respectively.

2. Pn i=1 Pr(Z(i),Z) D(i)[Z(i) = Z]/n β.

3. For each i [n], there is a (randomized) algorithm that takes a set Z(i) and returns a set Z, such that if the input Z(i) is distributed as the distribution Z(i), then the distribution of the pair (Z(i), Z) is equal to D(i). In particular, Z is distributed as Z.

The following gives a general transformation from algorithms with low average sensitivity to consistent algorithms where a probability transportation is computable:

Lemma 4.2. Let A be a (randomized) algorithm for Euclidean (k, ℓ)-clustering with approximation ratio α and average sensitivity at most β(n) for inputs of size n. Suppose that the probability transportation for A(X) of average sensitivity β(n) is computable for any set X of n points. Then, there is an algorithm for consistent Euclidean (k, ℓ)-clustering in the random-order model such that

the expected approximation ratio is α for each step, where the expectation is taken over the randomness of the algorithm, and

the expected inconsistency is at most k Pn i=1 β(i), where the expectation is taken over both the randomness of the algorithm and the arrival order of the input.

Proof. Let X = {x1, . . . , xn} be a set of points in Rd, and let σ be a random permutation of [n]. The points are given in the order xσ(1), . . . , xσ(n). We ﬁrst compute Zσ,1 by running A on xσ(1). For t 2, we compute Zσ,t from Zσ,t 1 using the probability transportation for A({xσ(1), . . . , xσ(t)}) given in the statement. Let Dσ,t be the joint distribution given by the probability transportation such that the marginal on the ﬁrst and second coordinates are Zσ,t and Zσ,t 1, respectively. Our output sequence is Zσ,1, . . . , Zσ,n.

We observe that the expected approximation ratio is α because the distribution of Zσ,t is the same as the output distribution of A({xσ(1), . . . , xσ(t)}). The expected number of times that the output changes is bounded as

t=1 Pr[Zσ,t 1 = Zσ,t]

t=1 Pr (Z,Z ) Dσ,t[Z = Z ]

Pr (Z,Z ) Dσ,t[Z = Z ] =

Pr (Z,Z ) Dσ,t[Z = Z ] | σt = i

As we have E |Zσ,t 1 Zσ,t| k Pr[Zσ,t 1 = Zσ,t], the claim holds.

4.2 Consistent Dℓ-sampling

In this section, we propose a consistent algorithm that is a slight modiﬁcation of Dℓ-SAMPLING, and show the following:

Theorem 4.3. There exists a polynomial-time O(log k)-approximation algorithm for consistent Euclidean (k, ℓ)-clustering in the random-order model with inconsistency O(k2 log n).

For a probability distribution Z, we say that we have sampling access to Z if we can sample a value from Z. For z in the support of Z, let Z(z) denote the probability that z is sampled from Z. Then, we say that we have query access to Z if we can obtain the value of Z(z) for a speciﬁed z. We use the following lemma.

Lemma 4.4 ([14]). Let Z, Z be probability distributions, and suppose that we have sample and query accesses to Z and Z . Then, there is an algorithm called LAZYSAMPLING that, given z sampled from Z, outputs z such that

the distribution of z is equal to Z ,

Prz,z [z = z ] d TV(Z, Z ),

the expected number of samples drawn from Z and Z is O(1).

the expected number of queries to Z and Z is O(1).

The following lemma provides the probability transportation for Dℓ-SAMPLING.

Lemma 4.5. The probability transportation for Dℓ-SAMPLING(X, k, ℓ) with average sensitivity O(k/n) is computable.

Proof. Fix i [n]. We provide an algorithm that takes a set Z(i) = {z(i) 1 , . . . , z(i) k } of k points distributed as Dℓ-SAMPLING(X(i), k, ℓ) and outputs a set Z = {z1, . . . , zk} of k points such that Z is distributed as Dℓ-SAMPLING(X, k, ℓ) and the probability that Z(i) = Z is at most d TV(Dℓ-SAMPLING(X(i), k, ℓ), Dℓ-SAMPLING(X, k, ℓ)).

We ﬁrst set z1 = LAZYSAMPLING(z(i) 1 , UX(i), UX), where UX(i) and UX are the uniform distributions over X(i 1) and X, respectively. If z1 = z(i) 1 , then we compute z2, . . . , zk by following the process of Dℓ-SAMPLING(X, k, ℓ) conditioned on having chosen z1 as the ﬁrst point. If z1 = z(i) 1 , then we compute z2 = LAZYSAMPLING(z(i) 2 , DX(i),{z1}, DX,{z1}) (see Algorithm 1 for the deﬁnition of Dℓ-sampling distribution DX,Z). We repeat this process until we compute zk.

We can observe that the distribution of Z is equal to that of A(X) and that the total variation distance between Z(i) and Z is at most that between Dℓ-SAMPLING(X(i), k, ℓ) and Dℓ-SAMPLING(X, k, ℓ). Hence, the claim holds.

Proof of Theorem 4.3. Consider an algorithm obtained by combining Lemmas 4.2 and 4.5. It is clearly an O(log k)-approximation algorithm, and inconsistency is k Pn t=1 O(k/t) = O(k2 log n).

5 Dynamic clustering in the random-order model

In this section, we consider algorithms for dynamic Euclidean (k, ℓ)-clustering in the random-order model. Our algorithms are obtained by carefully implementing the consistent algorithms discussed in Section 4 so that the amortized update time is small. We provide efﬁcient implementations for algorithms discussed in Section 4.2 in Section 5.1. We also show that the consistent transformation discussed in C can be efﬁciently implemented in D.

5.1 Dynamic Dℓ-sampling

In this section, we show the following:

Theorem 5.1. There exists an O(2ℓlog k)-approximation algorithm for dynamic Euclidean (k, ℓ)- clustering in the random-order model with amortized update time O(dk + (k + log n)k log n).

Proof. Let x1, . . . , xn Rd be the input points (that arrive in this order). Throughout the process, we maintain points z1, . . . , zk and sequences S1, . . . , Sk of real values. When xt (t [n]) arrive, for each j [k], we update zj so that its distribution is equal to that of the j-th point selected in Dℓ-SAMPLING(Xt, k, ℓ), where Xt = {x1, . . . , xt}. Also, we update Sj to be the sequence costℓ Zj 1(x1), . . . , costℓ Zj 1(xi), where Zj = {z1, . . . , zj}. When j = 1, we regard that costℓ Zj 1(xi) = 1 for every i [t]. For each j [k], we maintain a binary tree on Sj so that we can append a value and compute the sum of values in an arbitrary (consecutive) subsequence of Sj in O(log n) time. We can support those queries using, say, a segment tree. Note that by querying consecutive sums on Sj, we can compute p Xt,Zj 1(x) (see Algorithm 1 for the deﬁnition), which is required to simulate Dℓ-SAMPLING in O(log n) time. Also, we can draw a sample from the distribution DXt,Zj 1 in O(log2 n) time using the consecutive sum queries (e.g., draw a number uniformly from [0, costℓ Zj 1(Xt)] and then perform binary search to ﬁnd the position that the number belongs to). Note that for j = 1, the distribution Zt,1 is just a uniform distribution over Xt.

We initialize z1, . . . , zk and S1, . . . , Sk using the ﬁrst point x1. That is, z1 = = zk = x1, which is the k points selected by Dℓ-SAMPLING({x1}, k, ℓ). Then for each j [k], we initialize Sj to be the sequence consisting of a single value costℓ Zj 1(x1)/costℓ Zj 1(X1) = 1.

Below, we explain how we update z1, . . . , zk and S1, . . . , Sk when xi for t > 1 arrives. When updating zj using LAZYSAMPLING as in Section 4.2, we need sampling and query accesses to Zt 1,j and Zt,j. We can provide sampling and query accesses to the former by using the current binary tree on Sj, and those to the latter by using the binary tree obtained by appending xt (We can access both by keeping the old one at this step). If zj is updated to a different point, we rebuild binary trees on Sj, . . . , Sk from scratch.

Now we analyze the amortized update time. At each step, for each j [k], we need O(dk) time to compute costℓ Z1(xt), . . . , costℓ Zk(xt), and O(k log t) time to append xt to the binary trees on S1, . . . , Sk. Then, we need O(k log2 n) time to update z1, . . . , zk. When zj is replaced, we rebuild binary trees on Sj, . . . , Sk, which takes O(kt log t) time. Recalling that the probability of replacing some of zj s at t-th step is O(k/t), we need

t=1 (O(dk)+O(k log t))+O(k log2 n) n+

t=1 O(kt log t) O k

= O(dkn+(k+log n)kn log n)

time in total. Hence the amortized update time is O(dk + (k + log n)k log n).

6 Conclusion

We have shown that the average sensitivity of Dℓ-SAMPLING is small and that any approximation algorithm can be transformed to one with low average sensitivity. Then, we show that these algorithms can be used to obtain consistent and dynamic algorithms in the random-order model. We believe consistent and dynamic algorithms for other problems can also be obtained via average sensitivity analysis.

It is an interesting open question whether we can remove the dependency on d from the sensitivity bound in Theorem 3.1. A natural idea is to apply dimension reduction, e.g., [19]. However, it is not clear whether we can recover a solution for the original space from that for the reduced space with a small average sensitivity. Another natural idea is to use coresets of size independent of d, e.g., [6, 13]. A challenge here is that their constructions are more complicated than the importance sampling used in this work, and their average sensitivity might be high.

Acknowledgments

Y.Y. is Supported by JST, PRESTO Grant Number JPMJPR192B.

[1] B. Aaron, D. E. Tamir, N. D. Rishe, and A. Kandel. Dynamic incremental k-means clustering. In Proceedings of the Iinternational Conference on Computational Science and Computational Intelligence (CSCI), volume 1, pages 308 313, 2014.

[2] P. Agarwal, S. Har-Peled, and K. Varadarajan. Geometric approximation via coresets. Combinatorial and Computational Geometry, 52:1 30, 2005.

[3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245 248, 2009.

[4] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), page 1027 1035, 2007.

[5] V. Braverman, D. Feldman, and H. Lang. New frameworks for ofﬂine and streaming coreset constructions. Co RR, abs/1612.00889, 2016.

[6] V. Cohen-Addad, D. Saulpic, and C. Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 169 182, 2021.

[7] S. Dasgupta. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California, 2008.

[8] H. Fichtenberger, S. Lattanzi, A. Norouzi-Fard, and O. Svensson. Consistent k-clustering for general metrics. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2660 2678, 2021.

[9] B. Ghazi, R. Kumar, and P. Manurangsi. Differentially private clustering: Tight approximation ratios. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Neur IPS), pages 4040 4054, 2020.

[10] G. Goranci, M. Henzinger, D. Leniowski, C. Schulz, and A. Svozil. Fully dynamic k-center clustering in low dimensional metrics. In 2021 Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX), pages 143 153, 2021.

[11] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pages 291 300, 2004.

[12] M. Henzinger and S. Kale. Fully-dynamic coresets. In 28th Annual European Symposium on Algorithms (ESA), volume 173, pages 57:1 57:21, 2020.

[13] L. Huang and N. K. Vishnoi. Coresets for clustering in euclidean spaces: Importance sampling is nearly optimal. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1416 1429, 2020.

[14] P. E. Jacob, J. O Leary, and Y. F. Atchadé. Unbiased markov chain monte carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):543 600, 2020.

[15] S. Kumabe and Y. Yoshida. Average sensitivity of dynamic programming. In Proceedings of the 33th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1925 1961, 2022.

[16] S. Kumabe and Y. Yoshida. Average sensitivity of the knapsack problem. In 30th Annual European Symposium on Algorithms (ESA), volume 244, pages 75:1 75:14, 2022.

[17] S. Lattanzi and S. Vassilvitskii. Consistent k-clustering. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1975 1984, 2017.

[18] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129 137, 1982.

[19] K. Makarychev, Y. Makarychev, and I. Razenshteyn. Performance of Johnson Lindenstrauss transform for k-means and k-medians clustering. SIAM Journal on Computing, pages STOC19 269 STOC19 297, 2022.

[20] S. Murai and Y. Yoshida. Sensitivity analysis of centralities on unweighted networks. In Proceedings of the 2019 World Wide Web Conference (WWW), pages 1332 1342, 2019.

[21] P. Peng and Y. Yoshida. Average sensitivity of spectral clustering. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1132 1140, 2020.

[22] U. Stemmer and H. Kaplan. Differentially private k-means with constant multiplicative error. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), pages 5436 5446, 2018.

[23] N. Varma and Y. Yoshida. Average sensitivity of graph algorithms. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 684 703, 2021.

[24] U. Von Luxburg. Clustering stability: an overview. Now Publishers Inc, 2010.

[25] Y. Yoshida and S. Zhou. Sensitivity analysis of the maximum matching problem. In Innovations in Theoretical Computer Science (ITCS), pages 58:1 58:20, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes]

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]