# replicable_clustering__f463058c.pdf

Replicable Clustering

Hossein Esfandiari Google Research esfandiari@google.com

Amin Karbasi Yale University, Google Research amin.karbasi@yale.edu

Vahab Mirrokni Google Research mirrokni@google.com

Grigoris Velegkas Yale University grigoris.velegkas@yale.edu

Felix Zhou Yale University felix.zhou@yale.edu

We design replicable algorithms in the context of statistical clustering under the recently introduced notion of replicability from Impagliazzo et al. [2022]. According to this definition, a clustering algorithm is replicable if, with high probability, its output induces the exact same partition of the sample space after two executions on different inputs drawn from the same distribution, when its internal randomness is shared across the executions. We propose such algorithms for the statistical k-medians, statistical k-means, and statistical k-centers problems by utilizing approximation routines for their combinatorial counterparts in a black-box manner. In particular, we demonstrate a replicable O(1)-approximation algorithm for statistical Euclidean k-medians (k-means) with O(poly(k, d)klog log k) sample complexity. We also describe an O(1)-approximation algorithm with an additional O(1)-additive error for statistical Euclidean k-centers, albeit with O(poly(k) exp(d)) sample complexity. In addition, we provide experiments on synthetic distributions in 2D using the k-means++ implementation from sklearn as a black-box that validate our theoretical results2.

1 Introduction

The unprecedented increase in the amount of data that is available to researchers across many different scientific areas has led to the study and development of automated data analysis methods. One fundamental category of such methods is unsupervised learning which aims to identify some inherent structure in unlabeled data. Perhaps the most well-studied way to do that is by grouping together data that share similar characteristics. As a result, clustering algorithms have become one of the central objects of study in unsupervised learning. Despite a very long line of work studying such algorithms, e.g. Jain and Dubes [1988], Hart et al. [2000], Anderberg [2014], there is not an agreed-upon definition that quantifies the quality of a clustering solution. Kleinberg [2002] showed that there is an inherent reason why this is the case: it is impossible to design a clustering function that satisfies three natural properties, namely scale-invariance, richness of solutions, and consistency. This means that the algorithm designer needs to balance several conflicting desiderata. As a result, the radically different approaches that scientists use depending on their application domain can be sensitive to several factors such as their random initialization, the measure of similarity of the data, the presence of noise in the measurements, and the existence of outliers in the dataset. All these issues give rise to algorithms whose results are not replicable, i.e., when we execute them on two different samples of the same population, they output solutions that vary significantly. This begs the

Authors are listed alphabetically. 2https://anonymous.4open.science/r/replicable_clustering_experiments-E380

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

following question. Since the goal of clustering algorithms is to reveal properties of the underlying population, how can we trust and utilize their results when they fail to pass this simple test?

Replicability is imperative in making sure that scientific findings are both valid and reliable. Researchers have an obligation to provide coherent results and conclusions across multiple repetitions of the same experiment. Shockingly, a 2016 survey that appeared in Nature [Baker, 2016] revealed that 70% of researchers tried, but were unable to, replicate the findings of another researcher and more than 50% of them believe there is a significant crisis in replicability. Unsurprisingly, similar worries have been echoed in the subfields of machine learning and data science [Pineau et al., 2019, 2021]. In this work, we initiate the study of replicability in clustering, which is one of the canonical problems of unsupervised learning.

1.1 Related Works

Statistical Clustering. The most relevant previous results for (non-replicable) statistical clustering was established by Ben-David [2007], who designed O(1)-approximation algorithms for statistical k-medians (k-means) with O(k) sample complexity. However, their algorithm picks centers from within the samples and is therefore non-replicable.

Combinatorial Clustering. The flavor of clustering most studied in the approximation algorithms literature is the setting where we have a uniform distribution over finite points and the algorithm has explicit access to the entire distribution. Our algorithms rely on having black-box access to a combinatorial clustering oracle. See Byrka et al. [2017], Ahmadian et al. [2019] for the current best polynomial-time approximation algorithms for combinatorial k-medians (k-means) in general metrics with approximation ratio 2.675 (9). Also, see Cohen-Addad et al. [2022] for a 2.406 (5.912) approximation algorithm for the combinatorial Euclidean k-medians (k-means).

Clustering Stability. Stability in clustering has been studied both from a practical and a theoretical point of view [Ben-Hur et al., 2001, Lange et al., 2004, Von Luxburg and Ben-David, 2005, Ben-David et al., 2006, Rakhlin and Caponnetto, 2006, Ben-David et al., 2007, Von Luxburg et al., 2010]. In most applications, it is up to the algorithm designer to decide upon the value of k, i.e., the number of different clusters. Thus, it was proposed that a necessary condition it should satisfy is that it leads to solutions that are not very far apart under resampling of the input data [Ben-Hur et al., 2001, Lange et al., 2004]. However, it was shown that this notion of stability for center-based clustering is heavily based on symmetries within the data which may be unrelated to clustering parameters [Ben-David et al., 2006]. Our results differ from this line of work in that we require the output across two separate samples to be exactly the same with high probability, when the randomness is shared. Moreover, our work reaffirms Ben-David et al. [2006] in that their notion of stability can be perfectly attained no matter the choice of k.

Other notions of stability related to our work include robust hierarchical clustering [Balcan et al., 2014], robust online clustering [Lattanzi et al., 2021], average sensitivity [Yoshida and Ito, 2022], and differentially private (DP) clustering [Cohen et al., 2021, Ghazi et al., 2020]. The definition of replicability we use is statistical and relies on an underlying data distribution while (DP) provides a worst-case combinatorial guarantee for two runs of the algorithm on neighboring datasets. Bun et al. [2023], Kalavasis et al. [2023] provide connections between DP and replicability for statistical learning problems. However, these transformations are not computationally efficient. It would be interesting to come up with computationally efficient reductions between replicable and DP clustering algorithms.

Coresets for Clustering. A long line of work has focused on developing strong coresets for various flavors of centroid-based clustering problems. See Sohler and Woodruff [2018] for an overview of this rich line of work. The most relevant for our results include coresets for dynamic geometric streams through hierarchical grids [Frahling and Sohler, 2005] and sampling based techniques [Ben-David, 2007, Feldman and Langberg, 2011, Bachem et al., 2018].

Dimensionality Reduction. Dimensionality reduction for clustering has been a popular area of study as it reduces both the time and space complexity of existing algorithms. The line of work on data-oblivious dimensionality reduction for k-means clustering was initiated by Boutsidis et al. [2010]. The goal is to approximately preserve the cost of all clustering solutions after passing the data through a dimensionality reduction map. This result was later improved and generalized to (k, p)-clustering [Cohen et al., 2015, Becchetti et al., 2019], culminating in the work of Makarychev

et al. [2019], whose bound on the target dimension is sharp up to a factor of log 1/ε. While Charikar and Waingarten [2022] overcome this factor, their result only preserves the cost across the optimal solution.

Replicability in ML. Our results extend the recently initiated line of work on designing provably replicable learning algorithms under the definition that was introduced by Impagliazzo et al. [2022]. Later, Esfandiari et al. [2022] considered a natural adaption of this definition to the setting of bandits and designed replicable algorithms that have small regret. A slightly different notion of replicability in optimization was studied in Ahn et al. [2022], where it is required that an optimization algorithm that uses noisy operations during its execution, e.g., noisy gradient evaluations, outputs solutions that are close when executed twice. Subsequently, Bun et al. [2023], Kalavasis et al. [2023] established strong connections between replicability and other notions of algorithmic stability. Recently, Dixon et al. [2023], Chase et al. [2023] proposed a weaker notion of replicability where the algorithm is not required to output the same solution across two executions, but its output needs to fall into a small list of solutions.

2 Setting & Notation

Let X Rd be the instance space endowed with a metric κ : X X R+ and P be a distribution on X which generates the i.i.d. samples that the learner observes.

For F Rd and x X, we overload the notation and write F(x) := argminf F κ(x, f) to be the closest point to x in F as well as κ(x, F) := κ(x, F(x)) to be the shortest distance from x to a point in F.

We assume that X is a subset of Bd, the d-dimensional κ-ball of diameter 1 centered about the origin3. We also assume that κ is induced by some norm on Rd that is sign-invariant (invariant to changing the sign of a coordinate) and normalized (the canonical basis has unit length). Under these assumptions, the unit ball of κ is a subset of [ 1, 1]d.

Our setting captures a large family of norms, including the ℓp-norms, Top-ℓnorms (sum of ℓlargest coordinates in absolute value), and ordered norms (non-negative linear combinations of Top-ℓnorms) [Chakrabarty and Swamy, 2019]. Our results hold for more general classes of norms but for the sake of simplicity, we abide by these assumptions.

We define := sup{κ(x, y) : x, y [0, 1)d} to be the κ-diameter of the unit hypercube. Note that 1 d by assumption. Moreover, L is the κ-diameter of a hypercube with side length L. For example, if = 2 is the Euclidean norm, then =

2.1 Clustering Methods and Generalizations

We now introduce the clustering objectives that we study in this work, which all fall in the category of minimizing a cost function cost : F R+, where F := {F Bd : |F| = k}. We write cost(F) to denote the objective in the statistical setting and d cost(F) for the combinatorial setting in order to distinguish the two. Problem 2.1 (Statistical (k, p)-Clustering). Given i.i.d. samples from a distribution P on X Bd, minimize cost(F) := Ex Pκ(x, F)p.

In other words, we need to partition the points into k clusters so that the expected distance of a point to the center of its cluster, measured by κ( , )p, is minimized. This is closely related to the well-studied combinatorial variant of the (k, p)-clustering problem.

Problem 2.2 ((k, p)-Clustering). Given some points x1, . . . , xn X, minimize d cost(F) := 1 n Pn i=1 κ(xi, F)p.

We note that (statistical) k-medians and (statistical) k-means is a special case of Problem 2.2 (Problem 2.1) with p = 1, 2, respectively. We also consider a slight variant of the combinatorial problem, i.e., Problem 2.2, where we allow different points xi to participate with different weights wi in the objective. We refer to this problem as the weighted (k, p)-clustering problem.

3Our results can be generalized to κ-balls of diameter L with an arbitrary center through translation and scaling.

We now shift our attention to the k-centers problem. Problem 2.3 (Statistical k-Centers). Given i.i.d. samples from a distribution P on X Bd, minimize cost(F) := maxx X κ(x, F).

Notice that the ℓ norm is the limit of the ℓp norm as p tends to infinity, hence k-centers is, in some sense, the limit of (k, p)-clustering as p tends to infinity. Also, notice that this problem differs from k-means and k-medians in the sense that it has a min-max flavor, whereas the other two are concerned with minimizing some expected values. Due to this difference, we need to treat k-centers separately from the other two problems, and we need to make some assumptions in order to be able to solve it from samples (cf. Assumption F.1, Assumption F.2). We elaborate more on that later.

Let us also recall the combinatorial version of k-centers. Problem 2.4 (k-Centers). Given some points x1, . . . , xn X, minimize d cost(F) := maxi [n] κ(xi, F).

We remark that clustering has mainly been studied from the combinatorial point of view, where the distribution is the uniform distribution over some finite points and we are provided the entire distribution. The statistical clustering setting generalizes to arbitrary distributions with only sample access. We emphasize that although we only have access to samples, our output should be a good solution for the entire distribution and not just the observed data.

We write FOPT to denote an optimal solution for the entire distribution and OPT := cost(FOPT). Similarly, we write b FOPT to denote an optimal sample solution and [ OPT := d cost( b FOPT). Suppose we solve Problem 2.4 given a sample of size n from Problem 2.3. Then [ OPT OPT since we are optimizing over a subset of the points.

Recall that a β-approximate solution F is one which has cost cost(F) β OPT. Note this is with respect to the statistical version of our problems. An algorithm that outputs β-approximate solutions is known as a β-approximation algorithm. We also say that F is a (β, B)-approximate solution if cost(F) β OPT +B.

2.1.1 Parameters p and κ

Here we clarify the difference between p and κ, which are two separate entities in the cost function Ex [κ(x, F(x))p]. We denote by κ the distance metric used to measure the similarity between points. The most commonly studied and applied option is the Euclidean distance for which our algorithms are the most sample-efficient. On the other hand, p is the exponent to which we raise the distances when computing the cost of a clustering. A smaller choice of p puts less emphasis on points that are far away from centers and p = 1 seeks to control the average distance to the nearest center. A large choice of p puts emphasis on points that are further away from centers and as p tends to infinity, the objective is biased towards solutions minimizing the maximum distance to the nearest center. Thus we can think of k-centers as (k, p)-clustering when p = . As a concrete example, when κ is the

Euclidean distance and p = 5, the cost function becomes Ex h x F(x) 5 2 i .

2.2 Replicability

Throughout this work, we study replicability4 as an algorithmic property using the definition of Impagliazzo et al. [2022]. Definition 2.5 (Replicable Algorithm; [Impagliazzo et al., 2022]). Let ρ (0, 1). A randomized algorithm A is ρ-replicable if for two sequences of n i.i.d. samples X, Y generated from some distribution Pn and a random binary string r R(X),

P X, Y Pn, r R(X){A( X; r) = A( Y ; r)} 1 ρ, .

In the above definition, we treat A as a randomized mapping to solutions of the clustering problem. Thus, even when X is fixed, A( X) should be thought of as random variable, whereas A( X; r) is the realization of this variable given the (fixed) X, r. We should think of r as the shared randomness

4Originally this definition was called reproducibility [Impagliazzo et al., 2022] but it was later pointed out that the correct term is replicability [Ahn et al., 2022].

between the two executions. In practice, it can be implemented as a shared random seed. We underline that sharing the randomness across executions is crucial for the development of our algorithms. We also note that by doing that we couple the two random variables A( X), A( Y ), whose realization depends on r R(X). Thus, if their realizations are equal with high probability under this coupling, it means that the distributions of A( X), A( Y ) are statistically close. This connection is discussed further in Kalavasis et al. [2023].

In the context of a clustering algorithm A, we interpret the output A( X; r) as a clustering function f : X [k] which partitions the support of P. The definition of ρ-replicability demands that f is the same with probability at least 1 ρ across two executions. We note that in the case of centroid-based clustering such as k-medians and k-means, the induced partition is a function of the centers and thus it is sufficient to output the exact same centers with probability 1 ρ across two executions. However, we also allow for algorithms that create partitions implicitly without computing their centers explicitly.

Our goal is to develop replicable clustering algorithms for k-medians, k-means, and k-centers, which necessitates that the centers we choose are arbitrary points within Rd and not only points among the samples. We underline that as in the case of differential privacy, it is trivial to design algorithms that satisfy the replicability property, e.g. we can let A be the constant mapping. The catch is that these algorithms do not achieve any utility. In this work, we are interested in designing replicable clustering algorithms whose utility is competitive with their non-replicable counterparts.

3 Main Results

In this section, we informally state our results for replicable statistical k-medians ((k, 1)-clustering), k-means ((k, 2)-clustering), and k-centers under general distances. Unfortunately, generality comes at the cost of exponential dependency on the dimension d. We also state our results for replicable statistical k-medians and k-means specifically under the Euclidean distance, which has a polynomial dependency on d. Two key ingredients is the uniform convergence of (k, p)-clustering costs (cf. Theorem C.9) as well as a data-oblivious dimensionality reduction technique for (k, p)-clustering in the distributional setting (cf. Theorem E.10). These results may be of independent interest.

We emphasize that the Euclidean k-median and k-means are the most studied and applied flavors of clustering, thus the sample complexity for the general case and the restriction to p = 1, 2 does not diminish the applicability of our approach.

The main bottleneck in reducing the sample complexity for general norms is the lack of a dataoblivious dimensionality reduction scheme. This bottleneck is not unique to replicability and such a scheme for general norms would be immediately useful for many distance-based problems including clustering. It may be possible to extend our results to general (k, p)-clustering beyond p = 1, 2. The main challenge is to develop an approximate triangle inequality for p-th powers of norms. Again, this limitation is not due to replicability but rather the technique of hierarchical grids. It is a limitation shared by Frahling and Sohler [2005].

Before stating our results, we reiterate that the support of our domain X is a subset of the unit-diameter κ-ball Bd. In particular, we have that OPT 1.

Theorem 3.1 (Informal). Let ε, ρ (0, 1). Given black-box access to a β-approximation oracle for weighted k-medians, respectively weighted k-means (cf. Problem 2.2), there is a ρ-replicable algorithm for statistical k-medians, respectively k-means (cf. Problem 2.1), such that with probability at least 0.99, it outputs a (1 + ε)β-approximation. Moreover, the algorithm has sample complexity

poly k ρ OPT

When we are working in Euclidean space, we can get improved results for these problems.

Theorem 3.2 (Informal). Let ρ (0, 1). Suppose we are provided with black-box access to a β-approximation oracle for weighted Euclidean k-medians (k-means). Then there is a ρ-replicable algorithm that partitions the input space so with probability at least 0.99, the cost of the partition is

at most O(β OPT). Moreover, the algorithm has sample complexity

O(log log(k/ρ))!

We underline that in this setting we compute an implicit solution to Problem 2.1, since we do not output k centers. Instead, we output a function f that takes as input a point x X and outputs the label of the cluster it belongs to in polynomial time. The replicability guarantee states that, with probability 1 ρ, the function will be the same across two executions.

The combinatorial k-medians (k-means) problem where the centers are restricted to be points of the input is a well-studied problem from the perspective of polynomial-time constant-factor approximation algorithms. See Byrka et al. [2017], Ahmadian et al. [2019] for the current best polynomial-time approximation algorithms for combinatorial k-medians (k-means) in general metrics with approximation ratio 2.675 (9). Also, see Cohen-Addad et al. [2022] for a 2.406 (5.912) approximation algorithm for the combinatorial Euclidean k-medians (k-means).

As we alluded to before, in order to solve k-centers from samples, we need to make an additional assumption. Essentially, Assumption F.2 states that there is a (β, B)-approximate solution F, such that, with some constant probability, e.g. 0.99, when we draw n samples from P we will observe at least one sample from each cluster of F.

Theorem 3.3 (Informal). Let c (0, 1). Given black-box access to a (ˆβ, ˆB)-approximation oracle for k-centers (cf. Problem 2.4) and under Assumption F.2, there is a ρ-replicable algorithm for statistical k-centers (cf. Problem 2.3), that outputs a (O(β + ˆβ), O(B + b B + (β + ˆβ + 1)c) )-approximate solution with probability at least 0.99. Moreover, it has sample complexity

n2k (1/c)3d

Recall that there is a simple greedy 2-approximation for the sample k-center problem whose approximation ratio cannot be improved unless P = NP [Hochbaum and Shmoys, 1985].

We defer the discussion around k-centers to Appendix F. In particular, see Theorem F.8 in Appendix F for the formal statement of Theorem 3.3.

4 Overview of (k, p)-Clustering

In this section, we present our approach to the (k, p)-clustering problem. First, we replicably approximate the distribution with a finite set of points by extending the approach of Frahling and Sohler [2005] to the distributional setting. Then, we solve the combinatorial (k, p)-clustering problem on this coreset using an approximation oracle in a black-box manner. In the following subsections, we give a more detailed overview for each step of our approach. For the full proofs and technical details, we kindly refer the reader to Appendix D.4 - D.6. In summary:

1) Replicably build a variant of a quad tree [Finkel and Bentley, 1974] (cf. Section 4.2). 2) Replicably produce a weighted coreset using the quad tree (cf. Section 4.1). 3) Apply the optimization oracle for the combinatorial problem on the coreset.

For general norms, this approach leads to an exponential dependence on d. However, we are able to handle the case of Euclidean distances by extending existing dimensionality reduction techniques for sample Euclidean (k, p)-clustering [Makarychev et al., 2019] to the distributional case (cf. Section 5). Thus, for the widely used Euclidean norm, our algorithm has poly(d) sample complexity.

4.1 Coresets

Definition 4.1 ((Strong) Coresets). For a distribution P with support X Bd and ε (0, 1), a (strong) ε-coreset for X is a distribution P on X Bd which satisfies (1 ε) Ex Pκ(x, F)p Ex P [κ(x , F)p] (1 + ε) Ex Pκ(x, F)p

for every set of centers F Bd, |F| = k.

Essentially, coresets help us approximate the true cost on the distribution P by considering another distribution P whose support X can be arbitrarily smaller than the support of P.

Inspired by Frahling and Sohler [2005], the idea is to replicably consolidate our distribution P through some mapping R : X X whose image has small cardinality |R(X)| << so that for any set of centers F, (1 ε)Exκ(x, F)p Exκ(R(x), F)p (1 + ε)Exκ(x, F)p , where ε (0, 1) is some error parameter. In other words, (PR, R(X)) is an ε-coreset. Note that given the function R, we can replicably estimate the probability mass at each point in R(X) and then apply a weighted (k, p)-clustering algorithm.

4.2 Replicable Quad Tree

We now explain how to replicably obtain the mapping R : X X by building upon the work of Frahling and Sohler [2005]. The pseudocode of the approach is provided in Algorithm 4.1. While Frahling and Sohler [2005] present their algorithm using hierarchical grids, we take an alternative presentation using the quad tree [Finkel and Bentley, 1974], which could be of independent interest.

First, we recall the construction of a standard quad tree in dimension d. Suppose we have a set of n points in [ 1/2, 1/2]d. The quad tree is a tree whose nodes represent hypercubes containing points and can be built recursively as follows: The root represents the cell [ 1/2, 1/2]d and contains all points. If a node contains more than one point, we split the cell it represents into 2d disjoint, equally sized cells. For each non-empty cell, we add it as a child of the current node and recurse into it. The recursion stops when a node contains only 1 point. In the distributional setting, the stopping criterion is either when the diameter of a node is less than some length or when the node contains less than some mass, where both quantities are a function of the depth of the node. See Algorithm 4.1.

A quad tree implicitly defines a function R : X X as follows. Given a point x X and the root node of our tree, while the current node has a child, go to the child containing x if such a child exists, otherwise, go to any child. At a leaf node, output the center of the cell the leaf represents. Intuitively, the quad tree consolidates regions of the sample space into single points. The construction can be made replicable since the decision to continue the recursion or not is the only statistical operation and is essentially a heavy-hitters operations which can be performed in a replicable fasion.

Let Gi denote the union of all 2id possible cells at the i-th level. We write Pi to denote the discretized distribution to Gi. In other words, Pi = P|σ(Gi) is the restriction of P to the smallest σ-algebra containing Gi. Moreover, we write Λ to denote a replicable estimate of OPT with relative error ε, say 1/β(1+ε) OPT Λ (1 + ε) OPT for some absolute constant β 1. We demonstrate how to obtain such a replicable estimate in Appendix D.5.

Algorithm 4.1 Replicable Quad Tree

1: r Quad Tree(distribution P, accuracy ε, exponent p, replicability ρ, confidence δ): 2: Init the node on the first level Z[0] [ 1/2, 1/2]d . 3: for depth i 1; Z[i 1] = AND (2 i+1 )p > εΛ/5; i i + 1 do 4: {Pi is the discretized distribution over 2id cells on the i-th layer} 5: {γ is a parameter to be determined later.} 6: {t is an upper bound on the number of layers.}

7: Compute H = r Heavy Hitters Pi, v = γ Λ 2 pi , v

8: for node Z Z[i 1] do 9: for heavy hitter cells H H such that H Z do 10: children(Z) children(Z) {H} 11: Z[i] Z[i] {H}. 12: end for 13: end for 14: end for 15: Output root node.

Our technique differs from that of Frahling and Sohler [2005] in at least three ways. Firstly, they performed their analysis for finite, uniform distributions with access to the entire distribution, while

our results hold assuming only sample access to a general bounded distribution5. Secondly, Frahling and Sohler [2005] bound the number of layers as a function of the cardinality of the support. For us, this necessitates the extra termination condition when the side lengths of our grids fall below a fraction of Λ as our distribution may have infinite support. Finally, Frahling and Sohler [2005] estimate OPT by enumerating powers of 2. This suffices for their setting since their distributions are discrete and bounded. However, we require a more nuanced approach (cf. Appendix D.5) as we do not have a lower bound for OPT. We tackle this by showing uniform convergence of the clustering solution costs, which we establish via metric entropy and Rademacher complexity (cf. Appendix C).

4.3 Putting it Together

Once we have produced the function R implicitly through a quad tree, there is still the matter of extracting a replicable solution from a finite distribution. We can accomplish this by replicably estimating the probability mass at each point of R(X) and solving an instance of the weighted sample k-medians (k-means). This leads to the following results. For details, see Appendix D.4 - D.6.

Theorem 4.2 (Theorem 3.1; Formal). Let ε, ρ (0, 1) and δ (0, ρ/3). Given black-box access to a β-approximation oracle for weighted k-medians (cf. Problem 2.2), there is a ρ-replicable algorithm for statistical k-medians (cf. Problem 2.1) such that, with probability at least 1 δ, it outputs a (1 + ε)β-approximation. Moreover, it has sample complexity

ε12ρ6 OPT12 + k3218d 3d+3

ρ2ε3d+5 OPT3

Theorem 4.3 (Theorem 3.1; Formal). Given black-box access to a β-approximation oracle for weighted k-means (cf. Problem 2.2), there is a ρ-replicable algorithm for statistical k-means (cf. Problem 2.1) such that, with probability at least 1 δ, it replicably outputs a (1 +ε)β-approximation. Moreover, it has sample complexity

ε12ρ6 OPT12 + k3239d 3d+6

ρ2ε6d+8 OPT3

5 The Euclidean Metric, Dimensionality Reduction, and (k, p)-Clustering

In this section, we focus on the Euclidean metric κ(y, z) := y z 2 and show how the sample complexity for Euclidean (k, p)-clustering can be made polynomial in the ambient dimension d. Note that results for dimensionality reduction exist for the combinatorial Euclidean (k, p)-clustering problem [Charikar and Waingarten, 2022, Makarychev et al., 2019]. However, these do not extend trivially to the distributional case. We remark that we require our dimensionality reduction maps to be data-oblivious since we are constrained by replicability requirements.

The cornerstone result in dimensionality reduction for Euclidean distance is the Johnson-Lindenstrauss lemma (cf. Theorem E.1), which states that there is a distribution over linear maps πd,m from Rd Rm that approximately preserves the norm of any x Rd with constant probability for some target dimension m. In Makarychev et al. [2019], the authors prove Theorem E.4, which roughly states that it suffices to take m = O(p4/ε2 log(1/δ)) in order to preserve (k, p)-clustering costs when the domain is finite. Firstly, we extend Theorem E.4 to the distributional setting by implicitly approximating the distribution with a weighted ε-net. Then, we implicitly map the ε-net onto the low-dimensional space and solve the clustering problem there. An important complication we need to overcome is that this mapping preserves the costs that correspond to partitions6 of the data and not arbitrary solutions. Because of that, it is not clear how we can lift the solution from the low-dimensional space to the original space. Thus, instead of outputting k points that correspond to the centers of the clusters, our algorithm outputs a clustering function f : X [k], which takes as input a point x X and returns the label of the cluster it belongs to. The replicability guarantees of the algorithm state that, with probability 1 ρ, it will output the same function across two executions. In Appendix E.4, we describe this function. We emphasize that for each x X, the running time of f is polynomial in k, d, p, 1/ε, log(1/δ). Essentially, this function maps x onto the low-dimensional

5Our results also generalize to unbounded distributions with sufficiently small tails 6Roughly speaking, a solution corresponds to a partition when the center of each cluster is its center of mass.

space using the same projection map π as our algorithm and then finds the nearest center of π(x) in the low-dimensional space. For full details, we refer the reader to Appendix E.3. We are now ready to state the result formally.

Theorem 5.1 (Theorem 3.2; Formal). Let ε, ρ (0, 1) and δ (0, ρ/3). Given a β-approximation oracle for weighted Euclidean k-medians (k-means), there is a ρ-replicable algorithm that outputs a clustering function such that with probability at least 1 δ, the cost of the partition is at most (1 + ε)β OPT. Moreover, the algorithm has sample complexity

poly kd ρ OPT

where m = O 1

6 Running Time for (k, p)-Clustering

All of our algorithms terminate in O(poly(n)) time where n denotes the sample complexity. See Table A.1 for more detailed time complexity of each stage of our approach. Moreover, we make O(log(β/(ε OPT))) calls to the β-approximation oracle for the combinatorial (k, p)-clustering problem within our OPT estimation subroutine and then one more call to the oracle in order to output a solution on the coreset.

7 Replicable k-Centers

Due to space limitations, we briefly sketch our approach for the k-centers problem and kindly refer the reader to Appendix F. As explained before, the assumptions we make in this setting state that there exists some good solution, so that when we draw n i.i.d. samples from P we observe at least one sample from each cluster, with constant probability. We first take a fixed grid of side c in order to cover the unit-diameter ball. Then, we sample sufficiently many points from P. Subsequently, we round all the points of the sample to the centers of the cells of the grid that they fall into and estimate the probability mass of every cell. In order to ensure replicability, we take a random threshold from a predefined interval and discard the points from all the cells whose mass falls below the threshold. Finally, we call the approximation oracle using the points that remain. Unlike the (k, p)-clustering problem (cf. Problem 2.1), to the best of our knowledge, there does not exist any dimensionality reduction techniques that apply to the k-centers problem. The main result is formally stated in Theorem F.8.

8 Experiments

We now provide a practical demonstration7 of the replicability of our approach on synthetic data in 2D. In Figure 8.1, we leverage the sklearn [Pedregosa et al., 2011] implementation of the popular k-means++ algorithm for k = 3 and compare the output across two executions on the two moons distribution. In the first experiment, we do not perform any preprocessing and run k-means++ as is, resulting in different centers across two executions. In the second experiment, we compute a replicable coreset for the two moons distribution before running k-means++ on the coreset. This leads to the same centers being outputted across the executions. Note that the computation for the coreset is performed independently for each execution, albeit with a shared random seed for the internal randomness. See also Figure 8.2 for the results of a similar experiment on a mixture of truncated Gaussian distributions.

7https://anonymous.4open.science/r/replicable_clustering_experiments-E380

Figure 8.1: The results of running vanilla vs replicable k-Means++ on the two moons distribution for k = 3.

Figure 8.2: The results of running vanilla vs replicable k-Means++ on a mixture of truncated Gaussians distributions for k = 3.

9 Conclusion & Future Work

In this work, we designed replicable algorithms with strong performance guarantees using black-box access to approximation oracles for their combinatorial counterparts. There are many follow-up research directions that this work can lead to. For instance, our coreset algorithm adapts the coreset algorithm of Frahling and Sohler [2005] by viewing their algorithm as a series of heavy hitter estimations that can be made replicable. it may be possible to interpret more recent approaches for coreset estimation as a series of statistical operations to be made replicable in order to get replicable algorithms in the statistical (k, p)-clustering setting [Hu et al., 2018]. It would also be interesting to examine the sensitivity of (replicable) clustering algorithms to the choice of parameters such as the choice of exponent in the cost function or the measure of similarity of the data. Another relevant direction is to explore sample complexity lower ounds for statistical clustering, where little is known even in the non-replicable setting.

Acknowledgments and Disclosure of Funding

Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N00014-19-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Grigoris Velegkas is supported by TILOS, the Onassis Foundation, and the Bodossaki Foundation. Felix Zhou is supported by TILOS.

Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for kmeans and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4): FOCS17 97, 2019.

Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, and Gil I Shamir. Reproducibility in optimization: Theoretical framework and limits. ar Xiv preprint ar Xiv:2202.04598, 2022.

Michael R Anderberg. Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks, volume 19. Academic press, 2014.

Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1119 1127, 2018.

Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 2016.

Maria-Florina Balcan, Yingyu Liang, and Pramod Gupta. Robust hierarchical clustering. The Journal of Machine Learning Research, 15(1):3831 3871, 2014.

Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st annual ACM SIGACT symposium on theory of computing, pages 1039 1050, 2019.

Shai Ben-David. A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. Machine Learning, 66(2):243 257, 2007.

Shai Ben-David, Ulrike Von Luxburg, and D avid P al. A sober look at clustering stability. In International conference on computational learning theory, pages 5 19. Springer, 2006.

Shai Ben-David, D avid P al, and Hans Ulrich Simon. Stability of k-means clustering. In International conference on computational learning theory, pages 20 34. Springer, 2007.

Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon. A stability based method for discovering structure in clustered data. In Biocomputing 2002, pages 6 17. World Scientific, 2001.

Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for k-means clustering. Advances in neural information processing systems, 23, 2010.

Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Jessica Sorrell, and Satchit Sivakumar. Stability is stable: Connections between replicability, privacy, and adaptive generalization. ar Xiv preprint ar Xiv:2303.12921, 2023.

Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median and positive correlation in budgeted optimization. ACM Transactions on Algorithms (TALG), 13(2):1 31, 2017.

Bernd Carl and Irmtraud Stephani. Entropy, compactness and the approximation of operators. Cambridge University Press, 1990.

Deeparnab Chakrabarty and Chaitanya Swamy. Approximation algorithms for minimum norm and ordered optimization problems. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 126 137, 2019.

Moses Charikar and Erik Waingarten. The johnson-lindenstrauss lemma for clustering and subspace approximation: From coresets to dimension reduction. ar Xiv preprint ar Xiv:2205.00371, 2022.

Zachary Chase, Shay Moran, and Amir Yehudayoff. Replicability and stability in learning. ar Xiv preprint ar Xiv:2304.03757, 2023.

Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, and Eliad Tsfadia. Differentially-private clustering of easy instances. In International Conference on Machine Learning, pages 2049 2059. PMLR, 2021.

Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163 172, 2015.

Vincent Cohen-Addad, Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Improved approximations for euclidean k-means and k-median, via nested quasi-independent sets. ar Xiv preprint ar Xiv:2204.04828, 2022.

Ronald A De Vore and George G Lorentz. Constructive approximation, volume 303. Springer Science & Business Media, 1993.

Peter Dixon, A Pavan, Jason Vander Woude, and NV Vinodchandran. List and certificate complexities in replicable learning. ar Xiv preprint ar Xiv:2304.02240, 2023.

Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Reproducible bandits. ar Xiv preprint ar Xiv:2210.01898, 2022.

Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569 578, 2011.

Raphael A Finkel and Jon Louis Bentley. Quad trees a data structure for retrieval on composite keys. Acta informatica, 4:1 9, 1974.

Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 209 217, 2005.

Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. Differentially private clustering: Tight approximation ratios. Advances in Neural Information Processing Systems, 33:4040 4054, 2020.

Peter E Hart, David G Stork, and Richard O Duda. Pattern classification. Wiley Hoboken, 2000.

Dorit S Hochbaum and David B Shmoys. A best possible heuristic for the k-center problem. Mathematics of operations research, 10(2):180 184, 1985.

Wei Hu, Zhao Song, Lin F Yang, and Peilin Zhong. Nearly optimal dynamic k-means clustering for high-dimensional data. ar Xiv preprint ar Xiv:1802.00459, 2018.

Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. ar Xiv preprint ar Xiv:2201.08430, 2022.

Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.

Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics & Probability Letters, 135:1 6, 2018.

William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26: 189 206, 1984.

Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms. ar Xiv preprint ar Xiv:2305.14311, 2023.

Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, 1998.

Jon Kleinberg. An impossibility theorem for clustering. Advances in neural information processing systems, 15, 2002.

Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M Buhmann. Stability-based validation of clustering solutions. Neural computation, 16(6):1299 1323, 2004.

Silvio Lattanzi, Benjamin Moseley, Sergei Vassilvitskii, Yuyan Wang, and Rudy Zhou. Robust online correlation clustering. Advances in Neural Information Processing Systems, 34:4688 4698, 2021.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.

Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Performance of johnsonlindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1027 1038, 2019.

Vitali D Milman and Gideon Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaced, volume 1200. Springer Berlin, 1986.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rosemary Nan Ke, and Hugo Larochelle. Iclr reproducibility challenge 2019. Re Science C, 5(2):5, 2019.

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi ere, Alina Beygelzimer, Florence d Alch e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research: a report from the neurips 2019 reproducibility program. Journal of Machine Learning Research, 22, 2021.

Gilles Pisier. The volume of convex bodies and Banach space geometry, volume 94. Cambridge University Press, 1999.

Alexander Rakhlin and Andrea Caponnetto. Stability of k-means clustering. Advances in neural information processing systems, 19, 2006.

Christian Sohler and David P Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 802 813. IEEE, 2018.

AW van der Vaart and Jon A Wellner. Weak convergence and empirical processes with applications to statistics. Journal of the Royal Statistical Society-Series A Statistics in Society, 160(3):596 608, 1997.

Ulrike Von Luxburg and Shai Ben-David. Towards a statistical theory of clustering. In Pascal workshop on statistics and optimization of clustering, pages 20 26. London, UK, 2005.

Ulrike Von Luxburg et al. Clustering stability: an overview. Foundations and Trends in Machine Learning, 2(3):235 274, 2010.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Yuichi Yoshida and Shinji Ito. Average sensitivity of euclidean k-clustering. In Advances in Neural Information Processing Systems, 2022.

We present the running times required for our sub-routines in order to ensure each stage is ρ-replicable and succeeds with probability at least 0.99. The overall output is a constant ratio approximation for Euclidean k-medians (k-means).

Table A.1: Running Time Overview for Euclidean k-Medians (k-Means)

ALGORITHM RUNNING TIME

OPT ESTIMATION O(poly(kd/OPT ρ)) DIMENSIONALITY REDUCTION O(poly(d/OPT)(k/ρ)O(log log(k/ρ))) CORESET O(poly(d/OPT)(k/ρ)O(log log(k/ρ))) PROBABILITY MASS ESTIMATION O(poly(d/OPT)(k/ρ)O(log log(k/ρ)))

B Useful Facts

Proposition B.1 (Bretagnolle-Huber-Carol Inequality; [Vaart and Wellner, 1997]). Suppose the random vector (Z(1), . . . , Z(N)) is multinomially distributed with parameters (p(1), . . . , p(N)) and n. Let bp(j) := 1

n Z(j). Then

j=1 |bp(j) p(j)| 2ε

2N exp 2ε2n .

In particular, for any ε, ρ (0, 1), sampling

points from a finite distribution implies that PN j=1|bp(j) p(j)| < 2ε with probability at least 1 ρ.

Remark B.2 (Weighted k-Means/Medians). We remark that if we have access to a β-approximation oracle for unweighted k-means/medians, we can implement a weighted one by considering multiple copies of the points. In particular, in our applications, we get weights that are polynomials in the parameters of concern so this will not affect the stated runtime guarantees.

C Uniform Convergence of (k, p)-Clustering Costs

We would like to estimate OPT for statistical k-medians and statistical k-means by solving the combinatorial problem on a sufficiently large sample size. However, while the convergence of cost for a particular set of centers is guaranteed by standard arguments, e.g. Chernoff bounds, we require the stronger statement that convergence holds simultaneously for all possible choices of centers.

Similar results can be found [Ben-David, 2007], with the limitation that centers are either chosen from the sample points, or chosen as centers of mass of the clusters they induce. While this suffices to achieve constant ratio approximations, we would like to choose our centers anywhere in Bd. One reason for doing so is for replicability as we have no control over the location of samples from two independent executions so we need to output centers that are not overly reliant on its specific input. Another is for dimensionality reduction, which we will see later.

C.1 History of Uniform Laws of Large Numbers

Uniform laws of large numbers generalize convergence results for a finite number of random variables to possibly uncountable classes of random variables. The Rademacher complexity and related Gaussian complexity are canonical techniques in developing these laws and also have a lengthy history in the study of Banach spaces using probabilistic methods [Pisier, 1999, Milman and Schechtman,

1986, Ledoux and Talagrand, 1991]. Metric entropy, along with related notions of expressivity of various function classes, can be used to control the Rademacher complexity and are also central objects of study in the field of approximation theory [De Vore and Lorentz, 1993, Carl and Stephani, 1990].

C.2 Rademacher Complexity

Let F denote a class of functions from Rd R. For any fixed collection of n points xn 1 := (x1, . . . , xn), we write

f(xn 1) := (f(x1), . . . , f(xn)) F(xn 1) := {f(xn 1) : f F},

i.e, the restriction of F onto the sample. Let s { 1, 1}n be a Rademacher random vector. That is, si takes on values 1, 1 with equal probability independently of other components. Recall that the Rademacher complexity of F is given by

Rn(F) := EX,s

1 n s, f(Xn 1 )

Theorem C.1 ([Wainwright, 2019]). For any b-uniformly bounded class of functions F, integer n 1, and error ε 0,

i=1 f(Xi) E[f(X)]

with probability at least 1 exp nε2

In particular, as long as Rn(F) = o(1), we have uniform convergence of the sample mean.

C.3 Metric Entropy

We write B(x, r; µ) to denote the ball of radius r about a point x with respect to a metric µ. Recall that an ε-cover of a set T with respect to a metric µ is a subset θ1, . . . , θN T such that

i [N] B(θi, ε; µ).

The covering number N(ε; T, µ) is the cardinality of the smallest ε-cover. Proposition C.2 ([Wainwright, 2019]). Fix ε (0, 1). Let B(R; ) denote the d-dimensional ball of radius R with respect to . Then

N(ε, B(R; ), ) 1 + 2R

We say a collection of zero-mean random variables {Yθ : θ T} is a sub-Gaussian process with respect to a metric µ on T if

E exp [λ(Yθ Yθ )] exp λ2µ2(θ, θ )

for all θ, θ T and λ R. Proposition C.3. The canonical Rademacher process

is a zero-mean sub-Gaussian process with respect to the Euclidean norm on T Rn.

Proof. Recall that a Rademacher variable is sub-Gaussian with parameter 1. Moreover, the sum of sub-Gaussian variables is sub-Gaussian with parameter equal to the Euclidean norm of the parameters. It follows that Yθ Yθ = s, θ θ is sub-Gaussian with parameter θ θ 2. The result follows.

Theorem C.4 (One-Step Discretization; [Wainwright, 2019]). Let Yθ, θ T be a zero-mean sub Gaussian process with respect to the metric µ. Fix any ε [0, D] where D = diam(T; µ) such that N(ε; T, µ) 10. Then

E sup θ T Yθ

sup θ,θ T Yθ Yθ

sup θ,θ T :µ(θ,θ ) ε Yθ Yθ

log N(ε; T, µ).

In general, it may be possible to improve the bound in Theorem C.4 using more sophisticated techniques such as Dudley s entropy integral bound [Wainwright, 2019]. However, the simple inequality from Theorem C.4 suffices for our setting.

Corollary C.5. Fix any ε [0, D] where D = diam(T; µ) such that N(ε; T, µ) 10. The canonical Rademacher process Yθ = s, θ for θ T Rn satisfies

2ε n + 4D q

log N(ε; T, 2).

Proof. By Proposition C.3, Yθ is a zero-mean sub-Gaussian process and we can apply Theorem C.4 to conclude that

sup θ θ 2 ε Yθ Y θ

sup v 2 ε v, s

2E [ s 2 v 2]

Corollary C.5 gives us a way to control the Rademacher complexity of a function class whose co-domain is well-behaved. We make this notion rigorous in the next section.

C.4 Uniform Convergence of (k, p)-Clustering Cost

For a fixed set of centers F, let κp F : Rd R be given by

κp F (x) := κ(x, F)p.

Define F := {F Bd : |F| = k}. We take our function class to be

F := {κp F : F F}.

Let µ : F F R be given by

µ(A, B) := max a A,b B κ(a, b).

Proposition C.6. µ is a metric on F.

Proof. It is clear that µ is symmetric and positive definite. We need only show that the triangle inequality holds. Fix A, B, C F and suppose a A, c C are such that κ(a, c) = µ(A, C). Then

µ(A, C) = κ(a, c) κ(a, b) + κ(b, c) b B µ(A, B) + µ(B, C)

as desired.

Proposition C.7. F is p-Lipschitz parameterized with respect to the metric µ.

Proof. We have

|κ1 F (x) κ1 F (x)| = |κ(x, F(x)) κ(x, F (x))|

|κ(x, F (x)) + κ(F (x), F(x)) κ(x, F (x))|

= κ(F (x), F(x))

Now, the function g(x) : [0, 1] R given by x 7 xp is p-Lipschitz by the mean value theorem:

|g(x) g(y)| sup ξ [0,1] g (ξ)|x y| p|x y|.

The result follows by the fact that the composition of Lipschitz functions is Lipschitz with a constant equal to the product of constants from the composed functions.

Theorem C.8. For any ε [0, D = n] such that N(ε; F(xn 1), 2) 10,

Rn(F) 2ε n + 4

kd log 3 np

ε n = o(1).

Proof. Remark that κp F (xn 1) is np-Lipschitz with respect to the metric µ. Indeed,

i=1 |κp F (xi) κp F (xi)|2 np2µ(F, F )2.

Now, D := diam(F(xn 1); 2) = n. We can apply Corollary C.5 with T = F(xn 1) to see that

n E sup θ Yθ

log N(ε; F(xn 1), 2) n .

Now, since κp F (xn 1) is np-Lipschitz, a ε np-cover for F yields an ε-cover of F(xn 1). Hence

N(ε; F(xn 1), 2) N ε np; F, µ .

Note that the cross product of k ε-covers of B is an ε-cover of F. Hence

N ε np; F, µ N ε np; Bd, κ k

kd . Proposition C.2

Substituting this bound on the covering number of F(xn 1) concludes the proof.

Note that the Lipschitz property was crucial to ensure that the exponent in the covering number does not contain n. Theorem C.9. Fix ε, δ (0, 1). Then with

i.i.d. samples from P, d cost(F) cost(F) ε

with probability at least 1 δ for any set of centers F Bd.

Proof. Choose ε := 3 D = n. By Theorem C.8,

kd log 3 np

kd n. log n n

Fix some ε1 (0, 1). We have

Similarly, we have

n 16kd log p

By taking the maximum of the three lower bounds, we conclude that

ε4 1 log p = Rn(F) 3ε1.

Fix δ (0, 1). Observe that F is 1-uniformly bounded. Thus by Theorem C.1, we require

δ , 256k2d2

ε4 1 log p 256k2d2

samples in order to guarantee that

i=1 κp F (Xi) Eκp F (X)

2Rn(F) + ε1

7ε1 with probability at least 1 δ.

Choosing ε1 = ε

7 concludes the proof.

D (k, p)-Clustering

In this section, we provide the full proofs for Theorem 4.2 and Theorem 4.3, which state the guarantees for the replicable k-medians and k-means algorithms, respectively. First, we describe a

useful subroutine for replicable heavy hitters estimation in Appendix D.2. This subroutine is crucial to the replicable coreset algorithm (cf. Algorithm D.3), which we analyze in Appendix D.3. Once we have a coreset, it remains to solve the statistical (k, p)-clustering problem on a finite distribution. We describe a replicable algorithm for this in Appendix D.4. Our coreset algorithm assumes the knowledge of some constant ratio estimate of OPT. In Appendix D.5, we show how to output such an estimate replicably. Finally, we summarize our findings in Appendix D.6.

D.1 Warm-Up: Replicable SQ Oracle and ε-Covers

In order to give some intuition to the reader, we first show how we can use a subroutine that was developed in Impagliazzo et al. [2022] in order to derive some results in the setting we are studying. We first need to define the statistical query model that was introduced in Kearns [1998]

Definition D.1 (Statistical Query Oracle; [Kearns, 1998]). Let D be a distribution over the domain X and ϕ : X n R be a statistical query with true value

v := lim n ϕ(X1, . . . , Xn) R.

Here Xi i.i.d. D and the convergence is understood in probability or distribution. Let ε, δ (0, 1)2. A statistical query (SQ) oracle outputs a value v such that |v v | ε with probability at least 1 δ.

The simplest example of a statistical query is the sample mean

ϕ(X1, . . . , Xn) = 1

The idea is that by using a sufficiently large number of samples, an SQ oracle returns an estimate of the expected value of a statistical query whose range is bounded. Impagliazzo et al. [2022] provide a replicable implementation of an SQ oracle with a mild blow-up in the sample complexity which we state below.

Theorem D.2 (Replicable SQ Oracle; [Impagliazzo et al., 2022]). Let ε, ρ (0, 1) and δ (0, ρ/3). Suppose ϕ is a statistical query with co-domain [0, 1]. There is a ρ-replicable SQ oracle to estimate its true value with tolerance ε and failure rate δ. Moreover, the oracle has sample complexity

O 1 ε2ρ2 log 1

The interpretation of the previous theorem is that we can replicably estimate statistical queries whose range is bounded.

We now explain how we can use this result for the statistical k-means problem under the Euclidean metric, but it is not hard to extend the approach to more general (k, p)-clustering problems. Consider a fixed set of centers F. Then, we can replicably estimate the cost of this solution using Theorem D.2 within an additive accuracy ε and confidence δ using O ε 2ρ 2 log(1/δ) samples. Thus, a natural approach is to consider an ε-cover of the unit-diameter ball and then exhaustively search among solutions whose centers coincide with elements of the ε-cover. This is outlined in Algorithm D.1. We are now ready to state our results.

Algorithm D.1 Replicable k-Means with ε-Cover

r KMeans Cover: distribution P, error ε, replicability ρ, confidence δ G ε/3-cover over the d-dimensional unit-diameter ball F {F G : |F| = k} {We restrict the solutions to the ε-cover.} Output argmin F F r SQ-Oracle(cost(F), ε/3, ρ/|F|, δ/|F|)

Lemma D.3. For any ε, δ, ρ (0, 1), δ < ρ/3, Algorithm D.1 is ρ-replicable and outputs a solution F whose cost is at most OPT +ε with probability 1 δ. Moreover, it has sample complexity

Proof (Lemma D.3). We first argue about the replicability of Algorithm D.1. Since we make |F| calls to the replicable SQ subroutine with parameter ρ/|F|, the overall replicability of the algorithm follows by taking a union bound.

Let us now focus on the correctness of the algorithm. Let F be the optimal solution. Consider the solution that we get when we move the centers of F to the closest point of G and let us denote it by ˆF . Notice that the cost of ˆF is at most OPT +ε/3. Furthermore, by a union bound, all the calls to the SQ oracle will return an estimate that is within an additive ε/3-error of the true cost. This happens with probability at least 1 δ and we condition on this event for the rest of the proof. Thus the estimated cost of the solution ˆF will be at most OPT +2ε/3. Let e F be the solution that we output. Its estimated cost is at most that of ˆF and so its true cost will be at most OPT +ε. This concludes the proof of correctness.

Lastly, we argue about the sample complexity of the algorithm. By Proposition C.2,

|F| O (9/ε)kd .

By plugging this value into the sample complexity from Theorem D.2, we get

ρ2ε2 log (9/ε)kd

Although Algorithm D.1 provides some basic guarantees, there are several caveats with this approach. Namely, we can only get additive approximations and the dependence of the sample and time complexity on both k, d is exponential.

In the following sections, we will explain how we can overcome these issues. As we alluded to before, our approach combines ideas from coresets estimation through hierarchical grids [Frahling and Sohler, 2005], uniform convergence through metric entropy [Wainwright, 2019], and dimensionality reduction techniques [Makarychev et al., 2019].

D.2 Replicable Heavy-Hitters

In this section, we present a replicable heavy hitters algorithm which is inspired by Impagliazzo et al. [2022] and has an improved sample complexity by a factor of O(1/(v ε)). We believe that this result could be of independent interest since the replicable heavy hitters algorithm has many applications as a subroutine in more complicated algorithms [Impagliazzo et al., 2022]. Intuitively, this algorithm consists of two phases. In the first phase, we estimate all the candidate heavy hitters and reduce the size of the domain. In the second phase, we estimate the mass of these candidates. We present a new analysis of the second phase using Proposition B.1.

Algorithm D.2 Replicable Heavy Hitters

r Heavy Hitters:distribution P, target v, error ε, replicability ρ, confidence δ

if |X| < ln 2 δ(v ε) v ε then ˆ X X else

ˆ X ln 2 δ(v ε) v ε samples P

S 648 ln 2/δ+648(| ˆ X|+1) ln 2 ρ2ε2 samples P

Choose v [v 2/3ε, v 1/3ε] uniformly randomly Output all x ˆ X such that ˆPS(x) v {empirical distribution}

Theorem D.4. Fix v, ρ (0, 1) and ε (0, v), δ (0, ρ/3). Then Algorithm D.2 is ρ-replicable and returns a list L of elements x such that with probability at least 1 δ:

(a) If P(x) v ε, then x / L.

(b) If P(x) v, then x L.

Moreover, it has sample complexity

O min |X|, 1 (v ε)

1 ρ2ε2 log 1

Proof. First, we wish to capture all v ε heavy hitters in ˆ X. If X is sufficiently small, this is easy. Otherwise, we fail to observe each (v ε) heavy hitter with probability at most (1 v + ε)| ˆ X|. By a union bound over all (v ε) 1 possible heavy hitters, we fail to capture all v ε heavy hitters with probability at most

(1 v + ε)| ˆ X|

v ε 1 v ε exp

(v ε) ln 2 δ(v ε) v ε

Moving forward, we condition on this step succeeding.

Next, consider ˆp1, ˆp2, the mass estimates over the course of two runs supported on the candidate sets ˆ X1, ˆ X2. both follow (possibly different) multinomial distributions of dimension at most | ˆ X| + 1 with unknown mean parameters p1(x) for x ˆ X1 {y1}, p2(x) for x ˆ X2 {y2}, respectively, and |S|.

Here y1, y2 are dummy elements for observations beyond ˆ X1, ˆ X2. Suppose we draw ln 2/δ+(| ˆ X|+1) ln 2 2ε 2 samples over each of two runs to yield estimates ˆp1, ˆp2. By Proposition B.1, X

x ˆ X1 {y1} |ˆp1(x) p1(x)| < 2ε

x ˆ X2 {y2} |ˆp2(x) p2(x)| < 2ε

each with probability at least 1 δ/2. Moving forward, we condition on this step succeeding.

Now, consider choosing v [v 2ε/3, v ε/3] uniformly at random.

Correctness: By choosing 2ε < ε/3, any element x with true mass at least v will have empirical mass strictly more than v ε/3 v . Similarly, any element x with true mass at most v ε will have empirical mass strictly less than v 2ε/3 v . Thus we satisfy the correctness guarantees. Note that we satisfy this guarantee with probability at least 1 δ.

Replicability: v lands in between some ˆp1(x), ˆp2(x) for x ˆ X1 ˆ X2 with total probability at most P

x ˆ X1 ˆ X2|ˆp1(x) ˆp2(x)|

x ˆ X1 ˆ X2|ˆp1(x) p1(x)|

x ˆ X1 ˆ X2|ˆp2(x) p2(x)|

In addition, we wish for v to avoid landing below any ˆp1(x), ˆp2(x) where x / ˆ X1 ˆ X2. This happens with probability at most P

x ˆ X1 {y1}\ ˆ X2|ˆp1(x) p1(x)|

x ˆ X2 {y2}\ ˆ X1|ˆp2(x) p2(x)|

The sum of the two expressions above is at most 4ε

ε/3. Thus we set

The probability of failing to be replicable is then at most 1 ρ/3 2δ 1 ρ.

Sample Complexity: Recall that we set

If |X| is sufficiently small, the sample complexity becomes

δ + (|X| + 1) ln 2

2ε 2 = 648 ln 2

δ + 648 (|X| + 1) ln 2

Otherwise, the sample complexity is

ln 2 δ(v ε) v ε + 648 ln 2

δ ρ2ε2 + 648 ln 2 δ(v ε) + 1 ln 2

ρ2ε2(v ε) .

D.3 Replicable Coreset

We now prove that Algorithm 4.1 replicably yields an ε-coreset. In Appendix D.3.1, we state Algorithm D.3, an equivalent description of Algorithm 4.1. Then, we show that it is indeed replicable in Appendix D.3.2. Following this, we prove that the cost of any set of centers F Bd is preserved up to an ε-multiplicative error in Appendix D.3.3. In Appendix D.3.4, we show that the outputted coreset is of modest size. Last but not least, we summarize our findings in Appendix D.3.5.

D.3.1 Pseudocode

Before presenting the analysis, we state an equivalent algorithm to Algorithm 4.1 which uses precise notation that is more suitable for analysis.

Starting with the grid that consists of a single cell [ 1/2, 1/2]d at layer 0, we recursively subdivide each cell into 2d smaller cells of length 2 i at layer i. Each subdivided cell is a child of the larger cell, the containing cell is the parent of a child cell, and cells sharing the same parent are siblings. Let Gi denote the union of cells at the i-th level. We write Pi to denote the discretized distribution to Gi. In other words, Pi = P σ(Gi) is the restriction of P to the smallest σ-algebra containing Gi. We say a cell on the i-th level is heavy if the replicable heavy hitters algorithm (cf. Algorithm D.2) returns it with input distribution Pi, heavy hitter threshold v = γ Λ 2 pi , and error v/2 for some parameter γ to be determined later. A cell that is not heavy is light. The recursion terminates either when there are no more heavy cells on the previous layer, or the current grid length is a sufficiently small fraction of OPT. In the second case, we say the last layer consists of special cells.

If no child cell of a heavy parent is heavy, we mark the parent. Otherwise, we recurse on its heavy children so that one of its descendants will eventually be marked. The recursion must stop, either when there are no more heavy children or when the side length is no more than a fraction of Λ. Note that the light cells with heavy parents and special cells partition [ 1/2, 1/2]d.

Then, we build R in reverse as follows. For a marked heavy cell Z, we set R(x) to be an arbitrary (but replicable) point in Z, e.g. its center, for all x Z. This covers all light cells with no heavy siblings as well as special cells. Otherwise, a cell Z is light with heavy siblings, and its parent is heavy but unmarked. Thus, the parent has some marked heavy descendent Z , and we choose R(x) = R(x ) for all x Z and x Z .

D.3.2 Replicability

Proposition D.5. Algorithm D.3 terminates after at most

p log 10β p

Algorithm D.3 Replicable Coreset; Algorithm 4.1 Equivalent

1: r Coreset(distribution P, accuracy ε, exponent p, replicability ρ, confidence δ): 2: Init H[i] for all i 3: Init L[i] for all i 4: Init S[i] for all i 5: 6: H(0) n [ 1/2, 1/2]do

7: for i 1; H[i 1] = ; i i + 1 do 8: if (2 i+1 )p εΛ/5 then 9: S[i] children(H[i 1]) 10: else 11: H[i] children(H[i 1]) r Heavy Hitters Pi, v = γ Λ 2 pi , v/2, ρ/t, δ/t {t is an upper

bound on the number of layers.} 12: L[i] children(H[i 1]) \ H[i] 13: end if 14: end for 15: 16: for j i 1; j 0; j j 1 do 17: for Z H[j] do 18: if children(Z) H[j + 1] = then 19: R (Z) r Choice(Z) 20: else 21: R (Z) R (r Choice(children(Z) H[j + 1])) 22: end if 23: end for 24: for Z L[j + 1] do 25: R(Z) R (parent(Z)) 26: end for 27: for Z S[j + 1] do 28: R(Z) R (parent(Z)) 29: end for 30: end for 31: 32: Output R

Proof. We have

(2 i+1 )p εΛ

2p( i+1) εΛ

p( i + 1) log εΛ

But 1/Λ β(1+ε)/OPT, concluding the proof.

Corollary D.6. Algorithm D.3 is ρ-replicable and succeeds with probability at least 1 δ. Moreover, it has sample complexity

ρ2γ3 OPT3 log 1

Proof. The only non-trivial random components of Algorithm D.3 is the heavy hitter estimation. By Theorem D.4, the heavy hitters subroutine call on the i-th layer is ρ/t-replicable and succeeds with

probability at least 1 δ/t. Also, it has sample complexity

O t2 23( pi+1)

ρ2γ3 Λ3 log 1

Thus all in all, Algorithm D.3 is ρ-replicable and succeeds with probability at least 1 δ. It also has sample complexity

ρ2γ3 Λ3 log 1

Since 1/Λ β(1+ε)/OPT, we conclude the proof.

D.3.3 Expected Shift

Proposition D.7. For any x Z L[i] S[i] outputted by Algorithm D.3,

κ(x, R(x)) (2 i+1 ).

Proof. x is represented by some point within its parent.

Proposition D.8. The output S[t] of Algorithm D.3 satisfies Z

S[t] κ(x, R(x))pd P(x) εΛ

Proof. By computation, Z

S[t] κ(x, R(x))pd P(x) Z

S[t] (2 i+1 )pd P(x) Proposition D.7

5 d P(x) definition of t

Let F be an arbitrary set of k centers. Partition L[i] into

Lnear[i] := Z L[i] : min x Z κ(x, F)p 5

Lfar[i] := Z L[i] : min x Z κ(x, F)p > 5

ε(2 i+1 )p .

Proposition D.9. The output Lfar of Algorithm D.3 satisfies

Lfar[i] κ(x, R(x))pd P(x) ε

Proof. Any x Z Lfar[i] contributes a cost of at least 5

ε(2 i+1 )p. Hence

Lfar[i] κ(x, R(x))pd P(x)

Lfar[i] (2 i+1 )pd P(x)

5 ε(2 i+1 )pd P(x)

Lfar[i] κ(x, F)pd P(x)

Proposition D.10. The output Lnear of Algorithm D.3 satisfies |Lnear[i]| k M where

Proof. The furthest point x in a cell in Lnear[i] can have a distance of at most

5 ε(2 i+1 ) 1 + 10

to the nearest center. Thus the points belonging to a particular center live in an ℓ ball of that radius. Not including the cell containing the center, we can walk past at most

1 2 i 1 + 10

(2 i ) = 1 + 10

cells if we walk in the directions of the canonical basis of Rd. It follows that there can be at most 1 + 2 1 + 10

cells in Lnear[i] close to each center of F, and thus there are at most k M cells in total.

Proposition D.11. By choosing γ := ε 5tk M(2 )p ,

The output Lnear of Algorithm D.3 satisfies

Lnear[i] κ(x, R(x))p εΛ

Proof. Each light cell has measure at most γ Λ

2 pi . By computation,

Lnear[i] κ(x, R(x))pd P(x)

Lnear[i] (2 i+1 )pd P(x)

i=1 k M γ Λ

2 pi 2 pi(2 )p

= t k M γ Λ (2 )p

Corollary D.12. Set γ := ε 5tk M(2 )p .

Then for any set F of k-centers, the output R of Algorithm D.3 satisfies Z

X κ(x, R(x))pd P(x) ε cost(F).

Proof. X is contained in the union of all L[i] s and S[t]. Hence Z

X κ(x, R(x))pd P(x) 2εΛ

2ε(1 + ε) OPT

D.3.4 Coreset Size

To determine the size of our coreset, we bound the number of marked heavy cells, as all our representative points are in marked heavy cells.

Fix FOPT to be a set of optimal centers for the (k, p)-clustering problem. Write M[i] to denote the set of marked heavy cells on the i-th layer and partition M[i] as

Mclose[i] := Z M[i] : min x Z κ(x, F) 2 pi+1

Mdist[i] := Z M[i] : min x Z κ(x, F) > 2 pi+1 .

Proposition D.13. The marked heavy cells Mdist outputted by Algorithm D.3 satisfy

0 i t Mdist[i]

Proof. Each cell in Mdist[i] has mass at least γ Λ 2 pi+1 and each x Z Mdist[i] is of distance at least 2 pi+1 to its closest center. Thus each cell contributes at least γ Λ 2 pi+1 2 pi+1 = γ Λ

to the objective. If follows that there are at most

γ β (1 + ε) 2β

such cells.

Proposition D.14. The marked heavy cells Mclose outputted by Algorithm D.3 satisfy

|Mclose[i]| k(7 )d.

Proof. The furthest point x in a cell Mclose[i] to its nearest center is

2 pi+1 + 2 i .

In other words, not including the cell containing the center, we can walk past at most

2i(1 p)+1 + 2 +

cells by walking along the direction of the canonical basis of Rd. Thus there are at most

k [1 + 2 (2 + )]d k [7 ]d

such cells on the i-th layer.

Proposition D.15. The image of the map R outputted by Algorithm D.3 satisfies

N := |R(X)| 2β

γ + tk(7 )d.

In particular, for γ := ε 5tk M(2 )p ,

N := |R(X)| 3β

γ = 3β 5tk M(2 )p

Proof. The image is contained in the marked heavy cells.

D.3.5 k-Medians & k-Means

We now derive results for the case of p = 1. Theorem D.16. Algorithm D.3 is ρ-replicable and outputs an ε-coreset of size

εd+1 log ε OPT

for statistical k-medians with probability at least 1 δ. Moreover, it has sample complexity

O k3215d 3d+3

ρ2ε3d+3 OPT3 log 1

Proof (Theorem D.16). By Corollary D.12, Algorithm D.3 outputs an ε-coreset. By Corollary D.6, Algorithm D.3 is ρ-replicable with sample complexity

ρ2γ3 OPT3 log 1

k 25 /ε d (2 ) ε log ε OPT

= O k25d d+1

εd+1 log ε OPT

By Proposition D.15, the coreset has size at most this value.

Substituting the value of γ above and remarking that t is a polylogarithmic factor of the other terms, we thus conclude that the final sample complexity is

O k3215d 3d+3

ε3d+3 1 ρ2 OPT3 log 1

= O k3215d 3d+3

ρ2ε3d+3 OPT3 1 δ

Now consider the case of p = 2. We can perform our analysis with the help of a standard inequality. Proposition D.17. For any set of centers F,

Exκ(R(x), F)2 Exκ(x, F)2 + 2 p

Ex [κ(x, F)2] Ex [κ(x, R(x))2] + Exκ(x, R(x))2

Exκ(x, F)2 Exκ(R(x), F)2 + 2 p

Ex [κ(R(x), F)2] Ex [κ(x, R(x))2] + Exκ(x, R(x))2.

Thus if Exκ(x, R(x))2 ε2

26 Exκ(x, F)2,

Exκ(R(x), F)2 Exκ(x, F)2 + 2 1

23 εEx κ(x, F)2 + 1

26 ε2Ex κ(x, F)2

Exκ(x, F)2 Exκ(R(x), F)2 + 2

(1 + ε) Ex [κ(x, F)2] 1

26 ε2Ex [κ(x, F)2] + 1

26 ε2Exκ(x, F)2

Exκ(R(x), F)2 + 1

2εEx κ(x, F)2 + 1

26 ε2Exκ(x, F)2

Exκ(R(x), F)2 + εExκ(x, F)2.

Proof. By H older s inequality,

Exκ(R(x), F)2 Ex [κ(x, F) + κ(x, R(x))]2

= Exκ(x, F)2 + 2Ex[κ(x, F)κ(x, R(x))] + Exκ(x, R(x))2

Exκ(x, F)2 + 2 p

Ex [κ(x, F)2] Ex [κ(x, R(x))2] + Exκ(x, R(x))2.

The other case is identical.

Theorem D.18. Algorithm D.3 is ρ-replicable and outputs an ε-coreset for statistical k-means of size

O k211d d+2

ε2d+2 log 2

with probability at least 1 δ. Moreover, it has sample complexity

O k3233d 3d+6

ρ2ε6d+6 OPT3 log 1

Proof (Theorem D.18). Similar to k-medians, Algorithm D.3 outputs an ε-coreset with probability at least 1 δ, is ρ-replicable, and has sample complexity

ρ2γ3 OPT3 log 1

However, we need to take ε := ε2/26.

We have 3 γ = O

tk 25+6 /ε2 d (2 )2

= O k211d d+2

ε2d+2 log 2

By Proposition D.15, the coreset has size at most this value.

t is a polylogarithmic factor of the rest of the parameters, thus we thus conclude that the final sample complexity is

O k3233d 3d+6

ε6d+6 1 ρ2 OPT3

= O k3233d 3d+6

ρ2ε6d+6 OPT3 log 1

It is not clear how to compare our guarantees with that of Frahling and Sohler [2005] as we attempt to minimize the number of samples required from a distribution assuming only sample access. On the other hand, Frahling and Sohler [2005] attempted to minimize the running time of their algorithm assuming access to the entire distribution.

D.4 Replicable Multinomial Parameter Estimation

Now that we have reduced the problem to a finite distribution (coreset), we can think of it as a weighted instance of (k, p)-clustering where the weights are unknown but can be estimated using data.

Proposition D.19. Enumerate the coreset R(X) = r(1), . . . , r(N) and let w(i) be the probability mass at r(i). If we have estimates ˆw(i) satisfying

| ˆw(i) w(i)| < ε

i=1 w(i)κ(r(i), F)p

i=1 ˆw(i)κ(r(i), F)p ε.

for all centers F.

Proof. Indeed,

i=1 w(i)κ(r(i), F)p

i=1 ˆw(i)κ(r(i), F)p

i=1 |w(i) ˆw(i)|κ(r(i), F)p

i=1 |w(i) ˆw(i)| 1

Let us formulate this as a replicable multinomial parameter estimation problem. Consider a multinomial distribution Z = (Z(1), . . . , Z(N)) of dimension N with parameters p(1), . . . , p(N) and n. Note that this can be cast as a more general statistical query problem which we illustrate below. We allow simultaneous estimation of more general statistical queries g1, . . . , g N : (X n, P) R compared to Impagliazzo et al. [2022], assuming they can be estimated with high accuracy and confidence from data, say each query gj(x1, . . . , xn) concentrates about its true value Gj R with high probability.

Algorithm D.4 Replicable Rounding

1: r Rounding(statistical queries g1, . . . g N, distribution P, number of samples n, error ε, replicability ρ, confidence δ): 2: Sample x1, . . . , xn P 3: α 2ε ρ+1 2δ 4: for j 1, . . . , N: do 5: α(j) uniform random sample [0, α] 6: Split R into intervals I(j) = {[α(j) + zα, α(j) + (z + 1)α) : z Z} 7: Output the midpoint ˆGj of the interval within I(i) that gj(x1, . . . , xn) falls into 8: end for

Theorem D.20 (Replicable Rounding). Suppose we have a finite class of statistical queries g1, . . . , g N and sampling n independent points from P ensures that

j=1 |gj(x1, . . . , xn) Gj| ε := ε(ρ 2δ)

with probability at least 1 δ.

Then Algorithm D.4 is ρ-replicable and outputs estimates ˆGj such that

| ˆGj Gj| ε

with probability at least 1 δ for every j [N]. Moreover, it requires at most n samples.

Proof. Outputting the midpoint of the interval can offset each gj(x1, . . . , xn) by at most α

| ˆGj Gj| ε(ρ 2δ)

ρ + 1 2δ + ε ρ + 1 2δ = ε.

Consider two executions of our algorithm yielding estimates ˆG(1), ˆG(2). The probability that any of the estimates fail to satisfy ℓ1-tolerance ε is at most 2δ. The two executions output different estimates only if some random offset splits some ˆG(1) j , ˆG(2) j . Conditioning on the success to satisfy ℓ1-tolerance, this occurs with probability at most

| ˆG(1) j ˆG(2) j | α

| ˆG(1) j Gj| + | ˆG(2) j Gj| α

Accounting for the probability of failure to satisfy the ℓ1-tolerance of 2δ, our algorithm is ρ-replicable.

We note that Theorem D.20 can be thought of as a generalization of the SQ oracle (cf. Theorem D.2) from Impagliazzo et al. [2022]. Indeed, simply using Theorem D.2 yields an extra factor of N in the sample complexity as we must take a union bound.

Corollary D.21. Let ε, ρ (0, 1) and δ (0, ρ/3). There is a ρ-replicable algorithm that outputs parameter estimates p for a multinomial distribution of dimension N such that

(a) | p(i) p(i)| ε for every i [N] with probability at least 1 δ.

(b) p(i) 0 for all i [N].

i p(i) = 1.

Moreover, the algorithm has sample complexity

O ln 1/δ + N

Proof. Define

ε := ε(ρ 2δ)

By Proposition B.1, sampling

n = 2 ln 1/δ + 2N ln 2

8 ln 1/δ + 8N ln 2

ε2(ρ 2δ)2 .

points implies that PN i=1|bp(i) p(i)| < ε with probability at least 1 δ.

Thus running Algorithm D.4 with functions

gj(x1, . . . , x N) := 1

i=1 1{xi = j}

yields p(i) s such that | p(i) p(i)| ε for each i [N]. If there are any negative p(i) s, we can only improve the approximation by setting them to 0. If the sum of estimates is not equal to 1, we can normalize by taking

p(i) p(i) 1

This introduces an additional error of ε/N for each ˆp(i) and a maximum error of 2ε. Choosing ε1 := ε/2 concludes the proof.

D.5 Replicable OPT Estimation for (k, p)-Clustering

Our coreset algorithm assumes the knowledge of a constant ratio estimation of OPT. If this is not provided to us, we show in this section that it is possible to replicably compute such an estimate using a two-step approach. First, we are able to produce replicable estimates with additive error ε simply by approximately solving the sample (k, p)-clustering problem on a sufficiently large sample thanks to uniform convergence (cf. Theorem C.9). Then, we repeat this process with ε ε/2 until ε is but a small fraction of the outputted estimate.

Theorem D.22 (OPT Estimation with Additive Error). Let β 1 be an absolute constant and ε, ρ (0, 1), δ (0, ρ/3). Suppose we are provided with an algorithm A that, given an instance of sample (k, p)-clustering, outputs an estimate Ξ of the value [ OPT such that

β [ OPT, [ OPT .

Then there is a ρ-replicable algorithm which produces an estimate Λ of OPT such that

β ε, OPT +ε

with probability at least 1 δ. Moreover, the algorithm has sample complexity

We note that the algorithm A can be taken to be any β-approximation algorithm with some postprocessing, where we divide the value of the cost by β. Alternatively, we can take A to be some convex relaxation of our problem with constant integral gap.

Proof. Let X1, . . . , XN P be i.i.d. random variables. The output Ξ = A(X1, . . . , XN) [0, 1] is thus a bounded random variable. Suppose we repeat this experiment n times to produce estimates Ξ1, . . . , Ξn. By an Hoeffding bound,

i=1 (Ξi E[Ξ]) > ε

2 exp 2nε2 .

Thus with n 1 2ε2 ln 2

δ trials, the average estimate Ξ := 1

n Pn i=1 Ξi satisfies Ξ E[Ξ] < ε with probability at least 1 δ.

Now, by Theorem C.9, choosing

i.i.d. samples from P ensures that [ OPT OPT ε for all n trial with probability at least 1 δ. Thus conditioning on success, we have

β ε, OPT +ε

for every trial i [n]. But then the average Ξ also falls into this interval as well.

We have shown that there is an algorithm that outputs an estimate

Ξ [E[Ξ] ε, E[Ξ] + ε] OPT

β ε, OPT +ε

with probability at least 1 δ. Moreover, it has sample complexity

We can now apply the replicable rounding algorithm (cf. Algorithm D.4) to achieve the desired outcome. Indeed, by Theorem D.20, the output Λ after the rounding procedure is ρ-reproducible and offsets the average Ξ by at most ε. Hence we have

β 2ε, OPT +2ε.

with probability at least 1 δ. Finally, the algorithm has sample complexity

ε6ρ6 log2 p ερδ

Choosing ε = ε/2 completes the proof.

Theorem D.23 (OPT Estimation with Relative Error). Fix ε, ρ (0, 1) and δ (0, ρ/3). There is a ρ-replicable algorithm such that with probability at least 1 δ, it outputs an estimate Λ of OPT where

Λ 1 β(1 + ε) OPT, (1 + ε) OPT .

Moreover, it has sample complexity

ε12ρ6 OPT12 log p

Proof. We run the estimation algorithm with additive error with

for i = 1, 2, 3, . . . until we obtain an estimate Λi such that εi εΛi/2.

Remark that

= i log ε OPT

i = log β ε OPT + 2.

Thus the algorithm certainly terminates.

Upon termination, we output some Λ such that Λ [OPT/β εΛ/2, OPT +εΛ/2]. It follows that

(1 + ε) OPT

Λ 1 β(1 + ε) OPT .

The probability of not being replicable in each iteration is 2 iρ. Hence the total probability is at most

i=1 2 iρ = ρ

and similarly for δ.

Finally, there are O log β ε OPT iterations and the sample complexity of each iteration is bounded above by the sample complexity of the final iteration where

2 i = Ω ε OPT

Hence the total sample complexity is

log β ε OPT k2d2 ε OPT

β ρ 6 log p

= O k2d2β12

ε12ρ6 OPT12 log p

D.6 Putting it Together

We are now ready to prove the results on statistical k-medians and statistical k-means, which we restate below for convenience. Recall that β is an absolute constant and hence disappears under the big-O notation. Theorem 4.2 (Theorem 3.1; Formal). Let ε, ρ (0, 1) and δ (0, ρ/3). Given black-box access to a β-approximation oracle for weighted k-medians (cf. Problem 2.2), there is a ρ-replicable algorithm for statistical k-medians (cf. Problem 2.1) such that, with probability at least 1 δ, it outputs a (1 + ε)β-approximation. Moreover, it has sample complexity

ε12ρ6 OPT12 + k3218d 3d+3

ρ2ε3d+5 OPT3

Proof (Theorem 4.2). First, we compute Λ, an estimate of OPT with relative error ε, confidence δ/3, and replicability parameter ρ/3. Then, we produce an ε/2-coreset for k-medians of cardinality N with confidence and replicability parameters δ/3, ρ/3. Finally, we estimate the mass at each point of the coreset with confidence and replicability parameters δ/3, ρ/3 to an accuracy of

εΛ 4N ε OPT

By Proposition D.19, running the approximation oracle on the coreset with the estimated weights yields a β(1 + ε)-approximation.

By Theorem D.23, computing Λ requires

ε12ρ6 OPT12 log 1

By Theorem D.16, the coreset construction yields a coreset of cardinality

N = O k25d d+1

εd+1 log ε OPT 2d = O k26d d+1

εd+1 log ε OPT

and incurs a sample cost of

O k3215d 3d+3

ρ2ε3d+3 OPT3 23d 1

= O k3218d 3d+3

ρ2ε3d+3 OPT3 1 δ

Finally, Corollary D.21 states that the probability mass estimation incurs a sample cost of

ε2ρ2 Λ2 log 1

ε2ρ2 OPT2 log 1

= O k3218d 3d+3

ε3d+3 1 ε2ρ2 OPT2 log 1

= O k3218d 3d+3

ρ2ε3d+5 OPT2 log 1

Theorem 4.3 (Theorem 3.1; Formal). Given black-box access to a β-approximation oracle for weighted k-means (cf. Problem 2.2), there is a ρ-replicable algorithm for statistical k-means (cf. Problem 2.1) such that, with probability at least 1 δ, it replicably outputs a (1 +ε)β-approximation. Moreover, it has sample complexity

ε12ρ6 OPT12 + k3239d 3d+6

ρ2ε6d+8 OPT3

Proof (Theorem 4.3). Similar to k-median, we first compute Λ, an estimate of OPT with relative error ε, confidence δ/3, and replicability parameter ρ/3. Then, we produce an ε/2-coreset for k-medians of cardinality N with confidence and replicability parameters δ/3, ρ/3. Finally, we estimate the mass at each point of the coreset with confidence and replicability parameters δ/3, ρ/3 to an accuracy of εΛ 4N ε OPT

By Proposition D.19, running the approximation oracle on the coreset with the estimated weights yields a β(1 + ε)-approximation.

By Theorem D.23, computing Λ requires

ε12ρ6 OPT12 log 1

By Theorem D.18, the coreset construction yields a coreset of cardinality

N = O k211d d+2

ε2d+2 log ε OPT 22d = O k213d d+2

ε2d+2 log ε OPT

and incurs a sample cost of

O k3233d 3d+6

ρ2ε6d+6 OPT3 26d = O k3239d 3d+6

ρ2ε6d+6 OPT3 1 δ

Finally, Corollary D.21 states that the probability mass estimation incurs a sample cost of

ε2ρ2 Λ2 1 δ

ε2ρ2 OPT2 1 δ

= O k3239d 3d+6

ε6d+6 1 ε2ρ2 OPT2 1 δ

= O k3239d 3d+6

ρ2ε6d+8 OPT2 1 δ

E The Euclidean Metric, Dimensionality Reduction, and (k, p)-Clustering

We now build towards a proof for Theorem 5.1, which states the formal guarantees of the replicable k-medians and k-means algorithms under the Euclidean distance. Let us begin by recalling the Johnson-Lindenstrauss lemma. Theorem E.1 ([Johnson, 1984]). There exists a family of random linear maps πd,m from Rd Rm such that the following hold.

(i) For every d 1, ε, δ (0, 1/2), x Rd, and m = O (1/ε2 log 1/δ), 1 1 + ε πx 2 x 2 (1 + ε) πx 2

with probability at least 1 δ.

(ii) πd,m is sub-Gaussian-tailed, thus for every unit vector x Rd and ε > 0,

with probability at most exp Ω(ε2m) .

Furthermore, we can take πd,m to be the set of random orthogonal projections Rd Rm scaled by a factor of p

We say that π πd,m is a standard random dimension-reduction map and write Pπ to denote the projected distribution on Rm, i.e., Pπ( ) = P(π 1( )).

As mentioned in Section 5, it is not entirely clear how to translate between sets of centers for highdimensional data and solutions sets for the compressed data. For this reason, our goal is to produce a clustering function f : X [k] which computes the label for any given point in polynomial time. This also requires us to consider another notion of cost which is translated more easily.

Definition E.2 (Partition Cost). Define the cost of a k-partition C = (C(1), . . . , C(k)) of X as

cost(C) := min u(j) Rd:j [k] Ex

j=1 1{x C(j)} κ(x, u(j))p

i.e., we consider the expected cost of picking the optimal centers w.r.t. the underlying data-generating distribution. Similarly, we define the following sample partition cost.

Definition E.3 (Sample Partition Cost). We define the sample cost of a k-partition C of x1, . . . , xn as

d cost(C, w) = min u(j) Rd:j [k]

j=1 1{xi C(j)}κ(xi, u(j))p

where w 0, P

i wi = 1. i.e., we consider the cost that is induced by the distribution specified by w on the samples. We write d cost(C) to denote the cost with the uniform distribution on the samples.

Remark that if C is induced by F, then OPT cost(C) cost(F) and similarly for the sample partition cost as well.

Theorem E.4 ([Makarychev et al., 2019]). Let x1, . . . , xn Rd be an instance of the Euclidean (k, p)-clustering problem. Fix ε (0, 1/4), δ (0, 1) and suppose π : Rd Rm is a standard random dimension-reduction map with

Then with probability at least 1 δ,

1 1 + ε d cost(C) d cost(π(C)) (1 + ε)d cost(C)

for every k-partition C of x1, . . . , xn.

In essence, this theorem tells us that if we let the centers of the clusters be the centers of the partition, then the (sample) cost is preserved in the lower dimensional space. To the best of our knowledge, this is the strongest result for dimensionality reduction in (sample) (k, p)-clustering which preserves more than just the costs of optimal centers.

The rest of Appendix E serves to generalize Theorem E.4 for preserving partition costs in the distributional setting. First, Appendix E.1 addresses the issue that a random orthogonal projection can increase the diameter of our data in the worst case by scaling the original data by an appropriate factor. Next, Appendix E.2 slightly generalizes Theorem E.4 to the weighted sample clustering setting. Finally, Appendix E.3 extends Theorem E.4 to the distributional setting, culminating with Theorem E.10, which may be of independent interest beyond replicability.

Recall that Pπ is the push-forward measure on π(X) induced by π. We write OPTπ to denote the cost of an optimal solution to the statistical (k, p)-clustering problem on (π(X), Pπ) and [ OPTπ for the (k, p)-clustering problem on π(x1), . . . , π(xn).

E.1 Scaling

We may assume that a standard dimension-reduction map π satisfies π = p

d/m P for some orthogonal projection P. Hence π has operator norm at most

In order to apply our existing analysis, we first scale down the samples in the input space by a factor of

d so that the projected space still lives in a κ-ball of diameter 1. This does not affect our approximation guarantees as they are all multiplicative and hence scale-invariant. However, the analysis we perform is on the scaled distribution with optimal cost OPT = d p/2 OPT. This extra term must be accounted for.

We proceed assuming now that X 1

d Bd and π(X) Bm.

E.2 Preserving Partition Costs

Let d cost(C, w), d cost(π(C), w) be the weighted cost of the partition and projected partition, respectively. Recall this means that the points participate with weights w in the objective. Essentially, the next result shows that since the projection guarantees do not depend on the number of points we are projecting, we can simulate the weights by considering multiple copies of the points. Proposition E.5. Let x1, . . . , xn Rd be an instance of the Euclidean (k, p)-clustering problem and w1, . . . , wn R+ be weights which sum to 1. Fix ε (0, 1/4), δ (0, 1), and suppose π : Rd Rm is a standard random dimension-reduction map with

Then with probability at least 1 δ, 1 1 + ε d cost(C, w) d cost(π(C), w) (1 + ε)d cost(C, w),

for every k-partition C of x1, . . . , xn.

Proof (Proposition E.5). First suppose that wi = ai/bi for ai, bi Z+ and consider w i := lcm(b1, . . . , bn)wi Z+ obtained from wi by multiplying by the least common multiple of the denominators.

Let y1, . . . , yn be the multiset obtained from x1, . . . , xn by taking w i multiples of xi. The cost of any partition of x1, . . . , xn with weights q1, . . . , qn is equal to the cost of the induced partition of y1, . . . , yn with uniform weights. Thus we can apply Theorem E.4 on y1, . . . , yn to conclude the proof.

Now, for general wi R+, we remark by the density of the rationals in R that there are some q1, . . . , qn Q+ which sum to 1 and satisfy n X

i=1 |qi wi| < ε min([ OPT, [ OPTπ) 2 .

Then for any k-partition C and u(1), . . . , u(k) Bd,

j=1 1{xi C(j)}κ(xi, u(j))p

j=1 1{xi C(j)}κ(xi, u(j))p

i=1 |qi wi|

j=1 1{xi C(j)}κ(xi, u(j))p

i=1 |qi wi|

In particular 1 1 + ε cost(C, w) cost(C, q) (1 + ε) cost(C, w).

Similarly, for any v(1), . . . , v(k) Bm,

j=1 1{xi C(j)}κ(π(xi), v(j))p

j=1 1{xi C(j)}κ(π(xi), v(j))p

which implies that 1 1 + ε cost(π(C), w) cost(π(C), q) (1 + ε) cost(π(C), w).

Finally, we conclude that

cost(C, w) (1 + ε) cost(C, q)

(1 + ε)2 cost(π(C), q)

(1 + ε)3 cost(π(C), w)

cost(π(C), w) (1 + ε)3 cost(C, w).

Choosing ε to be a constant fraction of ε concludes the proof.

Corollary E.6. Let x1, . . . , xn Rd be an instance of the weighted Euclidean (k, p)-clustering problem with non-negative weights w1, . . . , wn R+ which sum to 1. Fix ε (0, 1/4), δ (0, 1) and suppose π : Rd Rm is a standard random dimension-reduction map with

Then with probability at least 1 δ, 1 1 + ε [ OPT [ OPTπ (1 + ε)[ OPT.

Proof (Corollary E.6). The optimal clustering cost coincides with the cost of the partition induced by the optimal centers.

E.3 Dimensionality Reduction

Let G = {g(1), . . . , g(k)} be a β-approximate solution to the statistical Euclidean (k, p)-clustering problem on (π(X), Pπ). We would like to argue that the partition of X induced by G is a (1 + ε)βapproximate partition with probability at least 1 δ. To do so, we first approximate X with a finite set X in a replicable fashion while preserving partition costs. This allows us to apply Proposition E.5.

Recall we write Λ to denote a replicable estimate of OPT such that Λ [OPT/β(1+ε), (1 + ε) OPT] (cf. Theorem D.23). Consider a grid of length εΛ/(4p ). for a point x X, we assign it to the center of the cell that it belongs to, say x. Let X := { x : x X} denote this finite set and P the distribution induced by the discretization. Remark that κ(x, x) εΛ/(4p) for every x X.

Note that a partition of X induces a partition of X and vice versa. We proceed without making any distinction between the two. Proposition E.7. For any point u Bd,

|κ(x, u)p κ( x, u)p| p|κ(x, u) κ( x, u)|.

Proof. By the mean value theorem, the function g : [0, 1] R given by z 7 zp is p-Lipschitz:

|g(y) g(z)| sup ξ [0,1] g (ξ)|y z|

We write g cost(C) to denote the cost of the partition C of X. Proposition E.8. For any partition C of X 1 1 + ε cost(C) g cost(C) (1 + ε) cost(C).

Proof. We have Ex

j=1 1{x C(j)} κ(x, u(j))p

j=1 1{x C(j)} κ( x, u(j))p

j=1 1{x C(j)} |κ(x, u(j))p κ( x, u(j))p|

j=1 1{x C(j)} p|κ(x, u(j)) κ( x, u(j))|

Corollary E.9. Fix ε (0, 1/4), δ (0, 1) and suppose π : Rd Rm is a standard random dimension-reduction map with

We have 1 (1 + ε) OPT ] OPTπ (1 + ε) OPT ,

with probability at least 1 δ, where the probability is taken w.r.t. the choice of the random map.

Proof. The optimal clustering cost coincides with the cost of the partition induced by the optimal centers. An application of Proposition E.8 yields the desired result.

Theorem E.10. Fix ε (0, 1/4), δ (0, 1) and suppose π πd,m is a standard random dimensionreduction map with

Let G = {g(1), . . . , g(k)} be a β-approximate solution to the statistical Euclidean (k, p)-clustering problem on (π( X), Pπ). Then the partition of X induced by G is a (1 + ε)β-approximate partition with probability at least 1 δ.

That is, for C = {C(1), . . . , C(k)} given by

C(j) := n x X : g(j) = argming G κ(π( x), g(j))po ,

we have cost(C) (1 + ε)β OPT .

Proof. We have

cost(C) (1 + ε)g cost(C) Proposition E.8

(1 + ε)2 g cost(π(C)) Proposition E.5

(1 + ε)2 g cost(G) G induces π(C)

(1 + ε)2β ] OPTπ G is β-approximate

(1 + ε)3β OPT . Corollary E.9

Choosing ε to be a constant fraction of ε concludes the proof.

E.4 Summary of Dimensionality Reduction

Before proving Theorem 5.1, let us summarize the stages of our pipeline.

1) Scale the input data by 1/

d. 2) Estimate OPT of the scaled data within a multiplicative error. 3) Discretize the scaled data with a fine grid. 4) Project the scaled data with a random map π πd,m. 5) Compute a coreset on the projected data. 6) Estimate the probability mass at each point of the coreset.

7) Call the β-approximation oracle to produce k centers g(1), . . . , g(k). 8) Output the clustering function induced by the centers.

The clustering function proceeds as follows. Given a point x X,

1) Scale x by 1/

d. 2) Discretize the scaled point with the same grid used in the algorithm. 3) Project the discretized point using the same π used in the algorithm.

4) Output the label of the closest center g(j) in the lower dimension for j [k].

We are now ready to prove Theorem 5.1, which we restate below for convenience. Theorem 5.1 (Theorem 3.2; Formal). Let ε, ρ (0, 1) and δ (0, ρ/3). Given a β-approximation oracle for weighted Euclidean k-medians (k-means), there is a ρ-replicable algorithm that outputs a clustering function such that with probability at least 1 δ, the cost of the partition is at most (1 + ε)β OPT. Moreover, the algorithm has sample complexity

poly kd ρ OPT

where m = O 1

Proof (Theorem 5.1). We randomly draw some π πd,m for

Then, we discretize a sufficiently large sample after scaling and push it through π. We then run the corresponding algorithm (cf. Theorem 4.2, Theorem 4.3) to produce some centers G for the compressed data. This allows us to partition X according to the closest member of G in the lower dimension. An application of Theorem E.10 yields the performance guarantees of this algorithm.

The sample complexity guarantee from Theorem 4.2, Theorem 4.3 is for the scaled distribution and the scaled optimal cost satisfies OPT π = d p/2 OPTπ. Hence the sample complexity resolves to

poly kd ρ OPT

We conclude the proof by remarking that for the Euclidean metric, =

d and that O(log n)O(log n) = n O(log log n).

F k-Centers

In this section, we will explain our approach to solving the statistical k-centers problem (cf. Problem 2.3). As we explained before, since this problem has a different flavor compared to (k, p)- clustering, we introduce some extra assumptions in order to be able to solve it from samples. To the best of our knowledge, this is the first work that considers a model of statistical flavor for the k-centers problem.

F.1 Assumptions

First, we describe an assumption that is necessary to obtain good clustering solutions even if we do not insist on using replicable algorithms. Assumption F.1 (Clusterable). For some absolute constants β, B > 0, an i.i.d. sample S = x1, . . . , xn of size n from P has the following property. With probability at least q, there exists some (β, B)-approximate solution F to Problem 2.3 such that, for every f F there is some xf S with F (xf) = f. Here q [0, 1].

Assumption F.1 is necessary to obtain good clustering solutions even in the non-replicable case. Our distribution P must be sufficiently well-behaved so that solving the sample k-centers problem translates to a good solution in the population case. Essentially, this assumption states that with high probability, the sample will contain a point from every single cluster of some (β, B)-approximate solution. In order to design replicable algorithms, we need to make a stronger assumption. Assumption F.2 (Replicable). For some absolute constants β, B, there exists a (β, B)-approximate solution F to Problem 2.3 with the following property. With probability at least q, in an i.i.d. sample S = x1, . . . , xn of size n from P, for every f F , there is some xf S with F (xf) = f. Here q [0, 1].

Note that Assumption F.1 guarantees that, with high probability, every sample will help us identify some good solution. However, these good solutions could be vastly different when we observe different samples, thus this assumption is not sufficient to design a replicable algorithm with high utility. Assumption F.2 is necessary to attain replicable solutions. Indeed, this assumption essentially states that there is a fixed solution for which we will observe samples from every cluster with high probability. If this does not hold and we observe points from clusters of solutions that vary significantly, we cannot hope to replicably recover a fixed approximate solution. Note that Assumption F.2 is stronger than Assumption F.1, since we have flipped the order of the existential quantifiers. One way to satisfy this assumption is to require that there is some (β, B)-approximate solution so that the total mass that is assigned to every cluster of it is at least 1/n for some n N. Then, by observing e O(n) points, we will receive at least one from every cluster with high probability. This is made formal in Proposition F.3. Proposition F.3 ([Janson, 2018]). Let δ (0, 1). Under Assumption F.2, a sample of

points contains at least m points from each cluster of F with probability at least 1 δ.

Proof (Proposition F.3). Let Yi be a geometric variable with success rate q, which denotes the number of batches of n samples we draw until a point from each cluster of F is observed. Then Y := Pm i=1 Yi upper bounds the number of batches until we obtain m points from each cluster.

As shown in Janson [2018], P{Y λEY } exp(1 λ)

The result follows by picking λ = 1 + log(1/δ).

F.2 High-Level Overview of the Approach

We are now ready to provide a high-level overview of our approach to derive Theorem 3.3. We first take a fixed grid of side c in order to cover the unit ball. Then, we sample sufficiently many points from P. Subsequently, we round all the points of the sample to the centers of the cells of the grid that they fall into. Using these points, we empirically estimate the density of the distribution on every cell of the grid. Next, in order to ensure that our solution is replicable, we take a random threshold from a predefined interval and discard the points from all the cells whose density falls below the threshold. Finally, we call the approximation oracle that we have access to using the points that have survived after the previous step. We remark that unlike the (k, p)-clustering problem (cf. Problem 2.1), to the best of our knowledge, there does not exist any dimensionality reduction techniques that apply to the k-centers problem. In the following subsections we explain our approach in more detail.

F.3 Oracle on Sample

As we explained before, we require black-box access to an oracle O for the combinatorial k-centers problem on a sample which outputs (bβ, b B)-approximate solutions b F for Problem 2.4. Thus, we need to show that, with high probability, the output of this solution is a good approximation to Problem 2.3. This is shown in Proposition F.4.

Proposition F.4. Suppose there is a (β, B)-approximate solution F for Problem 2.3 and that we are provided with a sample of size O(n log(1/δ)/q) containing an observation from each cluster of F , with probability at least 1 δ. Then, the output b F of the oracle O on the sample satisfies

κ(x, b F) (2β + bβ) OPT +2B + b B.

for all x X, with probability at least 1 δ.

In the previous result (β, B), (bβ, b B) are the approximation parameters that we inherit due to the fact that we receive samples from P and that we solve the combinatorial problem approximately using O, respectively.

Proof (Proposition F.4). Let F denote the (β, B)-approximate solution from Assumption F.2. We condition on the event that such a solution exists in the sample, which occurs with probability at least 1 δ.

Fix x X, which is not necessarily observed in the sample. We pay a cost of β OPT +B to travel to the nearest center of F . Next, we pay another β OPT +B to travel to the observed sample from this center. Finally, we pay a cost of

d cost( b F) bβ [ OPT + b B

bβ OPT + b B.

to arrive at the closest center in b F.

F.4 Oracle with Grid

The issue with using directly the output of O on a given sample is that the result will not be replicable, since we have no control over its behavior. In a similar way as with the (k, p)-clustering problem, the main tool that we have in our disposal to stabilize the output of O is to use a grid. Unlike the other problems, we cannot use the hierarchical grid approach in this one. The reason is that, due to the min-max nature of k-centers, we need to cover the whole space and not just the points where the distribution is dense. Recall that so far we have established that the oracle O outputs (2β + bβ, 2B + 2 b B)-approximate solutions F for Problem 2.3, given that Assumption F.2 holds.

Algorithm F.1 discretizes the points of the sample using a grid with side-length c before applying the oracle. Proposition F.5 shows that by doing that, we have to pay an extra additive term that is of the order O(c ).

Algorithm F.1 Oracle with Grid

1: Oracle with Grid(oracle O, sample x1, . . . , xn, grid length c): 2: for i 1, . . . , n do 3: Decompose xi = zi c + x i for some zi Zd, x i [0, c)d

4: xi zi c + (c/2, c/2, . . . , c/2) { xi is the center of the cell that contains xi.} 5: end for 6: Return O( x1, . . . , xn)

Proposition F.5. Suppose there is a (β, B)-approximate solution F for Problem 2.3 and that we are provided with a sample of size n containing an observation from each cluster of F . Let G denote the output of Algorithm F.1 using a (ˆβ, ˆB) approximation oracle O for Problem 2.4. Then, for any x X, κ(x, G) (2β + bβ) OPT +2B + b B + (4β + 2ˆβ + 1)c .

Proof (Proposition F.5). For the sake of analysis, imagine we used the grid to discretize all points of X 7 X. Then F is a solution with cost β ] OPT + B + c and we observe a point from each of its clusters in our discretized sample. Here ] OPT is the cost of a solution to Problem 2.3 on X.

Proposition F.4 shows that the oracle returns a solution e G with cost at most

(2β + bβ)] OPT + (2B + c ) + b B on the discretized points.

Then for any x X,

] OPT = κ(x, GOPT)

κ(x, x) + κ( x, GOPT) κ(x, x) + κ ( x, FOPT) 2κ(x, x) + κ(x, FOPT) 2c + OPT .

The conclusion follows by combining the two inequalities.

F.5 Replicable Active Cells

At a first glance, we might be tempted to work towards replicability by arguing that, with high probability, every grid cell is non-empty in one execution of the algorithm if and only if it is nonempty in another execution. However, since we have no control over P, it is not clear how one can prove such a statement. Our approach is to take more samples and ignore cells which do not have a sufficient number of points using some random threshold. The idea is similar to the heavy-hitters algorithm (cf. Algorithm D.2) originally conceived by Impagliazzo et al. [2022] and is presented in Algorithm F.2. Then, we move all the points of active cells to the center of the cell. Notice that in the k-center objective, unlike the (k, p)-clustering objective, we do not care about how much mass is placed in a cell, just whether it is positive or not.

We first derive a bound on the number of cells that the distribution puts mass on. It is not hard to see that since we are working in a bounded domain, the grid contains at most (1/c)d cells in total. Thus there are at most M (1/c)d cells that have non-zero mass. Proposition F.6 shows how we can obtain such a bound in the unbounded domain setting. Proposition F.6. Let F be a (β, B)-approximate solution for Problem 2.3. Let M denote the maximum number of cells which intersect some cluster of F and P the discretized distribution that is supported on centers of the cells of the grid of size c. Then, the support of P is at most k M where

M c + 2c + 2(β OPT +B)

Proof (Proposition F.6). Fix a center f and consider the cell Z containing the center. The farthest point in the farthest cell in the cluster of f is of distance at most β OPT +B + c and thus the cells we are concerned with sit in an ℓ ball of that radius. Not including Z, we walk past at most β OPT +B + c

c cells along the canonical basis of Rd, hence there are at most 1 + 2c + 2

c (β OPT +B) d

such cells.

Notice that the previous discussion and Proposition F.5 reveal a natural trade-off in the choice of the parameter c: by reducing c, we decrease the additive error of our approximation algorithm but we increase the number of samples at a rate (1/c)d. Since we have bounded the number of cells from which we can observe a point, the next step is to use a sufficiently large number of samples so that we can estimate replicably whether a cell is active or not. This is demonstrated in Algorithm F.2.

Algorithm F.2 Replicable Active Cells

1: r Active Cells(grid length c): 2: m O λkn M 2/ρ2

3: N λnm M 4: Sample x1, . . . , x N P, 5: Lazy initialize counter Z = 0 for every cell of the grid with length c 6: for i 1, . . . , N do 7: Gi cell that samples xi falls into 8: Increment Z(Gi) Z(Gi) + 1 9: end for 10: Choose v [0, m/N] uniformly randomly 11: Output all Gi such that Z(Gi)/N v

Proposition F.7. Suppose Assumption F.2 holds and let F be the (β, B)-approximate solution. Let δ, ρ (0, 1). Then Algorithm F.2 is ρ-replicable and returns cells of the grid such that at least one sample from each cluster of F falls into these cells with probability at least 1 δ. Moreover, it has sample complexity

O n2k M 3 log(1/δ)

Proof (Proposition F.7). Enumerate the k M cells within the support of P and let the counter Z(j)

denote the number of samples at the j-th cell. Then Z = (Z(1), . . . , Z(k M)) follows a multinomial distribution with parameters (p(1), . . . , p(k M)) and N, where p is unknown to us. Nevertheless, by Proposition B.1, for any ε (0, 1), sampling

N ln 5/δ + k M ln 2

points implies that Pa j=1|bp(j) p(j)| < 2ε with probability at least 1 δ/5. From the definition of N, this is equivalent to

m ln 5/δ + k M ln 2

By Proposition F.3, we observe at least m M points from each cluster of F with probability at least 1 δ.

Let bp(j) 1 , bp(j) 2 be the empirical mean of Z(j) across two different executions of Algorithm F.2. Then by the triangle inequality, Pk M j=1|bp(j) 1 bp(j) 2 | 4ε.

Since we observe at least Mm points from each cluster of F , there is at least one point from each cluster with at least m observations. Thus outputting the points j such that bp(j) v for some v [0, m/N] means we always output one point from each cluster.

Now, the outputs between two executions are not identical only if the point v splits some pair bp(j) 1 , bp(j) 2 . This occurs with probability at most 4ε/(m/N) = 4Nε/m, To bound this value by ρ/5, set

= ρ 20λn M .

Plugging this back yields

m 400λn M ln 5

δ + 400λnk M 2 ln 2

All in all, Algorithm F.2 outputs the desired result with probability at least 1 δ and is ρ-replicable given

O n2k M 3 log(1/δ)

samples from P.

F.6 Putting it Together

We now have all the ingredients in place to prove Theorem 3.3. The proof follows directly by combining Proposition F.5, and Proposition F.7. Essentially, we replicably estimate the active cells of our grid and then move all the points of each active cell to its center.

For completeness, we also state the formal version of Theorem 3.3.

Theorem F.8 (Theorem 3.3; Formal). Let δ, ρ (0, 1). Suppose Assumption F.2 holds and let F be the (β, B)-approximate solution. Then, given black-box access to a (ˆβ, ˆB)-approximation oracle for Problem 2.4, Algorithm F.2 is ρ-replicable and returns a solution ˆF such that

κ(x, ˆF) (2β + bβ) OPT +2B + b B + (4β + 2ˆβ + 1)c

with probability at least 1 δ. Moreover, the algorithm requires at most

n2k (1/c)3d log(1/δ)

samples from P.