# differentiallyprivate_clustering_of_easy_instances__78ae34fe.pdf

Differentially-Private Clustering of Easy Instances

Edith Cohen 1 2 Haim Kaplan 1 2 Yishay Mansour 1 2 Uri Stemmer 1 3 Eliad Tsfadia 1 2

Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify k cluster centers without disclosing information on individual data points. Despite signiﬁcant research progress, the problem had so far resisted practical solutions. In this work we aim at providing simple implementable differentially private clustering algorithms that provide utility when the data is easy, e.g., when there exists a signiﬁcant separation between the clusters.

We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results. We are able to get improved sample complexity bounds in some cases of Gaussian mixtures and k-means. We complement our theoretical analysis with an empirical evaluation on synthetic data.

1. Introduction

Differential privacy (Dwork et al., 2006b) is a mathematical deﬁnition of privacy, that aims to enable statistical analyses of databases while providing strong guarantees that individual-level information does not leak. Privacy is achieved in differentially private algorithms through randomization and the introduction of noise to obscure the effect of each individual, and thus differentially private algorithms can be less accurate than their non-private analogues. In most cases, this loss in accuracy is studied theoretically, using asymptotic tools. As a result, there is currently a signiﬁcant gap between what is known to be possible theoretically and what can be done in practice with differential privacy. In this work we take an important step towards bridging this gap in the context of clustering related tasks.

The construction of differentially private clustering algorithms has attracted a lot of attention over the last decade,

*Equal contribution 1Google Research 2Blavatnik School of Computer Science, Tel Aviv University 3Ben-Gurion University. Correspondence to: Eliad Tsfadia <eliadtsfadia@gmail.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

and many different algorithms have been suggested.1 However, to the best of our knowledge, none of these algorithms have been implemented: They are not particularly simple and suffer from large hidden constants that translate to a signiﬁcant loss in utility, compared to non-private implementations.

Question 1.1. How hard is it to cluster privately with a practical implementation?

We take an important step in this direction using the following approach. Instead of directly tackling standard clustering tasks, such as k-means clustering, we begin by identifying a very simple clustering problem that still seems to capture many of the challenges of practical implementations (we remark that this problem is completely trivial without privacy requirements). We then design effective (private) algorithms for this simple problem. Finally, we reduce standard clustering tasks to this simple problem, thereby obtaining private algorithms for other tasks.

In more detail, we introduce the following problem, called the k-tuple clustering problem.

Deﬁnition 1.2 (informal, revised in Deﬁnition 3.6). An instance of the k-tuple clustering problem is a collection of k-tuples. Assuming that the input tuples can be partitioned into k obvious clusters , each consisting of one point of each tuple, then the goal is to report k cluster-centers that correctly partition the input tuples into clusters. If this assumption on the input structure does not hold, then the outcome is not restricted.

Remark 1.3.

1. By obvious clusters we mean clusters which are far away from each other.

2. The input tuples are unordered. This means, e.g., that the correct clustering might place the ﬁrst point of one tuple with the ﬁfth point of another tuple.

3. Of course, we want to solve this problem while guaranteeing differential privacy. Intuitively, this means

1(Blum et al., 2005; Nissim et al., 2007; Feldman et al., 2009; Mc Sherry, 2009; Gupta et al., 2010; Mohan et al., 2012; Wang et al., 2015; Nock et al., 2016; Su et al., 2016; Nissim et al., 2016; Feldman et al., 2017; Balcan et al., 2017; Nissim & Stemmer, 2018; Huang & Liu, 2018; Kaplan & Stemmer, 2018; Stemmer, 2020; Shechner et al., 2020; Ghazi et al., 2020; Nguyen, 2020)

Differentially-Private Clustering of Easy Instances

that the outcome of our algorithm should not be significantly effected when arbitrarily modifying one of the input tuples.

Observe that without the privacy requirement this task is trivial: We can just take one arbitrary input tuple (x1, ..., xk) and report it. With the privacy requirement, this task turns out to be non-trivial. It s not that this problem cannot be solved with differential privacy. It can. It s not even that the problem requires large amounts of data asymptotically. It does not. However, it turns out that designing an implementation with a practical privacy-utility tradeoff, that is effective on ﬁnite datasets (of reasonable size), is quite challenging.

1.1. Our algorithms for the k-tuple problem

We present two (differentially private) algorithms for the ktuple clustering problem, which we call Privatek Averages and Privatek Noisy Centers. Both algorithms ﬁrst privately test if indeed the input is partitioned into k obvious clusters and quit otherwise. They differ by the way they compute the centers in case this test passes. Algorithm Privatek Averages privately averages each identiﬁed cluster. Algorithm Privatek Noisy Centers, on the other hand, does not operate by averaging clusters. Instead, it selects one of the input k-tuples, and then adds a (relatively small) Gaussian noise to every point in this tuple. We prove that this is private if indeed there are k obvious clusters in the input. We evaluate these two algorithms empirically, and show that, while algorithm Privatek Averages is better in theory , algorithm Privatek Noisy Centers is much more practical for some interesting regimes of parameters.

We now give a simpliﬁed overview of the ideas behind our algorithms. For concreteness, we focus here on Privatek Averages. Recall that in the k-tuple clustering problem, we are only required to produce a good output assuming the data is nice in the sense that the input tuples can be clustered into k far clusters such that every cluster contains exactly one point from every tuple. However, with differential privacy we are forced to produce good outputs even when this niceness assumption does not hold. This happens because if the input data is almost nice (in the sense that modifying a small number of tuples makes it nice) then differential privacy states that the outcome of the computation should be close to what it is when the input data is nice.

So, the deﬁnition of differential privacy forces us to cope with almost nice datasets. Therefore, the niceness test that we start with has to be a bit clever and soft and succeed with some probability also for data which is almost nice . Then, in order to achieve good performances, we have to utilize the assumption that the data is almost nice when

we compute the private centers. To compute these centers, Algorithm Privatek Averages determines (non-privately) a clustering of the input tuples, and then averages (with noise) each of the clusters. The conceptual challenge here is to show that even though the clustering of the data is done non-privately, it is stable enough such that the outcome of this algorithm still preserves privacy.

1.2. Applications

The signiﬁcance of algorithms Privatek Averages and Privatek Noisy Centers is that many clustering related tasks can be privately solved by a reduction to the k-tuple clustering problem. In this work we explore two important use-cases: (1) Privately approximating the k-means under stability assumption, and (2) Privately learning the parameters of a mixture of well-separated Gaussians.

k-Means Clustering

In k-means clustering, we are given a database P of n input points in Rd, and the goal is to identify a set C of k centers in Rd that minimizes the sum of squared distances from each input point to its nearest center. This problem is NPhard to solve exactly, and even NP-hard to approximate to within a multiplicative factor smaller than 1.0013 (Lee et al., 2017). The current (non-private) state-of-the-art algorithm achieves a multiplicative error of 6.357 (Ahmadian et al., 2019).

One avenue that has been very fruitful in obtaining more accurate algorithms (non-privately) is to look beyond worstcase analysis (Ostrovsky et al., 2012; Awasthi et al., 2010; 2012; Balcan et al., 2009; Bilu & Linial, 2012; Kumar & Kannan, 2010). In more details, instead of constructing algorithms which are guaranteed to produce an approximate clustering for any instance, works in this vain give stronger accuracy guarantees by focusing only on instances that adhere to certain nice properties (sometimes called stability assumptions or separation conditions). The above mentioned works showed that such nice inputs can be clustered much better than what is possible in the worst-case (i.e., without assumptions on the data).

Given the success of non-private stability-based clustering, it is not surprising that such stability assumptions were also utilized in the privacy literature, speciﬁcally by Nissim et al. (2007); Wang et al. (2015); Huang & Liu (2018); Shechner et al. (2020). While several interesting concepts arise from these four works, none of their algorithms have been implemented, their algorithms are relatively complex, and their practicability on ﬁnite datasets is not clear.

We show that the problem of stability-based clustering (with privacy) can be reduced to the k-tuple clustering problem. Instantiating this reduction with our algorithms for the ktuple clustering problem, we obtain a simple and practical

Differentially-Private Clustering of Easy Instances

algorithm for clustering nice k-means instances privately.

Learning Mixtures of Gaussians. Consider the task of privately learning the parameters of an unknown mixtures of Gaussians given i.i.d. samples from it. By now, there are various private algorithms that learn the parameters of a single Gaussian (Karwa & Vadhan, 2018; Kamath et al., 2019a; Cai et al., 2019; Bun & Steinke, 2019; Kamath et al., 2020; Biswas et al., 2020). Recently, (Kamath et al., 2019b) presented a private algorithm for learning mixtures of wellseparated (and bounded) Gaussians. We remark, however, that besides the result of (Biswas et al., 2020), which is a practical algorithm for learning a single Gaussian, all the other results are primarily theoretical.

By a reduction to the k-tuples clustering problem, we present a simple algorithm that privately learns the parameters of a separated (and bounded) mixture of k Gaussians. From a practical perspective, compared with the construction of (Kamath et al., 2019b), our algorithm is simple and implementable. From a theoretical perspective, our algorithm offers reduced sample complexity, weaker separation assumption, and modularity.

Our results for stability-based clustering and for learning mixtures of Gaussians, as well as an extended discussion on related works, are given in the supplementary material.

2. Preliminaries

2.1. Notation

In this work, a k-tuple X = {x1, . . . , xk} is an unordered set of k vectors xi Rd. For x Rd, we denote by x the ℓ2 norm of x. For c Rd and r > 0, we denote B(c, r) := {x Rd : x c r}. For a multiset P (Rd) we denote by Avg(P) := 1 |P| P

x P x the average of all points in P. Throughout this work, a database D is a multiset. For two multisets D = {x1, . . . , xn} and D = {x 1, . . . , x m}, we let D D be the multiset {x1, . . . , xn, x 1, . . . , x m}. For a multiset D = {x1, . . . , xn} and a set S, we let M S be the multiset {xi}i I where I = {i [n]: xi S}. All logarithms considered here are natural logarithms (i.e., in base e).

2.2. Indistinguishability and Differential Privacy

Deﬁnition 2.1 (Neighboring databases). Let D = {x1, . . . , xn} and D = {x 1, . . . , x n} be two databases over a domain X. We say that D and D are neighboring if there is exactly one index i [n] with xi = x i.

Deﬁnition 2.2 ((ε, δ)-indistinguishable). Two random variable X, X over a domain X are called (ε, δ)- indistinguishable, iff for any event T X, it holds that Pr[X T] eε Pr[X T] + δ. If δ = 0, we say that X and X are ε-indistinguishable.

Deﬁnition 2.3 ((ε, δ)-differential privacy (Dwork et al., 2006b)). An algorithm A is called (ε, δ)-differentially private, if for any two neighboring databases D, D it holds that A(D) and A(D ) are (ε, δ)-indistinguishable. If δ = 0 (i.e., pure privacy), we say that A is ε-differentially private.

Lemma 2.4 ((Bun & Steinke, 2016)). Two random variable X, X over a domain X are (ε, δ)- indistinguishable, iff there exist events E, E X with Pr[X E], Pr[X E ] 1 δ such that X|E and X |E are ε-indistinguishable.

2.2.1. THE LAPLACE MECHANISM

Deﬁnition 2.5 (Sensitivity). We say that a function f : Un R has sensitivity λ, if for all neighboring databases S, S it holds that |f(S) f(S )| λ. Theorem 2.6 (The Laplace Mechanism (Dwork et al., 2006b)). Let ε > 0, and assume f : Un R has sensitivity λ. Then the mechanism that on input S Un outputs f(S) + Lap(λ/ε) is ε-differentially private.

2.2.2. THE GAUSSIAN MECHANISM

Deﬁnition 2.7 (ℓ2-sensitivity). We say that a function f : Un Rd has ℓ2-sensitivity λ if for all neigboring databases S, S it holds that f(S) f(S ) λ. Theorem 2.8 (The Gaussian Mechanism (Dwork et al., 2006a)). Let ε, δ (0, 1), and assume f : Un Rd has ℓ2-sensitivity λ. Let σ λ

2 log(1.25/δ). Then the mech-

anism that on input S Un outputs f(S) + N(0, σ2) d

is (ε, δ)-differentially private.

2.2.3. ESTIMATING THE AVERAGE OF POINTS

The Gaussian mechanism (Theorem 2.8) allows for privately estimating the average of points in B(0, Λ) Rd within ℓ2 error of Λ

d εn . In some cases, we could relax the dependency on Λ. For example, using the following proposition. Proposition 2.9 (Estimating the Average of Bounded Points in Rd). Let ε (0, 1), d, Λ > 0 and let rmin [0, Λ]. There exists an efﬁcient (ε, δ)-differentially private algorithm that takes an n-size database S of points inside the ball B(0, Λ) in Rd and satisfy the following utility guarantee: Let r > 0 be the minimal radius of a d-dimensional ball that contains all points in S. Then with probability 1 β, the algorithm outputs ˆa Rd such that

ˆa Avg(S) O

max{r, rmin} d log d

δ log Λ rminβ

where the O hides a p

log(d/β) factor. The algorithm runs in time O(dn) (ignoring logarithmic factors).

Proposition 2.9 can be seen as a simpliﬁed variant of (Nissim

Differentially-Private Clustering of Easy Instances

et al., 2016) s private average algorithm. The main difference is that (Nissim et al., 2016) ﬁrst uses the Johnson Lindenstrauss (JL) transform (Johnson & Lindenstrauss, 1984) to randomly embed the input points in Rd for d log n, and then estimates the average of the points in each axis of Rd . As a result, they manage to save a factor of

d upon Proposition 2.9 (at the cost of paying a factor of log(n) instead). However, for simplifying the construction and the implementation, we chose to omit the JL transform step, and we directly estimate the average along each axis of Rd. For completeness, we present the full details of Proposition 2.9 in the supplementary material.

2.2.4. SUB-SAMPLING

Lemma 2.10 ((Beimel et al., 2010; Kasiviswanathan et al., 2011)). Let A be an (ε , δ )-differentially private algorithm operating on databases of size m. Fix ε 1, and denote n = m

ε (3 + exp(ε )). Construct an algorithm B that on an input database D = (zi)n i=1, uniformly at random selects a subset I [n] of size m, and executes A on the multiset DI = (zi)i I. Then B is (ε, δ)-differentially private, where δ = n 4m δ .

3. k-Tuples Clustering

We ﬁrst introduce a new property of a collection of (unordered) k-tuples {x1, . . . , xk} (Rd)k, which we call partitioned by -far balls.

Deﬁnition 3.1 ( -far balls). A set of k balls B = {Bi = B(ci, ri)}k i=1 over Rd is called -far balls, if for every i [k] it holds that ci cj max{ri, rj} (i.e., the balls are relatively far from each other).

Deﬁnition 3.2 (partitioned by -far balls). A tuple X (Rd)k is partitioned by a given set of k -far balls B = {B1, . . . , Bk}, if for every i [k] it holds that |X Bi| = 1. A multiset of k-tuples T ((Rd)k) is partitioned by B, if all X T are partitioned by B. We say that T is partitioned by -far balls if such a set B of k -far balls exists.

In some cases we want to use a notion of almost partitioned property of a database of k-tuples T . This is deﬁned below using the additional parameter ℓ.

Deﬁnition 3.3 (ℓ-nearly partitioned by -far balls). A multiset T ((Rd)k) is ℓ-nearly partitioned by a given set of -far balls B = {B1, . . . , Bk}, if there are at most ℓ tuples in T that are not partitioned by B. We say that T is ℓ-nearly partitioned by -far balls if such a set of -far balls B = {B1, . . . , Bk} exists.

For a database of k-tuples T ((Rd)k)n, we let Points(T ) be the collection of all the points in all the k-tuples in T .

Deﬁnition 3.4 (The points in a collection of k-tuples). For

T = {{x1,j}k j=1, . . . , {xn,j}k j=1} ((Rd)k)n, we deﬁne Points(T ) = {xi,j}i [n],j [k] (Rd)kn.

We next deﬁne Partition(T ) of a database T ((Rd)k)

which is partitioned by -far balls for > 2.

Deﬁnition 3.5 (Partition(T )). Given a multiset T ((Rd)k) which is partitioned by -far balls for > 2, we deﬁne the partition of T , which we denote by Partition(T ) = {P1, . . . , Pk}, by ﬁxing an (arbitrary) ktuple X = {x1, . . . , xk} T and setting Pi = {x Points(T ): i = argminj [k] x xj }.

In the supplementary material we prove that Partition(T ) is uniquely deﬁned.

We now deﬁne the k-tuple clustering problem.

Deﬁnition 3.6 (k-tuple clustering). The input to the problem is a database T ((Rd)k)n and a parameter > 2. The goal is to output a k-tuple Y = {y1, . . . , yk} (Rd)k

such that the following holds: If T is partitioned by - far balls, then for every i [k], there exists a cluster in Partition(T ) (call it Pi) such that Pi = {x Points(T ): i = argminj [k] x yj }.

Namely, in the k-tuple clustering problem, the goal is to output a k-tuple Y that partitions T correctly. We remark that for applications, we are also interested in the quality of the solution. Namely, how small is the distance between yi and Pi, compared to the other clusters in Partition(T ). We also remark that without privacy, the problem is trivial, since any k-tuple X T is a good solution by deﬁnition.

4. Our Algorithms

In this section we present two (ε, δ)-differentially private algorithms for the k-tuple clustering problem: Privatek Averages and Privatek Noisy Centers. Algorithm Privatek Averages attempts to solve the problem by determining the clusters in Partition(T ) and then privately estimating the average of each cluster using the algorithm from Proposition 2.9. Algorithm Privatek Noisy Centers, on the other hand, does not operate by averaging clusters. Instead, it ﬁrst selects one of the input tuples X T (in a special way), and then adds a (relatively small) Gaussian noise to this tuple.2

Both algorithms share the same ﬁrst step, which is to call Algorithm Private Test Partition (Figure 2), that privately decides whether T is ℓ-nearly partitioned by -far balls or not (for small ℓ). If so, the algorithm determines (non-

2We remind that all the tuples in this work are unordered, and indeed the privacy analysis of our algorithms relies on it (i.e., the domain of outputs that we consider is all the unordered ktuples, and (ε, δ)-indistinguishability holds for each subset of this domain).

Differentially-Private Clustering of Easy Instances

privately) a set of -far balls B = {B1, . . . , Bk} that ℓnearly partitions T .

All of the missing proofs are given in the supplementary material.

4.1. Algorithm Private Test Partition

Algorithm Private Test Partition is described in Figure 2. Its main component is Algorithm Private Test Close Tuples (Figure 1), which inputs two multisets of k-tuples T1 and T2 and privately checks whether the tuples in T1 are close to the tuples in T2. In the supplementary material, we prove that the value of Status in Private Test Close Tuples preserves ε1-DP w.r.t. T1, and preserves ε2-DP w.r.t. T2.

Algorithm Private Test Close Tuples

Input: Multisets T1 ((Rd)k)m and T2 ((Rd)k)n, a privacy parameter ε1 (0, 1] for T1, a privacy parameter ε2 (0, 1] for T2, a conﬁdence parameter β (0, 1], and a separation parameter > 6.

1. For each X = {x1, . . . , xk} T1:

(a) BX = {BX i = B(xi, r X i )}k i=1, where r X i = 1 minj =i xi xj . (b) ℓX = |{Y T2 : Y is not partitioned by BX}|.

(c) ˆℓX = ℓX + Lap m ε2

(d) pass X =

0 otherwise .

X T1 pass X and ˆs s + Lap 1 ε1

3. If ˆs < m 1 ε1 log 1 β , set Status = Failure . Otherwise, set Status = Success .

4. If Status = Success and pass X = 1 for at least one X T1, let X be the ﬁrst tuple in T1 with pass X = 1 and set B = BX . Otherwise, set B to be a set of k empty balls.

5. Output (Status, B).

Figure 1: Algorithm Private Test Close Tuples for privately checking if -far balls around each k-tuples in T1 partitions the tuples in T2.

In the following, we specify our choices of m and ε1 (functions of n, ε, δ, β) that are used by Private Test Partition.

Deﬁnition 4.1. Let m = m(n, ε, δ, β) be the smallest integer that satisﬁes m > 1 ε1 (2 log(1/δ) + log(1/β)), where ε1 = log( εn

The dependence between m and ε1 for Algorithm Private Test Partition is due to the choice of T1 as an msize random sample of T . A smaller m allows for a larger value of ε1 for the same overall privacy, by a sub-sampling argument (e.g., Lemma 2.10). We note that for n 1/ε and β, δ 1 poly(n), we have ε1 = Θ(log n), which yields that m = O(1). For smaller values of δ, we obtain that

m = O log(1/δ)

Algorithm Private Test Partition

Input: A multiset T ((Rd)k)n, privacy parameters ε, δ (0, 1], conﬁdence parameter β (0, 1], and separation parameter > 6.

1. Let m and ε1 be the values from Deﬁnition 4.1 w.r.t. n, ε, δ, β, and let ε2 = ε/2.

2. Let T1 be a uniform sample of m k-tuples from T (without replacement), and let T2 = T .

3. Output (Status, B) = Private Test Close Tuples(T1, T2, ε1, ε2, β, ).

Figure 2: Algorithm Private Test Partition for privately checking if T is partitioned by -far balls.

4.1.1. PROPERTIES OF Private Test Partition

Claim 4.2 (Correctness). Assume that T is partitioned by (2 + 2)-far balls. Then with probability 1 β, when executing Private Test Partition on input T , ε, δ, β, , it outputs ( Success , B), where B is a set of -far balls that partitions T .

Claim 4.3 (Status is private). Let T and T be two neighboring databases, and consider two independent executions Private Test Partition(T ) and Private Test Partition(T ) (with the same parameters ε, δ, β, ). Let Status and Status be the status outcomes of the two executions (respectively). Then Status and Status are εindistinguishable.

The following claim states that when the tests succeed, then w.h.p., T is ℓ-nearly partitioned by B, for the value of ℓ deﬁned below.

Deﬁnition 4.4. Let ℓ= ℓ(n, ε, δ, β) = 2m

ε log m βδ ,

where m = m(n, ε, δ, β) is the value from Deﬁnition 4.1.

We note that ℓ= O log2(1/δ)

ε log n . When β, δ 1/poly(n),

we have that ℓ= O 1

Claim 4.5 (On success, B almost partitions T ). Let T ((Rd)k)n and δ > 0. Consider a random execution of

Differentially-Private Clustering of Easy Instances

Private Test Partition(T , ε, δ, β), and let (Status, B) be the outcome of the execution. Let S be the event that Status = Success , and let E S be the event that T is ℓ-nearly partitioned by B, where ℓ= ℓ(n, ε, δ, β) is the value from Deﬁnition 4.4. Then the following holds: If Pr[S] δ, then Pr[E | S] 1 δ.

Recall that Algorithm Private Test Partition has two outputs: A bit Status and a set of balls B. As we stated in Claim 4.3, the bit Status preserves privacy. The set of balls B, however, does not. Still, in the following sections we use Algorithm Private Test Partition as a subroutine in our two main algorithms Privatek Averages and Privatek Noisy Centers. To argue about the privacy properties of these algorithms, we rely on the following key property of algorithm Private Test Partition.

Claim 4.6. Let A be an algorithm that gets as input a multiset T ((Rd)k)n and a set of balls B = {B1, . . . , Bk}, and let ℓ= ℓ(n, ε/2, δ/4, β/2) be the value from Definition 4.4. Assume that A has the property that for any neighboring multisets T , T and any sets of -far balls B, B that ℓ-nearly partitions T and T (respectively), it holds that A (T , B) and A (T , B ) are (ε , δ/4)- indistinguishable. Let A be the algorithm that on input T , does the following steps: (1) Compute (Status, B) = Private Test Partition(T , ε/2, δ/4, β/2, ), and (2) If Status = Failure , output and abort, and otherwise output A (T , B). Then A is (ε/2 + ε , δ)-differentially private.

Proof sketch. Consider two executions over two neighboring databases. If the probability of success is smaller than δ in the two executions, then the outputs are (0, δ)- indistinguishable by Lemma 2.4. Otherwise, Claim 4.5 tells us that when the tests succeed, then w.h.p. the sets of -far balls ℓ-nearly partitions the databases in both executions. Hence, the proof holds by the assumption on A .

Remark 4.7. Note that Private Test Partition runs in time O(mdk2n) = O(dk2n) since for each iteration X T1 in Private Test Close Tuples, Step 1a takes O(dk2) time, and Step 1b takes O(dk2n) times.

4.2. Algorithm Privatek Averages

In this section we describe and state the properties of Algorithm Privatek Averages (Figure 3). The properties are given in the following theorems.

Theorem 4.8 (Utility of Privatek Averages). Let d, k, Λ > 0, rmin [0, Λ], ε, δ, β (0, 1], and let T B(0, Λ)k n ((Rd)k)n. Assume that T is partitioned by -far balls for = 16, let {P1, . . . , Pk} = Partition(T ) (according to Deﬁnition 3.5), let ri be the radius of the ball that contains Pi. Then w.p. 1 β, algorithm Privatek Averages on inputs T , rmin, ε, δ, k, outputs k

Algorithm Privatek Averages

Input: A multiset T B(0, Λ)k n ((Rd)k)n, privacy parameters ε, δ (0, 1], a conﬁdence parameter β (0, 1], and a lower bound on the radii rmin [0, Λ].

1. Compute (Status, B = {B1, . . . , Bk}) = Private Test Partition(T , ε/2, δ/4, β/2, = 7).

2. If Status = Failure , output and abort.

3. Let c1, . . . , ck be the centers of B1, . . . , Bk (respectively), and let Qi = {x Points(T ): i = argminj x cj }.

4. Let ℓ= ℓ(n, ε/2, δ/4, β/2) be the value from Deﬁnition 4.4.

5. For i = 1 to k:

Compute a noisy average ˆai of Qi using the algorithm from Proposition 2.9 with parameters Λ, rmin, ˆβ = β 2k, ˆε = ε 4k(ℓ+1), ˆδ =

δ 8k exp(ε/2)(ℓ+1).

6. Output ˆA = {ˆa1, . . . , ˆak}.

Figure 3: Algorithm Privatek Averages for privately estimating the k averages.

points ˆA = {ˆa1, . . . , ˆak} such that for any i [k], there exists a cluster (call it Pi) with

max{ri, rmin} dkℓlog ℓ

δ log Λ rminβ

where ai = Avg(Pi) and ℓ= ℓ(n, ε

2 ) is the value from Deﬁnition 4.4.

Theorem 4.9 (Privacy of Privatek Averages). Let d, k, Λ > 0, rmin [0, Λ], ε, δ, β (0, 1]. Then for any integer n 2 ℓ(n, ε/2, δ/4, β/2) + 2 (where ℓ is the function from Deﬁnition 4.4), algorithm Privatek Averages( , ε, δ, β, rmin) is (ε, δ)-differentially private for databases T (B(0, Λ)k)n ((Rd)k)n.

Proof Sketch. Consider two independent executions over neighboring databases T and T , and let B and B be the balls from Step 1, respectively. By Claim 4.6, if we treat Step 3 to 6 as algorithm A of the claim, then it is enough to prove indistinguishability only in the case that T and T are ℓ-nearly partitioned by B and B , respectively. In this case, since T and T are neighboring and n 2ℓ+2, there exists at least one k-tuples that is partitioned by both B and B ,

Differentially-Private Clustering of Easy Instances

yielding that each ball Bi in B intersects a single ball in B

(call it B i w.l.o.g.). Since B and B are sets of -far balls for large enough ( = 7, as deﬁned in the algorithm, is sufﬁcient), then each area Bi B i is far away from each Bj B j for j = i. This yields that the partition into {Q1, . . . , Qk} in Step 3 of both executions, agree on all the points that belong to Bi B i for some i. Therefore, this partition disagree on at most k(ℓ+ 1) points, which are the points of the (at most) ℓ+1 tuples that are not partitioned by B or B . Hence, the privacy now holds by Proposition 2.9 with group privacy of size k(ℓ+ 1).

Remark 4.10 (Run time of Privatek Averages). Step 1 of Privatek Averages takes O(dk2n) time (see Remark 4.7). By Proposition 2.9, the k executions of Step 5 take time Pk i=1 O(|Ti|) = O(dkn). Overall, the running time of Privatek Averages is O(dk2n).

4.2.1. REDUCING THE DEPENDENCY IN d

Algorithm Privatek Averages estimates the average of each cluster Pi with radius ri up to an additive error of O d

(ignoring poly(k, 1/ε) and polylog(n, δ, β, Λ, 1/rmin) factors). Yet, we can easily reduce the d into

d by replacing in Step 5 the average algorithm of Proposition 2.9 by the average algorithm of (Nissim et al., 2016) (see Section 2.2.3).

4.3. Algorithm Privatek Noisy Centers

In this section we describe and state the properties of Algorithm Privatek Noisy Centers (Figure 4).

The properties of Privatek Noisy Centers are given in the following theorems.

Theorem 4.11 (Utility of Privatek Noisy Centers). Let d, k > 0, ε, β, δ (0, 1] with δ < β, let T ((Rd)k)n, and assume that T is partitioned by (2 + 2)-far balls,

for = Ω k log(k/δ)

. Then when execut-

ing Privatek Noisy Centers(T , ε, δ, β, ), with probability 1 β, the output ˆC = {ˆc1, . . . , ˆck} satisfy for every i and j = i that ˆci ci < ˆci cj .

We remark that the k factor in the in Theorem 4.11, comes from applying basic composition over the k noisy centers ˆC. This however can be reduced to O(

k) factor by applying advanced composition (Dwork et al., 2010).

Theorem 4.12 (Privacy of Privatek Noisy Centers). Let d, k > 0, ε, β (0, 1], δ (0, 1/2], > 6. Then for any integer n 2 ℓ(n, ε/2, δ/4, β/2) + 2 (where ℓ is the function from Deﬁnition 4.4), Privatek Noisy Centers( , ε, δ, β, ) is (ε + δ/4, δ)- differentially private for databases T ((Rd)k)n.

Proof sketch. Consider two independent executions over

Algorithm Privatek Noisy Centers

Input: A multiset T ((Rd)k)n, privacy parameters ε (0, 1], δ (0, 1/2], a conﬁdence parameter β (0, 1], and a separation parameter 6.

1. Compute (Status, B = {B1, . . . , Bk}) = Private Test Partition(T , ε/2, δ/4, β/2, ).

2. If Status = Failure , output and abort.

3. Let c1, . . . , ck be the centers of B1, . . . , Bk (respectively).

4. For i = 1 to k:

(a) γi = 4 2 Lap(4k/ε) + 4k

ε log(4k/δ) + 1

(1 + γi) minj =i ci cj . (c) ˆci = ci + (N(0, σ2 i ))d, where σi = 4kλi

2 log(10k/δ).

5. Output ˆC = {ˆc1, . . . , ˆck}.

Figure 4: Algorithm Privatek Noisy Centers for privately ﬁnding the k centers.

neighboring databases T and T , and let B and B be the balls from Step 1, respectively. Similarly to the proof sketch of Theorem 4.9 (privacy of Privatek Averages), it is enough to prove indistinguishability only in the case that T and T

are ℓ-nearly partitioned by B and B , where in this case, each ball Bi in B intersects a single ball in B (call it B i w.l.o.g.). Since B and B are -far balls, this yields that the centers of B and B are relatively close, i.e., ci c i is bounded by (approximately) 2 minj =i ci cj 2 minj =i c i c j . Therefore, we deduce by the properties of the Gaussian mechanism that the outputs are indistinguishable.

Remark 4.13 (Run time of Privatek Noisy Centers). Step 1 of Privatek Noisy Centers takes O(dk2n) time (see Remark 4.7). The for-loop in Step 4 only takes O(dkn) time. Overall, the running time of Privatek Noisy Centers is O(dk2n).

5. Empirical Results

We implemented in Python our two main algorithms for k-tuple clustering: Privatek Averages and Privatek Noisy Centers. We compared the two algorithms in terms of the sample complexity that is needed to privately separate the samples from a given mixture of Gaussians. Namely, how many ktuples we need to sample such that, when executing

Differentially-Private Clustering of Easy Instances

Figure 5: The case d = 1 and k = 2, for varies R.

Privatek Averages or Privatek Noisy Centers, the resulting k-tuple Y = {y1, . . . , yk} satisﬁes the following requirement: For every i [k], there exists a point in Y (call it yi), such that for every sample x that was drawn from the i th Gaussian, it holds that i = argminj [k] x yj . We perform three tests, where in each test we considered a uniform mixture of k standard spherical Gaussians around the means {R ei, R ei}k/2 i=1, where ei is the i th standard basis vector. In all the tests, we generated each k-tuple by running algorithm k-means++ (Arthur & Vassilvitskii, 2007) over enough samples.

In Test1 (Figure 5) we examined the sample complexity in the case d = 1, k = 2, for R {25, 26, . . . , 29}. In Test2 (Figure 6) we examined the case d = 4, R = 512 k, for k {2, 4, 6, 8}. In Test3 (Figure 7) we examined the case k = 2, R = 256

d, for d {4, 8, 12, 16}. In all the experiments we used privacy parameters ε = 1 and δ = e 28, and used β = 0.05. In all the tests of Privatek Noisy Centers, we chose = 10

ε k log(k/δ) p

log(k/β), the number of k-tuples that we generated was exactly 3781 (the minimal value that is required for privacy), but the number of samples per k-tuple varied from test to test. In the tests of Privatek Averages, we chose Λ = 210 k

d and rmin = 0.1, we generated each k-tuple using 15 k samples, but the number of k-tuples varied from test to test.3 All the experiments were tested in a Mac Book Pro Laptop with 4-core Intel i7 CPU with 2.8GHz, and with 16GB RAM.

The graphs show the main bottleneck of Algorithm Privatek Averages in practice. It requires only Oε,δ(kd) tuples (or Oε,δ(k

d) for large values of d) in order to succeed, but the hidden constant is 500, 000 for our choice of ε and δ, and this does not improve even when the assumed separation R is very large. The cause of this large constant is the group privacy of size O(kℓ) that we do in

3By using Ω(kd) samples for creating each k-tuple, in Test3 (Figure 7) we could avoid the dependency of R in

d (see the supplementary material for more details). However, since we only used O(k) samples for each k-tuple when testing Privatek Averages, then we could not avoid this dependency.

Figure 6: The case d = 4 and R = 512 k, for varies k.

Step 5, where recall that ℓ= O log2(1/δ)

ε log n (Deﬁnition 4.4). While in theory this ℓis relatively small, with our choice of parameters we get ℓ 1000. This means that we need to execute the private average algorithm with ˆε ε 4000k. Internally, this ˆε is shared between other private algorithms, and in particular, with an Interior Point algorithm that is one of the internal components of the average algorithm from Proposition 2.9. This algorithm is implemented using the exponential mechanism (Mc Sherry & Talwar, 2007), which simply outputs a random noise when the number of points is too small.

We remark that prior work on differentially-private clustering, including in easy settings, is primarily theoretical. In particular, we are not aware of implemented methods that we could use as a baseline.4 As a sanity check, we did consider the following naive baseline: For every sample point, add a Gaussian noise to make it private. Now, the resulting noisy samples are just samples from a new Gaussian mixture. Then, run an off-the-shelf non-private method to learn the parameters of this mixture. We tested this naive method on the simple case d = 1 and k = 2, where we generated samples from a mixture of standard Gaussians that are separated by R = 512. By the Gaussian mechanism, the noise magnitude that we need to add to each point for guaranteeing (ε, δ)-differential privacy, is

4We remark that in different settings, such as node, edge or weight-differential privacy, there exist some available implementations (e.g., (Pinot et al., 2018)).

Figure 7: The case k = 2, R = 256

d, for varies d.

Differentially-Private Clustering of Easy Instances

log(1/δ) 1 for some Λ > R, meaning that the resulting mixture consists of very close Gaussians. We applied Gaussian Mixture from the package sklearn.mixture to learn this mixture, but it failed even when we used 100M samples, as this method is not intended for learning such close Gaussians.We remark that there are other non-private methods that are designed to learn any mixture of Gaussians (even very weakly separated ones) using enough samples (e.g., (Suresh et al., 2014)). The sample complexity and running time of these methods, however, are much worse than ours even asymptotically (e.g., the running time of (Suresh et al., 2014) is exponential in k), and moreover, we are not aware of any implementation we could use.5

6. Conclusion

We developed an approach to bridge the gap between the theory and practice of differentially private clustering methods. For future, we hope to further optimize the constants in the k-tuple clustering algorithms, making the approach practical for instances with lower separation. Tangentially, the inherent limitations of private versus non-private clustering suggest exploring different rigorous notions of privacy in the context of clustering.

Acknowledgements

Edith Cohen is supported by Israel Science Foundation grant no. 1595-19.

Haim Kaplan is supported by Israel Science Foundation grant no. 1595-19, and the Blavatnik Family Foundation.

Yishay Mansour has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17) and the Yandex Initiative for Machine Learning at Tel Aviv University.

Uri Stemmer is partially supported by the Israel Science Foundation (grant 1871/19) and by the Cyber Security Research Center at Ben-Gurion University of the Negev.

Ahmadian, S., Norouzi-Fard, A., Svensson, O., and Ward, J. Better guarantees for k-means and euclidean k-median

5Asymptotically, (Suresh et al., 2014) requires at least Ω(dk9) samples, and runs in time Ω(n2d+d2(k7 log d)k2). For the setting of learning a mixture of k well-separated Gaussians, the approach of ﬁrst adding noise to each point and then applying a non-private method such as (Suresh et al., 2014), results with much worse parameters than our result, which only requires O(dk) samples and runs in time O(dk2n).

by primal-dual algorithms. SIAM Journal on Computing, (0):FOCS17 97, 2019.

Arthur, D. and Vassilvitskii, S. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027 1035, 2007.

Awasthi, P., Blum, A., and Sheffet, O. Stability yields a ptas for k-median and k-means clustering. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 309 318. IEEE, 2010.

Awasthi, P., Blum, A., and Sheffet, O. Center-based clustering under perturbation stability. Information Processing Letters, 112(1-2):49 54, 2012.

Balcan, M.-F., Blum, A., and Gupta, A. Approximate clustering without the approximation. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 1068 1077. SIAM, 2009.

Balcan, M.-F., Dick, T., Liang, Y., Mou, W., and Zhang, H. Differentially private clustering in high-dimensional Euclidean spaces. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 322 331, 2017.

Beimel, A., Kasiviswanathan, S. P., and Nissim, K. Bounds on the sample complexity for private learning and private data release. In TCC, volume 5978 of LNCS, pp. 437 454. Springer, 2010.

Bilu, Y. and Linial, N. Are stable instances easy? Combinatorics, Probability and Computing, 21(5):643 660, 2012.

Biswas, S., Dong, Y., Kamath, G., and Ullman, J. R. Coinpress: Practical private mean and covariance estimation. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS, 2020.

Blum, A., Dwork, C., Mc Sherry, F., and Nissim, K. Practical privacy: The Su LQ framework. In Li, C. (ed.), PODS, pp. 128 138. ACM, 2005.

Bun, M. and Steinke, T. Concentrated differential privacy: Simpliﬁcations, extensions, and lower bounds. In Theory of Cryptography - 14th International Conference, TCC 2016, volume 9985, pp. 635 658, 2016.

Bun, M. and Steinke, T. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. In Advances in Neural Information Processing Systems, pp. 181 191, 2019.

Differentially-Private Clustering of Easy Instances

Cai, T. T., Wang, Y., and Zhang, L. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. ar Xiv preprint ar Xiv:1902.04495, 2019.

Dwork, C., Kenthapadi, K., Mc Sherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via distributed noise generation. In Vaudenay, S. (ed.), EUROCRYPT, volume 4004 of Lecture Notes in Computer Science, pp. 486 503. Springer, 2006a.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC, pp. 265 284. Springer, 2006b.

Dwork, C., Rothblum, G. N., and Vadhan, S. P. Boosting and differential privacy. In FOCS, pp. 51 60. IEEE Computer Society, 2010.

Feldman, D., Fiat, A., Kaplan, H., and Nissim, K. Private coresets. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, pp. 361 370, 2009.

Feldman, D., Xiang, C., Zhu, R., and Rus, D. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 17, pp. 3 15. ACM, 2017.

Ghazi, B., Kumar, R., and Manurangsi, P. Differentially private clustering: Tight approximation ratios. In Neur IPS, 2020.

Gupta, A., Ligett, K., Mc Sherry, F., Roth, A., and Talwar, K. Differentially private combinatorial optimization. In Proceedings of the Twenty-ﬁrst Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 10, pp. 1106 1125, 2010.

Huang, Z. and Liu, J. Optimal differentially private algorithms for k-means clustering. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 395 408, 2018.

Johnson, W. and Lindenstrauss, J. Extensions of lipschitz maps into a hilbert space. Contemporary Mathematics, 26:189 206, 1984.

Kamath, G., Li, J., Singhal, V., and Ullman, J. Privately learning high-dimensional distributions. In Beygelzimer, A. and Hsu, D. (eds.), Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pp. 1853 1902. PMLR, 2019a.

Kamath, G., Sheffet, O., Singhal, V., and Ullman, J. Differentially private algorithms for learning mixtures of separated gaussians. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, pp. 168 180, 2019b.

Kamath, G., Singhal, V., and Ullman, J. Private mean estimation of heavy-tailed distributions. ar Xiv preprint ar Xiv:2002.09464, 2020.

Kaplan, H. and Stemmer, U. Differentially private k-means with constant multiplicative error. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, pp. 5436 5446, 2018.

Karwa, V. and Vadhan, S. Finite sample differentially private conﬁdence intervals. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), 2018.

Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. D. What can we learn privately? SIAM J. Comput., 40(3):793 826, 2011.

Kumar, A. and Kannan, R. Clustering with spectral norm and the k-means algorithm. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 299 308. IEEE, 2010.

Lee, E., Schmidt, M., and Wright, J. Improved and simpliﬁed inapproximability for k-means. Information Processing Letters, 120:40 43, 2017.

Mc Sherry, F. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pp. 19 30, 2009.

Mc Sherry, F. and Talwar, K. Mechanism design via differential privacy. In FOCS, pp. 94 103. IEEE Computer Society, 2007.

Mohan, P., Thakurta, A., Shi, E., Song, D., and Culler, D. Gupt: Privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD 12, pp. 349 360, 2012.

Nguyen, H. L. A note on differentially private clustering with large additive error. Co RR, abs/2009.13317, 2020.

Nissim, K. and Stemmer, U. Clustering algorithms for the centralized and local models. In Proceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research, pp. 619 653, 2018.

Differentially-Private Clustering of Easy Instances

Nissim, K., Raskhodnikova, S., and Smith, A. Smooth sensitivity and sampling in private data analysis. In STOC, pp. 75 84. ACM, 2007.

Nissim, K., Stemmer, U., and Vadhan, S. P. Locating a small cluster privately. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pp. 413 427, 2016.

Nock, R., Canyasse, R., Boreli, R., and Nielsen, F. kvariates++: more pluses in the k-means++. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, pp. 145 154, 2016.

Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. J. ACM, 59(6):28:1 28:22, 2012.

Pinot, R., Morvan, A., Yger, F., Gouy-Pailler, C., and Atif, J. Graph-based clustering under differential privacy. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fourth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2018, pp. 329 338, 2018.

Shechner, M., Sheffet, O., and Stemmer, U. Private kmeans clustering with stability assumptions. In The 23rd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2020, volume 108 of Proceedings of Machine Learning Research, pp. 2518 2528, 2020.

Stemmer, U. Locally private k-means clustering. In SODA. SIAM, 2020.

Su, D., Cao, J., Li, N., Bertino, E., and Jin, H. Differentially private k-means clustering. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, CODASPY 16, pp. 26 37, 2016.

Suresh, A. T., Orlitsky, A., Acharya, J., and Jafarpour, A. Near-optimal-sample estimators for spherical gaussian mixtures. Advances in Neural Information Processing Systems, 27:1395 1403, 2014.

Wang, Y., Wang, Y.-X., and Singh, A. Differentially private subspace clustering. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS 15, pp. 1000 1008, 2015.