# locally_private_kmeans_in_one_round__e7100421.pdf

Locally Private k-Means in One Round

Alisa Chang 1 Badih Ghazi 1 Ravi Kumar 1 Pasin Manurangsi 1

We provide an approximation algorithm for kmeans clustering in the one-round (aka noninteractive) local model of differential privacy (DP). This algorithm achieves an approximation ratio arbitrarily close to the best non private approximation algorithm, improving upon previously known algorithms that only guarantee large (constant) approximation ratios. Furthermore, this is the ﬁrst constant-factor approximation algorithm for k-means that requires only one round of communication in the local DP model, positively resolving an open question of Stemmer (2020). Our algorithmic framework is quite ﬂexible; we demonstrate this by showing that it also yields a similar near-optimal approximation algorithm in the (one-round) shufﬂe DP model.

1. Introduction

The vast amounts of data collected for use by modern machine learning algorithms, along with increased public awareness for privacy risks, have stimulated intense research into privacy-preserving methods. Recently, differential privacy (DP) (Dwork et al., 2006b;a) has emerged as a popular deﬁnition, due to its mathematical rigor and strong guarantees. This has translated into several practical deployments within government agencies (Abowd, 2018) and the industry (Erlingsson et al., 2014; Shankland, 2014; Greenberg, 2016; Apple Differential Privacy Team, 2017; Ding et al., 2017), and integration into widely-used libraries such as Tensor Flow (Radebaugh and Erlingsson, 2019) and Py Torch (Testuggine and Mironov, 2020).

Among unsupervised machine learning tasks, clustering in general and k-means clustering in particular are one of the most basic and well-studied from both theoretical

*Equal contribution 1Google Research, Mountain View, CA. Correspondence to: Alisa Chang <alisac@google.com>, Badih Ghazi <badihghazi@gmail.com>, Ravi Kumar <ravi.k53@gmail.com>, Pasin Manurangsi <pasin@google.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

and practical points of view. The research literature on kmeans is extensive, with numerous applications not only in machine learning but also well beyond computer science. Computationally, ﬁnding optimal k-means cluster centers is NP-hard (e.g., Aloise et al., 2009) and the best known polynomial-time algorithm achieves an approximation ratio of 6.358 (Ahmadian et al., 2020).

A natural question is to study clustering and k-means with DP guarantees. There is a growing literature on this topic (Blum et al., 2005; Nissim et al., 2007; Feldman et al., 2009; Gupta et al., 2010; Mohan et al., 2012; Wang et al., 2015; Nissim et al., 2016; Nock et al., 2016; Su et al., 2016; Feldman et al., 2017; Balcan et al., 2017; Nissim and Stemmer, 2018; Huang and Liu, 2018; Nock et al., 2016; Nissim and Stemmer, 2018; Stemmer and Kaplan, 2018; Stemmer, 2020; Ghazi et al., 2020b; Jones et al., 2021; Chaturvedi et al., 2021). In contrast to non-private k-means, additive errors are necessary in this case, and a recent line of work has obtained increasingly tighter approximation ratios, culminating with the work of Ghazi et al. (2020b), which comes arbitrarily close to the smallest possible non-private approximation ratio (currently equal to 6.358 as mentioned above). However, this last result only holds in the central model of DP where a curator is entrusted with the raw data and is required to output private cluster centers.

The lack of trust in a central curator has led to extensive study of distributed models of DP, the most prominent of which is the local setting where the sequence of messages sent by each user is required to be DP. The state-of-the-art local DP k-means algorithm is due to Stemmer (2020); it achieves a constant (that is at least 100 by our estimate) approximation ratio in a constant (at least ﬁve) number of rounds of interaction. In fact, all (non-trivial) known local DP k-means algorithms are interactive, meaning that the analyzer proceeds in multiple stages, each of which using the (private) outcomes from the previous stages. This begs the question, left open by Stemmer (2020): is there a noninteractive local DP approximation algorithm for k-means and can it attain the non-private approximation ratio?

1.1. Our Contributions

We resolve both open questions above by showing that there is a DP algorithm in the non-interactive local model

Locally Private k-Means in One Round

with an approximation ratio that is arbitrarily close to that of any non-private algorithm, as stated below1.

Theorem 1. Suppose that there is a (not necessarily private) polynomial-time approximation algorithm for kmeans with approximation ratio . For any > 0, there is a one-round -DP ( (1+ ), k O (1)

nd polylog(n)/ )- approximation algorithm for k-means in the local DP model. The algorithm runs in time poly(n, d, k).

Here and throughout, we assume that the users and the analyzer have access to shared randomness. We further discuss the question of obtaining similar results using only private randomness in Section 6. Furthermore, we assume throughout that 0 < O(1) and will not state this assumption explicitly in the formal statements.

The dependency on n in the additive error above is also essentially tight, as Stemmer (2020) proved a lower bound of (pn) on the additive error. Our bound also improves upon his protocol, where the additive error grows with n1/2+a for an arbitrary small positive constant a.

In addition to the local model, our approach can be easily adapted to other distributed privacy models. To demonstrate this, we give an algorithm in the shufﬂe model (Bittau et al., 2017; Erlingsson et al., 2019; Cheu et al., 2019)2

with a similar approximation ratio but with a smaller additive error of O(k O (1)

Theorem 2. Suppose that there is a (not necessarily private) polynomial-time approximation algorithm for kmeans with approximation ratio . For any > 0, there is a one-round ( , δ)-DP ( (1 + ), k O (1)

d polylog(nd/δ)/ )-approximation algorithm for k-means in the shufﬂe DP model. The algorithm runs in time poly(n, d, k, log(1/δ)).

To the best of our knowledge, this is the ﬁrst k-means algorithm in the one-round shufﬂe model with (non-trivial) theoretical guarantees. Furthermore, we remark that the above guarantees essentially match those of the central DP algorithm of Ghazi et al. (2020b), while our algorithm works even in the more restrictive shufﬂe DP model.

The main new ingredient in our framework is the use of a hierarchical data structure called a net tree (Har-Peled and Mendel, 2006) to construct a private coreset3 of the input points. The decoder can then simply run any (non-private) approximation algorithm on this private coreset.

The net tree (Har-Peled and Mendel, 2006) is built as fol-

1An algorithm achieves a ( , t)-approximation if its output cost is no more than times the optimal cost plus t, i.e., is the approximation ratio and t is the additive error (see Section 2.1).

2The shufﬂe model of DP is a middle ground between the central and local models. Please see Appendix C for more details.

3See Section 2.1 for the deﬁnition of coresets.

lows. Roughly speaking, each input point can be assigned to its representative leaf of a net tree; the deeper the tree is, the closer this representative is to the input point. To achieve privacy, noise needs to be added to the number of nodes assigned to each leaf. This creates a natural tension when building such a tree: if the tree has few leaves, the representatives are far from the actual input points whereas if the tree has too many leaves, it suffers a larger error introduced by the noise. Our main technical contribution is an algorithm for constructing a net tree that balances between these two errors, ensuring that neither affects the k-means objective much. We stress that this algorithm works in the non-interactive local model, as each user can encode all possible representatives of their input point without requiring any interaction.

Organization. In Section 2, we provide background on k-means and a description of the tools we need for our algorithm. In Section 3, we describe and analyze net trees, which are sufﬁcient to solve private k-means in low dimensions. In Section 4, we use dimensionality reduction to extend our results to the high-dimensional case. In Section 5, we present some empirical evaluation of our algorithm. Section 6 contains the concluding remarks.

All the missing proofs and the results for the shufﬂe DP model are in the Supplementary Material.

2. Preliminaries

We use kxk to denote the Euclidean norm of a vector x. We use Bd(x, r) to denote the closed radius-r ball around x, i.e., Bd(x, r) = {y 2 Rd | kx yk r}. Let Bd = Bd(0, 1), the unit ball. For every pair T, T 0 Rd of sets, we use d(T, T 0) as a shorthand for minx2T,x02T 0 kx x0k.

For every m 2 N, we use [m] as a shorthand for {1, . . . , m}. For real numbers a1, . . . , at, we use bottomm(a1, . . . , at) to denote the sum of the smallest m numbers among a1, . . . , at, where t m.

We will be working with a generalization of multisets, where each element can have a weight associated with it.

Deﬁnition 3 (Weighted Point Set). A weighted point set S on a domain D is a ﬁnite set of tuples (x, w S(x)) where x 2 D and w S(x) 2 R 0 denotes its weight; each x should not be repeated in S. It is useful to think of S as the function w S : D ! R 0; this extends naturally (as a measure) to any subset T D for which we deﬁne w S(T) = P

x2T w S(x). We say that w S(D) is the total weight of S and we abbreviate it as |S|. The average of S is deﬁned as µ(S) := P

x2Bd w S(x) x/|S|.

Throughout the paper, we alternatively view a multiset X as a weighted point set, where w X(x) denotes the (possibly fractional) number of times x appears in X.

Locally Private k-Means in One Round

For an algorithm A, we use t(A) to denote its running time.

2.1. k-means and coresets

Let X Bd be an input multiset and let n = |X|.

Deﬁnition 4 (k-means). The cost of a set C of centers with respect to an input multiset X Bd is deﬁned as cost X(C) := P

x2X (minc2C kx ck)2. The k-means problem asks, given X and k 2 N, to ﬁnd C of size k that minimizes cost X(C); the minimum cost is denoted OPTk

For a weighted point set S on Bd, we deﬁne cost S(C) := P

x2Bd w S(x) (minc2C kx ck)2, and let OPTk

S := min|C|=k cost S(C).

A set C of size k is a ( , t)-approximate solution for input S if cost S(C) OPTk

It is also useful to deﬁne cost with respect to partitions, i.e., the mapping of points to the centers.

Deﬁnition 5 (Partition Cost). For a given partition φ : S ! [k] where S is a weighted point set and an ordered set C = (c1, . . . , ck) of k candidate centers, we deﬁne the cost of φ with respect to C as cost S(φ, C) := P

x2Bd w S(x) kx cφ(x)k2. Furthermore, we deﬁne the cost of φ as cost S(φ) := min C2(Rd)k cost S(φ, C).

Note that a minimizer for the term min C2(Rd)k cost S(φ, C) above is ci = µ(Si) where Si denotes the weighted point set corresponding to

partition i, i.e., w Si(x) =

w S(x) if φ(x) = i, 0 otherwise.

Coresets. In our algorithm, we will produce a weighted set that is a good approximation to the original input set. To quantify how good is the approximation, it is convenient to use the well-studied notion of coresets (see, e.g., Har-Peled and Mazumdar, 2004), which we recall below.

Deﬁnition 6 (Coreset). A weighted point set S0 is a (k, γ, t)-coreset of a weighted point set S if for every set C Bd of k centers, it holds that (1 γ) cost S(C) t cost S0(C) (1 + γ) cost S(C) + t. When k is clear from context, we refer to such an S0 as just a (γ, t)-coreset of X.

2.2. (Generalized) Monge s Optimal Transport

Another tool that will facilitate our proof is (a generalization of) optimal transport (Monge, 1781). Roughly speaking, in Monge s Optimal Transport problem, we are given two measures on a metric space and we would like to ﬁnd a map that transports the ﬁrst measure to the second measure, while minimizing the cost, which is often deﬁned in terms of the total mass moved times some function of the distance. This problem can be ill-posed as such a mapping may not exist, e.g., when the total masses are different. (We

will often encounter this issue as we will be adding noise for differential privacy, resulting in unequal masses.) To deal with this, we use a slightly generalized version of optimal transport where the unmatched mass is allowed but penalized by the L1 difference,4 deﬁned next.

Deﬁnition 7 (Generalized L2

2 Transport Cost). Let S, S0 be weighted point sets on Bd. Their generalized (L2

2) Monge s transport cost of a mapping : Bd ! Bd is deﬁned as

mt( , S, S0) :=

w S(y) k (y) yk2

|w S( 1(x)) w S0(x)|. (1)

Finally, we deﬁne the optimal generalized (L2

2) Monge s transport cost from S to S0 as

mt(S, S0) = min :Bd!Bd mt( , S, S0). (2)

We remark that the minimizer always exists because our weighted sets S, S0 have ﬁnite supports. A crucial property of optimal transport we will use is that if the optimal transport cost between S, S0 is small relative to the optimal k-means objective, then S0 is a good coreset for S. This is encapsulated in the following lemma.

Lemma 8. For any 2 (0, 1), any weighted point sets S, S0 over Bd, if mt(S, S0) 8(1+2/ ) OPTk

S +t for some t 0, then S0 is a ( , 4(1 + 2/ )t)-coreset of S.

We also use a slightly more reﬁned bound stated below. It is worth to note that the ﬁrst inequality below is very explicit as it also gives the mapping for S; we will indeed need such an additional property in our analysis when applying dimensionality reduction.

Lemma 9. For any 2 (0, 1], weighted point sets S, S0

over Bd, C 2 (Bd)k, φ : Bd ! [k] and : Bd ! Bd,

cost S(φ , C) (1 + ) cost S0(φ, C)

+ 4(1 + 1/ ) mt( , S, S0), and

cost S0(C) (1 + ) cost S(C)

+ 4(1 + 1/ ) mt( , S, S0).

The proofs of both lemmas are deferred to Appendix A.1

2.3. Efﬁciently Decodable Nets

Let L Bd be a ﬁnite set. Its covering radius, denoted (L), is deﬁned as maxx2Bd miny2L kx yk. Its packing radius, denoted γ(L), is deﬁned as the largest γ such that the open balls around each point of L of radius γ are disjoint. We say L is a ( , γ)-net if (L) and γ(L) γ.

Using known results on lattices (Rogers, 1959; Micciancio, 2004; Micciancio and Voulgaris, 2013), Ghazi et al.

4Several works have deﬁned similar but slightly different notions (see, e.g., Benamou, 2003; Piccoli and Rossi, 2014).

Locally Private k-Means in One Round

(2020b) show that there exist nets with γ = ( ) that are efﬁciently decodable , i.e., given any point we can ﬁnd points in L close to it in exp(O(d)) time.

Lemma 10 ((Micciancio, 2004; Ghazi et al., 2020b)). For any given > 0, there exists a ( , /3)-net L such that, for any given point x 2 Bd and any r , we can ﬁnd all points in Bd(x, r) \ L in time (1 + r/ )O(d).

2.4. (Generalized) Histogram and Vector Summation

Finally, we need a few fundamental protocols as building blocks. We ﬁrst introduce a general aggregation problem.

Deﬁnition 11 (Generalized Bucketized Vector Summation). In the generalized bucketized vector summation problem, each user holds a set Yi Y of T buckets and a vector vi 2 Bd. The goal is to determine, for a given y 2 Y , the vector sum of bucket y, which is vy := P

i2[n] y2Yi vi.

An approximate generalized vector sum oracle v is said to be -accurate at y if we have kvy vyk < .

The setting in this problem each user holds a ddimensional vector and contributes to T buckets generalizes many well-studied problems including (i) the histogram problem where T = d = 1 and the (scalar) sum of a bucket is the frequency, (ii) the generalized histogram problem where d = 1 but T can be more than one, and (iii) the bucketized vector summation problem where T = 1 but d can be more than one.

2.5. Privacy Models

We ﬁrst recall the formal deﬁnition of differential privacy (DP). For > 0 and δ 2 [0, 1], a randomized algorithm A is ( , δ)-DP if for every pair X, X0 of inputs that differ on one point and for every subset S of the algorithm s possible outputs, it holds that Pr[A(X) 2 S] e Pr[A(X0) 2 S] + δ. When δ = 0, we simply say the algorithm is -DP.

Let n be the number of users, let X = {x1, . . . xn} and let the input xi be held by the ith user. An algorithm in the local DP model consists of (i) an encoder whose input is the data held by one user and whose output is a sequence of messages and (ii) a decoder, whose input is the concatenation of the messages from all the encoders and whose output is the output of the algorithm. A pair (Enc, Dec) is ( , δ)-DP in the local model if for any input X = (x1, . . . , xn), the algorithm A(X) := (Enc(x1), . . . , Enc(xn)) is ( , δ)-DP.

For any generalized bucketized vector summation problem , we say that the pair (Enc, Dec) is an ( , β)-accurate ( , δ)-DP algorithm (in a privacy model) if:

( ,δ) is an ( , δ)-DP algorithm that takes in the input and produces a randomized output, Dec

( ,δ) takes in the randomized output, a target

bucket y and produces an estimate vector sum for y, For each y 2 Y , the above oracle is -accurate at y

with probability 1 β.

For the local DP model, the following generalized histogram guarantee is a simple consequence of the result of Bassily et al. (2020): Theorem 12 ((Bassily et al., 2020)). There is an (O(

n T 3 log(|Y |/β)/ ), β)-accurate -DP algorithm for generalized histogram in the local model. The encoder and the decoder run in time poly(n T/ , log |Y |).

Using the technique of Bassily et al. (2020) together with that of Duchi et al. (2013), we can obtain the following guarantee for generalized bucketized vector summation: Lemma 13. There is an (O(

nd T 3 log(|Y |/β)/ ), β)- accurate -DP algorithm for generalized bucketized vector summation in the local model. The encoder and the decoder run in time poly(nd T/ , log |Y |).

The derivations of the above two bounds are explained in more detail in Appendix B.

3. Net Trees

In this section, we describe net trees (Har-Peled and Mendel, 2006), which are data structures that allow us to easily construct coresets of the inputs when the dimension is small. We remark that, although the main structure of net trees we use is similar to that of (Har-Peled and Mendel, 2006), there are several differences, the main one being the construction algorithm which in our case has to be done via (noisy) oracles, leading to considerable challenges.

3.1. Description and Notation

Let L1, . . . , LT Bd be a family of efﬁciently decodable nets, where Li has covering radius5 i := 1/2i and packing radius γi := γ/2i. Furthermore, let L0 = {0}, 0 = γ0 = 1. For convenience, we assume that L0, L1, . . . , LT are disjoint; this is w.l.o.g. as we can always shift each net slightly so that their elements are distinct.

For i 2 {0, . . . , T}, let i : Bd ! Li denote the map from any point to its closest point in Li (ties broken arbitrarily).

Complete Net Tree. Given a family of nets L1, . . . , LT , the complete net tree is deﬁned as a tree with (T +1) levels. For i 2 {0, . . . , T}, the nodes in level i are exactly the elements of Li. Furthermore, for i 2 [T], the parent of node z 2 Li is i 1(z) 2 Li 1. We use children(z) to denote the set of all children of z in the complete net tree.

Net Tree. A net tree T is a subtree of the complete net

5The 2i term here can be replaced by λi for any λ 2 (0, 1); we use 2i to avoid introducing yet another parameter.

Locally Private k-Means in One Round

tree rooted at 0, where each node z is either a leaf or all of children(z) must be present in the tree T .

We will also use the following additional notation. For each i 2 {0, . . . , T}, let Ti be the set of all nodes at level i of tree T . Moreover, we use leaves(T ) to denote the set of all leaves of T and leaves(Ti) to denote the set of all leaves at level i. For a node z 2 T , we use level(z) 2 {0, . . . , T} to denote its level and children(z) to denote its children.

Representatives. Given a point x 2 Bd, its potential representatives are the T + 1 nodes T (x), T 1( T (x)), . . . , 0( ( T (x)) ) in the complete net tree. The representative of x in a net tree T , denoted by T (x), is the unique leaf of T that is a potential representative of x. Note that T : Bd ! T induces a partition of points in Bd based on the leaves representing them.

For a weighted point set S, its frequency at a leaf z of T , denoted by fz, is deﬁned as the total weights of points in S whose representative is z, i.e., fz = w S( 1

Representative Point Set. Let f be a frequency oracle on domain L1[ [Lt. The above partitioning scheme yields a simple way to construct a weighted point set from a net tree T . Speciﬁcally, the representative point set of a tree T (and frequency oracle f), denoted by ST , is the weighted point set where each leaf z 2 leaves(T ) receives a weight of fz. We stress that ST depends on the frequency oracle; however we discard it in the notation for readability.

3.2. Basic Properties of Net Trees

Before we proceed, we list a few important properties of net trees that both illustrate their usefulness and will guide our algorithms. The ﬁrst property is that the potential representation of point x at level i cannot be too far from x:

Lemma 14 (Distance Property). For any x 2 Bd and i 2 {0, . . . , T}, we have kx i( ( T (x)) )k 21 i.

Proof. Using the triangle inequality, we can bound kx i( ( T (x)) )k above by k T (x) xk + PT 1

j=i k j( ( T (x)) ) j+1( ( T (x)) )k Since the covering radius of Lj is 2 j, this latter term is at most 2 T + PT 1

j=i 2 j 21 i as desired.

Second, we show that the number of children is small.

Lemma 15 (Branching Factor). For any z 2 L0 [ [ LT 1, we have | children(z)| B := (1 + 2/γ)d.

Proof. Let i = level(z). Since each z0 2 children(z) has z as its closest point in Li, we have kz0 zk 2 i. Furthermore, since children(z) Li+1, children(z) form a

(γ 2 i 1)-packing. As a result, a standard volume argument implies that

| children(z)|

= (1 + 2/γ)d .

We note that when applying dimensionality reduction in the next section, we will have6 d = O(log k), meaning B = k O(1).

The last property is an upper bound on the optimal transport cost from a given weighted point set to a representative point set created via a net tree T . Recall from Lemma 8 that this implies a certain coreset guarantee for the constructed representative point set, a fact we will repeatedly use in the subsequent steps of the proof. The bound on the transport cost is stated in its general form below. Lemma 16. For a weighted point set S and a net tree T , let fz denote the frequency of S on a leaf z 2 leaves(T ), i.e., fz = w S( 1

T (z)). Let ST denote the representative point set constructed from T and frequency oracle f. Then,

mt( T , S, ST )

z2leaves(T )

level(z)) + |fz fz|

Proof. Recall by deﬁnition that mt( T , S, ST ) is equal to P

z2leaves(T ) |w S( 1

T (z)) w ST (z)| + P

y2Bd w S(y) k T (y) yk2. By construction of ST , the ﬁrst term is equal to P

z2leaves(T ) |fz fz|. Using Lemma 14, the second term is bounded above by X

z2leaves(T )

w S(y) (2 level(z))2

z2leaves(T )

Combining the two bounds completes the proof.

3.3. Building the Net Tree

Although we have deﬁned net trees and shown several of their main properties, we have not addressed how a net tree should be constructed from a given (approximate) frequency oracle (on L0 [ [ LT ). Lemma 16 captures the tension arising when constructing the tree: if we decide to include too many nodes in the tree, then all of the nodes contribute to the additive error, i.e., the second term in the bound. On the other hand, if we include too few nodes, then many leaves will be at a small level, resulting in a large ﬁrst term. In this section, we give a tree building algorithm that balances these two errors, and prove its guarantees.

We assume that each approximate frequency fz is nonnegative.7 The tree construction (Algorithm 1) itself is

6Since we will pick γ > 0 to be a constant, we hide its dependency in asymptotic notations throughout this section.

7This is w.l.o.g. as we can always take max( fz, 0) instead.

Locally Private k-Means in One Round

simple; we build the tree in a top down manner, starting with small levels and moving on to higher levels. At each level, we compute a threshold on the number of nodes to expand. We then only expand nodes with maximum approximate frequencies. The algorithm for computing this threshold (COMPUTETHRESHOLD) and its properties will be stated next.

Algorithm 1 Building the Net Tree.

Oracle Access: Frequency oracle f on L0 [ [ LT Parameters: k, a, Γ 2 N

1: procedure BUILDTREE

2: T root node z = 0 at level 0 3: for i = 0, . . . , T 1 4: z1

i , . . . , zmi

i level-i nodes sorted in nondecreasing order of fz 5: i COMPUTETHRESHOLD

i , . . . , zmi

i ) 6: for j = 0, . . . , i 1 7: Add children(zmi j

i ) to T 8: return T

Computing the Threshold. Before we describe the threshold computation algorithm, we provide some intuition. Recall Lemma 16, which gives an upper bound on the optimal transport cost and Lemma 8, which relates this quantity to the quality of the coreset. At a high level, these two lemmas allow us to stop branching as soon as the bound in Lemma 16 becomes much smaller than OPTk

S. Of course, the glaring issue in doing so is that we do not know OPTk

S! It turns out however that we can give a lower bound on OPTk

S based on the tree constructed so far. To state this lower bound formally, we ﬁrst state a general lower bound on the k-means cost, without any relation to the tree.

Lemma 17. Let a, b, k 2 N and r 2 R 0. Let S be a weighted point set, and T1, . . . , Tka+b Rd be any ka + b disjoint sets such that for any point c 2 Rd it holds that |{i 2 [ka + b] | Bd(c, r) \ Ti 6= ;}| a. Then, OPTk

S r2 bottomb(w S(T1), . . . , w S(Tka+b)).

Proof. Consider any set C of k candidate centers. From the assumption, there must be at least b subsets Ti s such that d(C, Ti) r; for such a subset, its contribution to the k-means objective is at least r2 w S(Ti). As a result, the total k-means objective is at least r2 bottomb(w S(T1), . . . , w S(Tka+b)).

This lets us prove a lower bound on the k-means objective for net trees:

Corollary 18. For any > 0, let r = 2 i and a = d(1 + (2 + )/γ)de. Let b 2 N. Suppose that there exist (ka + b) level-i nodes z1, . . . , zka+b in a net tree

T . Furthermore, let S be any multiset and f the frequency of S. Then, we have OPTk

T ({ z1,..., zka+b})

r2 bottomb (f z1, . . . , f zka+b).

Proof. Consider any center c 2 Rd. Recall from Lemma 14 that 1

T ( zj) Bd( zj, 21 i). In other words, if Bd(c, r) \ (S \ 1

T ( zj)) 6= ;, we must have kc zjk r + 21 i = (2 + )2 i, (3)

implying that Bd( zj, γ 2 i) Bd(c, (2+ )2 i +γ 2 i).

Furthermore, since z1, . . . , zka+b Li form a (γ 2 i)- packing, the balls Bd( zj, γ 2 i) are disjoint. This means that any point c satisﬁes (3) for at most

1 + (2 + ) 2 i

many j s. Applying Lemma 17 completes the proof.

Thus, we can use r2 bottomb (f z1, . . . , f zka+b) as a running lower bound on OPTk

S. Still, we have to be careful as the additive error introduced will add up over all the levels of the tree; this can be an issue since we will select the number of levels to be as large as (log n). To overcome this, we make sure that the additive error introduced at each level is only charged to the optimum of the weighted point set corresponding to leaves in that level. This ensures that there is no double counting in the error.

Below we formalize our cutoff threshold computation algorithm and prove its main property, which will be used later to provide the guarantees on the tree construction.

Algorithm 2 Computing the Threshold.

Oracle Access: Frequency oracle f on L0 [ [ LT Parameters: k, a, Γ 2 N Inputs: Nodes z1, . . . , zm from the same level of a net tree

1: procedure COMPUTETHRESHOLD

f k,a,Γ 2: for j 2 [min{Γ, bm/kac}]

3: if Pm (j 1)ka

i=1 fzi 2 Pm jka

i=1 fzi 4: return (j 1)ka 5: return min{m, Γka}

Lemma 19. Suppose that f is -accurate at each of z1, . . . , zm and suppose fz1 fzm. Then, COM-

PUTETHRESHOLD

f a,Γ(z1, . . . , zm) outputs satisfying

Proof. If the algorithm returns on line 4, then the inequality trivially holds due to the check on line 3 before. Furthermore, the inequality also trivially holds if = m, as the left hand side is simply zero. As a result, we may assume

Locally Private k-Means in One Round

that the algorithm returns = Γka < m. This means the condition on line 3 does not hold for all j 2 [Γ]. From this, we can conclude that

fz1 + + fzm > 2

fz1 + + fzm ka

fz1 + + fzm Γka

Finally, since the frequency oracle is -accurate at each of z1, . . . , zm, we have

fz1 + + fzm m + fz1 + + fzm m + n, where the latter inequality follows since each input point is mapped to at most one node at level i. Combining the above two inequalities yields the desired bound.

We remark that fz1 + + fzm is the approximate frequency of points that will be mapped to the leaves at this level, which in turn gives the upper bound on the transport cost (in Lemma 16); whereas fz1 + + fzm ka indeed governs the lower bound on the optimum k-means objective in Corollary 18. Intuitively, Lemma 19 thus allows us to charge the transport cost at each level to the lower bound on the optimum of the leaves at that level as desired. These arguments will be formalized in the next subsection.

Putting it Together. The main property of a net tree T output by our tree construction algorithm is that its representative point set is a good coreset of the underlying input. This, plus some additional properties, are stated below.

Theorem 20. Let 2 (0, 1). Suppose that the frequency oracle f is -accurate on every element queried by the algorithm. Let T be the tree output by Algorithm 1 where

Γ = dlog ne, T = d0.5 log ne, = 8

, and a be as

in Corollary 18. Let NT = 2O (d) k (log2 n). Then,

The number of nodes in T is NT . Furthermore, this

holds regardless of the frequency oracle accuracy. mt( T , S, ST ) 8(1+2/ ) OPTk

S + O(NT ). ST is a ( , O(NT ))-coreset of S.

Moreover, the tree construction algorithm runs in time poly(NT ) multiplied by the time to query f.

4. From Low to High Dimension

The net tree-based algorithm can already be applied to give an approximation algorithm for k-means, albeit with an additive error of 2O (d) k (log2 n) . The exponential dependency on d is undesirable. In this section, we show how to eliminate this dependency via random projections, following the approach of Ghazi et al. (2020b) for private clustering in the central DP model. Speciﬁcally, the breakthrough work of Makarychev et al. (2019) allows one to randomly project to d = O(log k) dimensions while maintaining the objective for any given partition.

Theorem 21 ((Makarychev et al., 2019)). For every 0 <

β, < 1 and k 2 N, there exists d0 = O (log(k/β)) such that the following holds. Let P be a random d0-dimensional subspace of Rd and P denote the projection from Rd to P. With probability 1 β, we have the following for all partitions φ : Bd0 ! [k]:

1 1 + d cost P (S)(φ) d0 cost S(φ P ) 1 + .

Our encoding algorithm ﬁrst projects x to x in a given subspace P and appropriately scales it to x0. It then computes all potential representatives (corresponding to the complete net tree of the nets L1, . . . , LT ), and then encodes these representatives in the generalized histogram and the generalized bucketized vector summation encoders, the latter with the input vector x. This is presented more formally below in Algorithm 3. (Note that we treat x0

i as a vector in Bd0 directly; this is w.l.o.g. as we can rotate to make the basis of P into the ﬁrst d0 standard basis vectors.)

Algorithm 3 Encoding Algorithm for k-means.

Input: Point xi 2 Bd of user i. Parameters: Privacy parameters , δ, nets L1, . . . , LT , d0-dimensional subspace P, and > 0. Subroutines: Encoders Enchist, Encvec for generalized histogram and bucketized vector summation.

1: procedure KMEANSENCODER ,δ, ,P,L1,...,LT (xi) 2: xi P (xi) 3: if k xik 1/ 4: x0

i = x 5: else 6: x0

i = 0 7: y T

i Closest point to x0

i in LT 8: for j = T 1, . . . , 1 9: yj

i Closest point to yj+1

i in Lj 10: eh

( /2,δ/2)({y1

i , . . . , y T

( /2,δ/2)({y1

i , . . . , y T

12: return (eh

To decode, we ﬁrst use the encoded histogram to build a frequency oracle, from which we construct a net tree T using the algorithm in Section 3. We then run any approximation algorithm A for k-means on the representative set of T . The output of A gives a partition of the leaves of T according to which centers are the closest. We then use the vector summation oracle on these partitions to determine the k centers in the original (high-dimensional) space. A pseudo-code of this algorithm is given below as Algorithm 4. We stress here that the approximation algorithm A need not be private.

A generic guarantee of our algorithm is stated next. As we will explain below, plugging known histogram/vector summation algorithms immediately yields our main results.

Locally Private k-Means in One Round

Algorithm 4 Decoding Algorithm for k-means.

Input: Encoded inputs eh

1, . . . , eh

n. Parameters: Privacy parameters , δ, approximation algorithm A for k-means. Subroutines: Decoders Dechist, Decvec for generalized histogram and bucketized vector summation.

1: procedure KMEANSDECODER ,δ,A(eh

1, . . . , eh

n) 2: f frequency oracle from Dechist

( /2,δ/2)(eh

1, . . . , eh

n) 3: v vector sum oracle from Decvec

( /2,δ/2)(ev

1, . . . , ev

4: T BUILDTREE

1, . . . , c0

k} A(ST ) 6: φ mapping leaves(T ) ! [k] where φ (z) = j iff c0

j is closest to z (with ties broken arbitrarily) 7: for j = 1, . . . , k 8: vj 0 9: nj 0 10: for z 2 φ 1

(j) 11: vj vj + vz 12: nj nj + fz 13: cj = vj/ max{1, nj} 14: if k cjk 1 15: cj cj 16: else 17: cj cj/k cjk 18: return {c1, . . . , ck}

Theorem 22. KMEANSENCODER ,δ is ( , δ)-DP. Furthermore, suppose that the following hold:

A is a -approximation algorithm for k-means. d0 is as in Theorem 21 with β = 0.1β, = 0.1 , and

0.01 log(n/β) d

P is a random d0-dimensional subspace of Rd. The parameters of BUILDTREE are as in Theorem 20

with = 0.1 , and let NT = 2O (d0) k (log2 n) be an upper bound on the number of nodes of T . (Enchist

( /2,δ/2), Dechist

( /2,δ/2)) is ( , 0.1β/NT )-accurate for generalized histogram. (Encvec

( /2,δ/2), Decvec

( /2,δ/2)) is ( , 0.1β/NT )-accurate for generalized bucketized vector summation.

Then, with probability 1 β, KMEANSDECODER outputs a

(1 + ), k O (1)(log2 n) (log(n/β) + )

- approximate solution for k-means. Moreover, the encoder runs in time poly(ndk O (1), t(Enchist), t(Encvec)), and the decoder runs in time poly(ndk O (1), t(A), t(Dechist), t(Decvec)).

Theorem 22 allows us to easily derive approximation algorithms for k-means in different distributed models of DP, by simply plugging in the known generalized histogram/vector summation guarantees.

4.1. Approximation Algorithm for Local DP

Next we consider the local DP model and prove Theorem 1.

Proof of Theorem 1. Let β = 0.1. From Theorem 12, there is an ( , 0.1β/NT )-accurate 0.5 -DP algorithm for generalized histogram with

n T 3 log(NT |L1 [ [ LT |/β)/ ). Since we set T = O(log n) (in Theorem 20), NT = k O (1) poly log n by our choice of parameters, and since |L1 [ [LT | exp(O(Td0)) by a volume argument, we get = O(pnpoly log(n)/ ).

Similarly, from Lemma 13, there is an ( , 0.1β/NT )- accurate 0.5 -DP algorithm for generalized histogram with

nd T 3 log(d NT |L1 [ [ LT |/β)/ ), which as before yields = O(

nd polylog(n)/ ). Plugging this into Theorem 22, we indeed arrive at a oneround -local DP ( (1 + ), k O (1)

nd polylog(n)/ )- approximation algorithm for k-means (with failure probability 0.1). It is easy to verify that the encoder and the decoder run in time poly(n, d, k O (1)).

5. Experiments

Implementation Details. We modify our algorithm in several places to make it more practical. First, instead of using nets we use locality-sensitive hashing (LSH). Speciﬁcally, given LSH g1, . . . , g T , the level-i representation of x is now zi = (g1(x), . . . , g T (x)). In this sense, our tree bears a strong resemblance to the so-called LSH forests (Bawa et al., 2005; Andoni et al., 2017). As for the speciﬁc family of hashes, we use Sim Hash (Charikar, 2002) in which a random vector vi is picked and gi(x) is the sign of hvi, xi. Since LSH can already be viewed as a dimensionality reduction method, we skip the random projection at the beginning of the algorithm. Consequently, we also directly compute the approximate centers of all the nodes in the tree and then use a non-private algorithm (in particular, kmeans++ (Arthur and Vassilvitskii, 2007)) to compute the k centers on this privatized dataset. Details on the choice of parameters can be found in Appendix D.

Dataset Generation. We use mixtures of Gaussians, which are generated as follows. For a separation parameter r and the number k of clusters, ﬁrst pick k centers uniformly at random from the sphere of radius slightly less than one (i.e., 1 (r)). Then, for each center, we create n/k points by adding to the center a Gaussian-distributed vector whose expected norm is 1/r. Finally, we project any point that is outside of the unit ball back into it. Note that we run our algorithm on this dataset using the same value of k.

Although these datasets are relatively easy, we would like to stress that we are not aware of any prior experiments or

Locally Private k-Means in One Round

Figure 1. Normalized k-means objective of the output clusters for varying n, or k for d = 100, r = 100. Each set of parameters is run 10 times; the average and the standard deviation of the normalized k-means objectives are included.

implementations of non-interative local DP algorithms.89

Furthermore, we point out that even experiments in the central model of Balcan et al. (2017), which uses almost the same data generation process (including separation parameters) as ours, suggest that the datasets are already challenging for the much more relaxed central DP model.

Results Summary. Figure 1 presents the normalized kmeans objective (i.e., the k-means objective divided by n) as n, or k vary. Note that the trivial clustering where we simply use the origin as the center has normalized objective roughly equal to one. Our clustering algorithm signiﬁcantly improves upon these when n is sufﬁciently large. More importantly, as either n or increases, the normalized objective decreases, as predicted by theory. Finally, when the number of clusters k becomes larger, our algorithm suffers larger errors, once again agreeing with theory.

6. Conclusions and Open Questions

We give private approximation algorithms for k-means clustering whose ratios are essentially the same as those of non-private algorithms in both the (one-round) local DP and the (one-round) shufﬂe DP models. An interesting open question is to extend this result to other clustering objectives, such as k-median. While the net trees can also be applied to k-median with little change, the techniques we use to handle high dimensions do not directly carry over. This is due to the fact that, in k-means, it is simple to ﬁnd a center of a cluster in one round, by ﬁnding the average of all the points. However, ﬁnding a cluster center in k-median (to the best of our knowledge) requires solving a linear program and it seems challenging to do so non-interactively.

8Very recently, (Xia et al., 2020) have reported experimental results for k-means in the local DP model. However, their algorithm, which is based on Lloyd s iteration, requires interaction.

9We remark that we also tried some naive baseline algorithms, such as noising each point using Gaussian/Laplace noise and running k-means++ on this noisy dataset. However, they do not produce any meaningful clustering even for n as large as 106.

Even for the k-means problem itself, there are still several interesting questions. First, as mentioned in the Introduction, our algorithm relies crucially on public randomness in the dimensionality reduction step, where every user has to project onto the same random subspace P. Can k-means be approximated almost optimally via a one-round local DP algorithm that uses only private randomness?

Another direction is to tighten the additive error. In the central model, this has recently been investigated in Chaturvedi et al. (2021); Jones et al. (2021), who achieved essentially tight additive errors in terms of k (and in certain regimes d, n). It is a natural question to determine such a tight dependency in the (non-interactive) local model too.

J. M. Abowd. The US Census Bureau adopts differential

privacy. In KDD, pages 2867 2867, 2018.

A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means clustering. In APPROX, volume 5687, pages 15 28, 2009.

S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward.

Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. SIAM J. Comput., 49(4), 2020.

D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-

hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245 249, 2009.

A. Andoni, I. P. Razenshteyn, and N. S. Nosatzki. LSH

forest: Practical algorithms made theoretical. In SODA, pages 67 78, 2017.

Apple Differential Privacy Team. Learning with privacy at

scale. Apple Machine Learning Journal, 2017.

D. Arthur and S. Vassilvitskii. k-means++: The advantages

of careful seeding. In SODA, pages 1027 1035, 2007.

Locally Private k-Means in One Round

M. Balcan, T. Dick, Y. Liang, W. Mou, and H. Zhang.

Differentially private clustering in high-dimensional Euclidean spaces. In ICML, pages 322 331, 2017.

V. Balcer and A. Cheu. Separating local & shufﬂed differ-

ential privacy via histograms. In ITC, pages 1:1 1:14, 2020.

V. Balcer, A. Cheu, M. Joseph, and J. Mao. Connecting

robust shufﬂe privacy and pan-privacy. In SODA, pages 2384 2403, 2021.

B. Balle, J. Bell, A. Gasc on, and K. Nissim. The privacy

blanket of the shufﬂe model. In CRYPTO, pages 638 667, 2019.

B. Balle, J. Bell, A. Gasc on, and K. Nissim. Private sum-

mation in the multi-message shufﬂe model. In CCS, pages 657 676, 2020.

R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Prac-

tical locally private heavy hitters. JMLR, 21:16:1 16:42, 2020.

M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-

tuning indexes for similarity search. In WWW, pages 651 660, 2005.

J.-D. Benamou. Numerical resolution of an unbalanced mass transport problem. ESAIM: Mathematical Modelling and Numerical Analysis-Mod elisation Math ematique et Analyse Num erique, 37(5):851 868, 2003.

A. Bittau, U. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, D. Lie, M. Rudominer, U. Kode, J. Tinn es, and B. Seefeld. Prochlo: Strong privacy for analytics in the crowd. In SOSP, pages 441 459, 2017.

A. Blum, C. Dwork, F. Mc Sherry, and K. Nissim. Practical

privacy: the sulq framework. In PODS, pages 128 138, 2005.

C. Canonne, G. Kamath, and T. Steinke. The discrete Gaus-

sian for differential privacy. In Neur IPS, 2020.

M. Charikar. Similarity estimation techniques from round-

ing algorithms. In STOC, pages 380 388, 2002.

A. Chaturvedi, H. Nguyen, and E. Xu. Differentially pri-

vate k-means clustering via exponential mechanism and max cover. AAAI, 2021.

L. Chen, B. Ghazi, R. Kumar, and P. Manurangsi. On distributed differential privacy and counting distinct elements. In ITCS, 2021.

A. Cheu, A. D. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via shufﬂing. In EUROCRYPT, pages 375 403, 2019.

B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry

data privately. In NIPS, pages 3571 3580, 2017.

J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local

privacy and statistical minimax rates. In FOCS, pages 429 438, 2013.

C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and

M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486 503, 2006a.

C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrat-

ing noise to sensitivity in private data analysis. In TCC, pages 265 284, 2006b.

U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In CCS, pages 1054 1067, 2014.

U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan,

K. Talwar, and A. Thakurta. Ampliﬁcation by shufﬂing: From local to central differential privacy via anonymity. In SODA, pages 2468 2479, 2019.

D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private

coresets. In STOC, pages 361 370, 2009.

D. Feldman, C. Xiang, R. Zhu, and D. Rus. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In IPSN, pages 3 16, 2017.

B. Ghazi, N. Golowich, R. Kumar, P. Manurangsi, R. Pagh,

and A. Velingker. Pure differentially private summation from anonymous messages. In ITC, 2020a.

B. Ghazi, R. Kumar, and P. Manurangsi. Differentially pri-

vate clustering: Tight approximation ratios. In Neur IPS, 2020b.

B. Ghazi, R. Kumar, P. Manurangsi, and R. Pagh. Private counting from anonymous messages: Near-optimal accuracy with vanishing communication overhead. In ICML, pages 3505 3514, 2020c.

B. Ghazi, P. Manurangsi, R. Pagh, and A. Velingker. Pri-

vate aggregation from fewer anonymous messages. In EUROCRYPT, pages 798 827, 2020d.

B. Ghazi, N. Golowich, R. Kumar, R. Pagh, and A. Vel-

ingker. On the power of multiple anonymous messages. In EUROCRYPT, 2021a.

B. Ghazi, R. Kumar, P. Manurangsi, R. Pagh, and A. Sinha.

Differentially private aggregation in the shufﬂe model: Almost central accuracy in almost a single message. In ICML, 2021b.

Locally Private k-Means in One Round

A. Greenberg. Apple s differential privacy is about col-

lecting your data but not your data. Wired, June, 13, 2016.

A. Gupta, K. Ligett, F. Mc Sherry, A. Roth, and K. Tal-

war. Differentially private combinatorial optimization. In SODA, pages 1106 1125, 2010.

S. Har-Peled and S. Mazumdar. On coresets for k-means

and k-median clustering. In STOC, pages 291 300, 2004.

S. Har-Peled and M. Mendel. Fast construction of nets in low-dimensional metrics and their applications. SICOMP, 35(5):1148 1184, 2006.

Z. Huang and J. Liu. Optimal differentially private algo-

rithms for k-means clustering. In PODS, pages 395 408, 2018.

Y. Ishai, E. Kushilevitz, R. Ostrovsky, and A. Sahai. Cryp-

tography from anonymity. In FOCS, pages 239 248, 2006.

C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan.

A short note on concentration inequalities for random vectors with subgaussian norm. Co RR, abs/1902.03736, 2019.

M. Jones, H. L. Nguyen, and T. Nguyen. Differentially pri-

vate clustering via maximum coverage. In AAAI, 2021.

P. Kairouz, Z. Liu, and T. Steinke. The distributed discrete

gaussian mechanism for federated learning with secure aggregation. In ICML, 2021.

K. Makarychev, Y. Makarychev, and I. P. Razenshteyn.

Performance of Johnson Lindenstrauss transform for kmeans and k-medians clustering. In STOC, pages 1027 1038, 2019.

D. Micciancio. Almost perfect lattices, the covering radius

problem, and applications to Ajtai s connection factor. SICOMP, 34(1):118 169, 2004.

D. Micciancio and P. Voulgaris. A deterministic single ex-

ponential time algorithm for most lattice problems based on Voronoi cell computations. SICOMP, 42(3):1364 1391, 2013.

P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler.

GUPT: privacy preserving data analysis made easy. In SIGMOD, pages 349 360, 2012.

G. Monge. M emoire sur la th eorie des d eblais et des rem-

blais. De l Imprimerie Royale, 1781.

K. Nissim and U. Stemmer. Clustering algorithms for the

centralized and local models. In ALT, pages 619 653, 2018.

K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sen-

sitivity and sampling in private data analysis. In STOC, pages 75 84, 2007.

K. Nissim, U. Stemmer, and S. P. Vadhan. Locating a small

cluster privately. In PODS, pages 413 427, 2016.

R. Nock, R. Canyasse, R. Boreli, and F. Nielsen. kvariates++: more pluses in the k-means++. In ICML, pages 145 154, 2016.

B. Piccoli and F. Rossi. Generalized Wasserstein distance

and its application to transport equations with source. Archive for Rational Mechanics and Analysis, 211(1): 335 358, 2014.

C. Radebaugh and U. Erlingsson. Introducing Tensor Flow

Privacy: Learning with Differential Privacy for Training Data, March 2019. blog.tensorflow.org.

C. A. Rogers. Lattice coverings of space. Mathematika, 6

(1):33 39, 1959.

S. Shankland. How Google tricks itself to protect Chrome

user privacy. CNET, October, 2014.

U. Stemmer. Locally private k-means clustering. In SODA,

pages 548 559, 2020.

U. Stemmer and H. Kaplan. Differentially private k-means

with constant multiplicative error. In Neur IPS, pages 5436 5446, 2018.

D. Su, J. Cao, N. Li, E. Bertino, and H. Jin. Differentially

private k-means clustering. In CODASPY, pages 26 37, 2016.

D. Testuggine and I. Mironov. Py Torch Differential Pri-

vacy Series Part 1: DP-SGD Algorithm Explained, August 2020. medium.com.

Y. Wang, Y.-X. Wang, and A. Singh. Differentially private

subspace clustering. In NIPS, pages 1000 1008, 2015.

C. Xia, J. Hua, W. Tong, and S. Zhong. Distributed kmeans clustering guaranteeing local differential privacy. Computers & Security, 90:101699, 2020.