# differentially_private_clustering_via_maximum_coverage__e1091005.pdf

Differentially Private Clustering via Maximum Coverage

Matthew Jones, Huy L. Nguyen, Thy D Nguyen Northeastern University {jones.m, hu.nguyen, nguyen.thy2}@northeastern.edu

This paper studies the problem of clustering in metric spaces while preserving the privacy of individual data. Speciﬁcally, we examine differentially private variants of the k-medians and Euclidean k-means problems. We present polynomial algorithms with constant multiplicative error and lower additive error than the previous state-of-the-art for each problem. Additionally, our algorithms use a clustering algorithm without differential privacy as a black-box. This allows practitioners to control the trade-off between runtime and approximation factor by choosing a suitable clustering algorithm to use.

1 Introduction Clustering is an important routine in many machine learning tasks, such as image segmentation (Yang et al. 2017; Patel, Van Nguyen, and Vidal 2013), collaborative ﬁltering (Mc Sherry and Mironov 2009; Schafer et al. 2007), and time series analysis (Mueen and Keogh 2012). Thus, improving the performance of private clustering has a great potential for directly improving other private machine learning tasks. Ideally, we are looking to achieve differential privacy, a privacy deﬁnition with strong theoretical support which requires that the algorithm be insensitive to small changes in the dataset (Dwork, Roth et al. 2014). As practical evidence of the signiﬁcance of differentially private clustering, differential privacy has been accepted by several large tech corporations such as Google, Apple, and Microsoft ( Alvarez Mara n on 2020) and used in several clustering applications including clustering network data (Ni et al. 2018), clustering for facial recognition (Chamikara et al. 2020), and developing intelligent electrical service through user data clustering (Xiong et al. 2018). Hence, the motivation to study clustering in the differential privacy framework has sources in both strong technical and practical reasons. We begin with closely related works and our results, ﬁrst in k-medians and then in k-means. The k-medians problem has been studied extensively in the literature. Previous work in Kariv and Hakimi (1979) proved that k-medians is NP-hard. There is a long line of works on approximation algorithms for k-medians without differential privacy (Chrobak, Kenyon, and Young 2006;

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Arya et al. 2004; Charikar et al. 2002; Jain and Vazirani 2001; Jain et al. 2003; Byrka et al. 2017). The state of the art is a 2.675 + ϵ approximation by Byrka et al. (2017). For the differentialy private k-medians problem, Gupta et al. (2010) proposed a polynomial time (ϵp, 0)-differentially private algorithm with cost at most 6OPT + O(k2 log2 n/ϵp). Kaplan and Stemmer (2018) states a variant of the algorithm for (ϵp, δp)-differential privacy with cost

O OPT + k1.5

log n log(1/δp)

where β is the failure probability and = max(u,v) V 2 d(u, v) is the diameter of the metric space. We denote by OPT the optimal objective cost of the non-private clustering problem in question, and n and k denote the number of points available for choice as cluster centers and the number of clusters respectively. In contrast with the non-private setting, any result for differentially private clustering requires additive error in addition to a multiplicative factor on the optimal clustering cost. In particular for the k-medians problem, Theorem 4.5 in Gupta et al. (2010) shows that any private algorithm needs at least Ω( k ln(n/k)/ϵp) additive error. In fact, their proof shows there exists a family of k-medians instances such that the optimal objective cost is 0 but every ϵp-differentially private algorithm must incur a cost Ω( k ln(n/k)/ϵp). This lower bound can be extended almost as is to (ϵp, δp)-DP with δp = 1/poly(n). From a practical point of view, the additive error is the main driver dictating the performance of algorithms as can be seen from the experimental results in Balcan et al. (2017). In Section 7 of Balcan et al. (2017) the experiments show that increasing the number of centers reduces the clustering cost for the non-private k-means algorithm signiﬁcantly, whereas the clustering cost stays relatively stable for the private counterpart. This behavior is observed in both synthetic and real datasets and both for their algorithm and Su LQ k-means. These results show strong evidence that the additive error is both a limiting factor objectively and an important key to improving private clustering algorithms. Motivated by the insights from Gupta et al. (2010) and Balcan et al. (2017), our main contributation is a new (ϵp, δp)-differentially private k-medians algorithm whose time is polynomial and which has a constant multiplicative

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

factor and improved additive error:

ϵp log n log e

Note that our additive error is linear in k and almost matches the lower bound of Gupta et al. (2010) up to lower order terms. In comparison, the best previous upper bound due to Gupta et al. (2010); Kaplan and Stemmer (2018) has a dependence of k1.5. By using the non-private algorithm of Byrka et al. (2017) as a subroutine, our multiplicative factor is 6.35 + ϵ, which is slightly worse than 6 from Gupta et al. (2010). The k-means problem has also been studied extensively with a long line of works on privacy-preserving approximation algorithms (Blum et al. 2005; Nissim, Raskhodnikova, and Smith 2007; Feldman et al. 2009, 2017; Balcan et al. 2017; Kaplan and Stemmer 2018). As an extension of our techniques, we also provide an (ϵp, δp)-differentially private algorithm for the Euclidean k-means problem with constant multiplicative error and better additive cost compared to previous work (see Table 1). Our algorithm has cost:

O OPT + k log n + d0.51(log log n)2.53 (k log k)1.01

Again, in this setting our additive error is almost linear in k as opposed to k1.5 in previous work (Kaplan and Stemmer 2018). In addition to improved performance guarantee, our kmedians and Euclidean k-means algorithms use a nonprivate clustering algorithm as a black-box. This allows practitioners to control the trade-off between the approximation guarantee and the runtime of the clustering algorithm and even use heuristics with good empirical performance. This is especially important since most applications use Lloyd s algorithm for k-means, which has superior practical performance compared with other methods despite its O(log k) approximation factor. For example, in the experiments of Balcan et al. (2017) they found that adding several Lloyd s iteration to their algorithms improves the experimental result, supporting the importance of such heuristics on practical performance. Our technique is a novel approach based on Maximum Coverage, in contrast to previous approaches in differentially private clustering such as center swapping in Gupta et al. (2010); Balcan et al. (2017) and locality sensitive hashing in Nissim and Stemmer (2018); Balcan et al. (2017). The algorithms for k-medians and Euclidean k-means include two main steps. In the ﬁrst step, we iterate through distance thresholds from small to large and at each threshold apply a differentially private Maximum Coverage algorithm to select centers that cover almost as many points as are covered by the optimal solution at those thresholds. For the second step, we create a new dataset based on those centers and apply a non-private clustering algorithm on this dataset by moving each demand point to its nearest center and then applying the Laplace mechanism (Dwork et al. 2006b) to report the number of points at each center. This makes sure that privacy is preserved for the new dataset and we incur no additional privacy cost when we apply the non-private clustering algorithm in the ﬁnal step.

2 Related Works Table 1 summarizes the performance of previous works in comparison with our algorithms. In the table, only the work by Gupta et al. (2010) is for (ϵp, 0)-differential privacy, while the others are for (ϵp, δp)-differential privacy. As mentioned above, Gupta et al. (2010) gave the ﬁrst private kmedians algorithm with a constant multiplicative approximation and additive error polynomial in the number of centers and logarithmic in the number of points. The algorithm uses the local search approach of Arya et al. (2004). Our algorithm, on the other hand, can be used with any non-private k-medians algorithm. For the Euclidean k-means problem, Balcan et al. (2017) proposes the strategy of ﬁrst identifying a set of potential centers with low k-means cost, then applying the techniques of Gupta et al. (2010) to ﬁnd the ﬁnal centers among the potential centers. However, their potential centers are only guaranteed to contain a solution with multiplicative approximation O log3 n . This result was improved by Kaplan and Stemmer (2018), which can construct a set of potential centers containing a solution with constant multiplicative approximation. Another approach for the k-means problem is via the 1cluster problem. Given a set of input points in Rd and t n, the goal is to ﬁnd a center that covers at least t points with the smallest radius. The work of Feldman et al. (2017) shows that the k-means problem can solved by running the algorithm for the 1-cluster problem multiple times to ﬁnd several balls to cover most of data points with O(k log n) multiplicative error. Nissim and Stemmer (2018) proposed an improved algorithm for the 1-cluster problem, resulting in a differentially private k-means algorithm with O(k) multiplicative error.

3 Preliminaries Differential Privacy Differential privacy is a privacy framework for computations run against sensitive input data sets. Its requirement, informally, is that the computation behaves similarly on two input datasets that are nearly identical. Formally, Deﬁnition 1. (Dwork et al. 2006a) A randomized algorithm M has δp-approximate ϵp-differential privacy, or (ϵp, δp)- differential privacy, if for any two input sets A and B with a symmetric difference which has a single element and for any set of outcomes S Range(M)

Pr[M(A) S] exp(ϵp) Pr[M(B) S] + δp.

If δp = 0, we say that M is ϵp-differentially private. An algorithm with (ϵp, 0)-differential privacy ensures that the output M(A) is (almost) equally likely to be observed on neighboring datasets, whereas in (ϵp, δp)-differential privacy the value δp dictates the probability that ϵp-privacy fails to hold (Dwork et al. 2006a). In this way, (ϵp, δp)-differential privacy is a relaxation of ϵp-differential privacy. We use an error parameter ϵ for utility and we use ϵp and δp to denote parameters for differential privacy (or ϵs and δs for Algorithm 1, to differentiate privacy parameters to different algorithms).

Reference Objective Multiplicative Error Additive Error (Gupta et al. 2010), k-medians O(1) O(k log n)2

(Kaplan and Stemmer 2018) O(k log n)1.5

Ours k-medians O(1) O(k log n)

(Feldman et al. 2017) Euclidean k-means O(k log n) O k

d log(nd) 9log (|X|

(Balcan et al. 2017) Euclidean k-means O log3 n

O k2 + d log5 n

(Kaplan and Stemmer 2018) Euclidean k-means O(1) O (k log (n log k))1.5

+d0.51(log log n)2.53 (k log k)1.01

Ours Euclidean k-means O(1) O k log n +d0.51(log log n)2.53 (k log k)1.01

Table 1: Comparison of our clustering algorithms with prior works, omitting dependence on ϵp, δp.

One basic construction for differentially private algorithms is the Laplace mechanism.

Deﬁnition 2. (L1 sensitivity) A function f : N|X| Rk has L1 sensitivty f if f(A) f(A ) 1 f for all A, A with a symmetric difference which has a single element. Theorem 3. Laplace mechanism (Dwork et al. 2006b): Let function f : N|X| Rk have L1 sensitivty f and ϵp > 0. Mechanism M that on input A outputs f(A) + Lap( f

is (ϵp, 0)-differentially private, where Lap( f

ϵp ) denotes a random variable following Laplace distribution with scale parameter b = f

Another tool for construction of differentially private algorithms which we use in this work is the exponential mechanism. This construction is parameterized by a query function q(A, r) mapping a pair of input data set A and candidate result r to a real value. With q and a privacy value ϵp, the mechanism selects an output which favors higher score values using

Pr[Eϵ q(A) = r] exp(ϵpq(A, r)). (1)

Theorem 4. (Mc Sherry and Talwar 2007) The exponential mechanism when used to select an output r R gives 2ϵp -differential privacy and, letting R be the subset of R achieving q(A, r) = maxr q(A, r), ensures that

Pr[q(A, Eϵ q(A)) < max r q(A, r) ln |R|

Maximum Coverage Our differentially private k-medians algorithm solves the Maximum Coverage problem as a subproblem. The Maximum Coverage problem is deﬁned as follows: on a universe U of items, a family S of subsets of U, and a parameter z, the goal is to select z sets in S to cover the maximum number of elements in U. Formally, we are looking to ﬁnd

arg max C S,|C|=z

Our approach for solving private Maximum Coverage is based on the Unweighted Set Cover algorithm in Gupta et al.

Algorithm 1: Maximum Coverage

Input: Set system (U, S), a private set R U to cover, ϵs, δs, m i 1, Ri = R, Si S. ϵ ϵs/2 ln( e

δs ). for i = 1, 2, . . . , m do

Pick a set S from Si with probability proportional to exp(ϵ |S Ri|). Output set S. Ri+1 Ri \ S, Si+1 Si {S}.

(2010). To preserve privacy, Algorithm 1 chooses sets using the exponential mechanism, with probability related to the improvement in coverage caused by choosing the set. Assume that there exists a selection of z sets that covers U. A classic fact for the maximum coverage problem is that if we build the family C by always selecting the item in S \C that covers the largest number of uncovered elements in U, then after z iterations, | c C c| (1 1/e)|U|. Here we show an observation that will be useful for our algorithm later. The proof is in the appendix.

Lemma 5. For ϵ > 0, if we always select a set that covers at least half as many uncovered elements as the set that covers the most uncovered elements, then after 2z ln 1/ϵ iterations, S c C c (1 ϵ)|U|.

4 Private k-medians Given a set of points V , a metric d : V V R, a private set of demand points D V , and a value k where k < |V | = n, recall that the objective of the k-medians problem is to select a set of points (centers) F V , |F| = k to minimize cost(F) = P v D d(v, F), where d(v, F) = minf F d(v, f). Also, recall that we use ϵ as the approximation parameter for maximum coverage problem in Lemma 5 and we use ϵp and δp as privacy parameters. Let Br(v) be the ball of radius r centered at v, i.e. the set of all points in the metric space within distance r from v. Our approach is based on the Maximum Coverage problem. One way to compute the clustering cost is by computing for every distance threshold t the number of points within distance t from the centers and integrating the counts from 0

to the maximum distance. Thus, if for every threshold t the number of points farther than t from our solution s centers is not much more than the number of points farther than t from the optimal centers, then our cost is not much larger than the optimal cost. Thus, our algorithm goes through distance thresholds from small to large and tries to cover as many points as possible using fresh centers every time. For each threshold, we aim to cover (1 ϵ) times the number of points the optimal solution can cover. The result is that our clustering cost is not much larger than the optimal cost, albeit using more centers. Since we use exponentially growing thresholds, we only use a multiplicative factor O(log n) extra centers in the ﬁnal set C, which results in a small error due to privacy noise. The full algorithm is described in Algorithm 2.

Algorithm 2: The k-medians algorithm

Input: a set of points V , a private set of demand points D V , a metric d, ϵ, ϵp, δp

1 V = D, C = , r = 1 + log1+ϵ n

2 for i from 1 to r do

3 Set ti = (1 + ϵ)i 1 /n

4 Run algorithm 1 for the set system U = V, S = {Bti(v) V : v V }, with ϵs = ϵp

2 , δs = δp for m = 2k ln(1/ϵ) iterations to get Ci 5 Vi = S v Ci Bti(v) V

6 V = V \ Vi 7 C = C Ci 8 Assign each point in D to its closest point c C

9 Let nc be the total number of points assigned to c for each c C

10 For each c C, set n c = nc + Yc where

Yc Lap 2 ϵp

11 Run a k-medians algorithm to select and return k centers from V , with demand points at each c C with multiplicity n c.

The algorithm begins with the discretization of distance thresholds. The goal is similar to our discussion above. We apply Algorithm 1 to select a set of points that covers a large set of demand points across different distance thresholds. Note that the objective cost of any set of centers following the discretization scheme is not too far from the actual costs, as we will show in Lemma 8. Thus, the set of many centers in C that we ﬁnd across different thresholds t should also have cost similar to OPT. Our ﬁnal step is to obtain the ﬁnal set of centers with privacy. To preserve privacy for this step, we create a new dataset similar to the original one with some privacy noise. In this new dataset, every demand point is shifted towards the closest center from C, and we apply the Laplace mechanism to the assigned number of demands points of each point in C to preserve privacy. At this point, we can use a k-medians algorithm on this dataset to output the ﬁnal k centers. Although the objective in the new problem changes

because of shifting and the Laplace mechanism, the set of available choices for centers V is the same. Thus, the cost of the centers returned in the ﬁnal step is at most the cost of this new objective plus the total shifting distance. In the following sections, we will ﬁrst formally analyze the privacy and then the utility of this algorithm.

Privacy Analysis We ﬁrst show that this algorithm is (ϵp, δp)-differentially private. To show this, we ﬁrst show that the entirety of the for loop is (ϵp/2, δp)-differentially private, and then take advantage of composition and apply Theorem 3 at line 10 to obtain the ﬁnal result. Note that the analysis of the loop very closely follows the proof for privacy of Unweighted Set Cover in Gupta et al. (2010); their algorithm selects sets in a particular order to form a cover, while our algorithm selects candidate centers with increasing distance thresholds where a center is assumed to cover all demand points within its distance threshold. The signiﬁcant difference in the two proofs is that our algorithm could select the same center twice with different distance thresholds while a set will never be chosen twice in set cover in Gupta et al. (2010). This re-selection of the same center will not affect the differential privacy of the algorithm, because the privacy analysis hinges on which demand points have been covered, not which centers have been selected. As a result of this, we save an additional log n factor on privacy, which removes a log n term from the additive error in Lemma 9 which carries through the additive error in the utility. We include the proof in the appendix and omit it here, due to its close similarity to Gupta et al. (2010).

Lemma 6. The for loop in Algorithm 2 preserves (ϵp/2, δp)- differential privacy.

The function affected by the Laplace mechanism in line 11 returns a vector of the counts nc. In the case of sets A, A as in Theorem 3, the difference between f(A) and f(A ) is exactly one for one item in this vector, and therefore f(A) f(A ) = 1, so the function has L1 sensitivity 1. Thus, line 10 is (ϵp/2, 0)-differentially private by Theorem 3. By composition, this fact and Lemma 6 yield the following lemma:

Lemma 7. Algorithm 2 is (ϵp, δp)-differentially private.

Utility Analysis

We deﬁne t0 = 0, t1 =

n , t2 = (1+ϵ)

n , ..., tr = as shorthand for the thresholds. Also, let oi be the number of points at distance in the range [ti 1, ti) from their center in the optimal solution (which we denote by OPT) and let ai be the the number of points at distance in the range [ti 1, ti) from their closest point in C after the for-loop in Algorithm 2. To bound the performance of our solution, we ﬁrst show in Lemma 8 that discretizing the distance thresholds at ti s instead of integrating from 0 to introduces negligible error to the cost of the optimal solution. Next, for each distance threshold, Lemma 9 uses the approximation guarantee of maximum coverage to show that we are efﬁciently covering demand points using not many more centers than OPT. Crucially, Lemma 10 shows that by covering almost as well

as OPT at every distance threshold, our solution has cost not much more than that of OPT. We begin by bounding the discretized cost of OPT.

Lemma 8. P i=1 oiti (1 + ϵ)OPT +

Proof. On all u D and set of centers F, deﬁne d (u, F) as the minimum distance threshold ti which is larger than d(u, F). If d(u, OPT) >

n then d (u, OPT) (1 + ϵ)d(u, OPT) since thresholds ti scale geometrically with factor 1 + ϵ. If d(u, OPT)

n , then d (u, OPT) =

n d(u, OPT)+

n . Summing over d D yields the bound.

Before we begin using this bound, we will ﬁrst prove the effectiveness of Algorithm 1 in the context of Algorithm 2, to show that we are effectively covering demand points. For each distance threshold ti, the following lemma shows that the algorithm covers almost as many points as OPT.

Lemma 9. Consider iteration i of the for-loop and let Mi be the maximum coverage of k centers with radius ti over points in V . With high probability, at line 5:

|Vi| (1 ϵ)Mi 24k ln n ln e δp

Proof. The items in the family S in line 4 are exactly one-toone with the points in V , and the set sv S corresponding to v V is exactly the set of points which are not within distance ti 1 of an existing center in C but are within distance ti of v. Therefore, the items covered by the centers in Ci are all within distance ti of their closest centers, so the change in the coverage of C is at least the size of the set coverage from Ci. Our analysis is similar to Gupta et al. (2010). The main difference is that instead of covering all points that OPT can cover, we aim to cover a (1 ϵ) portion within an additive error, hence we run 2k log(1/ϵ) iterations instead of 2k log n. Consider |Ri| to be the number of remaining elements yet to be covered, and deﬁne Li = max S S |S Ri|, the largest number of uncovered elements covered by any set in S. By Theorem 4, the exponential mechanism, when selecting a set, ensures that with probability at most 1/n2 that we select a center with coverage less than Li 3 ln n

ϵ . When Li > 6 ln n

ϵ , we are guaranteed to choose a center that covers at least Li/2 points. Based on Lemma 5, for each iteration as long as Li > 6 ln n

ϵ we always take the greedy option and are guaranteed to have |Ci| (1 ϵ)Mi with probability at least 1 1

n. When Li 6 ln n

ϵ , although we are not guaranteed to take the greedy action there are at most 6k ln n

ϵ points yet to be covered by the algorithm compared to OPT at radius r. Thus, the algorithm loses at most 6k ln n

The next lemma relates the cost of our solution and that of OPT given that we cover almost as well as OPT at every distance threshold. Speciﬁcally, we bound the discretized cost of our coverage in terms of the discretized cost of the coverage of OPT and the additive error resulting from Algorithm 1.

i=1 aiti 1 ϵ 1 ϵ ϵ2 X

i=1 oiti + 24 k ln n ln e δp

Proof. Let Oi = Pr j=i oj, Ai = Pr j=i aj, and E =

24k ln n ln e δp

ϵp . Given a threshold ti, we know that the centers in OPT cover n Oi+1 points with distance at most ti. At threshold ti, Algorithm 2 has already covered n Ai points so we know that there is a solution covering an additional (n Oi+1) (n Ai) = Ai Oi+1 points. By the guarantee of the greedy set cover algorithm in Lemma 9, we cover ai (1 ϵ)(Ai Oi+1) E new points on the next iteration. By substituting Ai = ai + Ai+1, we have ai (1 ϵ)

ϵ (Ai+1 Oi+1) E

ϵ . Rearranging the last inequality to isolate Ai+1, notice that

i=1 Ai(ti ti 1)

1 ϵ + Oi + E 1 ϵ

1 ϵ (ti ti 1) +

i=1 oiti + E

The last equality holds because of telescoping sums. Also, notice that

1 ϵ (ti ti 1) =

ϵai 1 ϵ(ti+1 ti)

The last inequality holds because for all 1 i r 1, ti+1 (1 + ϵ)ti by deﬁnition. We are also able to drop the term i = 0 from the last two sums because a0 = 0. Thus,

1 ϵaiti = 1 ϵ ϵ2

i=1 aiti + artr

i=1 oiti + E

which implies that

i=1 aiti 1 ϵ 1 ϵ ϵ2 X

i=1 oiti + (1 ϵ ϵ2) E.

Combining the results of Lemmas 8 and 10, we see that

i=1 aiti 1 ϵ 1 ϵ ϵ2 ((1 + ϵ)OPT + )+(1 ϵ ϵ2) E

which gives us a bound on the cost of snapping points in D to points in C. This will be included in our ﬁnal cost, which we show in the next lemma.

Lemma 11. Consider the k-medians problem in the last line of Algorithm 2, where demand points in D are shifted to points in C and Laplace noise is applied. With high probability, the optimal objective cost of this new k-medians problem is at most

OPT + X aiti + 4 k ln(1/ϵ)

ln(n) ln(1 + ϵ) + 2 .

Proof. After assigning every point in D to the closest point in C, we run a k-medians algorithm on a multiset deﬁned by C where each element c C has multiplicity n c as in line 10 of Algorithm 2. Recall that the absolute value of a random variable following a Laplace distribution with parameter b follows an exponential distribution with parameter 1

b. Also note that the sum of exponential variables will be less than twice the expectation with high probability, with probability of failure decreasing exponentially in the number of summands. For the sake of completeness, we include the proof of this fact in the appendix. It is also signiﬁcant that since we call Algorithm 1 a total l log(1+ϵ) n + 1 m = l ln n ln(1+ϵ) + 1 m times with m = 2k ln(1/ϵ), we select at most |C| 2k ln(n) ln(1/ϵ)/ ln(1 + ϵ) + 2k ln(1/ϵ) centers before calling the black-box k-medians algorithm. With all this preliminary information, we begin to prove the claim. The last term in the bound is obtained by the Laplace mechanism, where noise is applied to the counts of each center c C. Each of the centers in C has Laplacian noise applied to it with parameter 2/ϵp. Therefore, with high prob-

ability at most 4k ln(1/e)

ln(n) ln(1+ϵ) + 2 demand points are added by line 10 of the algorithm and each of these points is at distance at most from their closest center, which yields the last term in the bound. The ﬁrst two terms come from the fact that we shift points in line 8 of Algorithm 2 and the triangle inequality. For each of the original demand points v D, let the cluster center in OPT closest to v be OPTv, and let the point in C closest to V be denoted cv. We see that

d(cv, OPTv) d(v, cv) + d(v, OPTv)

by the triangle inequality. Since OPT of the original kmedians problem is a candidate solution for the new kmedians problem, the objective cost of using OPT upper bounds the optimal cost of the new problem. Summing over all v D, we see that the objective cost of shifted demand points is therefore bounded as X

v D d(cv, OPTv) X aiti + OPT

since P aiti is an upper-bound approximation of the cost of shifting the demand points to centers in C, and the sum of d(v, OPTv) yields exactly OPT. Thus, the ﬁrst two terms come from the cost of shifting the real demand points, and the last term comes from the Laplacian noise.

Note that there may be a better solution than the original k-medians OPT to this new k-medians problem, but this is consistent with the bound by inequality.

Using a non-private approximation algorithm for the kmedians problem with approximation factor M in the last step of Algorithm 2, the entire cost gains a multiplicative factor M. Therefore, we can summarize the utility of Algorithm 2 into the following lemma and even simpler theorem:

Lemma 12. With high probability, Algorithm 2 preserves (ϵp, δp)-differential privacy and solves the k-medians problem with cost

O (M(1 + ϵ)) OPT

ϵp ln n ln e

where the black-box k-medians algorithm used in the last step of Algorithm 2 has approximation factor M and ϵ is a small, positive constant.

Proof. Combining the results of Lemmas 8, 10, and 11, we see that with high probability the optimal cost of the kmedians problem at the ﬁnal line of Algorithm 2 is given by at most 2 ϵ 2ϵ2

OPT + 4 k ln(1/ϵ)

ln(n) ln(1 + ϵ) + 2

+ 24k ln n ln e δp

ϵp(1 ϵ ϵ2) + 1 ϵ 1 ϵ ϵ2

For simplicity, we will use big-O notation going forward, so this is the same as

(2 + O(ϵ))OPT

ln(n) ln(1/ϵ)

ln(1 + ϵ) + ln n ln e

Since the k-medians algorithm used at the last line has approximation factor M, the objective cost of the shifted kmedians problem is M times that result. To obtain the objective cost for the original problem, we see that for any demand point v D, the distance between v and a center is at most d(cv, v) and the distance from cv to a center, by triangle inequality. Therefore, the new objective cost only adds the shifting cost on top of the modiﬁed k-medians problem s objective cost, which only slightly affects the factor in front of OPT,

(2M + 1 + (M + 1)O(ϵ))OPT

ln(n) ln(1/ϵ)

ln(1 + ϵ) + ln n ln e

which simpliﬁes to the claim.

Corollary 13. By using a constant-approximation nonprivate algorithm for k-medians, there is a (ϵp, δp)- differentially private algorithm for the k-medians problem that, with high probability, outputs a solution with objective cost

ϵp ln(n) ln 1

Note that our algorithm allows for any k-medians algorithm to be used at the last step. One can choose a preferred trade-off between runtime and performance to select a suitable algorithm. This is in contrast to the approach in Gupta et al. (2010), where the algorithm builds on the k-median algorithm in Arya et al. (2004).

5 Application to Euclidean k-means In the Euclidean k-means problem, instead of having a discrete set of demand points, V is deﬁned to be all of Rd. We wish to select a set of points (centers) F Rd, |F| = k to minimize cost(F) = P v D d(v, F)2. In this section, we will apply our result to improve additive error in the approach in Kaplan and Stemmer (2018). The strategy in Kaplan and Stemmer (2018) is to ﬁrst identify a polynomial set of candidate centers such that it contains a subset of k candidate centers with low k-means cost. Then, the algorithm uses a private discrete k-means algorithm to select the ﬁnal k-centers with low cost from the set of candidate centers. More concretely, the algorithm in Kaplan and Stemmer (2018) is guaranteed to output a (ϵp, δp)-private set of candidate centers Y of size at most ϵpn log( k

β ) such that with probability at least 1 β, there exist a subset of size k centers with constant multiplicative error and additive error of O(T 1 1 a b w 1 1 a b k 1 1 a b ) 2, where a, b are small constant parameters of the Locality Sensitive Hashing algorithm used in Kaplan and Stemmer (2018), T = Θ(log log n) and

d ϵp log log n log k

log log log n

Note that a, b can be chosen to be arbitrarily small at the cost of making the multiplicative approximation factor a larger constant. The work of Kaplan and Stemmer (2018) focuses on the regime where a, b are small and 1/(1 a b) 1.01. The resulting additive error for identifying candidate centers is Oϵp,δp(k1.01d.51). The performance bottleneck of the Euclidean k-means is in the algorithm to select the ﬁnal k centers from a candidate set. We can apply our algorithm on the potential centers returned by Kaplan and Stemmer (2018) to improve the algorithm s performance. Note that our algorithm can be applied to solve the k-means objective by passing the correct distance function. Although the square of Euclidean distance is not a metric, we can still apply the same algorithm and similar analysis to get the bound for the k-means objective. The only algorithmic difference is at the last step, where instead of running a k-medians algorithm we run a k-means algorithm to get ﬁnal centers. Rather than replicate the entire analysis section, we will only review the proofs which are directly affected by the change in the distance function. Furthermore, we will extend the proof to all distance functions dp for any natural number p where d2 is the distance function in the Euclidean k-means objective. Notably, the privacy analysis is independent of the distance function, and is therefore unaffected. In fact, the only

steps in the proofs of Section 4 which involve the distance function are Lemma 8, 11 and 12. For the distance function dp, Lemma 8 is amended as follows:

Lemma 14. P i=1 oiti (1 + ϵ)p OPT +

Proof. In the case where (d(u, OPT))p > n2 , now (d (u, OPT))p (1 + ϵ)p(d(u, OPT))p. Otherwise, the proof is functionally identical to that of Lemma 8.

For Lemmas 11 and 12, we directly address and resolve the main issue the distance function faces here, which is that the triangle inequality does not hold when p > 1. However, we can use the following lemma, which we prove in the appendix:

Lemma 15. In any metric space and p 1, (d(a, b))p 2p 1 ((d(a, c))p + (d(b, c))p).

Using this lemma, we see that when the triangle inequality would be applied, we gain an additional 2p 1 constant. This affects the leading constant of the approximation factor and additive error, but does not affect the asymptotic cost. Therefore, the only changes to Algorithm 2 necessary to make it a functional k-means algorithm are to use the proper input metric d and to run a black-box k-means algorithm as the last step rather than a black-box k-medians algorithm. Then, if we are trying to minimize the objective function using the distance function dp and we use a black-box algorithm for this objective function at the last step of Algorithm 2, the proofs in section 4 using Lemma 14 instead of Lemma 8 yields the following lemma and corollary:

Lemma 16. Given a problem equivalent to k-means but with distance function dp, and a non-private algorithm for that problem with approximation factor M, there exists an (ϵp, δp)-differentially-private algorithm for that problem which, with high probability, has objective cost at most

O 2p 1M(1 + ϵ)p OPT

ln n + 2p 1 ln |Y | ln e

Corollary 17. There is a (ϵp, δp)-differentially private algorithm for the Euclidean k-means problem that with probability at least 1 β returns a solution with a constant multiplicative factor and an additive error of

O 2 T 1 1 a b w 1 1 a b k 1 1 a b

ln n + ln ϵpn log k

Note that our algorithm results in a better additive term compared to applying Gupta et al. (2010) on the potential centers. Speciﬁcally, the second additive term is almost linear in k instead of k1.5, making the entire additive error nearly linear in k.

Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1909314 and 1750716. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the National Science Foundation.

Ethics Statement Clustering has many applications in machine learning, such as image segmentation (Yang et al. 2017; Patel, Van Nguyen, and Vidal 2013), collaborative ﬁltering (Mc Sherry and Mironov 2009; Schafer et al. 2007), and time series analysis (Mueen and Keogh 2012). Privacy is a major concern when input data contains sensitive information. Differential privacy (Dwork et al. 2006b) has become a rigorous framework for ensuring privacy in algorithms. Thus, differentially private algorithms for clustering problem would ensure for each individual in the input a robust privacy guarantee. Our improved utility guarantee will perhaps encourage adoption of privacy-preserving algorithm as a replacement for the non-private counterpart. Furthermore, our approach allows for usage with another clustering algorithm as a black-box. We believe this further improves the applicability of private clustering algorithms, making it easier to incorporate the privacy guarantee into existing clustering frameworks. The limitations of the work are that the privacy guarantee requires that the range of the data is bounded and the utility guarantee has additive error that is only meaningful when the dataset has a large enough number of participants. When applying the algorithm, the curator has to ensure that the assumptions hold to protect the privacy of the participants.

References Arya, V.; Garg, N.; Khandekar, R.; Meyerson, A.; Munagala, K.; and Pandit, V. 2004. Local search heuristics for k-median and facility location problems. SIAM Journal on computing 33(3): 544 562. Balcan, M.-F.; Dick, T.; Liang, Y.; Mou, W.; and Zhang, H. 2017. Differentially private clustering in high-dimensional euclidean spaces. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 322 331. JMLR. org. Blum, A.; Dwork, C.; Mc Sherry, F.; and Nissim, K. 2005. Practical privacy: the Su LQ framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 128 138. Byrka, J.; Pensyl, T. W.; Rybicki, B.; Srinivasan, A.; and Trinh, K. 2017. An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization. ACM Trans. Algorithms 13(2): 23:1 23:31. doi:10.1145/2981561. URL https://doi.org/10.1145/2981561. Chamikara, M.; Bertok, P.; Khalil, I.; Liu, D.; and Camtepe, S. 2020. Privacy Preserving Face Recognition Utilizing Differential Privacy. Computers & Security 97: 101951. ISSN 0167-4048. doi:https://doi.org/10.1016/j.cose.2020.

101951. URL http://www.sciencedirect.com/science/article/ pii/S0167404820302273.

Charikar, M.; Guha, S.; Tardos, E.; and Shmoys, D. B. 2002. A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences 65(1): 129 149.

Chrobak, M.; Kenyon, C.; and Young, N. 2006. The reverse greedy algorithm for the metric k-median problem. Information Processing Letters 97(2): 68 72.

Dwork, C.; Kenthapadi, K.; Mc Sherry, F.; Mironov, I.; and Naor, M. 2006a. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 486 503. Springer.

Dwork, C.; Mc Sherry, F.; Nissim, K.; and Smith, A. 2006b. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), 265 284. ISBN 3540327312. doi:10.1007/11681878 14.

Dwork, C.; Roth, A.; et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9(3 4): 211 407.

Feldman, D.; Fiat, A.; Kaplan, H.; and Nissim, K. 2009. Private coresets. In Proceedings of the forty-ﬁrst annual ACM symposium on Theory of computing, 361 370.

Feldman, D.; Xiang, C.; Zhu, R.; and Rus, D. 2017. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), 3 16. IEEE.

Gupta, A.; Ligett, K.; Mc Sherry, F.; Roth, A.; and Talwar, K. 2010. Differentially private combinatorial optimization. In Proceedings of the twenty-ﬁrst annual ACM-SIAM symposium on Discrete Algorithms, 1106 1125. SIAM.

Jain, K.; Mahdian, M.; Markakis, E.; Saberi, A.; and Vazirani, V. V. 2003. Greedy facility location algorithms analyzed using dual ﬁtting with factor-revealing LP. Journal of the ACM (JACM) 50(6): 795 824.

Jain, K.; and Vazirani, V. V. 2001. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. Journal of the ACM (JACM) 48(2): 274 296.

Kaplan, H.; and Stemmer, U. 2018. Differentially private k-means with constant multiplicative error. In Advances in Neural Information Processing Systems, 5431 5441.

Kariv, O.; and Hakimi, S. L. 1979. An algorithmic approach to network location problems. I: The p-centers. SIAM Journal on Applied Mathematics 37(3): 513 538.

Mc Sherry, F.; and Mironov, I. 2009. Differentially private recommender systems: Building privacy into the netﬂix prize contenders. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 627 636.

Mc Sherry, F.; and Talwar, K. 2007. Mechanism Design via Differential Privacy. In Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE. URL https://www.microsoft.com/en-us/research/ publication/mechanism-design-via-differential-privacy/. Mueen, J. Z. A.; and Keogh, E. 2012. Clustering time series using unsupervised-shapelets. In International Conference on Data Mining (ICDM). Ni, L.; Li, C.; Wang, X.; Jiang, H.; and Yu, J. 2018. DPMCDBSCAN: Differential Privacy Preserving Multi-Core DBSCAN Clustering for Network User Data. IEEE Access 6: 21053 21063. doi:10.1109/ACCESS.2018.2824798. Nissim, K.; Raskhodnikova, S.; and Smith, A. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, 75 84. Nissim, K.; and Stemmer, U. 2018. Clustering Algorithms for the Centralized and Local Models. In Algorithmic Learning Theory, 619 653. Patel, V. M.; Van Nguyen, H.; and Vidal, R. 2013. Latent space sparse subspace clustering. In Proceedings of the IEEE international conference on computer vision, 225 232. Schafer, J. B.; Frankowski, D.; Herlocker, J.; and Sen, S. 2007. Collaborative ﬁltering recommender systems. In The adaptive web, 291 324. Springer. Xiong, J.; Ren, J.; Chen, L.; Yao, Z.; Lin, M.; Wu, D.; and Niu, B. 2018. Enhancing privacy and availability for data clustering in intelligent electrical service of Io T. IEEE Internet of Things Journal 6(2): 1530 1540. Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3861 3870. JMLR. org. Alvarez Mara n on, G. 2020. What Differential Privacy Is and Why Google and Apple Are Using It with Your Data. Think Big URL business.blogthinkbig.com/ differential-privacy-google-apple-using-it-with-your-data/.