# differentially_private_set_union__24164a2d.pdf

Differentially Private Set Union

Sivakanth Gopi 1 Pankaj Gulhane 1 Janardhan Kulkarni 1 Judy Hanwen Shen 1 2 Milad Shokouhi 1

Sergey Yekhanin 1

We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe U of items, possibly of inﬁnite size, and a database D of users. Each user i contributes a subset Wi U of items. We want an (ε,δ)-differentially private Algorithm which outputs a subset S i Wi such that the size of S is as large as possible. The problem arises in countless real world applications, and is particularly ubiquitous in natural language processing (NLP) applications. For example, discovering words, sentences, n-grams etc., from private text data belonging to users is an instance of the set union problem. In this paper we design new algorithms for this problem that signiﬁcantly outperform the best known algorithms.

1. Introduction

Natural language models for applications such as suggested replies for e-mails and dialog systems rely on the discovery of n-grams and sentences (Hu et al., 2014; Kannan et al., 2016; Chen et al., 2019; Deb et al., 2019). Words and phrases used for training come from individuals, who may be left vulnerable if personal information is revealed. For example, a model could generate a sentence which reveals personal information of the users on which it was trained (Carlini et al., 2019). Therefore, algorithms that allow the public release of the words, n-grams, and sentences obtained from user text while preserving privacy are desirable. Additional applications of this problem include the release of search queries and keys in SQL queries (Korolova et al., 2009; Wilson et al., 2020). While other privacy deﬁnitions are common in practice, guaranteeing differential privacy,

1Microsoft 2Work done as part of the Microsoft AI Residency Program. Correspondence to: Sivakanth Gopi <sigopi@microsoft.com>, Janardhan Kulkarni <jakul@microsoft.com>, Judy Hanwen Shen <hashe@microsoft.com>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

introduced in the seminal work of Dwork et al. (2006), ensures users the strongest preservation of privacy. In this paper we consider user level privacy.

Deﬁnition 1.1 (Differential Privacy (Dwork et al., 2006)). A randomized algorithm A is (ε,δ)-differentially private if for any two neighboring databases D and D , which differ in exactly the data pertaining to a single user, and for all sets S of possible outputs:

Pr[A(D) S] eε Pr[A(D ) S] + δ.

An algorithm satisfying differential privacy (DP) guarantees that its output does not change by much if a single user is either added or removed from the dataset. Moreover, the guarantee holds regardless of how the output of the algorithm is used downstream. Therefore, items (e.g. n-grams) produced using a DP algorithm can be used in other applications without any privacy concerns. Since its introduction a decade ago (Dwork et al., 2006), differential privacy has become the de facto notion of privacy in statistical analysis and machine learning, with a vast body of research work (see Dwork et al. (2014) and Vadhan (2017) for surveys) and growing acceptance in industry. Differential privacy is deployed in many industries, including Apple (Apple, 2017), Google (Erlingsson et al., 2014; Bittau et al., 2017), Microsoft (Ding et al., 2017), Mozilla (Avent et al., 2017), and the US Census Bureau (Abowd, 2016; Kuo et al., 2018).

The vocabulary extraction and n-gram discovery problems mentioned above, as well as many commonly studied problems (Korolova et al., 2009; Wilson et al., 2020) can be abstracted as a set union which leads to the following problem.

Problem 1.1 (Differentially Private Set Union (DPSU)). Let U be some universe of items, possibly of unbounded size. Suppose we are given a database D of users where each user i has a subset Wi U. We want an (ε,δ)-differentially private Algorithm A which outputs a subset S i Wi such that the size of S is as large as possible.

As the universe of items can be unbounded, as in our motivating examples, it is not clear how to apply the exponential mechanism (Mc Sherry & Talwar, 2007) to DPSU. Intuitively, when the universe is unbounded, an algorithm which

Differentially Private Set Union

outputs items outside the true union doesn t lead to any gains in privacy. And in many applications, it is essential that we output a subset of the true union. Therefore, in the deﬁnition of DPSU, we impose the condition that the output is a subset of the true union. It is not hard to show that there are no non-trivial (ε, 0)-DP DPSU algorithms, so we in this paper we will only study (ε, δ)-DP algorithms for δ > 0.

Existing algorithms 1 for this problem (Korolova et al., 2009; Wilson et al., 2020) collect a bounded number of items from each user, build a histogram of these items, and disclose the items whose noisy counts fall above a certain threshold. In these algorithms, the contribution of each user is always independent from the identity of items held by other users, resulting in a wasteful aggregation process, where some items counts could be far above the threshold. Since the goal is to release as large a set as possible rather than to release accurate counts of each item, there could be more efﬁcient ways to allocate the weight to users items. We

Figure 1. Size of the set output by our proposed algorithms POLICY LAPLACE and POLICY GAUSSIAN compared to natural generalizations of previously known algorithms for various values of privacy parameter ε and δ = exp( 10). The error bars are negligible.

deviate from the previous methods by allowing users to contribute their items in a dependent fashion, guided by an update policy. In our algorithms, proving privacy is more delicate as some update policies can result in histograms with unbounded sensitivity. We prove a meta-theorem to show that update policies with certain contractive properties would result in differentially private algorithms. The main contributions of the paper are:

Guided by our meta-theorems, we introduce two new algorithms called POLICY LAPLACE and POLICY GAUSSIAN for the DPSU problem. Both of them run in linear time and only require a single pass over the

1They don t study the DPSU problem as deﬁned in this paper. Their goal is to output approximate counts of as many items as possible in i Wi.

users data.

Using a Reddit dataset, we demonstrate that our algorithms signiﬁcantly improve the size of DP set union even when compared to natural generalizations of the existing mechanisms for this problem (see Figure 1).

Our algorithms are being productized in industry to make a basic subroutine in an NLP application differentially private.

1.1. Base Line Algorithms

To understand the DPSU problem better, let us start with the simplest case we can solve by known techniques. Deﬁne 0 = maxi |Wi|. Suppose 0 = 1. This special case can be solved using the algorithms in (Korolova et al., 2009; Wilson et al., 2020). Their algorithm works as follows: Construct a histogram on i Wi (the set of items in a database D) where the count of each item is the number of sets it belongs to. Then add Laplace noise or Gaussian noise to the counts of each item. Finally, release only those items whose noisy histogram counts are above a certain threshold ρ. It is not hard to prove that if the threshold is set sufﬁciently high, then the algorithm is (ε, δ)-DP.

In many applications, however, 0 is much greater than 1. For example, in the n-gram discovery problem, each user holds a large set of n-grams not just 1. A straight-forward extension of the histogram algorithm for 0 > 1 is to upper bound the ℓ1-sensitivity by 0 (and ℓ2-sensitivity by 0), and then add some appropriate amount of Laplace noise (or Gaussian noise) based on sensitivity. The threshold ρ has to be set based on 0. The Laplace noise based algorithm was also the approach considered in (Korolova et al., 2009; Wilson et al., 2020). This approach has the following drawback. Suppose a signiﬁcant fraction of users have sets of size smaller than 0. Then constructing a histogram based on counts of the items results in wastage of sensitivity budget. A user i with |Wi| < 0 can increment the count of items in Wi by any vector v RWi as long as one can ensure that ℓ1 sensitivity is bounded by 0 (or ℓ2 sensitivity is bounded by 0 if adding Gaussian noise). Consider the following natural generalization of Laplace and Gaussian mechanisms to create a weighted histogram of elements. A weighted histogram over a domain U is any map H : U R. For an item u U, H(u) is called the weight of u. In the rest of the paper, the term histogram should be interpreted as weighted histogram. Each user i updates the weight of each item u Wi using the rule: H[u] := H[u] + ( 0/|Wi|)1/p for p = 1, 2. It is not hard to see that ℓp-sensitivity of this weighted histogram is still 1/p 0 . Adding Laplace noise (for p = 1) or Gaussian noise (for p = 2) to each item of the weighted histogram, and releasing only those items above an appropriately calibrated

Differentially Private Set Union

threshold will lead to differentially private output. We call these algorithms as WEIGHTED LAPLACE and WEIGHTED GAUSSIAN, they will be used as benchmarks to compare against our new algorithms.

1.2. Our Techniques

The WEIGHTED LAPLACE and WEIGHTED GAUSSIAN mechanisms described above can be thought of trying to solve the following variant of a Knapsack problem. Here each item u U is a bin and we gain a proﬁt of 1 if the total weight of the item in the weighted histogram constructed is more than the threshold. Each user can increment the weight of elements u Wi using an update policy φ which is deﬁned as follows.

Deﬁnition 1.2 (Update policy). An update policy is a map φ : RU 2U RU such that supp(φ(H, W) H) W, i.e., φ can only update the weights of items in W. And the ith user updates H to φ(H, Wi). Since Wi is typically understood from context, we will write φ(H) instead of φ(H, Wi) for simplicity.

In this framework, the main technical challenge is the following:

How to design update policies such that the sensitivity of the resulting weighted histogram is small while maximizing the number of bins that are full?

Note that bounding sensitivity requires that φ(H, W) H ℓp C for some constant C i.e. each user has an ℓp-budget of C and can increase the weights of items in their set by an ℓp-distance of at most C. By scaling, WLOG we can assume that C = 1. Note that having a larger value of 0 should help in ﬁlling more bins as users have more choice in how they can use their budget to increment the weight of items.

In this paper, we consider algorithms which iteratively construct the weighted histogram. That is, in our algorithms, we consider users in a random order, and each user updates the weighted histogram using the update policy φ. Algorithm 1 is a meta-algorithm for DP set union, and all our subsequent algorithms follow this framework.

If the update policy is such that it increments the weights of items independent of other users (as done in WEIGHTED LAPLACE and WEIGHTED GAUSSIAN), then it is not hard to see that sensitivity of H can be bounded by 1; that is, by the budget of each user. However, if some item is already way above the threshold ρ, then it doesn t make much sense to waste the limited budget on that item. So the users can choose a clever update policy to distribute their budget among the Wi items based on the current weights.

Note that if a policy is such that updates of a user depend on other users, it can be quite tricky to bound the sensitiv-

Algorithm 1 High level meta algorithm for DP Set Union

Input: D: Database of n users where each user i has some subset Wi U ρ: threshold Noise: Noise distribution (Lap(0, λ) or N(0, σ2)) Output: S: A subset of i Wi Build weighted histogram H supported over i Wi using Algorithm 2. S = (empty set) for u i Wi do

ˆH[u] H[u] + Noise if ˆH[u] > ρ then

S S {u} end if end for Output S

Algorithm 2 High level meta algorithm for building weighted histogram using a given update policy

Input: D: Database of n users where each user i has some subset Wi U 0: maximum contribution parameter hash: A random hash function which maps user ids into some large domain without collisions φ: Update policy for a user to update the weights of items in their set Output: H: A weighted histogram in R i Wi H = {} (empty histogram) Sort users into User1, User2, . . . , Usern by sorting the hash values of their user ids under the hash function hash for i = 1 to n do

Wi set of Useri if |Wi| > 0 then

W i Randomly choose 0 items from Wi else

W i Wi end if Update H[u] for each u W i using update policy φ end for Output H

ity of the resulting weighted histogram. To illustrate this, consider for example the greedy update policy. Each user i can use his budget of 1 to ﬁll the bins that is closest to the threshold among the bins u Wi. If an item already reached the threshold, the user can spend his remaining budget incrementing the weight of next bin that is closest to the threshold and so on. Note that from our Knapsack problem analogy this seems to be a good way to maximize the number of bins ﬁlled. However such a greedy policy can have very large sensitivity (see supplementary material for an example), and hence won t lead to any reasonable DP algorithm. So, the main contribution of the paper is

Differentially Private Set Union

in showing policies which help maximize the number of items that are ﬁlled while keeping the sensitivity low. In particular, we deﬁne a general class of ℓp-contractive update policies and show that they produce weighted histograms with bounded ℓp-sensitivity.

Deﬁnition 1.3 (ℓp-contractive update policy). We say that an update policy φ is ℓp-contractive if there exists a subset I (called the invariant subset for φ) of pairs of weighted histograms which are at an ℓp distance of at most 1, i.e.,

I n (H1, H2) : H1 H2 ℓp 1 o

such that the following conditions hold.

1. (Invariance) (H1, H2) I (φ(H1), φ(H2)) I.2

2. (φ(H), H) I for all H.

Property (2) of Deﬁnition 1.3 requires that the update policy can change the histogram by an ℓp distance of at most 1 (budget of a user).

Theorem 1.1 (Contractivity implies bounded sensitivity). Suppose φ is an update policy which is ℓp-contractive over some invariant subset I. Then the histogram output by Algorithm 2 has ℓp-sensitivity bounded by 1.

We prove Theorem 1.1 in Section 2. Once we have bounded ℓp-sensitivity, we can get a DP Set Union algorithm with some additional technical work.

Theorem 1.2. (Informal: Bounded sensitivity implies DP) For p {1, 2}, if the ℓp-sensitivity of the weighted histogram output by Algorithm 2 is bounded, then Algorithm 1 for DP Set Union can be made (ε, δ)-differentially private by appropriately choosing the noise distribution (Noise) and threshold (ρ).

The main contribution of the paper is two new algorithms guided by Theorem 1.1. The ﬁrst algorithm, which we call POLICY LAPLACE, uses update policy that is ℓ1-contractive. The second algorithm, which we call POLICY GAUSSIAN, uses update policy that is ℓ2-contractive. Finally we show that our algorithms signiﬁcantly outperform the weighted policies.

At a very high-level, the role of contractivity in our algorithms is indeed similar to its role in the recent elegant work of Feldman et al. (2018). They show that if an iterative algorithm is contractive in each step, then adding Gaussian noise in each iteration will lead to strong privacy ampliﬁcation. In particular, users who make updates early on will

2Note that property (1) is a slightly weaker requirement than the usual notion of ℓp-contractivity which requires φ(H1) φ(H2) ℓp H1 H2 ℓp for all H1, H2. Instead we require contraction only for (H1, H2) I.

enjoy much better privacy guarantees. However their framework is not applicable in our setting, because their algorithm requires adding noise to the count of every item in every iteration; this will lead to unbounded growth of counts and items which belong to only a single user can also get output which violates privacy.

2. Contractivity implies DP algorithms

In this section, we show that if an update policy satisﬁes contractive property as in Deﬁnition 1.3, we can use it to develop a DP algorithm for DPSU. First we show that contractivity of update policy implies bounded sensitivity (Theorem 1.1), which in turn implies a DP Set Union algorithm by Theorem 1.2

We will ﬁrst deﬁne sensitivity and update policy formally. Let D denote the collection of all databases. We say that D, D are neighboring databases, denoted by D D , if they differ in exactly one user.

Deﬁnition 2.1. For p 0, the ℓp-sensitivity of f : D Rk is deﬁned as sup D D f(D) f(D ) ℓp where the supremum is over all neighboring databases D, D .

Proof of Theorem 1.1. Let φ be an ℓp-contractive update policy with invariant subset I. Consider two neighboring databases D1 and D2 where D1 has one extra user compared to D2. Let H1 and H2 denote the histograms built by Algorithm 1 using the update policy φ when the databases are D1 and D2 respectively.

Say the extra user in D1 has position t in the global ordering given by the hash function. Let Ht 1 1 and Ht 1 2 be the histograms after the ﬁrst t 1 (according to the global order given by the hash function hash) users data is added to the histogram. Therefore Ht 1 1 = Ht 1 2 . And the new user updates Ht 1 1 to Ht 1. By property (2) in Deﬁnition 1.3 of ℓp-contractive policy, (φ(Ht 1 1 ), Ht 1 1 ) I. Since φ(Ht 1 1 ) = Ht 1, we have (Ht 1, Ht 1 1 ) = (Ht 1, Ht 1 2 ) I. The remaining users are now added to Ht 1, Ht 1 2 in the same order. Note that we are using the fact that the users are sorted according some hash function and they contribute in that order (this is also needed to claim that Ht 1 1 = Ht 1 2 ). Therefore, by property (1) in Deﬁnition 1.3 of ℓp-contractive policy, we get (H1, H2) I. Since I only contains pairs with ℓp-distance at most 1, we have H1 H2 ℓp 1. This proves that the histogram built by Algorithm 2 using φ has ℓp-sensitivity of at most 1.

Above theorem implies that once we have a ℓp contractive update policy, we can appeal to Theorem 1.2 to design a DP algorithm for DPSU.

Differentially Private Set Union

3. Policy Laplace Algorithm

In this section we will present an ℓ1-contractive update policy called ℓ1-DESCENT (Algorithm 3) and use it to obtain a DP Set Union algorithm called POLICY LAPLACE (Algorithm 4).

3.1. ℓ1-DESCENT update policy

The policy is described in Algorithm 3. We will set some cutoff Γ above the threshold ρ to use in the update policy. Once the weight of an item (H[u]) crosses the cutoff, we don t want to increase it further. In this policy, each user starts with a budget of 1. The user uniformly increases H[u] for each u W i s.t. H[u] < Γ. Once some item s weight reaches Γ, the user stops increasing that item and keeps increasing the rest of the items until the budget of 1 is expended. To implement this efﬁciently, the 0 items from each user are sorted based on distance to the cutoff. Beginning with the item whose weight is closest to the cutoff Γ (but still below the cutoff), say item u, we will add Γ H[u] (gap to cutoff for item u) to each of the items below the cutoff. This repeats until the user s budget of 1 has been expended.

This policy can also be interpreted as gradient descent to minimize the ℓ1-distance between the current weighted histogram and the point (Γ, Γ, . . . , Γ), hence the name ℓ1DESCENT. Since the gradient vector is 1 in coordinates where the weight is below cutoff Γ and 0 in coordinates where the weight is Γ, the ℓ1-DESCENT policy is moving in the direction of the gradient until it has moved a total ℓ1-distance of at most 1. 3.2. POLICY LAPLACE

The POLICY LAPLACE algorithm (Algorithm 4) for DPSU uses the framework of the meta algorithm in Algorithm 1 using the update policy in Algorithm 3. Since the added noise is Lap(0, λ), which is centered at 0, we want to set the cutoff Γ in the update policy to be sufﬁciently above the threshold ρ. Thus we pick Γ = ρLap + α λ for some α > 0. From our experiments, choosing α [2, 6] works best empirically. The parameters λ, ρLap are set so as to achieve (ε, δ)-DP as shown in Theorem 3.1.

3.3. Privacy analysis of POLICY LAPLACE

In this section we will prove that the POLICY LAPLACE algorithm (Algorithm 4) is (ε, δ)-DP. By Theorem 1.2 and Theorem 1.1, it is enough to show that ℓ1-DESCENT policy (Algorithm 3) is ℓ1-contractive. For two histograms G1, G2, we write G1 G2 if G1[u] G2[u] for each every item u. G1 G2 is deﬁned similarly.

Lemma 3.1. Let I = {(G1, G2) : G1 G2, G1 G2 ℓ1 1}. Then ℓ1-DESCENT update policy is ℓ1-contractive over the invariant subset I.

Algorithm 3 ℓ1-DESCENT update policy

Input: H: Current histogram W: A subset of U of size at most 0 Γ: cutoff parameter Output: H: Updated histogram // Build cost dictionary G G = {} // Empty dictionary for u W do

if H[u] < Γ then

// Gap to cutoff for items below cutoff Γ G[u] Γ H[u] end if end for budget 1 // Each user gets a total budget of 1 K |G| // Number of items still under cutoff // Sort in increasing order of the gap Γ H[u] G sort(G) // Let u1, u2, . . . , u|G| be the sorted order for j = 1 to |G| do

// Cost of increasing weights of remaining K items by G[uj] cost = G[uj] K if cost budget then

for ℓ= j to |G| do

H[uℓ] H[uℓ] + G[uj] // Gap to cutoff is reduced by G[uj] G[uℓ] G[uℓ] G[uj] end for budget budget - cost // uj has reached cutoff, so decrease K by 1 K K 1 else

for ℓ= j to |G| do

// Update item weights by as much as remaining budget allows H[uℓ] H[uℓ] + budget

K break end for end if end for

Proof. Let φ denote the ℓ1-DESCENT update policy.

We will ﬁrst show property (2) of Deﬁnition 1.3. Let G be any weighted histogram and let G = φ(G). Clearly G G as the new user will never decrease the weight of any item. Moreover, the total change to the histogram is at most 1 in ℓ1-distance. Therefore G G ℓ1 1. Therefore (G , G) I.

We will now prove property (1) of Deﬁnition 1.3. Let (G1, G2) I, i.e., G1 G2 and G1 G2 ℓ1 1. Let G 1 = φ(G1), G 2 = φ(G2). A new user can increase G1 and G2 by at most 1 in ℓ1 distance. Let Γ be the cutoff

Differentially Private Set Union

Algorithm 4 POLICY LAPLACE algorithm for DPSU

Input: D: Database of n users where each user has some subset W U 0: maximum contribution parameter (ε, δ): privacy parameters α: parameter for setting cutoff Output: S: A subset of i Wi λ 1/ε // Noise parameter in Lap(0, λ) // Threshold parameter

ρLap max1 t 0 1 t + 1

ε log 1 2(1 (1 δ)1/t)

Γ ρLap+α λ // Cutoff parameter for update policy Run Algorithm 1 with Noise Lap(0, λ) and the ℓ1-

DESCENT update policy in Algorithm 3 to output S.

parameter in Algorithm 3. Let S be the set of 0 items with the new user, therefore only the items in S will change in G 1, G 2. WLOG, we can assume that the user changes both G1 and G2 by exactly total ℓ1 distance of 1. Otherwise, in at least one of them all the items in S should reach the cutoff Γ. If this happens with G1, then clearly Γ = G 1[u] G 2[u] for all u S. But it is easy to see that if this happens with G2, then it should also happen with G1 in which case G 1[u] = G 2[u] = Γ for u S.

Imagine that at time t = 0, the user starts pushing mass continuously at a rate of 1 to both G1, G2 until the entire mass of 1 is sent, which happens at time t = 1. The mass ﬂow is equally split among all the items which haven t yet crossed cutoff. Let Gt 1 and Gt 2 be the histograms at time t. Therefore G0 i = Gi and G1 i = G i. We claim that Gt 1 Gt 2 implies that d Gt 1[u] dt d Gt 2[u] dt for all u S s.t. Gt 1[u] < Γ. This is because the ﬂow is split equally among items which didn t cross the cutoff, and there can only be more items in Gt 2 which didn t cross the the cutoff when compared to Gt 1. And at time t = 0, we have G0 1 G0 2. Therefore, we have Gt 1 Gt 2 for all t [0, 1] and so G 1 G 2.

We will now prove ℓ1-contraction. Let Ci = Gi G i ℓ1. By the discussion above, C1 C2 (either total mass ﬂow is equal to 1 for both or all items in S will reach cutoff Γ in G1 before this happens in G2).

G 1 G 2 ℓ1 = X

u S G 1[u] X

u S G 2[u] (Since G 1 G 2)

u S G1[u] X

u S G2[u] + C1 C2

u S G1[u] X

u S G2[u] (Since C1 C2)

= G1 G2 ℓ1 (Since G1 G2)

Therefore (G 1, G 2) I which proves property (2) of Deﬁnition 1.3.

We now state a formal theorem which proves (ε, δ) DP of POLICY LAPLACE algorithm3.

Theorem 3.1. The POLICY LAPLACE algorithm (Algorithm 4) is (ε, δ)-DP when

ρLap max 1 t 0 1

1 2 1 (1 δ)1/t

4. Policy Gaussian Algorithm

In this section we will present an ℓ2-contractive update policy called ℓ2-DESCENT (Algorithm 5) and use it to obtain a DP Set Union algorithm called POLICY GAUSSIAN (Algorithm 4).

4.1. ℓ2-DESCENT update policy

Similar to the Laplace update policy, we will set some cutoff Γ above the threshold ρ and once an item s count (H[u]) crosses the cutoff, we don t want to increase it further. In this policy, each user starts with a budget of 1. But now, the total change a user can make to the histogram can be at most 1 when measured in ℓ2-norm (whereas in Laplace update policy we used ℓ1-norm to measure change). In other words, sum of the squares of the changes that the user makes is at most 1. Since we want to get as close to the cutoff (Γ) as possible, the user moves the counts vector (restricted to the set W of 0 items the user has) in the direction of the point (Γ, Γ, . . . , Γ) by an ℓ2-distance of at most 1. This update policy is presented in Algorithm 5.

This policy can also be interpreted as gradient descent to minimize the ℓ2-distance between the current weighted histogram and the point (Γ, Γ, . . . , Γ), hence the name ℓ2DESCENT. Since the gradient vector is in the direction of the line joining the current point and (Γ, Γ, . . . , Γ), the ℓ2DESCENT policy is moving the current histogram towards (Γ, Γ, . . . , Γ) by an ℓ2-distance of at most 1.

4.2. POLICY GAUSSIAN

The POLICY GAUSSIAN algorithm (Algorithm 6) for DPSU uses the framework of the meta algorithm in Algorithm 1 using the Gaussian update policy (Algorithm 5). Since the added noise is N(0, σ2) which is centered at 0, we want to set the cutoff Γ in the update policy to be sufﬁciently above (but not too high above) the threshold ρGauss. Thus we pick Γ = ρGauss + α σ for some α > 0. From our experiments, choosing α [2, 6] empirically yields these best results.

3The proof for this theorem can be found in the supplementary material.

Differentially Private Set Union

Algorithm 5 ℓ2-DESCENT update policy

Input: H: Current histogram W: A subset of U of size at most 0 Γ: cutoff parameter Output: H: Updated histogram G = {} // Empty dictionary for u W do

// G is the vector joining H|W to (Γ, Γ, . . . , Γ) G[u] Γ H[u] end for // ℓ2-distance between H|W and (Γ, Γ, . . . , Γ) Z P

u W G[u]2 1/2

// If Z 1, then the user moves H|W to (Γ, Γ, . . . , Γ). Else, move H|W in the direction of (Γ, Γ, . . . , Γ) by an ℓ2-distance of at most 1 if Z < 1 then

H[u] Γ end for else

H[u] H[u] + G[u]

Z end for end if

The parameters σ, ρGauss are set so as to achieve (ε, δ)-DP as shown in Theorem 4.1. Φ( ) is the cumulative density function of standard Gaussian distribution and Φ 1( ) is its inverse.

Algorithm 6 POLICY GAUSSIAN algorithm for DPSU

Input: D: Database of n users where each user has some subset W U 0: maximum contribution parameter (ε, δ): privacy parameters α: parameter for setting cutoff Output: S: A subset of i Wi // Standard deviation in Gaussian noise σ min σ : Φ 1

2σ εσ eεΦ 1

// Threshold parameter ρGauss max1 t 0 1

t + σΦ 1 1 δ

Γ ρLap+α σ // Cutoff parameter for update policy Run Algorithm 1 with Noise N(0, σ2) and the ℓ2DESCENT update policy in Algorithm 5 to output S.

To ﬁnd min σ : Φ 1

2σ εσ eεΦ 1

2 , one can use binary search because Φ 1

2σ εσ eεΦ 1

2σ εσ is a decreasing function of σ. An efﬁcient and robust implementation of this binary search can be found in (Balle & Wang, 2018).

4.3. Privacy analysis of POLICY GAUSSIAN

In this section we will prove that the POLICY GAUSSIAN algorithm (Algorithm 6) is (ε, δ)-DP. By Theorem 1.2 and Theorem 1.1, it is enough to show ℓ2-contractivity of ℓ2-

DESCENT update policy. We will need a simple plane geometry lemma for this. A proof can be found in the supplementary material.

Lemma 4.1. Let A, B, C denote the vertices of a triangle in the Euclidean plane. If |AB| > 1, let B be the point on the side AB which is at a distance of 1 from B and if |AB| 1, deﬁne B = A. C is deﬁned similarly. Then |B C | |BC|.

Figure 2. Geometric illustration of Lemma 4.1 when |AB|, |AC| > 1. The lemma implies that |B C | |BC|.

Lemma 4.2. Let I = {(G1, G2) : G1 G2 ℓ2 1}. Then the ℓ2-DESCENT update policy is ℓ2-contractive over the invariant set I.

Proof. Let φ denote the ℓ2-DESCENT policy. Property (2) of Deﬁnition 1.3 follows easily because each new user can only change a weighted histogram by an ℓ2-distance of at most 1.

We will now prove Property (1) of Deﬁnition 1.3. Let (G1, G2) I, i.e., G1 G2 ℓ2 1. Let G 1 = φ(G1) and G 2 = φ(G2). A new user can increase G1 and G2 by at most 1 in ℓ2 distance. Let Γ be the cutoff parameter in Algorithm 5. Let S be the set of 0 items with the new user, therefore only the items in S will change in G 1, G 2. Therefore we can just assume that G1, G2 are supported on S for the sake of the analysis. Algorithm 6 moves Gi towards P = (Γ, Γ, . . . , Γ) by an ℓ2-distance of 1 (or to P if the distance to P is already lower than 1). We can restrict ourselves to the plane containing G1, G2, P (G 1, G 2 will also lie on the same plane). Now by Lemma 4.1, G 1 G 2 ℓ2 G1 G2 ℓ2.

We now state a formal theorem which proves (ε, δ) DP of POLICY GAUSSIAN algorithm4.

4The proof for this theorem can be found in the supplementary material.

Differentially Private Set Union

Theorem 4.1. The POLICY GAUSSIAN algorithm (Algorithm 6) is (ε, δ)-DP if σ, ρGauss are chosen s.t.

2σ εσ eεΦ 1

ρGauss max 1 t 0

t + σΦ 1 1 δ

5. Experiments

While the algorithms we described generalize to many domains that involve the release of set union, our experiments will use a natural language dataset. In the context of n-gram release, D is a database of users where each user is associated with 1 or more Reddit posts and Wi is the set of unique n-grams used by each user. The goal is to output as large a subset of n-grams i Wi as possible while providing (ε, δ)-differential privacy to each user. In our experiments we consider n = 1 (unigrams)5.

5.1. Dataset

Our dataset is collected from the subreddit r/Ask Reddit. We take a sample of 15,000 posts from each month between January 2017 and December 2018. We ﬁlter out duplicate entries, removed posts, and deleted authors. For text preprocessing, we remove URLs and symbols, lowercase all words, and tokenize using nltk.word tokenize. After preprocessing, we again ﬁlter out empty posts to arrive at a dataset of 373,983 posts from 223,388 users.

Similar to other natural language datasets, this corpus follows Zipf s law across users. The frequency of unigrams across users is inversely proportional to its rank of the unigram. The distribution of how many unigrams each user uses also follows a long tail distribution. While the top 10 users contribute 850-2000 unique unigrams, most users (93.1%) contribute less than 100 unique unigrams.

5.2. Results

For the problem of outputting the large possible set of unigrams, Table 1 and Figure 3, summarize the performance of DP set union algorithms for different values of 0. The privacy parameters are ε = 3 and δ = exp( 10). For each algorithm, we average the results of 5 different shufﬂes of user ordering and also include the standard deviation in Table 1. We compare our algorithms with baseline algorithms: COUNT LAPLACE, COUNT GAUSSIAN, WEIGHTED LAPLACE, and WEIGHTED GAUSSIAN discussed previously6. Our conclusions are as follows:

5Our code is made available here https://github.com/ heyyjudes/differentially-private-set-union. 6Additional experiments with different α and ε values are included in the supplementary materials

Figure 3. Count of unigrams released by set union algorithms averaged across 5 shufﬂes of user order. Privacy parameters are ε = 3 and δ = exp( 10). The cutoff Γ is calculated using α = 5.

Our new algorithms POLICY LAPLACE and POLICY GAUSSIAN output a DP set union that is 2-4 times larger than output of weighted/count based algorithms. This holds for all values of ε 1.

To put the size of released set in context, we compare our new algorithms against the number of unigrams belonging to at least k users (See Table 2). For POLICY LAPLACE with 0 = 100, the size of the output set covers almost all unigrams (94.8%) when k = 20 and surpasses the size of the output set when k 25. POLICY GAUSSIAN with 0 = 100 covers almost all unigrams (91.8%) when k = 15 and surpasses the size of the output set when k 18. In other words, our algorithms (with ε = 3 and δ = exp( 10)) perform better than k-anonymity based algorithms for values of k around 20.

5.2.1. SELECTING HYPERPARAMETERS WHILE MAINTAINING PRIVACY

As can be seen from Table 1 the 0 resulting in the largest output set varies by algorithm. Since most users in our dataset possess less than 300 unique unigrams, it is not surprising that the largest output set can be achieved with 0 < 300. However, running our algorithms for different values of 0 and selecting the best output will result in a higher value of ε. There are several ways to ﬁnd the best value of 0 (or any other tunable parameter): 1) using prior knowledge of the data 2) running the algorithms on a small sample of the data to ﬁnd the best parameters, and discarding that sample. 3) ﬁnally, one could also run all the algorithms in parallel and choose the best performing one. Here we will have to account for the loss in privacy budget; see (Liu & Talwar, 2019) for example.

Differentially Private Set Union

Table 1. Count of unigrams released by various set union algorithms. Results are averaged across 5 shufﬂes of user order. The best results for each algorithm are in bold. The privacy parameters are ε = 3 and δ = exp( 10). α = 5 is chosen for the cutoff parameter Γ.

0 1 10 50 100 200 300

COUNT LAPLACE 4484 32 3666 7 2199 8 1502 14 882 4 647 4 COUNT GAUSSIAN 3179 15 6616 18 6998 23 6470 12 5492 14 4813 14 WEIGHTED LAPLACE 4479 26 4309 15 4012 10 3875 9 3726 17 3648 12 WEIGHTED GAUSSIAN 3194 11 6591 18 8570 14 8904 24 8996 30 8936 12 POLICY LAPLACE 4482 21 12840 28 15268 10 14739 23 14173 25 13870 23 POLICY GAUSSIAN 3169 13 11010 15 16181 33 16954 58 17113 16 17022 57

Table 2. This table shows the total number of unigrams that at least k users possess (|Sk|) and the percentage coverage of this total by POLICY LAPLACE (|SP L| = 14739) and POLICY GAUSSIAN (|SP G| = 16954) for 0 = 100.

k |Sk| % COVERAGE POLICY LAPLACE % COVERAGE POLICY GAUSSIAN

5 34699 24.5% 48.9% 10 23471 62.8% 72.2% 15 18461 79.8% 91.8% 18 16612 88.7% 102.1% 20 15550 94.8% 109.0% 25 13638 108.1% 124.3%

6. Conclusion

We initiated the study of differentially private set union problem, which has many real-world applications. We designed better algorithms for the problem using the notion of contractive update policy as a guiding principle to preserve privacy. From a set of empirical experiments on a Reddit natural language dataset, we demonstrate that our policy algorithms release a larger set-union than previous algorithms.

One immediate question is to give theoretical guarantees on the size of set union produced by our algorithms. A more interesting and signiﬁcantly challenging question is to design instance optimal algorithms for the problem. Given the ubiquitous nature of this problem, we believe that it is a worthwhile direction to explore.

Abowd, J. M. The challenge of scientiﬁc reproducibility and privacy protection for statistical agencies. Technical report, Census Scientiﬁc Advisory Committee, 2016.

Apple, D. P. T. Learning with privacy at scale. Technical report, Apple, 2017.

Avent, B., Korolova, A., Zeber, D., Hovden, T., and Livshits, B. Blender: enabling local search with a hybrid differential privacy model. In Proc. of the 26th USENIX Security Symposium, pp. 747 764, 2017.

Balle, B. and Wang, Y.-X. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pp. 403 412, 2018.

Bittau, A., Erlingsson, U., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., Rudominer, M., Kode, U., Tinnes, J., and Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP 17, pp. 441 459, 2017.

Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th {USENIX} Security Symposium ({USENIX} Security 19), pp. 267 284, 2019.

Chen, M. X., Lee, B. N., Bansal, G., Cao, Y., Zhang, S., Lu, J., Tsay, J., Wang, Y., Dai, A. M., Chen, Z., and et al. Gmail smart compose: Real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2287 2295, 2019.

Deb, B., Bailey, P., and Shokouhi, M. Diversifying reply suggestions using a matching-conditional variational autoencoder. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

Differentially Private Set Union

Ding, B., Kulkarni, J., and Yekhanin, S. Collecting telemetry data privately. In Advances in Neural Information Processing Systems, pp. 3574 3583, 2017.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265 284. Springer, 2006.

Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends R in Theoretical Computer Science, 9(3 4):211 407, 2014.

Erlingsson, U., Pihur, V., and Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp. 1054 1067. ACM, 2014.

Feldman, V., Mironov, I., Talwar, K., and Thakurta, A. Privacy ampliﬁcation by iteration. In Thorup, M. (ed.), 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pp. 521 532. IEEE Computer Society, 2018.

Hu, B., Lu, Z., Li, H., and Chen, Q. Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042 2050, 2014.

Kannan, A., Kurach, K., Ravi, S., Kaufmann, T., Tomkins, A., Miklos, B., Corrado, G., Lukacs, L., Ganea, M., Young, P., et al. Smart reply: Automated response suggestion for email. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 955 964, 2016.

Korolova, A., Kenthapadi, K., Mishra, N., and Ntoulas, A. Releasing search queries and clicks privately. In Proceedings of the 18th international conference on World wide web, pp. 171 180, 2009.

Kuo, Y.-H., Chiu, C.-C., Kifer, D., Hay, M., and Machanavajjhala, A. Differentially private hierarchical group size estimation. ar Xiv preprint ar Xiv:1804.00370, 2018.

Liu, J. and Talwar, K. Private selection from private candidates. In Charikar, M. and Cohen, E. (eds.), Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pp. 298 309. ACM, 2019. doi: 10.1145/3313276.3316377. URL https://doi. org/10.1145/3313276.3316377.

Mc Sherry, F. and Talwar, K. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007),

October 20-23, 2007, Providence, RI, USA, Proceedings, pp. 94 103. IEEE Computer Society, 2007. doi: 10.1109/FOCS.2007.41. URL https://doi.org/ 10.1109/FOCS.2007.41.

Vadhan, S. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pp. 347 450. Springer, 2017.

Wilson, R., Zhang, C. Y., Lam, W., Desfontaines, D., Simmons-Marengo, D., and Gipson, B. Differentially private SQL with bounded user contribution. 2020.