# differentially_private_anonymized_histograms__df87bf6d.pdf

Differentially private anonymized histograms

Ananda Theertha Suresh Google Research, New York

theertha@google.com

For a dataset of label-count pairs, an anonymized histogram is the multiset of counts. Anonymized histograms appear in various potentially sensitive contexts such as password-frequency lists, degree distribution in social networks, and estimation of symmetric properties of discrete distributions. Motivated by these applications, we propose the ﬁrst differentially private mechanism to release anonymized histograms that achieves near-optimal privacy utility trade-off both in terms of number of items and the privacy parameter. Further, if the underlying histogram is given in a compact format, the proposed algorithm runs in time sub-linear in the number of items. For anonymized histograms generated from unknown discrete distributions, we show that the released histogram can be directly used for estimating symmetric properties of the underlying distribution.

1 Introduction

Given a set of labels X, a dataset D is a collection of labels and counts, D

def = {(x, nx) : x 2 X}. An anonymized histogram of such a dataset is the unordered multiset of all non-zero counts without any label information,

def = {nx : x 2 X and nx > 0}. For example, if X = {a, b, c, d}, D = {(a, 8), (b, 0), (c, 8), (d, 3)}, then h(D) = {3, 8, 8}1. Anonymized histograms do not contain any information about the labels, including the cardinality of X. Furthermore, we only consider histograms with positive counts. The results can be extended to histograms that include zero counts. A histogram can also be represented succinctly using prevalences. For a histogram h, the prevalence 'r is the number of elements in the histogram with count r,

In the above example, '3(h) = 1, '8(h) = 2, and 'r(h) = 0 for r /2 {3, 8}. Anonymized histograms are also referred to as histogram of histograms [1], histogram order statistics [2], proﬁles [3], unattributed histograms [4], ﬁngerprints [5], and frequency lists [6].

Anonymized histograms appear in several potentially sensitive contexts ranging from password frequency lists to social networks. Before we proceed to the problem formulation and results, we ﬁrst provide an overview of the various contexts where anonymized histograms have been studied under differential privacy and their motivation.

Password frequency lists: Anonymized password histograms are useful to security researchers who wish to understand the underlying password distribution in order to estimate the security risks or evaluate various password defenses [7, 6]. For example, if n(i) is the ith most frequent password, then λβ = Pβ

i=1 n(i) is the number of accounts that an adversary could compromise with β guesses

1h(D) is a multiset and not a set and duplicates are allowed.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

per user. Hence, if the server changes the k-strikes policy from 3 to 5, the frequency distribution can be used to evaluate the security implications of this change. We refer readers to [6, 8] for more uses of password frequency lists. Despite their usefulness, organizations may be wary of publishing these lists due to privacy concerns. This is further justiﬁed as it is not unreasonable to expect that an adversary will have some side information based on attacks against other organizations. Motivated by this, [6] studied the problem of releasing anonymized password histograms.

Degree-distributions in social networks: Degree distributions is one of the most widely studied properties of a graph as it inﬂuences the structure of the graph. Degree distribution can also be used to estimate linear queries on degree distributions such as number of k-stars. However, some graphs may have unique degree distributions and releasing exact degree distributions is no more safer than naive anonymization, which can leave social network participants vulnerable to a variety of attacks [9, 10, 11]. Thus releasing them exactly can be revealing. Hence, [12, 4, 13, 14, 15, 16, 17] considered the problem of releasing degree distributions of graphs with differential privacy. Degree distributions are anonymized histograms over the graph node degrees.

Estimating symmetric properties of discrete distributions: Let k

def = |X|. A discrete distribution p is a mapping from a domain X to [0, 1]k such that P

x px = 1. Given a discrete distribution p over k symbols, a symmetric property is a property that depends only on the multiset of probabilities [18, 19], e.g., entropy ( P

x2X px log 1

px ). Other symmetric properties include support size, Rényi entropy, distance to uniformity, and support coverage. Given independent samples from an unknown p, the goal of property estimation is to estimate the value of the symmetric property of interest for p. Estimating symmetric properties from unknown distributions has received a wide attention in the recent past e.g., [5, 18, 20, 21, 22, 19, 23, 24] and has applications in various ﬁelds from neuro-science [2] to genetics [25]. Recently, [26] proposed algorithms to estimate support size, support coverage and entropy with differential privacy. Optimal estimators for symmetric properties only depend on the anonymized histograms of the samples [1, 19]. Hence, releasing anonymous histograms with differential privacy would simultaneously yield differentially-private plug-in estimators for all symmetric properties.

2 Differential privacy

2.1 Deﬁnitions

Before we outline our results, we ﬁrst deﬁne the privacy and utility aspects of anonymized histograms. Privacy has been studied extensively in statistics and computer science [27, 28, 29, 30] and references therein. Perhaps the most studied form of privacy is differential privacy (DP) [31, 32], where the objective is to ensure that an adversary would not infer whether a user is present in the dataset or not.

We study the problem of releasing anonymized histograms via the lens of global-DP. We begin by deﬁning the notion of DP. Formally, given a set of datasets H and a notion of neighboring datasets NH H H, and a query function f : H ! Y, for some domain Y, then a mechanism M : Y ! O is said to be -DP, if for any two neighboring datasets (h1, h2) 2 NH, and all S O,

Pr(M(f(h1)) 2 S) e Pr(M(f(h2)) 2 S). (1) Broadly-speaking -DP ensures that given the output, an attacker would not be able to differentiate between any two neighboring datasets. -DP is also called pure-DP and provides stricter guarantees than the approximate ( , δ)-DP, where equation (1) needs to hold with probability 1 δ.

Since introduction, DP has been studied extensively in various applications from dataset release to learning machine learning models [33]. It has also been adapted by industry [34]. There are two models of DP: server or global or output DP, where a centralized entity has access to the entire dataset and answers the queries in a DP manner. The second model is local DP, where -DP is guaranteed for each individual user s data [35, 36, 37, 38, 39]. We study the problem of releasing anonymized histograms under global-DP. Here H is the set of anonymized histograms, f is the identity mapping, and O = H.

2.2 Distance measure

For DP, a general notion of neighbors is as follows. Two datasets are neighbors if and only if one can be obtained from another by adding or removing a user [30]. Since, anonymized histograms do not

contain explicit user information, we need few deﬁnitions to apply the above notion. We ﬁrst deﬁne a notion of distance between label-count datasets. A natural notion of distance between datasets D1 and D2 over X is the 1 distance,

|nx(D1) nx(D2)|,

where nx(D) is the count of x in dataset D. Since anonymized histograms do not contain any information about labels, we deﬁne distance between two histograms h1, h2 as

def = min D1,D2:h(D1)=h1,h(D2)=h2 1(D1, D2). (2)

The following simple lemma characterizes the above distance in terms of counts.

Lemma 1 (Appendix B). For an anonymized histogram h = {nx}, let n(i) be the ith highest count in the dataset.2 For any two anonymized histograms h1, h2, 1(h1, h2) = P

i 1 |n(i)(h1) n(i)(h2)|.

The above distance is also referred to as sorted 1 distance or earth-mover s distance. With the above deﬁnition of distance, we can deﬁne neighbors as follows.

Deﬁnition 1. Two anonymized histograms h and h0 are neighbors if and only if 1(h, h0) = 1.

The above deﬁnition of neighboring histograms is same as the deﬁnition of neighbors in the previous works on anonymized histograms [4, 6].

3 Previous and new results

3.1 Anonymized histogram estimation

Similar to previous works [6], we measure the utility of the algorithm in terms of the number of items in the anonymized histogram, n

nx2h nx = P

r 1 'r(h)r.

Previous results: The problem of releasing anonymized histograms was ﬁrst studied by [12, 4] in the context of degree distributions of graphs. They showed that adding Laplace noise to each count, followed by a post-processing isotonic regression step results in a histogram H with expected sorted2

2(h, H)] = E

(n(i)(h1) n(i)(h2))2

log3 max('r, 1)

Their algorithm runs in time O(n). The problem was also considered in the context of password frequency lists by [6]. They observed that an exponential mechanism over integer partitions yields an -DP algorithm. Based on this, for = (1/pn), they proposed a dynamic programming based relaxation of the exponential mechanism that runs in time O

and returns

a histogram H such that 1(h, H) = O

, with probability 1 δ. Furthermore, the

relaxed mechanism is ( , δ)-DP.

The best information-theoretic lower bound for the 1 utility of any -DP mechanism is due to [40], who showed that for (1/n), any -DP mechanism has expected 1 error of (pn/p ) for some dataset.

New results: Following [6], we study the problem in 1 metric. We propose a new DP mechanism PRIVHIST that satisﬁes the following:

Theorem 1. Given a histogram in the prevalence form h = {(r, 'r) : 'r > 0}, PRIVHIST returns a histogram H and a sum count N that is -DP. Furthermore, if > 1, then

E[ 1(h, H)] = O

and E[|N n|] e c

2For i larger than number of counts in h, n(i) = 0.

for some constant c > 0 and has an expected run time of O(pn). If 1 = (1/n) then,

E[ 1(h, H)] = O

and E[|N n|] O

and has an expected run time of O

Together with the lower bound of [40], this settles the optimal privacy utility trade-off for 2 [ (1/n), 1] up to a multiplicative factor of O(

log(2/ )). We also show that PRIVHIST is nearoptimal for > 1, by showing the following lower bound. Theorem 2 (Appendix E). For a given n, let H = {h : n P

r r'r(h) n + 1}. For any -DP mechanism M, there exists a histogram h 2 H, such that

E[ 1(h, M(h))] (pne 2 ).

Theorems 1 and 2 together with [40] show that the the proposed mechanism has near-optimal utility for all = (1/n). We can infer the number of items in the dataset by P

r r 'r(H). However, this estimate is very noisy. Hence, we also return the sum of counts N as it is useful for applications in symmetric property estimation for distributions. Apart from the near-optimal privacy-utility trade-off, we also show that PRIVHIST has several other useful properties.

Time complexity: By the Hardy-Ramanujan integer partition theorem [41], the number of anonymized histograms with n items is e (pn). Hence, we can succinctly represent them using O(pn) space. Recall that any anonymized histogram can be written as {(r, 'r) : 'r > 0}, where 'r is the number of symbols with count r. Let t be the number of distinct counts and let r1, r2, . . . , rt be the distinct counts with non-zero prevalences. Then ri i and

and hence there are at most t

2n non-zero prevalences and h can be represented as {(r, 'r) : 'r > 0} using O(pn) count-prevalence pairs. Histograms are often stored in this format for space efﬁciency e.g., password frequency lists in [42]. PRIVHIST takes advantage of this succinct representation. Hence, given such a succinct representation, it runs time O(pn) as opposed to the O(n) running time of [12] and O(n3/2) running time of [6]. This is highly advantageous for large datasets such as password frequency lists with n = 70M data points [6].

Pure vs approximate differential privacy: The only previous known algorithm with 1 utility of O(pn) is that of [6] and it runs in time O(n3/2). However, their algorithm is ( , δ)-approximate DP which is strictly weaker than PRIVHIST, whose output is -DP. For applications in social networks it is desirable to have group privacy for large groups [32]. For groups of size k, ( , δ) approximate DP, scales as (k , ek δ)-DP, which can be prohibitive for large values of k. Hence -DP is preferable.

Applications to symmetric property estimation: We show that the output of PRIVHIST can be directly applied to obtain near-optimal sample complexity algorithms for discrete distribution symmetric property estimation.

3.2 Symmetric property estimation of discrete distributions

For a symmetric property f and an estimator ˆf that uses n samples, let E( ˆf, n) be an upper bound on the worst expected error over all distributions p with support at most k, E( ˆf, n)

def = maxp2 k E[|f(p) ˆf(Xn)|] . Let sample complexity n(f, ) denote the minimum number of

samples such that E( ˆf, n) , n(f, )

def = min{n : E( ˆf, n) }.

Given samples Xn def = X1, X2, . . . , Xn, let h(Xn) denote the corresponding anonymous histogram. For a symmetric property f, linear estimators of the form

f(r, n) 'r(h(Xn),

are shown to be sample-optimal for symmetric properties such as entropy [21], support size [18, 20], support coverage [22], and Rényi entropy [43, 44], where f(r, n)s are some distribution-independent coefﬁcients that depend on the property f. Recently, [26] showed that for any given property such as entropy or support size, one can construct DP estimators by adding Laplace noise to the non-private estimator. They further showed that this approach is information theoretically near-optimal.

Instead of just computing a DP estimate for a given property, the output of PRIVHIST can be directly used to estimate any symmetric property. By the post-processing lemma [32], since the output of PRIVHIST is DP, the estimate is also DP. For an estimator ˆf, let Ln

ˆ f be the Lipschitz constant given

def = max(f(1, n), maxr 1 |f(r, n) f(r + 1, n)|). If instead of h(Xn), a DP histogram H

and the sum of counts N is available, then ˆf can be modiﬁed as

ˆf dp def =

f(r, N) 'r(H),

which is differentially private. Using Theorem 1, we show that:

Corollary 1 (Appendix F). Let ˆf satisfy Ln

ˆ f nβ 1, for a β 0.5. Further, let there exists E

such that |E( ˆf, n) E( ˆf, n + 1)| nβ 1. Let fmax = maxp2 k f(p). If n( ˆf, ) is the sample complexity of estimator ˆf, then for > 1

n( ˆf dp, 2 ) max

n( ˆf, ), O

for some constant c > 0. For (1/n) 1,

n( ˆf dp, 2 ) max

n( ˆf, ), O

Further, by the post-processing lemma, ˆf dp is also -DP.

For entropy ( P

x px log px), normalized support size (P

x 1px>1/k/k), and normalized support coverage, there exists sample-optimal linear estimators with β < 0.1 and have the property |E( ˆf, n) E( ˆf, n + 1)| E( ˆf, n)nβ 1 [19, 26]. Hence the sample complexity of the proposed algorithm increases at most by a polynomial in 1/ . Furthermore, the increase is dependent on the maximum value of the function for distributions of interest and it does not explicitly depend on the support size. This result is slightly worse than the property speciﬁc results of [26] in terms of dependence on and

. In particular, for entropy estimation, the main term in our privacy cost is O

the bound of [26] is O

. Thus for β = 0.1, our dependence on and is slightly worse. However, we note that our results are more general in that H can be used with any linear estimator. For example, our algorithm implies DP algorithms for estimating distance to uniformity, which have been not been studied before. Furthermore, PRIVHIST can also be combined with the maximum likelihood estimators of [3, 22, 45] and linear programming estimators of [5], however we do not provide any theoretical guarantees for these combined algorithms.

In the algorithm description and analysis, let x denote the vector x and let 'r+(h)

s r 's(h) denote the cumulative prevalences. Since, anonymized histograms are multisets, we can deﬁne the sum of two anonymized histograms as follows: for two histograms h1, h2, the sum h = h1 + h2 is given by 'r(h) = 'r(h1) + 'r(h2), 8r. Furthermore, since there is a one-to-one mapping between histograms in count form h = {n(i)} and in prevalence form h = {(r, 'r) : 'r > 0}, we use both interchangeably. For the ease of analysis, we also use the notation of improper histogram, where the 'r s can be negative or non-integers. Finally, for a histogram ha indexed by super-script a , we deﬁne 'a def = '(ha) for the ease of notation.

4.1 Approach

Instead of describing the technicalities involved in the algorithm directly, we ﬁrst motivate the algorithm with few incorrect or high-error algorithms. Before we proceed, recall that histograms can be written either in terms of prevalences 'r or in terms of sorted counts n(i).

An incorrect algorithm: A naive tendency would be to just add noise only to the non-zero prevalences or counts. However, this is not differentially private. For example, consider two neighboring histograms in prevalence format, h = {'1 = 2} and h0 = {'1 = 1, '2 = 1}. The resulting outputs for the above two inputs would be very different as the output of h never produces a non-zero '2, whereas the output of h0 produces non-zero '2 with high probability. Similar counter examples can be shown for sorted counts.

A high-error algorithm: Instead of adding noise to non-zero counts or prevalences, one can add noise to all the counts or prevalences. It can be shown that adding noise to all the counts (including those appeared zero times), yields a 1 error O(n/ ), whereas adding noise to prevalences can yield an 1 error of O(n2/ ), if we naively use the utility bound in terms of prevalences (3). We note that [12] showed that a post-processing step after adding noise to sorted-counts and improves the 2 utility. A naive application of the Cauchy-Schwarz inequality yields an 1 error of n3/4/ for that algorithm. While it might be possible to improve the dependence on n by a tighter analysis, it is not clear if the dependence on can be improved.

The algorithm is given in PRIVHIST. After some computation, it calls two sub-routines PRIVHIST-

LOWPRIVACY and PRIVHIST-HIGHPRIVACY depending on the value of . PRIVHIST has two main new ideas: (i) splitting r around pn and using prevalences in one regime and counts in another and (ii) the smoothing technique used to zero out the prevalence vector. Of the two (i) is crucial for the computational complexity of the algorithm and (ii) is crucial in improving the -dependence from 1/ to 1/p in the high privacy regime ( 1). There are more subtle differences such as using cumulative prevalences instead of actual prevalences. We highlight them in the next section. We now overview our algorithm for low and high privacy regimes separately.

4.2 Low privacy regime

We ﬁrst consider the problem in the low privacy regime when > 1. We make few observations.

Geometric mechanism vs Laplace mechanism: For obtaining DP output of integer data, one can add either Laplace noise or geometric noise [46]. For -DP, the expected 1 noise added by Laplace mechanism is 1/ , which strictly larger than that of the geometric mechanism (2e /(1 e 2 )) (see Appendix A). For > 1, we use the geometric mechanism to obtain optimal trade off in terms of .

Prevalences vs counts: If we add noise to each coordinate of a d-dimensional vector, the total amount of noise in 1 norm scales linearly in d, hence it is better to add noise to a small dimensional vector. In the worst case, both prevalences and counts can be an n-dimensional vector. Hence, we propose to use prevalences for small values of r pn, and use counts for r > pn. This ensures that the dimensionality of vectors to which we add noise is at most O(pn).

Cumulative prevalences vs prevalences: The 1 error can be bounded in terms of prevalences as follows. See Appendix B for a proof.

|'r+(h1) 'r+(h2)|

r|'r(h1) 'r(h2)|, (3)

If we add noise to prevalences, the 1 error can be very high as noise is multiplied with the corresponding count r (3) . The bound in terms of cumulative prevalences '+ is much tighter. Hence, for small values of r, we use cumulative prevalences instead of prevalences themselves.

The above observations provide an algorithm for the low privacy regime. However, there are few technical difﬁculties. For example, if we split the counts at a threshold naively, then it is not differentially private. We now describe each of the steps in the high-privacy regime.

(1) Find pn: To divide the histogram into two smaller histograms, we need to know n, which may not be available. Hence, we allot 1 privacy cost to ﬁnd a DP value of n.

(2) Sensitivity preserving histogram split: If we divide the histogram into two parts based on counts naively and analyze the privacy costs independently for the higher and smaller parts separately,

then the sensitivity would be lot higher for certain neighboring histograms. For example, consider two neighboring histograms h1 = {'T = 1, 'n T = 1} and h2 = {'T +1 = 1, 'n T 1 = 1}. If we divide h1 in to two parts based on threshold T, say hs

1 = {'T = 1} and h

1 = {'n T = 1} and hs

2 = {} and h

2 = {'T +1 = 1, 'n T 1 = 1}, then 1(h

2) = T + 2. Thus, the 1 distance between neighboring separated histograms 1(h

2) would be much higher compared to 1(h2, h2) and we need to add a lot of noise. Therefore, we perturb 'T and 'T +1 using geometric noise. This ensures DP in instances where the neighboring histograms differ at 'T and 'T +1, and doesn t change the privacy analysis for other types of histograms. However, adding noise may make the histogram improper as 'T can become negative. To this end, we add M fake counts at T and T + 1 to ensure that the histogram is proper with high probability. We remove them later in (L4). We refer readers to Appendix C.2 for details about this step.

(3,4) Add noise: Let Hbs (small counts) and Hb (large counts) be the split-histograms. We add noise to cumulative prevalences in Hbs and counts in Hb as described in the algorithm overview.

(L1, L2) Post-processing: The noisy versions of 'r+ may not satisfy the properties satisﬁed by the histograms i.e., 'r+ '(r+1)+. To overcome this, we run isotonic regression over noisy 'r+ subject to the monotonicity constraints i.e., given noisy counts 'r+, ﬁnd 'mon

r+ that minimizes P

r T ('r+ 'mon

r+ )2, subject to the constraint that 'mon

(r+1)+, for all r T. Isotonic regression in one dimension can be run in time linear in the number of inputs using the pool adjacent violators algorithm (PAVA) or its variants [47, 48]. Hence, the time complexity of this algorithm is O(T) pn. We then round the prevalences to the nearest non-negative integers. We similarly post-process large counts and remove the fake counts that we introduced in step (2).

Since we used succinct representation of histograms, used prevalences for r smaller than O(pn) and counts otherwise, the expected run time of the algorithm is O(pn) for > 1.

4.3 High privacy regime

For the high-privacy regime, when 1, all known previous algorithms achieve an error of 1/ . To reduce the error from 1/ to 1/p , we use smoothing techniques to reduce the sensitivity and hence reduce the amount of added noise.

Smoothing method: Recall that the amount of noise added to a vector depends on its dimensionality. Since prevalences have length n, the amount of 1 noise would be O(n/ ). To improve on this, we ﬁrst smooth the input prevalence vector such that it is non-zero only for few values of r and show that the smoothing method also reduces the sensitivity of cumulative prevalences and hence reduces the amount of noise added.

While applying smoothing is the core idea, two questions remain: how to select the location of non-zero values and how to smooth to reduce the sensitivity? We now describe these technical details.

(H1) Approximate high prevalences: Recall that N was obtained by adding geometric noise to n. In the rare case that this geometric noise is very negative, then there can be prevalences much larger than 2N. This can affect the smoothing step. To overcome this, we move all counts above 2N to N. Since this changes the histogram with low probability, it does not affect the 1 error.

(H2) Compute boundaries: We ﬁnd a set of boundaries S and smooth counts to elements in S. Ideally we would like to ensure that there is a boundary close to every count. For small values of r, we ensure this by adding all the counts and hence there is no smoothing. If r pn, we use boundaries that are uniform in the log-count space. However, using this technique for all values of r, results in an additional log n factor. To overcome this, for r pn, we use the noisy large counts in step (4) to ﬁnd the boundaries and ensure that there is a boundary close to every count.

(H3) Smooth prevalences: The main ingredient in proving better utility in the high privacy regime is the smoothing technique, which we describe now with an example. Suppose that all histograms have non-zero prevalences only between counts s and s + t and further suppose we have two neighboring histograms h1 and h2 as follows. '1

r = 1 and for all i 2 [s, s + t] \ {r}, '1

i = 0. Similarly, let '2

r+1 = 1 and for all i 2 [s, s + t] \ {r + 1}, '2

i = 0. If we want to release the prevalences or cumulative prevalences, we add 1 noise of O(1/ ) for each prevalence in [s, s + t]. Thus the 1 norm of the noise would be O(t/ ). We propose to reduce this noise by smoothing prevalences.

For a r 2 [s, s + t] , we divide 'r into 's and 's+t as follows. We assign s+t r

t fraction of 'r to 's and the remaining to 's+t. After this transformation, the ﬁrst histogram becomes ht1 given by, 't1

t and all other prevalences are zero. Similarly, the second histogram becomes ht2 given by, 't2

s = t+s r 1

t and all other prevalences are zero. Thus the prevalences after smoothing differ only in two locations s and s + t and they differ by at most 1/t. Thus the total amount of noise needed for a DP release is O(1/t ) to these two prevalences. However, note that we also incur a loss as due to smoothing, which can be shown to be O(t). Hence, the total amount of error would be O(1/(t ) + t). Choosing t = 1/p , yields a total error of O(1/p ). The above analysis is for a toy example and extending it to general histograms requires additional work. In particular, we need to ﬁnd the smoothing boundaries that give the best utility. As described above, we choose boundaries based on logarithmic partition of counts and also by private values of counts. The utility analysis with these choice of boundaries is in Appendix D.2.

(H4) Add small noise: Since the prevalences are smoothed, we add small amount of noise to the corresponding cumulative prevalences. For 'si+, we add L(1/(si si 1) ) to obtain -DP.

(H5) Post-processing: Finally, we post-process the prevalences similar to (L1) to impose monotonicity and ensure that the resulting prevalences are positive and non-negative integers.

Since we used succinct histogram representation, ensured that the size of S is small, and used counts larger than O(pn ) to ﬁnd boundaries, the expected run time is O

Privacy budget allocation: We allocate 1 privacy budget to estimate n, 2 to the rest of PRIVHIST and 3 to PRIVHIST-HIGHPRIVACY. We set 1 = 2 = 3 in our algorithms. We note that there is no particular reason for 1, 2, and 3 to be equal and we chose those values for simplicity and easy readability. For example, since 1 is just used to estimate n, the analysis of the algorithm shows that 2, 3 affects utility more than 1. Hence, we can set 2 = 3 = (1 o(1))/2 and 1 = o( ) to get better practical results. Furthermore, for low privacy regime, the algorithm only uses a privacy budget of 1 + 2. Hence, we can set 1 = o( ), 2 = (1 o(1)), and 3 = 0.

5 Acknowledgments

Authors thank Jayadev Acharya and Alex Kulesza for helpful comments and discussions.

Algorithm PRIVHIST

Input: anonymized histogram h in terms of prevalences i.e., {(r, 'r) : 'r > 0}, privacy cost . Parameters: 1 = 2 = 3 = /3. Output: DP anonymized histogram H and N (an estimate of n).

1. DP value of the total sum: N = max(P

nx2h nx + Za, 0), where Za G(e 1). If N = 0, output empty histogram and N. Otherwise continue.

2. Split h: Let T = d

N min( , 1)e and M =

max(2 log Ne 2,1)

(a) Ha : 'a

T = 'T + M, 'a

T +1 = 'T +1 + M and 8r /2 {T, T + 1}, 'a

(b) Hb : 'b

T +1+Zb, 'b

T Zb and 8r /2 {T, T +1} 'b

r, where Zb G(e 2).

(c) Divide Hb into two histograms Hbs and Hb . For all r T + 1, 'b

r = max(0, Pr

r ) for all r T 'bs

r = max(0, PT

3. DP value of Hbs. Let Zcs

r G(e 2) i.i.d. and Hcs be 'cs

4. DP value of Hb : Let Zc

i G(e 2) i.i.d. and Hc be N c

i for N(i) 2 Hb .

5. If > 1, output PRIVHIST-LOWPRIVACY(Hcs, Hc , T, M) and N.

6. If 1, output PRIVHIST-HIGHPRIVACY(h, Hc , T, N, 3) and N.

Algorithm PRIVHIST-LOWPRIVACY

Input: low-count histogram Hcs, high-count histogram Hc , T, M and Output: a histogram H.

L1. Post processing of Hcs:

(a) Find 'mon that minimizes P

r+ 'r+(Hcs))2. with 'mon

(r+1)+, 8r.

(b) Hds: for all r, 'ds

r+ = round(max('mon

L2. Post processing of Hc : Compute Hd = {max(Ni(Hc ), T), 8i}.

L3. Let Hd = Hds + Hd .

L4. Compute He by removing M elements closest to T +1 from Hd and then removing M elements

closest to T and output it.

Algorithm PRIVHIST-HIGHPRIVACY

Input: non-private histogram h, high-count histogram H , T, N, 3 and Output: a histogram H. H1. Approximate higher prevalences: for r < 2N, 'u

r = 'r(h) and 'u

2N = '2N+(h).

H2. Compute boundaries: Let the set S be deﬁned as follows:

(a) T 0 = d10

log(1/ 3)/N 3 (b) S = {1, 2, . . . , T}[{b T(1+q)ic : i log1+q(T 0/T)}[{Nx : Nx 2 H , Nx T 0}[{2N}.

H3. Smooth prevalences: Let si denote the ith smallest element in S.

si+1 si + Psi 1

si si 1 and if j /2 S, 'v

H4. DP value of Hv: for each si 2 S, let 'w

si+ + Zsi, where Zsi L

1 3(si si 1)

H5. Find Hx that minimizes P

si+)2(si si 1)2 such that 'x

H6. Return Hy given by, 'y

r+ = round(max('x

r+, 0)) 8r.

[1] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing

that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 259 269. IEEE, 2000. [2] Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191

1253, 2003. [3] Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. On

modeling proﬁles instead of values. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 426 435. AUAI Press, 2004. [4] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of

differentially private histograms through consistency. Proceedings of the VLDB Endowment, 3(1-2):1021 1032, 2010. [5] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for

entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685 694. ACM, 2011. [6] Jeremiah Blocki, Anupam Datta, and Joseph Bonneau. Differentially private password frequency

lists. In NDSS, volume 16, page 153, 2016. [7] Joseph Bonneau. The science of guessing: analyzing an anonymized corpus of 70 million

passwords. In Security and Privacy (SP), 2012 IEEE Symposium on, pages 538 552. IEEE, 2012.

[8] Jeremiah Blocki, Benjamin Harsha, and Samson Zhou. On the economics of ofﬂine password

cracking. In 2018 IEEE Symposium on Security and Privacy (SP), pages 853 871. IEEE, 2018. [9] Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou r3579x?: anonymized

social networks, hidden patterns, and structural steganography. In Proceedings of the 16th international conference on World Wide Web, pages 181 190. ACM, 2007. [10] Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis. Resisting

structural re-identiﬁcation in anonymized social networks. Proceedings of the VLDB Endowment, 1(1):102 114, 2008. [11] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. In 2009 30th IEEE

Symposium on Security and Privacy, pages 173 187. IEEE, 2009. [12] Michael Hay, Chao Li, Gerome Miklau, and David Jensen. Accurate estimation of the degree

distribution of private networks. In Data Mining, 2009. ICDM 09. Ninth IEEE International Conference on, pages 169 178. IEEE, 2009. [13] Vishesh Karwa and Aleksandra B Slavkovi c. Differentially private graphical degree sequences

and synthetic graphs. In International Conference on Privacy in Statistical Databases, pages 273 285. Springer, 2012. [14] Shiva Prasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Analyz-

ing graphs with node differential privacy. In Theory of Cryptography, pages 457 476. Springer, 2013. [15] Sofya Raskhodnikova and Adam Smith. Efﬁcient lipschitz extensions for high-dimensional

graph statistics and node private degree distributions. In FOCS, 2016. [16] Jeremiah Blocki. Differentially private integer partitions and their applications. 2016. [17] Wei-Yen Day, Ninghui Li, and Min Lyu. Publishing graph degree distribution with node

differential privacy. In Proceedings of the 2016 International Conference on Management of Data, pages 123 138. ACM, 2016. [18] Gregory Valiant and Paul Valiant. The power of linear estimators. In Foundations of Computer

Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 403 412. IEEE, 2011. [19] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uniﬁed

maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pages 11 21, 2017. [20] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of

functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, 2015. [21] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best

polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, 2016. [22] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number

of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283 13288, 2016. [23] Yi Hao, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Data ampliﬁcation: A uniﬁed

and competitive approach to property estimation. In Advances in Neural Information Processing Systems, pages 8834 8843, 2018. [24] Yi Hao and Alon Orlitsky. Data ampliﬁcation: Instance-optimal property estimation. ar Xiv

preprint ar Xiv:1903.01432, 2019. [25] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha,

Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G Mac Arthur. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature communications, 7:13293, 2016. [26] Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang. Inspectre: Privately

estimating the unseen. In International Conference on Machine Learning, pages 30 39, 2018. [27] Tore Dalenius. Towards a methodology for statistical disclosure control. statistik Tidskrift,

15(429-444):2 1, 1977. [28] Nabil R Adam and John C Worthmann. Security-control methods for statistical databases: a

comparative study. ACM Computing Surveys (CSUR), 21(4):515 556, 1989.

[29] Dakshi Agrawal and Charu C Aggarwal. On the design and quantiﬁcation of privacy preserving

data mining algorithms. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 247 255. ACM, 2001. [30] Cynthia Dwork. Differential privacy: A survey of results. In International Conference on

Theory and Applications of Models of Computation, pages 1 19. Springer, 2008. [31] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to

sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006. [32] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Founda-

tions and Trends R in Theoretical Computer Science, 9(3 4):211 407, 2014. [33] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar,

and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308 318. ACM, 2016.

[34] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable

privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054 1067. ACM, 2014. [35] Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer

bias. Journal of the American Statistical Association, 60(309):63 69, 1965. [36] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam

Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011. [37] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax

rates. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 429 438. IEEE, 2013. [38] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential

privacy. In Advances in neural information processing systems, pages 2879 2887, 2014. [39] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response: Estimating distributions

privately, efﬁciently, and with little communication. In AISTATS, 2019. [40] Francesco Aldà and Hans Ulrich Simon. A lower bound on the release of differentially private

integer partitions. Information Processing Letters, 129:1 4, 2018. [41] Godfrey H Hardy and Srinivasa Ramanujan. Asymptotic formulaæ in combinatory analysis.

Proceedings of the London Mathematical Society, 2(1):75 115, 1918. [42] Joseph Bonneau. Yahoo password frequency corpus, Dec 2015. [43] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity

of estimating rényi entropy. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 1855 1869. SIAM, 2014. [44] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating

rényi entropy of discrete distributions. IEEE Transactions on Information Theory, 63(1):38 56, 2016. [45] Yi Hao and Alon Orlitsky. The broad optimality of proﬁle maximum likelihood. ar Xiv preprint

ar Xiv:1906.03794, 2019. [46] Arpita Ghosh, Tim Roughgarden, and Mukund Sundararajan. Universally utility-maximizing

privacy mechanisms. SIAM Journal on Computing, 41(6):1673 1693, 2012. [47] Richard E Barlow, David J Bartholomew, JM Bremner, and H Daniel Brunk. Statistical inference

under order restrictions: The theory and application of isotonic regression. Technical report, Wiley New York, 1972. [48] Patrick Mair, Kurt Hornik, and Jan de Leeuw. Isotone optimization in r: pool-adjacent-violators

algorithm (pava) and active set methods. Journal of statistical software, 32(5):1 24, 2009.