# inspectre_privately_estimating_the_unseen__062de592.pdf

INSPECTRE: Privately Estimating the Unseen

Jayadev Acharya * 1 Gautam Kamath * 2 Ziteng Sun * 1 Huanyu Zhang * 1

We develop differentially private methods for estimating various distributional properties. Given a sample from a discrete distribution p, some functional f, and accuracy and privacy parameters α and ε, the goal is to estimate f(p) up to accuracy α, while maintaining ε-differential privacy of the sample. We prove almost-tight bounds on the sample size required for this problem for several functionals of interest, including support size, support coverage, and entropy. We show that the cost of privacy is negligible in a variety of settings, both theoretically and experimentally. Our methods are based on a sensitivity analysis of several state-of-the-art methods for estimating these properties with sublinear sample complexities.

1. Introduction

How can we infer a distribution given a sample from it? If data is in abundance, the solution may be simple the empirical distribution will approximate the true distribution. However, challenges arise when data is scarce in comparison to the size of the domain, and especially when we wish to quantify rare events. This is frequently the case: for example, it has recently been observed that there are several very rare genetic mutations which occur in humans, and we wish to know how many such mutations exist (Keinan & Clark, 2012; Tennessen et al., 2012; Nelson et al., 2012). Many of these mutations have only been seen once, and we can infer that there are many which have not been seen at all. Over the last decade, a large body of work has focused on developing theoretically sound and effective tools for such settings (Orlitsky et al., 2016) and references therein, including the problem of estimating the frequency distribution of

*Equal contribution 1ECE, Cornell University, Ithaca, New York, USA 2EECS & CSAIL, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. Correspondence to: Jayadev Acharya <acharya@cornell.edu>, Gautam Kamath <g@csail.mit.edu>, Ziteng Sun <zs335@cornell.edu>, Huanyu Zhang <hz388@cornell.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

rare genetic variations (Zou et al., 2016).

However, in many settings where one wishes to perform statistical inference, data may contain sensitive information about individuals. For example, in medical studies, where the data may contain individuals health records and whether they carry some disease which bears a social stigma. Alternatively, one can consider a map application which suggests routes based on aggregate positions of individuals, which contains delicate information including users residence data. In these settings, it is critical that our methods protect sensitive information contained in the dataset. This does not preclude our overall goals of statistical analysis, as we are trying to infer properties of the population p, and not the samples which are drawn from said population.

That said, without careful experimental design, published statistical ﬁndings may be prone to leaking sensitive information about the sample. As a notable example, it was recently shown that one can determine the identity of some individuals who participated in genome-wide association studies (Homer et al., 2008). This realization has motivated a surge of interest in developing data sharing techniques with an explicit focus on maintaining privacy of the data (Johnson & Shmatikov, 2013; Uhler et al., 2013; Yu et al., 2014; Simmons et al., 2016).

Privacy-preserving computation has enjoyed signiﬁcant study in a number of ﬁelds, including statistics and almost every branch of computer science, including cryptography, machine learning, algorithms, and database theory see, e.g., (Dalenius, 1977; Adam & Worthmann, 1989; Agrawal & Aggarwal, 2001; Dinur & Nissim, 2003; Dwork, 2008; Dwork & Roth, 2014) and references therein. Perhaps the most celebrated notion of privacy, proposed by theoretical computer scientists, is differential privacy (Dwork et al., 2006). Informally, an algorithm is differentially private if its outputs on neighboring datasets (differing in a single element) are statistically close (for a more precise deﬁnition, see Section 2). Differential privacy has become the standard for theoretically-sound data privacy, leading to its adoption by several large technology companies, including Google and Apple (Erlingsson et al., 2014; Differential Privacy Team, Apple, 2017).

Our focus in this paper is to develop tools for privately performing several distribution property estimation tasks. In

INSPECTRE: Privately Estimating the Unseen

particular, we study the tradeoff between statistical accuracy, privacy, and error rate in the sample size. Our model is that we are given sample access to some unknown discrete distribution p, over a domain of size k, which is possibly unknown in some tasks. We wish to estimate the following properties:

Support Coverage: If we take m samples from the distribution, what is the expected number of unique elements we expect to see? Support Size: How many elements of the support have non-zero probability? Entropy: What is the Shannon entropy of the distribution?

For more formal statements of these problems, see Section 2.1. We require that our output is α-accurate, satisﬁes (ε, 0)-differential privacy, and is correct with probability 1 β. The goal is to give an algorithm with minimal sample complexity n, while simultaneously being computationally efﬁcient.

Theoretical Results. Our main results show that privacy can be achieved for all these problems at a very low cost. For example, if one wishes to privately estimate entropy, this incurs an additional additive cost in the sample complexity which is very close to linear in 1/αε. We draw attention to two features of this bound. First, this is independent of k. All the problems we consider have complexity Θ(k/ log k), so in the primary regime of study where k 1/αε, this small additive cost is dwarfed by the inherent sample complexity of the non-private problem. Second, the bound is almost linear in 1/αε. We note that performing even the most basic statistical task privately, estimating the bias of a coin, incurs this linear dependence. Surprisingly, we show that much more sophisticated inference tasks can be privatized at almost no cost. In particular, these properties imply that the additive cost of privacy is o(1) in the most studied regime where the support size is large. In general, this is not true for many other problems, including distribution estimation and hypothesis testing, the additional cost of privacy depends signiﬁcantly on the support size or dimension (Diakonikolas et al., 2015; Cai et al., 2017; Acharya et al., 2017c; Aliakbarpour et al., 2017). We also provide lower bounds, showing that our upper bounds are almost tight. A more formal statement of our results appears in Section 3.

Experimental Results. We demonstrate the efﬁcacy of our method with experimental evaluations. As a baseline, we compare with the non-private algorithms of (Orlitsky et al., 2016) and (Wu & Yang, 2018). Overall, we ﬁnd that our algorithms performance is nearly identical, showing that, in many cases, privacy comes (essentially) for free. We begin with an evaluation on synthetic data. Then, inspired by (Valiant & Valiant, 2013; Orlitsky et al., 2016), we ana-

lyze text corpus consisting of words from Hamlet, in order to estimate the number of unique words which occur. Finally, we investigate name frequencies in the US census data. This setting has been previously considered by (Orlitsky et al., 2016), but we emphasize that this is an application where private statistical analysis is critical. This is proven by efforts of the US Census Bureau to incorporate differential privacy into the 2020 US census (Dajani et al., 2017).

Techniques. Our approach works by choosing statistics for these tasks which possess bounded sensitivity, which is well-known to imply privacy under the Laplace or Gaussian mechanism. We note that bounded sensitivity of statistics is not always something that can be taken for granted. Indeed, for many fundamental tasks, optimal algorithms for the nonprivate setting may be highly sensitive, thus necessitating crucial modiﬁcations to obtain differential privacy (Acharya et al., 2015; Cai et al., 2017). Thus, careful choice and design of statistics must be a priority when performing inference with privacy considerations.

To this end, we leverage recent results of (Acharya et al., 2017a), which studies estimators for non-private versions of the problems we consider. The main technical work in their paper exploits bounded sensitivity to show sharp cutoff-style concentration bounds for certain estimators, which operate using the principle of best-polynomial approximation. They use these results to show that a single algorithm, the Proﬁle Maximum Likelihood (PML), can estimate all these properties simultaneously. On the other hand, we consider the sensitivity of these estimators for purposes of privacy the same property is utilized by both works for very different purposes, a connection which may be of independent interest.

We note that bounded sensitivity of a statistic may be exploited for purposes other than privacy. For instance, by Mc Diarmid s inequality, any such statistic also enjoys very sharp concentration of measure, implying that one can boost the success probability of the test at an additive cost which is logarithmic in the inverse of the failure probability. One may naturally conjecture that, if a statistical task is based on a primitive which concentrates in this sense, then it may also be privatized at a low cost. However, this is not true estimating a discrete distribution in ℓ1 distance is such a task, but the cost of privatization depends signiﬁcantly on the support size (Diakonikolas et al., 2015).

One can observe that, algorithmically, our method is quite simple: compute the non-private statistic, and add a relatively small amount of Laplace noise. The non-private statistics have recently been demonstrated to be practical (Orlitsky et al., 2016; Wu & Yang, 2018), and the additional cost of the Laplace mechanism is minimal. This is in contrast to several differentially private algorithms which invoke signiﬁcant overhead in the quest for privacy. Our algorithms

INSPECTRE: Privately Estimating the Unseen

attain almost-optimal rates (which are optimal up to constant factors for most parameter regimes of interest), while simultaneously operating effectively in practice, as demonstrated in our experimental results.

Related Work. Over the last decade, there have been a ﬂurry of works on the problems we study in this paper by the computer science and information theory communities, including Shannon and R enyi entropy estimation (Paninski, 2003; Valiant & Valiant, 2017; Jiao et al., 2017; Acharya et al., 2017b; Obremski & Skorski, 2017; Wu & Yang, 2018), support coverage and support size estimation (Orlitsky et al., 2016; Wu & Yang, 2018). A recent paper studies the general problem of estimating functionals of discrete distribution from samples in terms of the smoothness of the functional (Fukuchi & Sakuma, 2017). These have culminated in a nearly-complete understanding of the sample complexity of these properties, with optimal sample complexities (up to constant factors) for most parameter regimes.

Recently, there has been signiﬁcant interest in performing statistical tasks under differential privacy constraints. Perhaps most relevant to this work are (Cai et al., 2017; Acharya et al., 2017c; Aliakbarpour et al., 2017), which study the sample complexity of differentialy privately performing classical distribution testing problems, including identity and closeness testing. Other works investigating private hypothesis testing include (Wang et al., 2015a; Gaboardi et al., 2016; Kifer & Rogers, 2017; Kakizaki et al., 2017; Rogers, 2017; Gaboardi & Rogers, 2017), which focus less on characterizing the ﬁnite-sample guarantees of such tests, and more on understanding their asymptotic properties and applications to computing p-values. There has also been study on private distribution learning (Diakonikolas et al., 2015; Duchi et al., 2017; Karwa & Vadhan, 2018; Acharya et al., 2018; Kamath et al., 2018), in which we wish to estimate parameters of the distribution, rather than just a particular property of interest. A number of other problems have been studied with privacy requirements, including clustering (Wang et al., 2015b; Balcan et al., 2017), principal component analysis (Chaudhuri et al., 2013; Kapralov & Talwar, 2013; Hardt & Price, 2014), ordinary least squares (Sheffet, 2017), and much more.

2. Preliminaries

We will start with some deﬁnitions.

Let def = {(p(1), . . . , p(k)) : p(i) 0, Pk i=1 p(i) = 1, 1 k } be the set of discrete distributions over a countable support. Let k be the set of distributions in with at most k non-zero probability values. A property f(p) is a mapping from R. We now describe the classical distribution property estimation problem, and then state the problem under differential privacy.

Property Estimation. Given α, β, f, and independent samples Xn 1 from an unknown distribution p, design an estimator ˆf : Xn 1 R such that with probability at least 1 β, ˆf(Xn 1 ) f(p) < α. The sample complexity of ˆf,

C ˆ f(f, α, β) def = min{n : Pr ˆf(Xn 1 ) f(p) > α < β} is the smallest number of samples to estimate f to accuracy α, and error β. We study the problem for β = 1/3, and by the median trick, we can boost the success probability to 1 β with an additional multiplicative log(1/β) more samples. Therefore, focusing on β = 1/3, we deﬁne

C ˆ f(f, α) def = C ˆ f(f, α, 1/3). The sample complexity of estimating a property f(p) is the minimum sample complexity over all estimators: C(f, α) = min ˆ f C ˆ f(f, α).

An estimator ˆf is ε-differentially private (DP) (Dwork et al., 2006) if for any Xn 1 and Y n 1 , with dham(Xn 1 , Y n 1 ) 1, Pr (f(Xn 1 ) S) Pr(f(Y n 1 ) S) eε, for all measurable S.

Private Property Estimation. Given α, ε, β, f, and independent samples Xn 1 from an unknown distribution p, design an ε-differentially private estimator ˆf : Xn 1 R such that with probability at least 1 β, ˆf(Xn 1 ) f(p) < α. Similar to the non-private setting, the sample complexity of ε-differentially private estimation problem is C(f, α, ε) = min ˆ f: ˆ fis ε-DP C ˆ f(f, α, 1/3), the smallest number of samples n for which there exists such an ε-DP α estimator with error probability at most 1/3.

In their original paper (Dwork et al., 2006) provides a scheme for differential privacy, known as the Laplace mechanism. This method adds Laplace noise to a non-private scheme in order to make it private. We ﬁrst deﬁne the sensitivity of an estimator, and then state their result in our setting.

Deﬁnition 1. The sensitivity of an estimator ˆf : [k]n

R is n, ˆ f def = maxdham(Xn 1 ,Y n 1 ) 1 ˆf(Xn 1 ) ˆf(Y n 1 ) . Let

D ˆ f(α, ε) = min{n : n, ˆ f αε}.

C(f, α, ε) = O min ˆ f

n C ˆ f(f, α/2) + D ˆ f α

Proof. (Dwork et al., 2006) showed that for a function with sensitivity n, ˆ f, adding Laplace noise X Lap( n, ˆ f/ε) makes the output ε-differentially private. By the deﬁnition of D ˆ f( α

4 , ε), the Laplace noise we add has parameter at most α

4 . Recall that the probability density function of

Lap(b) is 1 2be |x|

b , hence we have Pr (|X| > α/2) < 1 e2 . By the union bound, we get an additive error larger than α = α

2 with probability at most 1/3+ 1

e2 < 0.5. Hence, with the median trick, we can boost the error probability

INSPECTRE: Privately Estimating the Unseen

to 1/3, at the cost of a constant factor in the number of samples.

To prove sample complexity lower bounds for differentially private estimators, we observe that the estimator can be used to test between two distributions with distinct property values, hence is a harder problem. For lower bounds on differentially private testing, (Acharya et al., 2017c) gives the following argument based on coupling:

Lemma 2. Suppose there is a coupling between distributions p and q over X n, such that E [dham(Xn 1 , Y n 1 )] D. Then, any ε-differentially private algorithm that distinguishes between p and q with error probability at most 1/3 must satisfy D = Ω 1

2.1. Problems of Interest

Support Size. The support size of a distribution p is S(p) = |{x : p(x) > 0}|, the number of symbols with nonzero probability values. However, notice that estimating S(p) from samples can be hard due to the presence of symbols with negligible, yet non-zero probabilities. To circumvent this issue, (Raskhodnikova et al., 2009) proposed to study the problem when the smallest probability is bounded. Let 1

def = {p : p(x) {0} [1/k, 1]} be the set of all distributions where all non-zero probabilities have value at least 1/k. For p 1

k , our goal is to estimate S(p) up to αk with the least number of samples from p.

Support Coverage. For a distribution p, and an integer m, let Sm(p) = P

x(1 (1 p(x))m), be the expected number of symbols that appear when we obtain m independent samples from the distribution p. The objective is to ﬁnd the least number of samples n in order to estimate Sm(p) to an additive αm.

Support coverage arises in many ecological and biological studies (Colwell et al., 2012) to quantify the number of new elements (gene mutations, species, words, etc) that can be expected to be seen in the future. Good and Toulmin (Good & Toulmin, 1956) proposed an estimator that for any constant α, requires m/2 samples to estimate Sm(p).

Entropy. The Shannon entropy of a distribution p is H(p) = P

x p(x) log 1 p(x), H(p) is a central object in information theory (Cover & Thomas, 2006), and also arises in many ﬁelds such as machine learning (Nowozin, 2012), neuroscience (Berry et al., 1997; Nemenman et al., 2004), and others. Estimating H(p) is hard with any ﬁnite number of samples due to the possibility of inﬁnite support. To circumvent this, a natural approach is to consider distributions in k. The goal is to estimate the entropy of a distribution in k to an additive α, where k is all discrete distributions over at most k symbols.

3. Statement of Results

Our theoretical results for estimating support coverage, support size, and entropy are given below. Algorithms for these problems and proofs of these statements are provided in Section 4. Our experimental results are described and discussed in Section 5.

Theorem 1. The sample complexity of support coverage estimation C(Sm, α, ε) is

O m log(1/α)

log m + m log(1/α)

log(2+εm) , when m 1 αε O 1

α2 + 1 αε , when 1

α m 1 αε O m2 + m

ε . when m 1

Furthermore,

C(Sm, α, ε) = Ω m log(1/α)

Theorem 2. The sample complexity of support size estimation C(S, α, ε) is

O k log2(1/α)

log k + k log2(1/α)

log(2+εk) , when k 1 αε O k log(1/α) + 1 αε , when 1

α k 1 αε O k log k + k

ε . when k 1

Furthermore,

C(S, α, ε) =

( Ω k log2(1/α)

log k + 1 αε , when k 1

α Ω k log k + k

ε . when k 1

Theorem 3. Let λ > 0 be any small ﬁxed constant. For instance, λ can be chosen to be any constant between 0.01 and 1. We have the following upper bounds on the sample complexity of entropy estimation C(H, α, ε):

α + log2(min{k, n})

k λ2α log k + log2(min{k, n})

Furthermore,

C(H, α, ε) = Ω k α log k + log2(min{k, n})

We provide some discussion of our results. At a high level, we wish to emphasize the following two points:

INSPECTRE: Privately Estimating the Unseen

1. Our upper bounds show that the cost of privacy in these settings is often negligible compared to the sample complexity of the non-private statistical task, especially when we are dealing with distributions over a large support. Furthermore, our upper bounds are almost tight in all parameters. 2. The algorithmic complexity introduced by the requirement of privacy is minimal, consisting only of a single step which noises the output of an estimator. In other words, our methods are realizable in practice, and we demonstrate the effectiveness on several synthetic and real-data examples.

Before we continue, we emphasize that, in Theorems 1 and 2, we consider the sublinear regime to be of primary interest (when m 1 αε or k 1 αε, respectively), both technically, and in terms of parameter regimes which may be of greatest interest in practice. We include results for other regimes mostly for completeness.

First, we examine our results on support coverage and support size estimation in the sublinear regime, when m 1 αε (focusing on support coverage for simplicity, but support size is similar). In this regime, if ε = Ω(mγ/m) for any constant γ > 0, then up to constant factors, our upper bound is within a constant factor of the optimal sample complexity without privacy constratints. In other words, for most meaningful values of ε, privacy comes for free. In the nonsublinear regime for these problems, we provide upper and lower bounds which match in a number of cases. We note that in this regime, the cost of privacy may not be a lower order term however, this regime only occurs when one requires very high accuracy, or unreasonably large privacy, which we consider to be of somewhat lesser interest.

Next, we turn our attention to entropy estimation. We note that the second upper bound in Theorem 3 has a parameter λ that indicates a tradeoff between the sample complexity incurred in the ﬁrst and third term. This parameter determines the degree of a polynomial to be used for entropy estimation. As the degree becomes smaller (corresponding to a large λ), accuracy of the polynomial estimator decreases, however, at the same time, low-degree polynomials have a small sensitivity, allowing us to privatize the outcome.

In terms of our theoretical results, one can think of λ = 0.01. With this parameter setting, it can be observed that our upper bounds are almost tight. For example, one can see that the upper and lower bounds match to either logarithmic factors (when looking at the ﬁrst upper bound), or a very small polynomial factor in 1/αε (when looking at the second upper bound). For our experimental results, we empirically determined an effective value for the parameter λ on a single synthetic instance. We then show that this choice of parameter generalizes, giving highly-accurate private estimation in other instances, on both synthetic and real-world data.

4. Algorithms and Analysis

We now prove our results for support coverage estimation, Theorem 1, while support size and entropy estimation appear in the supplementary material. We ﬁrst describe and analyze our algorithms, and then go on to describe and analyze a lower bound construction, showing that our upper bounds are almost tight.

All our algorithms fall into the following simple framework:

1. Compute a non-private estimate of the property; 2. Privatize this estimate by adding Laplace noise, where the parameter is determined through analysis of the estimator and potentially computation of the estimator s sensitivity.

4.1. Support Coverage Estimation

4.1.1. UPPER BOUND FOR SUPPORT COVERAGE ESTIMATION

We split the analysis into two regimes. First, we focus on the case where m 1 αε, and we prove the upper bound O 1

α2 + 1 αε . Note that the problem is identical for any α < 1 m, since this corresponds to estimating the support coverage exactly, and the above bound simpliﬁes to O m2 + m

ε . The algorithm in this case is simple: since n = Ω(m), we group the dataset into n/m batches of size m. Let Yj be the number of unique symbols observed in batch j. Our estimator is ˆSm(Xn 1 ) = m

n Pn/m j=1 Yj. Observe that E [Yj] = Sm(p), and that Var[Yj] m. The latter can be seen by observing that Yj is the sum of m negatively correlated indicator random variables, each one being the indicator of whether that sample in the batch is the ﬁrst time the symbol is observed. This gives that ˆSm(Xn 1 ) is an unbiased estimator of Sm(p), with variance O(m2/n). By Chebyshev s inequality, since we want an estimate which is accurate up to αm, this gives us that C ˆSm(Sm(p), α/2) = O 1

α2 . Furthermore, we can see that the sensitivity of ˆSm(Xn 1 ) is at most 2m/n. By Lemma 1, there is a private algorithm for support coverage estimation as long as ˆSm(Xn 1 ) m αε. With the above bound on sensitivity, this is true with n = O(1/αε), giving the desired upper bound.

Now, we turn our attention to the case where m 1 αε,

and we prove the upper bound O m log(1/α)

log m + m log(1/α)

log(2+εm) . Let ϕi be the number of symbols that appear i times in Xn 1 . We will use the following non-private support coverage estimator from (Orlitsky et al., 2016):

ˆSm(Xn 1 ) =

i=1 ϕi 1 ( t)i Pr (Z i) ,

where Z is a Poisson random variable with mean r (which

INSPECTRE: Privately Estimating the Unseen

is a parameter to be instantiated later), and t = (m n)/n.

Our private estimator of support coverage is derived by adding Laplace noise to this non-private estimator with the appropriate noise parameter, and thus the performance of our private estimator, is analyzed by bounding the sensitivity and the bias of this non-private estimator according to Lemma 1.

The sensitivity and bias of this estimator is bounded in the following lemmas.

Lemma 3. Suppose m > 2n, then the maximum coefﬁcient of ϕi in ˆSm(p) is at most 1 + er(t 1).

Proof. By the deﬁnition of Z, we know Pr (Z i) = P k=i e r rk

k! , hence we have: |1 + ( t)i Pr (Z i)|

1 + ti P k=i e r rk

k! 1 + e r P k=i (rt)k

e r P k=0 (rt)k

k! = 1 + er(t 1).

The bias of the estimator is bounded in Lemma 4 of (Acharya et al., 2017a):

Lemma 4. If m > 2n, then E h ˆSm(Xn 1 ) i Sm(p)

2 + 2er(t 1) + min(m, S(p)) e r.

Using these results, letting r = log(1/α), (Orlitsky et al., 2016) showed that there is a constant C, such that with n = C m log m log(1/α) samples, with probability at least 0.9, ˆSm(Xn 1 ) m Sm(p)

Our upper bound in Theorem 1 is derived by the following analysis of the sensitivity of ˆSm(Xn 1 ) m . If we change one sample in Xn 1 , at most two of the ϕj s change. Hence by Lemma 3, the sensitivity of the estimator satisﬁes ˆSm(Xn 1 ) m 2 m 1 + er(t 1) . By Lemma 1, there is a private algorithm for support coverage estimation as long as ˆSm(Xn 1 ) m αε, which, by the inequality above, holds if 2(1 + exp(r(t 1))) αεm. Let r = log(3/α), note that t 1 = m

n 2. Suppose αεm > 2, then, the condition above reduces to log 3

2αεm 1 . This is equivalent to n m log(3/α) log( 1

2 αεm 1)+2 log(3/α) = m log(3/α) log( 3

2 εm 3/α)+log(3/α).

Suppose αεm > 2, then the condition above reduces to the requirement that n = Ω m log(1/α) log(2+εm) .

4.1.2. LOWER BOUND FOR SUPPORT COVERAGE ESTIMATION

We now prove the lower bound described in Theorem 1. Note that the ﬁrst term in the lower bound is the sample complexity of non-private support coverage estimation, shown

in (Orlitsky et al., 2016). Therefore, we turn our attention to prove the last term in the sample complexity.

Consider the following two distributions. u1 is uniform over [m(1 + α)]. u2 is distributed over m + 1 elements [m] { } where u2[i] = 1 m(1+α) i [m] and u2[ ] = α 1+α. Moreover, / [m(1 + α)]. Then, Sm(u1) =

m(1 + α) 1 1 1 m(1+α) m , and Sm(u2) = m 1 1 1 m(1+α) m + 1 1 α 1+α m . Therefore,

Sm(u2) Sm(u1) = mα 1 1 1 m(1+α) m 1 1 α 1+α m = Ω(αm).

Hence we know there support coverage differs by Ω(αm). Moreover, their total variation distance is α 1+α. The following lemma is folklore, based on the coupling interpretation of total variation distance, and the fact that total variation distance is subadditive for product measures. Lemma 5. For any two distributions p, and q, there is a coupling between n i.i.d. samples from the two distributions with an expected Hamming distance of d TV(p, q) n.

Using Lemma 5 and d TV(u1, u2) = α 1+α, we have Lemma 6. Suppose u1 and u2 are as deﬁned before, there is a coupling between un 1 and un 2 with expected Hamming distance equal to α 1+αn.

Moreover, given n samples, we must be able to privately distinguish between u1 and u2 given an α accurate estimator of support coverage with privacy considerations. Thus, according to Lemma 2 and 6, we have α 1+αn 1

5. Experiments

We evaluated our methods for entropy estimation and support coverage on both synthetic and real data. Overall, we found that privacy is quite cheap: private estimators achieve accuracy which is comparable or near-indistinguishable to non-private estimators in many settings. Our results on entropy estimation and support coverage appear in Sections 5.1 and 5.2, respectively. Code of our implementation is available at https://github.com/Huanyu Zhang/ INSPECTRE.

5.1. Entropy

We compare the performance of our entropy estimator with a number of alternatives, both private and non-private. Nonprivate algorithms considered include the plug-in estimator (plug-in), the Miller-Madow Estimator (MM) (Miller, 1955), the sample optimal polynomial approximation estimator (poly) of (Wu & Yang, 2016). We analyze the privatized versions of plug-in, and poly in the supplementary material. The implementation of the latter is based on

INSPECTRE: Privately Estimating the Unseen

code from the authors of (Wu & Yang, 2016)1. We compare performance on different distributions including uniform, a distribution with two steps, Zipf(1/2), a distribution with Dirichlet-1 prior, and a distribution with Dirichlet-1/2 prior, and over varying support sizes.

While plug-in, and MM are parameter free, poly (and its private counterpart) have to choose the degree L of the polynomial to use, which manifests in the parameter λ in the statement of Theorem 3. (Wu & Yang, 2016) suggests the value of L = 1.6 log k in their experiments. However, since we add further noise, we choose a single L as follows: (i) Run privatized poly for different L values and distributions for k = 2000, ε = 1, (b) Choose the value of L that performs well across different distributions (See Figure 1). We choose L = 1.2 log k from this, and use it for all other experiments. To evaluate the sensitivity of poly, we computed the estimator s value at all possible input values, computed the sensitivity, (namely, = maxdham(Xn 1 ,Y n 1 ) 1 |poly(Xn 1 ) poly(Y n 1 )|), and added noise distributed as Lap 0,

The RMSE of various estimators for k = 1000, and ε = 1 for various distributions are illustrated in Figure 2. The RMSE is averaged over 100 iterations in the plots.

We observe that the performance of our private-poly is near-indistinguishable from the non-private poly, particularly as the number of samples increases. It also performs signiﬁcantly better than all other alternatives, including the non-private Miller-Madow and the plug-in estimator. The cost of privacy is minimal for several other settings of k and ε, additional experiments appear in the supplementary material.

5.2. Support Coverage

We investigate the cost of privacy for the problem of support coverage. We provide a comparison between the Smoothed Good-Toulmin estimator (SGT) of (Orlitsky et al., 2016) and our algorithm, which is a privatized version of their statistic (see Section 4.1.1). Our implementation is based on code provided by the authors of (Orlitsky et al., 2016). As shown in our theoretical results, the sensitivity of SGT is at most 2(1 + er(t 1)), necessitating the addition of Laplace noise with parameter 2(1 + er(t 1))/ε. Note that while the theory suggests we select the parameter r = log(1/α), α is unknown. We instead set r = 1

2t loge n(t+1)2

t 1 , as previously done in (Orlitsky et al., 2016).

1See https://github.com/Albuso0/entropy for their code for entropy estimation.

5.2.1. EVALUATION ON SYNTHETIC DATA

In our synthetic experiments, we consider different distributions over different support sizes k. We generate n = k/2 samples, and then estimate the support coverage at m = n t. For large t, estimation is harder. Some results of our evaluation on synthetic are displayed in Figure 3. We compare the performance of SGT, and privatized versions of SGT with parameters ε = 1, 2, and 10. For this instance, we ﬁxed the domain size k = 20000. We ran the methods described above with n = k/2 samples, and estimated the support coverage at m = nt, for t ranging from 1 to 10. The performance of the estimators is measured in terms of RMSE over 1000 iterations.

We observe that, in this setting, the cost of privacy is relatively small for reasonable values of ε. This is as predicted by our theoretical results, where unless ε is extremely small (less than 1/k) the non-private sample complexity dominates the privacy requirement. However, we found that for smaller support sizes (as shown in the supplementary material), the cost of privacy can be signiﬁcant. We provide an intuitive explanation for why no private estimator can perform well on such instances. To minimize the number of parameters, we instead argue about the related problem of support-size estimation. Suppose we are trying to distinguish between distributions which are uniform over supports of size 100 and 200. We note that, if we draw n = 50 samples, the proﬁle of the samples (i.e., the histogram of the histogram) will be very similar for the two distributions. In particular, if one modiﬁes only a few samples (say, ﬁve or six), one could convert one proﬁle into the other. In other words, these two proﬁles are almost-neighboring datasets, but simultaneously correspond to very different support sizes. This pits the two goals of privacy and accuracy at odds with each other, thus resulting in a degradation in accuracy.

5.2.2. EVALUATION ON CENSUS DATA AND HAMLET

We conclude with experiments for support coverage on two real-world datasets, the 2000 US Census data and the text of Shakespeare s play Hamlet, inspired by investigations in (Orlitsky et al., 2016) and (Valiant & Valiant, 2017). Our investigation on US Census data is also inspired by the fact that this is a setting where privacy is of practical importance, evidenced by the proposed adoption of differential privacy in the 2020 US Census (Dajani et al., 2017).

The Census dataset contains a list of last names that appear at least 100 times. Since the dataset is so oversampled, even a small fraction of the data is likely to contain almost all the names. As such, we make the task non-trivial by subsampling mtotal = 86080 individuals from the data, obtaining 20412 distinct last names. We then sample n of the mtotal individuals without replacement and attempt to

INSPECTRE: Privately Estimating the Unseen

250 500 750 1000 1250 1500 1750 2000 Number of samples

L=0.3log(k) L=0.6log(k) L=0.9log(k) L=1.2log(k) L=1.5log(k) L=1.8log(k)

250 500 750 1000 1250 1500 1750 2000 Number of samples

L=0.3log(k) L=0.6log(k) L=0.9log(k) L=1.2log(k) L=1.5log(k) L=1.8log(k)

250 500 750 1000 1250 1500 1750 2000 Number of samples

L=0.3log(k) L=0.6log(k) L=0.9log(k) L=1.2log(k) L=1.5log(k) L=1.8log(k)

250 500 750 1000 1250 1500 1750 2000 Number of samples

Dirichlet-1 prior

L=0.3log(k) L=0.6log(k) L=0.9log(k) L=1.2log(k) L=1.5log(k) L=1.8log(k)

250 500 750 1000 1250 1500 1750 2000 Number of samples

Dirichlet-1/2 prior

L=0.3log(k) L=0.6log(k) L=0.9log(k) L=1.2log(k) L=1.5log(k) L=1.8log(k)

Figure 1. RMSE comparison between private Polynomial Approximation Estimators for entropy with various values for degree L, k = 2000, ε = 1. The degree L represents a bias-variance tradeoff: a larger degree decreases the bias but increases the sensitivity, necessitating the addition of Laplace noise with a larger variance.

200 400 600 800 1000 Number of samples

Plug-in Miller Poly Poly-Laplace Plug-in-Laplace

200 400 600 800 1000 Number of samples

Plug-in Miller Poly Poly-Laplace Plug-in-Laplace

200 400 600 800 1000 Number of samples

Plug-in Miller Poly Poly-Laplace Plug-in-Laplace

200 400 600 800 1000 Number of samples

Dirichlet-1 prior

Plug-in Miller Poly Poly-Laplace Plug-in-Laplace

200 400 600 800 1000 Number of samples

Dirichlet-1/2 prior

Plug-in Miller Poly Poly-Laplace Plug-in-Laplace

Figure 2. Comparison of various estimators for entropy estimation, k = 1000, ε = 1.

1 2 3 4 5 6 7 8 9 10 t

Non-private Private eps=10 Private eps=2 Private eps=1

1 2 3 4 5 6 7 8 9 10 t

Non-private Private eps=10 Private eps=2 Private eps=1

1 2 3 4 5 6 7 8 9 10 t

Non-private Private eps=10 Private eps=2 Private eps=1

1 2 3 4 5 6 7 8 9 10 t

Dirichlet-1 prior

Non-private Private eps=10 Private eps=2 Private eps=1

1 2 3 4 5 6 7 8 9 10 t

Dirichlet-1/2 prior

Non-private Private eps=10 Private eps=2 Private eps=1

Figure 3. Comparison between the private support coverage estimator with the non-private SGT when k = 20000

estimate the total number of last names. Figure 4 displays the RMSE over 100 iterations of this process. We observe that even with an exceptionally stringent privacy budget of ε = 0.5, the performance is almost indistinguishable from the non-private SGT estimator.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fraction of seen names

Non-private Private eps=2 Private eps=1 Private eps=0.5

Figure 4. Comparison between our private support coverage estimator with the SGT on Census Data.

The Hamlet dataset has mtotal = 31, 999 words, of which 4804 are distinct. Since the distribution is not as oversampled as the Census data, we do not need to subsample the

data. Besides this difference, the experimental setup is identical to that of the Census dataset. Once again, as we can see in Figure 5, we get near-indistinguishable performance between the non-private and private estimators, even for very small values of ε. Our experimental results demonstrate that privacy is realizable in practice, with particularly accurate performance on real-world datasets.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fraction of seen words

Non-private Private eps=2 Private eps=1 Private eps=0.5

Figure 5. Comparison between our private support coverage estimator with the SGT on Hamlet.

INSPECTRE: Privately Estimating the Unseen

Acknowledgements

JA, ZS, and HZ are supported by NSF CCF-1657471 and a Cornell University startup grant. GK is supported by ONR N00014-12-1-0999, NSF CCF-1617730, CCF-1650733, and CCF-1741137. Work partially done while author was an intern at Microsoft Research, New England.

Acharya, J., Daskalakis, C., and Kamath, G. Optimal testing for properties of distributions. In NIPS 15, pp. 3577 3598, 2015.

Acharya, J., Das, H., Orlitsky, A., and Suresh, A. T. A uniﬁed maximum likelihood approach for estimating symmetric properties of discrete distributions. In ICML 17, 2017a.

Acharya, J., Orlitsky, A., Suresh, A. T., and Tyagi, H. Estimating r enyi entropy of discrete distributions. IEEE Transactions on Information Theory, 63(1):38 56, 2017b.

Acharya, J., Sun, Z., and Zhang, H. Differentially private testing of identity and closeness of discrete distributions. ar Xiv preprint ar Xiv:1707.05128, 2017c.

Acharya, J., Sun, Z., and Zhang, H. Communication efﬁcient, sample optimal, linear time locally private discrete distribution estimation. ar Xiv preprint ar Xiv:1802.04705, 2018.

Adam, N. R. and Worthmann, J. C. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys (CSUR), 21(4):515 556, 1989.

Agrawal, D. and Aggarwal, C. C. On the design and quantiﬁcation of privacy preserving data mining algorithms. In PODS 01, 2001.

Aliakbarpour, M., Diakonikolas, I., and Rubinfeld, R. Differentially private identity and closeness testing of discrete distributions. ar Xiv preprint ar Xiv:1707.05497, 2017.

Balcan, M.-F., Dick, T., Liang, Y., Mou, W., and Zhang, H. Differentially private clustering in high-dimensional euclidean spaces. In ICML 17, 2017.

Berry, M. J., Warland, D. K., and Meister, M. The structure and precision of retinal spike trains. Proceedings of the National Academy of Sciences, 94(10):5411 5416, 1997.

Cai, B., Daskalakis, C., and Kamath, G. Priv it: Private and sample efﬁcient identity testing. In ICML 17, pp. 635 644, 2017.

Chaudhuri, K., Sarwate, A. D., and Sinha, K. A near-optimal algorithm for differentially-private principal components. Journal of Machine Learning Research, 14(Sep):2905 2943, 2013.

Colwell, R. K. et al. Models and estimators linking individualbased and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of Plant Ecology, 5(1):3 21, 2012.

Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley, 2006.

Dajani, A. N. et al. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientiﬁc Advisory Committee.

Dalenius, T. Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 15:429 444, 1977.

Diakonikolas, I., Hardt, M., and Schmidt, L. Differentially private learning of structured discrete distributions. In NIPS 15, 2015.

Differential Privacy Team, Apple. Learning with privacy at scale, December 2017.

Dinur, I. and Nissim, K. Revealing information while preserving privacy. In PODS 03, 2003.

Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Minimax optimal procedures for locally private estimation. JASA, 2017.

Dwork, C. Differential privacy: A survey of results. In TAMC 08, 2008.

Dwork, C. and Roth, A. The Algorithmic Foundations of Differential Privacy. Now Publishing, Inc., 2014.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC 06, 2006.

Erlingsson, U., Pihur, V., and Korolova, A. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In CCS 14, 2014.

Fukuchi, K. and Sakuma, J. Minimax optimal estimators for additive scalar functionals of discrete distributions. In ISIT 17, 2017.

Gaboardi, M. and Rogers, R. Local private hypothesis testing: Chi-square tests. ar Xiv preprint ar Xiv:1709.07155, 2017.

Gaboardi, M., Lim, H., Rogers, R. M., and Vadhan, S. P. Differentially private chi-squared hypothesis testing: Goodness of ﬁt and independence testing. In ICML 16, 2016.

Good, I. and Toulmin, G. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43(1-2):45 63, 1956.

Hardt, M. and Price, E. The noisy power method: A meta algorithm with applications. In NIPS 14, 2014.

Homer, N. et al. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLo S Genetics, 4(8):1 9, 2008.

Jiao, J., Venkat, K., Han, Y., and Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, 2017.

Johnson, A. and Shmatikov, V. Privacy-preserving data exploration in genome-wide association studies. In KDD 13, 2013.

Kakizaki, K., Sakuma, J., and Fukuchi, K. Differentially private chi-squared test by unit circle mechanism. In ICML 17, 2017.

Kamath, G., Li, J., Singhal, V., and Ullman, J. Privately learning high-dimensional distributions. ar Xiv preprint ar Xiv:1805.00216, 2018.

Kapralov, M. and Talwar, K. On differentially private low rank approximation. In SODA 13, 2013.

Karwa, V. and Vadhan, S. Finite sample differentially private conﬁdence intervals. In ITCS 18, 2018.

INSPECTRE: Privately Estimating the Unseen

Keinan, A. and Clark, A. G. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science, 336(6082):740 743, 2012.

Kifer, D. and Rogers, R. M. A new class of private chi-square tests. In AISTATS 17, 2017.

Miller, G. A. Note on the bias of information estimates. Information Theory in Psychology: Problems and Methods, 2:95 100, 1955.

Nelson, M. R. et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science, 337 (6090):100 104, 2012.

Nemenman, I., Bialek, W., and de Ruyter van Steveninck, R. Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):056111:1 056111:6, 2004.

Nowozin, S. Improved information gain estimates for decision tree induction. In ICML 12, 2012.

Obremski, M. and Skorski, M. Renyi entropy estimation revisited. In APPROX 17, 2017.

Orlitsky, A., Suresh, A. T., and Wu, Y. Optimal prediction of the number of unseen species. PNAS, 113(47):13283 13288, 2016.

Paninski, L. Estimation of entropy and mutual information. Neural Computation, 15(6):1191 1253, 2003.

Raskhodnikova, S., Ron, D., Shpilka, A., and Smith, A. Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM Journal on Computing, 39 (3):813 842, 2009.

Rogers, R. M. Leveraging Privacy in Data Analysis. Ph D thesis, University of Pennsylvania, May 2017.

Sheffet, O. Differentially private ordinary least squares. In ICML 17, 2017.

Simmons, S., Sahinalp, C., and Berger, B. Enabling privacypreserving GWASs in heterogeneous human populations. Cell Systems, 3(1):54 61, 2016.

Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science, 337(6090):64 69, 2012.

Uhler, C., Slavkovi c, A., and Fienberg, S. E. Privacy-preserving data sharing for genome-wide association studies. The Journal of Privacy and Conﬁdentiality, 5(1):137 166, 2013.

Valiant, G. and Valiant, P. Estimating the unseen: Improved estimators for entropy and other properties. In NIPS 13, 2013.

Valiant, G. and Valiant, P. Estimating the unseen: Improved estimators for entropy and other properties. Journal of the ACM, 64(6):37:1 37:41, 2017.

Wang, Y., Lee, J., and Kifer, D. Revisiting differentially private hypothesis tests for categorical data. ar Xiv preprint ar Xiv:1511.03376, 2015a.

Wang, Y., Wang, Y.-X., and Singh, A. Differentially private subspace clustering. In NIPS 15, 2015b.

Wu, Y. and Yang, P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, 2016.

Wu, Y. and Yang, P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 2018.

Yu, F. et al. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics, 50:133 141, 2014.

Zou, J. et al. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature Communications, 7, 2016.