# privit_private_and_sample_efficient_identity_testing__16c35495.pdf

Priv IT: Private and Sample Efﬁcient Identity Testing

Bryan Cai * 1 Constantinos Daskalakis * 1 Gautam Kamath * 1

We develop differentially private hypothesis testing methods for the small sample regime. Given a sample D from a categorical distribution p over some domain Σ, an explicitly described distribution q over Σ, some privacy parameter ε, accuracy parameter α, and requirements βI and βII for the type I and type II errors of our test, the goal is to distinguish between p = q and d TV(p, q) α. We provide theoretical bounds for the sample size |D| so that our method both satisﬁes (ε, 0)-differential privacy, and guarantees βI and βII type I and type II errors. We show that differential privacy may come for free in some regimes of parameters, and we always beat the sample complexity resulting from running the χ2-test with noisy counts, or standard approaches such as repetition for endowing nonprivate χ2-style statistics with differential privacy guarantees. We experimentally compare the sample complexity of our method to that of recently proposed methods for private hypothesis testing (Gaboardi et al., 2016; Kifer & Rogers, 2017).

1. Introduction

Hypothesis testing is the age-old problem of deciding whether observations from an unknown phenomenon p conform to a model q. Often p can be viewed as a distribution over some alphabet Σ, and the goal is to determine, using samples from p, whether it is equal to some model distribution q or not. This type of test is the lifeblood of the scientiﬁc method and has received tremendous study in statistics since its very beginnings. Naturally, the focus has been on minimizing the number of observations from the unknown distribution p that are needed to determine, with

*Equal contribution 1Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. Correspondence to: Bryan Cai <bcai@mit.edu>, Constantinos Daskalakis <costis@csail.mit.edu>, Gautam Kamath <g@csail.mit.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

conﬁdence, whether p = q or p = q.

In several ﬁelds of research and application, however, samples may contain sensitive information about individuals; consider for example, individuals participating in some clinical study of a disease that carries social stigma. It may thus be crucial to guarantee that operating on the samples needed to test a statistical hypothesis protects sensitive information about the samples. This is not at odds with the goal of hypothesis testing itself, since the latter is about verifying a property of the population p from which the samples are drawn, and not of the samples themselves.

Without care, however, sensitive information about the sample might actually be divulged by statistical processing that is improperly designed. As recently exhibited, for example, it may be possible to determine whether individuals participated in a study from data that would typically be published in genome-wide association studies (Homer et al., 2008). Motivated in part by this realization, there has been increased recent interest in developing data sharing techniques which are private (Johnson & Shmatikov, 2013; Uhler et al., 2013; Yu et al., 2014; Simmons et al., 2016).

Protecting privacy when computing on data has been extensively studied in several ﬁelds ranging from statistics to diverse branches of computer science including algorithms, cryptography, database theory, and machine learning; see, e.g., (Dalenius, 1977; Adam & Worthmann, 1989; Agrawal & Aggarwal, 2001; Dinur & Nissim, 2003; Dwork, 2008; Dwork & Roth, 2014) and their references. A notion of privacy proposed by theoretical computer scientists which has found a lot of traction is that of differential privacy (Dwork et al., 2006). Roughly speaking, it requires that the output of an algorithm on two neighboring datasets D and D that differ in the value of one element be statistically close. For a formal deﬁnition see Section 2.

Our goal in this paper is to develop tools for privately performing statistical hypothesis testing. In particular, we are interested in studying the tradeoffs between statistical accuracy, power, signiﬁcance, and privacy in the sample size. To be precise, given samples from a categorical distribution p over some domain Σ, an explicitly described distribution q over Σ, some privacy parameter ε, accuracy parameter α, and requirements βI and βII for the type I and type II errors of our test, the goal is to distinguish between p = q

and d TV(p, q) α. We want that the output of our test be (ε, 0)-differentially private, and that the probability we make a type I or type II error be βI and βII respectively. Treating these as hard constraints, we want to minimize the number of samples that we draw from p.

Notice that the correctness constraint on our test pertains to whether we draw the right conclusion about how p compares to q, while the privacy constraint pertains to whether we respect the privacy of the samples that we draw from p. The pertinent question is how much the privacy constraint increases the number of samples that are needed to guarantee correctness. Our main result is that privacy may come for free in certain regimes of parameters, and has a mild cost for all regimes of parameters.

To be precise, without privacy constraints, it is well known that identity testing can be performed from O( n

β ) samples, where n is the size of Σ and β = min{βI, βII}, and that this is tight (Batu et al., 2001; Paninski, 2008; Valiant & Valiant, 2014; Acharya et al., 2015). Our main theoretical result is that, with privacy constraints, the number of samples that are needed is

α2 , n α3/2ε, n1/3

log(1/β) . (1)

Our statistical test is provided in Section 5 where the above upper bound on the number of samples that it requires is proven as Theorem 3. Notice that privacy comes for free when the privacy requirement ε is Ω( α) for example when ε = 10% and the required statistical accuracy is 3%.

The precise constants sitting in the O( ) notation of Eq. (1) are given in the proof of Theorem 3. We experimentally verify the sample efﬁciency of our tests by comparing them to recently proposed private statistical tests (Gaboardi et al., 2016; Kifer & Rogers, 2017), discussed in more detail shortly. Fixing a differential privacy and type I, type II error constraints, we compare how many samples are required by our and their methods to distinguish between hypotheses that are α = 0.1 apart in total variation distance. We ﬁnd that different algorithms are more efﬁcient depending on the regime and properties desired by the analyst. Our experiments and further discussion of the tradeoffs are presented in Section 6.

Approach. A standard approach to turn an algorithm differentially private is to use repetition. As already mentioned above, absent differential privacy constraints, statistical tests have been provided that use an optimal m = O( n

β ) number of samples. A trivial way to get (ε, 0)-differential privacy using such a non-private test is to create O(1/ε) datasets, each comprising m samples from p, and run the non-private test on one of these datasets, chosen randomly. It is clear that changing the value of a single

element in the combined dataset may only affect the output of the test with probability at most ε. Thus the output is (ε, 0)-differentially private; see Section 3 for a proof. The issue with this approach is that the total number of samples that it draws is m/ε = O( n εα2 log 1

β ), which is higher than our target. See Corollary 1.

A different approach towards private hypothesis testing is to look deeper into the non-private tests and try to privatize them. The most sample-efﬁcient tests are variations of the classical χ2-test. They compute the number of times, Ni, that element i Σ appears in the sample and aggregate those counts using a statistic that equals, or is close to, the χ2-divergence between the empirical distribution deﬁned by these counts and the hypothesis distribution q. They accept q if the statistic is low and reject q if it is high, using some threshold.

A reasonable approach to privatize such a test is to add noise, e.g. Laplace(1/ε) noise, to each count Ni, before running the test. It is well known that adding Laplace(1/ε) noise to a set of counts makes them differentially private, see Theorem 1. However, it also increases the variance of the statistic. This has been noticed empirically in recent work of (Gaboardi et al., 2016) for the χ2-test. We show that the variance of the optimal χ2-style test statistic signiﬁcantly increases if we add Laplace noise to the counts, in Section 4.1, thus increasing the sample complexity from O( n) to Ω(n3/4). So this route, too, seems problematic.

A last approach towards designing differentially private tests is to exploit the distance beween the null and the alternative hypotheses. A correct test should accept the null with probability close to 1, and reject an alternative that is α-far from the null with probability close to 1, but there are no requirements for correctness when the alternative is very close to the null. We could thus try to interpolate smoothly between datasets that we expect to see when sampling the null and datasets that we expect to see when sampling an alternative that is far from the null. Rather than outputting accept or reject by merely thresholding our statistic, we would like to tune the probability that we output reject based on the value of our statistic, and make it so that the reject probability is ε-Lipschitz as a function of the dataset. Moreover, the probability should be close to 0 on datasets that we expect to see under the null and close to 1 on datasets that we expect to see under an alternative that is α-far. As we show in Section 4.2, χ2-style statistics have high sensitivity, requiring ω( n) samples to be made appropriately Lipschitz.

While both the approach of adding noise to the counts, and that of turning the output of the test Lipschitz fail in isolation, our test actually goes through by intricately combining these two approaches. It has two steps:

1. A ﬁltering step, whose goal is to reject when p is blatantly far from q. This step is performed by comparing the counts Ni with their expectations under q, after having added Laplace(1/ε) noise to these counts. If the noisy counts deviate from their expectation, taking into account the extra variance introduced by the noise, then we can safely reject. Moreover, because noise was added, this step is differentially private.

2. If the ﬁltering step fails to reject, we perform a statistical step. This step just computes the χ2-style statistic from (Acharya et al., 2015), without adding noise to the counts. The crucial observation is that if the ﬁltering step does not reject, then the statistic is actually ε-Lipschitz with respect to the counts, and thus the value of the statistic is still differentially private. We use the value of the statistic to determine the bias of a coin that outputs reject.

Details of our test are given in Section 5.

Related Work. Identity testing is one of the most classical problems in statistics, where it is traditionally called hypothesis or goodness-of-ﬁt testing, see (Pearson, 1900; Fisher, 1935; Rao & Scott, 1981; Agresti, 2012) for some classical and contemporary references. In this ﬁeld, the focus is often on asymptotic analysis, where the number of samples goes to inﬁnity, and we wish to get a grasp on their asymptotic distributions and error exponents (Agresti, 2012; Tan et al., 2010). In the past twenty years, this problem has enjoyed signiﬁcant interest in the theoretical computer science community (see, i.e., (Batu et al., 2001; Paninski, 2008; Levi et al., 2013; Valiant & Valiant, 2014; Acharya et al., 2015; Canonne et al., 2016; Diakonikolas & Kane, 2016; Daskalakis et al., 2016), and (Canonne, 2015) for a survey), where the focus has instead been on the ﬁnite sample regime, rather than asymptotics. Speciﬁcally, the goal is to minimize the number of samples required, while still remaining computationally tractable.

A number of recent works (Wang et al., 2015; Gaboardi et al., 2016; Kifer & Rogers, 2017) (and a simultaneous work, focused on independence testing (Kakizaki et al., 2017)) investigate differential privacy with the former set of goals. In particular, their algorithms focus on ﬁxing a desired signiﬁcance (type I error) and privacy requirement, and study the asymptotic distribution of the test statistics. On the other hand, we are the ﬁrst work to apply differential privacy to the latter line of inquiry, where our goal is to minimize the number of samples required to ensure the desired signiﬁcance, power and privacy. As a point of comparison between these two worlds, we provide an empirical evaluation of our method versus their methods.

The problem of distribution estimation (rather than testing) has also recently been studied under the lens of differential

privacy (Diakonikolas et al., 2015). This is another classical statistics problem which has recently piqued the interest of the theoretical computer science community. We note that the techniques required for this setting are quite different from ours, as we must deal with issues that arise from very sparsely sampled data.

2. Preliminaries

In this paper, we will focus on discrete probability distributions over [n]. For a distribution p, we will use the notation pi to denote the mass p places on symbol i.

Deﬁnition 1. The total variation distance between p and q is deﬁned as

d TV(p, q) = 1

i [n] |pi qi| .

Deﬁnition 2. A randomized algorithm M with domain Nn

is (ε, δ)-differentially private if for all S Range(M) and for all pairs of inputs D, D such that D D 1 1:

Pr [M(D) S] eε Pr [M(D ) S] + δ.

If δ = 0, the guarantee is called pure differential privacy.

In the context of distribution testing, the neighboring dataset deﬁnition corresponds to two datasets where one dataset is generated from the other by removing one sample. Up to a factor of 2, this is equivalent to the alternative deﬁnition where one dataset is generated from the other by arbitrarily changing one sample.

Deﬁnition 3. An algorithm for the (α, βI, βII)-identity testing problem with respect to a (known) distribution q takes m samples from an (unknown) distribution p and has the following guarantees:

If p = q, then with probability at least 1 βI it outputs p = q;

If d TV(p, q) α, then with probability at least 1 βII it outputs p = q.

In particular, βI and βII are the type I and type II errors of the test. Parameter α is the radius of distinguishing accuracy. Notice that, when p satisﬁes neither of cases above, the algorithm s output may be arbitrary.

We note that if an algorithm is to satisfy both these deﬁnitions, the latter condition (the correctness property) need only be satisﬁed when p falls into one of the two cases, while the former condition (the privacy property) must be satisﬁed for all realizations of the samples from p (and in particular, for p which do not fall into the two cases above).

We recall the classical Laplace mechanism, which states that applying independent Laplace noise to a set of counts is differentially private. Theorem 1 (Theorem 3.6 of (Dwork & Roth, 2014)). Given a set of counts N1, . . . , Nn, the noised counts (N1 + Y1, . . . , Nn + Yn) are (ε, 0)-differentially private when the Yi s are i.i.d. random variables drawn from Laplace(1/ε).

Finally, we recall the deﬁnition of zero-concentrated differential privacy from (Bun & Steinke, 2016) and its relationship to differential privacy. Deﬁnition 4. A randomized algorithm M with domain Nn

is ρ-zero-concentrated differentially private (ρ-z CDP) if for all pairs of inputs D, D such that D D 1 1 and all α (1, ):

Dα(M(D)||M(D )) ρα,

where Dα is the α-R enyi divergence between the distribution of M(D) and M(D ). Proposition 1 (Propositions 1.3 and 1.4 of (Bun & Steinke, 2016)). If a mechanism M1 satisﬁes (ε, 0)-differential privacy, then M1 satisﬁes ε2

2 -z CDP. If a mechanism M2 satisﬁes ρ-z CDP, then M2 satisﬁes (ρ + 2 p

ρ log(1/δ), δ)- differential privacy for any δ > 0.

3. A Simple Upper Bound

In this section, we provide an O n

α2ε upper bound for the differentially private identity testing problem. More generally, we show that if an algorithm requires a dataset of size m for a decision problem, then it can be made (ε, 0)- differentially private at a multiplicative cost of 1/ε in the sample size. This is a folklore result, but we include and prove it here for completeness. Theorem 2. Suppose there exists an algorithm for a decision problem P which succeeds with probability at least 1 β and requires a dataset of size m. Then there exists an (ε, 0)-differentially private algorithm for P which succeeds with probability at least 4

5(1 β) + 1/10 and requires a dataset of size O(m/ε).

Proof. First, with probability 1/5, we ﬂip a coin and output yes or no with equal probability. This guarantees that we have probability at least 1/10 of either outcome, which will allow us to satisfy the multiplicative guarantee of differential privacy.

We then draw 10/ε datasets of size m, and solve the decision problem (non-privately) for each of them. Finally, we select a random one of these computations and output its outcome.

The correctness follows, since we randomly choose the right answer with probability 1/10, or with probability 4/5,

we solve the problem correctly with probability 1 β. As for privacy, we note that, if we remove a single element of the dataset, we may only change the outcome of one of these computations. Since we pick a random computation, this is selected with probability ε/10, and thus the probability of any outcome is additively shifted by at most ε/10. Since we know the minimum probability of any output is 1/10, this gives the desired multiplicative guarantee required for (ε, 0)-differential privacy.

We obtain the following corollary by noting that the tester of (Acharya et al., 2015) (among others) requires O( n/α2) samples for identity testing.

Corollary 1. There exists an (ε, 0)-differentially private testing algorithm for the (α, βI, βII)-identity testing problem for any distribution q which requires

εα2 log(1/β)

samples, where β = min (βI, βII).

4. Roadblocks to Differentially Private Testing

In this section, we describe roadblocks which prevent two natural approaches to differentially private testing from working.

In Section 4.1, we show that if one simply adds Laplace noise to the empirical counts of a dataset (i.e., runs the Laplace mechanism of Theorem 1) and then attempts to run an optimal identity tester, the variance of the statistic increases dramatically, and thus results in a much larger sample complexity, even for the case of uniformity testing. The intuition behind this phenomenon is as follows. When performing uniformity testing in the small sample regime (when the number of samples m is the square root of the domain size n), we will see a (1 o(1))n elements 0 times, O( n) elements 1 time, and O(1) elements 2 times. If we add Laplace(10) noise to guarantee (0.1, 0)- differential privacy, this obliterates the signal provided by these collision statistics, and thus many more samples are required before the signal prevails.

In Section 4.2, we demonstrate that χ2 statistics have high sensitivity, and thus are not naturally differentially private. In other words, if we consider a χ2 statistic Z on two datasets D and D which differ in one record, |Z(D) Z(D )| may be quite large. This implies that methods such as rescaling this statistic and interpreting it as a probability, or applying noise to the statistic, will not be differentially private until we have taken a large number of samples.

4.1. A Laplaced χ2-statistic has large variance

Proposition 2. Applying the Laplace mechanism to a dataset before applying the identity tester of (Acharya et al., 2015) results in a signiﬁcant increase in the variance, even when considering the case of uniformity. More precisely, if we consider the statistic

(Ni + Yi m/n)2 (Ni + Yi)

where Ni is the number of occurrences of symbol i in the dataset D (which is of size Poisson(m)) and Yi Laplace(1/ε), then

If p is uniform, then E[Z ] = 2n2 ε2m and Var[Z ]

20n3 ε4m2 .

If p is a particular distribution which is α-far in total variation distance from uniform, then E[Z ] = 4mα2 + 2n2

The variance of the statistic can be compared to that of the unnoised statistic, which is upper bounded by m2α4. We can see that the noised statistic has larger variance until m = Ω(n3/4).

Proof. First, we compute the mean of Z . Note that since |D| Poisson(m), the Ni s will be independently distributed as Poisson(mpi) (see, i.e., (Acharya et al., 2015) for additional discussion).

E[Z ] = E X

(Ni + Yi m/n)2 (Ni + Yi)

(Ni m/n)2 Ni

Y 2 i + 2Yi(Ni m/n) Yi

= m χ2(p, q) + X

= m χ2(p, q) + 2n2

In other words, the mean is a rescaling of the χ2 distance between p and q, shifted by some constant amount. When p = q, the χ2-distance between p and q is 0, and the expectation is just the second term. Focus on the case where n is even, and consider p such that pi = (1+2α)/n if i is even, and (1 2α)/n otherwise. This is α-far from uniform in total variation distance. Furthermore, by direct calculation, χ2(p, q) = 4α2, and thus the expectation of Z in this case is 4mα2 + 2n2

Next, we examine the variance of Z . Let λi = mpi and λ i = mqi = m/n. By a similar computation as before, we have that

Var[Z ] = X

2λ2 i + 4λi(λi λ i)2

ε2 (8λi + 2(2λi 2λ i 1)2) + 20

Since all four summands of this expression are nonnegative, we have that

1 λ 2 i = 20n3

If we wish to use Chebyshev s inequality to separate these two cases, we require that Var[Z ] is at most the square of the mean separation. In other words, we require that

20n3 ε4m2 m2α4, or that m = Ω n3/4

4.2. A χ2-statistic has high sensitivity

Consider the primary statistic which we use in Algorithm 1:

Z(D) = 1 mα2 X

(Ni mqi)2 Ni

As shown in Section 5, E[Z] = 0 if p = q and E[Z] 1 if d TV(p, q) α, and the variance of Z is such that these two cases can be separated with constant probability. A natural approach is to truncate this statistic to the range [0, 1], interpret it as a probability and output the result of Bernoulli(Z) if p = q, the result is likely to be 0, and if d TV(p, q) α, the result is likely to be 1. One might hope that this statistic is naturally private. More speciﬁcally, we would like that the statistic Z has low sensitivity, and does not change much if we remove a single individual. Unfortunately, this is not the case. We consider datasets D, D , where D is identical to D, but with one fewer occurrence of symbol i. It can be shown that the difference in Z is

|Z(D) Z(D )| = 2|Ni mqi 1|

m2α2qi Letting q be the uniform distribution and requiring that this is at most ε (for the sake of privacy), we have a constraint which is roughly of the form 2Nin m2α2 ε, or that

m = Ω Ni n ε0.5α .

In particular, if Ni = nc for any c > 0, this does not achieve the desired O( n) sample complexity. One may observe that, if Ni is this large, looking at symbol i alone is sufﬁcient to conclude p is not uniform, even if the count Ni had Laplace noise added. Indeed, our main algorithm of Section 5 works in part due to our formalization and quantiﬁcation of this intuition.

5. Priv IT: A Differentially Private Identity Tester

In this section, we sketch the proof of our main testing upper bound:

Theorem 3. There exists an (ε, 0)-differentially private testing algorithm for the (α, βI, βII)-identity testing problem for any distribution q which requires

m = O max n

α2 , n α3/2ε, n1/3

samples, where β = min (βI, βII).

The full details of the proof are provided in the supplementary materials.

The pseudocode for this algorithm is provided in Algorithm 1. We ﬁx the constants c1 = 1/4 and c2 = 3/40. For a high-level overview of our algorithm s approach, we refer the reader to the Approach paragraph in Section 1.

Algorithm 1 Priv IT: A differentially private identity tester

1: Input: ε; an explicit distribution q; sample access to a distribution p 2: Deﬁne A {i : qi c1α/n}, A [n] \ A 3: Sample Yi Laplace(2/c2ε) for all i A 4: if there exists i A such that |Yi|

2 c2ε log 1 1 (1 c2)1/|A| then

5: return either p = q or p = q with equal probability 6: end if 7: Draw a multiset S of Poisson(m) samples from p 8: Let Ni be the number of occurrences of the ith domain element in S 9: for i A do 10: if |Ni + Yi mqi| 2 c2ε log 1 1 (1 c2)1/|A| +

max 4 mqi log n, log n then 11: return p = q 12: end if 13: end for 14: Z 2 mα2 P

i A (Ni mqi)2 Ni

mqi 15: Let T be the closest value to Z which is contained in the interval [0, 1] 16: Sample b Bernoulli(T) 17: if b = 1 then 18: return p = q 19: else 20: return p = q 21: end if

Proof of Theorem 3 (sketch): We focus on the case where β = 1/3, the general case follows at the cost of a multiplicative log(1/β) in the sample complexity from a stan-

dard ampliﬁcation argument. We will require the following tail bounds on Ni and Yi.

Claim 1. |Yi| 2 c2ε log 1 1 (1 c2)1/|A| simultaneously for all i A with probability exactly 1 c2.

Claim 2. |Ni mpi| max 4 mpi log n, log n simultaneously for all i A with probability at least 1 2 n0.84 1.1

Correctness. Correctness can be shown in a similar way to (Acharya et al., 2015) in short, if m = Ω( n/α2), then the expectations are separated in the two cases, and the variance is bounded. A careful combination of the previous claims and Chebyshev s inequality guarantee correctness.

Privacy. We will prove (0, c2ε/2)-differential privacy, which in our setting, will imply (ε, 0)-differential privacy (due to Claim 1).

We ﬁrst consider the possibility of rejecting in line 11. Noising our counts by the random variables Yi ensures that this step is (0, c2ε/4)-differentially private.

Consider the difference in value of Z for two neighboring datasets D and D , differing in i: Z(D) Z(D ) = 2(Ni mqi 1)

m2α2qi . Conditioning on the event that we did not return in line 11, we can show

|Ni mqi| 4 log(n/c2)

c2ε +max n 4 p

mqi log n, log n o .

This implies that |Z(D) Z(D )|

6 log(n/c2)

c2ε + 4 mqi log n . Enforcing that

each of these terms are at most c2ε/8 gives the condition

n log(n/c2)

α1.5ε , 64 c2 c1

2/3 (n log n)1/3

Since both terms are at most c2ε/8, this step is (0, c2ε/4)- differentially private. By composition of differential privacy, this gives the desired overall (0, c2ε/2)-differential privacy and thus ε-pure differential privacy.

6. Experiments

We performed an empirical evaluation of our algorithm, Priv IT, on synthetic datasets. All experiments were performed on a laptop computer with a 2.6 GHz Intel Core i7-6700HQ CPU and 8 GB of RAM. Signiﬁcant discussion is required to provide a full comparison with prior work in this area, since performance of the algorithms varies depending on the regime.

We compared our algorithm with two recent algorithms for differentially private hypothesis testing:

1. The Monte Carlo Goodness of ﬁt test with Laplace noise from (Gaboardi et al., 2016), MCGOF;

2. The projected Goodness of Fit test from (Kifer & Rogers, 2017), z CDP-GOF.

We note that we implemented a modiﬁed version of Priv IT, which differs from Algorithm 1 in lines 14 to 21. In particular, we instead consider a statistic

(Ni mqi)2 Ni

We add Laplace noise to Z, with scale parameter Θ( /ε), where is the sensitivity of Z, which guarantees (ε/2, 0)- differential privacy. Then, similar to the other algorithms, we choose a threshold for this noised statistic such that we have the desired type I error. This algorithm can be analyzed to provide identical theoretical guarantees as Algorithm 1, but with the practical advantage that there are fewer parameters to tune.

To begin our experimental evaluation, we started with uniformity testing. Our experimental setup was as follows. The algorithms were provided q as the uniform distribution over [n]. The algorithms were also provided with samples from some distribution p. This (unknown) p was q for the case p = q, or a distribution which we call the Paninski construction for the case d TV(p, q) α. The Paninski construction is a distribution where half the elements of the support have mass (1+α)/n and half have mass (1 α)/n. We use this name for the construction as (Paninski, 2008) showed that this example is one of the hardest to distinguish from uniform: one requires Ω( n/α2) samples to (non-privately) distinguish a random permutation of this construction from the uniform distribution. We ﬁxed parameters ε = 0.1 and α = 0.1. In addition, recall that Proposition 1 implies that pure differential privacy (the privacy guaranteed by Priv IT) is stronger than z CDP (the privacy guaranteed by z CDP-GOF). In particular, our guarantee of ε-pure differential privacy implies ε2/2-z CDP. As a result, we ran z CDP-GOF with a privacy parameter of 0.005-z CDP, which is equivalent to the amount of z CDP our algorithm provides. Our experiments were conducted on a number of different support sizes n, ranging from 10 to 10600. For each n, we ran the testing algorithms with increasing sample sizes m in order to discover the minimum sample size when the type I and type II errors were both empirically below 1/3. To determine these empirical error rates, we ran all algorithms 1000 times for each n and m, and recorded the fraction of the time each algorithm was correct. As the other algorithms take a parameter βI as a target type I error, we input 1/3 as this parameter.

The results of our ﬁrst test are provided in Figure 1. The x-axis indicates the support size, and the y-axis indicates the minimum number of samples required. We plot three lines, which demonstrate the empirical number of samples

0 2000 4000 6000 8000 10000 Support Size (n)

Sample Complexity (m)

Priv IT z CDP-GOF MCGOF

Uniformity Testing

Figure 1. The sample complexities of Priv IT, MCGOF, and z CDP-GOF for uniformity testing

required to obtain 1/3 type I and type II error for the different algorithms. We can see that in this case, z CDP-GOF is the most statistically efﬁcient, followed by MCGOF and Priv IT.

To explain this difference in statistical efﬁciency, we note that the theoretical guarantees of Priv IT imply that it performs well even when data is sparsely sampled. More precisely, one of the beneﬁts of our tester is that it can reduce the variance induced by elements whose expected number of occurrences is less than 1. Since none of these testers reach this regime (i.e., even z CDP-GOF at n = 10000 expects to see each element 10 times), we do not reap the beneﬁts of Priv IT. Ideally, we would run these algorithms on the uniform distribution at sufﬁciently large support sizes. However, since this is prohibitively expensive to do with thousands of repetitions (for any of these methods), we instead demonstrate the advantages of our tester on a different distribution.

Our second test is conducted with q being a 2-histogram1, where all but a vanishing fraction of the probability mass is concentrated on a small, constant fraction of the support2. This serves as our proxy for a very large support, since now we will have elements which have a sub-constant expected number of occurrences. The algorithms are provided with samples from a distribution p, which is either q or a similar Paninski construction as before, where the total variation distance from q is placed on the support elements containing non-negligible mass. We ran the test on support sizes n ranging from 10 to 6800. All other parameters are the same

1A k-histogram is a distribution where the domain can be partitioned into k intervals such that the distribution is uniform over each interval. 2In particular, in Figure 3, n/200 support elements contained 1 10/n probability mass, but similar trends hold with modiﬁcations of these parameters.

0 1000 2000 3000 4000 5000 6000 7000 Support Size (n)

Sample Complexity (m)

Priv IT z CDP-GOF

Identity Testing on a 2-Histogram

Figure 2. The sample complexities of Priv IT and z CDP-GOF for identity testing on a 2-histogram

as in the previous test.

The results of our second test are provided in Figure 2. In this case, we compare Priv IT and z CDP-GOF, and note that our test is slightly better for all support sizes n, though the difference can be pronounced or diminished depending on the construction of the distribution q. We found that MCGOF was incredibly inefﬁcient on this construction even for n = 400 it required 130000 samples, which is a factor of 10 worse than z CDP-GOF on a support of size n = 6800. To explain this phenomenon, we can inspect the contribution of a single domain element i to their statistic:

(Ni + Yi mqi)2

In the case where mqi 1 and p = q, this is approximately equal to Y 2 i mqi . The standard deviation of this term will be of the order 1 mqiε2 , which can be made arbitrarily large as mqi 0. While z CDP-GOF may naively seem susceptible to this same pitfall, their projection method appears to elegantly avoid it.

As a ﬁnal test, we note that z CDP-GOF guarantees z CDP, while Priv IT guarantees (vanilla) differential privacy. In our previous tests, our guarantee was ε-differential privacy, while theirs was ε2

2 -z CDP: by Proposition 1, our guarantees imply theirs. In the third test, we revisit uniformity testing, but when their guarantees imply ours. More speciﬁcally, again with ε = 0.1, we ran z CDP-GOF with the guarantee of ε2

2 -z CDP and Priv IT with the guarantee of ( ε2

2 log(1/δ), δ) for various δ > 0. We note that δ is often thought in theory to be cryptographically small (such as 2 100), but we compare with a wide range of δ, both large and small: δ = 1/et for t {1, 2, 4, 8, 16}. This test was conducted on support sizes n ranging from 10 to 6000.

0 1000 2000 3000 4000 5000 6000 Support Size (n)

Sample Complexity (m)

z CDP-GOF Priv IT (δ = 1/e)

Priv IT (δ = 1/e2)

Priv IT (δ = 1/e4)

Priv IT (δ = 1/e8)

Priv IT (δ = 1/e16)

Uniformity Testing, Revisited

Figure 3. The sample complexities of Priv IT and z CDP-GOF for uniformity testing, with approximate differential privacy

The results of our third test are provided in Figure 3. We found that, for all δ tested, Priv IT required fewer samples than z CDP-GOF. This is unsurprising for δ very large and small, since the differential privacy guarantees become very easy to satisfy, but we found it to be true for even moderate values of δ. This implies that if an analyst is satisﬁed with approximate differential privacy, she might be better off using Priv IT, rather than an algorithm which guarantees z CDP.

While the main focus of our evaluation was statistical in nature, we will note that Priv IT was more efﬁcient in runtime than our implementation of MCGOF, and more efﬁcient in memory usage than our implementation of z CDP-GOF. The former point was observed by noting that, in the same amount of time, Priv IT was able to reach a trial corresponding to a support size of 20000, while MCGOF was only able to reach 10000. The latter point was observed by noting that z CDP-GOF ran out of memory at a support size of 11800. This is likely because z CDP-GOF requires matrix computations on a matrix of size O(n2). It is plausible that all of these implementations could be made more time and memory efﬁcient, but we found our implementations to be sufﬁcient for the sake of our comparison.

Acknowledgments

The authors would like to thank Jon Ullman for helpful discussions in the early stages of this work. The authors were supported by NSF CCF-1551875, CCF-1617730, CCF1650733, and ONR N00014-12-1-0999.

Acharya, Jayadev, Daskalakis, Constantinos, and Kamath, Gautam. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems 28, NIPS 15, pp. 3577 3598. Curran Associates, Inc., 2015.

Adam, Nabil R. and Worthmann, John C. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys (CSUR), 21(4):515 556, 1989.

Agrawal, Dakshi and Aggarwal, Charu C. On the design and quantiﬁcation of privacy preserving data mining algorithms. In Proceedings of the 20th ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems, PODS 01, pp. 247 255, New York, NY, USA, 2001. ACM.

Agresti, Alan. Categorical Data Analysis. Wiley, 2012.

Batu, Tugkan, Fischer, Eldar, Fortnow, Lance, Kumar, Ravi, Rubinfeld, Ronitt, and White, Patrick. Testing random variables for independence and identity. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science, FOCS 01, pp. 442 451, Washington, DC, USA, 2001. IEEE Computer Society.

Bun, Mark and Steinke, Thomas. Concentrated differential privacy: Simpliﬁcations, extensions, and lower bounds. In Proceedings of the 14th Conference on Theory of Cryptography, TCC 16-B, pp. 635 658, Berlin, Heidelberg, 2016. Springer.

Canonne, Cl ement L. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22(63), 2015.

Canonne, Cl ement L. A short note on Poisson tail bounds.

http://www.cs.columbia.edu/ ccanonne/ files/misc/2017-poissonconcentration. pdf, 2017.

Canonne, Cl ement L., Diakonikolas, Ilias, Gouleakis, Themis, and Rubinfeld, Ronitt. Testing shape restrictions of discrete distributions. In Proceedings of the 33rd Symposium on Theoretical Aspects of Computer Science, STACS 16, pp. 25:1 25:14, 2016.

Dalenius, Tore. Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 15:429 444, 1977.

Daskalakis, Constantinos, Dikkala, Nishanth, and Kamath, Gautam. Testing Ising models. ar Xiv preprint ar Xiv:1612.03147, 2016.

Diakonikolas, Ilias and Kane, Daniel M. A new approach for testing properties of discrete distributions. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS 16, pp. 685 694, Washington, DC, USA, 2016. IEEE Computer Society.

Diakonikolas, Ilias, Hardt, Moritz, and Schmidt, Ludwig. Differentially private learning of structured discrete distributions. In Advances in Neural Information Processing Systems 28, NIPS 15, pp. 2566 2574. Curran Associates, Inc., 2015.

Dinur, Irit and Nissim, Kobbi. Revealing information while preserving privacy. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 03, pp. 202 210, New York, NY, USA, 2003. ACM.

Dwork, Cynthia. Differential privacy: A survey of results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC 08, pp. 1 19, Berlin, Heidelberg, 2008. Springer.

Dwork, Cynthia and Roth, Aaron. The Algorithmic Foundations of Differential Privacy. Now Publishers, Inc., 2014.

Dwork, Cynthia, Mc Sherry, Frank, Nissim, Kobbi, and Smith, Adam. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC 06, pp. 265 284, Berlin, Heidelberg, 2006. Springer.

Fisher, Ronald A. The Design of Experiments. Macmillan, 1935.

Gaboardi, Marco, Lim, Hyun-Woo, Rogers, Ryan M., and Vadhan, Salil P. Differentially private chi-squared hypothesis testing: Goodness of ﬁt and independence testing. In Proceedings of the 33rd International Conference on Machine Learning, ICML 16, pp. 1395 1403. JMLR, Inc., 2016.

Homer, Nils, Szelinger, Szabolcs, Redman, Margot, Duggan, David, Tembe, Waibhav, Muehling, Jill, Pearson, John V., Stephan, Dietrich A., Nelson, Stanley F., and Craig, David W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using highdensity snp genotyping microarrays. PLo S Genetics, 4 (8):1 9, 2008.

Johnson, Aaron and Shmatikov, Vitaly. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 13, pp. 1079 1087, New York, NY, USA, 2013. ACM.

Kakizaki, Kazuya, Sakuma, Jun, and Fukuchi, Kazuto. Differentially private chi-squared test by unit circle mechanism. In Proceedings of the 34th International Conference on Machine Learning, ICML 17. JMLR, Inc., 2017.

Kifer, Daniel and Rogers, Ryan M. A new class of private chi-square tests. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 17, pp. 991 1000. JMLR, Inc., 2017.

Klar, Bernhard. Bounds on tail probabilities of discrete distributions. Probability in the Engineering and Informational Sciences, 14(02):161 171, 2000.

Levi, Reut, Ron, Dana, and Rubinfeld, Ronitt. Testing properties of collections of distributions. Theory of Computing, 9(8):295 347, 2013.

Paninski, Liam. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750 4755, 2008.

Pearson, Karl. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302):157 175, 1900.

Pollard, David. A few good inequalities. http: //www.stat.yale.edu/ pollard/Books/ Mini/Basic.pdf, 2015.

Rao, Jon N.K. and Scott, Alastair J. The analysis of categorical data from complex sample surveys: Chi-squared tests for goodness of ﬁt and independence in two-way tables. Journal of the Americal Statistical Association, 76 (374):221 230, 1981.

Simmons, Sean, Sahinalp, Cenk, and Berger, Bonnie. Enabling privacy-preserving gwass in heterogeneous human populations. Cell Systems, 3(1):54 61, 2016.

Tan, Vincent Y.F., Anandkumar, Animashree, and Willsky, Alan S. Error exponents for composite hypothesis testing of Markov forest distributions. In Proceedings of the 2010 IEEE International Symposium on Information Theory, ISIT 10, pp. 1613 1617, Washington, DC, USA, 2010. IEEE Computer Society.

Uhler, Caroline, Slavkovi c, Aleksandra, and Fienberg, Stephen E. Privacy-preserving data sharing for genomewide association studies. The Journal of Privacy and Conﬁdentiality, 5(1):137 166, 2013.

Valiant, Gregory and Valiant, Paul. An automatic inequality prover and instance optimal identity testing. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science, FOCS 14, pp. 51 60, Washington, DC, USA, 2014. IEEE Computer Society.

Wang, Yue, Lee, Jaewoo, and Kifer, Daniel. Differentially private hypothesis testing, revisited. ar Xiv preprint ar Xiv:1511.03376, 2015.

Yu, Fei, Fienberg, Stephen E., Slavkovi c, Aleksandra B., and Uhler, Caroline. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics, 50:133 141, 2014.