# locally_private_hypothesis_testing__3b7be05a.pdf

Locally Private Hypothesis Testing

Or Sheffet 1

We initiate the study of differentially private hypothesis testing in the local-model, under both the standard (symmetric) randomized-response mechanism (Warner, 1965; Kasiviswanathan et al., 2008) and the newer (non-symmetric) mechanisms (Bassily & Smith, 2015; Bassily et al., 2017). First, we study the general framework of mapping each user s type into a signal and show that the problem of ﬁnding the maximum-likelihood distribution over the signals is feasible. Then we discuss the randomizedresponse mechanism and show that, in essence, it maps the nulland alternative-hypotheses onto new sets, an afﬁne translation of the original sets. We then give sample complexity bounds for identity and independence testing under randomizedresponse. We then move to the newer nonsymmetric mechanisms and show that there too the problem of ﬁnding the maximum-likelihood distribution is feasible. Under the mechanism of Bassily et al (2017) we give identity and independence testers with better sample complexity than the testers in the symmetric case, and we also propose a χ2-based identity tester which we investigate empirically.

1. Introduction

Differential privacy is a mathematically rigorous notion of privacy that has become the de-facto gold-standard of privacy preserving data analysis. Informally, ϵ-differential privacy bounds the affect of a single datapoint on any result of the computation by ϵ. In recent years the subject of private hypothesis testing has been receiving increasing attention (see Related Work below). However, by and large, the focus of private hypothesis testing is in the centralized model (or the curated model), where a single trusted entity holds the sensitive details of n users and runs the private hypothesis tester on the actual data.

1Dept. of Computing Science, University of Alberta.. Correspondence to: Or Sheffet <osheffet@ualberta.ca>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

In contrast, the subject of this work is private hypothesis testing in the local-model (or the distributed model), where a ϵ-differentially private mechanism is applied independently to each datum. This model, which alleviates trust (each user can run the mechanism independently on her own and release the noisy signal from the mechanism), has gained much popularity in recent years, especially since it was adopted by Google s Rappor (Erlingsson et al., 2014) and Apple (Apple, 2017). And yet, despite its popularity, and the fact that recent works (Bassily & Smith, 2015; Bassily et al., 2017) have shown the space of possible locally-private mechanism is richer than what was originally thought, little is known about private hypothesis testing in the local-model.

1.1. Background: Local Differential Privacy

We view the local differentially private model as a signaling scheme. Each datum / user has a type x taken from a predeﬁned and publicly known set of possible types X whose size is T = |X|. The differentially private mechanism is merely a randomized function M : ([n], X) S, mapping each possible type X of the i-th datum to some set of possible signals S, which we assume to be ϵ-differentially private: for any index i, any pair of types x, x X and any signal s S it holds that Pr[M(i, x) = s] eϵ Pr[M(i, x ) = s].1 In our most general results (Theorems 1 and 9), we ignore the fact that M is ϵ-differentially private, and just refer to any signaling scheme that transforms one domain (namely, X) into another (S). For example, a surveyer might unify rarely occurring types under the category of other , or perhaps users report their types over noisy channels, etc.

We differentiate between two types of signaling schemes: the symmetric (or index-oblivious) variety, and the nonsymmetric (index-aware) type. A local signaling mechanism is called symmetric if it is independent of the index of the datum. Namely, if for any i = j we have that M(i, x) = M(j, x) def = M(x). A classic exam-

1For simplicity, we assume S, the set of possible signals, is discrete. Note that this doesn t exclude mechanisms such as adding Gaussian/Gamma noise to a point in Rd such mechanisms require X to be some bounded subset of Rd and use the bound to set the noise appropriately. Therefore, the standard approach of discretizing X and projecting the noisy point to the closest point in the grid yields a ﬁnite set of signals S.

Locally Private Hypothesis Testing

ple of such a mechanism is randomized-response that actually dates back to before differential privacy was deﬁned (Warner, 1965) and was ﬁrst put to use in differential privacy in (Kasiviswanathan et al., 2008) where each user / datum x draws her own signal from the set S = X skewing the probability ever-so-slightly in favor of the original type. I.e. if the user s type is x then

( x, w.p. eϵ T 1+eϵ x , for any other x w.p. 1 T 1+eϵ .

The utility of the above-mentioned symmetric mechanism scales polynomially with T (or rather, with |S|), which motivated the question of designing local mechanisms with error scaling logarithmically in T. This question was recently answered in the afﬁrmative by the works of Bassily and Smith (2015) and Bassily et al (2017), whose mechanisms are not symmetric. In fact, both of them work by presenting each user i with a mapping fi : X S (the mapping itself is chosen randomly, but it is public, so we treat it as a given), and the user then runs the standard randomized response mechanism on the signals using fi(x) as the more-likely signal. (In fact, in both schemes, S = {1, 1}: in (Bassily & Smith, 2015) fi is merely the j-th coordinate of a hashing of the types where j and the hashing function are publicly known, and in (Bassily et al., 2017) fi maps a u.a.r chosen subset of X to 1 and its complementary to 1.2) So given fi, the user then tosses her own private random coins to determine what signal she broadcasts. Therefore, each user s mechanism can be summarized in a |S| |X|-matrix, where Mi(s, x) is the probability a user of type x sends the signal s. For example, using the mechanism of (Bassily et al., 2017), each user whose type maps to 1 sends signal 1 with probability eϵ 1+eϵ and signal 1 with probability 1 1+eϵ . Namely, Mi(fi(x), x) = eϵ 1+eϵ and Mi( fi(x), x) = 1 1+eϵ , where fi is the mapping X {1, 1} set for user i.

1.2. Our Contribution and Organization

This work initiates (to the best of our knowledge) the theory of differentially private hypothesis testing in the local model. First we survey related work and preliminaries. Then, in Section 3, we examine the symmetric case and show that any mechanism (not necessarily a differentially private one) yields a distribution on the signals for which ﬁnding a maximum-likelihood hypothesis is feasible, assuming the set of possible hypotheses is convex. Then, focusing on the classic randomized-response mechanism, we show that the problem of maximizing the likelihood of the observed signals is strongly-convex and thus simpler than the original problem. More importantly, in essence

2In both works, much effort is put to ﬁrst reducing T to the most frequent n types, and then run the counting algorithm. Regardless, the end-counts / collection of users signals are the ones we care for the sake of hypothesis testing.

we give a characterization of hypothesis testing under randomized response: the symmetric locally-private mechanism translates the original null hypothesis H0 (and the alternative H1) by a known afﬁne translation into a different set ϕ(H0) (and resp. ϕ(H1)). Hence, hypothesis testing under randomized-response boils to discerning between two different (and considerably closer in total-variation distance) sets, but in the exact same model as in standard hypothesis testing as all signals were drawn from the same hypothesis in ϕ(H0). As an immediate corollary we give bounds on identity-testing (Corollary 5) and independencetesting (Theorem 6) under randomized-response. (The latter requires some manipulations and far less straightforward than the former.) The sample complexity (under certain simplifying assumptions) of both problems is proportional to T 2.5.

In Section 4 we move to the non-symmetric local-model. Again, we start with a general result showing that in this case too, ﬁnding an hypothesis that maximizes the likelihood of the observed signals is feasible when the hypothesis-set is convex. We then focus on the mechanism of Bassily et al (2017) and show that it also makes the problem of ﬁnding a maximum-likelihood hypothesis strongly-convex. We then give a simple identity tester under this scheme whose sample complexity is proportional to T 2, and is thus more efﬁcient than any tester under standard randomized-response. Similarly, we also give an independence-tester with a similar sample complexity. In Section 4.2 we empirically investigate alternative identitytesting and independence-testing based on Pearson s χ2test in this non-symmetric scheme, and identify a couple of open problems in this regime.

1.3. Related Work

Several works have looked at the intersection of differential privacy and statistics (Dwork & Lei, 2009; Smith, 2011; Chaudhuri & Hsu, 2012; Duchi et al., 2013a; Dwork et al., 2015) mostly focusing on robust statistics; but only a handful of works study rigorously the signiﬁcance and power of hypotheses testing under differential privacy. Vu and Slavkovic (2009) looked at the sample size for privately testing the bias of a coin. Johnson and Shmatikov (2013), Uhler et al (2013) and Yu et al (2014) focused on the Pearson χ2-test (the simplest goodness of ﬁt test), showing that the noise added by differential privacy vanishes asymptotically as the number of datapoints goes to inﬁnity, and propose a private χ2-based test which they study empirically. Wang et al (2015) and Gaboardi et al (2016) who have noticed the issues with both of these approaches, have revised the statistical tests themselves to incorporate also the added noise in the private computation. Cai et al (2017) give a private identity tester based on noisy χ2-test over large bins, Sheffet (2017) studies private Ordinary Least Squares using the JL transform, and Karwa and Vadhan (2018) give

Locally Private Hypothesis Testing

matching upperand lower-bounds on the conﬁdence intervals for the mean of a population. All of these works however deal with the centralized-model of differential privacy.

Perhaps the closest to our work are the works of Duchi et al (2013a; 2013b) who give matching upperand lowerbound on robust estimators in the local model. And while their lower bounds do inform as to the sample complexity s dependency on ϵ 2, they do not ascertain the sample complexity dependency on the size of the domain (T) we get in Section 3. Moreover, these works disregard independence testing (and in fact (Duchi et al., 2013b) focus on mean estimation so they apply randomized-response to each feature independently generating a product-distribution even when the input isn t sampled from a product-distribution). And so, to the best of our knowledge, no work has focused on hypothesis testing in the local model, let alone in the (relatively new) non-symmetric local model. Lastly, developed concurrently to our work, Gaboardi and Rogers (2018) study the asymptotic power of a variety chi-squared based hypothesis testing in the local model.

2. Preliminaries, Notation and Background

Notation. We user lower-case letters to denote scalars, bold bold bold characters to denote vectors and CAPITAL letters to denote matrices. So 1 denotes the number, 1 denotes the all-1 vector, and 1X X denotes the all-1 matrix over a domain X. We use ex to denote the standard basis vector with a single 1 in coordinate corresponding to x. To denote the x-coordinate of a vector v we use v(x), and to denote the (x, x )-coordinate of a matrix M we use M(x, x ). For a given vectorv, we use diag(v) to denote the matrix whose diagonal entries are the coordinates of v. For any natural n, we use [n] to denote the set {1, 2, ..., n}.

Distances and norms. Unless speciﬁed otherwise v refers to the L2-norm of v, whereas v 1 refers to the L1-

norm. We also denote v 2

i |vi| 2 3 3

2 . For a ma-

trix, M 1 denotes (as usual) the maximum absolute column sum. We identify a distribution p over a domain X as a T-dimensional vector with non-negative entries that sum to 1. This deﬁnes the total variation distance between two distributions: d TV(p,q) = 1

2 p q 1. (On occasion, we will apply d TV to vectors that aren t distributions, but rather nearby estimations; in those cases we use the same deﬁnition: the half of the L1-norm.) It is known that the TV-distance is a metric overs distributions. We also use the χ2-divergence to measure difference between two distributions: dχ2(p,q) = P

x (p(x) q(x))2

The χ2-divergence is not symmetric and can be inﬁnite, however it is non-negative and zeros only when p = q. We refer the reader to (Sason & Verd u, 2016) for more properties of the total-variance distance the χ2-divergence.

Differential Privacy. An algorithm A is called ϵdifferentially private, if for any two datasets D and D that differ only on the details of a single user and any set of outputs O, we have that Pr[A(D) O] eϵ Pr[A(D ) O]. The unacquainted reader is referred to the Dwork-Roth monograph (Dwork & Roth, 2014) as an introduction to the rapidly-growing ﬁeld of differential privacy.

Hypothesis testing. The problem of hypothesis testing is to test whether a given set of samples was drawn from a distribution satisfying the null-hypothesis or the alternativehypothesis. Thus, the null-hypothesis is merely a set of possible distributions H0 and the alternative is disjoint set H1. Hypothesis tests boils down to estimating a teststatistic θ whose distribution has been estimated under the null-hypothesis. We can thus reject the null-hypothesis if the value of θ is highly unlikely, or accept the nullhypothesis otherwise. We call an algorithm a tester if the acceptance (in the completeness case) or rejection (in the soundness case) happen with probability 2/3. Standard ampliﬁcation techniques (return the median of independent tests) reduce the error probability from 1/3 to any β > 0 at the expense of increasing the sample complexity by a factor of O(log(1/β)); hence we focus on achieving a constant error probability. One of the most prevalent and basic tests is the identity-testing, where the null-hypothesis is composed of a single distribution H0 = {p} and our goal is to accept if the samples are drawn from p and reject if they were drawn from any other α-far (in d TV) distribution. Another extremely common tester is for independence when X is composed of several features (i.e., X = X 1 X 2 ... X d) and the null-hypothesis is composed of all product distributions H0 = {p1 ... pd} where each pj is a distribution on the jth feature X j.

Miscellaneous. We use M 0 to denote that M is a positive semi-deﬁnite (PSD) matrix, and M N to denote that (M N) 0. We use M to denote M s pseudoinverse. We emphasize that we made no effort to minimize constants in our proofs, and only strived to obtain asymptotic bounds (O( ), Ω( )).

3. Symmetric Signaling Scheme

Recall, in the symmetric signaling scheme, each user s type is mapped through a random function M into a set of signals S. This mapping is index-oblivious each user of type x X, sends the signal s with the same probability Pr[M(x) = s]. We denote the matrix G as the (|S| |X|)- matrix whose entries are Pr[M(x) = s], and its sth-row by gs. Note that all entries of G are non negative and that for each x we have Gex 1 = 1. By garbling each datum i.i.d, we observe the new dataset (y1, y2, ..., yn) Sn.

Theorem 1. For any convex set H of hypotheses, the problem of ﬁnding the max-likelihood p H generating the observed signals (y1, .., yn) is poly-time solvable.

Locally Private Hypothesis Testing

Proof. Since G(s, x) describes the probability that a user of type x sends the signal s, any distribution p H over the types in X yields a distribution on S where Pr[user sends s] = P

x X G(s, x) p(x) = g T sp. Therefore, given signals (y1, ..., yn) summarized as a signalshistogram ns s S, the likelihood of these signals is given by: L(p; y1, ..., yn) = Q i g T yip = Q s S(g T s p)ns = exp P

s ns log(g T sp) . Thus, the gradient of the negative log-loss function is f = 1

s S ns g Tsp gs, and its Hes-

sian is given by the matrix 1

s S ns (g Tsp)2gsg T s . Clearly, as a non-negative sum of rank-1 matrices, the Hessian is a PSD matrix.so our loss-function is convex. Known polytime algorithms for minimizing a convex function over a convex set (e.g. (Zinkevich, 2003)) conclude the proof.

Unfortunately, in general the solution to this problem has no closed form (to the best of our knowledge). However, we can ﬁnd a close-form solution under the assumption that G isn t just any linear transformation but rather one that induces probability distribution over S, the assumption that |S| |X| (in all applications we are aware of use fewer signals than user-types) and one extra-condition.

Corollary 2. Let q be the |S|-dimensional vector given by ns

n . Given that |S| |X|, that G is a full-rank matrix satisfying G 1 = 1 and assuming that G q +ker(G) H = , then any vector in H of the form p +u where p = G q and u ker(G) is an hypothesis that maximizes the likelihood of the given signals (y1, ..., yn). Proof deferred to the supplementary material, Section B.

3.1. Hypothesis Testing under Randomized-Response

We now aim to check the affect of a particular G, the one given by the randomized-response mechanism. In this case S = X and we denote G as the matrix whose entries are

( ρ + γ , if x = x ρ , otherwise where ρ def = 1 T 1+eϵ

and γ def = eϵ 1 T 1+eϵ . We get that G = ρ 1X X + γI (where 1X X is the all-1 matrix). In particular, all vectors gs = gx, which correspond to the rows of G, are of the form: gx = ρ1 + γex. It follows that for any probability distribution p H we have that Pr[seeing signal x] = g T xp = ρ + γp(x). We have therefore translated any p H (over X) to an hypothesis q over S (which in this case S = X), using the afﬁne transformation ϕ(p) = ρ1+γp = Tρu X +γp when u X denotes the uniform distribution over X. (Indeed, γ = 1 Tρ, an identity we will often apply.) At the risk of overburdening notation, we use ϕ to denote the same transformation over scalars, vectors and even sets (applying ϕ to each vector in the set). Since ϕ is injective, we have therefore discovered the following theorem.

Theorem 3. Under the classic randomized response mechanism, testing for any hypothesis H0 (or for comparing H0 against the alternative H1) of the original distribution,

translates into testing for hypothesis ϕ(H0) (or ϕ(H0) against ϕ(H1)) for generating the signals y1, ..., yn. Theorem 3 seems very natural and simple, and yet (to the best of our knowledge) it was never put to words.

Moreover, it is simple to see that under standardrandomized response, our log-loss function is in fact strongly-convex, and therefore ﬁnding p becomes drastically more efﬁcient (see, for example (Hazan et al., 2006)).

Claim 4. Given signals y1, ..., yn generated using standard randomized response with parameter ϵ < 1, we have that our log-loss function is Θ(ϵ2 minx{nx}

n )-strongly convex. Note that in expectation nx ρn, hence with overwhelming probability we have minx nx n/(2T). The proof is fairly straight-forward and is deferred to the supplementary material, Section B.

A variety of corollaries follow from Theorem 3. In particular, a variety of detailing matching sample complexity upperand lower-bounds translate automatically into the realm of making such hypothesis-tests over the outcomes of the randomized-response mechanism. We focus here on two of the most prevalent tests: identity testing and independence testing.

Identity Testing. Perhaps the simplest of the all hypothesis testing is to test whether a given sample was generated according to a given distribution or not. Namely, the null hypothesis is a single hypothesis H0 = {p}, and the alternative is H1 = {q : d TV(p,q) α} for a given parameter α. The seminal work of Valiant and Valiant (2014) discerns that (roughly) Θ( p 2

3 /α2) samples are sufﬁcient and are necessary for correctly rejecting or accepting the null-hypothesis w.p. 2/3.3

Here, the problem of identity testing under standard randomized response reduces to the problem of hypothesis testing between ϕ(H0) = {ρ1 + γp : p H0} and ϕ(H1) = {ϕ(q) : q satisfying d TV(p,q) α}.

Corollary 5. In order to do identity testing under standard randomized response with conﬁdence and power 2/3, it is necessary and sufﬁcient that we get Θ( T 2.5

ϵ2α2 ) samples. The proof uses the results of (Valiant & Valiant, 2014) as a black-box and is mainly composed of calculations, so it is deferred to supplementary material, Section B.

Independence Testing. Another prevalent hypothesis testing over a domain X where each type is composed of multiple feature is independence testing. Denoting X = X 1 X 2 ... X d as a domain with d possible features (hence T = |X| = Q

j |X j| def = Q

j T j), our goal is to discern whether an observed sample is drawn from a product distribution or a distribution α-far from any product distri-

3For the sake of brevity, we ignore pathological examples where by removing α probability mass from p we obtain a vector of signiﬁcantly smaller 2

Locally Private Hypothesis Testing

bution. In particular, the null-hypothesis in this case is a complex one: H0 = { p = p1 p2 ... pd} and the alternative is H1 = {q : min p H0 d TV(q, p) α}. To the best of our knowledge, the (current) tester with smallest sample complexity is of Acharya et al (2015), which requires Ω (

j T j)/α2 iid samples.

We now consider the problem of testing for independence under standard randomized response. Our goal is to prove the following theorem. Theorem 6. There exists an algorithm that takes n =

d2(max j {T j})2 +

T ) signals generated by

applying standard randomized response (with ϵ < 1) on n samples drawn from a distribution p and with probability 2/3 accepts if p H0, or rejects if p H1. Moreover, no algorithm can achieve such guarantee using n = o(T 5/2/(α2ϵ2)) signals. Note there are at least two types per feature, so d log2(T). Should all T js be equal we have (T j)2 T 2 d , making T 2.5/(α2ϵ2) the leading term in the above bound.

Proof. Theorem 3 implies we are comparing ϕ(H0) = {ρ1X +γ(p1 ... pd)} to ϕ(H1) = {ρ1X +γq : q H1}. Note that ϕ(H0) is not a subset of product-distributions over X but rather a convex combination (with publicly known weights) of the uniform distribution and H0; so we cannot run the independence tester of Acharya et al on the signals as a black-box. Luckily it holds that ϕ(H1) is far from all distributions in ϕ(H0): for each q H1 and p H0 we have d TV(ϕ(q), ϕ( p)) γd TV(q, p) γα. And so we leverage on the main result of Acharya et al ((2015), Theorem 2): we ﬁrst ﬁnd a distribution ρ1 + γ z ϕ(H0) such that if the signals were generated by some ρ1X + γ p ϕ(H0) then dχ2(ϕ( z), ϕ( p)) γ2α2/500, and then test if indeed the signals are likely to be generated by a distribution close to ϕ( z) using Acharya et al s algorithm. We now give our procedure for ﬁnding the product-distribution z.

Per feature j, given the jth feature of the signals yj 1, ..., yj n where each xj X j appears nxj times, our procedure for ﬁnding zj is as follows.

0. (Preprocessing:) Denote τ = α/(10d T j). We call any type xj where nxj

T j + γτ as small and otherwise we say type xj is large. Ignore all small types, and learn zj only over large types. (For brevity, we refer to n as the number of signals on large types and T j as the number of large types.)

1. Set the distribution zj as the add-1 estimator of Kamath et al (2015) for the signals: zj(xj) = 1+nxj

2. Compute zj = 1

T j 1X j zj.

Once zj is found for each feature j, set z = z1 ... zd

run the test of Acharya et al (2015) (Theorem 2) with ϕ( z) looking only at the large types from each feature, setting

the distance parameter to αγ

2 and conﬁdence 1

9, to decide whether to accept or reject.

In order to successfully apply the Acharya et al s test, a few conditions need to hold. First, the provided distribution ϕ( z) should be close to ϕ(H0). This however hold trivially, as z is a product-distribution. Secondly, we need that ϕ( z) and ϕ( p) to be close in χ2-divergence, as we argue next.

Lemma 7. Suppose that n, the number of signals, is at least Ω( d2 α2γ2 maxj{T j}). Then the above procedure creates distributions zj such that the product distribution z = z1 z2 ... zd satisﬁes the following property. If the signals y1, ..., yn were generated by ϕ( p) for some productdistribution p = p1 ... pd, then w.p. 8/9 we have that dχ2(ϕ( z), ϕ( p)) γ2α2/1000.

We table the proof of Lemma 7 to Section B in the supplementary material. Next, either completeness or soundness must happen: either the signals were taken from randomized-response on a product distribution, or they were generated by a distribution γα/2-far from ϕ(H0). If no type of any feature was deemed as small , this condition clearly holds; but we need to argue this continues to hold even when we run our tester on a strict subset of X composed only of large types in each feature. Completeness is straight-forward: since we remove types feature by feature, the types now come from a product distribution plarge = p1 large ... pd large where each pj large is a restriction of pj to the large types of feature j. Soundness however is more intricate. We partition X into two subsets: All Large = {(x1, x2, ..., xd) X : j, xj is large} and Rest = X \ All Large; and break q into q = ηq Rest + (1 η)q All Large, with η = Prq[Rest]. Claim 8 (proof deferred to the supplementary material) argues that η < α

2 . Therefore, d TV(q,q All Large) α

2 , implying that d TV(ϕ(q All Large), ϕ(H0)) > α γ αγ

Claim 8. Assume the underlying distribution of the samples is q and that the number of signals is at least n = Ω( d2(maxj T j)2

α2γ2 log(d maxj T j)). Then w.p. 8/9 our preprocessing step marks certain types each feature as small such that the probability (under q) of sampling a type (x1, x2, ..., xd) such that j, xj is small is α/2.

So, given that both Lemma 7 and Claim 8 hold, we can use the test of Acharya et al, which requires a sample of size n = Ω(

T/(αγ)2). Recall that ϵ < 1 so γ = Θ(ϵ/T), and we get that the sample size required for the last test is n = Ω( T 2.5

α2ϵ2 ). Moreover, for this last part, the lower bound in Acharya et al (2015) still holds (for the same reason it holds in the identity-testing case): the lower bound is derived from the counter example of testing whether the signals were generated from the uniform distribution (which clearly lies in ϕ(H0)) or any distribution from a collection of perturbations which all belong to ϕ(H1) (See (Paninski, 2008) for more details). Each of distribution is thus γα-far from ϕ(H0) and so any tester for this particular construc-

Locally Private Hypothesis Testing

tion requires

T/(αγ)2-many samples. This proves both the upperand lower-bounds of Theorem 6.

4. Non-Symmetric Signaling Schemes

Let us recall the non-symmetric signaling schemes in (Bassily & Smith, 2015; Bassily et al., 2017). Each user, with true type x X, is assigned her own mapping (the mapping is broadcast and publicly known) fi : X S. This sets her inherent signal to fi(x), and then she runs standard (symmetric) randomized response on the signals, making the probability of sending her true signal fi(x) to be eϵ-times greater than any other signal s = fi(x).

In fact, let us allow an even broader look. Each user is given a mapping fi : X S, and denoting (like before) T = |X| and S = |S|, we identify this mapping with a (S T)-matrix Gi. The column gx i = Giex is the probability distribution that a user of type x is going to use to pick which signal she broadcasts. (And so the guarantee of differential privacy is that for any signal s S and any two types x = x we have that gx i (s) eϵgx i (s).) Therefore, all entries in Gi are non-negative and Gi 1 = 1 for all is.

Similarly to the symmetric case, we ﬁrst exhibit the feasibility of ﬁnding a maximum-likelihood hypothesis given the signals from the non-symmetric scheme. Since we view which signal in S was sent, our likelihood mainly depends on the row vectors gs i. We prove the following theorem, proof deferred to Section C in the supplementary material.

Theorem 9. For any convex set H, the problem of ﬁnding the max-likelihood p H generating the observed nonsymmetric signals (y1, .., yn) is poly-time solvable.

4.1. Hypothesis Testing under Non-Symmetric Locally-Private Mechanisms

Let us recap the differentially private scheme of Bassily et al (2017). It this scheme, the mechanism uses solely two signals S = {1, 1} (so S = 2). For every i the mechanism sets Gi by picking u.a.r for each x X which of the two signals in S is more likely; the chosen signal gets a probability mass of eϵ 1+eϵ and the other get probability mass of 1 1+eϵ . We denote η as the constant such that 1

2 + η = eϵ 1+eϵ and 1

2 η = 1 1+eϵ ; namely η = eϵ 1 2(eϵ+1) = Θ(ϵ) when ϵ < 1. Thus, for every s {1, 1} the row vector gs i is chosen such that each coordinate is chosen iid and uniformly from { 1

2 η}. (Obviously, there s dependence between g1 i and g 1 i , as g1 i + g 1 i = 1, but the distribution of g1 i is identical to the one of g 1 i .)

First we argue that for any distribution p, if n is sufﬁciently large then w.h.p over the generation of the Gis and over the signals we view from each user, then ﬁnding ˆp which maximizes the likelihood of the observed signals yields a good approximation to p. To that end, it sufﬁces to argue that the function we optimize is Lipfshitz and strongly-convex.

Lemma 10. Fix δ > 0 and assume that the number of signals we observe is n = Ω(T 3 log(1/δ)). Then w.p. 1 δ it holds that the function f(p) we optimize (as given in

Equation (1)) is 3

T -Lipfshitz and η2

2 -strongly con-

vex over the subspace {x : x T1 = 0} (all vectors orthogonal to the all-1 vector). The proof of Lemma 10 which (in part) is hairy due to the dependency between the matrix Gi and the signal yi is deferred to Section C in the supplementary material.

Identity Testing. Designing an Identity Test based solely on the maximum-likelihood is feasible, due to results like Cesa-Binachi et al (2002) which allow us to compare between the risk of the result p of a online gradient descent algorithm to the original distribution p which generated the signals. Through some manipulations one can (eventually) infer that |f(p) f( p)| = O(1/ n). However, since strong-convexity refers to the L2-norm squared of p p , we derive the resulting bound is p p 2 1 T p p 2 2 = O( 1 η2 n), which leads to a sample complexity bound proportional to T 3/(αη)4. This bound is worse than the bounds in Section 3.

We therefore design a different, simple, identity tester in the local non-symmetric scheme, based on the estimator given in (Bassily et al., 2017). The tester itself which takes as input a given distribution p, a distance parameter α > 0 and the n signals is quite simple.

1. Given the n matrices G1, ..., Gn and the n observed signals y1, ..., yn, compute the estimator θ = 1

i 1 η gyi i 1

2. If d TV( 1

2 then accept, else reject.

Theorem 11. Assume ϵ < 1. If we observe n = Ω( T

αϵ 2) signals generated by a distribution q then w.p. 2/3 over the matrices Gi we generate and the signals we observe, it holds that d TV( 1

2ηθ,q) α/2. The correctness of the tester now follows from checking for the two cases where either p = q or d TV(p,q) > α.

Proof. In the ﬁrst part of the proof we assume the types of the n users were already drawn and are now ﬁxed. We denote xi as the type of user i. We denote the frequency vector f = nx

n x X , generated by counting the number of users of type x and normalizing it by n.

Given f , we examine the estimator θ. For each user i we have that 1 η(gyi i 1

21) { 1, 1}T . Because xi, the type of user i, is ﬁxed, then for each coordinate x = xi, the signal yi is independent of the x -column in Gi (yi depends solely on the entries in the xi-column). We thus have that gyi i (x ) is distributed uniformly among { 1

2 η} and so E[ 1

η(gyi i (x ) 1

2)] = 0. In contrast, Pr[ 1

η(gyi i (xi) 1 2) = 1 ] = P

s { 1,1} Pr[ 1

η(gs i (xi) 1 2) = 1 and yi = s] = 2 1

2+η. Therefore, E[ 1

η(gyi i (xi) 1

Locally Private Hypothesis Testing

2 η) = 2η. It follows that E[ 1

21)] = 2ηexi and so E[θ] = 2ηf .

Next we examine the variance ofθ , and argue the following (proof deferred to supplementary material).

Proposition 12. E[(θ 2ηf )(θ 2ηf )T] 1

So as a result, the expected L2-difference E[ θ 2ηf 2] = E[trace((θ 2ηf )(θ 2ηf )T)] = trace(E[(θ 2ηf )(θ 2ηf )T]) T n . Chesbyshev s inequality assures us that

therefore Pr[ 1

6T 2η n] T/n 6T/n = 1

So far we have assumed f is ﬁxed, and only looked at the event that the coin-tosses of the mechanism yielded an estimator far from its expected value. We now turn to bounding the distance between f and its expected value q (the distribution that generated the types). Indeed, it is clear to see that the expected value of f = 1 n P

i exi is E[f ] = q. Moreover, it isn t hard (and has been computed before many times, e.g. Agresti (2003)) to see that E[(f q)(f q)T] = 1 n diag(q) qq T . Thus E[ f q 2] = trace( 1

n diag(q) qq T ) = 1

n(1 q 2). Therefore, applying Chebyshev again, we get that w.p. at most 1/6 over the choice of types by q, we have that Pr[ f q > p

Combining both results we get that w.p. 2/3 we have that 1

2ηθ f + f q q

6T 2 4η2n + q

we have n = Ω( T 2

η2α2 ). Recall that η = Θ(ϵ) and that d TV(x,y) = 1

2 x y 1, and the bound of α

2 is proven.

Independence Testing. Similarly to the identity tester, we propose a similar tester for independence. Recall that in this case, X is composed of d features, hence X = X 1 X 2 ... X d, with our notation of T j = |X j| for each j. Our tester should accept when the underlying distribution over the types is some product distribution p, and should reject when the underlying distribution over the types is α-far from any product distribution. The tester, whose input is the n signals and a distance parameter α > 0, is as follows.

1. Given the n matrices G1, ..., Gn and the n observed signals y1, ..., yn, compute the estimator θ = 1

i 1 η gyi i 1

2. For each feature j compute θj the jth marginal of 1 2ηθ (namely, for each xj X j sum all types whose jth feature is xj). Denote θ = θ1 ... θd.

3. If d TV( 1

2 then accept, else reject.

Theorem 13. Assume ϵ < 1. Given n = Ω( T α2ϵ2 T + d2 P

j T j ) iid drawn signals from the nonsymmetric locally-private mechanism under a dataset whose types were drawn iid from some distribution q, then

w.p. 2/3 over the matrices Gi we generate and the types in the dataset we have the following guarantee. If q is a product distribution, then d TV( 1

2 , and if q is αfar from any product distribution then d TV( 1

2ηθ, θ) > α

2 . (Proof deferred to the supplementary material, Section C.)

Open Problems. (1) Is there a tester with a better sample complexity? The experiment in Section 4.2 leads us to conjecture that there exists a tester with sample complexity of T 1.5/(ηα)2. There could exist better testers, of smaller sample complexity, which leads to the second question. (2) Can one derive lower bounds for identity/independence testing in this model, where each sample has its own distribution, related to the original distribution over types? In Section D in the supplementary material we give more details as to possible venues to tackle both problems, relating them to the problem of learning a mixture-model of product distributions.

4.2. Experiment: Proposed χ2-Based Testers

Following the derivations in the proof of Theorem 11, we can see that Var(θ) = 1 n I 4η2diag(f 2) . As ever, we assume ϵ is a small constant and as a result the variance in 2ηf (which is approximately 4η2

n diag(p)) is signiﬁcantly smaller than the variance of θ. This allows us to use the handwavey approximation f p, and argue that we have the approximation Var(θ)

1 n I 4η2diag(p2) def = 1 n M. Central Limit Theorem thus give that n M 1/2(θ 2ηp) n N(0, I). Therefore, it stands to reason that the norm of the LHS is distributed like a χ2-distribution, namely,

P(θ) def = n X

(θ(x) 2η p(x))2

1 4η2p(x)2 n χ2 T

Our experiment is aimed at determining whether P(θ) can serve as a test statistic and assessing its sample complexity.

Setting and Default Values. We set a true ground distribution on T possible types, p. We then pick a distribution q which is α-far from p using the counter example of Paninski (2008): we pair the types and randomly move 2α

T probability mess between each pair of matched types. We then generate n samples according to q, and apply the nonsymmetric ϵ-differentially private mechanism of (Bassily et al., 2017). Finally, we aggregate the suitable vectors to obtain our estimator θ and compute P(θ). If we decide to accept/reject we do so based on comparison of P to the 2

3-quantile of the χ2 T -distribution, so that in the limit we reject only w.p. 1/3 under the null-hypothesis. We repeat this entire process t times. We have set the default values T = 10, p = u T (uniform on [T]), α = 0.2, n = 1000, ϵ = 0.25 and therefore η = 1

2 eϵ 1 eϵ+1, and t = 10000.

Experiment 1: Convergence to the χ2-distribution in the null case. First we ask ourself whether our approximation, denoting P(θ) χ2 T is correct when indeed p is

Locally Private Hypothesis Testing

the distribution generating the signals. To that end, we set α = 0 (so the types are distributed according to p) and plot the t empirical values of P we in our experiment, varying both the sample size n {10, 100, 1000, 10000} and the domain size T {10, 25, 50, 100}. The results are consistent P is distributed like a χ2 T - distribution. Indeed, the mean of the t sample points is T (the mean of a χ2 T -distribution). The results themselves appear in Figure 2 in the supplementary material, Section D.

Experiment 2: Divergence from the χ2-distribution in the alternate case. Secondly, we asked whether P can serve as a good way to differentiate between the null hypothesis (the distribution over the types is derived from p) and the alternative hypothesis (the distribution over the types if α-far from p). We therefore ran our experiment while varying α (between 0.25 and 0.05) and increasing n. Again, the results show that the distribution does shift towards higher values as n increases. The results are given in Figure 3 in the supplementary material, Section D.

Experiment 3: Sample Complexity. Next, we set to ﬁnd the required sample complexity for rejection. We ﬁx the α-far distribution from p, and ﬁrst do binary search to hone on an interval [n L, n U] where the empirical rejection probability is between 30% 35%; then we equipartition this interval and return the n for which the empirical rejection probability is the closest to 33%. We repeat this experiment multiple times, each time varying just one of the 3 most important parameters, T, α and ϵ. We maintain two parameters at default values, and vary just one parameter: T {5, 10, 15, .., 100}, α {0.05, 0.1, 0.15, ..., 0.5}, ϵ {0.05, 0.1, 0.15, ..., 0.5}. The results are shown in Figure 1, where next to each curve we plot the curve of our conjecture in a dotted line.4 We conjecture initially that n T c T αcα ϵcϵ. And so, for any parameter ξ {T, α, ϵ}, if we compare two experiments i, j that differ only on the value of this parameter and resulted in two empirical estimations Ni, Nj of the sample complexity, then we get that cξ log(Ni/Nj)

log(ξi/ξj) . And so for any ξ {T, α, ϵ} we take the median over of all pairs of i and j and we get the empirical estimations of cϵ = 1.900793, cα = 1.930947 and c T = 1.486957. This leads us to the conjecture that the actual sample complexity according to this test is T 1.5

Open Problem. Perhaps even more interesting, is the experiment we wish we could have run: a χ2-based independence testing. Assuming the distribution of the type is a product distribution p = p1 ... pd, the proof of Theorem 13 shows that for each feature j we have Var(θj pj) 1 4η2n T T j IX j. Thus 4η2n T j

T θj pj 2 n χ2 T j. However, the d estimators θj are not independent, so it is

4We plot the dependency on α and ϵ on the same plot, as both took the same empirical values.

not true that P

T θj pj 2 n χ2P

j T j. Moreover, even if the estimators of the marginals were independent,5 we are still unable to determine the asymptotic distribution of θ p 2 (only a bound, scaled by O(maxj Tj), using Proposition 17 in the supplementary material), let alone the asymptotic distribution of 1

Nonetheless, we did empirically measure the quantity

Q(θ) def = n P

2η θ(x) θ(x))2

θ(x) under the null (α = 0) and the alternative (α = 0.25) hypothesis with n = 25, 000 samples in each experiment. The results (given in Figure 4 in the supplementary material) show that the distribution of Q albeit not resembling a χ2-distribution is different under the nulland the alternative-hypothesis, so we suspect that there s merit to using this quantity as a tester. We thus leave the design of a χ2-based statistics for independence in this model as an open problem.

Figure 1: Empirical sample-complexity to have the tester reject w.p. 2/3 under the alternative hypothesis.

5E.g. by assigning each example i to one of the d estimators, costing only d = log(T) factor in sample complexity

Locally Private Hypothesis Testing

Acknowledgments

This work was supported by the Natural Sciences and Engineering Council of Canada, Grant #2017-06701. The author is also an unpaid collaborator on NSF grant 1565387. The authors thanks the anonymous reviewers for many helpful suggestions and ideas, as well as Marco Gaboardi and Ryan Rogers for helpful discussions illustrating the similarities and differences between our two papers.

Acharya, J., Daskalakis, C., and Kamath, G. Optimal testing for properties of distributions. In NIPS, pp. 3591 3599. 2015.

Agresti, A. Categorical Data Analysis. Wiley Series in Probability and Statistics. 2003.

Apple, D. P. T. Learning with privacy at scale. Apple Machine Learning Journal, 1(8), 2017. available on http: //machinelearning.apple.com/2017/12/ 06/learning-with-privacy-at-scale. html.

Bassily, R. and Smith, A. D. Local, private, efﬁcient protocols for succinct histograms. In STOC, pp. 127 135, 2015.

Bassily, R., Nissim, K., Stemmer, U., and Thakurta, A. G. Practical locally private heavy hitters. In NIPS, pp. 2285 2293, 2017.

Cai, B., Daskalakis, C., and Kamath, G. Priv it: Private and sample efﬁcient identity testing. In ICML, pp. 635 644, 2017.

Cesa-bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning algorithms. In NIPS, pp. 359 366. 2002.

Chaudhuri, K. and Hsu, D. Convergence rates for differentially private statistical estimation. In ICML, 2012.

Duchi, J., Jordan, M., and Wainwright, M. Local privacy and statistical minimax rates. In FOCS, 2013a.

Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy and minimax bounds: Sharp rates for probability estimation. In NIPS, pp. 1529 1537, 2013b.

Dwork, C. and Lei, J. Differential privacy and robust statistics. In STOC, 2009.

Dwork, C. and Roth, A. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, NOW Publishers, 2014.

Dwork, C., Su, W., and Zhang, L. Private false discovery rate control. Co RR, abs/1511.03803, 2015.

Erlingsson, U., Pihur, V., and Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In CCS, 2014.

Gaboardi, M. and Rogers, R. M. Local private hypothesis testing: Chi-square tests. In ICML (to appear), 2018.

Gaboardi, M., Lim, H. W., Rogers, R., and Vadhan, S. P. Differentially private chi-squared hypothesis testing: Goodness of ﬁt and independence testing. In ICML, pp. 2111 2120, 2016.

Hazan, E., Kalai, A., Kale, S., and Agarwal, A. Logarithmic regret algorithms for online convex optimization. In COLT, pp. 499 513, 2006.

Johnson, A. and Shmatikov, V. Privacy-preserving data exploration in genome-wide association studies. In KDD, pp. 1079 1087, 2013.

Kamath, S., Orlitsky, A., Pichapati, D., and Suresh, A. T. On learning distributions from their samples. In COLT, pp. 1066 1100, 2015.

Karwa, V. and Vadhan, S. Finite sample differentially private conﬁdence intervals, 2018.

Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? In FOCS, 2008.

Paninski, L. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Information Theory, 54(10):4750 4755, 2008.

Reiss, R. Approximate distributions of order statistics: with applications to nonparametric statistics. Springer series in statistics. 1989.

Sason, I. and Verd u, S. f-divergence inequalities. IEEE Trans. Information Theory, 62(11):5973 6006, 2016.

Sheffet, O. Differentially private ordinary least squares. In ICML, 2017.

Smith, A. Privacy-preserving statistical estimation with optimal convergence rates. In STOC, 2011.

Uhler, C., Slavkovic, A. B., and Fienberg, S. E. Privacypreserving data sharing for genome-wide association studies. Journal of Privacy and Conﬁdentiality, 5(1), 2013.

Valiant, G. and Valiant, P. An automatic inequality prover and instance optimal identity testing. In FOCS, pp. 51 60, 2014.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. 2010. URL http://arxiv. org/abs/1011.3027.

Locally Private Hypothesis Testing

Vu, D. and Slavkovic, A. Differential privacy for clinical trial data: Preliminary evaluations. In ICDM, pp. 138 143, 2009.

Wang, Y., Lee, J., and Kifer, D. Differentially private hypothesis testing, revisited. Co RR, abs/1511.03376, 2015.

Warner, S. Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. Journal of the American Statistical Association, 60(309), March 1965.

Yu, F., Fienberg, S., Slavkovic, A., and Uhler, C. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics, 50:133 141, 2014.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, pp. 928 936, 2003.