# private_query_release_assisted_by_public_data__36974a0d.pdf

Private Query Release Assisted by Public Data

Raef Bassily 1 Albert Cheu 2 Shay Moran 3 Aleksandar Nikolov 4 Jonathan Ullman 2 Zhiwei Steven Wu 5

1Department of Computer Science & Engineering, The Ohio State University. 2Khoury College of Computer Sciences, Northeastern University 3Google AI Princeton 4Department of Computer Science, University of Toronto 5Department of Computer Science and Engineering, University of Minnesota. Correspondence to: Raef Bassily <rbbassily@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class H of statistical queries with error no more than α using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in terms of the private and public sample complexities. Our upper and lower bounds on the private sample complexity have matching dependence on the dual VC-dimension of H. For a large category of query classes, our bounds on the public sample complexity have matching dependence on α.

1. Introduction

The ability to answer statistical queries on a sensitive data set in a privacy-preserving way is one of the most fundamental primitives in private data analysis. In particular, this task has been at the center of the literature of differential privacy since its emergence (Dinur & Nissim, 2003; Dwork et al., 2006) and is also central to the upcoming US Census 2020 release (Dajani et al., 2017). In its basic form, the problem of differentially private query release can be described as follows. Given a class H of queries

1 h : X { 1} deﬁned over some domain X, and a data set x of i.i.d. samples from some unknown distribution D over X, the goal is to construct an (ε, δ)-differentially private algorithm that, given H and x, outputs a mapping G : H [ 1, 1] such that for every h H, G(h) gives an accurate estimate for the true mean E [h(x)] up to some x D

error α. A central question in this problem is concerned with characterizing the private sample complexity, which is the least amount of private samples required to perform this task up to some error α. This question has been extensively studied in the literature on differential privacy (Dinur & Nissim, 2003; Hardt & Rothblum, 2010; Muthukrishnan & Nikolov, 2012; Blum et al., 2013; Bun et al., 2018; Steinke & Ullman, 2015). For general query classes, it was shown that the optimal bound on the private sample complexity in terms of |X|, |H, and the privacy parameters is attained by the Private Multiplicative Weights (PMW) algorithm due to Hardt and Rothblum (Hardt & Rothblum, 2010). This optimality was established by the lower bound due to Bun et al. (Bun et al., 2018). This result implied the impossibility of differentially private query release for certain classes of inﬁnite size. Moreover, subsequent results by (Bun et al., 2015; Alon et al., 2019b) implied that this impossibility is true even for a simple class such as onedimensional thresholds over R. On the other hand, without any privacy constraints, the query release problem is equivalent to attaining uniform convergence over H, and hence the sample complexity is given by d/α2, where d is the VC-dimension of H.

In this work, we focus on classes of binary functions (known in the literature of DP as counting queries).

In practice, it is often feasible to collect some amount of public data that poses no privacy concerns. For example, in the language of consumer privacy, there is considerable amount of data collected from the so-called opt-in users, who voluntarily offer or sell their data to companies or organizations. Such data is deemed by its original owner to pose no threat to personal privacy. There are also a variety of other sources of public data that can be harnessed.

Motivated by the above observation, and by the limitations in the standard model of differentially private query release, in this work, we study a relaxed setting of this problem, which we call Public-data-Assisted Private (PAP) query release. In this setting, the query-release algorithm has access to two types of samples from the unknown distribution: private samples that contain personal and sensitive information (as in the standard setting), and public samples. The goal is to design algorithms that can exploit as little public data as possible to achieve non-trivial savings in sample complexity over standard DP query-release algorithms, while still providing strong privacy guarantees for the private dataset.

Private Query Release Assisted by Public Data

1.1. Our results

In this work we study the private and public sample complexities of PAP query-release algorithms, and give upper and lower bounds on both.

1. Upper bounds: We give a construction of a PAP algorithm that solves the query release problem for any query class H using only d/α public samples, and pd3/2/α2 private samples, where d is the VCdimnension of H and p is the dual VC-dimension of H.

Recall that d/α2 samples are necessary even without privacy constraints, so our upper bound on the public sample complexity shows a nearly quadratic saving.

2. Lower bound on private sample complexity: We show that there is a query class H with VC-dimension log(p) and dual VC-dimension p such that any PAP algorithm either requires Ω(1/α2) public samples or requires Ω( p/α) total samples. Thus the p dependence above is unavoidable. For this class, O(log(p)/α2) public samples are enough to solve the problem with no private samples, and O(log(p)/α2 + p/α) private samples are enough to solve the problem with no public samples. Thus, public samples do not help, unless there are nearly enough public samples to solve the problem without private samples.

3. Lower bound on public sample complexity: We show that if the class H has inﬁnite Littlestone dimension,2

then any PAP query-release algorithm for H must have public sample complexity Ω(1/α). The simplest example of such a class is the class of one-dimensional thresholds over R. This class has VC-dimension 1, and therefore demonstrates that the upper bound above is nearly tight.

2The Littlestone dimension is a combinatorial parameter that arises in online learning (Littlestone, 1988; Ben-David et al., 2009).

1.2. Techniques

Upper bounds: The ﬁrst key step in our construction for the upper bound is to use the public data to construct a ﬁnite class He that forms a good approximation of the original class H. Such approximation is captured via the notion of an α-cover (Deﬁnition 1). The number of public examples that sufﬁces to construct such approximation is about d/α (Alon et al., 2019a). Given this ﬁnite class He, we then reduce the original domain X to a ﬁnite set XHe of representative domain points, which is deﬁned via an equivalence relation induced by H over X (Deﬁnition 2). Using Sauer s Lemma, we can show that the size of such a repree p sentative domain is at most e |H| p , where p is the dual

VC-dimension of H. At this point, we can reduce our problem to DP query-release for the ﬁnite query class He and the ﬁnite domain XHe, which we can solve via the PMW algorithm (Hardt & Rothblum, 2010; Dwork & Roth, 2014).

Lower bound on private sample complexity: The proof of the lower bound is based on the robust tracing attack of Dwork et al. (Dwork et al., 2015). That work proves privacy lower bounds for the class of decision stumps over the domain { 1, 1}p, which contains queries of the form qi( x) = xi. Speciﬁcally, they show that for any algorithm that takes at most s p/α samples, and releases the class of decision stumps with accuracy α, there is some attacker that can detect the presence of at least t 1/α2 of the samples. Therefore, if the number of public samples is at most t 1, the attacker can detect the presence of one of the private samples, which means the algorithm cannot be differentially private with respect to the private samples.

Lower bound on public sample complexity: This lower bound is derived in two steps. First, we show that PAP query-release for a class H implies PAP learning (studied in (Beimel et al., 2013; Alon et al., 2019a)3) of the same class with the same amount of public data. This step follows from a straightforward generalization of an analogous result by (Beimel et al., 2013) in the standard DP model with no public data. Second, we invoke the lower bound of (Alon et al., 2019a) on the public sample complexity of PAP learning.

1.3. Other related work

To the best of our knowledge, our work is the ﬁrst to formally study differentially private query release assisted with public data. There have been a few works that considered the analog of our problem in the supervised-learning setting. In particular, the notion of differentially private PAC learning assisted with public data was introduced by Beimel et al. in (Beimel et al., 2013), where it was called semi-private learning. They gave a construction of a learning algorithm in this setting, and derived upper bounds on the private and public sample complexities. The paper (Alon et al., 2019a) revisited this problem and gave nearly optimal bounds on the private and public sample complexities in the agnostic PAC model. The work of (Alon et al., 2019a) emphasizes the notion of α-covers as a useful tool in the construction of such learning algorithms. Our PAP algorithm demonstrates the usefulness of this notion in the query-release setting.

3These works use the term semi-private instead of PAP.

In the learning setting, there are also several other works that considered similar relaxations, e.g., (Chaudhuri & Hsu, 2011; Beimel et al., 2013) who studied the notion of

Private Query Release Assisted by Public Data

label-private learning , where only the labels in the training set are considered private. Another line of work considers the setting in which the learning task can be reduced to privately answering classiﬁcation queries (Hamm et al., 2016; Papernot et al., 2017; 2018; Bassily et al., 2018; Nandi & Bassily, 2019), where the goal is to construct a differentially private classiﬁcation algorithm that predicts the labels of a sequence of public feature-vectors given a private training set as input.

2. Preliminaries

We use X to denote an arbitrary data universe, D to denote a distribution over X, and H { 1}X to denote a binary hypothesis class.

2.1. Tools from learning theory

The VC dimension of a binary hypothesis class H { 1}X is denoted by VC(H).

We will use the following notion of coverings:

Deﬁnition 1 (α-cover for a hypothesis class). A family of hypotheses He is said to form an α-cover for a hypothesis class H { 1}X with respect to a distribution D over e h X if for everyi h H, there is h H such that P h(x) = h(x) α. x D Deﬁnition 2 (Representative domain (a.k.a. the dual class)). Let He { 1}X be a hypothesis class. Deﬁne an equivalence relation on X by x x if and only if ( h He) : h(x) = h(x ). The representative domain induced by He on X, denoted by XHe, is a complete set of distinct representatives from X for this equivalence relation.

For example, let H be a class of M binary thresholds over R given by: htj(x) = +1 iff x tj, j [M], < t1 < t2 < . . . < t M < . Then, a representative domain in this case is a set of M + 1 distinct elements; one from each of the following intervals ( , t1], (t1, t2], . . . , (t M, ). More generally, if He is a class of halfspaces then a representative domain contains exactly one point in each cell of the hyperplane arrangement induced by He (see Figure 1).

Note that when He is ﬁnite then any representative domain for He e has size at most 2|H|, since the equivalence class of each x X is determined by the binary vector (h(x))h He. Moreover, one can also make the following simple claim, which is a direct consequence of the Sauer Shelah Lemma (Sauer, 1972) together with the fact that a representative domain XHe has a one-to-one correspondence with the dual class of He. We use VC (H) to denote the dual VC-dimension of a hypothesis class H, namely, VC (H) is the VC-dimension of the dual class of H.

Figure 1. A representative domain for a ﬁnite class of halfspaces

Claim 3. Let He be a ﬁnite class of binary functions deﬁned over a domain X. Then, the size of a representative domain C e H| e V (He) | XHe satisﬁes: | X e | H (He . VC )

The following useful fact gives a worst-case upper bound on the dual VC-dimension in terms of the VC-dimension.

Fact 4 ((Assouad, 1983)). Let H be a binary hypothesis class. The dual VC-dimension of H satisﬁes: VC (H) < 2VC(H)+1.

Notation: In Section 3, we will use the following notation. Let He be a hypothesis class deﬁned over a domain X. For any x X, we deﬁne XHe(x) as the representative s XHe such that x s, where is the equivalence relation described in Deﬁnition 2. Note that by deﬁnition this s XHe is unique. Moreover, for any n and any x = (x1, . . . , xn) X n, we deﬁne XHe( x) XHe(x1), . . . , XHe(xn) .

Deﬁnition 5 (Query Release). Given a distribution D over X and a binary hypothesis class H, a query release data structure G [ 1, 1]H (equivalently, G : H [ 1, 1]) estimates the expected label E [h(x)] for all h H. The x D worst-case error is deﬁned as

Err D,H(G) sup | G(h) E [h(x)] |. h H x D

2.2. Tools from Differential Privacy

Two datasets x, x X n are neighboring when they differ on one element.

Deﬁnition 6 (Differential Privacy). A randomized algorithm A : X n Z is (ε, δ)-differentially private if for all neighboring x, x and all Z Z

P [A( x) Z] eεP [A( x ) Z] + δ

Private Multiplicative Weights (PMW): In our construction in Section 3, we will use, as a black box, a wellstudied algorithm in differential privacy known as Private

Private Query Release Assisted by Public Data

Multiplicative-Weights (Hardt & Rothblum, 2010). We will use a special case of the ofﬂine version of the PMW algorithm. Namely, the input query class He is ﬁnite, and PMW runs over all the queries in the input class (in any order) to perform its updates, and ﬁnally outputs a query release data structure Ge e [ 1, 1]H. When the input private data set s is drawn i.i.d. from some unknown distribution De , the accuracy goal is to have a data structure Ge such that Err e e He(G) D, is small. The outline (inputs and output of the PMW algorithm) is described in Algorithm 1.

Algorithm 1 An outline for the Private Multiplicative Weights Algorithm (PMW)

Input: Private data set s Xen (where Xe is a ﬁnite domain); A ﬁnite query (hypothesis) class He e { 1}X ; accuracy parameters α, β; privacy parameters ε, δ. Output: A data structure Ge : He [ 1, 1].

The following lemma is an immediate corollary of the accuracy guarantee of the PMW algorithm (Hardt & Rothblum, 2010; Dwork & Roth, 2014). In particular, it follows from combining the empirical accuracy of PMW with a standard uniform-convergence argument.

Lemma 7 (Corollary of Theorem 4.15 in (Dwork & Roth, 2014)). For any 0 < ε, δ < 1, the PMW algorithm is (ε, δ)- differentially private. Let De be any distribution over Xe. For any 0 < α, β < 1, given an input data set s De n

such that r 200 n log |Xe| log(2/δ) ε α2 128 log |Xe| log |He| + log , α2β

then, with probability at least 1 β, PMW outputs a data structure Ge satisfying Err D e ,He(Ge) α.

2.3. Our Model: Differentially private query release assisted by public data

In this paper, we study an extension of the problem of differentially private query release (Dwork & Roth, 2014) where the input data to the algorithm comes in two types: private and public. More precisely, the problem we study can be deﬁned as follows. Let D be any distribution over a data domain X. Let H { 1}X be a set of queries (here, we assume that H is a binary hypothesis class). We consider a family of algorithms whose generic description (namely, inputs and outputs) is given by Algorithm 2.

Given the query class H, a private data set x Dn (i.e., a data set whose elements belong to n private users), and a

Algorithm 2 An outline for a generic Public-data-Assisted Private (PAP) algorithm for query release Input: Private data set x X n; public data set w X m; A query (hypothesis) class H { 1}X ; accuracy and conﬁdence parameters α, β; privacy parameters ε, δ. Output: A data structure G : H [ 1, 1].

public data set w X m (i.e., a data set whose elements belong to m users with no privacy constraint), the algorithm outputs a query release data structure G : H [ 1, 1]. Such an algorithm is required to be (ε, δ)-differentially private but only with respect to the private data set. We call such an algorithm Public-data-Assisted Private (PAP) query-release algorithm. The accuracy/utility of the algorithm is determined by the worst-case estimation error incurred by its output data structure G on any query (hypothesis) h H.

Deﬁnition 8 ((α, β, ε, δ) PAP query-release algorithm). Let H { 1}X be a query class. Let A : X [ 1, 1]H

be a randomized algorithm in the family outlined in Algorithm 2. We say that A is (α, β, ε, δ) Public-data-Assisted Private (PAP) query-release algorithm for H with private sample size n and public sample size m if the following conditions hold:

1. For every distribution D over X, given data sets x Dn and w Dm as inputs to A, with probability at least 1 β (over the choice of x, w , and the random coins of A), A outputs a function (data structure) A ( x, w ) = G [ 1, 1]H such that Err D,H(G) α.

2. For all public data w X m, A ( , w ) is (ε, δ)- differentially private.

Remark 9. In our description in Algorithm 2, the algorithm is required to output a data structure G : H [ 1, 1] and not necessarily a synthetic data set v X n

for some number n as in what is referred to as private proper sanitizers in (Beimel et al., 2013). In that special case, obviously the output data set can be used to deﬁne a data structure P G ; namely, for any h H, G (h) 1 n i [n ] h(vi). Moreover, in the general case, ignoring computational complexity, the output data structure can also be used to construct a data set as pointed out in Remark 2.18 of (Beimel et al., 2013) . In particular, given a data structure G whose error α, then it sufﬁces ﬁnd a data set v X n , where n > VC(H)/α2, P such that | 1 i [n ] h(vi)) G(h)| 2α n for all h H, and hence the accuracy requirement would follow by the triangle inequality. Also, we know that this data set must exist. This is because by a standard uniform-convergence

Private Query Release Assisted by Public Data

argument, a data set s Dn will, with a non-zero P probability, satisfy | 1 h(si)) E [h](x)| α n i [n ] x D for all h H, and hence, by the triangle inequality, | 1 P

n i [n ] h(si)) G(h)| 2α for all h H.

3. A PAP Query-Release Algorithm for Classes of Finite VC-Dimension

We now describe a construction of a public-data-assisted private query release algorithm that works for any class with a ﬁnite VC-dimension.

Our construction is given by Algorithm 3. The key idea of the construction is to use the public data to create a ﬁnite α-cover He for the input query class H (see Deﬁnition 1), then, run the PMW algorithm on the ﬁnite cover and the representative domain XHe given by the dual of He (see Definition 2).

Theorem 10 (Upper Bound). APAP (Algorithm 3) is an (α, β, ε, δ) public-data-assisted private query-release algorithm for H whose private and public sample complexities satisfy: ! 3/2 p (d log(1/α) + log(1/β)) p log(1/δ) n = O , ε α2 d log(1/α) + log(1/β) m = O , α

where d = VC(H) and p = VC (H).

Remark: Using Fact 4, we can further bound the private sample complexity for general query classes as 3/2 p ! (d log(1/α) + log(1/β)) 2d log(1/δ) n = O . ε α2

In the proof of Theorem 10, we use the following lemma from (Alon et al., 2019a).

Lemma 11 (Lemma 3.3 in (Alon et al., 2019a )). Let

m where d log(1/α)+log(1/β) w D , m = O α . Then,

with probability at least 1 β/2, the family He constructed in Step 3 of Algorithm 3 is an α/4-cover for H w.rh.t. D. In particular i , for every h H, we have P h(x) = h(x) α/4, where h = Project He (h) ,w x D (see Algorithm 3 for the deﬁnition of Project He,w ).

Proof of Theorem 10. First, note that for any realization of the public data set w , APAP is (ε, δ)-differentially private w.r.t. the private data set. Indeed, the private data set x is only used to construct s = XHe( x), which is the input data set to the PMW algorithm. The output of PMW is

Algorithm 3 APAP Public-data-assisted Private Query Release Algorithm Input: Private data set x = (x n 1, . . . , xn) X ; public data set w = (w1, . . . , wm) X m; A query class H { 1}X ; accuracy and conﬁdence parameters α, β; privacy parameters ε, δ. Output: A data structure G : H [ 1, 1].

/* Use public data to construct α-cover for H */ Let T = {wˆ1, . . . , wˆmˆ } be the set of points in X appearing at least once in w . Let ΠH(T) = {(h(wˆ1), . . . , h(wˆmˆ )) : h H} . Initialize He = . For each y = (y1, . . . , ymˆ ) ΠH(T) Add to He one arbitrary h H that satisﬁes h(wˆj) = yj for every j = 1, . . . , mˆ .

Let XHe be a representative domain induced by He on X (as in Deﬁnition 2). /* replace each point in the private data set x with its representative in XHe */ s XHe( x)

/* Run PMW algorithm over the data-set of representatives s X n He and He */ Ge PMW s, He, α/2, β/2, ε, δ . Return G = G w , He , Ge , /* see code

////////////////////////////////////////

/* Construct a function G : H [ 1, 1] as follows: */ Function G = G w , He , Ge ,

Input: A query (hypothesis) h H. Output: An estimate r [ 1, 1].

h Project He (h) ,w ,

where denotes Project e (h) the unique h He H,w s.t. h(w1), . . . , h(wm) = h(w1), . . . , h(wm)

r Ge (h) Return r

Private Query Release Assisted by Public Data

then used to construct the output data structure G. Moreover, for any pair of neighboring data sets x, x , the pair XHe( x), XHe( x ) cannot differ in more than one element. Hence, (ε, δ)-differential privacy of our construction follows from (ε, δ)-differential privacy of the PMW algorithm together with the fact that differential privacy is closed under post-processing.

Next, we prove the accuracy guarantee of our construction. By Lemma 11, it follows that with probabi h lity i at least 1 /2, we have sup E β [h(x)] E h(x) α/2, x D x H D h where h = Project He (h). ,w Hence, it sufﬁces to show that

with probability at least 1 β/2, Err D,He(Ge) α/2 (recall

that Ge is the output data structure of the PMW algorithm). Note that by Sauer s lemma, we know | He d | em , d where d = VC(H). From the setting of m in the theorem statement, we hence have

log(1/β) log |He| = O d log(1/α) + d log 1 + d

= O (d log(1/α) + log(1/β)) ,

Moreover, by Claim 3, we have

log | X | e = O VC (He H ) d log(1/α) + log(1/β) O p d log(1/α) + log(1/β) ,

where p = VC (H). Thus, given the setting of n in the theorem statement, Lemma 7 implies that with probability at least 1 β/2, our instantiation of the PMW algorithm yields Ge that satisﬁes h i α/2 sup Ge(h) E h XHe(x) e x D h H h i = sup Ge (h) E h(x) D e x h H = Err Ge D,He ,

which completes the proof.

4. A Lower Bound for Releasing Decision Stumps

In this section we give an example of a hypothesis class decision stumps on { 1}p where additional public data doesn t help for private query release. This concept class can be released using either O(log(p)/α2 + p/α) private samples and no public samples, or using O(log(p)/α2) public samples and no private samples, but we show that every PAP query-release algorithm requires either Ω( p/α) private samples or Ω(1/α2) public samples.

That is, making some samples public does not reduce the overall sample complexity until the number of the public samples is nearly enough to solve the problem on its own.

The class of decision stumps on { 1}p has dual-VCdimension p, but VC-dimension just log p, so this lower bound implies that the polynomial dependence on the dual VC-dimension in Theorem 10 cannot be improved there are classes with dual-VC-dimension p that require either Ω( p/α) private samples or Ω(1/α2) public samples.

Deﬁnition 12 (Binary Decision Stumps). For any p N, let Sp be a hypothesis class of hypotheses h : { 1}p { 1} consisting of all hypotheses of the form hi(x) = xi for i [p].

Theorem 13 (Lower Bound for Releasing Decision Stumps). Fix any p N and α > 0. Suppose A is a PAP algorithm that takes n private samples and m public samples, satisﬁes (1, 1/4n)-differential privacy, and is (α, α)-accurate for the class of decision stumps Sp. Then either p n = Ω( ) α or m = Ω(1/α2).

Thus, if m = o(1/α2), then the number of private samples must scale proportionally to p as in our upper bound in Theorem 10.

The main ingredient in the proof is a result of Dwork et al. (Dwork et al., 2015). Informally, what this theorem says is that for any algorithm that releases accurate answers to the class of decision stumps using too small of a dataset, there is an attacker who can identify a large number of that algorithm s samples.

Theorem 14 (Special Case of Theorem 17 of (Dwork et al., 2015) ). For every p N and α > 0 , there exists a number p r = Ω( ) and a number t = Ω( 1 α α2 ) such that the following holds: For every query-release algorithm A with total sample size s [t, r + t] that is (α, α)-accurate for the class Sp of decision stumps on { 1}p, there exists a distribution De over { 1}p and an attacker T who takes as input the vector of answers q [ 1, 1]p and an example y { 1}p and outputs either IN or OUT such that

P [T (q, y) = OUT] 1 1 , (r+t)2 and z ,...,zs,y D e 1 q A(z) P |{i [s] : T (q, zi) = IN}| t 1 1 . z ,...,z e 2 (r+t)2 1 s D q A(z)

Proof of Theorem 13. Fix p, α > 0 and let r and t be the values speciﬁed in Theorem 14. Suppose that A is a PAP algorithm that is (α, α)-accurate for the class Sp with n private samples and m public samples. We will show that either n > r or m t/2. First, note that the accuracy condition of A implies that we must have n + m > t by the standard lower bound on the sample complexity of query release even without any privacy constraints. Thus,

Private Query Release Assisted by Public Data

to prove the theorem statement, it sufﬁces to show that if m t 1, t < n + m r + t, 2 and A is (α, α)-accurate, then A cannot satisfy (1, 1/4n)-differential privacy w.r.t. its private samples. Indeed, this would imply that either m t/2 or n > t/2 + r > r.

Let D be the distribution promised by Theorem 14. Let x = (x1, . . . , xn) De n be a set of n private samples and w = (w1, . . . , wm) De m be a set of public samples, and let z = (x1, . . . , xn, w1, . . . , wm) be the combined set of samples. Let q A( z). By Theorem 14,

P [|{i [n + m] : T (q, zi) = IN}| m + 1] P [|{i [n + m] : T (q, zi) = IN}| t/2]

1 1/(r + t)2

1 1/(n + m)2,

where the ﬁrst inequality follows from the assumption that m t 1 2 , and the last inequality follows from the assumption that n + m r + t.

That is, with high probability, the attacker identiﬁes at least m + 1 samples in the dataset. Let 1 (T (q, zi) = IN) be the indicator of the event T (q, zi) = IN. Therefore, we have

X n X m P [T (q, xi) = IN] + P [T (q, wi) = IN]

i=1" i=1 # X n X m = E 1 (T (q, xi) = IN) + 1 (T (q, wi) = IN)

i=1 i=1 (m + 1) P [|{i [n + m] : T (q, zi) = IN}| m + 1] 2 (m + 1) 1 1/ (n + m) ,

where the second step follows from Markov s inequality. Since X m P [T (q, wi) = IN] m

i=1 we can conclude that

X n P [T (q, xi) = IN] (m + 1)(1 1/(n + m)2) m

i=1 = 1 (m 1)/(n + m)2

1 1/(n + m) 1 1/n 1/2

Therefore, there must exist a private sample i such that

P [T (q, xi ) = IN] 1/2n x,w D e q A(z)

Now, consider the dataset z i where we replace xi in z with an independent sample y De but the rest of the samples in z i is the same as in z. In this experiment

xi is now an independent sample from De , so Theorem 14 states that

P [T (q, xi ) = IN] 1/(n + m)2 1/n2

x,w ,y D e q A( z ) i

However, note that the joint distribution ( z, z i ) is a distribution over pairs of datasets that differ on at most one private sample. Thus, we have shown that A cannot satisfy (ε, 1/4n)-differential privacy for its private samples unless

1 eε 1 1 + = ln(n/4) ε 2n n2 4n

The upshot is that A cannot be (1, 1/4n)-differentially private for n 11.

5. A Lower Bound on Public Sample Complexity

The goal of this section is to show a general lower bound on the public sample complexity of PAP query release. Our lower bound holds for classes with inﬁnite Littlestone dimension. The Littlestone dimension is a combinatorial parameter of hypothesis classes that characterizes mistake and regret bounds in Online Learning (Littlestone, 1988; Ben-David et al., 2009). There are many examples of classes that have ﬁnite VC-dimension, but inﬁnite Littlestone dimension. The simplest example is the class of threshold functions over R whose VC-dimension is 1, but has inﬁnite Littlestone dimension. In (Alon et al., 2019b), it was shown that if a class has inﬁnite Littlestone dimension, then it is not privately learnable.

Our lower bound is formally stated below.

Theorem 15 (Lower bound on public sample complexity). Let H { 1}X be any query class that has inﬁnite Littlestone dimension. Any PAP query-release algorithm for H must have public sample complexity m = Ω(1/α), where α is the desired accuracy.

We stress that the above lower bound on the public sample complexity holds regardless of the number of private samples, which can be arbitrarily large.

In the light of our upper bound in Section 3, our lower bound on the public sample complexity exhibits a tight dependence on the accuracy parameter α. That is, one cannot hope to attain public sample complexity that is o(1/α).

In the proof of the above theorem, we will refer to the following notion of private PAC learning with access to public data that was deﬁned in (Alon et al., 2019a). For completeness, we restate this deﬁnition here.

Deﬁnition 16 ((α, β, ε, δ) PAP Learner). Let H { 1}X

be a hypothesis class. A randomized algorithm A is

Private Query Release Assisted by Public Data

(α, β, ε, δ) PAP learner for H with private sample size n and public sample size m if the following conditions hold:

1. For every distribution D over Z = X { 1}, given data sets x Dn and w Dm as inputs to A, with probability at least 1 β (over the choice of x, w , and the random coins of A), A outputs a hypothesis A ˆ ( x, w ) = h { 1}X satisfying err ˆ D h inf err D (h) + α, h H

where, for any hypothesis h { 1}X , err D(h) P [h(x) = y]. (x,y) D

2. For all public data w Zm, A ( , w ) is (ε, δ)- differentially private.

We say that A is proper PAP learner if A ( x, w ) H with probability 1.

Proof. We prove the above theorem in two simple steps: we show that PAP query-release implies PAP learning, then invoke a lower bound on PAP learning classes with inﬁnite Littlestone dimension. Both steps are formalized in the lemmas below.

Lemma 17 (General version of Theorem 5.5 in (Beimel et al., 2013) ). Let H { 1}X be any class of binary functions. If there exists an (α, β, ε, δ) PAP queryrelease algorithm for H with private sample complexity n and public sample complexity m, then there exists an (O(α), O(β), O(ε), O(δ)) PAP learner for H with private sample complexity n = O(n log(1/αβ)/α2β), and public sample complexity m.

Lemma 18 (Theorem 4.1 in (Alon et al., 2019a)). Let H be any class with an inﬁnite Littlestone dimension (e.g., the class of thresholds over R). Then, any PAP learner for H must have public sample complexity m = Ω(1/α), where α is the excess error.

Given these two lemmas, the proof is straightforward. To elaborate, note that Lemma 17 shows that for any class H, a PAP query-release algorithm for H with public sample complexity m implies the existence of a PAP learner for H with the same public sample complexity (and essentially the same accuracy and privacy parameters). Hence, by Lemma 18, if H has inﬁnite Littlestone dimension, then such public sample complexity must satisfy m = Ω(1/α). This proves our theorem.

Although the proof of Lemma 17 is almost straightforward given Theorem 5.5 in (Beimel et al., 2013), we will elaborate on a couple of minor details. First, note that even though the reduction in (Beimel et al., 2013) involves pure

differentially private algorithms, the same construction in their reduction would also work for the case of (ε, δ)- differential privacy with minor and obvious changes in the privacy analysis. Second, we note that the reduction in (Beimel et al., 2013) is for proper sanitizers, which are query-release algorithms that are restricted to output a data set from the input domain rather than any data structure that maps H to [ 1, 1]. As discussed in Remark 9, ignoring computational complexity, any PAP query-release algorithm satisfying Deﬁnition 8 can be transformed into a PAP query-release algorithm that outputs a data set from the input domain and has the same accuracy (up to a constant factor). Now, given these minor details and since any PAP algorithm can obviously be viewed as a differentially private algorithm operating on the private data set (by hardwiring the public data set into the algorithm), Lemma 17 follows by invoking the reduction in (Beimel et al., 2013).

Acknowledgments

We thank the reviewers for their helpful comments. RB s research is supported by NSF Awards AF-1908281, SHF1907715, Google Faculty Research Award, and OSU faculty start-up support. AC and JU were supported by NSF grants CCF-1718088, CCF-1750640, CNS-1816028, and CNS-1916020. AN is supported by an Ontario ERA, and an NSERC Discovery Grant RGPIN-2016-06333. ZSW is supported by a Google Faculty Research Award, a J.P. Morgan Faculty Award, and a Mozilla research grant. Part of this work was done while RB, AC, AN, JU, and ZSW were visiting the Simons Institute for the Theory of Computing.

Alon, N., Bassily, R., and Moran, S. Limits of private learning with access to public data. ar Xiv:1910.11519 [cs.LG] (appeared at Neur IPS 2019), 2019a.

Alon, N., Livni, R., Malliaris, M., and Moran, S. Private pac learning implies ﬁnite littlestone dimension. STOC 2019, pp. 852-860 (ar Xiv preprint ar Xiv:1806.00949), 2019b.

Assouad, P. Densit e et dimension. In Annales de l Institut Fourier, volume 33, pp. 233 282, 1983.

Bassily, R., Thakurta, A. G., and Thakkar, O. D. Modelagnostic private learning. In Advances in Neural Information Processing Systems, pp. 7102 7112, 2018.

Beimel, A., Nissim, K., and Stemmer, U. Private learning and sanitization: Pure vs. approximate differential privacy. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 363 378. Springer, 2013.

Private Query Release Assisted by Public Data

Ben-David, S., P al, D., and Shalev-Shwartz, S. Agnostic online learning. In COLT, volume 3, pp. 1, 2009.

Blum, A., Ligett, K., and Roth, A. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):1 25, 2013.

Bun, M., Nissim, K., Stemmer, U., and Vadhan, S. P. Differentially private release and learning of threshold functions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pp. 634 649, 2015.

Bun, M., Ullman, J., and Vadhan, S. Fingerprinting codes and the price of approximate differential privacy. SIAM Journal on Computing, 47(5):1888 1938, 2018.

Chaudhuri, K. and Hsu, D. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 155 186, 2011.

Dajani, A. N., Lauger, A. D., Singer, P. E., Kifer, D., Reiter, J. P., Machanavajjhala, A., Garﬁnkel, S. L., Dahl, S. A., Graham, M., Karwa, V., Kim, H., Lelerc, P., Schmutte, I. M., Sexton, W. N., Vilhuber, L., and Abowd, J. M. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientiﬁc Advisory Committee.

Dinur, I. and Nissim, K. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 202 210, 2003.

Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265 284. Springer, 2006.

Dwork, C., Smith, A., Steinke, T., Ullman, J., and Vadhan, S. Robust traceability from trace amounts. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 650 669. IEEE, 2015.

Hamm, J., Cao, Y., and Belkin, M. Learning privately from multiparty data. In International Conference on Machine Learning, pp. 555 563, 2016.

Hardt, M. and Rothblum, G. N. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 61 70. IEEE, 2010.

Littlestone, N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285 318, 1988.

Muthukrishnan, S. and Nikolov, A. Optimal private halfspace counting via discrepancy. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 1285 1292, 2012.

Nandi, A. and Bassily, R. Privately answering classiﬁcation queries in the agnostic pac model. ar Xiv preprint ar Xiv:1907.13553. To appear in ALT 2020, 2019.

Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., and Talwar, K. Semi-supervised knowledge transfer for deep learning from private training data. stat, 1050, 2017.

Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., and Erlingsson, U. Scalable private learning with pate. ar Xiv preprint ar Xiv:1802.08908, 2018.

Sauer, N. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145 147, 1972.

Steinke, T. and Ullman, J. Between pure and approximate differential privacy. ar Xiv preprint ar Xiv:1501.06095, 2015.