# modelagnostic_private_learning__2065c345.pdf Model-Agnostic Private Learning Raef Bassily Om Thakkar Abhradeep Thakurta We design differentially private learning algorithms that are agnostic to the learning model assuming access to a limited amount of unlabeled public data. First, we provide a new differentially private algorithm for answering a sequence of m online classification queries (given by a sequence of m unlabeled public feature vectors) based on a private training set. Our algorithm follows the paradigm of subsample-and-aggregate, in which any generic non-private learner is trained on disjoint subsets of the private training set, and then for each classification query, the votes of the resulting classifiers ensemble are aggregated in a differentially private fashion. Our private aggregation is based on a novel combination of the distance-to-instability framework [26], and the sparse-vector technique [15, 18]. We show that our algorithm makes a conservative use of the privacy budget. In particular, if the underlying non-private learner yields a classification error of at most α (0, 1), then our construction answers more queries, by at least a factor of 1/α in some cases, than what is implied by a straightforward application of the advanced composition theorem for differential privacy. Next, we apply the knowledge transfer technique to construct a private learner that outputs a classifier, which can be used to answer an unlimited number of queries. In the PAC model, we analyze our construction and prove upper bounds on the sample complexity for both the realizable and the non-realizable cases. Similar to non-private sample complexity, our bounds are completely characterized by the VC dimension of the concept class. 1 Introduction The main goal in the standard setting of differentially private learning is to design a differentially private learner that, given a private training set as input, outputs a model (or, a classifier) that is safe to publish. Despite being a natural way to define the private learning problem, there are several limitations with this standard approach. First, there are pessimistic lower bounds in various learning problems implying that the error associated with the final private model will generally have necessary dependence on the dimensionality of the model [2], or the size of the model class [9]. Second, this approach often requires non-trivial, white-box modification of the existing non-private learners [19, 11, 21, 26, 2, 27, 1], which can make some of these constructions less practical since they require making changes in the infrastructure of the existing systems. Third, designing algorithms for this setting often requires knowledge about the underlying structure of the learning problem, e.g., specific properties of the model class [4, 5, 9]; or convexity, compactness, and other geometric properties of the model space [2, 27]. We study the problem of differentially private learning when the learner has access to a limited amount of public unlabeled data. Our central goal is to characterize in a basic model, such as the standard PAC model, the improvements one can achieve for private learning in such a relaxed setting Department of Computer Science & Engineering, The Ohio State University. bassily.1@osu.edu Department of Computer Science, Boston University. omthkkr@bu.edu Department of Computer Science, University of California Santa Cruz. aguhatha@ucsc.edu 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada. compared to the aforementioned standard setting. Towards this goal, we first consider a simpler problem, namely, privately answering classification queries given by a sequence of public unlabeled data Q = {x1, , xm}. In this problem, one is given a private labeled dataset denoted by D, and the goal is to design an (ϵ, δ)-differentially private algorithm that labels all m public feature vectors in Q. In designing such an algorithm, there are four main goals we aim to achieve: (i) We wish to provide an algorithm that enables answering as many classification queries as possible while ensuring (ϵ, δ)-differential privacy. This is a crucial property for the utility of such an algorithm since the utility in this problem is limited by the number of queries we can answer while satisfying the target privacy guarantee. (ii) We want to have a modular design paradigm in which the private algorithm can use any generic non-private algorithm (learner) in a black-box fashion, i.e., it only has oracle access to the non-private algorithm. This property is very attractive from a practical standpoint as the implementation of such an algorithm does not require changing the internal design of the existing non-private algorithms. (iii) We want to have a design paradigm that enables us to easily and formally transfer the accuracy guarantees of the underlying non-private algorithm into meaningful accuracy guarantees for the private algorithm. The most natural measure of accuracy in that setting would be the misclassification rate. (iv) We want to be able to use such an algorithm together with the public unlabeled data to construct a differentially private learner that outputs a classifier, which can then be used to answer as many classification queries as we wish. In particular, given the second goal above, the final private learner would be completely agnostic to the intricacies of the underlying non-private learner and its model. Namely, it would be oblivious to whether the model is simple logistic regression, or a multi-layer deep neural network. Given the above goals, a natural framework to consider is knowledge aggregation and transfer, which is inspired by the early work of Breiman [8]. The general idea is to train a non-private learner on different subsamples from the private dataset to generate an ensemble of classifiers. The ensemble is collectively used in a differentially private manner to generate privatized labels for the given unlabeled public data. Finally, the public data together with the private labels are used to train a non-private learner, which produces a final classifier that is safe to publish. Related Work: For private learning via knowledge aggregation and transfer, Hamm et al. [17] explored a similar technique, however their construction deviated from the above description. In particular, it was a white-box construction with weak accuracy guarantees; their guarantees also involved making strong assumptions about the learning model and the loss function used in training. In a recent work [23, 24], of which [24] is independent from our work, Papernot et al. gave algorithms that follow the knowledge transfer paradigm described above. Their constructions are black-box. However, only empirical evaluations are given for their constructions; no formal utility guarantees are provided. For the query-answering setting, a recent independent work [12] considers the problem of private prediction, but only in the single-query setting, whereas we study the multiple-query setting. The earliest idea of using ensemble classifiers to provide differentially private prediction can be traced to Dwork, Rothblum, and Thakurta from 2013. Our Techniques: In this work, we give a new construction for privately answering classification queries that is based on a novel framework combining two special techniques in the literature of differential privacy, namely, the subsampling stability framework [22, 26] and the sparse vector technique [15, 18, 16]. Our construction also follows the knowledge aggregation and transfer paradigm, but it exploits the stability properties of good non-private learners in a quantifiable and formal manner. Our construction is based on the following idea: if a good learner is independently trained k times on equally sized, independent training sets, then one would expect the corresponding output classifiers h1, , hk to predict similarly on a new example from the same distribution. Using this idea, we show that among m classification queries, one only needs to pay the price of privacy for the queries for which there is significant disagreement among the k classifiers. Using our construction and the unlabeled public data, we also provide a final private learner. We show via formal and quantifiable guarantees that our construction achieves our four main goals stated earlier. We note that our framework is not restricted to classification queries; it can be used for privately answering any sequence of online queries that satisfy certain stability properties in the sense of [26]. Due to space limitations, and to avoid distracting the reader from the main results of this work, we defer the description of the generic framework to the full version, and focus here on the special case of classification queries. 1.1 Our Contributions Answering online classification queries using the privacy budget conservatively: In Section 3, we give our (ϵ, δ)-differentially private construction for answering a sequence of online classification queries. Our construction uses any generic non-private learner in a black-box fashion. The privacy guarantee is completely independent of the non-private learner and its accuracy. Moreover, the accuracy guarantee can be obtained directly from the accuracy of the non-private learner, i.e., the construction allows us to directly and formally transform the accuracy guarantee for the nonprivate learner into an accuracy guarantee for the final private algorithm. We provide a new privacy analysis for the novel framework combining subsampling stability and sparse vector techniques. We analyze the accuracy of our algorithm in terms of its misclassification rate, defined as the ratio of misclassified queries to the total number of queries, in the standard (agnostic) PAC model. Our accuracy analysis is new, and is based on a simple counting argument. We consider both the realizable and non-realizable (agnostic) cases. In the realizable case, the underlying non-private learner is assumed to be a PAC learner for a hypothesis class H of VC-dimension V . The private training set consists of n labeled examples, where the labels are generated by some unknown hypothesis h H. The queries are given by a sequence of m i.i.d. unlabeled domain points drawn from the same distribution as the domain points in the training set. We show that, with high probability, our private algorithm can answer up to n/V queries with a misclassification rate of V/n, which is essentially the optimal misclassification rate attainable without privacy. Thus, answering those queries essentially comes with no cost for privacy. When answering m > n/V queries, the misclassification rate is m V 2/n2. A straightforward application of the advanced composition theorem of differential privacy would have led to a misclassification rate m V/n, which can be significantly larger than our rate. This is because our construction pays a privacy cost only for hard queries for which the PAC learner tends to be incorrect. Our result for the realizable case is summarized below. We also provide an analogous statement for the non-realizable case (Theorem 3.5). Informal Theorem 1.1 (Corresponding to Theorem 3.4). Given a PAC learner for a class H of VCdimension V , a private training set of size n, and assuming realizability, our private construction (Algorithm 2) answers a sequence of up to Ω(n/V ) binary classification queries such that, with high probability, the misclassification rate is O(V/n). When the number of queries m is beyond Ω(n/V ), then with high probability, the misclassification rate is O(m V 2/n2). A model-agnostic private learner with formal guarantees: In Section 4, we use the knowledge transfer technique to bootstrap a private learner from our construction above. The idea is to use our private construction to label a sufficient number of public feature vectors. Then, we use these newly labeled public data for training a non-private learner to finally output a classifier. Since there is no privacy constraint associated with the public data, the overall construction remains private as differential privacy is closed under post-processing. Note that this construction also uses the non-private learner as a black box, and hence it is agnostic to the structure of such learner and the associated model. This general technique has also been adopted in [23]. Our main contribution here is that we provide formal and explicit utility guarantees for the final private learner in the standard (agnostic) PAC model. Our guarantees are in terms of upper bounds on the sample complexity in both realizable and non-realizable cases. Let err(h; D) P (x,y) D [h(x) = y] denote the true classification error of a hypothesis h. Given black-box access to an agnostic PAC learner for a class H of VC-dimension V , we obtain the following results: Informal Theorem 1.2 (Corresponding to Theorems 4.2, 4.3). Let 0 < α < 1. Let n be the size of the private training set. There exists an (ϵ, δ)-differentially private algorithm (Algorithm 3) that, given access to m = O V α2 unlabeled public data points, w.h.p. outputs a classifier ˆh H such that the following guarantees hold: (i) Realizable case: err(ˆh; D) α for n = O V 3/2/α3/2 , and (ii) Agnostic case: err(ˆh; D) α + O(γ) for n = O V 3/2/α5/2 , where γ = min h H err(h; D). Our bounds are only a factor of O( p V/α) worse than the corresponding optimal non-private bounds. In the agnostic case, however, we note that the accuracy of the output hypothesis in our case has a suboptimal dependency (by a small constant factor) on γ min h H err(h; D). We note that the same construction can serve as a private learner in a less restrictive setting where only the labels of the training set are considered private information. This setting is known as labelprivate learning, and it has been explored before in [10] and [6]. Both works have only considered pure, i.e., (ϵ, 0), differentially private learners, and their constructions are white-box, i.e., they do not allow for using a black-box non-private learner. The bounds in [10] involve smoothness assumptions on the underlying distribution. In [6], an upper bound on the sample complexity is derived for the realizable case. Their bound is a factor of O(1/α) worse than the optimal non-private bound for the realizable case. 2 Preliminaries In this section, we formally define the notation, provide important definitions, and state the main existing results used in this work. We denote the data universe by U = X Y, where X denotes abstract domain for unlabeled data (feature-vector space) and Y = {0, 1}. An n-element dataset is denoted by D = {(x1, y1), (x2, y2), . . . , (xn, yn)} U n. For any two datasets D, D U , we denote the symmetric difference between them by D D . We will use the standard notion of agnostic PAC learning [20] (see the full version for a definition). We will also use the following parameterized version of the definition of agnostic PAC learning. Definition 2.1 ((α, β, n)-learner for a class H). Let α, β (0, 1) and n N. An algorithm Θ is an (α, β, n) (agnostic PAC) learner if, given an input dataset D of n i.i.d. examples from the underlying unknown distribution D, with probability 1 β it outputs a hypothesis h D with err(h D; D) γ +α, where err(h; D) P (x,y) D [h D(x) = y] and γ min h H err(h; D). Next, we define the notion of differential privacy. Definition 2.2 ((ϵ, δ)-Differential Privacy [13, 14]). A (randomized) algorithm M with input domain U and output range R is (ϵ, δ)-differentially private (DP) if for all pairs of datasets D, D U s.t. |D D | = 1, and every measurable S R, we have that: Pr (M(D) S) eϵ Pr (M(D ) S) + δ, where the probability is over the coin flips of M. 2.1 Distance to Instability Framework Next, we describe the distance to instability framework from [26] that releases the exact value of a function on a dataset while preserving differential privacy, provided the function is sufficiently stable on the dataset. We define the notion of stability first, and provide a pseudocode for a private estimator for any function via this framework in Algorithm 1. Definition 2.3 (k-stability [26]). A function f : U R is k-stable on dataset D if adding or removing any k elements from D does not change the value of f, i.e., D s.t. |D D | k, we have f(D) = f(D ). We say f is stable on D if it is (at least) 1-stable on D, and unstable otherwise. The distance to instability of a dataset D U with respect to a function f is the number of elements that must be added to or removed from D to reach a dataset that is not stable. Algorithm 1 Astab [26]: Private release of a classification query via distance to instability Input: Dataset D U , a function f : U R for some range R, distance to instability function associated with f: distf : U R, threshold: Γ, privacy parameter ϵ > 0 1: d dist distf(D) + Lap (1/ϵ) 2: If d dist > Γ, then output f(D), else output Theorem 2.4 (Privacy guarantee for Astab). If the threshold Γ = log(1/δ)/ϵ, and the distance to instability function distf(D) = arg max k [f(D) is k-stable], then Astab (Algorithm 1) is (ϵ, δ)-DP. Theorem 2.5 (Utility guarantee for Astab). If the threshold Γ = log(1/δ)/ϵ, the distance to instability function is chosen as in Theorem 2.4, and f(D) is ((log(1/δ) + log(1/β)) /ϵ)-stable, then Algorithm 1 outputs f(D) with probability at least 1 β. The proof of the above two theorems follows from [26, Proposition 3]. 3 Privately Answering Classification Queries In this section, we instantiate the distance to instability framework (Algorithm 1) with the subsample and aggregate framework [22, 26], and then combine it with the sparse vector technique [15, 16] to obtain a construction for privately answering classification queries with a conservative use of the privacy budget (Algorithm 2 below). We consider here the case of binary classification for simplicity. However, we note that one can easily extend the construction (and obtain analogous guarantees) for multi-class classification. Note: The full version [3] contains a detailed discussion of a more general framework together with a more modular and generic description of the algorithmic techniques. The description of the algorithms in this short version involves only classification queries. A private training set, denoted by D, is a set of n private binary-labeled data points {(x1, y1), . . . , (xn, yn)} (X Y)n drawn i.i.d. from some (arbitrary unknown) distribution D over X Y. We will refer to the induced marginal distribution over X as DX . We consider a sequence of (online) classification queries defined by a sequence of m unlabeled points from X Q = { x1, , xm} X m, drawn i.i.d. from DX , and let { y1, , ym} {0, 1}m be the corresponding true unknown labels. Algorithm 2 has oracle access to a non-private learner Θ for a hypothesis class H. We will consider both realizable and non-realizable cases of the standard PAC model. In particular, Θ is assumed to be an (agnostic) PAC learner for H. Algorithm 2 Abin Clas: Private Online Binary Classification via subsample and aggregate; and sparse vector Input: Private dataset: D, sequence of online unlabeled public data (defining the classification queries) Q = { x1, , xm}, oracle access to a non-private learner Θ : U H for a hypothesis class H, cutoff parameter: T, privacy parameters ϵ, δ > 0, failure probability: β 1: c 0, λ p 32T log(2/δ)/ϵ, and k 34 2λ log (4m T/ min (δ, β/2)) 2: w 2λ log(2m/δ), and bw w + Lap(λ) 3: Arbitrarily split D into k non-overlapping chunks of size n/k. Call them D1, , Dk 4: for j [k], train Θ on Dj to get a classifier hj H 5: for i [m] and c T do 6: Let Si = {h1(xi), , hk(xi)}, and for y {0, 1}, let ct(y) = # times y appears in Si 7: bqxi(D) arg max y {0,1} [ct(y)], distbqxi max {0, ct (bqxi) ct (1 bqxi) 1} 8: outi Astab D, bqxi, distbqxi, Γ = bw, ϵ = 1/2λ 9: If outi = , then c c + 1 and bw w + Lap(λ) 10: Output outi Theorem 3.1 (Privacy guarantee for Abin Clas). Algorithm Abin Clas (Algorithm 2) is (ϵ, δ)-DP. The proof of this theorem follows from combining the guarantees of the distance to instability framework [26], and the sparse vector technique [16]. The idea is that in each round of query response, if the algorithm outputs a label in {0, 1}, then there is no loss in privacy in terms of ϵ (as there is sufficient consensus). However, when the output is , there is a loss of privacy. This argument is formalized via the distance to instability framework. Sparse vector helps account for the privacy loss across all the m queries. A formal proof of this theorem is deferred to the full version. Theorem 3.2. Let α, β (0, 1), and γ min h H err(h; D). (Note that in the realizable case γ = 0). In Algorithm Abin Clas (Algorithm 2), suppose we set the cutoff parameter as T = 3 (γ + α)m + p (γ + α)m log(m/β)/2 . If Θ is an (α, β/k, n/k)-agnostic PAC learner (Definition 2.1), where k is as defined in Abin Clas, then i) with probability at least 1 2β, Abin Clas does not halt before answering all the m queries in Q, and outputs for at most T queries; and ii) the misclassification rate of Abin Clas is at most T/m = O(γ + α). Proof. First, notice that Θ is an (α, β/k, n/k)-agnostic PAC learner, hence w.p. 1 β, the misclassification rate of hj for all j [k] is at most γ + α. So, by the standard Chernoff s bound, with probability at least 1 β none of the hj s misclassify more than (γ + α)m + (γ + α)m log(m/β)/2 B queries in Q. Now, we use the following Markov-style counting argument (Lemma 3.3) to bound the number of queries for which at least k/3 classifiers in the ensemble {h1, . . . , hk} result in a misclassification. Lemma 3.3. Consider a set of {( x1, y1), . . . , ( xm, ym)} X Y, and k binary classifiers h1, . . . , hk, where each classifier is guaranteed to make at most B mistakes in predicting the m labels { y1, . . . , ym}. For any ξ (0, 1/2], i [m] : |{j [k] : hj( xi) = yi}| > ξk < B/ξ. Therefore, there are at most 3B queries xi Q, where the votes of the ensemble {h1( xi), . . . , hk( xi)} has number of ones (or, zeros) > k/3 (i.e., they significantly disagree). Now, to prove part (i) of the theorem, observe that to satisfy the distance to instability condition (in Theorem 2.5) for the remaining m 3B queries, it would suffice to have k/3 32 log (4m T/ min (δ, β/2)) p 2T log(2/δ)/ϵ (taking into account the noise in the threshold passed to Astab in Step 8 of Abin Clas4). This condition on k is satisfied by the setting of k in Abin Clas. For part (ii), note that by the same lemma above, w.p. 1 β, there are at least 2k/3 classifiers that output the correct label in each of the remaining m 3B queries. Hence, w.p. 1 2β, Algorithm Abin Clas will correctly classify such queries. This completes the proof. Remark 1. A natural question for using Theorem 3.2 in the agnostic case is that how would one know the value of γ in practice, in order to set the right value for T? One simple approach is to set aside half the training dataset, and compute the empirical misclassification rate with differential privacy to get a sufficiently accurate estimate for γ + α (as in standard validation techniques [25]), and use it to set T. Since the sensitivity of misclassification rate is small, the amount of noise added would not affect the accuracy of the estimation. Furthermore, with a large enough training dataset, the asymptotics of Theorem 3.2 would not change either. Explicit misclassification rate: In Theorem 3.2, it might seem that there is a circular dependency of the following terms: T α k T. However, the number of independent relations is equal to the number of parameters, and hence, we can set them meaningfully to obtain non-trivial misclassification rates. We now obtain an explicit misclassification rate for Abin Clas in terms of the VC-dimension of H. Let V denote the VC-dimension of H. First, we consider the realizable case (γ = 0). Our result for this case is formally stated in the following theorem. Theorem 3.4 (Misclassification rate in the realizable case). For any β (0, 1), there exists M = Ω(ϵ n/V ), and a setting for T = O m2 V 2/ϵ2 n2 , where m max(M, m), such that w.p. 1 β, Abin Clas yields the following misclassification rate: (i) O(V/ϵ n) for up to M queries, and (ii) O m V 2/ϵ2 n2 for m > M queries. Proof. By standard uniform convergence arguments [25], there is an (α, β, n/k)-PAC learner with misclassification rate α = O (k V /n). Setting T as in Theorem 3.2 with the aforementioned setting of α, and setting k as in Algorithm Abin Clas gives the setting of T in the theorem statement. For up to m = Ω(ϵ n/V ) queries, the setting of T becomes T = O(1), and hence Theorem 3.2 implies Abin Clas yields a misclassification rate O(V/ϵ n), which is essentially the same as the optimal nonprivate rate. Beyond Ω(ϵ n/V ) queries, T = O(m2 V 2/ϵ2 n2), and hence, Theorem 3.2 implies that the misclassification rate of Abin Clas is O m V 2/ϵ2 n2 . We note that the attainable misclassification rate is significantly smaller than the rate of O ( m V/ϵ n) implied by a direct application of the advanced composition theorem of differential privacy. Next, we provide analogous statement for the non-realizable case (γ > 0). Theorem 3.5 (Misclassification rate in the non-realizable case). For any β (0, 1), there exists M = Ω min n 1/γ, p ϵ n/V o , and a setting for T = O( mγ)+ O m4/3 V 2/3/ϵ2/3 n2/3 , where m max{M, m}, such that w.p. 1 β, Abin Clas yields the following misclassification rate: (i) O(γ) + O p V/ϵ n for up to M queries, and (ii) O(γ) + O m1/3 V 2/3/ϵ2/3 n2/3 for m > M queries. 4Detailed argument given in the full version. Proof. Again, by a standard argument, Θ is (α, β, n/k)-agnostic PAC learner with α = O p k V /n , and hence, it has a misclassification rate of γ + O p k V /n when trained on a dataset of size n/k. Setting T as in Theorem 3.2 with this value of α, and setting k as in Abin Clas, and then solving for T in the resulting expression, we get the setting of T as in the theorem statement (it would help here to consider the cases where γ > α and γ α separately). For up to m = Ω min n 1/γ, p ϵ n/V o queries, the setting of T becomes T = O(1), and hence Theorem 3.2 implies Abin Clas yields a misclassification rate O(γ) + O p V/ϵ n , which is essentially the same as the optimal non-private rate. Beyond Ω min n 1/γ, p ϵ n/V o queries, T = O(mγ)+ O m4/3 V 2/3/ϵ2/3 n2/3 , and hence, Theorem 3.2 implies that the misclassification rate of Abin Clas is O(γ) + O m1/3 V 2/3/ϵ2/3 n2/3 . 4 From Answering Queries to Model-agnostic Private Learning In this section, we build on our algorithm and results in Section 3 to achieve a stronger objective. In particular, we bootstrap from our previous algorithm an (ϵ, δ)-differentially private learner that publishes a final classifier. The idea is based on a knowledge transfer technique: we use our private construction above to generate labels for sufficient number of unlabeled domain points. Then, we use the resulting labeled set as a new training set for any standard (non-private) learner, which in turn outputs a classifier. We prove explicit sample complexity bounds for the final private learner in both PAC and agnostic PAC settings. Our final construction can also be viewed as a private learner in the less restrictive setting of labelprivate learning where the learner is only required to protect the privacy of the labels in the training set. Note that any construction for our original setting can be used as a label-private learner simply by splitting the training set into two parts and throwing away the labels of one of them. Let hpriv denote the mapping defined by Abin Clas (Algorithm 2) on a single query (unlabeled data point). That is, for x X, hpriv(x) {0, 1, } denotes the output of Abin Clas on a single input query x. Note that w.l.o.g., we can view hpriv as a binary classifier by replacing with a uniformly random label in {0, 1}. Our private learner is described in Algorithm 3 below. Algorithm 3 APriv: Private Learner Input: Unlabeled set of m i.i.d. feature vectors: Q = { x1, . . . , xm}, oracle access to our private classifier hpriv, oracle access to an agnostic PAC learner Θ for a class H. 1: for t = 1, . . . , m do 2: ˆyt hpriv( xt) 3: Output ˆh Θ( D), where D = {( x1, ˆy1), . . . , ( xm, ˆym)} Note that since differential privacy is closed under post-processing, APriv is (ϵ, δ)-DP w.r.t. the original dataset (input to Abin Clas). Note also that the mapping hpriv is independent of Q; it only depends on the input training set D (in particular, on h1, . . . , hk), and the internal randomness of Abin Clas. We now make the following claim about hpriv. Claim 4.1. Let 0 < β α < 1, and m 4 log(1/αβ)/α. Suppose that Θ in Abin Clas (Algorithm 2) is an (α, β/k, n/k)-(agnostic) PAC learner for the hypothesis class H. Then, with probability at least 1 2β (over the randomness of the private training set D, and the randomness in Abin Clas), we have err(hpriv; D) 3γ + 7α = O(γ + α), where γ = min h H err(h; D). Proof. The proof largely relies on the proof of Theorem 3.2. First, note that w.p. 1 β (over the randomness of the input dataset D), for all j [k], we have err(hj; D) α. For the remainder of the proof, we will condition on this event. Let x1, . . . , xm be a sequence of i.i.d. domain points, and y1, . . . , ym be the corresponding (unknown) labels. Now, for every t [m], define vt 1 (|{j [k] : hj( xt) = yt}| > k/3). Note that since ( x1, y1), . . . , ( xm, ym) are i.i.d., it follows that v1, . . . , vm are i.i.d. (this is true conditioned on the original dataset D). As in the proof of Theorem 3.2, we have: P x1,..., xm h 1 m Pm t=1 vt > 3 α + γ + q log(m/β) 2m(α+γ) i < β. Hence, for any t [m], we have E xt [vt] = E x1,..., xm m Pm t=1 vt < β + 3 α + γ + q log(m/β) 2m(α+γ) 7α + 3γ. Let vt = 1 vt. Using the same technique as in the proof of Theorem 3.2, we can show that w.p. at least 1 β over the internal randomness in Algorithm 2, we have vt = 1 hpriv( xt) = yt. Hence, conditioned on this event, we have P xt [hpriv( xt) = yt] P xt [vt = 1] = E xt [vt] 7α + 3γ. We now state and prove the main results of this section. Let V denote the VC-dimension of H. Theorem 4.2 (Sample complexity bound in the realizable case). Let 0 < β α < 1. Let m be such that Θ is an (α, β, m)-agnostic PAC learner of H, i.e., m = O V +log(1/β) α2 . Let the parameter T of Abin Clas (Algorithm 2) be set as in Theorem 3.4. There exists n = O V 3/2/ϵ α3/2 for the size of the private dataset such that, w.p. 1 3β, the output hypothesis ˆh of APriv (Algorithm 3) satisfies err(ˆh; D) = O(α). Proof. Let h H denote the true labeling hypothesis. We will denote the true distribution D as (DX , h ). Note that since T is set as in Theorem 3.4, and given the value of m in the theorem statement, we get k = O(V 2/ϵ2 α2n). Hence, there is a setting n = O V 3/2/ϵ α3/2 such that Θ is an (α, β/k, n/k)-PAC learner for H (in particular, sample complexity in the realizable case = n/k = O(V/α)). Hence, by Claim 4.1, w.p. 1 2β, err(hpriv; D) 7α. For the remainder of the proof, we will condition on this event. Note that each ( xt, ˆyt), t [m], is drawn independently from (DX , hpriv). Now, since Θ is also an (α, β, m)-agnostic PAC learner for H, w.p. 1 β (over the new set D), the output hypothesis ˆh satisfies err(ˆh; (DX , hpriv)) err(h ; (DX , hpriv)) err(ˆh; (DX , hpriv)) min h H err(h; (DX , hpriv)) α. Observe that err(h ; (DX , hpriv)) = E x DX [1 (h (x) = hpriv(x))] = err(hpriv; (DX , h )) = err(hpriv; D) 7α, where the last inequality follows from Claim 4.1 (with γ = 0). Hence, err(ˆh; (DX , hpriv)) 8α. Furthermore, observe that err(ˆh; D) = E x DX h 1(ˆh(x) = h (x)) i E x DX h 1(ˆh(x) = hpriv(x)) + 1(hpriv(x) = h (x)) i = err(ˆh; (DX , hpriv)) + err(hpriv; D) 15α. Hence, w.p. 1 3β, we have err(ˆh; D) 15α. Remark 2. In Theorem 4.2, if Θ is an ERM learner, then the value of m can be reduced to O(V/α). Hence, the resulting sample complexity would be n = O(V 3/2/ϵ α), saving us a factor of 1 α. This is because the disagreement rate in the labels produced by Abin Clas is α, and agnostic learning with such a low disagreement rate can be done using O(V/α) if the learner is an ERM [7, Corollary 5.2]. Remark 3. Our result involves using an agnostic PAC learner Θ. Agnostic PAC learners with optimal sample complexity can be computationally inefficient. One way to give an efficient construction in the realizable case (with a slightly worse sample complexity) is to use a PAC learner (rather than an agnostic one) in APriv with target accuracy α (and hence, m = O(V/α)), but then train the PAC learner in Abin Clas towards a target accuracy 1/m. Hence, the misclassification rate of Abin Clas can be driven to zero. This yields a sample complexity bound n = O(V 2/ϵ α). Theorem 4.3 (Sample complexity bound in the non-realizable case). Let 0 < β α < 1, and m = O V +log(1/β) α2 . Let T be set as in Theorem 3.5. There exists n = O V 3/2/ϵ α5/2 such that, w.p. 1 3β, the output hypothesis ˆh of (Algorithm 3) satisfies err(ˆh; D) = O(α + γ). Proof. The proof is similar to the proof of Theorem 4.2. 5 Discussion Implications, and comparison to prior work on label privacy: Our results also apply to the setting of label-private learning, where the learner is only required to protect the privacy of the labels in the training set. That is, in this setting, all unlabeled features in the training set can be viewed as public information. This is a less restrictive setting than the setting we consider in this paper. In particular, our construction can be directly used as a label-private learner simply by splitting the training set into two parts and discarding the labels in one of them. The above theorems give sample complexity upper bounds that are only a factor of O p V/α worse than the optimal non-private sample complexity bounds. We note, however, that our sample complexity upper bound for the agnostic case has a suboptimal dependency (by a small constant factor) on γ min h H err(h; D). Label-private learning has been considered before in [10] and [6]. Both works have only considered pure, i.e., (ϵ, 0), differentially private learners for those settings, and the constructions in both works are white-box, i.e., they do not allow for modular construction based on a black-box access to a nonprivate learner. The work of [10] gave upper and lower bounds on the sample complexity in terms of the doubling dimension. Their upper bound involves a smoothness condition on the distribution of the features DX . The work of [6] showed that the sample complexity (of pure differentially labelprivate learners) can be characterized in terms of the VC dimension. They proved an upper bound on the sample complexity for the realizable case. The bound of [6] is only a factor of O(1/α) worse than the optimal non-private bound for the realizable case. Beyond standard PAC learning with binary loss: In this paper, we used our algorithmic framework to derive sample complexity bounds for the standard (agnostic) PAC model with the binary 0-1 loss. However, it is worth pointing out that our framework is applicable in more general settings. In particular, if a surrogate loss (e.g., hinge loss or logistic loss) is used instead of the binary loss, then our framework can be instantiated with any non-private learner with respect to that loss. That is, our construction does not necessarily require an (agnostic) PAC learner. However, in such case, the accuracy guarantees of our construction will be different from what we have here for the standard PAC model. In particular, in the surrogate loss model, one often needs to invoke some weak assumptions on the data distribution in order to bound the optimization error [25]. One can still provide meaningful accuracy guarantees since our framework allows for transferring the classification error guarantee of the underlying non-private learner to a classification error guarantee for the final private learner. Acknowledgement: The authors would like to thank Vitaly Feldman, and Adam Smith for helpful discussions during the course of this project. In particular, the authors are grateful for Vitaly s ideas about the possible extensions of the results in Section 4, which we outlined in Remarks 2 and 3. This work is supported by NSF grants TRIPODS-1740850, TRIPODS+X-1839317, and IIS-1447700, a grant from the Sloan foundation, and start-up supports from OSU and UC Santa Cruz. [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308 318. ACM, 2016. [2] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464 473. IEEE, 2014. [3] Raef Bassily, Om Thakkar, and Abhradeep Thakurta. Model-agnostic private learning via stability. ar Xiv preprint ar Xiv:1803.05101, 2018. [4] Amos Beimel, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on the Sample Complexity for Private Learning and Private Data Release. In TCC, pages 437 454. Springer, 2010. [5] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Characterizing the sample complexity of private learners. In ITCS. ACM, 2013. [6] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory of Computing, 12(1):1 61, 2016. [7] St ephane Boucheron, Olivier Bousquet, and G abor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323 375, 2005. [8] Leo Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996. [9] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 634 649. IEEE Computer Society, 2015. [10] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pages 155 186, 2011. [11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 2011. [12] Cynthia Dwork and Vitaly Feldman. Privacy-preserving prediction. ar Xiv preprint ar Xiv:1803.10266, 2018. [13] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, 2006. [14] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265 284. Springer, 2006. [15] Cynthia Dwork, Moni Naor, Omer Reingold, Guy Rothblum, and Salil Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In STOC, pages 381 390, 2009. [16] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. [17] Jihun Hamm, Yingjun Cao, and Mikhail Belkin. Learning privately from multiparty data. In International Conference on Machine Learning, pages 555 563, 2016. [18] Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacypreserving data analysis. In FOCS, 2010. [19] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? In FOCS, pages 531 540. IEEE Computer Society, 2008. [20] Michael J. Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA, USA, 1994. [21] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. Journal of Machine Learning Research, 1:41, 2012. [22] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In STOC, 2007. [23] Nicolas Papernot, Martın Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. stat, 1050, 2017. [24] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Ulfar Erlingsson. Scalable private learning with pate. ar Xiv preprint ar Xiv:1802.08908, 2018. [25] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [26] Adam Smith and Abhradeep Thakurta. Differentially private feature selection via stability arguments, and the robustness of the lasso. In COLT, 2013. [27] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly optimal private lasso. In NIPS, 2015.