# private_everlasting_prediction__1bd64758.pdf

Private Everlasting Prediction

Moni Naor Kobbi Nissim Uri Stemmer Chao Yan

A private learner is trained on a sample of labeled points and generates a hypothesis that can be used for predicting the labels of newly sampled points while protecting the privacy of the training set [Kasiviswannathan et al., FOCS 2008]. Past research uncovered that private learners may need to exhibit significantly higher sample complexity than non-private learners as is the case of learning of one-dimensional threshold functions [Bun et al., FOCS 2015, Alon et al., STOC 2019]. We explore prediction as an alternative to learning. A predictor answers a stream of classification queries instead of outputting a hypothesis. Earlier work has considered a private prediction model with a single classification query [Dwork and Feldman, COLT 2018]. We observe that when answering a stream of queries, a predictor must modify the hypothesis it uses over time, and in a manner that cannot rely solely on the training set. We introduce private everlasting prediction taking into account the privacy of both the training set and the (adaptively chosen) queries made to the predictor. We then present a generic construction of private everlasting predictors in the PAC model. The sample complexity of the initial training sample in our construction is quadratic (up to polylog factors) in the VC dimension of the concept class. Our construction allows prediction for all concept classes with finite VC dimension, and in particular threshold functions over infinite domains, for which (traditional) private learning is known to be impossible.

1 Introduction

A PAC learner for a concept class C is given labeled examples S = {(xi,yi)}i [n] drawn i.i.d. from an unknown underlying probability distribution D over a data domain X and outputs a hypothesis h that can be used for predicting the label of fresh points xn+1,xn+2,... sampled from the same underlying probability distribution D [Valiant, 1984]. It is well known that when points are labeled by a concept selected from a concept class C = {c : X {0,1}} then learning is possible with sample complexity proportional to VC(C).

Department of Computer Science and Applied Math, Weizmann Institute of Science. moni.naor@weizmann.ac.il. Incumbent of the Judith Kleeman Professorial Chair. Research supported in part by grants from the Israel Science Foundation (no.2686/20), by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness and by the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center. Department of Computer Science, Georgetown University. kobbi.nissim@georgetown.edu. Research supported in part by by NSF grant No. CNS-2001041 and a gift to Georgetown University. Blavatnik School of Computer Science, Tel Aviv University, and Google Research. u@uri.co.il. Partially supported by the Israel Science Foundation (grant 1871/19) and by Len Blavatnik and the Blavatnik Family foundation. Department of Computer Science, Georgetown University. cy399@georgetown.edu. Research supported in part by by a gift to Georgetown University.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Learning often happens in settings where the underlying training data is related to individuals and privacy-sensitive. For legal, ethical, or other reasons, the learner is then required, to protect personal information from being leaked in the learned hypothesis h. Private learning was introduced by Kasiviswanathan et al. [2011] as a theoretical model for studying such tasks. A private learner is a PAC learner that preserves differential privacy with respect to its training set S. That is, the learner s distribution on outcome hypotheses must not depend too strongly on any single example in S. Kasiviswanathan et al. showed that any finite concept class can be learned privately and with sample complexity n = O(log|C|), a value that is sometimes significantly higher than the VC dimension of the concept class C.

It is now understood that the gap between the sample complexity of private and non-private learners is essential an important example is private learning of threshold functions (defined over an ordered domain X as Cthresh = {ct}t X where ct(x) = 1x t), which requires sample complexity that is asymptotically higher than the (constant) VC dimension of Cthresh. In more detail, with pure differential privacy, the sample complexity of private learning is characterized by the representation dimension of the concept class [Beimel et al., 2013a]. The representation dimension of Cthresh (hence, the sample complexity of private learning thresholds) is Θ(log|X|) [Feldman and Xiao, 2015]. With approximate differential privacy, the sample complexity of learning threshold functions is Θ(log |X|) [Beimel et al., 2013b, Bun et al., 2015, Alon et al., 2019, Kaplan et al., 2020, Cohen et al., 2022]. Hence, whether with pure or with approximate differential privacy, the sample complexity of privately learning thresholds grows with the cardinality of the domain |X| and no private learner exists for this task over infinite domains. In contrast, non-private learning is possible with constant sample complexity (independent of |X|).

Privacy preserving (black-box) prediction. Dwork and Feldman [2018] proposed privacypreserving prediction as an alternative for private learning. Noting that [i]t is now known that for some basic learning problems [. . .] producing an accurate private model requires much more data than learning without privacy, they considered a setting where users may be allowed to query the prediction model on their inputs only through an appropriate interface . That is, a setting where the learned hypothesis is not made public and may be accessed only in a black-box manner via a privacy-preserving query-answering prediction interface. The prediction interface is required to preserve the privacy of its training set S:

Definition 1.1 (private prediction interface [Dwork and Feldman, 2018] (rephrased)). A prediction interface A is (ϵ,δ)-differentially private if for every interactive query generating algorithm Q, the output of the interaction between Q and A(S) is (ϵ,δ)-differentially private with respect to S.

Dwork and Feldman focused on the setting where the entire interaction between Q and A(S) consists of issuing a single prediction query and answering it:

Definition 1.2 (Single query prediction [Dwork and Feldman, 2018]). Let A be an algorithm that given a set of labeled examples S and an unlabeled point x produces a label y. A is an (ϵ,δ)-differentially private prediction algorithm if for every x, the output A(S,x) is (ϵ,δ)-differentially private with respect to S.

W.r.t. answering a single prediction query, Dwork and Feldman showed that the sample complexity of such predictors is proportional to the VC dimension of the concept class.

1.1 Our contributions

We extend private prediction to answering a sequence unlimited in length of prediction queries. We refer to this as private everlasting prediction (PEP). Our goal is to present a generic private everlasting predictor with low training sample complexity |S|.

1.1.1 Private prediction interfaces when applied to a large number of queries

We begin by examining private everlasting prediction under the framework of Definition 1.1. We prove:

Theorem 1.3 (informal version of Theorem 3.3). Let A be a private everlasting prediction interface for concept class C and assume A bases its predictions solely on the initial training set S, then there exists a private learner for concept class C with sample complexity |S|.

This means that everlasting predictors that base their prediction solely on the initial training set S are subject to the same complexity lowerbounds as private learners. Hence, to avoid private learning lowerbounds, private everlasting predictors need to rely on more than the initial training sample S as a source of information about the underlying probability distribution and the labeling concept.

In this work, we choose to allow the everlasting predictor to rely on the queries made which are unlabeled points from the domain X assuming the queries are drawn from the same distribution the initial training S is sampled from. This requires modifying the privacy definition, as Definition 1.1 does not protect the queries issued to the predictor.

1.1.2 A definition of private everlasting predictors

Our definition of private everlasting predictors is motivated by the observations above. Consider an algorithm A that is first fed with a training set S of labeled points and then executes for an unlimited number of rounds, where in round i algorithm A receives as input a query point xi and produces a label ˆyi. We say that A is an everlasting predictor if, when the (labeled) training set S and the (unlabeled) query points are coming from the same underlying distribution, A answers each query points xi by invoking a good hypothesis hi (i.e., hi has low generalization error), and hence the label ˆyi produced by A is correct with high probability. We say that A is a private everlasting predictor if its sequence of predictions ˆy1, ˆy2, ˆy3,... protects both the privacy of the training set S and the query points x1,x2,x3,... in face of any adversary that chooses the query points adaptively.

We emphasize that while private everlasting predictors need to exhibit average-case utility as good prediction is required only for the case where the initial training set S and the queries x1,x2,x3,... are selected i.i.d. from the same underlying distribution our privacy requirement is worst-case, and holds in face of an adaptive adversary that chooses each query point xi after receiving the prediction provided for (x1,...,xi 1), and not necessarily in accordance with any probability distribution.

1.1.3 A generic construction of private everlasting predictors

We show a reduction from private everlasting prediction of a concept class C to non-private PAC learning of C. Our Algorithm Generic BBL, presented in Section 6, executes in rounds.

Initialization. The input to the first round is a labeled training set S where |S| = O (VC(C))2 , assumed to be labeled consistently with some unknown target concept c C. Denote S1 = S and c1 = c.

Rounds. Each round begins with a collection Si of labeled examples and ends with a newly generated collection of labeled examples Si+1 that feeds as input for the next round. The size of these collections grows by a constant factor at each round, to allow for the accumulated error of the predictor to converge. The construction ensures that each labeled set Si is consistent with some concept in ci C, which is not necessarily the original target concept c, but has a bounded generalization error with respect to c. In more detail, the construction ensures that Prx D[ci+1(x) , ci(x)] decreases by a factor of two in every round, and hence, by the triangle inequality, Prx D[ci(x) , c(x)] is bounded.

We briefly describe the main computations performed in each round of Generic BBL.5

Round initialization: At the outset of a round, the labeled set Si is partitioned into sub-sets, each with number of samples which is proportional to the VC dimension (so we have |Si| VC(C) sub-sets). Each of the sub-sets is used for training a classifier

5Important details, such as privacy amplification via sampling and management of the learning accuracy and error parameters are omitted from the description provided in this section.

non-privately, hence creating a collection of classifiers Fi = {f : X {0,1}} that are used throughout the round. Query answering: Queries are issued to the predictor in an online manner. A query is first labeled by each of the classifiers in Fi. Then the predicted label is computed by applying a privacy-preserving majority vote on these intermediate labels. By standard

composition theorems for differential privacy, we could answer roughly |Fi|2 |Si| VC(C) 2

queries without exhausting the privacy budget. Generating a labeled set for the following round: A round ends with the preparation of a collection Si+1 of labeled samples that is to be used in the initialization of next round. We explore alternatives for creating Si+1:

As at the end of a round the predictor has already labeled all the queries presented during the round s execution, one alternative is to let Si+1 consist of these queries and the labels provided for them. Note, however, that it is not guaranteed that the majority vote would result in a set of labels that are consistent with some concept in C, even if Si is consistent with some concept in C. Hence, following this alternative would require the use of non-private agnostic learners instead. Furthermore, the error introduced by the majority vote can be larger than the generalization error of the classifiers in Fi by a constant factor greater than one. This may prevent the generalization error of the classifiers used in future rounds from converging.6

To overcome this problem, we use Algorithm Label Boost a tool developed by

Beimel et al. [2021] in the context of private semi-supervised learning. Label Boost takes as input the sample Si (labeled by a concept ci C) and the (unlabeled) queries made during the round. It labels them with a concept ci+1 C where the error of ci+1 with respect to ci, i.e., Prx D[ci+1(x) , ci(x)] is bounded. Controlling the generalization error: Let Si+1 denote the (re)labeled query points

obtained in the ith round. This is a collection of size |Si+1| |Si| VC(C) 2. Hence, provided

that |Si| (VC(C))2 we get that |Si+1| > |Si|. This allows to lower the accuracy parameters of the non-private learners in each round, and hence ensure that the total error converges. Theorem 1.4 (informal version of Theorem 6.1). For every concept class C, Algorithm Generic BBL is a private everlasting predictor requiring an initial set of labeled examples which is (upto polylogarithmic factors) quadratic in the VC dimension of C.

1.2 Related work

Beyond the work of Dwork and Feldman [2018] on private prediction mentioned above, our work is related to private semi-supervised learning and joint differential privacy.

Semi-supervised private learning. As in the model of private semi-supervised learning of Beimel et al. [2021], our predictors depend on both labeled and unlabeled samples. Beyond the difference between outputting a hypothesis and providing black-box prediction, a major difference between the settings is that in the work of Beimel et al. [2021] all samples labeled and unlabeled are given at once at the outset of the learning process whereas in the setting of everlasting predictors the unlabeled samples are supplied in an online manner. Our construction of private everlasting predictors uses tools developed for the semi-supervised setting, and in particular Algorithm Label Boost of of Beimel et al.

Joint differential privacy. Kearns et al. [2015] introduced joint differential privacy (JDP) as a relaxation of differential privacy applicable for mechanism design and games. For

6to see that the majority vote can increase the generalization error by a constant factor consider three classifiers f1,f2,f3, each with generalization error α, and let A,B,C be three disjoint subsets of the support of the underlying distribution, each of probability weight α/2. If the classifier f1 errs on inputs in A B, the classifier f2 errs on A C, and f3 errs on B C then the majority of f1,f2,f3 errs on inputs in A B C and hence the generalization error grows from α to 1.5α.

every user u, JDP requires that the outputs jointly seen by all other users would preserve differential privacy w.r.t. the input of u. Crucially, in JDP users select their inputs ahead of the computation. In our settings, the inputs to a private everlasting predictor are prediction queries which are chosen in an online manner, and hence a query can depend on previous queries and their answers. Yet, similarly to JDP, the outputs provided to queries not performed by a user u should jointly preserve differential privacy w.r.t. the query made by u. Our privacy requirement hence extends JDP to an adaptive online setting.

Additional works on private prediction. Bassily et al. [2018] studied a variant of the private prediction problem where the algorithm takes a labeled sample S and is then required to answer m prediction queries (i.e., label a sequence of m unlabeled points sampled from the same underlying distribution). They presented algorithms for this task with sample complexity |S| m. This should be contrasted with our model and results, where the sample complexity is independent of m. The bounds presented by Dwork and Feldman [2018] and Bassily et al. [2018] were improved by Dagan and Feldman [2020] and by Nandi and Bassily [2020] who presented algorithms with improved dependency on the accuracy parameter in the agnostic setting.

1.3 Discussion and open problems

We show how to transform any (non-private) learner for the class C (with sample complexity proportional to the VC dimension of C) to a private everlasting predictor for C. Our construction is not polynomial time due to the use of Algorithm Label Boost, and requires an initial set S of labeled examples which is quadratic in the VC dimension. We leave open the question whether |S| can be reduced to be linear in the VC dimension and whether the construction can be made polynomial time. A few remarks are in order:

1. While our generic construction is not computationally efficient, it does result in efficient learners for several interesting special cases. Specifically, algorithm Label Boost can be implemented efficiently whenever given an input sample S it is possible to efficiently enumerate all possible dichotomies from the target class C over the points in S. In particular, this is the case for the class of 1-dim threshold functions Cthresh, as well as additional classes with constant VC dimension. Another notable example is the class Cenc thresh which intuitively is an encrypted version of Cthresh. Bun and Zhandry [2016] showed that (under plausible cryptographic assumptions) the class Cenc thresh cannot be learned privately and efficiently, while it can be learned efficiently non-privately. Our construction can be implemented efficiently for this class. This provides an example where private everlasting prediction can be done efficiently, while (standard) private learning is possible but necessarily inefficient.

2. It is now known that some learning tasks require the produced model to memorize parts of the training set in order to achieve good learning rates, which in particular disallows the learning algorithm from satisfying (standard) differential privacy [Brown et al., 2021]. Our notion of private everlasting prediction circumvents this issue, since the model is never publicly released and hence the fact that it must memorize parts of the sample is not of a direct privacy threat. In other words, our work puts forward a private learning model which, in principle, allows memorization. This could have additional applications in broader settings.

3. As mentioned above, in general, private everlasting predictors cannot base their predictions solely on the initial training set, and in this work we choose to rely on the queries presented to the algorithm (in addition to the training set). Our construction can be easily adapted to a setting where the content of the blackbox is updated based on fresh unlabeled samples (whose privacy would be preserved), instead of relying on the query points themselves. This might be beneficial to avoid poisoning attacks via the queries.

2 Preliminaries

2.1 Preliminaries from differential privacy

Definition 2.1 ((ϵ,δ)-indistinguishability). Let R0,R1 be two random variables over the same support. We say that R0,R1 are (ϵ,δ)-indistinguishable if for every event E defined over the support of R0,R1, Pr[R0 E] eϵ Pr[R1 E] + δ and Pr[R1 E] eϵ Pr[R0 E] + δ. Definition 2.2. Let X be a data domain. Two datasets x,x Xn are called neighboring if |{i : xi , x i}| = 1. Definition 2.3 (differential privacy [Dwork et al., 2006]). A mechanism M : Xn Y is (ϵ,δ)-differentially private if M(x) and M(x ) are (ϵ,δ)-indistinguishable for all neighboring x,x Xn.

In our analysis, we use the post-processing and composition properties of differential privacy, that we cite in their simplest forms. Proposition 2.4 (post-processing). Let M1 : Xn Y be an (ϵ,δ)-differentially private algorithm and M2 : Y Z be any algorithm. Then the algorithm that on input x Xn outputs M2(M1(x)) is (ϵ,δ)-differentially private. Proposition 2.5 (composition). Let M1 be a (ϵ1,δ1)-differentially private algorithm and let M2 be (ϵ2,δ2)-differentially private algorithm. Then the algorithm that on input x Xn outputs (M1(x),M2(x) is (ϵ1 + ϵ2,δ1 + δ2)-differentially private. Theorem 2.6 (Advanced composition Dwork et al. [2010b]). Let M1,...,Mk : X Y be (ε,δ)-differentially private algorithms. Then the algorithm that on input x X outputs (M1(x),...,Mk(x)) is (ε ,kδ+δ )-differentially private, where ε = p

2k ln(1/δ ) ε for every δ > 0. Definition 2.7 (Laplace mechanism [Dwork et al., 2006]). For f : Xn R Let f = max(f (x) f (x )), where the maximum is taken over all neighboring x,x Xn. The Laplace mechanis is M(x) = f (x) + Y , where Y is sampled from the laplace distribution Lap( f /ε). The Laplace mechanism is (ε,0)-differentially private. Definition 2.8 (Exponential mechanism [Mc Sherry and Talwar, 2007]). Let q : Xn Y R be a score function defined over data domain X and output domain Y . Define = max(|q(x,y) q(x ,y)|) where the maximum is taken over all y Y and all neighbouring databases x,x Xn. The exponential mechanism is the (ϵ,0)-differentially private mecha-

nism which selects an output y Y with probability proportional to e ϵq(x,y)

2 . Claim 2.9 (Privacy amplification by sub-sampling [Kasiviswanathan et al., 2011]). Let A be an (ε ,δ )-differentially private algorithm operating on a database of size n. Let ε 1 and let t = n

ε (3 + exp(ε )). Construct an algorithm B operating the database D = (zi)t i=1. Algorithm B randomly selects a subset J {1,2,...,t} of size n, and executes A on DJ = (zi)i J. Then B is ε, 4ε 3+exp(ε )δ -differentially private.

2.2 Preliminaries from PAC learning

A concept class C over data domain X is a set of predicates c : X {0,1} (called concepts) which label points of the domain X by either 0 or 1. A learner A for concept class C is given n examples sampled i.i.d. from an unknown probability distribution D over the data domain X and labeled according to an unknown target concept c C. The learner should output a hypothesis h : X [0,1] that approximates c for the distribution D. More formally, Definition 2.10 (generalization error). The generalization error of a hypothesis h : X [0,1] with respect to concept c and distribution D is defined as error D(c,h) = Expx D[|h(x) c(x)|]. Definition 2.11 (PAC learning [Valiant, 1984]). Let C be a concept class over a domain X. Algorithm A is an (α,β,n)-PAC learner for C if for all c C and all distributions D on X, Pr[(x1,...,xn) Dn ; h A((x1,c(x1)),...,(xn,c(xn)) ; error D(c,h) α] 1 β, where the probability is over the sampling of (x1,...,xn) from D and the coin tosses of A. The parameter n is the sample complexity of A.

See Appendix A for additional preliminaries on PAC learning.

2.3 Preliminaties from private learning

Definition 2.12 (private PAC learning [Kasiviswanathan et al., 2011]). Algorithm A is a (α,β,ϵ,δ,n)-private PAC learner if (i) A is an (α,β,n)-PAC learner and (ii) A is (ϵ,δ) differentially private.

Kasiviswanathan et al. [2011] provided a generic private learner with O(log(|X|)) labeled samples.7 Beimel et al. [2013a] introduced the representation dimension and showed that any concept class C can be privately learned with Θ(Rep Dim(C)) samples. For the sample complexity of (ϵ,δ)-differentially private learning of threshold functions over domain X, Bun et al. [2015] gave a lower bound of Ω(log |X|). Recently, Cohen et al. [2022] gave a (nearly) matching upper bound of O(log |X|).

3 Towards private everlasting prediction

In this work, we extend private prediction to answering any sequence of prediction queries unlimited in length. Our main goal is to present a generic private everlasting predictor with low training sample complexity |S|. Definition 3.1 (everlasting prediction). Let A be an algorithm with the following properties:

1. Algorithm A receives as input n labeled examples S = {(xi,yi)}n i=1 (X {0,1})n and selects a hypothesis h0 : X {0,1}.

2. For round r N, algorithm A gets a query, which is an unlabeled element xn+r X, outputs hr 1(xn+r) and selects a hypothesis hr : X {0,1}.

We say that A is an (α,β,n)-everlasting predictor for a concept class C over a domain X if the following holds for every concept c C and for every distribution D over X. If x1,x2,... are sampled i.i.d. from D, and the labels of the n initial samples S are correct, i.e., yi = c(xi) for i [n], then Pr[ r 0 s.t. error D(c,hr) > α] β, where the probability is over the sampling of x1,x2,... from D and the randomness of A.

Applying the Dwork-Feldman notion of a private prediction interface to everlasting predictors we get: Definition 3.2. An algorithm A is an (α,β,ϵ,δ,n)-everlasting differentially private prediction interface if (i) A is a (ϵ,δ)-differentially private prediction interface M (as in Definition 1.1), and (ii) A is an (α,β,n)-everlasting predictor.

As a warmup, consider an (α,β,ϵ,δ,n)-everlasting differentially private prediction interface A for concept class C over (finite) domain X (as in Definition 3.2 above). Assume that A does not vary its hypotheses, i.e. (in the language of Definition 3.1) hr = h0 for all r > 0.8 Note that a computationally unlimited adversarial querying algorithm can recover the hypothesis h0 by issuing all queries x X. Hence, in using A indefinitely we lose any potential benefits to sample complexity of restricting access to h0 to being black-box and getting to the point where the lower-bounds on n from private learning apply. A consequence of this simple observation is that a generic private everlasting predictor should answer all prediction queries with a single hypothesis it should modify its hypothesis over time as it processes new queries.

We now take this observation a step further, showing that a private everlasting predictor that answers prediction queries solely based on its training sample S (and randomness, but not on the queries) is subject to the same sample complexity lowerbounds as private learners.

7We omit the dependency on ϵ,δ,α,β in this brief review. 8Formally, A can be thought of as two mechanisms A = (M0,M1) where M0 is (ϵ,δ)-differentially private. (i) On input a labeled training sample S mechanism M0 computes a hypothesis h0. (ii) On a query x X mechanism M1 replies h0(x).

Consider an (α,β < 1/8,ϵ,δ,n)-everlasting differentially private prediction interface A for concept class C over (finite) domain X that upon receiving the training set S (X {0,1})n

selects an infinite sequence of hypotheses {hr}r 0 where hr : X {0,1}. Formally, we can think of A as composed of three mechanisms A = (M0,M1,M2) where M0 is (ϵ,δ)- differentially private:

On input a labeled training sample S (X {0,1})n mechanism M0 computes an initial state and an initial hypothesis (σ0,h0) = M0(S).

On a query xn+r mechanism M1 produces an answer M1(xn+r) = hi(xn+r) and mechanism M2 updates the hypothesis-state pair (hr+1,σr+1) = M2(σr).

Note that as M0 and M2 do not receive the sequence {xn+r}r 0 as input, the sequence {hr}r 0 depends solely on S. Furthermore as M1 and M2 post-process the outcome of M0, i.e., the sequence of queries and predictions {(xr,hr(xr))}r 0 preserves (ϵ,δ)-differential privacy with respect to the training set S. In Appendix B we prove:

Theorem 3.3. A can be transformed into a (O(α),O(β),ϵ,δ,O(nlog(1/β))-private PAC learner for C.

4 Private everlasting prediction definition

Theorem 3.3 requires us to seek private predictors whose prediction relies on more information than what is provided by the initial labeled sample. Possibilities include requiring the input of additional labeled or unlabeled examples during the lifetime of the predictor, while protecting the privacy of these examples.

We choose to rely on the queries for updating the predictor s internal state. This introduces a potential privacy risk for these queries as sensitive information about a query may be leaked in the predictions following it. Furthermore, we need take into account that a privacy attacker may choose their queries adversarially and adaptively.

Definition 4.1 (private everlasting black-box prediction). An algorithm A is an (α,β,ε,δ,n)- private everlasting black-box predictor for a concept class C if

1. Prediction: A is an (α,β,n)-everlasting predictor for C (as in Definition 3.1).

2. Privacy: For every adversary B and every t 1, the random variables View0 B,t and View1 B,t (defined in Figure 1) are (ε,δ)-indistinguishable.

5 Tools from prior works

We briefly describe tools from prior works that we use in our construction. See Appendix C for a more detailed account.

Algorithm Label Boost [Beimel et al., 2021]: Algorithm Label Boost takes as input a partially labeled database S T (X {0,1, }) (where the first portion of the database, S, contains examples by some concept c C) and outputs a similar database where both S and T are (re)labeled by a concept h C such that error D(c,h) is bounded. We use the following lemmata from Beimel et al. [2021]:

Lemma 5.1 (privacy of Algorithm Label Boost). Let A be an (ϵ,δ)-differentially private algorithm operating on labeled databases. Construct an algorithm B that on input a partially labeled database S T (X {0,1, }) applies A on the outcome of Label Boost(S T ). Then, B is (ϵ + 3,4eδ)-differentially private.

Lemma 5.2 (Utility of Algorithm Label Boost). Fix α and β, and let S T be s.t. S is labeled by some target concept c C, and s.t. |T | β

e VC(C)exp( α|S| 2VC(C)) |S|. Consider the execution of Label Boost on S T , and let h denote the hypothesis chosen by Label Boost to relabel S T . With probability at least (1 β) we have that error S(h) α.

Parameters: b {0,1}, t N.

Training Phase:

1. The adversary B chooses two sets of n labeled elements (x0 1,y0 1),...,(x0n,y0n) and

(x1 1,y1 1),...,(x1n,y1n), subject to the restriction n i [n] : (x0 i ,y0 i ) , (x1 i ,y1 i ) o {0,1}.

2. If i s.t. (x0 i ,y0 i ) , (x1 i ,y1 i ) then set Flag = 1. Otherwise set Flag = 0.

3. Algorithm A gets (xb 1,yb 1),...,(xbn,ybn) and selects a hypothesis h0 : X {0,1}. \* the adversary B does not get to see the hypothesis h0 *\

Prediction phase:

4. For round r = 1,2,...,t:

(a) If Flag = 1 then the adversary B chooses two elements x0 n+r = x1 n+r X. Otherwise, the adversary B chooses two elements x0n+r,x1n+r X. (b) If x0n+r , x1n+r then Flag is set to 1.

(c) If x0n+r = x1n+r then the adversary B gets hr 1(xbn+r). \* the adversary B does not get to see the label if x0 n+r , x1 n+r *\ (d) Algorithm A gets xbn+r and selects a hypothesis hr : X {0,1}. \* the adversary B does not get to see the hypothesis hr *\

Let Viewb B,t be B s view of the execution, i.e., its own randomness and the sequence of predictions in Step 4c.

Figure 1: Definition of View0 B,t and View1 B,t.

6 A Generic Construction

Our generic construction Algorithm Generic BBL transforms a (non-private) learner for a concept class C into a private everlasting predictor for C. The theorem below follows from Theorem 6.2 and Claim 6.3 which are proved in Appendix E. Theorem 6.1. Given α,β,δ < 1/16,ϵ < 1, Algorithm Generic BBL is a (4α,4β,ϵ,δ,n)-private everlasting predictor, where n is set as in Algorithm Generic BBL. Theorem 6.2 (accuracy of algorithm Generic BBL). Given α,β,δ < 1/16, ε < 1, for any concept c and any round r, algorithm Generic BBL can predict the label of xr as hr(xr), such that Pr[error D(c(xr) , hr(xr)) 4α] 1 4β. Claim 6.3 (privacy of algorithm Generic BBL). Generic BBL is (ε,δ)-differentially private. Remark 6.4. For simplicity, we analyzed Generic BBL in the realizable setting, i.e., under the assumption that the training set S is consistent with the target class C. Our construction carries over to the agnostic setting via standard arguments (ignoring computational efficiency). We refer the reader to [Beimel et al., 2021] and [Alon et al., 2020] for generic agnostic-to-realizable reductions in the context of private learning.

6.1 Improving the sample complexity dependency on accuracy

We briefly sketch how to improve the sample complexity of Algorithm Generic BBL from n = Θ VC2(C)

α2ϵ2 to n = Θ VC2(C)

αϵ2 by modifying steps 3(d)ii and 3(d)iii of Algorithm Generic BBL.

To simplify the description, we consider a constant ϵ, we ignore the privacy amplification by subsampling occurring in steps 2 and 3f and illustrate how the modified algorithm would execute by considering round i = 1 of Algorithm Generic BBL.

Note that S1 = S where n = |S1| = T1 λ1 and λ1 = Θ(VC(C)/α1) is the sample complexity needed such that each of the hypotheses computed in Step 3b has error α1 except for

Algorithm Generic BBL

Initial input: A labeled database S (X {0,1})n where n = 8τ α2ε2 8VC(C)log( 26

α ) + 4log( 4

β ) 2 log( 1

δ) log2 64VC(C)log( 26

α )+32log( 4

! (3 + exp(ε + 4)).

1. Let τ > 50. Set α1 = α/2, β1 = β/2. Define λi = 8VC(C)log( 13

αi )+4log( 2

βi ) αi . /* by Theorem A.2 λi samples suffice for PAC learning C with parameters αi,βi */

2. Let S1 S be a random subset of size n ε 3+exp(ε+4) = τ λ2 i log( 1

δ ) log2( λi εαiβiδ )

3. Repeat for i = 1,2,3,...

(a) Divide Si into Ti = τ λi log( 1

δ ) log2( λi εαiβiδ )

ε disjoint databases Si,1,...,Si,Ti of size λi. (b) For t [Ti] let ft C be a hypothesis minimizing error Si,t( ). Define Fi = (f1,...,f Ti).

(c) Set Ri = 12800|Si|

ε . Set the privacy parameters ε i = 1

δ ) and δ i = δ 2Ri . Instantiate

noises η1,...,ηRi Lap(1/Tiε i). (d) For ℓ= 1 to Ri:

i. Receive as input a prediction query xi,ℓ X. ii. Compute qxi,ℓ(Fi) = 1

Ti P t [Ti] ft(xi,ℓ), and let yi,ℓ= qxi,ℓ(Fi) + ηℓ. iii. Respond with the label 0 if yi,ℓ< 1/2 and 1 if yi,ℓ 1/2. (e) Denote Di = (xi,1,...,xi,Ri).

(f) Let ˆSi Si and ˆDi Di be random subsets of size ε|Si| 3+exp(ε+4) and ε|Di| 3+exp(ε+4) respec-

tively, and let ˆS i ˆD i Label Boost( ˆSi ˆDi). Let Si+1 ˆD i be a random subset of size λi+1Ti+1. (g) Set αi+1 αi/2 and βi+1 βi/2.

probability βi. Applying advanced composition, we get that Θ(T 2 1 ) noisy majority queries implemented in steps steps 3(d)ii and 3(d)iii can be performed, i.e., R1 = Θ(T 2 1 ). Finally, to feed the next phase with |S2| = Θ(|S1|) labeled samples, we need that R1 = Θ(T1 λ1) = Θ(T1 VC(C)/α) and hence T1 = Θ(V C(C)/α) resulting in n = Θ(VC2(C)/α2).

However, when queries are made from the same underlying distribution S was selected, we expect that most of them would exhibit a clear majority in steps 3(d)ii and 3(d)iii, except for a fraction of O(α). Hence a natural way to improve on the number of majority queries that can be performed is to replace steps 3(d)ii and 3(d)iii with a decision based on the sparse vector technique of Dwork et al. [2009]. In particular, we can use the Between Thresholds mechanism of Bun et al. [2016] to get an improved R1 = Θ(T 2 1 /α) and hence get T1 = Θ(V C(C)) and n = Θ(V C2/α).

A final important wrinkle is that the above calculation is based on the queries coming from the same underlying distribution S was selected, while our worst-case privacy requirement allows for an adversarial choice of queries that may in turn cause the execution of Between Thresholds to halt too often and hence exhaust the privacy budget for the phase. It is hence important to estimate the number of times Between Thresholds halts within a phase and to stop the execution of Algorithm Generic BBL when the estimate crosses a threshold. This estimate needs to be done in an online manner and should preserve differential privacy with respect to the queries.9 This can be done, e.g., using the private counter algorithm of Dwork et al. [2010a], which preserves differential privacy under continual observation.

9For example, stopping Generic BBL after the count of halts crosses a precise threshold would reveal that the last query caused Between Thresholds to halt, and hence breach differential privacy.

Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852 860. ACM, 2019. doi: 10.1145/3313276.3316312. URL https://doi.org/10.1145/3313276.3316312.

Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties for private classification and online prediction. In COLT, volume 125 of Proceedings of Machine Learning Research, pages 119 152. PMLR, 2020.

Raef Bassily, Abhradeep Guha Thakurta, and Om Dipakbhai Thakkar. Model-agnostic private learning. In Neur IPS, pages 7102 7112, 2018.

Amos Beimel, Kobbi Nissim, and Uri Stemmer. Characterizing the sample complexity of private learners. In ITCS, pages 97 110. ACM, 2013a.

Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. In APPROX-RANDOM, pages 363 378, 2013b.

Amos Beimel, Kobbi Nissim, and Uri Stemmer. Learning privately with labeled and unlabeled examples. Algorithmica, 83(1):177 215, 2021.

Gavin Brown, Mark Bun, Vitaly Feldman, Adam D. Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In STOC, pages 123 132. ACM, 2021.

Mark Bun and Mark Zhandry. Order-revealing encryption and the hardness of private learning. In TCC (A1), volume 9562 of Lecture Notes in Computer Science, pages 176 206. Springer, 2016.

Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. In FOCS, pages 634 649, 2015.

Mark Bun, Thomas Steinke, and Jonathan Ullman. Make up your mind: The price of online queries in differential privacy. Co RR, abs/1604.04618, 2016. URL http://arxiv.org/ abs/1604.04618.

Edith Cohen, Xin Lyu, Jelani Nelson, Tam as Sarl os, and Uri Stemmer. Optimal differentially private learning of thresholds and quasi-concave optimization. Co RR, abs/2211.06387, 2022. doi: 10.48550/ar Xiv.2211.06387. URL https://doi.org/10.48550/ar Xiv.2211. 06387.

Yuval Dagan and Vitaly Feldman. PAC learning with stable and private predictions. In COLT, volume 125 of Proceedings of Machine Learning Research, pages 1389 1410. PMLR, 2020.

Cynthia Dwork and Vitaly Feldman. Privacy-preserving prediction. In S ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pages 1693 1702. PMLR, 2018. URL http://proceedings.mlr.press/v75/ dwork18a.html.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265 284, 2006.

Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil P. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In STOC, pages 381 390, 2009.

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Symposium on Theory of Computing (STOC), pages 715 724. ACM, 2010a.

Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In FOCS, pages 51 60, 2010b.

Vitaly Feldman and David Xiao. Sample complexity bounds on differentially private learning via communication complexity. SIAM J. Comput., 44(6):1740 1764, 2015. doi: 10.1137/140991844. URL http://dx.doi.org/10.1137/140991844.

Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni Naor, and Uri Stemmer. Privately learning thresholds: Closing the exponential gap. In COLT, volume 125 of Proceedings of Machine Learning Research, pages 2263 2285. PMLR, 2020.

Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM J. Comput., 40(3):793 826, 2011.

Michael J. Kearns, Mallesh M. Pai, Ryan M. Rogers, Aaron Roth, and Jonathan R. Ullman. Robust mediators in large games. Co RR, abs/1512.02698, 2015.

Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pages 94 103. IEEE, Oct 20 23 2007.

Anupama Nandi and Raef Bassily. Privately answering classification queries in the agnostic PAC model. In ALT, volume 117 of Proceedings of Machine Learning Research, pages 687 703. PMLR, 2020.

L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134 1142, November 1984. ISSN 0001-0782. doi: 10.1145/1968.1972. URL http://doi.acm.org/10.1145/1968. 1972.

Vladimir N. Vapnik and Alexey Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16 (2):264 280, 1971.

A Additional Preliminaries from PAC Learning

It is well know that that a sample of size Θ(VC(C)) is necessary and sufficient for the PAC learning of a concept class C, where the Vapnik-Chervonenkis (VC) dimension of a class C is defined as follows: Definition A.1 (VC-Dimension [Vapnik and Chervonenkis, 1971]). Let C be a concept class over a domain X. For a set B = {b1,...,bℓ} X of ℓpoints, let ΠC(B) = {(c(b1),...,c(bℓ)) : c C} be the set of all dichotomies that are realized by C on B. We say that the set B X is shattered by C if C realizes all possible dichotomies over B, in which case we have |ΠC(B)| = 2|B|.

The VC dimension of the class C, denoted VC(C), is the cardinality of the largest set B X shattered by C. Theorem A.2 (VC bound). Let C be a concept class over a domain X. For α,β < 1/2, there exists

an (α,β,n)-PAC learner for C, where n = 8VC(C)log( 13

α )+4log( 2

B Proof of Theorem 3.3

The proof of Theorem 3.3 follows from algorithms Hypothesis Learner, Accuracy Boost and claims B.1, B.2, all described below.

In Algorithm Hypothesis Learner we assume that the everlasting differentially private prediction interface A was fed with n i.i.d. samples taken from some (unknown) distribution D and labeled by an unknown concept c C. Assumning the sequence of hypotheses {hr}r 0 produced by A satisfies r error D(c,hr) α (1) we use it to construct with constant probability a hypothesis h with error bounded by O(α).

Algorithm Hypothesis Learner Parameters: 0 < β 1/8, R = |X| log(|X|) log(1/β) Input: hypothesis sequence {hr}r 0

1. for all x X let Lx = 2. for r = 0,1,2,...,R

(a) select x uniformly at random from X and let Lx = Lx {hr(x)} 3. if Lx = for some x X then fail, output an arbitrary hypothesis, and halt

/* Pr[ x such that Lx = ] |X|(1 1

|X|)R |X|e R/|X| = β */

4. for all x X let rx be sampled uniformly at random from Lx 5. construct the hypothesis h, where h(x) = rx

Claim B.1. If executed on a hypothesis sequence satisfying Equation 1 then with probability at least 3/4 Algorithm Hypothesis Learner outputs a hypothesis h satisfying error D(c,h) 8α.

Proof. Having D,c C fixed, and given a hypothesis h, we define eh(x) to be 1 if h(x) , c(x) and 0 otherwise. Thus, we can write error D(c,h) = Ex D[eh(x)].

Observe that when Algorithm Hypothesis Learner does not fail, rx (and hence h(x)) is chosen with equal probability among (h1(x),h2(x),...,h R(x)) and hence Eθ[eh(x)] = Ei R[R][ehi(x)] where θ denotes the randomness of Hypothesis Learner. We get:

Eθ[error D(c,h)] = EθEx D[eh(x)] = Ex DEθ[eh(x)] = Ex DEi R[R][ehi(x)] = Ei R[R]Ex D[ehi(x)]

Ei R[α] = α.

By Markov inequality, we have Prθ[error D(c,h) 8α] 1/8. The claim follows noting that Algorithm Hypothesis Learner fails with probability at most β 1/8.

The second part of the transformation is Algorithm Accuracy Boost that applies Algorithm Hypothesis Learner O(log(1/β)) times to obtain with high probability a hypothesis with O(α) error.

Algorithm Accuracy Boost

Parameters: β, R = 104ln 1

β Input: R labeled samples with n examples each (S1,...,SR) where Si (X {0,1})n

1. for i = 1,2...R

(a) execute A(Si) to obtain a hypothesis sequence {hir}r 0 (b) execute Algorithm Weak Hypothesis Learner on {hir}r 0 to obtain hypothesis hi

2. construct the hypothesis ˆh, where ˆh(x) = maj(h1(x),...,h R(x)).

Claim B.2. With probability 1 β, Algorithm Accuracy Boost output a 24α-good hypothesis over distribution D.

Proof. Define Bi to be the event where the sequence of hypotheses {hi r}r 0 produced in Step 1a of Accuracy Boost does not satisfy Equation 1. We have,

Pr[error D(c,hi) > 8α] Pr[B] + (1 Pr[B]) Pr[error D(c,h) > 8α] β + 1/4 < 3/8.

Hence, by the Chernoff bound, when R 104ln 1

β , we have at least 7R/8 hypotheses are 8α-good over distribution D. Consider the worst case, in which R/8 hypotheses always output wrong labels. To output a wrong label of x, we require at least 3R/8 hypotheses to output wrong labels. Thus h is 24α-good over distribution D.

C Tools from Prior Works

C.1 Algorithm Label Boost [Beimel et al., 2021]

Algorithm Label Boost [Beimel et al., 2021] Parameters: A concept class C. Input: A partially labeled database S T (X {0,1, }) . % We assume that the first portion of the database (denoted S) contains labeled examples. The algorithm outputs a similar database where both S and T are (re)labeled.

1. Initialize H = . 2. Let P = {p1,...,pℓ} be the set of all points p X appearing at least once in S T . Let ΠC(P ) = {(c(p1),...,c(pℓ)) : c C} be the set of all dichotomies generated by C on P . 3. For every (z1,...,zℓ) ΠC(P ), add to H an arbitrary concept c C s.t. c(pi) = zi for every 1 i ℓ. 4. Choose h H using the exponential mechanism with privacy parameter ϵ=1, solution set H, and the database S. 5. (Re)label S T using h, and denote the resulting database (S T )h, that is, if S T = (xi,yi)t i=1 then (S T )h = (xi,y i)t i=1 where y i = h(xi).

6. Output (S T )h.

Lemma C.1 (privacy of Algorithm Label Boost [Beimel et al., 2021]). Let A be an (ϵ,δ)- differentially private algorithm operating on partially labeled databases. Construct an algorithm B that on input a partially labeled database S T (X {0,1, }) applies A on the outcome of Label Boos(S T ). Then, B is (ϵ + 3,4eδ)-differentially private.

Consider an execution of Label Boost on a database S T , and assume that the examples in S are labeled by some target concept c C. Recall that for every possible labeling z of the elements in S and in T , algorithm Label Boost adds to H a hypothesis from C that agrees with z. In particular, H contains a hypothesis that agrees with the target concept c on S (and on T ). That is, f H s.t. error S(f ) = 0. Hence, the exponential mechanism (on Step 4) chooses (w.h.p.) a hypothesis h H s.t. error S(h) is small, provided that |S| is roughly log|H|, which is roughly VC(C) log(|S| + |T |) by Sauer s lemma. So, algorithm Label Boost takes an input database where only a small portion of it is labeled, and returns a similar database in which the labeled portion grows exponentially. Lemma C.2 (utility of Algorithm Label Boost [Beimel et al., 2021]). Fix α and β, and let S T be s.t. S is labeled by some target concept c C, and s.t.

e VC(C)exp( α|S| 2VC(C)) |S|.

Consider the execution of Label Boost on S T , and let h denote the hypothesis chosen on Step 4. With probability at least (1 β) we have that error S(h) α.

D Some Technical Facts

We refer to the execution of steps 3a-3g of algorithm Generic BBL as a phase of the algorithm, indexed by i = 1,2,3,....

In Generic BBL, we require that the set of all predictions in the phase i is (1,δ)-differentially private. Claim D.1. For δ < 1, the composited Laplace mechanism used in step 3c in the i-th iteration, is (1,δ)-differentially private.

Proof. Let ε i,δ i be as in Step 3c. Since eε i 1 < 2ε i for 0 < ε i < 1, we have s

Riδ i ) ε i + Riε i(eε i 1)

δ) ε i + 2Riε i 2 =

2 3 + 2 9ln( 2

The proof is concluded by using advanced composition (Theorem 2.6).

In step 3f, Generic BBL takes a random subset of size λi+1Tt+1 from ˆD i. We show that the size of ˆD i is at least λi+1Tt+1.

Claim D.2. When ε 1, for any i 1, we always have | ˆD i| λi+1Ti+1.

Proof. Let m = 3+exp(ε+4) < 200. By the step 3c, step 3e and step 3f, | ˆDj| = ε|Dj|

m = 12800|Sj|

m 64|Sj| = 64λj Tj. Then it is sufficient to verify 128λj Tj λj+1Tj+1

We can verify that

4λj = 4 8VC(C)log( 13

αi ) + 4log( 2

αi = 4 8VC(C)(log( 13

αj+1 ) 1) + 4(log( 2 βj+1 ) 1)

16Tj = 16τ λi log( 1

δ) log2( λi εαiβiδ)

ε 4τ λi+1 log( 1

δ) log2( λi+1 16εαi+1βi+1δ)

The last inequalitu holds because λj 4 and αj,βj 1/2.

To apply the privacy and accuracy of Label Boost, the sizes of the databases need to satisfy the inequalities in lemma C.2. We verify that in each phase, the sizes of the databases always satisfy the requirement.

Claim D.3. When ε 1, for any i 1, we have | ˆDi| βi

e VC(C)exp αi| ˆSi| 2VC(C)

Proof. By claim D.2, step 3c and step 3f,

| ˆDi| = ε|Di|

m = O(λi Ti) = O VC(C)log2(VC(C)) poly 1

| ˆSi| = ε|Si|

m = O(ελi Ti) = O(λi Ti)

= O VC(C)log2(VC(C)) poly 1

e VC(C)exp αi| ˆSi|

! = Ω VC2(C) exp poly 1

for Ti = τ λi log( 1

δ ) log2( λi εαiβiδ )

ε , the inequality holds when τ 50.

E Accuracy of Algorithm Generic BBL proof of Theorem 6.2

We refer to the execution of steps 3a-3g of algorithm Generic BBL as a phase of the algorithm, indexed by i = 1,2,3,....

We give some technical facts in Appendix D. In Claim E.1, we show that in each phase, samples are labeled with high accuracy. In Claim E.3, we prove that algorithm Generic BBL fails with low probability. In Claim E.4, we prove that algorithm Generic BBL predict the labels with high accuracy. Claim E.1. For each phase i we have

gi C s.t. error Si(g) = 0 and error D(gi,c)

Proof. The proof is by induction on i. The base case for i = 1 is trivial, with g1 = c. Assume the claim holds for all j i. By the properties of Label Boost (Lemma C.2) and Claim D.3, with probability at least 1 βi we have that Si is labeled by a hypothesis gi C s.t. error Si 1(gi 1,gi) αi. Observe that the points in Si (without their labels) are chosen i.i.d. from D, and hence, By Theorem A.2 (VC bounds) and |Si 1| 128λi 1 λi, with probability at least 1 βi we have that error D(gi,gi 1) αi. Hence, with probability 1 2βi, we have error D(gi,gi 1) αi. Finally, by the triangle inequality, error D(gi,c) Pi j=1 αj, except with

probability 2Pi j=1 βj

Combining claims E.1 union bound, we get: Claim E.2. Let D be an underlying distribution and let c C be a target concept. Then

Pr[ i gi C s.t. error Si(gi) = 0 and error D(gi,c) α] 1 2β.

For each phase i, if there exists a noise ηj 1/16 in step 3c, we say the phase i fails. Define the following good event.

Event E1: Phase i doesn t fail for all i 1.

Claim E.3. When λ 1,ε 1,α 1/2,β 1/2,δ 1/2, event E1 occurs with probability at least 1 β.

Proof. In each phase i, For each ηj, by the property of Laplace distribution, we have

Pr[|η| 1/16] = e Tiε i/16 = e Ti 48

Ri ln(2/δ) βi/Ri. This inequality holds when τ 1. By union bound, we have Pr[phase i fails] βi.

Using to union bound, Pr[Event E1 occurs] 1 β.

Notations. Consider the ith phase of Algorithm Generic BBL, and focus on the j-th iteration of Step 3. Fix all of the randomness of noises. Now observe that the output on step 3(d)iii is a deterministic function of the input xi,j. This defines a hypothesis which we denote as hi,j.

Figure 2: Hypothesis hi,j

Claim E.4. For β < 1/16, with probability at least 1 4β, all of the hypotheses defined above are 4α-good w.r.t. D and c.

Proof. In the phase i, by Claim E.2, with probability at least 1 2β we have that Si is labeled by a hypothesis gi C satisfying error D(gi,c) α. We continue with the analysis assuming that this is the case.

On step 3a of the ith phase we divide Si into Ti subsamples of size λi each, identify a consistent hypothesis ft C for every subsample Si,t, and denote Fi = (f1,...,f T ). By Theorem A.2 (VC bounds), every hypothesis in Fi satisfies error D(ft,gi) α with probability 1 βi, in which case, by the triangle inequality we have that error D(ft,c) 2α.

Set Ti 512(1 4βi)ln( 1

(1 64βi)2 , using Chernoff bound, it holds that for at least 15Ti/16 of the hypotheses in Fi have error error D(ft,c) 2α with probability at least 1 βi.

By Claim E.3, with probability 1 β, all Laplace noises added in step 3d is less than Ti

16. Let m : X {0,1} defined as m(x) = majft Fi(ft(x)). For m to err on a point x (w.r.t. the target concept c), it must be that at least 3/8-fraction of the 2α-good hypotheses in ˆFi err on x. Consider the worst case in Figure 3, based on Claim E.3, we have error D(m,c) 4α with probability 1 β.

Figure 3: The horizontal represents the input point. The vertical represents the hypothesis. The red parts represent the incorrect prediction. We let Ti

16 hypothesis predict all labels incorrectly and let Laplace noises to be as large as Ti

16. To output an incorrect label, there must exist 3Ti

8 hypothesis output the incorrect label. In the worst case, at most 4α of points are incorrectly classified.

Conclude all above, with probability 1 4β, all the hypotheses are 4α-good.

E.1 Privacy analysis proof of Claim 6.3

Fix t N and the adversary B. We need to show that View0 B,t and View1 B,t (defined in Figure 1) are (ε,δ) indistinguishable. We will consider separately the case where the executions differ in the training phase (Claim E.5) and the case where the difference occurs during the prediction phase (Claim E.6).

Privacy of the initial training set S. Let S0,S1 (X {0,1})n be neighboring datasets of labeled examples and let View0 B,t and View1 B,t be as in Figure 1 where (x0 1,y0 1),...,(x0 n,y0 n) =

S0 and (x1 1,y1 1),...,(x1 n,y1 n) = S1.

Claim E.5. For all adversaries B, for all t > 0, and for any two neighbouring database S0 and S1

selected by B, View0 B,t and View1 B,t are (ε,δ)-indistinguishable.

Proof. Let R 1 = min(t,R1). Note that Viewb B,R 1 is a prefix of Viewb B,t which includes the

labels Algorithm Generic BBL produces in Step 3(d)iii for the R 1 first unlabeled points selected by B. Let Sb 2 be the result of the first application of algorithm Label Boost in Step 3f of Generic BBL (if t < R1 we set Sb 2 as ). The creation of these random variables is depicted in Figure 4, where DL 1 denotes the labels Algorithm Generic BBL produces for the unlabeled points D1.

Observe that Viewb B,t results from a post-processing (jointly by the adversary B and Algo-

rithm Generic BBL) of the random variable Viewb B,R 1,Sb 2

, and hence it suffices to show that View0 B,R 1,S0 2

and View1 B,R 1,S1 2

are (ε,δ)-indistinguishable.

We follow the processes creating Viewb B,t and Sb 2 in Figure 4: (i) The mechanism M1 corresponds to the loop in Step 3d of Generic BBL where labels are produced for the adver-

Figure 4: Privacy of the labeled sample S

sarially chosen points Db 1. By application of claim D.1, M1 is (1,δ)-differentially private. (ii) The mechanism M2, corresponds to the subsampling of ˆSb 1 from Sb 1 and the application of procedure Label Boost on the subsample in Step 3f of Generic BBL resulting in Sb 2. By application of Claim 2.9 and Lemma C.1, M2 is (ε,0)-differentially private. Thus (M1,M2) is (ε + 1,δ)-differentially private. (iii) The mechanism M3 with input of Sb and

output Db,L 1 ,Sb 2 = Viewb B,R 1,Sb 2

applies (M1,M2) on the sub-sample Sb 1 obtained from

Sb in Step 2 of Generic BBL. By application of Claim 2.9 M3 is (ε, 4εδ 3+exp(ε+1))-differentially

private. Since 4εδ 3+exp(ε+1) δ for any ε, hence View0 B,R 1,S0 2

and View1 B,R 1,S1 2

indistinguishable

Privacy of the unlabeled points D. Let D0,D1 Xt be neighboring datasets of unlabeled examples and let View0 B,t and View1 B,t be as in Figure 1 where x0 1,...,x0 t = D0 and x1 1,...,x1 t = D1.

Claim E.6. For all adversaries B, for all t > 0, and for any two neighbouring databases D0 and D1 selected by B, View0 B,t and View1 B,t are (ε,δ)-indistinguishable.

Figure 5: Privacy leakage of Di

Proof. Let D0 1,D0 2,...,D0 k and D1 1,D1 2,...,D1 k be the set of unlabeled databases in step 3e of Generic BBL. Without loss of generality, we assume D0 i and D1 i differ on one entry. When

i = k, View0 B,t = View1 B,t because all selected hypothesis are the same. When i < k, let

R = min Pi+1 j=1 Rj,t .

Similar to the analysis if Claim E.5, Viewb B,t results from a post-processing of the random variable (Viewb B,R ,Sb i+2) (if t < Pi+1 j=1 Rj we set Sb i+2 as ). Note that Viewb B,R 1 =

(Db,L 1 ,...,Db,L i ,Db,L i+1), and (Db,L 1 ,...,Db,L i 1,Db,L i ) follow the same distribution for b {0,1}, where Db,L i is the labels of points in Db i expect the different point. So that it suffices to show that D0,L i+1,S0 2 and D1,L i+1,S1 2 are (ε,δ)-indistinguishable.

We follow the processes creating Db,L i+1 and Sb i+2 in Figure 5: (i) The mechanism M1 corresponds to the loop in Step 3d of Generic BBL where labels are produced for the adversarially chosen points Db i+1. By application of claim D.1, M1 is (1,δ)-differentially private. (ii) The mechanism M2, corresponds to the subsampling of ˆSb i+1 from Sb i+1 and the application of procedure Label Boost on the subsample in Step 3f of Generic BBL resulting in Sb i+2. By application of Claim 2.9 and Lemma C.1, M2 is (ε,0)-differentially private. Thus (M1,M2) is (ε + 1,δ)-differentially private. (iii) The mechanism M3 with input of ˆDb i and output Db,L i+1,Sb i+2 applies (M2,M3) on Si+1, which is generated from ˆDb i and in Step 3f of Generic BBL. By application of Claim C.1, M3 is (ε + 4,4εδ)-differentially private. (iv) The mechanism M4, corresponds to the subsampling ˆDb i from Db i and the application of M4 on ˆDb i . By application of Claim 2.9, M4 is (ε, 16eεδ 3+exp(ε+4))-differentially private. Since

16eε 3+exp(ε+4) 1 for any ε, D0,L i+1,S0 2 and D1,L i+1,S1 2 are (ε,δ)-indistinguishable.

Remark E.7. The above proofs work on the adversarially selected D because: (i) Lemma ?? works on the adaptively selected queries. (We treat the hypothesis class Fi as the database, the unlabelled points xi,ℓas the query parameters.) (ii) Label Boost generates labels by applying one private hypothesis on points. The labels are differentially private by post-processing.