# blackbox_differential_privacy_for_interactive_ml__ffe49ddb.pdf

Black-Box Differential Privacy for Interactive ML

Haim Kaplan Yishay Mansour Shay Moran Kobbi Nissim Uri Stemmer

In this work we revisit an interactive variant of joint differential privacy, recently introduced by Naor et al. [2023], and generalize it towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. In order to demonstrate the advantages of this privacy definition compared to traditional forms of differential privacy, we consider the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with traditional forms of differential privacy, such as the one studied by Golowich and Livni [2021], where only a double exponential overhead in the mistake bound is known (via an information theoretic upper bound).

1 Introduction

In this work we study privacy of interactive machine learning processes. As a motivating story, consider a chatbot that continuously improves itself by learning from the conversations it conducts with users. As these conversations might contain sensitive information, we would like to provide privacy guarantees to the users, in the sense that the content of their conversations with the chatbot would not leak. This setting fleshes out the following two requirements.

(1) Clearly, the answers given by the chatbot to user ui must depend on the queries made by user ui. For example, the chatbot should provide different answers when asked by user ui for the weather forecast in Antarctica, and when asked by ui for a pasta recipe. This is in contrast to the plain formulation of differential privacy, where it is required that all of the mechanism outputs would be (almost) independent of any single user input. Therefore, the privacy requirement we are aiming for is that the conversation of user ui will remain hidden from other users, and would not leak through the other users interactions with the chatbot. Moreover, this should remain true even if a privacy attacker (aiming to extract information about the conversation user ui had) conducts many different conversations with the chatbot. (2) The interaction with the chatbot is, by design, interactive and adaptive, as it aims to conduct dialogues with the users. This allows the privacy attacker (mentioned above) to choose its queries to the chatbot adaptively. Privacy, hence, needs to be preserved even in the presence of adaptive attackers.

While each of these two requirements was studied in isolation, to the best of our knowledge, they have not been (explicitly) unified into a combined privacy framework. Requirement (1) was formalized

Tel Aviv University and Google Research. haimk@tau.ac.il. Tel Aviv University and Google Research. mansour.yishay@gmail.com. Technion and Google Research. smoran@technion.ac.il. Georgetown University. kobbi.nissim@georgetown.edu. Tel Aviv University and Google Research. u@uri.co.il.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

by Kearns et al. [2015] as joint differential privacy (JDP). It provides privacy against non-adaptive attackers. Intuitively, in the chatbot example, JDP aims to hide the conversation of user ui from any privacy attacker that chooses in advance all the queries it poses to the chatbot. This is unsatisfactory since the adaptive nature of this process invites adaptive attackers.

Requirement (2) was studied in many different settings, but to the best of our knowledge, only w.r.t. the plain formulation of DP, where the (adaptive) privacy attacker sees all of the outputs of the mechanism. Works in this vein include [Dwork et al., 2009, Chan et al., 2010, Hardt and Rothblum, 2010, Dwork et al., 2010b, Bun et al., 2017, Kaplan et al., 2021, Jain et al., 2021]. In the chatbot example, plain DP would require, in particular, that even the messages sent from the chatbot to user ui reveals (almost) no information about ui. In theory, this could be obtained by making sure that the entire chatbot model is computed in a privacy preserving manner, such that even its full description leaks almost no information about any single user. Then, when user ui comes, we can simply share the model with her, and let her query it locally on her device. But this is likely unrealistic with large models involving hundreds of billions of parameters.

In this work we use challenge differential privacy, which was recently introduced by Naor et al. [2023] in the context of PAC learning.6 As discussed below, challenge differential privacy is particularly suitable for addressing interactive and adaptive learning processes, such as the one illustrated above. Challenge DP can be viewed as an interactive variant of JDP, aimed at maintaining privacy against adaptive privacy attackers. Intuitively, in the chatbot example, this definition would guarantee that even an adaptive attacker that controls all of the users except for user ui, learns (almost) no information about the conversation user ui had with the chatbot.

1.1 Private Online Classification

We initiate the study of challenge differential privacy in the basic setting of online classification. Let X be the domain, Y be the label space, and Z = X Y be set of labeled examples. An online learner is a (possibly randomized) mapping A : Z X Y. That is, it is a mapping that maps a finite sequence S Z (the past examples), and an unlabeled example x (the current query point) to a label y, which is denoted by y = A(x; S).

Let H YX be a hypothesis class. A sequence S Z is said to be realizable by H if there exists h H such that h(xi) = yi for every (xi, yi) S. For a sequence S = {(xt, yt)}T t=1 Z we write M(A; S) for the random variable denoting the number of mistakes A makes during the execution on S. That is M A; S = PT t=1 1{ˆyt = yt}, where ˆyt = A(xt; S<t) is the (randomized) prediction of A on xt. Definition 1.1 (Online Learnability: Realizable Case). We say that a hypothesis class H is online learnable if there exists a learning rule A such that E M A; S = o(T) for every sequence S which is realizable by H. Remark 1.2. Notice that Definition 1.1 corresponds to an oblivious adversary, as it quantifies over the input sequence in advance. This should not be confused with the adversaries considered in the context of privacy which are always adaptive in this work. In the non-private setting, focusing on oblivious adversaries does not affect generality in terms of utility. This is less clear when privacy constraints are involved.7 We emphasize that our results (our mistake bounds) continue to hold even when the realizable sequence is chosen by an adaptive (stateful) adversary, that at every point in time chooses the next input to the algorithm based on all of the previous outputs of the algorithm.

A classical result due to Littlestone [1988] characterizes online learnability (without privacy constraints) in terms of the Littlestone dimension. The latter is a combinatorial parameter of H which was named after Littlestone by Ben-David et al. [2009].

In particular, Littlestone s characterization implies the following dichotomy: if H has finite Littlestone dimension d then there exists a (deterministic) learning rule which makes at most d mistakes on

6The privacy notion we study, Challenge DP, is a generalization of the privacy notion presented by Naor et al. [2023]. They focused on a special case in which there is no interaction with individual users. In the chatbot example, this corresponds to having each user submitting only a single query to the chatbot without conducting an adaptive dialog with it. In addition, we analyze useful properties of Challenge DP, such as composition, post-processing, and group-privacy. These properties were not studied by Naor et al. [2023]. 7In particular, Golowich and Livni [2021] studied both oblivious and adaptive adversaries, and obtained very different results in these two cases.

every realizable input sequence. In the complementing case, when the Littlestone dimension of H is infinite, for every learning rule A and every T N there exists a realizable sequence S of length T such that E M A; S T/2. In other words, as a function of T, the optimal mistake bound is either uniformly bounded by the Littlestone dimension, or it is T/2. Because of this dichotomy, in some places online learnability is defined with respect to a uniform bound on the number of mistakes (and not just a sublinear one as in the above definition). In this work we follow the more general definition.

We investigate the following questions: Can every online learnable class be learned by an algorithm which satisfies challenge differential privacy? What is the optimal mistake bound attainable by private learners?

Our main result in this part provides an affirmative answer to the first question. We show that for any class H with Littlestone dimension d there exists an (ε, δ)-challenge-DP learning rule which makes

at most O d2 ε2 log2 1

δ log2 T β mistakes, with probability 1 β, on every realizable sequence of length T. Remarkably, our proof provides an efficient transformation taking a non-private learner to a private one: that is, given a black box access to a learning rule A which makes at most M mistakes in the realizable case, we efficiently construct an (ε, δ)-challenge-DP learning rule A which makes

at most O M 2

δ log2 T β mistakes.

1.1.1 Construction overview

We now give a simplified overview of our construction, called POP, which transforms a non-private online learning algorithm into a private one (while maintaining computational efficiency). Let A be a non-private algorithm, guaranteed to make at most d mistakes in the realizable setting. We maintain k copies of A. Informally, in every round i [T] we do the following:

1. Obtain an input point xi. 2. Give xi to each of the k copies of A to obtain predicted labels ˆyi,1, . . . , ˆyi,k. 3. Output a privacy preserving aggregation ˆyi of {ˆyi,1, . . . , ˆyi,k}, which is some variant of noisy majority. This step will only satisfy challenge-DP, rather than (standard) DP. 4. Obtain the true label yi. 5. Let ℓ [k] be chosen at random. 6. Rewind all copies of algorithm A except for the ℓth copy, so that they forget ever seeing xi. 7. Give the true label yi to the ℓth copy of A.

As we aggregate the predictions given by the copies of A using (noisy) majority, we know that if the algorithm errs then at least a constant fraction of the copies of A err. As we feed the true label yi to a random copy, with constant probability, the copy which we do not rewind incurs a mistake at this moment. That is, whenever we make a mistake then with constant probability one of the copies we maintain incurs a mistake. This can happen at most k d times, since we have k copies and each of them makes at most d mistakes. This allows us to bound the number of mistakes made by our algorithm (w.h.p.). The privacy analysis is more involved. Intuitively, by rewinding all of the copies of A (except one) in every round, we make sure that a single user can affect the inner state of at most one of the copies. This allows us to efficiently aggregate the predictions given by the copies in a privacy preserving manner. The subtle point is that the prediction we release in time i does require querying all the experts on the current example xi (before rewinding them). Nevertheless, we show that this algorithm is private.

1.1.2 Comparison with Golowich and Livni [2021]

The closest prior work to this manuscript is by Golowich and Livni who also studied the problem of private online classification, but under a more restrictive notion of privacy than challenge-DP. In particular their definition requires that the sequence of predictors which the learner uses to predict in each round does not compromise privacy. In other words, it is as if at each round the learner publishes the entire truth-table of its predictor, rather than just its current prediction. This might be too prohibitive in certain applications such as the chatbot example illustrated above. Golowich and Livni show that even with respect to their more restrictive notion of privacy it is possible to online

learn every Littlestone class. However, their mistake bound is doubly exponential in the Littlestone dimension (whereas ours is quadratic), and their construction requires more elaborate access to the non-private learner. In particular, it is not clear whether their construction can be implemented efficiently (while our construction is efficient).

1.2 Additional Related Work

Several works studied the related problem of private learning from expert advice [Dwork et al., 2010a, Jain et al., 2012, Thakurta and Smith, 2013, Dwork and Roth, 2014, Jain and Thakurta, 2014, Agarwal and Singh, 2017, Asi et al., 2022]. These works study a variant of the experts problem in which the learning algorithm has access to k experts; on every time step the learning algorithm chooses one of the experts to follow, and then observes the loss of each expert. The goal of the learning algorithm is that its accumulated loss will be competitive with the loss of the best expert in hindsight. In this setting the private data is the sequence of losses observed throughout the execution, and the privacy requirement is that the sequence of experts chosen by the algorithm should not compromise the privacy of the sequence of losses.8 When applying these results to our context, the set of experts is the set of hypotheses in the class H, which means that the outcome of the learner (on every time step) is a complete model (i.e., a hypothesis). That is, in our context, applying prior works on private prediction from expert advice would result in a privacy definition similar to that of Golowich and Livni [2021] that accounts (in the privacy analysis) for releasing complete models, rather than just the predictions, which is significantly more restrictive.

There were a few works that studied private learning in online settings under the constraint of JDP. For example, Shariff and Sheffet [2018] studied the stochastic contextual linear bandits problem under JDP. Here, in every round t the learner receives a context ct, then it selects an action at (from a fixed set of actions), and finaly it receives a reward yt which depends on (ct, at) in a linear way. The learner s objective is to maximize cumulative reward. The (non-adaptive) definition of JDP means that action at is revealed only to user ut. Furthermore, it guarantees that the inputs of user ut (specifically the context ct and the reward yt) do not leak to the other users via the actions they are given, provided that all these other users fix their data in advance. This non-adaptive privacy notion fits the stochastic setting of Shariff and Sheffet [2018], but (we believe) is less suited for adversarial processes like the ones we consider in this work. We also note that the algorithm of Shariff and Sheffet [2018] in fact satisfies the more restrictive privacy definition which applies to the sequence of predictors (rather than the sequence of predictions), similarly to the that of Golowich and Livni [2021].

2 Preliminaries

Notation. Two datasets S and S are called neighboring if one is obtained from the other by adding or deleting one element, e.g., S = S {x }. For two random variables Y, Z we write X (ε,δ) Y to mean that for every event F it holds that Pr[X F] eε Pr[Y F] + δ, and Pr[Y F] eε Pr[X F] + δ. Throughout the paper we assume that the privacy parameter ε satisfies ε = O(1), but our analyses trivially extend to larger values of epsilon.

The standard definition of differential privacy is, Definition 2.1 ([Dwork et al., 2006]). Let M be a randomized algorithm that operates on datasets. Algorithm M is (ε, δ)-differentially private (DP) if for any two neighboring datasets S, S we have M(S) (ε,δ) M(S ).

The Laplace mechanism. The most basic constructions of differentially private algorithms are via the Laplace mechanism as follows. Definition 2.2. A random variable has probability distribution Lap(γ) if its probability density function is f(x) = 1 2γ exp( |x|/γ), where x R.

Definition 2.3 (Sensitivity). A function f that maps datasets to the reals has sensitivity if for every two neighboring datasets S and S it holds that |f(S) f(S )| .

8Asi et al. [2022] study a more general framework of adaptive privacy in which the private data is an auxiliary sequence (z1, . . . , z T ). During the interaction with the learner, these zt s are used (possibly in an adaptive way) to choose the sequence of loss functions.

Theorem 2.4 (The Laplace Mechanism [Dwork et al., 2006]). Let f be a function that maps datasets to the reals with sensitivity . The mechanism A that on input S adds noise with distribution Lap(

ε ) to the output of f(S) preserves (ε, 0)-differential privacy.

Joint differential privacy. The standard definition of differential privacy (Definition 2.1) captures a setting in which the entire output of the computation may be publicly released without compromising privacy. While this is a very desirable requirement, it is sometimes too restrictive. Indeed, Kearns et al. [2015] considered a relaxed setting in which we aim to analyze a dataset S = (x1, . . . , xn), where every xi represents the information of user i, and to obtain a vector of outcomes (y1, . . . , yn). This vector, however, is not made public. Instead, every user i only receives its corresponding outcome yi. This setting potentially allows the outcome yi to strongly depend on the the input xi, without compromising the privacy of the ith user from the view point of the other users.

Definition 2.5 ([Kearns et al., 2015]). Let M : Xn Y n be a randomized algorithm that takes a dataset S Xn and outputs a vector y Y n. Algorithm M satisfies (ε, δ)-joint differential privacy (JDP) if for every i [n] and every two datasets S, S Xn differing only on their ith point it holds that M(S) i (ε,δ) M(S ) i. Here M(S) i denotes the (random) vector of length n 1 obtained by running (y1, . . . , yn) M(S) and returning (y1, . . . , yi 1, yi+1, . . . , yn).

In words, consider an algorithm M that operates on the data of n individuals and outputs n outcomes y1, . . . , yn. This algorithm is JDP if changing only the ith input point xi has almost no affect on the outcome distribution of the other outputs (but the outcome distribution of yi is allowed to strongly depend on xi). Kearns et al. [2015] showed that this fits a wide range of problems in economic environments.

Example 2.6 ([Nahmias et al., 2019]). Suppose that a city water corporation is interested in promoting water conservation. To do so, the corporation decided to send each household a customized report indicating whether their water consumption is above or below the median consumption in the neighborhood. Of course, this must be done in a way that protects the privacy of the neighbors. One way to tackle this would be to compute a privacy preserving estimation z for the median consumption (satisfying Definition 2.1). Then, in each report, we could safely indicate whether the household s water consumption is bigger or smaller than z. While this solution is natural and intuitive, it turns out to be sub-optimal: We can obtain better utility by designing a JDP algorithm that directly computes a different outcome for each user ( above or below ), which is what we really aimed for, without going through a private median computation.

Algorithm Above Threshold. Consider a large number of low sensitivity functions f1, f2, . . . , f T which are given (one by one) to a data curator (holding a dataset S). Algorithm Above Threshold allows for privately identifying the queries fi whose value fi(S) is (roughly) greater than some threshold t.

Algorithm 1 Above Threshold [Dwork et al., 2009, Hardt and Rothblum, 2010] Input: Dataset S X , privacy parameters ε, δ, threshold t, number of positive reports r, and an adaptively chosen stream of queries fi : X R with sensitivity

1. Denote γ = O

2. In each round i, when receiving a query fi Q, do the following:

(a) Let ˆfi fi(S) + Lap(γ)

(b) If ˆfi t, then let σi = 1 and otherwise let σi = 0

(c) Output σi (d) If Pi j=1 σi r then HALT

Even though the number of possible rounds is unbounded, algorithm Above Threshold preserves differential privacy. Note, however, that Above Threshold is an interactive mechanism, while the standard definition of differential privacy (Definition 2.1) is stated for non-interactive mechanisms, that process their input dataset, release an output, and halt. The adaptation of DP to such interactive settings is done via a game between the (interactive) mechanism and an adversary that specifies the inputs to the mechanism and observes its outputs. Intuitively, the privacy requirement is that the view

of the adversary at the end of the execution should be differentially private w.r.t. the inputs given to the mechanism. Formally, Definition 2.7 (DP under adaptive queries [Dwork et al., 2006, Bun et al., 2017]). Let M be a mechanism that takes an input dataset and answers a sequence of adaptively chosen queries (specified by an adversary B and chosen from some family Q of possible queries). Mechanism M is (ε, δ)-differentially private if for every adversary B we have that Adaptive Query M,B,Q (defined below) is (ε, δ)-differentially private (w.r.t. its input bit b).

Algorithm 2 Adaptive Query M,B,Q [Bun et al., 2017]

Input: A bit b {0, 1}. (The bit b is unknown to M and B.)

1. The adversary B chooses two neighboring datasets S0 and S1. 2. The dataset Sb is given to the mechanism M. 3. For i = 1, 2, . . .

(a) The adversary B chooses a query qi Q. (b) The mechanism M is given qi and returns ai.

(c) ai is given to B. 4. When M or B halts, output B s view of the interaction.

Theorem 2.8 ([Dwork et al., 2009, Hardt and Rothblum, 2010, Kaplan et al., 2021]). Algorithm Above Threshold is (ε, δ)-differentially private.

A private counter. In the setting of algorithm Above Threshold, the dataset is fixed in the beginning of the execution, and the queries arrive sequentially one by one. Dwork et al. [2010a] and Chan et al. [2010] considered a different setting, in which the data arrives sequentially. In particular, they considered the counter problem where in every time step i [T] we obtain an input bit xi {0, 1} (representing the data of user i) and must immediately respond with an approximation for the current sum of the bits. That is, at time i we wish to release an approximation for x1 + x2 + + xi.

Similarly to our previous discussion, this is an interactive setting, and privacy is defined via a game between a mechanism M and an adversary B that adaptively determines the inputs for the mechanism. Definition 2.9 (DP under adaptive inputs [Dwork et al., 2006, 2010a, Chan et al., 2010, Kaplan et al., 2021, Jain et al., 2021]). Let M be a mechanism that in every round i obtains an input point xi (representing the information of user i) and outputs a response ai. Mechanism M is (ε, δ)- differentially private if for every adversary B we have that Adaptive Input M,B (defined below) is (ε, δ)-differentially private (w.r.t. its input bit b).

Algorithm 3 Adaptive Input M,B [Jain et al., 2021]

Input: A bit b {0, 1}. (The bit b is unknown to M and B.)

1. For i = 1, 2, . . .

(a) The adversary B outputs a bit ci {0, 1}, under the restriction that Pi j=1 cj 1. % The round i in which ci = 1 is called the challenge round. Note that there could be at most one challenge round throughout the game. (b) The adversary B chooses two input points xi,0 and xi,1, under the restriction that if ci = 0 then xi,0 = xi,1. (c) Algorithm M obtains xi,b and outputs ai. (d) ai is given to B. 2. When M or B halts, output B s view of the interaction.

Theorem 2.10 (Private counter [Dwork et al., 2010a, Chan et al., 2010, Jain et al., 2021]). There exists an (ε, 0)-differentially private mechanism M (as in Definition 2.9) that in each round i [T] obtains an input bit xi {0, 1} and outputs a response ai N with the following properties. Let s denote the random coins of M. Then there exists an event E such that: (1) Pr[s E] 1 β, and (2) Conditioned on every s E, for every input sequence (x1, . . . , x T ), the answers (a1, . . . , a T )

satisfy ai Pi j=1 xi O 1 ε log(T) log T β .

3 Challenge Differential Privacy

We now introduce the privacy definition we consider in this work is. Intuitively, the requirement is that even an adaptive adversary controlling all of the users except Alice, cannot learn much information about the interaction Alice had with the algorithm. Definition 3.1 (Extension of Naor et al. [2023]). Consider an algorithm M that, in each round i [T] obtains an input point xi, outputs a predicted label ˆyi, and obtains a true label yi. We say that algorithm M is (ε, δ)-challenge differentially private if for any adversary B we have that Online Game M,B,T , defined below, is (ε, δ)-differentially private (w.r.t. its input bit b).

Algorithm 4 Online Game M,B,T,g Setting: T N denotes the number of rounds and g N is a group privacy parameter. If not explicitly stated we assume that g = 1. M is an online algorithm and B is an adversary that determines the inputs adaptively. Input of the game: A bit b {0, 1}. (The bit b is unknown to M and B.)

1. For i = 1, 2, . . . , T

(a) The adversary B outputs a bit ci {0, 1}, under the restriction that Pi j=1 cj g. % We interpret rounds i in which ci = 1 as challenge rounds. Note that there could be at most g challenge rounds throughout the game. (b) The adversary B chooses two labeled inputs (xi,0, yi,0) and (xi,1, yi,1), under the restriction that if ci = 0 then (xi,0, yi,0) = (xi,1, yi,1). (c) Algorithm M obtains xi,b, then outputs ˆyi, and then obtains yi,b. (d) If ci = 0 then set yi = ˆyi. Otherwise set yi = .

(e) The adversary B obtains yi.

% Note that the adversary B does not get to see the outputs of M in challenge rounds. 2. Output B s view of the game, that is y1, . . . , y T and the internal randomness of B.

% Note that from this we can reconstruct all the inputs specified by B throughout the game.

Remark 3.2. Naor et al. [2023] studied a special case of this definition, suitable to their stochastic setting. More specifically, they considered a setting where the algorithm initially gets a dataset containing labeled examples. Then, on every time step, a new user arrives and submits its unlabeled example to the algorithm, and the algorithm responds with a predicted label. We extend the definition to capture settings in which every user interacts with the algorithm (rather than just submitting its input). In the concrete application we consider (online learning) this corresponds to the user submitting its input, then obtaining the predicted label, and then submitting the true label. Our generalized definition (Section A) captures this as a special case and allows for arbitrary interactions with each user.

Composition and post-processing. Composition and post-processing for challenge-DP follows immediately from their analogues for (standard) DP. Formally, composition is defined via the following game, called Composition Game, in which a meta adversary B is trying to guess an unknown bit b {0, 1}. The meta adversary B is allowed to (adaptively) invoke k executions of the game specified in Algorithm 4, where all of these k executions are done with the same (unknown) bit b. See Algorithm 5.

Algorithm 5 Composition Game B ,m,ε,δ Input of the game: A bit b {0, 1}. (The bit b is unknown to B .)

1. For ℓ= 1, 2, . . . , m

(a) The adversary B outputs an (ε, δ)-challenge-DP algorithm Mℓ, an adversary Bℓ, and an integer Tℓ. (b) The adversary B obtains the outcome of Online Game Mℓ,Bℓ,Tℓ(b). 2. Output B s view of the game (its internal randomness and all of the outcomes of Online Game it obtained throughout the execution).

The following theorem follows immediately from standard composition theorems for differential privacy [Dwork et al., 2010b].

Theorem 3.3 (special case of [Dwork et al., 2010b]). For every B , every m N and every ε, δ, δ 0 it holds that Composition Game B ,m,ε,δ is (ε , mδ + δ )-differentially private (w.r.t. the input bit b) for ε = p

2m ln(1/δ )ε + mε(eε 1).

Group privacy. We show that challenge-DP is closed under group privacy. This is more subtle than the composition argument. In fact, we first need to define what do we mean by group privacy in the context of challenge-DP, which we do using the parameter g in algorithm Online Game. Recall that throughout the execution of Online Game, the adversary is allowed g challenge rounds. We show that if an algorithm satisfies challenge-DP when the adversary is allowed only a single challenge round, then it also satisfies challenge-DP (with related privacy parameters) when the adversary is allowed g > 1 challenge rounds. This is captured by the following theorem; see Appendix B for the proof. Theorem 3.4. Let M be an algorithm that in each round i [T] obtains an input point xi, outputs a predicted label ˆyi, and obtains a true label yi. If M is (ε, δ)-challenge-DP then for every g N and every adversary B (posing at most g challenges) we have that Online Game M,B,T,g is (gε, g eεg δ)-differentially private.

4 Algorithm Challenge AT

Towards presenting our private online learner, we introduce a variant of algorithm Above Threshold with additional guarantees, which we call Challenge AT. Recall that Above Threshold hides arbitrary modifications to a single input point. Intuitively, the new variant we present aims to hide both an arbitrary modification to a single input point and an arbitrary modification to a single query throughout the execution. Consider algorithm Challenge AT.

Algorithm 6 Challenge AT Input: Dataset S X , privacy parameters ε, δ, threshold t, number of positive reports r, and an adaptively chosen stream of queries fi : X R each with sensitivity Tool used: An (ε, 0)-DP algorithm, Private Counter, for counting bits under continual observation, guaranteeing error at most λ with probability at least 1 δ

1. Instantiate Private Counter 2. Denote γ = O

r + λ ln( r+λ

3. In each round i, when receiving a query fi, do the following:

(a) Let ˆfi fi(S) + Lap(γ)

(b) If ˆfi t, then let σi = 1 and otherwise let σi = 0

(c) Output σi (d) Feed σi to Private Counter and let counti denote its current output

(e) If counti r then HALT

Remark 4.1. When we apply Challenge AT, it sets λ = O 1 ε log(T) log T β . Technically, for this it has to know T and β. To simplify the description this is not explicit in our algorithms.

The utility guarantees of Challenge AT are straightforward. The following theorem follows by bounding (w.h.p.) all the noises sampled throughout the execution (when instantiating Challenge AT with the private counter from Theorem 2.10). Theorem 4.2. Let s denote the random coins of Challenge AT. Then there exists an event E such that: (1) Pr[s E] 1 β, and (2) Conditioned on every s E, for every input dataset S and every sequence of T queries (f1, . . . , f T ) it holds that

1. Algorithm Challenge AT does not halt before the rth time in which it outputs σi = 1,

2. For every i such that σi = 1 it holds that fi(S) t O

r + λ ln( r+λ

3. For every i such that σi = 0 it holds that fi(S) t + O

r + λ ln( r+λ

where λ = O 1 ε log(T) log T β is the error of the counter of Theorem 2.10.

The event E in Theorem 4.2 occurs when all the Laplace noises of the counter and Challenge AT are within a factor of log(T/β) of their expectation. The privacy guarantees of Challenge AT are more involved. They are defined via a game with an adversary B whose goal is to guess a secret bit b. At the beginning of the game, the adversary chooses two neighboring datasets S0, S1, and Challenge AT is instantiated with Sb. Then throughout the game the adversary specifies queries fi and observes the output of Challenge AT on these queries. At some special round i , chosen by the adversary, the adversary specifies two queries f 0 i , f 1 i , where only f b i is fed into Challenge AT. In round i the adversary does not get to see the answer of Challenge AT on f b i (otherwise it could easily learn the bit b since f 0 i , f 1 i may be very different). We show that any such adversary B has only a small advantage in guessing the bit b. The formal details are given in Appendix C.

5 Online Classification under Challenge Differential Privacy

We are now ready to present our private online prediction algorithm. Consider algorithm POP.

Theorem 5.1. When executed with a learner A that makes at most d mistakes and with parameters k = O d ε2 log2( 1

δ ) log2( T

β ) and r = O dk + ln 1 β , then with probability at least (1 β) the

number of mistakes made by algorithm POP is bounded by O d2 ε2 log2( 1

δ ) log2( T

Algorithm 7 POP (Private Online Procedure) Setting: T N denotes the number of rounds in the game. A is a non-private online-algorithm. Parameters: k determines the number of copies of A we maintain. r determines the number of positive reports we aim to receive from Challenge AT.

1. Instantiate k copies A1, . . . , Ak of algorithm A 2. Instantiate algorithm Challenge AT on an empty dataset with threshold t = k/4, privacy parameters ε, δ, number of positive reports r, and sensitivity parameter = 1. 3. For i = 1, 2, . . . , T

(a) Obtain input xi (b) Let Atemp 1 , . . . , Atemp k be duplicated copies of A1, . . . , Ak (c) Let ℓi [k] be chosen uniformly at random (d) Let ˆyi,ℓi Aℓi(xi). For j [k] \ {ℓi} let ˆyi,j Atemp j (xi)

(e) Feed Challenge AT the query fi k

j [k] ˆyi,j and obtain an outcome σi. (If Challenge AT halts then POP also halts.) % Recall that σi = 1 indicates that k

j [k] ˆyi,j k

4 , meaning that there is a lot of disagreement among ˆyi,1, . . . , ˆyi,k. (f) If σi = 1 then sample ˆyi {0, 1} at random. Else let ˆyi = majority{ˆyi,1, . . . , ˆyi,k} (g) Output the bit ˆyi as a prediction, and obtain a true label yi (h) Feed yi to Aℓi

% Note that Aℓis the only copy of A that changes its state during this iteration

Informally, the privacy guarantees of POP follow from those of Challenge AT. We give the formal details in Appendix D, obtaining the following theorem.

Theorem 5.2. Algorithm POP is (O(ε), O(δ))-Challenge-DP. That is, For every adversary B it holds that Online Game POP,B is (O(ε), O(δ))-DP w.r.t. the bit b (the input of the game).

We proceed with the utility guarantees of POP. See Appendix F for an extension to the agnostic setting.

Theorem 5.3. When executed with a learner A that makes at most d mistakes and with parameters k = O d ε2 log2( 1

δ ) log2( T

β ) and r = O dk + ln 1 β , then with probability at least (1 β) the

number of mistakes made by algorithm POP is bounded by O d2 ε2 log2( 1

δ ) log2( T

Proof. By Theorem 4.2, with probability (1 β) over the internal coins of Challenge AT, for every input sequence, its answers are accurate up to error of

error CAT = O

r + λ ln(r + λ

where in our case, the sensitivity is 1, and the error of the counter λ is at most O 1

ε log(T) log T

by Theorem 2.10. We continue with the proof assuming that this event occurs. Furthermore, we set k = Ω(error CAT), large enough, such that if less than 1

5 the experts disagree with the other experts, then algorithm POP returns the majority vote with probability 1.

Consider the execution of algorithm POP and define 1/5-Err be a random variable that counts the number of time steps in which at least 1/5th of the experts make an error. That is

j [k] 1{ˆyi,j = yi} > k/5

We also define the random variable

expert Advance = |{i [T] : yi = ˆyi,ℓi}| .

That is expert Advance counts the number of times steps in which the random expert we choose (the ℓith expert) errs. Note that the ℓith expert is the expert that gets the true label yi as feedback. As we run k experts, and as each of them is guaranteed to make at most d mistakes, we get that

expert Advance kd.

We now show that with high probability 1/5-Err is not much larger than expert Advance. Let i be a time step in which at least 1/5 fraction of the experts err. As the choice of ℓi (the expert we update) is random, then with probability at least 1

5 the chosen expert also errs. It is therefore unlikely that 1/5-Err is much larger than expert Advance, which is bounded by kd. Specifically, by standard concentration arguments (see Appendix E for the precise version we use) it holds that

Pr 1/5-Err > 18dk + 18 + ln 1

Note that when at least 1/5 of the experts disagree with other experts then at least 1/5 of the experts err. It follows that 1/5-Err upper bounds the number of times in which algorithm Challenge AT returns an above threshold answer. Hence, by setting r > 18dk + 18 + ln 1 β we ensure that w.h.p. algorithm Challange AT does not halt prematurely (and hence POP does not either).

Furthermore our algorithm errs either when there is a large disagreement between the experts or when all experts err. It follows that 1/5-Err also upper bounds the number of times which our algorithm errs.

Overall, by setting r = O dk + ln 1 β we ensure that POP does not halt prematurely, and

by setting k = O

r + λ ln( r+λ

β ) we ensure that POP does not err too many times throughout the execution. Combining the requirement on r and on k, it suffices to take

β ) + 1 ε d log(T) log T

in which case algorithm POP makes at most O d2 ε2 log2( 1

δ ) log2( T

β ) with high probability.

6 Conclusion

Our work presents a new privacy model for online classification, together with an efficiency preserving transformation from non-private online classification, that exhibits a doubly exponential improvement in the error compared to prior works on this topic. We leave open the possibility that such an improvement could also be achieved in the setting of Golowich and Livni [2021], i.e., under the more restrictive notion of privacy where the sequence of predictors does not compromise privacy.

Acknowledgments and Disclosure of Funding

Haim Kaplan was artially supported by Israel Science Foundation (grant 1595/19), and the Blavatnik Family Foundation.

Yishay Mansour was partially funded from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17), Tel Aviv University Center for AI and Data Science (TAD), and the Yandex Initiative for Machine Learning at Tel Aviv University.

Shay Moran is a Robert J. Shillman Fellow; he acknowledges support by ISF grant 1225/20, by BSF grant 2018385, by an Azrieli Faculty Fellowship, by Israel PBC-VATAT, by the Technion Center for Machine Learning and Intelligent Systems (MLIS), and by the European Union (ERC, GENERALIZATION, 101039692). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Kobbi Nissim was partially funded by NSF grant No. CNS 2001041 and by a gift to Georgetown University.

Uri Stemmer was partially supported by the Israel Science Foundation (grant 1871/19) and by Len Blavatnik and the Blavatnik Family foundation.

Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 32 40. PMLR, 06 11 Aug 2017.

Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private online prediction from experts: Separations and faster rates. Co RR, abs/2210.13537, 2022.

Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, 2009.

Mark Bun, Thomas Steinke, and Jonathan Ullman. Make up your mind: The price of online queries in differential privacy. In Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms, pages 1306 1325. SIAM, 2017.

T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. In ICALP (2), volume 6199 of Lecture Notes in Computer Science, pages 405 417. Springer, 2010.

Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265 284, 2006.

Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil P. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In STOC, pages 381 390, 2009.

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Symposium on Theory of Computing (STOC), pages 715 724. ACM, 2010a.

Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In FOCS, pages 51 60, 2010b.

Noah Golowich and Roi Livni. Littlestone classes are privately online learnable. In Neur IPS, pages 11462 11473, 2021.

Anupam Gupta, Katrina Ligett, Frank Mc Sherry, Aaron Roth, and Kunal Talwar. Differentially private combinatorial optimization. In SODA, pages 1106 1125, 2010.

Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, pages 61 70, 2010.

Palak Jain, Sofya Raskhodnikova, Satchit Sivakumar, and Adam D. Smith. The price of differential privacy under continual observation. Co RR, abs/2112.00828, 2021.

Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In ICML, volume 32 of JMLR Workshop and Conference Proceedings, pages 476 484. JMLR.org, 2014.

Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In COLT, volume 23 of JMLR Proceedings, pages 24.1 24.34. JMLR.org, 2012.

Haim Kaplan, Yishay Mansour, and Uri Stemmer. The sparse vector technique, revisited. In COLT, volume 134 of Proceedings of Machine Learning Research, pages 2747 2776. PMLR, 2021.

Michael J. Kearns, Mallesh M. Pai, Ryan M. Rogers, Aaron Roth, and Jonathan R. Ullman. Robust mediators in large games. Co RR, abs/1512.02698, 2015.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. In Machine Learning, pages 285 318, 1988.

Yifat Nahmias, Oren Perez, Yotam Shlomo, and Uri Stemmer. Privacy preserving social norm nudges. Mich. Tech. L. Rev., 26:43, 2019.

Moni Naor, Kobbi Nissim, Uri Stemmer, and Chao Yan. Private everlasting prediction. Co RR, abs/2305.09579, 2023.

Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. In Neur IPS, pages 4301 4311, 2018.

Abhradeep Guha Thakurta and Adam D. Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. In NIPS, pages 2733 2741, 2013.

A General Variant of challenge-DP

Definition A.1. Consider an algorithm M that, in each phase i [T], conducts an arbitrary interaction with the ith user. We say that algorithm M is (ε, δ)-challenge differentially private if for any adversary B we have that General Game M,B,T , defined below, is (ε, δ)-differentially private (w.r.t. its input bit b).

Algorithm 8 General Game M,B,T Setting: T N denotes the number of phases. M is an interactive algorithm and B is an adaptive and interactive adversary. Input of the game: A bit b {0, 1}. (The bit b is unknown to M and B.)

1. For i = 1, 2, . . . , T

(a) The adversary B outputs a bit ci {0, 1}, under the restriction that Pi j=1 cj 1. (b) The adversary B chooses two interactive algorithms Ii,0 and Ii,1, under the restriction that if ci = 0 then Ii,0 = Ii,1. (c) Algorithm M interacts with Ii,b. Let ˆyi denote the view of Ii,b at the end of this interaction. (d) If ci = 0 then set yi = ˆyi. Otherwise set yi = .

(e) The adversary B obtains yi. 2. Output B s view of the game.

B Group privacy

In this section we prove Theorem 3.4.

Proof of Theorem 3.4. Fix g N and fix an adversary B (that poses at most g challenge rounds). We consider a sequence of games W0, W1, . . . , Wg, where Wℓis defined as follows.

1. Initialize algorithm M and the adversary B.

2. For round i = 1, 2, . . . , T:

(a) Obtain a challenge indicator ci and two labeled inputs (xi,0, yi,0) and (xi,1, yi,1) from B.

(b) If Pi j=1 cj > ℓthen set (wi, zi) = (xi,0, yi,0). Otherwise set (wi, zi) = (xi,1, yi,1). (c) Feed wi to algorithm M, obtain an outcome ˆyi, and feed it zi. (d) If ci = 0 then set yi = ˆyi. Otherwise set yi = .

(e) Give yi to B.

3. Output y1, . . . , y T and the internal randomness of B.

That is, Wℓsimulates the online game between M and B, where during the first ℓchallenge rounds algorithm M is given (xi,1, yi,1), and in the rest of the challenge rounds algorithm M is given (xi,0, yi,0). Note that

Online Game M,B,T,g(0) W0 and Online Game M,B,T,g(1) Wg.

We claim that for every 0 < ℓ g it holds that Wℓ 1 (ε,δ) Wℓ. To this end, fix 0 < ℓ g and consider an adversary b B, that poses at most one challenge, defined as follows. Algorithm b B runs B internally. In every round i, algorithm b B obtains from B a challenge bit ci and two labeled inputs (xi,0, yi,0) and (xi,1, yi,1). As long as B did not pose its ℓth challenge, algorithm b B outputs (xi,1, yi,1), (xi,1, yi,1). During the round i in which B poses its ℓth challenge, algorithm B outputs (xi,0, yi,0), (xi,1, yi,1). This is the challenge round posed by algorithm b B. In every round t afterwards, algorithm b B outputs (xi,0, yi,0), (xi,0, yi,0). When algorithm b B obtains an answer yi it

sets yi = yi, if ci = 0 , if ci = 1 and gives yi to algorithm B.

As b B is an adversary that poses (at most) one challenge, by the privacy properties of M we know that Online Game M, b B,T is (ε, δ)-DP. Recall that the output of Online Game M, b B,T includes all of

the randomness of b B, as well as the answers yt generated throughout the game. This includes the randomness of B (which b B runs internally), and hence, determines also all of the yi s defined by b B throughout the interaction. Let P be a post-processing procedure that takes the output of Online Game M, b B,T and returns the randomness of B as well as ( y1, . . . , y T ). By closure of DP to post-processing, we have that P(Online Game M, b B,T (0)) (ε,δ) P(Online Game M, b B,T (1)). Now note that P(Online Game M, b B,T (0)) Wℓ 1 , and P(Online Game M, b B,T (1)) Wℓ, and hence Wℓ 1 (ε,δ) Wℓ. Overall we have that

Online Game A,B,T,g(0) W0 (ε,δ) W1 (ε,δ) W2 (ε,δ) (ε,δ) Wg Online Game A,B,T,g(1).

This shows that Online Game A,B,T,g is (gε, g eεg δ)-differentially private, thereby completing the proof.

C Privacy Analysis of Challenge AT

The privacy guarantees of Challenge AT are captured using the game specified in algorithm Challenge AT-Game B.

Theorem C.1. For every adversary B it holds that Challenge AT-Game B is (O(ε), O(δ))-DP w.r.t. the bit b (the input of the game).

Algorithm 9 Challenge AT-Game B Setting: B is an adversary that adaptively determines the inputs to Challenge AT. Input of the game: A bit b {0, 1}. (The bit b is unknown to Challenge AT and B.)

1. The adversary B specifies two neighboring datasets S0, S1 X . 2. Instantiate Challenge AT with the dataset Sb and parameters ε, δ, threshold t, and number of positive reports r. 3. For i = 1, 2, 3, . . .

(a) Get bit ci {0, 1} from B subject to the restriction that Pi j=1 cj 1. % When ci = 1 this is the Challange round. (b) Get two queries f 0 i : X R and f 1 i : X R from B, each with sensitivity , subject to the restriction that if ci = 0 then f 0 i f 1 i . (c) Give the query f b i to Challenge AT and get back the bit σi. (d) If ci = 0 then set ˆyi = σi. Otherwise set ˆyt = .

(e) Give ˆyi to the adversary B. 4. Publish B s view of the game, that is ˆy1, . . . , ˆy T and the internal randomness of B.

Proof. Fix an adversary B. Let CATG denote the algorithm Challenge AT-Game B with this fixed B. Consider a variant of algorithm CATG, which we call CATG-no Count defined as follows. During the challenge round i, inside the call to Challenge AT, instead of feeding σi to the Private Counter we simply feed it 0 (in Step 3d of Challenge AT).

By the privacy properties of Private Counter (Theorem 2.10), for every b {0, 1} we have that

CATG(b) (ε,0) CATG-no Count(b),

so it suffices to show that CATG-no Count is DP (w.r.t. b). Now observe that the execution of Private Counter during the execution of CATG-no Count can be simulated from the view of the adversary B (the only bit that Challenge AT feeds the counter which is not in the view of the adversary is the one of the challange round which we replaced by zero in CATG-no Count). Hence, we can generate the view of B in algorithm CATG by interacting with Above Threshold instead of with Challenge AT. This is captured by algorithm CAT-G-Above Threhold.

Algorithm 10 CATG-Above Threshold Setting: B is an adversary that adaptively determines the inputs to Challenge AT. Input of the game: A bit b {0, 1}. (The bit b is unknown to Challenge AT and B.)

1. The adversary B specifies two neighboring datasets S0, S1 X .

2. Instantiate Private Counter

3. Instantiate Above Threshold on the dataset Sb with parameters ε, δ, t, (r + λ).

4. For i = 1, 2, 3, . . .

(a) Get bit ci {0, 1} from the adversary B subject to the restriction that Pi j=1 cj 1.

(b) Get two queries f 0 i : X R and f 1 i : X R, each with sensitivity from B, subject to the restriction that if ci = 0 then f 0 i f 1 i . (c) Give the query f b i to Algorithm Above Threshold and get back a bit σi. (d) If ci = 0 then set ˆyi = σi. Otherwise set ˆyt = .

(e) Give ˆyi to the adversary B.

(f) If ci = 0 then feed σi to Private Counter, and otherwise feed it 0. (g) Let counti denote the current output of Private Counter, and HALT if counti r

5. Publish B s view of the game, that is ˆy1, . . . , ˆy T and the internal randomness of B.

This algorithm is almost identical to CATG-no Count, except for the fact that Above Threshold might halt the execution itself (even without the halting condition on the outcome of Private Counter). However, by the utility guarantees of Private Counter, with probability at least 1 δ it never errs

by more than λ, in which case algorithm Above Threshold never halts prematurely. Hence, for every bit b {0, 1} we have that CATG-Above Threhold(b) (0,δ) CATG-no Count(b). So it suffices to show that CATG-Above Threhold is DP (w.r.t. its input bit b). This almost follows directly from the privacy guarantees of Above Threshold, since CATG-Above Threhold interacts only with this algorithm, except for the fact that during the challenge round i the adversary B specifies two queries (and only one of them is fed into Above Threshold). To bridge this gap, we consider one more (and final) modification to the algorithm, called [ CATG-Above Threhold. This algorithm is identical to CATG-Above Threhold, except that in Step 4c we do not feed f b i to Above Threshold if ci = 1. That is, during the challenge round we do not interact with Above Threshold.

Now, by the privacy properties of Above Threshold we have that [ CATG-Above Threhold is DP (w.r.t. its input bit b). Furthermore, when algorithm Above Threshold does not halt prematurely, we have that [ CATG-Above Threhold is identical to CATG-Above Threhold. Therefore, for every bit b {0, 1} we have CATG-Above Threhold(b) (0,δ) [ CATG-Above Threhold(b).

Overall we get that CATG(0) (ε,0) CATG-no Count(0)

(0,δ) CATG-Above Threhold(0)

(0,δ) [ CATG-Above Threhold(0)

(ε,δ) [ CATG-Above Threhold(1)

(0,δ) CATG-Above Threhold(1)

(0,δ) CATG-no Count(1)

(ε,0) CATG(1)

D Privacy Analysis of POP

In this section we prove Theorem 5.2.

Proof of Theorem 5.2. Let B be an adversary that playes in Online Game against POP, posing at most 1 challenge. That is, at one time step i, the adversary specifies two inputs (x0 i , y0 i ), (x1 i , y1 i ), algorithm POP processes (xb i, yb i ), and the adversary does not see the prediction ˆyi at this time step. We need to show that the view of the adversary is DP w.r.t. the bit b. To show this, we observe that the view of B can be generated (up to a small statistical distance of δ) by interacting with Challenge AT as in the game Challenge AT-Game. Formally, consider the following adversary ˆB that simulates B while interacting with Challenge AT instead of POP.

As ˆB only interacts with Challenge AT, its view at the end of the execution (which includes the view of the simulated B) is DP w.r.t. the bit b. Furthermore, the view of the simulated B generated in this process is almost identical to the view of B had it interacted directly with POP. Specifically, the only possible difference is that the computation of ˆyi in Step 3(e)ii of ˆB might not be well-defined. But this does not happen when Challenge AT maintains correctness, which holds with probability at least 1 δ.

Overall, letting Challenge AT-Game ˆ B|B denote the view of the simulated B at the end of the interac-

tion of ˆB with Challenge AT, we have that

Online Game POP,B(0) (0,δ) Challenge AT-Game ˆ B|B (0)

(ε,δ) Challenge AT-Game ˆ B|B (1)

(0,δ) Online Game POP,B(1).

Algorithm 11 ˆ B Setting: This is an adversary that plays against Challenge AT in the game Challenge AT-Game.

1. Specify two datasets S0 = {0} and S1 = {1}.

2. Instantiate algorithm B

3. For i = 1, 2, . . . , T

(a) Obtain a challenge indicator ci and inputs x0 i , x1 i from B (where x0 i = x1 i if ci = 0). (b) Let ℓi [k] be chosen uniformly at random

(c) Define the query qi : {0, 1} R, where qi(b) = fi and where fi is defined as in Step 3e of POP. % Note that, given b, this can be computed from (x0 1, x1 1), . . . , (x0 i , x1 i ) and ℓ1, . . . , ℓi and y1, . . . , yi 1. Furthermore, whenever ci = 0 then this is a query of sensitivity at most 1. When ci = 1 the sensitivity might be large, which we view it as two separate queries, corresponding to a challenge round when playing against Challenge AT. (d) Output the challenge bit ci and the query qi, which is given to Challenge AT.

(e) If ci = 0 then

i. Obtain an outcome σi from Challenge AT ii. Define ˆyi as in Step 3f of POP, as a function of σi and (x0 1, x1 1), . . . , (x0 i , x1 i ) and ℓ1, . . . , ℓi and y1, . . . , yi 1. iii. Feed the bit ˆyi to the adversary B (f) Obtain a true label yi from the adversary B.

E A Coin Flipping Game

Consider algorithm 12 which specifies an m-round coin flipping game against an adversary B. In this game, the adaptively chooses the biases of the coins we flip. In every flip, the adversary might gain a reward or incur a budget loss . The adversary aims to maximize the rewards it collects before its budget runs out.

Algorithm 12 Coin Game B,k,m Setting: B is an adversary that determins the coin biases adaptively. k denotes the budget of the adversary. m denotes the number of iterations.

1. Set budget = k and reward = 0. 2. In each round i = 1, 2, . . . , m:

(a) The adversary chooses 0 pi 5

5 qi 1 pi, possibly based on the first (i 1) rounds. (b) A random variable Xi {0, 1, 2} is sampled, where Pr[Xi = 1] = pi and Pr[Xi = 2] = qi and Pr[Xi = 0] = 1 pi qi. (c) The adversary obtains Xi (d) If Xi = 1 and budget > 0 then reward = reward + 1.

(e) Else if Xi = 2 then budget = budget 1. 3. Output reward.

The next theorem states that no adversary can obtain reward much larger than k in this game. Intuitively, this holds because in every time step i, the probability of Xi = 2 is not much smaller than the probability that Xi, then (w.h.p.) it is very unlikely that the number of rewards would be much larger than k.

Theorem E.1 ([Gupta et al., 2010, Kaplan et al., 2021]). For every adversary s strategy, every k 0, every m N, and every λ R, we have

Pr[Coin Game B,k,m > λ] exp λ

6 + 3(k + 1) .

F Extension to the Agnostic Case

In this section we extend the analysis of POP to the agnostic setting. We use the tilde-notation to hide logarithmic factors in T, 1

Theorem F.1 ([Ben-David et al., 2009]). For any hypothesis class H and scalar M 0 there exists an online learning algorithm such that for any sequence ((x1, y1), . . . , (x T , y T )) satisfying min h H PT i=1 |h(xi) yi| M the predictions ˆy1, . . . , ˆy T given by the algorithm satisfy

i=1 |ˆyi yi| O (M + Ldim(H) ln(T)) .

Definition F.2. For parameters u < w, let POP[u,w] denote a variant of POP in which we halt the execution after the vth time in which we err, for some arbitrary value u v w. (Note that the execution might halt even before that, by the halting condition of POP itself.) This could be done while preserving privacy (for appropriate values of u < w) by using the counter of Theorem 2.10 for privately counting the number of mistakes.

Lemma F.3. Let H be a hypothesis class with d = Ldim(H), and let A denote the non-private algorithm from Theorem F.1 with M = d ln(T). Denote k = Θ d2

ε , r = u = Θ (kd ln(T)), and w = 2u. Consider executing POP[u,w] with A and with parameters k, r on an adaptively chosen sequence of inputs (x1, y1), . . . , (xi , yi ), where i T denotes the time at which POP[u,w] halts. Then, with probability at least (1 β) it holds that

OPTi min h H

i=1 |h(xi) yi| > d ln(T).

Proof sketch. Similarly to the proof of Theorem 5.3, we set k = Ω d2

ε , and assume that if less

5 the experts disagree with the other experts, then algorithm POP[u,w] returns the majority vote with probability 1.

Let 1/5-Err denote the random variable that counts the number of time steps in which at least 1/5th of the experts make an error. As in the proof of Theorem 5.3, 1/5-Err upper bounds both the number of mistakes made by POP[u,w] , which we denote by Our Error, as well as the number of times in which algorithm Challenge AT returns an above threshold answer, which we denote by Num Top. By Theorem 4.2, we know that (w.h.p.) Num Top r. Also let Worst Expert denote the largest number of mistakes made by a single expert.

Consider the time i at which POP[u,w] halts. If it halts because u v w mistakes have been made, then k Worst Expert 1/5-Err Our Error u = Ω(kd ln(T)) .

Alternatively, if POP[u,w] halts after r above threshold answer, then

k Worst Expert 1/5-Err Num Top r = Ω(kd ln(T)) .

At any case, when POP[u,w] halts it holds that at least one expert made at least Ω(d ln(T)) mistakes. Therefore, by Theorem F.1, we have that OPTi d ln(T).

Theorem F.4. Let H be a hypothesis class with Ldim(H) = d. There exists an (ε, δ)-Challenge DP online learning algorithm providing the following guarantee. When executed on an adaptively

chosen sequence of inputs (x1, y1), . . . , (x T , y T ), then the algorithm makes at most O d OPT

mistakes (w.h.p.), where

OPT min h H

i=1 |h(xi) yi|.

Proof sketch. This is obtained by repeatedly re-running POP[u,w], with the parameter setting specified in Lemma F.3. We refer to the time span of every single execution of POP[u,w] as a phase.

By construction, in every phase, POP[u,w] makes at most w = Θ(kd) mistakes. By Lemma F.3 every hypothesis in H makes at least d ln(T) mistakes in this phase. Therefore, there could be at most O max 1 , OPT

d phases, during which we incur a total of at most O d OPT