# littlestone_classes_are_privately_online_learnable__d6cf661a.pdf

Littlestone Classes are Privately Online Learnable

Noah Golowich MIT CSAIl nzg@mit.edu

Roi Livni Tel Aviv University rlivni@tauex.tau.ac.il

We consider the problem of online classiﬁcation under a privacy constraint. In this setting a learner observes sequentially a stream of labelled examples (𝑥𝑡, 𝑦𝑡), for 1 𝑡 𝑇, and returns at each iteration 𝑡a hypothesis ℎ𝑡which is used to predict the label of each new example 𝑥𝑡. The learner s performance is measured by her regret against a known hypothesis class H. We require that the algorithm satisﬁes the following privacy constraint: the sequence ℎ1, . . . , ℎ𝑇of hypotheses output by the algorithm needs to be an (ϵ, δ)-diﬀerentially private function of the whole input sequence (𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇). We provide the ﬁrst non-trivial regret bound for the realizable setting. Speciﬁcally, we show that if the class H has constant Littlestone dimension then, given an oblivious sequence of labelled examples, there is a private learner that makes in expectation at most 𝑂(log𝑇) mistakes comparable to the optimal mistake bound in the non-private case, up to a logarithmic factor. Moreover, for general values of the Littlestone dimension 𝑑, the same mistake bound holds but with a doubly-exponential in 𝑑factor. A recent line of work has demonstrated a strong connection between classes that are online learnable and those that are diﬀerentially-private learnable. Our results strengthen this connection and show that an online learning algorithm can in fact be directly privatized (in the realizable setting). We also discuss an adaptive setting and provide a sublinear regret bound of 𝑂(

1 Introduction

Privacy-preserving machine learning has attracted considerable attention in recent years, motivated by the fact that individuals data is often collected to train statistical models, and such models can leak sensitive data about those individuals [13, 32]. The notion of diﬀerential privacy has emerged as a central tool which can be used to formally reason about the privacy-accuracy tradeoﬀs one must make in the process of analyzing and learning from data. A considerable body of literature on diﬀerentially private machine learning has resulted, ranging from empirical works which train deep neural networks with a diﬀerentially private form of stochastic gradient descent [1], to a recent line of theoretical works which aim to characterize the optimal sample complexity of privately learning an arbitrary hypothesis class [3, 11, 20].

Nearly all of these prior works on diﬀerentially private learning, however, are limited to the statistical learning setting (also known as the oﬄine setting): this is the setting where the labeled data, (𝑥𝑡, 𝑦𝑡), are assumed to be drawn i.i.d. from some unknown population distribution. This setting, while very well-understod and readily amenable to analysis, is unlikely to hold in practice. Indeed, the data (𝑥𝑡, 𝑦𝑡) fed as input into the learning algorithm may shift over time (e.g., as a consequence of demographic changes in a population), or may be subject to more drastic changes which are adaptive to the algorithm s prior predictions (e.g., drivers reactions to the recommendations of route-planning apps may aﬀect traﬃc patterns, which inﬂuence the input data to those apps). For this reason, it is desirable to develop provable algorithms which make fewer assumptions on the data.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

In this work, we do so by studying the setting of (private) online learning, in which the sequence of data (𝑥𝑡, 𝑦𝑡) is allowed to be arbitrary, and we also discuss a certain notion of privacy in a setting where it is even allowed to adapt to the algorithm s predictions in prior rounds. We additionally restrict our attention to the problem of classiﬁcation, namely where the labels 𝑦𝑡 {0, 1}; thus we introduce the problem of diﬀerentially private online classiﬁcation, and prove the following results (see Section 3 for the exact setup):

In the realizable setting with an oblivious adversary, we introduce a private learning algorithm which, for hypothesis classes of Littlestone dimension 𝑑(see Section 2.1) and time horizon 𝑇, achieves a mistake bound of 𝑂(2𝑂(2𝑑) log𝑇), ignoring the dependence on privacy parameters (Theorem 4.1). In the realizable setting with an adaptive adversary, we show that a slight modiﬁcation of the above algorithm achieves a mistake bound of 𝑂(2𝑂(2𝑑)

𝑇) (Theorem 4.2).

We remark that no algorithm (even without privacy, allowing randomization, and in the oblivious adversary setting) can achieve a mistake bound of smaller than Ω(𝑑) for classes of Littlestone dimension 𝑑[30, 33]. Therefore, a class of inﬁnite Littlestone dimension cannot have any ﬁnite mistake bound, and the regret for any algorithm, for any time horizon 𝑇, is Ω(𝑇). Thus, our results listed above, which show a mistake-bound (which is also the regret in the realizable setting) of 𝑂𝑑(

𝑇) for classes of Littlestone dimension 𝑑, establish that in the realizable setting, ﬁniteness of the Littlestone dimension is necessary and suﬃcient for online learnability ([31]) with diﬀerential privacy.

Recently it was shown by Alon et al. [3] and Bun et al. [11] (later to be improved by Ghazi et al. [20]) that ﬁniteness of the Littlestone dimension is necessary and suﬃcient for private learnability in the oﬄine setting, namely with i.i.d. data (and both in the realizable and agnostic settings). Since, as remarked above, the Littlestone dimension characterizes online learnability (even without privacy), this means that a binary hypothesis class is privately (oﬄine) learnable if and only if it is online learnable. Our result thus strengthens this connection, showing that the equivalence also includes private online learnability (in the realizable setting).

1.1 Related work

A series of papers [15, 25, 21, 17, 2] has studied the problem of diferentially private online convex optimization, which includes speciﬁc cases such as private prediction from expert advice and, when one assumes imperfect feedback, private non-stochastic multi-armed bandits [35, 36, 18, 24]. These results show that in many regimes privacy is free for such problems: for instance, for the problem of prediction from the expert advice (with 𝑁experts), Agarwal and Singh [2] shows that an ϵ-diﬀerentially private algorithm (based on follow-the-regularized-leader) achieves regret of 𝑂

ϵ , which matches the non-private regret bound of 𝑂(

𝑇log 𝑁) when

𝑇 Ω((𝑁/ϵ)2). Our results can be seen as extending such privacy is (nearly) free results to the nonparametric setting where we instead optimize over an arbitrary class of ﬁnite Littlestone dimension. Our techniques are diﬀerent from those of the above papers.

In addition to [11, 20] which establish private learning algorithms for classes with ﬁnite Littlestone dimension in the i.i.d. (oﬄine) setting, there has been an extensive line of work on private learning algorithms in the oﬄine setting: [29, 7, 5, 19] study the complexity of private learning with pure diﬀerential privacy, [26, 9, 10, 4] study the sample complexity of privately learning thresholds, and [27, 28, 6] study the sample complexity of privately learning halfspaces.

2 Preliminaries

In this section we introduce some background concepts used in the paper.

2.1 Online Learning

We begin by revisiting the standard setting of online-learning: We consider a sequential game between a learner and an adversary. Both learner and adversary know the sets X and H. The game

proceeds for 𝑇rounds (again 𝑇is known) and at each round 𝑡 𝑇, the adversary chooses a pair (𝑥𝑡, 𝑦𝑡) and presents the learner with the example 𝑥𝑡. The learner then must present the adversary with a hypothesis (perhaps randomly) ℎ𝑡: X {0, 1}. ℎ𝑡is not required to lie in H1. Finally the adversary presents the learner with 𝑦𝑡, which the learner uses to update its internal state. The performance of the learner is measured by its regret which is its number of mistake vs. the optimal decision in hindsight:

𝑡=1 1[ℎ𝑡(𝑥𝑡) 𝑦𝑡] min ℎ H

𝑡=1 1[ℎ (𝑥𝑡) 𝑦𝑡]

The adversary is said to be realizable if it presents the learner with a sequence of examples (𝑥𝑡, 𝑦𝑡) so that there is some ℎ H so that for each 𝑡 [𝑇], ℎ (𝑥𝑡) = 𝑦𝑡. In the realizable setting, the regret simply counts the number of mistakes the learner makes. And we measure the performance by its mistake bound, namely the maximum, over all possible realizable adversaries, of

𝑡=1 1[ℎ𝑡(𝑥𝑡) 𝑦𝑡]

In the setting with an agnostic adversary, we do not require such ℎ to exist; and we measure the learner by its (worst-case) regret, as in Eq. (1). In this paper we focus on the realizable setting; the (private ) agnostic setting is left as an interesting direction for future work.

Additionally, we normally make a distinction between two types of adversaries: An oblivious adversary chooses its sequence in advance and at each iteration (𝑥𝑡, 𝑦𝑡) is revealed to the learner. In the adversarial setting, the adversary may choose (𝑥𝑡, 𝑦𝑡) as a function of the learner s previous choices: i.e. ℎ1, . . . , ℎ𝑡 1. This deﬁnition follows the standard setup of online learning (see [12] for example). We note though, that in the non-private setting of online binary classiﬁcation, one can obtain results against an adversary that even gets to observe the learner s prediction at time-step 𝑡. However, we will simplify here by considering the more standard setting. It is interesting to ﬁnd out if we can compete against such a strong adversary in the private setup.

Littlestone dimension We next turn to introduce the Littlestone dimension which is a combinatorial measure that turns out to characterize learnability in the above setting.

Let H be a class of hypotheses ℎ: X {0, 1}. To deﬁne the Littlestone dimension of H, we ﬁrst introduce mistake trees: a mistake tree of depth 𝑑is a complete binary tree, each of whose non-leaf nodes 𝑣is labeled by a point 𝑥𝑣 X, and so that the two out-edges of 𝑣are labeled by 0 and 1. We associate each root-to-leaf path in a mistake tree with a sequence (𝑥1, 𝑦1), . . . , (𝑥𝑑, 𝑦𝑑), where for each 𝑖 [𝑑], the 𝑖th node in the path is labeled 𝑥𝑖and the path takes the out-edge from that node labeled 𝑦𝑖. A mistake tree is said to be shattered by H if for any root-to-leaf path whose corresponding sequence is (𝑥1, 𝑦1), . . . , (𝑥𝑑, 𝑦𝑑), there is some ℎ H so that ℎ(𝑥𝑖) = 𝑦𝑖for all 𝑖 [𝑑]. The Littlestone dimension of H, denoted Ldim(H), is the depth of the largest mistake tree that is shattered by H.

The Standard Optimal Algorithm (SOA) Suppose H is a binary hypothesis class with Littlestone dimension 𝑑. Littlestone [30] showed that there is an algorithm, called the Standard Optimal Algorithm (SOA), which, against an adaptive and realizable adversary, has a mistake bound of 𝑑; moreover, this is the best possible mistake bound. We will access the SOA as a black box. The underlying assumption we make is that given a realizable sequence (𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇), the 𝑆𝑂𝐴 makes at most Ldim(H) mistakes. We will also assume that whenever the algorithm 𝑆𝑂𝐴makes a mistake then it changes it state: namely, if the algorithm makes mistake on example 𝑡then ℎ𝑡+1 ℎ𝑡, this is in fact true for the SOA algorithm, but it can be seen that any algorithm with mistake bound can be modiﬁed to make sure this holds (simply by reiterating the mistake until the algorithm does change state). We refer the reader to [30, 33] for the speciﬁcs of it.

2.2 Diﬀerential Privacy

We next recall the standard notion of (ϵ, δ) diﬀerential privacy:

1This setup is known as the improper learning problem. In the proper version of the problem, it is required that ℎ𝑡 H and we leave a study of proper private online learning for future work. (see [22] for a discssion on proper online learning in the non-private case

Deﬁnition 2.1 (Diﬀerential privacy). Let 𝑛be a positive integer, ϵ, δ (0, 1), and W be a set. A randomized algorithm 𝐴: (X {0, 1})𝑛 W is deﬁned to be (ϵ, δ)-diﬀerentially private if for any two datasets 𝑆, 𝑆 (X {0, 1})𝑛diﬀering in a single example, and any event E E, it holds that Pr[𝐴(𝑆) E] 𝑒ϵ Pr[𝐴(𝑆 ) E] + δ.

Adaptive Composition The online nature of the problem naturally requires us to deal with adaptive mechanisms that query the data-base. We thus depict here the standard framework of adaptive querying, and we refer the reader to Dwork and Roth [13] for a more detailed exposition.

In this framework we assume a sequential setting, where at step 𝑡an adversary chooses two adjacent datasets 𝑆1 𝑡and 𝑆0 𝑡, and a mechanism 𝑀𝑡(𝑆) from a class F and receives 𝑧𝑏 𝑡= 𝑀𝑡(𝑆𝑏 𝑡) for some 𝑏 {0, 1} (where 𝑏does not depend on 𝑡). Deﬁnition 2.2. We say that the family F of algorithms over databases satisﬁes (ϵ, δ)-diﬀerential privacy under 𝑇-fold adaptive composition if for every adversary 𝐴and event E, we have

Pr((𝑧0 1, . . . , 𝑧0 𝑇) E) 𝑒ϵ Pr((𝑧1 1, . . . , 𝑧1 𝑇) E) + δ.

3 Problem Setup

We now formally introduce the main problem considered in this paper, namely that of private online learning. Let X be a set, and let H be a set of hypotheses, namely of functions ℎ: X {0, 1}. We consider the setting depicted in Section 2.1 and in this framework we want to study the learnability of private learners which are deﬁned next. We make a distinction between the case of an oblivious and an adaptive adversary:

Private online learning vs. an oblivious adversary As discussed, in this setting the adversary must choose the entire sequence (𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇) before its interaction with the learner (though it may use knowledge of the learner s algorithm). In particular, the samples (𝑥𝑡, 𝑦𝑡) do not depend on any random bits used by the learner. Thus, in the private online learning problem we merely require that the sequence of hypotheses (ℎ1, . . . , ℎ𝑇) output by the learner is (ϵ, δ)-diﬀerentially private as a function of the entire input sequence (𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇).

Private online learning vs. an adaptive adversary: In the adaptive setting, the adversary may choose each example (𝑥𝑡, 𝑦𝑡) as a function of all of the learner s hypotheses up to 𝑡. This makes the notion of privacy a little bit more subtle, so we need to carefully deﬁne what we mean here by (ϵ, δ)-privacy. We consider then the following scenario:

At each round 𝑡, the adversary outputs two outcomes (𝑥0 𝑡, 𝑦0 𝑡) and (𝑥1 𝑡, 𝑦1 𝑡). The learner then outputs ℎ𝑏 𝑡and (𝑥𝑏 𝑡, 𝑦𝑏 𝑡) is revealed to the learner where 𝑏 {0, 1} is independent of 𝑡. We require that the sequences 𝑆0 𝑇= {(𝑥0 𝑡, 𝑦0 𝑡)} and 𝑆1 𝑇= {(𝑥1 𝑡, 𝑦1 𝑡)} diﬀer in, at most, a single example. We will say that an adaptive online classiﬁcation algorithm is (ϵ, δ) diﬀerentially private, if for any event E and any adversary, it holds that

Pr[(ℎ1 1, . . . , ℎ1 𝑇) E] 𝑒ϵ Pr[(ℎ0 1, . . . , ℎ0 𝑇) E] + δ.

The notion is similar to privacy under 𝑇-fold adaptive composition. Normally, though, for a mechanism to be (ϵ, δ)-diﬀerentially private under 𝑇-fold adaptive compositions, Dwork et al. [16] requires it to be private under an adversary that may choose at each iteration any two adjacent datasets, 𝑆0 𝑖, 𝑆1 𝑖. Note, however that, in the online setup, the utility is dependent only on a single point at each iteration, hence such a requirement will be too strong (in fact, the learner will then be tested on two arbitrary sequences).

4 Main Results

We next state the main results of this paper, we start with a logarithmic regret bound for realizable oblivious learning. Theorem 4.1 (Private Oblivious online-learning). For a choice of 𝑘1 = 𝑂(2𝑑+1), and

Running DP-SOA (Algorithm 1) for 𝑇iterations on any realizable sequence (𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇), the algorithm outputs a sequence of predictors ℎ1, . . . , ℎ𝑇such that

The algorithm is (ϵ, δ) diﬀerentially private.

The expected number of mistakes the algorithm makes is

𝑡=1 ℎ𝑡(𝑥𝑡) 𝑦𝑡] = 𝑂

Theorem 4.1 shows that, up to logarithmic factor, the number of mistakes in the private case is comparable with the number of mistakes in the non-private case, when 𝑑the Littlestone dimension of the class is constant. We obtain, though, a strong deterioration in terms of the Littlestone dimension sublinear dependece vs. double exponential dependence. As discussed, Ghazi et al. [20] improved the dependence in the batch case to polynomial, and it remains an open question if similar improvement is applicable in the online case. We next turn to the adversarial case Theorem 4.2 (Private Adaptive online-learning). There exists an adaptive online classiﬁcation algorithm that is (ϵ, δ)-diﬀerentially private with expected regret over a realizble seqeunce:

𝑡=1 ℎ𝑡(𝑥𝑡) 𝑦𝑡] = 𝑂

Theorem 4.2 provides a sublinear regret bound, which is in fact optimal for the agnostic case. However, in the non-private (realizable) case it is known that constant regret can be obtained2 . We leave it as an open problem whether one can achieve logarithmic regret in the realizable adaptive setting.

5 Algorithm

We next present our main algorithm for an oblivious, realizable online private learning algorithm. The algorithm, DP-SOA, assumes access to a mistake bound algorithm for the class H (not necessarily private) such as SOA as in [30], which we denote by 𝐴,3 as well as call a procedure Hist Sparse that is depicted below (Algorithm 2). We can think of DP-SOA as an algorithm that runs several copies of the same procedure, where each copy is working on its own subsequence of (𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇), and the sub sequences form a random partition of the entire sequence.

Each process can be described by a tree whose vertices are labelled by samples that are iteratively constructed. Each tree outputs a predictor according to the state of its vertices. Hence, overall the algorithm can be depicted as a forest, where at each iteration an example is randomly assigned to one of the trees, and that tree, in turn, makes an update.

At each time step, we maintain a set of vertices V𝑡, which we will call pertinent vertices. Each pertinent vertex 𝑣holds a sample 𝑆𝑣. At time 𝑡= 1 only the leaves are in V1, and each leaf 𝑣is assigned the sample 𝑆𝑣= . Then, at every time-step where an example (𝑥𝑡, 𝑦𝑡) is assigned to the tree, it is randomly assigned to a pertinent vertex 𝑣in V (in detail, it is ﬁrst randomly assigned to a leaf and then propagated to a pertinent ancestor), and the sample 𝑆𝑣is updated to (𝑆𝑣, (𝑥𝑡, 𝑦𝑡)). After that, as we next describe, a process starts that updates the set of pertinent vertices; this process follows the idea of the tournament examples presented in [11].

Whenever two siblings 𝑣, s(𝑣) are pertinent and assigned with sequences 𝑆𝑣and 𝑆s(𝑣), respectively, they stay pertinent as long as 𝐴(𝑆𝑣) = 𝐴(𝑆s(𝑣)), and samples are assigned to them at their turn via the process depicted above. Whenever it becomes the case that 𝐴(𝑆𝑣) 𝐴(𝑆s(𝑣)), let 𝑣denote the parent of 𝑣, s(𝑣); we consider an example 𝑥 𝑣on which 𝐴(𝑆𝑣), 𝐴(𝑆s(𝑣)) disagree, and guess its label 𝑦 𝑣. Then, 𝑣, s(𝑣) are removed from the set of pertinent vertices, their parent 𝑣becomes pertinent, and we set 𝑆 𝑣to equal (𝑆𝑣, (𝑥𝑣, 𝑦𝑣)) if 𝐴(𝑆𝑣)[𝑥𝑣] 𝑦𝑣, and (𝑆s(𝑣), (𝑥𝑣, 𝑦𝑣)) otherwise. Once this

2and as discussed, the adversary may even depend on ℎ𝑡at round 𝑡 3In particular, 𝐴is required to be an algorithm that achieves a mistake bound of at most 𝑑on hypothesis classes of Littlestone dimension 𝑑. We will use the following (easily veriﬁed) fact about such an algorithm: after making a mistake, the algorithm must change the hypothesis it outputs for the following round.

Algorithm 1 DP-SOA

Input (ϵ, δ), 𝑘1, 𝑘2. Set η = 2 4𝑘1

4𝑘1 , and 𝑐= 4𝑘1/η Let 𝐺= (𝑉, 𝐸) be a forest of 𝑘2 full binary trees, each with 𝑘1 leaves. Let π : 𝑇 Leaves(𝑉) be a random mapping that maps 𝑡 [𝑇] to a random leaf. Set 𝑆𝑣= for each leaf 𝑣and 𝑆𝑢= for each non-leaf vertex 𝑢(where we deﬁne 𝐴( ) = ). Initialize V1 to be the set of all leaves in the forest. set 𝑣(𝑖) 1 be an arbitrary leaf from the tree 𝐺𝑖, for each 𝑖 [𝑘2] for t=1 to T do

Run 𝐻𝑖𝑠𝑡𝑆𝑝𝑎𝑟𝑠𝑒ϵ,δ,η,𝑐(ℎ𝑡 1, 𝐿𝑡) on the List 𝐿𝑡= {𝐴(𝑆𝑣(𝑖) 𝑡)}𝑘2 𝑖=1 and receive ℎ𝑡 Predict ℎ𝑡(𝑥𝑡) = ˆ𝑦𝑡, and observe 𝑦𝑡. Choose 𝑣1 V𝑡to be an antecedent of leaf π(𝑡) %there exists a unique antecedent in V𝑡 Set 𝑣2 = s(𝑣1) (if 𝑣1 is the root, continue to the next iteration). Set (𝑆𝑣1, (𝑥𝑡, 𝑦𝑡)) 𝑆𝑣1. while 𝐴(𝑆𝑣1) 𝐴(𝑆𝑣2) AND 𝑣1, 𝑣2 V𝑡do

Set 𝑣to be the parent of 𝑣1, 𝑣2 Choose an arbitrary 𝑥 𝑣such that 𝐴(𝑆𝑣1)[𝑥 𝑣] 𝐴(𝑆𝑣2)[𝑥 𝑣] and 𝑦 𝑣randomly Set (𝑆𝑣𝑖, (𝑥 𝑣, 𝑦 𝑣)) 𝑆 𝑣where 𝑖is such that 𝐴(𝑆𝑣𝑖)[𝑥𝑣] 𝑦𝑣. Remove 𝑣1,𝑣2 from V𝑡and add 𝑣to V𝑡. Let 𝑣1 be 𝑣𝑡 if 𝑣1 is not the root then

Set 𝑣2 to be the sibling of 𝑣1 else

Set 𝑣1 = 𝑣2 (and hence exit the loop.) end if end while if The While loop was executed at least once then

Let 𝑖be the tree for which π(𝑡) belongs to. Choose randomly a vertex 𝑣in tree 𝑖such that 𝑣, s(𝑣) V𝑡and 𝐴(𝑆𝑣) = 𝐴(𝑆s(𝑣)) (break ties by choosing randomly). (If no such 𝑣exists, let 𝑣be the root and set 𝑆𝑣to be some sample for which 𝐴(𝑆𝑣) = , add the root to V𝑡and remove all other vertices that belong to tree 𝑖).

Set 𝑣(𝑖 ) 𝑡+1 =

𝑣(𝑖 ) 𝑡 𝑖 𝑖 .

Set 𝑣(𝑖 ) 𝑡+1 = 𝑣(𝑖 ) 𝑡 for all 𝑖 𝑘2. end if Set V𝑡+1 = V𝑡. end for

procedure ﬁnishes, the tree outputs (randomly) some hypothesis ℎ= 𝐴(𝑆𝑣) where 𝑣is a pertinent vertex. The hypothesis will change only when the state of the tree changes (note that at initialization, the tree outputs 𝐴( )).

5.1 Technical Overview

We next give a high level overview of our proof techniques. We focus until the end of this section on the oblivious realizable case. The main procedure of the algorithm, DP-SOA, is Algorithm 1.

Our proof strategy is similar to the approach of Bun et al. [11] for learning privately in the stochastic setting, which we next brieﬂy describe. In the stochastic setup, the idea was to rely on global stability. In a nutshell, a randomized algorithm is called globally stable if it outputs a certain function with constant probability (over the random bits of the algorithm as well as the random i.i.d sample). Once we can construct such an algorithm (with suﬃciently small error) we run several copies of the algorithm on separate samples, and then we can use any mechanism, such as the one in Theorem 5.1 below, that publishes (privately) an estimated histogram of the frequency of appearance of each

Algorithm 2 Hist Sparse: Receives a sequence of 1-sensitive lists 𝐿1(𝐷), . . . , 𝐿𝑇(𝐷).

Initialize: parameters ϵ, η, δ, 𝑐. Let σ = 2𝑐/(𝑘ϵ), θ = 1 3η/32 Let θ0 = θ + LAP(σ). Let counter = 1 For list 𝐿1 set ℎ1 = ℎ𝑖𝑠𝑡ϵ/(2𝑐,δ/𝑐,η) (𝐿1). for 𝑡= 1, . . . ,𝑇: do

Deﬁne query: 𝑄𝑡= 1 freq𝐿𝑡(ℎ𝑡 1). Let ν𝑖= LAP(2σ) if 𝑄𝑡+ ν𝑖 θcounter then

Set ℎ𝑡= ℎ𝑖𝑠𝑡ϵ/(2𝑐),δ/𝑐,η(𝐿𝑡) counter = counter + 1 θcounter = θ + LAP(σ). else

Set ℎ𝑡= ℎ𝑡 1 end if if counter 𝑐then

ABORT end if end for

function. In detail, given a list 𝐿= {𝑥1, . . . , 𝑥𝑘} we denote by freq𝐿the mapping

freq𝐿( 𝑓) = 1

𝑥 𝐿 1[𝑥= 𝑓].

Theorem 5.1 ([8] essentially Proposition 2.20). For every ϵ, δ and η, there exists a (ϵ, δ)-DP mechanism ℎ𝑖𝑠𝑡ϵ,δ,η that given a list 𝐿= {𝑥1, . . . , 𝑥𝑘}, outputs a mapping freq𝐿: X [0, 1] such that if

𝑘 Θ(2) (η, β, ϵ, δ) := 4/η + log 1/(η2βδ)

ηϵ = 𝑂 log 1/ηβδ

then with probability (1 β):

If freq𝐿(𝑥) > 0 then freq𝐿(𝑥) > η

For every 𝑥such that freq𝐿(𝑥) > η, we have that freq𝐿(𝑥) > 0.

Our algorithm follows a similar strategy but certain care needs to taken due to the sequential (and distribution-free) nature of the data, as well as the fact that using hist procedure 𝑇times may be prohibitive (if we wish to obtain logarithmic regret). We next review these challenges:

Global Stability Our ﬁrst task is to construct an online version of a globally stable algorithm, which roughly means that diﬀerent copies of the same algorithm run on disjoint subsequences of (𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇), and output a ﬁxed hypothesis which may depend on the whole sequence but not on the disjoint subsequences. DP-SOA does so by assigning each subsequence to a tree which is running the procedure described in Section 5. We now explain how this procedure induces the desired stability.

As in Section 5, recall that a vertex 𝑣is pertinent if it is in the set V𝑡. We will refer to the distance of a vertex to any of its leaves as that vertex s depth. Note that for each pertinent vertex 𝑣at depth 𝑘, the algorithm makes 𝑘mistakes on the sequence 𝑆𝑣 indeed, whenever a vertex 𝑣is made pertinent, we always append to 𝑆 𝑣an example which forces a mistake for the sequence of a child of 𝑣. Also, notice that with probability 2 2𝑘1, where 𝑘1 is the number of leaves in the tree, all sequences assigned to each pertinent vertex are consistent with the realized hypothesis ℎ (recall that we are considering here the oblivious realizable case, hence ℎ is well-deﬁned). Indeed, this is true as as long as we guessed the label 𝑦 𝑣to equal ℎ (𝑥 𝑣) at each round; the number of guesses is bounded by the number of vertices, which is 2𝑘1 1 < 2𝑘1. Ultimately, this allows two cases: in the ﬁrst case a vertex of depth 𝑑is pertinent: in this case the vertex must identify ℎ (indeed, if there are two diﬀerent

hypotheses that are consistent on a sample with 𝑑mistakes, then we can force a (𝑑+ 1)th mistake). So, if there are many" trees with a 𝑑-depth pertinent vertex, then fraction of 2 2𝑘1 of them, are outputting ℎ , hence we found a frequent hypothesis. The second case is that in many" of the trees, for some 𝑘< 𝑑, there are many pairs 𝑣, s(𝑣) of pertinent vertices at depth 𝑘so that 𝐴(𝑆𝑣) = 𝐴(𝑆s(𝑣)); we will refer to such a pair 𝑣, s(𝑣) as a collision.

In the batch case the latter case immediately implies that some hypothesis is outputted frequently (i.e., we get global stability) through a standard concentration inequality that relates the number of collisions between i.i.d random variables, and the frequency of the most probable hypothesis. In the online case it is a little bit more subtle as the examples are not i.i.d, hence the sequences for the pertinent vertices are not i.i.d copies of some random variable. However, suppose that there are many collisions at depth 𝑘, and that we now reassign the data by randomly permuting the 𝑘-depth subtree (i.e. we reassign a random parent to each vertex at depth 𝑘, in order to form a new complete binary tree, and we don t change relations at other depths). Since the assignment of the data (𝑥𝑡, 𝑦𝑡) to the leaves is invariant under permutation, we can think of this process as randomly picking a new assignment, conditioning on the 𝑘-th level structure of the trees. Alternatively, we can also think of this process as randomly picking without replacement the diﬀerent hypotheses outputed by the 𝑘-depth vertices, and counting collisions of siblings.

We now want to relate the number of collisions to their expected mean and obtain a bound on the most frequent hypothesis. We can do this using a variant of Mcdiarmid s inequality for permutations or sampling without replacement. The observation for this inequality was found in [23] which attributes it to Talagrand [34]. For completeness we provide the proof in the full version.

Lemma 5.2 (Mcdiarmid s without replacement). Suppose 𝑍= (𝑍1, . . . , 𝑍𝑛) are random variables sampled uniformly from some universe Z = {𝑧(1), . . . , 𝑧(𝑁)} without replacement (in particular 𝑛 𝑁). Let 𝐹: 𝑍𝑛 [0, 1] be a mapping such that for 𝑧= (𝑧1, . . . , 𝑧𝑛) and 𝑧 = (𝑧 1, . . . , 𝑧 𝑛) that are of Hamming distance at most 1, |𝐹( 𝑧) 𝐹( 𝑧 )| 𝑐. Then:

ℙ 𝔼(𝐹( 𝑍)) 𝐹( 𝑍) ϵ 𝑒 2ϵ2

We use Lemma 5.2 as follows: our function 𝐹counts the number of collisions between depth 𝑘vertices after a random permutation (where we think here of permutation as sampling without replacement), this function is 1-sensitive to changing a single element, as required. We thus obtain an estimate of the number of collisions for a random permutation, which we can relate to the appearance of the most frequent hypothesis.

The above calculation can be used to obtain a guarantee that there exists an hypothesis that appears at frequency 2 𝑂(𝑘1) (this frequency is roughly the probability that the tree remains consistent with ℎ ). Since the number of leaves is exponential in the depth, and the depth needs to be at least 𝑑 (the upper bound on the level at which the algorithm stabilizes for sure), we overall obtain doubly exponential dependence of the frequency on the Littlestone dimension.

Mistake Bound We next turn to bound the number of mistakes. The crucial observation is that every time the algorithm makes a mistake, if example 𝑥𝑡is assigned to tree 𝑖then with some positive probability (speciﬁcally, the frequency of ℎ𝑡, lower bounded by 2 𝑂(2𝑑)) tree 𝑖outputs ℎ𝑡. Moreover, with probability 1/𝑘1 > 0, 𝑥𝑡is assigned to the pertinent vertex that made the mistake. Once the example is assigned to this vertex, we have 𝐴((𝑆𝑣, (𝑥𝑡, 𝑦𝑡))) 𝐴(𝑆s(𝑣)). In particular, the two siblings are taken out of the list of pertinent vertices, and their parent becomes pertinent. In other words, every time the algorithm makes a mistake with some constant probability (roughly 2 𝑂(2𝑑)), the set of pertinent vertices diminishes by one. Since we start with ﬁnite number of leaves as pertinent vertices, the expected number of mistakes is bounded by the number of leaves in the forest.

It remains to show that the number of leaves in the forest is logarithmic in the sequence size (but doubly exponential in the Littlestone dimension). The number of leaves is roughly 𝑘1 (which is roughly 𝑂(2𝑑)) times the number of trees in the forest; this number of trees depends on the sample complexity of the private process in which we output the frequent hypothesis. We now explain why roughly 𝑂(2𝑂(2𝑑) ln𝑇) trees is suﬃcient.

Online publishing of a globally stable hypothesis The next challenge we meet is to output the frequent hypothesis. The most straightforward method to do that is to repeat the idea in the batch

setting and use procedure hist. We can guarantee a 𝑂(

𝑇) factor of deterioration in the privacy parameter ϵ (see Lemma 5.4) due to the repeated use of the hist procedure 𝑇times.

Our main observation though, is that in most rounds, the frequent hypothesis does not change, allowing us to exploit the sparse vector technique [14], (see also [13]). The sparse vector technique is a method to answer, adaptively, a stream of queries where: whenever the answer to the query does not exceed a certain threshold the algorithm returns a negative result but without any cost in privacy. We pay, though, in each round where the query exceed the threshold.

We will exploit this idea in the following setting: we receive a stream of 1-sensitive lists 𝐿1(𝑆), . . . , 𝐿𝑇(𝑆): Namely, each list 𝐿𝑡is derived from the data 𝑆= {(𝑥1, 𝑦1), . . . , (𝑥𝑇, 𝑦𝑇)}, and 𝐿𝑡changes by at most one element, given a change in a single (𝑥𝑡, 𝑦𝑡). We assume that at each iteration 𝑡we want to output an element ℎ𝑡 𝐿𝑡with high frequency. Our key assumption is that the lists are related and a very frequent element ℎ𝑡is also frequent at step 𝑡+ 1. Thus in most rounds we just verify that freq𝐿𝑡(ℎ𝑡 1) is large, and only in rounds where it is too small do we use the stable histogram mechanism, paying for privacy.

Indeed, in our setting, the appearance of the frequent hypothesis may diminish by at most one each round. Once its frequency has diminished by a certain factor, then we have already made a certain fraction of the maximum possible number of mistakes. Thus, in general we only need to verify that the frequency of ℎ𝑡 1 in 𝐿𝑡is suﬃciently large each round, which can be done via the sparse vector technique without loss of privacy. We next state the result more formally, the proof is provided in the full version Lemma 5.3. Consider, the procedure Hist Sparseη,𝑐,ϵ depicted in Algorithm 2. Given a sample 𝑆, suppose Algorithm 2 receives a stream of lists, where each list is a function of 𝑆to an array of elements and each list is 1-sensitive. Then Algorithm 2 is (ϵ, δ) diﬀerentially private and: Set

Θ(3) (𝑐, α, β, ϵ, α) := 8𝑐(ln𝑇+ ln 2𝑐/β)

and suppose:

𝑘 Θ(4) (𝑐, η,𝑇, β, ϵ, δ) := max{Θ(3) (𝑐, α, β, ϵ, α), Θ(2) (η, β, ϵ, δ)} = 𝑂 𝑐ln𝑇/βδ

The procedure then outputs a sequence {ℎ𝑡}𝑇 𝑡=1, where ℎ𝑡 𝐿𝑡such that if for each list 𝐿𝑡there exists ℎsuch that freq𝐿𝑡(ℎ) η then with probability at least (1 2β), for all 𝑡 𝑇, either the algorithm aborted before step 𝑡or

freq𝐿𝑡(ℎ𝑡) η/16.

If ℎ𝑡 1 ℎ𝑡: freq𝐿𝑡(ℎ𝑡 1) η/8 and freq𝐿𝑡(ℎ𝑡) η/4.

Adaptive adversaries The proof for the oblivious case relies on the existence of an ℎ that is consistent with the data (and independent of the random bits of the algorithm). In the adaptive case, while the sequence has to be consistent, ℎ need not be determined, and the consistent hypothesis may depend on the algorithm s choices.

However, to obtain a regret bound, we rely on the standard reduction that shows that a randomized learner against oblivious adversary, can attain a similar regret against an adaptive adversary ([12], Lemma 4.1). One issue, though, is that DP-SOA uses random bits that are shared through time. Hence for the reduction to work we need to reinitialize the algorithm at every time-step. In this case, though, the assumptions we make for using the sparse vector technique no longer hold. Thus we can run DP-SOA, using hist (as we no longer obtain any guarantee from Hist Sparse), and we require that each output hypothesis will be 𝑂(ϵ/

𝑇, 𝑂(δ/𝑇))-DP. The privacy of the whole mechanism now follows from 𝑇-fold composition: Lemma 5.4. (see for example Dwork and Roth [13]) Suppose (ϵ , δ ) satisfy:

δ = δ/2𝑇, and ϵ = ϵ

2𝑇ln(1/δ) . (5)

Then, the class of (ϵ , δ )-diﬀerentially private mechanisms satisﬁes (ϵ, δ)-diﬀerentialy privacy under 𝑇-fold adaptive composition.

Unfortunately though, the above strategy leads to a

𝑇factor in the regret.

Acknowledgments and Disclosure of Funding

The authors would like to thank Uri Stemmer for helpful discussions. N.G is supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship; R.L is supported by an ISF grant no. 2188/20 and by a grant from Tel Aviv University Center for AI and Data Science (TAD) in collaboration with Google, as part of the initiative of AI and DS for social good.

[1] M. Abadi, A. Chu, I. Goodfellow, H. B. Mc Mahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with diﬀerential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS 16, page 308 318, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341394.

[2] N. Agarwal and K. Singh. The price of diﬀerential privacy for online learning. In Proceedings of the 34th International Conference on Machine Learning, pages 32 40, 2017.

[3] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private PAC learning implies ﬁnite Littlestone dimension. In STOC, page 852 860, 2019. ISBN 9781450367059.

[4] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate diﬀerential privacy. In APPROX-RANDOM, pages 363 378, 2013.

[5] A. Beimel, H. Brenner, S. P. Kasiviswanathan, and K. Nissim. Bounds on the sample complexity for private learning and private data release. Machine Learning, 94:401 437, 2014.

[6] A. Beimel, S. Moran, K. Nissim, and U. Stemmer. Private center points and learning of halfspaces. In COLT, pages 269 282, 2019.

[7] A. Beimel, K. Nissim, and U. Stemmer. Characterizing the sample complexity of pure private learners. JMLR, 20(146):1 33, 2019.

[8] M. Bun, K. Nissim, and U. Stemmer. Simultaneous private learning of multiple concepts. ar Xiv preprint ar Xiv:1511.08552, 2015.

[9] M. Bun, K. Nissim, U. Stemmer, and S. P. Vadhan. Diﬀerentially private release and learning of threshold functions. In FOCS, pages 634 649, 2015.

[10] M. Bun, C. Dwork, G. N. Rothblum, and T. Steinke. Composable and versatile privacy via truncated CDP. In STOC, page 74 86, 2018.

[11] M. Bun, R. Livni, and S. Moran. An equivalence between private classiﬁcation and online prediction. ar Xiv preprint ar Xiv:2003.00563, 2020.

[12] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[13] C. Dwork and A. Roth. The algorithmic foundations of diﬀerential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

[14] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan. On the complexity of diﬀerentially private data release: eﬃcient algorithms and hardness results. In Proceedings of the forty-ﬁrst annual ACM symposium on Theory of computing, pages 381 390, 2009.

[15] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Diﬀerential privacy under continual observation. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC 10, page 715 724, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781450300506.

[16] C. Dwork, A. Roth, et al. The algorithmic foundations of diﬀerential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

[17] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. Analyze gauss: Optimal bounds for privacypreserving principal component analysis. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC 14, page 11 20, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450327107.

[18] A. Ene, H. L. Nguyen, and A. Vladu. Projection-free bandit optimization with privacy guarantees. Co RR, abs/2012.12138, 2020.

[19] V. Feldman and D. Xiao. Sample complexity bounds on diﬀerentially private learning via communication complexity. In COLT, pages 1 20, 2014. [20] B. Ghazi, N. Golowich, R. Kumar, and P. Manurangsi. Sample-eﬃcient proper pac learning with approximate diﬀerential privacy. ar Xiv preprint ar Xiv:2012.03893, 2020. [21] A. Guha Thakurta and A. Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. [22] S. Hanneke, R. Livni, and S. Moran. Online learning with simple predictors and a combinatorial characterization of minimax in 0/1 games. ar Xiv preprint ar Xiv:2102.01646, 2021. [23] K. P. C. (https://mathoverﬂow.net/users/405/kevin-p costello). Concentration bounds for sums of random variables of permutations. Math Overﬂow, 2013. URL https://mathoverflow. net/q/120257. URL:https://mathoverﬂow.net/q/120257 (version: 2013-01-29). [24] B. Hu, Z. Huang, and N. A. Meta. Optimal algorithms for private online learning in a stochastic environment. Co RR, abs/2102.07929, 2021. [25] P. Jain, P. Kothari, and A. Thakurta. Diﬀerentially private online learning. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 24.1 24.34, 2012. [26] H. Kaplan, K. Ligett, Y. Mansour, M. Naor, and U. Stemmer. Privately learning thresholds: Closing the exponential gap. In COLT, pages 2263 2285, 2020. [27] H. Kaplan, Y. Mansour, U. Stemmer, and E. Tsfadia. Private learning of halfspaces: Simplifying the construction and reducing the sample complexity. In Neur IPS, 2020. [28] H. Kaplan, M. Sharir, and U. Stemmer. How to Find a Point in the Convex Hull Privately. In So CG, pages 52:1 52:15, 2020. [29] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Rashkodnikova, and A. Smith. What can we learn privately? In FOCS, pages 531 540, 2008. [30] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285 318, 1988. [31] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning via sequential complexities. JMLR, 16:155 186, 2015. [32] A. Roth and M. Kearns. The Ethical Algorithm: The Science of Socially Aware Algorithm Design. Oxford University Press, 2019. [33] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2011. [34] M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathématiques de l Institut des Hautes Etudes Scientiﬁques, 81(1):73 205, 1995. [35] A. C. Y. Tossou and C. Dimitrakakis. Algorithms for diﬀerentially private multi-armed bandits. In Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, AAAI 16, page 2087 2093. AAAI Press, 2016. [36] A. C. Y. Tossou and C. Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, AAAI 17, page 2653 2659. AAAI Press, 2017.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or oﬀensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]