# private_online_learning_via_lazy_algorithms__9e22fc05.pdf

Private Online Learning via Lazy Algorithms

Hilal Asi Apple hilal.asi94@gmail.com

Tomer Koren Tel Aviv University tkoren@tauex.tau.ac.il

Daogao Liu University of Washington liudaogao@gmail.com

Kunal Talwar Apple kunal@kunaltalwar.org

We study the problem of private online learning, focusing on online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that translates lazy, low-switching online learning algorithms into private algorithms. We apply our transformation to differentially private OPE and OCO using existing lazy algorithms for these problems. The resulting algorithms attain regret bounds that significantly improve over prior art in the high privacy regime, where ε 1, obtaining O( T log d + T 1/3 log(d)/ε2/3) regret for DPOPE and O(

d/ε2/3) regret for DP-OCO. We complement our results with a lower bound for DP-OPE, showing that these rates are optimal for a natural family of low-switching private algorithms.

1 Introduction

Online learning is a fundamental problem in machine learning, where an algorithm interacts with an oblivious adversary for T rounds. First, the oblivious adversary chooses T loss functions ℓ1, . . . , ℓT : X R over a fixed decision set X. Then, at any round t, the algorithm chooses a model xt X, and the adversary reveals the loss function ℓt. The algorithm suffers loss ℓt(xt), and its goal is to minimize its cumulative loss compared to the best model in hindsight, namely its regret:

t=1 ℓt(xt) min x X

t=1 ℓt(x ).

In this work, we study two different differentially private instances of this problem: differentially private online prediction from experts (DP-OPE) where the model x can be chosen from d experts (X = [d]); and differentially private online convex optimization (DP-OCO) where the model belongs to a convex set X Rd.

Both problems have been extensively studied recently [JKT12, ST13, JT14, AS17, KMS+21] and an exciting new direction with promising results for this problem is that of designing private algorithms based on low-switching algorithms for online learning [AFKT23b, AFKT23a, AKST23a, AKST23b]. The main idea in these works is that the privacy cost for privatizing a low-switching algorithm can be significantly smaller as these algorithms do not update their models too frequently, allowing them to allocate a larger privacy budget for each update. This has been initiated by [AFKT23b], which used the shrinking dartboard algorithm to design new algorithms for DP-OPE, later revisited by [AKST23a] to design new algorithms for DP-OCO using a regularized follow-the-perturbedleader approach, and more recently by [AKST23b] which used a lazy and regularized version of the multiplicative weights algorithm to obtain improved rates for DP-OCO.

Part of this work was done while interning at Apple.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Prior work This work

DP-OPE T log d + min{

d , T 1/3} log d

ε [AS17, AFKT23b] T log d + T 1/3 log d

ε2/3 DP-OCO min n d1/4

d ε + T 3/8

d ε3/4 o [KMS+21, AKST23b]

Table 1: Regret for approximate (ε, δ)-DP algorithms. For readability, we omit logarithmic factors that depend on T and 1/δ.

While all of these results build on lazy-switching algorithms for designing private online algorithms, each one of them has a different method for achieving privacy and, to a greater extent, a different analysis. Moreover, it is not clear whether these transformations from lazy to private algorithms in prior work have fulfilled the full potential of lazy algorithms for private online learning and whether better algorithms are possible through this approach. Indeed, the regret obtained in prior work [AFKT23b, AKST23b] is T 1/3/ε (omitting dependence on d) for DP-OPE, which implies that the normalized regret is 1/T 2/3ε: this is different than what exhibited in a majority of scenarios of private optimization, where the normalized error is usually a function of Tε.

0 0.2 0.4 0.6 0.8 1 0.4

non-private

(a) DP-OCO (d = log T)

0 0.2 0.4 0.6 0.8 1 0.4

non-private

(b) DP-OCO (d = T 1 3 )

0 0.2 0.4 0.6 0.8 1 0.4

non-private

Figure 1: Regret bounds for (a) DP-OCO with d = poly log(T), (b) DP-OCO with d = T 1/3 and (c) DP-OPE with d = T. We denote the privacy parameter ε = T α and regret T β, and plot β as a function of α (ignoring logarithmic factors).

1.1 Our contributions

Our main contribution in this work is a new transformation that converts lazy online learning algorithms into private algorithms with similar regret guarantees, resulting in new state-of-the-art rates for DP-OPE and DP-OCO. We provide a summary in Table 1 and Figure 1.

L2P: a transformation from lazy to private algorithms (Section 3). Our main contribution is a new transformation, we call L2P, that allows converting any lazy algorithm into a private one with only a slight cost in regret. This allows us to use a long line of work on lazy online learning [KV05, GVW10, AT18, CYLK20, SK21, AKST23a] to design new algorithms for the private setting. Our transformation builds on two new techniques: first, we design a new switching rule that only depends on the loss at the current round, so as to minimize the privacy cost of each switching and mitigate the accumulation of privacy loss. Second, we rely on a simple, key observation that by grouping losses in a large batch, we can minimize the effect on the regret of lazy online learning algorithms. We introduce a new analysis for the regret of lazy online algorithms with a large batch size that improves over the existing analysis in [AFKT23b]; this allows us to reduce the total number of fake switches needed to guarantees privacy, improving the final regret.

Faster rates for DP-OPE (Section 3.1). As a first application, we use our transformation in the DP-OPE problem on the multiplicative weights algorithm [AHK12]. This results in a new algorithm for DP-OPE that has regret p

T log(d) + T 1/3 log(d)/ε2/3, improving over the best existing results for the high-dimensional regime in which the regret is p

T log(d) + T 1/3 log(d)/ε [AFKT23b].2

2[AFKT23b] has another algorithm which slightly improves over this regret in the high-privacy regime and obtains regret T 2/5/ε4/5. We include this algorithm in Figure 1.

The improvement is particularly crucial in the high-privacy regime, where ε 1: indeed, our regret shows that (for d = poly(T)) it is sufficient to set ε T 1/4 for matching the optimal non-private regret T log d, whereas previous results require a much larger ε T 1/6 to get privacy for free. This is also important in practice, when multiple applications of DP-OPE are necessary: using advanced composition, our result shows that we can solve K

T instances of DP-OPE with ε = 1 and still obtain the non-private regret of order

T; in contrast, prior work only allows to solve K T 1/3 instances while still attaining the non-private regret.

Faster rates for DP-OCO (Section 3.2). As another application, we use our transformation for DP-OCO with the regularized multiplicative weights algorithm of [AKST23b]. We obtain a new algorithm for DP-OCO that has regret

d/ε2/3, improving over the best existing results that established regret

d/ε + T 3/8

d/ε3/4 [AKST23b] or d1/4

T/ ε using DP-FTRL [KMS+21].

Lower bounds for low-switching private algorithms (Section 4). To understand the limitations of low-switching private algorithms, we prove a lower bound for the natural family of private algorithms with limited switching, showing that the upper bounds obtained via our reduction are nearly tight for this family of algorithms up to logarithmic factors. This shows that new techniques, beyond limited switching, are required in order to improve upon our upper bounds.

Related work. Our transformation and algorithms build on a long line of work in online learning with limited switching [KV05, GVW10, AT18, CYLK20, SK21, AKST23b]. As is evident from prior work in private online learning, the problems of lazy online learning and private online learning are tightly connected [AFKT23b, AFKT23a, AKST23a, AKST23b]. In this problem, the algorithm wishes to minimize its regret while making at most S switches: the algorithm can update the model at most S times throughout the T rounds. Recent work has resolved the lazy OPE problem: [AT18] show a lower bound of

T + (T/S) log(d) on the regret, which is achieved by several algorithms such as Follow-the-perturbed-leader [KV05] and the shrinking dartboard algorithm [GVW10]. For lazy OCO, however, optimal rates are yet to be known: [AKST23b] recently show that a lazy version of the regularized multiplicative weights algorithm obtains regret

d, whereas the best lower bound is

T + T/S [SK21].

2 Preliminaries

2.1 Problem setup

We consider an interactive T-round game between an algorithm ALG and an oblivious adversary Adv. Before the interaction, the adversary Adv chooses T loss functions ℓ1, . . . , ℓT L = {ℓ| ℓ: X R}. Then, at each round t [T], the algorithm ALG, which observed ℓ1, , ℓt 1 chooses xt X, and then the loss function ℓt chosen by Adv is revealed. The regret of the algorithm ALG is defined below:

Reg T (ALG) :=

t=1 ℓt(xt) min x X

t=1 ℓt(x ).

We study online optimization under the constraint that the algorithm is differentially private. For an algorithm ALG and a sequence S = (ℓ1, . . . , ℓT ) chosen by an oblivious adversary Adv, we let ALG(S) := (x1, . . . , x T ) denote the output of ALG over the loss sequence S. We have the following definition of privacy against oblivious adversaries.3

Definition 2.1 (Differential privacy). A randomized algorithm ALG is (ε, δ)-differentially private against oblivious adversaries ((ε, δ)-DP) if, for all neighboring sequences S = (ℓ1, . . . , ℓT ) LT

3Our regret bound may be invalid with an adaptive adversary, but our algorithms will satisfy a stronger notion of differential privacy against adaptive adversaries (see [AFKT23b]). However, to keep the notation and analysis simpler, we limit our attention to privacy against oblivious adversaries.

and S = (ℓ 1, . . . , ℓ T ) LT that differ in a single element, and for all events O in the output space of ALG, we have Pr[ALG(S) O] eε Pr[ALG(S ) O] + δ.

We focus on two important instances of differentially private online optimization:

(i) DP Online Convex Optimization (DP-OCO). In this problem, the adversary picks loss functions ℓ LOCO := {ℓ| ℓ: X R is convex and L-Lipschitz} where X Rd is a convex set with diameter D = diam(X) := supx,y X x y , and the algorithm chooses xt X. The goal of the algorithm is to minimize regret while being (ε, δ)-differentially private.

(ii) DP Online Prediction from Experts (DP-OPE). In this problem, the adversary picks loss functions ℓ LOP E = {ℓ| ℓ: [d] [0, 1]} where X = [d] is the set of d experts, and the algorithm chooses xt [d]. The goal of the algorithm is to minimize regret while being (ε, δ)-differentially private.

2.2 Tools from differential privacy

Our analysis crucially relies on the following divergence between two distributions.

Definition 2.2 (δ-Approximate Max Divergence). For two distributions µ and ν, we define

Dδ (µ ν) := sup S supp(µ),µ(S) δ ln µ(S) δ

We let Dδ (µ, ν) := max{Dδ (µ ν), Dδ (ν µ)}.

We also use the notion of indistinguishability between two distributions.

Definition 2.3. ((ε, δ)-indistinguishability) Two distributions µ, ν are (ε, δ)-indistinguishable, denoted µ (ε,δ) ν, if Dδ (µ, ν) ε.

Note that if an algorithm ALG has ALG(S) (ε,δ) ALG(S ) for all neighboring datasets S, S then ALG is (ε, δ)-differentially private. We direct readers to Appendix A for additional background information and detailed preliminaries.

3 L2P: From Lazy to Private Algorithms for Online Learning

This section presents our L2P transformation, which turns lazy online learning algorithms into private ones. The transformation has an input algorithm A with measure µt at round t and samples xt from the normalized measure µt, which satisfies the following condition:

Assumption 3.1. The online algorithm A has at time t a measure µt that is a function of ℓ1, . . . , ℓt 1 (and density function µt) such that for some δ0 1 and 0 < η 1/10 that are data-independent, we have

Dδ0 (µt+1, µt) η,

µt+1(x)/µt(x) = func(ℓt, x) for all x X where func is a data-independent function.

While algorithms satisfying Assumption 3.1 need not be lazy, this assumption is satisfied by most existing lazy online learning algorithms such as the shrinking dartboard (Section 3.1) and lazy regularized multiplicative weights (Section 3.2). Moreover, any algorithm that satisfies this assumption can be made lazy via our reduction.

Technique Overview: Suppose the neighboring datasets differ from the s0-th loss function. The high-level intuition behind our framework is that our algorithm only loses the privacy budget when it makes a switch (draws a fresh sample) whenever t > s0. Hence, in the framework, we try to make the algorithm make as few switches as possible. This modification can lead to additional regret compared to lazy online learning algorithms, and we need to balance the privacy-regret trade-off. The family of

low-switching algorithms is ideal for privatization because its built-in low-switching property can achieve a better trade-off.

Our starting point is the ideas in [AFKT23b, AKST23b] to privatize low-switching algorithms, which use correlated sampling to argue that a sample from xt 1 µt 1 is likely a good sample from µt and therefore switching at round t is often not necessary. In particular, at round t, these algorithms sample a Bernoulli random variable St Ber(c µt(xt 1)/µt 1(xt 1)) for some constant c and use the same model xt = xt 1 if St = 1, and otherwise sample new model xt µt if St = 0 (which happens with small probability). This guarantees that the marginal probability of the lazy iterates remains the same as the original iterates. Finally, to preserve the privacy of the switching decisions, existing algorithms add a fake switching probability p where the algorithm switches independently of the input. To summarize, existing low-switching private algorithms work roughly as follows:

At each round t:

Sample St Ber(C µt(xt 1)/µt 1(xt 1)) and S t Ber(1 p)

Sample new xt µt if St = 0 or S t = 0

Otherwise set xt = xt 1

This sketch is the starting point of our transformation, and we will introduce two new components to improve performance. The first component aims to avoid the accumulation of privacy cost for switching in the current approaches where each user can affect the switching probability for all subsequent rounds: this happens since µt(xt 1)/µt 1(xt 1) is usually a function of the whole history ℓ1, . . . , ℓt, and hence the existing low-switching private algorithms lose the privacy budget even it does not make real switches. To address this, we deploy a new correlated sampling strategy in L2P where the loss ℓt at time t affects the switching probability only at time t, hence paying a privacy cost for switching only in a single round. To this end, we construct a parallel sequence of models {yt}t [T ] (independent of xt) that is used for normalizing the ratio µt(xt 1)/µt 1(xt 1) to become independent of the history. In particular, at round t, we switch with probability proportional to

µt(xt 1) µt 1(xt 1) µt 1(yt 1)

The main observation here is that µt(xt 1) µt 1(xt 1)

µt(yt 1) = µt(xt 1) µt 1(xt 1) µt 1(yt 1)

µt(yt 1) and this ratio is a function of ℓt our input online learning algorithms which satisfies Assumption 3.1. This will, therefore, improve the privacy guarantee of the final algorithm.

The second main observation in L2P is that having a large batch size (batching rounds together) does not significantly affect the regret of lazy online algorithms compared to non-lazy algorithms but can further reduce the times to make switches and save the privacy budget. Our main novelty is a new analysis of the effect of batching on the regret of lazy algorithms (Proposition 3.3), which states that running a lazy online algorithm with a batch size of B would have an additive error of TB2η2 to the regret where η is a measure of distance between µt and µt 1. This significantly improves over existing analysis by [AFKT23b, Theorem 2] which shows that batching can add an additive term of B/η to the regret.

Having reviewed our main techniques, we proceed to present the full details of our L2P transformation in Algorithm 1, denoting νs = µ(s 1)B+1 where B is the batch size.

The regret of our transformation depends on the regret of its input algorithm. For the measure {µt}T t=1, we denote its regret

Reg T ({µt}T t=1) :=

t=1 E xt µt [ℓt(xt)] min x X

The following theorem summarizes the main guarantees of Algorithm 1. Theorem 3.2. Let p (0, 1) and B N. Assuming Assumption 3.1, Tp/B 1, and for any δ1 > 0 such that ηB log(1/δ1)/p 1, our transformation L2P is (ε, δ)-DP with

p + η + 3Tη2p log(1/δ1)

6Tη2p log2(1/δ1)/B,

Algorithm 1: L2P

1 Input: Parameter η, measures {νt}t [T ], batch size B, fake switching parameter p ;

2 Sample x1, y1 ν1;

3 Observe ℓ1, . . . , ℓB and suffer loss PB i=1 ℓi(x1);

4 for s = 2, , T/B do

5 Sample Ss Ber min 1, νs(xs 1) e2Bηνs 1(xs 1) νs 1(ys 1)

νs(ys 1) and S s Ber(1 p);

6 if Ss = 0 or S s = 0 then

7 Sample xs νs ;

9 Set xs = xs 1;

10 Sample As Ber(1 p);

11 if As = 0 then

12 Sample ys νs ;

14 Set ys = ys 1;

15 Play xs;

16 Observe ℓ(s 1)B+1, . . . , ℓs B and suffer loss Ps B i=(s 1)B+1 ℓi(xs);

δ = 2T(2/η + log(1/δ1)/p)e Bδ0 + 2Tδ1,

and has regret

Reg T Reg T ({µt}T t=1) + O

TB2η2 + δ0T 2 log( 1

We begin by proving the utility guarantees of our transformation. It will follow directly from the following proposition, which bounds the regret of running L2P over a lazy online learning algorithm. Proposition 3.3 (Regret of Batched Lazy Algorithm). Let ALG be an online learning algorithm that satisfies Assumption 3.1. Let ηB log(1/δ1)/p 1, and δ1, η < 1/2. Then running L2P with the input algorithm ALG has regret

Reg T Reg T ({µt}T t=1) + O

TB2η2 + δ0T 2 log( 1

To prove Proposition 3.3, we first show that we can instead analyze the utility of a simpler algorithm that samples from νs at each round. This is due to the following lemma, which shows that bνs νs T V is small where bνs is the marginal distribution of xs in Algorithm 1. Lemma 3.4. Let bνs be the marginal distribution of xs in Algorithm 1. When ηB log(1/δ1)/p 1, we have bνs νs T V 3(s 1)(2e + log(1/δ1)/p)Bδ0.

We also require the following lemma which allows to build a coupling over multiple variables, such that the variables are as close as possible. This will be used to construct a coupling between the lazy algorithm and the L2P algorithm that runs it. Lemma 3.5 ([AS19]). Given a collection S of random variables, all absolutely continuous w.r.t. a common σ-finite measure. Then, there exists a coupling Γ, such that for any variables X, Y S, we have Pr[X = Y ] 2 X Y T V 1+ X Y T V .

We are now ready to prove Proposition 3.3

Proof. Let Reg T denote the regret when the marginal distribution of xt is νt instead of bνt induced in the Algorithm. Since each loss function is bounded,

Reg T Reg T + B X

s [T/B] νs bνs T V .

By Lemma 3.4, we have

Reg T Reg T + B X

s [T/B] 3(s 1)(2/η + log(1/δ1)/p)e Bδ0

Reg T + 8T 2δ0 log(1/δ1)/η.

Thus, it now suffices to upper bound Reg T .

Due to the preconditions that Dδ0 (µi+1, µi) η and δ0 η, we know µi+1 µi T V 2η. Recall that we assume xs νs. Suppose zi is the action taken by the input lazy algorithm A for i [T] and the marginal distribution of zi is µi. By Lemma 3.5, we can construct a coupling Γs between xs and z := (z(s 1)B+1, , zs B), such that

Pr (xs,z) Γs[ i [(s 1)B + 1, s B], zi = xs] Bη.

Letting Is = 1( i [(s 1)B + 1, s B], zi = xs), we have

i=(s 1)B+1 ℓi(xs) = E (xs,z) Γs

i=(s 1)B+1 ℓi(xs)

= E xs,z Γs(1 Is)

i=(s 1)B+1 ℓi(zi)

+ E xs,z Γs Is

i=(s 1)B+1 ℓi(xs)

E xs,z Γs(1 Is)

i=(s 1)B+1 ℓi(zi)

+ E xs,z Γs Is

i=(s 1)B+1 (ℓi(zi) + O(Bη))

i=(s 1)B+1 ℓi(zi) + O(Bη B2η).

Hence we get Reg T Reg T ({µt}T t=1) + T

B O(B3η2), which completes the proof.

Now we turn to prove the privacy of L2P. We begin with the following lemma, which provides the privacy guarantees of sampling a new model xt from the distribution µt. We defer the proof to Appendix B. Lemma 3.6. Let {µt}T t=0 satisfy Assumption 3.1 where η 1/10. Then for any neighboring sequences S and S with corresponding {µt}T t=0 and {µ t}T t=0 that differ one loss function, we have

D4δ0 (µt, µ t) 2η.

We use correlated sampling in the algorithm rather than sampling from xt directly. To this end, we need the following lemma, which provides upper and lower bounds on the ratio used for correlated sampling. Lemma 3.7. For any s [T/B], if ηB log(1/δ1)/p 1, then with probability at least 1 (2/η + log(1/δ1)/p) e Bδ0 δ1,

νs(xs) νs(ys) νs+1(ys) [e 2Bη, e2Bη].

The privacy proof will build on the previous two lemmas to control the privacy cost of updating the model and the cost of the switching time. We defer the proof to Appendix B.

One remaining issue is we need to conditional on the high probability events in Lemma 3.7 for the privacy guarantee and can not directly apply Advanced Composition (Lemma A.3). Now, we modify the Advanced Composition for our usage. In the classic k-fold adaptive composition experiment, the adversary, after getting the first i 1 answers Y1, , Yi 1 (denoted by Y[i 1] for simplicity), can output two datasets D0 i and D1 i , a query qi, and receives the answer Yi Mi(Db i, qi) for the secret bit b {0, 1}. If each Mi is (εi, δi)-DP, then the joint distributions over the answers Y[k] satisfy the advanced composition theorem.

In our case, however, we know there exists a subset Gi 1(Db [i 1]), such that with probability at least 1 λi, Y[i 1] Gi 1(Db [i 1]). Conditional on Y[i 1] b {0,1}Gi 1(Db [i 1]),

Mi(D0 i , qi | Y[i 1] b {0,1}Gi 1(Db [i 1])) (εi,δi) Mi(D1 i , qi | Y[i 1] b {0,1}Gi 1(Db [i 1])) (1)

Then we have the following lemma:

Lemma 3.8. Given the k mechanisms satisfying the Condition (1), then the class of mechanisms satisfy ( ε δ, 1 (1 δ)Πt [k](1 δt)) + 2 P

t [k] λt-DP under k-fold adaptive composition, with ε δ defined in Equation (4).

Proof. Without losing generality, suppose we know the adversary and how they generate the databases and queries. We can construct a series of mechanisms M i, such that M i draws Yi from Mi(Db i, qi), and outputs Yi if Yi b {0,1}Gi 1(Db [i 1]), and outputs 0 otherwise. Let (Y 1,b, , Y k,b) be the outputs of M i with secret bit b, and we know the TV distance between (Y 1,b, , Y k,b) and (Y1,b, , Yk,b) is at most P t [k] λt for any b {0, 1}. Moreover, we know

(Y 1,0, , Y k,0) ε δ,1 (1 δ)Πt [k](1 δt)) (Y 1,1, , Y k,1)

by the advanced composition. The basic composition finishes the proof.

3.1 Application to DP-OPE

This section discusses the first application of our transformation to differentially private online prediction from experts (DP-OPE). Towards this end, we apply our transformation over the multiplicative weights algorithms [AHK12], which can be made lazy as done in the shrinking dartboard algorithm [GVW10]. It has the following measure at round t

µmw t (x) = e η Pt 1 i=1 ℓi(x). (2)

The following proposition shows that this measure satisfies the desired properties required by our transformation. We let µmw t denote the density corresponding to µmw t .

Lemma 3.9. Assume ℓ1, . . . , ℓT where ℓt : [d] [0, 1]. Then we have that

1. Dδ0 (µmw t+1, µmw t ) η with δ0 = 0.

µmw t+1(x) µmw t (x) = e ηℓt(x) for all x [d].

Proof. The first item follows from the guarantees of the exponential mechanism as ℓt(x) [0, 1] for all x [d]. The second item follows immediately from the definition of µmw.

Having proved our desired properties, our transformation now gives the following theorem.

Theorem 3.10 (DP-OPE). Let ℓ1, . . . , ℓT where ℓt : [d] [0, 1]. Setting B = 1/ε and η = min(ε0, ε)2/3/T 1/3 where ε0 = T 1/4 log3/4 d, the L2P transformation (Algorithm 1) applied with the measure {µmw t }T t=1 is (ε, δ)-DP and has regret

Reg T = O p

T log d + T 1/3 log d

Proof. First, based on theorem 3.2, note that the setting of B = 1/ε and η min(ε0, ε)2/3/T 1/3

where ε0 = T 1/4 log3/4 d guarantee the algorithm is (ε, δ)-DP.

To upper bound the regret, we use existing guarantees of the multiplicative weights algorithm [AHK12], combined with Theorem 3.2 to get that the regret is

Reg T O ηT + log(d)

O ηT + log(d)

O (Tε0)2/3 + T 1/3 log(d)

ε2/3 + T 1/3

T log d + T 1/3 log(d)

where the second inequality follows by setting B = 1/ε, and the third inequality follows by setting η min(ε0, ε)2/3/T 1/3, and the last inequality follows since ε0 = T 1/4 log3/4 d.

3.2 Application to DP-OCO

In this section, we use our transformation for differentially private online convex optimization (DPOCO) using the regularized multiplicative weights algorithm [AKST23b], which has the following measure µrmw t (x) = e β( Pt 1 i 1 ℓi(x)+λ x 2 2). (3)

Letting µrmw denote the corresponding density function, we have the following properties.

Lemma 3.11. Assume ℓ1, . . . , ℓT : X R be convex and L-Lipschitz functions. Then we have that

1. Dδ0 (µrmw t+1, µrmw t ) η where η = 2βL2

8βL2 log(2/δ0)

µrmw t+1(x) µrmw t (x) = e βℓt(x) for all x X.

Proof. The first item follows from Lemma 3.5 in [GLL22, AKST23b]. The second item follows immediately from the definition of µrmw t .

Combining these properties with our transformation, we get the following result. Theorem 3.12 (DP-OCO). Let ℓ1, . . . , ℓT : X R be convex and L-Lipschitz functions. Setting B = 1 2ε log(1/δ), λ = L

η }, β = η2λ/20L2, η = ε2/3

T 1/3 log(T/δ) and p = η/ε, the

L2P transformation (Algorithm 1) applied with the measure {µrmw t }T t=1 is (ε, δ)-DP and has regret

Reg T = LD O

T + T 1/3 d log T log(T/δ)

Proof. First, based on Theorem 3.2, note that there are three constraints to make the algorithm private:

η/p ε/2, η p

Tp log(1/δ)/B ε/2, ηB log(1/δ)/p 1.

Setting of B = 1 2ε log(1/δ), λ = L D max{

η }, β = η2λ/20L2, η = ε2/3

T 1/3 log(T/δ) and p = η/ε guarantees the algorithm is (ε, δ)-DP.

For utility, we use theorem 3.2 with the existing regret bounds for the regularized multiplicative weights algorithm (Theorem 4.1 in [AKST23b]) to get that the algorithm has regret

Reg T O λD2 + L2T

λ + d log(T)

β + LDTB2η2

T + λD2 + L2d log T

λη2 + LDTB2η2

T + T 1/3 d log T log(T/δ)

4 Lower bound for low-switching private algorithms

In this section, we prove a lower bound for DP-OPE for a natural family of private low-switching algorithms that contains most of the existing low-switching private algorithms such as our algorithms and the ones in [AFKT23b, AKST23b]. Our lower bound matches our upper bounds for DP-OPE and suggests that new techniques beyond limited switching are required in order to obtain faster rates.

For our lower bounds, we will assume that the algorithm satisfies the following condition:

Condition 4.1. (Limited switching algorithms) The online algorithm ALG works as follows: at each round t, ALG is allowed to either set xt+1 = xt or sample xt+1 µt+1 where µt+1 is a function of ℓ1, . . . , ℓt and is supported over X. The algorithm releases the resampling rounds {t1, . . . , t S} and models {xt1, . . . , xt S}.

Our lower bound will hold for algorithms that satisfy concentrated differential privacy. We use this notion as it allows to get tight characterization of the composition of private algorithms and in most settings have similar rates to approximate differential privacy. We can also prove a tight lower bound for pure differential privacy using the same techniques. We have the following lower bound for concentrated DP. We defer the proof to Appendix C.

Theorem 4.2. Let T 1 and ε 100 log3/2(d T)/T. If an algorithm ALG satisfies Condition 4.1 and is ε2-CDP, then there exists an oblivious adversary that chooses ℓ1, . . . , ℓT : [d] [0, 1] such that the regret is lower bounded by

Finally, we note that this lower bound only holds for switching-based algorithms: indeed, the binary-tree-based algorithm of [AS17] obtains regret

d log(d)/ε which is better in the lowdimensional regime. This motivates the search for new strategies beyond limited switching for the high-dimensional regime.

5 Conclusion

In this paper, we proposed a new transformation that allows the conversion of lazy online learning algorithms into private algorithms and demonstrates two applications (DP-OPE and DP-OCO) where this transformation offers significant improvements over prior work. Moreover, for DP-OPE, we show a lower bound for natural low-switching-based private algorithms, which shows that new techniques are required for low-switching algorithms to improve our transformation s regret. This begs the question of whether the same lower bound holds for all algorithms or whether a different strategy that breaks the low-switching lower bound exists. As for DP-OCO, it is interesting to see whether better upper or lower bounds can be obtained. The current normalized regret, omitting logarithmic terms, is proportional to

d/(εT)2/3. This is different than most applications in private optimization where the normalized error is usually a function of

d/(εT). Hence, it is natural to conjecture that the normalized regret can be improved to d1/3/(εT)2/3.

[AFKT23a] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Near-optimal algorithms for private online optimization in the realizable regime. Proceedings of the 40th International Conference on Machine Learning, 2023.

[AFKT23b] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private online prediction from experts: Separations and faster rates. Proceedings of the Thirty Sixth Annual Conference on Computational Learning Theory, 2023.

[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta algorithm and applications. Theory of Computing, 8(1):121 164, 2012.

[AKST23a] Naman Agarwal, Satyen Kale, Karan Singh, and Abhradeep Thakurta. Differentially private and lazy online convex optimization. In The Thirty Sixth Annual Conference on Learning Theory, pages 4599 4632. PMLR, 2023.

[AKST23b] Naman Agarwal, Satyen Kale, Karan Singh, and Abhradeep Thakurta. Improved differentially private and lazy online convex optimization. ar Xiv:2312.11534 [cs.CR], 2023.

[AS17] Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In Proceedings of the 34th International Conference on Machine Learning, pages 32 40, 2017.

[AS19] Omer Angel and Yinon Spinka. Pairwise optimal coupling of multiple random variables. ar Xiv preprint ar Xiv:1903.00632, 2019.

[AT18] Jason Altschuler and Kunal Talwar. Online learning over a finite action set with limited switching. In Proceedings of the Thirty First Annual Conference on Computational Learning Theory, 2018.

[BS16] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography - 14th International Conference, TCC 2016-B, Proceedings, Part I, volume 9985 of Lecture Notes in Computer Science, pages 635 658, 2016.

[CYLK20] Lin Chen, Qian Yu, Hannah Lawrence, and Amin Karbasi. Minimax regret of switchingconstrained online convex optimization: No phase transition. Advances in Neural Information Processing Systems, 33:3477 3486, 2020.

[GLL22] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. Private convex optimization via exponential mechanism. In Conference on Learning Theory, pages 1948 1989. PMLR, 2022.

[GVW10] Sascha Geulen, Berthold Vöcking, and Melanie Winkler. Regret minimization for online buffering problems using the weighted majority algorithm. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010.

[JKT12] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In Proceedings of the Twenty Fifth Annual Conference on Computational Learning Theory, 2012.

[JT14] Prateek Jain and Abhradeep Thakurta. (Near) dimension independent risk bounds for differentially private learning. In Proceedings of the 31st International Conference on Machine Learning, pages 476 484, 2014.

[KMS+21] Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. ar Xiv:2103.00039 [cs.CR], 2021.

[KOV15] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376 1385. PMLR, 2015.

[KV05] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

[SK21] Uri Sherman and Tomer Koren. Lazy oco: Online convex optimization on a switching budget. In Proceedings of the Thirty Fourth Annual Conference on Computational Learning Theory, 2021.

[ST13] Adam Smith and Abhradeep Thakurta. (Nearly) optimal algorithms for private online learning in full-information and bandit settings. In Advances in Neural Information Processing Systems 26, 2013.

A Missing details for preliminaries

In this section, we provide additional preliminaries and provide missing details for some of the results in the preliminaries section.

For our lower bounds, we require the notion of concentrated differential privacy. To this end, we first define the α-Renyi divergence (α > 1) between two probability measures:

Dα(µ ν) := 1 α 1 log Z µ(ω)

Concentrated DP is defined below: Definition A.1 (concentrated DP). Let ρ 0. We say an algorithm ALG satisfies ρ-concentrated differential privacy (ρ-CDP) against oblivious adversaries if for any neighboring sequences S = (ℓ1, . . . , ℓT ) LT and S = (ℓ 1, . . . , ℓ T ) LT that differ in a single element, and any α 1, Dα(ALG(D) ALG(D )) αρ.

Now we list a few standard results from the privacy literature that we use in the paper, namely group privacy and privacy composition. Lemma A.2. (Group Privacy) Let ALG be an (ε, δ)-DP algorithm and let S, S LT be two datasets that differ in k elements. Then for any measurable set S in the output space of ALG

Pr[ALG(S) O] ekε Pr[ALG(S ) O] + ke(k 1)εδ.

Lemma A.3 (Advanced Composition,[KOV15]). For any εt > 0, δt (0, 1) for t [k], and δ (0, 1), the class of (εt, δt)-DP mechanisms satisfy ( ε δ, 1 (1 δ)Πt [k](1 δt))-DP under k-fold adaptive composition, for

t [k] εt + min

v u u u t X

t [k] 2ε2 t log(e +

t [k] ε2 t δ ), s X

t [k] 2ε2 t log(1/ δ)

Moreover, we use the fact that distributions with bounded Dδ satisfy the following property. Lemma A.4. Let ε 1/10. If Dδ (µ, ν) ε/2 then we have

ν(X) eε 1 6δ/ε and Pr X ν

ν(X) eε 1 6δ/ε.

Finally, we have the following standard conversion from ρ-concentrated DP to (ε, δ)-DP.

Lemma A.5 ([BS16]). If ALG is ρ-CDP with ρ 1, then it is (3 p

ρ log(1/δ), δ)-DP for all δ (0, 1/4).

A.1 Proof of Lemma A.4

Let S = {X : µ(X)

ν(X) [e ε, eε]}, and assume towards a contradiction that µ(S ) < 1 6δ/ε. Then

there is a set S = S , such that µ[S] > 6δ/ε. We divide S by letting S1 = {X S : µ(X)/ν(X) > eε} and S2 = S \ S1.

Case (1): µ(S1) µ(S)/2 3δ/ε. Then we know

ν(S1) > ln eε µ(S1) δ

µ(S1) ln eε 3/ε 1

3/ε ln eε/2 = ε/2,

where we use that µ(S1) eεν(S1) and 1 ε/3 e ε/2 for ε < 1/10. This is a contradiction.

Case (2): Suppose µ(S2) 3δ/ε. We know S2 = {X S : µ(X)/ν(X) < e ε} = {X S : ν(X)/µ(X) > eε}. Then we know ν(S2) eεµ(S2) 3eεδ/ε. Similarly, we have

µ(S2) > ε/2.

This is a contradiction as well and proves the statement.

B Missing Proofs for Section 3

B.1 Proof of Theorem 3.2

The regret bound follows directly from Proposition 3.3. It suffices to prove the privacy guarantee.

Fix two arbitrary neighboring datasets S and S , and suppose the sequences differ at t0-step, that is {ℓ(t0 1)B+1, , ℓt0B} differ one loss function from {ℓ (t0 1)B+1, , ℓ t0B}.

Define ζt as the indicator that at least one of At+1, St+1 and S t+1 is zero. Let {(xt, yt, ζt)}t [T ] and {(x t, y t, ζ t)}t [T ] be the random variables with neighboring datasets. Let Σt = {(xτ, yτ, ζτ)}τ [t] be the random variables for the first t-iterations. We will argue that {(xt, yt, ζt)}t [T ] and {(x t, y t, ζ t)}t [T ] are indistinguishable, and privacy will follow immediately.

Let Et be the event such that νt+1(xt)

νt(xt) νt(yt) νt+1(yt) [e 2Bη, e2Bη]. Hence we know Pr[Et] 1 (2 + log(1/δ1)/p)e Bδ0 δ1 by Lemma 3.7 for any t. Define E t in a similar way. Moreover, let EG be the event that PT/B t=2 1(At = 0 or S t = 0) 2Tp log(1/δ1)/B. By Chernoff bound, we know

Pr(EG) = Pr[

t=2 1(At = 0 or S t = 0) 2Tp log(1/δ1)/B] 1 δ1.

We also let Et be the event such that νt+1(xt)

νt(xt) νt(yt) νt+1(yt) [e 2Bη, e2Bη]. Lemma 3.7 implies that Pr[Et] 1 (2/η + log(1/δ1)/p)e Bδ0 δ1 for any t. Define E t in a similar way.

Then it suffices to show {(xt, yt, ζt)}t [T ] and {(x t, y t, ζ t)}t [T ] are (ε, δx)-indistinguishable conditional on E := EG E G t [T/B] (Et E t). Then this will imply that {(xt, yt, ζt)}t [T ] and {(x t, y t, ζ t)}t [T ] are (ε, δx + (2T + 2)δ1 + 2T(2/η + log(1/δ1)/p)e Bδ0)-indistinguishable.

Now we show, conditional on E, any value Σ such that Σt 1 = Σ t 1 = Σ, (xt, yt, ζt) and (x t, y t, ζ t) are (εt, δ0)-indistinguishable where

0, t < t0 2η/p t = t0 ζt η t > t0 (5)

Case 1 (t < t0): It is clear that the claim is correct for t t0 as (xt, yt, ζt) and (x t, y t, ζ t) have the same distribution then.

Case 2 (t = t0): Now consider the case where t = t0.

Note that (xt0, yt0) and (x t0, y t0) have identical distributions, and it suffices to consider the indistinguishability of ζt0 and ζ t0.

Pr[ζt0 = 0 | Σt0 1, xt0, yt0] Pr[ζ t0 = 0 | Σ t0 1, x t0, y t0] = (1 p)2 νt0+1(xt0) e2Bηνt0(xt0) νt0(yt0) νt0+1(yt0)

(1 p)2 ν t0+1(x t0) e2Bην t0(x t0) ν t0(y t0) ν t0+1(y t0)

= νt0+1(xt0)

ν t0+1(xt0) ν t0+1(yt0) νt0+1(yt0)

Similarly, we have

Pr[ζt0 = 1 | Σt0 1, xt0, yt0] Pr[ζ t0 = 1 | Σ t0 1, x t0, y t0] = 1 (1 p)2 + (1 p)2(1 νt0+1(xt0) e2Bηνt0(xt0) νt0(yt0) νt0+1(yt0))

1 (1 p)2 + (1 p)2(1 ν t0+1(x t0) e2Bην t0(x t0) ν t0(y t0) ν t0+1(y t0))

= 1 + (1 p)2( ν t0+1(x t0) e2Bην t0(x t0) ν t0(y t0) ν t0+1(y t0) νt0+1(xt0) e2Bηνt0(xt0) νt0(yt0) νt0+1(yt0))

1 (1 p)2 + (1 p)2(1 ν t0+1(x t0) e2Bην t0(x t0) ν t0(y t0) ν t0+1(y t0))

Case 3 (t > t0): As for the case when t > t0, when ζt = 0 (At+1 = St+1 = S t+1 = 1), the variables are 0-indistinguishable since xt = xt 1 and yt = yt 1 in this case. Consider the remaining possibility. Given the assumption that µt+1/µt is a function of ℓt, for any possible Σ, we have

Pr[ζt = 1 | Σt 1 = Σ] = Pr[ζ t = 1 | Σ t 1 = Σ].

For any set S, by the assumption on µt, we have

Pr[ζt = 1, (xt, yt) S | Σt 1 = Σ] = Pr[(xt, yt) S | Σt 1 = Σ, ζt = 1] Pr[ζt = 1 | Σt 1 = Σ]

= Pr[(xt, yt) S | Σt 1 = Σ, ζt = 1] Pr[ζ t = 1 | Σ t 1 = Σ]

e2η Pr[(x t, y t) S | Σ t 1 = Σ, ζ t = 1] Pr[ζ t = 1 | Σ t 1 = Σ] + 4δ0 =e2η Pr[ζ t = 1, (x t, y t) S | Σ t 1 = Σ] + 4δ0,

where the inequality comes from Lemma 3.6 by the divergence bound between µt and µ t. This completes the proof of Equation (5).

The final privacy guarantee follows from combining Equation (5) and Advanced composition (Lemma A.3).

B.2 Proof of Lemma 3.4

We prove this statement by induction. For t = 1, the statement is obviously correct. We assume bνt νt T V 3(t 1)(2(e/η + log(1/δ1)/p)Bδ0 + δ1) prove that bνt+1 νt+1 T V 3t(2e + log(1/δ0)/p)Bδ0.

Let Xgood := {x : log νt+1(x)

νt(x) [ Bη, Bη]} and Ygood := {y : log νt(y) νt+1(y) [ Bη, Bη]}. Let bφt(y) be the distribution of yt. Note that the distribution of yt is independent of {xτ}τ [T/B], while the distribution of xt+1 is independent of yt+1 but depends on yt. By the assumption and group privacy, we know DBe Bηδ0 (νt+1, νt) Bη, and hence we have

νt(Y good) e Bηδ0/η 2eδ0/η.

Let t0 t be largest integer such that At0 = 1, that is, yt is sampled from νt0 for some random t0 t. We have

νt0(Y good) e Bη(t t0) νt(Ygood) + (t t0)Bδ0e Bη(t t0).

With probability at least 1 δ1, we know |t t0| log(1/δ1)/p. Hence we get

Pr y bφt[y Ygood] 1 2(e/η + log(1/δ1)/p)Bδ0 δ1.

Pr x νt,y bφt[x Xgood, y Ygood]

= Pr y bφt[y Ygood] Pr x νt[x Xgood | y Ygood]

1 2(e/η + log(1/δ1)/p)Bδ0 δ1.

Denote the good set

Sgood = (x, y) : x Xgood, Ygood .

Let φt be the distribution of yt conditional on yt Ygood. Let bΓt be the marginal distribution over (xt, yt), that is xt bνt and yt bφt. Let Γt be the distribution over (xt, yt) where xt νt, yt φt, and Γt be the distribution of Γt conditional on (xt, yt) Sgood.

We know bΓt Γt T V (2e/η + log(1/δ1)/p)Bδ0(3t 2). Let qt+1 be the distribution of xt+1 if (xt, yt) is sampled from Γt instead of bΓt. By the property that post-processing does not increase the TV distance, we know

qt+1 bνt+1 T V bΓt Γt T V .

Now it suffices to bound the TV distance between qt+1 and νt+1.

For any set E, we have

qt+1(E) = Z (Pr[S t = 0, xt+1 E | xt = x, yt = y]

+ Pr[S t = 1, St = 0, xt+1 E | xt = x, yt = y]

+ Pr[S t = 1, St = 1, xt+1 E | xt = x, yt = y])Γt(x, y)d(x, y)

= pνt+1(E) + (1 p)νt+1(E) Z (1 νt+1(x) e2Bηνt(x) νt(y) νt+1(y))Γt(x, y)d(x, y)

νt+1(x) e2Bηνt(x) νt(y) νt+1(y)Γt(x, y)d(x, y).

Thus we have

|qt+1(E) νt+1(E)| Z

νt+1(x) e2Bηνt(x) νt(y) νt+1(y)Γt(x, y)d(x, y)

νt+1(E) Z νt+1(x) e2Bηνt(x) νt(y) νt+1(y)Γt(x, y)d(x, y) .

Note that for any (x, y) Sgood, we have

Γt(x, y) = νt(x) φt(y)

Γt(Sgood) .

Fixing any y, we know the above term is bounded by

| νt(y) e2Bηνt+1(y)Γt(Sgood)(νt+1(E Xgood) νt+1(E)νt+1(Xgood))| 2(e/η + B log(1/δ1)/p)δ0,

where the last inequality follows from νt+1(Xgood) 1 Bδ0. Hence, we prove that

qt+1 νt+1 T V 2(e/η + log(1/δ1)/p)Bδ0.

This suggests that

bνt+1 νt+1 T V bνt+1 qt+1 T V + qt+1 νt+1 T V 6t(e/η + log(1/δ1)/p)Bδ0.

B.3 Proof of Lemma 3.6

Let S = (ℓ1, . . . , ℓT ) and S = (ℓ 1, . . . , ℓ T ) differ in a single round t0. We fix t and prove the claim is correct. If t t0, then the claim clearly holds as µt = µ t. For t = t0 + 1, note that Assumption 3.1 implies that Dδ0 (µt0+1, µt0) η and Dδ0 (µ t0+1, µt0) η, hence by group privacy we get that D(eη+1)δ0 (µt, µ t) 2η. Finally, for t > t0 + 1, note that Assumption 3.1 implies that µt = µ0 func(ℓ1) func(ℓ2) func(ℓt 1) and µ t = µ0 func(ℓ 1) func(ℓ 2) func(ℓ t 1). Thus, swapping the losses at rounds t 1 and t0 results in the same distributions µt and µ t, therefore privacy follows from the same arguments as the case when t = t0 + 1. The final claim follows as eη + 1 4.

B.4 Proof of Lemma 3.7

To prove lemma 3.7, we first prove the same result under a simpler setting where xt νt and yt νt. Lemma B.1. For any 0 t T/B 1, if Bη 1/20, xt νt and yt νt independently, then with probability at least 1 6e Bηδ0/η,

νt(xt) νt(yt) νt+1(yt) [e 2Bη, e2Bη]

Proof. Let Zt = R νt(x)dx. We know νt = νt/Zt by our notation. Then we have that

νt(xt) νt(yt) νt+1(yt) = νt+1(xt)Zt

νt(xt)Zt+1 νt(yt)Zt+1

νt(xt) νt(yt) νt+1(yt).

The statement follows from the Assumption 3.1 and the group privacy

DBe Bηδ0 (νt+1, νt) Bη.

Then the statement follows from Lemma A.4, the independence between xt, yt and Union bound.

We are now ready to prove Lemma 3.7.

Proof. Fix any t. Let t0 t be largest integer such that At0 = 1, that is, yt is sampled from νt0 for

some random t0 t. By the group privacy, we know DBe Bη(t t0)δ0(t t0) (νt, νt0) Bη(t t0).

Define the bad set

Sbad = {y : νt+1(x)

νt(x) νt(y) νt+1(y) / [e 2Bη, e2Bη], x νt}.

By Lemma B.1, we know

νt(y Sbad) 6e Bη δ0/η.

Therefore, we have that

νt0(y Sbad) e Bη(t t0) νt(y Sbad) + (t t0)Bδ0e Bη(t t0)

2e Bη(t t0+1) Bδ0/η + Bδ0(t t0)e Bη(t t0).

By the CDF of the geometric distribution, we know with probability at least 1 δ1, we get |t0 t| log(1/δ1)/p. Let E be the event that |t0 t| log(1/δ1)/p. Hence we know

νt0(y Sbad) νt0(y Sbad | E) Pr(E) + Pr(Ec) (2/η + log(1/δ1)/p) e Bδ0 + δ1.

C Missing proofs for Section 4

We prove a sequence of lemmas that are needed for the proof. The first lemma shows that the algorithm has to split the privacy budget across all resampling rounds. To this end, let S be a random variable that corresponds to the number of resampling steps in the algorithm, let Ti be the random variable corresponding to the round of the i th resampling (where we let Ti = T + 1 if i > S), and let Zi be the random variable corresponding to the model sampled at time Ti (letting Zi = 1 if i > S). Lemma C.1. (Composition) Let S, Ti, Zi and S , T i, Z i denote the random variables for two neighboring datasets. Under the assumptions of Theorem 4.2, if ALG is ε2-CDP, then for all α 1

i=1 Dα(Zi||Z i | Ti) αε2.

Proof. As ALG is ε2-concentrated DP and outputs T1, . . . , TS and Z1, . . . , ZS, we have that

αε2 Dα(T1, Z1, . . . , TS, ZS||T 1, Z 1, . . . , T S , Z S )

Dα(T1, Z1, . . . , TT , ZT ||T 1, Z 1, . . . , T T , Z T )

i=1 Dα(Zi||Z i | Ti),

where the second inequality follows as the random variables Ti, Zi and T j, Z j are constant for i > S and j > S , and the last inequality follows as Zi is independent of (T1, . . . , Ti) and (Z1, . . . , Zi 1) given Ti.

We defer the proof of the following Lemma to the appendix.

Lemma C.2. Let T 1, ε 1/T and δ 1/2. Assume ℓ: [d] {0, 1} where ℓ[x] Ber(1/2) for each x [d]. Let D = (ℓ, . . . , ℓ) and let ALG be an (ε, δ)-DP algorithm that outputs (z1, . . . , z T ) = ALG(D). Then

t=1 ℓ(zt)] T 1

We are now ready to prove our main lower bound.

Proof. (of Theorem 4.2) We consider the following construction for the lower bound: the adversary sets Sadv = (Tε)2/3, the sequence of losses will have E = S2 adv epochs, each of size B = T/E = T/(Tε)4/3 = 1 (T ε)1/3ε. Inside each epoch, the adversary samples ℓ Ber(1/2)d and plays the same loss function for the whole epoch.

Let S be random variable denoting the number of switches in the algorithm. In this case, we argue that each switch must have a small privacy budget (Lemma C.1), and thus, the price inside each epoch has to be large (Lemma C.2). Let T1, . . . , TS be the rounds where the algorithm resamples (Ti = T + 1 for i > S) and let Z1, Z2, . . . , ZS be the resampled models (Zi = 1 for i > S). Lemma C.1 implies that T X

i=1 Dα(Zi||Z i | Ti) αε2.

Now note that inside an epoch e, if the algorithm does not switch, then it will suffer loss B/2 in that epoch. Otherwise, if it switches, assume without loss of generality there is at most one switch inside each epoch (see Lemma C.2). Let je [T] denote the index such that Zje was sampled in epoch e. Note that the algorithm in this epoch has Dα(Zje||Z je | Tje) = αε2 e, hence it is ε2 e CDP. Standard conversion from concentrated DP to approximate DP (Lemma A.5) implies that it is (3εe p

log(1/δ), δ)-DP where δ 1/T 3d. Hence Lemma C.2 implies the error for this epoch is

1/T. Letting Eswitch [E] denote the epochs where there is a switch,

we have that the loss of the algorithm is

L(ALG) := E[

t=1 ℓt(xt)]

e Eswitch B

log(1/δ) 2 1

= T/2 1 3B2p

e Eswitch εe

E log(1/δ)ε

where the last inequality follows since P

e Eswitch εe q

E PE e=1 ε2e

Eε. Note also that the loss of the best expert is

L := min x [d]

t=1 ℓt(x) = T/2

Overall we get that the regret of the algorithm is

E log(1/δ)ε

(Tε)2/3 T (Tε)4/3 3 p

log(1/δ) 2(Tε)2/3ε2

log(1/δ)E 2(Tε)2/3ε 1

(ii) = Ω T 1/3

where (i) follows since E (Tε)4/3, and (ii) holds since 3

2ε2/3 for ε 100 log3/2(d T)/T 27 log3/2(1/δ)/T. The claim follows.

C.1 Proof of Lemma C.2

Proof. For this lower bound, we assume that the algorithm has full access to D to release z1, . . . , z T . First, note that if the algorithms picks z = zi with probability 1/T and releases (z, . . . , z), then it has the same error since

t=1 ℓ(z)] = T E[ℓ(z)] = T E[ 1

t=1 ℓ(zt)] = E[

t=1 ℓ(zt)].

Therefore, we assume that the algorithm releases a single z = ALG(D) that is (ε, δ)-DP. Denote Dℓ= (ℓ, . . . , ℓ). Note that as we sample ℓ Ber(1/2)d, the probability p := Pr(ℓ= ℓ0) = Pr(ℓ= ℓ1) for all ℓ0, ℓ1 {0, 1}d. Letting ℓ= 1 ℓ, we have that

E ℓ Ber(1/2)d

t=1 ℓ(ALG(Dℓ))

= T E ℓ Ber(1/2)d [ℓ(ALG(Dℓ))]

ℓ0 {0,1}d Pr ℓ Ber(1/2)d(ℓ= ℓ0) E [ℓ0(ALG(Dℓ0))]

ℓ0 {0,1}d p E ℓ0(ALG(Dℓ0)) + ℓ0(ALG(Dℓ0))

2 min ℓ0 {0,1}d E ℓ0(ALG(Dℓ0)) + ℓ0(ALG(Dℓ0)) .

Now note that for any ℓ0 we have

E ℓ0(ALG(Dℓ0)) + ℓ0(ALG(Dℓ0))

z [d] Pr(ALG(Dℓ0) = z)ℓ0(z) + Pr(ALG(Dℓ0) = z)ℓ0(z)

z [d] Pr(ALG(Dℓ0) = z)ℓ0(z) + Pr(ALG(Dℓ0) = z)(1 ℓ0(z))

z [d] ℓ0(z) Pr(ALG(Dℓ0) = z) Pr(ALG(Dℓ0) = z)

z [d] ℓ0(z) e T ε Pr(ALG(Dℓ0) = z) Tδ Pr(ALG(Dℓ0) = z)

z [d] ℓ0(z) Pr(ALG(Dℓ0) = z) e T ε 1

z [d] ℓ0(z) Pr(ALG(Dℓ0) = z)Tε

where the first inequality follows since ALG is (ε, δ)-DP and group privacy. The claim follows

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We provide new algorithms with improved guarantees for heavy-tailed private SCO in several settings, which is what we claim in the abstract and introduction. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The paper discusses places where the main result is lossy, providing an appendix section dedicated towards improving it to be optimal. It also compares its results to another similar bound in the literature, discussing regimes where our bound is weaker. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We provide a full set of verifiable details for all of our theoretical results.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper does not include experiments requiring code. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We read the ethics guidelines and believe we meet them.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: We do not believe our paper poses a significant negative societal impact, as it is about making existing learning algorithms differentially private under less stringent distributional assumptions, which we do not foresee being used in any significant malicious cases. We do believe that our algorithms can have positive societal impacts, but do not wish to overclaim to this effect because our results are primarily theoretical.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not release data or models. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.