# private_zerothorder_nonsmooth_nonconvex_optimization__8cd625c6.pdf

Published as a conference paper at ICLR 2024

PRIVATE ZEROTH-ORDER NONSMOOTH NONCONVEX OPTIMIZATION

Qinzi Zhang, Hoang Tran & Ashok Cutkosky Department of Electrical and Computer Engineering Boston University Boston, MA, USA {qinziz,tranhp,cutkosky}@bu.edu

We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives. Given a dataset of size M, our algorithm ensures (α, αρ2/2)-R enyi differential privacy and finds a (δ, ϵ)-stationary point

so long as M = Ω d δϵ3 + d3/2

ρδϵ2 . This matches the optimal complexity of its non-private zeroth-order analog. Notably, although the objective is not smooth, we have privacy for free whenever ρ

1 INTRODUCTION

We study the stochastic optimization problem of the form

min x Rd F(x) = Ez[f(x, z)],

where the stochastic function f(x, z) might be both non-convex and non-smooth with respect to x. Our focus is on zeroth-order algorithms, often referred to as gradient-free algorithms. Unlike first-order algorithms that have access to the stochastic gradient f(x, z), zeroth-order algorithms can access only the function value f(x, z).

Non-convex stochastic optimization is fundamental to modern machine learning. For example, in deep learning, x is the weight of some neural network model, z is a datapoint, and f(x, z) represents the loss of the model evaluated on x and z. Consequently, training the machine learning model equates to minimizing the non-convex objective F(x). Given its significant role, there has been a marked increase in research focusing on non-convex optimization in recent years (Ghadimi & Lan, 2013; Arjevani et al., 2019; 2020; Carmon et al., 2019; Fang et al., 2018; Cutkosky & Orabona, 2019). Since finding a global minimum of non-convex F(x) is intractable, many previous works assume that F is smooth, and their goal is to find an ϵ-stationary point x where F(x) ϵ. However, objectives are not always smooth, especially with the ever-increasing complexity of modern machine learning models. Making the problem even more difficult, recent research shows that finding an ϵ-stationary point of a non-smooth objective might be intractable (Zhang et al., 2020; Kornowski & Shamir, 2022).

To address this issue, Zhang et al. (2020) introduces an alternative objective for the convergence analysis of non-smooth non-convex optimization. Specifically, they propose the notion of Goldstein stationary point (Goldstein, 1977): x is said to be an (δ, ϵ)-stationary point of F if there exists a subset S in the ball of radius δ centered at x such that E [ F(y)] ϵ where y is uniformly distributed over S. Recently, there have been many works on first-order algorithms using this framework (Zhang et al., 2020; Tian et al., 2022; Cutkosky et al., 2023). Notably, Cutkosky et al. (2023) designs an Online-to-non-convex Conversion algorithm that finds a (δ, ϵ)-stationary point with O(δ 1ϵ 3) samples, which is proven to be the optimal rate. However, first-order algorithms have their own limitations. For many applications, computing first-order gradients can be computationally expensive or even impossible (Nelson, 2010; Mania et al., 2018). Hence, there are many recent results focusing on designing efficient zeroth-order algorithms for non-smooth non-convex optimization (Lin et al., 2022; Chen et al., 2023; Kornowski & Shamir, 2023). Recently, Kornowski & Shamir (2023) proposes a zeroth-order algorithm that requires only O(dδ 1ϵ 3) samples, which is

Published as a conference paper at ICLR 2024

the current state-of-the-art rate for zeroth-order optimization on non-smooth non-convex objectives. The additional dimension dependence is an inevitable cost of zeroth-order algorithms even when the objective is smooth or convex (Duchi et al., 2015).

In this paper, we study zeroth-order stochastic optimization on non-convex and non-smooth objectives. Zeroth-order optimization is particularly important in cases where memory is constrained or the model is excessively large so that computing a backwards pass is forbiddingly expensive. In addition, we also aim to protect the privacy of user s data. For example, consider the practical problem of training a recommendation model or a chat response model. At each step the algorithm is given a set of different users who report the model s performance in the form of upvotes or downvotes . It s important to make updates using such zeroth-order feedback while preserving the privacy of feedback from the individual users.

To this end, we require our algorithm to be differentially private (DP) (Dwork et al., 2006), which means with high probability, the output of our algorithm operating on any particular dataset is almost indistinguishable from the output when one datapoint in the dataset is perturbed.

The primary objective of private stochastic optimization is to minimize the dataset size required to solve the optimization problem while maintaining differential privacy guarantees. This area has seen extensive research efforts. While problems with convex objectives have been well-studied over the past decade (Chaudhuri et al., 2011; Jain et al., 2012; Kifer et al., 2012; Bassily et al., 2014; Talwar et al., 2014; Jain & Thakurta, 2014; Bassily et al., 2019; Feldman et al., 2020; Bassily et al., 2021b; Zhang et al., 2022), more recent efforts have focused on private non-convex optimization (Wang et al., 2017; 2019a; Tran & Cutkosky, 2022; Wang et al., 2019b; Zhou et al., 2020; Bassily et al., 2021a; Arora et al., 2023). Recently, Arora et al. (2023) designs an algorithm that, assuming the objective is smooth, finds an ϵ-stationary point and satisfies (ρ, γ)-DP with O(ϵ 3+

dρ 1ϵ 2) data complexity. While private non-convex optimization has seen rapid advancements, certain challenges remain. Many previous studies hinge on the assumption of smooth objectives since their methods are essentially extensions of non-private non-convex optimization algorithms. In fact, this limitation emerges even in situations where the objective is convex, in which case smoothness provides a critical tool for reducing sensitivity. Since the results in non-smooth non-convex optimization are recent, its differentially private counterpart remains largely unexplored.

1.1 OUR CONTRIBUTIONS

The main contribution of this paper is the introduction of a zeroth-order algorithm for differentially private stochastic optimization on non-convex and non-smooth objectives. To our knowledge, this is the first work under this framework. Our algorithm satisfies (α, αρ2/2)-R enyi differential privacy (RDP) (Mironov, 2017) (which is approximately (ρ, γ)-DP) and finds a (δ, ϵ)-stationary point with O(dδ 1ϵ 3+d3/2ρ 1δ 1ϵ 2) data complexity. Notably, the non-private term O(dδ 1ϵ 3) matches the state-of-the-art complexity found in its non-private counterpart. Moreover, when ρ

dϵ, the non-private term is dominating, suggesting that privacy can be attained without additional cost. This is consistent with the observations when the objective is smooth (Arora et al., 2023).

Our algorithm incorporates four essential components, each crucial for achieving the optimal rate. We leverage the non-private Online-to-non-convex Conversion (O2NC) framework by Cutkosky et al. (2023), which finds a (δ, ϵ)-stationary point of a non-smooth objective using a first-order oracle. We then build an approximate first-order oracle (i.e. a gradient estimator) with a zeroth-order oracle. Although this high-level strategy is a common technique, our approach distinguishes itself through its gradient oracle design. In contrast to prior non-private approaches that directly approximate the gradient with a standard zeroth-order estimator (Kornowski & Shamir, 2023), we introduce a variance-reduced gradient oracle. Our approach incorporates two zeroth-order estimators: one for the gradient and another for its difference between two points. Our gradient oracle also differs subtly from the standard zeroth-order estimator by sampling d i.i.d. estimators for each data point, which is required to achieve the optimal dimension dependence. Both estimators exhibit reduced sensitivity. We reduce the privacy cost of our variance-reduced estimators by using the tree mechanism (Dwork et al., 2010; Chan et al., 2011), yielding a new state-of-the-art privacy guarantee.

Omitting any component from our algorithm significantly impacts the results. Neither utilizing the O2NC technique nor sampling d i.i.d. estimators results in an additional dimension dependence of

Published as a conference paper at ICLR 2024

d. Moreover, the combination of the variance-reduced oracle and the tree mechanism is crucial for optimal privacy. A naive approach might directly make O2NC private adding noise to classical zeroth-order gradient estimators, but this yields a sub-optimal rate with an extra factor of ϵ 1. More details about the intuitions and challenges of our algorithm can be found in Section 4.1.

2 PRELIMINARY

Notation We use bold font x to denote a vector x Rd, and denote the Euclidean norm of x by x . We denote the unit ball and unit sphere in Rd by B and S, and denote the uniform distribution on B and S by UB and US respectively. We denote an open ball in Rd centered at x of radius r by B(x, r) = {y : x y < r}. For n N, we denote the set {1, . . . , n} by [n], and we denote a sequence a1, . . . , an by a1:n. We use the standard big-O notation, and use O to hide additional logarithmic factors. We interchangeably use f(x) g(x) to denote f(x) = O(g(x)).

Non-smooth Optimization A function h : Rd R is L-Lipschitz if |h(x) h(y)| L x y for all x, y Rd; h is H-smooth if it is differentiable and h(x) h(y) H x y for all x, y Rd. A point x Rd is an ϵ-stationary point of h if h(x ) ϵ. Definition 2.1. Let δ > 0 and h : Rd R be differentiable. The Goldstein δ-subdifferential of h at x is δh(x) = conv( y B(x,δ) h(y)). We define h(x) δ = inf{ g : g δh(x)}. Then x is said to be a Goldstein (δ, ϵ)-stationary point of h if h(x ) δ ϵ. Note that

h(x) δ inf S B(x,δ) 1

Differential Privacy A stochastic optimization algorithm can be considered as a randomized algorithm that takes a dataset Z (a collection of data points z1, . . . , z M) and outputs an output w. Throughout this paper, we denote Z as the set of all possible datasets of size M. Two datasets Z, Z Z are neighboring if they differ only in one data point. A randomized algorithm A : Z R is said to be (ϵ, δ)-differentially private ((ϵ, δ)-DP) for ϵ, δ > 0 if for all neighboring datasets Z, Z and measurable E R, we have P{A(Z) E} eϵ P{A(Z ) E} + δ (Dwork et al., 2006).

Another common privacy measure is R enyi differential privacy (RDP). Algorithm A is (α, ρ)-RDP for α > 1, ρ > 0 if for all neighboring Z, Z , Dα(A(Z) A(Z )) ρ (Mironov, 2017), where Dα(µ ν) is the R enyi divergence of distributions µ, ν. RDP can be converted to (ϵ, δ)-DP as follows: if an algorithm is (α, αρ2/2)-RDP for all α > 1, then it is also (2ρ ln(1/δ)1/2, δ)-DP for all δ exp( ρ2) (Mironov, 2017, Proposition 3). Therefore, we use (α, αρ2/2)-RDP as a measure of differential privacy in our paper.

If A(Z) A(Z ) s for any neighboring Z, Z , we say the sensitivity of A is bounded by s. It is well known that in this case, adding a Gaussian noise N(0, σ2I) to the output of A, where σ = s/ρ, ensures that A is (α, αρ2/2)-RDP (Mironov, 2017).

We also make use of the tree mechanism Dwork et al. (2010); Chan et al. (2011), which is a technique that allows for private release of running sums of potentially sensitive data (Algorithm 5).

Online Learning Our algorithm builds on the Online-to-non-convex algorithm by Cutkosky et al. (2023), so we briefly introduce the setting of online convex optimization (OCO) (Cesa-Bianchi & Lugosi, 2006; Hazan, 2019; Orabona, 2019). An OCO algorithm proceeds in rounds. In each round t, it outputs xt, receives a convex loss function ℓt(x), and suffers loss ℓt(xt). It is common to use a linear loss ℓt(x) = vt, x . An OCO algorithm has domain bounded by D if xt D for all t.

The goal of online learning is to minimize the static regret defined as

Reg T (u) = PT t=1 ℓt(xt) ℓt(u) = PT t=1 vt, xt u ,

where u is some comparator vector (often, u = arg min PT t=1 ℓt(x)). There are many OCO algorithms that achieve the minimax optimal O(

T) regret. For example, projected Online Subgradient Descent (OSD) (Zinkevich, 2003) with domain bounded by D satisfies

E[Reg T (x)] D q PT t=1 E vt 2.

Published as a conference paper at ICLR 2024

Uniform Smoothing Randomized smoothing is a well known technique that converts a possibly non-smooth function to a smooth approximation (Flaxman et al., 2005; Duchi et al., 2012; Lin et al., 2022). Given a Lipschitz function h : Rd R and δ > 0, we define the uniform smoothing of h as ˆhδ(x) = Ev UB[h(x + δv)]. In later sections, we denote ˆFδ(x) and ˆfδ(x, z) as the uniform smoothing of the objective F(x) and f(x, z) respectively. As shown in prior works (see (Yousefian et al., 2012) and Duchi et al. (2012, Section E)), uniform smoothing is a smooth approximation whose gradient can be estimated by finite differentiation. We rephrase the key properties in the following lemma, whose proof is presented in the appendix for completeness.

Lemma 2.2. Suppose h : Rd R is L-Lipschitz. Then (i) ˆhδ is L-Lipschitz; (ii) ˆhδ(x) h(x) Lδ; (iii) ˆhδ is differentiable and

d L δ -smooth; (iv)

ˆhδ(x) = Eu US[ d

δ h(x + δu)u] = Eu US[ d

2δ(h(x + δu) h(x δu))u];

As a corollary, finding an (δ, ϵ)-stationary point of ˆFδ is sufficient to find an (2δ, ϵ)-stationary point of F. The formal statement is as follows (Kornowski & Shamir (2023, Lemma 4); proof is presented in Appendix A for completeness).

Corollary 2.3. Suppose F : Rd R is L-Lipschitz. Then for any ϵ, δ > 0, ˆFδ(x) δ ϵ implies that F(x) 2δ ϵ.

2.1 ANALYSIS ORGANIZATION

As a brief overview, our algorithm leverages O2NC to privately optimize the uniform smoothing of the objective. Note that O2NC does not require any smoothness itself - we employ uniform smoothing because it allows us to construct a low-sensitivity gradient oracle from a zeroth-order oracle. The smoothness property is tangential. We construct a variance-reduced gradient oracle for O2NC and incorporate the tree mechanism for enhanced privacy. This oracle is based on two zeroth-order estimators: one for the stochastic gradient of the uniform smoothing and another for the gradient difference.

This paper is organized in the following order. In Section 3, we introduce the zeroth-order estimators. In Section 4.1, we present the O2NC algorithm combined with the private variance-reduced gradient oracle. In Section 4.2, we demonstrate the improved privacy guarantee provided by the tree mechanism. We conclude with the convergence analysis in Section 4.3.

3 ZEROTH-ORDER ESTIMATORS

Let F(x) = Ez[f(x, z)] and denote ˆFδ and ˆfδ as the uniform smoothing of F and f respectively. Using the randomized smoothing technique from Lemma 2.2, we can construct unbiased zerothorder estimators for both ˆFδ(x) and ˆFδ(x) ˆFδ(y). The key properties of these estimators GRAD and DIFF, as defined in Algorithm 1 and 2, are summarized in the following results.

Lemma 3.1. If f(x, z) is differentiable and L-Lipschitz in x, then for any δ > 0, x Rd and neighboring data batches z1:b, z 1:b of size b,

E[GRADf,δ(x, z1:b)] = ˆFδ(x), (unbiased)

E GRADf,δ(x, z1:b) ˆFδ(x) 2 16d L2

b , (variance)

GRADf,δ(x, z1:b) GRADf,δ(x, z 1:b) 2d L

b . (sensitivity)

Lemma 3.2. If f(x, z) is differentiable and L-Lipschitz in x, then for any δ > 0, x, y Rd and neighboring data batches z1:b, z 1:b of size b,

E[DIFFf,δ(x, y, z1:b)] = ˆFδ(x) ˆFδ(y), (unbiased)

E DIFFf,δ(x, y, z1:b) [ ˆFδ(x) ˆFδ(y)] 2 16d L2

bδ2 x y 2, (variance)

DIFFf(x, y, z1:b) DIFFf(x, y, z 1:b) 2d L

bδ x y . (sensitivity)

Published as a conference paper at ICLR 2024

Algorithm 1 Zeroth-order gradient oracle GRADf,δ(x, z1:b)

Input: loss function f, constant δ > 0, parameter x, i.i.d. data batch z1:b 1: Sample u11, . . . , ubd Uniform(S) i.i.d. 2: Return GRADf,δ(x, z1:b) = 1

b Pb i=1 1 d Pd j=1 d 2δ(f(x + δuij, zi) f(x δuij, zi))uij.

Algorithm 2 Zeroth-order gradient difference oracle DIFFf,δ(x, y, z1:b)

Input: loss function f, constant δ > 0, parameter x, y, i.i.d. data batch z1:b 1: Sample u11, . . . , ubd Uniform(S) i.i.d. 2: Return DIFFf,δ(x, y, z1:b) = 1

b Pb i=1 1 d Pd j=1 d δ (f(x + δuij, zi) f(y + δuij, zi))uij.

Proofs for Lemmas 3.1 and 3.2 are both stemming from the uniform smoothing properties detailed in Lemma 2.2. We defer the proofs to Appendix B. Notably, these lemmas indicate that both the gradient and gradient difference estimators are unbiased, with low variance and sensitivity. Specifically, in our final algorithm, we will be able to control the distance between parameters x, y in two successive iterations by δ/T. As a result, the variance in Lemma 3.2 is bounded by O(d L2/b T 2), and its sensitivity is bounded by O(d L/b T). Using these estimators, we construct zeroth-order gradient oracles by exploiting the decomposition ˆFδ(xt) = ˆFδ(x1) + Pt i=1 ˆFδ(xi) ˆFδ(xi 1) GRADf,δ(x1, z) + Pt i=1 DIFFf,δ(xi, xi 1, z). Our constructed oracles have low variance and low sensitivity, thus allowing us to incorporate the tree mechanism to privately aggregate the sum with less noise. A more detailed exploration of this idea is presented in the subsequent sections, especially in Corollary 4.4 and Lemma 4.8.

In addition, we would like to discuss the rationale behind sampling d i.i.d. uniform vectors uij for each data point zi. For the gradient estimator GRAD, it is feasible to approximate ˆfδ(x, zi) with a single two-point estimator, namely gi = d 2δ(f(x + δu, zi) f(x δu, zi))u, as it is proved that E gi 2 = O(d L2) (Shamir, 2017, Lemma 10). The central argument hinges on a concentration inequality for Lipschitz functions, which states that if h is L-Lipschitz on u and u US, then P{|h(u) Eh(u)| t} 2 exp( dt2

2L2 ), as per (Wainwright, 2019) Proposition 3.11. In the context of their proof, h(u) = f(x + δu) which is Lδ-Lipschitz.

However, this argument isn t applicable to the gradient difference estimator DIFF. In order to apply the concentration inequality, we need to bound the Lipschitz constant of h(u) = f(x + δu) f(y + δu). Although this function is indeed 2Lδ-Lipschitz, using this would bound the variance at O(d L2), which doesn t match our target of O(d L2/T 2). On the other hand, directly using the Lipschitzness of f bounds E d

δ (f(x+δu, zi) f(y+δu, zi))u 2 ( d L

δ x y )2 = O(d2L2/T 2). Consequently, we need to sample d i.i.d. uniform vectors to reduce the dimension dependence of the variance by a factor of d in order to match the optimal rate. While a more efficient analysis that avoids sampling d uniform vectors might exist, we are currently unaware of it.

4 OUR ALGORITHM AND RESULTS

4.1 PRIVATE ONLINE-TO-NON-CONVEX CONVERSION

Our algorithm builds on the Online-to-Nonconvex Conversion (O2NC) by (Cutkosky et al., 2023), which is a general framework that converts any OCO algorithm with O(

T) regret into a nonsmooth nonconvex optimization algorithm that finds a (δ, ϵ)-stationary point in O(δ 1ϵ 3) iterations. The pseudo-code of this framework is presented in Algorithm 3. Specifically, our algorithm chooses projected Online Subgradient Descent (OSD) as the OCO algorithm, which updates k t in line 3 of Algorithm 3 following the explicit update rule: k 1 0 and k t ΠD( k t 1 η gk t 1) for t 2. Here ΠD(x) = arg min y D x y denotes the projection operator and η is the stepsize tuned by OSD. Regarding notations in Algorithm 3, there are two round indices t and k, and we use the subscript t and superscript k, namely xk t , to denote a variable x in outer loop k and inner loop t.

The selection of O2NC as our base non-private algorithm is deliberate and influenced by the literature of non-private zeroth-order algorithms in nonsmooth nonconvex optimization. Intuitively,

Published as a conference paper at ICLR 2024

Algorithm 3 Online-to-Nonconvex Conversion

Input: OCO algorithm A with domain D, gradient oracle O, initial state x1. 1: for k = 1, 2, . . . , K do 2: for t = 1, 2, . . . , T do 3: Receive k t from A. (For example, k 1 0 and k t ΠD( k t 1 η gk t 1), t 2.) 4: Update xk t+1 xk t + k t and wk t xk t + sk t k t , where sk t Uniform([0, 1]). 5: Query gradient estimator gk t from O. 6: Send ℓk t ( ) = gk t , to A and A updates k t+1. 7: end for 8: Compute wk 1

T PT t=1 wk t .

9: Reset xk+1 1 xk T +1 and restart A. 10: end for 11: Output w Uniform({w1, . . . , w K}).

in the realm of privacy, a variance-reduction algorithm is more appealing due to its inherently low sensitivity. However, a straightforward application of variance-reduced SGD paired with zerothorder estimators only achieves a sub-optimal dimension dependence of O(d3/2δ 1ϵ 3), as shown in (Chen et al., 2023). A recent advancement by Kornowski & Shamir (2023) has improved the rate to O(dδ 1ϵ 3). Central to their achievement is the observation that finding a (2δ, ϵ)-stationary point of a nonsmooth objective F is equivalent to finding a (δ, ϵ)-stationary point of its uniform smoothing ˆFδ. Previous zeroth-order algorithms aim to find an ϵ-stationary point of ˆFδ via smooth optimization algorithms such as variance-reduced SGD, resulting in an additional smoothness cost of order O(

d). In contrast, the algorithm proposed by Kornowski & Shamir (2023) leverages O2NC to find a (δ, ϵ)-stationary point, achieving a linear dimension dependence.

Informed by the insights of (Kornowski & Shamir, 2023), we adopt O2NC as our foundational nonprivate algorithm. Specifically, we adapt Algorithm 3 for privacy by substituting the gradient oracle O in line 5 with a more carefully designed private oracle. A main challenge in this design is to minimize sensitivity, which is required to ensure privacy. In the algorithm proposed by Kornowski & Shamir (2023), the gradient oracle is simply the zeroth-order gradient estimator, gk t = d 2δ(f(wk t + δu, zk t ) f(wk t +δu, zk t ))u, whose sensitivity is O(d L). A straightforward method to ensure privacy for this algorithm is by directly adding Gaussian noise with variance O(d2L2/ρ2). While this might seem intuitive, it leads to significantly worse data complexity. Specifically, the resulting rate is O(d3/2ρ 1δ 1ϵ 3), which is worse by a factor of ϵ 1 when compared to our proposed algorithm - and importantly does not admit any value of ρ for which we obtain the non-private convergence rate. For interested readers, we include the detailed analysis of this naive approach in Appendix E.

In contrast, we designed a more refined private gradient oracle in Algorithm 4. According to Lemma 3.1 and 3.2, the sensitivity of gk 1 is O(d L/B1), while that of dk t is O(d L wk t wk t 1 /B2δ). Setting B1 = T, B2 = 1, and using Remark 4.1 with D = δ/T, both gl 1 and dk t have sensitivity O(d L/T). With these settings, we can apply the tree mechanism, a useful technique to release sums of private algorithms, ensuring that the variance-reduced gradient gk t achieves (α, αρ2/2)-RDP. Specifically, the tree mechanism adds noise of order O(d2L2/ρ2T 2) to each gk t . Compared to the aforementioned naive approach, our refined approach reduces noise by a factor of T 2, thus resulting in the desired data complexity. Remark 4.1. Note that wk t+1 = xk t+1+sk t+1 k t+1 = wk t +(1 sk t ) k t +sk t+1 k t+1. Therefore, if the OCO algorithm has domain bounded by D (i.e., k t D for all t, k), then wk t+1 wk t 2D. In general, for all t, t [T], wk t wk t 2|t t |D.

4.2 PRIVACY GUARANTEE

To ensure the privacy of the Online-to-Nonconvex conversion, we use a variant of the tree mechanism, as defined in Algorithm 5, to privately aggregate the sum of gk 1 and dk t . The privacy guarantee of the tree mechanism is presented in Theorem 4.3, and the proof is presented in Append C.1. Intuitively, the tree mechanism says if each of the algorithms M1, . . . , Mn requires σ2 noise for privacy, then the sum Pn i=1 Mi only needs O(ln(n)σ2) noise for the same level of privacy. With Theorem

Published as a conference paper at ICLR 2024

Algorithm 4 Private variance-reduced gradient oracle O

Input: dataset Z, constants B1, B2 Initialize: Partition Z into KT subsets: Zk 1 of size B1 for k [K] and Zk t of size B2 for t = [2, K], k [K]. The total data size is |Z| = M = K(B1 + B2(T 1)). 1: Upon receiving round index k, t and parameters wk t : 2: if t = 1 then 3: Query gk 1 GRADf,δ(wk 1, Zk 1 ). 4: else 5: Query dk t DIFFf,δ(wk t , wk t 1, Zk t ) and compute gk t gk t 1 + dk t . 6: end if 7: Return gk t gk t + TREE(t).

Algorithm 5 Tree Mechanism

Input: Index t [T], global variables σ1, . . . , σT . 1: If t = 1, set NOISE {}. 2: Sample ξt N(0, σ2 t I) and store ξt in NOISE. 3: Return P

( ,i) NODE(t) ξi.

4: function NODE(t) 5: Initialize: Set k 0 and S {}. 6: for i = 0, . . . , log2(t) while k < t do 7: Set k k + 2 log2(t) i. If k t, store S S {(k + 1), k } and update k k . 8: end for 9: Return S. 10: end function

4.3, we determine the minimal noise required to ensure privacy for Algorithm 3 in Corollary 4.4, and we evaluate the total noise added by the tree mechanism in Corollary 4.5. Remark 4.2. To better understand the NODE function in Algorithm 5, consider a binary tree of depth log2(t) . NODE(t) returns the largest node ni in each layer i, if exists, such that nj < ni t starting from the top level. As an example, NODE(7) = {(1, 4), (5, 6), (7, 7)} and NODE(8) = {(1, 8)}. In particular, note that |NODE(t)| log2(t) 2 ln(t).

Next, we present the main privacy theorem for the tree mechanism as presented in Algorithm 5.

Theorem 4.3. Let X be state space and Z(1), . . . , Z(n) be dataset spaces, and denote Z(1:i) = Z(1) Z(i). Let M(i) : X i 1 Z(i) X be a sequence of algorithms for i [n], and let A : Z(1:n) X n be the algorithm that, given a dataset Z1:n Z(1:n), sequentially computes xi = Pi j=1 M(j)(x1:j 1, Zi) + TREE(i) for i [n] and then outputs x1:n. Suppose for all i [n] and neighboring Z1:n, Z 1:n Z(1:n), M(i)(x1:i 1, Zi) M(i)(x1:i 1, Z i) si for all auxiliary inputs x1:i 1 X i 1. Then for all α > 1, A is (α, αρ2/2)-RDP where

2 ln n max b [n],i b si σb .

In our case, X is the collection of weights x for f(x, z), Z(1) is the collection of all possible batches Zk 1 of size B1 defined in Algorithm 3 and Z(t) for t 2 is the collection of all possible batches Zk t of size B2. Moreover, M(1) corresponds to GRAD that computes gk 1 and M(t) corresponds to DIFF that computes dk t . Upon substituting these specific definition into Theorem 4.3, we have the following privacy guarantees. Corollary 4.4. Suppose f(x, z) is differentiable and L-Lipschitz in x and the domain of A is bounded by D = δ/T. Let B1, B2 satisfies B1 TB2/2. Then for any α > 1, ρ > 0, Algorithm 3 is (α, αρ2/2)-RDP if we set the noises σ1:T in the tree mechanism as

Published as a conference paper at ICLR 2024

Corollary 4.5. Following the assumptions and definitions in Corollary 4.4, for all t [T],

E TREE(t) 2 2 ln(t)dσ2 8 ln(T)d3/2L

4.3 CONVERGENCE ANALYSIS

To start the convergence analysis, we revisit the convergence result of the non-private O2NC algorithm as presented in Lemma 4.6 whose proof is included in Appendix D for completeness. Building upon the observation in Corollary 2.3, our objective shifts to finding a (δ, ϵ)-stationary point of the uniform smoothing ˆFδ, as opposed to directly optimizing F. This approach leads to the formulation of Corollary 4.7. Missing proofs in this section are deferred to Appendix D.

Lemma 4.6. For any function F : Rd R that is differentiable and F(x1) infx F(x) F , if the domain of the OCO algorithm A is bounded by D = δ/T, then

EReg T (uk)

E F(wk t ) gk t , k t uk DKT .

where uk = D

PT t=1 F (wk t ) PT t=1 F (wk t ) .

Corollary 4.7. Suppose F : Rd R is differentiable, L-Lipschitz, and F(x1) infx F(x) F , and suppose the domain of A is bounded by D = δ/T. Then

E F(w) 2δ F + 2Lδ

EReg T (ˆuk)

E ˆFδ(wk t ) gk t , k t ˆuk DKT .

where ˆuk = D

PT t=1 ˆ Fδ(wk t ) PT t=1 ˆ Fδ(wk t ) .

Next, we introduce a key result in our convergence analysis. At its core, Lemma 4.8 suggests that the variance of the non-private gradient oracle, defined as gk t in Algorithm 4, is approximately O(d L2/T). This result leads to the non-private term of O(dδ 1ϵ 3) in the data complexity, matching the optimal rate in non-private contexts. Together with Corollary 4.5, this lemma implies the main result as stated in Theorem 4.9.

Lemma 4.8. Suppose f(x, z) is differentiable and L-Lipschitz in x and the domain of A is bounded by D = δ/T, then the variance of the gradient oracle O (Algorithm 4) is bounded by

E ˆFδ(wk t ) gk t 2 16d L2

B1 + 64d L2

Theorem 4.9. Suppose f(x, z) is differentiable and L-Lipschitz in x, F(x1) infx F(x) F . Then for any δ, ϵ > 0 and α > 1, ρ > 0, there exists

M = O d F L2

such that upon running Algorithm 3 with projected OSD as the OCO algorithm, using Algorithm 4 as the private oracle, and setting σt as defined in Corollary 4.4 and B1 = T + 1, B2 = 1, D = δ T , T = min{(

d LδM F +Lδ )2/3, ( d3/2LδM

(F +Lδ)ρ)1/2}, K = M 2T , the algorithm outputs an (α, αρ2/2)-RDP (δ, ϵ)-stationary point with M data points.

Proof. Since B1 = T +1, B2 = 1 satisfy B1 TB2/2, Corollary 4.4 implies the privacy guarantee. For the convergence analysis, Corollary 4.7 states that

E F(w) 2δ F + 2Lδ

EReg T (ˆuk)

E ˆFδ(wk t ) gk t , k t ˆuk DKT .

Published as a conference paper at ICLR 2024

Recall that given a sequence of stochastic vectors v1, . . . , v T , the regret of projected OSD is bounded

by E[Reg T (u)] D q PT t=1 E vt 2 (Orabona, 2019). In our case, vt = gk t . We can bound

E gk t 2 3E gk t ˆFδ(wk t ) 2 + 3E ˆFδ(wk t ) 2 + 3E TREE(t) 2

B2T + 3L2 + 3 8 ln(T)d3/2L

where the first term is bounded by Lemma 4.8 with B1 TB2/2, the second by Lipschitzness of ˆFδ (Lemma 2.2 (i)), and the third by Corollary 4.5. Consequently, we bound the regret by

E[Reg T (ˆuk)] DL

d B2T + 1 + d3/2

Next, note that k t ˆuk 2D. Following the same previous bounds, we have

E ˆFδ(wk t ) gk t , k t ˆuk 2DE[ ˆFδ(wk t ) gk t + TREE(t) ]

d L B2T + d3/2L

Upon substituting equation 1 and equation 2 into Corollary 4.7, we have

E F(w) 2δ F + Lδ

d B2T + 1 + d3/2

d L B2T + d3/2L

Upon setting B1 = T + 1, B2 = 1, D = δ

T , the data size is M = K(B1 + B2(T 1)) = 2KT and

Upon setting T = min{(

d LδM F +Lδ )2/3, ( d3/2LδM

(F +Lδ)ρ)1/2}, we achieve the desired bound

1/3 + F d3/2L

1/2 + d3/2L2

Equivalently, the right hand side is bounded by ϵ if M = max{ F d L2

ϵ3 , F d3/2L

ρδϵ2 , d3/2L2

5 CONCLUSION

This paper presents a novel zeroth-order algorithm for private nonsmooth nonconvex optimization. We prove that, given a dataset of size M, our algorithm finds a (δ, ϵ)-stationary point while achiev-

ing (α, αρ2/2)-RDP privacy guarantee so long as M = Ω d( F L2

ϵ3 ) + d3/2( F L

ρϵ2 ) . Notably, this is the first algorithm for the private nonsmooth nonconvex optimization problem.

For future research, there are several intriguing directions. First, we are interested in exploring methods that could eliminate the need to sample d uniform vectors per data point, a limitation we delve into in Section 3. Additionally, the design of efficient first-order private algorithms for nonsmooth nonconvex optimization stands out as an area of potential exploration. Finally, the optimality of the current rate is still uncertain, and finding the lower bounds remains an open problem.

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019.

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. In Conference on Learning Theory, pp. 242 299, 2020.

Published as a conference paper at ICLR 2024

Raman Arora, Raef Bassily, Tom as Gonz alez, Crist obal Guzm an, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. 2023.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 464 473. IEEE, 2014.

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 11282 11291, 2019.

Raef Bassily, Crist obal Guzm an, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. Advances in Neural Information Processing Systems, 34, 2021a.

Raef Bassily, Crist obal Guzm an, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Conference on Learning Theory, pp. 474 499. PMLR, 2021b.

Dimitri P Bertsekas. Stochastic optimization problems with nondifferentiable cost functionals. Journal of Optimization Theory and Applications, 12(2):218 231, 1973.

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, pp. 1 50, 2019.

Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1 24, 2011.

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011.

Lesi Chen, Jing Xu, and Luo Luo. Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization, 2023.

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210 15219, 2019.

Ashok Cutkosky, Harsh Mehta, and Francesco Orabona. Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. In International Conference on Machine Learning (ICML), 2023.

John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012.

John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788 2806, 2015. doi: 10.1109/TIT.2015.2409256.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265 284. Springer, 2006.

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715 724, 2010.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689 699, 2018.

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 439 449, 2020.

Published as a conference paper at ICLR 2024

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 385 394, 2005.

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

A. A. Goldstein. Optimization of lipschitz continuous functions. Math. Program., 13(1):14 22, dec 1977. ISSN 0025-5610. doi: 10.1007/BF01584320. URL https://doi.org/10.1007/ BF01584320.

Elad Hazan. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pp. 476 484. PMLR, 2014.

Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In Conference on Learning Theory, pp. 24 1. JMLR Workshop and Conference Proceedings, 2012.

Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pp. 25 1. JMLR Workshop and Conference Proceedings, 2012.

Guy Kornowski and Ohad Shamir. Oracle complexity in nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23(314):1 44, 2022.

Guy Kornowski and Ohad Shamir. An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization, 2023.

Tianyi Lin, Zeyu Zheng, and Michael I. Jordan. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization, 2022.

Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies is competitive for reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Ilya Mironov. R enyi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263 275. IEEE, 2017.

Barry L. Nelson. Optimization via simulation over discrete decision variables. In Risk and Optimization in an Uncertain World, pp. 193 207. INFORMS, September 2010. doi: 10.1287/educ. 1100.0069. URL https://doi.org/10.1287/educ.1100.0069.

Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52):1 11, 2017.

Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Private empirical risk minimization beyond the worst case: The effect of the constraint set geometry. ar Xiv preprint ar Xiv:1411.5417, 2014.

Lai Tian, Kaiwen Zhou, and Anthony Man-Cho So. On the finite-time complexity and practical computation of approximate stationarity concepts of lipschitz functions. In International Conference on Machine Learning, pp. 21360 21379. PMLR, 2022.

Hoang Tran and Ashok Cutkosky. Momentum aggregation for private non-convex ERM. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id= x56v-UN7Bj D.

Published as a conference paper at ICLR 2024

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. doi: 10.1017/ 9781108627771.

Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: faster and more general. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2719 2728, 2017.

Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. In International Conference on Machine Learning, pp. 6526 6535. PMLR, 2019a.

Lingxiao Wang, Bargav Jayaraman, David Evans, and Quanquan Gu. Efficient privacy-preserving stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1910.13659, 2019b.

Farzad Yousefian, Angelia Nedi c, and Uday V. Shanbhag. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56 67, 2012. ISSN 0005-1098.

Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Ali Jadbabaie, and Suvrit Sra. Complexity of finding stationary points of nonsmooth nonconvex functions. 2020.

Qinzi Zhang, Hoang Tran, and Ashok Cutkosky. Differentially private online-to-batch for smooth losses. In Advances in Neural Information Processing Systems, volume 35, 2022.

Yingxue Zhou, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Arindam Banerjee. Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds. ar Xiv preprint ar Xiv:2006.13501, 2020.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928 936, 2003.

Published as a conference paper at ICLR 2024

A PROOFS IN SECTION 2

A.1 PROOF OF LEMMA 2.2

Lemma 2.2. Suppose h : Rd R is L-Lipschitz. Then (i) ˆhδ is L-Lipschitz; (ii) ˆhδ(x) h(x) Lδ; (iii) ˆhδ is differentiable and

d L δ -smooth; (iv)

ˆhδ(x) = Eu US[ d

δ h(x + δu)u] = Eu US[ d

2δ(h(x + δu) h(x δu))u];

Proof. For simplicity, we drop the subscript δ of ˆhδ. By Jensen s inequality and Lipschitzness,

ˆh(x) ˆh(y) Ev h(x + δv) h(y + δv) L x y , (i)

ˆh(x) h(x) Ev h(x + δv) h(x) L δv Lδ. (ii)

Next, since h is Lipschitz, h is almost every differentiable by Rademacher s theorem and the uniform smoothing ˆh is also almost everywhere differentiable by (Bertsekas, 1973, Proposition 2.4). Moreover, it holds that ˆh(x) = Ev[ h(x + δv)]. Denote as the symmetric difference, i.e., A B = (A \ B) (B \ A), and let S = B(x, δ) B(y, δ). Then

ˆh(x) ˆh(y) = Ev UB[ h(x + δv) h(y + δv)]

h(v) vol(B(x, δ)) dv Z

h(v) vol(B(y, δ)) dv

h(v) vol(δB) dv Z

vol(δB) dv vol(S)L

d L δ x y ,

where the second last inequality follows from h(v) L, and the last follows from Lemma A.1.

Finally, by definition of expectation,

Ev UB[ h(x + δv)] =

B xf(x + δv) dv

Eu US[h(x + δu)u] =

S f(x + δu)u du

By Stokes theorem, Z

B xf(x + δv) dv = Z

S f(x + δu)u du

We then establish the first equality in (iv) by vol(B)/ vol(S) = δ/d. Next, let p be a Rademacher random variable and u = u/p. Observe that u US as well. Therefore,

δ h(x + δu)u] = Ep,u [ d

δ h(x + δpu )pu ]

δ h(x + δu)u + d

δ h(x δu)( u)].

This proves the second equality in (iv).

Lemma A.1. Let x, y Rd and S = B(x, δ) B(y, δ), then vol(S)

Proof. Let D = x y , Vd(r) be the volume of an d-ball in Rd, and Ud(h, r) be the volume of the cap of an d-ball of height h. Observe that for D 2δ,

vol(S) vol(δB) = 2 Ud(δ + D

2 , δ) Ud(δ D

2 , δ) Vd(δ) .

Next, we compute the value of Ud(h, r).

Ud(h, r) = Z h

r2 (r t)2) dt.

Published as a conference paper at ICLR 2024

Recall that the volume of an d-ball is Vd(r) = πd/2 Γ(d/2+1)rn. Therefore,

vol(S) vol(δB) =

π(d 1)/2 Γ((d 1)/2+1) R δ+ D

2 (2δt t2) d 1

πd/2 Γ(d/2+1)δd

2 + 1) πΓ( d 1

2 + 1) πΓ( d 1

2δ (2u u2) d 1

2 du. (t = δu)

Note that 2u u2 1 for all u R, so the integral is bounded by R 1+ D

2δ 1 du = D

proof then follows from the fact that Γ( d

2 + 1)/Γ( d 1

2d (proved in Lemma A.2 for completeness).

Lemma A.2. For all n N, Γ( n

2 + 1)/Γ( n 1

Proof. Let n!! = n(n 2)(n 4) = Q n

2 1 k=0 (n 2k), then by definition of Γ(n + 1) = nγ(n) and Γ( 1

2) = π, we have

( n!! 2n/2 , n is even πn!! 2(n+1)/2 , n is odd

Consequently, we have

2 + 1) Γ( n 1

( n!! π(n 1)!!, n is even πn!! 2(n 1)!!, n is odd n!! (n 1)!!.

We finish the proof by induction. For n = 1, 1!!

2 1. For n = 2, 2!!

2 2. Now assume m!! (m 1)!!

2m for all m n. Then

n!! = n + 1

n (n 1)!! (n 2)!! n + 1

(n + 1)(n 1)

This concludes the proof by induction.

A.2 PROOF OF COROLLARY 2.3

Corollary 2.3. Suppose F : Rd R is L-Lipschitz. Then for any ϵ, δ > 0, ˆFδ(x) δ ϵ implies that F(x) 2δ ϵ.

Proof. Recall that the δ-subdifferential of ˆFδ is defined as δ ˆFδ(x) = conv( y B(x,δ) ˆFδ(y)). In other words, for each g δ ˆFδ(x), g is of form of g = P λi ˆFδ(yi) where λi s are convex coefficients and yi B(x, δ). (Lin et al., 2022, Theorem 10) has proved that ˆFδ(yi) δF(yi). In addition, since yi x δ, we have δF(yi) 2δF(x). Consequently, we have

ˆFδ(yi) δF(yi) 2δF(x)

and thus g 2δF(x) for all g δ ˆFδ(x). We conclude the proof by recalling Definition 2.1 that

ˆFδ(x) δ = inf{ g : g δ ˆFδ(x)}, F(x) 2δ = inf{ g : g 2δF(x)}

and the previous result that δ ˆFδ(x) 2δF(x).

Published as a conference paper at ICLR 2024

B PROOFS IN SECTION 3

Lemma 3.1. If f(x, z) is differentiable and L-Lipschitz in x, then for any δ > 0, x Rd and neighboring data batches z1:b, z 1:b of size b,

E[GRADf,δ(x, z1:b)] = ˆFδ(x), (unbiased)

E GRADf,δ(x, z1:b) ˆFδ(x) 2 16d L2

b , (variance)

GRADf,δ(x, z1:b) GRADf,δ(x, z 1:b) 2d L

b . (sensitivity)

Proof. For simplicity, let gij = d 2δ(f(x+δuij, zi) f(x δuij, zi))uij so that GRADf,δ(x, z1:b) = 1 b Pb i=1 1 d Pd j=1 gij. Since f is L-Lipschitz, it follows that gij d L.

First, by F(x) = Ez[f(x, z)] and Lemma 2.2 (iv),

Eu,z[gij] = Eu[ d

2δ(F(x + δuij) F(x δuij))uij] = ˆFδ(x).

Taking the average over i, j gives E[GRADf,δ(x, z1:b)] = ˆFδ(x).

Next, let gi = 1

d Pd j=1 gij and observe that E[gi] = ˆFδ(x). Since gi ˆFδ(x) are mean-zero and independent for all i, we have

E[ gi ˆFδ(x), gi ˆFδ(x) ] = Ezi [ Ezi[gi ˆFδ(x)], gi ˆFδ(x) |zi ] = 0.

In other words, all cross-terms are 0. Therefore,

E GRADf,δ(x, z1:b) ˆFδ(x) 2

= 1 b2 Pb i=1 E gi ˆFδ(x) 2

1 b2 Pb i=1 2E gi ˆfδ(x, zi) 2 + 2E ˆf(x, zi) ˆFδ(x) 2. (3)

We first evaluate the second term. By Lemma 2.2 (i), ˆfδ( , zi) and Fδ are L-Lipschitz, which implies that ˆfδ(x, zi) L and ˆFδ(x) L. Consequently, E ˆf(x, zi) ˆFδ(x) 2 (2L)2. Next, note that gij ˆfδ(x, zi)|zi are mean-zero and independent for all j. Therefore,

E gi ˆfδ(x, zi) 2 = 1 d2 Pd j=1 E gij ˆfδ(x, zi) 2 (d+1)2L2

The last inequality follows from gij ˆfδ(x, zi) gij + ˆfδ(x, zi) (d + 1)L. Upon substituting back into equation 3, we have

E GRADf,δ(x, z1:b) ˆFδ(x) 2 2

d + (2L)2 16d L2

Finally, let g ij, g i be defined in the same way as gij, gi but using z 1:b. Since z1:b, z 1:b are neighboring, gi g i = 0 for all but one i. Let k be the index of the different data point. Then

GRADf,δ(x, z1:b) GRADf,δ(x, z 1:b) = 1

b gk g k 1 bd Pd j=1 gkj g kj 2d L

Lemma 3.2. If f(x, z) is differentiable and L-Lipschitz in x, then for any δ > 0, x, y Rd and neighboring data batches z1:b, z 1:b of size b,

E[DIFFf,δ(x, y, z1:b)] = ˆFδ(x) ˆFδ(y), (unbiased)

E DIFFf,δ(x, y, z1:b) [ ˆFδ(x) ˆFδ(y)] 2 16d L2

bδ2 x y 2, (variance)

DIFFf(x, y, z1:b) DIFFf(x, y, z 1:b) 2d L

bδ x y . (sensitivity)

Proof. Let dij = d

δ (f(x+δuij, zi) f(y+δuij, zi))uij. Then it follows that DIFFf,δ(x, y, z1:b) = 1 b Pb i=1 1 d Pd j=1 dij. Since f is L-Lipschitz, it also holds that dij d L

First, by F(x) = Ez[f(x, z)] and Lemma 2.2 (iv),

Eu,z[dij] = Eu[ d

δ (F(x + δuij) F(y + δuij)uij] = ˆFδ(x) ˆFδ(y).

Published as a conference paper at ICLR 2024

Taking the average over i, j gives Eu,z[DIFFf,δ(x, y, z1:b)] = ˆFδ(x) ˆFδ(y).

Next, let di = 1

k Pk j=1 dij, F = ˆFδ(x) ˆFδ(y), and f(zi) = ˆfδ(x, zi) ˆfδ(y, zi). Observe that E[di] = F and E[dij|zi] = f(zi). Since di F are independent and mean-zero for all i, we have

E DIFFf,δ(x, y, z1:b) [ ˆFδ(x) ˆFδ(y)] 2

= 1 b2 Pb i=1 E di F 2

1 b2 Pb i=1 2E di f(zi) 2 + 2E f(zi) F 2. (4)

We first evaluate the second term. By Lemma 2.2 (iii), ˆfδ, Fδ are

d L δ -smooth, which implies that

d L δ x y . Consequently, E f(zi) F 2 ( 2

d L δ x y )2. Next, since dij f(zi)|zi are mean-zero and independent for all j,

E di f(zi) 2 1 d2 Pd j=1 E dij f(zi) 2 (2d2+2d)L2

The last inequality follows from dij d L

b x y and f(zi)

d L δ x y . Therefore, upon substituting back into equation 4, we have

E DIFFf,δ(x, y, z1:b) [ ˆFδ(x) ˆFδ(y)] 2 16d L2

Finally, let d ij, d i be defined in the same way as dij, di but using z 1:b. Since z1:b, z 1:b are neighbor-

ing, di d i = 0 for all but one i. Let k be the index of the different data point. Then

DIFFf(x, y, z1:b) DIFFf(x, y, z 1:b) = 1

b dk d k 2d L

C PROOFS IN SECTION 4.2

C.1 PROOF OF THEOREM 4.3

Before we prove the theorem, we first prove a more general composition of RDP mechanisms. For two datasets Z, Z Z, we denote Z q Z if Z, Z are neighbors and they differ at the q-th entry.

Lemma C.1. For any domain Z, let R(i) : S(1) S(i 1) Z S(i) be a sequence of algorithms such that R(i)(x1:i 1, ) is (α, ϵi)-RDP for all auxiliary inputs x1:i 1 S(i) S(i 1). Let f R(i)(x1:i 1,Z) be the distribution of R(i)(x1:i 1, Z). Suppose each dataset Z Z has m data, and for each q [m], let

IN(q) := {i [n] : f R(i)( ,Z) = f R(i)( ,Z ), Z q Z }, OUT(q) := [n] \ IN(q).

Let An : Z S(1) S(n) be the algorithm that given a dataset Z Z sequentially computes xi = R(i)(x1:i 1, zi) for i [n] and then outputs x1:n. Then An is (α, ϵ)-RDP, where

ϵ maxq [m] P

i OUT(q) ϵi.

Proof. Suppose Z q Z and let P, Q be the distributions of An(Z), An(Z ) respectively. Note that P = P1 Pn where Pi is the distribution of Ai(Z) given Ai 1(Z), i.e., Pi(xi|x1:i 1) = f R(i)(x1:i 1,Z)(xi). Similarly, Q = Q1 Qn where Qi(xi|x1:i 1) = f R(i)(x1:i 1,Z )(xi).

For all i IN(q), Pi( |x1:i 1) = Qi( |x1:i 1) for all x1:i 1 by definition of IN(q); and for all i OUT(q), since R(i)(x1:i 1, ) is (α, ϵi)-RDP, Dα(Pi( |x1:i 1) Qi( |x1:i 1) ϵi. Therefore, Z Pi(xi|x1:i 1)αQi(xi|x1:i 1)1 α dxi

= exp((α 1)Dα(Pi( |x1:i 1) Qi( |x1:i 1)) 1, i IN(q), e(α 1)ϵi, i OUT(q).

We then conclude the proof by recursively integrating from dxn to dx1:

e(α 1)Dα(P Q) = Z Y Pi(xi|x1:i 1)αQi(xi|x1:i 1)1 α dxn dx1 Y

i OUT(q) e(α 1)ϵi.

Published as a conference paper at ICLR 2024

Next, we introduce several tree notations. Let k N0 and consider a complete binary tree with n = 2k leaves. Leaves are indexed by i [n] (or (i, i) interchangeably) in an ascending order from left to right, and every other node is indexed by (i, j) where i is its left-most child leaf and j is its right-most child leaf. Let T be the set of all nodes, and let < be a total order on T such that (a, b) < (c, d) if either b < d or b = d, a < c. In other words, we compare rightmost child first and break tie with leftmost child. We assume (T, <) is sorted in ascending order, then it s valid to label a sequence w.r.t. T (e.g., we can rewrite a1, . . . , a|T| as a(i,j) for (i, j) T). We denote a(c,d)<(a,b) as a short-hand notation of {a(c,d) : (c, d) < (a, b)}. Now we are ready to prove the theorem.

Theorem 4.3. Let X be state space and Z(1), . . . , Z(n) be dataset spaces, and denote Z(1:i) = Z(1) Z(i). Let M(i) : X i 1 Z(i) X be a sequence of algorithms for i [n], and let A : Z(1:n) X n be the algorithm that, given a dataset Z1:n Z(1:n), sequentially computes xi = Pi j=1 M(j)(x1:j 1, Zi) + TREE(i) for i [n] and then outputs x1:n. Suppose for all i [n] and neighboring Z1:n, Z 1:n Z(1:n), M(i)(x1:i 1, Zi) M(i)(x1:i 1, Z i) si for all auxiliary inputs x1:i 1 X i 1. Then for all α > 1, A is (α, αρ2/2)-RDP where

2 ln n max b [n],i b si σb .

Proof. For any n N, define T as he tree set with 2 log2 n leaves. Let Z = Z(1:n) and S(a,b) = X for all (a, b) T. For each (a, b) T, define R(a,b) : (Q

(c,d)<(a,b) S(c,d)) Z S(a,b) as

R(a,b)(y(c,d)<(a,b), Z1:n) =

i=a M(i)(x1:i 1, Zi) + ξb, (5)

where xj = P

(c,d) NODE(j) y(c,d) for each j [i 1] and ξb N(0, σ2 b I). We can check that xj is well-defined: for all j b and for all (c, d) NODE(j), we have (c, d) < (a, b) by definition of NODE and thus x1:i 1 can be computed from y(i,j)<(a,b) for all i [a, b].

Next, since for all neighboring Z1:n, Z 1:n Z, M(i)(x1:i 1, Zi) M(i)(x1:i 1, Z i) si, R(a,b)(y(c,d)<(a,b), ) has sensitivity bounded by s(a,b) := maxi [a,b] si. Therefore, upon adding Gaussian noise N(0, σ2 b I), R(a,b)(y(c,d)<(a,b), ) is (α, ϵ(a,b))-RDP where ϵ(a,b) = αs2 (a,b)/2σ2 b.

Now we are ready to apply Lemma C.1. Let AR : Z Q S(a,b) be the composition of R, i.e., given dataset Z1:n sequentially computes y(a,b) = R(a,b)(y(c,d)<(a,b), Z1:n) and outputs y(a,b) T. Then by Lemma C.1, AR is (α, ϵ)-RDP where

(a,b) OUT(q) ϵ(a,b) = max q

(a,b) OUT(q) max i [a.b] αs2 i 2σ2 b .

Observe that node (a, b) OUT(q) if and only if the q-th data in Z1:n is in Zi and i [a, b]. Since there is exactly one such node in each of log2 n + 1 layers in the binary tree associated with T, we have |OUT(q)| = log2 n + 1 2 ln n. Consequently,

2 α ln n max b [n],i b s2 i σ2 b .

Finally, we conclude the proof by claiming that algorithm A defined in the theorem can be derived from AR after post-processing. Observe that P

(a,b) NODE(i) Pb j=a = Pi j=1, so by equation 5,

(a,b) NODE(i) y(a,b) = X

(a,b) NODE(i)

j=a M(j)(x1:j 1, Zj) + ξb

j=1 M(j)(x1:j 1, Zj) + TREE(i).

Therefore, A is indeed a post-processing of AR, which concludes the proof.

Published as a conference paper at ICLR 2024

C.2 PROOF OF COROLLARY 4.4

Corollary 4.4. Suppose f(x, z) is differentiable and L-Lipschitz in x and the domain of A is bounded by D = δ/T. Let B1, B2 satisfies B1 TB2/2. Then for any α > 1, ρ > 0, Algorithm 3 is (α, αρ2/2)-RDP if we set the noises σ1:T in the tree mechanism as

Proof. Since the private oracle defined in Algorithm 4 uses disjoint data in different iteration k [K], Algorithm 3 is a post-processing of {( gk 1, . . . , gk T )}k [K]. Therefore, it suffices to prove that for each k [K], the composition ( gk 1, . . . , gk T ) is (α, αρ2/2)-RDP.

The key is to observe that ( gk 1, . . . , gk T ) can be written as the adaptive composition in Theorem 4.3. Let gk 1 = M(1)(Zk 1 ) GRADf,δ(wk 1, Zk 1 ) and dk i = M(i)( gk 1:i 1, Zk i ) DIFFf,δ(wk i , wk i 1, Zk i ). This is well-defined because wk i is a post-processing of gk 1:i 1 by construction of O2NC. Therefore, gk t = gk 1 + Pt i=2 dk i = Pt i=1 M(i)( gk 1:i 1, Zk i ), and Theorem 4.3 can be applied.

By Lemma 3.1, the sensitivity of M(1) is bounded by s1 = 2d L

B1 . By Remark 4.1 and Lemma 3.2, the sensitivity of M(i)( gk 1:i 1, ) is bounded by si = 2d L

B2δ wk i wk i 1 4d L

B2T for all i 2. Since we assume B1 TB2/2, it holds that s1 st for all t 2. Therefore, upon setting σt =

2 ln Ts2/ρ for all t [T], Theorem 4.3 implies that ( gk 1, . . . gk T ) is (α, αρ 2/2)-RDP where

2 ln T max b [n],i n si σb

C.3 PROOF OF COROLLARY 4.5

Corollary 4.5. Following the assumptions and definitions in Corollary 4.4, for all t [T],

E TREE(t) 2 2 ln(t)dσ2 8 ln(T)d3/2L

Proof. Recall that TREE(t) = P

( ,i) NODE(t) ξi. Since ξi s are independent and ξi N(0, σ2 i I),

E TREE(t) 2 = X

( ,i) NODE(t) E ξi 2 = X

( ,i) NODE(t) dσ2 i 2 ln(t)dσ2.

The last inequality follows from Remark 4.2 that |NODE(t)| 2 ln(t). We then conclude the proof by substituting the value of σ defined in Corollary 4.4.

D MISSING PROOFS IN SECTION 4.3

Lemma 4.6. For any function F : Rd R that is differentiable and F(x1) infx F(x) F , if the domain of the OCO algorithm A is bounded by D = δ/T, then

EReg T (uk)

E F(wk t ) gk t , k t uk DKT .

where uk = D

PT t=1 F (wk t ) PT t=1 F (wk t ) .

Proof. First, by fundamental theorem of calculus, we have

F(xk t+1) F(xk t ) = Z 1

0 F(xk t + t(xk t+1 xk t )), xk t+1 xk t dt

0 F(xk t + t k t ), k t dt = Esk t F(wk t ), k t .

Published as a conference paper at ICLR 2024

Next, upon taking the telescopic sum (and recall xk+1 1 = xk T +1), we have

EF(x K T +1) F(x1 1) =

t=1 E[F(xk t+1) F(xk t )] =

t=1 E F(wk t ), k t .

Note that F(wk t ), k t = F(wk t ), uk + gk t , k t uk + F(wk t ) gk t , k t uk . Therefore, by definition of uk and Reg T (uk),

t=1 F(wk t ), k t = D

t=1 F(wk t )

+ Reg T (uk) +

t=1 F(wk t ) gk t , k t uk .

Upon rearranging the inequality and applying F(x K T +1) F(x1 1) F , we have

t=1 F(wk t )

k=1 EReg T (uk) + E

t=1 F(wk t ) gk t , k t uk t .

Finally, note that wk t wk DT = δ, so F(wk) δ 1

T PT t=1 F(wk t ) by definition of δ. Moreover, E F(w) δ = E[ 1

K PK i=1 F(wk) δ]. Therefore, dividing both sides of the inequality by DKT concludes the proof.

Corollary 4.7. Suppose F : Rd R is differentiable, L-Lipschitz, and F(x1) infx F(x) F , and suppose the domain of A is bounded by D = δ/T. Then

E F(w) 2δ F + 2Lδ

EReg T (ˆuk)

E ˆFδ(wk t ) gk t , k t ˆuk DKT .

where ˆuk = D

PT t=1 ˆ Fδ(wk t ) PT t=1 ˆ Fδ(wk t ) .

Proof. By Lemma 2.2, ˆFδ is differentiable. Next, since ˆFδ F Lδ, ˆFδ(x1) inf x ˆFδ(x) F(x1) inf x F(x) + 2Lδ = F + 2Lδ.

Hence, upon applying Lemma 4.6 on the uniform smoothing ˆFδ, we have

E ˆFδ(w) δ F + 2Lδ

EReg T (ˆuk)

E ˆFδ(wk t ) gk t , k t ˆuk DKT .

Finally, by Corollary 2.3, E F(w) 2δ is also bounded by RHS, which concludes the proof.

Lemma 4.8. Suppose f(x, z) is differentiable and L-Lipschitz in x and the domain of A is bounded by D = δ/T, then the variance of the gradient oracle O (Algorithm 4) is bounded by

E ˆFδ(wk t ) gk t 2 16d L2

B1 + 64d L2

Proof. First, we recall a martingale concentration bound. Let v1, . . . , vn be a sequence of random vectors and let Fi = σ(v1:i) be the σ-algebra generated by v1:i. If E[vi|Fi 1] = 0, then for all cross terms i j, vi Fj 1 and thus E vi, vj = EFj 1 vi, E[vj|Fj 1] = 0. Consequently,

E Pn i=1 vi 2 = Pn i=1 E vi 2.

In our case, note that ˆF(wk t ) = ˆF(wk 1) + Pt i=2 ˆF(wk t ) ˆF(wk t 1) and recall that gk t = gk 1 + Pt i=2 di. Therefore, by the concentration inequality and Lemma 3.1 and 3.2,

E ˆF(wk t ) gk t 2 E ˆF(wk 1) gk 1 2 +

i=2 E ˆF(wk i ) ˆF(wk i 1) dk i 2

B2δ2 E wk i wk i 1 2.

We conclude the proof by Remark 4.1 that wk i wk i 1 2D 2δ

Published as a conference paper at ICLR 2024

Algorithm 6 Naive private gradient oracle

Input: dataset Z, constant B, noise σ Initialize: Partition Z into KT subsets Zk t of size B. The data size is |Z| = M = BKT. 1: Upon receiving round index k, t and parameter wk t : 2: Sample u1, . . . , u B Uniform(S) i.i.d. 3: gk t 1

B PB i=1 d 2δ(f(wk t + δui, zi) f(wk t δui, zi))ui where zi is the i-th data in Zk t . 4: Return gk t gk t + ξk t where ξk t N(0, σ2I) i.i.d.

E COMPARISON WITH NAIVE APPROACH

To concrete the discussion from Section 4.1, we ll delve into the analysis of the naive private algorithm built on O2NC. Specifically, this algorithm replaces the gradient oracle (as seen in line 5 of Algorithm 3) with a direct zeroth-order estimator, as defined in Algorithm 6. As a remark, gk t is an unbiased estimator and E gk t 2 L2( d

B +1), according to (Kornowski & Shamir, 2023) Remark 6. Moreover, following the same argument as the sensitivity bound in Lemma 3.1, the sensitivity of gk t is bounded by O(d L/B). Consequently, if we set σ = d L/Bρ, each gk t achieves (α, αρ2/2)-RDP.

Although the algorithm appears more straightforward, its convergence analysis is less favorable. Following the proof idea of Theorem 4.9, we first recall the result in Corollary 4.7:

E F(w) 2δ F + 2Lδ

EReg T (ˆuk)

E ˆFδ(wk t ) gk t , k t ˆuk DKT .

Recall that E[Reg T (ˆuk)] D q PT t=1 E gk t 2. Therefore, we can bound the regret of OSD by

E[Reg T (ˆuk)] D q PT t=1 E gk t 2 + E ξk t 2

D q PT t=1 L2( d

B + 1) + d3L2

B + 1 + d3/2

Next, we bound the sum of inner product. It s important to note that E ˆFδ(wk t ) gk t , k t = 0 because k t is independent of Zk t . Therefore, using the martingale concentration bound, we have PT t=1 E ˆFδ(wk t ) gk t , k t ˆuk = E PT t=1 ˆFδ(wk t ) gk t , ˆuk

D q PT t=1 E ˆFδ(wk t ) gk t 2 + E ξk t 2

B + 1 + d3/2

Upon substituting back into Corollary 4.7, we have

E F(w) 2δ F + Lδ

B + 1 + d3/2

Upon replacing D = δ/T and M = KBT, we have

To achieve the optimal non-private rate O(dδ 1ϵ 3), we need to set B = 1. Therefore, upon setting T = min{(

d LδM F +Lδ )2/3, ( d3/2LδM

(F +Lδ)ρ)2/3}, we have

(F + Lδ)d L2

1/3 + (F + Lδ)d3/2L2

Published as a conference paper at ICLR 2024

Equivalently, E F(w) 2δ ϵ if M = Ω d( F L2

ϵ3 ) + d3/2( F L2

ρϵ3 ) , whose dom-

inating term is Ω(d3/2ρ 1δ 1ϵ 3). In contrast, our carefully designed algorithm only requires Ω(dδ 1ϵ 3 + d3/2ρ 1δ 1ϵ 2) samples, as proved in Theorem 4.9. Notably, the naive algorithm suffers an additional cost of ϵ 1 for privacy.