# differentially_private_optimization_with_sparse_gradients__e3012063.pdf

Differentially Private Optimization with Sparse Gradients

Badih Ghazi Google Research badihghazi@google.com

Cristóbal Guzmán Google Research and Pontificia Universidad Católica de Chile crguzman@google.com

Pritish Kamath Google Research pritishk@google.com

Ravi Kumar Google Research ravi.k53@gmail.com

Pasin Manurangsi Google Research pasin@google.com

Motivated by applications of large embedding models, we study differentially private (DP) optimization problems under sparsity of individual gradients. We start with new near-optimal bounds for the classic mean estimation problem but with sparse data, improving upon existing algorithms particularly for the highdimensional regime. The corresponding lower bounds are based on a novel blockdiagonal construction that is combined with existing DP mean estimation lower bounds. Next, we obtain pureand approximate-DP algorithms with almost optimal rates for stochastic convex optimization with sparse gradients; the former represents the first nearly dimension-independent rates for this problem. Furthermore, by introducing novel analyses of bias reduction in mean estimation and randomly-stopped biased SGD we obtain nearly dimension-independent rates for near-stationary points for the empirical risk in nonconvex settings under approximate-DP.

1 Introduction

The pervasiveness of personally sensitive data in machine learning applications (e.g., advertising, public policy, and healthcare) has led to the major concern of protecting users data from their exposure. When releasing or deploying these trained models, differential privacy (DP) offers a rigorous and quantifiable guarantee on the privacy exposure risk [1].

Consider neural networks whose inputs have categorical features with large vocabularies. These features can be modeled using embedding tables; namely, for a feature that takes K distinct values, we create trainable parameters w1, . . . , w K Rk, and use wa as input to the neural network when the corresponding input feature is a. A natural outcome of such models is that the per-example gradients are guaranteed to be sparse; when the input feature is a, then only the gradient with respect to wa is non-zero. Given the prevalence of sparse gradients in practical deep learning applications, GPUs/TPUs that are optimized to leverage gradient sparsity are commercially offered and widely used in industry [2, 3, 4, 5]. To leverage gradient sparsity, recent practical work has considered DP stochastic optimization with sparse gradients for large embedding models for different applications including recommendation systems, natural language processing, and ads modeling [6, 7].

Despite its relevance and promising empirical results, there is limited understanding of the theoretical limits of DP learning under gradient sparsity. This gap motivates our work.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Setting Upper bound Lower bound

sd εn (Thm. 3.2) 1

s ln(d/(εn))

sd εn (Thm. 4.1)

(ε, δ)-DP 1 (s ln(d/s) ln(1/δ))1/4

εn (Thm. B.1) 1 (s ln(1/δ))1/4

εn (Thm. 4.5)

Table 1: Rates for DP mean estimation with sparse data of unit ℓ2-norm. Bounds stated for constant success/failure probability, resp. We use a b to denote min(a, b). New results highlighted .

Setting Guarantee

New Upper bound (sparse)

Upper bound (non-sparse)

Convex ERM (s ln(d) ln(1/δ))1/4

εn Rε,δ (Thm. 5.4, 6.1) Rε,δ

SCO (s ln(d) ln(1/δ))1/4

εn Rε,δ + 1 n (Thm. 6.3) Rε,δ + 1 n

Rε (Thm. 6.1, G.4) Rε

Rε + 1 n (Thm. 6.3) Rε + 1 n

(ε, δ)-DP Emp. Grad. Norm (s ln(d/s) ln3(1/δ))1/8

(εn)1/4 Rε,δ 2/3 (Thm. 5.4)

Table 2: Rates for DP optimization with sparse gradients, compared to best-existing upper bounds in the non-sparse case. In the above, the bounds are stated for constant success probability, the function parameters and polylog(n) factors are omitted, Rε,δ = p

d ln(1/δ)/(εn), Rε = d/(εn), and our improvements are highlighted .

1.1 Our Results

We initiate the study of DP optimization under gradient sparsity. More precisely, we consider a stochastic optimization (SO) problem, min{FD(x) : x X}, where X Rd is a convex set, and FD(x) = Ez D[f(x, z)], with f( , z) enjoying some regularity properties, and D is a probability measure supported on a set Z. Our main assumption is gradient sparsity: for an integer 0 s d,

x X, z Z : f(x, z) 0 s ,

where y 0 denotes the number of nonzero entries of y. We also study empirical risk minimization (ERM), where given a dataset S = (z1, . . . , zn) we aim to minimize FS(x) := 1

i [n] f(x, zi).

Our results unearth three regimes of accuracy rates for the above setting: (i) the small dataset size regime where the optimal rate is constant, (ii) the large dataset size where the optimal rates are polynomial in the dimension, and (iii) an intermediate dataset size regime characterized by a new high-dimensional rate1 (see Table 1 and Table 2, for precise rates). These results imply in particular that even for high-dimensional models, this problem is tractable under gradient sparsity. Without sparsity, these polylogarithmic rates is impossible due to known lower bounds [8].

In Section 3, we start with the fundamental task of ℓ2-mean estimation with sparse data (which reduces to ERM with sparse linear losses [8]). Here, we obtain new upper bounds (see Table 1). These rates are obtained by adapting the projection mechanism [9], with a convex relaxation that makes our algorithms efficient. Note that for pure-DP, even our large dataset rate of

sd/(εn) can be substantially smaller than the dense pure-DP rate of d/(εn) [8], whenever s d. For approximate DP we also obtain a sharper upper bound by solving an ℓ1-regression problem of a noisy projection of the empirical mean over a random subspace. Its analysis combines ideas from compressed sensing [10] with sparse approximation via the Approximate Carathéodory Theorem [11].

In Section 4, we prove lower bounds that show the near-optimality of our algorithms. For pure-DP, we obtain a new lower bound of Ω(s log(d/s)/(nε)), which is based on a packing of sparse vectors.

1We will generally refer to high-dimensional or nearly dimension-independent rates indistinguishably, meaning more precisely that the rates scale polylogarithmically with the dimension.

While this lower bound looks weaker than the standard Ω(d/(nε)) lower bound based on dense packings [12, 8], we design a novel bootstrapping via a block diagonal construction where each block contains a sparse lower bound as above. This, together with a padding argument [8], yields lower bounds for the three regimes of interest. For approximate-DP, we also use the block diagonal bootstrapping, where this time the blocks use classical fingerprinting codes in dimension s [8, 13]. Our approximate-DP lower bounds, however, have a gap of ln(d/s)1/4 in the high-dimensional regime; we conjecture that the aforementioned compressed sensing-based upper bound is tight.

In Section 5, we study DP-ERM with sparse gradients, under approximate-DP. We propose the use of stochastic gradient (SGD) with a mean estimation gradient oracle based on the results in Section 3. This technique yields nearly-tight bounds in the convex case (similar to first row of Table 2), and for the nonconvex case the stationarity rates are nearly dimension independent (last row of Table 2). The main challenge here is the bias in mean estimation, which dramatically deteriorates the rates of SGD. Hence we propose a bias reduction method inspired by the simulation literature [14]. This technique uses a random batch size in an exponentially increasing schedule and a telescopic estimator of the gradient which used in conjunction with our DP mean estimation methods provides a stochastic first-order oracle that attains bias similar to the one of a full-batch algorithm, with moderately bounded variance. Note that using the full-batch in this case would lead to polynomially weaker rates; in turn, our method leverages the batch randomization to conduct a more careful privacy accounting based on subsampling and the fully-adaptive properties of DP [15]. The introduction of random batch sizes and the random evolution of the privacy budget leads to various challenges in analyzing the performance of SGD. First, we analyze a randomly stopped method, where the stopping time dictated by the privacy budget. Noting that the standard SGD analysis bounds the cumulative regret, which is a submartingale, we carry out this analysis by integrating ideas from submartingales and stopping times [16]. Second, this analysis only yields the desired rates with constant probability. Towards high probability results, we leverage a private model selection [17] based on multiple runs of randomly-stopped SGD that exponentially boosts the success probability (details in Appendix F).

In Section 6, we study further DP-SO and DP-ERM algorithms for the convex case. Our algorithms are based on regularized output perturbation with an ℓ projection post-processing step. While this projection step is rather unusual, its role is clear from the analysis: it leverages the ℓ bounds of noise addition, which in conjunction with convexity provides an error guarantee that also leverages the gradient sparsity. This algorithm is nearly-optimal for approximate-DP. For pure-DP, the previous algorithm requires an additional smoothness assumption, hence we propose a second algorithm based on the exponential mechanism [18] run over a net of suitably sparse vectors. Neither of the pure-DP algorithms matches the lower bound for mean estimation (the gap in the exponent of the rate is of 1/6), but they attain the first nearly dimension-independent rates for this problem.

1.2 Related Work

DP optimization is an extensively studied topic for over a decade (see [8, 19, 20], and the references therein). In this field, some works have highlighted the role of model sparsity (e.g., using sparsitypromoting ℓ1-ball constraints) in near-dimension independent excess-risk rates for DP optimization, both for ERM and SCO [21, 22, 23, 24, 25, 26, 27]. These settings are unrelated to ours, as sparse predictors are typically related to dense gradients.

Another proposed assumption to mitigate the impact of dimension in DP learning is that gradients lie (approximately) in a low dimensional subspace [28, 29, 30, 31] or where dimension is substituted by a bound on the trace of the Hessian of the loss [32]. These useful results are unfortunately not applicable to our setting of interest, as we are interested in arbitrary gradient sparsity patterns for different datapoints.

Substantially less studied is the role of gradient sparsity. Closely related to our work, [6] studied approximate DP-ERM under gradient sparsity, with some stronger assumptions. Aside from an additional ℓ bound on individual gradients, the following partitioning sparsity assumption is imposed. The dataset S can be uniformly partitioned into subsets S1, . . . , Sm with a uniform gradient sparsity bound: for all k [m] and x X, P z Sk f(x, z) 0 c1. The work shows polylogarithmic in the dimension rates, for both convex and nonconvex settings. Our results only assume individual gradient sparsity, so on top of being more general, they are also faster and provably nearly optimal in the convex case. Another relevant work is [7], which studies the computational and utility benefits for DP with sparse gradients in neural networks with embedding tables. With the

caveat that variable selection on stochastic gradients is performed at the level of contributing buckets (i.e., rows of the embedding table), rather than on gradient coordinates, this work shows substantial improvements on computational efficiency and also on the resulting utility.

In [33], bias reduction is used to mitigate the regularization bias in SCO. While they also borrow inspiration from [14], both their techniques and scope are unrelated to ours.

1.3 Future Directions

We present some of the main open questions and future directions of this work. First, we conjecture that for approximate-DP mean estimation similarly to the pure-DP case a lower bound Ω p

s log(d/s) ln(1/δ)/[nε] should exist; such construction could be bootstrapped with a blockdiagonal dataset for a tight lower bound (Lemma 4.3). Second, for pure DP-SCO, we believe an algorithm should exist that achieves rates analogous to those for mean estimation. Unfortunately, most of variants of output perturbation (including phasing [20, 24, 34]) cannot attain such rates. From a practical perspective, the main open question is whether our rates are attainable without prior knowledge of s; note that all our mean estimation algorithms (which carries over to our optimization results) depend crucially on knowledge of this parameter. While we can treat s as a hyperparameter, it would be highly beneficial to design algorithms that automatically adapt to it.

We believe our bias reduction is of broader interest. For example, [35, 36] have shown strong negative results about bias in DP mean estimation. While similar lower bounds may hold for sparse estimation, bias reduction allows us to amortize this error within an iterative method, preventing error accumulation.

Finally, there is no evidence of our nonconvex rate being optimal. In this vein, we should remark that even in the dense case the optimal stationarity rates are still open [37].

2 Notation and Preliminaries

In this work, = 2 is the standard Euclidean norm on Rd. We will also make use of ℓp-norms, where x p := P j [d] |xj|p 1/p for 1 p . For p = 0, we use the notation x 0 = |{j [d] : xj = 0}|, i.e., the size of the support of x. We denote the r-radius ball centered at x of the p-norm in Rd by Bd p(x, r) := {y Rd : y x p r}. Given s [d] and L > 0, the set of s-sparse vectors is (the scaling factor L is omitted in the notation for brevity)

Sd s := {x Rd : x 0 s, x 2 L}. (1)

Note that Jensen s inequality implies: if x 0 s and 1 p < q , then x p s1/p 1/q x q.

Remark 2.1. The upper bound results in this paper hold even if we replace the set Sd s of sparse vectors by the strictly larger ℓ1-ball Bd 1(0, L s). Note that while our upper bounds extend to the ℓ1 assumption above, our lower bounds work under the original sparsity assumption.

Let f : X Z 7 R be a loss function. The function evaluation f(x, z) represents the loss incurred by hypothesis x X on datapoint z Z. In stochastic optimization (SO), we consider a data distribution D, and our goal is to minimize the expected loss under this distribution

minx X n FD(x) := Ez D[f(x, z)] o . (SO)

Throughout, we use x (D) to denote an optimal solution to (SO), which we assume exists. In the empirical risk minimization (ERM) problem, we consider sample datapoints S = (z1, . . . , zn) and our goal is to minimize the empirical error with respect to the sample

minx X n FS(x) := 1

i [n] f(x, zi) o . (ERM)

We denote by x (S) an arbitrary optimal solution to (ERM), which we assume exists. Even when S is drawn i.i.d. from D, solutions (or optimal values) of (SO) and (ERM) do not necessarily coincide.

We present the definition of differential privacy (DP), deferring useful properties and examples to Appendix A. Let Z be a sample space, and X an output space. A dataset is a tuple S Zn, and datasets S, S Zn are neighbors (denoted as S S ) if they differ in only one of their entries.

Definition 2.2 (Differential Privacy). Let A : Zn 7 X. We say that A is (ε, δ)-(approximately) differentially private (DP) if for every pair S S , we have for all E X that Pr[A(S) E] eε Pr[A(S ) E] + δ. When δ = 0, we say that A is ε-DP or pure-DP.

3 Upper Bounds for DP Mean Estimation with Sparse Data

We first study DP mean estimation with sparse data. Our first result is that the projection mechanism [9] is nearly optimal, both for pureand approximate-DP. In our case, we interpret the marginals on each of the d dimensions as the queries of interest: this way, the ℓ2-error on private query answers corresponds exactly to the ℓ2-norm estimation error. A key difference to the approach in [9] and related works is that we project the noisy answers onto the set K := Bd 1(0, L s), which is a (coarse) convex relaxation of conv(Sd s ). This is crucial to make our algorithm efficiently implementable. Due to space limitations, proofs from this section have been deferred to Appendix B.

Algorithm 1 Projection_Mechanism( z(S), ε, δ, n)

Require: Vector z(S) = 1

n Pn i=1 zi from dataset S (Sd s )n; ε, δ 0, privacy parameters

z = z(S) + ξ, with ξ

Lap(σ) d with σ = 2L s

nε if δ = 0 ,

N(0, σ2I) with σ2 = 8L2 ln(1.25/δ)

(nε)2 if δ > 0 .

return ˆz = argmin{ z z 2 : z K}, where K := Bd 1(0, L s)

Lemma 3.1. In Algorithm 1, it holds that ˆz z(S) 2 p

2L ξ s, almost surely.

We now provide the privacy and accuracy guarantees of Algorithm 1. Theorem 3.2. For δ = 0, Algorithm 1 is ε-DP, and with probability 1 β:

ˆz z(S) 2 L min

Theorem 3.3. For δ > 0, Algorithm 1 is (ε, δ)-DP, and with probability 1 β:

ˆz z(S) 2 L min (

ln(1/δ) nε , (s log(1/δ) log(d/β))1/4

Sharper Upper Bound via Compressed Sensing In Appendix B.4 we propose a faster mean estimation approximate-DP algorithm. Its rate nearly matches the lower bound we will prove in Theorem 4.4. We believe that this rate is essentially optimal. This algorithm projects the data average into a low dimensional subspace (via a random projection matrix), and uses compressed sensing to recover a noisy version of this projection: this way, noise provides privacy, which is further boosted by the random projection, and the accuracy follows from an application of the stable and noisy recovery properties of compressed sensing [10], together with the Approximate Carathéodory Theorem.

4 Lower Bounds for DP Mean Estimation with Sparse Data

We provide matching lower bounds to those from Section 3. Moreover, although the stated lower bounds are for mean estimation, known reductions imply analogous lower bounds for DP-ERM and DP-SCO [8, 19]. First, for pure-DP we provide a packing-type construction based on sparse vectors. This is used in a novel block-diagonal construction, which provides the right low/highdimensional transition. On the other hand, for approximate-DP, a block diagonal reduction with existing fingerprinting codes [38, 13], suffices to obtain lower bounds that exhibit a nearly tight low/high-dimensional transition. For simplicity, we consider the case of L = 1, i.e., Sd s = {z Rd : z 0 s, z 2 1}; it is easy to see that any lower bound scales linearly in L. We defer proofs from this section to Appendix C.

4.1 Lower Bounds for Pure-DP

Our main lower bound for pure-DP mechanisms is as follows.

Theorem 4.1. Let ε > 0 and s < d/2. Then the empirical mean estimation problem over Sd s satisfies

inf A : ε-DP sup S (Sd s )n P A(S) z(S) 2 min n 1, q

s log(d/[εn])

The statement above as well as those which follow should be read as for all DP algorithms A, there exists a dataset S, such that the mean estimation error is lower bounded by α(n, d, ε, δ) with

probability at least β(n, d, ε, δ) (where in this case α min n 1, q

s log(d/[εn])

sd εn o and β 1).

We also introduce a strengthening of the worst case lower bound, based on hard distributions. Definition 4.2. We say that a probability µ over Zn induces an (α, β)-distributional lower bound for (ε, δ)-DP mean estimation if inf A: (ε,δ)-DP PS µ,A A(S) z(S) 2 α β.

Note this type of lower bound readily implies a worst case lower bound. On the other hand, while the existence of hard distributions follows by the existence of hard datasets (by Yao s minimax principle), we provide explicit constructions of these distributions, for the sake of clarity.

Theorem 4.1 follows by combining the two results that we provide next. First, and our main technical innovation in the sparse case is a block-diagonal dataset bootstrapping construction, which turns a low-dimensional lower bound into a high-dimensional one. Lemma 4.3 (Block-Diagonal Lower Bound Bootstrapping). Let n0, t N. Let µ be a distribution over (St s)n0 that induces an (α0, ρ0)-distributional lower bound for (ε, δ)-DP mean estimation. Then, for any d t, n n0 and K min n

t , there exists µ over (Sd s )n that induces an (α, ρ)-distributional lower bound for (ε, δ)-DP mean estimation, where α α0n0

n ρ0K and ρ 1 exp( ρ0/8).

Note that the above result needs a base lower bound for which packing-based constructions suffice. Theorem 4.4. Let ε > 0 and s < d/2. Then there exists an (α, ρ)-distributional lower bound for

ε-DP mean estimation over (Sd s )n with α min n 1, s log(d/s)

εn o and ρ = 1/2.

4.2 Lower Bounds for Approximate-DP

While the lower bound for the approximate-DP case is similarly based on the block-diagonal reduction, its base lower bound follows more directly from the dense case. Theorem 4.5. Let ε (0, 1], 2 o(n) δ 1 n1+Ω(1) . Then the empirical mean estimation problem over Sd s satisfies

inf A : (ε,δ)-DP sup S (Sd s )n P A(S) z(S) 2 min n 1, [s ln(1/δ)]1/4

5 Bias Reduction Method for DP-ERM with Sparse Gradients

We now start with our study of DP-ERM with sparse gradients. We defer some proofs to Appendix E. In this section and later, we will impose subsets of the following assumptions:

(A.1) Initial distance: For SCO, x0 x (D) D; for ERM, x0 x (S) D. (A.2) Diameter bound: x y D, for all x, y X. (A.3) Convexity: f( , z) is convex, for all z Z. (A.4) Loss range: f(x, z) f(y, z) B, for all x, y X, z Z. (A.5) Lipschitzness: f( , z) is L-Lipschitz, for all z Z. (A.6) Smoothness: f( , z) is H-Lipschitz, for all z Z. (A.7) Individual gradient sparsity: f(x, z) is s-sparse, for all x X and z Z.

The most natural and popular DP optimization algorithms are based on SGD. Here we show how to integrate the mean estimation algorithms from Section 3 to design a stochastic first-order oracle that can be readily used by any stochastic first-order method. The key challenge here is that estimators from Section 3 are inherently biased, which is known to dramatically deteriorate the convergence rates. Hence, we start by introducing a bias reduction method.

Algorithm 2 Subsampled_Bias-Reduced_Gradient_Estimator(x, S, N, ε, δ) Require: Dataset S = (z1, . . . , zn) Zn, ε, δ > 0 privacy parameters, L-Lipschitz loss f(x, z) with s-sparse gradient, x X, batch size parameter N TGeom(M) with M = log2(n) 1 Let B Unif n 2N+1 , O, E a partition of B with |O| = |E| = 2N, I Unif([n]) G+ N+1(x, B) = Projection_Mechanism( FB(x), ε/4, δ/4, 2N+1) (Algorithm 1) G N(x, O) = Projection_Mechanism( FO(x), ε/4, δ/4, 2N) G N(x, E) = Projection_Mechanism( FE(x), ε/4, δ/4, 2N) G0(x, I) = Projection_Mechanism( f(x, z I), ε/4, δ/4, 1) Return (below pk = P[TGeom(M) = k])

G(x) = 1 p N

G+ N+1(x, B) 1

2 G N(x, O) + G N(x, E) + G0(x, I)

Algorithm 3 Subsampled_Bias-Reduced_Sparse_SGD(x0, S, ε, δ)

Require: Initialization x0 X; Dataset S = (z1, . . . , zn) Zn; ε, δ, privacy parameters; stepsize η > 0; gradient oracle for L-Lipschitz and with s-sparse gradient loss f( , z) t 1

δ Pt 1 s=0 3 2Ns+1+1

2 Pt 1 s=0 3 2Ns+1+1

2 and Pt 1 s=0 3 2Ns+1+1

t t + 1 Nt TGeom(M) where M = log2(n) 1 G(xt) = Subsampled_Bias-Reduced_Gradient_Estimator(xt, S, Nt, ε/8, δ/4) (Alg. 2) xt+1 = ΠX xt ηG(xt)

( x = 1 t+1 Pt s=0 xs if f( , z) is convex ,

xˆt where ˆt Unif({0, . . . , T}) iff( , z) is not convex.

5.1 Subsampled Bias-Reduced Gradient Estimator for DP-ERM

We propose Algorithm 2, inspired by a debiasing technique proposed in [14]. The idea is the following: we know that the projection mechanism2 would provide more accurate gradient estimators with larger sample sizes, and we will see that its bias improves analogously. We choose our batch size as a random variable with exponentially increasing range, and given such a realization we subtract the projection mechanism applied to the whole batch minus the same mechanism applied to both halves of this batch.3 This subtraction, together with a multiplicative and additive correction, results in the expected value of the outcome G(x) corresponding to the estimator with the largest batch size, leading to its expected accuracy being boosted by such large sample size, without necessarily utilizing such amount of data (in fact, the probability of such batch size being picked is polynomially smaller, compared to the smallest possible one). The caveat with this technique, as we will see, relates to a heavy-tailed distribution of outcomes, and therefore great care is needed for its analysis.

Instrumental to our analysis is the following truncated geometric distribution with parameter M N, whose law will be denoted by TGeom(M): we say N TGeom(M) if it is supported on {0, . . . , M}, and takes value k with probability pk := CM/2k, where CM = (2(1 2 (M+1))) 1, is the normalizing constant. Note that 1/2 CM 1, thus it is bounded away from 0 and + .

We propose Algorithm 3, which interacts with the oracle given in Algorithm 2. For convenience, we will denote the random realization from the truncated geometric distribution used in iteration t by Nt. The idea is that, using the fully adaptive composition property of DP [15], we can run the method until our privacy budget is exhausted. Due to technical reasons, related to the bias reduction, we need

2Note that we use the projection mechanism (Algorithm 1) as subroutine for Algorithm 2 only to have a self-contained presentation in the main body of the paper. We will analyze and state the sharper bounds obtained with Algorithm 5 as subroutine. 3We follow the Blanchet-Glynn notation of O and E to denote the odd and even terms for the batch partition [14]; this partitioning is arbitrary.

to shift by one the termination condition in the algorithm. In particular, our algorithm goes over the reduced privacy budget of (ε/2, δ/2). The additional slack in the privacy budget guarantees that even with the extra oracle call the algorithm respects the privacy constraint. Lemma 5.1. Algorithm 3 is (ε, δ)-DP.

5.2 Bias and Moment Estimates for the Debiased Gradient Estimator

We provide bias and second moment estimates for our debiased estimator of the empirical gradient. In summary, we show that this estimator has bias matching that of the full-batch gradient estimator, while at the same time its second moment is bounded by a mild function of the problem parameters.

Lemma 5.2. Let d nε

ln(1/δ) . Algorithm 2, enjoys bias and second moment bounds

E[G(x) FS(x)|x] L[s ln(d/s) ln(1/δ)]1/4

E[ G(x) 2|x] L2 ln(n)

s ln(d/s) ln(1/δ)

Proof. For simplicity, we assume without loss of generality that n is a power of 2, so that 2M+1 = n.

Bias. Let, for k = 0, . . . , M, G+ k+1(x) = E[G+ N+1(x, B) | N = k, x], and

G k (x) = E[G N(x, E) | N = k, x] = E[G N(x, O) | N = k, x],

where the last equality follows from the identical distribution of O and E. Noting further that G+ k (x) = G k (x) (which follows from the uniform sampling and the cardinality of the used datapoints), and using the law of total probability, we have

E[G(x) | x] = PM k=0 G+ k (x) G k 1(x) + E[G0(x, I) | x]

= G+ M+1(x) G 0 (x) + E[G0(x, I) | x]

= E[G+ M+1(x) FS(x)|x] + FS(x),

where we also used that E[G0(x, I) | x] = G 0 (x) (since I is a singleton). Next, by Theorem B.1

E[G(x) | x] FS(x) E[G+ M+1(x) FS(x)|x] L [s ln(d/s) ln(1/δ)]1/4

Second moment bound. Using the law of total probability, and that O, E are a partition of B:

E[ G(x) 2 | x] =

k=0 pk E h 1

pk [G+ N+1(x, B) FB(x)]

G N(x, O) FO(x) + G N(x, E) FE(x) + G0(x, I) 2 x, N = k i

2E[ G0(x, I) 2 | x] + 4

1 pk E h G+ N+1(x, B) FB(x) 2 x, N = k i

1 pk E h G N(x, O) FO(x) 2 + G N(x, E) FE(x) 2 x, N = k i .

We now use Theorem B.1, to conclude that

E h G+ N+1(x, B) FB(x) 2 x, N = k i L2

s ln(d/s) ln(1/δ)

max A {O,E} n E h G N(x, A) FA(x) 2 x, N = k io L2

s ln(d/s) ln(1/δ)

E G0(x, I) 2 x L2

s ln(d/s) ln(1/δ)

Recalling that M + 1 = log2 n and pk = 2 k, these bounds readily imply that E G(x) 2 ν2.

5.3 Accuracy Guarantees for Subsampled Bias-Reduced Sparse SGD

The previous results provide useful information about the privacy, bias, and second-moment of our proposed oracle. Our goal now is to provide excess risk rates for DP-ERM. For this, we need to prove the algorithm runs for long enough, i.e., a lower bound on the stopping time of Algorithm 3,

T := inf n t : ε

2 < ε 2 ln 4

16n 2 1/2 + ε2

16n or δ 4 <

(3 2Ns+1+1)δ

The proof of Theorem 5.2 implies that moments of G increase exponentially in M. This heavy-tailed behavior implies that T may not concentrate strongly enough to obtain high probability lower bounds for T. What we will do instead is showing that with constant probability T behaves as desired.

To justify the approach, let us provide a simple in-expectation bound on how the privacy budget accumulates in the definition of T: letting εt = (3 2Nt+1 + 1)ε/[16n], we have that

E h Pt s=0 ε2 s i = (t+1)ε2

(16n)2 E h (3 2N1+1 + 1)2i 2(t+1)ε2

(16n)2 9E[22(N1+1)] + 1 tε2

where in the last step we used that E 22(N1+1) = CM PM+1 k=1 2k n. This in-expectation analysis can be used in combination with ideas from stopping times to establish bounds for T. Lemma 5.3. Let 0 < δ < 1/n2. Let T be the stopping time defined in eqn. (2). Then, there exists t = Cn/ log(2/δ) (with C > 0 an absolute constant) such that P[T t] 1/4. On the other hand,

n2 (n+1) ln(4/δ) 1 E[T] 64n 9 ln(4/δ).

With our bounds on T, further analysis involving regret bounds on randomly stopped SGD yields the following bounds for convex and nonconvex losses. See Theorem E.2 and Theorem E.3 for details. Theorem 5.4. Consider a (SO) problem under initial distance (Item (A.1)), Lipschitzness (Item (A.5)) and gradient sparsity (Item (A.7)) assumptions.

In the convex case (Item (A.3)), Algorithm 3 satisfies

P h FS(ˆx) FS(x (S)) LD

ln n[s ln(d/s) ln3(1/δ)]1/4

In the nonconvex case, additionally assuming smoothness (Item (A.6)) and the following initial suboptimality assumption: namely, that given our initialization x0 Rd, there exists Γ > 0 such that FS(x0) FS(x (S)) Γ; Algorithm 3 satisfies

P h FS(xˆt) 2 2

ln(n) ln(1/δ) + L2 [s ln(d/s) ln(1/δ)]1/4

Boosting the Confidence of the Bias-Reduced SGD To conclude, in Appendix F we provide a boosting algorithm that can exponentially amplify the success probability of Algorithm 3. The approach is based on making parallel runs of the method and using private model selection to obtain the best performing model.

6 DP Convex Optimization with Sparse Gradients via Regularized Output Perturbation

We conclude our work introducing another class of algorithms that attains nearly optimal rates for approximate-DP ERM and SO in the convex setting. These algorithms are based on solving a regularized ERM problem and privatizing its output by an output perturbation method. The main innovation of this technique is that we reduce the noise error by a -projection. This type of projection leverages the concentration of the noise in high-dimensions. We carry out an analysis that also leverages the convexity of the risk and the gradient sparsity to obtain these rates. The full description is included in Algorithm 4. We defer missing proofs from this section, as well as additional results, to Appendix G.

Algorithm 4 Output_Perturbation Require: Dataset S = (z1, . . . , zn) Zn, ε, δ 0 privacy params., f( , z) L-Lipschitz convex function (if δ = 0 further assume H-smooth) with s-sparse gradient, λ 0 regularization param. Let x λ(S) = argminx X F λ S (x), where F λ S (x) := FS(x) + λ

x = x λ(S) + ξ, with ξ

Lap(σ) d with σ = 2

2s L λεn 2H

λ + 1 if δ = 0 ,

N(0, σ2I) with σ2 = 8L2 ln(1.25/δ)

[λεn]2 if δ > 0 . return ˆx = argminx X x x (breaking ties arbitrarily)

Theorem 6.1. Consider an ERM problem under assumptions: initial distance (Item (A.1)), convexity (Item (A.3)), Lipchitzness (Item (A.5)) and gradient sparsity (Item (A.7)). Then, Algorithm 4 is (ε, δ)-DP, and it satisfies the following excess risk guarantees, for any 0 < β < 1:

If δ = 0, and under the additional assumption of smoothness (A.6) and unconstrained domain,

X = Rd, then selecting λ = L2H

D2 s log(d/β)

εn 1/3 , it holds with probability 1 β that

FS(ˆx) FS(x (S)) L2/3H1/3D4/3 s log(d/β)

If δ > 0 then selecting λ = L

D [s log(1/δ) log(d/β)]1/4

εn , we have with probability 1 β that

FS(ˆx) FS(x (S)) LD (s log(1/δ) log(d/β))1/4

Remark 6.2. For approximate-DP, the theorem above can also be proved if we replace assumption (Item (A.1)) by the diameter assumption (Item (A.2)). On the other hand, for the pure-DP case it is a natural question whether the smoothness assumption is essential. In Appendix G.3, we provide a version of the exponential mechanism that works without the smoothness and unconstrained domain assumptions. This algorithm is inefficient and it does require an structural assumption on the feasible set, but it illustrates the possibilities of more general results in the pure-DP setting.

We note that the proposed output perturbation approach (Algorithm 4) leads to nearly optimal population risk bounds for approximate-DP, by a different tuning of the regularization parameter λ. Theorem 6.3. Consider a problem (SO) under bounded initial distance (Item (A.1)) (or bounded diameter, Item (A.2), if δ > 0), convexity (Item (A.3)), Lipschitzness (Item (A.5)), bounded range (Item (A.4)), and gradient sparsity (Item (A.7)). Then, Algorithm 4 is (ε, δ)-DP, and for 0 < β < 1,

If δ = 0, and under the additional assumption of smoothness (A.6) and unconstrained domain,

X = Rd. Selecting λ = L2H

D2 s log(d/β)

εn 1/3 , then with probability 1 β

FS(ˆx) FS(x (D)) L2/3H1/3D4/3 s log(d/β)

εn 1/3 + B q

If δ > 0. Selecting λ = L

D ln(n) ln(1/β)

s ln(1/δ) ln(d/β)

εn 1/2 , then with probability 1 β

FD(ˆx) FD(x (D)) LD [s ln(1/δ) log(d/β)]1/4

ln n + B) q

Acknowledgments and Disclosure of Funding

C.G. s research was partially supported by INRIA Associate Teams project, ANID FONDECYT 1210362 grant, ANID Anillo ACT210005 grant, and National Center for Artificial Intelligence CENIA FB210017, Basal ANID.

[1] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

[2] Alec Gunny, Chirayu Garg, Levs Dolgovs, and Akshay Subramaniam. Accelerating wide & deep recommender inference on GPUs, 2019. https://developer.nvidia.com/blog/ accelerating-wide-deep-recommender-inference-on-gpus.

[3] Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G Abel, Xu Guo, Jianbing Dong, et al. Merlin huge CTR: GPU-accelerated recommender system training and inference. In Rec Sys, 2022.

[4] Gagik Amirkhanyan and Bruce Fontaine. Building large scale recommenders using cloud TPUs, 2022. https://cloud.google.com/blog/topics/developers-practitioners/ building-large-scale-recommenders-using-cloud-tpus.

[5] Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In ISCA, 2023.

[6] Huanyu Zhang, Ilya Mironov, and Meisam Hejazinia. Wide network learning with differential privacy. Co RR, abs/2103.01294, 2021.

[7] Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. Sparsity-preserving differentially private training of large embedding models. In Neur IPS, 2023.

[8] Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464 473, 2014.

[9] Aleksandar Nikolov, Kunal Talwar, and Li Zhang. The geometry of differential privacy: the sparse and approximate cases. In STOC, pages 351 360, 2013.

[10] Przemyslaw Wojtaszczyk. Stability and instance optimality for Gaussian measurements in compressed sensing. Foundations of Computational Mathematics, 10:1 13, 2010.

[11] G Pisier. Remarques sur un résultat non publié de b. maurey. Séminaire Analyse fonctionnelle (dit), pages 1 12, 1981.

[12] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In STOC, pages 705 714, 2010.

[13] Thomas Steinke and Jonathan R. Ullman. Between pure and approximate differential privacy. J. Priv. Confidentiality, 7(2), 2016.

[14] Jose H. Blanchet and Peter W. Glynn. Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In WSC, pages 3656 3667, 2015.

[15] Justin Whitehouse, Aaditya Ramdas, Ryan Rogers, and Steven Wu. Fully-adaptive composition in differential privacy. In ICML, pages 36990 37007, 2023.

[16] Jeffrey S Rosenthal. A First Look at Rigorous Probability Theory. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, second edition, 2006.

[17] Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In STOC, pages 298 309, 2019.

[18] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pages 94 103, 2007.

[19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Neur IPS, 2019.

[20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In STOC, pages 439 449, 2020.

[21] Prateek Jain and Abhradeep Guha Thakurta. (Near) dimension independent risk bounds for differentially private learning. In ICML, pages 476 484, 2014.

[22] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Private empirical risk minimization beyond the worst case: The effect of the constraint set geometry. Co RR, abs/1411.5417, 2014.

[23] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly-optimal private LASSO. In NIPS, pages 3025 3033, 2015.

[24] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in L1 geometry. In ICML, pages 393 403, 2021.

[25] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-Euclidean differentially private stochastic convex optimization. In COLT, pages 474 499, 2021.

[26] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. Ann. Stat., 49(5):2825 2850, 2021.

[27] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy in generalized linear models: Algorithms and minimax lower bounds. Co RR, abs/2011.03900, 2020.

[28] Yingxue Zhou, Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Private SGD with gradient subspace identification. In ICLR, 2021.

[29] Peter Kairouz, Monica Ribero Diaz, Keith Rush, and Abhradeep Thakurta. (Nearly) Dimension Independent Private ERM with Ada Grad Rates via Publicly Estimated Subspaces. In COLT, pages 2717 2746, 2021.

[30] Xuechen Li, Daogao Liu, Tatsunori Hashimoto, Huseyin Inan, Janardhan (Jana) Kulkarni, Yin Tat Lee, and Abhradeep Guha Thakurta. When does differentially private learning not suffer in high dimensions? In Neur IPS, November 2022.

[31] Yin Tat Lee, Daogao Liu, and Zhou Lu. The power of sampling: Dimension-free risk bounds in private ERM. Co RR, abs/2105.13637, 2024.

[32] Yi-An Ma, Teodor Vanislavov Marinov, and Tong Zhang. Dimension independent generalization of DP-SGD for overparameterized smooth convex optimization. Co RR, abs/2206.01836, 2022.

[33] Hilal Asi, Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Stochastic bias-reduced gradient methods. In Neur IPS, 2021.

[34] Hilal Asi, Daniel Lévy, and John C Duchi. Adapting to function difficulty and growth conditions in private optimization. In Neur IPS, pages 19069 19081, 2021.

[35] Gautam Kamath, Argyris Mouzakis, Matthew Regehr, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A bias-variance-privacy trilemma for statistical estimation, 2023.

[36] Aleksandar Nikolov and Haohua Tang. Gaussian noise is nearly instance optimal for private unbiased mean estimation. Co RR, abs/2301.13850, 2023.

[37] Raman Arora, Raef Bassily, Tomás González, Cristóbal A Guzmán, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In ICML, pages 1060 1092, 2023.

[38] Mark Bun, Jonathan R. Ullman, and Salil P. Vadhan. Fingerprinting codes and the price of approximate differential privacy. In STOC, pages 1 10, 2014.

[39] Roman Vershynin. On the role of sparsity in compressed sensing and random matrix theory, 2009.

[40] Emmanuel J. Candès and Mark A. Davenport. How well can we estimate a sparse vector? Applied and Computational Harmonic Analysis, 34(2):317 323, 2013.

[41] E.J. Candes and T. Tao. Decoding by linear programming. TOIT, 51(12):4203 4215, 2005.

[42] Mark Bun, Thomas Steinke, and Jonathan R. Ullman. Make up your mind: The price of online queries in differential privacy. In SODA, pages 1306 1325, 2017.

[43] Gautam Kamath and Jonathan R. Ullman. A primer on private statistics. Co RR, abs/2005.00010, 2020.

[44] Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2:499 526, 2002.

[45] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In COLT, 2009.

[46] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 12:1069 1109, 2011.

[47] Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In COLT, pages 610 626, 2020.

A Auxiliary Privacy Results

The privacy and accuracy of some of the perturbation based methods we use to privatize our algorithms are based on the following simple facts (see, e.g., [1]).

Fact A.1 (Laplace & Gaussian mechanisms). For all g : Zn 7 Rd

(a) If the ℓ1-sensitivity of g is bounded, i.e., g 1 := sup S S g(S) g(S ) 1 < + , then Ag Lap(S) := g(S) + ξ where ξ Lap d( g 1/ε) is ε-DP.

(b) If the ℓ2-sensitivity of g is bounded, i.e., g 2 := sup S S g(S) g(S ) 2 < + , then

Ag N (S) := g(S) + ξ, where ξ N 0, σ2I for σ g 2

2 log(1.25/δ)

ε is (ε, δ)-DP.

Fact A.2 (Laplace & Gaussian concentration). Let σ > 0 and 0 < β < 1.

(a) For ξ Lap(σ) d: (i) ξ σ log(d/β) holds with probability 1 β, and (ii) ξ 2 σ

d log(d/β) holds with probability 1 β.

(b) For ξ N(0, σ2I), (i) ξ σ p

log(d/β) holds with probability 1 β, (ii) ξ 2 σ

log(1/β) holds with probability 1 β, and (iii) E ξ 2 2 = dσ2.

We note the existence of packing sets of sparse vectors (e.g., [39, 40]). Denote by Cd s the set of all s-sparse vectors in {0, 1/ s}d; note that Cd s Sd s .

Lemma A.3. For all s and d such that s d/2, there exists a subset P Cd s such that |P| (d/s 1/2)s/2 and for all u, v P, it holds that u v 2 1/

Proof. This follows from a simple packing-based construction (see, e.g., [40]). There are d s vectors in Cd s, and for each vector v Cd s, there are at most d s/2 many vectors u Cd s such that u v 0

s/2 and hence u v 2 1/

2. Thus, we can greedily pick vectors to be C, guaranteeing that all

vectors u, v Cd s satisfy u v 0 > s/2, and have |C| d s / d s/2 d

For completeness, we provide a classical dataset bootstrapping argument used for DP mean estimation lower bounds [8]. Whereas in the original reference this bootstrapping is achieved by appending dummy vectors which mutually cancel out with the goal of maintaining the structure of vectors, we simply append zero vectors as dummies as we do not need to satisfy an exact sparsity pattern.

Lemma A.4 (Dataset bootstrapping argument from [8]). Suppose for some n, there exists a mechanism A such that for all S (Sd s )n, it holds with probability at least 1/2 that A(S) z(S) 2 C, for some C 0. Then for all n < n, there exists a mechanism A such that for all S (Sd s )n , it holds with probability at least 1/2 that A(S ) z(S ) 2 C n

n . Furthermore, A satisfies the same privacy guarantees as A, namely if A is ε-DP (or (ε, δ)-DP), then so is A .

Proof. Given mechanism A, consider mechanism A that for any dataset S (Sd s )n , builds dataset S by adding n n copies of 0 to S and returns n n A(S). From the guarantees of A, it holds that P [ A(S) z(S) 2 C] 1

2. Since A (S ) = n n A(S) and z(S ) = n n z(S), it follows that

P h A (S ) z(S ) 2 C n

Since A just applies A once, it follows that A satisfies the same privacy guarantee as A.

Next we provide a generic reduction of existence of packing sets with pure-DP mean estimation lower bounds. Note however that the lower bounds we state work on the distributional sense.

Lemma A.5 (Packing-based mean estimation lower bound, adapted from [12, 8]). Let P Rd be an α0-packing set of vectors with |P| = p. Then, there exists a distribution µ over Pn that induces

an (α, ρ)-distributional lower bound for ε-DP mean estimation with α = α0

2 min n 1, log(p/2)

Proof. Let n = log(p/2)

ε . First, consider the case where n < n . We construct p datasets S1, . . . Sp where Sl consists of n copies of zl, and define µ = Unif({S1, . . . , Sp}). Note that for all k = l, it holds that z(Sk) z(Sl) 2 α0. Suppose µ does not induce a distributional lower bound. Then there exists A which is ε-DP and has ℓ2-accuracy better than α0/2 w.p. at least 1/2: this implies in particular that

Pl Unif([p]) A(Sl) Bd 2(zl), α0

For all distinct k, l, the datasets Sk and Sl differ in all n entries, and hence for any ε-DP mechanism A, it holds that P[A(Sl) Bd 2(zk, α0

2e εn. However, by construction, Bd 2(zl, α0

2 ) are pairwise disjoint. Hence,

k=1 PS µ[A(S) Bd 2(zk, α0/2)] =

k=1 PS µ[A(S) Bd 2(zk, α0/2)|S = zj]1

k=1 PS µ[A(S) Bd 2(zk, α0/2)|S = zk] e εnp

Thus, we get that n log(p/2)

ε , which is a contradiction since we assumed n < n . Hence, µ induces an (α0/2, 1/2)-distributional lower bound for ε-DP mean estimation. Next, consider the case where n > n . Then the previous argument together with Lemma A.4 implies an (α, ρ)-lower bounded, where α = n

2n and ρ = 1/2, as desired.

We will make use of the following fully adaptive composition property of DP, which informally states that for a prescribed privacy budget, a composition of (adaptively chosen) mechanisms whose privacy parameters are predictable, if we stop the algorithm before the (predictable) privacy budget is exhausted, the result of the full transcript is DP. Theorem A.6 ((ε, δ)-DP Filter, [15]). Suppose (At)t 0 is a sequence of algorithms such that, for any t 0, At is (εt, δt)-DP, conditionally on (A0:t 1) (in particular, (εt, δt)t is (At)t-predictable). Let ε > 0 and δ = δ + δ be the target privacy parameters such that δ > 0, δ 0. Let

v u u t2 ln 1

s=0 ε2s + 1

s=0 ε2 s, and δ[0:t] :=

and define the stopping time

T((εt, δt)t) := inf n t : ε < ε[0:t+1] o inf n t : δ < δ[0:t+1] o .

Then, the algorithm A0:T ( )( ) is (ε, δ)-DP, where T(x) = T (εt(x), δt(x)

B Missing Proofs from Section 3

B.1 Proof of Lemma 3.1

Proof. From the properties of the Euclidean projection, we have

ˆz z(S), ˆz z 0. (3)

ˆz z(S) 2 2 = ˆz z(S), ˆz z + ˆz z(S), ξ (3) ˆz z(S), ξ

2 max u K u, ξ 2 max u K u 1 ξ = 2L ξ s,

where we used the fact that conv(Sd s ) K.

B.2 Proof of Theorem 3.2

Proof. First, the privacy follows from the ℓ1-sensitivity bound of the empirical mean

1 = sup S S z(S) z(S ) 1 = 1

n supz,z Sd s z z 1 2L s

together with Theorem A.1(a).

For the accuracy, the first term follows from Theorem A.2(a)-(ii), and the fact that Euclidean projection does not increase the ℓ2-estimation error, and the second term follows from Lemma 3.1 with the fact that ξ O L s

nε log(d/β) holds with probability at least 1 β, by Theorem A.2(a)-(i).

B.3 Proof of Theorem 3.3

Proof. The privacy guarantee follows from the ℓ2-sensitivity bound of the empirical mean, 2 = 2L

n , together with Theorem A.1(b). For the accuracy, the first term in the minimum follows from Theorem A.2(b)-(ii), and the fact that Euclidean projection does not increase the ℓ2-estimation error. The second term follows from Lemma 3.1 and Theorem A.2(b)-(ii).

B.4 Sharper DP Mean Estimation Upper Bounds via Compressed Sensing

We propose Algorithm 5, a more accurate method for approximate-DP mean estimation based on compressed sensing [10]. The precise improvements relate to reducing the log(d) factor to log(d/s), and a faster rate dependence on the confidence β. The idea is that for sufficiently high dimensions, a small number of random measurements suffices to estimate a noisy and approximately sparse signal. These properties follow from existing results in compressed sensing, which provide guarantees based on the ℓ2-norm of the noise, and the best sparse approximation in the ℓ2-norm (known as ℓ2-ℓ2-stable and noisy recovery) [10]. We will exploit such robustness in two ways: regarding the noise robustness, this property is used in order to perturb our measurements, which will certify the privacy; on the other hand, the approximate recovery property is used to find a sparser approximation of our empirical mean. As the approximation is only used for analysis, we can appeal to the Approximate Caratheodory Theorem to certify the existence of a sparse vector whose sparsity increases more moderately with n than the empirical average [11].

An interesting feature of this algorithm is that ℓ1-minimization promotes sparse solutions, and thus we expect our output to be approximately sparse: this is not a feature that we particularly exploit, but it may be relevant for computational and memory considerations. Furthermore, note that the ℓ1minimization problem does not require exact optimality for the privacy guarantee, hence approximate solvers can be used without compromising privacy.

Algorithm 5 Gaussian ℓ1-Recovery( z(S), ε, δ, n)

Require: z(S) = 1

i [n] zi Rd from dataset S (Ss,d)n; privacy parameters ε, δ > 0

return ˆz =

z(S) + ξ, where ξ N(0, σ2Id d) and σ2 = 8L2 ln(1.25/δ)

(nε)2 , if d < m ln2 m,

z 1{ z 2 2L}, where z = arg min{ z 1 : Az = b}, A (N(0, 1

b = A z(S) + ξ and ξ N(0, σ2Im m) with σ2 = 18L2 ln(2.5/δ)

(nε)2 , otherwise

Theorem B.1. If 6 exp{ cm} δ < s ln(d/s)

m2 (where c > 0 is a constant) and 0 < ε 1, then Algorithm 5 is (ε, δ)-DP, and with probability 1 δ/2 β,

ˆz z(S) 2 L min n (

ln(1/δ) nε , (s ln(d/s) ln(1/δ))1/4

ln(1/β) ln(1/δ)

Moreover, we have the following second moment estimate,

E[ ˆz z 2 2] L2 min nd ln(1/δ)

s ln(d/s) ln(1/δ)

Proof. First, if d < m ln2 m, then Algorithm 5 is (ε, δ)-DP by privacy of Gaussian noise addition and the post-processing property of DP. Moreover, its (high probability and second moment) accuracy guarantees follow from Theorem A.2.

Next, if d m ln2 m, we start with the privacy analysis. Let S S and suppose they only differ in their ith entry. We note that due to our choice of m, A is an approximate restricted isometry with probability 1 3 exp{ cm} [41] (where c is the same as in the theorem statement); in particular, letting K nε

s ln(d/s) ln(1/δ), we have that for all v Rd which is (s K)-sparse

1 2 v 2 Av 2 3

Hence, due to our assumption on δ, the event above has probability at least 1 δ/2, and therefore

A( z z ) 2 = 1

n A(zi z i) 2 3L

where we used the fact that zi z i is (2s)-sparse. We conclude by the choice of σ2 that A z + ξ is (ε, δ)-DP, and thus z is (ε, δ)-DP by postprocessing.

We now proceed to the accuracy guarantee. By [10, Theorem 3.6 (b)], under the same event as stated above (which has probability 1 δ/2) we have

ˆz z 2 ξ 2 + inf z: z 0 s K z z 2.

For the first term, we use Gaussian norm concentration to guarantee that with probability 1 β,

ln(1/β) σ p

Ks ln(d/s) + r

For the second term, by the Approximate Carátheodory Theorem [11], the infimum above is upper bounded by O(L/

K); for this, note that z lies in the convex hull of Sd s . Given our choice of K, we have that, with probability 1 δ/2 β

ˆz z 2 L [s ln(d/s) ln(1/δ)]1/4

ln(1/β) ln(1/δ)

We conclude by providing the second moment estimate, by a simple tail integration argument. First, by the law of total probability, and letting E be the event of A being an approximate restricted isometry, E ˆz z 2 2 E[ ˆz z 2 2|E] + 9L2δ,

where we also used that ˆz 2 2L and z 2 L, almost surely. Now, conditionally on E, we have

that letting α L [s ln(d/s) ln(1/δ)]1/4

nε (below c > 0 is an absolute constant),

E[ ˆz z 2 2|E] = Z

0 P h ˆz z 2 u i (2u)du

0 P h ˆz z 2 α τ i 2(α + τ)dτ

0 2 exp n c(nε)2

L2 ln(1/δ)τ 2o (α + τ)dτ

nε + L2 ln(1/δ)

where in the second inequality we used the previous high probability upper bound (here c > 0 is an absolute constant), and in the last step we used that nε > p

ln(1/δ). Finally, by our assumptions on δ, 9L2δ α2, and this concludes the proof.

C Missing Proofs from Section 4

C.1 Proof of Lemma 4.3

Proof. Consider an n d data matrix D whose rows correspond to datapoints of a dataset S, and whose columns correspond to their d features. We will indistinctively refer to S or D as needed (these are equivalent representations of a dataset). This data matrix will be comprised of K diagonal blocks, D1, . . . , DK; in particular, outside of these blocks, the matrix has only zeros. These blocks are sampled i.i.d. from the hard distribution µ given by hypothesis. Denote µ the law of D. Let now zk(Dk) Rt be the mean (over rows) of dataset Dk. Then, the mean (over rows) of dataset D is given by z(D) = n0

n z1(D1) . . . z K(DK) , where [z1| . . . |z K] Rd denotes the concatenation of z1, . . . , z K (note that if K < d/t, then the concatenation above needs to be padded with (d t K)-zeros, which we omit for simplicity). Let A be an (ε, δ)-DP algorithm, and let Ak its output on the kth block variables, then

A(D) z(D) 2 2 =

2 = n2 0 n2

n0 Ak(D) zk(Dk) 2

Let now Bk(D) := n n0 Ak(D), and note it is (ε, δ)-DP w.r.t. Dk (as it is DP w.r.t. D); further, by the independence of D1, . . . , DK, we can condition on (Dh)h =k, to conclude that the squared ℓ2-error Bk(D) zk(Dk) 2 2 must be at least α2 0, with probability at least ρ0 (both on Dk and the internal randomness of Bk). Letting Yk := 1{ Bk(D) zk(Dk) 2 α0}, we have

P h A(D) z(D) 2 2 α0n0

We will now use a coupling argument to lower bound the probability above. First, we let U1, . . . , UK i.i.d. Unif([0, 1]), and Wk = 1{Ui ρ0} which are i.i.d. On the other hand, we define

pk(y1, . . . , yk 1) := P[Yk = 1|Y1 = y1, . . . , Yk 1 = yk 1] Yk := 1{Uk pk( Y1,..., Yk 1)}.

Noting that Y d= Y , and that Yk Wk almost surely, due to the fact that pk ρ0 almost surely (which it follows from the ℓ2-error argument discussed above), we have

i = P h K X

i 1 exp( ρ0/8),

where we used a one-sided multiplicative Chernoff bound.

Therefore, A(D) z(D) 2 2 α0n0

n 2 ρ0K, with probability 1 exp( ρ0/8). We conclude that

µ induces an (α, ρ)-distributional lower bound for (ε, δ)-DP mean estimation, as claimed.

C.2 Proof of Theorem 4.4

Proof. By Lemma A.3, there exists a set P of 1/

2-packing vectors on Cd s with log(|P|) s log(d/s). Lemma A.5 thus implies the desired lower bound.

C.3 Proof of Theorem 4.1

With all the building blocks in place, we now prove Theorem 4.1.

Proof of Theorem 4.1. We divide the analysis into the different regimes of sample size n. First, if n s log(d/s)

ε , then Theorem 4.4 provides an Ω(1) lower bound.

Next we consider the case s log(d/s)

ε. For s t d to be determined, let n0 = s log(t/s)

ε . We choose t so that d

t n n0 : this can be attained by choosing t ds

εn . This implies in the

context of Lemma 4.3 that K = d

t n n0 . By Theorem 4.4, this implies a lower bound α0 1, with constant probability 1/2 for sparse mean estimation in dimension t. By Lemma 4.3, we conclude

a sparse mean estimation lower bound of α0n0

s log(d/nε)

εn holds with constant probability.

On the other hand, if n d

ε. By the previous paragraph, for datasets of size n the

following lower bound holds, Ω q

s log(d/εn )

d. For any n > n , by Lemma A.4, we have

the lower bound Ω p s

sd εn holds with constant probability.

C.4 Proof of Theorem 4.5

Proof. We divide the analysis into the different regimes of sample size n. First, if n p

s ln(1/δ)/ε, then embedding an s-dimensional lower bound construction [42]4 and padding it with zeros for the remaining d s features, provides an Ω(1) lower bound with constant probability.

Next, we consider the case p

s ln(1/δ)/ε n d

ln(1/δ) sε . Let n0 = p

s ln(1/δ)/ε, t = s,

and K = n n0 d s, where the last inequality holds by our regime assumption. The classic sdimensional mean estimation lower bound by [42] provides an α0 1 lower bound with constant probability. Hence by Lemma 4.3, the sparse mean estimation problem satisfies a lower bound Ω α0n0

K (s ln(1/δ))1/4

εn , with constant probability.

We conclude with the final range, n d

ln(1/δ) sε . First, letting n d

ln(1/δ) sε , we note that this sample size falls within the range of the previous analysis, which implies a lower bound with constant probability of (s ln(1/δ))1/4

d. Now, if n > n , by Lemma A.4, we conclude that the following

lower bound holds with constant probability, Ω s

D Analysis of Biased SGD

Given the heavy-tailed nature of our estimators, our guarantees for a single run of SGD with biasreduced first-order oracles only yields constant probability guarantees. Here we prove pathwise bounds that facilitate such analyses.

D.1 Excess Empirical Risk: Convex Case

First, we provide a path-wise guarantee for a run of SGD with a biased oracle. Importantly, this guarantee is made of a method which runs for a random number of steps.

Proposition D.1. Let (Ft)t be the natural filtration, and T be a random time. Let (xt)t be the trajectory of projected SGD with deterministic stepsize sequence (ηt)t, and (biased) stochastic first-order oracle G for a given function F. If x arg min{F(x) : x X}, then the following event holds almost surely

t=0 [F(xt) F(x )] 1 2ηt x0 x 2 +

2 G(xt) 2 + F(xt) G(xt), xt x .

4While [42] only provides 1-dimensional distributional lower bounds for approximate-DP mean estimation, it is easy to convert these into higher dimensional lower bounds, see, e.g., [26, 43].

Proof. By convexity

F(xt) F(x ) F(xt), xt x = F(xt) G(xt), xt x | {z } :=bt

+ G(xt), xt x

bt + G(xt), xt xt+1 + G(xt), xt+1 x

2 G(xt) 2 + 1

2ηt xt xt+1 2 + G(xt), xt+1 x

( ) bt + ηt

2 G(xt) 2 + 1

2ηt xt xt+1 2 + 1

ηt xt+1 xt, x xt+1

2 G(xt) 2 + 1

2ηt xt x 2 1

2ηt xt+1 x 2,

where the second inequality follows by the Young inequality, and step ( ) we used the optimality conditions of the projected SGD step:

ηt G(xt) + [xt+1 xt], x xt+1 0 ( x X).

Therefore, summing up these inequalities, we obtain

t=0 [F(xt) F(x )] 1 2η0 x0 x 2 +

2 G(xt) 2 + bt i .

Plugging in the definition of bt proves the result.

D.2 Stationary Points: Nonconvex Case

Proposition D.2. Let F satisfy (A.6), and let G be a biased first-order stochastic oracle for F. Let (xt)t be the trajectory of SGD with oracle G, constant stepsize 0 < η 1/[2H], and initialization x0 such that F(x0) minx Rd F(x) Γ. Let T be a random time. Then the following event holds almost surely

t=0 F(xt) 2 2 Γ

t=0 G(xt) 2 2

t=0 F(xt), G(xt) F(xt)

Proof. By smoothness of f, we have

F(xt+1) F(xt) η F(xt), G(xt) + η2H

2 G(xt) 2 2

η F(xt) 2 2 η F(xt), G(xt) F(xt) + η2H

2 G(xt) 2 2.

t=0 F(xt) 2 2 F(x0) F(x T +1)

t=0 F(xt), G(xt) F(xt) + ηH

t=0 G(xt) 2 2

t=0 F(xt), G(xt) F(xt) + ηH

t=0 G(xt) 2 2.

E Missing proofs from Section 5

E.1 Proof of Lemma 5.1

Proof. The proof is based on the fully adaptive composition theorem of DP [15]. For this, we consider {At}t 0, where A0(S) = (x0, N0) (here N0 the first truncated geometric parameter), and inductively, At+1(At(S), S) for t 0 takes as input At(S) = (xt, Nt), computes G(xt) using the subsampled debiased gradient estimator (Algorithm 2), and performs a projected gradient step based on G(xt). Let Ht be the σ-algebra induced by (As)s=0,...,t.

Suppose now that At is (εt, δt)-DP, where (εt, δt) are Ht-measurable (we will later obtain these parameters), and let T := inf{t : ε[0:t] > ε/2, δ[0:t] > δ/2}, in the language of Theorem A.6 (notice that in the context of that theorem, we are choosing δ = δ = δ/4). We first claim that (xt)t=0,...,T 1 is (ε/2, δ/2)-DP, which follows directly from Theorem A.6. Next, we will later show that εt ε/4 and δt δ/4, almost surely (this applies in particular to x T ), and therefore by the composition property of DP, (xt)t T is (ε, δ)-DP.

Next, we provide the bounds on (εt, δt) required to conclude the proof. For this, we first note that conditionally on xt, Nt and Bt the computation of G+ Nt+1(xt, Bt), G Nt(xt, Ot), G Nt(xt, Et), is (3ε/32, 3δ/16)-DP. Furthermore, by privacy amplification by subsampling, this triplet of random variables is (ε , δ ), with

ε = ln 1 + 2Nt+1

n (e3ε/32 1) 2Nt+1

n 3ε 16, δ = 2Nt+1

where we used above that ε 1. Similarly, we have that G0(x, I) is ε 16n, δ 16n -DP. Therefore, by the basic composition theorem of DP, we have the following privacy parameters for the tth iteration of the algorithm

εt = (3 2Nt+1 + 1) ε

16n, δt = (3 2Nt+1 + 1) δ

This proves in particular that (εt, δt) are Ht-measurable, and that εt ε/4, and δt δ/4 almost surely, which concludes the proof

E.2 Proof of Lemma 5.3

Proof. Let A = Pt 1 s=0 3 2Ns+1+1

16n 2, and note that for t T + 1, A 1 almost surely. Then, we have that

ε[0:t 1] = p

2 ln(4/δ)ε2A + ε2

2 ln(4/δ)A.

Now, by eqn. (2) and the union bound,

P[T t] P h 2ε p

2 ln(4/δ)A > ε/2 i + P h t 1 X

s=0 (3 2Nt+1 + 1) > 4n i

3 2Nt+1 + 1 2 > 32n2

δ i + P h t 1 X

s=0 (3 2Nt+1 + 1) > 4n i

16n2 9E[22(Nt+1)] + 1 + t

4n[6(M + 1) + 1]

16n2 [18n + 1] + t

4n[6 log(n) + 1]

1/4, where the third step follows from Markov s inequality and the fact that (Ns)s are i.i.d., and the last step follows from our choice of t = Cn/ log(4/δ) with C > 0 sufficiently small (here we use the fact that δ < 1/n2).

For the second part, we use that by the definition of T (eqn. (2))

v u u t2ε2 ln 4

(3 2Ns+1 + 1)2

(16n)2 + ε2

(3 2Ns+1 + 1)2

(16n)2 1 4 <

3 2Ns+1 + 1

(3 2Ns+1 + 1)2

3 2Ns+1 + 1

Taking expectations and bounding the maximum by the sum allows us to use Wald s identity as follows,

n2 < E[T + 1] 8 ln 4

162 + n3 log(n) + 1

E[T + 1] ln 4

which proves the claimed bound.

The upper nound on E[T] is obtained similarly. Again, by eqn. (2),

ln(4/δ) E h T 1 X

3 2Ns+1 + 1) 2i E[T]9n

Re-arranging terms provides the claimed lower bound.

E.3 Excess Empirical Risk in the Convex Setting

As a first application, we study the accuracy guarantees of Algorithm 3 in the convex setting. We remark that these rates will be slightly weaker than those provided in Section 6, but this example is useful to illustrate the technique. Towards this goal, we analyze the cumulative regret of the algorithm, namely RT := PT t=0[FS(xt) FS(x (S))]. Although this is a standard and well-studied object in optimization, we need to obtain bounds for this object when the stopping time T is random. The key observation here is that since T is a stopping time, the event {T t} is Ft 1-measurable (here and throughout, Ft = σ((xs)s t) is the natural filtration). This permits using our bias and second moment bounds similarly to the case where T is deterministic.5 Moreover, for the sake of analysis, we will consider Algorithm 3 as running indefinitely, for all t 0. This would of course eventually violate privacy. However, since our algorithm stops at time T, then privacy is guaranteed as done earlier in this section. Proposition E.1. Let Rt := Pt t=0[FS(xt) FS(x (S))], let T be the stopping time defined in eqn. (2). Then

2η x0 x (S) 2 + E[T + 1] ην2

where b and ν2 are defined as in Lemma 5.2.

Proof. By Proposition D.1 (see Appendix D),

2η x0 x (S) 2 +

2 G(xt) 2 + F(xt) G(xt), xt x (S)

2η x0 x (S) 2

2E[1{T t} G(xt) 2|Ft 1] + E[1{T t} F(xt) G(xt), xt x (S) |Ft 1] o

2η x0 x (S) 2

2 E[ G(xt) 2|Ft 1] + 1{T t}E[ F(xt) G(xt), xt x (S) |Ft 1] o

where in the first equality we used the tower property of the conditional expectation, and in the second equality we used that {T t} = {T t 1}c is Ft 1-measurable.

Now, by Lemma 5.2, E[ F(xt) G(xt), xt x (S) |Ft 1] Db and E[ G(xt) 2|Ft 1] ν2 (note that Ft 1 does not include the randomness of Nt, and therefore the bias and moment estimates as in the mentioned lemma hold), thus

2η x0 x (S) 2 + E[T + 1] ην2

We conclude with the constant probability guarantee for the biased and randomly stopped SGD, Algorithm 3.

5This idea is related to the Wald identities [16]; however, we provide a direct analysis for the sake of clarity.

Theorem E.2. Consider a (SO) problem under convexity (Item (A.3)), initial distance (Item (A.1)), Lipschitzness (Item (A.5)) and gradient sparsity (Item (A.7)) assumptions. Let τ = C n ln(2/δ), where C > 0 is an absolute constant. Let η = D ν τ , U = CD[ν τ + bτ], where C > 0 is an absolute constant. Then Algorithm 3 satisfies

P h FS( x) FS(x (S)) U

Proof. We start by noting that

P h FS( x) FS(x (S)) > U

i P {T τ} {RT > U} P T τ] + P[RT > U].

For the first event, by Lemma 5.3, we have that P[T τ] 1/4 (which determines C ). On the other hand, using Proposition E.1 and Lemma 5.3, we have that for our choice of η, we have that

E[RT ] Dν τ

2 + E[T + 1]D ν

2 τ + b D[ν τ + τb].

In particular, for our choice of U (with C > 0 sufficiently large),

P[RT > U] E[RT ]

The above result implies a nearly optimal empirical excess risk rate for DP-SCO,

ln n[s ln(d/s) ln3(1/δ)]1/4

but only with constant probability. We defer to the next section how to boost this guarantee to hold with arbitrarily high probability.

E.3.1 Near Stationary Points for the Empirical Risk

For nonconvex objectives it is known that obtaining vanishing excess risk is computationally difficult. Hence, we study the more modest goal of approximating stationary points, i.e., points with small norm of the gradient. By combining known analyses of biased SGD with our bias-reduced oracle, we can establish bounds on the success probability of the algorithm.

Theorem E.3. Consider a (nonconvex) (SO) problem, under the following assumptions: Lipschitzness (Item (A.5)), smoothness (Item (A.6)), gradient sparsity (Item (A.7)), and the following initial suboptimality assumption: namely, that given our initialization x0 Rd, we know Γ > 0 such that

FS(x0) FS(x (S)) Γ. (5)

Let τ = C n ln(2/δ) with C > 0 an absolute constant. Let η = q

Γ Htν2 and U = C

with C > 0 an absolute constant. Then Algorithm 3 satisfies P h FS(xˆt) 2 2 U

τ i 1/2, and

ln(n) ln(1/δ) + L2 [s ln(d/s) ln(1/δ)]1/4

Proof. First, given any U > 0, we have that

P h FS(xˆt) 2 >

i P[T < τ] + P[T FS(xˆt) 2 2 > U] 1

4 + E[T FS(xˆt) 2 2] U ,

where the last step follows by Lemma 5.3 and Chebyshev s inequality, respectively. Next, by definition of ˆt and Proposition D.2 (see Appendix D.2),

E[(T + 1) F(xˆt) 2 2] = E h T X

t=0 F(xt) 2 2 i

t=0 G(xt) 2 2 i E h T X

t=0 F(xt), G(xt) F(xt) i

t=0 E[1{T t} G(xt) 2 2]

t=0 E[1{T t} F(xt), G(xt) F(xt) ]

t=0 P[T t]E E[ G(xt) 2 2|Ft 1]

t=0 P[T t]E E[ F(xt), G(xt) F(xt)|Ft 1 ]

2 E[T + 1]ν2 + E[T + 1]Lb

ΓHτν + τLb,

where the third inequality holds since {T t} is Ft 1-measurable (see the proof of Theorem E.1 for details), and the fourth inequality follows from Theorem 5.2, used the upper bound on E[T] from Lemma 5.3, and our choice for η. Selecting U = C

ΓHτν + Lτb with C > 0 sufficiently large, we get E[T F(xˆt) 2 2]/U 1/4, concluding the proof.

F Boosting the Confidence for the Bias-Reduced Stochastic Gradient Method

We conclude by providing a boosting method to amplify the success probability of our bias-reduced method. This private boosting method is a particular instance of a private selection method [17], and it is based on running a random number of independent runs of Algorithm 3 with noisy evaluations of their performance. Among the independent runs, we select the best performing one based on the noisy evaluations. This particular implementation sharpens some polylogarithmic factors that would appear for other private selection methods, such as Report Noisy Min [18, 1].

Algorithm 6 Boosting_Bias-Reduced_SGD(S, ε, δ, K) Require: Dataset S Dn, ε, δ > 0 privacy parameters, random stopping parameter γ (0, 1)

for k = 1, . . . , K do

Run Algorithm 3 with privacy budget (ε/12, (δ/[4K])2), ˆxk its output and if f( , z) convex then

Set sk = [FS(ˆxk) + ξk], where ξk Lap(λ), and λ = 12B

Set sk = [ FS(ˆxk) 2 + ξk], where ξk Lap(λ), and λ = 24L

nε . end if Flip a γ-biased coin: with probability γ, return ˆx = ˆxˆk, where ˆk = arg minl k sl end for Return ˆx = ˆx ˆ K, where ˆK = arg mink K sk

Theorem F.1. Let ε, δ > 0 such that δ ε/10. Then Algorithm 6 is (ε, δ)-DP. Let 0 < β < 1 and

γ = min{1/2, 3β/4}. In the convex case, Algorithm 6 attains excess risk P h FS(ˆx) FS(x (S))

α i 1 β, where

ln n[s ln(d/s) ln3 ln(1/δ)/[βδ] ]1/4 εn + B

On the other hand, in the nonconvex case, P h FS(ˆx) 2 2 α i 1 β, where

ln(n) ln ln(1/δ)

βδ + L2 [s ln(d/s) ln(ln(1/δ)/[βδ])]1/4

Proof. The privacy analysis follows easily from [17]. First, by basic composition, we have that for each k the pair (ˆxk, sk) is (ε1, δ1)-DP, with ε1 = ε/6, and δ1 = (δ/[4K])2. By [17, Thm 3.4], the private selection with random stopping used in Algorithm 6 is such that ˆx is (3ε1 + 3 2δ1, 2δ1K + δ/2)-DP; notice that

2δ1K + δ/2 δ, due to our choices of ε1, δ1. This proves that the algorithm is (ε, δ)-DP.

The accuracy of the algorithm closely follows [17, Theorem 3.3]. First, let κ be the number of runs the algorithm makes before stopping, and let α > 0 to be determined. Conditioning on κ

P FS(ˆx) FS(x (S)) > α =

k=1 P FS(ˆx) FS(x (S)) > α κ = k P[κ = k]

k=1 P FS(ˆx) FS(x (S)) > α κ = k (1 γ)k 1γ.

We will now bound the conditional probability above. By the subexponential tails of the Laplace distribution, we have that letting E := {( j [κ]) : |ξj| α } (here, α > 0 is arbitrary),

P[Ec|κ = k] = P h ( j [κ]) |ξk| > α κ = k i 2k exp n nεα

P h FS(ˆx) FS(x (S)) > α κ = k i P h FS(ˆx) FS(x (S)) > α E κ = k i + P[Ec|κ = k].

Next we have

P h FS(ˆx) FS(x (S)) > α E κ = k i P h FS(ˆxˆk) + ξˆk FS(x (S)) > α α E κ = k i

= P h min k [κ]

FS(ˆxk) + ξk FS(x (S)) > α α E κ = k i

P h min k [κ]

FS(ˆxk) FS(x (S)) > α 2α κ = k i

P h FS(ˆx1) FS(x (S)) > α 2α i k ,

where in the last step we used that the runs are i.i.d.

We now choose α, α such that α 2α = U/τ (where U, τ are those from Theorem E.2). Hence,

P h FS(ˆx) FS(x (S)) > α κ = k i 2 k + 2k exp n nεα

We can now bound the failure probability as follows:

P FS(ˆx) FS(x (S)) > α

2 k + 2K exp n nεα

o (1 γ)k 1γ

2 γ 1 γ2 + 2

δ exp n nεα

δ exp n nεα

where in the last step we used that γ = min{1/2, 3β/4}. It is clear then that α =

nε ln 16 3β2 ln 2

δ makes the probability above at most β. These choices lead to a final bound

ln n[s ln(d/s) ln3 ln(1/δ)/[βδ] ]1/4 εn + B

For the nonconvex case, we need to replace B by 2L in the Laplace concentration bound. Further, we consider the event { F(ˆxk) 2 > α} (as opposed to the optimality gap event). This implies that we need to set α > 0 such that α 2α p

U/τ from Theorem E.3. This leads to

P h FS(ˆx) 2 > α i

2 k + 2K exp n nεγ

o (1 γ)k 1γ.

The rest of the derivations are analogous.

G Missing Proofs and Results from Section 6

G.1 Proof of Theorem 6.1

Proof. We proceed by cases:

Case δ = 0. First, we prove that privacy of the algorithm. To do this, we first establish a bound on the ℓ1-sensitivity of the (quadratically) regularized ERM. Note that the first-order optimality conditions in this case correspond to

λ FS(x λ(S)).

Therefore, if S S , where S = (z1, . . . , zn) and S = (z 1, . . . , z n) only differ in one entry,

x λ(S) x λ(S ) 1 1

λ FS(x λ(S)) FS (x λ(S )) 1

i=1 f(x λ(S), zi) f(x λ(S ), z i) 1

2s H x λ(S) x λ(S ) 2 + 2

Above, in the third inequality we used the gradient sparsity (A.7), and the smoothness (A.6), assumptions. In the fourth inequality we used that the regularized ERM has ℓ2-sensitivity 4L

λn [44, 45, 46]. We conclude the privacy then by Theorem A.1(a).

We also remark that by Theorem A.2(a)-(i), ξ L s ln(d/β)

λ +1 , with probability 1 β.

Case δ > 0. The privacy guarantee follows from the fact that the ℓ2-sensitivity of x λ(S) is 4L

λn [44, 45, 46], together with Theorem A.1(b).

Moreover, by Theorem A.2(b)-(i), ξ L

ln(d/β) λnε , with probability 1 β.

We continue with the accuracy analysis, making a unified presentation for both pure and approximate DP. First, by the optimality conditions of the regularized ERM,

FS(x λ(S)) FS(x (S)) λ

2 x (S) 2 λ

We need the following key fact, which follows by the definitions of ˆx and x,

ˆx x λ(S) ˆx x + x x λ(S) 2 ξ . (7)

Using these two bounds, we proceed as follows

FS(ˆx) FS(x (S)) FS(ˆx) FS(x λ(S)) + λ

2 D2 FS(ˆx), ˆx x λ(S) + λ

FS(ˆx) 1 ˆx x λ(S) + λ

where the second inequality follows by convexity of FS, and the fourth one by the gradient sparsity assumption and (7).

The conclusion follows by plugging in the respective bounds of λ and ξ , for both pureand approximate-DP cases.

G.2 Proof of Theorem 6.3

Remark G.1. Note first that in the proof below we are not addressing the privacy of Algorithm 4, as this has already been proven in Theorem 6.1.

On the other hand, note that the same proof below using the in-expectation generalization guarantees of uniformly stable algorithms [44] provides a sharper upper bound for the expected excess risk for the pure and approximate-DP cases, which would hold w.p. 1 β over the algorithm internal randomness

ES[FD(ˆx) FD(x (D))] L2/3H1/3D4/3 s log(d/β)

ES[FD(ˆx) FD(x (D))] LD[s ln(1/δ) log(d/β)]1/4

Proof. Using the ℓ2-sensitivity of x λ(S), 2 = 4L

λn, we have the following generalization bound [47]: with probability 1 β/2

FD(x λ(S)) FS(x λ(S)) L2

λn ln(n) ln 1

The bound of (6) can be obviously modified by comparison with the population risk minimizer, x (D): in particular, the event above6 implies that

FD(x λ(S)) FD(x (D)) FS(x λ(S)) FS(x (D)) + γ λ

2 x (D) 2 2 + γ λD2 + γ.

On the other hand, the bound (7) works exactly as in the proof of Theorem 6.1. Hence, we have that with probability 1 β/2,

FD(ˆx) FD(x (D)) FD(ˆx) FD(x λ(S)) + λD2 + γ

FD(ˆx), ˆx x λ(S) + λD2 + γ

2L s ξ + L2

λn ln(n) ln 1

+ λD2 + B n

where in the last step we used that FD(ˆx) 1 = Ez[ f(ˆx, z)] 1 Ez[ f(ˆx, z) 1] L s (the last step which follows by the gradient sparsity), inequality (7), and the definition of γ.

We proceed now by separately studying the different cases for δ:

6We also need concentration to upper bound FS(x (D)) FD(x (D)). However, this is easy to do by e.g., Hoeffding s inequality, leading to a bound γ.

Case δ = 0. The bound above becomes

FD(ˆx) F(x (D)) L2

λ + 1 + ln n ln(1/β) + λD2 + B

Our choice of λ provides the claimed bound.

Case δ > 0. Here, the upper bound takes the form

FD(ˆx) F(x (D)) L2

s ln(d/β) ln(1/δ)

ε + ln(n) ln(1/β) + λD2 + B

The proposed value of λ leads to the bound below that holds with probability 1 β,

FD(ˆx) FD(x (D)) B

ln n ln(1/β)

s ln(1/δ) log(d/β)

n + LD[s ln(1/δ) log(d/β)]1/4

G.3 A Pure DP-ERM Algorithm for Nonsmooth Losses

We now prove that the rates of pure DP-ERM in the convex case above can be obtained without the smoothness assumption, albeit with an inefficient algorithm. This algorithm is based on the exponential mechanism, and it leverages the fact that the convex ERM with sparse gradient always has an approximate solution which is sparse. This result requires an additional assumption on the feasible set: x X P [d] = x|P X, (8)

where x|P Rd is the vector such that x P,j = xj if j P, and x P,j = 0 otherwise. We will say that X is sparsifiable if (8) holds. Note this property holds e.g., for ℓp-balls centered at the origin. Lemma G.2. Let X be a convex sparsifiable set. Consider the problem (ERM) under convexity (Item (A.3)), bounded diameter (Item (A.2)), Lipschitzness (Item (A.5)) and gradient sparsity (Item (A.7)), assumptions. If x (S) is an optimal solution of (ERM) and τ > 0, then there exists x X such that x 0 1/τ 2, and

FS( x) FS(x (S)) L sτ.

Proof. Let x Rd be defined as

xj = xj if |x S,j| τ 0 otherwise.

Note that x X since x (S) X and X is sparsifiable. Now we note that

j: |x S,j| τ

Finally, for the accuracy guarantee, we use convexity as follows,

FS( x) FS(x (S)) FS( x), ˆx x (S) FS( x) 1 x x (S) L sτ,

where in the last step we used that f(ˆx, zi) Sd s and the definition of x.

We present now the sparse exponential mechanism, which uses the result above to approximately solve (ERM) with nearly dimension-independent rates.

Algorithm 7 Sparse_Exponential_Mechanism Require: Dataset S = {z1, . . . , zn} Z, ε privacy parameter, f( , z) L-Lipschitz convex function with s-sparse gradients and range bounded by B, 0 < β < 1 confidence parameter Let τ > 0 be such that τ 3 ln(d/[τβ]) = L sεn

Let Nτ be a τ-net of 1/τ 2-sparse vectors over X with |Nτ| d 1/τ 2 3

Let ˆx be a random variable supported on Nτ such that P[ˆx = x] exp B

Remark G.3. The bound on |Nτ| claimed in Algorithm 7 follows from a standard combinatorial

argument (e.g., [39]). Moreover, it follows that |Nτ| d τ 1/τ 2 .

Theorem G.4. Let X be a convex sparsifiable set. Consider a problem (ERM) under bounded diameter (Item (A.2)), convexity (Item (A.3)), bounded range (Item (A.4)), Lipschitzness (Item (A.5)) and gradient sparsity (Item (A.7)), assumptions. Then Algorithm 7 satisfies with probability 1 β

FS(ˆx) FS(x (S)) L2/3B1/3 s

εn ln L sεn

Proof. Let x be the vector whose existence is guaranteed by Theorem G.2. By the high probability guarantee of the exponential mechanism [1] with probability 1 β,

FS(ˆx) FS( x) B

ln |Nτ| + ln(1/β) B

Hence, using Theorem G.2 with the upper bound above,

FS(ˆx) FS(x (S)) FS(ˆx) FS( x) + FS( x) FS(x (S))

εn ln(d/[τβ])

εn ln L sεn

where we used our choice of τ.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] . Justification: All results have full proofs and are self-contained. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The paper includes a description of current limitations in the Future Directions section. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: All proofs contain a precise description of the underlying assumptions and proofs. Some of the proofs were deferred to the Appendix due to space limitations. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] . Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] . Justification: The paper does not include experiments requiring code. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] . Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] . Justification: The paper does not include experiments Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] . Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes]

Justification: The work in this paper is theoretical, and therefore there are no ethical concerns. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] . Justification: This work is theoretical in nature and thus it has no negative societal impact. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to

generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: The paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] . Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA] . Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.