# differentially_private_generalized_linear_models_revisited__13bc50d3.pdf

Differentially Private Generalized Linear Models Revisited

Raman Arora Department of Computer Science The Johns Hopkins University arora@cs.jhu.edu

Raef Bassily Department of Computer Science & Engineering Translational Data Analytics Institute (TDAI) The Ohio State University bassily.1@osu.edu

Cristóbal Guzmán Inst. for Mathematical and Comput. Eng. Fac. de Matemáticas and Esc. de Ingeniería Pontiﬁcia Universidad Católica de Chile c.guzman@utwente.nl

Michael Menart Department of Computer Science & Engineering The Ohio State University menart.2@osu.edu

Enayat Ullah Department of Computer Science The Johns Hopkins University enayat@jhu.edu

We study the problem of (ǫ, δ)-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The ﬁrst case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of e O w n + min n w 2

nǫ o , where n is the number of samples, d is the dimension of the problem, and w is the minimizer of the population risk. Apart from the dependence on w , our bound is essentially tight in all parameters. In particular, we show a lower bound of eΩ 1 n + min n w 4/3

nǫ o . We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) Θ w n + min n w nǫ ,

nǫ o , where rank is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of w .

1 Introduction

Ensuring privacy of users data in machine learning algorithms is an important desideratum for multiple domains, such as social and medical sciences. Differential privacy (DP) has become the gold standard of privacy-preserving data analysis as it offers formal and quantitative privacy guarantees and enjoys many attractive properties from an algorithmic design perspective [DR+14]. However, for various basic machine learning tasks, the known risk guarantees are potentially sub-optimal and pessimistic.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

H-Smooth, Non-negative G-Lipschitz

H X nǫ Y 2/3

H X nǫ Y 2/3

d (nǫ)2/3 d > (nǫ)2/3 rank nǫ rank > nǫ

H w X ) 4/3 Y 2/3

G w X nǫ Theorem 1 Theorem 2 Theorem 1 Theorem 2, 3 [SSTT20] Theorem 10, 5

LB Tight Tight

H w X ) 4/3 Y 2/3

(nǫ)2/3 Tight Tight

Theorem 4 Theorem 4 Theorem 4 Theorem 4 Theorem 6 Theorem 6

Non-private

G w X n [SST10] [NY83]

LB min{ Y , w X } n + min n Y 2, H w 2 X 2+d Y 2

H w Y X n o

[SST10, Sha15] [NY83]

Table 1: Summary of Rates. Parameters: d: dimension, n: sample size, H: smoothness parameter, w : minimum norm population risk minimizer, X : bound on feature vectors, Y : bound on loss at zero, G: Lipschitzness parameter, rank: expected rank of the design matrix, ǫ: privacy parameter (δ factors omitted). The actual private excess risk bounds are the sum of the expressions shown in the DP rows and their non-private counterparts. Details on non-private lower bounds in Appendix E.

In this work, we make progress towards resolving one of the most basic machine learning problems under differential privacy: learning linear predictors with convex losses.

1.1 Related Work

Differentially private machine learning has been thoroughly studied for over a decade. In the Lipschitzconvex setting, tight rates are known for both the empirical and population risk [BST14, BFTGT19]. Speciﬁcally, it was shown that in the constrained setting, dependence on the dimension in the form of Ω

d nǫ is unavoidable even for generalized linear models (GLM) (see Section 2 for a formal deﬁnition). By contrast, in the unconstrained setting, it has been shown that dimension independent rates are possible for GLMs [JT14]. In this setting, assuming prior knowledge of w , the best

known rate is O w n + w

rank nǫ [SSTT20], where rank is the expected rank of the design matrix. However, without prior knowledge of w , these methods exhibit quadratic dependence on w . Furthermore, these results crucially rely on the assumption that the loss is Lipschitz to bound the sensitivity. Although gradient clipping has been proposed to remedy this problem [SSTT20, CWH20], it is known that the solution obtained by clipping may not coincide with the one of the original model.

Without Lipschitzness, work on differentially private GLMs has largely been limited to linear regression [Wan18, CWZ21]. Here, dimension independent rates have only been obtained under certain sparsity assumptions.

More generally, smooth non-negative losses have been studied in the non-private setting by [SST10], where it was shown such functions can obtain risk guarantees with linear dependence on the minimizer norm (as in the Lipschitz case). This work also established a lower bound of Ω 1 n on the excess population risk for this class of loss functions. [Sha15] additionally establishes a lower bound of Ω n min n w 2+d

n , w n oo on the excess population risk by way of linear regression1.

1.2 Our Contributions

Smooth nonegative GLMs. Our primary contribution is a new and nearly optimal rate for the problem of differentially private learning of smooth GLMs. In this setting, we focus on characterizing the excess risk in terms of n, d, ǫ and w . Speciﬁcally, we show that it is possible to achieve a rate of e O w n + min n w 2

nǫ o on the excess population risk. Our new rates exhibit an

interesting low/high dimensional transition at d ( w nǫ)2/3. First, in the low dimensional regime, we develop a novel analysis of noisy gradient decent (GD) inspired by techniques from [SST10]. In particular, we show that Noisy GD gives an improved rate for non-negative smooth functions (not necessarily GLMs). This is based on an average stability analysis of Noisy GD. As we elaborate in

1The [SST10] bound assumes Y , H, X = Ω(1). The bounds of [Sha15] were originally stated for the constrained setting, but can easily be converted. More details in Appendix E.

Section 3.1, a straightforward application of uniform stability leads to sub-optimal bounds and hence a new analysis is required. We note in passing that this upper bound works for (unconstrained) DP-SCO with smooth (non-Lipschitz) losses, which is of independent interest. For the high dimensional regime, we perform random projections of the data (speciﬁcally, the Johnson-Lindenstrauss transform) for dimensionality reduction, roughly reducing the problem to its low dimensional counterpart. We also develop a lower bound for the excess risk under DP of eΩ min n w 4/3

nǫ o . We note that

non-privately a lower bound of eΩ 1 n + min n d+ w 2

n , w n o is known on the excess population risk [SST10, Sha15]. We note that these private and non-private lower bounds imply that our bound is optimal up to factors of w (see Table 1).

Lipschitz GLMs. For the Lipschitz case, we close a subtle but important gap in existing rates. In this setting, it has been shown that one can characterize the excess risk in terms of the expected rank of the design matrix, rank, instead of d [SSTT20]. In this setting, the best known rate was e O w n +

nǫ . We show an improved rate of e O w n + min n w nǫ ,

nǫ o . This

improves in the high privacy regime where ǫ rank

n . In fact, the upper bound O w nǫ for this rate can be obtained with only minor adjustments to the regularization method of [JT14]. Our second contribution in this setting is extending the lower bound of [SSTT20] to hold for all values of w > 0 and rank [n]. This is in contrast to the original lower bound which only holds for problem instances where w 2 = rank and rank [nǫ].

Model selection. As part of our methods, we develop a differentially private model selection approach which eliminates the need for a-priori knowledge of w . Although such methods are well established in the non-private case, (see e.g. [SSBD14]), in the private case no such methods have been established. Our method, as in the non-private case, performs a grid search over estimates of w and picks the best model based on the loss. However, in the private setting we must account for the fact the the loss evaluation must be privatized. This is non-trivial in the non-Lipschitz smooth case as the loss at a point w may grow quadratically with w .

Lower bounds for Non-Euclidean DP-SCO. Our lower bound construction generalizes to Non Euclidean ℓp/ℓq variants of DP-SCO with Lipschitz convex losses [BGN21]. Herein, we assume that the loss function is Gq-Lipschitz with respect to ℓq norm, and radius of the constraint set is bounded

in ℓp norm by Bp. For this setting, we give a lower bound of Ω Gq Bp min 1 (nǫ)1/p , d(p 1)/p

excess empirical/population risk of any (potentially unconstrained) (ǫ, δ)-DP algorithm; see Corollary 4 in Appendix B.4 for a formal statement and proof. For p = and p 2, d nǫ, this matches the best known upper bounds in [BGN21].

Non-private settings. As by-products, we give the following new results for the non-private setting. For details on the parameters used below we refer to Table 1.

1. We show that gradient descent, when run on convex non-negative e H smooth functions (not

necessarily GLMs), it achieves the optimal rate of O

e H w Y n (see Corollary 1). This is done via an average-stability analysis of gradient descent. This result is interesting as it also shows GD only needs n iterations, which is known not to work for non-smooth SCO [BFGT20, ACKL21, AKL21]. 2. In Section D, we give a procedure to boost the conﬁdence of algorithms for risk minimization with convex non-negative e H smooth functions (not necessarily GLMs). The standard boosting analysis based on Hoeffding s inequality does not give a bound with a linear dependence on the parameters ( w , X , Y ), and hence a tighter analysis is required.

1.3 Techniques

Upper bounds. We give two algorithms for both the smooth and Lispschitz cases. The ﬁrst method is simple and has two main steps. First, optimize the regularized empirical risk over the constraint set {w : w B} for some B w . Then output a perturbation of the regularized minimizer

with Gaussian noise (which is not requried to be in the constraint set). This method is akin to that of [JT14] with the modiﬁcation that the regularized minimizer is constrained to a ball. We elaborate on this key difference shortly.

The second method is based on dimensionality reduction. We use smaller dimensional data-oblivious embeddings of the feature vectors. A linear JL transform sufﬁces to give embeddings with the required properties. We then run a constrained DP-SCO method (Noisy GD) in the embedded space, and use the transpose of the JL transform to get a d dimensional model. In this method, the embedding dimension required is roughly the threshold on dimension at which the rates switch from dimension dependent to independent bounds. We also remark that [NêUZ20] applied a similar technique to provide dimension independent classiﬁcation error guarantees for privately learning support vector machines under hard margin conditions.

We note that a crucial part in all of these methods is the use of constrained optimization as a subroutine, where the constraint set is a ball of radius w . This is in stark contrast to the Lipschitz case where existing methods such as those presented by [JT14, SSTT20] rely on the fact that projection is not required. In the smooth case however, constrained optimization helps ensure that the norm of the gradient is roughly bounded by the diameter of the constraint set. We note that in the high dimensional regime, the property that the ﬁnal output of the algorithm can have large norm is still crucial to the success of our algorithms.

Lower bounds. For our lower bounds in the smooth case we rely on the connection between stability and privacy. Speciﬁcally, we will utilize a lemma from [CH12] which bounds the accuracy of one-dimensional differentially private algorithms. We then combine this with packing arguments to obtain stronger lower bounds for high dimensional problems. For the Lipschitz case, we adapt the method of [SSTT20].

2 Preliminaries

In the following we detail several concepts needed for the presentation of this paper.

Risk minimization. Let X Rd be the domain of features, and Y R be the domain of responses. A linear predictor is any w Rd. Let ℓ: Rd (X Y) R be a loss function. Given some unknown distribution D over (X Y), we deﬁne the population loss L(w; D) = E (x,y) D [ℓ(w; (x, y))].

Given some dataset S (X Y)n drawn i.i.d. from D, the objective is to obtain bw Rd which minimizes the excess population risk, L( bw; D) min w Rd {L(w; D)}. Given a population risk

minimization problem, we will denote w to be a minimum norm solution to this problem. We deﬁne the empirical risk as b L(w; S) = 1

(x,y) S ℓ(w; (x, y)). We deﬁne the following quantities

for notational convenience: εrisk(w) = L(w; D) L(w ; D), εerm(w) = b L(w; S) b L(w ; S), and εgen(w) = L(w; D) b L(w; S). We deﬁne BB to be the Euclidean ball of radius B on Rd.

Generalized linear models. We will more speciﬁcally be interested in the problem of learning generalized linear models, where there exists some function φ : R Y R such that the loss function can be written as ℓ(w; (x, y)) = φ( w, x , y). We deﬁne parameter bounds X = maxx X x and Y 2 = maxy Y |φ(0, y)|. Note that for many common loss functions, the latter condition is the moral equivalent of assuming labels bounded by Y . For ease of notation, we write φ( w, x , y) as φy( w, x ) and denote function φy : z 7 φ(z, y). We say that the loss function is G-Lipschitz GLM if all y Y, φy : R R is G-Lipschitz. We similarly deﬁne H-smooth GLM.

Differential privacy [DKM+06]. We restrict our investigation to the class of algorithms which minimize the excess population risk under the constraint of differential privacy. A randomized algorithm A is said to be (ǫ, δ) differentially private (i.e., (ǫ, δ)-DP) if for any pair of datasets S and S differing in one point and any event E in the range of A it holds that

P[A(S) E] eǫP[A(S ) E] + δ.

For our lower bounds, we will make use of the following Lemma from [CH12].

Lemma 1. Let Z be a data domain and let S and S be two datasets each in Zn that differ in at most entries, and let A : Zn R be any (ǫ, δ)-DP algorithm. For all τ R, if log(1/2γ)

ǫ and δ 1 16(1 e ǫ), then E [|A(S) τ| + |A(S ) τ |] 1

Finally, we introduce the Johnson-Lindenstrauss (JL) transform to perform random projections.

Deﬁnition 1 ((α, β)-JL property). A distribution over matrices Rk d satisﬁes (α, β)-JL property if for any u, v Rd, P [| Φu, Φv u, v | > α u v ] β.

It is well known that several such data-oblivious (i.e. independent of u,v) distributions exist with k = O log(1/β)

α2 [Nel11]. We note that the JL property is typically described as approximation of norms (or distances), but it is easy to deduce the above dot product preservation property from it; for completeness we give this as Lemma 12. Finally, we use ΦD to denote the push-forward measure of the distribution D under the map Φ : (x, y) 7 (Φx, y). Similarly, given a data set S = {(xi, yi)}i, we deﬁne ΦS := {(Φxi, yi)i}

3 Smooth Non-negative GLMs

For smooth non-negative GLMs, we present new upper and lower bounds on the excess risk. For our upper bounds, we here assume that the algorithm is given access to some upper bound on w , that we denote by B. We later show in Section 5 how to obtain such a rate without prior knowledge of w . We also emphasize that the privacy of these algorithms holds regardless of whether or not B w .

3.1 Upper Bounds

Before presenting our algorithms, we highlight some key ideas underlying all our methods. A crucial property of non-negative smooth loss functions which allows one to bound sensitivity, and thus ensure privacy, is the self-bounding property (e.g. Sec. 12.1.3 in [SSBD14]), which states that for an e H-smooth non-negative function f and u dom(f), f(u) q

This property implies that the gradient grows at most linearly with w . More precisely, we have the following.

Lemma 2. Let ℓbe an H-smooth non-negative GLM. Then for any w BB and (x, y) (X Y) we have ℓ(w, (x, y)) 2 Y

H X + 2HB X 2.

In order to leverage this property, all our algorithms in this setting utilize constrained optimization as a subroutine, where the constraint set is BB and the Lipschitz constant is G = 2 Y

H X + 2HB X 2. This, in conjunction with the self-bounding property ensures reasonable bounds on sensitivity. In turn, this allows us to ensure privacy without excessive levels of noise.

Finally, we note that our upper bounds for smooth GLMs distinguish low and high dimensional

regimes, transitioning at d = min B

3, 1 (nǫ) 2 3 .

3.1.1 Low dimensional regime

We start with the low dimensional setting where we use techniques developed for constrained DPSCO for Lipschitz losses (not necessarily GLMs). Existing private algorithms for DP-SCO (e.g., [BFTGT19, FKT20]) lead to excess risk bounds that scale with HB2

n . On the other hand, the optimal

non-private rate [SST10] scales with

HB n , which may indicate that the private rate implied by the known methods is sub-optimal. We show that this gap can be closed by a novel analysis of private GD.

A standard proof of excess risk of (noisy) gradient descent for smooth convex functions is based on uniform stability [HRS15, BFTGT19]. However, this still leads to sub-optimal rates. Hence we turn to an average stability based analysis of GD, yielding the following result.

Theorem 1. Let ℓbe a non-negative, convex, e H-smooth loss function, bounded at zero by Y 2. Let

B, n > 0, n0 = e HB2 Y 2 , G = 2 Y p

e H + 2 e HB. Then, for any ǫ, δ > 0, Algorithm 1 invoked with

W = BB, T = n, σ2 = 8G2T log(1/δ)

n2ǫ2 , η = min B

d , 1 4 e H

is (ǫ, δ)-differentially

private. Further, given a dataset S of n n0 i.i.d samples from an unknown distribution D, the excess risk of output of Algorithm 1 is bounded as,

E[εrisk( bw)] O

e HB Y n + GB p

d log (1/δ)

We note that since H-smooth GLMs satisfy the Theorem condition with parameter e H H X 2, we obtain results for GLMs as a direct corollary.

Non-private risk bound: As a corollary (see Corollary 1 in Appendix A.3.1), with no privacy constraint, the above result (setting ǫ and δ = 1) shows that gradient descent achieves the optimal excess risk bound, previously shown to be achievable by regularized ERM and one-pass SGD [SST10]. The lower bound, n0, simply means that the trivial solution zero" has larger excess risk.

Proof sketch of Theorem 1. The privacy proof simply follows from [BST14] since the loss function is G-Lipschitz in the constraint set. For utility, we ﬁrst introduce two concepts used in the proof. Let S be a dataset of n i.i.d samples {(xi, yi)}n i=1, S(i) be the dataset where the i-th data point is replaced by an i.i.d. point (x , y ). Let A be an algorithm which takes a dataset as input and outputs A(S). The average argument stability of A, denoted as εav stab(A) is deﬁned as

εav stab(A)2 = ES,i,(x ,y ) A(S) A(S(i)) 2.

The average regret of gradient descent (Algorithm 1) with iterates {wt}T t=1 is

εreg(A; w ) = 1

T PT j=1 E[b L(wj; S) b L(w ; S)].

The key arguments are as follows: we ﬁrst bound the generalization error, or on-average stability, in terms of average argument stability and excess empirical risk (Lemma 4). We then bound average argument stability in terms of average regret (Lemma 5). Finally, in Lemma 6, we provide bounds on excess empirical risk and average regret of noisy gradient descent. Substituting these in the excess risk decomposition gives the claimed bound. The full proof with the above lemmas is deferred to Appendix A.3.1.

Algorithm 1 Noisy GD Input: Dataset S, loss function ℓ, constraint set W, steps T, learning rate η, noise scale σ2

1: w0 = 0 2: for t = 1 to T 1 do 3: ξ N(0, σ2I)

4: wt+1 = ΠW wt η b L(wt; S) + ξ

where ΠW is the Euclidean projection on to W 5: end for Output: bw = 1

T PT t=1 wj

3.1.2 High dimensional regime

In the high dimensional setting, we present two techniques.

JL method. In the JL method, Algorithm 2, we use a data-oblivious JL map Φ to embed all feature vectors in dataset S to k < d dimensions. Let dataset e S = {(Φxi, yi)}n i=1. We then run projected Noisy GD method (Algorithm 1) on the loss with dataset e S and the diameter of the constraint set as 2B. Finally, we map the returned output ew back to d dimensions using Φ to get bw = Φ ew. We note that no projection is performed on bw and thus the output may have large norm due to re-scaling induced by Φ .

Algorithm 2 JL Method Input: Dataset S = {(x1, y1), ..., (xn, yn)}, loss function ℓ, k, B, η, T, σ2

1: Sample JL matrix Φ Rk d

2: e S = {(Φx1, y1) , (Φx2, y2) , . . . , (Φxn, yn)} 3: ew = Noisy GD(e S, ℓ, B2B, T, η, σ2) Output: bw = Φ ew

Algorithm 3 Regularized Output Perturbation Input: Dataset S = {(x1, y1), ..., (xn, yn)}, loss function ℓ, λ, B, σ2

1: ew = arg min w BB

n L(w; S) + λ

2: ξ N(0, σ2Id) Output: bw = ew + ξ

Theorem 2. Let k = O B

H X log(2n/δ)nǫ Y X +

HB X 2 2/3 , W = BB, T = n, σ2 = 8G2T log(1/δ)

d , 1 4H X 2 and n0 = HB2 X 2

Y 2 . Algorithm 2 satisﬁes (ǫ, δ)-

differential privacy. Given a dataset S of n n0 i.i.d samples, of the output bw is bounded by

E[εrisk( bw)] e O

Proof sketch of Theorem 2. From the JL property, with k = O(log (n/δ) /α2), w.h.p. norms of all feature vectors and w , as well as inner products between are preserved upto an α tolerance (see Deﬁnition 1). The preservation of norms of feature vectors implies the gradient norms are preserved, and thus privacy guarantee of sub-routine, Algorithm 1, sufﬁces to establish DP. For the utility proof, from our analysis of Noisy GD (Theorem 1) and using the JL property, the excess risk of bw under D w.r.t. the risk of Φw under ΦD is bounded as,

E[L( bw; D) L(Φw ; ΦD)] O

H X B Y n +

H X Y + H X 2B2 B p

k log (1/δ)

From smoothness and JL property, the, JL error is:

E[L(Φw ; ΦD)] L(w ; D)] e O HB2 X 2

The above is optimized for the value of k prescribed in Theorem 2, substituting which gives the claimed bound.

Constrained regularized ERM + output perturbation. Our second technique is constrained regularized ERM with output perturbation (Algorithm 3). A similar technique for the Lipschitz case was seen in [JT14], however we note that the addition of the constraint set BB is crucial in bounding the sensitivity in the smooth case.

Theorem 3. Let n0 = HB2 X 2

Y 2 . Then Algorithm 3 run with σ2 = O ( Y 2+H2B2 X 4) log(1/δ)

and λ = ( Y +HB X 2)

2/3 (log (1/δ))1/3 satisﬁes (ǫ, δ)-differential privacy. Given a

dataset S of n n0 i.i.d samples, the excess risk of its output bw is bounded as

E[εrisk( bw)] e O

HB X Y + Y 2

HB X 4/3 Y 2/3 +

We note that we can use the same technique in the low dimensional setting too, yielding a rate of

HB X Y + Y 2

d nǫ . However, in contrast to Theorem 1 and 2, these results have an

additional Y 2

n term (in both regimes). Thus, in the regime when Y

HB X , the two upper bounds are of the same order.

Proof sketch of Theorem 3. The privacy proof follows the Gaussian mechanism guarantee together with the fact that the ℓ2-sensitivity of constrained regularized ERM is O G

λn [BE02]. For utility, we use the Rademacher complexity based result of [SST10] to bound the generalization error of ew. The other term, error from noise, E[L( bw; D) L( ew; D)] O(Hσ2 X 2) from smoothness of GLM. Combining these two and optimizing for λ gives the claimed result.

3.2 Lower Bounds

The proof technique for our lower bound relies on the connection between differential privacy and sensitivity shown in [CH12]. The underlying mechanisms behind the proof is also similar in nature to the method seen in [SU17]. We note that although we present the following theorem for empirical risk, a reduction found in [BFTGT19] implies the following result holds for population risk as well (up to log factors).

Theorem 4. Let ǫ [1/n, 1], δ 1 16(1 e ǫ). For any (ǫ, δ)-DP algorithm, A, there exists a dataset S with empirical minimizer of norm B such that the excess empirical risk of A is lower bounded by Ω n min n Y 2, (B X )4/3(H Y )2/3

The problem instance used in the proof is the squared loss function, and thus this result holds additionally for the more speciﬁc case of linear regression. We also note this lower bound implies our upper bound is optimal whenever B X Y , which is a commonly studied regime in linear regression [SST10, KL15]. We here provide a proof sketch and defer the full proof to Appendix A.7.

Proof sketch of Theorem 4 Deﬁne F(w; S) = 1

(x,y) S( w, x y)2. Let d < min {n, d}

and b, p [0, 1] be parameters to be chosen later. For any σ { 1}d , deﬁne the dataset Sσ which consists of the union of d subdatasets, S1, ..., Sd given as follows. Set pn

d of the feature vectors in Sj as X ej (the rescaled j th standard basis vector) and the rest as the zero vector. Set pn

2d (1 + b) of the labels as σj Y and pn

2d (1 b) labels as σj Y . Let wσ = arg minw Rd {F(w; Sσ)} be the ERM minimizer of F( ; Sσ). Following from Lemma 2 of [Sha15] we have that for any w Rd that

F( w; Sσ) F(wσ; Sσ) p X 2

j=1 ( wj wσ j )2. (1)

We will now show lower bounds on the per-coordinate error. Consider any σ and σ which differ only at index j for some j [d ]. Note that the datasets Sσ and Sσ differ in = pn

2d [(1+b) (1 b)] = pbn

d points. Let τ = wσ j = Y b

X and τ = wσ j = Y b

X (i.e. the j components of the empirical

minimizers for S and S j respectively). Note that |wσ j wσ j | = 2 Y b

X . Setting d = p X Bnǫ

and b = X B Y pnǫ 2/3 ensures 1

ǫ , and thus Lemma 1 can be used to obtain

E h |A(Sσ)j wσ j |2 + |A(Sσ )j wσ j |2i 1

X 2 = B4/3 Y 2/3

32( X pnǫ)2/3 .

One can now show via a packing argument that

n E [F(A(Sσ)) F(wσ)] o ( X B)4/3 Y 2/3p1/3

128(nǫ)2/3 ,

The result then follows from setting p = min n 1, d3/2 Y

4 Lipschitz GLMs

In the Lipschitz case, we close a subtle gap in existing rates. We recall that in this setting a more precise characterization in terms of the expected rank of the design matrix is possible (as opposed to using d). The best known upper bound is e O w

rank nǫ assuming knowledge of w . This bound

was shown to be optimal when ǫ rank/n and w =

rank [SSTT20].

We ﬁrst show that in the high privacy regime where ǫ rank/n, an improved rate is possible. Speciﬁcally, we show that in this regime constrained regularized ERM with output perturbation and achieves the optimal rate. In fact, we note that the method of [JT14] (i.e. unconstrained regularized ERM with output perturbation), can obtain this rate when ǫ = O(1) if the regularization parameter is set differently. We present the constrained version in order to leverage Rademacher complexity arguments and provide a slightly cleaner bound that holds for all ǫ > 0.

Theorem 5. Algorithm 3 run with parameters σ2 = 4G2 X 2 log(1/δ)

λ2n2ǫ2 and λ = G X (log(1/δ))1/4

B nǫ satisﬁes (ǫ, δ)-DP. Given a dataset of n i.i.d. samples from D, its output has excess risk E [εrisk( bw)] =

e O GB X n + GB X nǫ .

We also state and prove a similar bound using the JL technique in Appendix B.

Next, we generalize the lower bound of [SSTT20] to show this new bound is optimal for all settings of B, rank, and ǫ. We now show that a modiﬁcation of the lower bound present in [SSTT20] shows our upper bound is tight. We note that their lower bound only held for problem instances where rank = w 2 and ǫ rank/n. By contrast, the upper bound O(

nǫ ) holds for any values of rank and w . Theorem 6. Let G, Y , X , B > 0, ǫ 1.2 and δ ǫ. For any (ǫ, δ)-DP algorithm A, there exists a G-Lipschitz GLM loss bounded at zero by Y and a distribution D with w B such

that the output of A on S Dn satisﬁes E[εrisk(A(S))] = Ω GB X min 1, 1 nǫ,

All proofs for this section are deferred to Appendix B.

5 Adapting to w

Our method for privately adapting to w is given in Algorithm 4. We start by giving a high level overview and deﬁning some necessary preliminaries. The algorithm works in the following manner. First we deﬁne a number of guesses K for w , B1, ..., BK where Bj = 2j : j [K]. Then given black box access to a DP optimization algorithm, A, Algorithm 4 generates K candidate vectors w1, ..., w K using A, training set S1 (X Y)n/2, and the guesses B1, ..., BK. We assume A satisﬁes the following accuracy assumption for some conﬁdence parameter β > 0. Assumption 1. There exists a function ERR : R+ 7 R+ such that for any B R+, whenever B w , w.p. at least 1 β 4K under the randomness of S1 D n 2 and A it holds that εrisk(A(S1, B); D) ERR(B).

After generating the candidate vectors, the goal is to pick guess with the smallest excess population risk in a differentially private manner using a validation set S2. The following assumption on A allows us both to ensure the privacy of the model selection algorithm and verify that b L(wj; S2) provides a tight estimate of L(wj; D).

Assumption 2. There exist a function : R+ 7 R+ such that for any dataset S2 (X Y)n/2 and B > 0

P A[ (x, y) S2 : |ℓ(A(S1, B); (x, y))| (B)] min {δ, β}

Speciﬁcally, our strategy will be to use the Generalized Exponential Mechanism, Gen Exp Mech, of [RS15] in conjunction with a penalized score function. Roughly, this score function penalty ensures the looser guarantees on the population loss estimate when B is large do not interfere with the loss estimates at smaller values of B. We provide the relevant details for Gen Exp Mech in Appendix C.1. We now state our result.

Algorithm 4 Private Grid Search Input: Dataset S (X Y)n, grid parameter K R, optimization algorithm: A : (X Y)n R 7 Rd, privacy parameters (ǫ, δ) 1: Partition S into two disjoint sets, S1 and S2, of size n

2 2: w0 = 0 3: for j [K] do 4: Bj = 2j

5: wj = A(S1, Bj)

6: e Lj = b L(wj; S2) + (Bj) log(K/β)

4 Y 2 log(K/β)

n 7: end for 8: Set j as the output of Gen Exp Mech run with privacy parameter ǫ

2, conﬁdence parameter β

4 , and sensitivity/score pairs (0, Y 2), ( (B1), e L1)..., ( (BK), e LK), 9: Output wj

Theorem 7. Let ℓ: Rd (X Y) be a smooth non-negative loss function such that ℓ(0, (x, y)) Y 2 for any x, y (X Y). Let ǫ, δ, β [0, 1]. Let K > 0 satisfy ERR(2K) Y 2. Let A be an ( ǫ

2K , δ 2K )-DP algorithm satisfying Assumption 2. Then Algorithm 4 is (ǫ, δ)-DP. Further, if A satisﬁes Assumption 1 and S1 Dn/2 then Algorithm 4 outputs w s.t. with probability at least 1 β,

εrisk( w; D) min n Y 2, ERR(2 max { w , 1}) +

4 Y 2 log (4K/β)

n + 5 (2 max { w , 1})

We note that we develop a generic conﬁdence boosting approach to obtain high probability guarantees from our previously described algorithms in Section 5, and thus obtaining algorithms which satisfy 1 is straightforward. We provide more details on how our algorithms satisfy Assumption 2 in Appendix C.4. The following Theorem details the guarantees implied by this method for output perturbation with boosting (see Theorems 13,15). Full details are in Appendix C.3. Theorem 8. Let K, ǫ, δ, β > 0 and A be the algorithm formed by running 3 with boosting and privacy parameters ǫ = ǫ K , δ = δ K . Then there exists a setting of K such that

K = Θ log max n Y n X

H , Y 2(nǫ)2/3

H X 2 o and Algorithm 4 run with A and K is (ǫ, δ)-DP

and when given S Dn, satisﬁes the following w.p. at least 1 β (letting B = 2 max { w , 1})

εrisk( bw) = e O

HB X 4/3 Y 2/3 +

HB X max { Y , 1} + Y 2

n + Y 2 + H(B X )2

Conﬁdence Boosting: We give an algorithm to boost the conﬁdence of unconstrained, smooth DP-SCO (with possibly non-Lipschitz losses). We split the dataset S into m + 1 chunks and run an (ǫ, δ)-DP algorithm over the m chunks to get m models, and then use Report Noisy Max mechanism to select a model with approximately the least empirical risk. We show that this achieves the optimal rate of e O

HB X Y n whereas the previous high probability result of [SST10] had an additional

n term, which was also limited to only GLMs. The key idea is that non-negativity, convexity, smoothness and loss bounded at zero, all together enable strong bounds on the variance of the loss, and consequently give stronger concentration bounds. The details are deferred to Appendix D.

Acknowledgments and Disclosure of Funding

RA and EU are supported, in part, by NSF BIGDATA award IIS-1838139 and NSF CAREER award IIS-1943251. RB s and MM s work on this research is supported by NSF Award AF-1908281 and Google Faculty Research Award. CG s research is partially supported by INRIA Associate Teams project, FONDECYT 1210362 grant, and National Center for Artiﬁcial Intelligence CENIA FB210017, Basal ANID. Part of this work was done while CG was at the University of Twente.

[ACKL21] Idan Amir, Yair Carmon, Tomer Koren, and Roi Livni. Never go full batch (in stochastic convex optimization). Advances in Neural Information Processing Systems, 34, 2021.

[AKL21] Idan Amir, Tomer Koren, and Roi Livni. Sgd generalizes better than gd (and regularization doesn t help). In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 63 92. PMLR, 15 19 Aug 2021.

[BBL03] Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction to statistical learning theory. In Summer school on machine learning, pages 169 207. Springer, 2003.

[BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499 526, March 2002.

[BFGT20] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33, 2020.

[BFTGT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Conference on Learning Theory, pages 474 499. PMLR, 2021.

[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efﬁcient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464 473. IEEE, 2014.

[CH12] Kamalika Chaudhuri and Daniel J. Hsu. Convergence rates for differentially private statistical estimation. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012.

[CWH20] Xiangyi Chen, Steven Z. Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A geometric perspective. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13773 13782. Curran Associates, Inc., 2020.

[CWZ21] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics, 49(5):2825 2850, 2021.

[DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, 2006.

[DR+14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

[FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439 449, 2020.

[HRS15] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. Co RR, abs/1509.01240, 2015.

[JT14] Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 476 484, Bejing, China, 22 24 Jun 2014. PMLR.

[KL15] Tomer Koren and Kﬁr Y Levy. Fast rates for exp-concave empirical risk minimization. In NIPS, pages 1477 1485, 2015.

[Nel11] Jelani Nelson. Sketching and streaming high-dimensional vectors. Ph D thesis, Massachusetts Institute of Technology, 2011.

[NêUZ20] Huy Lê Nguy ên, Jonathan Ullman, and Lydia Zakynthinou. Efﬁcient Private Algorithms for Learning Large-Margin Halfspaces. In Aryeh Kontorovich and Gergely Neu, editors, Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 704 724. PMLR, 08 Feb 11 Feb 2020.

[NY83] A.S. Nemirovsky and E.R. Yudin. Problem Complexity and Method Efﬁciency in Optimization. A Wiley-Interscience publication. Wiley, 1983.

[RS15] Sofya Raskhodnikova and Adam D. Smith. Efﬁcient lipschitz extensions for high-dimensional graph statistics and node private degree distributions. Co RR, abs/1504.07912, 2015.

[Sha15] Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. J. Mach. Learn. Res., 16:3475 3486, 2015.

[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[SST10] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.

[SSTT20] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Characterizing private clipped gradient descent on convex generalized linear problems. Co RR, abs/2006.06783, 2020.

[SU17] Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 552 563. IEEE, 2017.

[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[VGNA20] Mariia Vladimirova, Stéphane Girard, Hien Nguyen, and Julyan Arbel. Sub-weibull distributions: Generalizing sub-gaussian and sub-exponential properties to heavier tailed distributions. Stat, 9(1):e318, 2020.

[Wan18] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. In Amir Globerson and Ricardo Silva, editors, Proceedings of the Thirty-Fourth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 93 103. AUAI Press, 2018.

1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work is largely theoretical. Moreover, there are no foreseeable negative societal impacts for our proposed algorithms. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 2 (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]