# learning_with_userlevel_privacy__34bfc2f6.pdf Learning with User-Level Privacy Daniel Levy ,1 Ziteng Sun ,2 Kareem Amin3 Satyen Kale3 Alex Kulesza3 Mehryar Mohri3,4 Ananda Theertha Suresh3 1Stanford University 2Cornell University 3Google Research 4Courant Institute danilevy@stanford.edu, zs335@cornell.edu, {kamin, satyenkale, kulesza, mohri, theertha}@google.com We propose and analyze algorithms to solve a range of learning tasks under userlevel differential privacy constraints. Rather than guaranteeing only the privacy of individual samples, user-level DP protects a user s entire contribution (m 1 samples), providing more stringent but more realistic protection against information leaks. We show that for high-dimensional mean estimation, empirical risk minimization with smooth losses, stochastic convex optimization, and learning hypothesis classes with finite metric entropy, the privacy cost decreases as O(1/ m) as users provide more samples. In contrast, when increasing the number of users n, the privacy cost decreases at a faster O(1/n) rate. We complement these results with lower bounds showing the minimax optimality of our algorithms for mean estimation and stochastic convex optimization. Our algorithms rely on novel techniques for private mean estimation in arbitrary dimension with error scaling as the concentration radius τ of the distribution rather than the entire range. 1 Introduction Releasing seemingly innocuous functions of a data set can easily compromise the privacy of individuals, whether the functions are simple counts [35] or complex machine learning models like deep neural networks [52, 30]. To protect against such leaks, Dwork et al. proposed the notion of differential privacy (DP). Given some data from n participants in a study, we say that a statistic of the data is differentially private if an attacker who already knows the data of n 1 participants cannot reliably determine from the statistic whether the n-th remaining participant is Alice or Bob. With the recent explosion of publicly available data, progress in machine learning, and widespread public release of machine learning models and other statistical inferences, differential privacy has become an important standard and is widely adopted by both industry and government [32, 5, 21, 55]. The standard setting of DP described in [22] assumes that each participant contributes a single data point to the dataset, and preserves privacy by noising the output in a way that is commensurate with the maximum contribution of a single example. This is not the situation faced in many applications of machine learning models, where users often contribute multiple samples to the model for example, when language and image recognition models are trained on the users own data, or in federated learning settings [37]. As a result, current techniques either provide privacy guarantees that degrade with a user s increased participation or naively add a substantial amount of noise, relying on the group property of differential privacy, which significantly harms the performance of the deployed model. To remedy this issue, we consider user-level DP, which instead of guaranteeing privacy for individual samples, protects a user s entire contribution (m 1 samples). This is a more stringent but more realistic privacy desideratum. To hold, it requires that the output of our algorithm does not significantly Equal contribution. Work was done during an internship at Google Research. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). change when changing user s entire contribution i.e. possibly swapping up to m samples in total. We make this formal in Definition 1. Very recently, for the reasons outlined above, there has been increasing interest in user-level DP for applications such as estimating discrete distributions under user-level privacy constraints [46], PAC learning with user-level privacy [31], and bounding user contributions in ML models [4, 26]. Differentially private SQL with bounded user contributions was proposed in [59]. User-level privacy has been also studied in the context of learning models via federated learning [49, 48, 58, 6]. In this paper, we tackle the problem of learning with user-level privacy in the central model of DP. In particular, we provide algorithms and analyses for the tasks of mean estimation, empirical risk minimization (ERM), stochastic convex optimization (SCO), and learning hypothesis classes with finite metric entropy. Our utility analyses assume that all users draw their samples i.i.d. from related distributions, a setting we refer to as limited heterogeneity. On these tasks, naively applying standard mechanisms, such as Laplace or Gaussian, or using the group property with item-level DP estimators, both yield a privacy error independent of m. We first develop novel private mean estimators in high dimension with statistical and privacy error scaling with the (arbitrary) concentration radius rather than the range, and apply these to the statistical query setting [SQ; 41]. Our algorithms then rely on (privately) answering a sequence of adaptively chosen queries using users samples, e.g., gradient queries in stochastic gradient descent algorithms. We show that for these tasks, the additional error due to privacy constraints decreases as O(1/ m), contrasting with the naive rate independent of m. Interestingly, increasing n, the number of users, decreases the privacy cost at a faster O(1/n) rate. Importantly, our results imply concrete practical recommendations on sample collection, regardless of the level of heterogeneity. Indeed, increasing m will yield the most value in the i.i.d. setting and will yield no improvement when the users distributions are arbitrary. As the real-world will lie somewhere in between, our results exhibit a regime where, for any heterogeneity, it is strictly better to collect more users (increasing n) than more samples per user (increasing m). 1.1 Our Contributions and Related Work We provide a theoretical tool to construct estimators for tasks with user-level privacy constraints and apply it to a range of learning problems. Optimal private mean estimation and uniformly concentrated queries (Section 3) We show that for a random variable in [ B, B] concentrated in an unknown interval of radius τ (made precise in Definition 2), we can privately estimate its mean with error proportional to τ rather than B, as we would obtain using standard private mean estimation techniques such as Laplace mechanism [24]. When data is concentrated in ℓ -norm, several papers show that one can achieve an error scaling with τ rather than B, either asymptotically [53], for Gaussian mean-estimation [40, 38], for sub-Gaussian symmetric distributions [18, 17] or for distributions with bounded p-th moment [39]. We propose a private mean estimator (Algorithm 2) with error scaling with τ that works in arbitrary dimension when data is concentrated in ℓ2-norm (Theorem 2). In Corollary 1, we show it (optimally) solves mean estimation under user-level privacy constraints for random vectors bounded in ℓ2-norm. In Appendix D.6, we show that for uniformly concentrated queries (see Definition 3), sequentially applying Algorithm 2 privately answers K adaptively chosen queries with privacy cost O(τ Our conclusions relate to the growing literature in adaptive data analysis. While a sequence of work [25, 9, 27, 28] use techniques from differential privacy and their answers are (ε, δ)-DP with ε = Θ(1), our work guarantees privacy for arbitrary ε with the additional assumption of uniform concentration. Empirical risk minimization (Section 4) An influential line of papers studies ERM under itemlevel privacy constraints [19, 42, 8]. Importantly, these papers assume arbitrary data, i.e., not necessarily samples from users distributions. The exact analog of ERM in the user-level setting is consequently less interesting as, for n data points {z1, . . . , zn}, in the worst case, each user u [n] contributes m copies of zu and the problem reduces to the item-level setting. Instead, we consider the (related) problem of ERM when users contribute points sampled i.i.d. Assuming some regularity (A3 and A4), we develop and analyze algorithms for ERM under user-level DP constraints for convex, strongly-convex, and non-convex losses (Theorem 3). Optimal stochastic convex optimization (Section 5) Under item-level DP (or equivalently, userlevel DP with m = 1), a sequence of work [19, 8, 10, 11, 29] establishes the constrained minimax risk as Θ(1/ n + d/(nε)). In this paper, with the additional assumptions that the losses are individually smooth2 and the gradients are sub-Gaussian random vectors, we prove matching upper (Theorem 4) and lower bounds (Theorem 5) of order Θ(1/ nm + d/(n mε)) in a regime we make precise. We leave closing the gap outside of this regime to future work. Limit of learning with a fixed number of users (Appendix B) Finally, we resolve a conjecture of [4] and prove that with a fixed number of users, even in the limit m (i.e., each user has an infinite number of samples), we cannot reach zero error. In particular, we prove that for all the learning tasks we consider, the risk under user-level privacy constraints is at least Ω(e εn) regardless of m. Note that this does not contradict the results above since they require n = Ω((log m)/ε). Finally, we provide results in Appendix A for learning under pure user-level DP for function classes with finite metric entropy. We apply these to SCO with ℓ constraints (Remark 1) and achieve (near)-optimal rates. 2 Preliminaries Notation. Throughout this work, d denotes the dimension, n the number of users, and m the number of samples per user. Generically, σ will denote the sub-Gaussian parameter, τ the concentration radius, ν the variance of a random vector and P a data distribution. We denote the optimization variable with θ Θ Rd, use z (or Z when random) to denote the data sample supported on a space Z, and ℓ: Θ Z R for the loss function. Gradients (denoted ) are always taken with respect to the optimization variable θ. For a convex set C, ΠC denotes the euclidean projection on C, i.e. ΠC(y) := argminz C y z 2. We use A to refer to (possibly random) private mechanisms and Xn as a shorthand for the dataset (X1, . . . , Xn). For two distributions P and Q, we denote by P Q TV their total variation distance and Dkl (P||Q) their Kullback-Leibler divergence. For a random vector X P supported on Rd, we use Var(P) or Var(X) to denote E X E[X] 2 2 , which is equal to the trace of the covariance matrix of X. Next, we consider differential privacy in the most general way, which only requires specifying a dataset space S and a distance d on S. Definition 1 (Differential Privacy). Let ε, δ 0. Let A: S Θ be a (potentially randomized) mechanism. We say that A is (ε, δ)-DP with respect to d if for any measurable subset O Θ and all S, S S satisfying d(S, S ) 1, P(A(S) O) eεP(A(S ) O) + δ. (1) If δ = 0, we refer to this guarantee as pure differential privacy. For a data space Z, choosing S = Zn and d(S, S ) = d Ham(S, S ) = Pn i=1 1{zi = z i} recovers the canonical setting considered in most of the literature we refer to this as item-level differential privacy. When we wish to guarantee privacy for users rather than individual samples, we instead assume a structured dataset into which each of n users contributes m > 1 samples. This corresponds to S = (Zm)n such that for S S, we have S = (S1, . . . , Sn), where Su = n z(u) 1 , . . . , z(u) m o and duser(S, S ) := u=1 1{Su = S u} , which means that, in this setting, two datasets are neighboring if at most one of the user s contributions differ. We henceforth refer to this setting as user-level differential privacy. Distributional assumptions. In the case of user-level privacy with n users each providing m samples, we assume existence of a collection of distributions {Pu}u [n] over Z. One then observes the following user-level dataset3 S = (S1, . . . , Sn) where Su iid Pu. (2) 2We note that the results only require O(n3/2)-smooth losses. For large n keeping all other problem parameters fixed this is a very weak assumption. More precisely, when n > poly(d, m, 1/ε), our algorithm on a smoothed version ℓof ℓ(e.g., using the Moreau envelope [33]) yields optimal rates for non-smooth losses. Whether the smoothness assumption can be removed altogether is an open question. 3For simplicity, we assume that |Su| = m but our guarantees directly extend to the setting where users have different number of samples with m replaced by median(m1, . . . , mn) using techniques from [46]. We leave eliciting the optimal rates in settings when mu is an arbitrary random variable to future work. In this paper, we consider the limited heterogeneity setting, i.e. when the users have related distributions. This setting is more reflective of practice, especially in light of growing interest towards federated learning applications [37, 60]. Assumption A1 (Limited heterogeneity setting). There exists a distribution P0 over Z such that all the user distributions are close to P0 in total variation distance, i.e. max u [n] Pu P0 TV , where 0 quantifies the level of heterogeneity. Note that = 0 corresponds to assumption A2. Note that our TV-based definition is natural in this setting as it is closely related to the notion of discrepancy (or d A distance) which plays a key role in domain adaption scenarios [47, 12]. Lower bound results have been given in terms of the discrepancy measure (see [13]), which further justify the adoption of this definition in the presence of multiple distributions. In the case that = 0, A1 reduces to the standard homogeneous setting. Many fundamental papers choose this setting when explicating minimax rates under constraints (e.g. in distributed optimization and federated learning [61] or under communication constraints [63, 15]). Assumption A2 (Homogeneous setting). The distributions of individual users are equal, meaning there exists P0 such that for all u [n], Pu = P0. In this paper, we develop techniques and provide matching upper and lower bounds for solving learning tasks in the homogeneous setting. In Appendix C, we prove that our techniques naturally apply to the heterogeneous setting in a black-box fashion, and for all considered problems provide meaningful guarantees under Assumption A1. Moreover, the algorithm achieves almost optimal rate whenever is (polynomially) small. See the detailed statement in Theorem 9. 2.1 ERM and stochastic convex optimization Assumptions on the loss. Throughout this work, we assume that the parameter space Θ is closed, convex, and satisfies θ ϑ 2 R for all θ, ϑ Θ. We also assume that the loss ℓ: Θ Z R is G-Lipschitz w.r.t. the ℓ2-norm4, meaning that for all z Z, for all θ Θ, ℓ(θ; z) 2 G. We further consider the following assumptions. Assumption A3. The function ℓ( ; z) is H-smooth. In other words, the gradient ℓ(θ; z) is HLipschitz in the variable θ for all z Z. Assumption A4. The random vector ℓ(θ; Z) is σ2-sub-Gaussian for all θ Θ and Z P0. Equivalently, for all v Rd, v, ℓ(θ; Z) is a σ2-sub-Gaussian random variable, i.e., E [exp( v, ℓ(θ; Z) E[ ℓ(θ; Z)] )] exp v 2 2σ2/2 . In this work, our rates often depend on the sub-Gaussianity and Lipschitz parameters σ and G, and thus we define the shorthands e G := σ d and G := min{G, e G}. Intuitively, the G-Lipschitzness assumption bounds the gradient in a ball around 0 (independently of θ), while sub-Gaussianity implies that, for each θ, ℓ(θ; Z) likely lies in Bd 2( L(θ; P0), e G). Generically, there is no ordering between G and e G: for linear loss ℓ(θ; z) = θ, z , depending on P0, it can hold that G e G (e.g., P0 = Unif{ v, v} for v Rd), e G G (e.g., P0 is N(µ, σ2Id) truncated in a ball around µ, with µ 2 σ d) or G e G (e.g., P0 = Unif{ 1, +1}d). We introduce the tasks we consider in this work, namely empirical risk minimization (ERM) and stochastic convex optimization (SCO). For a collection of samples from n users S = (S1, . . . , Sn), where each Su = {z(u) 1 , . . . , z(u) m } Zm, we define the empirical risk objectives L(θ; Su) := 1 i=1 ℓ(θ; z(u) i ) and L(θ; S) := 1 u=1 L(θ; Su) = 1 mn i=1 ℓ(θ; z(u) i ). (3) In the user-level setting we wish to minimize L(θ; S) under user-level privacy constraints. Going beyond the empirical risk, we also solve SCO [51], i.e. minimizing a convex population objective 4It is straightforward to develop analogs of the results of Sections 3 and 4 for arbitrary norms, but we restrict our attention to the ℓ2 norm in this work for clarity. when provided with samples from each users distributions. In the user-level setting, for a convex loss ℓand a convex constraint set Θ, we observe S = (S1, . . . , Sn) u [n](Pu)m and wish to minimize θ Θ 1 n u [n] L(θ; Pu) := 1 u [n] EPu[ℓ(θ; Z)]. (4) In the homogeneous case (Assumption A2), this reduces to the classic SCO setting: minimize θ Θ L(θ; P0) := EP0[ℓ(θ; Z)]. (5) 2.2 Uniform concentration of queries Let φ : Z Rd be a d-dimensional query function. We define concentration of random variables and uniform concentration of multiple queries as follows. Definition 2. A (random) sample Xn supported on [ B, B]d is (τ, γ)-concentrated (and we call τ the concentration radius ) if there exists x0 [ B, B]d such that with probability at least 1 γ, max i [n] Xi x0 2 τ. Definition 3 (Uniform concentration of vector queries). Let Qd B = {φ: Z [ B, B]d} be a family of queries with bounded range. For Zn = (Z1, . . . , Zn) iid P, we say that (Zn, Qd B) is (τ, γ)-uniformly-concentrated if with probability at least 1 γ, we have max i [n] sup φ Qd B φ(Zi) EZ P [φ(Z)] 2 τ. In this work, we will often consider σ2-sub-Gaussian random variables (or vectors), which are concentrated according to Definition 2. For example, if Xn is drawn i.i.d. from a σ2-sub-Gaussian random vector supported on [ B, B]d, then it is (σ p d log(2n/γ), γ)-concentrated around its mean (see, e.g., [56]). Finally, we define a distance between random variables (and estimators). Definition 4 (β-close Random Variables). For any two random variables X1 P1 and X2 P2, we say X1 and X2 are β-close, if P1 P2 TV β. We use the notation X1 β X2 if X1 and X2 are β-close. β-closeness is useful as, in many of our results, the private estimator we propose returns a simple unbiased estimate with high probability and is bounded otherwise. Thus, it suffices to do the analysis in the nice case and crudely bound the error otherwise. 3 High Dimensional Mean Estimation and Uniformly Concentrated Queries In this section, we present a private mean estimator with privacy cost proportional to the concentration radius. Using these techniques, we show that, under uniform concentration, we answer adaptivelychosen queries with privacy cost proportional to the concentration radius instead of the whole range. Our theorems guarantee that the estimator is β-close (with β exponentially small in n) to a simple unbiased estimator with small noise. We further show how to directly translate these results into bounds on the estimator error, which we demonstrate by providing tight bounds on estimating the mean of ℓ2-bounded random vectors under user-level DP constraints (Corollary 1). Given i.i.d samples Xn from a distribution P supported on Rd with mean µ, the goal of mean estimation is to design a private estimator that minimizes the E A(Xn) µ 2 2 . We focus on distributions with bounded supprot [ B, B]d. However, our algorithm also generalize to the case when the mean is guaranteed to be in [ B, B]d. In the user-level setting (in the homogeneous case), one observes a dataset S sampled as in (2) and wishes to minimize E[ A(S) EP0 2 2] under user-level privacy constraints. We first focus on the scalar case. Mean estimation in one dimension. The algorithm uses a two-stage procedure, similar in spirit to those of [53], [40], and [39]. In the first stage of this procedure, we use the approximate median estimation in [27], detailed in Algorithm 6 in Appendix D.1, to privately estimate a crude interval Algorithm 1 Winsorized Mean1D(Xn, ε, τ, B): Winsorized Mean Estimator (WME) Require: Xn := (X1, X2, ..., Xn) [ B, B]n, τ : concentration radius, privacy parameter ε > 0. 1: [a, b] = Private Range(Xn, ε/2, τ, B) with |b a| = 4τ. {Algorithm 6 in Appendix D.1. } 2: Sample ξ Lap 0, 8τ εn and return i=1 Π[a,b](Xi) + ξ, where Π[a,b](x) = max{a, min{x, b}}. in which the means lie, with accuracy Θ(τ). The second stage clips the mean around this interval, reducing the sensitivity from O(B) to O(τ), and adds the appropriate Laplace noise. With high probability, we can recover the guarantee of the Laplace mechanism with smaller sensitivity since the samples are concentrated in a radius τ. We present the formal guarantees of Algorithm 1 in Theorem 1 and defer its proof to Appendix D.2. Theorem 1. Let Xn be a dataset supported on [ B, B]. The output of Algorithm 1, denoted by A(Xn), is ε-DP. Furthermore, if Xn is (τ, γ)-concentrated, it holds that A(Xn) β 1 n i=1 Xi + Lap 8τ where β = min 1, γ + B 8 . Moreover, Algorithm 1 runs in time O(n + log(B/τ)). Compared to [40, 38, 39], our algorithm runs in time O(n + log(B/τ)) instead of O(n + B/τ) owing to the approximate median estimation algorithm in [27], which is faster when τ B. Mean estimation in arbitrary dimension. In the general d-dimensional case, if Xn is concentrated in ℓ -norm, one simply applies Algorithm 1 to each dimension. However, when Xn is concentrated in ℓ2-norm, naively upper bounding ℓ -norm by the ℓ2-norm will incur a superfluous d factor: if v 2 ρ, each |vj| is possibly as large as ρ. To remedy this issue, we use the random rotation trick in [3, 54]. This guarantees that all coordinates have roughly the same range: for v Rd, with high probability, Rv O( v 2/ d), where R is the random rotation. We present this procedure in Algorithm 2 and its performance in Theorem 2. Algorithm 2 Winsorized Mean High D(Xn, ε, δ, τ, B, γ): WME - High Dimension Require: Xn := (X1, X2, ..., Xn), Xi [ B, B]d, τ, γ: concentration radius and probability, privacy parameter ε, δ > 0. 1: Let D = Diag(ω) where ω is sampled uniformly from { 1}d. 2: Set U = d 1/2HD, where H is a d-dimensional Hadamard matrix. For all i [n], compute Yi = UXi. 3: Let ε = ε 8d log(1/δ), τ = 10τ q d . For j [d], compute Y (j) = Winsorized Mean1D {Yi(j)}i [n], ε , τ , 4: return X = U 1 Y . Theorem 2. Let A(Xn) = Winsorized Mean High D(Xn, ε, δ, τ, B, γ) be the output of Algorithm 2. A(Xn) is (ε, δ)-DP. Furthermore, if Xn is (τ, γ)-concentrated in ℓ2-norm, there exists an estimator A (Xn) such that A(Xn) β A (Xn) and E[A (Xn)|Xn] = 1 i=1 Xi and Var(A (Xn)|Xn) c0 dτ 2 log(dn/α) log(1/δ) where c0 = 102, 400 and β = min 1, 2γ + d2B τ exp nε 24 We present the proof of Theorem 2 in Appendix D.3. We are able to transfer both Theorem 1 and Theorem 2 into finite-sample estimation error bounds for various types of concentrated distributions and obtain near optimal guarantees (see Appendix D.5 for an example in mean estimation of sub Gaussian distributions). The next corollary characterizes the risk of mean estimation for distributions supported on an ℓ2-bounded domain with user-level DP guarantees (see Appendix D.4 for the proof). Corollary 1. Assume A2 holds with P0 supported on Bd 2(0, B) with mean µ. Given S = (S1, S2, ..., Sn), |Su| = m, consisting of m i.i.d. samples from Pu. There exists an (ε, δ)-userlevel DP algorithm A(S) such that, if n (c1 p d log(1/δ)/ε) log(m(dn + n2ε2)) for a numerical constant c1, we have5 E A(S) µ 2 2 = Var(P0) mn + O d B2 Note that Var(P0) B2 for any P0 supported on Bd 2(0, B). Replacing Var(P0) by B2, the bound is minimax optimal up to logarithmic factors. When only A1 holds with poly(d, 1 ε), the same error bounds holds (up to constant) for estimating EZ Pu[Z] for any u [n]. Note that algorithms in [38, 39], which focus on estimating the mean of d-dimensional sub Gaussian distributions, can also be used to estimate the mean of ℓ2-bounded distributions since bounded random variables are also sub Gaussian. However, applying these algorithms directly will incur a superfluous d factor in the mean square error. We void this using the random rotation trick in Algorithm 2. Answering multiple queries. We end this section by noting that, when a family of queries Q is uniformly concentrated (as made precise in Definition 3), we answer sequences of K d-dimensional, adaptively chosen queries with error scaling as O( d Kτ/(nε)) by applying Algorithm 2 to {φk(Zi)}i [n] with the right (ε0, δ0). We make this formal in Theorem 10 in Appendix D.6. 4 Empirical Risk Minimization with User-Level Differential Privacy In this section, we present an algorithm to solve the ERM objective of (3) under user-level DP constraints. We apply the results of Section 3 by noting that the SQ framework encompasses stochastic gradient methods. Informally, one can sequentially choose queries φk(z) = ℓ(θk; z) and, for a stepsize η, update θk+1 = ΠΘ(θk ηvk), where vk is the answer to the k-th query. For the results to hold, we require a uniform concentration result over the appropriate class of queries. Uniform concentration of stochastic gradients The class of queries for stochastic gradient methods is Qerm := { ℓ(θ; ) : θ Θ}. We prove that when assumptions A3 and A4 hold, ({ ℓ( ; Su)}u [n], Qerm) is ( O(σ p d/m), α)-uniformly concentrated. The next proposition is a simplification of the result of [50] under the (stronger) assumption A3 that ℓis uniformly H-smooth. The proof, which we defer to Appendix E.1, hinges on a covering number argument. Proposition 1 (Concentration of random gradients). Let Su iid Pu, |Su| = m for u [n] and α 0. Under Assumptions A3 and A4, with probability greater than 1 α it holds that max u [n] sup θ Θ L(θ; Su) L(θ; Pu) 2 = O Stochastic gradient methods We state classical convergence results for stochastic gradient methods for both convex and non-convex losses under smoothness. For a function F : Θ R, we assume access to a first-order stochastic oracle OF,ν2, i.e., a random mapping such that for all θ Θ, OF,ν2(θ) = b F(θ) with E h b F(θ) i = F(θ) and Var b F(θ) ν2. We abstract optimization algorithms in the following way: an algorithm consists of an output set O, a sub-routine Query : O Θ that takes the last output and indicates the next point to query and a sub-routine Update : O Rd O that takes the previous output and a stochastic gradient and returns the next output. After T steps, we call Aggregate : O Θ, which takes all the previous outputs and returns the final point. (See Algorithm 7 in Appendix E.2 for how to instantiate generic first-order optimization in this framework.) We detail in Proposition 4 in Appendix E.2 standard convergence results for variations of (projected) stochastic gradient descent (SGD). We introduce this abstraction to forego the details of each specific algorithm and instead focus on the privacy and utility guarantees. 5For precise log factors, see Appendix D.4. Algorithm We recall the ERM setting with user-level DP. We observe S = (S1, . . . , Sn) with Su Zm for u [n] and wish to solve the constrained optimization problem with objective in (3). We present our method in Algorithm 3 and provide utility and privacy guarantees in Theorem 3. Algorithm 3 Winsorized First-Order Optimization 1: Input: Number of iterations T, optimization algorithm {O, Query, Update, Aggregate}, privacy parameters (ε, δ), data S = (S1, . . . , Sn), initial output o0, parameter set Θ, concentration radius τ, probability γ. 2: Set ε = ε 2 2T log(2/δ) and δ = δ 2T 3: for t = 0, . . . , T 1 do 4: θt Query(ot). 5: For each user u [n], compute g(u) t = L(θt; Su) = 1 j [m] ℓ(θt; z(u) j ). 6: Compute gt = Winsorized Mean High D({g(u) t }u [n], ε , δ , τ, G, γ). 7: ot+1 Update(ot, gt). 8: end for 9: return θ Aggregate(o0, . . . , o T ). Theorem 3 (Privacy and utility guarantees for ERM). Assume A2 holds and recall that e G = σ d, assume6 n = Ω( d T/ε) and let bθ be the output of Algorithm 3. There exists variants of projected SGD (e.g. the ones we present in Proposition 4) such that, with probability greater than 1 γ: (i) If for all z Z, ℓ( ; z) is convex, then E L(bθ; S) inf θ Θ L(θ ; S) S = O (ii) If for all z Z, ℓ( ; z) is µ-strongly-convex, then E L(bθ; S) inf θ Θ L(θ ; S) S = O GR exp µ H T + e G2 d µn2mε2 (iii) Otherwise, defining the gradient mapping7 GF,γ(θ) := 1 γ [θ ΠΘ(θ γ F(θ))], we have E h GL( ;S),1/H(bθ) 2 2|S i = O For ε 1, δ > 0, Algorithm 3 instantiated with any first-order gradient algorithm is (ε, δ)-user-level DP. In the case that only A1 holds, the same guarantees hold whenever poly(d, 1 We present the proof in Appendix E.3. For the utility guarantees, the crux of the proof resides in Theorem 10: as well as ensuring small excess loss in expectation, the SQ algorithm produces with high probability a sample from the stochastic gradient oracle OL( ;S),ν2 where ν2 = O(T e G2 d n2mε2 ). When this happens for all T steps, the analysis of stochastic gradient methods provide the desired regret. The privacy guarantees follow from the strong composition theorem of [23]. Importantly, when the function exhibits (some) strong-convexity (which will be the case for any regularized objective), we are able to localize the optimal parameter up to the privacy cost in O(H/µ) steps. This will be particularly important in Section 5. Corollary 2 (Localization). Let bθ be the output of Algorithm 3 on the ERM problem of (3). Assume that ℓ( ; z) is µ-strongly-convex for all z Z, that n = Ω( p d H/µ) and set T = µ log n2m(G/ e G2) µRε2 d and γ = σ2d2 µ2n2mε2R2 . For θ S argminθ Θ L(θ ; S), it holds8 6For precise log factors, see Appendix E.3. 7In the unconstrained case Θ = Rd this corresponds to an ϵ-stationary point as GF,γ(x) = F(x). 8A logarithmic dependence on T is hiding in the result. Since T = O(H/µ), we implicitly assume H/µ is polynomial in the stated parameters, which is satisfied when we later apply these results to regularized objectives. E[ bθ θ S 2 2] = O σ2d2 5 Stochastic Convex Optimization with User-level Privacy In this section we address the SCO task of (5) under user-level DP constraints. Our approach (which we show in Algorithm 4) solves a sequence of carefully regularized ERM problems, drawing on the guarantees of the previous section. Recall that e G = σ d and G = min{G, e G}, and that ℓis H-smooth under assumption A3. In this section, we assume that ℓis convex. We first present our results and state an upper and lower bound for SCO with user-level privacy constraints. Theorem 4 (Phased ERM for SCO). Algorithm 4 is user-level (ε, δ)-DP. When A2 holds and n = Ω(min{ 3p d2m H2R2/(GGε4), HR m/(σε)}), or, equivalently, H = O( q R2m + GGn3ε4 d2R2m ) for all P and ℓsatisfying Assumptions A3 and A4, we have E [L(APhased ERM(S); P0)] min θ Θ L(θ ; P0) = O R GG mn + R e G Furthermore, our results still hold in the heterogeneous setting (Assumption A1) whenever poly(d, 1 ε); the risk guarantee being with respect to any user distribution Pu. Theorem 5 (Lower bound for SCO). There exists a distribution P and a loss ℓsatisfying Assumptions A3 and A4 such that for any algorithm A satisfying (ε, δ)-DP at user-level, we have E [L(A(S); P)] min θ Θ L(θ ; P) = Ω When G = Θ(σ d), the upper bound matches the lower bound up to logarithmic factors. We present the algorithm and proof for Theorem 4 in Section 5.1. Theorem 5 is proved in Section 5.2. 5.1 Upper bound: minimizing a sequence of regularized ERM problems We now present Algorithm 4, which achieves the upper bound of Theorem 4. It is similar in spirit to Phased ERM [29] and Epoch GD [34], in that at each round we minimize a regularized ERM problem with fresh samples and increased regularization, initializing each round from the final iterate of the previous round. This allows us to localize the optimum with exponentially increasing accuracy without blowing up our privacy budget. We solve each round using Algorithm 3 to guarantee privacy and obtain an approximate minimizer. We show the guarantee in Corollary 2 is enough to achieve optimal rates. We provide the proof of Theorem 4 in Appendix F and present a sketch here. Algorithm 4 APhased ERM: Phased ERM Require: Private dataset: S = (S1, . . . , Sn) (Zm)n : n m i.i.d samples from P, H-smooth, convex loss function ℓ, convex set Θ Rd, privacy parameters ε 1, δ 1/n2, sub-Gaussian parameter σ. 1: Set T = log2( Gn mε σd ) , λ = q GG nm + σ2d2 n2mε2 /R 2: for t = 1 to T do 3: Set nt = n 2t , λt = 4tλ. 4: Sample St, nt users that have not participated in previous rounds. Using Algorithm 3, compute an approximate minimizer bθt, to the accuracy of Corollary 2, for the objective Lλt,bθt 1(θ; St) = 1 mnt j=1 ℓ(θ, z(u) j ) + λt 2 θ bθt 1 2 2. (7) 5: end for 6: return bθT . Proof sketch of Theorem 4. The privacy guarantee comes directly from the privacy guarantee of Algorithm 3 and the fact that St are non-overlapping. The proof for utility is similar to the proof of Theorem 4.8 in [29]. In round t of Algorithm 4, we consider the true minimizer θ t and the approximate minimizer bθt. By stability [14], we can bound the generalization error of θ t (see Proposition 5 in Appendix F) and, by Corollary 2, we can bound E bθt θ t 2 2. We finally choose {(λt, nt)}t T such that the assumptions of Corollary 2 hold and to minimize the final error. 5.2 Lower bound: SCO is harder than Gaussian mean estimation First of all, note that it suffices to prove the lower bounds in the homogeneous setting as any level of heterogeneity only makes the problem harder. Theorem 5 holds for (ε, δ)-user-level DP importantly, this is a setting for which lower bounds are generally more challenging (we provide a related lower bound for ε-user-level DP in Appendix A.2). We present the proof in Appendix F.2 and a sketch here. Proof sketch of Theorem 5. The (constrained) minimax lower bound decomposes into a statistical rate and a privacy rate. The statistical rate is optimal (see, e.g., [44, 2]), thus we focus on the privacy rate. We consider linear losses of the form ℓ(θ; z) = θ, z . We show that optimizing L(θ; P) = EP [ℓ(θ; Z)] over θ Θ is harder than the mean estimation task for P. Intuitively, L(θ; P) = θ, EZ attains its minimum at θ = RE[Z]/ E[Z] 2 and finding θ provides a good estimate of (the direction of) E[Z]. We make this formal in Proposition 6. Next, for Gaussian mean estimation, we reduce, in Proposition 3, user-level DP to item-level DP with lower variance by having each user contribute their sample average (which is a sufficient statistic). We conclude with the results of [38] (see Proposition 7) by proving in Corollary 6 that estimating the direction of the mean with item-level privacy is hard. Acknowledgments The authors would like to thank Hilal Asi and Karan Chadha for comments on an earlier draft as well as Yair Carmon, Peter Kairouz, Gautam Kamath, Sai Praneeth Karimireddy, Thomas Steinke and Sebastian Stich, for useful discussions and pointers to very relevant references. [1] J. Acharya, Z. Sun, and H. Zhang. Differentially private Assouad, Fano, and Le Cam. In V. Feldman, K. Ligett, and S. Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, pages 48 78. PMLR, 16 19 Mar 2021. URL https:// proceedings.mlr.press/v132/acharya21a.html. [2] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of convex optimization. IEEE Transactions on Information Theory, 58(5):3235 3249, 2012. [3] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557 563, 2006. [4] K. Amin, A. Kulesza, A. Munoz, and S. Vassilvtiskii. Bounding user contributions: A biasvariance trade-off in differential privacy. In International Conference on Machine Learning, pages 263 271, 2019. [5] Apple Privacy Team. Learning with privacy at scale, 2017. Available at https: //machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale. html. [6] S. Augenstein, H. B. Mc Mahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, and B. A. y Arcas. Generative models for effective ml on private, decentralized datasets. In International Conference on Learning Representations, 2019. [7] R. F. Barber and J. C. Duchi. Privacy and statistical risk: Formalisms and minimax bounds. ar Xiv:1412.4451 [math.ST], 2014. [8] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464 473. IEEE, 2014. [9] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1046 1059, 2016. [10] R. Bassily, V. Feldman, K. Talwar, and A. G. Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279 11288, 2019. [11] R. Bassily, V. Feldman, C. Guzm an, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4381 4391. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/2e2c4bf7ceaa4712a72dd5ee136dc9a8-Paper.pdf. [12] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. Advances in Neural Information Processing Systems 20, 2007. [13] S. Ben-David, T. Lu, T. Luu, and D. P al. Impossibility theorems for domain adaptation. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 129 136, 2010. [14] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499 526, 2002. [15] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the Forty-Eigth Annual ACM Symposium on the Theory of Computing, 2016. URL https://arxiv.org/abs/1506.07216. [16] S. Bubeck. Convex optimization: Algorithms and complexity. ar Xiv preprint ar Xiv:1405.4980, 2014. [17] M. Bun and T. Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. In Advances in Neural Information Processing Systems, pages 181 191, 2019. [18] T. T. Cai, Y. Wang, and L. Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. ar Xiv preprint ar Xiv:1902.04495, 2019. [19] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069 1109, 2011. [20] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019. [21] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 3571 3580, 2017. [22] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265 284, 2006. [23] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51 60. IEEE, 2010. [24] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. 2014. [25] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117 126, 2015. [26] A. Epasto, M. Mahdian, J. Mao, V. Mirrokni, and L. Ren. Smoothly bounding user contributions in differential privacy. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13999 14010. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/a0dc078ca0d99b5ebb465a9f1cad54ba-Paper.pdf. [27] V. Feldman and T. Steinke. Generalization for adaptively-chosen estimators via stable median. In S. Kale and O. Shamir, editors, ICML, volume 65 of Proceedings of Machine Learning Research, pages 728 757, Amsterdam, Netherlands, 07 10 Jul 2017. PMLR. [28] V. Feldman and T. Steinke. Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pages 535 544. PMLR, 2018. [29] V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439 449, 2020. [30] M. Fredrikson, S. Jha, and T. Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322 1333, New York, NY, USA, 2015. ACM. doi: 10.1145/2810103.2813677. URL http://doi.acm.org/10.1145/2810103.2813677. [31] B. Ghazi, R. Kumar, and P. Manurangsi. User-level private learning via correlated sampling. ar Xiv preprint ar Xiv:2110.11208, 2021. [32] Google. Enabling developers and organizations to use differential privacy, 2019. Available at https://developers.googleblog.com/2019/09/ enabling-developers-and-organizations.html. [33] C. Guzm an and A. Nemirovski. On lower complexity bounds for large-scale smooth convex optimization. Journal of Complexity, 31(1):1 14, 2015. [34] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages 421 436, 2011. [35] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W. Craig. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLo S Genetics, 4(8):e1000167, 2008. [36] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv:1902.03736 [math.PR], 2019. [37] P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021. ISSN 1935-8237. doi: 10.1561/2200000083. URL http://dx.doi.org/10.1561/2200000083. [38] G. Kamath, J. Li, V. Singhal, and J. Ullman. Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853 1902. PMLR, 2019. [39] G. Kamath, V. Singhal, and J. Ullman. Private mean estimation of heavy-tailed distributions. In J. Abernethy and S. Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2204 2235. PMLR, 09 12 Jul 2020. [40] V. Karwa and S. Vadhan. Finite sample differentially private confidence intervals. 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), 2018. [41] M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the Association for Computing Machinery, 45(6):983 1006, 1998. [42] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and highdimensional regression. In Conference on Learning Theory, pages 25 1. JMLR Workshop and Conference Proceedings, 2012. [43] A. Kulunchakov and J. Mairal. Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise. Journal of Machine Learning Research, 21(155):1 52, 2020. [44] D. Levy and J. C. Duchi. Necessary and sufficient geometries for gradient methods. In Advances in Neural Information Processing Systems 32, 2019. URL https://arxiv.org/abs/1909. 10455. [45] J. Liu and K. Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 298 309, 2019. [46] Y. Liu, A. Theertha Suresh, F. Yu, S. Kumar, and M. Riley. Learning discrete distributions: user vs item-level privacy. In Advances in Neural Information Processing Systems, 2020. [47] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009. [48] H. B. Mc Mahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz. A general approach to adding differential privacy to iterative training procedures. ar Xiv preprint ar Xiv:1812.06210, 2018. [49] H. B. Mc Mahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018. [50] S. Mei, Y. Bai, A. Montanari, et al. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747 2774, 2018. [51] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Conference on Learning Theory, 2009. [52] R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1310 1321, 2015. [53] A. Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 813 822, 2011. [54] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. Mc Mahan. Distributed mean estimation with limited communication. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3329 3337, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. [55] United States Census Bureau. Statistical safeguards, 2018. Available at https://www.census. gov/about/policies/privacy/statistical_safeguards.html. [56] R. Vershynin. High Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2019. [57] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. [58] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi. Beyond inferring class representatives: User-level privacy leakage from federated learning. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages 2512 2520. IEEE, 2019. [59] R. J. Wilson, C. Y. Zhang, W. Lam, D. Desfontaines, D. Simmons-Marengo, and B. Gipson. Differentially private SQL with bounded user contribution. Proceedings on Privacy Enhancing Technologies, 2:230 250, 2020. [60] B. Woodworth, K. K. Patel, and N. Srebro. Minibatch vs local SGD for heterogeneous distributed learning. In Proceedings of the 37th International Conference on Machine Learning, 2020. [61] B. E. Woodworth, J. Wang, A. Smith, B. Mc Mahan, and N. Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in Neural Information Processing Systems 31, 2018. [62] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423 435. Springer-Verlag, 1997. [63] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings. neurips.cc/paper/2013/file/d6ef5f7fa914c19931a55bb262ec879c-Paper.pdf.