# private_stochastic_convex_optimization_with_optimal_rates__bfbceb46.pdf Private Stochastic Convex Optimization with Optimal Rates Raef Bassily The Ohio State University bassily.1@osu.edu Vitaly Feldman Google Research. Brain Team. Kunal Talwar Google Research. Brain Team. kunal@google.com Abhradeep Thakurta University of California Santa Cruz Google Research. Brain Team. We study differentially private (DP) algorithms for stochastic convex optimization (SCO). In this problem the goal is to approximately minimize the population loss given i.i.d. samples from a distribution over convex and Lipschitz loss functions. A long line of existing work on private convex optimization focuses on the empirical loss and derives asymptotically tight bounds on the excess empirical loss. However a significant gap exists in the known bounds for the population loss. We show that, up to logarithmic factors, the optimal excess population loss for DP algorithms is equal to the larger of the optimal non-private excess population loss, and the optimal excess empirical loss of DP algorithms. This implies that, contrary to intuition based on private ERM, private SCO has asymptotically the same rate of 1/ n as non-private SCO in the parameter regime most common in practice. The best previous result in this setting gives rate of 1/n1/4. Our approach builds on existing differentially private algorithms and relies on the analysis of algorithmic stability to ensure generalization. 1 Introduction Many fundamental problems in machine learning reduce to the problem of minimizing the expected loss (a.k.a. population loss) L(w) = E z D [ℓ(w, z)] for convex loss functions of w given access to samples z1, . . . , zn from the data distribution D. This problem arises in various settings, such as estimating the mean of a distribution, least squares regression, or minimizing a convex surrogate loss for a classification problem. This problem is commonly referred to as Stochastic Convex Optimization (SCO) and has been the subject of extensive study in machine learning and optimization [SSBD14]. In this work we study this problem with the additional constraint of differential privacy. A closely related problem is that of minimizing the loss b L(w) = 1 i ℓ(w, zi) on the sampled set of functions, often known as Empirical Risk Minimization (ERM). The problem of private ERM has been well-studied and tight upper and lower bounds are known for private ERM. We give nearly tight upper and lower bounds on the excess population loss (a.k.a. excess population risk). At first glance these two problems may appear to be essentially the same as an optimal algorithm for minimizing the empirical risk should also achieve the best bounds for the population risk itself, i.e. the best approach to private SCO is to use the best private ERM. This simple intuition is unfortunately false, even in the non-private case. A natural approach of bounding the population loss is by proving an upper bound on Ez1,...,zn[supw(L(w) b L(w))]. This Part of this work was done while visiting the Simons Institute for the Theory of Computing. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. is known as uniform convergence. There are examples of distributions over losses where uniform convergence based bounds are provably sub-optimal. For example, for convex Lipschitz losses in d-dimensional Euclidean space, the best bound on the population loss achievable via uniform convergence is Ω( p d/n) [Fel16]. In contrast, SGD is known to achieve excess loss of O( p 1/n) which is independent of the dimension. As a result, in the high-dimensional settings often considered in modern ML (when n = Θ(d)), the optimal achievable excess loss is O( p 1/n), whereas the uniform convergence bound is Ω(1). This discrepancy implies that using private ERM and appealing to uniform convergence will not lead to optimal bounds for private SCO. The first work to address the population loss for private SCO is [BST14] which gives bounds based on several natural approaches. Their first approach is to use the generalization properties of differential privacy itself to bound the gap between the empirical and population losses [DFH+15, BNS+16], and thus derive bounds for SCO from bounds on ERM. This approach leads to a suboptimal bound for private SCO (specifically2, max d 1 4 n, Sec. F]). For the important case of d being on the order of n and ϵ being on the order of one this results in Ω(n 1 4 ) bound on excess population loss. Their second approach uses stability induced by regularizing the empirical loss before it is minimized via a private ERM algorithm for strongly convex losses. This technique also yields a suboptimal bound on the excess population loss (d 1 4 / ϵ n). There are two natural lower bounds that apply to private SCO. The lower bound of Ω( p 1/n) for the excess loss of non-private SCO applies for private SCO. Further it is not hard to show that lower bounds for private ERM translate to essentially the same lower bound for private SCO, leading to a lower bound of the form Ω( d ϵn ). We give a detailed argument for the lower bound in the full version [BFTT19] . In this work, we address the question: What is the optimal excess loss for private SCO? Is the rate of O q d ϵn achievable? 1.1 Our contribution We show that the optimal rate of O q d ϵn is achievable. In particular, we obtain the statistically optimal rate of O(1/ n) whenever d = O(n). This is in contrast to the situation for private ERM where the cost of privacy grows with the dimension for all n. In fact, under relatively mild smoothness assumptions, this rate is achieved by a variant of the standard noisy mini-batch SGD. The parameters of the scheme need to be tuned carefully to satisfy a delicate balance. The classical analyses for non-private SCO depend crucially on making only one pass over the dataset. However, a single pass noisy SGD is not sufficiently accurate as we need a non-trivial amount of noise in each step to carry out the privacy analysis. We rely instead on a different approach to generalization, known as uniform stability [BE02]. The stability parameter degrades with the number of passes over the dataset [HRS15, FV19], while the empirical accuracy improves as we make more passes. In addition, the batch size needs to be sufficiently large to ensure that the noise added for privacy is small. To satisfy all these constraints the parameters of the scheme need to be tuned carefully. Specifically we show that min(n, n2ϵ2/d) steps of SGD with a batch size of max( ϵn, 1) are sufficient to get all the desired properties. Our second contribution is to show that the smoothness assumptions can be relaxed at essentially no additional loss. We use a general smoothing technique based on the Moreau-Yosida envelope operator that allows us to derive the same asymptotic bounds as the smooth case. This operator cannot be implemented efficiently in general, but for algorithms based on gradient steps we exploit the well-known connection between the gradient step on the smoothed function and the proximal step on the original function. Thus our algorithm is equivalent to (stochastic, noisy, mini-batch) proximal descent on the unsmoothed function. We show that our analysis in the smooth case is robust to inaccuracies in the computation of the gradient. This allows us to show that sufficient approximation to the proximal steps can be implemented in polynomial time given access to the gradient of the ℓ(w, zi) s. 2In this Introduction, we are concerned with the dependence on d and n, for (ϵ, δ)-DP. We suppress the dependence on δ and on parameters of the loss function such as Lipschitz constant and the constraint set radius. Finally, we show that Objective Perturbation [CMS11, KST12] also achieves optimal bounds for private SCO. However, objective perturbation is only known to satisfy privacy under some additional assumptions (most notably, Hessian being rank 1 on all points in the domain). The generalization analysis in this case is based on the uniform stability of the solution to strongly convex ERM. Aside from extending the analysis of this approach to population loss, we show that it can lead to algorithms for private SCO that use only near-linear number of gradient evaluations (whenever these assumptions hold). In particular, we give a variant of objective perturbation in conjunction with the stochastic variance reduced gradient descent (SVRG) with only O(n log n) gradient evaluations. We remark that the known lower bounds for uniform convergence [Fel16] hold even under those additional assumptions invoked in objective perturbation. Finding algorithms with near-linear running time in the general setting of SCO is a natural avenue for future research. Our work highlights the importance of uniform stability as a tool for analysis of this important class of problems. We believe it should have applications to other differentially private statistical analyses. Related work: Differentially private empirical risk minimization (ERM) is a well-studied area spanning over a decade [CM08, CMS11, JKT12, KST12, ST13, SCS13, DJW13, Ull15, JT14, BST14, TTZ15, STU17, WLK+17, WYX17, INS+19]. Aside from [BST14] and work in the local model of DP [DJW13], these works focus on achieving optimal empirical risk bounds under privacy. Our work builds heavily on algorithms and analyses developed in this line of work while contributing additional insights. Optimal bounds for private SCO are known for some simple subclasses of convex functions such as Generalized Linear Models [JT14, BST14] where uniform convergence bounds on the order of 1/ n are known [KST08]. 2 Preliminaries Notation: We use W Rd to denote the parameter space, which is assumed to be a convex, compact set. We denote by M = max w W w the L2 radius of W. We use Z to denote an arbitrary data universe and D to denote an arbitrary distribution over Z. We let ℓ: Rd Z R be a loss function that takes a parameter vector w W and a data point z Z as inputs and outputs a real value. The empirical loss of w W w.r.t. loss ℓand dataset S = (z1, . . . , zn) is defined as b L(w; S) 1 n Pn i=1 ℓ(w, zi). The excess empirical loss of w is defined as b L(w; S) min ew W b L (ew; S) . The population loss of w W with respect to a loss ℓand a distribution D over Z, is defined as L(w; D) E z D [ℓ(w, z)] . The excess population loss of w is defined as L(w; D) min ew W L(ew; D). Definition 2.1 (Uniform stability). Let α > 0. A (randomized) algorithm A : Zn W is αuniformly stable (w.r.t. loss ℓ: W Z R) if for any pair S, S Zn differing in at most one data point, we have sup z Z E A [ℓ(A(S), z) ℓ(A(S ), z)] α where the expectation is taken only over the internal randomness of A. The following is a useful implication of uniform stability. Lemma 2.2 (See, e.g., [SSBD14]). Let A : Zn W be an α-uniformly stable algorithm w.r.t. loss ℓ: W Z R. Let D be any distribution over Z, and let S Dn. Then, h L (A(S); D) b L (A(S); S) i α. Definition 2.3 (Smooth function). Let β > 0. A differentiable function f : Rd R is β-smooth if for every w, v Rd, we have f(v) f(w) + f(w), v w + β In the sequel, whenever we attribute a property (e.g., convexity, Lipschitz property, smoothness, etc.) to a loss function ℓ, we mean that for every data point z Z, the loss ℓ( , z) possesses that property. Stochastic Convex Optimization (SCO): Let D be an arbitrary (unknown) distribution over Z, and S = {z1, . . . , zn} be a sample of i.i.d. draws from D. Let ℓ: W Z R be a convex loss function. A (possibly randomized) algorithm for SCO uses the sample S to generate an (approximate) minimizer bw S for L( ; D). We measure the accuracy of A by the expected excess population loss of its output parameter bw S, defined as: L (A; D) E L(bw S; D) min w W L(w; D) , where the expectation is taken over the choice of S Dn, and any internal randomness in A. Differential privacy [DMNS06, DKM+06]: A randomized algorithm A is (ϵ, δ)-differentially private if, for any pair of datasets S and S differ in exactly one data point, and for all events O in the output range of A, we have P [A(S) O] eϵ P [A(S ) O] + δ, where the probability is taken over the random coins of A. For meaningful privacy guarantees, the typical settings of the privacy parameters are ϵ < 1 and δ 1/n. Differentially Private Stochastic Convex Optimization (DP-SCO): An (ϵ, δ)-DP-SCO algorithm is a SCO algorithm that satisfies (ϵ, δ)-differential privacy. 3 Private SCO via Mini-batch Noisy SGD In this section, we consider the setting where the loss ℓis convex, Lipschitz, and smooth. We give a technique that is based on a mini-batch variant of Noisy Stochastic Gradient Descent (NSGD) algorithm [BST14, ACG+16]. Algorithm 1 ANSGD: Mini-batch noisy SGD for convex, smooth losses Input: Private dataset: S = (z1, . . . , zn) Zn, L-Lipschitz, β-smooth, convex loss function ℓ, convex set W Rd, step size η, mini-batch size m, # iterations T, privacy parameters ϵ 1, δ 1/n2. 1: Set noise variance σ2 := 8T L2 log(1/δ) n2ϵ2 . 2: Set mini-batch size m := max n p ϵ 4 T , 1 . 3: Choose arbitrary initial point w0 W. 4: for t = 0 to T 1 do 5: Sample a batch Bt = {zi(t,1), . . . , zi(t,m)} S uniformly with replacement. 6: wt+1 := Proj W wt η 1 m Pm j=1 ℓ(wt, zi(t,j)) + Gt , where Proj W denotes the Eu- clidean projection onto W, and Gt N 0, σ2Id drawn independently each iteration. 7: return w T = 1 T PT t=1 wt Theorem 3.1 (Privacy guarantee of ANSGD). Algorithm 1 is (ϵ, δ)-differentially private. Proof. The proof follows from [ACG+16, Theorem 1]. The population loss attained by ANSGD is given by the next theorem. Theorem 3.2 (Excess population loss of ANSGD). Let D be any distribution over Z, and let S Dn. Suppose β L 2d log(1/δ) . Let T = min n 8 , ϵ2 n2 32 d log(1/δ) and η = M L L (ANSGD; D) 10 M L max Before proving the above theorem, we first state and prove the following useful lemmas. Lemma 3.3. Let S Zn. Suppose the parameter set W is convex and M-bounded. For any η > 0, the excess empirical loss of ANSGD satisfies E h b L(w T ; S) i min w W b L(w; S) M 2 2 η T + η L2 16T d log(1/δ) where the expectation is taken with respect to the choice of the mini-batch (step 5) and the independent Gaussian noise vectors G1, . . . , GT . Proof. The proof follows from the classical analysis of the stochastic oracle model (see, e.g., [SSBD14]). In particular, we can show that E h b L(w T ; S) i min w W b L(w; S) M 2 2 η T + η L2 2 + η σ2 d, where the last term captures the extra empirical error due to privacy. The statement now follows from the setting of σ2 in Algorithm 1. The following lemma is a simple extension of the results on uniform stability of GD methods that appeared in [HRS15] and [FV19, Lemma 4.3] to the case of mini-batch noisy SGD. We provide a proof for this lemma in the full version [BFTT19]. Lemma 3.4. In ANSGD, suppose η 2 β , where β is the smoothness parameter of ℓ. Then, ANSGD is α-uniformly stable with α = L2 T η Proof of Theorem 3.2 By Lemma 2.2, α-uniform stability implies that the expected generalization error is bounded by α. Hence, by combining Lemma 3.3 with Lemma 3.4, we have E S Dn, ANSGD [L(w T ; D)] min w W L(w; D) E S Dn, ANSGD h b L(w T ; S) i min w W L(w; D) + L2 η T E S Dn, ANSGD b L(w T ; S) min w W b L(w; S) + L2 η T 2 η T + η L2 n2 ϵ2 + 1 + L2 η T where (1) follows from the fact that E S Dn min w W b L(w; S) min w W E S Dn h b L(w; S) i = min w W L(w; D). Optimizing the above bound in η and T yields the values in the theorem state- ment for these parameters, as well as the stated bound on the excess population loss. 4 Private SCO for Non-smooth Losses In this section, we consider the setting where the convex loss is non-smooth. First, we show a generic reduction to the smooth case by employing the smoothing technique known as Moreau-Yosida regularization (a.k.a. Moreau envelope smoothing) [Nes05]. Given an appropriately smoothed version of the loss, we obtain the optimal population loss w.r.t. the original non-smooth loss function. Computing the smoothed loss via this technique is generally computationally inefficient. Hence, we move on to describe a computationally efficient algorithm for the non-smooth case with essentially optimal population loss. Our construction is based on an adaptation of our noisy SGD algorithm ANSGD (Algorithm 1) that exploits some useful properties of Moreau-Yosida smoothing technique that stem from its connection to proximal operations. Definition 4.1 (Moreau envelope). Let f : W Rd be a convex function, and β > 0. The β-Moreau envelope of f is a function fβ : W Rd defined as fβ(w) = min v W 2 w v 2 , w W. Moreau envelope has direct connection with the proximal operator of a function defined below. Definition 4.2 (Proximal operator). The prox operator of f : W Rd is defined as proxf(w) = arg min v W 2 w v 2 , w W. It follows that the Moreau envelope fβ can be written as fβ(w) = f proxf/β (w) + β 2 w proxf/β (w) 2. The following lemma states some useful, known properties of Moreau envelope. Lemma 4.3 (See [Nes05, Can11]). Let f : W Rd be a convex, L-Lipschitz function, and let β > 0. The β-Moreau envelope fβ satisfies the following: 1. fβ is convex, 2L-Lipschitz, and β-smooth. 2. w W fβ(w) f(w) fβ(w) + L2 3. w W fβ(w) = β w proxf/β(w) . Let ℓ: W Z R be a convex, L-Lipschitz loss. For any z Z, let ℓβ( , z) denote the β-Moreau envelope of ℓ( , z). For a dataset S = (z1, . . . , zn) Zn, let b Lβ( ; S) 1 n Pn i=1 ℓβ( , zi) be the empirical risk w.r.t. the β-smoothed loss. For any distribution D, let Lβ( ; D) E z D [ℓ( , z)] denote the corresponding population loss. The following theorem asserts that, with an appropriate setting for β, running ANSGD over the β-smoothed losses ℓβ( , zi), i [n] yields the optimal population loss w.r.t. the original non-smooth loss ℓ. Theorem 4.4 (Excess population loss for non-smooth losses via smoothing). Let D be any distribution over Z. Let S = (z1, . . . , zn) Dn. Let β = L . Suppose we run ANSGD (Algorithm 1) over the β-smoothed version of ℓassociated with the points in S: {ℓβ( , zi), i [n]}. Let η and T be set as in Theorem 3.2. Then, the excess population loss of the output of ANSGD w.r.t. ℓ satisfies L (ANSGD; D) 24 M L max Proof. Let w T be the output of ANSGD. Using property 1 of Lemma 4.3 together with Theorem 3.2, we have E S Dn,ANSGD [Lβ(w T ; D)] min w W Lβ(w; D) 20 M L max By property 2 of Lemma 2 and the setting of β in the theorem statement, for every w W, we have Lβ(w; D) L(w; D) Lβ(w; D) + 2 M L max Putting these together gives the stated result. Computationally efficient algorithm AProx GD (NSGD + Prox) Computing the Moreau envelope of a function is computationally inefficient in general. However, by property 3 of Lemma 4.3, we note that evaluating the gradient of Moreau envelope at any point can be attained by evaluating the proximal operator of the function at that point. Evaluating the proximal operator is equivalent to minimizing a strongly convex function (see Definition 4.2). This can be approximated efficiently, e.g., via gradient descent. Since our ANSGD algorithm (Algorithm 1) requires only sufficiently accurate gradient evaluations, we can hence use an efficient, approximate proximal operator to approximate the gradient of the smoothed losses. The gradient evaluations in ANSGD will thus be replaced with such approximate gradients evaluated via the approximate proximal operator. The resulting algorithm, referred to as AProx GD, will approximately minimize the smoothed empirical loss without actually computing the smoothed losses. Our construction of AProx GD involves n2 T 2 m gradient evaluations (of individual losses), where T is the number of iterations of ANSGD reported in Theorem 3.2, and m is its mini-batch size. We argue that the approximate proximal operation will essentially have no impact on the guarantees of AProx GD as compared to those of ANSGD. In particular, in terms of privacy, the sensitivity of the approximate gradients (evaluated via the approximate prox operator) will basically remain the same as that of the exact gradients. In terms of empirical error, since the approximation error in the prox operations can be made sufficiently small (while maintaining computational efficiency), the impact of the approximation error on the empirical loss guarantee of AProx GD will be negligible. Finally, in terms of uniform stability, again since the approximation error is sufficiently small, the error accumulated across iterations will have no pronounced impact on the uniform stability of ANSGD (established in Lemma 3.4). Putting these together shows that AProx GD achieves the optimal population loss bound in Theorem 4.4. A more detailed description of AProx GD and its guarantees is given in the full version [BFTT19]. 5 Private SCO via Objective Perturbation In this section, we show that the technique known as objective perturbation [CMS11, KST12] can be used to attain optimal population loss under additional assumptions on the loss. These assumptions are invoked to ensure differential privacy. The excess empirical loss of this technique for smooth convex losses was originally analyzed in the aforementioned works, and was shown to be optimal by the lower bound in [BST14]. We revisit this technique and show that the regularization term added for privacy can be used to attain the optimal excess population loss by exploiting the stability-inducing property of regularization. The objective perturbation algorithm AObj P is described in Algorithm 2. In addition to smoothness and convexity, we make the following assumption on the loss. Assumption 5.1. For all z Z, ℓ( , z) is twice-differentiable, and the rank of its Hessian 2ℓ(w, z) at any w W is at most 1. Algorithm 2 AObj P: Objective Perturbation for convex, smooth losses Input: Private dataset: S = (z1, . . . , zn) Zn, L-Lipschitz, β-smooth, convex loss function ℓ, convex set W Rd, privacy parameters ϵ 1, δ 1/n2, regularization parameter λ. 1: Sample G N 0, σ2 Id , where σ2 = 10 L2 log(1/δ) ϵ2 2: return bw = arg min w W b L (w; S) + G, w n + λ w 2, where b L(w; S) 1 n Pn i=1 ℓ(w, zi). Note: Unlike in [KST12], the regularization term as appears in AObj P is not normalized by n. Hence, whenever the results from [KST12] are used here, the regularization parameter in their statements should be replaced with nλ. This presentation choice is more consistent with literature on regularization. The privacy guarantee of AObj P follows directly from [KST12]: Theorem 5.2 (Privacy guarantee of AObj P, restatement of Theorem 2 in [KST12]). Suppose that Assumption 5.1 holds and that the smoothness parameter satisfies β ϵ n λ. Then, AObj P is (ϵ, δ)-differentially private. We now state our main result for this section showing that, with appropriate setting for λ, AObj P yields the optimal population loss. Theorem 5.3 (Excess population loss of AObj P). Let D be any distribution over Z, and let S Dn. Suppose that Assumption 5.1 holds. Suppose that W is M-bounded. In AObj P, set λ = 2 n + 4 d log(1/δ) ϵ2 n2 . Then, we have L (AObj P; D) 2 M L 2 n + 4 d log(1/δ) Note: According to Theorem 5.2, the privacy of AObj P entails the assumption that β ϵ n λ. With the setting of λ in Theorem 5.3, it would suffice to assume that β 2 ϵ L 2 n + 4 d log(1/δ). To prove the above theorem, we use the following lemmas. Lemma 5.4 (Excess empirical loss of AObj P, restatement of Theorem 26 in [KST12]). Let S Zn. Under Assumption 5.1, the excess empirical loss of AObj P satisfies E h b L(bw; S) i min w W b L(w; S) 16 L2 d log(1/δ) n2 ϵ2 λ + λ M 2. where the expectation is taken over the Gaussian noise in AObj P. Lemma 5.5 ([SSBD14]). Let f : W Z R be a convex, ρ-Lipschitz loss, and let λ > 0. Let S = (z1, . . . , zn) Zn. Let A be an algorithm that outputs ew = arg min w W b F(w; S) + λ w 2 , where b F(w; S) = 1 n Pn i=1 f(w, zi). Then, A is 2 ρ2 λ n -uniformly stable. Proof of Theorem 5.3 Fix any realization of the noise vector G. For every w, z define f G(w, z) ℓ(w, z) + G,w Note that f G is L + G n -Lipschitz. For any S = (z1, . . . , zn) Zn, let b FG(w; S) 1 n Pn i=1 f G(w, zi). Hence, the output of AObj P can be written as bw = arg min w W b FG(w; S)+λ w 2. Define FG(w; D) E z D [f G(w, z)] . Thus, by combining Lemmas 5.5 and 2.2, we have h FG(bw; D) b FG(bw; S) i 2(L+ G λ n . On the other hand, note that FG(bw; D) b FG(bw; S) = L(bw; D) b L(bw; S) since the linear term cancels out. Hence, h L(bw; D) b L(bw; S) i 2 By taking expectation over G N 0, σ2Id as well, we get E h L(bw; D) b L(bw; S) i 8 L2 Now, observe that: L (AObj P; D) E b L(bw; S) min w W b L(w; S) + E h L(bw; D) b L(bw; S) i 2 L2 d log(1/δ) ϵ2 n2 + L 2 where we use Lemma 5.4 in the last bound. Optimizing this bound in λ yields the result. A note on the rank assumption: The assumption on the rank of 2ℓ(w, z) can actually be relaxed (using similar argument in [INS+19]) to a rank of e O L n+d βM without affecting the asymptotic population loss guarantees (see the full version [BFTT19] for a discussion.) Efficient Objective Perturbation: The privacy guarantee of the standard objective perturbation technique is given only when the output is the exact minimizer [CMS11, KST12]. Exact minimization is not usually attainable in practice. We give a practical version of algorithm AObj P that attains the same guarantees of privacy and optimal population loss as AObj P, and in addition, makes only O(n log n) number of gradient evaluations. The main idea is to first obtain an approximate minimizer w that is sufficiently close to the true minimizer, and then perturb w with only a small amount of Gaussian noise to ensure privacy. The extra error due to the little noise added in the last step ends up having a trivial impact on the population loss. Hence, the algorithm achieves the same guarantees as AObj P. Crucially, it attains the optimal population loss in an efficient manner. In particular, we use Stochastic Variance Reduced Gradient Descent (SVRG) [JZ13, XZ14] to perform the optimization step, which leads to a construction with O(n log n) number of gradient evaluations. Detailed discussion can be found in the full version [BFTT19]. Acknowledgements We thank Adam Smith, Thomas Steinke and Jon Ullman for the insightful discussions of the problem at the early stages of this project. We are also grateful to Tomer Koren for bringing the Moreau-Yosida smoothing technique to our attention. R. Bassily s research is supported by NSF Awards AF-1908281, SHF-1907715, Google Faculty Research Award, and OSU faculty start-up support. A. Thakurta s research is supported by NSF Awards TRIPODS+X-1839317, AF-1908281, TRIPODS-1740850, and Google Faculty Research Award. [ACG+16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308 318. ACM, 2016. [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499 526, 2002. [BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. ar Xiv preprint ar Xiv:1908.09970, 2019. [BNS+16] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the fortyeighth annual ACM symposium on Theory of Computing, pages 1046 1059. ACM, 2016. [BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Differentially private empirical risk minimization: Efficient algorithms and tight error bounds. ar Xiv preprint ar Xiv:1405.7085, 2014. [Can11] Emmanuel Candes. Mathematical optimization, volume Lec. notes: MATH 301. Stanford Univesity, 2011. [CM08] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, NIPS. MIT Press, 2008. [CMS11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069 1109, 2011. [DFH+15] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117 126. ACM, 2015. [DJW13] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical minimax rates. In IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 429 438, 2013. [DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, 2006. [DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265 284. Springer, 2006. [Fel16] Vitaly Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems, pages 3576 3584, 2016. [FV19] Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. ar Xiv preprint ar Xiv:1902.10710, 2019. [HRS15] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1509.01240, 2015. [INS+19] Roger Iyengar, Joseph P Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical differentially private convex optimization. In IEEE S and P (Oakland), 2019. [JKT12] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In 25th Annual Conference on Learning Theory (COLT), pages 24.1 24.34, 2012. [JT14] Prateek Jain and Abhradeep Thakurta. (near) dimension independent risk bounds for differentially private learning. In ICML, 2014. [JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315 323, 2013. [KST08] S. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS, pages 793 800, 2008. [KST12] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25 1, 2012. [Nes05] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127 152, 2005. [SCS13] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing, 2013. [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [ST13] Adam Smith and Abhradeep Thakurta. Differentially private feature selection via stability arguments, and the robustness of the LASSO. In Conference on Learning Theory (COLT), pages 819 850, 2013. [STU17] Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is interaction necessary for distributed private learning? In IEEE Security & Privacy, pages 58 77, 2017. [TTZ15] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly optimal private LASSO. In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 2, pages 3025 3033, 2015. [Ull15] Jonathan Ullman. Private multiplicative weights beyond linear queries. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 303 312. ACM, 2015. [WLK+17] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In SIGMOD. ACM, 2017. [WYX17] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. In Advances in Neural Information Processing Systems, pages 2722 2731, 2017. [XZ14] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.