# momentum_aggregation_for_private_nonconvex_erm__9db41d51.pdf Momentum Aggregation for Private Non-convex ERM Hoang Tran Boston University tranhp@bu.edu Ashok Cutkosky Boston University ashok@cutkosky.com We introduce new algorithms and convergence guarantees for privacy-preserving non-convex Empirical Risk Minimization (ERM) on smooth d-dimensional objectives. We develop an improved sensitivity analysis of stochastic gradient descent on smooth objectives that exploits the recurrence of examples in different epochs. By combining this new approach with recent analysis of momentum with private aggregation techniques, we provide an (ϵ, δ)-differential private algorithm that finds a gradient of norm O d1/3 (ϵN)2/3 in O N7/3ϵ4/3 d2/3 gradient evaluations, improving the previous best gradient bound of O d1/4 1 Introduction In recent years, statistical machine learning models have been deployed in many domains such as health care, education, criminal justice, or social studies [Chen et al., 2021, He et al., 2019, Jiang et al., 2017]. However, the release of statistical estimates based on these sensitive data comes with the risk of leaking personal information of individuals in the original dataset. One naive solution for this problem is to remove all the identifying information such as names, races, or social security numbers. Unfortunately, this is usually not enough to preserve privacy. It has been shown in various works that an adversary can take advantages of structural properties of the rest of the dataset to reconstruct information about certain individuals [Backstrom et al., 2007, Dinur and Nissim, 2003]. Thus, we would need a stronger privacy-preserving mechanism. Over the past couple of decades, differential privacy [Dwork et al., 2006] has emerged as the dominant privacy notion for machine learning problems. Definition 1 (Differential Privacy [Dwork and Roth, 2014]). A randomized algorithm M: X N 7 Rd satisfies (ϵ, δ) differential privacy ((ϵ, δ)-DP) if for any two data sets D, D X N differing by at most one element and any event E Rd, it holds that: P [M(D) E] exp(ϵ)P [M(D ) E] + δ Roughly speaking, differential privacy guarantees that the outputs of two neighboring datasets (datasets that differ in at most one datapoint) are almost the same with high probability, thus preventing the adversary from identifying any individual s data. In this paper, we are interested in designing (ϵ, δ) private algorithms for non-convex empirical risk minimization (ERM) problems. In ERM problems, given N i.i.d samples x1, ..., x N X from some unknown distribution P, the goal is to find w Rd such that w minimizes the empirical loss defined as follows: i=1 f(w, xi) 36th Conference on Neural Information Processing Systems (Neur IPS 2022). where f : Rd X 7 R is the loss function associated with the learning problem. This is a setting that commonly arises in modern machine learning problem. For example, in image classification problems, the data point x would be a tuple of (image, label), w denotes the parameters of our model, and f(w, x) represents composing the model predictions with some loss function such as cross-entropy. We are interested in finding a critical point, or a point such that the norm of the empirical gradient F(w) is as small as possible. Further, we want all the outputs w1, w2, ..., w T to be differentially private with respect to the N training samples. Private ERM has been well studied in the convex settings. The approaches in this line of work can be classified into three main categories: output perturbation [Dwork et al., 2006, Chaudhuri et al., 2011, Zhang et al., 2017, Wu et al., 2017], objective perturbation [Chaudhuri et al., 2011, Kifer et al., 2012, Iyengar et al., 2019, Talwar et al., 2014], and gradient perturbation [Bassily et al., 2014, Wang et al., 2017, Jayaraman et al., 2018, Wang et al., 2018]. All of these approaches have been shown to achieve the asymptotically optimal bound O d ϵN for smooth convex loss (with output perturbation requiring strong convexity to get the optimal bound) in (near) linear time. On the other hand, the literature on private non-convex ERM is nowhere as comprehensive. The first theoretical bound in private non-convex ERM is from [Zhang et al., 2017]. They propose an algorithm called Random Round Private Stochastic Gradient Descent (RRSGD) which is inspired by the results from [Bassily et al., 2014, Ghadimi and Lan, 2013]. RRSGD is able to guarantee the utility bound of O (d log(n/δ) log(1/δ))1/4 . However, RRSGD takes O N 2d gradient computations to achieve this, which can be troublesome for high-dimensional problem. [Wang et al., 2018] then improves upon this utility bound by a factor of O (log(n/δ))1/4 . They achieve this rate by using full-batch gradient descent which is not a common practice in non-private machine learning in which very large batch sizes actually require careful work to make training efficient. Recently, [Wang et al., 2019b] tackles both runtime and utility issues by introducing a private version of the Stochastic Recursive Momentum (DP-SRM) [Cutkosky and Orabona, 2019]. By appealing to variance reduction as well as privacy amplification by subsampling [Balle et al., 2018, Abadi et al., 2016], DP-SRM achieve the bound O (d log(1/δ))1/4 in O (ϵN)3/2 gradient complexity. However, DR-SRM still requires the batch size to be O ϵN d1/4 for the analysis to work. Finally, although our focus in this paper is the ERM problem, there are also various works on private stochastic non-convex/convex optimization [Bassily et al., 2019, 2021a,b, Feldman et al., 2018, 2020, Zhou et al., 2020, Asi et al., 2021, Wang et al., 2019a, Kulkarni et al., 2021]. Contributions. We first provide the analysis for the private version of Normalized SGD (DP-NSGD) [Cutkosky and Mehta, 2020] for unconstrained non-convex ERM. By using the tree-aggregation technique [Chan et al., 2011, Dwork et al., 2010] to compute the momentum privately, we can ensure the privacy guarantee while adding noise of only O (where T is the total number of iterations). This allows us to achieve the same asymptotic bound O d1/4 on the expectation of the gradient as [Zhang et al., 2017, Wang et al., 2018, 2019b] without appealing to privacy amplification techniques which is usually required for private SGD to have a good utility guarantee [Abadi et al., 2016]. DP-NSGD also does not require a large batch size as in [Wang et al., 2018, 2019b]; it has utility guarantee for any batch size. This tree-aggregation technique is morally similar to the approach in [Kairouz et al., 2021, Guha Thakurta and Smith, 2013] for online learning. However, unlike in [Kairouz et al., 2021, Guha Thakurta and Smith, 2013], we do not restrict our loss function to be convex and we will also extend tree-aggregation technique to SGD with momentum. Further, we provide a new variant of Normalized SGD that takes advantages of the fact that the gradients of the nearby iterates are close to each other due to smoothness. This new algorithm is able to guarantee an error of O d1/3 (ϵN)2/3 in O N 7/3ϵ4/3 d2/3 gradient computations which, to our knowledge, is the best known rate for private non-convex ERM. Organization. The rest of the paper is organized as follows. In section 2, we define our problem of interest and the assumptions that we make on the problem settings. We also provide some background on Differential Privacy as well as some high-level intuition on tree-aggregation technique. We then formally describe our first private variant of Normalized SGD in section 3 and discuss its privacy guarantee and theoretical utility bound. In section 4, we introduce a novel sensitivity-reduced analysis for Normalized SGD that allows us to improve upon the utility bound in section 3. Finally, we conclude with a discussion in section 5. 2 Preliminaries Assumptions. We define our loss function as f(w, x) : Rd X 7 R where x X is some sample from the dataset. Throughout the paper, we make the following assumptions on f(w, x). Define a differentiable function f(.) : Rd 7 R to be G-Lipschitz iff f(y) G for all y Rd, and to be L-smooth iff f(y1) f(y2) L y1 y2 where . is the standard 2-norm unless specified otherwise. We assume that f(w, x) is differentiable, G-Lipschitz, and L-smooth with probability 1. We also assume that provided the initial point w1 Rd there exists some upper bound R of F(w1) or formally: supw Rd F(w1) F(w) R. Differential Privacy. Let D be a dataset containing N datapoints. Then two datasets D, D are said to be neighbors if D D 1 = 1 or in other words, the two datasets differ at exactly one entry. Now we have the definition for sensitivity: Definition 2. (L2-sensitivity) The L2-sensitivity of a function f(.) : X N 7 Rd is defined as follows: (f) = sup D,D f(D) f(D ) 2 In this work, we also make use of Renyi Differential Privacy (RDP) [Mironov, 2017]. RDP is a relaxation of (ϵ, δ) DP and can be used to improve the utility bound and do composition efficiently while still guaranteeing privacy. Definition 3. ((α, ϵ)-RDP [Mironov, 2017]) A randomized algorithm f : X N 7 Rd is said to have ϵ RDP of order α if for any neighboring datasets D, D it holds that: Dα(f(D)||f(D )) ϵ where Dα(P||Q) 1 α 1 log Ex Q P (x) Q(x) α g[1:1] g[2:2] g[3:3] g[4:4] g[5:5] g[6:6] g[7:7] g[8:8] Figure 1: Visualization of the tree-aggregation mechanism. Each node of the tree hold the appropriate value g[y:z] where g[y:z] = Pz i=y gi. Tree-aggregation. Consider the problem of releasing private partial sum s[y,z] = Pz i=y gi given a stream of vectors g1, . . . , gt. Assuming gi G i [t], then the simplest solution for this problem is to add Gaussian noise with standard deviation O (1/ϵ) to each gi so that each gi is (ϵ, δ) DP. Then error from the noise added to any partial sum will be bounded by O( T/ϵ). However, this problem can be solved much more efficiently using tree-aggregation [Chan et al., 2011, Dwork et al., 2010]. In tree-aggregation, we consider a complete binary tree where each of the leaf nodes is labeled with gi and each internal node in the tree is the sum of its children. The tree is illustrated in Fig.1. It is clear from Fig.1 that each gi only affects its ancestors, or at most log t + 1 nodes. Thus, by adding Gaussian noise with standard deviation O(log t/ϵ) to every node, the complete tree will be (ϵ, δ) DP. Furthermore, since we only need at most log t nodes to compute any partial sum s[y,z], the noise added to every s[y,z] would be bounded by O(log1.5 t/ϵ), which is a lot less than the noise that we would add using advanced composition. To get a general idea on how we can use this tree-aggregation technique to design private algorithms, let us analyse the update of SGD with initial point w1 = 0: wt+1 = wt η f(wt, xt) i=1 f(wi, xi) where f(wt, xt) is the gradient of the loss function evaluated on the current iterate wt and some sample xt. Notice that every iterate wt+1 would be a function of the sum of the gradients st = Pt i=1 f(wt, xt) up to iteration t. Thus, if we can use tree-aggregation to privately compute every partial sum s[y,z] = Pz i=y f(wi, xi), we can the use these partial sums to compute every iterate wt, and wt is also private by post-processing [Dwork and Roth, 2014]. Similarly, we can also use this aggregation technique for SGD with momentum where we replace f(wt, xt) with the momentum mt. This method of computing momentum would be the main ingredient for our private algorithms. A more detailed discussions are provided in section 3 and 4. Tree-aggregated momentum has been briefly mentioned in the momentum variant of Differentially Private Follow-The-Regularized-Leader (DP-FTRL) in [Kairouz et al., 2021]. However, their algorithm does not directly compose the momentum using the tree-aggregation and they also do not provide any formal guarantee for the algorithm. 3 Private Normalized SGD Before we go into the details of our private algorithm, first let us set up some notations that are frequently used in the rest of the paper. We define the set of subsets I as follows: I = {[a2b + 1, (a + 1)2b]|a, b Z, [a2b + 1, (a + 1)2b] {1, . . . , T}} Each interval in I corresponds to a node in the tree described above (In Fig.1, the set I is {[1, 1], [2, 2], [1, 2], . . . }). Thus, the set of intervals I is essentially a different way to store the binary tree for tree-aggregation mechanism. Then, for every interval [a, b], there exists O(log T) disjoint subsets [y1, z1], ..., [yn, zn] I such that [a, b] = Sn i=1[yi, zi] Daniely et al. [2015]. We denote the set of such subsets as COMPOSE(a, b), which can be computed using Algorithm 3. Now we can proceed to the analysis of the private Normalized SGD algorithm described in Algorithm 1. The main idea of Algorithm 1 is based on two observations. The first observation is wt is a function of the momentum. Thus, similar to SGD, if we can compute the momentum privately, every iterate w1, . . . , w T would also be private by post-processing. The second observation is that mt is an exponentially weighted sum of the gradients and can be computed efficiently using tree-aggregation. To see this, notice that the momentum mt can be written as follows: mt = (1 α)mt 1 + α f(wt, xπqt rt ) i=1 (1 α)t i f(wi, xπ qi ri ) [y,z] COMPOSE(1,t) (1 α)t z z X t =y (1 α)z t f(wt , xπ qt rt ) where we denote xπ qi ri as the sample drawn at iteration i = qi N + ri. Now, if we set f[y,z] = α Pz t =y(1 α)z t f(wt , xπ qt rt ), then: [y,z] COMPOSE(1,t) (1 α)t zf[y,z] Thus, if we consider f[y,z] as the value of the node [y, z] in the tree, we can compute mt using at most O(log t) nodes from the tree. We will also accumulate the noise in the same way. Assuming ζ[y,z] as the noise we need to add to node [y, z] to make f[y,z] private, then the noisy ˆmt is: [y,z] COMPOSE(1,t) (1 α)t zf[y,z] + X [y,z] COMPOSE(1,t) (1 α)t zζ[y,z] [y,z] COMPOSE(1,t) (1 α)t zζ[y,z] which is essentially the update of Algorithm 1. Then, as long as f[y,z] has low sensitivity, the noise we need to add to make mt private would be small since there are only at most O (log t) intervals in COMPOSE(1, t). Algorithm 1 Differentially private Normalized SGD with momentum (DP-NSGD) Input: noise parameter σ, momentum parameter α = 1 β, dataset (x1, . . . , x N), tree depth R, Lipschitz constant G, loss function f(w, x), noise map Z. m0 ˆm0 0. NOISE 0 V (min(R, log2(N) ) + 1) T N + PR j= log2(N)+1 T 2j . for b = 0, . . . , R 1 do for a = 0, . . . , T/2b 1 do Sample ζ[a2b+1,(a+1)2b] N(0, 16α2G2σ2V I). Z[a2b + 1, (a + 1)2b] ζ[a2b+1,(a+1)2b]. end for end for for q = 0, . . . , T/N do Sample a permutation πq 1, . . . , πq N of {1, . . . , N} uniformly at random. for i = 1, . . . , min{N, T q N} do t q N + i. Send interval [1, t] to Algorithm 3 to get the set S = COMPOSE(1, t). for every interval Si S do NOISE NOISE + (1 α)t (Si[1])Z[Si] ( Si[1] indicates the end index of each interval Si) end for mt (1 α)mt 1 + α f(wt, xq πi). ˆmt mt + NOISE wt+1 wt η ˆmt ˆmt NOISE 0 end for end for return w1, . . . , w T Theorem 4. (Privacy guarantee) Suppose that f(w, x) is G Lipschitz for all w Rd, x X, Algorithm 1 is z, z 2σ2 Renyi-differentially private for all z. Consequentially, if δ exp( ϵ), then ϵ , Algorithm 1 is (ϵ, δ)-differentially private. Proof sketch of Theorem 4. Denote t = qt N + rt for any iteration t for rt [1, N] and qt 0. We define: t =y (1 α)z t f(wt , xπ qt rt ) [y,z] COMPOSE(1,t) (1 α)t zf[y,z] To compute the sensitivity of f[y,z], recall that the loss function f is G-Lipschitz. Further, any given datapoint xi can contribute at most |y z|/N gradient terms in the summation defining f[y,z]. Thus f[y,z] has sensitivity = 2αG P |y z|/N 1 j=0 (1 α)j N 4αG for α 1 N (Proposition 13). Next, we compute the maximum number of interval that contains any datapoint xi for i {1, . . . , N}. Given two neighboring datasets D and D , we use s to indicate the index of the datapoint that is different between the two datasets. We then denote S[y,z] {1, ..., N} as the set that contains the indices of the datapoints that contribute to node [y, z] (essentially which elements of the dataset are relevant to the value of node [y, z]). For any i {1, . . . , N}, for any b log2(N) , there are at most T N different a {0, . . . , T/2b 1} such that s S[a2b+1,(a+1)2b] for b log2(N) . Further, for any b > log2(N) , i S[a2b+1,(a+1)2b] = {1, . . . , N} for all a. Therefore, for any i there are at most V = min(R+1, log2(N)+1 ) T N +PR j= log2(N)+1 T 2j sets Si such that i Si. The privacy guarantee then follows via the ordinary composition properties of Renyi differential privacy. The full proof is included in the appendix. Before proving our utility bound, we need the following lemma. Lemma 5. [Essentially Lemma 2 [Cutkosky and Mehta [2020]] Define: ˆϵt = ˆmt F(wt) Suppose w1, ..., w T is a sequence of iterates defined by wt+1 = wt η ˆmt ˆmt for some arbitrary sequence ˆm1, ..., ˆm T . Pick ˆw uniformly at random from w1, ..., w T . Then: E [ F( ˆw) ] 3 E [(F(w1) F(w T +1))] t=1 E[ ˆϵt ] Now we are ready to prove the utility guarantee of Algorithm 1. Theorem 6. (Utility guarantee) Assuming f(w, x) is G-Lipschitz, L-smooth for all w Rd, x X, and F(w1) is bounded by R. Then Algorithm 1 with η = 1 NT , α = ϵN T log2 T d log(1/δ), T = d log(1/δ), ϵ T log2 T d log(1/δ) N , and ˆw that is pick uniformly random from w1, . . . , w T guarantees: E [ F( ˆw) ] ( 3 log2 T log(1/δ) + 12G p log2 T log(1/δ)1/4)d1/4 + 3L(d log(1/δ))1/4p log2 T 4 ϵN 3/2 + 6G N + 6G log2 T p d log(1/δ) ϵN The analysis of Algorithm 1 pretty much follows the analysis of non-private Normalized SGD in Cutkosky and Mehta [2020]. We first bound the error between the noisy momentum and the true empirical risk gradient E[ ˆmt F(wt) ] E[ ˆmt mt ] + E[ mt F(wt) ]. The first term is bounded by the standard deviation of the added noise. The second term requires a more delicate analysis due to the use of random reshuffling (RR) in our algorithm (which is a more common practice than sampling in real-world problems). This introduces bias to the gradient estimates. Or formally, if we define qi = i N and r = i qi N for every iteration i [T], we have E[ f(wi, xqi πri)] = F(wi). Most of the known analysis of RR is done with regular SGD Mishchenko et al. [2020], Nguyen et al. [2019] since the bias becomes an even bigger problem in SGD with momentum due to the accumulative nature of momentum. To circumvent this problem we rewrite mt as α Pt i=1(1 α)t i f(wqi N, xqi πri)+O(ηN) and use the fact that E[ f(wqi, xπ qi N ri )] = F(wqi N) where wqi is the iterate at the beginning of the epoch and xqi πri is a datapoint that is sampled in that particular epoch (this is true because wqi N is independent of any data sampled in the epoch). Then, by applying Jensen s inequality and the definition of empirical loss, we have the bound E[ ˆmt F(wt) ] O α + ηN + η . Finally, we can apply Lemma 5 with the appropriate hyperparameters to get the utility bound in Theorem 6. Algorithm 2 Sensitivity reduced normalized SGD Input: Initial Point w1, training set x1, . . . , x N, learning rates η, momentum parameter α = 1 β, cross-batch parameter γ, Lipschitz constant G, smoothness constant L, noise variance σ2. Set V 2N = 3 log2 N, V>2N = 4 T N log2(N). Set δG = 4G, δ = δr = 2ηN. for q = 0, . . . , T/N do Sample a permutation πq 1, . . . , πq N of {1, . . . , N} uniformly at random. for r = 0, . . . , min{N 1, T q N} do t q N + rt for k = 0, . . . , log2(t) do if t mod 2k = 0 then if t 2N then Sample ζG t N(0, δ2 Gσ2V 2NI) else Sample ζG t N(0, δ2 Gσ2V>2NI) end if Sample ζr t N(0, δ2 rσ2V>2NI). ˆF G [t 2k+1,t] = Pt i=t 2k+1(1 α)t i f(wq N, xπq ri) + ζG t ˆF r [t 2k+1,t] = Pt i=t 2k+1(1 α)t i( f(wi, xπq ri) f(wq N, xπq ri)) + ζr t if t 2k + 1 > N then Sample ζ t N(0, δ2 σ2V>2NI). ˆF [t 2k+1,t] = Pt i=t 2k+1(1 α)t i( f(wq N, xπq ri) f(w(q 1)N, xπq ri)) + ζ t end if end if end for Compute ˆmt using the reconstruction algorithm (Algorithm 4). wt = wt 1 η ˆmt ˆmt end for end for Return w1, ..., w T 4 Sensitivity Reduced Normalized SGD In this section, we will devise an algorithm that has an asymptotically tighter bound than O d1/4 , which is the best known bound for (ϵ, δ) DP non-convex ERM. One natural direction is to use a better algorithm for non-convex optimization than SGD. It is known that in non-convex optimization operating on smooth losses, the optimal rate for SGD is O(1/N 1/4) Arjevani et al. [2019] while more advanced methods that use variance reduction Fang et al. [2018], Zhou et al. [2018], Tran-Dinh et al. [2019] or Hessian-vectors product Arjevani et al. [2020], Tran and Cutkosky [2021] can achieve the rate O(1/N 1/3). Thus, one could hope that these methods can improve the bound of private non-convex ERM. Unfortunately, this does not seem to be the case in reality. For example, Wang et al. [2019b] proposes a variant of the variance-reduction based stochastic recursive momentum (STORM) Cutkosky and Orabona [2019] called DP-SRM but their algorithm shows little asymptotic improvement compared to previous known bound. DP-SRM improves the error bound by a factor of log(n/δ) but this seems to be the result of sampling Abadi et al. [2016] rather than variance reduction. We would like to explore an alternative direction to improve the error bound. Instead of using a better non-convex optimization methods, we will reuse the Normalized SGD in section 3 but with a tighter privacy analysis that requires us to add less noise than O The motivation for our methods comes from full-batch SGD. Consider the following full-batch normalized gradient descent update: wt+1 = wt η gt gt where gt = F(wt). Notice that (for t > 1) gt = gt 1 + F(wt) F(wt 1). Now, given gt 1, the sensitivity of gt can be bounded by the sensitivity of F(wt) F(wt 1), which in turn is only O(η/N) due to smoothness and the fact that wt wt 1 = η. This is a lot smaller than the naive bound of O(1/N) since η 1. We can iterate the above idea to write gt = F(w1) + Pt i=2 F(wi) F(wi 1). In this way, each gt is a partial sum of terms with sensitivity O(η/N), which can be be privately estimated with an error of O(η/Nϵ) using tree-aggregation. The main intuition here is that except for the first iterate, each iterate is close to the previous iterate due to smoothness - that is, F(wt) F(wt 1) has low sensitivity. In fact, we can further control the sensitivity of gt by expressing gt = (1 γ)(gt 1 + F(wt) F(wt 1)) + γ F(wt) for some parameter γ < 1. Now, with appropriate value of γ, gt can potentially have even lower sensitivity compared to the case γ = 0 that we discuss above. Moving beyond the full-batch case, we would like to replace the full-batch gradients F(wt) in this argument with momentum estimates mt. Unfortunately, it is not the case that mt mt 1 has low sensitivity. Instead, notice that the distance between two iterates evaluated on the same sample at two consecutive epochs will be at most ηN 1 for appropriate η, and we can show that the sensitivity of mt mt N scales by this same distance factor. Our method described in Algorithm 2 follows this observation to compute low sensitivity momentum queries. To see this, let us rewrite the momentum mt in Algorithm 1: mt = (1 α)mt 1 + α f(wt, xπqt rt ) i=1 (1 α)t i f(wi, xπ qi ri ) i=1 (1 α)t i f(wqi N, xπ qi ri ) + α i=1 (1 α)t i( f(wi, xπ qi ri ) f(wqi N, xπ qi ri )) In order to exploit this decomposition, we define G[a,b] = Pb t=a(1 α)b t f(wqt N, xπqt rt ), r[a,b] = Pb t=a(1 α)b t( f(wt, xπqt rt ) f(wqt N, xπqt rt )) and [a,b] = Pb t=a(1 α)b t( f(wqt N, xπqt rt ) f(w(qt 1)N, xπqt rt )). Then, (1) states: mt = αG[1,t] + αr[1,t] (2) Considering t = q N + r, we can further refine this expansion by observing (Lemma 17): G[1,t] = (1 α)t r G[1,r] + i=0 (1 α)Ni(1 γ)q (i+1)G[r+1,r+N] j=0 (1 γ)j(1 α)(i 1 j)N (1 γ) [t i N+1,t (i 1)N] + γG[t i N+1,t (i 1)N] Thus, in Algorithm 2, we will estimate G[a,b], r[a,b] and [a,b] for all intervals [a, b] using tree aggregation, and then recombine these estimates using (2) and (3) to estimate mt. The critical observation is that each term of the sums for r[a,b] and [a,b] have sensitivity at most O(ηN), because the normalized update implies wt wt ηN for all |t t | N, while the O(1) sensitivity terms in G[a,b] are scaled down by γ. However, notice that in the first two epochs, we don t have the extra parameter γ that helps us control the noise added to the momentum. Thus, if we just naively add noise based on the maximum sensitivity (which is V>2N), the noise will blow up since it scales with T/N. To remedy this, we add two different types of noise to G[t 2k+1,t] (k {0, . . . , log2(t) ) depends on whether t 2N. By dividing the sensitivity into 2 cases, we can make the momentum of the first 2 epochs (ϵ, δ) DP by adding noise of O(1/ϵ) and every other epochs (ϵ, δ) DP by adding noise of O(γ Nϵ) . Then, the whole procedure would be (O(ϵ), O(δ)) DP overall. Morally speaking, we can think of the first two terms of the update of Algorithm 2 (Eq. 3) as the first iterate in the full-batch example with some exponentially weighted factors and the third term as the (1 γ)( F(wt) F(wt 1)) + γ F(wt) part. Overall, if we set γ = ηN, the noise we need to add to make mt private is O α , which is smaller than the O( α N ) noise that we add in Section 3. Formally, we have the following error bound for the momentum. Lemma 7. Suppose α 1 2, using the update of Algorithm 2, we have: E[ ˆmt mt ] 192Gα log2 T p d log(1/δ) ϵ + 128(G + 2L)α p ηd T log(1/δ) log2 T + 16ηLα log2 T p 3d TN log(1/δ) ϵ To prove Lemma 7, we sum up the variance of the Gaussian noise needed to add to make every term in Eq.3 and ˆrt private. Then E[ ˆmt mt ] d VAR where VAR is the total noise variance and d is the dimension of the noise vector. Theorem 8. (Privacy guarantee) Suppose α 1/N and f is G-Lipschitz and L-smooth. Then Algorithm 2 is (z, 3z/2σ2) Renyi differentially private for all z. With σ 4 ϵ , Algorithm 2 is (ϵ, δ)-DP The proof of Theorem 8 follows the proof of tree-aggregated momentum with Renyi differential privacy as in section 3. We defer the proof to the appendix. Now we have the utility bound for Algorithm 2. Theorem 9. (Utility guarantee) Assuming f(w, x) is G-Lipschitz, L-smooth for all w Rd, x X, and F(w1) is bounded by R. Then Algorithm 2 with γ = ηN, η = 1 NT , α = N3/4ϵ T 3/4 d, T = N7/3ϵ4/3 d N3/4 , and ˆw that is picked uniformly at random from w1, ..., w T guarantees: E[ F( ˆw) ] ( 3 2R + 24K(G + 2L) + 12L)d1/3 (ϵN)2/3 + 6G (ϵN)2/3 + 1 where K = 16 log2 T log(1/δ). The analysis of Algorithm 2 is similar to the analysis of Algorithm 1. We still bound the error between the noisy momentum and the true empirical risk gradient as E [ ˆmt F(wt) ] E [ ˆmt mt ]+ E [ mt F(wt) ]. E [ mt F(wt) ] is the same as in Theorem 6 but now, the noise bound E [ ˆmt mt ] is only O α d ϵ + α ηd T instead of O α due to our sensitivityreduced analysis. Then we can apply Lemma 5 to get our final utility guarantee. The full proof is provided in the appendix. 5 Conclusions In this paper, we present two new private algorithms for non-convex ERM. Both algorithms are variants of Normalized SGD with momentum and both utilizes tree-aggregation method to privately compute the momentum. Our first algorithm overcomes the large batch size and privacy amplification techniques requirements in previous works while still achieving the state-of-the-art asymptotic bound O d1/4 . Our second algorithm uses the insight from full-batch SGD operating on smooth losses to reduce the sensitivity of the momentum. This allows the algorithm to achieve the utility bound O d1/3 (ϵN)2/3 , which, to our knowledge, is the best known bound for private non-covex ERM. Limitations: There are several limitations that one could further explore to improve the results of this paper. The most natural direction is to incorporate privacy amplification by shuffling into both algorithms. Based on the current analysis, it is not obvious that we can apply privacy amplification immediately to tree-aggregated momentum. However, it is possible that there are some clever ways that one can come up to make the analysis work. It has been shown in previous work that privacy amplification by shuffling does help us achieve the O d1/4 utility bound. Therefore, it is our hope that by combining privacy amplification with our sensitivity reduced algorithm, we can achieve an even better utility bound than O d1/3 (ϵN)2/3 . Furthermore, even though our first algorithm has comparable run time to previous works, the second algorithm has slightly less ideal run time. The current algorithm requires O N7/3ϵ4/3 d2/3 gradient evaluations but has the total run time of O(N 11/3/d2/3) due to the extra run time from the compose and reconstruction algorithms. We would like to reduce this to quadratic run time. We will leave these for future research. Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016. Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019. Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. In Conference on Learning Theory, pages 242 299, 2020. Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, and Kunal Talwar. Private adaptive gradient methods for convex optimization. In International Conference on Machine Learning, pages 383 392. PMLR, 2021. Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou r3579x? anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th International Conference on World Wide Web, WWW 07, page 181 190, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595936547. doi: 10.1145/1242572.1242598. URL https://doi.org/10.1145/1242572. 1242598. Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in Neural Information Processing Systems, 31, 2018. Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464 473. IEEE, 2014. Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 11282 11291, 2019. Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. Advances in Neural Information Processing Systems, 34, 2021a. Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Conference on Learning Theory, pages 474 499. PMLR, 2021b. T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1 24, 2011. Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011. Zhaoyi Chen, Xiong Liu, William Hogan, Elizabeth Shenkman, and Jiang Bian. Applications of artificial intelligence in drug development using real-world data. Drug discovery today, 26(5):1256 1264, 2021. Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. In International Conference on Machine Learning, 2020. Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pages 15210 15219, 2019. Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405 1411. PMLR, 2015. Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proceedings of the Twenty Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 03, page 202 210, New York, NY, USA, 2003. Association for Computing Machinery. ISBN 1581136706. doi: 10.1145/773153.773173. URL https://doi.org/10.1145/773153.773173. Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC 06, page 265 284, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3540327312. doi: 10.1007/11681878_14. URL https: //doi.org/10.1007/11681878_14. Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715 724, 2010. Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689 699, 2018. Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521 532. IEEE, 2018. Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439 449, 2020. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in fullinformation and bandit settings. Advances in Neural Information Processing Systems, 26:2733 2741, 2013. Jianxing He, Sally Baxter, Jie Xu, Jiming Xu, Xingtao Zhou, and Kang Zhang. The practical implementation of artificial intelligence technologies in medicine. Nature Medicine, 25, 01 2019. doi: 10.1038/s41591-018-0307-0. Roger Iyengar, Joseph P. Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 299 316, 2019. doi: 10.1109/SP.2019.00001. Bargav Jayaraman, Lingxiao Wang, David Evans, and Quanquan Gu. Distributed learning without distress: Privacy-preserving empirical risk minimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ 7221e5c8ec6b08ef6d3f9ff3ce6eb1d1-Paper.pdf. Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang, Qiang Dong, Haipeng Shen, and Yongjun Wang. Artificial intelligence in healthcare: past, present and future. Stroke and Vascular Neurology, 2(4):230 243, 2017. ISSN 2059-8688. doi: 10.1136/svn-2017-000101. URL https://svn.bmj.com/ content/2/4/230. Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. ar Xiv preprint ar Xiv:2103.00039, 2021. Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and highdimensional regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 25.1 25.40, Edinburgh, Scotland, 25 27 Jun 2012. PMLR. URL https://proceedings.mlr.press/ v23/kifer12.html. Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps. ar Xiv preprint ar Xiv:2103.15352, 2021. Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263 275. IEEE, 2017. Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. ar Xiv preprint ar Xiv:2006.05988, 2020. Phuong Ha Nguyen, Lam M Nguyen, and Marten van Dijk. Tight dimension independent lower bound on the expected convergence rate for diminishing step sizes in sgd. Advances in Neural Information Processing Systems, 32, 2019. Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Private empirical risk minimization beyond the worst case: The effect of the constraint set geometry. ar Xiv preprint ar Xiv:1411.5417, 2014. Hoang Tran and Ashok Cutkosky. Better sgd using second-order momentum. ar Xiv preprint ar Xiv:2103.03265, 2021. Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan, and Lam M Nguyen. A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. ar Xiv preprint ar Xiv:1907.03793, 2019. Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: faster and more general. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 2719 2728, 2017. Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. ar Xiv preprint ar Xiv:1802.05251, 2018. Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. In International Conference on Machine Learning, pages 6526 6535. PMLR, 2019a. Lingxiao Wang, Bargav Jayaraman, David Evans, and Quanquan Gu. Efficient privacy-preserving stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1910.13659, 2019b. Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1307 1322, 2017. Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3922 3928, 2017. Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3925 3936, 2018. Yingxue Zhou, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Arindam Banerjee. Private stochastic nonconvex optimization: Adaptive algorithms and tighter generalization bounds. ar Xiv preprint ar Xiv:2006.13501, 2020. The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [Yes] S Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [No] this is a theoretical paper. It has no potential negative societal impacts (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]