# differentially_private_adaptive_optimization_with_delayed_preconditioners__2a58337b.pdf Published as a conference paper at ICLR 2023 DIFFERENTIALLY PRIVATE ADAPTIVE OPTIMIZATION WITH DELAYED PRECONDITIONERS Tian Li , Manzil Zaheer , Ziyu Liu , Sashank Reddi , Brendan Mc Mahan , Virginia Smith Carnegie Mellon University, Google Deep Mind, Google Research {litian,ziyuliu,smithv}@cs.cmu.edu, {manzilzaheer,sashank,mcmahan}@google.com Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP2 across several realworld datasets, demonstrating that it can improve convergence speed by as much as 4 relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data. 1 INTRODUCTION Adaptive optimizers such as Ada Grad (Duchi et al., 2011; Mc Mahan & Streeter, 2010) and RMSProp (Hinton et al., 2012) are commonly used to improve convergence speed in machine learning training. However, in privacy-sensitive applications, the benefits of adaptivity may degrade as a result of noise added to the preconditioners to guarantee differential privacy (Li et al., 2022). Prior works typically address this issue by using non-sensitive auxiliary data to approximate the underlying structures of private gradients (Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022). While this can boost performance, assuming access to informative public data may be unrealistic in many privacy-sensitive applications. In this work, we instead ask: Can we improve privacy/utility trade-offs in private adaptive optimization without accessing auxiliary data? preconditioner epoch 10 epoch 25 epoch 40 Figure 1: Preconditioner values do not change drastically during optimization (IMDB dataset). A key insight we have in addressing this question is that for many machine learning problems, the gradient geometry may not change drastically during successive steps of optimization (e.g., see Figure 1, which plots successive distributions of preconditioner values). This presents an opportunity to estimate the preconditioners used by adaptive optimizers with smaller noise, by averaging across previous iterates. To this end, we propose DP2, a differentially private adaptive method that uses historical gradients to construct delayed preconditioners with reduced noise. Despite the simplicity of this approach, we find that it can significantly improve performance in practice improving convergence speed by as much as 4 relative to non-adaptive baselines, all without the need to access auxiliary data. To better understand these performance gains, we theoretically and empirically analyze the method to study the effect of using delayed preconditioners, including trade-offs that emerge between the noise reduction and staleness. Contributions. We propose DP2 as a method for differentially private adaptive optimization with delayed preconditioners. Unlike prior work, DP2 does not rely on auxiliary data to improve privacy/utility trade-offs in private training. We provide convergence guarantees for DP2 in both convex and non-convex settings, and analyze the trade-offs between delay and privacy noise. We conduct Published as a conference paper at ICLR 2023 extensive experiments to showcase the effectiveness of DP2, which can significantly improve model utility for a given privacy budget across text and recommendation benchmarks. 2 BACKGROUND AND RELATED WORK In this section we discuss closely related works and set up some preliminaries. We start by discussing prior work in differentially private optimization, considering the classic framework of (ε, δ)-differential privacy (DP) (Dwork et al., 2006), defined as follows. Definition 1 (Differential privacy (Dwork et al., 2006)). A randomized algorithm M is (ε, δ)- differentially private if for all neighboring datasets D, D differing by one element, and every possible subset of outputs O, Pr (M(D) O) eε Pr (M(D ) O) + δ. Differentially Private SGD. Informally, DP in machine learning offers protection by masking the influence of individual examples (example-level DP, e.g. (Abadi et al., 2016; Bassily et al., 2014; Song et al., 2013)) or all of the examples from one user (user-level DP, e.g. (Kairouz et al., 2021b; Mc Mahan et al., 2018)) on the trained model. In this work, we consider example-level DP using the popular subsampled Gaussian mechanism (Dwork et al., 2014; Mironov et al., 2019) to perturb gradients to ensure DP. Unless much larger batch sizes and possibly larger datasets are used, DP mechanisms often lead to a significant utility drop. Extensive research has thus been devoted to investigating improved privacy/utility/computation trade-offs for DP-SGD, including various training techniques (e.g., data augmentation and large-batch training) (De et al., 2022), leveraging public data (Amid et al., 2022; Zhou et al., 2021), and releasing gradient statistics via tree aggregation to reduce the amount of noise (Chan et al., 2011; Denisov et al., 2022; Kairouz et al., 2021b). These prior works are orthogonal to and could be applied in conjunction with our proposed method, which focuses specifically on privacy in the context of adaptive optimization. Differentially Private Adaptive Optimization. To reduce privacy cost in iterative DP algorithms, it is natural to consider applying adaptive optimizers (e.g., Ada Grad (Duchi et al., 2011; Mc Mahan & Streeter, 2010), RMSProp (Hinton et al., 2012), AMSGrad (Reddi et al., 2018), and Yogi (Zaheer et al., 2018)) to speed up convergence. A straightforward approach is to first privatize mini-batch gradients and then plug in noisy gradients to any adaptive updating rules (Zhou et al., 2020). However, estimating gradient moments in this way may yield preconditioners with too much noise, resulting in adaptive methods that may not have meaningful improvements over DP-SGD (Li et al., 2022). As we discuss in Section 1, more recent works suggest the use of non-sensitive public information to estimate the preconditioners (or other gradient structures) (Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022), which may not always be available in practice. In Section 5.2, we empirically benchmark two baselines along this line of work and demonstrate that DP2 can perform comparably to these state-of-the-art methods, even though it does not require access to auxiliary data. Finally, we note that previous works have explored the high-level direction of delayed preconditioners, but mainly as a compromise for computational considerations in non-private training (Gupta et al., 2018). In this work, we instead show that staleness can be leveraged to improve privacy/utility trade-offs in private adaptive optimization, and propose and analyze a novel method for delaying preconditioner computation in the context of private training. Notation. In this work, we consider using adaptive optimization methods to solve the classic empirical risk minimization objective, i.e., minw F(w) = 1 n Pn i=1 f(xi; w), where w Rd and {f(xi; w)}i [n] are individual loss functions on training sample i [n]. For vectors u, v Rd, we use u + v for coordinate-wise addition, and u v for coordinate-wise division. For any vector v, vj denotes the j-th coordinate of v. For example, gi,t j refers to the j-th coordinate of gradient gi,t. Finally, |v| Rd denotes taking coordinate-wise absolute values, and M denotes the matrix norm defined as M := p , M for a symmetric and positive definite matrix M Rd d, or a diagonal matrix with non-negative diagonal entries populated by a vector M Rd. 3 DP2: DELAYED PRECONDITIONERS FOR DIFFERENTIALLY PRIVATE ADAPTIVE OPTIMIZATION We now introduce our DP2 framework. While we discuss DP2 in the context of a particular adaptive method (RMSProp), we note that the approach is method-agnostic in that it can generally be applied Published as a conference paper at ICLR 2023 to any private adaptive optimization method where preconditioners are calculated at each iteration. As an initial step towards understanding the algorithm, we first investigate the effects of delayed preconditioners in non-private training in Section 3.1. We then explain how to apply this idea to construct less noisy preconditioners from prior gradients in private training in Section 3.2. 3.1 DELAYED PRECONDITIONERS IN NON-PRIVATE SETTINGS Adaptive methods use preconditioners to adapt to gradient geometry, effectively resulting in coordinate-wise learning rates. This can be advantageous for many applications, especially those with sparse gradients or non-uniform stochastic noise (e.g., Hinton et al., 2012; Mc Mahan & Streeter, 2010; Reddi et al., 2021; Zhang et al., 2020). One of the key design choices of DP2 is to update preconditioners less frequently and use the average of past gradients to reduce noise. Our observation is that a wide range of learning problems are tolerant to the staleness of preconditioners. In this subsection, we validate this empirically on the benchmark datasets considered throughout this paper. There are potentially many ways that one could instantiate the idea of delayed preconditioner computation in adaptive optimization. Here we consider a specific algorithm, which is the exact non-private version of our proposed DP2 framework (Algorithm 1) introduced in later sections. The basic idea is to alternate between s steps of SGD and s steps of an adaptive method (for simplicity we assume RMSProp as the adaptive algorithm), where s is a constant larger than 1. Each time we switch from SGD to RMSProp, we average s past SGD gradients and use the average to update the preconditioner. The preconditioner will be used in subsequent RMSProp updates (thus being stale). As motivation for DP2, we empirically show that RMSProp with delayed preconditioners achieves almost the same optimization performance as RMSProp (Figure 2). 20 40 60 80 100 # epochs training loss SGD RMSProp Delayed Precond. 0 10 20 30 # epochs training loss Stack Overflow 0 10 20 30 40 50 # epochs training loss Figure 2: In non-private training, RMSProp with delayed preconditioners achieves similar training loss as standard RMSProp across all datasets. Final test accuracies are presented in Section 5.1. This observation provides motivation for our proposed DP2 framework for private training (Section 3.2). As discussed in Section 2, we note that the idea of delayed preconditioning has been briefly discussed in prior work (Gupta et al., 2018) for the purpose of speeding up the computation of adaptive optimization in non-private training. Unlike this prior work, we focus on the goal of reducing noise in private training, propose an alternative method for using stale preconditioners that is more amenable to differential privacy, and analyze our method in both convex and non-convex settings. 3.2 CONSTRUCTING DELAYED PRECONDITIONERS WITH REDUCED NOISE Without access to public data or other side information, prior works typically update preconditioners based on noisy gradients at each iteration (Zhou et al., 2020). For instance, a natural way to privatize RMSProp is to update the preconditioner v Rd as v βv + (1 β)( g)2 where β (0, 1) is a moving average constant, and g Rd is the noisy gradient output by some standard privacy mechanism (e.g., the Gaussian mechanism).1 However, a drawback to this is that the noise gets accumulated at each iteration, making adaptive methods significantly less effective (Li et al., 2022). Inspired by the observation that problems can be tolerant to the staleness of preconditioners (Figure 2), we propose to update the preconditioners less frequently to reduce noise. For instance, we update v every s steps using some aggregate function of s recent private gradients from DP-SGD. During iterations where v is not updated, we simply apply the most recent (stale) v to precondition the gradients. In order to mitigate the noise, we average over these s gradients to form a pseudo-gradient g, which can be plugged into arbitrary adaptive optimization algorithms. Note that the privacy noise variance will be reduced s times if we average s Gaussian random variables (i.e., the DP noise). 1We consider the practical diagonal (as opposed to matrix) form of adaptive methods throughout the paper. Published as a conference paper at ICLR 2023 Algorithm 1: DP2-RMSprop: Delayed Preconditioners for Differentially Private RMSprop Input: T, batch size b, noise multiplier σ, clipping thresholds C, initial model w0 Rd, v = 0, constant ϵ R+, learning rate schedule αt, moving average parameter β, SGD cumulative aggregation step s1, RMSProp cumulative step s2 1 for t = 0, , T 1 do 2 if t mod (s1 + s2) = 0 then 3 Reset accumulator Gt 0 4 if t mod (s1 + s2) = s1 then 5 Update moment estimates as v βv + (1 β) (Gt/s1)2 6 Reset accumulator Gt 0 7 Uniformly randomly sample a mini-batch B with size b from private training data 8 Get individual gradients for sample i B: gi,t f(xi; wt) 9 Privatize the (preconditioned) gradients using the Gaussian mechanism: i B clip gi,t Dt , C + N 0, σ2C2 ! where Dt 1 if t mod (s1 + s2) < s1 v + ϵ otherwise. 10 Accumulate the private gradients gt : Gt+1 Gt + gt 11 Update model parameters w: wt+1 wt αt gt 12 return w T DP2 is summarized in Algorithm 1. For simplicity of presentation, we assume RMSProp as the adaptive method (denoted as DP2-RMSProp) throughout this section. However, our framework can be generally applied to other common adaptive methods (see Appendices C.3 and D). The high-level idea is to alternate between s1 steps of private SGD and s2 private RMSProp steps, and use averages of s1 SGD gradients (i.e., average of the accumulator G Rd) to update the preconditioner v. Next, we discuss some key components of our algorithm. Order of privatization and preconditioning. Given a private preconditioner v, there are generally two choices to perform adaptive optimization over the raw gradients {gi,t}i B generated from mini-batch B at the t-th iteration. 1. First privatize gradients with clipping threshold C1, then precondition noisy gradients with v + ϵ where ϵ is a small constant: i B clip gi,t, C1 + N 0, σ2C2 1 ! 2. First precondition gradients with v + ϵ, then privatize the output with clipping threshold C2: i B clip gi,t/ v + ϵ , C2 + N 0, σ2C2 2 ! The difference is that the privacy noise in the first choice may be scaled in an undesired direction, as N(0,σ2C2) v+ϵ with a less noisy estimated v (perfect estimation removing all privacy noise in the extreme case) would amplify the noise N(0, σ2C2) on informative coordinates (i.e., coordinates with smaller preconditioner values), which is consistent with the argument made in Li et al. (2022). We empirically compare the two options and show that the latter gives better performance (Section 5.3). It is critical to average noisy gradients to construct a cleaner estimate of the preconditioner (Line 5 and 10 in Algorithm 1) and apply it for adaptive optimization (Line 9). As these two steps access raw gradients twice, we need to privatize them separately. Unfortunately, the privacy budget would accumulate with each query to the raw training data. Hence, we use the private SGD gradients for both Published as a conference paper at ICLR 2023 the model update and the preconditioner estimation. This results in a hybrid method that alternates between private SGD and private adaptive optimization steps. Note that to get an unbiased estimate of the true delayed preconditioners, we can correct the bias in (Gt/s1)2 (Line 5) by subtracting the privacy noise variance term σ2C2 s1b2 out of (Gt/s1)2. But this value is usually very small and negligible in practice. While in principle, non-adaptive and adaptive updates can take different numbers of consecutive iterations, in our empirical evaluation, we simply set s1 = s2, and find that this works reasonably well across all datasets (Section 5). Privacy guarantees. From Algorithm 1, we see that at each iteration, we access raw data and pass them through the privacy barrier once (Line 9) to generate private gradients gt with the same noise multiplier σ and batch size b, and the preconditioner only accumulates already differentially private gradients. Since the final model is a composition of these private releases (noisy gradients), Algorithm 1 (or DP2 in general) achieves the same privacy guarantees as standard DP-SGD training under the same training settings. For completeness, we formally state the privacy guarantee below. Theorem 1 (Privacy guarantee of Algorithm 1 (Abadi et al., 2016)). There exist constants c1 and c2 such that for any ε < c1b2T/n2, Algorithm 1 is (ε, δ)-differentially private for any δ > 0 if In practice, we use R enyi differential privacy (RDP) for the subsampled Gaussian mechanism accountant (Mironov et al., 2019) to compute the actual ε s reported in the experiments (Section 5). 4 CONVERGENCE ANALYSIS In this section, we analyze Algorithm 1 for both convex and non-convex problems. We aim to study the convergence properties of DP2 and investigate the trade-offs between delay and privacy noise. In doing so, key challenges are introduced by alternating between adaptive and non-adaptive updating and through the staleness of preconditioners. 4.1 CONVEX CASES For convex functions, we define the optimal model w as w arg minw F(w). First we state some assumptions (apart from convexity) that are used in the analysis. Assumption 1. There exists a constant R such that wt w 2 R for any iteration t. Assumption 2 (Bounded stochastic gradient norm). There exists a constant C such that gi,t 2 C for any i [n] and iteration t. Assumption 1 (bounded domain across all iterations) is commonly used in adaptive optimization literature (Asi et al., 2021; Levy et al., 2018; Li et al., 2022; Reddi et al., 2018). Assumption 2 aims to bound the L2 norm of the stochastic gradient, thus helping bound the L2 sensitivity of the operation of calculating and averaging individual gradients from a mini-batch. Assuming bounded stochastic gradient norm is standard in prior works on convex and non-convex private optimization (e.g., Kairouz et al., 2021a; Li et al., 2022; Zhou et al., 2020). Under this assumption, suppose the clipping does not happen, we have gt gt + N(0, σ2C2/b2), where gt := 1 i B gi,t. Without loss of generality, let s1=s2 in Algorithm 1. Our main convergence result is as follows (assuming t starts from 1). Theorem 2 (Convergence of Algorithm 1 for convex problems). Let Assumptions 1 and 2 hold. Assume F is a convex function. Let the learning rate αt be set as αt α t t . After running Algorithm 1 for T iterations with s = υT for a small constant υ (0, 1], we obtain min t [T ] E F(wt) F(w ) R2 + κ t Tυ E Dt 1 + 1 α t 2υT + t+υT t E[ N t 2 Dt], where Tυ denotes the iteration indices where we switch from private RMSProp steps to private SGD steps plus the last iteration, with cardinality |Tυ| = 1 2υ , N t N(0, σ2C2/b2), and κ max α2C2, Ch(s) , α = min ϵ, 1 M + ϵ , 1 where M := C2 + σ2C2 Published as a conference paper at ICLR 2023 We defer all proofs to Appendix A and state simplified convergence results in Corollary 1. As we can see, the above upper bound relies on a critical metric h(s) which is related to temporal gradient similarity and the amount of staleness s, formally defined as: h(s) max t [T ] E [ gt 1] i + dϵ = max t [T ] E [ gt 1] 0 25 50 75 100 delay s ( 30) Figure 3: Visualization of h(s) versus s on IMDB. where the expectation is taken with respect to all randomness in the algorithm, and G t s s Rd refers to the latest accumulator that is used to update v (Line 5 in Algorithm 1). A smaller h(s) indicates better convergence. We see that the denominator of h(s) can be decomposed into the average of past raw gradients and the average of random Gaussian noise. Intuitively, h(s) tends to be smaller as gradients across the s iterations in G t s s are more similar with the current gradient gt in terms of the gradient norms. In Appendix A.2, we show that an upper bound of h(s) can be expressed as c1 + c2s where c1, c2 are two constants. We also visualize the value of h(s) on the IMDB dataset in Figure 3, and show that (1) the values of h(s) are consistently small across all delays, and (2) h(s) increases as the s gets larger, which is consistent with the expression of s. Trade-offs between delay and noise. Here we discuss how s affects convergence based on our analysis. Intuitively, larger s (larger delay) results in staler preconditioners, but introduces less noise due to private gradient averaging. In our convergence bound, there are several terms that depend on s (or υ). Although this makes it difficult to derive a closed-form characterization of an optimal s, we can analyze the effects of s in simplified settings. In particular, examine the first term of the RHS of the convergence bound, let α = 1 υ +ϵ (where c3, c4 are two constants), and assume 1 2υ = 1 2υ + 1+υ 2υ . Combined with h(s), the dependence on υ in 2υ can be expressed as (c1 + c2υ) pc3 + c4 2υ . This suggests that there exists an optimal υ that achieves the minimal value. In Section 5.1, we empirically study the effects of s across real-world datasets, and demonstrate that there exist specific ranges of s that provide favorable trade-offs between delay and noise (Figure 6). Corollary 1. Let Assumptions 1 and 2 hold. Assume F is a convex function. Ignoring the constants, the convergence rate under learning rate αt = O 1 simplifies to min t [T ] E[F(wt)] F(w ) O 1 T max t Ts E Dt 1 + O t E N t 2 Dt ! where Ts denotes the iteration indices where we switch from private RMSProp steps to private SGD steps plus the last iteration (thus having a constant cardinality) and N t N(0, σ2C2/b2). At a high level, the first term is due to adaptive optimization using RMSProp, and the second term corresponds to the added privacy noise. Our O 1 rate is the same as previous results for SGD (or DP-SGD) in convex cases with delaying learning rates (Bassily et al., 2014; Nemirovski et al., 2009). Compared with DP-SGD, the added privacy noise would be reduced from 1 t E[ N t 2] to 1 T PT t=1 1 t E[ N t 2 Dt] when the gradients are sparse (so that Dt 1 < d in adaptive iterations). Hence, this theorem suggests some constant improvements relative to DP-SGD when we switch for a constant number of times. 4.2 NON-CONVEX CASES We make the following additional common assumptions in non-convex convergence analyses. Assumption 3 (Smoothness). Each f(xi; w) (i [n]) is L-smooth with respect to w Rd. Assumption 4. Stochastic gradient variance is bounded, i.e., E[ gi,t E[gi,t] 2 2] τ 2 for all i, t. Published as a conference paper at ICLR 2023 Theorem 3 (Convergence of Algorithm 1 for non-convex problems.). Let Assumptions 1-4 hold. Define constant M as M := C2 + σ2C2 sb2 . Under any delay parameter s, after running Algorithm 1 with constant learning rates αt = α such that Lα ϵ 1, we have t=1 E[ F(wt) 2] 2( M + 1)F(w1) 2ϵ2b + dσ2C2 The proof is deferred to Appendix B. Compared with Theorem 2, here we do not have constraints on s. Note that to guarantee (ε, δ)-DP by running T iterations, we can set σ2 = O b2T log(1/δ) , and T = O nε log(1/δ) , to arrive at a convergence bound O . Under any s, our rate (with and without noise) is the same as previous results on DP-SGD and (DP) adaptive methods for non-convex problems (Li et al., 2022; Zaheer et al., 2018). We note that our non-convex analysis does not directly highlight the benefits of adaptivity or trade-offs around s; hence the optimal choice of s according to this result is s = T, to maximize the goal of reducing privacy noise. However, the practical performance can be better than the upper bound derived here, as shown in our experiments (Section 5). Most of the previous works studying stochastic non-convex adaptive optimization does not prove improvements relative to SGD (e.g., Alacaoglu et al., 2020; De et al., 2018; Ward et al., 2020; Zaheer et al., 2018). It is still an open problem to rigorously characterize the benefits of adaptivity for non-convex problems, which we leave for future work. 5 EMPIRICAL EVALUATION In this section we report empirical results on a range of learning tasks. In Section 5.1, we compare DP2 with the baselines of DP-SGD and vanilla DP adaptive methods across various privacy budgets, and investigate the effects of delay on all datasets. We additionally compare DP2 with recent more advanced private adaptive methods in Section 5.2, and conduct ablation studies to validate the effectiveness of different DP2 components in Section 5.3. In all experiments, we use R enyi differential privacy (RDP) accountant for the subsampled Gaussian mechanism (Mironov et al., 2019) for privacy accounting. We focus on the RMSProp optimizer (Hinton et al., 2012) and provide results relating to other adaptive methods such as Ada Grad (Duchi et al., 2011; Streeter & Mc Mahan, 2010) in Appendix C. Our experiments are implemented in JAX (Bradbury et al., 2018) with Haiku (Hennigan et al., 2020) to auto-vectorize over the perexample operations (e.g. per-example clipping) for substantial speedups (Subramani et al., 2021). Unless explicitly stated, we report results with the best grid-searched hyperparameters. Note that for DP2 we tune the learning rates and clipping thresholds separately for private SGD iterations and private adaptive (RMSProp) iterations. See Appendix C.2 for hyperparameter details. Our code is publicly available at github.com/kenziyuliu/DP2. Tuning s. In all experiments, we tune the delay parameter (s) via grid search. For convex tasks, we choose s from {0.025, 0.5, 0.1, 0.5, 1, 2} epochs. For the non-convex model, we choose s from {0.5, 3, 10, 25} epochs. We explore the sensitivity of DP2 to s in Section 5.2, and show that there exist a wide range of s parameters that result in superior performance compared with baseline methods. Datasets and Tasks. We pick datasets and tasks where adaptivity is crucial (e.g., those involving sparse gradients). For such tasks, adaptive methods have major benefits relative to SGD in non-private training, and we expect DP2 to retain the benefits in private training. See Appendix C.1 for a detailed description. For all datasets, we explore the effects of several noise multiplier (σ) values, and set δ = 10 k where k is the smallest integer that satisfies 10 k 1/n for the training dataset size n. 5.1 DP2 COMPARED WITH DP-SGD AND VANILLA DP ADAPTIVE METHODS We consider two popular baselines: DP-SGD (Abadi et al., 2016) and vanilla DP-RMSProp (Zhou et al., 2020). In vanilla DP adaptive methods, private gradients are plugged into adaptive updating rules to approximate the preconditioners at each iteration. Figure 4 compares DP2-RMSProp with DP-SGD and DP-RMSProp. We observe that across all datasets, DP2 consistently and substantially outperforms the baselines in terms of both convergence and absolute performance. Privacy/utility trade-offs. Figure 4 reports learning curves under specific privacy budgets determined by the batch size and the number of epochs. Here, we additionally explore privacy/utility trade-offs Published as a conference paper at ICLR 2023 10 15 20 25 30 35 40 # epochs Movie Lens (" .33, = 10 6) DP-SGD DP-RMSProp DP2-RMSProp (ours) 0 20 40 60 80 100 # epochs IMDB (" = 3.04, = 10 5) DP-SGD DP-RMSProp DP2-RMSProp (ours) 0 10 20 30 40 50 # epochs Stack Overflow (" = 0.87, = 10 6) DP-SGD DP-RMSProp DP2-RMSProp (ours) Figure 4: Test performance of DP2 compared to DP-SGD and DP-RMSProp on IMDB (left), Stack Overflow (middle), and Movie Lens-100k (right) for a fixed privacy budget. For all datasets, we calculate the privacy loss (ε) under fixed δ s, noise multipliers {1.0, 1.0, 0.5}, and batch size 64. All runs are repeated over 5 random seeds. Dotted lines correspond to training metrics. privacy budget " DP-SGD DP-RMSProp DP2-RMSProp 2 4 6 8 10 12 14 privacy budget " Stack Overflow DP-SGD DP-RMSProp DP2-RMSProp 2 4 6 8 10 12 privacy budget " DP-SGD DP-RMSProp DP2-RMSProp 9 18 27 36 45 Figure 5: Privacy/utility trade-offs of DP2-RMSProp (Algorithm 1) compared with DP-SGD and DP-RMSProp for a range of privacy budgets. We see that DP2-RMSProp consistently achieves more favorable privacy/utility trade-offs than the baseline methods. across a range of privacy parameters, where ε ranges are consistent with prior works (e.g., Kairouz et al., 2021b). Results are shown in Figure 5. We observe that similar to the results in Figure 4, DP2 significantly outperforms DP-SGD and DP-RMSProp under each privacy budget. For reference, the non-private RMSProp method achieves 87% accuracy, 62% accuracy, and 0.88 mean square error (MSE) on IMDB, Stack Overflow, and Movie Lens, respectively. Indeed, with weaker privacy (larger ε), we expect smaller utility gaps between private and non-private optimization. In Appendix C.4, we additionally explore how increasing the computational budget may affect the privacy-utility trade-off. 0 20 40 # epochs Stack Overflow DP-SGD DP-RMSProp DP2 (s = 30) DP2 (s = 100) DP2 (s = 300) DP2 (s = 1000) DP2 (s = 3000) DP2 (s = 10000) 0 1 2 delay in terms of # epochs Stack Overflow DP-SGD DP-RMSProp DP2-RMSProp 0.0 2.5 5.0 7.5 delay in terms of #epochs DP-SGD DP-RMSProp DP2-RMSProp 0 20 40 delay in terms of #epochs DP-SGD DP-RMSProp DP2-RMSProp Figure 6: Effect of the delay parameter s. We show trade-offs between delay and noise in the first three subplots. The rightmost subfigure showcases convergence curves under different delays (s=10000 corresponds to delaying for 3 epochs) where DP2 achieves 4 convergence speedup than DP-SGD. Privacy settings follow those of Figure 4. Although a specific value of s achieves the greatest improvements, we observe that nearly all instantiations of DP2 improve upon the baselines. Effects of s. Finally, we empirically study the effect of the delay parameter s. Intuitively, there exists a trade-off between the amount of delay and the privacy noise in the preconditioner: averaging over more historical gradients (larger s) could yield less noisy preconditioners, while introducing more staleness. In Figure 6, we report test performance versus the delay s across all datasets on the first three subplots. In the last subplot, we additionally show the convergence behavior under different values of s. These results suggest that there is a sweet spot for s to yield good performance small delays are gradually improving over DP-RMSProp; moderate delays perform best in terms of convergence and absolute performance; and large delays may slow down convergence (although it is Published as a conference paper at ICLR 2023 possible to reach similar performance with sufficient training). These empirical results are consistent with the implications of our convergence analysis discussed in Section 4.1. 5.2 DP2 COMPARED WITH RECENT METHODS FOR PRIVATE OPTIMIZATION As discussed in Section 2, beyond DP-SGD and vanilla DP adaptive methods, another line of work uses auxiliary, public data to improve private (adaptive) optimization. While not directly comparable to DP2 since DP2 does not require any side/public information, we compare DP2 to two state-ofthe-art methods along this direction2: (1) Adad PS (Li et al., 2022) which uses public data or their statistics to estimate gradient geometry, and (2) PDA-DPMD (Amid et al., 2022), which uses the loss on public data as a mirror map to learn the underlying gradient geometry. Results are reported in Table 1, which show that DP2 has comparable performance to state-of-the-art baselines, but without the need to access auxiliary data. See Appendix C.6 for full details and convergence curves. Dataset DP-SGD DP-RMSProp PDA-DPMD Ada DPS DP2-RMSProp (w/ RMSProp) IMDB .687 .018 .713 .005 .703 .005 .826 .003 .815 .011 Stack Overflow .330 .002 .328 .002 .353 .001 .406 .027 .391 .001 Movie Lens 3.02 .068 2.96 .062 3.74 .053 2.86 .042 2.78 .054 Table 1: DP2 compared with other private (adaptive) methods that use public data (Amid et al., 2022; Li et al., 2022). Even though DP2 does not require auxiliary information, we find that it achieves comparable performance with these state-of-the-art approaches that require additional public data. Corresponding convergence plots are presented in Figure 11 in Appendix C.6. 5.3 ABLATION STUDIES 0 20 40 60 80 100 # epochs IMDB (ε = 3.04, δ = 10 5) DP-SGD DP-RMSProp Variant 1 (extra query) Variant 2 (noise first) DP2-RMSProp (ours) Figure 7: Different ablation variants of DP2 on IMDB. The dotted lines correspond to training accuracy. Finally, we also study the effectiveness of different components of DP2. Recall that in Algorithm 1, we use noisy gradients from DP-SGD iterations to update both the model parameters and the preconditioner such that the total privacy cost is identical to that of DP-SGD. The first variant considers accumulating DP-SGD gradients in the same way, but it runs private adaptive methods using delayed preconditioner in almost all iterations. This requires us to add independent noise twice at most iterations (when accumulating the preconditioner and when noising the preconditioned update), thus increasing the total privacy budget. The second variant is identical to DP2 except that it applies the delayed preconditioner after noising the clean gradient; this is to study the order of preconditioning as discussed in Section 3. As illustrated in Figure 7, both variants indeed significantly underperform our proposed method on the IMDB dataset, thus validating the design choices of DP2. We defer complete results to Figure 10 and Table 4 in Appendix C.5. See also Appendix D for the exact algorithms of both variants. 6 CONCLUSION AND FUTURE WORK In this work, we proposed DP2, a private adaptive optimization framework that uses historical gradients to construct delayed but less noisy preconditioners, yielding improved privacy/utility tradeoffs without the need to access auxiliary data. We demonstrated the effectiveness of DP2 both theoretically and empirically. In the future, it would be interesting to extend the techniques developed herein to other privacy-sensitive applications such as federated learning (Mc Mahan et al., 2017; Reddi et al., 2021). It is also worth exploring interplays between DP2 and private online optimization with tree aggregation, which similarly releases cumulative statistics with reduced noise (Chan et al., 2011). 2We do not directly compare with the prior work of Asi et al. (2021) as the code is not publicly available and implementation details are missing in the paper; however, the more recent PDA-DPMD work of Amid et al. (2022) we compare with suggests superior performance to Asi et al. (2021). We also implement the diagonal variant of the method proposed in the theoretically-focused work of Kairouz et al. (2021a), but observe that accuracy improves only marginally beyond random guessing (see Figure 12 in Appendix C.6). Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS The work of TL, ZL, and VS was supported in part by the National Science Foundation Grant IIS1838017, a Google Faculty Award, a Meta Faculty Award, the Private AI Collaborative Research Institute, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency. Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Conference on Computer and Communications Security, 2016. Ahmet Alacaoglu, Yura Malitsky, Panayotis Mertikopoulos, and Volkan Cevher. A new regret analysis for adam-type algorithms. In International Conference on Machine Learning, 2020. Ehsan Amid, Arun Ganesh, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song, Thomas Steinke, Vinith M Suriyakumar, Om Thakkar, and Abhradeep Thakurta. Public data-assisted mirror descent for private model training. In International Conference on Machine Learning, 2022. Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, and Kunal Talwar. Private adaptive gradient methods for convex optimization. In International Conference on Machine Learning, 2021. Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In IEEE Symposium on Foundations of Computer Science, 2014. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security, 2011. Soham De, Anirbit Mukherjee, and Enayat Ullah. Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration. ar Xiv preprint ar Xiv:1807.06766, 2018. Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlocking highaccuracy differentially private image classification through scale. ar Xiv preprint ar Xiv:2204.13650, 2022. Sergey Denisov, Brendan Mc Mahan, Keith Rush, Adam Smith, and Abhradeep Thakurta. Improved differential privacy for sgd via optimal private linear operators on adaptive streams. In Advances in Neural Information Processing Systems, 2022. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, 2006. Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2014. Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018. F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 2015. Published as a conference paper at ICLR 2023 Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX, 2020. URL http://github.com/deepmind/dm-haiku. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Rmsprop: Divide the gradient by a running average of its recent magnitude. Neural Networks for Machine Learning, Coursera Lecture 6e, 2012. Kaggle. Stack Overflow Data on Kaggle. https://www.kaggle.com/datasets/ stackoverflow/stackoverflow, 2022. Peter Kairouz, Monica Ribero Diaz, Keith Rush, and Abhradeep Thakurta. (nearly) dimension independent private erm with adagrad rates via publicly estimated subspaces. In Conference on Learning Theory, 2021a. Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, 2021b. Kfir Y Levy, Alp Yurtsever, and Volkan Cevher. Online adaptive methods, universality and acceleration. Advances in Neural Information Processing Systems, 2018. Tian Li, Manzil Zaheer, Sashank Reddi, and Virginia Smith. Private adaptive optimization with side information. In International Conference on Machine Learning, 2022. Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2017. Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018. H. Brendan Mc Mahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Conference on Learning Theory, 2010. Ilya Mironov, Kunal Talwar, and Li Zhang. R enyi differential privacy of the sampled gaussian mechanism. ar Xiv preprint ar Xiv:1908.10530, 2019. Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009. Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn y, Sanjiv Kumar, and H Brendan Mc Mahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing, 2013. Matthew Streeter and H Brendan Mc Mahan. Less regret via online conditioning. ar Xiv preprint ar Xiv:1002.4862, 2010. Pranav Subramani, Nicholas Vadivelu, and Gautam Kamath. Enabling fast differentially private sgd via just-in-time compilation and vectorization. In Advances in Neural Information Processing Systems, 2021. Tensor Flow Federated. Tensor Flow Federated Stack Overflow Dataset. https: //www.tensorflow.org/federated/api_docs/python/tff/simulation/ datasets/stackoverflow/load_data, 2022. Published as a conference paper at ICLR 2023 Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 2020. Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems, 2018. Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems, 2020. Yingxue Zhou, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Arindam Banerjee. Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds. ar Xiv preprint ar Xiv:2006.13501, 2020. Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identification. In International Conference on Learning Representations, 2021. Published as a conference paper at ICLR 2023 Lemma 1. Under Assumption 2, let s1 = s2 = s in Algorithm 1, we have for any j [d], E[vj] C2 + σ2C2 Proof. Recall that C is the gradient norm bound (Assumption 2). Let the clipping threshold be C as well. We have for j [d], s gi1 j + + gis j + 1 s N i1 j + + N is j 2# s2 gi1 j + + gis j 2 + E 1 s2 N i1 j + + N is j 2 (2) where {i1, . . . , is} denotes the indices of s noisy gradients used to obtain Gj, and {N i1 j , . . . , N is j } are random zero-mean Gaussian variables with variance σ2C2 b2 under noise multiplier σ, clipping threshold C, and mini-batch size b. Hence for any j [d] and t [T], sb2 := M, (4) E[vj] M, (5) E Dt j max n M + ϵ, 1 o . (7) A.1 PROOF OF THEOREM 2 Based on the updating rule, we have wt+1 w 2 Dt (8) Dt αt N t w = wt w 2 Dt + αt gt Dt + αt N t Dt 2 wt w , αtgt + αt Dt N t (10) = wt w 2 Dt 2αt gt, wt w + (αt)2 gt, gt 2αt wt w , Dt N t + (αt)2 N t 2 Dt + 2(αt)2 gt, N t . (11) Rearranging terms gives gt, wt w = wt w 2 Dt wt+1 w 2 Dt 2αt + αt wt w , Dt N t + αt 2 N t 2 Dt + αt gt, N t . (12) Taking the expectation on both sides conditioned on wt, F(wt), wt w = Et wt w 2 Dt Et[ wt+1 w 2 Dt] 2αt 2 Et N t 2 Dt , (13) Published as a conference paper at ICLR 2023 where we have used the fact that N is a zero-mean Gaussian variable independent of gt, wt. Taking the expectation on both sides and using the convexity of F( ): E[F(wt)] F(w ) E[ wt w 2 Dt] E[ wt+1 w 2 Dt] 2αt + αt 2 E N t 2 Dt . (14) Applying telescope sum, we have E[F(wt)] F(w ) w1 w 2 A1 2α1 + E wt w 2 Dt 2αt E wt w 2 Dt 1 2 E N t 2 Dt . (15) Hence, we need to bound the RHS: w1 w 2 D1 2α2 + E wt w 2 Dt 2αt E wt w 2 Dt 1 2 E N t 2 Dt , (16) where the vector Dt Rd satisfies that Dt = 1 when running private SGD steps, and Dt = v + ϵ when running private RMSProp steps. Let the delay parameter to be scheduled as s = υT (0 < υ < 1) (17) and the learning rate αt be where α = min n ϵ, 1 M+ϵ, 1 o , and M is the upper bound of E [vj] for j [d], as defined and proved in Lemma 1. We next consider the T1 term. There are four cases. 1. DP-SGD at the t 1-th iteration, and DP-SGD at the t-th iteration: As Dt = Dt 1 there is not much requirement other that the learning rates need to satisfy αt αt 1, which holds for our choice. 2. Private RMSProp at the t 1-th iteration, and private RMSProp at the t-th iteration: Similar to previous case, the learning rates need to satisfy αt αt 1, which holds for our choice. 3. DP-SGD at the t 1-th iteration, and private RMSProp at the t-th iteration: We require αt 1 αt 1 (19) But in this case we must have t % s = 0. So this is satisfied by our choice as long as α ϵ. 4. Private RMSProp at the t 1-th iteration, and DP-SGD at the t-th iteration Published as a conference paper at ICLR 2023 The first three cases form an updating pattern of DP-SGD DP-SGD DP-RMSProp DP-RMSProp, where every pattern takes 2s iterations, except for the first pattern, because the telescope sum starts from t = 2. For the first pattern, we have w1 w 2 D1 2α2 + E wt w 2 Dt 2αt E wt w 2 Dt 1 w1 w 2 D1 2α2 + E wt w 2 Dt w1 w 2 D1 2α2 + R2 2s X 2αt E Dt 1 1 2α2s E D2s 1 , (22) where D2s = v + ϵ. For k 1, we have E wt w 2 Dt 2αt E wt w 2 Dt 1 = E w2sk+1 w 2 D2sk+1 2α2sk+1 E w2sk+1 w 2 D2sk E wt w 2 Dt 2αt Dt 1 E w2sk+1 w 2 D2sk+1 2α2sk+1 E w2sk+1 w 2 D2sk 2α2sk + R2 E[ D2sk+2s 1] 2α2sk+2s E[ D2sk+1 1] E w2sk+1 w 2 D2sk+1 2α2sk+1 + R2 E[ D2sk+2s 1] 2α2sk+2s E[ D2sk+1 1] 2α2sk+2s E D2sk+2s 1 , (23) where D2sk+2s = v + ϵ belong to DP-RMSProp updates. We look at the second T2 term, and prove by induction that there exists a constant κ such that αT E DT 1 . (24) When T = 1 (α1 = α and D1 = 1), α 2 E[ g1 2] κd α holds if κ α2C2. At each step t, the goal is to get κ αt 1 E Dt 1 1 + αt αt E Dt 1 (25) 1. DP-SGD at the t 1-th iteration, and DP-SGD at the t-th iteration: We require κd αt 1 + αt 2 E h gt 2i κd which would hold for choice of αt as gradients are bounded and κ α2C2. 2. Private RMSProp at the t 1-th iteration, and private RMSProp at the t-th iteration: We need vt 1 + ϵ 1 i vt + ϵ 1 i , (27) Published as a conference paper at ICLR 2023 h(s) max t [T ] Based on our updating rule, 2ϵ E[ gt ] αt C 2ϵ E[ gt 1], (31) where we have used the assumption that gt C. Combining the above two, 2ϵ E[ gt ] αt C 2ϵ h(s)E 1 s 2ϵ h(s) 1 β E h This implies the condition holds as long as κ satisfies κ Ch(s) ϵ 1 β . (35) 3. DP-SGD at the t 1-th iteration, and private RMSProp at the t-th iteration. We want to prove κd αt 1 + αt αt E Dt 1 . (36) As gt C, it holds that 2ϵ E[ gt 2] αt C 2ϵ E[ gt ] αt C 2ϵ E[ gt 1]. (37) Ch(s) 2ϵ 1 β αt E h Based on our learning rate set in Eq. (18), t 1αt 1ϵ (39) αt 1 αt 1ϵ 1 αt d αt 1E [ Dt 1]. (40) Ch(s) 2ϵ 1 β αt E h i Ch(s) ϵ 1 β E Dt 1 1 αt d αt 1E [ Dt 1] where we require κ Ch(s) ϵ 1 β . (43) Published as a conference paper at ICLR 2023 4. Private RMSProp at the t 1-th iteration, and DP-SGD at the t-th iteration. We need vt 1 + ϵt 1 1 2 E gt 2 κd Plug in E h M (Lemma 1) and gt 2 C2, we have vt 1 + ϵ 1 i + αt 2 E gt 2 κ αt 1 Based on our learning rate set in Eq. (18), for some constant γ, αt 1 = γ t 1, αt γ M + 1) (46) M + 1 αt 1 d M + d αt 1 . (47) holds as long as κ α2C2. To sum up, the requirement on κ is κ max α2C2, Ch(s) Final convergence results: min t [T ] E F(wt) F(w ) (50) t Tυ E Dt 1 + 1 α t 2υT + t+υT t E[ N t 2 Dt], (51) where Tv denotes the iteration indices where we switch from private RMSProp steps to private SGD steps plus the last iteration, and its cardinality is |Tυ| = 1 2υ , and κ max n α2C2, Ch(s) ϵ 1 β , o , α = min n ϵ, 1 A.2 A CLOSER LOOK AT h(s) We closely examine h(s), defined as h(s) max t [T ] Let us assume mini-batch gradients on consecutive time steps are not very different, i.e. gt gt 1 1 M. This means each gradient norm cannot be too far away from each other, which can be used to show the dependence of h(s) on the delay parameter s. Denote the gap between the current iteration t and the iteration where v gets updated as k, i.e., k := t t s s. Hence, s (gt k 1 + + gt k s) + 1 s (N t k 1 + + N t k s) 1 + dϵ (53) s gt k 1 + + gt k s j + 1 s gt k 1 + + gt k s j 1 1 s gt k 1 + + gt k s j + 1 s (N t k 1 + + N t k s) 1 + dϵ (54) s (gt gt k 1) + + (gt gt k s) + 1 s gt k 1 + + gt k s 1 1 s (gt k 1 + + gt k s) + 1 s (N t k 1 + + N t k s) 1 + dϵ (55) s gt k 1 + + gt k s 1 1 s (gt k 1 + + gt k s) + 1 s (N t k 1 + + N t k s) 1 + dϵ + 1 s(s M + + (2s)M) Published as a conference paper at ICLR 2023 Denote a := 1 s N t k 1 + + N t k s , and b := 1 s gt k 1 + + gt k s . Then h(s) E[ b 1] E[ a + b 1] + dϵ + s M E[ b 1] 1 + dϵ E[ b 1] + s M In the special case where gradients are sparse, i.e., E[ b 1] < E[ a 1], we have E[ a 1] E[ b 1] + dϵ E[ b 1] 1 + s M It is easy to see that the RHS is O (s), and it increases as s. We can informally express it as c1s + c2, where c1 and c2 are two constants. B PROOF OF THEOREM 3 First we introduce a result that will be used in this section. Under the bounded stochastic gradient variance assumption (Assumption 4), we have that conditioned on wt, Et gt 2 τ 2 b + F(wt) 2, (60) where b refers to the mini-batch size to obtain gradient gt, i.e., gt 1 i B gi,t. This lemma is proved in Zaheer et al. (2018). The per-coordinate version of this result is that for j [d], Et (gt j)2 τ 2 j b + j F(wt) 2 , (61) j [d] τ 2 j = τ 2. As we assume F(w) is L-smooth, at each iteration t, F(wt+1) F(wt) + F(wt), wt+1 wt + L wt+1 wt 2 . (62) Based on the updating rule of Algorithm 1, we have F(wt+1) F(wt) + F(wt), wt+1 wt + L wt+1 wt 2 (63) = F(wt) αt F(wt), gt Dt + N t + (αt)2L where N Rd and Nj N 0, σ2C2 b2 with noise multiplier σ and clipping threshold C, and Dt satisfies that Dt 1 if t mod 2s s, v + ϵ otherwise. (65) Take expectation with respect to samples at the t-th iteration and N t, Et[F(wt+1)] F(wt) αt F(wt), F(wt) = F(wt) αt X ( j F(wt))2 Dt j + (αt)2L (Dt j)2 + d(αt)2L where we have used the fact that N t is a zero-mean random variable independent of wt, and Dt is independent of samples at time t. We need to consider two cases. Published as a conference paper at ICLR 2023 1. DP-SGD at the t-th iteration In this case, Dt = 1. Hence plugging in Et (gt j)2 τ 2 j b + j F(wt) 2 , (67) Et F(wt+1) F(wt) αt (αt)2L F(wt) 2 + (αt)2L τ 2 Under constant learning rate, let αt = α 1 Et F(wt+1) F(wt) α 2 F(wt) 2 + (αt)2L τ 2 Taking expectation on both sides gives 2 E F(wt) 2 2 E[F(wt)] E[F(wt+1)] + (αt)2L τ 2 2. Private RMSProp at the t-th iteration We have Et[F(wt+1)] F(wt) αt X [ F(wt)]2 j q vt j + ϵ + (αt)2L Et[(gt j)2] q vt j + ϵt + d(αt)2Lσ2C2 Plugging in Et (gt j)2 τ 2 j b + ( j F(wt))2 results in Et[F(wt+1)] (72) [ F(wt)]2 j q vt j + ϵ + (αt)2L [ F(wt)]2 j q vt j + ϵ + d(αt)2Lσ2C2 = F(wt) αt (αt)2L [ F(wt)]2 j q vt j + ϵ + (αt)2L vt j + ϵ b + d(αt)2Lσ2C2 F(wt) αt (αt)2L [ F(wt)]2 j q vt j + ϵ + (αt)2L τ 2 2ϵ2b + dσ2C2 Taking expectation on both sides yields E[F(wt+1)] E[F(wt)] αt (αt)2L [ F(wt)]2 j q + (αt)2L τ 2 2ϵ2b + dσ2C2 We need to lower bound P j [d] E [ F (wt)]2 j . We know from Holder s inequality that E[ u, v ] E[ u 1]E[ v ]. Now note that E F(wt) 2 = E | F(wt)|2 Dt , Dt E ( F(wt))2 E ( F(wt))2 M + ϵ). (78) Published as a conference paper at ICLR 2023 " ( j F(wt))2 E[ F(wt) 2] E[F(wt+1)] E[F(wt)] αt (αt)2L M + ϵ + (αt)2L τ 2 2ϵ2b + dσ2C2 Let αt = α ϵ L, we obtain E[F(wt+1)] E[F(wt)] α 2( M + ϵ) E F(wt) 2 + (αt)2L τ 2 2ϵ2b + dσ2C2 Combining the two cases, for any t, we have E[ F(wt) 2] (82) α E[F(wt)] E[F(wt+1)] + 2αL( 2ϵ2b + dσ2C2 Taking a telescope sum results in t=1 E[ F(wt) 2] 2( M + 1)F(w1) 2ϵ2b + dσ2C2 where M := C2 + σ2C2 Published as a conference paper at ICLR 2023 C EXPERIMENTAL DETAILS AND ADDITIONAL RESULTS C.1 DATASETS IMDB (Maas et al., 2011) is a binary classification dataset on sentiment analysis for movie reviews that includes 25,000/25,000 training/test samples. Each sample is a review under a vocabulary size of 10,000. We train a logistic regression model with 10,001 parameters. Stack Overflow (Kaggle, 2022; Tensor Flow Federated, 2022) is a large-scale text dataset containing questions and answers from Stack Overflow. We focus on the task of classifying the tag(s) of a given sentence described in Tensor Flow Federated (2022), though we focus on the usual centralized training setting instead of a federated setting. We randomly sample 246,092 sentences for training and 61,719 for testing, where each sentence is described by 10,000 features. We format the task as a 500-class classification problem, and the resulting model has roughly 5 million parameters. Movie Lens-100k (Harper & Konstan, 2015) is a movie review dataset commonly used for recommendation systems. It contains 100,000 movie ratings from 943 users on 1,682 items ( 6% non-zero entries). We study a (non-convex) matrix factorization task with embedding size 100, thus totaling 262,500 parameters. We treat each non-zero entry as a record for differential privacy, and randomly partition them for training and evaluation. C.2 HYPERPARAMETERS Unless otherwise stated, we fix the following hyperparameters in our experiments: for IMDB, Stack Overflow, and Movie Lens respectively, we train for 100/50/50 epochs with batch size 64 and privacy δ = 10 5/10 6/10 6. We then perform a grid search on other hyperparameters: Learning rates: We grid search over {0.03, 0.1, 0.3, 1, 3, 5} for SGD / Ada Grad update rules and from {0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3} for the RMSProp update rule. Per-example clipping thresholds: We grid search over {0.1, 0.25, 0.5, 1} when performing perexample clipping on clean gradients without preconditioning (e.g. for DP-SGD updates), and over {0.1, 0.25, 0.5, 1, 2, 3, 5} when clipping preconditioned clean gradients (e.g. for DP2 updates in adaptive iterations). The rationale is that, in general, the preconditioned gradient norms are usually larger than those without preconditioning (recall from Section 3.2 that we apply preconditioning before privatization in DP2). For Ada DPS and DP2-RMSProp, we also tried a few values of even larger clip thresholds ( 10) though we did not perform a full sweep for other hyperparameters at those values due to computational constraints. Delay parameter s: For all datasets, s (i.e., the number of optimization steps) is chosen heuristically as a function of the number of steps in an epoch. When reporting the best results (e.g. Figure 4, Figure 5), we search over s {195, 390, 780} (roughly 0.5, 1, 2 epochs respectively) for IMDB (390 steps/epoch); s {100, 300, 1000, 3000} for Stack Overflow (3845 steps/epoch); and s {1250, 15625, 31250, 50000} for Movie Lens (1250 steps/epoch). Adaptivity ϵ: In our settings, the adaptivity parameter ϵ for RMSProp/Ada Grad (in the denominator Dt = v + ϵ) would affect the amount of adaptivity as well as the norms of preconditioned gradients, which may in turn influence the privacy-utility trade-off under per-example clipping. We tune ϵ over a small grid of {10 2, 10 3, 10 5, 10 7}. All reported results use the best hyperparameter configurations, which are selected using training set metrics (as overfitting generally does not occur under DP noise). To facilitate reproducibility, we summarize the tuned hyperparameters for the main experiments and the ablation studies in Table 2 and Table 3 below respectively. Published as a conference paper at ICLR 2023 Dataset DP-SGD DP-RMSProp PDA-DPMD Ada DPS DP2-RMSProp (w/ RMSProp) IMDB (5, 0.5) (0.3, 0.1, 10-3) (5, 0.5) (1, 5, 10-3) (0.1, 3, 0.5, 5, 10-7, 195) Stack Overflow (3, 0.25) (0.03, 0.1, 10-3) (3, 0.25) (0.4, 5, 10-3) (0.3, 0.3, 0.25, 5, 10-5, 1000) Movie Lens (0.1, 1) (0.001, 0.5, 10-3) (0.1, 1) (0.01, 10, 10-2) (0.1, 0.03, 1, 5, 10-3, 31250) Table 2: Tuned hyperparameters for different methods across three datasets. For DP-SGD and PDA-DPMD, the values refer to (LR, clip); for DP-RMSProp and Ada DPS, the values refer to (LR, clip, adaptivity ϵ); and for DP2, the values refer to (LR for SGD iters, LR for RMSProp iters, clip for SGD iters, clip for RMSProp iters, adaptivity ϵ, delay s). Bold values were experimented on the edges of the hyperparameter grids. Dataset Ablation Variant1 Ablation Variant 2 IMDB (3.0, 0.1, 0.5, 2.0, 10-7, 780) (0.3, 0.3, 0.25, 10-3, 780) Stack Overflow (1.0, 1.0, 1.0, 1.0, 10-5, 1000) (0.3, 0.001, 0.25, 10-5, 1000) Table 3: Tuned hyperparameters for ablation studies (Section 5.3) on IMDB and Stack Overflow. Both variants use the RMSProp update rule for the adaptive steps. Bold values were experimented on the edges of the hyperparameter grids. For Variant 1 and 2 respectively, the values refer to (LR for SGD iters, LR for RMSProp iters, clip for SGD iters, clip for RMSProp iters, adaptivity ϵ, delay s) and (LR for SGD iters, LR for RMSProp iters, clip for both SGD/RMSProp iters, adaptivity ϵ, delay s). Note that for Variant 2 the clipping threshold do not need to be tuned separately for SGD/RMSProp iters as it applies to preconditioned gradients in both cases. C.3 RESULTS FOR DP2-ADAGRAD The DP2 framework can be applied to a range of adaptive methods beyond RMSProp mostly discussed in the main text. We extend DP2 to the Ada Grad update rule (with only one line of code change, see Section D), and benchmark its convergence and privacy-utility trade-offs. In Figure 8 and Figure 9, the results indicate that DP2-Ada Grad, like DP2-RMSProp, can consistently and substantially improve over the baselines in terms of both convergence and absolution performance, demonstrating the generality of DP2 to other adaptive optimizers. 0 20 40 60 80 100 # epochs IMDB (" = 3.04, = 10 5) DP-SGD DP-RMSProp DP-Ada Grad DP2-RMSProp (ours) DP2-Ada Grad (ours) 0 10 20 30 40 50 # epochs Stack Overflow (" = 0.87, = 10 6) DP-SGD DP-RMSProp DP-Ada Grad DP2-RMSProp (ours) DP2-Ada Grad (ours) Figure 8: (Extension of Figure 4 to the Ada Grad update rule) Test accuracy of DP2 compared to DP-SGD, DP-RMSProp, and DP-Ada Grad on IMDB and Stack Overflow. Dotted lines denote training performance. C.4 EFFECTS OF INCREASING COMPUTATIONAL BUDGETS When differential privacy introduces a large utility gap between private and non-private training, one approach to improving the privacy-utility trade-off is to increase computational costs by using larger batch sizes under fixed numbers of steps. The noise multiplier needs to increase to achieve the same privacy target, while the overall privacy noise may still be reduced due to the larger batch size. This technique may be adopted in practice when we want to prioritize the utility of private optimization Published as a conference paper at ICLR 2023 under fixed privacy budgets. In Figure 9 (right), we explore the effect of such increased computation on Stack Overflow. With a 4 factor increase in computational cost (4 larger batch sizes with the same number of training iterations), we observe that the privacy/utility trade-off of all methods can be substantially improved, narrowing the utility gap to non-private training. In particular, observe that the absolute performance improvement of DP2 over the vanilla DP baselines remains similar. 2 4 6 8 10 12 14 16 privacy budget " Stack Overflow SGD DP-SGD DP-SGD (4x) RMSProp DP-RMSProp DP-RMSProp (4x) Ada Grad DP-Ada Grad DP-Ada Grad (4x) Delayed RMSProp DP2-RMSProp DP2-RMSProp (4x) Delayed Ada Grad DP2-Ada Grad DP2-Ada Grad (4x) 2 4 6 8 10 12 14 privacy budget " Stack Overflow 2 4 6 8 10 12 privacy budget " Figure 9: (Extension of Figure 5 to the Ada Grad update rule and increased computational cost) Privacy/utility trade-offs of DP2 compared to DP-SGD, DP-RMSProp, and DP-Ada Grad on IMDB and Stack Overflow. (4 ) denotes increasing the batch size and the number of epochs simultaneously by a factor of 4 and picking the appropriate noise multiplier to arrive at similar privacy costs (ε). C.5 ADDITIONAL RESULTS FOR ABLATION STUDIES Table 4 summarizes the results for ablation studies on IMDB, Stack Overflow, and Movie Lens, and Figure 10 reports test accuracies on IMDB and Stack Overflow during optimization. The variants are discussed in Section 5.3 and complete algorithms are presented in Appendix D. We observe that DP2 indeed consistently outperforms the two (weaker) variants on all datasets, thus verifying our design choices for DP2. In particular, note that the utility drop of variant 2 (adding noise before preconditioning) on Stack Overflow is more significant compared to that on IMDB; we argue that this is due to Stack Overflow being a high-dimensional learning task (roughly 5 million model parameters) and thus the detrimental effect of preconditioning per-coordinate noise is larger. Dataset Variant1 Variant 2 DP2-RMSProp IMDB .799 .006 .643 .007 .815 .011 Stack Overflow .382 .002 .265 .004 .391 .001 Movie Lens 3.32 .088 3.18 .066 2.78 .054 Table 4: Summary of ablation studies on all three datasets. 0 20 40 60 80 100 # epochs IMDB (ε = 3.04, δ = 10 5) DP-SGD DP-RMSProp Variant 1 (extra query) Variant 2 (noise first) DP2-RMSProp (ours) 0 10 20 30 40 50 # epochs Stack Overflow (ε = 0.87, δ = 10 6) DP-SGD DP-RMSProp Variant 1 (extra query) Variant 2 (noise first) DP2-RMSProp (ours) Figure 10: Test accuracies for ablation studies on DP2. Dotted lines correspond to training metrics. Published as a conference paper at ICLR 2023 C.6 ADDITIONAL RESULTS FOR COMPARISON WITH PUBLIC DATA-ASSISTED METHODS Figure 11 extends the results in Section 5.2 with convergence plots on IMDB and Stack Overflow. On IMDB, we observe that despite not using any auxiliary information, the convergence of DP2RMSProp is comparable with that of Ada DPS-RMSProp (Li et al., 2022) which uses 1% of training data as the public data (250 examples) to approximate the preconditioner. On Stack Overflow where the same public split of 1% corresponds to 2460 examples, we observe that Ada DPS-RMSProp can outperform DP2. On the other hand, the extra public data do not help PDA-DPMD outperform DP2. 0 20 40 60 80 100 # epochs IMDB (ε = 3.04, δ = 10 5) DP-SGD DP-RMSProp Ada DPS-RMSProp PDA-DPMD DP2-RMSProp (ours) 0 10 20 30 40 50 # epochs Stack Overflow (ε = 0.87, δ = 10 6) DP-SGD DP-RMSProp Ada DPS-RMSProp PDA-DPMD DP2-RMSProp (ours) Figure 11: Test accuracies of DP2 compared against recent private (adaptive) methods that leverage public data (Amid et al., 2022; Li et al., 2022). Dotted lines correspond to training metrics. 0 20 40 60 80 100 # epochs IMDB (ε = 3.04, δ = 10 5) DP-SGD DP-RMSProp Noisy Ada Grad (based on [KRRT21]) DP2-RMSProp (ours) Figure 12: Comparing DP2 against a noisy Ada Grad variant based on Kairouz et al. (2021a) where the gradients and the preconditioner are privatized separately. In Figure 12, we additionally implement a private Ada Grad method proposed in Kairouz et al. (2021a) that also leverages public data. Specifically, in each iteration, the algorithm clips and adds independent noise to both the clean gradients and the preconditioner estimated using clean gradients; it then uses public data to estimate a gradient subspace onto which to project the clipped/noised preconditioner in order to reduce the effect of noise; finally, it preconditions the noisy gradient with the noisy preconditioner and takes an update step. Our implementation differs from Kairouz et al. (2021a) in that we use the diagonal form of the preconditioner instead of the full matrix form. To estimate the gradient subspace, we follow the approach described in Zhou et al. (2021) where the projection matrix V Rd k where d is the number of parameters and k is the dimension of the subspace is obtained by taking the top-k eigenspace of M t with M t = 1 |Xpub| xi Xpub wtf xi; wt wtf xi; wt where Xpub is the set of public examples. Unfortunately, we have not obtained a satisfactory result for this noisy Ada Grad algorithm. We remark that since the method is extremely computationally Published as a conference paper at ICLR 2023 expensive (involves computing the eigendecomposition of a d d matrix with d = 10001 at every iteration), further hyperparameter tuning may help improve the performance. However, our ablation studies (Section 5.3 and Appendix C.5) may shed light on the current observations since this method privatizes gradients before preconditioning. D ALGORITHMS For completeness, we present all algorithms mentioned in the main text in detail. Non-private version of DP2: only changing Line 9 in Algorithm 1 to DP2 with the Ada Grad update rule (DP2-Ada Grad): only changing Line 5 in Algorithm 1 to v v + Gt/s1 2 DP2 with Yogi s additive update rule (DP2-Yogi): only changing Line 5 in Algorithm 1 to v v + (1 β)sign(Gt/s1 v2) Gt/s1 2 Ablation variant 1 (extra query) with delayed preconditioners: see Algorithm 2. Observe that the clean batch gradients {gi,t}i B get privatized twice in most iterations (when (t 1) mod s = 0), increasing the total privacy cost. Ablation variant 2 (noise before preconditioning) with delayed preconditioners: in Line 9 of Figure 1, privatize the batch gradients with the following replacement: i B clip gi,t, C + N 0, σ2C2 ! Published as a conference paper at ICLR 2023 Algorithm 2: Ablation variant 1 (extra query) using delayed preconditioners Input: T, batch size b, noise multiplier σ, clipping thresholds C1, C2, initial model w0 Rd, v = 0, constant ϵ R+, learning rate schedule αt, moving average parameters β, delay steps s 1 Set accumulator G0 0 2 for t = 1, , T do 3 Uniformly randomly sample a mini-batch B with size b from private training data 4 Get individual gradients for sample i B: gi,t f(xi; wt 1) 5 Privatize the gradients using the Gaussian mechanism: i B clip gi,t, C1 + N 0, σ2C2 1 ! Accumulate the private gradients gt : Gt Gt 1 + gt 6 if (t 1) mod s = 0 then 7 Update moment estimates: v βv + (1 β) (Gt/s)2 8 Reset accumulator: Gt 0 9 Set final gradient: gt gt 11 Privatize the clean, preconditioned gradients using the Gaussian mechanism: i B clip gi,t v + ϵ, C2 + N 0, σ2C2 2 ! Set final gradient: gt ˆgt 12 Update model parameters w: wt wt 1 αt gt 13 return w T