# private_adaptive_optimization_with_side_information__cf8e1428.pdf Private Adaptive Optimization with Side information Tian Li 1 Manzil Zaheer 2 Sashank J. Reddi 3 Virginia Smith 1 Adaptive optimization methods have become the default solvers for many machine learning tasks. Unfortunately, the benefits of adaptivity may degrade when training with differential privacy, as the noise added to ensure privacy reduces the effectiveness of the adaptive preconditioner. To this end, we propose Ada DPS, a general framework that uses non-sensitive side information to precondition the gradients, allowing the effective use of adaptive methods in private settings. We formally show Ada DPS reduces the amount of noise needed to achieve similar privacy guarantees, thereby improving optimization performance. Empirically, we leverage simple and readily available side information to explore the performance of Ada DPS in practice, comparing to strong baselines in both centralized and federated settings. Our results show that Ada DPS improves accuracy by 7.7% (absolute) on average yielding state-of-the-art privacy-utility trade-offs on large-scale text and image benchmarks. 1. Introduction Privacy-sensitive applications in areas such as healthcare and cross-device federated learning have fueled a demand for optimization methods that ensure differential privacy (DP) (Dwork et al., 2006; Chaudhuri et al., 2011; Abadi et al., 2016; Mc Mahan et al., 2018). These methods typically perturb gradients with random noise at each iteration in order to mask the influence of individual examples on the trained model. As the amount of privacy is directly related to the number of training iterations, private applications stand to benefit from optimizers that improve convergence speed. To capitalize on this, a number of recent works have naturally tried to combine DP with adaptive optimizers such as Adagrad, RMSProp, and Adam, which have proven to be 1Carnegie Mellon University 2Google Deep Mind 3Google Research. Correspondence to: Tian Li . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 0 10 20 30 40 50 60 # epochs test accuracy Non-Private SGD Adam Ada S (ours) 0 10 20 30 40 50 60 # epochs test accuracy DP-SGD DP-Adam Ada DPS (ours) Figure 1. Test performance on IMDB with logistic regression. Ada S refers to using preconditioning in Ada DPS for non-private training. Adaptive methods (Adam) become less effective when trained with differential privacy (DP-Adam), while Ada DPS retains the benefits of adaptivity. effective for non-private machine learning tasks, especially those involving sparse gradients or non-uniform stochastic noise (Duchi et al., 2011; Hinton et al., 2012; Kingma & Ba, 2015; Reddi et al., 2018a;b; Zhang et al., 2020). Unfortunately, tasks where adaptive optimizers work particularly well (e.g., sparse, high-dimensional problems), are exactly the tasks where DP is known to degrade performance (Bassily et al., 2014). Indeed, as we show in Figure 1, this can result in existing private adaptive optimization methods performing only marginally better than simple baselines such as differentially private stochastic gradient descent (DP-SGD), even under generous privacy budgets. In this work, we aim to close the gap between adaptive optimization in non-private and private settings. We propose Ada DPS, a simple yet powerful framework leveraging nonsensitive side information to effectively adapt to the gradient geometry. Our framework makes two key changes over prior art in private adaptive optimization: (1) Rather than privatizing the gradients first and then applying preconditioners, we show that transforming the gradients prior to privatization can reduce detrimental impacts of noise; (2) To perform gradient transformations, we explore using simple, easily obtainable side information in the data. We discuss two practical scenarios for obtaining such information below. With Public Data. A natural choice for side information is to use a small amount of public data generated from a similar distribution as the private data, a common assumption in private optimization (Amid et al., 2021; Asi et al., Private Adaptive Optimization with Side information 2021; Kairouz et al., 2021a; Zhou et al., 2021; 2020). In practice, public data could be obtained through proxy data or from opt-out users who are willing to share their information (Kairouz et al., 2021a; Aldaghri et al., 2021). Indeed, the notion of heterogeneous DP where subsets of samples require zero or weak privacy has been extensively studied in prior works (e.g., Alaggan et al., 2015; Jorgensen et al., 2015). Another line of works is to assume that the gradients are low rank, and then use public data to estimate this gradient subspace thereby mitigating some of the earlier discussed poor performance of DP in high-dimensional regimes. We do not consider using public data for this purpose in this work, as such a low-rank assumption might not hold in practice, particularly for the problems settings where adaptive optimizers are known to excel (Asi et al., 2021). Instead, we propose to use the public data more directly: we estimate gradient statistics on public data at each iteration, and then apply these statistics as a preconditioner before privatizing the gradients. Despite the simplicity of this procedure, we unaware of any work that has explored it previously. Without Public Data. Of course, there may also be applications where it is difficult to obtain public data, particularly data that follows the same distribution as the private data. In such scenarios, our insight is that for many applications, in lieu of public data we may have access to some common knowledge about the training data that can be (i) computed before training, and (ii) used to improve optimization performance in both private and non-private settings. For instance, in many language tasks, certain aggregate statistics (e.g., frequency) of different words/tokens are common knowledge or may be easily computed prior to training, and serve as reasonably good estimates of the predictiveness of each feature. Ada DPS considers leveraging such simple heuristics to precondition the gradients. Perhaps surprisingly, in our experiments (Section 6), we demonstrate that the performance of Ada DPS when scaling gradients via these simple statistics (such as feature frequency) can even match the performance of Ada DPS with public data. We summarize our main contributions below. We propose a simple yet effective framework, Ada DPS, to precondition the gradients with non-sensitive side information before privatizing them to realize the full benefits of adaptive methods in private training. Depending on the application at hand, we show that such side information can be estimated from either public data or some common knowledge about the data (e.g. feature frequencies or TF-IDF scores in NLP applications). We analyze Ada DPS and provide convergence guarantees for both convex and non-convex objectives. In convex cases, we analyze a specific form of Ada DPS using RM- SProp updates to provably demonstrate the benefits of our approach relative to differentially private SGD when the gradients are sparse. Empirically, we evaluate our method on a set of realworld datasets. Ada DPS improves the absolute accuracy by 7.7% on average compared with strong baselines under the same privacy budget, and can even achieve similar accuracy as adaptive methods in non-private settings. We additionally demonstrate how to apply Ada DPS to the application of federated learning, where it outperforms existing baselines by a large margin. 2. Related Work There are many DP algorithms for machine learning, including object perturbation, gradient perturbation, and model perturbation. In this work, we focus on the popular gradient perturbation method with Gaussian mechanisms (Dwork et al., 2014). Without additional assumptions on the problem structure, DP algorithms can suffer from O( d nε ) excess empirical risk where d is the dimension of the model parameters and n is the number of training samples (Bassily et al., 2014). While it is possible to mitigate such a dependence in the unconstrained setting (Kairouz et al., 2021a; Song et al., 2021) or assuming oracle access to a constant-rank gradient subspace (Zhou et al., 2021; Kairouz et al., 2021a), we do not focus on such a setting in this work, as it does not align with the settings in which adaptive methods have been designed (Duchi et al., 2011; Asi et al., 2021). Recent work (Amid et al., 2021) proposes to use public data differently (evaluating public loss as the mirror map in a mirror descent algorithm) to obtain dimension-independent bounds. However, this method do not account for gradient preconditioning, as their approximation is a linear combination of private and public gradients. We empirically verify Ada DPS s superior performance in Section 6 (Table 2). In the context of adaptive differentially private optimization, we note that adaptivity may have various meanings. For example, Andrew et al. (2021) adaptively set the clipping threshold based on the private estimation of gradient norms, which is out of the scope of this work. Instead, our work is related to a line of work that aims to develop and analyze differentially private variants of common adaptive optimizers (e.g., private Ada Grad, private Adam), which mostly focus on estimating gradient statistics from noisy gradients (Zhou et al., 2020; Asi et al., 2021; Pichapati et al., 2019). They differ from Ada DPS in the order of preconditioning and privatization, the techniques used to approximate the gradient geometry, and the convergence analysis (see Section 5 for details). In empirical evaluation (Section 6), we compare Ada DPS with these works on diverse tasks and show that even Ada DPS without public data can outperform them. Private Adaptive Optimization with Side information 3. Preliminaries and Setup In terms of privacy formulations, we consider classic samplelevel DP in centralized settings (Section 6.1), and a variant of it user-level DP in distributed/federated environments (Section 6.2). We define both more formally below. Definition 1 (Differential privacy (Dwork et al., 2006)). A randomized algorithm M is (ε, δ)-differentially private if for all neighbouring datasets D, D differing by one element, and every possible subset of outputs O, Pr(M(D) O) eεPr(M(D ) O)+δ. Within DP, neighbouring datasets can be defined in different ways depending on the application of interest. In this work, we also apply Ada DPS to federated learning (Section 6.2), where differential privacy is commonly defined at the granularity of users/devices (Mc Mahan et al., 2018; Kairouz et al., 2021c), as stated below. Definition 2 (User-level DP for federated learning (Mc Mahan et al., 2018)). A randomized algorithm M is (ε, δ)- differentially private if for all datasets U, U differing by one user, and every possible subset of outputs O, Pr(M(U) O) eεPr(M(U ) O)+δ. In centralized empirical risk minimization, our goal is to learn model parameters w Rd to fit n training samples {xi}i [n]: minw F(w) = 1 n Pn i=1 f(xi; w), where f( ) is the individual loss function. Optionally, there may exist pubic data denoted as xpub, which does not overlap with {xi}i [n]. We focus primarily on the classic centralized training, and later on extend our approach to federated settings (Objective (1)) (Mc Mahan et al., 2017). 0 2000 4000 6000 8000 10000 coordinate id grad. second moments Adam DP-Adam Figure 2. The estimates of gradient statistics (e.g., second moments) in private adaptive methods (e.g., DP-Adam) are noisy and may become uninformative of the relative importance of coordinates. In gradient-based optimization, adaptive optimizers and their properties have been extensively studied (e.g., Mukkamala & Hein, 2017; Reddi et al., 2018a; Duchi et al., 2011; Kingma & Ba, 2015). They effectively result in coordinatewise learning rates, which can be advantageous for many learning problems. The preconditioners can be estimated via a moving average of mini-batch gradients (as in, e.g., Adam) or simply by calculating the sum of gradients so far (Ada Grad). In private settings, estimating the required statistics on noisy gradients can introduce significant noise (Figure 2), making these methods less effective. To address this we introduce Ada DPS in the next section, and setup some notation below. Notations. For vectors u, v Rd, we use u + v for coordindate-wise addition, and u v for coordinate-wise division. For a vector v and scalar a, v + a denotes adding a to every dimension of v. For any vector v, vj always denotes the j-th coordinate of v. For example, gi,t j refers to the j-th coordinate of gradient gi,t. We use M to denote the matrix norm defined as M := p , M for a symmetric and positive definite matrix M Rd d, or a diagonal matrix with non-negative diagonal entries M Rd. 4. Ada DPS: Private Adaptive Optimization with Side Information Gradient-based private optimization methods usually update the model parameters with noisy gradients at each iteration and then release the private final models (Abadi et al., 2016). To control the sensitivity of computing and summing individual gradients from a mini-batch, methods based on the subsampled Gaussian mechanism typically first clip each individual gradient and then add i.i.d. zero-mean Gaussian noise with variance determined by the clipping threshold and the privacy budget. To use adaptivity effectively, in Ada DPS, we instead propose first preconditioning the raw gradients with side information estimated either on public data or via some auxiliary knowledge, and then applying the Gaussian mechanism with noise multiplier σ on top. Ada DPS in centralized training is summarized in Algorithm 1. Note that Ada DPS is a general framework in that it incorporates a set of private adaptive methods. As described in Algorithm 1, the functions φ, ϕ, and A abstract a set of updating rules of different adaptive methods. Next, we describe the algorithm in more detail and instantiate φ, ϕ, and A. Option 1 (With Public Data). We first consider estimating gradient statistics based on public data. Functions φ, ϕ, and A can define a very broad a set of common adaptive updating rules, as shown below. Adam: At is the square root of the second moment estimation, and M t is the first moment estimation; with A(gi,t, At, M t) = βtgi,t+(1 βt)M t At for some moving averaging parameter βt. Ada Grad: The update corresponds to M t = 0, (At)2 = (At 1)2 + (ˆgt)2, and A(gi,t, At, M t) = gi,t Private Adaptive Optimization with Side information Algorithm 1: Ada DPS Input: T, batch size b, noise multiplier σ, clipping threshold C, initial model w1 Rd, side information At Rd, learning rate αt, potential momentum buffer M 0 Rd 1 for t = 1, , T 1 do 2 Uniformly sample a mini-batch B (|B|=b) from the training set and get b gradients: gi,t f(xi; wt), i B 3 Option 1: With public data xpub 4 Uniformly sample a mini-batch B (|B |=b) from xpub, get gradients, and update At and M t with recurrence φ and ϕ respectively: j B f(xj; wt), xj xpub At φ(At 1, ˆgt), M t ϕ(M t 1, ˆgt), 5 Option 2: Without public data At estimated via heuristics 6 Precondition individual gradients by A: gi,t A(gi,t, At, M t) 7 Privatize preconditioned gradients: i B clip gi,t, C + 1 b N 0, σ2C2 , where clip(g, C) clips a vector g to L2 norm C 8 Update the model parameter w as wt+1 wt αt gt 9 return w T RMSProp: At is the square root of the second moment estimation with M t = 0. And A(gi,t, At, M t) = gi,t We note that the Ada DPS framework can potentially incorporate other adaptive optimizers, beyond what is listed above. In our analysis and experiments, we mainly focus on using RMSProp updates to obtain the preconditioner, because Ada Grad which sums up gradients in all iterations so far in the denominator, often has poor practical performance, and Adam needs to maintain an additional mean estimator. However, we also evaluate the use of Adam within Ada DPS in Table 6 in Appendix C, showing that it yields similar improvements as RMSProp across all datasets. In Section 5, we analyze the convergence of At as E[(ˆgt)2], and prove that in sparse settings, Ada DPS allows the addition of less noise under the same privacy budget. Option 2 (Without Public Data). When there is no public data available, we develop simple and effective heuristics to determine which coordinates are more predictive based on non-sensitive side information. In particular, for gen- eralized linear models in NLP applications, we set At to be (i) proportional to the frequency of input tokens, or (ii) proportional to the TF-IDF values of input tokens. Follow a similar analysis as that of Option 1, we provide theoretical justification in Theorem 3 in Section 5.1.1. While these are simple approaches to remove the dependence on public data, we find that they can significantly outperform DP-SGD for real-world tasks with several million model parameters (Section 6.1). Privacy guarantees. We now state the differential privacy guarantees of Algorithm 1. As the side information At (as well as the potential momentum buffer M t) is nonsensitive, its privacy property directly follows from previous results (Abadi et al., 2016). Theorem 1 (Privacy guarantee of Algorithm 1 (Abadi et al., 2016)). Assume the side information At is nonsensitive. There exist constants c1 and c2 such that for any ε < c1b2T/n2, Algorithm 1 is (ε, δ)-differentially private for any δ > 0 if σ c2 b 4.1. Intuition for At In this section we provide further intuition for the Ada DPS framework. When At is an all-ones vector, Ada DPS reduces to the normal DP-SGD algorithm. Otherwise, At is indicative of how informative each coordinate is. Intuitively, suppose clipping does not happen and the public data come from the same distribution as private data so that for the RMSProp preconditioner, we have At = p E[(gi,t)2]. Then the effective transformation on each individual gradient gi,t is gi,t+N(0,σ2C2E[(gi,t)2]) E[(gi,t)2] . This can be viewed as first adding non-isotropic noise proportional to the second moment of gradients, and then applying RMSProp updates, which is beneficial as coordinates with higher second moments are more tolerant to noise. Therefore, Ada DPS could improve privacy/utility tradeoffs via adding coordinate-specific noise (formalized in Theorem 2). We next consider a toy example to highlight one of the regimes where Ada DPS (or, adaptive methods) is particularly effective. Consider a linear regression task with the objective minw R500 1 2n P i [n](w xi yi)2 where n = 1, 000 and each sample xi R500. In many realworld applications, the tokens (features) are sparse and their frequencies follow heavy-tailed distributions. Without loss of generality, we assume the first 10% features are frequent and uninformative; and the later 90% rare and informative. Let the j-th feature of all data points be sampled from a random variable xj {0, 1}. Features and the underlying true w are generated as follows: ( 0.9, j 50 0.01, j>50, wj= ( 0.01, j 50 1.0, j>50. Private Adaptive Optimization with Side information Labels are generated by yi= w,xi +bi where bi N(0,0.01). For Ada DPS, we assume model engineers know which words are more frequent, thus setting At j=1 for j 50 and At j=0.01 otherwise for all t. Using larger learning rates on informative coordinates, side information helps to improve optimization performance dramatically (see results in Figure 3). 0 5 10 15 20 25 30 35 40 # iters (x1k) Non-Private 0 5 10 15 20 # iters (x100) DP-SGD Ada DPS Figure 3. Training loss on the linear regression problem described in Section 4 (averaged over five runs). We tune optimal learning rates separately for each method. Private training (right) achieves (4.13,10 3)-DP. Comparison to Asi et al. (2021). The most related work to ours is Asi et al. (2021), which adds non-isotropic noise which lies in a non-uniform ellipsoid to the gradients, and (optionally) transforms the resulting gradients with the demoninator used in Ada Grad. Ada DPS differs from their approach in several ways, as we (i) first precondition then add noise (as opposed to the other way round), (ii) consider a broader class of preconditioners (beyond Ada Grad), and (iii) make the approaches to estimating gradient geometry in Asi et al. (2021) more explicit in lieu of public data (as discussed in previous sections). Empirically, we compare with another state-of-the-art method (Amid et al., 2021) which outperforms Asi et al. (2021), and demonstrate Ada DPS s superior performance (Section 6, Table 2).1 5. Convergence Analysis We now analyze the convergence of Ada DPS (Algorithm 1) with and without public data, in both convex (Section 5.1) and non-convex (Section 5.2) settings. When there is public data available, we prove that Ada DPS adds less noise (plus, with noise proportional to the magnitude of gradients) compared with DP-SGD. Our theory extends previous proofs in related contexts, but considers stochastic gradients, adding random Gaussian noise to the processed gradients, and estimating the preconditioner on public data (as opposed to updating it with the raw gradients on training data). When there is no public data, we present convergence results for general At, covering the heuristics used in practice. 1We do not compare with Asi et al. (2021) directly as the code is not publicly available. 5.1. Convex Convergence For convex functions, we define the optimal model w as w argminw F(w). First we state a set of assumptions that are used in the analyses. Assumption 1. There exists a constant D such that wt w 2 D for any t [T]. Assumption 2. There exists a constant C such that gi,t At 2 C for any t [T] and i [n]. Assumption 3. Denote gt:= 1 i Bgi,t. There exists a constant a (0,1] such that for any j [d] and t [T], a(ˆgt j)2 (gt j)2 1 a(ˆgt j)2 holds. Assumption 2 aims to bound the L2 norm of the transformed gradient, thus resulting in bounded L2 sensitivity on the operation of calculating and averaging (scaled) individual gradients from a mini-batch. Assuming bounded stochastic gradient norm is standard in prior works on convex and non-convex private optimization (e.g., Kairouz et al., 2021a; Zhou et al., 2020). Assumption 3 bounds the dissimilarity between public and private data. Within the framework of Ada DPS, we explore the convergence of the RMSProp preconditioner where At= p E[(ˆgt)2]+ϵt (with public data) where ϵt is some small constant or At is obtained via side information heuristics. We first look at the case where there exist public data, and therefore the exact updating rule at the t-th iteration is: j B f(xj;wt) (xj xpub),1 i B f(xi;wt), vt βtvt 1+(1 βt)(ˆgt)2, At wt+1 wt αt gt At +N , N 1 b N(0,σ2C2). Theorem 2 below states the convergence guarantees. Theorem 2. Assume F(w) is a convex function w.r.t. w. Let Assumptions 1-3 hold. Additionally, choose βt such that 1 γ t holds for some γ (0,1]; and let t+1ϵt+1 tϵt for any t. After running Algorithm 1 using learning rates αt= α t with public data for T iterations, we have min t [T ]E[F(wt)] F G j=1 E AT j + α T max t [T ]E N 2 At , where F :=F(w ), G= D2 2α + α(2 γ) aγ , and AT j = q v T j +ϵT . The full proof is deferred to Appendix A.1. Our proof extend the proof of the original RMSProp method with full gradients in Mukkamala & Hein (2017) to stochastic and Private Adaptive Optimization with Side information private settings. From Theorem 4, we see that the first term in the bound is standard for the RMSProp optimizer, and the last term is due to noise added to ensure differential privacy. Fixing the noise multiplier σ, the second term depends on the clipping value C and the preconditioner At. We see that when the gradients are sparse, it is likely that the added DP noise would be significantly reduced. In other words, to guarantee overall (ε,δ)-differential privacy by run- ning T total iterations, we can set σ2=O b2T log(1/δ) maxt [T ]E[ At 1]log(1/δ) . The convergence rate therefore becomes min t [T ]E[F(wt)] F O maxt [T ]E[ At 1] When gradients are sparse (hence maxt [T ]E[ At 1]0 with probability p, and 0 with probability 1 p. It is straightforward to see the scaling of the last two terms in the convergence bound in Theorem 3: O(dpv)=O(dv), E N 2 A =σ2C2 d X j=1 Aj=O max i,t gi,t 2 dpv (pv+ϵ)2 E N 2 A will reduce to O(d) if the gradients are sparse in a certain way, i.e., maxi,t gi,t 2=O(p). Note that only using this simple first moment information (Aj=E[|xj|]+ ϵ), we are not able to obtain a constant improvement in the convergence bound as in Theorem 2 with public data. However, we empirically demonstrate that these ideas can be effective in practice in Section 6. 5.2. Non-Convex Convergence In addition to convex problems, we also consider convergence to a stationary point for non-convex and smooth functions. We introduce additional assumptions in the following. Assumption 4. F( ) is L-smooth. Assumption 5. The expectation of stochastic gradient variance is bounded, i.e., E[ gi,t j E[gi,t j ] 2 2] τ 2 j for all i,t,j. Denote τ 2:=(τ 2 1 , ,τ 2 d) Rd. Assumption 4 together with Assumption 1 implies that there exists a constant that bounds F(wt) for any t, which we denote as B. Theorem 4. Let Assumptions 1-5 hold. After running Algorithm 1 with public data for T iterations using a constant learning rate α and a constant ϵ, choosing the constants to satisfy α ϵ 2L and B 1 β aϵ 4 , we have t [T ] E F(wt) 2 O 1 The proof is mostly extended from Adam s proof in Reddi et al. (2018a) (see Appendix A.2 for complete steps). When the batch size increases, both the stochastic gradient noise and differential privacy noise would be reduced. Private Adaptive Optimization with Side information Theorem 5. Let Assumptions 1-5 hold. After running Algorithm 1 with a fixed A as prior information for T iterations using a constant learning rate α and a constant ϵ, choosing the constants to satisfy α ϵ t [T ] E F(wt) 2 A 1 O 1 6. Experiments We evaluate the performance of Ada DPS in both centralized (Section 6.1) and federated (Section 6.2) settings for various tasks and models. In centralized training, we investigate two practical scenarios for obtaining side information with and without public data (Section 6.1.1 and 6.1.2). We describe our experimental setup below; details of datasets, models, and hyperparameter tuning are described in Appendix B. Our code is publicly available at github.com/litian96/Ada DPS. Datasets. We consider common benchmarks for adaptive optimization in centralized or federated settings (Amid et al., 2021; Reddi et al., 2018a; 2021) involving varying types of models (both convex and non-convex) and data (both text and image data). Both linear and non-convex models contain millions of learnable parameters. Hyperparameters. We fix the noise multiplier σ for each task, and select an individual (fixed) clipping threshold for each differentially private run. To track the privacy loss (to ensure (ε,δ)-DP), we add the same amount of noise to all compared methods, set the δ value to be the inverse of the number of all training samples, and compute ε using R enyi differential privacy (RDP) accountant for the subsampled Gaussian mechanism (Mironov et al., 2019). 6.1. Centralized Training Common Baselines. One can directly privatize an adaptive optimizer by first privatizing the raw gradients, and then applying that adaptive method on top of noisy gradients (Zhou et al., 2020). We consider these baselines named DP-Adam or DP-RMSProp where the adaptive optimizer is chosen to be Adam or RMSProp (same as DP-Adam appearing in previous sections). As the empirical results of DP-Adam and DP-RMSProp are very similar (Table 6 in the appendix), in the main text, we mainly compare Ada DPS with DPAdam (Zhou et al., 2020) and DP-SGD (Abadi et al., 2016). For completeness, we present the exact DP-Adam algorithm in Appendix B. 6.1.1. WITH PUBLIC DATA In the main text, we set the public data size to be 1% of training data size. We further explore the effects of public data size empirically in Table 10, Appendix C. Next, we present results of comparing Ada DPS with several baselines, and results using both in-distribution (ID) and out-of-distribution (OOD) data as public data to estimate the preconditioner. Preconditioning Noisy Gradients with Public Data. In addition to DP-SGD and DP-Adam mentioned in Section 6.1, we consider another method of preconditioning the noisy gradients with second moment estimates obtained from public data. Specifically, the updating rule at iteration t is i B clip gi,t,C +1 b N(0,σ2C2), E[(ˆgt)2]+ϵt , where ˆgt:= f(x;wt) for x xpub. We call this adaptive baseline DP-R-Pub, which is equivalent to standard DP-RMSProp but using public data to estimate the preconditioner. Comparing with this method directly reflects the importance of the order of preconditioning and privatization in Ada DPS. Results with in-distribution proxy data (randomly sampled from training sets) are shown in Figure 4 and Table 1 below. We see that across three datasets, (i) DP-Adam does not necessarily outperform DP-SGD all the same, (ii) Ada DPS improves over all baselines significantly, including DP-RPub. Full results involving DP-RMSProp and Ada DPS with Adam as the updating rule are presented in Table 6. 0 5 10 15 20 25 30 35 40 # epochs test accuracy Stack Overflow DP-SGD DP-Adam DP-R-Pub Ada DPS 0 20 40 60 80 100 # epochs test accuracy IMDB (LSTM) DP-SGD DP-Adam DP-R-Pub Ada DPS Figure 4. Test accuracies of baselines and Ada DPS assuming access to public data. ε values on these two datasets are 0.84 and 2.8, respectively. Ada DPS significantly improves test performance, even reaching an accuracy much higher than the accuracy of SGD in non-private training on Stack Overflow. Methods Loss 100 (σ=1) Loss 100 (σ=0.75) DP-SGD 5.013 (.001) 3.792 (.001) DP-Adam 3.697 (.020) 3.286 (.016) Ada DPS 3.566 (.008) 3.158 (.003) Table 1. Test reconstruction loss (mean and standard deviation across three runs) on MNIST under a deep autoencoder model. σ=1 and σ=0.75 correspond to ε=1.6 and ε=3, respectively. DPAdam works well in this task compared with DP-SGD. Ada DPS improves over DP-Adam. Private Adaptive Optimization with Side information For completeness, we also evaluate Ada DPS on the MNIST image classification benchmark, and observe that it yields 0.5% 2% accuracy improvements (Table 6), depending on which specific adaptive method to use. Comparisons with Amid et al. (2021). We compare Ada DPS with one recent work, PDA-DPMD, which is the state-of-the-art that leverages public data to improve privacy/utility tradeoffs in a mirror descent framework (Amid et al., 2021). We take their proposed approximation, where the actual gradients are a convex combination of gradients on private and public data. As this approximation does not precondition gradients, PDA-DPMD could underperform Ada DPS in the tasks where adaptivity is critical, as shown in Table 2 below. Datasets Metrics PDA-DPMD Ada DPS Ada DPS w/ public w/ public w/o public IMDB accuracy 0.62 0.80 0.75 Stack Overflow accuracy 0.33 0.40 0.41 MNIST loss 0.039 0.036 Table 2. Comparison with a recent method (PDA-DPMD) using public data in private mirror descent. Ada DPS outperforms PDADPMD due to preconditioning. Out-Of-Distribution Public Data As mentioned in Section 1, public data could be obtained via a small amount of proxy data or opt-out users that do not need privacy protection. We consider two practical cases where we use OOD data to extract side information. For IMDB sentiment analysis, we use a small subset of Amazon reviews2 (Zhang et al., 2015) as public data (1% the size of IMDB). Amazon reviews study a more fine-grained 5-class classification problem on product reviews, and we map labels {1, 2} to 0 (negative), and labels {4, 5} to 1 (positive). For Stack Overflow tag prediction task which consists of 400 users with different styles and interested topics, we simply hold out the first four of them to provide public data. We show results in Table 3 below. We see that the preconditioners obtained from out-of-distribution but related public data are fairly effective. Datasets DP-SGD Ada DPS Ada DPS OOD public ID public IMDB 0.63 0.79 0.80 Stack Overflow 0.28 0.40 0.40 Table 3. Using small out-of-distribution data as public data achieves the same improvements. For IMDB, we leverage Amazon reviews data (1% the size of IMDB) as public data. For Stack Overflow, we hold out 1% users as those who opt out of privacy training. 2figshare.com/articles/dataset/Amazon Reviews Full/13232537/1 6.1.2. WITHOUT PUBLIC DATA When it is difficult to obtain public data that follow sufficiently similar distribution as private training data, we explore two specific heuristics as side information tailored to language tasks: token frequencies and TF-IDF values of input tokens (or features). These statistics are always known as open knowledge, thus can be used as an approximate how important each feature is. We compare Ada DPS with DP-SGD and DP-Adam described before. At Based on Token Frequencies. One easily obtainable side information is token frequencies, and we can set At (t [T]) to be proportional to that accordingly. Note that our implicit assumption here is that rare features are more informative than frequent ones. We investigate the logistic regression model on two datasets in Figure 4 below. Despite the simplicity, this simple method works well on Stack Overflow and IMDB under a tight privacy budget, outperforming DP-SGD and DP-Adam significantly. Especially for Stack Overflow, the test accuracy is the same as that of Ada DPSwith a small set of public data (Figure 4). 0 5 10 15 20 25 30 35 40 # epochs test accuracy Stack Overflow DP-SGD DP-Adam Ada DPS 0 20 40 60 80 100 # epochs test accuracy IMDB (convex) DP-SGD DP-Adam Ada DPS Figure 5. Ada DPS uses token frequencies as the side information, demonstrating superior performance than the baselines. Methods in each subfigure reach (0.84,4.2 10 6)- and (2.8,4 10 5)- DP. At Based on TF-IDF Values. Another common criterion to measure the relevance of tokens to each data point in a collection of training data is TF-IDF values. With the presence of such information available in a data processing pipeline, we explore the effects of At being inversely proportional to TF-IDF scores for linear models. The results are reported in Table 4. The ideal method refers to Ada DPS with RMSProp updates and the second moment estimated on clean gradients from private training data, which serves a performance upper bound. As expected, across all methods, TF-IDF features result in higher accuracies than Bo W features. Ada DPS outperforms the baselines by 10% absolute accuracy, only slightly underperforming the ideal oracle. Remark (Side Information in Non-Private Training). The idea of using side information (with or without public data) can also improve the performance of vanilla SGD in non-private training, yielding similar accuracies as that of Private Adaptive Optimization with Side information Features Methods DP-SGD DP-Adam Ada DPS ideal Bo W 0.62 (.02) 0.68 (.01) 0.75 (.01) 0.82 (.01) TF-IDF 0.68 (.01) 0.65 (.01) 0.80 (.00) 0.83 (.00) Table 4. We preprocess IMDB into two versions with either Bo W or TF-IDF features, and report average test accuracy along with standard deviation across three runs. Ada DPS with At being inversely proportional to features TF-IDF values outperforms the baselines of DP-SGD and DP-Adam by a large margin. Ada DPS also performs relatively closely to the ideal upper bound. adaptive methods. We report additional results along this line in Table 9 in the appendix. 6.2. Federated Learning In this section, we discuss Ada DPS adapted to federated learning (FL) (learning statistical models over heterogeneous networks while keeping all data local) to satisfy userlevel, global differential privacy (assuming a trusted central server). The canonical objective for FL is to fit a single model w Rd to data across a network of n devices: min w Rd F(w)= i=1 pifi(w), (1) where fi(w) is the empirical local loss on each device i, and pi is some pre-defined weight for device i such that Pn i=1pi=1, which can be 1 n or proportional to the number of local samples. In this work, we simply set pi= 1 n, i [n]. For this privacy-sensitive application, we assume there is no public data available. Due to the practical constraints of federated settings (e.g., unreliable network conditions, device heterogeneity, etc), federated optimization algorithms typically randomly samples a small subset of devices at each communication round, and allows each device to run optimization methods locally (e.g., local mini-batch SGD) before sending the updates back to the server (Mc Mahan et al., 2017). Adapting Ada DPS to federated learning is not straightforward, raising questions of applying preconditioning at the server side, the device side, or both (Wang et al., 2021). We empirically find that on the considered dataset, preconditioning the mini-batch gradients locally at each iteration demonstrates superior performance than preconditioning the entire model updates at the server side. The exact algorithm is summarized in Algorithm 3 in the appendix. We investigate the same Stack Overflow dataset described in Section 6.1, but follow its original, natural partition (by Stack Overflow users) for federated settings, one device per user. There are 400 devices in total for the subsampled version we use. We select 20 devices at each communication round and use a noise multiplier σ=0.3. While we arrive at a large ε value for user-level DP, the final model could still be useful for defending against membership inference attacks in practice (Kairouz et al., 2021b). 0 10 20 30 40 # communication rounds test accuracy Stack Overflow, non-private Fed Avg Fed Adam Ada S 0 10 20 30 40 # communication rounds test accuracy Stack Overflow, private DP-Fed Avg DP-Fed Adam Ada DPS Figure 6. Test accuracy of Stack Overflow in non-private and private settings under (34,0.0025) user-level DP (Definition 2). Ada DPS extended to federated learning (Algorithm 3 in the appendix) improves over baselines of DP-Fed Avg (Mc Mahan et al., 2018) and DP-Fed Adam by 5% in terms of test accuracy. In Figure 6, we plot test accuracy versus the number of communication rounds. Ada DPS has 5% higher test accuracy than other two methods. We note that in federated learning applications involving massive and unreliable networks, it is not always realistic to allow for uniform device sampling. Incorporating recent advances in DP without sampling (e.g., Kairouz et al., 2021b) to address this is left for future work. 7. Conclusion In this work, we explored a simple and effective framework, Ada DPS, to realize the benefits of adaptive optimizers in differentially private learning via side information. Such information is used to precondition gradients before privatizing them. We analyzed the benefits of Ada DPS in terms of reducing the effective noise to reach similar privacy bounds, and empirically validated its superior performance across various tasks in both centralized and federated settings. Acknowledgements We thank Brendan Mc Mahan and Abhradeep Thakurta for valuable discussions and feedback. The work of TL and VS was supported in part by the National Science Foundation Grant IIS1838017, a Google Faculty Award, a Facebook Faculty Award, the Private AI Collaborative Research Institute, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency. Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Conference on Computer and Private Adaptive Optimization with Side information Communications Security, 2016. Alaggan, M., Gambs, S., and Kermarrec, A.-M. Heterogeneous differential privacy. ar Xiv preprint ar Xiv:1504.06998, 2015. Aldaghri, N., Mahdavifar, H., and Beirami, A. Feo2: Federated learning with opt-out differential privacy. ar Xiv preprint ar Xiv:2110.15252, 2021. Amid, E., Ganesh, A., Mathews, R., Ramaswamy, S., Song, S., Steinke, T., Suriyakumar, V. M., Thakkar, O., and Thakurta, A. Public data-assisted mirror descent for private model training. ar Xiv preprint ar Xiv:2112.00193, 2021. Andrew, G., Thakkar, O., Mc Mahan, H. B., and Ramaswamy, S. Differentially private learning with adaptive clipping. In Advances in Neural Information Processing Systems, 2021. Asi, H., Duchi, J., Fallah, A., Javidbakht, O., and Talwar, K. Private adaptive gradient methods for convex optimization. In International Conference on Machine Learning, 2021. Authors, T. T. F. Tensor Flow Federated Stack Overflow dataset, 2019. URL https://www.tensorflow. org/federated/api_docs/python/tff/ simulation/datasets/stackoverflow/ load_data. Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Symposium on Foundations of Computer Science, 2014. Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. Journal of Machine Learning Research, 2011. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, 2006. Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2014. Hinton, G., Srivastava, N., and Swersky, K. Rmsprop: Divide the gradient by a running average of its recent magnitude. Neural Networks for Machine Learning, Coursera Lecture 6e, 2012. Jorgensen, Z., Yu, T., and Cormode, G. Conservative or liberal? personalized differential privacy. In International Conference on Data Engineering, 2015. Kairouz, P., Diaz, M. R., Rush, K., and Thakurta, A. (nearly) dimension independent private erm with adagrad rates via publicly estimated subspaces. In Conference on Learning Theory, 2021a. Kairouz, P., Mc Mahan, B., Song, S., Thakkar, O., Thakurta, A., and Xu, Z. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, 2021b. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 2021c. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998. Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2017. Mc Mahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. International Conference on Learning Representations, 2018. Mironov, I., Talwar, K., and Zhang, L. R enyi differential privacy of the sampled gaussian mechanism. ar Xiv preprint ar Xiv:1908.10530, 2019. Mukkamala, M. C. and Hein, M. Variants of rmsprop and adagrad with logarithmic regret bounds. In International Conference on Machine Learning, 2017. Pichapati, V., Suresh, A. T., Yu, F. X., Reddi, S. J., and Kumar, S. Adaclip: Adaptive clipping for private sgd. ar Xiv preprint ar Xiv:1908.07643, 2019. Reddi, S., Zaheer, M., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems, 2018a. Private Adaptive Optimization with Side information Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive federated optimization. International Conference on Learning Representations, 2021. Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018b. Song, S., Steinke, T., Thakkar, O., and Thakurta, A. Evading the curse of dimensionality in unconstrained private glms via private gradient descent. In International Conference on Artificial Intelligence and Statistics, 2021. Wang, J., Charles, Z., Xu, Z., Joshi, G., Mc Mahan, H. B., et al. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, 2021. Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S. J., Kumar, S., and Sra, S. Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems, 2020. Zhang, X., Zhao, J., and Le Cun, Y. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 2015. Zhou, Y., Chen, X., Hong, M., Wu, Z. S., and Banerjee, A. Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds. ar Xiv preprint ar Xiv:2006.13501, 2020. Zhou, Y., Wu, Z. S., and Banerjee, A. Bypassing the ambient dimension: Private sgd with gradient subspace identification. In International Conference on Learning Representations, 2021. Private Adaptive Optimization with Side information A. Convergence A.1. Proof for Theorem 2 (Convex cases) Based on our assumptions, the updating rule becomes i B f(xi;wt) (2) j B f(xj;wt), xj xpub (3) vt βtvt 1+(1 βt)(ˆgt)2 (4) wt+1 wt αt gt At +N , N 1 b N(0,σ2C2) (6) We extend the proof in Mukkamala & Hein (2017) to stochastic, private cases with preconditioner estimated on public data. Based on the updating rule, we have wt+1 w 2 At (7) = wt αt(At) 1gt αt N w 2 At (8) = wt w 2 At+ αt(At) 1gt+αt N 2 At 2 wt w ,αtgt+αt At N (9) = wt w 2 At 2αt gt,wt w +(αt)2 gt,(At) 1gt 2αt wt w ,At N +(αt)2 N 2 At+2(αt)2 gt,N . (10) Rearranging terms gives gt,wt w = wt w 2 At wt+1 w 2 At 2αt +αt 2 gt,(At) 1gt wt w ,At N +αt 2 N 2 At+αt gt,N . (11) Take expectation on both sides conditioned on wt, F(wt),wt w =Et wt w 2 At Et[ wt+1 w 2 At] 2αt +αt 2 Et gt,(At) 1gt +αt 2 Et N 2 At , (12) where we have used the fact that N is a zero-mean Gaussian variable independent of gt,wt. Taking expectation on both sides and using the convexity of F( ): E[F(wt)] F(w ) E[ wt w 2 At] E[ wt+1 w 2 At] 2αt +αt 2 E[ gt,(At) 1gt ]+αt 2 E N 2 At . (13) Applying telescope sum, we have E[F(wt)] F(w ) (14) w1 w 2 A1 2α1 + E wt w 2 At 2αt E wt w 2 At 1 2 E gt,(At) 1gt + 2 E N 2 At . (15) E[F(wt)] F(w ) (16) w1 w 2 A1 2α + t At t 1At 1 i 2α | {z } T1 t E gt,(At) 1gt t E N 2 At . (17) Private Adaptive Optimization with Side information t for some γ (0,1] and tϵt t 1ϵt 1. We first bound T1. Based on the relations between vt and vt 1 and βt 1 1 t , we can prove for any t,j βtvt 1 j +(1 βt)(ˆgt j)2+ϵt q (t 1)vt 1 j + t 1ϵt 1. (18) So for any j,t, t 1At 1 j . (19) t At t 1At 1 j=1 (wt j w j )2 q (t 1)vt 1 j t=2 (wt j w j )2 q (t 1)vt 1 j (t 1)vt 1 j We next bound T2. We prove a variant of Lemma 4.1 in Mukkamala & Hein (2017). The major differences are in that we consider the stochastic case and estimating vt on public data. We prove the following inequality by induction: TϵT i , j [d]. (24) (1 β1)(ˆg1 j )2+ϵ1 (1 β1)(ˆg1 j )2+ϵ1 (1 β1)(ˆg1 j )2+ϵ1 Suppose that the conclusion holds when t=T 1, i.e., for any j [d], (T 1)v T 1 j + T 1ϵT 1 . (26) In addition, combined with the fact that v T j =βT v T 1 j +(1 βT )(ˆg T j )2 and T 1ϵT 1, we have (T 1)v T 1 j + (T 1)v T j βT (T 1)(1 βT )(ˆg T j )2 Tv T j (T 1)(1 βT )(ˆg T j )2 Tv T j (T 1)(1 βT )(ˆg T j )2 TϵT a(T 1)(1 βT )(g T j )2 Private Adaptive Optimization with Side information The third inequality comes from a b a b 2 a (a b) by letting a be Tv T j and b be (T 1)(1 βT )(ˆg T j )2 TϵT a(T 1)(1 βT )(g T j )2 (g T j )2 q TϵT i + 1 (2 γ)(T 1)(1 βT ) (g T j )2 q TϵT i . (33) We then bound T2 as follows. TϵT i . (34) Noting that w1 w 2 A1 2α D2 v1 j +ϵ1 , (35) combined with the bounds of T1,T2 yields E[F(wt) F(w )] D2 TϵT i , (36) which implies that min t [T ]E[F(wt)] F(w ) D2 t E N 2 At (37) v T j +ϵT i + α T max t [T ]E N 2 At . (38) The first term is standard for the RMSprop optimizer, and the last term is due to noise added to ensure differential privacy. To guarantee overall (ε,δ)-differential privacy by running T total iterations, we set σ2=O b2T log(1/δ) maxt [T ]E[ Pd j=1At j]log(1/δ) . The convergence rate becomes min t [T ]E[F(wt)] F(w ) O maxt [T ]E h Pd j=1At j i log(1/δ) A.1.1. PROOF FOR THEOREM 3 (FIX At BEFORE TRAINING) Denote the side information as a fixed A at any iteration t. Similar as previous analysis, setting a decaying learning rate αt= α E[F(wt)] F(w ) (40) w1 w 2 A 2α + 2α | {z } T1 t E gt,(A) 1gt t E N 2 A . (41) Private Adaptive Optimization with Side information To bound T1, we have t=2 E h wt w 2 t A t 1A i =E j=1 (wt j w j )2 ( w1 w 2 A. (42) We consider T2 next. From the assumptions on the clipping bound, R:=max j,t E (gt j)2 A2 j C2. (43) t Aj E (gt j)2 j=1 Aj. (44) Therefore, we obtain min t [T ]E[F(wt)] F(w ) O A.2. Proof for Theorem 4 (Non-convex and smooth cases) We use the same ϵ at each iteration. Let j F(w) denote the j-th coordinate of F(w) for any w. Based on the L-smoothness of F( ), F(wt+1) F(wt) αt d X vt j+ϵ 2 +N 2+ gt j q b N(0,σ2C2) and E[N 2]= σ2C2 b2 . Taking expectation conditioned on wt on both sides gives Et[F(wt+1)] F(wt) αt d X j=1 j F(wt)Et 2b2 σ2C2. (47) The following proof is extended from that of Theorem 1 in Reddi et al. (2018a). Et[F(wt+1)] F(wt) αt d X j=1 j F(wt)Et vt j+ϵ gt j q βvt 1 j +ϵ + gt j q 2b2 σ2C2 (48) F(wt) αt d X ( j F(wt))2 βvt 1 j +ϵ +αt d X vt j+ϵ gt j q 2b2 σ2C2. (49) Private Adaptive Optimization with Side information vt j+ϵ gt j q βvt 1 j +ϵ gt j βvt 1 j +ϵ (1 β)(ˆgt j)2 q βvt 1 j (51) βvt 1 j +ϵ (1 β)(ˆgt j)2 q βvt 1 j +(1 β)(ˆgt j)2+ q βvt 1 j (52) βvt 1 j +ϵ 1 β a (gt j)2 1 q βvt 1 j +ϵ ϵ 1 β a (gt j)2. (53) We have used the observation (1 β)(ˆgt j)2 q βvt 1 j +(1 β)(ˆgt j)2+ q βvt 1 j 1 βˆgt j, and ˆgt j 1 agt j. From L-smoothness of F( ) which implies that F(u) F(v) L u v for any u,v Rd, and Assumption 1, it is easy to see there exists a constant B such that | j F(w)| B for any j [d]. Et[F(wt+1)] (54) F(wt) αt d X ( j F(wt))2 βvt 1 j +ϵ +αt B 1 β F(wt) αt d X ( j F(wt))2 βvt 1 j +ϵ +αt B 1 β where the last inequality holds due to q vt j+ϵ 2 ϵ ϵ+ q vt j ϵ ϵ+ q βvt 1 j . Lemma 1 in Reddi et al. (2018a) proves that Et (gt j)2 τ 2 j b +( j F(wt))2 where τ 2 j is the variance bound of the j-th coordinate, i.e., E gt j j F(wt) 2 τ 2 j . Plugging this inequality into Eq. (56), combined with Lαt 4 and B 1 β aϵ 1 4, we obtain Et[F(wt+1)] F(wt) αt 2( βB+ϵ) F(wt) 2+ αt B 1 β aϵ2 +L(αt)2 j [d]τ 2 j b +(αt)2Ld 2b2 σ2C2. (57) Taking expectation on both sides and applying the telescope sum yields t [T ] E[ F(wt) 2] O 1 A.2.1. PROOF FOR THEOREM 5 (FIX A BEFORE TRAINING) Due to L-smoothness of F( ), we have F(wt+1) F(wt) αt d X j F(wt) gt j Aj +N +(αt)2L A2 j +N 2+ gt j Aj 2N Private Adaptive Optimization with Side information b N(0,σ2C2). Taking expectation conditioned on wt on both sides gives Et F(wt+1) F(wt) αt d X ( j F(wt))2 1 A2 j Et (gt j)2 +(αt)2Ld 2b2 σ2C2 (60) F(wt) αt d X ( j F(wt))2 τ 2 j b +( j F(wt))2 ! 2b2 σ2C2 (61) F(wt) αt d X ( j F(wt))2 τ 2 j b +( j F(wt))2 ! 2b2 σ2C2 (62) F(wt) 2 A 1+(αt)2L τ 2 j Ajb+(αt)2Ld 2b2 σ2C2. (63) The last inequality is due to αt ϵ L. Taking expectation on both sides yields E F(wt+1) E F(wt) αt 2 E[ F(wt) 2 A 1]+(αt)2L τ 2 j Ajb+(αt)2Ld 2b2 σ2C2. (64) Similarly, by rearranging terms and applying telescope sum, we obtain t [T ] E F(wt) 2 A 1 O 1 B. Experimental Details Pseudo Code of some Algorithms. For completeness, we present the full baseline DP-Adam algorithm in Algorithm 2 and Ada DPS extended to federated learning in Algorithm 3. Algorithm 2: DP-Adam (Zhou et al., 2020) Input: T, batch size b, noise multiplier σ, clipping threshold C, initial model w1 Rd, v0=0, m0=0, small constant vector ϵt Rd, learning rate αt, moving average parameters β1, β2 1 for t=1, ,T 1 do 2 Uniformly randomly sample a mini-batch B with size b from private training data 3 Get individual gradients for sample i B: gi,t f(xi;wt) 4 Private gradients using Gaussian mechanism: i B clip gi,t,C +1 b N(0,σ2C2) 5 Update first and second moment estimates as mt β1mt 1+(1 β1) gt vt β2vt 1+(1 β2)( gt)2 6 Update the model parameter w as wt+1 wt αt mt/(1 βt 1) p vt/(1 βt 2)+ϵt , where βt 1,βt 2 denotes β1,β2 to the power of t (with slight abuse of notations) 7 return w T Private Adaptive Optimization with Side information Algorithm 3: Ada DPS applied to federated learning Input: T communication rounds, b selected devices each round, noise multiplier σ, clipping threshold C, initial model w1 Rd, non-sensitive side information A, number of local iterations s, local learning rate ηt 1 for t=1, ,T 1 do 2 Server uniformly selects a subset B of b devices and sends the current global model wt to them 3 Each device i B sets the local model to be the current global model: 4 Each device i B runs adaptive mini-batch SGD locally with side information A to obtain model updates: 5 for j=0, ,s do wi,j+1 wi,j ηt f(wi,j) 7 And then privatize model updates: i,t wi,s+1 wi,0 i,t clip( i,t,C)+N(0,σ2C2) 8 Each device i B sends i,t to the server 9 Server updates the global model as: 10 return w T Datasets and Models. We consider a diverse set of datasets and tasks. Stack Overflow (Authors, 2019) consists of posts on the Stack Overflow website, where the task is tag prediction (500class classification). We randomly subsample 246,092 samples from the entire set. There are 10,000 input features in Stack Overflow, resulting in more than 5 million learnable parameters in a logistic regression model. IMDB (Maas et al., 2011) is widely used for for binary sentiment classification of movie reviews, consisting of 25,000 training and 25,000 testing samples. We study two models on IMDB: logistic regression and neural networks (with LSTM) with 20,002 and 706,690 parameters, respectively. For logistic regression, we set the vocabulary size to 10,000 and consider two sets of commonly-used features separately: bag-of-words (Bo W) and TF-IDF values. MNIST (Le Cun et al., 1998) images with a deep autoencoder model (for image reconstruction) which has the same architecture as that in previous works (Reddi et al., 2018a) (containing more than 2 million parameters). The loss is reconstruction error measured as the mean squared distance in the pixel space. We scale each input feature to the range of [0,1]. Hyperparameter Tuning. We detail our hyperparameter tuning protocol and the hyperparameter values here. Our code is publicly available at github.com/litian96/Ada DPS. For non-private training experiments, we fix the mini-batch size to 64, and tune fixed learning rates by performing a grid search over {0.0005,0.001,0.005,0.01,0.05,0.1,0.2,0.5,1,2} separately for all methods on validation data. We do not use momentum for Ada S (i.e., applying the idea of preconditioning of Ada DPS without privatization) for all centralized training experiments. For differentially private training, the δ values in the privacy budget are always inverse of the number of training samples. We fix the noise multiplier σ for each dataset, tune the clipping threshold, and compute the final ε values. Specifically, the σ values are 1, 1, and 0.95 for IMDB (convex), IMDB (LSTM), and Stack Overflow; 1 and 0.75 for MNIST (autoencoder). The clipping threshold C (in Algorithm 1) is tuned from {0.01,0.02,0.05,0.1,0.2,0.5,1,2,3}, jointly with tuning the (fixed) learning rates. The number of micro-batches is 16 for all related experiments, and the mini-batch size is 64 (i.e., we privatize each gradient averaged over 4 individual ones to speed up computation). Private Adaptive Optimization with Side information For federated learning experiments, we fix server-side learning rate to be 1 (i.e., simply applying the unscaled average of noisy model updates in Line 9 of Algorithm 3), and apply server-side momentum (Reddi et al., 2021) with a moving average parameter 0.9 for all methods in the left sub-figure in Figure 6. The number of local epochs is set to 1 for all runs, and the local mini-batch size is 100. The tuned hyperparameter values (clipping threshold C, learning rate) for private training are summarized in Table 5 below. Datasets DP-SGD DP-Adam DP-RMSProp Ada DPS (RMSProp) Ada DPS (Adam) IMDB (convex) (0.1, 1) (0.02, 0.1) (0.05, 0.1) (2, 0.5) (2, 1) IMDB (LSTM) (0.1, 0.1) (0.1, 0.001) (0.1, 0.001) (0.2, 0.1) (0.2, 0.05) Stack Overflow (linear) (0.1, 1) (0.1, 0.01) (0.2, 0.01) (1, 0.5) (1, 0.5) MNIST (autoencoder) (0.05, 0.5) (0.01, 0.001) (0.01, 0.001) (3, 0.005) (3, 0.005) MNIST (classification) (0.5, 0.01) (0.5, 0.001) (0.5, 0.001) (2, 0.005) (2, 0.005) Table 5. Major hyperparameter values (learning rate and clipping threshold C) used in private experiments for all datasets. IMDB (convex) is IMDB (Bo W features) on a logistic regression model. Stack Overflow results are for centralized training. The noise multiplier σ values in these four tasks are 1, 1, 0.95, 1, respectively, resulting in ε values being 1.5, 2.8, 0.84, and 1.6. C. Additional Results C.1. Additional Baselines Other Private Adaptive Optimization Baselines. In the main text, we mainly compare Ada DPS with DP-Adam (summarized in Algorithm 2). There are other possible baselines similar as DP-Adam, by replacing Adam with other adaptive methods, resulting in DP-Ada Grad and DP-RMSProp. This line of differentially private optimizers has similar empirical performance as DP-Adam, as shown in the results in Table 6 below (using DP-RMSProp as an example). Datasets Metrics DP-SGD DP-Adam DP-RMSProp Ada DPS (RMSProp) Ada DPS (Adam) IMDB (convex) accuracy 0.63 0.69 0.67 0.80 0.80 IMDB (LSTM) accuracy 0.70 0.69 0.69 0.73 0.73 Stack Overflow (linear) accuracy 0.28 0.30 0.31 0.40 0.40 MNIST (autoencoder) loss ( 100) 5.013 3.697 3.636 3.566 3.443 MNIST (classification) accuracy 0.9273 0.9333 0.9314 0.9377 0.9541 Table 6. Full comparisons between Ada DPS and private adaptive optimization methods. The evaluation metrics are reported on test data. IMDB (convex) is IMDB (Bo W features) on a logistic regression model. For (ε,δ)-differential privacy, the ε values of experiments in the five rows are 1.5, 2.8, 0.84, 1.6, and 1.25, respectively, and the δ values are the inverse of the number of training samples (as mentioned in the main text). Using Public Data for Pretraining. Another possible way of leveraging public data to improve privacy/utility tradeoffs is to pretrain on them. However, this would give only limited performance improvement if the amount of public data is very small. In the main text, when needed, Ada DPS randomly samples 1% training data as public data. Under this setup, we empirically compare Ada DPS with the pre-training baseline (denoted as DP-SGD w/ warm start). From Table 7, we see that Ada DPS outperforms it by a large margin. Datasets Metrics DP-SGD DP-SGD Ada DPS w/ warm start w/ public IMDB (convex) accuracy 0.63 0.73 0.80 Stack Overflow accuracy 0.28 0.33 0.40 MNIST (autoencoder) loss 0.050 0.049 0.036 Table 7. Compare Ada DPS with an additional baseline of DP-SGD pre-trained on public data on three datasets. For DP-SGD w/ warm start , we first train on public data for 10 epochs via adaptive methods (RMSProp), and then run DP-SGD on private data starting from that initialization. Private Adaptive Optimization with Side information DP-Adam with Public Data. In the main text (Section 6.1.1), we discuss the DP-R-Pub. baseline based on the RMSProp method with the preconditioner estimated on public data. Similarly, one can also apply such clean preconditioners in DP-Adam updates, resulting in another baseline which we call DP-Adam-Pub. The main differences between DP-Adam-Pub and DP-R-Pub are that DP-Adam-Pub additionally considers momentum. Formally, the updates of wt is as follows: i B clip gi,t,C +1 b N(0,σ2C2), ˆgt Ex f(x;wt) for x xpub, mt β2mt+(1 β2) β1 gt+(1 β1)ˆgt , vt β3vt+(1 β3)(ˆgt)2, wt+1 wt αt mt/(1 (β2)t) p vt/(1 (β3)t)+ϵ , where β1,β2,β3,ϵ are small constants. Datasets Metrics DP-SGD DP-Adam-Pub Ada DPS w/ public IMDB (convex) accuracy 0.63 0.74 0.80 Stack Overflow accuracy 0.28 0.31 0.40 MNIST (autoencoder) loss 0.050 0.064 0.036 Table 8. Results of comparing Ada DPS with to DP-Adam-Pub (i.e., DP-Adam using clean preconditioners estimated on public data). C.2. Side Information in Non-Private Training In the main text, we mainly focus on private optimization. It is expected that side information (even without the assist of public data) would also be beneficial in non-private settings, which could serve as a simple alternative to adaptive methods. We report results in Table 9. Datasets Metrics SGD Adam Ada S (w/ public) Ada S (w/o public) IMDB (convex) accuracy 0.66 0.88 0.88 0.88 IMDB (LSTM) accuracy 0.88 0.88 0.88 0.88 Stack Overflow (linear) accuracy 0.38 0.64 0.64 0.64 MNIST (autoencoder) loss ( 100) 5.013 1.151 1.805 Table 9. Performance of each method in non-private training. We see that Ada S can match the performance of Adam in non-private settings. C.3. Effects of Public Data Size We further study the effects of public data size. Only a very small set of public data (even 0.04% the size of private training data) can provide good preconditioner estimates. Datasets upper bound Ada DPS Ada DPS Ada DPS 1% public 5 less 25 less IMDB (convex) 0.82 0.80 0.80 0.75 Stack Overflow 0.41 0.40 0.40 0.39 Table 10. Effects of public data sizes.