# revisiting_differentially_private_relu_regression__f3b2612d.pdf Revisiting Differentially Private Re LU Regression Meng Ding University at Buffalo Mingxi Lei University at Buffalo Liyang Zhu KAUST Shaowei Wang Guangzhou University Di Wang KAUST Jinhui Xu University at Buffalo As one of the most fundamental non-convex learning problems, Re LU regression under differential privacy (DP) constraints, especially in high-dimensional settings, remains a challenging area in privacy-preserving machine learning. Existing results are limited to the assumptions of bounded norm x 2 1, which becomes stringent and strong with increasing data dimensionality. In this work, we revisit the problem of DP Re LU regression in overparameterized regimes. We propose two innovative algorithms, DP-GLMtron and DP-TAGLMtron, that outperform the conventional DPSGD. DP-GLMtron is based on a generalized linear model perceptron approach, integrating adaptive clipping and Gaussian mechanism for enhanced privacy. To overcome the constraints of small privacy budgets in DPGLMtron, represented by e O( p 1/N) where N is the sample size, we introduce DP-TAGLMtron, which utilizes a tree aggregation protocol to balance privacy and utility effectively, showing that DP-TAGLMtron achieves comparable performance with only an additional factor of O(log N) in the utility upper bound. Moreover, our theoretical analysis extends beyond Gaussian-like data distributions to settings with eigenvalue decay, showing how data distribution impacts learning in high dimensions. Notably, our findings suggest that the utility bound could be independent of the dimension d, even when d N. Experiments on synthetic and real-world datasets also validate our results. 1 Introduction Protecting individual privacy in data analysis has emerged as a critical concern. In light of the vast quantities of personal and sensitive information involved, traditional methods of ensuring privacy are encountering significant challenges. Differential Privacy (DP), introduced by [13], has gained widespread recognition as a method for preserving privacy by adding a controlled amount of random noise to the data or query responses, thereby effectively concealing the details of any individual. Differentially Private Stochastic Optimization (DP-SO) and its empirical form, Differentially Private Empirical Risk Minimization (DP-ERM), are foundational models extensively studied in the DP community. Although there is a substantial body of research on DP-SO and DP-ERM with convex loss functions [10, 46, 7, 16, 35, 37, 6, 9, 26, 36, 38, 39], the understanding of nonconvex optimization in a DP context remains limited. Recent efforts have developed private algorithms with established proven utility bounds for differentially private nonconvex optimization [51, 46, 43, 52, 8, 47]. However, most of the current work employs the gradient norm of the population risk function as a measurement instead of excess population risk, commonly used in convex scenarios. Recently, [33] comprehensively studied different instances of non-convex learning, such as generalized linear Correspondence to: Di Wang , Jinhui Xu 38th Conference on Neural Information Processing Systems (Neur IPS 2024). models, Re LU regression, and multi-layer neural networks. However, for each problem, they imposed different strong assumptions, which left a big room to further understanding DP non-convex learning. In this paper, we consider the most fundamental one, the Re LU (Rectified Linear Unit) regression model. As one of the most fundamental non-convex models, Re LU regression stands out due to its widespread use and effectiveness in deep learning. Despite its prevalence in industry, theoretical exploration of Re LU regression has been relatively limited. [33] studies the DP Re LU regression problem in both well-specified and misspecified settings. However, the problem is still far from well-understood. By and large, several challenges need to be addressed. Their analysis, for instance, heavily relies on strong assumptions about bounded norms of feature vectors and labels, i.e. x 2 O(1) and |y| O(1), which does not hold even for normal Gaussian distributions. Secondly, while they provide a utility upper bound of e O( 1 N +min{ d1/2 (Nε)1/2 , 1 (Nε)2/3 }) with sample size N and dimension d, their dimension-independent results heavily rely on their data assumption, leading to a dimensiondependent upper bound when considering Bernoulli or O(1)-sub Gaussian data where x 2 O( d) with high probability. That means when d N, i.e., when in the overparameterization regime, previous results become trivial. In this paper, we revisit the DP Re LU regression problem in the overparameterized regime. Specifically, we provide a general analysis under the well-specified setting and propose two novel algorithms (DP-GLMtron and DP-TAGLMtron). Our contributions are three-fold, see Table 1 for details. 1) We propose an innovative algorithm, DP-GLMtron, built upon the Generalized Linear Model Perceptron (GLM-tron) algorithm of [23]. Specifically, DP-GLMtron operates by making a single pass over the dataset and processes one data point at a time by integrating appropriate clipping and adding Gaussian noise to maintain privacy. Additionally, we provide an excess population risk upper bound of e O( Deff N + Dpri eff N 2ε2 ) where Deff (Dpri eff) is the (private) effective dimension. It is noticed that Deff and Dpri eff are defined in a way that does not directly rely on the dimension d. Furthermore, our results are applicable not only to Gaussian-like data, typically considered in the traditional differential privacy community, but also to data distributions that exhibit either polynomial or exponential decay. In these cases, we demonstrate that the utility bound is independent of the dimension d, which marks a significant departure from the traditional one (see more discussions in Remark 4). Finally, we also show that our analysis on DP-GLMtron is almost tight by providing a lower bound of eΩ( Deff N + Dpri eff N2ε2 ). 2) A significant issue with DP-GLMtron is its upper bound only holds with a small privacy budget ε = e O( 1 N ) (for further details, see Remark 3), which is due to privacy amplification via shuffling. To tackle this limitation, we propose DP-TAGLMtron, which utilizes the tree aggregation protocol to introduce noise into the sum of mini-batch gradients instead of using privacy amplification. Our analysis reveals that the utility upper bound of DP-TAGLMtron is closely comparable to that of DP-GLMtron only with an additional factor of O(log N). This finding indicates that DP-TAGLMtron can offer similar performance benefits while potentially being applicable in settings with larger privacy budgets. 3) We conducted experiments with both synthetic data and real data on the Re LU regression model across various algorithms and privacy budgets. Specifically, in the first part, synthetic data was generated under two eigenvalue decay scenarios, λi i 2 and λi i 3. The results revealed that faster eigenvalue decay consistently resulted in lower excess risk compared to slower decay, indicating that it may suffer less from dimensionality with respect to utility. Additionally, in both synthetic and real data, DP-GLMtron and DP-TAGLMtron outperformed DP-SGD and DP-FTRL [22], especially as the sample size increased, highlighting the effectiveness of our proposed methods. Due to space limitations, we have included the algorithms and proofs in the appendix. 2 Related Work Private Nonconvex Optimization Previous research focusing on the utility of DP-SO/DP-ERM with convex loss functions predominantly employed excess population risk as the metric of utility. In contrast, for non-convex scenarios, utility assessment can be broadly categorized into three types: first-order stationary-based, second-order stationary-based, and excess population risk. The first-order stationary-based method [43, 53, 34, 8, 52, 50] involves evaluating the ℓ2-norm of the gradient of Table 1: Comparison of our work with related studies on (ε, δ)-DP Re LU regression in the statistical estimation setting. Here, N represents the sample size, d refers to the dimension and Deff, , Dpri eff are the (private) effective dimensions. It is noticed that Deff and Dpri eff are defined in a way that does not directly rely on the dimension d. Instead, they segment the feature space into an effective subspace.For further discussion, please see Remark 4. Method Upper Bound Lower Bound Privacy Constraints Data Assumptions DP-PGD [33] e O(min{ d1/2 (Nε)1/2 , 1 (Nε)2/3 }+ DP-GLMtron e O( Deff N + Dpri eff N2ε2 ) eΩ( Deff N + Dpri eff N2ε2 ) ε = O( 1 DP-TAGLMtron e O( Deff N + Dpri eff+Deff N2ε2 ) - - - the population risk function. This method, however, encounters several challenges. For example, [2] shows that the gradient norm tends to approach zero as the sample size increases indefinitely, while there exists no assurance that a differentially private estimator will converge to or even be near any significant local minimum. The Second-order stationary-based method [42, 45] employs the norm of the gradient and the Hessian matrix minimal eigenvalue of the population risk function. However, this approach is particularly suited to specific scenarios where any second-order stationary point is a local minimum of the problem, and all the local minima are the global minimum, such as matrix completion and dictionary learning. The third approach involves directly employing the excess population risk as the measure of utility [33, 43] and our work aligns with this direction. However, as we mentioned these bounds either need strong assumptions or become trivial when d N. Private Generalized Linear Models DP-SO and DP-ERM with generalized linear loss (DP-GLL) and generalized linear models (DP-GLM) have gained significant attention in recent years. The key developments in this area can be regarded as three aspects: 1) For convex loss functions, an early study on DP-GLL [21] revealed that, assuming features have bounded l2 norm, the error bound can reach e O( 1 Nε), contrasting with the general convex DP-ERM bound of O( d Nε). A subsequent study [25] found that in constrained cases, the error bound depends on the Gaussian width of the constraint set. [34] improved the bound to O( θ Nε) under the assumption that the data space is a bounded set, where θ is the rank of the expectation of the data matrix. 2) For constrained DP-GLM, [8] explored various scenarios, including smooth/non-smooth losses and losses in ℓp space for 1 p 2. [5] provided the optimal rates of DP-GLM in unconstrained settings, identifying different rates depending on the nature of the loss function (smooth and non-negative but not necessarily Lipschitz, or Lipschitz). 3) For non-convex losses, [34, 7] provided dimension-independent bounds for the gradient ℓ2 norm of the population risk function. As a specific instance of GLM, this work focuses on the overparameterized regime with sub-Gaussian data rather than bounded data. 3 Preliminaries Notations: In this paper, we adhere to a consistent notation style for clarity. We use boldface lower letters such as x, w for vectors, and boldface capital letters (e.g. A, H) for matrices. Let A 2 denote the spectral norm of A. For two matrices A and B of appropriate dimension, their inner product is defined as A, B := tr(A B). For a positive semi-definite (PSD) matrix A and a vector v of appropriate dimension, we write v 2 A := v Av. The outer product is denoted by . Given a dataset D = {(xi, yi)}N i=1 with each data point has a feature vector xi X Rd and a response variable yi Y. In this paper, we assume that {(xi, yi)}N i=1 are i.i.d. sampled from a Re LU regression model, i.e., each (xi, yi) is a realization of the Re LU regression model y = Re LU(x w ) + z, where Re LU( ) := max{ , 0} is the Rectified Linear Unit (Re LU); z is a zero mean randomized noise; w Rd is the optimal model parameter. Our goal is to output a model wpriv that maintains privacy while minimizing the excess population risk, i.e., L(wpriv ) L(w ), where 2E(x,y) D[(Re LU(x w) y)2]. (1) In this paper, we will focus on classic Differential Privacy to ensure privacy. Definition 3.1 (Differential Privacy [13]). Given a data universe X, we say that two datasets D, D X are neighbors if they differ by only one element, which is denoted as D D . A randomized algorithm A is (ε, δ)-differentially private (DP) if for all adjacent datasets D, D and for all events S in the output space of A, we have P(A(D) S) eε P(A(D ) S) + δ. In the following, we will introduce some definitions and assumptions related to the model. Definition 3.2 (Well-specified Condition). Assume that there exists a parameter w Rd such that E[y | x] = Re LU(x w ), and the variance of the model noise can be denoted by σ2 := E (y Re LU(x w ))2 . In the literature, the well-specified setting is also extensively referred to as the noisy teacher setting [18] or the well-structured noise mode [19]. It has been widely studied previously [56, 41, 33, 12]. Assumption 3.3 (Data Covariance). Define H := E[xx ] as the expected data covariance matrix. Denote the eigen decomposition of the data covariance by H = P i λiviv i , where (λi)i 1 are eigenvalues in a nonincreasing order and (vi)i 1 are the corresponding eigenvectors. Define Hk1:k2 as Hk1:k2 := P k1k λiviv i . In the case of a Gaussian distribution, the data covariance matrix reduces to an identity matrix scaled by variance. Assumption 3.4 (Fourth moment conditions). Define that the fourth moment of x as M: (A) There exists a constant α > 0 such that for any Positive Semi-Definite (PSD) matrix A, the following holds: E[xx Axx ] α tr(HA) H. (B) There exists a constant β > 0, such that for every PSD matrix A, the following holds: E[xx Axx ] HAH β tr(HA) H. Remark 1. For Gaussian distribution, it can be verified that Assumption 3.4 holds with α = 3 and β = 1 [56]. Furthermore, the assumption of sub-Gaussianity in data, specifically that H 1 2 x is a sub-Gaussian random vector, as posited in previous work [44, 41, 54, 28, 55, 20, 32, 31] for private linear regression model, can be shown to imply Assumption 3.4 (A) [56]. It is important to note that Assumption 3.4 (B) is specifically utilized in our analysis for establishing the lower bound only. Definition 3.5 ((H, C2, a, b)-Tail). Let both H and M exist and are finite. A random vector x satisfies (H, C2, a, b)-Tail if the the following holds: a > 0 s.t. with probability 1 b, x 2 2 E[ x 2 2] log2a(1/b), (2) max v, v =1 E[exp(( | x, v |2 C2 2 H 2 )1/2a)] 1, That is, for any fixed v, with probability 1 b : ( x, v )2 C2 2 H 2 v 2 log2a(1/b). Both conditions above provide Gaussian-like tail bounds determining the resilience of the dataset. Note that Definition 3.5 is widely used in the recent literature on DP analysis for the sub-Gaussian type of data such as [41, 29, 27]. In this work, we will assume each sample x satisfying (H, C2, a, b)- Tail and the distribution of the inherent noise z satisfies (σ2, C2, a, b)-Tail, which could capture more statistical properties beyond expectations. Additionally, according to Assumption 3.4 (A), it holds that x 2 2 α tr(H) log2a(1/b) in Equation (2). Assumption 3.6 (Symmetricity conditions). Assume that for every u, v Rd, it holds that: E[xx 1[x u > 0, x v > 0]] = E[xx 1[x u < 0, x v < 0]] E[(x v)2xx 1[x u > 0, x v > 0]] = E[(x v)2xx 1[x u < 0, x v < 0]]. Remark 2. Here we require that both the second and fourth moments of x remain symmetric. It can be shown that Assumption 3.6 is satisfied when x and x follow the same distribution, which can apply to symmetric sub-Gaussian such as symmetric Bernoulli or Gaussian distributions. (a) Trajectory with smaller noise (b) Trajectory with larger noise Figure 1: Training trajectories of DP-SGD and DP-GLMtron on a 2D noiseless Re LU regression with symmetric Bernoulli data. 4 Proposed DP-GLMtron Algorithm Before presenting the analysis of DP Re LU regression problem, we first introduce our novel algorithm: DP-GLMtron (Algorithm 1), which is inspired by and built upon the Generalized Linear Model Perceptron (GLM-tron) algorithm of [23]. The fundamental distinction between SGD and GLMtron lies in their respective update rules. Specifically, it takes the following rules: SGD: wt = wt 1 η (Re LU(x t wt 1) yt)xt 1[x t wt 1 > 0] GLMtron: wt = wt 1 η (Re LU(x t wt 1) yt)xt. (3) where the algorithm commence from an initial point w0 and iterate from t = 0 to t = N with stepsize η. In contrast to the update rule of SGD, GLMtron diverges by changing the derivative of the Re LU function in its update mechanism. Typically, in neural network training, the gradient of the activation function, such as Re LU, is a critical component of the backpropagation process. The exclusion of this derivative in GLMtron s framework not only simplifies the computational process but also enhances efficiency. Our motivation for developing the DP-GLMtron algorithm stems from observations regarding the limitations of DP-SGD, as it sometimes fails to reach the optimal solution and can struggle with convergence under conditions of high noise, often settling at a saddle point instead (as illustrated in Figure 4). This issue is particularly pronounced when considering a low privacy budget. Furthermore, [23] demonstrates that this specific omission plays a significant role in GLMtron s ability to efficiently identify a predictor that closely approximates the optimal solution. Building upon these foundations, we now present the detailed implementation of our proposed DP-GLMtron. For Algorithm 1, we begin with a random permutation of the dataset, ensuring that each data point is treated independently and uniformly. Also, such shuffling can amplify privacy[17]. Subsequently, the algorithm performs a single pass over the dataset. Throughout this pass, we need to determine the clipping threshold before undergoing iterative updates for w. Establishing an appropriate clipping threshold is crucial for balancing the trade-off between utility and privacy. If the clipping threshold is set excessively low, it results in loss of gradient information, consequently leading to high bias [30, 3]. Consequently, we propose an advanced Algorithm 2 based on the residual estimator in [28] to achieve an adaptive clipping [4] by utilizing a small batch of data points. Recall that the clipping term is a product of the residual, (Re LU(x t wt) yt), and the covariate, x t . Based on Definition 3.5 and Assumption 3.4, we understand that xt 2 2 α tr(H) log2a(1/b) with a probability of at least 1 b. Thus, it is sufficient to get an upper bound of | Re LU(x t wt) yt|. To bound this, we consider three cases, 1) x t wt > 0, x t w > 0; 2) x t wt < 0, x t w > 0; 3) x t wt > 0, x t w < 0, corresponding to d1 t, d2 t, and d3 t in step 2 of Algorithm 2. It is noticed that the inherent model noise is not considered when initially calculating the residual for the third case at step 2, while this noise is taken into account at step 15. Moreover, when considering a special case that x t wt < 0, x t w < 0, the clipping threshold is integrated into the third scenario. For each residual, we first partition the dataset into K batches, compute an estimate for each batch, and form a histogram with over those K estimates. We then apply a private histogram mechanism with geometrically increasing bin sizes. The bin with the most estimates is used, ensuring a constant factor approximation of the distance to the optimum. Finally, we take the maximum among the three private residual estimates. Upon determining the clipping threshold γt, it is subsequently incorporated into the existing clipping list γ. Following this, Gaussian noise with a standard deviation of 2fγt is introduced to the clipped lt and then the weight vector wt+1 will be updated. Finally, the algorithm culminates by computing the average of the iterates. Theorem 4.1 (Privacy Guarantee). Suppose DP-GLMtron is applied to N input samples with a noise multiplier set to f = O( log(N/δ) N ). Under these conditions, the algorithm satisfies (ε, δ)-differential privacy for ε = O( q N ), with ε = ε 8N log(2/δ), δ = δ 2N in Algorithm 2. Remark 3. For general data distributions, DP-GLMtron necessitates a more complex approach to clipping decisions. To this end, we incorporate Algorithm 2, which uses the magnitude of the residual along with privacy parameters ε and δ , as a means to estimate the private threshold. Additionally, we take only one pass over random shuffling data for Algorithm 1 [15, 17, 41], which is a wellknown method of privacy amplification. Hence, we need to add noise O( γt ε N ), where γt is the clipping threshold and ε must be restricted to 1/ N, to satisfy the requirement of [17] and to ensure differential privacy. Theorem 4.2 (Utility Guarantee Upper). Consider a dataset D = {(xi, yi)}N i=1, where each xi is drawn from a distribution D satisfying (H, C2, a, b)-Tail and the distribution of the inherent noise z satisfies (σ2, C2, a, b)-Tail. Given Assumptions 3.4 and 3.6, if we set a step size η 1 2α tr(H) log2a(N), a noise multiplier f = O( log(N/δ) N ), estimating data size m = Ω( log(N) log(N/δ ) ε ), then with a probability of 1 1/N, the output of Algorithm 1 will satisfy: E[L(w N)] L(w ) bias + variance + privacy, where the bias, variance, and private terms are upper bounded by bias 1 η2N 2 w0 w 2 H 1 0:k + w0 w 2 Hk : + ( α N w0 w 2 I0:k Nη +Hk : ) Deff variance σ2 N Deff, private Γ2f 2 where the effective dimensions are given by Deff = k + N 2η2 i>k λ2 i , Dpri eff = X i 1, then with probability at least 1 1/N, the utility is upper bounded by e O(N r 1+r (1 + (ε 2N r 1+r ) r 1+2r )). 2. If the spectrum of H satisfies λk = e k, then with probability at least 1 1/N, the utility is upper bounded by e O(N 1(1 + (ε 2N) 1 2 )). Remark 5. Corollary 4.3 indicates that utility bound in the overparameterized setting can still vanish if the spectrum exhibits either polynomial or exponential decay. Specifically, a constant privacy budget ε = O(1) can yield utility bounds of e O(N r/(1+2r)) for polynomial decay and e O(N 1/2) for exponential decay. Notably, faster eigenvalue decay (e.g. r = 2, 3) consistently resulted in lower excess risk compared to slower decay, indicating that it may suffer less from dimensionality with respect to utility. This insight cannot be derived when considering only traditional Gaussian-like data. Our experiment in Section 6 also validates the conclusion. Moreover, increasing the privacy budget to ε = O(N r/(2+2r)) for polynomial decay and ε = O(N 1/2) for exponential decay improves utility bounds to e O(N r/(1+r)) and e O(N 1), respectively, matching non-private and linear regression scenario results [49, 48]. In the following, we show that our previous analysis for Algorithm 1 is almost tight. Theorem 4.4 (Utility Guarantee Lower). Consider a dataset D = {(xi, yi)}N i=1, where each xi is drawn from a distribution D satisfying (H, C2, a, b)-Tail and the distribution of the inherent noise z satisfies (σ2, C2, a, b)-Tail. Given Assumptions 3.4 and 3.6, if we set the same parameters as in Theorem 4.2, then with a probability of 1 1/N, the output of Algorithm 1 will satisfy: E[L(w N)] L(w ) bias + variance + privacy, bias 1 η2N 2 w0 w 2 H 1 0:k + w0 w 2 Hk : + ( β N w0 w 2 I0:k Nη +Hk : ) Deff variance σ2 N Deff, private eΓ2f 2 where the effective dimensions, k m are defined same as in Theorem 4.2 and denotes eΓ = min{γt}T 1 t=0 . Remark 6. Similar to the Theorem 4.2, the lower bound stated here consists of three terms: the bias term, the variance term, and the privacy term. It is noticed that Theorem 4.4 provides the algorithmic lower bound Algorithm 1, applicable in both under-parameterized and over-parameterized settings. Notably, our lower bound aligns with the upper bound, differing only by absolute constants. 5 Advanced DP-TAGLMtron A pivotal concern with Algorithm 1 DP-GLMtron lies in its primary applicability in the setting where the privacy budget ε is small, which could potentially limit its practical utility (see Remark 3 for more details). To address this challenge, we introduce DP-TAGLMtron (Algorithm 3) in this section, which is designed to attain an improved trade-off between privacy and utility. In Algorithm 3, the process starts with setting the initial weights. For each iteration t, a private residual is determined by Algorithm 2 on the estimating samples, current weights, and privacy parameters. Subsequently, the clipping bound is computed as the product of the private residual and the covariate, analogous to the previous case. Once this clipping bound γt for the current iteration is computed, it is added to the existing clipping list γ. With γt in place, the clipped "gradient" (lt) is calculated. The maximum value from this list is selected as the clipping threshold Γ for our next step. The "gradient" lt is then forwarded to the differentially private tree-aggregation mechanism with noise multiplier f and the maximal clipping threshold Γ. It is noticed that our proposed algorithm, DP-TAGLMtron, does not depend on privacy amplification. Instead, it primarily involves employing tree aggregation [14, 11, 22, 40] to introduce noise into the sum of mini-batch gradients. At the beginning of training, we construct a binary tree consisting of (2 log2 N +1 1) nodes, corresponding to N leaves. Each leaf node is assigned a label li, and the value of each internal node represents the sum of its child nodes. In each iteration, the tree receives a vector lt, derived from the current data pair (xt, yt), which is used to update the tree, incorporating the new node. The final output, denoted as l t, is obtained by aggregating the gradients accumulated in the binary tree and adding Gaussian noise to this aggregate. As a result, we can understand that l t is the private estimation of Pt i=0 li generated by tree aggregation protocol as follows: i=0 li + vt, where li = clipγt(xi(Re LU(x i wi) yi)), and vt is a combination of at most O( log2 t ) Gaussian. Therefore, each li influences only its ancestor nodes, affecting at most log t + 1 nodes. By introducing Gaussian noise with a standard deviation of O(log t/ε) to each node, the entire tree achieves (ε, δ)-differential privacy. When receiving the output of Algorithm 4, the optimization objective at step 9 is to determine the weight parameter w. This objective function is composed of two key components: the first term, (l t, w), represents the inner product of l t and the weight parameter w. The second term, 1 2η w 2 2, introduces a regularization mechanism, where η controls the balance between fitting the training data and constraining the magnitude of the parameters to prevent overfitting. One common method to find the optimal solution for this optimization problem is to take the derivative of the objective function and set it to zero. Therefore, for the given problem at step 9, we have the following update: wt+1 η l t = η ( i=0 li + vt). It should be noted that in Algorithm 1, if the initialization is set to w0 = 0, then the update rule of w can be considered equivalent to the one presented here, provided that the variance of the noise and the clipping threshold are disregarded. Upon completing all iterations, the final model weights are determined by averaging all the updated weights from each iteration. Theorem 5.1 (Privacy Guarantee). Suppose DP-GLMtron is applied to N input samples with a noise multiplier set to f = 2 lg(N+1) ln(1/δ) ε . Under these conditions, the Algorithm 3 satisfies (ε, δ)-differential privacy, with ε = ε 8N log(2/δ) and δ = δ 2N in Algorithm 2. Remark 7. Differing from Algorithm 1, the update rule of DP-TAGLMtron mainly utilizes the DP-Tree-Aggregation mechanism, which involves aggregating the "gradients" collected in a binary tree and then adding Gaussian noise to this aggregate. Based on [22], Algorithm 4 achieves (ε, δ)-differential privacy with appropriate noise multiplier f = 2 lg(N+1) ln(1/δ) ε , ensuring that Algorithm 3 also upholds (ε, δ)-DP under similar settings with ε , δ as Algorithm 2. Theorem 5.2 (Utility Guarantee). Consider a dataset D = {(xi, yi)}N i=1, where each xi is drawn from a distribution D satisfying (H, C2, a, b)-Tail and the distribution of the inherent noise z satisfies (σ2, C2, a, b)-Tail. Given Assumptions 3.4 and 3.6 hold, if we set a stepsize η < 1/(α tr(H)), a noise multiplier f = 2 lg(N+1) ln(1/δ) ε , then with a probability of 1 1/N, the output of Algorithm 3 will satisfy: E[L(w N)] L(w ) bias + variance + privacy, bias 1 η2N 2 w0 w 2 H 1 0:k + w0 w 2 Hk : + ( α N w0 w 2 I0:k Nη +Hk : ) Deff variance σ2 N Deff, private f 2 log2 N N (αη log2 N tr(H) + Γ2) (Deff + Dpriv eff ) where the effective dimensions and k are defined same as in Theorem 4.2. Remark 8. It is noteworthy that in Theorem 5.2, both the bias term and the variance term are consistent with those in Theorem 4.2. This alignment extends to the non-private results, with the only differences being in the absolute constants. Additionally, the private term in Theorem 5.2 closely aligns with that in Theorem 4.2, with the primary difference being a factor of log2 N. It is intuitive that the tree aggregation mechanism utilized in DP-TAGLMtron indicates that the noise added for privacy is derived from a combination of at most log2 N + 1 and at least one random Gaussian vector. This aspect suggests that DP-TAGLMtron diverges from the upper bound of DP-GLMtron at most by a factor of log2 N and adheres to the same lower bound as DP-GLMtron with a different noise variance. Moreover, in contrast to Algorithm 1, DP-TAGLMtron does not necessitate ε = O(1/ N) to maintain differential privacy, allowing a broader range of privacy budget settings in terms of flexibility and applicability. 6 Empirical Simulation In this section, we first conduct experiments with synthetic data to validate the theoretical insights related to the role of eigenvalues in a high-dimensional setting. Additionally, due to space constraints, we present a real-data experiment on the MNIST dataset in Appendix B to demonstrate the performance of our proposed method. 100 200 300 400 500 Epoch Excess Risk Bernoulli Case i i 2 ( = 0.05) DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL 100 200 300 400 500 Epoch Excess Risk Bernoulli Case i i 2 ( = 0.2) DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL 100 200 300 400 500 Epoch Excess Risk Bernoulli Case i i 2 ( = 0.5) DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL 100 200 300 400 500 Epoch Excess Risk Bernoulli Case i i 3 ( = 0.05) DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL 100 200 300 400 500 Epoch Excess Risk Bernoulli Case i i 3 ( = 0.2) DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL 100 200 300 400 500 Epoch Excess Risk Bernoulli Case i i 3 ( = 0.5) DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL Figure 2: Comparative Performance of Various Differential Privacy Algorithms Across Different Data Spectrum and Privacy Budgets. This figure illustrates the performance of four differential privacy algorithms DP-SGD, DP-GLMtron, DP-TAGLMtron, and DP-FTRL on generated Bernoulli distributed data, comparing excess risk across different sample sizes and privacy budgets. Experiment Setup. The experiments were designed with varying privacy budgets (ε) set at 0.05, 0.2, and 0.5 with δ = 1 n1.1 to observe the impact of differential privacy constraints on the learning process. We generated the data with two eigenvalue decay scenarios, λi i 2, i 3, simulating different distributions of feature importance. The dimensionality of the data was fixed at 1,024 to maintain consistency across trials. The learning rate was initially set to 10 2, with N representing the sample size, which varied from 50 to 550 and increased in steps to 100. We utilized a Re LU regression model with noise to be trained using the previously mentioned algorithms, integrating privacy-preserving noise adjustments. The algorithms underwent a single iteration over generated data. The primary metric for evaluation was the average excess risk across varying dataset sizes and ε values. For each experiment, we also consider DP-SGD [1] and DP-FTRL [22] as baselines. Empirical Findings. Compared with Figures (2a-2c), which illustrate data with eigenvalue decay λi i 2, Figures (2f-2e), which illustrate data with eigenvalue decay λi i 3 exhibit the lower excess risk under the same training algorithms and conditions regardless of privacy budget. This indicates that fast eigenvalue decay suffers less from the dimensionality with respect to utility, thereby validating our results in Section 4. Additionally, results in both data with eigenvalue decay λi i 2 and data with eigenvalue decay λi i 3 show that DP-GLMtron and DP-TAGLMtron consistently outperform DP-SGD and DP-FTRL regarding excess risk, particularly as the sample size increased. Moreover, it is noticed that the DP-SGD and DP-FTRL exhibit poor convergence performance even with comparatively large data sizes. Finally, in both figures, increasing the privacy budget from ε = 0.05 to ε = 0.5 resulted in lower excess risks for both algorithms. This observation confirms the anticipated trade-off between privacy and learning performance, with a looser privacy constraint leading to more accurate models. 7 Conclusions In conclusion, our study significantly advances the field of differential privacy in Re LU regression problems. Our first proposed algorithm, DP-GLMtron, provides an excess population risk upper bound of e O( Deff N + Dpri eff N2ε2 ), offering a novel perspective on utility decomposition and highlighting the impact of eigenvalues in overparameterized settings. Our results apply not only to Gaussian-like data, typically considered in traditional differential privacy studies, but also to data distributions with polynomial or exponential decay. To address limitations with small privacy budgets, we developed DP-TAGLMtron, which offers performance comparable to DP-GLMtron but supports broader privacy budgets. Empirical results from both synthetic and real data experiments demonstrate that DPGLMtron and DP-TAGLMtron consistently outperform DP-SGD and DP-FTRL. Acknowledgments The research of Meng Ding and Jinhui Xu was supported in part by KAUST through grant CRG104663.2. Di Wang was supported in part by the baseline funding BAS/1/1689-01-01, funding from the CRG grand URF/1/4663-01-01, REI/1/5232-01-01, REI/1/5332-01-01, FCC/1/1976-49-01 from CBRC of King Abdullah University of Science and Technology (KAUST). Di Wang was also supported by the funding RGC/3/4816-09-01 of the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI). Impact Statements This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Limitations While this study advances previous work by relaxing certain assumptions and providing both synthetic and real-data validations of the theoretical findings, the analysis is specifically confined to the Re LU regression problem. Consequently, it remains uncertain whether these conclusions can be generalized to other nonconvex optimization scenarios. Future research could explore whether similar patterns hold across a broader range of nonconvex problems. [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016. [2] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195 1199, 2017. [3] Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding user contributions: A bias-variance trade-off in differential privacy. In International Conference on Machine Learning, pages 263 271. PMLR, 2019. [4] Galen Andrew, Om Thakkar, Brendan Mc Mahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34:17455 17466, 2021. [5] Raman Arora, Raef Bassily, Cristóbal Guzmán, Michael Menart, and Enayat Ullah. Differentially private generalized linear models revisited. Advances in Neural Information Processing Systems, 35:22505 22517, 2022. [6] Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, and Kunal Talwar. Private adaptive gradient methods for convex optimization. In International Conference on Machine Learning, pages 383 392. PMLR, 2021. [7] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in neural information processing systems, 32, 2019. [8] Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. Advances in Neural Information Processing Systems, 34:9317 9329, 2021. [9] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Conference on Learning Theory, pages 474 499. PMLR, 2021. [10] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pages 464 473. IEEE, 2014. [11] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1 24, 2011. [12] Meng Ding, Kaiyi Ji, Di Wang, and Jinhui Xu. Understanding forgetting in continual learning with linear regression. ar Xiv preprint ar Xiv:2405.17583, 2024. [13] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265 284. Springer, 2006. [14] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715 724, 2010. [15] Úlfar Erlingsson, Ilya Mironov, Ananth Raghunathan, and Shuang Song. That which we call private. ar Xiv preprint ar Xiv:1908.03566, 2019. [16] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439 449, 2020. [17] Vitaly Feldman, Audra Mc Millan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 954 964. IEEE, 2022. [18] Spencer Frei, Yuan Cao, and Quanquan Gu. Agnostic learning of a single neuron with gradient descent. Advances in Neural Information Processing Systems, 33:5417 5428, 2020. [19] Surbhi Goel and Adam R Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470 1499. PMLR, 2019. [20] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. High dimensional differentially private stochastic optimization with heavy-tailed data. In Proceedings of the 41st ACM SIGMODSIGACT-SIGAI Symposium on Principles of Database Systems, pages 227 236, 2022. [21] Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pages 476 484. PMLR, 2014. [22] Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213 5225. PMLR, 2021. [23] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011. [24] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. ar Xiv preprint ar Xiv:1711.03908, 2017. [25] Shiva Prasad Kasiviswanathan and Hongxia Jin. Efficient private empirical risk minimization for high-dimensional learning. In International Conference on Machine Learning, pages 488 497. PMLR, 2016. [26] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps. ar Xiv preprint ar Xiv:2103.15352, 2021. [27] Xiyang Liu, Prateek Jain, Weihao Kong, Sewoong Oh, and Arun Suggala. Label robust and differentially private linear regression: Computational and statistical efficiency. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [28] Xiyang Liu, Prateek Jain, Weihao Kong, Sewoong Oh, and Arun Sai Suggala. Near optimal private and robust linear regression. ar Xiv preprint ar Xiv:2301.13273, 2023. [29] Xiyang Liu, Weihao Kong, Prateek Jain, and Sewoong Oh. Dp-pca: Statistically optimal and differentially private pca. Advances in Neural Information Processing Systems, 35:29929 29943, 2022. [30] H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017. [31] Yuan Qiu, Jinyan Liu, and Di Wang. Truthful generalized linear models. In Kristoffer Arnsfelt Hansen, Tracy Xiao Liu, and Azarakhsh Malekian, editors, Web and Internet Economics - 18th International Conference, WINE 2022, Troy, NY, USA, December 12-15, 2022, Proceedings, volume 13778 of Lecture Notes in Computer Science, pages 369 370. Springer, 2022. [32] Yuan Qiu, Jinyan Liu, and Di Wang. Truthful and privacy-preserving generalized linear models. Information and Computation, 301:105225, 2024. [33] Hanpu Shen, Cheng-Long Wang, Zihang Xiang, Yiming Ying, and Di Wang. Differentially private non-convex learning for multi-layer neural networks. ar Xiv preprint ar Xiv:2310.08425, 2023. [34] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and Statistics, pages 2638 2646. PMLR, 2021. [35] Shuang Song, Om Thakkar, and Abhradeep Thakurta. Characterizing private clipped gradient descent on convex generalized linear problems. ar Xiv preprint ar Xiv:2006.06783, 2020. [36] Jinyan Su, Lijie Hu, and Di Wang. Faster rates of differentially private stochastic convex optimization. Journal of Machine Learning Research, 25(114):1 41, 2024. [37] Jinyan Su and Di Wang. Faster rates of differentially private stochastic convex optimization. ar Xiv preprint ar Xiv, 2108, 2021. [38] Jinyan Su, Changhong Zhao, and Di Wang. Differentially private stochastic convex optimization in (non)-euclidean space revisited. In Uncertainty in Artificial Intelligence, pages 2026 2035. PMLR, 2023. [39] Youming Tao, Yulian Wu, Xiuzhen Cheng, and Di Wang. Private stochastic convex optimization and sparse learning with heavy-tailed data revisited. In IJCAI, pages 3947 3953. ijcai.org, 2022. [40] Hoang Tran and Ashok Cutkosky. Momentum aggregation for private non-convex erm. Advances in Neural Information Processing Systems, 35:10996 11008, 2022. [41] Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear regression via adaptive clipping. ar Xiv preprint ar Xiv:2207.04686, 2022. [42] Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. In International Conference on Machine Learning, pages 6526 6535. PMLR, 2019. [43] Di Wang and Jinhui Xu. Differentially private empirical risk minimization with smooth nonconvex loss functions: A non-stationary view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1182 1189, 2019. [44] Di Wang and Jinhui Xu. On sparse linear regression in the local differential privacy model. In International Conference on Machine Learning, pages 6628 6637. PMLR, 2019. [45] Di Wang and Jinhui Xu. Escaping saddle points of empirical risk privately and scalably via dptrust region method. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14 18, 2020, Proceedings, Part III, pages 90 106. Springer, 2021. [46] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. Advances in Neural Information Processing Systems, 30, 2017. [47] Lingxiao Wang, Bargav Jayaraman, David Evans, and Quanquan Gu. Efficient privacypreserving stochastic nonconvex optimization. In Uncertainty in Artificial Intelligence, pages 2203 2213. PMLR, 2023. [48] Lingxiao Wang, Difan Zou, Kumar Kshitij Patel, Jingfeng Wu, and Nathan Srebro. Private overparameterized linear regression without suffering in high dimensions, 2024. [49] Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Sham M Kakade. Finite-sample analysis of learning high-dimensional single relu neuron. 2023. [50] Hanshen Xiao, Zihang Xiang, Di Wang, and Srinivas Devadas. A theory to instruct differentiallyprivate learning via clipping bias reduction. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2170 2189. IEEE Computer Society, 2023. [51] Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. ar Xiv preprint ar Xiv:1703.09947, 2017. [52] Qiuchen Zhang, Jing Ma, Jian Lou, and Li Xiong. Private stochastic non-convex optimization with improved utility rates. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021. [53] Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identification. ar Xiv preprint ar Xiv:2007.03813, 2020. [54] Liyang Zhu, Meng Ding, Vaneet Aggarwal, Jinhui Xu, and Di Wang. Improved analysis of sparse linear regression in local differential privacy model. ar Xiv preprint ar Xiv:2310.07367, 2023. [55] Liyang Zhu, Amina Manseur, Meng Ding, Jinyan Liu, Jinhui Xu, and Di Wang. Truthful high dimensional sparse linear regression. ar Xiv preprint ar Xiv:2410.13046, 2024. [56] Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory, pages 4633 4635. PMLR, 2021. A Algorithms Algorithm 1 DP-GLMtron 1: Input: Training Samples: {(xi, yi)}N 1 i=0 , Estimating Samples: {(x i, y i)}m i=1, Learning Rate: η, DP Noise Multiplier: f, Expected x Norm: p α tr(H), Privacy Parameters: ε , δ , Clipping list γ = {} 2: Randomly permute {(xi, yi)}N 1 i=0 3: Initialize w0 0 4: for t = 0, . . . , N 1 do 5: st DP-Threshold({(x i, y i)}m i=1, wt, ε , δ ) 6: γt = p 2α tr(H)C2 log2a N st 7: Sample gt N(0, Id d) 8: wt+1 wt η(lt + 2fγtgt) 9: where lt = clipγt(x t (Re LU(x t wt) yt)) 10: end for 11: return w 1 N PN 1 t=0 wt Algorithm 2 DP-Threshold 1: Input: Estimating Samples: {(x i, y i)}m i=1, Privacy Parameters: ε , δ , Model w, Distribution Parameters: a, bz in Definition 3.5 2: i [m], d1 i (x i w yi)2, d2 i y2 i , d3 i (x i w)2, i [m] 3: Partition {d1 i }m i=1, {d2 i }m i=1 and {d3 i }m i=1 into three K = C1 log(N/(δ ))/ε subsets of equal size separately, namely {D1 j}K j=1, {D2 j}K j=1 and {D3 j}K j=1 4: for p=1,2,3 do 5: (dp j) P i Dp j dp i /|Dp j |, j [K] 6: Partition [0, ) into geometrically increasing intervals Ωp := {. . . , [2 1, 1), [1, 2), [2, 22), . . .} {[0, 0]} 7: Compute d p k P j Dp k dp j1(dp j (dp k) )/|Dp k|, where Dp k is the interval in Ωp 8: if d p k (0, 2 log(1/δ )/(ε K) + (1/K)) then 9: edp k d p k + pp k, where qp k Lap(0, 2/(ε K)) 10: else 11: edp k 0 12: end if 13: Let [sp 1, sp 2] be the nonempty interval containing the largest edp k 14: if p=3 then 15: set sp = q 2(sp 2 + σ2 log2a(1/bz)) 16: else 17: set sp = p 2sp 2 18: end if 19: end for 20: return s = max{sp}3 p=1 Algorithm 3 DP-TAGLMtron 1: Input: Training Samples: {(xi, yi)}N 1 i=0 , Estimating Samples: {(x i, y i)}m i=1, Learning Rate: η, DP Noise Multiplier: f, Expected x Norm: p α tr(H), Privacy Parameters: ε , δ , Clipping list γ = {} 2: Initialize w0 0 3: for t = 0 . . . N 1 do 4: st DP-Threshold({(x i, y i)}m i=1, wt, ε , δ ) 5: γt p 2α tr(H)C2 log2a N st 6: add γt to the list γ and Γ = max γ 7: lt clipγt(xt(Re LU(x t wt) yt)) 8: l t DP-Tree-Aggregation(l t, f, Γ) 9: Update wt+1 argminw w, l t + 1 2η w 2 2 10: end for 11: return w N 1 N PN 1 t=0 wt Algorithm 4 DP-Tree-Aggregation 1: Input: Vectors: {li}t 1 i=0, Noise Paramters: f, Γ 2: Initialization Create a binary tree T of size 2 log2 N +1 1 with N leaf nodes, and each node is sampled i.i.d. from N(0, f 2Γ2Id d). 3: At t-th iteration, do: 4: Accept lt and add it to all the nodes along the path to the root of T . 5: Let [e1, . . . , eh] be the list of nodes from the root of T to the t-th leaf node, with node e1 being the root node and node eh being the leaf node, where h is the height of tree. 6: Let l t = 0d and {b1, . . . , bh} be the h bit binary representation of t. 7: for j [h] do 8: if bj = 1 then 9: l t l t + ej, where ej is the value in left sibling of ej node. 10: end if 11: end for 12: return l t B Additional Experiment 0 1000 2000 3000 4000 Epoch Excess Risk DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL 0 1000 2000 3000 4000 Epoch Excess Risk DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL 0 1000 2000 3000 4000 Epoch Excess Risk DP-SGD DP-GLMtron DP-TAGLMtron DP-FTRL Figure 3: Comparative Performance of Various Differential Privacy Algorithms on MNIST Across Different Privacy Budgets. This figure illustrates the performance of four differential privacy algorithms DP-SGD, DP-GLMtron, DP-TAGLMtron, and DP-FTRL on generated Bernoulli distributed data, comparing excess risk across different sample sizes and privacy budgets (ε = 0.05, 0.2, 0.5). In our second experiment, we evaluate the performance of four different differential privacy (DP) algorithms DP-SGD, DP-GLMtron, DP-TAGLMtron, and DP-FTRL using the MNIST dataset across varying sample sizes and privacy budgets denoted by ε, with values set at 0.05, 0.2, and 0.5. The learning rate was initially set to 0.05, with N representing the sample size, which varied from 0 to 1000 and increased in steps to 100. We utilized a Re LU regression model with noise to be trained using the previously mentioned algorithms, integrating privacy-preserving noise adjustments. The primary metric for evaluation was the average excess risk across varying dataset sizes and ε values. The key insights from our findings are: Under Tight Privacy Constraints (ε = 0.05): DP-TAGLMtron demonstrated superior performance across all sample sizes, achieving the lowest excess risk, while DP-GLMtron s excess risk increased with larger sample sizes. Under Moderate Privacy Constraints (ε = 0.2): The performance differential between our DP-GLMtron and the baseline narrowed, and DP-TAGLMtron continued to exhibit lower excess risks. DP-SGD s performance suffered with larger sample sizes, highlighting its sensitivity to sample size under moderate privacy levels. Under Relaxed Privacy Constraints (ε = 0.5): Our proposed methods markedly outperformed both DP-SGD and DP-FTRL, showcasing their enhanced risk mitigation capabilities when privacy constraints are loosened. These insights demonstrate the effectiveness of our proposed algorithms, DP-TAGLMtron and DP-GLMtron, in optimizing the balance between privacy protection and model accuracy. 200 400 Epoch Excess Risk =0.05 =0.20 =0.50 200 400 Epoch Excess Risk =0.05 =0.20 =0.50 200 400 Epoch Excess Risk =0.05 =0.20 =0.50 200 400 Epoch Excess Risk DP-TAGLMtron =0.05 =0.20 =0.50 Figure 4: Performance of Different Algorithms under Various Privacy Budgets. Sensitivity of Privacy Budgets. Here, we also included additional experimental results related to the performance of different algorithms under various privacy budgets. we demonstrate the sensitivity of the privacy budget for each algorithm, and it is clear that DP-SGD faces the same issue. When ε is small, the tree mechanism in DP-FTRL allows it to add less noise compared to the naive Gradient Descent or GLMtron algorithm, resulting in similar performance between DP-FTRL and DP-GLMtron. Additionally, it is important to note that DP-FTRL should be compared with DP-TAGLMtron rather than DP-GLMtron, as both DP-FTRL and DP-TAGLMtron utilize the tree mechanism, whereas DP-GLMtron does not. Moreover, even though DP-FTRL converges at the end of the training, it can be observed that the excess risk for DP-FTRL is significantly larger than that of DP-TAGLMtron, indicating that it may converge to a suboptimal solution. C Supported Lemmas Here, we provide some relaxation for Assumption 3.6. Assumption C.1 (Moment symmetricity conditions [49]). Assume that: 1. For every u H, it holds that E[xx 1[x u > 0]] = E[xx 1[x u < 0]]. 2. For every u H and v H, it holds that E[xx 1[x u > 0, x v > 0]] = E[xx 1[x u < 0, x v < 0]]. 3. For every u H, it holds that E[x 4 1[x u > 0]] = E[x 4 1[x u < 0]]. 4. For every u H and v H, it holds that E[(x v)2xx 1[x u > 0, x v > 0]] = E[(x v)2xx 1[x u < 0, x v < 0]]. Lemma C.2 ([49]). The following results are direct consequences of Assumption C.1. 1. Under Assumption C.1, it holds that: for every vector u H, E[xx 1[x u > 0]] = 1 2 E[xx ] =: 1 2. Under Assumption C.1, it holds that: for every vector u H, E[x 4 1[x u > 0]] = 1 2 E[x 4] =: 1 D DP-GLMtron D.1 Privacy Guarantee To guarantee the privacy of Algorithm 1, we should first ensure that each step of DP-GLMtron is private. Lemma D.1. Each update step of DP-GLMtron (Algorithm 1)ensures (ε0, δ0)-differential privacy, provided that f = c ε0 , where c p 2 log(1.25/δ0), and γt denotes the clipping norm. Proof. We first consider {wt}N 1 t=0 . Each update step (excluding the DP-noise addition) is of the form: wt+1 wt η clipγt(xt(Re LU(x t wt) yt)), where clipγt(v) = v max{1, γt v 2 }. Consequently, the local L2 sensitivity of wt+1 is determined by analyzing the variation in the tth iteration data sample, as follows: 2 = w t+1 wt+1 = η clipγt(x t(Re LU(x t wt) y t)) η clipγt(xt(Re LU(x t wt) yt)) 2η clipγt(xt(Re LU(x t wt) yt)) Moreover, denoting ε = ε0/ p 8N log(2/δ0) and δ = δ0/(2N), according to Lemma 2.3 in [24], we could know that {st}N t=0 is (ε , δ )-DP. Lemma D.2 ([17]). For a domain D, let R(i) : f D S(i) for i [n] (where S(i) is the range space of R(i) ) be a sequence of algorithms such that R(i)(z1:i 1, ) is a (ε0, δ0)-DP local randomizer for all values of auxiliary inputs z1:i 1 S(1) S(i 1). Let As : Dn S(1) S(n) be the algorithm that given a dataset x1:n Dn, samples a uniformly random permutation π, then sequentially computes zi = R(i)(z1:i 1, xπ(i)) for i [n], and outputs z1:n. Then for any δ [0, 1] such that ε0 log( n 16 log(2/δ)), As is (ε, δ + O(eεδ0n))-DP where ε is: ε = O((1 e ε0)( eε0 log(1/δ) n + eε0 We are now prepared to prove Theorem 4.1, utilizing the lemmas discussed above. Firstly, we reformulate the update rule into a sequence of one-step algorithms as follows: R(t+1)(u0:t, (x, y)) := wt+1 wt(u0:t) η clipγt(xπ(t)(Re LU(x π(t)wt(u0:t)) yπ(t))) 2ηγtfgt, where u denotes auxiliary inputs, and π(t) represents the sample at the t-th iteration after randomly permuting the input data. From Lemma D.1, each R(t+1)(u0:t, ) is a (ε0, δ0)-DP local randomizer algorithm, where ε0 log( N 16 log(2/bδ)). The output of DP-GLMtron is derived through post-processing of the shuffled outputs ut+1 = R(t+1)(u0:t, (x, y)) for t 0, . . . , N 1. Therefore, by Lemma D.2, Algorithm DP-GLMtron adheres to (bε, bδ + O(ebεδ0N))-DP, where: bε = O((1 e ε0)( eε0 log(1/bδ) Assuming ε0 1 2, we can infer the existence of some constant c1 > 0 such that: bε c1 (1 e ε0)( eε0 log(1/bδ) c1 ((eε0/2 e ε0/2) N + (eε0 1) 1 c1 (((1 + ε0) (1 ε0/2)) N + ((1 + 2ε0) 1) 1 By setting f = 2 log(1.25/δ0) ε0 in Lemma D.1, we ensure that each update step of DP-GLMtron independently satisfies (ε0, δ0)-DP, based on standard Gaussian mechanism. Replacing ε0 = 2 log(1.25/δ0) f , we obtain: 2 log(1.25/δ0) log(1/bδ) To satisfy overall (ε, δ)-DP, set bδ = δ 2, and δ0 = c2 δ ebεN for some constant c2 > 0. From this, we have: 2 log(c2 1.25 ebεN/δ) log(2/δ) For any ε 1, setting f = c3 log(N/δ) log(N/δ) log(1/δ) N for a sufficiently large c3 > 0 ensures that bε ε. Additionally, to fulfill Lemma D.2 s assumption, ε0 < 1 2 must be satisfied, which is attainable by setting ε = O( q This implies that for f = Ω( log(N/δ) N ), DP-GLMtron achieves (ε, δ)-DP as long as ε = N ), thereby completing the proof. D.2 Utility Guarantee Here we first provide several results that will be used in our DP-GLMtron utility bound. Lemma D.3 ([28]). Let m = Ω(b log(b/δ ) log N/ε ) and denote V = max{ w wt 2 H + σ2, w 2 H + σ2, wt 2 H + σ2}, with probability at least 1 1/b, Algorithm 2 returns st such that V γ2 t 2V . Proof. Consider each sample xi D satisfying (H, C2, a, b)-Tail and the inherent noise zi satisfying (σ2, C2, a, b)-Tail. It implies the following facts with Assumption 3.4 a > 0 s.t. with probability 1 bx, such that x 2 2 α tr(H) log2a(1/bx). a > 0 s.t. with probability 1 bv, such that x, v 2 C2 2v Hv log(1/bv). a > 0 s.t. with probability 1 bz, such that z 2 2 σ2 log2a(1/bz). Additionally, we know that xt(Re LU(x t wt) yt) = xt(max(0, x t wt) yt) xt (max(0, x t wt) yt) According to the aforementioned facts, it holds that with probability 1 bx, such that x 2 2 α tr(H) log2a(1/bx). Moreover, st max{ q 2((x t (wt w ))2 + z2 t ), q 2((x t w )2 + z2 t ), q 2((x t wt)2 + z2 t )} 2C2 2 log2a(1/b) max{ q w wt 2 H + σ2, q w 2 H + σ2, q wt 2 H + σ2}. The second part of the proof can be directly derived from Lemma 2.3 in [24] and Theorem 5 in [28], because each part of dp i , with high probability, holds that st max{ q w wt 2 H + σ2, q w 2 H + σ2, q wt 2 H + σ2}. Therefore, we have the following clipping results: xt(Re LU(x t wt) yt) 2 2 2α tr(H) log2a(1/bx) C2 2 log2a(1/b) max{ w wt 2 H + σ2, w 2 H + σ2, wt 2 H + σ2} 2α tr(H) log4a(1/b) C2 2 s2 t. Lemma D.4 (Generic bounds on the DP-GLMtron iterates [49]). Suppose that Assumption 3.6 holds. Considering the DP-GLMtron algorithm, we have the following recursion: 2(HAt 1 + At 1H) + η2M At 1 + η2σ2H + 4η2Γ2f 2I 2(HAt 1 + At 1H) + η2 4 M At 1 + η2σ2H + 4η2eΓ2f 2I where At := E(wt w )(wt w ) , t 0 Now consider the recursion of At given in Lemma D.4. Note that At is related to At 1 through a linear operator, therefore At can be understood as the sum of two iterates, i.e., At := Bt + Ct, where Bt (I η 2 T (2η)) Bt 1; B0 = (w0 w ) 2 2 T (2η)) Ct 1 + η2σ2H + 4η2Γ2f 2I; C0 = 0 (7) and Bt (I η 2)) Bt 1; B0 = (w0 w ) 2 2)) Ct 1 + η2σ2H + 4η2eΓ2f 2I; C0 = 0 (8) 2 T (2η)) At 1 := At 1 η 2(HAt 1 + At 1H) + η2M At 1; (I η 2)) At 1 := At 1 η 2(HAt 1 + At 1H) + η2 4 M At 1 Besides, since our DP-GLM-tron is run with constant stepsize η and outputs the average of the iterates: t=0 wt. (9) Then, the following lemma holds: Lemma D.5. Suppose that Assumption 3.6 hold. For w N defined in Equation (9), we have that E H, (w N w ) 2 1 ηN 2 (I η 2H)k t H, At E H, (w N w ) 2 1 2ηN 2 (I η 2H)k t H, At . E[wt w | wt 1] =E[(I η1[x t wt 1 > 0]xtx t )(wt 1 w ) | wt 1] + 2ηΓf E[gt 1 | wt 1] + η E[(1[x t w > 0] 1[x t wt 1 > 0])xtx t w | wt 1] + ηE[ztxt | wt 1] =E[(I η1[x t wt 1 > 0]xtx t )(wt 1 w ) | wt 1] 2H)(wt 1 w ) The remaining proof simply follows [56]. From the decomposition presented in Equation (7) and Equation (8), we know that PN t=0 At = PN t=0 Bt +PN t=0 Ct. With this foundation, we can now bound the bias and variance terms separately. Upper bound for variance error For t = 0 we have C0 = 0 ησ2 1 η(α tr(H))I + 4ηΓ2f 2 1 η(α tr(H))H 1. We then assume that Ct 1 ησ2 1 η(α tr(H))I + 4ηΓ2f 2 1 η(α tr(H))H 1, and exam Ct based on Equation (7): Ct (I η T (η)) Ct 1 + η2σ2H + 4η2Γ2f 2I 2(HCt 1 + Ct 1H) + η2M Ct 1 + η2σ2H + 4η2Γ2f 2I 1 η(α tr(H))I + 4ηΓ2f 2 1 η(α tr(H))H 1 η( ησ2 1 η(α tr(H))H + 4ηΓ2f 2 1 η(α tr(H))I) + η2(α tr(H))( ησ2 1 η(α tr(H))H + 4ηΓ2f 2 1 η(α tr(H))I) + η2σ2H + 4η2Γ2f 2I 1 η(α tr(H))I + 4ηΓ2f 2 1 η(α tr(H))H 1. For the simplicity, we define Σ := σ2H + 4Γ2f 2I. By the definitions of T and e T , we have: 2 T (2η)) Ct 1 + η2Σ 2 e T (2η)) Ct 1 + η2(M 1 4 f M) Ct 1 + η2Σ 2 e T (2η)) Ct 1 + η2M Ct 1 + η2Σ, where the last inequality is due to the fact that f M is a PSD mapping. Then by the iteration of variance, we have for all t 0, M Ct M ( ησ2 1 η(α tr(H))I+ 4ηΓ2f 2 1 η(α tr(H))H 1) ησ2(α tr(H)) 1 η(α tr(H))H+4ηΓ2f 2(α tr(H)) 1 η(α tr(H)) I Substituting above into Appendix D.2, we obtain: 2 e T (2η)) Ct 1 + η3(α tr(H)) 1 η(α tr(H)) (σ2H + 4Γ2f 2I) + η2Σ 2 e T (2η)) Ct 1 + η3(α tr(H)) 1 η(α tr(H)) (σ2H + 4Γ2f 2I) + η2(σ2H + 4Γ2f 2I) 2 e T (2η)) Ct 1 + η2 1 η(α tr(H)) (σ2H + 4Γ2f 2I) 1 η(α tr(H)) 2 e T (2η))k (σ2H + 4Γ2f 2I) (10) 1 η(α tr(H)) 2H)k(σ2H + 4Γ2f 2I)(I η 1 η(α tr(H)) 2H)k(σ2H + 4Γ2f 2I) 1 η(α tr(H)) (I (I η 2H)t) + 4ηΓ2f 2 1 η(α tr(H)) (I (I η Consequently, the variance error can be represented as follows, in accordance with Lemma D.5: Variance error 1 ηN 2 I (I η 1 ηN 2 I (I η 1 η(α tr(H)) (I (I η + 1 ηN 2 I (I η 1 η(α tr(H)) (I (I η 2H)t) H 1 (11) (1 η(α tr(H)))N I (I η 2H)N, (I (I η (1 η(α tr(H)))N I (I η 2H)N, (I (I η 2H)N) H 1 . Noting that for any x (0, 1/η), we have 1 (1 ηx)N min{1, Nηx}, then 2H)N I0:k + N η Final results can be obtained: variance error: (1 η(α tr(H)))N (k + N 2η2 i>k λ2 i ) + 4Γ2f 2 (1 η(α tr(H)))N Tr(H 1 0:k + N 2η2 (1 η(α tr(H)))N (k + N 2η2 i>k λ2 i ) + 4Γ2f 2 N(1 η(α tr(H)))(k Nη + N 2η2 N (k + N 2η2 X i>k λ2 i ) + Γ2f 2 ik λ2 i ) + Γ2f 2η(k + Nη X where the second inequality holds since tr(H 1 0:k ) = Pk i=1 λ 1 i Nη k and λi 1 ηN . Lower bound for variance error Similarly, we derive a recursion for the variance term to establish the lower bound: Ct = (I ηT (η)) Ct 1 + η2Σ = (I η e T (η)) Ct 1 + η2(M 1 4 f M) Ct 1 + η2Σ (I η e T (η)) Ct 1 + η2(σ2H + 4eΓ2f 2I). It implies that 2H)k(σ2H + 4eΓ2f 2I)(I η 2H)k(σ2H + 4eΓ2f 2I) = η2σ2 (I (I η 2H)2t) (ηI η2 4 H) 1 + 4η2eΓ2f 2 (I (I η 2H)2t) (ηH η2 ησ2 (I (I η 2H)2t) + 4ηeΓ2f 2 (I (I η 2H)2t) H 1. where the last inequality comes from ηI η2 Applying Lemma D.5, we could derive the lower bound for variance error: variance error 1 2ηN 2 2H)k t H, Ct 2H)N t, (I (I η 2H)N t, (I (I η t=0 (1 (1 η 2λi)N t)(1 (1 η t=0 (1 (1 η 2λi)N t)(1 (1 η 2λi)2t)λ 1 i t=0 (1 (1 η 2λi)N t 1)(1 (1 η t=0 (1 (1 η 2λi)N t 1)(1 (1 η 2λi)t)λ 1 i . By defining f(x) := PN 1 t=0 (1 (1 x)N t 1)(1 (1 x)t) and according to Lemma C.3 in [56], we have: variance error σ2 i f(ηλi) + 4eΓ2f 2 i f(ηλi)λ 1 i i>k λ2 i ) + 4eΓ2f 2 ik λ2 i ) + 4eΓ2f 2 ik λ2 i ) + eΓ2f 2 i 0]xtx t )(wt 1 w ) + η(1[x t w > 0] 1[x t wt 1 > 0])xtx t w + ηztxt ηΓt. Let us consider the expected outer product: At = E[ wt w 2 2] = E[(wt w )(wt w ) ] = E ((I η1[x t wt 1 > 0] xtx t ) (wt 1 w ))((I η1[x t wt 1 > 0] xtx t ) (wt 1 w )) | {z } (quadratic term 1) + η2 E (1[x t w > 0] 1[x t wt 1 > 0])2 (xtx t w )(xtx t w ) | {z } (quadratic term 2) + 2η E(1[x t w > 0] 1[x t wt 1 > 0]) (I η1[x t wt 1 > 0] xtx t )(wt 1 w )(xtx t w ) | {z } (crossing term 1) η (E[(I η1[x t wt 1 > 0]xtx t )(wt w )Γ t ] + E[Γt(wt w ) (I 1[x t wt 1 > 0]ηxtx t )]) | {z } (crossing term 2) η (E[(1[x t w > 0] 1[x t wt 1 > 0])(xtx t w )Γ t ] + E[Γt(xtx t w ) (1[x t w > 0] 1[x t wt 1 > 0])]) | {z } (crossing term 3) + η2 E(σ2x t xt) | {z } (quadratic term 3) +η2 E[ΓtΓ t ] | {z } (quadratic term 4) It can be observed that quadratic terms 1,2,3 and crossing term 1 are original terms for non-private problems, where the remaining are introduced by the private noise. According to Assumption 3.6, Lemma C.2 and the independence of w , we have: (crossing term 3) = 0, (crossing term 2) = E[(I η1[x t wt 1 > 0]xtx t )wtΓ t ] + E[Γtw t (I 1[x t wt 1 > 0]ηxtx t )]. Now we examine the crossing term 2, denoting Rt := (I η1[x t wt 1 > 0]xtx t ). According to the update rule, we have: wt = wt 1 η (Re LU(x t wt 1) yt)xt + Γt = (I η1[x t wt 1 > 0] xtx t )wt 1 + ηxtyt + Γt j=1 (Rt 1Rt 2 Rj(ηxj 1yj 1 + ηΓj 1)) + ηxt 1yt 1 + ηΓt 1. Hence, we have: j=1 (Rt 1Rt 2 Rj(ηxj 1yj 1 + ηΓj 1))Γ t + ηxt 1Γ t yt 1 + ηΓt 1Γ t . By multiplying both sides by A and taking the expected value, we can obtain: E[(I ηxtx t )wtΓ t ] = E[ηRt j=1 (Rt 1Rt 2 RjΓj 1)Γ t ] + E[ηRtxt 1yt 1Γ t ] + E[ηRtΓt 1Γ t ]. We first consider the second term, notice that: E[ηRtxt 1yt 1Γ t ] = E[ηRt(1[x t w > 0] xtx t w + ztxt)Γ t ] = 0. Furthermore, considering Γt and Γj, when j < t and j / Q(t), it follows that E[ΓjΓ t ] = 0, otherwise, we have: E[ΓjΓ t ] = ( X q Q(j 1) γ(q)f 2gq)( X q Q(t) γ(q )f 2gq ) . These results arise because only Γq is dependent on gq for q Q(t). Hence, denoting R[t:j] = Rt Rj, we have: j=1 (Rt 1Rt 2 RjΓj 1)Γ t ] j Q(t) (Rt 1Rt 2 RjΓj 1)Γ t ] j Q(t) R[t:j]( X q Q(j 1) γ(q)f 2gq)( X q Q(t) γ(q )f 2gq ) ] j Q(t) R[t:j]( X eq Q(j 1) Q(t) γ2(eq)f 2geqg eq + X q =q γ(q )γ(q)f 2gqg q )] eq Q(j 1) Q(t) E[γ2(eq)f 2R[t:j]]. For the last term of crossing term 3, we could also derive: E[ηRtΓt 1Γ t ] = X eq Q(t 1) Q(t) E[γ2(eq)f 2Rt]. Combining the above results with the symmetric of Rt, it implies that: (crossing term 2) = ηf 2 X eq Q(j 1) Q(t) E[γ2(eq)R[t:j] + γ2(eq)R[j:t]] ηf 2( log2 N + 1) X j Q(t) E[Rt . . . Rj Rj . . . Rt] + η2f 2( log2 N + 1)E[γ4(t)]I. Recall Rt := (I η1[x t wt 1 > 0]xtx t ) and the lemma in [49]. It indicates that: E[Rt . . . Rj Rj . . . Rt] (I η 2 T (2η))t j+1 I. We first consider Pp = (I η 2 T (2η)) Pp 1 and P0 = I. Then, it follows that: 2 e T (2η)) Pp 1 + η2(M f M) Pp 1 2 e T (2η)) Pp 1 + η2α tr(HPp 1)H 2 e T (2η))p I + 2η2α tr(H) 2 e T (2η))i H 2 e T (2η))p I + 4ηα tr(H)(I (I η where the term HPp 1 can be bounded by Lemma E.2. Now, we could derive: X j Q(t) E[Rt . . . Rj Rj . . . Rt] X j Q(t) (I η 2 e T (2η))p I + 4ηα|Q(t)| tr(H)(I (I η As a results, the generic bounds on the DP-TAGLMtron iterate could have the following recursion: 2 T (2η)) At 1 + η2σ2H + η2f 2( log2 N + 1)E[γ4(t) + γ2(t)]I. + η2f 2( log2 N + 1)( X j Q(t) (I η 2 e T (2η))p I + 4ηα|Q(t)| tr(H)(I (I η We now move on to the second part. In light of the Γt definition, we know it is at least a random Gaussian vector such that Γt = eΓgt, where eΓ = min γ. Moreover, it holds that E[(I ηxtx t )wtΓ t ] = E[ηRt j=1 (Rt 1Rt 2 RjΓj 1)Γ t ] + E[ηRtxt 1yt 1Γ t ] + E[ηRtΓt 1Γ t ] = 0. Since E[ΓjΓ t ] = 0 for any j = t. Therefore, the iteration of At will be the same case in DPGLMtron. Consequently, analogous to the previous case, we can decompose At into bias term Bt and variance term Ct. 2 T (2η)) Bt 1, B0 = (w0 w )(w0 w ) ; 2 T (2η)) Ct 1 + η2Mt, C0 = 0 Mt σ2H + f 2( log2 N + 1)E[γ4(t) + γ2(t)]I + f 2( log2 N + 1)( X j Q(t) (I η 2 e T (2η))p I + 4ηα|Q(t)| tr(H)(I (I η Lemma E.6. Let M represent a union upper bound for M0, . . . , MN 1. Under Assumptions (A) and 3.6, and provided that the step size η satisfies η 1/(tr(H) log N), the following holds: 1 αη tr(H) I (I η 2H)N, M (I (I η 2 e T (2η))t M. Proof of Lemma E.6. Our proofs follow the main idea of [48], we include them here for completeness. According to the update rule of Ct, it follows that: Ct = η2 N 1 X 2 T (2η)) Mt. It can be observed that C0 C1 CN since (I η 2 T (2η)) is a PSD mapping. Let M be an union upper bound of {M0, . . . , MN 1}, we will have: CN η2 N 1 X 2 T (2η)) M = 2η T 1 (I (I η 2 T (2η))N) M 2η T 1 (I (I η 2 e T (2η))N) M. The last inequality comes from (e T T ), (I η 2 T (2η)) and (I η 2 e T (2η)) are PSD mappings. Hence, it implies that: Ct+1 = (I η 2 T (2η)) Ct + η2 (M f M) Ct + η2Mt 2 T (2η)) Ct + η2 (M f M) CN + η2M 2 T (2η)) Ct + η2 (M f M) 2η T 1 (I (I η 2 e T (2η))N) M + η2M 2 T (2η)) Ct + 4αη3 1 αη tr(H) I (I η 2H)N, M H + η2M, where the last inequality comes from Lemma E.3. Therefore, we have: 1 αη tr(H) I (I η 2 e T (2η))N 1 t H + η2 2 e T (2η))N 1 t M 1 αη tr(H) I (I η 2H)N, M (I (I η 2 e T (2η))t M. Lemma E.7. Variance error] If the step size satisfies η 1/(tr(H) log N), then the following holds for any k 1, variance error αη N(1 αη tr(H)) I0:k +Nη 2 Hk : , f M (k +N 2η2 i>k λ2 i )+ 1 N H 1 0:k +N 2η2 4 Hk : , f M , f M = σ2 H+ηαf 2(log2 N)2 tr(H) (I0:k + Nη 2 Hk : )+f 2 log2 N (E[γ(N)2+γ(N)4]+1) I. Proof. According to Lemma D.5, we have: variance error 16αη N(1 αη tr(H)) I (I η 16αη N(1 αη tr(H)) I (I η 2H)N, M I (I η 2H)N, I (I η 2 e T (2η))t M = 16αη N(1 αη tr(H)) I (I η 2H)N, M I (I η 2H)N, I (I η 2 e T (2η))t (I (I η = 16αη N(1 αη tr(H)) I (I η 2H)N, M I (I η 2H)N, I (I η 2H)N)2H 1, M . Furthermore, Lemma E.5 implies that, for any t N: Mt σ2H + f 2( log2 N + 1)E[γ4(t) + γ2(t)]I + f 2( log2 N + 1)( X j Q(t) (I η 2 e T (2η))p I + 4ηα|Q(t)| tr(H)(I (I η Now we consider an arbitrary PSD mapping A commuting with H, it holds that: A, M σ2 H, A + f 2 log2 N E[γ4(t) + γ2(t)] tr(A) + ηαf 2(log2 N)2 tr(H) A, I (I η Moreover, noting that for any x (0, 2/η), we have 1 (1 ηx 2 )N min{1, Nηx/2}, then 2H)N I0:k + Nη for any k 0. Therefore, we can define a new matrix as follows: f M = σ2 H + ηαf 2(log2 N)2 tr(H) (I0:k + Nη 2 Hk : ) + f 2 log2 N (E[γ(N)2 + γ(N)4] + 1) I. Combining the previous results, we obtain αη N(1 αη tr(H)) I0:k + Nη 2 Hk : , f M (k + N 2η2 i>k λ2 i ) + 1 N H 1 0:k + N 2η2 4 Hk : , f M N (1 + αη( X ik λ2 i )) (k + N 2η2 + f 2 log2 N N (αη log2 N tr(H) + E[γ(N)2 + γ(N)4]) (αη(k + Nη i>k λi) + ( X ik λ2 k k + Nη k>k λk (Nη) 1 r + Nη (Nη) 1 r r (Nη) 1 r . For the sake of simplicity, we assume w0 w 2 W 2, then it hold that: Utility W 2 N 2 ) (Nη) 1 r . If we choose the step size as follows: η min{N 1 1+r , N 1+r 1+2r f 2r 1+2r }, which indicates: Utility (σ + W 2) N r 1+r + (W 2 + Γ2) (f 2N) r 1+2r . Thus, it holds that Utility = e O(N r 1+r (1 + (f 2N r 1+r ) r 1+2r )). 2. If λk = e k, we have k = log(Nη), then: k>k λ2 k k + Nη X k>k λk log(Nη) + Nη 2 (Nη) 1 log(Nη). Utility W 2 N 2 ) log(Nη) W 2 If we choose the following step size η min{1, 1 Nf }, then we complete our results: Utility e O(N 1(1 + (ε 2N)1/2)). Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The paper s main contributions and scope are included in the abstract and introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The paper provides a discussion on the limitation and future direction. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All assumptions and proofs are included in the submission. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All experiments can be reproduced. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: All data and code will be released after acceptance. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide the detailed experiment setup Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: The factors that will influence the experiment are discussed. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The Experiments Compute Resources are discussed in the Appendix. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The authors have reviewed the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: All related works are cited correctly and explicitly. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.