# towards_hyperparameterfree_optimization_with_differential_privacy__84e35738.pdf

Published as a conference paper at ICLR 2025

TOWARDS HYPERPARAMETER-FREE OPTIMIZATION WITH DIFFERENTIAL PRIVACY

Ruixuan Liu Emory University ruixuan.liu2@emory.edu

Zhiqi Bu Amazon woodyx218@gmail.com

Differential privacy (DP) is a privacy-preserving paradigm that protects the training data when training deep learning models. Critically, the performance of models is determined by the training hyperparameters, especially those of the learning rate schedule, thus requiring fine-grained hyperparameter tuning on the data. In practice, it is common to tune the learning rate hyperparameters through the grid search that (1) is computationally expensive as multiple runs are needed, and (2) increases the risk of data leakage as the selection of hyperparameters is datadependent. In this work, we adapt the automatic learning rate schedule to DP optimization for any models and optimizers, so as to significantly mitigate or even eliminate the cost of hyperparameter tuning when applied together with automatic per-sample gradient clipping. Our hyperparameter-free DP optimization is almost as computationally efficient as the standard non-DP optimization, and achieves state-of-the-art DP performance on various language and vision tasks.

1 INTRODUCTION

The performance of deep learning models relies on a proper configuration of training hyperparameters. In particular, the learning rate schedule is critical to the optimization, as a large learning rate may lead to divergence, while a small learning rate may slowdown the converge too much to be useful. In practice, people have used heuristic learning rate schedules that are controlled by many hyperparameters. For example, many large language models including LLa Ma2 (Touvron et al., 2023) uses linear warmup and cosine decay in its learning rate schedule, which are controlled by 3 hyperparameters. Generally speaking, hyperparameter tuning (especially for multiple hyperparameters) can be expensive for large datasets and large models.

To address this challenge, it is desirable or even necessary to determine the learning rate schedule in an adaptive and data-dependent way, without little if any manual effort. Recent advances such as D-adaptation (Defazio & Mishchenko, 2023), Prodigy (Mishchenko & Defazio, 2024), Do G (Ivgi et al., 2023), Do WG (Khaled et al., 2023), U-Do G (Kreisler et al., 2024), and Ge N (Bu & Xu, 2024) have demonstrated the feasibility of automatic learning rate schedule, with some promising empirical results in deep learning (see a detailed discussion in Section 2.3).

While data-dependent learning rate are evolving in the standard non-DP regime, the adaptive data dependency can raise the privacy risks of memorizing and reproducing the training data. As an example, we consider the zeroth-order optimizer (ZO-SGD),

wt+1 = wt ηZO-SGDzt

where wt is the model parameters, zt is a random vector that is independent of any data, and ηZO-SGD L wt

zt R is the effective learning rate, with L being the loss value. In Malladi et al. (2023), ZO-SGD can effectively optimize models such as OPT 13B 60B in the few-shot and many-shot settings. Hence, by using an adaptive hyperparameter ηZO-SGD, the models are able to learn and possibly memorize the training data even if the descent direction is data-independent. In fact, the private information can also leak through other hyperparameters and in other models,

Equal contribution. This work does not relate to ZB s position at Amazon.

Published as a conference paper at ICLR 2025

where an outlier datapoint can be revealed via the membership inference attacks in support vector machines (Papernot & Steinke, 2021).

Specifically, in the regime of deep learning with differential privacy (DP), the privacy risks from the non-DP hyperparameter tuning could render the privacy guarantee non-rigorous. In practice, most work has leveraged the DP optimizers Abadi et al. (2016) to offer the privacy guarantee, where the privatization is only on the gradients, but not on the hyperparameters such as the clipping threshold Rg and the learning rate η. A common approach is to trial-and-error on multiple (Rg, η) pairs and select the best hyperparameters, as showcased by (Li et al., 2021; Kurakin et al., 2022) in Figure 8.

At high level, there are two approaches to accommodate the privacy risk in hyperparameter tuning: (1) The more explored approach is to assign a small amount of privacy budget to privatize the hyperparameter tuning. Examples include Renyi-DP tuning (Papernot & Steinke, 2021), DP-Hypo (Wang et al., 2023), DP-ZO-SGD (Tang et al., 2024; Liu et al., 2024; Zhang et al.) and many more (Liu & Talwar, 2019; Panda et al., 2024; Koskela & Kulkarni, 2023). However, these methods may suffer from worse performance due to larger DP noise addition from the reduced privacy budget, or high computation overhead. (2) The less explored approach is to adopt hyperparameter-free methods. For instance, (Bu et al., 2023b; Yang et al., 2022) replace the per-sample gradient clipping with the per-sample gradient normalization (i.e. setting Rg 0+), so as to remove the tuning of the hyperparameter Rg in DP optimizers.

In this work, we work towards the hyperparameter-free optimization with DP, by adapting learningrate-free methods in DP optimization:

Vanilla: wt+1 = wt ηmanual P

i B min{ Rg

||gi||, 1}gi + σRgz

Hyperparameter-free: wt+1 = wt ηGe N-DP P

i B gi ||gi|| + σz

Our contributions are summarized as follows:

1. We propose Hy Free DP, a hyperparameter-free DP framework that rigorously guarantees DP on the hyperparameter tuning of (Rg, ηt), and works with any optimizer. 2. We apply the loss privatization to leverage Ge N learning rate under DP, with a specific auto-regressive clipping threshold Rl that aims to minimize the clipping bias. 3. We give an end-to-end privacy accounting method that adds < 1% more gradient noise, while accurately capturing the loss curvature to determine the adaptive learning rate. 4. We show the strong performance and high efficiency of our method empirically.

2 PRELIMINARIES AND RELATED WORKS

2.1 DIFFERENTIALLY PRIVATE OPTIMIZATION

Overview of DP optimization. We aim to minimize the loss P

i L(w, xi) where w Rd is the model parameters and xi is one of data points with 1 i N. We denote the per-sample gradient as gi(w) := L(w,xi)

w Rd and the mini-batch gradient at the tth updating iteration as mt Rd: for a batch size B N,

mt({gi}B i=1; Rg, σg) := [

i ci(Rg)gi + σg Rg zg]/B (2)

where zg N(0, Id) is the Gaussian noise, and ci(Rg) = min(Rg/||gi||, 1) is the per-sample gradient clipping factor in (Abadi et al., 2016; De et al., 2022) with Rg being the clipping threshold.

In particular, mt reduces to the standard non-DP gradient P

i gi/B when σg = 0 and Rg is sufficiently large (i.e., no clipping is applied). The clipped and perturbed batch gradient becomes the DP gradient m m DP whenever σg > 0, with stronger privacy guarantee for larger σg. On top of m, models can be optimized using any optimizer such as SGD and Adam W through

wt+1 = wt ηt Gt(mt({gi}B i=1)) (3)

Published as a conference paper at ICLR 2025

in which Gt is the post-processing such as momentum, adaptive pre-conditioning, and weight decay.

Hyper-parameters matter for DP optimization. Previous works (De et al., 2022; Li et al., 2021) reveal that the performance of DP optimization is sensitive to the hyper-parameter choices. On the one hand, DP by itself brings extra hyper-parameters, such as the gradient clipping threshold Rg, making the tuning more complex. An adaptive clipping method (Andrew et al., 2021) proposes to automatically learn Rg at each iteration, with an extra privacy budget that translates to worse accuracy. There are methods that do not incur extra privacy budget. Automatic clipping (or persample normalization) (Bu et al., 2023b; Yang et al., 2022) uses ci = 1/||gi|| to replace the Rgdependent per-sample clipping, and thus removes the hyperparameter Rg from DP algorithms. On the other hand, hyper-parameter tuning for DP training has different patterns than non-DP training and cannot borrow previous experience on non-DP. For example, previous works (Li et al., 2021; De et al., 2022) observe that there is no benefit from decaying the learning rate during training and the optimal learning rate can be much higher than the optimal one in non-DP training.

2.2 END-TO-END DP GUARANTEE FOR OPTIMIZATION AND TUNING

However, hyper-parameter tuning enlarges privacy risks (Papernot & Steinke, 2021), and it is necessary to provide end-to-end privacy guarantee for DP. There are two existing technical paths to solve this problem:

Multiple runs with multiple choices of hyper-parameters, and choose from these choices in a DP manner (Mohapatra et al., 2022; Papernot & Steinke, 2021; Wang et al., 2023; Liu & Talwar, 2019). This path requires some knowledge to determine the choices a priori and consumes a significant privacy budget as well as computation due to the multiple runs. For instance, (Liu & Talwar, 2019) showed that repeated searching with random iterations satisfies (3ϵ, 0)-DP if each run was (ϵ, 0)-DP.

Making DP optimization hyper-parameter free, and only a single run is needed. For example, the automatic clipping (Bu et al., 2023b; Yang et al., 2022) eliminates the hyperparameter Rg. Our work follows the second path, aiming to resolve tuning on η the last critical hyper-parameter by finding a universal configuration across datasets and tasks.

2.3 AUTOMATIC LEARNING RATE SCHEDULE

Automatic learning rate schedules (also known as learning-rate-free or parameter-free methods) have demonstrated promising performance in deep learning, with little to none manual efforts to select the learning rate. Specifically, D-adaptation (Defazio & Mishchenko, 2023), Prodigy (Mishchenko & Defazio, 2024), and Do G (Ivgi et al., 2023) (and its variants) have proposed to estimate η D G

T where D = ||w0 w || is the initialization-to-minimizer distance, G is the Lipschitz continuity constant, and T is the total number of iterations. These methods have their roots in the convergence theory under the convex and Lipschitz conditions, and may not be accurate when applied in the DP regime when the gradient is noisy. In fact, the estimation of D and G can deviate significantly from the truth when ϵ is small and the number of trainable parameters is large, i.e. the DP noise in gradient is large, as shown in Table 2.

Along an orthogonal direction, Ge N (Bu & Xu, 2024) leverages the Taylor approximation of loss to set the learning rate ηGe N, without assuming the Lipschitz continuity or the knowledge of D. Given any descent vector m, omitting the iteration index t for a brief notation, we can use the Ge N learning rate in (4) to approximately minimize L:

ηGe N(m) := G m m Hm = argminηL(w) G mη + m Hmη2

2 argminηL(w ηm) (4)

in which G = L w is the gradient and H = 2L w2 is the Hessian matrix. Notice that because L is approximated by a quadratic function, the minimizer ηGe N is unique and in closed form.

Numerically, ηGe N can be computed up to any precision by curve fitting or finite difference. Under the non-DP regime, given a series of ηi and loss values L(w ηim), Bu & Xu (2024) obtains the

Published as a conference paper at ICLR 2025

numerator and denominator of ηGe N by solving the problem in (5):

m Hm, G m argmina,b X

L(w ηim) L(w) bηi + aη2 i 2

Nevertheless, directly applying these non-DP automatic learning rate scheduler with DP gradient in (2) and using ηGe N(m) will violate DP, because the learning rate estimation is obtained from forward passes on batches of private data. We defer the explanation and solution to Section 3.

3 LOSS VALUE PRIVATIZATION WITH MINIMAL CLIPPING BIAS

3.1 PRIVATIZED QUADRATIC FUNCTION

We emphasize that the learning rate ηGe N(m) is not DP, because even though m is privatized, the data is accessed without protection through G and H. To solve this issue, we introduce a privatized variant of (5),

(m Hm)DP, (G m)DP := argmina,b X

L(w ηim DP) L(w) bηi + aη2 i 2

which not only replaces m with the DP gradient m DP but also privatizes the loss by L(w ηim DP) as in (7). The resulting learning rate is ηGe N-DP = (G m)DP (m Hm)DP , which is DP since every quantity in (6) is DP and because of the post-processing property. We now discuss the specifics of the loss value privatization L(w ηim) R. In Table 1, we emphasize that the loss privatization is distinctively different from the gradient privatization because the loss is scalar, whereas the gradient is high-dimensional.

Table 1: Difference between the privatization of loss and gradient.

Aspects Loss Privatization Gradient Privatization

Dimension 1 d Clipping Norm L2 or L1 L2

Noise Magnitude

B Key to Convergence Clipping Bias Noise Magnitude Per-sample Operation Clipping (Rl L) Normalization (Rg 0+)

From the perspective of per-sample clipping, the gradient is ubiquitously clipped on L2 norm, because ||gi||2 ||gi||1 in large neural networks, and consequently a Gaussian noise is added to privatize the gradient. In contrast, we can apply L2 or L1 norm for the loss clipping, and add Gaussian (by default) or Laplacian noise to the loss, respectively.

From the perspective of noising, the expected noise magnitude for loss privatization is E|zl| σl Rl

B where zl N(0, 1) is the noise on loss, and that for gradient privatization is

E||zg|| σg Rg

B where the gradient noise vector zg N(0, Id) by the law of large numbers. On the one hand, the gradient noise zg is large and requires small Rg to suppress the noise magnitude, as many works have use very small R (Li et al., 2021; De et al., 2022). This leads to the automatic clipping in Bu et al. (2023b) when the gradient clipping effectively becomes the gradient normalization as Rg 0+. In fact, each per-sample gradient (as a vector) has a magnitude and a direction, and the normalization neglects some if not all magnitude information about the persample gradients. On the other hand, we must not use a small Rl for the loss clipping because the per-sample loss (as a scalar) only has the magnitude information. We will show by Theorem 1 that the choice of threshold Rl creates a bias-variance tradeoff between the clipping and the noising for the loss privatization:

Li , 1)Li + σl Rl N(0, 1)

Published as a conference paper at ICLR 2025

Update model

Algorithm 1: Activate auto-tuning when t % K == 0

𝜎= Get Sigma(𝜖, 𝛿, B, T, N)

𝜖!"#$ 𝐵, 𝑇, 𝑁, 𝜎%, 𝜎&, 𝐾 = 𝜖𝐵, 𝑇, 𝑁, 𝜎, where 𝜎%=1.01 𝜎

𝒎𝒕 𝑮𝑫𝑷= 𝑮𝒕(𝒎𝒕) 𝑅(

w), 𝜂𝐺*+, 2𝜂𝐺*+,

Loss clipping

Loss perturbation

Sample & forward

Curve fitting

𝒘𝒕-𝟏= 𝒘𝒕 𝜼𝑮𝑫𝑷

batch Privatized update

Privacy Accounting

Update loss clipping threshold

Non-DP and Data-independent Data-dependent

Auto-updated params

𝜎1 𝜎2 Solve the equation for loss noise

Figure 1: Hy Free DP overview with three types of hyper-parameters in the DP training. Hy Free DP saves tuning efforts via automatically tuning hyper-parameters in red text, and sets other parameters as default constants. We showcase with 5 points in curve fitting.

We note (7) is a private mean estimation that has been heavily studied in previous works (Biswas et al., 2020; Kamath et al., 2020), though many use an asymptotic threshold like Rl + O(log B), whereas Theorem 1 is more suited for our application in practice.

3.2 BIAS-VARIANCE TRADE-OFF IN LOSS PRIVATIZATION

Theorem 1. The per-sample clipping bias of (7) is E( L) P

Li , 1)Li 1

i (Li Rl)I(Li > Rl)

which is monotonically decreasing in Rl, and converges to [E(Li|Li > Rl) Rl] P(Li > Rl) as B . In contrast, the noise variance is Var( L) = (σl Rl/B)2 which is increasing in Rl.

In words, a large Rl reduces the clipping bias but magnifies the noise, and vice versa for a small Rl. We propose to use Rl L so that the clipping bias is close to zero (i.e. L is approximately unbiased), and the loss noise shown in Table 1 is reasonably small for large batch size.

To put this into perspective, we give the explicit form of clipping bias when Li follows a Gaussian distribution in Corollary 1. Corollary 1. Suppose Li N(µ, ξ2), then the asymptotic clipping bias in Theorem 1 is

[E(Li|Li > Rl) Rl] P(Li > Rl) = ξ[ϕ(α) α(1 Φ(α)], (8)

where α = Rl µ

ξ , ϕ is the probability density function and Φ is the cumulative distribution function of standard normal distribution. The term (8) is strictly decreasing in α as well as Rl (see Figure 7).

4 ALGORITHM

4.1 HYPERPARAMETER-FREE DP OPTIMIZATION

We present our algorithm in Algorithm 1, which is DP as guaranteed in Theorem 2, almost as efficient as the standard non-DP optimization by Section 4.4, and highly accurate and fast in convergence as demonstrated in Section 5. Importantly, we have split the hyperparameters into three classes, as shown in Figure 1:

DP-related hyperparameters that do not depend on the tasks, such as the gradient noise σg, the loss noise σl, and the update interval K, can be set as default constants For example, we fix Rg 0+ and re-scale the learning rate by 1/Rg according to automatic clipping and set K = 5. Training hyperparameters that are robust to different models and datasets, which we view as data-independent, need-not-to-search, and not violating DP, such as the batch size B and the number of iterations T. We also fix other hyperparameters not explicitly displayed

Published as a conference paper at ICLR 2025

in Algorithm 1, e.g. throughout this paper, we fix the momentum coefficients and weight decay (β1, β2, weight decay) = (0.9, 0.999, 0.01), which is the default in Pytorch Adam W. Training hyperparameters that are data-dependent, which requires dynamical searching under DP, such as Rl and η. For these hyperparameters, Algorithm 1 adopts multiple auto-regressive designs, i.e. the variables to use in the t-th iteration is based on the (t 1)- th iteration, which has already been privatized. These auto-regressive designs allow new variables to preserve DP by the post-processing property of DP1.

To be specific, we have used L(0) t 1 as the loss clipping threshold Rl for the next iteration in Line 7, because loss values remain similar values within a few iterations; In practice, we set a more conservative loss clipping threshold Rl = P L(k) t 1 to avoid the clipping bias. We have used the previous η to construct next-loss in Line 6, which in turn will determine the new η by Line 9.

Algorithm 1 Hyperparameter-free Optimization with Differential Privacy

1: INPUT: initial η=1e-4, initial Rl = 1 2: Forward pass to compute per-sample losses L(0) t,i = L(wt, xi)

3: Compute the mini-batch loss L(0) t = 1

i L(0) t,i 4: Back-propagate from L(0) t to compute m DP in (2) with automatic clipping 5: Post-process m DP by any optimizer GDP := G(m) in (3) 6: if t%K == 0 (e.g. K = 10) then

7: Forward pass to get per-sample losses L( 1) t,i = L(wt ηGDP, xi)

8: Privatize losses L(k) t by (7) with Rl = L(0) t 1 for k { 1, 0, +1}

9: Fit the quadratic function in (6) from { η, 0, η} to { L( 1) t , L(0) t , L(+1) t } 10: Extract coefficients of the fitted quadratic function (m Hm)DP, (G m)DP 11: Update η with ηGe N-DP = (G m)DP (m Hm)DP argminη L(wt ηGDP) 12: end if 13: Update wt+1 = wt ηGDP

0.02 0.01 0.00 0.01 0.02 Learning Rate Update

Loss Values

Iteration 10

Curve fitting w/o clipping w/o noise w/ clipping w/o noise w/ clipping w/ noise Gaussian noise

0.02 0.01 0.00 0.01 0.02 Learning Rate Update

Loss Values

Iteration 20

0.02 0.01 0.00 0.01 0.02 Learning Rate Update

Loss Values

Iteration 40

0.02 0.01 0.00 0.01 0.02 Learning Rate Update

Loss Values

Iteration 50

Figure 2: Impact of loss value clipping and perturbation on curve fitting along different training iterations on CIFAR100 with Vit-Small fully fine-tuning, with zero in x-axis denotes the current wt. We use 5 points for the ease of illustration and use 3 points in Algorithm 1 and experiments.

4.2 PRIVACY GUARANTEE

Theorem 2. Algorithm 1 is (ϵours, δ)-DP, where ϵours depends on the batch size B, the number of iterations T, the noises (σg, σl), and the update interval K. In contrast, vanilla DP-SGD is (ϵvanilla, δ)-DP, where ϵvanilla depends on B, T, σ. Furthermore, we have ϵours(B, T, N, σg, σl, K) > ϵvanilla(B, T, N, σ) if σg = σ.

We omit the concrete formulae of ϵ because it depends on the choice of privacy accountants. For example, if we use µ-GDP as an asymptotic estimation, then we show in Appendix B that

2 T(e1/σ2 1), µours =

µ2 vanilla + B

K (e1/σ2 l 1) (9)

1The post-processing of DP ensures that if X is (ϵ, δ)-DP then g(X) is also (ϵ, δ)-DP for any function g.

Published as a conference paper at ICLR 2025

which can translate into (ϵ, δ)-DP by Equation (6) in Bu et al. (2020). Note in our experiments, we use the improved RDP as the privacy accountant.

Theorem 2 and (9) show that given the same (B, T, σ), our Algorithm 1 uses more privacy budget because we additionally privatize the loss, whereas the vanilla DP-SGD does not protect the hyperparameter tuning. To maintain the same privacy budget as vanilla DP-SGD, we use 99% budget for the gradient privatization and 1% budget for the loss privatization. Therefore, we need 1% larger gradient noise σg = γσ and then select σl based on

ϵours(B, T, N, γσ, σl, K) = ϵvanilla(B, T, N, σ), where we can use γ 1.01. (10)

4.3 END-TO-END NOISE DETERMINATION

We use auto DP library2 to compute σl based on σg in (10). We give more details of our implementation in Appendix C. We visualize both noises in Figure 3 with RDP accounting and Gaussian and mechanisms for loss privatization, dashed line indicates the case when σg and σl are set equally. More examples for different mechanisms or different accounting are shown in Appendix. Note that we only add a little more gradient noise, hence σg introduces negligible accuracy drop to Algorithm 1, as empirically shown in Section 5. Additionally, we demonstrate in Figure 2 that σl has negligible interference with the precision of estimating ηGe N.

4.4 EFFICIENCY OF ALGORITHM

2 4 6 8 10 12 14 Update interval K

Noise scale for gradient and loss

RDP Composition (Gauss and Gauss)

g with =1.05

l with =1.05

g with =1.01

l with =1.01

Figure 3: Gradient and loss noise.

We illustrate that Algorithm 1 can be almost as efficient as the standard non-DP optimization, in terms of training time and memory cost. We identify three orthogonal components that are absent from non-DP optimization: (1) Gradient privatization. DP optimization (including vanilla DP-SGD) always requires per-sample gradient clipping. Due to the high dimension of gradients, this could incur high cost in memory and time if implemented inefficiently. We directly leverage the recent advances like ghost clipping and book-keeping (BK) which have allowed DP optimization to be almost as efficient as non-DP optimization, up to 256 GPUs and 100B parameters. (2) Loss privatization. The cost of loss privatization alone is O(B) and thus negligible, compared to the forward passes and back-propagation which are O(Bd). (3) Learning rate computation. The cost of computing ηGe N-DP in (6) mainly comes from the additional forward passes3 for L( 1) t,i . Given that the back-propagation approximately costs 2 the training time of forward pass, the optimizer without Ge N learning rate (non-DP or DP) roughly uses 3 units of time at each iteration. In contrast, Algorithm 1 uses 3 + 2/K 3.2 units if we set K = 10 with < 7% overhead. We emphasize that the actual overhead to training time is even lower, because the training also includes non-optimization operations such as data loading and inter-GPU communication. In short, the efficiency gap between Algorithm 1 and non-DP optimization is negligible in practice.

5 EXPERIMENTS

To comprehensively evaluate the effectiveness of the proposed method, we conduct experiments on both computer vision tasks and natural language tasks, across different model architectures (Vit (Yuan et al., 2021), GPT2 (Radford et al., 2019) and LLa Ma2-7B (Touvron et al., 2023)) and fine-tuning paradigms (Full, Bit Fit (Zaken et al., 2022) and Lo RA (Hu et al., 2021)). Bit Fit only tunes the bias terms of a pre-trained model, while Lo RA only tunes the injected low-rank matrices, both keeping all other parameters frozen. We use the privacy budget ϵ = {1, 3, 8} with δ N 1.1 for training dataset with N samples4 and also perform non-DP baseline (ϵ = ).

2https://github.com/yuxiangw/autodp 3The number of {ηi}i in (6) is at least two since there are two unknown variables. More ηi may stabilize the algorithm, at cost of more forward passes, longer training time, and using more privacy budget. 4https://github.com/lxuechen/private-transformers

Published as a conference paper at ICLR 2025

Table 2: Performance comparison of Hy Free DP with other baselines. We use cosine learning rate decay for Non DP-GS w/ LS in the Bit Fit fine-tuning. Detailed results are provided in the Appendix.

Fully Fine-Tune Vit-Small Vit-Base

Privacy budget Method CIFAR10 CIFAR100 SVHN GTSRB Food101 CIFAR10 CIFAR100 SVHN GTSRB Food101

ϵ =1 Non DP-GS 96.49 79.70 89.86 67.00 70.38 95.20 67.18 89.10 81.11 58.80 D-adaptation 23.74 0.80 15.27 1.86 1.17 19.15 0.83 18.56 4.72 1.48 Prodigy 27.54 0.80 15.27 1.86 1.19 19.15 0.83 18.56 4.72 1.43 DP-hyper 92.98 74.63 34.58 28.23 19.06 94.86 6.59 79.94 77.85 57.02 Hy Free DP 96.36 78.17 91.15 74.68 67.98 95.76 67.17 88.24 79.00 58.16 ϵ =3 Non DP-GS 96.86 83.69 91.60 83.08 74.58 95.84 79.03 91.74 91.03 66.87 D-adaptation 40.99 0.94 17.22 2.65 1.37 30.29 1.07 19.77 5.68 1.68 Prodigy 77.18 0.95 18.11 2.68 1.47 30.29 1.07 19.77 5.68 1.65 DP-hyper 95.10 78.82 48.82 42.02 36.21 95.75 19.92 87.82 90.25 66.09 Hy Free DP 96.89 81.62 92.21 89.92 75.24 97.29 80.34 91.71 91.76 73.46 Non DP Non DP-GS 98.33 89.41 95.58 98.66 85.19 98.88 92.40 96.87 98.41 89.61 D-adaptation 54.29 86.35 97.09 97.38 74.75 63.37 88.41 97.02 98.65 78.35 Prodigy 97.85 89.21 96.87 98.31 87.49 98.59 91.01 97.14 98.77 89.47 Hy Free DP 98.32 90.84 96.86 99.00 86.38 98.83 92.58 96.75 97.37 88.73

Bit Fit Fine-Tune Vit-Small Vit-Base

Privacy budget Method CIFAR10 CIFAR100 SVHN GTSRB Food101 CIFAR10 CIFAR100 SVHN GTSRB Food101

ϵ =1 Non DP-GS 96.74 84.22 90.18 86.20 77.39 97.34 84.97 90.91 87.15 76.61 Non DP-GS w/ LS 97.40 81.36 67.95 48.64 74.55 97.90 81.87 79.86 55.66 73.76 Prodigy 96.36 82.23 90.30 86.85 76.36 97.13 84.87 91.08 81.54 78.72 Hy Free DP 96.20 83.84 90.49 88.08 79.27 97.42 84.01 92.21 81.83 79.25 ϵ =3 Non DP-GS 97.13 85.95 91.42 91.50 80.37 97.73 87.00 92.20 91.46 80.06 Non DP-GS w/ LS 97.61 84.87 78.91 60.13 78.08 98.05 85.48 86.22 69.03 78.71 Prodigy 95.72 83.63 91.90 91.64 79.20 96.96 86.26 92.13 90.27 81.98 Hy Free DP 97.09 86.07 91.64 92.38 84.55 97.65 87.04 92.37 90.67 84.53 Non DP Non DP-GS 97.91 89.39 91.88 95.20 86.29 98.56 91.59 94.42 92.87 88.37 Non DP-GS w/ LS 97.89 89.40 91.98 90.31 86.27 98.56 91.61 94.43 92.90 88.39 Prodigy 97.88 87.85 95.19 95.34 86.56 98.41 90.70 96.02 95.02 88.60 Hy Free DP 97.97 89.86 93.60 95.19 87.24 98.43 91.69 95.48 95.23 89.06

As the first hyperparameter-free method for differentially private optimization, we compare Hy Free DP with the following baselines: 1) Non DP-GS: We manually perform grid search over a predefined range of learning rates, selecting the best without accumulating privacy budget across runs. This serves as the performance upper bound since tuning is non-DP. We also experiment with a manually tuned learning rate scheduler, noted as Non DP-GS w/ LS. We search the learning rate over the range [5e-5, 1e-4, 5e-4, 1e-3, 5e-3] based on previous works (Bu et al., 2023b;a) to cover suitable η for various DP levels. 2) DP-hyper (Liu & Talwar, 2019; Wang et al., 2023; Papernot & Steinke, 2021): We simulate with a narrow range around the optimal η, spending 85% of the privacy budget for DP training with the searched η. 3) D-Adaptation (Defazio & Mishchenko, 2023) and Prodigy (Mishchenko & Defazio, 2024): Both are state-of-the-art learning rate tuning algorithms in non-DP optimization. We adopt their optimizers with recommended hyperparameters, alongside DP-specific clipping and perturbation. 4) Hy Free DP : We initialize Rl = 1 and η = 1e 4 by default, allowing automatic updates in training.

5.1 IMAGE CLASSIFICATION TASKS

Experimental setups. As the main result shown in Table 2, we compare Hy Free DP to other baselines by experiments on CIFAR10, CIFAR100, SVHN, GTSRB and Food101 for models of Vit Small and Vit-Base.

Evaluation results. Hy Free DP almost outperforms all end-to-end DP baselines. While non-DP automatic learning rate schedulers (D-adaptation and Prodigy) sometimes match grid-searched constants, they perform poorly in DP training, especially with tighter privacy budgets and when the number of trainable parameters is large, confirming our analysis in Section 2.3. Although DPhyper surpasses these schedulers likely due to our intentionally narrow search range finding suitable ranges remains challenging as training dynamics vary by dataset and privacy budget. Even in this optimistic setting, DP-hyper underperforms due to increased gradient noise. In contrast, Hy Free DP achieves consistent performance across various ϵ values and datasets without manual learning rate tuning, thanks to our gradient noise control and efficient learning rate estimation with minimal loss value perturbation. Hy Free DP achieves comparable or superior performance to Non DP-GS baseline without manual learning rate tuning in Bit Fit experiments. Non DP-GS w/ LS and Prodigy show improved stability compared to full fine-tuning, suggesting the sensitivity to training paradigms and trainable model size. In Figure 4, we illustrate training dynamics (Rl, η, loss, test accuracy) across meth-

Published as a conference paper at ICLR 2025

Steps Steps Steps Steps

Steps Steps Steps Steps

Figure 4: Automatic learning of clipping threshold, learning rate, training loss, and testing accuracy for SVHN (top) and GTSRB (bottom). Hy Free DP schedules Rl and η during training, approaching the manually tuned baseline with end-to-end DP guarantees, and is robust to varying intervals K.

ods. Hy Free DP automatically finds optimal learning rate schedules, with clipping thresholds peaking early and decreasing gradually, enabling more accurate rate estimation as training progresses. While updating with K = 1 yields optimal convergence, Hy Free DP remains robust to less frequent updates (e.g., K = 5), balancing tuning cost and convergence speed.

5.2 NATURAL LANGUAGE GENERATION TASKS

Experimental setups. We conduct experiments on E2E dataset with a table to text task on GPT2 model, and also evaluate the language generation task with Pub Med dataset by fine-tuning LLa Ma2-7B model with Lo RA (Hu et al., 2021) for demonstrating the scalability and generality of Hy Free DP . We follow the experimental setups based on previous works (Bu et al., 2024a) and use the dataset provided by Yu et al. (2022; 2023) which contains over 75,000 abstracts of medical papers that were published after the cut-off date of LLa Ma2. Based on the non-DP experience, Lo RA typically requires a magnitude greater learning rate than full fine-tuning5, thus we scale up our default initial learning by 10. We tune the best learning rate for LLa Ma2-7B with Lo RA fine-tuning on 4,000 samples of Pub Med for Non DP-GS when training on the full dataset.

Table 3: Performance comparison on GPT-2 for E2E dataset with different privacy budgets. Best end-to-end DP results are bolded, and results surpassing the manually tuned baseline are underlined.

Full Fine-Tune ϵ = 3 ϵ = 8

Model Method BLEU CIDEr METEOR NIST ROUGE L BLEU CIDEr METEOR NIST ROUGE L

Non DP-GS 0.583 1.566 0.367 5.656 0.653 0.612 1.764 0.385 6.772 0.664 D-Adaptation 0.000 0.000 0.003 0.082 0.016 0.000 0.000 0.000 0.000 0.000 Prodigy 0.082 0.000 0.157 1.307 0.239 0.012 0.000 0.003 0.000 0.003 Hy Free DP 0.585 1.564 0.365 5.736 0.636 0.612 1.768 0.378 6.702 0.655

Steps Steps Figure 5: Training dynamics of Llama2-7B on Pub Med

Evaluation results. As shown in Table 3, we observe that even when the privacy budget is not small (e.g., ϵ = 8), non-DP automatic learning rate scheduler does not perform well. We find that Hy Free DP consistently obtains a comparable performance as the Non DP-GS baseline without extra tuning. In Figure 5, we observe that Hy Free DP automatically discovers a learning rate schedule that achieves better generalization performance compared to the early-stopped Non DP-GS. The automatically determined learning rate (η) reveals a consistent pattern across model scales: an

5See this reference as an example. Note that Lo RA may use a similar learning rate if it is set proportional to rank (Biderman et al., 2024).

Published as a conference paper at ICLR 2025

increase-then-decrease trajectory that holds true from smaller models (Vit-Small) to larger models (LLa Ma2-7B).

Table 4: Comparison of model performance in minutes with and without Hy Free DP across various datasets. The coefficients represent the ratio relative to the w/o auto configuration.

Models Dataset K=1 K=5 K=10 w/o Hy Free DP

Llama2-7B (Lo RA-FT) Pub Med (4k) 409.750 ( 2.040) 244.333 ( 1.217) 222.583 ( 1.108) 200.833 ( 1.000) GPT2 E2E 163.333 ( 1.888) 97.167 ( 1.123) 94.983 ( 1.098) 86.500 ( 1.000) Vit-base CIFAR100 152.617 ( 1.370) 118.483 ( 1.063) 113.317 ( 1.017) 111.433 ( 1.000) Vit-base (Bit Fit-FT) CIFAR100 113.450 ( 1.654) 74.733 ( 1.089) 73.817 ( 1.076) 68.600 ( 1.000) Vit-small SVHN 102.000 ( 1.255) 84.500 ( 1.040) 82.800 ( 1.019) 81.250 ( 1.000)

5.3 EFFICIENCY COMPARISON

Based on Section 4.4, we compare the training efficiency of Hy Free DP with a single run of DP training using the same non-DP and data-independent hyperparameters, as shown in Table 4. For LLa Ma2-7B, we sample 4,000 records from Pub Med and train for 3 epochs on a single A100 (80GB). For smaller datasets and models, we use previous setups on Titan RTX (24GB). Hy Free DP introduces less than 2 overhead compared to a single run of DP training, even with frequent updates at K = 1. Smaller models or those using Lo RA or Bit Fit have lower additional costs, especially with K = 1, and the gap narrows as K increases, approaching a cost factor of 1 .

6 DISCUSSION AND CONCLUSION

In conclusion, we tackle the challenge of hyperparameter tuning in differential privacy (DP) by introducing a hyperparameter-free DP training method that privately and automatically updates the learning rate. Combined with automatic clipping, our approach reduces tuning efforts and ensures end-to-end DP during training. This bridges the gap between hyperparameter-free methods in non DP settings and DP optimization, opening promising avenues for future research.

ACKNOWLEDGMENTS

R.L. is partially supported by National Institutes of Health grant (R01LM013712), National Science Foundation grants (CNS-2124104 and CNS-2125530). We sincerely thank Prof. Li Xiong for her invaluable support and guidance throughout this work. We also appreciate the insightful comments and constructive feedback from the reviewers and the area chair.

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308 318, 2016.

Galen Andrew, Om Thakkar, Brendan Mc Mahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34:17455 17466, 2021.

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less. Transactions on Machine Learning Research, 2024.

Sourav Biswas, Yihe Dong, Gautam Kamath, and JU COINPRESS. Practical private mean and covariance estimation. Preprint. Available at, 2020.

Zhiqi Bu and Shiyun Xu. Gradient descent with generalized newton s method. In The Thirteenth International Conference on Learning Representations, 2024.

Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J Su. Deep learning with gaussian differential privacy. Harvard data science review, 2020(23), 2020.

Published as a conference paper at ICLR 2025

Zhiqi Bu, Ruixuan Liu, Yu-Xiang Wang, Sheng Zha, and George Karypis. On the accuracy and efficiency of group-wise clipping in differentially private optimization. ar Xiv preprint ar Xiv:2310.19215, 2023a.

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic clipping: Differentially private deep learning made easier and stronger. Advances in Neural Information Processing Systems, 36, 2023b.

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private bias-term finetuning of foundation models. In Forty-first International Conference on Machine Learning, 2024a. URL https://openreview.net/forum?id=fqe ANcj BMT.

Zhiqi Bu, Xinwei Zhang, Sheng Zha, Mingyi Hong, and George Karypis. Pre-training differentially private models with limited public data. Advances in Neural Information Processing Systems, 37: 94652 94683, 2024b.

Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale. ar Xiv preprint ar Xiv:2204.13650, 2022.

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. In International Conference on Machine Learning, pp. 7449 7479. PMLR, 2023.

Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):3 37, 2022.

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.

Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, pp. 14465 14499. PMLR, 2023.

Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. In Conference on Learning Theory, pp. 2204 2235. PMLR, 2020.

Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36:6748 6769, 2023.

Antti Koskela and Tejas D Kulkarni. Practical differentially private hyperparameter tuning with subsampling. Advances in Neural Information Processing Systems, 36:28201 28225, 2023.

Itai Kreisler, Maor Ivgi, Oliver Hinder, and Yair Carmon. Accelerated parameter-free stochastic optimization. In The Thirty Seventh Annual Conference on Learning Theory, pp. 3257 3324. PMLR, 2024.

Alexey Kurakin, Shuang Song, Steve Chien, Roxana Geambasu, Andreas Terzis, and Abhradeep Thakurta. Toward training at imagenet scale with differential privacy. ar Xiv preprint ar Xiv:2201.12328, 2022.

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2021.

Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 298 309, 2019.

Zhihao Liu, Jian Lou, Wenjie Bao, Zhan Qin, and Kui Ren. Differentially private zeroth-order methods for scalable large language model finetuning. ar Xiv preprint ar Xiv:2402.07818, 2024.

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36:53038 53075, 2023.

Published as a conference paper at ICLR 2025

Konstantin Mishchenko and Aaron Defazio. Prodigy: an expeditiously adaptive parameter-free learner. In Proceedings of the 41st International Conference on Machine Learning, pp. 35779 35804, 2024.

Shubhankar Mohapatra, Sajin Sasy, Xi He, Gautam Kamath, and Om Thakkar. The role of adaptive optimizers for honest private hyperparameter selection. In Proceedings of the aaai conference on artificial intelligence, volume 36, pp. 7806 7813, 2022.

Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, and Prateek Mittal. A new linear scaling rule for private adaptive hyperparameter optimization. In International Conference on Machine Learning, pp. 39364 39399. PMLR, 2024.

Nicolas Papernot and Thomas Steinke. Hyperparameter tuning with renyi differential privacy. In International Conference on Learning Representations, 2021.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal. Private finetuning of large language models with zeroth-order optimization. In ICML 2024 Workshop on Foundation Models in the Wild, 2024.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Hua Wang, Sheng Gao, Huanyu Zhang, Weijie J Su, and Milan Shen. Dp-hypo: an adaptive private hyperparameter optimization framework. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 41868 41891, 2023.

Xiaodong Yang, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. Normalized/clipped sgd with perturbation for differentially private non-convex optimization. ar Xiv preprint ar Xiv:2206.13033, 2022.

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, et al. Differentially private fine-tuning of language models. In International Conference on Learning Representations (ICLR), 2022.

Da Yu, Arturs Backurs, Sivakanth Gopi, Huseyin Inan, Janardhan Kulkarni, Zinan Lin, Chulin Xie, Huishuai Zhang, and Wanrong Zhang. Training private and efficient language models with synthetic data from llms. In Socially Responsible Language Modelling Research, 2023.

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 558 567, 2021.

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1 9, 2022.

Liang Zhang, Kiran Koshy Thekumparampil, Sewoong Oh, and Niao He. Dpzero: Dimensionindependent and differentially private zeroth-order optimization. In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with Neur IPS 2023.

Published as a conference paper at ICLR 2025

A EXPERIMENTAL DETAILS

Image classification. In full fine-tuning, we tried to integrate a constant linear scheduler as a common practice, but the result does not steadily outperform the tuned Non DP-GS, so we omit the results in Appendix Table 5. We do not apply the linear scheduler to Prodigy and D-adaptation for keeping the originally recommended configuration. The results with ϵ = 8 are shown in Table 7 with the consistent conclusions across different datasets.

Vit-Small Vit-Base

Privacy Budget CIFAR10 CIFAR100 SVHN GTSRB Food101 CIFAR10 CIFAR100 SVHN GTSRB Food101

ϵ = 1 88.02 3.46 29.00 10.16 9.80 96.62 58.74 89.13 64.39 60.74 ϵ = 3 92.48 8.00 35.54 17.68 20.40 97.11 75.73 91.79 82.31 69.09 ϵ = 8 93.79 15.04 43.73 24.15 31.05 97.41 81.31 93.05 88.79 73.65 Non DP 98.00 89.11 96.55 96.94 84.55 98.88 92.51 97.27 98.29 89.10

Table 5: Full fine-tuning results by adding a linear scheduler to Non DP-GS across different datasets with privacy budgets for Vit-Small and Vit-Base models. Results show that directly integrating a constant learning rate scheduler in Non DP does not hurt performance but the DP training performance is sensitive to the learning rate scheduler.

In Table 2, we report the test accuracy averaged over 3 checkpoints. For demonstrating the stability of different methods, we present the mean accuracy with standard deviation in Table 6 for low privacy budget regime and Non DP.

2 4 6 8 10 12 14 Update interval K

Noise scale for gradient and loss

RDP Composition (Gauss and Lap)

g with =1.05

l with =1.05

g with =1.01

l with =1.01

2 4 6 8 10 12 14 Update interval K

Noise scale for gradient and loss

GDP Composition (Gauss and Gauss)

g with =1.05

l with =1.05

g with =1.01

l with =1.01

Figure 6: Privacy composition for gradient and loss values privatization, with privacy accountants of RDP and GDP, and loss perturbation of Gaussian and Laplacian mechanisms. Gray dashed line indicates the naive and even budget splitting for every access to private gradient and loss, which results in larger noise magnitude on gradient especially when the adjustment is frequent. The privacy accounting strategy proposed in Hy Free DP effectively restrains the privacy budget consumption on hyper-parameter tuning and spending it wisely by only perturbing a single-dimensional loss value.

Clipping bias. Additionally, we demonstrate the loss clipping bias in Figure 7.

Figure 7: ϕ(α) α(1 Φ(α)) which is minimized at α = .

Published as a conference paper at ICLR 2025

Table 6: Averaged test accuracy with standard deviation of Hy Free DP with other baselines

Fully Fine-Tune Vit-Small Vit-Base

Privacy budget Method CIFAR10 CIFAR100 SVHN GTSRB Food101 CIFAR10 CIFAR100 SVHN GTSRB Food101

ϵ =1 Non DP-GS 96.49 0.02 79.7 0.13 89.86 0.21 67.00 0.44 70.38 0.23 95.2 0.02 67.18 0.52 89.1 0.08 81.11 0.27 58.8 0.32 D-adaptation 23.74 0.19 0.8 0.01 15.27 0.08 1.86 0 1.17 0.01 19.15 0.41 0.83 0.31 18.56 0.74 4.72 0.02 1.48 0.10 Prodigy 27.54 0.61 0.80 0.01 15.27 0.08 1.86 0.00 1.19 0.00 19.15 0.41 0.83 0.00 18.56 0.04 4.72 0.02 1.43 0.00 DP-hyper 92.98 0.14 74.63 0.09 34.58 0.58 28.23 0.91 19.06 0.59 94.86 0.10 6.59 0.00 79.94 0.04 77.85 0.16 57.02 0.00 Hy Free DP 96.36 0.03 78.17 0.12 91.15 0.06 74.68 0.54 67.98 0.09 95.76 0.03 67.17 0.11 88.24 0.14 79.00 0.51 58.16 0.07 ϵ =3 Non DP-GS 96.86 0.03 83.69 0.12 91.6 0.11 83.08 0.69 74.58 0.23 95.84 0.07 79.03 0.18 91.74 0.04 91.03 0.11 66.87 0.06 D-adaptation 40.99 0.92 0.94 0.01 17.22 0.09 2.65 0.03 1.37 0.01 30.29 0.51 1.07 0.01 19.77 0.04 5.68 0.00 1.68 0.01 Prodigy 77.18 1.85 0.95 0.01 18.11 0.16 2.68 0.05 1.47 0.02 30.29 0.51 1.07 0.01 19.77 0.04 5.68 0.00 1.65 0.02 DP-hyper 95.10 0.04 78.82 0.14 48.82 1.15 42.02 0.47 36.21 0.8 95.75 0.09 19.92 0.73 87.82 0.19 90.25 0.42 66.09 0.05 Hy Free DP 96.89 0.06 81.62 0.06 92.21 0.12 89.92 0.57 75.24 0.06 97.29 0.02 80.34 0.05 91.71 0.18 91.76 0.48 73.46 0.06 Non DP Non DP-GS 98.33 0.01 89.41 0.18 95.58 0.14 98.66 0.02 85.19 0.12 98.88 0.01 92.4 0.04 96.87 0.11 98.41 0.02 89.61 0.08 D-adaptation 54.29 0.53 86.35 0.06 97.09 0.08 97.38 0.31 74.75 1.00 63.37 1.10 88.41 0.05 97.02 0.09 98.65 0.01 78.35 1.17 Prodigy 97.85 0.03 89.21 0.07 96.87 0.07 98.31 0.14 87.49 0.08 98.59 0.06 91.01 0.10 97.14 0.06 98.77 0.14 89.47 0.06 Hy Free DP 98.32 0.03 90.84 0.02 96.86 0.02 99.00 0.02 86.38 0.03 98.83 0.01 92.58 0.02 96.75 0.05 97.37 0.05 88.73 0.02

Table 7: Performance comparison of Hy Free DP to other baselines. We use consine learning rate decay for Non DP-GS w/ LS baseline in the Bit Fit fine-tuning setting.

Full Fine-Tune Vit-Small Vit-Base

Privacy budget Method CIFAR10 CIFAR100 SVHN GTSRB Food101 CIFAR10 CIFAR100 SVHN GTSRB Food101

ϵ =8 Non DP-GS 96.28 84.99 92.53 89.97 77.08 96.14 82.52 92.38 94.91 71.05 D-adaptation 78.31 1.07 19.10 3.67 1.76 40.90 1.25 20.86 6.84 1.94 Prodigy 95.74 1.29 20.75 4.86 2.92 45.89 1.25 20.86 6.84 1.90 DP-hyper 95.70 80.58 64.72 50.10 49.18 96.28 36.46 90.27 94.62 70.63 Hy Free DP 97.04 86.00 95.15 88.06 80.61 97.79 87.57 95.00 94.07 81.91

Bit Fit Fine-Tune Vit-Small Vit-Base

Privacy budget Method CIFAR10 CIFAR100 SVHN GTSRB Food101 CIFAR10 CIFAR100 SVHN GTSRB Food101

ϵ =8 Non DP-GS 97.23 86.56 92.23 93.34 82.12 97.81 88.05 93.38 93.04 81.88 Non DP-GS w/ LS 97.63 86.21 83.17 67.35 79.85 98.08 87.17 88.25 75.61 81.18 Hy Free DP 97.31 88.43 89.17 93.54 74.61 97.90 90.34 92.36 93.14 87.14

Proof of Theorem 1. It s not hard to see

Li , 1)Li 1

i (Rl Li)I(Li > Rl) = 1

i Re LU(Li Rl)

in which the non-negative Re LU(x) = x I(x > 0). Taking the absolute value, we have E( L) P

i Re LU(Li Rl) =

i (Li Rl)I(Li > Rl)

which is decreasing in Rl because Re LU is increasing in its input (Li Rl), and this input is decreasing in Rl.

As B , the clipping bias tends to

E(Re LU(Li Rl)) = E(Re LU(Li Rl)|Li > Rl) P(Li > Rl) =E((Li Rl)|Li > Rl) P(Li > Rl) = [E(Li|Li > Rl) Rl] P(Li > Rl)

where we have used Re LU(x) = x when x > 0.

Proof of Corollary 1. The key part in (8) is E(Li|Li > Rl), which is the expectation of the truncated normal distribution by one-side truncation. It is known that for α = Rl µ

E(Li|Li > Rl) = µ + ξ ϕ(α) 1 Φ(α), P(Li > Rl) = 1 Φ(α)

The proof is complete by inserting these quantities.

Proof of Theorem 2. All privacy budget of Algorithm 1 goes into two components: privatizing the gradient (with noise level σg) and privatizing the loss (with noise level σl).

Published as a conference paper at ICLR 2025

Under the same (B, T, N, σg), we have T mechanisms of gradient privatization, each of (ϵg, δg)-DP and 3T/K mechanisms of loss privatization, each of (ϵl, δl)-DP. Hence it is clear that ϵours > ϵvanilla.

To be more specific, we demonstrate with µ-GDP. The vanilla DP-SGD is µ-GDP with

µvanilla = B

T(e1/σ2g 1)

which is the same as the gradient privatization component of our DP-SGD. We additionally spend

K (e1/σ2 l 1)

leading to a total budget of

µ2 vanilla + µ2 l by Corollary 3.3 in Dong et al. (2022). It is clear µours > µvanilla.

C END-TO-END PRIVACY ACCOUNTING AND INVERSE

We demonstrate how to determine (σg, σl) given (ϵ, δ)-DP budget. In vanilla DP optimization, we can leverage privacy accountants such as RDP, GDP, PRV, etc. Each accountant is a function whose input is hyperparameters (B, T, N, δ, σ) and the output is ϵ (see an example in (9)). In this section, we denote any accountant as f, so that f(σ; B, N, δ) = ϵ (11)

f T (σ; B, N, δ) = ϵ (12)

f T (ϵ; B, N, δ) = σ (13)

in which ϵ is the single-iteration budget, f T means a composition of T iterations, and f T is the inverse function known as Get Sigma in Figure 1.

In this work, we develop an end-to-end privacy accountant to compose both the gradient privatization and the loss privatization. Our accountant takes the input (B, T, N, δ, σl, γ, K), where γ = 1.01 by default and can take smaller value for larger models or smaller (ϵ, δ) budget. Therefore, σg = γσ.

Firstly, we call f T (γσ; B, N, δ) = ˆϵ to get the reference budget ˆϵ which is strictly smaller than ϵ because f is monotonically decreasing in its input. Then we guess the loss noise and call f 3T/K(σl; B, N, δ) = ϵl since there are 3T/K rounds of loss privatization. We continue our guess until f 3T/K(σl; B, N, δ) + f T (γσ; B, N, δ) = ϵl + ˆϵ = ϵ Notice the left hand side is monotonically decreasing in σl. Hence we use bisection method to find the unique solution σl, at an exponentially fast speed.

Algorithm 2 End-to-End Privacy Accounting

1: INPUT: The end-to-end DP budget (ϵ, δ), Get Sigma( ), Compose( ), Solve( ) 2: OUTPUT: Noise magnitude for gradient and loss privatization σg and σl 3: Compute the gradient noise scale σ by assuming there is only a single training run 4: σ = Get Sigma(ϵ, δ, B, T, N) 5: Compute the σg in Algorithm 1 with a controlled noise increase 6: σg = γ σ with the constant γ slightly greater than 1 (e.g, γ = 1.01) 7: Define the composition function with input variable as the loss noise scale c

8: ϵours(c|σg, δ, B, N, T, K) = Compose(f T g (σg; B, N, δ), f 3T/K l (c; B, N, δ)) 9: Solve the minimization of the scalar function respect to c 10: σl = Solve(c, ϵ) = arg minc |ϵours(c) ϵ|

In the above, we have used the functions Get Sigma( )6, Compose( )7, and any root-finding method Solve( ) such as the bi-section in scipy library. We highlight that Algorithm 1 and

6https://github.com/yuxiangw/autodp/blob/master/example/example_ calibrator.py 7https://github.com/yuxiangw/autodp/blob/master/example/example_ composition.py

Published as a conference paper at ICLR 2025

Algorithm 2 are sufficiently flexible to work with the general DP notions, including GDP, Renyi DP, t CDP, per-instance DP, per-user DP, etc.

Hyper-parameter matters for DP optimization. Previous works De et al. (2022); Li et al. (2021) reveal that the performance of DP optimization is sensitive to the hyper-parameter choices, as we cited in Figure 8.

Clipping norm

Learning rate

60.26 59.58 60.01 57.87 35.44

50.48 50.31 50.48 49.25 34.47

33.17 33.18 33.13 32.96 29.50

29.74 29.55 29.61 29.62 27.07

9.69 9.69 9.68 9.65 5.79

Clipping norm

Learning rate

43.0 2.0 0.1 0.1 0.1

22.0 45.0 0.1 0.1 0.1

2.1 22.0 45.0 11.0 0.09

0.29 2.2 22.0 24.0 0.11

0.1 0.25 2.1 13.0 0.44

Figure 8: Hyperparameter tuning of (Rg, η) cited from Figure 1 of Bu et al. (2023b). Here Rg is the clipping norm and η is the learning rate. Left: BLEU score of GPT2 on E2E dataset Li et al. (2021). Right: Test accuracy of Res Net18 on Image Net Kurakin et al. (2022).

End-to-end DP guarantee for optimization and tuning. As shown in Table 8, we compare our work with representative works that try to ensure end-to-end DP for both optimization and tuning.

Table 8: Comparison of hyperparameter search strategies. Budget splitting percentages for Adaptive Clipping and DP-hyper are estimated values, as the actual percentages depend on specific datasets.

Method Searching η Searching Rg % Budget on Hyperparam

Vanilla Abadi et al. (2016) 0% Automatic Clipping Bu et al. (2023b) 0% Adaptive Clipping Andrew et al. (2021) 20% DP-hyper Papernot & Steinke (2021) 20% Ours (this work) <1%

Taylor approximation of next-iteration loss This analysis of next-iteration loss via the Taylor approximation is also presented in Section 2 of (Bu et al., 2024b), which focuses on the explanation of the DP training dynamics, instead of on the learning rate.