# shifted_interpolation_for_differential_privacy__0eeb1388.pdf Shifted Interpolation for Differential Privacy Jinho Bok 1 Weijie J. Su 1 Jason Altschuler 1 Abstract Noisy gradient descent and its variants are the predominant algorithms for differentially private machine learning. It is a fundamental question to quantify their privacy leakage, yet tight characterizations remain open even in the foundational setting of convex losses. This paper improves over previous analyses by establishing (and refining) the privacy amplification by iteration phenomenon in the unifying framework of fdifferential privacy which tightly captures all aspects of the privacy loss and immediately implies tighter privacy accounting in other notions of differential privacy, e.g., (ε, δ)-DP and R enyi DP. Our key technical insight is the construction of shifted interpolated processes that unravel the popular shifted-divergences argument, enabling generalizations beyond divergence-based relaxations of DP. Notably, this leads to the first exact privacy analysis in the foundational setting of strongly convex optimization. Our techniques extend to many settings: convex/strongly convex, constrained/unconstrained, full/cyclic/stochastic batches, and all combinations thereof. As an immediate corollary, we recover the f-DP characterization of the exponential mechanism for strongly convex optimization in Gopi et al. (2022), and moreover extend this result to more general settings. 1. Introduction Private optimization is the primary approach for private machine learning. The goal is to train good models while not leaking sensitive attributes of the training data. Differential privacy (DP) is the gold standard for measuring information leakage (Dwork et al., 2006; Dwork & Roth, 2014), and noisy gradient descent and its variants are the predominant 1Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA, USA. Correspondence to: Jinho Bok . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). algorithms for private optimization. It is therefore a central question to quantify the differential privacy of these algorithms. However, tight characterizations remain open, even in the seemingly simple setting of convex optimization. In words, DP measures how distinguishable the output of a (randomized) algorithm is when run on two adjacent datasets, i.e., two datasets that differ in only one individual record. There are several ways to measure distinguishability leading to many relaxations of DP, e.g., (Bun & Steinke, 2016; Mironov, 2017; Dong et al., 2022). Different DP notions lead to different privacy analyses, and a long line of work has sought to prove sharp privacy bounds for noisy gradient descent and its variants (Bassily et al., 2014; Abadi et al., 2016; Feldman et al., 2018; Chourasia et al., 2021; Ye & Shokri, 2022; Altschuler & Talwar, 2022). A common approach is to use the composition theorem, which essentially pays a price in privacy for every intermediate iterate along the optimization trajectory, leading to possibly suboptimal privacy bounds. Recent work has significantly improved the privacy analysis in the case of convex and strongly convex losses by leveraging stability properties of (stochastic) gradient descent (Chourasia et al., 2021; Ye & Shokri, 2022; Altschuler & Talwar, 2022) in order to show that the privacy leakage does not increase ad infinitum in the number of iterations t. This is in stark contrast to the composition-based approach, which gives privacy bounds that scale as All these convergent privacy bounds were proved in the R enyi DP framework, which is inherently lossy. To achieve the tightest possible privacy bound on private gradient methods, a natural goal is to use the f-DP framework (Dong et al., 2022) for analysis, since it is an information-theoretically lossless definition of DP. This definition measures distinguishability in terms of the Type I vs Type II error tradeoff curve f for the hypothesis testing problem of whether a given user was in the training dataset. The f-DP framework is desirable because: (1) f-DP exactly characterizes all relevant aspects of the hypothesis testing problem defining DP, and thus (optimal) f-DP bounds can be losslessly converted to (optimal) bounds in other notions of privacy such as (ε, δ)-DP or R enyi DP, (2) f-DP is lossless under composition of multiple private mechanisms, which is the most ubiquitous operation in DP since it enables combining Shifted Interpolation for Differential Privacy building blocks, and (3) f-DP is easily interpretable in terms of the original hypothesis testing definition of DP. However, analyzing privacy leakage in the f-DP framework is often challenging since quantifying the entire tradeoff between Type I/II error is substantially more difficult than quantifying (less informative) alternative notions of privacy. Consequently, the analysis toolbox for f-DP is currently limited. These limitations are pronounced for the fundamental problem of analyzing the privacy loss of noisy gradient descent and its variants. To put it into perspective, existing privacy guarantees based on f-DP diverge as the number of iterations t increases, whereas the aforementioned recent work has used divergence-based DP definitions to show that for convex problems, noisy gradient descent and its variants can remain private even when run indefinitely (Chourasia et al., 2021; Ye & Shokri, 2022; Altschuler & Talwar, 2022). Convergent privacy bounds complement celebrated results for minimax-optimal privacy-utility tradeoffs (Bassily et al., 2014; 2019) because they enable longer training which is useful since typical learning problems are not worst-case and benefit from training longer. Can convergent privacy bounds be achieved directly1 in the tight framework of f-DP? All current arguments are tailored to R enyi DP an analytically convenient but inherently lossy relaxation of DP and do not appear to extend. Answering this question necessitates developing fundamentally different techniques for f-DP, since convergent privacy bounds require only releasing the algorithm s final iterate in sharp contrast to existing f-DP techniques such as the composition theorem which can only argue about the accumulated privacy loss of releasing all intermediate iterates. Tight f-DP analyses typically require closed-form expressions for the random variable in question in order to argue about the tradeoff of Type I/II error but this is impossible for the final iterate of (stochastic) gradient descent due to the non-linearity intrinsic to each iteration. 1.1. Contribution Our primary technical contribution is establishing (and refining) the privacy amplification by iteration phenomenon in the unifying framework of f-DP. This enables directly analyzing the privacy loss of the final iterate of noisy gradient descent (and its variants), leading to the first direct f-DP analysis that is convergent as the number of iterations t . 1.2 overviews this new analysis technique. Notably, this leads to the first exact privacy analysis in the foundational setting of strongly convex losses. To our knowledge, there is no other setting where exact convergent 1Convergent RDP bounds can of course be lossily converted to convergent f-DP bounds, but that defeats the purpose of using the lossless f-DP framework. 0.0 0.5 1.0 Type I Error Type II Error Perfect privacy Our f-DP bound (optimal) Composition bound Iterations t Our f-DP bound (optimal) Our RDP bound Prior RDP bound Composition bound Figure 1. Left: improved f-DP vs the standard composition analysis. Right: improved (ε, δ)-DP by losslessly converting from f-DP. Our privacy bound is optimal in all parameters, here for Noisy GD on strongly convex losses; see D for the parameter choices and other settings. Our f-DP analysis also implies optimal bounds for the R enyi DP framework (previously unknown), but f-DP is strictly better since it captures all aspects of the privacy leakage, whereas R enyi DP is intrinsically lossy. privacy analyses are known for any t > 1, except for the setting of convex quadratic losses which is analytically trivial because all iterates are explicit Gaussians.2 We emphasize that our techniques are versatile and readily extend to many settings a well-known challenge for other convergent analyses, even for simpler relaxations of DP like R enyi DP (Chourasia et al., 2021; Ye & Shokri, 2022; Altschuler & Talwar, 2022). In 4, we illustrate how our analysis extends to convex/strongly convex losses, constrained/unconstrained optimization, full/cyclic/stochastic batches, and all combinations thereof. Since our improved privacy guarantees are for f-DP (Figure 1, left), lossless conversions immediately imply improved guarantees for other notions of privacy like R enyi DP and (ε, δ)-DP (Figure 1, right). For example, for the strongly convex setting, our exact bound improves over previous results by a factor of 2 in R enyi DP, and thus by even more in (ε, δ)-DP due to the intrinsic lossiness of R enyi DP that we overcome by directly analyzing in f-DP. In practice, improving the privacy by a factor of two enables training with half the noise, while satisfying the same privacy budget. Although this paper s focus is the theoretical methodology, preliminary numerics in 4.4 corroborate that our improved privacy guarantees can be helpful in practice. Since our privacy bounds are convergent in the number of iterations t, we can take the limit t to bound the f-DP of the stationary distributions of these optimization algorithms. As an immediate corollary, we recover the recent f-DP characterization of the exponential mechanism for 2The standard analysis approach based on the composition theorem is nearly tight for small numbers of iterations t, but as mentioned above, yields an arbitrarily loose bound (in fact vacuous) for convex losses as t . Shifted Interpolation for Differential Privacy strongly convex losses in Gopi et al. (2022), and moreover extend this result to more general settings in 5. 1.2. Techniques The core innovation underlying our results is the construction of certain auxiliary processes, shifted interpolated processes, which enable directly analyzing the Type I/II error tradeoff between the final iterates of two stochastic processes even when their laws are complicated and nonexplicit. Informally, this argument enables running coupling arguments traditionally possible only for Wasserstein analysis to analyze tradeoff functions for the first time. In this paper, the two processes are noisy (stochastic, projected) gradient descent run on two adjacent datasets, but the technique is more general and we believe may be of independent interest. See 3 for a detailed discussion. Crucially, our argument is geometrically aware: it exploits (strong) convexity of losses via (strong) contractivity of gradient descent updates, in order to argue that the sensitive gradient queries have (exponentially) decaying privacy leakage, the longer ago they were performed. This is essential for convergent privacy bounds, and is impossible with the standard composition-based analysis which only exploits the sensitivity of the losses, and is oblivious to any further geometric phenomena like convexity or contractivity. A key motivation behind the construction of our auxiliary sequence is that it demystifies the popular privacy amplification by iteration analysis (Feldman et al., 2018), which has been used in many contexts, and in particular was recently shown to give convergent R enyi DP bounds (Altschuler & Talwar, 2022). Those arguments rely on shifted divergences, which combine R enyi divergence and Wasserstein distance, and it was an open question whether this ad-hoc potential function could be simplified. Our shifted interpolated process answers this: its iterates coincide with the optimal shifts in the shifted divergence argument, which allows us to disentangle the R enyi and Wasserstein components of the shifted divergence argument; details in B. Crucially, this disentanglement enables generalizations beyond divergencebased relaxations of DP, to f-DP.3 1.3. Organization 2 recalls relevant preliminaries from differential privacy and convex optimization. 3 introduces our core technique of shifted interpolation. 4 uses this technique to establish improved privacy bounds for noisy gradient descent and its variants for the foundational settings of convex and 3Na ıvely extending the shifted divergence argument to shifted tradeoffs runs into subtle but fundamental issues since tradeoff functions do not enjoy key properties that divergences do. Details in B. strongly convex losses. 5 describes how, as immediate corollaries of these convergent privacy bounds, taking an appropriate limit recovers and generalizes recent results on the f-DP of the exponential mechanism. 6 discusses future directions motivated by our results. Code reproducing our numerics can be found here: https://github.com/ jinhobok/shifted_interpolation_dp. 2. Preliminaries 2.1. Differential privacy DP measures the distinguishability between outputs of a randomized algorithm run on adjacent datasets, i.e., datasets that differ on at most one data point (Dwork et al., 2006). The most popular definition is (ε, δ)-DP. Definition 2.1 ((ε, δ)-DP). A randomized algorithm A is (ε, δ)-DP if for any adjacent datasets S, S and event E, P(A(S) E) eεP(A(S ) E) + δ . However, the most precise quantification of DP is based on the (optimal) hypothesis testing formulation (Wasserman & Zhou, 2010; Kairouz et al., 2017). This is formalized as f-DP (Dong et al., 2022), where f denotes a tradeoff function, i.e., a curve of hypothesis testing errors. Definition 2.2 (f-DP). For distributions P, Q on the same space, the tradeoff function T(P, Q) : [0, 1] [0, 1] is T(P, Q)(α) = inf{1 EQϕ : EP ϕ α, 0 ϕ 1} A randomized algorithm A is f-DP if for any adjacent datasets S and S , T(A(S), A(S )) f. The following lemma provides a useful characterization of tradeoff functions (Dong et al., 2022, Proposition 1). It follows that the most private tradeoff function is Id : [0, 1] [0, 1], given by Id(α) = 1 α. See Figure 2. Lemma 2.3 (Characterization of tradeoff functions). A function f : [0, 1] [0, 1] is a tradeoff function iff f is decreasing, convex and f(α) 1 α for all α [0, 1]. See A.2 for further details on tradeoff functions. Gaussian tradeoff functions are a particularly useful family, providing a notion of Gaussian DP (GDP) parametrized by a single scalar. These are central to our analysis due to the Gaussian noise in noisy (stochastic) gradient descent. Definition 2.4 (GDP). For GDP parameter µ 0, the Gaussian tradeoff function G(µ) is defined as G(µ) = T(N(0, 1), N(µ, 1)). Its value at α [0, 1] is given as G(µ)(α) = Φ(Φ 1(1 α) µ), where Φ denotes the CDF of N(0, 1). A randomized algorithm A is µ-GDP if for any adjacent datasets S and S , T(A(S), A(S )) G(µ). Shifted Interpolation for Differential Privacy 0.0 0.5 1.0 G(0) = Id (full privacy) G(0.5) G(1) G(5) (essentially no privacy) T( (S), (S )) Figure 2. Illustration of f-DP and GDP. Gaussian tradeoff functions G(µ) are less private as µ increases from 0 (full privacy) to (no privacy). The closer to Id, the more private. Here A is 1-GDP but not 0.5-GDP because its tradeoff function is pointwise above G(1) but not pointwise above G(0.5). We now recall two key properties of tradeoff functions that are central to our analysis. The first states that postprocessing two distributions in the same way cannot make them easier to distinguish (Dong et al., 2022, Lemma 1). Lemma 2.5 (Post-processing). For any probability distributions P, Q and (random) map Proc, we have T(Proc(P), Proc(Q)) T(P, Q). The next lemma enables analyzing the composition of multiple private mechanisms (Dong et al., 2022, Definition 5 & Lemma C.1). Definition 2.6 (Composition). The composition of two tradeoff functions f = T(P, Q) and g = T(P , Q ) is defined as f g = T(P P , Q Q ). The n-fold composition of f with itself is denoted f n. Lemma 2.7 (Strong composition). Let K1, K 1, K2, K 2 be (random) maps such that for all y, T(K1(y), K 1(y)) T(K2(y), K 2(y)). Then T((P, K1(P)), (Q, K 1(Q))) T((P, K2(P)), (Q, K 2(Q))). If g = T(K2(y), K 2(y)) does not depend on y, then T((P, K1(P)), (Q, K 1(Q))) T(P, Q) g. 2.2. Convex optimization This paper focuses on convex losses because tight privacy guarantees for noisy gradient descent (and variants) are open even in this seemingly simple setting. We make use of the following two basic facts from convex optimization. Below, we say a function is contractive if it is 1-Lipschitz. Recall that a function f is M-smooth if f is M-Lipschitz, and is m-strongly convex if x 7 f(x) m 2 x 2 is convex. Lemma 2.8. If f is convex and M-smooth, then the gradient descent update g(x) = x η f(x) is contractive for each η [0, 2 M ]. If f is additionally m-strongly convex and η (0, 2 M ), then g is c-Lipschitz where c = max{|1 ηm|, |1 ηM|} < 1. Lemma 2.9. Let K be a closed and convex set in Rd. Then the projection ΠK(x) = arg minz K z x is welldefined and is a contraction. 2.3. Algorithms Throughout, we consider a private optimization setting in which the goal is to minimize the objective function F(x) = 1 n Pn i=1 fi(x), where the i-th loss function fi is associated with the i-th data point in a dataset S. An adjacent dataset S corresponds to loss functions {f i}i [n] where fi f i except for a single index i . Noisy gradient descent and its variants follow the general template of i Bk fi(Xk) + Zk+1 k = 0, 1, . . . , t 1 (1) where X0 is the initialization (e.g., zero), η is the learning rate, Zk+1 N(0, σ2Id) independently, σ is the noise rate, K is the constraint set, and t is the number of steps. The batch Bk of size b can be chosen in several ways: Full batches (Noisy GD): Bk [n]. Cyclic batches (Noisy CGD): Partition [n] into batches of sizes b and cycle through them. Stochastic batches (Noisy SGD): Choose a batch of size b uniformly at random from [n]. The advantage of the latter two variants is that they avoid computing the gradient of the objective, which can be computationally burdensome when n is large. A standard assumption in private optimization is the following notion of gradient sensitivity: Definition 2.10 (Gradient sensitivity). A family of loss functions F (defined on X) has gradient sensitivity L if supf,g F,x X f(x) g(x) L. For example, a family of L-Lipschitz loss functions has gradient sensitivity 2L. Another example is loss functions of the form fi = ℓi + r, where ℓi are convex, LLipschitz losses, and r is a (non-Lipschitz) strongly convex regularization the point being that this family of loss functions {fi} has finite gradient sensitivity 2L despite each fi not being Lipschitz. 3. Shifted interpolation for f-DP Here we explain the key conceptual ideas enabling our convergent f-DP bounds (see 1.2 for a high-level overview). To preserve the logical flow of ideas we defer proofs to C.1. Below, in 3.1 we first recall the standard f-DP analysis based on the composition theorem and why it yields divergent bounds. Then in 3.2 we describe our technique of Shifted Interpolation for Differential Privacy shifted interpolated processes and how this enables convergent f-DP bounds. To explain the ideas in their simplest form, we consider here the setting of full-batch gradients and unconstrained optimization. Let {fi}i [n] and {f i}i [n] be the losses corresponding to two adjacent datasets, where fi f i except for one index i . Then Noisy GD forms the iterates Xk+1 = ϕ(Xk) + Zk+1 (2) X k+1 = ϕ (X k) + Z k+1 (3) where X0 = X 0, ϕ(x) := x η n Pn i=1 fi(x), ϕ (x ) := x η n Pn i=1 f i(x ), and Zk+1, Z k+1 N(0, η2σ2Id). 3.1. Previous (divergent) f-DP bounds, via composition f-DP requires bounding T(Xt, X t). The standard approach, based on the composition theorem, argues as follows: T(Xt, X t) T(Xt 1, X t 1) G(c) T(Xt 2, X t 2) G(c T(X0, X 0) | {z } =Id since X0=X 0 Here, the composition theorem simultaneously unrolls both processes, at some price G(c) in each iteration. (These prices are collected via a basic GDP identity, see Lemma A.2.) This is due to the following simple lemma, which relies on the f-DP of the Gaussian mechanism using different updates ϕ, ϕ (Dong et al., 2022, Theorem 2). Lemma 3.1. Suppose ϕ(x) ϕ (x) s for all x. Then T(ϕ(X) + N(0, σ2Id), ϕ (X ) + N(0, σ2Id)) T(X, X ) G( s Bounding s via sensitivity enables the argument (4) and gives the appropriate c. See Dong et al. (2022) for details. However, while this argument (4) is reasonably tight for small t, it is vacuous as t . Conceptually, this is because this analysis considers releasing all intermediate iterates, hence it bounds T((X1, . . . , Xt), (X 1, . . . X t)) G(c t). Concretely, this is because the above analysis requires completely unrolling to iteration 0. Indeed, the identical initialization X0 = X 0 ensures T(X0, X 0) = Id, whereas at any other iteration k > 0 it is unclear how to directly bound T(Xk, X k) as Xk = X k. This inevitably leads to final privacy bounds which diverge in t since a penalty is incurred in each of the t iterations. 3.2. Convergent f-DP bounds, via shifted interpolation The central idea underlying our analysis is the construction of a certain auxiliary process { e Xk} that interpolates between the two processes in the sense that e Xτ = X τ at some Figure 3. Illustration of shifted interpolated process (6). The interpolated process { e Xk} starts from one process ( e Xτ = X τ) and ends at the other ( e Xt = Xt). The intermediate time τ is an analysis parameter that we optimize to get the best final privacy bound. intermediate time τ and e Xt = Xt at the final time. See Figure 3. Crucially, this enables running the argument (5) where we unroll only from t to τ, rather than all the way to initialization: T(Xt, X t) = T( e Xt, X t) T( e Xt 1, X t 1) G(at) T( e Xt 2, X t 2) G a2 t + a2 t 1 1/2 T( e Xτ, X τ) | {z } =Id since Xτ =X τ k=τ+1 a2 k 1/2 . (5) Intuitively, this argument replaces the divergent t dependence of prior f-DP bounds with something scaling in t τ. Here τ is an analysis parameter that we can optimize based on the following intuitive tradeoff: larger τ enables unrolling less, whereas smaller τ gives the auxiliary process e Xk more time to interpolate between X τ and Xt which leads to smaller penalties ak for unrolling at each iteration. Formalizing (5) leads to two interconnected questions: Q1. How to construct the auxiliary process { e Xk}? Q2. How to unroll each iteration? I.e., what is the analog of Lemma 3.1? 3.2.1. SHIFTED INTERPOLATED PROCESS For Q1, we initialize e Xτ = X τ and define e Xk+1 = λk+1ϕ(Xk) + (1 λk+1)ϕ ( e Xk) + Zk+1 (6) for k = τ, . . . , t 1. Intuitively, this auxiliary process {Xk} uses a convex combination of the updates performed by the two processes, enabling it to gracefully interpolate from its initialization at one process to its termination at the other. Here λk controls the speed at which we shift from one process to the other. We set λt = 1 so that e Xt = Xt achieves the desired interpolation; the other {λk} are Shifted Interpolation for Differential Privacy analysis parameters that we optimize to get the best final bound. An important technical remark is that this auxiliary process uses the same noise increments {Zk} as {Xk}; this coupling enables bounding the distance between Xk and e Xk by a deterministic value (i.e., in the -Wasserstein distance W ). We remark that auxiliary interpolating processes have been used in the context of proving Harnack inequalities (or equivalently, R enyi reverse transport inequalities) for diffusions on manifolds (Arnaudon et al., 2006; Wang, 2013; 2014; Altschuler & Chewi, 2023b;c). Two key challenges posed by the present setting are that f-DP requires tradeoff functions (rather than R enyi divergences), and also tracking stochastic processes that undergo different dynamics (rather than the same diffusion). This requires constructing and analyzing the auxiliary process (6). 3.2.2. GEOMETRICALLY AWARE COMPOSITION For Q2, we develop the following lemma, which generalizes Lemma 3.1 by allowing for an auxiliary process e X and a shift parameter λ (Lemma 3.1 is recovered in the special case λ = 1 and e X = X). A key feature is that unlike Lemma 3.1, this lemma is geometrically aware in that it exploits the Lipschitzness of the gradient descent updates ϕ, ϕ recall from Lemma 2.8 that ϕ, ϕ are (strongly) contractive whenever the losses are (strongly) convex. Intuitively, this contractivity ensures that long-ago gradient queries incur (exponentially) less privacy loss, thus making the total privacy loss convergent; c.f., the discussion in 1.2. Lemma 3.2. Suppose that ϕ(x) ϕ (x) s for all x and that ϕ, ϕ are c-Lipschitz. Then for any λ 0 and any random variable e X satisfying X e X z, T(λϕ(X) + (1 λ)ϕ ( e X) + N(0, σ2Id), ϕ (X ) + N(0, σ2Id)) T( e X, X ) G( λ(cz+s) 3.2.3. CONVERGENT f-DP BOUNDS Combining our answers to Q1 (shifted interpolated process) and Q2 (geometrically aware composition) enables formalizing the argument (5). The remaining proof details are straightforward and deferred to C.2. For clarity, we state this result as a meta-theorem where the shifts λk and intermediate time τ are parameters; our final bounds are obtained by optimizing them, see 4. Theorem 3.3. Consider the stochastic processes {Xk}, {X k}, { e Xk} defined in (2), (3), (6), with λt = 1. Suppose that ϕ(x) ϕ (x) s for all x and that ϕ, ϕ are c-Lipschitz. For any sequence {zk} such that Xk e Xk zk, T(Xt, X t) G where ak = λk(czk 1 + s). We emphasize that although this technique-overview section focused on the simple case of full-batch gradients and strongly convex losses for clarity, these techniques readily extend to more general settings. Briefly, for constrained optimization, projections are handled by using the post-processing inequality for tradeoff functions; for (non-strongly) convex optimization, the optimal shifts ak will be of similar size rather than geometrically increasing; for cyclic batches, the update functions ϕk, ϕ k and corresponding sensitivity sk are time-varying; and for stochastic batches, the analog of Lemma 3.2 incorporates the celebrated privacy amplification by subsampling phenomenon. Details in 4. 4. Improved privacy for noisy optimization algorithms Here we apply the shifted interpolation technique developed in 3 to establish improved privacy bounds for noisy gradient descent and its variants. We showcase the versatility of our techniques by investigating gradient descent with full-batch gradients in 4.1, cyclic batches in 4.2, and stochastic batches in 4.3. In all cases, we show convergent f-DP bounds for unconstrained strongly convex and constrained convex settings; the constrained strongly convex setting is similar and omitted for brevity (and the unconstrained convex setting has divergent privacy). The proofs are similar for all these different settings, based on the approach in 3; for brevity the proofs are deferred to C. See also D for numerical illustrations of the improvements of our bounds. Below, recall from 2 that we denote the learning rate by η, the noise rate by σ, the number of data points by n, the batch size by b, the constraint set by K, and its diameter by D. Throughout we denote by c = max{|1 ηm|, |1 ηM|} the Lipschitz constant for a step of gradient descent on m-strongly convex and M-smooth losses (c.f., Lemma 2.8). 4.1. Noisy gradient descent Here we consider full-batch gradient descent. For comparison, we first recall the standard f-DP bound implied by the composition theorem (Dong et al., 2022). Theorem 4.1. Consider loss functions with gradient sensitivity L. Then Noisy GD is µ-GDP where µ = L nσ This (divergent) bound is tight without further assumptions on the losses. Below we show convergent f-DP bounds for Noisy GD in the setting of strongly convex losses, and the setting of constrained convex losses. Theorem 4.2. Consider m-strongly convex, M-smooth loss functions with gradient sensitivity L. Then for any η Shifted Interpolation for Differential Privacy (0, 2/M), Noisy GD is µ-GDP where 1 + ct 1 + c 1 c L nσ . For η (0, 2 M+m], this bound is optimal. Theorem 4.3. Consider convex, M-smooth loss functions with gradient sensitivity L and constraint set K of diameter D. Then for any η [0, 2/M] and t Dn ηL , Noisy GD is µ-GDP where Theorem 4.2 is exactly tight in all parameters, and improves over the composition-based analysis (Theorem 4.1) for all t > 1. Theorem 4.3 is tight up to a constant factor (see Theorem C.16), and for t > 4Dn ηL it dominates Theorem 4.1 since its convergent nature outweighs the slightly suboptimal constant. 4.2. Noisy cyclic gradient descent We now turn to cyclic batches. For simplicity, suppose that the number of batches per epoch l = n/b and the number of epochs E = t/l are integers. We state our results with respect to E rather than t. For comparison, we first state the standard (divergent) f-DP bound implied by the composition theorem (Dong et al., 2022). Theorem 4.4. Consider loss functions with gradient sensitivity L. Then Noisy CGD is µ-GDP where µ = L Below we show convergent f-DP bounds for Noisy CGD in the setting of strongly convex losses, and the setting of constrained convex losses. Theorem 4.5. Consider m-strongly convex, M-smooth loss functions with gradient sensitivity L. Then for any η (0, 2/M), Noisy CGD is µ-GDP where 1 + c2l 2 1 c2 (1 cl)2 1 cl(E 1) 1 + cl(E 1) L bσ . Theorem 4.6. Consider convex, M-smooth loss functions with gradient sensitivity L and constraint set K of diameter D. Then for any η [0, 2/M] and E Db ηL, Noisy CGD is µ-GDP where The convergent nature of these bounds ensures that they dominate Theorem 4.4 when Noisy CGD is run long enough. This threshold is roughly E c2l 2 1 c2 (1 cl)2 for Theorem 4.5 and E 4Db ηℓL for Theorem 4.6. 4.3. Noisy stochastic gradient descent Compared to Noisy GD, the privacy leakage in Noisy SGD only occurs when the index i is in the sampled batch. This phenomenon is known as privacy amplification by subsampling (Kasiviswanathan et al., 2011), which is formulated in f-DP as follows (Dong et al., 2022, Definition 6). Definition 4.7. For tradeoff function f and p [0, 1], define fp = pf + (1 p)Id. The subsampling operator Cp (with respect to f) is defined as Cp(f) = min{fp, (fp) 1} where 1 denotes the (left-continuous) inverse and denotes the convex conjugate. For comparison, we first recall the standard f-DP bound based on composition (Dong et al., 2022, Theorem 9). Theorem 4.8. Consider loss functions with gradient sensitivity L. Then Noisy SGD is f-DP where f = Cb/n(G( L This (divergent) bound is tight for t = 1 without further assumptions on the losses. Below we show convergent f DP bounds for Noisy SGD in the setting of strongly convex losses, and the setting of constrained convex losses. Theorem 4.9. Consider m-strongly convex, M-smooth loss functions with gradient sensitivity L. Then for any η (0, 2/M), Noisy SGD is f-DP for 2L bσ ct τ+1 ct 2L bσ )) Cb/n(G(2L bσ )) (t τ) for any τ = 0, 1, . . . , t 1. Theorem 4.10. Consider convex, M-smooth loss functions with gradient sensitivity L and constraint set K of diameter D. Then for any η [0, 2/M], Noisy SGD is f-DP where 2D ησ t τ ) Cb/n(G(2 2L bσ )) (t τ) for any τ = 0, 1, . . . , t 1. Both theorems give convergent privacy by taking t τ constant as t . In contrast, Theorem 4.8 is convergent in the regime t = O( n2 b2 ), but yields a vacuous privacy as t for fixed b n (Dong et al., 2022). We remark that for finite but large t, one can set t τ to be sufficiently large and apply CLT (Lemma A.5) to approximate the composition of Cp(G( )); see Lemma A.11 and C.4.3. We also remark that by choosing t τ = Θ( Dn ηL ), Theorem 4.10 recovers the asymptotically tight R enyi DP bound of (Altschuler & Talwar, 2022). 4.4. Numerical example As a proof of concept, here we consider regularized logistic regression on MNIST (Le Cun et al., 2010). We compare our Shifted Interpolation for Differential Privacy results with the state-of-the-art R enyi DP bounds, and existing f-DP bounds (based on the composition theorem) which we denote as GDP Composition. For a fair comparison, we use the same algorithm Noisy CGD, with all parameters unchanged, and only focus on the privacy accounting. Table 1 demonstrates that for this problem, our privacy guarantees are tighter, enabling longer training for the same privacy budget which helps both training and testing accuracy (c.f., Table 2). For full details of the experiment, see D.2. Table 1. Privacy ε of Noisy CGD on regularized logistic regression for δ = 10 5 in (ε, δ)-DP. Our results provide better privacy than both GDP Composition and RDP bounds in all cases. Epochs GDP Composition RDP Our Bounds 50 30.51 5.82 4.34 100 49.88 7.61 5.60 200 83.83 9.88 7.58 Table 2. Training and test accuracy (%) of Noisy CGD for regularized logistic regression, averaged over 10 runs. Both the training and test accuracy improve as the number of epochs increases. Epochs Training Test 50 89.36 0.03 90.12 0.04 100 90.24 0.03 90.94 0.07 200 90.85 0.02 91.37 0.08 5. f-DP of the exponential mechanism Since we show convergent f-DP bounds for randomized algorithms, we can take the limit t to obtain f-DP bounds for their stationary distributions. We focus here on Noisy GD because, up to a simple rescaling, it is equivalent to Langevin Monte Carlo (LMC), one of the most wellstudied sampling algorithms in the statistics literature; see, e.g., (Robert et al., 1999; Liu, 2001; Andrieu et al., 2003). Our results for (strongly) convex losses not only imply new results for (strongly) log-concave sampling for LMC, but also imply f-DP bounds for the exponential mechanism (Mc Sherry & Talwar, 2007) a foundational concept in DP since it is obtained from LMC s stationary distribution in the limit as the stepsize η 0. 5.1. Strongly log-concave targets Our optimal f-DP bounds for Noisy GD immediately imply optimal4 f-DP bounds for LMC. Proposition 5.1. Suppose that F, F are m-strongly convex, M-smooth and F F is L-Lipschitz. Consider the LMC 4Although here we bound the optimal constants for simplicity. Xk+1 = Xk η F(Xk) + Zk+1 X k+1 = X k η F (X k) + Z k+1 where X0 = X 0 and Zk+1, Z k+1 N(0, 2ηId). Then for any η (0, 2 M+m], T(Xt, X t) G q Proof. LMC is a special case of Noisy GD with n = 1, f1 = F, f 1 = F , and σ = p 2/η. Apply Theorem 4.2. Taking t gives f-DP guarantees for the stationary distributions π(η) and π (η) of these LMC chains. We also obtain f-DP guarantees between the exponential mechanisms π e F and π e F for F and F . Corollary 5.2. In the setting of Proposition 5.1, T(π(η), π (η)) G q and T(π, π ) Proof. It is well-known that under these assumptions, LMC converges to its stationary distribution in TV as t , and the stationary distribution converges to the exponential mechanism as η 0, see e.g., (Chewi, 2023). By Lemma A.10, tradeoff functions converge under TV. Thus, we recover the recent result (Gopi et al., 2022, Theorem 4) which characterizes the f-DP of the exponential mechanism. The proof in (Gopi et al., 2022) is entirely different, based on the Gaussian isoperimetry inequality (Ledoux, 1999) rather than connecting LMC to the exponential mechanism. Our results can be viewed as algorithmic generalizations of theirs in the sense that we also obtain tight f-DP bounds on the iterates of LMC and its stationary distribution. Remark 5.3 (Tightness). As noted in (Gopi et al., 2022), the exponential mechanism bound in Corollary 5.2 is tight by considering F(x) = m 2 x 2, F (x) = m mv 2 (where v is a unit vector) which yields π = N(0, 1 m Id), π = N( L m Id). With the same loss functions, it is straightforward to check that this construction also shows optimality for our results on the f-DP of LMC and its stationary distribution. 5.2. Log-concave targets A similar story holds in the setting of convex losses, although this requires a constrained setting since otherwise stationary distributions may not exist. Hence we consider projected Noisy GD (Theorem 4.3), which corresponds to projected LMC. As above, this leads to f-DP bounds for the exponential mechanism due to known TV convergence Shifted Interpolation for Differential Privacy results, for projected LMC to its stationary distribution as t (Altschuler & Talwar, 2023), and from that distribution to the exponential mechanism as η 0 (Bubeck et al., 2018). Corollary 5.4. Let F, F be convex, M-smooth and LLipschitz functions and K be a convex body with diameter D containing a unit ball. Then for π e F 1K and π e F 1K, T(π, π ) G(2 LD). Furthermore, for η (0, 2/M], the respective stationary distributions π(η), π (η) of the projected LMC satisfy T(π(η), π (η)) G( p 4LD + 2ηL2). Unlike the strongly convex case (Minami et al., 2016; Gopi et al., 2022), we are unaware of any results in this setting beyond the standard analysis (Mc Sherry & Talwar, 2007) on the exponential mechanism. That yields (2LD, 0)-DP, and our result provides nontrivial improvement in privacy when LD > 0.677. See D.4. 6. Discussion The techniques and results of this paper suggest several directions for future work. One natural direction is whether convergent f-DP bounds can be shown in more general settings, e.g., (structured) non-convex landscapes, heteroscedastic or correlated noise (Choquette-Choo et al., 2023), adaptive first-order algorithms, or second-order algorithms (Ganesh et al., 2023). A technical question is whether one can relax the W bounds between our shifted interpolated process { e Xk} and the target process {Xk}, and if this can enable tighter analyses of stochastic algorithms. While W has traditionally been used for privacy amplification by iteration (Feldman et al., 2018), (Altschuler & Chewi, 2023a) recently showed that some of this analysis extends to the Orlicz Wasserstein distance, which is even necessary in some applications. Another natural direction is more computationally tractable f-DP bounds. Although the f-DP framework provides an information-theoretically lossless quantification of DP, it is often computationally burdensome, e.g., for Noisy SGD bounds expressed as the composition of many tradeoff functions. Recent work has developed useful tools for approximation (Zheng et al., 2020; Gopi et al., 2021; Zhu et al., 2022), and further developments would help practitioners who need to adhere to given privacy budgets. Acknowledgements We thank Sinho Chewi, Kunal Talwar, and Jiayuan Ye for insightful conversations about the literature and the anonymous reviewers for helpful comments. We also thank Jiayuan Ye for sharing helpful code for numerical experiments. Impact Statement This paper establishes better privacy guarantees for learning from sensitive user data. Therefore we believe that the broader societal implications can only be positive. Indeed, we are proposing to use the same standard learning algorithms, but now with improved privacy guarantees. Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Conference on Computer and Communications Security, pp. 308 318, 2016. Altschuler, J. and Talwar, K. Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. In Advances in Neural Information Processing Systems, volume 35, pp. 3788 3800, 2022. Altschuler, J. and Talwar, K. Resolving the mixing time of the Langevin algorithm to its stationary distribution for log-concave sampling. In Conference on Learning Theory, volume 195, pp. 2509 2510. PMLR, 2023. Altschuler, J. M. and Chewi, S. Faster high-accuracy logconcave sampling via algorithmic warm starts. In Symposium on Foundations of Computer Science (FOCS), pp. 2169 2176, 2023a. Altschuler, J. M. and Chewi, S. Shifted composition I: Harnack and reverse transport inequalities. ar Xiv preprint ar Xiv:2311.14520, 2023b. Altschuler, J. M. and Chewi, S. Shifted composition II: Shift Harnack inequalities and curvature upper bounds. ar Xiv preprint ar Xiv:2401.00071, 2023c. Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I. An introduction to MCMC for machine learning. Machine Learning, 50:5 43, 2003. Arnaudon, M., Thalmaier, A., and Wang, F.-Y. Harnack inequality and heat kernel estimates on manifolds with curvature unbounded below. Bulletin des Sciences Math ematiques, 130(3):223 233, 2006. Asoodeh, S., Liao, J., Calmon, F. P., Kosut, O., and Sankar, L. Three variants of differential privacy: Lossless conversion and applications. IEEE Journal on Selected Areas in Information Theory, 2(1):208 222, 2021. Awan, J. and Dong, J. Log-concave and multivariate canonical noise distributions for differential privacy. In Advances in Neural Information Processing Systems, volume 35, pp. 34229 34240, 2022. Shifted Interpolation for Differential Privacy Balle, B. and Wang, Y.-X. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, volume 80, pp. 394 403. PMLR, 2018. Balle, B., Barthe, G., Gaboardi, M., Hsu, J., and Sato, T. Hypothesis testing interpretations and Renyi differential privacy. In International Conference on Artificial Intelligence and Statistics, volume 108, pp. 2496 2506. PMLR, 2020. Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Symposium on Foundations of Computer Science (FOCS), pp. 464 473, 2014. Bassily, R., Feldman, V., Talwar, K., and Guha Thakurta, A. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, volume 32, 2019. Bu, Z., Dong, J., Long, Q., and Su, W. Deep learning with Gaussian differential privacy. Harvard Data Science Review, 2(3), 2020. Bubeck, S., Eldan, R., and Lehec, J. Sampling from a log-concave distribution with projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4):757 783, 2018. Bun, M. and Steinke, T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In International Conference on Theory of Cryptography, pp. 635 658, 2016. Chewi, S. Log-concave sampling. Forthcoming, 2023. Draft available at https://chewisinho.github.io/. Choquette-Choo, C. A., Dvijotham, K., Pillutla, K., Ganesh, A., Steinke, T., and Thakurta, A. Correlated noise provably beats independent noise for differentially private learning. ar Xiv preprint ar Xiv:2310.06771, 2023. Chourasia, R., Ye, J., and Shokri, R. Differential privacy dynamics of Langevin diffusion and noisy gradient descent. In Advances in Neural Information Processing Systems, volume 34, pp. 14771 14781, 2021. Dong, J., Roth, A., and Su, W. J. Gaussian differential privacy. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 84(1):3 54, 2022. With discussions and a reply by the authors. Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, volume 3876 of Lecture Notes in Computuer Science, pp. 265 284. Springer, Berlin, 2006. Feldman, V., Mironov, I., Talwar, K., and Thakurta, A. Privacy amplification by iteration. In Symposium on Foundations of Computer Science (FOCS), pp. 521 532. 2018. Ganesh, A., Haghifam, M., Steinke, T., and Thakurta, A. Faster differentially private convex optimization via second-order methods. ar Xiv preprint ar Xiv:2305.13209, 2023. Gopi, S., Lee, Y. T., and Wutschitz, L. Numerical composition of differential privacy. In Advances in Neural Information Processing Systems, volume 34, pp. 11631 11642, 2021. Gopi, S., Lee, Y. T., and Liu, D. Private convex optimization via exponential mechanism. In Conference on Learning Theory, volume 178, pp. 1948 1989. PMLR, 2022. Kairouz, P., Oh, S., and Viswanath, P. The composition theorem for differential privacy. IEEE Transactions on Information Theory, 63(6):4037 4049, 2017. Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011. Le Cun, Y., Cortes, C., and Burges, C. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Ledoux, M. Concentration of measure and logarithmic Sobolev inequalities. In Az ema, J., Emery, M., Ledoux, M., and Yor, M. (eds.), S eminaire de Probabilit es XXXIII, pp. 120 216, Berlin, Heidelberg, 1999. Springer Berlin Heidelberg. Liu, J. S. Monte Carlo strategies in scientific computing, volume 75. Springer, 2001. Mc Sherry, F. and Talwar, K. Mechanism design via differential privacy. In Foundations of Computer Science (FOCS), pp. 94 103, 2007. Minami, K., Arai, H., Sato, I., and Nakagawa, H. Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, volume 29, 2016. Mironov, I. R enyi differential privacy. In Computer Security Foundations Symposium, pp. 263 275, 2017. Robert, C. P., Casella, G., and Casella, G. Monte Carlo statistical methods, volume 2. Springer, 1999. Shifted Interpolation for Differential Privacy Wang, C., Su, B., Ye, J., Shokri, R., and Su, W. Unified enhancement of privacy bounds for mixture mechanisms via f-differential privacy. In Advances in Neural Information Processing Systems, volume 36, pp. 55051 55063, 2023. Wang, F.-Y. Harnack inequalities for stochastic partial differential equations, volume 1332. Springer, 2013. Wang, F.-Y. Analysis for diffusion processes on Riemannian manifolds, volume 18. World Scientific, 2014. Wasserman, L. and Zhou, S. A statistical framework for differential privacy. Journal of the American Statistical Association, 105(489):375 389, 2010. Ye, J. and Shokri, R. Differentially private learning needs hidden state (or much faster convergence). In Advances in Neural Information Processing Systems, volume 35, pp. 703 715, 2022. Zheng, Q., Dong, J., Long, Q., and Su, W. Sharp composition bounds for Gaussian differential privacy via Edgeworth expansion. In International Conference on Machine Learning, volume 119, pp. 11420 11435. PMLR, 2020. Zhu, Y., Dong, J., and Wang, Y.-X. Optimal accounting of differential privacy via characteristic function. In International Conference on Artificial Intelligence and Statistics, volume 151, pp. 4782 4817. PMLR, 2022. Shifted Interpolation for Differential Privacy A. R enyi DP and tradeoff functions Here we provide helper lemmas and other relevant background about R enyi DP ( A.1), tradeoff functions ( A.2), and their convergence properties ( A.3). A.1. R enyi DP A popular notion of DP that is often analytically tractable is R enyi DP (RDP) (Mironov, 2017). Definition A.1 (RDP). The R enyi divergence of order α > 1 between probability distributions P, Q is defined as Dα(P Q) = 1 α 1 log Z d P d Q(ω) α d Q(ω) . A randomized algorithm A is (α, ε)-RDP if for any adjacent datasets S and S , Dα(A(S) A(S )) ε . Numerical conversion. Conversion from RDP to (ε, δ)-DP is inherently lossy and there are many proposed formulae for this. Since the RDP bounds mentioned in this text are of the form (α, ρα)-RDP for all α > 1, given ρ and a fixed level of δ the corresponding converted value of ε = ε(α, ρ, δ) can be found by optimizing over α. Also, in addition to RDP, results on zero concentrated DP (Bun & Steinke, 2016) can be applied. Throughout, we calculate the minimum ε (aka the best bound) among the following formulae: (Bun & Steinke, 2016, Lemma 3.5), (Mironov, 2017, Proposition 3), (Balle et al., 2020, Theorem 20), and (Asoodeh et al., 2021, Lemma 1). A.2. Lemmas on tradeoff functions Here we recall various useful facts about tradeoff functions. The first lemma records basic properties of tradeoff functions that we use repeatedly (Dong et al., 2022, Proposition D.1). Lemma A.2 (Basic properties). For tradeoff functions f, g1, g2 and µ = (µ1, . . . , µd) Rd, (a) g1 g2 f g1 f g2. (b) f Id = Id f = f. (c) T(N(0, σ2Id), N(µ, σ2Id)) = G(|µ1|/σ) G(|µd|/σ) = G( µ /σ). Next, we recall tight conversion formulae from GDP to other standard notions of DP, namely (ε, δ)-DP (Balle & Wang, 2018, Theorem 8) and RDP (Dong et al., 2022, Corollary B.6). Lemma A.3 (GDP to (ε, δ)-DP). A µ-GDP mechanism is (ε, δ(ε))-DP for all ε > 0 where Lemma A.4 (GDP to RDP). A µ-GDP mechanism is (α, 1 2µ2α)-RDP for any α > 1. An appealing property of tradeoff functions is that they admit a central limit theorem (CLT) that approximates multiple compositions to GDP. In particular, the subsampled GDP can be approximated as follows (Dong et al., 2022, Corollary 4). Lemma A.5 (CLT). Let µ 0 and assume that p t p0 as t . Then Cp(G(µ)) t G eµ2Φ(1.5µ) + 3Φ( 0.5µ) 2 . A.3. Convergence of tradeoff functions Here we present results about the convergence of distributions as measured by tradeoff functions. The main results are Lemma A.6 and Lemma A.10, which state that this is equivalent to convergence in TV distance; we also present intermediate results which may be of independent interest. For notation, we use Pn, P, Qn, Q to denote probability distributions, and α, α to respectively denote elements in [0, 1] and (0, 1]. Also, we use a b and a b to respectively denote max{a, b} and min{a, b}. Shifted Interpolation for Differential Privacy Lemma A.6. The following are equivalent. (a) T(Pn, P) Id. (b) T(P, Pn) Id. (c) TV (Pn, P) 0. Proof. On one hand, if TV(P, Pn) 0, then T(P, Pn) Id since 1 TV(P, Pn) α + T(P, Pn)(α) 1 . On the other hand, if TV(P, Pn) 0 then by taking a subsequence {n } such that TV(P, Pn ) ε > 0 we know that the first equality holds for some α = αn and thus T(P, Pn )(αn ) 1 ε αn . By taking a further subsequence {n } of {n } such that αn α for some α (note that αn 1 ε for all n and thus α 1 ε), there exists N N such that n > N αn < α + ε/2, from which we have T(P, Pn )(α + ε for all n > N. Thus lim inf n T(P, Pn )(α + ε implying that T(P, Pn) does not converge to Id. Lemma A.7. If T(Pn, P) Id then for any probability distribution Q, lim n T(Pn, Q)(α ) = T(P, Q)(α ) for every α (0, 1]. In particular, if T(P, Q)(0) = 1 then limn T(Pn, Q) = T(P, Q). Proof. From (Dong et al., 2022, Lemma A.5) we have T(P, Q)(α ) T(Pn, Q)(1 T(P, Pn)(α )) T(Pn, Q)(α) T(P, Q)(1 T(Pn, P)(α)) . By taking lim infn in the second line, we have lim infn T(Pn, Q)(α) T(P, Q)(α). On the other hand, for any α (0, 1] and sufficiently small ε > 0 we have 1 T(P, Pn)(α ) (α + ε) 1 for all sufficiently large n, from which in the first line we have T(P, Q)(α ) T(Pn, Q)((α + ε) 1) . Taking lim supn (it is straightforward to check that a limit supremum of tradeoff function is continuous on (0, 1)) and letting ε 0, we have T(P, Q)(α ) lim supn T(Pn, Q)(α ). Remark A.8 (Necessity of the restriction on α ). The restriction α (0, 1] is necessary. For example, if P = δ0, Q = 1 2δ1, and Pn = (1 1 nδ1 (here, δx denotes the Dirac measure at x and p P + (1 p)Q denotes the mixture of (P, Q) with mixing rate (p, 1 p)), then TV(Pn, P) 0 implies T(Pn, P) Id and T(P, Q)(α) = 1 2(1 α), yet T(Pn, Q)(α) = n lim n T(Pn, Q)(α) = ( 1 α = 0 1 2(1 α) α > 0 . However, if the limit is switched to the second argument, then this restriction on α simplifies and is unnecessary, as proven in the following lemma. Shifted Interpolation for Differential Privacy Lemma A.9. If T(Qn, Q) Id then for any probability distribution P, lim n T(P, Qn) = T(P, Q) . Proof. Again, from (Dong et al., 2022, Lemma A.5) we have T(P, Qn)(α) T(Q, Qn)(1 T(P, Q)(α)) T(P, Q)(α) T(Qn, Q)(1 T(P, Qn)(α)) . Taking lim infn in the first line, we have lim infn T(P, Qn)(α) T(P, Q)(α). On the other hand, we know that the limit T(Qn, Q) Id is uniform over [0, 1] see, for example, (Dong et al., 2022, Lemma A.7) and thus for any ε > 0 we have T(Qn, Q)(α) 1 α ε for all α [0, 1] when n is sufficiently large, from which we have T(P, Q)(α) T(P, Qn)(α) ε . Taking lim supn and letting ε 0, we have T(P, Q)(α) lim supn T(P, Qn)(α). Lemma A.10. If TV(Pn, P) 0 and TV(Qn, Q) 0 then lim n T(Pn, Qn)(α ) = T(P, Q)(α ) for every α (0, 1].5 In particular, if T(P, Q)(0) = 1 then limn T(Pn, Qn) = T(P, Q). Proof. From T(Pn, Qn)(α) T(P, Qn)(1 T(Pn, P)(α)) and T(P, Qn) T(P, Q) uniformly over [0, 1] (by Lemma A.6 and Lemma A.9), taking lim infn we have lim infn T(Pn, Qn)(α) T(P, Q)(α). From T(P, Qn)(α) T(Pn, Qn)(1 T(P, Pn)(α)), for any α (0, 1] and sufficiently small ε > 0, for all sufficiently large n we have 1 T(P, Pn)(α ) (α + ε) 1 and thus T(P, Qn)(α ) T(Pn, Qn)((α + ε) 1) . Taking lim supn and letting ε 0, we have T(P, Q)(α ) lim supn T(Pn, Qn)(α ). The final lemma shows how composition and limit of tradeoff functions can be combined. This is useful when, for example, we have a lower bound of the form G(µ) gt, and gt converges to G(ν) as t (e.g., by CLT), which can be approximated by the lemma as G(µ) gt G( p Lemma A.11. Let f, g, gn be tradeoff functions such that g(α) > 0 for all α < 16 and gn g. Then lim inf n (f gn) f g . Proof. Fix 0 < δ < 1, and let hδ be the tradeoff function defined as ( 1 δ α α 1 δ 0 α > 1 δ . Then it is known see (Dong et al., 2022, Equation 12) that for any tradeoff function f, ( (1 δ)f( α 1 δ) α 1 δ 0 α > 1 δ . 5In (Awan & Dong, 2022), this result is stated without the restriction on α (0, 1]. However, this restriction is needed, as evidenced by the counterexample in Remark A.8. 6This condition is technical and is not necessary; the same proof applies by defining r(δ) as the minimum over α [0, z(1 δ)] where z = inf{α : g(α) = 0}. Shifted Interpolation for Differential Privacy Now we approximate g by g hδ. Defining r(δ) = min0 α 1 δ |g(α) (g hδ)(α)| (the minimum exists as the function is continuous and [0, 1 δ] is compact), we have r(δ) > 0 because for any α [0, 1 δ] g(α) (g hδ)(α) = g(α) g( α 1 δ ) + δg( α 1 δ ) δg( α 1 δ ) 0 , where the first inequality is from g decreasing; if this value is 0 then we should have α = 1 δ from the second inequality, but then g(α) g( α 1 δ) = g(1 δ) g(1) > 0, a contradiction. Since the limit gn g is uniform, for all sufficiently large n we have gn (g r(δ)) 0 g hδ, implying lim inf n (f gn) f (g hδ) = hδ (f g) . Then from limδ 0 hδ (f g) = f g, we obtain the result. B. Disentangling the shift in shifted divergences As mentioned in 1.2, a key motivation behind the construction of our shifted interpolated process (6) is that it demystifies the popular privacy amplification by iteration analysis for R enyi DP (Feldman et al., 2018), which has been used in many contexts, and in particular was recently shown to give convergent R enyi DP bounds for Noisy GD and variants (Altschuler & Talwar, 2022). Here we explain this connection. Briefly, privacy amplification by iteration arguments for R enyi DP use as a Lyapunov function the shifted R enyi divergence D(z) α (P Q) = inf P :W (P,P ) z(P Q), which combines the R enyi divergence Dα and -Wasserstein distance W . (Feldman et al., 2018) bounds the R enyi DP via an argument of the form Dα(Xt X t) = D(zt) α (Xt X t) D(zt 1) α (Xt 1 X t 1) + O(a2 t) D(zt 2) α (Xt 2 X t 2) + O(a2 t + a2 t 1) (7) D(z0) α (X0 X 0) | {z } =0 since X0=X 0 k=1 a2 k (8) where zt = 0 and zk+1 = czk + s ak+1. (Altschuler & Talwar, 2022) obtained convergent R enyi DP bounds by essentially unrolling this argument only to an intermediate time τ, and then arguing that the shifted R enyi divergence D(zτ ) α (Xτ, X τ) = 0 if the shift zτ is made sufficiently large. Several open questions remained: (1) Can this argument be performed without using shifted divergences, which is an admittedly ad-hoc combination of R enyi divergences and Wasserstein distances? (2) Can this argument be extended beyond divergence-based relaxations of DP, namely to f-DP? Our paper answers both questions. For (1), our argument makes explicit the surrogates implicit in the shifted divergences D(zk) α (Xk X k) = inf e Xk : W ( e Xk,Xk) zk Dα( e Xk X k) in each intermediate iteration of the argument. Indeed, it can be shown that our shifted interpolating process { e Xk}, defined in (6), gives such a random variable that achieves the value required by this shifted divergence argument. This enables re-writing the argument (8) without any notion of shifted divergences, in terms of the auxiliary process { e Xk}, as we did for f-DP in 3.2. This completely disentangles the R enyi divergence and Wasserstein distance in the shifted divergence argument. For (2), the disentangling we achieve in (1) appears essential. The na ıve approach of directly extending the shifted divergence argument to shifted tradeoffs T (z)(P, Q) = sup P :W (P,P ) z(P , Q) runs into several subtle technical issues. For example, the argument appears to require the existence of an optimal shift P . For the shifted R enyi argument, it suffices to find a nearly-optimal shift D(z) α (P Q) = inf P :W (P,P ) z Dα(P Q), and moreover have the shift be Shifted Interpolation for Differential Privacy nearly-optimal for a given R enyi parameter α but perhaps not uniformly so over all α. Due to the more involved calculus of tradeoff functions, these issues become subtle but important problems, and have led others to state the problem of privacy amplification by iteration in f-DP as open, e.g., (Wang et al., 2023). Although the general problem of finding an optimal shift for general tradeoff functions remains open, the answer to (1) our shifted interpolated process explicitly constructs an optimal shift for the tradeoff functions specifically needed to analyze two contractive noisy iterations. C. Deferred details for 4 In this section we provide details for the proofs in 4. See 3 for a high-level overview of the analysis approach. We formalize the technique of shifted interpolated processes in a general context in C.1, then prove the results of 4.1, 4.2, 4.3 in C.2, C.3, C.4, respectively. C.1. Shifted interpolation for contractive noisy iterations We begin by providing definitions that unify the presentation of the different settings. The first definition abstracts the fundamental reason underlying why noisy gradient descent and all its variants enjoy the phenomenon of privacy amplification by iteration for convex optimization and is why the results stated in this section are for contractive noisy iterations (CNI). This is based on the observation that the variants of noisy gradient descent update by alternately applying contraction maps and noise convolutions (Feldman et al., 2018, Definition 19). Definition C.1 (CNI). The CNI corresponding to a sequence of contractive functions {ϕk}k [t], a sequence of noise distributions {ξk}k [t], and a closed and convex set K, is the stochastic process Xk+1 = ΠK(ϕk+1(Xk) + Zk+1) (9) where Zk+1 ξk+1 is independent of (X0, . . . , Xk). Although CNI(X0, {ϕk}k [t], {ξk}k [t], K) usually refers to the distribution of the final iterate Xt, we occasionally abuse notation by using this to refer to the entire sequence of iterates {Xk}. The second definition abstracts the idea of shifted interpolated processes at the level of generality of CNI. See 3.2 for an informal overview. Definition C.2 (Shifted interpolated process). Consider processes {Xk} and {X k} corresponding respectively to CNI(X0, {ϕk}k [t], {ξk}k [t], K) and CNI(X 0, {ϕ k}k [t], {ξk}k [t], K). The shifted interpolated process between these two CNI is the auxiliary process { e Xk} satisfying e Xτ = X τ and e Xk+1 = ΠK λk+1ϕk+1(Xk) + (1 λk+1)ϕ k+1( e Xk) + Zk+1 (10) for all k = τ, . . . , t 1. Here, the noise Zk ξk is coupled between the processes {Xk} and { e Xk}. The parameters τ {0, . . . , t}, and λk [0, 1] can be chosen arbitrarily, with the one restriction that λt = 1 so that e Xt = Xt. The upshot of shifted interpolation is the following meta-theorem. See 3.2 for a high-level overview of this result, its proof, and its uses. Here, we state this meta-theorem in the more general framework of CNI. Theorem C.3 (Meta-theorem for shifted interpolation). Let Xt and X t respectively be the output of CNI(X0, {ϕk}k [t], {N(0, σ2Id)}k [t], K) and CNI(X0, {ϕ k}k [t], {N(0, σ2Id)}k [t], K) such that each ϕk, ϕ k is c Lipschitz and ϕk(x) ϕ k(x) sk for all x and k [t]. Then for any intermediate time τ and shift parameters λτ+1, . . . , λt [0, 1] with λt = 1, T(Xt, X t) G where ak+1 = λk+1(czk + sk+1), zk+1 = (1 λk+1)(czk + sk+1), and Xτ X τ zτ. To prove Theorem C.3, we first prove two helper lemmas. The first lemma characterizes the worst-case tradeoff function between a Gaussian and its convolution with a bounded random variable. The lemma is tight, with equality achieved when the random variable is a constant. Shifted Interpolation for Differential Privacy Lemma C.4. For s 0, let R(s, σ) = inf{T(W + Z, Z) : W s, Z N(0, σ2Id), W, Z are independent}, where the infimum is taken pointwise.7 Then R(s, σ) = G( s Proof. For any random variable W with W s, the post-processing inequality (Lemma 2.5) implies T(W + Z, Z) T((W, Z), (W, W + Z)) . Letting K1(y) = Z and K 1(y) = y + Z, we have T(K1(y), K 1(y)) = G( y σ) for any fixed y with y s and thus by strong composition (Lemma 2.7), T((W, Z), (W, W + Z)) T(W, W) G( s The bound is tight since equality holds with W = sv for any fixed unit vector v. The second lemma, Lemma 3.2, is the one-step version of the desired result Theorem C.3. It uses the first lemma in its proof. Proof of Lemma 3.2. For shorthand, let Z, Z N(0, σ2Id) be independent. Then T(λϕ(X) + (1 λ)ϕ ( e X) + Z, ϕ (X ) + Z ) T(( e X, λ(ϕ(X) ϕ ( e X)) + Z), (X , Z )) T( e X, X ) R(λ(cz + s), σ) = T( e X, X ) G(λ(cz + s) Above, the first step is by the post-processing inequality (Lemma 2.5) for the post-processing function (x, y) 7 ϕ (x) + y. The second step is by strong composition (Lemma 2.7), which we can apply since λ( ϕ(X) ϕ ( e X) ) λ( ϕ(X) ϕ( e X) + ϕ( e X) ϕ ( e X) ) λ(cz + s). The final step is by Lemma C.4. Proof of Theorem C.3. Let { e Xk} be as in (10). By induction, Xk e Xk zk for all k = τ, . . . , t from Xk+1 e Xk+1 (1 λk+1)( ϕk+1(Xk) ϕ k+1(Xk) + ϕ k+1(Xk) ϕ k+1( e Xk) ) (1 λk+1)(sk+1 + czk) , where the first line holds from Lemma 2.9. Letting Zk+1, Z k+1 N(0, σ2Id) be independent noises, T( e Xk+1, X k+1) T(λk+1ϕk+1(Xk) + (1 λk+1)ϕ k+1( e Xk) + Zk+1, ϕ k+1(X k) + Z k+1) T( e Xk, X k) G(ak+1 where the first inequality is by the post-processing inequality (Lemma 2.5) with respect to ΠK, and the second inequality is by Lemma 3.2. Repeating this for k = t 1, . . . , τ, and using the fact that the shifted interpolated process satisfy e Xt = Xt (from λt = 1) and e Xτ = X τ, we conclude the desired bound T(Xt, X t) = T( e Xt, X t) T( e Xτ, X τ) G 7The infimum of tradeoff functions is in general not a tradeoff function; however, we prove a lower bound that is in fact a tradeoff function. That is, we show that T(W + Z, Z) G( s σ ) for all W, Z satisfying the conditions in the definition of R(s, σ). An analogous discussion also applies to Lemma C.12. Shifted Interpolation for Differential Privacy C.2. Deferred proofs for 4.1 C.2.1. PROOF OF THEOREM 4.2 First, we consider the following setting where the contractive factor is strictly less than 1, which corresponds to the strongly convex setting for Noisy GD. Theorem C.5. In the setting of Theorem C.3, additionally assume that 0 < c < 1 and sk s. Then T(Xt, X t) G 1 + ct 1 + c 1 c s σ with equality holding if X0 = X 0 = 0, ϕk(x) = cx, ϕ k(x) = cx + sv for any unit vector v and K = Rd. Proof. In Theorem C.3, we can take τ = 0 and zτ = 0. Then the values of {λk}, {zk}, {ak} obtained from the elementary optimization problem (Lemma C.6) yield the desired result. Finally, for the equality case, by direct calculation we have Xt N(0, 1 c2t 1 c2 σ2Id) and X t N( 1 ct 1 c sv, 1 c2t 1 c2 σ2Id), giving T(Xt, X t) = G( 1 + ct 1 + c 1 c s σ ) . Lemma C.6. Given s > 0 and 0 < c < 1, the optimal value of subject to zk+1 = (1 λk+1)(czk + s), ak+1 = λk+1(czk + s) zk, ak 0, z0 = zt = 0 Proof. Since zk+1 = (1 λk+1)(czk +s) and ak+1 = λk+1(czk +s), we have zk+1 +ak+1 = s+czk for k = 0, . . . , t 1, from which we obtain zt = ctz0 + (1 + c + + ct 1)s (at + cat 1 + + ct 1a1) . From z0 = zt = 0, we have at + cat 1 + + ct 1a1 = 1 ct By the Cauchy-Schwarz inequality, k=1 a2 k (at + cat 1 + + ct 1a1)2 Pt k=1 c2(t k) = 1 ct 1 + ct 1 + c 1 cs2 where equality holds if the corresponding equality criterion of the Cauchy-Schwarz inequality is satisfied. The explicit formulae are zk = (1 ck)(1 ct k) (1+ct)(1 c) s, ak = ct k(1+c) 1+ct s, and λk = ct k(1 c2) 1 ct k+2 ck+ct . Proof of Theorem 4.2. This follows as a direct corollary of Theorem C.5 with ϕk(x) ϕ(x) = x η n Pn i=1 fi(x) and ϕ k(x ) ϕ (x ) = x η n Pn i=1 f i(x ). For this application, consider parameters c = max{|1 ηm|, |1 ηM|} < 1 (by Lemma 2.8), s ηL/n, and σ ησ (rescaling to simplify notation). The equality case is a straightforward calculation in the setting that K = Rd, X0 = 0, fi(x) = mx for all i [n], and f i(x) defined as mx for i = i , and otherwise mx Lv for some unit vector v. Shifted Interpolation for Differential Privacy C.2.2. PROOF OF THEOREM 4.3 We consider here the setting of optimization over a bounded constraint set. Theorem C.7. In the setting of Theorem C.3, additionally assume that K has a finite diameter D and sk s. Then for any integer 0 τ < t, T(Xt, X t) G 1 t τ + D t τ ) . In particular, if t D/s then T(Xt, X t) G 3s D + s2 D Proof. From Xτ e Xτ D for all τ, in Theorem C.3 we can take zτ = D and c = 1. The values of {λk}, {zk}, {ak} are obtained by analyzing the following elementary optimization problem (Lemma C.8). Lemma C.8. Given s > 0 and D > 0, the optimal value of subject to zk+1 = (1 λk+1)(zk + s), ak+1 = λk+1(zk + s) zk, ak 0, zτ = D, zt = 0 is s + D t τ 2 (t τ). As a function of t τ (0, ), this value is minimized when t τ = D/s. Proof. By adding the equations zk+1 = (1 λk+1)(zk + s) and ak+1 = λk+1(zk + s) for k = τ, . . . , t 1, we obtain (with zτ = D and zt = 0) at + at 1 + + aτ+1 = D + (t τ)s . By the Cauchy-Schwarz inequality, the minimum value of Pt k=τ+1 a2 k is (s+R)2(t τ) and is obtained when zk = R(t k), ak s + R, and λk = s+R s+R(t k+1), where we use the shorthand R := D t τ . The last part is straightforward from the strict convexity of the one-dimensional function z 7 z s + D z 2 = s2z + D2 z + 2s D, z > 0. Proof of Theorem 4.3. This follows by considering s ηL n and σ ησ as in the proof of Theorem 4.2. C.3. Deferred proofs for 4.2 As done in the case of Noisy GD, we first characterize Noisy CGD as a particular instance of CNI and proceed to the proofs of the theorems. The following proposition holds straight from the definition; recall that l = n/b is the number of batches. Proposition C.9. For t = l E and k = 0, 1, . . . , t 1, let B1, . . . , Bl be a fixed partition of [n] with size b, and define ϕk+1(x) = x η i Br fi(x) and ϕ k+1(x ) = x η i Br f i(x ) where r = k + 1 l k l . Then the f-DP of Noisy CGD is equal to that between Xt = CNI(X0, {ϕk}k [t], {N(0, η2σ2Id)}k [t], K) and X t = CNI(X0, {ϕ k}k [t], {N(0, η2σ2Id)}k [t], K). Proof of Theorem 4.5. Let j [l] be the index such that i Bj and consider the setting in Proposition C.9. We will establish a lower bound on T(Xt , X t ) where t = t + j l 1; the lower bound on T(Xt, X t) is then given by T(Xt, X t) T(Xt+j l, X t+j l) T(Xt+j l 1, X t+j l 1) G( L where the first inequality holds from the post-processing inequality with ϕk ϕ k for all k = t + j l + 1, . . . , t, and the second inequality holds from ϕt+j l(x) ϕ t+j l(x) ηL b for all x with Lemma 3.1. Shifted Interpolation for Differential Privacy In general, ϕk+1 = ϕ k+1 when r = r(k) = k + 1 l k l is not equal to j ; otherwise ϕk+1(x) ϕ k+1(x) ηL b for all x. Thus, in Theorem C.3 we can take τ = 0, zτ = 0 and sk+1 = s1{r=j } where s = ηL b . Using {λk}, {zk}, {ak} obtained from the following result (Lemma C.10), T(Xt, X t) G Lemma C.10. Given s > 0 and 0 < c < 1, let t = t + j l 1 and consider a system zk+1 = (1 λk+1)(czk + s1{r=j }), ak+1 = λk+1(czk + s1{r=j }) zk, ak 0, z0 = zt = 0 λk [0, 1] . Then {ak}1 k t , {zk}0 k t , {λk}1 k t defined as 1+ct l s k j zk+1 = czk + s1{r=j } ak+1, z0 = 0 ( ak zk+ak zk + ak > 0 0 zk + ak = 0 is a solution, where r = r(k) = k + 1 l k Proof. From the stated formulae and zk+1 + ak+1 = czk + s1{r=j }, every condition except zk 0 and zt = 0 are straightforward to check. If k < j then zk 0. For k j , let q be the integer such that l(q 1) + j k < lq + j and r = k (l(q 1) + j ). Then zk = cr (1 + cl + + cl(q 1))s (ak + cak 1 + + ck j aj ) = cr (1 + cl + + cl(q 1)) ct k+j 2 1 c2 (1 cl)(1 + ct l)(1 + c2 + + c2(k j )) s . For any fixed q, this is a decreasing function in r and thus it suffices to consider r = l 1. Then cr (1 + cl + + cl(q 1)) ct k+j 2 1 c2 (1 cl)(1 + ct l)(1 + c2 + + c2(k j )) = cl 1 1 clq 1 cl cl(E q) 1 1 c2lq (1 cl)(1 + ct l) = cl 1 1 clq 1 cl 1 cl(E q 1) 1 + c E(l 1) , which is nonnegative for q E 1. Also, for q = E 1 this is equal to 0, implying zl(E 1)+j 1 = zt = 0 and thus λt = 1. Proof of Theorem 4.6. As in the proof of Theorem 4.5, we establish a lower bound on T(Xt , X t ) for t = t + j l 1. For any τ, letting τ = j + l(τ 1) we have Xτ e Xτ D. Thus in Theorem C.3 we can take zτ = D, sk+1 = s1{r=j }(s = ηL b ) and c = 1. The sequences {λk}, {zk}, {ak} can be chosen as in the following result (Lemma C.11), which yields a bound of T(Xt, X t) G k=τ +1 a2 k 2 + (D/η + L(E τ)/b)2 by Proposition C.9. Optimizing over the choice of E τ can be done similarly as in Theorem C.7; in particular, one can take E τ = Db ηL when E Db Shifted Interpolation for Differential Privacy Lemma C.11. Given s > 0 and D > 0, let t = t + j l 1, τ = j + l(τ 1) and consider a system zk+1 = (1 λk+1)(zk + s1{r=j }), ak+1 = λk+1(zk + s1{r=j }) zk, ak 0, zτ = D, zt = 0 λk [0, 1] . Then {ak}τ +1 k t , {zk}τ k t , {λk}τ +1 k t defined as ak D + s(E τ) l(E τ) zk+1 = zk + s1{r=j } ak+1, zτ = D ( ak zk+ak zk + ak > 0 0 zk + ak = 0 is a solution, where r = r(k) = k + 1 l k Proof. As in the proof of Lemma C.10, it suffices to check that zk 0 and zt = 0. Let q τ be the integer such that l(q 1) + j k < lq + j and r = k (l(q 1) + j ). Then zk = D + (q τ + 1)s (l(q τ) + r + 1)D + s(E τ) D + (q τ + 1)s (q τ + 1)D + s(E τ) = D(1 q τ + 1 where the inequality is from that the first line is minimized when r = l 1 for any fixed q. Also, zk = 0 and λk = 1 when r = l 1 and q = E 1, i.e., k = t + j l 1 = t . C.4. Deferred proofs for 4.3 C.4.1. NO I S YSGD AS STOCHASTIC VERSION OF CNI We first revisit the composition bound (Theorem 4.8). The key point in this proof relevant to our new results is the following formulation, which can be considered as a stochastic version of CNI (9) with each map x 7 ψs(x), ϕs(x), ϕ s(x) being contractive. Xk+1 = ΠK(ψSk(Xk) + Vk(ϕSk ψSk)(Xk) + Zk+1) X k+1 = ΠK(ψS k(X k) + V k(ϕ S k ψS k)(X k) + Z k+1) (11) Proof of Theorem 4.8. Let Xk and Xk+1 respectively be the k-th and (k +1)-th iterate of Noisy SGD with losses {fi}i [n], and similarly define X k and X k+1 for Noisy SGD with losses {f i}i [n]. It suffices to show T(Xk+1, X k+1) T(Xk, X k) Cb/n(G( L For the corresponding random batch Bk, we sample a random pair of set and element Sk = (Rk, Ck) as described below. This Sk will be here and after used as a representation for the random batch Bk. 1. Sample a set A1 of size b in [n] \ {i } uniformly at random. 2. Sample an element A2 from A1 uniformly at random. This element will serve as a candidate to be (potentially) replaced by i . 3. Let Rk = A1 \ {A2}, Ck = A2. Shifted Interpolation for Differential Privacy Finally, let Vk Ber(p) be a Bernoulli random variable with success probability p = b/n, which serves as an indicator denoting whether i Bk (i.e., Vk = 1) or not (i.e., Vk = 0). Then ( Rk {Ck} Vk = 0 Rk {i } Vk = 1 is a valid sampling procedure for Bk (i.e., the marginal distribution of Bk is uniform over size b subsets of [n]). These can be defined similarly for X k+1 as B k, V k and S k. The reason for formulating this alternative sampling scheme is to separate the subsampling part which only depends on whether the index i is included in the batch from the rest of the information on the batch. In particular, in (11), Sk and Vk are independent and Vk is still distributed as Ber(p) after conditioning on Sk. Now for a pair of set and element S = (R, C), define ϕS(x) = x η ϕ S(x) = x η b ( f i + X ψS(x) = x η i R {C} fi(x) . Then the updates for Xk+1 and X k+1 can be respectively written as (11), where Zk+1, Z k+1 N(0, η2σ2Id) are independent of anything else. Now the tradeoff function between Xk+1 and X k+1 satisfies T(Xk+1, X k+1) T((Xk, Sk, Vk(ϕSk ψSk)(Xk) + Zk+1), (X k, S k, V k(ϕ S k ψS k)(X k) + Z k+1)) by the post-processing inequality with respect to (x, s, y) 7 ΠK(ψs(x) + y). For any fixed realization (x, s) = (x, (r, c)) of the first two arguments, we find a lower bound on T(Vk(ϕs ψs)(x) + Zk+1, V k(ϕ s ψs)(x) + Z k+1) . (13) In fact, this is tradeoff function of the subsampled Gaussian mechanism as presented in (Dong et al., 2022, Theorem 9). To see this, we construct a new private setting as follows: Datasets: S = {y1, y2, . . . , yn}, S = {y 1, y2, . . . , yn} where y 1, y1, . . . , yn are distinct alphabets. Note that the datasets here are considered only for this part of the proof and are irrelevant with the original datasets in the private optimization setting. Mechanisms: (a) Sampleb: From a set of size n, sample a set of size b uniformly at random. (b) M: Given a set R of size b, output θ(R) + N(0, η2σ2Id) where (ϕs ψs)(x) y1 R, y 1 / R (ϕ s ψs)(x) y1 / R, y 1 R 0 else . Then a lower bound f on (13) is equivalent to M Sampleb being f-DP (when considered as being applied to S and S ). Note that from b ( fc fi ) b ( fc f i ) b ( f i fi ) , Shifted Interpolation for Differential Privacy 0.0 0.5 1.0 Figure 4. Illustration of Cp(G( L bσ )) and f (λ), for λ {0, 0.5, 1} with p = 0.25, L/(bσ) = 2.5. θ has (l2-)sensitivity ηL b and thus M is G( L bσ)-DP by (Dong et al., 2022, Theorem 1). Then by (Dong et al., 2022, Theorem 9), M Sampleb is Cb/n(G( L bσ))-DP. Thus, T((Xk, Sk, Vk(ϕSk ψSk)(Xk) + Zk+1), (X k, S k, V k(ϕ S k ψS k)(X k) + Z k+1)) T((Xk, Sk), (X k, S k)) Cb/n(G( L = T(Xk, X k) Cb/n(G( L where the equality is from that Sk(S k) is independent of Xk(X k), and that Sk and S k have the same distribution. One step optimality. Now we show the optimality of Theorem 4.8 for t = 1, i.e., T(X1, X 1) Cb/n(G( L Let X0 = X 0 = 0, K = Rd and for λ [0, 1], consider the gradients fi = f i = 0 fi = (1 λ)Lu where u is a unit vector. Then T(X1, X 1) = T( (1 λ)ηL b u V0 + N(0, η2σ2Id), ληL b u V 0 + N(0, η2σ2Id)) = T( (1 λ) L bσ V0 + N(0, 1), λ L bσ V 0 + N(0, 1)) where V0, V 0 Ber(p). Denoting the corresponding tradeoff function as f (λ), a valid lower bound for T(X1, X 1) is (pointwise) at most infλ [0,1] f (λ) and thus it suffices to show that infλ [0,1] f (λ) = Cb/n(G( L bσ)). Now the rest of the proof is a combination of following facts. (a) f (1)(α) Cb/n(G( L bσ))(α) with equality holding for all α [0, Φ( L (b) f (0)(α) Cb/n(G( L bσ))(α) with equality holding for all α [pΦ( L 2bσ) + (1 p)Φ( L (c) For λ (0, 1), f (λ)(α) Cb/n(G( L bσ))(α) with equality holding at α = pΦ( L 2bσ)+(1 p)Φ(( 1 bσ) (note that as λ varies, this covers the range of α at which Cb/n(G( L bσ)) is linear with slope 1 and interpolates the boundaries in (a) and (b)). Shifted Interpolation for Differential Privacy The first two facts are straightforward from Definition 4.7, with G(µ)p = T(N(0, 1), p N(µ, 1) + (1 p)N(0, 1)) and (f (0)) 1 = f (1).8 For (c), note that as a mixture of one-dimensional Gaussians the likelihood ratio between the two distributions is monotone and thus for any z R, with α = 1 (1 p)Φ(z) pΦ(z + (1 λ) L bσ) we have f (λ)(α) = (1 p)Φ(z) + pΦ(z λ L Thus from (here φ denotes the probability density function of N(0, 1)) dz = (1 p)φ(z) pφ(z + (1 λ) L dz = (1 p)φ(z) + pφ(z λ L at z = (λ 1 bσ we have α = pΦ( L 2bσ)+(1 p)Φ(( 1 bσ) where α+f (λ)(α) = (1+p)Φ( L 2bσ)+(1 p)Φ( L dα (α) = dfλ(α) dz = 1. This implies that f (λ) is tangent to Cb/n(G( L bσ)) at the point, and (c) follows by Lemma 2.3. C.4.2. PROOFS OF NEW RESULTS As in the case of Noisy GD (Lemma C.4), we start by establishing a lower bound for tradeoff function between convolutions of Gaussian random variables with bounded random variables now including the subsampling. Lemma C.12. For s 0 and p = b/n, let R(s, σ, p) = inf{T(V W + Z, V W + Z) : V Ber(p), W , W s, Z N(0, σ2Id)} where the infimum is taken pointwise and is over independent V, W, W , Z. Then R(s, σ, p) Cp(G( 2s Proof. The proof is fairly similar to the subsampling part in the proof of Theorem 4.8. Let V, W, W , Z be as in the definition of R(s, σ, p), and consider the following private setting: Datasets: S = {y1, y2, . . . , yn}, S = {y 1, y2, . . . , yn} where y 1, y1, . . . , yn are distinct alphabets. Mechanisms: (a) Sampleb: From a set of size n, sample a set of size b uniformly at random. (b) M: Given a set R of size b, output θ(R) + Z where W y1 R, y 1 / R W y1 / R, y 1 R 0 else . From W , W , W W 2s, θ has sensitivity 2s and thus M is a G( 2s σ )-DP mechanism by (Dong et al., 2022, Theorem 1). By (Dong et al., 2022, Theorem 9), M Sampleb is Cp(G( 2s σ ))-DP, which is equivalent to T(V W +Z, V W + Z) Cp(G( 2s Now we proceed to the proofs of the new results. The key point here is that we build shifted interpolated processes by not only coupling the noise Zk+1 but also the subsampling indicator Vk; see Figure 5. For the strongly convex and smooth setting, we state and prove a general theorem that allows one to choose the sequences of shift and sensitivity. 8In fact, since tradeoff functions are convex, (a) and (b) are enough to conclude that Cp(G(µ)) is the best tradeoff function bound; (c) provides an additional explanation on the linear part of Cp(G(µ)). See Figure 4. 9We conjecture that a strictly better lower bound holds, which corresponds to the case when W and W are constant vectors aligned in the opposite direction, i.e., R(s, σ, p) = T(p N( s σ , 1) + (1 p)N(0, 1), p N( s σ , 1) + (1 p)N(0, 1)). Shifted Interpolation for Differential Privacy Figure 5. Illustration of shifted interpolated processes in the proofs of Theorem 4.9 (left) and Theorem 4.10 (right). The solid lines denote the updates based on the realized values of {Vk}, and the dashed lines denote the alternative updates based on their unrealized values; each interpolated process uses the same (coupled) values of {Vk} as expressed in the figure. In Theorem 4.9, we build two processes, each of which tracks its corresponding original process. In Theorem 4.10, only one process is built and it inherits the identical deviation based on the realizations of {Vk}. Theorem C.13. Consider m-strongly convex, M-smooth loss functions with gradient sensitivity L. Then for any η (0, 2/M), Noisy SGD is f-DP where k=0 Cb/n(G(2ak for any sequence {zk}0 k t 1, {ak}0 k t 1 such that z0 = 0, a0 = b for all k 1 and zt 1 = b Pt 1 k=1 ct k 1ak and where c = max{|1 ηm|, |1 ηM|}. Proof. As in (11) and (12), the iterates of Noisy SGD with respect to {fi}i [n] and {f i}i [n] are Xk+1 = ΠK(ψSk(Xk) + Vk(ϕSk ψSk)(Xk) + Zk+1) X k+1 = ΠK(ψS k(X k) + V k(ϕ S k ψS k)(X k) + Z k+1) , where Zk+1, Z k+1 N(0, η2σ2Id). Now consider shifted interpolated processes defined as e Xk+1 = ΠK(ψSk( e Xk) + λk+1Vk(ϕSk(Xk) ψSk( e Xk)) + Zk+1) e X k+1 = ΠK(ψS k( e X k) + λk+1V k(ϕ S k(X k) ψS k( e X k)) + Z k+1) , with e X0 = e X 0 = X0 and λk = ak zk+ak 1{zk+ak>0} for {zk}0 k t 1 and {ak}0 k t 1 such that z0 = 0, a0 = zk+1 = czk + ηL b ak+1 for all k 0. Then inductively e Xk Xk zk for all k from e Xk+1 Xk+1 ( ψSk(Xk) ψSk( e Xk) czk Vk = 0 (1 λk+1)(ϕSk(Xk) ψSk( e Xk)) czk + ηL b ak+1 Vk = 1 and λk+1(ϕSk(Xk) ψSk( e Xk)) ak+1; similar results hold for {X k}. Thus as in the proof of Theorem 4.8 (see also Theorem C.3), with Lemma C.12 T( e Xt 1, e X t 1) k=1 Cb/n(G(2ak Shifted Interpolation for Differential Privacy To relate this with T(Xt, X t), note that there is no choice of λt that yields e Xt = Xt. Instead, we can proceed as follows: write down the corresponding update (before taking the projection) as ψSt 1(Xt 1) + Vt 1(ϕSt 1 ψSt 1)(Xt 1) + Zt = ψSt 1( e Xt 1) + ψSt 1(Xt 1) ψSt 1( e Xt 1) + Z(1) t + Vt 1(ϕSt 1 ψSt 1)(Xt 1) + Z(2) t where Z(1) t , Z(2) t N(0, η2σ2 2 Id)10 are independent, ψSt 1(Xt 1) ψSt 1( e Xt 1) is bounded by czt 1 and (ϕSt 1 ψSt 1)(Xt 1) is bounded by ηL T(( e Xt 1, St 1, ψSt 1(Xt 1) ψSt 1( e Xt 1) + Z(1) t ), ( e X t 1, S t 1, ψS t 1(X t 1) ψS t 1( e X t 1) + Z(1) T(( e Xt 1, St 1), ( e X t 1, S t 1)) R(czt 1, ησ T( e Xt 1, e X t 1) G(2 ησ ) Cb/n(G(2 In this formulation, optimizing over the sequences {zk} and {ak} is intractable because of the analytically complicated nature of the subsampled operator and composition of tradeoff functions. Heuristically, when b/n is small, each individual Cb/n(G( )) is very close to Id and the most substantial factor is the GDP part. In this sense, sequences that make zt 1 small can be considered as a reasonable choice. Proof of Theorem 4.9. Consider at 1 = = aτ = ηL b and ak = 0 for all 1 k < τ in Theorem C.13. Proof of Theorem 4.10. For the iterates (11) and (12), consider the shifted interpolated process e Xk+1 = ΠK(ψSk( e Xk) + λk+1(ψSk(Xk) ψSk( e Xk)) + Vk(ϕSk ψSk)(Xk) + Zk+1) where λk+1 = 1 t k and e Xτ = X τ. Then for any k τ, e Xk Xk zk and λk+1(ψSk(Xk) ψSk( e Xk)) ak+1 where zk = D t τ (t k) ak+1 D t τ . The first inequality is inductively from e Xτ Xτ = X τ Xτ D and e Xk+1 Xk+1 (1 λk+1) e Xk Xk zk+1 . The second inequality is from λk+1(ψSk(Xk) ψSk( e Xk)) λk+1zk = ak+1. Also, note that e Xt = Xt. As in the proof of Theorem 4.9, we can write down as ψSk( e Xk) + λk+1(ψSk(Xk) ψSk( e Xk)) + Vk(ϕSk(Xk) ψSk(Xk)) + Zk+1 = ψSk( e Xk) + λk+1(ψSk(Xk) ψSk( e Xk)) + Z(1) k+1 + Vk(ϕSk ψSk)(Xk) + Z(2) k+1 10In general, we can split the noise into Zt = Z(1) t + Z(2) t where Z(1) t N(0, η2σ2 α2 Id) and Z(2) t N(0, η2σ2 β2 Id) are independent and 1/α2 + 1/β2 = 1. Then the part G( 2 ησ ) Cb/n(G( 2 2L bσ )) in the last line of the proof is replaced with G( 2αczt 1 ησ ) Cb/n(G( 2βL Shifted Interpolation for Differential Privacy where Z(1) k+1, Z(2) k+1 N(0, η2σ2 2 Id)11 are independent and similarly ψS k(X k) + V k(ϕ S k ψS k)(X k) + Z k+1 = ψS k(X k) + Z(1) k+1 + V k(ϕ S k ψS k)(X k) + Z(2) T( e Xk+1, X k+1) T(( e Xk, Sk, λk+1(ψSk(Xk) ψSk( e Xk)) + Z(1) k+1), (X k, S k, Z(1) T(( e Xk, Sk), (X k, S k)) R(ak+1, ησ T( e Xk, X k) G( 2D (t τ)ησ ) Cb/n(G(2 Repeating this for k = t 1, . . . , τ yields the result. C.4.3. CHOICE OF t τ BASED ON APPROXIMATION Since Theorem 4.9 and Theorem 4.10 hold for every t τ, we can calculate the corresponding f-DP bound for each t τ and then take the pointwise maximum as a valid privacy guarantee; however, this may be computationally burdensome if t is large. One way to bypass this calculation is to approximate the composition of subsampled Gaussian mechanisms via CLT, (Lemma A.5), where the resulting f-DP bound becomes a GDP bound and thus optimization over t τ is analytically tractable. Proposition C.14. In the setting of Theorem 4.9, by choosing (modulo floor or ceiling) t τ = log b2σ(1 c) e4L2/(bσ)2Φ( 3L bσ ) + 3Φ( L Noisy SGD is approximately µ-GDP, where n2 (t τ)(e4L2/(bσ)2Φ(3L bσ ) + 3Φ( L bσ ) 2) . (14) Proof. By Lemma A.5, 2L bσ )) Cb/n(G(2L bσ )) (t τ) G (t τ)(e4L2/(bσ)2Φ(3L bσ ) + 3Φ( L Also, by bounding 2 2L bσ ct τ+1 ct 2L bσ ct τ+1 1 c we obtain an approximate lower bound G(µ) of the form (14). As a function of t τ (0, ) it is convex, and the first-order optimality condition provides the stated formula for t τ. Proposition C.15. In the setting of Theorem 4.10, by choosing (modulo floor or ceiling) e8L2/(bσ)2Φ( 3 2L bσ ) + 3Φ( Noisy SGD is approximately µ-GDP, where η2σ2(t τ) + 2 b2 n2 (t τ)(e8L2/(bσ)2Φ(3 2L bσ ) + 3Φ( 2L bσ ) 2) . (15) Proof. As in the proof of Proposition C.14, the CLT approximation of Cb/n(G( 2 2L bσ )) (t τ) provides a lower bound G(µ) of the form (15), which is a convex function in t τ; the first-order optimality condition yields the stated result. 11As before, setting Z(1) k+1 N(0, η2σ2 α2 Id) and Z(2) k+1 N(0, η2σ2 β2 Id) with 1/α2 + 1/β2 = 1 replaces G( 2D (t τ)ησ ) 2L bσ )) with G( αD (t τ)ησ ) Cb/n(G( 2βL Shifted Interpolation for Differential Privacy C.5. Lower bounds Here we elaborate on lower bounds for the amount of privacy preserved (i.e., upper bounds on the f-DP guarantee) that complement our results in 4. Note that an exactly matching bound for Noisy GD in the strongly convex setting was obtained in Theorem 4.2, and an asymptotically matching bound for Noisy SGD in the constrained convex setting was obtained in (Altschuler & Talwar, 2022). Below, we present results for the other related settings using similar techniques. For the strongly convex setting, these lower bounds are built based on convex quadratics which yield iterates with explicit Gaussians; and for the constrained convex setting, these are obtained by comparing symmetric and biased (projected) Gaussians. We refer the readers to (Altschuler & Talwar, 2022) for further discussion about these constructions. Theorem C.16. Consider the setting of Theorem 4.3 or Theorem 4.6. There exist universal constants 0 < c0 < 1/5, c1 > 0 such that if σ2 c0 LD ηn and µ = c1 1 (a) Noisy GD is not µ-GDP for all t Dn (b) Noisy CGD is not µ-GDP for all E Db Proof. (a) For Noisy GD, let µ0 = µ ηn . Consider d = 1 and loss functions such that fi(x) = 0 for all i [n], f i(x) = 0 for all i = i and f i (x) = L.12 Also, let X0 = 0 and K = [ D 2 ]. Note that by Lemma A.3, a µ-GDP algorithm is (µ2, Φ( µ 2 ))-DP. We will show that for E = [ D P(Xt E) = 1 P(X t E) < exp( µ2)(1 which implies that Noisy GD is not (µ2, Φ( µ 2 ))-DP and thus Noisy GD is not µ-GDP. First, recall that Xk+1 = ΠK(Xk + Zk+1) X k+1 = ΠK(X k + ηL where Zk+1, Z k+1 N(0, η2σ2). Since the distribution of Xk is symmetric for all k, P(Xt E) = 1 2. On the other hand, for t0 = t 0.8 Dn ηL + 1 consider a process {X k }t0 k t such that X t0 = D X k+1 = min{X k + ηL n + Z k+1, D Then inductively, P(X k z) P(X k z) for all z. Letting E0 = {maxt0 k t Pk j=t0 Zj 0.1D}, by Doob s submartingale inequality we have P(Ec 0) exp( (0.1D)2 ηL (ησ)2 ) exp( 0.01LD 5.6nησ2 ) = exp( 0.01 5.6 µ2 0) . 12For general d > 1, a similar argument (with slightly different constants) can be made by considering f i (x) = Le1 and K = [ Θ(D), Θ(D)] [ Θ(D/ d 1)]d 1 (constant factors chosen such that K has diameter D). Shifted Interpolation for Differential Privacy Also, conditioning on E0, X t = D ηL + Pt j=t0 Zj 0.3D + Pt j=t0 Zj. Thus P(X t / E) P(X t > 0) P({X t > 0} E0) j=t0 Zj > 0} E0) j=t0 Zj > 0) exp( 0.01 2.8µ0) exp( 0.01 5.6µ2 0) exp( 0.01 where the penultimate inequality is from that Pt j=t0 Zj is a mean zero Gaussian with variance 0.8 Dn µ2 0 , and the last inequality is from Φ(x) 1 exp( 1 2x2) for all x 1 2π (with 0.3µ0/ 0.3/ 2.8c0 1/ 2π). By taking sufficiently small c1 < q 5.6 and c0 < c2 1 such that 5.6µ2 0) + exp( 0.01 5.6 µ2 0) exp( c2 1µ2 0)(1 for all µ2 0 1 c0 , we have 2 )) = exp( c2 1µ2 0)(1 exp( c2 1µ2 0)(1 > exp( c2 1µ2 0)(1 5.6µ2 0) + exp( 0.01 as desired. (b) The proof for Noisy CGD is similar to that for Noisy GD (recall that t = l E and n = lb); consider the same loss functions, initialization, constraint set with i Bl. Then Xk+1 = ΠK(Xk + Zk+1) X k+1 = ΠK(X k + ηL b 1{r(k)=l} + Z k+1) where r(k) = k + 1 l k l . For t0 = l(E 0.8 Db ηL ) + 1, consider a process {X k }t0 k t such that X t0 = D X k+1 = min{X k + ηL b 1{r(k)=l} + Z k+1, D Then with the same events E and E0, P(Xt E) = 1/2 and P(X k z) P(X k z) for all z P(Ec 0) exp( (0.1D)2 ηL (ησ)2 ) exp( 0.01 and conditioning on E0, X t = D ηL 0.3D + Pt j=t0 Zj; the rest are identical. Shifted Interpolation for Differential Privacy Theorem C.17. In the setting of Theorem 4.5, any valid f-DP lower bound for Noisy CGD satisfies where µ = L 1 cl E 1+cl E 1 c2 (1 cl)2 . Proof. Consider the loss functions in the proof of Theorem 4.2, with X0 = 0, K = Rd and i Bl. By direct calculation Xl E = N(0, 1 c2l E 1 c2 η2σ2Id) and X l E = N( ηL 1 cl v, 1 c2l E 1 c2 η2σ2Id), implying T(Xl E, X l E) = G(µ) with µ as stated. D. Numerical details and results In this section, we provide numerical details of the figures and experiments in the main text and additional numerical results for different algorithms. Code reproducing these numerics can be found here: https://github.com/jinhobok/ shifted_interpolation_dp. D.1. Details for Figure 1 In Figure 1, we consider 1-strongly convex and 10-smooth loss functions with learning rate η = 0.05, effective sensitivity L/(nσ) = 0.1, and t {10, 20, 40, 80, 160}, with t = 160 in the left figure and δ = 10 5 in the right figure. Our f-DP bound is from Theorem 4.2, our RDP bound is from Theorem 4.2 and Lemma A.4, the prior RDP bound is from (Ye & Shokri, 2022, Theorem D.6), and the composition bound is from Theorem 4.1. For conversion from GDP and RDP to (ε, δ)-DP, see A.1 and A.2. We emphasize that different choices of parameters lead to qualitatively similar plots; see D.3 for further numerical comparisons in other settings. D.2. Details for 4.4 Here we provide further numerical details for the experiment in 4.4. The purpose of this simple numerical example is to corroborate our theoretical findings by comparing them with existing privacy bounds. As such, we simply compare algorithms with the same hyperparameters, and do not attempt to optimize these choices for individual algorithms. In 4.4, Table 1 shows that our results provide improved privacy bounds. That table considers the privacy leakage of Noisy CGD in (ε, δ)-DP with regularization parameter λ = 0.002. Table 3 and Table 4 provide more details on this numerical comparison by also considering another algorithm (Noisy SGD), another notion of privacy leakage (GDP), and another parameter (λ = 0.004). Details on these tables: for the GDP Composition privacy bound on Noisy SGD, we present the approximate value of the GDP parameter provided by CLT since this is computationally tractable; for (ε, δ)-DP we compute the corresponding ε to an error of 10 3 using the numerical procedure in D.5; and we convert the currently known best RDP bounds provided by (Ye & Shokri, 2022, Theorem 3.3) to (ε, δ)-DP using the numerical procedure in A.1. Table 3. More detailed version of Table 1, for GDP. Lists the GDP parameters of private algorithms for the regularized logistic regression problem. Note that GDP Composition yields the same privacy bound regardless of the regularization parameter. Our results provide improved privacy. Epochs GDP Composition Our Bounds Algorithms Noisy CGD Noisy SGD Noisy CGD λ {0.002, 0.004} 0.002 0.004 50 4.71 1.03 0.99 0.99 100 6.67 1.45 1.24 1.22 200 9.43 2.05 1.59 1.51 Shifted Interpolation for Differential Privacy Table 4. More detailed version of Table 1, for (ε, δ)-DP. Lists ε of private algorithms on the regularized logistic regression problem for δ = 10 5. Note that GDP Composition yields the same privacy bound regardless of λ. Our results provide improved privacy over both GDP Composition and RDP. Epochs GDP Composition RDP Our Bounds Algorithms Noisy CGD Noisy SGD Noisy CGD Noisy CGD λ {0.002, 0.004} 0.002 0.004 0.002 0.004 50 30.51 4.44 5.82 5.61 4.34 4.32 100 49.88 6.65 7.61 7.00 5.60 5.51 200 83.83 10.11 9.88 8.38 7.58 7.09 These tables show that compared to our results (Theorem 4.5), the standard GDP Composition bound for Noisy CGD (Theorem 4.4) provides essentially no privacy. This is because that standard bound incurs a large privacy loss in each epoch (at the step in which the adjacent datasets use different gradients), and this privacy leakage accumulates indefinitely whereas our analysis captures the contractivity of the algorithm s updates, which effectively ensures that previous gradient queries leak less privacy the longer ago they were performed. See 3 for a further discussion of this. Combined with the lossless conversion enabled by our f-DP analysis, our results also provide better privacy than the state-of-the-art RDP bounds. Table 5 and Table 6 (reporting (mean) (standard deviation) of accuracies over 10 runs) show that (1) Noisy CGD and Noisy SGD have comparable training and test accuracy for this problem, and (2) both algorithms improve when run longer, thus necessitating better privacy guarantees in order to achieve a target error (for either training or test) given a fixed privacy budget. Note that while Noisy SGD enjoys better privacy bounds than Noisy CGD using the standard GDP Composition argument, our new privacy guarantees for Noisy CGD improve over GDP Composition bounds for both algorithms (c.f., Table 3 and Table 4). In particular, observe that while running algorithms longer leads to better accuracy, the privacy leak in Noisy SGD from GDP Composition grows faster relative to our results (e.g., compare the values of ε when E = 50 and E = 200). This highlights the convergent dynamics of our privacy bounds and exemplifies how this enables algorithms to be run longer while preserving privacy. Table 5. More detailed version of Table 2. Lists training accuracy (%) of Noisy CGD and Noisy SGD for regularized logistic regression. Note that both algorithms perform similarly and improve when run longer. Epochs Noisy CGD Noisy SGD λ 0.002 0.004 0.002 0.004 50 89.36 0.03 89.23 0.02 89.36 0.04 89.22 0.04 100 90.24 0.03 90.00 0.03 90.25 0.02 89.99 0.03 200 90.85 0.02 90.39 0.04 90.84 0.03 90.37 0.02 Table 6. More detailed version of Table 2. Lists test accuracy (%) of Noisy CGD and Noisy SGD for regularized logistic regression. Again, note that both algorithms perform similarly and improve when run longer. Epochs Noisy CGD Noisy SGD λ 0.002 0.004 0.002 0.004 50 90.12 0.04 90.03 0.07 90.12 0.08 90.00 0.06 100 90.94 0.07 90.70 0.05 90.97 0.04 90.75 0.03 200 91.37 0.08 91.02 0.07 91.40 0.07 91.01 0.04 For the experiment, we closely follow the setting considered in (Ye & Shokri, 2022) for proofs and details on theoretical guarantees with respect to the setting, see (Ye & Shokri, 2022, Section 5). The MNIST dataset has n = 60000 training data points and 10000 test data points; for both Noisy CGD and Noisy SGD, we set the parameters as C = 8, η = 0.05, b = 1500, σ = 1/100, L = 10, E {50, 100, 200} and λ {0.002, 0.004}. First, we clip the feature so that it has norm Shifted Interpolation for Differential Privacy C. For the loss function l(θ, (x, y)) of the (unregularized) logistic regression, we calculate the gradient for each data point (x, y) as f(θ, (x, y)) = l(θ, (x, y)) l(θ, (x, y)) min{ l(θ, (x, y)) , L In other words, we first clip the gradient by L/2 so that the gradient sensitivity is L, and add a gradient λθ of the regularization term (λ/2) θ 2 (which does not affect the gradient sensitivity). D.3. Additional numerics Here we provide additional numerical results to illustrate our privacy bounds in 4, by comparing our f-DP bounds with the counterparts derived by the standard GDP Composition analysis. We cover the settings and algorithms covered in the main text over a broad range of parameters, emphasizing the convergent dynamics of our privacy bounds. The different settings lead to qualitatively similar comparisons. Recall that the relevant parameters of the algorithms are the learning rate η, noise rate σ, number of data points n, batch size b, gradient sensitivity L, and diameter D of the constraint set K; see 2.3. D.3.1. NO I S YGD Figure 6 shows our results for f-DP (left) and its conversion into (ε, δ)-DP (right) for Noisy GD in the strongly convex setting (Theorem 4.2), where our bound is exact. In contrast, observe that while the bound from GDP Composition is nearly tight for a small number of iterations t, the guarantee becomes vacuous as t increases. This is also evident from the (ε, δ)-DP plot, where the discrepancy between the two bounds increases in t. In Table 7, since we obtain GDP bounds, we provide the GDP parameter µ as a function of the number of iterations t and the contractivity c = max{|1 ηm|, |1 ηM|}. All values in Table 7 scale linearly in the effective sensitivity L/(nσ); for simplicity we set it to 0.1. Note that the GDP Composition bound is independent of c because it is not geometrically aware in the sense described in 3. Our bound is optimal and always improves over GDP composition substantially so as t increases. 0.0 0.5 1.0 Type I Error Type II Error Our Bound, t = 10 Our Bound, t = 100 Our Bound, t = 1000 GDP Composition, t = 10 GDP Composition, t = 100 GDP Composition, t = 1000 Our Bound, = 1e-03 Our Bound, = 1e-05 GDP Composition, = 1e-03 GDP Composition, = 1e-05 Figure 6. Comparison of our exact privacy characterization (Theorem 4.2) with the standard GDP Composition bound (Theorem 4.1) for Noisy GD, for c = 0.99. Shown for f-DP (left) and (ε, δ)-DP (right). Table 7. GDP parameter µ from our exact privacy characterization (Theorem 4.2), for varying t and c. Steps GDP Composition Our Bounds c {0.92, 0.96, 0.98, 0.99, 0.995} 0.92 0.96 0.98 0.99 0.995 10 0.316 0.308 0.314 0.316 0.316 0.316 100 1.000 0.490 0.688 0.871 0.961 0.990 1000 3.162 0.490 0.700 0.995 1.411 1.984 Figure 7 and Table 8 turn to the setting of constrained convex optimization in Theorem 4.3. In the (ε, δ)-DP figure, we plot the minimum ε between Theorem 4.3 and GDP Composition. A distinctive feature from both plots is that our privacy bound stays constant after a number of iterations, compared to GDP Composition. In particular, there exists a threshold Shifted Interpolation for Differential Privacy t = t (L/n, η) such that the algorithm can run beyond the threshold (and even indefinitely) with a provable guarantee of µ -GDP. To highlight this fact, we provide the pairs of (t , µ ) in the table over multiple combinations of parameters. We set the diameter of the constraint set K to be D = 1 and noise parameter to be σ = 8; note that as in the previous case, the GDP parameters in this setting scale linearly with respect to 1/σ. Other parameter choices lead to qualitatively similar comparisons. 0.0 0.5 1.0 Type I Error Type II Error Our Bound, t 80 GDP Composition, t = 80 GDP Composition, t = 160 GDP Composition, t = 320 7 Our Bound, = 1e-03 Our Bound, = 1e-05 GDP Composition, = 1e-03 GDP Composition, = 1e-05 Figure 7. Comparison of our bound (Theorem 4.3) with the standard GDP Composition bound (Theorem 4.1) for Noisy GD, for L/n = 0.5 and η = 0.1. Shown for f-DP (left) and (ε, δ)-DP (right). Table 8. Threshold number of iterations t at which point our GDP bound µ from Theorem 4.3 no longer increases. Shown for varying parameters L/n and η. L/n \ η 0.2 0.1 0.05 0.25 (80, 0.280) (160, 0.395) (320, 0.559) 0.5 (40, 0.395) (80, 0.559) (160, 0.791) 1 (20, 0.559) (40, 0.791) (80, 1.118) D.3.2. NO I S YCGD Figure 8 and Table 9 show the analog of Figure 6 and Table 7, now for Noisy CGD rather than Noisy GD. Recall that l denotes the number of batches and c = max{|1 ηm|, |1 ηM|} is the contraction factor. All GDP parameters scale linearly in the effective sensitivity L/(bσ); we set it to 0.2 for concreteness. The improvement of our bounds over the standard GDP Composition bound is pronounced: our bounds yield strong privacy in both f-DP (left) and (ε, δ)-DP (right), whereas the GDP Composition bound becomes effectively non-private as the number of epochs E increases. This is also evident from Table 9, where our bound produces better privacy (even with E = 500 epochs) than GDP Composition (even with E = 5). 0.0 0.5 1.0 Type I Error Type II Error Our Bound, E = 5 Our Bound, E = 50 Our Bound, E = 500 GDP Composition, E = 5 GDP Composition, E = 50 GDP Composition, E = 500 20 Our Bound, = 1e-03 Our Bound, = 1e-05 GDP Composition, = 1e-03 GDP Composition, = 1e-05 Figure 8. (Left) f-DP, (Right) (ε, δ)-DP comparison of our bound (Theorem 4.5) with the standard GDP Composition bound (Theorem 4.4) for Noisy CGD, for c = 0.99 and l = 20. Shifted Interpolation for Differential Privacy Table 9. GDP parameter µ, for varying number of epochs E and contractivity c. Epochs GDP Composition Our Bounds l {10, 20, 40} 10 20 40 c {0.98, 0.99, 0.995} 0.98 0.99 0.995 0.98 0.99 0.995 0.98 0.99 0.995 5 0.447 0.229 0.233 0.235 0.211 0.215 0.217 0.202 0.205 0.208 50 1.414 0.270 0.334 0.410 0.216 0.237 0.275 0.203 0.208 0.219 500 4.472 0.270 0.336 0.439 0.216 0.237 0.276 0.203 0.208 0.219 Next we turn to the setting of constrained convex losses from Theorem 4.6. Again, our bounds converge in the number of epochs quickly and uniformly improve over the bounds from GDP Composition after only a few number of epochs for the (ε, δ)-DP plot, we show the minimum ε between Theorem 4.6 and GDP Composition. In particular, from the table one can observe that there are even a few cases in which our bounds are better than those from GDP Composition after less than 10 epochs. We chose D = 1 and σ = 3; the GDP parameters scale linear in 1/σ. 0.0 0.5 1.0 Type I Error Type II Error Our Bound, E 25 GDP Composition, E = 25 GDP Composition, E = 50 GDP Composition, E = 100 Our Bound, = 1e-03 Our Bound, = 1e-05 GDP Composition, = 1e-03 GDP Composition, = 1e-05 Figure 9. (Left) f-DP, (Right) (ε, δ)-DP comparison of our bound (Theorem 4.6) with the existing bound from GDP Composition (Theorem 4.4) for Noisy CGD under constrained set, with η = 0.02, L/b = 0.5 and l = 20. Table 10. (E , µ ) over different values of (L/b, η), with l = 10. L/b \ η 0.04 0.02 0.01 0.25 (31, 0.534) (62, 0.750) (123, 1.057) 0.5 (17, 0.764) (33, 1.067) (65, 1.500) 1 (10, 1.106) (20, 1.528) (40, 2.134) Table 11. (E , µ ) over different values of (L/b, η), with l = 20. L/b \ η 0.04 0.02 0.01 0.25 (16, 0.382) (31, 0.534) (62, 0.750) 0.5 (9, 0.553) (17, 0.764) (33, 1.067) 1 (5, 0.816) (10, 1.106) (20, 1.528) Table 12. (E , µ ) over different values of (L/b, η), with l = 40. L/b \ η 0.04 0.02 0.01 0.25 (8, 0.276) (16, 0.382) (31, 0.534) 0.5 (5, 0.408) (9, 0.553) (17, 0.764) 1 (3, 0.624) (5, 0.816) (10, 1.106) Shifted Interpolation for Differential Privacy D.3.3. NO I S YSGD For brevity, here we consider just the setting of constrained convex losses; similar plots can be obtained for the strongly convex setting. Figure 10 compares our new privacy bound (Theorem 4.10) with the standard GDP Composition bound (Theorem 4.8), by illustrating the f-DP tradeoff curves of both bounds for a broad range of parameters. For most parameter choices, our bounds provide reasonable privacy that is valid even beyond the number of iterations in the plots. On the other hand, the divergence of the GDP Composition bound clearly degrades the privacy as the number of iterations increases. 0.0 0.5 1.0 b/n = 0.0025, L/b = 0.125 0.0 0.5 1.0 b/n = 0.0025, L/b = 0.25 0.0 0.5 1.0 b/n = 0.0025, L/b = 0.5 0.0 0.5 1.0 b/n = 0.0025, L/b = 1 0.0 0.5 1.0 b/n = 0.0025, L/b = 2 0.0 0.5 1.0 b/n = 0.005, L/b = 0.125 0.0 0.5 1.0 b/n = 0.005, L/b = 0.25 0.0 0.5 1.0 b/n = 0.005, L/b = 0.5 0.0 0.5 1.0 b/n = 0.005, L/b = 1 0.0 0.5 1.0 b/n = 0.005, L/b = 2 0.0 0.5 1.0 b/n = 0.01, L/b = 0.125 0.0 0.5 1.0 b/n = 0.01, L/b = 0.25 0.0 0.5 1.0 b/n = 0.01, L/b = 0.5 0.0 0.5 1.0 b/n = 0.01, L/b = 1 0.0 0.5 1.0 b/n = 0.01, L/b = 2 0.0 0.5 1.0 b/n = 0.02, L/b = 0.125 0.0 0.5 1.0 b/n = 0.02, L/b = 0.25 0.0 0.5 1.0 b/n = 0.02, L/b = 0.5 0.0 0.5 1.0 b/n = 0.02, L/b = 1 0.0 0.5 1.0 b/n = 0.02, L/b = 2 0.0 0.5 1.0 b/n = 0.04, L/b = 0.125 0.0 0.5 1.0 b/n = 0.04, L/b = 0.25 0.0 0.5 1.0 b/n = 0.04, L/b = 0.5 0.0 0.5 1.0 b/n = 0.04, L/b = 1 0.0 0.5 1.0 b/n = 0.04, L/b = 2 = 0.2 = 0.1 = 0.05 GDP Composition, t = 5000 GDP Composition, t = 20000 GDP Composition, t = 80000 = 0.2 = 0.1 = 0.05 GDP Composition, t = 5000 GDP Composition, t = 20000 GDP Composition, t = 80000 = 0.2 = 0.1 = 0.05 GDP Composition, t = 5000 GDP Composition, t = 20000 GDP Composition, t = 80000 = 0.2 = 0.1 = 0.05 GDP Composition, t = 5000 GDP Composition, t = 20000 GDP Composition, t = 80000 = 0.2 = 0.1 = 0.05 GDP Composition, t = 5000 GDP Composition, t = 20000 GDP Composition, t = 80000 Figure 10. Comparison of our new privacy bound (Theorem 4.10) with the standard GDP Composition bound (Theorem 4.8) for Noisy SGD in the setting of constrained convex losses. Each subplot illustrates the f-DP tradeoff curve of these bounds, for a given relative batch size b/n and effective sensitivity L/b. To make this figure, we approximated the compositions of Cb/n(G( )) in Theorem 4.10 by CLT, by choosing the best privacy bound among t τ {100, 200, . . . , 4900} after applying CLT; this is a valid approximation by Lemma A.11. We also note that for improved numerical results, instead of the respective factors of ( 2) inside G( ) and Cb/n(G( )) of the statement of Theorem 4.10 we used ( 10/3) (see the proof of Theorem 4.10 for details). For parameters unspecified in the plots we used D = 1 and σ = 3. Shifted Interpolation for Differential Privacy D.4. Comparison of privacy bounds for the exponential mechanism Let fε be the tradeoff function corresponding to (ε, 0)-DP. From (Dong et al., 2022, Proposition 3), it is straightforward to check that fε G(µ) iff Plugging in ε = 2LD and µ = 2 LD from Corollary 5.4, one can check (using standard nonlinear equation solvers) that this holds iff LD c where c 0.676. D.5. Numerical composition of subsampled GDP Here we mention details on the numerical procedure for calculating an (ε, δ)-DP bound from an f-DP bound of the form f = Cp(G(µ)) t; this is used in 4.4. This formula appears in multiple settings of Noisy SGD, with each Cp(G(µ)) representing the f-DP of subsampled Gaussian mechanism. This conversion process is important for both notions: First, while the composition can be approximated by CLT (Lemma A.5), in practice the approximation is not enough to guarantee whether the algorithm achieves a given privacy budget, typically expressed in (ε, δ)-DP. On the other hand, if one can obtain an accurate collection of different (ε, δ)-DP bounds, it can be converted into an f-DP bound of comparable accuracy due to the duality between the two notions (Dong et al., 2022, Proposition 5 & 6). We implement the framework of privacy loss random variables (PRV) which is an equivalent notion of f-DP and the corresponding analytical procedure provided in (Gopi et al., 2021). Given a fixed value of δ and (possibly different) compositions of private mechanisms, the algorithm presented in the paper allows one to numerically calculate ε with user-specified margin of error. We refer the readers to (Gopi et al., 2021) for the background and overview of the PRV framework and only present relevant results for the problem of our interest.13 The privacy curve and PRV are characterized as follows (Gopi et al., 2021, Definition 2.1 & 3.1). Definition D.1 (Privacy curve and PRV). Let f = T(X, Y ) be a tradeoff function. Then the privacy curve δ : R [0, 1] with respect to (X, Y ) is defined as δ(X||Y )(ε) = sup E{P(Y E) eεP(X E)} where the supermum is over all events. Conversely, given a privacy curve δ : R [0, 1], (X, Y ) are privacy loss random variables if the following holds. X, Y are supported on the extended real line R. δ(X||Y ) δ. Let X(t), Y (t) respectively be the probability density functions of X, Y . Then Y (t) = et X(t) and Y ( ) = X( ) = 0. The probability density functions of PRVs can be calculated from the privacy curve (Gopi et al., 2021, Theorem 3.3). Lemma D.2 (Conversion). Given a privacy curve δ : R [0, 1], the probability density functions of its PRVs (X, Y ) are given as Y (t) = δ (t) δ (t) and X(t) = et(δ (t) δ (t)). Also, symmetric tradeoff functions have the simple form of PRVs (X, Y ) with X = Y (Gopi et al., 2021, Proposition D.9). Lemma D.3 (Symmetry). If (X, Y ) are PRVs for a privacy curve δ(P||Q), the PRVs for δ(Q||P) are ( Y, X). In particular, if the privacy curve is symmetric (i.e., δ(P||Q) = δ(Q||P); equivalently, the corresponding tradeoff function is symmetric) then X = Y . By the following result, we can numerically calculate the (ε, δ)-DP converted from f-DP for f = Cp(G(µ)) t. Proposition D.4. Let (X, Y ) be such that the CDF of Y is given as 2 ) + (1 p)Φ( ε+ 13We also note that while a corresponding result for the one-sided version of G(µ)p = T(N(0, 1), p N(µ, 1) + (1 p)N(0, 1)) was already presented in (Gopi et al., 2021) and is often interchangeably used, it is quantitatively different from Cp(G(µ)), even in the limiting regime of CLT. Compare, for example, Lemma A.5 and (Bu et al., 2020). Shifted Interpolation for Differential Privacy where ε+ = log((p 1 + et)/p), ε = log((p 1 + e t)/p) and X = Y . Then (X, Y ) are PRVs for the tradeoff function Cp(G(µ)). Proof. Let δ, δ0 respectively be the privacy curves of Cp(G(µ)) and G(µ)p. Then it is straightforward to check that ( δ0(t) t > 0 1 et(1 δ0( t)) t 0 from the definition of Cp(G(µ)) (as a symmetrized version of G(µ)p; see Definition 4.7) and the duality between (ε, δ)-DP and f-DP. By taking antiderivative from Lemma D.2, the CDF of Y is given as ( δ 0(t) δ0(t) + C t > 0 etδ 0( t) 1 + C t 0 for some constant C. By either obtaining δ0(t) directly from (Dong et al., 2022, Lemma 2) or comparing the t > 0 part of FY (t) with (Gopi et al., 2021, Proposition C.4), one can derive the formula of FY (t) as stated with C = 1. Also, X = Y from Lemma D.3.