# individual_privacy_accounting_with_gaussian_differential_privacy__364f7d56.pdf Published as a conference paper at ICLR 2023 INDIVIDUAL PRIVACY ACCOUNTING WITH GAUSSIAN DIFFERENTIAL PRIVACY Antti Koskela Nokia Bell Labs University of Helsinki antti.h.koskela@nokia-bell-labs.com Marlon Tobaben University of Helsinki marlon.tobaben@helsinki.fi Antti Honkela University of Helsinki antti.honkela@helsinki.fi Individual privacy accounting enables bounding differential privacy (DP) loss individually for each participant involved in the analysis. This can be informative as often the individual privacy losses are considerably smaller than those indicated by the DP bounds that are based on considering worst-case bounds at each data access. In order to account for the individual privacy losses in a principled manner, we need a privacy accountant for adaptive compositions of randomised mechanisms, where the loss incurred at a given data access is allowed to be smaller than the worst-case loss. This kind of analysis has been carried out for the R enyi differential privacy by Feldman and Zrnic (12), however not yet for the so called optimal privacy accountants. We make first steps in this direction by providing a careful analysis using the Gaussian differential privacy which gives optimal bounds for the Gaussian mechanism, one of the most versatile DP mechanisms. This approach is based on determining a certain supermartingale for the hockey-stick divergence and on extending the R enyi divergence-based fully adaptive composition results by Feldman and Zrnic (12). We also consider measuring the individual (ε, δ)-privacy losses using the so called privacy loss distributions. With the help of the Blackwell theorem, we can then make use of the results of Feldman and Zrnic (12) to construct an approximative individual (ε, δ)-accountant. 1 INTRODUCTION Differential privacy (DP) (8) provides means to accurately bound the compound privacy loss of multiple accesses to a database. Common differential privacy composition accounting techniques such as R enyi differential privacy (RDP) based techniques (23; 33; 38; 24) or so called optimal accounting techniques (19; 15; 37) require that the privacy parameters of all algorithms are fixed beforehand. Rogers et al. (28) were the first to analyse fully adaptive compositions, wherein the mechanisms are allowed to be selected adaptively. Rogers et al. (28) introduced two objects for measuring privacy in fully adaptive compositions: privacy filters, which halt the algorithms when a given budget is exceeded, and privacy odometers, which output bounds on the privacy loss incurred so far. Whitehouse et al. (34) have tightened these composition bounds using filters to match the tightness of the so called advanced composition theorem (9). Feldman and Zrnic (12) obtain similar (ε, δ)-asymptotics via RDP analysis. In their analysis using RDP, Feldman and Zrnic (12) consider individual filters, where the algorithm stops releasing information about the data elements that have exceeded a pre-defined RDP budget. This kind of individual analysis has not yet been considered for the optimal privacy accountants. We make first steps in this direction by providing a fully adaptive individual DP analysis using the Gaussian differential privacy (7). Our analysis leads to tight bounds for the Gaussian mechanism and it is based on determining a certain supermartingale for the hockey-stick divergence and on using similar proof techniques as in the RDP-based fully adaptive composition results of Feldman Published as a conference paper at ICLR 2023 and Zrnic (12). We note that the idea of individual accounting of privacy losses has been previously considered in various forms by, e.g., Ghosh and Roth (13); Ebadi et al. (10); Wang (32); Cummings and Durfee (6); Ligett et al. (22); Redberg and Wang (27). We also consider measuring the individual (ε, δ)-privacy losses using the so called privacy loss distributions (PLDs). Using the Blackwell theorem, we can in this case rely on the results of (12) to construct an approximative (ε, δ)-accountant that often leads to smaller individual ε-values than commonly used RDP accountants. For this accountant, evaluating the individual DP-parameters using the existing methods requires computing FFT at each step of the adaptive analysis. We speed up this computation by placing the individual DP hyperparameters into well-chosen buckets, and by using pre-computed Fourier transforms. Moreover, by using the Plancherel theorem, we obtain a further speed-up. 1.1 OUR CONTRIBUTIONS Our main contributions are the following: We show how to analyse fully adaptive compositions of DP mechanisms using the Gaussian differential privacy. Our results give tight (ε, δ)-bounds for compositions of Gaussian mechanisms and are the first results with tight bounds for fully adaptive compositions. Using the concept of dominating pairs of distributions and by utilising the Blackwell theorem, we propose an approximative individual (ε, δ)-accountant that in several cases leads to smaller individual ε-bounds than the individual RDP analysis. We propose efficient numerical techniques to compute individual privacy parameters using privacy loss distributions (PLDs) and the FFT algorithm. We show that individual ε-values can be accurately approximated in O(n)-time, where n is the number of discretisation points for the PLDs. Due to the lack of space this is described in Appendix D. We give experimental results that illustrate the benefits of replacing the RDP analysis with GDP accounting or with FFT based numerical accounting techniques. As an observation of indepedent interest, we notice that individual filtering leads to a disparate loss of accuracies among subgroups when training a neural network using DP gradient descent. 2 BACKGROUND 2.1 DIFFERENTIAL PRIVACY We first shortly review the required definitions and results for our analysis. For more detailed discussion, see e.g. (7) and (37). An input dataset containing N data points is denoted as X = (x1, . . . , x N) X N, where xi X, 1 i N. We say X and X are neighbours if we get one by adding or removing one element in the other (denoted X X ). To this end, similarly to Feldman and Zrnic (12), we also denote X i the dataset obtained by removing element xi from X, i.e. X i = (x1, . . . , xi 1, xi+1, . . . , x N). A mechanism M is (ε, δ)-DP if its outputs are (ε, δ)-indistinguishable for neighbouring datasets. Definition 1. Let ε 0 and δ [0, 1]. Mechanism M : X n O is (ε, δ)-DP if for every pair of neighbouring datasets X, X , every measurable set E O, P(M(X) E) eεP(M(X ) E) + δ. We call M tightly (ε, δ)-DP, if there does not exist δ < δ such that M is (ε, δ )-DP. The (ε, δ)-DP bounds can also be characterised using the Hockey-stick divergence. For α > 0 the hockey-stick divergence Hα from a distribution P to a distribution Q is defined as Hα(P||Q) = Z [P(t) α Q(t)]+ dt, where for x R, [x]+ = max{0, x}. Tight (ε, δ)-values for a given mechanism can be obtained using the hockey-stick-divergence: Published as a conference paper at ICLR 2023 Lemma 2 (Zhu et al. 37). For a given ε 0, tight δ(ε) is given by the expression δ(ε) = max X X He ε(M(X)||M(X )). Thus, if we can bound the divergence He ε(M(X)||M(X )) accurately, we also obtain accurate δ(ε)-bounds. To this end we consider so called dominating pairs of distributions: Definition 3 (Zhu et al. 37). A pair of distributions (P, Q) is a dominating pair of distributions for mechanism M(X) if for all neighbouring datasets X and X and for all α > 0, Hα(M(X)||M(X )) Hα(P||Q). If the equality holds for all α for some X, X , then (P, Q) is tightly dominating. Dominating pairs of distributions also give upper bounds for adaptive compositions: Theorem 4 (Zhu et al. 37). If (P, Q) dominates M and (P , Q ) dominates M , then (P P , Q Q ) dominates the adaptive composition M M . To convert the hockey-stick divergence from P P to Q Q into an efficiently computable form, we consider so called privacy loss random variables. Definition 5. Let P and Q be probability density functions. We define the privacy loss function LP/Q as LP/Q(t) = log P(t) We define the privacy loss random variable (PRV) ωP/Q as ωP/Q = LP/Q(t), t P(t). With slight abuse of notation, we denote the probability density function of the random variable ωP/Q by ωP/Q(t), and call it the privacy loss distribution (PLD). Theorem 6 (Gopi et al. 15). The δ(ε)-bounds can be represented using the following representation that involves the PRV: He ε(P||Q) = E t P h 1 eε LP/Q(t)i + = E s ωP/Q Moreover, if ωP/Q is the PRV for the pair of distributions (P, Q) and ωP /Q the PRV for the pair of distributions (P , Q ), then the PRV for the pair of distributions (P P , Q Q ) is given by ωP/Q + ωP /Q . When we set α = eε, the following characterisation follows directly from Theorem 6. Corollary 7. If the pair of distributions (P, Q) is a dominating pair of distributions for a mechanism M, then for all neighbouring datasets X and X and for all ε R, h 1 eε LM(X)/M(X )(t)i h 1 eε LP/Q(t)i We will in particular consider the Gaussian mechanism and its subsampled variant. Example: hockey-stick divergence between two Gaussians. Let x0, x1 R, σ 0, and let P be the density function of N(x0, σ2) and Q the density function of N(x1, σ2). Then, the PRV ωP/Q is distributed as (Lemma 11 by 30) ωP/Q N (x0 x1)2 2σ2 , (x0 x1)2 Thus, in particular: Hα(P||Q) = Hα(Q||P) for all α > 0. Plugging in PLD ωP/Q to the expression (2.1), we find that for all ε 0, the hockey-stick He ε(P||Q) is given by the expression δ(ε) = Φ εσ Published as a conference paper at ICLR 2023 where Φ denotes the CDF of the standard univariate Gaussian distribution and = |x0 x1|. This formula was originally given by Balle and Wang (3). If M is of the form M(X) = f(X) + Z, where f : X N Rd and Z N(0, σ2Id), and = max X X f(X) f(X ) 2, then for x0 = 0, x1 = , (P, Q) of the above form gives a tightly dominating pair of distributions for M (37). Subsequently, by Theorem 6, M is (ε, δ)-DP for δ(ε) given by (2.3). Lemma 8 allows tight analysis of the subsampled Gaussian mechanism using the hockey-stick divergence. We state the result for the case of Poisson subsampling with sampling rate γ. Lemma 8 (Zhu et al. 37). If (P, Q) dominates a mechanism M for add neighbors then (P, (1 γ) P + γ Q) dominates the mechanism M SPoisson for add neighbors and ((1 γ) Q + γ P), P) dominates M SPoisson for removal neighbors. We will also use the R enyi differential privacy (RDP) (23) which is defined as follows. R enyi divergence of order α (1, ) between two distributions P and Q is defined as Dα(P||Q) = 1 α 1 log Z P(t) By continuity, we have that limα 1+ Dα(P||Q) equals the KL divergence KL(P||Q). Definition 9. We say that a mechanism M is (α, ρ)-RDP, if for all neighbouring datasets X, X , the output distributions M(X) and M(X ) have R enyi divergence of order α less than ρ, i.e. max X X {Dα M(X)||M(X ) , Dα M(X )||M(X) } ρ. 2.2 INFORMAL DESCRIPTION: FILTRATIONS, SUPERMARTINGALES, STOPPING TIMES Similarly to (34) and (12), we use the notions of filtrations and supermartingales for analysing fully adaptive compositions, where the individual worst-case pairs of distributions are not fixed but can be chosen adaptively based on the outcomes of the previous mechanisms. Given a probability space (Ω, F, P), a filtration (Fn)n N of F is a sequence of σ-algebras satisfying: (i) Fn Fn+1 for all n N, and (ii) Fn F for all n N. In the context of the so called natural filtration generated by a stochastic process Xt, t N, the σ-algebra of the filtration Fn represents all the information contained in the outcomes of the first n random variables (X1, . . . , Xn). The law of total expectation states that if a random variable X is Fn-measurable and Fn Fn+1, then E((X|Fn+1)|Fn) = E(X|Fn). Thus, if we have natural filtrations F0, . . . , Fn for a stochastic process X0, . . . , Xn, then E(Xn|F0) = E(((Xn|Fn 1)|Fn 2)| . . . |F0). (2.4) The supermartingale property means that for all n, E(Xn|Fn 1) Xn 1. (2.5) From the law of total expectation it then follows that for all n N, E(Xn|F0) X0. We follow the analysis of Feldman and Zrnic (12) and first set a maximum number of steps (denote by k) for the compositions. We do not release more information if a pre-defined privacy budget is exceeded. Informally speaking, the stochastic process Xn that we analyse represents the sum of the realised privacy loss up to step n and the budget remaining at that point. The privacy budget has to be constructed such that the amount of the budget left at step n is included in the filtration Fn 1. This allows us to reduce the privacy loss of the adaptively chosen nth mechanism from the remaining budget. Mathematically, this means that the integration E(Xn|Fn 1) will be only w.r.t. the outputs of the nth mechanism. Consider e.g. the differentially private version of the gradient descend (GD) method, where the amount of increase in the privacy budget depends on the gradient norms which depend on the parameter values at step n 1, i.e., they are included in Fn 1. Then, E(Xk|F0) corresponds to the total privacy loss. If we can show that (2.5) holds for Xn, then by the law of total expectation the total privacy loss is less than X0, the pre-defined budget. In our case the total budget X0 will equal the (ε, δ)-curve for a µ-Gaussian DP mechanism, where µ determines the total privacy budget, and E[XT ], the expectation of the consumed privacy loss at step T, will equal the (ε, δ)-curve for the fully adaptive composition to be analysed. Published as a conference paper at ICLR 2023 A discrete-valued stopping time τ is a random variable in the probability space (Ω, F, {Ft}, P) with values in N which gives a decision of when to stop. It must be based only on the information present at time t, i.e., it has to hold {τ = t} Ft. The optimal stopping theorem states that if the stochastic process Xn is a supermartingale and if T is a stopping time, then E[XT ] E[X0]. In the analysis of fully adaptive compositions, the stopping time T will equal the step where the privacy budget is about to exceed the limit B. Then, only the outputs of the (adaptively selected) mechanisms up to step T are released, and from the optimal stopping theorem it follows that E[XT ] X0. 3 FULLY ADAPTIVE COMPOSITIONS In order to compute tight δ(ε)-bounds for fully adaptive compositions, we determine a suitable supermartingale that gives us the analogues of the RDP results of (12). 3.1 NOTATION AND THE EXISTING ANALYSIS Similarly to Feldman and Zrnic (12), we denote the mechanism corresponding to the fully adaptive composition of first n mechanisms as M(n)(X) = M1(X), M2(M1(X), X), . . . , Mn(M1(X), . . . , Mn 1(X), X) and the outcomes of M(n)(X) as a(n) = (a1, . . . , an), For datasets X and X , define L(n) X/X as L(n) X/X = log P(M(n)(X) = a(n)) P(M(n)(X ) = a(n)) and, given a(n 1), we define Ln X/X as the privacy loss of the mechanism Mn, Ln X/X = log P(Mn(a(n 1), X) = an) P(Mn(a(n 1), X ) = an) Using the Bayes rule it follows that L(n) X/X = L(n 1) X/X + Ln X/X = n P m=1 Lm X/X . Whitehouse et al. (34) obtain the advanced-composition-like (ε, δ)-privacy bounds for fully adaptive compositions via a certain privacy loss martingale. However, our approach is motivated by the analysis of Feldman and Zrnic (12). We review the main points of the analysis in Appendix A. The approach of Feldman and Zrnic (12) does not work directly in our case since the hockey-stick divergence does not factorise as the R enyi divergence does. However, we can determine a certain random variable via the hockey-stick divergence and show that it has the desired properties in case the individual mechanisms Mi have dominating pairs of distributions that are Gaussians. As we show, this requirement is equivalent to them being Gaussian differentially private. 3.2 GAUSSIAN DIFFERENTIAL PRIVACY Informally speaking, a randomised mechanism M is µ-GDP, µ 0, if for all neighbouring datasets the outcomes of M are not more distinguishable than two unit-variance Gaussians µ apart from each other (7). Commonly the Gaussian differential privacy (GDP) is defined using so called tradeoff functions (7). For the purpose of this work, we equivalently formalise GDP using pairs of dominating distributions: Lemma 10. A mechanism M is µ-GDP, if and only if for all neighbouring datasets X, X and for all α > 0: Hα(M(X)||M(X )) Hα N(0, 1)||N(µ, 1) . (3.1) Proof. By Corollary 2.13 of (7), a mechanism is µ-GDP if and only it is (ε, δ)-DP for all ε 0, where δ(ε) = Φ ε 2 . From (2.3) we see that this is equivalent to the fact that for all neighbouring datasets X, X and for all ε 0: Heε(M(X)||M(X )) Heε(N(0, 1)||N(µ, 1)). By Lemma 31 of (37), Hα M(X)||M(X ) Hα P, Q for all α > 1 if and only if Hα M(X)||M(X ) Hα Q, P for all 0 < α 1. As P and Q are Gaussians, we see from the form of the privacy loss distribution (2.2) that Hα Q, P = Hα P, Q and that (3.1) holds for all α > 0. Published as a conference paper at ICLR 2023 3.3 GDP ANALYSIS OF FULLY ADAPTIVE COMPOSITIONS Analogously to individual RDP parameters (A.1), we define the conditional GDP parameters as µm = inf{µ 0 : Mm( , a(m 1)) is µ GDP}. (3.2) By Lemma 10 above, in particular, this means that for all neighbouring datasets X, X and for all α > 0: Hα(Mm(X, a(m 1)), Mm(X , a(m 1)) Hα(N(µm, 1)||N(0, 1)). Notice that for all m, the GDP parameter µm depends on the history a(m 1) and is therefore a random variable, similarly to the conditional RDP values ρm defined in (A.1). Example: Private GD. Suppose each mechanism Mi, i [k], is of the form Mi(X, a) = P x X f(x, a) + N(0, σ2). Since the hockey-stick divergence is scaling invariant, and since the sensitivity of the deterministic part of Mi(X, a) is maxx X f(x, a(m 1)) 2, we have that µm = maxx X f(x, a(m 1)) 2/σ. We now give the main theorem, which is a GDP equivalent of (Thm. 3.1, 12). Theorem 11. Let k denote the maximum number of compositions. Suppose that, almost surely, Xk m=1 µ2 m B2. Then, M(k)(X) is B-GDP. Proof. We here describe the main points, a proof with more details is given in Appendix B. We remark that an alternative proof of this result is given in an independent and concurrent work by Smith and Thakurta (29). First, recall the notation from Section 3.1: L(k) X/X denotes the privacy loss be- tween M(k)(X) and M(k)(X ) with outputs a(k). Let ε R. Our proof is based on showing the supermartingale property for the random variable Mn(ε), n [k], defined as Mk(ε) = 1 eε L(k) X/X Mn(ε) = E t Rn 1 eε L(n) X/X Ln(t) + , 0 n k 1, (3.3) where Ln(t) = log(Rn(t)/Q(t)) and Rn is the density function of N p B2 Pn m=1 µ2m, 1 and Q is the density function of N(0, 1). This implies that M0(ε) = Et R0 1 eε L0(t) + , where L0(t) = log(R0(t)/Q(t)) and R0 is the density function of N B, 1 and Q is the density function of N(0, 1). In particular, this means that M0(ε) gives δ(ε) for a B-GDP mechanism. Let Fn denote the natural filtration σ(a(n)). First, we need to show that E [Mk(ε)|Fk 1] Mk 1(ε). Since the pair of distributions N(µk, 1), N(0, 1) dominates the mechanism Mk, we have by the Bayes rule and Corollary 7, " 1 eε L(k 1) X/X Lk X/X 1 eε L(k 1) X/X e Lk(t) where e Lk(t) = log(Pk(t)/Q(t)), Pk is the density function of N(µk, 1) and Q is the density function of N(0, 1). Above we have also used the fact that L(k 1) X/X Fk 1. The last inequality follows from the fact that Pk m=1 µ2 m B2 a.s., i.e., µk (B2 Pk 1 m=1 µ2 m) 1 2 a.s., and from the data-processing inequality. Moreover, we see that (B2 Pk 1 m=1 µ2 m) 1 2 Fk 2. Thus we can repeat (3.4) and use the fact that a composition of c µ1-GDP and c µ2-GDP mechanisms is (c µ1 2 + c µ2 2) 1 2 -GDP (Cor. 3.3, 7), and by induction see that Mn(ε) is a supermartingale. By the law of total expectation (2.4), E[Mk(ε)] M0(ε). By Theorem 6, E[Mk(ε)] = He ε M(k)(X)||M(k)(X ) , and M0(ε) = He ε N(B, 1)||N(0, 1) . As ε was taken to be an arbitrary real number, we see that the inequality E[Mk(ε)] M0(ε) holds for all ε R and by Lemma 10, M(k)(X) is B-GDP. Published as a conference paper at ICLR 2023 4 INDIVIDUAL GDP FILTER Similarly to (12), we can determine an individual GDP privacy filter that keeps track of individual privacy losses and adaptively drops the data elements for which the cumulative privacy loss is about to cross the pre-determined budget (Alg. 1). First, we need to define a GDP filter: FB(µ1, . . . , µt) = HALT, if Pt i=1 µ2 i > B2, CONT, else. (4.1) Also, similarly to (12), we define S(xi, n) as the set of dataset pairs (S, S ), where |S| = n and S is obtained from S by deleting the data element xi from S. Algorithm 1 Individual GDP Filter Algorithm Input: Budget B, maximum number of compositions k, initial value a0. for j = 1, . . . , k do For each i [N], find µ(i) j 0 such that for all α > 0 and for all (S, S ) S(xi, n): Hα Mj(S, a(j 1))||Mj(S , a(j 1)) Hα N(µ(i) j , 1)||N(0, 1) . (4.2) Define the active set Sj: Sj = {xi : FB(µ(i) 1 , . . . , µ(i) j ) = CONT}. For all xi: set µ(i) j = µ(i) j 1{xi Sj}. Compute aj = Mj(a(j 1), Sj). end for return a(j). Using Theorem 11 and the supermartingale property of a personalised version of Mn(ε) (Eq. (3.3)), we can show that the output of Alg. 1 is B-GDP. Theorem 12. Denote by M the output of Algorithm 1. M is B-GDP under remove neighbourhood relation, meaning that for all datasets X X N, for all i [N] and for all α > 0: max{Hα M(X)||M(X i)) , Hα M(X i)||M(X)) } Hα N(B, 1)||N(0, 1) . Proof. The proof goes the same way as the proof for (Thm. 4.3 12) which holds for the RDP filter. Let Ft denote the natural filter σ(a(t)), and let the privacy filter FB be defined as in (4.1). We see that the random variable T = min{min{t : FB(µ(i) 1 , . . . , µ(i) t+1) = HALT}, k} is a stopping time since {T = t} Ft. Let Mn(ε) be the random variable of Eq. (3.3) defined for the pair of datasets (X, X i) or (X i, X). From the optimal stopping theorem (26) and the supermartingale property of Mn(ε) it follows that E[MT (ε)] M0(ε) for all ε R. By the reasoning of the proof of Thm.11 we have that Alg. 1 is B-GDP. 4.1 BENEFITS OF GDP VS. RDP: MORE ITERATIONS FOR THE SAME PRIVACY When we replace the RDP filter with a GDP filter for the private GD, we get considerably smaller ε-values. As an example, consider the private GD experiment by Feldman and Zrnic (12) and set σ = 100 and the number of compositions k = 420 (this corresponds to worst-case analysis ε = 0.8 for δ = 10 5). When using GDP instead of RDP, we can run k = 495 iterations for an equal value of ε. Figure 3 (Section C.4) depicts the differences in (ε, δ)-values computed via RDP and GDP. 5 APPROXIMATIVE (ε, δ)-FILTER VIA BLACKWELL S THEOREM We next consider a filter that can use any individual dominating pairs of distributions, not just Gaussians. To this end, we need to determine pairs of dominating distributions at each iteration. Assumption. Given neighbouring datasets X, X , we assume that for all i, i [n], we can determine a dominating pair of distributions (Pi, Qi) such that for all α > 0, Hα Mi(a(i 1), X)||Mi(a(i 1), X ) Hα Pi||Qi . Published as a conference paper at ICLR 2023 A tightly dominating pair of distributions (Pi, Qi) always exists (Proposition 8, 37), and on the other hand, uniquely determining such a pair is straightforward for the subsampled Gaussian mechanism, for example (see Lemma 8). For the so called shufflers, such worst case pairs can be obtained by post-processing (11). As we show in Appendix C.6, the orderings determined by the trade-off functions and the hockey-stick divergence are equivalent. Therefore, from the Blackwell theorem (7, Thm. 2.10) it follows that there exists a stochastic transformation (Markov kernel) T such that TPi = Mi(a(i 1), X) and TQi = Mi(a(i 1), X ). First, we replace the GDP filter condition P µ2 i B2 by the condition (µ > 0) E t1 P1,...,tn Pn h 1 eε Pn m=1 Lm(tm)i + He ε N(µ, 1)||N(0, 1) (5.1) for all ε R. By the Blackwell theorem there exists a stochastic transformation that maps N(µ, 1) and N(µ, 0) to the product distributions (P1, . . . , Pn) and (Q1, . . . , Qn), respectively. From the data-processing inequality for R enyi divergence we then have Dα(P1 . . . Pn||Q1 . . . Qn) Dα N(0, 1)||N(µ, 1) , (5.2) for all α 1, where Dα denotes the R enyi divergence of order α. Since the pairs (Pi, Qi), 1 i n, are the worst-case pairs also for RDP (as described above, due to the data-processing inequality), by (5.2) and the RDP filter results of Feldman and Zrnic (12), we have that for all α 1, Dα M(X)||M(X ) Dα N(µ, 1)||N(0, 1) . (5.3) By converting the RDPs of Gaussians in (5.3) to (ε, δ)-bounds, this procedure provides (ε, δ)-upper bounds and can be straightforwardly modified into an individual PLD filter as in case of GDP. One difficulty, however, is how to compute the parameter µ in (5.1), given the individual pairs (Pi, Qi), 1 i n. When the number of iterations is large, by the central limit theorem the PLD of the composition starts to resemble that of a Gaussian mechanism (30), and it is then easy to numerically approximate µ (see Fig. 1 for an example). It is well known that the (ε, δ)-bounds obtained via RDPs are always non-tight, since the conversion of RDP to (ε, δ) is lossy (37). Moreover, often the computation of the RDP values themselves is lossy. In the procedure described here, the only loss comes from converting (5.3) to (ε, δ)-bounds. In Appendix D we show how to numerically efficiently compute the individual PLDs using FFT. To illustrate the differences between the individual ε-values obtained with an RDP accountant and with our approximative PLD-based accountant, we consider DP-SGD training of a small feedforward network for MNIST classification. We choose randomly a subset of 1000 data elements and compute their individual ε-values (see Fig. 1). To compute the ε-values, we compare RDP accounting and our approach based on PLDs. We train for 50 epochs with batch size 300, noise parameter σ = 2.0 and clipping constant C = 5.0. 0.0 0.2 0.4 0.6 0.8 Tight DP-SGD ( , )-DP Approximative -GDP Upper Bound 0.2 0.4 0.6 0.8 1.0 0 160 Individual -values (via PLDs) Individual -values (via RDPs) Figure 1: MNIST experiment. Left: Randomly chose data element and its accurate (ε, δ)-curve after 50 epochs vs. the µ-GDP upper bound approximation. Right: Comparison of individual ε-values obtained via RDPs and PLDs: histograms for randomly selected 1000 samples after 50 epochs (δ = 10 6). Computation using PLDs is better able to capture small individual ε-values. Published as a conference paper at ICLR 2023 6 EXPERIMENTS WITH MIMIC-III: GROUP-WISE ε-VALUES For further illustration, we consider the phenotype classification task from a MIMIC-III benchmark library (16) on the clinical database MIMIC-III (17), freely-available from Physio Net (14). The task is a multi-label classification and aims to predict which of 25 acute care conditions are present in a patient s MIMIC-III record. We have trained a multi-layer perceptron to maximise the macroaveraged AUC-ROC, the task s primary metric. We train the model using DP-GD combined with the Adam optimizer, and use the individual GDP filtering algorithm 1. See Appendix E for further details. To study the model behaviour between subgroups, we observe five non-overlapping groups of size 1000 from the train set and of size 400 from the test set by the present acute care condition: subgroup 0: no condition at all, subgroups 1 and 2: diagnosed with/not with Pneumonia, subgroups 3 and 4: diagnosed with/not with acute myocardial infarction (heart attack). Similarly as Yu et al. (36), we see a correlation between individual ε-values and model accuracies across the subgroups: the groups with the best privacy protection (smallest average ε-values) have also the smallest average training and test losses. Fig. 2 shows that after running the filtered DP-GD beyond the worst-case ε - threshold for a number of iterations, both the training and test loss get smaller for the best performing group and larger for other groups. Similarly as DP-SGD has a disparate impact on model accuracies across subgroups (2), we find that while the individual filtering leads to more equal group-wise εvalues, it leads to even larger differences in model accuracies. Here, one could alternatively consider other than algorithmic solutions for balancing the privacy protection among subgroups, by, e.g., collecting more data from the groups with the weakest privacy protection according to the individual ε-values (36). Finally, we observe negligible improvements of the macro-averaged AUC-ROC in the optimal hyperparameter regime using filtered DP-GD, but similarly to (12) improvements can be seen when choosing sub-optimal hyperparameters (see Appendix E.1). group 0 group 4 group 2 group 3 group 1 subgroups average at epoch loss at epoch train (50) test (50) train (100) test (100) Figure 2: MIMIC III experiment and individual filtering for private GD. Comparing the test losses, training losses and average privacy losses before and after filtering has started (at 50 epochs). The filtering has a further disparate impact on model accuracies across subgroups. 7 CONCLUSIONS To conclude, we have shown how to rigorously carry out fully adaptive analysis and individual DP accounting using the Gaussian DP. We have also proposed an approximative (ε, δ)-accountant that can utilise any dominating pairs of distributions and shown how to implement it efficiently. As an application we have studied the connection between group-wise individual privacy parameters and model accuracies when using DP-GD, and found that the filtering further amplifies the model accuracy imbalance between groups. An open question remains how to carry out tight fully adaptive analysis using arbitrary dominating pairs of distributions. Published as a conference paper at ICLR 2023 8 ETHICS STATEMENT Our work is on improving differential privacy techniques, which contributes to the strong theoretical foundation of privacy-preserving machine learning, an essential component of trustworthy machine learning. Our method provides accurate estimates of individual privacy loss, therefore helping to evaluate the impact of privacy-preserving machine learning to individual privacy. Our experiments indicate that the filtered DP gradient descent has disparate impact on subgroups of data, and should therefore be used with caution. Our experiments use the MIMIC-III data set of pseudonymised health data by permission of the data providers. The data was processed according to usage rules defined by the data providers, and all reported results are anonymised. All the code related to MIMIC-III data set is publicly available (https://github.com/DPBayes/individual-accounting-gdp), as requested by Physionet (https://physionet.org/content/mimiciii/view-dua/1.4/). 9 REPRODUCIBILITY STATEMENT All the missing details from proofs and missing proofs in the main text are given in the Appendix, as well as more detailed descriptions of the experiments. Pythons codes needed for the experiments and plots is made publicly available (https://github.com/DPBayes/ individual-accounting-gdp). To compute (ε, δ)-bounds via RDP, we have used the conversion formula of Canonne et al. (4). GDP bounds are easily converted to (ε, δ)-bounds using the formula (C.2). 10 ACKNOWLEDGMENTS The authors acknowledge CSC IT Center for Science, Finland, and the Finnish Computing Competence Infrastructure (FCCI) for computational and data storage resources. This work was supported by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI; and grant 325573), the Strategic Research Council at the Academy of Finland (Grant 336032) as well as the European Union (Project 101070617). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. [1] Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [2] Bagdasaryan, E., Poursaeed, O., and Shmatikov, V. (2019). Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems, 32. [3] Balle, B. and Wang, Y.-X. (2018). Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pages 394 403. PMLR. [4] Canonne, C. L., Kamath, G., and Steinke, T. (2020). The discrete gaussian for differential privacy. Advances in Neural Information Processing Systems, 33:15676 15688. [5] Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of computation, 19(90):297 301. [6] Cummings, R. and Durfee, D. (2020). Individual sensitivity preprocessing for data privacy. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 528 547. SIAM. Published as a conference paper at ICLR 2023 [7] Dong, J., Roth, A., and Su, W. J. (2022). Gaussian differential privacy. Journal of the Royal Statistical Society Series B, 84(1):3 37. [8] Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC, Proceedings 3, pages 265 284. Springer. [9] Dwork, C. and Roth, A. (2014). The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3 4):211 407. [10] Ebadi, H., Sands, D., and Schneider, G. (2015). Differential privacy: Now it s getting personal. ACM SIGPLAN Notices, 50(1):69 81. [11] Feldman, V., Mc Millan, A., and Talwar, K. (2023). Stronger privacy amplification by shuffling for r enyi and approximate differential privacy. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4966 4981. SIAM. [12] Feldman, V. and Zrnic, T. (2021). Individual privacy accounting via a R enyi filter. Advances in Neural Information Processing Systems, 34. [13] Ghosh, A. and Roth, A. (2011). Selling privacy at auction. In Proceedings of the 12th ACM conference on Electronic commerce, pages 199 208. [14] Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. (13.06.2000). Physio Bank, Physio Toolkit, and Physio Net: Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215 e220. Circulation Electronic Pages: http://circ.ahajournals.org/content/101/23/e215.full PMID:1085218; doi: 10.1161/01.CIR.101.23.e215. [15] Gopi, S., Lee, Y. T., and Wutschitz, L. (2021). Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34:11631 11642. [16] Harutyunyan, H., Khachatrian, H., Kale, D. C., Ver Steeg, G., and Galstyan, A. (2019). Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1 18. [17] Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1):1 9. [18] Koskela, A. and Honkela, A. (2021). Computing differential privacy guarantees for heterogeneous compositions using FFT. ar Xiv preprint ar Xiv:2102.12412. [19] Koskela, A., J alk o, J., and Honkela, A. (2020). Computing tight differential privacy guarantees using FFT. In International Conference on Artificial Intelligence and Statistics, pages 2560 2569. PMLR. [20] Koskela, A., J alk o, J., Prediger, L., and Honkela, A. (2021). Tight differential privacy for discrete-valued mechanisms and for the subsampled Gaussian mechanism using FFT. In International Conference on Artificial Intelligence and Statistics, pages 3358 3366. PMLR. [21] L ecuyer, M. (2021). Practical privacy filters and odometers with R enyi differential privacy and applications to differentially private deep learning. ar Xiv preprint ar Xiv:2103.01379. [22] Ligett, K., Peale, C., and Reingold, O. (2020). Bounded-leakage differential privacy. In 1st Symposium on Foundations of Responsible Computing (FORC 2020). Schloss Dagstuhl-Leibniz Zentrum f ur Informatik. [23] Mironov, I. (2017). R enyi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263 275. [24] Mironov, I., Talwar, K., and Zhang, L. (2019). R enyi differential privacy of the sampled Gaussian mechanism. ar Xiv preprint ar Xiv:1908.10530. Published as a conference paper at ICLR 2023 [25] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2007). Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press. [26] Protter, P. (2004). Stochastic integration and stochastic differential equations. Berlin: Springer Verlag. [27] Redberg, R. and Wang, Y.-X. (2021). Privately publishable per-instance privacy. Advances in Neural Information Processing Systems, 34:17335 17346. [28] Rogers, R., Roth, A., Ullman, J., and Vadhan, S. (2016). Privacy odometers and filters: payas-you-go composition. Advances in Neural Information Processing Systems, 30:1929 1937. [29] Smith, A. and Thakurta, A. (2022). Fully adaptive composition for gaussian differential privacy. ar Xiv preprint ar Xiv:2210.17520. [30] Sommer, D. M., Meiser, S., and Mohammadi, E. (2019). Privacy loss classes: The central limit theorem in differential privacy. Proceedings on Privacy Enhancing Technologies, 2019(2):245 269. [31] Stoer, J. and Bulirsch, R. (2013). Introduction to numerical analysis, volume 12. Springer Science & Business Media. [32] Wang, Y.-X. (2019). Per-instance differential privacy. Journal of Privacy and Confidentiality, 9(1). [33] Wang, Y.-X., Balle, B., and Kasiviswanathan, S. P. (2019). Subsampled R enyi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226 1235. [34] Whitehouse, J., Ramdas, A., Rogers, R., and Wu, Z. S. (2022). Fully adaptive composition in differential privacy. ar Xiv preprint ar Xiv:2203.05481. [Yousefpour et al.] Yousefpour, A., Shilov, I., Sablayrolles, A., Testuggine, D., Prasad, K., Malek, M., Nguyen, J., Ghosh, S., Bharadwaj, A., Zhao, J., et al. Opacus: User-friendly differential privacy library in pytorch. In Neur IPS 2021 Workshop Privacy in Machine Learning. [36] Yu, D., Kamath, G., Kulkarni, J., Yin, J., Liu, T.-Y., and Zhang, H. (2022). Perinstance privacy accounting for differentially private stochastic gradient descent. ar Xiv preprint ar Xiv:2206.02617. [37] Zhu, Y., Dong, J., and Wang, Y.-X. (2022). Optimal accounting of differential privacy via characteristic function. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. [38] Zhu, Y. and Wang, Y.-X. (2019). Poisson subsampled R enyi differential privacy. In International Conference on Machine Learning, pages 7634 7642. A EXISTING ANALYSIS USING RDP BY FELDMAN AND ZRNIC (12) We next illustrate how the stochastic process Xn that is used to analyse fully adaptive compositions is determined in case of RDP analysis (12). Central in the analysis is showing the supermartingale property (2.5) of Xn. The fully adaptive RDP analysis by Feldman and Zrnic (12) is based on studying the properties of the supermartingale Mn which they define as Mn(X, X ) = Loss(a(n), X, X , α) e (α 1) Pn m=1 ρm, Loss(a(n), X, X , α) = P(M(n)(X) = a(n)) P(M(n)(X ) = a(n)) Published as a conference paper at ICLR 2023 and ρm gives the RDP of order α given M1:m 1(X), i.e. ρm = 1 α 1 log sup (X,X ) S E a(m) M(m)(X ) " P(M(m)(X) = a(m)) P(M(m)(X ) = a(m)) where S is a pre-determined set of neighbouring datasets In particular, the RDP bounds for the fully adaptive compositions are obtained by showing that Mn(X, X ) has the supermartingale property, meaning that E(Mn(X, X )|Fn 1) Mn 1(X, X ). (A.2) Feldman and Zrnic (12) show that from this property, and from the law of total expectation (2.4), it follows that if Pk i=1 ρi B almost surely, where K is the maximum number of compositions, then the fully adaptive composition is (α, B)-RDP (Thm. 3.1, 12) . Due to the factorisability of the R enyi divergence, the property (A.2) is straightforward to show for the random variable Mn(X, X ) using the Bayes theorem: E(Mn(X, X )|Fn 1) = E(Loss(a(n), X, X , α) e (α 1) Pn m=1 ρm|Fn 1) = E P(Mn(X) = an|a(n 1)) P(Mn(X ) = an|a(n 1)) α Loss(a(n 1), X, X , α) e (α 1) Pn m=1 ρm, since ρ1, . . . , ρn Fn 1 and Loss(a(n 1), X, X , α) Fn 1. Moreover, as Mn is ρn-RDP, Loss(a(n 1), X, X , α) e(α 1)ρn, and the supermartingale property follows from (A.3), i.e., that E(Mn(X, X )|Fn 1) Mn 1(X, X ). As the hockey-stick divergence does not factorise in this way, we need to take another approach to get the required supermartingale. B MAIN THEOREM Theorem B.1. Let k denote the maximum number of compositions. Suppose that, almost surely, m=1 µ2 m B2. (B.1) Then, M(k)(X) is B-GDP. Proof. First, recall the notation from Section 3.1: L(k) X/X denotes the privacy loss between M(k)(X) and M(k)(X ) with outputs a(k). Let ε R. Our proof is based on showing the supermartingale property for the random variable Mn(ε), n [k], defined as Mk(ε) = 1 eε L(k) X/X Mn(ε) = E t Rn 1 eε L(n) X/X Ln(t) + , 0 n k 1, (B.2) where Ln(t) = log Rn(t)/Q(t) and Rn is the density function of N p B2 Pn m=1 µ2m, 1 and Q is the density function of N(0, 1). Moreover, M0(ε) = E t R0 h 1 eε L0(t)i where L0(t) = log R0(t)/Q(t) and R0 is the density function of N B, 1 and Q is the density function of N(0, 1). Notice that in particular this means that M0(ε) gives δ(ε) for a B-GDP mechanism. Published as a conference paper at ICLR 2023 We next show that E Mk 1(ε) for all k. The supermartingale property follows then by induction. Since the pair of distributions N(µk, 1), N(0, 1) dominates the mechanism Mk, we have by the Bayes rule and Corollary 7 that = E a(k) M(k) " 1 eε L(k) X/X = E ak Mk(ε) " 1 eε L(k 1) X/X Lk X/X 1 eε L(k 1) X/X e Lk(t) where e Lk(t) = log Pk(t)/Q(t) and Pk is the density function of N(µk, 1) and Q is the density function of N(0, 1). Above we have also used the fact that L(k 1) X/X Fk 1. Since Pk m=1 µ2 m B2 almost surely, i.e., µk q B2 Pk 1 m=1 µ2m almost surely, by the dataprocessing inequality for α-divergence we have that, almost surely, 1 eε L(k 1) X/X e Lk(t) 1 eε L(k 1) X/X Lk 1(t) + = Mk 1(ε), where Lk 1(t) = log Rk 1(t)/Q(t) and Rk 1 is the density function of N q B2 Pk 1 m=1 µ2m, 1 and Q is the density function of N(0, 1). Therefore, E Since µ1, . . . , µk 1 Fk 2, we have that, almost surely, Mk 1(ε) Fk 2 = E a(k 1) M(k 1) 1 eε L(k 1) X/X Lk 1(t) = E ak 1 Mk 1 1 eε L(k 2) X/X Lk 1 X/X Lk 1(t) E ak 1 Mk 1 1 eε L(k 2) X/X Lk 1 X/X Lk 1(t) E t Rk 1 E tk 1 Pk 1 1 eε L(k 2) X/X e Lk 1(tk 1) Lk 1(t) 1 eε L(k 2) X/X Lk 2(t) where e Lk 1(t) = log Pk 1(t)/Q(t) and Pk 1 is the density function of N µk 1, 1 and Q is the density function of N(0, 1). In the inequality step we use Corollary 7 and the fact that the pair of distributions N(µk 1, 1), N(0, 1) dominates the mechanism Mk 1(ε). In the second last step we have also use the fact that if b P1 N(bµ1, 1), b P1 N(bµ2, 1) and Q N(0, 1), and b L1(t) = log b P1(t)/Q(t) and b L2(t) = log b P2(t)/Q(t), then E t1 b P1 E t2 b P2 h 1 eε b L1(t) b L2(t)i + = E t b P3 h 1 eε b L3(t)i where b P3 N( p bµ2 1 + bµ2 2, 1) and b L3(t) = log b P3(t)/Q(t). This follows directly from the fact that the PLDs determined by the pairs of distributions ( b P1, Q) and ( b P2, Q) are Gaussians (see Eq. (2.2)), the convolution of two Gaussians is a Gaussian. Published as a conference paper at ICLR 2023 By induction, we see from (B.3) that the the supermartingale property holds for the random variable Mn(ε). By the law of total expectation (2.4), E[Mk(ε)] M0(ε). By Theorem 6, E[Mk(ε)] = He ε M(k)(X)||M(k)(X ) , and M0(ε) = He ε N(B, 1)||N(0, 1) . As ε was taken to be an arbitrary real number, the inequality E[Mk(ε)] M0(ε) holds for all ε R and by Lemma 10 we see that M(k)(X) is B-GDP. C FILTERS AND ODOMETERS We here give additional details on the GDP filters and shortly discuss implementation of GDP privacy odometers as well. C.1 GDP - PRIVACY FILTER For simplicity, we here consider a GDP filter that chooses the privacy parameters adaptively, but not individually (like the filter in the main text). I.e., the amount that the privacy budget is spent at each step has to provide a guarantee over the whole dataset. To this end we formally define a GDP filter as FB(µ1, . . . , µt) = HALT, if Pt i=1 µ2 i > B2, CONT, else. Using the filter FB, a GDP filter is given as in Alg. 2. Algorithm 2 GDP Filter Algorithm Input: Budget B, maximum number of compositions k, initial value a0. for j = 1, . . . , k do Find parameter µj 0 such that Mj( , a(j 1)) is µj-GDP. if Pj ℓ=0 µ2 ℓ> B2: then BREAK else aj = Mj(X, a(j 1)) end if end for return a(j 1). In principle, the supermartingale property of the random variable Mn(ε), as defined in (3.3), is sufficient to show that the algorithm below is B-GDP. The only difference is that the algorithm can stop at random time. To include that feature in the analysis, we need to use the optimal stopping time theorem. Theorem C.1. Denote by M the output of Algorithm 2. M is B-GDP under remove neighbourhood relation, meaning that for all datasets X X N, for all i [N] and for all α > 0: max{Hα M(X)||M(X i)) , Hα M(X i)||M(X)) } Hα N(B, 1)||N(0, 1) . (C.1) Proof. The proof goes exactly the same as the proof for (Thm. 4.3 12) which holds for the RDP filter. By using the fact that for all t 0: µt+1 Ft, where Ft is the natural filter σ(a(t)), we see that the random variable T = min{t : FB(µ1, . . . , µt+1) = HALT} k is a stopping time since {T = t} Ft since µt+1 Ft. Let Mn(ε) be the random variable of Eq. (3.3) defined for the pair of datasets (X, X i) or (X i, X). From the optimal stopping theorem and the supermartingale property it then follows that for all ε R, E[MT (ε)] M0(ε), which by the reasoning of the proof of Thm.11 shows that (C.1) holds, i.e., output of Alg. 2 is B-GDP w.r.t. to the removal neighbourhood relation of datasets. Published as a conference paper at ICLR 2023 A benefit of GDP filter when compared to RDP filter is that we obtain tight (ε, δ(ε))-bounds for adaptive compositions of Gaussian mechanisms. Moreover, from Thm. 10 it follows that these tight (ε, δ(ε))-DP bounds can be obtained by an analytic formula: Corollary C.2. For GDP-budget B > 0, the outputs of Algorithms 1 and 2 are (ε, δ(ε))-DP for all ε 0 and C.2 GDP PARAMETERS FOR THE INDIVIDUAL FILTERING OF PRIVATE GD Notice that we have the following for the individual GDP filtering. Suppose each mechanism Mi, i [k], in the sequence is of the form Mi(X, a) = X x X f(x, a) + N(0, σ2). Since the hockey-stick divergence is scaling invariant and since the sensitivity of P x X f(x, a(j 1)) w.r.t. to removal of xi is f(xi, a(j 1)) 2, we have that µ(i) j = f(xi, a(j 1)) 2/σ. C.3 TIGHT BOUNDS FOR THE GAUSSIAN MECHANISM When running e.g. the DP-GD algorithm and using either the filtering of Alg. 2 or the individual filtering of Alg. 1, by appropriate scaling of the gradients each individual data element can be made to fully consume its privacy budget. This scaling for individual filtering is given in (Algorithm 3 12). Remark C.3. Suppose we use the Gaussian mechanism and scale the noise for each data element xi, i [N] at the last step such that the GDP budget is fully consumed, i.e., we have that P j µ(i) j = B, then the resulting algorithm is tightly (ε, δ)-DP for δ(ε) given by the expression (C.2), in a sense that for all i [N], max{He ε M(k)(X)||M(k)(X i)) , Hα M(k)(X i)||M(k)(X)) } = δ(ε). C.4 BENEFITS OF GDP VS. RDP FILTERING To experimentally illustrate the benefits of GDP accounting, consider one of the private GD experiments of (12), where σ = 100, and number of compositions corresponding to worst-case analysis is k = 420. The RDP value of order α corresponding to this iteration is then α/(2 eσ2), where eσ = σ/ k. Figure 3 shows the (ε, δ)-values, computed via RDP and GDP. To get the (ε, δ)-values from the RDP-values, we use the conversion formula of Lemma C.4 below. When using GDP instead of RDP, we can run k = 495 iterations instead of the k = 420 iterations, for an equal privacy budget of ε = 0.8, when δ = 10 5. R enyi DP parameters are converted to (ε, δ)-DP by minimizing w.r.t. λ over the values given by (C.3). Lemma C.4 (Canonne et al. 4). Suppose the mechanism M is λ, ε -RDP. Then M is also (ε, δ(ε))-DP for arbitrary ε 0 with δ(ε) = exp (λ 1)(ε ε) λ 1 . (C.3) Published as a conference paper at ICLR 2023 0.2 0.4 0.6 0.8 ( , )-DP via RDP Filtering ( , )-DP via GDP Filtering Figure 3: Comparison of RDP and GDP for private GD with σ = 100 and number of iterations k = 420 (experiment conisdered by Feldman and Zrnic (12)). This means that when we replace the RDP filter with GDP filter for the private GD, we can get roughly 10 percent smaller ε-values. C.4.1 FURTHER COMPARISONS OF RDP AND GDP Figures 4 and 5 further illustrate the differences between RDP and GDP accounting for filtering. Figure 4 shows the effect of number of compositions k, when σ = 100 and Figure 5 illustrates the maximum number of compositions for a given privacy budget ε, when σ = 100 and δ = 10 5. 0.2 0.4 0.6 0.8 1.0 ( , )-DP via RDP Filtering (k = 300) ( , )-DP via GDP Filtering (k = 300) ( , )-DP via RDP Filtering (k = 500) ( , )-DP via GDP Filtering (k = 500) ( , )-DP via RDP Filtering (k = 900) ( , )-DP via GDP Filtering (k = 900) Figure 4: Comparison of RDP and GDP for private GD with σ = 100 and different number of compositions k. Published as a conference paper at ICLR 2023 0.0 0.2 0.4 0.6 0.8 1.0 max number of compositions k ( , )-DP via PLD Filtering ( , )-DP via RDP Filtering Figure 5: Comparison of RDP and GDP for private GD with σ = 100: maximum number of allowed steps for private GD (number of compositions k) for different values of ε, when δ = 10 5. C.5 GDP PRIVACY ODOMETERS We here also shortly comment on privacy odometers, considered e.g. by (28; 12; 21). In practice, one might want to track the privacy loss incurred so far. Rogers et al. (28) were the first ones to formalise this in terms of a privacy odometer. Feldman and Zrnic (12) utilise a sequence of valid R enyi privacy filters such that a fixed sequence of privacy losses B1, B2, . . . determine random stopping times T1, T2, . . . such that the privacy spent up to time Ti is at most Bi. By assuming, for example, that for all i, Bi+1 Bi = for a fixed discretisation parameter > 0, we may employ the RDP filter such that whenever the privacy budget counter crosses (suppose for the mth time) we release the sequence a(Tm) and initialize the privacy loss counter to zero. The fact that a(Tm) is m -RDP follows directly from the RDP results that hold for the filters. With the GDP, we can construct in the exactly same way an algorithm that outputs states always after every predetermined amount of GDP budget is spent. If at round i we spend i-GDP budget, by the results for GDP filters we know that B2 i+1 = B2 i + 2 i , and that the output a(Tm) is e Bm-GDP, where e Bm = p 2 1 + . . . + 2m. C.6 BLACKWELL S THEOREM VIA DOMINATING PAIRS OF DISTRIBUTIONS There is a one-to-one relationship between the orderings determined by trade-off functions and the hockey-stick divergence and this follows from the results by Zhu et al. (37) as follows. Let P and Q be probability distributions and denote by T(P, Q) the trade-off function determined by P and Q and let Hα(P||Q) denote the hockey-stick divergence or order α > 0. The Lemma 20 of (37) is a restatement of (Proposition 2.12, 7) and it states that for any ε R, Heε(P||Q) = 1 + T[P, Q] ( eε), (C.4) where, for a trade-off function f, f denotes the function f (y) = sup x R yx f(x). From (C.4) the equivalence follows directly, i.e., if ( e P, e Q) is another pair of probability distributions, then Heε( e P|| e Q) Heε(P||Q) for all ε R if and only if T[ e P, e Q] T[P, Q]. Published as a conference paper at ICLR 2023 This also means that, if Hα( e P|| e Q) Hα(P||Q) for all α > 0, then by the Blackwell theorem (see e.g. Thm. 2.10, 7), there exists a stochastic transformation (Markov kernel) T such that TP = e P and TQ = e Q. D EFFICIENT INDIVIDUAL NUMERICAL ACCOUNTING FOR DP-SGD We next show how to compute the individual PLDs for DP-SGD. These are needed when implementing the approximative individual (ε, δ)-accountant described in Section 5. The errors arising from the approximations are generally negligible and a rigorous error analysis could be carried out using the techniques presented in (20) and (15). The numerical approximation is based on 1. a numerical σ-grid which allows evaluating upper bounds for δ s efficiently: we precompute FFTs for different σ-values and no additional FFT computations are then needed during the evaluation of the individual ε-values. By the data-processing inequality this grid approximation also leads to upper (ε, δ)-bounds. 2. the Plancherel Theorem, which removes the need to compute inverse FFTs when evaluating individual PLDs. First, we recall some basics about numerical accounting using FFT (see also (19; 15)). D.1 NUMERICAL EVALUATION OF DP PARAMETERS USING FFT We use a Fast Fourier Transform (FFT)-based method by Koskela et al. (19; 20) called the Fourier Accountant (FA). The same approximation could be done when using the PRV accountant by Gopi et al. (15). Using FFT requires that we truncate and place the PLD ω on an equidistant numerical grid over an interval [ L, L], L > 0. Convolutions are evaluated using the FFT algorithm, and using the existing error analysis (see e.g., 20), the error incurred by the numerical FFT approximation can be bounded. The Fast Fourier Transform (FFT) is described as follows (5). Let x = [x0, . . . , xn 1]T, w = [w0, . . . , wn 1]T Rn. The discrete Fourier transform F and its inverse F 1 are defined as (31) (Fx)k = Xn 1 j=0 xje i 2πkj/n, (F 1w)k = 1 j=0 wjei 2πkj/n, where i = 1. Using FFT the running time of evaluating Fx and F 1w reduces to O(n log n). Also, FFT enables evaluating discrete convolutions efficiently using the so called convolution theorem For obtaining computational speed-ups, we use the Plancherel Theorem (Chpt. 12, 25), which states that the DFT preserves inner products: for x, y Rn, n F(x), F(y) . When using FA to approximate δ(ε), we need to evaluate an expression of the form bk = D F 1 F(Da1) k1 F(Dam) km , D = 0 In/2 In/2 0 where ai corresponds to a numerical PLD for a combination of DP hyperparameters i, and ki is the number of times the composition contains a mechanism with this PLD. Approximation for δ(ε) is then obtained from the discrete sum that approximates the hockey-stick integral: eδ(ε) = X L+ℓ x>ε 1 eε ( L+ℓ x) bk ℓ. The Plancherel Theorem gives the following: Published as a conference paper at ICLR 2023 Lemma D.1. Let eδ(ε) and bk be defined as above. Denote wε Rn such that (wε)ℓ= h 0, 1 eε ( L+ℓ x)i . (D.1) Then, we have that n F(Dwε), F(Da1) k1 F(Dam) km . Proof. See (18). We instantly see that if both F(Dwε) and F(Dai), 1 i m, are precomputed, eδ(ε) can be computed in O(n) time (where n is the number of discretisation points for the PLD). We can utilise this by placing the individual DP hyperparameters into well-chosen buckets, and by pre-computing FFTs corresponding to the hyperparameter values of each bucket. Then, the approximative numerical PLD for each sequence of DP hyperparameters (e.g. sequence of noise ratios) can be written in a form F(Da1) k1 F(Dam) km, where ki s correspond to number of elements in each bucket. If we also have F(Dwε) precomputed for different values of ε, we can easily construct a numerical accountant that outputs an approximation of ε as a function of δ. D.2 NOISE VARIANCE GRID FOR FAST INDIVIDUAL ACCOUNTING We next show how to construct the DP hyperparameter grid for DP-SGD: a numerical σ-grid. We remark that Yu et al. (36) carry out similar approximations for speeding up their approximative individual RDP accountants. Suppose we have models a0, . . . , a T as an output of DP-SGD iteration that we run with subsampling ratio q, clipping constant C > 0 and noise parameter σ. Also, suppose, that for a given data element x, along the iteration the gradients have norms Cx,i := θf(ai, x) , 0 i T 1. We get the individual εx-value (or individual numerical PLD, more generally) then for the entry x by considering heterogeneous compositions of DP-SGD mechanisms with parameter values q and eσx,i = C Cx,i σ, 0 i T 1. A naive approach would require up to T FFT evaluations which quickly becomes computationally heavy. For the approximation, we determine a σ-grid Σ = {σ0, . . . , σnσ}, where nσ Z+ is a number of intervals in the grid and σi = σmin + i σmax σmin We then encode the sequence of noise ratios eΣ := {eσx,0, . . . , eσx,n 1} into a tuple of integers k = (k0, k1, . . . , knσ), where ( #{eσ eΣ : σi eσ < σi+1}, i < nσ, #{eσ eΣ : σnσ eσ}, i = nσ. (D.2) i.e. ki is number of scaled noise parameters eσ hitting the bin number i in the grid Σ. By the construction of the approximation, we have the following: Published as a conference paper at ICLR 2023 Theorem D.2. Consider the approximation described above. Denote the FFT transformation of the approximative numerical PLD obtained with the Σ-grid as eax = F(Dea1) k1 F(Deanσ) knσ and the corresponding δ (as a function of ε), as given by Lemma D.1 as n F(Dwε), eax , where wε is the weight vector (D.1). Then, we have that for each ε 0: δ(ε) eδ(ε) + err, where δ(ε) is the tight value of δ corresponding to the actual sequence of noise ratios {eσx,0, . . . , eσx,n 1} and err denotes the (controllable) numerical errors arising from the discretisations of PLDs. Proof. The results follows from the data-processing inequality since each σx,i-value is placed to bucket corresponding to a smaller noise ratio. The numerical error term err can also be bounded using the techniques and results of (15). The importang thing here is that if the FFTs F(Deai), 0 i nσ, are precomputed as well as F(Dwε), then evaluating eδ(ε) is an O(n) operation. To implement the approximative accountant described in Section 5, we numerically approximate individual upper bound µ-GDP values using the bisection method. D.3 POISSON SUBSAMPLING OF THE GAUSSIAN MECHANISM For completeness we show how to determine the PLDs for the Poisson subsampled Gaussian mechanism, required for the individual accounting of DP-SGD. Consider the Gaussian mechanism x X f(x) + N(0, σ2Id), where f is a function f : X Rd. Then, if the dataset X and X are neighbours such that X = X {x } for some entry x , then from the translation invariance of the hockey-stick divergence and from the unitary invariance of the Gaussian noise, it follows that, for all α 0, Hα Mn(X)||Mn(X ) = Hα N f(x ) 2, σ2 ||N 0, σ2 . Furthermore, from the scaling invariance of the hockey-stick divergence, we have that for all α 0, Hα Mn(X)||Mn(X ) = Hα N C, σ2 ||N 0, σ2 = Hα N 1, (σ/C)2 ||N 0, (σ/C)2 , where C = f(x ) 2. Using the subsampling amplification results of Zhu et al. (37) we get a unique worst-case pair (P, Q), where P = q N 1, eσ2 + (1 q) N 0, eσ2 , Q = N 0, eσ2 , where eσ = σ/C. The PLD ωP/Q is then determined by P and Q as defined in Def. 5. E EXPERIMENTS WITH MIMIC-III We use the preprocessing provided by (16) to obtain the train and test data for the phenotyping task. We refer to (16) for details on the preprocessing pipeline and the details on the phenotyping task. We tune the hyperparameters using Bayesian optimization using the hyperparameter tuning library Optuna (1) to maximise the macro-averaged AUC-ROC, the task s primary metric. We train using DP-GD and opacus (Yousefpour et al.) with noise parameter σ 10.61 and determine the optimal clipping constant as C 0.79 in our training runs. We compute the budget B so that filtering starts after 50 epochs and set the maximum number of epochs to 100. With these parameter values ε = 2.75, when δ = 10 5. Published as a conference paper at ICLR 2023 E.1 EFFECT OF SUBOPTIMAL HYPERPARAMETER VALUES ON FILTERED DP-GD We study here the effect of choosing sub-optimal clipping constants by evaluating the effects of filtering using clipping constants reaching from half the optimum to five times the optimum (Figure 6). We observe that filtering only improves the utility when choosing clipping constants that are sub-optimal (e.g., 5x the optimum). Our observations complement the observations made by (12), who also observe the largest improvements by filtering in sub-optimal hyperparameter regimes. 0 20 40 60 80 100 epoch utility (macro-averaged AUC-ROC) clipping constant 0.5x optimum 1.0x optimum 1.5x optimum 3.0x optimum 5.0x optimum 0 20 40 60 80 100 epoch number of active elements clipping constant 0.5x optimum 1.0x optimum 1.5x optimum 3.0x optimum 5.0x optimum Figure 6: Filtered DP-GD with a maximum privacy loss of (ε, δ) = (2.75, 10 5) using different clipping constants. Rest of the hyperparameters are tuned. Left: The test AUC-ROC as a function of epochs. The red vertical line denotes the starting point of the filtering. Right: The number of active elements. E.2 HISTOGRAMS OF INDIVIDUAL ε-VALUES FOR THE MIMIC-III EXPERIMENT As described in the main text, to observe the differences across subgroups, we choose five nonoverlapping groups of size 1000 based on the following criteria: subgroup 0: No diagnosis at all, subgroups 1 and 2: Pneumonia/no Pneumonia, subgroups 3 and 4: Heart attack/no heart attack. In the training data, there are total 2072 cases without a diagnosis, 4105 Pneumonia cases and in total 9413 heart attack cases. We remark that it is not uncommon for a patient to have multiple conditions. During the training, we track the gradient norms Cx,i for all elements of the training dataset, and thus we compute the individual ε-values after given number of iterations for a given value of δ (δ is set to 10 5). In Figures 7 and 8 we display histograms of the individual ε values after 50 epochs. With the optimal clipping constant a majority of the datapoints have an individual ε = 2.75, which is near the budget. For a clipping constant that is 5x the optimum most points are significantly smaller than ε = 2.75. Published as a conference paper at ICLR 2023 0 1 2 3 individual number of datapoints 0 1 2 3 individual Figure 7: Histogram of individual ε after 50 epochs without filtering when using the optimal clipping constant. A majority of the individual ε are near ε = 2.75, which means that they will be deactivated in epoch 51, which uses filtering. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 individual number of datapoints 0.0 0.5 1.0 1.5 2.0 2.5 3.0 individual Figure 8: Histogram of individual ε after 50 epochs without filtering when using a clipping constant that is 5x the optimum. A majority of the individual ε are far away from ε = 2.75, which means that they will not be instantly deactivated in epoch 51, which uses filtering. E.3 FURTHER EXPERIMENTAL RESULTS We run the same experiment as above but now, instead of maximum privacy loss of (ε = 2.75, δ = 10 5), using maximum privacy loss of ε = 0.5 and ε = 10.0. Figures 9 and 10 depict the performance in these cases (test AUC-ROC curves). We see, similarly to the experiments of (12), that the overall performance slightly increases when using the individual filtering. Published as a conference paper at ICLR 2023 0 20 40 60 80 epoch utility (macro-averaged AUC-ROC) 0 20 40 60 80 epoch number of active elements Filtered DP-GD with ( = 0.5, = 1e-5) Figure 9: Filtered DP-GD with a maximum privacy loss of (ε, δ) = (0.5, 10 5) using tuned hyperparameters. Left: the test AUC-ROC as a function of epochs. Right: The number of active elements. Published as a conference paper at ICLR 2023 0 10 20 30 40 50 60 epoch utility (macro-averaged AUC-ROC) 0 10 20 30 40 50 60 epoch number of active elements Filtered DP-GD with ( = 10, = 1e-5) Figure 10: Filtered DP-GD with a maximum privacy loss of (ε, δ) = (10.0, 10 5) using tuned hyperparameters. Left: the test AUC-ROC as a function of epochs. Right: The number of active elements.