# numerical_composition_of_differential_privacy__a960f4bb.pdf Numerical Composition of Differential Privacy Sivakanth Gopi Microsoft Research sigopi@microsoft.com Yin Tat Lee University of Washington yintat@uw.edu Lukas Wutschitz Microsoft lukas.wutschitz@microsoft.com We give a fast algorithm to optimally compose privacy guarantees of differentially private (DP) algorithms to arbitrary accuracy. Our method is based on the notion of privacy loss random variables to quantify the privacy loss of DP algorithms. The running time and memory needed for our algorithm to approximate the privacy curve of a DP algorithm composed with itself k times is O( k). This improves over the best prior method by [KH21] which requires Ω(k1.5) running time. We demonstrate the utility of our algorithm by accurately computing the privacy loss of DP-SGD algorithm of Abadi et al. [ACG+16] and showing that our algorithm speeds up the privacy computations by a few orders of magnitude compared to prior work, while maintaining similar accuracy. 1 Introduction Differential privacy (DP) introduced by [DMNS06] provides a provable and quantifiable guarantee of privacy when the results of an algorithm run on private data are made public. Formally, we can define an (ε, δ)-differentially private algorithm as follows. Definition 1.1 ((ε, δ)-DP [DMNS06, DKM+06]). An algorithm M is (ε, δ)-DP if for any two neighboring databases D, D differing in exactly one user and any subset S of outputs, we have Pr[M(D) S] eε Pr[M(D ) S] + δ. Intuitively, it says that looking at the outcome of M, we cannot tell whether it was run on D or D . Hence an adversary cannot infer the existence of any particular user in the input database, and therefore cannot learn any personal data of any particular user. DP algorithms have an important property called composition. Suppose M1 and M2 are DP algorithms and say M(D) = (M1(D), M2(D)), i.e., M runs both the algorithms on D and outputs their results. Then M is also a DP algorithm. Proposition 1.2 (Simple composition [DKM+06, DL09]). If M1 is (ε1, δ1)-DP and M2 is (ε2, δ2)- DP, then M(D) = (M1(D), M2(D)) is (ε1 + ε2, δ1 + δ2)-DP. This also holds under adaptive composition (denoted by M = M2 M1), where M2 can look at both the database and the output of M1 (here M(D) = (M1(D), M2(D, M1(D)))). It turns out that both compositions enjoy much better DP guarantees than this simple composition rule. Let M be an (ε, δ)-DP algorithm and let M k denote the (adaptive) composition of M with itself k times. The naive composition rule shows that M k is (kε, kδ)-DP. This was significantly improved in [DRV10]. Proposition 1.3 (Advanced composition [DRV10, DR+14]). If M is (ε, δ)-DP, then M k is (ε , kδ+ δ )-DP where + kε(eε 1). Author ordering is alphabetical. Code is available at https://github.com/microsoft/prv_ accountant. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Note that if ε = O 1 and δ = o 1 k , then M k satisfies (Oδ (1), δ )-DP. Using simple composi- tion (Proposition 1.2), we can only claim that M k is (O( k), o(1))-DP. Thus advanced composition often results in k-factor savings in privacy which is significant in practice. The optimal DP guarantees for k-fold composition of an (ε, δ)-DP algorithm were finally obtained by [KOV15]. For composing different algorithms, the situation is more complicated. If M1, M2, . . . , Mk are DP algorithms such that Mi is (εi, δi)-DP, then it is shown by [MV16] that computing the exact DP guarantees for M = M1 M2 Mk is #P-complete. They also give an algorithm to approximate the DP guarantees of M to desired accuracy η which runs in O k3 ε(1 + ε) time where ε = (Pk i=1 εi)/k.2 If each εi 1 k (so that M will satisfy reasonable privacy guarantees by advanced composition), then the running time is O(k2.5/η). In most situations, DP algorithms come with a collection of (ε, δ)-DP guarantees, i.e., for each value of ε, there exists δ such that the algorithm is (ε, δ)-DP. Definition 1.4 (Privacy curve). A DP algorithm M is said to have privacy curve δ : R [0, 1], if for every ε R, M is (ε, δ(ε))-DP. For example the privacy curve of a Gaussian mechanism (with sensitivity 1 and noise scale σ) is given by δ(ε) = Φ ( εσ + 1/2σ) eεΦ ( εσ 1/2σ) where Φ( ) is the Gaussian CDF [BW18]. Suppose we want to compose several Gaussian mechanisms, which (ε, δ)-DP guarantee should we choose for each mechanism? Any choice will lead to suboptimal DP guarantees for the final composition. Instead, we need a way to compose the privacy curves directly. This was suggested through the use of privacy region in [KOV15] and explicitly studied in the f-DP framework of [DRS19]. f-DP is a dual way (and equivalent) to look at the privacy curve δ(ε). Independently, an algorithm called Privacy Buckets for approximately composing privacy curves using the notion of was initiated in [MM18]. This algorithm depends on the notion of Privacy Loss Random Variable (PRV) [DR16], whose distribution is called Privacy Loss Distribution (PLD). For any DP-algorithm, one can associate a PRV and the privacy curve of that algorithm can be easily obtained from the PRV. The really useful property of PRVs is that under adaptive composition, they just add up; the PRV Y of the composition M = M1 M2 Mk is given by Y = Pk i=1 Yi where Yi is the PRV of Mi.3 Therefore, one can find the distribution of Y by the convolution of the distributions of Y1, Y2, . . . , Yk. In an important paper, [KJH+20] proposed that one can speed up the convolutions using Fast Fourier Transform (FFT). Explicit error bounds were obtained for the approximation obtained by their algorithm in [KJH+20, KJPH21, KH21]. The running time of this algorithm was analyzed in [KH21] where it was shown that the privacy curve δM(ε) of M = M1 M2 Mk can be computed up to an additive error of δerror in time if each algorithm Mi is satisfies (εi, 0)-DP and ε = 1 k Pk i=1 εi. Assuming that each εi 1 get O(k2.5/δerror) running time. Note that this is slightly worse than (1), where the denominator η is the multiplicative error in δM. When composing the same algorithm with itself for k times, the running time can be improved to O k2 ε δerror , which is O k1.5 δerror Moments Accountant and Renyi DP In an influential paper where they introduce Differentially Private Deep Learning, [ACG+16] proposed a method called the Moments Accountant (MA) for giving an upper bound the privacy curve of a composition of DP algorithms. They applied their method to bound the privacy loss of differentially private Stochastic Gradient Descent (DP-SGD) algorithm which they introduced. Analyzing the privacy loss of DP-SGD involves composing the 2ε has an additive error of η and δ has a multiplicative error of η. 3[KJH+20] only state this for non-adaptive composition. In this paper we show how to extend this to adaptive composition as well. privacy curve of each iteration of training with itself k times, where k is the total of number of training iterations. Typical values of k range from 1000 to 300000 (such as when training large models like GPT3). The Moments Accountant was subsumed into the framework of Renyi Differential Privacy (RDP) introduced by [Mir17]. The running time of these accountants are independent of k, but they only give an upper bound and cannot approximate the privacy curve to arbitrary accuracy. DP-SGD is one of the most important DP algorithms in practice, because one can use it to train neural networks to achieve good privacy-vs-utility tradeoffs. Therefore obtaining accurate and tight privacy guarantees for DP-SGD is important. For example reducing ε from 2 to 1, can mean that one can train the network for 4 times more epochs while staying within the same privacy budget. Therefore DP-SGD is one of the main motivations for this work. There are also situations when the PRVs do not have bounded moments and so Moments Accountant or Renyi DP cannot be applied for analyzing privacy. An example of such an algorithm is the DP-SGD-JL algorithm of [BGK+21] which uses numerical composition of PRVs to analyze privacy. GDP Accountant [DRS19, BDLS19] introduced the notion of Gaussian Differential Privacy (GDP) and used it to develop an accountant for DP-SGD. The accountant is based on central limit theorem and only gives an approximation to the true privacy curve, where the approximation gets better with k. But as we show in Figure 1, GDP accountant can significantly underreport the true epsilon value. Several different notions of privacy were introduced for obtaining good upper bounds on the privacy curve of composition of DP algorithms such as Concentrated DP (CDP) [DR16, BS16], Truncated CDP [BDRS18] etc. None of these methods can approximate the privacy curve of compositions to arbitrary accuracy. The notion of f-DP introduced by [DRS19], allows for a lossless composition theorem, but computing the privacy curve of composition seems computationally hard and they do not give any algorithms for doing it. 1.1 Our Contributions The main contribution of this work is a new algorithm with an improved analysis for computing the privacy curve of the composition of a large number of DP algorithms. Theorem 1.5 (Informal version of Theorem 5.5). Suppose M1, M2, . . . , Mk are DP algorithms. Then the privacy curve δM(ε) of adaptive composition M = M1 M2 Mk can be approximated in time εupper k1.5 log k q log 1 δerror εerror where εerror is the additive error in ε, δerror is the additive error in δ and εupper is an upper bound on max εM(δerror), maxi εMi δerror If each Mi satisfies 1 k -DP, then by advanced composition (Proposition 1.3), we can set εupper = O(1). Therefore the running time of our algorithm in this case is O log 1 δerror εerror can save a factor of k, when we compose the same algorithm with itself k times. Theorem 1.6. Suppose M is a DP algorithm. Then the privacy curve δM k(ε) of M (adaptively) composed with itself k times can be approximated in time εupper k 1 2 log k q log 1 δerror εerror where εerror is the additive error in ε, δerror is the additive error in δ and εupper is an upper bound on max εM k(δerror), εM δerror Thus we improve the state-of-the-art by at least a factor of k in running time. We also note that our algorithm improves the memory required by a factor of k. See Figure 1 for a comparison of 4εM(δ) is the inverse of δM(ε). our algorithm with that of [KJPH21]. Also note that RDP Accountant (equivalent to the Moments Accountant) significantly overestimates the true ε, while the GDP Accountant significantly underestimates the true ε. In contrast, the upper and lower bounds provided by our algorithm lie very close to each other. 10 15 20 25 30 35 40 45 50 DPSGD steps εup Ours εlow Ours εup [KJPH21] εlow [KJPH21] (a) Our algorithm gives much closer upper and lower bounds on the true privacy curve compared to [KJPH21], under the same mesh size of 4 10 5. Our upper and lower bounds are nearly coinciding. 0 500 1000 1500 2000 DPSGD steps Ours Upper Ours Lower GDP RDP (b) Our algorithm can improve significantly over the RDP Accountant. We also see that GDP Accountant can significantly underreport the true ε. We have set εerror = 0.1, δerror = δ/1000 here. Figure 1: Case study on DP-SGD. Sampling probability p = 10 3, noise scale σ = 0.8, δ = 10 7. Our Techniques Our algorithm (also the prior work of [KJH+20]) proceeds by approximating the privacy loss random variables (PRVs) by truncating and discretizing them. We then use Fast Fourier Transform (FFT) to convolve the distributions efficiently. The main difference is in the approximation procedure and the error analysis. In the approximation procedure, we correct the approximation so that the expected value of the discretization matches with the expected value of the PRV. To analyze the approximation error, we introduce the concept of coupling approximation (Definition 5.1), which is a variant of Wasserstein (optimal transport) distance specifically tailored to this application. We first show that the approximation output by our algorithm to each privacy random variable is a good coupling approximation. We then show that when independent coupling approximations are added, cancellation happens between the errors due to Hoeffding bound, producing a much better coupling approximation than one naively expects from the triangle inequality. This allows us to choose the mesh size in our discretization to be 1 k, whereas [KH21] choose a mesh size of 1 k. The other improvement is the truncation procedure. We give a tight tail bound of the PRVs (Lemma 5.4). This allows us to choose the domain size for in truncation to be O(1), whereas [KH21] choose O( k). Both ideas together saves a factor of k in the run time and memory. For the analysis, the previous paper analyzes the discretization error by studying the stability of convolution. This leads to complicated calculations with the runtime linear in 1/δerror (see (2)). Since δerror δ 1/N is required to give meaningful privacy guarantee (N is the number of users), this term 1/δerror is huge. In this paper, we show various facts about how coupling approximation accumulates and use them to give a runtime depending only on p log(1/δerror). 2 DP Preliminaries Given a DP algorithm M, for each value of ε 0, there exists some δ [0, 1] such that M is (ε, δ)-DP. We can represent all these privacy guarantees by a function δM(ε) : R 0 [0, 1] and say that δM( ) is the privacy curve of M. This inspires the following definition of a privacy curve between two random variables. Definition 2.1 (Privacy curve). Given two random variables X, Y supported on some set Ω, define δ(X||Y ) : R [0, 1] as: δ(X||Y )(ε) = sup S Ω Pr[Y S] eε Pr[X S]. Therefore an algorithm M is (ε, δ)-DP iff δ (M(D)||M(D )) (ε) δ for all neighboring databases D, D . Remark 2.2. Note that not all functions δ : R [0, 1] are privacy curves. A characterization of privacy curves can be obtained using the f-DP framework of [DRS19]. The notion of privacy curve δ(X||Y ) and tradeoff function T(X||Y ) are dual to each other via convex duality [DRS19]. This implies a characterization of privacy curves as shown in [ZDW21]. Definition 2.3 (Composition of privacy curves [DRS19]). Let δ1 δ(X1||Y1) and δ2 δ(X2||Y2) be any two privacy curves. The composition of the privacy curves, denoted by δ1 δ2, is defined as δ1 δ2 δ ((X1, X2)||(Y1, Y2)) where X1, X2 are independently sampled and Y1, Y2 are independently sampled. Note that there can be many pairs of random variables which have the same privacy curve, but the above operation is well-defined. If δ(X1||Y1) δ(X 1||Y 1) and δ(X2||Y2) δ(X 2||Y 2), then it was shown by [DRS19] that δ ((X1, X2)||(Y1, Y2)) = δ ((X 1, X 2)||(Y 1, Y 2)) . [DRS19] also show that is a commutative and associative operation. Given two DP algorithms M1 and M2, the adaptive composition (M2 M1)(D) is an algorithm which outputs (M1(D), M2(D, M1(D)), i.e., M2 can look at the database D and also the output of the previous algorithm M1(D). Adaptive composition of more than two algorithms is similarly defined. Suppose M1 has privacy curve δ1 and M2 has privacy curve δ2 (i.e., M2( , y) is a DP algorithm with privacy curve δ2 for any fixed y.). The following composition theorem shows how to get the privacy curve of M2 M1. Theorem 2.4 (Composition theorem [DRS19]). Let M1, M2, . . . , Mk be DP algorithms with privacy curves given by δ1, δ2, . . . , δk respectively. The privacy curve of the adaptive composition Mk Mk 1 M1 is given by δ1 δ2 δk. 3 Privacy loss random variables The notion of privacy loss random variables (PRVs) is a unique way to assign a pair (X, Y ) for any privacy curve δ such that δ δ(X||Y ). PRVs allow us to compute composition of two algorithms via summing random variables (Theorem 3.5) (equivalently, convolving their distributions). Thus PRVs can be thought of as a reparametrization of privacy curves where composition becomes convolution. In this paper, we differ from the usual definition of PRVs given in [DR16, KJH+20], which are tied to a specific algorithm. Instead we think of them as a reparametrization of privacy curves and study them directly. This allows us to succinctly prove many useful properties of PRVs. Let R = R { , } be the extended real line where we define + x = and + x = for x R. Definition 3.1 (Privacy loss random variables (PRVs)). Given a privacy curve δ : R [0, 1], we say that (X, Y ) are privacy loss random variables for δ, if they satisfy the following conditions: X, Y are supported on R, δ(X||Y ) δ, Y (t) = et X(t) for every t R and Y ( ) = 0 and X( ) = 0 where X(t), Y (t) are probability density functions of X, Y respectively. Mathematically, the correct way to write the condition Y (t) = et X(t) is to say that EY [φ(Y )] = EX[φ(X)e X] for all test functions φ : R [0, 1] with φ( ) = φ( ) = 0. This will generalize to all situations where X, Y are continuous or discrete or both. For ease of exposition, we ignore this complication and assume that X(t), Y (t) represent the PDFs if X, Y are continuous at t, or the probability masses if they have point masses at t. The following theorem shows that the PRVs for a privacy curve δ = δ(P||Q) are given by the log-likelihood random variables of P, Q. Theorem 3.2. Let δ : R [0, 1] be a privacy curve given by δ δ(P||Q) where P, Q are two random variables supported on Ω. The PRVs (X, Y ) for the privacy curve δ are given by5: X = log Q(ω) Y = log Q(ω) The following theorem provides a formula for computing the privacy curve δ in terms of the PRVs and conversely a formula for PRVs in terms of the privacy curve. A similar statement appears in [SMM19, KJH+20]. Theorem 3.3. The privacy curve δ can be expressed in terms of PRVs (X, Y ) as: δ(ε) = Pr[Y > ε] eε Pr[X > ε] = EY [(1 eε Y )+] = Pr[Y ε + Z]. (5) where Z is an exponential random variable.6 Conversely, given a privacy curve δ : R [0, 1], we can compute the PDFs of its PRVs (X, Y ) as: Y (t) = δ (t) δ (t) and X(t) = et(δ (t) δ (t)). (6) Remark 3.4. Theorem 3.3 shows that the PRVs X, Y do not depend on the particular P, Q used to represent the privacy curve δ in Theorem 3.2. So we should think of the PDF of of the PRV Y (or X) as an equivalent reparametrization of the privacy curve δ : R [0, 1], just as the notion of f-DP [DRS19] is a reparametrization of the privacy curve δ. PRVs are useful in computing privacy curves because the composition of two privacy curves can be computed by adding the corresponding pairs of PRVs. A similar statement appears in [DR16]. Theorem 3.5. Let δ1, δ2 be two privacy curves with PRVs (X1, Y1) and (X2, Y2) respectively. Then the PRVs for δ1 δ2 = δ(X1, X2||Y1, Y2) are given by (X1 + X2, Y1 + Y2). In particular, δ1 δ2 = δ(X1 + X2||Y1 + Y2). Proof. Let (X, Y ) be the privacy random variables for δ(X1, X2||Y1, Y2). By Theorem 3.2, X = log (Y1, Y2)(t1, t2) (X1, X2)(t1, t2) where (t1, t2) (X1, X2) = log Y1(t1)Y2(t2) X1(t1)X2(t2) where t1 X1, t2 X2 (By independence of X1, X2 and indpendence of Y1, Y2) = log et1 et2 where t1 X1, t2 X2 = t1 + t2 where t1 X1, t2 X2 = X1 + X2. Y = log (Y1, Y2)(t1, t2) (X1, X2)(t1, t2) where (t1, t2) (Y1, Y2) = t1 + t2 where t1 Y1, t2 Y2 = Y1 + Y2. In the supplementary material, we provide a proof of Theorems 3.2 and 3.3. We also discuss how to compute the PRVs for a subsampled mechanism given the PRVs for the original mechanism and give examples of PRVs for few standard mechanisms. These are used in our experiments to calculate the PRVs for DP-SGD. 5Here Q(ω) and P(ω) are the probability density functions of Q, P respectively. Note that the mathematically precise way is to replace the ratio Q(ω) P (ω) by the Radon-Nikodym derivative d Q d P (ω). 6For x R, x+ = max{x, 0}. 4 Numerical composition of privacy curves In this section, we present an efficient and numerically accurate method, Compose PRV (Algorithm 1), for composing privacy guarantees by utilizing the notion of PRVs. Algorithm 1: Compose PRV: Composing privacy curves using PRVs Input: CDFs of PRVs Y1, Y2, . . . , Yk, mesh size h, Truncation parameter L h Output: PDF of an approximation e Y for Y = Pk i=1 Yi. e Y will be supported on µ + (h Z [ L, L]) for some µ [0, h 2 ]. for ℓ= 1 to k do e Yi Discretize PRV(Yi, L, h); end Compute PDF Of e Y = e Y1 L e Y2 L L e Yk by convolving PDFs of e Y1, e Y2, . . . , e Yk using FFT; Compute δe Y (ε) = Ee Y for all ε [0, L]; Return e Y , δe Y ( ) In the algorithm Compose PRV, we compute the circular convolution L using Fast Fourier Transform (FFT). Fix some L > 0. For x R, we define x (mod 2L) = x 2Ln where n Z is chosen such that x 2Ln ( L, L]. Given x, y, we define the circular addition x L y = x+y (mod 2L). When we use FFT to compute the convolution of two discrete distributions Y1, Y2 supported on h Z [ L, L], we are implicitly calculating the the distribution of Y1 L Y2. We will later show that e Y1 L e Y2 L L e Yk is a good approximation of Y1 + Y2 + + Yk. The subroutine Discretize PRV (Algorithm 2) is used to truncate and discretize PRVs. In this subroutine, we shift the discretized random variables such that it has the same mean as the original variables. This is one of main differences between our algorithm and the algorithm in [KJPH21, KH21]. We show that this significantly decreases the discretization error and allow us to use much coarser mesh h 1/ k instead of h 1/k. Algorithm 2: Discretize PRV: Discretize and truncate a PRV Input: CDFY ( ) of a PRV Y , mesh size h, Truncation parameter L h Output: PDF of an approximation e Y supported on µ + (h Z [ L, L]) for some µ [0, h 2 h ; for i = n to n do qi CDFY (ih + h/2) CDFY (ih h/2); end q q/ Pn i= n qi ; // Normalize q to make it a probability distribution Y L Y |Y | L (i.e., Y conditioned on |Y | L); µ E[Y L] Pn i= n ih qi; e Y ih + µ w.p. qi for n i n; Return e Y ; For simplicity, throughout this paper, we will assume that the PRVs Y1, Y2, . . . , Yk do not have any mass at . This is with out loss of generality. Suppose Pr[Yi = ] = δi for each i. Let Y i = Yi|Yi = . Then Y1 + Y2 + + Yk = Y 1 + Y 2 + + Y k w.p. (1 δ1)(1 δ2) (1 δk) w.p. 1 (1 δ1)(1 δ2) (1 δk). Therefore we can use Algorithm 1 to approximate the distribution of Y 1 + Y 2 + + Y k, and use it to approximate the distribution of Y1 + Y2 + + Yk. 5 Error analysis To analyze the discretization error, we introduce the notion of coupling approximation, a variant of Wasserstein distance. Intuitively, a good coupling approximation is a coupling where the two random variables are close to each other with high probability. Definition 5.1 (coupling approximation). Given two random variables Y1, Y2, we write |Y1 Y2| η h if there exists a coupling between Y1, Y2 such that Pr[|Y1 Y2| > h] η. The following lemma shows that if we have a good coupling approximation e Y to a PRV Y , then the privacy curves δY (ε) and δe Y (ε) should be close. Lemma 5.2. If Y and e Y are two random variables such that |Y e Y | η h, then for every ε R, δe Y (ε + h) η δY (ε) δe Y (ε h) + η. Proof. By Theorem 3.2, δY (ε) = Pr[Y ε + Z] and hence δY (ε) = Pr[Y e Y + e Y ε + Z] Pr[Y e Y h] + Pr[e Y ε h + Z] η + δe Y (ε h). Similarly, we have δe Y (ε) η + δY (ε h) for all ε R. Therefore the goal of our analysis is to show that the Compose PRV algorithm finds a good coupling approximation e Y to Y = Pk i=1 Yi. We first show that the Discretize PRV algorithm computes a good coupling approximation to the PRVs and crucially, it preserves the expected value after truncation. Lemma D.5 shows that |e Y Y L| 0 h where e Y is the approximation of a PRV Y output by Algorithm 2 and Y L is the truncation of Y to [ L, L]. We then use the following key lemma which shows that when we add independent coupling approximations (where expected values match), we get a much better coupling approximation than what the triangle inequality predicts. Lemma 5.3. Suppose Y1, Y2, . . . , Yk and e Y1, e Y2, . . . , e Yk are two collections of independent random variables such that |Yi e Yi| 0 h and E[Yi] = E[e Yi] for all i, then Proof. Let Xi = Yi e Yi where (Yi, e Yi) are coupled such that |Yi e Yi| h w.p. 1. Then Xi [ h, h] w.p. 1. Note that X1, X2, . . . , Xk are independent of each other. By Hoeffding s inequality, if we set t = h q This lemma shows that the error of k times composition is around k h and hence setting h 1/ k gives small enough error. Next, we bound the domain size L. Naively, the domain size L should be of the order of k because Y is the sum of k independent random variables with each bounded by a constant. In the supplementary material, we prove a tighter tail bound of Y . Lemma 5.4. Let (X, Y ) be the privacy random variables for a (ε, δ)-DP algorithm, then for any t 0, we have Pr[|Y | ε + t] δ (1 + e ε t) This shows that Pr[|Y | ε + 2] 4 3δ and hence truncating the domain with L = 2 + ε only introduces an additive δ error in the privacy curve. Therefore, if the composition satisfies a good privacy guarantee (namely ε = O(1) for small enough δ), we can truncate the domain at L = Θ(1). Together with the fact that mesh size is 1/ k, this gives a O( k)-time algorithm for computing the privacy curve when we compose the same mechanism with itself k times. The following theorem gives a formal statement of the error bounds of our algorithm, it is proved in the supplementary material. Theorem 5.5. Let εerror, δerror > 0 be some fixed error terms. Let M1, M2, . . . , Mk be DP algorithms with privacy curves δMi(ε). Let Yi be the PRV corresponding to Mi such that δMi(ε) = δYi(ε) for ε 0. Let M be the (adaptive) composition of M1, M2, . . . , Mk and let δM(ε) be its privacy curve. Set L 2 + εerror sufficiently large such that i=1 δMi(L 2) δerror 8 and δM(L 2 εerror) δerror Let e Y be the approximation of Y = Pk i=1 Yi produced by Compose PRV algorithm with mesh size h = εerror q k 2 log 12 δerror Then δe Y (ε + εerror) δerror δY (ε) = δM(ε) δe Y (ε εerror) + δerror. (8) Furthermore, our algorithm takes O b L h time where b is the number of distinct algorithms among M1, M2, . . . , Mk. Remark 5.6. A simple way to set L such that the condition (7) holds is by choosing an L such that: L 2 + max εerror + εM , max i [k] εMi where εA(δ) is the inverse of δA(ε). To set the value of L, we do not need the exact value of εM (or εMi). We only need an upper bound on εM, which can often by obtained by using the RDP Accountant or any other method to derive upper bounds on privacy. 6 Experiments 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 εexact εup, εshift = 0.1 εlow, εshift = 0.1 εup, εshift = 0.5 εlow, εshift = 0.5 εup, εshift = 1.0 εlow, εshift = 1.0 Figure 2: Setting p = 1 and comparing to the analytical solution (10). We demonstrate the utility of our composition method by computing the privacy curves for the DP-SGD algorithm, which is one of the most important algorithms in differential privacy. The DP-SGD algorithm [ACG+16] is a variant of stochastic gradient descent with k steps. In each step, the algorithm selects a p fraction of training examples uniformly at random. The algorithm adds a Gaussian vector with variance σ2 to the clipped gradient of the selected batch. Then it performs a gradient step (or any other iterative methods) using the noisy gradient computed. The privacy loss of DP-SGD involves composing the privacy curve of each iteration with itself k times. The PRVs for each iteration have a closed form and depend only p, σ (see supplementary material). Our algorithms use this closed form of PRVs. See Figure 1(b) for the comparison between our algorithm and the GDP and RDP Accountant. Our method provides a lower and upper bound of the privacy curve according to (8). In Figure 1(a), we compare our algorithm with [KJPH21] (implemented in [KP21]). Under the same mesh size, our algorithm computes a much closer upper and lower bound. We validate our program for the case p = 1. When p = 1, we have an exact formula for k σ . In Figure 2, we show that the true privacy curve is indeed sandwiched between the bounds we compute and that the vertical distance between our bounds is indeed 2εerror with a neglible δerror of 10 10. Floating point errors Note that our error analysis in Section 5 ignores floating point errors. This is because they are negligible compared to the discretization and truncation errors we analyzed in Section 5 for the range of δ we are interested in. See supplementary material for more details. 6.1 Comparison with [KJPH21] 103 104 105 106 107 108 109 Discretisation points δup [KJPH21] δlow [KJPH21] δup Ours δlow Ours 0 Runtime [s] DPSGD steps Ours KJPH21 Speed up Figure 3: (a) Comparison of error bounds of δ with varying number of discretisation points for p = 4 10 3, σ = 0.8, ε = 1.5, k = 1000. (b) Comparing runtimes for our algorithm with that of [KJPH21] when aligned on accuracy for σ = 0.8, p = 4 10 3. We can see a significant reduction in runtime in particular for large number of DPSGD steps. We were not able to run the algorithm of [KJPH21] beyond 2,000 steps, since it becomes unstable beyond that point.7. We also plot the speed up directly on the secondary y-axis. In this section, we provide more results demonstrating the practical use of our algorithm. We compare runtimes of our algorithm with [KJPH21], which is the state-of-the-art, for typical values of privacy parameters (σ = 0.8, p = 4 10 3, ε = 1.5). See Figure 3(a) for the effect of the number of discretisation points n on the accuracy of δ. Our algorithm requires about a few orders of magnitude smaller number of discretization points to converge compared to the algorithm of [KJPH21]. In order to compare runtimes, we align the accuracy of both FFT algorithms. We set numerical parameters (number of discretization bins and domain length) such that both algorithms give similarly accurate bounds and verify it visually. Figure 3(b) shows a significant speedup (100x) using our algorithms. We note that runtimes are directly proportional to the memory required by the algorithms and so a separate memory analysis is not required; the runtime and memory are dominated by the number of points in the discretization of PRV. All experiments are performed on a Intel Xeon W-2155 CPU with 3.30GHz with 128GB of memory. Acknowledgments and Disclosure of Funding We would like to thank Janardhan Kulkarni and Sergey Yekhanin for several useful discussions and encouraging us to work on this problem. L.W. would like to thank Daniel Jones and Victor Rühle for fruitful discussions and helpful guidance. 7We are using the implementation of [KJPH21] from [KP21]. [ACG+16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308 318, 2016. [BDLS19] Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J. Su. Deep learning with gaussian differential privacy, 2019. [BDRS18] Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 74 86, 2018. [BGK+21] Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Judy Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. ar Xiv preprint ar Xiv:2102.03013, 2021. [BS16] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016. [BW18] Borja Balle and Yu-Xiang Wang. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pages 403 412, 2018. [DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486 503. Springer, 2006. [DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 371 380, 2009. [DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265 284. Springer, 2006. [DR+14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends R in Theoretical Computer Science, 9(3 4):211 407, 2014. [DR16] Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. ar Xiv preprint ar Xiv:1603.01887, 2016. [DRS19] Jinshuo Dong, Aaron Roth, and Weijie J. Su. Gaussian differential privacy, 2019. [DRV10] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51 60. IEEE, 2010. [KH21] Antti Koskela and Antti Honkela. Computing differential privacy guarantees for heterogeneous compositions using fft. ar Xiv preprint ar Xiv:2102.12412, 2021. [KJH+20] Antti Koskela, Joonas Jälkö, Antti Honkela, et al. Computing tight differential privacy guarantees using fft. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. AISTATS, 2020. [KJPH21] Antti Koskela, Joonas Jälkö, Lukas Prediger, and Antti Honkela. Tight differential privacy for discrete-valued mechanisms and for the subsampled gaussian mechanism using fft. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 3358 3366. PMLR, 13 15 Apr 2021. [KOV15] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376 1385. PMLR, 2015. [KP21] Antti Koskela and Lukas Prediger. Github repository for fourier accountant. https: //github.com/DPBayes/PLD-Accountant, 2021. [Mir17] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263 275. IEEE, 2017. [MM18] Sebastian Meiser and Esfandiar Mohammadi. Tight on budget? tight bounds for r-fold approximate differential privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 247 264, 2018. [MV16] Jack Murtagh and Salil Vadhan. The complexity of computing the optimal composition of differential privacy. In Theory of Cryptography Conference, pages 157 175. Springer, 2016. [SMM19] David M Sommer, Sebastian Meiser, and Esfandiar Mohammadi. Privacy loss classes: The central limit theorem in differential privacy. Proceedings on privacy enhancing technologies, 2019(2):245 269, 2019. [ZDW21] Yuqing Zhu, Jinshuo Dong, and Yu-Xiang Wang. Optimal accounting of differential privacy via characteristic function. ar Xiv preprint ar Xiv:2106.08567, 2021.