# how_private_are_dpsgd_implementations__f32c236f.pdf

How Private are DP-SGD Implementations?

Lynn Chua 1 Badih Ghazi 1 Pritish Kamath 1 Ravi Kumar 1 Pasin Manurangsi 1 Amer Sinha 1

Chiyuan Zhang 1

We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling-based DP-SGD is more commonly used in practical implementations, it has not been amenable to easy privacy analysis, either analytically or even numerically. On the other hand, Poisson subsampling-based DP-SGD is challenging to scalably implement, but has a well-understood privacy analysis, with multiple open-source numerically tight privacy accountants available. This has led to a common practice of using shuffling-based DP-SGD in practice, but using the privacy analysis for the corresponding Poisson subsampling version. Our result shows that there can be a substantial gap between the privacy analysis when using the two types of batch sampling, and thus advises caution in reporting privacy parameters for DP-SGD.

1. Introduction

Using noisy gradients in first-order methods such as stochastic gradient descent (SGD) has become a prominent approach for adding differential privacy (DP) to the training of differentiable models such as neural networks. This approach, introduced by Abadi et al. (2016), has come to be known as Differentially Private Stochastic Gradient Descent, and we use the term DP-SGD to refer to any such first-order method. DP-SGD is currently the canonical algorithm for training deep neural networks with privacy guarantees, and there currently exist multiple open source im-

1Google Research. Correspondence to: Pritish Kamath <pritish@alum.mit.edu>, Pasin Manurangsi <pasin@google.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Figure 1. Privacy parameter ε for different noise parameters σ, for fixed δ = 10 6 and number of steps T = 10, 000. εD : for deterministic batching, εP : upper bounds when using Poisson subsampling (computed using different accountants), and εS : a lower bound when using shuffling. We observe that shuffling does not provide much amplification for small values of σ, incurring significantly higher privacy cost compared to Poisson subsampling.

plementations, such as in Tensorflow Privacy (a), Pytorch Opacus (Yousefpour et al., 2021), and JAX Privacy (Balle et al., 2022). The algorithm has been applied widely in various machine learning domains, such as training of image classification (Tramer & Boneh, 2021; Papernot et al., 2021; Klause et al., 2022; De et al., 2022; Bu et al., 2022), generative models with GAN (Torkzadehmahani et al., 2019; Chen et al., 2020), diffusion models (Dockhorn et al., 2022), language models (Li et al., 2022; Yu et al., 2022; Anil et al., 2022; He et al., 2023), medical imaging (Ziller et al., 2021), as well as private spatial querying (Zeighami et al., 2022), ad modeling (Denison et al., 2023), and recommendation (Fang et al., 2022).

DP-SGD operates by processing the training data in minibatches, and at each step, performs a first-order gradient update, using a noisy estimate of the average gradient for each mini-batch. In particular, the gradient g for each record is first clipped to have a pre-determined bounded ℓ2-norm, by setting [g]C := g min{1, C/ g 2}, and then adding Gaussian noise of standard deviation σC to all coordinates of the sum of gradients in the mini-batch.1 The privacy

1While some distributed training setup uses the sum gradient directly, it is common to rescale the sum gradient by the batch size

How Private are DP-SGD implementations?

Algorithm 1 ABLQB: Adaptive Batch Linear Queries Parameters: Batch sampler B using (expected) batch size b and number of batches T, noise parameter σ, and an (adaptive) query method A : (Rd) X Bd. Input: Dataset x = (x1, . . . , xn). Output: Query estimates g1, . . . , g T Rd

(S1, . . . , ST ) Bb,T (n) for t = 1, . . . , T do

ψt( ) := A(g1, . . . , gt 1; ) gt P i St ψt(xi) + et for et N(0, σ2Id)

return (g1, . . . , g T )

Algorithm 2 Db,T : Deterministic Batch Sampler Parameters: Batch size b, number of batches T. Input: Number of datapoints n = b T. Output: Seq. of disjoint batches S1, . . . , ST [n].

for t = 0, . . . , T 1 do

St+1 {tb + 1, . . . , tb + b}

return S1, . . . , ST

guaranteed by the mechanism depends on the following: the choice of σ, the size of dataset, the size of mini-batches, the number of steps of gradient update performed, and finally the process used to generate the batches. Almost all deep learning systems generate fixed-sized batches of data by going over the dataset sequentially. When feasible, a global shuffling of all the examples in the dataset is performed for each training epoch by making a single pass over the dataset. On the other hand, the process analyzed by Abadi et al. (2016) constructs each batch by including each record with a certain probability, chosen i.i.d. However, this leads to variable-sized mini-batches, which is technically challenging to handle in practice. As a result, there is generally a mismatch between the actual training pipeline and the privacy accounting in many applications of DP-SGD, with the implicit assumption that this subtle difference is negligible and excusable. However, in this paper, we show that this is not true in a typical setting, as shown in Figure 1, the privacy loss from the correct accounting is significantly larger than expected.

Adaptive Batch Linear Queries and Batch Samplers. Formally, the privacy analysis of DP-SGD, especially in the case of non-convex differentiable models, is performed by viewing it as a post-processing of a mechanism performing adaptive batch linear queries (ABLQB) as defined in Algorithm 1 (we use subscript B to emphasize the role of the

to obtain the average gradient before applying the optimization step. Since this scaling factor can be assimilated in the learning rate, we focus on the sum gradient for simplicity in this paper. Note that, in case of DP-SGD using Poisson subsampling, the scaling is done by the expected batch size, and not the realized batch size.

Algorithm 3 Sb,T : Shuffle Batch Sampler Parameters: Batch size b, number of batches T. Input: Number of datapoints n = b T. Output: Seq. of disjoint batches S1, . . . , ST [n].

Sample a random permutation π over [n]. for t = 0, . . . , T 1 do

St+1 {π(tb + 1), . . . , π(tb + b)}

return S1, . . . , ST

Algorithm 4 Pb,T : Poisson Batch Sampler Parameters: Expected batch size b, number of batches T. Input: Number of datapoints n. Output: Seq. of batches S1, . . . , ST [n].

for t = 1, . . . , T do

St for i = 1, . . . , n do

( St {i} with probability b/n St with probability 1 b/n return S1, . . . , ST

batch sampler), where the linear query ψt(xi) corresponds to the clipped gradient corresponding to record xi. For any x, we require that ψt(x) Bd := {w Rd : w 2 1}. Note that without loss of generality, we can treat the norm bound C to be 1 by defining ψt(x) := [g]C/C for the corresponding gradient g, and rescaling gt by C to get back the noisy gradient for DP-SGD.

As mentioned above, a canonical way to generate the minibatches is to go through the dataset in a fixed deterministic order, and divide the data into mini-batches of a fixed size (Algorithm 2, denoted as D). Another commonly used option is to first randomly permute the entire dataset before dividing it into mini-batches of a fixed size (Algorithm 3, denoted as S); this option provides amplification by shuffling, namely, that the privacy guarantees are better compared to fixed deterministic ordering. However, obtaining such amplification bounds is non-trivial, and while some bounds have been recently established for such privacy amplification (Erlingsson et al., 2019; Feldman et al., 2021; 2023), they tend to be loose in our setting and only kick in when the basic mechanism is already sufficiently private.

Instead, the approach followed by (Abadi et al., 2016), and henceforth used commonly in reporting privacy parameters for DP-SGD, is to assume that each batch is sampled i.i.d. by including each record with a certain probability, referred to as Poisson subsampling (Algorithm 4, denoted as P). The advantage of this approach is that the privacy analysis of such sampling is easier to carry out since the ABLQP mechanism can be viewed as a composition of T independent sub-mechanisms. This enables privacy accounting methods

How Private are DP-SGD implementations?

such as R enyi DP (Mironov, 2017) as well as numerically tight accounting methods using privacy loss distributions (elaborated on later in Section 3).

This has been the case, even when the algorithm being implemented in fact uses some form of shuffling-based batch sampler. To quote Abadi et al. (2016) (with our emphasis),

In practice, for efficiency, the construction of batches and lots is done by randomly permuting the examples and then partitioning them into groups of the appropriate sizes. For ease of analysis, however, we assume that each lot is formed by independently picking each example with probability q = L/N, where N is the size of the input dataset.

Most implementations of DP-SGD mentioned earlier also use some form of shuffling, with a rare exception of Py Torch Opacus (Yousefpour et al., 2021) that has the option of Poisson subsampling to be consistent with the privacy analysis. But this approach does not scale to large datasets as random access for datasets that do not fit in memory is generally inefficient. Moreover, variable batch sizes are inconvenient to handle in deep learning systems2. Tensorflow Privacy (b) provides the compute dp sgd privacy statement method for computing the privacy parameters and reminds users about the implicit assumption of Poisson subsampling. Ponomareva et al. (2023) note in their survey that It is common, though inaccurate, to train without Poisson subsampling, but to report the stronger DP bounds as if amplification was used.

As DP-SGD is being deployed in more applications with such discrepancy, it has become crucial to understand exactly how the privacy guarantee depends on the precise choice of the batch sampler, especially when one cares about specific (ε, δ) privacy parameters (and not asymptotic bounds). This leads us to our main motivating question:

How do the privacy guarantees of ABLQB compare when using different batch samplers B?

1.1. Our Contributions

We study the privacy guarantees of ABLQB for different choices of batch samplers B. While we defer the formal definition of (ε, δ)-DP to Section 2, let δB(ε) denote the privacy loss curve of ABLQB for any B {D, P, S}, for a fixed choice of σ and T. Namely, for all ε > 0, let δB(ε)

2For example, when the input shape changes, jax.jit will trigger recompilation, and tf.function will retrace the graph. Google TPUs require all operations to have fixed (input and output) shapes. Moreover, in various form of data parallelism, the batch size needs to be divisible by the number of accelerators.

be the smallest δ 0 such that ABLQB satisfies (ε, δ)-DP for any underlying adaptive query method A. Let εB(δ) be defined analogously.

D vs S. We observe that ABLQS always satisfies stronger privacy guarantees than ABLQD, i.e., δS(ε) δD(ε) for all ε 0.

D vs P. We show that the privacy guarantee of ABLQD and ABLQP are incomparable. Namely, for all values of T (number of steps) and σ (noise parameter), it holds (i) for small enough ε > 0 that δP(ε) < δD(ε), but perhaps more interestingly, (ii) for sufficiently large ε it holds that δP(ε) δD(ε). We also demonstrate this separation in a specific numerical setting of parameters.

S vs P. By combining the above it follows that for sufficiently large ε, it holds that δS(ε) < δP(ε). If δS(ε) < δP(ε) were to hold for all ε > 0, then reporting privacy parameters for ABLQP would provide correct, even if pessimistic, privacy guarantees for ABLQS. However, we demonstrate multiple concrete settings of parameters, for which δP(ε) δS(ε), or alternately, εP(δ) εS(δ).

For example, in Figure 1, we fix δ = 10 6 and the number of steps T = 10, 000,3 and compare the value of εB(δ) for various values of σ. For σ = 0.5, we find εP(δ) < 1.96 (PLD) and εP(δ) < 3.43 (RDP), but εS(δ) > 10.994 and εD(δ) 10.997. For σ = 1.3, we find εP(δ) < 0.031 (PLD), whereas, εS(δ) > 0.26.

This suggests that reporting privacy guarantees using the Poisson batch sampler can significantly underestimate the privacy loss when the implementation in fact uses the shuffle batch sampler.

Our main takeaway is that batch sampling plays a crucial in determining the privacy guarantees of ABLQB, and hence caution must be exercised in reporting privacy parameters for mechanisms such as DP-SGD.

1.2. Technical Overview

Our techniques relies on the notion of dominating pairs as defined by Zhu et al. (2022) (see Definition 2.3), which if tightly dominating captures the privacy loss curve δB(ε).

D vs S. We observe that applying a mechanism on a random permutation of the input dataset does not degrade its privacy guarantees. While standard, we include a proof for completeness.

D vs P. In order to show that δP(ε) < δD(ε) for small enough ε > 0, we first show that δP(0) < δD(0) by showing that the total variation distance between the tightly dominating pair for ABLQD is larger than that in the case of

3Recall that since we are analyzing a single epoch , the subsampling probability of Poisson subsampling is b/n = 1/T.

How Private are DP-SGD implementations?

ABLQP. And thus, by the continuity of hockey stick divergence in ε, we obtain the same for small enough ε.

In order to show that δP(ε) > δD(ε) for large enough ε > 0, we demonstrate an explicit set E (a halfspace), which realizes a lower bound on the hockey stick divergence between the tightly dominating pair of ABLQP, and show that this decays slower than the hockey stick divergence between the tightly dominating pair for ABLQD.

We demonstrate the separation in a specific numerical setting of parameters using the dp accounting library (Google s DP Library., 2020) to provide lower bounds on the hockey stick divergence between the tightly dominating pair for ABLQP.

S vs P. One challenge in understanding the privacy guarantee of ABLQS is the lack of a clear dominating pair for the mechanism. Nevertheless, we consider a specific instance of the query method A, and a specific adjacent pair x x . The key insight we use in constructing this A and x, x is that in the case of ABLQS, since the batches are of a fixed size, the responses to queries on batches containing only the non-differing records in x and x can leak information about the location of the differing record in the shuffled order. This limitation is not faced by ABLQP, since each record is independently sampled in each batch. We show a lower bound on the hockey stick divergence between the output distribution of the mechanism on these adjacent inputs, by constructing an explicit set E, thereby obtaining a lower bound on δS(ε).

In order to show that this can be significantly larger than the privacy guarantee of ABLQP, we again use the dp accounting library, this time to provide upper bounds on the hockey stick divergence between the dominating pair of ABLQP.

We provide the IPython notebook4 that was used for all the numerical demonstrations; the notebook can be executed using a free CPU runtime on Google Colab.

Related work. The phenomenon of non-differing records leaking information about whether the differing record is in a batch or not can also exist in sampling-based batch samplers, such as, sampling independent batches of fixedsize, as studied in a concurrent work by Lebeda et al. (2024). However, we focus on the shuffle-based batch sampler, as these are most common in practical implementations.

2. Differential Privacy

We consider mechanisms that map input datasets to distributions over an output space, namely M : X O. That is, on input dataset x = (x1, . . . , xn) where each record

4https://colab.research.google.com/drive/ 1z I2H8YEXb Qy D6g ZVVsk Fwc Oi M5YMvq Re?usp=sharing

xi X, M(x) is a probability distribution over the output space O; we abuse notation to also use M(x) to denote the underlying random variable. Two datasets x and x are said to be adjacent, denoted x x , if, loosely speaking, they differ in one record . This can be formalized in multiple ways, which we elaborate on in Section 2.2, but for any notion of adjacency, Differential Privacy (DP) can be defined as follows.

Definition 2.1 (DP). For ε, δ 0, a mechanism M satisfies (ε, δ)-DP if for all adjacent datasets x x , and for any (measurable) event E it holds that

Pr[M(x) E] eε Pr[M(x ) E] + δ.

For any mechanism M, let δM : R 0 [0, 1] be its privacy loss curve, namely δM(ε) is the smallest δ for which M satisfies (ε, δ)-DP; εM : [0, 1] R 0 can be defined analogously.

2.1. Adaptive Batch Linear Queries Mechanism

We primarily study the adaptive batch linear queries mechanism ABLQB using a batch sampler B and an adaptive query method A, as defined in Algorithm 1. The batch sampler B can be instantiated with any algorithm that produces a (randomized) sequence S1, . . . , ST [n] of batches, where n is the number of examples. ABLQB produces a sequence (g1, . . . , g T ) of responses where each gi Rd. The response gt is produced recursively using the adaptive query method A that given g1, . . . , gt 1, constructs a new query ψt : X Bd (for Bd := {v Rd : v 2 1}), and we estimate the sum of ψt(x) over the batch St with added zero-mean Gaussian noise of scale σ to all coordinates. As explained in Section 1, DP-SGD falls under this abstraction by considering an adaptive query method that is specified by a differentiable loss function f : Rd X R, and starts with an initial w0 Rd, and defines the query A(g1, . . . , gt 1; x) as the clipped gradient [ wfwt 1(x)]1 where wt is the tth model iterate recursively obtained by performing gradient descent, e.g., wt wt 1 ηtgt (or any other first-order optimization step).

We consider the Deterministic D (Algorithm 2), Poisson P (Algorithm 4) and Shuffle S (Algorithm 3) batch samplers. As used in Section 1, we will continue to use δB(ε) as a shorthand for denoting the privacy loss curve of ABLQB for any B {D, P, S}. Namely, for all ε > 0, let δB(ε) be the smallest δ 0 such that ABLQB satisfies (ε, δ)-DP for all choices of the underlying adaptive query method A. And εB(δ) is defined analogously.

2.2. Adjacency Notions

As alluded to earlier, the notion of adjacency is crucial to Definition 2.1. Commonly used adjacency notions are:

How Private are DP-SGD implementations?

Add-Remove adjacency. Datasets x, x X are said to be add-remove adjacent if there exists an i such that x = x i or vice-versa (where x i represents the dataset obtained by removing the ith record in x).

Substitution adjacency. Datasets x, x X are said to be substitution adjacent if there exists an i such that x i = x i and x i = xi.

The privacy analysis of DP-SGD is typically done for the Poisson batch sampler P (Abadi et al., 2016; Mironov, 2017), with respect to the add-remove adjacency. However, it is impossible to analyze the privacy of ABLQD or ABLQS with respect to the add-remove adjacency because the batch samplers D and S require that the number of records n equals b T. On the other hand, using the substitution adjacency for D and S leads to an unfair comparison to ABLQP whose analysis is with respect to the add-remove adjacency. Thus, to make a fair comparison, we consider the following adjacency (proposed by Kairouz et al. (2021)).

Zero-out adjacency. We augment the input space to be X := X { } and extend any adaptive query method A as A(g1, . . . , gt; ) = 0 for all g1, . . . , gt Rd. Datasets x, x X n are said to be zero-out adjacent if there exists i such that x i = x i, and exactly one of {xi, x i} is in X and the other is . Whenever we need to specifically emphasize that xi X and x i = , we will denote it as x z x . In this notation, x x if either x z x or x z x.

The privacy analysis of ABLQP with respect to zero-out adjacency is the same as that with respect to the add-remove adjacency; it is essentially replacing a record by a ghost record that makes the query method always return 0. In the rest of this paper, we only consider this zero-out adjacency.

We note that our separations where εP(δ) εS(ε), such as in Figure 1, also hold under the substitution adjacency, by using group privacy , namely if a mechanism satisfies (ε, δ)-DP with respect to zero-out adjacency, then it satisfies (2ε, δ (1 + eε))-DP (see, e.g., (Vadhan, 2017)).

2.3. Hockey Stick Divergence

We interchangeably use the same notation (e.g., letters such as P) to denote both a probability distribution and its corresponding density function. For µ RD and positive semi-definite Σ RD D, we use N(µ, Σ) to denote the Gaussian distribution with mean µ and covariance Σ. For probability densities P and Q, we use αP + βQ to denote the weighted sum of the corresponding densities. P Q denotes the product distribution sampled as (u, v) for u P, v Q, and, P T denotes the T-fold product distribution P P.

Definition 2.2. For all ε R, the eε-hockey stick divergence between P and Q is Deε(P Q) := sup E{P(E)

It is immediate to see that M satisfies (ε, δ)-DP iff for all adjacent x x , it holds that Deε(M(x) M(x )) δ.

Definition 2.3 (Dominating Pair (Zhu et al., 2022)). The pair (P, Q) dominates the pair (A, B) if Deε(P Q) Deε(A B) holds for all ε R. We say that (P, Q) dominates a mechanism M if (P, Q) dominates (M(x), M(x )) for all adjacent x z x .

If (P, Q) dominates M, then for all ε 0, δM(ε) max{Deε(P Q), Deε(Q P)}. We say that (P, Q) tightly dominates a mechanism M if (P, Q) dominates M and there exist adjacent datasets x z x such that Deε(P Q) = Deε(M(x) M(x )) holds for all ε R (note that this includes ε < 0); in this case, δM(ε) = max{Deε(P Q), Deε(Q P)}. Thus, tightly dominating pairs completely characterize the privacy loss of a mechanism (although they are not guaranteed to exist for all mechanisms).5 Dominating pairs behave nicely under mechanism compositions. Namely, if (P1, Q1) dominates M1 and (P2, Q2) dominates M2, then (P1 P2, Q1 Q2) dominates the (adaptively) composed mechanism M1 M2.

3. Dominating Pairs for ABLQB

We discuss the dominating pairs for ABLQB for B {D, P, S} that will be crucial for establishing our results.

Tightly dominating pair for ABLQD. It follows from the standard analysis of the Gaussian mechanism and parallel composition that a tightly dominating pair for ABLQD is the pair (PD := N(1, σ2), QD := N(0, σ2)), leading to a closed-form expression for δD(ε).

Proposition 3.1 (Theorem 8 in Balle & Wang (2018)). For all ε 0, it holds that

δD(ε) = Φ σε + 1 2σ eεΦ σε 1 2σ ,

where Φ( ) is the cumulative density function (CDF) of the standard normal random variable N(0, 1).

Tightly dominating pair of ABLQP. Zhu et al. (2022) showed6 that the tightly dominating pair for a single step of

5Zhu et al. (2022) define tightly dominating pair differently, in a manner that is guaranteed to exist. They additional define the notion of a worst-case pair , which is a pair of adjacent datasets x x such that (M(x), M(x )) is a tightly dominating pair. Thus, our notion of tightly dominating pair refers precisely to the pair (M(x), M(x )) for a worst-case adjacent pair x, x . It is also worth noting that our notation for tightly dominating pairs is asymmetric as it only considers pairs x z x ; the reverse setting is handled implicitly because (P, Q) dominates (M(x), M(x )) if and only if (Q, P) dominates (M(x ), M(x)). 6This was implicit in prior work, e.g., (Koskela et al., 2020).

How Private are DP-SGD implementations?

ABLQP, a Poisson sub-sampled Gaussian mechanism, is given by the pair (A = (1 q)N(0, σ2) + q N(1, σ2), B = N(0, σ2)), where q is the sub-sampling probability of each record, namely q = b/n, and in the case where n = b T, we have q = 1/T. Since ABLQP is a T-fold composition of this Poisson subsampled Gaussian mechanism, it follows that the tightly dominating pair for ABLQP is (PP := A T , QP := B T ).

The hockey stick divergence Deε(PP QP) does not have a closed-form expression, but there are privacy accountants based on the methods of R enyi DP (RDP) (Mironov, 2017) as well as privacy loss distributions (PLD) (Meiser & Mohammadi, 2018; Sommer et al., 2019), the latter providing numerically accurate algorithms (Koskela et al., 2020; Gopi et al., 2021; Ghazi et al., 2022; Doroshenko et al., 2022), and have been the basis of multiple open-source implementations from both industry and academia including (Prediger & Koskela, 2020; Google s DP Library., 2020; Microsoft., 2021). While R enyi-DP-based accounting provides an upper bound on max{Deε(PP QP), Deε(QP PP)}, the PLD-based accounting implementations can provide upper and lower bounds on max{Deε(PP QP), Deε(QP PP)} to high accuracy, as controlled by a certain discretization parameter.

Tightly dominating pair for ABLQS? It is not clear which adjacent pair would correspond to a tightly dominating pair for ABLQS, and moreover, it is even a priori unclear if one even exists. However, in order to prove lower bounds on the privacy parameters, it suffices to consider a specific instantiation of the adaptive query method A, and a particular adjacent pair x x . In particular, we instantiate the query method A as follows.

Consider the input space X = [ 1, 1], and assume that the query method A is non-adaptive, and always produces the query ψt(x) = x. We consider the adjacent datasets:

x = (x1 = 1, . . . , xn 1 = 1, xn = 1), and

x = (x1 = 1, . . . , xn 1 = 1, xn = ).

In this case, when the differing record falls in batch t, then output of the mechanism on all batches t = t is centered at b, and is centered at b + 2 (on input x) or at b + 1 (on input x ) on batch t. Thus it follows that the distributions A = ABLQS(x) and B = ABLQS(x ) are given as:

A = PT t=1 1 T N( b 1 + 2et, σ2I),

B = PT t=1 1 T N( b 1 + et, σ2I),

where 1 denotes the all-1 s vector in RT and et denotes the tth standard basis vector in RT . Shifting the distributions by b 1 does not change the hockey stick divergence Deε(A B),

hence we might as well consider the pair

PS := PT t=1 1 T N(2et, σ2I), (1)

QS := PT t=1 1 T N(et, σ2I). (2)

Thus, δS(ε) max{Deε(PS QS), Deε(QS PS)}. We conjecture that this pair is in fact tightly dominating for ABLQS for all instantiations of query methods A (including adaptive ones). We elaborate more in Section 4.3.1.

Conjecture 3.2. The pair (PS, QS) tightly dominates ABLQS for all adaptive query methods A.

The results in this paper do not rely on this conjecture being true, as we only use the dominating pair (PS, QS) to establish lower bounds on δS( ).

4. Privacy Loss Comparisons

4.1. ABLQD vs ABLQS

We first note that ABLQS enjoys stronger privacy guarantees than ABLQD.

Theorem 4.1. For all σ, ε 0 and T 1: δS(ε) δD(ε).

This follows from a standard technique that shuffling cannot degrade the privacy guarantee satisfied by a mechanism. For completeness, we provide a proof in Appendix B.

4.2. ABLQD vs ABLQP

We show that ABLQD and ABLQP have incomparable privacy loss. In particular, we show the following.

Theorem 4.2. For all σ > 0 and T > 1, there exist ε0, ε1 0 such that,

(a) ε [0, ε0), it holds that δD(ε) > δP(ε), and

(b) ε > ε1, it holds that δD(ε) < δP(ε).

We defer the detailed proof to Appendix C, and provide a proof sketch here. Part (a) is shown by first establishing that the total variation distance (corresponds to D1( )) between PD and QD is strictly larger than the total variation distance between PP and QP when T > 1 and σ > 0. This implies that, δD(0) > δP(0). By using the continuity of Deε( ) in ε, we conclude the same for all ε < ε0.

For part (b), we construct an explicit set E such that PP(E) eεQP(E) > δD(ε). In particular, we choose a halfspace E := w RT P

i wi > (ε + log 2 + T log T)σ2 + T

and show that PP(E) eεQP(E) is at least

T (T log T +log 2)σ

T 2σ . For large ε,

the dominant term is εσ/

T. On other hand, δD(ε) is at most Φ( εσ + 1 2σ) (from Proposition 3.1), which has the dominant term εσ. Since εσ/

T decays slower

How Private are DP-SGD implementations?

Figure 2. δD(ε) and δP(ε) for σ = 0.3 and T = 10.

than εσ, we get that for sufficiently large ε1, it holds that δD(ε) < δP(ε) for all ε > ε1.

Even though Theorem 4.2 was proved for some values of ε0 and ε1, we conjecture that it holds for ε0 = ε1.

Conjecture 4.3. Theorem 4.2 holds for ε0 = ε1.

We do not rely on this conjecture being true in the rest of this paper. We provide a numerical example that validates Theorem 4.2 and provides evidence for Conjecture 4.3. In Figure 2, for σ = 0.3 and T = 10, we plot the numerically computed δD(ε) (using Proposition 3.1), as well as lower and upper bounds on δP(ε), computed using the open source dp accounting library (Google s DP Library., 2020).

4.3. ABLQP vs ABLQS

From Theorems 4.1 and 4.2, it follows that there exists an ε1 such that for all ε > ε1, it holds that δS(ε) δD(ε) < δP(ε). On the other hand, while we know that δD(ε) > δP(ε) for sufficiently small ε, this does not imply anything about the relationship between δS(ε) and δP(ε) for small ε.

We demonstrate simple numerical settings where δS(ε) is significantly larger than δP(ε). We prove lower bounds on δS(ε) by constructing specific sets E and using the fact that δS(ε) PS(E) eεQS(E).

In particular, we consider sets EC parameterized by C of the form {w RT : maxt wt C}; note that EC is the complement of T-fold Cartesian product of the set ( , C). For a single Gaussian distribution D = N(µ, σ2I), we can compute the probability mass of EC under measure D as:

D(EC) = 1 D(RT EC)

= 1 QT t=1 Prx N(µt,σ2)[x < C]

= 1 QT t=1 Φ C µt

In particular, when µ is α et for any standard basis vector et, we have D(EC) = 1 Φ C α

σ T 1. Thus, we

Figure 3. δD(ε), upper bounds on δP(ε) and a lower bound on δS(ε) for varying ε and fixed σ = 0.4 and T = 10, 000.

have that PS(EC) is

PS(EC) = PT t=1 1 T Dt(EC) for Dt = N(2et, σ2I)

Similarly, we have QS(EC) = 1 Φ C 1

Thus, we use the following lower bound:

δS(ε) max C C PS(EC) eεQS(EC) (3)

for any suitable set C that can be enumerated over. In our experiments described below, we set C to be the set of all values of C ranging from 0 to 100 in increments of 0.01.

In Figure 3, we set σ = 0.4 and number of steps T = 10, 000 and plot δD(ε), an upper bound on δP(ε) (obtained using dp accounting) and a lower bound on δS(ε) as obtained via (3). We find that while δP(4) 1.18 10 5, δS(4) 0.226, that is close to δD(4) 0.244. Even δS(12) 7.5 10 5 is larger than δP(4). While the (4, 1.2 10 5)-DP guarantee of ABLQP could have been considered as sufficiently private, ABLQS only satisfies much worse privacy guarantees. We provide additional examples in Appendix A.

4.3.1. INTUITION FOR CONJECTURE 3.2

We attempt to shed some intuition for why ABLQS does not provide as much amplification over ABLQD, compared to ABLQP, and why we suggest Conjecture 3.2.

For sake of intuition, let s consider the setting where the query method A always generates the query ψt(x) = x, and we have two adjacent datasets:

x = (x1 = L, . . . , xn 1 = L, xn = 1), and x = (x1 = L, . . . , xn 1 = L, xn = ).

The case of L > 1 is not valid, since in this case |ψt(x)| = L > 1. However, we can still ask how well the privacy

How Private are DP-SGD implementations?

of the nth example is preserved by ABLQB, by considering the hockey stick divergence between ABLQB(x) and ABLQB(x ).

The crucial difference between ABLQP and ABLQS is that the privacy analysis of ABLQP does not depend at all on the non-differing records in the two datasets. In the case of ABLQS, we observe that for any fixed σ and T, the hockey stick divergence Deε(ABLQS(x) ABLQS(x )) approaches δD(ε) as L . We sketch this argument intuitively: For any batch St that does not contain n, the corresponding gt = b L + et for et N(0, σ2). Whereas for the batch St that contains n, the corresponding gt = (b 1)L + 1 + et in case of input x or gt (b 1)L + et in case of input x . As L , we can identify the batch St that contains n with probability approaching 1, thereby not providing any amplification.

In summary, the main differing aspect about ABLQS and ABLQP is that in the former, the non-differing examples can leak information about the location of the differing example in the shuffled order, but it is not the case in the latter. While we sketched an argument that works asymptotically as L , we see glimpses of it already at L = 1.

Our intuitive reasoning behind Conjecture 3.2 is that even in the case of (vector-valued) adaptive query methods, in order to leak the most information about the differing record xi between x and x , it seems natural that the query ψt(xj) should evaluate to ψt(xi) for all j = i. If the query method satisfies this condition for all t, then it is easy to show that (PS, QS) tightly dominates (ABLQS(x), ABLQS(x )). Conjecture 3.2 then asserts that this is indeed the worst case.

5. Conclusion & Future Directions

We identified significant gaps between the privacy analysis of adaptive batch linear query mechanisms, under the deterministic, Poisson, and shuffle batch samplers. We find that while shuffling always provides better privacy guarantees over deterministic batching, Poisson batch sampling can provide a worse privacy guarantee than even deterministic sampling at large ε. But perhaps most surprisingly, we demonstrate that the amplification guarantees due to shuffle batch sampling can be severely limited compared to the amplification due to Poisson subsampling, in various regimes that could be considered of practical interest.

Several interesting directions remain to be investigated. In our work, we provide a technique to only provide a lower bound on the privacy guarantee when using a shuffle batch sampler. It would be interesting to have a tight accounting method for ABLQS. A first step towards this could be to establish Conjecture 3.2, which if true, might make numerical accountants for computing the hockey stick divergence

possible. While this involves computing a high-dimensional integral, it might be possible to approximate using importance sampling; e.g., such approaches have been attempted for ABLQP (Wang et al., 2023). Also, our approach for analyzing the privacy with shuffle batch sampler is limited to a single epoch mechanism, whereas in practice, it is common to use DP-SGD with multiple epochs. Extending our approach to multiple epochs will be interesting.

However, it remains to be seen how the utility of DP-SGD would be affected when we use the correct privacy analysis for ABLQS instead of the analysis for ABLQP, which has been used extensively so far and treated as a good approximation . Alternative approaches such as DP-FTRL (Kairouz et al., 2021; Mc Mahan et al., 2022) that do not rely on amplification might turn out to be better if we instead use the correct analysis for ABLQS, in the regimes where such methods are currently reported to under perform compared to DP-SGD.

An important point to note is that the model of shuffle batch sampler we consider here is a simple one. There are various types of data loaders used in practice, which are not necessarily captured by our model. For example tf.data.Dataset.shuffle takes in a parameter of buffer size b. It returns a random record among the first b records, and immediately replaces it with the next record ((b+1)st in this case), and continues repeating this process. This leads to an asymmetric form of shuffling, when the dataset size exceeds the size of the buffer. Such batch samplers might thus require more careful privacy analysis.

The notion of DP aims to guarantee privacy even in the worst case. For example in the context of DP-SGD, it aims to protect privacy of a single record even when the model trainer and all other records participating in the training are colluding to leak information about this one record. And moreover, releasing the final model is assumed to leak as much information as releasing all intermediate iterates. Such strong adversarial setting might make obtaining good utility to be difficult. Alternative techniques for privacy amplification such as amplification by iteration (Feldman et al., 2018; Altschuler & Talwar, 2022) or through convergence of Langevin dynamics (Chourasia et al., 2021) have been studied, where only the last iterate of the model is released. However, these analyses rely on additional assumptions such as convexity and smoothness of the loss functions. Investigating whether it is possible to relax these assumptions to make amplification by iteration applicable to non-convex models, even if under some assumptions that are applicable to the ones used in practice, is an interesting future direction.

How Private are DP-SGD implementations?

Acknowledgments

We thank the anonymous ICML reviewers for their thoughtful comments and suggestions which have improved the presentation of this paper. We also thank Christian Janos Lebeda, Matthew Regehr, Gautam Kamath, and Thomas Steinke for correspondence about their concurrent work (Lebeda et al., 2024).

Impact Statement

This paper presents work that points out that the particular type of batch sampling used can play a significant role in the privacy analysis of DP-SGD type of algorithms. The impact we see coming from our work is to make practitioners of DP more aware of these gaps. There are many potential societal consequences of DP itself, none which we feel must be specifically highlighted here.

Abadi, M., Chu, A., Goodfellow, I. J., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In CCS, pp. 308 318, 2016.

Altschuler, J. M. and Talwar, K. Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. In Neur IPS, 2022.

Anil, R., Ghazi, B., Gupta, V., Kumar, R., and Manurangsi, P. Large-scale differentially private BERT. In EMNLP (Findings), pp. 6481 6491, 2022.

Balle, B. and Wang, Y. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In ICML, pp. 403 412, 2018.

Balle, B., Berrada, L., De, S., Ghalebikesabi, S., Hayes, J., Pappu, A., Smith, S. L., and Stanforth, R. JAXPrivacy: Algorithms for privacy-preserving machine learning in JAX, 2022. URL http://github.com/ google-deepmind/jax_privacy.

Bu, Z., Mao, J., and Xu, S. Scalable and efficient training of large convolutional neural networks with differential privacy. Neur IPS, pp. 38305 38318, 2022.

Chen, D., Orekondy, T., and Fritz, M. GS-WGAN: a gradient-sanitized approach for learning differentially private generators. In Neur IPS, pp. 12673 12684, 2020.

Chourasia, R., Ye, J., and Shokri, R. Differential privacy dynamics of Langevin diffusion and noisy gradient descent. In Neur IPS, pp. 14771 14781, 2021.

De, S., Berrada, L., Hayes, J., Smith, S. L., and Balle, B. Unlocking high-accuracy differentially private image classification through scale. ar Xiv, 2204.13650, 2022.

Denison, C., Ghazi, B., Kamath, P., Kumar, R., Manurangsi, P., Narra, K. G., Sinha, A., Varadarajan, A. V., and Zhang, C. Private ad modeling with DP-SGD. In Ad KDD, 2023.

Dockhorn, T., Cao, T., Vahdat, A., and Kreis, K. Differentially private diffusion models. ar Xiv, 2210.09929, 2022.

Doroshenko, V., Ghazi, B., Kamath, P., Kumar, R., and Manurangsi, P. Connect the dots: Tighter discrete approximations of privacy loss distributions. Po PETS, 2022(4): 552 570, 2022.

Erlingsson, U., Feldman, V., Mironov, I., Raghunathan, A., Talwar, K., and Thakurta, A. Amplification by shuffling: From local to central differential privacy via anonymity. In SODA, pp. 2468 2479, 2019.

Fang, L., Du, B., and Wu, C. Differentially private recommender system with variational autoencoders. Knowledge-Based Systems, 250:109044, 2022.

Feldman, V., Mironov, I., Talwar, K., and Thakurta, A. Privacy amplification by iteration. In FOCS, pp. 521 532, 2018.

Feldman, V., Mc Millan, A., and Talwar, K. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In FOCS, pp. 954 964, 2021.

Feldman, V., Mc Millan, A., and Talwar, K. Stronger privacy amplification by shuffling for R enyi and approximate differential privacy. In SODA, pp. 4966 4981, 2023.

Ghazi, B., Kamath, P., Kumar, R., and Manurangsi, P. Faster privacy accounting via evolving discretization. In ICML, pp. 7470 7483, 2022.

Google Colab. URL https://colab.research.google. com/.

Google s DP Library. DP Accounting Library. https: //github.com/google/differential-privacy/ tree/main/python/dp_accounting, 2020.

Gopi, S., Lee, Y. T., and Wutschitz, L. Numerical composition of differential privacy. In Neur IPS, pp. 11631 11642, 2021.

He, J., Li, X., Yu, D., Zhang, H., Kulkarni, J., Lee, Y. T., Backurs, A., Yu, N., and Bian, J. Exploring the limits of differentially private deep learning with group-wise clipping. In ICLR, 2023.

Kairouz, P., Mc Mahan, B., Song, S., Thakkar, O., Thakurta, A., and Xu, Z. Practical and private (deep) learning without sampling or shuffling. In ICML, pp. 5213 5225, 2021.

How Private are DP-SGD implementations?

Klause, H., Ziller, A., Rueckert, D., Hammernik, K., and Kaissis, G. Differentially private training of residual networks with scale normalisation. ar Xiv, 2203.00324, 2022.

Koskela, A., J alk o, J., and Honkela, A. Computing tight differential privacy guarantees using FFT. In AISTATS, pp. 2560 2569, 2020.

Lebeda, C. J., Regehr, M., Kamath, G., and Steinke, T. Avoiding pitfalls for privacy accounting of subsampled mechanisms under composition, 2024. URL https:// arxiv.org/abs/2405.20769.

Li, X., Tramer, F., Liang, P., and Hashimoto, T. Large language models can be strong differentially private learners. In ICLR, 2022.

Mc Mahan, B., Rush, K., and Thakurta, A. G. Private online prefix sums via optimal matrix factorizations. ar Xiv, 2202.08312, 2022.

Meiser, S. and Mohammadi, E. Tight on budget? Tight bounds for r-fold approximate differential privacy. In CCS, pp. 247 264, 2018.

Microsoft. A fast algorithm to optimally compose privacy guarantees of differentially private (DP) mechanisms to arbitrary accuracy. https://github.com/ microsoft/prv_accountant, 2021.

Mironov, I. R enyi differential privacy. In CSF, pp. 263 275, 2017.

Papernot, N., Thakurta, A., Song, S., Chien, S., and Erlingsson, U. Tempered sigmoid activations for deep learning with differential privacy. In AAAI, pp. 9312 9321, 2021.

Ponomareva, N., Hazimeh, H., Kurakin, A., Xu, Z., Denison, C., Mc Mahan, H. B., Vassilvitskii, S., Chien, S., and Thakurta, A. G. How to dp-fy ML: A practical guide to machine learning with differential privacy. J. Artif. Intell. Res., 77:1113 1201, 2023.

Prediger, L. and Koskela, A. Code for computing tight guarantees for differential privacy. https://github. com/DPBayes/PLD-Accountant, 2020.

Sommer, D. M., Meiser, S., and Mohammadi, E. Privacy loss classes: The central limit theorem in differential privacy. Po PETS, 2019(2):245 269, 2019.

Tensorflow Privacy, a. URL https://www.tensorflow. org/responsible_ai/privacy/api_docs/python/ tf_privacy.

Tensorflow Privacy, b. URL https://www. tensorflow.org/responsible_ai/privacy/ api_docs/python/tf_privacy/compute_dp_

sgd_privacy_statement. Note about compute dp sgd privacy statement.

Torkzadehmahani, R., Kairouz, P., and Paten, B. DP-CGAN: Differentially private synthetic data and label generation. In CVPR Workshops, 2019.

Tramer, F. and Boneh, D. Differentially private learning needs better features (or much more data). In ICLR, 2021.

Vadhan, S. The Complexity of Differential Privacy. Springer, 2017.

Wang, J. T., Mahloujifar, S., Wu, T., Jia, R., and Mittal, P. A randomized approach to tight privacy accounting. In Neur IPS, 2023.

Yousefpour, A., Shilov, I., Sablayrolles, A., Testuggine, D., Prasad, K., Malek, M., Nguyen, J., Ghosh, S., Bharadwaj, A., Zhao, J., Cormode, G., and Mironov, I. Opacus: Userfriendly differential privacy library in Py Torch. ar Xiv, 2109.12298, 2021.

Yu, D., Naik, S., Backurs, A., Gopi, S., Inan, H. A., Kamath, G., Kulkarni, J., Lee, Y. T., Manoel, A., Wutschitz, L., et al. Differentially private fine-tuning of language models. In ICLR, 2022.

Zeighami, S., Ahuja, R., Ghinita, G., and Shahabi, C. A neural database for differentially private spatial range queries. In VLDB, pp. 1066 1078, 2022.

Zhu, Y., Dong, J., and Wang, Y. Optimal accounting of differential privacy via characteristic function. In AISTATS, pp. 4782 4817, 2022.

Ziller, A., Usynin, D., Braren, R., Makowski, M., Rueckert, D., and Kaissis, G. Medical imaging deep learning with differential privacy. Scientific Reports, 11(1):13524, 2021.

How Private are DP-SGD implementations?

A. Additional Evaluations

A.1. ε vs. σ for fixed δ and T

We first plot ε against σ for fixed δ and T. We compute upper bounds on εP(δ) using R enyi DP (RDP) as well as using privacy loss distributions (PLD). These accounting methods are provided in the open source Google dp accounting library (Google s DP Library., 2020).

In particular we consistently find that for small values of σ, ABLQS provides almost no improvement over ABLQD, and has εS that is significantly larger than εP.

In Figure 4, we set δ = 10 6 and number of steps T = 100, 000. In particular, for σ = 0.4, we find that εP(δ) 3 (as per PLD accounting) and εP(δ) 4.71 (as per RDP accounting), whereas on the other hand, εS(δ) 14.45, which is very close to εD(δ). For σ = 1.3, we find that εP(δ) < 0.01 (as per PLD accounting), whereas, εS(δ) > 0.029.

Figure 4. εD(δ), upper bounds on εP(δ) and a lower bound on εS(δ) for varying σ and fixed δ = 10 6 and T = 10, 000.

In Figure 5, we set δ = 10 5 and number of steps T = 1000. In particular, for σ = 0.7, εP(δ) 0.61 (as per PLD accounting) and εP(δ) 1.64 (as per RDP accounting), whereas on the other hand, εS(δ) 6.528 and εD(δ) 6.652. For σ = 1.3, we find that εP(δ) < 0.092 (as per PLD accounting), whereas, εS(δ) > 0.83.

Figure 5. εD(δ), upper bounds on εP(δ) and a lower bound on εS(δ) for varying σ and fixed δ = 10 5 and T = 1000.

How Private are DP-SGD implementations?

A.2. δ vs. ε for fixed σ and T

Next, we plot δ against ε for fixed σ and T. We compute upper bounds on δP(ε) using R enyi DP (RDP) as well as using privacy loss distributions (PLD).

In particular when σ is closer to 1.0, we find that our lower bound on δS(ε) is distinctly smaller than δD(ε), but still significantly larger than δP(ε).

In Figure 6, we set σ = 0.8 and number of steps T = 1000. In particular, while δP(1) 9.873 10 9 (as per PLD accounting) and δP(1) 3.346 10 5 (as per RDP accounting), we have that δS(1) 0.018 and δS(4) 1.6 10 4. For ε = 4, we find the upper bound using PLD accounting to be larger than the upper bound using R enyi DP accounting. This is attributable to the numerical instability in PLD accounting when δ is very small.

Figure 6. δD(ε), upper bounds on δP(ε) and a lower bound on δS(ε) for varying ε and fixed σ = 0.8 and T = 1000.

In Figure 7, we set σ = 1.0 and number of steps T = 1000. In particular, while δP(1) 2.06 10 10 (as per PLD accounting) and δP(1) 8.45 10 5 (as per RDP accounting), we have that δS(1) 0.004 and δS(4) 4.38 10 7

(last one not shown in plot). For ε > 1.0, we find the upper bound using PLD accounting appears to not decrease as much, which could be due to the numerical instability in PLD accounting when δ is very small.

Figure 7. δD(ε), upper bounds on δP(ε) and a lower bound on δS(ε) for varying ε and fixed σ = 1.0 and T = 1000.

B. Proof of Theorem 4.1: (ABLQD vs ABLQS)

We use the joint convexity property of hockey stick divergence. While this is standard, we include a proof for completeness.

How Private are DP-SGD implementations?

Lemma B.1 (Joint Convexity of Hockey Stick Divergence). Given two mixture distributions P = Pm i=1 αi Pi and Q = Pm i=1 αi Qi, it holds for all ε R that Deε(P Q) P i αi Deε(Pi Qi) .

Proof. We have that

Deε(P Q) = sup E {P(E) eεQ(E)}

i=1 αi (Pi(E) eεQi(E))

i=1 αi sup E {Pi(E) eεQi(E)} =

i=1 αi Deε(Pi Qi) .

Thus, it follows that shuffling the dataset first cannot degrade the privacy guarantee of any mechanism as shown below.

Lemma B.2. Fix a mechanism M : X O, and let MS be defined as MS(x) := M(xπ) for a random permutation π over [n] where xπ := (xπ(1), . . . , xπ(n)). Then, if M satisfies (ε, δ)-DP, then MS also satisfies (ε, δ)-DP.

Proof. Consider any adjacent pair of dataset x x . For any permutation π over [n], let Pπ := M(xπ) and Qπ := M(x π), and let P = MS(x) and Q = MS(x ). It is easy to see that

π 1 n! Pπ and Q = P

π 1 n! Qπ .

Since M satisfies (ε, δ)-DP it follows that Deε(Pπ Qπ) δ for all permutations π. Thus, from Lemma B.1, it follows that Deε(P Q) P

π 1 n!Deε(Pπ Qπ) δ. Hence MS also satisfies (ε, δ)-DP.

Proof of Theorem 4.1. The proof follows by observing that if we choose M = ABLQD in Lemma B.2, then ABLQS is precisely the corresponding mechanism MS.

C. Proof of Theorem 4.2 (ABLQD vs ABLQP)

We first state and prove some intermediate statements required for the proof of Theorem 4.2. We use the Gaussian measure of a halfspace.

Proposition C.1. For P = N(µ, σ2I) and the set E := {w Rd : a w b 0}, it holds that P(E) = Φ a µ b

Proposition C.2. For all T N and distributions A, B, it holds that D1(A T B T ) 1 (1 D1(A B))T . And hence D1(A T B T ) T D1(A B) and equality can occur only if T = 1 or D1(A B) = 0.

Proof. Recall that D1(A B) is the total variation distance between A and B, which has the following characterization inf(X,Y ) Pr[X = Y ] where (X, Y ) is a coupling such that X A, Y B. Given any coupling (X, Y ) for A, B, we construct a coupling ((X1, . . . , XT ), (Y1, . . . , YT )) of A T , B T by sampling (Xi, Yi) independently according to the coupling (X, Y ). From this, we have

Pr[(X1, . . . , XT ) = (Y1, . . . , YT )]

= 1 Pr[(X1, . . . , XT ) = (Y1, . . . , YT )]

= 1 Pr[X = Y ]T .

By taking the infimum over all (X, Y ) such that X A, Y B, we arrive at the desired bound.

We also note that a simple observation that for all P, Q, the hockey stick divergence Deε(P Q) is a 1-Lipschitz in eε.

Proposition C.3. For ε1 < ε2, it holds that Deε1(P Q) Deε2(P Q) eε2 eε1.

How Private are DP-SGD implementations?

Proof. We have that

Deε1(P Q) Deε2(P Q) = sup E [P(E) eε1Q(E)] sup E [P(E ) eε2Q(E )]

sup E [P(E) eε1Q(E) P(E) + eε2Q(E)]

= (eε2 eε1) sup E Q(E)

= eε2 eε1 .

Proof of Theorem 4.2. We prove each part as follows:

For part (a), first we consider the case of ε = 0. In this case, Deε(P Q) is simply the total variation distance between P and Q. Recall that PP = A T and QP = B T , where A = (1 1

T PD and B = QD. Observe that D1(A B) = 1

T D1(PD QD). Thus, we have that

D1(PP QP) = D1(A T B T )

< T D1(A B) (Proposition C.2)

T D1(PD QD)

= D1(PD QD) .

Note that the inequality is strict for T > 1. Since Deε(P Q) is continuous in ε (see Proposition C.3), there exists some ε0 > 0, such that for all ε [0, ε0), δD(ε) > δP(ε).

For part (b), we construct an explicit bad event E RT such that PP(E) eεQP(E) > Deε(PD QD). In particular, we consider:

E := w RT P

i wi > (ε + log 2 + T log T)σ2 + T

The choice of E is such that,

QP(w) = PT t=1 log A(wt)

= PT t=1 log 1 1

PT t=1 2wt 1

σ2 T log T T 2σ2 ε + log 2 for all w E .

Hence it follows that log PP(E)

QP(E) ε + log 2, or equivalently, PP(E) 2eεQP(E). This implies that Deε(PP QP)

Next, we obtain a lower bound on PP(E). For Nµ := N(µ, σ2I) and pk = 1 T k (1 1

T )T k, it holds that

µ {0,1}T p µ 1Nµ(E) N0(E)

T (T log T +log 2)σ

where we use Proposition C.1 in the last two steps. On the other hand, we have from Proposition 3.1 that

δD(ε) = Φ εσ + 1 2σ eεΦ εσ 1 2σ

Φ εσ + 1 2σ . (5)

How Private are DP-SGD implementations?

There exists a sufficiently large ε1 such that

δP(ε) Deε(PP QP)

T (T log T +log 2)σ

Φ εσ + 1 2σ (for ε > ε1) (7)

by noting that for large ε the most significant term inside Φ( ) in (6) is εσ/

T, whereas in (7) the most significant term inside Φ( ) is εσ, which decreases much faster as ε , for a fixed T > 1 and σ > 0.