# privacy_amplification_for_matrix_mechanisms__0719f5a5.pdf

Published as a conference paper at ICLR 2024

PRIVACY AMPLIFICATION FOR MATRIX MECHANISMS

Christopher A. Choquette-Choo Arun Ganesh Thomas Steinke

Abhradeep Thakurta

Privacy ampliﬁcation exploits randomness in data selection to provide tighter differential privacy (DP) guarantees. This analysis is key to DP-SGD s success in machine learning (ML), but, is not readily applicable to the newer state-of-the-art (SOTA) algorithms. This is because these algorithms, known as DP-FTRL, use the matrix mechanism to add correlated noise instead of independent noise as in DP-SGD.

In this paper, we propose MMCC , the ﬁrst algorithm to analyze privacy ampliﬁcation via sampling for any generic matrix mechanism. MMCC is nearly tight in that it approaches a lower bound as ε 0. To analyze correlated outputs in MMCC, we prove that they can be analyzed as if they were independent, by conditioning them on prior outputs. Our conditional composition theorem has broad utility: we use it to show that the noise added to binary-tree-DP-FTRL can asymptotically match the noise added to DP-SGD with ampliﬁcation. Our algorithm also has practical empirical utility. We show that ampliﬁcation leads to signiﬁcant improvement in the privacy/utility trade-offs for DP-FTRL style algorithms for standard benchmark tasks.

1 INTRODUCTION

Privacy ampliﬁcation is key in differentially private (DP) machine learning (ML) as it enables tighter privacy budgets under certain assumptions on the data processing. For example, one of the main contributions in the DP-SGD (DP Stochastic Gradient Descent) work by Abadi et al. (2016) was the moments accountant , which relies on privacy ampliﬁcation (Kasiviswanathan et al., 2008; Bassily et al., 2014) for bounding the privacy cost. Recently, privacy ampliﬁcation analysis enabled Choquette-Choo et al. (2023a) to show that a class of DP-FTRL (DP Follow-The-Regularized-Leader) algorithms (Smith & Thakurta, 2013; Kairouz et al., 2021) was superior in privacy-utility tradeoffs to DP-SGD.1 At the heart of DP-FTRL is the construct of matrix mechanism (Mc Kenna et al., 2021; Denisov et al., 2022) that effectively computes the preﬁx sums P i t xi over a sequence of adaptively chosen vectors {xi : i [n]} (given by A x, where A is a lower triangular matrix of all ones and x = [x1| |xn] Rn d). Matrix mechanism corresponds to factorizing A = B C to minimize the error in the estimation of the preﬁx sums, while ensuring C x + z satisﬁes DP, where z is drawn from an isotropic normal distribution. Bringing privacy ampliﬁcation to matrix mechanisms is thus an important area of research to enable better privacy-utility tradeoffs. (In the rest of the paper, we will refer to the matrix B as the decoder matrix, and C as the encoder matrix.)

Google Deep Mind. cchoquette@google.com. Google Research. arunganesh@google.com. Google Deep Mind. steinke@google.com. Google Deep Mind. athakurta@google.com. 1Precisely, they showed DP-FTRL is never worse, and often better, than DP-SGD it pareto-dominates .

Published as a conference paper at ICLR 2024

Matrix mechanism poses a major challenge for privacy ampliﬁcation analysis of DP-FTRL style algorithms. Standard privacy ampliﬁcation exploits randomness in the selection of minibatches2 but requires that the noise added to each minibatch is independent. In the matrix mechanism, a minibatch (given by xi) contributes to multiple rows of C x + z, thus preventing direct application of ampliﬁcation. This challenge can be seen by the limitations of the ampliﬁcation analysis of Choquette-Choo et al. (2023a) which only applies to a special class of b-banded matrix mechanisms (i.e., the ﬁrst b principal diagonals of C are non-zero), that in-turn leads to multiplicatively higher sampling probabilities preventing the full beneﬁts of ampliﬁcation. Resulting from these limitations is a large region of epsilons where banded matrix mechanisms cannot simultaneously leverage the beneﬁts of correlated noise and privacy ampliﬁcation; in other words, they perform equivalent to, but no better than, DP-SGD3. Further, since their analysis only applies to the special banded case, matrix mechanisms from the extant literature cannot leverage ampliﬁcation and correlated noise, e.g., Kairouz et al. (2021); Choquette-Choo et al. (2023b); Denisov et al. (2022).

In this work, we provide a generic privacy ampliﬁcation machinery for adaptive matrix mechanisms for any lower-triangular encoder matrix C that strictly generalizes the approach in (Choquette-Choo et al.,

1.1 OUR CONTRIBUTIONS

Our main contribution is to prove a general privacy ampliﬁcation analysis for any matrix mechanism, i.e., arbitrary encoder matrices C for non-adaptively chosen x, and for lower-triangular C s when x is adaptively chosen (which is the typical situation for learning tasks). We then demonstrate that our method yields both asymptotic improvements and experimental improvements.

Conditional composition (Sec. 3, Theorem 3.1): This is our main technical tool that gracefully handles dependence between the queries C[i,:] x, and C[i+1,:] x that arises due to multiple participation of a single row of the data matrix x. Standard composition theorems (Dwork & Roth, 2014) only handle this via a pessimistic worst-case privacy guarantee that holds with certainty for each query (in our application, a query is C[i,:] x+zi conditioned on C[1:i 1,:] x+z1:i 1). Theorem 3.1 relaxes this to holding with high probability (over the randomness of the algorithm) leading to signiﬁcantly better guarantees. This generalizes an idea previously used in (Erlingsson et al., 2019; Balle et al., 2020) to analyze privacy ampliﬁcation by shufﬂing. We believe this theorem will be useful for analyzing correlated noise mechanisms beyond those studied herein.

Matrix mechanism with uniform sampling via MMCC (Sec. 4): We prove ampliﬁed privacy guarantees for the matrix mechanism with uniform sampling, using Theorem 3.1, that are nearly-tight in the low-epsilon regime (as ε 0). We improve over Choquette-Choo et al. (2023a) because we enable more randomness in sampling instead of participating w.p. bp in n/b rounds, data records can participate w.p. p in all n rounds. Recall we need to analyze the privacy of outputting Cx + z, where rows of x are chosen via uniform sampling. We use Thm. 4.3 to reduce Cx + z to a series of mixture of Gaussians (Mo G) mechanisms for which we can use privacy loss distribution (PLD) accounting (see the proof sketch of Thm. 4.3 in Sec. 4). Our algorithm MMCC formally stating this reduction is given in Fig. 1. We plan to publicly release a library implementing MMCC with the ﬁnal manuscript. The analysis of Mo G mechanisms included in this library has other uses, such as tighter privacy guarantees for DP-SGD with group-level DP or for linear losses, see App. F for more discussion.

2E.g., for a data set D, a row x[i,:] = P

d S θℓ(θi; d), where S is a randomly chosen subset of D (e.g., sampled u.a.r. from D, or a subset from a random shufﬂing of D), ℓis a loss function, and θi is obtained via an SGD state update process. 3In Choquette-Choo et al. (2023a), this region surfaces empirically even for larger ε 1.

Published as a conference paper at ICLR 2024

Binary tree analysis (Sec. 5): Letting σε,δ be the noise required for the Gaussian mechanism to achieve to satisfy (ε, δ)-DP, the binary tree mechanism requires noise σε,δ log n. Owing to the versatility of conditional composition, we show that with shufﬂing, the (non-adaptive) binary tree mechanism only needs noise σε,δ min{ log n, p

log log(1/δ)}. This is optimal given current ampliﬁcation by shufﬂing results, which require n = Ω(log 1/δ)4. To the best of our knowledge, this is the ﬁrst ampliﬁcation guarantee (of any kind) for the binary tree mechanism.

Empirical improvements (Sec. 6): First we implement and show that ε computed via MMCC for the binary tree mechanism matches the theoretical predictions of Ω( log n) from Sec. 5. Then we apply our work to machine learning and show we can improve the privacy-utility tradeoff for binary-tree-DP-FTRL (Kairouz et al., 2021) entirely post-hoc. Finally, we show that the every round sampling enabled by MMCC achieves better ampliﬁcation than the b-min-sep sampling of (Choquette-Choo et al., 2023a).

1.2 PROBLEM DEFINITION

Matrix mechanism (MM): Consider a workload matrix A Rn n, and consider a data set D = {d1, . . . , dm} Dm. Let x = [x1(D)| |xn(D)] Rn d be a matrix s.t. each row xi : D Rd is a randomized function that ﬁrst selects a subset of the data set D, and then maps it to a real valued vector. Furthermore, each of the xi has the following two properties. a) Decomposability: For the subset of the data set D that xi chooses (call it Si), we have xi(D) = P

d Si gi(d) with gi : D Rd is a vector valued function, and b) bounded sensitivity: d D : gi(d) 2 1. Matrix mechanism corresponds to the class of DP algorithms that approximates Ax with low-error. Typically, one designs a pair of matrices A = BC ( which we will call the decoder and the encoder matrices respectively) s.t. Cx + z satisﬁes DP5 (with z being isotropic normal noise), and Bz is minimized in appropriately chosen norm. We will assume C is non-negative for simplicity.

Privacy ampliﬁcation for MM: In this work we study the problem of amplifying the DP guarantee of MM if we incorporate the randomness in the selection of the subsets of D by each function xi. In particular we consider two selection strategies: i) uniform sampling: each xi selects each entry of D independently w.p. p, ii) shufﬂing: First the records of D are randomly permuted, and then each xi picks a ﬁxed disjoint subset (of equal size) from D.

Adaptivity: In our work we allow the choice of xi s to be adaptive, i.e., xi can be chosen based on the ﬁrst i 1 outputs of MM. Under adaptivity, we will only consider encoder (B) decoder matrices (C) that are lower triangular. However, for non-adaptive choices of the xi s we allow arbitrary choice of the matrices B and C. Unless mentioned speciﬁcally, all our results will be for the adaptive setting.

2 BACKGROUND AND RELATED WORKS

2.1 PRIVACY LOSS DISTRIBUTIONS (PLD)

Suppose we have a DP mechanism M that outputs a sample from the continuous distribution P = M(D) when given database D, and outputs a sample from Q = M(D ) when given D . The ε-hockey stick divergence between two distributions P, Q is deﬁned as:

Hε(P, Q) = Z

x max{P(x) eεQ(x), 0}dx = Ex P

eln(P (x)/Q(x)) , 0 .

4We believe this requirement is fundamental and thus σε,δ min{ log n, p

log log(1/δ)} is optimal, but if it were removed, our result would improve to O(1) σε,δ. 5We use the zero-out adjacency (Ponomareva et al., 2023) to deﬁne DP in this paper.

Published as a conference paper at ICLR 2024

A mechanism M satisﬁes (ε, δ)-DP if and only if for all adjacent databases D, D we have Hε(M(D), M(D )) δ. From the deﬁnition, we see that to obtain the ε-hockey stick divergence between P and Q, it sufﬁces to know their privacy loss distribution (PLD):

Deﬁnition 2.1. The privacy loss random variable for P and Q is given by sampling x P, and computing ln(P(x)/Q(x)). The PLD of P and Q is the distribution of this random variable.

We frequently use the notion of dominating PLDs:

Deﬁnition 2.2 (Deﬁnition 7 in (Zhu et al., 2022)). The PLD of P, Q dominates the PLD of P , Q if for any ε, Hε(P, Q) Hε(P , Q ). We will also say random variable L dominates random variable L if for any ε, Hε(L) Hε(L ), where Hε(L) = Eℓ L max 1 eε ℓ, 0 .

Informally, a PLD dominates another PLD if any privacy guarantee satisﬁed by mechanisms with the dominating PLD is also satisﬁed by mechanisms with the dominated PLD. In particular, if the PLD of some pair of distributions P, Q dominates the PLDs of all pairs M(D), M(D ) for adjacent D, D , then if Hε(P, Q) δ, M satisﬁes (ε, δ)-DP.

2.2 PRIVACY AMPLIFICATION

Privacy ampliﬁcation via sampling analyzes the improvement in privacy given by randomly sampling a minibatch of examples instead of choosing it deterministically. Roughly, a (ε, δ)-DP mechanism run on a batch where each example participates with probability p satisﬁes (log(1 p + peε), δ)-DP. The relative improvement from ε to log(1 p + peε) gets better as ε gets smaller: log(1 p + peε) pε for ε < 1, but log(1 p + peε) ε log(1/p) for large ε. The beneﬁts of privacy ampliﬁcation via sampling in the independent noise setting of DP-SGD, i.e., the decoder matrix C = I, are extremely well-studied (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016; Mironov et al., 2019; Steinke, 2022; Koskela et al., 2020) with tight analyses. In particular, one round of DP-SGD is dominated by the PLD of N(0, σ2) and (1 p) N(0, σ2)+p N(1, σ2) and since each round of DP-SGD has independent randomness, composing this PLD with itself n times gives a tight dominating PLD, i.e. tight (ε, δ) curve, for DP-SGD.

3 CONDITIONAL COMPOSITION

We ﬁrst show a conditional composition theorem, which allows us to analyze a sequence of adaptive mechanisms using high-probability instead of worst-case privacy guarantees for each mechanism. We state conditional composition formally as Theorem C.2. This is a generalization of an idea used in (Erlingsson et al., 2019; Balle et al., 2020) to analyze ampliﬁcation by shufﬂing.

Theorem 3.1 (Informal version of Theorem C.2). Let M1, M2, . . . be a sequence of adaptive mechanisms, where each Mi takes D and the previous mechanisms output as input. Suppose there is some bad event E that happens with probability at most δbad over the randomness of M1, M2, . . . for any input D. If the composition of the worst-case privacy guarantees of M1, M2, . . . each conditioned on E not happening satisﬁes (ε, δ)-DP, then the composition of M1, M2, . . . satisﬁes (ε, δ + δbad)-DP.

The proof is given in App. C. To apply Theorem 3.1 to correlated noise mechanisms, we observe that they can be viewed as a sequence of adaptive independent-noise mechanisms:

Observation 3.2. Let M : D X1 X2 . . . Xn be a mechanism that takes a dataset D and outputs the tuple x = (x1, x2, . . . , xn) drawn from the distribution M(D). Let Mi : X1 X2 . . . Xi 1 D Xi be the mechanism that takes x 1, x 2, . . . , x i 1 and a dataset D and outputs x i with probability (or likelihood) Prx M(D) xi = x i|x1 = x 1, x2 = x 2, . . . , xi 1 = x i 1 . The output distributions of M and the composition of M1, M2, . . . are the same.

Published as a conference paper at ICLR 2024

Algorithm 1 Matrix Mechanism Conditional Composition algorithm, MMCC(C, p, σ, δ1, δ2)

1: Input: Matrix C, sampling probability p, noise standard deviation σ, probabilities δ1, δ2. 2: {epi,j}i,j [n] Probability Tail Bounds(C, p, σ, δ1). epi,j is a high-probability upper bound on the probability that an example participated in round j, conditioned on output in rounds 1 to i 1. 3: for i [n] do 4: PLDi PLD of MP Mo G({epi,j}j [n], {Ci,j}j [n]).

5: PLD convolution of {PLDi}i [n]. 6: return min ({ε : PLD satisﬁes (ε, δ2)-DP}).

Figure 1: Algorithm MMCC for computing ampliﬁed privacy guarantees of the matrix mechanism. The subroutine Probability Tail Bounds is given in Fig. 5 in App. C.

4 PRIVACY ANALYSIS FOR MATRIX MECHANISMS

In this section, we give an algorithm for computing an upper bound on the privacy guarantees of the matrix mechanism, and prove its correctness.

4.1 MIXTURE OF GAUSSIANS MECHANISMS

The key tool in our privacy analysis is a mixture of Gaussians mechanism, a generalization of the Gaussian mechanism with sampling. Here we deﬁne these mechanisms under the add adjacency, i.e. D contains an example zeroed out in D. Deﬁnition 4.1. A mixture of Gaussians (Mo G) mechanism is deﬁned by two lists, a list of probabilities {p1, p2, . . . , pk}, with P

i pi = 1, pi [0, 1], a list of sensitivities {c1, c2, . . . ck} and a noise level σ. For simplicity, we will assume ci 0. Given D, the mechanism MMo G({p1, p2, . . . , pk}, {c1, c2, . . . ck}) outputs z N(0, σ2). Given D , it samples s from the distribution with support {ci}i [k] and associated probabilities {pi}i [k], and outputs z N(s, σ2). In other words, it is a Gaussian mechanism where the sensitivity s is a random variable distributed according to {pi}i [k], {ci}i [k].

A vector mixture of Gaussians (VMo G) mechanism MV Mo G is the same as a Mo G mechanism, except the sensitivities ci are allowed to be vectors ci instead of scalars, and our output is sampled from a multivariate Gaussian z N(0, σ2 I) or z N(s, σ2 I).

For our proofs, we will need to prove a few properties of Mo G mechanisms. We give these properties and their proofs in App. B. It will be easier for us to work with a special case of Mo G mechanisms, where the probabilities and sensitivities arise from a product distribution: Deﬁnition 4.2. A product mixture of Gaussians (PMo G) mechanism is deﬁned by two lists {p1, . . . , pk} and {c1, . . . ck} and a noise level σ. The mechanism MP Mo G({p1, . . . , pk}, {c1, . . . , ck}) is deﬁned equivalently as MMo G({Q

i S(1 pi)|S 2[k]}, {P

i S ci|S 2[k]}).

4.2 MATRIX MECHANISM CONDITIONAL COMPOSITION

The high-level idea of our algorithm, MMCC (short for matrix mechanism conditional composition), for analyzing the matrix mechanism with ampliﬁcation is the following: The output of each round conditioned on the previous rounds output is a Mo G mechanism. For each round, we specify a Mo G mechanism that dominates this Mo G mechanism with high probability. Then by Theorem 3.1, it sufﬁces to compute the

Published as a conference paper at ICLR 2024

privacy loss distribution of each of the dominating Mo Gs, and then use composition to get our ﬁnal privacy guarantee. MMCC is given in Fig. 1. In App. C, we prove that MMCC computes a valid DP guarantee:

Theorem 4.3. Let ε be the output of MMCC(C, p, σ, δ1, δ2). The matrix mechanism with matrix C, uniform sampling probability p, and noise level σ satisﬁes (ε, δ1 + δ2)-DP.

We give a high-level overview of the proof. The proof proceeds in three steps. First, we show the matrix mechanism is dominated by a sequence of adaptively chosen scalar Mo G mechanisms, by analyzing the distribution for each round conditioned on previous rounds and applying a vector-to-scalar reduction (Lem. B.4). Second, we simplify these Mo G mechanisms by showing that each is dominated by a PMo G mechanism with probabilities pi,j depending on the outputs from previous rounds. Third, we show that with high probability pi,j epi,j for all i, j, i.e., the upper bounds generated by Probability Tail Bounds hold. We then apply Theorem 3.1.

Tightness: To get a sense for how tight MMCC is, if in MMCC we instead set epi,j = p for all i, j, this is equivalent to analyzing the matrix mechanism as if each row were independent. Since the rows are actually correlated, we expect this analysis to give a lower bound on the true value of ε. So we can use maxi,j epi,j/p as roughly an upper bound on the ratio of the ε reported by MMCC and the true ε value. In particular, as σ , for epi,j computed by Probability Tail Bounds this ratio approaches 1, i.e. MMCC gives tight ε guarantees in the limit as σ .

Sampling scheme of (Choquette-Choo et al., 2023a): The techniques used in MMCC are complementary to those in (Choquette-Choo et al., 2023a): In App. D, we give a generalization of MMCC that analyzes the matrix mechanism under their b-min-sep sampling. For b = 1, this is the same as i.i.d. sampling every round so this generalization retrieves MMCC. For b-banded matrices this generalization retrieves exactly the DP-SGD-like analysis of (Choquette-Choo et al., 2023a). In other words, this generalization subsumes all existing ampliﬁcation results for matrix mechanisms.

Beneﬁts of i.i.d. sampling: MMCC is the ﬁrst analysis that allows us to beneﬁt from both correlated noise and privacy ampliﬁcation via i.i.d. (i.e., maximally random) sampling. In Sec. 6.3 we demonstrate that the combination of beneﬁts allows us to get better ℓ2 2-error for computing all preﬁx sums than independent-noise mechanisms, for much smaller ε than prior work.

5 AMPLIFICATION VIA SHUFFLING FOR NON-ADAPTIVE BINARY TREE

In this section, we show that ampliﬁcation allows us to improve the privacy guarantees of the binary tree mechanism of (Dwork et al., 2010; Chan et al., 2011). We consider the setting where ﬁrst the data set D is randomly permuted (call it Π(D)), and each function xi (in the deﬁnition of MM from Section 1.2) picks the i-th data record in Π(D). Roughly speaking, using privacy ampliﬁcation by shufﬂing (see Section 1.2) we improve σ for this mechanism by Ω( log n/ p

log log(1/δ)), while maintaining that each example participates once. For simplicity throughout the section we restrict to the case where n is a power of 2.

Binary tree mechanism: The binary tree computes sums of rows of x over the intervals [1 : 1], [2 : 2], . . . , [n : n], [1 : 2], [3 : 4], . . . , [n 1 : n], [1 : 4], . . . [1 : n] with noise. That is, it outputs n P

k 2j+1 i (k+1) 2j xi + zj,k o

0 j log n,0 k<n/2j , where zj,k i.i.d. N(0, σ2). Equivalently, it is a (non-

square) matrix mechanism where for each j, k pair there is a row of C where the entries in the interval [k 2j + 1 : (k + 1) 2j] are 1 and the remaining entries are 0. We refer to all the noisy sums indexed by the same j as level j. In the single-epoch setting (without shufﬂing), each row of xi is a sensitivity-1 function computed on the ith example in D. The binary tree mechanism then satisﬁes the privacy guarantees of distinguishing z and Cei +z, where ei is an elementary vector. Since each row of x is included in log n+1 of the

Published as a conference paper at ICLR 2024

sums, we have Cei 2 = log n + 1, i.e. the binary tree mechanism satisﬁes O

log(n) log(1/δ)

We now analyze the binary tree mechanism under shufﬂing. To apply Theorem 3.1, we need the following analysis of approximate shufﬂing given in Lem. 5.1, proven in App. C.

Lemma 5.1 (Simpliﬁcation of Lem. C.4 in App. C). Suppose we run n Gaussian mechanisms on n inputs, where the order of the inputs is chosen according to a distribution such that no input appears in a certain position with probability more than 1/n . Then for δ 2 Ω(n ), δ0 0, this set of mechanisms satisﬁes O

ln(1/δ0) ln(1/δ)

Theorem 5.2. The non-adaptive binary tree mechanism run on Π(D) satisﬁes O

log(1/δ) log log(1/δ)

, δ -DP for σ = Ω( p

log(1/δ) log log(1/δ)), δ [2 Ω(n), 1/n].

The proof of Theorem 5.2 is given in App. C. We give a summary here: For the top levels j [log n O(log log(1/δ)), log n], the number of sums per level (i.e., n in Lem. 5.1) is less than log(1/δ) so we cannot apply Lem. 5.1. Instead we use the unampliﬁed privacy analysis of the Gaussian mechanism to analyze the privacy of the corresponding noisy sums. For the remaining levels, we combine Theorem 3.1 and Lem. 5.1 to show level j s privacy parameter ε is exponentially decaying in log n j, i.e. the top O(log log(1/δ)) levels dominate the privacy guarantee. Note that the theorem is proven in the non-adaptive case; our argument for adaptivity in Sec. 4 implicitly requires independence of participations across examples, which does not hold for shufﬂing.

6 EMPIRICAL IMPROVEMENTS

We implement MMCC by building on methods in the open-source dp accounting Python library (DP Team, 2022), and perform empirical studies of the ampliﬁcation beneﬁts from MMCC. PLD accounting for Mo G mechanisms is currently open-sourced as part of the dp accounting library. We plan to opensource our implementation of MMCC which builds on dp accounting. There are some challenges in the implementation which we discuss in App. F. For simplicity we use δ1, δ2 = δ/2 in MMCC.

6.1 BINARY TREE MECHANISM AMPLIFICATION

In this section, we show how the privacy guarantee of the binary tree mechanism empirically improves if we use sampling and MMCC. In App. E we repeat this study for a different matrix mechanism proposed by (Fichtenberger et al., 2023).

As a baseline, we ﬁx a constant c, and consider the binary tree mechanism under a single-participation constraint, with σ = c p

log(n) + 1. By the analysis of the Gaussian mechanism, for all n that are powers of 2, the binary tree mechanism with this choice of σ under a single-participation constraint without ampliﬁcation satisﬁes (ε, δ)-DP for the same ε, δ. In other words, as we increase n, the privacy guarantee of the unampliﬁed mechanism remains ﬁxed. Then, for the same c and n that are powers of 2, we use MMCC to compute a privacy guarantee for the binary tree mechanism with subsampling probability 1/n and the same choice of σ. By the analyses in Section 5, we expect that with subsampling, the value of ε will decrease as Ω( log n).

In Fig. 2, we observe that empirical improvement in ε due to ampliﬁcation is roughly proportional to p

log(n) + 1. We also observe two improvements as c (i.e., σ) increases. First, the multiplicative improvement in ε increases; second, empirical improvements better match a linear ﬁt to p

log(n) + 1. Both

Published as a conference paper at ICLR 2024

2 4 6 8 10 log(n)

Ratio of unamplified to amplified

. Higher is better

c = 5 (unamplified = 0.834)

y=0.2921 (log n + 1)+0.2573

2 4 6 8 10 log(n)

Ratio of unamplified to amplified

. Higher is better

c = 10 (unamplified = 0.397)

y=0.4861 (log n + 1)+0.1441

2 4 6 8 10 log(n)

Ratio of unamplified to amplified

. Higher is better

c = 20 (unamplified = 0.189)

y=0.5804 (log n + 1)+0.1172

Figure 2: Multiplicative improvement of our ampliﬁcation analysis (roughly) matches p

log(n) + 1. A higher ratio (> 1) indicates ampliﬁcation is better. We plot n = 2i, i {1, 2, . . . , 10} with σ = c p

log(n) + 1 so ε is ﬁxed for unampliﬁed single-participation. δ = 10 6.

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Privacy Budget,

Test Accuracy

None (Kairouz et al. 2023) MMCC (Ours)

Figure 3: Our ampliﬁcation analysis leads to signiﬁcant gains over Kairouz et al. (2021) on practical ML experiments (CIFAR-10), entirely posthoc.

50 100 150 200 Noise Standard Deviation,

Privacy Budet, , @ =10 6

Analysis Method

MMCC, (Ours) Choquette-Choo et al. (2023)

Figure 4: MMCC gives tighter ε than the analysis of (Choquette-Choo et al., 2023a) for a DPFTRL-Tree Restart mechanism of height 4. Ran for n = 512 steps with p = 1 16.

these improvements are explained by the fact that (as discussed in Sec. 4) as σ , MMCC reports a tighter ε.

6.2 LEARNING EXPERIMENTS WITH BINARY-TREE-DP-FTRL

A motivating reason for us to study matrix mechanisms is that the analysis of Kairouz et al. (2021) has a suboptimal scaling in the amount of noise added, which manifests in their experiments with DP machine learning. We reproduce the centralized DP training on CIFAR-10 from Choquette-Choo et al. (2023b), including model architecture, tuning setup, hyperparameter choices, and optimizations to the tree aggregation mechanism for ML; we use these as our baseline results.

In Fig. 3, we re-analyze the baseline using MMCC and show signiﬁcant improvements in privacy-utility tradeoffs for DP-FTRL via binary trees. In particular, we observe that these beneﬁts become larger as ε becomes small. Note that these improvements are entirely post-hoc, i.e. the algorithm is the same, but with a better privacy analysis.

6.3 I.I.D. SAMPLING ENABLES BETTER AMPLIFICATION THAN b-MIN-SEP SAMPLING

The prior work of (Choquette-Choo et al., 2023a) gives an ampliﬁcation result using a sampling scheme we call b-min-sep sampling for b-banded matrices. In their sampling scheme, each example participates in

Published as a conference paper at ICLR 2024

n/b rounds with sampling probability bp. In contrast, MMCC enables sampling each example in all n rounds with probability p, a more random form of sampling. We compare the two ampliﬁcation analyses using the DP-FTRL-Tree Restart algorithm of (Kairouz et al., 2021), which sequentially runs n/2h 1 height-h binary tree mechanisms, each binary tree mechanism run for 2h 1 rounds. This corresponds to a matrix mechanism that is 2h 1-banded, so we can apply the results of (Choquette-Choo et al., 2023a). In Fig. 4, we compare the ε for DP-FTRL-Tree Restart computed as a function of σ using MMCC and the analysis of (Choquette-Choo et al., 2023a), in the setting of n = 512, p = 1/16, h = 4, and we see that indeed the more random sampling enabled by MMCC allows for improved privacy guarantees compared to b-min-sep sampling.

7 DISCUSSION, FUTURE DIRECTIONS, AND CONCLUSION

In this paper, we proposed MMCC, which gives tight ampliﬁcation guarantees for sampling in the limit as ε 0. One limitation of our work is that we are not able to prove adaptivity for non-lower triangular C, which captures important matrix mechanisms like the fully efﬁcient binary tree mechanism (Honaker, 2015). It is an important future direction to fully understand what combinations of privacy ampliﬁcation and correlated noise allow the same privacy for non-adaptive and adaptive inputs. In addition, there are many potential improvements to MMCC, as well as open problems that naturally follow from our work. Another open problem that we make progress towards is proving DP-FTRL strictly dominates DP-SGD, i.e. for any ε > 0 DP-FTRL achieves strictly6 better utility than DP-SGD under an appropriate deﬁnition of utility. In particular, we conjecture that a tighter ampliﬁcation analysis than that of MMCC could show that even for ε close to 0, DP-FTRL with a matrix mechanism where C is close to but not equal to the identity has strictly better utility than DP-SGD.

Our interest in the matrix mechanism is primarily motivated by the works of (Denisov et al., 2022; Choquette-Choo et al., 2023b;a) which considered the problem of choosing C that optimizes (a proxy for) the utility of DP-FTRL. The utility of DP-FTRL can be written as a function of C 1, and thus can be optimized under a constraint of the form the matrix mechanism deﬁned by C satisﬁes a given privacy deﬁnition . Without ampliﬁcation, this constraint can usually be easily written as e.g. C S where S is a convex set of matrices, which makes optimizing under this constraint easy. An interesting question is whether we can solve the same problem, except the privacy constraint accounts for ampliﬁcation. This would likely require designing a function that takes C, p, σ and approximates ε that is differentiable in C (unlike MMCC, which is an algorithmic computation that is not easily differentiable).

In these works, DP-FTRL is always strictly better than DP-SGD without ampliﬁcation, but with ampliﬁcation for small ε the optimal choice of C with ampliﬁcation is the identity, i.e. the optimal DP-FTRL is just DP-SGD (with independent noise). If we could optimize C under an ampliﬁed privacy constraint, we conjecture the following (perhaps surprising) statement could be proven as a corollary: As long as we are not in the full-batch setting, even with ampliﬁcation by sampling, the optimal choice of C is never the identity for ε > 0. In other words, despite its ubiquity, DP-SGD is never the optimal algorithm to use (ignoring computational concerns).

Due to space constraints, we defer the discussion of other future directions to App. A.

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308 318, 2016.

6Note that (Choquette-Choo et al., 2023a) showed that DP-FTRL is always at least as good as DP-SGD, but in their results for small ε DP-SGD is just as good as DP-FTRL, i.e. DP-FTRL is not strictly better.

Published as a conference paper at ICLR 2024

Borja Balle, Peter Kairouz, H Brendan Mc Mahan, Om Thakkar, and Abhradeep Thakurta. Privacy ampliﬁcation via random check-ins. In Neur IPS, 2020.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efﬁcient algorithms and tight error bounds. In Proc. of the 2014 IEEE 55th Annual Symp. on Foundations of Computer Science (FOCS), pp. 464 473, 2014.

T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Trans. on Information Systems Security, 14(3):26:1 26:24, November 2011.

Christopher A Choquette-Choo, Arun Ganesh, Ryan Mc Kenna, H Brendan Mc Mahan, Keith Rush, Abhradeep Guha Thakurta, and Zheng Xu. (ampliﬁed) banded matrix factorization: A uniﬁed approach to private training. ar Xiv preprint ar Xiv:2306.08153, 2023a. URL https://arxiv.org/abs/2306. 08153.

Christopher A. Choquette-Choo, Hugh Brendan Mc Mahan, J Keith Rush, and Abhradeep Guha Thakurta. Multi-epoch matrix factorization mechanisms for private machine learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 5924 5963. PMLR, 23 29 Jul 2023b. URL https://proceedings.mlr. press/v202/choquette-choo23a.html.

Sergey Denisov, H. Brendan Mc Mahan, John Rush, Adam Smith, and Abhradeep Guha Thakurta. Improved differential privacy for sgd via optimal private linear operators on adaptive streams. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 5910 5924. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 271ec4d1a9ff5e6b81a6e21d38b1ba96-Paper-Conference.pdf.

DP Team. Google s differential privacy libraries., 2022. https://github.com/google/ differential-privacy.

Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715 724, 2010.

Ulfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Ampliﬁcation by shufﬂing: From local to central differential privacy via anonymity. In ACMSIAM Symposium on Discrete Algorithms (SODA), 2019. URL https://arxiv.org/abs/1811. 12469.

Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy ampliﬁcation by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 521 532. IEEE, 2018.

Vitaly Feldman, Audra Mc Millan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy ampliﬁcation by shufﬂing. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 954 964, 2022. doi: 10.1109/FOCS52979.2021.00096.

Published as a conference paper at ICLR 2024

Hendrik Fichtenberger, Monika Henzinger, and Jalaj Upadhyay. Constant matters: Fine-grained error bound on differentially private continual observation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 10072 10092. PMLR, 2023. URL https://proceedings.mlr.press/ v202/fichtenberger23a.html.

Arun Ganesh. Tight group-level dp guarantees for dp-sgd with sampling via mixture of gaussians mechanisms, 2024.

Monika Henzinger, Jalaj Upadhyay, and Sarvagya Upadhyay. A unifying framework for differentially private sums under continual observation, 2023.

James Honaker. Efﬁcient use of differentially private binary trees, 2015.

Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shufﬂing. In ICML, 2021.

Shiva Prasad Kasiviswanathan and Adam Smith. On the semantics of differential privacy: A bayesian formulation. J. Priv. Conﬁdentiality, 6(1), 2014. doi: 10.29012/jpc.v6i1.634. URL https://doi. org/10.29012/jpc.v6i1.634.

Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. What can we learn privately? In 49th Annual IEEE Symp. on Foundations of Computer Science (FOCS), pp. 531 540, 2008.

Antti Koskela, Joonas J alk o, and Antti Honkela. Computing tight differential privacy guarantees using fft. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2560 2569. PMLR, 2020.

Ryan Mc Kenna, Gerome Miklau, Michael Hay, and Ashwin Machanavajjhala. Hdmm: Optimizing error of high-dimensional statistical queries under differential privacy. ar Xiv preprint ar Xiv:2106.12118, 2021.

Ilya Mironov, Kunal Talwar, and Li Zhang. R\ enyi differential privacy of the sampled gaussian mechanism. ar Xiv preprint ar Xiv:1908.10530, 2019. URL https://arxiv.org/abs/1908.10530.

Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan Mc Mahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta. How to DP-fy ML: A practical guide to machine learning with differential privacy. Journal of Artiﬁcial Intelligence Research, 77:1113 1201, jul 2023. doi: 10.1613/jair.1.14649. URL https://doi.org/10.1613%2Fjair.1.14649.

Adam Smith and Abhradeep Thakurta. (nearly) optimal algorithms for private online learning in fullinformation and bandit settings. In Advances in Neural Information Processing Systems, pp. 2733 2741, 2013.

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 245 248. IEEE, 2013.

Thomas Steinke. Composition of differential privacy & privacy ampliﬁcation by subsampling. ar Xiv preprint ar Xiv:2210.00597, 2022. URL https://arxiv.org/abs/2210.00597.

Salil Vadhan. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pp. 347 450. Springer, 2017.

Published as a conference paper at ICLR 2024

Yuqing Zhu, Jinshuo Dong, and Yu-Xiang Wang. Optimal accounting of differential privacy via characteristic function. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (eds.), Proceedings of The 25th International Conference on Artiﬁcial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 4782 4817. PMLR, 28 30 Mar 2022. URL https: //proceedings.mlr.press/v151/zhu22c.html.

Published as a conference paper at ICLR 2024

A MORE FUTURE DIRECTIONS

First, our tail bound on the conditional sampling probabilities epi,j approach p as σ . However, for ﬁnite σ, epi,j can be much larger than p, i.e. the ε computed by MMCC can be much larger than the true ε. We believe the values of epi,j we compute are not tight and can be improved. In particular, in computing epi,j, we give a tail bound on the maximum of the dot product of a Gaussian with a set of vectors, and the values of epi,j we compute effectively correspond to the case where this tail bound is attained by every dot product. This is overly pessimistic, and it should be possible to obtain tighter ε via a more reﬁned tail-bounding approach.

Second, while MMCC has a polynomial dependence on n (whereas computing Hε via e.g. numerical integration would require time exponential in n), empirically we found that even with many optimizations for runtime, running MMCC for n 2000 still took several hours. In practice, we would often like to run for larger n, or do multiple sequential runs of MMCC in order to e.g. compute the smallest σ that gives a certain ε via binary search. In turn, it is practically interesting/important to make MMCC more efﬁcient, or discover another algorithm that gives ε comparable to or better than MMCC, but with a smaller runtime.

B PROPERTIES OF MOG MECHANISMS

The following lemma informally lets us show that domination for the remove adjacency (i.e., D contains an example zeroed out in D ) is equivalent to domination for the add adjacency (i.e., D contains an example zeroed out in D). Thus, we usually only need to prove statements under one of the two adjacencies, and it is implied for the other as well.

Lemma B.1 (Lemma 29 in (Zhu et al., 2022)). The PLD of P, Q dominates the PLD of P, Q if and only if the PLD of Q, P dominates the PLD of Q , P.

B.1 MONOTONICITY OF MOG MECHANISMS

The following shows the privacy guarantees of a Mo G mechanism are monotonic in the sensitivity random variable ci:

Lemma B.2. Let {p1, p2, . . . pk}, {c1, c2, . . . , ck} and {c 1, c 2, . . . c k} be such that (i) each ci is nonnegative and (ii) c i is entry-wise greater than or equal to ci for all i, i.e. each c i ci is non-negative.

Then the PLD of MV Mo G({p1, p2, . . . pk}, {c1, c2, . . . , ck})

is dominated by the PLD of

MV Mo G({p1, p2, . . . pk}, {c 1, c 2, . . . c k}).

Proof. By Lem. B.1, it sufﬁces to only consider the remove adjacency, i.e. given D we sample ci and then sample from N(ci, σ2I) and given D from N(0, σ2I). The privacy loss of outputting x is:

PL(x) := ln

2 ci, x ci 2 2 2σ2

Let S Rd be monotonic if for any x S, y such that y x is non-negative, y is also in S. In other words, increasing any subset of the entries of x S gives another vector in S. Since all ci are non-negative, if y x is non-negative, then the privacy loss of outputting y is larger than that of outputting x. So for any

Published as a conference paper at ICLR 2024

VMo G mechanism and any t, the set of outputs St = {x : PL(x) t} is monotonic. By the Neyman Pearson lemma it sufﬁces to consider only the sets St in the deﬁnition of (ε, δ)-DP, i.e. a mechanism satisﬁes (ε, δ)-DP if and only if

t : Pr x M(D)[x St] eε Pr x M(D )[x St] + δ.

So, in order to show that to show the ﬁrst VMo G mechanism is dominated by the second, it sufﬁces to show the probability that x N(ci, σ2I) is in any monotonic set S is at most the probability that x N(c i, σ2I) is in S. This is immediate by a coupling of the two random variables: we let the ﬁrst random variable be ci + z and the second random variable be c i + z, where the choice of i and Gaussian noise z are the same for both random variables. For any monotonic S, since c i ci is non-negative, ci + z is in S only if c i + z is in S, giving that the probability x N(ci, σ2I) is in S is at most the probability x N(c i, σ2I) is in S.

Since the above proof holds for any ci, c i satisfying the assumptions in the lemma, it also holds if c i are ﬁxed/non-adaptive but the entries in ci are chosen adaptively (while still satisfying the assumptions in the lemma), i.e. the jth coordinate of ci is chosen only after seeing the ﬁrst j 1 coordinates of the output. In the scalar case, we get the following corollary:

Corollary B.3. Let {p1, p2, . . . pk}, {c1, c2, . . . , ck} and {p 1, p 2, . . . p k }, {c 1, c 2, . . . c k } be such that for all T, P

i:c i T p i P

i:ci T pi. In other words, the random variable induced by {pi}i, {ci}i is stochastically dominated by the random variable induced by {p i}i, {c i}i. We also assume ci, c i 0 for all i.

Then the PLD of MMo G({p1, p2, . . . pk}, {c1, c2, . . . , ck})

is dominated by the PLD of

MMo G({p 1, p 2, . . . p k }, {c 1, c 2, . . . c k }).

Cor. B.3 follows from Lem. B.2 since by allowing duplicate ci values, we can reduce to the setting where the probabilities are the same, and ci c i for all ci. For example, if ci is 0 or 1 w.p. 1/2 and c i is 0, 1, or 2 w.p. 1/3, we can use {pi} = {1/3, 1/6, 1/6, 1/3}, {ci} = {0, 0, 1, 1}, and {c i} = {0, 1, 1, 2}.

B.2 DIMENSION REDUCTION FOR MOG MECHANISMS

We now give the following lemma, which lets us reduce the dimensions of a VMo G mechanism.

Lemma B.4. Let c1, c2, . . . , ck Rn p. Let c 1, c 2, . . . , c k Rn be vectors such that (ci)j,: 2 c i(j) for all i, j, i.e. the entries of c i upper bound the ℓ2-norms of the corresponding rows of ci. Then the PLD of

MV Mo G({p1, p2, . . . , pk}, {c1, c2, . . . , ck})

is dominated by the PLD of

MV Mo G({p1, p2, . . . , pk}, {c 1, c 2, . . . , c k}).

Furthermore, this holds even if the rows of each ci are adaptively chosen and c i are ﬁxed, i.e. the jth row of all ci is chosen by an adversary after seeing the ﬁrst j 1 rows of the output of the VMo G mechanism, as long as the assumption (ci)j,: 2 c i(j) holds.

We need the following lemma, which we can apply multiple times to prove Lem. B.4:

Published as a conference paper at ICLR 2024

Lemma B.5. Let w1, w2, . . . wk > 0 be positive scalars and let c1, c2, . . . ck Rp be arbitrary vectors. Then for any ε and σ > 0:

Ex N(0,σ2Ip)

i wi exp( ci, x ) eε, 0

i wi exp( ci 2 x) eε, 0

i wi exp( ci 2 x) as a function of x is continuous, increasing in x, and has range R+. So, there exists some t such that P

i wi exp( ci 2 t) = eε. For this choice of t, let ti = wi exp( ci 2 t). Then we have for all x:

i wi exp( ci 2 x) eε, 0

i max {wi exp( ci 2 x) ti, 0} .

Now, by linearity of expectation and the fact that max{P i ai, P i bi} P i max{ai, bi}:

Ex N(0,σ2Ip)

i wi exp( ci, x ) eε, 0

Ex N(0,σ2Ip)

i max {wi exp( ci, x ) ti, 0}

i Ex N(0,σ2Ip) [max {wi exp( ci, x ) ti, 0}]

i Ex N(0,σ2) [max {wi exp( ci 2 x) ti, 0}]

= Ex N(0,σ2)

i max {wi exp( ci 2 x) ti, 0}

= Ex N(0,σ2)

i wi exp( ci 2 x) eε, 0

Proof of Lem. B.4. Lem. B.2 holds for adaptively chosen ci and ﬁxed c i (using the notation of that lemma), so by Lem. B.2 it sufﬁces to prove the lemma for adaptive ci and ﬁxed c i such that (ci)j,: 2 = c i(j) for all i, j. Further, by Lem. B.1, it sufﬁces to show the lemma under the remove adjacency. That is, P = N(ci, σ2In p), Q = N(0, σ2In p), P = N(c i, σ2In), Q = N(0, σ2In), and it sufﬁces to show Hε(P, Q) Hε(P , Q ) for all ε.

Hε(P, Q) = Ex Q

= Ex N(0,σ2In p)

i pi exp( x ci 2 2 /2σ2)

exp( x 2 2 /2σ2) eε, 0

= Ex N(0,σ2In p)

2 ci, x ci 2 2 2σ2

Published as a conference paper at ICLR 2024

To reﬂect the fact that ci can be chosen adaptively, let ci,j(x1:j 1) denote any adversary s adaptive choice of the jth row of ci after observing the ﬁrst j 1 rows of x. We can then write Hε(P, Q) as:

Ex1,x2,...xn N(0,σ2Ip)

2 ci,j(x1:j 1), xj ci,j(x1:j 1) 2 2 2σ2

Ex1,x2,...xn 1 N(0,σ2Ip)

Exn N(0,σ2Ip)

2 ci,j(x1:j 1), xj ci,j(x1:j 1) 2 2 2σ2

Note that the values of all ci,j in (1) are constants with respect to the inner expectation. So for any realization of x1, x2, . . . , xn 1, choosing

wi = pi exp

ci,n(x1:n 1) 2 2 2σ2

j [n 1] exp

2 ci,j(x1:j 1), xj ci,j(x1:j 1) 2 2 2σ2

and observing that by assumption ci,n(x1:n 1) 2 = c i(n), we can apply Lem. B.5 to upper bound the inner expectation in (1) as:

(1) Ex1,x2,...xn 1 N(0,σ2Ip),xn N(0,σ2)

2 ci,j(x1:j 1), xj ci,j(x1:j 1) 2 2 2σ2

exp 2c i(n)xn c i(n)2

We can then iteratively repeat this argument for rows n 1, n 2, ... 1 to get:

Hε(P, Q) Ex1,x2,...xn 1 N(0,σ2Ip),xn N(0,σ2)

j [n 1] exp

2 ci,j(x1:j 1), xj ci,j(x1:j 1) 2 2 2σ2

exp 2c i(n)xn c i(n)2

Ex1,x2,...xn 2 N(0,σ2Ip),xn 1,xn N(0,σ2)

j [n 2] exp

2 ci,j(x1:j 1), xj ci,j(x1:j 1) 2 2 2σ2

j [n]\[n 2] exp 2c i(j)xj c i(j)2

Ex1,x2,...xn 3 N(0,σ2Ip),xn 2,xn 1,xn N(0,σ2)

j [n 3] exp

2 ci,j(x1:j 1), xj ci,j(x1:j 1) 2 2 2σ2

j [n]\[n 3] exp 2c i(j)xj c i(j)2

Published as a conference paper at ICLR 2024

Ex1,x2,...,xn N(0,σ2)

j [n] exp 2c i(j)xj c i(j)2

= Ex N(0,σ2In)

2 c i, x c i 2 2 2σ2

= Hε(P , Q ).

As a corollary to the above matrix-to-vector reduction , we have a vector-to-scalar reduction for Mo G mechanisms:

Corollary B.6. The PLD of

MV Mo G({p1, p2, . . . , pk}, {c1, c2, . . . , ck})

is dominated by the PLD of

MMo G({p1, p2, . . . , pk}, { c1 2 , c2 2 , . . . , ck 2}).

C DEFERRED PROOFS

C.1 PROOF OF THEOREM C.2

We will need the following lemma, which shows domination is preserved by composition:

Lemma C.1 (Theorem 10 in (Zhu et al., 2022)). Let M1, . . . , Mk be an adaptive sequence of mechanisms, i.e., each mechanism receives the output of all previous mechanism and the database. Suppose for all i and joint outputs x of M1, . . . Mi 1, the PLD of Mi(x, D) and Mi(x, D ) is dominated by the PLD of Pi, Qi. Then letting M be the composition of these mechanisms, the PLD of M(D), M(D ) is dominated by the PLD of P1 P2 . . . , Q1 Q2 . . ..

Similarly, if L1, L2, . . . , Lk and L 1, L 2, . . . , L k are random variables such that Li dominates L i for all i, then L1 + L2 + . . . + Lk dominates L 1 + L 2 + . . . + L k.

In (Zhu et al., 2022), only the ﬁrst part of Lem. C.1 is stated. However, the proof allows arbitrary measures, i.e., measures that don t integrate to 1, which implies the second part of Lem. C.1.

Theorem C.2. Let M1 : D X1, M2 : X1 D X2, M3 : X1 X2 D X3, . . . Mn be a sequence of adaptive mechanisms, where each Mi takes a dataset in D and the output of mechanisms M1, . . . , Mi 1 as input. Let M be the mechanism that outputs (x1 = M1(D), x2 = M2(x1, D), . . . , xn = Mn(x1, . . . , xn 1, D)). Fix any two adjacent datasets D, D .

Suppose there exists bad events E1 X1, E2 X1 X2, . . . ...En 1 X1 X2 . . . Xn 1 such that

Pr x M(D) [ i : (x1, x2, . . . xi) Ei] δ

and pairs of distributions (P1, Q1), (P2, Q2), . . . (Pn, Qn) such that the PLD of M1(D) and M1(D ) is dominated by the PLD of P1, Q1 and for any i 1 and good output (x1, x2, . . . xi) / Ei, the PLD of Mi+1(x1, . . . , xi, D) and Mi+1(x1, . . . , xi, D ) is dominated by the PLD of Pi+1, Qi+1. Then for all ε:

Hε(M(D), M(D )) Hε (P1 P2 . . . Pn, Q1 Q2 . . . Qn) + δ.

Published as a conference paper at ICLR 2024

Algorithm 2 Probability Tail Bounds(C, p, σ, δ1)

1: Input: Matrix C, sampling probability p, noise standard deviation σ, probability δ1. 2: δ = δ1 2 (nnz(C) n) nnz is the number of non-zeros.

3: z = Φ 1(1 δ ) Tail bound on normal distribution; here, Φ is the standard normal CDF. 4: for i, j [n] do 5: if Ci,j = 0 then 6: epi,j = 1 7: else 8: si,j = minimum s s.t. Pr[P

j i xj C1:i 1,j, C1:i 1,j > s] δ , xj i.i.d. Bern(p) si,j is a tail bound on the dot product of ﬁrst i 1 entries of Cx and C1:i 1,j.

9: εi,j = z C1:i 1,j 2

σ + 2si,j C1:i 1,j 2 2 2σ2 εi,j is a tail bound on the privacy loss of a participation in round j after outputting ﬁrst i 1 rounds

10: epi,j = p exp(εi,j) p exp(εi,j)+(1 p) 11: return {epi,j}i,j [n].

Figure 5: Algorithm for computing epi,j, tail bound on conditional probability of participating in round j given ﬁrst i 1 outputs.

Proof. Let L1 be the privacy loss random variable of M, and let L2 be the privacy loss random variable of P1 P2 . . . Pn, Q1 Q2 . . . Qn. We want to show Hε(L1) Hε(L2) + δ for all δ.

Let L 1 be the random variable coupled with L1, with the coupling deﬁned as follows: If i : (x1, x2, . . . , xi) Ei, then L 1 = , otherwise L 1 = L1. Let E = {x| i : (x1, x2, . . . xi) Ei}. Then for all ε: Hε(L1) = Ex h max n 1 eε L1(x), 0 oi

= Pr x [x / E] Ex h max n 1 eε L1(x), 0 o x / E i + Pr x [x E] Ex h max n 1 eε L1(x), 0 o x E i

= Hε(L 1) + Pr x [x E] Ex h max n 1 eε L1(x), 0 o x E i Hε(L 1) + Pr x [x E] Hε(L 1) + δ.

So it sufﬁces to show L 1 is dominated by L2. We consider the following process for sampling L 1: For each i, if for any i < i, (x1, x2, . . . , xi ) Ei , then we let L1,i = deterministically. Otherwise we sample

xi Mi(x1, . . . , xi 1, D), L1,i = ln Pryi Mi(x1,...,xi 1,D)[yi=xi]

Pryi Mi(x1,...,xi 1,D)[yi=xi] . Then L 1 = P

i L 1,i. Similarly, let

L2,i be the privacy loss random variable for Pi, Qi, and let L2 = P

i L2,i. By assumption, the distribution of L 1,i conditioned on x1, x2, . . . , xi 1 is always dominated by L2,i. So by Lem. C.1, L 1 is dominated by L2.

C.2 PROOF OF THM. 4.3

Before stating the proof, in Fig. 5 we give Probability Tail Bounds, the subroutine used to compute the values of epi,j.

Proof of Thm. 4.3. For simplicity in the proof we only consider remove adjacency, i.e. D contains a sensitive example zeroed out in D . By symmetry the proof also works for add adjacency. By quasi-convexity

Published as a conference paper at ICLR 2024

of approximate DP, it sufﬁces to prove the theorem assuming the participation of all examples except the sensitive example is deterministic, i.e. we know the contribution of all examples except the sensitive example to x, so we can assume these contributions are zero. So, let x be the matrix used in the matrix mechanism if we were to sample the sensitive example in each round. Then, the matrix mechanism is a VMo G mechanism with probabilities {p|S|(1 p)n |S|}S [n] and sensitivities {P

j S C:,jxj}S [n].

Our proof proceeds in three high-level steps:

1. We show the matrix mechanism is dominated by a sequence of adaptively chosen Mo G mechanisms. 2. We show each of the adaptively chosen Mo G mechanisms is further dominated by a PMo G mechanism. 3. We show these PMo G mechanisms are with high probability dominated by the PMo G mechanisms in MMCC, and then apply Theorem C.2.

Step 1 (matrix mechanism dominated by sequence of Mo G mechanisms): Let f be the function that takes a matrix M and returns a vector f(M) where the ith entry of this vector is the ℓ2-norm of the ith row of M. Using triangle inequality, for any x such that each row of x has norm at most 1, f(P

j S C:,jxj) is entrywise less than or equal to P

j S C:,j. So by Lem. B.4 the matrix mechanism is dominated by the VMo G mechanism with probabilities {p|S|(1 p)n |S|}S [n] and sensitivities {P

j S C:,j}S [n]7. Note that this is exactly the (non-adaptive) matrix mechanism where each xi = 1 (prior to sampling), i.e. it sufﬁces to prove the privacy guarantee holds for this choice of x. So, for the rest of the proof we will assume the input of the matrix mechanism (prior to sampling) is the all ones vector.

Now, let θ1:i denote the output of rounds 1 to i. By Observation 3.2, this random variable is the same as the composition over i of outputting θi sampled from its distribution conditioned on θ1:i 1. Let Si denote the set of rounds in [i] in which we sample the sensitive example. Abusing notation to let Pr denote a likelihood, the likelihood of the matrix mechanism M(D) outputting θi in round i conditioned on θ1:i 1 is:

T [i] Pr τ Θ(D)[Si = T|τ1:i 1 = θ1:i 1] Pr τi N(P

j T Ci,j,σ2 I) [τi = θi]

The likelihood of M(D ) outputting θi in round i (conditioned on θ1:i 1, which doesn t affect the likelihood since since each coordinate of θ is independent when sampled from M(D )) is

Pr τi N(0,σ2 I) [τi = θi] .

In other words, the distribution of θi conditioned on θ1:i 1 under M(D), M(D ) is exactly the same as the pairs of output distributions given by the Mo G mechanism.

Pr τ Θ(D)[Si = T|τ1:i 1 = θ1:i 1]

T [i] , { X

j T Ci,j}T [i]

So the matrix mechanism with x being all ones is the same as the sequence of (adaptively chosen) Mo G mechanisms given by

7Note that since C is lower-triangular, so the choice of the distribution of the ith row of Cx by an adaptive adversary depends only on rows 1 to i 1 of Cx + z. That is, an adversary who chooses the jth row of x after seeing the j 1st ﬁrst rows of the matrix mechanism satisﬁes the adaptivity condition in Lem. B.4.

Published as a conference paper at ICLR 2024

Pr τ Θ(D)[Si = T|τ1:i 1 = θ1:i 1]

T [i] , { X

j T Ci,j}T [i]

Step 2 (each Mo G is dominated by a PMo G): To achieve step 2, we use the following lemma:

Lemma C.3. Let

pi,j = p exp 2 θ1:i 1,C1:i 1,j C1:i 1,j 2 2 2σ2

p exp 2 θ1:i 1,C1:i 1,j C1:i 1,j 2 2 2σ2 + 1 p .

The random variable induced by probabilities n Q

j [i]\T (1 pi,j) o

T [i] and support

j T Ci,j}T [i] stochastically dominates the random variable induced by probabilities {Prτ Θ(D)[Si = T|τ1:i 1 = θ1:i 1]}T [i] and the same support.

Proving Lem. C.3 completes the step as with this lemma and Cor. B.3, the PLD of

Pr τ Θ(D)[Si = T|τ1:i 1 = θ1:i 1]

T [i] , { X

j T Ci,j}T [i]

is dominated by the PLD of

MP Mo G {pi,j}j [n], {Ci,j}j [n] .

Proof of Lem. C.3. Sampling T according to probabilities {Prτ Θ(D)[Si = T|τ1:i 1 = θ1:i 1]}T [i] is equivalent to the following process: We start with T = , and for each j [i], add it to T with probability Pr[T {j} Si|T Si, τ1:i 1 = θ1:i 1]. Similarly, sampling T according to n Q

j [i]\T (1 pi,j) o

T [i] is equivalent to the same process, except we add j with probabil-

ity pi,j. If we show that Pr[T {j} Si|T Si, τ1:i 1 = θ1:i 1] pi,j for all T, j, then we can couple these sampling processes such that with probability 1, P

j T Ci,j is at least as large for the second process as for the ﬁrst, which implies the lemma. The posterior distribution of Si satisﬁes:

Pr τ Θ(D)[Si = T|τ1:i 1 = θ1:i 1] Pr τ Θ(D)[Si = T] Pr τ Θ(D)[τ1:i 1 = θ1:i 1|Si = T]

p|T |(1 p)i |T | exp

2 θ1:i 1, P j T C1:i 1,j P j T C1:i 1,j 2

Pr[T {j} Si|T Si, τ1:i 1 = θ1:i 1] =

Published as a conference paper at ICLR 2024

T T {j} p|T |(1 p)i |T | exp 2 θ1:i 1,P

j T C1:i 1,j P

j T C1:i 1,j 2

T T p|T |(1 p)i |T | exp 2 θ1:i 1,P

j T C1:i 1,j P

j T C1:i 1,j 2

Fix some T T {j}. Consider the term in the numerator sum corresponding to T , and the two terms in the denominator sum corresponding to T and T \ {j}. The ratio of the numerator term to the sum of the two denominator terms is:

p exp 2 θ1:i 1,C1:i 1,j P

j T C1:i 1,j 2

p exp 2 θ1:i 1,C1:i 1,j P

j T C1:i 1,j 2

+ (1 p) exp P

j T \{j} C1:i 1,j 2

Since entries of C are non-negative, we have P

j T C1:i 1,j 2

j T \j C1:i 1,j 2

C1:i 1,j 2 2, hence this ratio and thus Pr[T {j} Si|T Si, τ1:i 1 = θ1:i 1] are at most pi,j, which proves the lemma.

Step 3 (replacing pi,j with epi,j via conditional composition): By Theorem C.2 and Cor. B.3, it now sufﬁces to show that w.p. 1 δ1, pi,j epi,j for all i, j simultaneously. The bound trivially holds for entries where Ci,j = 0, so we only need the bound to hold for all nnz(C) pairs i, j such that Ci,j > 0. Furthermore, if Ci,j is the ﬁrst non-zero entry of column j, then C1:i 1,j is the all zero-vector, so we get pi,j = epi,j = p.

So, there are only nnz(C) n non-trivial pairs we need to prove the tail bound for; by a union bound, we can show each of these bounds individually holds w.p. δ1 nnz(C) n. By deﬁnition of pi,j, epi,j, this is equivalent to showing θ1:i 1, C1:i 1,j z C1:i 1,j 2 σ + si,j for each of these i, j pairs. We have:

θ1:i 1, C1:i 1,j = X

j Si C1:i 1,j , C1:i 1,j + z1:i 1, C1:i 1,j .

The ﬁrst term is tail bounded by si,j with probability 1 δ1 2(nnz(C) n) by deﬁnition, the second term is

drawn from N(0, C1:i 1,j 2 2 σ2) and thus tail bounded by z C1:i 1,j 2 σ with the same probability by deﬁnition. A union bound over these two events gives the desired tail bound on θ1:i 1, C1:i 1,j .

C.3 PROOF OF LEM. 5.1

Proof. Since each 0 pi 1/n , the mechanism is the same as the following: For each example we choose a subset S [n] of size n according to some distribution that is a function of the pi, and then choose i uniformly at random from the elements of S, and include the example in the ith subset. By quasi-convexity of approximate DP, it sufﬁces to prove the DP guarantee for a ﬁxed choice of S. For any ﬁxed choice of S, the mechanism is equivalent to the shufﬂed Gaussian mechanism over n coordinates. Each unshufﬂed

Gaussian mechanism satisﬁes (

2 ln(1.25/δ0)

σ , δ0)-DP, and then the lemma follows by the ampliﬁcation via shufﬂing statement of Theorem 3.8 of (Feldman et al., 2022).

Published as a conference paper at ICLR 2024

C.4 PROOF OF THEOREM 5.2

We ﬁrst analyze a simpliﬁed case where xi = 0 if i = i , and otherwise xi = 1 for D and xi = 0 for D . We later give a proof for the general case.

Proof of Theorem 5.2 in simpliﬁed case. Let τj,k be the value of the noisy sum P

k 2j+1 i (k+1) 2j xi + zj,k, τ = {τj,k}0 j log n,0 k<n/2j and let Θ(D) be the distribution of these values under dataset D. We consider a single sensitive example; let i be the (random) coordinate of xi that this example contributes to.

Now, again abusing notation to let Pr denote a likelihood, we have for any j:

h {τj,k}0 k<n/2j = {θj ,k}j >j,0 k<n/2j {τj ,k}j >j,0 k<n/2j = {θj ,k}j >j,0 k<n/2j i

0 k n/2j Pr τ Θ(D)

h {τj,k}0 k<n/2j = {θj ,k}j >j,0 k<n/2j k 2j + 1 i (k + 1) 2ji

h k 2j + 1 i (k + 1) 2j {τj ,k}j >j,0 k<n/2j = {θj ,k}j >j,0 k<n/2j i

In other words, for any j, the distribution of the j-th level of the tree, {τj,k}0 k n/2j, conditioned on the higher levels of the tree, {τj ,k}j >j,0 k n/2j , is the output distribution of mechanism described in Lem. 5.1, where the probabilities are

pj,k := Pr τ Θ(D)

h k 2j + 1 i (k + 1) 2j {τj ,k}j >j,0 k<n/2j = {θj ,k}j >j,0 k<n/2j i .

We now show a high probability bound on each of these probabilities. We have:

h k 2j + 1 i (k + 1) 2j {τj ,k}j >j,0 k<n/2j = {θj ,k}j >j,0 k<n/2j i

i=k 2j+1 Prτ Θ(D) h {τj ,k}j >j,0 k<n/2j = {θj ,k}j >j,0 k<n/2j i = i i

Pn i=1 Prτ Θ(D) h {τj ,k}j >j,0 k<n/2j = {θj ,k}j >j,0 k<n/2j i = i i

j >j,0 k<n/2j exp

τj ,k (k 2j +1 i (k+1) 2j ) 2

j >j,0 k<n/2j exp (τj ,k (k 2j +1 i (k+1) 2j )) 2

j,k:k 2j +1 i (k+1) 2j exp τj ,k

j,k:k 2j +1 i (k+1) 2j exp τj ,k

n maxi [n] Q

j,k:k 2j +1 i (k+1) 2j exp τj ,k

j,k:k 2j +1 i (k+1) 2j exp τj ,k

n exp (log n j) maxj ,k τj ,k minj ,k τj ,k

Published as a conference paper at ICLR 2024

With probability 1 δ/2, by a union bound for all 2n pairs j, k we have |τj,k| p

2 ln(4n/δ)σ, so the above bound is at most:

(log n j)2 p

2 ln(4n/δ) σ

ln 2 in turn this is at most:

2 log n j =

Now, by Theorem C.2 and Observation 3.2 it sufﬁces to show that conditioned on this probability

1 δ/2 event, the binary tree mechanism satisﬁes O

log(1/δ) log log(1/δ)

, δ/2 -DP. For log n

16e log log(16n/δ) j log n, releasing τj,k satisﬁes O

log(1/δ) log log(n/δ)

, δ/4 -DP by the

analysis of the (unampliﬁed) Gaussian mechanism. For levels 0 j < log n 16e log log(16n/δ), our upper bound on the conditional probabilities and Lem. 5.1 with δ0 = δ/8n log n shows that, conditioned on the high-probability event, the distribution of the privacy loss of outputting {τj,k}k conditioned on levels

j > j satisﬁes O

ln(n /δ) ln(1/δ)

, δ/4 log n -DP, with n = p n

2j . By basic composition, the

overall privacy loss distribution conditioned on the 1 δ/2 probability event satisﬁes:

log(1/δ) log log(1/δ)

j=log n 16e log log(16n/δ) +2j/4 ln(1/δ)

Here we use the upper bound on δ which is equivalent to log(n/δ) = O(log(1/δ)). We conclude by bounding the sum as:

j=log n 16e log log(16n/δ) +2j/4 ln(1/δ)

log n 16e log log(16n/δ) X

l=0 + ln(1/δ)

log n 16e log log(16n/δ) X

ln(1/δ) σ(1 2 1/4).

We now discuss how to extend the proof to a more general case. In other words, we choose some y with each row having 2-norm at most 1 for D, and then set y for D to be y with the ﬁrst row zeroed out. Then, x is chosen by shufﬂing the rows of y or y .

Lemma C.4. Under the above setup, for some k that divides n, consider the mechanism that chooses a random size k equipartition of [n], P = (S1, S2, . . . Sk) of [n] according to some distribution and outputs (θ1, θ2, . . . , θk), θi N(P

j Si yi, σ2). Suppose for any two equipartitions P, P , the probability of choosing P is at most c times the probability of choosing P , and let n = k/c .

Published as a conference paper at ICLR 2024

Then, for any δ 2e n

16e , δ0 0, if σ = Ω( p

ln(1/δ0)) then this mechanism satisﬁes O

ln(1/δ0) ln(1/δ)

Proof. Recall that y1 is the example differing between D and D . By post-processing and quasi-convexity, we can instead analyze the mechanism that for each Si, also publishes all but one element in Si, and specifically for the Si including 1 (the sensitive element), the element of Si not published must be 1. This is equivalent to saying: without loss of generality we can assume n = k.

Next, the assumption on the distribution over P implies that the distribution is in the convex hull of distributions over P that deterministically choose k n elements of P, with 1 being one of these n unchosen elements, and then uniformly shufﬂe the remaining n elements. In terms of privacy guarantees, each individual mechanism using one of these distributions is equivalent to n Gaussian mechanisms on shufﬂed elements. Then, by quasi-convexity, the privacy guarantees of this mechanism are no worse than those of a Gaussian mechanism over n shufﬂed elements. We conclude using the analysis of ampliﬁcation by shufﬂing in (Feldman et al., 2022).

Next, we use the following black-box reduction from (ε, δ)-DP guarantees to high-probability privacy loss bounds:

Lemma C.5 ((Kasiviswanathan & Smith, 2014)). If a mechanism satisﬁes (ε, δ)-DP, then the probability the privacy loss of its output exceeds 2ε is at most δ εeε .

Now, the high-level idea is that the (ε, δ)-DP guarantee on outputting levels j > j implies a high-probability bound on the privacy loss of outputting these levels via Lem. C.5, which in turn implies a bound on c in Lem. C.4 if we use the posterior distribution over shufﬂes as the distribution in that lemma. Then, we can use Lem. C.4 to get an (ε, δ)-DP guarantee for round j conditioned on the previous rounds, and as before the resulting ε per level decays geometrically and we can use basic composition.

Proof of Theorem 5.2. By the upper bound on δ, log(poly (n) /δ) = O(log(1/δ)). So, for any constant c1 and another constant c2 depending on c1, releasing levels log n c1 log log 1/δ to log n satisﬁes c2

log(1/δ) log log(1/δ)

σ , δ/n2 -DP by analysis of the Gaussian mechanism.

Now, we will show by induction that releasing levels j to log n, j log n c1 log log 1/δ, satisﬁes (εj, δj)- DP for:

log(1/δ) log log(1/δ)

j =log n c1 log log(1/δ)

c3 log(1/δ)

j =log n c1 log log(1/δ) (1 + 1/e)log n c1 log log(1/δ) j.

In the inequality on εj we assume c1 is sufﬁciently large. The base case of j = log n c1 log log 1/δ holds by the aforementioned analysis of the Gaussian mechanism. Now, assuming releasing levels j + 1 to log n satisﬁes (εj, δj)-DP, we will prove releasing levels j to log n satisﬁes (εj, δj)-DP. Consider the output distribution of level j, conditioned on the event that the privacy loss of releasing levels j + 1 to log n is at most 1. The privacy loss being at most 1 implies that conditioned on levels j +1 to log n s output, no shufﬂe is more than e times as likely as any other shufﬂe, and thus the same is true for equipartitions of the data into

Published as a conference paper at ICLR 2024

the sums in level j. Then by Lem. C.4, level j satisﬁes, say, ( c3

2log n j , δ/n2)-DP for some sufﬁciently

large constant c3, assuming c1 is sufﬁciently large and σ c4 p

log(1/δ) log log(1/δ) for a sufﬁciently large constant c4. We have

log(1/δ) log log(1/δ) + 4c3 p

So εj < 1/2 for all j, again assuming σ c4 p

log(1/δ) log log(1/δ) for sufﬁciently large c4, by Lem. C.5 this event happens with probability at least 1 δj+1/e. Then assuming releasing levels j + 1 to log n satisﬁes (εj+1, δj+1)-DP by Thm. 4.3 and basic composition, we have proven releasing levels j to log n satisﬁes (εj, δj)-DP for

εj = εj+1 + c3 log(1/δ)

2log n j , δj = δj+1(1 + 1/e) + δ/n2.

The claimed non-recursively deﬁned values for εj, δj follow by unrolling the above recursive formula and plugging in the base case j = log n c1 log log 1/δ. Now, the full binary tree mechanism with shufﬂing

satisﬁes (ε0, δ0)-DP for ε0 = O

log(1/δ) log log(1/δ)

, δ0 δ as desired. (Note that the between the

constants c1, c2, c3, c4 there are no circular dependencies, i.e. there does exist a set of constants satisfying the assumptions in the proof.)

D EXTENDING MMCC TO b-MIN-SEP SAMPLING

(Choquette-Choo et al., 2023a) analyzed the b-banded matrix mechanism under the following scheme, which we ll call b-min-sep sampling : We partition the dataset D into b equal-size subsets, D1, D2, . . . Db. To compute xi, we use independently include each element of Di (mod b) (where we say i (mod b) = b if b divides i) with probability bp; here, we write the sampling probability in these rounds as bp instead of p to reﬂect the fact that the average example still participates in fraction p of rounds in expectation for any choice of b.

We give a generalization of MMCC that analyzes the matrix mechanism under b-min-sep sampling, that matches the analysis of (Choquette-Choo et al., 2023a) when C is b-banded but can generalize to arbitrary lower triangular matrices. In other words, this generalization of MMCC subsumes the analysis in (Choquette Choo et al., 2023a).

Note that if we want to analyze the privacy guarantee for an example in Di, i 1, this is the same as analyzing the privacy guarantee for an example in D1, if we use C with the ﬁrst i 1 rows/columns cut off. Then, without loss of generality we only need to state a privacy analysis for examples in D1 - to get a privacy guarantee that holds for all examples simultaneously, for each Di we can compute a privacy guarantee using the above reduction, and then take the worst of these. Further, for some classes of matrices, such as Toeplitz matrices, the examples in D1 will have the worst privacy guarantee and thus it sufﬁces to only analyze these examples.

We now show Generalized-MMCC, given in Fig. 6, computes a valid privacy guarantee under b-min-sep sampling.

Theorem D.1. Let ε be the output of Generalized-MMCC. Then the matrix mechanism with matrix C, b-min-sep sampling, sampling probability p, noise level σ satisﬁes (ε, δ1 + δ2)-DP (for examples in D1).

Published as a conference paper at ICLR 2024

Algorithm 3 Generalized-MMCC

1: Input: Matrix C, sampling probability p, noise standard deviation σ, probabilities δ1, δ2, min-sep b. 2: Delete all columns of C except columns 1, b + 1, 2b + 1 . . . 3: {epi,j}i [n],j [ n/b Generalized Probability Tail Bounds(C, bp, σ, bδ1). epi,j is a high-probability upper bound on the probability that an example participated in round j, conditioned on output in rounds 1 to i 1.

4: ep(b) i,j = ep(i 1)b+1,(j 1)b+1

5: C(b) i,j = C(i 1)b+1:ib,(j 1)b+1 2 6: for i [ n/b ] do

7: PLDi PLD of MP Mo G({ep(b) i,j }j [ n/b ], {C(b) i,j }j [ n/b ]).

8: PLD convolution of {PLDi}i [ n/b ]. 9: return min ({ε : PLD satisﬁes (ε, δ2)-DP}).

Figure 6: Extension of MMCC to b-min-sep sampling.

Algorithm 4 Generalized Probability Tail Bounds(C, p, σ, δ1)

1: Input: Matrix C Rm n, sampling probability p, noise standard deviation σ, probability δ1. 2: δ = δ1 2 (nnz(C) n) nnz is the number of non-zeros.

3: z = Φ 1(1 δ ) Tail bound on normal distribution; here, Φ is the standard normal CDF. 4: for i [m], j [n] do 5: if Ci,j = 0 then 6: epi,j = 1 7: else 8: si,j = minimum s s.t. Pr[P

j i xj C1:i 1,j, C1:i 1,j > s] δ , xj i.i.d. Bern(p) si,j is a tail bound on the dot product of ﬁrst i 1 entries of Cx and C1:i 1,j.

9: εi,j = z C1:i 1,j 2

σ + 2si,j C1:i 1,j 2 2 2σ2 εi,j is a tail bound on the privacy loss of a participation in round j after outputting ﬁrst i 1 rounds

10: epi,j = p exp(εi,j) p exp(εi,j)+(1 p) 11: return {epi,j}i [m],j [n].

Figure 7: Generalization of Probability Tail Bounds.

Published as a conference paper at ICLR 2024

1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 sqrt(log(n)+1)

unamplified epsilon / amplified epsilon

c = 10 (unamplified epsilon = 0.397)

y=-0.0313x+1.1625

1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 sqrt(log(n)+1)

unamplified epsilon / amplified epsilon

c = 20 (unamplified epsilon = 0.189)

y=0.2053x+0.8877

1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 sqrt(log(n)+1)

unamplified epsilon / amplified epsilon

c = 40 (unamplified epsilon = 0.0901)

y=0.3452x+0.7164

Figure 8: Plot of multiplicative improvement in ε for the optimal continual counting matrix mechanism as a function of p

log(n) + 1 Ce1 2. We plot n = 2i, i {1, 2, . . . , 7}. We use σ = c Cei 2, so the ε value in the unampliﬁed single-participation setting is ﬁxed. All ε are for δ = 10 6.

Proof. The algorithm is almost the same as Thm. 4.3, so we just need to justify the key differences. In particular, we need to justify (1) the deletion of columns, (2) the choice of ep(b) i,j , and (3) the choice of C(b).

(1) is justiﬁed by the proof of Theorem 4 in (Choquette-Choo et al., 2023a), which observes that the products of columns j of C for which j (mod b) = 1 and the corresponding rows of x are independent of D1, i.e. we can treat their products as public information. So it does not affect the privacy analysis to delete these rows/columns from C/x, and then view the resulting x as generated by i.i.d sampling every round with probability bp.

(2) and (3) are both justiﬁed if we use conditional composition over sequential mechanisms corresponding to b rows of Cx + z instead of a single row. Each of these sequential mechanisms is a VMo G mechanism, which Cor. B.6 allows us to reduce to the scalar PMo G mechanism deﬁned in terms of C(b) in Generalized-MMCC. The probabilities ep(b) are then valid to use in the conditional composition by the same argument as in Thm. 4.3, up to the adjustment to use bδ1 instead of δ1. This adjustment is valid, since we only use fraction 1/b of the values generated by Generalized Probability Tail Bounds, i.e. we are union bounding over 1/b as many bad events as in the original proof, so we can increase the allowed probability for each bad events by b (which is implicitly done by increasing δ1 by b).

One can verify that (i) for b = 1, Generalized-MMCC is equivalent to MMCC, and that (ii) if C is bbanded, Generalized-MMCC is equivalent to the privacy analysis in (Choquette-Choo et al., 2023a).

E MORE EMPIRICAL ε COMPUTATIONS

(Fichtenberger et al., 2023; Henzinger et al., 2023) showed that a post-processing of the matrix mechanism using the following lower-triangular matrix achieves 1 + o(1) times the optimal ℓ2 2 error for preﬁx sums (without ampliﬁcation): Ci,j = f(i j), where f is deﬁned as

0, for k < 0 1, for k = 0 f(k 1) 1 1 2k , for k > 0 .

Similarly to the binary tree mechanism, we will consider the unampliﬁed single-participation setting as a baseline. In this case, the sensitivity of this matrix mechanism is Ce1 2, i.e. the ℓ2-norm of the ﬁrst column of C. So again, setting σ = c Ce1 2 results in a ﬁxed ε for a ﬁxed δ. Our comparison will be applying the same matrix mechanism with subsampling probability 1/n and the same choice of σ.

Published as a conference paper at ICLR 2024

In Fig. 8, we reproduce the plots in Fig. 2 but for this matrix mechanism instead of the binary tree mechanism. The ℓ2-norm of the columns of this matrix asymptotically are Θ( log n); because of this, and to make a direct comparison to the binary tree mechanism easier, we use p

log(n) + 1 as the x-axis and plot the least squares linear regression. Because the columns of this matrix are less orthogonal than those of the matrix for the binary tree mechanism, there is less beneﬁt from ampliﬁcation in this setting than the binary tree mechanism setting, so we use a larger range of values c {10, 20, 40} for the noise multiplier to better demonstrate the behavior of the improvement in ε.

For sufﬁciently large σ, the improvement in ε due to the ampliﬁcation analysis is again roughly proportional to p

log(n) + 1. For the same reasons as for the binary tree mechanism, the ﬁt of the linear regression is better as σ increases: here, because the columns of this matrix are less orthogonal on average, a larger value of c is needed for the ﬁt to improve. Here, the constant multiplier in the improvement is smaller; this makes sense as these matrices improve on the error of the binary tree mechanism by a constant, and thus the amount by which we can improve the privacy analysis of this matrix mechanism without violating lower bounds is smaller than for the binary tree mechanism.

F IMPLEMENTATION DETAILS

To implement the MMCC algorithm, we use the open-source Python library dp accounting.pld8. We extend the class dp accounting.pld.pld mechanism.Additive Noise Privacy Loss to create a class, Mixture Gaussian Privacy Loss that represents the privacy loss distribution of MMo G, which can be used along with other tools in the dp accounting.pld library to implement MMCC. We discuss our implementation and some challenges here. The dp accounting.pld library uses the convention that privacy losses are decreasing; we use the same convention in the discussions in this section for consistency. Our implementation is currently open-sourced as part of the dp accounting library, and PLD accounting for Mo G mechanisms can be done using dp accounting.pld.PLDAccountant and dp accounting.dp event.Mixture Of Gaussians Dp Event.

F.1 EXTENDING ADDITIVENOISEPR I V A C YLO S S

In order to perform all the necessary computations in MMCC, we need to implement the following methods in Mixture Gaussian Privacy Loss:

1. A method to compute the CDF of the mixture of Gaussians distribution. 2. A method to compute the privacy loss at x. 3. An inverse privacy loss method, i.e. a method which takes ε and computes the smallest x achieving this ε.

Given the probabilities and sensitivities {p1, p2, . . . , pk} and {c1, c2, . . . , ck}, as well as σ, the ﬁrst two can easily be done by just summing the PDFs/CDFs of the Gaussians in the mixture. This takes at most O(k) times the runtime of the corresponding method for the (subsampled) Gaussian mechanism.

The third is more problematic. For the subsampled Gaussian mechanism with sampling probability p and sensitivity 1, the privacy loss function (under the remove adjacency) is:

ln p exp 2x 1

8https://github.com/google/differential-privacy/tree/main/python/dp_ accounting/dp_accounting

Published as a conference paper at ICLR 2024

This function is easily invertible. However, if we consider MMo G({p, 1 p}, {c1, c2}), the privacy loss at x is:

ln p exp 2c1x c2 1 2σ2

+ (1 p) exp 2c2x c2 2 2σ2

Because this function includes the sum of two exponential functions of x, it is not easy to invert. We instead use binary search to get the smallest multiple of 1 which achieves the desired privacy loss, where 1 is a parameter we choose that trades off between efﬁciency and accuracy. That is, if L is the privacy loss function, and we want to compute the inverse privacy loss of y, we return x = L 1(y)/ 1 1. Note that by overestimating x, we also overestimate the privacy loss since we assume the privacy loss is decreasing. Hence this approximation is pessimistic, i.e. does not cause us to report an (ε, δ)-DP guarantee that is not actually satisﬁed by MMo G.

Note that using binary search requires a O(log(1/ 1)) multiplicative dependence on 1, that is not incurred for e.g. the subsampled Gaussian for which we can quickly compute the exact inverse privacy loss. Indeed, we observed that this inverse privacy loss method is the bottleneck for our implementation.

F.2 EFFICIENTLY REPRESENTING PMOG AS MOG

As discussed in the previous section, the runtime of our implementation has a linear dependence on the number of components in the Mo G. However, in MMCC, we are actually using PMo Gs, which are Mo Gs with potentially 2n components. So, even just listing the components can be prohibitively expensive.

We instead choose another approximation parameter 2, and round each entry of C up to the nearest multiple of 2. By Lemma B.4, this only worsens the privacy guarantee, i.e. any privacy guarantee we prove for the rounded version of C also applies to the original C. After this rounding, the number of components in any Mo G we compute the PLD of is at most maxi e i C 1 / 2 + n (maxi e i C 1 is the maximum row norm of C). Furthermore, we can compute the probabilities/sensitivities efﬁciently since we are working with PMo Gs. In particular, for each epi,j, Ci,j pair, we can construct the probability mass function (PMF) of the random variable that is Ci,j w.p. epi,j and 0 otherwise, and then take the convolution of all such PMFs for a row to get the PMF of the discretized sensitivity for the PMo G. For each row, this can be done in at most n 1 convolutions, each convolution between two PMFs that have support size at most 2 and maxi e i C 1 / 2 + n. So the convolutions can be done in time O(maxi e i C 1 / 2 + n), i.e. our overall runtime is O(n2 maxi e i C 1 / 2 + n3), i.e. polynomial instead of exponential in n if e.g. all entries of C are bounded by a constant. By doing the convolutions in a divide-and-conquer fashion, and using FFT for the convolutions, we can further improve the runtime to e O(n maxi e i C 1 / 2 + n2), i.e. nearly linear in the input size and 1/ 2 if the entries of C are bounded by a constant.

F.3 COMPUTING si,j

Similar to computing the probabilities and sensitivities for the PMo Gs, any overestimate of si,j can be used in place of si,j to get a valid privacy guarantee from MMCC by Lemma B.3. Since si,j only appears in a lower order term in the deﬁnition of εi,j, a weaker tail bound will not affect the privacy guarantee as much. So, in our implementation, we use the following simple and efﬁcient approximation: We use the binomial CDF to obtain an exact tail bound t on x1:i 1 = P

j i xj in the deﬁnition of si,j. We then take the sum of the t largest values of C1:i 1,j, C1:i 1,j to be our overestimate of si,j.

Published as a conference paper at ICLR 2024

F.4 COMPUTING ALL ROW PLDS

Putting this all together, we must compute n PLDs in MMCC, one for each row of C. Though only an O(n) overhead in runtime over computing a single PLD, this O(n) overhead is undesirable as each PLD computation is already quite expensive due to the aforementioned difﬁculties. However, this component is embarrassingly parallel, which we leverage to massively speed up runtimes.

Note that for some special classes of matrices, we will have that multiple rows share the same PLD, which also allows us to dramatically speed up the calculation even without parallelization. For example, this is the case for the binary tree mechanism due to symmetry, as well for as b-banded Toeplitz C due to the fact that rows 2b 1 to n of ep and C are the same (up to an offset in indices that doesn t affect the PLD).

F.5 APPLICATIONS BEYOND MATRIX MECHANISMS

We believe that Mo G mechanisms/Mixture Gaussian Privacy Loss are useful analytic tools for privacy analysis of mechanisms beyond the matrix mechanism. We discuss two examples here.

Privacy ampliﬁcation via iteration on linear losses: Consider running DP-SGD with sampled minibatches. To get a (ε, δ)-DP guarantee, we can compute the PLD for the subsampled Gaussian mechanism, and then compose this PLD with itself n times. For general non-convex losses, this accounting scheme is tight, even if we only release the last iterate.

For linear losses, we can give a better privacy guarantee for releasing only the last iterate, similarly to (Feldman et al., 2018): Releasing the last iterate is equivalent in terms of privacy guarantees to a Gaussian mechanism with random sensitivity Binom(n, p) and variance nσ2. Using Mixture Gaussian Privacy Loss we can get tight (ε, δ)-DP guarantees for this mechanism. Empirically, we found that these can be a lot tighter than composition of subsampled Gaussians. For example, using n = 128, p = 1/128, σ = 1 we found that composition of subsampled Gaussians gives a proof of (.806, 10 6)-DP, whereas analyzing the last iterate as a Mo G mechanism gives a proof of (.291, 10 6)-DP. We conjecture a similar improvement is possible for all convex losses, rather than linear losses.

Tight group privacy guarantees for DP-SGD: Consider analyzing the privacy guarantees of DP-SGD under group privacy. That is, we want to give a privacy guarantee for pairs of databases differing in k > 1 examples. One way of doing this is to compute a DP guarantee for k = 1, then use an exampleto-group privacy theorem such as that of (Vadhan, 2017), which shows an (ε, δ)-DP mechanism satisﬁes (kε, k exp(kε)δ)-DP for groups of size k. This is overly pessimistic, since the black-box theorem doesn t account for the speciﬁc structure of the mechanism. We can instead get relatively tight guarantees via Mixture Gaussian Privacy Loss: If each example is sampled independently, then the privacy loss of a group of k examples in each round of DP-SGD is dominated by a Gaussian mechanism with sensitivity Binom(k, p). Then, we can use Mixture Gaussian Privacy Loss to analyze the composition of n of these Mo G mechanisms. Further, note that e.g. in the case where we instead sample a random batch of size B in each round (i.e. different examples participations within the same round are no longer independent), we can still use Mixture Gaussian Privacy Loss to get a tight analysis by adjusting the sensitivity random variable used. See the follow-up note (Ganesh, 2024) for more details.