# on_differentially_private_u_statistics__1cabea90.pdf

On Differentially Private U Statistics

Kamalika Chaudhuri Po-Ling Loh University of California San Diego University of Cambridge kamalika@ucsd.edu pll28@cam.ac.uk

Shourya Pandey Purnamrita Sarkar University of Texas at Austin University of Texas at Austin shouryap@utexas.edu purna.sarkar@austin.utexas.edu

We consider the problem of privately estimating a parameter E[h(X1, . . . , Xk)], where X1, X2, . . . , Xk are i.i.d. data from some distribution and h is a permutationinvariant function. Without privacy constraints, the standard estimators for this task are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and are the unique minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied in a black-box manner to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even Θ(1/n) rather than O(1/n2) in degenerate settings. To remedy this, we propose a new thresholding-based approach that reweights different subsets of the data using local Hájek projections. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.

1 Introduction

A fundamental task in statistical inference is to estimate a parameter of the form E[h(X1, . . . , Xk)], where h is a possibly vector-valued function and {Xi}n i=1 are i.i.d. draws from an unknown distribution. U-statistics are a well-established class of estimators for such parameters and can be expressed as averages of functions of the form h(X1, . . . , Xk). U-statistics arise in many areas of statistics and machine learning, encompassing diverse estimators such as the sample mean and variance, hypothesis tests such as the Mann-Whitney, Wilcoxon signed rank, and Kendall s tau tests, symmetry and uniformity testing [26, 21], goodness-of-fit tests [58], counts of combinatorial objects such as the number of subgraphs in a random graph [30], ranking and clustering [19, 18], and subsampling [52].

U-statistics are a natural generalization of the sample mean. However, little work has been done on U-statistics under differential privacy, in contrast to the sizable body of existing work on private mean estimation [42, 39, 14, 40, 10, 41, 24, 11, 33, 15]. Ghazi et al. [29] and Bell et al. [8] consider Ustatistics in the setting of local differential privacy [43], while we are interested in privacy guarantees under the central model. Moreover, existing work on private U-statistics focuses on discrete data, and relies on simple privacy mechanisms (such as the Laplace mechanism [25]) which are usually optimal in these settings.

Many U-statistics converge to a limiting Gaussian distribution with variance O(k2/n) when suitably scaled. This is commonly used in hypothesis testing [35, 5, 37]. However, there are also examples of non-degenerate U-statistics, which often arise in a variety of hypothesis tests [26, 58, 21], where the statistic is degenerate at the null hypothesis (in which case the U-statistic converges to a sum of centered chi-squared distributions [31]). Another interesting U-statistic arises in subgraph counts in random geometric graphs [30]. When the probability of an edge being present tends to zero with n, creating a private estimator by simply adding Laplace noise with a suitable scale may not be effective.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Our contributions

1. We present a new algorithm for private mean estimation that achieves nearly optimal private and non-private errors for non-degenerate U-statistics with sub-Gaussian kernels.

2. We provide a lower bound for privately estimating non-degenerate sub-Gaussian kernels, which nearly matches the upper bound of our algorithm. We also derive a lower bound for degenerate kernels and provide evidence that the private error achieved by our algorithm in the degenerate case is nearly optimal. A summary of the utility guarantees of our algorithm and adaptations of existing private mean estimation methods is presented in Table 1.

3. The computational complexity of our first algorithm scales as O( n k ). We generalize this algorithm and develop an estimator based on subsampled data, providing theoretical guarantees for a more efficient version with O(n2) computational complexity.

The paper is organized as follows. Section 2 reviews the background on U-statistics and key concepts in differential privacy. Section 3 introduces an initial set of estimators based on the Coin Press algorithm for private mean estimation [10]. Section 4 presents our main algorithm, which leverages what we term local Hájek projections. Section 5 discusses applications of our algorithm to private uniformity testing and density estimation in sparse geometric graphs. Section 6 concludes the paper.

2 Background and Problem Setup

Let n and k be positive integers with k n. Let D be an unknown probability distribution over a set X, and let h : X k R be a known function symmetric in its arguments1. Let H be the distribution of h(X1, X2, . . . , Xk), where X1, X2, . . . , Xk D are i.i.d. random variables. We are interested in providing a ϵ-differentially private confidence interval for the estimable parameter [32] θ = E[h(X1, X2, . . . , Xk)], which is the mean of the distribution H, given access to n i.i.d. samples from D; we use X1, X2, . . . , Xn to denote these n samples. The kernel h, the degree k, and the estimable parameter θ are allowed to depend on n; we omit the subscript n for the sake of brevity.

We consider bounded kernels h and unbounded kernels h where the distribution H is sub-Gaussian. We write Y sub-Gaussian(τ) if E[exp(λ(Y EY ))] exp(τλ2/2) for all λ R. The quantity τ is called a variance proxy for the distribution H and satisfies the inequality τ σ2 [59]. Throughout the paper, we assume that the privacy parameter ϵ = O(1). We also use the notation O( ) in error terms, which hides poly-logarithmic factors in n/α.

2.1 U-Statistics

Let [n] denote {1, . . . , n}, and let In,k be the set of all k-element subsets of [n]. Denote the n i.i.d. samples by X1, X2, . . . , Xn. For any S In,k, let XS be the (unordered) k-tuple {Xi : i S}. The U-statistic associated with the data and the function h is

Un := 1 n k X

{i1,...,ik} In,k h(Xi1, . . . , Xik). (1)

The function h is the kernel of Un and k is the degree of Un. While U-statistics can be vector-valued, we consider scalar U-statistics in this paper. The variance of Un can be expressed in terms of the conditional variances of h(X1, X2, . . . , Xn). For c [k], we define the conditional variance

ζc := Var (E [h(X1, . . . , Xk)|X1, . . . , Xc]) . (2)

Equivalently, ζc = cov (h(XS1), h(XS2)) where S1, S2 In,k and |S1 S2| = c. The number of such pairs of sets S1 and S2 is equal to n k k c n k k c , which implies

Var(Un) = n k

1That is, h(x1, x2, . . . , xk) = h(xσ(1), xσ(2), . . . , xσ(k)) for any permutation σ. We do not assume that the distribution H itself is symmetric about its mean.

Since E[Un] = θ, Un is an unbiased estimate of θ. Moreover, the variance of Un is a lower bound on the variance of any unbiased estimator of θ. (cf Lee [45, Chapter 1, Theorem 3]). We also have the following inequality from Serfling [54] (see Appendix A.2 for a proof):

Infinite-order U-Statistics: Classical U-statistics typically have small, fixed k. However, important estimators that appear in the contexts of subsampling [52] and Breiman s random forest algorithm [56, 50] have k growing with n. These types of U-statistics are sometimes referred to as infinite-order Ustatistics [27, 48]). U-statistics also frequently appear in the analysis of random geometric graphs [30]. The difference between this setting and the examples above is that the conditional variances {ζc} vanish with n in the sparse setting. (See Section 5.)

Degenerate U-statistics: A U-statistic is degenerate of order ℓ k 1 if ζi = 0 for all i [ℓ] and ζℓ+1 > 0 (if ζk = 0, the distribution is almost surely constant). Degenerate U-statistics arise in hypothesis tests such as Cramer-Von Mises and Pearson tests of goodness of fit [31, 3, 55] and tests for unformity [21]. They also appear in tests for model misspecification in econometrics [46, 47]. For more examples of degenerate U-statistics, see [20, 61, 34].

2.2 Differential privacy

The main idea of differential privacy [25] is that the participation or non-participation of a single person should not affect the outcome significantly. A (randomized) algorithm M, that takes as input a dataset D X n and outputs an element of its range space S, satisfies ϵ-differential privacy if for any pair of adjacent datasets D and D and any measurable subset S S of the range space S, Pr(M(D) S) eϵ Pr(M(D ) S). A dataset is D := (X1, X2, . . . , Xn) from some domain X, for some n which is public. Two datasets D and D are adjacent if they differ in exactly one index. An important property of differentially private algorithms is composition. We defer composition theorems for differentially private algorithms to Appendix A.2.

Basic DP algorithms. One way to ensure an algorithm satisfies differential privacy is through the Laplace Mechanism [25]. The global sensitivity of a function f : X n S is

GS(f) = max |D D |=1 |f(D) f(D )|, (5)

where D D := |{i : Di = D i}| A fundamental result in differential privacy is that we can achieve privacy for f by adding noise calibrated to its global sensitivity. Lemma 1. (Laplace mechanism [25]) Let f : X n S be a function and let ϵ > 0 be the privacy parameter. Then the algorithm A(D) = f(D) + Lap GS(f)

ϵ 2 is ϵ-differentially private.

The global sensitivity of a function is the worst-case change in the function value and may be high on atypical datasets. To account for the small sensitivity on typical datasets, the notion of local sensitivity is useful. The local sensitivity of a function f : X n S at D is defined as

LS(f, D) = max |D D |=1 |f(D) f(D )|. (6)

Unfortunately, adding noise proportional to the local sensitivity does not ensure differential privacy, because variation in the magnitude of noise itself may leak information. Instead, [49] proposed the notion of a smooth upper bound on LS(f, D). A function SS(f, ) is said to be an ϵ-smooth upper bound on the local sensitivity of f if (i) SS(f, D) LS(f, D) for all D, and (ii) SS(f, D) eϵSS(f, D ) for all |D D | = 1. Intuitively, (i) ensures that enough noise is added, and (ii) ensures that the noise itself does not leak information about the data. Lemma 2. (Smoothed Sensitivity mechanism [49]) Let f : X n S be a function, ϵ > 0, and SS(f, ) be an ϵ-smooth upper bound on LS(f, ). Then, the algorithm A(D) = f(D) + SS(f, D)/ϵ Z, where Z has density h(z) 1 1+z4 , is ϵ/10-differentially private.

2The Laplace Distribution with parameter b > 0, denoted Lap(b), has density ℓ(z) exp ( |z|/b) .

3 Lower Bounds and Application of Off-the-shelf Tools

Algorithm Sub-Gaussian, non-degenerate Bounded, degenerate Private error Is non-private error Private error Is non-private error O(Var(Un))? O(Var(Un))?

Naive (Proposition A.1) O k τ

All-tuples (Proposition 1) O k3/2 τ

Main algorithm O k τ

Yes O k3/2C

Yes Corollary 1 Corollary 2

Lower bound Ω k τ

for private algorithms Theorem 1 Theorem 3

Table 1: We compare our application of off-the-shelf tools to Algorithm 1. We only provide the leading terms in the private error. The non-private lower bound on E(ˆθ Eh(X1, . . . , Xk))2 for all unbiased ˆθ is Var(Un), which our private algorithms nearly match.

We start with a simple, non-private estimator involving an average of independent quantities. Let m = n/k , and define Sj = {(j 1)k + 1, (j 1)k + 2, . . . , jk} for all j [m]. Define the naive estimator ˆθnaive := Pm j=1 h(XSj)/m. Directly applying existing private mean estimation algorithms [42, 40, 39, 10] to our setting yields an error bound of3

Var(ˆθnaive) + k τ/(nϵ) = O p

kζk/n + k τ/(nϵ) , (7)

since Var(ˆθnaive) = kζk/n. Note that this variance is larger than the dominant term k2ζ1/n of Var(Un) (see Lemma A.1 and Eq 4); indeed, ˆθnaive is a suboptimal estimator of θ.

In Algorithms A.2 and A.3, we present a general extension of the Coin Press algorithm [10], which is then used to obtain a private estimate of θ with the non-private error term matching p

Var(Un). For completeness, we present the algorithms and their proofs in Appendix A.3. Definition 1 (All-tuples family). Let M = n k and let Sall = {S1, S2, . . . , SM} = In,k be the set of all k-element subsets of [n]. Call Sall the all-tuples family. Proposition 1. Suppose θ [ R, R]. Let Sall be the all-tuples family in Definition 1. Then Wrapper 1, with f = all, failure probability α, and A = U-Stat Mean (Algorithm A.2) returns an estimate θall of the mean θ such that, with probability at least 1 α,

| θall θ| O p

Var(Unα) + O k3/2 τ

as long as nα = Ω k

. Moreover, the algorithm is ϵ-differentially private and runs in time

O log(1/α) k + log R

Remark 1. While Lemma 1 recovers the correct first term of the deviation, the private error term is a

k factor worse. Moreover, we need k2/n = o(1) for the private error to be asymptotically smaller than the non-private error. Note, however, that existing concentration [35, 5] or convergence in probability results [58, 48] only require k = o(n) (see Lemmas A.1 and A.3 in the Appendix). Remark 2 (Degenerate and sparse settings). While Lemma 1 improves over the naive estimator, the private error can overwhelm the non-private error for degenerate and sparse U-statistics (see Section 5). We show that for uniformity testing, using this estimator can lead to suboptimal sample complexity if the distribution is already close to uniform.

Proposition 1 improves over the naive estimator at the cost of computational complexity. We can trade off the computational and statistical efficiencies using a different family S parameterized by the size M of S. In Appendix 3, we show a result similar to 1 for the subsampled family.

3 O(.) hides logarithmic factors in the problem parameters k, n, τ, α, ζk, where α is the error probability.

Definition 2 (Subsampled Family). Draw M i.i.d. samples S1, . . . , SM from the uniform distribution over the elements of In,k, and let Sss := {S1, . . . , SM}. Call Sall the subsampled family.

The next result shows a nearly optimal dependence on n and ϵ in the bounds for ˆθnaive and θall. In particular, the dependence of the modified Coinpress algorithm (Lemma 1) on k is suboptimal. Theorem 1 (Lower bound for non-degenerate kernels). Let n and k be positive integers with k < n/2 and let ϵ = Ω(k/n). Let F be the set of all sub-Gaussian distributions over R with variance proxy 1, and let µ be the output of any ϵ-differentially private algorithm applied to n i.i.d. observations from D. Then, suph,D:H F E | µ(X1, . . . , Xn) E[h(X1, . . . , Xk)]| = Ω k

Among unbiased estimators, Un is the best non-private estimator [35, 45]. The most widely used non-private estimators are Uand V -statistics, which share similar asymptotic properties [58]. The above lower bound also has a log factor arising from an optimal choice of the sub-Gaussian proxy for Bernoulli random variables [4]. Proofs are deferred to Appendix A.4.

3.1 Boosting the error probability via median-of-means

If we use Algorithm A.2 as stated with failure probability α, then the error in the algorithm has a O(1/ α) factor, which is undesirable. Instead, we use the Algorithm with a constant failure probability (say, 0.25) and then boost this failure probability to α via a median-of-means procedure. We incorporate the median-of-means in all of our theoretical results. Wrapper 1 (Median Of Means(n, k, Algorithm A, Parameters Λ, Failure probability α, Family type f {all, ss})). Divide [n] into q = 8 log(1/α) independent chunks Ii, i [q] of roughly the same size. For each i [q], run Algorithm A with subset family Si := Sf(Ii), Dataset {h(XS)}S Si, and other parameters Λ for A to output ˆθi, i [q]. Return θ = median(ˆθ1, . . . , ˆθq).

In the above wrapper, Sf(Di) simply creates the appropriate family of subsets for the dataset Di. For example, if Di = {X1, . . . Xnα}, f = all, then Sall(Di) is {h(XS)}S Inα,k. If f = ss, then Sss(Di) is {h(XS)}S Si, where Si is the set of M subsampled subsets of Di. Here, nα := n 8 log(1/α). Wrapper 1 can be used to boost the failure probability from constant to α by splitting the data into O(log(1/α)) chunks, applying the algorithm on each of the chunks, and taking the median output. The expense of this procedure is the reduction in the effective sample size to nα = Θ(n/ log(1/α)). Details and proofs on the median-of-means procedure can be found in Section A.3.2.

4 Main Results

In Section 3, we showed that off-the-shelf private mean estimation tools applied to U-statistics either achieve a sub-optimal non-private error (see Remark 1) or a sub-optimal private error. If the U-statistic is degenerate of order 1, the non-private and private errors (assuming ϵ = Θ(1)) are Θ(1/n). We now present an algorithm that achieves nearly optimal private error for sub-Gaussian non-degenerate kernels. Our algorithm can be viewed as a generalization of the algorithm proposed in Ullman and Sealfon [57] for privately estimating the edge density of an Erd os-Rényi graph. We provide strong evidence that, for bounded degenerate kernels, we achieve nearly optimal non-private error. All proofs for this section can be found in Section A.5.

4.1 Key intuition

Our key insight is to leverage the Hájek projection [58, 45], which gives the best representation of a U-statistic as a linear function of the form Pn i=1 f(Xi):

i=1 E[Tn|Xi] (n 1)E[Tn] (ii) = k

i=1 E[h(XS)|Xi] (n 1)θ.

Equality (i) gives the form of the Hájek projection for a general statistic Tn, whereas (ii) gives the form when Tn is a U-statistic. Let I(i) n,k = {S In,k : i S}. In practice, one uses the estimates

b E[h(XS)|Xi] := 1 n 1 k 1 X

which we call local Hájek projections. In some sense, this is the U-statistic when viewed locally at Xi. When the dataset is clear from context, we write ˆh In,k(i), or simply ˆh(i), for b E[h(XS)|Xi].

4.2 Proposed algorithm

Consider a family of subsets S In,k of size M. Let Si = {S S : i S} and Mi = |Si|, and suppose Mi = 0 for all i [n]. Assume also that S satisfies the inequalities

Mi/M 3k/n and Mij/Mi 3k/n (10)

for any distinct indices i and j in [n] (one such family is S = In,k, for which Mi/M = k/n and Mij/Mi = (k 1)/(n 1) k/n, but there are other such families). Define

S S h(XS), and ˆh S(i) := 1

S Si h(XS), i [n]. (11)

Un in Eq (1) and b E[h(XS)|Xi] in equation (9) are the same as the quantities An(In,k) and ˆh In,k(i).

A standard procedure in private estimation algorithms is to clip the data to an appropriate interval [10, 41] in such a way that the sensitivity of the overall estimate can be bounded. Similarly, we use the concentration of the local Hájek projections to define an interval such that each i can be classified as good" or bad based on the distance between ˆh S(i) and the interval. The final estimator is devised so the contribution of the bad indices to the estimator is low and the estimator has low sensitivity.

Let ξ and C be parameters to be chosen later; they will be chosen in such a way that with high probability, (i) |ˆh(i) θ| ξ for all i, and (ii) each h(XS) is at most C away from θ. Define

LS := arg min t N>0

n i : ˆh S(i) An(S) > ξ + 6k Ct/n o t . (12)

In other words, LS is the smallest positive integer t such that at most t indices i [n] satisfy |ˆh S(i) An(S)| > ξ + 6k Ct

n (such an integer t always exists because t = n works). Define

Good(S) := n i : ˆh S(i) An(S) ξ + 6k CLS/n o , Bad(S) := [n] \ Good. (13)

For each index i [n], define the weight of i with respect to S as

wt S(i) := max 0, 1 ϵn

6Ck dist ˆh S(i) An, [ ξ 6k CLS/n, ξ + 6k CLS/n] . (14)

Here, ϵ is the privacy parameter and dist(x, I) is the distance between x and the interval I.

Based on whether a datapoint is good or bad, we will define a weighting scheme that reweights the h(XS) in equation (1). For each S S, let

wt S(S) := min i S wt S(i), and g S(XS) := h(XS)wt S(S) + An(S) (1 wt S(S)) .

In particular, if wt S(S) = 1, then g S(XS) = h(XS); and if wt S(S) = 0, then g S(XS) = An(S). Finally, define the quantities

S S g S(XS), ˆg S(i) := 1

S Si g S(XS) i [n]. (15)

To simplify notation, we will drop the argument S from L, An, An, ˆh, ˆg, Good, and Bad.

Idea behind the algorithm: If all ˆh(i) s are within ξ of the empirical mean An, then Bad = and An = An. Otherwise, for any set S containing a bad index, we replace h(XS) by a weighted combination of h(XS) and An. This averaging-out of the bad indices allows a bound on the local sensitivity of An. We then provide a smooth upper bound on the local sensitivity characterized by the quantity L, which can be viewed as an indicator of how well-concentrated the data is. The choice of ξ will be such that L = 1 with high probability and that the smooth sensitivity of An at X is small. This ensures that a smaller amount of noise is added to An to preserve privacy.

Algorithm 1 Private Mean Local Hájek(n, k, {h(XS), S S}, ϵ, C, ξ, S)

1: M |S| 2: Si {S S : i S} 3: Mi |Si| 4: if there exist indices i = j such that Mi = 0 or Mi/M > 3k/n or Mij/Mi > 3k/n, then 5: return 6: end if 7: An P

S S h(XS)/M 8: for i = 1, 2, . . . , n do 9: ˆh(i) P

S Si h(XS)/Mi 10: end for 11: Let L be the smallest positive integer such that n i : ˆh(i) An > ξ + 6k CL

12: Good(S) n i : ˆh(i) An ξ + 6k CL

n o ; Bad(S) [n] \ Good(S)

13: for i = 1, 2, . . . , n do

14: wt(i) max 0, 1 ϵ 6Ck/ndist ˆh(i) An, ξ 6k CL

n , ξ + 6k CL

15: end for 16: for S S do 17: g(XS) h(XS) mini S wt(i) + An (1 mini S wt(i)) 18: end for 19: An P

S S g(XS)/M

20: S(S) max0 ℓ n e ϵℓ k n(ξ+ k C(L+ℓ)

n )(1+ϵ(L+ℓ))+ k2C(L+ℓ)2 min(k,(L+ℓ))

21: Draw Z from distribution with density h(z) 1/(1 + |z|4) 22: return An + S(S)/ϵ Z

Theorem 2. Algorithm 1 is 10ϵ-differentially private for any ξ. Moreover, suppose h is bounded with additive range C,4 and with probability at least 0.99, we have maxi |ˆh S(i) An| ξ. Run Wrapper 1 with f = all, and A = Private Mean Local Hajek (Algorithm 1). With probability at least 1 α, the output θ satisfies | θ θ| = O p

Var(Unα) + kξ nαϵ + k2 n2αϵ2 + k3 n3αϵ3 C .

Connections to Ullman and Sealfon [57]: Ullman and Sealfon [57] estimate the edge density of an Erd os Renyi graph using strong concentration properties of its degrees. This idea can be loosely generalized to a broader setting of U-statistics: consider a hypergraph with n nodes and n k edges, where the ith node corresponds to index i. An edge corresponds to a k-tuple of data points S In,k, and the edge weight is given by h(XS). The natural counterpart of a degree in a graph becomes a local Hájek projection, defined as in equation (9). In degenerate cases and cases where k2ζ1 kζk, the local Hájek projections are tightly concentrated around the mean θ. We exploit this fact and reweight the edges (k-tuples) so that the local sensitivity of the reweighted U-statistic is small.

4.3 Application to non-degenerate and degenerate kernels

Algorithm 1 can be extended from bounded kernels to sub-Gaussian(τ) kernels. First, split the samples into two roughly equal halves. The first half of the samples will be used to obtain a coarse estimate of the mean θ. For this, we can use any existing private mean estimation algorithm to obtain an ϵ/2-differentially private estimate θcoarse such that with probability at least 1 α, | θcoarse θ| = O( p

kζk/n + k τ/(nϵ)). By a union bound, with probability at least 1 α,

|h(XS) θ| is within 4 q

τ log n k /α for all S In,k, and therefore also within c p

kτ log(n/α)

of the coarse estimate θcoarse, for some universal constant c, as long as ϵ = Ω(

Define the projected function h(X1, X2, . . . , Xk) to be the value h(X1, X2, . . . , Xk) projected to the interval [ θcoarse c p

kτ log(n/α), θcoarse + c p

kτ log(n/α)]. The final estimate of the mean θ is obtained by applying Algorithm 1 to the other half of the samples, the function h, and the

4More precisely, suppose supx h(x) infx h(x) C.

privacy parameter ϵ/2. The following lemma shows that p

2τ log(2n/α) is a valid choice of the concentration parameter ξ for sub-Gaussian, non-degenerate kernels.

Lemma 3. If H is sub-Gaussian(τ), the local Hájek projections ˆh(i) are also sub-Gaussian(τ). In particular, with probability at least 1 α, we have max1 i n |ˆh(i) θ| p

2τ log(2n/α).

Combining these parameters with Theorem 2 gives us the following result: Corollary 1 (Non-degenerate sub-Gaussian kernels). Suppose h is sub-Gaussian(τ) and the privacy parameter ϵ = Ω(k1/2/n). Split the samples into two halves and compute a private estimate of the mean by applying the naive estimator on the first half of the samples with privacy parameter ϵ/2 to obtain θcoarse. Let h be the projection of of the function h onto the interval θcoarse O( p

kτ log(2n/α)). Run Wrapper 1 on the remaining half of the samples with f = all, failure probability α/2, algorithm A = Private Mean Local Hájek (Algorithm 1), C = p

2kτ log(n/2α), ξ = p

2τ log(n/2α), function h, and privacy parameter ϵ/20. Then, the output θ is ϵ-differentially private. Moreover, with probability at least 1 α,

| θ θ| = O p

Var(Unα) + O k τ

nαϵ + k2.5 τ

n2αϵ2 + k3.5 τ

From our lower bound on non-degenerate kernels in Theorem 1, we see that the above corollary is optimal in terms of k, n, ϵ (up to log factors). In contrast, Lemma 1 is suboptimal in k.

Many degenerate U-statistics (e.g., all the degenerate ones in Section 5) have bounded kernels. For these, we see that the local Hájek projections concentrate strongly around the U-statistic. Lemma 4. Suppose H is bounded, with additive range C. Let i [n] be an arbitrary index and Si In,k be a set containing i, and suppose xi R is some element in the support of D. With probability at least 1 β

n, conditioned on Xi = xi, we have

b E[h(XS)|Xi = xi] E [h(XSi)|Xi = xi] 2σi

where b E[h(XS)|Xi = xi] =

( n 1 k 1) , and σ2 i = Var (h(XSi)|Xi = xi).

For bounded kernels with additive range C, σi C/2 by Popoviciu s inequality [53]. Moreover, for degenerate kernels, ζ1 = 0. That is, the conditional expectation E [h(XSi)|Xi = xi] is equal to θ for all xi, because the variance of this conditional expectation is ζ1. Based on this, we can show that the choice of ξ = O Ck1/2/n1/2 satisfies the requirement that the local Hájek projections are within ξ of θ with probability at least 1 α. Corollary 2 (Degenerate bounded kernels). Suppose h is bounded with additive range C and the kernel is degenerate ζ1 = 0. Let ϵ = Ω(k1/2/n) be the privacy parameter. Run Wrapper 1 with f = all, failure probability α, and algorithm A = Private Mean Local Hájek (Algorithm 1) with ξ = O(C p

k/n log(n/α)), to output θ. With probability 1 α, we have θ θ = O p

Var(Unα) + O k1.5

n1.5 α ϵC + k2

n2αϵ2 C + k3

Obtaining a result for sub-Gaussian degenerate kernels poses difficulties on bounding the concentration parameter ξ. However, for bounded kernels, we see that the above result obtains better private error than the application of off-the-shelf methods (Lemma 1). In the next subsection, we provide a lower bound for degenerate bounded kernels, which, together with Corollary 2, gives strong indication that our algorithm achieves optimal private error for private degenerate kernels.

4.4 Lower bound

To obtain a lower bound of the private error, we construct a dataset and kernel function such that the local Hájek projections are 1/ n concentrated around the corresponding U-statistic. This is one way of characterizing a degenerate U-statistic. The proof of the following theorem is in Appendix A.4.

Theorem 3. For any n, k N with k n, ϵ = Ω((k/n)1 1/2(k 1)), and ϵ-differentially private algorithm A : X n R, there exists a function h : X k {0, 1} and dataset D such that |ˆh(i) Un| p

k/n (where ˆh(i) and Un are computed on D) for every i [n] and E|A(D) Un| =

n3/2ϵ , where the expectation is taken over the randomness of A.

Remark 3. The above lower bound is in some sense informal because we created a deterministic dataset and h that mimics the property of a degenerate U statistic; the local Hájek projections concentrate around Un at a rate p

k/n. However, it gives us a strong reason to believe that the private error cannot be smaller than O(k3/2/n3/2ϵ) for degenerate U statistics of order k. Note that for bounded kernels, Corollary 2 does achieve this bound, as opposed to Lemma 1.

4.5 Subsampling estimator

We now focus on subsampled U-statistics. Previous work has shown how to use random subsampling to obtain computationally efficient approximations of U-statistics [38, 52, 17], where the sum is replaced with an average of samples (drawn with or without replacement) from In,k.

Recall Definition 2. Let S := {S1, . . . , SM} denote the subsampled set of subsets, let Si := {S S : i S}, and let Mi := |Si|. The proof of Theorem 2 with S = In,k (cf. Appendix A.5) uses the property of In,k that Mi/M = k/n and Mij/Mi = (k 1)/(n 1), so the inequalities (10) certainly hold. Indeed, we can show that for subsampled data (cf. Lemma A.11), the following inequalities hold with probability at least 1 α, provided M = Ω(n2/k2 log(n/α)):

Mi/M 3k/n and Mij/Mi 3k/n. (17)

Algorithmically, we check if the bounds (17) hold for S, and output if not. Privacy is not compromised because the check only depends on S and is agnostic to the data. Theorem 4. Let Mn = Ω (n2/k2) log n . Then Algorithm 1, modified to output if the bounds (17) do not hold, is 10ϵ-differentially private. Moreover, suppose that with probability at least 0.99, we have maxi |ˆh S(i) An| ξ and |h(S) θ| C for all S In,k. Run Wrapper 1 with f = ss, failure probability α, and A = Private Mean Local Hajek (Algorithm 1) to output θ. With probability at least 1 α, we have

|A(X) θ| = O

ζk Mnα + kξ

n2αϵ2 + k3C

Remark 4. If the kernel is non-degenerate and the number of times we subsample (for each run of the algorithm) is Ω n2 α/k2 , then Theorem 4 nearly achieves the same error as Algorithm 1 with S = In,k with a better computational complexity for k 3. The lower-order terms have an additional min(k, 1/ϵ) factor, which can be removed with Ω(n3) subsamples.

5 Applications

We apply our methods to private uniformity testing and estimation in random networks. For more applications, see Appendix A.6.4.

1. Uniformity testing: A fundamental task in distributional property testing [6, 7] is deciding whether a discrete distribution is uniform on its domain, called the problem of uniformity testing. Let X1, X2, . . . , Xn be n i.i.d. samples from a discrete distribution with support [m], characterized by the probability masses p1, p2, . . . , pm on the atoms. Given an error tolerance δ > 0, the task is to distinguish between approximately uniform distributions p : ℓ2(p, U) δ/

2m and far-fromuniform distributions {p : ℓ2(p, U) δ/ m}.

Without the constraint of privacy, Diakonikolas et al. [21] perform this test by rejecting the uniformity hypothesis whenever the test statistic Un := P

i<j 1(Xi = Xj)/ n k > (1 + 3δ2/4)/m, and show that this test succeeds with probability 0.9 as long as n = Ω m1/2/δ2 . As detailed in Algorithm A.4 in the appendix, instead of using Un, we use the private estimate Un using Algorithm 1.

For our algorithm to work, we require the distributions to satisfy pi 2/m for all i. Let pi = (1 + ai)/m for all i, with ai [ 1, 1]. Under H1, we have E[1(X1 = X2)] = (1 + a 2 /m)/m

(1 + δ2)/m, where a is the ℓ2 norm of (a1, a2, . . . , am). Under H0, the mean is 1/m. The difference between the threshold (1 + δ2/2)/m and the mean (under either of the two hypotheses) is at least δ2/(2m). Moreover, [21] shows that the standard deviation of Un is much smaller than the difference in the means under H0 and H1 as long as n = Ω(m1/2/δ2). However, we must also account for the noise added to ensure privacy. In Appendix A.6.1, we show that the choice of ξ = Θ(1/m + 1/n) works and establish the following result: Theorem 5. Let pi = (1+ai)/m with ai [ 1, 1], Pm i=1 ai = 0. Let {Xj}n j=1 be i.i.d. multinomial random variables such that P(X1 = i) = pi, for all i [m]. There exists an algorithm that distinguishes between a 2

m2 δ2 m from a 2

m2 < δ2 2m with probability at least 1 α, as long as

nα = Ω m1/2

(δϵ) + m1/2 log(m/δϵ)

δϵ1/2 + m1/3

δ2 , and is 10ϵ-differentially private.

The non-private error term of Theorem 5 is the same as in Theorem 1 of [21] and is optimal [22]. Proposition 1 shows that Algorithm A.2 with the all-tuples family leads to a private error bounded by O(1/nϵ). This private error is O(δ2/m) only when n = Ω(m/δ2ϵ). In comparison, Algorithm A.4 has error O(δ2/m) for n = Ω m1/2/min(δ2, δϵ) , which is quadratically better in m. Remark 5 (Comparison with existing algorithms). Existing results for private uniformity testing [1, 13] distinguish between the uniform distributions (ℓ1(p, U) = 0) and distributions away from uniform in TV-distance (ℓ1(p, U) δ). Our algorithm considers the alternative hypothesis to be ℓ2(p, U) δ/ m. Hence, our results are not strictly comparable. One caveat is that we restrict ourselves to distributions p such that ℓ (p, U) 1/m. Our algorithm also allows some tolerance in the null hypothesis, similar to [21] and other collision-based testers. That is, can allow some slack and take the null hypothesis H0 : ℓ2(p, U) δ/

2m instead of ℓ2(p, U) = 0.

2. Sparse graph statistics: The geometric random graph (see [30]) has edges h(Xi, Xj) := 1( Xi Xj 2 rn), where rn governs the average degree. Under a suitable distribution for X1, . . . , Xn, the subgraph counts show normal convergence for a large range of rn [30]. Typically, we only observe the graph and do not know the underlying distribution of the latent variables Xi or the radius rn. This is why estimates of the network moments are of interest since they reveal information about the underlying unknown distribution and parameters.

Let the Xi s be uniformly distributed on the three-dimensional sphere to ignore boundary conditions. For edge density, Eh(Xi, Xj) r2 n. For any distinct indices i, j, k and a given Xi, the random variables h(Xi, Xj) and h(Xi, Xk) are independent. Therefore, ζ1 = cov(h(Xi, Xj)h(Xi, Xk)) = 0. We have ζ2 = var[h(Xi, Xj)] = O(r2 n), so the non-private error is O(rn/n). In Appendix A.6.2, we provide Algorithm A.5 that uses Algorithm 1 to obtain a private estimate of the edge density of a graph {h(Xi, Xj)}1 i<j n. Note that the Xi s themselves can be unknown. Our methods can also be used for private triangle density estimation. See the extended version [16] for details.

Theorem 6. Let rn = Ω(n 1/2) and ϵ = Ω 1/nr2 n . Let {X1, . . . , Xn} be i.i.d. latent positions such that Xi is distributed uniformly on S2. Let the observed geometric network have adjacency matrix {Aij, 1 i < j n} where Aij = 1( Xi Xj rn). There exists a 10ϵ-differentially private algorithm that estimates the edge density θ of the geometric graph. With probability at least 1 α, the output θ satisfies | θ θ| = O rn nα + 1 n2 αϵ2 + 1 n3 αϵ3

Remark 6. By Lemma 1, the all-tuples estimator (1) satisfies | θall θ| O (rn/n + τ/nϵ) with probability 1 α, where τ is the variance proxy of the distribution. Since τ = Ω(1) [4], the private error overpowers the main variance term in sparse settings where rn = o(1).

6 Discussion We have considered the problem of estimating θ := Eh(X1, . . . , Xk) for a broad class of kernel functions h. The best non-private unbiased estimator is a U statistic, which is widely used in estimation and hypothesis testing. While existing private mean estimation algorithms can be used for this setting, they can be suboptimal for large k or for non-degenerate U statistics, which have O(1/n) limiting variance. We provide lower bounds for both degenerate and non-degenerate settings. We analyze bounded degenerate kernels motivated by typical applications with degenerate U statistics. To extend this to the subgaussian setting is part of future work. We propose an algorithm that matches our lower bounds for sub-Gaussian non-degenerate kernels and bounded degenerate kernels. We also provide applications of our theory to private hypothesis testing and estimation in sparse graphs.

Acknowledgments and Disclosure of Funding

PS and SP gratefully acknowledge NSF grants 2217069, 2019844, and DMS 2109155. We thank the reviewers and Gautam Kamath for their insightful comments and suggestions.

[1] J. Acharya, Z. Sun, and H. Zhang. Differentially private testing of identity and closeness of discrete distributions. Advances in Neural Information Processing Systems, 31, 2018.

[2] M. Aliakbarpour, I. Diakonikolas, D. Kane, and R. Rubinfeld. Private testing of distributions via sample permutations. Advances in Neural Information Processing Systems, 32, 2019.

[3] T. W. Anderson and D. Darling. Asymptotic theory of certain "goodness of fit" criteria based on stochastic processes. Annals of Mathematical Statistics, 23:193 212, 1952.

[4] Julyan Arbel, Olivier Marchal, and Hien D Nguyen. On strict sub-gaussianity, optimal proxy variance and symmetry for bounded random variables. ESAIM: Probability and Statistics, 24:39 55, 2020.

[5] M. A. Arcones and E. Gine. Limit theorems for U-processes. The Annals of Probability, 21(3):1494 1542, 1993.

[6] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 259 269. IEEE, 2000.

[7] Tu gkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing closeness of discrete distributions. Journal of the ACM (JACM), 60(1):1 25, 2013.

[8] J. Bell, A. Bellet, A. Gascón, and T. Kulkarni. Private protocols for U-statistics in the local model and beyond. In International Conference on Artificial Intelligence and Statistics, pages 1573 1583. PMLR, 2020.

[9] Sergei Bernstein. On a modification of chebyshev s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38 49, 1924.

[10] S. Biswas, Y. Dong, G. Kamath, and J. Ullman. Coinpress: Practical private mean and covariance estimation. Advances in Neural Information Processing Systems, 33:14475 14485, 2020.

[11] Gavin Brown, Samuel Hopkins, and Adam Smith. Fast, sample-efficient, affine-invariant private mean and covariance estimation for subgaussian distributions. In The Thirty Sixth Annual Conference on Learning Theory, pages 5578 5579. PMLR, 2023.

[12] C. Butucea, A. Rohde, and L. Steinberger. Interactive versus noninteractive locally differentially private estimation: Two elbows for the quadratic functional. The Annals of Statistics, 51(2):464 486, 2023.

[13] Bryan Cai, Constantinos Daskalakis, and Gautam Kamath. Priv it: Private and sample efficient identity testing. In International Conference on Machine Learning, pages 635 644. PMLR, 2017.

[14] T. T. Cai, Y. Wang, and L. Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics, 49(5):2825 2850, 2021.

[15] K. Chaudhuri and D. Hsu. Convergence rates for differentially private statistical estimation. In Proceedings of the International Conference on Machine Learning, volume 2012, page 1327. NIH Public Access, 2012.

[16] Kamalika Chaudhuri, Po-Ling Loh, Shourya Pandey, and Purnamrita Sarkar. On differentially private u statistics, 2024.

[17] X. Chen and K. Kato. Randomized incomplete U-statistics in high dimensions. The Annals of Statistics, 47(6):3127 3156, 2019.

[18] S. Clémençon. A statistical view of clustering performance through the theory of U-processes. Journal of Multivariate Analysis, 124:42 56, 2014.

[19] S. Clémençon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36(2):844 874, 2008.

[20] Tertius de Wet. Degenerate uand v-statistics. South African Statistical Journal, 21:99 129, 1987.

[21] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. ar Xiv preprint ar Xiv:1611.03579, 2016.

[22] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Optimal identity testing with high probability. ar Xiv preprint ar Xiv:1708.02728, 2017.

[23] A. Dubois, T. B. Berrett, and C. Butucea. Goodness-of-fit testing for Hölder continuous densities under local differential privacy. In Foundations of Modern Statistics, pages 53 119. Springer, 2019.

[24] John Duchi, Saminul Haque, and Rohith Kuditipudi. A fast algorithm for adaptive private mean estimation. ar Xiv preprint ar Xiv:2301.07078, 2023.

[25] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265 284. Springer, 2006.

[26] Andrey Feuerverger and Roman A. Mureika. The Empirical Characteristic Function and Its Applications. The Annals of Statistics, 5(1):88 97, 1977.

[27] E. W. Frees. Infinite order U-statistics. Scandinavian Journal of Statistics, 16(1):29 45, 1989.

[28] M. Gaboardi, H. Lim, R. Rogers, and S. Vadhan. Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing. In International conference on Machine Learning, pages 2111 2120. PMLR, 2016.

[29] B. Ghazi, P. Kamath, R. Kumar, P. Manurangsi, and A. Sealfon. On computing pairwise statistics with local differential privacy. Advances in Neural Information Processing Systems, 33:14475 14485, 2020.

[30] E. N. Gilbert. Random plane networks. Journal of The Society for Industrial and Applied Mathematics, 9:533 543, 1961.

[31] G. Gregory. Large sample theory for U-statistics and tests of fit. Annals of Statistics, 5:110 123, 1977.

[32] P. R. Halmos. The theory of unbiased estimation. The Annals of Mathematical Statistics, 17(1):34 43, 1946.

[33] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 705 714, 2010.

[34] H.-C. Ho and G. S. Shieh. Two-stage U-statistics for hypothesis testing. Scandinavian Journal of Statistics, 33, 2006.

[35] W. Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293 325, 1948.

[36] Wassily Hoeffding. The strong law of large numbers for u-statistics. Technical report, North Carolina State University. Dept. of Statistics, 1961.

[37] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

[38] S. Janson. The asymptotic distributions of incomplete U-statistics. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 66(4):495 505, 1984.

[39] G. Kamath, J. Li, V. Singhal, and J. Ullman. Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853 1902. PMLR, 2019.

[40] G. Kamath, O. Sheffet, V. Singhal, and J. Ullman. Differentially private algorithms for learning mixtures of separated Gaussians. Advances in Neural Information Processing Systems, 32, 2019.

[41] G. Kamath, V. Singhal, and J. Ullman. Private mean estimation of heavy-tailed distributions. Ar Xiv, abs/2002.09464, 2020.

[42] V. Karwa and S. Vadhan. Finite sample differentially private confidence intervals. ar Xiv preprint ar Xiv:1711.03908, 2017.

[43] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011.

[44] J. Lam-Weil, B. Laurent, and J.-M. Loubes. Minimax optimal goodness-of-fit testing for densities and multinomials under a local differential privacy constraint. Bernoulli, 28(1):579 600, 2022.

[45] A J Lee. U-statistics: Theory and Practice. Routledge, 2019.

[46] C. Li and X. Fan. On nonparametric conditional independence tests for continuous variables. WIREs Computational Statistics, 12(3):e1489, 2020.

[47] O. Linton and P. Gozalo. Testing conditional independence restrictions. Econometric Reviews, 33(5-6):523 552, 2014.

[48] S. Minsker. U-statistics of growing order and sub-Gaussian mean estimators with sharp constants, 2023.

[49] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, pages 75 84, 2007.

[50] W. Peng, T. Coleman, and L. Mentch. Asymptotic distributions and rates of convergence for random forests via generalized U-statistics. ar Xiv preprint ar Xiv:1905.10651, 2019.

[51] Yannik Pitcan. A note on concentration inequalities for u-statistics. ar Xiv preprint ar Xiv:1712.06160, 2017.

[52] D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer Science & Business Media, 2012.

[53] T. Popoviciu. Sur certaines inégalités qui caractérisent les fonctions convexes. Analele Stiintifice Univ. Al. I. Cuza , Iasi, Sectia Mat, 11:155 164, 1965.

[54] R. J. Serfling. Approximation theorems of mathematical statistics. Wiley Series in Probability and Mathematical Statistics : Probability and Mathematical Statistics. Wiley, New York, NY [u.a.], [nachdr.] edition, 1980.

[55] Galen R Shorack and Jon A Wellner. Empirical processes with applications to statistics. SIAM, 2009.

[56] Yanglei Song, Xiaohui Chen, and Kengo Kato. Approximating high-dimensional infinite-order u-statistics: Statistical and computational guarantees. Electronic Journal of Statistics, 13(2), January 2019.

[57] J. Ullman and A. Sealfon. Efficiently estimating erdos-renyi graphs with node differential privacy. Advances in Neural Information Processing Systems, 32, 2019.

[58] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.

[59] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

[60] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint, volume 48. Cambridge University Press, 2019.

[61] N. C. Weber. Incomplete degenerate u-statistics. Scandinavian Journal of Statistics, 8(2):120 123, 1981.

Roadmap of Appendix

1. Section A.1: U statistics and concentration of U statistics 2. Section A.2: Composition theorems for differential privacy 3. Section A.3: Details on our extension of the Coinpress algorithm [10] 4. Section A.4: Proofs of lower bounds 5. Section A.5: Proofs of the main theorems 2 and 4 6. Section A.6: More applications and details on uniformity testing and edge-density estimation

A.1 U-Statistics

Let h : X k R be a symmetric function, and let X1, . . . , Xn X. The U statistic on the n variables X1, . . . , Xn, Un(h), associated with h is defined as

Un = 1 n k X

{i1,...,ik} In,k h(Xi1, . . . , Xik). (A.18)

The mean of Un based on iid variables X1, . . . , Xn D, for some distribution D on X, is simply θ := E[h(X1, . . . , Xk)]. Moreover, the variance of Un can be expressed succinctly in terms of conditional expectations [45]. For c = 1, 2, . . . , k, define hc : X c R as hc(X1, . . . , Xc) := E[h(X1, . . . , Xk)|X1 = x1, . . . , Xc = xc], (A.19) and let ζc = Var (hc(X1, . . . , Xc)) . (A.20) Equivalently, ζc = cov (h(XS1), h(XS2)) where S1, S2 In,k and |S1 S2| = c. The number of such pairs of sets S1 and S2 is equal to n k k c n k k c , which implies implies Eq 3.

Hoeffding decomposition. A U statistic of degree k can be written as the sum of uncorrelated U statistics of degrees 1, 2, . . . , k. Define

h(1)(X1) = h(X1) θ and for all 2 c k,

h(c)(X1, . . . , Xc) = (hc(X1, . . . , Xc) θ) X

ϕ S Ic,i h(i) (XS) .

Then, Un can be written as

U (c) n , (A.21)

where U (c) n is the U statistic on X1, . . . , Xn based on the kernel h(c). Equation A.21 is called the Hoeffding decomposition (or the H-decomposition) of Un [36]. [36] also shows that the c functions h(1), . . . , h(k) are pairwise uncorrelated. That is, let 1 c < d k and let Sc and Sd be subsets of [n] of sizes c and d respectively. Then,

cov h(c)(XSc), h(d)(XSd) = 0.

This allows us to write the variance of Un in terms of the variances of h(c). For all c [k], define

δ2 c = Var(h(c)). (A.22) Then,

1 δ2 c. (A.23)

Moreover, the conditional covariances ζc are related to the variances δ2 c in the following manner:

δ2 i , δ2 c =

i=1 ( 1)c i c i

Lemma A.1. Suppose k n/2.

(i) If ζ1 > 0, then

Var(Un) = k2ζ1

n + O ζk k2

(ii) If ζ1 = 0 and ζ2 > 0, then

Var(Un) = k2(k 1)2ζ2

2n(n 1) + O ζk k3

Proof. This result follows directly from a calculation appearing in the proof of Theorem 3.1 in [48]. Note that

for all j [k]. Moreover,

For part (i), we write

Var(Un) = k2ζ1

1 δ2 j k2ζ1

For part (ii), we write

Var(Un) = k2ζ1

n + k2(k 1)2ζ2

k j 2 n j ζj k2(k 1)2ζ2

2n(n 1) + ζk

2n(n 1) + ζk

j = k2(k 1)2ζ2

2n(n 1) + 2k3ζk

Lemma A.2. For all 1 c d k,

In particular, kζ1 ζk.

Proof. Using equation A.24,

Lemma A.3 (Concentration of U-statistics). [37, 9]

(i) If H is sub-Gaussian with variance proxy τ, then for all t > 0, we have

P (|Un θ| t) 2 exp n

(ii) If H is almost surely bounded in ( C, C), then for all t > 0, we have

P (|Un θ| t) exp

2ζk + 2Ct/3

Proof. Without loss of generality, let θ = 0. For any permutation σ of [n], let

i=1 h(Xσ(k(i 1)+1), Xσ(k(i 1)+2), . . . , Xσ(ki)),

where m = n/k . By symmetry, Un = 1

σ Vσ. For any s > 0,

P (Un t) = P es Un est e st E es Un = e st E

σ exp(s Vn)

= e st E [exp(s Vid)]

= e st E h exp s

mh(X1, . . . , Xk) im .

If H is sub-Gaussian with variance proxy τ, then we can further bound the inequality above as

P (Un t) e st exp s2τ

m = exp st + s2τ

τ 2 to get the desired result. Note how the argument is similar to the classical Hoeffding s inequality argument after applying Jensen s inequality on Vσ. The second result follows similarly by adapting the tricks of Bernstein s inequality [9] to [37]; for a detailed proof, see [51].

A.2 Details on Privacy Mechanisms

Lemma A.4. (Basic Composition) If Ai : X n Si is ϵi-differentially private for all i [k], then the mechanism A : X n S1 Sk defined as

A (X1, . . . , Xn) = (A1 (X1, . . . , Xn) , . . . , Ak (X1, . . . , Xn))

is Pk i=1 ϵi-differentially private.

Lemma A.5. (Parallel Composition) If Ai : X n Si is ϵ-differentially private for all i [k], then the mechanism A : X kn S1 Sk defined as

A (X1, . . . , Xkn) = A1 (X1, . . . , Xn) , A2 (Xn+1, . . . , X2n) , . . . , Ak X(k 1)n+1, . . . , Xkn

is ϵ-differentially private.

A.2.1 Private mean estimation

A fundamental task in private statistical inference is to privately estimate the mean based on a set of IID observations. One way to do this is via the global sensitivity method, wherein the standard deviation of the noise scales with the ratio between the range of the distribution and the size of the dataset. In the fairly realistic case where the range is large or unbounded, this leads to highly noisy estimation even in the setting where typical samples are small in size.

To remedy this effect, a line of work [42, 39, 14, 40, 24, 11] has looked into designing better private mean estimators for (sub)-Gaussian vectors. Our work will build on one such method: Coin Press [10]. The idea is to iteratively refine an estimate for the parameters until one obtains a small range containing most of the data with high probability; noise is then added proportional to this smaller range. Note that some dependence on the range of the mean is inevitable for estimation with pure differential privacy [33, 15].

A.3 Details for Section 3

A.3.1 General result

We will prove a more general theorem than Lemma 1, from which Lemma 1 and related Lemmas using other families of subsets are derived.

In Algorithms A.2 and A.3, we present a natural extension of the Coin Press algorithm [10], which is then used to obtain a private estimate of θ with the non-private term matching Var(Un). Originally, this algorithm was used for private mean and covariance estimation of i.i.d. (sub)Gaussian data. We extend the algorithm to take as input data {Yj}j [m] such that (i) each Yj is equal to h(XS) for some S, (ii) the Yj s are weakly dependent on each other, and (iii) each Yj, as well as their mean P

j [m] Yj/m, has sufficiently strong concentration around the population mean.

For instance, suppose m = n/k and Yj = h(XSj) for all j [m], where Sj = {(j 1)k + 1, . . . , (j 1)k + k}. Then Algorithm A.2 reduces to the Coin Press algorithm applied to n/k independent observations h(XS1), h(XS2), . . . , h(XSm).

Algorithm A.2 U-Stat Mean n, k, h, {Xi}i [n] , F = {S1, . . . , Sm}, R, ϵ, γ, Q( ), Qavg( )

1: t log (R/Q(γ)) , [l0, r0] [ R, R] 2: for j = 1, . . . , m do 3: Y0,j h(XSj) 4: end for 5: for i = 1, 2, ..., t do 6: {Yi,j}j [m], [li, ri] U-Stat One Step n, k, {Yi 1,j}, F, [li 1, ri 1], ϵ

t , Q( ), Qavg( )

7: end for 8: {Yt+1,j}j [m], [lt+1, rt+1] U-Stat One Step (n, k, {Yt,j}, F, [lt, rt], ϵ/2, γ, Q( ), Qavg( )) 9: return (lt+1 + rt+1)/2

Algorithm A.3 U-Stat One Step n, k, {Yi}i [m] , F, [l, r], ϵ , β, Q( ), Qavg( )

1: Yj projl Q(β),r+Q(β) (Yj) for all 1 j m. 2: depn,k (F) (r l + 2Q(β))

m Pm j=1 Yj + W, where W Lap

4: [l, r] h Z Qavg(β) +

β , Z + Qavg(β) +

5: return {Yj}j [m], [l, r]

Setting 1. Let n and k be positive integers with k n/2, and let h : X k R be a symmetric function and let D be an unknown distribution over X with E h(Dk) = θ such that |θ| < R for some known parameter R.

Let m be an integer and F = {S1, S2, . . . , Sm} be a family of not necessarily distinct elements of In,k. Define

fi := |{j [m] : i Sj}|

the fraction of indices j such that Sj contains i, and define the maximal dependence fraction depn,k (S) := max i [n] fi. (A.31)

For each j [m], let Yj denote h(XSj). Clearly, E [Yj] = θ. To allow for small noise addition while ensuring privacy, it is desirable to choose S with small depn,k (S).

Define functions Q(β) = Qn,k,h,D,S(β) and Qavg(β) = Qavg n,k,h,D,S(β) on β (0, 1] such that

sup j [m] |Yj θ| > Q(β)

< β. (A.32)

We will refer to Q(β) and Qavg(β) as β-confidence bounds for supj [m] |Yj θ| and 1

j [m] Yj θ , respectively. We apply Theorem A.1 (specifically, the form obtained in Lemma A.6) to different F to obtain private estimates of θ, with statistical and computational tradeoffs depending on the family F. As Remark A.7 suggests, we will also need to privately estimate concentration bounds on the Yj s and their average. Naturally, this requires a private estimate of the variance ζk. We provide guarantees from Biswas et al. [10] for private variance estimation and mean estimation here, where we have translated the mean estimation guarantee to fit our setting.

Before we do that, a natural approach to this problem is to view it as a standard private mean estimation task: split the data into n/k equally-sized chunks, apply the function h to each chunk, and run any existing private mean estimation algorithm to these n/k values. We show that the error guarantee of such an algorithm is suboptimal compared to the error guarantee using the all-tuples estimator 1 or even the subsampling estimator 2 if sufficiently many samples are used. Before stating the propositions associated with these families, we state and prove the mother theorem.

Theorem A.1. For ϵ > 0, Algorithm A.2 with input n, k, h, {Xi}i [n], F, R, ϵ, γ, Q( ), Qavg( )

returns θn such that

m γ | {z } non-private error

+ depn,k (F) Q(γ)

| {z } private error

with probability at least 1 6γ,5 as long as

depn,k (F) Q(γ)ϵ 10t Q(γ/t) log t/γ and Qavg(γ/t) < Q(γ), (A.34)

where t = C log (R/Q(γ)) . Moreover, Algorithm A.2 is ϵ-differentially private and runs in time O(n + m log (R/Q(γ)) + k|F|).

Remark A.7. Theorem A.1 assumes that Q( ) and Qavg( ) are known, despite the mean θ being unknown. Note that we only need to know the value of these functions at γ and γ/t, for a given γ. If these bounds are not known, we may first need to (privately) compute Q(x) and Qavg(x) and then use those privately computed bounds in the algorithm. For example, if the Y i s are sub-Gaussian with variance proxy 1, then Q(x) = p

log(m/x). We will see how to estimate these parameters for various families F of indices used in Algorithm A.2.

Proof of Theorem A.1. We will prove privacy and accuracy guarantees separately.

Privacy. Algorithm A.2 makes t + 1 calls to Algorithm A.3; let i, Wi, and Zi be the values taken by , W, and Z in the ith call to Algorithm A.3, for 1 i t + 1. Let β := γ

t . It can be shown inductively that the interval lengths ri li and the values i do not depend on the dataset. For any 1 i t, note that Yi,j = projli 1 Q(β),ri 1,Q(β) (Yi 1,j) for all 1 j m. Suppose we change Xw to X w for some index w. For any 1 i t + 1, conditioned on the values of Zi for 1 i < i, at most a depn,k (F) fraction of {Yi,j}j [m] depend on w (this is true by the definition of depn,k (F)). Since Yi,j = projli 1 Q(β),ri 1+Q(β/m) (Yi 1,j) has range ri 1 li 1 + 2Q(β), the sensitivity of 1 m Pm j=1 Yi,j is at most depn,k (F) (ri 1 li 1 + 2Q(β)) = i. Therefore, by standard results (cf. Lemma 1), for all 1 i t, the output Zi (and therefore the interval [li, ri]), conditioned on Zi for 1 i < i, is ϵ 2t-differentially private. Similarly, the output (lt+1 + rt+1)/2 = Zt+1, conditioned on {Zi}i [t], is ϵ

2-differentially private. By Basic Composition (see Lemma A.4), Algorithm A.2 is ϵ-differentially private.

Utility. First, we show that if Algorithm A.3 is invoked with θ [l, r], it returns an interval [l , r ] such that θ [l , r ] with probability at least 1 3β. Consider running a variant of Algorithm A.2 with the projection step omitted in every call of Algorithm A.3. With probability at least 1 β, we have 1

m Pm i=1 Yi θ Qavg(β), and with probability at least 1 β, we have |W|

5The following subsection modifies the algorithm so that the error depends polylogarithmically in α.

Therefore, with probability at least 1 2β, we have

|Z θ| Qavg(β) +

in which case θ [l , r ].

Finally, reintroducing the projection step only increases the error probability by at most β. Taking a union bound over t steps, we see that θ [l , r ], with probability at least 1 3γ.

Next, we claim that if r l > 28Q(γ), then r l (r l)/2. Using the assumption, we have

depn,k (F) Q(γ)ϵ 10t Q(γ/t) log t/γ min ϵ

5 log 1/β , Q(γ)ϵ

5Q(β) log 1/β

where the second inequality follows from taking ϵ = ϵ 2t and using the fact that Q(γ) Q γ

t , since the quantile function is nonincreasing. Furthermore, by the assumption Qavg(β) < Q(γ), we have

r l = 2depn,k (F) log 1/β

ϵ (r l) + 2Qavg(β) + 4depn,k (F) Q(β) log 1/β

5 + 2Q(γ) + 4

Thus, after t = Ω log R Q(γ) iterations, we are guaranteed that the length of the final interval

[lt, rt] is at most 28Q(γ).

Finally, consider lines 8 and 9 of Algorithm A.2. The algorithm returns the midpoint of the interval [lt+1, rt+1], which is Zt+1 in the final call of Algorithm A.3. By Chebyshev s inequality, we have

with probability at least 1 γ, and with probability at least 1 γ, none of the Yi s are truncated in the projection step in the final call of Algorithm A.3. Finally, with probability at least 1 γ, we have

Wt+1 = O t+1

= O depn,k (F) Q(γ)

The conclusion follows from a union bound over all events.

A.3.2 Boosting the error probability via median-of-means

Algorithm A.2 incurs a 1/ γ multiplicative factor in the non-private error, stemming from an application of Chebyshev s inequality to bound | Pm j=1 h(XSj)/m θ|. For specific families F, we may be able to provide stronger concentration bounds for Pm j=1 h(XSj)/m in inequality (A.35). Instead, we complement the result of Theorem A.1 by applying the following median-of-means procedure that allows for an improved dependence on the failure probability α with only a log (1/α) multiplicative blowup in the sample complexity: Lemma A.6. Let α (0, 1) and ϵ 0. Let A be an ϵ-differentially private algorithm. Consider a size n dataset Dn i.i.d D, for a distribution D with some unknown parameter θ such that with probability at least 0.75, we have

|A(Dn) θ| rn.

Split Dn into q := 8 log(1/α) equal independent chunks,6 and run A on each chunk to obtain ϵ-differentially private estimates { θn,i}i [d] of θ. Let θmed n be the median of these q estimates. Then θmed n is ϵ-differentially private, and with probability at least 1 α, we have θmed n θ rn/q. (A.36)

6Assume for simplicity that n is divisible by q and q is an odd integer.

Proof. The privacy of θmed n follows from parallel composition (Lemma A.5). For utility, we know from the hypothesis that for each i [q], with probability at least 3/4, the estimate θn,i satisfies

| θn,i θ| rn/q.

If more than half the estimates θn,i satisfy the above equation, then so does the median. Let Ti be the random variable that assumes the value 0 if θn,i satisfies the above equation and assumes the value 1 otherwise. Then, E[Ti] 1/4, and it suffices to show that

Pr (T1 + T2 + + Tq q/2) 1 α.

This follows from a standard Hoeffding inequality; as long as q 8 log(1/α),

Pr (T1 + T2 + + Tq > q/2) Pr

i (Ti E[Ti]) > q/4

e 2(1/4)2q α.

A.3.3 Application to the all-tuples family: proof of Proposition 1

For any i [n], there are exactly n 1 k 1 sets S In,k such that i S. Following the notation from the setting of Theorem A.1, we have fi = n 1 k 1 / n k = k

n for all i [n], so depn,k (Fall) = k

Moreover, for each S In,k, we have P (|h(XS) θ| y) 2 exp y2

2τ . Letting

we see that each Yi is within Q(γ) of θ with probability at least γ ( n k). A union bound implies that this

choice of Q(γ) is valid. For the concentration of the average 1

j [m] Yj, which is simply Un, we can use Lemma A.3) to see that

is a valid choice. We now verify the conditions in Theorem A.1:

depn,k (S) = k

n Q(γ)ϵ 10t Q(γ/t) log(t/γ) = ϵ 10t log(t/γ)

log 2n/γ log 2nt/γ

if and only if

n 10kt log(t/γ)

Recalling that t = C log (R/Q(γ)) , we see that this holds under the sample complexity assumption on n. Furthermore, we have Qavg(γ/t) Q(γ) if and only if s

which is also true by assumption. Therefore, with probability at least 1 6γ, we have

| θall θ| O

Var(Un) + k

Algorithm A.3 uses a constant failure probability of γ = 0.01, which assures a success probability of at least 0.75. This is further boosted by Wrapper 1. Now, an application of Lemma A.6 gives the stated result.

A.3.4 Application to the naive family

Definition A.3. Consider the following estimator: divide the n data points into n/k disjoint chunks, compute h(XS) on each of these chunks, and apply the Coin Press algorithm [10] to obtain a private estimate of the mean θ. We will call this naive estimator ˆθnaive.

The following proposition records the guarantee of the naive estimator ˆθnaive:

Proposition A.1. The naive estimator ˆθnaive satisfies

|ˆθnaive θ| O

with probability at least 0.9, as long as n = Ω k

ϵ log R τ . The estimate ˆθnaive is ϵ-differentially

private and the algorithm runs in time O n + n

k log R τ .

First, suppose the variance ζk is known. It is easy to see that depn,k (Fnaive) = k

n. By the assumption that h(XS) is τ-sub-Gaussian, we have

P(|h(XS) θ| y) 2 exp y2

Hence, with probability at least 1 γ/m, we have

2τ log(2m/γ), (A.37)

for each 1 j m, where we use the notation as in the setting of Theorem A.1. By a union bound,

we can take the quantile function Q(γ) = r

2τ log 2n kγ . Moreover, since the Yj s are independent,

the average 1

j [m] Yj is τ

m-sub-Gaussian with variance ζk

m . Therefore, we have

This yields a bound of Qavg(γ) = q

2kτ log(2/γ)

n . It remains to verify the conditions of Theorem A.1. We have

k n Q(γ)ϵ 10t Q(γ/t) log(t/γ) k

n ϵ 10t log(t/γ)

log(2n/kγ) log(2nt/kγ),

Qavg(γ/t) < Q(γ)

2kτ log(2t/γ)

2τ log(2n/kγ)

n k log(2t/γ)

log(2n/kγ),

which are both true by the sample size assumption, noting that t = C log(R/Q(γ)) . Therefore, with probability at least 1 6γ, we have

ˆθnaive θ O 1 γ Var(ˆθnaive) + k

2τ log(2n/kγ) log 1

Choosing γ to be an appropriate constant, we arrive at the deisred result.

A.3.5 Application to the subsampled family

Unlike the all-pairs family Sall, the subsampled Sss in Definition 2 is randomized. Define ˆθss = PM j=1 h(XSj)/M. Recall from our discussion before (cf. Theorem A.1) that we want each of the h(XSj) s, as well as ˆθss, to concentrate around θ, and we also want depn,k (Sss) to be small. As we show, the former concentration holds as in the all-tuples case, and the latter holds with high probability. Proposition A.2. Let Mn = Ω((n/k) log n). Then Wrapper 1, with f = ss, failure probability α, algorithm A = U-Stat Mean (Algorithm A.2), S(Ii) a set of Mnα i.i.d. subsets of size k picked with replacement from Ii, returns an estimate θss such that, with probability at least 1 α,

| θss θ| O p

Var(Unα) + O

ζk Mnα + k3/2 τ

as long as nα = Ω k

. Moreover, the estimator θss is ϵ-differentially private and runs

in time O log(1/α) k + log R

First, we need the following helper lemmas:

Lemma A.7. Define ˆθss = 1 M P

S Fss h(XS). We have Var h ˆθss i = 1 1 M Var(Un) + 1 M ζk.

Proof. Clearly, E[ˆθss] = θ. We compute both terms of the following decomposition of the variance of ˆθss separately; recall that X = {Xi}i [n]:

Var ˆθss = Var E h ˆθss|X i + E h Var ˆθss|X i .

Var E h ˆθss|X i = Var

j [M] h (XSi)

E h Var ˆθss X i = E

M E [Var (h(XS)| X)]

S In,k h(XS)2

S In,k h(XS)

M ζk + θ2 Var(Un) + θ2 = ζk Var(Un)

Adding the two equalities yields the result.

Lemma A.8. Let γ > 0, and let M = Ω( n

γ ). Then depn,k (Fss) 4k

n with probability at least 1 γ.

Proof. Let Zi be the number of sampled subsets of which i is an element. Observe that Zi Binom(M, k/n), with mean µ := Mk/n. By a Chernoff bound, for any δ > 0 and any i [n], we have

P (Zi (1 + δ)µ) eδ

By a union bound, we have

P depn,k (Fss) > 4k

= P max i Zi > 4µ n e3

which is at most γ by our choice of M.

Let Gγ denote the event that depn,k (Fss) 4k

n , which occurs with high probability by Lemma A.8.

Note also that conditioned on any family Fss of subsets of In,k, the run of Algorithm A.2 is ϵdifferentially private. Since the randomness of Fss is independent of the data, the algorithm (along with the private variance estimation) is still 2ϵ-differentially private.

and Qavg(γ) = 4

τk min (M, n) log 4n

We show that these are indeed the corresponding confidence bounds for YS, S Fss and ˆθss.

By a sub-Gaussian tail bound, for any S In,k, the probability that |h(XS) θ| > Q(γ) is at most 2 γ

4n k γ 2nk . By a union bound over all n k sets S, we then have |h(XS) θ| Q(γ) for all S In,k, with probability at least 1 γ

2 . Call this event Eγ.

Next, E[ˆθss|X1, . . . Xn] = Un. Moreover, for any c > 0,

P |ˆθss θ| c P |ˆθss Un| c/2 + P (|Un θ| c/2)

EX1,...,Xn P |ˆθss Un| c/2|X1, . . . , Xn + 2 exp nc2

where we used Lemma A.3 to bound the second term. For the first term in inequality (A.40), note that conditioned on the data X1, . . . , Xn, the h(XSj) s are independent draws from a uniform distribution over the n k values {h(XS)}S In,k, with mean Un, and the h(XSj) θ s are bounded by max S In,k |h(XS) θ| Q(γ). Therefore, each h(XSj) Un is sub-Gaussian(Q(γ)2), implying that

E h P |ˆθss Un| c/2|X1, . . . , Xn i 2E exp Mc2

|X1, . . . , Xn, Eγ

16τk log(4n/γ)

16τk log(4n/γ)

Combining inequalities (A.40) and (A.39), we have

P |ˆθss θ| c 2 exp Mc2

16τk log(4n/γ)

2 + 2 exp nc2

τk min (M, n) log 2n

This justifies the choice of Qavg(γ). We now verify the conditions in Theorem A.1. Conditioned on Gγ, we have

depn,k(S) = 4k

n Q(γ)ϵ 10t Q(γ/t) log(t/γ) = ϵ 10t log(t/γ)

log 4n/γ log 4nt/γ

if and only if

n 40kt log(t/γ)

Recalling that t = C log (R/Q(γ)) , we see that the above holds under the sample complexity assumption on n. Furthermore, Qavg(γ/t) Q(γ) iff

τk min (M, n) log 4nt

min(M, n) log(4nt/γ)2

k log n γ , the assumption on the sample complexity of n implies the above result.

Conditioned on Eγ, the projection steps are never invoked in Algorithm A.2 or A.3, so we have θss = ˆθss + Wt+1, where Wt+1 is a Laplace random variable with parameter 2

ϵ , where = depn,k(F)Q(γ)

ϵ (coming from the noise added to 1 M PM i=1 Yi in the (t + 1)th step of Algorithm A.2). Finally, using Lemma A.8, we have

| θss ˆθss| = |Wt+1| 2

γ = 2depn,k (S) Q(γ)

on the event Gγ. Combined with Lemmas A.6, A.7, inequality (A.40), and Theorem A.1, with probability at least 1 7γ, we obtain

Var(Un) + 1 γ

The success probability is 1 7γ instead of 1 6γ because we also require depn,k (Fss) 4k

n , which holds with probability 1 γ as in Lemma A.8. Algorithm A.3 uses a constant failure probability of γ = 0.01, which assures a success probability of at least 0.75. This is further boosted by Wrapper 1. Now, an application of Lemma A.6 gives the stated result.

A.4 Proofs of Lower Bounds

In this appendix, we provide the proofs of our two lower bound results.

A.4.1 Proof of Theorem 1

Lemma A.9 (Lemma 6.2 in [41]). Let P = {P1, P2, . . . } be a finite family of distributions over a domain X such that for any i = j, the total variation distance between Pi and Pj is at most α. Suppose there exists a positive integer n and an ϵ-differentially private algorithm B : X n [|P|] such that for every Pi P, we have

Pr X1,...,Xn Pi,B (B(X1, . . . , Xn) = i) 2/3.

Then, n = Ω log |P|

Lemma A.10 (Proposition 4.1 in [4]). The Bernoulli distribution Bern(p) is sub-Gaussian with optimal variance proxy τp, where

τp = τ1 p =

log 1 p 1 ,

for p (0, 1) \ {1/2}. In particular, if 0 < p < 1/10, then τp 1 2 log 1 2p .

Define D0 = Bern(1) and D1 = Bern(1 β), where β = c nϵ and c > 0 is small enough such that kβ < 1/10. The TV-distance between D0 and D1 is β. Since n = c βϵ, we also choose c small enough

such that Lemma A.9 is violated: for any ϵ-differentially private algorithm B : {0, 1}n {0, 1}, there exists an i {0, 1} such that

Pr (B(X1, . . . , Xn) = i) < 2/3, (A.41)

by Lemma A.9.

Consider now the task of ϵ-privately estimating the parameter

θ(D) := EX1,...,Xk D [h(X1, X2, . . . , Xk)] ,

where Xi D and

h(X1, X2, . . . , Xk) = 1 τ(1 β)k 1 (X1 = X2 = = Xk = 1) ,

for some distribution D. Suppose there exists an ϵ-differentially private algorithm A such that

EX1,...,Xn D,A h |A(X1, . . . , Xn) θ(D)| i 1

τ(1 β)k , (A.42)

for any D such that the distribution H of h(X1, . . . , Xk) is sub-Gaussian(1). If D = Bern(1) or Bern(1 β), Lemma A.10 shows that the distribution H of h(X1, . . . , Xk) is indeed sub-Gaussian(1). If inequality (A.42) holds, then by Markov s inequality,

|A(X1, . . . , Xn) θ (Di)| 3

for i {0, 1}.

Also, θ (D0) = 1/ τ1 β and θ (D1) = (1 β)k/ τ1 β. The difference between these means is

θ (D0) θ (D1) = 1 (1 β)k

Therefore, the following algorithm violates inequality (A.41): Run A on X1, . . . , Xn to obtain θ. Output 0 if θ is closer to θ(D0) than to θ(D1), and output 1 otherwise.

This implies inequality (A.42) does not hold, so we have a lower bound on the expected error. Theorem 1 follows from the calculation

2 log 1 2 (1 (1 β)k) = Θ kβ r

where we used 1 kβ < (1 β)k < 1 kβ/2 for kβ < 1/10.

A.4.2 Proof of Theorem 3

Consider two datasets D0 and D1 of size n each, differing in at most 1/ϵ data points. Suppose EA|A(D) Un(D)| < 1 10|Un(D0) Un(D1)| for D {D0, D1}. By Markov s inequality, we have

Pr |A(D) Un(D)| < 1

2|Un(D0) Un(D1)| 0.8,

for D {D0, D1}. Moreover, since A is ϵ-differentially private, we have

Pr |A(D1) Un(D0)| < |Un(D0) Un(D1)|

e Pr |A(D0) Un(D0)| < |Un(D0) Un(D1)|

By the triangle inequality, the event n |A(D1) Un(D0)| < |Un(D0) Un(D1)|

2 o is disjoint from the

event n |A(D1) Un(D1)| < |Un(D0) Un(D1)|

2 o . The sum of the probabilities of these two events is

at least 0.8 + 0.8e 1 > 1, a contradiction. Therefore, A has expected error Ω(|Un(D0) Un(D1)|) on at least one of D0 or D1. Next, we will define appropriate choices of D0 and D1.

For simplicity, assume 1/ϵ is an integer. Define bn = k +k1/(2k 2)n1 1/(2k 2) 1

ϵ . The assumed range of ϵ implies that bn 2k/ϵ. Let h(x1, . . . , xk) = 1(x1 = = xk). We define D0 such that

xi = 1, i bn, i, i > bn.

We define D1 = {y1, . . . , yn} such that yi = xi for all i {bn + 1, . . . , bn + 1/ϵ} and yi = 1 for bn < i bn + 1

Un(D1) Un(D0) =

bn+1/ϵ k n k

Furhtermore, for k

ϵ bn, using the fact that (1 + x)r 1 1 rx for x [ 1, 1/r), we have

bn+1/ϵ k = 1

bn i bn + 1/ϵ i 1 bn bn + 1/ϵ

= 1 1 1/ϵ bn + 1/ϵ

1 + k/ϵ bn+1/ϵ = k/ϵ bn + (k + 1)/ϵ,

implying that

Un(D1) Un(D0) k/ϵ bn + (k + 1)/ϵ

For i with xi = 1 in D0, we have ˆh D0(i) = ( bn 1 k 1 ) ( n 1 k 1) Un(D0). For i with xi = 1 in D1, we have

ˆh D1(i) = ( bn+1/ϵ 1 k 1 ) ( n 1 k 1) Un(D1). Therefore, we have |ˆh D(i) Un(D)| ˆh D(i) ξ for all i and

D {D0, D1}, where

bn+1/ϵ 1 k 1

by our choice of bn. Moreover, we have

ξ bn + 1/ϵ k

k 1 k1/(2k 2)n1 1/(2k 2)

k n. (A.44)

By inequality (A.43) and the definition of ξ, we see that

Un(D1) Un(D0) k/ϵ bn + 2k/ϵ bn + 1/ϵ

n ξ k 3nϵξ,

where the second inequality follows from the assumption that k/ϵ bn. Using the lower bound on ξ as in inequality (A.44), we obtain the desired result.

A.5 Proof of Theorems 2 and 4

We first prove Theorem 4 with S equal to any subsampled family that satisfies the inequalities (17). In particular, the following lemma guarantees that the required bounds hold with high probability for a subsampled family chosen uniformly at random from In,k:

Lemma A.11. Let γ > 0, and let M = Ω n2 k2 log n γ . Let S be a collection of M i.i.d sets sampled uniformly from In,k. For each i [n], let Si be the number of sets in S containing i, and define Mi = |Si|. For each i = j [n], let Sij be the number of sets in S containing i and j, and define Mij = |Sij|. With probability at least 1 γ, for all distinct indices i and j, we have k 2n Mi

n , k 2n Mij

Proof. Note that Mi Binom(M, k/n). By a Chernoff bound, for any δ > 0 and any i [n], we have

e (0.5)2(k M/n)/3 e Ω( n

which is much smaller than γ 2n. Call this event Ei, and for the remaining argument, assume Ei holds for all i (which, by a union bound, holds with probability at least 1 γ

2 ). This gives the first inequality.

For any distinct i, j [n], conditioned on the value of Mi, we have Mij Binom(mi, (k 1)/(n 1)). By a Chernoff bound, for any δ > 0 and i, j [n] with j = i, we have

Pr Mij (k 1)Mi

e (0.5)2((k 1)Mi/(n 1))

48n2 γ 2n2 ,

using the fact that Mi k M

2n and our assumption on M. The second inequality then follows from a union bound over all pairs of indices.

A.5.1 Proof of Theorem 4

Consider two adjacent datasets X = (X1, X2, . . . , Xn) and X = (X 1, X 2, . . . , X n) differing only in the index i , that is, X i = Xi for all i = i . Throughout the proof, we will use the superscript prime to denote quantities related to X .

Let B := (Bad(X) Bad(X )) \ {i } and G := (Good(X) Good(X )) \ {i }. Then

m An A n = X

S S (g(XS) g(X S))

S B = i / S

(g(XS) g(X S))

(g(XS) g(X S))

i S (g(XS) g(X S))

We bound each of the three terms separately. The term T2 is equal to 0: all indices i S have weight 1, and i / S, so g(XS) = h(XS) = h(X S) = g(X S). We prove some preliminary lemmas before bounding the first and last terms. Recall the definitions of L and wt in equations (12) and (14), respectively. Lemma A.12. We have:

(i) |An A n| 2k C

n and |L L | 1.

(ii) For all i = i , we have |ˆh (i) A n| |ˆh(i) An| 4k C

|wt(i) wt (i)| ϵ, (A.47)

and for S, such that i S,

|wt(S) wt (S)| ϵ. (A.48)

Proof. For (i), note that

|An A n| = 1

S Si (h(XS) h(X S))

where the last inequality comes from Lemma A.11. Similarly, for any i = i , we have

ˆh(i) ˆh (i) = 1

S Sij (h(XS) h(X S))

Figure A.1: Weighting scheme in Eq (14)

Therefore, if an index i = i is in Good(X), using inequalities (A.49) and (A.50), we have ˆh (i) A n ˆh (i) ˆh(i) + ˆh(i) An + |An A n| (A.51)

n + ξ + 6k CL

n = ξ + k C (4 + 6L)

which leaves at most 1 + L potential indices i for which |ˆh (i) A n| > ξ + 6k C(1+L)

n : the indices in Bad(X) and the index i . Therefore, L L + 1. Similarly, L L + 1.

For (ii), note that from inequalities (A.49) and (A.50), we have |ˆh (i) A n| |ˆh(i) An| 4k C

n for i = i . Recalling the definition (14), this implies that the difference between the weights on an index i can never be greater than ϵ.

Finally, note that the weight of a subset S, wt(S) = mini S wt(i). Now, by inequality (A.47), each wt(i) differs by ϵ. Say a = arg mini S wt(i). In order to make the difference between wt(S) and wt(S ) large we will set wt (a) = wt(a) + ϵ and take some other b and set wt (b) = wt(b) ϵ such that wt (b) wt (a). But then, wt(a) wt(b) wt(a) + 2ϵ. This completes the proof of inequality (A.48).

Next, we show that the weighted Hájek variants ˆg(i) are close to the empirical mean An and have low sensitivity.

Lemma A.13. For all indices i, we have |ˆg(i) An| ξ + 9k CL

nϵ . Moreover, if i = i , we have

|(ˆg(i) An) (ˆg (i) A n)| ξ + k C(14 + 6L)

n2 (1 + 2L).

Proof. If wt(i) = 0, then g(XS) = An for all S i and ˆg(i) = An. For clarity, we add a picture of this weighting scheme here: Otherwise, we write

ˆg(i) An = 1

S Si (h(XS) An)wt(S)

S Si (h(XS) An) wt(i) + 1

S Si (h(XS) An) (wt(S) wt(i))

= (ˆh(i) An)wt(i) + 1

S Si (h(XS) An) (wt(S) wt(i)) . (A.52)

From equation (14) and the assumption that the weight of index i is strictly positive, the magnitude of the first term in equation (A.52) is bounded by ξ + 6k CL

nϵ . For the second term, note

that wt(S) = wt(i) unless an index j with a lower weight than i exists. Note that such an index j is necessarily in Bad(X). Therefore, the absolute value of the second term is bounded above by 1 mi P

j Bad(X),j =i C 2k CL

n . This proves the first part of the lemma.

To bound the sensitivity of ˆg(i), by the triangle inequality, we have (ˆh(i) An)wt(i) (ˆh (i) A n)wt (i)

ˆh(i) An |wt(i) wt (i)| + (ˆh(i) An) (ˆh (i) A n) wt (i)

ξ + k C (4 + 6L)

= ξ + k C (10 + 6L)

where the argument for the second inequality is as follows: To bound the first term, note that when wt(i) = wt (i), it is zero. If wt(i) > 0, then |ˆh(i) An| ξ + 6k CL

nϵ . Now, if |ˆh(i) An| > ξ + 6k CL

n , then wt(i) = 0, and by inequality (A.46) of Lemma A.12, we see that |ˆh(i) A n| > ξ + 6k CL

nϵ , so wt (i) will also be zero. The second term is bounded directly by Lemma A.12. Overall, we arrive at a bound on the sensitivity of the first term in equation (A.52).

For the sensitivity of the second term of equation (A.52), note that if i has minimum weight among the indices in S, then wt(S) = wt(i). Otherwise, some index j S has strictly lower weight than i. Such an index j is necessarily in Bad(X) Bad(X ) because it has weight less than 1. If S also does not contain the index i , then h(XS) = h(X S), so by Lemma A.12, we have

|(h(X S) A n) (h(XS) An)| |An A n| 2k C

|(wt (S) wt (i)) (wt(S) wt(i))| 2ϵ, and letting TS := ((h(XS) An) (wt(S) wt(i)) (h(X S) A n) (wt (S) wt (i))) ,

we have |TS| 2k C

Moreover, there are at most Mi,i sets S containing both i and i , and for each such set S, the change in (h(XS) An) (wt(S) wt(i)) is at most 2C, since the weights lie in [0, 1] and |h(XS) An| C. Combining these bounds, we obtain

S Si ((h(XS) An) (wt(S) wt(i)) (h(X S) A n) (wt (S) wt (i))) | {z } TS

S Si\Sii |TS|1(|S (Bad(X) Bad(X ))| > 0) + X

S Si\Sii 1(|S (Bad(X) Bad(X ))| > 0) + 1

j Bad(X) Bad(X )

S Sij 1 + 2C Mi,i

(|Bad(X)| + |Bad(X )|) sup j Bad(X) Bad(X )

Mi + 2C Mi,i

(1 + 2L) + 4k C

where the first inequality uses the fact that the weights are all equal if S (Bad(X) Bad(X )) = , and the last inequality uses Lemma A.11. Combining inequalities (A.53) and (A.54) into equation (A.52) yields the result.

To bound the term T1 in (A.45), we decompose it as

(g(XS) g(X S))

min(k,|B|) X

S S |S B|=a i / S

(a 1) (g(XS) g(X S)) . (A.55)

The first term sums over all subsets that contain some element in B. However, this leads to overcounting every subset with a elements in common with B exactly a 1 times. The second term corrects for this over-counting, akin to an inclusion-exclusion argument. The following lemmas bound each of the two terms: Lemma A.14. For all i B, we have

S Si,i / S (g(XS) g(X S))

Proof. We have X

(g(XS) g(X S)) = Mi (ˆg(i) ˆg(i)) X

S Si,i (g(XS) g(X S)) . (A.56)

By Lemmas A.12 and A.13, we have

|ˆg(i) ˆg(i)| ξ + k C(14 + 6L)

n2 (1 + 2L).

Moreover, the second term in equation (A.56) is clearly upper-bounded by

2CMi,i 4k C

Summing these two bounds yields the result.

For the second term of equation (A.55), we use the following lemma: Lemma A.15. We have min(k,|B|) X

S S |S B|=a i / S

(a 1) |g(XS) g(X S)| 36k2L2

min(k, 2L)M.

Moreover, if S = In,k, with m = n k , we have the stronger inequality

min(k,|B|) X

S S |S B|=a i / S

(a 1) |g(XS) g(X S)| 9k2L2

Proof. For any S not containing i , we have |g(XS) g(X S)| = |(h(XS) An)wt(S) + An (h(X S) A n)wt (S) A n|

|(h(XS) An)(wt(S) wt (S))| + |(h(XS) h(X S))wt (S)| + |An A n|

using Lemma A.12 and the fact that the second term is zero. Moreover, we have

min(k,|B|) X

S S |S B|=a i / S

S S |S B|=a

min(k, |B|) = X

S S |S B| 2

min(k, |B|)

S Sij min(k, |B|) 9k2

n2 min(k, |B|)|B|2m.

The last inequality follows from Lemma A.11, which implies that Mij

M = O k2 n2 .

In the case when S = In,k, we have

min(k,|B|) X

S S |S B|=a i / S

min(k,|B|) X

a=2 (a 1) |B| a

n 1 + n |B| k

n + 1 k2|B|2

where the first equality used the identities Pk a=0 n a m k a = n+m k and Pk a=0 a n a m k a = nk m+n n+m k , and the third inequality used the fact that (1 x)k 1 1+kx for all x [0, 1]. The statement in the lemma follows because |B| 2L + 1 3L.

Combining the results of Lemma A.14 and A.15 yields Lemma A.16. Lemma A.16. We have

where γ = 4 min(k, 2L). If S = In,k, then the bound also holds for γ = 1.

It remains to bound the third term, T3, of equation (A.45), which we do in the following lemma: Lemma A.17. We have

2ξ + k C(11 + 18L)

Proof. Using Lemmas A.12 and A.13, we have

1 Mi |T3| = |ˆg(i ) ˆg (i )| |ˆg(i ) An| + |ˆg (i ) A n| + |An A n|

+ ξ + 9k CL

2ξ + k C(11 + 18L)

The lemma follows after using the fact that Mi 2k

Combining the bounds on T1 and T3 from Lemmas A.16 and A.17 in equation (A.45), the local sensitivity of An at X is then bounded as

LS An (X) = O k

(1 + ϵL) + k2CL2 min(k, L)

Let g(ξ, L, n) denote the upper bound, where to simplify the following argument, we assume the constant prefactor is 1, i.e.,

g(ξ, L, n) := k

(1 + ϵL) + k2CL2 min(k, L)

Note that g is strictly increasing in L. Also define

S(X) = max ℓ Z 0 e ϵℓg(ξ, LX + ℓ, n). (A.57)

Lemma A.18. The function S(X) is an ϵ-smooth upper bound on LS An(X).

ξ + k C(1/ϵ + L)

(1 + ϵL) + k2C(1/ϵ + L)2 min(k, 1/ϵ + L)

Proof of Lemma A.18. Clearly, we have S(X) g(ξ, LX, n) LS An(X), and for any two adjacent X and X , we have S(X ) = max ℓ Z 0 e ϵℓg(ξ, LX + ℓ, n) max ℓ Z 0 e ϵℓg(ξ, LX + ℓ+ 1, n)

= max ℓ Z>0 e ϵ(ℓ 1)g(ξ, LX + ℓ, n) eϵ max ℓ Z 0 e ϵℓg(ξ, LX + ℓ, n) = eϵS(X).

This shows that S is indeed a ϵ-smooth upper bound on the local sensitivity. As for the upper bound on S, for any ℓ 0, we have

e ϵℓg(ξ, LX + ℓ, n) = k

ξe ϵℓ/2 + k C(ℓe ϵℓ/2 + Le ϵℓ/2)

e ϵℓ/2 + ϵ(ℓe ϵℓ/2 + Le ϵℓ/2)

n2 C(ℓe ϵℓ/3 + Le ϵℓ/3)2 min(ke ϵℓ/3, ℓe ϵℓ/3 + Le ϵℓ/3) ϵ + k

ξ + k C(1/ϵ + L)

(1 + ϵ(1 + L)) + k2

ϵ + L 2 min k, 2

ϵ + L ϵ + k

where we used in multiple places the inequalities e ϵℓ 1 and ℓe ℓ/c c/e, for any c > 0.

By Lemma A.18, it is clear that the term S(X) added to An in Algorithm 1 is the smoothed sensitivity defined in equation (A.57).

Therefore, An + S(X)

ϵ Z, where Z is sampled from the distribution with density h(z) 1/(1+|z|4), is O(ϵ)-differentially private, by Lemma 2. Moreover, if S = In,k, the above bound on the smooth sensitivity holds without the min(k, 1/ϵ + L) term, due to Lemma A.15.

Utility. By Chebyshev s inequality, we have

Var(An) = 1 γ

with probability at least 1 γ. Moreover, with probability at least 1 γ, each of the Hájek projections is within ξ of An. This implies that every index i has weight 1, which further implies that g(XS) = h(XS) for all S S, and consequently, An = An. Also, L = 1 for such an X. Finally, with probability at least 1 γ, we have Z 3 γ . Combining these inequalities and using Lemma A.18, we have

|A(X) θ| An An + |An θ| + |S(X)/ϵ Z|

with probability at least 1 4γ, recalling that ϵ = O(1) when simplifying the expression. Algorithm 1 uses a constant failure probability of γ = 0.01, which ensures a success probability of at least 0.75. This is further boosted by Wrapper 1. Now, an application of Lemma A.6 gives the stated result.

A.5.2 Proof of Theorem 2

The proof of this theorem proceeds nearly identically to that of Theorem 4 with some exceptions. If S = In,k, the smoothed sensitivity bound has no min(k, 1/ϵ) term, owing to Lemmas A.15 and A.16, which gives the bound

|A(X) θ| 1 γ O p

Var(An) + kξ

Furthermore, since S = In,k, we have An = Un with probability at least 1 3γ. Algorithm 1 uses a constant failure probability of γ = 0.01. This ensures a success probability of at least 0.75, which is further boosted by Wrapper 1. An application of Lemma A.6 gives the stated result.

A.5.3 Concentration of local Hájek projections

Proof of Lemma 3. We first show that if Y1, Y2, . . . , Yt are random variables such that each Yj is τj-sub-Gaussian, the sum Y1 + + Yt is τ1 + τ2 + + τt 2-sub-Gaussian.

Define pj =

Pt i=1 τi τj . Clearly, we have Pt j=1 1/pj = 1. By Hölder s inequality, for any λ > 0, we have

i=1 exp(λYi)

i=1 E [exp(λYi)pi]1/pi

exp λ2p2 i τi 2

i=1 exp λ2piτi

= exp λ2( τ1 + τ2 + + τt)2

Now, h(XS) is sub-Gaussian(τ) for all S In,k. Since ˆh(i) is the average of t = n 1 k 1 such quantities, it is clear from the previous claim that ˆh(i) sub-Gaussian with parameter

1 t2 ( τ + τ + + τ | {z } t terms )2 = τ.

Proof of Lemma 4. Define σ2 i := Var (h(XSi)|Xi = xi). First, conditioned on Xi, the projection ˆh(i) can be viewed as a U-statistic on the other n 1 data points. First, for this lemma, we use S = In,k, and

ˆh(i) = b E [h(XS)|Xi] =

S Si h(XS) n 1 k 1

Since h is bounded, the random quantity ˆh(i) E [h(XS)|Xi] [ 2C, 2C] satisfies the Bernstein moment condition and also the Bernstein tail inequality (cf. Proposition 2.3 in [60]). By Bernstein s inequality for U-statistics (see inequality (A.29)), for all t > 0, we have

P ˆh(i) E [h(XS)|Xi] t Xi 2 exp

j n 1 k 1 k t2

2σ2 i + 4Ct/3

which is at most β/n as long as t = σi q

A.6 Applications

A.6.1 Uniformity testing

To motivate the test, consider the expectation θ := E[h(Xi, Xj)] and the variance var(Un):

Lemma A.19. We have E[h(X1, X2)] = 1

m2 . In particular, the means under the two hypothesis classes differ by at least δ2

Proof. We have

E[h(X1, X2)] =

1 + 2ai + a2 i m2 = 1

Under approximate uniformity, this is at most δ2

2m; and under the alternative hypothesis, this quantity is at least δ2

Lemma A.20. The variance of Un is

var(Un) = 2 n(n 1)

i<j pipj(pi pj)2 +

Proof. The conditional variances ζ1 and ζ2 can be written as

ζ1 = cov(h(X1, X2), h(X1, X3)) = E[1[X1 = X2]1[X1 = X3]] E[1[X1 = X2]]E[1[X1 = X3]]

i<j (p3 i pj + pip3 j 2p2 i p2 j) = X

i<j pipj(pi pj)2 0, and

ζ2 = cov(h(X1, X2), h(X1, X2)) =

Also from equation (3), we have

var(Un) = n 2

1 (2(n 2)ζ1 + ζ2) .

Combining the above bounds with equation (3) shows the result.

Algorithm A.4 Private Uniformity Test n, m, X = {Xi}i [n] , ϵ

1: C 1, γ 0.01 2: ξ 6/m + 8 log(4n/γ)/n 3: S {(i, j) : 1 i < j n} 4: θ Private Mean Hájek(n, 2, {1(Xi = Xj), (i, j) S}, ϵ, α, C, ξ, S)

5: if θ 1+3δ2/4

m then 6: DEC 1 {Reject approximate uniformity} 7: else 8: DEC 0 {Accept approximate uniformity} 9: end if 10: return DEC

Proof of Theorem 5: Recall that θ denotes the private test statistic, which is thresholded at the value 1+3δ2/4

m to determine the output of the hypothesis test. We claim that the validity of the test is established if we can show that

P | θ E[Un]| δ2

1 O(γ) (A.60)

under both hypotheses. Indeed, it would then hold that:

(i) Under approximate uniformity,

4m = 1 + 3δ2/4

(ii) Under the alternative hypothesis,

4m = 1 + 3δ2/4

To establish inequality (A.60), we further write

P | θ E[Un]| > δ2

P | θ Un| > δ2

+ P |Un E[Un]| > δ2

The second term can be controlled using an argument in Diakonikolas et al. [21], which further develops the variance bound in Lemma A.20 for the two hypothesis classes and then uses Chebyshev s inequality. It is shown that the second probability term in inequality (A.61) can be bounded by α if n = Ω m

To bound the first probability term in inequality (A.61), we study the concentration parameter ξ for the local Hájek projection ˆh(i) = 1 n 1 P

j =i h(Xi, Xj). We have the following result:

Lemma A.21. If ξ = 6

m + 8 log(4n/γ)

γ , then |ˆh(i) Un| ξ for all i, with probability at least 1 γ.

Proof. By the triangle inequality, we have

|ˆh(i) Un| |ˆh(i) E[h(X1, X2)|X1]| + |E[h(X1, X2)|X1] θ| + |Un θ|. (A.62)

We will provide a bound on each of these three terms. Note that

h(X1, X2)|X1 Bern(p X1),

which has variance

σ2 i = var(h(Xi, Xj)|Xi) = p Xi(1 p Xi) 2

Hence, with probability at least 1 γ

2 , the first term in inequality (A.62) can be bounded as

|ˆh(i) E[h(X1, X2)|X1]| 2

2 mn log 4n

where we have used the AM-GM inequality. The second term in inequality (A.62) can be bounded as 1 + a Xi

Finally, by Chebyshev s inequality, the third term can be bounded as |Un θ| q

γ with probability at least 1 γ

2 . It remains to find the variance of Un.

By Lemma A.20, if |ai| 1 for all i, we have

var(Un) = 2 n(n 1)

(1 + ai)(1 + aj)(ai aj)2

(1 + ai)(1 + aj)(ai aj)2

8 m2n + 8 mn2 . (A.63)

Combining the three bounds into inequality (A.62), with probability at least 1 γ, we have

|ˆh(i) Un| 2

m + 7 log(4n/γ)

m n + 4/ γ mn

m + 8 log(4n/γ)

where in the second inequality, we used the AM-GM inequality and the assumption n 16

γ . The statement of the lemma follows after discarding lower-order terms.

By Lemma A.21, with probability at least 1 γ, the weights of all projections in Algorithm 1 are equal to 1 and Un = An. Then | θ Un| is simply the magnitude of the noise added in the final step of Algorithm 1 (which uses a constant γ = 0.01), which (cf. the proof of Theorem 2) takes the form

nϵ + 1 n2ϵ2 + 1 n3ϵ3

n2ϵ + 1 mnϵ + 1 n2ϵ2 + 1 n3ϵ3

with probability at least 0.75. This is bounded by δ2

8m as long as

δϵ1/2 log m1/2

Wrapper 1 further boosts this constant probability of success. Now, an application of Lemma A.6 gives the stated result.

A.6.2 Sparse graph statistics

Algorithm A.5 Private Network Edge(n, m, {Aij}1 i<j n, ϵ)

1: C 1, γ 0.01 2: Un 1 ( n 2) P

3: ν2 Un + 1

nϵZ, where Z is a standard Laplace random variable 4: if ν < 0 then 5: return 6: end if

8: S {(i, j) : 1 i < j n} 9: θ Private Mean Hájek(n, 2, {1(Aij = 1), (i, j) S}, ϵ, C, ξ, S) 10: return θ

A.6.3 Proof of Theorem 6

The privacy of the algorithm follows by composing (see Lemma A.4) the ϵ-privacy of ν and the O(ϵ)-privacy of θ conditioned on ν. It remains to show the utility of the algorithm.

The kernel h(x, y) = 1( x y rn) is degenerate, since P( Xi Xj rn|Xi) does not depend on Xi. So var(E[h(Xi, Xj)|Xi = x]) = 0. We have var[h(Xi, Xj)] πr2 n, so the non-private error is O(rn/n) (Eq 3). Using Proposition 2.3 from Arcones and Gine [5], there exist universal constants c1, c2, and c3 such that

2 (Un r2 n/4) t c1 exp c2t c3rn + (t/n)1/3

Setting t = nr2 n/16, we have, for large enough n, since rn = Ω(n 1/2),

P Un r2 n/4 r2 n/8 c1 exp

c2nr2 n c3rn + r2/3 n

c1 exp( c nr4/3 n ) = O exp( n1/3) .

Therefore, with probability 1 o(1), we have

Un [r2 n/8, 3r2 n/8]

ν2 := Un + Z/nϵ [r2 n/9, r2 n/2]. (A.64)

In particular, the probability that Un + Z

nϵ computed in step 3 of Algorithm A.5 is then positive.

From Lemma 4, we have

ˆh(i) E [h(X1, X2)|X1 = x1] 4σi

with probability at least 1 γ, where σ2 i = Var (h(X1, X2)|X1 = x1) πr2 n.

Thus, from Eq A.64, we see that:

σ2 i πr2 n 9πν2

Moreover, since g is degenerate, we have E[h(Xi, Xj)|Xi] = θn. Using the fact that |Un θn| q

1 γ with probability at least 1 O(γ), we have

max i |ˆh(i) Un| max i |ˆh(i) θn| + |θn Un|

with probability 1 O(γ). Hence, using Theorem 2, and noting that ξ = O rn n , we can ensure

that the estimate θ output by Algorithm 1 satisfies

| θ θ| = O p

Var(Unα) + kξ

nα + ξ nαϵ + 1 n2αϵ2 + 1 n3αϵ3

nα + rn n3/2 α ϵ + 1 n2αϵ + 1 n2αϵ2 + 1 n3αϵ3

with probability at least 0.75. Wrapper 1 further boosts this probability of success from constant to 1 α. Now, an application of Lemma A.6 gives the stated result.

A.6.4 Other applications

In this section, we provide more applications of U statistics for important hypothesis testing problems.

1. Goodness-of-fit testing: The Cramer-Von Mises statistic for testing the hypothesis that the cumulative distribution function of a random variable is equal to a function F0 is given by

Z (1{Xi x} F0(x)) (1{Xj x} F0(x)) d F0(x).

Under the null H0 : X F0, the distribution of the statistic is degenerate [58]. Thus, our techniques from Section 4.2 provide a method for private goodness-of-fit testing based on the Cramer-Von Mises statistic. Private goodness-of-fit testing has so far mostly been studied in the setting of discrete data [28, 1, 2]. For continuous distributions, we are only aware of work that analyzes the local DP framework [23, 44, 12], which is therefore not directly comparable to our proposed approach.

2. Pearson s chi-squared test: The chi-squared goodness of fit test is widely used to test if a discrete random variable comes from a given distribution. The corresponding statistic (which can be written as a U statistic plus a smaller order term) is degenerate [20].

3. Symmetry testing: Testing the symmetry of the underlying distribution of i.i.d. X1, . . . , Xn is often used in paired tests. [26] use the test statistic P

i,j(g(Xi Xj) g(Xi + Xj))/n2 (which is a U statistic plus a lower-order term), where g is the characteristic function of some distribution symmetric around 0. When the distribution of Xi is symmetric, this is degenerate.

Neur IPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and precede the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

You should answer [Yes] , [No] , or [NA] .

[NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available.

Please provide a short (1 2 sentence) justification right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We have assured that the abstract and introduction accurately reflect our contributions.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We have acknowledged any limitations of our work in the paper and in the discussion section.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We have provided proofs for all theoretical results in the paper. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Yes, we followed the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA] Justification: This paper presents theoretical work whose goal is to advance the field of differential privacy and statistics. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.