# on_differentially_private_u_statistics__1cabea90.pdf On Differentially Private U Statistics Kamalika Chaudhuri Po-Ling Loh University of California San Diego University of Cambridge kamalika@ucsd.edu pll28@cam.ac.uk Shourya Pandey Purnamrita Sarkar University of Texas at Austin University of Texas at Austin shouryap@utexas.edu purna.sarkar@austin.utexas.edu We consider the problem of privately estimating a parameter E[h(X1, . . . , Xk)], where X1, X2, . . . , Xk are i.i.d. data from some distribution and h is a permutationinvariant function. Without privacy constraints, the standard estimators for this task are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and are the unique minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied in a black-box manner to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even Θ(1/n) rather than O(1/n2) in degenerate settings. To remedy this, we propose a new thresholding-based approach that reweights different subsets of the data using local Hájek projections. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics. 1 Introduction A fundamental task in statistical inference is to estimate a parameter of the form E[h(X1, . . . , Xk)], where h is a possibly vector-valued function and {Xi}n i=1 are i.i.d. draws from an unknown distribution. U-statistics are a well-established class of estimators for such parameters and can be expressed as averages of functions of the form h(X1, . . . , Xk). U-statistics arise in many areas of statistics and machine learning, encompassing diverse estimators such as the sample mean and variance, hypothesis tests such as the Mann-Whitney, Wilcoxon signed rank, and Kendall s tau tests, symmetry and uniformity testing [26, 21], goodness-of-fit tests [58], counts of combinatorial objects such as the number of subgraphs in a random graph [30], ranking and clustering [19, 18], and subsampling [52]. U-statistics are a natural generalization of the sample mean. However, little work has been done on U-statistics under differential privacy, in contrast to the sizable body of existing work on private mean estimation [42, 39, 14, 40, 10, 41, 24, 11, 33, 15]. Ghazi et al. [29] and Bell et al. [8] consider Ustatistics in the setting of local differential privacy [43], while we are interested in privacy guarantees under the central model. Moreover, existing work on private U-statistics focuses on discrete data, and relies on simple privacy mechanisms (such as the Laplace mechanism [25]) which are usually optimal in these settings. Many U-statistics converge to a limiting Gaussian distribution with variance O(k2/n) when suitably scaled. This is commonly used in hypothesis testing [35, 5, 37]. However, there are also examples of non-degenerate U-statistics, which often arise in a variety of hypothesis tests [26, 58, 21], where the statistic is degenerate at the null hypothesis (in which case the U-statistic converges to a sum of centered chi-squared distributions [31]). Another interesting U-statistic arises in subgraph counts in random geometric graphs [30]. When the probability of an edge being present tends to zero with n, creating a private estimator by simply adding Laplace noise with a suitable scale may not be effective. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Our contributions 1. We present a new algorithm for private mean estimation that achieves nearly optimal private and non-private errors for non-degenerate U-statistics with sub-Gaussian kernels. 2. We provide a lower bound for privately estimating non-degenerate sub-Gaussian kernels, which nearly matches the upper bound of our algorithm. We also derive a lower bound for degenerate kernels and provide evidence that the private error achieved by our algorithm in the degenerate case is nearly optimal. A summary of the utility guarantees of our algorithm and adaptations of existing private mean estimation methods is presented in Table 1. 3. The computational complexity of our first algorithm scales as O( n k ). We generalize this algorithm and develop an estimator based on subsampled data, providing theoretical guarantees for a more efficient version with O(n2) computational complexity. The paper is organized as follows. Section 2 reviews the background on U-statistics and key concepts in differential privacy. Section 3 introduces an initial set of estimators based on the Coin Press algorithm for private mean estimation [10]. Section 4 presents our main algorithm, which leverages what we term local Hájek projections. Section 5 discusses applications of our algorithm to private uniformity testing and density estimation in sparse geometric graphs. Section 6 concludes the paper. 2 Background and Problem Setup Let n and k be positive integers with k n. Let D be an unknown probability distribution over a set X, and let h : X k R be a known function symmetric in its arguments1. Let H be the distribution of h(X1, X2, . . . , Xk), where X1, X2, . . . , Xk D are i.i.d. random variables. We are interested in providing a ϵ-differentially private confidence interval for the estimable parameter [32] θ = E[h(X1, X2, . . . , Xk)], which is the mean of the distribution H, given access to n i.i.d. samples from D; we use X1, X2, . . . , Xn to denote these n samples. The kernel h, the degree k, and the estimable parameter θ are allowed to depend on n; we omit the subscript n for the sake of brevity. We consider bounded kernels h and unbounded kernels h where the distribution H is sub-Gaussian. We write Y sub-Gaussian(τ) if E[exp(λ(Y EY ))] exp(τλ2/2) for all λ R. The quantity τ is called a variance proxy for the distribution H and satisfies the inequality τ σ2 [59]. Throughout the paper, we assume that the privacy parameter ϵ = O(1). We also use the notation O( ) in error terms, which hides poly-logarithmic factors in n/α. 2.1 U-Statistics Let [n] denote {1, . . . , n}, and let In,k be the set of all k-element subsets of [n]. Denote the n i.i.d. samples by X1, X2, . . . , Xn. For any S In,k, let XS be the (unordered) k-tuple {Xi : i S}. The U-statistic associated with the data and the function h is Un := 1 n k X {i1,...,ik} In,k h(Xi1, . . . , Xik). (1) The function h is the kernel of Un and k is the degree of Un. While U-statistics can be vector-valued, we consider scalar U-statistics in this paper. The variance of Un can be expressed in terms of the conditional variances of h(X1, X2, . . . , Xn). For c [k], we define the conditional variance ζc := Var (E [h(X1, . . . , Xk)|X1, . . . , Xc]) . (2) Equivalently, ζc = cov (h(XS1), h(XS2)) where S1, S2 In,k and |S1 S2| = c. The number of such pairs of sets S1 and S2 is equal to n k k c n k k c , which implies Var(Un) = n k 1That is, h(x1, x2, . . . , xk) = h(xσ(1), xσ(2), . . . , xσ(k)) for any permutation σ. We do not assume that the distribution H itself is symmetric about its mean. Since E[Un] = θ, Un is an unbiased estimate of θ. Moreover, the variance of Un is a lower bound on the variance of any unbiased estimator of θ. (cf Lee [45, Chapter 1, Theorem 3]). We also have the following inequality from Serfling [54] (see Appendix A.2 for a proof): Infinite-order U-Statistics: Classical U-statistics typically have small, fixed k. However, important estimators that appear in the contexts of subsampling [52] and Breiman s random forest algorithm [56, 50] have k growing with n. These types of U-statistics are sometimes referred to as infinite-order Ustatistics [27, 48]). U-statistics also frequently appear in the analysis of random geometric graphs [30]. The difference between this setting and the examples above is that the conditional variances {ζc} vanish with n in the sparse setting. (See Section 5.) Degenerate U-statistics: A U-statistic is degenerate of order ℓ k 1 if ζi = 0 for all i [ℓ] and ζℓ+1 > 0 (if ζk = 0, the distribution is almost surely constant). Degenerate U-statistics arise in hypothesis tests such as Cramer-Von Mises and Pearson tests of goodness of fit [31, 3, 55] and tests for unformity [21]. They also appear in tests for model misspecification in econometrics [46, 47]. For more examples of degenerate U-statistics, see [20, 61, 34]. 2.2 Differential privacy The main idea of differential privacy [25] is that the participation or non-participation of a single person should not affect the outcome significantly. A (randomized) algorithm M, that takes as input a dataset D X n and outputs an element of its range space S, satisfies ϵ-differential privacy if for any pair of adjacent datasets D and D and any measurable subset S S of the range space S, Pr(M(D) S) eϵ Pr(M(D ) S). A dataset is D := (X1, X2, . . . , Xn) from some domain X, for some n which is public. Two datasets D and D are adjacent if they differ in exactly one index. An important property of differentially private algorithms is composition. We defer composition theorems for differentially private algorithms to Appendix A.2. Basic DP algorithms. One way to ensure an algorithm satisfies differential privacy is through the Laplace Mechanism [25]. The global sensitivity of a function f : X n S is GS(f) = max |D D |=1 |f(D) f(D )|, (5) where D D := |{i : Di = D i}| A fundamental result in differential privacy is that we can achieve privacy for f by adding noise calibrated to its global sensitivity. Lemma 1. (Laplace mechanism [25]) Let f : X n S be a function and let ϵ > 0 be the privacy parameter. Then the algorithm A(D) = f(D) + Lap GS(f) ϵ 2 is ϵ-differentially private. The global sensitivity of a function is the worst-case change in the function value and may be high on atypical datasets. To account for the small sensitivity on typical datasets, the notion of local sensitivity is useful. The local sensitivity of a function f : X n S at D is defined as LS(f, D) = max |D D |=1 |f(D) f(D )|. (6) Unfortunately, adding noise proportional to the local sensitivity does not ensure differential privacy, because variation in the magnitude of noise itself may leak information. Instead, [49] proposed the notion of a smooth upper bound on LS(f, D). A function SS(f, ) is said to be an ϵ-smooth upper bound on the local sensitivity of f if (i) SS(f, D) LS(f, D) for all D, and (ii) SS(f, D) eϵSS(f, D ) for all |D D | = 1. Intuitively, (i) ensures that enough noise is added, and (ii) ensures that the noise itself does not leak information about the data. Lemma 2. (Smoothed Sensitivity mechanism [49]) Let f : X n S be a function, ϵ > 0, and SS(f, ) be an ϵ-smooth upper bound on LS(f, ). Then, the algorithm A(D) = f(D) + SS(f, D)/ϵ Z, where Z has density h(z) 1 1+z4 , is ϵ/10-differentially private. 2The Laplace Distribution with parameter b > 0, denoted Lap(b), has density ℓ(z) exp ( |z|/b) . 3 Lower Bounds and Application of Off-the-shelf Tools Algorithm Sub-Gaussian, non-degenerate Bounded, degenerate Private error Is non-private error Private error Is non-private error O(Var(Un))? O(Var(Un))? Naive (Proposition A.1) O k τ All-tuples (Proposition 1) O k3/2 τ Main algorithm O k τ Yes O k3/2C Yes Corollary 1 Corollary 2 Lower bound Ω k τ for private algorithms Theorem 1 Theorem 3 Table 1: We compare our application of off-the-shelf tools to Algorithm 1. We only provide the leading terms in the private error. The non-private lower bound on E(ˆθ Eh(X1, . . . , Xk))2 for all unbiased ˆθ is Var(Un), which our private algorithms nearly match. We start with a simple, non-private estimator involving an average of independent quantities. Let m = n/k , and define Sj = {(j 1)k + 1, (j 1)k + 2, . . . , jk} for all j [m]. Define the naive estimator ˆθnaive := Pm j=1 h(XSj)/m. Directly applying existing private mean estimation algorithms [42, 40, 39, 10] to our setting yields an error bound of3 Var(ˆθnaive) + k τ/(nϵ) = O p kζk/n + k τ/(nϵ) , (7) since Var(ˆθnaive) = kζk/n. Note that this variance is larger than the dominant term k2ζ1/n of Var(Un) (see Lemma A.1 and Eq 4); indeed, ˆθnaive is a suboptimal estimator of θ. In Algorithms A.2 and A.3, we present a general extension of the Coin Press algorithm [10], which is then used to obtain a private estimate of θ with the non-private error term matching p Var(Un). For completeness, we present the algorithms and their proofs in Appendix A.3. Definition 1 (All-tuples family). Let M = n k and let Sall = {S1, S2, . . . , SM} = In,k be the set of all k-element subsets of [n]. Call Sall the all-tuples family. Proposition 1. Suppose θ [ R, R]. Let Sall be the all-tuples family in Definition 1. Then Wrapper 1, with f = all, failure probability α, and A = U-Stat Mean (Algorithm A.2) returns an estimate θall of the mean θ such that, with probability at least 1 α, | θall θ| O p Var(Unα) + O k3/2 τ as long as nα = Ω k . Moreover, the algorithm is ϵ-differentially private and runs in time O log(1/α) k + log R Remark 1. While Lemma 1 recovers the correct first term of the deviation, the private error term is a k factor worse. Moreover, we need k2/n = o(1) for the private error to be asymptotically smaller than the non-private error. Note, however, that existing concentration [35, 5] or convergence in probability results [58, 48] only require k = o(n) (see Lemmas A.1 and A.3 in the Appendix). Remark 2 (Degenerate and sparse settings). While Lemma 1 improves over the naive estimator, the private error can overwhelm the non-private error for degenerate and sparse U-statistics (see Section 5). We show that for uniformity testing, using this estimator can lead to suboptimal sample complexity if the distribution is already close to uniform. Proposition 1 improves over the naive estimator at the cost of computational complexity. We can trade off the computational and statistical efficiencies using a different family S parameterized by the size M of S. In Appendix 3, we show a result similar to 1 for the subsampled family. 3 O(.) hides logarithmic factors in the problem parameters k, n, τ, α, ζk, where α is the error probability. Definition 2 (Subsampled Family). Draw M i.i.d. samples S1, . . . , SM from the uniform distribution over the elements of In,k, and let Sss := {S1, . . . , SM}. Call Sall the subsampled family. The next result shows a nearly optimal dependence on n and ϵ in the bounds for ˆθnaive and θall. In particular, the dependence of the modified Coinpress algorithm (Lemma 1) on k is suboptimal. Theorem 1 (Lower bound for non-degenerate kernels). Let n and k be positive integers with k < n/2 and let ϵ = Ω(k/n). Let F be the set of all sub-Gaussian distributions over R with variance proxy 1, and let µ be the output of any ϵ-differentially private algorithm applied to n i.i.d. observations from D. Then, suph,D:H F E | µ(X1, . . . , Xn) E[h(X1, . . . , Xk)]| = Ω k Among unbiased estimators, Un is the best non-private estimator [35, 45]. The most widely used non-private estimators are Uand V -statistics, which share similar asymptotic properties [58]. The above lower bound also has a log factor arising from an optimal choice of the sub-Gaussian proxy for Bernoulli random variables [4]. Proofs are deferred to Appendix A.4. 3.1 Boosting the error probability via median-of-means If we use Algorithm A.2 as stated with failure probability α, then the error in the algorithm has a O(1/ α) factor, which is undesirable. Instead, we use the Algorithm with a constant failure probability (say, 0.25) and then boost this failure probability to α via a median-of-means procedure. We incorporate the median-of-means in all of our theoretical results. Wrapper 1 (Median Of Means(n, k, Algorithm A, Parameters Λ, Failure probability α, Family type f {all, ss})). Divide [n] into q = 8 log(1/α) independent chunks Ii, i [q] of roughly the same size. For each i [q], run Algorithm A with subset family Si := Sf(Ii), Dataset {h(XS)}S Si, and other parameters Λ for A to output ˆθi, i [q]. Return θ = median(ˆθ1, . . . , ˆθq). In the above wrapper, Sf(Di) simply creates the appropriate family of subsets for the dataset Di. For example, if Di = {X1, . . . Xnα}, f = all, then Sall(Di) is {h(XS)}S Inα,k. If f = ss, then Sss(Di) is {h(XS)}S Si, where Si is the set of M subsampled subsets of Di. Here, nα := n 8 log(1/α). Wrapper 1 can be used to boost the failure probability from constant to α by splitting the data into O(log(1/α)) chunks, applying the algorithm on each of the chunks, and taking the median output. The expense of this procedure is the reduction in the effective sample size to nα = Θ(n/ log(1/α)). Details and proofs on the median-of-means procedure can be found in Section A.3.2. 4 Main Results In Section 3, we showed that off-the-shelf private mean estimation tools applied to U-statistics either achieve a sub-optimal non-private error (see Remark 1) or a sub-optimal private error. If the U-statistic is degenerate of order 1, the non-private and private errors (assuming ϵ = Θ(1)) are Θ(1/n). We now present an algorithm that achieves nearly optimal private error for sub-Gaussian non-degenerate kernels. Our algorithm can be viewed as a generalization of the algorithm proposed in Ullman and Sealfon [57] for privately estimating the edge density of an Erd os-Rényi graph. We provide strong evidence that, for bounded degenerate kernels, we achieve nearly optimal non-private error. All proofs for this section can be found in Section A.5. 4.1 Key intuition Our key insight is to leverage the Hájek projection [58, 45], which gives the best representation of a U-statistic as a linear function of the form Pn i=1 f(Xi): i=1 E[Tn|Xi] (n 1)E[Tn] (ii) = k i=1 E[h(XS)|Xi] (n 1)θ. Equality (i) gives the form of the Hájek projection for a general statistic Tn, whereas (ii) gives the form when Tn is a U-statistic. Let I(i) n,k = {S In,k : i S}. In practice, one uses the estimates b E[h(XS)|Xi] := 1 n 1 k 1 X which we call local Hájek projections. In some sense, this is the U-statistic when viewed locally at Xi. When the dataset is clear from context, we write ˆh In,k(i), or simply ˆh(i), for b E[h(XS)|Xi]. 4.2 Proposed algorithm Consider a family of subsets S In,k of size M. Let Si = {S S : i S} and Mi = |Si|, and suppose Mi = 0 for all i [n]. Assume also that S satisfies the inequalities Mi/M 3k/n and Mij/Mi 3k/n (10) for any distinct indices i and j in [n] (one such family is S = In,k, for which Mi/M = k/n and Mij/Mi = (k 1)/(n 1) k/n, but there are other such families). Define S S h(XS), and ˆh S(i) := 1 S Si h(XS), i [n]. (11) Un in Eq (1) and b E[h(XS)|Xi] in equation (9) are the same as the quantities An(In,k) and ˆh In,k(i). A standard procedure in private estimation algorithms is to clip the data to an appropriate interval [10, 41] in such a way that the sensitivity of the overall estimate can be bounded. Similarly, we use the concentration of the local Hájek projections to define an interval such that each i can be classified as good" or bad based on the distance between ˆh S(i) and the interval. The final estimator is devised so the contribution of the bad indices to the estimator is low and the estimator has low sensitivity. Let ξ and C be parameters to be chosen later; they will be chosen in such a way that with high probability, (i) |ˆh(i) θ| ξ for all i, and (ii) each h(XS) is at most C away from θ. Define LS := arg min t N>0 n i : ˆh S(i) An(S) > ξ + 6k Ct/n o t . (12) In other words, LS is the smallest positive integer t such that at most t indices i [n] satisfy |ˆh S(i) An(S)| > ξ + 6k Ct n (such an integer t always exists because t = n works). Define Good(S) := n i : ˆh S(i) An(S) ξ + 6k CLS/n o , Bad(S) := [n] \ Good. (13) For each index i [n], define the weight of i with respect to S as wt S(i) := max 0, 1 ϵn 6Ck dist ˆh S(i) An, [ ξ 6k CLS/n, ξ + 6k CLS/n] . (14) Here, ϵ is the privacy parameter and dist(x, I) is the distance between x and the interval I. Based on whether a datapoint is good or bad, we will define a weighting scheme that reweights the h(XS) in equation (1). For each S S, let wt S(S) := min i S wt S(i), and g S(XS) := h(XS)wt S(S) + An(S) (1 wt S(S)) . In particular, if wt S(S) = 1, then g S(XS) = h(XS); and if wt S(S) = 0, then g S(XS) = An(S). Finally, define the quantities S S g S(XS), ˆg S(i) := 1 S Si g S(XS) i [n]. (15) To simplify notation, we will drop the argument S from L, An, An, ˆh, ˆg, Good, and Bad. Idea behind the algorithm: If all ˆh(i) s are within ξ of the empirical mean An, then Bad = and An = An. Otherwise, for any set S containing a bad index, we replace h(XS) by a weighted combination of h(XS) and An. This averaging-out of the bad indices allows a bound on the local sensitivity of An. We then provide a smooth upper bound on the local sensitivity characterized by the quantity L, which can be viewed as an indicator of how well-concentrated the data is. The choice of ξ will be such that L = 1 with high probability and that the smooth sensitivity of An at X is small. This ensures that a smaller amount of noise is added to An to preserve privacy. Algorithm 1 Private Mean Local Hájek(n, k, {h(XS), S S}, ϵ, C, ξ, S) 1: M |S| 2: Si {S S : i S} 3: Mi |Si| 4: if there exist indices i = j such that Mi = 0 or Mi/M > 3k/n or Mij/Mi > 3k/n, then 5: return 6: end if 7: An P S S h(XS)/M 8: for i = 1, 2, . . . , n do 9: ˆh(i) P S Si h(XS)/Mi 10: end for 11: Let L be the smallest positive integer such that n i : ˆh(i) An > ξ + 6k CL 12: Good(S) n i : ˆh(i) An ξ + 6k CL n o ; Bad(S) [n] \ Good(S) 13: for i = 1, 2, . . . , n do 14: wt(i) max 0, 1 ϵ 6Ck/ndist ˆh(i) An, ξ 6k CL n , ξ + 6k CL 15: end for 16: for S S do 17: g(XS) h(XS) mini S wt(i) + An (1 mini S wt(i)) 18: end for 19: An P S S g(XS)/M 20: S(S) max0 ℓ n e ϵℓ k n(ξ+ k C(L+ℓ) n )(1+ϵ(L+ℓ))+ k2C(L+ℓ)2 min(k,(L+ℓ)) 21: Draw Z from distribution with density h(z) 1/(1 + |z|4) 22: return An + S(S)/ϵ Z Theorem 2. Algorithm 1 is 10ϵ-differentially private for any ξ. Moreover, suppose h is bounded with additive range C,4 and with probability at least 0.99, we have maxi |ˆh S(i) An| ξ. Run Wrapper 1 with f = all, and A = Private Mean Local Hajek (Algorithm 1). With probability at least 1 α, the output θ satisfies | θ θ| = O p Var(Unα) + kξ nαϵ + k2 n2αϵ2 + k3 n3αϵ3 C . Connections to Ullman and Sealfon [57]: Ullman and Sealfon [57] estimate the edge density of an Erd os Renyi graph using strong concentration properties of its degrees. This idea can be loosely generalized to a broader setting of U-statistics: consider a hypergraph with n nodes and n k edges, where the ith node corresponds to index i. An edge corresponds to a k-tuple of data points S In,k, and the edge weight is given by h(XS). The natural counterpart of a degree in a graph becomes a local Hájek projection, defined as in equation (9). In degenerate cases and cases where k2ζ1 kζk, the local Hájek projections are tightly concentrated around the mean θ. We exploit this fact and reweight the edges (k-tuples) so that the local sensitivity of the reweighted U-statistic is small. 4.3 Application to non-degenerate and degenerate kernels Algorithm 1 can be extended from bounded kernels to sub-Gaussian(τ) kernels. First, split the samples into two roughly equal halves. The first half of the samples will be used to obtain a coarse estimate of the mean θ. For this, we can use any existing private mean estimation algorithm to obtain an ϵ/2-differentially private estimate θcoarse such that with probability at least 1 α, | θcoarse θ| = O( p kζk/n + k τ/(nϵ)). By a union bound, with probability at least 1 α, |h(XS) θ| is within 4 q τ log n k /α for all S In,k, and therefore also within c p kτ log(n/α) of the coarse estimate θcoarse, for some universal constant c, as long as ϵ = Ω( Define the projected function h(X1, X2, . . . , Xk) to be the value h(X1, X2, . . . , Xk) projected to the interval [ θcoarse c p kτ log(n/α), θcoarse + c p kτ log(n/α)]. The final estimate of the mean θ is obtained by applying Algorithm 1 to the other half of the samples, the function h, and the 4More precisely, suppose supx h(x) infx h(x) C. privacy parameter ϵ/2. The following lemma shows that p 2τ log(2n/α) is a valid choice of the concentration parameter ξ for sub-Gaussian, non-degenerate kernels. Lemma 3. If H is sub-Gaussian(τ), the local Hájek projections ˆh(i) are also sub-Gaussian(τ). In particular, with probability at least 1 α, we have max1 i n |ˆh(i) θ| p 2τ log(2n/α). Combining these parameters with Theorem 2 gives us the following result: Corollary 1 (Non-degenerate sub-Gaussian kernels). Suppose h is sub-Gaussian(τ) and the privacy parameter ϵ = Ω(k1/2/n). Split the samples into two halves and compute a private estimate of the mean by applying the naive estimator on the first half of the samples with privacy parameter ϵ/2 to obtain θcoarse. Let h be the projection of of the function h onto the interval θcoarse O( p kτ log(2n/α)). Run Wrapper 1 on the remaining half of the samples with f = all, failure probability α/2, algorithm A = Private Mean Local Hájek (Algorithm 1), C = p 2kτ log(n/2α), ξ = p 2τ log(n/2α), function h, and privacy parameter ϵ/20. Then, the output θ is ϵ-differentially private. Moreover, with probability at least 1 α, | θ θ| = O p Var(Unα) + O k τ nαϵ + k2.5 τ n2αϵ2 + k3.5 τ From our lower bound on non-degenerate kernels in Theorem 1, we see that the above corollary is optimal in terms of k, n, ϵ (up to log factors). In contrast, Lemma 1 is suboptimal in k. Many degenerate U-statistics (e.g., all the degenerate ones in Section 5) have bounded kernels. For these, we see that the local Hájek projections concentrate strongly around the U-statistic. Lemma 4. Suppose H is bounded, with additive range C. Let i [n] be an arbitrary index and Si In,k be a set containing i, and suppose xi R is some element in the support of D. With probability at least 1 β n, conditioned on Xi = xi, we have b E[h(XS)|Xi = xi] E [h(XSi)|Xi = xi] 2σi where b E[h(XS)|Xi = xi] = ( n 1 k 1) , and σ2 i = Var (h(XSi)|Xi = xi). For bounded kernels with additive range C, σi C/2 by Popoviciu s inequality [53]. Moreover, for degenerate kernels, ζ1 = 0. That is, the conditional expectation E [h(XSi)|Xi = xi] is equal to θ for all xi, because the variance of this conditional expectation is ζ1. Based on this, we can show that the choice of ξ = O Ck1/2/n1/2 satisfies the requirement that the local Hájek projections are within ξ of θ with probability at least 1 α. Corollary 2 (Degenerate bounded kernels). Suppose h is bounded with additive range C and the kernel is degenerate ζ1 = 0. Let ϵ = Ω(k1/2/n) be the privacy parameter. Run Wrapper 1 with f = all, failure probability α, and algorithm A = Private Mean Local Hájek (Algorithm 1) with ξ = O(C p k/n log(n/α)), to output θ. With probability 1 α, we have θ θ = O p Var(Unα) + O k1.5 n1.5 α ϵC + k2 n2αϵ2 C + k3 Obtaining a result for sub-Gaussian degenerate kernels poses difficulties on bounding the concentration parameter ξ. However, for bounded kernels, we see that the above result obtains better private error than the application of off-the-shelf methods (Lemma 1). In the next subsection, we provide a lower bound for degenerate bounded kernels, which, together with Corollary 2, gives strong indication that our algorithm achieves optimal private error for private degenerate kernels. 4.4 Lower bound To obtain a lower bound of the private error, we construct a dataset and kernel function such that the local Hájek projections are 1/ n concentrated around the corresponding U-statistic. This is one way of characterizing a degenerate U-statistic. The proof of the following theorem is in Appendix A.4. Theorem 3. For any n, k N with k n, ϵ = Ω((k/n)1 1/2(k 1)), and ϵ-differentially private algorithm A : X n R, there exists a function h : X k {0, 1} and dataset D such that |ˆh(i) Un| p k/n (where ˆh(i) and Un are computed on D) for every i [n] and E|A(D) Un| = n3/2ϵ , where the expectation is taken over the randomness of A. Remark 3. The above lower bound is in some sense informal because we created a deterministic dataset and h that mimics the property of a degenerate U statistic; the local Hájek projections concentrate around Un at a rate p k/n. However, it gives us a strong reason to believe that the private error cannot be smaller than O(k3/2/n3/2ϵ) for degenerate U statistics of order k. Note that for bounded kernels, Corollary 2 does achieve this bound, as opposed to Lemma 1. 4.5 Subsampling estimator We now focus on subsampled U-statistics. Previous work has shown how to use random subsampling to obtain computationally efficient approximations of U-statistics [38, 52, 17], where the sum is replaced with an average of samples (drawn with or without replacement) from In,k. Recall Definition 2. Let S := {S1, . . . , SM} denote the subsampled set of subsets, let Si := {S S : i S}, and let Mi := |Si|. The proof of Theorem 2 with S = In,k (cf. Appendix A.5) uses the property of In,k that Mi/M = k/n and Mij/Mi = (k 1)/(n 1), so the inequalities (10) certainly hold. Indeed, we can show that for subsampled data (cf. Lemma A.11), the following inequalities hold with probability at least 1 α, provided M = Ω(n2/k2 log(n/α)): Mi/M 3k/n and Mij/Mi 3k/n. (17) Algorithmically, we check if the bounds (17) hold for S, and output if not. Privacy is not compromised because the check only depends on S and is agnostic to the data. Theorem 4. Let Mn = Ω (n2/k2) log n . Then Algorithm 1, modified to output if the bounds (17) do not hold, is 10ϵ-differentially private. Moreover, suppose that with probability at least 0.99, we have maxi |ˆh S(i) An| ξ and |h(S) θ| C for all S In,k. Run Wrapper 1 with f = ss, failure probability α, and A = Private Mean Local Hajek (Algorithm 1) to output θ. With probability at least 1 α, we have |A(X) θ| = O ζk Mnα + kξ n2αϵ2 + k3C Remark 4. If the kernel is non-degenerate and the number of times we subsample (for each run of the algorithm) is Ω n2 α/k2 , then Theorem 4 nearly achieves the same error as Algorithm 1 with S = In,k with a better computational complexity for k 3. The lower-order terms have an additional min(k, 1/ϵ) factor, which can be removed with Ω(n3) subsamples. 5 Applications We apply our methods to private uniformity testing and estimation in random networks. For more applications, see Appendix A.6.4. 1. Uniformity testing: A fundamental task in distributional property testing [6, 7] is deciding whether a discrete distribution is uniform on its domain, called the problem of uniformity testing. Let X1, X2, . . . , Xn be n i.i.d. samples from a discrete distribution with support [m], characterized by the probability masses p1, p2, . . . , pm on the atoms. Given an error tolerance δ > 0, the task is to distinguish between approximately uniform distributions p : ℓ2(p, U) δ/ 2m and far-fromuniform distributions {p : ℓ2(p, U) δ/ m}. Without the constraint of privacy, Diakonikolas et al. [21] perform this test by rejecting the uniformity hypothesis whenever the test statistic Un := P i (1 + 3δ2/4)/m, and show that this test succeeds with probability 0.9 as long as n = Ω m1/2/δ2 . As detailed in Algorithm A.4 in the appendix, instead of using Un, we use the private estimate Un using Algorithm 1. For our algorithm to work, we require the distributions to satisfy pi 2/m for all i. Let pi = (1 + ai)/m for all i, with ai [ 1, 1]. Under H1, we have E[1(X1 = X2)] = (1 + a 2 /m)/m (1 + δ2)/m, where a is the ℓ2 norm of (a1, a2, . . . , am). Under H0, the mean is 1/m. The difference between the threshold (1 + δ2/2)/m and the mean (under either of the two hypotheses) is at least δ2/(2m). Moreover, [21] shows that the standard deviation of Un is much smaller than the difference in the means under H0 and H1 as long as n = Ω(m1/2/δ2). However, we must also account for the noise added to ensure privacy. In Appendix A.6.1, we show that the choice of ξ = Θ(1/m + 1/n) works and establish the following result: Theorem 5. Let pi = (1+ai)/m with ai [ 1, 1], Pm i=1 ai = 0. Let {Xj}n j=1 be i.i.d. multinomial random variables such that P(X1 = i) = pi, for all i [m]. There exists an algorithm that distinguishes between a 2 m2 δ2 m from a 2 m2 < δ2 2m with probability at least 1 α, as long as nα = Ω m1/2 (δϵ) + m1/2 log(m/δϵ) δϵ1/2 + m1/3 δ2 , and is 10ϵ-differentially private. The non-private error term of Theorem 5 is the same as in Theorem 1 of [21] and is optimal [22]. Proposition 1 shows that Algorithm A.2 with the all-tuples family leads to a private error bounded by O(1/nϵ). This private error is O(δ2/m) only when n = Ω(m/δ2ϵ). In comparison, Algorithm A.4 has error O(δ2/m) for n = Ω m1/2/min(δ2, δϵ) , which is quadratically better in m. Remark 5 (Comparison with existing algorithms). Existing results for private uniformity testing [1, 13] distinguish between the uniform distributions (ℓ1(p, U) = 0) and distributions away from uniform in TV-distance (ℓ1(p, U) δ). Our algorithm considers the alternative hypothesis to be ℓ2(p, U) δ/ m. Hence, our results are not strictly comparable. One caveat is that we restrict ourselves to distributions p such that ℓ (p, U) 1/m. Our algorithm also allows some tolerance in the null hypothesis, similar to [21] and other collision-based testers. That is, can allow some slack and take the null hypothesis H0 : ℓ2(p, U) δ/ 2m instead of ℓ2(p, U) = 0. 2. Sparse graph statistics: The geometric random graph (see [30]) has edges h(Xi, Xj) := 1( Xi Xj 2 rn), where rn governs the average degree. Under a suitable distribution for X1, . . . , Xn, the subgraph counts show normal convergence for a large range of rn [30]. Typically, we only observe the graph and do not know the underlying distribution of the latent variables Xi or the radius rn. This is why estimates of the network moments are of interest since they reveal information about the underlying unknown distribution and parameters. Let the Xi s be uniformly distributed on the three-dimensional sphere to ignore boundary conditions. For edge density, Eh(Xi, Xj) r2 n. For any distinct indices i, j, k and a given Xi, the random variables h(Xi, Xj) and h(Xi, Xk) are independent. Therefore, ζ1 = cov(h(Xi, Xj)h(Xi, Xk)) = 0. We have ζ2 = var[h(Xi, Xj)] = O(r2 n), so the non-private error is O(rn/n). In Appendix A.6.2, we provide Algorithm A.5 that uses Algorithm 1 to obtain a private estimate of the edge density of a graph {h(Xi, Xj)}1 i 0, then Var(Un) = k2ζ1 n + O ζk k2 (ii) If ζ1 = 0 and ζ2 > 0, then Var(Un) = k2(k 1)2ζ2 2n(n 1) + O ζk k3 Proof. This result follows directly from a calculation appearing in the proof of Theorem 3.1 in [48]. Note that for all j [k]. Moreover, For part (i), we write Var(Un) = k2ζ1 1 δ2 j k2ζ1 For part (ii), we write Var(Un) = k2ζ1 n + k2(k 1)2ζ2 k j 2 n j ζj k2(k 1)2ζ2 2n(n 1) + ζk 2n(n 1) + ζk j = k2(k 1)2ζ2 2n(n 1) + 2k3ζk Lemma A.2. For all 1 c d k, In particular, kζ1 ζk. Proof. Using equation A.24, Lemma A.3 (Concentration of U-statistics). [37, 9] (i) If H is sub-Gaussian with variance proxy τ, then for all t > 0, we have P (|Un θ| t) 2 exp n (ii) If H is almost surely bounded in ( C, C), then for all t > 0, we have P (|Un θ| t) exp 2ζk + 2Ct/3 Proof. Without loss of generality, let θ = 0. For any permutation σ of [n], let i=1 h(Xσ(k(i 1)+1), Xσ(k(i 1)+2), . . . , Xσ(ki)), where m = n/k . By symmetry, Un = 1 σ Vσ. For any s > 0, P (Un t) = P es Un est e st E es Un = e st E σ exp(s Vn) = e st E [exp(s Vid)] = e st E h exp s mh(X1, . . . , Xk) im . If H is sub-Gaussian with variance proxy τ, then we can further bound the inequality above as P (Un t) e st exp s2τ m = exp st + s2τ τ 2 to get the desired result. Note how the argument is similar to the classical Hoeffding s inequality argument after applying Jensen s inequality on Vσ. The second result follows similarly by adapting the tricks of Bernstein s inequality [9] to [37]; for a detailed proof, see [51]. A.2 Details on Privacy Mechanisms Lemma A.4. (Basic Composition) If Ai : X n Si is ϵi-differentially private for all i [k], then the mechanism A : X n S1 Sk defined as A (X1, . . . , Xn) = (A1 (X1, . . . , Xn) , . . . , Ak (X1, . . . , Xn)) is Pk i=1 ϵi-differentially private. Lemma A.5. (Parallel Composition) If Ai : X n Si is ϵ-differentially private for all i [k], then the mechanism A : X kn S1 Sk defined as A (X1, . . . , Xkn) = A1 (X1, . . . , Xn) , A2 (Xn+1, . . . , X2n) , . . . , Ak X(k 1)n+1, . . . , Xkn is ϵ-differentially private. A.2.1 Private mean estimation A fundamental task in private statistical inference is to privately estimate the mean based on a set of IID observations. One way to do this is via the global sensitivity method, wherein the standard deviation of the noise scales with the ratio between the range of the distribution and the size of the dataset. In the fairly realistic case where the range is large or unbounded, this leads to highly noisy estimation even in the setting where typical samples are small in size. To remedy this effect, a line of work [42, 39, 14, 40, 24, 11] has looked into designing better private mean estimators for (sub)-Gaussian vectors. Our work will build on one such method: Coin Press [10]. The idea is to iteratively refine an estimate for the parameters until one obtains a small range containing most of the data with high probability; noise is then added proportional to this smaller range. Note that some dependence on the range of the mean is inevitable for estimation with pure differential privacy [33, 15]. A.3 Details for Section 3 A.3.1 General result We will prove a more general theorem than Lemma 1, from which Lemma 1 and related Lemmas using other families of subsets are derived. In Algorithms A.2 and A.3, we present a natural extension of the Coin Press algorithm [10], which is then used to obtain a private estimate of θ with the non-private term matching Var(Un). Originally, this algorithm was used for private mean and covariance estimation of i.i.d. (sub)Gaussian data. We extend the algorithm to take as input data {Yj}j [m] such that (i) each Yj is equal to h(XS) for some S, (ii) the Yj s are weakly dependent on each other, and (iii) each Yj, as well as their mean P j [m] Yj/m, has sufficiently strong concentration around the population mean. For instance, suppose m = n/k and Yj = h(XSj) for all j [m], where Sj = {(j 1)k + 1, . . . , (j 1)k + k}. Then Algorithm A.2 reduces to the Coin Press algorithm applied to n/k independent observations h(XS1), h(XS2), . . . , h(XSm). Algorithm A.2 U-Stat Mean n, k, h, {Xi}i [n] , F = {S1, . . . , Sm}, R, ϵ, γ, Q( ), Qavg( ) 1: t log (R/Q(γ)) , [l0, r0] [ R, R] 2: for j = 1, . . . , m do 3: Y0,j h(XSj) 4: end for 5: for i = 1, 2, ..., t do 6: {Yi,j}j [m], [li, ri] U-Stat One Step n, k, {Yi 1,j}, F, [li 1, ri 1], ϵ t , Q( ), Qavg( ) 7: end for 8: {Yt+1,j}j [m], [lt+1, rt+1] U-Stat One Step (n, k, {Yt,j}, F, [lt, rt], ϵ/2, γ, Q( ), Qavg( )) 9: return (lt+1 + rt+1)/2 Algorithm A.3 U-Stat One Step n, k, {Yi}i [m] , F, [l, r], ϵ , β, Q( ), Qavg( ) 1: Yj projl Q(β),r+Q(β) (Yj) for all 1 j m. 2: depn,k (F) (r l + 2Q(β)) m Pm j=1 Yj + W, where W Lap 4: [l, r] h Z Qavg(β) + β , Z + Qavg(β) + 5: return {Yj}j [m], [l, r] Setting 1. Let n and k be positive integers with k n/2, and let h : X k R be a symmetric function and let D be an unknown distribution over X with E h(Dk) = θ such that |θ| < R for some known parameter R. Let m be an integer and F = {S1, S2, . . . , Sm} be a family of not necessarily distinct elements of In,k. Define fi := |{j [m] : i Sj}| the fraction of indices j such that Sj contains i, and define the maximal dependence fraction depn,k (S) := max i [n] fi. (A.31) For each j [m], let Yj denote h(XSj). Clearly, E [Yj] = θ. To allow for small noise addition while ensuring privacy, it is desirable to choose S with small depn,k (S). Define functions Q(β) = Qn,k,h,D,S(β) and Qavg(β) = Qavg n,k,h,D,S(β) on β (0, 1] such that sup j [m] |Yj θ| > Q(β) < β. (A.32) We will refer to Q(β) and Qavg(β) as β-confidence bounds for supj [m] |Yj θ| and 1 j [m] Yj θ , respectively. We apply Theorem A.1 (specifically, the form obtained in Lemma A.6) to different F to obtain private estimates of θ, with statistical and computational tradeoffs depending on the family F. As Remark A.7 suggests, we will also need to privately estimate concentration bounds on the Yj s and their average. Naturally, this requires a private estimate of the variance ζk. We provide guarantees from Biswas et al. [10] for private variance estimation and mean estimation here, where we have translated the mean estimation guarantee to fit our setting. Before we do that, a natural approach to this problem is to view it as a standard private mean estimation task: split the data into n/k equally-sized chunks, apply the function h to each chunk, and run any existing private mean estimation algorithm to these n/k values. We show that the error guarantee of such an algorithm is suboptimal compared to the error guarantee using the all-tuples estimator 1 or even the subsampling estimator 2 if sufficiently many samples are used. Before stating the propositions associated with these families, we state and prove the mother theorem. Theorem A.1. For ϵ > 0, Algorithm A.2 with input n, k, h, {Xi}i [n], F, R, ϵ, γ, Q( ), Qavg( ) returns θn such that m γ | {z } non-private error + depn,k (F) Q(γ) | {z } private error with probability at least 1 6γ,5 as long as depn,k (F) Q(γ)ϵ 10t Q(γ/t) log t/γ and Qavg(γ/t) < Q(γ), (A.34) where t = C log (R/Q(γ)) . Moreover, Algorithm A.2 is ϵ-differentially private and runs in time O(n + m log (R/Q(γ)) + k|F|). Remark A.7. Theorem A.1 assumes that Q( ) and Qavg( ) are known, despite the mean θ being unknown. Note that we only need to know the value of these functions at γ and γ/t, for a given γ. If these bounds are not known, we may first need to (privately) compute Q(x) and Qavg(x) and then use those privately computed bounds in the algorithm. For example, if the Y i s are sub-Gaussian with variance proxy 1, then Q(x) = p log(m/x). We will see how to estimate these parameters for various families F of indices used in Algorithm A.2. Proof of Theorem A.1. We will prove privacy and accuracy guarantees separately. Privacy. Algorithm A.2 makes t + 1 calls to Algorithm A.3; let i, Wi, and Zi be the values taken by , W, and Z in the ith call to Algorithm A.3, for 1 i t + 1. Let β := γ t . It can be shown inductively that the interval lengths ri li and the values i do not depend on the dataset. For any 1 i t, note that Yi,j = projli 1 Q(β),ri 1,Q(β) (Yi 1,j) for all 1 j m. Suppose we change Xw to X w for some index w. For any 1 i t + 1, conditioned on the values of Zi for 1 i < i, at most a depn,k (F) fraction of {Yi,j}j [m] depend on w (this is true by the definition of depn,k (F)). Since Yi,j = projli 1 Q(β),ri 1+Q(β/m) (Yi 1,j) has range ri 1 li 1 + 2Q(β), the sensitivity of 1 m Pm j=1 Yi,j is at most depn,k (F) (ri 1 li 1 + 2Q(β)) = i. Therefore, by standard results (cf. Lemma 1), for all 1 i t, the output Zi (and therefore the interval [li, ri]), conditioned on Zi for 1 i < i, is ϵ 2t-differentially private. Similarly, the output (lt+1 + rt+1)/2 = Zt+1, conditioned on {Zi}i [t], is ϵ 2-differentially private. By Basic Composition (see Lemma A.4), Algorithm A.2 is ϵ-differentially private. Utility. First, we show that if Algorithm A.3 is invoked with θ [l, r], it returns an interval [l , r ] such that θ [l , r ] with probability at least 1 3β. Consider running a variant of Algorithm A.2 with the projection step omitted in every call of Algorithm A.3. With probability at least 1 β, we have 1 m Pm i=1 Yi θ Qavg(β), and with probability at least 1 β, we have |W| 5The following subsection modifies the algorithm so that the error depends polylogarithmically in α. Therefore, with probability at least 1 2β, we have |Z θ| Qavg(β) + in which case θ [l , r ]. Finally, reintroducing the projection step only increases the error probability by at most β. Taking a union bound over t steps, we see that θ [l , r ], with probability at least 1 3γ. Next, we claim that if r l > 28Q(γ), then r l (r l)/2. Using the assumption, we have depn,k (F) Q(γ)ϵ 10t Q(γ/t) log t/γ min ϵ 5 log 1/β , Q(γ)ϵ 5Q(β) log 1/β where the second inequality follows from taking ϵ = ϵ 2t and using the fact that Q(γ) Q γ t , since the quantile function is nonincreasing. Furthermore, by the assumption Qavg(β) < Q(γ), we have r l = 2depn,k (F) log 1/β ϵ (r l) + 2Qavg(β) + 4depn,k (F) Q(β) log 1/β 5 + 2Q(γ) + 4 Thus, after t = Ω log R Q(γ) iterations, we are guaranteed that the length of the final interval [lt, rt] is at most 28Q(γ). Finally, consider lines 8 and 9 of Algorithm A.2. The algorithm returns the midpoint of the interval [lt+1, rt+1], which is Zt+1 in the final call of Algorithm A.3. By Chebyshev s inequality, we have with probability at least 1 γ, and with probability at least 1 γ, none of the Yi s are truncated in the projection step in the final call of Algorithm A.3. Finally, with probability at least 1 γ, we have Wt+1 = O t+1 = O depn,k (F) Q(γ) The conclusion follows from a union bound over all events. A.3.2 Boosting the error probability via median-of-means Algorithm A.2 incurs a 1/ γ multiplicative factor in the non-private error, stemming from an application of Chebyshev s inequality to bound | Pm j=1 h(XSj)/m θ|. For specific families F, we may be able to provide stronger concentration bounds for Pm j=1 h(XSj)/m in inequality (A.35). Instead, we complement the result of Theorem A.1 by applying the following median-of-means procedure that allows for an improved dependence on the failure probability α with only a log (1/α) multiplicative blowup in the sample complexity: Lemma A.6. Let α (0, 1) and ϵ 0. Let A be an ϵ-differentially private algorithm. Consider a size n dataset Dn i.i.d D, for a distribution D with some unknown parameter θ such that with probability at least 0.75, we have |A(Dn) θ| rn. Split Dn into q := 8 log(1/α) equal independent chunks,6 and run A on each chunk to obtain ϵ-differentially private estimates { θn,i}i [d] of θ. Let θmed n be the median of these q estimates. Then θmed n is ϵ-differentially private, and with probability at least 1 α, we have θmed n θ rn/q. (A.36) 6Assume for simplicity that n is divisible by q and q is an odd integer. Proof. The privacy of θmed n follows from parallel composition (Lemma A.5). For utility, we know from the hypothesis that for each i [q], with probability at least 3/4, the estimate θn,i satisfies | θn,i θ| rn/q. If more than half the estimates θn,i satisfy the above equation, then so does the median. Let Ti be the random variable that assumes the value 0 if θn,i satisfies the above equation and assumes the value 1 otherwise. Then, E[Ti] 1/4, and it suffices to show that Pr (T1 + T2 + + Tq q/2) 1 α. This follows from a standard Hoeffding inequality; as long as q 8 log(1/α), Pr (T1 + T2 + + Tq > q/2) Pr i (Ti E[Ti]) > q/4 e 2(1/4)2q α. A.3.3 Application to the all-tuples family: proof of Proposition 1 For any i [n], there are exactly n 1 k 1 sets S In,k such that i S. Following the notation from the setting of Theorem A.1, we have fi = n 1 k 1 / n k = k n for all i [n], so depn,k (Fall) = k Moreover, for each S In,k, we have P (|h(XS) θ| y) 2 exp y2 2τ . Letting we see that each Yi is within Q(γ) of θ with probability at least γ ( n k). A union bound implies that this choice of Q(γ) is valid. For the concentration of the average 1 j [m] Yj, which is simply Un, we can use Lemma A.3) to see that is a valid choice. We now verify the conditions in Theorem A.1: depn,k (S) = k n Q(γ)ϵ 10t Q(γ/t) log(t/γ) = ϵ 10t log(t/γ) log 2n/γ log 2nt/γ if and only if n 10kt log(t/γ) Recalling that t = C log (R/Q(γ)) , we see that this holds under the sample complexity assumption on n. Furthermore, we have Qavg(γ/t) Q(γ) if and only if s which is also true by assumption. Therefore, with probability at least 1 6γ, we have | θall θ| O Var(Un) + k Algorithm A.3 uses a constant failure probability of γ = 0.01, which assures a success probability of at least 0.75. This is further boosted by Wrapper 1. Now, an application of Lemma A.6 gives the stated result. A.3.4 Application to the naive family Definition A.3. Consider the following estimator: divide the n data points into n/k disjoint chunks, compute h(XS) on each of these chunks, and apply the Coin Press algorithm [10] to obtain a private estimate of the mean θ. We will call this naive estimator ˆθnaive. The following proposition records the guarantee of the naive estimator ˆθnaive: Proposition A.1. The naive estimator ˆθnaive satisfies |ˆθnaive θ| O with probability at least 0.9, as long as n = Ω k ϵ log R τ . The estimate ˆθnaive is ϵ-differentially private and the algorithm runs in time O n + n k log R τ . First, suppose the variance ζk is known. It is easy to see that depn,k (Fnaive) = k n. By the assumption that h(XS) is τ-sub-Gaussian, we have P(|h(XS) θ| y) 2 exp y2 Hence, with probability at least 1 γ/m, we have 2τ log(2m/γ), (A.37) for each 1 j m, where we use the notation as in the setting of Theorem A.1. By a union bound, we can take the quantile function Q(γ) = r 2τ log 2n kγ . Moreover, since the Yj s are independent, the average 1 j [m] Yj is τ m-sub-Gaussian with variance ζk m . Therefore, we have This yields a bound of Qavg(γ) = q 2kτ log(2/γ) n . It remains to verify the conditions of Theorem A.1. We have k n Q(γ)ϵ 10t Q(γ/t) log(t/γ) k n ϵ 10t log(t/γ) log(2n/kγ) log(2nt/kγ), Qavg(γ/t) < Q(γ) 2kτ log(2t/γ) 2τ log(2n/kγ) n k log(2t/γ) log(2n/kγ), which are both true by the sample size assumption, noting that t = C log(R/Q(γ)) . Therefore, with probability at least 1 6γ, we have ˆθnaive θ O 1 γ Var(ˆθnaive) + k 2τ log(2n/kγ) log 1 Choosing γ to be an appropriate constant, we arrive at the deisred result. A.3.5 Application to the subsampled family Unlike the all-pairs family Sall, the subsampled Sss in Definition 2 is randomized. Define ˆθss = PM j=1 h(XSj)/M. Recall from our discussion before (cf. Theorem A.1) that we want each of the h(XSj) s, as well as ˆθss, to concentrate around θ, and we also want depn,k (Sss) to be small. As we show, the former concentration holds as in the all-tuples case, and the latter holds with high probability. Proposition A.2. Let Mn = Ω((n/k) log n). Then Wrapper 1, with f = ss, failure probability α, algorithm A = U-Stat Mean (Algorithm A.2), S(Ii) a set of Mnα i.i.d. subsets of size k picked with replacement from Ii, returns an estimate θss such that, with probability at least 1 α, | θss θ| O p Var(Unα) + O ζk Mnα + k3/2 τ as long as nα = Ω k . Moreover, the estimator θss is ϵ-differentially private and runs in time O log(1/α) k + log R First, we need the following helper lemmas: Lemma A.7. Define ˆθss = 1 M P S Fss h(XS). We have Var h ˆθss i = 1 1 M Var(Un) + 1 M ζk. Proof. Clearly, E[ˆθss] = θ. We compute both terms of the following decomposition of the variance of ˆθss separately; recall that X = {Xi}i [n]: Var ˆθss = Var E h ˆθss|X i + E h Var ˆθss|X i . Var E h ˆθss|X i = Var j [M] h (XSi) E h Var ˆθss X i = E M E [Var (h(XS)| X)] S In,k h(XS)2 S In,k h(XS) M ζk + θ2 Var(Un) + θ2 = ζk Var(Un) Adding the two equalities yields the result. Lemma A.8. Let γ > 0, and let M = Ω( n γ ). Then depn,k (Fss) 4k n with probability at least 1 γ. Proof. Let Zi be the number of sampled subsets of which i is an element. Observe that Zi Binom(M, k/n), with mean µ := Mk/n. By a Chernoff bound, for any δ > 0 and any i [n], we have P (Zi (1 + δ)µ) eδ By a union bound, we have P depn,k (Fss) > 4k = P max i Zi > 4µ n e3 which is at most γ by our choice of M. Let Gγ denote the event that depn,k (Fss) 4k n , which occurs with high probability by Lemma A.8. Note also that conditioned on any family Fss of subsets of In,k, the run of Algorithm A.2 is ϵdifferentially private. Since the randomness of Fss is independent of the data, the algorithm (along with the private variance estimation) is still 2ϵ-differentially private. and Qavg(γ) = 4 τk min (M, n) log 4n We show that these are indeed the corresponding confidence bounds for YS, S Fss and ˆθss. By a sub-Gaussian tail bound, for any S In,k, the probability that |h(XS) θ| > Q(γ) is at most 2 γ 4n k γ 2nk . By a union bound over all n k sets S, we then have |h(XS) θ| Q(γ) for all S In,k, with probability at least 1 γ 2 . Call this event Eγ. Next, E[ˆθss|X1, . . . Xn] = Un. Moreover, for any c > 0, P |ˆθss θ| c P |ˆθss Un| c/2 + P (|Un θ| c/2) EX1,...,Xn P |ˆθss Un| c/2|X1, . . . , Xn + 2 exp nc2 where we used Lemma A.3 to bound the second term. For the first term in inequality (A.40), note that conditioned on the data X1, . . . , Xn, the h(XSj) s are independent draws from a uniform distribution over the n k values {h(XS)}S In,k, with mean Un, and the h(XSj) θ s are bounded by max S In,k |h(XS) θ| Q(γ). Therefore, each h(XSj) Un is sub-Gaussian(Q(γ)2), implying that E h P |ˆθss Un| c/2|X1, . . . , Xn i 2E exp Mc2 |X1, . . . , Xn, Eγ 16τk log(4n/γ) 16τk log(4n/γ) Combining inequalities (A.40) and (A.39), we have P |ˆθss θ| c 2 exp Mc2 16τk log(4n/γ) 2 + 2 exp nc2 τk min (M, n) log 2n This justifies the choice of Qavg(γ). We now verify the conditions in Theorem A.1. Conditioned on Gγ, we have depn,k(S) = 4k n Q(γ)ϵ 10t Q(γ/t) log(t/γ) = ϵ 10t log(t/γ) log 4n/γ log 4nt/γ if and only if n 40kt log(t/γ) Recalling that t = C log (R/Q(γ)) , we see that the above holds under the sample complexity assumption on n. Furthermore, Qavg(γ/t) Q(γ) iff τk min (M, n) log 4nt min(M, n) log(4nt/γ)2 k log n γ , the assumption on the sample complexity of n implies the above result. Conditioned on Eγ, the projection steps are never invoked in Algorithm A.2 or A.3, so we have θss = ˆθss + Wt+1, where Wt+1 is a Laplace random variable with parameter 2 ϵ , where = depn,k(F)Q(γ) ϵ (coming from the noise added to 1 M PM i=1 Yi in the (t + 1)th step of Algorithm A.2). Finally, using Lemma A.8, we have | θss ˆθss| = |Wt+1| 2 γ = 2depn,k (S) Q(γ) on the event Gγ. Combined with Lemmas A.6, A.7, inequality (A.40), and Theorem A.1, with probability at least 1 7γ, we obtain Var(Un) + 1 γ The success probability is 1 7γ instead of 1 6γ because we also require depn,k (Fss) 4k n , which holds with probability 1 γ as in Lemma A.8. Algorithm A.3 uses a constant failure probability of γ = 0.01, which assures a success probability of at least 0.75. This is further boosted by Wrapper 1. Now, an application of Lemma A.6 gives the stated result. A.4 Proofs of Lower Bounds In this appendix, we provide the proofs of our two lower bound results. A.4.1 Proof of Theorem 1 Lemma A.9 (Lemma 6.2 in [41]). Let P = {P1, P2, . . . } be a finite family of distributions over a domain X such that for any i = j, the total variation distance between Pi and Pj is at most α. Suppose there exists a positive integer n and an ϵ-differentially private algorithm B : X n [|P|] such that for every Pi P, we have Pr X1,...,Xn Pi,B (B(X1, . . . , Xn) = i) 2/3. Then, n = Ω log |P| Lemma A.10 (Proposition 4.1 in [4]). The Bernoulli distribution Bern(p) is sub-Gaussian with optimal variance proxy τp, where τp = τ1 p = log 1 p 1 , for p (0, 1) \ {1/2}. In particular, if 0 < p < 1/10, then τp 1 2 log 1 2p . Define D0 = Bern(1) and D1 = Bern(1 β), where β = c nϵ and c > 0 is small enough such that kβ < 1/10. The TV-distance between D0 and D1 is β. Since n = c βϵ, we also choose c small enough such that Lemma A.9 is violated: for any ϵ-differentially private algorithm B : {0, 1}n {0, 1}, there exists an i {0, 1} such that Pr (B(X1, . . . , Xn) = i) < 2/3, (A.41) by Lemma A.9. Consider now the task of ϵ-privately estimating the parameter θ(D) := EX1,...,Xk D [h(X1, X2, . . . , Xk)] , where Xi D and h(X1, X2, . . . , Xk) = 1 τ(1 β)k 1 (X1 = X2 = = Xk = 1) , for some distribution D. Suppose there exists an ϵ-differentially private algorithm A such that EX1,...,Xn D,A h |A(X1, . . . , Xn) θ(D)| i 1 τ(1 β)k , (A.42) for any D such that the distribution H of h(X1, . . . , Xk) is sub-Gaussian(1). If D = Bern(1) or Bern(1 β), Lemma A.10 shows that the distribution H of h(X1, . . . , Xk) is indeed sub-Gaussian(1). If inequality (A.42) holds, then by Markov s inequality, |A(X1, . . . , Xn) θ (Di)| 3 for i {0, 1}. Also, θ (D0) = 1/ τ1 β and θ (D1) = (1 β)k/ τ1 β. The difference between these means is θ (D0) θ (D1) = 1 (1 β)k Therefore, the following algorithm violates inequality (A.41): Run A on X1, . . . , Xn to obtain θ. Output 0 if θ is closer to θ(D0) than to θ(D1), and output 1 otherwise. This implies inequality (A.42) does not hold, so we have a lower bound on the expected error. Theorem 1 follows from the calculation 2 log 1 2 (1 (1 β)k) = Θ kβ r where we used 1 kβ < (1 β)k < 1 kβ/2 for kβ < 1/10. A.4.2 Proof of Theorem 3 Consider two datasets D0 and D1 of size n each, differing in at most 1/ϵ data points. Suppose EA|A(D) Un(D)| < 1 10|Un(D0) Un(D1)| for D {D0, D1}. By Markov s inequality, we have Pr |A(D) Un(D)| < 1 2|Un(D0) Un(D1)| 0.8, for D {D0, D1}. Moreover, since A is ϵ-differentially private, we have Pr |A(D1) Un(D0)| < |Un(D0) Un(D1)| e Pr |A(D0) Un(D0)| < |Un(D0) Un(D1)| By the triangle inequality, the event n |A(D1) Un(D0)| < |Un(D0) Un(D1)| 2 o is disjoint from the event n |A(D1) Un(D1)| < |Un(D0) Un(D1)| 2 o . The sum of the probabilities of these two events is at least 0.8 + 0.8e 1 > 1, a contradiction. Therefore, A has expected error Ω(|Un(D0) Un(D1)|) on at least one of D0 or D1. Next, we will define appropriate choices of D0 and D1. For simplicity, assume 1/ϵ is an integer. Define bn = k +k1/(2k 2)n1 1/(2k 2) 1 ϵ . The assumed range of ϵ implies that bn 2k/ϵ. Let h(x1, . . . , xk) = 1(x1 = = xk). We define D0 such that xi = 1, i bn, i, i > bn. We define D1 = {y1, . . . , yn} such that yi = xi for all i {bn + 1, . . . , bn + 1/ϵ} and yi = 1 for bn < i bn + 1 Un(D1) Un(D0) = bn+1/ϵ k n k Furhtermore, for k ϵ bn, using the fact that (1 + x)r 1 1 rx for x [ 1, 1/r), we have bn+1/ϵ k = 1 bn i bn + 1/ϵ i 1 bn bn + 1/ϵ = 1 1 1/ϵ bn + 1/ϵ 1 + k/ϵ bn+1/ϵ = k/ϵ bn + (k + 1)/ϵ, implying that Un(D1) Un(D0) k/ϵ bn + (k + 1)/ϵ For i with xi = 1 in D0, we have ˆh D0(i) = ( bn 1 k 1 ) ( n 1 k 1) Un(D0). For i with xi = 1 in D1, we have ˆh D1(i) = ( bn+1/ϵ 1 k 1 ) ( n 1 k 1) Un(D1). Therefore, we have |ˆh D(i) Un(D)| ˆh D(i) ξ for all i and D {D0, D1}, where bn+1/ϵ 1 k 1 by our choice of bn. Moreover, we have ξ bn + 1/ϵ k k 1 k1/(2k 2)n1 1/(2k 2) k n. (A.44) By inequality (A.43) and the definition of ξ, we see that Un(D1) Un(D0) k/ϵ bn + 2k/ϵ bn + 1/ϵ n ξ k 3nϵξ, where the second inequality follows from the assumption that k/ϵ bn. Using the lower bound on ξ as in inequality (A.44), we obtain the desired result. A.5 Proof of Theorems 2 and 4 We first prove Theorem 4 with S equal to any subsampled family that satisfies the inequalities (17). In particular, the following lemma guarantees that the required bounds hold with high probability for a subsampled family chosen uniformly at random from In,k: Lemma A.11. Let γ > 0, and let M = Ω n2 k2 log n γ . Let S be a collection of M i.i.d sets sampled uniformly from In,k. For each i [n], let Si be the number of sets in S containing i, and define Mi = |Si|. For each i = j [n], let Sij be the number of sets in S containing i and j, and define Mij = |Sij|. With probability at least 1 γ, for all distinct indices i and j, we have k 2n Mi n , k 2n Mij Proof. Note that Mi Binom(M, k/n). By a Chernoff bound, for any δ > 0 and any i [n], we have e (0.5)2(k M/n)/3 e Ω( n which is much smaller than γ 2n. Call this event Ei, and for the remaining argument, assume Ei holds for all i (which, by a union bound, holds with probability at least 1 γ 2 ). This gives the first inequality. For any distinct i, j [n], conditioned on the value of Mi, we have Mij Binom(mi, (k 1)/(n 1)). By a Chernoff bound, for any δ > 0 and i, j [n] with j = i, we have Pr Mij (k 1)Mi e (0.5)2((k 1)Mi/(n 1)) 48n2 γ 2n2 , using the fact that Mi k M 2n and our assumption on M. The second inequality then follows from a union bound over all pairs of indices. A.5.1 Proof of Theorem 4 Consider two adjacent datasets X = (X1, X2, . . . , Xn) and X = (X 1, X 2, . . . , X n) differing only in the index i , that is, X i = Xi for all i = i . Throughout the proof, we will use the superscript prime to denote quantities related to X . Let B := (Bad(X) Bad(X )) \ {i } and G := (Good(X) Good(X )) \ {i }. Then m An A n = X S S (g(XS) g(X S)) S B = i / S (g(XS) g(X S)) (g(XS) g(X S)) i S (g(XS) g(X S)) We bound each of the three terms separately. The term T2 is equal to 0: all indices i S have weight 1, and i / S, so g(XS) = h(XS) = h(X S) = g(X S). We prove some preliminary lemmas before bounding the first and last terms. Recall the definitions of L and wt in equations (12) and (14), respectively. Lemma A.12. We have: (i) |An A n| 2k C n and |L L | 1. (ii) For all i = i , we have |ˆh (i) A n| |ˆh(i) An| 4k C |wt(i) wt (i)| ϵ, (A.47) and for S, such that i S, |wt(S) wt (S)| ϵ. (A.48) Proof. For (i), note that |An A n| = 1 S Si (h(XS) h(X S)) where the last inequality comes from Lemma A.11. Similarly, for any i = i , we have ˆh(i) ˆh (i) = 1 S Sij (h(XS) h(X S)) Figure A.1: Weighting scheme in Eq (14) Therefore, if an index i = i is in Good(X), using inequalities (A.49) and (A.50), we have ˆh (i) A n ˆh (i) ˆh(i) + ˆh(i) An + |An A n| (A.51) n + ξ + 6k CL n = ξ + k C (4 + 6L) which leaves at most 1 + L potential indices i for which |ˆh (i) A n| > ξ + 6k C(1+L) n : the indices in Bad(X) and the index i . Therefore, L L + 1. Similarly, L L + 1. For (ii), note that from inequalities (A.49) and (A.50), we have |ˆh (i) A n| |ˆh(i) An| 4k C n for i = i . Recalling the definition (14), this implies that the difference between the weights on an index i can never be greater than ϵ. Finally, note that the weight of a subset S, wt(S) = mini S wt(i). Now, by inequality (A.47), each wt(i) differs by ϵ. Say a = arg mini S wt(i). In order to make the difference between wt(S) and wt(S ) large we will set wt (a) = wt(a) + ϵ and take some other b and set wt (b) = wt(b) ϵ such that wt (b) wt (a). But then, wt(a) wt(b) wt(a) + 2ϵ. This completes the proof of inequality (A.48). Next, we show that the weighted Hájek variants ˆg(i) are close to the empirical mean An and have low sensitivity. Lemma A.13. For all indices i, we have |ˆg(i) An| ξ + 9k CL nϵ . Moreover, if i = i , we have |(ˆg(i) An) (ˆg (i) A n)| ξ + k C(14 + 6L) n2 (1 + 2L). Proof. If wt(i) = 0, then g(XS) = An for all S i and ˆg(i) = An. For clarity, we add a picture of this weighting scheme here: Otherwise, we write ˆg(i) An = 1 S Si (h(XS) An)wt(S) S Si (h(XS) An) wt(i) + 1 S Si (h(XS) An) (wt(S) wt(i)) = (ˆh(i) An)wt(i) + 1 S Si (h(XS) An) (wt(S) wt(i)) . (A.52) From equation (14) and the assumption that the weight of index i is strictly positive, the magnitude of the first term in equation (A.52) is bounded by ξ + 6k CL nϵ . For the second term, note that wt(S) = wt(i) unless an index j with a lower weight than i exists. Note that such an index j is necessarily in Bad(X). Therefore, the absolute value of the second term is bounded above by 1 mi P j Bad(X),j =i C 2k CL n . This proves the first part of the lemma. To bound the sensitivity of ˆg(i), by the triangle inequality, we have (ˆh(i) An)wt(i) (ˆh (i) A n)wt (i) ˆh(i) An |wt(i) wt (i)| + (ˆh(i) An) (ˆh (i) A n) wt (i) ξ + k C (4 + 6L) = ξ + k C (10 + 6L) where the argument for the second inequality is as follows: To bound the first term, note that when wt(i) = wt (i), it is zero. If wt(i) > 0, then |ˆh(i) An| ξ + 6k CL nϵ . Now, if |ˆh(i) An| > ξ + 6k CL n , then wt(i) = 0, and by inequality (A.46) of Lemma A.12, we see that |ˆh(i) A n| > ξ + 6k CL nϵ , so wt (i) will also be zero. The second term is bounded directly by Lemma A.12. Overall, we arrive at a bound on the sensitivity of the first term in equation (A.52). For the sensitivity of the second term of equation (A.52), note that if i has minimum weight among the indices in S, then wt(S) = wt(i). Otherwise, some index j S has strictly lower weight than i. Such an index j is necessarily in Bad(X) Bad(X ) because it has weight less than 1. If S also does not contain the index i , then h(XS) = h(X S), so by Lemma A.12, we have |(h(X S) A n) (h(XS) An)| |An A n| 2k C |(wt (S) wt (i)) (wt(S) wt(i))| 2ϵ, and letting TS := ((h(XS) An) (wt(S) wt(i)) (h(X S) A n) (wt (S) wt (i))) , we have |TS| 2k C Moreover, there are at most Mi,i sets S containing both i and i , and for each such set S, the change in (h(XS) An) (wt(S) wt(i)) is at most 2C, since the weights lie in [0, 1] and |h(XS) An| C. Combining these bounds, we obtain S Si ((h(XS) An) (wt(S) wt(i)) (h(X S) A n) (wt (S) wt (i))) | {z } TS S Si\Sii |TS|1(|S (Bad(X) Bad(X ))| > 0) + X S Si\Sii 1(|S (Bad(X) Bad(X ))| > 0) + 1 j Bad(X) Bad(X ) S Sij 1 + 2C Mi,i (|Bad(X)| + |Bad(X )|) sup j Bad(X) Bad(X ) Mi + 2C Mi,i (1 + 2L) + 4k C where the first inequality uses the fact that the weights are all equal if S (Bad(X) Bad(X )) = , and the last inequality uses Lemma A.11. Combining inequalities (A.53) and (A.54) into equation (A.52) yields the result. To bound the term T1 in (A.45), we decompose it as (g(XS) g(X S)) min(k,|B|) X S S |S B|=a i / S (a 1) (g(XS) g(X S)) . (A.55) The first term sums over all subsets that contain some element in B. However, this leads to overcounting every subset with a elements in common with B exactly a 1 times. The second term corrects for this over-counting, akin to an inclusion-exclusion argument. The following lemmas bound each of the two terms: Lemma A.14. For all i B, we have S Si,i / S (g(XS) g(X S)) Proof. We have X (g(XS) g(X S)) = Mi (ˆg(i) ˆg(i)) X S Si,i (g(XS) g(X S)) . (A.56) By Lemmas A.12 and A.13, we have |ˆg(i) ˆg(i)| ξ + k C(14 + 6L) n2 (1 + 2L). Moreover, the second term in equation (A.56) is clearly upper-bounded by 2CMi,i 4k C Summing these two bounds yields the result. For the second term of equation (A.55), we use the following lemma: Lemma A.15. We have min(k,|B|) X S S |S B|=a i / S (a 1) |g(XS) g(X S)| 36k2L2 min(k, 2L)M. Moreover, if S = In,k, with m = n k , we have the stronger inequality min(k,|B|) X S S |S B|=a i / S (a 1) |g(XS) g(X S)| 9k2L2 Proof. For any S not containing i , we have |g(XS) g(X S)| = |(h(XS) An)wt(S) + An (h(X S) A n)wt (S) A n| |(h(XS) An)(wt(S) wt (S))| + |(h(XS) h(X S))wt (S)| + |An A n| using Lemma A.12 and the fact that the second term is zero. Moreover, we have min(k,|B|) X S S |S B|=a i / S S S |S B|=a min(k, |B|) = X S S |S B| 2 min(k, |B|) S Sij min(k, |B|) 9k2 n2 min(k, |B|)|B|2m. The last inequality follows from Lemma A.11, which implies that Mij M = O k2 n2 . In the case when S = In,k, we have min(k,|B|) X S S |S B|=a i / S min(k,|B|) X a=2 (a 1) |B| a n 1 + n |B| k n + 1 k2|B|2 where the first equality used the identities Pk a=0 n a m k a = n+m k and Pk a=0 a n a m k a = nk m+n n+m k , and the third inequality used the fact that (1 x)k 1 1+kx for all x [0, 1]. The statement in the lemma follows because |B| 2L + 1 3L. Combining the results of Lemma A.14 and A.15 yields Lemma A.16. Lemma A.16. We have where γ = 4 min(k, 2L). If S = In,k, then the bound also holds for γ = 1. It remains to bound the third term, T3, of equation (A.45), which we do in the following lemma: Lemma A.17. We have 2ξ + k C(11 + 18L) Proof. Using Lemmas A.12 and A.13, we have 1 Mi |T3| = |ˆg(i ) ˆg (i )| |ˆg(i ) An| + |ˆg (i ) A n| + |An A n| + ξ + 9k CL 2ξ + k C(11 + 18L) The lemma follows after using the fact that Mi 2k Combining the bounds on T1 and T3 from Lemmas A.16 and A.17 in equation (A.45), the local sensitivity of An at X is then bounded as LS An (X) = O k (1 + ϵL) + k2CL2 min(k, L) Let g(ξ, L, n) denote the upper bound, where to simplify the following argument, we assume the constant prefactor is 1, i.e., g(ξ, L, n) := k (1 + ϵL) + k2CL2 min(k, L) Note that g is strictly increasing in L. Also define S(X) = max ℓ Z 0 e ϵℓg(ξ, LX + ℓ, n). (A.57) Lemma A.18. The function S(X) is an ϵ-smooth upper bound on LS An(X). ξ + k C(1/ϵ + L) (1 + ϵL) + k2C(1/ϵ + L)2 min(k, 1/ϵ + L) Proof of Lemma A.18. Clearly, we have S(X) g(ξ, LX, n) LS An(X), and for any two adjacent X and X , we have S(X ) = max ℓ Z 0 e ϵℓg(ξ, LX + ℓ, n) max ℓ Z 0 e ϵℓg(ξ, LX + ℓ+ 1, n) = max ℓ Z>0 e ϵ(ℓ 1)g(ξ, LX + ℓ, n) eϵ max ℓ Z 0 e ϵℓg(ξ, LX + ℓ, n) = eϵS(X). This shows that S is indeed a ϵ-smooth upper bound on the local sensitivity. As for the upper bound on S, for any ℓ 0, we have e ϵℓg(ξ, LX + ℓ, n) = k ξe ϵℓ/2 + k C(ℓe ϵℓ/2 + Le ϵℓ/2) e ϵℓ/2 + ϵ(ℓe ϵℓ/2 + Le ϵℓ/2) n2 C(ℓe ϵℓ/3 + Le ϵℓ/3)2 min(ke ϵℓ/3, ℓe ϵℓ/3 + Le ϵℓ/3) ϵ + k ξ + k C(1/ϵ + L) (1 + ϵ(1 + L)) + k2 ϵ + L 2 min k, 2 ϵ + L ϵ + k where we used in multiple places the inequalities e ϵℓ 1 and ℓe ℓ/c c/e, for any c > 0. By Lemma A.18, it is clear that the term S(X) added to An in Algorithm 1 is the smoothed sensitivity defined in equation (A.57). Therefore, An + S(X) ϵ Z, where Z is sampled from the distribution with density h(z) 1/(1+|z|4), is O(ϵ)-differentially private, by Lemma 2. Moreover, if S = In,k, the above bound on the smooth sensitivity holds without the min(k, 1/ϵ + L) term, due to Lemma A.15. Utility. By Chebyshev s inequality, we have Var(An) = 1 γ with probability at least 1 γ. Moreover, with probability at least 1 γ, each of the Hájek projections is within ξ of An. This implies that every index i has weight 1, which further implies that g(XS) = h(XS) for all S S, and consequently, An = An. Also, L = 1 for such an X. Finally, with probability at least 1 γ, we have Z 3 γ . Combining these inequalities and using Lemma A.18, we have |A(X) θ| An An + |An θ| + |S(X)/ϵ Z| with probability at least 1 4γ, recalling that ϵ = O(1) when simplifying the expression. Algorithm 1 uses a constant failure probability of γ = 0.01, which ensures a success probability of at least 0.75. This is further boosted by Wrapper 1. Now, an application of Lemma A.6 gives the stated result. A.5.2 Proof of Theorem 2 The proof of this theorem proceeds nearly identically to that of Theorem 4 with some exceptions. If S = In,k, the smoothed sensitivity bound has no min(k, 1/ϵ) term, owing to Lemmas A.15 and A.16, which gives the bound |A(X) θ| 1 γ O p Var(An) + kξ Furthermore, since S = In,k, we have An = Un with probability at least 1 3γ. Algorithm 1 uses a constant failure probability of γ = 0.01. This ensures a success probability of at least 0.75, which is further boosted by Wrapper 1. An application of Lemma A.6 gives the stated result. A.5.3 Concentration of local Hájek projections Proof of Lemma 3. We first show that if Y1, Y2, . . . , Yt are random variables such that each Yj is τj-sub-Gaussian, the sum Y1 + + Yt is τ1 + τ2 + + τt 2-sub-Gaussian. Define pj = Pt i=1 τi τj . Clearly, we have Pt j=1 1/pj = 1. By Hölder s inequality, for any λ > 0, we have i=1 exp(λYi) i=1 E [exp(λYi)pi]1/pi exp λ2p2 i τi 2 i=1 exp λ2piτi = exp λ2( τ1 + τ2 + + τt)2 Now, h(XS) is sub-Gaussian(τ) for all S In,k. Since ˆh(i) is the average of t = n 1 k 1 such quantities, it is clear from the previous claim that ˆh(i) sub-Gaussian with parameter 1 t2 ( τ + τ + + τ | {z } t terms )2 = τ. Proof of Lemma 4. Define σ2 i := Var (h(XSi)|Xi = xi). First, conditioned on Xi, the projection ˆh(i) can be viewed as a U-statistic on the other n 1 data points. First, for this lemma, we use S = In,k, and ˆh(i) = b E [h(XS)|Xi] = S Si h(XS) n 1 k 1 Since h is bounded, the random quantity ˆh(i) E [h(XS)|Xi] [ 2C, 2C] satisfies the Bernstein moment condition and also the Bernstein tail inequality (cf. Proposition 2.3 in [60]). By Bernstein s inequality for U-statistics (see inequality (A.29)), for all t > 0, we have P ˆh(i) E [h(XS)|Xi] t Xi 2 exp j n 1 k 1 k t2 2σ2 i + 4Ct/3 which is at most β/n as long as t = σi q A.6 Applications A.6.1 Uniformity testing To motivate the test, consider the expectation θ := E[h(Xi, Xj)] and the variance var(Un): Lemma A.19. We have E[h(X1, X2)] = 1 m2 . In particular, the means under the two hypothesis classes differ by at least δ2 Proof. We have E[h(X1, X2)] = 1 + 2ai + a2 i m2 = 1 Under approximate uniformity, this is at most δ2 2m; and under the alternative hypothesis, this quantity is at least δ2 Lemma A.20. The variance of Un is var(Un) = 2 n(n 1) i δ2 P | θ Un| > δ2 + P |Un E[Un]| > δ2 The second term can be controlled using an argument in Diakonikolas et al. [21], which further develops the variance bound in Lemma A.20 for the two hypothesis classes and then uses Chebyshev s inequality. It is shown that the second probability term in inequality (A.61) can be bounded by α if n = Ω m To bound the first probability term in inequality (A.61), we study the concentration parameter ξ for the local Hájek projection ˆh(i) = 1 n 1 P j =i h(Xi, Xj). We have the following result: Lemma A.21. If ξ = 6 m + 8 log(4n/γ) γ , then |ˆh(i) Un| ξ for all i, with probability at least 1 γ. Proof. By the triangle inequality, we have |ˆh(i) Un| |ˆh(i) E[h(X1, X2)|X1]| + |E[h(X1, X2)|X1] θ| + |Un θ|. (A.62) We will provide a bound on each of these three terms. Note that h(X1, X2)|X1 Bern(p X1), which has variance σ2 i = var(h(Xi, Xj)|Xi) = p Xi(1 p Xi) 2 Hence, with probability at least 1 γ 2 , the first term in inequality (A.62) can be bounded as |ˆh(i) E[h(X1, X2)|X1]| 2 2 mn log 4n where we have used the AM-GM inequality. The second term in inequality (A.62) can be bounded as 1 + a Xi Finally, by Chebyshev s inequality, the third term can be bounded as |Un θ| q γ with probability at least 1 γ 2 . It remains to find the variance of Un. By Lemma A.20, if |ai| 1 for all i, we have var(Un) = 2 n(n 1) (1 + ai)(1 + aj)(ai aj)2 (1 + ai)(1 + aj)(ai aj)2 8 m2n + 8 mn2 . (A.63) Combining the three bounds into inequality (A.62), with probability at least 1 γ, we have |ˆh(i) Un| 2 m + 7 log(4n/γ) m n + 4/ γ mn m + 8 log(4n/γ) where in the second inequality, we used the AM-GM inequality and the assumption n 16 γ . The statement of the lemma follows after discarding lower-order terms. By Lemma A.21, with probability at least 1 γ, the weights of all projections in Algorithm 1 are equal to 1 and Un = An. Then | θ Un| is simply the magnitude of the noise added in the final step of Algorithm 1 (which uses a constant γ = 0.01), which (cf. the proof of Theorem 2) takes the form nϵ + 1 n2ϵ2 + 1 n3ϵ3 n2ϵ + 1 mnϵ + 1 n2ϵ2 + 1 n3ϵ3 with probability at least 0.75. This is bounded by δ2 8m as long as δϵ1/2 log m1/2 Wrapper 1 further boosts this constant probability of success. Now, an application of Lemma A.6 gives the stated result. A.6.2 Sparse graph statistics Algorithm A.5 Private Network Edge(n, m, {Aij}1 i