# instanceoptimal_mean_estimation_under_differential_privacy__59dee07b.pdf Instance-optimal Mean Estimation Under Differential Privacy Ziyue Huang, Yuting Liang, Ke Yi {zhuangbq,yliangbs,yike}@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology Mean estimation under differential privacy is a fundamental problem, but worstcase optimal mechanisms do not offer meaningful utility guarantees in practice when the global sensitivity is very large. Instead, various heuristics have been proposed to reduce the error on real-world data that do not resemble the worst-case instance. This paper takes a principled approach, yielding a mechanism that is instance-optimal in a strong sense. In addition to its theoretical optimality, the mechanism is also simple and practical, and adapts to a variety of data characteristics without the need of parameter tuning. It easily extends to the local and shuffle model as well. 1 Introduction Mean estimation is one of the most fundamental problems in statistics, optimization, and machine learning. However, privacy concerns forbid us from using the exact mean in these applications, and the problem of how to achieve the smallest error under a given privacy model has received considerable attention in the literature. Differential privacy (DP) is a rigorous mathematical definition for protecting individual privacy and has emerged as the golden standard in privacy-preserving data analysis nowadays, which has been deployed by Apple [16], Google [22], and Microsoft [17]. Given a data set D := {xi}i [n] Ud, where U = [u], i.e., each coordinate of the input vector is an integer (real-valued coordinates can be handled by quantization; see remark 1), our goal is to obtain a differentially private estimation M(D) for the mean f(D) = 1 n Pn i=1 xi with small ℓ2 error M(D) f(D) 2. Because f( ) has global ℓ2 sensitivity GSf = du/n, the standard DP mechanism just adds Gaussian noise scaled to GSf to each coordinate of f(D), which results in an ℓ2 error proportional to du/n. This simple mechanism is worst-case optimal [28], but it is certainly undesirable in practice, as people often conservatively use a large u (e.g., u = 232) but the actual dataset D may have much smaller coordinates. Instead, the clipped-mean estimator [1] (see Section 3.1 for details) has been widely used as an effective heuristic, but two questions remain unresolved: (1) how to choose the clipping threshold C; and (2) if it can yield any optimality guarantees. We answer these questions in a fairly strong sense in this paper. 1.1 Instance Optimality As worst-case optimality is theoretically trivial and practically meaningless for the mean estimation problem when the global sensitivity is too large, one may aim at instance optimality. More precisely, let M be the class of DP mechanisms and let Rins(D) := inf M M inf{ξ | Pr[ M (D) f(D) 2 ξ] 2/3} 35th Conference on Neural Information Processing Systems (Neur IPS 2021). be the smallest error any M can achieve (with constant probability) on D, then the standard definition of instance optimality requires us to design an M such that Pr[ M(D) f(D) 2 c Rins(D)] 2/3 (1) for every D, where c is called the optimality ratio. Unfortunately, for any D, one can design a trivial M ( ) f(D) that has 0 error on D (but fails miserably on other instances), so Rins( ) 0, which rules out instance-optimal DP mechanisms by a standard argument [21]. Since Rins( ) is unachievable, relaxed versions can be considered. The above trivial M exists because it is only required to work well on one instance D. Imposing higher requirements on M would yield relaxed notions of instance optimality. One natural requirement is that M should work well not just on D, but also on its neighbors, i.e., we raise the target error from Rins(D) to Rnbr(D) := inf M M sup D :dham(D,D ) 1 inf{ξ | Pr[ M (D ) f(D ) 2 ξ] 2/3}. Vahdan [35] observes that Rnbr(D) is exactly LSf(D), the local sensitivity of f at D, up to constant factors. However, LSf( ) may not be an appropriate target to shoot at, depending on what f is. For the MEDIAN problem, LSf(D) = 0 for certain D s and no DP mechanisms can achieve this error [33], while for mean estimation, LSf(D) = Θ(GSf) = Θ( du/n) for all D, so this relaxation turns instance optimality into worst-case optimality. The reason why the above relaxation is too much for the mean estimation problem is that D may change one vector of D arbitrarily, e.g., from (0, . . . , 0) to (u, . . . , u). We restrict this. More precisely, letting supp(D) denote the set of distinct vectors in D, we consider the target error Rin-nbr(D) := inf M M sup D :dham(D,D ) 1,supp(D ) supp(D) inf{ξ | Pr[ M (D ) f(D ) 2 ξ] 2/3}, namely, we require M to work well only on D and its in-neighbors, in which a vector can only be changed to another one already existing in D. Correspondingly, an instance-optimal M (w.r.t. the in-neighborhood) is one such that (1) holds where Rins is replaced by Rin-nbr. We make a few notes on this notion of instance optimality: (1) This optimality is only about the utility of the mechanism, not its privacy. We still require the mechanism to satisfy the DP requirement between any D, D such that dham(D, D ) = 1, not necessarily one and its in-neighbors. (2) In general, a smaller neighborhood leads to a stronger notion of instance optimality. Thus, the optimality using in-neighbors is stronger than that using all neighbors, which is in turn stronger than worst-case optimality (i.e., D can be any instance), while the latter two are actually the same for the mean estimation problem. (3) For an instance-optimal M (by our notion), there still exist D, M such that M does better on D than M, but it is not possible for M to achieve a smaller error than the error of M on D over all in-neighbor of D. This is more meaningful than ranging over all neighbors of D, some of which (e.g., one with (u, . . . , u) as a datum) are unlikely to be the actual instances encountered in practice. 1.2 Our Results To design an M(D) for the mean function f(D) = 1 n Pn i=1 xi that achieves an error w.r.t. Rin-nbr(D) for all D, we need an upper bound and a lower bound. For the lower bound, we show that Rin-nbr(D) = Ω(w(D)/n), where w(D) := max1 i 0 and δ 0, a randomized algorithm M : X n Y is (ε, δ)-differentially private if for any neighboring datasets D D (i.e., dham(D, D ) = 1) and any E Y, Pr[M(D) E] eε Pr[M(D ) E] + δ. Definition 2 (Concentrated Differential Privacy (z CDP) [10]). For ρ > 0, a randomized algorithm M : X n Y is ρ-z CDP if for any D D , Dα(M(D)||M(D )) ρα for all α > 1, where Dα(M(D)||M(D )) is the α-Rényi divergence between M(D) and M(D ). Note that (ε, 0)-DP implies ε2 2 -z CDP, which implies ( ε2 δ , δ)-DP for any δ > 0. To release a numeric function f(D) taking values in Rd, the most common technique for achieving z CDP is by masking the result with Gaussian noise calibrated to the ℓ2-sensitivity of f. Lemma 1 (Gaussian Mechanism [10]). Let f : X n Rd be a function with global ℓ2-sensitivity GSf := max D D f(D) f(D ) 2. For a given data set D X n, the mechanism that releases f(D) + N 0, GS2 f 2ρ Id d satisfies ρ-z CDP. Lemma 2 (Composition Theorem [10, 21]). If M is an adaptive composition of differentially private algorithms M1, M2, . . . , Mk, then 1. If each Mi satisfies (εi, δi)-DP, then M satisfies (P 2. For all ε, δ, δ 0, if each Mi satisfies (ε, δ)-DP, then M satisfies (ε , kδ + δ )-DP, where δ ε + kε(eε 1). 3. If each Mi satisfies ρi-z CDP, then M satisfies (P i ρi)-z CDP. 2.2 Differential Privacy in the Local Model and Shuffle Model The above definitions of DP and z CDP assume that D is handled by a trusted curator and only the output of the mechanism will be released to the public. Therefore, if the curator is corrupted, the privacy of all users will be breached. For weaker trust assumptions, the most popular models are the local model and the shuffle model, where each user holds their datum and locally privatizes (by some randomized mechanism) the message before sending it out for analysis. Hence, there is no third-party who has direct access to D. Formally, each user holds one datum xi D, and the protocol interacts with the dataset using some local randomizer R : X Y, and the privacy guarantee is defined over the transcript (all messages sent during the protocol). For simplicity, we only present the definition for one-round protocols; the privacy guarantee of multi-round protocols can be composed across all rounds by the composition theorem. The definition below uses z CDP; other DP notions can be defined similarly. Definition 3 (Local Model (LDP)). A protocol using R( ) as the local randomizer satisfies ρ-z CDP in the local model if for any x, x X, any α > 1, Dα(R(x)||R(x )) ρα. Due to the much stronger privacy requirement, the best accuracy guarantee of LDP protocols for several fundamental problems [12, 6, 18, 31] is a n-factor worse than that in the central model. The shuffle model is established on an intermediary level of trust assumption between the local model and the central model and aims for obtaining errors closer to the central model. The key feature of the shuffle model is a trusted shuffler S, which can permute all messages randomly before sending them to the analyzer, so that an adversary cannot identify the source of any message. Specifically, we consider the multi-message shuffle model, where each local randomizer R : X Ym outputs m messages, and the transcript of the protocol ΠP (D) is a random permutation of all mn messages. The following definition uses (ε, δ)-DP; the other two DP notions can also be defined similarly, but they do not offer the improvements that we want over LDP protocols. Definition 4 (Shuffle Model). A protocol P satisfies (ε, δ)-DP in the shuffle model if for any D D , and any set E Ymn, Pr[ΠP (D) E] eε Pr[ΠP (D ) E] + δ. 3 Our Method 3.1 Clipped-Mean Estimator In the rest of the paper, we focus on the mean function f(D) = 1 n Pn i=1 xi. Since GSf is large, a very natural idea is to clip each vector in its ℓ2 norm by some threshold C. This reduces GSf to 2C/n, leading to the clipped-mean estimator [1]: i=1 min C xi 2 , 1 xi + N 0, 2C2 ρn2 I . (2) Lemma 3. For any given C, MC(D) satisfies ρ-z CDP, and has an expected ℓ2 error at most E [ MC(D) f(D) 2] E(C; D) := 1 i=1 max{ xi 2 C, 0} + C An important remaining question is how to set the clipping threshold C. Setting it too low will result in a large bias, while setting it too high will introduce a large amount of noise. We show how to choose the optimal C to balance this bias-variance trade-off. It is easy to see that the error E(C; D) is a convex function w.r.t. C, thus the optimal C can be found by setting the derivative of E(C; D) to zero, i.e., n|{i [n] | xi 2 > C}| 1 Therefore, the optimal choice of C is the (n p 2d/ρ)-th quantile of { xi 2}i [n]. 3.2 Private Quantile Selection However, we cannot use the optimal C directly, as it would violate DP. Instead, we find a privatized quantile with small rank error. Specifically, for this problem, D consists of a sequence of ordered integers 0 x(1) x(n) u. We would like to design a DP mechanism that, for a given m, returns an x (which is not necessarily an element in D) such that x(m τ) x x(m+τ)1 w.h.p. Here τ is referred to as the rank error. Existing methods on private range counting queries [11, 20] can be used for this purpose, but they actually find all quantiles, which is an overkill. Instead, we use a simple binary search algorithm [25, 14], which not only simplifies the algorithm, but also reduces the rank error (by polylog(u) factors) to nearly optimal. Our algorithm Priv Quant makes use of a function Noisy RC([a, b], D) that returns a noisy count of |D [a, b]|, and we present the details in the supplementary material. In the central DP model, we simply use Noisy RC([0, mid], D) = |D [0, mid]| + N(0, log u/(2ρ)). Theorem 1. The algorithm Priv Quant preserves ρ-CDP, and it returns a quantile with rank error τ with probability at least 1 β for τ = q log u log log u In Section 4, we prove an Ω( p log u/ρ) lower bound (Corollary 1) on the rank error under z CDP for constant β. Thus the algorithm is optimal up to just an O( log log u)-factor. We can now use Priv Quant to find an approximately optimal clipping threshold. Specifically, we invoke Priv Quant with ρ = ρ/4 to find the max{n max{ p 2d/ρ, τ}, 1}-th quantile of { xi 2 2}i [n]. They are integers no more than du2, so replacing u by du2 in Theorem 1 yields a rank error of τ = 2 q log(du) log log(du) β /ρ. Then we set C as the square root of the returned quantile. Finally, we return the clipped mean estimator M C(D) with ρ = 3ρ/4. The following theorem analyzes its error. Theorem 2. Our mean estimation mechanism is ρ-z CDP and has ℓ2 error O( p d/ρ + τ) r(D)/n with probability 1 β, where τ = 2 q log(du) log log(du) 1Define x(j) = 0 for j < 1 and x(j) = u for j > n. 3.3 Shifted-Clipped-Mean Estimator To reduce the error from being proportional to r(D) to being proportional to w(D), we perform a random rotation on D followed by a translation. The rotation is done by ˆxi := HDxi, where H is the Hadamard matrix, D is a diagonal matrix whose diagonal entry is independently and uniformly drawn from { 1, +1}. Note that for now we omit the normalization coefficient 1 d so that each coordinate of ˆxi is still an integer; we will apply the normalization to the final estimator instead. Then, for each j [d], we invoke Priv Quant with ρ = ρ/(4d) to find an approximate median of {ˆxi}i [n] along dimension j, denoted as cj. Next, we shift the dataset to be centered around c = ( c1, . . . , cd), obtaining D = { xi := ˆxi c}i [n]. Note that c has integer coordinates, so does xi. Finally, we apply the clipped-mean estimator in Theorem 2 with ρ = 3 4ρ on D, obtaining an estimation y, and return y := ( 1 d( y + c) as the mean estimator over D. Theorem 3. Set τ = q log(du) log d log(du) β /ρ and assume n = Ω(τ d). Our mean estimation mechanism is ρ-z CDP, and has ℓ2 error O ( p β w(D)/n with probability 1 β. Remark 1. For a dataset with real coordinates bounded by R (in absoluate value), one can quantize each coordinate to an integer using bucket size α/ d, for any 0 < α < R, and then apply our algorithm over an integer universe of size u = 2R d/α. This just brings an additive α error to the error bound of Theorem 3. Statistical Mean Estimation. Suppose D consists of i.i.d. samples drawn from the multivariate Gaussian distribution N(µ, Σ), and we wish to estimate µ, assuming a priori bounds µ 2 R and σ2 min I Σ σ2 max I. We first clip each sample xi xi min{R / xi 2, 1} where R := R + 2σmax q β . Then we apply our mechanism with bucket size α/ d/n. We analyze the error in the supplementary material. When Σ = I and ignoring log O(1)( dn R σmin ) factors, the error in becomes O d n + d ρn , matching the known optimal bound for Gaussian mean estimation [8]. 4 Lower Bounds In this section we establish the instance optimality of Theorem 3 via three lower bounds: (1) Rin-nbr(D) = Ω(w(D)/n) for all D; (2) an Ω( p d/ρ) lower bound on the optimality ratio, and (3) that the condition n = Ω( p d/ρ) is necessary. The first lower bound follows from an observation by Vadhan [35]: Lemma 4 ([35]). For any f, any (ε, δ)-DP mechanism M , and any neighboring datasets D0 D1, there is a b {0, 1} such that Pr[ M (Db) f(Db) 2 < f(D0) f(D1) 2/2] < 1 + δ 1 + e ε . Theorem 4. For ε < 0.1, δ < 0.1, Rin-nbr(D) = Ω(w(D)/n). The lower bound on the optimality ratio is by the reduction from statistical mean estimation [26]. Theorem 5. Let M be any ρ-z CDP mechanism for mean estimation that has ℓ2 error c w(D)/n with constant probability for any D = {xi}i [n] drawn from [u]d. If ρ < O( p d/n), then c = Ω( p For the lower bound on n, we consider a weaker problem (so the lower bound is stronger), which is the d-dimensional version of the interior point problem [9]: Given a dataset D = {xi}i [n] drawn from [u]d, the mechanism is only required to return a y [u]d such that mini xij yj maxi xij for all j with constant probability. Theorem 6. If there exists a ρ-z CDP mechanism that solves the interior point problem with success probability 2/3, then n = Ω( p d log u/ρ). Given a quantile selection mechanism with rank error τ, by finding the median, the 1-dimensional interior point problem can be solved when n = O(τ). The following corollary then follows from Theorem 6. Corollary 1. Any ρ-z CDP mechanism for the quantile selection problem must have rank error Ω( p 5 Extension to the Local Model and Shuffle Model Our mean estimation framework can be summarized as follows: (1) Given D = {x1, . . . , xn}, perform a random rotation, obtaining ˆD = {ˆxi := HDxi}i [n]; (2) For each j [d], find an approximate median cj of ˆD along dimension j. Shift ˆD to be centered around c, obtaining D = { xi := ˆxi c}; (3) Find a clipping threshold C, which is the m-th quantile over the ℓ2 norms of the vectors in D. In the central model, the optimal choice is m = n p 2d/ρ; (4) Perform ℓ2 clipping over D using C, and obtain a mean estimator y of the clipped vectors. Finally, return y = ( 1 We note that each step has their counterparts in the local and the shuffle model: Step (1) is easy, where the randomized diagonal matrix D can be generated using public randomness, or sent from the aggregator to each user if public randomness is not available. Step (2) and (3) both rely on quantile selection, which have alternatives in the local model and shuffle model. Step (4) is also easy since all vectors have norms bounded by C. We elaborate on the details in the supplementary material. The Local Model. For step (4), the standard LDP mechanism is that each user applies the Gaussian mechanism (Lemma 1) with GS = 2C to inject noise to their clipped vector, sends it out, and the aggregator adds them up and divides the sum by n. For quantile selection in step (2) and (3), we use the LDP range counting protocol in [13], which returns a data structure such that any range counting query can be answered. Putting things together, we obtain the following result. Theorem 7. Set τ = p n/ρ log2(du) log du β and assume n = Ω(τ d). There is a 3-round ρ-z CDP mean estimation mechanism in the local model, achieving an ℓ2-error of O ( p dn/ρ + τ) q w(D)/n with probability 1 β. The Shuffle Model. We see that the error in the local model is worse than that in the central model by a n-factor. It turns out that in the shuffle model, we can match the result in the central model up to logarithmic factors, albeit with (ε, δ)-DP. This is mostly due to highly accurate summation and range counting protocols discovered recently for the shuffle model. We first extend the one-dimensional summation protocol in [5] for high-dimensional data by using random rotation. Lemma 5. Given D = {xi}i [n] Rd where xi 2 C for all i, there is a oneround (ε, δ)-DP mean estimation protocol in the shuffle model that returns a y such that Pr y f(D) 2 O C εn log(nd) log d For the Noisy RC queries in the algorithm Priv Quant, we can use the range counting mechanism [24] in the shuffle model. Putting things together, we obtain the following result. Theorem 8. Set τ = 1 ε log3.5(du) q log d log(du) δ and assume n = Ω τ q a 3-round (ε, δ)-DP mean estimation mechanism in the shuffle model, achieving an ℓ2-error of log(nd) w(D)/n with probability 2/3. 6 Experiments We performed both statistical and empirical mean estimation experiments to evaluate our method. For statistical mean estimation, we used multivariate Gaussian distributions with various µ and Σ. All algorithms are given the same R, σmin, σmax. We tried various R, while fixing σmin = 0.1 and σmax = R/ d. For empirical mean estimation, we used a real-world dataset, MNIST, which consists 24 25 26 27 28 29 210 Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (a) µ = 0 1d. 24 25 26 27 28 29 210 Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (b) µ = 5 1d. 24 25 26 27 28 29 210 Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (c) µ = 10 1d. Figure 1: ℓ2 error vs. d for N(µ, Id d), where n = 4000, ρ = 0.5, R = 50 0.0 0.1 0.2 0.3 0.4 0.5 ρ Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (a) µ = 0 1d. 0.0 0.1 0.2 0.3 0.4 0.5 ρ Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (b) µ = 5 1d. 0.0 0.1 0.2 0.3 0.4 0.5 ρ Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (c) µ = 10 1d. Figure 2: ℓ2 error vs. ρ for N(µ, Id d), where n = 4000, d = 128, R = 50 of 70,000 images of handwritten digits, where each image is represented by a vector of dimension d = 784 = 28 28. We quantized the values to integers [u] for u = 210. We measured the ℓ2 error by taking the trimmed mean with trimming parameter 0.1 over 100 trials (as in [8]). For the central model, we compared with COINPRESS2 [8]. It starts with a given confidence ball of radius R that captures the mean, and iteratively refines the confidence ball. The number of iterations t is an important internal parameter in this algorithm; we tried t = 1, 2, 3, 4, 10 following their suggestion. 24 25 26 27 28 29 210 Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (a) σ2 j = 0.1 for all j [d]. 24 25 26 27 28 29 210 Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (b) σ2 j = 10. 24 25 26 27 28 29 210 Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (c) σ2 j [0, 10]. Figure 3: ℓ2 error vs. d for N(µ, Σ), where Σ = diag([σ2 1, σ2 2, . . . , σ2 d]), n = 4000, ρ = 0.5, R = 50 The results for statistical mean estimation are shown in Fig. 1 to 5, where the detailed parameter settings are given in the captions. The results show that our method (Shifted-CM) performs at least as well as COINPRESS with the best t across a variety of settings. In particular, when Σ is not identity Id d, our method significantly outperforms COINPRESS and offers errors close to the non-private estimator as demonstrated in Fig. 3 and Fig. 4. Moreover, as seen from Fig. 3, 4, and 5, the best choice of t for COINPRESS is quite sensitive to Σ and R, making it difficult to tune in practice. 2We also compared with another version of COINPRESS in the supplementary material, where we first scale the data by an estimated Σ (obtained by MVCRec [8] using half of the privacy budget ρ). 0.0 0.1 0.2 0.3 0.4 0.5 ρ Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (a) σ2 j = 0.1 for all j [d]. 0.0 0.1 0.2 0.3 0.4 0.5 ρ Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (b) σ2 j = 10 for all j [d]. 0.0 0.1 0.2 0.3 0.4 0.5 ρ Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM (c) σ2 j [0, 10]. Figure 4: ℓ2 error vs. ρ for N(µ, Σ), where Σ = diag([σ2 1, σ2 2, . . . , σ2 d]), n = 4000, d = 128, R = 50 500 1000 1500 2000 R COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM Non-private (a) µ = 0 1d. 500 1000 1500 2000 R COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM Non-private (b) µ = 5 1d. 500 1000 1500 2000 R COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM Non-private (c) µ = 10 1d. Figure 5: ℓ2 error vs. R for N(µ, Σ), where Σ = diag([σ2 1, σ2 2, . . . , σ2 d]) and σ2 j u.a.r. [0, 10] for each j [d], n = 4000, ρ = 0.5, d = 128. Both our method and COINPRESS are translation-invariant. This can be verified from Fig. 1, 2, and 5, where the results are not effected by µ. However, the approaches taken are different: COINPRESS uses an iterative process, while we shift the dataset to be centered around an approximate center point. In Fig. 1 and 2, we also include CM, our estimator without this shift operation, which is indeed affected by µ. 2-3 2-2 2-1 20 21 ρ COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM (a) digit 0. 2-3 2-2 2-1 20 21 ρ COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM (b) digit 1. 2-3 2-2 2-1 20 21 ρ COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM (c) digit 2. Figure 6: ℓ2 error vs. ρ for various digits in MNIST, where d = 784, R = 50 For empirical mean estimation on the MNIST dataset, we see in Fig. 6 that our method outperforms COINPRESS for various privacy levels. This means that this dataset, as with most real-world datasets, does not follow Gaussian distribution with an identity Σ. The instance-optimality of our method is precisely the reason behind its robustness to different distributional assumptions, or the lack of. One important component of our method is the optimal clipping threshold C, which is the (n p 2d/ρ)-th quantile of the norms. Note that this depends on d and ρ. Prior work [3] used a fixed quantile (e.g., the median). To better see the relationship between the optimal C and d, ρ, we used a synthetic dataset D = {i 1d}i [n] with n = 500 and tried different quantiles as the clipping threshold. The results in Fig. 7 confirm our theoretical analysis: The optimal C indeed depends on d and ρ, while our choice attains the optimum. 27 28 29 210 211 212 213 214 215 50% 75% 85% 95% Ours (a) ρ = 0.1. 27 28 29 210 211 212 213 214 215 50% 75% 85% 95% Ours (b) ρ = 0.5. 27 28 29 210 211 212 213 214 215 50% 75% 85% 95% Ours Figure 7: ℓ2 error vs. d for clipping at various quantiles of the norms on a synthetic dataset. We also evaluated our algorithm in the local model and compare it with [23], and the results are presented in the supplementary material. Acknowledgments and Disclosure of Funding We would like to thank Gautam Kamath and Jonathan Ullman for helpful discussions about Coin Press [8]. Funding in direct support of this work: This work is supported by HKRGC under grants 16201318, 16201819, and 16205420. [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016. [2] Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding user contributions: A bias-variance trade-off in differential privacy. In International Conference on Machine Learning, pages 263 271. PMLR, 2019. [3] Galen Andrew, Om Thakkar, H Brendan Mc Mahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. ar Xiv preprint ar Xiv:1905.03871, 2019. [4] Hilal Asi and John C Duchi. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. Advances in Neural Information Processing Systems, 33, 2020. [5] Borja Balle, James Bell, Adria Gascón, and Kobbi Nissim. Private summation in the multimessage shuffle model. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 657 676, 2020. [6] Raef Bassily and Adam Smith. Local, private, efficient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 127 135, 2015. [7] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning. ar Xiv preprint ar Xiv:1812.00984, 2018. [8] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. Advances in Neural and Information Processing Systems, 2020. [9] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and learning of threshold functions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 634 649. IEEE, 2015. [10] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016. [11] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security, 2011. [12] TH Hubert Chan, Elaine Shi, and Dawn Song. Optimal lower bound for differentially private multi-party aggregation. In European Symposium on Algorithms, pages 277 288. Springer, 2012. [13] Graham Cormode, Tejas Kulkarni, and Divesh Srivastava. Answering range queries under local differential privacy. Proceedings of the VLDB Endowment, 12(10):1126 1138, 2019. [14] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58 75, 2005. [15] Peter Davies, Vijaykrishna Gurunanthan, Niusha Moshrefi, Saleh Ashkboos, and Dan Alistarh. New bounds for distributed mean estimation and variance reduction. In International Conference on Learning Representations, 2021. [16] Apple Differential Privacy Team. Learning with privacy at scale. https: //machinelearning.apple.com/docs/learning-with-privacy-at-scale/ appledifferentialprivacysystem.pdf. December 2017. [17] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In NIPS, 2017. [18] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory, pages 1161 1191. PMLR, 2019. [19] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy, data processing inequalities, and minimax rates. ar Xiv preprint ar Xiv:1302.3203, 2013. [20] Cynthia Dwork, Moni Naor, Omer Reingold, and Guy N Rothblum. Pure differential privacy for rectangle queries via private partitions. In International Conference on the Theory and Application of Cryptology and Information Security, pages 735 751. Springer, 2015. [21] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. [22] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054 1067, 2014. [23] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private mean estimation: z-test and tight confidence intervals. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2545 2554. PMLR, 2019. [24] Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. On the power of multiple anonymous messages. In Euro CRYPT, 2021. [25] Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin J Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In VLDB 02: Proceedings of the 28th International Conference on Very Large Databases, pages 454 465. Elsevier, 2002. [26] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning highdimensional distributions. In Conference on Learning Theory, pages 1853 1902. PMLR, 2019. [27] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. In Conference on Learning Theory, pages 2204 2235. PMLR, 2020. [28] Gautam Kamath and Jonathan Ullman. A primer on private statistics. ar Xiv preprint ar Xiv:2005.00010, 2020. [29] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2018. [30] Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. Learning with user-level privacy. ar Xiv preprint ar Xiv:2102.11845, 2021. [31] Andrew Mc Gregor, Ilya Mironov, Toniann Pitassi, Omer Reingold, Kunal Talwar, and Salil Vadhan. The limits of two-party differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 81 90. IEEE, 2010. [32] H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017. [33] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75 84, 2007. [34] Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X Yu, Sashank J Reddi, and Sanjiv Kumar. Adaclip: Adaptive clipping for private sgd. ar Xiv preprint ar Xiv:1908.07643, 2019. [35] Salil Vadhan. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pages 347 450. Springer, 2017.