# instanceoptimal_mean_estimation_under_differential_privacy__59dee07b.pdf

Instance-optimal Mean Estimation Under Differential Privacy

Ziyue Huang, Yuting Liang, Ke Yi {zhuangbq,yliangbs,yike}@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology

Mean estimation under differential privacy is a fundamental problem, but worstcase optimal mechanisms do not offer meaningful utility guarantees in practice when the global sensitivity is very large. Instead, various heuristics have been proposed to reduce the error on real-world data that do not resemble the worst-case instance. This paper takes a principled approach, yielding a mechanism that is instance-optimal in a strong sense. In addition to its theoretical optimality, the mechanism is also simple and practical, and adapts to a variety of data characteristics without the need of parameter tuning. It easily extends to the local and shufﬂe model as well.

1 Introduction

Mean estimation is one of the most fundamental problems in statistics, optimization, and machine learning. However, privacy concerns forbid us from using the exact mean in these applications, and the problem of how to achieve the smallest error under a given privacy model has received considerable attention in the literature. Differential privacy (DP) is a rigorous mathematical deﬁnition for protecting individual privacy and has emerged as the golden standard in privacy-preserving data analysis nowadays, which has been deployed by Apple [16], Google [22], and Microsoft [17].

Given a data set D := {xi}i [n] Ud, where U = [u], i.e., each coordinate of the input vector is an integer (real-valued coordinates can be handled by quantization; see remark 1), our goal is to obtain a differentially private estimation M(D) for the mean f(D) = 1

n Pn i=1 xi with small ℓ2 error M(D) f(D) 2. Because f( ) has global ℓ2 sensitivity GSf =

du/n, the standard DP mechanism just adds Gaussian noise scaled to GSf to each coordinate of f(D), which results in an ℓ2 error proportional to du/n. This simple mechanism is worst-case optimal [28], but it is certainly undesirable in practice, as people often conservatively use a large u (e.g., u = 232) but the actual dataset D may have much smaller coordinates. Instead, the clipped-mean estimator [1] (see Section 3.1 for details) has been widely used as an effective heuristic, but two questions remain unresolved: (1) how to choose the clipping threshold C; and (2) if it can yield any optimality guarantees. We answer these questions in a fairly strong sense in this paper.

1.1 Instance Optimality

As worst-case optimality is theoretically trivial and practically meaningless for the mean estimation problem when the global sensitivity is too large, one may aim at instance optimality. More precisely, let M be the class of DP mechanisms and let

Rins(D) := inf M M inf{ξ | Pr[ M (D) f(D) 2 ξ] 2/3}

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

be the smallest error any M can achieve (with constant probability) on D, then the standard deﬁnition of instance optimality requires us to design an M such that

Pr[ M(D) f(D) 2 c Rins(D)] 2/3 (1)

for every D, where c is called the optimality ratio. Unfortunately, for any D, one can design a trivial M ( ) f(D) that has 0 error on D (but fails miserably on other instances), so Rins( ) 0, which rules out instance-optimal DP mechanisms by a standard argument [21].

Since Rins( ) is unachievable, relaxed versions can be considered. The above trivial M exists because it is only required to work well on one instance D. Imposing higher requirements on M would yield relaxed notions of instance optimality. One natural requirement is that M should work well not just on D, but also on its neighbors, i.e., we raise the target error from Rins(D) to

Rnbr(D) := inf M M sup D :dham(D,D ) 1 inf{ξ | Pr[ M (D ) f(D ) 2 ξ] 2/3}.

Vahdan [35] observes that Rnbr(D) is exactly LSf(D), the local sensitivity of f at D, up to constant factors. However, LSf( ) may not be an appropriate target to shoot at, depending on what f is. For the MEDIAN problem, LSf(D) = 0 for certain D s and no DP mechanisms can achieve this error [33], while for mean estimation, LSf(D) = Θ(GSf) = Θ(

du/n) for all D, so this relaxation turns instance optimality into worst-case optimality.

The reason why the above relaxation is too much for the mean estimation problem is that D may change one vector of D arbitrarily, e.g., from (0, . . . , 0) to (u, . . . , u). We restrict this. More precisely, letting supp(D) denote the set of distinct vectors in D, we consider the target error

Rin-nbr(D) := inf M M sup D :dham(D,D ) 1,supp(D ) supp(D) inf{ξ | Pr[ M (D ) f(D ) 2 ξ] 2/3},

namely, we require M to work well only on D and its in-neighbors, in which a vector can only be changed to another one already existing in D. Correspondingly, an instance-optimal M (w.r.t. the in-neighborhood) is one such that (1) holds where Rins is replaced by Rin-nbr.

We make a few notes on this notion of instance optimality: (1) This optimality is only about the utility of the mechanism, not its privacy. We still require the mechanism to satisfy the DP requirement between any D, D such that dham(D, D ) = 1, not necessarily one and its in-neighbors. (2) In general, a smaller neighborhood leads to a stronger notion of instance optimality. Thus, the optimality using in-neighbors is stronger than that using all neighbors, which is in turn stronger than worst-case optimality (i.e., D can be any instance), while the latter two are actually the same for the mean estimation problem. (3) For an instance-optimal M (by our notion), there still exist D, M such that M does better on D than M, but it is not possible for M to achieve a smaller error than the error of M on D over all in-neighbor of D. This is more meaningful than ranging over all neighbors of D, some of which (e.g., one with (u, . . . , u) as a datum) are unlikely to be the actual instances encountered in practice.

1.2 Our Results

To design an M(D) for the mean function f(D) = 1

n Pn i=1 xi that achieves an error w.r.t. Rin-nbr(D) for all D, we need an upper bound and a lower bound. For the lower bound, we show that Rin-nbr(D) = Ω(w(D)/n), where w(D) := max1 i<j n xi xj 2 is the diameter of D. Thus, from the upper bound side, it sufﬁces to show that the mechanism s error is bounded by c w(D)/n. This is achieved in two steps. First, we use the clipped-mean estimator, but ﬁnd the clipping threshold C that optimizes its bias-variance trade-off, which is a certain quantile of the norms of the vectors in D. However, we cannot use the optimal C directly, as it would violate DP. Thus, we use a simple binary search based algorithm that can ﬁnd any speciﬁc quantile privately with an optimal rank error. This results in a DP mechanism with error O( p

d/ρ) r(D)/n, where r(D) := maxi xi 2 and ρ is the privacy parameter (formal deﬁnition given in Section 2). To reduce the error from r(D) to w(D), in the second step, we rotate and shift D into a D such that r( D) = O(w(D)) w.h.p., and apply the clipped-mean estimator (with our privatized optimal clipping threshold) on D, leading to an error of O( p

d/ρ) w(D)/n for n = Ω( p

d/ρ). We also show that the optimality ratio c = O( p

d/ρ) is optimal, i.e., any mechanism M(D) having error c w(D)/n for all D must have c = Ω( p

d/ρ) for ρ < O( p

Our mechanism has the following applications: (1) It can be applied directly to statistical mean estimation, where the vectors in D are i.i.d. samples from a certain distribution and one would like to estimate the mean of the distribution (in contrast, the version deﬁned above is referred to as empirical mean estimation). For concreteness, we show how this is done for the multivariate Gaussian distributions N(µ, Σ). For the case Σ = I, our algorithm achieves an ℓ2 error of α using n = O( d

α2 + d α ρ) samples for α O(1), matching the optimal bound in the statistical setting [8].

For a non-identity, unknown Σ, the error is proportional to Σ1/2 2 as in [8]. Our mechanism requires only crude a priori bounds on µ and Σ (i.e., the error depends on these bounds logarithmically), while [8] needs a constant-factor approximation of Σ, which can be obtained using n = Ω(d3/2/ ρ) samples [26]. Note that this can be a

d-factor higher than the sample complexity of mean estimation. Fundamentally, estimating Σ is harder than estimating µ, and we bypass the former so as to retain the same sample complexity of the latter. In practice, estimating Σ ﬁrst would consume the privacy budget from the mean estimation problem itself. On the other hand, the beneﬁt of estimating Σ ﬁrst is that one can obtain an error guarantee under the Mahalanobis distance [26], which cannot be achieved by our method. (2) By simply changing the primitive operations, our mechanism easily extends to the local and shufﬂe model of differential privacy. In doing so, we also extend the one-dimensional summation/mean estimation protocol in the shufﬂe model [5] to high dimensions.

In addition to the theoretical optimality, our mechanism is also simple and practical. Most importantly, there is no (internal) parameter to tune. Yet, our experimental results demonstrate that our mechanism outperforms the state-of-the-art algorithm [8] with the best parameters tuned for each speciﬁc setting.

1.3 Related Work

Asi and Duchi [4] recently initialized the study on instance optimality under DP. They propose two ways to relax (equivalently, strengthen the requirement on M ) the strict instance optimality, which is unachievable. The ﬁrst is to require M to be unbiased. This is not appropriate for mean estimation, since many estimators, including clipped-mean, is not unbiased. The second is to require M to work well over all the r-distance neighbors of D for r 1. Thus, their optimality is weaker than using Rnbr( ), hence not appropriate for the mean estimation problem (i.e., their optimality is the same as worst-case optimality). Instance optimality has not been studied in the local or shufﬂe model; existing protocols in these two models [7, 19, 5] all have errors proportional to the global sensitivity.

How to choose the clipping threshold C for the clipped mean estimator has been extensively studied [2, 3, 34, 32], but existing methods do not offer any optimality guarantees. In particular, Andrew et al. [3] also use a quantile (actually, median) as C, but as we shall see, median is actually not the optimal choice. Furthermore, they use online gradient descent to ﬁnd a privatized quantile, which does not have any theoretical error guarantees. Amin et al. [2] attempt to select an optimal quantile as the clipping threshold to truncate the number of contributions from each user, instead of clipping the actual samples in high dimensions as in our paper.

In the statistical setting, where the data are i.i.d. samples from some speciﬁc distribution, there are numerous methods [26, 8, 27, 29] that can avoid an error proportional to the global sensitivity, by exploiting the concentration property of the distribution. In particular, Biswas et al. [8] provide a simple and practical mechanism for multivariate Gaussian data. Levy et al. [30] propose a private mean estimator with error scaling with the concentration radius τ of the distribution rather than the entire range, but their algorithm requires τ to be publicly known in advance. In the local model, the algorithm in [23] uses a quantile estimation procedure based on binary search as a subroutine for one-dimensional Gaussian data.

Very recently, the relationship between the error of mean estimation and the diameter of the dataset has been exploited in [15] for low-communication protocols, but they do not consider privacy. Our DP protocols in the local and shufﬂe models have communication cost O(d) per user (we do not state the communication costs in the theorems as they are not our major concern); it would be interesting to see if ideas from [15] can be used to reduce it further.

2 Preliminaries

2.1 Differential Privacy in the Central Model

Deﬁnition 1 (Differential Privacy (DP) [21]). For ε > 0 and δ 0, a randomized algorithm M : X n Y is (ε, δ)-differentially private if for any neighboring datasets D D (i.e., dham(D, D ) = 1) and any E Y, Pr[M(D) E] eε Pr[M(D ) E] + δ. Deﬁnition 2 (Concentrated Differential Privacy (z CDP) [10]). For ρ > 0, a randomized algorithm M : X n Y is ρ-z CDP if for any D D , Dα(M(D)||M(D )) ρα for all α > 1, where Dα(M(D)||M(D )) is the α-Rényi divergence between M(D) and M(D ).

Note that (ε, 0)-DP implies ε2

2 -z CDP, which implies ( ε2

δ , δ)-DP for any δ > 0. To

release a numeric function f(D) taking values in Rd, the most common technique for achieving z CDP is by masking the result with Gaussian noise calibrated to the ℓ2-sensitivity of f. Lemma 1 (Gaussian Mechanism [10]). Let f : X n Rd be a function with global ℓ2-sensitivity GSf := max D D f(D) f(D ) 2. For a given data set D X n, the mechanism that releases

f(D) + N 0, GS2 f 2ρ Id d satisﬁes ρ-z CDP.

Lemma 2 (Composition Theorem [10, 21]). If M is an adaptive composition of differentially private algorithms M1, M2, . . . , Mk, then

1. If each Mi satisﬁes (εi, δi)-DP, then M satisﬁes (P

2. For all ε, δ, δ 0, if each Mi satisﬁes (ε, δ)-DP, then M satisﬁes (ε , kδ + δ )-DP, where

δ ε + kε(eε 1).

3. If each Mi satisﬁes ρi-z CDP, then M satisﬁes (P i ρi)-z CDP.

2.2 Differential Privacy in the Local Model and Shufﬂe Model

The above deﬁnitions of DP and z CDP assume that D is handled by a trusted curator and only the output of the mechanism will be released to the public. Therefore, if the curator is corrupted, the privacy of all users will be breached. For weaker trust assumptions, the most popular models are the local model and the shufﬂe model, where each user holds their datum and locally privatizes (by some randomized mechanism) the message before sending it out for analysis. Hence, there is no third-party who has direct access to D. Formally, each user holds one datum xi D, and the protocol interacts with the dataset using some local randomizer R : X Y, and the privacy guarantee is deﬁned over the transcript (all messages sent during the protocol). For simplicity, we only present the deﬁnition for one-round protocols; the privacy guarantee of multi-round protocols can be composed across all rounds by the composition theorem. The deﬁnition below uses z CDP; other DP notions can be deﬁned similarly. Deﬁnition 3 (Local Model (LDP)). A protocol using R( ) as the local randomizer satisﬁes ρ-z CDP in the local model if for any x, x X, any α > 1, Dα(R(x)||R(x )) ρα.

Due to the much stronger privacy requirement, the best accuracy guarantee of LDP protocols for several fundamental problems [12, 6, 18, 31] is a n-factor worse than that in the central model. The shufﬂe model is established on an intermediary level of trust assumption between the local model and the central model and aims for obtaining errors closer to the central model. The key feature of the shufﬂe model is a trusted shufﬂer S, which can permute all messages randomly before sending them to the analyzer, so that an adversary cannot identify the source of any message. Speciﬁcally, we consider the multi-message shufﬂe model, where each local randomizer R : X Ym outputs m messages, and the transcript of the protocol ΠP (D) is a random permutation of all mn messages. The following deﬁnition uses (ε, δ)-DP; the other two DP notions can also be deﬁned similarly, but they do not offer the improvements that we want over LDP protocols. Deﬁnition 4 (Shufﬂe Model). A protocol P satisﬁes (ε, δ)-DP in the shufﬂe model if for any D D , and any set E Ymn, Pr[ΠP (D) E] eε Pr[ΠP (D ) E] + δ.

3 Our Method

3.1 Clipped-Mean Estimator

In the rest of the paper, we focus on the mean function f(D) = 1

n Pn i=1 xi. Since GSf is large, a very natural idea is to clip each vector in its ℓ2 norm by some threshold C. This reduces GSf to 2C/n, leading to the clipped-mean estimator [1]:

i=1 min C xi 2 , 1 xi + N 0, 2C2

ρn2 I . (2)

Lemma 3. For any given C, MC(D) satisﬁes ρ-z CDP, and has an expected ℓ2 error at most

E [ MC(D) f(D) 2] E(C; D) := 1

i=1 max{ xi 2 C, 0} + C

An important remaining question is how to set the clipping threshold C. Setting it too low will result in a large bias, while setting it too high will introduce a large amount of noise. We show how to choose the optimal C to balance this bias-variance trade-off. It is easy to see that the error E(C; D) is a convex function w.r.t. C, thus the optimal C can be found by setting the derivative of E(C; D) to zero, i.e.,

n|{i [n] | xi 2 > C}| 1

Therefore, the optimal choice of C is the (n p

2d/ρ)-th quantile of { xi 2}i [n].

3.2 Private Quantile Selection

However, we cannot use the optimal C directly, as it would violate DP. Instead, we ﬁnd a privatized quantile with small rank error. Speciﬁcally, for this problem, D consists of a sequence of ordered integers 0 x(1) x(n) u. We would like to design a DP mechanism that, for a given m, returns an x (which is not necessarily an element in D) such that x(m τ) x x(m+τ)1 w.h.p. Here τ is referred to as the rank error. Existing methods on private range counting queries [11, 20] can be used for this purpose, but they actually ﬁnd all quantiles, which is an overkill. Instead, we use a simple binary search algorithm [25, 14], which not only simpliﬁes the algorithm, but also reduces the rank error (by polylog(u) factors) to nearly optimal. Our algorithm Priv Quant makes use of a function Noisy RC([a, b], D) that returns a noisy count of |D [a, b]|, and we present the details in the supplementary material.

In the central DP model, we simply use Noisy RC([0, mid], D) = |D [0, mid]| + N(0, log u/(2ρ)). Theorem 1. The algorithm Priv Quant preserves ρ-CDP, and it returns a quantile with rank error τ

with probability at least 1 β for τ = q

log u log log u

In Section 4, we prove an Ω( p

log u/ρ) lower bound (Corollary 1) on the rank error under z CDP for constant β. Thus the algorithm is optimal up to just an O( log log u)-factor.

We can now use Priv Quant to ﬁnd an approximately optimal clipping threshold. Speciﬁcally, we invoke Priv Quant with ρ = ρ/4 to ﬁnd the max{n max{ p

2d/ρ, τ}, 1}-th quantile of { xi 2 2}i [n]. They are integers no more than du2, so replacing u by du2 in Theorem 1 yields a rank

error of τ = 2 q

log(du) log log(du)

β /ρ. Then we set C as the square root of the returned quantile. Finally, we return the clipped mean estimator M C(D) with ρ = 3ρ/4. The following theorem analyzes its error.

Theorem 2. Our mean estimation mechanism is ρ-z CDP and has ℓ2 error O( p

d/ρ + τ) r(D)/n

with probability 1 β, where τ = 2 q

log(du) log log(du)

1Deﬁne x(j) = 0 for j < 1 and x(j) = u for j > n.

3.3 Shifted-Clipped-Mean Estimator

To reduce the error from being proportional to r(D) to being proportional to w(D), we perform a random rotation on D followed by a translation. The rotation is done by ˆxi := HDxi, where H is the Hadamard matrix, D is a diagonal matrix whose diagonal entry is independently and uniformly drawn from { 1, +1}. Note that for now we omit the normalization coefﬁcient 1

d so that each coordinate of ˆxi is still an integer; we will apply the normalization to the ﬁnal estimator instead. Then, for each j [d], we invoke Priv Quant with ρ = ρ/(4d) to ﬁnd an approximate median of {ˆxi}i [n] along dimension j, denoted as cj. Next, we shift the dataset to be centered around c = ( c1, . . . , cd), obtaining D = { xi := ˆxi c}i [n]. Note that c has integer coordinates, so does xi. Finally, we apply the clipped-mean estimator in Theorem 2 with ρ = 3

4ρ on D, obtaining an estimation y, and return y := ( 1

d( y + c) as the mean estimator over D.

Theorem 3. Set τ = q

log(du) log d log(du)

β /ρ and assume n = Ω(τ

d). Our mean estimation

mechanism is ρ-z CDP, and has ℓ2 error O ( p

β w(D)/n with probability 1 β.

Remark 1. For a dataset with real coordinates bounded by R (in absoluate value), one can quantize each coordinate to an integer using bucket size α/

d, for any 0 < α < R, and then apply our algorithm over an integer universe of size u = 2R

d/α. This just brings an additive α error to the error bound of Theorem 3.

Statistical Mean Estimation. Suppose D consists of i.i.d. samples drawn from the multivariate Gaussian distribution N(µ, Σ), and we wish to estimate µ, assuming a priori bounds µ 2 R and σ2 min I Σ σ2 max I. We ﬁrst clip each sample xi xi min{R / xi 2, 1} where

R := R + 2σmax q

β . Then we apply our mechanism with bucket size α/

d/n. We analyze the error in the supplementary material. When Σ = I and ignoring

log O(1)( dn R

σmin ) factors, the error in becomes O

d n + d ρn , matching the known optimal bound for Gaussian mean estimation [8].

4 Lower Bounds

In this section we establish the instance optimality of Theorem 3 via three lower bounds: (1) Rin-nbr(D) = Ω(w(D)/n) for all D; (2) an Ω( p

d/ρ) lower bound on the optimality ratio, and (3) that the condition n = Ω( p

d/ρ) is necessary.

The ﬁrst lower bound follows from an observation by Vadhan [35]:

Lemma 4 ([35]). For any f, any (ε, δ)-DP mechanism M , and any neighboring datasets D0 D1, there is a b {0, 1} such that

Pr[ M (Db) f(Db) 2 < f(D0) f(D1) 2/2] < 1 + δ 1 + e ε .

Theorem 4. For ε < 0.1, δ < 0.1, Rin-nbr(D) = Ω(w(D)/n).

The lower bound on the optimality ratio is by the reduction from statistical mean estimation [26].

Theorem 5. Let M be any ρ-z CDP mechanism for mean estimation that has ℓ2 error c w(D)/n with constant probability for any D = {xi}i [n] drawn from [u]d. If ρ < O( p

d/n), then c = Ω( p

For the lower bound on n, we consider a weaker problem (so the lower bound is stronger), which is the d-dimensional version of the interior point problem [9]: Given a dataset D = {xi}i [n] drawn from [u]d, the mechanism is only required to return a y [u]d such that mini xij yj maxi xij for all j with constant probability.

Theorem 6. If there exists a ρ-z CDP mechanism that solves the interior point problem with success probability 2/3, then n = Ω( p

d log u/ρ).

Given a quantile selection mechanism with rank error τ, by ﬁnding the median, the 1-dimensional interior point problem can be solved when n = O(τ). The following corollary then follows from Theorem 6. Corollary 1. Any ρ-z CDP mechanism for the quantile selection problem must have rank error Ω( p

5 Extension to the Local Model and Shufﬂe Model

Our mean estimation framework can be summarized as follows: (1) Given D = {x1, . . . , xn}, perform a random rotation, obtaining ˆD = {ˆxi := HDxi}i [n]; (2) For each j [d], ﬁnd an approximate median cj of ˆD along dimension j. Shift ˆD to be centered around c, obtaining D = { xi := ˆxi c}; (3) Find a clipping threshold C, which is the m-th quantile over the ℓ2 norms of the vectors in D. In the central model, the optimal choice is m = n p

2d/ρ; (4) Perform ℓ2 clipping over D using C, and obtain a mean estimator y of the clipped vectors. Finally, return y = ( 1

We note that each step has their counterparts in the local and the shufﬂe model: Step (1) is easy, where the randomized diagonal matrix D can be generated using public randomness, or sent from the aggregator to each user if public randomness is not available. Step (2) and (3) both rely on quantile selection, which have alternatives in the local model and shufﬂe model. Step (4) is also easy since all vectors have norms bounded by C. We elaborate on the details in the supplementary material.

The Local Model. For step (4), the standard LDP mechanism is that each user applies the Gaussian mechanism (Lemma 1) with GS = 2C to inject noise to their clipped vector, sends it out, and the aggregator adds them up and divides the sum by n. For quantile selection in step (2) and (3), we use the LDP range counting protocol in [13], which returns a data structure such that any range counting query can be answered. Putting things together, we obtain the following result.

Theorem 7. Set τ = p

n/ρ log2(du) log du

β and assume n = Ω(τ

d). There is a 3-round ρ-z CDP

mean estimation mechanism in the local model, achieving an ℓ2-error of O ( p

dn/ρ + τ) q

w(D)/n with probability 1 β.

The Shufﬂe Model. We see that the error in the local model is worse than that in the central model by a n-factor. It turns out that in the shufﬂe model, we can match the result in the central model up to logarithmic factors, albeit with (ε, δ)-DP. This is mostly due to highly accurate summation and range counting protocols discovered recently for the shufﬂe model. We ﬁrst extend the one-dimensional summation protocol in [5] for high-dimensional data by using random rotation. Lemma 5. Given D = {xi}i [n] Rd where xi 2 C for all i, there is a oneround (ε, δ)-DP mean estimation protocol in the shufﬂe model that returns a y such that

Pr y f(D) 2 O C εn

log(nd) log d

For the Noisy RC queries in the algorithm Priv Quant, we can use the range counting mechanism [24] in the shufﬂe model. Putting things together, we obtain the following result.

Theorem 8. Set τ = 1 ε log3.5(du) q

log d log(du)

δ and assume n = Ω τ q

a 3-round (ε, δ)-DP mean estimation mechanism in the shufﬂe model, achieving an ℓ2-error of

log(nd) w(D)/n with probability 2/3.

6 Experiments

We performed both statistical and empirical mean estimation experiments to evaluate our method. For statistical mean estimation, we used multivariate Gaussian distributions with various µ and Σ. All algorithms are given the same R, σmin, σmax. We tried various R, while ﬁxing σmin = 0.1 and σmax = R/

d. For empirical mean estimation, we used a real-world dataset, MNIST, which consists

24 25 26 27 28 29 210

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(a) µ = 0 1d.

24 25 26 27 28 29 210

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(b) µ = 5 1d.

24 25 26 27 28 29 210

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(c) µ = 10 1d.

Figure 1: ℓ2 error vs. d for N(µ, Id d), where n = 4000, ρ = 0.5, R = 50

0.0 0.1 0.2 0.3 0.4 0.5 ρ

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(a) µ = 0 1d.

0.0 0.1 0.2 0.3 0.4 0.5 ρ

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(b) µ = 5 1d.

0.0 0.1 0.2 0.3 0.4 0.5 ρ

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(c) µ = 10 1d.

Figure 2: ℓ2 error vs. ρ for N(µ, Id d), where n = 4000, d = 128, R = 50

of 70,000 images of handwritten digits, where each image is represented by a vector of dimension d = 784 = 28 28. We quantized the values to integers [u] for u = 210. We measured the ℓ2 error by taking the trimmed mean with trimming parameter 0.1 over 100 trials (as in [8]).

For the central model, we compared with COINPRESS2 [8]. It starts with a given conﬁdence ball of radius R that captures the mean, and iteratively reﬁnes the conﬁdence ball. The number of iterations t is an important internal parameter in this algorithm; we tried t = 1, 2, 3, 4, 10 following their suggestion.

24 25 26 27 28 29 210

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(a) σ2 j = 0.1 for all j [d].

24 25 26 27 28 29 210

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(b) σ2 j = 10.

24 25 26 27 28 29 210

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(c) σ2 j [0, 10].

Figure 3: ℓ2 error vs. d for N(µ, Σ), where Σ = diag([σ2 1, σ2 2, . . . , σ2 d]), n = 4000, ρ = 0.5, R = 50

The results for statistical mean estimation are shown in Fig. 1 to 5, where the detailed parameter settings are given in the captions. The results show that our method (Shifted-CM) performs at least as well as COINPRESS with the best t across a variety of settings. In particular, when Σ is not identity Id d, our method signiﬁcantly outperforms COINPRESS and offers errors close to the non-private estimator as demonstrated in Fig. 3 and Fig. 4. Moreover, as seen from Fig. 3, 4, and 5, the best choice of t for COINPRESS is quite sensitive to Σ and R, making it difﬁcult to tune in practice.

2We also compared with another version of COINPRESS in the supplementary material, where we ﬁrst scale the data by an estimated Σ (obtained by MVCRec [8] using half of the privacy budget ρ).

0.0 0.1 0.2 0.3 0.4 0.5 ρ

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(a) σ2 j = 0.1 for all j [d].

0.0 0.1 0.2 0.3 0.4 0.5 ρ

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(b) σ2 j = 10 for all j [d].

0.0 0.1 0.2 0.3 0.4 0.5 ρ

Non-private COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 CM Shifted-CM

(c) σ2 j [0, 10].

Figure 4: ℓ2 error vs. ρ for N(µ, Σ), where Σ = diag([σ2 1, σ2 2, . . . , σ2 d]), n = 4000, d = 128, R = 50

500 1000 1500 2000 R

COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM Non-private

(a) µ = 0 1d.

500 1000 1500 2000 R

COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM Non-private

(b) µ = 5 1d.

500 1000 1500 2000 R

COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM Non-private

(c) µ = 10 1d.

Figure 5: ℓ2 error vs. R for N(µ, Σ), where Σ = diag([σ2 1, σ2 2, . . . , σ2 d]) and σ2 j u.a.r. [0, 10] for each j [d], n = 4000, ρ = 0.5, d = 128.

Both our method and COINPRESS are translation-invariant. This can be veriﬁed from Fig. 1, 2, and 5, where the results are not effected by µ. However, the approaches taken are different: COINPRESS uses an iterative process, while we shift the dataset to be centered around an approximate center point. In Fig. 1 and 2, we also include CM, our estimator without this shift operation, which is indeed affected by µ.

2-3 2-2 2-1 20 21 ρ

COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM

(a) digit 0.

2-3 2-2 2-1 20 21 ρ

COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM

(b) digit 1.

2-3 2-2 2-1 20 21 ρ

COINPRESS t = 1 COINPRESS t = 2 COINPRESS t = 3 COINPRESS t = 4 COINPRESS t = 10 Shifted-CM

(c) digit 2.

Figure 6: ℓ2 error vs. ρ for various digits in MNIST, where d = 784, R = 50

For empirical mean estimation on the MNIST dataset, we see in Fig. 6 that our method outperforms COINPRESS for various privacy levels. This means that this dataset, as with most real-world datasets, does not follow Gaussian distribution with an identity Σ. The instance-optimality of our method is precisely the reason behind its robustness to different distributional assumptions, or the lack of.

One important component of our method is the optimal clipping threshold C, which is the (n p

2d/ρ)-th quantile of the norms. Note that this depends on d and ρ. Prior work [3] used a ﬁxed quantile (e.g., the median). To better see the relationship between the optimal C and d, ρ, we used a synthetic dataset D = {i 1d}i [n] with n = 500 and tried different quantiles as the clipping threshold. The results in Fig. 7 conﬁrm our theoretical analysis: The optimal C indeed depends on d and ρ, while our choice attains the optimum.

27 28 29 210 211 212 213 214 215

50% 75% 85% 95% Ours

(a) ρ = 0.1.

27 28 29 210 211 212 213 214 215

50% 75% 85% 95% Ours

(b) ρ = 0.5.

27 28 29 210 211 212 213 214 215

50% 75% 85% 95% Ours

Figure 7: ℓ2 error vs. d for clipping at various quantiles of the norms on a synthetic dataset.

We also evaluated our algorithm in the local model and compare it with [23], and the results are presented in the supplementary material.

Acknowledgments and Disclosure of Funding

We would like to thank Gautam Kamath and Jonathan Ullman for helpful discussions about Coin Press [8].

Funding in direct support of this work: This work is supported by HKRGC under grants 16201318, 16201819, and 16205420.

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016.

[2] Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding user contributions: A bias-variance trade-off in differential privacy. In International Conference on Machine Learning, pages 263 271. PMLR, 2019.

[3] Galen Andrew, Om Thakkar, H Brendan Mc Mahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. ar Xiv preprint ar Xiv:1905.03871, 2019.

[4] Hilal Asi and John C Duchi. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. Advances in Neural Information Processing Systems, 33, 2020.

[5] Borja Balle, James Bell, Adria Gascón, and Kobbi Nissim. Private summation in the multimessage shufﬂe model. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 657 676, 2020.

[6] Raef Bassily and Adam Smith. Local, private, efﬁcient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 127 135, 2015.

[7] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning. ar Xiv preprint ar Xiv:1812.00984, 2018.

[8] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. Advances in Neural and Information Processing Systems, 2020.

[9] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and learning of threshold functions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 634 649. IEEE, 2015.

[10] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simpliﬁcations, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016.

[11] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security, 2011.

[12] TH Hubert Chan, Elaine Shi, and Dawn Song. Optimal lower bound for differentially private multi-party aggregation. In European Symposium on Algorithms, pages 277 288. Springer, 2012.

[13] Graham Cormode, Tejas Kulkarni, and Divesh Srivastava. Answering range queries under local differential privacy. Proceedings of the VLDB Endowment, 12(10):1126 1138, 2019.

[14] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58 75, 2005.

[15] Peter Davies, Vijaykrishna Gurunanthan, Niusha Moshreﬁ, Saleh Ashkboos, and Dan Alistarh. New bounds for distributed mean estimation and variance reduction. In International Conference on Learning Representations, 2021.

[16] Apple Differential Privacy Team. Learning with privacy at scale. https: //machinelearning.apple.com/docs/learning-with-privacy-at-scale/ appledifferentialprivacysystem.pdf. December 2017.

[17] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In NIPS, 2017.

[18] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory, pages 1161 1191. PMLR, 2019.

[19] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy, data processing inequalities, and minimax rates. ar Xiv preprint ar Xiv:1302.3203, 2013.

[20] Cynthia Dwork, Moni Naor, Omer Reingold, and Guy N Rothblum. Pure differential privacy for rectangle queries via private partitions. In International Conference on the Theory and Application of Cryptology and Information Security, pages 735 751. Springer, 2015.

[21] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

[22] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054 1067, 2014.

[23] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private mean estimation: z-test and tight conﬁdence intervals. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2545 2554. PMLR, 2019.

[24] Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. On the power of multiple anonymous messages. In Euro CRYPT, 2021.

[25] Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin J Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In VLDB 02: Proceedings of the 28th International Conference on Very Large Databases, pages 454 465. Elsevier, 2002.

[26] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning highdimensional distributions. In Conference on Learning Theory, pages 1853 1902. PMLR, 2019.

[27] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. In Conference on Learning Theory, pages 2204 2235. PMLR, 2020.

[28] Gautam Kamath and Jonathan Ullman. A primer on private statistics. ar Xiv preprint ar Xiv:2005.00010, 2020.

[29] Vishesh Karwa and Salil Vadhan. Finite sample differentially private conﬁdence intervals. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2018.

[30] Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. Learning with user-level privacy. ar Xiv preprint ar Xiv:2102.11845, 2021.

[31] Andrew Mc Gregor, Ilya Mironov, Toniann Pitassi, Omer Reingold, Kunal Talwar, and Salil Vadhan. The limits of two-party differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 81 90. IEEE, 2010.

[32] H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017.

[33] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75 84, 2007.

[34] Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X Yu, Sashank J Reddi, and Sanjiv Kumar. Adaclip: Adaptive clipping for private sgd. ar Xiv preprint ar Xiv:1908.07643, 2019.

[35] Salil Vadhan. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pages 347 450. Springer, 2017.