# subsetbased_instance_optimality_in_private_estimation__96daa579.pdf Subset-Based Instance Optimality in Private Estimation Travis Dick 1 Alex Kulesza 1 Ziteng Sun 1 Ananda Theertha Suresh 1 We propose a new definition of instance optimality for differentially private estimation algorithms. Our definition requires an optimal algorithm to compete, simultaneously for every dataset D, with the best private benchmark algorithm that (a) knows D in advance and (b) is evaluated by its worst-case performance on large subsets of D. That is, the benchmark algorithm need not perform well when potentially extreme points are added to D; it only has to handle the removal of a small number of real data points that already exist. This makes our benchmark significantly stronger than those proposed in prior work. We nevertheless show, for real-valued datasets, how to construct private algorithms that achieve our notion of instance optimality when estimating a broad class of dataset properties, including means, quantiles, and ℓp-norm minimizers. For means in particular, we provide a detailed analysis and show that our algorithm simultaneously matches or exceeds the asymptotic performance of existing algorithms under a range of distributional assumptions. 1. Introduction Differentially private algorithms intentionally inject noise to obscure the contributions of individual data points (Dwork et al., 2006; 2014). This noise, of course, reduces the accuracy of the result, so it is natural to ask whether we can derive a private algorithm that minimzes the cost to accuracy, optimally estimating some property θ(D) of a dataset D given the constraint of differential privacy. But what do we mean by optimal? Optimal worst-case loss is often achievable by algorithms that add noise calibrated to the global sensitivity of θ (Dwork et al., 2006; Ald a & Simon, 2017; Balle & Wang, 2018; Fernandes et al., 2021). 1Google Research, New York. Correspondence to: Ziteng Sun . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). However, realistic datasets may have local sensitivities that are much smaller, making this a weak notion of optimality that does not reward reducing loss on typical easy examples (Nissim et al., 2007; Bun & Steinke, 2019). At the other extreme, we might hope for instance optimality, where a single algorithm competes, simultaneously for all datasets D, with the best possible benchmark algorithm chosen with knowledge of D. It is easy to see that this is too strong: a constant algorithm that always returns exactly θ(D), regardless of its input, is perfectly private and gives zero loss on D. Of course, we cannot hope to achieve zero loss for all datasets at once. Thus, there has been recent interest in finding variations of instance optimality that are strong and yet still achievable. Asi & Duchi (2020b) gave a variant defined using local minimax risk, which softens the definition above by allowing an optimal algorithm s performance to degrade on D whenever there is a second dataset D such that no private algorithm can simultaneously achieve low loss on both D and D . Huang et al. (2021) gave a different variant of instance optimality for mean estimation in which the worst-case loss across nearby datasets sharing support with D defines the risk for D. Related ideas were also explored by Błasiok et al. (2019); Brunel & Avella-Medina (2020); Mc Millan et al. (2022), and others. In this work we propose a new definition of instance optimality. We refer to our definition as subset optimality because it calibrates the risk of a dataset D using only subsets of D. To motivate this approach, consider the basic semantics of differential privacy. A DP algorithm is required to give similar output distributions on D and D whenever D is a neighbor of D that is, whenever D can be obtained from D by adding or removing a single point. Technically, this definition is symmetric, but in practice addition and removal can have very different implications. For typical real datasets where data points tend to be similar to one another, removing a point may change the target property θ(D) only modestly, and a correspondingly modest amount of noise might ensure sufficiently similar output distributions. On the other hand, an added point could potentially be an extreme outlier that dramatically changes θ(D), requiring a large amount of noise to conceal its presence. Concretely, imagine we have a database reporting the annual Subset-Based Instance Optimality in Private Estimation incomes for n households in a particular neighborhood, the largest of which happens to be $100,000. We wish to privately compute the mean household income. Removing one of the existing households from the database can change the mean by at most roughly $100,000/n. Adding a new household, on the other hand, could have a dramatically larger effect a new household s income might theoretically be, say, $100,000,000, requiring 1000x more noise. Thus, an algorithm that is required to perform well only on subsets of the real dataset intuitively has an advantage over one that must perform well everywhere. The surprising implication of our results is that this is not always true. We demonstrate how to construct subset-optimal differentially private algorithms that simultaneously compete, for all datasets D, with the best private algorithm that (a) knows D in advance and (b) is evaluated by its worstcase performance over only (large) subsets of D. Subset optimality is achievable for a class of monotone properties that include means, quantiles, and other common estimators, despite being stronger than prior definitions in the literature. We begin by describing our setting in more detail, defining subset optimality, and comparing our definition to those found in related work. We then show how to achieve subset optimality for mean estimation, giving an algorithm that simultaneously matches or exceeds the asymptotic performance of existing algorithms under a range of distributional assumptions (see Table 1). Finally, we generalize the result for means and show how to construct optimal algorithms for a broad class of monotone properties. 2. Problem formulation A dataset D is a collection of points from the domain [ R, R]. Let |D| be the cardinality of the set D. For two datasets D and D we define the distance between them to be the number of points that need to be added to and/or removed from D to obtain D . More formally, d(D, D ) |D \ D | + |D \ D|. For example, d({1, 2, 3}, {2, 3, 4}) = 2. We refer to D and D as neighboring datasets if d(D, D ) = 1 (Dwork et al., 2014, Chapter 2). Note that this notion of neighboring datasets is sometimes called the add-remove model, in contrast to the swap model where all datasets are the same size and datasets are neighbors if they differ in a single point. We use the add-remove model here since we study algorithms that accept input datasets of various sizes. Because the add-remove model leads to stronger privacy than the swap model, our algorithms also provide guarantees in the swap model, with at most a factor of two increase in the privacy parameter ε. Definition 2.1 (Differential privacy). A randomized algo- rithm A with range R satisfies ε-differential privacy if for any two neighboring datasets D, D and for any output S R, it holds that Pr[A(D) S] eεPr[A(D ) S]. Let Aε be the set of all ε-DP algorithms. Our goal is to estimate some property of D, denoted by θ(D), and we use a loss function ℓ: R R R 0 to measure the performance of an algorithm A on a dataset D as ℓ(A(D), θ(D)). Given ℓ, we would like to identify an algorithm A that performs well not just on average or for certain datasets, but simultaneously for every dataset D. For each D, we will aim to compete with a benchmark algorithm that can be selected using knowledge of D but is evaluated according to its performance on large subsets of D. In particular, we adopt the following definition of subset-based risk. R(D, ε) inf A Aε sup D D |D | |D| 1/ε E [ℓ(A(D ), θ(D ))] 1 (1) The benchmark algorithm for D is the one with the lowest worst-case loss on datasets obtained by removing up to 1/ε elements from D. Intuitively, if it is feasible to perform well on all of these subsets at once, then the benchmark risk is small and an optimal algorithm will be expected to perform correspondingly well, even if adding elements to D would dramatically increase the benchmark algorithm s loss. When no private algorithm performs well on all large subsets of D, then an optimal algorithm will also be permitted to have larger loss. Definition 2.2. We say an algorithm A is subset-optimal with respect to Aε if there exist constants α, β, and c such that, for all D, we have E [ℓ(A(D), θ(D))] α R(D, ε) + β, and A is c ε-DP. Ideally, we want the constants α and c to be close to 1 and β to be close to 0. Remark 2.3. The constraint |D | |D| 1/ε in Equation (1) could be generalized to |D| τ for any τ 0. Intuitively, smaller τ would lead to a stronger notion of optimality, since the benchmark algorithm would need to perform well on fewer datasets. Our choice of τ = 1/ε gives the strongest possible optimality definition that remains achievable: for any τ 1/ε, it is impossible to 1We focus on the case when ε 1 in this paper. To simply our notation, we often treat 1/ε as a positive integer. If this is not the case, we can set ε = 1/ 1/ε (ε/2, ε]. This will affect our results by at most constant factors. Subset-Based Instance Optimality in Private Estimation compete with the benchmark algorithm. To see this, consider the task of real mean estimation. Let D1 contain (n 1/ε) copies of 0 and 1/ε copies of 1, and let D2 contain (n 1/ε) copies of 0 and 1/ε copies of 1. Since the add/remove distance between D1 and D2 and is 2/ε, standard packing arguments (e.g., Lemma 4.4 or Lemma 8.4 in Dwork et al. (2014)) show that an (ε, δ)-DP algorithm cannot give significantly different answers on D1 and D2, and therefore must incur an error of at least Θ(1/nε) on one of them. On the other hand, a benchmark algorithm that knows the dataset can always output 1/nε for D1, and on subsets of D1 with size at least |D1| τ, the error will be at most τ/n (and similarly for D2). Thus τ 1/ε does not yield an achievable definition of instance-optimality. Notation. For a dataset D = {x1 x2 . . . xn}, let Lm(D) denote the multiset {x1, x2, . . . , xm} (the m lower elements of D) and Um(D) denote the multiset {xn m+1, . . . , xn} (the m upper elements of D). 2.1. Our Contributions. Subset-based instance optimality. We propose a new notion of instance-optimality (Definition 2.2) for private estimation. The notion has the advantage of only considering the effect of removing values from the dataset, which leads to tighter (or as tight) rates compared to other instanceoptimal formulations that need to handle extreme data points. See Section 2.2 for a detailed discussion. Moreover, we propose Algorithm 4 based on private threshold estimation and the inverse sensitivity mechanism of (Asi & Duchi, 2020a). For real-valued datasets, the algorithm is subset-optimal for a wide range of monotone properties with arbitrary β > 0 and α and c at most logarithmic in problem-specific parameters (see Section 4 and Theorem 4.3 for the precise definition and statement). Improvement on mean estimation. For the task of private mean estimation (Section 3), we propose an efficient algorithm (Algorithm 3) that is subset-optimal. In the statistical setting (Section 5) we show how this algorithm obtains distribution-specific rate guarantees that depend on all centralized absolute moments (Corollary 5.3). To the best of our knowledge, the rate improves upon the previously bestknown distribution-specific results for distributions whose kth-moment is much smaller than its best sub-Gaussian proxy for some k 2. For distribution families with concentration assumptions, the distribution-specific rate recovers (up to logarithmic factors) the min-max optimal rate for each corresponding family. Moreover, our proposed algorithm achieves this rate without explicit assumptions on which family the distribution comes from. See Table 1 for a detailed comparison. 2.2. Related Work Instance-optimality in private estimation. Several variations of instance optimality for differential privacy have been studied recently. Asi & Duchi (2020b) initiated the study of instance optimality using the following notion of local minimax risk: R1(D, ε) = sup D inf A Aε sup D {D,D } E[ℓ(A( D), θ( D))]. They showed that the inverse sensitivity mechanism gives nearly optimal results with respect to this notion of risk. However, for mean estimation in one dimension, with all values bounded in [ R, R], it can be shown that for every dataset D. In contrast, subset optimality provides much tighter guarantees for mean estimation, roughly replacing the full range R with the range actually spanned by D, as described in the sections below. For general losses, Asi & Duchi (2020b) showed that R1(D, ε) max D :d(D,D ) 1/ε ℓ(θ(D), θ(D )), (2) while we show that R(D, ε) max D :d(D,D ) 1/ε,D D ℓ(θ(D), θ(D )), which is strictly tighter due to the subset constraint. (As illustrated above, the difference can be dramatic for realistic D.) Mc Millan et al. (2022) studied a quantity similar to R1(D, ε) in the setting where D is drawn from a distribution. We focus on the comparison to Asi & Duchi (2020a) since it is most relevant to our setting. Huang et al. (2021) considered a slightly different notion of instance optimality given by R2(D, ε) = inf A Aε sup supp(D ) supp(D) d(D,D )=1 Pr (ℓ(A(D ),θ(D ))>η)<2 where supp(D) denotes the set of unique elements in the dataset D. They proposed an algorithm for d-dimensional mean estimation and showed that it is O( p d/ρ)-optimal for ρ-z CDP (concentrated differential privacy). Their definition and results differ from R(D, ε) and R1(D, ε) in two basic ways. First, their definition is a high probability definition, whereas the others are in expectation. Second, even for one-dimensional mean estimation, their proposed algorithm is only O(1/ ρ) competitive with the lower bound, and they further show that no algorithm can achieve a better competitive guarantee. Our definition (modulo expectation) Subset-Based Instance Optimality in Private Estimation Table 1. Results for statistical mean estimation. R(A, p) is the expected absolute error of A given n i.i.d. samples from p (Equation (3)). Mk(p) denotes the kth absolute central moment of p (Definition 5.2). σG(p) denotes the best sub-Gaussian proxy of p. is due to (Huang et al., 2021). is due to Kamath et al. (2020) and is due to Karwa & Vadhan (2017). Assumption Metric Prior work This work N.A. R(A, p) O M2(p) n + σG(p) M2(p) n +mink Mk(p)1/k Bounded Moment maxp:Mk(p) mk R(A, p) O m1/k k n + m1/k k (nε)1 1/k O m1/k k n + m1/k k (nε)1 1/k Sub-Gaussian maxp:σG(p) σ R(A, p) O σ n + σ nε O σ n + σ coincides with this definition for ε = 1; however, we are able to construct constant optimal algorithms for general ε. We also show that our definition of optimality is achievable for target properties beyond just means. Remark 2.4. The definition of Huang et al. (2021) can be extended by changing d(D, D ) = 1 to d(D, D ) 1/ε, bringing it closer to our definition. However, this leads to an instance-dependent risk of R2(D, ε) sup supp(D ) supp(D) d(D,D ) 1/ε ℓ(θ(D), θ(D )), which can be much larger than bound in Lemma 3.2. In Appendix F, we use ℓp minimization as a concrete example to compare these instance-dependent risks and show that our new definition gives significant quantitative improvements for specific datasets. It is worth remarking here that both Asi & Duchi (2020a) and Huang et al. (2021) provide achievability results when the dataset is supported over a high-dimensional space while our result mainly focuses on one-dimensional datasets. Whether our subset-based instance optimality can be achieved in the high-dimensional setting is an interesting future direction to explore. Private statistical mean estimation. Private mean estimation has been widely studied in the statistical setting, where the dataset is assumed to be generated i.i.d. from an underlying distribution. Classic methods such as the Laplace or Gaussian mechanism (Dwork et al., 2014) incur a privacy cost that scales with the worst-case sensitivity. However, recently, by assuming certain concentration properties of the underlying distribution (e.g., sub-Gaussianity (Karwa & Vadhan, 2017; Cai et al., 2019; Smith, 2011; Kamath et al., 2019; Bun & Steinke, 2019; Biswas et al., 2020), bounded moments (Feldman & Steinke, 2018; Kamath et al., 2020; Hopkins et al., 2022) and high probability concentration (Levy et al., 2021; Huang et al., 2021)), it has been shown that the privacy cost can be improved to scale with the concentration radius of the underlying distribution. These algorithms are in some cases known to be (nearly) optimal for distributions satisfying these specific concentration properties. It is worth remarking here that most of these works consider high-dimensional distributions while our work only focuses on real-valued datasets. Moreover, for moment-bounded distributions, (Kamath et al., 2020) achieves constant-optimal estimation risk while our obtained rate is only optimal up to logarithmic factors. The results are outlined in Table 1. Our mean estimation algorithm is similar to that of (Huang et al., 2021). However, we choose a tighter threshold in the clipping bound estimation step, which is crucial in the analysis to achieve the instance-optimal bound. 3. Subset-Optimal Private Means We start by presenting an efficient subset-optimal algorithm for estimating means; in Section 4 we will generalize this approach to a larger class of properties. Let µ(D) denote the mean of a dataset D. We will use ℓ(x, y) = |x y| as our loss function. Our main result is stated in the theorem below. Theorem 3.1. Let D [ R, R] be a multiset of points and let ˆµ be the output of Algorithm 3 with parameters R and ε > 0. Publishing ˆµ is 3ε-differentially private and for any γ > 0, we have E [|ˆµ µ(D)|] = O R (D, ε) log Rε We begin by establishing an up-to-constant characterization of the subset-based risk for mean estimation under this loss. Lemma 3.2. For any ε (0, 1] and multiset D R with |D| > 1 R(D, ε) = c D,ε µ(D \ L 1 ε ) µ(D \ U 1 Subset-Based Instance Optimality in Private Estimation where c D,ε [1/(2e2), 1]. The upper bound is straightforward; the lower bound proof is based on standard packing arguments (Dwork et al., 2014) and appears in Appendix B.1. Now we turn to designing a subset-optimal algorithm that is competitive with this lower bound for every dataset D and prove Theorem 3.1. First, we describe a straightforward algorithm for private mean estimation of bounded datasets. Pseudocode is given in Algorithm 1, and privacy and utility analyses are given in Lemma 3.3. We use D + m to denote {x + m | x D}, and clip(D, [l, u]) to denote {min(max(x, l), u) | x D}. Algorithm 1 Bounded mean estimation Input: Multiset D [l, u], ε > 0. 1 Let w = u l and m = l+u 2 . 2 Let D = D m. 3 Let ˆn = n + Zn, where n = |D |, Zn Lap( 2 ε). 4 Let ˆs = s + Zs, where s = P x D x, Zs Lap( w ε ). 5 Output ˆµ = clip( ˆs We provide the proof in Appendix B.2. Lemma 3.3. Let [l, u] be any interval and ε > 0 be any privacy parameter. Publishing ˆµ output by Algorithm 1 is ε-differentially private. Furthermore, E [|ˆµ µ(D)|] 3(u l) In order to construct a subset-optimal mean algorithm from Algorithm 1, the high level idea is to first find an interval [ˆl, ˆu] that contains all but a small number of outliers from the dataset D, clip D to [ˆl, ˆu], and finally apply Algorithm 1. The error of this algorithm will have two main components: error incurred by clipping the data to [ˆl, ˆh], and error due to the noise added by Algorithm 1. Our analysis shows that both error components are not significantly larger than the lower bound given by Lemma 3.2. We start by describing the subroutine that we use to choose ˆl and ˆu; following Lemma 3.2, our goal will be to find ˆl and ˆu that delineate approximately 1/ε elements of D each. 3.1. Private Thresholds We present an ε-differentially private algorithm that, given a multiset D of real numbers and a target rank r, outputs a threshold τ R that is approximately a rank-r threshold for D. Roughly, τ being a rank-r threshold for D means that r points in D are less than or equal to τ. However, if D Figure 1. Example of ranks as defined in Definition 3.4. There are 4 points with x2 = x3. For each rank r {0, . . . , 4} we show the interval of points that are rank-r thresholds. For r = 0, 1, . . . , 4, the intervals of rank-r thresholds are given by I0 = [ , x1], I1 = [x1, x2], I2 = {x2}, I3 = [x3, x4], and I4 = [x4, ], respectively. has repeated points then there can be ranks for which no such threshold exists. (In the extreme, consider the dataset D = {x, . . . , x}, containing n copies of the same point; exactly 0 or n points are less than or equal to any threshold τ.) We will therefore define a rank-r threshold in a slightly more general way so that there is at least one rank-r threshold for every r {0, . . . , |D|}. This definition is consistent with the standard definition of quantiles for distributions with point-masses. Definition 3.4. Let D be any multiset of real numbers. We say that τ R is a rank-r threshold for D if P x D I{x < τ} r and P x D I{x τ} r. That is, there are at most r points strictly smaller than x, and at least r points that are greater than or equal to x. For any threshold τ, let ranks(τ, D) denote the set of ranks r such that τ is a rank-r threshold for D. See Figure 1 for an example of this rank definition in a dataset with repeated points. Definition 3.5. The rank error of a threshold τ for dataset D and target rank r is errrank(τ, r, D) = min r ranks(τ,D) |r r |. The rank error measures how close τ is to being a rank-r threshold for D and is equal to 0 if and only if τ is a rank-r threshold. When privately estimating rank-r thresholds for a dataset D, we will incur a bicriteria error: our output will be close (on the real line) to a threshold with low rank error. Definition 3.6. Let D be any multiset of real numbers. We say that τ is an (α, β)-approximate rank-r threshold for D if there exists τ R so that |τ τ | α and errrank(τ , r, D) β. Our algorithm for finding approximate rank-r thresholds is an instance of the exponential mechanism. The loss (or Subset-Based Instance Optimality in Private Estimation Algorithm 2 Threshold estimation Input: Dataset D [a, b], target rank r, data range [a, b], distance α > 0, privacy parameter ε > 0. 1 Define ℓ(τ) = minτ α τ τ+α errrank(τ , r, D). 2 Let f(τ) = exp( ε 2ℓ(τ))/ R b a exp( ε 2ℓ(τ)) dτ. 3 Output sample ˆτ in [a, b] drawn from density f. negative utility) function that we minimize is parameterized by the distance error α > 0, and the loss of a threshold τ is defined to be the minimum rank-error of any threshold τ within distance α of τ. Intuitively, by allowing the loss function to search in a window around τ, we guarantee that there is always an interval of width α with loss zero (namely, any interval centered around a true rank-r threshold). This is sufficient to argue that the exponential mechanism outputs a low-loss threshold with high probability and, by definition of the loss, this implies that there is a nearby threshold with low rankerror. Pseudocode is given in Algorithm 2. Since the density f is piecewise constant with at most 2|D| discontinuities at locations x α for x D, it is possible to sample from f in time O(|D| log |D|) (see the proof of Theorem 3.7 for details). Theorem 3.7, which is proved in Appendix A, shows that this algorithm outputs an (α, β)-approximate threshold. Theorem 3.7. Fix data range [a, b], ε > 0, and distance error 0 < α b a 2 . For any dataset D [a, b] and rank 1 r |D|, let ˆτ be the output of Algorithm 2 run on D with parameters [a, b], α, and ε. Publishing ˆτ satisfies ε-DP and, for any ζ > 0, with probability at least 1 ζ, ˆτ is an (α, β)-approximate rank-r threshold for D with β = 2 αζ . Moreover, Algorithm 2 can be implemented with O(|D| log |D|) running time. Next, we argue that when the dataset is supported on a grid Z = {z1, . . . , zm}, where zi+1 = zi + γ, running Algorithm 2 with a sufficiently small distance parameter α and rounding to the nearest grid point results in a (0, β)- approximate rank-r threshold with high probability. The proof of Corollary 3.8 is in Appendix A. Corollary 3.8. Let Z = {z1 zm} be such that zi+1 = zi + γ for all i = 1, . . . , m 1 and let D be any multiset supported on Z. Let ˆτ be the output of Algorithm 2 run on D with parameters [a, b] = [z1, zm], α = γ/3, and ε > 0, and let τ the closest point in Z to ˆτ. Then for any ζ > 0, with probability at least 1 ζ we have that τ is a (0, β)-approximate rank-r quantile for D with β = 2 3.2. Mean Estimation We now present the pseudo-code for our subset-optimal mean estimation algorithm in Algorithm 3 and give the proof sketch of Theorem 3.1. Proof sketch of Algorithm 3. We provide the proof sketch here and provide a detailed proof in Appendix B.3. Our goal is to show that the expected error of Algorithm 3 is not much larger than R(D, ε) µ(D \ L 1 ε ) µ(D \ U 1 With probability at least 1 ζ, by Theorem 3.7 we are guaranteed that ˆl and ˆu are (α, β)-approximate rank-tl and rank-tu thresholds, respectively for the values of α and β defined in Algorithm 3. In particular, this implies that there exist l and t l such that |ˆl l | α, |t l tl| β, and l is a rank-t l threshold for D. Similarly, there exist u and t u such that |ˆu u | α, |t u tu| β, and u is a rank-t u threshold for D. Let G denote this high probability event. We first argue that conditioned on G, the expected loss of ˆµ is small (where the expectation is taken only over the randomness of Algorithm 1). To convert this bound into a bound that holds in expectation, we bound the error when G does not hold by 2R. Let ˆµ be the output of Algorithm 3. We decompose the error of ˆµ into three terms: |ˆµ µ(D)| |ˆµ µ(clip(D, [ˆl, ˆu]))| + |µ(clip(D, [ˆl, ˆu]) µ(clip(D, [l , u ])| + |µ(clip(D, [l , u ])) µ(D)|. Roughly speaking, the first term captures the variance incurred by using Algorithm 1 to estimate the mean of the clipped data, the second term measures our bias due to α in our (α, β)-approximate thresholds, and the third term measures the bias due to β. Our goal is to prove that all of these terms are not much larger than R(D, ε). Bounding first term. At a high level, we argue that all points in L 1 ε are to the left of ˆl + α, and all points in R 1 ε are to the right of ˆu α. It follows that the distance from any point in L 1 ε to any point in U 1 ε is at least ˆu ˆl 2α. In particular, this guarantees that the difference between the means of D \ L 1 ε and D \ U 1 ε must be at least ˆu ˆl 2α ε(|D| 1 since we move 1/ε points a distance at least ˆu ˆl 2α. This expression is close to the loss incurred by Algorithm 1 when run on the clipped dataset. 2Note that when τ is a rank-r threshold for a dataset D, τ is a rank |D| r threshold for D. Therefore, we can use Algorithm 2 to find an approximate rank-(|D| r) threshold for D by negating an approximate rank-r threshold for D. Subset-Based Instance Optimality in Private Estimation Algorithm 3 Subset optimal mean estimation Input: Range R, dataset D [ R, R], privacy parameter ε > 0 and α > 0. 1 Define α = γ |D|, ζ = α R|D|ε, β = 2 αζ , tl = 1 ε + β, and tu = |D| 1 2 Let ˆl be the output of Algorithm 2 run on D with parameters [a, b] = [ R, R], r = tl, α, and ε. 3 Let ˆu be the output of Algorithm 2 run on D with parameters [a, b] = [ R, R], r = tu, α, and ε. 4 Let ˆµ be the output of Algorithm 1 run on D with interval [a, b] = [ˆl, ˆu] and privacy parameter ε. 5 Output ˆµ. Bounding the second term. The key idea behind bounding the second term is that, whenever ˆl is close to l and ˆu is close to u , then clipping a point x to [ˆl, ˆu] is approximately the same as clipping it to [l , u ]. Bounding the third term. Our bound for the third term is the most involved. At a high level, we show that the bias introduced by clipping D to the interval [l , u ] is at most the worst one-sided clipping bias incurred clipping the points to the left of l or to the right of u . To see this, observe that when we clip from both sides, the left and right biases cancel out. Next, we argue that clipping points to the left of l (or to the right of u ) introduces less bias than removing those points. This step bridges the gap between Algorithm 1 which clips points and the lower bound on R(D, ε), which removes points. We argue that the number of points removed to the left of l or to the right of u is not much larger than 1 ε and use Lemma B.1 to show that the resulting bias is not much larger than if we had removed exactly 1 ε points instead. Finally, to finish the bound, combine our two one-sided bias bounds to show that the overall bias is never much larger than R(D, ε). 3.3. Intuition The lower bound in Lemma 3.2 is obtained by showing (roughly) that no private algorithm can reliably determine whether O(1/ε) outliers have been removed from D. In proving the upper bound in Theorem 3.1, then, the challenge is to show that those outliers can be identified and removed privately without introducing asymptotically larger error even when D is not known in advance. This is possible in Algorithm 3 due to a careful choice of the rank targets tl and tu. In particular, if we are overly aggressive in trying to privately remove outliers, we run the risk of adding too much bias, since we are clipping away important information about the mean. On the other hand, if we are too tentative, we may end up with wide clipping thresholds that require adding too much variance (in the form of noise) when calling Algorithm 1. The key to our construction, therefore, is choosing rank targets such that the risk of excess bias and the risk of excess variance both exactly balance with the lower bound; that is, they match the error incurred by removing outliers in the first place. There is no reason to think a priori that this should be possible. Indeed, for certain properties (such as the mode), such a result does not seem to exist errors due to overor underestimating outliers can change the property arbitrarily. However, we show in the next section that the result for mean estimation can be extended to a relatively large class of common properties. 4. Instance-optimal algorithm for monotone properties We now show that subset-optimal estimation algorithms can be constructed for any monotone property. We start by defining our notion of monotonicity. Definition 4.1 (First-order stochastic dominance (Lehmann, 1955; Mann & Whitney, 1947)). Let D and D both be multisets of real numbers. D is said to dominate D (denoted D D) if, v R, P x D 1 {x v} x D 1 {x v} In other words, first-order stochastic dominance requires the cumulative density function (CDF) of D to be larger than the CDF of D for all points on the real line. Definition 4.2 (Monotone property). A property is called monotone if, for all D , D with D D, we have or, for all D , D with D D, we have θ(D ) θ(D). Intuitively, the definition requires that if we move points from a dataset in one direction, we will always increase (or always decrease) the property. The family of monotone properties includes natural functions such as the mean, median, and quantiles. It also includes minimizers of ℓp distances, i.e., θp(D) = arg min y x D |x y|p, and other common estimators. Subset-Based Instance Optimality in Private Estimation Algorithm 4 Subset-optimal monotone property estimation Input: Range R, dataset D [ R, R], privacy parameter ε > 0, and discretization parameter β. Algorithms: Private threshold algorithm Prv Threshold (Algorithm 2). Inverse sensitivity algorithm Inv Sen (Asi & Duchi, 2020a) (Algorithm 5). 1: Quantize each value in the dataset D to the nearest multiple of β and denote the quantized dataset by Dquant. 2: Set error probability η = Lβ B , rank r = 32 log(6R/ηβ) ε . 3: l = Prv Threshold(Dquant, r/4, [ R, R], β/3, ε/4). 4: u = Prv Threshold(Dquant, |D| r/4, [ R, R], β/3, ε/4). 5: Let Dquant = {x1 x2 , . . . , xn}. For i 3r/2, let yi = xi β, for i n 3r/2 yi = xi + β and otherwise yi = xi. Let D quant = {y1 y2 , . . . , yn}. 6: Prune the dataset to obtain D[l,u] = D quant [l, u]. 7: Return the output of Inv Sen on D[l,u] with range [l, u], granularity β, and privacy parameter ε/2. We also make the following assumptions on the property θ and loss function ℓ: For any dataset D supported on [ R, R], θ(D) [ R, R]3. ℓis a metric; that is, it is commutative, it satisfies the triangle inequality, and ℓ(θ, θ) = 0. ℓis finite and bounded for all datasets under consideration. Let B = sup D,D ℓ(θ(D), θ(D )). Whenever θ θ1 θ2, ℓ(θ, θ1) ℓ(θ, θ2). ℓis L-Lipschitz, as defined below4. Let xi(D) denote the ith largest element in D. For all D, D such that |D| = |D |, we have ℓ(θ(D), θ(D )) L max i |D| |xi(D) xi(D )|. For all θ and θ1 = θ2, we have ℓ(θ, θ1) ℓ(θ, θ2) + L|θ1 θ2|. Observe that both mean and median are 1-Lipschitz when ℓ(θ, θ ) = |θ θ |. Our main result is stated below. Theorem 4.3. For any ε (0, c 1 ε ), there exists a cε εDP Algorithm (Algorithm 4) with cε = 128 log(6RB/Lβ2), whose output A(D) satisfies E[ℓ(A(D), θ(D))] 2e2R(D, ε) + 7Lβ. We first show a simple lower bound on R(D, ε) for general properties. This bound generalizes the mean estimation lower bound in Lemma 3.2; the proof is given in Appendix D.1. 3In general, we only need to assume the property is bounded and our result only depends on the bound logarithmically. We use the same R here to simplify notations. 4The constants in the two conditions below need not be the same. We use L here for both to keep notations simple. Lemma 4.4. For ε [0, 1], let S = {(D1, D2) : min(|D1|, |D2|) |D| 1/ε, d(D1, D2) 2/ε}. If S = , then R(D, ε) is at least 1 2e2 max (D1,D2) S ℓ(θ(D1), θ(D2)). We show that the above lower bound can be achieved (up to logarithmic factors). The algorithm is given in Algorithm 4. It is similar in spirit to Algorithm 3, but we need to make a few modifications to ensure the algorithm works for general monotone properties. We briefly describe the steps in the algorithm below. Discretization: As before, we will use the private threshold algorithm to remove outliers. The approximation guarantee in Theorem 3.7 has an additive rank error β and an additive threshold error α; however, for general properties, it is technically challenging to bound the effects of nonzero α. To work around this, we first discretize the interval into steps of size β. This allows us to use Corollary 3.8 to get a guarantee with α = 0. Private thresholds: We then find the private thresholds l and u as in Algorithm 3. As noted above, these estimates come with (0, β) approximation guarantees due to discretization. Pruning outliers: In Algorithm 3, we clip outliers outside the thresholds. However, the effect of clipping is difficult to analyze generally. Instead, in Algorithm 4 we simply prune outliers. Technically, it is possible for all values to lie on the thresholds, in which case we might not prune any elements. Hence, for ease of analysis, we deliberately move a small fraction of points outside the thresholds. Inverse sensitivity mechanism. Finally, while in Algorithm 3 we directly computed the private mean of the clipped dataset, here we use the inverse sensitivity mechanism (Asi & Duchi, 2020a) to estimate the desired property. Subset-Based Instance Optimality in Private Estimation 5. Implications on private statistical mean estimation. In this section, we apply our mean estimation algorithm to the statistical mean estimation (SME) task where D = Xn, which are i.i.d. samples from a distribution p with mean µ. And the performance of the algorithm is measured by the expected distance from the mean, RSME(A, p):=E [|A(D) µ|] . (3) We apply Algorithm 3 on D and obtain obtain distributionspecific bounds on RSME(A, p). For distribution families with various concentration assumptions, we show that our instance-based bound is almost as tight (up to logarithmic factors) as algorithms designed for specific distribution families. We first state a generic result for statistical mean estimation. Theorem 5.1. Let D = Xn be i.i.d. samples from a distribution p with mean µ and A be Algorithm 3. We have RSME(A, p) E [|µ(D) µ|] + C E h µ(D \ L 1 ε ) µ(D \ U 1 where C hides logarithmic factors in the problem parameters. RSME(A, p) = E [|A(D) µ|] E [|µ(D) µ|] + E [|A(D) µ(D)|] . Applying Theorem 3.1 to the second term directly leads to the claim. The bound in Theorem 5.1 can be hard to compute for a specific distribution. For distributions with bounded moments, we can obtain explicit upper bounds on the quantities above. Definition 5.2. Let p be a distribution supported on R with mean µ. Its kth absolute central moment is denoted as Mk(p):=EX p |X µ(p)|k if it is finite; otherwise Mk(p) = . In Appendix E.1, we prove the following result for statistical mean estimation on distributions with bounded moments. Corollary 5.3. For any distribution p over [ R, R], Algorithm 3 satisfies RSME(A, p) = O M2(p)1/2 n + min k 2 Mk(p)1/k Note that Algorithm 3 obtains the above instance-specific rate without any knowledge on the underlying distribution p. Moreover, for specific distribution families such as subgaussian distributions and distributions with bounded kth moments (k 2), Corollary 5.3 implies almost tight minmax rates. Subgaussian distributions. A distribution p is subgaussian with proxy σ if t 0, P(|X µ| t) 2 exp t2 We denote all such distributions as Gσ. For such distributions, we have max p Gσ RSME(A, p) = O σ n + σ This matches the optimal rate for sub-Gaussian distributions (e.g.,, in Karwa & Vadhan (2017); Kamath et al. (2019)). Distributions with bounded moments. Let Mk,m be the family of distributions with Mk(p) m, we have max p Mk,m RSME(A, p) = O M 1/k k n + M 1/k k (nε)1 1/k This matches the optimal rate for distributions with bounded kth moment (e.g.,, in Kamath et al. (2020)). We list the detailed comparisons in Table 1. Extending to higher dimensions. For (ε, δ)-DP mean estimation in the high-dimensional case, algorithms in Levy et al. (2021); Huang et al. (2021) rely on pre-processing techniques (e.g., random rotation) and apply a one-dimensional estimation algorithm to each dimension. Our algorithm can also be combined with this procedure to obtain similar bounds since our algorithm provably provides an instanceoptimal solution to each one-dimensional problem. We leave exploring better instance-specific bounds in high dimension as a direction for future work. 6. Conclusion We proposed a new definition of instance optimality for differentially private estimation and showed that our notion of instance optimality is stronger than those proposed in prior work. We furthermore constructed private algorithms that achieve our notion of instance optimality when estimating a broad class of monotone properties. We also showed that our algorithm matches the asymptotic performance of prior work under a range of distributional assumptions on dataset generation. Subset-Based Instance Optimality in Private Estimation Ald a, F. and Simon, H. U. On the optimality of the exponential mechanism. In Cyber Security Cryptography and Machine Learning: First International Conference, CSCML 2017, Beer-Sheva, Israel, June 29-30, 2017, Proceedings 1, pp. 68 85. Springer, 2017. Asi, H. and Duchi, J. C. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 14106 14117. Curran Associates, Inc., 2020a. URL https://proceedings. neurips.cc/paper/2020/file/ a267f936e54d7c10a2bb70dbe6ad7a89-Paper. pdf. Asi, H. and Duchi, J. C. Near instance-optimality in differential privacy. ar Xiv preprint ar Xiv:2005.10630, 2020b. Balle, B. and Wang, Y.-X. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pp. 394 403. PMLR, 2018. Biswas, S., Dong, Y., Kamath, G., and Ullman, J. Coinpress: Practical private mean and covariance estimation. Advances in Neural Information Processing Systems, 33: 14475 14485, 2020. Błasiok, J., Bun, M., Nikolov, A., and Steinke, T. Towards instance-optimal private query release. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2480 2497. SIAM, 2019. Brunel, V.-E. and Avella-Medina, M. Propose, test, release: Differentially private estimation with high probability. ar Xiv preprint ar Xiv:2002.08774, 2020. Bun, M. and Steinke, T. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. In Advances in Neural Information Processing Systems, pp. 181 191, 2019. Cai, T. T., Wang, Y., and Zhang, L. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. ar Xiv preprint ar Xiv:1902.04495, 2019. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265 284. Springer, 2006. Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends R in Theoretical Computer Science, 9(3 4):211 407, 2014. Feldman, V. and Steinke, T. Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pp. 535 544. PMLR, 2018. Fernandes, N., Mc Iver, A., and Morgan, C. The laplace mechanism has optimal utility for differential privacy over continuous queries. In 2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pp. 1 12. IEEE, 2021. Hopkins, S. B., Kamath, G., and Majid, M. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, pp. 1406 1417, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392648. doi: 10.1145/ 3519935.3519947. URL https://doi.org/10. 1145/3519935.3519947. Huang, Z., Liang, Y., and Yi, K. Instance-optimal mean estimation under differential privacy. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 25993 26004. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ da54dd5a0398011cdfa50d559c2c0ef8-Paper. pdf. Kamath, G., Li, J., Singhal, V., and Ullman, J. Privately learning high-dimensional distributions. In Conference on Learning Theory, pp. 1853 1902. PMLR, 2019. Kamath, G., Singhal, V., and Ullman, J. Private mean estimation of heavy-tailed distributions. In Abernethy, J. and Agarwal, S. (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 2204 2235. PMLR, 09 12 Jul 2020. Karwa, V. and Vadhan, S. Finite sample differentially private confidence intervals, 2017. Lehmann, E. L. Ordered families of distributions. The Annals of Mathematical Statistics, pp. 399 419, 1955. Levy, D. A. N., Sun, Z., Amin, K., Kale, S., Kulesza, A., Mohri, M., and Suresh, A. T. Learning with userlevel privacy. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=G1jmx FOt Y_. Subset-Based Instance Optimality in Private Estimation Mann, H. B. and Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pp. 50 60, 1947. Mc Millan, A., Smith, A., and Ullman, J. Instanceoptimal differentially private estimation. ar Xiv preprint ar Xiv:2210.15819, 2022. Nissim, K., Raskhodnikova, S., and Smith, A. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pp. 75 84, 2007. Smith, A. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the fortythird annual ACM symposium on Theory of computing, pp. 813 822, 2011. Subset-Based Instance Optimality in Private Estimation A. Proofs for Section 3.1 We will make use of the following characterization of rank-r thresholds. Lemma A.1. Let D be any multiset of real numbers. Then τ R is a rank-r threshold for D if and only if every point x Lr satisfies x τ and every point x U|D| r satisfies x τ. Proof. First we show that if τ is a rank-r threshold for D, then every point x Lr satisfies x τ and every point x U|D| r satisfies x τ. By definition, we have that P x D I{x τ} r, which means that there are at least r points in D that are less than or equal to τ. Since Lr contains the r smallest points in D, every point in Lr must also be less than or equal to τ. Similarly, by definition we have that r P x D I{x < τ} = |D| P x D I{x τ}, which means that there are at least |D| r points in D that are greater than or equal to τ. Since U|D| r contains the largest |D| r points in D, they must all be greater than or equal to τ. This proves the first implication. Now suppose that τ is a threshold with the property that every x Lr satisfies x τ and every x U|D| r satisfies x τ. We will show that this implies that τ is a rank-r threshold. Let x D be any point with x < τ. We know that all such points must belong to Lr (since x U|D| r would violate our assumption). It follows that P x D I{x < τ} |Lr| = r. Now let x D be any point with x > τ. We know that all such points must belong to U|D| r (since x Lr would violate our assumption). It follows that P x D I{x τ} = |D| P x D I{x > τ} |D| |U|D| r| = r. Together, these arguments show that P x D I{x < τ} r and P x D I{x τ} r, which proves the second implication. It follows that τ is a rank-r threshold if and only if every point x Lr satisfies x τ and every point x U|D| r satisfies x τ. Lemma A.2. Let D be a multiset of real numbers and τ R be any threshold. Then ranks(τ, D) = x D I{x < τ}, . . . , X Proof. By definition, τ is a rank-r threshold if P x D I{x < τ} r and P x D I{x τ} r. Let rmin = P x D I{x < τ} and rmax = P x D{x τ}. We will show that τ is a rank-r threshold for D if and only if rmin r rmax. First, since rmin = P x D I{x < τ}, we have that P x D I{x < τ} r if and only if rmin . Similarly, since rmax = P x D I{x τ}, we have that P x D I{x τ} r if and only if r rmax. Since τ is a rank-r threshold if and only if both of the above inequalities hold, it follows that τ is a rank-r threshold iff rmin r rmax, as required. Theorem 3.7. Fix data range [a, b], ε > 0, and distance error 0 < α b a 2 . For any dataset D [a, b] and rank 1 r |D|, let ˆτ be the output of Algorithm 2 run on D with parameters [a, b], α, and ε. Publishing ˆτ satisfies ε-DP and, for any ζ > 0, with probability at least 1 ζ, ˆτ is an (α, β)-approximate rank-r threshold for D with β = 2 αζ . Moreover, Algorithm 2 can be implemented with O(|D| log |D|) running time. Proof. Algorithm 2 is an instance of the exponential mechanism, so to prove that it satisfies ε-differential privacy, it is sufficient to argue that the sensitivity of the loss ℓ(τ) is bounded by one. Let D and D be any neighboring datasets. Since adding or removing a point from D changes the value of each sum in the expression for ranks(τ , D) given by Lemma A.2 by at most one, we are guaranteed that whenever r ranks(τ , D), then at least one of r 1, r , or r + 1 belongs to ranks(τ , D ). From this, it follows that | errrank(τ , r, D) errrank(τ , r, D )| 1. Taking the minimum of both sides of this inequality with respect to τ [τ α, τ + α] shows that the sensitivity of ℓis bounded by one, as required. Next we argue that there exists an interval I [a, b] of width at least α so that for every τ I we have ℓ(τ) = 0. Let τ [a, b] be any rank-r threshold for the dataset D. Next, define I = [τ α, τ + α] [a, b]. The width of I is at least α (since at least half of it is contained in [a, b]). Moreover, for every τ I we have that ℓ(τ) = 0, since τ is within distance α of an exact rank-r threshold, as required. Finally, we follow the standard analysis of the exponential mechanism to prove that this is sufficient to find an approximate rank-r threshold. For any c 0, define Sc = {τ [a, b] | ℓ(τ) c} to be the set of thresholds whose loss is at least c. We Subset-Based Instance Optimality in Private Estimation have that Z τ Sc exp εc (b a) exp εc On the other hand, we have Z b I exp(0) dτ α. Together, it follows that Pr(ˆτ Sc) = Z τ Sc f(τ) dτ b a Choosing c = 2 αζ results in Pr(ˆτ Sc) ζ. It follows that with probability at least 1 ζ, we have that ℓ(ˆτ) < 2 αζ . From the definition of the loss, it follows that ˆτ is an (α, β)-approximate rank-r threshold for S. Finally, we prove the running time guarantee. First, the loss function ℓ(τ) is piecewise constant with at most 2|D| discontinuities. This is because, as we slide a threshold τ from left to right, the minimum rank error within the interval [τ α, τ + α] only changes when an endpoint of the interval crosses a datapoint in D, which can hapen at most 2|D| times. It follows that the output distribution of Algorithm 2 is also piecewise constant with discontinuities (potentially) occurring at x α for each x D. Let the constant intervals be I1, . . . , IM and let p1, . . . , p M be the value of the exponential mechanism density on the intervals, respectively. Now let ˆτ be a sample from the output distribution and ˆI {I1, . . . , IM} be the interval that contains ˆτ. The key idea behind the sampling strategy is to sample ˆI first, and then sample ˆτ conditioned on the choice of ˆI. Since the density is constant on each interval I1, . . . , IM, the second step is equivalent to outputting a uniformly random sample from ˆI. This works as long as the probability we choose ˆI = Ii is equal to the marginal distribution of ˆI, which is given by P(ˆI = Ii) width(Ii) pi. The running time of the above approach is dominated by the cost of computing the piecewise constant representation of the output density. This can be accomplished by constructing the set of 2|D| candidate discontinuity locations x α for x D, sorting them, and then making a linear pass from left to right computing the constant intervals and the value of ℓ(τ) on each interval. The overall running time of this is O(|D| log |D|). Corollary 3.8. Let Z = {z1 zm} be such that zi+1 = zi + γ for all i = 1, . . . , m 1 and let D be any multiset supported on Z. Let ˆτ be the output of Algorithm 2 run on D with parameters [a, b] = [z1, zm], α = γ/3, and ε > 0, and let τ the closest point in Z to ˆτ. Then for any ζ > 0, with probability at least 1 ζ we have that τ is a (0, β)-approximate rank-r quantile for D with β = 2 Proof. For any threshold τ, Lemma A.2 guarantees that ranks(τ, D) = x D I{x < τ}, . . . , X When the dataset D is supported on a grid Z, moving τ to its nearest grid point never removes ranks from ranks(τ, D), since the left sum is either the same or decreases, and the right sum is either the same or increases. This implies that moving τ to its closest grid point never increases its rank error. By Theorem 3.7 we are guaranteed that with probability at least 1 ζ, there exists τ with |ˆτ τ | α and errrank(τ , r, D) 2 ε log b a αζ β (where we used the fact that (b a)/α 3m). Assume this high probability event holds for the remainder of the proof. Let τ and τ be the closest grid points to ˆτ and τ , respectively. If τ = τ then we have that errrank( τ, r, D) = errrank( τ , r, D) errrank(τ , r, D) β and we have shown that τ is a (0, β)-approximate rank-r threshold for D. Now suppose that τ = τ . First we argue that D [ˆτ, τ ] must be the empty set. Suppose for contradiction that there is x D [ˆτ, τ ]. Then, x is a grid point, and we have that max{|ˆτ x|, |τ x|} |ˆτ τ | α = γ/3. In particular, this Subset-Based Instance Optimality in Private Estimation implies that both ˆτ and τ are within distance γ/3 of the same grid point, which means we must have τ = τ , which is a contradiction. Since there are no datapoints in [ˆτ, τ ], by Lemma A.2 we have that ranks(ˆτ, D) = ranks(τ , D). It follows that errrank( τ, r, D) errrank(ˆτ, r, D) = errrank(τ , r, D) β. In either case, we showed that the rank-error of τ is bounded by β. B. Proofs of mean estimation B.1. Proof of Lemma 3.2 Upper bound. c D,ε 1. This can be seen since for all D D , |D | |D| 1/ε, we have ε ) µ(D ) µ(D \ U 1 Hence a fixed algorithm that outputs µ(D) will always have |µ(D) µ(D )| µ(D \ L 1 ε ) µ(D \ U 1 Lower bound. c D,ε 1/(2e2). The proof follows from substituting D1 = D \ L 1 ε and D2 = D \ U 1 ε in Lemma 4.4. B.2. Proof of Lemma 3.3 First we prove the privacy guarantee. Let D1 and D2 be any pair of datasets and let D 1 and D 2 be the shifted and clipped versions of them for the interval [l, u]. The size of the symmetric difference between D 1 and D 2 cannot be larger than between D1 and D2, so whenever D1 and D2 are neighbors, so are D 1 and D 2. It follows that we can ignore the shifting and clipping step in the privacy analysis. Since the add/remove sensitivity of n is 1, step 3 estimates n using the Laplace mechanism with a budget of ε/2. The add/remove sensitivity of the sum s of the shifted data is w/2, so step 4 estimates s using the Laplace mechanism with a budget of ε/2. The overall privacy guarantee of the algorithm then follows from basic composition and post-processing, since w and m are public quantities (i.e., they depend on the algorithm parameters, not on the actual dataset). Now we turn to the utility analysis. Recall that n = |D | = |D|. Since D = D m, ˆµ µ(D) = clip ˆs Since all all elements of D lie in [ w i µ(D ) ˆs ˆn µ(D ) = ˆs ˆn s Hence, we can bound the desired expectation as E [|ˆµ µ(D)|] = Pr(Zn < n/2)E [|ˆµ µ(D)||Zn < n/2] + Pr(Zn n/2)E [|ˆµ µ(D)||Zn n/2] = Pr(Zn < n/2)E [|ˆµ µ(D)||Zn < n/2] + E ˆs ˆn s 2e nε/4|l u| + E ˆs ˆn s |Zn n/2 , (4) Subset-Based Instance Optimality in Private Estimation where the last inequality follows by observing that both ˆµ and µ(D) lie in [l, u]. We now simplify the second term in (4). |Zn n/2 = E s + Zs n + Zn s = E n Zs (n + Zn)n = E Zs (n + Zn) (a) = E[|Zs|] E 1 (n + Zn) where (a) uses the fact that Zs and Zn are independent of each other. Combining (4) and (5) yields, E [|ˆµ µ(D)|] 1 2e nε/4|l u| + 2(u l) e + 2 3(u l) where the penultimate inequality uses the fact that e xx e 1 for all x 0. B.3. Proof of Theorem 3.1 Our goal is to show that the expected error of Algorithm 3 is not much larger than R(D, ε) µ(D \ L 1 ε ) µ(D \ U 1 With probability at least 1 ζ, by Theorem 3.7 we are guaranteed that ˆl and ˆu are (α, β)-approximate rank-tl and rank-tu thresholds, respectively for the values of α and β defined in Algorithm 3. In particular, this implies that there exist l and t l such that |ˆl l | α, |t l tl| β, and l is a rank-t l threshold for D. Similarly, there exist u and t u such that |ˆu u | α, |t u tu| β, and u is a rank-t u threshold for D. Let G denote this high probability event. We first argue that conditioned on G, the expected loss of ˆµ is small (where the expectation is taken only over the randomness of Algorithm 1). To convert this bound into a bound that holds in expectation, we bound the error when G does not hold by 2R. Let ˆµ be the output of Algorithm 3. We decompose the error of ˆµ into three terms: |ˆµ µ(D)| |ˆµ µ(clip(D, [ˆl, ˆu]))| + |µ(clip(D, [ˆl, ˆu]) µ(clip(D, [l , u ])| + |µ(clip(D, [l , u ])) µ(D)| Roughly speaking, the first term captures the variance incurred by using Algorithm 1 to estimate the mean of the clipped data, the second term measures our bias due to α in our (α, β)-approximate thresholds, and the third term measures the bias due to β. Our goal is to prove that all of these terms are not much larger than R(D, ε). Bounding first term. At a high level, we argue that all points in L 1 ε are to the left of ˆl + α, and all points in R 1 ε are to the right of ˆu α. It follows that the distance from any point in L 1 ε to any point in U 1 ε is at least ˆu ˆl 2α. In particular, this guarantees that the difference between the means of D \ L 1 ε and D \ U 1 ε must be at least ˆu ˆl 2α ε(|D| 1 ε ), since we move 1/ε points a distance at least ˆu ˆl 2α. This expression is close to the loss incurred by Algorithm 1 when run on the clipped dataset. Formally, let S = (D \ L 1 ε ) (D \ U 1 ε ) be the set of common points in the two means from the lower bound. Then we Subset-Based Instance Optimality in Private Estimation ε ) µ(D \ U 1 ε ) = 1 |D| 1 Next, since l is a rank-t l threshold with t l 1 ε, we know that every x Lt l L 1 ε satisfies x l ˆl + α. Similarly, since u is a rank-t u threshold with t u |D| + β = |D| 1 ε, we are guaranteed that every x U|D| t u U|D| |D|+ 1 ε satisfies x u ˆu α. Substituting these bounds into the above expression gives 2e2R(D, ε) µ(D \ L 1 ε ) µ(D \ R 1 By Lemma 3.3, we have that the expectation of the first term in the error decomposition conditioned on the choice of ˆl and ˆu is bounded by 3(ˆu ˆl) |D|ε . It follows that E |ˆµ µ(clip(D, [ˆl, ˆu]))| G 6e2R(D, ε) + 6α Bounding the second term. The key idea behind bounding the second term is that, whenever ˆl is close to l and ˆu is close to u , then clipping a point x to [ˆl, ˆu] is approximately the same as clipping it to [l , u ]. Formally, we have |µ(clip(D, [ˆl, ˆu])) µ(clip(D, [l , u ]))| 1 |D| x D | clip(x, [ˆl, ˆu]) clip(x, [l , u ])| x D | min(ˆu, max(x, ˆl)) min(u , max(x, l ))| Bounding the third term. Our bound for the third term is the most involved. At a high level, we show that the bias introduced by clipping D to the interval [l , u ] is at most the worst one-sided clipping bias incurred clipping the points to the left of l or to the right of u . To see this, observe that when we clip from both sides, the left and right biases cancel out. Next, we argue that clipping points to the left of l (or to the right of u ) introduces less bias than removing those points. This step bridges the gap between Algorithm 1 which clips points and the lower bound on R(D, ε), which removes points. We argue that the number of points removed to the left of l or to the right of u is not much larger than 1 ε and use Lemma B.1 to show that the resulting bias is not much larger than if we had removed exactly 1 ε points instead. Finally, to finish the bound, combine our two one-sided bias bounds to show that the overall bias is never much larger than R(D, ε). We begin by showing that the bias is bounded by the worst one-sided bias. We have that µ(D) = 1 |D| Subset-Based Instance Optimality in Private Estimation µ(clip(D, [l , r ])) = 1 |D| Therefore, we have that |µ(clip(D, [l , u ])) µ(D)| = 1 |D| where the inequality follows because the two sums have opposite signs. This expression is the maximum bias we introduce if we only cliped either points to the left of l or to the right of u . Next, we relate the bias of clipping the points to the left of l to the bias of removing those points instead. Let q = P x D I{x < l } be the number of points that are clipped to l . Next, since adding copies of µ(D \ Lq) to D \ Lq does not change its mean, we have that µ(D \ Lq) µ(D) = 1 |D| x D\Lq (x x) + X x Lq µ(D \ Lq) x x Lq µ(D \ Lq) x where the final inequality follows from the fact that µ(D \ Lq) l , since every element of D \ Lq is at least l . We have shown that the bias from clipping the points to l is at most the bias from deleting them. Next, we use the fact that l is a rank-t l threshold for D to argue that the number of points clipped, q, cannot be too large. Since l is a rank-t l threshold for D, we know that every x U|D| t l satisfies x l . Therefore, x D I{x τ l} |U|D| t l = |D| t l. Since q = |D| P x D I{x τ }, we have that q t l 1 ε + 2β. Putting these bounds together and using Lemma B.1 to handle the fact that q may be larger than 1 x Lq l x µ(D \ Lq) µ(D) 1/ε µ(D \ L 1 (2 + 4βε) µ(D \ L 1 Next we turn to the bias of clipping points to the right of u . Let p = P x D I{x > u } be the number of points in D that are strictly greater than u . A similar argument to the above shows that p 1 ε + 2β and that x Uq x u µ(D) µ(D \ Up) (2 + 4βε) µ(D) µ(D \ R 1 Subset-Based Instance Optimality in Private Estimation Putting it all together, the third term of the error decomposition is upper bounded by |µ(clip(D, [l , u ])) µ(D)| (2 + 4βε) max{µ(D \ L 1 ε ) µ(D), µ(D) µ(D \ R 1 (2 + 4βε) µ(D \ L 1 ε ) µ(D \ U 1 2e(2 + 4βε)R(D, ε), where the second inequality follows from the fact that the maximum of two numbers is not larger than the sum. Finally, conditioned on the good event G, we have shown that the expected loss of ˆµ is bounded by 2e(2 + 4βε2) + 6e2 R(D, ε) + α 6 ε|D| + 2 . Using the fact that β = 2 αζ and bounding the error when the good event G fails to hold by 2R, we have that E [|ˆµ µ(D)|] 56 + 44 log 2R R(D, ε) + α 6 ε|D| + 1 + 2Rζ, as required. Lemma B.1. Let D be any multiset of real numbers of size n, and let n1 n2 n/2. Then we have µ(D \ Ln2) µ(D) 2n2 µ(D \ Ln1) µ(D) and µ(D) µ(D \ Rn2) 2n2 µ(D) µ(D \ Rn1) Proof. We only prove the first inequality and the second will follow similarly. For any n n/2, we have µD µD\Ln = 1 i [n] (µD\Ln Xi)1{Xi Ln } i [n] (µD\Ln Xi)1{Xi < a} + (n n )µD\Ln (n n )µD\Ln = n µD\Ln + (n n )µD\Ln X i [n] 1{Xi < a}Xi X i [n] 1{Xi a}Xi = n µD\Ln µD . Setting n to be n1, n2 respectively, it would be enough to prove that µD\Ln2 µLn2 2 µD\Ln1 µLn1 This follows since 2 µD\Ln1 µLn1 µD\Ln2 µLn2 2µD\Ln1 µD\Ln2 µLn1 = 2µD\Ln1 (n n1)µD\Ln1 Pn2 i=n1+1 Xi n n2 µLn1 µD\Ln1 + n2 n1 n n2 µLn2\Ln1 µLn1 Subset-Based Instance Optimality in Private Estimation C. Inverse sensitivity mechanism In this section, we state the inverse sensitivity mechanism and its guarantee in the add/remove model of DP. Most of the proof follows from (Asi & Duchi, 2020a) in the replacement model of DP. We include its add/remove variant here for completeness. We first provide a few definitions. Let D be a dataset supported over Z and k N+, let ωℓ(D, k) be defined as ωℓ(D, k):= max {ℓ(θ(D), θ(D )) | D Zn, d(D, D ) k} i.e., the maximum change in the parameter by k add/remove operations on D. Let lenθ(D, t):= min {d(D , D) | θ(D ) = t} , which is the minimum number of add/remove operations needed to change D to some D with θ(D ) = t. Then the add/remove version of inverse sensitivity mechanism is stated below. Algorithm 5 Inverse Sensitivity Mechanism (Asi & Duchi, 2020a) Input: Range R, dataset D [ R, R], privacy parameter ε > 0 and granularity β > 0. 1: Let T be the set of points in [ R, R] that are also mulitples of β. 2: Output t T with distribution Pr (A(D) = t) = exp ( lenθ(D, t)) P t T exp ( lenθ(D, t )), The following guarantee holds for the inverse sensitivity mechanism. Theorem C.1 ((Asi & Duchi, 2020a)). For any β (0, B), the inverse sensitivity mechanism A with privacy parameter ε = 2ε log 2BR β satisfies that E [ℓ(A(D), θ(D))] ωℓ(D, 1/ε) + Lβ. D. Proofs for the general algorithm D.1. Proof of the Lemma 4.4 For any algorithm A, max D D:|D | |D| 1/ε E [ℓ(A(D ), θ(D ))] max (D1,D2) S max (E [ℓ(A(D1), θ(D1))] , E [ℓ(A(D2), θ(D2))]) max (D1,D2) S 0.5 (E [ℓ(A(D1), θ(D1))] + E [ℓ(A(D2), θ(D2))]) . Since d(D1, D2) 2/ϵ, if A Aϵ, E [ℓ(A(D1), θ(D1))] + E [ℓ(A(D2), θ(D2))] e (2/ε)εE [ℓ(A(D2), θ(D1))] + E [ℓ(A(D2), θ(D2))] e 2 (E [ℓ(A(D2), θ(D1))] + E [ℓ(A(D2), θ(D2))]) e 2ℓ(θ(D1), θ(D2)) e 2ℓ(θ(D1), θ(D2)). Hence, for any algorithm A Aϵ, max D D:|D | |D| 1/ε E [ℓ(A(D ), θ(D ))] max max(D1,D2) S 1 2e2 ℓ(θ(D1), θ(D2)). D.2. Proof of Theorem 4.3 We first prove a general result on stochastic dominance which will be helpful later in our results. For a dataset D and thresholds l, u, let D[l,u] = D [l, u]. Subset-Based Instance Optimality in Private Estimation Lemma D.1. Let l and u satisfy, 2r |D ( , l)| 3 2r |D (u, )| 3 For all D such that all of their elements are in [l, u] and d(D[l,u], D ) r/2, D \ L4r(D) D D \ U4r(D). Proof. Let L4r denote L4r(D) and U4r denote U4r(D) for convenience. We present the proof for D \ L4r D and the other relation will follow similarly. We divide the proof into three cases depending on the value of v in Definition 4.1. Case v < minx D\L4r x: Since there are no points in D \ L4r in this range, P x D\L4r 1 {x v} |D \ L4r| = 0 P x D 1 {x v} Case v u: Since all points of D lie below u, P x D\L4r 1 {x v} |D \ L4r| 1 = P x D 1 {x v} Case v [minx D\L4r x, u): Since d(D[l,h], D ) r/2, x D 1 {x v} x D[l,u] 1 {x v} r/2 |D[l,u]| + r/2 x D[l,u] 1 {x v} r/2 |D \ L4r| + 3r/2 , where the last inequality follows by observing that the assumptions in the lemma imply, |D[l,u]| + r/2 |D| 5r/2 |D \ L4r| + 3r/2. For v [minx D\L4r x, u), x D[l,u] 1 {x v} r/2 X x D 1 {x v} 2r r/2 x D 1 {x v} 5r/2 x D\L4r 1 {x v} + 4r 5r/2 x D\L4r 1 {x v} + 3r/2. x D 1 {x v} x D\L4r 1 {x v} + 3r/2 |D \ L4r| + 3r/2 x D\L4r 1 {x v} |D \ L4r| . Subset-Based Instance Optimality in Private Estimation Proof of Theorem 4.3. In this proof, we show that Algorithm 4 is ε-DP and achieves E[ℓ(A(D), θ(D))] 2e2R(D, ε ) + 7Lβ, where ε = ε 128 log(6RB/Lβ2). Theorem 4.3 can be obtained by applying Algorithm 4 with privacy parameter 128 log(6RB/Lβ2)ε. By composition theorem, the overall privacy budget is ε. In the rest of the proof, we focus on the utility guarantee. We first quantify the effect of quantization. Observe that the output of the algorithm does not change for inputs D and the corresponding quantized dataset Dquant. Hence together with the Lipschitz property of θ, E[ℓ(A(D), θ(D)] = E[ℓ(A(Dquant), θ(D)] E[ℓ(A(Dquant), θ(Dquant)] + ℓ(θ(Dquant), θ(D)), E[ℓ(A(Dquant), θ(Dquant)] + Lβ. By Corollary 3.8, with probability at least 1 η, there exists a r such that |r 7r/4| 8 ηβ and r R(l, Dquant). In other words, R(l, Dquant) [3r/2, 2r] = . Hence, |Dquant ( , l)| 2r, and hence, 3r/2 |D quant ( , l)| 2r. Similarly, 3r/2 |D quant (u, )| 2r. Let E be the event where both the above equations hold. By triangle inequality, E[ℓ(A(D quant), θ(D quant)] E[ℓ(A(D quant), θ(D[l,h])] + E[ℓ(θ(D quant), θ(D[l,h])], . Let L 4r = L4r(D quant) and U 4r = U4r(D quant). By Lemma D.1, D quant \ L4r D[l,u] D quant \ U4r. Hence, conditioned on the event E, by the monotonicity property E[ℓ(θ(D quant), θ(D[l,h])] max ℓ(θ(D quant), θ(D quant \ L 4r), ℓ(θ(D quant), θ(D quant \ U 4r) . Furthermore by triangle inequality, ℓ(θ(Dquant), θ(D quant \ L 4r) ℓ(θ(D), θ(D \ L4r) + ℓ(θ(D), D quant)) + ℓ(θ(D \ L4r, θ(D quant \ L 4r) ℓ(θ(D), θ(D \ L4r) + 4Lβ. Similarly, ℓ(θ(D quant), θ(D quant \ U 4r) ℓ(θ(D), θ(D \ U4r) + 4Lβ. ℓ(θ(D quant), θ(D[l,h]) max (ℓ(θ(D), θ(D \ L 4r), ℓ(θ(D), θ(D \ U 4r)) + 4Lβ max D1,D2 D:d(D1,D2) 4r ℓ(θ(D1), θ(D2)) + 4Lβ. E[ℓ(θ(Dquant), θ(D[l,h])] max D1,D2 D:d(D1,D2) 4r ℓ(θ(D1), θ(D2)) + 4Lβ + Pr(E)B = max D1,D2 D:d(D1,D2) 4r ℓ(θ(D1), θ(D2)) + 6Lβ. Subset-Based Instance Optimality in Private Estimation Since inverse sensitivity mechanism is applied on D[l,u], E[ℓ(A(D quant), θ(D[l,h])] = E[ℓ(Inv Sen(D[l,h]), θ(D[l,h])]. By the guarantee of the inverse sensitivity mechanism, E[ℓ(Inv Sen(D[l,h]), θ(D[l,h])] max D1,D2 D:d(D1,D2) r ℓ(θ(D1), θ(D2)) + Lβ, where r = 4 log((8BR)/Lβ2) ε . Combining the above equations yield E[ℓ(A(D), θ(D))] 2 max D1,D2 D:d(D1,D2) max(4r,r ) ℓ(θ(D1), θ(D2)) + 7Lβ 2e R(D, ε ) + 7Lβ, where ε = ε 128 log(6RB/Lβ2). E. Proofs for statistical mean estimation E.1. Proof of Corollary 5.3 It is straightforward to see that E [|µ(Xn) µ|] E h |µ(Xn) µ|2i = Hence it remains to show that k 2, E h µ(Xn \ L 1 ϵ ) µ(Xn \ U 1 n + Mk(p)1/k Our proof will be based on the lemma below. Lemma E.1. For any sequence of samples Xn and k 2, let c Mk(Xn):= 1 n Pn i=1 |Xi µ(Xn)|k, we have ϵ ) µ(Xn \ U 1 c Mk(Xn)1/k We first prove Equation (6) based on Lemma E.1 and then present the proof of Lemma E.1. E h µ(Xn \ L 1 ϵ ) µ(Xn \ U 1 " c Mk(Xn)1/k E h c Mk(Xn)1/ki = E " Pn i=1 |Xi µ(Xn)|k " Pn i=1 |Xi µ(p)|k 1/k + Pn i=1 |µ(Xn) µ(p)|k = 2E Pn i=1 |Xi µ(p)|k 1/k + 2E [|µ(Xn) µ(p)|] 2Mk(p)1/k + 2M2(p)1/2 Subset-Based Instance Optimality in Private Estimation Proof of Lemma E.1: By definition, ϵ ) µ(Xn \ U 1 ϵ ) = 1 n 1/ε |Xi µ(Xn)|k i [n] |Xi µ(Xn)|k = 4c Mk(Xn)1/k (nε)1 1/k . F. Comparison between different instance-dependent risks for ℓp minimization. Consider the task of estimating the ℓp minimizer of a dataset over [0, R] with R 1, i.e., for p > 1, θ(D) = min µ R x D |x µ|p. Consider a dataset D consisting of n 1 0 s and one 1. It can be shown that the minimizer is µp(D) = 1 1 + (n 1)1/(p 1) . Let ℓ(x, x ) = |x x |. The definition in (Asi & Duchi, 2020a) (Equation (2)) will have R1(D, ε) max D :d(D,D ) 1/ε ℓ(θ(D), θ(D )) R 1 + (nε 1)1/(p 1) 1 1 + (n 1)1/(p 1) , where we take D 1 to be the dataset with n 1/ε 0 s and 1/ε R s. The modified definition of (Huang et al., 2021) (Remark 2.4) will lead to a instance dependent risk of R2(D, ε) = sup supp(D ) supp(D) d(D,D ) 1/ε ℓ(θ(D), θ(D )) 1 1 + (nε 1)1/(p 1) 1 1 + (n 1)1/(p 1) , where we take D 2 be the dataset with n 1/ε 0 s and 1/ε 1 s. Our propose subset-risk will only allow D D. Hence µp(D ) [0, µp(D)], and R(D, ε) sup D D,d(D,D ) 1/ε ℓ(θ(D), θ(D )) 1 1 + (n 1)1/(p 1) . In the regimen when p < log(nε), ε 1 and R 1, we have R1(D, ε) R2(D, ε) R(D, ε).