# differentially_private_approximate_quantiles__c08bb0a0.pdf

Differentially Private Approximate Quantiles

Haim Kaplan 1 2 Shachar Schnapp 3 Uri Stemmer 1 2

In this work we study the problem of differentially private (DP) quantiles, in which given dataset X and quantiles q1, ..., qm [0, 1], we want to output m quantile estimations which are as close as possible to the true quantiles and preserve DP. We describe a simple recursive DP algorithm, which we call Approximate Quantiles (AQ), for this task. We give a worst case upper bound on its error, and show that its error is much lower than of previous implementations on several different datasets. Furthermore, it gets this low error while running time two orders of magnitude faster that the best previous implementation.

1. Introduction

Quantiles are the values that divide a sorted dataset in a certain proportion. They are one of the most basic and important data statistics, with various usages, ranging from computing the median to standardized test scores (GRE, 2021). Given sensitive data, publishing the quantiles can expose information about the individuals that are part of the dataset. For example, suppose that a company wants to publish the median of its users ages. Doing so means to reveal the date of birth of a certain user, thus compromising the user s privacy. Differential privacy (DP) offers a solution to this problem by requiring the output distribution of the computation to be insensitive to the data of any single individual. This leads us to the definition of the DP-quantiles problem:

Definition 1.1 (The DP-Quantiles Problem). Let a, b R. Given a dataset X (a, b) containing n points from (a, b), and a set of m quantiles 0 < q1 qm < 1, privately identify quantile estimations v1, ..., vm such that for every j [m] we have Prx UX[x vj] qj,1 where UX is the

*Equal contribution 1Tel Aviv University 2Google Research 3Ben-Gurion University. Correspondence to: Shachar Schnapp <schnapp@post.bgu.ac.il>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1We will make this precise later.

uniform distribution over the data X.2

On the theory side, the DP-quantiles problem is relatively well-understood, with advanced constructions achieving very small asymptotic error (Beimel et al., 2016; Bun et al., 2015; Kaplan et al., 2020). However, as was recently observed by Gillenwater et al. (2021), due to the complexity of these advanced constructions and their large hidden constants, their practicality is questionable. This led Gillenwater et al. (2021) to design a simple algorithm for the DP-quantiles problem that performs well in practice. In this work we revisit the DP-quantiles problem. We build on the theoretical construction of Bun et al. (2015), and present a new (and simple) practical algorithm that obtains better utility and runtime than the state-of-the-art construction of Gillenwater et al. (2021) (and all other existing implementations). When the number of quantiles is large, the error of AQ algorithm reaches up to 7.14 time lower than the best existing baseline.

1.1. Our Contributions

We provide Approximate Quantiles, an algorithm and implementation for the DP-quantiles problem (Section 3). We prove a worst case bound on the error of the AQ algorithm for arbitrary m quantiles, and a tighter error bound for the case of uniform quantiles qi = i/(m+1), i = 1, . . . , m (Section 3.2). We experimentally evaluated the AQ algorithm and conclude that it obtains higher accuracy and faster runtime than the existing state-of-the-art implementations (Section 4). In addition, we adapt our algorithm (and its competitors) to the definition of Concentrated Differential Privacy (z CDP) (Bun & Steinke, 2016). We show that its dominance over other methods is even more significant with this definition of privacy.

1.2. A Technical Overview of Our Construction: Algorithm Approximate Quantiles

Our algorithm operates using a DP algorithm for estimating a single quantile. Specifically, we assume that we have a DP algorithm A : R2 Rn (0, 1) R that takes a domain I = (a, b) R2 (an interval on the line), a database X Rn (containing n points in I), and a sin-

2We assume that there are no duplicate points in X.

Differentially Private Approximate Quantiles

Figure 1: The recursion tree of the algorithm. At each node we write the range of the corresponding subproblem and above it the part of the data in this subproblem.

(v1 2, v1 1)

(v1 1, v2 2)

gle quantile q (0, 1), and returns a point v I such that Prx UX[x v] q. Estimating a single quantile is a much easier task, with a very simple algorithm based on the exponential mechanism (Mc Sherry & Talwar, 2007). A naive approach of using A for solving the DP-quantiles problem would be to run it m times (once for every given quantile). However, due to composition costs of running m DP algorithms on the same data, the error with this approach would scale polynomially with m. As we demonstrate in our experimental results (in Section 4), this leads to a significantly reduced performance. In contrast, as we explain next, in our algorithm the error scales only logarithmically in the number of quantiles m. The AQ algorithm privately estimates (using A) the middle quantile p = q m/2 and observes an answer v. Then it splits the problem into two sub-problems. The first sub-problem is defined over the dataset Xℓwhich contains the values from X that are smaller than v. Its goal is to privately compute the quantiles (q1, .., q m/2 1)/p on Xℓ. The second problem is defined over the dataset Xu which contains the values from X that are greater than v. The goal of the second problem is to privately compute (q m/2 +1 p, . . . , qm p)/(1 p) on Xu. Notice that the recursive sub-problems have smaller ranges. Specifically, at the first level of the recursion tree we compute one quantile q1 1 q m/2 on data points from a range (a, b). We denote by v1 1 the estimate which we receive. At the second level we compute two normalized quantiles q1 2 q m/4 /q1 1 and q2 2 q 3m/4 /(1 q1 1). The quantile q1 2 is computed on data in the range (a, v1 1), and we denote its estimate by v1 2. The second quantile is computed on the data in the range [v1 1, b], and we denote its estimate by v1 2, and so on. Figure 1 illustrates this recursion tree. A few remarks are in order.

1. By shrinking the data range from one level to the next, we effectively reduce the error of algorithm A (because its error depends on the data range). We have found that, in practice, this has a large impact on our accuracy guarantees.

2. Note that every single data point participates only in log(m) + 1 sub-problems (one at each level). This allows our privacy loss (and, hence, our error) to grow only logarithmically in m.

1.3. Related Work

Private (Cumulative Distribution Function) CDF estimation can be applied to estimating all the quantiles (Bun et al., 2015; Kaplan et al., 2020), however the theoretically best known algorithm for private CDF estimation Bun et al. (2015) relies on several reductions, thus limiting its practicality. We present our AQ algorithm, inspired by Bun et al. (2015), taking their CDF estimation algorithm into practice. As opposed to the algorithm proposed by Bun et al. (2015), our algorithm avoids discretization of the domain by solving the DP single quantile problem using the exponential mechanism (Smith, 2011) instead of an interior point algorithm (Bun et al., 2015). Moreover, we split the data according the desired quantiles (rather than uniformly) while avoiding using the laplace mechanism in the process. A common tree-based approach to CDF estimation is included in our empirical error analysis (Section 4). A recent work by Gillenwater et al. (2021) proposed an instance of the exponential mechanism that simultaneously draws m quantiles. The naive implementation of this exponential mechanism for m quantiles is computationally difficult, but Gillenwater et al. (2021) provide a sophisticated O(mn log n + m2n) implementation. The empirical experiments of Gillenwater et al. (2021) show that when the number of quantiles is small, Joint Exp algorithm performs best. A comparison with this algorithm is included in our experiments (Section 4).

2. Preliminaries

A database X is a set of n points from some data domain D.3 Differential privacy uses the definition of neighbors as follows. Definition 2.1. Databases X and X Dn are neighbors, denoted X X , if one of them can be obtained from the other by adding or removing a single element.

We use the add-remove definition of neighbors, as opposed to the swap definition (in which we replace a point instead of deleting or inserting it), although it is important to note that our algorithm Approximate Quantiles (AQ) easily adapts to the swap framework. Differential privacy can be defined using the notion of neighboring databases as follows: Definition 2.2 (Dwork et al. (2006b)). A randomized algorithm A: D Y is (ε, δ)-differentially private ((ε, δ)- DP) if, for every pair of neighboring databases X, X

and every output subset Y Y, Pr A[A(X) Y ] eε Pr A[A(X ) Y ] + δ. When δ > 0, we say A sat-

3The domain in this paper is the interval (a, b).

Differentially Private Approximate Quantiles

isfies approximate differential privacy. If δ = 0, we say A satisfies pure differential privacy, and shorthand this as ε-differential privacy (ε-DP).

The composition property is a key benefit of differential privacy: an algorithm that relies on differentially private subroutines inherits an overall privacy guarantee by simply adding up the privacy guarantees of its components. Lemma 2.3 (Dwork et al. (2006a;b)). Let A1, . . . , Ak be k algorithms that respectively satisfy (ε1, δ1)-, . . . , (εk, δk)- differential privacy. Then running A1, . . . , Ak satisfies Pk i=1 εi, Pk i=1 δi -differential privacy.

The above lemma has come to be known as basic composition in the literature of differential privacy (see (Dwork et al., 2010b) for more advanced composition theorems). Given Lemma 2.3, a simplistic approach for the DPquantiles problem would be to estimate each of the m quantiles separately, using an ε

m-DP algorithm, and then to apply Lemma 2.3 in order to show that the m estimations together satisfy ε-DP. As we will see, our algorithm outperforms this simplistic approach by a large gap. We will also rely on the exponential mechanism, a common building block for differentially private algorithms. Definition 2.4 (Mc Sherry & Talwar (2007)). Given utility function u: D S R mapping (database, output) pairs to real-valued scores with L1 sensitivity

u max X X ,s S |u(X, s) u(X , s)|,

the probability that the exponential mechanism M has an output s S is:

Pr[M(X) = s] exp εu(X, s)

where elides the normalization factor.

The exponential mechanism maintains the database s privacy while prioritizing its higher utility outputs. Lemma 2.5 (Mc Sherry & Talwar (2007)). The mechanism described in Definition 2.4 is ε-DP and for error parameter γ > 0 satisfies that:

Pr s S[u(X, s) > Opt γ] |S| exp εγ

where Opt = maxs S{u(X, s)}.

In our case in this paper, the solution space S is infinite. Specifically, it is an interval (a, b). Lemma A.1 in Appendix A adapts the exponential mechanism to this setting.

3. Approximate Quantiles Algorithm

This section demonstrates our differentially private quantiles algorithm, Approximate Quantiles. As we mentioned in the

Algorithm 1 - Approximate Quantiles (AQ)

input Domain D = (a, b), database X = (x1, . . . , xn) Dn, quantiles Q = (q1, . . . , qm), privacy parameter ε. 1: Let A : R2 Rn (0, 1) R be a ε log m+1-DP mechanism for a single quantile. 2: function F((a, b), X, Q) 3: if m = 0 then 4: return null 5: else if m = 1 then 6: return {A((a, b), X, q1)} 7: end if 8: Let ˆm = m/2 9: Let p = q ˆm 10: Let v = A((a, b), X, p) 11: Let Xℓ, Xu = {x X | x < v}, {x X | x > v} 12: Let Qℓ= (q1, . . . , q ˆm 1)/p 13: Let Qu = (q ˆm+1 p, . . . , qm p)/(1 p) 14: return F((a, v), Xℓ, Qℓ) {v} F((v, b), Xu, Qu) 15: end function

introduction, our algorithm uses a subroutine A for privately estimating a single quantile. We implement algorithm A using the exponential mechanism (see Appendix A). We remark that, if the dataset is given sorted, then A runs in linear time, and the time required for all recursive calls at the same level of the recursion tree is O(n). It follows that the overall time complexity of AQ algorithm is O(n log m).

We denote by qj i the normalized quantile computed by the jth subproblem at the ith level of the algorithm s recursion tree, and by vj i the result of this computation. Note that the number of subproblems at level i is 2i 1. (At the last level some of the subproblems may be empty.) We let Xj i be the data used to compute vj i . It was created by splitting the data X j/2 i 1 into two parts according to v j/2 i 1 . We note that Xj1 i and Xj2 i are disjoint for a fixed level i and j1 = j2. This allows us to avoid splitting our privacy budget between subproblems at the same level of the recursion, and we split it only between different levels, see Section 3.1. We also denote Qi (q1 i , . . . , q2i 1 i ) and Vi (v1 i , . . . , v2i 1 i ).

We assume that the data does not contain duplicate points. This can be enforced by adding small perturbations to the points. The answer we return is with respect to the perturbed points. In fact the utility of our algorithm depends on the minimum distance between a pair of points.

3.1. Privacy Analysis

First we prove that AQ algorithm is ε-DP.

Lemma 3.1 (Differential Privacy). If A is an ε log m+1-DP mechanism for a single quantile then AQ algorithmis ε-DP.

Differentially Private Approximate Quantiles

Proof. It suffices to show that for each level 1 i log m + 1 the output Vi is ε log m+1-DP, since the number of levels is log m + 1, from composition (Lemma 2.3) we get that the AQ algorithm satisfies (ε, δ)-DP.

Let X and X = X {x } be neighboring databases, mark as X k i the part of level i that contains x among X j i, 1 j 2i 1 (as explained above, only one X j i contains x ). For each j = k the data Xj i equals X j i and therefore the probability of the output vj i is the same under X or X . The output vk i is obtained by A which is a ε log m+1-DP mechanism, therefore it satisfies ε log m+1-DP.

3.2. Utility Analysis

A qi-quantile is any point oi (a, b) such that the number points of X which are in (a, oi) is qin . We also define the gap, Gap X(d1, d2), between d1, d2 (a, b) with respect to X are the number of points in the data X that fall between d1 and d2, formally: Gap X(d1, d2) = |{x X | x [min{d1, d2}, max{d1, d2})}|. Using this notion we define the error of the algorithm. Given dataset X, quantiles Q = (q1, . . . , qm), solution V = (v1, . . . , vm) and true quantiles O = (o1, . . . , om), the maximum missed points error is defined as:

Err X(O, V ) = max j [m]{Gap(oj, vj)} =

max j [m]{||{x X|x < vj}| qj n |}.

This error was first defined by Smith (2011) and is widely used in the literature on the differentially private quantiles problem. We first analyze the error of our algorithm in the general case, without any constraints on the given quantiles.

Lemma 3.2 (General Quantiles Utility). Let X (a, b)n

be a database, and let A be an algorithm that computes an approximation v for a single quantile q of X such that Pr[Gap X(o, v) > γ] β m. for some constants β, γ > 0, where o is a true q-quantile. We run AQ algorithm using A on a database X, and quantiles Q = (q1, . . . , qm). Then, with probability 1 β, we get approximate quantiles V = (v1, . . . , vm) such that Err X(O, V ) (log m + 1)γ.

Proof. For the computation of m quantiles the AQ algorithm applies A at most m times (once per each internal node of the recursion tree). Since in each run A has error at most γ with probability 1 β/m it follows by a union bound that:

Pr[ i, j s.t. Gap X(ˆoj i, vj i ) > γ] β, (1)

where ˆoj i is a true qj i -quantile with respect to the dataset Xj i . We also denote by oj i a true qk-quantile with respect to X

Figure 2: An error in computing v1 1 by at most γ = 2 points from the optimal value o1 1, might cause an error of at most γ points between the value of o2 j compared to ˆo2 j. The example demonstrates the computed quantiles q1 = 1

2 and q3 = 3

4. Note that q1 2, q2 2 = 1

0 5 10 15 20 X

v1 1 o1 1 = ˆo1 1 o2 2 o1 2

0 2 4 6 8 10

v1 1 ˆo1 2 o1 2

v1 1 ˆo2 2 o2 2

where qk is the original fraction in Q that corresponds to qj i (see Figure 2).

At the first level (i = 1) of the recursive tree we compute one quantile q1 1 on the data X = X1 1 and therefore Gap X(ˆo1 1, o1 1) = 0. At the second level (i = 2), we split the data X according to v1 1 into X1 2, X2 2, since Gap X(v1 1, o1 1) γ, the qj 2-quantile ˆoj 2 of the dataset Xj 2 satisfies that Gap X(ˆoj 2, oj 2) γ (see Figure 2).

By induction, at layer i, for every j we have that Gap X(ˆoj i, oj i) (i 1) γ. Combining this with Equation (1) results in Gap X(vj i , oj i) i γ. At the last level (i = log m+1) we have that Gap X(vj i , oj i) (log m+1) γ.

Theorem 3.3. Assume that we implement A using the exponential mechanism with privacy parameter ε log m+1, as described in Appendix A to solve the single quantile problem. Then, given a database X (a, b)n and quantiles Q = (q1, . . . , qm) the AQ algorithm is ε-DP, and with probability 1 β output V = (v1, . . . , vm) that satisfies

Err X(O, V ) O

log2 m(log ψ + log m + log 1

where ψ = b a mink [n+1]{xk xk 1}.4

Proof. By Lemma A.1, the exponential mechanism with privacy parameter ε log m+1 has an error more than 2(log m+

1) log ψ+log m log β

ε with probability at most β

m. Therefore, by Lemma 3.2, the AQ algorithm has an error no larger than γ = 2(log m+1)2 log ψ+log m log β

ε with probability 1 β. Combining this with Lemma 3.1 the theorem follows.

The uniform quantiles problem is a common instance of the quantiles problem where qi = i m+1, for i = 1, . . . , m. Lemma 3.4 shows that when the desired quantiles are uniform it is possible to guarantee that Err X(O, V ) = O(γ) with probability 1 β.

4We define x0 = a and xn+1 = b. These are not real data points.

Differentially Private Approximate Quantiles

Lemma 3.4 (Uniform Quantiles Utility). Let X (a, b)n

be a database, and let A be an algorithm that computes an approximation v for a single quantile q of X such that

Pr[Gap X(o, v) > γ] β

for some constants β, γ > 0, where o is a true q-quantile. We run AQ algorithm using A on a database X, and quantiles Q = (q1, . . . , qm) where qi = i m+1. Then, with probability 1 β, the AQ algorithm returns approximate quantiles V = (v1, . . . , vm) satisfying Err X(O, V ) 2γ.

Proof. For simplicity we assume that m = 2k 1. The proof is similar for other values of m. For this value of m we have that qℓ= ℓ/2k, ℓ= 1, . . . , 2k 1 and qj i = 1/2 for all i and j. Furthermore, we have that the number of points in X (a, oj i) is 2 j 1

2i n , j = 1, . . . , 2i 1. As in Lemma 3.2, Equation (1) holds. So we assume in the rest of the proof that Gap X(vj i , ˆoj i) γ for all i and j.

For the root (i = 1) of the recursion tree we compute one quantile q1 1 on the data X = X1 1 and therefore Gap X(ˆo1 1, o1 1) = 0. At the second level, (i = 2), we split the data X according to v1 1 into X1 2, and X2 2. Since Gap X(v1 1, o1 1) γ, we have that n/2 γ |Xj 2| n/2 + γ. Recall that |X (a, o1 2)| = n/4 and |X (a, o2 2)| = 3n/4 and therefore Gap X(ˆoj 2, oj 2) γ/2 for j = 1, 2. Now, since Gap X(vj 2, ˆoj 2) γ we get that n/4 3γ/2 |Xj 3| n/4 + 3γ/2, and therefore Gap X(ˆoj 3, oj 3) 3γ/4, for j [4]. Similarly, since Gap X(vj 3, ˆoj 3) γ, it follows that n/8 7γ/4 |Xj 4| n/8 + 7γ/4 and therefore Gap X(ˆoj 4, oj 4) 7γ/8, for j [8]. We conclude that by induction on the levels we get that,

Gap X(ˆoj i, oj i) 2i 1 1

for all i and j. Combining this upper bound with Equation (1) we get that Gap X(vi j, oi j) 2γ.

Theorem 3.5 improves the bound of Theorem 3.3 for uniform quantiles. We omit its proof which is similar to the proof of Theorem 3.3 using Lemma 3.4 instead of Lemma 3.2. Theorem 3.5. If we set A to be the exponential mechanism with privacy parameter ε log m+1, as described in Appendix A, then given data X (a, b)n and quantiles Q = (q1, . . . , qm), where qi = i m+1, the AQ algorithm is ε-DP and with probability 1 β output V = (v1, . . . , vm) that satisfies

Err X(O, V ) = O

log m(log ψ + log m + log 1

where ψ = b a mink [n+1]{xk xk 1}.

Quantiles sanitization. Given a database X = (x1, .., xn) (a, b)n, we can produce a differentially private dataset ˆX = (ˆx1, . . . , ˆxn) (a, b)n, such that for each point ˆxℓ, the number of points in ˆX that are smaller than ˆxℓ is similar to the number of points in X that are smaller than ˆxℓ. This is specified precisely in the following Corollary of Theorem 3.3. In particular this corollary implies that for every interval I (a, b), |X I| is approximately equal to | ˆX I|.

Corollary 3.6. Assume that we implement A using the exponential mechanism with privacy parameter ε log n+1, as described in Appendix A, to solve the single quantile problem. Then, given a database X (a, b)n and n quantiles Q = (q1, . . . , qn), where qi = i/n, the AQ algorithm is ε-DP, and with probability 1 β outputs ˆX = (ˆx1, . . . , ˆxn) that satisfies

Err X(X, ˆX) = O

log n(log ψ + log n + log 1

where ψ = b a mink [n+1]{xk xk 1}.

3.3. Zero-Concentrated Differential Privacy

Zero Concentrated Differential Privacy (z CDP) (Bun & Steinke, 2016) offers smoother composition properties than standard (ε, δ)-DP. The general idea is to compare the R enyi divergence of the privacy losses random variables for neighboring databases. We analyse our algorithm also under this definition of privacy. As in Section 3.1, the privacy analysis of our algorithm applies composition of the processing of different levels in the recursion tree. z CDP s composition theorem allows us to run the exponential mechanism at each level with a higher privacy parameter, which results in a tighter error bound for the exponential mechanism. For precise statements see Theorem 3.11 and Theorem 3.12 below. In Section 4.3 we measure empirically the gain from this smoother composition of z CDP.

Definition 3.7 (Zero-Concentrated Differential Privacy (z CDP) Bun & Steinke (2016)). An algorithm M : Dn R is ρ-z CDP if for all neighbouring X, X Dn, and α (1, ) RDα(M(Z), M(Z )) ρα, where RDα is the α-R enyi 5 divergence between random variables A and B. (D is the domain of the database elements, in our case it is (a, b).)

The following lemma shows that DP implies z CDP.

Lemma 3.8. (Bun & Steinke, 2016) if algorithm M satisfies ε-DP, then M satisfies ρ-z CDP with ρ = ε2

5the R enyi divergence of order α (1, ) between two probability distributions on P and Q over ω is RDα := 1 1 α log( R

ω P(x)αQ(x)1 α dx)

Differentially Private Approximate Quantiles

The above lemma is generic in that it applies to every mechanism that satisfies differential privacy. Better bounds are known for specific cases. In particular, we will use the following lemma.

Lemma 3.9. (Cesar & Rogers, 2021) The exponential mechanism with parameter ε satisfies ε2

Lemma 3.10. (Bun & Steinke, 2016) Let M : Dn Y and M : Dn Z. Suppose M satisfies ρ-z CDP and M

satisfies ρ -z CDP. Define M : Dn Z by M (X) = M (X, M(X)). Then M satisfies (ρ + ρ )-z CDP.

Theorem 3.11 (General Quantiles Utility with z CDP). Suppose we implement A using the exponential mechanism to solve the single quantile problem, with privacy parameter ε = q

8ρ log m+1. Then, given data X and quantiles Q = (q1, . . . , qm), the AQ algorithm is ρ-z CDP and with probability 1 β output V = (v1, . . . , vm) that satisfies

O log1.5 m ρ (log ψ + log m + log 1

where ψ = b a mink [n+1]{xk xk 1}.

Proof. By Lemma 3.1, the computation at each recursive level 1 i log m + 1 is q

8ρ log m+1-DP, and therefore, by Lemma 3.9, also ( ρ log m+1)-z CDP. Since the number of levels is log m + 1, by Lemma 3.10, the AQ algorithm is ρ-z CDP. The error bound follows exactly as in the proof of Theorem 3.3.

The following theorem is analogous to Theorem 3.5. Its privacy proof is as for Theorem 3.11 and the error analysis is as in the proof of Theorem 3.5.

Theorem 3.12 (Uniform Quantiles Utility with z CDP). Suppose we implement A using the exponential mechanism to solve the single quantile problem, with privacy parameter q

8ρ log m+1. Then, given data X and quantiles

Q = (q1, . . . , qm), where qi = i m+1, the AQ algorithm is ρz CDP, and with probability 1 β output V = (v1, . . . , vm) that satisfies

Err X(O, V ) = O

ρ (log ψ + log m + log 1

where ψ = b a mink [n+1]{xk xk 1}.

4. Experiments

We implemented the AQ algorithm in Python and its code is publicly available on Git Hub6. We used the exponential

6https://github.com/Shachar Schnapp/DP AQ

mechanism for the DP single quantile algorithm A, with Gap X(o, v) as the utility of a solution v, where o is a true quantile, see Appendix A. We also experimented with the AQ-z CDP algorithm, a version of our algorithm that is private with respect to the definition of zero-concentrated differential privacy (z CDP) (Section 3.3). We compared our algorithms to the three best performing algorithms from Gillenwater et al. (2021) called: (1) Joint Exp (2) Appind Exp and (3) Agg Tree. We ran the implementations provided by Gillenwater et al. (2021). We describe these baseline algorithms in Section 4.1. We tested the algorithms using two synthetic datasets and four real datasets that are described in Section 4.2. For each dataset we compared the accuracy (Section 4.3) and runtime (Section 4.4) of the competing algorithms.

4.1. Baseline Algorithms

Joint Exp Gillenwater et al. (2021) solves the DP quantiles problem by an efficient implementation of the exponential mechanism (Definition 2.4) on m-tuples V = (v1, . . . , vm) (a, b)m, where the utility of V is defined as follows:

u(X, V ) = X

j [m+1] |Gap X(oj, oj 1) Gap X(vj, vj 1)}|,

where we define v0 = o0 = a and vm+1 = om+1 = b. The naive implementation of the exponential mechanism with this utility function is computationally difficult: The number of m tuples V is infinite, and there may even be exponentially many (in m) equivalence classes of such m-tuples. Gillenwater et al. (2021) give an O(mn log n + m2n) time algorithm to sample from the distribution defined by the exponential mechanism. The experiments of Gillenwater et al. (2021) show that when the number of quantiles, m, is small the Joint Exp algorithm performs best.

Appind Exp solves the DP quantiles problem by applying the exponential mechanism as described in Appendix A to find every quantile qi separately. Since Appind Exp applies the exponential mechanism m times, if we use ε/m as the privacy parameter for each application of the exponential mechanism, then by composition we get that Appind Exp is ε-DP. The advanced composition theorem Dwork et al. (2010b) shows that if we use ε ε/ p

m log(1/δ) for each application of the exponential mechanism then the overall algorithm would be (ε, δ)-DP. The implementation of Gillenwater et al. (2021) uses a tighter advanced composition theorem specific for nonadaptive applications of the exponential mechanism (Dong et al., 2020), to determine an ε for each quatile computation such that the overall composition of the m applications is (ε, δ)-DP. We use n = 1000 data points in our experiments, so we chose

Differentially Private Approximate Quantiles

δ = 10 6 in accordance with the common practice that δ 1

Agg Tree Dwork et al. (2010a) and Chan et al. (2011) implement an ε-DP tree-based counting algorithm for CDF estimation. Given a domain (a, b) the algorithm builds a balanced tree T with branching factor b and height h, so T has bh leaves. The j th leaf of the tree is associated with sub-domain [cj 1, cj] where cj := a + j(b a)/bh. Given a dataset D (a, b)n, the algorithm starts by counting the number of elements from the dataset that fall into each leaf (i.e. are contained in its sub-domain). Each internal node of T is associated with the sum of the counters of its children (which equals to the number of elements in the leaves of its subtree). In particular, the count associated with the root is n. Since each element in the data contributes to at most h nodes (the path from the leaf containing it to the root), it suffice to add Lap(h/ε) noise to the value of each node to make the counts of T ϵ-DP. We can approximate any quantile q using this data structure as follows. We find the leftmost leaf z such that the sum of the noisy counts of all leaves to the left of z (including z) is at least q n. In particular, if the counts were not noisy that z would contain a qth quantile. Let c(z) be the noisy count of z and let c (z) be the sum of the noisy counts of the leaves to the left of z. Let p = (qn c (z))/c(z). Without noise p would have been the approximate relative quantile of the qth quantile among the elements in z. Let [ck 1, ck] be the range associated with z. We approxmate the qth quantile using linear interpolation inside [ck 1, ck]. That is we return (1 p)ck 1 + pck. We utilize the Agg Tree implementation provided by Gillenwater et al. (2021), we used b = 12 and h = 4 those values came from the exhaustive hyperparameters search of Gillenwater et al. (2021) (see their Appendix E) and the results are given in Section 4.3.

4.2. Datasets

We tested our four algorithms on six datasets. Two data sets are synthetic. One contains independent samples from the uniform distribution U( 5, 5), and the other contains independent samples from the Gaussian N(0, 5). Two real continuous datasets from Goodreads (2019), one contains book ratings and the other contains books page counts. Last we have two categorial datasets from the adults census data (Dua & Graff, 2019). One contains working hours per week and the other ages of different persons. Table 1 shows the properties of each dataset, and Figure 3 shows the histograms of 100 equal-width bins for each dataset.

4.3. Empirical Error Analysis

We compare the error of Approximate Quantiles and the baseline algorithms. Our error metric is the average gap of

Figure 3: Histograms of 100 equal-width bins.

Table 1: Data set properties.

Data set Size Data Characteristics

Uniform (synthetic) 10000 Continuous

Gaussian (synthetic) 10000 Continuous

Goodreads rating 11123 Continuous

Goodreads pages 11123 Continuous

Adult hours 48842 96 Categories

Adult age 48842 74 Categories

the approximate quantiles V = (v1, . . . , vm) and and the

true ones O = (o1, . . . , om):

j [m] Gap X(oj,vj)

DP Error Analysis: We randomly chose 1000 samples from each dataset and checked the error of each algorithm with m = 1 to m = 120 uniform quantiles and D = [ 100, 100] as a (loose) user-provided domain. We used the privacy parameter ε = 1. This process was repeated 100 times. Figure 4 shows the average of the error across the 100 iterations. Figure 5 zooms in on the error for m = 1, . . . , 35 quantiles. Approximate Quantiles performs better than the baselines almost in all experiments, except for a few small values of m where the performance of Joint Exp was slightly better. As the number of quantiles increases Approximate Quantiles wins by a larger margin.

z CDP Error Analysis: As in previous experiments we randomly chose 1000 samples from each dataset and checked the error of each algorithm for m = 1, . . . , 120 uniform quantiles and D = [ 100, 100] as a (loose) userprovided domain. All algorithms were ρ-z CDP for ρ = 1

8. For this we used ε = 1/ m in each application of the exponential mechanism by Appind Exp, and a Laplace noise of magnitude ε =

h in each node of the tree computed by Agg Tree. In each application of the exponential mechanism by Approximate Quantiles we used ε = p

1/(log m + 1) as described in Theorem 3.12. The algorithm Joint Exp

Differentially Private Approximate Quantiles

Figure 4: The x axis shows the number of quantiles and the y axis shows the gap per quantile averaged over 100 trials with ε = 1. Note that the graphs are in log scale.

Figure 5: The x axis shows the number of quantiles and the y axis shows the gap per quantile averaged over 100 trials with ε = 1. Note that the graphs are in log scale.

with ε = 1 is 1

8-z CDP by Lemma 3.10, so its error is the same as in the previous experiment. Figure 6 shows the average of the error for z-CDP across the 100 iterations. Figure 7 zooms in on the error for m = 1, . . . , 35. Approximate Quantiles performs much better than the baselines even for small number of quantiles. When the number of quantiles is large, the error of AQ algorithm reaches up to 7.14 time lower than the best existing baseline.

4.4. Time Complexity Experiments

Given a sorted dataset, it takes O(n) time to find all quantiles at a single level of the recursion tree of Approximate Quantiles. Therefore the overall time complexity (i.e., without the sort) of the AQ algorithm is O(n log m), where m is the number of quantiles. In comparison, the baseline algorithms are computationally more expensive: Joint Exp algorithm runs in O(mn log n+m2n) time, Appind Exp algorithm runs in O(mn) time and Agg Tree algorithm runs in O(n log n) time. We empirically compared the running time of Approximate Quantiles to the running times of the baseline algorithms. For each dataset we measured the time required to find m [120] quantiles

Figure 6: z CDP: The x axis shows the number of quantiles and the y axis shows the gap per quantile averaged over 100 trials with ρ = 1/8. Note that the graphs are in log scale.

Figure 7: z CDP: The x axis shows the number of quantiles and the y axis shows the gap per quantile averaged over 100 trials with ρ = 1/8. Note that the graphs are in log scale.

in a sub-sample of size 1000 of each dataset, averaged over 100 trials per dataset. Figure 8 shows the average running time across all datasets (Section 4.2), each experiment used one core of an Intel i9-9900K processor. We see that the running time of Approximate Quantiles is about ten times smaller than of Agg Tree and at least a 100 smaller than of Joint Exp and Appind Exp .

Figure 8: x axis number of quantiles, y axis is the run time averaged across all datasets, each dataset averaged across 100 trials. Note that the graphs are in log scale.

Differentially Private Approximate Quantiles

Acknowledgements

Haim Kaplan and Uri Stemmer are partially supported by the Israel Science Foundation (grants 1595/19 and 1871/19) and by Len Blavatnik and the Blavatnik Family foundation. Shachar Schnapp is partially supported by the Cyber Security Research Center at Ben-Gurion University of the Negev.

Beimel, A., Nissim, K., and Stemmer, U. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1 61, 2016.

Bun, M. and Steinke, T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635 658. Springer, 2016.

Bun, M., Nissim, K., Stemmer, U., and Vadhan, S. Differentially private release and learning of threshold functions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 634 649. IEEE, 2015.

Cesar, M. and Rogers, R. Bounding, concentrating, and truncating: Unifying privacy loss composition for data analytics. In ALT, volume 132 of Proceedings of Machine Learning Research, pp. 421 457. PMLR, 2021.

Chan, T.-H. H., Shi, E., and Song, D. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1 24, 2011.

Dong, J., Durfee, D., and Rogers, R. Optimal differential privacy composition for exponential mechanisms. In International Conference on Machine Learning, pp. 2597 2606. PMLR, 2020.

Dua, D. and Graff, C. UCI machine learning repository, 2019. URL http://archive.ics.uci.edu/ml. Date accessed: 2021-07-01.

Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

Dwork, C., Kenthapadi, K., Mc Sherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, volume 4004 of Lecture Notes in Computer Science, pp. 486 503. Springer, 2006a.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265 284. Springer, 2006b.

Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715 724, 2010a.

Dwork, C., Rothblum, G. N., and Vadhan, S. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51 60, 2010b.

Gillenwater, J., Joseph, M., and Kulesza, A. Differentially private quantiles. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 3713 3722. PMLR, 2021.

Goodreads. Goodreads-books dataset, 2019. URL https://www.kaggle.com/jealousleopard/. Date accessed: 2021-07-01.

GRE. Ets. gre guide to the use of scores., 2021. URL https: //www.ets.org/s/gre/pdf/gre_guide.pdf. Date accessed: 2021-07-19.

Kaplan, H., Ligett, K., Mansour, Y., Naor, M., and Stemmer, U. Privately learning thresholds: Closing the exponential gap. In Conference on Learning Theory, pp. 2263 2285. PMLR, 2020.

Mc Sherry, F. and Talwar, K. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pp. 94 103. IEEE, 2007.

Smith, A. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the fortythird annual ACM symposium on Theory of computing, pp. 813 822, 2011.

Differentially Private Approximate Quantiles

A. DP Single quantile

In the DP single quantile problem, the input is a single quantile q and a database X (a, b)n. The output is a quantile estimate v (a, b) such that Prx UX[x v] q (in a sense that Lemma A.1 would make precise). We solve this problem using the exponential mechanism of (Mc Sherry & Talwar, 2007) on (a, b) with the utility function:

u(X, w) = ||{x X|x < w}| q n | = Gap X(o, w),

where o is a q-quantile and w (a, b). This mechanism samples each point w (a, b) to be the output with density proportional to exp(εu(X, w)/2). Note that the sensitivity of u is 1, that is max{|u(X, ω) u(X , ω)|} 1, where the maximum is over neighboring datasets X and X and points ω (a, b). The largest utility is of a q-quantile and equals to 0. We can sample from this distribution using the technique given by (Smith, 2011) (see their Algorithm 2). The idea is to split the sampling process into two steps:

1. Let Ik = [xk 1, xk], k = 1, . . . , n + 1 where x0 = a, xn+1 = b, be the set of n + 1 intervals between data points. We sample an interval from this set of intervals, where the probability of sampling Ik is proportional to

Pr[A(X) = Ik] exp εu(X, Ik)

Note that all points in Ik have the same utility which we denote by u(X, Ik).

2. Return a uniform random point from the sampled interval.

Lemma A.1. Given dataset X (a, b)n and quantile q [0, 1], the exponential mechanism is ε-DP, and with probability 1 β outputs v that satisfies

Gap X(o, v) 2 log ψ log β

where o is a true q-quantile and ψ = b a mink [n+1]{xk xk 1}.

Proof. ε-DP follows in a straightforward way by bounding the ratio of the densities of a point w in the destribution defined by X and in the distribution defined by X ; See also (Dwork & Roth, 2014).

Let It be an interval such that u(X, It) γ. It follows that the probability of sampling a point from It is at most

Pr[A(X) = It] exp εγ

2 (xt xt 1) P k [n+1] exp εu(X,Ik)

2 (xk xk 1) .

Using the union bound we get that:

Pr[A(X) γ] exp εγ

k [n+1] exp εu(X,Ik)

2 (xk xk 1)

exp εu(X,Io)

2 (xo xo 1)

b a mink [n+1](xk xk 1) exp εγ

where Io is the interval containing the q-quantiles so u(X, I0) = 0. It follows that with probability less than β, we sample an interval whose utility is at most γ for

γ = 2 u log ψ log β

Differentially Private Approximate Quantiles

Since in the second step we sample the q-quantile v uniformly from the interval selected in the first step, we get that with probability 1 β the output v satisfies:

Gap X(o, v) 2 log ψ log β