# differentially_private_boxplots__7a09255a.pdf Differentially Private Boxplots Kelly Ramsay * 1 Jairo Diaz-Rodriguez * 1 Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently, which is not always the case for a boxplot naively constructed from a single existing differentially private quantile algorithm. As a byproduct of this exposition, we introduce several new results concerning private quantile estimation. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms the naive boxplot. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization. 1. Introduction It is now well established that differential privacy (Dwork et al., 2006) is a powerful framework for protecting the privacy of individuals data. As a result, a plethora of differentially private data analysis tools have been developed over the last several decades (Dwork, 2008; Ji et al., 2014; Liu et al., 2023). However, one area that has been considerably underdeveloped is that of differentially private visualization. This is despite the fact that data visualization is a key tool in exploratory analysis, which is an essential component of data analysis. In fact, in a recent study of data analysts and *Equal contribution 1Department of Mathematics and Statistics, York University, Toronto, Canada. Correspondence to: Kelly Ramsay . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). practitioners, Garrido et al. (2023) found that most analysts employed aggregations and visualizations to fulfill their use case, rather than machine learning models. Differentially private versions of several popular visualizations have been considered in the literature. Nanayakkara et al. (2022) developed a plot to visualize privacy utility trade-offs. Zhou et al. (2022) developed DPVis Creator, a visualization system, which relies on publishing a synthetic dataset. Visualizations in other privacy models were considered by Dasgupta & Kosara (2011); Dasgupta et al. (2013; 2019), and recently reviewed by Bhattacharjee et al. (2020). There has been substantial study of the differentially private histogram (Hay et al., 2010; Li et al., 2010; Acs et al., 2012; Kellaris & Papadopoulos, 2013; Xu et al., 2013; Zhang et al., 2014; Budiu et al., 2022). Visualizations based on clustering (Hongde et al., 2014), for mobility data (He et al., 2016), heatmaps (Zhang et al., 2016) and scatterplots (Panavas et al., 2024) have also been considered. Surprisingly, despite its widespread use, a differentially private boxplot has not been directly studied. Boxplots, invented by Tukey et al. (1977), are used for visualizing the characteristics of a univariate sample, as well as for visualizing the relationship between a continuous variable and a categorical variable. Despite its simplicity, many key distributional characteristics can be evaluated from the boxplot: namely, location, scale, skewness, and tails. This has made it popular among practitioners. In addition, its simplicity is beneficial from a privacy standpoint. The boxplot only requires estimating a few summary statistics in order to convey the distributional characteristics of a univariate sample, making the boxplot efficient in its use of privacy budget. For instance, by contrast, a histogram can also be used to convey the same information. However, a histogram is generally a consistent estimate of the population density, which intuitively, should then require more noise to ensure privacy. Motivated by these observations, we take the first steps in developing and studying differentially private boxplots. First, one can observe that a boxplot is made up of sample quantiles. It is then natural to first consider how well a boxplot constructed from naive applications of existing differentially private quantile algorithms. Succinctly: Differentially Private Boxplots How well does the naive private boxplot perform? On this front, with a combination of new theoretical results and simulations, we demonstrate that this method may fail on data coming from common distribution families. We focus on the state-of-the-art algorithms Joint Exp (Gillenwater et al., 2021), Private Quantile (Smith, 2011), Approx Quantile (Kaplan et al., 2022), and unbounded (Durfee, 2023). Due to limited space, our rationale for this choice, and a full literature review on private quantiles, including how our results build on this literature, is relegated to Appendix D. See also the related problem of differentially private range estimation (Kaplan et al., 2020). While range queries can be used to compute quantiles (Bun et al., 2015; Kulkarni, 2019; Kaplan et al., 2020), our focus is on more practical methods that estimate differentially private quantiles. In summary: We prove that Joint Exp, Private Quantile and Approx Quantile are (almost always) inconsistent for extreme quantiles (Lemma 4.1), which makes them a poor choice for generating the whiskers in the boxplot. (To our knowledge, no results concerning the consistency of extreme private quantiles previously existed. Lalanne et al. (2023b) proved consistency for inner quantiles, under the assumption of a lower bound on the density, which would not apply to extreme quantiles.) We prove that unbounded is consistent for extreme quantiles (Lemma B.3) . This result also confirms an empirical observation of Durfee (2023), which says that the unbounded quantile algorithm performs better than existing algorithms for estimating extreme quantiles. (To our knowledge, no statistical results concerning unbounded previously existed.) Our simulations reveal that the unbounded algorithm does not perform as well as Joint Exp, Approx Quantile or Private Quantile for estimating the inner quantiles of Gaussian data. This makes it a poor choice for generating the box in the plot. We support this finding by providing matching (up to logarithmic factors) upper and lower bounds for the sample complexity of Joint Exp, Approx Quantile and Private Quantile for estimating inner quantiles, under general distributional assumptions (Theorem 4.2 and Theorem 4.4). In particular, we only assume that the density is positive in a neighborhood of the theoretical quantile and bounded everywhere (Condition 1). Previous concentration results for these algorithms assume the population density has bounded support (Lalanne et al., 2023a). The lower bound is new, and does not follow from the lower bound on private medians (Tzamos & Vlatakis Gkaragkounis, 2020). Given that the naive boxplot is not satisfactory, we then sought to answer: Can we develop a novel private boxplot, which performs well at generating both the box and whiskers? To this end, building on the aforementioned results concerning private quantiles, we present a novel differentially private boxplot, which we call DPBoxplot. We then evaluate its ability to represent key features location, scale, skewness, and tails both theoretically and empirically. Figure 1 provides an informal, graphical illustration of our approach. Specifically, we use Joint Exp for estimating the quartiles and the median to construct the box, and unbounded for determining extreme values necessary for the whiskers and the number of outliers. Instead of revealing the precise locations of outliers, we privately disclose their count using the Laplace mechanism. In summary, our main contributions concerning the new differentially private boxplot are: We carefully combine private quantile estimators, ensuring that those used for the location and scale of the boxplot are estimated optimally (Theorem 4.2), while the extreme quantiles required to depict skewness and tails remain consistent (Theorem 4.5). Extensive simulations demonstrate that DPBoxplot is often more accurate than those constructed naively using existing differentially private quantile methods, see Section 5. We conduct a differentially private exploratory data analysis, showcasing the practical utility and shortcomings of DPBoxplot, see Section 6. 2. Preliminaries First, we briefly review the boxplot. It is also helpful to introduce some notation used throughout the paper. We define a dataset of size n N to be a set of n real numbers. Let Dn be the set of datasets of size n and let Xn = {X1, . . . , Xn} be a dataset size n. In this work, we assume the analyst has access to a dataset, denoted by Xn. We will also assume that Xn consists of n independent draws of from a common population distribution or measure, denoted ν. We consider the traditional boxplot, which consists of a box with a line drawn through it, with two whiskers emanating from the lower and upper bounds of the box, see Figure 1. The box is made up of the median and the quartiles, while the whiskers, represent the tails and skewness of the data. Points beyond the whiskers are explicitly plotted. Together, this 5 number summary describes the empirical distribution of the data, Differentially Private Boxplots Laplace mechanism Laplace mechanism Figure 1. Graphical illustration of algorithms for the traditional non-private boxplot (left) and our proposed differentially private boxplot DPBoxplot (right). including its location, scale, skewness and tails. A more detailed review of the boxplot can be seen in Appendix A. 2.1. Differential privacy Next, we introduce differential privacy. First, we say that Yn Dn, is adjacent to Xn if Xn and Yn differ in exactly one point. Let An be the set of pairs of adjacent datasets of size n. Next, denote by PXn a probability measure which depends on Xn. For a space S, let B(S) denote the Borel sets of S and let M1(S) be the set of Borel probability measures over S. Formally, PXn is a map from Rn to M1(Rd), for some d N. We can now define differential privacy. Definition 2.1. Given ϵ > 0, the quantity θ PXn is ϵ-differentially private if sup (Xn,Yn) An sup B B(Rd) PXn(B) PYn(B) eϵ. (1) Here, θ is a d-dimensional differentially private quantity and ϵ is the privacy budget, where values of ϵ which are closer to zero enforce stricter privacy constraints. The proposed differentially private boxplot makes use of existing differentially private algorithms. Specifically, the Laplace mechanism, and algorithms for computing differentially private quantiles. The Laplace mechanism (Dwork et al., 2006) is a fundamental mechanism for constructing differentially private statistics. Let Z be a standard Laplace random variable. Specifically, for a statistic T : Dn R, define its global sensitivity to be sup(Xn,Yn) An |T(Xn) T(Yn)|. If T has global sensitivity bounded by 1, then T = T(Xn) + Z/ϵ is a differentially private quantity. This is known as the Laplace mechanism. 2.2. Private quantiles Of course, a private boxplot relies on differentially private quantiles. The proposed private boxplot relies on the unbounded algorithm of Durfee (2023) and the Joint Exp algorithm of Gillenwater et al. (2021). These algorithms are the main focus of our theoretical exposition. However, the Private Quantile is a special case of Joint Exp and so our results also apply to Private Quantile. In addition, combining our results with those of Kaplan et al. (2022) allows our results to also hold for Approx Quantile. We do, however, consider the performance of the naive boxplot constructed from all three of these algorithms in our simulation study, see Section 5. We now give a brief overview of Joint Exp and unbounded. Joint Exp works as follows: Suppose we wish to estimate m N quantiles of levels 0 < q1 < . . . < qm 1. For an integer k N, let [k] denote the set {1, . . . , k}. Define the jth quantile generated by Joint Exp for a given j [m] to be ξqj, and let ξq = ( ξq1, . . . , ξqm). For a given µ M1(R), let Fµ be the cumulative distribution function (CDF) of µ and, if it exists, let fµ be its probability density function. For the dataset Xn, let ˆν be the corresponding empirical measure, so that Fˆν is then the associated empirical CDF. For x Rm with x1 < . . . < xm, define the following utility function: ϕ(x) = Pm+1 j=1 |Fˆν(xj) Fˆν(xj 1) (qj qj 1)|, where x0 = , xm+1 = , q0 = 0 and qm+1 = 1. Then ξq is a random vector drawn from the density f QXn(x) exp( nϕ(x)/2ϵ)1 {x [a, b]m} , where QXn is used to denote the distribution of ξq, given Xn.1 It was shown by Gillenwater et al. (2021) that ξq is ϵ-differentially private. Furthermore, Lalanne et al. (2023b) showed that this algorithm is consistent if ν is continuous. If one suspects that ν contains atoms, then we recommend that one uses the jittering modification proposed by Lalanne et al. (2023b). We show in Section 4 that, for a large class of distributions, this algorithm has optimal sample complexity for estimating a single inner quantile, and therefore, for the scale and location of the boxplot. (Here we are only interested in estimating three quantiles, and so, how sample complexity scales in m in not of particular concern.) The original unbounded algorithm produces a private quantile given a lower bound on the data. We present a slightly modified version, which performs better for extreme quantiles. For a given quantile q [1/2, 1] and a > 0, let V0, V1, . . . be independent, standard exponential random variables and let βn > 1. Define i = inf{i N: Fˆν(βi n + a 1) + 2 nϵVi q + V0 2 nϵ}. Here i is the smallest integer i such that a noisy empirical CDF computed at βi n + a 1 is greater than a noisy version of q. The quantile estimate generated by the unbounded algorithm is given by: ψq = βi n +a 1. The intuition is that if a noisy version of Fˆν(βi n + a 1) is close to a noisy version of q, 1Note f QXn is only defined for x Rm with x1 < . . . < xm. Differentially Private Boxplots then βi n + a 1 should be close to the nonprivate quantile. If q (0, 1/2), then, we apply the above procedure to the dataset Xn = { X1, . . . , Xn}, with input parameters 1 q+1/n and a = b. It follows from (Durfee, 2023) that this algorithm is ϵ-differentially private. In the proposed differentially private boxplot algorithm, we use unbounded to estimate the minimum and maximum of the dataset. For a measure µ M1(R), let ξp,µ = inf{x R: p Fµ(x)} be the associated pth quantile, min(µ) = inf{x: Fµ(x) > 0} and max(µ) = inf{x: Fν(x) = 1}. We show that if Xn is an independent sample from ν, then ψ1 is weakly consistent for max(ν), and ψ1/n is weakly consistent for min(ν), see Lemma B.3. By contrast, we show that when the support of fν is bounded below, Joint Exp is inconsistent for min(ν) when q = 1/n, unless a = min(ν), see Lemma 4.1. Lastly, we note that instead of exponential noise, one could also use Laplace or Gumbel noise (Durfee, 2023). We found this to make little difference in simulation, and so we only present the version which incorporates exponential noise. 3. A differentially private boxplot We can now introduce the differentially private boxplot. Given lower and upper bounds on the data a and b, we first generate the minimum and maximum estimates using the unbounded algorithm, ψ1/n and ψ1 with a privacy budget of 3ϵ/16 each. If one does not wish to supply input bounds for the data, one can use the fully unbounded version of unbounded (Durfee, 2023). However, in simulation, (see Section 5), we have observed that the procedure is still accurate, even when the input bounds are very loose. Critically, we use the unbounded algorithm because Joint Exp (and, by consequence, Private Quantile and Approx Quantile) is often inconsistent for estimating the aforementioned extreme quantiles, unless the input bounds a and b match min(ν) and max(ν), respectively, see Lemma 4.1. Next, we run one instance of Joint Exp with a total privacy budget of ϵ/2 to generate ξ1/4, ξ1/2, ξ3/4 simultaneously to be the lower box bound, center line and upper box bound, respectively. We then calculate the private interquartile range IQR = ξ3/4 ξ1/4. Let ℓ1 = ξ1/4 1.5 IQR, u1 = ξ3/4 +1.5 IQR and λn > 0 be the buffer parameter. The lower and upper whisker are defined as ( ℓ1 ξ1/n λn| ℓ1| + ℓ1 ψ1/n ξ1/n > λn| ℓ1| + ℓ1, ( u1 ξ1 u1 λn| u1| ψ1 ξ1 < u1 λn| u1| , respectively. To elaborate, the role of the minimum and the maximum of the dataset in a traditional boxplot is played by the estimated extreme quantiles ψ1/n and ψ1. Given that estimated extreme quantiles are more variable than the estimated inner quantiles, we add a buffer λn to account for this. Specifically, we take the upper whisker equal to the extreme quantile ψ1 when ψ1 is (λn 100)% smaller than 1.5 times the IQR. An analogous procedure is done for the lower whisker. The role of λn is to account for the fact that the algorithm is more prone to erroneously replacing the IQRbased whisker with an extreme quantile. This is because the extreme quantiles are more variable than the inner quantiles. To mitigate this, we introduce λn as a trust parameter that favors the IQR whisker allowing it to be replaced only when the extreme quantile is smaller in magnitude by at least (λn 100)%. We take λn = n 1/4. Simulation results (Appendix C.2) indicate that the performance of the method is not overly sensitive to the choice of this parameter. The boxplot includes the observations that lie above and below the upper and lower whiskers, respectively. We cannot release such data points under the constraints of differential privacy. Instead, we plot a noisy version of the number of points above u, denoted ou and below ℓ, denoted oℓ. These are generated via the Laplace mechanism, where it is easy to see that the count of observations above or below a threshold has global sensitivity 1. Lastly, we attribute less privacy budget for computing the outliers, as we deem these values to be of less interest than the box itself. We assign each noisy outlyingness number a privacy budget of ϵ/16. Together, these seven values make up the differentially private boxplot: B(Xn, ϵ) = B(ˆν, ϵ) = ( oℓ, ℓ, ξ1/4, ξ1/2, ξ3/4, u, ou). The algorithm, which we call DPBoxplot, is summarized in Algorithm 3, see also, Figure 1. It follows from sequential composition that DPBoxplot is ϵ-differentially private. Note that the time and space complexities of DPBoxplot are given by the maximum of those of the differentially private quantile algorithms unbounded and Joint Exp. Since the unbounded quantile can be computed in linear time (Durfee, 2023), the overall complexity of DPBoxplot is determined by that of Joint Exp, which is O(n log(n)) (Gillenwater et al., 2021). 4. Theoretical results In this section, we derive several results concerning private quantiles and the different elements of the private boxplot. Recall that we assume that the dataset Xn consists of n independent, identically distributed random variables drawn from a population measure ν M1(R). We first present a lemma which says that Joint Exp is inconsistent for extreme quantiles. We then present an upper bound on the sample complexity of the inner quantiles generated from Joint Exp. After which, we present a minimax lower bound for privately estimating a quantile from a population measure lying in a general set, which matches the upper Differentially Private Boxplots Algorithm 1 DPBoxplot Input: data Xn, ϵ, a, b, λn {Construct the private quantiles2} ψ1 unbounded(1, a, b, 3ϵ/16) ψ1/n unbounded(1/n, a, b, 3ϵ/16) ξq Joint Exp(q, a, b, ϵ/2) {Ensure the box contains the median} ξ1/4 min( ξ1/4, ξ1/2) ξ3/4 max( ξ3/4, ξ1/2) {Construct the whiskers and outlyingness counts} ℓ ξ1/4 1.5( ξ3/4 ξ1/4) u ξ3/4 + 1.5( ξ3/4 ξ1/4) if ψ1/n > λn| ℓ| + ℓthen ℓ ψ1/n oℓ 0 else oℓ Pn i=1 1 n xi < ℓ o + Laplace(0, 1 ϵ/16) end if if ψ1 < u λn| u| then u ψ1 ou 0 else ou Pn i=1 1 {xi > u} + Laplace(0, 1 ϵ/16) end if Output: B(Xn, ϵ) = ( oℓ, ℓ, ξ1/4, ξ1/2, ξ3/4, u, ou) bound up to logarithmic terms. Lastly, we present a result that says the whiskers and outlyingness numbers are consistent for their population counterparts. We now define a boxplot rigorously, as a function of a measure on the set of real numbers, which is convenient mathematically. The non-private boxplot constructed from the data is taken to be the boxplot computed on the empirical measure of the dataset Xn, denoted by ˆν. For ν M1(R), let IQR(ν) = ξ3/4,ν ξ1/4,ν. Now, letting ℓ1,ν = ξ1/4,ν 1.5 IQR(ν) and u1,ν = ξ3/4,ν + 1.5 IQR(ν), the whiskers are defined as ℓν = max(ℓ1,ν, min(ν)) and uν = min(u1,ν, max(ν)). Lastly, we can define oℓ,ν = Fν(ℓν) ν({X = ℓν}), ou,ν = 1 Fν(uν). The population boxplot is the following seven number summary B(ν) = (oℓ,ν, ℓν, ξ1/4,ν, ξ1/2,ν, ξ3/4,ν, uν, ou,ν). The next lemma says that Joint Exp (and, by consequence, Private Quantile and Approx Quantile since these algorithms are equivalent for m = 1 quantile) is inconsistent for the minimum of ν when we set q = O(1/n) if the input bounds do not exactly match those of the support of the population distribution, and the support of the population distribution is bounded. Lemma 4.1. For < a < b < , 0 < q 1, if there exists M > a such that ν({X M}) = 0, then for any 0 < t < M a, we have that Pr | ξq ξq,ν| t e ϵ nq In this case, if q = n 1, then the sample complexity is bounded below by infinity, which implies inconsistency. The proof of Lemma 4.1 can be seen in Appendix B. Before getting to our next results, we introduce a condition on the population distribution. For L, r > 0 and q (0, 1], let GL,r,q be the set of absolutely continuous measures µ M1(R) such that infξq r x ξq+r fµ(x) L > 0. Therefore, if the population measure ν GL,r,q, then it has a density which is bounded below by L in a neighborhood of size r around its qth quantile. Condition 1. Given p = (p1, . . . , pm) with 0 < p1 < . . . < pm 1 then a ξp1 < ξpm b and there exists K, r, L > 0 such that ν m i=1GL,r,pi and supx [a,b] fν(x) K. This condition has three requirements. First, we require that the interval [a, b] contains the population quantiles we wish to estimate. Our theoretical and simulation results show that this interval can be chosen loosely, as the error in our estimates does not depend strongly on the size of this interval. Thus, this is not a strict requirement. Next, we require that for each pi, i [m], in a neighborhood of size r of the pith quantile of ν, the density is bounded away from 0. This requirement is standard in quantile estimation, e.g, (Tzamos & Vlatakis-Gkaragkounis, 2020; Lalanne et al., 2023a). Lastly, we require that the population density is bounded above everywhere. Requiring the density being bounded above is a weak restriction, which is satisfied by many distribution families, such as the Gaussian, Beta and Gamma families. We can now state an upper bound on the sample complexity of the private quantiles generated via Joint Exp, and by consequence, Private Quantile and Approx Quantile, see Remark 4.3. For q = (q1, . . . , qm) such that 0 < q1 < . . . < qm 1, define ξq,ν = (ξq1,ν, . . . , ξqm,ν). Next, for x, y R, we write x y (x y) for the maximum (minimum) of x and y. Theorem 4.2. Given < a < b < and q = (q1, . . . , qm) such that 0 < q1 < . . . < qm 1, if Condition 1 holds with p = q, then there exists a universal constant C > 0 such that for all 0 < γ < 1 and 0 < t r, it holds that ξq, ν ξq,ν t with probability 1 γ, provided that m5/2 log(1/γ) m log(m/ct L) _ m2 log (1/γ) log K(b a) q1 (1 qm) t L Differentially Private Boxplots The proof of Theorem 4.2 relies on a general bound for the exponential mechanism detailed by Ramsay et al. (2024), it can be seen in Appendix B. Theorem 4.2 gives an upper bound on the sample complexity of the quantiles generated by Joint Exp for distributions who satisfy Condition 1 with p being the vector of quantile levels to be estimated. Comparing to existing results, Lalanne et al. (2023a) give a concentration result for Joint Exp which yields the same upper bound given by Theorem 4.2, when the support of fν is bounded. Therefore, our bound is essentially a generalization of theirs, though we assume that fν is bounded above. The sample complexity of the median of Tzamos & Vlatakis-Gkaragkounis (2020), which is a different estimator, also matches the upper bound given in Theorem 4.2, under similar assumptions. Lastly, we note that by combining our techniques used to prove the upper bound given in Theorem 4.2 with the Approx Quantile algorithm of Kaplan et al. (2022), one can obtain logarithmic scaling in m. However, in this context, m = 3, and this is not particularly relevant, thought may be interesting for other applications. In simulation, we observe almost identical performance between Approx Quantile and Joint Exp for the purposes of DPBoxplot. Remark 4.3. We now expand on how Theorem 4.2 applies to Private Quantile and Approx Quantile. The Private Quantile algorithm only generates one quantile. So, for Private Quantile, we would be applying it m times to estimate m quantiles. Given that all three algorithms are the same with m = 1, Theorem 4.2 applies to Private Quantile with m = 1. Applying Theorem 4.2 m times with t = t/m1/2 yields the same bound that is given in Theorem 4.2. For Approx Quantile, if you inspect the Approx Quantile algorithm, it is made up of successive applications of Private Quantile at each level of a tree, with the input bounds depending on the previous levels of the tree. In that case, you can combine our bound for Private Quantile with m = 1, t = t/(log(m) + 1) and γ = γ/m. Then you can apply Lemma 3.2 of Kaplan et al. (2022) (Note their notation uses β for γ and gamma for t) and the result follows immediately. Next, we present a minimax lower bound for estimating a single quantile subject to differential privacy. Next, for a set S, let M1,ϵ,n(S) be the set of maps from Dn to M1(S) that satisfy (1). Note that M1,ϵ,n(S) is just a formal way of writing the set of all differentially private mechanisms whose output lies in S. Next, we write Tϵ M1,ϵ,n(R) to denote the set of all univariate differentially private estimators.3 For 0 < q 1, let hq(x) = x p ( log x [Φ 1 (q)]2/2)/π and Cq = argmaxx>0 hq(x). Denote the minimax risk for 3To be clear, given an element of M1,ϵ,n(R), say P., Tϵ could be the associated estimator. Given a dataset Xn with empirical measure ˆν, Tϵ(ˆν) would be a draw from PXn, or Tϵ(ˆν) PXn. differentially private quantile estimation by R(ϵ, L, r, q) = inf Tϵ M1,ϵ,n(R) supν GL,r,q E|Tϵ(ˆν) ξq,ν|. Theorem 4.4. For all n 1, q (0, 1), L, r > 0 such that r L supx>0 hq(x), it holds that C2 q L2t2 Cq samples are required for R(ϵ, L, r, q) t to hold. Theorem 4.4 gives a lower bound on the sample complexity for estimating the qth quantile from a distribution whose density is bounded below by L in a neighborhood of size r around the qth population quantile. That is, every differentially privacy quantile estimator requires at least Ω C2 q L2t2 Cq Ltϵ samples in order to have a minimax risk bounded above by t. What is important for this context, is that Theorem 4.4 implies a lower bound on the sample complexity for estimating m quantiles from a distribution whose density is bounded below by L in neighborhoods of size r around each of the m population quantiles. That is, for all q = (q1, . . . , qm), n 1, L, r > 0 with r L infj [m] Cqj, it holds that (2) samples are also required for inf Tϵ M1,ϵ,n(Rm) supν GL,r,q E Tϵ(ˆν) ξq,ν t. Applying this lower bound in conjunction with Theorem 4.2 yields that the scale and location of the proposed private boxplot are estimated optimally, up to logarithmic factors. Note that Tzamos & Vlatakis-Gkaragkounis (2020) give a similar minimax lower bound on differentially private median estimation. However, their proof does not directly extend to estimating an arbitrary quantile, since it relies on a set of uniform distributions that have the property that the median coincides with the mean. The proof of Theorem 4.4 makes use of the differentially private version of Fano s inequality (Acharya et al., 2021), it is deferred to Appendix B. Next, we show that the whiskers and the outlyingness numbers, when appropriately normalized, are weakly consistent for their population counterparts. Theorem 4.5. For < a < b < , if λn 0 and βn 1 as n , and Condition 1 holds for p = (1/4, 1/2, 3/4), then it holds that ℓ p ℓν, u p uν oℓ/n p oℓ,ν and ou/n p ou,ν. Here p denotes convergence in probability. Theorem 4.5 says if the population density is bounded below in neighborhoods of each of the population median and the population quartiles, the population density is bounded above everywhere, and [a, b] contain the quartiles, then the whiskers and outlyingness numbers will be consistent. To our knowledge, this is the first result concerning extreme, private quantiles. Together, Theorems 4.2 and 4.5 imply that, given a large enough sample size, DPBoxplot will correctly represent Differentially Private Boxplots normal distribution non-private Private Quantile unbounded Joint Exp Approx Quantile DPBoxplot skew distribution 103 104 105 uniform distribution 103 104 105 103 104 105 103 104 105 Figure 2. Average error in each metric with 95% confidence intervals between the different differentially private boxplots and their corresponding population boxplot (y-axis) with ϵ = 1 and increasing n (x-axis). The DPBoxplot algorithm is presented, along with the naive method paired with each of the algorithms Joint Exp, Approx Quantile, unbounded and Private Quantile (line color). The different generated distributions are represented row-wise, and the different metrics: location, scale, skewness and tails are presented column-wise. The non-private line is the distance between the non-private boxplot and its corresponding population boxplot. the location, scale, skewness and tail of the underlying distribution. Thus, making it statistically valid for one to use the private boxplot to describe samples in a differentially private exploratory data analysis. 5. Simulations We now assess the empirical performance of DPBoxplot. We compare DPBoxplot to the naive private boxplot and the non-private sample boxplot. The naive private boxplot is one where a single differentially private quantile algorithm is used to generate all quantiles that are used in the boxplot. Using these quantiles, the whiskers and outlyingness counts are then constructed in the same manner as that of Algorithm 3. We consider the following quantile estimation methods: Private Quantile, unbounded, Approx Quantile and Joint Exp. For all algorithms, we set a = 50 and b = 50 and λn = n 1/4. We ran the same simulations with other values of λn and found it did not alter the conclusions of the study, see Appendix C.2. For the unbounded algorithm, β was set to the default value of 1.001 for both the naive boxplot and DPBoxplot. We assess the boxplots across the four key metrics: scale, location, skewness and tails. We consider the error between each boxplot and the population boxplot. For DPBoxplot, errors in each of the components are quantified as follows: | ξ1/2 ξ1/2,ν| (location), | IQR IQR(ν)| (scale), | ℓ ℓν|+| u uν| (skewness), | oℓ noℓ,ν|+| ou nou,ν| (tails). Errors for remaining boxplots are defined analogously. Data were generated from the five distinct distributions parametrized to have mean 0 and variance 1: normal distribution (normal), skew normal distribution (skew), uniform distribution (uniform), beta distribution (beta), and an empirical distribution of 2019 NY airbnb listing prices (airbnb)4. We considered values of n and ϵ and each scenario was simulated 1000 times. (For more details, see Appendix C.1). Results from beta and airbnb distributions, and those with ϵ = 0.5 or ϵ = 5 did not add additional insights, so we also defer to Appendix C.1. We also assessed DPBoxplot s ability to compare multiple distributions simultaneously, concluding that DPBoxplot also performs well under this setting. For brevity, these results are deferred to Appendix C.3. Figure 2 provides a comprehensive summary of the simulation results. Each column of figures corresponds to a distinct boxplot component, while each row pertains to a different generating distribution. The sample size is depicted along the x-axis, while the average error is represented on the y-axis. The line color within the figures denotes the method used to generate the boxplot. 4These data were sampled without replacement. Differentially Private Boxplots Consistent with Theorems 4.2, 4.4, and 4.5, the error for all components of DPBoxplot is converging to zero as n increases in all scenarios. Further, we see that the error for DPBoxplot is similar to that of the error for non-private. This means that the sampling error is larger than that of the error attributed to privatization. On the other hand, the naive boxplots do not always exhibit the same behavior. In particular, the inconsistency of the whiskers for the naive methods based on Joint Exp, Approx Quantile, and Private Quantile is reflected through poor performance in the skew metric for the skew and uniform distributions. On the other hand, though it does well at estimating the extreme quantiles, the unbounded algorithm worse than the other algorithms for estimating scale and location. We see that DPBoxplot retains the best of both worlds. In the setting of normally distributed data at small sample sizes, DPBoxplot does perform worse than the naive methods in the skewness metric. This happens because at small sample sizes, the unbounded algorithm tends to underestimate the magnitude of the minimum and maximum of the dataset, while the other algorithms overestimate the magnitude of the minimum and maximum. Nevertheless, we can still conclude that overall, DPBoxplot exhibits consistently better behavior than the naive methods. 6. Case study We now conduct an exploratory data analysis via boxplots, within the framework of differential privacy. We consider two business inquiries, each with a privacy budget of ϵ = 1. Each inquiry consists of several visualizations. The privacy budget for each visualization is proportional to the number of boxplots in the visualization. To elaborate, out of the total budget (ϵ = 1), each visualization (and therefore, as a result of parallel composition, each boxplot) is assigned a privacy budget equal to the number of boxplots in the visualization, divided by the number of boxplots on all generated visualizations. We chose this allocation so that more budget is allotted to visualizations where the data is partitioned more times. The intuition is that if the data is partitioned more times, then each boxplot in the visualization has a sample size which is smaller.Code for replicating our visualizations has been made publicly accessible (). We analyze a dataset containing Airbnb listing prices and associated metrics within New York City (NYC) in 2019 (Kaggle, 2019). After removing listings priced above 500 US dollars (USD) and requiring minimum nights of stay fewer than 10, this dataset has n = 40738 observations and d = 4 explanatory variables of business interest. We only consider listings priced below 500 USD, and so we set a = 0 and b = 500. We address two distinct business inquiries. Inquiry 1: Do discernible patterns emerge in Airbnb listing prices across various boroughs in New York City and differing room types? The dataset encompasses five distinct boroughs within New York City, namely the Bronx, Brooklyn, Manhattan, Queens, and Staten Island, alongside three offered room types: Entire Home, Private Room, and Shared Room. We present three visualizations: boxplots of prices by borough, room type, and by borough and room type combinations. These generate 5, 3, and 15 boxplots, respectively, totaling 23. The differentially private boxplots are displayed in Figure 3, juxtaposed with non-private counterparts. Only looking at the differentially private boxplots, Figure 3 reveals that prices predominantly lie in the bottom end of the range [0,500], with a right skew and heavy right tail observed across all boroughs and rooms, except for in the Bronx, and for shared rooms. The distribution of prices in the Bronx still exhibits a right skew, but has light tails. The distribution of prices for shared rooms appears to be symmetric, with light tails. Notably, prices appear elevated for Manhattan. As expected, entire homes are priced higher than shared spaces. The right-most plot, which displays prices by both borough and room type, affirms previous observations, except shared rooms in Staten Island and the Bronx. Staten Island and the Bronx exhibits higher variability for shared rooms. It is natural to compare the patterns observed on the differentially private boxplots to those observed in the non-private boxplots. As previously mentioned, Figure 3 juxtaposes the differentially private boxplots with the non-private boxplots. Many conclusions derived from the private visualizations persist in the non-private domain. However, there are several disparities, which are outlined are as follows: In reality, both the listing prices for shared rooms and properties in the Bronx do have a heavy, right tail. The second disparity occurs in the third plot, where non-private plots does not indicate higher prices and variability for properties in Staten Island and the Bronx in shared rooms. These discrepancies can be explained by small sample sizes and our conservative choice of privacy budget. For instance, the number of shared room listings in Staten Island is only 9. Overall, the patterns observed are generally consistent between the private and non-private boxplots. The principal disparities are attributable to sample size, a factor we encourage practitioners to consider when conducting differentially private exploratory analyses. Inquiry 2: Are there observable trends in Airbnb listing prices concerning minimum nights required for reservation and the types of rooms offered? We create a new categorical variable called minimum nights, Differentially Private Boxplots Manhattan Queens Staten Island Manhattan Queens Staten Island non-private Entire home/apt Private room Shared room Entire home/apt Private room Shared room non-private Entire home/apt Private room Shared room Bronx Brooklyn Manhattan Queens Staten Island Entire home/apt Private room Shared room non-private Figure 3. Private and non-private boxplots for inquiry 1. The boxplots display listing prices against borough, room type and room type separated by borough in columns 1, 2 and 3, respectively. Private boxplots are obtained with a total aggregated budget of ϵ = 1. high low Minimum nights high low Minimum nights non-private Entire home/apt Private room Shared room Minimum nights Entire home/apt Private room Shared room non-private Figure 4. Private and non-private boxplots for inquiry 2. The boxplots display listing prices against minimum nights and room type separated by minimum nights in columns 1 and 2, respectively. Private boxplots are obtained with a total aggregated budget of ϵ = 1. in which a listing is assigned low if it has less than or equal to three minimum nights, and high otherwise. For this inquiry, we generate two visualizations: Boxplots of prices by minimum nights, and prices by combinations of minimum nights and room offered. These visualizations require 2, and 6 boxplots, respectively, totaling 8. The differentially private boxplots are displayed in Figure 4, juxtaposed with their non-private counterparts. Again, we first analyze the patterns in the differentially private boxplots, and compare to those observed in the private boxplots afterward. The differentially private boxplots indicate the emergence of a phenomenon akin to Simpson s paradox. Specifically, a preliminary examination suggests that listings requiring a higher minimum number of nights are priced more steeply than their counterparts with lower minimum requirements. However, this trend disappears when the data is divided by room type. Listings for entire rooms exhibit no significant price differential based on minimum night requirement, while private rooms slightly favor lower minimum nights in terms of price. In shared rooms the median price does not seem to be significantly different. A comparative analysis with non-private boxplots reaffirms these observations; we observe relatively consistent patterns between the private and non-private boxplots. The pri- mary visual disparity pertains to the positioning of the lower whiskers on the boxplots. This underscores the recognized challenge associated with differentially private estimation of extreme quantiles. However, this discrepancy does not materially impede the analytical value of our findings. 7. Discussion We have demonstrated that differentially private boxplots are a theoretically and practically viable standalone tool for private data analysis. Our work concerns differentially private data visualization, which, despite its impact potential, is severely underexplored. Though some work has been done in this area, a deep investigation into the practical aspects of differentially private exploratory data analysis is still needed, and a promising direction of future research. In particular, a large barrier for private exploratory analysis is optimal budget allocation for sequential and iterative workflows. Addressing this challenge is crucial to facilitate real-world adoption of differential privacy in data analysis pipelines. Code. Official implementation is available at https: //github.com/jairoadiazr/DPBoxplot. Differentially Private Boxplots Acknowledgements This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC); through grants DGECR-2023-00311 and DGECR-2022-04531, respectively. We thank the anonymous reviewers and session chair for their valuable comments and suggestions, which helped improve the quality of this work. Impact statement Our proposed methodology inherits the extensive implications, both beneficial and detrimental, of differential privacy. Differential privacy significantly enhances data confidentiality by introducing noise into datasets, which complicates the ability to link specific data points to individual identities. This greatly enhances the privacy awarded to individuals whose information is contained in such datasets. However, this method is not without its drawbacks. The primary challenge lies in the trade-off between privacy protection and data accuracy. The introduction of noise, while safeguarding privacy, may distort the data, potentially yielding inaccurate insights or conclusions. Such inaccuracies are especially problematic in contexts requiring high precision, such as policy development or clinical research, where exact data is crucial. Acharya, J., Sun, Z., and Zhang, H. Differentially private Assouad, Fano, and Le Cam. In Feldman, V., Ligett, K., and Sabato, S. (eds.), Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pp. 48 78. PMLR, 16 19 Mar 2021. URL https://proceedings.mlr.press/ v132/acharya21a.html. Acs, G., Castelluccia, C., and Chen, R. Differentially private histogram publishing through lossy compression. In 2012 IEEE 12th International Conference on Data Mining, pp. 1 10. IEEE, 2012. Alabi, D., Ben-Eliezer, O., and Chaturvedi, A. Bounded space differentially private quantiles, 2022. Asi, H. and Duchi, J. C. Near instance-optimality in differential privacy. ar Xiv preprint ar Xiv:2005.10630, 2020. Bhattacharjee, K., Chen, M., and Dasgupta, A. Privacypreserving data visualization: Reflections on the state of the art and research opportunities. Computer Graphics Forum, 39:675 692, 6 2020. ISSN 0167-7055. doi: 10. 1111/cgf.14032. URL https://onlinelibrary. wiley.com/doi/10.1111/cgf.14032. Budiu, M., Thaker, P., Gopalan, P., Wieder, U., and Zaharia, M. Overlook: Differentially private exploratory visualization for big data. Journal of Privacy and Confidentiality, 12, 7 2022. ISSN 2575-8527. doi: 10.29012/jpc.779. URL https://journalprivacyconfidentiality. org/index.php/jpc/article/view/779. Bun, M., Nissim, K., Stemmer, U., and Vadhan, S. Differentially private release and learning of threshold functions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 634 649. IEEE, 2015. Dasgupta, A. and Kosara, R. Adaptive privacy-preserving visualization using parallel coordinates. IEEE Transactions on Visualization and Computer Graphics, 17(12): 2241 2248, 2011. doi: 10.1109/TVCG.2011.163. Dasgupta, A., Chen, M., and Kosara, R. Measuring privacy and utility in privacy-preserving visualization. Computer Graphics Forum, 32:35 47, 12 2013. ISSN 0167-7055. doi: 10.1111/cgf. 12142. URL https://onlinelibrary.wiley. com/doi/10.1111/cgf.12142. Dasgupta, A., Kosara, R., and Chen, M. Guess me if you can: A visual uncertainty model for transparent evaluation of disclosure risks in privacy-preserving data visualization. In 2019 IEEE Symposium on Visualization for Cyber Security (Viz Sec), pp. 1 10, 2019. doi: 10.1109/Viz Sec48167.2019.9161608. Durfee, D. Unbounded differentially private quantile and maximum estimation. In Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 77691 77712. Curran Associates, Inc., 2023. Dwork, C. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pp. 1 19. Springer, 2008. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265 284, 2006. Garrido, G. M., Liu, X., Matthes, F., and Song, D. Lessons learned: Surveying the practicality of differential privacy in the industry. Proceedings on Privacy Enhancing Technologies, 2023:151 170, 4 2023. ISSN 2299-0984. doi: 10.56553/ popets-2023-0045. URL https://petsymposium. org/popets/2023/popets-2023-0045.php. Gillenwater, J., Joseph, M., and Kulesza, A. Differentially private quantiles. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Differentially Private Boxplots Learning Research, pp. 3713 3722. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/ v139/gillenwater21a.html. Hay, M., Rastogi, V., Miklau, G., and Suciu, D. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow., 3(1 2):1021 1032, sep 2010. ISSN 2150-8097. doi: 10.14778/1920841. 1920970. URL https://doi.org/10.14778/ 1920841.1920970. He, X., Raval, N., and Machanavajjhala, A. A demonstration of visdpt. Proceedings of the VLDB Endowment, 9:1489 1492, 9 2016. ISSN 2150-8097. doi: 10.14778/3007263.3007291. URL https://dl.acm. org/doi/10.14778/3007263.3007291. Hongde, R., Shuo, W., and Hui, L. Differential privacy data aggregation optimizing method and application to data visualization. In 2014 IEEE Workshop on Electronics, Computer and Applications, pp. 54 58, 2014. doi: 10. 1109/IWECA.2014.6845555. Ji, Z., Lipton, Z. C., and Elkan, C. Differential privacy and machine learning: a survey and review, 2014. Kaggle. New york city airbnb open data, 2019. URL https://www. kaggle.com/datasets/dgomonov/ new-york-city-airbnb-open-data. Kaplan, H., Ligett, K., Mansour, Y., Naor, M., and Stemmer, U. Privately learning thresholds: Closing the exponential gap. In Conference on learning theory, pp. 2263 2285. PMLR, 2020. Kaplan, H., Schnapp, S., and Stemmer, U. Differentially private approximate quantiles. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 10751 10761. PMLR, 17 23 Jul 2022. Kellaris, G. and Papadopoulos, S. Practical differential privacy via grouping and smoothing. Proceedings of the VLDB Endowment, 6(5):301 312, 2013. Kulkarni, T. Answering range queries under local differential privacy. In Proceedings of the 2019 International Conference on Management of Data, pp. 1832 1834, 2019. Lalanne, C., Garivier, A., and Gribonval, R. Private statistical estimation of many quantiles. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 18399 18418. PMLR, 23 29 Jul 2023a. URL https://proceedings.mlr. press/v202/lalanne23a.html. Lalanne, C. S., Gastaud, C., Grislain, N., Garivier, A., and Gribonval, R. Private quantiles estimation in the presence of atoms, 2023b. Li, C., Hay, M., Rastogi, V., Miklau, G., and Mc Gregor, A. Optimizing linear counting queries under differential privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 123 134, 2010. Liu, W., Zhang, Y., Yang, H., and Meng, Q. A survey on differential privacy for medical data analysis. Annals of Data Science, 11(2):733 747, June 2023. ISSN 2198-5812. doi: 10.1007/s40745-023-00475-3. URL http://dx. doi.org/10.1007/s40745-023-00475-3. Nanayakkara, P., Bater, J., He, X., Hullman, J., and Rogers, J. Visualizing privacy-utility tradeoffs in differentially private data releases. Proceedings on Privacy Enhancing Technologies, 2022:601 618, 4 2022. ISSN 2299-0984. doi: 10.2478/ popets-2022-0058. URL https://petsymposium. org/popets/2022/popets-2022-0058.php. Panavas, L., Crnovrsanin, T., Adams, J. L., Ullman, J., Sargavad, A., Tory, M., and Dunne, C. Investigating the visual utility of differentially private scatterplots. IEEE Transactions on Visualization and Computer Graphics, pp. 1 16, 2024. ISSN 1077-2626. doi: 10.1109/TVCG. 2023.3292391. URL https://ieeexplore.ieee. org/document/10173631/. Ramsay, K., Jagannath, A., and Chenouri, S. Differentially private multivariate medians, 2024. Smith, A. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the fortythird annual ACM symposium on Theory of computing, STOC 11. ACM, June 2011. doi: 10.1145/1993636. 1993743. URL http://dx.doi.org/10.1145/ 1993636.1993743. Tukey, J. W. et al. Exploratory data analysis, volume 2. Springer, 1977. Tzamos, C. and Vlatakis-Gkaragkounis. Optimal private median estimation under minimal distributional assumptions. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3301 3311, 2020. Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., and Winslett, M. Differentially private histogram publication. The VLDB journal, 22:797 822, 2013. Differentially Private Boxplots Zhang, D., Hay, M., Miklau, G., and O connor, B. Challenges of visualizing differentially private data, 2016. Zhang, X., Chen, R., Xu, J., Meng, X., and Xie, Y. Towards accurate histogram publication under differential privacy. pp. 587 595. Society for Industrial and Applied Mathematics, 4 2014. ISBN 978-1-61197-344-0. doi: 10.1137/ 1.9781611973440.68. URL https://epubs.siam. org/doi/10.1137/1.9781611973440.68. Zhou, J., Wang, X., Wong, J. K., Wang, H., Wang, Z., Yang, X., Yan, X., Feng, H., Qu, H., Ying, H., and Chen, W. Dpviscreator: Incorporating pattern constraints to privacypreserving visualizations via differential privacy. IEEE Transactions on Visualization and Computer Graphics, pp. 1 11, 8 2022. ISSN 1077-2626. doi: 10.1109/TVCG. 2022.3209391. URL https://ieeexplore.ieee. org/document/9904449/. Differentially Private Boxplots A. A review of the boxplot The boxplot consists of a box with a line drawn through it, with two whiskers emanating from the lower and upper bounds of the box. Specific points are sometimes indicated above or below the whiskers. The central line inside the box marks the median of the dataset. The box itself is constructed from the quartiles, detailing the middle 50% of Xn. The lower (upper) whisker is constructed from the larger (smaller) of the minimum (maximum) of Xn and the lower (upper) quartile of Xn minus (plus) 1.5 times the interquartile range of Xn. Lastly, any points in Xn falling outside of the whiskers are added to the plot. See Figure 3 for an example. The boxplot describes various characteristics of Xn. The median line approximates the location of Xn. The box itself approximates the spread or scale of Xn. The skewness of Xn can be approximated as follows: A median line placed in the center of the box, along with whiskers that are similar in magnitude, suggests a symmetric distribution. A median line placed away from the center of the box or asymmetric whiskers indicates skewness. Additionally, the tails can be assessed by the number of points beyond the whiskers. Many such points signify outliers or heavy tails, highlighting data points that deviate significantly from the rest. B. Technical proofs Here for a, b R, a b (a b) denotes the maximum (minimum) of a and b. Proof of Lemma 4.1. We have that Pr | ξqn ξqn,ν| t Pr ξqn M t R M t a e ϵ n|Fˆν (x) qn| 2 dx R b a e n|Fˆν (x) qn| 2ϵ dx e ϵ nqn Before presenting the proof of Theorem 4.2, we first prove a general sample complexity bound for quantiles. For t 0, q = (q1, . . . , qm) such that 0 < q1 < . . . < qm < 1, < a < b < , and ν M1(R), define: α(t) = inf a 0 such that for all q = (q1, . . . , qm) such that 0 < q1 < . . . < qm < 1 and a ξq1,ν . . . ξqm,ν b, all t 0 and 0 < γ < 1, it holds that ξq, ν ξq,ν t, with probability 1 γ, provided that n Cm2 log(1/γ) m log m/cα(t) _ m log (1/γ) log K(b a) q1 (1 qm) α(t) α(t)ϵ . (3) Proof of Lemma B.1. Given that ξq,ˆν is a draw from the exponential mechanism, the result follows from an application of Corollary 7 of (Ramsay et al., 2024). Corollary 7 of (Ramsay et al., 2024) gives an upper bound on the sample complexity of a draw from the exponential mechanism, provided the utility function meets certain criteria. In order to apply Corollary 7 of (Ramsay et al., 2024), we must show that the following function j=1 |Fν(xj) Fν(xj 1) (qj qj 1)|1 {[x1, xm] [a, b]} (1 1 {[x1, xm] [a, b]}), and ν satisfy three conditions. First, we must show that ϕ (x, ν) has a maximum, which is easily seen by definition, ϕ (x, ν) is maximized at ξq,ν. The second requirement is that ϕ (x, ν) is K-Lipschitz, which follows by assumption. Lastly, Corollary 7 of (Ramsay et al., 2024) requires that ϕ (x, ν) is a (L, F) regular function for some L > 0 with VC(F) < , see Definition E.2. It is easy to see that ϕ (x, ν) is (m, F) regular function with VC(F) = 1. In order to apply the bound given in Corollary 7 of (Ramsay et al., 2024), we must compute the discrepancy function of ϕ , ν, which is given by: α (t) = ϕ (ξq,ν, ν) sup x ξq,ν t ϕ (x, ν). A straightforward calculation yields that α (t) = α(t). Applying Corollary Differentially Private Boxplots 7 of (Ramsay et al., 2024), yields that there exists a universal constant C > 0 such that for any t > 0 and 0 γ 1, ξq, ν ξq,ν t with probability at least 1 γ, if m2 log(1/γ) m log m/cα(t) _ m log (1/γ) log (b a) |ξq1,ν a| |b ξqm,ν| α(t)/K Define a b (a b) if there is a universal constant c > 0 such that a cb (a cb). Next, using the Lipschitz assumption and the mean value theorem, we have that |ξq1,ν a| |b ξqm,ν| [q1 (1 qm)]/K, which yields the desired result. We can now prove Theorem 4.2. Proof of Theorem 4.2. The proof is based on an application of Lemma B.1. All of the conditions of Lemma B.1 are satisfied by assumption, and so it remains to lower bound α. To this end, let kx = argmaxi [m] |xi ξqi,ν|. Using this notation, for 0 t r, the definition of α in conjunction with the mean value theorem yields that α(t) = inf a 0 and any P M1(R), let Q = {Q1, . . . , Qm} P. If for all i = j, it holds that |Tϵ(Qi) Tϵ(Qj)| 3τ, KL(Qi, Qj) βn, and TV(Qi, Qj) γ, then n = Ω log m(β 1 n (γϵ) 1) . In order to apply Corollary B.2, we first define a class of Gaussian measures which lie in GL,r,q. Recall that we consider ν with densities which satisfy: infx ξq,ν r fν(x) L. Let Pµ,σ = N(µ, σ2) and let hµ,σ denote the associated quantile function. We have that for q 1/2, it holds that inf x hµ,σ(q) r f Pµ,σ(x) (2πσ2) 1/2 exp [erf 1(2q 1) r/ (2πσ2) 1/2 exp r2/2σ2 [erf 1(2q 1)]2 . Taking σ = Cq/ 2πL gives that inf x hµ,σ(q) r f Pµ,σ(x) L/Cq exp [erf 1(2q 1)]2 πr2L2/C2 q . Now, note that erf 1(2q 1) = Φ 1 (q) / Differentially Private Boxplots which in turn implies that inf x hµ,σ(q) r f(x) L exp [Φ 1 (q)]2/2 r2L2 / π. Next, we have that L/Cq exp [Φ 1 (q)]2/2 πr2L2/C2 q L exp [Φ 1 (q)]2/2 πr2L2/C2 q Cq [Φ 1 (q)]2/2 + πr2L2/C2 q log Cq ( log Cq [Φ 1 (q)]2/2)/π. We must then have that r L Cq p ( log Cq [Φ 1(q)]2/2)/π, which holds by assumption. Define FL,r = N(µ, C2 q /2πL2): µ R . We find the values of βn, γ and τ for FL,r. Consider the values µ0, µ1, and for i=1 or i = 0, denote Qi = Pµi,C2q /2πL2. Note that for any quantile q, we have that |ξq,Q0 ξq,Q1| = |µ0 µ1|. That is, the distance between the quantiles is just the difference between the means. For some τ > 0, take any µ0, µ1 such that |µ0 µ1| 3τ. It follows that KL(Q0, Q1) τ 2L2/C2 q and TV(Q0, Q1) = 2Φ(cτL/Cq) 1 Lτ/Cq. Therefore, applying Corollary B.2 gives that C2 q L2τ 2 Cq Before proceeding with the proof of Theorem 4.5, we first prove that the unbounded algorithm estimates extreme quantiles consistently. Lemma B.3. For all non-increasing sequences βn 1 and qn 0, and all ν M(R), it holds that i if min(ν) < b, then ψqn p min(ν). ii if max(ν) > a, then ψ1 qn p max(ν). Proof. First, without loss of generality, take a = 0. Next, using the fact that V0 is a standard exponential random variable. Pr (2V0/nϵ > 1 qn) = e n(1 qn)ϵ/2. (4) In addition, letting cn = 1/ log n, the Dvoretzky Kiefer Wolfowitz inequality yields that Pr sup x R |Fˆν(x) Fν(x)| > cn 2e 2n/(log n)2. (5) It suffices to show that for all t > 0, it holds that Pr | ψ1 qn max(ν)| > t 0 as n . (A symmetric argument then implies that also, Pr | ψqn min(ν)| > t 0 as n .) Letting k be the integer such that ψ1 qn = βk n 1, we have that Pr | ψ1 qn max(ν)| > t = Pr max(ν) βk n 1 > t + Pr βk n 1 max(ν) > t , for which we can write Pr max(ν) βk n 1 > t = Pr k < log (max(ν) t 1) := Pr (k < Rt,1) , Pr βk n 1 max(ν) > t = Pr k > log (max(ν) + t + 1) := Pr (k > Rt,2) . It suffices to show that Pr (k < Rt,1) , Pr (k > Rt,2) 0 as n . To this end, first, note that if log (max(ν) + t + 1) log βn log (max(ν) t 1) log βn < 1, Differentially Private Boxplots then Pr ({k < Rt,1} {k > Rt,2}) = 1. However, we have that βn is decreasing to 1. This immediately implies that for all t > 0, there exists n0 > 1 such that for all n > n0, we have that log (max(ν) + t + 1) log βn log (max(ν) t 1) log βn > 1. Next, utilizing (5), the fact that V0 > 0 and the definition of Vi, we have that Pr (k < Rt,1) = 1 i=1 Pr Fˆν(βi n 1) + 2Vi/nϵ < 1 qn + 2V0/nϵ i=1 Pr Fν(βi n 1) + 2Vi/nϵ < 1 qn cn + 2e 2n/(log n)2 1 exp nϵ(1 qn cn Fν(βi n 1))/2 + 2e 2n/(log n)2 := 1 I + 2e 2n/(log n)2. (6) Now, using the fact that (1 + x/n)n 1 + x for all n 1 and |x| n, we have that 1 exp nϵ(1 qn cn Fν(βi n 1))/2 1 exp nϵ(1 qn cn Fν(βRt,1 n 1))/2 Rt,1 1 Rt,1 exp nϵ(1 qn cn Fν(βRt,1 n 1))/2 . Now, applying the preceding inequality in conjunction with (6) yields that Pr (k < Rt,1) Rt,1 exp nϵ(1 qn cn Fν(βRt,1 n 1))/2 + 2e 2n/(log n)2 0, as n . On the other hand, for the term Pr (k > Rt,2), utilizing (4) and (5), we have that Pr (k > Rt,2) i=1 Pr Fν(βi n 1) + 2Vi/nϵ < 1 qn/2 + cn + 2e 2n/(log n)2 + e nϵ(1 qn)/2 Pr Fν(βRt,2 n 1) + 2VRt,2/nϵ < 1 qn/2 + cn + 2e 2n/(log n)2 + e nϵ(1 qn)/2 1 1 qn/2 + cn Fν(βRt,2 n 1) 0 + 2e 2n/(log n)2 + e nϵ(1 qn)/2 0, We can now prove Theorem 4.5. Proof of Theorem 4.5. We first prove that the whiskers are consistent. Theorem 4.2 implies that ℓ1 p ℓ1,ν and Lemma B.3 implies that ℓ2 p min(ν). Continuous mapping theorem and the fact that λn 0 as n yields that ℓ p ℓν. The same argument applies to the upper whisker. For the outlyingness number, first, we have that |oℓ,ν/n oℓ| |oℓ,ˆν oℓ/n| + |oℓ,ν oℓ,ˆν| := I + II. Next, the properties of the Laplace distribution give that Pr (|oℓ,ˆν oℓ/n| t) e nϵt. Therefore I p 0. Next, using the fact that supx R fν(x) K, we have that Fν is K-Lipschitz. It follows that |oℓ,ν/n oℓ,ˆν/n| K|ℓˆν ℓν| + supx R |Fν(x) Fˆν(x)|. Next, the Dvoretzky Kiefer Wolfowitz inequality yields that supx R |Fν(x) Fˆν(x)| p 0 and the fact that the whiskers are consistent implies that K|ℓˆν ℓν| p 0. The same argument can be made for the upper outlyingness number. Differentially Private Boxplots normal distribution non-private Private Quantile unbounded Joint Exp Approx Quantile DPBoxplot skew distribution 103 104 105 uniform distribution 103 104 105 103 104 105 103 104 105 Figure 5. Average error in each metric with 95% confidence intervals between the different differentially private boxplots and their population (oracle) counterparts (y-axis) with ϵ = 0.5 and increasing n (x-axis). The DPBoxplot algorithm is presented, along with the naive method paired with each of the algorithms Joint Exp, Approx Quantile, unbounded and Private Quantile (line color). The different generated distributions are represented row-wise, and the different metrics: location, scale, skewness and tails are presented column-wise. The non-private line is the distance between the non-private boxplot and its population counterpart. C. Simulation details and additional results C.1. Single boxplot estimation This section presents more details on the simulation study described in Section 5. We simulated data from five distributions, each with mean 0 and variance 1: standard normal distribution (normal), skew normal distribution with scale parameter 20 (skew), uniform distribution with interval [ 3] (uniform), a normalized and mean-centered version of the beta distribution with α = βn = 2 (beta), and normalized and mean-centered empirical distribution generated from 2019 NY airbnb listing prices. We considered sample sizes of n {1000, 3500, 10000, 35000, 100000} and privacy budgets of ϵ {0.5, 1, 5, 10}. Each scenario was simulated 1000 times, leading to simulated data vectors Xn. For each generated dataset, we computed both the non-private boxplot B(Xn) = B(ˆν) and the population boxplot. We also calculated differentially private boxplots B(Xn, ϵ) using each of the methods. Figure 5 and Figure 6 are analogous to Figure 1 in the manuscript, except under ϵ = 0.5 and ϵ = 5, respectively. As anticipated, the performance of all methods decreases with lower ϵ values, and increases with higher ϵ values, reflecting the trade-off between privacy and accuracy. However, the variation in performance is marginal in most of the cases, and the same conclusions as those given in Section 5 hold across both privacy levels. In addition, we present Figures 7 and 8, which give a comprehensive summary of the results for DPBoxplot from this simulation study. C.2. Parameter tuning We performed simulations under similar settings as those discussed in Section C.1. We varied λn {n 1/2, n 1/4, 1}, fixing ϵ = 1 and estimated differentially private boxplots with all methods. Figure 9 summarizes the results for simulated distributions (column-wise), boxplot metrics (row-wise). Variations in the method and λn and are represented with different line colors and line styles, respectively. We observe that none of the methods are highly sensitive to the chosen parameters, except for skewness and tails. Moreover, DPBoxplot was generally the best regardless of λn. Based on this observation Differentially Private Boxplots normal distribution non-private Private Quantile unbounded Joint Exp Approx Quantile DPBoxplot skew distribution 103 104 105 uniform distribution 103 104 105 103 104 105 103 104 105 Figure 6. Average error in each metric with 95% confidence intervals between the different differentially private boxplots and their population (oracle) counterparts (y-axis) with ϵ = 5 and increasing n (x-axis). The DPBoxplot algorithm is presented, along with the naive method paired with each of the algorithms Joint Exp, Approx Quantile, unbounded and Private Quantile (line color). The different generated distributions are represented row-wise, and the different metrics: location, scale, skewness and tails are presented column-wise. The non-private line is the distance between the non-private boxplot and its population counterpart. and to account for better results in small samples, we concluded that λn = n 1/4 was a reasonable choice for all methods. C.3. Multiple boxplot estimation This section presents an extensive simulation study to assess the differentially private boxplot s ability to describe, and facilitate comparison of multiple samples. Specifically, we analyze whether the metrics (quantified by distances on location, scale, skewness and tails as in Section 5) between differentially private boxplots across distinct datasets mirror the corresponding pairwise distances observed between their population counterparts. The similitude between two distinct differentially private boxplots is anticipated to align closely with the similitude observed between their respective population counterparts, thereby facilitating analogous visual comparisons and interpretation. For the purpose of quantifying this phenomenon, we introduce the concept of relative similitude between two differentially private boxplots as follows: d B(Xn, ϵ), B(Ym, ϵ) = 1 d B(Xn, ϵ), B(Ym, ϵ) + 1 d (B(νX), B(νY )) + 1 Here, distance is related to the four metrics previously described. In order to empirically validate the consistency of our proposed approach with this behavior, we conducted Monte Carlo simulations by generating t datasets (z1, z2, , zt), and two vectors m and s of size t sampling from uniform distributions with interval [ 1, 1] and [0.5, 2], respectively. Each vector zi is size ni for i {1, 2, t} where ni is chosen randomly such that n = n1 + n2 + nt. Here, t plays the role of the number of treatments in an experiment. We replicated this simulation 1000 times for each combination of t 4, 8 and n {500, 5000, 50000} leading to datasets (X1, X2, , Xt) such that Xi = sizi + mi. We performed this simulation generating the initial t datasets using mixture of the five distributions described in the previous section such that each distribution generates the same amount of vectors on each replica. For each dataset we calculated non-private boxplots B(X1), B(X2), , B(Xt), and differentially private boxplots using each of the quantile estimation methods and for each Differentially Private Boxplots non-private 0.5 1 5 10 skew uniform airbnb beta 103 104 105 103 104 105 103 104 105 103 104 105 103 104 105 Figure 7. Average error in each metric with 95% confidence intervals between estimated differentially private boxplots and oracle boxplot (y-axis) with ϵ {0.5, 1, 5, 10} (line colors) and increasing n (x-axis); using DPBoxplot for private boxplot estimation; from all sampled distributions (column-wise) comparing location, scale, skewness and tails (row-wise). In the legend, non-private refers to the distance between the non-private boxplot and the population boxplot. ϵ {0.5, 1, 5, 10}. We then calculated pair-wise relative boxplot distances for every pair of differentially private boxplots. Figure 10 and 11 offers a detailed overview of the results obtained from various scenarios considered. Each boxplot within these plots represents the average pairwise distances observed among all differentially private boxplots generated in a specific scenario. The x-axis on each plot denotes the sample size (n). Variations in hue color correspond to different methods and different privacy budgets (ϵ), respectively. The first and last two plots in the first row correspond to relative location and relative scale, respectively. The first and last two plots in the second row correspond to relative skewness and relative outliers, respectively. For each pair of relative distance plots, the first and second visualization correspond to t = 5 and t = 10, respectively. Figure 11 uses ϵ = 1 and Figure 10 shows results with DPBoxplot method. Figure 10 shows that DPBoxplot performs consistently well across different scenarios. All other methods have poor performance in at least one scenario. In Figure 11 relative distances exhibit a diminishing trend with augmented sample sizes and ϵ, consistent with expectations illustrated by the non-private boxplot. A marginal increase in the parameter t leads to a slight augmentation in the relative distances, attributable to the partitioning of data into smaller subsets and Differentially Private Boxplots non-private Private Quantile unbounded Joint Exp Approx Quantile DPBoxplot skew uniform airbnb beta 103 104 105 103 104 105 103 104 105 103 104 105 103 104 105 Figure 8. Average error in each metric with 95% confidence intervals between estimated differentially private boxplots and oracle boxplots (y-axis) with ϵ = 1 and increasing n (x-axis); using boxplot estimation methods Joint Exp, Approx Quantile, unbounded, Private Quantile, and DPBoxplot (line colors); from all sampled distributions (column-wise) comparing location, scale, skewness and tails (row-wise). In the legend, non-private refers to the distance between the non-private boxplot and the population boxplot. consequent reduction in sample sizes; however, this impact appears to be negligible. Hence, our analysis suggests that our proposed methodology effectively maintains the visual coherence of relative similarities among diverse boxplots during the execution of multiple comparisons. C.4. Computational requirements All simulations were conducted on a single CPU and did not require significant computational resources. The execution time for the simulations presented in the paper did not exceed 24 hours. D. Differentially private quantiles Here, we include a longer review of the differentially private quantile estimation literature. We focus on the existing statistical results for such estimators, as statistical results for private quantiles is one of our main contributions. It is Differentially Private Boxplots method Private Quantile unbounded Joint Exp Approx Quantile DPBoxplot skew uniform airbnb beta 103 104 105 103 104 105 103 104 105 103 104 105 103 104 105 Figure 9. Average error in each metric with 95% confidence intervals between estimated differentially private boxplots (y-axis) and oracle boxplots with ϵ = 1 and increasing n (x-axis); using various methods (color), with varying λn (line style); from all sampled distributions (column-wise) comparing location, scale, skewness and tails (row-wise). also important to mention that range queries can be used to compute quantiles (Bun et al., 2015; Kulkarni, 2019; Kaplan et al., 2020), however, currently, these algorithms may not be very practical, so we do not cover a full review of those. First, Smith (2011) introduced Private Quantile based on the empirical CDF and the exponential mechanism, and Differentially Private Boxplots relative location Approx Quantile DPBoxplot Joint Exp Private Quantile unbounded relative scale t = 5 t = 10 500 5000 50000 n relative skewness 500 5000 50000 n 500 5000 50000 n relative tails 500 5000 50000 n Figure 10. Average pair-wise relative boxplot distances (y-axis) between multiple sample boxplots in terms of relative location, scale, skewness and outliers, with ϵ = 1 using different boxplot estimations (color), n {500, 5000, 50000} (x-axis) and t {5, 10} (columns). Legend non-private is the relative similitude of the non-private boxplots. relative location 0.5 1 5 10 non-private relative scale t = 5 t = 10 500 5000 50000 n relative skewness 500 5000 50000 n 500 5000 50000 n relative tails 500 5000 50000 n Figure 11. Average pair-wise relative boxplot distances (y-axis) between multiple private boxplots in terms of relative location, scale, skewness and outliers, under varying conditions of ϵ {0.5, 1, 5, 10} (color), n {500, 5000, 50000} (x-axis); using DPBoxplot quantile estimation and t {5, 10} (columns). Legend non-private is the relative similitude between the non-private boxplots. present a finite sample accuracy guarantee, under the assumption that the population distribution is close to the normal distribution. An extension of this algorithm is Joint Exp, which can be used to estimate multiple quantiles (Gillenwater Differentially Private Boxplots et al., 2021). Later, Lalanne et al. (2023b) and Lalanne et al. (2023a) considered the statistical properties of Joint Exp. Lalanne et al. (2023a) show various concentration results, assuming that the population density has bounded support and is positive in a neighbourhood of the quantiles being estimated. Lalanne et al. (2023b) show that Joint Exp is consistent when the population distribution is continuous, and provide a modified algorithm which can be consistent for population distributions with atoms. This algorithm could also be used to generate the box in the boxplot, if atoms are suspected. Earlier, Asi & Duchi (2020) provided an instance optimal algorithm for the median, based on the inverse sensitivity mechanism. We do not consider this algorithm here because it was shown by Lalanne et al. (2023b) that the algorithm is very similar to Private Quantile. Similarly, a multiple quantile version based on the inverse sensitivity mechanism is similar to Joint Exp (Lalanne et al., 2023b). Tzamos & Vlatakis-Gkaragkounis (2020) also consider private median estimation, which could be used for private quantile estimation, however, their algorithm takes n4 time, which may be slow in practice, so we did not consider it here. However, we see no reason it would perform poorly to generate the median line in the box. Recently, (Alabi et al., 2022) present a bounded space quantile algorithm, with some concentration bounds that condition on the dataset at hand, and require the sample space to be finite. They also present some statistical guarantees under the assumption of normality. On a related note, some general works imply that no differentially private quantile estimate can have finite sample complexity without making distributional assumptions, .e.g, (Bun et al., 2015). Recently, Durfee (2023) present the unbounded algorithm for quantile estimation, which is meant to estimate extreme quantiles well. They do not provide statistical guarantees. Recently, Kaplan et al. (2022) introduce a clever algorithm for estimating multiple quantiles from an existing quantile algorithm, where the statistical error has minimal dependence on the number of quantiles. We call this algorithm paired with Private Quantile the Approx Quantile. We build on this literature by providing some new statistical results concerning Joint Exp, Private Quantile, Approx Quantile and unbounded. Specifically, we show consistency of the unbounded algorithm under general distributional assumptions, a lower bound on differentiall private quantile estimation for a large class of distributions, and that Joint Exp, Private Quantile, and Approx Quantile provide minimax optimal, single quantile estimates. We also show that Joint Exp, Private Quantile, and Approx Quantile are inconsistent for extreme quantiles. E. Auxiliary results Lemma E.1. For all n 1, a, c R, we have that n a log n + c if n 2(a log a + c) a. Proof. First, note that h(n) = n a log n c is increasing for n a. Now, for n a we can use the fact that x log x x/2 for x 1 to get h(n) 0 whenever n 2(a log a + c). Let B denote the space of Borel functions from Rd to [0, 1]. For a family of functions, F B, define a pseudometric on M1(Rd), d F(µ, ν) = supg F | R gd(µ ν)|, where µ, ν M1(Rd). Definition E.2 ((Ramsay et al., 2024)). We say that ϕ(x, ν) is (K, F)-regular if there exists a class of functions F B such that ϕ(x, ) is K-Lipschitz with respect to the F-pseudometric uniformly in x, i.e., for all µ, ν M1(Rd) sup x |ϕ(x, µ) ϕ(x, ν)| Kd F(µ, ν).