# differentially_private_heatmaps__641b55b4.pdf Differentially Private Heatmaps Badih Ghazi, Junfeng He, Kai Kohlhoff, Ravi Kumar Pasin Manurangsi, Vidhya Navalpakkam, Nachiappan Valliappan Google Research, Mountain View, CA {badihghazi, ravi.k53}@gmail.com {junfenghe,kohlhoff,pasin,vidhyan,nac}@google.com We consider the task of producing heatmaps from users aggregated data while protecting their privacy. We give a differentially private (DP) algorithm for this task and demonstrate its advantages over previous algorithms on real-world datasets. Our core algorithmic primitive is a DP procedure that takes in a set of distributions and produces an output that is close in Earth Mover s Distance to the average of the inputs. We prove theoretical bounds on the error of our algorithm under a certain sparsity assumption and that these are near-optimal. 1 Introduction Recently, differential privacy (DP) (Dwork et al. 2006b,a) has emerged as a strong notion of user privacy for data aggregation and machine learning, with practical deployments including the 2022 US Census (Abowd 2018), in industry (Erlingsson, Pihur, and Korolova 2014; Shankland 2014; Greenberg 2016; Apple Differential Privacy Team 2017; Ding, Kulkarni, and Yekhanin 2017) and in popular machine learning libraries (Radebaugh and Erlingsson 2019; Testuggine and Mironov 2020). Over the last few years, DP algorithms have been developed for several analytic tasks involving aggregation of user data. One of the basic data aggregation tools is a heatmap. Heatmaps are popular for visualizing aggregated data in two or higher dimensions. They are widely used in many fields including computer vision and image processing, spatial data analysis, bioinformatics, etc. Many of these applications involve protecting the privacy of user data. For example, heatmaps for gaze or gene microdata (Liu et al. 2019; Steil et al. 2019) would be based on data from individuals that would be considered private. Similarly, a heatmap of popular locations in a geographic area will be based on user location check-ins, which are sensitive. Motivated by such applications, in this paper, we present an efficient DP algorithm for computing heatmaps with provable guarantees, and evaluate it empirically. At the core of our algorithm is a primitive solving the following basic task: how to privately aggregate sparse input vectors with a small error as measured by the Earth Mover s Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Distance (EMD)? While closely related to heatmaps, the EMD measure is of independent interest: it was originally proposed for computer vision tasks (Rubner, Tomasi, and Guibas 2000) since it matches perceptual similarity better than other measures such as ℓ1, ℓ2, or KLdivergence (Stricker and Orengo 1995; Levina and Bickel 2001; Wang and Guibas 2012). It is also well-suited for spatial data analysis since it takes the underlying metric space into account and considers neighboring bins. EMD is used in spatial analysis (Kranstauber, Smolla, and Safi 2017), human mobility (Isaacman et al. 2012), image retrieval (Rubner, Tomasi, and Guibas 1998; Puzicha et al. 1999), face recognition (Xu, Yan, and Luo 2008), visual tracking (Zhao, Yang, and Tao 2008), shape matching (Grauman and Darrell 2004), etc. For the task of sparse aggregation under EMD, we give an efficient algorithm with asymptotically tight error. We next describe our results in more detail. 1.1 Our Results We consider the setting where each user i holds a probability distribution pi over points in [0, 1)2, and the goal is to compute the heatmap of the average of these probabilities, i.e., 1 n Pn i=1 pi. We give an ε-DP algorithm for this task, establish its theoretical guarantees, and provide empirical evaluations of its performance. (For definitions, see Section 2.) Sparse Aggregation under EMD. At the heart of our approach is the study of aggregation under EMD1, where we would like to output the estimate of 1 n Pn i=1 pi with the error measured in EMD. There are two main reasons why we consider EMD for the error measure. First, a bound on the EMD to the average distribution implies bounds on several metrics commonly used in evaluating heatmaps, including the KL-divergence, ℓ1 distance, and EMD itself. Second, while it is possible to obtain DP aggregation algorithms with bounded EMD error, as we will discuss below, any DP aggregation algorithm must suffer errors under other metrics, including KL-divergence or ℓ1 distance, that grow with the resolution2, rendering them impractical when the number of users is small compared to the resolution. 1For a formal definition of EMD, please see Section 2.1. 2Specifically, it follows from previous work (Dwork et al. 2015) that, if we consider the ℓ1 distance or KL-divergence for grid and n Oε( ), then the error must be Ω(1). The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) When the distributions pi s are arbitrary, we show that a simple ε-DP algorithm yields a guarantee of Oε(1/ n) on EMD, and that this bound is essentially optimal. While this is already a reasonable bound, we can improve on it by exploiting a property that is commonly present in distributions used for aggregations: sparsity (Cormode et al. 2012a). Following the literature on compressed sensing (Indyk and Price 2011; Backurs et al. 2016), we define our approximation guarantee for the sparse EMD aggregation problem with respect to the best k-sparse distribution3 that approximates the average a := 1 n Pn i=1 pi under EMD. More formally, we say that an output distribution ˆa is a (λ, κ)- approximation for sparse EMD aggregation if EMD (ˆa, a) λ mink-sparse a EMD (a , a) + κ, where λ, κ > 0 denote the multiplicative approximation ratio and additive error respectively. Our main algorithmic contribution is in showing that under such a sparse approximation notion, we can achieve an error of only Oε( k/n) and that this is tight.4 Theorem 1.1 (Informal). There exists an ε-DP algorithm that, for any constant λ (0, 1), can output a (λ, Oε( k/n))-approximation for sparse EMD aggregation w.p. 0.99. Furthermore, no ε-DP algorithm can output a (λ, oε( k/n))-approximate solution w.p. 0.1. Due to a known connection between sparse EMD aggregation and k-median clustering on the plane (Indyk and Price 2011; Backurs et al. 2016), our result also yields an improved DP algorithm for the latter. Due to space constraints, we omit the formal statement of our k-median results. Experimental results. We test our algorithm on both realworld location datasets and synthetic datasets. The results demonstrate its practicality even for moderate values of ε [0.5, 5] and a number of users equal to 200. Furthermore, we compare our algorithm with simple baselines; under popular metrics for heatmaps, our results demonstrate significant improvements on these regimes of parameters. 1.2 Overview of Techniques At a high level, our algorithm is largely inspired by the work of Indyk and Price (2011) on compressed sensing under EMD. Roughly speaking, in compressed sensing, there is an underlying vector x that is known to be well-approximated by a sparse vector; we have to provide a matrix A such that, when we observe the measurements Ax, we can reconstruct x that is close to x (under a certain metric). This can of course be done trivially by taking A to, e.g., be the identity matrix. Thus, the objective is to perform this recovery task using as few measurements (i.e., number of rows of A) as possible. There is a rich literature on compressive sensing; most relevant to our work are the prior papers studying compressive sensing with EMD, in particular, Indyk and Price (2011) and Backurs et al. (2016). 3A distribution is k-sparse if it is non-zero on at most k points. 4Note that the output ˆa need not be k-sparse. This is the reason why the approximation ratio λ can be less than one. Indyk and Price (2011) presented an elegant framework for reducing the compressed sensing problem under EMD to one under ℓ1, which is well-studied (see, e.g., Berinde et al. 2008; Berinde, Indyk, and Ruzic 2008; Indyk and Ruzic 2008; Berinde and Indyk 2009). Their reduction centers around finding a linear transformation with certain properties. Once such a transformation is specified, the algorithm proceeds (roughly) as follows: transform the input x, run the compressed sensing scheme for ℓ1, and invert the transformation to get x . Note that the number of measurements required is that of the ℓ1 compressed sensing scheme. One can try to use the Indyk Price scheme for DP aggregation by viewing the hidden vector x as the sum Pn i=1 pi, and then adding Laplace noise to each measurement to ensure privacy. Although they did not analyze their guarantees for noisy measurements, one can follow the robustness of known ℓ1 compressed sensing schemes to analyze the error. Unfortunately, since the error will scale according to the ℓ1 norm of the noise vector and the noise vector consists of O(k log(n/k)) entries, this approach only provides an error guarantee of O(k (poly log n)/n). To overcome this, we observe that, while compressed sensing and DP aggregation seem similar, they have different goals: the former aims to minimize the number of measurements whereas the latter aims to minimize the error due to the noise added (irrespective of the number of measurements). With this in mind, we proceed by using the Indyk Price framework but without compressing, i.e., we simply measure the entire transformation. Even with this, the noise added to achieve DP is still too large and makes the error dependent on log n. As a final step, to get rid of this factor we carefully select a different noise magnitude for each measurement, which allows us to finally achieve the O( k) error as desired. The details are presented in Section 3. Our lower bound follows the packing framework of Hardt and Talwar (2010). Specifically, we construct a set of ksparse distributions whose pairwise EMDs are at least Ω(1/ k). The construction is based on an ℓ1 packing of the k grid, which gives a set of size 2Ω(k). It then immediately follows from Hardt and Talwar (2010) that the error must be at least Ωε( k/n) with probability 0.9. 1.3 Related Work and Discussion In a concurrent and independent work, Bagdasaryan et al. (2022) also study the private heatmaps problem. However, our work differs from theirs in three aspects: (i) they do not formulate the problem in terms of EMD, (ii) their work does not provide any formal utility guarantees unlike ours, (iii) their emphasis is on communication efficiency in distributed/federated setting whereas our focus is more general. Our DP EMD sparse aggregation algorithm bears highlevel similarity to known algorithms for DP hierarchical histograms (see, e.g., Cormode et al. 2012b; Qardaji, Yang, and Li 2013): all algorithms may be viewed as traversing the grid in a top-down manner, starting with larger subgrids and moving on to smaller ones, where a noise is added to the measurement corresponding to each subgrid. The differences between the algorithms are in the amount of noise added to each step and how the noisy measurement is used to reconstruct the final output. Our choices of the noise amount and the Indyk Price reconstruction algorithm are crucial to achieve the optimal EMD error bound stated in Theorem 1.1. There are also DP hierarchical histogram algorithms that do not fit into the above outline, such as the Priv Tree algorithm (Zhang, Xiao, and Xie 2016). An advantage of our approach is that the only aggregation primitive required is the Laplace mechanism; therefore, while we focus on the central model of DP (where the analyzer can see the raw input and only the output is required to be DP), our algorithm extends naturally to distributed models that can implement the Laplace mechanism, including the secure aggregation model and the shuffle model (Balle et al. 2020; Ghazi et al. 2020). On the other hand, algorithms such as Priv Tree that use more complicated primitives cannot be easily implemented in these models. 2 Notation and Preliminaries For N N {0}, we write [N] to denote {0, . . . , N}. Let G be the set of ( ) grid points in [0, 1)2; specifically, G = {(i/ , j/ ) | i, j [ 1]}. For notational convenience, we assume throughout that = 2ℓfor some ℓ N. For an index set I, we view p RI as a vector indexed by I and we write p(i) to denote the value of its ith coordinate; this notation extends naturally to the set S I of coordinates, for which we let p(S) := P i S p(i). Furthermore, we use p|S to denote the restriction of p to S; more formally, p|S(i) = p(i) if i S and p|S(i) = 0 otherwise. We also write p| S as a shorthand for p p|S, i.e., the restriction of p to the complement of S. We use supp(p) to denote the set of non-zero coordinates of vector p. A vector is said to be k-sparse if its support is of size at most k. Recall that the ℓ1-norm of a vector p RI is p 1 := P i I |p(i)|. 2.1 Earth Mover s Distance (EMD) Given two non-negative vectors p, q RG 0 such that p 1 = q 1, their Earth Mover s Distance (EMD) is EMD(p, q) := minγ P x G P y G γ(x, y) x y 1, where the minimum is over γ RG G 0 whose marginals are p and q. (I.e., for all x G , P y G γ(x, y) = p(x) and, for all y G , P x G γ(x, y) = q(y).) We define the EMD norm of a vector w RG by w EMD := min p,q RG 0 p q+r=w, p 1= q 1 EMD(p, q) + α r , where α = 2 is the diameter of our space [0, 1) [0, 1). The following simple lemma will be useful when dealing with unnormalized vs normalized vectors. Lemma 2.1. Suppose that s,ˆs RG 0 are such that s 1 = n and s ˆs EMD n/2. Let a = s/ s 1 and ˆa = ˆs/ ˆs 1. Then, we have a ˆa EMD 4 s ˆs EMD/n. Proof. Let ζ = s ˆs EMD; observe that | s 1 ˆs 1| ζ. As a result, we have ˆs 1 [n ζ, n + ζ]. Thus, ˆs/n ˆa EMD ˆs EMD 1 n 1 ˆs 1 (n + ζ) 1 n 1 n ζ As a result, from the triangle inequality, we have a ˆa EMD a ˆs/n EMD + ˆs/n ˆa EMD 2.2 Differential Privacy Two input datasets X, X are neighbors if X results from adding or removing a single user s data from X. In our setting, each user i s data is a distribution pi over G . Definition 2.1 (Differential Privacy; Dwork et al. (2006b)). A mechanism M is said to be ε-DP iff, for every set O of outputs and every pair X, X of neighboring datasets, Pr[M(X) O] eε Pr[M(X ) O]. For a vector-valued function f, its ℓ1-sensitivity, denoted by S1(f), is defined as maxneighbors X,X f(X) f(X ) 1. Definition 2.2 (Laplace Mechanism). The Laplace mechanism with parameter b > 0 adds an independent noise drawn from the Laplace distribution Lap(b) to each coordinate of a vector-valued function f. Lemma 2.2 (Dwork et al. (2006b)). The Laplace mechanism with parameter S1(f)/ε is ε-DP. 2.3 Heatmaps Given p RG 0 , its associated heatmap with Gaussian filter variance σ2 is defined as Hσ p(x, y) = P (x ,y ) G e (x x )2+(y y )2 2σ2 Z(x ,y ) p(x , y ) for all (x, y) G , where Z(x , y ) := P (x ,y ) G e (x x )2+(y y )2 2σ2 is the normalization factor. In the heatmap aggregation problem over n users, each user i has a probability distribution pi over G . The goal is to output an estimate of the aggregated heatmap Hσ a where a = 1 n P i [n] pi. 3 Algorithm In this section, we describe our private sparse EMD aggregation algorithm and prove our main result. Theorem 3.1. For any ε > 0 and λ (0, 1), there is an εDP algorithm that can, w.p. 0.99, output a λ, O k λεn - approximation for the k-sparse EMD aggregation problem. 3.1 Pyramidal Transform As alluded to in Section 1, we use a linear transformation from (Indyk and Price 2011). This linear transformation is the so-called (scaled) pyramidal transform, whose variant is also often used in (metric) embedding of EMD to ℓ1 (Charikar 2002; Indyk and Thaper 2003). Roughly speaking, the transform represents a hierarchical partitioning of [0, 1)2 into subgrids, where a subgrid at a level is divided into four equal subgrids at the next level. The (scaled) pyramidal transform has one row corresponding to each subgrid; Algorithm 1: DPSPARSEEMDAGG 1: Input: distributions p1, . . . , pn on G 2: Parameters: ε1, . . . , εℓ> 0, w N 3: s Pn i=1 pi 4: for i = 0, . . . , ℓdo 5: νi Lap(1/εi) mi 2i (Pis + νi) 7: end for 8: y [y 0 y ℓ] 9: ˆs RECONSTRUCT(y ; w) 10: return ˆa := ˆs/ ˆs Algorithm 2: RECONSTRUCT 1: Input: noisy measurements y R S 2: Parameters: w N 3: S0 C1 4: for i = 1, . . . , ℓdo 5: Ti children(Si 1) 6: Si the set of min{w, |Ti|} coordinates in Ti with maximum values in y 7: end for 8: S S i [ℓ] Si and ˆy y |S 9: return ˆs arg mins 0 ˆy Ps 1 the row is equal to the indicator vector of the subgrid scaled by its side length. These are formalized below. Definition 3.1. For i N {0}, we let C2i denote the set of level i grid cells defined as C2i := {[a, a + 2 i) [b, b + 2 i) | (a, b) G2i}; let mi := |C2i|. For i [ℓ], the level-i grid partition map is defined as the matrix Pi {0, 1}C2i G where Pi(c, p) = 1 iff p c. The (scaled) pyramidal transform is the matrix P R Sℓ i=0 C2i G defined by P := P 0 2 1P 1 . . . 2 ℓP ℓ . 3.2 The Algorithm Our algorithm for sparse EMD aggregation consists of two components. The first component (Algorithm 1) aggregates the input distributions (Line 3) and applies the pyramidal transform to the aggregate, adding different amounts of Laplace noise for different levels of the grid (Lines 5, 6). (The parameters ε1, . . . , εℓ, which govern the amount of Laplace noise, will be specified in the next subsection.) The second component (Algorithm 2) takes these noisy measurements for every level of the grid and reconstructs the solution by first recovering the ℓ1 solution (Line 8) and then the EMD solution using a linear program (Line 9). We stress that our algorithm is similar to that of Indyk and Price (2011) except for two points: first, we add noise to the measurements and, second, we are not doing any compression in contrast to (Indyk and Price 2011), which takes a wide matrix A for ℓ1 recovery and multiplies it with Ps. 3.3 Analysis Following the framework of Indyk and Price (2011), our analysis proceeds in two stages. We first show that the recovered ˆy is close, in the ℓ1 metric, to the true value of Ps. Then, we use the properties of P to argue that the output ˆs is close, in EMD, to s. Since we are adding noise to our measurement, we need to extend the work of Indyk and Price (2011) to be robust to noise. Finally, we set the privacy parameters ε1, . . . , εℓto finish our proof of Theorem 3.1. Let us now briefly demystify the additive error bound Oε,λ( k) that we end up with for ˆs (which ultimately gives the Oε,λ( k/n) error bound for the normalized ˆa). We will select w = Oλ(k) so as to have an additive error of Oε( w). At a high level, each noise 1 2i νi(t) added to a queried term yi(t) for t Ti permeates to an error of the same order. For simplicity, assume for the moment that |νi(t)| = O(1/εi). Now, notice that if we are at level i < log w, then |Ti| = |C2i| = 22i and thus the total error contribution of this level is O(2i/εi). On the other hand, for a level i log w, we will have |Ti| = w and the error contribution is O w 2iεi . Now, when i = log w O(1), these error terms are O( w/εi) and thus we should set εi = Ω(1) to get the desired bound. However, in terms of |i log w|, these error terms become exponentially smaller, i.e., O w 2|i log w|εi . This leads to the natural choices of εi we use, which is to make it proportional to γ|i log w| for some constant γ > 0.5. This indeed leads to the desired Oε( w) = Oε,λ( Phase I: ℓ1 Recovery. We will now analyze the ℓ1 recovery guarantee of ˆy. Our recovery algorithm, which is an adaptation of Indyk and Price (2011), does not work for general hidden vectors. However, it works well for those that follow a certain tree-like structure , formalized below. Definition 3.2 (Indyk and Price (2011)). For i 1, a grid cell c C2i is said to be a child of grid cell c C2i 1 if c c . This forms a tree rooted at [0, 1) [0, 1) C0 where every internal node has exactly four children. We let Tw denote the set of all trees such that the number of nodes at each level is at most w. Let Mw denote the set of y = [y0 yℓ] where yi R C2i 0 such that 1. supp(y) T for some tree T Tw. 2. For all i [ℓ 1], p C2i, the following holds: y(p) 2 y(children(p)). Under the above notion, we can adapt the ℓ1 recovery analysis of Indyk and Price (2011) in the no-noise case to our regime, where the noise shows up as an error: Lemma 3.2. Let y arg miny Mw Ps y 1 where supp(y ) T for some T Tw; let T i denote T C2i and Vi = T i \ Si for all i [ℓ]. Then, ˆy on Line 8 of RECONSTRUCT satisfies ˆy Ps 1 3 y Ps 1 + O P i [ℓ] 1 2i νi|Vi Si 1 . Proof. For every q T \S, let R(q) be the highest ancestor of q that does not belong to S. We have y | S 1 = X q T \S y (q) = X q R 1(p) y (q) p Vi 2y (p) = 2 X i [ℓ] y (Vi), (1) where ( ) follows from the second property of Mw. Next, consider the algorithm at the ith iteration and p Vi. Since p was not picked, the following must hold for all q Si \ T i : y (p) y (q). Observe also that from |Si| = max{w, |C2i|} and |T i | max{w, |C2i|}, we also have |Si \ T i | |T i \ Si| = |Vi|. Thus, we get y (Vi) y (Si \ T i ). (2) From this and (1), we can further derive y | S 1 i [ℓ] (y (Vi) Ps(Vi)) + Ps(Si \ T i ) i [ℓ] Ps(Vi) Ps((Si \ T i )) ( ) 2 y Ps 1 + 2 i [ℓ] Ps(Vi) Ps((Si \ T i )) ( ) 2 y Ps 1 + 2 νi|Vi (Si\T i ) i [ℓ] y (Vi) y ((Si \ T i )) (2) 2 y Ps 1 + 2 νi|Vi (Si\T i ) where ( ) follows from supp(y ) T = S i [ℓ] T i and ( ) follows from how y is calculated. Finally, from ˆy = y |S and how each entry of y is computed, we have ˆy Ps 1 = y |S Ps|S 1 + Ps S 1 1 2i νi|Si 1 + y | S 1 + y | S Ps| S 1 1 2i νi|Si 1 2 y Ps 1 + 2 νi|Vi (Si\T i ) 3 y Ps 1 + 3 1 2i νi|Vi Si 1 Phase II: From ℓ1 to EMD. We now proceed to bound the EMD error. The main lemma is stated below. Lemma 3.3. Let the notation be as in Lemma 3.2. For any η (0, 1), by setting w = O(k/(η )2), the output ˆs of RECONSTRUCT satisfies s s EMD η mink-sparse s s s EMD + O P i [ℓ] 1 2i νi|Vi Si 1 . Similar to the proof of Indyk and Price (2011), our proof of Lemma 3.3 converts the recovery guarantee under ℓ1 metric to that under EMD; to do this, we need the following two statements from prior work. Lemma 3.4 (Model-Alignment of EMD with Mw (Indyk and Price 2011)). For any x RG 0 , k N and η (0, 1), there exist w = O(k/η2) and y Mw such that y Ps 1 η mink-sparse x x x EMD. Lemma 3.5 (EMD-to-ℓ1 Expansion (Indyk and Thaper 2003)). For all z RG , z EMD Pz 1. Proof of Lemma 3.3. Recall that we use s to denote the true sum Pn i=1 pi. We set η = η /6 and let w = O(k/η2) = O(k/(η )2) be as in Lemma 3.4, which ensures that there exists y Mw with y Ps 1 η min k-sparse s s s EMD. (4) Thus, using Lemma 3.5, we can derive s s EMD P(s s ) 1 (triangle inequality) ˆy Ps 1 + ˆy Ps 1 (how s is computed) 2 ˆy Ps 1 (Lemma 3.2) 6 y Ps 1 1 2i νi|Vi Si 1 (4) η min k-sparse s s s EMD 1 2i νi|Vi Si 1 Finishing the Proof. We now select the privacy parameters and complete the proof of Theorem 3.1. Proof of Theorem 3.1. Let w = O(k/(η )2) be as in Lemma 3.3 with η = λ/4, and let q = log2 w . Let γ = 0.8 be the decay rate for εi s, and let Z = Pℓ i=0 γ|i q| O(1) be the normalization factor. We run Algorithm 1 with εi = γ|i q| ε/Z. Privacy Analysis. We can view the ith iteration of the algorithm as releasing 2iy i = Pis + νi. Since each pi has ℓ1 norm at most one, its sensitivity with respect to Pis is at most one; thus, Lemma 2.2 implies that the ith iteration is εi DP. As a result, by basic composition theorem of DP, we can conclude that releasing all of y 0, . . . , y ℓis (ε0+ +εℓ)-DP. Since the reconstruction is simply a post-processing step, the post-processing property of DP ensures that Algorithm 1 is (ε0 + +εℓ)-DP. Finally, observe that by definition of εi s, we have ε0 + + εℓ= ε as desired. Figure 1: Metrics averaged over 60 runs when varying ε. Shaded areas indicate 95% confidence interval. Utility Analysis. Applying Lemma 3.3, we can conclude that s s EMD η mink-sparse s s s EMD+ξ, where ξ = O P i [ℓ] 1 2i νi|Vi Si 1 . Recall that each of Vi, Si s is of size at most max{w, 22i} (because of definition of Mw and the fact that mi = |C2i| = 22i), and that each entry of νi is sampled from Lap(1/εi). As a result, we have 1 2i max{w, 22i} 1 i {q+1,...,ℓ} k λ2 1 2iγi qε where the last bound follows from our choice of γ > 0.5 and q = log2 w . Hence, by Markov s inequality, w.p. 0.99, we have s s EMD η mink-sparse s s s EMD + 100E[ξ] = η mink-sparse s s s EMD+O k λε . Finally, applying Lemma 2.1 concludes the proof. 4 Experiments In this section, we study the performance of our algorithms on real-world datasets. Implementation Details. We implement Algorithm 1 with a minor modification: we do not measure at the level i < q = log w . In other words, we start directly at the lowest level for which the number of grid cells is at most w. It is possible to adjust the proof to show that, even with this modification, the error remains Oε( k). Apart from this, the algorithm is exactly the same as presented earlier. We note that the linear program at the end of Algorithm 2 can be formulated so that the number of variables is only O(wℓ); the reason is that we only need one variable per cell that is left out at each stage. This allows us to solve it efficiently even when the resolution = 2ℓis large. As for our parameters, we use the decay rate γ = 1/ 2, which is obtained from minimizing the second error term in the proof of Theorem 3.1 as ℓ 5. We use w = 20 in our experiments, which turns out to work well already for datasets we consider. We refrain from tuning parameters further since a privacy analysis of the tuning step has to be taken into account if we want to be completely rigorous. (See, e.g., (Liu and Talwar 2019) for a formal treatment.) Datasets. We use two datasets available at snap.stanford. edu to generate the input distribution for users. The first dataset6, called GOWALLA, consists of location check-ins by users of the location-based social network Gowalla. Each record consists of, among other things, an anonymized user id together with the latitude (lat) and longitude (lon) of the check-in and a timestamp. We filtered this dataset to consider only check-ins roughly in the continental US (i.e., lon ( 135, 60) and lat (0, 50)) for the month of January 2010; this resulted in 196,071 check-ins corresponding to 10,196 users. The second dataset7, called BRIGHTKITE, also contains check-ins from a different and now defunct location-based social network Brightkite; each record is similar to GOWALLA. Once again, we filtered this dataset to consider only check-ins in the continental US for the months of November and December 2008; this resulted in 304,608 check-ins corresponding to 10,177 users. For each of these datasets, we partition the whole area into a 300 300 grid. We then took the top 30 cells (in both datasets combined) that have the most check-ins. (Each of the 30 cells is mostly around some city like New York, Austin, etc, and has check-ins from at least 200 unique users). We then consider each cell, partition into subgrids and snap each check-in to one of these subgrids. Metrics. To evaluate the quality of an output heatmap ˆh compared to the true heatmap h, we use the following commonly used metrics: Similarity, Pearson coefficient, KLdivergence, and EMD. (See, e.g., (Bylinskii et al. 2019) for detailed discussions of these metrics.) We note that the first two metrics should increase as ˆh, h are more similar, 5When w (and thus q) is fixed, the second error term is proportional to Z Pℓ q 1 i=0 1 (2γ)i which converges to 1 (1 γ)(1 0.5/γ) as ℓ . The latter term is minimized when γ = 1/ 2 6Available at http://snap.stanford.edu/data/loc-Gowalla.html 7Available at http://snap.stanford.edu/data/loc-Brightkite.html Figure 2: Example visualization of different algorithms for ε = 1 (top) and ε = 5 (bottom). The algorithms from left to right are: original heatmap (no privacy), baseline, baseline with top 0.01% and our algorithm. whereas the latter two should decrease. Baselines. We consider as a baseline an algorithm recently proposed in (Liu et al. 2019),8 where we simply add Laplace noise to each subgrid cell of the sum s, zero out any negative cells, and produce the heatmap from this noisy aggregate. We also consider a thresholding variant of this baseline that is more suited to sparse data: only keep top t% of the cell values after noising (and zero out the rest). 4.1 Results In the first set of experiments, we fix = 256. For each ε {0.1, 0.5, 1, 2, 5, 10}, we run our algorithms together with the baseline and its variants on all 30 cells, with 2 trials for each cell. In each trial, we sample a set of 200 users and run all the algorithms; we then compute the distance metrics between the true heatmap and the estimated heatmap. The average of these metrics over the 60 runs is presented in Figure 1, together with the 95% confidence interval. As can be seen in the figure, the baseline has rather poor performance across all metrics, even for large ε = 10. We experiment with several values of t for the thresholding variant, which yields a significant improvement. Despite this, we still observe an advantage of our algorithm consistently across all metrics. These improvements are especially significant when ε is not too large or too small (i.e., 0.2 ε 5). In the second set of experiments, we study the effect of varying the number n of users. By fixing a single cell (with > 500 users) and ε, we sweep n 8We remark that (Liu et al. 2019) also propose using the Gaussian mechanism. However, this algorithm does not satisfy ε-DP. Moreover, even when considering (ε, δ)-DP for moderate value of δ (e.g., δ = 10 3), the Gaussian mechanism will still add more noise in expectation than the Laplace mechanism. {50, 100, 200, 300, 400, 500} users. For each value of n, we run 10 trials and average their results. As predicted by theory, our algorithms and the original baseline perform better as n increases. However, the behavior of the thresholding variants of the baseline are less predictable, and sometimes the performance degrades with a larger number of users. It seems plausible that a larger number of users cause an increase in the sparsity, which after some point makes the simple thresholding approach unsuited for the data. We also run another set of experiments where we fix a single cell and ε, and vary the resolution {64, 128, 256}. In agreement with theory, our algorithm s utility remains nearly constant for the entire range of . On the other hand, the original baseline suffers across all metrics as increases. The thresholding variants are more subtle; they occasionally improve as increases, which might be attributed to the fact that when is small, thresholding can zero out too many subgrid cells. We include examples of the heatmaps from each approach in Figure 2. 5 Discussions and Future Directions We present an algorithm for sparse distribution aggregation under the EMD metric, which in turn yields an algorithm for producing heatmaps. As discussed earlier, our algorithm extends naturally to distributed models that can implement the Laplace mechanism, including the secure aggregation model and the shuffle model (Balle et al. 2020; Ghazi et al. 2020). Unfortunately, this does not apply to the more stringent local DP model (Kasiviswanathan et al. 2008) and it remains an interesting open question to devise practical local DP heatmap/EMD aggregation algorithms for moderate number of users and privacy parameters. References Abowd, J. M. 2018. The US Census Bureau adopts differential privacy. In KDD, 2867 2867. Apple Differential Privacy Team. 2017. Learning with privacy at scale. Apple Machine Learning Journal. Backurs, A.; Indyk, P.; Razenshteyn, I. P.; and Woodruff, D. P. 2016. Nearly-optimal bounds for sparse recovery in generic norms, with applications to k-median sketching. In SODA, 318 337. Bagdasaryan, E.; Kairouz, P.; Mellem, S.; Gasc on, A.; Bonawitz, K.; Estrin, D.; and Gruteser, M. 2022. Towards Sparse Federated Analytics: Location Heatmaps under Distributed Differential Privacy with Secure Aggregation. Po PETS, 4: 162 182. Balle, B.; Bell, J.; Gasc on, A.; and Nissim, K. 2020. Private Summation in the Multi-Message Shuffle Model. In CCS, 657 676. Berinde, R.; Gilbert, A. C.; Indyk, P.; Karloff, H.; and Strauss, M. J. 2008. Combining geometry and combinatorics: A unified approach to sparse signal recovery. In Allerton, 798 805. Berinde, R.; and Indyk, P. 2009. Sequential sparse matching pursuit. In Allerton, 36 43. Berinde, R.; Indyk, P.; and Ruzic, M. 2008. Practical nearoptimal sparse recovery in the ℓ1 norm. In Allerton, 198 205. Bylinskii, Z.; Judd, T.; Oliva, A.; Torralba, A.; and Durand, F. 2019. What Do Different Evaluation Metrics Tell Us About Saliency Models? PAMI, 41(3): 740 757. Charikar, M. 2002. Similarity estimation techniques from rounding algorithms. In STOC, 380 388. Cormode, G.; Procopic, C.; Srivastava, D.; and Tran, T. T. L. 2012a. Differentially private summaries for sparse data. In ICDT, 299 311. Cormode, G.; Procopiuc, C. M.; Srivastava, D.; Shen, E.; and Yu, T. 2012b. Differentially Private Spatial Decompositions. In ICDE, 20 31. Ding, B.; Kulkarni, J.; and Yekhanin, S. 2017. Collecting telemetry data privately. In Neur IPS, 3571 3580. Dwork, C.; Kenthapadi, K.; Mc Sherry, F.; Mironov, I.; and Naor, M. 2006a. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, 486 503. Dwork, C.; Mc Sherry, F.; Nissim, K.; and Smith, A. 2006b. Calibrating noise to sensitivity in private data analysis. In TCC, 265 284. Dwork, C.; Smith, A. D.; Steinke, T.; Ullman, J. R.; and Vadhan, S. P. 2015. Robust Traceability from Trace Amounts. In Guruswami, V., ed., FOCS, 650 669. Erlingsson, U.; Pihur, V.; and Korolova, A. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In CCS, 1054 1067. Ghazi, B.; Manurangsi, P.; Pagh, R.; and Velingker, A. 2020. Private Aggregation from Fewer Anonymous Messages. In EUROCRYPT, 798 827. Grauman, K.; and Darrell, T. 2004. Fast contour matching using approximate Earth Mover s Distance. In CVPR, 220 227. Greenberg, A. 2016. Apple s differential privacy is about collecting your data but not your data. Wired, June, 13. Hardt, M.; and Talwar, K. 2010. On the geometry of differential privacy. In STOC, 705 714. Indyk, P.; and Price, E. 2011. K-median clustering, modelbased compressive sensing, and sparse recovery for earth mover distance. In STOC, 627 636. Indyk, P.; and Ruzic, M. 2008. Near-Optimal Sparse Recovery in the L1 Norm. In FOCS, 199 207. Indyk, P.; and Thaper, N. 2003. Fast color image retrieval via embeddings. In Workshop on Statistical and Computational Theories of Vision (at ICCV), 2003. Isaacman, S.; Becker, R.; C aceres, R.; Martonosi, M.; Rowland, J.; Varshavsky, A.; and Willinger, W. 2012. Human mobility modeling at metropolitan scales. In Mobi Sys, 239 252. Kasiviswanathan, S. P.; Lee, H. K.; Nissim, K.; Rashkodnikova, S.; and Smith, A. 2008. What can we Learn Privately? In FOCS, 531 540. Kranstauber, B.; Smolla, M.; and Safi, K. 2017. Similarity in spatial utilization distributions measured by the earth mover s distance. Methods in Ecology and Evolution, 8(2): 155 160. Levina, E.; and Bickel, P. 2001. The earth mover s distance is the Mallows distance: Some insights from statistics. In ICCV, 251 256. Liu, A.; Xia, L.; Duchowski, A.; Bailey, R.; Holmqvist, K.; and Jain, E. 2019. Differential privacy for eye-tracking data. In ETRA, 1 10. Liu, J.; and Talwar, K. 2019. Private selection from private candidates. In Charikar, M.; and Cohen, E., eds., STOC, 298 309. Puzicha, J.; Rubner, Y.; Tomasi, C.; and Buhmann, J. M. 1999. Empirical evaluation of dissimilarity measures for color and texture. In ICCV, 1165 1173. Qardaji, W. H.; Yang, W.; and Li, N. 2013. Understanding Hierarchical Methods for Differentially Private Histograms. VLDB, 6(14): 1954 1965. Radebaugh, C.; and Erlingsson, U. 2019. Introducing Tensor Flow Privacy: Learning with Differential Privacy for Training Data. https://blog.tensorflow.org/2019/03/ introducing-tensorflow-privacy-learning.html. Accessed: 2023-03-05. Rubner, Y.; Tomasi, C.; and Guibas, L. 1998. A metric for distributions with applications to image databases. In ICCV, 59 66. Rubner, Y.; Tomasi, C.; and Guibas, L. J. 2000. The earth mover s distance as a metric for image retrieval. IJCV, 40(2): 99 121. Shankland, S. 2014. How Google tricks itself to protect Chrome user privacy. CNET, October. Steil, J.; Hagestedt, I.; Huang, M. X.; and Bulling, A. 2019. Privacy-aware eye tracking using differential privacy. In ETRA, 1 9. Stricker, M.; and Orengo, M. 1995. Similarity of color images. In SPIE, 381 392. Testuggine, D.; and Mironov, I. 2020. Py Torch Differential Privacy Series Part 1: DP-SGD Algorithm Explained. https://medium.com/pytorch/differential-privacy-seriespart-1-dp-sgd-algorithm-explained-12512c3959a3. Accessed: 2023-03-05. Wang, F.; and Guibas, L. J. 2012. Supervised earth mover s distance learning and its computer vision applications. In ECCV, 442 455. Xu, D.; Yan, S.; and Luo, J. 2008. Face recognition using spatially constrained Earth Mover s Distance. Trans. Image Processing, 17(11): 2256 2260. Zhang, J.; Xiao, X.; and Xie, X. 2016. Priv Tree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, 155 170. Zhao, Q.; Yang, Z.; and Tao, H. 2008. Differential Earth Mover s Distance with its applications to visual tracking. PAMI, 32: 274 287.