# practical_locally_private_heavy_hitters__aa34e689.pdf Journal of Machine Learning Research 21 (2020) 1-42 Submitted 11/18; Revised 12/19; Published 2/20 Practical Locally Private Heavy Hitters Raef Bassily bassily.1@osu.edu Department of Computer Science and Engineering The Ohio State University Kobbi Nissim kobbi.nissim@georgetown.edu Department of Computer Science Georgetown University Uri Stemmer u@uri.co.il Department of Computer Science Ben-Gurion University Abhradeep Thakurta aguhatha@ucsc.edu Department of Computer Science University of California Santa Cruz Editor: Moritz Hardt We present new practical local differentially private heavy hitters algorithms achieving optimal or near-optimal worst-case error and running time Tree Hist and Bitstogram. In both algorithms, server running time is O(n) and user running time is O(1), hence improving on the prior state-of-the-art result of Bassily and Smith [STOC 2015] requiring O(n5/2) server time and O(n3/2) user time. With a typically large number of participants in local algorithms (n in the millions), this reduction in time complexity, in particular at the user side, is crucial for making locally private heavy hitters algorithms usable in practice. We implemented Algorithm Tree Hist to verify our theoretical analysis and compared its performance with the performance of Google s RAPPOR code. Keywords: Differential privacy, local differential privacy, heavy hitters, histograms, sketching. 1. Introduction We revisit the problem of computing heavy hitters with local differential privacy. Such computations have already been implemented to provide organizations with valuable information about their user base while providing users with the strong the strong guarantee that their privacy would be preserved even if the organization is subpoenaed for the entire information seen during an execution. Two prominent examples are Google s use of RAPPOR in the Chrome browser (Erlingsson et al., 2014) and Apple s use of differential privacy in i OS-10 (Thakurta et al., 2017). These tools are used for learning new words typed by users and identifying frequently used emojis and frequently accessed websites. Differential privacy in the local model. Differential privacy (Dwork et al., 2006) provides a framework for rigorously analyzing privacy risk and hence can help organization c 2020 Raef Bassily, Kobbi Nissim, Uri Stemmer, Abhradeep Thakurta. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-786.html. Bassily, Nissim, Stemmer, Thakurta mitigate users privacy concerns as it ensures that what is learned about any individual user would be (almost) the same whether the user s information is used as input to an analysis or not. Differentially private algorithms work in two main modalities: curator and local. The curator model assumes a trusted centralized curator that collects all the personal information and then analyzes it. In contrast, the local model does not involve a central repository. Instead, each piece of personal information is randomized by its provider to protect privacy even if all information provided to the analysis is revealed. Holding a central repository of personal information can become a liability to organizations in the face of security breaches, employee misconduct, subpoenas, etc. This makes locally private computations attractive for implementation. Indeed in the last few years Google and Apple have announced successful deployments of locally private analyses (Erlingsson et al., 2014; Thakurta et al., 2017). Challenges of the local model. A disadvantage of the local model is that it requires introducing noise at a significantly higher level than what is required in the curator model. Furthermore, some tasks which are possible in the curator model are impossible in the local model (Dwork et al., 2006; Kasiviswanathan et al., 2011; Chan et al., 2012). To see the effect of noise, consider estimating the number of HIV positives in a given population of n participants. In the curated model, it suffices to add Laplace noise of magnitude Oϵ(1), i.e., independent of n (Dwork et al., 2006). In contrast, a lower bound of Ωϵ( n) is known for the local model (Chan et al., 2012). A higher noise level implies that the number of participants n needs to be large (maybe in the millions for a reasonable choice of ϵ). An important consequence is that, to be practical, locally private algorithms must exhibit low time, space, and communication complexity, especially at the user side. This is the problem addressed in our work. Heavy hitters and histograms in the local model. Assume each of n users holds an element xi taken from a domain of size d. A histogram of this data lists (an estimate of) the multiplicity of each domain element. When d is large, a succinct representation of the histogram is desired, either in the form of a frequency oracle a data structure allowing to approximate the multiplicity of any domain element or heavy hitters listing the multiplicity of most frequent domain elements, implicitly considering the multiplicity of other domain elements as zero. The problem of computing histograms with differential privacy has attracted significant attention both in the curator model (Dwork et al., 2006; Beimel et al., 2016; Bun et al., 2015) and the local model (Hsu et al., 2012; Erlingsson et al., 2014; Bassily and Smith, 2015). Of relevance is the work in (Mishra and Sandler, 2006). We briefly report on the state of the art heavy hitters algorithms of Bassily and Smith (2015) and Thakurta et al. (2017), which are most relevant for the current work. Bassily and Smith provide matching lower and upper bounds of Θ( p n log(d)/ϵ) on the worst-case error of local heavy hitters algorithms. Their local algorithm exhibits optimal communication but a rather high time complexity: Server running time is O(n5/2) and, crucially, user running time is O(n3/2) complexity that severely hampers the practicality of this algorithm.1 The construction by Thakurta et al. is a heuristic with no bounds on server running time and 1. If pubic randomness or efficient Private Information Retrieval is assumed, then the user running time in (Bassily and Smith, 2015) can be improved to O(n). Practical Locally Private Heavy Hitters accuracy (we refer to the maximal difference between the true and estimated multiplicities of a data element as the error or the accuracy of the protocol).2 User computation time is O(1), a significant improvement over (Bassily and Smith, 2015). Our contributions. The focus of this work is on the design of locally private heavy hitters algorithms with near optimal error, keeping time, space, and communication complexity minimal. We provide two new constructions of heavy hitters algorithms Tree Hist and Bitstogram. These algorithms achieve similar performance but apply different techniques. We implemented Algorithm Tree Hist and provide measurements in comparison with RAPPOR (Erlingsson et al., 2014) (the only currently available implementation for local histograms). Our measurements are performed with a setting that is favorable to RAPPOR (i.e., a small input domain), yet they indicate that Algorithm Tree Hist performs better than RAPPOR in terms of noise level. Table 1 details various performance parameters of Tree Hist and Bitstogram, and the reader can check in the table that these are similar up to small factors which we ignore for the rest of this paragraph. Comparing with (Bassily and Smith, 2015), we improve time complexity both at the server (reduced from O(n5/2) to O(n)) and at the user (reduced from O(n3/2) to O(max (log n, log d)2)). Comparing with (Thakurta et al., 2017), we get provable bounds on the server running time and worst-case error. Note that Algorithm Bitstogram achieves optimal worst-case error whereas Algorithm Tree Hist is almost optimal, by a factor of p Performance metric Tree Hist Bitstogram Server time (modular multiplications) O (n) O (n) User time (modular multiplications) O max (log n, log d)2 O max (log n, log d)2 Server processing memory O ( n) O ( n) User memory O (max(log d, log n)) O (max(log d, log n)) Communication/user O (1) O (1) Worst-case Error O p n log(n) log(d) Table 1: Performance of our protocols. Dependency on the privacy parameter ϵ and failure probability β is omitted. Elements of the constructions. Main details of our constructions are presented in sections 3.1 and 3.2. These are complemented with the detailed descriptions and analyses in the appendix. Both our algorithms make use of frequency oracles data structures that allow estimating various counts. Algorithm Tree Hist identifies heavy-hitters and estimates their frequencies by scanning the levels of a binary prefix tree whose leaves correspond to dictionary items. The recovery of the heavy hitters is in a bit-by-bit manner. As the algorithm progresses down the tree it prunes all the nodes that cannot be prefixes of heavy hitters, hence leaving O( n) nodes in every depth. This is done by making queries to a frequency oracle. Once the algorithm reaches the final level of the tree it identifies the list of heavy hitters. It then invokes the 2. The underlying construction in (Thakurta et al., 2017) is of a frequency oracle. Bassily, Nissim, Stemmer, Thakurta frequency oracle once more on those particular items to obtain more accurate estimates for their frequencies. Algorithm Bitstogram hashes the input domain into a domain of size roughly n. The observation behind this algorithm is that if a heavy hitter x does not collide with other heavy hitters then (h(x), xi) would have a significantly higher count than (h(x), xi) for every i, where xi is the ith bit of x. This allows recovering all bits of x in parallel given an appropriate frequency oracle. Followup work. Our work presents locally-private heavy-hitters algorithms with error rates that depend optimally on the number of users, the size of the domain, and the privacy parameter. Following our work, Bun et al. (2018) presented an improved protocol with error rate that depend optimally also on the failure probability. Their protocol is a refined version of our Bitstogram protocol. 2. Background and Preliminaries 2.1. Definitions and Notation Dictionary and user items: Let V = [d]. We consider a set of n users, where each user i [n] holds an item vi V. We will use vi to refer to the binary representation of vi when it is clear from the context. Item frequency (multiplicity): For each item v V, we define the frequency f(v) of v as the number of users holding v, namely, f(v) |{i [n] : vi = v}| . Frequency oracle: A frequency oracle is a data structure together with an algorithm that, for any given v V, allows computing an estimate ˆf(v) of the frequency f(v). Heavy hitters (succinct histogram): A succinct histogram is a data structure that provides a (short) list of items (ˆv1, ..., ˆvk), called the heavy hitters, together with estimates for their frequencies ( ˆf(ˆvj) : j [k]). The frequencies of the items not in the list are implicitly estimated as ˆf(v) = 0. We measure the error in a succinct histogram by the ℓ distance between the estimated and true frequencies, maxv [d] ˆf(v) f(v) . We will also consider the maximum error restricted to the items in the list (ˆv1, ..., ˆvk), that is, maxj [k] ˆf(ˆvj) f(ˆvj) . If a succinct histogram aims to provide ℓ error η, the list does not need to contain more than O(1/η) items since items with estimated frequencies below η may be omitted from the list. 2.2. Local Differential Privacy In the local model, an algorithm A : V Z accesses the database v = (v1, . . . , vn) Vn only via an oracle that, given index i [n], runs a randomized algorithm (local randomizer) R : V Z on input vi and returns R(vi) to A. Definition 2.1 (Local differential privacy Dwork et al. (2006); Evfimievski et al. (2003); Kasiviswanathan et al. (2011)) An algorithm satisfies ϵ-local differential privacy (LDP) if it accesses the database v = (v1, . . . , vn) Vn only via invocations of a local Practical Locally Private Heavy Hitters randomizer R and if for all i [n], if R(1), . . . , R(k) denote the algorithm s invocations of R on the data sample vi, then the algorithm A( ) R(1)( ), R(2)( ), . . . , R(k)( ) is ϵdifferentially private. That is, if for any pair of inputs v, v that differ on a single input, and for all S Range(A), Pr[A(v) S] eϵ Pr[A(v ) S]. 2.3. Count Sketch and Hadamard Transform Count sketch (Charikar et al., 2002) together with the Hadamard transform form the basis of our differentially private construction outlined in Section 3.1 and discussed in detail in Section 5. Count sketch (Charikar et al., 2002) is a sketching algorithm for finding frequent elements in a data stream. Let V = [d] be a domain of data elements, and v = (v1, . . . , vn) be a stream of data elements. Count sketch ensures that for any given v V, using a data structure of size m = O n log(1/β) k one can ensure that with probability at least 1 β the estimated frequency of v is within k of the true frequency, and the estimate is unbiased. The algorithm works as follows: First, pick t pairs of hash functions (hi : V [m], gi : V { 1, 1}), and set a matrix M = {0}t m. Second, with every data sample vi, populate the matrix as follows: j [t], M[j, hj(vi)] M[j, hj(vi)] + gj(vi). Finally, to estimate the frequency of an element v V, compute median {M[1, h1(v)] g1(v), , M[t, ht(v)] gt(v)}. Hadamard transform: We use Hadamard transform followed by sampling in order to compress our data transmission from the client to the server and to reduce the space requirements of our protocol. Hadamard transform of a vector w Rm is obtained via multiplying with the Hadamard transform matrix Hm { 1 m, 1 m}m m defined recur- sively as Hm = 1 Hm/2 Hm/2 Hm/2 Hm/2 , and H1 = [1]. Two main properties of Hadamard transform we use are: i) it is a dense basis transformation, i.e. the columns of the Hadamard matrix form a basis, and each entry of m Hm is in { 1, 1}, and ii) any entry (i, j) in m Hm can be computed in O(log m) time. 2.4. Error correction codes We will use error correction codes in order to reduce the error of (some of) our constructions outlined in Section 3.2 and discussed in detail in Section 6. Definition 2.2 A binary (n, k)-code is a pair of mappings (Enc, Dec) where Enc : {0, 1}k {0, 1}n, and Dec : {0, 1}n {0, 1}k. The code is ζ-decodable if for every x {0, 1}k and every y {0, 1}n whose Hamming distance to Enc(x) is at most ζn we have that Dec(y) = x. For any constant 0 < ζ < 1/4, there is a construction of a ζ-decodable (n, k)-code, where n = O(k), and furthermore, Enc and Dec run in time O(n). See, e.g., (Guruswami, 2001). Bassily, Nissim, Stemmer, Thakurta 2.5. Tools from Probability 2.5.1. Pairwise k-wise Independence The definitions given here are taken from (Vadhan, 2012). Definition 1 A family (i.e., multiset) of functions H = {h : [N] [M]} is pairwise independent if the following two conditions hold when H is a function chosen uniformly at random from H: 1. x [N], the random variable H(x) is uniformly distributed in [M]. 2. x1 = x2 [N], the random variables H(x1) and H(x2) are independent. We will use the following tail bound on sums of k-wise independent random variables. Lemma 2.3 (Bellare and Rompel (1994)) Let λ 6 be an even integer. Suppose X1, , Xn are k-wise independent random variables taking values in [0, 1]. Let X = X1 + + Xn and µ = E[X], and let α > 0. Then, Pr[|X µ| α] nk 2.5.2. The Poisson Approximation We will use the following useful facts about the Poisson approximation. When throwing n balls into R bins, the distribution of the number of balls in a given bin is Bin(n, 1/R). As the Poisson distribution is the limit distribution of the binomial distribution, the distribution of the number of balls in a given bin is approximately Pois(n/R). In fact, in some cases we could approximate the joint distribution of the number of balls in all the bins by assuming the load at each bin is an independent Poisson random variable with mean n/R. Theorem 2.4 (e.g., Mitzenmacher and Upfal (2005)) Suppose that n balls are thrown into R bins independently and uniformly at random, and let Xi be the number of balls in the ith bin, where 1 i R. Let Y1, , YR be independent Poisson random variables with mean n/R. Let f(x1, , x R) be a nonnegative function. Then, E [f(X1, , XR)] e n E [f(Y1, , YR)] . In particular, the theorem states that any event that takes place with probability p in the Poisson case, takes place with probability at most pe n in the exact case (this follows by letting f be the indicator function of that event). We will also use the following bounds for the tail probabilities of a Poisson random variable: Theorem 2.5 (Alon and Spencer (1992)) Let X have Poisson distribution with mean µ. For 0 α 1, Pr[X µ(1 α)] e α2µ/2 Pr[X µ(1 + α)] e α2µ/2. Practical Locally Private Heavy Hitters 3. Our Algorithms In this section we give an informal description of our algorithms, and highlight some of the ideas behind our constructions. 3.1. The Tree Hist Protocol We briefly give an overview of our construction that is based on a compressed, noisy version of the count sketch. To maintain clarity of the main ideas, we give here a high-level description of our construction. We refer to Section 5 for a detailed description of this construction. We first introduce some objects and public parameters that will be used in the construction: Prefixes: For a binary string v, we will use v[1 : ℓ] to denote the ℓ-bit prefix of v. Let V = v {0, 1}ℓfor some ℓ [log d] . Note that elements of V arranged in a binary prefix tree of depth log d, where the nodes at level ℓof the tree represent all binary strings of length ℓ. The items of the dictionary V represent the bottommost level of that tree. Note that |V| = 2d. We will use to denote an empty string. For a v {0, 1}ℓand b {0, 1}, let v b denote the ℓ+ 1-bit string resulting from appending the bit b to v. For a binary string v, we define Child(v) v 0, v 1 , that is, the set containing the two children of v in the prefix tree. Similarly, for a set of strings U, we define Child Set(U) v : v Child(u) for some u U . Hashes: Let t, m be positive integers to be specified later. We will consider a set of t pairs of hash functions {(h1, g1), . . . , (ht, gt)}, where for each i [t], hi : V [m] and gi : V { 1, +1} are independently and uniformly chosen pairwise independent hash functions. Basis matrix: Let W 1, +1 m m be m Hm where Hm is the Hadamard transform matrix of size m. As will be shown later, we will be making operations over the entries of this matrix. It is important to note that we do not need to store this matrix. The value of any entry in this matrix can be computed in O(log m) bit operations given the (row, column) index of that entry. In particular, suppose we want to compute the value of the entry Wi,j located at the i-th row and j-th column. Let (i0, i1, . . . , ilog m 1) and (j0, j1, . . . , jlog m 1) denote the bit representation of i and j, respectively. Then, Wi,j = ( 1) Plog m 1 k=0 ikjk. Global parameters: The total number of users n, the size of the Hadamard matrix m, the number of hash pairs t, the privacy parameter ϵ, the confidence parameter β, and the hash functions (h1, g1), . . . , (ht, gt) are assumed to be public information. We set t = O(log(n/β)) and m = O q Public randomness: In addition to the t hash pairs {(h1, g1), . . . , (ht, gt)}, we assume that the server creates a random partition Π : [n] [log d] [t] that assigns to each user i [n] a random pair (ℓi, ji) [log(d)] [t], and another random function Q : [n] [m] Bassily, Nissim, Stemmer, Thakurta that assigns3 to each user i a uniformly random index ri [m]. We assume that such random indices ℓi, ji, ri are shared between the server and each user. First, we describe the two main modules of our protocol. 3.1.1. A local randomizer: Local Rnd For each i [n], user i runs her own independent copy of a local randomizer, denoted as Local Rnd, to generate her private report. Local Rnd of user i starts by acquiring the index triple (ℓi, ji, ri) [log d] [t] [m] from public randomness. For each user, Local Rnd is invoked twice in the full protocol: once during the first phase of the protocol (called the pruning phase) where the high-frequency items (heavy hitters) are identified, and a second time during the final phase (the estimation phase) to enable the protocol to get better estimates for the frequencies of the heavy hitters. In the first invocation, Local Rnd of user i performs its computation on the ℓi-th prefix of the item vi of user i, whereas in the second invocation, it performs the computation on the entire user s string vi. Apart from this, in both invocations, Local Rnd follows similar steps. It first selects the hash pair (hji, gji), computes ci = hji(vi[1 : ℓ]) (where ℓ= ℓi in the first invocation and ℓ= log d in the second invocation, and vi[1 : ℓ] is the ℓ-th prefix of vi), then it computes a bit xi = gji vi[1 : ℓ] Wri,ci (where Wr,c denotes the (r, c) entry of the basis matrix W). Finally, to guarantee ϵ-local differential privacy, it generates a randomized response yi based on xi (i.e., yi = xi with probability eϵ/2/(1 + eϵ/2) and yi = xi with probability 1/(1 + eϵ/2), which is sent to the server. Our local randomizer can thought of as a transformed, compressed (via sampling), and randomized version of the count sketch (Charikar et al., 2002). In particular, we can think of Local Rnd as follows. It starts offwith similar steps to the standard count sketch algorithm, but then deviates from it as it applies Hadamard transform to the user s signal, then samples one bit from the result. By doing so, we can achieve significant savings in space and communication without sacrificing accuracy. 3.1.2. A frequency oracle: Freq Oracle Suppose we want to allow the server to estimate the frequencies of some given subset b V {0, 1}ℓfor some given ℓ [log d] based on the noisy users reports. We give a protocol, denoted as Freq Oracle, for accomplishing this task. For each queried item ˆv b V and for each hash index j [t], Freq Oracle computes c = hj(ˆv), then collects the noisy reports of a collection of users Iℓ,j that contains every user i whose pair of prefix and hash indices (ℓi, ji) match (ℓ, j). Next, it estimates the inverse Hadamard transform of the compressed and noisy signal of each user in Iℓ,j. In particular, for each i Iℓ,j, it computes yi Wri,c which can be described as a multiplication between yieri (where eri is the indicator vector with 1 at the ri-th position) and the scaled Hadamard matrix W, followed by selecting the c-th entry of the resulting vector. This brings us back to the standard count sketch representation. It then sums all the results and multiplies the outcome by gj(ˆv) to obtain an estimate ˆfj(ˆv) for the frequency of ˆv. 3. We could have grouped Π and Q into one random function mapping [n] to [log d] [t] [m], however, we prefer to split them for clarity of exposition as each source of randomness will be used for a different role. Practical Locally Private Heavy Hitters As in the count sketch algorithm, this is done for every j [t], then Freq Oracle obtains a high-confidence estimate by computing the median of all the t frequency estimates. 3.1.3. The protocol: Tree Hist The protocol is easier to describe via operations over nodes of the prefix tree V of depth log d (described earlier). The protocol runs through two main phases: the pruning (or, scanning) phase, and the final estimation phase. In the pruning phase, the protocol scans the levels of the prefix tree starting from the top level (that contains just 0 and 1) to the bottom level (that contains all items of the dictionary). For a given node at level ℓ [log d], using Freq Oracle as a subroutine, the protocol gets an estimate for the frequency of the corresponding ℓ-bit prefix. For any ℓ [log(d) 1], before the protocol moves to level ℓ+ 1 of the tree, it prunes all the nodes in level ℓthat cannot be prefixes of actual heavy hitters (high-frequency items in the dictionary).Then, as it moves to level ℓ+ 1, the protocol considers only the children of the surviving nodes in level ℓ. The construction guarantees that, with high probability, the number of surviving nodes in each level cannot exceed O q n log(d) log(n) . Hence, the total number of nodes queried by the protocol (i.e., submitted to Freq Oracle) is at most log(n) . In the second and final phase, after reaching the final level of the tree, the protocol would have already identified a list of the candidate heavy hitters, however, their estimated frequencies may not be as accurate as we desire due to the large variance caused by the random partitioning of users across all the levels of the tree. Hence, it invokes the frequency oracle once more on those particular items, and this time, the sampling variance is reduced as the set of users is partitioned only across the t hash pairs (rather than across log(d) t bins as in the pruning phase). By doing this, the server obtains more accurate estimates for the frequencies of the identified heavy hitters. The privacy and accuracy guarantees are stated below. The full details are given in Section 5. 3.1.4. Privacy and Utility Guartantees Theorem 3.1 Protocol Tree Hist is ϵ-local differentially private. Theorem 3.2 There is a number η = O p n log(n/β) log(d))/ϵ such that with probability at least 1 β (over the choice of all public and private randomnness), the output list of the Tree Hist protocol satisfies the following properties: 1. it contains all items v V with true frequencies above 3η. 2. it does not contain any item v V with true frequency below η. 3. Every frequency estimate in the output list is accurate up to an error O p n log(n/β)/ϵ Bassily, Nissim, Stemmer, Thakurta 3.2. The Bitstogram Protocol We now present a simplified description of our second protocol, that captures most of the ideas. Any informalities made hereafter are removed in the full description of the protocol (Section 6). First Step: Frequency Oracle. Recall that a frequency oracle is a protocol that, after communicating with the users, outputs a data structure capable of approximating the frequency of every domain element v V. So, if we were to allow the server to have linear runtime in the domain size |V| = d, then a frequency oracle would suffice for computing histograms. As we are interested in protocols with a significantly lower runtime, we will only use a frequency oracle as a subroutine, and query it only for (roughly) n elements. Let Z { 1}d n be a matrix chosen uniformly at random, and assume that Z is publicly known.4 That is, for every domain element v V and every user j [n], we have a random bit Z[v, j] { 1}. As Z is publicly known, every user j can identify its corresponding bit Z[vj, j], where vj V is the input of user j. Now consider a protocol in which users send randomized responses of their corresponding bits. That is, user j sends yj = Z[vj, j] w.p. 1 2 + ϵ 2 and sends yj = Z[vj, j] w.p. 1 2. We can now estimate the frequency of every domain element v V as a(v) = 1 j [n] yj Z[v, j]. To see that a(v) is accurate, observe that a(v) is the sum of n independent random variables (one for every user). For the users j holding the input v (that is, vj = v) we will have that 1 ϵE[yj Z[v, j]] = 1. For the other users we will have that yj and Z[v, j] are independent, and hence E[yj Z[v, j]] = E[yj] E[Z[v, j]] = 0. That is, a(v) can be expressed as the sum of n independent random variables: f(v) variables with expectation 1, and (n f(v)) variables with expectation 0. The fact that a(v) is an accurate estimation for f(v) now follows from the Hoeffding bound. Lemma 3.3 (Algorithm Hashtogram) Let ϵ 1. Algorithm Hashtogram satisfies ϵ-LDP. Furthermore, with probability at least 1 β, algorithm Hashtogram answers every query v V with a(v) satisfying: |a(v) f(v)| O 1 ϵ r Second Step: Identifying Heavy-Hitters. Let us assume that we have a frequency oracle protocol with worst-case error τ. We now want to use our frequency oracle in order to construct a protocol that operates on two steps: First, it identifies a small set of potential heavy-hitters , i.e., domain elements that appear in the database at least 2τ times. Afterwards, it uses the frequency oracle to estimate the frequencies of those potential heavy elements.5 Let h : V [T] be a (publicly known) random hash function, mapping domain elements into [T], where T will be set later.6 We will now use h in order to identify the heavy-hitters. 4. As we later explain, Z has a short description, as it need not be uniform. 5. Event though we describe the protocol as having two steps, the necessary communication for these steps can be done in parallel, and hence, our protocol will have only 1 round of communication. 6. As with the matrix Z, the hash function h can have a short description length. Practical Locally Private Heavy Hitters To that end, let v V denote such a heavy-hitter, appearing at least 2τ times in the database S, and denote t = h(v ). Assuming that T is big enough, w.h.p. we will have that v is the only input element (from S) that is mapped (by h) into the hash value t . Assuming that this is indeed the case, we will now identify v bit by bit. For ℓ [log d], denote Sℓ= (h(vj), vj,ℓ)j [n], where vj,ℓis bit ℓof vj. That is, Sℓis a database over the domain ([T] {0, 1}), where the row corresponding to user j is (h(vj), vj,ℓ). Observe that every user can compute her own row locally. As v is a heavy-hitter, for every ℓ [log d] we have that (t , v ℓ) appears in Sℓat least 2τ times, where v ℓis bit ℓof v . On the other hand, as we assumed that v is the only input element that is mapped into t we get that (t , 1 v ℓ) does not appear in Sℓat all. Recall that our frequency oracle has error at most τ, and hence, we can use it to accurately determine the bits of v . To make things more concrete, consider the protocol that for every hash value t [T], for every coordinate ℓ [log d], and for every bit b {0, 1}, obtains an estimation (using the frequency oracle) for the multiplicity of (t, b) in Sℓ(so there are log d invocations of the frequency oracle, and a total of 2T log d estimations). Now, for every t [T] let us define ˆvt where bit ℓof ˆvt is the bit b s.t. (t, b) is more frequent than (t, 1 b) in Sℓ. By the above discussion, we will have that ˆvt = v . That is, the protocol identifies a set of T log d domain elements, containing all of the heavy-hitters. The frequency of the identified heavy-hitters can then be estimated using the frequency oracle. Remark 2 As should be clear from the above discussion, it suffices to take T n2, as this will ensure that there are no collisions among different input elements. As we only care about collisions between heavy-hitters (appearing in S at least n times), it would suffice to take T n to ensure that w.h.p. there are no collisions between heavy-hitters. In fact, we could even take T n, which would ensure that a heavy-hitter v has no collisions with constant probability, and then to amplify our confidence using repetitions. Lemma 3.4 (Algorithm Bitstogram) Let ϵ 1. Algorithm Bitstogram satisfies ϵ-LDP. Furthermore, the algorithm returns a list L of length O( n) satisfying: 1. With probability 1 β, for every (v, a) L we have that |a f(v)| O 1 ϵ p n log(n/β) . 2. W.p. 1 β, for every v V s.t. f(v) O 1 ϵ q n log(d/β) log( 1 β) , we have that v is in L. 4. Detailed Experimental Results In this section we discuss implementation details of our algorithms mentioned in Section 5.7 The main objective of this section is to emphasize the empirical efficacy of our algorithms along with the theoretical optimality in terms of error, space, time and communication. Thakurta et al. (2017) recently claimed space optimality for a similar problem, but a formal analysis (or empirical evidence) was not provided. Our experiments corroborate both the analytical bounds in our current work, and in (Thakurta et al., 2017). Our experiments are performed on a mac OS-Sierra 10.12 system (in Python 2.7) with 3.3Ghz (Intel Core i5) and 16GB of DDR-3 RAM. 7. The experiments are performed without the Hadamard compression during data transmission. Bassily, Nissim, Stemmer, Thakurta 4.1. Private Frequency Oracle In this experiment, the objective is to test the efficacy of our algorithm in estimating the frequencies of a known set of dictionary of user items, under local differential privacy. We estimate the error in estimation while varying the size of the data set n, changing the privacy parameter ϵ. (See Section 2.1 for a refresher on the notation.) Figure 1 shows results on a synthetic data set with the domain size of hundred (i.e., d = 100) drawn from a power law distribution with power of 15. The default parameters used in Figure 1 are: number of data samples (n) : 10 million, range of the hash function (m): n, number of hash functions (t): 285, and the privacy parameter ϵ = 2.0. For the hash functions, we used the prefix bits of SHA-256. The estimated frequency is scaled by the number of samples to normalize the result, and each experiment is averaged over ten runs. The bars for True refers to the the true frequencies, and the bars for Priv corresponds to the differentially private frequencies. The pctle corresponds to the frequency of a domain element at the corresponding percentile in the frequency distribution of the data set. Observations: i) The plots corroborate the fact that the frequency oracle is indeed unbiased. The average frequency estimate (over ten runs) for each percentile is within one standard deviation of the corresponding true estimate. ii) The error in the estimates go down significantly as the number of samples are increased or the privacy parameter ϵ is increased. 100000.0 1000000.0 5000000.0 10000000.0 0.00 Estimated frequency Estimated frequency versus n 50pctle-True 50pctle-Priv 90pctle-True 90pctle-Priv 95pctle-True 95pctle-Priv 0.1 1.0 2.0 5.0 10.0 0.00 Estimated frequency Estimated frequency versus epsilon 50pctle-True 50pctle-Priv 90pctle-True 90pctle-Priv 95pctle-True 95pctle-Priv Figure 1: Frequency vs number of samples (n) and privacy parameter (ϵ) on the synthetic data set. In Figure 2 we show the result of changing the range of the hash function (m). The observation is that the results are seemingly insensitive to the range of the hash function. We also ran the same experiment (Figure 3) on a real data set drawn uniformly at random the NLTK Brown corpus (NLT). The data set we created has n = 10 million samples drawn i.i.d. from the corpus with replacement, and the system parameters are the same default parameters described earlier. In this plot, the rank corresponds to the rank of a domain element in the distribution of frequencies in the data set. The observation is here is also consistent with that of Figure 1. Comparison to RAPPOR (Erlingsson et al., 2014): Here we compare ourselves to the only other system (RAPPOR project from GOOGLE) for the private frequency estimation problem whose code is publicly available. We took the snapshot of their code Practical Locally Private Heavy Hitters 0.1 1.0 5.0 10.0 0.00 Estimated frequency Estimated frequency versus sketch-width / sqrt(n) 50pctle-True 50pctle-Priv 90pctle-True 90pctle-Priv 95pctle-True 95pctle-Priv Figure 2: Frequency vs sketch width (m) on the synthetic data set. 0.1 1.0 2.0 5.0 10.0 0.00 Estimated frequency Estimated frequency versus epsilon Rank 1-True Rank 1-Priv Rank 10-True Rank 10-Priv Rank 100-True Rank 100-Priv Figure 3: Frequency vs privacy (ϵ) on the NLTK-Brown corpus. base (https://github.com/google/rappor) on May 9th, 2017. In order to perform a fair comparison, we tested our algorithm against one of their demo experiments available (Demo3 using the demo.sh script). We used the privacy parameter ϵ = ln(3), the number of data samples n = 1 million, and the data set to be the same data set generated by the demo.sh script. In Figure 4 we observe that for higher frequencies both RAPPOR and our algorithm perform similarly. However, in lower frequency regimes, the RAPPOR estimates are zero most of the times, while our estimates are closer to the true estimates. N.B. We do not claim that our algorithm would outperform the RAPPOR system on all problem instances. However, our current experiment does highlight the need to perform an at-scale comparison between the two algorithms. 4.2. Private Heavy-hitters In this section, we take on the harder task of identifying the heavy hitters, rather than estimating the frequencies of domain elements. We run our experiments on the same two Bassily, Nissim, Stemmer, Thakurta 0 20 40 60 80 100 0.00 Frequency estimate Comparison between Count-Sketch and RAPPOR True Freq Count_Sketch RAPPOR Figure 4: Frequency vs privacy (ϵ) on the Demo 3 experiment from RAPPOR data sets described earlier with the same setting of parameters, except now we assume that we do not know the domain. As a part of our algorithm design, we assume that every element in the domain is from the English alphabet set [a-z] and are of length exactly equal to six. If they are of larger length, we truncate before entering them in the data set, and if they are of smaller length we tag a at the end. We generate the domain elements for the synthetic data set by first generating the frequency histogram based on the power law distribution described earlier, and then assign random strings of length eight to each bin of the histogram. That becomes our data set. For the NLTK Brown corpus (NLT), we sample n = 10 million samples with replacement from the corpus to form our data set. We set a threhold of 15 n as the threshold for being a heavy hitter. We measure the efficacy of our system by measuring the precision and recall. Figures 5 and 6 show the true data distribution for the synthetic and the NLTK data set. 0 20 40 60 80 Words Normalized Frequency Figure 5: Distr. for Synthetic 0 20 40 60 80 100 Words Normalized Frequency Figure 6: Distr. for Brown (top 100 words) Practical Locally Private Heavy Hitters In Table 4.2 we state our corresponding precision and recall parameters. Our recall numbers are much better than the precision numbers, primarily because of the large number of negative examples (3 108 examples). In practice, if there are false-positives, they can be easily pruned using domain expertise. For example, if we are trying to identify new words which users are typing in English (app, 2016), then using the domain expertise of English, a set of false positives can be easily ruled out by inspecting the list of heavy hitters output by the algorithm. Further, notice that since we are working with domain elements with size six characters, a brute force algorithm would require 266 queries to the frequency oracle, which would be computationally (near) infeasible. While there are other algorithms for finding heavy-hitters (Bassily and Smith, 2015; Hsu et al., 2012), either they do not provide any theoretical guarantee for the utility (Erlingsson et al., 2014; Fanti et al., 2015; Thakurta et al., 2017), or there does not exist a scalable and efficient implementation for them. Our work scores well on both these aspects. Data set # of unique words Precision Recall Synthetic 93 0.36 (σ = 0.05) 0.95 (σ = 0.03) NLTK Brown corpus 25991 0.24 (σ = 0.04) 0.86 (σ = 0.05) Table 2: Private Heavy-hitters with threshold=15 n. Here σ corresponds to the standard deviation. 5. Locally Private Heavy-hitters via Count-Sketch: The Tree Hist Protocol We start with a detailed description of our construction described at a high level in Section 3.1. We refer to Section 3.1 for definitions and public parameters that will be used in the construction, namely, prefixes, hashes, the basis matrix, global parameters, and public randomness. We restate below our public parameters and, when applicable, their specific settings. Global parameters: We will assume that total number of users n, the size of the Hadamard matrix m, the number of hash pairs t, the privacy parameter ϵ, and the confidence parameter β are public parameters, and hence they will not be explicitly provided as inputs to the algorithms. For integer parameters, we will implicitly assume that results are rounded to the nearest integer. We set t = 110 log(n/β) and m = 48 q n log(n/β). The hash functions (h1, g1), . . . , (ht, gt) will also be assumed to be public information (this is O(log(d) log(n/β)) bits of shared randomness). Public randomness: In addition to the t hash pairs {(h1, g1), . . . , (ht, gt)}, we assume that the server creates a random partition Π : [n] [log d] [t] over the set of users, that is, a each user i gets a random pair (ℓi, ji) [log(d)] [t] that represents the index of one of log(d) t buckets. Moreover, the server uses another random function Q : [n] [m] that assigns to each user i a uniformly random index ri [m]. We assume that such random indices ℓi, ji, ri are shared between the server and each user. For each ℓ [log d] and each j [t], we define Iℓ,j i : ℓi = ℓ, ji = j and Ij i : ji = j = ℓ [log d]Iℓ,j. Bassily, Nissim, Stemmer, Thakurta 5.1. A Local Randomizer: Local Rnd For each i [n], user i runs her own independent copy of Algorithm 1 below, refered to as Local Rnd, to generate her private report. We note that Local Rnd takes a flag Final {0, 1} as an input. The role of this input will become clear when we discuss the full protocol. In a nutshell, the flag is used to distinguish between two invocations of Local Rnd. In particular, the local randomizer of each user is invoked twice in the full protocol: once during the first phase of the protocol (called the pruning phase) where the high-frequency items (heavy hitters) are identified, and a second time during the final phase (the final estimation phase) to enable the protocol to get better estimates for the frequencies of the heavy hitters. Connection to count sketch and Hadamard transform: Our local randomizer can be thought of as a transformed, compressed (via sampling), and randomized version of the count sketch. Up to Step 5 in Local Rnd (Algorithm 1), our algorithm follows the standard count sketch algorithm (Charikar et al., 2002). Starting from Step 6, we start to deviate from the standard count sketch as we apply Hadamard transform to the user s signal, then sample one bit from the result. Indeed Step 6 can be thought of as a composition of two operations: first, we multiply the indicator vector eci {0, 1}m by the scaled Hadamard matrix W (this is equivalent to selecting wci: the ci-th column of W), then we randomly sample one entry from wci. By doing so, we can achieve significant savings in space and communication without sacrificing accuracy. Algorithm 1 Local Rnd Input: User i input: vi V, Flag: Final {0, 1}. 1: Using shared randomness, get random indices (ℓi, ji) [log d] [t] and ri [m]. 2: if Final = 0 then 3: Set si := gji (vi[1 : ℓi]) and ci := hji (vi[1 : ℓi]) . {v[1 : ℓ] denotes the ℓ-bit prefix of v.} 5: Set si := gji (vi) and ci := hji (vi) 6: Compute xi := si Wri, ci {Wr, c denotes the sign of the (r, c) entry of Hm (Hadamard matrix of size m).} 7: Generate a randomized version yi of the bit xi: ( xi w.p. eϵ/2 eϵ/2+1 xi w.p. 1 eϵ/2+1 8: return yi. The output of Local Rnd is 1 bit per invocation. Hence, during the entire span of the protocol, each user sends 2 bits to the server. Here, we will assume the user s identity (i.e., the index i [n]) associated with each report is known to the server (e.g., from a higher layer in the communication protocol stack). Practical Locally Private Heavy Hitters 5.2. A Frequency Oracle: Freq Oracle Before describing our protocol for identifying the heavy hitters and estimating their frequencies, we first discuss a protocol for a simpler task. Suppose we want to allow the server to estimate the frequencies of some given subset b V {0, 1}ℓfor some given ℓ [log d] based on the users reports. Algorithm 2 describes a protocol, denoted as Freq Oracle, for accomplishing this task. Note that in this protocol, we assume that all items whose frequencies are in question are given as inputs. For each queried item ˆv b V and for each hash index j [t], Freq Oracle starts by collecting the noisy reports of the collection Iℓ,j of users where each user i Iℓ,j is assigned a pair of prefix and hash indices (ℓi, ji) that matches (ℓ, j). Next, it estimates the inverse Hadamard transform of the compressed and noisy signal of each user in Iℓ,j. It then sums all the results and multiplies the outcome by gj(ˆv) to obtain an estimate ˆfj(ˆv) for the frequency of ˆv. As in the count sketch algorithm, this is done for every j [t], then Freq Oracle obtains a high-confidence estimate for the frequency of ˆv by computing the median of all the t frequency estimates. Inverse transform and back to count sketch: We note here that Steps 7 in Freq Oracle (Algorithm 2) can be described as an inverse Hadamard transform of the users compressed and noisy signals. In particular, each term yi Wri,c inside the sum can be described as a multiplication between yieri (where eri is the indicator vector with 1 at the ri-th position) and the scaled Hadamard matrix W, followed by picking the c-th entry of the resulting vector. This brings us back to the standard count sketch representation (Charikar et al., 2002). The Freq Oracle protocol then proceeds to process the frequency estimates in the same way a count sketch does. Indeed, Step 8 is a standard step in count sketch algorithm where a high-confidence frequency estimate is obtained via the median technique. Hence, we attain the functionality of the count sketch algorithm with much less space and communication by transforming the users signals to the Fourier domain, compressing signals via sampling, and then transforming them back at the server. 5.3. Succinct Histogram via a Tree Aggregation Protocol: Tree Hist We now describe our protocol (Algorithm 3), denoted as Tree Hist, that outputs a succinct histogram, that is, it outputs a list of heavy hitters together with estimates for their frequencies. W.l.o.g., we will assume that n > log(n/β) log(d) since otherwise, we cannot guarantee less than trivial error (i.e., an error of order n). A high-level description of the protocol is given in Section 3.1.3. The protocol is comprised of two main phases: the pruning phase, and the final estimation phase. We note that the binary flag Final is used to distinguish between these two phases. In particular, depending on the value of this flag, the partition of users and the scaling factor passed to Freq Oracle will differ. More precisely, when Final = 0 (the pruning phase), for each prexix length ℓ, the partition of users is given by the collection Iℓ,j : j [t] and the scaling factor is set as γ = t log d. When Final = 1 (the final phase), the partition of users is Ij : j [t] and the scaling factor is γ = t. Bassily, Nissim, Stemmer, Thakurta Algorithm 2 Freq Oracle Input: Prefix length: ℓ [log d], a subset of ℓ-bit prefixes b V {0, 1}ℓ, collection of t disjoint subsets of users: Ij : j [t] , oracle access to items of relevant users: vi : i j [t] Ij , scaling factor: γ, Flag: Final {0, 1}. {aϵ eϵ/2+1 eϵ/2 1 is assumed to be a global constant seen by Freq Oracle.} 1: for ˆv b V do 2: for Hash index j = 1 to t do 3: Set s := gj(ˆv) and c := hj(ˆv). 4: for Users i Ij do 5: Get user-i s 1-bit report: yi = Local Rnd(vi, Final) 6: Get user-i s random index ri = Q(i) using public randomness. 7: Compute the j-th estimate of the frequency of ˆv: ˆfj(ˆv) := γ aϵ P i Ij yi s Wri,c. 8: Compute final estimate for the frequency of ˆv: ˆf(ˆv) := Median n ˆfj(ˆv) : j [t] o . 9: Freq List := n ˆv, ˆf(ˆv) : ˆv b V o . 10: return Freq List. Algorithm 3 Tree Hist The Full Protocol Input: Oracle access to users items (vi V : i [n]). 1: Set η := 147 p n log(n/β) log(d)/ϵ and aϵ := eϵ/2+1 2: Set γ := t log d {This setting will change at the final stage of the protocol.} 3: Initialize Prefixes = { }. 4: for Prefix length ℓ= 1 to log d do 5: Get the collection Iℓ,j : j [t] using the random partition Π. {See Public randomness above.} n ˆv, ˆf(ˆv) : ˆv Child Set (Prefixes) o = Freq Oracle ℓ, Child Set (Prefixes) , Iℓ,j : j [t] , vi : i j [t]Iℓ,j , γ, Final = 0 7: Initialize New Prefixes = . 8: for v Child Set (Prefixes) do 9: if ˆf(ˆv) 2η then 10: Add ˆv to New Prefixes. 11: Update Prefixes New Prefixes. 12: Set γ := t 13: Succ Hist := Freq Oracle log d, Prefixes, Ij : j [t] , vi : i [n] , γ, Final = 1 {Here Ij = ℓ [log d]Iℓ,j as defined earlier in the Public randomness paragraph.} 14: return Succ Hist. Practical Locally Private Heavy Hitters 5.3.1. Running Time and Processing Memory of Tree Hist Running Time: We note that Algorithm Freq Oracle (Algorithm 2) is invoked log(d)+1 = O(log d) times by Tree Hist (Algorithm 3): once per each level of the tree and another at the final estimation phase at the bottom level of the tree. The main time-consuming step in Freq Oracle is the summation in Step 7. This step is executed O q log d times in every invocation of Freq Oracle since there are O q n log n log d nodes queried in every level of the tree and for each node Freq Oracle computes t log n frequency estimates. That is, in total, this step is executed O n log n log d times in the Tree Hist protocol. A direct implementation would involve summing n bits each time this step is executed, and hence would amount for running time of n1.5. However, we now show that this can be reduced to n. Consider Step 7 of Algorithm Freq Oracle when Freq Oracle is invoked by Tree Hist at any given prefix level ℓ [log d] during the pruning phase. We will not consider the final estimation phase here since its running time is dominated by that of the pruning phase. Note that since the size of the basis matrix W is m = O q n log n , there are only O q n log n values that the row index ri can take. Hence, ˆfj(ˆv) (computed in Step 7) can be expressed as follows: ˆfj(ˆv) = γ aϵ X i Iℓ,j: ri=κ yi Thus, to implement Step 7 of Freq Oracle for all items (ℓ, j, ˆv), we first compute i Iℓ,j: ri=κ yi for every κ [m], ℓ [log d], and j [t] (this amounts to a running time of O(n) in total). Then, for every value of (ℓ, j, ˆv), computing ˆfj(ˆv) would require summing m = O q numbers. Hence, in total, the running time of Tree Hist is O q n log n log d m t log d = O n log d . Processing memory: For the implementation described above, Algorithm Freq Oracle maintains m t log d sums of at most n bits each. This would require a processing memory of O n log1.5(n) log(d) bits. The memory required for all the remaining steps of the Tree Hist protocol does not exceed this amount, and hence, the total processing memory required by Tree Hist is O n log1.5(n) log(d) bits. 5.3.2. Privacy and Utility Guartantees In this section we provide the privacy and utility gurantees for the Tree Hist protocol. Theorem 5.1 [ϵ-Local Differential Privacy of the Tree Hist Protocol] The Tree Hist protocol (Algorithm 3) is ϵ-local differentially private. Bassily, Nissim, Stemmer, Thakurta Theorem 5.2 [Utility of the Tree Hist Protocol] Let η = 147 p n log(n/β) log(d)/ϵ. With probability at least 1 β (over the choice of all public and private randomnness), the output Succ Hist of the Tree Hist protocol (Algorithm 3) satisfies the following properties: 1. Succ Hist contains all items v V with true frequencies above 3η. 2. Succ Hist does not contain any item v V whith true frequency below η. 3. Every frequency estimate in Succ Hist is accurate up to an error 147 ϵ We defer the proofs of Theorems 5.1 and 5.2 to Appendix A. The following lemma states the error guarantees of the Algorithm Freq Oracle used by the Tree Hist protocol. Note that Freq Oracle is invoked by the Tree Hist protocol during both pruning and final estimation phases of the protocol. This lemma is central to our proof of Theorem 5.2. In the following lemma, we will use { Ij} to generically denote the collection of users subsets passed to Freq Oracle. As described early in Section 5.3, the form of this collection depends on whether the protocol is in the pruning phase (the flag Final = 0) or the final estimation phase (Final = 1). In particular, for any prefix length ℓ [log d], note that when Freq Oracle is invoked by Tree Hist in the pruning phase (i.e., Final = 0), the collection Ij : j [t] = Iℓ,j : j [t] and the scaling factor γ = t log d, whereas when Freq Oracle is invoked by Tree Hist in the final phase (i.e., Final = 1), the collection Ij : j [t] = Ij : j [t] and γ = t. We will use the generic notation Ij : j [t] and γ since the same proof works for both cases. Lemma 5.3 Let β (0, 1). Let the number of hash pairs t 110 log(n/β), and the size of Hadamard basis matrix m 48 q n log(n/β). Consider algorithm Freq Oracle (Algorithm 2) as invoked by the Tree Hist protocol (Algorithm 3). For any ℓ [log d], suppose that Freq Oracle is invoked on inputs: ℓ, subset b V {0, 1}ℓof size |b V| n, collection of users subsets Ij : j [t] , and scaling factor γ. Assuming that n 48t log(d), then, with probability 1 β log d, we have: v b V : |f(v) f(v)|14 nγ/ϵ. where ˆf(v) is the estimate of Freq Oracle for the true frequency f(v) of the item v b V. Proof Fix β (0, 1), ℓ [log d], t 110 log(n/β), m 48 q n log(n/β), and a subset b V {0, 1}ℓof size |b V| n. There are three independent sources of randomness. The first is due to randomness in the collection Ij : j [t] induced by the random partitioning of users via Π. The second source of randomness is due to the randomness in the choice of the t hash pairs (h1, g1), . . . , (ht, gt). The third source of randomness is due to the random row indices ri, i [n], generated by Q, and the randomization for privacy (step 7 in Algorithm 1). Before we discuss the guarantees we can attain under these sources of randomness, we first introduce some notation. For j [t], v b V, we define f Ij(v) P i Ij 1 (vi[1 : ℓ] = v); Practical Locally Private Heavy Hitters that is, f Ij(v) is the number of users in Ij whose the ℓ-prefix of their items is v, and define n Ij Ij , i.e., n Ij is the number of users in Ij. Let ˆf I1(v), . . . , ˆf It(v) be independent Poisson random variables with mean f(v) γ , and let ˆn I1, . . . , ˆn It be independent Poisson random variables with mean n For each v b V, let G1(v) = j [t] : |γf Ij(v) f(v)| 4 nγ, n 2 γn Ij 2n . Claim 5.4 With probability 1 β 3 log d over the randomness in Ij : j [t] , for every v b V, we have |G1(v)| 9 Fix v b V and j [t]. Consider the Poisson random variables hf Ij(v) and ˆn Ij with γ , respectively. By Theorem 2.5 (a tail bound for the Poisson distribution), the union bound, and assuming that n 48γ (the assumption stated in the lemma), then with probability at least 0.996, we have |γ ˆf Ij(v) f(v)| 4 nγ , n 2 γˆn Ij 2n. Now, consider the sequences of independent Poisson random variables ˆf I1(v), . . . , ˆf It(v) and ˆn I1, . . . , ˆn It as defined above. Using the fact that t 110 log(n/β) 55 log 3en log d by Chernoff s bound, with probability at least 1 β 3en log d, we have j [t] : |γ ˆf Ij(v) f(v)| 4 nγ , n 2 γˆn Ij 2n 9 Using Theorem 2.4 (the Poisson approximation), with probability at least 1 β 3 n log(d), we have |G1(v)| 9 10t. Hence, by the fact that |b V| n and the union bound, with probability at least 1 β 3 log d, for all v b V, we have |G1(v)| 9 10t. For the remainder of the proof, we will condition on the event in Claim 5.4. Let wc denote the c-th column of the basis matrix W. Since the prefix length ℓis fixed, we will denote vi[1 : ℓ] (the ℓ-prefix of the item of user i) as vi for brevity. For each v b V, let G2(v) = n j [t] : γ X gj(vi) gj(v) m whj(vi), whj(v) f(v) 8 γn o . Claim 5.5 Conditioned on the event in Claim 5.4, with probability at least 1 β 3 log d over the choice of the t hash pairs (hj, gj) : j [t] , for all v b V, we have |G2(v)| 4 Fix v b V and j G1(v). First consider P i Ij gj(vi) gj(v) m whj(vi), whj(v) f Ij(v) . Note that the columns of W are orthogonal, and each of them has norm m. Hence, the Bassily, Nissim, Stemmer, Thakurta error quantity above comes from those i Ij with vi = v and yet hj(vi) = hj(v). In particular, we can write this error as X zi gj(vi) gj(v) where zi = 1 (vi = v, hj(vi) = hj(v)). Define VBad = ˆv b V : f Ij(ˆv) 2pn Ij . Note that VBad pn Ij/2. Now, observe that zi gj(vi) gj(v) > 2pn Ij P hj,gj [ ˆv VBad : hj(ˆv) = hj(v)] zi gj(vi) gj(v) > 2pn Ij ˆv VBad, hj(ˆv) = hj(v) By the pairwise independence property of hj and the union bound, the first probability term on the right hand side is bounded from above by |VBad| 2m . We now consider the second probability term. Note that the event we conditioned on in the second probability term implies that for every i Ij where vi = v, we must have f Ij (vi) < 2pn Ij. Hence, conditioned on this event, by the pairwise independence of each of hj and gj, we have zi gj(vi) gj(v) Var [zi gj(vi)] + X i,k Ij: vi=vk =v E [zizk] E [gj(vi)gj(vk)] n Ij/m + 2n3/2 Ij /m 3n3/2 Ij /m. Hence, by using Chebyshev s inequality, the second probability term is bounded by 3 n Ij 4m . Hence, we have zi gj(vi) gj(v) > 2pn Ij log(n/β) 25 γ 1 250, where the last inequality follows from the fact that j G1(v) and the fact that m 48 q n log(n/β). Thus, with probability at least 0.996, we have gj(vi) gj(v) m whj(vi), whj(v) f Ij(v) 2pn Ij. Practical Locally Private Heavy Hitters Now, as we conditioned on the event in Claim 5.4, this implies that with probability at least 0.996, we have γ X gj(vi) gj(v) m whj(vi), whj(v) f(v) gj(vi) gj(v) m whj(vi), whj(v) γf Ij(v) + |γf Ij(v) f(v)| Conditioned on the event of Claim 5.4, |G1(v)| 9 10t 99 log(n/β) 49 log 3 n log d note also that the sums P i Ij gj(vi) gj(v) m whj(vi), whj(v) : j [t] are independent. Thus, conditioned on the event of Claim 5.4, by Chernoff s bound, with probability at least 1 β 3 n log d, we have |G2(v)| 22 25|G1(v)| 4 5t. Hence, by the fact that |b V| n and the union bound, with probability at least 1 β 3 log d, for all v b V, we have |G2(v)| 4 5t. We continue the proof while conditioning on this event as well. For every i [n] let ri [m] be the row index chosen uniformly at random for user i using the random function Q, and let yi denote the randomized bit generated by user i in step 7 of Algorithm 1. As was denoted in algorithm Freq Oracle, for each v b V, let ˆfj(v) = γaϵ P i Ij yi gj(v) Wri,hj(v) denote the j-th frequency estimate of Freq Oracle for v b V. For each v b V, let G3(v) = n j [t] : ˆfj(v) f(v) 14 nγ/ϵ o . Claim 5.6 Conditioned on the events in Claims 5.4 and 5.5, with probability at least 1 β 3 log d over the randomness in (ri, yi) : i Ij, j [t] , for all v b V, we have |G3(v)| 7 Fix v b V and j G2(v). Note that, conditioned on any realization for Ij, each term in the sum X aϵ yi gj(v) Wri,hj(v) gj(vi) gj(v) m whj(vi), whj(v) is independent, zero mean random variable whose support length is bounded by aϵ + 1 = O 1 ϵ . Hence, by Chernoff s bound, with probability at least 0.99, we have aϵ yi gj(v) Wri,hj(v) gj(vi) gj(v) m whj(vi), whj(v) 4pn Ij/ϵ. Thus, conditioned on the events in Claims 5.4 and 5.5, with probability at least 0.99, we have γ X aϵ yi gj(v) Wri,hj(v) f(v) 14 γn/ϵ, Bassily, Nissim, Stemmer, Thakurta i.e., | ˆfj(v) f(v) 14 nγ/ϵ. Conditioned on the event of Claim 5.5, |G2(v)| 4 5t 88 log(n/β) 44 log 3 n log d β . We note also that the above sums for j = 1, . . . , t are independent. Thus, conditioned on the event of Claim 5.5, by Chernoff s bound, with probability at least 1 β 3 n log d, we have 25|G2(v)| 88 125t 7 10t. Hence, by the fact that |b V| n and the union bound, with probability at least 1 β 3 log d, for all v b V, we have |G3(v)| 7 By combining Claims 5.4, 5.5, and 5.6, we conclude that with probability at least 1 β log d, for all v b V, we have j [t] : | ˆfj(v) f(v)| 14 nγ/ϵ > 1 Since for any item v, the final frequency estimate ˆf(v) generated by Freq Oracle is the median of ˆf1(v), . . . , ˆft(v), then the above implies that with probability at least 1 β log d, for all v b V, we have | ˆf(v) f(v)| 14 nγ/ϵ. This completes the proof of the lemma. 6. Locally Private Heavy-hitters bit-by-bit: The Bitstogram Protocol In this section we provide the full details for our Bitstogram protocol. We will use the following notation. Let S Vn be a database, which may be distributed across n users (each holding one row). For v V, we will be interested in estimating the the multiplicity of v in S, i.e., f S(v) = |{vi S : vi = v}|. As we explained in Section 3.2, we provide both an alternative frequency oracle protocol, as well as an alternative protocol for identifying the heavy-hitters. 6.1. Warmup: A Simple Protocol for Heavy-Hitters For readability, we first present a simplification of our protocol that captures most of the ideas presented in Section 3.2. This simplification does not achieve the optimal error rate, space complexity, and time complexity that we aim for. We will later modify this construction in order to reduce these complexities. 6.1.1. Frequency Oracle Our protocols use the simple local randomizer R (Algorithm 4), where every user holds one bit, and flips it with probability 1/(eϵ + 1). Algorithm 4 R: Basic Randomizer Inputs: x { 1}, and privacy parameter ϵ. 1. Generate and return a random bit z = x w.p. eϵ/(eϵ + 1) x w.p. 1/(eϵ + 1) Practical Locally Private Heavy Hitters Let Z { 1}d n be a matrix chosen uniformly at random, and assume that Z is publicly known. That is, for every domain element v V and every user j [n], we have a random bit Z[v, j] { 1}. As Z is publicly known, every user j can identify its corresponding bit Z[vj, j], where vj V is the input of user j. We now present a simple protocol, Explicit Hist, in which users send randomized responses of their corresponding bits in this matrix. As we show, this allows the server to estimate the frequency of every domain element. Algorithm 5 Explicit Hist Public randomness: Uniformly random matrix Z { 1}d n. Setting: Each player j [n] holds a value vj V. Define S = (v1, , vn). Define e S = ( v1, , vn) where vj = Z [vj, j]. Input: Oracle access to e S. 1. For j [n], get user-j s 1-bit report: yj R( vj). 2. On input v V, return a(v) = eϵ+1 eϵ 1 P j [n] yj Z[v, j], and wait for the next input. Lemma 6.1 Let ϵ 1, and fix a subset V V of size d d. With probability at least 1 β, algorithm Explicit Hist answers every v V with a(v) satisfying: |a(v) f S(v)| 3 n ln(4d /β). Proof Fix v V , and denote c(v) = P j [n] yj Z[v, j], and recall that algorithm Explicit Hist answers the query v with a(v) = eϵ+1 eϵ 1 c(v). We start by analyzing the expectation of c(v): E[c(v)] = X j [n] E [yj Z[v, j]] = X j [n]:vj=v E [yj Z[v, j]] + X j [n]:vj =v E [yj Z[v, j]] j [n]:vj=v E [yj Z[v, j]] + X j [n]:vj =v E [yj] E [Z[v, j]] = f S(v) eϵ 1 That is, c(v) can be expressed as two sums of 1 independent random variables: f S(v) variables with expectation eϵ 1 eϵ+1, and (n f S(v)) variables with expectation 0. Using the Hoeffding bound, with probability at least 1 β d we have that |c(v) eϵ 1 eϵ+1 f S(v)| p n ln(4d /β). That is, |a(v) f S(v)| eϵ+1 n ln(4d /β). Using the union bound, this holds simultaneously for every v V with probability at least 1 β. Observation 6.2 For the analysis above it suffices that, for every j [n], the entries of column j of Z are only pairwise independent. Furthermore, appealing to Lemma 2.3 (concentration of k-wise independent random variables) instead of the Hoeffding bound, is suffices that, for every v V, the entries of row v of Z are only k-wise independent, for k = 3 ln(d/β). Bassily, Nissim, Stemmer, Thakurta 6.1.2. A Simple Heavy Hitters Protocol We next give a (simplified version) of our construction that uses a frequency oracle protocol with worst-case error τ in order to identify the heavy-hitters. Let h : V [T] be a (publicly known) random hash function, mapping domain elements into [T], where T will be set later. We will use h in order to identify the heavy-hitters. To that end, let v V denote such a heavy-hitter, appearing at least 2τ times in the database S, and denote t = h(v ). Assuming that T is big enough, w.h.p. we will have that v is the only input element (from S) that is mapped (by h) into the hash value t . Assuming that this is indeed the case, we will now identify v bit by bit. For ℓ [log d], denote Sℓ= (h(vj), vj,ℓ)j [n], where vj,ℓis bit ℓof vj. That is, Sℓis a database over the domain ([T] {0, 1}), where the row corresponding to user j is (h(vj), vj,ℓ). Observe that every user can compute her own row locally. As v is a heavy-hitter, for every ℓ [log d] we have that (t , v ℓ) appears in Sℓat least 2τ times, where v ℓis bit ℓof v . On the other hand, as we assumed that v is the only input element that is mapped into t we get that (t , 1 v ℓ) does not appear in Sℓat all. Recall that our frequency oracle has error at most τ, and hence, we can use it to accurately determine the bits of v . The details are given in algorithm Succinct Hist. Algorithm 6 Succinct Hist Public randomness: Random hash function h : V [T]. Random partition of [n] into log d subsets I1, , Ilog d. Setting: Each player j [n] holds a value vj V. Define S = (v1, , vn). For ℓ [log d], let Sℓ= (h(vj), vj,ℓ)j Iℓ, where vj,ℓis bit ℓof vj. That is, Sℓis a database over the domain [T] {0, 1}. Input: Oracle access to S and Sℓfor ℓ [log d]. 1. For ℓ [log d], use Explicit Hist(Sℓ) with ϵ 2 to get aℓ(t, b) for all (t, b) [T] {0, 1}. 2. For t [T], define ˆvt V, where bit ℓof ˆvt is ˆvt,ℓ= argmax{aℓ(t, 0), aℓ(t, 1)}. 3. Use Explicit Hist(S) with privacy parameter ϵ 2 to obtain a(ˆvt) for all t [T]. 4. Return list L = {(ˆvt, a(ˆvt)) : t [T]}. Lemma 6.3 Let ϵ 1, denote w 32 log(d) log(16 log d) + 48 2n log d ln(64 log d), and set T = 32n w . Algorithm Succinct Hist returns a list L of length T satisfying: 1. With probability 1 β, for every (v, a) L we have that |a f S(v)| 6 n ln(4T/β). 2. For every v V s.t. f S(v) w, with probability 1/2 we have that v is in L. Remark 3 In this simplified version of our protocol, we successfully recover a heavy-hitter only if we recover (exactly) all of its bits. As will be illustrated in Section 6.2.2, this requirement can be relaxed using an error correction code s.t. in order to recover a heavyhitter v it suffices to recover correctly only part of its bits. Practical Locally Private Heavy Hitters Proof Item 1 of the lemma follows directly from Lemma 6.1. We now prove item 2. Assuming that n 12 log(d) log(12 log d), by the Chernoffbound, with probability at least 7/8 (over partitioning [n] into subsets I1, , Ilog d), for every ℓ [log d] we have that n 2 log d |Iℓ| 2n log d. We continue the analysis assuming that this is the case. Fix v V s.t. f S(v ) w, and consider the following good event (over sampling h): Event E1 : |{v S : v = v and h(v) = h(v )}| w/4. Event E1 states that v is mapped (by the hash function h) into a cell without too many collisions with different input elements. Denote t = h(v ). While the multiplicity of v in S is at least w, event E1 states that other than v there are at most w/4 elements which are mapped into t . That is, v dominates the cell t . We first show that if E1 occurs, then w.h.p. v is in the list L. Asserting that w 32 log(d) log(16 log d), by the Chernoffbound we get that with probability 7/8 (over partitioning [n] into subsets I1, , Ilog d), for every ℓ [log d] we have that f Sℓ(h(v ), v ℓ) f Sℓ(h(v ), 1 v ℓ) + w 4 log d. If that is the case, then by the properties of algorithm Explicit Hist, for 2n log d ln(64 log d), with probability at least 1 1 8 log d we have that ˆvt ,ℓ= v ℓ, where t = h(v ). Using the union bound, this holds simultaneously for all ℓ [log d] with probability at least 7/8, in which case v = ˆvt is in the list L. It remains to show that Event E1 occurs with high probability. To that end, observe that Eh [| {v S : v = v and h(v) = h(v )} |] = X v S:v =v Eh 1h(v)=h(v ) n Thus, by Markov s inequality, we have that Pr | {v S : v = v and h(v) = h(v )} | 8n Setting T = 32n w completes the proof. 6.2. Reducing Space and Time Complexities In this section we give the full details for algorithm Bitstogram. 6.2.1. Space and Time Efficient Frequency Oracle We modify algorithm Explicit Hist to obtain our frequency oracle with improved time and space complexity algorithm Hashtogram. Intuitively, this algorithm is obtained by applying algorithm Explicit Hist after hashing every domain element using a (pairwise independent) hash function (however, to boost accuracy, we use several hash functions). This will mean that the random matrix Z we use only has n rows, instead on |V| rows. As we will see, this will allow us to efficiently implement the algorithm. Bassily, Nissim, Stemmer, Thakurta Algorithm 7 Hashtogram Public randomness: Random partition of [n] into R subsets I1, , IR. Random hash functions h1, , h R mapping V to [T]. Uniformly random matrix Z { 1}T n. Setting: Each player j [n] holds a value vj V. Define S = (v1, , vn). Define e S = ( v1, , vn), where vj = Z [hr(vj), j] and hr is s.t. j Ir. Input: Oracle access to e S. 1. For j [n], get user-j s 1-bit report: yj R( vj). 2. For every (r, t) [R] [T] compute ar(t) = eϵ+1 eϵ 1 P j Ir yj Z[t, j]. 3. On input v V, return a(v) = R Median{a1(h1(v)), , a R(h R(v))}, and wait for the next input. Lemma 6.4 Algorithm Hashtogram satisfies ϵ-LDP. Lemma 6.5 Let ϵ 1, and fix a subset V V of size d d to be queried to algorithm Hashtogram. Let algorithm Hashtogram be executed with R 132 log(4d /β), and T ϵ2 log(d /β) + ϵ p n/ log(d /β), and n 8R log(8d /β). With probability at least 1 β, algorithm Hashtogram answers every v V with a(v) satisfying: |a(v) f S(v)| 27 n R log(2Rd Observe that the error in the lemma is sub-optimal, as the optimal error behaves like 1 ϵ n log d. However, we will only use Lemma 6.5 with constant d and constant β, and hence, will not be effected by this issue. A similar analysis (for the same algorithm) gives better bounds for other settings of parameters. Specifically, Lemma 6.6 Let ϵ 1. Fix a subset V V of size d d to be queried to algorithm Hashtogram. Let algorithm Hashtogram be executed with R 300 log(12nd /β) and n 43R and T ϵ p n/ log(nd /β). With probability at least 1 β, algorithm Hashtogram answers every v V with a(v) satisfying: |a(v) f S(v)| 400 As the analysis of the two lemmas are very similar, we only present the proof of Lemma 6.6. The proof of Lemma 6.5 appears in Section B for completness. Proof [Proof of Lemma 6.6] Consider the following good event: Event E1 (over sampling h1, , h R): For every query v V there exists a subset Rv 1 [R] of size |Rv 1 | 7 8R s.t. for every r Rv 1 it holds that |{v S : v = v and hr (v) = hr (v )}| 16n Practical Locally Private Heavy Hitters Event E1 states that for at least 7R/8 of the hash functions, we have that v is mapped into a cell without too many collisions with different input elements. Informally, for every single hash function hr, algorithm Hashtogram estimates the number of occurrences of hr(v ) in S. Hence, if event E1 occurs, then most of the estimations result in accurate answers. We start by showing that event E1 happens with high probability. To that end, fix v V and fix r [R]. We have that Ehr [| {v S : v = v and hr (v) = hr (v )} |] = X v S:v =v Ehr 1hr (v)=hr (v ) n Thus, by Markov s inequality, we have that | {v S : v = v and hr (v) = hr (v )} | 16n As the hash functions are independent from each other, for R 48 ln(d β ), by the Chernoff bound we get that with probability at least 1 β/d (over sampling h1, , h R) there exists a subset Rv 1 [R] of size |Rv 1 | 7 8R s.t. for every r Rv 1 it holds that |{v S : v = v and hr (v) = hr (v )}| 16n Using the union bound over every v V , we have that event E1 happens with probability at least 1 β. Event E2 (over partitioning [n] into I1, , IR): There exists a subset R2 [R] of size |R2| 7 8R s.t. for every r R2 it holds that n 2R |Ir| 2n Following Theorem 2.4 (the Poisson approximation), we analyze event E2 in the Poisson case. To that end, let ˆI1, , ˆIR be independent Poisson random variables with mean n/R. Now fix r R. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that n 43R we have that Pr[ n r ] 99 100. As ˆI1, , ˆIR are independent, assuming that R 300 ln(3n β ), by the Chernoffbound we get that event E2 happens with probability at least 1 β 3n in the Poisson case. Hence, by Theorem 2.4, event E2 happens with probability at least 1 β.8 For every r [R], let Sr = (vj)j Ir denote a database containing the data of all users j s.t. j Ir. Also for v V and r [R] denote Sr,v {v S : hr(v) = hr(v )}. That is, |Sr,v | is the number of users j s.t. hr(vj) = hr(v ). Furthermore, for v V and r [R] denote Iv r {v Sr : hr(v) = hr(v )}. That is, |Iv r | is the number of users j s.t. j Ir and hr(vj) = hr(v ). Observe that |Sr,v | f S(v ) and that |Iv r | f Sr(v ). 8. Alternatively, instead of the requirement on R, it suffices to ensure that n O(R log( R β )). In such a case, by the Chernoffbound, for every fixed r [R] we will have that n 2R |Ir| 2n r with probability at least 1 β R. Hence, using the union bound, this happens simultaneously for all r [R] with probability at least 1 β. Observe that at any case we already require that R O(log( d β )) for Event E1. Bassily, Nissim, Stemmer, Thakurta Event E3 (over partitioning [n] into I1, , IR): For every query v V there exists a subset Rv 3 [R] of size |Rv 3 | 9 10R s.t. for every r Rv 3 it holds that R |Iv r | |Sr,v | We analyze event E3 in the Poisson case. To that end, fix v V , and let ˆIv 1 , , ˆIv R be independent Poisson random variables with mean |Sr,v | R . Now fix r [R]. Using a tail bound for the Poisson distribution (see Theorem 2.5), with probability at least 19/20 we have that R ˆIv r |Sr,v | As ˆIv 1 , , ˆIv R are independent, assuming that R 300 ln(3nd β ), by the Chernoffbound we get that with probability at least 1 β 3nd , Inequality (1) holds for at least 9/10 choices of r [R]. By Theorem 2.4 (the Poisson approximation), with probability at least 1 β d , this is also the case for the random variables |Iv r |. That is, with probability at least 1 β d , there exists a subset Rv 3 [R] of size |Rv 3 | 9 10R s.t. for every r Rv 3 it holds that R |Iv r | |Sr,v | Using the union bound over every choice of v V , we get that event E3 happens with probability at least 1 β.9 Event E4 (over sampling Z and the coins of the local randomizers): For every query v V there exists a subset Rv 4 [R] of size |Rv 4 | 7 10R s.t. for every r Rv 4 it holds that R ar (hr (v )) R |Iv r | eϵ+1 For v V and r [R] denote cr(v ) = P j Ir yj Z[hr(v ), j], and recall that algorithm Hashtogram answers the query v V with a(v ) = R eϵ+1 eϵ 1 Median{cr(v )}r [R]. Fix v V and r R2 (where R2 [R] is the subset from event E2). We now analyze the expectation of cr(v ): E[cr(v )] = X j Ir E [yj Z[hr(v ), j]] j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X j Ir: hr(vj) =hr(v ) E [yj Z[hr(v ), j]] j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X j Ir: hr(vj) =hr(v ) E [yj] E [Z[hr(v ), j]] = |{v Sr : hr(v) = hr(v )}| eϵ 1 eϵ + 1 |Iv r | eϵ 1 9. Alternatively, instead of the requirement on R, it suffices to ensure that n O(R log( d R β )), in which such the analysis follows from the Chernoffbound. Practical Locally Private Heavy Hitters That is, cr(v ) can be expressed as two sums of 1 independent random variables: |Iv r | variables with expectation eϵ 1 eϵ+1, and (|Ir| |Iv r |) variables with expectation 0 (recall that by event E2 we have n 2R |Ir| 2n R ). Using the Hoeffding bound, with probability at least 43/44 we have that cr(v ) eϵ 1 eϵ+1 |Iv r | p 11n/R. That is, R ar(hr(v )) R |Iv r | eϵ + 1 Fix v V , and observe that the above sums are independent for different values of r. Hence, using the Chernoffbound and asserting that R 150 ln(d /β), for that fixed v M, with probability at least 1 β/d we have that Inequality (2) holds for at least 7R/8 choices of r R1. Using the union bound, with probability at least 1 β, this is true for every v V simultaneously. That is, event E4 happens with probability at least 1 β. We are now ready to complete the proof, bu showing that the guarantees of the lemma holds whenever events E1, E2, E3, E4 occur. Fix v V . Combining events E3 and E4, we get that for every r Rv 3 Rv 4 R ar(hr(v )) |Sr,v | eϵ + 1 Recall that for every r [R] we have that |Sr,v | f S(v ). Furthermore, by event E1, for every r Rv 1 we have that |Sr,v | f S(v ) + 16n T . Hence, for every r Rv 1 Rv 3 Rv 4 we have that | R ar(hr(v )) f S(v ) | eϵ + 1 T error(v ). That is, for every r Rv 1 Rv 3 Rv 4 we have that R ar(hr(v )) is accurate up to error(v ). As |Rv 1 Rv 3 Rv 4 | 9 16R, and as algorithm Hashtogram answers v with a(v ) chosen as the median of {R ar(hr(v ))}, we get that |a(v ) f S(v )| error(v ). This holds for every v M. Processing Memory. Algorithm Hashtogram maintains (on step 2) R T sums of at most n bits. This requires O(R T log n) bits for processing memory. Runtime. Observe that a direct implementation of (step 2 of) algorithm Hashtogram consists of summing a total of T n bits per user, and hence results in a runtime of n1.5. As we next explain, this can be reduced to n. First observe that for the analysis of Lemma 6.6 (specifically, for the analysis of Event E4), it suffices that, for every j [n], the entries of column j of Z are only pairwise independent. That is, each column of Z consists of T n pairwise independent bits. We can represent such a column using log T log n bits, in which case there are at most T n choices for the columns of Z (see, e.g., Construction 3.18 of Vadhan (2012)). So, even though the matrix Z contains n columns, it has at most T n distinct columns. Let us denote those distinct columns as z1, . . . , z T , where zγ[t] denotes the bit in position t in this column. We will write Z[ , j] = zγ to indicate that the jth column of Z is zγ. With this notation, we can restate ar(t) (computed on step 2 of algorithm Hashtogram) as follows. Bassily, Nissim, Stemmer, Thakurta ar(t) = eϵ + 1 j Ir yj Z[t, j] = eϵ + 1 j Ir s.t. Z[ ,j]=zγ yj Thus, we can implement step 2 of algorithm Hashtogram by first computing P j Ir s.t. Z[ ,j]=zγ yj for every 1 γ T and 1 r R (this amounts to summing a total of n bits, and can be done it time n). Afterwards, for every choice of (r, t), computing ar(t) consists of summing T n elements. Overall, step 2 of the algorithm can be executed in time R T T n log n. 6.2.2. The Full Protocol We are now ready to present the full Bitstogram protocol. This protocol is obtained by repeating the the strategy outlined in algorithm Succinct Hist to boost the success probability, and by using an error correction code to reduce the error (as we mentioned in Remark 3). Algorithm 8 Bitstogram Tool used: Binary code (Enc, Dec), where Enc : V V , correcting 1 8-fraction of errors with constant rate, that is log d = O(log d), where d = |V| and d = |V |. Public randomness: Random partition of [n] into R log d subsets Ir,ℓfor (r, ℓ) [R] [log d ]. Random hash functions h1, , h R mapping V to [T]. Setting: Each player j [n] holds a value vj V. Define S = (v1, , vn). For every j [n] denote cj = Enc(vj) V . For (r, ℓ) [R] [log d ], let Sr,ℓ= (hr(cj), cj,ℓ)j Ir,ℓ, where cj,ℓis bit ℓof cj. That is, Sr,ℓis a database over the domain [T] {0, 1}. Input: Oracle access to S and to Sr,ℓfor every (r, ℓ) [R] [log d ]. 1. For (r, ℓ) [R] [log d ], use Hashtogram(Sr,ℓ) with ϵ 2 to get ar,ℓ(t, b) : (t, b) [T] {0, 1} . 2. For (r, t) [R] [T], define ˆcr,t V , where bit ℓ of ˆcr,t is ˆcr,t,ℓ = argmax{ar,ℓ(t, 0), ar,ℓ(t, 1)}. That is, if ar,ℓ(t, 0) ar,ℓ(t, 1) then ˆcr,t,ℓ= 0, and otherwise ˆcr,t,ℓ= 1. 3. For (r, t) [R] [T], define ˆvr,t = Dec(ˆcr,t) V. 4. Use Hashtogram(S) with privacy parameter ϵ 2 to obtain a(ˆvr,t) for all (r, t) [R] [T]. 5. Return list L = {(ˆvr,t, a(ˆvr,t)) : (r, t) [R] [T]}. Remark 4 The execution of Hashtogram on step 4 is made using the parameters stated in Lemma 6.6, in order to obtain accurate answers for every fixture of n queries with probability 1 β. The executions of Hashtogram on step 1 are made using the parameters stated in Lemma 6.5, in order to obtain accurate answers for every fixture of two queries with probability 255/256. Observe that every such instantiation of Hashtogram is queried 2T times, and hence, some of these queries might result in inaccurate answers. Nevertheless, Practical Locally Private Heavy Hitters as will be made clear later, due to our use of error correction code, these inaccurate answers will not effect the final outcome of the algorithm. Lemma 6.7 Algorithm Bitstogram satisfies ϵ-LDP. Lemma 6.8 Let ϵ 1, and assume that log d O(log(n/β)). Set R = O (log(1/β)) and T = O ϵ n R log d . Algorithm Bitstogram returns a list L of length R T satisfying: 1. With probability 1 β, for every (v, a) L we have that |a f S(v)| O 1 ϵ p n log(n/β) . 2. With probability 1 β, for every v V s.t. f S(v) O 1 ϵ q n log(d) log( 1 β) , we have that v is in L. Remark 5 The assumption in Lemma 6.8 that log d O(log(n/β)) is without loss of generality, as otherwise the universe size d is small enough to allow the server to run in time linear in d, which makes the problem much easier. Specifically, if d < n then we can instantiate the frequency oracle of Lemma 6.1, and query it for every domain element. As d < n, this can be executed in time n (the runtime analysis is similar to the one in Section 6.2.1). Otherwise, if d n then we already have that log d O(log n), and (if necessary) we can pad the representation of domain elements to satisfy the assumption that log d O(log(n/β)). Proof [Proof of Lemma 6.8] Item 1 of the lemma follows directly from Lemma 6.6. We now prove item 2. Consider the following good event (over sampling h1, , h R): Event E1 (over sampling h1, , h R): There exists a subset R1 [R] of size |R1| 7 8R s.t. for every r R1 and for every v S satisfying f S(v ) n1.5 T it holds that |{v S : v = v and hr (Enc(v)) = hr (Enc(v ))}| 16n1.5 We start by showing that event E1 happens with high probability. To that end, fix v S and fix r [R]. We have that Ehr [| {v S : v = v and hr (Enc(v)) = hr (Enc(v ))} |] v S:v =v Ehr 1hr (Enc(v))=hr (Enc(v )) n Thus, by Markov s inequality, we have that | {v S : v = v and hr (Enc(v)) = hr (Enc(v ))} | 16n1.5 Bassily, Nissim, Stemmer, Thakurta Assuming that T n, there could be at most n heavy elements v satisfying f S(v ) n1.5 T . Hence, using the union bound, v s.t. f S(v ) n1.5 T and | {v S : v = v and hr (Enc(v)) = hr (Enc(v ))} | 16n1.5 As the hash functions are independent from each other, for R 48 ln( 1 β), by the Chernoff bound we get that with probability at least 1 β (over sampling h1, , h R) there exists a subset R1 [R] of size |R1| 7 8R s.t. for every v satisfying f S(v ) n1.5 T and every r R1 it holds that |{v S : v = v and hr (Enc(v)) = hr (Enc(v ))}| 16n1.5 That is, event E1 happens with probability at least 1 β. We continue the analysis assuming that event E1 occurs. Event E2 (over partitioning [n] into {Ir,ℓ}): There exists a subset R2 [R] of size |R2| 7 8R s.t. for every r R2, there exist at least 31 32 log d choices for ℓ [log d ] for which |Ir ,ℓ | 2n R log d . Following Theorem 2.4 (the Poisson approximation), we analyze event E2 in the Poisson case. To that end, let I1,1, , IR,log d be independent Poisson random variables with mean n R log d . Now fix (r, ℓ) [R] [log d ]. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that n 10R log d we have that Pr[ Ir,ℓ 2n R log d ] 99 100. As I1,1, , IR,log d are independent, assuming that R log d 300 ln(3n β ), by the Chernoff bound we get that with probability at least 1 β 3n there are at least 98 100R log d choices for (r, ℓ) [R] [log d ] s.t. Ir,ℓ 2n R log d . Hence, by Theorem 2.4, with probability at least 1 β there are at least 98 100R log d choices for (r, ℓ) [R] [log d ] s.t. |Ir,ℓ| 2n R log d . If that is the case, then there must be at least 7R 8 choices for r [R] for which ℓ [log d ] : |Ir,ℓ| 2n R log d That is, Pr [E2] 1 β. Event E3 (over partitioning [n] into {Ir,ℓ}): For every v S s.t. f S(v ) n1.5 T there exists a subset Rv 3 [R] of size |Rv 3 | 7 8R s.t. for every r Rv 3 there exist at least 31 32 log d choices for ℓ [log d ] for which |{j Ir ,ℓ : vj = v }| f S(v ) 2R log d . Practical Locally Private Heavy Hitters We analyze event E3 in the Poisson case. To that end, fix v S s.t. f S(v ) n1.5 T , and let Iv 1,1, , Iv R,log d be independent Poisson random variables with mean f S(v ) R log d . Now fix (r, ℓ) [R] [log d ]. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that n1.5 T 37R log d we have that Pr[ Ir,ℓ f S(v ) 2R log d ] 99 100. As Iv 1,1, , Iv R,log d are independent, assuming that R log d 300 ln(3n2 β ), by the Chernoffbound we get that with probability at least 1 β 3n2 there are at least 98 100R log d choices for (r, ℓ) [R] [log d ] s.t. Ir,ℓ f S(v ) 2R log d . Hence, by Theorem 2.4, with probability at least 1 β n there are at least 98 100R log d choices for (r, ℓ) [R] [log d ] s.t. |{j Ir ,ℓ : vj = v }| f S(v ) 2R log d . If that is the case, then there must be at least 7R 8 choices for r [R] for which ℓ [log d ] : |{j Ir ,ℓ : vj = v }| f S(v ) 2R log d Using the union bound, this holds simultaneously for every such v with probability at least 1 β. That is, Pr [E3] 1 β. Event E4 (over partitioning [n] into {Ir,ℓ}): For every v S s.t. f S(v ) n1.5 T there exists a subset Rv 4 of size |Rv 4 | 7 8R s.t. for every r Rv 4 there exist at least 31 32 log d choices for ℓ [log d ] for which |{j Ir ,ℓ : vj = v and hr (Enc(vj)) = hr (Enc(v ))}| 32n1.5 RT log d . We analyze event E4 in the Poisson case. To that end, fix v S s.t. f S(v ) n1.5 T . For r [R] denote Colr(v ) = |{v S : v = v and hr(Enc(v)) = hr(Enc(v ))}| , and recall that by event E1, for r R1 we have that Colr(v ) 16n1.5 T . Let Iv 1,1, , Iv R,log d be independent Poisson random variables with mean Colr(v ) R log d . Now fix (r, ℓ) R1 [log d ]. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that T n1.5 R log d we have that Pr[ Iv r,ℓ 32n1.5 RT log d ] 99 100. As Iv 1,1, , Iv R,log d are independent, assuming that |R1| log d 300 ln(3n2 β ), by the Chernoffbound we get that with probability at least 1 β 3n2 there are at least 98 100|R1| log d choices for (r, ℓ) R1 [log d ] s.t. Iv r,ℓ 32n1.5 RT log d . Hence, by Theorem 2.4, with probability at least 1 β n there are at least 98 100|R1| log d choices for (r, ℓ) R1 [log d ] s.t. |{j Ir ,ℓ : vj = v and hr (Enc(vj)) = hr (Enc(v ))}| 32n1.5 If that is the case, then there must be at least 7R1 8 choices for r R1 for which ℓ [log d ] : |{j Ir ,ℓ : vj = v }| 32n1.5 Bassily, Nissim, Stemmer, Thakurta Using the union bound, this holds simultaneously for every such v with probability at least 1 β. That is, Pr [E4] 1 β. We are now ready to complete the proof. Fix v S s.t. f S(v ) 264n1.5 T . Let R1, R2, Rv 3 , Rv 4 be the sets from events E1, E2, E3, E4, and observe that |R1 R2 Rv 3 Rv 4 | 3 8R, and furthermore, for every r (R1 R2 Rv 3 Rv 4 ) there exist at least 29 32 log d choices for ℓ [log d ] for which (a) |{j Ir ,ℓ: vj = v }| 132n1.5 TR log d , (b) |{j Ir ,ℓ: vj = v and hr (Enc(vj)) = hr (Enc(v ))}| 32n1.5 TR log d , (c) |Ir ,ℓ| 2n R log d . Denote c = Enc(v ). Observe that by items (a),(b) above, we have that f Sr ,ℓ(hr (c ), c ℓ) f Sr ,ℓ(hr (c ), 1 c ℓ) + 100n1.5 Let us assume that T ϵn 326 R log d . By the properties of algorithm Hashtogram (Lemma 6.5), For every r , ℓ satisfying items (a),(b),(c), algorithm Hashtogram(Sr ,ℓ ) ensures that with probability at least 255 256 we have ˆcr ,t ,ℓ= c ℓ, where t = hr (v ). Now fix r (R1 R2 Rv 3 Rv 4 ), and recall that the coins of Hashtogram(Sr ,ℓ) are independent for different values of ℓ. Hence, by the Chernoffbound, for log d 850 log( n β) we get that Pr ℓ [log d ] : ˆcr ,t ,ℓ= c ℓ 9 10 log d 1 β That is, for our fixed v there exists an r (R1 R2 Rv 3 Rv 4 ) s.t. with probability at least 1 β n we have that ˆcr ,t and c = Enc(v ) differ on at most 1 10-fraction of their bits. Using the union bound, with probability at least 1 β, this holds simultaneously for every v s.t. f S(v ) 264n T . By the properties of the error correction code, in such a case, for every v s.t. f S(v ) 264n1.5 T we have that ˆvr ,t = Dec(ˆcr ,t ) = v , and that v is in the list L. Runtime. On step 1 algorithm Bitstogram instantiates O(R log d) copies of the frequency oracle Hashtogram. Every such instantiation takes time n. Afterwards, the algorithm queries the oracles for a total of O(2RT log d) n queries, each of which takes time O(1). Thus, step 1 takes time n to compute. Steps 2 and 3 loop for every (r, t) [R] [T], and take time RT n to compute. Finaly, step 4 instantiates Hashtogram (in time n) and queries it for a total or R T n queries. Overall, algorithm Bitstogram runs in time n. Processing Memory. Recall that step 1 of algorithm Bitstogram requires querying algorithm Hashtogram for a total of O(2RT log d) n queries. As we are aiming for an algorithm with processing memory n, we cannot store all of the answers in memory simultaneously. To resolve this issue, let us reorganize step 1 of algorithm Bitstogram as follows: Practical Locally Private Heavy Hitters 1a. For every (r, ℓ) [R] [log d ], invoke Hashtogram(Sr,ℓ) with ϵ 1b. For every (r, t) [R] [T], for every ℓ [log d ], query Hashtogram(Sr,ℓ) to get ar,ℓ(t, b) : b{0, 1} . Now, every one of the steps 1b,2,3,4 of the algorithm Bitstogram contains a loop over every choice of (r, t) [R] [T], and we can group all of this steps together into one loop over (r, t) [R] [T]. If an iteration of this loop results in a value a(ˆvr,t) n, we can simply ignore it (recall that the frequencies of domain elements that are not in the list L are estimated as zero, and that n is less than the guaranteed bound on the error of the protocol, so this step does not effect our error bounds). As there could be at most n elements with frequencies at least n, the necessary processing memory is only n. Acknowledgments R. B. s research is supported by NSF Awards AF-1908281, SHF-1907715, Google Faculty Research Award, and OSU faculty start-up support. Work of K. N. was supported by NSF grant No. 1565387, TWC: Large: Collaborative: Computing Over Distributed Sensitive Data. Work of U. S. was supported in part by the Israel Science Foundation (grant No. 1871/19). A. T. s research is supported by NSF Awards TRIPODS+X-1839317, AF1908281, TRIPODS-1740850, and Google Faculty Research Award. Nltk brown corpus. www.nltk.org. Apple tries to peek at user habits without violating privacy. The Wall Street Journal, 2016. Noga Alon and Joel Spencer. The Probabilistic Method. John Wiley, 1992. ISBN 0-47153588-5. Raef Bassily and Adam Smith. Local, private, efficient protocols for succinct histograms. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 127 135. ACM, 2015. Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory of Computing, 12(1):1 61, 2016. doi: 10.4086/ toc.2016.v012a001. URL http://dx.doi.org/10.4086/toc.2016.v012a001. M. Bellare and J. Rompel. Randomness-efficient oblivious sampling. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, SFCS 94, pages 276 287, Washington, DC, USA, 1994. IEEE Computer Society. ISBN 0-8186-6580-7. doi: 10.1109/SFCS.1994.365687. URL http://dx.doi.org/10.1109/SFCS.1994.365687. Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 Bassily, Nissim, Stemmer, Thakurta October, 2015, pages 634 649. IEEE Computer Society, 2015. ISBN 978-1-4673-8191-8. doi: 10.1109/FOCS.2015.45. URL http://dx.doi.org/10.1109/FOCS.2015.45. Mark Bun, Jelani Nelson, and Uri Stemmer. Heavy hitters and the structure of local privacy. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 435 447, 2018. doi: 10.1145/3196959.3196981. URL https://doi.org/10.1145/3196959.3196981. T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Optimal lower bound for differentially private multi-party aggregation. In Leah Epstein and Paolo Ferragina, editors, Algorithms - ESA 2012 - 20th Annual European Symposium, Ljubljana, Slovenia, September 1012, 2012. Proceedings, volume 7501 of Lecture Notes in Computer Science, pages 277 288. Springer, 2012. ISBN 978-3-642-33089-6. doi: 10.1007/978-3-642-33090-2 25. URL http://dx.doi.org/10.1007/978-3-642-33090-2_25. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In ICALP, 2002. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265 284. Springer, 2006. Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In CCS, 2014. Alexandre Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211 222. ACM, 2003. ISBN 1-58113-670-6. Giulia Fanti, Vasyl Pihur, and Ulfar Erlingsson. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. ar Xiv preprint ar Xiv:1503.01214, 2015. Venkatesan Guruswami. List Decoding of Error-Correcting Codes. Ph D thesis, Massachusetts Institute of Technology, 2001. Supervisor-Madhu Sudan. Justin Hsu, Sanjeev Khanna, and Aaron Roth. Distributed private heavy hitters. In International Colloquium on Automata, Languages, and Programming, pages 461 472. Springer, 2012. Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011. Nina Mishra and Mark Sandler. Privacy via pseudorandom sketches. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 143 152. ACM, 2006. Practical Locally Private Heavy Hitters Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005. ISBN 0521835402. A.G. Thakurta, A.H. Vyrros, U.S. Vaishampayan, G. Kapoor, J. Freudiger, V.R. Sridhar, and D. Davidson. Learning new words. US Patent 9594741, 2017. Salil P. Vadhan. Pseudorandomness. Foundations and Trends in Theoretical Computer Science, 7(13):1 336, 2012. ISSN 1551-305X. doi: 10.1561/0400000010. URL http: //dx.doi.org/10.1561/0400000010. A. Missing Proofs from Section 5 A.1. Proof of Theorem 5.1 It is easy to see that each invocation of Local Rnd is ϵ/2-locally differentially private since conditioned on any realization of ℓi, ji, ri, for any pair of possible input items vi, v i V to Local Rnd and any output bit b generated in step 7 of Algorithm 1, we have P [yi = b | vi] eϵ/2 P yi = b | v i Note that Local Rnd is invoked only twice for each user: once when Final = 0 (the scanning/pruning phase of Tree Hist) and another time when Final = 1 (the final phase of Tree Hist). Thus, it follows that protocol Tree Hist is ϵ-differentially private. A.2. Proof of Theorem 5.2 Let β (0, 1) and η as defined in the theorem statement. Consider the pruning phase of Tree Hist, that is, Steps 1 to 11 in Algorithm 3. Let γ be as set in Step 2. In this phase, Tree Hist invokes Freq Oracle once in every iteration of the outer for loop (over the levels of the tree) with the flag Final = 0. Consider any such iteration ℓ. Suppose, for now, that the size of Child Set(Prefixes) passed to Freq Oracle in that iteration is at most 2n/η. (We will show that with probability at least 1 β this condition is satisfied at all levels ℓof the tree, i.e., it s a loop invariant). By invoking Lemma 5.3 with b V = Child Set(Prefixes), Ij : j [t] = Iℓ,j : j [t] , and γ = t log d = 110 log(n/β) log d, we have that with probability at least 1 β/ log(d), for every ˆv Child Set(Prefixes) : f(ˆv) > 3η, Freq Oracle gives an estimate ˆf(ˆv) 2η, and for every ˆv Child Set(Prefixes) : f(ˆv) η, Freq Oracle gives an estimate ˆf(ˆv) < 2η. Hence, Step 9 implies, that with probability at least 1 β/ log d, all ˆv Child Set(Prefixes) with true frequencies f(ˆv) 3η will proceed to the next iteration ℓ+ 1 and all those ˆv Child Set(Prefixes) with true frequencies f(ˆv) < η will be pruned out. Since the number of nodes ˆv with true frequency f(ˆv) η cannot be more than n/η, then the number of surviving nodes in the next iteration ℓ+1 cannot be more than 2n/η. Hence, this condition will be satisfied in the next iteration, and we can proceed in the same fashion. Note that when ℓ= 1, the condition is trivially satisfied since there are only 2 < 2n/η nodes at that level. This induction argument shows that with probability at least 1 β, for every level ℓ [log d], the surviving nodes at level ℓcorrespond to prefixes whose true frequencies are not below η and include all prefixes whose true frequencies are above 3η. In particular, Bassily, Nissim, Stemmer, Thakurta with probability at least 1 β, all items in Succ Hist satisfy these properties. This covers the proof of items 1 and 2 of Theorem 5.2. Now, consider the final phase of Tree Hist, that is, Steps 12 to 14 Algorithm 3. Let γ be as set in Step 12. In this phase, Tree Hist invokes Freq Oracle on the surviving nodes at the final level of the tree (the last update of Prefixes) and with input flag Final = 1. Now, by invoking Lemma 5.3 with b V = Prefixes, Ij : j [t] = Ij : j [t] , and γ = t = 110 log(n/β), we have that with probability at least 1 β/ log(d), for every ˆv Prefixes, | ˆf(ˆv) f(ˆv)| 14 . This proves item 3 of the B. Missing Proofs from Section 6 B.1. Proof of Lemma 6.5 Consider the following good event: Event E1 (over sampling h1, , h R): For every query v V there exists a subset Rv 1 [R] of size |Rv 1 | 7 8R s.t. for every r Rv 1 it holds that |{v S : v = v and hr (v) = hr (v )}| 16n Event E1 states that for at least 7R/8 of the hash functions, we have that v is mapped into a cell without too many collisions with different input elements. Informally, for every single hash function hr, algorithm Hash Hist estimates the number of occurrences of hr(v ) in S. Hence, if event E1 occurs, then most of the estimations result in accurate answers. We start by showing that event E1 happens with high probability. To that end, fix v V and fix r [R]. We have that Ehr [| {x S : v = v and hr (v) = hr (v )} |] = X v S:v =v Ehr 1hr (v)=hr (v ) n Thus, by Markov s inequality, we have that | {v S : v = v and hr (v) = hr (v )} | 16n As the hash functions are independent from each other, for R 48 ln(d β ), by the Chernoff bound we get that with probability at least 1 β/d (over sampling h1, . . . , h R) there exists a subset Rv 1 [R] of size |Rv 1 | 7 8R s.t. for every r Rv 1 it holds that |{v S : v = v and hr (v) = hr (v )}| 16n Using the union bound, we have that event E1 happens with probability at least 1 β. We continue the analysis assuming that event E1 occurs. Practical Locally Private Heavy Hitters For every r [R], let Sr = (vj)j Ir denote a database containing the data of all users j s.t. j Ir. Also for v V and r [R] denote |Sr,v | |{v S : hr(v) = hr(v )}|. That is, |Sr,v | is the number of users j s.t. hr(vj) = hr(v ). Furthermore, for v V and r [R] denote |Iv r | |{v Sr : hr(v) = hr(v )}|. That is, |Iv r | is the number of users j s.t. j Ir and hr(vj) = hr(v ). Observe that |Sr,v | f S(v ) and that |Iv r | f Sr(v ). Fix v V . By the Chernoffbound, with probability at least 1 β/d (over partitioning [n] into subsets I1, . . . , IR), for every r [R] we have that R |Iv r | |Sr,v | 3R |Sr,v | log(2Rd Using the union bound this holds simultaneously for every v V and r [R] with probability at least 1 β. Also, assuming that n 8R log(2R/β), by the Chernoffbound, with probability at least 1 β (over partitioning [n] into subsets I1, . . . , IR), for every r [R] we have that n 2R |Ir| 2n R . We continue the analysis assuming that this is the case, and that Inequality (4) holds. Event E2 (over sampling Z and the coins of the local randomizers): For every query v V there exists a subset Rv 2 [R] of size |Rv 2 | 7 8R s.t. for every r Rv 2 it holds that R ar (hr (v )) R |Iv r | eϵ+1 For v V and r [R] denote cr(v ) = P j Ir yj Z[hr(v ), j], and recall that algorithm Hashtogram answers the query v with a(v ) = R eϵ+1 eϵ 1 Medianr [R]{cr(v )}. Fix v V and r [R]. We now analyze the expectation of cr(v ): E[c(v )] = X j Ir E [yj Z[hr(v ), j]] j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X j Ir: hr(vj) =hr(v ) E [yj Z[hr(v ), j]] j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X j Ir: hr(vj) =hr(v ) E [yj] E [Z[hr(v ), j]] = |{v Sr : hr(v) = hr(v )}| eϵ 1 eϵ + 1 |Iv r | eϵ 1 That is, cr(v) can be expressed as two sums of 1 independent random variables: |Iv r | variables with expectation eϵ 1 eϵ+1, and (|Ir| |Iv r |) variables with expectation 0 (recall that n 2R |Ir| 2n R ). Using the Hoeffding bound, with probability at least 43/44 we have that cr(v ) eϵ 1 eϵ+1 |Iv r | p 11n/R. That is, R ar(hr(v )) R |Iv r | eϵ + 1 Bassily, Nissim, Stemmer, Thakurta Fix v V , and observe that the above sums are independent for different values of r. Hence, using the Chernoffbound and asserting that R 132 ln(d /β), for that fixed v V , with probability at least 1 β/d we have that Inequality (5) holds for at least 7R/8 choices of r [R]. Using the union bound, with probability at least 1 β, this is true for every v V simultaneously. That is, event E2 happens with probability at least 1 β. We continue the analysis assuming that event E2 occurs. For every v V we denote Rv 3 = Rv 1 Rv 2 . Combining event E2 with Inequality (4), we get that for every r Rv 2 R ar(hr(v )) |Sr,v | eϵ + 1 3R |Sr,v | log(2Rd Recall that for every v V and every r [R] we have that |Sr,v | f S(v ). Furthermore, for every v V and every r Rv 1 we have that |Sr,v | f S(v ) + 16n T . Hence, for every v V and every r Rv 3 we have that | R ar(hr(v )) f S(v ) | eϵ + 1 3R f S(v ) + 16n That is, for every r Rv 3 we have that R ar(hr(v )) is accurate up to error(v ). As |Rv 3 | 3 4R, and as algorithm Hashtogram answers v with a(v ) chosen as the median of {R ar(hr(v ))}, we get that |a(v ) f S(v )| error(v ) for every v V .