# practical_locally_private_heavy_hitters__aa34e689.pdf

Journal of Machine Learning Research 21 (2020) 1-42 Submitted 11/18; Revised 12/19; Published 2/20

Practical Locally Private Heavy Hitters

Raef Bassily bassily.1@osu.edu Department of Computer Science and Engineering The Ohio State University

Kobbi Nissim kobbi.nissim@georgetown.edu Department of Computer Science Georgetown University

Uri Stemmer u@uri.co.il Department of Computer Science Ben-Gurion University

Abhradeep Thakurta aguhatha@ucsc.edu Department of Computer Science University of California Santa Cruz

Editor: Moritz Hardt

We present new practical local diﬀerentially private heavy hitters algorithms achieving optimal or near-optimal worst-case error and running time Tree Hist and Bitstogram. In both algorithms, server running time is O(n) and user running time is O(1), hence improving on the prior state-of-the-art result of Bassily and Smith [STOC 2015] requiring O(n5/2) server time and O(n3/2) user time. With a typically large number of participants in local algorithms (n in the millions), this reduction in time complexity, in particular at the user side, is crucial for making locally private heavy hitters algorithms usable in practice. We implemented Algorithm Tree Hist to verify our theoretical analysis and compared its performance with the performance of Google s RAPPOR code.

Keywords: Diﬀerential privacy, local diﬀerential privacy, heavy hitters, histograms, sketching.

1. Introduction

We revisit the problem of computing heavy hitters with local diﬀerential privacy. Such computations have already been implemented to provide organizations with valuable information about their user base while providing users with the strong the strong guarantee that their privacy would be preserved even if the organization is subpoenaed for the entire information seen during an execution. Two prominent examples are Google s use of RAPPOR in the Chrome browser (Erlingsson et al., 2014) and Apple s use of diﬀerential privacy in i OS-10 (Thakurta et al., 2017). These tools are used for learning new words typed by users and identifying frequently used emojis and frequently accessed websites.

Diﬀerential privacy in the local model. Diﬀerential privacy (Dwork et al., 2006) provides a framework for rigorously analyzing privacy risk and hence can help organization

c 2020 Raef Bassily, Kobbi Nissim, Uri Stemmer, Abhradeep Thakurta.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-786.html.

Bassily, Nissim, Stemmer, Thakurta

mitigate users privacy concerns as it ensures that what is learned about any individual user would be (almost) the same whether the user s information is used as input to an analysis or not. Diﬀerentially private algorithms work in two main modalities: curator and local. The curator model assumes a trusted centralized curator that collects all the personal information and then analyzes it. In contrast, the local model does not involve a central repository. Instead, each piece of personal information is randomized by its provider to protect privacy even if all information provided to the analysis is revealed. Holding a central repository of personal information can become a liability to organizations in the face of security breaches, employee misconduct, subpoenas, etc. This makes locally private computations attractive for implementation. Indeed in the last few years Google and Apple have announced successful deployments of locally private analyses (Erlingsson et al., 2014; Thakurta et al., 2017).

Challenges of the local model. A disadvantage of the local model is that it requires introducing noise at a signiﬁcantly higher level than what is required in the curator model. Furthermore, some tasks which are possible in the curator model are impossible in the local model (Dwork et al., 2006; Kasiviswanathan et al., 2011; Chan et al., 2012). To see the eﬀect of noise, consider estimating the number of HIV positives in a given population of n participants. In the curated model, it suﬃces to add Laplace noise of magnitude Oϵ(1), i.e., independent of n (Dwork et al., 2006). In contrast, a lower bound of Ωϵ( n) is known for the local model (Chan et al., 2012). A higher noise level implies that the number of participants n needs to be large (maybe in the millions for a reasonable choice of ϵ). An important consequence is that, to be practical, locally private algorithms must exhibit low time, space, and communication complexity, especially at the user side. This is the problem addressed in our work.

Heavy hitters and histograms in the local model. Assume each of n users holds an element xi taken from a domain of size d. A histogram of this data lists (an estimate of) the multiplicity of each domain element. When d is large, a succinct representation of the histogram is desired, either in the form of a frequency oracle a data structure allowing to approximate the multiplicity of any domain element or heavy hitters listing the multiplicity of most frequent domain elements, implicitly considering the multiplicity of other domain elements as zero. The problem of computing histograms with diﬀerential privacy has attracted signiﬁcant attention both in the curator model (Dwork et al., 2006; Beimel et al., 2016; Bun et al., 2015) and the local model (Hsu et al., 2012; Erlingsson et al., 2014; Bassily and Smith, 2015). Of relevance is the work in (Mishra and Sandler, 2006). We brieﬂy report on the state of the art heavy hitters algorithms of Bassily and Smith (2015) and Thakurta et al. (2017), which are most relevant for the current work. Bassily and Smith provide matching lower and upper bounds of Θ( p

n log(d)/ϵ) on the worst-case error of local heavy hitters algorithms. Their local algorithm exhibits optimal communication but a rather high time complexity: Server running time is O(n5/2) and, crucially, user running time is O(n3/2) complexity that severely hampers the practicality of this algorithm.1 The construction by Thakurta et al. is a heuristic with no bounds on server running time and

1. If pubic randomness or eﬃcient Private Information Retrieval is assumed, then the user running time in (Bassily and Smith, 2015) can be improved to O(n).

Practical Locally Private Heavy Hitters

accuracy (we refer to the maximal diﬀerence between the true and estimated multiplicities of a data element as the error or the accuracy of the protocol).2 User computation time is O(1), a signiﬁcant improvement over (Bassily and Smith, 2015). Our contributions. The focus of this work is on the design of locally private heavy hitters algorithms with near optimal error, keeping time, space, and communication complexity minimal. We provide two new constructions of heavy hitters algorithms Tree Hist and Bitstogram. These algorithms achieve similar performance but apply diﬀerent techniques. We implemented Algorithm Tree Hist and provide measurements in comparison with RAPPOR (Erlingsson et al., 2014) (the only currently available implementation for local histograms). Our measurements are performed with a setting that is favorable to RAPPOR (i.e., a small input domain), yet they indicate that Algorithm Tree Hist performs better than RAPPOR in terms of noise level. Table 1 details various performance parameters of Tree Hist and Bitstogram, and the reader can check in the table that these are similar up to small factors which we ignore for the rest of this paragraph. Comparing with (Bassily and Smith, 2015), we improve time complexity both at the server (reduced from O(n5/2) to O(n)) and at the user (reduced from O(n3/2) to O(max (log n, log d)2)). Comparing with (Thakurta et al., 2017), we get provable bounds on the server running time and worst-case error. Note that Algorithm Bitstogram achieves optimal worst-case error whereas Algorithm Tree Hist is almost optimal, by a factor of p

Performance metric Tree Hist Bitstogram

Server time (modular multiplications) O (n) O (n)

User time (modular multiplications) O max (log n, log d)2

O max (log n, log d)2

Server processing memory O ( n) O ( n) User memory O (max(log d, log n)) O (max(log d, log n)) Communication/user O (1) O (1)

Worst-case Error O p

n log(n) log(d)

Table 1: Performance of our protocols. Dependency on the privacy parameter ϵ and failure probability β is omitted.

Elements of the constructions. Main details of our constructions are presented in sections 3.1 and 3.2. These are complemented with the detailed descriptions and analyses in the appendix. Both our algorithms make use of frequency oracles data structures that allow estimating various counts. Algorithm Tree Hist identiﬁes heavy-hitters and estimates their frequencies by scanning the levels of a binary preﬁx tree whose leaves correspond to dictionary items. The recovery of the heavy hitters is in a bit-by-bit manner. As the algorithm progresses down the tree it prunes all the nodes that cannot be preﬁxes of heavy hitters, hence leaving O( n) nodes in every depth. This is done by making queries to a frequency oracle. Once the algorithm reaches the ﬁnal level of the tree it identiﬁes the list of heavy hitters. It then invokes the

2. The underlying construction in (Thakurta et al., 2017) is of a frequency oracle.

Bassily, Nissim, Stemmer, Thakurta

frequency oracle once more on those particular items to obtain more accurate estimates for their frequencies. Algorithm Bitstogram hashes the input domain into a domain of size roughly n. The observation behind this algorithm is that if a heavy hitter x does not collide with other heavy hitters then (h(x), xi) would have a signiﬁcantly higher count than (h(x), xi) for every i, where xi is the ith bit of x. This allows recovering all bits of x in parallel given an appropriate frequency oracle. Followup work. Our work presents locally-private heavy-hitters algorithms with error rates that depend optimally on the number of users, the size of the domain, and the privacy parameter. Following our work, Bun et al. (2018) presented an improved protocol with error rate that depend optimally also on the failure probability. Their protocol is a reﬁned version of our Bitstogram protocol.

2. Background and Preliminaries

2.1. Deﬁnitions and Notation

Dictionary and user items: Let V = [d]. We consider a set of n users, where each user i [n] holds an item vi V. We will use vi to refer to the binary representation of vi when it is clear from the context.

Item frequency (multiplicity): For each item v V, we deﬁne the frequency f(v) of v as the number of users holding v, namely,

f(v) |{i [n] : vi = v}| .

Frequency oracle: A frequency oracle is a data structure together with an algorithm that, for any given v V, allows computing an estimate ˆf(v) of the frequency f(v).

Heavy hitters (succinct histogram): A succinct histogram is a data structure that provides a (short) list of items (ˆv1, ..., ˆvk), called the heavy hitters, together with estimates for their frequencies ( ˆf(ˆvj) : j [k]). The frequencies of the items not in the list are implicitly estimated as ˆf(v) = 0. We measure the error in a succinct histogram by the ℓ distance between the estimated and true frequencies, maxv [d] ˆf(v) f(v) . We will also consider the maximum error restricted to the items in the list (ˆv1, ..., ˆvk), that is, maxj [k] ˆf(ˆvj) f(ˆvj) . If a succinct histogram aims to provide ℓ error η, the list does not need to contain more than O(1/η) items since items with estimated frequencies below η may be omitted from the list.

2.2. Local Diﬀerential Privacy

In the local model, an algorithm A : V Z accesses the database v = (v1, . . . , vn) Vn

only via an oracle that, given index i [n], runs a randomized algorithm (local randomizer) R : V Z on input vi and returns R(vi) to A.

Deﬁnition 2.1 (Local diﬀerential privacy Dwork et al. (2006); Evﬁmievski et al. (2003); Kasiviswanathan et al. (2011)) An algorithm satisﬁes ϵ-local diﬀerential privacy (LDP) if it accesses the database v = (v1, . . . , vn) Vn only via invocations of a local

Practical Locally Private Heavy Hitters

randomizer R and if for all i [n], if R(1), . . . , R(k) denote the algorithm s invocations of R on the data sample vi, then the algorithm A( ) R(1)( ), R(2)( ), . . . , R(k)( ) is ϵdiﬀerentially private. That is, if for any pair of inputs v, v that diﬀer on a single input, and for all S Range(A),

Pr[A(v) S] eϵ Pr[A(v ) S].

2.3. Count Sketch and Hadamard Transform

Count sketch (Charikar et al., 2002) together with the Hadamard transform form the basis of our diﬀerentially private construction outlined in Section 3.1 and discussed in detail in Section 5.

Count sketch (Charikar et al., 2002) is a sketching algorithm for ﬁnding frequent elements in a data stream. Let V = [d] be a domain of data elements, and v = (v1, . . . , vn) be a stream of data elements. Count sketch ensures that for any given v V, using a data structure of size m = O n log(1/β)

k one can ensure that with probability at least 1 β the estimated frequency of v is within k of the true frequency, and the estimate is unbiased. The algorithm works as follows: First, pick t pairs of hash functions (hi : V [m], gi : V { 1, 1}), and set a matrix M = {0}t m. Second, with every data sample vi, populate the matrix as follows: j [t], M[j, hj(vi)] M[j, hj(vi)] + gj(vi). Finally, to estimate the frequency of an element v V, compute median {M[1, h1(v)] g1(v), , M[t, ht(v)] gt(v)}.

Hadamard transform: We use Hadamard transform followed by sampling in order to compress our data transmission from the client to the server and to reduce the space requirements of our protocol. Hadamard transform of a vector w Rm is obtained via multiplying with the Hadamard transform matrix Hm { 1 m, 1 m}m m deﬁned recur-

sively as Hm = 1

Hm/2 Hm/2 Hm/2 Hm/2

, and H1 = [1]. Two main properties of Hadamard

transform we use are: i) it is a dense basis transformation, i.e. the columns of the Hadamard matrix form a basis, and each entry of m Hm is in { 1, 1}, and ii) any entry (i, j) in m Hm can be computed in O(log m) time.

2.4. Error correction codes

We will use error correction codes in order to reduce the error of (some of) our constructions outlined in Section 3.2 and discussed in detail in Section 6.

Deﬁnition 2.2 A binary (n, k)-code is a pair of mappings (Enc, Dec) where Enc : {0, 1}k {0, 1}n, and Dec : {0, 1}n {0, 1}k. The code is ζ-decodable if for every x {0, 1}k and every y {0, 1}n whose Hamming distance to Enc(x) is at most ζn we have that Dec(y) = x.

For any constant 0 < ζ < 1/4, there is a construction of a ζ-decodable (n, k)-code, where n = O(k), and furthermore, Enc and Dec run in time O(n). See, e.g., (Guruswami, 2001).

Bassily, Nissim, Stemmer, Thakurta

2.5. Tools from Probability

2.5.1. Pairwise k-wise Independence

The deﬁnitions given here are taken from (Vadhan, 2012).

Deﬁnition 1 A family (i.e., multiset) of functions H = {h : [N] [M]} is pairwise independent if the following two conditions hold when H is a function chosen uniformly at random from H:

1. x [N], the random variable H(x) is uniformly distributed in [M].

2. x1 = x2 [N], the random variables H(x1) and H(x2) are independent.

We will use the following tail bound on sums of k-wise independent random variables.

Lemma 2.3 (Bellare and Rompel (1994)) Let λ 6 be an even integer. Suppose X1, , Xn are k-wise independent random variables taking values in [0, 1]. Let X = X1 + + Xn and µ = E[X], and let α > 0. Then,

Pr[|X µ| α] nk

2.5.2. The Poisson Approximation

We will use the following useful facts about the Poisson approximation. When throwing n balls into R bins, the distribution of the number of balls in a given bin is Bin(n, 1/R). As the Poisson distribution is the limit distribution of the binomial distribution, the distribution of the number of balls in a given bin is approximately Pois(n/R). In fact, in some cases we could approximate the joint distribution of the number of balls in all the bins by assuming the load at each bin is an independent Poisson random variable with mean n/R.

Theorem 2.4 (e.g., Mitzenmacher and Upfal (2005)) Suppose that n balls are thrown into R bins independently and uniformly at random, and let Xi be the number of balls in the ith bin, where 1 i R. Let Y1, , YR be independent Poisson random variables with mean n/R. Let f(x1, , x R) be a nonnegative function. Then,

E [f(X1, , XR)] e n E [f(Y1, , YR)] .

In particular, the theorem states that any event that takes place with probability p in the Poisson case, takes place with probability at most pe n in the exact case (this follows by letting f be the indicator function of that event). We will also use the following bounds for the tail probabilities of a Poisson random variable:

Theorem 2.5 (Alon and Spencer (1992)) Let X have Poisson distribution with mean µ. For 0 α 1,

Pr[X µ(1 α)] e α2µ/2

Pr[X µ(1 + α)] e α2µ/2.

Practical Locally Private Heavy Hitters

3. Our Algorithms

In this section we give an informal description of our algorithms, and highlight some of the ideas behind our constructions.

3.1. The Tree Hist Protocol

We brieﬂy give an overview of our construction that is based on a compressed, noisy version of the count sketch. To maintain clarity of the main ideas, we give here a high-level description of our construction. We refer to Section 5 for a detailed description of this construction. We ﬁrst introduce some objects and public parameters that will be used in the construction:

Preﬁxes: For a binary string v, we will use v[1 : ℓ] to denote the ℓ-bit preﬁx of v. Let V = v {0, 1}ℓfor some ℓ [log d] . Note that elements of V arranged in a binary preﬁx tree of depth log d, where the nodes at level ℓof the tree represent all binary strings of length ℓ. The items of the dictionary V represent the bottommost level of that tree. Note that |V| = 2d. We will use to denote an empty string. For a v {0, 1}ℓand b {0, 1}, let v b denote the ℓ+ 1-bit string resulting from appending the bit b to v. For a binary string v, we deﬁne Child(v) v 0, v 1 , that is, the set containing the two children of v in the preﬁx tree. Similarly, for a set of strings U, we deﬁne Child Set(U) v : v Child(u) for some u U .

Hashes: Let t, m be positive integers to be speciﬁed later. We will consider a set of t pairs of hash functions {(h1, g1), . . . , (ht, gt)}, where for each i [t], hi : V [m] and gi : V { 1, +1} are independently and uniformly chosen pairwise independent hash functions.

Basis matrix: Let W 1, +1 m m be m Hm where Hm is the Hadamard transform matrix of size m. As will be shown later, we will be making operations over the entries of this matrix. It is important to note that we do not need to store this matrix. The value of any entry in this matrix can be computed in O(log m) bit operations given the (row, column) index of that entry. In particular, suppose we want to compute the value of the entry Wi,j located at the i-th row and j-th column. Let (i0, i1, . . . , ilog m 1) and (j0, j1, . . . , jlog m 1)

denote the bit representation of i and j, respectively. Then, Wi,j = ( 1) Plog m 1 k=0 ikjk.

Global parameters: The total number of users n, the size of the Hadamard matrix m, the number of hash pairs t, the privacy parameter ϵ, the conﬁdence parameter β, and the hash functions (h1, g1), . . . , (ht, gt) are assumed to be public information. We set

t = O(log(n/β)) and m = O q

Public randomness: In addition to the t hash pairs {(h1, g1), . . . , (ht, gt)}, we assume that the server creates a random partition Π : [n] [log d] [t] that assigns to each user i [n] a random pair (ℓi, ji) [log(d)] [t], and another random function Q : [n] [m]

Bassily, Nissim, Stemmer, Thakurta

that assigns3 to each user i a uniformly random index ri [m]. We assume that such random indices ℓi, ji, ri are shared between the server and each user. First, we describe the two main modules of our protocol.

3.1.1. A local randomizer: Local Rnd

For each i [n], user i runs her own independent copy of a local randomizer, denoted as Local Rnd, to generate her private report. Local Rnd of user i starts by acquiring the index triple (ℓi, ji, ri) [log d] [t] [m] from public randomness. For each user, Local Rnd is invoked twice in the full protocol: once during the ﬁrst phase of the protocol (called the pruning phase) where the high-frequency items (heavy hitters) are identiﬁed, and a second time during the ﬁnal phase (the estimation phase) to enable the protocol to get better estimates for the frequencies of the heavy hitters. In the ﬁrst invocation, Local Rnd of user i performs its computation on the ℓi-th preﬁx of the item vi of user i, whereas in the second invocation, it performs the computation on the entire user s string vi. Apart from this, in both invocations, Local Rnd follows similar steps. It ﬁrst selects the hash pair (hji, gji), computes ci = hji(vi[1 : ℓ]) (where ℓ= ℓi in the ﬁrst invocation and ℓ= log d in the second invocation, and vi[1 : ℓ] is the ℓ-th preﬁx of vi), then it computes a bit xi = gji vi[1 : ℓ] Wri,ci (where Wr,c denotes the (r, c) entry of the basis matrix W). Finally, to guarantee ϵ-local diﬀerential privacy, it generates a randomized response yi based on xi (i.e., yi = xi with probability eϵ/2/(1 + eϵ/2) and yi = xi with probability 1/(1 + eϵ/2), which is sent to the server. Our local randomizer can thought of as a transformed, compressed (via sampling), and randomized version of the count sketch (Charikar et al., 2002). In particular, we can think of Local Rnd as follows. It starts oﬀwith similar steps to the standard count sketch algorithm, but then deviates from it as it applies Hadamard transform to the user s signal, then samples one bit from the result. By doing so, we can achieve signiﬁcant savings in space and communication without sacriﬁcing accuracy.

3.1.2. A frequency oracle: Freq Oracle

Suppose we want to allow the server to estimate the frequencies of some given subset b V {0, 1}ℓfor some given ℓ [log d] based on the noisy users reports. We give a protocol, denoted as Freq Oracle, for accomplishing this task. For each queried item ˆv b V and for each hash index j [t], Freq Oracle computes c = hj(ˆv), then collects the noisy reports of a collection of users Iℓ,j that contains every user i whose pair of preﬁx and hash indices (ℓi, ji) match (ℓ, j). Next, it estimates the inverse Hadamard transform of the compressed and noisy signal of each user in Iℓ,j. In particular, for each i Iℓ,j, it computes yi Wri,c which can be described as a multiplication between yieri (where eri is the indicator vector with 1 at the ri-th position) and the scaled Hadamard matrix W, followed by selecting the c-th entry of the resulting vector. This brings us back to the standard count sketch representation. It then sums all the results and multiplies the outcome by gj(ˆv) to obtain an estimate ˆfj(ˆv) for the frequency of ˆv.

3. We could have grouped Π and Q into one random function mapping [n] to [log d] [t] [m], however, we prefer to split them for clarity of exposition as each source of randomness will be used for a diﬀerent role.

Practical Locally Private Heavy Hitters

As in the count sketch algorithm, this is done for every j [t], then Freq Oracle obtains a high-conﬁdence estimate by computing the median of all the t frequency estimates.

3.1.3. The protocol: Tree Hist

The protocol is easier to describe via operations over nodes of the preﬁx tree V of depth log d (described earlier). The protocol runs through two main phases: the pruning (or, scanning) phase, and the ﬁnal estimation phase. In the pruning phase, the protocol scans the levels of the preﬁx tree starting from the top level (that contains just 0 and 1) to the bottom level (that contains all items of the dictionary). For a given node at level ℓ [log d], using Freq Oracle as a subroutine, the protocol gets an estimate for the frequency of the corresponding ℓ-bit preﬁx. For any ℓ [log(d) 1], before the protocol moves to level ℓ+ 1 of the tree, it prunes all the nodes in level ℓthat cannot be preﬁxes of actual heavy hitters (high-frequency items in the dictionary).Then, as it moves to level ℓ+ 1, the protocol considers only the children of the surviving nodes in level ℓ. The construction guarantees that, with high probability,

the number of surviving nodes in each level cannot exceed O q

n log(d) log(n)

. Hence, the

total number of nodes queried by the protocol (i.e., submitted to Freq Oracle) is at most

log(n) . In the second and ﬁnal phase, after reaching the ﬁnal level of the tree, the protocol would have already identiﬁed a list of the candidate heavy hitters, however, their estimated frequencies may not be as accurate as we desire due to the large variance caused by the random partitioning of users across all the levels of the tree. Hence, it invokes the frequency oracle once more on those particular items, and this time, the sampling variance is reduced as the set of users is partitioned only across the t hash pairs (rather than across log(d) t bins as in the pruning phase). By doing this, the server obtains more accurate estimates for the frequencies of the identiﬁed heavy hitters. The privacy and accuracy guarantees are stated below. The full details are given in Section 5.

3.1.4. Privacy and Utility Guartantees

Theorem 3.1 Protocol Tree Hist is ϵ-local diﬀerentially private.

Theorem 3.2 There is a number η = O p

n log(n/β) log(d))/ϵ such that with probability at least 1 β (over the choice of all public and private randomnness), the output list of the Tree Hist protocol satisﬁes the following properties:

1. it contains all items v V with true frequencies above 3η.

2. it does not contain any item v V with true frequency below η.

3. Every frequency estimate in the output list is accurate up to an error O p

n log(n/β)/ϵ

Bassily, Nissim, Stemmer, Thakurta

3.2. The Bitstogram Protocol

We now present a simpliﬁed description of our second protocol, that captures most of the ideas. Any informalities made hereafter are removed in the full description of the protocol (Section 6).

First Step: Frequency Oracle. Recall that a frequency oracle is a protocol that, after communicating with the users, outputs a data structure capable of approximating the frequency of every domain element v V. So, if we were to allow the server to have linear runtime in the domain size |V| = d, then a frequency oracle would suﬃce for computing histograms. As we are interested in protocols with a signiﬁcantly lower runtime, we will only use a frequency oracle as a subroutine, and query it only for (roughly) n elements. Let Z { 1}d n be a matrix chosen uniformly at random, and assume that Z is publicly known.4 That is, for every domain element v V and every user j [n], we have a random bit Z[v, j] { 1}. As Z is publicly known, every user j can identify its corresponding bit Z[vj, j], where vj V is the input of user j. Now consider a protocol in which users send randomized responses of their corresponding bits. That is, user j sends yj = Z[vj, j] w.p. 1 2 + ϵ

2 and sends yj = Z[vj, j] w.p. 1

2. We can now estimate the frequency of every domain element v V as a(v) = 1

j [n] yj Z[v, j].

To see that a(v) is accurate, observe that a(v) is the sum of n independent random variables (one for every user). For the users j holding the input v (that is, vj = v) we will have that 1 ϵE[yj Z[v, j]] = 1. For the other users we will have that yj and Z[v, j] are independent, and hence E[yj Z[v, j]] = E[yj] E[Z[v, j]] = 0. That is, a(v) can be expressed as the sum of n independent random variables: f(v) variables with expectation 1, and (n f(v)) variables with expectation 0. The fact that a(v) is an accurate estimation for f(v) now follows from the Hoeﬀding bound.

Lemma 3.3 (Algorithm Hashtogram) Let ϵ 1. Algorithm Hashtogram satisﬁes ϵ-LDP. Furthermore, with probability at least 1 β, algorithm Hashtogram answers every query

v V with a(v) satisfying: |a(v) f(v)| O 1 ϵ r

Second Step: Identifying Heavy-Hitters. Let us assume that we have a frequency oracle protocol with worst-case error τ. We now want to use our frequency oracle in order to construct a protocol that operates on two steps: First, it identiﬁes a small set of potential heavy-hitters , i.e., domain elements that appear in the database at least 2τ times. Afterwards, it uses the frequency oracle to estimate the frequencies of those potential heavy elements.5

Let h : V [T] be a (publicly known) random hash function, mapping domain elements into [T], where T will be set later.6 We will now use h in order to identify the heavy-hitters.

4. As we later explain, Z has a short description, as it need not be uniform. 5. Event though we describe the protocol as having two steps, the necessary communication for these steps can be done in parallel, and hence, our protocol will have only 1 round of communication. 6. As with the matrix Z, the hash function h can have a short description length.

Practical Locally Private Heavy Hitters

To that end, let v V denote such a heavy-hitter, appearing at least 2τ times in the database S, and denote t = h(v ). Assuming that T is big enough, w.h.p. we will have that v is the only input element (from S) that is mapped (by h) into the hash value t . Assuming that this is indeed the case, we will now identify v bit by bit. For ℓ [log d], denote Sℓ= (h(vj), vj,ℓ)j [n], where vj,ℓis bit ℓof vj. That is, Sℓis a database over the domain ([T] {0, 1}), where the row corresponding to user j is (h(vj), vj,ℓ). Observe that every user can compute her own row locally. As v is a heavy-hitter, for every ℓ [log d] we have that (t , v ℓ) appears in Sℓat least 2τ times, where v ℓis bit ℓof v . On the other hand, as we assumed that v is the only input element that is mapped into t we get that (t , 1 v ℓ) does not appear in Sℓat all. Recall that our frequency oracle has error at most τ, and hence, we can use it to accurately determine the bits of v . To make things more concrete, consider the protocol that for every hash value t [T], for every coordinate ℓ [log d], and for every bit b {0, 1}, obtains an estimation (using the frequency oracle) for the multiplicity of (t, b) in Sℓ(so there are log d invocations of the frequency oracle, and a total of 2T log d estimations). Now, for every t [T] let us deﬁne ˆvt where bit ℓof ˆvt is the bit b s.t. (t, b) is more frequent than (t, 1 b) in Sℓ. By the above discussion, we will have that ˆvt = v . That is, the protocol identiﬁes a set of T log d domain elements, containing all of the heavy-hitters. The frequency of the identiﬁed heavy-hitters can then be estimated using the frequency oracle.

Remark 2 As should be clear from the above discussion, it suﬃces to take T n2, as this will ensure that there are no collisions among diﬀerent input elements. As we only care about collisions between heavy-hitters (appearing in S at least n times), it would suﬃce to take T n to ensure that w.h.p. there are no collisions between heavy-hitters. In fact, we could even take T n, which would ensure that a heavy-hitter v has no collisions with constant probability, and then to amplify our conﬁdence using repetitions.

Lemma 3.4 (Algorithm Bitstogram) Let ϵ 1. Algorithm Bitstogram satisﬁes ϵ-LDP. Furthermore, the algorithm returns a list L of length O( n) satisfying:

1. With probability 1 β, for every (v, a) L we have that |a f(v)| O 1 ϵ p

n log(n/β) .

2. W.p. 1 β, for every v V s.t. f(v) O 1 ϵ q

n log(d/β) log( 1

β) , we have that v is in L.

4. Detailed Experimental Results

In this section we discuss implementation details of our algorithms mentioned in Section 5.7

The main objective of this section is to emphasize the empirical eﬃcacy of our algorithms along with the theoretical optimality in terms of error, space, time and communication. Thakurta et al. (2017) recently claimed space optimality for a similar problem, but a formal analysis (or empirical evidence) was not provided. Our experiments corroborate both the analytical bounds in our current work, and in (Thakurta et al., 2017). Our experiments are performed on a mac OS-Sierra 10.12 system (in Python 2.7) with 3.3Ghz (Intel Core i5) and 16GB of DDR-3 RAM.

7. The experiments are performed without the Hadamard compression during data transmission.

Bassily, Nissim, Stemmer, Thakurta

4.1. Private Frequency Oracle

In this experiment, the objective is to test the eﬃcacy of our algorithm in estimating the frequencies of a known set of dictionary of user items, under local diﬀerential privacy. We estimate the error in estimation while varying the size of the data set n, changing the privacy parameter ϵ. (See Section 2.1 for a refresher on the notation.) Figure 1 shows results on a synthetic data set with the domain size of hundred (i.e., d = 100) drawn from a power law distribution with power of 15. The default parameters used in Figure 1 are: number of data samples (n) : 10 million, range of the hash function (m): n, number of hash functions (t): 285, and the privacy parameter ϵ = 2.0. For the hash functions, we used the preﬁx bits of SHA-256. The estimated frequency is scaled by the number of samples to normalize the result, and each experiment is averaged over ten runs. The bars for True refers to the the true frequencies, and the bars for Priv corresponds to the diﬀerentially private frequencies. The pctle corresponds to the frequency of a domain element at the corresponding percentile in the frequency distribution of the data set. Observations: i) The plots corroborate the fact that the frequency oracle is indeed unbiased. The average frequency estimate (over ten runs) for each percentile is within one standard deviation of the corresponding true estimate. ii) The error in the estimates go down signiﬁcantly as the number of samples are increased or the privacy parameter ϵ is increased.

100000.0 1000000.0 5000000.0 10000000.0 0.00

Estimated frequency

Estimated frequency versus n

50pctle-True 50pctle-Priv 90pctle-True 90pctle-Priv 95pctle-True 95pctle-Priv

0.1 1.0 2.0 5.0 10.0 0.00

Estimated frequency

Estimated frequency versus epsilon

50pctle-True 50pctle-Priv 90pctle-True 90pctle-Priv 95pctle-True 95pctle-Priv

Figure 1: Frequency vs number of samples (n) and privacy parameter (ϵ) on the synthetic data set.

In Figure 2 we show the result of changing the range of the hash function (m). The observation is that the results are seemingly insensitive to the range of the hash function. We also ran the same experiment (Figure 3) on a real data set drawn uniformly at random the NLTK Brown corpus (NLT). The data set we created has n = 10 million samples drawn i.i.d. from the corpus with replacement, and the system parameters are the same default parameters described earlier. In this plot, the rank corresponds to the rank of a domain element in the distribution of frequencies in the data set. The observation is here is also consistent with that of Figure 1. Comparison to RAPPOR (Erlingsson et al., 2014): Here we compare ourselves to the only other system (RAPPOR project from GOOGLE) for the private frequency estimation problem whose code is publicly available. We took the snapshot of their code

Practical Locally Private Heavy Hitters

0.1 1.0 5.0 10.0 0.00

Estimated frequency

Estimated frequency versus sketch-width / sqrt(n)

50pctle-True 50pctle-Priv 90pctle-True 90pctle-Priv 95pctle-True 95pctle-Priv

Figure 2: Frequency vs sketch width (m) on the synthetic data set.

0.1 1.0 2.0 5.0 10.0 0.00

Estimated frequency

Estimated frequency versus epsilon

Rank 1-True Rank 1-Priv Rank 10-True Rank 10-Priv Rank 100-True Rank 100-Priv

Figure 3: Frequency vs privacy (ϵ) on the NLTK-Brown corpus.

base (https://github.com/google/rappor) on May 9th, 2017. In order to perform a fair comparison, we tested our algorithm against one of their demo experiments available (Demo3 using the demo.sh script). We used the privacy parameter ϵ = ln(3), the number of data samples n = 1 million, and the data set to be the same data set generated by the demo.sh script. In Figure 4 we observe that for higher frequencies both RAPPOR and our algorithm perform similarly. However, in lower frequency regimes, the RAPPOR estimates are zero most of the times, while our estimates are closer to the true estimates. N.B. We do not claim that our algorithm would outperform the RAPPOR system on all problem instances. However, our current experiment does highlight the need to perform an at-scale comparison between the two algorithms.

4.2. Private Heavy-hitters

In this section, we take on the harder task of identifying the heavy hitters, rather than estimating the frequencies of domain elements. We run our experiments on the same two

Bassily, Nissim, Stemmer, Thakurta

0 20 40 60 80 100 0.00

Frequency estimate

Comparison between Count-Sketch and RAPPOR

True Freq Count_Sketch RAPPOR

Figure 4: Frequency vs privacy (ϵ) on the Demo 3 experiment from RAPPOR

data sets described earlier with the same setting of parameters, except now we assume that we do not know the domain. As a part of our algorithm design, we assume that every element in the domain is from the English alphabet set [a-z] and are of length exactly equal to six. If they are of larger length, we truncate before entering them in the data set, and if they are of smaller length we tag a at the end. We generate the domain elements for the synthetic data set by ﬁrst generating the frequency histogram based on the power law distribution described earlier, and then assign random strings of length eight to each bin of the histogram. That becomes our data set. For the NLTK Brown corpus (NLT), we sample n = 10 million samples with replacement from the corpus to form our data set. We set a threhold of 15 n as the threshold for being a heavy hitter. We measure the eﬃcacy of our system by measuring the precision and recall. Figures 5 and 6 show the true data distribution for the synthetic and the NLTK data set.

0 20 40 60 80 Words

Normalized Frequency

Figure 5: Distr. for Synthetic

0 20 40 60 80 100 Words

Normalized Frequency

Figure 6: Distr. for Brown (top 100 words)

Practical Locally Private Heavy Hitters

In Table 4.2 we state our corresponding precision and recall parameters. Our recall numbers are much better than the precision numbers, primarily because of the large number of negative examples (3 108 examples). In practice, if there are false-positives, they can be easily pruned using domain expertise. For example, if we are trying to identify new words which users are typing in English (app, 2016), then using the domain expertise of English, a set of false positives can be easily ruled out by inspecting the list of heavy hitters output by the algorithm. Further, notice that since we are working with domain elements with size six characters, a brute force algorithm would require 266 queries to the frequency oracle, which would be computationally (near) infeasible. While there are other algorithms for ﬁnding heavy-hitters (Bassily and Smith, 2015; Hsu et al., 2012), either they do not provide any theoretical guarantee for the utility (Erlingsson et al., 2014; Fanti et al., 2015; Thakurta et al., 2017), or there does not exist a scalable and eﬃcient implementation for them. Our work scores well on both these aspects.

Data set # of unique words Precision Recall Synthetic 93 0.36 (σ = 0.05) 0.95 (σ = 0.03) NLTK Brown corpus 25991 0.24 (σ = 0.04) 0.86 (σ = 0.05)

Table 2: Private Heavy-hitters with threshold=15 n. Here σ corresponds to the standard deviation.

5. Locally Private Heavy-hitters via Count-Sketch: The Tree Hist Protocol

We start with a detailed description of our construction described at a high level in Section 3.1. We refer to Section 3.1 for deﬁnitions and public parameters that will be used in the construction, namely, preﬁxes, hashes, the basis matrix, global parameters, and public randomness. We restate below our public parameters and, when applicable, their speciﬁc settings.

Global parameters: We will assume that total number of users n, the size of the Hadamard matrix m, the number of hash pairs t, the privacy parameter ϵ, and the conﬁdence parameter β are public parameters, and hence they will not be explicitly provided as inputs to the algorithms. For integer parameters, we will implicitly assume that results are rounded to the nearest integer. We set t = 110 log(n/β) and m = 48 q

n log(n/β). The

hash functions (h1, g1), . . . , (ht, gt) will also be assumed to be public information (this is O(log(d) log(n/β)) bits of shared randomness).

Public randomness: In addition to the t hash pairs {(h1, g1), . . . , (ht, gt)}, we assume that the server creates a random partition Π : [n] [log d] [t] over the set of users, that is, a each user i gets a random pair (ℓi, ji) [log(d)] [t] that represents the index of one of log(d) t buckets. Moreover, the server uses another random function Q : [n] [m] that assigns to each user i a uniformly random index ri [m]. We assume that such random indices ℓi, ji, ri are shared between the server and each user. For each ℓ [log d] and each j [t], we deﬁne Iℓ,j i : ℓi = ℓ, ji = j and Ij i : ji = j = ℓ [log d]Iℓ,j.

Bassily, Nissim, Stemmer, Thakurta

5.1. A Local Randomizer: Local Rnd

For each i [n], user i runs her own independent copy of Algorithm 1 below, refered to as Local Rnd, to generate her private report. We note that Local Rnd takes a ﬂag Final {0, 1} as an input. The role of this input will become clear when we discuss the full protocol. In a nutshell, the ﬂag is used to distinguish between two invocations of Local Rnd. In particular, the local randomizer of each user is invoked twice in the full protocol: once during the ﬁrst phase of the protocol (called the pruning phase) where the high-frequency items (heavy hitters) are identiﬁed, and a second time during the ﬁnal phase (the ﬁnal estimation phase) to enable the protocol to get better estimates for the frequencies of the heavy hitters.

Connection to count sketch and Hadamard transform: Our local randomizer can be thought of as a transformed, compressed (via sampling), and randomized version of the count sketch. Up to Step 5 in Local Rnd (Algorithm 1), our algorithm follows the standard count sketch algorithm (Charikar et al., 2002). Starting from Step 6, we start to deviate from the standard count sketch as we apply Hadamard transform to the user s signal, then sample one bit from the result. Indeed Step 6 can be thought of as a composition of two operations: ﬁrst, we multiply the indicator vector eci {0, 1}m by the scaled Hadamard matrix W (this is equivalent to selecting wci: the ci-th column of W), then we randomly sample one entry from wci. By doing so, we can achieve signiﬁcant savings in space and communication without sacriﬁcing accuracy.

Algorithm 1 Local Rnd Input: User i input: vi V, Flag: Final {0, 1}.

1: Using shared randomness, get random indices (ℓi, ji) [log d] [t] and ri [m].

2: if Final = 0 then

3: Set si := gji (vi[1 : ℓi]) and ci := hji (vi[1 : ℓi]) . {v[1 : ℓ] denotes the ℓ-bit preﬁx of v.}

5: Set si := gji (vi) and ci := hji (vi)

6: Compute xi := si Wri, ci {Wr, c denotes the sign of the (r, c) entry of Hm (Hadamard matrix of size m).}

7: Generate a randomized version yi of the bit xi:

( xi w.p. eϵ/2 eϵ/2+1 xi w.p. 1 eϵ/2+1

8: return yi.

The output of Local Rnd is 1 bit per invocation. Hence, during the entire span of the protocol, each user sends 2 bits to the server. Here, we will assume the user s identity (i.e., the index i [n]) associated with each report is known to the server (e.g., from a higher layer in the communication protocol stack).

Practical Locally Private Heavy Hitters

5.2. A Frequency Oracle: Freq Oracle

Before describing our protocol for identifying the heavy hitters and estimating their frequencies, we ﬁrst discuss a protocol for a simpler task. Suppose we want to allow the server to estimate the frequencies of some given subset b V {0, 1}ℓfor some given ℓ [log d] based on the users reports. Algorithm 2 describes a protocol, denoted as Freq Oracle, for accomplishing this task. Note that in this protocol, we assume that all items whose frequencies are in question are given as inputs. For each queried item ˆv b V and for each hash index j [t], Freq Oracle starts by collecting the noisy reports of the collection Iℓ,j of users where each user i Iℓ,j is assigned a pair of preﬁx and hash indices (ℓi, ji) that matches (ℓ, j). Next, it estimates the inverse Hadamard transform of the compressed and noisy signal of each user in Iℓ,j. It then sums all the results and multiplies the outcome by gj(ˆv) to obtain an estimate ˆfj(ˆv) for the frequency of ˆv. As in the count sketch algorithm, this is done for every j [t], then Freq Oracle obtains a high-conﬁdence estimate for the frequency of ˆv by computing the median of all the t frequency estimates.

Inverse transform and back to count sketch: We note here that Steps 7 in Freq Oracle (Algorithm 2) can be described as an inverse Hadamard transform of the users compressed and noisy signals. In particular, each term yi Wri,c inside the sum can be described as a multiplication between yieri (where eri is the indicator vector with 1 at the ri-th position) and the scaled Hadamard matrix W, followed by picking the c-th entry of the resulting vector. This brings us back to the standard count sketch representation (Charikar et al., 2002). The Freq Oracle protocol then proceeds to process the frequency estimates in the same way a count sketch does. Indeed, Step 8 is a standard step in count sketch algorithm where a high-conﬁdence frequency estimate is obtained via the median technique. Hence, we attain the functionality of the count sketch algorithm with much less space and communication by transforming the users signals to the Fourier domain, compressing signals via sampling, and then transforming them back at the server.

5.3. Succinct Histogram via a Tree Aggregation Protocol: Tree Hist

We now describe our protocol (Algorithm 3), denoted as Tree Hist, that outputs a succinct histogram, that is, it outputs a list of heavy hitters together with estimates for their frequencies. W.l.o.g., we will assume that n > log(n/β) log(d) since otherwise, we cannot guarantee less than trivial error (i.e., an error of order n). A high-level description of the protocol is given in Section 3.1.3. The protocol is comprised of two main phases: the pruning phase, and the ﬁnal estimation phase. We note that the binary ﬂag Final is used to distinguish between these two phases. In particular, depending on the value of this ﬂag, the partition of users and the scaling factor passed to Freq Oracle will diﬀer. More precisely, when Final = 0 (the pruning phase), for each prexix length ℓ, the partition of users is given by the collection Iℓ,j : j [t] and the scaling factor is set as γ = t log d. When Final = 1 (the ﬁnal phase), the partition of users is Ij : j [t] and the scaling factor is γ = t.

Bassily, Nissim, Stemmer, Thakurta

Algorithm 2 Freq Oracle

Input: Preﬁx length: ℓ [log d], a subset of ℓ-bit preﬁxes b V {0, 1}ℓ, collection of t disjoint subsets of users: Ij : j [t] , oracle access to items of relevant users: vi : i j [t] Ij , scaling factor: γ, Flag: Final {0, 1}. {aϵ eϵ/2+1

eϵ/2 1 is assumed to be a global constant seen by Freq Oracle.}

1: for ˆv b V do

2: for Hash index j = 1 to t do

3: Set s := gj(ˆv) and c := hj(ˆv).

4: for Users i Ij do

5: Get user-i s 1-bit report: yi = Local Rnd(vi, Final)

6: Get user-i s random index ri = Q(i) using public randomness.

7: Compute the j-th estimate of the frequency of ˆv: ˆfj(ˆv) := γ aϵ P i Ij yi s Wri,c.

8: Compute ﬁnal estimate for the frequency of ˆv: ˆf(ˆv) := Median n ˆfj(ˆv) : j [t] o .

9: Freq List := n ˆv, ˆf(ˆv) : ˆv b V o .

10: return Freq List.

Algorithm 3 Tree Hist The Full Protocol Input: Oracle access to users items (vi V : i [n]).

1: Set η := 147 p

n log(n/β) log(d)/ϵ and aϵ := eϵ/2+1

2: Set γ := t log d {This setting will change at the ﬁnal stage of the protocol.}

3: Initialize Preﬁxes = { }.

4: for Preﬁx length ℓ= 1 to log d do

5: Get the collection Iℓ,j : j [t] using the random partition Π. {See Public randomness above.}

n ˆv, ˆf(ˆv) : ˆv Child Set (Preﬁxes) o =

Freq Oracle

ℓ, Child Set (Preﬁxes) , Iℓ,j : j [t] , vi : i j [t]Iℓ,j , γ, Final = 0

7: Initialize New Preﬁxes = .

8: for v Child Set (Preﬁxes) do

9: if ˆf(ˆv) 2η then

10: Add ˆv to New Preﬁxes.

11: Update Preﬁxes New Preﬁxes.

12: Set γ := t

13: Succ Hist := Freq Oracle

log d, Preﬁxes, Ij : j [t] , vi : i [n] , γ, Final = 1

{Here Ij = ℓ [log d]Iℓ,j as deﬁned earlier in the Public randomness paragraph.}

14: return Succ Hist.

Practical Locally Private Heavy Hitters

5.3.1. Running Time and Processing Memory of Tree Hist

Running Time: We note that Algorithm Freq Oracle (Algorithm 2) is invoked log(d)+1 = O(log d) times by Tree Hist (Algorithm 3): once per each level of the tree and another at the ﬁnal estimation phase at the bottom level of the tree. The main time-consuming step in Freq Oracle is the summation in Step 7. This step is

executed O q

log d times in every invocation of Freq Oracle since there are O q

n log n log d

nodes queried in every level of the tree and for each node Freq Oracle computes t log n frequency estimates. That is, in total, this step is executed O n log n log d times in the Tree Hist protocol. A direct implementation would involve summing n bits each time this step is executed, and hence would amount for running time of n1.5. However, we now show that this can be reduced to n. Consider Step 7 of Algorithm Freq Oracle when Freq Oracle is invoked by Tree Hist at any given preﬁx level ℓ [log d] during the pruning phase. We will not consider the ﬁnal estimation phase here since its running time is dominated by that of the pruning phase. Note that since the size of the basis matrix W is m = O q

n log n , there are only O q

n log n values

that the row index ri can take. Hence, ˆfj(ˆv) (computed in Step 7) can be expressed as follows:

ˆfj(ˆv) = γ aϵ X

i Iℓ,j: ri=κ yi

Thus, to implement Step 7 of Freq Oracle for all items (ℓ, j, ˆv), we ﬁrst compute

i Iℓ,j: ri=κ yi

for every κ [m], ℓ [log d], and j [t] (this amounts to a running time of O(n) in total). Then, for every value of (ℓ, j, ˆv), computing ˆfj(ˆv) would require summing m = O q

numbers. Hence, in total, the running time of Tree Hist is O q

n log n log d m t log d =

O n log d .

Processing memory: For the implementation described above, Algorithm Freq Oracle maintains m t log d sums of at most n bits each. This would require a processing memory of O n log1.5(n) log(d) bits. The memory required for all the remaining steps of the Tree Hist protocol does not exceed this amount, and hence, the total processing memory required by Tree Hist is O n log1.5(n) log(d) bits.

5.3.2. Privacy and Utility Guartantees

In this section we provide the privacy and utility gurantees for the Tree Hist protocol.

Theorem 5.1 [ϵ-Local Diﬀerential Privacy of the Tree Hist Protocol] The Tree Hist protocol (Algorithm 3) is ϵ-local diﬀerentially private.

Bassily, Nissim, Stemmer, Thakurta

Theorem 5.2 [Utility of the Tree Hist Protocol] Let η = 147 p

n log(n/β) log(d)/ϵ. With probability at least 1 β (over the choice of all public and private randomnness), the output Succ Hist of the Tree Hist protocol (Algorithm 3) satisﬁes the following properties:

1. Succ Hist contains all items v V with true frequencies above 3η.

2. Succ Hist does not contain any item v V whith true frequency below η.

3. Every frequency estimate in Succ Hist is accurate up to an error 147

ϵ We defer the proofs of Theorems 5.1 and 5.2 to Appendix A.

The following lemma states the error guarantees of the Algorithm Freq Oracle used by the Tree Hist protocol. Note that Freq Oracle is invoked by the Tree Hist protocol during both pruning and ﬁnal estimation phases of the protocol. This lemma is central to our proof of Theorem 5.2. In the following lemma, we will use { Ij} to generically denote the collection of users subsets passed to Freq Oracle. As described early in Section 5.3, the form of this collection depends on whether the protocol is in the pruning phase (the ﬂag Final = 0) or the ﬁnal estimation phase (Final = 1). In particular, for any preﬁx length ℓ [log d], note that when Freq Oracle is invoked by Tree Hist in the pruning phase (i.e., Final = 0), the collection Ij : j [t] = Iℓ,j : j [t] and the scaling factor γ = t log d, whereas when Freq Oracle is invoked by Tree Hist in the ﬁnal phase (i.e., Final = 1), the collection Ij : j [t] = Ij : j [t] and γ = t. We will use the generic notation Ij : j [t] and γ since the same proof works for both cases.

Lemma 5.3 Let β (0, 1). Let the number of hash pairs t 110 log(n/β), and the size of Hadamard basis matrix m 48 q

n log(n/β). Consider algorithm Freq Oracle (Algorithm 2) as

invoked by the Tree Hist protocol (Algorithm 3). For any ℓ [log d], suppose that Freq Oracle is invoked on inputs: ℓ, subset b V {0, 1}ℓof size |b V| n, collection of users subsets Ij : j [t] , and scaling factor γ. Assuming that n 48t log(d), then, with probability 1 β log d, we have:

v b V : |f(v) f(v)|14 nγ/ϵ.

where ˆf(v) is the estimate of Freq Oracle for the true frequency f(v) of the item v b V.

Proof Fix β (0, 1), ℓ [log d], t 110 log(n/β), m 48 q

n log(n/β), and a subset

b V {0, 1}ℓof size |b V| n. There are three independent sources of randomness. The ﬁrst is due to randomness in the collection Ij : j [t] induced by the random partitioning of users via Π. The second source of randomness is due to the randomness in the choice of the t hash pairs (h1, g1), . . . , (ht, gt). The third source of randomness is due to the random row indices ri, i [n], generated by Q, and the randomization for privacy (step 7 in Algorithm 1). Before we discuss the guarantees we can attain under these sources of randomness, we ﬁrst introduce some notation. For j [t], v b V, we deﬁne f Ij(v) P i Ij 1 (vi[1 : ℓ] = v);

Practical Locally Private Heavy Hitters

that is, f Ij(v) is the number of users in Ij whose the ℓ-preﬁx of their items is v, and deﬁne

n Ij Ij , i.e., n Ij is the number of users in Ij. Let ˆf I1(v), . . . , ˆf It(v) be independent

Poisson random variables with mean f(v)

γ , and let ˆn I1, . . . , ˆn It be independent Poisson random variables with mean n

For each v b V, let

G1(v) = j [t] : |γf Ij(v) f(v)| 4 nγ, n

2 γn Ij 2n .

Claim 5.4 With probability 1 β 3 log d over the randomness in Ij : j [t] , for every

v b V, we have |G1(v)| 9

Fix v b V and j [t]. Consider the Poisson random variables hf Ij(v) and ˆn Ij with

γ , respectively. By Theorem 2.5 (a tail bound for the Poisson distribution), the union bound, and assuming that n 48γ (the assumption stated in the lemma), then with probability at least 0.996, we have

|γ ˆf Ij(v) f(v)| 4 nγ , n

2 γˆn Ij 2n.

Now, consider the sequences of independent Poisson random variables ˆf I1(v), . . . , ˆf It(v) and

ˆn I1, . . . , ˆn It as deﬁned above. Using the fact that t 110 log(n/β) 55 log 3en log d

by Chernoﬀ s bound, with probability at least 1 β 3en log d, we have

j [t] : |γ ˆf Ij(v) f(v)| 4 nγ , n

2 γˆn Ij 2n 9

Using Theorem 2.4 (the Poisson approximation), with probability at least 1 β 3 n log(d), we

have |G1(v)| 9

10t. Hence, by the fact that |b V| n and the union bound, with probability at least 1 β 3 log d, for all v b V, we have |G1(v)| 9

10t. For the remainder of the proof, we will condition on the event in Claim 5.4. Let wc denote the c-th column of the basis matrix W. Since the preﬁx length ℓis ﬁxed, we will denote vi[1 : ℓ] (the ℓ-preﬁx of the item of user i) as vi for brevity. For each v b V, let

G2(v) = n j [t] : γ X

gj(vi) gj(v)

m whj(vi), whj(v) f(v) 8 γn o .

Claim 5.5 Conditioned on the event in Claim 5.4, with probability at least 1 β 3 log d over

the choice of the t hash pairs (hj, gj) : j [t] , for all v b V, we have |G2(v)| 4

Fix v b V and j G1(v). First consider P i Ij gj(vi) gj(v)

m whj(vi), whj(v) f Ij(v) .

Note that the columns of W are orthogonal, and each of them has norm m. Hence, the

Bassily, Nissim, Stemmer, Thakurta

error quantity above comes from those i Ij with vi = v and yet hj(vi) = hj(v). In particular, we can write this error as X

zi gj(vi) gj(v)

where zi = 1 (vi = v, hj(vi) = hj(v)). Deﬁne VBad = ˆv b V : f Ij(ˆv) 2pn Ij . Note that VBad pn Ij/2. Now, observe that

zi gj(vi) gj(v) > 2pn Ij

P hj,gj [ ˆv VBad : hj(ˆv) = hj(v)]

zi gj(vi) gj(v) > 2pn Ij ˆv VBad, hj(ˆv) = hj(v)

By the pairwise independence property of hj and the union bound, the ﬁrst probability

term on the right hand side is bounded from above by |VBad|

2m . We now consider the second probability term. Note that the event we conditioned on in the second probability term implies that for every i Ij where vi = v, we must have f Ij (vi) < 2pn Ij. Hence, conditioned on this event, by the pairwise independence of each of hj and gj, we have

zi gj(vi) gj(v)

Var [zi gj(vi)] + X

i,k Ij: vi=vk =v

E [zizk] E [gj(vi)gj(vk)]

n Ij/m + 2n3/2 Ij /m 3n3/2 Ij /m.

Hence, by using Chebyshev s inequality, the second probability term is bounded by 3 n Ij

4m . Hence, we have

zi gj(vi) gj(v) > 2pn Ij

log(n/β) 25 γ 1 250,

where the last inequality follows from the fact that j G1(v) and the fact that m 48 q

n log(n/β). Thus, with probability at least 0.996, we have

gj(vi) gj(v)

m whj(vi), whj(v) f Ij(v) 2pn Ij.

Practical Locally Private Heavy Hitters

Now, as we conditioned on the event in Claim 5.4, this implies that with probability at least 0.996, we have γ X

gj(vi) gj(v)

m whj(vi), whj(v) f(v)

gj(vi) gj(v)

m whj(vi), whj(v) γf Ij(v) + |γf Ij(v) f(v)|

Conditioned on the event of Claim 5.4, |G1(v)| 9

10t 99 log(n/β) 49 log 3 n log d

note also that the sums P i Ij gj(vi) gj(v)

m whj(vi), whj(v) : j [t] are independent.

Thus, conditioned on the event of Claim 5.4, by Chernoﬀ s bound, with probability at least 1 β 3 n log d, we have |G2(v)| 22

25|G1(v)| 4

5t. Hence, by the fact that |b V| n and the

union bound, with probability at least 1 β 3 log d, for all v b V, we have |G2(v)| 4

5t. We continue the proof while conditioning on this event as well. For every i [n] let ri [m] be the row index chosen uniformly at random for user i using the random function Q, and let yi denote the randomized bit generated by user i in step 7 of Algorithm 1. As was denoted in algorithm Freq Oracle, for each v b V, let ˆfj(v) = γaϵ P i Ij yi gj(v) Wri,hj(v) denote the j-th frequency estimate of Freq Oracle for

v b V. For each v b V, let

G3(v) = n j [t] : ˆfj(v) f(v) 14 nγ/ϵ o .

Claim 5.6 Conditioned on the events in Claims 5.4 and 5.5, with probability at least 1 β 3 log d over the randomness in (ri, yi) : i Ij, j [t] , for all v b V, we have |G3(v)| 7

Fix v b V and j G2(v). Note that, conditioned on any realization for Ij, each term in the sum X

aϵ yi gj(v) Wri,hj(v) gj(vi) gj(v)

m whj(vi), whj(v)

is independent, zero mean random variable whose support length is bounded by aϵ + 1 = O 1

ϵ . Hence, by Chernoﬀ s bound, with probability at least 0.99, we have

aϵ yi gj(v) Wri,hj(v) gj(vi) gj(v)

m whj(vi), whj(v) 4pn Ij/ϵ.

Thus, conditioned on the events in Claims 5.4 and 5.5, with probability at least 0.99, we have γ X

aϵ yi gj(v) Wri,hj(v) f(v) 14 γn/ϵ,

Bassily, Nissim, Stemmer, Thakurta

i.e., | ˆfj(v) f(v) 14 nγ/ϵ.

Conditioned on the event of Claim 5.5, |G2(v)| 4

5t 88 log(n/β) 44 log 3 n log d

β . We note also that the above sums for j = 1, . . . , t are independent. Thus, conditioned on the event of Claim 5.5, by Chernoﬀ s bound, with probability at least 1 β 3 n log d, we have

25|G2(v)| 88 125t 7 10t. Hence, by the fact that |b V| n and the union bound, with probability at least 1 β 3 log d, for all v b V, we have |G3(v)| 7

By combining Claims 5.4, 5.5, and 5.6, we conclude that with probability at least 1 β log d,

for all v b V, we have j [t] : | ˆfj(v) f(v)| 14 nγ/ϵ > 1

Since for any item v, the ﬁnal frequency estimate ˆf(v) generated by Freq Oracle is the median of ˆf1(v), . . . , ˆft(v), then the above implies that with probability at least 1 β log d,

for all v b V, we have | ˆf(v) f(v)| 14 nγ/ϵ. This completes the proof of the lemma.

6. Locally Private Heavy-hitters bit-by-bit: The Bitstogram Protocol

In this section we provide the full details for our Bitstogram protocol. We will use the following notation. Let S Vn be a database, which may be distributed across n users (each holding one row). For v V, we will be interested in estimating the the multiplicity of v in S, i.e., f S(v) = |{vi S : vi = v}|. As we explained in Section 3.2, we provide both an alternative frequency oracle protocol, as well as an alternative protocol for identifying the heavy-hitters.

6.1. Warmup: A Simple Protocol for Heavy-Hitters

For readability, we ﬁrst present a simpliﬁcation of our protocol that captures most of the ideas presented in Section 3.2. This simpliﬁcation does not achieve the optimal error rate, space complexity, and time complexity that we aim for. We will later modify this construction in order to reduce these complexities.

6.1.1. Frequency Oracle

Our protocols use the simple local randomizer R (Algorithm 4), where every user holds one bit, and ﬂips it with probability 1/(eϵ + 1).

Algorithm 4 R: Basic Randomizer Inputs: x { 1}, and privacy parameter ϵ.

1. Generate and return a random bit z = x w.p. eϵ/(eϵ + 1) x w.p. 1/(eϵ + 1)

Practical Locally Private Heavy Hitters

Let Z { 1}d n be a matrix chosen uniformly at random, and assume that Z is publicly known. That is, for every domain element v V and every user j [n], we have a random bit Z[v, j] { 1}. As Z is publicly known, every user j can identify its corresponding bit Z[vj, j], where vj V is the input of user j. We now present a simple protocol, Explicit Hist, in which users send randomized responses of their corresponding bits in this matrix. As we show, this allows the server to estimate the frequency of every domain element.

Algorithm 5 Explicit Hist Public randomness: Uniformly random matrix Z { 1}d n.

Setting: Each player j [n] holds a value vj V. Deﬁne S = (v1, , vn). Deﬁne e S = ( v1, , vn) where vj = Z [vj, j].

Input: Oracle access to e S.

1. For j [n], get user-j s 1-bit report: yj R( vj).

2. On input v V, return a(v) = eϵ+1

eϵ 1 P j [n] yj Z[v, j], and wait for the next input.

Lemma 6.1 Let ϵ 1, and ﬁx a subset V V of size d d. With probability at least 1 β, algorithm Explicit Hist answers every v V with a(v) satisfying:

|a(v) f S(v)| 3

n ln(4d /β).

Proof Fix v V , and denote c(v) = P j [n] yj Z[v, j], and recall that algorithm Explicit Hist answers the query v with a(v) = eϵ+1

eϵ 1 c(v). We start by analyzing the expectation of c(v):

E[c(v)] = X

j [n] E [yj Z[v, j]] = X

j [n]:vj=v E [yj Z[v, j]] + X

j [n]:vj =v E [yj Z[v, j]]

j [n]:vj=v E [yj Z[v, j]] + X

j [n]:vj =v E [yj] E [Z[v, j]] = f S(v) eϵ 1

That is, c(v) can be expressed as two sums of 1 independent random variables: f S(v) variables with expectation eϵ 1

eϵ+1, and (n f S(v)) variables with expectation 0. Using the Hoeﬀding bound, with probability at least 1 β

d we have that |c(v) eϵ 1

eϵ+1 f S(v)| p

n ln(4d /β). That is, |a(v) f S(v)| eϵ+1

n ln(4d /β). Using the union bound, this holds simultaneously for every v V with probability at least 1 β.

Observation 6.2 For the analysis above it suﬃces that, for every j [n], the entries of column j of Z are only pairwise independent. Furthermore, appealing to Lemma 2.3 (concentration of k-wise independent random variables) instead of the Hoeﬀding bound, is suﬃces that, for every v V, the entries of row v of Z are only k-wise independent, for k = 3 ln(d/β).

Bassily, Nissim, Stemmer, Thakurta

6.1.2. A Simple Heavy Hitters Protocol

We next give a (simpliﬁed version) of our construction that uses a frequency oracle protocol with worst-case error τ in order to identify the heavy-hitters. Let h : V [T] be a (publicly known) random hash function, mapping domain elements into [T], where T will be set later. We will use h in order to identify the heavy-hitters. To that end, let v V denote such a heavy-hitter, appearing at least 2τ times in the database S, and denote t = h(v ). Assuming that T is big enough, w.h.p. we will have that v is the only input element (from S) that is mapped (by h) into the hash value t . Assuming that this is indeed the case, we will now identify v bit by bit. For ℓ [log d], denote Sℓ= (h(vj), vj,ℓ)j [n], where vj,ℓis bit ℓof vj. That is, Sℓis a database over the domain ([T] {0, 1}), where the row corresponding to user j is (h(vj), vj,ℓ). Observe that every user can compute her own row locally. As v is a heavy-hitter, for every ℓ [log d] we have that (t , v ℓ) appears in Sℓat least 2τ times, where v ℓis bit ℓof v . On the other hand, as we assumed that v is the only input element that is mapped into t we get that (t , 1 v ℓ) does not appear in Sℓat all. Recall that our frequency oracle has error at most τ, and hence, we can use it to accurately determine the bits of v . The details are given in algorithm Succinct Hist.

Algorithm 6 Succinct Hist Public randomness: Random hash function h : V [T]. Random partition of [n] into log d subsets I1, , Ilog d.

Setting: Each player j [n] holds a value vj V. Deﬁne S = (v1, , vn). For ℓ [log d], let Sℓ= (h(vj), vj,ℓ)j Iℓ, where vj,ℓis bit ℓof vj. That is, Sℓis a database over the domain [T] {0, 1}.

Input: Oracle access to S and Sℓfor ℓ [log d].

1. For ℓ [log d], use Explicit Hist(Sℓ) with ϵ

2 to get aℓ(t, b) for all (t, b) [T] {0, 1}.

2. For t [T], deﬁne ˆvt V, where bit ℓof ˆvt is ˆvt,ℓ= argmax{aℓ(t, 0), aℓ(t, 1)}.

3. Use Explicit Hist(S) with privacy parameter ϵ

2 to obtain a(ˆvt) for all t [T].

4. Return list L = {(ˆvt, a(ˆvt)) : t [T]}.

Lemma 6.3 Let ϵ 1, denote w 32 log(d) log(16 log d) + 48

2n log d ln(64 log d), and set T = 32n

w . Algorithm Succinct Hist returns a list L of length T satisfying:

1. With probability 1 β, for every (v, a) L we have that |a f S(v)| 6

n ln(4T/β).

2. For every v V s.t. f S(v) w, with probability 1/2 we have that v is in L.

Remark 3 In this simpliﬁed version of our protocol, we successfully recover a heavy-hitter only if we recover (exactly) all of its bits. As will be illustrated in Section 6.2.2, this requirement can be relaxed using an error correction code s.t. in order to recover a heavyhitter v it suﬃces to recover correctly only part of its bits.

Practical Locally Private Heavy Hitters

Proof Item 1 of the lemma follows directly from Lemma 6.1. We now prove item 2. Assuming that n 12 log(d) log(12 log d), by the Chernoﬀbound, with probability at least 7/8 (over partitioning [n] into subsets I1, , Ilog d), for every ℓ [log d] we have that n 2 log d |Iℓ| 2n log d. We continue the analysis assuming that this is the case. Fix v V s.t. f S(v ) w, and consider the following good event (over sampling h):

Event E1 : |{v S : v = v and h(v) = h(v )}| w/4.

Event E1 states that v is mapped (by the hash function h) into a cell without too many collisions with diﬀerent input elements. Denote t = h(v ). While the multiplicity of v in S is at least w, event E1 states that other than v there are at most w/4 elements which are mapped into t . That is, v dominates the cell t . We ﬁrst show that if E1 occurs, then w.h.p. v is in the list L. Asserting that w 32 log(d) log(16 log d), by the Chernoﬀbound we get that with probability 7/8 (over partitioning [n] into subsets I1, , Ilog d), for every ℓ [log d] we have that f Sℓ(h(v ), v ℓ) f Sℓ(h(v ), 1 v ℓ) + w 4 log d.

If that is the case, then by the properties of algorithm Explicit Hist, for

2n log d ln(64 log d),

with probability at least 1 1 8 log d we have that ˆvt ,ℓ= v ℓ, where t = h(v ). Using the union bound, this holds simultaneously for all ℓ [log d] with probability at least 7/8, in which case v = ˆvt is in the list L. It remains to show that Event E1 occurs with high probability. To that end, observe that Eh [| {v S : v = v and h(v) = h(v )} |] = X

v S:v =v Eh 1h(v)=h(v ) n

Thus, by Markov s inequality, we have that

Pr | {v S : v = v and h(v) = h(v )} | 8n

Setting T = 32n

w completes the proof.

6.2. Reducing Space and Time Complexities

In this section we give the full details for algorithm Bitstogram.

6.2.1. Space and Time Efficient Frequency Oracle

We modify algorithm Explicit Hist to obtain our frequency oracle with improved time and space complexity algorithm Hashtogram. Intuitively, this algorithm is obtained by applying algorithm Explicit Hist after hashing every domain element using a (pairwise independent) hash function (however, to boost accuracy, we use several hash functions). This will mean that the random matrix Z we use only has n rows, instead on |V| rows. As we will see, this will allow us to eﬃciently implement the algorithm.

Bassily, Nissim, Stemmer, Thakurta

Algorithm 7 Hashtogram Public randomness: Random partition of [n] into R subsets I1, , IR. Random hash functions h1, , h R mapping V to [T]. Uniformly random matrix Z { 1}T n.

Setting: Each player j [n] holds a value vj V. Deﬁne S = (v1, , vn). Deﬁne e S = ( v1, , vn), where vj = Z [hr(vj), j] and hr is s.t. j Ir.

Input: Oracle access to e S.

1. For j [n], get user-j s 1-bit report: yj R( vj).

2. For every (r, t) [R] [T] compute ar(t) = eϵ+1

eϵ 1 P j Ir yj Z[t, j].

3. On input v V, return a(v) = R Median{a1(h1(v)), , a R(h R(v))}, and wait for the next input.

Lemma 6.4 Algorithm Hashtogram satisﬁes ϵ-LDP.

Lemma 6.5 Let ϵ 1, and ﬁx a subset V V of size d d to be queried to algorithm Hashtogram. Let algorithm Hashtogram be executed with R 132 log(4d /β), and T ϵ2 log(d /β) + ϵ p

n/ log(d /β), and n 8R log(8d /β). With probability at least 1 β, algorithm Hashtogram answers every v V with a(v) satisfying:

|a(v) f S(v)| 27

n R log(2Rd

Observe that the error in the lemma is sub-optimal, as the optimal error behaves like 1 ϵ n log d. However, we will only use Lemma 6.5 with constant d and constant β, and hence, will not be eﬀected by this issue. A similar analysis (for the same algorithm) gives better bounds for other settings of parameters. Speciﬁcally,

Lemma 6.6 Let ϵ 1. Fix a subset V V of size d d to be queried to algorithm Hashtogram. Let algorithm Hashtogram be executed with R 300 log(12nd /β) and n 43R and T ϵ p

n/ log(nd /β). With probability at least 1 β, algorithm Hashtogram answers every v V with a(v) satisfying:

|a(v) f S(v)| 400

As the analysis of the two lemmas are very similar, we only present the proof of Lemma 6.6. The proof of Lemma 6.5 appears in Section B for completness. Proof [Proof of Lemma 6.6] Consider the following good event:

Event E1 (over sampling h1, , h R): For every query v V there exists a subset Rv 1 [R] of size |Rv 1 | 7

8R s.t. for every r Rv 1 it holds that |{v S : v = v and hr (v) = hr (v )}| 16n

Practical Locally Private Heavy Hitters

Event E1 states that for at least 7R/8 of the hash functions, we have that v is mapped into a cell without too many collisions with diﬀerent input elements. Informally, for every single hash function hr, algorithm Hashtogram estimates the number of occurrences of hr(v ) in S. Hence, if event E1 occurs, then most of the estimations result in accurate answers. We start by showing that event E1 happens with high probability. To that end, ﬁx v V and ﬁx r [R]. We have that

Ehr [| {v S : v = v and hr (v) = hr (v )} |] = X

v S:v =v Ehr 1hr (v)=hr (v ) n

Thus, by Markov s inequality, we have that

| {v S : v = v and hr (v) = hr (v )} | 16n

As the hash functions are independent from each other, for R 48 ln(d

β ), by the Chernoﬀ bound we get that with probability at least 1 β/d (over sampling h1, , h R) there exists a subset Rv 1 [R] of size |Rv 1 | 7

8R s.t. for every r Rv 1 it holds that

|{v S : v = v and hr (v) = hr (v )}| 16n

Using the union bound over every v V , we have that event E1 happens with probability at least 1 β.

Event E2 (over partitioning [n] into I1, , IR): There exists a subset R2 [R] of size |R2| 7 8R s.t. for every r R2 it holds that n 2R |Ir| 2n

Following Theorem 2.4 (the Poisson approximation), we analyze event E2 in the Poisson case. To that end, let ˆI1, , ˆIR be independent Poisson random variables with mean n/R. Now ﬁx r R. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that n 43R we have that Pr[ n

r ] 99 100. As ˆI1, , ˆIR are independent, assuming that R 300 ln(3n

β ), by the Chernoﬀbound we get that event E2 happens with

probability at least 1 β

3n in the Poisson case. Hence, by Theorem 2.4, event E2 happens with probability at least 1 β.8

For every r [R], let Sr = (vj)j Ir denote a database containing the data of all users j s.t. j Ir. Also for v V and r [R] denote Sr,v {v S : hr(v) = hr(v )}. That is, |Sr,v | is the number of users j s.t. hr(vj) = hr(v ). Furthermore, for v V and r [R] denote Iv r {v Sr : hr(v) = hr(v )}. That is, |Iv r | is the number of users j s.t. j Ir and hr(vj) = hr(v ). Observe that |Sr,v | f S(v ) and that |Iv r | f Sr(v ).

8. Alternatively, instead of the requirement on R, it suﬃces to ensure that n O(R log( R

β )). In such a case, by the Chernoﬀbound, for every ﬁxed r [R] we will have that n 2R |Ir| 2n

r with probability at least 1 β

R. Hence, using the union bound, this happens simultaneously for all r [R] with probability at least 1 β. Observe that at any case we already require that R O(log( d

β )) for Event E1.

Bassily, Nissim, Stemmer, Thakurta

Event E3 (over partitioning [n] into I1, , IR): For every query v V there exists a subset Rv 3 [R] of size |Rv 3 | 9 10R s.t. for every r Rv 3 it holds that R |Iv r | |Sr,v |

We analyze event E3 in the Poisson case. To that end, ﬁx v V , and let ˆIv 1 , , ˆIv R be independent Poisson random variables with mean |Sr,v |

R . Now ﬁx r [R]. Using a tail bound for the Poisson distribution (see Theorem 2.5), with probability at least 19/20 we have that R ˆIv r |Sr,v |

As ˆIv 1 , , ˆIv R are independent, assuming that R 300 ln(3nd

β ), by the Chernoﬀbound

we get that with probability at least 1 β 3nd , Inequality (1) holds for at least 9/10 choices of r [R]. By Theorem 2.4 (the Poisson approximation), with probability at least 1 β

d , this is also the case for the random variables |Iv r |. That is, with probability at least 1 β

d , there exists a subset Rv 3 [R] of size |Rv 3 | 9

10R s.t. for every r Rv 3 it holds that R |Iv r | |Sr,v |

Using the union bound over every choice of v V , we get that event E3 happens with probability at least 1 β.9

Event E4 (over sampling Z and the coins of the local randomizers): For every query v V there exists a subset Rv 4 [R] of size |Rv 4 | 7

10R s.t. for every r Rv 4 it holds that R ar (hr (v )) R |Iv r | eϵ+1

For v V and r [R] denote cr(v ) = P j Ir yj Z[hr(v ), j], and recall that algorithm Hashtogram answers the query v V with a(v ) = R eϵ+1

eϵ 1 Median{cr(v )}r [R]. Fix v V and r R2 (where R2 [R] is the subset from event E2). We now analyze the expectation of cr(v ):

E[cr(v )] = X

j Ir E [yj Z[hr(v ), j]]

j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X

j Ir: hr(vj) =hr(v ) E [yj Z[hr(v ), j]]

j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X

j Ir: hr(vj) =hr(v ) E [yj] E [Z[hr(v ), j]]

= |{v Sr : hr(v) = hr(v )}| eϵ 1

eϵ + 1 |Iv r | eϵ 1

9. Alternatively, instead of the requirement on R, it suﬃces to ensure that n O(R log( d R

β )), in which such the analysis follows from the Chernoﬀbound.

Practical Locally Private Heavy Hitters

That is, cr(v ) can be expressed as two sums of 1 independent random variables: |Iv r | variables with expectation eϵ 1

eϵ+1, and (|Ir| |Iv r |) variables with expectation 0 (recall that by event E2 we have n 2R |Ir| 2n

R ). Using the Hoeﬀding bound, with probability at least

43/44 we have that cr(v ) eϵ 1

eϵ+1 |Iv r | p

11n/R. That is,

R ar(hr(v )) R |Iv r | eϵ + 1

Fix v V , and observe that the above sums are independent for diﬀerent values of r. Hence, using the Chernoﬀbound and asserting that R 150 ln(d /β), for that ﬁxed v M, with probability at least 1 β/d we have that Inequality (2) holds for at least 7R/8 choices of r R1. Using the union bound, with probability at least 1 β, this is true for every v V simultaneously. That is, event E4 happens with probability at least 1 β.

We are now ready to complete the proof, bu showing that the guarantees of the lemma holds whenever events E1, E2, E3, E4 occur. Fix v V . Combining events E3 and E4, we get that for every r Rv 3 Rv 4 R ar(hr(v )) |Sr,v | eϵ + 1

Recall that for every r [R] we have that |Sr,v | f S(v ). Furthermore, by event E1, for every r Rv 1 we have that |Sr,v | f S(v ) + 16n

T . Hence, for every r Rv 1 Rv 3 Rv 4 we have that

| R ar(hr(v )) f S(v ) | eϵ + 1

T error(v ).

That is, for every r Rv 1 Rv 3 Rv 4 we have that R ar(hr(v )) is accurate up to error(v ). As |Rv 1 Rv 3 Rv 4 | 9 16R, and as algorithm Hashtogram answers v with a(v ) chosen as the median of {R ar(hr(v ))}, we get that |a(v ) f S(v )| error(v ). This holds for every v M.

Processing Memory. Algorithm Hashtogram maintains (on step 2) R T sums of at most n bits. This requires O(R T log n) bits for processing memory.

Runtime. Observe that a direct implementation of (step 2 of) algorithm Hashtogram consists of summing a total of T n bits per user, and hence results in a runtime of n1.5. As we next explain, this can be reduced to n. First observe that for the analysis of Lemma 6.6 (speciﬁcally, for the analysis of Event E4), it suﬃces that, for every j [n], the entries of column j of Z are only pairwise independent. That is, each column of Z consists of T n pairwise independent bits. We can represent such a column using log T log n bits, in which case there are at most T n choices for the columns of Z (see, e.g., Construction 3.18 of Vadhan (2012)). So, even though the matrix Z contains n columns, it has at most T n distinct columns. Let us denote those distinct columns as z1, . . . , z T , where zγ[t] denotes the bit in position t in this column. We will write Z[ , j] = zγ to indicate that the jth column of Z is zγ. With this notation, we can restate ar(t) (computed on step 2 of algorithm Hashtogram) as follows.

Bassily, Nissim, Stemmer, Thakurta

ar(t) = eϵ + 1

j Ir yj Z[t, j] = eϵ + 1

j Ir s.t. Z[ ,j]=zγ yj

Thus, we can implement step 2 of algorithm Hashtogram by ﬁrst computing P j Ir s.t. Z[ ,j]=zγ yj for every 1 γ T and 1 r R (this amounts to summing a total of n bits, and can be done it time n). Afterwards, for every choice of (r, t), computing ar(t) consists of summing T n elements. Overall, step 2 of the algorithm can be executed in time R T T n log n.

6.2.2. The Full Protocol

We are now ready to present the full Bitstogram protocol. This protocol is obtained by repeating the the strategy outlined in algorithm Succinct Hist to boost the success probability, and by using an error correction code to reduce the error (as we mentioned in Remark 3).

Algorithm 8 Bitstogram Tool used: Binary code (Enc, Dec), where Enc : V V , correcting 1

8-fraction of errors with constant rate, that is log d = O(log d), where d = |V| and d = |V |.

Public randomness: Random partition of [n] into R log d subsets Ir,ℓfor (r, ℓ) [R] [log d ]. Random hash functions h1, , h R mapping V to [T].

Setting: Each player j [n] holds a value vj V. Deﬁne S = (v1, , vn). For every j [n] denote cj = Enc(vj) V . For (r, ℓ) [R] [log d ], let Sr,ℓ= (hr(cj), cj,ℓ)j Ir,ℓ, where cj,ℓis bit ℓof cj. That is, Sr,ℓis a database over the domain [T] {0, 1}.

Input: Oracle access to S and to Sr,ℓfor every (r, ℓ) [R] [log d ].

1. For (r, ℓ) [R] [log d ], use Hashtogram(Sr,ℓ) with ϵ 2 to get ar,ℓ(t, b) : (t, b) [T] {0, 1} .

2. For (r, t) [R] [T], deﬁne ˆcr,t V , where bit ℓ of ˆcr,t is ˆcr,t,ℓ = argmax{ar,ℓ(t, 0), ar,ℓ(t, 1)}. That is, if ar,ℓ(t, 0) ar,ℓ(t, 1) then ˆcr,t,ℓ= 0, and otherwise ˆcr,t,ℓ= 1.

3. For (r, t) [R] [T], deﬁne ˆvr,t = Dec(ˆcr,t) V.

4. Use Hashtogram(S) with privacy parameter ϵ

2 to obtain a(ˆvr,t) for all (r, t) [R] [T].

5. Return list L = {(ˆvr,t, a(ˆvr,t)) : (r, t) [R] [T]}.

Remark 4 The execution of Hashtogram on step 4 is made using the parameters stated in Lemma 6.6, in order to obtain accurate answers for every ﬁxture of n queries with probability 1 β. The executions of Hashtogram on step 1 are made using the parameters stated in Lemma 6.5, in order to obtain accurate answers for every ﬁxture of two queries with probability 255/256. Observe that every such instantiation of Hashtogram is queried 2T times, and hence, some of these queries might result in inaccurate answers. Nevertheless,

Practical Locally Private Heavy Hitters

as will be made clear later, due to our use of error correction code, these inaccurate answers will not eﬀect the ﬁnal outcome of the algorithm.

Lemma 6.7 Algorithm Bitstogram satisﬁes ϵ-LDP.

Lemma 6.8 Let ϵ 1, and assume that log d O(log(n/β)). Set R = O (log(1/β)) and T = O ϵ n R log d

. Algorithm Bitstogram returns a list L of length R T satisfying:

1. With probability 1 β, for every (v, a) L we have that |a f S(v)| O 1 ϵ p

n log(n/β) .

2. With probability 1 β, for every v V s.t. f S(v) O 1 ϵ q

n log(d) log( 1

β) , we have that v is in L.

Remark 5 The assumption in Lemma 6.8 that log d O(log(n/β)) is without loss of generality, as otherwise the universe size d is small enough to allow the server to run in time linear in d, which makes the problem much easier. Speciﬁcally, if d < n then we can instantiate the frequency oracle of Lemma 6.1, and query it for every domain element. As d < n, this can be executed in time n (the runtime analysis is similar to the one in Section 6.2.1). Otherwise, if d n then we already have that log d O(log n), and (if necessary) we can pad the representation of domain elements to satisfy the assumption that log d O(log(n/β)).

Proof [Proof of Lemma 6.8] Item 1 of the lemma follows directly from Lemma 6.6. We now prove item 2. Consider the following good event (over sampling h1, , h R):

Event E1 (over sampling h1, , h R): There exists a subset R1 [R] of size |R1| 7

8R s.t. for every r R1 and for every v S satisfying f S(v ) n1.5

T it holds that |{v S : v = v and hr (Enc(v)) = hr (Enc(v ))}| 16n1.5

We start by showing that event E1 happens with high probability. To that end, ﬁx v S and ﬁx r [R]. We have that

Ehr [| {v S : v = v and hr (Enc(v)) = hr (Enc(v ))} |]

v S:v =v Ehr 1hr (Enc(v))=hr (Enc(v )) n

Thus, by Markov s inequality, we have that

| {v S : v = v and hr (Enc(v)) = hr (Enc(v ))} | 16n1.5

Bassily, Nissim, Stemmer, Thakurta

Assuming that T n, there could be at most n heavy elements v satisfying f S(v ) n1.5

T . Hence, using the union bound,

v s.t. f S(v ) n1.5

T and | {v S : v = v and hr (Enc(v)) = hr (Enc(v ))} | 16n1.5

As the hash functions are independent from each other, for R 48 ln( 1

β), by the Chernoﬀ bound we get that with probability at least 1 β (over sampling h1, , h R) there exists a subset R1 [R] of size |R1| 7

8R s.t. for every v satisfying f S(v ) n1.5

T and every r R1 it holds that

|{v S : v = v and hr (Enc(v)) = hr (Enc(v ))}| 16n1.5

That is, event E1 happens with probability at least 1 β. We continue the analysis assuming that event E1 occurs.

Event E2 (over partitioning [n] into {Ir,ℓ}): There exists a subset R2 [R] of size |R2| 7

8R s.t. for every r R2, there exist at least 31 32 log d choices for ℓ [log d ] for which |Ir ,ℓ | 2n R log d .

Following Theorem 2.4 (the Poisson approximation), we analyze event E2 in the Poisson case. To that end, let I1,1, , IR,log d be independent Poisson random variables with mean n R log d . Now ﬁx (r, ℓ) [R] [log d ]. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that n 10R log d we have that Pr[ Ir,ℓ 2n R log d ] 99 100. As I1,1, , IR,log d are independent, assuming that R log d 300 ln(3n

β ), by the Chernoﬀ

bound we get that with probability at least 1 β

3n there are at least 98 100R log d choices for (r, ℓ) [R] [log d ] s.t. Ir,ℓ 2n R log d . Hence, by Theorem 2.4, with probability at least 1 β there are at least 98 100R log d choices for (r, ℓ) [R] [log d ] s.t. |Ir,ℓ| 2n R log d . If that is the case, then there must be at least 7R

8 choices for r [R] for which

ℓ [log d ] : |Ir,ℓ| 2n R log d

That is, Pr [E2] 1 β.

Event E3 (over partitioning [n] into {Ir,ℓ}): For every v S s.t. f S(v ) n1.5

T there exists a subset Rv 3 [R] of size |Rv 3 | 7

8R s.t. for every r Rv 3 there exist at least 31

32 log d choices for ℓ [log d ] for which |{j Ir ,ℓ : vj = v }| f S(v ) 2R log d .

Practical Locally Private Heavy Hitters

We analyze event E3 in the Poisson case. To that end, ﬁx v S s.t. f S(v ) n1.5

T , and let Iv 1,1, , Iv R,log d be independent Poisson random variables with mean f S(v )

R log d . Now ﬁx (r, ℓ) [R] [log d ]. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that n1.5

T 37R log d we have that Pr[ Ir,ℓ f S(v ) 2R log d ] 99 100. As Iv 1,1, , Iv R,log d are independent, assuming that R log d 300 ln(3n2

β ), by the Chernoﬀbound we get that

with probability at least 1 β 3n2 there are at least 98 100R log d choices for (r, ℓ) [R] [log d ]

s.t. Ir,ℓ f S(v ) 2R log d . Hence, by Theorem 2.4, with probability at least 1 β

n there are at least

98 100R log d choices for (r, ℓ) [R] [log d ] s.t. |{j Ir ,ℓ : vj = v }| f S(v ) 2R log d . If that is the case, then there must be at least 7R

8 choices for r [R] for which

ℓ [log d ] : |{j Ir ,ℓ : vj = v }| f S(v ) 2R log d

Using the union bound, this holds simultaneously for every such v with probability at least 1 β. That is, Pr [E3] 1 β.

Event E4 (over partitioning [n] into {Ir,ℓ}): For every v S s.t. f S(v ) n1.5

T there exists a subset Rv 4 of size |Rv 4 | 7

8R s.t. for every r Rv 4 there exist at least 31

32 log d choices for ℓ [log d ] for which

|{j Ir ,ℓ : vj = v and hr (Enc(vj)) = hr (Enc(v ))}| 32n1.5 RT log d .

We analyze event E4 in the Poisson case. To that end, ﬁx v S s.t. f S(v ) n1.5

T . For r [R] denote

Colr(v ) = |{v S : v = v and hr(Enc(v)) = hr(Enc(v ))}| ,

and recall that by event E1, for r R1 we have that Colr(v ) 16n1.5

T . Let Iv 1,1, , Iv R,log d be independent Poisson random variables with mean Colr(v )

R log d . Now ﬁx (r, ℓ) R1 [log d ]. Using a tail bound for the Poisson distribution (see Theorem 2.5), assuming that T n1.5 R log d we have that Pr[ Iv r,ℓ 32n1.5 RT log d ] 99 100. As Iv 1,1, , Iv R,log d are

independent, assuming that |R1| log d 300 ln(3n2

β ), by the Chernoﬀbound we get that

with probability at least 1 β

3n2 there are at least 98

100|R1| log d choices for (r, ℓ) R1 [log d ] s.t. Iv r,ℓ 32n1.5 RT log d . Hence, by Theorem 2.4, with probability at least 1 β

n there are at least 98 100|R1| log d choices for (r, ℓ) R1 [log d ] s.t.

|{j Ir ,ℓ : vj = v and hr (Enc(vj)) = hr (Enc(v ))}| 32n1.5

If that is the case, then there must be at least 7R1

8 choices for r R1 for which

ℓ [log d ] : |{j Ir ,ℓ : vj = v }| 32n1.5

Bassily, Nissim, Stemmer, Thakurta

Using the union bound, this holds simultaneously for every such v with probability at least 1 β. That is, Pr [E4] 1 β.

We are now ready to complete the proof. Fix v S s.t. f S(v ) 264n1.5

T . Let R1, R2, Rv 3 , Rv 4 be the sets from events E1, E2, E3, E4, and observe that |R1 R2 Rv 3 Rv 4 | 3

8R, and furthermore, for every r (R1 R2 Rv 3 Rv 4 ) there exist at least 29 32 log d choices for ℓ [log d ] for which

(a) |{j Ir ,ℓ: vj = v }| 132n1.5 TR log d ,

(b) |{j Ir ,ℓ: vj = v and hr (Enc(vj)) = hr (Enc(v ))}| 32n1.5 TR log d ,

(c) |Ir ,ℓ| 2n R log d .

Denote c = Enc(v ). Observe that by items (a),(b) above, we have that

f Sr ,ℓ(hr (c ), c ℓ) f Sr ,ℓ(hr (c ), 1 c ℓ) + 100n1.5

Let us assume that T ϵn 326 R log d . By the properties of algorithm Hashtogram (Lemma 6.5), For every r , ℓ satisfying items (a),(b),(c), algorithm Hashtogram(Sr ,ℓ ) ensures that with probability at least 255

256 we have ˆcr ,t ,ℓ= c ℓ, where t = hr (v ). Now ﬁx r (R1 R2 Rv 3 Rv 4 ), and recall that the coins of Hashtogram(Sr ,ℓ) are independent for diﬀerent values of ℓ. Hence, by the Chernoﬀbound, for log d 850 log( n

β) we get that

Pr ℓ [log d ] : ˆcr ,t ,ℓ= c ℓ 9

10 log d 1 β

That is, for our ﬁxed v there exists an r (R1 R2 Rv 3 Rv 4 ) s.t. with probability at least 1 β

n we have that ˆcr ,t and c = Enc(v ) diﬀer on at most 1 10-fraction of their bits. Using the union bound, with probability at least 1 β, this holds simultaneously for every v s.t. f S(v ) 264n

T . By the properties of the error correction code, in such a case, for every v s.t. f S(v ) 264n1.5

T we have that ˆvr ,t = Dec(ˆcr ,t ) = v , and that v is in the list L.

Runtime. On step 1 algorithm Bitstogram instantiates O(R log d) copies of the frequency oracle Hashtogram. Every such instantiation takes time n. Afterwards, the algorithm queries the oracles for a total of O(2RT log d) n queries, each of which takes time O(1). Thus, step 1 takes time n to compute. Steps 2 and 3 loop for every (r, t) [R] [T], and take time RT n to compute. Finaly, step 4 instantiates Hashtogram (in time n) and queries it for a total or R T n queries. Overall, algorithm Bitstogram runs in time n.

Processing Memory. Recall that step 1 of algorithm Bitstogram requires querying algorithm Hashtogram for a total of O(2RT log d) n queries. As we are aiming for an algorithm with processing memory n, we cannot store all of the answers in memory simultaneously. To resolve this issue, let us reorganize step 1 of algorithm Bitstogram as follows:

Practical Locally Private Heavy Hitters

1a. For every (r, ℓ) [R] [log d ], invoke Hashtogram(Sr,ℓ) with ϵ

1b. For every (r, t) [R] [T], for every ℓ [log d ], query Hashtogram(Sr,ℓ) to get ar,ℓ(t, b) : b{0, 1} .

Now, every one of the steps 1b,2,3,4 of the algorithm Bitstogram contains a loop over every choice of (r, t) [R] [T], and we can group all of this steps together into one loop over (r, t) [R] [T]. If an iteration of this loop results in a value a(ˆvr,t) n, we can simply ignore it (recall that the frequencies of domain elements that are not in the list L are estimated as zero, and that n is less than the guaranteed bound on the error of the protocol, so this step does not eﬀect our error bounds). As there could be at most n elements with frequencies at least n, the necessary processing memory is only n.

Acknowledgments

R. B. s research is supported by NSF Awards AF-1908281, SHF-1907715, Google Faculty Research Award, and OSU faculty start-up support. Work of K. N. was supported by NSF grant No. 1565387, TWC: Large: Collaborative: Computing Over Distributed Sensitive Data. Work of U. S. was supported in part by the Israel Science Foundation (grant No. 1871/19). A. T. s research is supported by NSF Awards TRIPODS+X-1839317, AF1908281, TRIPODS-1740850, and Google Faculty Research Award.

Nltk brown corpus. www.nltk.org.

Apple tries to peek at user habits without violating privacy. The Wall Street Journal, 2016.

Noga Alon and Joel Spencer. The Probabilistic Method. John Wiley, 1992. ISBN 0-47153588-5.

Raef Bassily and Adam Smith. Local, private, eﬃcient protocols for succinct histograms. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 127 135. ACM, 2015.

Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate diﬀerential privacy. Theory of Computing, 12(1):1 61, 2016. doi: 10.4086/ toc.2016.v012a001. URL http://dx.doi.org/10.4086/toc.2016.v012a001.

M. Bellare and J. Rompel. Randomness-eﬃcient oblivious sampling. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, SFCS 94, pages 276 287, Washington, DC, USA, 1994. IEEE Computer Society. ISBN 0-8186-6580-7. doi: 10.1109/SFCS.1994.365687. URL http://dx.doi.org/10.1109/SFCS.1994.365687.

Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Diﬀerentially private release and learning of threshold functions. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20

Bassily, Nissim, Stemmer, Thakurta

October, 2015, pages 634 649. IEEE Computer Society, 2015. ISBN 978-1-4673-8191-8. doi: 10.1109/FOCS.2015.45. URL http://dx.doi.org/10.1109/FOCS.2015.45.

Mark Bun, Jelani Nelson, and Uri Stemmer. Heavy hitters and the structure of local privacy. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 435 447, 2018. doi: 10.1145/3196959.3196981. URL https://doi.org/10.1145/3196959.3196981.

T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Optimal lower bound for diﬀerentially private multi-party aggregation. In Leah Epstein and Paolo Ferragina, editors, Algorithms - ESA 2012 - 20th Annual European Symposium, Ljubljana, Slovenia, September 1012, 2012. Proceedings, volume 7501 of Lecture Notes in Computer Science, pages 277 288. Springer, 2012. ISBN 978-3-642-33089-6. doi: 10.1007/978-3-642-33090-2 25. URL http://dx.doi.org/10.1007/978-3-642-33090-2_25.

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In ICALP, 2002.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265 284. Springer, 2006.

Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In CCS, 2014.

Alexandre Evﬁmievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211 222. ACM, 2003. ISBN 1-58113-670-6.

Giulia Fanti, Vasyl Pihur, and Ulfar Erlingsson. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. ar Xiv preprint ar Xiv:1503.01214, 2015.

Venkatesan Guruswami. List Decoding of Error-Correcting Codes. Ph D thesis, Massachusetts Institute of Technology, 2001. Supervisor-Madhu Sudan.

Justin Hsu, Sanjeev Khanna, and Aaron Roth. Distributed private heavy hitters. In International Colloquium on Automata, Languages, and Programming, pages 461 472. Springer, 2012.

Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011.

Nina Mishra and Mark Sandler. Privacy via pseudorandom sketches. In Proceedings of the twenty-ﬁfth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 143 152. ACM, 2006.

Practical Locally Private Heavy Hitters

Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005. ISBN 0521835402.

A.G. Thakurta, A.H. Vyrros, U.S. Vaishampayan, G. Kapoor, J. Freudiger, V.R. Sridhar, and D. Davidson. Learning new words. US Patent 9594741, 2017.

Salil P. Vadhan. Pseudorandomness. Foundations and Trends in Theoretical Computer Science, 7(13):1 336, 2012. ISSN 1551-305X. doi: 10.1561/0400000010. URL http: //dx.doi.org/10.1561/0400000010.

A. Missing Proofs from Section 5

A.1. Proof of Theorem 5.1

It is easy to see that each invocation of Local Rnd is ϵ/2-locally diﬀerentially private since conditioned on any realization of ℓi, ji, ri, for any pair of possible input items vi, v i V to Local Rnd and any output bit b generated in step 7 of Algorithm 1, we have

P [yi = b | vi] eϵ/2 P yi = b | v i

Note that Local Rnd is invoked only twice for each user: once when Final = 0 (the scanning/pruning phase of Tree Hist) and another time when Final = 1 (the ﬁnal phase of Tree Hist). Thus, it follows that protocol Tree Hist is ϵ-diﬀerentially private.

A.2. Proof of Theorem 5.2

Let β (0, 1) and η as deﬁned in the theorem statement. Consider the pruning phase of Tree Hist, that is, Steps 1 to 11 in Algorithm 3. Let γ be as set in Step 2. In this phase, Tree Hist invokes Freq Oracle once in every iteration of the outer for loop (over the levels of the tree) with the ﬂag Final = 0. Consider any such iteration ℓ. Suppose, for now, that the size of Child Set(Preﬁxes) passed to Freq Oracle in that iteration is at most 2n/η. (We will show that with probability at least 1 β this condition is satisﬁed at all levels ℓof the tree, i.e., it s a loop invariant). By invoking Lemma 5.3 with b V = Child Set(Preﬁxes), Ij : j [t] = Iℓ,j : j [t] , and γ = t log d = 110 log(n/β) log d, we have that with probability at least 1 β/ log(d), for every ˆv Child Set(Preﬁxes) : f(ˆv) > 3η, Freq Oracle gives an estimate ˆf(ˆv) 2η, and for every ˆv Child Set(Preﬁxes) : f(ˆv) η, Freq Oracle gives an estimate ˆf(ˆv) < 2η. Hence, Step 9 implies, that with probability at least 1 β/ log d, all ˆv Child Set(Preﬁxes) with true frequencies f(ˆv) 3η will proceed to the next iteration ℓ+ 1 and all those ˆv Child Set(Preﬁxes) with true frequencies f(ˆv) < η will be pruned out. Since the number of nodes ˆv with true frequency f(ˆv) η cannot be more than n/η, then the number of surviving nodes in the next iteration ℓ+1 cannot be more than 2n/η. Hence, this condition will be satisﬁed in the next iteration, and we can proceed in the same fashion. Note that when ℓ= 1, the condition is trivially satisﬁed since there are only 2 < 2n/η nodes at that level. This induction argument shows that with probability at least 1 β, for every level ℓ [log d], the surviving nodes at level ℓcorrespond to preﬁxes whose true frequencies are not below η and include all preﬁxes whose true frequencies are above 3η. In particular,

Bassily, Nissim, Stemmer, Thakurta

with probability at least 1 β, all items in Succ Hist satisfy these properties. This covers the proof of items 1 and 2 of Theorem 5.2. Now, consider the ﬁnal phase of Tree Hist, that is, Steps 12 to 14 Algorithm 3. Let γ be as set in Step 12. In this phase, Tree Hist invokes Freq Oracle on the surviving nodes at the ﬁnal level of the tree (the last update of Preﬁxes) and with input ﬂag Final = 1. Now, by invoking Lemma 5.3 with b V = Preﬁxes, Ij : j [t] = Ij : j [t] , and γ = t = 110 log(n/β), we have that with probability at least 1 β/ log(d), for every

ˆv Preﬁxes, | ˆf(ˆv) f(ˆv)| 14

. This proves item 3 of the

B. Missing Proofs from Section 6

B.1. Proof of Lemma 6.5

Consider the following good event:

Event E1 (over sampling h1, , h R): For every query v V there exists a subset Rv 1 [R] of size |Rv 1 | 7

8R s.t. for every r Rv 1 it holds that |{v S : v = v and hr (v) = hr (v )}| 16n

Event E1 states that for at least 7R/8 of the hash functions, we have that v is mapped into a cell without too many collisions with diﬀerent input elements. Informally, for every single hash function hr, algorithm Hash Hist estimates the number of occurrences of hr(v ) in S. Hence, if event E1 occurs, then most of the estimations result in accurate answers. We start by showing that event E1 happens with high probability. To that end, ﬁx v V and ﬁx r [R]. We have that

Ehr [| {x S : v = v and hr (v) = hr (v )} |] = X

v S:v =v Ehr 1hr (v)=hr (v ) n

Thus, by Markov s inequality, we have that

| {v S : v = v and hr (v) = hr (v )} | 16n

As the hash functions are independent from each other, for R 48 ln(d

β ), by the Chernoﬀ bound we get that with probability at least 1 β/d (over sampling h1, . . . , h R) there exists a subset Rv 1 [R] of size |Rv 1 | 7

8R s.t. for every r Rv 1 it holds that

|{v S : v = v and hr (v) = hr (v )}| 16n

Using the union bound, we have that event E1 happens with probability at least 1 β. We continue the analysis assuming that event E1 occurs.

Practical Locally Private Heavy Hitters

For every r [R], let Sr = (vj)j Ir denote a database containing the data of all users j s.t. j Ir. Also for v V and r [R] denote |Sr,v | |{v S : hr(v) = hr(v )}|. That is, |Sr,v | is the number of users j s.t. hr(vj) = hr(v ). Furthermore, for v V and r [R] denote |Iv r | |{v Sr : hr(v) = hr(v )}|. That is, |Iv r | is the number of users j s.t. j Ir and hr(vj) = hr(v ). Observe that |Sr,v | f S(v ) and that |Iv r | f Sr(v ). Fix v V . By the Chernoﬀbound, with probability at least 1 β/d (over partitioning [n] into subsets I1, . . . , IR), for every r [R] we have that

R |Iv r | |Sr,v |

3R |Sr,v | log(2Rd

Using the union bound this holds simultaneously for every v V and r [R] with probability at least 1 β. Also, assuming that n 8R log(2R/β), by the Chernoﬀbound, with probability at least 1 β (over partitioning [n] into subsets I1, . . . , IR), for every r [R] we have that n 2R |Ir| 2n

R . We continue the analysis assuming that this is the case, and that Inequality (4) holds.

Event E2 (over sampling Z and the coins of the local randomizers): For every query v V there exists a subset Rv 2 [R] of size |Rv 2 | 7

8R s.t. for every r Rv 2 it holds that R ar (hr (v )) R |Iv r | eϵ+1

For v V and r [R] denote cr(v ) = P j Ir yj Z[hr(v ), j], and recall that algorithm Hashtogram answers the query v with a(v ) = R eϵ+1

eϵ 1 Medianr [R]{cr(v )}. Fix v V and r [R]. We now analyze the expectation of cr(v ):

E[c(v )] = X

j Ir E [yj Z[hr(v ), j]]

j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X

j Ir: hr(vj) =hr(v ) E [yj Z[hr(v ), j]]

j Ir: hr(vj)=hr(v ) E [yj Z[hr(v ), j]] + X

j Ir: hr(vj) =hr(v ) E [yj] E [Z[hr(v ), j]]

= |{v Sr : hr(v) = hr(v )}| eϵ 1

eϵ + 1 |Iv r | eϵ 1

That is, cr(v) can be expressed as two sums of 1 independent random variables: |Iv r | variables with expectation eϵ 1

eϵ+1, and (|Ir| |Iv r |) variables with expectation 0 (recall that n 2R |Ir| 2n

R ). Using the Hoeﬀding bound, with probability at least 43/44 we have that cr(v ) eϵ 1

eϵ+1 |Iv r | p

11n/R. That is,

R ar(hr(v )) R |Iv r | eϵ + 1

Bassily, Nissim, Stemmer, Thakurta

Fix v V , and observe that the above sums are independent for diﬀerent values of r. Hence, using the Chernoﬀbound and asserting that R 132 ln(d /β), for that ﬁxed v V , with probability at least 1 β/d we have that Inequality (5) holds for at least 7R/8 choices of r [R]. Using the union bound, with probability at least 1 β, this is true for every v V simultaneously. That is, event E2 happens with probability at least 1 β. We continue the analysis assuming that event E2 occurs. For every v V we denote Rv 3 = Rv 1 Rv 2 . Combining event E2 with Inequality (4), we get that for every r Rv 2

R ar(hr(v )) |Sr,v | eϵ + 1

3R |Sr,v | log(2Rd

Recall that for every v V and every r [R] we have that |Sr,v | f S(v ). Furthermore, for every v V and every r Rv 1 we have that |Sr,v | f S(v ) + 16n

T . Hence, for every v V and every r Rv 3 we have that

| R ar(hr(v )) f S(v ) | eϵ + 1

3R f S(v ) + 16n

That is, for every r Rv 3 we have that R ar(hr(v )) is accurate up to error(v ). As |Rv 3 | 3

4R, and as algorithm Hashtogram answers v with a(v ) chosen as the median of {R ar(hr(v ))}, we get that |a(v ) f S(v )| error(v ) for every v V .