# federated_heavy_hitter_recovery_under_linear_sketching__0f1ea39a.pdf Federated Heavy Hitter Recovery under Linear Sketching Adria Gascon 1 Peter Kairouz 1 Ziteng Sun 1 Ananda Theertha Suresh 1 Motivated by real-life deployments of multiround federated analytics with secure aggregation, we investigate the fundamental communicationaccuracy tradeoffs of the heavy hitter discovery and approximate (open-domain) histogram problems under a linear sketching constraint. We propose efficient algorithms based on local subsampling and invertible bloom look-up tables (IBLTs). We also show that our algorithms are informationtheoretically optimal for a broad class of interactive schemes. The results show that the linear sketching constraint does increase the communication cost for both tasks by introducing an extra linear dependence on the number of users in a round. Moreover, our results also establish a separation between the communication cost for heavy hitter discovery and approximate histogram in the multi-round setting. The dependence on the number of rounds R is at most logarithmic for heavy hitter discovery whereas that of approximate histogram is Θ( R). We also empirically demonstrate our findings. 1. Motivation Collecting and aggregating user data drives improvements in the app and web ecosystems. For instance, learning popular out-of-dictionary words can improve the auto-complete feature in a smart keyboard, and discovering malicious URLs can enhance the security of a browser. However, sharing user data directly with a service provider introduces several privacy risks. It is thus desirable to only make aggregated data available to the service provider, rather than directly sharing (unanonymized) user data with them. This is typically achieved via multi-party cryptographic primitives, such as a secure vector summation protocol (Melis et al., 2016; 1Google Research. Authorship is in alphabetical order. Correspondence to: Ziteng Sun . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Bonawitz et al., 2017; Corrigan-Gibbs et al., 2020; Bell et al., 2020). For instance, for closed domain histogram applications, the users can first one-hot encode their data into a vector of length d (the size of the domain) and then participate in a secure vector summation protocol to make the aggregate histogram (but never the individual user contributions) available to the service provider. Federated heavy hitters recovery. The abovementioned solution requires Ω(d) communication. However, in many real life applications the domain size is very large or even unknown a priori. For example, the set of new URLs can be represented via 8-bit character strings of length 100, and can thus take d = 256100 values, which is clearly impossible to support in practice. In such settings, linear1 sketching is often used to reduce the communication load. For example, Melis et al. use secure count-min sketch aggregation for privacy preserving training of recommender systems, and Corrigan-Gibbs & Boneh rely on count-min sketches for gathering browser statistics, i.e. aggregate histogram queries. Similarly, Hu et al. rely on secure aggregation of variants of Flajolet-Martin sketches for distributed cardinality estimation. Boneh et al. uses sketching to reduce the cost for distributed subset-histogram queries. In the work closest to ours, Chen et al. show that count-sketches can be used to recover the heavy hitter items (i.e. frequently appearing items) while reducing the communication overhead. All these protocols operate in the single-round setting. Sketching in multi-round aggregation schemes. Even though count-sketches are great step towards solving the heavy hitters problem, this approach has only been analyzed in the single round data aggregation setting. However, most commonly deployed systems for federated analytics employ multi-round schemes for data aggregation (Bonawitz et al., 2019). This is primarily because (a) not all users are available around the same time, (b) the population may be very large (in the billions of devices) and therefore the server has to aggregate data over batches for bandwidth/compute reasons, and (c) running the cryptographic secure vector summation protocol has compute and communication costs 1Linearity is necessary because non-linear compression/sketching schemes would not work under the secure vector summation primitive which only makes the sum of client-held vectors available to the server. Federated Heavy Hitter Recovery under Linear Sketching that are super linear in the number of users we are aggregating over (Bell et al., 2020; Bonawitz et al., 2017). Further, count sketch based approaches have a decoding runtime that is linear in d, which is infeasible in the open domain setting, and improving it to log d involves blowing up the communication cost by the same factor. Our contributions. Our paper thus takes a principled approach towards uncovering the fundamental accuracycommunication tradeoffs of the heavy hitters recovery problem under the linearity constraints imposed by secure vector summation protocols. We show that linearity constraints increase the per-user communication complexity. For a fixed total number of users, as the number of rounds increases, the required communication decreases due to less stringent linearity constraint. Moreover, surprisingly, we show that count-sketches are strictly sub-optimal for this application, and we develop a novel provably optimal approach that combines client-side (local) subsampling with inverse Bloom lookup tables (IBLTs). Roughly speaking, we show (via lower bounds) that in the R-round case, any approach that solves an approximate histogram problem (with additive error) will incur a R factor penalty in the communication cost, while our optimal approach incurs log(R). Hence, even non-trivial modifications of count-sketches are strictly sub-optimal. We also empirically evaluate our proposed algorithms and compare it with count-sketch baselines. Significant advantage of our algorithm is observed, especially when R is large. In the setting of Figure 1, to achieve an F1 score of 0.8, we see a 10x improvement in communication compared to the baseline using Count-sketch. Organization. We formally define the problem in Section 2 and then discuss our results in Section 3. Algorithms for heavy hitter recovery and approximate histogram are presented in Section 4 and Section 5, respectively. We discuss a practical modification of our algorithm in Section 6 and present the experimental results in Section 7. 2. Problem setup and preliminaries We consider heavy hitter discovery in the distributed setting with multi-round communication between the users and a central server. Suppose users come in R rounds. In round r [R], there are n users, denoted by the set Br. We assume the sets are pairwise disjoint, i.e., r = r , Br B r = . Each user i Br contributes mi samples with a contribution bound mi m from a finite domain X of size d. Let hi denote the user s local histogram where x X, hi(x) is the number of times element x appeared in user i s local samples. By assumption, we have hi 1 = mi m. Let h(r) be the aggregated histogram in round r, i.e., x X : h(r)(x) = X i Br hi(x). The aggregated histogram across all R rounds is denoted by h[R] where x X : h[R](x) = X r [R] h(r)(x). The total number of users is denoted by N:=n R. We will focus on cases where d Nm, i.e., the case where the support is large and the data is sparse. The goal of the server is to learn useful information about the aggregated histogram h[R]. More precisely, we consider the two tasks described below. τ-heavy hitter (Approx HH). For a given threshold τ, the goal of τ-heavy hitter recovery on the entire data stream is to return a set H such that with probability 1 β, 1. If h[R](x) τ, x H. 2. If h[R](x) τ/10, x / H. τ-approximate histogram (Approx Hist). The goal is to return an approximate histogram bh[R] such that with probability 1 β, x X, bh[R](x) h[R](x) τ. It can be seen that τ/3-approximate histogram is a harder problem than τ-heavy hitter (HH) since an τ/3-approximate histogram would imply a set of approximate heavy hitters by returning H to be the list of elements with approximate frequency more than τ τ/2. Previous work often solves Approx HH by reducing it to Approx Hist (e.g.,, in Chen et al. (2022)). However, as we show in this paper, our work establishes a seperation between the two tasks in terms of the communication complexity in the multi-round setting. Efficient decoding. Since d Nm, we require efficient encoding (run by users) and decoding (run by the server). More precisely, the encoding/decoding time should be polynomial in N, m, R, log d, log(1/β) and other parameters. Per-user communication complexity. We focus on distributed settings where each user has limited uplink communication capacity. In particular, each user must compress their local histogram hi to a message of bit length ℓ, denoted by Yi. And the server must solve the above tasks based on the received messages. The communication complexity of each task is the smallest bit length such that there exists a communication protocol to solve the task. Federated Heavy Hitter Recovery under Linear Sketching Task/Setting 1-round Lin Sketch R-round Lin Sketch Without Lin Sketch (R = N) τ-Approx HH Θ m N τ-Approx Hist Θ m N τ Θ min{ m N Table 1. Per-user communication complexity. All described bounds can be acheived by a non-interactive protocol with server runtime poly(m, n, R, log(d), log(1/β)). Recall N = n R denotes the total number of users. All bounds cannot be improved up to logarithmic factors even under interactive protocols. Θ omits factor that are logarithmic in τ, R, m and N. Distributed estimation with linear sketching (Lin Sketch). A even more stringent communication model is the linear summation model. In each round r, each user i Br can only send a message Yi from a finite ring Gr based on their local histogram and shared randomness U. For all i Br, let Yi = fi(hi, U). Under the linear aggregation model, the server only observes where the addition is the additive operation in the ring Gr and by definition, Y (r) Gr. The reason why we restrict ourselves to a finite ring is for compatibility with cryptographic protocols for secure vector summation (Bonawitz et al., 2017; Bell et al., 2020), which operate in over a finite space. These protocols ensure that any additional information observed by the server beyond Y (r) can in fact be simulated given Y (r), under standard cryptographic assumptions. As mentioned above, we abstract away the specifics of the underlying protocol and assume that the server observes exactly Y (r). For vector summation, it is convenient to think of Gr as Zℓ qr, i.e. length-ℓvectors with integer entries mod qr (we might chose qr to be prime when we require division, e.g. in the IBLT construction). If the protocol is interactive, for i Br, Yi is allowed to depend on Y (1), . . . , Y (r 1). In this case, each fi is a function of Y (1), . . . , Y (r 1). If the protocol is non-interactive, fi s must be fixed independently from previous messages. The server then must recover heavy hitters (and their frequencies) based on the transcript of the protocol, denoted by Π = (Y (1), . . . , Y (R), U). 2.1. Connection to other constrained settings. Below we discuss the connection between our stated setting to other popular constrained settings including the streaming setting, and the general communication constrained setting. Connection to the streaming setting. When R = 1, the setting is similar to the similar to the streaming setting (e.g., in Cormode & Muthukrishnan (2005)) since they both require that all information about the dataset must be compressed into a small state. One important difference is that in the distributed setting, the local data is processed independently at each user and only linear operation on the state is permitted due to the linear aggregation operation. For the R-round case, our setting is different since the server observes R states, each with bit length at most ℓ. These states couldn t be viewed as a mega-state with bit length R ℓsince there is a further restriction that each sub-state (corresponding to the aggregated message observed in a round) can only contain information about data in the corresponding round instead of the entire data stream. Another naive way to reduce the problem to the streaming setting is to sum over all R states and obtain a single state with ℓbits. However, our result implies that this reduction is strictly sub-optimal (see Table 1). To the best of our knowledge, similar settings have not been studied in either the federated analytics literature or the streaming literature. Connection to distributed estimation without linear sketching. The general communication constrained setting where linear sketching is not enforced could be viewed as a special case of the proposed framework with n = 1 and R = N since in this case, the linear aggregation is performed only over one user s message and hence trivial. We list comparisons to these settings for the considered tasks in Table 1. 3. Results and technique We consider both approximate heavy hitter recovery and approximate histogram estimation in the linear aggregation model. We establish tight (up to logarithmic factors) communication complexity for both tasks in the single-round and multi-round settings. The results are summarized in Table 1. Our results have the following interesting implications on the communication complexity of these problems. Linear aggregation increases the communication cost. As shown in Table 1, under Lin Sketch, for both tasks, the per-user communication would incur a linear dependence on n = N/R, the number of users in each round. On the other Federated Heavy Hitter Recovery under Linear Sketching hand, without linear aggregation constraint, there won t be a linear dependence on n since each user can simply send their m local samples losslessly using O(m log d) bits. The result establishes the fundamental cost of linear aggregation communication protocols for heavy hitter recovery. Approx Hist is harder than Approx HH. As mentioned before, a natural way to obtain heavy hitters is to obtain an approximate histogram and do proper thresholding to select the heavy elements. Although in the single-round case, there is at most a logarithmic gap between the communication complexity for the two problems. In the R-round case, our result shows that this is strictly sub-optimal. More precisely, the communication cost for τ-Approx HH increases by a factor of R while that of Approx Hist depends at most logarithmically in R. This implies a gap between the per-user communication cost for τ-Approx Hist and τ-Approx HH in the multi-round case. The impact of R. With a fixed total number of users N, our result shows that the per-user communication complexity decreases as R increases. This is due to the fact that as R increases, the linearity constraints are imposed over a smaller group of users with size N/R, and hence less stringent. However, this also comes at the cost that the privacy implication from aggregation becomes weaker. 3.1. Our technique - IBLT with local subsampling As discussed above, when solving the approximate heavy hitter problem in the multi-round setting, algorithms that rely on obtaining an approximate histogram and thresholding won t give the optimal communication complexity. In the paper, we propose to use invertible bloom lookup tables (IBLTs) (Goodrich & Mitzenmacher, 2011) and local subsampling. At a high-level, IBLT is a bloom filter-type linear data structure that supports efficient listing of the inserted elements and their exact counts. The size of the table scales linearly with the number of unique keys inserted. To reduce the communication cost, we perform local threshold sampling (Duffield et al., 2005a) on users local datasets. This guarantees that the light elements will be discarded with high probability and hence won t take up the capacity of the IBLT data structure. Compared to frequency-oracle based approach, the noise introduced in our subsampling-based approach for each item is proportional to its accumulative count, which gives better estimates for elements with frequencies near the threshold. For elements with counts way above the threshold, the frequency estimate will have a larger error but this won t affect heavy hitter recovery since only whether the count is above τ is crucial to our problem. See detail of the algorithm in Section 4. 3.2. Related work Linear dimensionality reduction techniques for frequency estimation and heavy hitter recovery has been widely studied to reduce storage or communication cost, such as Countsketch, Count-min sketch (Charikar et al., 2002; Cormode & Muthukrishnan, 2005; Donoho, 2006; Minton & Price, 2014), and efficient decoding techniques have also been proposed (Cormode & Muthukrishnan, 2006; Gilbert et al., 2010). The closest to our work is the independent work of (Chen et al., 2022), which studies approximate histogram estimation under linear sketching constraint. The work also establishes gap between communication complexities with/without Secure Aggregation. However, their result is in a more restricted setting of m = 1 and R = 1. Moreover, our algorithm also has the advantage of being computationally efficient (runtime only depends logartihmically in d), which is important for applications with large support but sparse data. Their work also considers algorithms that guarantee distributed differential privacy guarantees, which we leave as interesting future directions. 4. Approximate heavy hitter under linear aggregation In this section, we study the approximate heavy hitter problem and show that the problem can be solved with per-user communication complexity O mn τ log d , stated in Theorem 4.1. A natural comparison to make is the heavy hitter recovery algorithm obtained from getting a frequency oracle up to accuracy Θ(τ). Since there are R rounds, the naive approach would require an accuracy of Θ(τ/R) in each round and classic methods such as Count-min and Count-sketch would require a per-user communication complexity of Θ(mn R/τ). In the R-round case, our result improves upon this by a factor of R. In fact, as we show in Theorem 5.2, any frequency oracle-based approach would require per-user communication complexity of at least Ω(mn R/τ). Our result improves upon these and show that the dependence on R is at most logarithmic. Theorem 4.1. There exists a non-interactive linear sketching protocol with communication cost O mn τ bits per user, which solves the τ-approximate heavy hitter problem. Moreover, the running time of the algorithm is O mn The next theorem shows that the above communication complexity is minmax optimal up to logarithmic factors. Theorem 4.2. For any τ and interactive linear sketching protocol A with per-user communication cost o mn τ , there exists a dataset hi, i Br, r [R], such that A cannot solve τ-heavy hitter (HH) with success probability at least Federated Heavy Hitter Recovery under Linear Sketching Next we will present the protocol that achieves Theorem 4.1 in Section 4 and discuss the proof of the lower bound Theorem 4.2 in Appendix D.1. At a high level, the protocol relies on two main components: (i) a probabilistic data structure called Invertible Bloom Look up Table (IBLT) introduced by Goodrich & Mitzenmacher, and (ii) local subsampling. We start by introducing IBLTs, starting from the more standard (counting) Bloom filters. IBLT: Bloom filters with efficient listing. Note that each user s local histogram hi can be viewed as a sequence of key-value pairs (x, hi(x)). The Bloom filter data structure is a standard linear data structure to representing a set of keyvalue pairs with keys coming from a large domain. IBLT is a version of Bloom filter that supports an efficient listing operation while preserving the other nice properties of Bloom counting filters, namely linearity (and thus mergeable by summation), and succintness (linear size in number of indices it holds).2 These properties are summarized in the following Lemma. Lemma 4.3 ((Goodrich & Mitzenmacher, 2011)). Consider a collection of local histograms (hi)i [n] over [d] such that P i [n] hi 0 L0. For any γ > 0, there exist local linear sketches {fi}i [n] of length ℓ= O(γL0) and an O(ℓ) time decoding procedure Dec( ) such that i [n] fi(hi)) = X succeeds except with probability at most O L2 γ 0 . For the purpose of this paper we can focus on the two main operations supported by an IBLT instance B (see (Goodrich & Mitzenmacher, 2011) for details on deletions and lookups): Insert(k, v), which inserts the pair (k, v) into B. List Entries(), which enumerates the set of keyvalue pairs in B. Note that fi(hi) in Lemma 4.3 corresponds to the IBLT Bi resulting from inserting the set {(x, hi(x)) | hi(x) > 0} into an empty IBLT. Also, List Entries() corresponds to Dec in Lemma 4.3. Finally, Pn i fi(hi) corresponds to the encoding of the IBLT resulting from inserting the set {(x, Pn i hi(x)) | i [n] : hi(x) > 0} into an empty IBLT. In other words, each client 2In our algorithm, IBLT could be replaced by any data structure with these properties. i [n] computes local IBLT Bi := fi(hi), and the (secure) aggregation of the Bi s results in the global IBLT B := Pn i fi(hi). Further details on IBLT are stated in Appendix C. Reducing capacity via threshold sampling. The second tool in our main protocol is threshold sampling. Note that the guarantee in Lemma 4.3 relies on the number of unique elements in P i [n] hi, which can be at most mn in the worst-case, leading to an O(mn) not matching our lower bound in Lemma 4.2. For heavy hitter recovery, we reduce the communication cost by local subsampling. More precisely, we use the threshold sampling algorithm from (Duffield et al., 2005b), detailed in Algorithm 1 to achieve the (optimal) dependency O(mn/τ). Remark 4.4. Threshold sampling can be replaced by any unbiased local subsampling method that offers sparsity, e.g., binomial sampling where h (x) Binomial(h(x), p) for some p (0, 1), and similar theoretical guarantee will hold. In this work, we choose threshold sampling due to the property that it minimizes the total variance of h under an expected sparsity constraint (see Duffield et al. (2005b) for details). Algorithm 1 Threshold sampling. 1: Input: h : local histogram. t R+ : threshold. 2: for x supp(h) do 3: if h(x) t, then 4: h (x) = h(x). 5: else 6: ( t with prob h(x) t , 0 otherwise. 7: end if 8: end for 9: Return: h . The protocol that achieves the desired communication complexity in Theorem 4.1 is detailed in Algorithm 2. The algorithm can be viewed as b:= 20 log( 40mn R τβ ) independent runs of a basic protocol, each of which returns a list Hi of potential heavy hitters. And the repetition is to boost the error probability. In each basic protocol, users first apply Algorithm 1 to subsample to the data, which reduces the number of unique elements while maintaining the heavy hitters upon aggregation. Then the user encodes their samples using IBLTs, whose aggregation is then sent to the server to decode. Since the number of unique elements is reduced through subsampling, the decoding of the aggregated IBLT will be successful with high probabiltiy, hence recovering the aggregation of Federated Heavy Hitter Recovery under Linear Sketching Algorithm 2 Subsampled IBLT with Lin Sketch. 1: Input: {hi}i Br,r [R] : local histograms; d : alphabet size; R : number of rounds; m : per-user contribution bound; n : number of users per round; τ : threshold for heavy hitter recovery; β : failure probability. 2: Let t = max{τ/2, 1}, b = 10 log( 4mn R τβ ) and L0 = 20 mn τ log R, γ = log R. 3: for r [R] do 4: for j [b] do 5: Each user i Br applies Algorithm 1 with threshold t in to their local histogram with fresh randomness to get h i,j. 6: Each user sends message Yi,j = fi,j(h i,j) where fi,j s are mappings from Lemma 4.3 with parameter L0, γ and fresh randomness. 7: Server observes P i Br Yi,j and computes ˆhr,j = Dec( X i Br Yi,j). If the decoding is not successful, we set ˆhr,j be the all-zero vector. 8: end for 9: end for 10: for j [b] do 11: Server computes ˆh[R] j = P r [R] ˆhr,j, and obtain list Hj = {x [d] | ˆh[R] j > 0}. 12: end for 13: Return: j [b] 1 {x Hj} b subsampled local histograms. The detailed proof of Theorem 4.1 is presented in Appendix A. 5. Approximate histogram under linear aggregation In this section, we study the task of obtaining an approximate histogram in the multi-round linear aggregation model. The first observation we make is that using Algorithm 2 with threshold τ, we are able to return a list H of heavy hitters such that with high probability, the list contains all x s with frequency more than τ and no tail elements. The approximate histogram algorithm builds on this and further asks each user to send a linear sketching of the their unsampled local data alongside the IBLT data structures in Algorithm 2. The server would then use the aggregation of these linear sketches as a frequency oracle to estimate the frequency of elements in H. The above protocol leads to near optimal performance in the single-round case. However, the R-round case is trickier since the error will build up along all R rounds and the naive application of the sketching algorithm will lead to an error that depends linearly in R. This can be solved by carefully designing the correlation among hash functions in all R rounds and we show that the dependence on R can be reduced to R. We further show that the R dependence is in fact optimal by proving a matching lower bound, stated in Theorem 5.2. At a high-level, to improve the dependence on R, we use Count-sketches where the location hashes are fixed across rounds while the sign hashes are generated with fresh randomness. The details of the algorithm are described in Algorithm 3. The proof follows from the guarantee in Theorem 4.1 and standard analysis for the Count-sketch algorithm. We defer the complete proof to Appendix B. Theorem 5.1. In the R-round setting, there exists a linear aggregation protocol with communication cost O min{ mn R τ , mn} per user, which solves the τapproximate histogram problem. Moreover, the running time of the algorithm is O min{ mn R τ , mn} . Lower bound for Approx Hist We prove the following lower bound on Approx Hist, which shows that the bound in Theorem 5.1 is tight up to logarithmic factors, establishing the seperation between the sample complexity for Theorem 5.2. For any τ and R-round Approx Hist protocol with per-user communication cost o min{ mn R τ , mn} , there exists a dataset {hi}i Br,r [R], such that the protocol cannot solve τ-approximate histogram with error probability at most 1/5. 6. Practical adaptive tuning for instance-specific bounds In practical scenarios, the per-user communication cost ℓ is often determined by system constraints (e.g., delay tolerance, bandwidth constraint) and the goal is to recovery heavy hitters with the small enough τ under a fixed communication cost ℓmax. While we have shown in Theorem 4.2, in the worst case, we can only reliably recover heavy hitters with frequency at least Ω( mn ℓmax ). However, since the successful decoding of IBLTs only requires the number of unique elements in a round to be small, when users data is more favorable, it is possible to obtain better instance-specific bounds when the data is more concentrated on heavy elements. Federated Heavy Hitter Recovery under Linear Sketching Algorithm 3 R-round Approx Hist with Lin Sketch 1: Input: {hi}i Br,r [R] : local histograms; d : alphabet size; R : number of rounds; m : per-user contribution bound; n : number of users per round; τ : error for approximate histogram; β : failure probability. 2: if τ R then 3: Users implement Algorithm 2 with τ = 1 and return the histogram obtained in Line 11. 4: end if 5: Let w = 10mn R τ and b = log 4mn R 6: Get the same set of location hash functions {gj : [d] [w]}j [w] for all rounds. And the independent sets of sign hashes {sj,r : [d] { 1}}j [w],r [R] across rounds. 7: for r [R] do 8: (In Parallel) Each user i Br implements the protocol in Algorithm 2 and sends messages Yi. 9: (In Parallel) User i Br encode j [b] and k [w], Ti(j, k) = X x 1 {gj(x) = k} sj,r(x) hi(x). 10: end for 11: Server obtains a list H of heavy hitters from the the messages Yi s. 12: Server obtains r [R], T (r) = P i Br Ti and constructs ˆh, where x H ˆh(x) = Median r [R] T (r)(j, gj(x)) sj,r(x)}j [b] and x / H, ˆh(x) = 0. 13: Return ˆh. When interactivity across rounds is allowed, we give an adaptive tuning algorithm for the subsampling parameter, which can be implemented when interactivity is allowed. The details of the algorithm are described in Algorithm 4. At a high level, our algorithm is based on an estimate for P i Br h i 0 where h is are the subsampled histograms. When the decoding is successful, we can compute P i Br h i 0 exactly from the recovered histogram. When the decoding is not successful, we rely on an analysis based on the core size of a random hypergraph (Molloy, 2005) introduced by the hashing process to get an estimate of P i Br h i 0. We discuss this in details in Appendix C. Under the assumption that for a fixed subsampling parameter t, P i Br h i 0 will be relatively stable across rounds, we can then increase/decrease t based on past estimates of the data process. We will empirically demonstrate the effectiveness of our tuning procedure. We leave proving rigorous guarantees on the adaptive tuning algorithm as an interesting future direction. Algorithm 4 Adaptive subsampled IBLT Input: Communication budget C, number of users n, user contribution bound m. Update: A tuning function that updates the subsampling parameter based on past observations. 1: Set t0 = Θ n B C . 2: for r = 0, 1, 2, . . . , R do 3: Each user i Br applies Algorithm 1 with threshold t in to their local histogram with fresh randomness to get h i. 4: Each user sends message Yi = fi(h i) where fi s are mappings from Lemma 4.3 with parameter L0, γ and fresh randomness. 5: Server observes P i Br Yi and computes ˆhr = Dec( X If the decoding is not successful, we let ˆhr,j be the all-zero vector. 6: if The decoding is successful, then 7: Set ˆsr = ˆhr 0. 8: else 9: Get an estimate ˆsr for P i Br h i 0 based on P i Br Yi using (3) and (4) (Appendix C). 10: end if 11: Set tr+1 = Update(tr, C, ˆsr). 12: end for 7. Experiments In this section, we empirically evaluate our proposed algorithms (Algorithms 2 and 4) for the task of heavy hitter recovery and compare it with baseline methods including (1) Count-sketch based method; (2) IBLT-based method without subsampling (Algorithm 2 with τ = 1). We measure communication cost in units of words (denoted as C) and each word unit is an int16 object (can be communicated with 2 bytes) in python and C++ for implementation purposes. We will mainly focus on string data with characters from ROOT consisting of lower-case letters, digits and special symbols in { @ # ; : . / }. Our code is available at https:// github.com/google-research/federated. Below are the details of our implementation. Count-median sketch. We use H hash functions, each with domain size W and the total communication cost is Federated Heavy Hitter Recovery under Linear Sketching C = H W words3. In the R-round setting, for each round r, we loop over all x X and compute an estimate ˆhr(x) and the recovered heavy hitters are those with P r ˆhr(x) τ. Note that in the open-domain setting, d = |X| can be large and this decoding procedure can be computationally infeasible. There are more computationally feasible variants including tree-based decoding but these come at the cost of higher communication cost or lower utility. We stick to the described version in this work and show that our proposed algorithms outperform this computationally expensive version. The advantage will be be at least as large when comparing to the more computationally feasible versions. IBLT-based method. In our experiment, each IBLT data structure is of size 8L0, where L0 is the targeted capacity for IBLT. We state more details about how the size is computed in Appendix C. We consider fixed subsampling and adaptive subsampling. For fixed subsampling, when the max number of items in each round is upper bounded by Mmax, we set the subsampling parameter in Algorithm 1, to be t = max{1, min{ Mmax 2}}. In practice, Mmax can be obtained by system parameters such as the number of users in a round and the maximum contribution by a single user. Setting t τ/2 guarantees that the heavy hitters will be kept with high probability ( Lemma A.2). And setting t = Mmax L0 guarantees that with high probability, the decoding of IBLT in each round will succeed and we can obtain more information. We set b = 1 in our experiments, the estimated and the heavy hitters are defined as those with estimated frequency at least τ. In the adaptive algorithm (Algorithm 4), for the update rule, we use tr+1 = 0.5tr + 0.5tr ˆsr We leave designing better update rules as future work. Client data simulation. We take the ground-truth distribution of strings in the Stackoverflow dataset in Tensorflow Federated and truncate them to the first 3 characters in set ROOT. This is to make sure that the computation is feasible for Count-median Sketch. And the data universe is of size d = 97336. In each round, we take Mr N(M, M/10) i.i.d. samples from the this distribution and encode them using the algorithms mentioned above. In the experiment, we assume all samples come from different users (m = 1). For Count-sketch, this won t affect the performance. For IBLT with threshold sampling, this will be equivalent to IBLT with binomial sampling. And by (Duffield et al., 2005a), this will only increase the variance of the noise introduced 3In our experiments, mn will be between 1000 and 10, 000, and hence one word is enough to store an entry in the sketch. in the sampling process. The metric we use is the F1 score of real heavy hitters (heavy hitters with true cumulative frequency at least τ) and the estimated heavy hitters. We take R {10, 30, 50, 100}, τ {20, 50, 100, 200}, M {1000, 2000, 5000, 10000} and C {100, 200, 500, 1000, 2000, 5000, 8000, 10000, 20000, 30000, 40000}. For Count-median method, we take the max F1 score over all H {5, 7, 9, 11} for each communication cost. We run each experiment for 5 times and compute the mean and standard deviation of the obtained F1 score. Our proposed algorithms consistently outperforms the sketching based method. Below we list and analyze a few representative plots. In Figure 1, we plot the F1 score comparison under different communication costs when R = 30, τ = 50, M = 10000. It can be seen that our proposed algorithms significantly outperforms the Count-sketch method. Among the IBLTbased methods, Subsampled IBLT with adaptive tuning is performing the best. For non-interactive algorithms, subsampled IBLT with fixed subsampling probability is better compared to the unsampled counter part for a wide range of small capacity. Figure 1. F1 score comparison under different communication cost (R = 30, τ = 50, M = 10000). Each F1 score is an average of 5 runs and the error bar represents 3x the standard deviation of the runs. In Figure 2, we plot the F1 score comparison under different round numbers when C = 10000, τ = 200, M = 10000. As we can see, the performance of Count-sketch decreases significantly when the number of rounds increase while the performance of IBLT-based methods remains relatively flat, which is consistent with the theoretical results. The slight increase in the F1 score when R increases might be due to the i.i.d. generating process of the data in each round. As R increases, we get more information about the underlying distribution and this effect outweighs additional noise introduced by multiple rounds. Better understanding of this effect is an interesting further direction. Federated Heavy Hitter Recovery under Linear Sketching Figure 2. F1 score comparison with different number of rounds (τ = 200, M = 10000, C = 10000). Each F1 score is an average of 5 runs and the error bar represents 3x the standard deviation of the runs. In Figure 3, we further demonstrate our adaptive tuning method by showing that it is comparable with the best possible subsampling parameter in a candidate set. More specifically, we run subsampled IBLT with t {1, 1.25, 2, 5, 10, 20, 50, 100} for all communication costs. And the F1 score for Subsampled IBLT (best fixed) is defined as the best F1 score among these candidates. Our result shows that the performance of tha adaptive algorithm is in-par with the best fixed subsampling parameter. It outperforms the best fixed subsampling parameter in certain cases because the set of subsampling parameters we choose from has limited granularity and hence the adaptive algorithm might find better parameters for the underlying instance. Figure 3. F1 score comparison (adaptive vs best fixed probability) (τ = 200, M = 10000, C = 5000). Each F1 score is an average of 5 runs and the error bar represents 3x the standard deviation of the runs. 8. Conclusion We provided lower bounds and matching upper bounds for central tasks in multi-round distributed data analysis: heavy hitters recovery and approximate histograms over large domains. Our findings show how porting single-round approaches based on standard sketching does not achieve optimality, and how this can be cleverly achieved via subsampled IBLTs. Several interesting and non-trivial questions remain to be addressed, including (a) developing distributed differential privacy schemes that are provably optimal for this problem, and (b) developing (non-linear) cryptographic (or other secure) primitives that allow us to extract heavy hitters with smaller (sublinear in mn) communication. 9. Acknowledgments The authors thank Wennan Zhu for early discussions on the work, and Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha, and Ameya Velingker for proposing IBLT as the linear data structure in federated heavy hitter recovery. Acharya, J., Canonne, C. L., Sun, Z., and Tyagi, H. Unified lower bounds for interactive high-dimensional estimation under information constraints. ar Xiv preprint ar Xiv:2010.06562, 2020. Bar-Yossef, Z., Jayram, T., Kumar, R., and Sivakumar, D. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4):702 732, 2004. ISSN 00220000. doi: https://doi.org/10.1016/j.jcss.2003.11.006. URL https://www.sciencedirect.com/ science/article/pii/S0022000003001855. Special Issue on FOCS 2002. Bell, J. H., Bonawitz, K. A., Gasc on, A., Lepoint, T., and Raykova, M. Secure single-server aggregation with (poly)logarithmic overhead. In CCS, pp. 1253 1269. ACM, 2020. Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., Mc Mahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacypreserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 17, pp. 1175 1191, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450349468. doi: 10.1145/ 3133956.3133982. URL https://doi.org/10. 1145/3133956.3133982. Bonawitz, K. A., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Koneˇcn y, J., Mazzocchi, S., Mc Mahan, B., Overveldt, T. V., Petrou, D., Ramage, D., and Roselander, J. Towards federated learning at scale: System design. In MLSys. mlsys.org, 2019. Federated Heavy Hitter Recovery under Linear Sketching Boneh, D., Boyle, E., Corrigan-Gibbs, H., Gilboa, N., and Ishai, Y. Lightweight techniques for private heavy hitters. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 762 776, 2021. doi: 10.1109/SP40001.2021.00048. Braverman, M., Garg, A., Ma, T., Nguyen, H. L., and Woodruff, D. P. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 1011 1020, 2016. Charikar, M., Chen, K., and Farach-Colton, M. Finding frequent items in data streams. In Automata, Languages and Programming: 29th International Colloquium, ICALP 2002 M alaga, Spain, July 8 13, 2002 Proceedings 29, pp. 693 703. Springer, 2002. Chen, W.-N., Ozg ur, A., Cormode, G., and Bharadwaj, A. The communication cost of security and privacy in federated frequency estimation, 2022. URL https: //arxiv.org/abs/2211.10041. Cormode, G. and Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58 75, 2005. ISSN 0196-6774. doi: https://doi.org/10.1016/j.jalgor.2003.12. 001. URL https://www.sciencedirect.com/ science/article/pii/S0196677403001913. Cormode, G. and Muthukrishnan, S. Combinatorial algorithms for compressed sensing. In 2006 40th Annual Conference on Information Sciences and Systems, pp. 198 201. IEEE, 2006. Corrigan-Gibbs, H. and Boneh, D. Prio: Private, robust, and scalable computation of aggregate statistics. In NSDI, pp. 259 282. USENIX Association, 2017. Corrigan-Gibbs, H., Boneh, D., Chen, G., Englehardt, S., Helmer, R., Hutten-Czapski, C., Miyaguchi, A., Rescorla, E., and Saint-Andre, P. Privacy-preserving firefox telemetry with prio. https://rwc.iacr.org/2020/ slides/Gibbs.pdf, 2020. Donoho, D. L. Compressed sensing. IEEE Transactions on information theory, 52(4):1289 1306, 2006. Duffield, N., Lund, C., and Thorup, M. Learn more, sample less: control of volume and variance in network measurement. IEEE Transactions on Information Theory, 51(5): 1756 1775, 2005a. doi: 10.1109/TIT.2005.846400. Duffield, N., Lund, C., and Thorup, M. Learn more, sample less: control of volume and variance in network measurement. IEEE Transactions on Information Theory, 51(5): 1756 1775, 2005b. doi: 10.1109/TIT.2005.846400. Gilbert, A. C., Li, Y., Porat, E., and Strauss, M. J. Approximate sparse recovery: optimizing time and measurements. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 475 484, 2010. Goodrich, M. T. and Mitzenmacher, M. Invertible bloom lookup tables. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 792 799, 2011. doi: 10.1109/Allerton.2011.6120248. Han, Y., Ozg ur, A., and Weissman, T. Geometric lower bounds for distributed parameter estimation under communication constraints. IEEE Transactions on Information Theory, 67(12):8248 8263, 2021. doi: 10.1109/TIT. 2021.3108952. Hu, C., Li, J., Liu, Z., Guo, X., Wei, Y., Guang, X., Loukides, G., and Dong, C. How to make private distributed cardinality estimation practical, and get differential privacy for free. In USENIX Security Symposium, pp. 965 982. USENIX Association, 2021. Jayram, T. S. Hellinger strikes back: A note on the multiparty information complexity of and. In Dinur, I., Jansen, K., Naor, J., and Rolim, J. (eds.), Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 562 573, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. ISBN 978-3-642-03685-9. Melis, L., Danezis, G., and Cristofaro, E. D. Efficient private statistics with succinct sketches. In NDSS. The Internet Society, 2016. Minton, G. T. and Price, E. Improved concentration bounds for count-sketch. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pp. 669 686. SIAM, 2014. Mitzenmacher, M. and Upfal, E. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, USA, 2nd edition, 2017. ISBN 110715488X. Molloy, M. Cores in random hypergraphs and boolean formulas. Random Structures & Algorithms, 27(1):124 135, 2005. Federated Heavy Hitter Recovery under Linear Sketching A. Proof of Theorem 4.1 Note that the algorithm can be viewed as b:= 20 log( 40mn R τβ ) independent runs of a basic protocol, each of which returns a list Hi of potential heavy hitters. We assume b 260, else we take b = max{b, 260} and the result will change by at most a constant factor. The next lemma states that the probabilities of heavy elements and tail elements falling in the list. Lemma A.1. All Hj defined in Algorithm 2 satisfy that, if h[R](x) τ, Pr (x Hj) 4/5. Else if h[R](x) τ/10, Pr (x Hj) 2h[R](x) Before proving the lemma, we first show how Theorem 4.1 can be implied by Lemma A.1. By Lemma A.1, for x with h[R](x) τ, we have Pr (x H) Pr (Binom(b, 4/5) b/2) 1 βτ 40mn R, where the last inequality follows from standard concentration bounds for Binomial random variables (e.g., Chernoff bound (Mitzenmacher & Upfal, 2017)). Hence by union bound, we have Pr {x [d] | h[R](x) τ} H 1 β For any x, with h[R](x) τ/10, by Lemma A.1, we have Pr (x H) Pr Binom b, 2h[R](x) where the last inequality follows from Binomial tail bound (see Lemma E.1). Hence by union bound we have Pr {x [d] | h[R](x) τ/10} H = x:h[R](x) τ/10 where (1) follows from xb/2 + yb/2 (x + y)b/2, and hence we can combine symbols to increase the sum of tail probability and end up with at most 20mn R τ symbols with frequencies at most τ/10. (2) follows from the inequality (x + 1/2)(8e/25)x e x/10 for x 130. By union bound, we get the guarantee claimed in Theorem 4.1. Proof of Lemma A.1: The proof mainly consists of two parts. We will first show that local subsampling will keep each heavy hitter with a high probability and each tail element with a low probability, stated in Lemma A.2. We will then show that after local subsampling, the number of unique elements in each round will decrease so that the decoding in Algorithm 2 will succeed with high probability. Federated Heavy Hitter Recovery under Linear Sketching Lemma A.2. Let h [R] j be the aggregation of locally subsampled histogram for run j, i.e., Then if h[R](x) τ, [R] j (x) > 0 1 1 Else if h[R](x) τ/10, Pr (x Hj) 2h[R](x) Proof. When h[R](x) τ, [R] j (x) > 0 = 1 Πr [R],i Br min{1 2hi,j(x) τ , 0} 1 Πr [R],i Bre 2hi,j (x) τ = 1 e 2h[R](x) When h[R](x) τ/10 [R] j (x) > 0 = 1 Πr [R],i Br The next lemma shows that with high probability, the number of elements in each round will decrease by least a factor of τ. Lemma A.3. With probability at least 1 1/32, we have max r [R] { h r 0} = O mn Proof. Since all rounds are independent, it would be enough to show that i, with probability at least 1 1/32R, we have h r 0 = O mn To see this, we have Pr h r 0 2mn τ log R Pr Binom mn, 1 τ log R 1 32R, where the first step follows from that the left hand side is maximized when all mn elements in hr are distinct, and the second step follows from standard binomial tail bound when mn > 4τ and R > 32. Finally, it would be enough to show that when the condition in Lemma A.3 holds, the decoding of the aggregated IBLT will succeed with high probability. This is true since by Lemma 4.3 and union bound, we have Pr j, ˆh[R] j = h [R] j 1 R (mn τ log R)2 γ 1 1/32, where the last inequality holds when mn > 4τ and R > 32. Combining the above and Lemmas A.2 and A.3, we conclude the proof since 1/e2 + 1/32 + 1/32 1/5. Federated Heavy Hitter Recovery under Linear Sketching B. Proof of Theorem 5.1 We start with the case when τ R. In this case, Algorithm 3 implements Algorithm 2 with τ = 1 and returns the obtained histogram in Line 11. Notice that when τ = 1, the subsampling step is trivial and each user encodes their entire histogram. Hence as long as long the decoding of IBLT succeeds (as promised in the performance analysis of Algorithm 2), we recover the histogram perfectly, i.e., ˆh[R] = h[R]. And the communication cost will be Θ(mn). Next we focus on the case when τ R. We will condition on the event that the list H obtained in Line 8 of Algorithm 3 is a τ approximate heavy hitter set and hence setting ˆx = 0 for x / H won t introduce error larger than τ. The rest of the proof follows similarly as the standard proof for Count-sketch. Since b = log 4mn R τβ , it would be enough to prove that x X, with probability at least 2/3, we have r [R] Tr(j, gj(x)) sj,r(x) h[R](x)| = O(τ). r [R] T (r)(j, gj(x)) sj,r(x) x 1 {gj(x ) = gj(x)} sj,r(x )sj,r(x) h(r)(x ) x 1 {gj(x ) = gj(x)} X r [R] sj,r(x )sj,r(x) h(r)(x ) =h[R](x) + X x =x 1 {gj(x ) = gj(x)} X r [R] sj,r(x )sj,r(x) h(r)(x ) Then we have E h ˆhj(x) = h[R](x) i . Next we provide a bound on the variance. Let H10τ/ R be the set of elements with frequency at least 10τ/ R, then we have |H10τ/ R 10τ . Since w = 10mn R τ , we have with probability at least 5/6, X R,x =x 1 {gj(x ) = gj(x)} = 0. Conditioned on this event, we have E ˆhj(x) h[R](x) 2 = E R,x =x 1 {gj(x ) = gj(x)} X r [R] sj,r(x )sj,r(x) h(r)(x ) maxx / H10τ/ R h[R](x) P Hence with probability at least 5/6, we have E h ˆhj(x) h[R](x) i We conclude the proof by a union bound over the two events. C. Additional details on IBLT Intuition on List Entries for IBLT. The intuition behind the IBLT construction is as follows: Start with an array B of length ℓcontaining 4-tuples of the form (0, 0, 0, 0). To insert pair (x, v) hash the tuple (x, x, v, 1) into k locations l1, . . . , lk Federated Heavy Hitter Recovery under Linear Sketching in B based on the key x, where x := G(x) is a hash of x into a sufficiently large domain so that collision probability is sufficiently unlikely. Then add, using component-wise sum, (x, x, v, 1) to the contents of B in all locations l1, . . . , lk. The List Entries/Dec operation corresponds to the result of the following procedure: (1) find an entry (xsum, xsum, vsum, j) such that G(xsum/j) = xsum/j holds, (2) add (xsum/j, vsum) to the output, and (3) remove the pair (xsum/j, vsum) by subtracting (xsum, xsum, vsum, j) from the entries l 1, . . . , l k in the array B to which an insertion would add the tuple for key xsum/j and get back to step (1). The process of listing entries a.k.a peeling off B. might terminate before the IBLT is empty. This is the failure procedure in Lemma 4.3, which corresponds to the natural procedure to find a 2-core in a random graph (Goodrich & Mitzenmacher, 2011). Sketch size. The above intuition corresponds to the IBLT construction variant from (Goodrich & Mitzenmacher, 2011) that can handle duplicates. It can be implemented with four length ℓvectors with entries in [d], Im(G), [mn], [mn], respectively. In terms of concrete parameters (see (Goodrich & Mitzenmacher, 2011) for details), k = 3, ℓ> 1.3L0, and G = Zp with p = 231 1 give good performance, and require 1.3L0(32 + log2 d + 2 log2(mn)) bits. For the experiment setting considered in Section 7, this is will take at most 8L0 words. Cardinality estimation from saturated IBLT. Lemma 4.3 tells us that a tight bound L0 on the number of distinct non-zero indices in the intended histogram, can save us space in an IBLT encoding. However, getting that bound wrong results in an undecodable IBLT. While in the single round case all is lost, in the multi-round setting we leverage a property of undecodable IBLTs that helps update our L0 bound for subsequent rounds after a failed round. This is the main ingredient for our adaptive tuning heuristic presented in Section 6. Let B be an undecodable IBLT, and let S be the size of the undecoded graph of B. Also let ℓbe the size of B, and let N the (unknown) number of distinct elements inserted in B (note that N corresponds to the correct bound L0 that enables decoding). By (Molloy, 2005), we have the following relation: For large enough N, if S < ℓ, we have S C = 1 e x(1 + x) + o(1), (3) where x is the greatest solution to 6N C = 2x (1 e x)2 . (4) Hence we can have an estimate for N (and thus a correct choice for L0 in a subsequent round) based on S and C. We first solve (3) ignoring the o(1) term to get x and then plug x and C into (4) to get an estimate for N. As mentioned above we leverage this fact in Section 6. D. Proof of lower bounds. D.1. Proof of Theorem 4.2 We will focus on the case when R = 1 since the claimed bound doesn t depend on R and we can assume there is no data in other R 1 rounds. We will consider the case when 10 < τ < n/4. We prove the theorem by a reduction to the set disjointness problem (Bar-Yossef et al., 2004; Jayram, 2009). The set disjointness problem (DISTt,d) considers the setting where t users where user i has a set of elements Si {1, 2, . . . , d}. The goal is to distinguish between the following two chase with success probability at least 4/5. 1. All Si s disjoint. 2. There exists x [d] such that for all i, j [t], Si Sj = {x}. And the goal is to minimize the size of the transcript of all communications among all users. More specifically, we will use the following lemma: Lemma D.1 ((Jayram, 2009)). Any protocol that solves DISTt,d must have a transcript of size at least Θ(d/t). Next we show that DISTt,d with t = τ and d = mn/2 can be reduced to the approximate heavy hitter problem. We divide users into τ + 1 groups. For i [τ], the ith group has ni = |Si|/m users. And let Si be set of all elements held by users in group i. We partition Si to subsets of size at most m and distribute them to users in group i arbitrarily. This can be Federated Heavy Hitter Recovery under Linear Sketching done since mni |Si|. The total number of users in the first τ groups is P i [τ] ni d+τ m + τ n. The τ + 1 group has n P i [τ] ni users and each user has zero element. Suppose there exists a τ-Approx HH linear sketch algorithm with communication cost per-user o( mn τ ). When Si s are disjoint, all elements in [d] will have frequency 1 < τ/10. The algorithm should output an empty list. When Si s have an unique intersection, the element will have frequency τ, and hence the algorithm should output a list with size 1. By distinguishing between the two cases, the Approx HH algorithm can be used to solve DISTt,d. Moreover, under linear sketching constraint, the size of the transcript is the same as the per-user communication. Hence we conclude the proof by noticing that this violates Lemma D.1. D.2. Proof of Theorem 5.2 Here we prove a stronger version of the lower bound where in each round r, the communication among users is not limited but the users in Br must compress h(r) to an element Y (r) Gr with |Gr| 2ℓ, which is observed by the server. And the server will then obtain an approximate histogram bh[R] based on Π = (Y (1), . . . , Y (R), U). For a given τ, next we show that any protocol with ℓ= o min{ mn R τ , mn} won t solve τ-approximate heavy hitter with error probability at most 1/5. We will focus on the case when τ R and ℓ= o mn R τ . When τ < R, the bound follows by setting τ = R and the fact that the problem gets harder as τ decreases. To simply the proof, we assume R 400 without loss of generality. We consider histograms h(r), r [R] supported over the domain 10ℓand are generated i.i.d. from a distribution P. Let Z be uniformly distributed over { 1}5ℓ, and under distribution PZ, we have r [R], i [5ℓ], 5ℓ with prob 1 0 with prob 1 and h(r)(2i 1) = 1 h(r)(2i). It can be check that h(r) 1 = mn with probability 1. We prove the theorem by contradiction. If the protocol solves τ-approximate heavy hitter with error probability at most 1/5, let ˆZi = 1 ˆh[R](2i) > mn R Pr ˆZi = Zi Pr ˆh[R](2i) h[R](2i) mn h[R](2i) mn R where the first probability is bounded by the success probability of the algorithm and the second probability is bounded using Hoeffding bound. Hence we have X i [5ℓ] I(Zi; Π) X i [5ℓ] I(Zi; ˆZi) X i [5ℓ] (1 H( 5 where H(p) is the Shannon entropy of a Bernoulli random variable with success probability 6/25. To upper bound P i [5ℓ] I(Zi; Π), we notice that the vector mn , 5ℓh(r)(4) mn , . . . , 5ℓh(r)(10ℓ) follows a product distribution with the marginal of each coordinate being a Bernoulli distribution. Hence by standard arguments on communication-limited estimation of product of Bernoulli random variables (e.g., in (Braverman et al., 2016; Federated Heavy Hitter Recovery under Linear Sketching Han et al., 2021; Acharya et al., 2020)). In particular, following almost the same steps as in Acharya et al. (2020, Section 7.1), X i [5ℓ] I(Zi; Π) R 1 which leads to a contradiction. This concludes the proof. E. Binomial tail bound. Lemma E.1. Let X Binom(n, p) be a binomial distribution, when n > 10 and p < 1/5, we have Pr (X n/2) n + 1 Pr (X n/2) = i= (n+1)/2 Pr (X = i) ((1 p)p)n/2 (5) 2 (2e)n/2 4p where (5) follows from Pr (X = i) is monotonically decreasing when i n/2 and (6) follows from standard bounds on binomial coefficients.