# learningaugmented_data_stream_algorithms__7cc0ce2e.pdf Published as a conference paper at ICLR 2020 LEARNING-AUGMENTED DATA STREAM ALGORITHMS Tanqiu Jiang Department of Electrical and Computer Engineering Lehigh University Bethlehem, PA 18015, USA taj320@lehigh.edu Yi Li School of Physical and Mathematical Sciences Nanyang Technological University Singapore 637371 yili@ntu.edu.sg Honghao Lin Zhiyuan College Shanghai Jiao Tong University Shanghai, China 200240 honghao lin@sjtu.edu.cn Yisong Ruan Department of Software engineering Xiamen University Xiamen, Fujian, China 361000 24320152202802@stu.edu.xmu.cn David P. Woodruff Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA dwoodruf@cs.cmu.edu The data stream model is a fundamental model for processing massive data sets with limited memory and fast processing time. Recently Hsu et al. (2019) incorporated machine learning techniques into the data stream model in order to learn relevant patterns in the input data. Such techniques led to the training of an oracle to predict item frequencies in the streaming model. In this paper we explore the full power of such an oracle, showing that it can be applied to a wide array of problems in data streams, sometimes resulting in the first optimal bounds for such problems, and sometimes bypassing known lower bounds without such an oracle. We apply the oracle to counting distinct elements on the difference of streams, estimating frequency moments, estimating cascaded aggregates, and estimating moments of geometric data streams. For the distinct elements problem, we bypass the known lower bound and obtain a new space-optimal algorithm. For estimating the p-th frequency moment for 0 < p < 2 we obtain the first algorithms with optimal space and update time. For estimating the p-th frequency moment for p > 2 we obtain a quadratic savings in memory, bypassing known lower bounds without an oracle. We empirically validate our results, demonstrating also our improvements in practice. 1 INTRODUCTION Processing data streams has been an active research field in the past two decades. This is motivated by the increasingly common scenario where the size of the data far exceeds the size of the available storage, and the only feasible access to the data is to make a single or a few passes over the data. This situation occurs, for example, in network applications with high-speed packets being sent through routers with limited resources. Other examples include processing internet search logs, sensor networks, and scientific data streams. For an introduction to data streams, see, e.g., Muthukrishnan (2005). Formally, in the data stream model, we assume there is an underlying frequency vector x Zn, initialized to 0n, which evolves throughout the course of a stream. The stream consists of updates of the form (i, w), meaning xi xi + w. Suppose M is such that x = maxi {1,...,n} |xi| M Published as a conference paper at ICLR 2020 throughout the stream. At the end of the stream, we are asked to approximate f(x) for a function f : Zn R. Because of the sheer size of the stream, most algorithms are approximate and randomized. Typically, such algorithms output an estimate Z for which Pr{(1 ϵ)f(x) Z (1 + ϵ)f(x)} 2/3, where the probability is over the randomness used by the algorithm, and not of the input stream, which may be a worst-case stream. The success probability 2/3 can be replaced with any constant greater than 1/2 by independently repeating the algorithm a constant number of times and taking the median estimate. The space complexity of the algorithm is measured in bits and the goal is to use much less than the trivial n bits of space required to store x. A common difficulty in designing algorithms in the data stream model is that the coordinates i for which |xi| is large, i.e., the heavy hitters, are not known in advance. Such coordinates usually require very accurate estimates in order to achieve an accurate final estimate of the function value. A number of techniques have emerged for identifying such coordinates, such as the Count-Min and Count Sketch. These techniques are often used at multiple scales by subsampling the input coordinates and finding heavy items in the subsample, and have found applications to estimating frequency moments, cascaded aggregates, earthmover distance, and empirical entropy, among other statistics (Indyk & Woodruff, 2005; Chakrabarti et al., 2006; Jayram & Woodruff, 2009; Andoni et al., 2009; Mc Gregor et al., 2016). In many applications, however, the worst-case input is rarely seen; in fact, there is usually a pattern inside the underlying vector. For instance, in network traffic monitoring, the large or elephant flows fluctuate on a daily basis, but the actual source and destination servers of those flows do not vary much. In our formulation, this implies the value xi fluctuates while whether or not xi is a heavy coordinate remains stable. Motivated by this as well as related work on learning oracles for algorithm design (Gupta & Roughgarden, 2016; Dick et al., 2017; Balcan et al., 2018a;b), Hsu et al. (2019) introduced the notion of a heavy hitter oracle into the data stream model. The heavy hitter oracle, trained on past data, receives a coordinate index i and returns a prediction of whether xi will be heavy or not. They showed that such an oracle simplifies classical heavy hitter algorithms such as COUNT-MIN and COUNT-SKETCH. We note that similar oracles were also studied in previous work, e.g., membership oracles for Bloom filters by Kraska et al. (2017). These works demonstrate the feasibility of creating such an oracle using machine learning techniques. Recently, learned oracles for low rank approximation, which can be viewed as an analogue of heavy hitters for heavy directions, was explored by Indyk et al. (2019). The work of Hsu et al. (2019) is the starting point of our work. While Hsu et al. (2019) showed applications of the heavy hitter oracle to frequency estimation, the full power of such an oracle remained largely unexplored. Our main goal is to understand if such an oracle is more broadly useful, and whether it can give memory and time improvements to the large number of data stream problems mentioned above. Accurately estimating the heavy hitters in a vector is the bottleneck of existing algorithms for numerous data stream problems, and one cannot help but ask: Can a heavy hitter oracle be applied to a large number of problems in the data stream model to obtain significantly improved bounds for these problems? We consider estimating the following common functions f(x) in the data stream model: Distinct Elements: f(x) = x 0, where x 0 is the number of nonzero coordinates of x, defined as x 0 = |{i : xi = 0}|. This quantity is useful for detecing denial of service attacks and database query optimization (Kane et al., 2010b). the Fp-Moment: f(x) = x p p, where x p is the usual ℓp-norm of x, defined as x p = (P i |xi|p)1/p. For 0 < p < 2, these are often more robust than the Euclidean norm (Indyk, 2000). For p > 2 these are used to estimate skewness and kurtosis (Indyk & Woodruff, 2005). the (k, p)-Cascaded Norm: in this problem x is an n d matrix that receives updates to its individual entries, and f(x) = x k,p := (P j |xij|p)k/p)1/k. These are useful for executing multiple databse queries (Cormode & Muthukrishnan, 2005; Jayram & Woodruff, 2009). These problems have also been studied for non-numerical data, such as geometric data. For example, the Fp-moment problem was studied on a stream of rectangles in Tirthapura & Woodruff (2012). One can think of an underlying d-dimensional vector x, identified with a d-dimensional rectangle {1, . . . , }d. Without loss of generality, we assume that is a power of 2. Each item in the stream is a pair of the form (R, w), where R is a d-dimensional rectangle and w is an integer, meaning Published as a conference paper at ICLR 2020 Problem Result Type Previous Result This work Distinct Elements S O( 1 ϵ2 log n(log 1 ϵ + log log n)) O( 1 ϵ2 log n log 1 ϵ ) Kane et al. (2010b) Theorem 3 Fp Moment (p > 2) S e O(n1 2/p) e O(n1/2 1/p) e.g., Andoni et al. (2011) Theorem 7 Fp Moment Fast Update (p < 2) Update T amortized O(log2 1 ϵ log log 1 ϵ ) expected amortized O(1) Reporting T O( 1 ϵ log log 1 ϵ2 ) Kane et al. (2011) Theorem 17 Cascaded Norm S e O(n1 2/kd1 2/p) e O(n1 1/k p/(2k)d1/2 1/p) (k p 2) Jayram & Woodruff (2009) Corollary 20 Rectangle Fp S e O( d(1 2/p) poly( d ϵ )) e O( d(1/2 1/p) poly( d ϵ )) Moment (p > 2) Tirthapura & Woodruff (2012) Theorem 23 Table 1: Summary of previous results and the results obtained in this work. We assume that m, M = poly(n). In the column of result type, S denotes space complexity and T denotes time complexity. We view ϵ as a constant for the listed results of the Fp Moment and Cascaded Norm problems. Problem Heavy Hitter Oracle Distinct Elements |xi| 2poly(1/ϵ) Fp Moment (p > 2) |xi|p 1 n x p p Fp Moment Fast Update (p < 2) |xi|p ϵ2 x p p |xi|p ϵ2 log2(1/ϵ) log log(1/ϵ) x p p Cascaded Norm (k p 2) |xi|p x p p/(d 1 2 n1 p 2k ) Rectangle Fp Moment (p > 2) |xi|p x p p/ d/2 Table 2: Summary of the threshold of the heavy hitter oracles used in each problem. Note that two oracles are used for the Fp Moment Fast Update problem. xi xi + w for all points i R. The goal is to estimate f(x) = x p p. This is called the Rectangle Fp-Moment problem, which occurs in spatial data streams and constraint databases. Our Results We show that a heavy hitter oracle can greatly improve the complexity of a wide array of commonly studied problems in the data stream model, leading to the first optimal bounds for several important problems, and shattering lower bounds that have stood in the way of making further progress on important problems. We note that not only do we give new algorithms, we also give several lower bounds for algorithms equipped with the oracle, that show optimality of our algorithms even with an oracle. Our algorithms not only give practically better, theoreticallygrounded improvements to these problems with an oracle, they also shed light on what exactly the difficulties are of making further progress on data stream problems without an oracle. We consider both perfect oracles and oracles that may sometimes make mistakes. We summarize in Table 1 our improvements to existing algorithms using appropriate heavy hitter oracles, whose conditions are summarized in Table 2. Throughout, we make the conventional assumption that m, M = poly(n). Distinct Elements: For the problem of estimating x 0, we improve the prior memory bound of O( 1 ϵ2 log n(log 1 ϵ + log log n)) to O( 1 ϵ2 log n log 1 ϵ ). For constant ϵ, this gives an optimal O(log n) bits of memory, breaking a recent Ω(log n log log n) bit lower bound shown in Woodruff & Yang (2019). Examining the lower bound argument in Woodruff & Yang (2019), the hard instance is precisely when there are many items of large frequency, namely, the product of the first small number of primes. This prevents hashing-based algorithms, such as the one in Kane et al. (2010b), from working with O(log n) bits of space, since hash values are typically taken modulo a small prime and such heavy items necessitate the use of a large enough prime. Surprisingly, we show that with a heavy hitter oracle, we can separately handle such items and thus bypass this lower bound. We note that our O(log n)-bit algorithm is optimal even given a heavy hitter oracle; see the Ω(log n) lower bound in Alon et al. (1999), which holds even if all item frequencies are in the set {0, 1, 2}. Published as a conference paper at ICLR 2020 Fp-Moment Estimation, 0 < p < 2: There is a long line of work on this problem with the best known bounds given in Kane et al. (2011), which achieve an optimal O(ϵ 2 log n) bits of space, and O(log2(1/ϵ) log log 1/ϵ) time to process each element. The existing algorithm is based on separately handling those elements i for which |xi|p ϵ2 x p p, i.e., the heavy hitters, from the remaining light items . Although we can simply store the heavy hitters given a heavy hitter oracle, unfortunately, the large O(log2(1/ϵ) log log 1/ϵ) processing time in the algorithm of Kane et al. (2011) also comes from running a fast multipoint evaluation on the light elements. We observe that if we instead train two oracles, one for detecting heavy hitters, and one for detecting items xi with |xi|p ϵ2 log2(1/ϵ) log log(1/ϵ) x p p, then we can also separate out medium items , i.e., items that are detected with this oracle but not our heavy hitter oracle. There are only O(ϵ 2 log2(1/ϵ) log log(1/ϵ)) medium items and so assuming log n poly(log(1/ϵ)), one could perfectly hash all such items and maintain their hashed identities in the optimal O(ϵ 2 log n) bit of space. However, one cannot maintain their counts, since each count is O(log n) bits, and approximate counts do not work if the stream can increment or decrement the coordinate values xi. This is too much memory, as it would lead to O(ϵ 2(log n) log2(1/ϵ) log log(1/ϵ)) bits of space, even worse than without the oracle. We could also maintain a so-called p-stable random variable for each medium item, and use it to estimate the Fp-moment of the medium items, but this would again be O(ϵ 2(log n) log2(1/ϵ) log log(1/ϵ)) bits of space to store one random variable per medium item. We instead observe that most of the p-stable random variables are small, and so can be stored with many fewer than log n bits, and therefore globally we can get by with only O(ϵ 2 log n) bits in total for estimating the medium items. Finally, for the light items, the critical observation is that we can sub-sample a 1 log2(1/ϵ) log log(1/ϵ) fraction of them, and still accurately estimate their contribution. But this means we can run the multipoint evaluation algorithm of Kane et al. (2011) on such items, since while that has update time O(log2(1/ϵ) log log(1/ϵ)), on average it is only executed on one out of every log2(1/ϵ) log log(1/ϵ) items. In summary, with the aid of our oracles and by separately handling these three types of items, we are able to achieve the optimal O(1) amount of update time per item, while retaining the optimal O(ϵ 2 log n) bits of memory. Fp-Moment Estimation for p > 2: It is well-known (Bar-Yossef et al., 2004) that a hard input distribution for this problem is when a random coordinate of x has absolute value n1/p, and all other coordinates have value 0 or 1. In this case, one needs to certify that there is indeed a coordinate of value n1/p, as it contributes a constant fraction of x p p. There is a known Ω(n1 2/p) bits of space lower bound for this. However, a heavy hitter oracle exactly allows us to overcome this, as it can be used to separately identify such items. Our idea is to separate out every item i for which |xi|p 1 n1/2 x p p with the oracle, and run an existing Fp-estimation algorithm on such items. Since there are at most n1/2 items, existing algorithms give (n1/2)1 2/p poly(log n) = n1/2 1/p poly(log n) bits of memory for this problem. For the remaining light elements, one can show that if one subsamples only a 1/n1/2 fraction of them and then runs an existing Fp-estimation algorithm on them, one can use this to accurately estimate their contribution. This again takes n1/2 1/p poly(log n) bits of memory, balancing the memory with the other part of our algorithm and giving a quadratic improvement in memory in the case without a heavy hitter oracle. Overall our algorithm uses O(n1/2 1/p poly(log n)) bits with an oracle and we can in fact show that this is tight up to logarithmic factors as there exists a lower bound of Ω(n1/2 1/p) bits provided the same oracle. We describe our algorithms for cascaded norms and rectangle Fp-moments in the appendix. We conduct experiments for the distinct elements and the Fp moment (p > 2) problems, on both real-world and synthetic data, which demonstrate significant practical benefits. 2 PRELIMINARIES Notation We let [n] denote the set {1, . . . , n}. The function lsb(x) denotes the index of the least significant bit of a non-negative integer x. Published as a conference paper at ICLR 2020 Space and Time Complexity For streaming algorithms, the efficiency is characterized by the space (memory) complexity and the time complexity. The time is further divided into update time and reporting time. A data stream algorithm maintains a data structure D while processing the data stream and at the end of the stream, it outputs an approximation Z to f(x). The space complexity refers to the size, in bits, of D. The update time refers to the time spent to update the data structure D upon receiving a single stream update. The reporting time refers to the time to output the approximation Z at the end of the stream. Heavy Hitter Oracle We assume access to a heavy hitter oracle, which receives an input i [n], and outputs whether xi will be a heavy hitter at the end of the stream. The definition of heavy hitter varies for different data stream problems. There are two kinds of oracles. The first kind indicates whether or not |xi| T and the second kind indicates whether or not |xi|p 1 T x p p, where T is a pre-determined threshold associated with the problem. We shall use an oracle of the first kind for the distinct elements problem and oracles of the second kind for all other problems. In the experiments, the heavy hitter oracle receives an i [n] and outputs the predicted value of xi. The implementation of the first kind of oracle is thus straightforward. For the second kind of oracle, there are at most T coordinates xi satisfying |xi|p 1 T x p p. We shall regard the largest T coordinates in the predicted vector x as the heavy hitters in our implementation (instead of calculating the x p p of the predicted vector x), which is in conformity with the implementation in the prior experiments of Hsu et al. (2019). 3 DISTINCT ELEMENTS The current best algorithm for estimating L0 = x 0, in the presence of additions and deletions to coordinates of x, is due to Kane et al. (2010b). A key component of their algorithm is a constantfactor approximation algorithm, called ROUGHL0ESTIMATOR, which returns a value e L0 satisfying 1 110L0 e L0 L0 with probability at least 2/3. The space is O(log n log log(m M)) bits and the update and reporting times are O(1). In this section, we improve the space of ROUGHL0ESTIMATOR to O(log n) bits with the help of a trained oracle. All proofs are postponed to Section A. The algorithm first picks a pairwise hash function h : [n] [n] and subsamples n coordinates at log n scales, where the j-th scale consists of the coordinates i for which lsb(h(i)) = j. In each scale, it uses a small-space counting algorithm EXACTCOUNT to obtain, with probability at least 1 η, the exact value of x 0 at that scale. Then it finds the deepest scale, i.e., the largest j, with x 0 above some fixed constant threshold, and returns e L0 = 2j as the estimate. If such a j does not exist, it returns e L0 = 1. The EXACTCOUNT procedure in Kane et al. (2010b) uses a hash function h : [n] [B], where B = Θ(c2) for some constant c for which h is a perfect hashing of a set of size at most c. Each bucket ℓ [B] is a counter which keeps track of (P h(i)=ℓxi) mod p for some large prime p. The estimate is the number of nonzero buckets, and EXACTCOUNT returns the maximum estimate of Θ(log(1/η)) parallel repetitions. Now, with the trained oracle, we can pick p to be a smaller poly(1/ϵ) value in EXACTCOUNT. In the higher-level ROUGHL0ESTIMATOR algorithm, whenever an update is made to a heavy coordinate i identified by the oracle, the corresponding bucket inside EXACTCOUNT is marked as nonempty regardless of the counter value, since we know the heavy item will never be entirely deleted, since by definition it is heavy at the end of the stream (note this is not true of non-heavy elements). In the end, ROUGHL0ESTIMATOR finds the deepest scale with at least one nonempty bucket. The following is our new guarantee of EXACTCOUNT. Lemma 1. Assume that L0 c and suppose that B = Θ(c2). The subroutine EXACTCOUNT returns the exact value of L0 with probability at least 1 η using O(B log(1/ϵ)) bits of space, in addition to storing O(log(1/η)) independently chosen pairwise-independent hash functions from [n] to [B]. The update and reporting times are O(1). Now we specify some details of the algorithm ROUGHTL0ESTIMATOR. It chooses c = 141 and η = 1/16 for EXACTCOUNT, and the EXACTCOUNT algorithms for different scales all share the same O(log(1/η)) hash functions. The following is our new guarantee. Published as a conference paper at ICLR 2020 Theorem 2. The algorithm ROUGHL0ESTIMATOR outputs a value e L0, which satisfies 1 110L0 e L0 L0 with probability at least 13/16. The space complexity is O(log(n) log(1/ϵ)) bits, and the update and reporting times are O(1). With a constant-factor estimate R to L0, the high-level idea is to subsample each index in [n] with probability K/R. Then the number of distinct surviving items is Θ(K). To estimate the number of distinct surviving items efficiently, one hashes them into Θ(K) buckets and counts the number T of non-empty buckets, which is simulated by updating a bit vector of length Θ(K). The final estimate can be deduced from the number of distinct surviving items, which can be shown to be a (1 + eps)-approximation to F0 when K = Θ(1/ϵ2). Since we do not know L0 in advance, we have to guess its value using log n geometrically increasing values, and so our memory can be thought of as a bucket matrix A of dimension (log n) K. In parallel we run the ROUGHL0ESTIMATOR above, so at the end of the stream, we will obtain a constant-factor estimate R, allowing us to look into the appropriate row of A (the appropriate buckets). A problem is how to efficiently maintain a bucket. In Kane et al. (2010b), each bucket stores the dot product of frequencies hashing to it with a random vector over a large finite field of order poly(K log(n M)), using O(log K+log log(n M)) = O(log(1/ϵ) + log log(n M)) bits. Now, using a similar argument to the proof of Lemma 1, with the help of our heavy hitter oracle, we can mark the buckets containing a heavy hitter as nonempty and reduce the order of the finite field to O(poly(1/ϵ)). This improves the space complexity of each bucket to O(log(1/ϵ)) bits. Combined with the complexity of ROUGHL0ESTIMATOR, we conclude with the following theorem. Theorem 3. There is an algorithm for (1 ϵ)-approximating L0 using space O(ϵ 2(log n) log(1/ϵ)) with success probability at least 2/3, and with O(1) update and reporting times. 4 Fp ESTIMATION, p > 2 We start with a well-known algorithm for estimating x p p in a stream. Lemma 4 (Precision Sampling, Andoni et al. (2011)). There is a randomized algorithm, called the Precision Sampling algorithm, which returns an estimate X to x p p such that it holds with probability at least 0.9 that (1 ϵ) x p p X (1 + ϵ) x p p. The algorithm uses B(n) = O(p2ϵ 2 4/pn1 2/p log(n) log(M)) bits of space. In the rest of the section, we assume that there is an oracle which can tell us if |xi|p 1 s x p p is a heavy hitter. The main idea is that we separately estimate the Fp-moment of the heavy hitters, and for the remaining light elements, we use sub-sampling to estimate their contribution with sampling rate 1/ρ. Let IH, IL [n] denote the subsets of indices of the heavy hitters and the light elements, respectively. Since |IH| s, we can use an Fp estimator to (1 + ϵ)-approximate x IH p p with space B(s). For the light elements, we use a sub-sampling algorithm: we set a sampling rate 1/ρ, and for each item i IL, with probability 1/ρ we choose it. For each item i we choose, we calculate Fs = P |xi|p, and use ρFs to estimation x IL p p. By a Chernoff bound, with high probability, we sample at most 3n/ρ elements. To analyze it, we let Y = ρFS, and we have the following lemmas. All proofs in this section are postponed to Section B. Lemma 5. E[Y ] = x IL p p, Var[Y ] ρ Lemma 6. If we repeat the sub-sampling kρ/s times independently with sub-sampling rate ρϵ2, we can get a (1 ϵ)-approximation of Fp with probability at least 1 1/k. If we use the estimator in Lemma 4 in both parts (heavy part and light part), The total space complexity is B(s) + ϵ 2 ρ ρϵ2 ). Letting ρ = s = n, we obtain an Fp estimation algorithm with space O(ϵ 4n1/2 1/p log(n) log(M)). We remark that this bound is tight in n up to logarithmic factors as there is a lower bound of Ω(n1/2 1/p) bits even in the presence of the oracle. Theorem 7. Under the assumption of a heavy hitter oracle, we can estimate x p p within a factor 1 2ϵ in O(ϵ 4n1/2 1/p log(n) log(M)) bits with success probability at least 3/5. Published as a conference paper at ICLR 2020 The above analysis is based on the assumption that the oracle is perfect. In practice, sometimes the oracle makes mistakes. Here we assume that for each item i, the probability the oracle gives us an incorrect prediction is δ. We have the following theorem for noisy oracles. Theorem 8. With a noisy heavy hitter oracle with error probability δ, we can estimate x p p within a factor 1 2ϵ in O(ϵ 4n1/2 1/p log(n) log(M)) bits of space when δ = O(1/ n), or in O(ϵ 4(nδ)1 2/p log(n) log(M)) bits of space otherwise. Both guarantees hold with success probability at least 3/5. 5 EXPERIMENTS 5.1 DISTINCT ELEMENTS 5.1.1 INTERNET TRAFFIC DATA For this experiment, the goal is to estimate the number of distinct packets for each network flow. A flow is a sequence of packets between two machines on the Internet. It is identified by the IP addresses of its source and destination and the application ports. Dataset: The traffic data is collected at a backbone link of a Tier1 ISP between Chicago and Seattle in 2016 (CAIDA). Each recording session is around one hour. Within each minute, there are around 30 million packets (meaning 30 million updates) and 1 million unique flows. The distribution of packet counts over Internet flows is heavy tailed, see Hsu et al. (2019). Model: We use the prediction results in Hsu et al. (2019), which predict the logarithm of the packet counts for each flow. In Hsu et al. (2019), the authors use two RNNs to encode the source and destination IP addresses separately and additionally use two-layer fully-connected networks to encode the source and destination ports. Then they concatenate the encoded IP vectors, encoded port vectors, and the protocol type as the final features to predict the packet counts. They use the first 7 minutes for training, the following minute for validation, and estimate the packet counts in subsequent minutes. Results: We plot the estimation error vs. total number of buckets for the 20th minute in the CAIDA data. We run the estimation algorithm 5 times and calculate the average estimation error. The result is shown in Figure 2. The total number of unique flows appearing in a single test minute is about 1.1 106. For every light coordinate, we make an extra update that makes this coordinate 0 with probability 0.5. There are two parts to our algorithm. For ROUGHL0ESTIMATOR, we set c = 10 and η = 1/4. We use the heavy hitter oracle to predict whether the coordinate will be larger than 210. We randomly select a prime from [11, 31] for the hash buckets. The total number of buckets in this part is about 4 103 (each bucket can store a number in {0, 1, . . . , 31} and has an extra bit to indicate whether it has been marked). For the other part of our algorithm, we create a bucket matrix A of dimension (log n) K with K ranging from 26 to 29. Each entry in A can also store a number between 0 and 31 and with an additional bit to indicate whether it has been marked. From the plot we can see that the algorithm achieves an approximation error of about 2.5% using at most 1.5 104 buckets, approximately 1.5% of the space needed if we were to keep an individual counter for each flow. 5.1.2 SYNTHETIC DATA To further demonstrate the advantage of having a heavy hitter oracle, we run the previous algorithm due to Kane et al. (2010b) (where we randomly choose a prime p [11, 31] as the divisor for each bucket) and our modified algorithm on the synthetic data designed as follows: first, we generate an input vector x of dimension n = 106 with i.i.d entries uniform on {0, 1, . . . , 100}. Then for each coordinate i and each prime p [11, 31], we multiply xi by p with probability 0.2. We set the threshold to be a heavy hitter to be 210, that is, the oracle reports xi as a heavy hitter if |xi| > 210. Results: We use the optimal parameters in the last experiment (c = 10, η = 1/4 and K = 29) and vary the total number of buckets from 7000 to 15000. The results are shown in Figure 1. We see that having an oracle significantly reduces the estimation error. Published as a conference paper at ICLR 2020 Figure 1: The estimation error of the L0norm in the synthetic data Figure 2: The estimation error of the L0norm in the 20th minute of the CAIDA data 5.2 Fp ESTIMATION 5.2.1 SEARCH QUERY DATA For this experiment, the goal is to estimate the frequency moments of the vector indicating the number of occurrences each search query appears. Dataset: We use the AOL query log dataset, which consists of 21 million search queries collected from 650 thousand users over 90 days. There are 3.8 million unique queries. Each query is a search phrase with multiple words. The distribution of the search query frequencies is Zipfian, see Hsu et al. (2019). Model: We use the prediction result in Hsu et al. (2019), which predicts the number of times each search phrase appears. In Hsu et al. (2019), the authors train an RNN with LSTM cells that takes characters of a search phrase as input. The final states encoded by the RNN are fed to a fullyconnected layer to predict the query frequency. They use the first 5 days for training, the following day for validation, and estimate the number of times different search queries appear in subsequent days. We use this prediction result as an indicator of which search phrases are the heavy hitters. Results: We plot the estimation error versus the total number of buckets for the 50th and 80th day (p = 3 or p = 4). For the same data, we run the estimator 10 times and calculate the average estimation error. The results are shown in Figure 3. For our algorithm, we notice that the total number of search phrases that appear in a single day is about n = 1.5 105 and so n < 400. We can store the heavy coordinates identified by the heavy hitter oracle directly. Given the budget of the number of buckets, we divide them into two parts. Half of them were used to store the heavy items, and the other half are used by the sub-sampling algorithm with the precision sampling estimators to estimate the frequency moment of the light elements. In comparison, we also run the classical precision sampling estimators. Our results show a larger estimation error owing to the constraint of relatively few buckets. Note that for a fixed bucket budget Btot, there is a trade-off between the number k of repetitions and the number B = Btot/k of buckets allotted for the hash function in each repetition. We plot the results for different values of k = 10, 20, 30. It is clear from the plots that the estimation error of the oracle-aided algorithm is about 1% even when the total number of buckets is small, demonstrating a strong advantage over classical precision sampling estimators. 5.2.2 SYNTHETIC DATA A number of real-world datasets have the property that the heavy coordinates contribute a large fraction of the Fp-moment, as p increases, and so one needs only to obtain a good approximation of Published as a conference paper at ICLR 2020 Figure 3: The estimation error of the moment of search queries. The left figure corresponds to p = 3 (80th test day) and the right p = 4 (50th test day). the Fp-moment of the heavy coordinates. This is indeed one of the reasons why we achieved low error for search query estimation. In order to show that our subsampling algorithm also achieves a good approximation for the light coordinates, we considered the following data set. We generated an input vector x of dimension n = 108 with the first n coordinates being heavy and the remaining coordinates being light. Each coordinate is a uniform random variable in (0, 1), later renormalized such that the heavy and the light parts contribute equally to the total Fp-moment, i.e., x IH p p = x IL p p. Hence, a good approximation of x p p requires a good approximation of both parts. Results: We plot the estimation error vs. total number of buckets for p = 3 in Figure 4. For our algorithm and classical precision sampling, we fixed the number k of repetitions to 100 and varied the total number of buckets from 106 to 2 106 (1% to 2% of the length of x). We repeated the sub-sampling algorithm 9 times. In each of the 9 repetitions, we also used the precision sampling estimators, repeating 100 times. From the plots we see that our algorithm achieves significantly smaller relative approximation error than the classical precision sampling algorithm for every number of buckets. In particular, using 2 106 buckets, i.e., 2% space of what is needed to store the entire vector x, our algorithm achieves a 15% relative estimation error while the classical precision sampling obtains only a 27% relative error. Another observation is that since subsampling reduces the space consumption, our algorithm has a faster update time when n is large. Figure 4: Estimation error for the F3moment on synthetic data ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for helpful comments. Y. Li was supported in part by Singapore Ministry of Education (Ac RF) Tier 2 grant MOE2018-T2-1-013. D. Woodruff would like to thank partial support from the National Science Foundation under Grant No. CCF-1815840 and the Office of Naval Research (ONR) under grant N00014-181-2562. Published as a conference paper at ICLR 2020 Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137 147, 1999. Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David P. Woodruff. Efficient sketches for earthmover distance, with applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pp. 324 330, 2009. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pp. 363 372, 2011. Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm design, online learning, and private optimization. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pp. 603 614, 2018a. Maria-Florina Balcan, Travis Dick, and Colin White. Data-driven clustering via parameterized lloyd s families. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada., pp. 10664 10674, 2018b. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702 732, 2004. Vladimir Braverman and Rafail Ostrovsky. Recursive sketching for frequency moments. Co RR, 2010. CAIDA. Caida internet traces 2016 chicago. http://www.caida.org/data/monitors/ passive-equinix-chicago.xml. Amit Chakrabarti, Khanh Do Ba, and S. Muthukrishnan. Estimating entropy and entropy norm on data streams. Internet Mathematics, 3(1):63 78, 2006. Graham Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pp. 271 282, 2005. Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Nina Balcan, and Alexander J. Smola. Data driven resource allocation for distributed learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pp. 662 671, 2017. Rishi Gupta and Tim Roughgarden. A PAC approach to application-specific algorithm selection. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, Cambridge, MA, USA, January 14-16, 2016, pp. 123 134, 2016. Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pp. 189 197. IEEE, 2000. Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pp. 202 208, 2005. Piotr Indyk, Ali Vakilian, and Yang Yuan. Learning-based low-rank approximations. Co RR, abs/1910.13984, 2019. T. S. Jayram and D. P. Woodruff. The data stream space complexity of cascaded norms. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pp. 765 774, Oct 2009. Published as a conference paper at ICLR 2020 Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pp. 1161 1178, 2010a. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 10, pp. 41 52, New York, NY, USA, 2010b. ACM. Daniel M Kane, Jelani Nelson, Ely Porat, and David P Woodruff. Fast moment estimation in data streams in optimal space. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 745 754. ACM, 2011. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. Co RR, abs/1712.01208, 2017. Ping Li. Estimators and tail bounds for dimension reduction in lα (0 < α 2) using stable random projections. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008, pp. 10 19, 2008. Andrew Mc Gregor, A Pavan, Srikanta Tirthapura, and David P Woodruff. Space-efficient estimation of statistics over sub-sampled streams. Algorithmica, 74(2):787 811, 2016. S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005. John Nolan. Stable distributions: models for heavy-tailed data. Birkhauser Boston, 2003. Anna Ostlin and Rasmus Pagh. Uniform hashing in constant time and linear space. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pp. 622 628. ACM, 2003. Srikanta Tirthapura and David Woodruff. Rectangle-efficient aggregation in spatial data streams. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pp. 283 294. ACM, 2012. David P. Woodruff and Guang Yang. Separating k-player from t-player one-way communication, with applications to data streams. In 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece., pp. 97:1 97:14, 2019. Vladimir M Zolotarev. One-dimensional stable distributions, volume 65. American Mathematical Soc., 1986. Published as a conference paper at ICLR 2020 A OMITTED PROOFS IN SECTION 3 Proof of Lemma 1. If a bucket contains a heavy hitter, it is marked nonempty and it is indeed nonempty. It suffices to show that a bucket containing only light coordinates is nonempty modulo p. Since the light coordinates are at most 2poly(1/ϵ), there are at most c of them by assumption, their sum S is thus at most 2poly(1/ϵ). This implies that S has at most poly(1/ϵ) prime factors. If we choose a random prime p of size poly(1/ϵ), then S mod p = 0 with probability at least 1 poly(ϵ) if S = 0. Taking a union bound over Θ(c2) buckets proves the correctness. The space and time complexity are immediate from the description of the algorithm. Proof of Theorem 2. The proof is almost identical to that in Kane et al. (2010b). The space and time complexities are follow from the description of the algorithm. Next we show correctness. Let L0(j) denote the true number of distinct elements in the j-th scale. Then E L0(j) = L0/2j. Let j = max{j : E L0(j) 1} and j = max{j < j : E L0(j) 55}. As shown in Kane et al. (2010b), if j exists, it holds that 55 E L0(j ) < 110 and Pr{32 < L0(j ) < 142} 8/9. With our choices of c = 141 and η = 1/16, EXACTCOUNT returns a nonzero value for the j -th scale with probability 8/9 1/16 > 13/16 by Lemma 1. It follows that the deepest scale j we find satisfies j j j . Hence, L0 = 2j E L0(j ) 110 2j and L0 = 2j E L0(j ) 2j as desired. If j does not exist, then L0 < 55, and e L0 = 1 is a 55-approximation in this case. B OMITTED PROOFS IN SECTION 4 Proof of Lemma 5. We let Y = ρFs = ρ P i fi|xi|p, where fi = 1 if i was chosen, and fi = 0 otherwise. So we have that E[fi] = 1 It thus follows that E[Y ] = x IL p p, that is, Y is an unbiased estimate. Then, Var[Y ] = E[Y 2] E[Y ]2 = ρ2 i fi|xi|2p + X i =j fifj|xi|p|xj|p] i fi|xi|2p] + X 1 ρ2 |xi|p|xj|p i IL |xi|2p . To bound this, we notice that we have the fact that xi < x p p s and p > 2, so i IL x2p ρs x 2p p s2 = ρ So finally we have that Var[Y ] ρ Proof of Lemma 6. By Chebyshev s inequality, we have that Pr |Y x p p| > ϵ x p p ks |x 2p p ρ sϵ2 x 2p p = 1 Proof of Theorem 7. For the heavy hitters, note that |IH| n/s = n. For this part, we can use 9B( n) space to get an H for which |H x IH p p| ϵ x IH p p with probability at least 7 8 by independently running 9 estimators and then taking the median. For the light elements, for every sub-sample round i, we independently run O(1) estimators with rate nϵ2 and use their median Zi to be the estimate of this level of sub-sampling. Let Yi be the Published as a conference paper at ICLR 2020 true Fp-moment of this sub-sample. Then with probability at least 99 100 it holds that |Yi Zi| ϵYi. Taking a union bound, we get that if we do the sub-sampling 10 times, with probability at least 9 10, |Yi Zi| ϵ x IL p p holds for every i. Using Lemma 6 we can get that with probability at least 9 10, |Z x IL p p| ϵ x IL p p + ϵ x p p. The overall success probability of this part is at least 4 So we get that |H +Z x p p| ϵ x IL p p +ϵ x p p +ϵ x IH p p = 2ϵ x p p with success probability at least 3/5. The total space complexity is B( n) + O ϵ 2)B( n ϵ2 ) = O(ϵ 4n 1 2 1 p log(n) log(M)). Proof of Theorem 8. For an element xi which the oracle indicates as light, we know that with probability 1 δ, it is truly light. Combined with Theorem 5, we obtain that Var[Y ] ρ(1 δ) X i IL |xi|2p + ρδ X i IH |xi|2p (1 δ)ρ s x 2p p + ρδ x 2p p . Hence, when δ = O( 1 n), Theorem 7 continues to hold. Otherwise we can let s = nδ and ρ = ϵ2δ 1, and the space complexity becomes O(ϵ 4(nδ)1 2 p log(n) log(M)) bits. Lower Bound The upper bound in Theorem 7 is asymptotically tight in n up to logarithmic factors, since we have the following matching lower bound even in the presence of an oracle which indicates whether |xi| > 1 n x p p. Theorem 9. Suppose that 1/ n < ϵ < 1/4 and that a heavy hitter oracle indicates whether |xi| > 1 n x p p for the whole underlying vector x. Any randomized streaming algorithm that estimates x p p with a factor (1 ϵ) with probability 9/10 requires Ω(ϵ 2/pn1/2 1/p) bits of space. Proof of Theorem 9. We assume n to be an integer and define our hard instance as follows: for every 1 i n, we let xi = 1 with probability 1 1/ n, and xi = (4ϵ)1/pn1/2p with probability 1/ n. And for i > n, xi = xi mod n. In this case, x p p n, so any coordinate will not be a heavy hitter. We note that if our algorithm can outputs an (1 ϵ) approximation of x p p, it can also output an (1 ϵ) approximation of x[ n] . To prove our lower bound, we will use the ℓk communication problem in Bar-Yossef et al. (2004): there are two parties, Alice and Bob, holding vectors a, b Zm respectively, and their goal is to decide if a b 1 or a b k. From Bar-Yossef et al. (2004), we can get an Ω(m/k2) bits lower bound. Now we claim that: if we set m = n, k = (4ϵ)1/pn1/2p and make a b = x[ n], then any streaming algorithm outputs a (1 ϵ) approximation Y of x[ n] p p can be used to build a communication protocol for solving this ℓk problem with communication proportional to the algorithm s space complexity. In this case, if a b = 1, then we have Y (1 + ϵ) x[ n] p p (1 + ϵ) n. Otherwise, Y (1 ϵ)( n 1 + 4ϵ n) > (1 + ϵ) n. By this reduction, we can finally get an Ω(m/k2) = Ω(ϵ 2/pn1/2 1/p) bits space lower bound under the oracle assumption. C FAST MOMENT ESTIMATION, 0 < p < 2 When 0 < p < 2, the Fp estimation problem can be done in an optimal O(ϵ 2 log n) bits of space. Here we show how to improve the update time while maintaining this amount of space. In the rest of the section, we assume there are two oracles: one can tell us whether |xi|p ϵ2 x p p, and the other can tell us whether |xi|p ϵ2 log2(1/ϵ) log log(1/ϵ) x p p. We call these 3 categories of elements heavy elements, medium elements, and light elements. Let IH, IM, IL [n] denote the subset of indices of the heavy elements, medium elements, and the light elements, respectively. For the heavy elements, we can use O(ϵ 2 log(M)) space to store them. We now turn to the other two parts. For the light elements, we will use the Light Estimator in Section 3 in Kane et al. (2011): Published as a conference paper at ICLR 2020 Lemma 10 (Kane et al. (2011)). Suppose we are given 0 < ϵ < 1, and given a list L [n] such that i L if and only if |xi|p ϵ2 x p p. There is an algorithm Light Estimator which returns an estimate X to xn\L p p such that |X xn\L p p| ϵ x p p with probability at least 0.9. The space usage is O(ϵ 2 log(nm M)), the amortized update time is O(log2(1/ϵ) log log(1/ϵ)), and the reporting time is O(1/ϵ2). For the light elements xi, note we have the condition |xi|p ϵ2 log2(1/ϵ) log log(1/ϵ) x p p. Using Lemma 5 Lemma 6 and the argument in Theorem 7, if we use the Light Estimator to do the subsampling algorithm with rate ρ = log2(1/ϵ) log log(1/ϵ) for the light elements, we just need to sub-sample O(1) times to get an approximation X of x IL p p such that |X x IL p p| ϵ x p p with probability at least 0.9. Note that for each update (i, v) in the sub-sampling process, if xi is not selected by the sub-sampling algorithm, the update time is O(1) (we do not need to do anything), otherwise, by Kane et al. (2011) we know that the amortized update time is O(log2(1/ϵ) log log(1/ϵ)). Notice that in Lemma 5 we just need a pairwise independent hash function for the sub-sampling. So during the process, the expected amortized update time is O( log2(1/ϵ) log log(1/ϵ) ρ ) = O(1). Theorem 11. Suppose we are given 0 < ϵ < 1, and given the lists IL. There is an algorithm which returns an estimate X to x IL p p such that |X x IL p p| ϵ x p p with probability at least 0.9. The space usage is O(ϵ 2 log(nm M)), the expected amortized update time is O(1), and the reporting time is O(1/ϵ2). It remains to handle the medium elements. Here we will still use the Light Estimator, but do some adjustments. Before describing our Light Estimator data structure, we first define the p-stable distribution. Definition 1 (Zolotarev Zolotarev (1986)). For 0 < p < 2, there exists a probability distribution Dp called the p-stable distribution satisfying the following property. For any positive integer n and vector x Rn, if Z1, ..., Zn Dp are independent, then Pn j=1 Zjxj x p Z for Z Dp. Lemma 12 (Pagh and Pagh, Ostlin & Pagh (2003) Theorem 1.1). Let S U = [u] be a set of z > 1 elements, and let V = [v], with 1 < v u. Suppose the machine word size is Ω(log(u)). For any constant c > 0 there is a word RAM algorithm that, using time log(z) log O(1)(v) and O(log(z) + log log(u)) bits of space, selects a family H of functions from U to V (independent of S) such that: 1. With probability 1 O(1/zc), H is z-wise independent when restricted to S. 2. Any h H can be represented by a RAM data structure using O(z log(v)) bits of space, and h can be evaluated in constant time after an initialization step taking O(z) time. In Kane et al. (2011), Light Estimator was defined by creating R = 4/ϵ2 independent instantiations of the estimator D1, which we label D1 1, ..., DR 1 , and picking a hash function h : [n] [R] using Lemma 12. Upon receiving an update to xi in the stream, the update was fed to Dh(i) 1 . The estimator Di 1 is a slightly modified version of the geometric mean estimator of Li (2008), which takes a matrix A Rt n, where each row contains Ω(1/ϵp)-wise p-stable entries and the rows being independent from each other, and maintains y = Ax in the stream. Furthermore, in parallel we run the algorithm of Kane et al. (2010a) with constant error parameter to obtain a value e Fp in [ x p p/2, 3 x p p/2]. The estimator Estp is min{Ct,p(Qt j=1 |yj|p/t), e Fp/ϵ}. Hence in Kane et al. (2011), the update time is the time to evaluate a Θ(1/ϵp)-wise independent hash function over a field of size poly(nm M), which is O(log2(1/ϵ) log log(1/ϵ)). Therefore, if we can store the hash values of the p-stable entries for the medium elements, then we can improve the update time to O(1) for this part (note that the parameter t is a constant). We need the following lemma. Lemma 13. Let k = O( 1 ϵ log log 1 ϵ ) and X1, X2, ..., Xk be p-stable random variables. With probability at least 0.9, the following holds. There exists T [k] such that |T| = O(1/ϵ2) for which |Xi| poly log(1/ϵ) for all i T, and consequently, each such Xi can be approximated by some e Xi satisfying |Xi e Xi| poly(ϵ) and each such e Xi can be recovered from O(log(1/ϵ)) bits. To prove this, we need the following result. Published as a conference paper at ICLR 2020 Lemma 14 (Nolan (2003), Theorem 1.12). For fixed 0 < p < 2, the probability density function φp of the p-stable distribution satisfies φp(x) = O(1/(1 + |x|p+1)) and is an even function. The cumulative distribution function satisfies Φp(x) = O(|x| p). Proof of Lemma 13. Let I = [ log3/p(1/ϵ), log3/p(1/ϵ)]. For a p-stable random variable Xi, by Lemma 14, Pr{Xi / I} = O(1/ log3(1/ϵ)). So the expected number of Xi for which Xi / I is O(k/ log3(1/ϵ)) = O(1/ϵ2). By Markov s inequality, with probability at least 0.9, the number of these Xi are O(1/ϵ2). For Xi contained in I, we can uniformly partition I into subintervals of length poly(ϵ), and there will be O(poly(1/ϵ)) many subintervals. Let e Xi be the nearest partition point. It is clear that Xi can be recovered from O(log(1/ϵ)) bits to store the index of the partition point. The following lemma tells us that we can use the approximation value given by the p-stable random variables in the Esti. Lemma 15. Let k = O( 1 ϵ log log 1 ϵ ). Suppose we are given a, b, x Rk and |ai bi| ϵq for all i. Then we have | a, x b, x | ϵq 2 x p for sufficiently small ϵ. Proof. By the Cauchy-Schwarz inequality, | a b, x | a b 2 x 2 kϵ2q x 2 ϵq 2 x 2 ϵq 2 x p (recall that p < 2). Finally, we have the following theorem. Theorem 16. Suppose we are given 0 < ϵ < 1 and the lists IM. There is an algorithm which returns an estimate X to x IM p p such that |X x IM p p| 2ϵ x p p with probability at least 0.8. The space usage is O(ϵ 2 max{log(nm M), log3(1/ϵ) log log(1/ϵ)}), the amortized update time is O(1), and the reporting time is O(1/ϵ2). Proof. Let k = 1 ϵ2 log2 1 ϵ log log 1 ϵ . Note that there are at most k medium elements. Using Lemma 12, we pick two hash functions h1 : [n] [k], h2 : [n] [1/ϵ2], and we can assume with high probability, they perfectly hash our input set of items. Lemma 13 tells us with probability at least 0.9, we can use these two hash functions and use O(1) time to read the p-stable random variables for the medium elements (if Xi I, we store Xi in the cell which h1(i) refers to, using O(log(1/ϵ)) bits. Otherwise we store Xi in the cell which h2(i) refers to, using O(log(M)) bits). Suppose that φ(x) is the pdf of the p-stable distribution. For a p-stable Y , we have that Pr{|Y | < ϵq} = R ϵq ϵq φ(x)dx 2φ(0)ϵq = poly(ϵ) when q is sufficiently large. Hence, in Lemma 13 and Lemma 15, we can pick a sufficiently large q such that with probability at least 1 poly(ϵ), all of the true values yi output by the Esti are at least ϵq x p, and the eyi output by the Esti using the stored p-stable random variables satisfy |yi eyi| ϵq+ 1 p x p. So the approximation from the p-stable random variables will affect our estimate by at most ϵ x p p. This completes the proof. Using Theorem 11 and 16 we will have our result: Theorem 17. Let 0 < p < 2 and 0 < ϵ < 1/2. There exists a randomized algorithm which outputs (1 3ϵ) x p p with probability at least 0.7 using O(ϵ 2 max{log(nm M), log3(1/ϵ) log log(1/ϵ)}) bits of space. The expected amortized update time is O(1). The reporting time is O(1/ϵ2). In the case of a noisy oracle, we assume that for a fixed s, our oracle will not identify more than O(s) items xi as heavy hitters (meaning that |xi| > 1 s x p p). Then we can adapt the preceding theorem to the noisy oracle version. Theorem 18. Let 0 < p < 2 and 0 < ϵ < 1/2. Suppose that the heavy hitter oracle errs with probability δ = O( ϵ2 log2(1/ϵ) log log(1/ϵ)). There exists an algorithm which outputs (1 3ϵ) x p p with probability at least 0.7 using O(ϵ 2 max{log(nm M), log3(1/ϵ) log log(1/ϵ)}) bits of space. The expected amortized update time is O(1). The reporting time is O(1/ϵ2). Published as a conference paper at ICLR 2020 Proof. We only need to consider the difference in the estimation of x IL . When δ = O( ϵ2 log2(1/ϵ) log log(1/ϵ)), note that our sub-sampling rate is log2(1/ϵ) log log(1/ϵ). It follows from Theorem 8 that Theorem 17 continues to hold. D CASCADED NORMS For notational simplicity, we consider approximating Fk(Fp(x)) = Pn i=1(Pd j=1 |xij|p)k with k 1 and p 2. The cascaded norm x k,p then satisfies x k k,p = Fk/p(Fp(x) for k p 2. We follow the algorithm in Jayram & Woodruff (2009), which first downsamples the rows in log n levels, and then samples in each level Q = O(n1 1/k) entries from the row-sampled submatrix proportional to |xi|p, and at last estimates the Fk(Fp) cascaded norm from those samples. The first step of ℓp-sampling is further decomposed into two steps: (i) dividing the entries into layers by their magnitudes and uniformly sampling a sufficiently large set of entries in each layer, and (ii) performing ℓp-sampling from those samples. We shall improve the space complexity of Step (i) with a heavy hitter oracle. We first elaborate how Step (i) was done. It splits the entries in the row-sampled matrix X into layers, where the t-th layer St(X) consists of entries in [ζηt 1, ζηt]. A layer t is said to be contributing if |St(X)|(ζηt)p Fp(X)/(Bθ). It needs to return βt = θQ|St(X)|(ζηt)p/Fp(X) samples from each contributing layer t. Suppose that |St(X)|/2j βt < |St(X)|/2j 1 for some j, and the algorithm subsamples each entry with probability 1/2j, obtaining a subset Yj of entries. It can be shown that with high probability, for a contributing layer t, it holds that (ζηt)2 Yj 2 2/(Q2/pθ4|X|1 2/p log |X|), and therefore the standard COUNT-SKETCH algorithm can recover all those entries with e O(Q2/p|X|1 2/p) space. Consider the top level without subsampling of the rows. Suppose that the heavy hitter oracle determines whether |xi|p x p p/T for each entry xi of the matrix x with some threshold T. In the algorithm of Step (i), without loss of generality, we can assume each layer contains either heavy entries only or light entries only, and we call the layer a heavy layer or a light layer accordingly. If a light layer is contributing, since each entry is at most Fp/T, there must exist at least Nt = T|St(X)|(ζηt)p/Fp entries in that level. We would require that Nt βt, for which it suffices to have T = Ω(θQ). We can downsample the layer at rate θQ/T, and then apply the preceding algorithm and thus the number of buckets can be reduced by a factor of T/(θQ). Note that the heavy layers come from the at most T heavy hitters. This leads to a space complexity of e O(Q2/p T 1 2/p + Q2/p|X|1 2/p/(T/(θQ)). Let T = Θ((|X|Q)1/2). In this case, T = Ω(θQ) is satisfied and the space complexity is e O(Q1/2+1/p|X|1/2 1/p). For the j-th level of row sampling, note that the same (global) heavy hitter oracle, when applied to this row-subsampled matrix, effectively has T replaced with T/2j. However, |X| is also replaced with |X|/2j, and thus we would require T = Ω(2jθQ) and we can downsample the layer at rate 2jθQ/T. This means that the space complexity is a (2 j)1 2/p fraction of the space of the top level. The constraint of T = Ω(2jθQ) allows for j 1 Q . For j > 1 Q , we can just run the old algorithm with space e O(Q2/p(|X|/2j)1 2/p) = e O(Q1/2+1/p|X|1/2 1/p). Plugging in Q = Θ(n1 1/k) and |X| = nd and noting that the space of the subsequent steps are dominated by this very first sampling step, we have: Theorem 19. Let ϵ > 0 and k 1, p 2 be constants. There exists a randomized algorithm which receives an n d matrix x and outputs a (1 + ϵ)-approximation to Fk(Fp(x)) with probability at least 2/3 using e O(n1 1 Replacing k with k/p, we obtain our final result for cascaded norms. Corollary 20. Let ϵ > 0 and k p 2 be constants. There exists a randomized algorithm which receives an n d matrix x in a stream of additions and deletions to its coordinates, and outputs a (1 + ϵ)-approximation to x k,p with probability at least 2/3 using e O(n1 1 Published as a conference paper at ICLR 2020 E RECTANGULAR Fp The rectangle-efficient algorithm in Tirthapura & Woodruff (2012) is based on the Fp moment estimation algorithm in Braverman & Ostrovsky (2010). In Braverman & Ostrovsky (2010), they first subsample to obtain the substream Dj (j = 0, 1, 2, ... log n), where D0 is the full stream, and if i Dj, then i Dj+1 with probability 1/2. For every substream Dj, they run the COUNT-SKETCH algorithm to recover all (α 2 p /n1 2 p )-heavy hitters, whence they can obtain a good final estimate. The algorithm in Tirthapura & Woodruff (2012) is similar. Instead of updating the counter in each coordinate inside a rectangle, they developed a rectangle-efficient data structure called RECTANGLECOUNTSKETCH. We follow their notation that O (f) denotes a function of the form f poly(1/ϵ, d, log(m /δ)). Lemma 21 (RECTANGLECOUNTSKETCH, Tirthapura & Woodruff (2012)). The data structure RECTANGLECOUNTSKETCH(γ) can be updated rectangle-efficiently in time O (γ 2). The total space is O (γ 2) words. The data structure can be used to answer any query i GF( )d, returning a number Φ(i) with |Φ(i) xi| γ x 2. The algorithm succeeds on all queries simultaneously with probability 1 δ. Under an arbitrary mapping g : GF( )d {1, 2, ..., }, they subsample the items into a logarithmic number φ = d log( ) of levels. In particular, for j [φ], i Dj if g(i) d/2j 1. In each level, they give an algorithm which invokes RECTANGLECOUNTSKETCH(γ) to recover a list of the Fp heavy hitters. Then they use the heavy hitters from each layer to calculate the final estimate. Lemma 22 (Tirthapura & Woodruff (2012)). For p > 2, there is an algorithm which uses RECTANGLECOUNTSKETCH(γ) with γ = Θ(ϵ1+ 1 p α 1 p / d/2 d/p) and with probability at least 1 δ outputs a set P of O(1/α) pairs (i, x i) such that P (i,x i) P ||x i|p |xi|p| ϵ x p p and all elements i with |xi|p α x p p appear as the first element of some pair in P. The algorithm uses O (γ 2) words of space. Since the algorithm relies on a subroutine that finds the ℓp-heavy hitters, the improvement with a heavy hitter oracle is similar to that for the Fp-moment problem. Specifically, suppose the heavy hitter oracle is able to indicate whether |xi|p > 1 d/2 x p p. We know from Theorem 7 that we can use a rectangle-efficient estimator for the substream which contains all the heavy items, while for the light items, we can use the subsampling algorithm with rate d/2. Note that for each substream, we have x 0 d/2 (for those i s that are not in the substream, we see xi = 0), so by H older s inequality, it holds that x 2 d/4 d/(2p) x p. Following the proof of Lemma 22, we can set γ = Θ(ϵ1+ 1 p α 1 p / d/4 d/(2p)) to find those heavy hitters. We thus have the following theorem. Theorem 23. Under the assumption of a heavy hitter oracle, there is a rectangle-efficient singlepass streaming algorithm which outputs a (1 ϵ)-approximation to x p p with probability 1 δ for p > 2. It uses O ( d(1/2 1/p)) bits of space and O ( d(1/2 1/p)) time to process each rectangle in the stream.