# learningaugmented_frequent_directions__338c84b7.pdf Published as a conference paper at ICLR 2025 LEARNING-AUGMENTED FREQUENT DIRECTIONS Anders Aamand University of Copenhagen andersaamanda@gmail.com Justin Y. Chen MIT justc@mit.edu Siddharth Gollapudi Independent sgollapu@berkeley.edu Sandeep Silwal UW-Madison silwal@cs.wisc.edu Hao WU University of Waterloo hao.wu1@uwaterloo.ca An influential paper of Hsu et al. (ICLR 19) introduced the study of learning-augmented streaming algorithms in the context of frequency estimation. A fundamental problem in the streaming literature, the goal of frequency estimation is to approximate the number of occurrences of items appearing in a long stream of data using only a small amount of memory. Hsu et al. develop a natural framework to combine the worst-case guarantees of popular solutions such as Count Min and Count Sketch with learned predictions of high frequency elements. They demonstrate that learning the underlying structure of data can be used to yield better streaming algorithms, both in theory and practice. We simplify and generalize past work on learning-augmented frequency estimation. Our first contribution is a learning-augmented variant of the Misra-Gries algorithm which improves upon the error of learned Count Min and learned Count Sketch and achieves the state-of-the-art performance of randomized algorithms (Aamand et al., Neur IPS 23) with a simpler, deterministic algorithm. Our second contribution is to adapt learning-augmentation to a high-dimensional generalization of frequency estimation corresponding to finding important directions (top singular vectors) of a matrix given its rows one-by-one in a stream. We analyze a learning-augmented variant of the Frequent Directions algorithm, extending the theoretical and empirical understanding of learned predictions to matrix streaming. 1 INTRODUCTION Learning-augmented algorithms combine the worst-case analysis of traditional algorithm design with machine learning to exploit structure in the specific inputs on which the algorithm is deployed. A burgeoning line of work in this context has studied algorithms furnished with predictions given by domain experts or learned from past data. This general methodology has been applied to create input-optimized data structures (Kraska et al., 2018; Mitzenmacher, 2018), graph algorithms (Dinitz et al., 2021; Chen et al., 2022c), online algorithms (Lykouris & Vassilvitskii, 2021; Gollapudi & Panigrahi, 2019), streaming algorithms (Hsu et al., 2019; Jiang et al., 2020; Chen et al., 2022a) among many other applications1. Within the context of streaming algorithms, where the input arrives in an online fashion and the algorithm has too little memory to store everything, predictors can highlight data which are worth remembering. This intuition was formalized in an influential work of Hsu et al. (2019) in the context of frequency estimation, a fundamental streaming problem where the goal is to provide an estimate of how many times any element appeared in the stream. Given access to a heavy-hitter oracle identifying the highest frequency elements, Hsu et al. (2019) give a natural framework where the heavy-hitters are counted exactly while the rest of the frequencies are approximated using standard algorithms such as Count Min (Cormode & Muthukrishnan, 2005) or Count Sketch (Charikar et al., 2002). They study the setting where the frequencies follow a power law distribution, commonly seen in practice and therefore well-studied for frequency estimation (Cormode & Muthukrishnan, 2005; Metwally et al., 2005; Minton & Price, 2012). Given access 1There are hundreds of papers written on this topics. See the survey of Mitzenmacher & Vassilvitskii (2022) or the website https://algorithms-with-predictions.github.io/. Published as a conference paper at ICLR 2025 to an oracle which can recover the heaviest elements, they give improved error bounds where error is taken in expectation over the empirical distribution of frequencies. A sequence of follow-up works investigate how to learn good predictors (Du et al., 2021; Chen et al., 2022b), apply the results to other streaming models (Shahout et al., 2024), and give improved algorithms (Aamand et al., 2023). Algorithms Weighted Error Predictions? Analysis Frequent Direction Θ (ln d)2 A 2 F m No Theorem 3.3 Learned Frequent Direction Θ 1 (ln d)2 A 2 F m Yes Theorem 3.4 Table 1: Error bounds for Frequent Directions with n input vectors from the domain Rd using m d words of memory, assuming that the matrix consisting of input vectors has singular value σ2 i 1/i. The weighted error is defined by Equation (2). In this work, we define and analyze the corresponding problem in the setting where each data point is a vector rather than an integer, and the goal is to find frequent directions rather than elements (capturing low-rank structure in the space spanned by the vectors). This setting of matrix streaming is an important tool in big data applications including image analysis, text processing, and numerical linear algebra. Low-rank approximations via SVD/PCA are ubiquitous in these applications, and streaming algorithms for this problem allow for memory-efficient estimation of these approximations. In the matrix context, we define a corresponding notion of expected error and power-law distributed data. We develop a learning-augmented streaming algorithm for the problem based on the Frequent Directions (FD) algorithm (Ghashami et al., 2016) and give a detailed theoretical analysis on the space/error tradeoffs of our algorithm given predictions of the important directions of the input matrix. Our framework captures and significantly generalizes that of frequency estimation in one dimension. When the input vectors are basis vectors, our algorithm corresponds to a learningaugmented version of the popular Misra-Gries (Misra & Gries, 1982) heavy hitters algorithm. In this special case of our model, our algorithm achieves state-of-the-art bounds for learning-augmented frequency estimation, matching that of Aamand et al. (2023). In contrast to prior work, we achieve this performance without specializing our algorithm for power-law distributed data. We experimentally verify the performance of our learning-augmented algorithms on real data. Following prior work, we consider datasets containing numerous problem instances in a temporal order (each instance being either a sequence of items for frequency estimation or a sequence of matrices for matrix streaming). Using predictions trained on past data, we demonstrate the power of incorporating learned structure in our algorithms, achieving state-of-the-art performance in both settings. Our Contributions We generalize the learning-augmented frequency estimation model to the matrix streaming setting: each stream element is a row vector of a matrix A. We define corresponding notions of expected error and power-law distributed data with respect to the singular vectors and values of A. In this setting, we develop and analyze a learning-augmented version of the Frequent Directions algorithm of Ghashami et al. (2016). Given predictions of the important directions (corresponding to the top right singular vectors of A), we demonstrate an asymptotically better space/error tradeoff than the base algorithm without learning. See Table 1. As a special case of our setting and a corollary of our analysis, we bound the performance of learning-augmented Misra-Gries for frequency estimation. In the learning-augmented setting, past work has analyzed randomized algorithms Count Min and Count Sketch as well as specialized variants (Hsu et al., 2019; Aamand et al., 2023). To our knowledge, no analysis has been done prior to our work for the popular, deterministic Misra-Gries algorithm. Our analysis shows that learned Misra-Gries achieves state-of-the-art learning-augmented frequency estimation bounds, without randomness or specializing the algorithm for Zipfian data. See Table 2. We empirically validate our theoretical results via experiments on real data. For matrix streaming, our learning-augmented Frequent Directions algorithm outperforms the non-learned version by 12 orders of magnitude on all datasets. For frequency estimation, our learned Misra-Gries algorithm achieves superior or competitive performance against the baselines. Published as a conference paper at ICLR 2025 Algorithm Weighted Error Rand? Pred? Reference Count Min Θ n Yes No (Aamand et al., 2023) Learned Count Min Θ (ln d m n (ln d)2 Yes Yes (Hsu et al., 2019) Count Sketch O 1 Yes No (Aamand et al., 2023) Learned Count Sketch O ln d m m n (ln d)2 Yes Yes (Aamand et al., 2023) Count Sketch++ O ln m+poly(ln ln d) m n (ln d)2 Yes No (Aamand et al., 2023) Learned Count Sketch ++ O 1 m n (ln d)2 Yes Yes (Aamand et al., 2023) Misra-Gries Θ ln m ln 2d m m n (ln d)2 No No Theorem 3.1 Learned Misra-Gries Θ 1 m n (ln d)2 No Yes Theorem 3.2 Table 2: Error bounds for frequency estimation with n input elements from the domain [d] using m words of memory, assuming that the frequency of element i [d] follows f(i) 1/i. The weighted error indicates that element i is queried with a probability proportional to 1/i. Related Work The learning-augmented frequency estimation problem was introduced in Hsu et al. (2019). They suggest the model of predicted frequencies and give the first analysis of learningaugmented Count Min and Count Sketch with weighted error and Zipfian frequencies. Du et al. (2021) evaluate several choices for the loss functions to use to learn the frequency predictor and Chen et al. (2022b) develop a procedure to learn a good predictor itself with a streaming algorithm. Shahout et al. (2024) extend the model to sliding window streams where frequency estimation is restricted to recently appearing items. Shahout & Mitzenmacher (2024) analyze a learning-augmented version of the Space Saving algorithm (Metwally et al., 2005) which is a deterministic algorithm for frequency estimation, but, unlike our work, they do not give space/error tradeoffs comparable to Hsu et al. (2019). Aamand et al. (2023) give tight analysis for Count Min and Count Sketch both with and without learned predictions in the setting of weighted error with Zipfian data. Furthermore, they develop a new algorithm based on the Count Sketch, which we refer to as learning-augmented Count Sketch++, which has better asymptotic and empirical performance. Matrix sketching and low-rank approximations are ubiquitous in machine learning. The line of work most pertinent to our work is that on matrix streaming where rows arrive one-by-one, and in small space, the goal is to maintain a low-rank approximation of the full matrix. The Frequent Directions algorithm for the matrix streaming problem was introduced by Liberty (2013). Subsequent work of Ghashami & Phillips (2014a) and Woodruff (2014) refined the analysis and gave a matching lower bound. These works were joined and developed in Ghashami et al. (2016) with an even simpler analysis given by Liberty (2022). A related line of work is on learning sketching matrices for low-rank approximation, studied in Indyk et al. (2019; 2021). Their goal is to learn a sketching matrix S with few rows so that the low-rank approximation of A can be recovered from SA. The main guarantee is that the classical low-rank approximation algorithm of Clarkson & Woodruff (2013), which uses a random S, can be augmented so that only half of its rows are random, while retaining worst-case error. The learned half of S can be optimized empirically, leading to a small sketch SA in practice. The difference between these works and us is that their overall procedure cannot be implemented in a single pass over the stream. We discuss other related works in Appendix A. Organization Section 2 delves into the necessary preliminaries for our algorithm. We define the problems of frequency estimation, and its natural higher dimensional version, introduce our notion of estimation error for these problems, and discuss the two related algorithms Misra-Gries and Frequent Directions for these problems. In Section 3, we introduce our learning-augmented versions of Misra Gries and Frequent Directions. We also analyse the performance of learned Misra-Gries algorithms, postponing the the analysis of learned Frequent Directions to Appendix D. Section 4 presents our experiment results with extensive figures given in Appendix F. Published as a conference paper at ICLR 2025 2 PRELIMINARIES Frequency Estimation. Let n, d N+, and consider a sequence a1, a2, . . . , an [d] arriving one by one. We are interested in the number of times each element in [d] appears in the stream. Specifically, the frequency f(i) of element i [d] is defined as f(i) .= |{t [n] : at = i}|. Thus, P i [d] f(i) = n. Given estimates f(i) for each , i [d], we focus on the following weighted estimation error (Hsu et al., 2019; Aamand et al., 2023): Err .= P n f(i) f(i) . (1) The weighted error assigns a weight to each element s estimation error proportional to its frequency, reflecting the intuition that frequent elements are queried more often than less frequent ones. Direction Frequency. The frequency estimation problem has a natural high-dimensional extension. The input now consists of a stream of vectors A1, A2, . . . , An Rd. For each unit vector v Rd, we define its frequency, f( v), as the sum of the squares of the projected lengths of each input vector onto v. Specifically, let A Rn d denote the matrix whose rows are AT 1 , . . . , AT n. Then f( v) .= A v 2 2. To see the definition is a natural extension of the element frequency in the frequency estimation problem, suppose each input vector At is one of the standard basis vectors e1, . . . , ed in Rd. Further, we restrict the frequency query vector v to be one of these standard basis vectors, i.e., v = ei for some i [d]. Then f( ei) = A ei 2 2 = P t [n] At, ei 2 = P t [n] 1[At= ei], which is simply the number of times ei appears in A. Estimation Error. Consider an algorithm that can provide an estimate f( v) of f( v) for any unit vector v Rd. The estimation error of a single vector is given by f( v) f( v) = A v 2 2 f( v) . Since the set of all unit vectors in Rd is uncountably infinite, we propose to study the following weighted error: σ2 i A 2 F A vi 2 2 f( vi) , (2) where σ1, . . . , σd denote the singular values of A, v1, . . . , vd are the corresponding right singular vectors, and A F is its Frobenius norm. To see how this generalizes Equation (1), assume again that the rows of A consist of standard basis vectors and that f( e1) f( e2) f( ed). In this case, it is straightforward to verify that σ2 i = f( ei) and vi = ei for all i [d]. Consequently, A vi 2 2 = f( ei), and A 2 F = P i [d] σ2 i = n. Therefore, Equation (2) reduces to Equation (1) in this case. Moreover, for a specific class of algorithms, we can offer an alternative and intuitive interpretation of the weighted error. Lemma 2.1. For algorithms that estimate f( v) by first constructing a matrix B and then applying the formula f( v) = B v 2 2 such that 0 f( v) f( v), the weighted error defined in Equation (2) satisfies Err E v N(0,AT A) h A v 2 2 f( v) i . (3) The conditions stated in the lemma apply to the Frequent Directions algorithm (Ghashami et al., 2016), discussed later in the section. The lemma asserts that the weighted error is proportional to the expected difference between A v 2 2 and f( v), where v is sampled from a multivariate normal distribution with mean 0 and covariance matrix AT A. The proof is included in the Appendix B. Zipfian Distribution. We follows the assumption that in the frequency estimation problem, the element frequencies follow a Zipfian distribution (Hsu et al., 2019; Aamand et al., 2023), i.e., f(i) 1/i i [d]. Naturally, for the high dimensional counterpart, we assume that σ2 i 1/i. Misra-Gries and Frequent Directions Algorithms. The Misra-Gries algorithm (Misra & Gries, 1982) is a well-known algorithm developed for frequency estimation in the streaming setting with limited memory. Its high-dimensional counterpart is the Frequent Directions algorithm (Ghashami et al., 2016). We focus on presenting the Frequent Directions algorithm here along with a brief explanation of how Misra-Gries can be derived from it. Published as a conference paper at ICLR 2025 The algorithm is described in Algorithm 1. The matrix B created during the initialization phase can be viewed as an array of m buckets, where each bucket can store a vector in Rd. As each input vector Ai arrives, the algorithm updates B using an update procedure, inserting AT i into the first available bucket in B. If B is full, additional operations are triggered (Lines 7 - 10): essentially, the algo- rithm performs a singular value decomposition (SVD) of B, such that B = P j [d] σ(i) j u(i) j v(i) j T , where u(i) j and v(i) j are the columns of matrices U(i) and V(i), respectively, and σ(i) j are the diagonal entries of Σ(i). The algorithm then retains only the first τ 1 right singular vectors, v(i) 1 , . . . , v(i) τ 1, scaled by the factors (σ(i) 1 )2 (σ(i) τ )2 1/2, . . . , (σ(i) τ 1)2 (σ(i) τ )2 1/2 respectively. Algorithm 1 Frequent Direction AFD 1: Procedure INITIALIZATION 2: Input: sketch parameters m, τ, d N+, s.t., τ m d 3: Reserve m d space for an empty matrix B 4: Procedure UPDATE 5: Input: an input vector Ai Rd 6: B [B; AT i ] matrix obtained by appending AT i after the last row B 7: if B has m rows then 8: U(i),Σ(i), V(i) SVD(B) max{Σ(i)2 (σ(i) τ )2I, 0}, where σ(i) τ is the τ (th) largest singular value 10: B Σ(i)V(i)T 11: Procedure RETURN 12: return B To reduce the algorithm to Misra-Gries, we make the following modifications: each input vector Ai is an element in [d], and B is replaced by a dynamic array with a capacity of m. The SVD operation is replaced by an aggregation step, where identical elements in B are grouped together, retaining only one copy of each along with its frequency in B. Consequently, lines 7 10 now correspond to selecting the top-(τ 1) elements and reducing their frequencies by f(τ) 2 . Based on recent work by Liberty (2022), Algorithm 1 possesses the following properties. For completeness, we provide a brief proof in the Appendix. Proposition 2.2 ((Liberty, 2022)). Algorithm 1 uses O(md) space, operates in O nm2d m+1 τ time, and ensures that AT A BT B 0. Moreover, it guarantees the following error bound: AT A BT B 2 min k [0 . . τ 1] A [A]k 2 F τ k , (4) where 2 is the spectral norm of a matrix, and [A]k is the best rank-k approximation of A. Note that the error in this context is defined by the maximum distortion rather than a weighted one. If τ = (1 Ω(1))m, the running time reduces to O(nmd). Furthermore, for k = 0, the error bound simplifies to the original bound established by Liberty (2013). These bounds can be adapted for the Misra-Gries algorithm, where AT A BT B 0 implies that the algorithm never overestimates element frequencies. Additionally, when implemented with a hash table, the running time for Misra-Gries can be further optimized to O(n). 3 LEARNING-AUGMENTED FREQUENT DIRECTION We aim to augment the Frequent Directions algorithm with learned predictions. The framework is presented in Algorithm 2. Given space for storing m vectors in Rd, the algorithm reserves m L m 2A common implementation of Misra-Gries sets τ = m, and the aggregation step can be optimized using hash tables. Published as a conference paper at ICLR 2025 slots for the predicted frequent directions w1, . . . , wm L, which are orthonormal vectors returned by a learned oracle. The algorithm then initializes two seperate instances of Algorithm 1, denoted by A FD and A FD, with space usage m L and m 2 m L, respectively. After initialization, when an input vector Ai arrives, the algorithm decomposes it into two components, Ai = Ai, + Ai, , where Ai, is the projection of Ai onto the subspace spanned by w1, . . . , wm L, and Ai, is the component orthogonal to this subspace. The vector Ai, is passed to A FD, while Ai, is passed to A FD. Intuitively, A FD is responsible to compute a sketch matrix for the subspace predicted by the learned oracle, whereas A FD is responsible to compute a sketch matrix for the orthogonal subspace. When the algorithm terminates, the output matrix is obtained by stacking the matrices returned by A FD and A FD. To adapt this framework for the learning-augmented Misra Gries algorithm, A FD corresponds to an array to record the exact counts of the predicted elements and A FD corresponds to a Misra-Gries algorithm over all other elements. Algorithm 2 Learning-Augmented Frequent Direction ALFD 1: Procedure INITIALIZATION 2: Input: sketch parameters m, d N+; learned oracle parameter m L s.t., m L m 3: Let PH = [ w1 | . . . | wm L] Rd m L be the matrix consisting of the m L orthonormal columns, which are the frequent directions predicted by the learned oracle 4: Initialize an instance of Algorithm 1: A FD.initialization(m L, 0.5 m L, d) 5: Initialize an instance of Algorithm 1: A FD.initialization(m 2 m L, 0.5 (m 2 m L), d) 6: Procedure UPDATE 7: Input: an input vector Ai 8: Ai, PHPT HAi 9: Ai, Ai Ai, 10: A FD.update(Ai, ) 11: A FD.update(Ai, ) 12: Procedure RETURN 13: B A FD.return() 14: B A FD.return() 15: B B ; B T 16: return B 3.1 THEORETICAL ANALYSIS We present the theoretical analysis for the (learned) Misra-Gries and (learned) Frequent Directions algorithms under a Zipfian distribution. The error bounds for the (learned) Misra-Gries algorithms are detailed in Theorems 3.1 and 3.2. The corresponding results for the (learned) Frequent Directions algorithm are provided in Theorems 3.3 and 3.4. The complete proofs for are provided in Appendix C and Appendix D, respectively. Due to space constraints, we provide sketch proofs for the (learned) Misra-Gries algorithm only. The proofs for the (learned) Frequent Directions algorithm follow similar techniques. Since the structure of Misra-Gries is simpler, analyzing its bounds first offers clearer insights into the problems. Theorem 3.1 (Expected Error of the Misra-Gries Algorithm). Given a stream of n elements from a domain [d], where each element i has a frequency f(i) 1/i for i [d], the Misra-Gries algorithm using m words of memory achieves expected error of Err Θ ln m ln d m (ln d)2 n Proof Sketch. At a high level, we first derive an upper bound on the maximum estimation error using Fact 2.2 under the Zipfian distribution assumption. We then partition the elements into two groups: those with frequencies exceeding this error and those that do not. For the first group, the estimation error for each element is bounded by the maximum error. For the second group, since Misra-Gries never overestimates their frequencies, the error is limited to the actual frequency of each element. For each group, we can show that the weighted error is bounded above by the RHS Published as a conference paper at ICLR 2025 of (5). For the lower bound, we construct an adversarial input sequence such that the weighted error of elements in the first group indeed matches the upper bound, proving that the bound is tight. Theorem 3.2 (Expected Error of the Learned Misra-Gries Algorithm). Given a stream of n elements from a domain [d], where each element i has a frequency f(i) 1/i for i [d], and assuming a perfect oracle, the learning-augmented Misra-Gries algorithm using m words of memory achieves expected error of Err Θ 1 m n (ln d)2 . Here, a perfect oracle is defined as one that makes no mistakes in predicting the top frequent elements. The scenario where the learning oracle is not perfect will be discussed later in this section. Proof Sketch. Under the assumption of access to a perfect oracle, the algorithm does not make estimation error on the top-m L elements. For the remaining elements, the Misra-Gries algorithm never overestimates its frequency: f(i) [0, f(i)]. Hence the weighted error is at most n f(i) f(i) 1 i ln d n i ln d O 1 m n (ln d)2 The lower bound is obtained using a similar technique as in Theorem 3.1, by constructing an input sequence such that the error incurred by the non-predicted elements matches the upper bound. Comparison with Previous Work. This guarantee matches that of the learning-augmented frequency estimation algorithm of Aamand et al. (2023) but with significant simplifications. Aamand et al. (2023) also reserve separate buckets for the predicted heavy hitters, but to get a robust algorithm in case of faulty predictions, they maintain O(log log n) additional Count Sketch tables for determining if an arriving element (which is not predicted to be heavy) is in fact a heavy hitter with reasonably high probability. If these tables deem the element light, they output zero as the estimate, and otherwise, they use the estimate of a separate Count Sketch table. In contrast, our algorithm uses just a single implementation of the simple and classic Misra-Gries algorithm. This approach has the additional advantage of being deterministic in contrast to Count Sketch, which is randomized. Robustness and Resilience to Prediction Errors. We note that the learned Misra-Gries algorithm is robust in the sense that it essentially retains the error bounds of its classic counterpart regardless of predictor quality. Indeed, the learned version allocates m/2 space to maintain exact counts of elements predicted to be heavy, and uses a classic Misra-Gries sketch of size m/2 for the remaining elements. Thus, it incurs no error on the elements predicted to be heavy and on the elements predicted to be light, we get the error guarantees of classic Misra-Gries (using space m/2 instead of m). It is further worth noting that the error bound of Theorem 3.2 holds even for non-perfect learning oracles or predictions as long as their accuracy is high enough. Specifically, assume that the algorithm allocates some m L Ω(m) buckets for the learned oracle. Further, assume that only the top c m L elements with the highest frequencies are included among the m L heavy hitters predicted by the oracle, for some c 1 (e.g., c = 0.1). In this case, Inequality (6) still holds: the summation now starts from c m L + 1 instead of m L + 1, which does not affect the asymptotic error. The corresponding theorems for (learned) Frequent Directions are below with proofs in Appendix D. Theorem 3.3 (Expected Error of the FREQUENT DIRECTIONS Algorithm). Assume that the singular values of the input matrix A to the Algorithm 1 satisfies σ2 i 1 i , for all i [d], it achieves an expected error of Err(AFD) Θ ln m ln 2d m (ln d)2 A 2 F m . Theorem 3.4 (Expected Error of the Learned FREQUENT DIRECTIONS Algorithm). Assume that the singular values of the input matrix A to Algorithm 2 satisfies σ2 i 1 i , for all i [d], and that learning oracle is perfect, it achieves an expected error of Err(AFD) Θ 1 (ln d)2 A 2 F m . Robustness of Learned Frequent Directions. It turns out that Algorithm 2 does not come with a robustness guarantee similar to that of Learned Misra-Gries discussed above. In fact, we can construct adversarial inputs for which the expected error is much worse than in the classic setting. Fortunately, there is a way to modify the algorithm slightly using the fact that the residual error A [A]k 2 F can be computed within a constant factor using an algorithm from Li et al. (2024). Published as a conference paper at ICLR 2025 Since the error of the algorithm scales with the residual error, this essentially allows us to determine if we should output the result of a learned or standard Frequent Directions algorithm. The result is Theorem E.1 on robustness. Combined with Theorem E.2, which explicitly bounds the error of Algorithm 2 in terms of the true and predicted frequent directions, we obtain consistency/robustness tradeoffs for the modified algorithm. Details are provided in Appendix E. 4 EXPERIMENTS We complement our theoretical results with experiments on real data both in the frequency estimation (1-dimensional stream elements) and frequent directions (row vector stream elements) settings. We highlight the main experimental results here and include extensive figures in Appendix F. 4.1 FREQUENT DIRECTIONS EXPERIMENTS Datasets and Predictions We use datasets from Indyk et al. (2019) and Indyk et al. (2021), prior works on learning-based low rank approximation not in the streaming setting. The Hyper dataset (Imamoglu et al., 2018) contains a sequence of hyperspectral images of natural scenes. We consider 80 images each of dimension 1024 768. The Logo, Friends, and Eagle datasets come from highresolution Youtube videos3. We consider 20 frames from each video each with dimension 3240 1920. We plot the distribution of singular values for each dataset in Appendix F. For each dataset, we use the top singular vectors of the first matrix in the sequence to form the prediction via a low-rank projection matrix (see Algorithm 2). Baselines We compare two streaming algorithms and one incomparable baseline. In the streaming setting, we compare the Frequent Directions algorithm of Ghashami et al. (2016) with our learningaugmented variant. Both implementations are based on an existing implementation of Frequent Directions4. We additionally plot the performance of the low-rank approximation given by the largest right singular vectors (weighted by singular values). This matrix is not computable in a stream as it involves taking the SVD of the entire matrix A but we evaluate it for comparison purposes. Results are displayed based on the rank of the matrix output by the algorithm, which we vary from 20 to 200. For both Frequent Directions and our learned variant, the space used by the streaming algorithm is twice the rank: this is a choice made in the Frequent Directions implementation to avoid running SVD on every insertion and thus improve the update time. We use half of the space for the learned projection component and half for the orthogonal component in our algorithm. Results For each of the four datasets, we plot tradeoffs between median error (across the sequence of matrices) and rank as well as error across the sequence for a fixed rank of 100 (see Figure 1). We include the latter plots for choices of rank in Appendix F. Our learning-augmented Frequent Directions algorithm improves upon the base Frequent Directions by 1-2 orders of magnitude on all datasets. In most cases, it performs within an order of magnitude of the (full-memory, nonstreaming) SVD approximation. In all cases, increasing rank, or equivalently, space, yields significant improvement in the error. These results indicate that learned hints taken from the SVD solution on the first matrix in the sequence can be extremely powerful in improving matrix approximations in streams. As the sequences of matrices retain self-similarity (e.g., due to being a sequence of frames in a video), the predicted projection allows our streaming algorithm to achieve error closer to that of the memory-intensive SVD solution than that of the base streaming algorithm. 4.2 FREQUENCY ESTIMATION EXPERIMENTS Datasets and Predictions We test our algorithm and baselines on the CAIDA (CAIDA, 2016) and AOL (Pass et al., 2006) datasets used in prior work (Hsu et al., 2019; Aamand et al., 2023). The CAIDA dataset contains 50 minutes of internet traffic data, with a stream corresponding to the IP addressed associated with packets passing through an ISP over a minute of data. Each minute of data contains approximately 30 million packets with 1 million unique IPs. The AOL dataset 3Originally downloaded from http://youtu.be/L5HQo FIa T4I, http://youtu.be/ xm LZs Ef XEg E and http://youtu.be/ufnf_q_3Ofg and appearing in Indyk et al. (2019). 4https://github.com/edoliberty/frequent-directions Published as a conference paper at ICLR 2025 20 40 60 80 100 120 140 160 180 200 Rank Error/Rank Tradeoff SVD FD Learned FD (ours) 0 10 20 30 40 50 60 70 80 Matrices SVD FD Learned FD (ours) 20 40 60 80 100 120 140 160 180 200 Rank Error/Rank Tradeoff SVD FD Learned FD (ours) 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) 20 40 60 80 100 120 140 160 180 200 Rank Error/Rank Tradeoff SVD FD Learned FD (ours) 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) 20 40 60 80 100 120 140 160 180 200 Rank Error/Rank Tradeoff SVD FD Learned FD (ours) 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) (d) Friends Figure 1: Comparison of matrix approximations. The Frequent Directions and learning-augmented Frequent Directions algorithms are streaming algorithms while the exact SVD stores the entire matrix to compute a low-rank approximation (so it cannot be implemented in a stream). For each dataset, the left plot shows median error (error formula from Equation (2)) as the rank of the approximation varies while the right plot shows error over the sequence of matrices with a fixed rank of 100. The sudden drop in error in Eagle corresponds to several frames of a black screen in the video. contains 80 days of internet search query data with each stream (corresponding to a day) having around 300k total queries and 100k unique queries. We plot the frequency distribution for both Published as a conference paper at ICLR 2025 datasets in Appendix F. We use recurrent neural networks trained in past work of Hsu et al. (2019) as the predictor for both datasets. Algorithms We compare our learning-augmented Misra-Gries algorithm with learning-augmented Count Sketch (Hsu et al., 2019) and learning-augmented Count Sketch++ (Aamand et al., 2023). As in Aamand et al. (2023), we forego comparisons against Count Min as it has worse performance both in theory (Aamand et al., 2023) and practice (Hsu et al., 2019). For the prior state-of-the-art, learned CS++, the implemented algorithm does not exactly correspond to the one which achieves the best theoretical bounds as only a single Count Sketch table is used (as opposed to two) and the number of rows of the sketch is 3 (as opposed to O(log log n)). There is a tunable hyperparameter C where elements with estimated frequency less than Cn/w have their estimates truncated to zero (where n is the stream length and w is the sketch width). The space stored by the sketch corresponds to 3w as there are 3 rows. For Misra-Gries, the space corresponds to double the number of stored counters as each counter requires storing a key as well as a count. As in prior work, for all algorithms, their learned variants use half of the space for the normal algorithm and half of the space to store exact counts for the elements with top predicted frequencies. 500 1000 1500 2000 2500 3000 Space Weighted Error 1e12 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 Streams Weighted Error 1e11 Space: 750 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 500 1000 1500 2000 2500 3000 Space Weighted Error 1e7 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 60 70 80 Streams Weighted Error 1e7 Space: 750 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 2: Comparison of learning-augmented frequency estimation algorithms. Top: CAIDA, Bottom: AOL. For both datasets, the left plot show the median error of each method (across all 50 streams) with varying space budgets. The right plot shows the performance of each algorithm across streams with fixed space of 750 words. Randomized algorithms are averaged across 10 trials and one standard deviation is shaded. Results For both datasets, we compare the learning-augmented algorithms by plotting the tradeoff between median error and space as well as error across the sequence of streams for a fixed space of 750 (see Figure 2). In Appendix F, we include the latter plots for all choices of space, as well as all corresponding plots both without predictions (to compare the base CS, CS++, and MG algorithms) and under unweighted error (taken as the unweighted sum of absolute errors over all stream items regardless of frequency) which was also evaluated in prior work. The learning-augmented Misra-Gries algorithm improves significantly over learning-augmented Count Sketch, as implied by our theoretical bounds. Furthermore, it is competitive with the state-of-the-art learning-augmented CS++ algorithm. Sometimes our algorithm outperforms the best hyperparameter choice CS++ and often outperforms several of the hyperparameter choices of CS++. Furthermore, learning-augmented MG has no equivalent tunable parameter and is simpler to deploy (especially as CS++ is already a simplification of the theoretical algorithm of Aamand et al. (2023)). As learning-augmented MG is the only deterministic algorithm with provable guarantees in the setting of Hsu et al. (2019), our results indicate that there is essentially no cost to derandomization. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS Justin Y. Chen is supported by an NSF Graduate Research Fellowship under Grant No. 17453. Hao WU was a Postdoctoral Fellow at the University of Copenhagen, supported by Providentia, a Data Science Distinguished Investigator grant from Novo Nordisk Fonden. Siddharth Gollapudi is supported in part by the NSF (CSGrad4US award no. 2313998). Anders Aamand, Justin Y. Chen, Huy Lˆe Nguyen, Sandeep Silwal, and Ali Vakilian. Improved frequency estimation algorithms with and without predictions. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/ 2e49934cac6cb8604b0c67cfa0828718-Abstract-Conference.html. CAIDA. Caida internet traces, chicago. http://www.caida.org/data/monitors/passive-equinixchicago.xml, 2016. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pp. 693 703. Springer, 2002. Justin Y. Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David Woodruff, and Michael Zhang. Triangle and four cycle counting with predictions in graph streams. In International Conference on Learning Representations, 2022a. Justin Y. Chen, Piotr Indyk, and Tal Wagner. Streaming algorithms for support-aware histograms. In Proceedings of the 39th International Conference on Machine Learning, 2022b. Justin Y. Chen, Sandeep Silwal, Ali Vakilian, and Fred Zhang. Faster fundamental graph algorithms via learned predictions. In International Conference on Machine Learning, pp. 3583 3602. PMLR, 2022c. Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Dan Boneh, Tim Roughgarden, and Joan Feigenbaum (eds.), Symposium on Theory of Computing Conference, STOC 13, Palo Alto, CA, USA, June 1-4, 2013, pp. 81 90. ACM, 2013. doi: 10.1145/2488608.2488620. URL https://doi.org/10.1145/2488608. 2488620. Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58 75, 2005. Michael Dinitz, Sungjin Im, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Faster matchings via learned duals. Ar Xiv, abs/2107.09770, 2021. URL https://api. semanticscholar.org/Corpus ID:236154892. Elbert Du, Franklyn Wang, and Michael Mitzenmacher. Putting the learning into learningaugmented algorithms for frequency estimation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2860 2869. PMLR, 18 24 Jul 2021. URL https: //proceedings.mlr.press/v139/du21d.html. Mina Ghashami and Jeff M. Phillips. Relative errors for deterministic low-rank matrix approximations. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 14, pp. 707 717, USA, 2014a. Society for Industrial and Applied Mathematics. ISBN 9781611973389. Mina Ghashami and Jeff M Phillips. Relative errors for deterministic low-rank matrix approximations. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pp. 707 717. SIAM, 2014b. Published as a conference paper at ICLR 2025 Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762 1792, 2016. doi: 10.1137/15M1009718. URL https://doi.org/10.1137/15M1009718. Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expert advice. In International Conference on Machine Learning, 2019. URL https://api. semanticscholar.org/Corpus ID:174800680. Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/ forum?id=r1loho Cq Y7. Nevrez Imamoglu, Yu Oishi, Xiaoqiang Zhang, Guanqun Ding, Yuming Fang, Toru Kouyama, and Ryosuke Nakamura. Hyperspectral image dataset for benchmarking on salient object detection. In 2018 Tenth international conference on quality of multimedia experience (qo MEX), pp. 1 3. IEEE, 2018. Piotr Indyk, Ali Vakilian, and Yang Yuan. Learning-based low-rank approximations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/1625abb8e458a79765c62009235e9d5b-Paper.pdf. Piotr Indyk, Tal Wagner, and David Woodruff. Few-shot data-driven algorithms for low rank approximation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 10678 10690. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/588da7a73a2e919a23cb9a419c4c6d44-Paper.pdf. Tanqiu Jiang, Yi Li, Honghao Lin, Yisong Ruan, and David P Woodruff. Learning-augmented data stream algorithms. ICLR, 2020. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 18, pp. 489 504, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450347037. doi: 10.1145/3183713.3196909. URL https://doi.org/10. 1145/3183713.3196909. Yi Li, Honghao Lin, and David P. Woodruff. Optimal sketching for residual error estimation for matrix and vector norms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open Review.net, 2024. URL https: //openreview.net/forum?id=Rs Jwm Wv E6Q. Edo Liberty. Simple and deterministic matrix sketching. In Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy (eds.), The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013, pp. 581 588. ACM, 2013. doi: 10.1145/2487575.2487623. URL https://doi.org/10.1145/ 2487575.2487623. Edo Liberty. Even simpler deterministic matrix sketching. Co RR, abs/2202.01780, 2022. URL https://arxiv.org/abs/2202.01780. Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. J. ACM, 68(4), July 2021. ISSN 0004-5411. doi: 10.1145/3447579. URL https://doi.org/ 10.1145/3447579. Ahmed Metwally, Divyakant Agrawal, and A. Abbadi. Efficient computation of frequent and topk elements in data streams. In International Conference on Database Theory, pp. 398 412, 01 2005. ISBN 978-3-540-24288-8. doi: 10.1007/978-3-540-30570-5 27. Published as a conference paper at ICLR 2025 Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pp. 142 155. SIAM, 2021. Gregory T. Minton and Eric Price. Improved concentration bounds for count-sketch. In ACM-SIAM Symposium on Discrete Algorithms, 2012. URL https://api.semanticscholar.org/ Corpus ID:11724394. Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143 152, 1982. doi: 10.1016/0167-6423(82)90012-0. URL https://doi.org/10.1016/ 0167-6423(82)90012-0. Michael Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. Advances in Neural Information Processing Systems, 31, 2018. Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. Commun. ACM, 65 (7):33 35, June 2022. ISSN 0001-0782. doi: 10.1145/3528087. URL https://doi.org/ 10.1145/3528087. Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A picture of search. In Proceedings of the 1st international conference on Scalable information systems, pp. 1 es, 2006. Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS 06), pp. 143 152. IEEE, 2006. Rana Shahout and Michael Mitzenmacher. Learning-based heavy hitters and flow frequency estimation in streams. In ar Xiv preprint, 06 2024. doi: 10.48550/ar Xiv.2406.16270. Rana Shahout, Ibrahim Sabek, and Michael Mitzenmacher. Learning-augmented frequency estimation in sliding windows. In ar Xiv preprint, 09 2024. doi: 10.48550/ar Xiv.2409.11516. Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4): 1454 1485, 2017. David Woodruff. Low rank approximation lower bounds in row-update streams. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/ file/58e4d44e550d0f7ee0a23d6b02d9b0db-Paper.pdf. David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014. Published as a conference paper at ICLR 2025 Table of Contents A Other Related Works 15 B Missing Proofs for Preliminaries 15 C Analysis of Misra-Gries 18 D Analysis of Frequent Directions 21 D.1 Frequent Directions Under Zipfian . . . . . . . . . . . . . . . . . . . . . . . . 21 D.2 Learned Frequent Directions Under Zipfian . . . . . . . . . . . . . . . . . . . . 22 E Consistency/Robustness trade offs 25 E.1 The Error of Non-Perfect Oracles. . . . . . . . . . . . . . . . . . . . . . . . . . 26 F Additional Experiments 28 F.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 F.2 Noise Analysis in Frequent Directions . . . . . . . . . . . . . . . . . . . . . . 28 F.3 Additional Frequent Directions Experiments . . . . . . . . . . . . . . . . . . . 29 F.4 Additional Frequency Estimation Experiments . . . . . . . . . . . . . . . . . . 30 Table of Notations Symbol Definition n Number of Inputs in the Stream d Domain Size (Frequency Estimation) Dimension (Frequent Directions) f( ) Frequency ai The ith input element in the stream (Frequency Estimation) Ai The ith input vector in the stream (Frequent Directions) A Stream Input Matrix ei Standard Basis Vector σi Singular Value of a Matrix N( , ) Normal Distribution U(i),Σ(i), V(i) The SVD decomposition matrices at the ith iteration of Algorithm 1 Table 3: Definitions of Main Notation. Published as a conference paper at ICLR 2025 A OTHER RELATED WORKS There is also a vast literature on sketching based algorithms for low-rank approximation without any learned-augmentation (Sarlos, 2006; Ghashami & Phillips, 2014b; Liberty, 2013; Tropp et al., 2017; Meyer et al., 2021). We refer to the monograph Woodruff et al. (2014) for more details. B MISSING PROOFS FOR PRELIMINARIES We refer the reader to Section 2 for the full statements. Proof of Lemma 2.1. E v N(0,AT A) h A v 2 2 f( v) i = E v T AT A v E v T BT B v = E tr( v T AT A v) E tr( v T BT B v) = E tr(AT Axx T ) E tr(BT Bxx T ) = tr(AT A E xx T ) tr(BT B E xx T ) = tr((AT A)2) tr(BT BAT A) Let vi be the right singular vectors of A: tr AT A 2 = tr i [d] σ2 i vi v T i i [d] σ4 i vi v T i i [d] σ4 i , (8) tr BT BAT A = tr i [d] σ2 i vi v T i i [d] σ2 i tr BT B vi v T i = X i [d] σ2 i tr v T i BT B vi . (10) E v N(0,AT A) v T (AT A BT B) v = X i [d] σ4 i X i [d] σ2 i tr v T i BT B vi . (11) Further, since f( v) = B v 2 2 and 0 f( v) f( v), Equation (2) can be written as σ2 i A 2 F v T i (AT A BT B) vi. (12) The first term is given by i [d] σ2 i v T i j [d] σ2 j vj v T j i [d] σ4 i v T i vi = X i [d] σ4 i . (13) Further note that E v N(0,AT A) h v 2 2 i = E z N(0,I) z T AAT z = E z N(0,I) tr z T AAT z (14) = E z N(0,I) tr AAT zz T = tr AAT = A 2 F . (15) Published as a conference paper at ICLR 2025 σ2 i A 2 F v T i (AT A BT B) vi = E v N(0,AT A) v T (AT A BT B) v E v N(0,AT A) h v 2 2 i . (16) Proof of Fact 2.2. For consistence, at each iteration, if B has less than m rows after insertion, we still define U(i),Σ(i), V(i) SVD(B), and Σ(i) = Σ(i). To analyze the error, let B(i) denote the value of B after the i(th) iteration, and define (i) .= AT i Ai + B(i 1)T B(i 1) B(i)T B(i) = V(i)Σ(i)T U(i)T U(i)Σ(i)V(i)T V(i)Σ(i)T = V(i) Σ(i)TΣ(i) Σ(i)T Σ(i) V(i)T . AT A B(n)T B(n) = X AT i Ai + B(i 1)T B(i 1) B(i)T B(i) = X i [n] (i). (17) Since each (i) 0, we prove that AT A B(n)T B(n) 0. (18) Let v1, . . . , vd Rk be the right singular vectors of A. For each k = 0, . . . , τ 1, define the projection matrix Pk = [ 0 | . . . | 0 | vk+1 | . . . | vd] Rd d, consisting of columns vectors 0, . . . , 0, vk+1, . . . , vd. The null space is thus spanned by the top-k right singular vectors of A. We claim the following holds: (i) 2 1 τ k tr Pk T (i)Pk , k = 0, . . . , τ 1. (19) Before proving it, we complete the proof of the error: AT A B(n)T B(n) 2 = (i) 2 1 τ k X Pk T (i)Pk (20) i [n] (i) ! Pk T AT APk = 1 τ k A [A]k 2 F . (22) Proof of Inequality (19): First, (i) 2 = V(i) Σ(i)TΣ(i) Σ(i)T Σ(i) V(i)T 2 = Σ(i)TΣ(i) Σ(i)T = min{Σ(i)2, σ(i) τ 2 I} 2 = σ(i) τ 2 , Published as a conference paper at ICLR 2025 where σ(i) τ is the τ (th) largest singular value of Σ(i). Next, tr (i) = tr V(i) Σ(i)TΣ(i) Σ(i)T Σ(i) V(i)T (23) = tr Σ(i)TΣ(i) Σ(i)T σ(i) j 2 τ σ(i) τ 2 . (24) Next, let Pk = [ v1 | . . . | vk | 0 | . . . | 0] Rd d be the projection matrix to the space spanned by the top-k right singular vectors of A. Then Pk + Pk = I, and tr (i) = tr Pk + Pk T (i) Pk + Pk = tr P T k (i)Pk + tr Pk T (i)Pk . (25) Expanding tr P T k (i)Pk we get tr P T k (i)Pk = X j [k] v T j (i) vj k σ(i) τ 2 , (26) which implies that Pk T (i)Pk (τ k) σ(i) τ 2 = (τ k) (i) 2 Running Time: For each Ai, inserting it into B takes O(d) time. When B reaches its capacity of m rows, the operations in Lines 7-10 are triggered, and performing the SVD requires O(m2d) time. After completing this step, B has at least m τ + 1 empty rows. Thus, the algorithm can accommodate at least m τ + 1 additional insertions before Lines 7-10 need to be executed again. Consequently, the total running time is: O n m τ + 1 m2d . (28) Published as a conference paper at ICLR 2025 C ANALYSIS OF MISRA-GRIES Proof of Theorem 3.1. We need to prove both upper bounds and lower bounds for the theorem. Upper Bound for Theorem 3.1. W.L.O.G., assume that f(1) f(2) f(d). Since f(i) 1 i for each i [d], and the input stream consists of n elements, it follows that f(i) n i ln d. Assume that we have a Misra-Gries sketch of size m N+. Then by translating the error guarantee from Fact 2.2 for Misra-Gries, we have f(i) f(i) min k [0 . . m 1] j [m/2] f(i) m/2 = 2 n ln 2d m m ln d . (29) i m 2 ln 2d m = f(i) = n i ln d 2 n ln 2d m m ln d (30) Since we also know that 0 f(i) f(i), it follows that n f(i) f(i) = n f(i) f(i) + i= m 2 ln 2d n f(i) f(i) n 2 n ln 2d i= m 2 ln 2d 1 i ln d 2 n ln 2d i= m 2 ln 2d 1 i ln d n i ln d ln m 2 ln 2d m ln d n ln d m m ln d + 1 m 2 ln 2d m + 1 n (ln d)2 ln m 2 ln 2d m m (ln d)2 + ln d m m n (ln d)2 m m n (ln d)2 Lower Bound for Theorem 3.1. To prove the lower bound, we assume there is an adversary which controls the order that the input elements arrive, under the constraints that P i [d] f(i) = n and f(i) 1/i, to maximize the error of the Misra-Gries algorithm. Denote B the array maintained by the Misra-Gries algorithm, containing m buckets. Initially, all buckets are empty. First, the adversary inserts elements 1, . . . , t to the Misra-Gries algorithm, with multiplicities f(1), . . . , f(t), where t = m ln 2d m . After this, B contains t non-empty buckets (for simplicity, here we assume that P i [t] f(i) is a multiple of m), which stores elements 1, . . . , t, associated with their recorded frequencies f(1), . . . , f(t), which we call their counters. Next, let C be the multi-set consisting of elements t+1, . . . , d, such that each element i [t + 1 . . d] has multiplicity f(i) in C. Consider the following game: Adversary: pick an element i from C that is not in B. If such element exists, remove one copy of it from C, and send it to the Misra-Gries algorithm as the next input. If there is no such element, stop the game. Published as a conference paper at ICLR 2025 Misra-Gries Algorithm: process the input i. During this game, B are filled up and contains no empty bucket after at most every m input, and the counters of elements 1, . . . , t decrease by 1 (if they are still above zero) when B is updated to make empty buckets. Further, when the game stops, there can be at most m distinct elements in C, with frequency sum at most Pt+m i=t+1 f(i). It follows that the counters of elements 1, . . . , t in B decrease at least by i=m+t+1 f(i) Ω , since m + t + 1 2m. Therefore, the weighted error introduced by these counters is at least 1 i ln d n ln d ln t ln d n ln d m (ln d)2 n Proof of Theorem 3.2. We need to prove both upper bounds and lower bounds for the theorem. Upper Bound for Theorem 3.2. It suffices to show that, there exists a parameter setting of m L which enables the algorithm to achieve the desired error bound. Assume that the algorithm reserves m L = m/3 words for the learned oracle. Then for each element i [m L], its frequency estimate f(i) = f(i). And for each i / [m L], the Misra-Gries algorithm never overestimate its frequency: f(i) [0, f(i)]. Hence n f(i) f(i) 1 i ln d n i ln d O 1 m n (ln d)2 Lower Bound for Theorem 3.2. To establish the lower bound, we consider an adversarial scenario where an adversary controls the order in which elements arrive, subject to the constraints P i [d] f(i) = n and f(i) 1/i. The adversary s goal is to maximize the error of the learned Misra-Gries algorithm. According to the framework presented in Algorithm 2 for the learned Frequent Directions, the learned Misra-Gries algorithm initializes two separate Misra-Gries instances: one for the m L elements predicted to be frequent and one for elements predicted to be non-frequent. Since m L memory words are already reserved for storing the frequencies of the predicted frequent elements, we do not need to run a full Misra-Gries algorithm on the these elements. Instead, we only record their observed frequencies. By overloading the notation a little, let us denote B as the array used by the Misra-Gries instance managing the predicted non-frequent elements, which has a capacity of m m L buckets. Initially, all buckets in B are empty. Since the learned Misra-Gries algorithm incurs no estimation error for the predicted frequent elements, our analysis focuses on the non-frequent elements and the potential error introduced by the Misra-Gries instance that processes them. Let C denote the multi-set of elements m L +1, . . . , d, where each element i [m L + 1 . . d] appears with multiplicity f(i) in C. Consider the following adversarial game: Adversary s Role: At each step, the adversary selects an element i from C that is not currently stored in the array B. If such an element exists, the adversary removes one occurrence of i from C and sends it to the Misra-Gries algorithm as the next input. If there is no such element left, the adversary halts the game. Published as a conference paper at ICLR 2025 Misra-Gries Algorithm s Role: The Misra-Gries algorithm processes the incoming element i as it would normally, using the array B of capacity m m L. After the game, the remaining elements in C are fed to the Misra-Gries algorithm in arbitrary order by the adversary. Now, consider the estimation error made by the Misra-Gries algorithm on the elements m L + 1, . . . , m L + 2(m m L). Since the array B can only store up to m m L elements, the algorithm must estimate the frequency of at least m m L elements from this range as zero. Therefore, the error is at least m L+2(m m L) X i=m L+(m m L)+1 m L+2(m m L) X 1 i ln d n i ln d Ω n m ln2 d which finishes the proof. Published as a conference paper at ICLR 2025 D ANALYSIS OF FREQUENT DIRECTIONS D.1 FREQUENT DIRECTIONS UNDER ZIPFIAN Proof of Theorem 3.3. We need to prove both upper bounds and lower bounds for the theorem. Upper Bound for Theorem 3.3. First, based on the assumption that σ2 i 1 i , and Fact 2.2, we have AT A BT B 2 min k [0 . . m 1] A [A]k 2 F τ k (32) = min k [0 . . m 1] Pd i=k+1 σ2 i τ k (33) = min k [0 . . m 1] (Hd Hk) A 2 F (τ k) ln d (34) 2 (Hd Hm/2) A 2 F m ln d (35) A 2 F ln 2d where Hm .= P j [m] 1 j , m N+ are the harmonic numbers. Further, m = σ2 i = A 2 F i ln d A 2 F ln 2d m m ln d (37) Since BT B 0, and AT A BT B 0 by Fact 2.2, it follows that for each right singular vector vi of A 0 v T i (AT A BT B) vi v T i AT A vi σ2 i , (38) where σi is the singular value associated with vi. Therefore, the expected error is given by Err(AFD) .= X σ2 i A 2 F v T i (AT A BT B) vi (39) σ2 i A 2 F v T i (AT A BT B) vi + σ2 i A 2 F v T i (AT A BT B) vi (40) σ2 i A 2 F A 2 F ln 2d σ2 i A 2 F σ2 i 1 i ln d A 2 F ln 2d 1 i ln d A 2 F i ln d m ln d A 2 F ln 2d m m ln d + 1 m ln 2d m + 1 A 2 F (ln d)2 m 1 A 2 F ln 2d m m (ln d)2 + ln 2d m m A 2 F (ln d)2 m m A 2 F (ln d)2 Published as a conference paper at ICLR 2025 Lower Bound for Theorem 3.3. The proof of the lower bound follows the same approach as the one for Theorem 3.1, in Appendix C. Assume that A consists of standard basis vectors e1, . . . , ed Rd. Let f( ei) denote the number of occurrences of ei in A. Without loss of generality, assume that f( e1) . . . f( ed). In this case, we have f( ei) = σ2 i and P i [d] f( ei) = P i [d] σ2 i = A 2 F = n. Further, we can then view the B maintained by the Frequent Directions algorithm as an array of m buckets. Now the setting is exactly the same as the Misra-Gries algorithm. Consequently, the constructive lower bound proof from Theorem 3.1 directly applies to Frequent Directions. D.2 LEARNED FREQUENT DIRECTIONS UNDER ZIPFIAN We need an additional result to prove Theorem 3.4. Recall that PH in Algorithm 2 consists of orthonormal column vectors w1, . . . , wm L Rd. Extending this set of vectors to form an orthonormal basis of Rd: w1, . . . , wm L, wm L+1, . . . , wd. Write PH = [ wm L+1 | . . . | wd] the projection matrix to the orthogonal subspace. Let A .= APHPT H be the matrix of projecting the rows of A to the predicted subspace, and A .= A A = A(I PHPT H). The following lemma holds. Lemma D.1. For a vector x Rd, we have x T AT A x = x T AT A x + x T AT A x + 2 X i [d] σ2 i PT H vi, PT H x PT The proof of the lemma is included at the end of the section. Proof of Theorem 3.4. We need to prove both upper bounds and lower bounds for the theorem. Upper Bound for Theorem 3.4. It suffices to show that, there exists a parameter setting of m L which enables the algorithm to achieve the desired error bound. We assume that the algorithm uses m L = m/3 predicted directions from the learned oracle. Recall that Algorithm 2 maintains two instances of Algorithm 1: A FD and A FD. The former processes the vectors projected onto the subspace defined by PH, while the latter handles the vectors projected onto the orthogonal subspace. Therefore, the input to A FD is A = APHPT H, and the input to A FD is A = A A = A(I PHPT H) = APHPT H. Ultimately, the resulting matrix B is a combination of the matrices returned by A FD and A FD, specifically denoted as B and B , respectively. Combined with Lemma D.1, for each right singular vector vj of A, we have v T j (AT A BT B) vj = v T j AT vj A v T j (B )T B vj v T j (B )T B vj (47) = v T j AT A vj v T j (B )T B vj (48) + v T j AT A vj v T j (B )T B vj (49) i [d] σ2 i PT H vi, PT H vj PT H vj . (50) First, observe that since A FD is allocated m L d space for the matrix A with rank at most m L, by the error guarantee of Frequent Direction algorithm (Fact 2.2), it is guaranteed that v T j AT A vj v T j (B )T B vj = 0. Second, note that PT H vi, PT H vj is the inner product, between the projected vectors vi and vj to the subspace H specified by the predicted frequent directions, and that PT H vj is the inner product, between the projected vectors vi and vj to the orthogonal complement of H. Published as a conference paper at ICLR 2025 In particular, when the machine learning oracle makes perfect predictions of v1, . . . , vm L, i.e., w1 = v1, . . . , wm L = vm L, then for each i, either PT H vi or PT H vi will be zero. Therefore, it holds that v T j (AT A BT B) vj = v T j AT A vj v T j (B )T B vj. (51) Further, by the property of Frequent Direction algorithm A FD, AT A (B )T B 0. And since A is the projection of A to the subspace spanned by the right singular vectors vm L+1, . . . , vd, it still has right singular vectors vm L+1, . . . , vd, associated with singular values σm L+1, . . . , σd. It follows that 0 v T j (AT A BT B) vj (52) = v T j AT A vj v T j (B )T B vj (53) v T j AT A vj (54) σ2 j , j > m L 0 j m L . (55) Therefore, the weighted error is given by Err(AFD) .= X σ2 i A 2 F v T i (AT A BT B) vi (56) σ2 i A 2 F v T i (AT A BT B) vi + σ2 i A 2 F v T i (AT A BT B) vi (57) σ2 i A 2 F v T i (AT A BT B) vi (58) σ2 i A 2 F σ2 i 1 i ln d A 2 F i ln d 1 m L + 1 A 2 F (ln d)2 Noting that m L Θ(m) finishes the proof of upper bound. Lower Bound for Theorem 3.4. The proof of the lower bound follows the same approach as the one for Theorem 3.2, in Appendix C. Assume that A consists of standard basis vectors e1, . . . , ed Rd. Let f( ei) denote the number of occurrences of ei in A. Without loss of generality, assume that f( e1) . . . f( ed). In this case, we have f( ei) = σ2 i and P i [d] f( ei) = P i [d] σ2 i = A 2 F = n. Further, we can then view the B maintained by the Frequent Directions algorithm as an array of m buckets. Now the setting is exactly the same as the Misra-Gries algorithm. Consequently, the constructive lower bound proof from Theorem 3.2 directly applies to learned Frequent Directions. We next prove Lemma D.1. Proof of Lemma D.1. First, observe that x T AT A x = x T (A + A )T (A + A ) x (62) = x T AT A x + x T AT A x + x T AT A x + x T AT A x. (63) Published as a conference paper at ICLR 2025 It suffices to show that x T AT A x = x T AT A x = X i [d] σ2 i PT H x PT H vi, PT H x (64) x T AT A x = x T PT HPHAT A(I PHPT H) x (65) = x T PT HPHAT A(I PHPT H) x (66) = x T (I PHPT H)T AT APHPT H x = x T AT A x. (67) Hence, it remains to study x T AT A x or x T AT A x. Keeping expanding one of them x T AT A x = x T (I PHPT H)T AT APHPT H x (68) = x T (I PHPT H)T i [d] σ2 i vi v T i PHPT H x (69) i [d] σ2 i vi, (I PHPT H) x vi, PHPT H x . (70) Since I PHPT H = PHPT x T AT A x = X i [d] σ2 i vi, PHPT H x vi, PHPT H x (71) i [d] σ2 i PT H x PT H vi, PT H x (72) Published as a conference paper at ICLR 2025 E CONSISTENCY/ROBUSTNESS TRADE OFFS If the predictions are perfect, the sketch B output by Algorithm 2 satisfies that BT B AT A. This property in particular gives the quantity x T BT B x x T AT A x for any vector x and as further BT B 0, we get that the error | v T i BT B vi v T i AT A vi| σ2 i for any of the singular vectors vi (and with perfect predictions the errors on the predicted singular vectors are zero). Unfortunately, with imperfect predictions, the guarantee that BT B AT A is not retained. To take a simple example, suppose that d = 2 and that the input matrix A = (1, 1) has just one row. Suppose we create two frequent direction sketches by projecting onto the standard basis vectors e1 and e2 and stack the resulting sketches B1 and B2 to get a sketch matrix B. It is then easy to check that B is in fact the identify matrix. In particular, if x = e1 e2, then B x 2 2 = 2 whereas A x 2 2 = 0 showing that AT A BT B is not positive semidefinite. The absence of this property poses issues in proving consistency/robustness trade offs for the algorithm. Indeed, our analysis of the classic frequent directions algorithm under Zipfian distributions, crucially uses that the error incurred in the light directions vi for i m ln d m is at most σ2 i . In this section, we address this issue by presenting a variant of Algorithm 2 that does indeed provide consistency/robustness trade-offs with only a constant factor blow up in space. To do so, we will maintain three different sketches of the matrix A. The first sketch is the standard frequent directions sketch Liberty (2013) in Algorithm 1, the second one is the learning-augmented sketch produced by Algorithm 2, and the final sketch computes an approximation to the residual error A [A]k 2 F within a constant factor using an algorithm from Li et al. (2024). Let B1 be the output of Algorithm 1 on input A and B2 be the output of Algorithm 2 on input A. Suppose for simplicity that we knew A [A]k 2 F exactly. Then, the idea is that when queried with a unit vector x, we compute B1x 2 2 and B2x 2 2. If these are within 2 A [A]k 2 F m k of each other, we output B2x 2 2 as the final estimate of x T AT A x, otherwise, we output B1x 2 2. The idea behind this approach is that in the latter case, we know that the learning-based algorithm must have performed poorly with an error of at least A [A]k 2 F m k and by outputting the estimate from the classic algorithm, we retain its theoretical guarantee. On the other hand, in the former case, we know that the error is at most 3 A [A]k 2 F m k but could be much better if the learning augmented algorithm performed well. To state our the exact result, we recall that the algorithm from Li et al. (2024) using space O(k2/ε4) maintains a sketch of A such that from the sketch we can compute an estimate α such that A [A]k 2 F α (1 + ε) A [A]k 2 F . We denote this algorithm Ares(k, ε). Our final algorithm is Algorithm 3 for which we prove the following result. Theorem E.1. [Worst-Case guarantees] For any unit vector x, the estimate Γ of A x 2 2 returned by Algorithm 3 satisfies | A x 2 2 Γ| min A x 2 2 B2x 2 2 , 6 A [A]k 2 F m k In other words, the Error of Algorithm 3 is asymptotically bounded by the minimum of Algorithm 2 and the classic Frequent Direction algorithm. Proof. Suppose first that | B2 x 2 2 B1 x 2 2| 2α. Then | A x 2 2 Γ| = A x 2 2 B2x 2 2 . Moreover, by the approximation guarantees of Ares and AFD, A x 2 2 B2x 2 2 A x 2 2 B1x 2 2 + B1x 2 2 B2x 2 2 α + 2α 6 A [A]k 2 F m k , as desired. Suppose on the other hand that | B2 x 2 2 B1 x 2 2| > 2α. Since by Fact 2.2, we always have that A x 2 2 B1x 2 2 A [A]k 2 F m k α, it follows that A x 2 2 B2x 2 2 > α A [A]k 2 F m k . But since in this case, we output B1 x 2 2, the estimate of the standard frequent direction, we again have by Fact 2.2 that A x 2 2 B2x 2 2 A [A]k 2 F m k as desired. We note that the constant 6 in the theorem can be replaced by any constant > 3 by increasing the space used for Ares. Published as a conference paper at ICLR 2025 Algorithm 3 Robust Learning-based Frequent Direction ARLF D 1: Procedure INITIALIZATION 2: Input: sketch parameters m, d N+; learned oracle parameter m L s.t., m L m; predicted frequent directions PH = [ w1 | . . . | wm L] Rd m L, query vector x. 3: Initialize an instance of Algorithm 1: AFD.initialization(m, 0.5 m, d) 4: Initialize an instance of Algorithm 2: ALFD.initialization(m, 0.5 m, d) 5: Initialize the residual error estimation algorithm Li et al. (2024) Ares(m/2, 1) 6: Procedure UPDATE 7: AFD.update(Ai) 8: ALFD.update(Ai) 9: Ares.update(Ai) 10: Procedure RETURN 11: B1 AFD.return() 12: B2 AFD.return() 13: α0 Ares.return() 14: α α0 m k 15: return (B1, B2, α) 16: Procedure QUERY( x) 17: if | B2 x 2 2 B1 x 2 2| 2α then 18: return B2 x 2 2 19: else 20: return B1 x 2 2 E.1 THE ERROR OF NON-PERFECT ORACLES. We will now obtain a more fine-grained understanding of the consistency/robustness trade off of Algorithm 3. Consider the SVD A = P i [d] σi ui v T i . Let A .= APHPT H be the matrix of projecting the rows of A to the predicted subspace, and A .= A A = A(I PHPT H). Recall that PH consists of orthonormal column vectors w1, . . . , wm L Rd. Extending this set of vectors to form an orthonormal basis of Rd: w1, . . . , wm L, wm L+1, . . . , wd. Write PH = [ wm L+1 | . . . | wd] the projection matrix to the orthogonal subspace. Based on Lemma D.1, for each vector, we can write x T AT A x = x T AT A x + x T AT A x + 2 X i [d] σ2 i PT H vi, PT H x PT To understand the significance of Lemma D.1, note that our algorithm attempts to approximate the first two terms (through either exact or approximate Frequent Direction sketches), but ignores the final one. Therefore, regardless of how successful it is in approximating x T AT A x + x T AT A x, we will have 2 P i [d] σ2 i PT H vi, PT H x PT H x occurring as an additional added error. Note that PT H vi, PT H vj is the inner product, between the projected vectors vi and vj to the subspace H specified by the predicted frequent directions, and that PT H vj is the inner product, between the projected vectors vi and vj to the orthogonal complement of H. In particular, if PH consists of a set of correctly predicted singular vectors of A, then for any i, either PT H vi or PT H vi will be zero and in particular the additional added error will be zero. In order to obtain an algorithm performing as well as if we had perfect predictions, it therefore suffices that the predictions are accurate enough that i [d] σ2 i PT H vi, PT H vj PT O A [A]k 2 F m . (74) Published as a conference paper at ICLR 2025 To obtain a more general smoothness/robustness trade off, one can plug into Theorem E.1. Doing so in the setting of Theorem 3.4 where the singular values follow a Zipfian distribution, we obtain the following immediate corollary. Corollary E.2. Consider the setting of Theorem 3.4, but where we run Algorithm 3 instead of Algorithm 2 and where we make no assumptions on the quality of the oracle. Then the error Err(ARLF D) is at most 1 (ln d)2 A 2 F m σ2 i A 2 F X j [d] σ2 j PT H vj, PT H vi PT but also always bounded by m (ln d)2 A 2 F m We finish by showing an example demonstrating that even with very accurate predictions, the extra added error can be prohibitive. Assume that the input space is R2, and the input vectors are either (1, 0) or (0, 1). Assume that σ2 1 = 107, σ2 2 = 1,, v1 = (1, 0), and v2 = (0, 1). In this case, assume that m L = 1. A perfect PH should be PH = (1, 0), but we will assume that the actual prediction we get is a little perturbed, say we change it to PH = (cos 1 100, sin 1 100). Therefore, PH = (sin 1 100, cos 1 100), i [2] σ2 i PT H vi, PT H v1 PT H v1 = 107 cos 1 107 cos2 1 100 sin2 1 100 (77) 107 1 1002 103. (78) In general, assume that PH = (cos θ, sin θ) for small θ. The X i [2] σ2 i PT H vi, PT H v1 PT H v1 = σ2 1 cos2 θ sin2 θ σ2 1θ2 (79) So we need θ 1/ m, in order that this bound is comparable with the normal FD bound. Published as a conference paper at ICLR 2025 F ADDITIONAL EXPERIMENTS In this section, we include figures which did not fit in the main text. F.1 DATASET STATISTICS 100 101 102 103 104 105 106 Sorted Elements CAIDA Log-Log Frequencies 100 101 102 103 104 105 Sorted Elements AOL Log-Log Frequencies Figure 3: Log-log plot of frequencies for the CAIDA and AOL datasets. 100 101 102 103 Sorted Index Singular Value Hyper Log-Log Singular Values 100 101 102 103 Sorted Index Singular Value Logo Log-Log Singular Values Figure 4: Log-log plot of singular values for the first Hyper and Logo matrices. 100 101 102 103 Sorted Index Singular Value Eagle Log-Log Singular Values Figure 5: Log-log plot of singular values for the first Eagle and Friends matrices. F.2 NOISE ANALYSIS IN FREQUENT DIRECTIONS We present the following figure for the Published as a conference paper at ICLR 2025 10 5 10 4 10 3 10 2 10 1 Noise Analysis (Rank: 100) SVD FD Learned FD (ours) Figure 6: Analysis of prediction noise in matrix streaming on the first matrix of the Logo dataset. The rank of the algorithms is 100. The baselines of Frequent Directions and the true SVD are shown as dashed lines. Our learned Frequent Directions algorithm uses perfect predictions corrupted by a matrix of Gaussian noise with standard deviation σ/ d where σ is displayed as the amount of noise on the horizontal axis. The linear relationship on the log-log plot indicates that the performance of our algorithm decays polynomially with the amount of noise. F.3 ADDITIONAL FREQUENT DIRECTIONS EXPERIMENTS We present plots of error/rank tradeoffs and error across sequences of matrices with fixed rank for all four datasets Hyper, Logo, Eagle, and Friends. F.3.1 HYPER DATASET 0 10 20 30 40 50 60 70 80 Matrices SVD FD Learned FD (ours) 0 10 20 30 40 50 60 70 80 Matrices SVD FD Learned FD (ours) Figure 7: Frequent directions results on the Hyper dataset. F.3.2 LOGO DATASET 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) Figure 8: Frequent directions results on the Logo dataset. Published as a conference paper at ICLR 2025 F.3.3 EAGLE DATASET 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) Figure 9: Frequent directions results on the Eagle dataset. F.3.4 FRIENDS DATASET 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) 0 2 4 6 8 10 12 14 16 18 Matrices SVD FD Learned FD (ours) Figure 10: Frequent directions results on the Friends dataset. F.4 ADDITIONAL FREQUENCY ESTIMATION EXPERIMENTS Here, we present all frequency estimation results comparing our Learned Misra-Gries algorithm with Learned Count Sketch of Hsu et al. (2019) and Learned Count Sketch++ of Aamand et al. (2023). We present results both with and without learned predictions. Additionally, we present results both with standard weighted error discussed in this paper as well as unweighted error also evaluated in the experiments of prior work. The unweighted error corresponds to taking the sum of absolute errors across all items appearing in the stream (not weighted by their frequencies). F.4.1 NO PREDICTIONS, WEIGHTED ERROR 500 1000 1500 2000 2500 3000 Space Weighted Error 1e12 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 Streams Weighted Error 1e11 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 11: Frequency estimation on the CAIDA dataset with weighted error and no predictions. Published as a conference paper at ICLR 2025 500 1000 1500 2000 2500 3000 Space Weighted Error 1e7 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 60 70 80 Streams Weighted Error 1e7 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 12: Frequency estimation on the AOL dataset with weighted error and no predictions. F.4.2 WITH PREDICTIONS, WEIGHTED ERROR 500 1000 1500 2000 2500 3000 Space Weighted Error 1e12 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 Streams Weighted Error 1e12 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 13: Frequency estimation on the CAIDA dataset with weighted error and learned predictions. 500 1000 1500 2000 2500 3000 Space Weighted Error 1e7 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 60 70 80 Streams Weighted Error 1e7 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 14: Frequency estimation on the AOL dataset with weighted error and learned predictions. F.4.3 NO PREDICTIONS, UNWEIGHTED ERROR 500 1000 1500 2000 2500 3000 Space Unweighted Error 1e10 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 Streams Unweighted Error 1e10 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 15: Frequency estimation on the CAIDA dataset with unweighted error and no predictions. Published as a conference paper at ICLR 2025 500 1000 1500 2000 2500 3000 Space Unweighted Error 1e7 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 60 70 80 Streams Unweighted Error 1e7 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 16: Frequency estimation on the AOL dataset with unweighted error and no predictions. F.4.4 WITH PREDICTIONS, UNWEIGHTED ERROR 500 1000 1500 2000 2500 3000 Space Unweighted Error 1e10 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 Streams Unweighted Error 1e10 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 17: Frequency estimation on the CAIDA dataset with unweighted error and learned predictions. 500 1000 1500 2000 2500 3000 Space Unweighted Error 1e7 Error/Space Tradeoff CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) 0 10 20 30 40 50 60 70 80 Streams Unweighted Error 1e7 Space: 300 CS CS++ (C=1) CS++ (C=2) CS++ (C=5) MG (ours) Figure 18: Frequency estimation on the AOL dataset with unweighted error and learned predictions.