# countbased_frequency_estimation_with_bounded_memory__0c385179.pdf Count-Based Frequency Estimation With Bounded Memory Marc G. Bellemare Google Deep Mind London, United Kingdom bellemare@google.com Count-based estimators are a fundamental building block of a number of powerful sequential prediction algorithms, including Context Tree Weighting and Prediction by Partial Matching. Keeping exact counts, however, typically results in a high memory overhead. In particular, when dealing with large alphabets the memory requirements of count-based estimators often become prohibitive. In this paper we propose three novel ideas for approximating count-based estimators using bounded memory. Our first contribution, of independent interest, is an extension of reservoir sampling for sampling distinct symbols from a stream of unknown length, which we call K-distinct reservoir sampling. We combine this sampling scheme with a state-of-the-art count-based estimator for memoryless sources, the Sparse Adaptive Dirichlet (SAD) estimator. The resulting algorithm, the Budget SAD, naturally guarantees a limit on its memory usage. We finally demonstrate the broader use of K-distinct reservoir sampling in nonparametric estimation by using it to restrict the branching factor of the Context Tree Weighting algorithm. We demonstrate the usefulness of our algorithms with empirical results on two sequential, large-alphabet prediction problems. 1 Introduction Undoubtedly, counting is the simplest way of estimating the relative frequency of symbols in a data sequence. In the sequential prediction setting, these estimates are used to assign to each symbol a probability proportionate to its frequency. When dealing with binary alphabets the counting approach gives rise to the Krichevsky-Trofimov estimator, known to be minmax optimal [Krichevsky and Trofimov, 1981]. Countbased approaches are also useful in dealing with large alphabets, and a number of methods coexist [Katz, 1987; Tjalkens et al., 1993; Friedman and Singer, 1999; Hutter, 2013], each with its own statistical assumptions and regret properties. While these simple estimators deal only with memoryless sources whose output is context invariant they play a critical role as building blocks for more complicated estimators [Cleary and Witten, 1984; Willems et al., 1995; Tziortziotis et al., 2014]. We are interested in the large alphabet setting as it naturally lends itself to the compression of language data, where symbols consist of letters, phonemes, words or even sentence fragments. This setting is also of practical interest in reinforcement learning, for example in learning dynamical models for planning [Farias et al., 2010; Bellemare et al., 2013]. In both contexts, the statistical efficiency of count-based estimators makes them excellent modelling candidates. However, the memory demands of such estimators typically increase linearly with the effective alphabet size M, which is problematic when M grows with the sequence length. Perhaps surprisingly, this issue is more than just a theoretical concern and often occurs in language data [Gale and Church, 1990]. Our work is also motivated by the desire to improve the Context Tree Weighting algorithm [CTW; Willems et al., 1995] to model k-Markov large alphabet sources. While CTW is a powerful modelling technique, its practical memory requirements (linear in the size of the sequence) often preclude its use in large problems. Here, even when the alphabet size is fixed and small, the number of observed k-order contexts typically grows with the sequence length. We are particularly interested in reducing the long term memory usage of such an estimator, for example when estimating attributes of internet packets going through a router or natural language models trained on nonstationary data. Our solution takes its inspiration from the reservoir sampling algorithm [Knuth, 1981] and the work of Dekel et al. [2008] on budget perceptrons. We propose an online, randomized algorithm for sampling distinct symbols from a data stream; by choosing a sublinear reservoir size, we effectively force our statistical estimator to forget infrequent symbols. We show that this algorithm can easily be combined with a typical count-based estimator, the Sparse Adaptive Dirichlet estimator, to guarantee good statistical estimation while avoiding unbounded memory usage. As our results demonstrate, the resulting estimator is best suited to sources that emit a few frequent symbols while also producing a large pool of low-frequency noise . We further describe a simple application of our sampling algorithm to branch pruning within the CTW algorithm. The idea of sampling distinct symbols from a stream is Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) well-studied in the literature. Perhaps most related to ours is the work of Gibbons and Matias [1998], who proposed the notion of a concise sample to more efficiently store a hot list of frequent elements. Their algorithm is also based on reservoir sampling, but relies on a user-defined threshold and probabilistic counting. By contrast, our algorithm maintains a reservoir based on a random permutation of the data stream, which avoids the need for a threshold. Gibbons [2001] proposed an algorithm called Distinct Sampling to estimate the number of distinct values within a data stream. Similar to the hot list algorithm, Distinct Sampling employs a growing threshold, which is used to determine when infrequent symbols should be discarded. Metwally et al. [2005] proposed to track frequent symbols using a linked list and a hash table. Their Space Saving algorithm aims to find all frequently occurring items (heavy hitters), rather than provide a frequencybiased sample; a similar scheme was also proposed by Demaine et al. [2002]. Manku and Motwani [2002] proposed a probabilistic counting and halving approach to find all items whose frequency is above a user-defined threshold. Charikar et al. [2004] showed how to use a probabilistic count sketch in order to find frequent items in a data stream. While count sketches may seem the natural fit for count-based estimators, their memory requirements grow quadratically with the accuracy parameter ϵ; by contrast, our algorithm has only a linear dependency on the implied ϵ. 2 Background In the sequential prediction setting we consider, an algorithm must make probabilistic predictions over a sequence of symbols observed one at a time. These symbols belong to a finite alphabet X, with x1:n := x1 . . . xn X n denoting a string drawn from this alphabet. A probabilistic prediction consists in assigning a probability to a new symbol xn+1 given that x1:n has been observed. Let [n] := {1, . . . , n}. We denote the set of finite strings by X := S i=0 X i, denote the concatenation of strings s and r by sr, and use ϵ to represent the empty string. For a set B X, we write x1:n \ B := (xi : i [n], xi / B) to denote the substring produced by excising from x1:n the symbols in B. We denote by Nn(x) := N(x, x1:n) the number of occurrences of x X within x1:n, and write τn(x) := τ(x, x1:n) for the time of first occurrence of x in x1:n, with τ(x, x1:n) := n + 1 whenever x / x1:n. We denote the set of permutations of x1:n by P(x1:n) and the discrete uniform distribution over X by UX ( ). Finally, unless otherwise specified, we write log to mean the natural logarithm. A coding distribution ρ is a sequence (ρt : t N) of probability distributions ρt : X t [0, 1] each respecting x X ρt+1(x1:tx) = ρt(x1:t), and 2. ρ0(ϵ) = 1. In what follows the subscript to ρt is always implied from its argument; we therefore omit it. A coding distribution assigns conditional probabilities according to ρ(x | x1:t) := ρ(x1:tx) which after rearrangement yields the familiar chain rule ρ(x1:n) := Qn t=1 ρ(xt | x1:t 1). A source is a distinguished type of coding distribution which we assume generates sequences. When a coding distribution is not a source, we call it a model. A source (µt : t N) is said to be memoryless when µt(x | x1:t) = µt(x | ϵ) =: µt(x) for all t, all x X, and all sequences x1:t X t; for B X, we then write µt(B) := P x B µt(x). When µt(x) := µ1(x) for all t, we say it is stationary and omit its subscript. Let x1:n X n and let µ be an unknown coding distribution, for example the source that generates x1:n. We define the redundancy of a model ρ with respect to µ as Fn(ρ, µ) := F(ρ, µ, x1:n) := log ρ(x1:n) The redundancy Fn(ρ, µ) can be interpreted as the excess bits (or nats) required to encode x1:n using ρ rather than µ. The expectation of this redundancy with respect to a random sequence x1:n is the Kullback-Leibler divergence KLn(µ ρ): KLn(µ ρ) := E x1:n µ [Fn(ρ, µ)] . Typically, we are interested in how well ρ compares to a class of coding distributions M. The redundancy of ρ with respect to this class M is defined as Fn(ρ, M) := max ρ M Fn(ρ, ρ ). Intuitively, Fn(ρ, M) measures the cost of encoding x1:n with ρ rather than the best model in M. 2.1 The Sparse Adaptive Dirichlet Estimator The Sparse Adaptive Dirichlet estimator [SAD; Hutter, 2013] is a count-based frequency estimator designed for largealphabet sources with sparse support. Informally, the SAD predicts according to empirical frequencies in x1:n, while reserving some escape probability γ for unseen symbols. Given At := St i=1{xi} the set of symbols seen up to time t and a probability distribution wt over X \ At, the SAD estimator predicts according to ρSAD(x | x1:t) := t+γt if x At, γtwt(x) t+γt otherwise. γt := γ(t, At) := 1 if t = 0, 0 if At = X, |At| |At| otherwise. Let MSAD be the class of memoryless sources generating symbols from a subalphabet A X of size M. The redundancy of the SAD estimator with respect to this class is bounded as Fn(ρSAD, MSAD) M 1 2 log n + O M log log n making it particularly apt at dealing with sparse-support sources. 3 Budget Sequential Estimation The SAD estimator requires counting the occurrence of all encountered symbols. This is usually undesirable when the observed alphabet grows with the sequence length. Here we propose a sampling solution to this problem; this solution, K-distinct reservoir sampling, lets us impose a memory constraint on our estimator. More broadly, our sampling scheme can be combined with any algorithm with a non-parametric component; as an illustration, in Section 5.2 we apply our solution to restrict the branching factor of the Context Tree Weighting algorithm. Let µ be an unknown source. We are given a parameter K N and seek a count-based estimator that predicts x1:n µ sequentially without using more than O(K) memory. We call this problem the budget, sequential, count-based prediction setting. While similar settings have been studied extensively in the context of linear classifiers [Dekel et al., 2008; Cavallanti et al., 2007; Orabona et al., 2008] and in relation to the 0 1 loss [Lu and Lu, 2011], to the best of our knowledge the notion of a budget count-based frequency estimator is novel. From a practical perspective, we are interested in prediction problems for which the alphabet is large but the source skewed towards a few high-frequency symbols; it is for these problems that we expect meaningful prediction to be achievable. Our competitor class M is the set of budget multinomial distributions. More specifically, we consider models of the form ρ(x ; Z, θ, θ0) with Z := {xi1, . . . , xi K} X is a subalphabet of cardinality K; θ RK; and θ0 R. Each such model predicts according to ρ(x ; Z, θ, θ0) := θj if x = xij Z θ0 / |X \ Z| otherwise, (1) with the usual requirement that P x X ρ(x ; Z, θ, θ0) = 1. Let µ be a memoryless stationary source and for B X let µB(x) = 0 if x / B, µ(x)/µ(B) otherwise. Finally, arrange X := {x1, x2, . . . , xm} in order of decreasing probability, i.e. such that for all i, µ(xi) µ(xi+1). Lemma 1. The model ρ µ M which minimizes KL1(µ ) is Z = arg min Z:|Z|=K µ(X \ Z)KL1(µX\Z UX\Z) θ (x) = µ(x) x Z θ 0 = µ(X \ Z ). Proof (sketch). We first show that for a fixed Z, the optimal parameters correspond to those given above. We then use the convexity property of the KL divergence, namely that for λ [0, 1] and pairs of probability distributions (p1, q1), (p2, q2) KL1 λp1 + (1 λ)p2 λq1 + (1 λ)q2 λKL1(p1 q1) + (1 λ)KL1(p2 q2), with equality when (p1, q1) and (p2, q2) have disjoint support. The result follows by setting p1 = q1 = µZ, and taking p2, q2 to be respectively µX\Z and ρ µ,X\Z = UX\Z. The full proof is provided in the supplemental. The next result shows that the optimal Z is composed of a combination of the most and least frequent symbols according to µ. Lemma 2. Let Z be the subalphabet associated with the model ρ µ. Then Z = H L, where H, L X are disjoint sets such that for all x H, y X \ H, µ(x) µ(y), and for all x L, y X \ L, µ(x) µ(y). Proof (sketch). The proof follows by using the identity KL(X UX ) = log |X| H(X) for a discrete random variable X distributed over X, with H(X) its entropy [Cover and Thomas, 1991]. We then show that, for Z = {xi}, the function µ(X \ Z)KL1(µX\Z UX\Z) is concave in i. The result is then extended to the general case Z = {xi1, . . . xi K}. When µ is known, ρ µ can be constructed in time O(K) by considering the different choices of H and L. Furthermore, the set ˆZ containing the K most frequent symbols achieves an expected redundancy at most 1 e + K |X| log K greater than the expected redundancy of ρ µ (Lemma 3, supplemental), in our experiments and for large alphabets this redundancy is negligible. In the sequel we therefore assume that Z = ˆZ. However, we are interested in the setting where µ is unknown. In this case, our budget requirement poses a circular problem: we need good estimates of µ to find Z , and simultaneously need Z to estimate µ (with bounded memory). Our solution is to maintain a time-varying set Y which contains (with high probability) the K most frequent symbols. We call our algorithm K-distinct reservoir sampling, in relation to the reservoir sampling algorithm [Knuth, 1981]. As we shall see in Section 3.2, we can leverage K-distinct reservoir sampling to approximate the SAD estimator by estimating the frequency of symbols in Y. We call the resulting approximation the Budget Sparse Adaptive Dirichlet estimator, or Budget SAD for short. 3.1 K-Distinct Reservoir Sampling Let x1:n be a string of a priori unknown length and let K N. Reservoir sampling [Algorithm R; Knuth, 1981] is an algorithm for sampling y := y1 . . . y K X K from x1:n, without replacement and using O(K) memory. It proceeds as follows: 1. Randomly assign x1 . . . x K to y1 . . . y K. 2. For t = K + 1, . . . , n, draw r U ({1, . . . , t}). If (a) r K, then assign xt to yr; (b) otherwise do nothing. We would like to use reservoir sampling to sample an approximation to Z . However, Algorithm R is inadequate for our purposes: since y may contain duplicates, sampling K distinct items may require significant memory. Our solution is to modify Algorithm R to compactly represent duplicates. We begin with the following observation: Observation 1. Let x1:n be a random permutation of x1:n, that is x1:n U (P(x1:n)). Then y := x1 . . . x K is a sample of size K sampled without replacement from x1:n. We can therefore interpret Algorithm R as maintaining the first K elements of a random permutation of x1:n. By analogy, we define the K-distinct operator, which describes the first K distinct elements of a string: Definition 1. Let w1:m X m. The K-distinct operator φK : X SK i=0 X i is recursively defined as φK(w1:m) := ϵ K = 0 or m = 0 w1φK 1(w1:m\{w1}) otherwise. The next step is to use this operator to define a K-distinct sample of x1:n. From here onwards, we assume without loss of generality that sequences contain at least K distinct symbols. Definition 2. Let Yn := Y (x1:n) be a random variable taking values in X K, and let x1:n U (P(x1:n)). Then Yn is a K-distinct sample of x1:n if Pr{Yn = y} = Pr{φK( x1:n) = y}, In our algorithm, the simple reservoir is replaced by a Kconcise summary which keeps track of these K first distinct symbols and their times τ( , w1:m) of first occurence: Definition 3. Let w1:m X m with y1 . . . y K+1 := φK+1(w1:m). The K-concise summary of w1:m is a vector of K pairs S := (yi, di) X N such that, for all i K, 1. di = τ(yi+1, w1:m) τ(yi, w1:m), or equivalently 2. Di := Pi 1 j=1 dj = τ(yi, w1:m) 1 for i [K + 1]. Each pair (yi, di) in the K-concise summary of a string thus describes this string s ith distinct symbol and the gap di between this distinct symbol and the next (Figure 1, bottom row). A B B A C B A D A B A B B A C B A A, 1 B,3 C,3 A B B A C B A D A B A, 1 B,2 C,5 x1:n x1:n+1 Figure 1: Insertion of C at r = 3 into a 3-concise summary. In the spirit of Algorithm R, K-distinct reservoir sampling incrementally generates a random permutation x1:n through a sequence of insertions, and maintains the K-concise summary of this permutation. Effectively, this summary allows us to describe the beginning of x1:n compactly (using O(K) memory), while preserving the information needed for later updates. For x X, r [DK+1], and S := (yi, di) : i [K] , define the vector and set of stored symbols y(S) := yi : i [K] Y(S) := {yi : i [K]} as well as the indices I : Y(S) [K] and J : [DK+1] [K]: I(x) := i [K] s.t. yi = x J(r) := max{i [K + 1] : r Di}. In K-distinct reservoir sampling (Algorithm 1), I(x) and J(r) are used to indicate, respectively, the index of x in S and the index in S at which a new symbol should be inserted. For example, in Figure 1 (left) I( B ) = 2 and J(3) = 3. Algorithm 1 K-Distinct Reservoir Sampling. Initially: S = SAMPLE(K, x1:n) for t = 1 . . . n do Observe xt Draw r U ({0, . . . , t 1}) if r > DK+1 then do nothing else if xt / Y(S) then INSERT(S, xt, r) else if I(xt) < J(r) then d J(r) 1 d J(r) 1 + 1 else MERGE(S, xt) and INSERT(S, xt, r) Output y(S) INSERT(S, xt, r) Insert (xt, DJ(r) r + 1) at position J(r) in S if J(r) > 1 then d J(r) 1 r DJ(r) 1 if |S| > K then remove item K + 1 from S MERGE(S, xt) d I(xt) 1 d I(xt) 1 + d I(xt) Remove item I(xt) from S Theorem 1. When Algorithm 1 terminates, its summary S is the K-concise summary of a permutation x1:n sampled uniformly at random from P(x1:n). Proof (sketch). Let r1:n be the string of random integers generated by Algorithm 1. These induce a sequence ( xt 1:t : t N) of permutations of x1:n. To show the correctness of Algorithm 1, we simply show that its operations mirror this sequence of permutations and incrementally generate the sequence of K-concise summaries of xt 1:t. The different cases handled by Algorithm 1 then arise from different kinds of insertions. Corollary 1. Algorithm 1 outputs a K-distinct sample of x1:n. Note here the significance of Theorem 1: we can simulate, in a fully online fashion and in O(K) memory and running time, the sampling of a permutation x1:n U (P(x1:n)) and the subsequent selection of its first K distinct symbols. Interestingly enough, the use of random insertions, rather than swaps (as is done in Algorithm R), seems necessary to guarantee the correct distribution on y(S). 3.2 The Budget SAD Estimator Naturally, we wish to use K-distinct reservoir sampling to approximate the set Z of frequently occurring symbols in x1:n. However, the di variables do not estimate the frequencies of their respective symbols. To estimate these frequencies, our Budget Sparse Adaptive Dirichlet estimator augments a Kconcise summary with count information. The new summary is a vector of tuples U := (yi, di, ci) X N N : i [K] where ci represents the number of occurrences of yi since its last entering the summary. Algorithm 2 Budget SAD. Initially: U = , z = 0 for t = 1 . . . n do Predict xt according to Equation 2 Observe xt Update U as per Algorithm 1 if xt is new in U then c I(xt) 1 else if xt Y(U) then c I(xt) c I(xt) + 1 else z z + 1 // discard xt New symbols are added to U with c = 1; then, whenever a symbol already present in U is observed, its count is incremented. Recall that the SAD prediction depends on Nt(x), the number of occurrences of x in x1:t, as well as the subalphabet At (Section 2.1). Our process yields Yt := Y(Ut) as an approximation to At, as well as an approximate count function ˆNt : X R, with ˆNt(x) = 0 for x / Yt. The algorithm also tracks zt, the number of discarded symbols, and approximates the SAD prediction rule as ˆNt(x) := (t zt) ct,It(x) P x ct,It(x ) for x Yt ˆγt := γ(t, Yt) + zt Nt(x) := max n ˆNt(x), ˆγtwt(x) o ρB(x | x1:t) Nt(x) if x Yt, ˆγtwt(x) otherwise. (2) where ct, denote the stored counts at time t, and P x X ρB(x ) = 1. Our Budget SAD is designed to overestimate the probability of frequent symbols (via the (t zt) renormalization); the definition of Nt(x) ensures that symbols in Yt are never deemed less likely than unseen symbols. The process is summarized in Algorithm 2. The zt term keeps track of discarded counts; this is to correct for two types of counting errors. First, the counter ct,I(x) for a symbol x Yt only equals Nt(x) if 1) it was never discarded and 2) never removed from the reservoir; in general, ct,I(x) is an underestimate for Nt(x). We also need to compensate for our lack of knowledge of |At| in computing ˆγt, the probability mass associated with unseen symbols. Effectively, zt mitigates both issues by rectifying the stored counts so as to approximate the SAD denominator. In particular, when N(x) = ˆN(x) for all x Y, we have X Nt(x) + ˆγt = t + γ(t, Yt) One may also increment zt to keep tracks of counts that are discarded when a symbol leaves the reservoir; however, this does not affect our redundancy bound. In addition to these counting errors, the Budget SAD suffers a redundancy cost from forgetting symbols. Clearly, this problem arises only when symbols are removed from the reservoir. As we now show, the Budget SAD achieves low redundancy with respect to the optimal memory-bounded model (Section 3) whenever the source s tail distribution is close to uniform. 4 Analysis Let µ be a memoryless stationary source and let x1:n µ. We compare our Budget SAD estimator ρB to the optimal memory-bounded model ρ µ M described in Section 3. In our analysis we consider the associated optimal set Z of size K , and derive a redundancy bound which depends on the probability mass µ(Z ) as well as the reservoir size K > K . For clarity, we take the SAD probability distribution wt( ) to be uniform over the alphabet X, noting that our result extends to the more general case. We take a conservative approach and bound the expected redundancy E Fn(ρB, ρ µ) according to the probability of a best-case scenario. In this best-case scenario, the symbols in Z are never ejected from the reservoir. We define a good event Gn(x) := cn,In(x) = Nn(x) which occurs when the stored count for x matches the total number of occurrences of x in x1:n; this implies that ct,It(x) = Nt(x) for all t [n]. The instantaneous redundancy of our algorithm at time t then depends on the probability of Gn(xt). Critically, we set K so that Gn(x) must occur with probability 1 δ for all x Z . For B X and a memoryless stationary source µ, define υ(B, µ) := min x B µ(x). We first provide a lemma stating the conditions under which Gn(B) := T x B Gn(x) occurs with high probability. Lemma 3. Let B X, δ (0, 1), and K υ(B, µ) 1 log(|B|nδ 1). Then Pr{Gn(B)} 1 δ. Proof (sketch). The event Gn(x) occurs if x never leaves the K-concise summary after entering it. Because K-distinct reservoir sampling induces a permutation of x1:n, we can view the corresponding K-distinct sample as generated from independent draws without replacement from µ. Computing the probability that a symbol x B is not drawn in K tries and taking union bounds yields the result. Our main theoretical result follows from this lemma: Theorem 2. Let δ = o (log |X| + n log n) and K υ(Z , µ) 1 log(|Z |nδ 1). Let ZAUG X of size K maximizing µ( ), and let κ := K(1 µ(ZAUG)) 1. The expected redundancy of ρB with respect to ρ µ is bounded as E Fn(ρB, ρ µ) µ(Z ) |Z | 1 (1 µ(Z ))κ log2 n + O(κ log(κn) log log n), where the expectation is with respect to both x1:n µ and r1:n, the random integers chosen by Algorithm 1. Proof (sketch). We consider a partition of the event space at each time step: 1. xt Z and Gn(Z ), 2. xt Z and Gn(Z ), or 3. xt / Z . In the first case, the Budget SAD correctly estimates the probability of xt, and exhibits similar regret guarantees to the SAD. The second case contributes negligible expected redundancy, since Pr{ Gn(Z )} is small. The final case gives rise to most of the expected redundancy, which we bound by providing bounds on zt (which lets us estimate the probability of symbols not in the K-concise summary), which itself depends on a bound on Dt,K+1, the expected gap between the first and K + 1th symbols in our summary. Let r1:n be the random draws of r within K-distinct reservoir sampling. We show that E x1:n,r1:n [Dt,K+1] κ 1, which provides the last two terms of our bound. Remark 1. The expected redundancy of ρB with respect to ρ µ is also bounded as E Fn(ρB, ρ µ) O(n log n), and in particular the bound of Theorem 2 is meaningful whenever κ log n = o(n). The three terms in Theorem 2 correspond respectively to 1) the cost of learning µ over Z , 2) estimating the quantity 1 µ(Z ), and 3) the redundancy for symbols which enter, but do not remain in our reservoir. In the presence of a favourable distribution (i.e., when (1 µ(Z )/(1 µ(ZAUG)) is small), our algorithm is asymptotically optimal, in the sense that its per-step redundancy goes to 0 as n . This is the case, for example, when the source is in the class M of memorybounded models (Equation 1) and |X| n. As a final note, Lu and Lu [2011] recently provided a Ω(n) lower bound on the regret suffered by a bounded-memory, sequential prediction algorithm in the 0 1 loss setting. We point out that our analysis does not contradict their bound since, in the worst case, the second term may grow superlinearly (see Remark 1 and Section 6 below). In this case, one may (trivially) construct a Bayesian mixture between ρB and a uniform estimator to achieve O(n log |X|) regret. 5 Empirical Evaluation The Budget SAD is especially motivated by the desire to minimize memory usage in the presence of infrequent symbols. Here we present two experiments showing the practical value of our approach. 5.1 Synthetic Image Domain We first consider a synthetic image domain to demonstrate the effectiveness of our algorithm in the presence of symbol noise. In this domain, the source first selects one of m images uniformly at random, then corrupts this image with probability p and otherwise leaves it unchanged. Here we consider 50 60, 8-bit images and a uniform noise corruption model Figure 2: Sample from the synthetic image source with m = 4 and p = 0.9. Symbols are corrupted using uniform pixel noise. 0 5 10 15 20 25 30 SAD (K = 99,895) Source entropy Bits per symbol Memory budget K 0 20 40 60 80 100 120 140 SAD (K = 899,828) Source entropy Bits per symbol Memory budget K Figure 3: Redundancy for the Budget SAD as a function of maximum reservoir size. Images are encoded as 32-bit integers (i.e., log2 |X| = 32). (Figure 2; images from Wikipedia, 2015). Thus, while the source emits symbols from an alphabet X of size 25650 60, only m of these symbols occur with non-negligible frequency. We implemented the Budget SAD with a time-dependent reservoir size K to match the regret bound of Theorem 2: from a parameter R R+, we set Kt = R log(t+1) . We trained both SAD and Budget SAD on one million symbols generated with m = 4 and p {0.1, 0.9}. We performed experiments with R [1.0, 2.0] (when p = 0.1) and R [0.5, 10.0] (when p = 0.9), with the results of each Budget SAD experiment averaged across 100 trials. To study the behaviour of our algorithm, we map each value of R to its corresponding end-of-sequence reservoir size and report the algorithm s redundancy as a function of this value. Results for both values of p are shown in Figure 3. The Budget SAD achieves equal or lower redundancy using a small fraction of the SAD s memory budget (0.01% for both p = 0.1 and p = 0.9). Note that the Budget SAD even performs slightly better than the SAD in this instance, as it assigns a higher probability mass to unseen symbols. We conclude that the Budget SAD has the ability to significantly reduce memory usage. 5.2 Branch Pruning in Context Tree Weighting Our analysis so far has focused on memoryless sources and their models. A common method of augmenting memoryless models is to embed them into a tree structure in order to guarantee low redundancy with respect to the class of k-Markov sources. One such method, Context Tree Weighting [Willems et al., 1995], uses the generalized distributive law [Aji and Mc Eliece, 2000] to efficiently perform Bayesian model aver- 0 100 200 300 400 500 600 700 Bits per character Characters processed (millions) 0 100 200 300 400 500 600 700 Memory usage (% of CTS) Characters processed (millions) Figure 4: Left. Redundancy for the Budget SAD and SAD estimators. Right. Memory usage for the Budget SAD. aging over tree structures. At the core of CTW is a growing |X|-ary context tree; each new symbol creates up to D new nodes, where D is a depth parameter. Each node corresponds to a context (a subsequence of x1:n) and maintains a statistical estimator (here, the SAD). Here we consider a modified CTW in which each node uses a K-distinct reservoir to store its children: removing an item from the reservoir prunes a whole subtree. To understand the need for branch pruning, consider the Wikipedia Clean Text data set [Mahoney, 2011], which we use as our sequential prediction task here. This data set is composed of 713M lowercase characters (a z and whitespace; |X| = 27). Within are 105M unique contexts of length at most D = 9. This number grows linearly over time, making a memory-bounded solution highly desirable. However, each context sees, on average, two or fewer symbols; as a result, using the Budget SAD at each node cannot significantly reduce memory usage. Branch pruning, on the other hand, is relatively effective. For our experiment we used the more efficient CTW relative, Context Tree Switching [CTS; Veness et al., 2012] and derived our parameters from its authors open-source implementation. As with the synthetic experiment, we considered values in the range R [1.0, 3.0]. For each node u, its reservoir size was increased as R log(tu + 1), with tu the number of visits to this node. We set the depth parameter to D = 9 as larger values of D did not affect redundancy, and averaged each experiment across 10 trials. As illustrated in Figure 4, our algorithm, through the R parameter, can smoothly trade off memory usage and redundancy (with respect to an unpruned CTS). For example, for a small redundancy ( 5%) we observed a 44.6% average reduction in memory usage (R 1.76, not illustrated). Furthermore, there was minimal variance between trials. While memory usage still remains linear in n, we believe this result is particularly significant given the Zipfian behaviour of natural language data, which in our analysis corresponds to a large κ term. 6 Discussion In our setting, the optimal model ρ µ assigns a uniform probability over the set X \ Z . Interestingly enough, the main source of redundancy in the Budget SAD is the κ log2 n cost for approximating this uniform probability. In particular for some distributions (e.g. µ( ) 2 i for i N) our analysis suggests an unpleasant O(n log n) redundancy possibly worse-than-random performance. This cost stems from our overestimation of symbols in Z ; it seems probable that an alternative approach, which instead overestimates 1 µ(Z ), exists. On the other hand, with well-behaved data the Budget SAD provides significant memory savings. This is the case, for example, when the tail of µ is approximately uniform. Most count-based frequency estimators [Friedman and Singer, 1999; Teh, 2006; Veness and Hutter, 2012; Hutter, 2013] model this behaviour by setting aside a lump probability mass ε for all unobserved symbols; our zt term serves a similar purpose. Surprisingly, K-distinct reservoir sampling is unnecessary if the source µ is truly memoryless and stationary. In fact, in this case the first K distinct symbols in x1:n are just as good an approximation to Z ! The benefit of our approach is to (partially) inoculate the budget estimator against adversarial empirical sources; a similar role is played by the Dirichletmultinomial prior in estimating counts. Orlitsky et al. [2003] proposed two methods for achieving o(n) per-symbol redundancy on open alphabets, where new symbols appear at a O(n) rate. As an alternative to the zt term, in this setting it may be sufficient to accurately estimate |At|, the size of the alphabet at time t; work on distinct value queries suggests this is possible [Gibbons, 2001]. However, Orlitsky et al. s method requires that we instead estimate the prevalence, or frequency distribution, of symbols. Whether this can be done with bounded memory is not known to us. 7 Conclusion In this paper we presented a novel algorithm, K-distinct reservoir sampling, which incrementally maintains a sample of frequently occurring symbols within a data stream. We described how to combine K-distinct reservoir sampling with a simple count-based estimator and derived a corresponding redundancy bound. As our empirical results show, our algorithm can significantly reduce the memory requirements of both memoryless and tree-based estimators when dealing with long data sequences. 8 Acknowledgements The author wish to thank Joel Veness for uncountably many discussions on the topic of count-based estimation, Laurent Orseau for providing thorough editorial comments, Alvin Chua and Michael Bowling for helpful discussion, and finally the anonymous reviewers for their superb feedback. [Aji and Mc Eliece, 2000] Srinivas M. Aji and Robert J. Mc Eliece. The generalized distributive law. IEEE Transactions on Information Theory, 46(2):325 343, 2000. [Bellemare et al., 2013] Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively factored environments. In Proceedings of the Thirtieth International Conference on Machine Learning, 2013. [Cavallanti et al., 2007] Giovanni Cavallanti, Nicol o Cesa Bianchi, and Claudio Gentile. Tracking the best hyperplane with a simple budget perceptron. Machine Learning, 69(2-3):143 167, 2007. [Charikar et al., 2004] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3 15, 2004. [Cleary and Witten, 1984] John G. Cleary and Ian Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396 402, 1984. [Cover and Thomas, 1991] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 1991. [Dekel et al., 2008] Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer. The Forgetron: A kernel-based perceptron on a budget. SIAM Journal on Computing, 37(5):1342 1372, 2008. [Demaine et al., 2002] Erik D. Demaine, Alejandro L opez Ortiz, and J. Ian Munro. Frequency estimation of internet packet streams with limited space. In Proceedings of the Tenth Annual European Symposium on Algorithms, pages 348 360, 2002. [Farias et al., 2010] Vivek F. Farias, Ciamac C. Moallemi, Benjamin Van Roy, and Tsachy Weissman. Universal reinforcement learning. IEEE Transactions on Information Theory, 56(5):2441 2454, 2010. [Friedman and Singer, 1999] Nir Friedman and Yoram Singer. Efficient Bayesian Parameter Estimation in Large Discrete Domains. Advances in Neural Information Processing Systems 11, 11:417, 1999. [Gale and Church, 1990] William Gale and Kenneth W. Church. Poor estimates of context are worse than none. In Proceedings of the Workshop on Speech and Natural Language, pages 283 287, 1990. [Gibbons and Matias, 1998] Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the ACM SIGMOD, volume 27, pages 331 342, 1998. [Gibbons, 2001] Phillip B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the International Conference on Very Large Databases, 2001. [Hutter, 2013] Marcus Hutter. Sparse adaptive dirichletmultinomial-like processes. In COLT, 2013. [Katz, 1987] Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400 401, 1987. [Knuth, 1981] Donald E. Knuth. The Art of computer Programming, Volume 2: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, 1981. [Krichevsky and Trofimov, 1981] R. Krichevsky and V. Trofimov. The performance of universal coding. IEEE Transactions on Information Theory, 27:199 207, 1981. [Lu and Lu, 2011] Chi-Jen Lu and Wei-Fu Lu. Making online decisions with bounded memory. In Algorithmic Learning Theory, pages 249 261, 2011. [Mahoney, 2011] Matt Mahoney. Large text compression benchmark, 2011. [Online; accessed 01-February-2015]. [Manku and Motwani, 2002] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Databases, pages 346 357, 2002. [Metwally et al., 2005] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Technology, pages 398 412, 2005. [Orabona et al., 2008] Francesco Orabona, Joseph Keshet, and Barbara Caputo. The projectron: a bounded kernelbased perceptron. In Proceedings of the 25th international conference on Machine learning, pages 720 727. ACM, 2008. [Orlitsky et al., 2003] Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang. Always good turing: Asymptotically optimal probability estimation. In Proceedings of the 44th Anual Symposium on Foundations of Computer Science, 2003. [Teh, 2006] Yee Whye Teh. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 21st International Conference on Computational Linguistics, pages 985 992, 2006. [Tjalkens et al., 1993] Tj. J Tjalkens, Y.M. Shtarkov, and F.M.J. Willems. Context tree weighting: Multi-alphabet sources. In 14th Symposium on Information Theory in the Benelux, pages 128 135, 1993. [Tziortziotis et al., 2014] Nikolaos Tziortziotis, Christos Dimitrakakis, and Konstantinos Blekas. Cover tree bayesian reinforcement learning. The Journal of Machine Learning Research, 15(1):2313 2335, 2014. [Veness and Hutter, 2012] Joel Veness and Marcus Hutter. Sparse sequential Dirichlet coding. Ar Xiv e-prints, 2012. [Veness et al., 2012] Joel Veness, Kee Siong Ng, Marcus Hutter, and Michael H. Bowling. Context tree switching. In Data Compression Conference (DCC), pages 327 336, 2012. [Wikipedia, 2015] Wikipedia. Campbell s soup cans, 2015. [Online; accessed 01-February-2015]. [Willems et al., 1995] Frans M. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The context tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41:653 664, 1995.