# skip_context_tree_switching__e27d9e19.pdf Skip Context Tree Switching Marc G. Bellemare BELLEMARE@GOOGLE.COM Joel Veness VENESS@GOOGLE.COM Google Deep Mind Erik Talvitie ERIK.TALVITIE@FANDM.EDU Franklin and Marshall College Context Tree Weighting is a powerful probabilistic sequence prediction technique that efficiently performs Bayesian model averaging over the class of all prediction suffix trees of bounded depth. In this paper we show how to generalize this technique to the class of K-skip prediction suffix trees. Contrary to regular prediction suffix trees, K-skip prediction suffix trees are permitted to ignore up to K contiguous portions of the context. This allows for significant improvements in predictive accuracy when irrelevant variables are present, a case which often occurs within recordaligned data and images. We provide a regretbased analysis of our approach, and empirically evaluate it on the Calgary corpus and a set of Atari 2600 screen prediction tasks. 1. Introduction The sequential prediction setting, in which an unknown environment generates a stream of observations which an algorithm must probabilistically predict, is highly relevant to a number of machine learning problems such as statistical language modelling, data compression, and model-based reinforcement learning. A powerful algorithm for this setting is Context Tree Weighting (CTW, Willems et al., 1995), which efficiently performs Bayesian model averaging over a class of prediction suffix trees (Ron et al., 1996). In a compression setting, Context Tree Weighting is known to be an asymptotically optimal coding distribution for DMarkov sources. A significant practical limitation of CTW stems from the fact that model averaging is only performed over prediction suffix trees whose ordering of context variables is Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). fixed in advance. As we discuss in Section 3, reordering these variables can lead to significant performance improvements given limited data. This idea was leveraged by the class III algorithm of Willems et al. (1996), which performs Bayesian model averaging over the collection of prediction suffix trees defined over all possible fixed variable orderings. Unfortunately, the O(2D) computational requirements of the class III algorithm prohibit its use in most practical applications. Our main contribution is the Skip Context Tree Switching (Skip CTS) algorithm, a polynomial-time compromise between the linear-time CTW and the exponential-time class III algorithm. We introduce a family of nested model classes, the Skip Context Tree classes, which form the basis of our approach. The Kth order member of this family corresponds to prediction suffix trees which may skip up to K runs of contiguous variables. The usual model class associated with CTW is a special case, and corresponds to K = 0. In many cases of interest, Skip CTS s O(D2K+1) running time is practical and provides significant performance gains compared to Context Tree Weighting. Skip CTS is best suited to sequential prediction problems where a good fixed variable ordering is unknown a priori. As a simple example, consider the record aligned data depicted by Figure 1. Skip CTS with K = 1 can improve on the CTW ordering by skipping the five most recent symbols and directly learning the lexicographical relation. While Context Tree Weighting has traditionally been used as a data compression algorithm, it has proven useful in a diverse range of sequential prediction settings. For example, Veness et al. (2011) proposed an extension (FACCTW) for Bayesian, model-based reinforcement learning in structured, partially observable domains. Bellemare et al. (2013b) used FAC-CTW as a base model in their Quad-Tree Factorization algorithm, which they applied to the problem of predicting high-dimensional video game screen images. Our empirical results on the same video game domains (Section 4.2) suggest that Skip CTS is par- Skip Context Tree Switching A G A I N ! A L W A Y S A M A Z E D B E C O M E B E H O L D A F R A I D Figure 1. A sequence of lexicographically sorted fixed-length strings, which is particularly well-modelled by Skip CTS. ticularly beneficial in this more complex prediction setting. 2. Background We consider the problem of probabilistically predicting the output of an unknown sequential data generating source. Given a finite alphabet X, we write x1:n := x1x2 . . . xn X n to denote a string of length n, xy to denote the concatenation of two strings x and y, and xi to denote the concatenation of i copies of x. We further denote x 0, from which the chain rule ρ(x1:n) = Qn i=1 ρ(xi | x 0 are prior weights satisfying P ρ M wρ = 1. It can readily be shown that, for any ρ M, we have Rn(ξMIX, {ρ}) log wρ, which implies that the regret of ξMIX(x1:n) with respect to M is bounded uniformly by a constant that depends only on the prior weight assigned to the best model in M. For example, the Context Tree Weighting approach of Willems et al. (1995) applies this principle recursively to efficiently construct a mixture model over a doubly-exponential class of tree models. A more refined nonparametric Bayesian approach to mixing is also possible. Given a model class M, the switching method (Koolen & de Rooij, 2013) efficiently maintains a mixture model ξSWITCH over all sequences of models in M. We review here a restricted application of this technique based on the work of Veness et al. (2012) and Herbster & Warmuth (1998). More formally, given an indexed set of models {ρ1, ρ2, . . . , ρk} and an index sequence i1:n {1, 2, . . . , k}n let ρi1:n(x1:n) := t=1 ρit(xt | x 0 is necessarily larger than its CTS counterpart. While these differences negatively affect our regret bound, we have seen in Section 3 that we should expect significant savings whenever the data can be well-modelled by a small K-skip prediction suffix tree. We explore these issues further in the next section. 4. Experiments We tested the Skip Context Tree Switching on a series of prediction problems. The first set of experiments uses a popular data compression benchmark, while the second set of experiments investigates performance on a diverse set of structured image prediction problems taken from an open source Reinforcement Learning test framework. A reference implementation of Skip CTS is provided at: http://github.com/mgbellemare/Skip CTS. 4.1. The Calgary Corpus Our first experiment evaluated Skip CTS in a pure compression setting. Recall that any algorithm which sequentially assigns probabilities to symbols can be used for compression by means of arithmetic coding (Witten et al., 1987). In particular, given a model ξ assigning a probability ξ(x1:n) to x1:n X n, arithmetic coding is guaranteed to produce a compressed file size of essentially log2 ξ(x1:n). We ran Skip CTS (with D = 48, K = 1) and CTS (with D = 48) on the Calgary Corpus (Bell et al., 1989), an established compression benchmark composed of 14 different files. The results, provided in Table 1, show that Skip CTS performs significantly better than CTS on certain files, and never suffers by more than a negligible amount. Of interest, the files best improved by Skip CTS are those which contain highly-structured binary data: GEO, OBJ1, and OBJ2. For reference, we also included some CTW experiments, indicated by the CTW and and Skip CTW rows, that measured the performance of skipping using the original recursive CTW weighting scheme; here we see that the addition of skipping also helps. Table 1 also provides results for CTW , an enhanced version of CTW for bytebased data (Willems, 2013). Here both CTS and Skip CTS outperform CTW, with Skip CTS providing the best results Figure 6. The game PONG, in which the player controls the vertical position of a paddle in order to return a ball and score points. overall. Finally, it is worth noting that averaged over the Calgary Corpus, the bits per byte performance of Skip CTS is superior (2.10 vs 2.12) to DEPLUMP (Gasthaus et al., 2010), a state-of-the-art n-gram model. While Skip CTS is consistently slightly worse for text data, it is significantly better on binary data. It is also worth pointing out that no regret guarantees are yet known for DEPLUMP. 4.2. Atari 2600 Frame Prediction We also tested our algorithm on the task of video game screen prediction. We used the Arcade Learning Environment (Bellemare et al., 2013a), an interface that allows agents to interact with Atari 2600 games. Figure 6 depicts the well-known PONG, one of the Atari 2600 s flagship games. In the Atari 2600 prediction setting, the alphabet X is the set of all possible Atari 2600 screens. Because each screen contains 160 210 7-bit pixels, it is both impractical and undesirable to learn a model which predicts each xt X atomically. Instead, we take a similar approach to that of Bellemare et al. (2013b): we divide the screen into 16 16 blocks and predict each block atomically using Skip CTS or CTS combined with the SAD estimator. Each block prediction is made using a context composed of the symbol value of neighbouring blocks at previous timesteps, as well as the last action taken, for a total of 11 variables. In this setting, skipping irrelevant variables is particularly important because of the high branching factor at each level. For example, when predicting the motion of the opponent s paddle in PONG, Skip CTS can disregard horizontally neighbouring blocks and the player s action. We trained Skip CTS with K = 0 and 1 on 54 Atari 2600 games. Each experiment consisted of 10 trials, each lasting 100,000 time steps, where one time step corresponds to 4 emulated frames. Each trial was assigned a specific random seed which was used for all values for K. We report the average log-loss per frame over the last 4500 time steps, corresponding to 5 minutes of real-time Atari 2600 play. Throughout our trials actions were selected uniformly at random from each game s set of legal actions. The full table of results is provided as supplementary mate- Skip Context Tree Switching Table 1. Compression results on the Calgary Corpus, in average bits needed to encode each byte. Highlights indicate improvements greater than 3% from CTW to Skip CTW and from CTS to Skip CTS, respectively. CTW* results are taken from Willems (2013). File bib book1 book2 geo news obj1 obj2 paper1 paper2 pic progc progl progp trans CTW* 1.83 2.18 1.89 4.53 2.35 3.72 2.40 2.29 2.23 0.80 2.33 1.65 1.68 1.44 CTW 2.25 2.31 2.12 5.01 2.78 4.63 3.19 2.84 2.59 0.90 3.00 2.11 2.24 2.09 SKIPCTW 2.15 2.32 2.10 3.91 2.77 4.57 2.96 2.75 2.54 0.90 2.91 2.00 2.08 1.83 Diff. 4.4% -0.4% 0.9% 22.0% 0.4% 1.3% 7.2% 3.2% 1.9% 0.0% 3.0% 5.2% 7.1% 12.4% CTS 1.81 2.20 1.90 4.18 2.34 3.66 2.36 2.28 2.23 0.79 2.33 1.61 1.64 1.39 SKIPCTS 1.75 2.20 1.89 3.60 2.34 3.40 2.19 2.26 2.22 0.76 2.30 1.59 1.61 1.35 Diff. 3.3% 0.0% 0.5% 13.9% 0.0% 7.1% 7.2% 0.9% 0.4% 3.8% 1.3% 1.2% 1.8% 2.9% rial. For each game we computed the improvement in logloss per frame and determined whether the difference in loss was statistically significant using the Wilcoxon signed rank test. As a whole, Skip CTS achieved lower log-loss than CTS in 54 out of 55 games; all these differences are significant. While Skip CTS performed slightly worse in ELEVATOR ACTION, the difference was not statistically significant. The average overall log-loss improvement was 9.0% and the median, 8.25%; improvements ranged from - 2% (ELEVATOR ACTION) to 36% (FREEWAY). Skip CTS with K = 1 processed on average 34 time steps (136 frames) per second, corresponding to just over twice the real-time speed of the Atari 2600. We further ran our algorithm with K = 2 and observed an additional, significant increase in predictive performance on 18 games (up to 21.7% over K = 1 for TIME PILOT). On games where K = 2 is unnecessary, however, the performance of Skip CTS degraded somewhat. As discussed above, this behaviour is an expected consequence of the larger ΓK D(S). 5. Discussion We have seen that by allowing context trees to skip over variables, Skip CTS can achieve substantially better performance over CTS in problems where a good variable ordering may not be known a priori. Theoretically we have seen that Skip CTS can, in the extreme case, have exponentially lower regret. Empirically we observe substantial benefits in practice over state of the art lossless compression algorithms in problems involving highly structured data (e.g. the GEO problem in the Calgary Corpus). The dramatic and consistent improvement seen across over 50 Atari prediction problems indicate that Skip CTS is especially important in multi-dimensional prediction problems where issues of variable ordering are naturally exacerbated. The main drawback of Skip CTS is the increased computational complexity of inference as a result of the more expressive model class. However, our experiments have demonstrated that small values of K can make a substantial difference. Furthermore, the computational and memory costs of Skip CTS can be alleviated in practice. The tree structure induced by the recursive Skip CTS update (Equations 3 5) can naturally be parallelized, while the Skip CTS memory requirements can easily be bounded through hashing. Finally note that sampling from the model remains a O(D) operation, so, for instance, planning with a Skip CTS-based reinforcement learning model is nearly as efficient as planning with a CTS-based model. Tree-based models have a long history in sequence prediction, and the persistent issue of variable ordering has been confronted in many ways. The main strengths of Skip CTS are inherited from CTW efficient, incremental, and exact Bayesian inference, and strong theoretical guarantees on asymptotic regret. Other approaches with more representational flexibility lack these strengths. In the model based reinforcement learning setting, some methods (e.g. Mc Callum, 1996; Holmes & Isbell, 2006; Talvitie, 2012) extend the traditional predictive suffix tree by allowing variables from different time steps to be added in any order, or by allowing the tree to excise portions of history, but these methods are not incremental and do not provide regret guarantees. Bayesian decision tree learning methods (e.g. Chipman et al., 1998; Lakshminarayanan et al., 2013) could in principle be applied in the sequential prediction setting. These typically allow arbitrary variable ordering, but require approximate inference to remain tractable. 6. Conclusion In this paper we presented Skip Context Tree Switching, a polynomial-time algorithm which efficiently mixes over sequences of prediction suffix trees that may skip over K contiguous runs of variables. Our results show that Skip CTS is practical for small K and can produce significant empirical improvements compared to members of the Context Tree Weighting family (even with K = 1) in problems where irrelevant variables are naturally present. Acknowledgments The authors would like to thank Alex Graves, Andriy Mnih and Michael Bowling for some helpful discussions. Skip Context Tree Switching Bell, Timothy, Witten, Ian H., and Cleary, John G. Modeling for text compression. ACM Computing Surveys (CSUR), 21(4):557 591, 1989. Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, June 2013a. Bellemare, Marc G., Veness, Joel, and Bowling, Michael. Bayesian learning of recursively factored environments. In Proceedings of the Thirtieth International Conference on Machine Learning, 2013b. Chipman, Hugh A., George, Edward I., and Mc Culloch, Robert E. Bayesian CART model search. Journal of the American Statistical Association, 93(443):935 948, 1998. Erven, Tim Van, Grunwald, Peter, and de Rooij, Steven. Catching up faster in Bayesian model selection and model averaging. In NIPS, 2007. Gasthaus, Jan, Wood, Frank, and Teh, Yee Whye. Lossless compression based on the sequence memoizer. In Data Compression Conference (DCC), 2010. Herbster, Mark and Warmuth, Manfred K. Tracking the best expert. Machine Learning, 32(2):151 178, 1998. Holmes, Michael P. and Isbell, Jr, Charles Lee. Looping suffix tree-based inference of partially observable hidden state. In Proceedings of the 23rd International Conference on Machine Learning, pp. 409 416, 2006. Hutter, Marcus. Sparse adaptive Dirichlet-multinomial-like processes. In Proceedings of the Conference on Learning Theory (COLT), 2013. Koolen, Wouter M. and de Rooij, Steven. Universal codes from switching strategies. IEEE Transactions on Information Theory, 59(11):7168 7185, 2013. Krichevsky, R. and Trofimov, V. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199 207, 1981. Lakshminarayanan, Balaji, Roy, Daniel M., and Teh, Yee Whye. Top-down particle filtering for Bayesian decision trees. In Proceedings of the 30th International Conference on Machine Learning, 2013. Mc Callum, Andrew K. Reinforcement learning with selective perception and hidden state. Ph D thesis, University of Rochester, 1996. Ron, Dana, Singer, Yoram, and Tishby, Naftali. The power of amnesia: Learning probabilistic automata with variable memory length. Machine learning, 25(2):117 149, 1996. Talvitie, Erik. Learning partially observable models using temporally abstract decision trees. In Advances in Neural Information Processing Systems (25), 2012. Tjalkens, Tj. J, Shtarkov, Y.M., and Willems, F.M.J. Context tree weighting: Multi-alphabet sources. In 14th Symposium on Information Theory in the Benelux, pp. 128 135, 1993. Veness, Joel, Ng, Kee Siong, Hutter, Marcus, Uther, William T. B., and Silver, David. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Research, 40:95 142, 2011. Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowling, Michael H. Context tree switching. In Data Compression Conference (DCC), pp. 327 336, 2012. Volf, P. Weighting techniques in data compression: Theory and algorithms. Ph D thesis, Eindhoven University of Technology, 2002. Willems, Frans M., Shtarkov, Yuri M., and Tjalkens, Tjalling J. The context tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41:653 664, 1995. Willems, Frans M., Shtarkov, Yuri M., and Tjalkens, Tjalling J. Context weighting for general finite-context sources. IEEE Transactions on Information Theory, 42 (5):1514 1520, 1996. Willems, Frans M. J. CTW website. http://www.ele. tue.nl/ctw/, 2013. Witten, Ian H., Neal, Radford M., and Cleary, John G. Arithmetic coding for data compression. Communications of the ACM, 30(6):520 540, 1987.