# skip_context_tree_switching__e27d9e19.pdf

Skip Context Tree Switching

Marc G. Bellemare BELLEMARE@GOOGLE.COM Joel Veness VENESS@GOOGLE.COM Google Deep Mind

Erik Talvitie ERIK.TALVITIE@FANDM.EDU Franklin and Marshall College

Context Tree Weighting is a powerful probabilistic sequence prediction technique that efﬁciently performs Bayesian model averaging over the class of all prediction sufﬁx trees of bounded depth. In this paper we show how to generalize this technique to the class of K-skip prediction sufﬁx trees. Contrary to regular prediction sufﬁx trees, K-skip prediction sufﬁx trees are permitted to ignore up to K contiguous portions of the context. This allows for signiﬁcant improvements in predictive accuracy when irrelevant variables are present, a case which often occurs within recordaligned data and images. We provide a regretbased analysis of our approach, and empirically evaluate it on the Calgary corpus and a set of Atari 2600 screen prediction tasks.

1. Introduction

The sequential prediction setting, in which an unknown environment generates a stream of observations which an algorithm must probabilistically predict, is highly relevant to a number of machine learning problems such as statistical language modelling, data compression, and model-based reinforcement learning. A powerful algorithm for this setting is Context Tree Weighting (CTW, Willems et al., 1995), which efﬁciently performs Bayesian model averaging over a class of prediction sufﬁx trees (Ron et al., 1996). In a compression setting, Context Tree Weighting is known to be an asymptotically optimal coding distribution for DMarkov sources.

A signiﬁcant practical limitation of CTW stems from the fact that model averaging is only performed over prediction sufﬁx trees whose ordering of context variables is

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

ﬁxed in advance. As we discuss in Section 3, reordering these variables can lead to signiﬁcant performance improvements given limited data. This idea was leveraged by the class III algorithm of Willems et al. (1996), which performs Bayesian model averaging over the collection of prediction sufﬁx trees deﬁned over all possible ﬁxed variable orderings. Unfortunately, the O(2D) computational requirements of the class III algorithm prohibit its use in most practical applications.

Our main contribution is the Skip Context Tree Switching (Skip CTS) algorithm, a polynomial-time compromise between the linear-time CTW and the exponential-time class III algorithm. We introduce a family of nested model classes, the Skip Context Tree classes, which form the basis of our approach. The Kth order member of this family corresponds to prediction sufﬁx trees which may skip up to K runs of contiguous variables. The usual model class associated with CTW is a special case, and corresponds to K = 0. In many cases of interest, Skip CTS s O(D2K+1) running time is practical and provides signiﬁcant performance gains compared to Context Tree Weighting.

Skip CTS is best suited to sequential prediction problems where a good ﬁxed variable ordering is unknown a priori. As a simple example, consider the record aligned data depicted by Figure 1. Skip CTS with K = 1 can improve on the CTW ordering by skipping the ﬁve most recent symbols and directly learning the lexicographical relation.

While Context Tree Weighting has traditionally been used as a data compression algorithm, it has proven useful in a diverse range of sequential prediction settings. For example, Veness et al. (2011) proposed an extension (FACCTW) for Bayesian, model-based reinforcement learning in structured, partially observable domains. Bellemare et al. (2013b) used FAC-CTW as a base model in their Quad-Tree Factorization algorithm, which they applied to the problem of predicting high-dimensional video game screen images. Our empirical results on the same video game domains (Section 4.2) suggest that Skip CTS is par-

Skip Context Tree Switching

A G A I N !

A L W A Y S

A M A Z E D

B E C O M E

B E H O L D

A F R A I D

Figure 1. A sequence of lexicographically sorted ﬁxed-length strings, which is particularly well-modelled by Skip CTS.

ticularly beneﬁcial in this more complex prediction setting.

2. Background

We consider the problem of probabilistically predicting the output of an unknown sequential data generating source. Given a ﬁnite alphabet X, we write x1:n := x1x2 . . . xn X n to denote a string of length n, xy to denote the concatenation of two strings x and y, and xi to denote the concatenation of i copies of x. We further denote x<n := x1:n 1 and the empty string by ϵ. Given an arbitrary ﬁnite length string y, we denote its length by |y|. The space of probability distributions over a ﬁnite alphabet X is denoted by P(X). A sequential probabilistic model ρ is deﬁned by a sequence of probability mass functions {ρi P(X i)}i N that satisfy, for any n N, for any string x1:n X n, the constraint P

xn X ρn(x1:n) = ρn 1(x<n). Since the subscript to ρn is always clear from its argument, we henceforth write ρ(x1:n) for the probability assigned to x1:n by ρ. We use ρ(xn | x<n) to denote the probability of xn conditional on x<n, deﬁned as ρ(xn | x<n) := ρ(x1:n)/ρ(x<n) provided ρ(x<n) > 0, from which the chain rule ρ(x1:n) = Qn i=1 ρ(xi | x<i) follows.

We assess the quality of a model s predictions through its cumulative, instantaneous logarithmic loss Pn i=1 log ρ(xi | x<i) = log ρ(x1:n). Given a set of models M, we deﬁne the regret of ρ with respect to M as

Rn(ρ, M) := log ρ(x1:n) min ν M log v(x1:n).

Our notion of regret corresponds to the excess total loss suffered from using ρ in place of the best model in M. In our later analysis, we will show that the regret of our technique grows sublinearly and therefore that limn Rn(ρ, M)/n = 0. In other words, the average instantaneous excess loss of our technique with respect to the best model in M asymptotically vanishes.

2.1. Bayesian Mixture Models

One way to construct a model with guaranteed low regret with respect to some model class M is to use a Bayesian mixture model

ξMIX(x1:n) := X

ρ M wρ ρ(x1:n),

where wρ > 0 are prior weights satisfying P

ρ M wρ = 1. It can readily be shown that, for any ρ M, we have

Rn(ξMIX, {ρ}) log wρ,

which implies that the regret of ξMIX(x1:n) with respect to M is bounded uniformly by a constant that depends only on the prior weight assigned to the best model in M. For example, the Context Tree Weighting approach of Willems et al. (1995) applies this principle recursively to efﬁciently construct a mixture model over a doubly-exponential class of tree models.

A more reﬁned nonparametric Bayesian approach to mixing is also possible. Given a model class M, the switching method (Koolen & de Rooij, 2013) efﬁciently maintains a mixture model ξSWITCH over all sequences of models in M. We review here a restricted application of this technique based on the work of Veness et al. (2012) and Herbster & Warmuth (1998). More formally, given an indexed set of models {ρ1, ρ2, . . . , ρk} and an index sequence i1:n {1, 2, . . . , k}n let

ρi1:n(x1:n) :=

t=1 ρit(xt | x<t)

be a model which predicts at each time step t according to the model with index it. The switching technique implicitly computes a Bayesian mixture over the exponentially many possible index sequences. This mixture is efﬁciently computed in O(k) per step by using

ξSWITCH(x1:n) = X

ρ M wρ,n 1ρ(xn | x<n)

where, for t 1 . . . n, we have that

wρ,t := t t + 1wρ,t 1ρ(xi | x<t) +

ν M\{ρ} wν,t 1 ν(xi | x<t) (1)

and in the base case wρ,0 := 1/k for each ρ M. It can be shown (Veness et al., 2012) that for any ρi1:n we have

Rn(ξSWITCH, {ρi1:n}) [m(i1:n) + 1] (log k + log n)

where m(i1:n) := Pn t=2Jit 1 = it K counts the number of times the index sequence switches models. In particular, if

Skip Context Tree Switching

Figure 2. A prediction sufﬁx tree.

a single model performs best throughout, ξSWITCH only incurs an additional log n cost compared to a Bayesian mixture model using a uniform prior. The switching method is a key component of the Context Tree Switching algorithm, which we review in Section 2.3, as well as our new Skip CTS algorithm. In practice, especially when the models in M are both adaptive and of varying complexity, the ability to rapidly switch between models often leads to an empirical performance improvement (Erven et al., 2007). A more comprehensive overview of switching strategies can be found in the work of Koolen & de Rooij (2013).

2.2. Prediction Sufﬁx Trees

A Prediction Sufﬁx Tree (Ron et al., 1996; Figure 2) is a type of variable order Markov model. Informally, a prediction sufﬁx tree combines a collection of models using a data-partitioning tree, whose purpose is to select which particular model to use at each time step.

Given ﬁnite strings c := c1 . . . cm X m and x1:n := x1 . . . xn X n, we say that c is a sufﬁx of x1:n if xn i = cm i for all i {0, . . . , m 1}, and that c is the context of x1:n if it is a sufﬁx of x<n. We also write Tc(x1:n) := {i N : c is a sufﬁx of x<i, 1 i n} to denote the set of time indices occurring in context c, and denote by xc 1:n := xi : i Tc(x1:n) the subsequence of x1:n that matches context c. Furthermore, given an alphabet X and an upper bound on the maximum Markov order D N, let X := SD i=0 X i, with X 0 := {ϵ}. A set S X is called a proper sufﬁx set over X if for any ﬁnite string x there is exactly one c S such that c is a sufﬁx of x. We also write S(x) to denote the matching context c S of a string x. The key property of S( ) is that given x1:n, it deﬁnes a partition of the time indices {1, . . . , n} in the sense that the collection of sets {Tc(x1:n)}c S is mutually exclusive and exhaustive.

A prediction sufﬁx tree is a tuple (S, Θ) where S is a proper sufﬁx set over X and Θ := {θc}c S is a set of sequential probabilistic models, with each model being associated with a particular context in S. If ct = S(x<t) is the context in S corresponding to x<t, then the tth symbol is predicted

0c(x1:n) 1c(x1:n)

Figure 3. The Context Tree Switching recursive operation. For every context c we construct a model which switches between a base estimator ρ and a recursively deﬁned split estimator.

as θct(xt | xct <t). Since S is a proper sufﬁx set, this gives the sequential probabilistic model

ψS,Θ(x1:n) := Y

t Tc(x1:n) θc(xt | xc <t) = Y

c S θc(xc 1:n).

2.3. Context Tree Switching

Context Tree Switching (CTS, Veness et al., 2012) is a recent extension of Context Tree Weighting (CTW, Willems et al., 1995) that retains the strong theoretical properties of CTW but performs better in practice. Let X := {0, 1} be the binary alphabet, D N be arbitrary but ﬁxed, and let CD be the collection of all binary prediction sufﬁx trees (S, Θ) of depth less than or equal to D. Let ξCTS(x1:n) denote the probability assigned to x1:n by CTS. The regret Rn(ξCTS, CD) of CTS with respect to CD is upper bounded by ΓD(S) + ( (S) + 1) log n + |S|γ n

where ΓD(S) := |S| 1 + |{c : c S, |c| = D}| is the description length of S, (S) := maxc S |c|, and

( z if 0 z < 1 1 2 log2(z) + 1 if z 1.

Notice that the bound makes explicit an Occam bias towards smaller prediction sufﬁx trees. The last summand in Equation 2 arises from having to learn the parameters of |S| unknown Bernoulli distributions. This, combined with the fact that ΓD(S) = O(|S|), causes CTS to perform well whenever a small prediction sufﬁx tree is sufﬁcient to describe the data.

Algorithm. CTS is best understood as a recursive application of the switching technique described in Section 2.1. As depicted in Figure 3, for each context c X we deﬁne a switching model between two components: a base estimator ρ which predicts the substring xc 1:n and a split estimator which subdivides c into 0c and 1c and predicts xc 1:n by querying the corresponding switching models for the further partitioned data. The latter operation is well-deﬁned

Skip Context Tree Switching

by the partitioning property of proper sufﬁx sets: xn belongs to either x0c 1:n or x1c 1:n but not both. The algorithm then assigns a probability to x1:n X n using its top-level switching model, i.e. ξCTS(x1:n) := ξϵ(x1:n).

The algorithmic core of CTS is a context tree data structure: a perfect binary tree of depth D whose nodes correspond to all possible strings in X. Each node c X stores a base estimator ρc as well as the data-dependent quantities ξc, αc, and βc. Informally, αc(x<t) and βc(x<t) correspond to wρ,t 1 in Equation 1, while ξc corresponds to ξSWITCH(x1:t). CTS incrementally maintains these quantities as follows. Given a new symbol xt and its associated history x<t, CTS updates the D + 1 nodes along the path ϵ, xt 1, xt 2:t 1, . . . , xt D:t 1; all other nodes are left unchanged. CTS performs a post-order traversal along this path, ﬁrst updating each node s base estimator and the other quantities as follows. At the leaf c = xt D:t 1, CTS sets αc(x1:t) αc(x<t)ρc(xt | xc <t) and then ξc(x1:t) αc(x1:t). At the internal nodes, the following updates occur:

ξc(x1:t) αc(x<t)ρc(xt | xc <t) + βc(x<t)zc,t

αc(x1:t) 1 t + 1ξc(x1:t) + t 1

t + 1αc(x<t)ρc(xt | xc <t)

βc(x1:t) 1 t + 1ξc(x1:t) + t 1

t + 1βc(x<t)zc,t

where zc,t := ξx c(x1:t)/ξx c(x<t) is the probability assigned to xt by the recursively-deﬁned split estimator, with x := xt |c| 1. Every node c is initialized with αc(ϵ) = βc(ϵ) = 1

2, except for leaf nodes where we set αc(ϵ) = 1 and βc(ϵ) = 0 to reﬂect the fact that no further splitting occurs at depth D.

2.4. Choice of Base Estimator

If the alphabet is binary, a natural choice of base model is the KT estimator (Krichevsky & Troﬁmov, 1981). This estimator probabilistically predicts each symbol according to a Beta-Binomial model, using a Beta( 1

2) prior over the unknown parameter. The regret of the KT estimator with respect to any Bernoulli process is known to be at most 1

2 log n + 1. Non-binary alphabets can be handled in a number of ways. The most direct approach is to use a Dirichlet-Multinomial model with a Dirichlet( 1

2) prior over X, leading to the multi-alphabet KT estimator, whose regret is O(|X| log n) (Tjalkens et al., 1993). When |X| is large and only a small fraction of the possible symbols are observed, this approach is inefﬁcient (e.g. Volf, 2002). A recently developed solution to this problem is the Sparse Adaptive Dirichlet (SAD) estimator (Hutter, 2013). The SAD approach enjoys regret guarantees close to those of the multi-alphabet KT restricted to the subalphabet A X of symbols which occur in x1:n. In Section 4 we describe a

m irrelevant

2m O(log n) regret

O(log n) regret

Figure 4. Worst-case regret when using CTS instead of Skip CTS.

large experiment whose performance is much improved by the use of a SAD-like estimator.

3. Skip Context Tree Switching

In this section we generalize CTS to partial context matches to produce the Skip Context Tree Switching (Skip CTS) algorithm. We begin with some notation describing partial context matches, then describe how the Skip CTS algorithm incorporates these. Finally we provide a bound on the regret of Skip CTS with respect to the set of all bounded Kskip prediction sufﬁx trees.

To gain some intuition as to why ignoring irrelevant variables matters, consider what happens internally in the context tree in the presence of irrelevant variables. As the righthand side of Figure 4 shows, in the worst case, the data used to train the base models can be dispersed uniformly across 2m bins. On the other hand, Skip CTS can ignore the intervening variables and directly aggregate all the available data into a single bin (Figure 4, left). In the particular case where the base estimator is the KT estimator (with regret O(log n) for memoryless sources), we see that Skip CTS can enjoy an exponential reduction in regret compared to CTS.

3.1. Partial Context Matches

We begin with some notation. Let denote a wildcard symbol and let Y := X { } be the wildcard extension of X. We say that a string a X m matches b Ym if for all 1 i m such that bi = , we have ai = bi. For example, if X is the set of uppercase letters then BAY, BOY, BUY, etc. all match B Y. Given a ﬁnite string x, we say that c Ym is a sufﬁx of x iff x = yc for some c X m

such that c matches c. We call a string c k-skip contiguous if it contains at most k contiguous runs of symbols. For example, 0 1 0 is 2-skip contiguous. Finally, we denote the set of all k-skip contiguous strings of length i by Yi k and the union of all such sets for i = 0 . . . D as Yk.

Skip Context Tree Switching

3.2. Algorithm

The Skip CTS algorithm is parametrized by a maximum depth D N and a number of allowable skips K N. The key idea is to maintain a context tree whose nodes correspond to all possible contexts c YK. To do so, we generalize the CTS update equations described in Section 2.3, leading to a recursive switching model which chooses between not only a base estimator and a split estimator, but also between a variable number of recursively deﬁned skip estimators. Effectively, these additional estimators correspond to ignoring one or more symbols in the context. As we will show in Section 3.3, the addition of these new models allows us to obtain a competitive regret bound with respect to the set of all bounded K-skip prediction sufﬁx trees.

In designing Skip CTS, one additional subtle issue arises: some of the contexts in YK are redundant for the purpose of sequential prediction. To see this, consider two contexts from YK, c = 010 and c = 010. Given any string x1:n X n, both contexts induce the same substring xc 1:n = xc 1:n, so that only one of them needs to be considered by our algorithm. More generally, the contexts c YK for which c = lc with l N are equivalent. We reﬁne the algorithm by considering only the set of representative contexts, i.e. the contexts c such that c = xc

where x X. In other words, representative contexts are those contexts which do not contain trailing wildcard symbols. This results in a more efﬁcient algorithm than if we were to consider all possible contexts in YK. To avoid notational clutter, from here onwards we will use the notation YK to denote the set of representative contexts. For a given string x, we call a representative context c which is a sufﬁx of x a representative sufﬁx.

For every c YK we incrementally maintain a base estimator ρc and the quantities ξc, αc, βc,0, βc,1, . . . . The number of βc, quantities depends on c as follows. Deﬁne κ(c) as the number of contiguous runs of symbols in c. For c Y, deﬁne r(c) := 1 if either |c| = D or κ(c) = K, and D |c| otherwise. The βc,0 term then corresponds to the split estimator, while the remaining βc, terms correspond to r(c) 1 skip estimators; in particular, when r(c) = 1 no skip estimators are updated for c.

Upon observing a new symbol xt, Skip CTS ﬁrst updates αc(x1:t) αc(x<t)ρc(xt | xc <t) and then ξc(x1:t) αc(x1:t) for all c YK representative sufﬁxes of x<t with |c| = D. The remaining representative sufﬁxes of x<t are updated recursively as

ξc(x1:t) αc(x<t)bc,t +

l=0 βc,l(x<t)zc,l,t (3)

αc(x1:t) ηtξc(x1:t)+ t t+1 ηt αc(x<t)bc,t (4)

βc,l(x1:t) ηtξc(x1:t)+ t t+1 ηt βc,l(x<t)zc,l,t (5)

where ηt = 1/ [r(c)(t + 1)], bc,t := ρc(xt | xc <t), and zc,l,t is the probability assigned to xt by the lth skip estimator, which is deﬁned as

zc,l,t := ξx c (x1:t) / ξx c (x<t), (6)

where c := lc and x := x t |c | 1. As with CTS, any node not corresponding to a representative sufﬁx of x<t is left unchanged. The probability ξSKIPCTS(x1:n) output by Skip CTS is the probability assigned by the root switching model ξϵ(x1:n).

The particular update structure of Skip CTS, i.e. the contexts c for which ξc(x1:n) are updated, depends on both K and D. As with CTS, we set αc(ϵ) = 1 whenever |c| = D. For |c| < D, we set αc(ϵ) = 1

1. βc,0(ϵ) = 1

2 if κ(c) = K;

2. otherwise

1 4 if l = 0; 1 4r(c) if 1 l < r(c); 0 otherwise.

It can be veriﬁed that for any c YK we have αc(ϵ) + PD 1 l=0 βc,l(ϵ) = 1. In general, we may freely initialize the non-zero αc and βc terms, provided they are nonnegative and sum to one. Because κ(c) in Equation 3 ranges from 0 to K and |c| from 0 to D, performing one update requires O(D2K+1) operations effectively the number of representative sufﬁxes c YK which match x<t. In particular, note that when K = 0 we recover the original Context Tree Switching algorithm of Veness et al. (2012).

3.3. Regret Analysis

In this section we provide a bound on the regret of Skip Context Tree Switching with respect to any bounded Kskip prediction sufﬁx tree (S, Θ), where K is ﬁxed and S is a proper sufﬁx set over Y. At a high level, Lemma 1 bounds the regret induced by the base estimators at the leaves of (S, Θ). Lemma 2 bounds the regret contributed from a single level of internal nodes in the tree, and Lemma 3 applies a recursive argument to combine Lemmas 1 and 2. Theorem 1 ﬁnally uses Lemma 3 to bound the regret of Skip CTS with respect to an arbitrary K-skip prediction sufﬁx tree.

Lemma 1. For any proper sufﬁx set S over YK and for any x1:n X n, we have

c S αc(x1:n) 1 n + 1

c S αc(ϵ)ρc(xc 1:n).

Proof. Let ct := S(x<t) be the context in S which is a sufﬁx of x<t, which is guaranteed to be unique as S is a

Skip Context Tree Switching

Figure 5. A 1-skip prediction sufﬁx tree.

proper sufﬁx set. By combining Equations 3 and 4 we have αct(x1:t) t t+1αct(x<t)ρct(xt | xct <t). By deﬁnition, for all other c S we have αc(x1:t) = αc(x<t). Recalling that Tc(x1:n) := {t N : c is a sufﬁx ofx<t}, we expand Equation 4 as

αc(x1:n) αc(ϵ) Y

t t+1ρc(xt | xc <t)

= αc(ϵ)ρc(xc 1:n) Y

c S αc(x1:n) Y

c S αc(ϵ)ρc(xc 1:n) Y

and the desired result follows by recalling that {Tc(x1:n)}c S is a partition of {1, . . . , n}.

We now deﬁne the various kinds of decision points (e.g. split only; split or skip) within the context tree.

Deﬁnition 1. Let S be a proper sufﬁx set over Y. A string c Y is a choice point in S whenever c is a strict sufﬁx of some c S and c = xc , where x X.

Deﬁnition 2. Let S be a proper sufﬁx set over Y, and for c Y let ES(c) := {l N : lc is a strict sufﬁx of some c S}. We call S := {(c, l) : c is a choice point in S, l ES(c)} the set of choice nodes of S.

Figure 5, for example, is described by S = {0 0, 1 0, 01, 11}, of which 0 and 1 are choice points. The set of choice nodes corresponding to S is S := {(0, 1), (1, 0)}. Intuitively, these choice nodes correspond exactly to the nodes with more than one child.

Deﬁnition 3. Let c := c1c2 . . . cm Ym. We deﬁne the effective length of c as ℓ(c) := |{i {1, . . . m} : ci = }| and for a set V of such strings deﬁne ℓ(V) := maxc V ℓ(c).

Recall that our aim is to bound log ξϵ(x1:n). Having bounded the product of terms at the leaves, Q c S αc(x1:n), we now derive a similar bound for the internal nodes.

Lemma 2. Let S be a proper sufﬁx set over YK, and let Sd := {(c, l) S : ℓ(c) = d}. For any d {0, . . . , D 1} and any x1:n X n, we have that

(c,l) Sd βc,l(x1:n) 1 n + 1

(c,l) Sd βc,l(ϵ) Y

x X ξx lc(x1:n).

Proof. For any (c, l) S, and any t {1, . . . , n}, let c = lc and x = xt |c | 1. Now for any x X \{x }, we have ξxc (x1:t) = ξxc (x<t), which enables us to rewrite Equation 6 as

x X ξxc (x1:t)/ξxc (x<t),

which, following a similar approach to the proof of Lemma 1 leads to the desired result.

Lemma 3. For every proper sufﬁx set S over YK, with S denoting the set of choice nodes of S, we have

ξϵ(x1:n) n ℓ(S) Y

(c,l) S βc,l(ϵ) Y

c S αc(x<n)ρc(xn | xc <n).

Proof. For any c YK and l r(c), ξc(x1:n) αc(x<n)ρc(xn | xc <n) and ξc(x1:n) βc(x<n)zc,l,n. As S is a proper sufﬁx set, at any time step t at most one c S S of effective length d ℓ(S) matches x<t. By recursively applying Lemma 2 to the right hand side of Equation 3 for every c S, and keeping the left hand side whenever c S, we obtain

1 nβc,l(ϵ) Y

c S αc(x<n)ρc(xn | xc <n),

The result then follows since S = ℓ(S) 1 d=0 Sd.

We are now in a position to bound the regret of Skip Context Tree Switching with respect to any skipping prediction sufﬁx tree structure that can be obtained by pruning the Skip CTS context tree.

Theorem 1. Let ψS denote a D N bounded k-skip prediction sufﬁx tree (S, Θ), where S is a proper sufﬁx set over Yk and Θ := {ρc}c S is a set of sequential probabilistic models inside the Skip CTS context tree. For any x1:n X n, the regret of Skip CTS run with parameters D and K k with respect to ψS is bounded as

Rn(ξSKIPCTS, {ψS}) [ℓ(S) + 1] log n + ΓK D(S),

where ΓK D(S) := P

(c,l) S log βc,l(ϵ) P

c S log αc(ϵ).

Proof. (Sketch) Beginning with Rn(ξSKIPCTS, {ψS}) we apply Lemma 3, then Lemma 1, and ﬁnally simplify using ψS(x1:n) := Q

c S ρc(xc 1:n).

Skip Context Tree Switching

If the data is generated by some unknown prediction sufﬁx tree and the base estimators are KT estimators, the above regret bound leads to a result that is similar to the regret bound for CTS given by Veness et al. (2012), save for two main differences. First, recall that CD, the collection of models considered by CTS, is a subset of CD,K, the collection of models considered by Skip CTS (with equality only when K = 0). Our bound thus covers a broader collection of models. Second, for a proper sufﬁx set S deﬁned over X, i.e. a no skip prediction sufﬁx tree, the description length ΓK D(S) under Skip CTS with K > 0 is necessarily larger than its CTS counterpart. While these differences negatively affect our regret bound, we have seen in Section 3 that we should expect signiﬁcant savings whenever the data can be well-modelled by a small K-skip prediction sufﬁx tree. We explore these issues further in the next section.

4. Experiments

We tested the Skip Context Tree Switching on a series of prediction problems. The ﬁrst set of experiments uses a popular data compression benchmark, while the second set of experiments investigates performance on a diverse set of structured image prediction problems taken from an open source Reinforcement Learning test framework. A reference implementation of Skip CTS is provided at: http://github.com/mgbellemare/Skip CTS.

4.1. The Calgary Corpus

Our ﬁrst experiment evaluated Skip CTS in a pure compression setting. Recall that any algorithm which sequentially assigns probabilities to symbols can be used for compression by means of arithmetic coding (Witten et al., 1987). In particular, given a model ξ assigning a probability ξ(x1:n) to x1:n X n, arithmetic coding is guaranteed to produce a compressed ﬁle size of essentially log2 ξ(x1:n).

We ran Skip CTS (with D = 48, K = 1) and CTS (with D = 48) on the Calgary Corpus (Bell et al., 1989), an established compression benchmark composed of 14 different ﬁles. The results, provided in Table 1, show that Skip CTS performs signiﬁcantly better than CTS on certain ﬁles, and never suffers by more than a negligible amount. Of interest, the ﬁles best improved by Skip CTS are those which contain highly-structured binary data: GEO, OBJ1, and OBJ2. For reference, we also included some CTW experiments, indicated by the CTW and and Skip CTW rows, that measured the performance of skipping using the original recursive CTW weighting scheme; here we see that the addition of skipping also helps. Table 1 also provides results for CTW , an enhanced version of CTW for bytebased data (Willems, 2013). Here both CTS and Skip CTS outperform CTW, with Skip CTS providing the best results

Figure 6. The game PONG, in which the player controls the vertical position of a paddle in order to return a ball and score points.

overall. Finally, it is worth noting that averaged over the Calgary Corpus, the bits per byte performance of Skip CTS is superior (2.10 vs 2.12) to DEPLUMP (Gasthaus et al., 2010), a state-of-the-art n-gram model. While Skip CTS is consistently slightly worse for text data, it is signiﬁcantly better on binary data. It is also worth pointing out that no regret guarantees are yet known for DEPLUMP.

4.2. Atari 2600 Frame Prediction

We also tested our algorithm on the task of video game screen prediction. We used the Arcade Learning Environment (Bellemare et al., 2013a), an interface that allows agents to interact with Atari 2600 games. Figure 6 depicts the well-known PONG, one of the Atari 2600 s ﬂagship games. In the Atari 2600 prediction setting, the alphabet X is the set of all possible Atari 2600 screens. Because each screen contains 160 210 7-bit pixels, it is both impractical and undesirable to learn a model which predicts each xt X atomically. Instead, we take a similar approach to that of Bellemare et al. (2013b): we divide the screen into 16 16 blocks and predict each block atomically using Skip CTS or CTS combined with the SAD estimator.

Each block prediction is made using a context composed of the symbol value of neighbouring blocks at previous timesteps, as well as the last action taken, for a total of 11 variables. In this setting, skipping irrelevant variables is particularly important because of the high branching factor at each level. For example, when predicting the motion of the opponent s paddle in PONG, Skip CTS can disregard horizontally neighbouring blocks and the player s action.

We trained Skip CTS with K = 0 and 1 on 54 Atari 2600 games. Each experiment consisted of 10 trials, each lasting 100,000 time steps, where one time step corresponds to 4 emulated frames. Each trial was assigned a speciﬁc random seed which was used for all values for K. We report the average log-loss per frame over the last 4500 time steps, corresponding to 5 minutes of real-time Atari 2600 play. Throughout our trials actions were selected uniformly at random from each game s set of legal actions.

The full table of results is provided as supplementary mate-

Skip Context Tree Switching

Table 1. Compression results on the Calgary Corpus, in average bits needed to encode each byte. Highlights indicate improvements greater than 3% from CTW to Skip CTW and from CTS to Skip CTS, respectively. CTW* results are taken from Willems (2013).

File bib book1 book2 geo news obj1 obj2 paper1 paper2 pic progc progl progp trans

CTW* 1.83 2.18 1.89 4.53 2.35 3.72 2.40 2.29 2.23 0.80 2.33 1.65 1.68 1.44

CTW 2.25 2.31 2.12 5.01 2.78 4.63 3.19 2.84 2.59 0.90 3.00 2.11 2.24 2.09 SKIPCTW 2.15 2.32 2.10 3.91 2.77 4.57 2.96 2.75 2.54 0.90 2.91 2.00 2.08 1.83

Diff. 4.4% -0.4% 0.9% 22.0% 0.4% 1.3% 7.2% 3.2% 1.9% 0.0% 3.0% 5.2% 7.1% 12.4%

CTS 1.81 2.20 1.90 4.18 2.34 3.66 2.36 2.28 2.23 0.79 2.33 1.61 1.64 1.39 SKIPCTS 1.75 2.20 1.89 3.60 2.34 3.40 2.19 2.26 2.22 0.76 2.30 1.59 1.61 1.35

Diff. 3.3% 0.0% 0.5% 13.9% 0.0% 7.1% 7.2% 0.9% 0.4% 3.8% 1.3% 1.2% 1.8% 2.9%

rial. For each game we computed the improvement in logloss per frame and determined whether the difference in loss was statistically signiﬁcant using the Wilcoxon signed rank test. As a whole, Skip CTS achieved lower log-loss than CTS in 54 out of 55 games; all these differences are signiﬁcant. While Skip CTS performed slightly worse in ELEVATOR ACTION, the difference was not statistically signiﬁcant. The average overall log-loss improvement was 9.0% and the median, 8.25%; improvements ranged from - 2% (ELEVATOR ACTION) to 36% (FREEWAY). Skip CTS with K = 1 processed on average 34 time steps (136 frames) per second, corresponding to just over twice the real-time speed of the Atari 2600. We further ran our algorithm with K = 2 and observed an additional, signiﬁcant increase in predictive performance on 18 games (up to 21.7% over K = 1 for TIME PILOT). On games where K = 2 is unnecessary, however, the performance of Skip CTS degraded somewhat. As discussed above, this behaviour is an expected consequence of the larger ΓK D(S).

5. Discussion

We have seen that by allowing context trees to skip over variables, Skip CTS can achieve substantially better performance over CTS in problems where a good variable ordering may not be known a priori. Theoretically we have seen that Skip CTS can, in the extreme case, have exponentially lower regret. Empirically we observe substantial beneﬁts in practice over state of the art lossless compression algorithms in problems involving highly structured data (e.g. the GEO problem in the Calgary Corpus). The dramatic and consistent improvement seen across over 50 Atari prediction problems indicate that Skip CTS is especially important in multi-dimensional prediction problems where issues of variable ordering are naturally exacerbated.

The main drawback of Skip CTS is the increased computational complexity of inference as a result of the more expressive model class. However, our experiments have demonstrated that small values of K can make a substantial difference. Furthermore, the computational and memory costs of Skip CTS can be alleviated in practice.

The tree structure induced by the recursive Skip CTS update (Equations 3 5) can naturally be parallelized, while the Skip CTS memory requirements can easily be bounded through hashing. Finally note that sampling from the model remains a O(D) operation, so, for instance, planning with a Skip CTS-based reinforcement learning model is nearly as efﬁcient as planning with a CTS-based model.

Tree-based models have a long history in sequence prediction, and the persistent issue of variable ordering has been confronted in many ways. The main strengths of Skip CTS are inherited from CTW efﬁcient, incremental, and exact Bayesian inference, and strong theoretical guarantees on asymptotic regret. Other approaches with more representational ﬂexibility lack these strengths. In the model based reinforcement learning setting, some methods (e.g. Mc Callum, 1996; Holmes & Isbell, 2006; Talvitie, 2012) extend the traditional predictive sufﬁx tree by allowing variables from different time steps to be added in any order, or by allowing the tree to excise portions of history, but these methods are not incremental and do not provide regret guarantees. Bayesian decision tree learning methods (e.g. Chipman et al., 1998; Lakshminarayanan et al., 2013) could in principle be applied in the sequential prediction setting. These typically allow arbitrary variable ordering, but require approximate inference to remain tractable.

6. Conclusion

In this paper we presented Skip Context Tree Switching, a polynomial-time algorithm which efﬁciently mixes over sequences of prediction sufﬁx trees that may skip over K contiguous runs of variables. Our results show that Skip CTS is practical for small K and can produce signiﬁcant empirical improvements compared to members of the Context Tree Weighting family (even with K = 1) in problems where irrelevant variables are naturally present.

Acknowledgments

The authors would like to thank Alex Graves, Andriy Mnih and Michael Bowling for some helpful discussions.

Skip Context Tree Switching

Bell, Timothy, Witten, Ian H., and Cleary, John G. Modeling for text compression. ACM Computing Surveys (CSUR), 21(4):557 591, 1989.

Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47, June 2013a.

Bellemare, Marc G., Veness, Joel, and Bowling, Michael. Bayesian learning of recursively factored environments. In Proceedings of the Thirtieth International Conference on Machine Learning, 2013b.

Chipman, Hugh A., George, Edward I., and Mc Culloch, Robert E. Bayesian CART model search. Journal of the American Statistical Association, 93(443):935 948, 1998.

Erven, Tim Van, Grunwald, Peter, and de Rooij, Steven. Catching up faster in Bayesian model selection and model averaging. In NIPS, 2007.

Gasthaus, Jan, Wood, Frank, and Teh, Yee Whye. Lossless compression based on the sequence memoizer. In Data Compression Conference (DCC), 2010.

Herbster, Mark and Warmuth, Manfred K. Tracking the best expert. Machine Learning, 32(2):151 178, 1998.

Holmes, Michael P. and Isbell, Jr, Charles Lee. Looping sufﬁx tree-based inference of partially observable hidden state. In Proceedings of the 23rd International Conference on Machine Learning, pp. 409 416, 2006.

Hutter, Marcus. Sparse adaptive Dirichlet-multinomial-like processes. In Proceedings of the Conference on Learning Theory (COLT), 2013.

Koolen, Wouter M. and de Rooij, Steven. Universal codes from switching strategies. IEEE Transactions on Information Theory, 59(11):7168 7185, 2013.

Krichevsky, R. and Troﬁmov, V. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199 207, 1981.

Lakshminarayanan, Balaji, Roy, Daniel M., and Teh, Yee Whye. Top-down particle ﬁltering for Bayesian decision trees. In Proceedings of the 30th International Conference on Machine Learning, 2013.

Mc Callum, Andrew K. Reinforcement learning with selective perception and hidden state. Ph D thesis, University of Rochester, 1996.

Ron, Dana, Singer, Yoram, and Tishby, Naftali. The power of amnesia: Learning probabilistic automata with variable memory length. Machine learning, 25(2):117 149, 1996.

Talvitie, Erik. Learning partially observable models using temporally abstract decision trees. In Advances in Neural Information Processing Systems (25), 2012.

Tjalkens, Tj. J, Shtarkov, Y.M., and Willems, F.M.J. Context tree weighting: Multi-alphabet sources. In 14th Symposium on Information Theory in the Benelux, pp. 128 135, 1993.

Veness, Joel, Ng, Kee Siong, Hutter, Marcus, Uther, William T. B., and Silver, David. A Monte-Carlo AIXI approximation. Journal of Artiﬁcial Intelligence Research, 40:95 142, 2011.

Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowling, Michael H. Context tree switching. In Data Compression Conference (DCC), pp. 327 336, 2012.

Volf, P. Weighting techniques in data compression: Theory and algorithms. Ph D thesis, Eindhoven University of Technology, 2002.

Willems, Frans M., Shtarkov, Yuri M., and Tjalkens, Tjalling J. The context tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41:653 664, 1995.

Willems, Frans M., Shtarkov, Yuri M., and Tjalkens, Tjalling J. Context weighting for general ﬁnite-context sources. IEEE Transactions on Information Theory, 42 (5):1514 1520, 1996.

Willems, Frans M. J. CTW website. http://www.ele. tue.nl/ctw/, 2013.

Witten, Ian H., Neal, Radford M., and Cleary, John G. Arithmetic coding for data compression. Communications of the ACM, 30(6):520 540, 1987.