# dynamicdepth_context_tree_weighting__3e63d77e.pdf

Dynamic-Depth Context Tree Weighting

João V. Messias Morpheus Labs Oxford, UK jmessias@morpheuslabs.co.uk

Shimon Whiteson University of Oxford Oxford, UK shimon.whiteson@cs.ox.ac.uk

Reinforcement learning (RL) in partially observable settings is challenging because the agent s observations are not Markov. Recently proposed methods can learn variable-order Markov models of the underlying process but have steep memory requirements and are sensitive to aliasing between observation histories due to sensor noise. This paper proposes dynamic-depth context tree weighting (D2-CTW), a model-learning method that addresses these limitations. D2-CTW dynamically expands a sufﬁx tree while ensuring that the size of the model, but not its depth, remains bounded. We show that D2-CTW approximately matches the performance of state-of-the-art alternatives at stochastic time-series prediction while using at least an order of magnitude less memory. We also apply D2-CTW to model-based RL, showing that, on tasks that require memory of past observations, D2-CTW can learn without prior knowledge of a good state representation, or even the length of history upon which such a representation should depend.

1 Introduction

Agents must often act given an incomplete or noisy view of their environments. While decisiontheoretic planning and reinforcement learning (RL) methods can discover control policies for agents whose actions can have uncertain outcomes, partial observability greatly increases the problem difﬁculty since each observation does not provide sufﬁcient information to disambiguate the true state of the environment and accurately gauge the utility of the agent s available actions. Moreover, when stochastic models of the system are not available a priori, probabilistic inference over latent state variables is not feasible. In such cases, agents must learn to memorize past observations and actions [21, 9], or one must learn history-dependent models of the system [15, 8].

Variable-order Markov models (VMMs), which have long excelled in stochastic time-series prediction and universal coding [23, 14, 2], have recently also found application in RL under partial observability [13, 7, 24, 19]. VMMs build a context-dependent predictive model of future observations and/or rewards, where a context is a variable-length subsequence of recent observations. Since the number of possible contexts grows exponentially with both the context length and the number of possible observations, VMMs memory requirements may grow accordingly. Conversely, the frequency of each particular context in the data decreases as its length increases, so it may be difﬁcult to accurately model long-term dependencies without requiring prohibitive amounts of data.

Existing VMMs address these problems by allowing models to differentiate between contexts at nonconsecutive past timesteps, ignoring intermediate observations [13, 22, 10, 24, 4]. However, they typically assume that either the amount of input data is naturally limited or there is a known bound on the length of the contexts to be considered. In most settings in which an agent interacts continuously with its environment, neither assumption is well justiﬁed. The lack of a deﬁned time limit means the approaches that make the former assumption, e.g., [13, 24], may eventually and indiscriminately

During the development of this work, the main author was employed by the University of Amsterdam.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

use all the agent s physical memory, while those that assume a bound on the context length, e.g., [19], may perform poorly if observations older than this bound are relevant.

This paper proposes dynamic-depth context tree weighting (D2-CTW), a VMM designed for general continual learning tasks. D2-CTW extends context tree weighting (CTW) [23] by allowing it to dynamically grow a sufﬁx tree that discriminates between observations at different depths only insofar as that improves its ability to predict future inputs. This allows it to bound the number of contexts represented in the model, without sacriﬁcing the ability to model long-term dependencies.

Our empirical results show that, when used for general stochastic time-series prediction, D2-CTW produces models that are much more compact than those of CTW while providing better results in the presence of noise. We also apply D2-CTW as part of a model-based RL architecture and show that it outperforms multiple baselines on the problem of RL under partial observability, particularly when an effective bound on the length of its contexts is not known a priori.

2 Background

2.1 Stochastic Time-Series Prediction

Let an alphabet Σ = {σ1, σ2, . . . , σ|Σ|} be a discrete set of symbols, and let Π(Σ) represent the space of probability distributions over Σ (the (|Σ| 1)-simplex). Consider a discrete-time stochastic process that, at each time t 0, samples a symbol σt from a probability distribution pt Π(Σ). We assume that this stochastic process is stationary and ergodic, and that pt is a conditional probability distribution, which for some (unknown) constant integer D with 0 < D t has the form:

pt(σ) = P(σt = σ | σt 1, σt 2, . . . , σt D). (1)

Let σt D:t 1 = (σt D, σt D+1, . . . σt 1, ) be a string of symbols from time t D to t 1. Since σt D:t 1 ΣD and Σ is ﬁnite, there is a ﬁnite number of length-D strings on which the evolution of our stochastic process can be conditioned. Thus, the stochastic process can also be represented by a time-invariant function F : ΣD Π(Σ) such that pt =: F(σt D:t 1) at any time t D.

Let s be a string of symbols from alphabet Σ with length |s| and elements [s]i= {1,...,|s|}. Furthermore, a string q with |q| < |s| is said to be a preﬁx of s iff q1:|q| = s1:|q|, and a sufﬁx of s iff q1:|q| = s|s| |q|:|s|. We write sq or σs for the concatenation of strings s and q or of s and symbol σ Σ. A complete and proper sufﬁx set is a set of strings S such that any string not in S has exactly one sufﬁx in S but no string in S has a sufﬁx in S.

Although D is an upper bound on the age of the oldest symbol on which the process F depends, at any time t it may depend only on some sufﬁx of σt D:t 1 of length less than D. Given the variablelength nature of its conditional arguments, F can be tractably encoded as a D-bounded tree source [2] that arranges a complete and proper sufﬁx set into a tree-like graphical structure. Each node at depth d D corresponds to a length-d string and all internal nodes correspond to sufﬁxes of the strings associated with their children; and each leaf encodes a distribution over Σ representing the value of F for that string.

Given a single, uninterrupted sequence of σ0:t generated by F, we wish to learn the F : ΣD Π(Σ) that minimises the average log-loss of the observed data σ0:t. Letting P F ( | σi D, . . . , σi 1) := F(σi D:i 1):

l(σ0:t | F) = 1

i=D log P F (σi | σi D, . . . , σi 1). (2)

2.2 Context Tree Weighting

The depth-K context tree on alphabet Σ is a graphical structure obtained by arranging all possible strings in ΣK into a full tree. A context tree has a ﬁxed depth at all leaves and potentially encodes all strings in ΣK, not just those required by F. More speciﬁcally, given a sequence of symbols σ0:t 1, the respective length-K context σt K:t 1 induces a context path along the context tree by following at each level d K the edge corresponding to σt d. The root of the context tree represents an empty string , a sufﬁx to all strings. Furthermore, each node keeps track of the input symbols that have immediately followed its respective context. Let sub(σ0:t 1, s) represent the

string obtained by concatenating all symbols σi in σ0:t 1 such that its preceding symbols verify σi k = sk for k = 1, . . . , |s|. Then, each node s in the context tree maintains its own estimate of the probability of observing the string sub(σ0:t 1, s).

Context tree weighting (CTW) [23] learns a mixture of the estimates of P (sub(σ0:t 1, s)) at all contexts s of length |s| K and uses it to estimate the probability of the entire observed sequence. Let P s e (σ0:t 1) represent the estimate of P (sub(σ0:t 1, s)) at the node corresponding to s, and let P s w(σ0:t 1) be a weighted representation of the same measure, deﬁned recursively as:

P s w(σ0:t 1) := 1 2P s e (σ0:t 1) + 1

σ Σ P σs w (σ0:t 1) if |s| < K, P s e (σ0:t 1) if |s| = K. (3)

Since sub(σ0:t 1, ) = σ0:t 1 by deﬁnition of the empty context, P w(σ0:t 1) is an estimate of P(σ0:t 1). The conditional probability of symbol σt is approximated as P F (σt|σ0:t 1) = P w(σ0:t)/P w(σ0:t 1).

The (unweighted) estimate P s e (σ0:t 1) at each context is often computed by keeping |Σ| incrementally updated counters [cs,t]i=1,...,|Σ| N0, where for each σi Σ, [cs,t]i represents the total number of instances where the substring sσi can be found within σ0:t 1. The vector of counters cs,t can be modelled as the output of a Dirichlet-multinomial distribution with concentration parameter vector α = [αi]i=1,...,|Σ|. An estimate of the probability of observing symbol σk at time t + 1 can then be taken as follows: if s is on the context path at time t and σt = σk is the next observed symbol, then [cs,t+1]k = [cs,t]k + 1, and [cs,t+1]i = [cs,t]i for all i = k. Then:

P s e (σk| σ0:t) := PDir M(cs,t+1 | α)

PDir M(cs,t | α) = [cs,t]k + [α]k

c+ s,t + α+ , (4)

where α+ = P|Σ| i=1[α]i, c+ s,t = P|Σ| i=1[cs,t]i, and PDir M is the Dirichlet-multimomial mass function. The estimate of the probability of the full sequence is then P s e (σ0:t) = Qt τ=0 P s e (στ|σ0:τ 1). This can be updated in constant time as each new symbol is received. The choice of α affects the overall quality of the estimator. We use the sparse adaptive Dirichlet (SAD) estimator [11], which is especially suited to large alphabets.

In principle, a depth-K context tree has |Σ|K+1 1 nodes, each with at most |Σ| integer counters. In practice, there may be fewer nodes since one need only to allocate space for contexts found in the data at least once, but their total number may still grow linearly with the length of the input string. Thus, for problems such as partially observable RL, in which the amount of input data is unbounded, or for large |Σ| and K, the memory used by CTW can quickly become unreasonable.

Previous extensions to CTW and other VMM algorithms have been made that do not explicitly bound the depth of the model [6, 22]. However, these still take up memory that is worst-case linear in the length of the input sequence. Therefore, they are not applicable to reinforcement learning. To overcome this problem, most existing approaches artiﬁcially limit K to a low value, which limits the agent s ability to address long-term dependencies. To our knowledge, the only existing principled approach to reducing the amount of memory required by CTW was proposed in [5], through the use of a modiﬁed (Budget) SAD estimator which can be used to limit the branching factor in the context tree to B < |Σ|. This approach still requires K to be set a priori, and is best-suited to prediction problems with large alphabets but few high frequency symbols (e.g. word prediction), which is not generally the case in decision-making problems.

2.3 Model-Based RL with VMMs

In RL with partial observability, an agent performs at each time t an action at A, and receives an observation ot O and a reward rt R with probabilities P(ot|o0:t 1, r0:t 1, a0:t 1) and P(rt|o0:t 1, r0:t 1, a0:t 1) respectively. This representation results from marginalising out the latent state variables and assuming that the agent observes rewards. The agent s goal is to maximise the expected cumulative future rewards E{P τ=t+1 rt + λτ trτ} for some discount factor λ [0, 1).

Letting R = {rt : P(rt|o0:t 1, r0:t 1, a0:t 1) > 0 o0:t 1, r0:t 1, a0:t 1} represent the set of possible rewards and zt {1, . . . , |R|} the unique index of rt R, then a percept (ot, zt) is received at each time with probability P(ot, zt|o0:t 1, z0:t 1, a0:t 1). VMMs such as CTW can

then learn a model of this process, using the alphabet Σ = O {1, . . . , |R|}. This predictive model must condition on past actions, but its output should only estimate the probability of the next percept (not the next action). This is solved by interleaving actions and percepts in the input context, but only updating its estimators based on the value of the next percept [19]. The resulting action-conditional model can be used as a simulator by sample-based planning methods such as UCT [12].

2.4 Utile Sufﬁx Memory

Utile sufﬁx memory (USM) [13] is an RL algorithm similar to VMMs for stochastic time-series prediction. USM learns a sufﬁx tree that is conceptually similar to a context tree with the following differences. First, each node in the sufﬁx tree directly maintains an estimate of expected cumulative future reward for each action. To compute this estimate, USM still predicts (immediate) future observations and rewards at each context, analogously to VMM methods. This prediction is done in a purely frequentist manner, which often yields inferior prediction performance compared to other VMMs, especially given noisy data.

Second, USM s sufﬁx tree does not have a ﬁxed depth; instead, its tree is grown incrementally, by testing potential expansions for statistically signiﬁcant differences between their respective predictions of cumulative future reward. USM maintains a ﬁxed-depth subtree of fringe nodes below the proper leaf nodes of the sufﬁx tree. Fringe nodes do not contribute to the model s output, but they also maintain count vectors. At regular intervals, USM compares the distributions over cumulative future reward of each fringe node against its leaf ancestor, through a Kolmogorov-Smirnov (K-S) test. If this test succeeds at some threshold conﬁdence, then all fringe nodes below that respective leaf node become proper nodes, and a new fringe subtree is created below the new leaf nodes.

USM s fringe expansion allows it to use memory efﬁciently, as only the contextual distinctions that are actually signiﬁcant for prediction are represented. However, USM is computationally expensive. Performing K-S tests for all nodes in a fringe subtree requires, in the worst-case, time linear in the amount of (real-valued) data contained at each node, and exponential in the depth of the subtree. This cost can be prohibitive even if the expansion test is only run infrequently. Furthermore, USM does not explicitly bound its memory use, and simply stopping growth once a memory bound is hit would bias the model towards symbols received early in learning.

3 Dynamic-Depth Context Tree Weighting

We now propose dynamic-depth context tree weighting (D2-CTW). Rather than ﬁxing the depth a priori, like CTW, or using unbounded memory, like USM, D2-CTW learns F with dynamic depth, subject to the constraint | Ft| L at any time t, where L is a ﬁxed memory bound.

3.1 Dynamic Expansion in CTW To use memory efﬁciently and avoid requiring a ﬁxed depth, we could simply replicate USM s fringe expansion in CTW, by performing K-S tests on distributions over symbols (P s e ) instead of distributions over expected reward. However, doing so would introduce bias. The weighted estimates P s w(σ0:t) for each context s depend on the ratio of the probability of the observed data at s itself, P s e (σ0:t), and that of the data observed at its children, P s w (σ0:t) at s = σs σ Σ. These estimates depend on the number of times each symbol followed a context, implying that cs,t = P σ Σ cs ,t. Thus, the weighting in (3) assumes that each symbol that was observed to follow the non-leaf context s was also observed to follow exactly one of its children s . If this was not so and, e.g., s was created at time 0 but its children only at τ > 0, then, since P s w (στ:t) P s w (σ0:t), the weighting would be biased towards the children, which would have been exposed to less data.

Fortunately, an alternative CTW recursion, originally proposed for numerical stability [20], overcomes this issue. In CTW and for a context tree of ﬁxed depth K, let βs t be the likelihood ratio between the weighted estimate below s and the local estimate at s itself:

σ Σ P σs w (σ0:t) P s e (σ0:t) if |s| < K, 1 if |s| = K. (5)

Then, the weighted estimate of the conditional probability of an observed symbol σt at node s is:

P s w(σt|σ0:t 1) := P s w(σ0:t) P sw(σ0:t 1) =

1 2P s e (σ0:t) (1 + βs t )

1 2P se (σ0:t 1) 1 + βs t 1 =: P s e (σt|σ0:t 1) 1 + βs t 1 + βs t 1 . (6)

Furthermore, βs t can be updated for each s as follows. Let Ct represent the set of sufﬁxes on the context path at time t (the set of all sufﬁxes of σ0:t 1). Then:

( P s w (σt|σ0:t 1) P s e (σt|σ0:t 1) βs t 1 if s Ct, βs t = βs t 1 otherwise, (7)

where s = σt 1 |s|s is the child of s that follows it on the context path. For any context, we set βs 0 = 1. This reformulation allows the computation of P s w(σt|σ0:t 1) using only the nodes on the context path and while storing only a single value in those nodes, βs t , regardless of |Σ|.

Since this reformulation depends only on conditional probability estimates, we can perform fringe expansion in CTW and add nodes dynamically without biasing the mixture. Disregard the ﬁxed depth limit K and consider instead a sufﬁx tree where all leaf nodes have a depth greater than the fringe depth H > 0. For any leaf node at depth d, its ancestor at depth d H is its frontier node. The descendants of any frontier node are fringe nodes. Let ft represent the frontier node on the context path at time t. At every timestep t, we traverse down the tree by following the context path as in CTW. At every node on the context path and above ft, we apply (6) and (7) while treating ft as a leaf node. For ft and the fringe nodes on the context path below it, we apply the same updates while treating fringe nodes normally. Thus, the recursion in (6) does not carry over to fringe nodes, but otherwise all nodes update their values of β in the same manner.

Once the fringe expansion criterion is met (see Section 3.2), the fringe nodes below ft simply stop being labeled as such, while the values of β for the nodes above ft must be updated to reﬂect the change in the model. Let P ft w (σ0:t) represent the weighted (unconditional) output at ft after the fringe expansion step. We have therefore P ft w (σ0:t) := 1 2P ft e (σ0:t)(1 + βft t ), but prior to the expansion, P ft w (σ0:t) = P ft e (σ0:t). The net change in the likelihood of σ0:t, according to ft, is:

αft exp := P ft w (σ0:t)

P ft w (σ0:t) = 1 + βft t 2 . (8)

This induces a change in the likelihood of the data according to all of the ancestors of ft. We need to determine α exp =: P w(σ0:t)/P w(σ0:t), which quantiﬁes the effect of the fringe expansion on the global output of the weighted model.

Proposition 1. Let f be a string corresponding to a frontier node, and let pd be the length-d sufﬁx

of f (with p0 = ). Also let ρf := Q|f| 1 d=0 β pd t 1+β pd t , and αf exp := 1+βf t 2 . Then:

α exp := P w(σ0:t) P w(σ0:t) = 1 + ρf αf exp 1 .

The proof can be found in the supplementary material of this paper (Appendix A.1). This formulation is useful since, for any node s in the sufﬁx tree with ancestors (p0, p1, . . . , p|s| 1) we can

associate a value ρs t = Q|s| 1 d=0 β pd t 1+β pd t = ρ p|s| 1 t β p|s| 1| t 1+β p|s 1| t that measures the sensitivity of the whole

model to changes below s, and not necessarily just fringe expansions. Thus, a node with ρs 0 is a good candidate for pruning (see Section 3.3). Furthermore, this value can be computed while traversing the tree along the context path. Although the computation of ρs for a particular node still requires O(|s|) operations, the values of ρ for all ancestors of s are also computed along the way.

3.2 Fringe Expansion Criterion

As a likelihood ratio, αf exp provides a statistical measure of the difference between the predictive model at each frontier node f and that formed by its fringe children. Analogously, α exp can be seen as the likelihood ratio between two models that differ only on the subtree below f. Therefore, we can test the hypothesis that the subtree below f should be added to the model by checking if α exp > γ for some γ > 1. Since the form of P w( ) is unknown, we cannot establish proper conﬁdence levels for γ; however, the following result shows that the value of γ is not especially important, since if the subtree below f improves the model, this test will eventually be true given enough data.

Theorem 1. Let S and Sexp be two proper sufﬁx sets such that Sexp = (S \ f) F where f is sufﬁx to all f F. Furthermore, let M and Mexp be the CTW models using the sufﬁx trees induced by S and Sexp respectively, and P w(σ0:t; M), P w(σ0:t; Mexp) their estimates of the likelihood of σ0:t.

If there is a T N such that, for any τ > 0:

t=τ P f e (σt|σ0:t 1; M) <

σ Σ P σf w (σt|σ0:t 1; Mexp),

then for any γ [1, ), there is T > 0 such that P w(σ0:T ; Mexp)/P w(σ0:T ; M) > γ.

The proof can be found in the supplementary material (Appendix A.2). Using α exp > γ as a statistical test instead of K-S tests yields great computational savings, since the procedure described in Proposition 1 allows us to determine this test in O(|ft|), typically much lower than the O(|Σ|H+1) complexity of K-S testing all fringe children.

Theorem 1 also ensures that, if sufﬁcient memory is available, D2-CTW will eventually perform as well as CTW with optimal depth bound K = D. This follows from the fact that, for every node s at depth ds D in a CTW sufﬁx tree, if βs t 1 for all t > τ, then the D2-CTW sufﬁx tree will be at least as deep as ds at context s after some time t τ. That is, at some point, the D2-CTW model will contain the useful sub-tree of the optimal-depth context tree.

Corollary 1. Let l( | FCT W , D) represent the average log-loss of CTW using ﬁxed depth K when modeling a D bounded tree source, and l( | FD2 CT W , γ, H, L) the same metric when using D2CTW. For any values of γ > 1 and H > 1, and for sufﬁciently high L > 0, there exists a time T > 0 such that, for any t > T , l(σT :t | FD2 CT W , γ, H, L) l(σT :t | FCT W , D).

3.3 Ensuring the Memory Bound

In order to ensure that the memory bound | Ft| L is respected, we must ﬁrst consider whether a potential fringe expansion does not require more memory than is available. Thus, if the subtree below frontier node f has size Lf, we must test if | Ft| + Lf L. This means that fringe nodes are not taken into account when computing | Ft|, as they do not contribute to the output of Ft and are therefore considered as memory overhead, and discarded after training.

Once | Ft| is such that no fringe expansions are possible without violating the memory bound, it may still be possible to improve the model by pruning low-quality subtrees to create enough space for more valuable fringe expansions. Pruning operations also have a quantiﬁable effect on the likelihood of the observed data according to Ft. Let P s w(σ0:t) represent the weighted estimate at internal node s after pruning its subtree. Analogously to (8), we can deﬁne αs prune := P s w(σ0:t)/P s w(σ0:t) = 2/(1 + βs t ). We can also compute α prune, the global effect on the likelihood, using the procedure in Proposition 1. Since α prune = 1 + ρs αs prune 1 , typically with αs prune < 1, if a fringe expansion at f increases P w(σ0:t) by a factor of α exp but requires space Lf such that | Ft|+Lf > L, we should prune the subtree below s = f that frees Ls space and reduces P w(σ0:t) by α prune if 1) α exp α prune > 1; 2) | Ft| + Lf Ls L; and 3) s is not an ancestor of f. The latter condition requires O(|f| |s|) time to validate, while the former can be done in constant time if ρs is available.

In general, some combination of subtrees could be pruned to free enough space for some combination of fringe expansions, but determining the best possible combination of operations at each time is too computationally expensive. As a tractable approximation, we compare only the best single expansion and prune at nodes f and s respectively, quantiﬁed with two heuristics Hf exp := log α expf and Hs prune := log α prunes, such that f = arg maxf Hf exp and s = arg minf Hs prune.

As L is decreased, the performance of D2-CTW may naturally degrade. Although Corollary 1 may no longer be applicable in that case, a weaker bound on the performance of memory-constrained D2-CTW can be obtained as follows, regardless of L: let dt min denote the minimum depth of any frontier node at time t; then the D2-CTW sufﬁx tree covers the set of dt min-bounded models [23]. The redundancy of D2-CTW, measured as the Kullback-Leibler divergence DKL(F|| Ft), is then at least as low as the redundancy of a multi-alphabet CTW implementation with K = dt min [17].

Avg. log-loss (optimal params.)

100 101 102 103 104 105 106 107

Number of nodes

Nr. nodes (optimal params.)

Model depth

Model depth (optimal params.)

Avg. log-loss (L=1000)

Avg. Log-loss (L=1000, 5% noise)

1 5 9 13 17 Likelihood ratio threshold, γ

Avg. Log-loss vs. γ

Figure 1: Calgary Corpus performance with CTW (red) and D2-CTW (blue). For average log-loss, lower is better: (a)-(c) using optimal parameters; (d) with a bound on the number of nodes; (e) with size bound and uniform noise; (f) log-loss vs. γ on book2 , with 10% noise (over 30 runs).

3.4 Complete Algorithm and Complexity

The complete D2-CTW algorithm operates as follows (please refer to Appendix A.3 for the respective pseudo-code): a sufﬁx tree is ﬁrst initialized containing only a root node; at every timestep, the sufﬁx tree is updated using the observed symbol σt, and the preceding context (if it exists) from time t dt max H where dt max is the current maximum depth of the tree and H is the fringe depth. This update returns the weighted conditional probability of σt, and it also keeps track of the best known fringe expansion and pruning operations. Then, a post-processing step expands and possibly prunes the tree as necessary, ensuring the memory bound is respected. This step also corrects the values of β for any nodes affected by these topological operations. D2-CTW trains on each new symbol in O(dt max + H) time, the same as CTW with depth bound K = dt max + H. A worstcase O((dt max + H)|Σ|) operations are necessary to sample a symbol from the learned model, also equivalent to CTW. Post-processing requires O(max{|f |, |s |}) time.

4 Experiments

We now present empirical results on byte-prediction tasks and partially-observable RL. Our code and instructions for its use is publicly available at: https://bitbucket.org/jmessias/vmm_py.

Byte Prediction We compare the performance of D2-CTW against CTW on the 18-ﬁle variant of the Calgary Corpus [3], a benchmark of text and binary data ﬁles. For each ﬁle, we ask the algorithms to predict the next byte given the preceding data, such that |Σ| = 256 across all ﬁles.

We ﬁrst compare performance when using (approximately) optimal hyperparameters. For CTW, we performed a grid search taking K {1, . . . , 10} for each ﬁle. For D2-CTW, we investigated the effect of γ on the prediction log-loss across different ﬁles, and found no signiﬁcant effect of this parameter for sufﬁciently large values (an example is shown in Fig. 1f), in accordance with Theorem 1. Consequently, we set γ = 10 for all our D2-CTW runs. We also set L = and H = 2.

The corpus results, shown in Figs. 1a 1c, show that D2-CTW achieves comparable performance to CTW: on average D2-CTW s loss is 2% higher, which is expected since D2-CTW grows dynamically from a single node, while CTW starts with a fully grown model of optimal height. By contrast, D2-CTW uses many fewer nodes than CTW, by at least one order of magnitude (average factor 28). D2-CTW automatically discovers optimal depths that are similar to the optimal values for CTW. We then ran a similar test but with a bound on the number of nodes L = 1000. For CTW, we enforced this bound by simply stopping the sufﬁx tree from growing beyond this point2. The results

2For simplicity, we did not use CTW with Budget SAD as a baseline. Budget SAD could also be used to extend D2-CTW, so a fair comparison would necessitate the optional integration of Budget SAD into both CTW and D2-CTW. This is an interesting possibility for future work.

are shown in Fig. 1d. In this case, the log-loss of CTW is on average 11.4% and up to 32.3% higher than that of D2-CTW, showing that D2-CTW makes a signiﬁcantly better use of memory.

Finally, we repeated this test but randomly replaced 5% of symbols with uniform noise. This makes the advantage of D2-CTW is even more evident, with CTW scoring on average 20.0% worse (Fig. 1e). While the presence of noise still impacts performance, the results show that D2-CTW, unlike CTW, is resilient to noise: spurious contexts are not deemed signiﬁcant, avoiding memory waste.

Model-Based RL For our empirical study on online partially observable RL tasks, we take as a baseline MC-AIXI, a combination of ﬁxed-depth CTW modelling with ρUCT planning [19], and investigate the effect of replacing CTW with D2-CTW and limiting the available memory. We also compare against PPM-C, a frequentist VMM that is competitive with CTW [2]. Our experimental domains are further described in the supplementary material.

Our ﬁrst domain is the T-maze [1], in which an agent must remember its initial observation in order to act optimally at the end of the maze. We consider a maze of length 4. We set K = 3 for CTW and PPM-C, which is the guaranteed minimum depth to produce the optimal policy. For D2-CTW we set γ = 1, H = 2, and do not enforce a memory bound. As in [19], we use an ϵ-greedy exploration strategy. Fig. 2a shows that D2-CTW discovers the length of the T-Maze automatically. Furthermore, CTW and PPM-C fail to learn to retain the required observations, as during the initial stages of learning the agent may need more than 3 steps to reach the goal (D2-CTW learns a model of depth 4).

Our second scenario is the cheese maze [13], a navigation task with aliased observations. Under optimal parameters, D2-CTW and CTW both achieve near-optimal performance for this task. We investigated the effect of setting a bound on the number of nodes L = 1000, roughly 1/5 of the amount used by CTW with optimal hyperparameters. In Fig. 2b we show that the quality of D2CTW degrades less than both CTW and PPM-C, still achieving a near optimal policy. As this is a small-sized problem with D = 2, CTW and PPM-C still produce reasonable results in this case albeit with lower quality models than D2-CTW.

Finally, we tested a partially observable version of mountain car [16], in which the position of the car is observed but not its velocity. We coarsely discretised the position of the car into 10 states. In this task, we have no strong prior knowledge about the required context length, but found K = 4 to be sufﬁcient for optimal PPM-C and CTW performance. For D2-CTW, we used γ = 10, H = 2. We set also L = 1000 for all methods. Fig. 2c shows the markedly superior performance of D2-CTW when subject to this memory constraint.

5 Conclusions and Future Work

We introduced D2-CTW, a variable-order modelling algorithm that extends CTW by using a fringe expansion mechanism that tests contexts for statistical signiﬁcance, and by allowing the dynamic adaptation of its sufﬁx tree while subject to a memory bound. We showed both theoretically and empirically that D2-CTW requires little conﬁguration across domains and provides better performance with respect to CTW under memory constraints and/or the presence of noise. In future work, we will investigage the use the Budget SAD estimator with a dynamic budget as an alternative mechanism for informed pruning. We also aim to apply a similar approach to context tree switching (CTS) [18], an algorithm that is closely related to CTW but enables mixtures in a larger model class.

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Episodes 1e3

Average reward/episode in T-Maze (over 200 eps., K=3)

D2-CTW PPM-C CTW

optimal value fully-Markov bound

1 2 3 4 5 6 7

Timesteps 1e3

Average reward/step in Cheese Maze (over 200 steps, L=1000)

optimal value

100 200 300 400 500

Average reward/episode in P.O. Mountain Car (over 100 eps., L=1000)

Figure 2: Performance measured as (running) average rewards in (a) T-maze; (b) cheese maze; (c) partially observable mountain car. Results show mean over 10 runs, and shaded ﬁrst to third quartile.

Acknowledgments

This work was supported by the European Commission under the grant agreement FP7-ICT-611153 (TERESA).

[1] B. Bakker. Reinforcement learning with long short-term memory. In Proceedings of the 14th International Conference on Neural Information Processing Systems, NIPS 01, pages 1475 1482, Cambridge, MA, USA, 2001. MIT Press.

[2] R. Begleiter, R. El-Yaniv, and G. Yona. On prediction using variable order Markov models. Journal of Artiﬁcial Intelligence Research, 22:385 421, 2004.

[3] T. Bell, I. H. Witten, and J. G. Cleary. Modeling for text compression. ACM Computing Surveys (CSUR), 21(4):557 591, 1989.

[4] M. Bellemare, J. Veness, and E. Talvitie. Skip context tree switching. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1458 1466, 2014.

[5] M. G. Bellemare. Count-based frequency estimation with bounded memory. In IJCAI, pages 3337 3344, 2015.

[6] J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Computer Journal, 40(2 and 3):67 75, 1997.

[7] V. F. Farias, C. C. Moallemi, B. Van Roy, and T. Weissman. Universal reinforcement learning. IEEE Transactions on Information Theory, 56(5):2441 2454, 2010.

[8] W. L. Hamilton, M. M. Fard, and J. Pineau. Modelling sparse dynamical systems with compressed predictive state representations. In ICML (1), pages 178 186, 2013.

[9] M. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. In AAAI Fall Symposium Series, 2015.

[10] M. P. Holmes and C. L. Isbell Jr. Looping sufﬁx tree-based inference of partially observable hidden state. In Proceedings of the 23rd international conference on Machine learning, pages 409 416. ACM, 2006.

[11] M. Hutter et al. Sparse adaptive dirichlet-multinomial-like processes. In Conference on Learning Theory: JMLR Workshop and Conference Proceedings, volume 30. Journal of Machine Learning Research, 2013.

[12] L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282 293. Springer, 2006.

[13] A. K. Mc Callum. Reinforcement Learning with Selective Percemption and Hidden State. Ph D thesis, University of Rochester, 1995.

[14] D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine learning, 25(2-3):117 149, 1996.

[15] S. Singh, M. R. James, and M. R. Rudary. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 512 519. AUAI Press, 2004.

[16] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[17] T. J. Tjalkens, Y. M. Shtarkov, and F. M. J. Willems. Context tree weighting: Multi-alphabet sources. In 14th Symposium on Information Theory in the Benelux, pages 128 135, 1993.

[18] J. Veness, K. S. Ng, M. Hutter, and M. Bowling. Context tree switching. In Data Compression Conference (DCC), 2012, pages 327 336. IEEE, 2012.

[19] J. Veness, K. S. Ng, M. Hutter, W. Uther, and D. Silver. A monte-carlo AIXI approximation. Journal of Artiﬁcial Intelligence Research, 40(1):95 142, 2011.

[20] P. A. J. Volf. Weighting techniques in data compression: Theory and algorithms. Technische Universiteit Eindhoven, 2002.

[21] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber. Solving deep memory POMDPs with recurrent policy gradients. In International Conference on Artiﬁcial Neural Networks, pages 697 706. Springer, 2007.

[22] F. M. Willems. The context-tree weighting method: Extensions. IEEE Transactions on Information Theory, 44(2):792 798, 1998.

[23] F. M. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653 664, 1995.

[24] F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. The sequence memoizer. Communications of the ACM, 54(2):91 98, 2011.