# the_forgetmenot_process__1f0e99fb.pdf

The Forget-me-not Process

Kieran Milan , Joel Veness , James Kirkpatrick, Demis Hassabis Google Deep Mind {kmilan,aixi,kirkpatrick,demishassabis}@google.com

Anna Koop, Michael Bowling University of Alberta {anna,bowling}@cs.ualberta.ca

We introduce the Forget-me-not Process, an efﬁcient, non-parametric metaalgorithm for online probabilistic sequence prediction for piecewise stationary, repeating sources. Our method works by taking a Bayesian approach to partitioning a stream of data into postulated task-speciﬁc segments, while simultaneously building a model for each task. We provide regret guarantees with respect to piecewise stationary data sources under the logarithmic loss, and validate the method empirically across a range of sequence prediction and task identiﬁcation problems.

1 Introduction

Modeling non-stationary temporal data sources is a fundamental problem in signal processing, statistical data compression, quantitative ﬁnance and model-based reinforcement learning. One widely-adopted and successful approach has been to design meta-algorithms that automatically generalize existing stationary learning algorithms to various non-stationary settings. In this paper we introduce the Forget-me-not Process, a probabilistic meta-algorithm that provides the ability to model the class of memory bounded, piecewise-repeating sources given an arbitrary, probabilistic memory bounded stationary model.

The most well studied class of probabilistic meta-algorithms are those for piecewise stationary sources, which model data sequences with abruptly changing statistics. Almost all meta-algorithms for abruptly changing sources work by performing Bayesian model averaging over a class of hypothesized temporal partitions. To the best of our knowledge, the earliest demonstration of this fundamental technique was [21], for the purpose of data compression; closely related techniques have gained popularity within the machine learning community for change point detection [1] and have been proposed by neuroscientists as a mechanism by which humans deal with open-ended environments composed of multiple distinct tasks [4 6]. One of the reasons for the popularity of this approach is that the temporal structure can be exploited to make exact Bayesian inference tractable via dynamic programming; in particular inference over all possible temporal partitions of n data points results in an algorithm of O(n2) time complexity and O(n) space complexity [21, 1]. Many variants have been proposed in the literature [20, 11, 10, 17], which trade off predictive accuracy for improved time and space complexity; in particular the Partition Tree Weighting meta-algorithm [17] has O(n log n) time and O(log n) space complexity, and has been shown empirically to exhibit superior performance versus other low-complexity alternatives on piecewise stationary sources.

A key limitation of these aforementioned techniques is that they can perform poorly when there exist multiple segments of data that are similarly distributed. For example, consider data generated according to the schedule depicted in Figure 1. For all these methods, once a change-point occurs, the base (stationary) model is invoked from scratch, even if the task repeats, which is clearly undesirable

indicates joint ﬁrst authorship.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

1 20 40 60 80 100 120 140 160

Figure 1: An example task segmentation.

in many situations of interest. Our main contribution in this paper is to introduce the Forget-me-not Process, which has the ability to avoid having to relearn repeated tasks, while still maintaining essentially the same theoretical performance guarantees as Partition Tree Weighting on piecewise stationary sources.

2 Preliminaries

We now introduce some notation and necessary background material.

Sequential Probabilistic Data Generators. We begin with some terminology for sequential, probabilistic data generating sources. An alphabet is a ﬁnite non-empty set of symbols, which we will denote by X. A string x1x2 . . . xn X n of length n is denoted by x1:n. The preﬁx x1:j of x1:n, where j n, is denoted by x j or x<j+1. The empty string is denoted by ϵ and we deﬁne X = {ϵ} S i=1 X i. Our notation also generalizes to out of bounds indices; that is, given a string x1:n and an integer m > n, we deﬁne x1:m := x1:n and xm:n := ϵ. The concatenation of two strings s, r X is denoted by sr. Unless otherwise speciﬁed, base 2 is assumed for all logarithms.

A sequential probabilistic data generating source ρ is deﬁned by a sequence of probability mass functions ρn : X n [0, 1], for all n N, satisfying the constraint that ρn(x1:n) = P

y X ρn+1(x1:ny) for all x1:n X n, with base case ρ0(ϵ) = 1. From here onwards, whenever the meaning is clear from the argument to ρ, the subscripts on ρ will be dropped. Under this deﬁnition, the conditional probability of a symbol xn given previous data x<n is deﬁned as ρ(xn | x<n) := ρ(x1:n)/ρ(x<n) provided ρ(x<n) > 0, with the familiar chain rule ρ(xi:j | x<i) = Qj k=i ρ(xk | x<k) applying as usual. Notice too that a new sequential probabilistic data generating source ν can be obtained from an existing source ρ by conditioning on a ﬁxed sequence of input data. More explicitly, given a string s X , one can deﬁne ν(x1:n) := ρ(x1:n | s) for all n; we will use the notation ρ[s] to compactly denote such a derived probabilistic data generating source.

Temporal Partitions, Piecewise Sources and Piecewise-repeating sources. We now introduce some notation to formally describe temporal partitions and piecewise sources. A segment is a tuple (a, b) N N with a b. A segment (a, b) is said to overlap with another segment (c, d) if there exists an i N such that a i b and c i d. A temporal partition P of a set of time indices S = {1, 2, . . . n}, for some n N, is a set of non-overlapping segments such that for all x S, there exists a segment (a, b) P such that a x b. We also use the overloaded notation P(a, b) := {(c, d) P : a c d b} to denote the set of segments falling inclusively within the range (a, b). Finally, Tn will be used to denote the set of all possible temporal partitions of {1, 2, . . . , n}.

We can now deﬁne a piecewise data generating source µh P in terms of a partition P = {(a1, b1), (a2, b2), . . . } and a set of probabilistic data generating sources {µ1, µ2, . . . }, such that for all n N, for all x1:n X n,

µh P(x1:n) := Y

(a,b) Pn µh(a)(xa:b),

where Pn := {(a, b) P : a n} and h : N N is a task assignment function that maps segment beginnings to task identiﬁers.

A piecewise repeating data generating source is a special case of a piecewise data generating source that satisﬁes the additional constraint that a, c {x : (x, y) P} such that a = c and h(a) = h(c).

In terms of modeling a piecewise repeating source, there are three key unknowns: the partition which deﬁnes the location of the change points, the task assignment function, and the model for each individual task.

Bayesian Sequence Prediction. A fundamental technique for constructing algorithms that work well under the logarithmic loss is Bayesian model averaging. We now provide a short overview sufﬁcient for the purposes of this paper; for more detail, we recommend the work of [12] and [14].

Given a non-empty discrete set of probabilistic data generating sources M := {ρ1, ρ2, . . . } and a prior weight wρ 0 > 0 for each ρ M such that P

ρ M wρ 0 = 1, the Bayesian mixture predictor is deﬁned in terms of its marginal by ξ(x1:n) := P

ρ M wρ 0 ρ(x1:n). The predictive probability is thus given by the ratio of the marginals ξ(xn | x<n) = ξ(x1:n) / ξ(x<n). The predictive probability can also be expressed in terms of a convex combination of conditional model predictions, with each model weighted by its posterior probability. More explicitly,

ξ(xn | x<n) =

ρ M wρ 0 ρ(x1:n) P

ρ M wρ 0 ρ(x<n) = X

ρ M wρ n 1 ρ(xn | x<n), where wρ n 1 := wρ 0 ρ(x<n) P

ν M wν 0 ν(x<n).

A fundamental property of Bayesian mixtures is that if there exists a model ρ M that predicts well, then ξ will predict well since the cumulative loss satisﬁes

log ξ(x1:n) = log X

ρ M wρ 0 ρ(x1:n) log wρ 0 log ρ (x1:n). (1)

Equation 1 implies that a constant regret is suffered when using ξ in place of the best (in hindsight) model within M.

3 The Forget-me-not Process

We now introduce the Forget-me-not Process (FMN), a meta-algorithm designed to better model piecewise-repeating data generating sources. As FMN is a meta-algorithm, it takes as input a base model, which we will hereby denote as ν. At a high level, the main idea is to extend the Partition Tree Weighting [17] algorithm to incorporate a memory of previous model states, which is used to improve performance on repeated tasks. More concretely, our construction involves deﬁning a two-level hierarchical process, with each level performing exact Bayesian model averaging. The ﬁrst level will perform model averaging over a set of postulated segmentations of time, using the Partition Tree Weighting technique. The second level will perform model averaging over a growing set of stored base model states. We describe each level in turn before describing how to combine these ideas into the Forget-me-not Process.

Averaging over Temporal Segmentations. We now deﬁne the class of binary temporal partitions, which will correspond to the set of temporal partitions we perform model averaging over in the ﬁrst level of our hierarchical model. Although more restrictive than the class of all possible temporal partitions, binary temporal partitions possess important computational advantages. Deﬁnition 1. Given a depth parameter d N and a time t N, the set Cd(t) of all binary temporal partitions from t is recursively deﬁned by

Cd(t) := {(t, t + 2d 1)} S1 S2 : S1 Cd 1 (t) , S2 Cd 1 t + 2d 1 ,

with C0(t) := {(t, t)} . We also deﬁne Cd := Cd(1).

Each binary temporal partition can be naturally mapped onto a tree structure known as a partition tree; for example, Figure 2 shows the collection of partition trees represented by C2; the leaves of each tree correspond to the segments within each particular partition. There are two important properties of binary temporal partition trees. The ﬁrst is that there always exists a partition P Cd which is close to any temporal partition P, in the sense that P always starts a new segment whenever P does, and |P | |P|( log n + 1) [17, Lemma 2]. The second is that exact Bayesian model averaging can be performed efﬁciently with an appropriate choice of prior. This is somewhat surprising, since the

(3, 4) (1,2)

(3,3) (4,4) (1,1)

(2,2) (3,3)

(4,4) (1,1)

Figure 2: The set C2 represented as a collection of temporal partition trees.

number of binary temporal partitions |Cd| grows double exponentially in d. The trick is to deﬁne, given a data sequence x1:n, the Bayesian mixture

PTWd(x1:n) := X

P Cd 2 Γd(P) Y

(a,b) P ρ(xa:b), (2)

where Γd(P) gives the number of nodes in the partition tree associated with P that have a depth less than d and ρ denotes the base model to the PTW process. This prior weighting is identical to how the Context Tree Weighting method [22] weighs over tree structures, and is an application of the general technique used by the class of Tree Experts described in Section 5.3 of [3]. It is a valid prior, as one can show P

P Cd 2 Γd(P) = 1 for all d N. A direct computation of Equation 2 is clearly intractable, but we can make use of the tree structured prior to recursively decompose Equation 2 using the following lemma. Lemma 1 (Veness et al. [17]). For any depth d N, for all x1:n X n satisfying n 2d,

PTWd(x1:n) = 1

2ρ(x1:n) + 1

2 PTWd 1 (x1:k) PTWd 1 (xk+1:n) ,

where k = 2d 1.

Averaging over Previous Model States given a Known Temporal Partition. Given a data sequence x1:n X n, a base model ρ and a temporal partition P := {(a1, b1), . . . , (am, bm)} satisfying P Tn, consider a sequential probabilistic model deﬁned by

πP(x1:n) :=

1 |Mi| ρ(xai:bi)

where M1 := {ρ} and Mi := Mi 1 {ρ [xai:bi]}ρ Mi 1 for 1 < i |P|.

Here, whenever the ith segment of data is seen, each model in Mi is given the option of either ignoring or adapting to this segment s data, which implies |Mi| = 2 |Mi 1|. Using an argument similar to Equation 1, and letting xh(t) <t denote the subsequence of x<t generated by µh(t), we can see that the cumulative loss when the data is generated by a piecewise-repeating source µh P is bounded by

log πP(x1:n) = log

1 |Mi| ρ(xai:bi)

ρ Mi 2 i+1 ρ(xai:bi)

i=1 2 i+1 ρ xai:bi | xh(ai) <ai = |P|2 |P|

i=1 ρ xai:bi | xh(ai) <ai . (3)

Roughly speaking, this bound implies that πP(x1:n) will perform almost as well as if we knew h( ) in advance, provided the number of segments grows o( n). The two main drawbacks with this approach are that: a) computing πP(x1:n) takes time exponential in |P|; and b) a regret of (|P|2 |P|)/2 seems overly large in cases where the source isn t repeating. These problems can be rectiﬁed with the following modiﬁed process,

νP(x1:n) :=

2ρ(xai:bi) + 1

1 |Mi| 1 ρ (xai:bi)

where now M1 := {ρ} and Mi := Mi 1 n ρ [xai:bi] ρ = argmaxρ Mi 1 {ρ (xai:bi)} o .

1 2 3 4 5 6 7 8

Figure 3: A graphical depiction of the Forget Me Not process (d = 3) after processing 7 symbols.

With this modiﬁed deﬁnition of Mi, where the argmax implements a greedy approximation (ties are broken arbitrarily), |Mi| now grows linearly with the number of segments, and thus the overall time to compute νP(x1:n) is O(|P| n) assuming the base model runs in linear time. Although heuristic, this approximation is justiﬁed provided that ρ[ϵ] assigns the highest probability out of any model in Mi whenever a task is seen for the ﬁrst time, and that a model trained on k segments for a given task is always better than a model trained on less than k segments for the same task (or a model trained on any number of other tasks). Furthermore, using a similar dominance argument to Equations 1 and 3, the cost of not knowing h( ) with respect to piecewise non-repeating sources is now |P| vs O(|P|2).

Averaging over Binary Temporal Segmentations and Previous Model States. This section describes how to hierarchically combine the PTW and νP models to give rise to the Forget Me Not process. Our goal will be to perform model averaging over both binary temporal segmentations and previous model states. This can be achieved by instantiating the PTW meta-algorithm with a sequence of time dependent base models similar in spirit to νP.

Intuitively, this requires modifying the deﬁnition of Mi so that the best performing model state, for any completed segment within the PTW process, is available for future predictions. For example, Figure 3 provides a graphical depiction of our desired FMN3 process after processing 7 symbols. The dashed segments ending in unﬁlled circles describe the segments whose set of base models are contributing to the predictive distribution at time 8. The solid-line segments denote previously completed segments for which we want the best performing model state to be remembered and made available to segments starting at later times. A solid circle indicates a time where a model is added to the pool of available models; note that now multiple models can be added at any particular time.

We now formalize the above intuitions. Let Bt := {(a, b) Cd : b = t} be the set of segments ending at time t 2d. Given an an arbitrary string s X , our desired sequence of base models is given by

1 |Mt| 1 ρ (s), (5)

with the model pool deﬁned by M1 := {ρ} and

Mt := Mt 1 [

ρ [sa:b] ρ = argmax ρ Ma {ρ (sa:b)} for t > 1. (6)

Finally, substituting Equation 5 in for the base model of PTW yields our Forget Me Not process

FMNd(x1:n) := X

P Cd 2 Γd(P) Y

(a,b) Pn νa(xa:b). (7)

Algorithm. Algorithm 1 describes how to compute the marginal probability FMNd(x1:n). The rj variables store the segment start times for the unclosed segments at depth j; the bj variables implement a dynamic programming caching mechanism to speed up the PTW computation as explained in Section 3.3 of [17]; the wj variables hold intermediate results needed to apply Lemma 1. The Most Signiﬁcant Changed Bit routine MSCBd(t), invoked at line 4, is used to determine the range of segments ending at the current time t, and is deﬁned for t > 1 as the number of bits to the left of the most signiﬁcant location at which the d-bit binary representations of t 1 and t 2 differ, with MSCBd(1) := 0 for all d N. For example, in Figure 3, at t = 5, before processing x5, we need to deal with the segments

Algorithm 1 FORGET-ME-NOT - FMNd(x1:n)

Require: A depth parameter d N, and a base probabilistic model ρ Require: A data sequence x1:n X n satisfying n 2d

1: bj 1, wj 1, rj 1, for 0 j d 2: M {ρ}

3: for t = 1 to n do

4: i MSCBd(t) 5: bi wi+1

6: for j = i + 1 to d do 7: M UPDATEMODELPOOL(νrj, xrj:t 1) 8: wj 1, bj 1, rj t 9: end for

10: wd νrd(xrd:t) 11: for i = d 1 to 0 do 12: wi 1

2νri(xri:t) + 1

2wi+1bi 13: end for

14: end for

15: return w0

(1, 4), (3, 4), (4, 4) ﬁnishing. The method UPDATEMODELPOOL applies Equation 6 to remember the best performing model in the mixture νrj on the completed segment (rj, t 1). Lines 11 to 13 invoke Lemma 1 from bottom-up, to compute the desired marginal probability FMNd(x1:n) = w0.

(Space and Time Overhead) Under the assumption that each base model conditional probability can be obtained in O(1) time, the time complexity to process a sequence of length n is O(nk log n), where k is an upper bound on |M|. The n log n factor is due to the number of iterations in the inner loops on Lines 6 to 9 and Lines 11 to 13 being upper bounded by d + 1. The k factor is due to the cost of maintaining the vt terms for the segments which have not yet closed. An upper bound on k can be obtained from inspection of Figure 3, where if we set n = 2d, we have that the number of completed segments is given by Pd i=0 2i = 2d+1 1 = 2n + 1 = O(n); thus the time complexity is O(n2 log n). The space overhead is O(k log n), due to the O(log n) instances of Equation 5.

(Complexity Reducing Operations) For many applications of interest, a running time of O(n2 log n) is unacceptable. A workaround is to ﬁx k in advance and use a model replacement strategy that enforces |M| k via a modiﬁed UPDATEMODELPOOL routine; this reduces the time complexity to O(nk log n). We found the following heuristic scheme to be effective in practice: when a segment (a, b) closes, the best performing model ρ Ma for this segment is identiﬁed. Now, 1) letting yρ denote a uniform sub-sample of the data used to train ρ , if log ρ [xa:b](yρ ) log ρ (yρ ) > α then replace ρ with ρ [xa:b] in M; else 2) if a uniform Bayes mixture ξ over M assigns sufﬁciently higher probability to a uniform sub-sample s of xa:b than ρ does, that is log ξ(s) log ρ (s) > β, then leave M unchanged; else 3) add ρ [xa:b] to M; if |M| > k, remove the oldest model in M. This requires choosing hyperparameters α, β R and appropriate constant sub-sample sizes. Step 1 avoids adding multiple models for the same task; Step 2 avoids adding a redundant model to the model pool. Note that the per model and per segment sub-samples can be efﬁciently maintained online using reservoir sampling [19]. As a further complexity reducing operation, one can skip calls to UPDATEMODELPOOL unless (b a + 1) 2c for some c < d.

(Strongly Online Prediction) A strongly online FMN process, where one does not need to ﬁx a d in advance such that n 2d, can be obtained by deﬁning FMN(x1:n) := Qn i=1 FMN log i (xi | x<i), and efﬁciently computed in the same manner as for PTW, with a similar loss bound log FMN(x1:n) log FMNd(x1:n) + log n (log 3 1) following trivially from Theorem 2 in [17].

Theoretical properties. We now show that the Forget Me Not process is competitive with any piecewise stationary source, provided the base model enjoys sufﬁciently strong regret guarantees on

non-piecewise sources. Note that provided c = 0, Proposition 1 also holds when the complexity reducing operations are used. While the following regret bound is of the same asymptotic order as PTW for piecewise stationary sources, note that it is no tighter for sources that repeat; we will later explore the advantage of the FMN process on repeating sources experimentally.

Proposition 1. For all n N, using FMN with d = log n and a base model ρ whose redundancy is upper bounded by a non-negative, monotonically non-decreasing, concave function g : N R with g(0) = 0 on some class G of bounded memory data generating sources, the regret

log µh P(x1:n)

2|Pn| ( log n + 1) + |Pn| g n |Pn|( log n + 1)

( log n + 1) + |Pn|,

where µ is a piecewise stationary data generating source, and the data in each of the stationary regions P Tn is distributed according to some source in G.

Proof. First observe that for all x1:n X n we can lower bound the probability

FMNd(x1:n) = X

P Cd 2 Γd(P) Y

(a,b) Pn νa(xa:b) X

P Cd 2 Γd(P) Y

P Cd 2 Γd(P) Y

(a,b) Pn ρ(xa:b) = 2 |Pn| PTWd(x1:n).

Hence we have that log FMNd(x1:n) |P| log PTWd(x1:n). The proof is completed by using Theorem 1 from [17] to upper bound log PTWd(x1:n).

4 Experimental Results

We now report some experimental results with the FMN algorithm across three test domains. The ﬁrst two domains, The Mysterious Bag of Coins and A Fistful of Digits, are repeating sequence prediction tasks. The ﬁnal domain, Continual Atari 2600 Task Identiﬁcation, is a video stream of game-play from a collection of Atari games provided by the ALE [2] framework; here we qualitatively assess the capabilities of the FMN process to provide meaningful task labels online from high dimensional input.

Domain Description. (Mysterious Bag of Coins) Our ﬁrst domain is a sequence prediction game involving a predictor, an opponent and a bag of m biased coins. Flipping the ith coin involves sampling a value from a parametrized Bernoulli distribution B(θi), with θi [0, 1] for 1 i m. The predictor knows neither how many coins are in the bag, nor the value of the θi parameters. The data is generated by having the opponent ﬂip a single coin (the choice of which is hidden from the predictor) drawn uniformly from the bag for X G(0.005) ﬂips, and repeating, where G(θ) denotes the geometric distribution with success probability θ. At each time step t, the predictor outputs a distribution ρt : {0, 1} [0, 1], and suffers an instantaneous loss of ℓt(xt) := log ρt(xt). Here we test whether the FMN process can robustly identify change points, and exploit the knowledge that some segments of data appear to be similarly distributed.

(A Fistful of Digits) The second test domain uses a similar setup to The Mysterious Bag of Coins, except that now each observation is a 28x28 binary image taken from the MNIST [15] data set. We partitioned the MNIST data into m = 10 classes, one for each distinct digit, which we used to derive ten digit-speciﬁc empirical distributions. After picking a digit class, a random number Y = 200 + X G(0.01) of examples are sampled (with replacement) from the associated empirical distribution, before repeating the digit selection and generation process. Similar to before, the predictor is required to output a distribution ρt : {0, 1}28 28 [0, 1] over the possible outcomes, suffering an instantaneous loss of ℓt(xt) := log ρt(xt) at each time step.

(Continual Atari 2600 Task Identiﬁcation) Our third domain consists of a sequence of sampled Atari 2600 frames. Each frame has been downsampled to a 28 28 resolution and a 3 bit color space for reasons of computational efﬁciency. The sequence of frames is generated by ﬁrst picking a game uniformly at random from a set of 45 Atari games (for which a game-speciﬁc DQN [16] policy is available), and then generating a random number Y = 200 + X of frames, where X G(0.005). Each action is chosen by the relevant game speciﬁc DQN controller, which uses an epsilon-greedy policy. Once Y frames have been generated, the process is then repeated.

Algorithm Average Cumulative Regret KT 783.86 7.79 PTW + KT 157.19 0.77 FMN + KT 148.43 0.75 FMN + KT 147.75 0.74

Algorithm Average Per Digit Loss MADE 94.08 0.05 PTW + MADE 94.08 0.05 FMN + MADE 86.12 0.28 Oracle 82.81 0.06 Figure 4: (Left) Results on the Mysterious Bag of Coins; (Right) Results on a Fistful of Digits.

Results. We now describe our experimental setup and results. The following base models were chosen for each test domain: for the Mysterious Bag of Coins (MBOC), we used the KT-estimator [13], a beta-binomial model; for A Fistful of Digits (FOD), we used MADE [9], a recently introduced, general purpose neural density estimator, with 500 hidden units, trained online using ADAGRAD [8] with a learning rate of 0.1; MADE was also the base model for the Continual Atari task, but here a smaller network consisting of 50 neurons was used for reasons of computational efﬁciency.

(Sequence Prediction) For each domain, we compared the performance of the base model, the base model combined with PTW and the base model combined with the FMN process. We also report the performance relative to a domain speciﬁc oracle: for the MBOC domain, the oracle is the true data generating source, which has the (unfair) advantage of knowing the location of all potential change-points and task-speciﬁc data generating distributions; for the FOD domain, we trained a class conditional MADE model for each digit ofﬂine, and applied the relevant task-speciﬁc model to each segment. Regret is reported for MBOC since we know the true data generating source, whereas loss is reported for FOD. All results are reported in nats. The sequence length and number of repeated runs for MBOC and FOD was 5k/10k and 221/64 respectively. For the MBOC experiment we set m = 7 and generated each θi uniformly at random. Our sequence prediction results for each domain are summarized in Figure 4, with 95% conﬁdence intervals provided. Here FMN denotes the Forget-me-not algorithm without the complexity reducing techniques previously described (these results are only feasible to produce on MBOC). For the FMN results, the MBOC hyper-parameters are k = 15, α = 0, β = 0, c = 4 and sub-sample sizes of 100; the FOD hyper-parameters are k = 30, α = 0.2, β = 0.06, c = 4 with sub-sample sizes of 10. Here we see a clear advantage to using the FMN process compared with PTW, and that no signiﬁcant performance is lost by using the low complexity version of the algorithm.

Digging a bit deeper, it is interesting to note the inability of PTW to improve upon the performance of the base model on FOD. This is in contrast to the FMN process, whose ability to remember previous model states allows it to, over time, develop specialized models across digit speciﬁc data from multiple segments, even in the case where the base model can be relatively slow to adapt online. The reverse effect occurs in MBOC, where both FMN and PTW provide a large improvement over the performance of the base model. The advantage of being able to remember is much smaller here due to the speed at which the KT base model can learn, although not insigniﬁcant. It is also worth noting that a performance improvement is obtained even though each individual observation is by itself not informative; the FMN process is exploiting the statistical similarity of the outcomes across time.

(Online Task Identiﬁcation) A video demonstrating real-time segmentation of Atari frames can be found at: http://tinyurl.com/FMNVideo. Here we see that the (low complexity) FMN quickly learns 45 game speciﬁc models, and performs an excellent job of routing experience to the appropriate model. These results provide evidence that this technique can scale to long, high dimensional input sequences using state of the art density models.

5 Conclusion

We introduced the Forget-me-not Process, an efﬁcient, non-parametric meta-algorithm for online probabilistic sequence prediction and task-segmentation for piecewise stationary, repeating sources. We provided regret guarantees with respect to piecewise stationary data sources under the logarithmic loss, and validated the method empirically across a range of sequence prediction and task identiﬁcation problems. For future work, it would be interesting to see whether a single Multiple Model-based Reinforcement Learning [7] agent could be constructed using the Forget-me-not process for task identiﬁcation. Alternatively, the FMN process could be used to augment the conditional state density models used for value estimation in [18]. Such systems would have the potential to be able to learn to simultaneously play many different Atari games from a single stream of experience, as opposed to previous efforts [16, 18] where game speciﬁc controllers were learnt independently.

[1] Ryan Prescott Adams and David J.C. Mac Kay. Bayesian Online Changepoint Detection. In ar Xiv, http://arxiv.org/abs/0710.3742, 2007.

[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, 06 2013.

[3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. ISBN 0521841089.

[4] Anne Collins and Etienne Koechlin. Reasoning, learning, and creativity: Frontal lobe function and human decision-making. PLo S Biol, 10(3):1 16, 03 2012.

[5] Anne G.E. Collins and Michael J. Frank. Cognitive Control over Learning: Creating, Clustering and Generalizing Task-Set Structure. Psychological review, 120.1:190 229, 2013.

[6] Maël Donoso, Anne G. E. Collins, and Etienne Koechlin. Foundations of human reasoning in the prefrontal cortex. Science, 344(6191):1481 1486, 2014. doi: 10.1126/science.1252254.

[7] Kenji Doya and Kazuyuki Samejima. Multiple model-based reinforcement learning. Neural Computation, 14:1347 1369, 2002.

[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research (JMLR), 12:2121 2159, 07 2011.

[9] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: masked autoencoder for distribution estimation. In Proceedings of the 32nd International Conference on Machine Learning, JMLR W&CP, volume 37, pages 881 889, 2015.

[10] A. György, T. Linder, and G. Lugosi. Efﬁcient tracking of large classes of experts. IEEE Transactions on Information Theory, 58(11):6709 6725, 2011.

[11] E. Hazan and C. Seshadhri. Efﬁcient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 393 400. ACM, 2009.

[12] Marcus Hutter. On universal prediction and Bayesian conﬁrmation. Theoretical Computer Science, 384(1): 33 48, 2007.

[13] R. Krichevsky and V. Troﬁmov. The performance of universal encoding. Information Theory, IEEE Transactions on, 27(2):199 207, 1981.

[14] Tor Lattimore, Marcus Hutter, and Peter Sunehag. Concentration and conﬁdence for discrete bayesian sequence predictors. In Sanjay Jain, Rémi Munos, Frank Stephan, and Thomas Zeugmann, editors, Proceedings of the 24th International Conference on Algorithmic Learning Theory, pages 324 338. Springer, 2013.

[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, Nov 1998.

[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518, 2015.

[17] J. Veness, M. White, M. Bowling, and A. Gyorgy. Partition tree weighting. In Data Compression Conference (DCC), pages 321 330, March 2013.

[18] Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins. Compress and control. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 3016 3023, 2015.

[19] Jeffrey S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37 57, March 1985. ISSN 0098-3500. doi: 10.1145/3147.3165.

[20] F. Willems and M. Krom. Live-and-die coding for binary piecewise i.i.d. sources. In Information Theory. 1997. Proceedings., 1997 IEEE International Symposium on, page 68, jun-4 jul 1997.

[21] Frans M. J. Willems. Coding for a binary independent piecewise-identically-distributed source. IEEE Transactions on Information Theory, 42:2210 2217, 1996.

[22] Frans M.J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The Context Tree Weighting Method: Basic Properties. IEEE Transactions on Information Theory, 41:653 664, 1995.