# the_forgetmenot_process__1f0e99fb.pdf The Forget-me-not Process Kieran Milan , Joel Veness , James Kirkpatrick, Demis Hassabis Google Deep Mind {kmilan,aixi,kirkpatrick,demishassabis}@google.com Anna Koop, Michael Bowling University of Alberta {anna,bowling}@cs.ualberta.ca We introduce the Forget-me-not Process, an efficient, non-parametric metaalgorithm for online probabilistic sequence prediction for piecewise stationary, repeating sources. Our method works by taking a Bayesian approach to partitioning a stream of data into postulated task-specific segments, while simultaneously building a model for each task. We provide regret guarantees with respect to piecewise stationary data sources under the logarithmic loss, and validate the method empirically across a range of sequence prediction and task identification problems. 1 Introduction Modeling non-stationary temporal data sources is a fundamental problem in signal processing, statistical data compression, quantitative finance and model-based reinforcement learning. One widely-adopted and successful approach has been to design meta-algorithms that automatically generalize existing stationary learning algorithms to various non-stationary settings. In this paper we introduce the Forget-me-not Process, a probabilistic meta-algorithm that provides the ability to model the class of memory bounded, piecewise-repeating sources given an arbitrary, probabilistic memory bounded stationary model. The most well studied class of probabilistic meta-algorithms are those for piecewise stationary sources, which model data sequences with abruptly changing statistics. Almost all meta-algorithms for abruptly changing sources work by performing Bayesian model averaging over a class of hypothesized temporal partitions. To the best of our knowledge, the earliest demonstration of this fundamental technique was [21], for the purpose of data compression; closely related techniques have gained popularity within the machine learning community for change point detection [1] and have been proposed by neuroscientists as a mechanism by which humans deal with open-ended environments composed of multiple distinct tasks [4 6]. One of the reasons for the popularity of this approach is that the temporal structure can be exploited to make exact Bayesian inference tractable via dynamic programming; in particular inference over all possible temporal partitions of n data points results in an algorithm of O(n2) time complexity and O(n) space complexity [21, 1]. Many variants have been proposed in the literature [20, 11, 10, 17], which trade off predictive accuracy for improved time and space complexity; in particular the Partition Tree Weighting meta-algorithm [17] has O(n log n) time and O(log n) space complexity, and has been shown empirically to exhibit superior performance versus other low-complexity alternatives on piecewise stationary sources. A key limitation of these aforementioned techniques is that they can perform poorly when there exist multiple segments of data that are similarly distributed. For example, consider data generated according to the schedule depicted in Figure 1. For all these methods, once a change-point occurs, the base (stationary) model is invoked from scratch, even if the task repeats, which is clearly undesirable indicates joint first authorship. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. 1 20 40 60 80 100 120 140 160 Figure 1: An example task segmentation. in many situations of interest. Our main contribution in this paper is to introduce the Forget-me-not Process, which has the ability to avoid having to relearn repeated tasks, while still maintaining essentially the same theoretical performance guarantees as Partition Tree Weighting on piecewise stationary sources. 2 Preliminaries We now introduce some notation and necessary background material. Sequential Probabilistic Data Generators. We begin with some terminology for sequential, probabilistic data generating sources. An alphabet is a finite non-empty set of symbols, which we will denote by X. A string x1x2 . . . xn X n of length n is denoted by x1:n. The prefix x1:j of x1:n, where j n, is denoted by x j or x n, we define x1:m := x1:n and xm:n := ϵ. The concatenation of two strings s, r X is denoted by sr. Unless otherwise specified, base 2 is assumed for all logarithms. A sequential probabilistic data generating source ρ is defined by a sequence of probability mass functions ρn : X n [0, 1], for all n N, satisfying the constraint that ρn(x1:n) = P y X ρn+1(x1:ny) for all x1:n X n, with base case ρ0(ϵ) = 1. From here onwards, whenever the meaning is clear from the argument to ρ, the subscripts on ρ will be dropped. Under this definition, the conditional probability of a symbol xn given previous data x 0, with the familiar chain rule ρ(xi:j | x 0 for each ρ M such that P ρ M wρ 0 = 1, the Bayesian mixture predictor is defined in terms of its marginal by ξ(x1:n) := P ρ M wρ 0 ρ(x1:n). The predictive probability is thus given by the ratio of the marginals ξ(xn | x 1. (6) Finally, substituting Equation 5 in for the base model of PTW yields our Forget Me Not process FMNd(x1:n) := X P Cd 2 Γd(P) Y (a,b) Pn νa(xa:b). (7) Algorithm. Algorithm 1 describes how to compute the marginal probability FMNd(x1:n). The rj variables store the segment start times for the unclosed segments at depth j; the bj variables implement a dynamic programming caching mechanism to speed up the PTW computation as explained in Section 3.3 of [17]; the wj variables hold intermediate results needed to apply Lemma 1. The Most Significant Changed Bit routine MSCBd(t), invoked at line 4, is used to determine the range of segments ending at the current time t, and is defined for t > 1 as the number of bits to the left of the most significant location at which the d-bit binary representations of t 1 and t 2 differ, with MSCBd(1) := 0 for all d N. For example, in Figure 3, at t = 5, before processing x5, we need to deal with the segments Algorithm 1 FORGET-ME-NOT - FMNd(x1:n) Require: A depth parameter d N, and a base probabilistic model ρ Require: A data sequence x1:n X n satisfying n 2d 1: bj 1, wj 1, rj 1, for 0 j d 2: M {ρ} 3: for t = 1 to n do 4: i MSCBd(t) 5: bi wi+1 6: for j = i + 1 to d do 7: M UPDATEMODELPOOL(νrj, xrj:t 1) 8: wj 1, bj 1, rj t 9: end for 10: wd νrd(xrd:t) 11: for i = d 1 to 0 do 12: wi 1 2νri(xri:t) + 1 2wi+1bi 13: end for 14: end for 15: return w0 (1, 4), (3, 4), (4, 4) finishing. The method UPDATEMODELPOOL applies Equation 6 to remember the best performing model in the mixture νrj on the completed segment (rj, t 1). Lines 11 to 13 invoke Lemma 1 from bottom-up, to compute the desired marginal probability FMNd(x1:n) = w0. (Space and Time Overhead) Under the assumption that each base model conditional probability can be obtained in O(1) time, the time complexity to process a sequence of length n is O(nk log n), where k is an upper bound on |M|. The n log n factor is due to the number of iterations in the inner loops on Lines 6 to 9 and Lines 11 to 13 being upper bounded by d + 1. The k factor is due to the cost of maintaining the vt terms for the segments which have not yet closed. An upper bound on k can be obtained from inspection of Figure 3, where if we set n = 2d, we have that the number of completed segments is given by Pd i=0 2i = 2d+1 1 = 2n + 1 = O(n); thus the time complexity is O(n2 log n). The space overhead is O(k log n), due to the O(log n) instances of Equation 5. (Complexity Reducing Operations) For many applications of interest, a running time of O(n2 log n) is unacceptable. A workaround is to fix k in advance and use a model replacement strategy that enforces |M| k via a modified UPDATEMODELPOOL routine; this reduces the time complexity to O(nk log n). We found the following heuristic scheme to be effective in practice: when a segment (a, b) closes, the best performing model ρ Ma for this segment is identified. Now, 1) letting yρ denote a uniform sub-sample of the data used to train ρ , if log ρ [xa:b](yρ ) log ρ (yρ ) > α then replace ρ with ρ [xa:b] in M; else 2) if a uniform Bayes mixture ξ over M assigns sufficiently higher probability to a uniform sub-sample s of xa:b than ρ does, that is log ξ(s) log ρ (s) > β, then leave M unchanged; else 3) add ρ [xa:b] to M; if |M| > k, remove the oldest model in M. This requires choosing hyperparameters α, β R and appropriate constant sub-sample sizes. Step 1 avoids adding multiple models for the same task; Step 2 avoids adding a redundant model to the model pool. Note that the per model and per segment sub-samples can be efficiently maintained online using reservoir sampling [19]. As a further complexity reducing operation, one can skip calls to UPDATEMODELPOOL unless (b a + 1) 2c for some c < d. (Strongly Online Prediction) A strongly online FMN process, where one does not need to fix a d in advance such that n 2d, can be obtained by defining FMN(x1:n) := Qn i=1 FMN log i (xi | x