# learning_multistep_predictive_state_representations__a603e8c4.pdf

Learning Multi-Step Predictive State Representations

Lucas Langer Mc Gill University

Borja Balle Lancaster University

United Kingdom

Doina Precup Mc Gill University

Recent years have seen the development of efﬁcient and provably correct spectral algorithms for learning models of partially observable environments arising in many applications. But despite the high hopes raised by this new class of algorithms, their practical impact is still below expectations. One reason for this is the difﬁculty in adapting spectral methods to exploit structural constraints about different target environments which can be known beforehand. A natural structure intrinsic to many dynamical systems is a multi-resolution behaviour where interesting phenomena occur at different time scales during the evolution of the system. In this paper we introduce the multi-step predictive state representation (M-PSR) and an associated learning algorithm that ﬁnds and leverages frequent patterns of observations at multiple scales in dynamical systems with discrete observations. We perform experiments on robot exploration tasks in a wide variety of environments and conclude that the use of M-PSRs improves over the classical PSR for varying amounts of data, environment sizes, and number of observations symbols.

1 Introduction

Learning models of partially observable dynamical systems is a problem that arises in many practical applications, including robotics, medical monitoring, market analysis, etc. Predictive state representations (PSRs) [Littman et al., 2001; Singh et al., 2004; Rosencrantz et al., 2004], provide a theoretically justiﬁed approach in which at any time-step the current state is modeled as a set of predictions about the future evolution of the system, conditioned on past outcomes. The most popular of these models are linear PSRs, in which predictions for any future trajectories can be obtained as a linear combination of a ﬁnite set of core predictions. A variety of spectral algorithms for learning linear PSRs have been proposed in recent works, e.g. [Boots et al., 2011; Hamilton et al., 2013].

Spectral algorithms are appealing because of their strong theoretical properties, such as statistical consistency and

learning efﬁciency [Hsu et al., 2009; Bailly et al., 2010]. Unfortunately, their practical uptake is still limited, in part, due to the fact that these general-purpose algorithms are not designed to leverage the structural regularities frequently found in applications. This is a challenging question that needs to be tackled in order to facilitate more efﬁcient spectral learning algorithms for speciﬁc applications. Recent work along these lines has focused on sparse states spaces [Hamilton et al., 2013] and special structures encountered in epigenomics [Zhang et al., 2015]. In this paper, we focus on a speciﬁc type of structure which arises frequently in dynamical systems: behaviour which takes place at different time scales. This type of structure is often leveraged to provide efﬁcient algorithms in signal processing. Similar gains can be obtained by using multiple time scales in planning and reinforcement learning, e.g. [Sutton et al., 1999].

Our approach is based on a new class of PSRs which we call the multi-step PSR (M-PSR). Like the classical linear PSR, M-PSRs represent the dynamics of a partially observable system with discrete observations. However, unlike PSRs, M-PSRs are able to capture structure occurring at multiple time scales by representing transitions between states over multi-step observations. Our main result is a spectral learning algorithm for M-PSRs combined with a datadriven strategy for multi-step observation selection. Through an extensive empirical evaluation, we show that in environments where characteristic multi-step observations occur frequently, M-PSRs improve the quality of learning with respect to classical PSRs. This improvement is uniform over a range of environment sizes, number of observation symbols, and amounts of training data.

2 The Multi-Step PSR A linear predictive state representation (PSR) for an autonomous dynamical system with discrete observations is a tuple A = h , , 1, {Aσ}σ2 i where: is a ﬁnite set of possible observations, , 1 2 Rn are vectors of initial and ﬁnal weights, and Aσ 2 Rn n are the transition operators associated with each possible observation. The dimension n is the number of states of A. Formally, a PSR is a weighted ﬁnite automata (WFA) [Droste and Vogler, 2009] computing a function given by the probability distribution of sequences of observations in a partially observable dynamical system with a ﬁnite number of states. The function f A : ? ! R com-

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

puted by A on input x = x1 xt is given by:

f A(x1 xt) = >

Ax1 Axt 1 = >

The value of f A(x) is interpreted as the probability that the system produces the sequence of observations x = x1 xt starting from the initial state speciﬁed by . Depending on the semantics of the model, this can be a probability that the system generates x and stops, or the probability that the system generates x and continues. Given a partial history u of the evolution of the system the state of a PSR can be updated from to u = Au. This update allows for conditional queries about future observations. For example, the probability of observing a string v given that we have already observed u is f A,u(v) = u Av 1/ A(u), where A(u) is a normalizing constant.

To deﬁne a multi-step PSR, we augment a PSR with two extra objects: a set of multi-step observations 0 + containing non-empty strings formed by basic observations, and a coding function : ? ! 0? that takes a string of basic observations and produces an equivalent string composed of multi-step observations. The choice of 0 and can be customized for each application, and will typically capture frequent patterns of observations arising in a particular environment. For the sake of simplicity and to avoid degenerate situations, we assume these objects satisfy the following requirements:

1. The set 0 must contain all symbols in ; i.e. 0

2. The function satisﬁes @( (x)) = x for all x 2 ?,

where @ : 0? ! ? is the decoding morphism between free monoids given by @(z) = z 2 ? for all z 2 0. Note this implies that ( ) = , (σ) = σ for all σ 2 , and is injective. Using this notation, we deﬁne a multi-step PSR (M-PSR) as a tuple A0 = h , 0, , , 1, {Aσ}σ2 0i containing a PSR with observations in 0, together with the basic observations and the corresponding coding function . In addition to the standard PSR function f A0 : 0? ! R, an M-PSR also deﬁnes the function f 0

A0 : ? ! R given by f 0

A0(x) = f A0( (x)). In many cases we will abuse notation and write f A0 for f 0

A0 when there is no risk of confusion.

2.1 Examples of M-PSRs We now describe several examples of M-PSR, putting special emphasis on models that will be used in our experiments.

Base M-PSR A PSR with a single observation = {σ} can be used to measure the time i.e. number of discrete time-steps until a certain event happens [Bacon et al., 2015]. In this case, a natural approach to build an M-PSR for timing models is to build a set of multi-step observations containing sequences of σ s whose lengths are powers of a ﬁxed base. That is, given an integer b 2, the set of multi-step observations is 0 = {σ, σb, σb2, . . . , σb K} for some positive K. A natural choice of coding map in this case is the one that represents any length t 0 as a number in base b, with the difference that the largest power b that is allowed is b K. This corresponds to writing (in a unique way) t = t0b0 + t1b1 + t2b2 +

+ t Kb K, where 0 tk b 1 for 0 k K, and t K 0. With this decomposition, the coding map is given by (σt) = (σb K)t K(σb K 1)t K 1 (σb)t1(σ)t0. Note that we choose to write powers of longer multi-step observations ﬁrst, followed by powers of shorter multi-step observations. For further reference, we will call this model the Base M-PSR.

Base M-PSRs can also be extended to the case with multiple basic observations, | | > 1. For example, if there are two observations = {σ1, σ2}, we can take 0 = {σ1, σ2, σb

2 , ..., σb K

2 }. The encoding map ﬁrst splits the string into sequences of consecutive repeated symbols and then uses the same encoding as before. For example: (σ5

2)(σ2) when using b = 2 and K = 1.

Tree M-PSR A more ﬂexible generalization of the Base M-PSR for the multiple observation case is what we call the Tree M-PSR. In a Tree M-PSR, we set 0 = {x 2 ? : |x| L}, where L is a parameter. The decoding map ﬁrst splits a string x as x = u1u2 umuf, where |ui| = L for 1 i n and |uf| = |x| (m L) < L, and then sets (x) = (u1)(u2) (um)(uf). Note that this encoding promotes the use of multi-step symbols corresponding to longer strings more often than those corresponding to shorter strings. Note that we have | 0| = O(| |L), so in practice L must remain small if we want the M-PSR to be representable using a small number of parameters.

Data-Driven M-PSR The constructions above have some parameters that can be tuned depending on the particular application, but in general they are not directly dependent on the sequences of observations more frequently produced by the target environment. Intuition suggests that the performance of M-PSRs will depend heavily on how well 0 reﬂects frequent observation patterns; this is corroborated by the experiments in Section 4. Thus, we develop an algorithm for adaptively choosing 0 from data. For this to work, we need to provide a generic coding function that can be applied to any M-PSR. This encoding needs to deliver good predictive performance and computationally cheap queries. We call the output of this learning process a Data-Driven M-PSR. The details of this approach are described in the next section.

3 Learning Algorithm for M-PSR

This section describes a learning algorithm for M-PSR extending standard spectral techniques [Boots et al., 2011] and combining them with an algorithm for building an extended set of symbols 0 containing frequent patterns of observations.

3.1 Spectral Learning Algorithm We start by extending the spectral learning algorithm to MPSR under the assumption that and 0 are known. In this case, the learning procedure only needs to recover the operators Aσ for all σ 2 0, and the initial and ﬁnal weights , 1. We ﬁrst recall some basic notation about Hankel

matrices [Carlyle and Paz, 1971; Fliess, 1974], and then proceed to describe the learning algorithm. To simplify the description of the learning algorithm we assume that the function f : ? ! R associated with the target M-PSR can be evaluated for every string. In practice these are unknown, but can be effectively estimated from data because they correspond to probabilities of observations.

Given f : ? ! R, we will use its Hankel matrix representation Hf 2 R ? ?, which is an inﬁnite matrix with rows and columns indexed by strings in ? and whose entries satisfy Hf(u, v) = f(uv). To efﬁciently work with this matrix we only consider ﬁnite sub-blocks indexed by a ﬁnite set of preﬁxes P ? and sufﬁxes S ?. Both P and S are input parameters given to the algorithm; see [Balle et al., 2012] for a discussion of general strategies to choose these in practice. The pair B = (P, S) is called a basis, and it determines a sub-block HB 2 RP S of Hf with entries given by HB(u, v) = Hf(u, v) for all u 2 P and v 2 S. For a ﬁxed basis, we also consider the vectors h S 2 RS with entries given by h S(v) = Hf( , v) for every v 2 S, and h P 2 RP with h P(u) = Hf(u, ).

Note that the deﬁnitions above only depend on . In order to recover operators Aσ for all σ 2 0 we need to consider multi-step shifts of the ﬁnite Hankel matrix HB. In particular, given σ 2 0 we deﬁne the sub-block Hσ 2 RP S whose entries are given by Hσ(u, v) = f(u σ v). This can be interpreted either using the lift f( (u) σ (v)) or the decoding f(u @(σ) v), but the actual value Hσ will be the same.

Using the notation above we can give a simple description of the learning algorithm. Suppose the basis B and the desired number of states n are given. The algorithm starts by collecting a set of sampled trajectories and uses them to estimate the matrices HB, Hσ 2 RP S and vectors h P 2 RP, h S 2 RS. Then, it takes the truncated SVD Un Dn V>

n of HB, where Dn 2 Rn n contains the ﬁrst n singular values of HB, and Un 2 RP n and Vn 2 RS n contain the ﬁrst left and right singular vectors respectively. Finally, it computes the transition operators of the M-PSR as Aσ = D 1

n HσVn for all σ 2 0, and the initial and ﬁnal weights as >

S Vn and 1 = D 1

n h P. This algorithm yields an M-PSR with n states. It was proved in [Boots et al., 2011] that this algorithm is statistically consistent for PSRs (under a mild condition on B). A straightforward argument shows that these same guarantees extend to M-PSR.

3.2 A Generic Coding Function Given and 0, a generic coding function : ? ! 0? can be obtained by minimizing the coding length | (x)| of every string x 2 ?. More formally, consider the coding

(x) = argmin z2 0, x=yz, |y|<|x|

Note that for the single observation case, this is equivalent to the optimal coin change problem, which is a textbook example of dynamic programming. This has the advantage of minimizing the number of operators Aσ that will need to be multiplied to compute the value of the M-PSR on a string x. At the same time, operators expressing long transition sequences can capture the contributions of all intermediate states even if

the chosen model size is too small to represent every state traversed by a single-observation model. We will see in Section 4 that this is one of the reasons why M-PSRs show better performance than PSR for smaller models.

Algorithm 1 gives pseudocode for computing this function. It inductively computes the optimal string encoding for the preﬁx x1 xi for all 1 i |x|. This can be obtained by minimizing over all σ 2 0 which terminate at the index i of x. We use the following notation:

best Enc: A map from indices i of the query string x to the optimal encoding of x[: i].

min Enc: A map from indices i of the query string x to |best Encoding[i]|

ops End: A map from indices i of x to the set of strings in 0: {s 2 0, x[i |s| : i] = s}

Algorithm 1 Encoding Algorithm INPUT: x OUTPUT: (x)

1: procedure DPENCODE 2: best Enc[0] = x[0] 3: min Enc[0] = 0 4: for i in [1, |x|] do 5: ops End[i] {s 2 0 : x[i |s| : i] = s} 6: end for 7: for i in [1, |x|] do 8: best Op null 9: m 0 10: for s 2 ops End[i] do 11: t min Enc[i |s|] + 1 12: if best Op = null or t < m then 13: m t 14: best Op s 15: end if 16: end for 17: min Enc[i + 1] m 18: best Enc[i + 1] best Enc[i |best Op|] + best Op 19: end for 20: return best Enc[|x|] 21: end procedure

3.3 State Update When using classical PSR in online environments one typically updates the state vector every time a new observation is available. This eliminates the need for repeatedly transforming the initial state over the whole history of observations h when making predictions. In the case of M-PSRs, the dynamic programming algorithm for minimizing the length of string encodings provides a naturally convenient way to perform efﬁcient state updates. To do so, we cache past state vectors along with their encoding length. When a new observation σ is read, we determine the encoding for the extended observation string hσ with the same recurrence relation as in the previous section. Because this minimization is over {s 2 0 : 9p 2 ? ps = hσ}, one needs to cache at most maxs2 0 |s| state vectors and encoding lengths.

3.4 Greedy Selection of Multi-Step Observations

Algorithm 2 Base Selection Algorithm INPUT: Train, Sub M OUTPUT: 0

1: procedure BASE SELECTION 2: 0 3: best Enc null 4: for all obs 2 Train do 5: best Enc[obs] |obs| 6: end for 7: i 0 8: while i < num Ops do 9: best Op null 10: m 0 11: for all s 2 Sub M do 12: c 0 13: for all obs 2 Train do 14: c c + DPEnc(obs, 0 [ s) best Enc[obs] 15: end for 16: if c < m then 17: best Op s 18: m c 19: end if 20: end for 21: 0 0 [ best Op 22: for all obs 2 Train do 23: best Enc[obs] DPEnc(obs, 0) 24: end for 25: i i + 1 26: end while 27: return 0

28: end procedure

Selecting multi-step transition sequences to build 0 can be achieved with an adaptive greedy algorithm, depicted in Algorithm 2. A 0 that correctly captures the frequent observations produced by a target environment should promote short encodings when coupled with the coding function described above. In practice, we want 0 to contain substrings appearing in the training data which are long, frequent, and diverse. From an intuitive standpoint, one can view structure in observation sequences as relating to the level of entropy in the distribution over multi-step observations produced by the system, and interpret this selection method as a data compression scheme.

In general terms, the algorithm works as follows. In a preprocessing step, we ﬁnd all possible substrings in ? that appear in the training dataset. For computational reasons this can be constrained only to the M most frequent substrings, where M is a parameter chosen by the user. Frequency of occurence is measured by number of training trajectories that contain a given substring. The construction of 0 is then initialised by and proceeds iteratively by adding a new multistep symbol at each phase. A phase starts by evaluating all substrings in terms of how much reduction in the number of transition operators used to encode the whole training set

Figure 1: The Double Loop environment (left) and the Pacman Labyrinth (right)

would be achieved if the symbol was added to 0. The algorithm then adds to 0 the multi-step symbol corresponding to the best substring, i.e. the one that would reduce the most the whole coding cost (with respect to ). More formally, at the ith iteration the algorithm ﬁnds:

argmin u2sub M

i[{u}(x)| ,

where train is the training set, sub M is the set of substrings under consideration of length at most M, 0

i is the set of multi-step observations at the beginning of phase i and we use 0 to denote the encoding function with respect to a given set of multi-step observations for clarity. The algorithm terminates after 0 reaches a predetermined size.

4 Experiments In this section, we assess the performance of PSRs and MPSRs in various environments, with different model sizes, and learned from both large and small datasets. For all the plots, the x-axis is model size of the PSR/M-PSRs and the y-axis is an error measurement of the learned PSR/M-PSRs.

In all the experiments, an agent is positioned in a partially observable environment and navigates stochastically based on state to state transition probabilities. An observation symbol is produced on every transition. When the agent exits, we record the concatenation of the symbols produced, which represents the observation sequence. We perform experiments in two environments: a Double Loop maze and a Pacman-style environment, depicted in Figure 1.

For the Base M-PSR (i.e., the timing models), we construct the empirical Hankel matrix by taking P, S = {σi, 8i N}, where N is an application-dependent parameter. For Double Loop environments, we set N = 150, while for the Pacman domain, N = 600. For these choices of N, we verify that as

Figure 2: Results for the Double Loop of size 64-16: (1) 100 samples, p = 0 (2) 10000 samples, p = 0 (3) 10000 samples, p = 0.2; and the Double Loop of size 47-27: (4) 100 samples, p = 0.

the amount of data increases, the learned PSR with the true model size converges to the true model. For the Base M-PSR, we set b = 2, K = 8, so that the longest string in 0 is σ256.

For the tasks with multiple observations, a slightly more complex approach is required to choose P and S. For the preﬁxes P, we select the k most frequent preﬁxes from our observation set. For the sufﬁxes S, we take all sufﬁxes that occur in P. We also require preﬁx completeness. That is, if p0 is a preﬁx of p 2 P, then p0 2 P. This heuristic for constructing empirical Hankel matrices was also used in [Balle et al., 2012]. For the Base M-PSR, we take K = 8 and B = 2 for both symbols = {g, b}, where g stands for green and b for blue. For the Tree M-PSR, we set L = 7 for a total of 128 operators, a far larger limit than for the other M-PSRs.

To measure the performance of our learning algorithms we compute the difference between the values computed by f, the true probability distribution over observations, and

ˆf, the function associated with the learned M-PSR/PSR. In our environments, f can be computed exactly, as we have access to the underlying HMMs. Since there are inﬁnitely many strings, we only compute the difference over a ﬁxed set T ? which depends on the problem. Thus, reported errors are given by

(f(x) ˆf(x))2 .

For the Base M-PSR (timing) case, we take T = {σi : i C} with C = 600. Importantly, C has to be large enough to ensure the error measure provides a measure of accuracy over the most frequent strings, say PC

i=0 f(σi) > 0.99. For the multiple observation case, we take all possible strings that can be produced from the preﬁxes and sufﬁxes in our dataset: T = {ps : p 2 P, s 2 S}. In all our experiments we use the same P and S to learn PSR and the different types of M-PSR.

4.1 Double Loop Timing We start by considering the Double Loop environment, and the task of learning timing models. The length of a loop is the number of states in the loop. A trajectory begins with the agent at the intersection of the two loops. Here, the agent has a 50% probability of entering either loop. At intermediate states, the agent moves to the next state in the loop with probability 1 p or remains in its current state with probability p. Exit states are located halfway into each loop. The agent leaves the environment at exit states with 50% probability.

Figure 2 (1-3) provides results for the case in which one loop consists of 64 states and the other of 16 states. Panel (1) presents the results when learning from 100 observation sequences, while the results in panel (2) are obtained using 10000 observations. In both cases, the M-PSRs outperform the simple PSR for small model sizes. Panel (3) shows the effect of varying the self-transition probability to p = 0.2, in order to simulate noise in the environment. We ﬁnd that the environment with noisy loops is more compressible, that is, one can achieve better performance for low model sizes, but the performance becomes worse as the model size attains the environment s true size. M-PSRs still signiﬁcantly outperform the standard PSR for reduced model sizes. Panel (2) in Figure 4 illustrates the effect of K on the performance of the Base M-PSRs. We note that larger values of K have better performance, up to K = 7.

In Figure 4 (2), we illustrate how varying the number of multi-step transition operators affects performance for the 64-16 double loop. As expected, a higher number of operators improves performance, but the effect tapers off at about 20 operators. Here, the most important operators are: {σ, σ8, σ24, σ32, σ72} which are closely tied to the environment s periodic structure.

In Figure 2 (4), we plot the results of a 47-27 labyrinth. Here, because of the lengths of the loops, observations will not be as compactly expressed from the Base M-PSR. Again, M-PSRs outperform the standard PSR for reduced model sizes.

4.2 Pacman Labyrinth Timing

We now turn our attention to the environment on the right of Figure 1. The weights in the transitions represent the length of the corridors they represent, and at each state the agent has equal probability of following any corridor starting in that state. To test the effect of the geometry of the environment we add a parameter s F (stretch factor) which is used to multiply the length of the corridors (by adding extra intermediate states). In the left panels of Figure 3 we vary the number of observations used for learning. M-PSRs outperform the traditional PSR regardless of the amount of data. In the right two panels, we vary the stretch factor parameter, while keeping the size of the dataset ﬁxed. We ﬁnd that a higher value of s F provides increased improvement of the M-PSR relative to the standard PSR.

Figure 3: Results for the Pacman environment with (in order from left to right) low amount of data, high amount of data and stretch factor equal to 1 and 5 (right panels)

Figure 4: (1) Effect of exponent in Base M-PSR (Double Loop 64-16) (2) Effect of the number of operators (Double Loop 64-16) (3) Effect of model size (Colored Double Loop 27-17) (4) Effect of the number of operators (Colored Double Loop 27-17)

4.3 Multiple Observations: Coloured Loops To test the case of multiple observations, we construct a Double Loop environment where the ﬁrst loop has length l1 = 27 and observations are green. The second loop is blue, with length l2 = 17. We ﬁx the length of each trajectory at 3(l1 + l2) We build empirical estimates for the Hankel matrix as follows:

f(x) = |train \ x ?|

|train \ |x|| .

This means that the PSRs will compute the probability of x occurring as a preﬁx.

As for the timing case, we ﬁnd that M-PSRs perform far better, especially the Data-Driven M-PS. Results are displayed in panel (3) of Figure 4. In the panel (4) we vary the number of multi-step transition operators learned. The important operators learned are {g, b, g27, b17}, which is again very encouraging, as it reﬂects the structure of this particular environment.

5 Discussion We presented an approach to leveraging multiple temporal scales in the behaviour of dynamical systems, in order to learn predictive state representations. The proposed model, M-PSR, can be learned by adapting existing spectral approaches. To the best of our knowledge, the only other work that attempts to include temporal abstraction in PSRs is due to [Wolfe and Singh, 2006], who use temporally extended actions, or options [Sutton et al., 1999], and build PSRs on top

of these. However, this model has a very different ﬂavour from our approach, as we bundle together observations into multiple steps. In particular, our approach is applicable even when there are no actions, such as in the case of HMMs, whereas this previous work requires structure in the action space. In all the experiments we conducted, M-PSRs offer signiﬁcantly better predictive performance for reduced model sizes than PSRs. In addition, Data-Driven PSRs offer improvements over generic M-PSRs by learning the transition operators speciﬁc to the environment, which is very important when prior information about appropriate time scales for modelling is not known. Our evaluation focused on illustrating the advantage of the proposed model especially when the amount of data available is small, and in noisy environments. These experiments suggest that M-PSR with small number of states provide much better approximations to a given environment than PSR with the same number of states. We think this is a very promising property, and we plan to study it in detail in future work. We used simulated domains in our experiments in order to compare with a ground truth. However, we anticipate that the proposed models would be useful in real tasks, for example in ﬁnancial market prediction. In this case, one could discretize market levels into bins, and use M-PSRs to learn predictive models over multiple time steps, in order to capture both long-term trends and fast ﬂuctuations. We also plan to experiment with this approach in the domain of processing physiological signal recordings, which similarly exhibit short and long-term trends. On the theoretical side, it would be interesting to analyze the existence of an optimal

set of symbols for the data-driven M-PSR, and to develop further algorithms for ﬁnding it.

References [Bacon et al., 2015] P.-L. Bacon, B. Balle, and D. Precup.

Learning and planning with timing information in Markov Decision Processes. In UAI, 2015. [Bailly et al., 2010] R. Bailly, A. Habrard, and F. Denis. A

spectral approach for probabilistic grammatical inference on trees. In ALT, 2010. [Balle et al., 2012] B. Balle, A. Quattoni, and X. Carreras.

Local loss optimization in operator models: A new insight into spectral learning. International Conference on Machine Learning (ICML), 2012. [Boots et al., 2011] B. Boots, S. Siddiqi, and G. Gordon.

Closing the learning planning loop with predictive state representations. International Journal of Robotic Research, 2011. [Carlyle and Paz, 1971] J. W. Carlyle and A. Paz. Realiza-

tions by stochastic ﬁnite automata. Journal of Computer Systems Science, 1971. [Droste and Vogler, 2009] W. Droste, M. Kuich and H. Vogler. Handbook of weighted automata. Springer, 2009. [Fliess, 1974] M. Fliess. Matrices de Hankel. Journal de

Math ematiques Pures et Appliqu ees, 1974. [Hamilton et al., 2013] William L. Hamilton, Mahdi M.

Fard, and Joelle Pineau. Modelling sparse dynamical systems with compressed predictive state representations. In ICML, 2013. [Hsu et al., 2009] D. Hsu, S.M. Kakade, and T. Zhang. A

spectral algorithm for learning Hidden Markov Models. In COLT, 2009. [Littman et al., 2001] M. L. Littman, R. S. Sutton, and

S. Singh. Predictive representations of state. Neural Information Processing Systems Conference (NIPS), 2001. [Rosencrantz et al., 2004] M. Rosencrantz, G. Gordon, and

S. Thrun. Learning low dimensional predictive representations. In ICML, 2004. [Singh et al., 2004] S. Singh, M. R. James, and M. R.

Rudary. Predictive state representations: A new theory for modeling dynamical systems. In UAI, 2004. [Sutton et al., 1999] R. S Sutton, D. Precup, and S. Singh.

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 1999. [Wolfe and Singh, 2006] B. Wolfe and S. Singh. Predictive

state representations with options. International Conference on Machine Learning (ICML), 2006. [Zhang et al., 2015] Chicheng Zhang, Jimin Song, Kamalika

Chaudhuri, and Kevin Chen. Spectral learning of large structured hmms for comparative epigenomics. Neural Information Processing Systems (NIPS), 2015.