# recurrent_predictive_state_policy_networks__06390035.pdf Recurrent Predictive State Policy Networks Ahmed Hefny * 1 Zita Marinho * 2 3 Wen Sun 2 Siddhartha S. Srinivasa 4 Geoffrey Gordon 1 We introduce Recurrent Predictive State Policy (RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially observable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz & Gordon, 2004; Sun et al., 2016) by modeling predictive state a prediction of the distribution of future observations conditioned on history and future actions. This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al., 2018) to initialize the recursive filter. Predictive state serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behaviour. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992) and gradient descent based on prediction error of the recursive filter. We show the efficacy of RPSP-networks under partial observability on a set of robotic control tasks from Open AI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method. *Equal contribution 1Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA 2Robotics Institute, Carnegie Mellon University, Pittsburgh, USA 3ISR/IT, Instituto Superior T ecnico, Lisbon, Portugal 4Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA. Correspondence to: Ahmed Hefny , Zita Marinho . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). 1. Introduction Recently, there has been significant progress in deep reinforcement learning (Bojarski et al., 2016; Schulman et al., 2015; Mnih et al., 2013; Silver et al., 2016). Deep reinforcement learning combines deep networks as a representation of the policy with reinforcement learning algorithms and enables end-to-end training. While traditional applications of deep learning rely on standard architectures with sigmoid or Re LU activations, there is an emerging trend of using composite architectures that contain parts explicitly resembling other algorithms such as Kalman filtering (Haarnoja et al., 2016) and value iteration (Tamar et al., 2016). It has been shown that such architectures can outperform standard neural networks. In this work, we focus on partially observable environments, where the agent does not have full access to the state of the environment, but only to partial observations thereof. The agent has to maintain instead a distribution over states, i.e., a belief state, based on the entire history of observations and actions. The standard approach to this problem is to employ recurrent architectures such as Long-Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Units (GRU) (Cho et al., 2014). Despite their success (Hausknecht & Stone, 2015), these methods are difficult to train due to non-convexity, and their hidden states lack a predefined statistical meaning. Models based on predictive state representations (Littman et al., 2001; Singh et al., 2004; Rosencrantz & Gordon, 2004; Boots et al., 2013) offer an alternative method to construct a surrogate for belief state in a partially observable environment. These models represent state as the expectation of sufficient statistics of future observations, conditioned on history and future actions. Predictive state models admit efficient learning algorithms with theoretical guarantees. Moreover, the successive application of the predictive state update procedure (i.e., filtering equations) results in a recursive computation graph that is differentiable with respect to model parameters. Therefore, we can treat predictive state models as recurrent networks and apply backpropagation through time (BPTT) (Hefny et al., 2018; Downey et al., 2017) to optimize model parameters. We use this insight to construct a Recurrent Predictive State Policy (RPSP) network, a special recurrent architecture that consists of (1) a Recurrent Predictive State Policy Networks predictive state model acting as a recursive filter to keep track of a predictive state, and (2) a feed-forward neural network that directly maps predictive states to actions. This configuration results in a recurrent policy, where the recurrent part is implemented by a PSR instead of an LSTM or a GRU. As predictive states are a sufficient summary of the history of observations and actions, the reactive policy will have rich enough information to make its decisions, as if it had access to a true belief state. There are a number of motivations for this architecture: Using a PSR means we can benefit from methods in the spectral learning literature to provide an efficient and statistically consistent initialization of a core component of the policy. Predictive states have a well defined probabilistic interpretation as conditional distribution of observed quantities. This can be utilized for optimization. The recursive filter in RPSP-networks is fully differentiable, meaning that once a good initialization is obtained from spectral learning methods, we can refine RPSP-nets using gradient descent. This network can be trained end-to-end, for example using policy gradients in a reinforcement learning setting (Sutton et al., 2001) or supervised learning in an imitation learning setting (Ross et al., 2011). In this work we focus on the former. We discuss the predictive state model component in 3. The control component is presented in 4 and the learning algorithm is presented in 5. In 6 we describe the experimental setup and results on control tasks: we evaluate the performance of reinforcement learning using predictive state policy networks in multiple partially observable environments with continuous observations and actions. 2. Background and Related Work Throughout the rest of the paper, we will define vectors in bold notation v, matrices in capital letters W. We will use to denote vectorized outer product: x y is xy reshaped into a vector. We assume an agent is interacting with the environment in episodes, where each episode consists of T time steps in each of which the agent takes an action at A, and observes an observation ot O and a reward rt R. The agent chooses actions based on a stochastic policy πθ parameterized by a parameter vector θ: πθ(at | o1:t 1, a1:t 1) p(at | o1:t 1, a1:t 1, θ). We would like to improve the policy rewards by optimizing θ based on the agent s experience in order to maximize the expected long term reward J(πθ) = 1 T PT t=1 E γt 1rt | πθ , where γ [0, 1] is a discount factor. There are two major approaches for model-free reinforcement learning. The first is the value function-based approach, where we seek to learn a function (e.g., a deep network (Mnih et al., 2013)) to evaluate the value of each action at each state (a.k.a. Q-value) under the optimal policy (Sutton & Barto, 1998). Given the Q function the agent can act greedily based on estimated values. The second approach is direct policy optimization, where we learn a function to directly predict optimal actions (or optimal action distributions). This function is optimized to maximize J(θ) using policy gradient methods (Schulman et al., 2015; Duan et al., 2016) or derivative-free methods (Szita & Lrincz, 2006). We focus on the direct policy optimization approach as it is more robust to noisy continuous environments and modeling uncertainty (Sutton et al., 2001; Wierstra et al., 2010). Our aim is to provide a new class of policy functions that combines recurrent reinforcement learning with recent advances in modeling partially observable environments using predictive state representations (PSRs). There have been previous attempts to combine predictive state models with policy learning. Boots et al. (2011) proposed a method for planning in partially observable environments. The method first learns a PSR from a set of trajectories collected using an explorative blind policy. The predictive states estimated by the PSR are then considered as states in a fully observable Markov Decision Process. A value function is learned on these states using least squares temporal difference (Boots & Gordon, 2010) or point-based value iteration (PBVI) (Boots et al., 2011). The main disadvantage of these approaches is that it assumes a one-time initialization of the PSR and does not propose a mechanism to update the model based on subsequent experience. Hamilton et al. (2014) proposed an iterative method to simultaneously learn a PSR and use the predictive states to fit a Q-function. Azizzadenesheli et al. (2016) proposed a tensor decomposition method to estimate the parameters of a discrete partially observable Markov decision process (POMDP). One common limitation in the aforementioned methods is that they are restricted to discrete actions (some even assume discrete observations). Also, it has been shown that PSRs can benefit greatly from local optimization after a moment-based initialization (Downey et al., 2017; Hefny et al., 2018). Venkatraman et al. (2017) proposed predictive state decoders, where an LSTM or a GRU network is trained on a mixed objective function in order to obtain high cumulative rewards while accurately predicting future observations. While it has shown improvement over using standard training objective functions, it does not solve the initialization issue of the recurrent network. Our proposed RPSP networks alleviate the limitations of previous approaches: It supports continuous observations Recurrent Predictive State Policy Networks and actions, it uses a recurrent state tracker with consistent initialization, and it supports end-to-end training after the initialization. 3. Predictive State Representations of Controlled Models In this section, we give a brief introduction to predictive state representations, which constitute the state tracking (filtering) component of our model.1 We provide more technical details in the appendix. Given a history of observations and actions a1, o1, a2, o2, . . . , at 1, ot 1, a recursive filter computes a belief state qt using a recursive update equation qt+1 = f(qt, at, ot). Given the state qt, one can predict observations through a function g(qt, at) E[ot | qt, at]. In a recurrent neural network (Figure 1 (a,b)), q is latent and the function g that connects states to the output is unknown and has to be learned. In this case, the output could be predicted from observations, when the RNN is used for prediction, see Figure 1 (a,b). Predictive state models use a predictive representation of the state. That means the qt is explicitly defined as the conditional distribution of future observations ot:t+k 1 conditioned on future actions at:t+k 1.2 (e.g., in the discrete case, qt could be a vectorized conditional probability table). Predictive states are thus defined entirely in terms of observable features with no latent variables involved. That means the mapping between the predictive state qt and the prediction of ot given at can be fully known or simple to learn consistently (Hefny et al., 2015b; Sun et al., 2016). This is in contrast to RNNs, where this mapping is unknown and requires non-convex optimization to be learned. Similar to an RNN, a PSR employs a recursive state update that consists of the following two steps:3 State extension: A linear map Wext is applied to qt to obtain an extended state pt. This state defines a conditional distribution over an extended window of k + 1 observations and actions. Wext is a parameter to be learned. pt = Wextqt (1) 1 We follow the predictive state controlled model formulation in Hefny et al. (2018). Alternative methods such as predictive state inference machines (Sun et al., 2016) could be contemplated. 2 We condition on intervention by actions rather than observing actions . That means qt is independent of the policy that determines the actions. See (Pearl, 2009). The length-k depends on the observability of the system. A system is k-observable if maintaining the predictive state is equivalent to maintaining the distribution of the system s latent state. 3See the appendix Section 9 for more details. Conditioning: Given at and ot, and a known conditioning function fcond: qt+1 = fcond(pt, at, ot). (2) Figure 1 (c, d) depicts the PSR state update. The conditioning function fcond depends on the representation of qt and pt. For example, in a discrete system, qt and pt could represent conditional probability tables and fcond amounts to applying Bayes rule. In continuous systems we can use Hilbert space embedding of distributions (Boots et al., 2013), where fcond uses kernel Bayes rule (Fukumizu et al., 2013). In this work, we use the RFFPSR model proposed in (Hefny et al., 2018). Observation and action features are based on random Fourier features (RFFs) of RBF kernel (Rahimi & Recht, 2008) projected into a lower dimensional subspace using randomized PCA (Halko et al., 2011). We use φ to denote this feature function. Conditioning function fcond is kernel Bayes rule, and observation function g is a linear function of state E[ot | qt, at] = Wpred(qt φ(at)). See Section 9.1 in the appendix for more details. 3.1. Learning predictive states representations Learning PSRs is carried out in two steps: an initialization procedure using method of moments and a local optimization procedure using gradient descent. Initialization: The initialization procedure exploits the fact that qt and pt are represented in terms of observable quantities: since Wext is linear and using (1), then E[pt | ht] = Wext E[qt | ht]. Here ht h(a1:t 1, o1:t 1) denotes a set of features extracted from previous observations and actions (typically from a fixed length window ending at t 1). Because qt and pt are not hidden states, estimating these expectations on both sides can be done by solving a supervised regression subproblem. Given the predictions from this regression, solving for Wext then becomes another linear regression problem. We follow this two-stage regression proposed by Hefny et al. (2018).4 Once Wext is computed, we can perform filtering to obtain the predictive states qt. We then use the estimated states to learn the mapping to predicted observations Wpred, which results in another regression subproblem, see Section 9.2 in the appendix for more details. In RFFPSR, we use linear regression for all subproblems (which is a reasonable choice with kernel-based features). This ensures that the two-stage regression procedure is free of local optima. Local Optimization: Although PSR initialization procedure is consistent, it is based on method of moments and hence is not necessarily statistically efficient. Therefore it 4 We use the joint stage-1 regression variant for initialization. Recurrent Predictive State Policy Networks 𝑊J 𝑞-)$ + 𝜎 𝐨$ 𝐨𝒕'𝟏 𝒐𝒕 𝒐𝒕)𝟏 𝒐𝒕)𝒌'𝟏𝒐𝒕)𝒌 𝐚$ 𝐚𝒕'𝟏 𝐚𝒕 𝐚𝒕)𝟏 𝐚𝒕)𝒌'𝟏𝐚𝒕)𝒌 𝐪𝑃(𝐨-:-)6'$ 𝐚-:-)6'$) 𝐨$ 𝐨𝒕'𝟏 𝒐𝒕 𝒐𝒕)𝟏 𝒐𝒕)𝒌'𝟏𝒐𝒕)𝒌 𝐚$ 𝐚𝒕'𝟏 𝐚𝒕 𝐚𝒕)𝟏 𝐚𝒕)𝒌'𝟏𝐚𝒕)𝒌 extended future 𝐩𝑃(𝐨-:-)6 𝐚-:-)6) Conditioning 𝐨$ 𝐨𝒕'𝟏 𝒐𝒕 𝒐𝒕)𝟏 𝒐𝒕)𝒌'𝟏𝒐𝒕)𝒌 𝐚$ 𝐚𝒕'𝟏 𝐚𝒕 𝐚𝒕)𝟏 𝐚𝒕)𝒌'𝟏𝐚𝒕)𝒌 shifted future 𝐪-)$ 𝑃(𝐨-)$:-)6 𝐚-)$:-)6) Figure 1: Left: a) Computational graph of RNN and PSR and b) the details of the state update function f for both a simple RNN and c) a PSR. Compared to RNN, the observation function g is easier to learn in a PSR (see 3.1). Right: Illustration of the PSR extension and conditioning steps. can benefit from local optimization. Downey et al. (2017) and Hefny et al. (2018) note that a PSR defines a recursive computation graph similar to that of an RNN where we have qt+1 = fcond(Wext(qt), at, ot)) E[ot | qt, at] = Wpred(qt φ(at)), (3) With a differentiable fcond, the PSR can be trained using backpropagation through time to minimize prediction error. In a nutshell, a PSR effectively constitutes a special type of a recurrent network where the state representation and update are chosen in a way that permits a consistent initialization, which is then followed by conventional backpropagation. 4. Recurrent Predictive State Policy (RPSP) Networks We now introduce our proposed class of policies, Recurrent Predictive State Policies (RPSPs). We describe its components and in 5 we describe the learning algorithm . RPSPs consist of two fundamental components: a state tracking component, which models the state of the system, and is able to predict future observations; and a reactive policy component, that maps states to actions, shown in Figure 2. The state tracking component is based on the PSR formulation described in 3. The reactive policy is a stochas- 𝑞, 𝑊./, 𝑝, 𝑞,(6 𝑓>?@A Figure 2: RPSP network: The predictive state is updated by a linear extension Wext followed by a non-linear conditioning fcond. A linear predictor Wpred is used to predict observations, which is used to regularize training loss (see 5). A feed-forward reactive policy maps the predictive states qt to a distribution over actions. tic non-linear policy πre(at | qt) p(at | qt; θre) which maps a predictive state to a distribution over actions and is Recurrent Predictive State Policy Networks Algorithm 1 Recurrent Predictive State Policy network Optimization (RPSPO) 1: Input: Learning rate η. 2: Sample initial trajectories: {(oi t, ai t)t}M i=1 from πexp. 3: Initialize PSR: θ0 PSR = {q0, Wext, Wpred} via 2-stage regression in 3. 4: Initialize reactive policy θ0 re randomly. 5: for n = 1 . . . Nmax iterations do 6: for i = 1, . . . , M batch of M trajectories from πn 1: do 7: Reset episode: ai 0. 8: for t = 0 . . . T roll-in in each trajectory: do 9: Get observation oi t and reward ri t. 10: Filter qi t+1 = ft(qi t, ai t, oi t) in (Eq. 3). 11: Execute ai t+1 πn 1 re (qi t+1). 12: end for 13: end for 14: Update θ using D = {{oi t, ai t, ri t, qi t}T t=1}M i=1: θn UPDATE(θn 1, D, η), as in 5. 15: end for 16: Output: Return θ = (θPSR, θre). parametrized by θre. Similar to Schulman et al. (2015) we assume a Gaussian distribution N(µt, Σ), where µ = ϕ(qt; θµ); Σ = diag(exp(r))2 (4) for a non-linear map ϕ parametrized by θµ (e.g. a feedforward network) , and a learnable vector r. An RPSP is thus a stochastic recurrent policy with the recurrent part corresponding to a PSR. The parameters θ consist of two parts: the PSR parameters θPSR = {q0, Wext, Wpred} and the reactive policy parameters θre = {θµ, r}. In the following section, we describe how these parameters are learned. 5. Learning RPSPs As detailed in Algorithm 1, learning an RPSP is performed in two phases.5 In the first phase, we execute an exploration policy to collect a dataset that is used to initialize the PSR as described in 3.1. It is worth noting that this initialization procedure depends on observations rather than rewards. This can be particularly useful in environments where informative reward signals are infrequent. In the second phase, starting from the initial PSR and a random reactive policy, we iteratively collect trajectories using the current policy and use them to update the parameters of both the reactive policy θre = {θµ, r} and the predictive model θPSR = {q0, Wext, Wpred}, as depicted in Algorithm 1. Let p(τ | θ) be the distribution over trajectories induced by the policy πθ. By updating parameters, we seek to minimize the objective function in (5). L(θ) = α1ℓ1(θ) + α2ℓ2(θ) (5) = α1J(πθ) + α2 t=0 Ep(τ|θ) Wpred(qt at) ot 2 , 5https://github.com/ahefnycmu/rpsp which combines negative expected returns with PSR prediction error.6 Optimizing the PSR parameters to maintain low prediction error can be thought of as a regularization scheme. The hyper-parameters α1, α2 R determine the importance of the expected return and prediction error respectively. They are discussed in more detail in 5.3. Noting that RPSP is a special type of a recurrent network policy, it is possible to adapt policy gradient methods (Williams, 1992) to the joint loss in (5). In the following subsections, we propose different update variants. 5.1. Joint Variance Reduced Policy Gradient (VRPG) In this variant, we use REINFORCE method (Williams, 1992) to obtain a stochastic gradient of J(π) from a batch of M trajectories. Let R(τ) = PT t=1 γt 1rt be the cumulative discounted reward for trajectory τ given a discount factor γ [0, 1]. REINFORCE uses the likelihood ratio trick θp(τ|θ) = p(τ|θ) θ log p(τ|θ) to compute θJ(π) as θJ(π) = Eτ p(τ|θ)[R(τ) t=1 θ log πθ(at|qt)], In practice, we use a variance reducing variant of policy gradient (Greensmith et al., 2001) given by θJ(π) = Eτ p(τ|θ) t=0 [ θ log πθ(at|qt)(Rt(τ) bt)], where we replace the cumulative trajectory reward R(τ) by a reward-to-go function Rt(τ) = PT j=t γj trj computing the cumulative reward starting from t. To further reduce variance we use a baseline bt Eθ[Rt(τ) | a1:t 1, o1:t] which estimates the expected reward-to-go conditioned on the current policy. In our implementation, we assume bt = w b qt for a parameter vector wb that is estimated using linear regression. Given a batch of M trajectories, a stochastic gradient of J(π) can be obtained by replacing the expectation in (6) with the empirical expectation over trajectories in the batch. A stochastic gradient of the prediction error can be obtained using backpropagation through time. With an estimate of both gradients, we can compute (5) and update the parameters trough gradient descent. For more details, see Algorithm 2 in the appendix. 6We minimize 1-step prediction error, as opposed to general k-future prediction error recommended by (Hefny et al., 2018), to avoid biased estimates induced by non causal statistical correlations (observations correlated with future actions) when performing on-policy updates when a non-blind policy is in use. Recurrent Predictive State Policy Networks 5.2. Alternating Optimization In this section, we describe a method that utilizes the recently proposed Trust Region Policy Optimization (TRPO (Schulman et al., 2015)), an alternative to the vanilla policy gradient methods that has shown superior performance in practice.It uses a natural gradient update and enforces a constraint that encourages small changes in the policy in each TRPO step. This constraint results in smoother changes of policy parameters. Each TRPO update is an approximate solution to the following constrained optimization problem in (7). θn+1 = arg min θ Eτ p(τ|πn) πn(at|qt)(Rt(τ) bt) s.t. Eτ p(τ|πn) t=0 [DKL (πn(.|qt) | πθ(.|qt))] ϵ, (7) where πn is the policy induced by θn, and Rt and bt are the reward-to-go and baseline functions defined in 5.1. While it is possible to extend TRPO to the joint loss in (5), we observed that TRPO tends to be computationally intensive with recurrent architectures. Instead, we resort to the following alternating optimization:7 In each iteration, we use TRPO to update the reactive policy parameters θre, which involve only a feedforward network. Then, we use a gradient step on (5), as described in 5.1, to update the PSR parameters θPSR, see Algorithm 3 in the appendix. 5.3. Variance Normalization It is difficult to make sense of the values of α1, α2, specially if the gradient magnitudes of their respective losses are not comparable. For this reason, we propose a more principled approach for finding the relative weights. We use α1 = α1 and α2 = a2 α2, where a2 is a user-given value, and α1 and α2 are dynamically adjusted to maintain the property that the gradient of each loss weighted by α has unit (uncentered) variance, in (8). In doing so, we maintain the variance of the gradient of each loss through exponential averaging and use it to adjust the weights. v(n) i = (1 β)v(n 1) i + β X θj θ (n) θj ℓi 2 (8) θj θ v(n) i,j 6. Experiments We evaluate the RPSP-network s performance on a collection of reinforcement learning tasks using Open AI Gym 7 We emphasize that both VRPG and alternating optimization models optimize the joint RL/prediction loss. They differ only on how to update the reactive policy parameters (which are independent of prediction error). Mujoco environments. 8 We consider partially observable environments: only the angles of the joints of the agent are visible to the network, without velocities. Proposed Models: We consider an RPSP with a predictive component based on RFFPSR, as described in 3 and 4. For the RFFPSR, we use 1000 random Fourier features on observation and action sequences followed by a PCA dimensionality reduction step to d dimensions. We report the results for the best choice of d {10, 20, 30}. We initialize the RPSP with two stage regression on a batch of Mi initial trajectories (100 for Hopper, Walker and Cart Pole, and 50 for Swimmer) (equivalent to 10 extra iterations, or 5 for Swimmer). We then experiment with both joint VRPG optimization (RPSP-VRPG) described in 5.1 and alternating optimization (RPSP-Alt) in 5.2. For RPSPVRPG, we use the gradient normalization described in 5.3. Additionally, we consider an extended variation (+obs) that concatenates the predictive state with a window w of previous observations as an extended form of predictive state qt = [qt, ot w:t]. If PSR learning succeeded perfectly, this extra information would be unnecessary; however we observe in practice that including observations help the model learn faster and more stably. Later in the results section we report the RPSP variant that performs best. We provide a detailed comparison of all models in the appendix. Competing Models: We compare our models to a finite memory model (FM) and gated recurrent units (GRU). The finite memory models are analogous to RPSP, but replace the predictive state with a window of past observations. We tried three variants, FM1, FM2 and FM5, with window size of 1, 2 and 5 respectively (FM1 ignores that the environment is partially observable). We compare to GRUs with 16, 32, 64 and 128-dimensional hidden states. We optimize network parameters using the RLLab9 implementation of TRPO with two different learning rates (η = 10 2, 10 3). In each model, we use a linear baseline for variance reduction where the state of the model (i.e. past observation window for FM, latent state for GRU and predictive state for RPSP) is used as the predictor variable. Evaluation Setup: We run each algorithm for a number of iterations based on the environment (see Figure 3). After each iteration, we compute the average return Riter = 1 M PM m=1 PTm j=1 rj m on a batch of M trajectories, where Tm is the length of the mth trajectory. We repeat this process using 10 different random seeds and report the average and standard deviation of Riter for each iteration. For each environment, we set the number of samples in the batch to 10000 and the maximum length of each episode to 8 https://gym.openai.com/envs#mujoco 9https://github.com/openai/rllab Recurrent Predictive State Policy Networks (a) (b) (c) (d) (e) (f) models Swimmer Hopper Walker2d Cart-Pole FM 41.6 3.5 242.0 5.1 285.1 25.0 12.7 0.6 GRU 22.0 2.0 235.2 9.8 204.5 16.3 27.95 2.3 RPSP-Alt 34.9 1.1 307.4 5.1 345.8 12.6 22.9 1.0 RPSP-VRPG 44.9 2.8 305.0 10.9 287.8 21.1 23.8 2.3 reactive PSR 44.9 2.4 165.4 14.0 184.9 35.8 4.9 1.0 reg GRU 28.9 1.7 260.6.4 3.6 327.7 12.8 22.9 0.3 Figure 3: Empirical average return over 10 epochs (bars indicate standard error). (a-d): Finite memory model w = 2 (FM), GRUs, best performing RPSP with joint optimization (RPSP-VRPG) and best performing RPSP with alternate optimization (RPSP-Alt) on four environments. (e): RPSP variations: fixed PSR parameters (fix PSR), without prediction regularization (reactive PSR), random initialization (random PSR). (f-g): Comparison with GRU + prediction regularization (reg GRU). RPSP graphs are shifted to the right to reflect initialization trajectories. (h): Cumulative rewards (area under curve). 200, 500, 1000, 1000 for Cart-Pole, Swimmer, Hopper and Walker2d respectively.10 For RPSP, we found that a step size of 10 2 performs well for both VRPG and alternating optimization in all environments. The reactive policy contains one hidden layer of 16 nodes with Re LU activation. For all models, we report the results for the choice of hyper-parameters that resulted in 10For example, for a 1000 length environment we use a batch of 10 trajectories resulting in 10000 samples in the batch. the highest mean cumulative reward (area under curve). 7. Results and Discussion Performance over iterations: Figure 3 shows the empirical average return vs. the amount of interaction with the environment (experience), measured in time steps. We observe that RPSP networks (especially RPSP-Alt) perform well in every environment, competing with or outperforming the top model in terms of the learning speed and the Recurrent Predictive State Policy Networks Figure 4: Empirical average return over 10 trials with a batch of M = 10 trajectories of T = 1000 time steps for Hopper. (Left to right) Robustness to observation Gaussian noise σ = {0.1, 0.2, 0.3}, best RPSP with alternate loss (Alt) and Finite Memory model (FM2). final reward, with the exception of Cart-Pole where the gap to GRU is larger. We report the cumulative reward for all environments in Table 3(h). For all except Cart-Pole, come variant of RPSP is the best performing model. For Swimmer our best performing model is only statistically better than FM model (t-test, p < 0.01), while for Hopper our best RPSP model performs statistically better than FM and GRU models (t-test, p < 0.01) and for Walker2d RPSP outperforms only GRU baselines (t-test, p < 0.01). For Cart-Pole the top RPSP model performs better than the FM model (t-test, p < 0.01) and it is not statistically significantly different than the GRU model. We also note that RPSPAlt provides similar performance to the joint optimization (RPSP-VRPG), but converges faster. Effect of proposed contributions: Our RPSP model is based on a number of components: (1) State tracking using PSR (2) Consistent initialization using two-stage regression (3) End-to-end training of state tracker and policy (4) Using observation prediction loss to regularize training. We conducted a set of experiments to verify the benefit of each component.11 In the first experiment, we test three variants of RPSP: one where the PSR is randomly initialized (random PSR), another one where the PSR is fixed at the initial value and only the reactive policy is further updated (fix PSR), and a third one where we train the RPSP network with initialization and without prediction loss regularization (i.e. we set α2 in (5)) to 0 (reactive PSR). Figure 3(e) demonstrates that these variants are inferior to our model, showing the importance of two-stage initialization, end-toend training and observation prediction loss respectively. In the second experiment, we replace the PSR with a GRU that is initialized using BPTT applied on exploration data. This is analogous to the predictive state decoders proposed in (Venkatraman et al., 2017), where observation prediction 11 Due to space limitation, we report results on Hopper environment. We report results for other environments in the appendix. loss is included when optimizing a GRU policy network (reg GRU).12 Figure 3(f-g) shows that a GRU model is inferior to a PSR model, where the initialization procedure is consistent and does not suffer from local optima. Effect of observation noise: We also investigated the effect of observation noise on the RPSP model and the competitive FM baseline by applying Gaussian noise of increasing variance to observations. Figure 4 shows that while FM was very competitive with RPSP in the noiseless case, RPSP has a clear advantage over FM in the case of mild noise. The performance gap vanishes under excessive noise. 8. Conclusion We propose RPSP-networks, combining ideas from predictive state representations and recurrent networks for reinforcement learning. We use PSR learning algorithms to provide a statistically consistent initialization of the state tracking component, and propose gradient-based methods to maximize expected return while reducing prediction error. We compare RPSP against different baselines and empirically show the efficacy of the proposed approach in terms of speed of convergence and overall expected return. One direction to investigate is how to develop an online, consistent and statistically efficient method to update the RFFPSR as a predictor in continuous environments. There has been a body of work for online learning of predictive state representations (Venkatraman et al., 2016; Boots & Gordon, 2011; Azizzadenesheli et al., 2016; Hamilton et al., 2014). To our knowledge, none of them is able to deal with continuous actions and make use of local optimization. We are also interested in applying off-policy methods and more elaborate exploration strategies. 12 We report results for partially observable setting which is different from RL experiments in (Venkatraman et al., 2017). Recurrent Predictive State Policy Networks Azizzadenesheli, K., Lazaric, A., and Anandkumar, A. Reinforcement learning of pomdp s using spectral methods. Co RR, abs/1602.07764, 2016. URL http://arxiv. org/abs/1602.07764. Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., and Zieba, K. End to end learning for self-driving cars. Co RR, abs/1604.07316, 2016. Boots, B. and Gordon, G. An online spectral learning algorithm for partially observable nonlinear dynamical systems. In Proceedings of the 25th National Conference on Artificial Intelligence (AAAI), 2011. Boots, B. and Gordon, G. J. Predictive state temporal difference learning. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010., pp. 271 279, 2010. Boots, B., Siddiqi, S., and Gordon, G. Closing the learning planning loop with predictive state representations. In I. J. Robotic Research, volume 30, pp. 954 956, 2011. Boots, B., Gretton, A., and Gordon, G. J. Hilbert Space Embeddings of Predictive State Representations. In Proc. 29th Intl. Conf. on Uncertainty in Artificial Intelligence (UAI), 2013. Cho, K., van Merrienboer, B., G ulc ehre, C ., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing, EMNLP, pp. 1724 1734, 2014. Downey, C., Hefny, A., Boots, B., Gordon, G. J., and Li, B. Predictive state recurrent neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6055 6066. Curran Associates, Inc., 2017. Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pp. 1329 1338, 2016. Fukumizu, K., Song, L., and Gretton, A. Kernel bayes rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14(1):3753 3783, 2013. Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS 01, pp. 1507 1514, Cambridge, MA, USA, 2001. MIT Press. Haarnoja, T., Ajay, A., Levine, S., and Abbeel, P. Backprop kf: Learning discriminative deterministic state estimators. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 4376 4384. Curran Associates, Inc., 2016. Halko, N., Martinsson, P. G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 2011. Hamilton, W., Fard, M. M., and Pineau, J. Efficient learning and planning with compressed predictive states. J. Mach. Learn. Res., 15(1):3395 3439, January 2014. Hausknecht, M. J. and Stone, P. Deep recurrent q-learning for partially observable mdps. Co RR, abs/1507.06527, 2015. Hefny, A., Downey, C., and Gordon, G. J. Supervised learning for dynamical system learning. In NIPS. 2015a. Hefny, A., Downey, C., and Gordon, G. J. Supervised learning for dynamical system learning. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, pp. 1963 1971. 2015b. Hefny, A., Downey, C., and Gordon, G. J. An efficient, expressive and local minima-free method for learning controlled dynamical systems. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018. URL https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/view/17089. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9(8):1735 1780, 1997. Littman, M. L., Sutton, R. S., and Singh, S. Predictive representations of state. In In Advances In Neural Information Processing Systems 14, pp. 1555 1561. MIT Press, 2001. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning, 2013. URL http://arxiv.org/abs/1312.5602. cite arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013. Recurrent Predictive State Policy Networks Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2009. ISBN 052189560X, 9780521895606. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In NIPS. 2008. Rosencrantz, M. and Gordon, G. Learning low dimensional predictive representations. In ICML, pp. 695 702, 2004. Ross, S., Gordon, G. J., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pp. 627 635, 2011. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Blei, D. and Bach, F. (eds.), Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889 1897. JMLR Workshop and Conference Proceedings, 2015. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, January 2016. Singh, S., James, M. R., and Rudary, M. R. Predictive state representations: A new theory for modeling dynamical systems. In UAI, 2004. Sun, W., Venkatraman, A., Boots, B., and Bagnell, J. A. Learning to filter with predictive state inference machines. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1197 1205, 2016. Sutton, R., Mcallester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (Proceedings of the 1999 conference), pp. 1057 1063. MIT Press, 2001. Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. Szita, I. and Lrincz, A. Learning tetris using the noisy crossentropy method. Neural Computation, 18(12):2936 2941, Dec 2006. Tamar, A., Levine, S., Abbeel, P., and andonline Garrett Thomas, Y. W. Value iteration networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 2146 2154, 2016. Venkatraman, A., Sun, W., Hebert, M., Bagnell, J. A., and Boots, B. Online instrumental variable regression with applications to online linear system identification. In AAAI, 2016. Venkatraman, A., Rhinehart, N., Sun, W., Pinto, L., Boots, B., Kitani, K., and Bagnell., J. A. Predictive state decoders: Encoding the future into recurrent networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2017. Wierstra, D., F orster, A., Peters, J., and Schmidhuber, J. Recurrent policy gradients. Logic Journal of the IGPL, 18(5):620 634, October 2010. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Machine Learning, pp. 229 256, 1992.