# forethought_and_hindsight_in_credit_assignment__6d775f90.pdf

Forethought and Hindsight in Credit Assignment

Veronica Chelu Mila, Mc Gill University

Doina Precup Mila, Mc Gill University, Deep Mind

Hado van Hasselt

We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)- evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from explicit environment-dynamics predictors to more abstract planner-aware models.

1 Introduction

Credit assignment, i.e. determining how to correctly associate delayed rewards with states or stateaction pairs, is a crucial problem for reinforcement learning (RL) agents (Sutton and Barto, 2018). Model-based RL (MBRL) agents gradually learn a model of the rewards and transition dynamics through interaction with their environments and use the estimated model to ﬁnd better policies or predictions (e.g., Sutton, 1990a; Peng and Williams, 1993; Moore and Atkeson, 1993; Mc Mahan and Gordon, 2005; Sutton et al., 2012; Farahmand et al., 2017; Farahmand, 2018; Abachi et al., 2020; Wan et al., 2019; Silver et al., 2017; Schrittwieser et al., 2019; Deisenroth et al., 2015; Hester and Stone, 2013; Ha and Schmidhuber, 2018; Oh et al., 2017). However, the efﬁciency of MBRL depends on learning a useful model for its purpose. In this paper, we focus speciﬁcally on the use of models for value credit assignment.

Broadly, we refer to planning as any internal processing that an agent can perform without additional experience to improve prediction and/or performance. Within this nomenclature, we deﬁne models as knowledge about the internal workings of the environment, which can be routinely re-used through planning. One way of using models is by forethought or trying things in your head (Sutton and Barto, 1981), which requires learning to predict aspects of the future and planning forward, or in anticipation

to achieve goals. Dyna-style planning (Sutton, 1990a) chooses a (possibly hypothetical) state and action and predicts the resulting reward and next state, which are then used for credit assignment.

The origins of the chosen state and action are referred to as search control strategies. Lin (1992) proposed to use states actually experienced, and introduced the idea of replaying prior experience (Lin, 1992; Mnih et al., 2015). Combinations of these two approaches result in prioritized sweeping variants (Moore and Atkeson, 1993; Peng and Williams, 1993; Mc Mahan and Gordon, 2005), which generalize the idea of replaying experience in backward order (Lin, 1992) by prioritizing states based on the potential improvement in the value function estimate upon re-evaluation. From high priority states, forward models are used to perform additional value function updates, increasing computational efﬁciency (e.g., van Seijen and Sutton, 2013). An investigation into source-control strategies (Pan et al., 2018) reveals the utility of additional prioritization for guiding learning towards relevant, causal or interesting states.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

In this paper, we work to understand how different phenomena caused by the structure of an environment favor the use of forward or backward planning mechanisms for credit assignment. We deﬁne backward models as learning to predict potential predecessors of observed states, either through explicit predictors of the environment or via planner-aware models, where the latter account for how the planner performs credit assignment. Backward models are interesting from two standpoints: (i) they can be used to causally change predictions or behaviour in hindsight, thereby naturally prioritizing states where predictions need to be (re)-evaluated; (ii) modeling errors in backward models can sometimes be less detrimental, because updating misspeciﬁed imaginary states with real experience may be less problematic as the reverse (van Hasselt et al., 2019; Jafferjee et al., 2020). We hope that additional understanding of the mechanisms of backward planning paves the way for new, principled algorithms that use models to seamlessly integrate both forethought and hindsight (as had been the case in traditional planning methods La Valle, 2006).

The estimation and usage of models can be done in many ways (van Hasselt et al., 2019; Van Seijen and Sutton, 2015; Parr et al., 2008; Wan et al., 2019). The conventional approach is to learn explicit predictors of the environment which, if accurate enough, lead to good policies. However, no model is perfect. Model error is dependent on the choice of predictors and whether the true environment dynamics can be represented. Planner-aware model learning suggests learning instead only aspects of the environment relevant to the way in which the model is going to be used by the planner. This class of methods (Farahmand et al., 2017; Farahmand, 2018; Joseph et al., 2013; Silver et al., 2017; Oh et al., 2017; Farquhar et al., 2017; Luo et al., 2019; D Oro et al., 2019; Schrittwieser et al., 2019; Abachi et al., 2020; Ayoub et al., 2020) incorporates knowledge about the value function or policy when learning the model. We describe a spectrum of methods for model estimation. On one end, we have environment predictors that rely on maximum likelihood estimation based on supervised learning. Towards the opposite end, constraints can be progressively relaxed by accounting how planners use the models, ultimately leading to fully abstract models i.e. learnable black boxes (Xu et al., 2020; Oh et al., 2020).

Contributions We investigate the emergent properties of planning forward and backward with learned models of the world. We justify the use of backward models in identifying relevant states from which to recompute prediction errors, for which we establish available design choices to be made with respect to what the model represents, how it is estimated, and how it is parametrized. We review these in the broader context of model estimation. Finally, we conduct an empirical study on illustrative prediction and control tasks, which builds intuition and provides evidence for our ﬁndings.

2 Background and Notation

We consider the usual RL setting (Sutton and Barto, 2018) in which an agent interacts with an environment, modelled as a (discounted) Markov decision process (MDP) (Puterman, 1994) (S, A, P?, r?, γ), with state space S, action space A, state-transition distribution P? : S A ! P(S) (where P(S) is the set of probability distributions on the state space and P?(s0|s, a) denotes the probability of transitioning to state s0 from s by choosing action a), reward function r? : S A ! R, and discount γ 2 (0, 1). A policy maps states to distributions over actions; (a|s) denotes the probability of choosing action a in state s. A Markov reward process (MRP) is speciﬁed as (S, P?, r?, γ). The return at time t is deﬁned as: Gt = P1

k=0 γk Rt+k+1. The value function maps each state s 2 S to the expected value of the return: v (s) = E [Gt|St = s, An ( |Sn), 8n t]. Value-based methods can be used for control by approximating the optimal action-value function q?(s, a), representing the expected return when following the optimal policy, conditioned on a state-action pair.

In general, the agent does not directly observe the true state of the environment s, and instead observes or constructs a feature vector x(s). The value function can then be approximated using a parametrized function vw(s) v (s) with w 2 Rd and d the size of feature representation.

This estimate can be linear: vw(s) = w>x(s), or a non-linear arbitrary function in the general case. As a shorthand, we use xt = x(st).

Usually, the true model P? and reward function r? are not known to the agent. Instead, the agent interacts with the environment to collect samples and updates value prediction estimates:

wt+1 = wt + [Yt vwt(St)] rwtvwt(St) | {z } wt

, (TD update) (1)

where Yt is an update target. For instance, we could use Monte Carlo returns Yt = Gt, or temporal difference (TD) errors (Sutton, 1988) Yt vwt(St) = δt Rt+1 + γvwt(St+1) vwt(St).

For control, an optimal action-value function qw can be learned using Q-learning (Watkins and Dayan, 1992) updates: wt+1 = wt + [Rt+1 + γ maxa qw(St+1, a) qw(St, At)] rwtqwt(St).

A MBRL agent learns an estimate P of the true model P? and r of the reward function r?, a process called model learning. The agent can then employ a planning algorithm, that uses additional computation without additional experience to improve its predictions and/or behaviour. Usually, Planner uses models that look forward in time and anticipate a future state and reward conditioned on their input. A different option is a retrospect Planner, which uses models that look backward in time and predict a predecessor state and corresponding reward.

Conventional approaches to model learning focus on learning good predictors of the environment, and ignore how Planner uses the model. Limited capacity and sampling noise can lead to model approximation errors, and the model can potentially choose to pay attention to aspects of the environment irrelevant to Planner (e.g., trying to predict a noisy TV). To mitigate this, value-aware model learning methods (Farahmand et al., 2017; Farahmand, 2018; Silver et al., 2017; Ayoub et al., 2020) attempt to ﬁnd a model such that performing value-based planning on v has a similar effect to applying the true environment model. Policy-aware model learning methods (Abachi et al., 2020) similarly look at the effect of planning on the policy, rather than values. In both cases, this means the model can focus on the aspect most important for the associated planning algorithm.

3 Planning Backward in Time

Algorithm 1: Backward Planning

1: Input policy , n 2: s env() 3: for each interaction {1, 2 . . . T} do 4: a (s) 5: r, γ, s0 env(a) 6:

! model_learning_update(s, a, s0) 7: v ! learning_update(s, a, r, γ, s0) 8: for each planning step {1, 2 . . . N} do 9: s

( s, s) 10: v ! Planner( s, r, γ, s) 11: s s0

As a thought experiment, consider a simple model that looks forward or backward for one time-step to predict the next or the previous state. An agent takes action a in s and transitions to s0, experiencing a TD-error that changes the value prediction for s. To propagate this information backward to a predecessor state s of s, forward models can face difﬁculties, because ﬁnding a good predecessor is nontrivial, and model misspeciﬁcations can cause a damaging update, pushing the value prediction estimate of a real state further away from its true value.

Dyna-style planning methods (Sutton, 1990a) perform credit assignment by planning forward from previously visited states (or hypothetical states). This requires additional search-control and prioritization mechanisms. Otherwise: (i) the sampled state might be unrelated to the current state whose estimate has recently been updated; (ii) if the model is poor, planning steps can update the value of a real state with an erroneous imagined transition.

Backward models naturally sidestep these issues: (i) they can directly predict predecessor states s of a newly updated state s; (ii) if the planning update of the imagined state s solely uses s as target for the update, a poor model will only damage the value prediction estimate of a ﬁctitious state s.

We start our analysis with a treatment of backward planning which, in contrast to forward planning, operates using models running backward in time. We may write P?

(st+1, at|st) in place of P?

(St+1 =st+1, At =at|St =st) in the interest of space.

Assumptions Throughout the paper, we make a stationarity assumption: for any policy , the Markov chain induced by , P?

(st+1|st) = P

a2A P?(st+1|st, a) (a|st), is irreducible and aperiodic. We denote by d ,t(s) the probability of observing state s at time t when following . Under the ergodicity assumption, each policy induces a unique stationary distribution of observed states d (s) = limt!1 d ,t(s), as well as a stationary joint state-action distribution d (s, a) = (a|s)d (s).

Backward models A backward transition model identiﬁes predecessor states of its input state. In formalizing backward models, we highlight some interesting properties (for which we defer details to

appendix A). To begin, backward models are tethered to a policy. Formally, we use

,t to refer to the dynamics of the time-reversed Markov chain induced by a policy at time-step t:

,t(st, at|st+1)=d ,t+1(st+1) 1d ,t(st) (at|st)P?(st+1|st, at) (2)

(st, at|st+1) limt!1

,t(st, at|st+1). One might hope that action-conditioning would relieve this policy dependence. Alas, it does not. An action-conditioned backward model for policy is deﬁned as:

(st|st+1, at) = (at|st)

d (st) d (st+1)P?(st+1|st, at), (3)

(at|st+1) is the marginal probability of an action knowing the future state.

Time-extended backward models Policy-conditioned models hold many shapes and have the potential to be useful in reasoning over larger timescales. Speciﬁcally, given a backward transition model

)n( |s) as the predecessor state distribution of having followed policy for n steps, arriving at state s. Similarly, we denote the associated n-step reward model as: r~

?(n) ( s, s) = E

t=0 γt Rt+1|S0 = s, Sn =s, At+1 ( |St)

. Other time-extended variants exist, such as λ-models (Sutton, 1995) or option models (Precup et al., 1998; Sutton et al., 1999), and could be learned counter-factually (Sutton et al., 2011) using an excursion formulation (Mahmood et al., 2015; Sutton et al., 2016; Zhang et al., 2020a,b; Gelada and Bellemare, 2019; Hallak and Mannor, 2017). We defer investigation of the off-policy regime to future work.

We primarily center on the prediction setting, in which the goal is to evaluate a given , and simplify notation by removing the policy subscript from models and value functions.

Backward models are a pair of reward and transition models: (

) (single or multi-step, and policy-dependent). The reward model takes in two endpoint states and outputs the estimated reward. Depending on its class, the transition model can output a distribution of predecessor states, a sample predecessor or an expectation over prior states of its input.

Backward planning The hindsight planning mechanism we consider uses a backward model to identify the predecessor states s of a particular state s. Planner projects backward in time, and from the projected states, performs forward-looking TD updates that end back in s. These corrections are used to re-evaluate the value predictions at states s. Such updates attempt to do credit assignment counter-factually by making parameter corrections in hindsight, given the new information gathered at the current step (the TD error of the transition s

a" s0). Forward view corrections can be reformulated as backward corrections under the backward Markov chain (appendix B). For instance, an n-step learning update from any state s can be formulated as:

Y (n)(St n, St) vw(St n)

rwvw(St n)|St =s, St n

P(n)( |St =s)

where Y (n)(St n, St) = r~

?(St n, St) + γnvw(St) is the n-step update target. For simplicity, in the following we use single-step models, and henceforth drop the horizon reference from the notations. Algorithm 1 sketches the generic steps of hindsight planning with backward models in a full framework of simultaneous learning and planning.

Model estimation The choice for model estimation instantiates the above-mentioned algorithmic template. The most explicit way of computing

is by learning full probability distributions i.e. estimating the backward distribution

P( |s). Then, one can either (i) sample the model s

P( |s) and do a stochastic update (or many): w = w + (r~

( s, s) + γvw(s) vw ( s)) rwvw ( s), or (ii) perform a distributional backward planning update 8 s 2 S in proportion to the probability given by the backward distribution model: w = w +

P ( s|s) (r~

( s, s) + γvw (s) vw ( s)) rwvw ( s). In the general case, learning a full distribution model over the feature space is intractable. Alternatively, one can learn a backward generative model, sample predecessor features x

P( |x) and do one or more sample backward planning updates. We would like to think that maybe in the linear setting, where the gradient has the special form rwvw( x) = x, one can get away with learning backward expectation models over features, and then perform an expected backward planning update. We

ﬁnd however that a direct counterpart of the forward expectation models is not a valid update, as it involves a product of two (possibly) dependent random variables (the TD error and the gradient of the value function evaluated at the predecessor features given by the model). However, learning an unusual type of model still results in valid parameter corrections:

Px(x) = E [ x|x] is a backward expectation model,

is a covariance matrix of the predecessor features and r~

rx is a vector reward model with parameters r (appendix C). Note that this model requires estimating three quantities.

There are several approaches to estimating

P, which can be characterized based on the constraints that they impose on the model. The standard approach is Maximum Likelihood Estimation (MLE) (appendix C):

P (Si), where we used P to denote the model space and Dn = {(Si, Ai, Ri, S0

i=1 to represent collected data from interaction. Learning

P that minimizes a negative-log loss or other probabilistic losses leads to a model that tries to estimate all aspects of the environment. Estimating the reward model r~

defaults to a regression problem.

If however the true model does not belong to the model estimator s space and approximation errors exist, a planner-aware method can choose a model with minimum error with respect to Planner s objective. Both forward and backward planning objectives for value-based methods try to ﬁnd an approximation v to v by applying one step of semi-gradient model-based TD update. A planneraware model-learning objective is less constrained than the MLE objective in that it only tries to ensure that replacing the true dynamics with the model is inconsequential for the internal mechanism of Planner. In the extreme case, we note that one can potentially directly parametrize and estimate the expected parameter corrections or updates, thus learning a fully abstract model. Learning of this kind shadows the internal arrow of time of the model. The ultimate unconstrained objective could meta-learn the model, such that, after a model-learning update, the model would be useful for planning. We offer some directions for planner-aware model learning in appendix C and defer an in-depth investigation of such methods to future work.

4 Empirical Studies

Our empirical studies work to uncover the distinctions between planning in anticipation and in retrospect. With the aim of understanding the underlying properties of these approaches, we ask the following questions:

(i) How are the two planning algorithms distinct? When does it matter?

Figure 1: (Left, Center-Left) Complementary properties of forward and backward planning: Backward models work well in channeling structures, with large fan-in and small fan-out, while forward models are better suited for broadcasting state formations. The y-axis shows the RMSVE: p

2); (Right, Center-Right) Inﬂection point: As we shift from channeling patterns (left) to broadcasting ones (right), the gain from one type of planning switches to the other, for both true and learned models. The y-axis shows the area under the curve (AUC) of the RMSVE (results are normalized by zero-centering and re-scaling by the max min).

To understand which structural attributes of the environment are important for the two mechanisms and how such aspects interact with planning in online prediction settings we use the following experimental setup.

Experimental setup We explore the ﬁrst question in a prediction setting using Markov Reward Processes where the states are organised as bi-graphs with with one (or more) sets of states (or levels)

{xi}i2[0:nx] and {yj}j2[0:ny] (Fig. 2), where we vary nx and ny in our experiments. We additionally experiment with adding intermediary levels: {zl

k}k2[0:nlz],l2[1:L], where L is the number of levels and nl

z is the size of level l. The states from a particular level transition only to states in the next level, thus establishing a particular ﬂow and stationary structure of the Markov Chain under study.

Figure 2: Random Chain: Illustration of the Markov Reward Process used in the prediction experiments. The chain ﬂows from left to right.

We refer to the number of predecessors/successors a state might have in the state space as fan-in/fan-out. The experiments are ablation studies of the effects of varying the fan-in (nx), the fan-out (ny) and the number of levels l with their corresponding sizes nl

For this investigation we performed two types of experiments: (i) on 3-level bipartite graphs as illustrated in ﬁgure 1Left,Center-Left (ii) on 2-level bipartite graphs as shown in ﬁgure 1-Center-Right, Right. For (i), thumbnails depicts the phenomena of transitioning from a larger number of predecessors that funnel into a smaller number of successors, and vice-versa. The channelling pattern has the attributes: L = 1, nx = 500, nl

z = 50, ny = 5, which are opposite from the broadcasting version: L = 1, nx = 5, n1

z = 50, ny = 500. For (ii) the results reported are for (nx, ny) 2 {(500, 5), (50, 5), (5, 5), (5, 50), (5, 500)}, where we labeled the x-axis with the simpliﬁed ratio.

Algorithms For the purposes of this experiment we use backward planning for value prediction. We include a complete pseudo-code for the backward planning algorithm used for this experiment in Algorithm 4, appendix D. For any transition s

a! s0, we use the following reference frame to plan from: for backward models we use the current state s0 of a transition, whereas for forward models we use the previous state s of a transition. The exact deﬁnition of reference frames is given in the following question we explore, corresponding to the next experiment.

Context & observations Our studies identify an interesting phenomenon: the gain of the two planning mechanisms is reliant on two state attributes, which we call fan-in and fan-out. We illustrate this observation in the prediction setting presented above, depicted in the diagrams of Fig 1 and detailed in appendix D. We observe that large fan-in and small fan-out is better suited for backward planning, since backward planning updates many prior states at once (due to the large fan-in) and in these settings, due to small fan-out, backward models propagate lower-variance updates see Fig. 1-Left. Intuitively, when many trajectories end up with the same outcome, all prior states values can be updated with the new information available at the current state. This pattern, which we call channeling, also abates in part the vulnerability of backward models in updating states in a sample-based manner (i.e. states from which we correct predictions use a single sample, instead of the whole expectation as is the case for forward models). In contrast, forward models ﬁt better a broadcasting formation (Fig. 1-Center-Left)). A backward model for this regime would be closer to factual TD and less efﬁcient. Its predicted past states would need updates from many different successor states to construct accurate predictions. As we shift from the pattern of large fan-in/small fan-out to the opposite end, we notice a shift in the performance of the two planning mechanisms (Fig. 1-Right, Center-Right).

Implications These results highlight one aspect of the problem central to the success of planning: the breadth of backward and forward projections; namely, we ﬁnd anticipation to be sensible when the future is wide-ranging and predictable, and hindsight to work well when new discoveries affect many prior beliefs with certainty and to a great extent. Concurrently, these insights lay the groundwork for the development of new planning algorithms that dynamically choose where to plan to and from, seamlessly blending forethought and hindsight.

(ii) Does it matter where the agent plans from? What is the effect of shifting the frame of reference

used in planning?

Experimental setup In this experiment, as well as the following one, we perform ablation studies on the discrete navigation task from (Sutton and Barto, 2018) illustrated in Fig. 4 (details in (appendix D).

Algorithms We operate in the control setting, for which we describe the algorithms we use, Online Forward-Dyna and Online Backward-Dyna (similar in nature to Sutton, 1990b; van Hasselt et al.,

2019) in algorithms 2 and 3, respectively (details in appendix D). In brief, both algorithms interlace additional steps of model learning and planning in-between steps of model-free Q-learning 1.

Context & observations We now ask whether the frame of reference (input state of Planner and Planner, respectively), from which the agent starts planning, matters and if so, why. More precisely, consider a transition s

a! s0 and note that we could use either s or s0 as input to each planning algorithm. To show the effects of changing this frame of reference, we consider the control setting described at the beginning of the section and compare the action-value function variants that employ each of the planning mechanisms proposed, namely Online Forward-Dyna and Online Backward-Dyna ( appendix D for details).

Algorithm 2: Online Forward-Dyna: Learning, Acting & Forward Planning

1: Input policy , n 2: s env() 3: for each interaction {1, 2 . . . T} do 4: a argmaxa q(s, a) 5: r, γ, s0 env(a) 6: P, r~

, γ !model_learning_update(s, a, s0) 7: q ! learning_update(s, a, r, γ, s0) 8: sref ! planning_reference_state(s, s0) 9: for each a 2 A do 10: for each s0 2 S do 11: y = r~

(s0)+ γ(s0) maxa0 q(s0,a0) 12: (sref, a) ! (sref, a) + P(s0|sref, a) (y q(sref, a)) 13: q(sref, a) ! q(sref, a) + (sref, a) 14: s ! s0

Algorithm 3: Online Backward-Dyna: Learning, Acting & Backward Planning

1: Input policy , n 2: s env() 3: for each interaction {1, 2 . . . T} do 4: a argmaxa q(s, a) 5: r, γ, s0 env(a) 6:

! model_learning_update(s, a, s0) 7: q ! learning_update(s, a, r, γ, s0) 8: sref ! planning_reference_state(s, s0) 9: for each s 2 S, a 2 A do 10: y = r~

(sref) + γ max a q(sref, a) 11:

P( s, a|sref) (y q( s, a)) 12: q( s, a) ! q( s, a) +

( s, a) 13: s ! s0

We compare the pure model-based setting and the full learning framework (learning & planning). In the full learning framework shown in Fig. 3-Left, planning backward from s implies the use of new knowledge about the current prediction, as we bootstrap on the value at s that has recently been updated: max a q(s, a); in contrast, applying planning from s0 achieves a somewhat different effect: it complements model-free learning (as it bootstraps on q(s0, ), which has not changed, but may still beneﬁt from the reward information r(s0)).

Figure 3: Planning frame of reference: (Left) In the full learning setting (learning and planning), the agent is more effective by planning backward from s and planning forward from s0. (Right) In the pure planning setting, both planning mechanisms assume the role of learning and gain more by processing the exact same opposite states of the full learning case (Left), remaining in antithesis.

Contrary to backward planning, the forward counterpart gains from being as anticipatory as possible, by planning from the current state s0. The effects are reversed in the pure planning setting (Fig 3Right). Speciﬁcally, the backward model cannot rely on model-free learning to re-evaluate predecessor predictions with values at s, since these are no longer changed by the learning process; it thus assumes that role. Simultaneously, backward planning is more efﬁcient at state s0 since is beneﬁts from the current additional transition s

a! s0. Likewise, forward planning is more reliable in s by the same argument, and also assumes the role of learning.

Implications These results emphasize that both planning mechanisms work best when they complement model-free learning, if it is used, and both can take on its role, if it is not.

1N.B. despite the tabular setting, learning is online and planning uses parametric models only in reference to the current transition. This is because we are interested in insights that transfer to more complex environments.

(iii) How is planning inﬂuential on behaviour?

Figure 4: Information propagation in stochastic settings: backward models can propagate new information faster in stochastic reward settings and are more robust to randomness in the dynamics. Planning with the true forward model emphasizes the issues with forward planning (planning with the true backward model is omitted, as it depends on the constantly changing policy).

Experimental setup This experiment is done in the same control setting described above2.

Context & observations Our results provide evidence for the following observations: (i) errors in the backward model, caused by stochastic transitions, are to a lesser extent damaging for credit assignment (Fig. 4 Top-Right); (ii) backward planning accelerates the search for the optimal policy in the presence of stochastic rewards (Fig. 4-Bottom-Left and Bottom-Right); (iii) for extremely stochastic rewards, even backward models fail to capture the dynamics accurately enough (Fig. 4 Bottom-Right); (iv) model misspeciﬁcation affects to a deeper extent forward planning.

Implications These results emphasize the potential impact, in environments with high stochasticity, of a different pattern of reasoning, related to counterfactual learning. Particularly, an agent can project back to a potential causal state and rethink its decisions after each new experience. More investigation of this idea would be useful.

5 Related Work

Backward models and prioritized sweeping (Moore and Atkeson, 1993; Peng and Williams, 1993; Mc Mahan and Gordon, 2005; Sutton et al., 2012; van Seijen and Sutton, 2013) have been used within the Dyna framework (Sutton, 1990a,b), mostly in combination with heuristic strategies for ordering backups. Other search control strategies have explored the use of non-parametric models for prioritizing experience (Schaul et al., 2016; Pan et al., 2018). The experiments of van Hasselt et al. (2019) with backward models for credit assignment have motivated our study; we worked to understand these issues in depth, extending and complementing their results. Concurrently with our work, Jafferjee et al. (2020) also builds on the work of van Hasselt et al. (2019), and investigates model misspeciﬁcation for backward planning with linear models.

Goyal et al. (2019) proposes predicting predecessor states on a trajectory, similarly to the ideas of episodic control (Lengyel and Dayan, 2007). Their approach uses imitation learning on a generative model s outputs to improve exploration by incentivizing the agent towards the high value states on which the model was trained. In contrast, we aim to formalize and tease apart the fundamental properties of online hindsight planning. Schroecker et al. (2019) also applies generative predecessor models for imitation learning with policy gradient methods.

Interestingly, research reveals that planning in the brain might also involve two processes that resemble forward and backward planning (Mattar and Daw, 2018). Other research into credit assignment with hindsight reasoning, uses either a structural causal model to reason about the causes of events from trajectories (Buesing et al., 2019) or an inverse dynamics model conditioned on future outcomes to determine the relevance of past actions to particular outcomes (Harutyunyan et al., 2019b). The temporal value transport of Hung et al. (2019) uses an attention mechanism over an external memory to jump over irrelevant parts of trajectories and propagate credit backward in time. Ke et al. (2018) similarly attempts to propagate credit backward, yet their approach is to use an RNN with learned associative connections to do so.

Pitis (2019) addresses the problem of credit assignment by proposing a tabular version of cumulative predecessor representations. van Hasselt et al. (2020) expands this to the general case of expected

2For readers familiar with standard Dyna results, we note that our setting differs from the conventional Dyna setting, as do our results. Whereas classically, tabular Dyna-agents assume access to the entire state space when deciding where to plan from, we deliberately do not make this assumption.

eligibility traces. Satija et al. (2020) use backward value functions to encode constraints for solving constrained MDPs (CMDPs) with safe policy improvements. Though both our work and theirs employ some form of retrospective knowledge, the contents, purposes and uses differ. The concurrent work of Zhang et al. (2020b) also utilizes backward value functions, albeit for anomaly detection. Harutyunyan et al. (2019a) also connects model entropy and fan-in, although their aim is option discovery.

Model learning objectives and strategies that ground model prediction in estimates other than environment observations have been been considered in Joseph et al. (2013); Farahmand et al. (2017); Silver et al. (2017); Farahmand (2018); Oh et al. (2017); Farquhar et al. (2017); D Oro et al. (2019); Luo et al. (2019); Schrittwieser et al. (2019); Abachi et al. (2020); Ayoub et al. (2020).

6 Closing and Future Work

We explored several questions pertaining to the nature, use and attributes of planning. We provided motivation, formalism and insight for hindsight planning, emphasizing the properties of backward models. Particularly, we looked at how forward and backward planning exhibit complementary gains in opposite settings. Further, we highlighted a spectrum of model learning objectives that increasingly add more ﬂexibility. Lastly, we performed ablation studies that reveal interesting properties about the nature of planning, their instruments models, and the context in which they operate.

The key takeaways from our work are: (i) the problem dimension of the transition dynamics, resulting from the world dynamics and the agent s policy is of great importance for the effectiveness of planning; we demonstrated the two planning mechanisms exhibit complementary properties if the future is broad and predictable, forethought is invaluable, if backward hypotheses are causal and less determined by chance, hindsight planning is effective; (ii) planning behaves differently in anticipation and in retrospect; (iii) the states we pick to plan from matter, and the best states to consider for forward and backward planning differ; (iv) planning is complementary to model-free updates, and should be aware of these updates for instance, the best use of planning can depend on which states are being updated model-free; (v) backward planning can be favorable over forward planning in stochastic environments by proposing a different pattern of reasoning, related to counterfactual learning resulting in more efﬁcient credit assignment.

Planning is a very broad topic, and much interesting work remains to be done; we name some directions that seem most exciting to us. The interaction of planning with channelling and broadcasting patterns induces interesting questions in relation to time-extended models, for instance in terms of choosing where to plan from and to (e.g., cf. Harutyunyan et al., 2019a). Planning backwards with time-extended models has received little attention in the literature, but appears promising. Equally, these insights pave the way for future work in developing new planning algorithms that take advantage of this property to seamlessly and optimally integrate both forethought and hindsight. Lastly, we conjecture a potential virtuous cycle of backward planning and generalization, in which the former can complement the latter, either by connecting areas of the latent space where generalization is poor or, ideally, by contributing to the construction of a better representation. Generalization, in turn, can broaden the effect of backward planning through abstraction.

Broader Impact

Our work deals with fundamental insights related to the nature of model-based reinforcement learning. The problem of building models of the world and planning with them to achieve desired objectives is of paramount importance for real world applications of intelligent systems. However, in this work, we do not focus on applications, but instead look at the problem from a theoretical and investigative angle and largely treat it conceptually. As such, we consider this not to be applicable in this setting.

Abachi, R., Ghavamzadeh, M., and Farahmand, A. (2020). Policy-aware model learning for policy gradient

methods. Co RR, abs/2003.00030.

Ayoub, A., Jia, Z., Szepesvári, C., Wang, M., and Yang, L. (2020). Model-based reinforcement learning with

value-targeted regression. In ICML.

Buesing, L., Weber, T., Zwols, Y., Heess, N., Racanière, S., Guez, A., and Lespiau, J. (2019). Woulda, coulda,

shoulda: Counterfactually-guided policy search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net.

Deisenroth, M. P., Fox, D., and Rasmussen, C. E. (2015). Gaussian processes for data-efﬁcient learning in

robotics and control. IEEE Trans. Pattern Anal. Mach. Intell., 37(2):408 423.

D Oro, P., Metelli, A. M., Tirinzoni, A., Papini, M., and Restelli, M. (2019). Gradient-aware model-based policy

search. Co RR, abs/1909.04115.

Farahmand, A. (2018). Iterative value-aware model learning. In Bengio, S., Wallach, H. M., Larochelle, H.,

Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 9090 9101.

Farahmand, A. M., Barreto, A., and Nikovski, D. (2017). Value-aware loss function for model-based reinforce-

ment learning. In Singh, A. and Zhu, X. J., editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54

of Proceedings of Machine Learning Research, pages 1486 1494. PMLR.

Farquhar, G., Rocktäschel, T., Igl, M., and Whiteson, S. (2017). Treeqn and atreec: Differentiable tree planning

for deep reinforcement learning. Co RR, abs/1710.11417.

Gelada, C. and Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrapping the covariate

shift. In AAAI.

Goyal, A., Brakel, P., Fedus, W., Singhal, S., Lillicrap, T. P., Levine, S., Larochelle, H., and Bengio, Y. (2019).

Recall traces: Backtracking models for efﬁcient reinforcement learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net.

Ha, D. and Schmidhuber, J. (2018). Recurrent world models facilitate policy evolution. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 2455 2467.

Hallak, A. and Mannor, S. (2017). Consistent on-line off-policy evaluation. Ar Xiv, abs/1702.07121.

Harutyunyan, A., Dabney, W., Borsa, D., Heess, N., Munos, R., and Precup, D. (2019a). The termination critic.

In Chaudhuri, K. and Sugiyama, M., editors, The 22nd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 2231 2240. PMLR.

Harutyunyan, A., Dabney, W., Mesnard, T., Azar, M. G., Piot, B., Heess, N., van Hasselt, H., Wayne, G., Singh,

S., Precup, D., and Munos, R. (2019b). Hindsight credit assignment. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E. B., and Garnett, R., editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 12467 12476.

Hester, T. and Stone, P. (2013). TEXPLORE: real-time sample-efﬁcient reinforcement learning for robots. Mach.

Learn., 90(3):385 429.

Hung, C.-C., Lillicrap, T. P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., and Wayne, G. (2019).

Optimizing agent behavior over long time scales by transporting value. Nature Communications, 10.

Jafferjee, T., Imani, E., Talvitie, E., White, M., and Bowling, M. (2020). Hallucinating value: A pitfall of

dyna-style planning with imperfect environment models. Co RR, abs/2006.04363.

Joseph, J. M., Geramifard, A., Roberts, J. W., How, J. P., and Roy, N. (2013). Reinforcement learning with

misspeciﬁed model classes. In 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, May 6-10, 2013, pages 939 946. IEEE.

Ke, N. R., Goyal, A., Bilaniuk, O., Binas, J., Mozer, M., Pal, C., and Bengio, Y. (2018). Sparse attentive

backtracking: Temporal creditassignment through reminding. In Neur IPS.

La Valle, S. M. (2006). Planning Algorithms. Cambridge University Press, Cambridge, U.K. Available at

http://planning.cs.uiuc.edu/.

Lengyel, M. and Dayan, P. (2007). Hippocampal contributions to control: The third way. In Proceedings of

the 20th International Conference on Neural Information Processing Systems, NIPS 07, page 889 896, Red Hook, NY, USA. Curran Associates Inc.

Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching.

Machine Learning, 8(3):293 321.

Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. (2019). Algorithmic framework for model-based

deep reinforcement learning with theoretical guarantees. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net.

Mac Kay, D. J. C. (2002). Information Theory, Inference & Learning Algorithms. Cambridge University Press,

Mahmood, A. R., Yu, H., White, M., and Sutton, R. S. (2015). Emphatic temporal-difference learning. Ar Xiv,

abs/1507.01569.

Mattar, M. G. and Daw, N. D. (2018). Prioritized memory access explains planning and hippocampal replay.

Nature Neuroscience, 21(11):1609 1617.

Mc Mahan, H. B. and Gordon, G. J. (2005). Fast exact planning in markov decision processes. In Biundo, S.,

Myers, K. L., and Rajan, K., editors, Proceedings of the Fifteenth International Conference on Automated Planning and Scheduling (ICAPS 2005), June 5-10 2005, Monterey, California, USA, pages 151 160. AAAI.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A.,

Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529 533.

Moore, A. W. and Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less

time. Mach. Learn., 13:103 130.

Morimura, T., Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010). Derivatives of logarithmic stationary

distributions for policy gradient reinforcement learning. Neural Comput., 22(2):342 376.

Oh, J., Hessel, M., Czarnecki, W., Xu, Z., Hasselt, H. V., Singh, S., and Silver, D. (2020). Discovering

reinforcement learning algorithms. Ar Xiv, abs/2007.08794.

Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Guyon, I., von Luxburg, U., Bengio, S.,

Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R., editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6118 6128.

Pan, Y., Zaheer, M., White, A., Patterson, A., and White, M. (2018). Organizing experience: A deeper look at

replay mechanisms for sample-based planning in continuous state domains. Co RR, abs/1806.04624.

Parr, R., Li, L., Taylor, G., Painter-Wakeﬁeld, C., and Littman, M. L. (2008). An analysis of linear models, linear

value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, pages 752 759, New York, NY, USA. ACM.

Peng, J. and Williams, R. J. (1993). Efﬁcient learning and planning within the dyna framework. Adaptive

Behavior, 1(4):437 454.

Pitis, S. (2019). Source traces for temporal difference learning. Co RR, abs/1902.02907.

Precup, D., Sutton, R. S., and Singh, S. P. (1998). Theoretical results on reinforcement learning with temporally

abstract options. In Machine Learning: ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, April 21-23, 1998, Proceedings, pages 382 393.

Puterman, M. L. (1994). Markov Decision Processes. Wiley.

Satija, H., Amortila, P., and Pineau, J. (2020). Constrained markov decision processes via backward value

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. In Bengio, Y. and

Le Cun, Y., editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E.,

Hassabis, D., Graepel, T., Lillicrap, T. P., and Silver, D. (2019). Mastering atari, go, chess and shogi by planning with a learned model. Co RR, abs/1911.08265.

Schroecker, Y., Vecerík, M., and Scholz, J. (2019). Generative predecessor models for sample-efﬁcient imitation

learning. Co RR, abs/1904.01139.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D. P.,

Rabinowitz, N. C., Barreto, A., and Degris, T. (2017). The predictron: End-to-end learning and planning. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 3191 3199. PMLR.

Sutton, R. and Barto, A. G. (1981). An adaptive network that constructs and uses an internal model of its world.

Sutton, R., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal

abstraction in reinforcement learning. Artif. Intell., 112:181 211.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44.

Sutton, R. S. (1990a). Integrated architectures for learning, planning, and reacting based on approximating

dynamic programming. In In Proceedings of the Seventh International Conference on Machine Learning, pages 216 224. Morgan Kaufmann.

Sutton, R. S. (1990b). Integrated modeling and control based on reinforcement learning. In Lippmann, R.,

Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, [NIPS Conference, Denver, Colorado, USA, November 26-29, 1990], pages 471 478. Morgan Kaufmann.

Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Machine Learning,

Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, July 9-12, 1995, pages 531 539.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy

temporal-difference learning. Ar Xiv, abs/1503.04269.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: a

scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011), Taipei, Taiwan, May 2-6, 2011, Volume 1-3, pages 761 768.

Sutton, R. S., Szepesvári, C., Geramifard, A., and Bowling, M. (2012). Dyna-style planning with linear function

approximation and prioritized sweeping. Co RR, abs/1206.3285.

van Hasselt, H., Hessel, M., and Aslanides, J. (2019). When to use parametric models in reinforcement learning?

Co RR, abs/1906.05243.

van Hasselt, H., Madjiheurem, S., Hessel, M., Silver, D., Barreto, A. N., and Borsa, D. (2020). Expected

eligibility traces. Ar Xiv, abs/2007.01839.

van Seijen, H. and Sutton, R. S. (2013). Planning by prioritized sweeping with small backups. In Proceedings of

the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 361 369. JMLR.org.

Van Seijen, H. and Sutton, R. S. (2015). A deeper look at planning as learning from replay. In Proceedings of

the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 2314 2322. JMLR.org.

Wan, Y., Zaheer, M., White, A., White, M., and Sutton, R. S. (2019). Planning with expectation models. Co RR,

abs/1904.01191.

Watkins, C. J. C. H. and Dayan, P. (1992). Technical note: q -learning. Mach. Learn., 8(3 4):279 292.

Xu, Z., Hasselt, H. V., Hessel, M., Oh, J., Singh, S., and Silver, D. (2020). Meta-gradient reinforcement learning

with an objective discovered online. Ar Xiv, abs/2007.08433.

Xu, Z., van Hasselt, H., and Silver, D. (2018). Meta-gradient reinforcement learning. Co RR, abs/1805.09801.

Zhang, S., Liu, B., Yao, H., and Whiteson, S. (2020a). Provably convergent two-timescale off-policy actor-critic

with function approximation. ar Xiv: Learning.

Zhang, S., Veeriah, V., and Whiteson, S. (2020b). Learning retrospective knowledge with reverse reinforcement