# woulda_coulda_shoulda_counterfactuallyguided_policy_search__6eebad01.pdf Published as a conference paper at ICLR 2019 WOULDA, COULDA, SHOULDA: COUNTERFACTUALLY-GUIDED POLICY SEARCH Lars Buesing, Th eophane Weber, Yori Zwols, S ebastien Racani ere, Arthur Guez, Jean-Baptiste Lespiau, Nicolas Heess Deep Mind lbuesing@google.com Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for modelbased policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, i.e. actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods. 1 INTRODUCTION Imagine that a month ago Alice had two job offers from companies a1 and a2. She decided to join a1 because of the larger salary, in spite of an awkward feeling during the job interview. Since then she learned a lot about a1 and recently received information about a2 from a friend, prodding her now to imagine what would have happened had she joined a2. Re-evaluating her decision in hindsight in this way, she concludes that she made a regrettable decision. She could and should have known that a2 was a better choice, had she only interpreted the cues during the interview correctly... This example tries to illustrate the everyday human capacity to reason about alternate, counterfactual outcomes of past experience with the goal of mining worlds that could have been (Pearl & Mackenzie, 2018). Social psychologists theorize that such cognitive processes are beneficial for improving future decision making (Roese, 1997). In this paper we aim to leverage possible advantages of counterfactual reasoning for learning decision making in the reinforcement learning (RL) framework. In spite of recent success, learning policies with standard, model-free RL algorithms can be notoriously data inefficient. This issue can in principle be addressed by learning policies on data synthesized from a model. However, a mismatch between the model and the true environment, often unavoidable in practice, can cause this approach to fail (Talvitie, 2014), resulting in policies that do not generalize to the real environment (Jiang et al., 2015). Motivated by the introductory example, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm: Instead of relying on data synthesized from scratch by a model, policies are trained on model predictions of alternate outcomes of past experience from the true environment under counterfactual actions, i.e. actions that had not actually been taken, while everything else remaining the same (Pearl, 2009). At the heart Published as a conference paper at ICLR 2019 of CF-GPS are structural causal models (SCMs) which model the environment with two ingredients (Wright, 1920): 1) Independent random variables, called scenarios here, summarize all aspects of the environment that cannot be influenced by the agent, e.g. the properties of the companies in Alice s job search example. 2) Deterministic transition functions (also called causal mechanisms) take these scenarios, together with the agent s actions, as input and produce the predicted outcome. The central idea of CF-GPS is that, instead of running an agent on scenarios sampled de novo from a model, scenarios are inferred in hindsight from given off-policy data, and then evaluate and improve the agent on these specific scenarios using given or learned causal mechanisms (Balke & Pearl, 1994). CF-GPS can be regarded as a meta-algorithm that extends a given model-base RL algorithm by grounding or anchoring model-based predictions in inferred scenarios. As a result, this approach explicitly allows to trade-off historical data for model bias. We show empirically in a conceptually simple setting, where unknown initial states are inferred in hindsight and re-used to evalute to counterfactual actions, that this can mitigate model mismatch. CF-GPS differs substantially from standard off-policy RL algorithms based on Importance Sampling (IS), where historical data is re-weighted with respect to the importance weights to evaluate or learn new policies (Precup, 2000). In contrast, CF-GPS explicitly reasons counterfactually about given off-policy data. Our main contributions are: 1. We formulate model-based RL in POMDPs in terms of structural causal models, thereby connecting concepts from reinforcement learning and causal inference. 2. We provide the first results, to the best of our knowledge, showing that counterfactual reasoning in structural causal models on off-policy data can facilitate solving non-trivial RL tasks. 3. We show that two previously proposed classes of RL algorithms, namely Guided Policy Search (Levine & Koltun, 2013) and Stochastic Value Gradient methods (Heess et al., 2015), can be interpreted as counterfactual methods, opening up possible generalizations. The paper is structured as follows. We first give a self-contained, high-level recapitulation of structural causal models and counterfactual inference, as these are less widely known in the RL and generative model communities. In particular we show how to model POMDPs with SCMs. Based on this exposition, we first consider the task of policy evaluation and discuss how we can leverage counterfactual inference in SCMs to improve over standard model-based methods. We then generalize this approach to the policy search setting resulting in the CF-GPS algorithm. We close by highlighting connections to previously proposed algorithms and by discussing assumptions and limitations of the proposed method. 2 PRELIMINARIES We denote random variables (RVs) with capital letters, e.g. X, and particular values with lower caps, e.g. x. For a distribution P over a vector-valued random variable X, we denote the marginal over Y X by PY (and density p Y ); however we often omit the subscript if it is clear from the context, e.g. as in Y P. We assume the episodic, Partially Observable Markov Decision Process (POMDP) setting with states St, actions At and observations Ot, for t = 1, . . . , T. For ease of notation, we assume that Ot includes the reward Rt. The undiscounted return is denoted by G = PT t=1 Rt. We consider stochastic policies π(at|ht) over actions conditioned on observation histories Ht = (O1, A1, . . . , At 1, Ot). We denote the resulting distribution over trajectories T = (S1, O1, A1, . . . , AT 1, ST , OT ) induced by running π in the environment with T Pπ and the corresponding density by pπ(τ). 2.1 STRUCTURAL CAUSAL MODELS Definition 1 (Structural causal model). A structural causal model (SCM) M over X = (X1, . . . , XN) is given by a DAG G over nodes X, independent noise RVs U = (U1, . . . , UN) with distributions PUi and functions f1, . . . , f N such that Xi = fi(pai, Ui), where pai X are the parents of Xi in G. An SCM entails a distribution P with density p over (X, U). Published as a conference paper at ICLR 2019 Figure 1: Structural causal models (SCMs) model environments using random variables U (circles, scenarios ), that summarize immutable aspects, some of which are observed (grey), some not (white). These are fed into deterministic functions fi (black squares) that approximate causal mechanisms. Left: SCM for a contextual bandit with context Uc, action A, feedback O and scenario Uo. Right: SCM for a POMDP, with initial state Us1 = S1, states St and histories Ht. The mechanism that generates the actions At is the policy π. We also refer to U as scenarios and to fi as causal mechanisms. We give a (broad) definition of an intervention in an SCM. This also includes what is known as stochastic interventions or mechanism changes (Korb et al., 2004) which generalize atomic interventions (Pearl, 2009). Definition 2 (Intervention in SCM). An intervention I in an SCM M consists of replacing some of the original fi(pai, Ui) with other functions f I i (pa I i , Ui) where pa I i are the parents in a new DAG GI. We denote the resulting SCM with Mdo(I) with distribution P do(I) and density pdo(I). SCM representation of POMDPs We can represent any given POMDP (under a policy π) by an SCM M over trajectories T in the following way. We express all conditional distributions, e.g. the transition kernel PSt+1|St,At, as deterministic functions with independent noise variables U, such as St+1 = fst(St, At, Ust). This is always possible using auto-regressive uniformization, see Lemma 2 in the appendix. The DAG G of the resulting SCM is shown in fig. 1. This procedure is closely related to the reparameterization trick for models with location-scale distributions (Kingma & Welling, 2013; Rezende et al., 2014). We denote the distribution over T entailed by the SCM with P π and its density by pπ to highlight the role of π; note the difference to the true environment distribution Pπ with density pπ. Running a different policy µ instead of π in the environment can be expressed as an intervention I(π µ) consisting of replacing At = fπ(Ht, Uat) by At = fµ(Ht, Uat). We denote the resulting model distribution over trajectories with P do(I(π µ)) = P µ (analogously pµ). Intuition Here, we illustrate the main advantage of SCMs using the example of Alice s job choice from the introduction. We model it as contextual bandit with feedback shown in fig. 1. Alice has some initial knowledge given by the context Uc that is available to her before taking action A of joining company A = a1 or A = a2. We model Alice s decision as A = fπ(Uc, Ua), where Ua captures potential indeterminacy in Alice s decision making. The outcome O = fo(A, Uc, Uo) also depends on the scenario Uo, capturing all relevant, unobserved and highly complex properties of the two companies such as working conditions etc. Given this model, we can reason about alternate outcomes fo(a1, uc, uo) and fo(a2, uc, uo) for same the scenario uo. This is not possible if we only model the outcome on the level of the conditional distribution PO|A,Uc. 2.2 COUNTERFACTUAL INFERENCE IN SCMS For an SCM over X, we define a counterfactual query as a triple (ˆxo, I, Xq) of observations ˆxo of some variables Xo X, an intervention I and query variables Xq X. The semantics of the query are that, having observed ˆxo, we want to infer what Xq would have been had we done intervention I, while keeping everything else the same . Counterfactual inference (CFI) in SCMs answers the query in the following way (Balke & Pearl, 1994): 1. Infer the unobserved noise source U conditioned on the observations ˆxo, i.e. compute p(U|ˆxo) and replace the prior p(U) with p(U|ˆxo). Denote the resulting SCM by Mˆxo. Published as a conference paper at ICLR 2019 2. Perform intervention I on Mˆxo. This yields Mdo(I) ˆxo , which entails the counterfactual distribution pdo(I)|ˆxo(x). Return the marginal pdo(I)|ˆxo(xq). Note that our definition explicitly allows for partial observations Xo X in accordance with Pearl (2009). A sampled-based version, denoted as CFI, is presented in Algorithm 1. An interesting property of the counterfactual distribution pdo(I)|ˆxo is that marginalizing it over observations ˆxo yields an unbiased estimator of the density of Xq under intervention I. Lemma 1 (CFI for simulation). Let observations ˆxo p come from a SCM M with density p. Then the counterfactual density pdo(I)|ˆxo is an unbiased estimator of pdo(I), i.e. Eˆxo p[pdo(I)|ˆxo(x)] = pdo(I)(x) The proof is straightforward and outlined in the Appendix A. This lemma and the marginal independence of the Ui leads to the following corollary; the proof is given in the appendix. Corollary 1 (Mixed counterfactual and prior simulation from an SCM). Assume we have observations ˆxo p. We can simulate from M, under any intervention I, i.e. obtain unbiased samples from Mdo(I), by first sampling values u CF for an arbitrary subset UCF U from the posterior p(u CF|ˆxo) and the remaining UPrior := U\UCF from the prior p(u Prior), and then computing X with noise u = u CF u Prior. The corollary essentially states that we can sample from the model MI, by sampling some of the Ui from the prior, and inferring the rest from data ˆxo (as long as the latter was also sampled from M). We will make use of this later for running a POMDP model on scenarios Ust inferred from data while randomizing the action noise Uat. We note that the noise variables UCF from the posterior PUCF|ˆxo are not independent anymore. Nevertheless, SCMs with non-independent noise distributions arising from counterfactual inference, denoted here by Mˆxo, are commonly considered in the literature (Peters et al., 2017). Intuition Returning to Alice s job example from the introduction, we give some intuition for counterfactual inference in SCMs. Given the concrete outcome ˆo, under observed context ˆuc and having joined company ˆa = a1, Alice can try to infer the underlying scenario uo p(uo|a1, ˆuc, ˆo) that she experiences; this includes factors such as work conditions etc. She can then reason counterfactually about the outcome had she joined the other company, which is given by fo(a2, ˆuc, uo). This can in principle enable her to make better decisions in the future in similar scenarios by changing her policy fπ(A, Uc, Ua) such that the action with the preferred outcome becomes more likely under ˆuc, uo. In particular she can do so without having to use her (likely imperfect) prior model over possible companies p(Uo). She can use the counterfactual predictions discussed above instead to learn from her experience. We use this insight for counterfactual policy evaluation and search below. 3 OFF-POLICY EVALUATION: MODEL-FREE, MODEL-BASED AND COUNTERFACTUAL To explain how counterfactual reasoning in SCMs can be used for policy search, we first consider the simpler problem of policy evaluation (PE) on off-policy data. The goal of off-policy PE is to determine the value of a policy π, i.e. its expected return Epπ[G], without running the policy itself. We assume that we have data D = {ˆhi T }i=1,...,N consisting of logged episodes ˆhi T = (ˆoi 1, ˆai 1, . . . ˆai T 1, ˆoi T ) from running a behavior policy µ. A standard, model-free approach to PE is to use Importance sampling (IS): We can estimate the policy s value as P i wi ˆGi, where ˆGi is the empirical return of ˆhi T and wi pπ(ˆhi T ) pµ(ˆhi T ) are importance weights. However, if the trajectory densities pπ and pµ are very different, then this estimator has large variance. In the extreme case, IS can be useless if the support of pµ does not contain that of pπ, irrespective of how much data from pµ is available. If we have access to a model M, then we can evaluate the policy on synthetic data, i.e. we can estimate Epπ[G]. This is called model-based policy evaluation (MB-PE). However, any bias in M propagates from pπ to the estimate Epπ[G]. In the following, we assume that M is a SCM and we Published as a conference paper at ICLR 2019 Algorithm 1 Counterfactual policy evaluation and search // Counterfactual inference (CFI) 1: procedure CFI(data ˆxo, SCM M, intervention I, query Xq) 2: ˆu p(u|ˆxo) Sample noise variables from posterior 3: p(u) δ(u ˆu) Replace noise distribution in p with ˆu 4: fi f I i Perform intervention I 5: return xq pdo(I)(xq|ˆu) Simulate from the resulting model MI ˆxo 6: end procedure // Counterfactual Policy Evaluation (CF-PE) 7: procedure CF-PE(SCM M, policy π, replay buffer D, number of samples N) 8: for i {1, . . . N} do 9: ˆhi T D Sample from the replay buffer 10: gi = CFI(ˆhi T , M, I(µ π), G) Counterfactual evaluation of return 11: end for 12: return 1 N PN i=1 gi 13: end procedure // Counterfactually-Guided Policy Search (CF-GPS) 14: procedure CF-GPS(SCM M, initial policy π0, number of trajectory samples N) 15: for k = 1, . . . do 16: if sometimes then 17: µ πk Update behavior policy 18: end if 19: for i = 1, . . . , N do 20: ˆhi T pµ Get off-policy data from the true environment 21: τ i = CFI(ˆhi T , M, I(µ πλ), T ) Counterfactual rollouts under planner 22: end for 23: πk policy improvement on trajectories τ i=1,...,N using eqn. 1 24: end for 25: end procedure show that we can use counterfactual reasoning for off-policy evaluation (CF-PE). As the main result for this section, we argue that we expect CF-PE to be less biased than MB-PE, and we illustrate this point with experiments. 3.1 COUNTERFACTUAL OFF-POLICY EVALUATION Naive MB-PE with a SCM M simply consist of sampling the scenarios U PU from the prior, and then simulating a trajectory τ from the functions fi and computing its return. However, given data D from pµ, our discussion of counterfactual inference in SCMs suggests the following alternative strategy: Assuming no model mismatch, i.e. pµ = pµ, we can regard the task of off-policy evaluation of π as a counterfactual query with data ˆhi T , intervention I(µ π) and query variable G. In other words, instead of sampling from the prior as in MB-PE, we are free to the scenarios from the posterior ui pµ( |ˆhi T ). The algorithm is given in Algorithm 1. Lemma 1 guarantees that this results in an unbiased estimate: Corollary 2 (CF-PE is unbiased). Assuming no model mismatch, CF-PE is unbiased. Furthermore, Corollary 1 allows us to also sample some of the noise variables from the prior instead of the posterior, we can e.g. randomize the counterfactual actions by re-sampling the action noise Ua. Motivation When should one prefer CF-PE over the more straightforward MB-PE? Assuming a perfect model, Corollary 2 states that both yield the same answer in expectation for perfect models. For imperfect models however, these algorithms can differ substantially. MB-PE relies on purely synthetic data, sampled from the noise distribution p(U). In practice, this is usually approximated by a parametric density model, which can lead to under-fitting in case of complex distributions. This is a well-known effect in generative models with latent variables: In spite of recent research progress, e.g. models of natural images are still unable to accurately model the variability of the true data Published as a conference paper at ICLR 2019 (Gregor et al., 2016). In contrast, CF-PE samples from the posterior N 1 PN i=1 pµ(U|ˆhi T ), which has access to strictly more information than the prior p(U) by taking into account additional data ˆhi T . This semi-nonparametric distribution can help to de-bias the model by effectively winnowing out parts of the domain of U which do not correspond to any real data. We substantiate this intuition with experiments below; a concrete illustration for the difference between the prior and posterior / counterfactual distribution is given in fig. 4 in the appendix and discussed in appendix D. Therefore, we conclude that we expect CF-PE to outperform MB-PE, if the transition and reward kernels fst are accurate models of the environment dynamics, but if the marginal distribution over the noise sources PU is difficult to model. 3.2 EXPERIMENTS Environment As an example, we use a partially-observed variant of the SOKOBAN environment, which we call PO-SOKOBAN. The original SOKOBAN puzzle environment was described in detail by Racani ere et al. (2017); we give a brief summary here. The agent is situated in a 10 10 grid world and its five actions are to move to one of four adjacent tiles and a NOOP. In our variant, the goal is to push all three boxes onto the three targets. As boxes cannot be pulled, many actions result irreversibly in unsolvable states. Episodes are of length T = 50, and pushing a box onto a target yields a reward of 1, removing a box from a target yields 1, and solving a level results in an additional reward of 10. The state of the environment consists in a 10 10 matrix of categorical variables taking values in {0, . . . , 6} indicating if the corresponding tile is empty, a wall, box, target, agent, or a valid combinations thereof (box+target and agent+target). In order to introduce partial observability, we define the observations as the state corrupted by i.i.d. (for each tile and time step) flipping each categorical variable to the empty state with probability 0.9. Therefore, the state of the game is largely unobserved at any given time, and a successful agent has to integrate observations over tens of time steps. Initial states Us1, also called levels, which are the scenarios in this environment, are generated randomly by a generator algorithm which guarantees solvability (i.e. all boxes can be pushed onto targets). The environment is visualized in fig. 3 in the appendix. Given the full state of PO-SOKOBAN, the transition kernel is deterministic and quite simple as only the agent and potentially an adjacent box moves. Inferring the belief state, i.e. the distribution over states given the history of observations and actions, can however range from trivial to very challenging, depending on the amount of available history. In the limit of a long observed history, every tile is eventually observed and the belief state concentrates on a single state (the true state) that can be easily inferred. With limited observed history however, inferring the posterior distribution over states (belief state) is very complex. Consider e.g. the situation in the beginning of an episode (before pushing the first box). Only the first observation is available, however we know that all POSOKOBAN levels are initially guaranteed to be solvable and therefore satisfy many combinatorial constraints reflecting that the agent is still able to push all boxes onto targets. Learning a compact parametric model of the initial state distribution from empirical data is therefore difficult and likely results in large mismatch between the learned model and the true environment. Results To illustrate the potential advantages of CF-PE over MB-PE we perform policy evaluation in the PO-SOKOBAN environment. We first generate a policy π that we wish to evaluate, by training it using a previously-proposed distributed RL algorithm (Espeholt et al., 2018). The policy is parameterized as a deep, recurrent neural network consisting of a 3-layer deep convolutional LSTM (Xingjian et al., 2015) with 32 channels per layer and kernel size of 3. To further increase computational power, the LSTM ticks twice for each environment step. The output of the agent is a value function and a softmax yielding the probabilities of taking the 5 actions. In order to obtain an SCM of the environment, for the sake of simplicity, we assume that the ground-truth transition, observation and reward kernels are given. Therefore the only part of the model that we need to learn is the distribution p(Us1) of initial states S1 = Us1 (for regular MB-PE), and the density p(Us1|ˆhi t) for inferring levels in hindsight for CF-PE. We vary the amount of true data t that we condition this inference on, ranging from t = 0 (no real data, equivalent to MB-PE) to t = T = 50 (a full episode of real data is used to infer the initial state Us1). We train a separate model for each t {0, 5, 10, 20, 30, 40, 50}. To simplify model learning, both models were given access to the unobserved state during training, but not at test time. The models are chosen to be powerful, multilayer, generative DRAW models (Gregor et al., 2015) trained by approximate maximum likelihood learning (Kingma & Welling, 2013; Rezende et al., 2014). The models take as input the (potentially Published as a conference paper at ICLR 2019 Figure 2: Experimental results on PO-SOKOBAN environment. Left: Policy evaluation. Policy evaluation error decreases with amount of off-policy data available (in #transitions per episode) for inferring scenarios (levels) Us1 that are used for counterfactual evaluation. No data (data points on the very left) corresponds to standard model-based policy evaluation (MB-PE), yielding large errors, whereas Counterfactual policy evaluation yields more accurate results. This holds for all three policies with different true performances. Right: Policy search. Counterfactually-Guided Policy Search (CF-GPS) outperforms a naive model-based RL (MB-PS) algorithm as well as a version of standard Guided Policy Search ( GPS-like ) on PO-SOKOBAN. empty) data ˆhi t summarized by a backward RNN (a standard convolutional LSTM model with 32 units). The model is shown in fig. 3 in the appendix and additional details are given in appendix C. The data ˆhi T was collected under a uniform random policy µ. For all policy evaluations, we use > 105 levels ui from the inferred model. In order to evaluate policies of different proficiency, we derive from the original (trained) π three policies π0, π1, π2 ranging from almost perfect to almost random performance by introducing additional stochasticity during action selection. The policy evaluation results are shown in fig. 2. We found that for t = 0, in spite of extensive hyper-parameter search, the model p(Us1) was unable to accurately capture the marginal distribution of initial levels in PO-SOKOBAN. As argued above, a solvable level satisfies a large number of complex constraints that span the entire grid world, which are hard for a parametric model to capture. Empirically, we found that the model mismatch manifested itself in samples from p(Us1) not being well-formed, e.g. not solvable, and hence the performance of the policies πi are very different on these synthetic levels compared to levels sampled form p. However, inferring levels from full observed episodes i.e. p(Us1|ˆhi 50) was reliable, and running π on these resulted in accurate policy evaluation. The figure also shows the trade-off between policy evaluation accuracy and the amount of off-policy data for intermediate amounts of the data ˆhi t. We also want to emphasize that in this setting, model-free policy evaluation by IS fails. The uniform behavior policy µ was too different from πi, resulting in a relative error > 0.8 for all i = 1, 2, 3. 4 OFF-POLICY IMPROVEMENT: COUNTERFACTUALLY-GUIDED POLICY SEARCH In the following we show how we can leverage the insights from counterfactual policy evaluation for policy search. We commence by considering a model-based RL algorithm and discuss how we can generalize it into a counterfactual algorithm to increase its robustness to model mismatch. We chose a particular algorithm to start from to make a connection to the previously proposed Guided Policy Search algorithm (Levine & Koltun, 2013; Levine & Abbeel, 2014), but we think a larger class of MBRL algorithms can be generalized in an analogous manner. 4.1 STARTING POINT: VANILLA MODEL-BASED RL WITH RETURN WEIGHTED REGRESSION We start from the following algorithm. We assume we have a model M of the environment with trajectory distribution pπ. Our current policy estimate πk is improved at iteration k using returnweighted regression: πk+1 = arg max π Z exp(G(τ))pπk(τ) log pπ(τ) dτ, Published as a conference paper at ICLR 2019 where G(τ) is the return of trajectory τ. This policy improvement step can be motivated by the framework of RL as variational inference (Toussaint, 2009) and is equivalent to minimizing the KL divergence to a trajectory distribution exp(G)pπk which puts additional mass on high-return trajectories. Although not strictly necessary for our exposition, we also allow for a dedicated proposal distribution over trajectories pλ(τ), under a policy λ. We refer to λ as a planner to highlight that it could consist of a procedure that solves episodes starting from arbitrary, full states s1 sampled form the model, by repeatedly calling the model transition kernel, e.g. a search procedure such as MCTS (Browne et al., 2012) or an expert policy. Concretely, we optimize the following finite sample objective: πk+1 = arg max π i=1 exp(Gi(τ i))pπk(τ i) pλ(τ i) log pπ(τ i), τ i pλ. (1) We refer to this algorithm as model-based policy search (MB-PS). It is based on model rollouts τ i spanning entire episodes. An alternative would be to consider model rollouts starting from states visited in the real environment (if available). Both versions can be augmented by counterfactual methods, but for the sake of simplicity we focus on the simpler MB-PS version detailed above (also we did not find significant performance differences experimentally between both versions). 4.2 INCORPORATING OFF-POLICY DATA: COUNTERFACTUALLY-GUIDED POLICY SEARCH Now, we assume that the model M is an SCM. Based on our discussion of counterfactual policy evaluation, it is straightforward to generalize the MB-PS described above by anchoring the rollouts τ i under the model pλ in off-policy data D: Instead of sampling τ i directly from the prior pλ, we draw them from counterfactual distribution pλ|ˆhi T with data ˆhi T D from the replay buffer, i.e. instead of sampling the scenarios U from the prior we infer them from the given data. Again invoking Lemma 1, this procedure is unbiased under no model mismatch. We term the resulting algorithm Counterfactually-Guided Policy Search (CF-GPS), and it is summarized in Algorithm 1. The motivation for using CF-GPS over MB-PS is analogous to the advantage of CF-PE over MB-PE discussed in sec. 3.1. The policy π in CF-GPS is optimized on rollouts τ i that are grounded in data ˆhi T by sampling them from the counterfactual distribution pλ|ˆhi T instead of the prior pλ. If this prior is difficult to model, we expect the counterfactual distribution to be more concentrated in regions where there is actual mass under the true environment pλ. 4.3 EXPERIMENTS We evaluate CF-GPS on the PO-SOKOBAN environment, using a modified distributed actor-learner architecture based on Espeholt et al. (2018): Multiple actors (here 64) collect real data ˆh T by running the behavior policy µ in the true environment p. As in many distributed RL settings, µ is chosen to be a copy of the policy π, often slightly outdated, so the data must be considered to be offpolicy. The distribution p(Us1|ˆh T ) over levels Us1 is inferred from the data ˆh T using from the model M. We sample a scenario Us1 for each logged episode, and simulate 10 counterfactual trajectories τ 1,...,10 under the planner λ for each such scenario. Here, for the sake of simplicity, instead of using search, the planner was assumed to be a mixture between π and a pre-trained expert policy λe, i.e. λ = βλe + (1 β)π. The schedule β was set to an exponentially decaying parameter with time constant 105 episodes. The learner performs policy improvement on π using τ 1,...,10 according to eqn. 1. M was trained online, in the same way as in sec. 3.2. λ and π were parameterized by deep, recurrent neural networks with the same architecture described in sec. 3.2. We compare CF-GPS with the vanilla MB-PS baseline described in sec. 4.1 (based on the same number of policy updates). MB-PS differs from CF-GPS by just having access to an unconditional model p(Us1| ) over initial states. We also consider a method which conditions the scenario model p(Us1|o1) on the very first observation o1, which is available when taking the first action and therefore does not involve hindsight reasoning. This is more informed compared to MB-PS; however due to the noise on the observations, the state is still mostly unobserved rendering it very challenging to learn a good parametric model of the belief state p(Us1|o1). We refer to this algorithm as Guided Policy Search-like (GPS-like), as it roughly corresponds to the algorithm presented by Levine & Abbeel (2014), as discussed in greater detail in sec. 5. Fig. 2 shows that CF-GPS outperforms these Published as a conference paper at ICLR 2019 two baselines. As expected from the policy evaluation experiments, initial states sampled from the models for GPS and MB-PS are often not solvable, yielding inferior training data for the policy π. In CF-GPS, the levels are inferred from hindsight inference p(U1|ˆh T ), yielding high quality training data. For reference, we also show a policy trained by the model-free method of Espeholt et al. (2018) using the same amount of environment data. Not surprisingly, CF-GPS is able to make better use of the data compared to the model-free baseline as it has access to the true transition and reward kernels (which were not given to the model-free method). 5 RELATED WORK Bottou et al. (2013) provide an in-depth discussion of applying models to off-policy evaluation. However, their and related approaches (Li et al., 2015; Jiang & Li, 2015; Swaminathan & Joachims, 2015; Nedelec et al., 2017; Atan et al., 2016; Thomas & Brunskill, 2016) such as doubly-robust estimators, rely on Importance Sampling (IS), also called Propensity Score method. Although some of these algorithms are also termed counterfactual policy evaluation, they are not counterfactual in the sense used in this paper, where noise variables are inferred from logged data and reused to evaluate counterfactual actions. Hence, they are dogged by high variance in the estimators common to IS, although recent work aims to address this (Munos et al., 2016; Guo et al., 2017). Model-based methods for off-policy evaluation have recently been improved to account for the distribution shift between the data-collecting policy and the policy to be evaluated (Johansson et al., 2016; Liu et al., 2018). Recently (Andrychowicz et al., 2017) proposed the Hindsight Experience Replay (HER) algorithm for learning a family of goal directed policies. In HER one observes an outcome in the true environment, which is kept fixed, and searches for the goal-directed policy that should have achieved this goal in order to positively reinforce it. Therefore, this algorithm is complementary to CF-GPS where we search over alternative outcomes for a given policy. Our CF-GPS algorithm is inspired by and extends work presented by Abbeel et al. (2006) on a method for de-biasing weak models by estimating additive terms in the transition kernel to better match individual, real trajectories. The resulting model, which is a counterfactual distribution in the terminology used in our paper, is then used for model-based policy improvement. Our work generalizes this approach and highlights conceptual connections to causal reasoning. Furthermore, we discuss the connection of CF-GPS to two classes of RL algorithms in greater detail below. Guided Policy Search (GPS) CF-GPS is closely related to GPS, in particular we focus on GPS as presented by Levine & Abbeel (2014). Consider CF-GPS in the fully-observed MDP setting where Ot = St. Furthermore, assume that the SCM M is structured as follows: Let St+1 = fs(St, At, Ust) be a linear function in (St, At) with coefficients given by Ust. Further, assume an i.i.d. Gaussian mixture model on Ust for all t. As the states are fully observed, the inference step in the CFI procedure simplifies: we can infer the noise sources ˆust (samples or MAP estimates), i.e. the unknown linear dynamics, from pairs of observed, true states ˆst, ˆst+1. Furthermore assume that the reward is a quadratic function of the state. Then, the counterfactual distribution pλ(τ|ˆu) is a linear quadratic regulator (LQR) with time-varying coefficients ˆu. An appropriate choice for the planner λ is the optimal linear feedback policy for the given LQR, which can be computed exactly by dynamic programming. Observation 1. In the MDP setting, CF-GPS with a linear SCM and a dynamic programming planner for LQRs λ is equivalent to GPS. Another perspective is that GPS is the counterfactual version of the MB-PS procedure from sec. 4.1: Observation 2. In the MDP setting with a linear SCM and a dynamic programming planner for LQRs λ, GPS is the counterfactual variant of the MB-PS procedure outlined above. The fact that GPS is a successful algorithm in practice shows that the grounding of model-based search / rollouts in real, off-policy data afforded by counterfactual reasoning massively improves the naive, prior sample -based MB-PS algorithm. These considerations also suggest when we expect CF-GPS to be superior compared to regular GPS: If the uncertainty in the environment transition Ust cannot be reliably identified from subsequent pairs of observations ˆot, ˆot+1 alone, we expect benefits of inferring Ust from a larger context of observations, in the extreme case from the entire history ˆh T as described above. Published as a conference paper at ICLR 2019 Stochastic Value Gradient methods There are multiple interesting connections of CF-GPS to Stochastic Value Gradient (SVG) methods (Heess et al., 2015). In SVG, a policy π for a MDP is learned by gradient ascent on the expected return under a model p. Instead of using the scorefunction estimator, SVG relies on a reparameterization of the stochastic model and policy (Kingma & Welling, 2013; Rezende et al., 2014). We note that this reparameterization casts p into an SCM. As in GPS, the noise sources Ust are inferred from two subsequent observed states ˆst, ˆst+1 from the true environment, and the action noise Uat is kept frozen. As pointed out in the GPS discussion, this procedure corresponds to the inference step in a counterfactual query. Given inferred values u for U, gradients θG of the return under the model are taken with respect to the policy parameters θ. We can loosely interpret these gradients as 2 dim(θ) counterfactual policy evaluations of policies π(θ θi) where a single dimension i of the parameter vector θ is perturbed. 6 DISCUSSION Simulating plausible synthetic experience de novo is a hard problem for many environments, often resulting in biases for model-based RL algorithms. The main takeaway from this work is that we can improve policy learning by evaluating counterfactual actions in concrete, past scenarios. Compared to only considering synthetic scenarios, this procedure mitigates model bias. However, it relies on some crucial assumptions that we want to briefly discuss here. The first assumption is that off-policy experience is available at all. In cases where this is e.g. too costly to acquire, we cannot use any of the proposed methods and have to exclusively rely on the simulator / model. We also assumed that there are no additional hidden confounders in the environment and that the main challenge in modelling the environment is capturing the distribution of the noise sources p(U), whereas we assumed that the transition and reward kernels given the noise is easy to model. This seems a reasonable assumption in some environments, such as the partially observed grid-world considered here, but not all. Probably the most restrictive assumption is that we require the inference over the noise U given data ˆh T to be sufficiently accurate. We showed in our example, that we could learn a parametric model of this distribution from privileged information, i.e. from joint samples u, h T from the true environment. However, imperfect inference over the scenario U could result e.g. in wrongly attributing a negative outcome to the agent s actions, instead environment factors. This could in turn result in too optimistic predictions for counterfactual actions. Future research is needed to investigate if learning a sufficiently strong SCM is possible without privileged information for interesting RL domains. If, however, we can trust the transition and reward kernels of the model, we can substantially improve model-based RL methods by counterfactual reasoning on off-policy data, as demonstrated in our experiments and by the success of Guided Policy Search and Stochastic Value Gradient methods. Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 1 8. ACM, 2006. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048 5058, 2017. Onur Atan, William R Zame, Qiaojun Feng, and Mihaela van der Schaar. Constructing Effective Personalized Policies Using Counterfactual Inference from Biased Data Sets with Many Features. ar Xiv preprint ar Xiv:1612.08082, 2016. Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds and applications. In Proceedings of the Tenth international conference on Uncertainty in artificial intelligence, pp. 46 54. Morgan Kaufmann Publishers Inc., 1994. L eon Bottou, Jonas Peters, Joaquin Qui nonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013. Published as a conference paper at ICLR 2019 Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012. Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. IMPALA: Scalable distributed Deep-RL with importance weighted actor-learner architectures. ar Xiv preprint ar Xiv:1802.01561, 2018. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. ar Xiv preprint ar Xiv:1502.04623, 2015. Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549 3557, 2016. Zhaohan Guo, Philip S Thomas, and Emma Brunskill. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2492 2501, 2017. Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944 2952, 2015. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1511.03722, 2015. Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181 1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015. Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In International Conference on Machine Learning, pp. 3020 3029, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kevin B Korb, Lucas R Hope, Ann E Nicholson, and Karl Axnick. Varieties of causal intervention. In Pacific Rim International Conference on Artificial Intelligence, pp. 322 331. Springer, 2004. Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071 1079, 2014. Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pp. 1 9, 2013. Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pp. 929 934. ACM, 2015. Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo Faisal, Finale Doshi Velez, and Emma Brunskill. Representation balancing mdps for off-policy policy evaluation. ar Xiv preprint ar Xiv:1805.09044, 2018. R emi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054 1062, 2016. Published as a conference paper at ICLR 2019 Thomas Nedelec, Nicolas Le Roux, and Vianney Perchet. A comparative study of counterfactual estimators. ar Xiv preprint ar Xiv:1704.00773, 2017. Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2009. ISBN 052189560X, 9780521895606. Judea Pearl and Dana Mackenzie. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018. J. Peters, D. Janzing, and B. Sch olkopf. Elements of Causal Inference - Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017. Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000. S ebastien Racani ere, Th eophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adri a Puigdom enech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5690 5701, 2017. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. Neal J Roese. Counterfactual thinking. Psychological bulletin, 121(1):133, 1997. Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814 823, 2015. Erik Talvitie. Model Regularization for Stable Sample Rollouts. In UAI, pp. 780 789, 2014. Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016. Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049 1056. ACM, 2009. Sewall Wright. The relative importance of heredity and environment in determining the piebald pattern of guinea-pigs. Proceedings of the National Academy of Sciences, 6(6):320 332, 1920. Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802 810, 2015. Published as a conference paper at ICLR 2019 A.1 PROOF OF LEMMA 1 Proof. We start from the fact that the density over noise sources U remains the same for every intervention I as U are root nodes in G: pdo(I)(u) = p(u). This leads to: pdo(I)(x) = Z pdo(I)(x|u)pdo(I)(u) du = Z pdo(I)(x|u)p(u) du = Z pdo(I)(x|u) Z p(ˆxo, u)dˆxo = Z Z pdo(I)(x|u) p(u|ˆxo)p(ˆxo) du dˆxo = Eˆxo p[ Z pdo(I)(x|u)p(u|ˆxo) du] = Eˆxo p[pdo(I)|ˆx0(x)]. A.2 PROOF OF COROLLARY 1 Proof. Given two sets CF {1, . . . , N} and Prior {1, . . . , N} with CF Prior = and CF Prior = {1, . . . , N}, we define u CF = (un)n CF and u Prior = (un)n Prior. By construction, the scenarios U are independent under the prior, i.e. p(u) = QN n=1 p(un). Therefore u CF and u Prior are independent. We can write: n Prior p(un) = p(u CF)p(u Prior). Following the arguments from Lemma 1, the averaged inference distribution is equal to the prior p(u) = Eˆxo p[p(u|ˆxo)]. This also holds for any subset of the variables u, in particular for for p(u CF). Hence: p(u) = p(u CF)p(u Prior) = Eˆxo p[p(u CF|ˆxo)]p(u Prior). B DETAILS ON CASTING A POMDP INTO SCM FORM Lemma 2 (Auto-regressive uniformization aka Reparametrization). Consider random variables X1, . . . , XN with joint distribution P. There exist functions fn for n {1, . . . , N} such that with independent random variables Un Uniform[0, 1] the random variables X equal X in dis- tribution, i.e. X d= X, where X = (X 1, . . . , X n) are defined as: X n := fn(Un, X