# bayesadaptive_simulationbased_search_with_value_function_approximation__6186adce.pdf Bayes-Adaptive Simulation-based Search with Value Function Approximation Arthur Guez ,1,2 Nicolas Heess2 David Silver2 Peter Dayan1 aguez@google.com 1Gatsby Unit, UCL 2Google Deep Mind Bayes-adaptive planning offers a principled solution to the explorationexploitation trade-off under model uncertainty. It finds the optimal policy in belief space, which explicitly accounts for the expected effect on future rewards of reductions in uncertainty. However, the Bayes-adaptive solution is typically intractable in domains with large or continuous state spaces. We present a tractable method for approximating the Bayes-adaptive solution by combining simulationbased search with a novel value function approximation technique that generalises appropriately over belief space. Our method outperforms prior approaches in both discrete bandit tasks and simple continuous navigation and control tasks. 1 Introduction A fundamental problem in sequential decision making is controlling an agent when the environmental dynamics are only partially known. In such circumstances, probabilistic models of the environment are used to capture the uncertainty of current knowledge given past data; they thus imply how exploring the environment can be expected to lead to new, exploitable, information. In the context of Bayesian model-based reinforcement learning (RL), Bayes-adaptive (BA) planning [8] solves the resulting exploration-exploitation trade-off by directly optimizing future expected discounted return in the joint space of states and beliefs about the environment (or, equivalently, interaction histories). Performing such optimization even approximately is computationally highly challenging; however, recent work has demonstrated that online planning by sample-based forwardsearch can be effective [22, 1, 12]. These algorithms estimate the value of future interactions by simulating trajectories while growing a search tree, taking model uncertainty into account. However, one major limitation of Monte Carlo search algorithms in general is that, na ıvely applied, they fail to generalize values between related states. In the BA case, a separate value is stored for each distinct path of possible interactions. Thus, the algorithms fail not only to generalize values between related paths, but also to reflect the fact that different histories can correspond to the same belief about the environment. As a result, the number of required simulations grows exponentially with search depth. Worse yet, except in very restricted scenarios, the lack of generalization renders MC search algorithms effectively inapplicable to BAMDPs with continuous state or action spaces. In this paper, we propose a class of efficient simulation-based algorithms for Bayes-adaptive modelbased RL which use function approximation to estimate the value of interaction histories during search. This enables generalization between different beliefs, states, and actions during planning, and therefore also works for continuous state spaces. To our knowledge this is the first broadly applicable MC search algorithm for continuous BAMDPs. Our algorithm builds on the success of a recent tree-based algorithm for discrete BAMDPs (BAMCP, [12]) and exploits value function approximation for generalization across interaction histories, as has been proposed for simulation-based search in MDPs [19]. As a crucial step towards this end we develop a suitable parametric form for the value function estimates that can generalize appropriately across histories, using the importance sampling weights of posterior samples to compress histories into a finite-dimensional feature vector. As in BAMCP we take advantage of root sampling [18, 12] to avoid expensive belief updates at every step of simulation, making the algorithm practical for a broad range of priors over environment dynamics. We also provide an interpretation of root sampling as an auxiliary variable sampling method. This leads to a new proof of its validity in general simulationbased settings, including BAMDPs with continuous state and action spaces, and a large class of algorithms that includes MC and TD upates. Empirically, we show that our approach requires considerably fewer simulations to find good policies than BAMCP in a (discrete) bandit task and two continuous control tasks with a Gaussian process prior over the dynamics [5, 6]. In the well-known pendulum swing-up task, our algorithm learns how to balance after just a few seconds of interaction. Below, we first briefly review the Bayesian formulation of optimal decision making under model uncertainty (section 2; please see [8] for additional details). We then explain our algorithm (section 3) and present empirical evaluations in section 4. We conclude with a discussion, including related work (sections 5 and 6). 2 Background A Markov Decision Processes (MDP) is described as a tuple M = S, A, P, R, γ with S the set of states (which may be infinite), A the discrete set of actions, P : S A S R the state transition probability kernel, R : S A R the reward function, and γ < 1 the discount factor. The agent starts with a prior P(P) over the dynamics, and maintains a posterior distribution bt(P) = P(P |ht) P(ht| P)P(P), where ht denotes the history of states, actions, and rewards up to time t. The uncertainty about the dynamics of the model can be transformed into certainty about the current state inside an augmented state space S+ = H S, where H is the set of possible histories (the current state also being the suffix of the current history). The dynamics and rewards associated with this augmented state space are described by P+(h, s, a, has , s ) = Z P P(s, a, s )P(P|h) d P, R+(h, s, a) = R(s, a). (1) Together, the 5-tuple M + = S+, A, P+, R+, γ forms the Bayes-Adaptive MDP (BAMDP) for the MDP problem M. Since the dynamics of the BAMDP are known, it can in principle be solved to obtain the optimal value function associated with each action: Q (ht, st, a) = max π E π t =t γt trt |at = a ; π (ht, st) = argmax a Q (ht, st, a), (2) where π : S+ A [0, 1] is a policy over the augmented state space, from which the optimal action for each belief-state π (ht, st) can readily be derived. Optimal actions in the BAMDP are executed greedily in the real MDP M, and constitute the best course of action (i.e., integrating exploration and exploitation) for a Bayesian agent with respect to its prior belief over P. 3 Bayes-Adaptive simulation-based search Our simulation-based search algorithm for the Bayes-adaptive setup combines efficient MC search via root-sampling with value function approximation. We first explain its underlying idea, assuming a suitable function approximator exists, and provide a novel proof justifying the use of root sampling that also applies in continuous state-action BAMDPs. Finally, we explain how to model Q-values as a function of interaction histories. 3.1 Algorithm As in other forward-search planning algorithms for Bayesian model-based RL [22, 17, 1, 12], at each step t, which is associated with the current history ht (or belief) and state st, we plan online to find π (ht, st) by constructing an action-value function Q(h, s, a). Such methods use simulation to build a search tree of belief states, each of whose nodes corresponds to a single (future) history, and estimate optimal values for these nodes. However, existing algorithms only update the nodes that are directly traversed in each simulation. This is inefficient, as it fails to generalize across multiple histories corresponding either to exactly the same, or similar, beliefs. Instead, each such history must be traversed and updated separately. Here, we use a more general simulation-based search that relies on function approximation, rather than a tree, to represent the values for possible simulated histories and states. This approach was originally suggested in the context of planning in large MDPs[19]; we extend it to the case of Bayes-Adaptive planning. The Q-value of a particular history, state, and action is represented as Q(h, s, a; w), where w is a vector of learnable parameters. Fixed-length simulations are run from the current belief-state ht, st, and the parameter w is updated online, during search, based on experience accumulated along these trajectories, using an incremental RL control algorithm (e.g., Monte-Carlo control, Q-learning). If the parametric form and features induce generalization between histories, then each forward simulation can affect the values of histories that are not directly experienced. This can considerably speed up planning, and enables continuous-state problems to be tackled. Note that a search tree would be a special case of the function approximation approach when the representation of states and histories is tabular. Algorithm 1: Bayes-Adaptive simulation-based search with root sampling procedure Search( ht, st ) repeat P P(P |ht) Simulate(ht, st, P, 0) until Timeout() return argmaxa Q(ht, st, a; w) end procedure procedure Simulate( h, s, P, t) if t > T then return 0 a πϵ greedy(Q(h, s, ; w)) s P(s, a, ), r R(s, a) R r + γ Simulate(has , s , P, t+1) w w α (Q(h, s, a; w) R) w Q(h, s, a; w) return R end procedure In the context of Bayes-Adaptive planning, simulation-based search works by simulating a future trajectory ht+T = statrtst+1 . . . at+T 1rt+T 1st+T of T transitions (the planning horizon) starting from the current belief-state ht, st. Actions are selected by following a fixed policy π, which is itself a function of the history, a π(h, ). State transitions can be sampled according to the BAMDP dynamics, st P+(ht 1, st 1, at , ht 1at , ). However, this can be computationally expensive since belief updates must be applied at every step of the simulation. As an alternative, we use root sampling [18], which only samples the dynamics Pk P(P |ht) once at the root for each simulation k and then samples transitions according to st Pk(st 1, at 1, ); we provide justification for this approach in Section 3.2.1 After the trajectory h T has been simulated on a step, the Q-value is modified by updating w based on the data in ht+T . Any incremental algorithm could be used, including SARSA, Q-learning, or gradient TD [20]; we use a simple scheme to minimize an appropriately weighted squared loss E[(Q(ht , st , at ; w) Rt )2]: | w | = α (Q(ht , st , at ; w) Rt ) w Q(ht , st , at ; w), (3) where α is the learning rate and Rt denotes the discounted return obtained from history ht .2 Algorithm 1 provides pseudo-code for this scheme; here we suggest using as the fixed policy for a simulation the ϵ greedy πϵ greedy based on some given Q value. Other policies could be considered (e.g., the UCT policy for search trees), but are not the main focus of this paper. 3.2 Analysis In order to exploit general results on the convergence of classical RL algorithms for our simulationbased search, it is necessary to show that starting from the current history, root sampling produces the appropriate distribution of rollouts. For the purpose of this section, a simulation-based search algorithm includes Algorithm 1 (with Monte-Carlo backups) but also incremental variants, as discussed above, or BAMCP. Let D π t be the rollout distribution function of forward-simulations that explicitly updates the belief at each step (i.e., using P+): D π t (ht+T ) is the probability density that history ht+T is generated when running that simulation from ht, st, with T the horizon of the simulation, and π an arbitrary history policy. Similarly define the quantity Dt π(ht+T ) as the probability density that history ht+T is generated when running forward-simulations with root sampling, as in Algorithm 1. The following lemma shows that these two rollout distributions are the same. 1For comparison, a version of the algorithm without root sampling is listed in the supplementary material. 2The loss is weighted according to the distr. of belief-states visited from the current state by executing π. Lemma 1. D π t (ht+T ) = D π t (ht+T ) for all policies π : H A [0, 1] and for all ht+T H of length t + T. Proof. A similar result has been obtained for discrete state-action spaces as Lemma 1 in [12] using an induction step on the history length. Here we provide a more intuitive interpretation of root sampling as an auxiliary variable sampling scheme which also applies directly to continuous spaces. We show the equivalence by rewriting the distribution of rollouts. The usual way of sampling histories in simulation-based search, with belief updates, is justified by factoring the density as follows: p(ht+T |ht, π) = p(atst+1at+1st+2 . . . st+T |ht, π) (4) = p(at|ht, π)p(st+1|ht, π, at)p(at+1|ht+1, π) . . . p(st+T |ht+T 1, at+T , π) (5) t t