# riskaverse_bayesadaptive_reinforcement_learning__70113476.pdf Risk-Averse Bayes-Adaptive Reinforcement Learning Marc Rigter Oxford Robotics Institute University of Oxford mrigter@robots.ox.ac.uk Bruno Lacerda Oxford Robotics Institute University of Oxford bruno@robots.ox.ac.uk Nick Hawes Oxford Robotics Institute University of Oxford nickh@robots.ox.ac.uk In this work, we address risk-averse Bayes-adaptive reinforcement learning. We pose the problem of optimising the conditional value at risk (CVa R) of the total return in Bayes-adaptive Markov decision processes (MDPs). We show that a policy optimising CVa R in this setting is risk-averse to both the epistemic uncertainty due to the prior distribution over MDPs, and the aleatoric uncertainty due to the inherent stochasticity of MDPs. We reformulate the problem as a two-player stochastic game and propose an approximate algorithm based on Monte Carlo tree search and Bayesian optimisation. Our experiments demonstrate that our approach significantly outperforms baseline approaches for this problem. 1 Introduction In standard model-based reinforcement learning (RL), an agent interacts with an unknown environment to maximise the expected reward [42]. However, for any given episode the total reward received by the agent is uncertain. There are two sources of this uncertainty: the epistemic (or parametric) uncertainty due to imperfect knowledge about the underlying MDP model, and aleatoric (or internal) uncertainty due to the inherent stochasticity of the underlying MDP [20]. In many domains, we wish to find policies which are risk-averse to both sources of uncertainty. As an illustrative example, consider a navigation system for an automated taxi service. The system attempts to minimise travel duration subject to uncertainty due to the traffic conditions, and traffic lights and pedestrian crossings. For each new day of operation, the traffic conditions are initially unknown. This corresponds to epistemic uncertainty over the MDP model. This uncertainty is reduced as the agent collects experience and learns which roads are busy or not busy. Traffic lights and pedestrian crossings cause stochasticity in the transition durations corresponding to aleatoric uncertainty. The aleatoric uncertainty remains even as the agent learns about the environment. A direct route, i.e. through the centre of the city, optimises the expected journey duration. However, occasionally this route incurs extremely long durations due to some combination of poor traffic conditions and getting stuck at traffic lights and pedestrian crossings. Thus, bad outcomes can be caused by a combination of both sources of uncertainty. These rare outcomes may result in unhappy customers for the taxi service. Therefore, we wish to be risk-averse and prioritise greater certainty of avoiding poor outcomes rather than expected performance. To ensure that the risk of a bad journey is avoided, the navigation system must consider both the epistemic and the aleatoric uncertainties. In model-based Bayesian RL, a belief distribution over the underlying MDP is maintained. This quantifies the epistemic uncertainty over the underlying MDP given the transitions observed so far. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). The Bayesian RL problem can be reformulated as a planning problem in a Bayes-Adaptive MDP (BAMDP) with an augmented state space composed of the belief over the underlying MDP and the state in the underlying MDP [12]. In this work, we address risk-averse decision making in model-based Bayesian RL. Instead of optimising for expected value, we optimise a risk metric applied to the total return of the BAMDP. Specifically, we focus on conditional value at risk (CVa R) [34]. The return in a BAMDP is uncertain due to both the uncertain prior belief over the underlying MDP, and the inherent stochasticity of MDPs. By optimising the CVa R of the return in the BAMDP, our approach simultaneously addresses epistemic and aleatoric uncertainty under a single framework. This is in contrast with previous works which have generally considered risk-aversion to either epistemic or aleatoric uncertainty. We formulate CVa R optimisation in Bayesian RL as a stochastic game over the BAMDP against an adversarial environment. Solving this game is computationally challenging because the set of reachable augmented states grows exponentially, and the action space of the adversary is continuous. Our proposed algorithm uses Monte Carlo tree search (MCTS) to focus search on promising areas of the augmented state space. To deal with the continuous action space of the adversary, we use Bayesian optimisation to expand promising adversary actions. Our main contributions are: Addressing CVa R optimisation of the return in model-based Bayesian RL to simultaneously mitigate epistemic and aleatoric uncertainty. An algorithm based on MCTS and Bayesian optimisation for this problem. To our knowledge, this is the first work to optimise a risk metric in model-based Bayesian RL to simultaneously address both epistemic and aleatoric uncertainty, and the first work to present an MCTS algorithm for CVa R optimisation in sequential decision making problems. Our empirical results show that our algorithm significantly outperforms baseline methods on two domains. 2 Related Work Many existing works address aversion to risk stemming from either epistemic model uncertainty or aleatoric uncertainty. One of the most popular frameworks for addressing epistemic uncertainty is the Robust or Uncertain MDP [17, 26, 33, 47] which considers worst-case expected value over a set of possible MDPs. Other works address epistemic uncertainty by assuming that a distribution over the underlying MDP is known. One approach is to find a policy which optimises the expected value in the worst k% of parameter realisations [5, 11]. Most related to our work is that of Sharma et al. [38] which optimises a risk metric in the Bayesian RL context where the prior is a discrete distribution over a finite number of MDPs. However, all of these existing works marginalise out the aleatoric uncertainty by considering the expected value under each possible MDP. In contrast, we optimise CVa R of the return in the Bayesian RL setting to address both epistemic and aleatoric uncertainty. Conditional value at risk (CVa R) [34] is a common coherent risk metric which has been applied to MDPs to address aleatoric uncertainty. For an introduction to coherent risk metrics, see [2]. In this paper, we consider static risk, where the risk metric is applied to the total return, rather than dynamic risk, where the risk metric is applied recursively [43]. For this static CVa R setting, approaches based on value iteration over an augmented state space have been proposed [3, 8, 32]. Other works propose policy gradient methods to find a locally optimal solution for the optimisation of CVa R [4, 6, 7, 31, 43, 45, 46], or general coherent risk metrics [44]. These existing approaches all assume the agent has access to the true underlying MDP for training/planning, and optimise the CVa R in that single MDP. In our work, we address the model-based Bayesian RL setting where we only have access to a prior belief distribution over MDPs. We find Bayes-adaptive policies which are risk-averse to both the epistemic uncertainty over the underlying MDP and the aleatoric uncertainty due to stochastic transitions. To our knowledge, no existing work has addressed CVa R optimisation in the Bayes-adaptive RL setting to simultaneously address both forms of uncertainty. Constrained MDPs impose a constraint on the expected value of a cost [1, 36]. Such constraints have been addressed in Bayesian RL [18]. However, this approach only constrains the expected cost and it is unclear how to choose cost penalties to obtain the desired risk averse behaviour. Monte Carlo Tree Search (MCTS) has been adapted to expected cost constraints [19]. However, to our knowledge, this is the first work to adapt MCTS to CVa R optimisation in sequential decision making problems. In the Deep RL setting, Deep Q-Networks [24] have been modified to predict the uncertainty over the Q-values associated with epistemic and aleatoric uncertainty [9, 13] and derive a risksensitive policy. In this work, we propose to use the Bayes-adaptive formulation of RL to model and address both uncertainty sources. Another related strand of work is robust adversarial reinforcement learning [28, 30] (RARL) which optimises robust policies by simultaneously training an adversary which perturbs the environment. In RARL it is unclear how much power to give to the adversary to achieve the desired level of robustness. Our approach ensures that the CVa R at the desired confidence level is optimised. 3 Preliminaries Let Z be a bounded-mean random variable, i.e. E[|Z|] < , on a probability space (Ω, F, P), with cumulative distribution function F(z) = P(Z z). In this paper we interpret Z as the total reward, or return, to be maximised. The value at risk at confidence level α (0, 1] is defined as Va Rα(Z) = min{z|F(z) α}. The conditional value at risk at confidence level α is defined as CVa Rα(Z) = 1 0 Va Rγ(Z)dγ. (1) If Z has a continuous distribution, CVa Rα(Z) can be defined using the more intuitive expression: CVa Rα(Z) = E[Z|Z Va Rα(Z)]. Thus, CVa Rα(Z) may be interpreted as the expected value of the α-portion of the left tail of the distribution of Z. CVa R may also be defined as the expected value under a perturbed distribution using its dual representation [34, 37] CVa Rα(Z) = min ξ B(α,P) Eξ Z , (2) where Eξ[Z] denotes the ξ-weighted expectation of Z, and the risk envelope, B, is given by ω Ω ξ(ω)P(ω)dω = 1 Therefore, the CVa R of a random variable Z may be interpreted as the expectation of Z under a worst-case perturbed distribution, ξP. The risk envelope is defined so that the probability density of any outcome can be increased by a factor of at most 1/α, whilst ensuring the perturbed distribution is a valid probability distribution. 3.1 CVa R Optimisation in Standard MDPs A finite-horizon Markov decision process (MDP) is a tuple, M = (S, A, R, T, H, s0), where S and A are finite state and action spaces, R : S A R is the reward function, T : S A S [0, 1] is the transition function, H is the horizon, and s0 is the initial state. A history of M is a sequence h = s0a0s1a1 . . . at 1st such that T(si, ai, si+1) > 0 for all i. We denote the set of all histories over M as HM. A history-dependent policy is a function π : HM A, and we write ΠM H to denote the set of all history-dependent policies for M. If π only depends on the last state st of h, then we say π is Markovian, and we denote the set of all Markovian policies as ΠM. A policy π induces a distribution PM π over histories and we define GM π = PH t=0 R(st, at) as the distribution over total returns of M under π. Many existing works have addressed the problem of optimising the static CVa R of GM π , defined as follows. Problem 1 (CVa R optimisation in standard MDPs) Let M be an MDP. Find the optimal CVa R of the total return at confidence level α: max π ΠM H CVa Rα(GM π ). (4) Unlike dynamic Markov risk measures [35], history-dependent policies may be required to optimally solve this static CVa R optimisation problem [3, 8, 43]. Previous works [8, 27] have shown that this problem can be interpreted either as being risk-averse to aleatoric uncertainty, or as being robust to epistemic uncertainty. The latter interpretation comes from the dual representation of CVa R (Eq. 2), where CVa R is expressed as the expected value under a worst-case perturbed probability distribution. Methods based on dynamic programming have been proposed to solve Problem 1 [3, 8, 29]. In particular, Chow et al. [8] augment the state space by an additional continuous state factor, y (0, 1], corresponding to the confidence level. The value function for the augmented state (s, y) is defined by V (s, y) = maxπ ΠM H CVa Ry(GM π ), and can be computed using the following Bellman equation V (s, y) = max a A R(s, a) + min ξ B(y,T (s,a, )) s S ξ(s )V (s , yξ(s ))T(s, a, s ) where ξ represents an adversarial perturbation to the transition probabilities, and the additional state variable, y, ensures that the perturbation to the probability of any history through the MDP is at most 1/y. Thus, y can be thought of as keeping track of the budget to adversarially perturb the probabilities. Therefore, V (s0, α) is the expected value under the adversarial probability distribution defined by Eq. 2 and 3 and corresponds to the optimal CVa R in the MDP (Problem 1). To address the continuous state variable, y, [8] use dynamic programming with linear function approximation. 3.2 Stochastic Games In this paper, we will formulate CVa R optimisation as a turn-based two-player zero-sum stochastic game (SG). An SG between an agent and an adversary is a generalisation of an MDP and can be defined using a similar tuple G = (S, A, T, R, H, s0). The elements of G are interpreted as with MDPs, but extra structure is added. In particular, S is partitioned into a set of agent states Sagt, and a set of adversary states Sadv. Similarly, A is partitioned into a set of agent actions Aagt, and a set of adversary actions Aadv. The transition function is defined such that agent actions can only be executed in agent states, and adversary actions can only be executed in adversary states. We denote the set of Markovian agent policies mapping agent states to agent actions as ΠG and the set of Markovian adversary policies, defined similarly, as ΣG. A pair (π, σ) of agent-adversary policies induces a probability distribution over histories and we define GG (π,σ) = PH t=0 R(st, at) as the distribution over total returns of G under π and σ. In a zero-sum SG, the agent seeks to maximise the expected return reward, whilst the adversary seeks to minimise it: max π ΠG min σ ΣG E GG (π,σ) . (6) 3.3 Bayesian RL Here we describe the Bayesian formulation of decision making in an unknown MDP [21, 12]. In Bayesian RL, the dynamics of the MDP M are unknown and we assume that its transition function, T, is a latent variable distributed according to a prior distribution P(T|h0), where h0 represents the empty history where no transitions have been observed. After observing a history h = s0a0s1a1 . . . at 1st, the posterior belief over T is updated using Bayes rule: P(T|h) P(h|T)P(T|h0). In the Bayes-Adaptive MDP (BAMDP) formulation of Bayesian RL [12], the uncertainty about the model dynamics is captured by augmenting the state space to include the history: S+ = S H. This captures the model uncertainty as the history is a sufficient statistic for the posterior belief over T given the initial prior belief. The transition and reward functions for this augmented state space are T +((s, h), a, (s , has )) = Z T T(s, a, s )P(T|h)d T, (7) R+((s, h), a) = R(s, a). (8) The initial augmented state is s+ 0 = (s0, h0). The tuple M+ = (S+, A, T +, R+, H, s+ 0 ) defines the BAMDP. Typically, solutions to the Bayesian RL problem attempt to (approximately) find a Bayes-optimal policy, which maximises the expected sum of rewards under the prior belief over the transition function. In this work, we address risk-averse solutions to the Bayesian RL problem. 4 CVa R Optimisation in Bayes-Adaptive RL Many existing works address CVa R optimisation in standard MDPs. Here, we assume that we only have access to a prior distribution over MDPs, and optimise the CVa R in the corresponding BAMDP. Problem 2 ( CVa R optimisation in Bayes-Adaptive MDPs) Let M+ be a Bayes-Adaptive MDP. Find the optimal CVa R at confidence level α: max π ΠM+ CVa Rα(GM+ π ). (9) Posing CVa R optimisation over BAMDPs leads to two observations. First, unlike Problem 1, we do not need to consider history-dependent policies as the history is already embedded in the BAMDP state. Second, the solution to this problem corresponds to a policy which is risk-averse to both the epistemic uncertainty due to the uncertain belief over the MDP, and the aleatoric uncertainty due to the stochasticity of the underlying MDP. In the following, we elaborate on this second observation. 4.1 Motivation: Robustness to Epistemic and Aleatoric Uncertainty The solution to Problem 2 is risk-averse to both the epistemic and the aleatoric uncertainty. Intuitively, this is because the return distribution in the BAMDP is influenced both by the prior distribution over models as well as stochastic transitions. In this section, we introduce a novel alternative definition of CVa R in BAMDPs to formally illustrate this concept. Specifically, we show that CVa R in a BAMDP is equivalent to the expected value under an adversarial prior distribution over transition functions, and adversarial perturbations to the transition probabilities in all possible transition functions. The perturbation budget of the adversary is split multiplicatively between perturbing the probability density of transition functions, and perturbing the transition probabilities along any history. Therefore, the adversary has the power to minimise the return for the agent through a combination of modifying the prior so that bad transition functions are more likely (epistemic uncertainty), and making bad transitions within any given transition function more likely (aleatoric uncertainty). We consider an adversarial setting, whereby an adversary is allowed to apply perturbations to modify the prior distribution over the transition function as well as the transition probabilities within each possible transition function, subject to budget constraints. Define T to be the support of P(T|h0), and let Tp T be a plausible transition function. Note that, in general, T can be a continuous set. Consider a perturbation of the prior distribution over transition functions δ : T R 0 such that R Tp δ(Tp)P(Tp|h0)d Tp = 1. Additionally, for Tp T , consider a perturbation of Tp, ξp : S A S R 0 such that P s S ξp(s, a, s ) Tp(s, a, s ) = 1 s, a. We denote the set of all possible perturbations of Tp as Ξp, and the set of all perturbations over T as Ξ = Tp T Ξp. For BAMDP M+, δ : T R 0, and ξ Ξ, we define M+ δ,ξ as the BAMDP, obtained from M+, with perturbed prior distribution Pδ(Tp|h0) = δ(Tp) P(Tp|h0) and perturbed transition functions T ξ p (s, a, s ) = ξp(s, a, s ) Tp(s, a, s ). We are now ready to state Prop. 1, which provides an interpretation of CVa R in BAMDPs as expected value under an adversarial prior distribution and adversarial transition probabilities. All propositions are proven in the supplementary material. Proposition 1 (CVa R in BAMDPs) Let M+ be a BAMDP. Then: CVa Rα(GM+ π ) = min δ:T R 0 ξ Ξ E G M+ δ,ξ π , s.t. (10) 0 δ(Tp)ξp(s0, a0, s1)ξp(s1, a1, s2) ξp(s H-1, a H-1, s H) 1 Tp T , (s0, a0, s1, a1, . . . , s H) HH. Prop. 1 shows that CVa R can be computed as expected value under an adversarial prior distribution over transition functions, and adversarial perturbations to the transition probabilities in all possible transition functions. Therefore, a policy which optimises the CVa R in a BAMDP must mitigate the risks due to both sources of uncertainty. 4.2 Bayes-Adaptive Stochastic Game Formulation In Prop. 1, we formulated CVa R in BAMDPs as expected value under adversarial perturbations to the prior, and to each possible MDP. However, this formulation is difficult to solve directly as it requires optimising over the set of variables Ξ, which can have infinite dimension since the space of plausible transition functions may be infinite. Thus, we formulate CVa R optimisation in BAMDPs in a way that is more amenable to optimisation: an SG played on an augmented version of the BAMDP. The agent plays against an adversary which modifies the transition probabilities of the original BAMDP. Perturbing the BAMDP transition probabilities is equivalent to perturbing both the prior distribution and each transition function as considered by Prop. 1. This extends work on CVa R optimisation in standard MDPs [8] to the Bayes-adaptive setting. We begin by augmenting the state space of the BAMDP with an additional state factor, y (0, 1], so that now the augmented state comprises (s, h, y) S+ G = S H (0, 1]. The stochastic game proceeds as follows. Given an augmented state, (s, h, y) S+,agt G , the agent applies an action, a A, and receives reward R+ G ((s, h, y), a) = R(s, a). The state then transitions to an adversary state (s, ha, y) S+,adv G . The adversary then chooses an action from a continuous action space, ξ Ξ(s, a, h, y), to perturb the BAMDP transition probabilities where Ξ(s, a, h, y) = n ξ : S R 0 0 ξ(s ) 1/y s and P s S ξ(s ) T +((s, h), a, (s , has )) = 1 o . (11) In Eq. 11, the adversary perturbation is restricted to adhere to the budget defined by Eq. 3 so that the probability of any history in the BAMDP is perturbed by at most 1/y. After the adversary chooses the perturbation action, the game transitions back to an agent state (s , has , y ξ(s )) S+,agt G according to the transition function T + G ((s, ha, y), ξ, (s , has , y ξ(s ))) = ξ(s ) T +((s, h), a, (s , has )). (12) The initial augmented state of the stochastic game is s+ 0,G = (s0, h0, α) S+,agt G . The tuple G+ = (S+ G , A+ G , T + G , R+ G , H, s+ 0,G) defines the Bayes-Adaptive CVa R Stochastic Game (BA-CVa R-SG), where the joint action space is A+ G = A Ξ. The maximin expected value equilibrium of the BA-CVa R-SG optimises the expected value of the BAMDP under perturbations to the probability distribution over paths by an adversary subject to the constraint defined by Eq. 3. Therefore, Prop. 2 states that the maximin expected value equilibrium in the BA-CVa R-SG corresponds to the solution to CVa R optimisation in BAMDPs (Problem 2). Proposition 2 Formulation of CVa R optimisation in BAMDPs as a stochastic game. max π ΠM+ CVa Rα(GM+ π ) = max π ΠG+ min σ ΣG+ E GG+ (π,σ) . (13) Prop. 2 directly extends Prop. 1 from [8] to BAMDPs. The BAMDP transition function is determined by both the prior distribution over transition functions and the transition functions themselves (Eq. 7). Therefore, perturbing the BAMDP transition function, as considered by Prop. 2, is equivalent to perturbing both the prior distribution and each transition function as considered by Prop. 1. 4.3 Monte Carlo Tree Search Algorithm Solving the BA-CVa R-SG is still computationally difficult due to two primary reasons. First, the set of reachable augmented states grows exponentially with the horizon due to the exponentially growing set of possible histories. This prohibits the use of the value iteration approach proposed by Chow et al. [8] for standard MDPs. Second, CVa R is considerably more difficult to optimise than expected value. In the BA-CVa R-SG formulation, this difficulty is manifested in the continuous action space of the adversary corresponding to possible perturbations to the BAMDP transition function. In this section we present an approximate algorithm, Risk-Averse Bayes-Adaptive Monte Carlo Planning (RA-BAMCP). Like BAMCP [16], which applies MCTS to BAMDPs, and applications of Algorithm 1: Risk-Averse Bayes-Adaptive Monte Carlo Planning function Search(G+) create root node v0 with: s+ G (v0) = s+ 0,G player(v0) = agent while within computational budget : vleaf Tree Policy(v0) er Default Policy(vleaf) Update Nodes(vleaf, er) v = arg max v children of v0 b Q(v ) v = arg min v children of v b Q(v ) return a(v ), ξ(v ) function Tree Policy(v) while v is non-terminal : if player(v) = agent : if v not fully expanded : return Expand Agent(v) else v v.Best Child() else if player(v) = adv : if (N(v))τ |eΞ(v)| : return Expand Adversary(v) else v v.Best Child() else if player(v) = chance : v sample successor according to Eq. 12 function Update Nodes(v, er) while v is not null : N(v) + = 1 b Q(v) + = er(v) b Q(v) N(v) v parent of v function Expand Agent(v) choose a untried actions from action set A add child v to v with: s+ G (v ) = s+ G (v) a(v ) = a player(v ) = adv eΞ(v ) = return v function Expand Adversary(v) if eΞ(v) = : ξnew = Random Perturbation(v) else ξnew = Bayes Opt Perturbation(v) eΞ(v).insert(ξnew) add child v to v with: s+ G (v ) = s+ G (v) ξ(v ) = ξ player(v ) = chance return v function Best Child(v) if player(v) = agent : arg max v children of v h b Q(v ) + cmcts arg min v children of v h b Q(v ) cmcts MCTS to two-player games such as Go [14], we use MCTS to handle the large state space by focusing search in promising areas. Unlike BAMCP, we perform search over a Bayes-adaptive stochastic game against an adversary which may perturb the probabilities, and we cannot use root sampling as we need access to the BAMDP transition dynamics (Eq. 7) to compute the set of admissible adversary actions (Eq. 11). To handle the continuous action space for the adversary, we use progressive widening [10] paired with Bayesian optimisation [25, 22] to prioritise promising actions. RA-BAMCP Outline Algorithm 1 proceeds using the standard components of MCTS: selecting nodes within the tree, expanding leaf nodes, performing rollouts using a default policy, and updating the statistics in the tree. The search tree consists of alternating layers of three types of nodes: agent decision nodes, adversary decision nodes, and chance nodes. At each new node v, the value estimate associated with that node, b Q(v), and visitation count, N(v), are both initialised to zero. At agent (adversary) decision nodes we use an upper (lower) confidence bound to optimistically select actions within the tree. At newly expanded chance nodes we compute P(T|h), the posterior distribution over the transition function for the history leading up to that chance node. Using this posterior distribution, we can compute the transition function for the BA-CVa R-SG, T + G , at that chance node. When a chance node is reached during a simulation, the successor state is sampled using T + G . At each adversary node, a discrete set of perturbation actions is available to the adversary, denoted by eΞ(v). We use progressive widening [10] to gradually increase the set of perturbation actions available by adding actions from the continuous set of possible actions, Ξ(v). The rate at which new actions are expanded is controlled by the progressive widening parameter, τ. A naive implementation of progressive widening would simply sample new perturbation actions randomly. This might require many expansions before finding a good action. To mitigate this issue we use Gaussian process (GP) Bayesian optimisation to select promising actions to expand in the tree [25, 22]. In our algorithm the Bayes Opt Perturbation function performs Bayesian optimisation to return a promising perturbation action at a given adversary node, v. We train a GP using a Gaussian kernel. The input features are existing expanded perturbation actions at the node, ξ eΞ(v). The number of input dimensions is equal to the number of successor states with non-zero probability according to the BAMDP transition function. The training labels are the corresponding value estimates in the tree for each of the perturbation actions: b Q(v), for all children v of v. The acquisition function for choosing the new perturbation action to expand is a lower confidence bound which is optimistic for the minimising adversary: ξnew = arg minξ[µ(ξ) cbo σ(ξ)], where µ(ξ) and σ(ξ) are the mean and standard deviations of the GP predicted posterior distribution respectively. Prop. 3 establishes that a near-optimal perturbation will eventually be expanded with high probability if cbo is set appropriately. Proposition 3 Let δ (0, 1). Provided that cbo is chosen appropriately (details in the appendix), as the number of perturbations expanded approaches , a perturbation within any ϵ > 0 of the optimal perturbation will be expanded by the Bayesian optimisation procedure with probability 1 δ. After each simulation, the value estimates at each node are updated using er(v), the reward received from v onwards during the simulation. Search continues by performing simulations until the computational budget is exhausted. At this point, the estimated best action for the agent and the adversary is returned. In our experiments we perform online planning: after executing an action and transitioning to a new state, we perform more simulations from the new root state. 5 Experiments Code for the experiments is included in the supplementary material. All algorithms are implemented in C++ and Gurobi is used to solve linear programs for the value iteration (VI) approaches. Computation times are reported for a 3.2 GHz Intel i7 processor with 64 GB of RAM. No existing works address the problem formulation in this paper. Therefore, we adapt two methods for CVa R optimisation to the BAMDP setting and also compare against approaches for related problems to assess their performance in our setting. We compare the following approaches: RA-BAMCP: the approach presented in this paper. CVa R BAMDP Policy Gradient (PG): applies the policy gradient approach to CVa R optimisation from [45] with the vectorised belief representation for BAMDPs proposed in [15] with linear function approximation. Further details on this approach can be found in the supplementary material. CVa R VI BAMDP: the approximate VI CVa R optimisation approach from [8] applied to the BAMDP. Due to the extremely poor scalability of this approach we only test it on the smaller of our domains. CVa R VI Expected MDP (EMDP): the approximate VI approach from [8] applied to the expected MDP parameters according to the prior distribution over transition functions, P(T|h0). BAMCP: Bayes-Adaptive Monte Carlo Planning which optimises the expected value [16]. This is equivalent to RA-BAMCP with the confidence level, α, set to 1. For RA-BAMCP and BAMCP the MCTS exploration parameter was set to cmcts = 2 based on empirical performance. For these methods, we performed 100, 000 simulations for the initial step and 25, 000 simulations per step thereafter. For both of these algorithms the rollout policy used for the agent was the CVa R VI Expected MDP policy. For RA-BAMCP, the rollout policy for the adversary was random, the progressive widening parameter was set to τ = 0.2, and the exploration parameter for the GP Bayesian optimisation was also set to cbo = 2. Details of the GP hyperparameters are in the supplementary material. For CVa R VI Expected MDP and CVa R VI BAMDP, 20 log-spaced points were used to discretise the continuous state space for approximate VI as described by [8]. For CVa R BAMDP Policy Gradient we trained using 2 106 simulations and details on hyperparameter tuning are in the supplementary material. Figure 1: Autonomous car navigation domain. Colours indicate the road types as follows - black: highway, red: main road, blue: street, green: lane. Betting Game Domain We adapt this domain from the literature on conditional value at risk in Markov decision processes [3] to the Bayes-adaptive setting. The agent begins with money = 10. At each stage the agent may choose to place a bet from bets = {0, 1, 2, 5, 10} provided that sufficient money is available. If the agent wins the game at that stage, an amount of money equal to the bet placed is received. If the agent loses the stage, the amount of money bet is lost. The reward is the total money the agent has after 6 stages of betting. In our Bayesadaptive setting, the probability of winning the game is not known apriori. Instead, a prior distribution over the probability of winning the game is modelled using a beta distribution with parameters α = 10 11, β = 1 11, indicating that the odds for the game are likely to be in favour of the agent. Autonomous Car Navigation Domain An autonomous car must navigate along roads to a destination by choosing from actions {up, down, left, right} (Figure 1). There are four types of roads: {highway, main road, street, lane} with different transition duration characteristics. The transition distribution for each type of road is unknown. Instead, the prior belief over the transition duration distribution for each road type is modelled by a Dirichlet distribution. Thus, the belief is represented by four Dirichlet distributions. Details on the Dirichlet prior used for each road type are included in the supplementary material. Upon traversing each section of road, the agent receives reward time where time is the transition duration for traversing that section of road. Upon reaching the destination, the agent receives a reward of 80. The search horizon was set to H = 10. 5.1 Results We evaluated the return from each method over 2000 episodes. For each episode, the true underlying MDP was sampled from the prior distribution. Therefore, the return received for each evaluation episode is subject to both epistemic uncertainty due to the uncertain MDP model, and aleatoric uncertainty due to stochastic transitions. Results for this evaluation can be found in Table 1. The rows indicate the method and the confidence level, α, that the method is set to optimise. The columns indicate the performance of each method for CVa R with α = 0.03 and α = 0.2, and expected value (α = 1), evaluated using the sample of 2000 episodes. Note that when evaluating the performance of each algorithm, we only care about the CVa R performance for the confidence level that the algorithm is set to optimise. In the Betting Game domain, strong performance is attained by CVa R VI BAMDP, however the computation time required for this approach is considerable. It was not possible to test CVa R VI BAMDP on the Autonomous Navigation due to its poor scalability. RA-BAMCP attained near-optimal performance at both confidence levels using substantially less computation. CVa R VI Expected MDP performed well at the α = 0.2 confidence level, but very poorly at α = 0.03. This approach performs poorly at α = 0.03 because by using the expected value of the transition function, the probability of repeatedly losing the game is underestimated. BAMCP performed the best at optimising expected value, but less well than other approaches for optimising CVa R. CVa R BAMDP Policy Gradient performed strongly in the betting domain with α = 0.2. However, it attained poor asymptotic performance with α = 0.03. Conversely, this method performed well in the navigation domain with α = 0.03, but poorly for α = 0.2. These results indicate that the policy gradient approach is susceptible to local minima. Training curves for CVa R BAMDP Policy Gradient can be found in the supplementary material. In the Autonomous Car Navigation domain RA-BAMCP outperformed CVa R VI Expected MDP at both confidence levels. As expected, BAMCP again performed the best at optimising expected value, but not CVa R. The return histograms in Figure 2 show that while BAMCP optimises expected value, there is high uncertainty over the return value. As the confidence level α is decreased, applying RA-BAMCP decreases the likelihood of poor returns by shifting the left tail of the return distribution to the right, at the expense of worse expected performance. This demonstrates that RA-BAMCP is risk-averse to poor return values in the presence of both epistemic and aleatoric uncertainty. Method Time (s) CVa R (α=0.03) CVa R (α=0.2) Expected Value RA-BAMCP (α=0.03) 17.9(0.2) 9.98 (0.02) 10.00 (0.002) 10.00 (0.001) CVa R VI BAMDP (α=0.03) 8620.5 10.00 (0.0) 10.00 (0.0) 10.00 (0.0) CVa R VI EMDP (α=0.03) 61.7 0.00 (0.0) 10.45 (0.33) 15.72 (0.09) CVa R BAMDP PG (α=0.03) 4639.3 2.90 (0.06) 10.57 (0.33) 27.34 (0.24) RA-BAMCP (α=0.2) 78.9(0.2) 0.00 (0.0) 20.09 (1.04) 46.84 (0.37) CVa R VI BAMDP (α=0.2) 8620.5 0.00 (0.0) 20.77 (1.02) 44.15 (0.33) CVa R VI EMDP (α=0.2) 61.7 0.00 (0.0) 19.33 (0.99) 39.86 (0.30) CVa R BAMDP PG (α=0.2) 4309.7 0.00 (0.0) 19.85 (0.97) 46.65 (0.37) BAMCP 13.5(0.01) 0.00 (0.0) 17.07 (1.09) 59.36 (0.52) (a) Betting Game domain Method Time (s) CVa R (α=0.03) CVa R (α=0.2) Expected Value RA-BAMCP (α=0.03) 255.7(0.2) 23.51 (0.17) 25.41 (0.06) 29.56 (0.06) CVa R VI EMDP (α=0.03) 96.6 20.65 (0.68) 26.11 (0.16) 30.13 (0.07) CVa R BAMDP PG (α=0.03) 37232.6 23.92 (0.08) 25.38 (0.05) 28.97 (0.05) RA-BAMCP (α=0.2) 205.1(0.3) 18.91 (0.44) 27.76 (0.24) 38.33 (0.15) CVa R VI EMDP (α=0.2) 96.6 4.20 (0.54) 20.42 (0.44) 36.64 (0.21) CVa R BAMDP PG (α=0.2) 36686.1 10.05 (0.74) 24.70 (0.39) 49.67 (0.36) BAMCP 74.4(0.02) 5.56 (0.60) 21.46 (0.47) 50.16 (0.38) (b) Autonomous Car Navigation domain Table 1: Results from evaluating the returns of each method for 2000 episodes. Brackets indicate standard errors. Time for RA-BAMCP and BAMCP indicates average total computation time per episode to perform simulations. Time for CVa R VI BAMDP and CVa R VI Expected MDP indicates the time for VI to converge. Time for CVa R BAMDP PG indicates the total training time. Figure 2: Histogram of returns received in the Autonomous Car Navigation domain. Vertical red lines indicate CVa R0.03, CVa R0.2, and the expected value of the return distributions. Results in the supplementary material show that our approach using random action expansion rather than Bayesian optimisation attains worse performance for the same computation budget. This demonstrates that Bayesian optimisation for action expansion is critical to our approach. 6 Discussion of Limitations and Conclusion In this work, we have addressed CVa R optimisation in the BAMDP formulation of model-based Bayesian RL. This is motivated by the aim of mitigating risk due to both epistemic uncertainty over the model, and aleatoric uncertainty due to environment stochasticity. Our experimental evaluation demonstrates that our algorithm, RA-BAMCP, outperforms baseline approaches for this problem. One of the limitations of our work is that our approach cannot use root sampling [16]. Instead, the posterior distribution is computed at each node in the MCTS tree, prohibiting the use of complex priors. In future work, we wish to develop an MCTS algorithm for CVa R optimisation which can use root sampling. Additionally, due to the exponential growth of the BAMDP and the difficulty of optimising CVa R, the scalability of our approach is limited. In future work, we wish to utilise function approximation to improve the scalability of our approach to more complex problems. One direction could be using a neural network to guide search, in the style of Alpha Go [39]. Another direction could be to extend the policy gradient algorithm that we proposed in the experiments section to make use of deep learning. Acknowledgements and Funding Disclosure The authors would like to thank Paul Duckworth for valuable feedback on an earlier draft of this work. This work was supported by a Programme Grant from the Engineering and Physical Sciences Research Council [EP/V000748/1], the Clarendon Fund at the University of Oxford, and a gift from Amazon Web Services. [1] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999. [2] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance, 9(3):203 228, 1999. [3] Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361 379, 2011. [4] Vivek Borkar and Rahul Jain. Risk-constrained Markov decision processes. In 49th IEEE Conference on Decision and Control (CDC), pages 2664 2669. IEEE, 2010. [5] Katherine Chen and Michael Bowling. Tractable objectives for robust policy optimization. In Advances in Neural Information Processing Systems, pages 2069 2077, 2012. [6] Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVa R optimization in mdps. In Advances in neural information processing systems, pages 3509 3517, 2014. [7] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070 6120, 2017. [8] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decisionmaking: a CVa R optimization approach. In Advances in Neural Information Processing Systems, pages 1522 1530, 2015. [9] William R Clements, Bastien Van Delft, Benoît-Marie Robaglia, Reda Bahi Slaoui, and Sébastien Toth. Estimating risk and uncertainty in deep reinforcement learning. ar Xiv preprint ar Xiv:1905.09638, 2019. [10] Adrien Couëtoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, and Nicolas Bonnard. Continuous upper confidence trees. In International Conference on Learning and Intelligent Optimization, pages 433 445. Springer, 2011. [11] Erick Delage and Shie Mannor. Percentile optimization for Markov decision processes with parameter uncertainty. Operations research, 58(1):203 213, 2010. [12] Michael O Duff. Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. 2003. [13] Hannes Eriksson, Debabrota Basu, Mina Alibeigi, and Christos Dimitrakakis. Sentinel: Taming uncertainty with ensemble-based distributional reinforcement learning. ar Xiv preprint ar Xiv:2102.11075, 2021. [14] Sylvain Gelly and David Silver. Monte-Carlo tree search and rapid action value estimation in computer Go. Artificial Intelligence, 175(11):1856 1875, 2011. [15] Arthur Guez, Nicolas Heess, David Silver, and Peter Dayan. Bayes-adaptive simulation-based search with value function approximation. In NIPS, volume 27, pages 451 459, 2014. [16] Arthur Guez, David Silver, and Peter Dayan. Efficient Bayes-adaptive reinforcement learning using sample-based search. Advances in neural information processing systems, 25:1025 1033, 2012. [17] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005. [18] Jongmin Lee, Youngsoo Jang, Pascal Poupart, and Kee-Eung Kim. Constrained Bayesian reinforcement learning via approximate linear programming. In IJCAI, pages 2088 2095, 2017. [19] Jongmin Lee, Geon-Hyeong Kim, Pascal Poupart, and Kee-Eung Kim. Monte-carlo tree search for constrained pomdps. In Neur IPS, pages 7934 7943, 2018. [20] Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance in value function estimation. In Proceedings of the twenty-first international conference on Machine learning, page 72, 2004. [21] James John Martin. Bayesian decision problems and Markov chains. Wiley, 1967. [22] John Mern, Anil Yildiz, Zachary Sunberg, Tapan Mukerji, and Mykel J Kochenderfer. Bayesian optimized Monte Carlo planning. ar Xiv preprint ar Xiv:2010.03597, 2020. [23] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(12), 2006. [24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. [25] Philippe Morere, Roman Marchant, and Fabio Ramos. Bayesian optimisation for solving continuous state-action-observation pomdps. Advances in Neural Information Processing Systems (NIPS), 2016. [26] Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. [27] Takayuki Osogami. Robustness and risk-sensitivity in Markov decision processes. Advances in Neural Information Processing Systems, 25:233 241, 2012. [28] Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 8522 8528. IEEE, 2019. [29] Georg Ch Pflug and Alois Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41(2):682 699, 2016. [30] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2817 2826, 2017. [31] LA Prashanth. Policy gradients for CVa R-constrained MDPs. In International Conference on Algorithmic Learning Theory, pages 155 169. Springer, 2014. [32] Marc Rigter, Paul Duckworth, Bruno Lacerda, and Nick Hawes. Lexicographic optimisation of conditional value at risk and expected value for risk-averse planning in MDPs. ar Xiv preprint ar Xiv:2110.12746, 2021. [33] Marc Rigter, Bruno Lacerda, and Nick Hawes. Minimax regret optimisation for robust planning in uncertain Markov decision processes. Proceedings of the AAAI Conference on Artificial Intelligence, 35(13):11930 11938, May 2021. [34] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000. [35] Andrzej Ruszczy nski. Risk-averse dynamic programming for Markov decision processes. Mathematical programming, 125(2):235 261, 2010. [36] Pedro Santana, Sylvie Thiébaux, and Brian Williams. RAO*: An algorithm for chanceconstrained POMDP s. 2016. [37] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy nski. Lectures on stochastic programming: modeling and theory. SIAM, 2014. [38] Apoorva Sharma, James Harrison, Matthew Tsao, and Marco Pavone. Robust and adaptive planning under model uncertainty. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 410 418, 2019. [39] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. [40] Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtˇech Franc. The SHOGUN machine learning toolbox. The Journal of Machine Learning Research, 11:1799 1802, 2010. [41] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, page 1015 1022, Madison, WI, USA, 2010. Omnipress. [42] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [43] Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, pages 1468 1476, 2015. [44] Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Sequential decision making with coherent risk. IEEE Transactions on Automatic Control, 62(7):3323 3338, 2016. [45] Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVa R via sampling. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI 15, page 2993 2999. AAAI Press, 2015. [46] Yichuan Charlie Tang, Jian Zhang, and Ruslan Salakhutdinov. Worst cases policy gradients. ar Xiv preprint ar Xiv:1911.03618, 2019. [47] Eric M Wolff, Ufuk Topcu, and Richard M Murray. Robust control of uncertain Markov decision processes with temporal logic specifications. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 3372 3379. IEEE, 2012.