# bayesian_bellman_operators__b89deddd.pdf

Bayesian Bellman Operators

Matthew Fellows Kristian Hartikainen Shimon Whiteson Department of Computer Science

University of Oxford

We introduce a novel perspective on Bayesian reinforcement learning (RL); whereas existing approaches infer a posterior over the transition distribution or Q-function, we characterise the uncertainty in the Bellman operator. Our Bayesian Bellman operator (BBO) framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions. In this paper, we use BBO to provide a rigorous theoretical analysis of model-free Bayesian RL to better understand its relationship to established frequentist RL methodologies. We prove that Bayesian solutions are consistent with frequentist RL solutions, even when approximate inference is used, and derive conditions for which convergence properties hold. Empirically, we demonstrate that algorithms derived from the BBO framework have sophisticated deep exploration properties that enable them to solve continuous control tasks at which state-of-the-art regularised actor-critic algorithms fail catastrophically.

1 Introduction

A Bayesian approach to reinforcement learning (RL) characterises uncertainty in the Markov decision process (MDP) via a posterior [35, 78]. A great advantage of Bayesian RL is that it offers a natural and elegant solution to the exploration/exploitation problem, allowing the agent to explore to reduce uncertainty in the MDP, but only to the extent that exploratory actions lead to greater expected return; unlike in heuristic strategies such as "-greedy and Boltzmann sampling, the agent does not waste samples trying actions that it has already established are suboptimal, leading to greater sampling efﬁciency. Elementary decision theory shows that the only admissible decision rules are Bayesian [22] because a non-Bayesian decision can always be improved upon by a Bayesian agent [24]. In addition, pre-existing domain knowledge can be formally incorporated by specifying priors.

In model-free Bayesian RL, a posterior is inferred over the Q-function by treating samples from the MDP as stationary labels for Bayesian regression. A major theoretical issue with existing model-free Bayesian RL approaches is their reliance on bootstrapping using a Q-function approximator, as samples from the exact Q-function are impractical to obtain. This introduces error as the samples are no long estimates of a Q-function and their dependence on the approximation is not accounted for. It is unclear what posterior, if any, these methods are inferring and how it relates to the RL problem.

In this paper, we introduce Bayesian Bellman Operators (BBO), a novel model-free Bayesian RL framework that addresses this issue and facilitates a theoretical exposition of the relationship between model-free Bayesian and frequentist RL approaches. Using our framework, we demonstrate that, by bootstrapping, model-free Bayesian RL infers a posterior over Bellman operators. For our main contribution, we prove that the BBO posterior concentrates on the true Bellman operator (or the closest representation in our function space of Bellman operators). Hence a Bayesian method using the BBO posterior is consistent with the equivalent frequentist solution in the true MDP. We derive convergent gradient-based approaches for Bayesian policy evaluation and uncertainty estimation. Remarkably, our consistency and convergence results still hold when approximate inference is used.

Correspondence to matthew.fellows@cs.ox.ac.uk

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Our framework is general and can recover empirically successful algorithms such as Boot DQNprior+ [57]. We demonstrate that Boot DQNprior+ s lagged target parameters, which are essential to its performance, arise from applying approximate inference to the BBO posterior. Lagged target parameters cannot be explained by existing model-free Bayesian RL theory. Using BBO, we extend Boot DQNprior+ to continuous domains by developing an equivalent Bayesian actor-critic algorithm. Our algorithm can learn optimal policies in domains where state-of-the-art actor-critic algorithms like soft actor-critic [39] fail catastrophically due to their inability to properly explore.

2 Bayesian Reinforcement Learning

To aid the reader with notation, we provide a mathematical glossary in Appendix A.

2.1 Preliminaries Formally, an RL problem is modelled as a Markov decision process (MDP) deﬁned by the tuple h S, A, r, P, P0, γi [72, 60], where S is the set of states and A the set of available actions. At time t, an agent in state st 2 S chooses an action at 2 A according to the policy at ( |st). The agent transitions to a new state according to the state transition distribution st+1 P( |st, at) which induces a scalar reward rt := r(st+1, at, st) 2 R with sups0,a,s|r(s0, a, s)| < 1. The initial state distribution for the agent is s0 P0 and the state-action transition distribution is denoted as P (s0, a0|s, a), which satisﬁes d P (s0, a0|s, a) = d (a0|s0)d P(s0|s, a).. As the agent interacts with the environment it gathers a trajectory: (s0, a0, r0, s1, a1, r1, s2...). We seek an optimal policy 2 arg max J that maximises the total expected discounted return: J := E [P1

t=0 γtrt] where E is the expectation over trajectories induced by . The Q-function is the total expected reward as a function of a state-action pair: Q (s, a) := E [P1

t=0 rt|s0 = s, a0 = a]. Any Q-function satisﬁes the Bellman equation B[Q ] = Q where the Bellman operator is deﬁned as:

B[Q ](s, a) := EP (s0,a0|s,a) [r(s0, a, s) + γQ (s0, a0)] . (1)

2.2 Model-based vs Model-free Bayesian RL Bayes-adaptive MDPs (BAMDPs) [27] are a framework for model-based Bayesian reinforcement learning where a posterior marginalises over the uncertainty in the unknown transition distribution and reward functions to derive a Bayesian MDP. BAMDP optimal policies are the gold standard, optimally balancing exploration with exploitation but require learning a model of the unknown transition distribution which is typically challenging due to its high-dimensionality and multi-modality [67]. Furthermore, planning in BAMDPs requires the calculation of high-dimensional integrals which render the problem intractable. Even with approximation, most existing methods are restricted to small and discrete state-action spaces [6, 38]. One notable exception is Vari BAD [82] which exploits a meta learning setting to carry out approximate Bayesian inference. Unfortunately this approximation sacriﬁces the BAMDP s theoretical properties and there are no convergence guarantees.

Existing model-free Bayesian RL approaches attempt to solve a Bayesian regression problem to infer a posterior predictive over a value function [78, 35]. Whilst foregoing the ability to separately model reward uncertainty and transition dynamics, modelling uncertainty in a value function avoids the difﬁculty of estimating high dimensional conditional distributions and mimics a Bayesian regression problem, for which there are tractable approximate methods [44, 10, 47, 61, 33, 51]. These methods assume access to a dataset of N samples: DN := {qi}i=1:N from a distribution over the true Qfunction at each state-action pair: qi PQ( |si, ai). Each sample is an estimate of a point of the true Q-function qi = Q (si, ai) + i corrupted by noise i. By introducing a probabilistic model of this random process, the posterior over the Q-function P(Q |s, a, DN) can be inferred, which characterises the aleatoric uncertainty in the sample noise and epistemic uncertainty in the model. Modeling aleatoric uncertainty is the goal of distributional RL [11]. In Bayesian RL we are more concerned with epistemic uncertainty, which can be reduced by exploration [57].

2.3 Theoretical Issues with Existing Approaches Unfortunately for most settings it is impractical to sample directly from the true Q-function. To obtain efﬁcient algorithms the samples qi are approximated using bootstrapping: here a parametric function approximator ˆQ! : S A ! R parametrised by ! 2 is learnt as an approximation of the Q-function ˆQ! Q and then a TD sample is used in place of qi. For example a one-step TD estimate approximates the samples as: qi ri + γ ˆQ!(si, ai), introducing an error that is dependent on !. Existing approaches do not account for this error s dependency on the function approximator.

Samples are no longer noisy estimates of a point Q (si, ai) and the resulting posterior predictive is not P(Q |s, a, DN) as it has dependence on ˆQ! due to the dataset. This problem is made worse when a posterior is inferred over an optimal Q-function as it is impossible to sample from the optimal policy a priori to obtain unbiased samples. This is a major theoretical issue that raises the following questions:

1. Do model-free Bayesian RL approaches that use bootstrapping still infer a posterior?

2. If it exists, how does this posterior relate to solving the RL problem?

3. What effect does approximate inference have on the solution?

4. Do methods that sample from an approximate posterior converge?

Contribution: Our primary contribution is to address these questions by introducing the BBO framework. In answer to Question 1, BBO shows that, by introducing bootstrapping, we actually infer a posterior over Bellman operators. We can use this posterior to marginalise over all Bellman operators to obtain a Bayesian Bellman operator. Our theoretical results provide answers to Questions 2-4, proving that the Bayesian Bellman operator can parametrise a TD ﬁxed point as the number of samples N ! 1 and is analogous to the projection operator used in convergent reinforcement learning. Our results hold even under posterior approximation. Although our contributions are primarily theoretical, many of the beneﬁts afforded by Bayesian methods play a signiﬁcant role in a wide range of real-world applications of RL where identifying decisions that are being made under high uncertainty is crucial. We discuss the impact of our work further in Appendix B.

3 Bayesian Bellman Operators Detailed proofs and a discussion of assumptions for all theoretical results are found in Appendix C.

To introduce the BBO framework we consider the Bellman equation using a function approximator: B[ ˆQ!] = ˆQ!. Using Eq. (1), we can write the Bellman operator for ˆQ! as an expectation of the empirical Bellman function b!:

B[ ˆQ!](s, a) = EP (s0,a0|a,s) [b!(s0, a0, s, a)] , b!(s0, a0, s, a) := r(s0, a, s) + γ ˆQ!(s0, a0). (2)

When evaluating the Bellman operator to solve the Bellman equation, we can evaluate the function approximator ˆQ!(s, a) but we cannot evaluate B[ ˆQ!](s, a) due to the uncertainty in reward function and transition distribution. In BBO we capture this uncertainty by treating the empirical Bellman function as a transformation of variables b!( , s, a) : S A ! R for each (s, a). The transformed variable B : R ! R has a conditional distribution PB(b|s, a, !) which is the pushforward of P (s0, a0|s, a) under the transformation b!( , s, a). For any PB-integrable function f : R ! R, the pushforward satisﬁes:

EPB(b|s,a,!) [f(b)] = EP (s0,a0|s,a) [f b!(s0, a0, s, a)] . (3)

As the pushforward PB(b|s, a, !) is a distribution over empirical Bellman functions, each sample b PB( |s, a, !) is a noisy sample of the Bellman operator at a point: bi = B[ ˆQ!](si, ai) + i. To prove this, observe that taking expectations of b recovers B[ ˆQ!](s, a):

EPB(b|s,a,!)[b] = |{z} Eq. (3)

EP (s0,a0|s,a) [b!(s0, a0, s, a)] = |{z} Eq. (2)

B[ ˆQ!](s, a).

Figure 1: Graphical Model for BBO.

As the agent interacts with the environment, it obtains samples from the transition distribution s0

i P( |si, ai) and policy a0

i). From Eq. (3) a sample from the distribution bi PB( |si, ai, !) is obtained from these state-action pairs by applying the empirical Bellman function bi = ri +γ ˆQ!(s0

i). As we discussed in Section 2.3, existing model-free Bayesian RL approaches incorrectly treat each bi as a sample from a distribution over the value function PQ(Q |s, a). BBO corrects this by modelling the true conditional distribution: PB(b|s, a, !) that generates the data.

The graphical model for BBO is shown in Fig. 1. To model PB(b|s, a, !) we assume a parametric conditional distribution: P(b|s, a, φ) with model parameters

φ 2 Φ and a conditional mean: EP (b|s,a,φ)[b] = ˆBφ(s, a). It is also possible to specify a nonparametric model: P(b|s, a). The conditional mean of the distribution ˆBφ deﬁnes a function space of approximators that represents a space of Bellman operators, each indexed by φ 2 Φ. The choice of P(b|s, a, φ) should therefore ensure that the space of approximate Bellman operators characterised by ˆBφ is expressive enough to sufﬁciently represent the true Bellman operator. As we are not concerned with modelling the transition distribution in our model-free paradigm, we assume states are sampled either from an ergodic Markov chain, or i.i.d. from a buffer. Off-policy samples can be corrected using importance sampling.

Assumption 1 (State Generating Distribution). Each state si is drawn either i) i.i.d. from a distribution (s) with support over S or ii) from an ergodic Markov chain with stationary distribution (s) deﬁned over a σ-algebra that is countably generated from S.

We represent our preexisting beliefs in the true Bellman operator by specifying a prior P(φ) with a density p(φ) which assigns mass over parameterisations of function approximators φ 2 Φ in accordance with how well we believe they represent B[ ˆQ!]. Given the prior and a dataset DN

! := {bi, si, ai}i=1:N of samples from the true distribution PB, we infer the posterior P(φ|DN

! ) using Bayes rule which has the density (see Appendix D.1 for a derivation using both state generating distributions of Assumption 1):

i=1 p(bi|si, ai, φ)p(φ) R

i=1 p(bi|si, ai, φ)d P(φ)

To be able to make predictions, we infer the posterior predictive: p(b|DN

! , s, a) := R

Φ p(b|s, a, φ)d P(φ|DN

! ). Unlike existing approaches, our posterior density is a function of !, which correctly accounts for the dependence on ˆQ! in our data and the generating distribution PB(b|s, a, !). We highlight that it is possible to deﬁne a likelihood and prior that are functions of ! to encode any prior knowledge of how the underlying Bellman operator varies with !, however this is not strictly necessary as the posterior automatically accounts for this dependence due to its conditioning on DN

! . As we anticipate that most applications of BBO will seek to learn an optimal policy, and hence and optimal Q-function, a prior that incorporates any knowledge available about the optimal Bellman operator will speed learning and give the agent an advantage.

As our data depends on ˆQ!, we must introduce a method of learning the correct Q-function approximator. As every Bellman operator characterises an MDP, the posterior predictive mean represents a Bayesian estimate of the true MDP by using the posterior to marginalise over all Bellman operators that our model can represent according to our uncertainty in their value:

!,N(s, a) :=EP (b|DN ! ,s,a)[b] = EP (φ|DN ! )

For this reason, we refer to the predictive mean B?

!,N as the Bayesian Bellman operator and our Qfunction approximator should satisfy a Bellman equation using B?

!,N. Our objective is therefore to ﬁnd !? such that ˆQ!? = B?

!?,N. A simple approach to learn !? is to minimise the mean squared Bayesian Bellman error (MSBBE) between the posterior predictive and function approximator:

MSBBEN(!) :=

Here the distribution on the 2-norm is (s) (a|s) where recall (s) is deﬁned in Assumption 1. Although the MSBBE has a similar form to a mean squared Bellman error with a Bayesian Bellman operator in place of the Bellman operator, our theoretical results in Section 3.1 show its frequentist interpretation is closer to the mean squared projected Bellman operator used by convergent TD algorithms [70]. We derive the MSBBE gradient in Appendix D.3:

r!MSBBEN(!)

ˆQ! EP (φ|DN ! )

r! ˆQ! EP (φ|DN ! )

ˆBφr! log p(φ|DN

If we can sample from the posterior then unbiased estimates of r!MSBBEN(!) can be obtained, hence minimising the MSBBE via a stochastic gradient descent algorithm is convergent if the standard Robbins-Munro conditions are satisﬁed [62]. When existing approaches are used, the posterior has

no dependence on ! and the gradient r! log p(φ|DN

! ) is not accounted for, leading to gradient terms being dropped in the update. Stochastic gradient descent using these updates does not optimise any objective and so may not converge to any solution. The focus of our analysis in Section 4.1 is to extend convergent gradient methods for minimising the MSSBE to approximate inference techniques in situations where sampling from the posterior becomes intractable.

Minimising the MSBBE also avoids the double sampling problem encountered in frequentist RL where to minimise the mean squared Bellman error, two independent samples from P(s0|s, a) are required to obtain unbiased gradient estimates [7]. In BBO, this issue is avoided by drawing two independent approximate Bellman operators Bφ1 and Bφ2 from the posterior φ1, φ2 P( |DN

! ) instead.

3.1 Consistency of the Posterior To address Question 2, we develop a set of theoretical results to understand the posterior s relationship to the RL problem. We introduce some mild regularity assumptions on our choice of model:

Assumption 2 (Regularity of Model). i) ˆQ! is bounded and (Φ, dΦ) and ( , d ) are compact metric spaces; ii) ˆBφ is Lipschitz in φ, P(b|s, a, φ) has ﬁnite variance and a density p(b|s, a, φ) which is Lipschitz in φ and bounded; and iii) p(φ) / exp ( R(φ)) where R(φ) is bounded and Lipschitz.

Our main result is a Bernstein-von-Mises-type theorem [49] applied to reinforcement learning. We prove that the posterior asymptotically converges to a Dirac delta distribution centered on the set of parameters that minimise the KL divergence between the true and model distributions, which are the frequentist maximum likelihood parameters:

! := arg min

KL(PB(b, s, a|!) k P(b, s, a|φ)) = arg min

EPB(b,s,a|!) [ log p(b, s, a|φ)] , (8)

where the expectation is taken with respect to distribution that generates the data: PB(b, s, a|!) that satisﬁes d PB(b, s, a|!) = d PB(b|s, a, !)d (a|s)d (s). We make a simplifying assumption that there is a single maximum likelihood parameter, which eases analysis and exposition of our results. We discuss the more general case where it does not hold in Appendix C.3. Assumption 3 (Single Minimiser). The set of maximum likelihood parameters φ?

! deﬁned in Eq. (8) exists and is a singleton. Theorem 1. Under Assumptions 1-3, in the limit N ! 1 the posterior concentrates weakly on φ?

!: i) P(φ|DN

! ) =) δ(φ = φ?

!) a.s.; ii) B?

a.s. ! ˆBφ?!; and iii) MSBBEN(!)

a.s. ! k ˆQ! ˆBφ?!k2

If our model can sufﬁciently represent the true conditional distribution then KL(PB(b, s, a|!) k P(b, s, a|φ?

!)) = 0 =) PB(b|s, a, !) = P(b|s, a, φ?

!). Theorem 1 proves that the posterior concentrates on the frequentist solution φ?

! and hence the Bayesian Bellman operator converges to the true Bellman operator: ˆBφ?!(s, a) = EP (b|s,a,φ?!)[b] = EPB(b|s,a,!)[b] = B[ ˆQ!](s, a). As every Bellman operator characterises an MDP, any Bayesian RL solution obtained using the BBO posterior such as an optimal policy or value function is consistent with the true RL solution. When the true distribution is not in the model class, Bφ?

! converges to the closest representation of the true Bellman operator according to the parametrisation that maximises the likelihood EPB(b,s,a|!) [log p(b, s, a|φ)]. This is analogous to frequentist convergent TD learning where the function approximator converges to a parametrisation that minimises the projection of the Bellman operator into the model class [70, 71, 12]. We now make this relationship precise by considering a Gaussian model.

3.2 Gaussian BBO To showcase the power of Theorem 1 and to provide a direct comparison to existing frequentist approaches, we consider the nonlinear Gaussian model P(b|s, a, φ) = N( ˆBφ(s, a), σ2) that is commonly used for Bayesian regression [55, 33]. The mean is a nonlinear function approximator that best represents the Bellman operator Bφ B[ ˆQ!] and the model variance σ2 > 0 represents the aleatoric uncertainty in our samples. Ignoring the log-normalisation constant cnorm, the log-posterior is an empirical mean squared error between the empirical Bellman samples and the model mean

ˆBφ(si, ai) with additional regularisation due to the prior (see Appendix D.2 for a derivation):

! ) = cnorm +

(bi ˆBφ(si, ai))2

2σ2 + R(φ), φ?

! 2 arg min

k ˆBφ B[ ˆQ!]k2

Theorem 1 proves that in the limit N ! 1, the effect of the prior diminishes and the Bayesian Bellman operator converges to the parametrisation: B?

a.s. ! ˆBφ?

! is the set of parameters that minimise the mean squared error between the true Bellman operator and the approximator, ˆBφ?! is a projection of the true Bellman operator onto the space of functions represented by ˆBφ:

ˆBφ?! = P ˆ Bφ B[ ˆQ!] := { ˆBφ0 : φ0 2 arg min

k ˆBφ B[ ˆQ!]k2

Finally, Theorem 1 proves that the MSBBE converges to the mean squared projected Bellman error MSBBEN(!)

a.s. ! MSPBE(!) := k ˆQ! P ˆ Bφ B[ ˆQ!]k2

, . By the deﬁnition of the projection

operator in Eq. (10), a solution ˆQ! = P ˆ Bφ B[ ˆQ!] is a TD ﬁxed point; hence any asymptotic MSBBE minimiser parametrises a TD ﬁxed point should it exist. To further highlight the relationship between BBO and convergent TD algorithms that minimise the mean squared projected Bellman operator, we explore the linear Gaussian regression model as a case study in Appendix E, allowing us to derive a regularised Bayesian TDC/GTD2 algorithm [71, 70].

4 Approximate BBO We have demonstrated in Eq. (7) that if it is tractable to sample from the posterior, a simple convergent stochastic gradient descent algorithm can be used to minimise the MSBBE. We derive the gradient update for the linear Gaussian model as part of our case study in Appendix E. Unfortunately, models like linear Gaussians that have analytic posteriors are often too simple to accurately represent the Bellman operator for domains of practical interest in RL. We now extend our analysis to include approximate inference approaches.

4.1 Approximate Inference To allow for more expressive nonlinear function approximators, for which the posterior normalisation is intractable, we introduce a tractable posterior approximation: q(φ|DN

! ). In this paper, we use randomised priors (RP) [57] for approximate inference. Randomised priors (PR) inject noise into the maximum a posteriori (MAP) estimate via a noise variable 2 E with distribution PE( ) where the density p E( ) has the same form as the prior. We provide a full exposition of RP for BBO in Appendix F, including derivations of our objectives. RP in practice uses ensembling: L prior randomisations EL := { l}l=1:L are ﬁrst drawn from PE. To use RP for BBO, we write the Qfunction approximator as an ensemble of L parameters L := {!l}l=1:L where ˆQ! = 1

l=1 ˆQ!l and require an assumption about the prior and the function spaces used for approximators:

Assumption 4 (RP Function Spaces). i) ˆQ!l and ˆB!l share a function space, that is ˆQ!0

l for any !0

l 2 , where Φ = Rn is compact, convex with a smooth boundary. ii) E Rn and R(φ ) is deﬁned for any φ 2 Φ, 2 E.

For each l 2 {1 : L}, a set of solutions to the prior-randomised MAP objective are found:

l (!l) 2 arg min

!l, l) := arg min

log p(bi|si, ai, φ)

The RP solution ?

l (!l) has dependence on !l that mirrors the BBO posterior s dependence on !. To construct the RP approximate posterior q(φ|DN

! ), we average the set of perturbed MAP estimates over all ensembles: q(φ|DN

l=1 δ(φ 2 ?

l (!l)). To sample from the RP posterior φ q( |DN

! ), we sample an ensemble uniformly l Unif({1 : L}) and set φ = ?

l (!l). Although BBO is compatible with any approximate inference technique, we justify our choice of RP by proving that it preserves the consistency results developed in Theorem 1:

Corollary 1.1. Under Assumptions 1-4, results i)-iii) of Theorem 1 hold with P(φ|DN

! ) replaced by the RP approximate posterior q(φ|DN

! ) both with or without ensembling.

In answer to Question 3), Corollary 1.1 shows that the difference between using the RP approximate posterior and the true posterior lies in their characterisation of uncertainty and not their asymptotic behaviour. Existing work shows that RP uncertainty estimates are conservative [59, 21] with strong empirical performance in RL [57, 58] for the Gaussian model that we study in this paper.

The RP approximate posterior q(φ|DN

! ) depends on the ensemble of Q-function approximators ˆQ!l and like in Section 3 we must learn an ensemble of optimal parametrisations !?

l . We substitute

! ) in place of the true posterior in Eqs. (5) and (6) to derive an ensembled RP MSBBE: MSBBERP(!l) := k ˆQ!l ˆB ?

, . When a ﬁxed point ˆQ!l = ˆB ?

l (!l) exists, minimising MSBBERP(!l) is equivalent to ﬁnding !?

l such that ?

l . To learn !?

l we can instead minimise the simpler parameter objective !?

l 2 arg min!l2 U(!l; ?

l ) := k!l ?

2 such that ?

l (!l) 2 arg min

!l, l), (12)

which has the advantage that deterministic gradient updates can be obtained. U(!l; ?

l ) can still provide an alternative auxilliary objective when a ﬁxed point does not exist as the convergence of algorithms minimising Eq. (12) does not depend on its existence and has the same solution as minimising MSBBERP(!l) for sufﬁciently smooth Bφ. Solving the bi-level optimisation problem in Eq. (12) is NP-hard [8]. To tackle this problem, we introduce an ensemble of parameters L := { l}1:L to track ?

l (!l) and propose a two-timescale gradient update for each l 2 {1 : L} on the objectives in Eq. (12) with per-step complexity of O(n):

l P ( l kr l (R( l l) log p(bi|si, ai, l))) , (fast) (13) !l P (!l βk(!l l)), (slow) (14)

where k and βk are asymptotically faster and slower stepsizes respectively and P ( ) := arg min!2 k !k2

2 is a projection operator that projects its argument back into if necessary. From a Bayesian perspective, we are concerned with characterising the uncertainty after a ﬁnite number of samples N < 1 and hence (bi, si, ai) should be drawn uniformly from the dataset DN

!l to form estimates of the summation in Eq. (11), which becomes intractable with large N. When compared to existing RL algorithms, sampling from DN

!l is analogous to sampling from a replay buffer [54]. A frequentist analysis of our updates is also possible by considering samples that are drawn online from the underlying data generating distribution (bi, si, ai) PB in the limit N ! 1. We discuss this frequentist interpretation further in Appendix C.5.

To answer Question 4), we prove convergence of updates (13) and (14) using a straightforward application of two-timescale stochastic approximation [15, 14, 42] to BBO. Intuitively, two timescale analysis proves that the faster timescale update (13) converges to an element in using standard martingale arguments, viewing the parameter !l as quasi-static as it behaves like a constant. Since the perturbations are relatively small, the separation of timescales then ensures that l tracks ?

l (!l) whenever !l is updated in the slower timescale update (14), viewing the parameter l as quasiequilibrated [14]. We introduce the standard two-timescale regularity assumptions and derive the limiting ODEs of updates (13) and (14) in Appendix C.3: Assumption 5 (Two-timescale Regularity). i) r l R( l l) and r l log p(bi|si, ai, l) are Lipschitz in l and (bi, si, ai) Unif(DN

!l); ii) ~(!l) and !~

l are local aysmptotically stable attractors of the limiting ODEs of updates (13) and (14) respectively and ~

l (!l) is Lipschitz in !l; and iii) The stepsizes satisfy: limk!1

βk k = 0, P1

k=1 βk = 1, P1

Theorem 2. If Assumptions 1 to 5 hold, l and !l converge to ~

l almost surely.

As !l are updated on a slower timescale, they lag the parameters l. When deriving a Bayesian actor-critic algorithm in Section 4.2, we demonstrate that these parameters share a similar role to a lagged critic. There is no Bayesian explanation for these parameters under existing approaches: when applying approximate inference to P(Q |s, a, DN), the RP solution ?

l has no dependence on !l. Hence, minimising U(!l; ?

l ) and the approximate MSBBE has an exact solution by setting !?

l . In this case, ˆQ!?

l meaning that existing approaches do not distinguish between the Q-function and Bellman operator approximators.

4.2 Bayesian Bellman Actor-Critic

Critics Exploration

Behavioural

Policy a i(a|s)

r = r(s0, a, s)

i Unif([1, ...L])

Environment

Target Critics

s0 P(s0|s, a)

Figure 2: Schematic of RP-BBAC.

Boot DQN+Prior [57, 58] is a state-of-the-art Bayesian model-free algorithm with Thompson sampling [74] where, in principle, an optimal Q-function is drawn from a posterior over optimal Q-functions at the start of each episode. As Boot DQN+Prior requires bootstrapping, it actually draws a sample from the Gaussian BBO posterior introduced in Section 3.2 using RP approximate inference with the empirical Bellman

function b!(s0, a, s) = r(s0, a, s)+γ maxa0 ˆQ!(s0, a0). This empirical Bellman function results from substituting an optimal policy (a|s) = δ(a 2 arg maxa0 ˆQ!(s, a0)) in Eq. (3). A variable l is drawn uniformly and the optimal exploration policy ?

l (a|s) = δ(a 2 arg maxa0 Bφl(s, a0)) is followed. Boot DQN+Prior achieves what Osband et al. [58] call deep exploration where exploration not only considers immediate information gain but also the consequences of an exploratory action on future learning. Due its use of the arg max operator, Boot DQN+Prior is not appropriate for continuous action or large discrete action domains as a nonlinear optimisation problem must be solved every time an action is sampled. We instead develop a randomised priors Bayesian Bellman actor-critic (RP-BBAC) to extend Boot DQN+Prior to continuous domains. A schematic of RP-BBAC is shown in Fig. 2 which summarises Algorithm 1. Additional details are in Appendix G.

Comparison to existing actor-critics: Using a Gaussian model also allows a direct comparison to frequentist actor-critic algorithms [50]: as shown in Fig. 2, every ensemble l 2 {1...L} has its own exploratory actor l, critic B l and target critic ˆQ!l. In BBAC, each critic is the solution to its unique l-randomised empirical MSBBE objective from Eq. (12): Lcritic( l) := 1

i=1(bi ˆB l(si, ai))2 + R( l l). The target critic parameters !l for each Bellman sample bi = ri + γ ˆQ!l(s0

i) are updated on a slower timescale to the critic parameters, which mimics the updating of target critic parameters after a regular interval in frequentist approaches [54, 39]. We introduce an ensemble of parametric exploration policies l(a|s) parametrised by a set of parameters L := { l}l=1:L. Each optimal exploration policy ?

l (a|s) is parametrised by the solution to its own optimisation problem: ?

l 2 arg max l2 E (s) l(a|s)[Bφl(s, a0)]. Unlike frequentist approaches, an exploratory actor is selected at the start of each episode in accordance with our current uncertainty in the MDP characterised by the approximate RP posterior.

Algorithm 1 RP-BBAC

Initialise L, L, L, EL, and D ? Sample initial state s P0 while not converged do

Sample policy l Unif( L) for n 2 {1, ...Nenv} do

Sample action a l( |s) Observe next state s0 P( |s, a) Observe reward r = r(s0, a, s) D D [ {s, a, r, s0} end for L, L, L UPDATEPOSTERIOR UPDATEBEHAVIOURALPOLICY end while

Exploration is thus both deep and adaptive as actions from an exploration policy are directed towards minimising epistemic uncertainty in the MDP and the posterior variance reduces in accordance with Corollary 1.1 as more data is collected. BBAC s explicit speciﬁcation of lagged target critics is unique to BBO and, as discussed in Section 4.1, corrects the theoretical issues raised by applying bootstrapping to existing model-free Bayesian RL theory, which does not account for the posterior s dependence on ˆQ!. Finally, exploration policies may not perform well at test time, so we learn a behaviour policy (a|s) parametrised by 2 from the data collected by our exploration policies using the ensemble of critics: { ˆB l}l=1:L. Theoretically, this is the optimal policy for the Bayesian estimate of the true MDP by using the approximate posterior to marginalise over the ensemble of Bellman operators. We augment our behaviour policy objective with entropy regularisation, allowing us to combine the exploratory beneﬁts of Thompson sampling with the faster convergence rates and algorithmic stability of regularised RL [77].

5 Related Work

Existing model-free Bayesian RL approaches assume either a parametric Gaussian [34, 57, 32, 52, 58, 75] or Gaussian process regression model [28, 29]. Value-based approaches use the empirical Bellman function b!(s0, a, s) = r(s0, a, s) + γ maxa0 ˆQ!(s0, a0) whereas actor-critic approaches use the empirical Bellman function b!(s0, a0, s, a) = r(s0, a, s) + γ ˆQ!(s0, a0). In answering Questions 1-4, we have shown existing methods that use bootstrapping inadvertently approximate the posterior

predictive over Q-functions with the BBO posterior predictive P(Q |s, a, DN) P(b|s, a, DN

! ). These methods minimise an approximation of the MSBBE where the Bayesian Bellman operator is treated as a supervised target, ignoring its dependence on !: gradient descent approaches drop gradient terms and ﬁtted approaches iteratively regress the Q-function approximator onto the Bayesian Bellman operator ˆQ!k+1 B?

!k,N. In both cases, the updates may not be a contraction mapping for the same reasons as in non-Bayesian TD [76] and so it is not possible to prove general convergence.

The additional Bayesian regularisation introduced from the prior can lead to convergence, but only in speciﬁc and restrictive cases [4, 5, 31, 17].

Approximate inference presents an additional problem for existing approaches: many existing methods naïvely apply approximate inference to the Bellman error, treating B[Q ](s, a) and Q (s, a) as independent variables [32, 52, 75, 34]. This leads to poor uncertainty estimates as the Bellman error cannot correctly propagate the uncertainty [56, 57]. Osband et al. [58] demonstrate that this can cause uncertainty estimates of Q (s, a) at some (s, a) to be zero and propose Boot DQN+Prior as an alternative to achieve deep exploration. BBO does not suffer this issue as the posterior characterises the uncertainty in the Bellman operator directly. In Section 4.2 we demonstrated that Boot DQN+Prior derived from BBO speciﬁes the use of target critics. Despite being essential to performance, there is no Bayesian explanation for target critics under existing model-free Bayesian RL theory, which posits that sampling a critic from P(Q |s, a, DN) is sufﬁcient.

6 Experiments

Figure 3: Tsitsiklis counterexample.

Convergent Nonlinear Policy Evaluation To conﬁrm our convergence and consistency results under approximation, we evaluate BBO in several nonlinear policy evaluation experiments that are constructed to present a convergence challenge for TD algorithms. We verify the convergence of nonlinear Gaussian BBO in the famous counterexample task of Tsitsiklis and Van Roy [76], in which the TD(0) algorithm is provably divergent. The results are presented in Fig. 3. As expected, TD(0) diverges, while BBO converges to the optimal solution faster than convergent frequentist nonlinear TDC and GTD2 [12]. We also consider three additional policy evaluation tasks commonly used to test convergence of nonlinear TD using neural network function approximators: 20-Link Pendulum [23], Puddle World [16], and Mountain Car [16]. Results are shown in Fig. 11 of Appendix H.3 from which we conclude that i) by ignoring the posterior s dependence on !, existing model-free Bayesian approaches are less stable and perform poorly in comparison to the gradient based MSBBE minimisation approach in Eq. (7), ii) regularisation from a prior can improve performance of policy evaluation by aiding the optimisation landscape [26], and iii) better solutions in terms of mean squared error can be found using BBO instead of the local linearisation approach of nonlinear TDC/GTD2[12].

(a) Mountain Car (b) Cartpole Figure 4: Continuous control with sparse reward.

Exploration for Continuous Control In many benchmark tasks for continuous RL, such as the locomotion tasks from Mu Jo Co Gym suite [18], the environment reward is shaped to provide a smooth gradient towards a successful task completion and naïve Boltzmann dithering exploration strategies from regularised RL can provide a strong inductive bias. In practical real-world scenarios, dense rewards are difﬁcult to specify by hand, especially when the task is learned from raw observations like images. Therefore, we consider a set of continuous control tasks with sparse rewards as continuous analogues of the discrete experiments used to test Boot DQN+Prior [57]: Mountain Car-Continuous-v0 from Gym benchmark suite and a slightly modiﬁed version of the cartpole-swingup_sparse from Deep Mind Control Suite [73]. Both environments have a sparse reward signal and penalize the agent proportional to the magnitude of executed actions. As the agent is always initialised in the same state, it has to deeply explore costly states in a directed manner for hundreds of steps until it reaches the rewarding region of the state space. We compare RP-BBAC with two variants of the state-of-the-art soft actor-critic: SAC, which is the exact algorithm presented in [40]; and SAC*, a tailored version which uses a single Q-function to avoid pessimistic underexploration [20] due to the use of the double-minimum-Q trick (see Appendix I for details). To understand the practical implications of our theoretical results, we also compare against BAC which is a variant of RP-BBAC where ˆQ!?

l . As we discussed in Section 4.1, BAC is the Bayesian actor-critic that results from applying RP approximate inference to the posterior over Q-functions used by existing model-free Bayesian approaches with bootstrapping.

The results are shown in Fig. 4. Due to the lack of smooth signal towards the task completion, SAC consistently fails to solve the tasks and converges to always executing the 0-action due to the action cost term, while SAC* achieves the goal in one out of ﬁve seeds. RP-BBAC succeeds for all ﬁve seeds in both tasks. To understand why, we provide a state support analysis in for

Mountain Car-Continuous-v0 Appendix I.1. The ﬁnal plots are shown in Fig. 5 and conﬁrm that the deep, adaptive exploration carried out by RP-BBAC leads agents to systematically explore regions of the state-action space with high uncertainty. The same analysis for SAC and SAC* conﬁrms the inefﬁciency of the exploration typical of RL as inference: the agent repeatedly explores actions that lead to poor performance and rarely explores beyond its initial state. The state support analysis for BAC in Appendix I.1 conﬁrms that by using the posterior over Q-functions with bootstrapping, existing model-free Bayesian RL cannot accurately capture the uncertainty in the MDP. Initially, exploration is similar to RP-BBAC but epistemic uncertainty estimates are unstable and cannot concentrate due to the convergence issues highlighted in this paper, preventing adaptive exploration.

Position Position

Visitation Frequency

Figure 5: State Support for RPBBAC (left) and SAC (right) in Mountain Car-Continuous-v0.

Our results in Fig. 4 demonstrate that the theoretical issues with existing approaches have negative empirical consequences, verifying that it is essential for Bayesian modelfree RL algorithms with bootstrapping to sample from the BBO posterior as BAC fails to solve both tasks where sampling from the correct posterior in RP-BBAC succeeds. In Appendix I.2, we also investigate RP-BBAC s sensitivity to randomized prior hyperparameters. The range of working hyperparameters is wide and easy to tune.

7 Conclusion

By introducing the BBO framework, we have addressed a major theoretical issue with model-free Bayesian RL by analysing the posterior that is inferred when bootsrapping is used. Our theoretical results proved consistency with frequentist RL and strong convergence properties, even under posterior approximation. We used BBO to extend Boot DQN+Prior to continuous domains. Our experiments in environments where rewards are not hand-crafted to aid exploration demonstrate that sampling from the BBO posterior characterises uncertainty correctly and algorithms derived from BBO can succeed where state-of-the-art algorithms fail catastrophically due to their lack of deep exploration. Future research could experiment with novel inference techniques, more complex priors and likelihoods, especially if they depend on !, or extend our convergence analysis to the actor-critic algorithm.

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement number 637713). Matthew Fellows and Kristian Hartikainen are funded by the EPSRC. The experiments were made possible by a generous equipment grant from NVIDIA. We would like to thank Piotr Miło s, whose proof for a similar problem inspired our proof of Lemma 3.

[1] Daron Acemoglu and Pascual Restrepo. Artiﬁcial intelligence, automation, and work. In The

Economics of Artiﬁcial Intelligence: An Agenda, pages 197 236. National Bureau of Economic Research, Inc, 2018. URL https://Econ Papers.repec.org/Re PEc:nbr: nberch:14027. B

[2] Daron Acemoglu and Pascual Restrepo. Unpacking skill bias: Automation and new tasks.

AEA Papers and Proceedings, 110:356 61, May 2020. doi: 10.1257/pandp.20201063. URL

https://www.aeaweb.org/articles?id=10.1257/pandp.20201063. B

[3] Donald Andrews. Generic uniform convergence. Econometric Theory, 8(2):241 257, 1992. 1

[4] András Antos, Rémi Munos, and Csaba Szepesvári. Fitted q-iteration in continuous action-space

mdps. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS 07, page 9 16, Red Hook, NY, USA, 2007. Curran Associates Inc. ISBN 9781605603520. 5

[5] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with

bellman-residual minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71(1):89 129, 2008. doi: 10.1007/s10994-007-5038-2. 5

[6] John Asmuth and Michael Littman. Learning is planning: near bayes-optimal reinforcement

learning via monte-carlo tree search. Proceedings of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence, pages 19 26, 01 2011. 2.2

[7] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation.

Machine Learning-International Workshop Then Conference-, pages 30 37, July 1995. ISSN 00043702. doi: 10.1.1.48.3256. 3

[8] J. F. Bard. Some properties of the bilevel programming problem. J. Optim. Theory Appl., 68(2):

371 378, February 1991. ISSN 0022-3239. 4.1

[9] R.F. Bass. Real Analysis for Graduate Students, chapter 21. Createspace Ind Pub, 2013. ISBN

9781481869140. 1

[10] Matthew James Beal. Variational algorithms for approximate Bayesian inference. Ph D thesis,

Gatsby Computational Neuroscience Unit, University College London, 2003. 2.2

[11] Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein-

forcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 449 458, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. 2.2

[12] Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R. Maei, and

Csaba Szepesvári. Convergent temporal-difference learning with arbitrary smooth function approximation. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1204 1212. Curran Associates, Inc., 2009. 3.1, 6, C.3, E.2, H.2.1, H.2.2, H.2.3

[13] Patrick Billingsley. Convergence of probability measures. Wiley Series in Probability and

Statistics: Probability and Statistics. John Wiley & Sons Inc., New York, second edition, 1999. ISBN 0-471-19745-9. A Wiley-Interscience Publication. 3, 1, 1.1

[14] Vivek Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book

Agency, 01 2008. ISBN 978-81-85931-85-2. doi: 10.1007/978-93-86279-38-5. 4.1, C.3, 2

[15] Vivek S. Borkar. Stochastic approximation with two time scales. Syst. Control Lett., 29(5):

291 294, February 1997. ISSN 0167-6911. doi: 10.1016/S0167-6911(97)90015-3. 4.1

[16] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely

approximating the value function. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 369 376. MIT Press, 1995. 6,

[17] David Brandfonbrener and Joan Bruna. Geometric insights into the convergence of nonlinear td

learning. In ICLR 2020, 2019. 5

[18] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,

and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. 6, 2, I.1

[19] Wesley Chung, Somjit Nath, Ajin Joseph, and Martha White. Two-timescale networks for

nonlinear value function approximation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum?id=r Jle N20q K7. C.3

[20] Kamil Ciosek, Quan Vuong, Robert Loftin, and Katja Hofmann. Better exploration with

optimistic actor-critic. ar Xiv preprint ar Xiv:1910.12807, 2019. 6, I

[21] Kamil Ciosek, Vincent Fortuin, Ryota Tomioka, Katja Hofmann, and Richard Turner. Conser-

vative uncertainty estimation by ﬁtting prior networks. In Eighth International Conference on Learning Representations, April 2020. 4.1

[22] David Roxbee Cox and David Victor Hinkley. Theoretical statistics. Chapman and Hall,

London, 1974. ISBN 0412124203. 1

[23] Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences:

A survey and comparison. Journal of Machine Learning Research, 15:809 883, 2014. 6, H.1, H.1.1, 1, 2, H.3.1, H.3.2, H.3.2

[24] Bruno de Finetti. La prévision : ses lois logiques, ses sources subjectives. Annales de l institut

Henri Poincaré, 7(1):1 68, 1937. 1

[25] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai. Deep direct reinforcement learning for ﬁnancial

signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3):653 664, 2017. B

[26] Simon S. Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance

reduction methods for policy evaluation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1049 1058, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. 6

[27] Michael O Gordon Duff and Andrew Barto. Optimal Learning: Computational Procedures for

Bayes-Adaptive Markov Decision Processes. Ph D thesis, University of Massachusetts Amherst, 2002. AAI3039353. 2.2

[28] Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets bellman: The gaussian process

approach to temporal difference learning. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 03, page 154 161, 2003. ISBN 1577351894. 5

[29] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, ICML 05, page 201 208, New York, NY, USA, 2005. Association for Computing Machinery.

ISBN 1595931805. doi: 10.1145/1102351.1102377. URL https://doi.org/10.1145/ 1102351.1102377. 5

[30] Matthew Fellows, Kamil Ciosek, and Shimon Whiteson. Fourier Policy Gradients. In ICML,

[31] Yihao Feng, Lihong Li, and Qiang Liu. A kernel loss for solving the bellman equation. In

H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 15456 15467. Curran Associates,

Inc., 2019. 5

[32] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alexan-

der Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In Proceedings of the International Conference on Representation Learning (ICLR 2018), Vancouver (Canada), 2018. 5

[33] Yarin Gal. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016. 2.2, 3.2

[34] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing

model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, page 1050 1059. JMLR.org, 2016. 5

[35] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian Reinforcement Learning: A

Survey. now, 2015. ISBN null. 1, 2.2

[36] W.R. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain Monte Carlo in Practice,

chapter Introduction to General State-Space Markov Chain Theory. Chapman & Hall/CRC Interdisciplinary Statistics. Taylor & Francis, 1995. ISBN 9780412055515. 1

[37] Adam Greenﬁeld. Radical Technologies: The Design of Everyday Life. Verso, 2018. ISBN

1784780456. B

[38] Arthur Guez, David Silver, and Peter Dayan. Scalable and efﬁcient bayes-adaptive reinforcement

learning based on monte-carlo tree search. Journal of Artiﬁcial Intelligence Research, 48:841 883, 10 2013. doi: 10.1613/jair.4117. 2.2

[39] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-

policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861 1870, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. 1, 4.2, I

[40] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan,

Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. Co RR, abs/1812.05905, 2018. 6, I.1

[41] Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa.

Learning continuous control policies by stochastic value gradients. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 2944 2952. Curran Associates, Inc., 2015. G

[42] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.

Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626 6637. Curran Associates,

Inc., 2017. 4.1, C.3, 2, C.5

[43] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. A deep reinforcement learning framework for

the ﬁnancial portfolio management problem. 06 2017. B

[44] Michael I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, USA,

1999. ISBN 0-262-60032-3. 2.2

[45] Prasenjit Karmakar and Shalabh Bhatnagar. Two time-scale stochastic approximation with

controlled markov noise and off-policy temporal-difference learning. Math. Oper. Res., 43(1): 130 151, February 2018. ISSN 0364-765X. doi: 10.1287/moor.2017.0855. URL https: //doi.org/10.1287/moor.2017.0855. 2, C.5

[46] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014. H.3.4

[47] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and

Yann Le Cun, editors, ICLR, 2014. 2.2, G

[48] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil

Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey,

[49] B.J.K. Kleijn and A.W. van der Vaart. The bernstein-von-mises theorem under misspeciﬁcation.

Electron. J. Statist., 6:354 381, 2012. doi: 10.1214/12-EJS675. 3.1

[50] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla, T. Leen, and K. Müller,

editors, Advances in Neural Information Processing Systems, volume 12, pages 1008 1014. MIT Press, 2000. 4.2

[51] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre-

dictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402 6413. Curran Associates, Inc., 2017. 2.2

[52] Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and li Deng. Bbq-networks:

Efﬁcient exploration in deep reinforcement learning for task-oriented dialogue systems. AAAI, 11 2018. 5

[53] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A

survey on bias and fairness in machine learning. 08 2019. B

[54] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.

Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, February 2015. ISSN 00280836. 4.1, 4.2

[55] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective, chapter 7. The MIT Press,

2012. ISBN 0262018020, 9780262018029. 3.2, E

[56] Brendan O Donoghue, Ian Osband, Remi Munos, and Vlad Mnih. The uncertainty Bellman

equation and exploration. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3839 3848, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. 5

[57] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforce-

ment learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8617 8629. Curran Associates, Inc., 2018. 1, 2.2, 4.1, 4.1, 4.2, 5, 6, F, G

[58] Ian Osband, Benjamin Van Roy, Daniel J. Russo, and Zheng Wen. Deep exploration via

randomized value functions. Journal of Machine Learning Research, 20(124):1 62, 2019. 4.1, 4.2, 5, F, I.2

[59] Tim Pearce, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. Uncertainty in neural

networks: Bayesian ensembling. Ar Xiv Preprint, abs/1810.05546, 10 2019. 4.1, F

[60] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.

John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779. 2.1

[61] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. Interna-

tional Conference on Machine Learning, 37:1530 1538, 07 09 Jul 2015. 2.2

[62] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of

Mathematical Statistics, 22(3):400 407, 1951. doi: 10.1214/aoms/1177729586. URL https: //doi.org/10.1214/aoms/1177729586. 3, C.3

[63] R. Tyrrell Rockafellar and Roger J.-B. Wets. Variational Analysis. Springer Verlag, Heidelberg,

Berlin, New York, 1998. 1.1, 1.1

[64] Alexander Shapiro. On differentiability of metric projections in rn, 1: Boundary case. Proceed-

ings of the American Mathematical Society, 99(1):123 128, 1987. ISSN 00029939, 10886826. URL http://www.jstor.org/stable/2046282. C.3

[65] Alexander Shapiro. Directional differentiability of metric projections onto moving sets at

boundary points. Journal of Mathematical Analysis and Applications, 131(2):392 403, 1988. ISSN 0022-247X. doi: https://doi.org/10.1016/0022-247X(88)90213-2. URL https://www. sciencedirect.com/science/article/pii/0022247X88902132. 2, C.3

[66] Adam Smith and Janna Anderson. Ai, robotics, and the future of jobs. 2017. B

[67] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A

uniﬁed kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, 2013. doi: 10.1109/MSP.2013.2252713. 2.2

[68] Thomas Spooner, Rahul Savani, John Fearnley, and Andreas Koukorinis. Market making via

reinforcement learning. In 17th International Conference on Autonomous Agents and Multiagent Systems, 07 2018. B

[69] Nick Srnicek and Alex Williams. Inventing the future: postcapitalism and a world without work.

Verso, 2015. ISBN 9781784780968. B

[70] Richard S Sutton, Hamid R. Maei, and Csaba Szepesvári. A convergent o(n) temporal-difference

algorithm for off-policy learning with linear function approximation. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1609 1616. Curran Associates, Inc., 2009. 3, 3.1, 3.2, E.2, E.2

[71] Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba

Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pages 993 1000, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553501. 3.1, 3.2, E.2, E.2

[72] Csaba Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artiﬁcial

Intelligence and Machine Learning, 4(1):1 103, 2010. ISSN 1939-4608. doi: 10.2200/ S00268ED1V01Y201005AIM009. 2.1

[73] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David

Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018. 6, I, I.2

[74] William R Thomson. On the likelihood that one unknown probability exceeds another in view

of the evidence of two samples. Biometrika, 25(3-4):285 294, 12 1933. ISSN 0006-3444. doi: 10.1093/biomet/25.3-4.285. 4.2

[75] Ahmed Touati, Harsh Satija, Joshua Romoff, Joelle Pineau, and Pascal Vincent. Randomized

value functions via multiplicative normalizing ﬂows. In Amir Globerson and Ricardo Silva, editors, UAI, page 156. AUAI Press, 2019. 5

[76] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 42(5):674 690, May 1997. ISSN 2334-3303. doi: 10.1109/9.580874. 5, 6, 10, H.2, H.2.1

[77] Nino Vieillard, Tadashi Kozuno, B. Scherrer, O. Pietquin, Rémi Munos, and M. Geist. Leverage

the average: an analysis of regularization in rl. Advances in Neural Information Processing Systems, 33, 2020. 4.2

[78] Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart. Bayesian Re-

inforcement Learning, pages 359 386. Springer Berlin Heidelberg, 2012. ISBN 978-3-64227645-3. doi: 10.1007/978-3-642-27645-3_11. 1, 2.2

[79] David Williams. Probability with Martingales. Cambridge mathematical textbooks. Cambridge

University Press, 1991. ISBN 978-0-521-40605-5. 1, 2

[80] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: a survey.

ar Xiv preprint ar Xiv:1908.08796, 2019. B

[81] EDUARDO H. ZARANTONELLO. Projections on convex sets in hilbert space and spectral

theory: Part i. projections on convex sets: Part ii. spectral theory. In Eduardo H. Zarantonello, editor, Contributions to Nonlinear Functional Analysis, pages 237 424. Academic Press, 1971. ISBN 978-0-12-775850-3. doi: https://doi.org/10.1016/B978-0-12-7758503.50013-3. URL https://www.sciencedirect.com/science/article/pii/ B9780127758503500133. C.3

[82] Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann,

and Shimon Whiteson. Varibad: A very good method for bayes-adaptive deep rl via metalearning. 8th International Conference on Learning Representations, ICLR 2020, Virtual Conference, Formerly Addis Ababa ETHIOPIA, 2020. 2.2

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] In this work we identiﬁed a major theoretical issue with existing model-free Bayesian RL approaches that claim to infer a posterior over Q-functions. We answered Questions 1 to 2 of Section 2.3 in Section 3 and Questions 3 to 4 in Section 4.

(b) Did you describe the limitations of your work? [Yes] The purpose of this work is about

addressing a major theoretical limitation with existing model-free Bayesian RL. We show that this limitation can have profound empirical consequences in Section 6. As we have discussed, the assumptions of our theories are relatively weak and apply to a wide range of settings and function approximators used for RL including nonlinear neural networks.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See

Section 2.3 and the further discussion in Appendix B

(d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Assump-

tions 1 to 5 and our extensive discussion in Appendices C.1 and C.3

(b) Did you include complete proofs of all theoretical results? [Yes] We provide a high level

intuitive explanation of our theorems in the main text (see Section 3.1 and Section 4.1) with rigorous and detailed proofs in Appendix C.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] Provided in the supplemental material.

(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes]

(c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes]

(d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [No]

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes]

(b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A]

(e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A]

(b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A]

(c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]