# deep_bayesian_quadrature_policy_optimization__7cb25676.pdf

Deep Bayesian Quadrature Policy Optimization

Ravi Tej Akella1, Kamyar Azizzadenesheli1, Mohammad Ghavamzadeh2, Anima Anandkumar3 and Yisong Yue3

1Purdue University 2Google Research 3Caltech {rakella,kamyar}@purdue.edu, ghavamza@google.com, {anima,yyue}@caltech.edu

We study the problem of obtaining accurate policy gradient estimates using a ﬁnite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efﬁcient alternatives like Bayesian quadrature methods have received little attention due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efﬁcient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a signiﬁcantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance.

1 Introduction Policy gradient (PG) is a reinforcement learning (RL) approach that directly optimizes the agent s policies by operating on the gradient of their expected return (Sutton et al. 2000; Baxter and Bartlett 2000). The use of deep neural networks for the policy class has recently demonstrated a series of success for PG methods (Lillicrap et al. 2015; Schulman et al. 2015) on high-dimensional continuous control benchmarks, such as Mu Jo Co (Todorov, Erez, and Tassa 2012). However, the derivation and analysis of the aforementioned methods mainly rely on access to the expected return and its true gradient. In general, RL agents do not have access to the true gradient of the expected return, i.e., the gradient of integration over returns; instead, they have access to its empirical estimate from sampled trajectories. Monte-Carlo (MC) sampling (Metropolis and Ulam 1949) is a widely used point estimation method for approximating this integration (Williams 1992). However, MC estimation returns high variance gradient estimates that are undesirably inaccurate, imposing a high sample complexity requirement for PG methods (Rubinstein 1969; Ilyas et al. 2018).

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

An alternate approach to approximate integrals in probabilistic numerics is Bayesian Quadrature (BQ) (O Hagan 1991). Under mild regularity assumptions, BQ offers impressive empirical advances and strictly faster convergence rates (Kanagawa, Sriperumbudur, and Fukumizu 2016). Typically, the integrand in BQ is modeled using a Gaussian process (GP), such that the linearity of the integral operator provides a Gaussian posterior over the integral. In addition to a point estimate of the integral, BQ also quantiﬁes the uncertainty in its estimation, a missing piece in MC methods. In RL, the BQ machinery can be used to obtain a Gaussian approximation of the PG integral, by placing a GP prior over the action-value function. However, estimating the moments of this Gaussian approximation, i.e., the PG mean and covariance, requires the integration of GP s kernel function, which, in general, does not have an analytical solution. Interestingly, Ghavamzadeh and Engel (2007) showed that the PG integral can be solved analytically when the GP kernel is an additive combination of an arbitrary state kernel and a ﬁxed Fisher kernel. While the authors demonstrate a superior performance of BQ over MC using a small policy network on simple environments, their approach does not scale to large non-linear policies (> 1000 trainable parameters) and high-dimensional domains. Contribution 1: We propose deep Bayesian quadrature policy gradient (DBQPG), a BQ-based PG framework that extends Ghavamzadeh and Engel (2007) to highdimensional settings, thus placing it in the context of contemporary deep PG algorithms. The proposed framework uses a GP to implicitly model the action-value function, and without explicitly constructing the action-value function, returns a Gaussian approximation of the policy gradient, represented by a mean vector (gradient estimate) and covariance (gradient uncertainty). Consequently, this framework can be used with an explicit critic network (different from implicit GP critic1) to leverage the orthogonal beneﬁts of actor-critic frameworks (Schulman et al. 2016). The statistical efﬁciency of BQ, relative to MC estimation, depends on the compatibility between the GP kernel and the MDP s action-value function. To make DBQPG robust to a diverse range of target MDPs, we choose a base

1The GP critic in BQ is implicit because it is used to solve PG integral analytically rather than explicitly predicting the Q-values.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

kernel capable of universal approximation (e.g. RBF kernel), and augment its expressive power using deep kernel learning (Wilson et al. 2016). Empirically, DBQPG estimates gradients that are both much closer to the true gradient, and with much lower variance, when compared to MC estimation. Moreover, DBQPG is a linear-time program that leverages the recent advances in structured kernel interpolation (Wilson and Nickisch 2015) for computational efﬁciency and GPU acceleration for fast kernel learning (Gardner et al. 2018). Therefore, DBQPG can favorably substitute MC estimation subroutine in a variety of PG algorithms. Speciﬁcally, we show that replacing the MC estimation subroutine with DBQPG provides a signiﬁcant improvement in the sample complexity and average return for vanilla PG (Sutton et al. 2000), natural policy gradient (NPG) (Kakade 2001) and trust region policy optimization (TRPO) (Schulman et al. 2015) algorithms on 7 diverse Mu Jo Co domains.

Contribution 2: We propose uncertainty aware PG (UAPG), a novel policy gradient method that utilizes the uncertainty in the gradient estimation for computing reliable policy updates. A majority of PG algorithms (Kakade 2001; Schulman et al. 2015) are derived assuming access to true gradients and therefore do not account the stochasticity in gradient estimation due to ﬁnite sample size. However, one can obtain more reliable policy updates by lowering the stepsize along the directions of high gradient uncertainty. UAPG captures this intuition by utilizing DBQPG s uncertainty to bring different components of a stochastic gradient estimate to the same scale. UAPG does this by normalizing the stepsize of the gradient components by their respective uncertainties, returning a new gradient estimate with uniform uncertainty, i.e., with new gradient covariance as the identity matrix. UAPG further provides a superior sample complexity and average return in comparison to DBQPG. Check the supporting resources2 for more details.

2 Preliminaries

Consider a Markov decision process (MDP) S, A, P, r, ρ0, γ , where S is the state-space, A is the action-space, P : S A S is the transition kernel that maps each state-action pair to a distribution over the states S, r : S A R is the reward kernel, ρ0 : S is the initial state distribution, and γ [0, 1) is the discount factor. We denote by πθ : S A a stochastic parameterized policy with parameters θ Θ. The MDP controlled by the policy πθ induces a Markov chain over state-action pairs z = (s, a) Z = S A, with an initial density ρπθ 0 (z0) = πθ(a0|s0)ρ0(s0) and a transition probability distribution P πθ(zt|zt 1) = πθ(at|st)P(st|zt 1). We use the standard deﬁnitions for action-value function Qπθ and expected return J(θ) under πθ:

2Full-text (with supplement): https://arxiv.org/abs/2006.15637 Code: https://github.com/Akella17/Deep-Bayesian-Quadrature Policy-Optimization

Qπθ(zt) = E

τ=0 γτr(zt+τ) zt+τ+1 P πθ(.|zt+τ)

J(θ) = Ez0 ρ πθ 0 [Qπθ(z0)] . (1)

However, the gradient of J(θ) with respect to policy parameters θ cannot be directly computed from this formulation. The policy gradient theorem (Sutton et al. 2000; Konda and Tsitsiklis 2000) provides an analytical expression for the gradient of the expected return J(θ), as:

Z dzρπθ(z)u(z)Qπθ(z)

= Ez ρπθ u(z)Qπθ(z) , (2)

where u(z) = θ log πθ(a|s) is the score function and ρπθ is the discounted state-action visitation frequency deﬁned as:

t=0 γt P πθ t (z), where (3)

P πθ t (zt) = Z

Zt dz0...dzt 1P πθ 0 (z0)

τ=1 P πθ(zτ|zτ 1).

3 Policy Gradient Evaluation The exact evaluation of the PG integral in Eq. 2 is often intractable for MDPs with a large (or continuous) state or action space. We discuss two prominent approaches to approximate this integral using a ﬁnite set of samples {zi}n i=1 ρπθ: (i) Monte-Carlo method and (ii) Bayesian Quadrature. Monte-Carlo (MC) method3 approximates the integral in Eq. 2 by the ﬁnite sum:

i=1 Qπθ(zi)u(zi) = 1

where u(z) is a |Θ| dimensional vector (|Θ| is the number of policy parameters). For conciseness, the function evaluations at sample locations {zi}n i=1 ρπθ are grouped into U = [u(z1), ..., u(zn)], a |Θ| n dimensional matrix, and Q = [Qπθ(z1), ..., Qπθ(zn)]4, an n dimensional vector. MC method returns the gradient mean evaluated at sample locations, which, according to the central limit theorem (CLT), is an unbiased estimate of the true gradient. However, CLT also suggests that the MC estimates suffer from a slow convergence rate (n 1/2) and high variance, necessitating a large sample size n for reliable PG estimation (Ilyas et al. 2018). Yet, MC methods are more computationally efﬁcient than their sample-efﬁcient alternatives (e.g. BQ), making them ubiquitous in PG algorithms. Bayesian quadrature (BQ) (O Hagan 1991) is an approach from probabilistic numerics (Hennig, Osborne, and Girolami 2015) that converts numerical integration into a

3Here, MC refers to the Monte-Carlo numerical integration method, and not the TD(1) (a.k.a Monte-Carlo) Q-estimates. 4Qπθ(z) is computed using TD(1) (a.k.a Monte-Carlo) method or an explicit critic module (different from BQ s implicit GP critic)

Bayesian inference problem. The ﬁrst step in BQ is to formulate a prior stochastic model over the integrand. This is done by placing a Gaussian process (GP) prior on the Qπθ function, i.e., a mean zero GP E [Qπθ(z)] = 0, with a covariance function k(zp, zq) = Cov[Qπθ(zp), Qπθ(zq)], and observation noise σ. Next, the GP prior on Qπθ is conditioned (Bayes rule) on the sampled data D = {zi}n i=1 to obtain the posterior moments E [Qπθ(z)|D] and CQ(z1, z2) = Cov [Qπθ(z1), Qπθ(z2)|D]. Since the transformation from Qπθ(z) to θJ(θ) happens through a linear integral operator (in Eq. 2), θJ(θ) also follows a Gaussian distribution:

LBQ θ = E [ θJ(θ)|D] = Z

Z dzρπθ(z)u(z)E [Qπθ(z)|D]

CBQ θ = Cov[ θJ(θ)|D] (5)

Z2 dz1dz2ρπθ(z1)ρπθ(z2)u(z1)CQ(z1, z2)u(z2) ,

where the mean vector LBQ θ is the PG estimate and the covariance CBQ θ is its uncertainty estimation. While the integrals in Eq. 5 are still intractable for an arbitrary GP kernel k, they have closed-form solutions when k is the additive composition of a state kernel ks (arbitrary) and the Fisher kernel kf (indispensable) (Ghavamzadeh and Engel 2007):

k(z1, z2) = c1ks(s1, s2) + c2kf(z1, z2),

with kf(z1, z2) = u(z1) G 1u(z2), (6)

where c1, c2 are hyperparameters5 and G is the Fisher information matrix of πθ. Using the matrices,

Kf = U G 1U, K = c1Ks + c2Kf,

G = Ez ρπθ [u(z)u(z) ] 1

the PG mean LBQ θ and covariance CBQ θ can be computed analytically (proof is deferred to the supplement Sec. B),

LBQ θ = c2U(K + σ2I) 1Q,

CBQ θ = c2G c2 2U K + σ2I 1 U . (8)

Here, Q is same as in the MC method. Note that removing the Fisher kernel kf (setting c2 = 0 in Eq. 8) causes LBQ θ = 0 and CBQ θ = 0. Further, Ghavamzadeh and Engel (2007) showed that BQ-PG LBQ θ reduces to MC-PG LMC θ when the state kernel ks is removed (c1 = 0). To summarize, (i) the presence of Fisher kernel (c2 = 0) is essential for obtaining an analytical solution for BQ-PG, and (ii) with the Fisher kernel ﬁxed, the choice of state kernel alone determines the convergence rate of BQ-PG relative to the MC baseline (equivalently BQ-PG with c1 = 0).

4 Deep Bayesian Quadrature Policy Gradient Here, we introduce the steps needed for obtaining a practical BQ-PG method, that (i) is more sample-efﬁcient than MC baseline, and (ii) easily scales to high-dimensional settings.

5c1, c2 are redundant hyperparameters that are simply introduced to offer a better explanation of BQ-PG.

Kernel Selection: In the previous section, it was highlighted that the ﬂexibility of kernel selection is limited to the choice of the state kernel ks. To understand the role of ks, we ﬁrst highlight the implication of the kernel composition proposed in Ghavamzadeh and Engel (2007). Specifically, the additive composition of the state kernel ks and the Fisher kernel kf implicitly divides the Qπθ function into state-value function and advantage function, separately modeled by ks and kf respectively (supplement, Sec. C.1). Thus, removing the Fisher kernel kf (c2 = 0) results in a non-existent advantage function, which explains LBQ θ = 0 and CBQ θ = 0. More interestingly, removing the state kernel ks (c1 = 0) results in a non-existent state-value function, which also reduces LBQ θ to LMC θ . In other words, MC-PG is a limiting case of BQ-PG where the state-value function is suppressed to 0 (more details in the supplement Sec. C.2). Thus, BQ-PG can offer more accurate gradient estimates than MC-PG, along with well-calibrated uncertainty estimates, when the state kernel ks is a better prior than the trivial ks = 0, for the MDP s state-value function. To make the choice of ks robust to a diverse range of target MDPs, we suggest (i) a base kernel capable of universal approximation (e.g. RBF kernel), followed by (ii) increasing the base kernel s expressive power by using a deep neural network (DNN) to transform its inputs (Wilson et al. 2016). Deep kernels combine the non-parametric ﬂexibility of GPs with the structural properties of NNs to obtain more expressive kernel functions when compared to their base kernels. The kernel parameters φ (DNN parameters + base kernel hyperparameters) are tuned using the gradient of GP s log marginal likelihood JGP (Rasmussen and Williams 2005),

JGP (φ|D) log |K| Q K 1Q, (9)

φJGP = Q K 1( φK)K 1Q + Tr(K 1 φK).

Scaling BQ to large sample sizes n: The complexity of estimating LBQ θ (Eq. 8) is largely inﬂuenced by the matrixinversion operation (K + σ2I) 1, whose exact computation scales with an O(n3) time and O(n2) space complexity. Our ﬁrst step is to shift the focus from an expensive matrix-inversion operation to an approximate inverse matrix-vector multiplication (i-MVM). In particular, we use the conjugate gradient (CG) method to compute the i-MVM (K + σ2I) 1v within machine precision using p n iterations of MVM operations Kv = c1Ksv + c2Kfv . While the CG method nearly reduces the time complexity by an order of n, the MVM operation with a dense matrix still incurs a prohibitive O(n2) computational cost. Fortunately, the matrix factorization of Kf (Eq. 7) allows for computing the Fisher kernel MVM Kfv in linear-time through automatic differentiation, without the explicit creation or storage of the Kf matrix (more details in Sec. D.1 of the supplement). On the other hand, since the choice of ks is arbitrary, we deploy structured kernel interpolation (SKI) (Wilson and Nickisch 2015), a general kernel sparsiﬁcation strategy to efﬁciently compute the state kernel MVM Ksv. SKI uses a set of m n inducing points {ˆsi}m i=1 to approximate Ks with a rank m matrix ˆ Ks =

Figure 1: (Left) Overview of the DBQPG method, (Right) Illustration of DBQPG and UAPG updates along the ﬁrst two principal components (PCs) of the gradient covariance matrix.

W Km s W , where Km s is an m m Gram matrix with entries Km s (p,q) = ks(ˆsp, ˆsq), and W is an n m interpolation matrix whose entries depend on the relative placement of sample points {si}n i=1 and inducing points {ˆsi}m i=1. In practice, a sparse W matrix that follows local cubic interpolation (only 4 non-zero entries per row) provides a good approximation ˆ Ks, and more importantly, offers ˆ Ksv MVM in O(n + m2) time and storage complexity. Further, the SKI framework also provides the ﬂexibility to select the inducing point locations for exploiting the structure of specialized GP kernels. For instance, onedimensional stationary kernels, i.e., ks(x, y) = ks(x y), can additionally leverage the Toeplitz method (Turner 2010) by picking evenly-spaced inducing points. Since these matrices are constant along the diagonal, i.e. Ks(x,y) = Ks(x+1,y+1), the Toeplitz method utilizes fast Fourier transform to attain an O(n+m log m) time and O(n+m) storage for the MVM operation. Further, Toeplitz methods can be extended to multiple dimensions by assuming that the kernel decomposes as a sum of one-dimensional stationary kernels along each of the input dimensions. Compared to conventional inducing point methods that operate with m n inducing points, choosing a base kernel that conforms with the Toeplitz method enables the realization of larger m values, thereby providing a more accurate approximation of Ks. Practical DBQPG Algorithm: The DBQPG algorithm is designed to compute the gradient from a batch of samples (parallelly) leveraging the automatic differentiation framework and fast kernel computation methods, thus placing BQ-PG in the context of contemporary PG algorithms (see Fig. 1(left)). In contrast, the original BQ-PG method (Ghavamzadeh and Engel 2007) was designed to process the samples sequentially (slow). Other major improvements in DBQPG include (i) replacing a traditional inducing points method (O(m2n+m3) time and O(mn+m2) storage) with SKI, a more efﬁcient alternative (O(n + m2) time and storage), and (ii) replacing a ﬁxed base kernel for ks with the more expressive deep kernel, followed by learning the kernel parameters (Eq. 9). The performance gains from deep kernel learning and SKI are documented in supplement Sec. G. For DBQPG, our default choice for the prior state covariance function ks is a deep RBF kernel which comprises of an RBF base kernel on top of a DNN feature extractor. Our

choice of RBF as the base kernel is based on: (i) its compelling theoretical properties such as inﬁnite basis expansion and universal function approximation (Micchelli, Xu, and Zhang 2006) and (ii) its compatibility with the Toeplitz method. Thus, the overall computational complexity of estimating LBQ θ (Eq. 8) is O(p(n + Y m log m)) time and O(n + Y m) storage (Y : state dimensionality; p: CG iterations; m: number of inducing points in SKI+Toeplitz approximation for a deep RBF kernel; automatic differentiation for Fisher kernel). Note that we intentionally left out kernel learning from the complexity analysis since a naive implementation of the gradient-based optimization step (Eq. 9) incurs a cubic time complexity in sample size. Our implementation relies on the black-box matrix-matrix multiplication (BBMM) feature (a batched version of CG algorithm that effectively uses GPU hardware) offered by the GPy Torch library (Gardner et al. 2018), coupled with a SKI approximation over Ks. This combination offers a linear-time routine for kernel learning. In other words, DBQPG with kernel learning is a lineartime program that leverages GPU acceleration to efﬁciently estimate the gradient of a large policy network, with a few thousands of parameters, on high-dimensional domains.

5 Uncertainty Aware Policy Gradient We propose UAPG, a novel uncertainty aware PG method that utilizes the gradient uncertainty CBQ θ from DBQPG to provide more reliable policy updates. Most classical PG methods consider stochastic gradient estimates as the true gradient, without accounting the uncertainty in their gradient components, thus, occasionally taking large steps along the directions of high uncertainty. UAPG provides more reliable policy updates by lowering the stepsize along the directions with high uncertainty. In particular, UAPG uses CBQ θ to normalize the components of LBQ θ with their respective uncertainties, bringing them all to the same scale. Thus, UAPG offers PG estimates with a uniform uncertainty, i.e. their gradient covariance is the identity matrix. See Fig. 1 (right). In theory, the UAPG update can be for-

mulated as CBQ θ 1

2 LBQ θ . Practical UAPG Algorithm: Empirical CBQ θ estimates are often ill-conditioned matrices (spectrum decays quickly)

with a numerically unstable inversion. Since CBQ θ only provides a faithful estimate of the top few principal directions of uncertainty, the UAPG update is computed from a rankδ singular value decomposition (SVD) approximation of CBQ θ νδI + Pδ i=1 hi(νi νδ)h i as follows:

LUAP G θ = ν 1

νδ/νi 1 h i LBQ θ .

The principal components (PCs) {hi}δ i=1 denote the top δ directions of uncertainty and the singular values {νi}δ i=1 denote their corresponding magnitudes of uncertainty. The rank-δ decomposition of CBQ θ can be computed in lineartime using the randomized SVD algorithm (Halko, Martinsson, and Tropp 2011). The UAPG update in Eq. 10 dampens the stepsize of the top δ directions of uncertainty, relative to the stepsize of remaining gradient components. Thus, in comparison to DBQPG, UAPG lowers the risk of taking large steps along the directions of high uncertainty, providing reliable policy updates. For natural gradient LNBQ θ = G 1LBQ θ , the gradient uncertainty can be computed as follows:

CNBQ θ = G 1CBQ θ G 1

= c2G 1 c2 2G 1U c1Ks + c2Kf + σ2I 1 U G 1

= c2(G + c2U c1Ks + σ2I 1 U ) 1. (11)

Since CNBQ θ is the inverse of an ill-conditioned matrix, we

instead apply a low-rank approximation on CNBQ θ 1 νδI + Pδ i=1 hi(νi νδ)h i for the UAPG update of NPG:

LNUAP G θ = ν

i=1 hi min r νi

νδ , ϵ 1 h i G 1LBQ θ ,

(12) where, {hi, νi}δ i=1 correspond to the top δ PCs of CNBQ θ 1

(equivalently the bottom δ PCs of CNBQ θ ), and ϵ > 1 is a hyperparameter. We replace p

νi/νδ with min( p

νi/νδ, ϵ) to avoid taking large steps along these directions, solely on the basis of their uncertainty estimates. For c2 1, it is interesting to note that (i) CBQ θ c2G and CNBQ θ c2G 1, which implies that the most uncertain gradient directions for vanilla PG approximately correspond to the most conﬁdent directions for NPG, and (ii) the ideal UAPG update for both vanilla PG and NPG converges along the G 1

2 LBQ θ direction. A more rigorous discussion on the relations between Vanilla PG, NPG and their UAPG updates can be found in the supplement Sec. E.1.

6 Experiments We study the behaviour of BQ-PG methods (Algorithm 1) on Mu Jo Co environments, using the mujoco-py library of Open AI Gym (Brockman et al. 2016). In our experiments, we replace Q with generalized advantage estimates (Schulman et al. 2016) computed using an explicit state-value critic

Algorithm 1 BQ-PG Estimator Subroutine

1: BQ-PG(θ, n) θ: policy parameters n: sample size for PG estimation 2: Collect n state-action pairs (samples) from running the policy πθ in the environment. 3: Compute action-values Q for these n samples using MC rollouts (TD(1) estimate) or an explicit critic network. 4: Update kernel parameters and explicit critic s parameters using GP MLL (Eq. 9) and TD error respectively. 5: Policy gradient estimation (DBQPG):

Lθ = LBQ θ (Vanilla PG) G 1LBQ θ (Natural PG) 6: (Optional) Compute {hi, νi}δ i=1 using fast SVD over

CBQ θ (Eq. 8, vanilla PG) or CNBQ θ 1 (Eq. 11, NPG). 7: (Optional) Uncertainty-based adjustment (UAPG):

νδ νi 1 h i

(Vanilla PG)

i=1 hi min q

νi νδ , ϵ 1 h i

(Natural PG) 8: return Lθ

network, i.e., Vπθ(s) is approximated using a linear layer on top of the state feature extractor in Fig. 1(left). Quality of Gradient Estimation: Inspired from the experimental setup of Ilyas et al. (2018), we evaluate the quality of PG estimates obtained via DBQPG and MC estimation using two metrics: (i) gradient accuracy or the average cosine similarity of the obtained gradient estimates with respect to the true gradient estimates (estimated from 106 state-action pairs) and (ii) variance in the gradient estimates (normalized by the norm of the mean gradient for scale invariance). See Fig. 2. We observe that DBQPG provides more accurate gradient estimates with a considerably lower variance. Interestingly, DBQPG and MC estimates offer nearly the same quality gradients at the start of training. However, as the training progresses, and DBQPG learns kernel bases, we observe that DBQPG returns superior quality gradient estimates. Moreover, as training progress from 0 to 150 iterations, the gradient norms of both DBQPG and MC estimates drop by a factor of 3, while the unnormalized gradient variances increase by 5 folds. The drop in the signal-to-noise ratio for gradient estimation explains the fall in accuracy over training time. Further, the wall-clock time of DBQPG and UAPG updates is comparable to MC-PG (Fig. 6 in supplement). These results motivate substituting MC with BQ-based PG estimates in deep PG algorithms. Compatibility with Deep PG Algorithms: We examine the compatibility of BQ-based methods with the following on-policy deep policy gradient algorithms: (i) Vanilla policy gradient, (ii) natural policy gradient (NPG), and (iii) trust region policy optimization (TRPO), as shown in Fig. 3. In these experiments, only the MC estimation subroutine is replaced with BQ-based methods, keeping the rest of the al-

102 103 104 105 # state-action pairs

Cosine similarity

# Iteration: 0

102 103 104 105 # state-action pairs

# Iteration: 150

102 103 104 105 # state-action pairs

0.4 # Iteration: 300

102 103 104 105 # state-action pairs

# Iteration: 450

102 103 104 105 # state-action pairs

# Iteration: 0

102 103 104 105 # state-action pairs

40 # Iteration: 150

102 103 104 105 # state-action pairs

# Iteration: 300

102 103 104 105 # state-action pairs

# Iteration: 450

Figure 2: An empirical analysis of the quality of policy gradient estimates as a function of the state-action sample size. The experiments are conducted for 0th, 150th, 300th, and 450th iteration along the training phase of DBQPG (vanilla PG) algorithm in Mu Jo Co Swimmer-v2 environment. All the results have been averaged over 25 repeated gradient measurements across 100 random runs. (a) The accuracy plot results are obtained w.r.t the true gradient , which is computed using MC estimates of 106 state-action pairs. (b) The normalized variance is computed using the ratio of trace of empirical gradient covariance matrix (like Zhao et al. (2011)) and squared norm of gradient mean.

gorithm unchanged. Note that for TRPO, we use the UAPG update of NPG to compute the step direction. We observe that DBQPG consistently outperforms MC estimation, both in ﬁnal performance and sample complexity across all the deep PG algorithms. This observation resonates with our previous ﬁnding of the superior gradient quality of DBQPG estimates, and strongly advocates the use of DBQPG over MC for PG estimation. For UAPG, we observe a similar trend as DBQPG. The advantage of UAPG estimates is more pronounced in the vanilla PG, and NPG experiments since the performance on these algorithms is highly sensitive to the choice of learning rates. UAPG adjusts the stepsize of each gradient component based on its uncertainty, resulting in a robust update in the face of uncertainty, a better sample complexity, and average return. We observe that UAPG performs at least as good as, if not considerably better than, DBQPG on most experiments. Since TRPO updates are less sensitive to the learning rate, UAPG adjustment does not provide a signiﬁcant improvement over DBQPG.

7 Related Work The high sample complexity of MC methods has been a long-standing problem in the PG literature (Rubinstein 1969). Previous approaches that address this issue broadly focus on two aspects: (i) improving the quality of PG estimation using a value function approximation (Konda and Tsitsiklis 2003), or (ii) attaining faster convergence by robustly taking larger steps in the right direction. The former class of approaches trade a tolerable level of bias for designing a lower variance PG estimator (Schulman et al. 2016; Reisinger, Stone, and Miikkulainen 2008). Following the latter research direction, Kakade (2001) and Kakade and Langford (2002) suggest replacing the vanilla PG with natural policy gradient (NPG), the steepest descent direc-

tion in the policy distribution space. While NPG improves over vanilla PG methods in terms of sample complexity (Peters and Schaal 2008; Agarwal et al. 2019), it is just as vulnerable to catastrophic policy updates. Trust region policy optimization (TRPO) (Schulman et al. 2015) extends the NPG algorithm with a robust stepsize selection mechanism that guarantees monotonic improvements for expected (true) policy updates. However, the practical TRPO algorithm loses its improvement guarantees for stochastic PG estimates, thereby necessitating a large sample size for computing reliable policy updates. The advantages of these approaches, in terms of sample efﬁciency, are orthogonal to the beneﬁts of DBQPG and UAPG methods. Another line of research focuses on using Gaussian processes (GP) to directly approximate the PG integral (Ghavamzadeh and Engel 2006). This work was followed by the Bayesian Actor-Critic (BAC) algorithm (Ghavamzadeh and Engel 2007), which exploits the MDP framework for improving the statistical efﬁciency of PG estimation. Like DBQPG, BAC is a BQ-based PG method that uses a GP to approximate the action-value function. However, BAC is an online algorithm that uses Gaussian process temporal difference (GPTD) (Engel, Mannor, and Meir 2005), a sequential kernel sparsiﬁcation method, for learning the value function. BAC s sequential nature and a prohibitive O(m2n + m3) time and O(mn + m2) storage complexity (m is the dictionary size, i.e., the number of inducing points) prevents if from scaling to large non-linear policies and high-dimensional continuous domains. Further, a recent extension of BAC (Ghavamzadeh, Engel, and Valko 2016) uses the uncertainty to adjust the learning rate of policy parameters, (I 1

ν CBQ θ )LBQ θ (ν is an upper bound for CBQ θ ), much like the UAPG method. While this update reduces the step-size of more uncertain directions, it does not provide gradient estimates with uniform uncertainty, making it less

Figure 3: The average performance (10 runs) of BQ-based methods and MC estimation in vanilla PG, NPG, and TRPO frameworks across 4 Mu Jo Co environments (refer Fig. 9 in the supplement for comparison across all 7 Mu Jo Co environments).

robust to bad policy updates relative to UAPG. Moreover, this method would not work for NPG and TRPO algorithms as their covariance is the inverse of an ill-conditioned matrix. Refer supplement Sec. G for a more detailed comparison of DBQPG and UAPG with prior BQ-PG works.

8 Discussion

We study the problem of estimating accurate policy gradient (PG) estimates from a ﬁnite number of samples. This problem becomes relevant in numerous RL applications where an agent needs to estimate the PG using samples gathered through interaction with the environment. Monte-Carlo (MC) methods are widely used for PG estimation despite offering high-variance gradient estimates and a slow convergence rate. We propose DBQPG, a high-dimensional generalization of Bayesian quadrature that, like MC method, estimates PG in linear-time. We empirically study DBQPG and demonstrate its effectiveness over MC methods. We show that DBQPG provides more accurate gradient estimates, along with a signiﬁcantly smaller variance in the gradient estimation. Next, we show that replacing the MC method

with DBQPG in the gradient estimation subroutine of three deep PG algorithms, viz., Vanilla PG, NPG, and TRPO, offers consistent gains in sample complexity and average return. These results strongly recommend the substitution of MC estimator with DBQPG in deep PG algorithms. To obtain reliable policy updates from stochastic gradient estimates, one needs to estimate the uncertainty in gradient estimation, in addition to the gradient estimation itself. The proposed DBQPG method additionally provides this uncertainty along with the PG estimate. We propose UAPG, a PG method that uses DBQPG s uncertainty to normalize the gradient components by their uncertainties, returning a uniformly uncertain gradient estimate. Such a normalization will lower the step size of gradient components with high uncertainties, resulting in reliable updates with robust step sizes. We show that UAPG further improves the sample complexity over DBQPG on Vanilla PG, NPG, and TRPO algorithms. Overall, our study shows that it is possible to scale Bayesian quadrature to high-dimensional settings, while its better gradient estimation and well-calibrated gradient uncertainty can signiﬁcantly boost the performance of deep PG algorithms.

Ethics Statement When deploying deep policy gradient (PG) algorithms for learning control policies in physical systems, sample efﬁciency becomes an important design criteria. In the past, numerous works have focused on improving the sample efﬁciency of PG estimation through variance reduction, robust stepsize selection, etc. In this paper, we propose deep Bayesian quadrature policy gradient (DBQPG), a statistically efﬁcient policy gradient estimator that offers orthogonal beneﬁts for improving the sample efﬁciency. In comparison to Monte-Carlo estimation, the default choice for PG estimation, DBQPG returns more accurate gradient estimates with much lower empirical variance. Since DBQPG is a general gradient estimation subroutine, it can directly replace Monte-Carlo estimation in most policy gradient algorithms, as already demonstrated in our paper. Therefore, we think that the DBQPG method directly beneﬁts most policy gradient algorithms and is indirectly beneﬁcial for several downstream reinforcement learning applications. We also propose uncertainty aware policy gradient (UAPG), a principled approach for incorporating the uncertainty in gradient estimation (also quantiﬁed by the DBQPG method) to obtain reliable PG estimates. UAPG lowers the risk of catastrophic performance degradation with stochastic policy updates, and empirically performs at least as good as, if not better than, the DBQPG method. Hence, we believe that the UAPG method is more relevant to reinforcement learning applications with safety considerations, such as robotics.

References Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2019. Optimality and approximation with policy gradient methods in markov decision processes. ar Xiv preprint ar Xiv:1908.00261 . Azizzadenesheli, K.; and Anandkumar, A. 2018. Efﬁcient Exploration through Bayesian Deep Q-Networks. ar Xiv preprint ar Xiv:1802.04412 . Baxter, J.; and Bartlett, P. L. 2000. Direct gradient-based reinforcement learning. In 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No. 00CH36353), volume 3, 271 274. IEEE. Briol, F.-X.; Oates, C.; Girolami, M.; and Osborne, M. A. 2015. Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems, 1162 1170. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Open AI Gym. Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking Deep Reinforcement Learning for Continuous Control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, 1329 1338. JMLR.org. URL http://dl.acm.org/citation.cfm?id= 3045390.3045531.

Engel, Y.; Mannor, S.; and Meir, R. 2003. Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 03, 154 161. AAAI Press. ISBN 1-57735189-4. URL http://dl.acm.org/citation.cfm?id=3041838. 3041858.

Engel, Y.; Mannor, S.; and Meir, R. 2005. Reinforcement Learning with Gaussian Processes. In Proceedings of the 22Nd International Conference on Machine Learning, ICML 05, 201 208. New York, NY, USA: ACM. ISBN 1-59593-180-5. doi:10.1145/1102351.1102377. URL http: //doi.acm.org/10.1145/1102351.1102377.

Gardner, J. R.; Pleiss, G.; Bindel, D.; Weinberger, K. Q.; and Wilson, A. G. 2018. GPy Torch: Blackbox Matrixmatrix Gaussian Process Inference with GPU Acceleration. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS 18, 7587 7597. USA: Curran Associates Inc. URL http://dl.acm.org/ citation.cfm?id=3327757.3327857.

Ghavamzadeh, M.; and Engel, Y. 2006. Bayesian Policy Gradient Algorithms. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS 06, 457 464. Cambridge, MA, USA: MIT Press.

Ghavamzadeh, M.; and Engel, Y. 2007. Bayesian Actorcritic Algorithms. In Proceedings of the 24th International Conference on Machine Learning, ICML 07, 297 304. New York, NY, USA: ACM. ISBN 978-1-59593-7933. doi:10.1145/1273496.1273534. URL http://doi.acm.org/ 10.1145/1273496.1273534.

Ghavamzadeh, M.; Engel, Y.; and Valko, M. 2016. Bayesian Policy Gradient and Actor-Critic Algorithms. Journal of Machine Learning Research 17(66): 1 53. URL http://jmlr. org/papers/v17/10-245.html.

Halko, N.; Martinsson, P.-G.; and Tropp, J. A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53(2): 217 288.

Hennig, P.; Osborne, M. A.; and Girolami, M. 2015. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 471(2179).

Ilyas, A.; Engstrom, L.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; and Madry, A. 2018. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? Ar Xiv abs/1811.02553.

Kakade, S. 2001. A Natural Policy Gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS 01, 1531 1538. Cambridge, MA, USA: MIT Press. URL http: //dl.acm.org/citation.cfm?id=2980539.2980738.

Kakade, S.; and Langford, J. 2002. Approximately optimal approximate reinforcement learning. In ICML, volume 2, 267 274.

Kanagawa, M.; Sriperumbudur, B. K.; and Fukumizu, K. 2016. Convergence guarantees for kernel-based quadrature rules in misspeciﬁed settings. In Advances in Neural Information Processing Systems, 3288 3296. Kanagawa, M.; Sriperumbudur, B. K.; and Fukumizu, K. 2020. Convergence analysis of deterministic kernel-based quadrature rules in misspeciﬁed settings. Foundations of Computational Mathematics 20(1): 155 194. Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In Bengio, Y.; and Le Cun, Y., eds., 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. URL http://arxiv.org/abs/1412. 6980. Konda, V. R.; and Tsitsiklis, J. N. 2000. Actor-Critic Algorithms. In Solla, S. A.; Leen, T. K.; and M uller, K., eds., Advances in Neural Information Processing Systems 12, 1008 1014. MIT Press. URL http://papers.nips.cc/paper/1786actor-critic-algorithms.pdf. Konda, V. R.; and Tsitsiklis, J. N. 2003. On Actor-Critic Algorithms. SIAM J. Control Optim. 42(4): 1143 1166. ISSN 0363-0129. doi:10.1137/S0363012901385691. URL https://doi.org/10.1137/S0363012901385691. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971 . Metropolis, N.; and Ulam, S. 1949. The monte carlo method. Journal of the American statistical association 44(247): 335 341. Micchelli, C. A.; Xu, Y.; and Zhang, H. 2006. Universal Kernels. J. Mach. Learn. Res. 7: 2651 2667. ISSN 15324435. O Hagan, A. 1991. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference 29(3): 245 260. URL http://www.sciencedirect.com/science/article/B6V0M45F5GDM-53/1/6e05220bfd4a6174e890f60bb391107c. Peters, J.; and Schaal, S. 2008. Reinforcement Learning of Motor Skills with Policy Gradients. Neural Networks 21(4): 682 697. Rasmussen, C. E.; and Williams, C. K. I. 2005. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press. ISBN 026218253X. Reisinger, J.; Stone, P.; and Miikkulainen, R. 2008. Online Kernel Selection for Bayesian Reinforcement Learning. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, 816 823. New York, NY, USA: Association for Computing Machinery. ISBN 9781605582054. Rubinstein, R. Y. 1969. Some Problems in Monte Carlo Optimization. Ph.D. thesis . Saatci, Y. 2012. Scalable Inference for Structured Gaussian Process Models. University of Cambridge. URL https:// books.google.co.in/books?id=9p C3o QEACAAJ.

Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML). Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2016. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In Proceedings of the International Conference on Learning Representations (ICLR). Silverman, B. W. 1985. Some Aspects of the Spline Smoothing Approach to Non-Parametric Regression Curve Fitting. Journal of the Royal Statistical Society. Series B (Methodological) 47(1): 1 52. ISSN 00359246. URL http://www. jstor.org/stable/2345542. Sutton, R. S.; Mc Allester, D.; Singh, S.; and Mansour, Y. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS 99, 1057 1063. Cambridge, MA, USA: MIT Press. URL http://dl.acm.org/citation.cfm?id= 3009657.3009806. Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 5026 5033. IEEE. Townsend, J. 2017. A new trick for calculating Jacobian vector products. URL https://j-towns.github.io/2017/06/12/ A-new-trick.html. Turner, R. E. 2010. Statistical Models for Natural Sounds. Ph.D. thesis, Gatsby Computational Neuroscience Unit, UCL. Williams, R. J. 1992. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8(3-4): 229 256. ISSN 08856125. doi:10.1007/BF00992696. URL https://doi.org/10. 1007/BF00992696. Wilson, A. G.; Hu, Z.; Salakhutdinov, R.; and Xing, E. P. 2016. Deep Kernel Learning. In Gretton, A.; and Robert, C. C., eds., Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, 370 378. Cadiz, Spain: PMLR. URL http://proceedings.mlr.press/ v51/wilson16.html. Wilson, A. G.; and Nickisch, H. 2015. Kernel Interpolation for Scalable Structured Gaussian Processes (KISSGP). In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, 1775 1784. JMLR.org. URL http: //dl.acm.org/citation.cfm?id=3045118.3045307. Woodbury, M. A. 1950. Inverting modiﬁed matrices. Memorandum report 42(106): 336. Zhao, T.; Hachiya, H.; Niu, G.; and Sugiyama, M. 2011. Analysis and Improvement of Policy Gradient Estimation. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.