# augmented_bayesian_policy_search__fa0eae24.pdf

Published as a conference paper at ICLR 2024

AUGMENTED BAYESIAN POLICY SEARCH

Mahdi Kallel1 , Debabrota Basu2, Riad Akrour2, Carlo D Eramo1,3,4

1Center for Artificial Intelligence and Data Science, University of W urzburg, Germany 2Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL, Lille, France 3Department of Computer Science, TU Darmstadt, Germany 4Hessian Center for Artificial Intelligence (Hessian.ai), Germany

Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a blackbox problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search (ABS). Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning. We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes.

1 INTRODUCTION

The majority of policy gradient literature in Reinforcement Learning (RL), from traditional methods (Sutton & Barto, 2018) to contemporary actor-critic strategies (Schulman et al., 2017; 2015; Haarnoja et al., 2018), employs stochastic policies for experience gathering. This is typically achieved either by modeling policies as probability distributions or by injecting noise into the actions of deterministic policies (Lillicrap et al., 2015; Fujimoto et al., 2018). Having an algorithm that explores effectively using only deterministic policies is preferable when these policies are deployed on physical systems like robotics. Indeed, deterministic policies can prevent the erratic and potentially damaging behavior of stochastic policies while being easier to implement and interpret.

To this end, Bayesian Optimization (BO) methods (Garnett, 2023) perform search directly in the parameter space. BO has emerged as a powerful tool for the global optimization of black-box functions, demonstrating its effectiveness across diverse landscapes and practical applications, such as parameter tuning for machine learning algorithms (Turner et al., 2021; Cowen-Rivers et al., 2022), robotics (Calandra et al., 2016; Muratore et al., 2021), and RL (Bruno et al., 2013). The strength of BO lies in its two core components: (i) a probabilistic model of the objective function, which usually takes the form of a Gaussian Process (GP) prior, and (ii) a sampling procedure that exploits this model to identify informative samples. However BO methods often struggle as the task dimensionality increases since they require a prohibitive number of samples to build a global model.

Local Bayesian Optimization provides an intriguing solution to this challenge (Akrour et al., 2017; Eriksson et al., 2019; Wang et al., 2020; M uller et al., 2021; Nguyen et al., 2022). By focusing on specific regions within the search space, local BO improves the handling of the high-dimensional spaces by promoting more targeted exploration and exploitation, thus reducing the number of evaluations needed to pinpoint optimal solutions. Consequently, there has been a recent upswing in

Correspondence to mahdi.kallel@uni-wuerzburg.de.

Published as a conference paper at ICLR 2024

efforts to scale BO to high-dimensional problems through the design of local schemes. However, when applied to high-dimensional RL problems, these schemes fall short as they treat the policy search problem as a black-box problem, and thus, use only the information of the policy return. By doing so, they overlook the sequential nature of MDPs and discard potentially useful experience.

In this paper, we introduce a principled solution to this problem by building a novel RL-aware mean function to enhance local BO methods. We leverage the performance difference lemma to inject an action-value function into the GP prior of the objective function, thus, effectively incorporating knowledge of past trajectories into our belief about the return of untested policies. We provide a further theoretical ground for our approach by deriving a new bound on the impact of altering deterministic policies on expected returns for Lipschitz MDPs. Then, we show that the posterior gradient yielded by using our new mean function, corresponds to the deterministic policy gradient. Thus, it bridges the gap between BO approaches and policy gradient methods.

The primary contribution of this work is a novel mean function that enhances GPs with the actionvalue function. Additionally, we propose a fitness-aware adaptive scheme for aggregating multiple Q-function approximators. Integrating these components into the Maximum Probability of Descent (MPD) framework (Nguyen et al., 2022) leads to the development of the Augmented Bayesian Search (ABS) algorithm. ABS effectively unifies policy gradient and BO methods, capitalizing on the scalability and sample-efficiency of RL, while also leveraging the principled exploration and practicality offered by BO methods. We provide empirical evidence on the effectiveness of our novel mean function, demonstrating that ABS surpasses previous BO schemes in high-dimensional Mu Jo CO locomotion problems (Todorov et al., 2012).

2 PRELIMINARIES

2.1 REINFORCEMENT LEARNING

We consider Markov Decision Processes (MDPs) (Sutton & Barto, 2018) with a continuous bounded state space S Rm, a continuous action space A Rd, stationary transition dynamics with conditional density p(st+1|st, at) satisfying the Markov property, an initial state distribution ι, a reward function r : S A R and a discount factor γ. We denote by π : S A a deterministic policy mapping states into actions. Throughout this manuscript, we use only deterministic policies.

At each discrete time step t, from a given state st S, the agent takes an action at = π(st), receiving a reward r(st, at) and the new state of the environment s according to the dynamics p(.|st, at). We denote by P(s s , t, π) the probability of being at state s after t transitions following policy π starting from s. We also denote ρπ s (s ) P t=0 γt P(s s , t, π) the improper discounted state visitation density of s starting from s. By integrating over the initial state distribution ι, we can deduce the improper discounted visitation measure dπ(s ) R

S ρπ s (s )ι(s)ds.

The action-value function Qπ describes the expected return after taking action a in the state s and thereafter following policy π, i.e., Qπ(s, a) Es ρπ s ,a =π(s ) [r(s , a )|a0 = a]. Given the actionvalue function, we can derive the advantage function as Aπ(s, a) Qπ(s, a) Qπ(s, π(s)). The goal of RL is to optimize a policy πθ, parameterized by θ, with the goal of maximizing the expected discounted policy return J(πθ) Es dπθ ,a=πθ(s) [r(s, a)] = Es ι [Qπθ(s, π(s))].

To this end, a popular class of RL methods, known as actor-critic, leverages an additional parametric approximation of the action-value function Qπθ ϕ (Schulman et al., 2015; Haarnoja et al., 2018). When restricting actor-critic methods to deterministic policies, a policy πθ can be updated using the deterministic policy gradient, which changes its actions to maximize Qπθ (Silver et al., 2014):

θJ(πθ) = Es dπθ a Qπθ(s, a) a=πθ(s) θπθ(s) . (1)

2.2 INFORMATION MAXIMIZING BAYESIAN OPTIMIZATION

Bayesian Optimization (BO) is a sequential method for global optimization of black-box functions when sample-efficiency is paramount. It builds a probabilistic model of the objective function, often a Gaussian Process (GP), which is used by an acquisition function that guides the parameter

Published as a conference paper at ICLR 2024

space exploration. However, BO struggles with high-dimensional spaces as it requires a prohibitive amount of data to build a global model of the objective function. To address these issues, local BO methods have been developed (Nguyen et al., 2022; M uller et al., 2021; Eriksson et al., 2019; Fr ohlich et al., 2021). These methods constrain the search to a subspace of interest, circumventing difficulty of modeling and finding promising candidates in high dimensional problems.

A recent development in the local BO methods is the introduction of Information Maximizing BO. First introduced by M uller et al. (2021), and subsequently refined by Nguyen et al. (2022), these methods rely on an acquisition function that seeks to maximize a local measure of information. By considering a GP belief about the objective function J GP (m(x), K(x, x )) with a differentiable mean function m and a twice-differentiable covariance function K, we have that the joint distribution between the GP and its derivative is still a GP (Rasmussen, 2003). Hence, by conditioning on a dataset of observations (X, Y ), the posterior distribution of the derivative at a point of interest θ takes the form of a Gaussian distribution p J(θ) | θ, X, Y = N(µθ, Σθ), where

µθ θm(θ) + θK(θ, X)K(X, X) 1 Y m(X) , (2)

Σθ θK(θ, θ) θK(θ, X)K(X, X) 1 θK(X, θ). (3)

In Nguyen et al. (2022), the measure of information is taken to be the probability of descent at a central point θ. Subsequently, an acquisition function α(z|θ, X, Y ) is developed to guide the exploration towards an acquisition point z, such that adding (z, yz) to our observation dataset maximizes the probability of descent at θ. The resulting BO algorithm, named MPD (Nguyen et al., 2022), alternates between two steps. Starting from a central point θ, it uses the acquisition function α to sequentially query multiple acquisition points z and observe their values yz until obtaining satisfying information about θ. Then, MPD uses the better-estimated gradient µθ to move along the most probable descent direction νθ Σ 1 θ µθ to a new point θ , and repeats this loop.

3 REINFORCEMENT LEARNING-AWARE BAYESIAN OPTIMIZATION

The mean function of a GP determines the expected value at a given point. In interpolating regions, the posterior mean is largely influenced by observed data points due to significant correlation. Conversely, in extrapolating regions, the data s influence is minimal, causing the posterior mean to revert to the mean function. Despite its importance, this function has received little attention in BO literature, especially in RL applications where uninformative priors like constant functions are favored. In this section, we leverage the MDP properties of the problem to build a better mean function.

3.1 ON THE SMOOTHNESS OF DETERMINISTIC POLICY RETURNS

First, we develop a new bound on the impact of altering deterministic policies on expected returns in the particular case of Lipschitz MDPs (Pirotta et al., 2015). We assume that the implementation of a policy πθ gives access to J(πθ), dπθ, and Qπθ. Given an alternative policy πx, our objective is to accurately estimate its return J(πx) 1. The performance difference lemma (Kakade & Langford, 2002) can establish the relation between deterministic policies returns and the advantage function :

J(πx) J(πθ) = Es dπx [Aπθ(s, πx(s))] = dπx(.), Aπθ(., πx(.)) . (4)

Direct application of this lemma to infer J(πx) is not feasible due to the need for dπx. However, by reordering terms, we can express policy return as an estimable term plus an unknown residual.

J(πx) = J(πθ) + dπθ(.), Aπθ(., πx(.)) | {z } Estimate

+ (dπx dπθ)(.), Aπθ(., πx(.)) | {z } Residual

The following assumptions allow us to bind the residual term while using deterministic policies. Assumption 3.1. An MDP is (Lr, Lp)-Lipschitz if for all s, s S and a, a A:

|r(s, a) r(s , a )| Lr(d S(s, s ) + d A(a, a )),

W(p(.|s, a), p(.|s , a )) Lp(d S(s, s ) + d A(a, a )).

1The findings of this section are relevant to both parametric and non-parametric policies. Parametric notations are used for consistency throughout the paper.

Published as a conference paper at ICLR 2024

Assumption 3.2. A policy πx is Lπ-Lipschitz if for all s, s S :

W(πx(s), πx(s )) Lπ(d S(s, s )).

Here, we denote by d S, d A the distances in S and A respectively. In this work, we take this distance to be the Euclidean distance . and W to be the Wassertstein-2 distance on the space of measures. Theorem 3.3. For an (Lr, Lp)-Lipschitz MDP operating with deterministic Lπ-Lipschitz policies, and γLp(1 + Lπ) < 1, we bound the residual term for any policies πx and πθ as

| dπx dπθ, Aπθ(.|πx(.)) | C sup s πx(s) πθ(s) , where C 2γLπLr (1 + Lπ) (1 γLp(1 + Lπ))2 . (6)

The bound describes the worst-case value of the residual term. Theorem 3.3 extends the existing bounds on the residual term (Kakade & Langford, 2002; Schulman et al., 2015) to our setting, where both the policy and the dynamics are deterministic. Similarly, we extend this bound to the case of linear policies and bounded state spaces as

| dπx dπθ, Aπθ(.|πx(.)) | C sup s s x θ . (7)

3.2 ADVANTAGE MEAN FUNCTION

Now, we construct a GP that models the policy return J over the space of parameters of deterministic linear policies. To solve the policy search problem, we can query an oracle to obtain a noisy evaluation b J(πx) = J(πx) + ω of any point x, an empirical estimate of its discounted state distribution bdπx dπx, and the transitions of the corresponding trajectory. In order to construct such a model, first, we roll out a central point θ and obtain estimates b J(πθ) and bdπθ. Then, we train a deep neural network parameterized by ϕ to estimate the action-value function b Qπθ ϕ Qπθ. For any policy πx parameterized by x, if sups x θ is small enough, Theorem 3.3 and Equation (7) indicate that a good prior on its return J(πx) is an estimate of the first term in Equation (5), i.e.

bmϕ(x) b J(πθ) + Es b dπθ h b Aπθ(s, πx(s)) i . (8)

Therefore, we advocate using bmϕ as a mean function for our GP, which we call Advantage Mean Function. We emphasize that the advantage mean function bmϕ(x) for the policy return J(πx) depends on the current central point θ. By using bmϕ, we leave the GP to model the residual term from Equation (5), which can be viewed as a second-order term (Saleh et al., 2022), and the errors due to empirical estimates and approximations. This is in contrast to the typical constant mean function, which forces the GP to fit all the variations of the objective function. By plugging in the advantage mean function in the posterior of the derivative µθ (Equation (2)), we unravel an interesting property. Corollary 3.3.1. Given the mean function bmϕ( ), the mean of the gradient posterior at θ is

µθ = Es b dπθ

a b Qπθ ϕ (s, a) a=π(s) θπθ(s) + θK(θ, X) K(X, X) 1( b J(X) bmϕ(X)), (9)

where X {z1, z2, . . . , . . .} {θ1, θ2, . . .} denotes the set of policy parameters of the past observations, b J(X) = { b J(πx1), b J(πx2), . . . } denotes the set of their noisy policy evaluations, and bmϕ(X) = { bmϕ(x1), bmϕ(x2), . . . } denotes the mean function estimate of their policy return.

The mean of the gradient posterior, µθ, takes the form of the Deterministic Policy Gradient (Silver et al., 2014), corrected by a factor that is proportional to the alignment of the advantage mean function with the observed data points. Given that our objective function is the policy return, the mean of the gradient posterior aims to get a closer correspondence with the actual deterministic policy gradient, as compared to the estimate obtained by using only the estimator b Qπθ ϕ .

In Figure 1, we demonstrate the behavior of the acquisition function of MPD (Nguyen et al., 2022). We observe its tendency to keep the acquisition points z in the immediate vicinity of the central point θ effectively controlling z θ and therefore the residual term of Equation (5). In conclusion, the proposed mean function bmϕ integrates both zero-order (returns) and first-order (gradient) observations into the BO framework, as hypothesized in (M uller et al., 2021), thereby seamlessly bridging the gap between policy gradient and Bayesian Optimization methods.

Published as a conference paper at ICLR 2024

(a) t = 0: Observed only θ (star)

(b) t = 1: Observed θ, z1 (white)

(c) t = 2: Observed θ, z1, z2 (white)

Figure 1: Behavior of the acquisition function of MPD (Nguyen et al., 2022) augmented with the advantage mean function (Equation (8)). The maximum of the acquisition function (blue dot) lies in the direction of the mean of the gradient posterior at θ (star), i.e. µθ (violet line). The posterior corrects the mean gradient θ bmϕ (pink line) when the mean function bmϕ does not fit the observations.

4 ENHANCING Q-FUNCTION ESTIMATORS: EVALUATION & AGGREGATION

The mean function, depicted in Figure 1, steers the acquisition function in its search for points that maximize the ascent probability at the central point θ. Equation (2) indicates that the same function influences the computation of the descent direction. Consequently, the quality of the approximator b Qπθ ϕ , embedded in the advantage mean function, is crucial for both exploration and exploitation.

Therefore, it is desirable for b Qπθ ϕ to generalize beyond observed trajectories. In this section, we propose a criterion for estimating the quality of such approximators. Utilizing this criterion, we further develop an adaptive strategy for aggregating the predictions of an ensemble of approximators.

4.1 EVALUATING Q-FUNCTION ESTIMATORS

In the existing literature (Fujimoto et al., 2018; Van Hasselt et al., 2016), the quality of a parametric approximator b Qπθ ϕ is assessed by comparing its predictions to those obtained by an empirical esti-

mate b Qπθ generated by rolling out the policy πθ. The weakness of this scheme is that it only uses the trajectories collected by πθ, which cover only a subset of the state-action space.

To tackle this limitation, we leverage the performance difference lemma (Equation (4)). This property indicates that given an action-value function Qπθ and the corresponding policy return J(πθ), we can perfectly recover the policy return J(πx) of any actor πx, if dπx is also known.

In our setting, we roll out the policy corresponding to the central point θ to have access to an empirical estimate b J(πθ) and an approximator b Qπθ ϕ . Similarly, we roll out the policy corresponding

to any point x to get access to the sample estimates of b J(πx) and bdπx. Thanks to the performance difference lemma (Equation (4)), we can use the error induced by b Qπθ ϕ as a measure of its quality : b J(πθ) + Es b dπx h b Aπθ ϕ (s, πx(s)) i b J(πx) 2 emϕ(x) b J(πx) 2 . (10)

We emphasize that bmϕ relies on bdπθ, while emϕ leverages bdπx to estimate the policy return J(πx). This method evaluates b Qπθ ϕ on trajectories from policies other than πθ, allowing all available trajectory data to be used for our approximator s evaluation. This provides a more comprehensive validation criterion, addressing the existing assessment scheme s limitations.

4.2 ADAPTIVE AGGREGATION OF Q-FUNCTION ESTIMATORS

In our setting, we have access to the dataset D = {X, b J(X), bdπ(X)} collected during the previous iterations of the algorithm. It consists of the observation points X, their sample policy returns b J(X), and empirical discounted state occupation measures bdπ(X) {dπx1, dπx2, . . .}. Assuming a good approximator b Qπθ ϕ , Equation (10) suggests emϕ should serve as an effective predictor of J(πx).

Published as a conference paper at ICLR 2024

Algorithm 1 Rollout for one point Require: Policy parameters x, observation dataset D = {}, replay buffer B = {}

1: if x is central then 2: Reset the optimizers of all critics 3: Reset the weights of least performing critic 4: end if 5: Collect b J(πx) = J(πx) + ϵ, bdπx, b T(πx) {st, at, rt, st+1}{t=1,...,H} 6: Update training data D D (x, b J(πx), bdπx) replay buffer B B b T(πx) 7: Update the parameters of critics b Qπθ ϕ={ϕ1,...,ϕn} on B 8: Compute the weights of every critic according to Equation (12) using D 9: Fit the GP parameters on D and bmΦ.

Therefore, to evaluate the quality of b Qπθ ϕ , we use the coefficient of determination of emϕ on the dataset D. This metric measures the percentage of variance in the data explained by our predictor :

e R2(ϕ|θ, D) 1

P x X emϕ(x) b J(πx) 2

x X J(X) b J(πx) 2 where J(X) P

x X b J(πx)

Estimating Qπθ(s, a) requires rolling out trajectories τ(s, a; πθ) {s, a, s , πθ(s ), s , πθ(s ), . . .}. These trajectories play the role of the training data to learn a good approximation Qπθ ϕ (s, a). Hence,

we refer to e R2(ϕ|θ, D) as the validation score since the trajectories τ(s, πx(s); πθ) for s dπx

used by emϕ are not present in our training dataset. Similarly, we can define b R2(ϕ|θ, D) for the predictor bmϕ. The trajectories τ(s, πx(s); πθ) for s dπθ used by bmϕ are different from the validation trajectories. Since bmϕ is used during inference, we refer to b R2(ϕ|θ, D) as the test score. We note that while a perfect Qπθ estimator can have a perfect validation score, it might not achieve a perfect test score due to the residual term in Equation (5).

In our context, we deploy an ensemble of critics b Qπθ {ϕ1,...,ϕn}. We adopt the Follow The Regularised

Leader (FTRL) algorithm (Cesa-Bianchi & Lugosi, 2006) to construct an aggregated critic, b Qπθ Φ , via softmax weighting of the predictions of each critic using its respective e R2 score:

b Qπθ Φ (s, a) =

i=1 wi b Qπθ ϕi (s, a) where wi exp e R2(ϕi|θ, D) Pn j=1 exp e R2(ϕj|θ, D) . (12)

We denote by bmΦ the advantage mean function that uses the aggregate estimate b Qπθ Φ . Additionally, we reset the optimizer of every critic when changing the central point θ because we are learning the Q-function of a new policy. We also reset the weights of the worst-performing critic in order to improve the general performance by avoiding the primacy bias (Nikishin et al., 2022). This resetting also helps in decorrelating the predictions of each critic, and thus, reducing the variance of the predictions. We illustrate this step in Algorithm 1 and describe the full BO scheme in Algorithm 2.2

Algorithm 2 Augmented Bayesian Search (ABS) Require: central point θ, number of iterations N, number of acquisition points M, stepsize δ, observation dataset D = {}, replay buffer B = {} 1: for i = 0, . . . , N do 2: Rollout the central point θ using Algorithm 1. 3: for j = 0, . . . , M do 4: Query for an acquisition point z = arg maxz α(z|θ, D) 5: Rollout the acquisition point z using Algorithm 1. 6: end for 7: end for 8: Move in the direction of the most probable descent θ θ + δ Σ 1 θ µθ

2The main BO loop of ABS can be seen as a simplified version of that of MPD. The only difference being that MPD performs multiple descent steps and uses a constant mean, whereas our method uses the advantage mean bmϕ, which depends on the current central point θ limiting us to one descent step.

Published as a conference paper at ICLR 2024

5 EXPERIMENTAL ANALYSIS

In this experimental analysis, we aim to answer the following questions: (I) Is our advantage mean function a good prior on the policy return? (II) Does our advantage mean function improve the efficiency of BO methods? (III) What role does our adaptive aggregation play?

5.1 IMPLEMENTATION DETAILS

For our Q-function estimator, we leverage the state-of-the-art ensemble method of Dro Q networks (Hiraoka et al., 2021) with default parameters. For our ensemble of critics we use 5 distinct Dro Q networks. After querying for an acquisition point z, we perform 5000 gradient steps to the Q-function. For our covariance function we choose a squared exponential kernel SE(x, x ) = σ2 f exp P

i d x x 2 2 σi2 + σ2 nδx=x , where σf, σn, σi {1,...,d} are the signal variance, the observation noise, and lengthscales for dimension i, respectively. We model the policy return as a GP over the parameters of deterministic linear policies J(x) GP( mϕ(x), SE(x, x )).

BO requires prior knowledge about the tasks in order to fix the hyperpriors for parameters, such as the signal variance σf and the observation noise σn, adding more hyperparameters to the problem. For our implementation of MPD and ABS, we rely on a heuristic to derive dynamic hyperpriors for σf, σn from the observation data. This leaves us only to set the lengthscale hyperprior alleviating the burden of tuning these parameters. We refer the reader to the Appendix A for further details.

We validate our contributions using Mu Jo Co locomotion tasks (Todorov et al., 2012). We use a discount factor γ = 0.99 for all tasks, apart from Swimmer where it is γ = 0.995. For our experiments, we use the state normalization scheme employed in (Mania et al., 2018; Nguyen et al., 2022) where the states are normalized to a normal distribution using online estimates of the mean and the variance. We have implemented ABS using Python 3.10 and Jax 0.4.20, we run all of our experiments on a cluster of NVIDIA A100 40 GB GPU. Our code is available at https://github.com/kallel-mahdi/abs.

5.2 RESULTS AND ANALYSIS

Goodness of Fit of the Advantage Mean Function. In Figure 2, we show the evolution of the validation metric of the best critic in the ensemble e R2, taking all the samples collected in the last 3 outer loop steps of Algorithm 2. We superpose to it the test metric b R2 of the ensemble b Qπθ Φ over the acquisition points sampled during the same step. This is done for 3 high-dimensional tasks of Mu Jo Co (Todorov et al., 2012). We observe that the validation metric e R2 remains consistently positive with the validation predictor emΦ explaining on average approximately 50% of the variance observed data. The test metric b R2 manages also to be positive quite frequently, while being sometimes negative, meaning that our mean function bmϕ fails sometimes to explain the variance of policy return for the acquisition points. However, this is not a big concern as we only need the mean function to be able to identify one good direction of descent at a time and not be a perfect predictor for every point. Similar to a validation and test error, we observe a positive correlation between both metrics and that the validation error acts like an upper bound to its test counterpart as in the supervised setting.

0 1000 2000 3000 4000 5000 #Episodes

Halfcheetah (#n_dims=102)

R2_validation R2_test

0 1000 2000 3000 4000 5000 #Episodes

Walker2d (#n_dims=102)

R2_validation R2_test

0 1000 2000 3000 4000 5000 #Episodes

Ant (#n_dims=216)

R2_validation R2_test

Figure 2: Evolution of the validation and test scores on some of the Mu Jo Co tasks. We plot the results of a seed to facilitate the interpretation of our results. We provide the histogram and correlations of these distributions in the Appendix.

Published as a conference paper at ICLR 2024

0 200 400 600 800 1000 #Episodes

Maximum return

Inverted Pendulum (#n_dims=4)

ARS MPD ABS

0 200 400 600 800 1000 #Episodes

Maximum return

Swimmer (#n_dims=16)

0 1000 2000 3000 4000 5000 #Episodes

Maximum return

Hopper (#n_dims=33)

0 1000 2000 3000 4000 5000 #Episodes

Maximum return

Halfcheetah (#n_dims=102)

0 2000 4000 6000 8000 10000 12000 14000 #Episodes

Maximum return

Walker2d (#n_dims=102)

0 2000 4000 6000 8000 10000 12000 14000 #Episodes

Maximum return

Ant (#n_dims=216)

Figure 3: Evolution of the maximum discounted policy return on the Mu Jo Co-v4 tasks. We use 5 random seeds for every algorithm. We report the undiscounted returns in Figure 6 in the Appendix.

Performance of ABS in terms of Policy Return. We evaluate ABS against two baselines, namely, MPD which is the current state-of-art of local BO methods (Nguyen et al., 2022) and Augmented Random Search (ARS) (Mania et al., 2018), an evolutionary algorithm that estimates the gradient of the noisy objective using random perturbations of the policy. In Figure 3, we see that ABS is competitive with MPD and ARS in low-dimensional tasks. Notably, ABS consistently outperforms the MPD demonstrating that the advantage mean function is helpful for local BO schemes. The relative performance of ABS improves in high-dimensional tasks, such as Walker2d and Ant, where the curse of dimensionality makes it harder for ARS and MPD to identify promising descent directions, whereas our mean function can focus the search on more promising regions of the parameter space.

Effectiveness of the Aggregation Scheme. We perform an ablation study to confirm the effectiveness of our adaptive scheme for aggregating critics in Figure 4. First, we observe that an ensemble of critics outperforms a single critic. We also notice that our strategy of adaptive aggregation (Equation 12) outperforms average ensembling. This is especially in the initial phases when transition data is scarce and it is easy to have a poorly performing critic. Resetting the least performing critic improves the performance of the ensemble, and its effect is more pronounced when it is combined with our adaptive aggregation method. We explain this performance by the capability of our validation metric to detect when the recently reset critic performs poorly and ignore its predictions.

To summarize, ABS, which can be viewed as MPD augmented with the advantage mean function and the adaptive aggregation scheme consistently outperfoms the original MPD that uses the uninformative constant mean function. The advantage mean function can explain a fair portion of the policy return of the observation points and hence represents a good prior. Our validation metric and the consequent aggregation scheme are effective in dynamically selecting good critics.

0 200 400 600 800 1000 #Episodes

Maximum return

Swimmer (#n_dims=16)

SINGLE AVERAGE SOFTMAX AVERAGE + R SOFTMAX + R

0 1000 2000 3000 4000 5000 #Episodes

Maximum return

Halfcheetah (#n_dims=102)

0 1000 2000 3000 4000 5000 #Episodes

Maximum return

Ant (#n_dims=216)

Figure 4: Ablation study: Effect of the adaptive aggregation on the performance of ABS. Combining adaptive aggregation and resetting the worst critic outperforms all baselines.

Published as a conference paper at ICLR 2024

6 RELATED WORK

Several prior works applied a Bayesian view to reinforcement learning and policy gradient methods (Ghavamzadeh et al., 2016). However, in Bayesian Optimization, the majority of works treat reinforcement learning as a black-box problem (Martinez-Cantin et al., 2007; Englert & Toussaint, 2016; Martinez-Cantin, 2018; Eriksson et al., 2019; Nguyen et al., 2022) and are oblivious to the MDP properties. As an exception, Wilson et al. (2014) propose a GP mean function leveraging a dynamics-model estimate as a prior for the policy return and develop a kernel function measuring similarity between policies using trajectory data. In the black-box setting, Akrour et al. (2017) use a sampling-based approach where the parameters of a normal distribution are updated using information-theoretic constraints leading to optima. Fr ohlich et al. (2021) introduce a confidence region BO scheme to constrain the search space to points with less uncertainty. Eriksson et al. (2019) introduce a trust region BO scheme that uses a collection of simultaneous local optimization runs using independent probabilistic models. M uller et al. (2021) propose the Gradient Information BO algorithm (GIBO) that utilizes a probabilistic model of the objective function and its gradient to maximize information about the gradient posterior. Finally, Nguyen et al. (2022) improve on the information criterion of GIBO by querying for points that maximizes the probability of descent, and then moves in the direction of most probable descent. In contrast, our work exploits the sequential nature of the task which provides large practical gains, especially on higher dimensional tasks.

In the direct policy search literature, one alternative to BO is the use Particle Swarms (Hein et al., 2016; Hein, 2019). Closer to our work are Evolutionary Strategies (ES). They are gradient-free methods deploying a random process to iteratively generate candidate solutions. Then, they evaluate these solutions and bias the search in the direction of the best-scoring ones (Hansen, 2006; Salimans et al., 2017; Mania et al., 2018). (Salimans et al., 2017) propose a variant of ES for optimizing the policy parameters. They estimate a gradient via Gaussian smoothing of the objective function. (Mania et al., 2018) build on the work of (Salimans et al., 2017) and introduce a simple random search algorithm that is competitive with the state of the art while using only linear policies. Compared to BO, ES uses a less principled exploration strategy and, in comparison to our work, does not include the structure of the MDP in its search.

The deterministic policy gradient (Silver et al., 2014) and its practical variants (Lillicrap et al., 2015; Fujimoto et al., 2018) are also closely related to the context of our work. These methods apply the deterministic actor-critic algorithm to learn deep neural network policies. Kakade & Langford (2002); Schulman et al. (2015) develop algorithms for learning monotonously improving policies. This class of algorithm relies on lower bounding the difference of return between the current and the potential next policy and taking a step such that we are sure to improve our policy. Saleh et al. (2022) are the first to derive such bounds for deterministic policies, but we specify these bounds further for Lipschitz MDPs. We further leverage this result to design our advantage mean function.

7 CONCLUSION AND DISCUSSION

In this work, we have presented Augmented Bayesian policy Search (ABS), a method that leverages a new mean function for the Gaussian Process (GP) prior which is tailored to Reinforcement Learning (RL). We derive our method by resorting to the performance difference lemma, to inject an action-value function of deterministic policies into the GP prior on the policy return. Our advantage mean function acts like a first order taylor expansion of the policy return, bridging the gap between BO and RL methods. We theoretically ground our approach by deriving a new bound on the impact of altering deterministic policies on the expected returns for Lipschitz MDPs. Moreover, we propose a novel adaptive scheme for aggregating multiple Q-function estimators. Empirically, we show that ABS scales to high-dimensional problems and establishes state-of-the-art performance for Bayesian Optimization (BO) methods in Mu Jo Co locomotion problems. ABS can explore using deterministic policies and learn effectively in an episodic setting. It also allows trading-off compute for learning speed, which is a highly desirable property in real-world applications.

For this work, we were limited to linear policies due to computational considerations. A promising future direction is scaling BO methods to deep policies. To this end, viable solutions include random projection methods (Ziomek & Ammar, 2023) and kernel functions for the GP that are tailored for deterministic policies, close to works like Wilson et al. (2014); Ghavamzadeh et al. (2016).

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

The authors thank Gandharv Patil and Tom Dupuis for their helpful comments and discussions. This work was funded by the German Federal Ministry of Education and Research (BMBF) (Project: 01IS22078). This work was also funded by Hessian.ai through the project The Third Wave of Artificial Intelligence 3AI by the Ministry for Science and Arts of the state of Hessen. D. Basu acknowledges the Inria-Kyoto University Associate Team RELIANT , and the ANR JCJC for the REPUBLIC project (ANR-22-CE23-0003-01) for supporting this work. The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universit at Erlangen-N urnberg (FAU) under the NHR project b187cb. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) 440719683.

AUTHOR CONTRIBUTIONS

R. Akrour proposed the idea of the advantage mean function which was explored as part of the internship of M. Kallel at Inria Scool under the supervision of R. Akrour and D. Basu. After his internship, M. Kallel started his Ph.D. at the University of W urzburg under the supervision of C. D Eramo, where he kept working on the project and finalized it. All the authors contributed to the writing of the paper.

Riad Akrour, Dmitry Sorokin, Jan Peters, and Gerhard Neumann. Local Bayesian optimization of motor skills. 34th International Conference on Machine Learning, ICML 2017, 1:59 68, 2017.

Danilo Bruno, Sylvain Calinon, and Darwin Caldwell. Bayesian nonparametric multi-optima policy search in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 27, pp. 1374 1380, 2013.

Roberto Calandra, Andr e Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian optimization for learning gaits under uncertainty: An experimental comparison on a dynamic bipedal walker. Annals of Mathematics and Artificial Intelligence, 76:5 23, 2016.

Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Alexander I Cowen-Rivers, Wenlong Lyu, Rasul Tutunov, Zhi Wang, Antoine Grosnit, Ryan Rhys Griffiths, Alexandre Max Maraval, Hao Jianye, Jun Wang, Jan Peters, et al. Hebo: Pushing the limits of sample-efficient hyper-parameter optimisation. Journal of Artificial Intelligence Research, 74:1269 1349, 2022.

Peter Englert and Marc Toussaint. Combined optimization and reinforcement learning for manipulation skills. In Robotics: Science and systems, volume 2016, 2016.

David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable global optimization via local Bayesian optimization. Advances in neural information processing systems, 32, 2019.

Lukas P Fr ohlich, Melanie N Zeilinger, and Edgar D Klenske. Cautious Bayesian optimization for efficient and scalable policy search. In Learning for Dynamics and Control, pp. 227 240. PMLR, 2021.

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International conference on machine learning, pp. 1587 1596. PMLR, 2018.

Roman Garnett. Bayesian optimization. Cambridge University Press, 2023.

Mohammad Ghavamzadeh, Yaakov Engel, and Michal Valko. Bayesian policy gradient and actorcritic algorithms. The Journal of Machine Learning Research, 17(1):2319 2371, 2016.

Published as a conference paper at ICLR 2024

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018.

Nikolaus Hansen. The cma evolution strategy: a comparing review. Towards a new evolutionary computation: Advances in the estimation of distribution algorithms, pp. 75 102, 2006.

Daniel Hein. Interpretable Reinforcement Learning Policies by Evolutionary Computation. Ph D thesis, Technische Universit at M unchen, 2019.

Daniel Hein, Alexander Hentschel, Thomas A Runkler, and Steffen Udluft. Reinforcement learning with particle swarm optimization policy (pso-p) in continuous state and action spaces. International Journal of Swarm Intelligence Research (IJSIR), 7(3):23 42, 2016.

Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout q-functions for doubly efficient reinforcement learning. ar Xiv preprint ar Xiv:2110.02034, 2021.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning, 2002.

Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. Co RR, 09 2015.

Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018.

Ruben Martinez-Cantin. Funneled Bayesian optimization for design, tuning and control of autonomous systems. IEEE transactions on cybernetics, 49(4):1489 1500, 2018.

Ruben Martinez-Cantin, Nando de Freitas, Arnaud Doucet, and Jos e A Castellanos. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and systems, volume 3, pp. 321 328, 2007.

Sarah M uller, Alexander von Rohr, and Sebastian Trimpe. Local policy search with Bayesian optimization. Advances in Neural Information Processing Systems, 34:20708 20720, 2021.

Fabio Muratore, Christian Eilers, Michael Gienger, and Jan Peters. Data-efficient domain randomization with Bayesian optimization. IEEE Robotics and Automation Letters, 6(2):911 918, 2021.

Quan Nguyen, Kaiwen Wu, Jacob Gardner, and Roman Garnett. Local Bayesian optimization via maximizing probability of descent. Advances in neural information processing systems, 35: 13190 13202, 2022.

Evgenii Nikishin, Max Schwarzer, Pierluca D Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp. 16828 16847. PMLR, 2022.

Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in Lipschitz Markov decision processes. Machine Learning, 100, 03 2015. doi: 10.1007/s10994-015-5484-1.

Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer school on machine learning, pp. 63 71. Springer, 2003.

Ehsan Saleh, Saba Ghaffari, Tim Bretl, and Matthew West. Truly deterministic policy optimization. Advances in Neural Information Processing Systems, 35:8469 8482, 2022.

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897. PMLR, 2015.

Published as a conference paper at ICLR 2024

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387 395. Pmlr, 2014.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012.

Ryan Turner, David Eriksson, Michael Mc Court, Juha Kiili, Eero Laaksonen, Zhen Xu, and Isabelle Guyon. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. In Neur IPS 2020 Competition and Demonstration Track, pp. 3 26. PMLR, 2021.

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.

C edric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.

Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for black-box optimization using Monte-Carlo tree search. Advances in Neural Information Processing Systems, 33:19511 19522, 2020.

Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using trajectory data to improve Bayesian optimization for reinforcement learning. Journal of Machine Learning Research, 15:253 282, 2014. ISSN 15337928.

Juliusz Krzysztof Ziomek and Haitham Bou Ammar. Are random decompositions all we need in high dimensional Bayesian optimisation? In International Conference on Machine Learning, pp. 43347 43368. PMLR, 2023.

Published as a conference paper at ICLR 2024

A IMPLEMENTATION DETAILS

Task name Learning rate Lengthscale prior

Inverted Pendulum {0.0025, 0.005} {U(0.00125, 0.025), U(0.0025, 0.05)} Swimmer {0.0025, 0.005} {U(0.00125, 0.025), U(0.0025, 0.05)} Hopper {0.0025, 0.005} {U(0.00125, 0.025), U(0.0025, 0.05)} Halfcheetah {0.00125, 0.0025} {U(0.000625, 0.0125), U(0.00125, 0.025)} Walker2d {0.00125, 0.0025} {U(0.000625, 0.0125), U(0.00125, 0.025)} Ant {0.00125, 0.0025} {U(0.000625, 0.0125), U(0.00125, 0.025)}

Table 1: Hyperparameters used for grid-search for MPD and ABS. (*) For MPD we use only a learning rate of 0.01.

Task name Learning rate Lengthscale prior Nc Nacq Nmax

Inverted Pendulum 0.005 U(0.0025, 0.05) 2 6 21 Swimmer 0.0025 U(0.0025, 0.05) 3 12 39 Hopper 0.0025 U(0.0025, 0.05) 3 16 51 Halfcheetah 0.0025 U(0.00125, 0.025) 4 20 63 Walker2d 0.0025 U(0.000625, 0.0125) 4 20 63 Ant 0.0025 U(0.000625, 0.0125) 5 24 75

Table 2: Hyperparameters used for ABS.

We use a Dro Q critic with the original hyperparameters (Hiraoka et al., 2021) which consists of two critics with layers of size (256, 256) with Re LU activations, and using Layer Normalization and Dropout. For all our experiments, we use an ensemble of 5 critics. We leverage the fact that the discounted policy returns are normalized and apply a tanh layer to bind the predictions of every critic between [ 1, 1]. We use 32 parallel optimizers to optimize both the parameters of the Gaussian Process and the acquisition function. We follow M uller et al. (2021) and only train our GPs on the last Nmax data points collected in the Bayesian Optimization loop. In all our experiments, we keep track of the data generated by the last 3 outer loops of Algorithm 2, thus Nmax = 3 (1 + Nacq) where Nacq is the number of acquisition points sample at one outer loop. We also rollout the central point θ Nc times in order to have a better advantage mean function estimate and to also alleviate the need for signal prior as described below.

The selection of hyperpriors for the signal variance σf and the noise variance σn requires domain knowledge about the task at hand and adds other hyperparameters to the BO problem. In the particular case of local BO, when we are at a fixed neighborhood it is fair to say that the observation noise is roughly similar among all points. The same logic applies to the signal variance. With this intuition in mind, we develop a heuristic to dynamically fix the hyperpriors for σf, σn. Using the fact that we roll out the central point Nc times, we use a noise estimate bσ2 n = V ar( b J1(πθ), . . . , b JNc(πθ)) where b Ji is the noisy policy return estimate from the i-th rollout policy πθ. We then fix the hyperprior to be U[ 1

3 bσn, 3 bσn]. For the signal variance, we adopt a similar scheme by taking bσ2 f = V ar({ ˆm(x) ˆJ(πx)}x X) and use as a signal variance prior U[ 1

3 bσf, 3 bσf]. In the case of the constant mean used in MPD, the variance corresponds to the observation variance. MPD (Nguyen et al., 2022) has the merit of being less learning rate dependent than our method, being able to use a small learning rate and perform multiple descent steps. ABS cannot perform the same scheme as our advantage mean function depends on the central point θ. In our implementation of MPD, we follow the code provided by the authors where we first normalize the gradient g = Σ θ 1µθ to have an L2 norm of 1 and use the same learning rate of 0.01 for all environments. Hence, for MPD we perform grid-search only for the lengthscale hyperprior. For ABS, we perform a search over the learning rate and the lengthscale prior, and report the best-performing set of hyperparameters (Table 1). For all the results obtained with MPD and ABS in this manuscript, we used these best set of parameters (Table 2 and 3). For ARS, we use the original parameters reported in the paper (Mania et al., 2018).

Published as a conference paper at ICLR 2024

Task name Learning rate Lengthscale prior Nc Nacq Nmax

Inverted Pendulum 0.01 U(0.0025, 0.05) 2 6 21 Swimmer 0.01 U(0.0025, 0.05) 3 12 39 Hopper 0.01 U(0.0025, 0.025) 3 16 51 Halfcheetah 0.01 U(0.00125, 0.025) 4 20 63 Walker2d 0.01 U(0.000625, 0.0125) 4 20 63 Ant 0.01 U(0.000625, 0.0125) 5 24 75

Table 3: Hyperparameters used for MPD. B EXPERIMENTAL SECTION

B.1 UNDISCOUNTED RETURNS

Here, we plot the evolution of the maximum undiscounted policy return as a function of the number of episodes. ABS, which can be seen as MPD augmented with the advantage mean function and the adaptive aggregation scheme consistently outperfoms MPD that uses the constant mean function. Our algorithm outperforms substancially the ARS baseline for the Walker2d and Ant tasks.

0 200 400 600 800 1000 #Episodes

Maximum return

Inverted Pendulum (#n_dims=4)

ARS MPD ABS

0 200 400 600 800 1000 #Episodes

Maximum return

Swimmer (#n_dims=16)

0 1000 2000 3000 4000 5000 #Episodes

Maximum return

Hopper (#n_dims=33)

0 1000 2000 3000 4000 5000 #Episodes

Maximum return

Halfcheetah (#n_dims=102)

0 2000 4000 6000 8000 10000 12000 14000 #Episodes

Maximum return

Walker2d (#n_dims=102)

0 2000 4000 6000 8000 10000 12000 14000 #Episodes

Maximum return

Ant (#n_dims=216)

Figure 5: Evolution of the maximum undiscounted policy return on the Mu Jo Co-v4 tasks. We use 5 random seeds for every algorithm.

B.2 VALIDATION AND TEST SCORES

Here, we plot the histogram of the distribution of the validation score c R2 and test score f R2 used in Figure 2 for better visualization. We observe that the validation metric e R2 remains consistently positive with the best critic explaining on average approximately 50% of the variance observed data. The test metric b R2 manages also to be positive quite frequently, while being sometimes negative

Published as a conference paper at ICLR 2024

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.0%

Halfcheetah (#n_dims=102)

Correlation:0.153

Validation score Test score

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.0%

Walker2d (#n_dims=102)

Correlation:0.212

Validation score Test score

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.0%

Ant (#n_dims=216)

Correlation:0.204

Validation score Test score

Figure 6: Histogram and correlation of the validation and test scores for Mu Jo Co-v4 tasks for a run of ABS. The dashed vertical lines represent the medians of their respective distributions. We clip the scores to [-1,1] for easier visualization.

C.1 PROOF OF THE POSTERIOR OF THE GRADIENT

Corollary C.0.1. Under the advantage mean function, the mean of the gradient posterior at θ is

µθ = Es b dπθ h a b Qπθ ϕ (s, a)|a=πθ(s) θπθ(s) i + θK(θ, X) K(X, X) 1( b J(X) bmϕ(X)). (13)

θmϕ(θ) = mϕ(x)

= (i) x [J(πθ) + Es dπθ [Aπθ(s, πx(s))]] x=θ

x Qπθ(s, πx(s)) x=θ

x Qπθ(s, πθ(s)) x=θ

= (ii) Es dπθ

x Qπθ(s, πx(s)) x=θ

x Qπθ(s, πθ(s)) x=θ

x Qπθ(s, πx(s)) x=θ

a Qπθ(s, a) a=πθ(s) πx(s)

Step (i) is due to independence of J(πθ) and x. Step (ii) is true because Qπθ(πθ(s), s) is independent of x. We get the final result by injecting this equality into Equation 2.

C.2 LIPSCHITZ MDPS INDUCE DISTRIBUTIONALLY LIPSCHITZ MDPS

Assumption C.1 (Lipschitz MDP (Pirotta et al., 2015)). An MDP with a policy class Π is (Lr, Lp, Lπ)-Lipschitz if for all s, s S and a, a A:

Lipschitzness of reward: |r(s, a) r(s , a )| Lr(d S(s, s ) + d A(a, a ))

Lipschitzness of transition: W(p(.|s, a), p(.|s , a )) Lp(d S(s, s ) + d A(a, a ))

Lipschitzness of policy: W(π(s1), π(s2)) Lπd S(s1, s2) for all π Π.

Assumption C.2 (Distributionally Lipschitz MDP (Saleh et al., 2022)). For an MDP with a policy class Π is (L1, L2)-Lipschitz if for all π1, π2 Π and µ1, µ2 P(S), the following holds true.

Lipschitzness w.r.t. policy: W(P(µ, π1), P(µ, π2)) L1 sups S W(π1(s), π2(s))

Lipschitzness w.r.t. state distribution: W(P(µ1, π), P(µ2, π)) L2W(µ1, µ2)

Published as a conference paper at ICLR 2024

Where P denotes the generalization of the transition dynamics to take as input probability distributions of actions or states. Assumption C.3. The state space S Rd and is bounded. Theorem C.4. Lipschitz MDPs (Assumptions C.1) induces Distributionally Lipschitz MDPs (Assumptions C.2) with Lipschitz rewards such that L1 = Lπ and L2 = Lp(1 + Lπ).

Proof. Now, we prove that Assumptions C.1 (without Lipschitzness of reward) implies Assumptions C.2.

Part 1: Lipschitzness w.r.t. policy.

W(P(µ, π1), P(µ, π2)) = W Z

S p(.|s, π1(s))dµ(s), Z

S p(.|s, π2(s))dµ(s)

Z W(a(s), b(s))dµ(s)

sup s S W(a(s), b(s))

Lπ sup s S W(π1(s), π2(s))

= LπW(π1, π2) Step (a) is due to Theorem 4.8 (Villani et al., 2009).

This implies that L1 = Lπ.

Part 2: Lipschitzness w.r.t. state distribution.

W1(P(µ1, π), P(µ2, π)) = W

S p(.|s, π(s)) | {z } λ(.|s)

S p(.|s, π(s))dµ2(s)

S λ(.|s)dµ1(s), Z

S λ(.|s)dµ2(s)

= (b) sup f L 1

S λ(s |s)µ1(s)ds ds

S λ(s |s)µ2(s)ds ds !

= sup f L 1

S (µ2(s) µ1(s))λ(s |s)ds ds

= (c) sup f L 1

S (µ2(s) µ1(s))

S f(s )λ(s |s)ds

| {z } g(s)

(d) Lp(1 + Lπ) sup a L 1

S (µ2(s) µ1(s))a(s)ds

Lp(1 + Lπ)W(µ2, µ1). Step (b) is due to Remark 6.5 in Villani et al. (2009). Step (c) is obtained by applying Fubini-Tonelli theorem, as S is bounded. Step (d) holds true as g is Lp(1 + Lπ)-Lipschitz (Lemma C.5) and f is maximum 1-Lipschitz.

Thus, we conclude that L2 = Lp(1 + Lπ).

Lemma C.5. If we define g(s) R

S f(s )λ(s |s)ds , for a Lipschitz MDP (Assumption C.1) and f s.t f L 1 , then g is Lp(1 + Lπ)-Lipschitz.

Published as a conference paper at ICLR 2024

|g(s1) g(s2)| =

S f(s )λ(s |s1)ds Z

S f(s )λ(s |s2)ds

(e) sup h L 1

S h(s )λ(s |s1)ds Z

S h(s )λ(s |s2)ds

W(λ(.|s1), λ(.|s2)) = W(p(.|s1, π(s1)), p(.|s2, π(s2))) Lp(d S(s1, s2) + d A(π(s1), π(s2)) Lp(d S(s1, s2) + Lπd S(s1, s2)) Lp(1 + Lπ) d S(s1, s2) Lp(1 + Lπ) d S(s1, s2)

Step (e) holds true because Lipschitz constant of f is less than or equal to 1, and we take supremum over all such functions.

C.3 BOUND ON THE RESIDUAL TERM

Now, we use Theorem C.4 to bound the residual for Lipschitz MDPs. Theorem C.6. For a Lipschitz MDP satisfying (Assumptions C.1) and γLP (1 + Lπ < 1, we have that the residual term

dπ2 dπ1, Aπ1(s, π2(s) C sup s S π2(s) π1(s) , (14)

for C 2γLπLr (1+Lπ) (1 γLP (1+Lπ))2 .

Proof. Step 1: First, we bound the residual using the Wasserstein distance between policies and L2 distance between the gradient of Q functions.

dπ2 dπ1, Aπ1(s, π2(s) = dπ2 dπ1, Qπ1(s, π2(s)) Qπ1(s, π1(s)) W (dπ2, dπ1) sup s s [Qπ1(s, π2(s)) Qπ1(s, π1(s))] (15)

We obtain the last inequality from the Kantorovich-Rubinstein inequality (Villani et al., 2009).

Step 2: Now, we bound the the change in gradient of Q-functions due to the change in policies.

s [Qπ1(s, π2(s)) Qπ1(s, π1(s))]

s,π2(s) + Sπ2(s) Qπ1

s,π2(s) Qπ1

s,π1(s) Sπ1(s) Qπ1

s,π2(s) Qπ1

s,π1(s) | {z } a

+ Sπ2(s) Qπ1

s,π2(s) Sπ1(s) Qπ1

a |s,π1(s) | {z } b

Step 3: Bounding Term a. Here, we leverage the property that if f is L-Lipschitz, f L.

Q function is LQ-Lipschitz for a Lipschitz MDP, such that LQ Lr 1 γLP (1+Lπ) for γLP (1+Lπ) < 1 (Pirotta et al., 2015).

Thus, we conclude that Qπ1

LQ. This bounds the Term a by 2LQ.

Step 4: Bounding Term b. sπ2(s) Qπ1

Published as a conference paper at ICLR 2024

The first inequality is a consequence of H older s inequality, while the second one is obtained from Lipschitzness of Q and π. Thus, Term b is bounded by 2LQLπ.

Step 5: Putting it all together. Combining Equation (15) with the bounds on Term a and Term b yields

dπ2 dπ1, Aπ1(s, π2(s) W(dπ2, dπ1) 2 LQ(1 + 2Lπ)

γLπ 1 γLp(1 + Lπ) sup s S W(π1(s), π2(s)) 2 LQ(1 + Lπ)

= 2γLπLr (1 + Lπ) (1 γLp(1 + Lπ))2 sup s π2(s) π1(s)

Step (a) is derived by an application of Theorem B.4 in (Saleh et al., 2022), when γLp(1+Lπ) < 1.