# gradientaware_modelbased_policy_search__382ee5cc.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Gradient-Aware Model-Based Policy Search

Pierluca D Oro, Alberto Maria Metelli, Andrea Tirinzoni, Matteo Papini, Marcello Restelli Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano Piazza Leonardo da Vinci, 32, 20133, Milano, Italy pierluca.doro@mail.polimi.it, {albertomaria.metelli, andrea.tirinzoni, matteo.papini, marcello.restelli}@polimi.it

Traditional model-based reinforcement learning approaches learn a model of the environment dynamics without explicitly considering how it will be used by the agent. In the presence of misspeciﬁed model classes, this can lead to poor estimates, as some relevant available information is ignored. In this paper, we introduce a novel model-based policy search approach that exploits the knowledge of the current agent policy to learn an approximate transition model, focusing on the portions of the environment that are most relevant for policy improvement. We leverage a weighting scheme, derived from the minimization of the error on the model-based policy gradient estimator, in order to deﬁne a suitable objective function that is optimized for learning the approximate transition model. Then, we integrate this procedure into a batch policy improvement algorithm, named Gradient-Aware Modelbased Policy Search (GAMPS), which iteratively learns a transition model and uses it, together with the collected trajectories, to compute the new policy parameters. Finally, we empirically validate GAMPS on benchmark domains analyzing and discussing its properties.

1 Introduction Model-Based Reinforcement Learning (MBRL, Sutton and Barto 2018; Nguyen-Tuong and Peters 2011) approaches use the interaction data collected in the environment to estimate its dynamics, with the main goal of improving the sample efﬁciency of Reinforcement Learning (RL, Sutton and Barto 2018) algorithms. However, modeling the dynamics of the environment in a thorough way can be extremely complex and, thus, require the use of very powerful model classes and considerable amounts of data, betraying the original goal of MBRL. Fortunately, in many interesting application domains (e.g., robotics), perfectly capturing the dynamics across the whole state-action space is not necessary for a model to be effectively used by a learning agent (Abbeel, Quigley, and Ng 2006; Nguyen-Tuong, Seeger, and Peters 2009; Levine and Abbeel 2014). Indeed, a wiser approach consists in using simpler model classes, whose estimation requires few interactions with the environment, and focus

Equal contribution. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

their limited capacity on the most relevant parts of the environment. These parts could present a local dynamics that is inherently simpler than the global one, or at least easier to model using prior knowledge.

The vast majority of MBRL methods employs a maximum-likelihood estimation process for learning the model (Deisenroth et al. 2013). Nonetheless, the relative importance of the different aspects of the dynamics greatly depends on the underlying decision problem, on the control approach, and, importantly, on the policy played by the agent. Recent work (Farahmand, Barreto, and Nikovski 2017; Farahmand 2018) shows that, in the context of valuebased reinforcement learning methods, it is possible to derive a decision-aware loss function for model learning that compares favorably against maximum likelihood. However, there exists no equivalent for policy-based methods, that are often preferred in the case of continuous observation/action spaces. Moreover, previous work fails at incorporating the inﬂuence of the current agent s behavior for evaluating the relative importance of the different aspects of the world dynamics. For instance, suppose a learning agent acts deterministically in a certain region of the environment, possibly thanks to some prior knowledge, and has no interest in changing its behavior in that area; or that some regions of the state space are extremely unlikely to be reached by an agent following the current policy. There would be no beneﬁt in approximating the corresponding aspects of the dynamics since that knowledge cannot contribute to the agent s learning process. Therefore, with a limited model expressiveness, an approach for model learning that explicitly accounts for the current policy and for how it will be improved can outperform traditional maximum likelihood estimation.

In this paper, motivated by these observations, we propose a model-based policy search (Deisenroth et al. 2013; Sutton et al. 2000) method that leverages awareness of the current agent s policy in the estimation of a forward model, used to perform policy optimization. Unlike existing approaches, which typically ignore all the knowledge available on the running policy during model estimation, we incorporate it into a weighting scheme for the objective function used in model learning. We choose to focus our discussion on the batch setting (Lange, Gabel, and Riedmiller 2012),

due to its particular real-world importance. Nonetheless, extensions to the interactive scenario can be easily derived. The contributions of this paper are theoretical, algorithmic and experimental. After having introduced our notation and the required mathematical preliminaries (Section 2), we formalize the concept of Model-Value-based Gradient (MVG), an approximation of the policy gradient that combines real trajectories along with a value function derived from an estimated model (Section 3). MVG allows ﬁnding a compromise between the large variance of a Monte Carlo gradient estimate and the bias of a full model-based estimator. Contextually, we present a bound on how the bias of the MVG is related to the choice of an estimated transition model. In Section 4, we derive from this bound an optimization problem to be solved, using samples, to obtain a gradient-aware forward model. Then, we integrate it into a batch policy optimization algorithm, named Gradient-Aware Model-based Policy Search (GAMPS), that iteratively uses samples to learn the approximate forward model and to estimate the gradient, used to perform the policy improvement step. After that, we present a ﬁnite-sample analysis for the single step of GAMPS (Section 5), that highlights the advantages of our approach when considering simple model classes. Finally, after reviewing related work in model-based policy search and decision-aware MBRL areas (Section 6), we empirically validate GAMPS against model-based and modelfree baselines, and discuss its peculiar features (Section 7). The proofs of all the results presented in the paper are reported in Appendix.1

2 Preliminaries A discrete-time Markov Decision Process (MDP, Puterman 2014) is described by a tuple M = (S, A, r, p, μ, γ), where S is the space of possible states, A is the space of possible actions, r(s, a) is the reward received by executing action a in state s, p( |s, a) is the transition model that provides the distribution of the next state when performing action a in state s, μ is the distribution of the initial state and γ [0, 1) is a discount factor. When needed, we assume that r is known, as common in domains where MBRL is employed (e.g., robotic learning (Deisenroth and Rasmussen 2011)), and that rewards are uniformly bounded by |r(s, a)| Rmax < + . The behavior of an agent is described by a policy π( |s) that provides the distribution over the action space for every state s. Given a state-action pair (s, a) we deﬁne the action-value function (Sutton and Barto 2018), or Q-function, as Qπ,p(s, a) = r(s, a) + γ

S p(s |s, a)

A π(a |s )Qπ,p(s , a )ds da and the state-value function, or V-function, as V π,p(s) = Ea π( |s)[Qπ,p(s, a)], where we made explicit the dependence on the policy π and on the transition model p. The goal of the agent is to ﬁnd an optimal policy π , i.e., a policy that maximizes the expected return: Jπ,p = Es0 μ [V π,p(s0)]. We consider a batch setting (Lange, Gabel, and Riedmiller 2012), in which the learning is performed on a previously collected dataset D = τ i N i=1 =

1The full version of the paper, including the appendix, is available at https://arxiv.org/abs/1909.04115.

si 0, ai 0, si 1, ai 1, ..., si Ti 1, ai Ti 1, si Ti N i=1 of N independent trajectories τ i, each composed of Ti transitions, and further interactions with the environment are not allowed. The experience is generated by an agent that interacts with the environment, following a known behavioral policy πb. We are interested in learning a parameterized policy πθ (for which we usually omit the parameter subscript in the notation) that belongs to a parametric space of stochastic differentiable policies ΠΘ = {πθ : θ Θ Rd}. In this case, the gradient of the expected return w.r.t. θ is provided by the policy gradient theorem (PGT) (Sutton et al. 2000; Sutton and Barto 2018):

θJ(θ) = 1 1 γ

A δπ,p μ (s, a) θ log π(a|s)

Qπ,p(s, a)dsda, (1)

where δπ,p μ (s, a) is the γ-discounted state-action distribution (Sutton et al. 2000), deﬁned as δπ,p μ (s, a) = (1 γ) + t=0 γt Pr(st = s, at = a|M, π). We call θ log π(a|s) the score of the policy π when executing action a in state s. Furthermore, we denote with δπ,p s ,a (s, a) the state-action distribution under policy π and model p when the environment is deterministically initialized by executing action a in state s and with ζπ,p μ (τ) the probability density function of a trajectory τ. In batch policy optimization, the policy gradient is typically computed for a policy π that is different from the policy πb having generated the data (off-policy estimation (Precup, Sutton, and Singh 2000)). To correct for the distribution mismatch, we employ importance sampling (Kahn and Marshall 1953; Owen 2013), re-weighing the transitions based on the probability of being observed under the policy π. Namely, we deﬁne the importance weight relative to a subtrajectory τt :t of τ, occurring from time t to t , and to policies π and πb as ρπ/πb(τt :t ) = ζπ,p μ (τt :t ) ζ πb,p μ (τt :t ) = t

t=t π(at|st) πb(at|st).

3 Model-Value-based Gradient The majority of the model-based policy search approaches employ the learned forward model for generating rollouts, which are used to compute an improvement direction θJ(θ) either via likelihood-ratio methods or by propagating gradients through the model (Deisenroth and Rasmussen 2011). Differently from these methods, we consider an approximation of the gradient, named Model-Valuebased Gradient (MVG), deﬁned as follows. Deﬁnition 3.1. Let p be the transition model of a Markov Decision Process M, ΠΘ a parametric space of stochastic differentiable policies, P a class of transition models. Given π ΠΘ and p P, the Model-Value-based Gradient (MVG) is deﬁned as:

MVG θ J(θ) = 1 1 γ

A δπ,p μ (s, a) θ log π(a|s)

Qπ, p(s, a)dsda. (2)

Thus, the MVG employs experience collected in the real environment p, i.e., sampling from δπ,p μ (s, a), and uses the

generative power of the estimated transition kernel p in the computation of an approximate state-action value function Qπ, p only. In this way, it is possible to ﬁnd a compromise between a full model-based estimator, in which the experience is directly generated from δπ, p μ (Deisenroth and Rasmussen 2011; Deisenroth et al. 2013), and a Monte Carlo estimator (e.g., GPOMDP (Baxter and Bartlett 2001)) in which also the Q-function is computed from experience collected in the real environment. Therefore, the MVG limits the bias effect of p to the Q-function approximation Qπ, p.2 At the same time, it enjoys a smaller variance w.r.t. a Monte Carlo estimator, especially in an off-policy setting, as the Q-function is no longer estimated from samples but just approximated using p. Existing approaches can be interpreted as MVG. For instance, the ones based on model-based value expansion (Feinberg et al. 2018; Buckman et al. 2018), that use a ﬁxed-horizon unrolling of an estimated forward model for obtaining a better value function in an actor-critic setting. A central question concerning Deﬁnition 3.1, is how the choice of p affects the quality of the gradient approximation, i.e., how much bias an MVG introduces in the gradient approximation. To this end, we bound the approximation error by the expected KL-divergence between p and p.

Theorem 3.2. Let q [1, + ] and p P. Then, the Lqnorm of the difference between the policy gradient θJ(θ) and the corresponding MVG MVG θ J(θ) can be upper bounded as:

θJ(θ) MVG θ J(θ) q γ

2ZRmax (1 γ)2

E s,a ηπ,p μ [DKL(p( |s, a) p( |s, a))],

ηπ,p μ (s, a) =

A νπ,p μ (s , a )δπ,p s ,a (s, a)ds da

is a probability distribution over S A, with νπ,p μ (s , a ) = 1 Z δπ,p μ (s , a ) θ log πθ(a |s ) q and Z =

A δπ,p μ (s , a ) θ log πθ(a |s ) q ds da both not depending on p.3

Proof Sketch. Since Qπ,p(s, a) = δπ,p s,a (s , a )r(s , a ) ds da , we bound the Q-function difference with δπ,p s,a δπ, p s,a 1. The latter is upper bounded with p( |s , a ) p( |s , a ) 1. The result follows from Pinsker s inequality.

Similarly to what was noted for other forms of decisionaware MBRL (Farahmand, Barreto, and Nikovski 2017), a

2It is worth noting that when the environment dynamics can be approximated locally with a simple model or some prior knowledge on the environment is available, selecting a suitable approximator p for the transition model is easier than choosing an appropriate function approximator for a critic in an actor-critic architecture. 3We need to assume that Z > 0 in order for ηπ,p μ to be welldeﬁned. This is not a limitation, as if Z = 0, then θJ(θ) = 0 and there is no need to deﬁne ηπ,p μ in this case.

looser bound in which the expectation on the KL-divergence is taken under δπ,p μ can be derived (see Appendix). This motivates the common maximum likelihood approach. However, our bound is tighter and clearly shows that not all collected transitions have the same relevance when learning a model that is used in estimating the MVG. Overall, the most important (s, a) pairs are those that are likely to be reached from the policy starting from high gradient-magnitude stateaction pairs.

4 Gradient-Aware Model-based Policy Search Inspired by Theorem 3.2, we propose a policy search algorithm that employs an MVG approximation, combining trajectories generated in the real environment together with a model-based approximation of the Q-function obtained with the estimated transition model p. The algorithm, Gradient Aware Model-based Policy Search (GAMPS), consists of three steps: learning the model p (Section 4.1), computing the value function Qπ, p (Section 4.2) and updating the policy using the estimated gradient θJ(θ) (Section 4.3).

4.1 Learning the Transition Model To learn p, we aim at minimizing the bound in Theorem 3.2, over a class of transition models P, using the trajectories D collected with ζπb,p μ . However, to estimate an expected value computed over ηπ,p μ , as in Theorem 3.2, we face two problems. First, the policy mismatch between the behavioral policy πb used to collect D and the current agent s policy π. This can be easily addressed by using importance sampling. Second, given a policy π we need to be able to compute the expectations over ηπ,p μ using samples from ζπ,p μ . In other words, we need to reformulate the expectation over ηπ,p μ in terms of expectation over trajectories. To this end, we provide the following general result. Lemma 4.1. Let π and πb be two policies such that π πb (π is absolutely continuous w.r.t. to πb). Let f : S A Rk be an arbitrary function deﬁned over the state-action space. Then, it holds that:

E s,a ηπ,p μ [f(s, a)] = (1 γ)2

Z E τ ζ πb,p μ

t=0 γtρπ/πb(τ0:t)

l=0 θ log π(al|sl) q f(st, at) .

To specialize Lemma 4.1 for our speciﬁc case, we just set f(s, a) = DKL(p( |s, a) p( |s, a)). Note that Z is independent from p and thus it can be ignored in the minimization procedure. Furthermore, minimizing the KL-divergence is equivalent to maximizing the log-likelihood of the observed transitions. Putting everything together, we obtain the objective:

t=0 ωi t log p si t+1|si t, ai t ,

ωi t = γtρπ/πb(τ i 0:t)

θ log π(ai l|si l) q .

The factors contained in the weight ωi t accomplish three goals in weighting the transitions. The discount factor γt encodes that later transitions are exponentially less important in the gradient computation. The importance weight ρπ/πb(τ i 0:t) is larger for the transitions that are more likely to be generated by the current policy π. This incorporates a key consideration into model learning: since the running policy π can be quite different from the policy that generated the data πb, typically very explorative (Deisenroth et al. 2013), an accurate approximation of the dynamics for the regions that are rarely reached by the current policy is not useful. Lastly, the factor t l=0 θ log π(ai l|si l) q favors those transitions that occur at the end of a subtrajectory τ0:t with a high cumulative score-magnitude. This score accumulation resembles the expression of some model-free gradient estimators (Baxter and Bartlett 2001). Intuitively, the magnitude of the score of a policy is related to its opportunity to be improved, i.e., the possibility to change the probability of actions. Our gradient-aware weighting scheme encourages a better approximation of the dynamics for states and actions found in trajectories that can potentially lead to the most signiﬁcant improvements to the policy.

4.2 Computing the value function The estimated transition model p can be used to compute the action-value function Qπ, p for any policy π. This amounts to evaluating the current policy using p instead of the actual transition probability kernel p. In the case of ﬁnite MDPs, the evaluation can be performed either in closed form or in an iterative manner via dynamic programming (Bellman and others 1954; Sutton and Barto 2018). For continuous MDPs, Qπ, p cannot, in general, be represented exactly. A ﬁrst approach consists of employing a function approximator Q Q and applying approximate dynamic programming (Bertsekas et al. 1995). However, this method requires a proper choice of a functional space Q and the deﬁnition of the regression targets, which should be derived using the estimated model p (Ernst, Geurts, and Wehenkel 2005; Riedmiller 2005), possibly introducing further bias. We instead encourage the use of p as a generative model for the sole purpose of approximating Qπ, p. Recalling that we will use Q to estimate the policy gradient from the available trajectories, we can just obtain a Monte Carlo approximation of Qπ, p on the ﬂy, in an unbiased way, averaging the return from a (possibly large) number M of imaginary trajectories obtained from the estimated model p:

Q(s, a) = 1

t=0 γtr(sj t, aj t), τ j ζπ, p s,a . (4)

This approach has the advantage of avoiding the harsh choice of an appropriate model class Q and the deﬁnition of the regression targets, while providing an unbiased estimate for the quantity of interest.

4.3 Estimating the policy gradient

After computing Qπ, p (or some approximation Q), all the gathered information can be used to improve policy π. As

Algorithm 1 Gradient-Aware Model-based Policy Search Input: Trajectory dataset D, behavior policy πb, initial parameters θ0, step size schedule (αk)K 1 k=0 1: for k = 0, 1, ..., K 1 do 2: Learn p (Section 4.1) 3: ωi t,k γtρπθk /πb(τ i 0:t) t l=0 θ log πθk(ai l|si l) q 4: pk arg maxp P 1 N N i=1

t ωi t,k log p(si t+1|si t, ai t) 5: Compute Q (Section 4.2) 6: Generate M trajectories for each (s, a) using pk 7: Qk(s, a) = 1 M M j=1 Tj 1 t=0 γtr(sj t, aj t) 8: Improve Policy (Section 4.3) 9: θJ(θk) 1 N N i=1 Ti 1 t=0 γtρπθk /πb(τ i 0:t)

θ log πθk(ai t|si t) Qk(si t, ai t) 10: θk+1 θk + αk θJ(θk) 11: end for

we are using a model-value-based gradient, the trajectories we will use have been previously collected in the real environment. Furthermore, the data have been generated by a possibly different policy πb, and, to account for the difference in the distributions, we need importance sampling again. Therefore, by writing the sample version of Equation (2) we obtain:

t=0 γtρπ/πb(τ i 0:t) θ log π(ai t|si t)

Qπ, p(si t, ai t).

For performing batch policy optimization, we repeat the three steps presented in this section using the data collected by the behavior policy πb. At each iteration, we ﬁt the model with the weights of the current policy, we employ it in the computation of the state-action value function and we then improve the policy with one or more steps of gradient ascent. The overall procedure is summarized in Algorithm 1.

5 Theoretical Analysis In this section, we provide a ﬁnite-sample bound for the gradient estimation of Equation (5), assuming to have the exact value of Qπ, p. This corresponds to the analysis of a single iteration of GAMPS. We ﬁrst deﬁne the following functions. Let τ be a trajectory, π ΠΘ and p P. We deﬁne lπ,p(τ) = + t=0 ωt log p (st+1|st, at) and gπ,p(τ) = + t=0 γtρπ/πb(τ0:t) θ log π(at|st)Qπ,p(st, at). To obtain our result, we need the following assumptions.

Assumption 1. The second moment of lπ,p and gπ,p are uniformly bounded over P and ΠΘ. In this case, given a dataset D = {τ i}N i=1, there exist two constants c1, c2 < + such that for all p P and π ΠΘ:

E τ ζ πb,p μ

lπ,p(τ)2 , 1

i=1 lπ,p(τ i)2 c2 1

max E τ ζ πb,p μ

i=1 gπ,p(τ i)2

R2 maxc2 2.

Assumption 2. The pseudo-dimension of the hypothesis spaces lπ,p : p P, π Π and gπ,p : p P, π Π

are bounded by v < + . Assumption 1 is requiring that the overall effect of the importance weight ρπ/πb, the score θ log π and the approximating transition model p preserves the ﬁniteness of the second moment. Clearly, a sufﬁcient (albeit often unrealistic) condition is requiring all these quantities to be uniformly bounded. Assumption 2 is necessary to state learning theory guarantees. We are now ready to present the main result, which employs the learning theory tools of Cortes, Greenberg, and Mohri (2013). Theorem 5.1. Let q [1, + ], d be the dimensionality of Θ and p P be the maximizer of the objective function in Equation (3), obtained with N > 0 independent trajectories {τ i}N i=1. Under Assumption 1 and 2, for any δ (0, 1), with probability at least 1 4δ it holds that: θJ(θ) θJ(θ) q 2Rmax

d 1 q c2ϵ + γ 2Zc1ϵ

estimation error

2ZRmax (1 γ)2 inf p P

E s,a ηπ,p μ [DKL(p( |s, a) p( |s, a))]

approximation error

v +log 8(d+1)

v +log 8(d+1)

and Γ(ξ) := 1

The theorem justiﬁes the intuition behind the gradient estimation based on MVG. A model p is good when it achieves a reasonable trade-off between the errors in approximation and estimation.4 In the case of scarce data (i.e., small N), it is convenient to choose a low-capacity model class P in order to reduce the error-enlarging effect of the pseudodimension v. However, this carries the risk of being unable to approximate the original model. Nonetheless, the approximation error depends on an expected value under ηπ,p μ . Even a model class that would be highly misspeciﬁed w.r.t. an expectation computed under the state-action distribution δπ,p μ can, perhaps surprisingly, lead to an accurate gradient estimation using our approach.

6 Related Works We now revise prior work in MBRL, focusing on policy search methods and those that include some level of awareness of the underlying control problem into model learning. Policy Search with MBRL The standard approach consists in using a maximum likelihood estimation of the environment dynamics to perform simulations (or imaginary rollouts) through which a policy can be improved without further or with limited interactions with the environment (Deisenroth et al. 2013). This approach has taken

4It is worth noting that the estimation error is O(N 1

different forms, with the use of tabular models (Wang and Dietterich 2003), least-squares density estimation techniques (Tangkaratt et al. 2014) or, more recently, combinations of variational generative models and recurrent neural networks employed in world models based on mixture density networks (Ha and Schmidhuber 2018). Several methods incorporate the model uncertainty into policy updates, by using Gaussian processes and moment matching approximations (Deisenroth and Rasmussen 2011), Bayesian neural networks (Gal, Mc Allister, and Rasmussen 2016) or ensembles of forward models (Chua et al. 2018; Kurutach et al. 2018; Janner et al. 2019; Buckman et al. 2018). MBRL works that are particularly related to GAMPS are those employing estimated forward models that are accurate only locally (Abbeel, Quigley, and Ng 2006; Nguyen-Tuong, Seeger, and Peters 2009; Levine and Abbeel 2014), or using a model-value based gradient formulation (Abbeel, Quigley, and Ng 2006; Feinberg et al. 2018; Buckman et al. 2018; Heess et al. 2015) as described in Section 3. Decision-aware MBRL The observation that, under misspeciﬁed model classes, the dynamics of the environment must be captured foreseeing the ﬁnal task to be performed led to the development of decision-aware approaches for model learning (Farahmand, Barreto, and Nikovski 2017). While one of the ﬁrst examples was a ﬁnancial application (Bengio 1997), the idea was introduced into MBRL (Joseph et al. 2013; Bansal and others 2017) and the related adaptive optimal control literature (Piroddi and Spinelli 2003) by using actual evaluations of a control policy in the environment as a performance index for model learning. More similarly to our approach, but in the context of value-based methods, a theoretical framework called value-aware MBRL (Farahmand, Barreto, and Nikovski 2017) was proposed, in which the model is estimated by minimizing the expected error on the Bellman operator, explicitly considering its actual use in the control algorithm. Starting from this, further theoretical considerations and approaches have been proposed (Farahmand 2018; Asadi et al. 2018). Awareness of the ﬁnal task to be performed has been also incorporated into stochastic dynamic programming (Donti, Amos, and Kolter 2017; Amos et al. 2018) and, albeit implicitly, into neural network-based works (Oh, Singh, and Lee 2017; Silver and others 2017; Luo et al. 2019), in which value functions and models consistent with each other are learned.

7 Experiments

We now present an experimental evaluation of GAMPS, whose objective is two-fold: assessing the effect of our weighting scheme for model learning and comparing the performance in batch policy optimization of our algorithm against model-based and model-free policy search baselines.

7.1 Two-areas Gridworld

This experiment is meant to show how decision-awareness can be an effective tool to improve the accuracy of policy gradient estimates when using a forward model. The environment is a 5 5 gridworld, divided into two areas (lower

(a) Gridworld

Empirical δπ,p μ (s, a)

(b) State-action distribution weights

Empirical ηπ,p μ (s, a)

(c) Gradient-aware weights

Figure 1: (a): Gridworld representation. The goal state is G and the possible initial states are μ. The two areas with different dynamics are represented with different colors. (b) and (c): Normalized values of the empirical state-action distribution and the weighting factor for GAMPS. Each grid represents every state of the environment for the two most representative actions.

and upper) with different dynamics: the effect of a movement action of the agent is reversed in one area w.r.t. the other. Once the agent gets to the lower area, it is not possible to go back in the upper one (Figure 1a). We collect experience with a linear policy πb that is deterministic on the lower area and randomly initialized in the upper area, which is also used as initial policy for learning. The ﬁrst goal of this experiment is to show that, with the use of gradient-awareness, even an extremely simple model class can be sufﬁciently expressive to provide an accurate estimate of the policy gradient. Hence, we use a forward model which, given the sole action executed by the agent (i.e., without knowledge of the current state), predicts the effect that it will cause on the position of the agent (i.e., up, down, left, right, stay). This model class cannot perfectly represent the whole environment dynamics at the same time, as it changes between the two areas. However, given the nature of policy π, this is not necessary, since only the modeling of the upper area, which is indeed representable with our model, would be enough to perfectly improve the policy. Nonetheless, this useful information has no way of being captured using the usual maximum likelihood procedure, which, during model learning, weighs the transitions just upon visitation, regardless of the policy. To experimentally assess how our approach addresses this intuitive point, we generate 1000 trajectories running πb in the environment, and we ﬁrst compare the maximum likelihood and the gradientaware weighting factors, δπ,p μ (s, a) and ηπ,p μ (s, a). The results (Figure 1b and 1c) show that our method is able, in a totally automatic way, to avoid assigning importance to the transitions in which the policy cannot be improved. We further investigate the performance of GAMPS compared to batch learning with the maximum likelihood transition model (ML) and two classical model-free learning algorithms REINFORCE (Williams 1992) and PGT (Sutton et al. 2000). To adapt the latter two to the batch setting, we employ importance sampling in the same way as described in Equation (5), but estimating the Q-function using the same trajectories (and importance sampling as well). The results obtained by collecting different numbers of trajectories and evaluating on the environment are shown in Figure 2. When the data are too scarce, all compared algorithms struggle in

converging towards good policies, experiencing high variance.5 It is worth noting that, with any amount of data we tested, the GAMPS learning curve is consistently above the others, showing superior performance even when considering the best iteration of all algorithms.

7.2 Continuous Control

To show that our algorithm is able to perform well also on continuous domains, we test its performance on a simulated Minigolf environment (Lazaric, Restelli, and Bonarini 2008; Tirinzoni, Salvini, and Restelli 2019) and the 3-link Swimmer robot control benchmark based on Mujoco (2012). In the minigolf game, the agent hits a ball using a ﬂatfaced golf club (the putter) with the goal of reaching a hole in the minimum number of strokes. Given only the distance to the hole, the agent chooses, at each step, the angular velocity of the putter which determines the next position of the ball. The episode terminates when the ball enters the hole, with reward 0, or when the agent overshoots, with reward 100. In all other cases, the reward is 1 and the agent can try other hits. We further suppose that the minigolf course is divided into two areas, one twice larger than the other, with different terrains: the ﬁrst, nearest to the hole and biggest, has the friction of a standard track; the second has very high friction, comparable to the one of an area with sand. We use Gaussian policies that are linear on six radial basis function features. The model predicts the difference from the previous state by sampling from a Gaussian distribution with linearly parameterized mean and standard deviation. For the Mujoco swimmer the goal of the agent is to swim forward in a ﬂuid. The policy is linear in the state features and the forward model is a 2-layer neural networks with 32 hidden neurons and tanh activation. We evaluate GAMPS against the same baselines employed for the previous experiment. We collect a dataset of 50 and 100 trajectories for minigolf and swimmer respec-

5In batch learning, performance degradation when the current policy π becomes too dissimilar from the behavioral policy πb is natural due to the variance of the importance weights. To avoid this effect, a stopping condition connected to the effective sample size (Owen 2013) can be employed.

Average Return

10 trajectories

Average Return

50 trajectories

Average Return

100 trajectories

Figure 2: Average return on the gridworld. ML employs maximum-likelihood model estimation (20 runs, mean std).

0 10 20 30 35

Average Return

(a) Minigolf

0 5 10 15 20

Average Return

(b) Swimmer

GAMPS ML PGT REINFORCE

Figure 3: Average return in the minigolf domain using 50 trajectories and in the swimmer environment using 100 trajectories (10 runs, mean std).

tively, using an explorative policy, and then run the algorithms for 30 and 20 iterations. Results in terms of average return are reported in Figure 3, showing that GAMPS outperforms all the other algorithms both in terms of terminal and maximum performance. In the minigolf domain, it is less susceptible to overﬁtting compared to the baselines, and it obtains a ﬁnal policy able to reach the hole most of the time. In the swimmer environment, where powerful models are used, GAMPS still shows superior performance, validating the insight provided by Theorem 3.2 even in this setting.

8 Discussion and Conclusions

In this paper, we presented GAMPS, a batch gradient-aware model-based policy search algorithm. GAMPS leverages the knowledge about the policy that is being optimized for learning the transition model, by giving more importance to the aspects of the dynamics that are more relevant for improving its performance. We derived GAMPS from the minimization of the bias of the model-value-based gradient, an approximation for the policy gradient that mixes trajectories collected in the real environment together with a value function computed with the estimated model. Our theoretical analysis validates the intuition that, when dealing with low capacity models, it is convenient to focus their representation capabilities on the portions of the environment that are most crucial for improving the policy. The empirical validation demonstrates that, even when extremely simple model classes are

considered, GAMPS is able to outperform the baselines. The main limitations of GAMPS are in the need to reuse all the samples in model learning at each iteration and in the compounding errors during rollouts. Future work could focus on mitigating these issues, as well as on adapting GAMPS to the interactive scenario, by leveraging existing on/off-policy approaches (Metelli et al. 2018), and on different ways to exploit gradient-aware model learning.

Acknowledgments

This work has been partially supported by the Italian MIUR PRIN 2017 Project ALGADIMAR Algorithms, Games, and Digital Market .

Abbeel, P.; Quigley, M.; and Ng, A. Y. 2006. Using inaccurate models in reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, 1 8. ACM. Amos, B.; Jimenez, I.; Sacks, J.; Boots, B.; and Kolter, J. Z. 2018. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, 8289 8300. Asadi, K.; Cater, E.; Misra, D.; and Littman, M. L. 2018. Equivalence between wasserstein and value-aware model-based reinforcement learning. ar Xiv preprint ar Xiv:1806.01265. Bansal, S., et al. 2017. Goal-driven dynamics learning via bayesian optimization. 2017 IEEE 56th Annual Conference on Decision and Control (CDC) 5168 5173. Baxter, J., and Bartlett, P. L. 2001. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research 15:319 350. Bellman, R., et al. 1954. The theory of dynamic programming. Bulletin of the American Mathematical Society 60(6):503 515. Bengio, Y. 1997. Using a ﬁnancial training criterion rather than a prediction criterion. International journal of neural systems. Bertsekas, D. P.; Bertsekas, D. P.; Bertsekas, D. P.; and Bertsekas, D. P. 1995. Dynamic programming and optimal control, volume 1. Athena scientiﬁc Belmont, MA. Buckman, J.; Hafner, D.; Tucker, G.; Brevdo, E.; and Lee, H. 2018. Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, 8224 8234. Chua, K.; Calandra, R.; Mc Allister, R.; and Levine, S. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 4754 4765.

Cortes, C.; Greenberg, S.; and Mohri, M. 2013. Relative deviation learning bounds and generalization with unbounded loss functions. ar Xiv preprint ar Xiv:1310.5796. Deisenroth, M., and Rasmussen, C. E. 2011. Pilco: A modelbased and data-efﬁcient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), 465 472. Deisenroth, M. P.; Neumann, G.; Peters, J.; et al. 2013. A survey on policy search for robotics. Foundations and Trends R in Robotics 2(1 2):1 142. Donti, P.; Amos, B.; and Kolter, J. Z. 2017. Task-based end-to-end model learning in stochastic optimization. In Advances in Neural Information Processing Systems, 5484 5494. Ernst, D.; Geurts, P.; and Wehenkel, L. 2005. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(Apr):503 556. Farahmand, A.-m.; Barreto, A.; and Nikovski, D. 2017. Valueaware loss function for model-based reinforcement learning. In Artiﬁcial Intelligence and Statistics, 1486 1494. Farahmand, A.-m. 2018. Iterative value-aware model learning. In Advances in Neural Information Processing Systems, 9072 9083. Feinberg, V.; Wan, A.; Stoica, I.; Jordan, M. I.; Gonzalez, J. E.; and Levine, S. 2018. Model-based value estimation for efﬁcient modelfree reinforcement learning. ar Xiv preprint ar Xiv:1803.00101. Gal, Y.; Mc Allister, R.; and Rasmussen, C. E. 2016. Improving pilco with bayesian neural network dynamics models. In Data Efﬁcient Machine Learning workshop, ICML, volume 4. Ha, D., and Schmidhuber, J. 2018. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, 2450 2462. Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Erez, T.; and Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2944 2952. Janner, M.; Fu, J.; Zhang, M.; and Levine, S. 2019. When to trust your model: Model-based policy optimization. ar Xiv preprint ar Xiv:1906.08253. Joseph, J. M.; Geramifard, A.; Roberts, J. W.; How, J. P.; and Roy, N. 2013. Reinforcement learning with misspeciﬁed model classes. 2013 IEEE International Conference on Robotics and Automation 939 946. Kahn, H., and Marshall, A. W. 1953. Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America 1(5):263 278. Kurutach, T.; Clavera, I.; Duan, Y.; Tamar, A.; and Abbeel, P. 2018. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations. Lange, S.; Gabel, T.; and Riedmiller, M. 2012. Batch reinforcement learning. In Reinforcement learning. Springer. 45 73. Lazaric, A.; Restelli, M.; and Bonarini, A. 2008. Reinforcement learning in continuous action spaces through sequential monte carlo methods. In Advances in neural information processing systems, 833 840. Levine, S., and Abbeel, P. 2014. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, 1071 1079. Luo, Y.; Xu, H.; Li, Y.; Tian, Y.; Darrell, T.; and Ma, T. 2019. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations.

Metelli, A. M.; Papini, M.; Faccio, F.; and Restelli, M. 2018. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, 5442 5454. Nguyen-Tuong, D., and Peters, J. 2011. Model learning for robot control: a survey. Cognitive processing 12(4):319 340. Nguyen-Tuong, D.; Seeger, M.; and Peters, J. 2009. Model learning with local gaussian process regression. Advanced Robotics 23(15):2015 2034. Oh, J.; Singh, S.; and Lee, H. 2017. Value prediction network. In Advances in Neural Information Processing Systems, 6118 6128. Owen, A. B. 2013. Monte Carlo theory, methods and examples. Piroddi, L., and Spinelli, W. 2003. An identiﬁcation algorithm for polynomial narx models based on simulation error minimization. International Journal of Control 76(17):1767 1781. Precup, D.; Sutton, R. S.; and Singh, S. P. 2000. Eligibility traces for off-policy policy evaluation. In Langley, P., ed., Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), 759 766. Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Riedmiller, M. 2005. Neural ﬁtted q iteration ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In European Conference on Machine Learning, 317 328. Springer. Silver, D., et al. 2017. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3191 3199. JMLR. org. Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057 1063. Tangkaratt, V.; Mori, S.; Zhao, T.; Morimoto, J.; and Sugiyama, M. 2014. Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation. Neural networks 57:128 140. Tirinzoni, A.; Salvini, M.; and Restelli, M. 2019. Transfer of samples in policy search via multiple importance sampling. In Proceedings of the 36th International Conference on Machine Learning, 6264 6274. Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 5026 5033. IEEE. Wang, X., and Dietterich, T. G. 2003. Model-based policy gradient reinforcement learning. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 776 783. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229 256.