# variational_modelbased_policy_optimization__93c9e687.pdf

Variational Model-based Policy Optimization

Yinlam Chow1 , Brandon Cui2 , Moon Kyung Ryu1 and Mohammad Ghavamzadeh1

1Google AI 2Facebook AI Research yinlamchow@google.com, bcui@fb.com, {mkryu, ghavamza}@google.com

Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efﬁciency problem in RL. However, designing such algorithms is often challenging because the bias in simulated data may overshadow the ease of data generation. A potential solution to this challenge is to jointly learn and improve model and policy using a universal objective function. In this paper, we leverage the connection between RL and probabilistic inference, and formulate such an objective function as a variational lower-bound of a log-likelihood. This allows us to use expectation maximization (EM) and iteratively ﬁx a baseline policy and learn a variational distribution, consisting of a model and a policy (E-step), followed by improving the baseline policy given the learned variational distribution (M-step). We propose model-based and model-free policy iteration (actor-critic) style algorithms for the E-step and show how the variational distribution learned by them can be used to optimize the M-step in a fully model-based fashion. Our experiments on a number of continuous control tasks show that our model-based (E-step) algorithm, which we refer to as variational model-based policy optimization (VMBPO), is more sample-efﬁcient and robust to hyper-parameter tuning than its model-free (Estep) counterpart. Using the same control tasks, we also compare VMBPO with several state-of-theart model-based and model-free RL algorithms and show its sample efﬁciency and performance.

1 Introduction

Model-free reinforcement learning (RL) algorithms that learn a good policy without constructing an explicit model of the system s dynamics have shown promising results in complex simulated problems [Mnih et al., 2013; Mnih et al., 2015; Schulman et al., 2015; Haarnoja et al., 2018]. However, these methods are not sample efﬁcient, and thus, not suitable for problems in which data collection is burdensome. Model-based RL algorithms address the data efﬁciency issue of the model-free methods by learning a model, and com-

bining model-generated data with those collected from interaction with the real system [Sutton, 1990; Janner et al., 2019]. However, designing model-based RL algorithms is often challenging because the bias in model may affect the process of learning policies and result in worse asymptotic performance than the model-free counterparts. A potential solution to this challenge is to incorporate the policy/value optimization method in the process of learning the model. An ideal case here would be to have a universal objective function that is used to learn and improve model and policy jointly. Casting RL as a probabilistic inference has a long history (e.g., [Todorov, 2008; Toussaint, 2009; Kappen et al., 2012; Rawlik et al., 2013]). This formulation has the advantage that allows powerful tools for approximate inference to be employed in RL. One such class of tools are variational techniques that have been successfully used in RL (e.g., [Neumann, 2011; Levine and Koltun, 2013; Abdolmaleki et al., 2018]). Another formulation of RL with strong connection to probabilistic inference is the formulation of policy search as an expectation maximization (EM) style algorithm (e.g., [Dayan and Hinton, 1997; Peters and Schaal, 2007; Peters et al., 2010; Chebotar et al., 2017; Abdolmaleki et al., 2018]). The main idea here is to write the expected return of a policy as a (pseudo)-likelihood function, and then conditioning on the success in maximizing the return, ﬁnd the policy that most likely would have been taken. Another class of RL algorithms that are related to the inference formulation are entropy-regularized algorithms that add an entropy term to the reward and ﬁnd the soft-max optimal policy (e.g., [Levine and Abbeel, 2014; Nachum et al., 2017; Haarnoja et al., 2018; Fellows et al., 2019]). For a comprehensive tutorial on RL as probabilistic inference, we refer readers to [Levine, 2018]. In this paper, we leverage the connection between RL and probabilistic inference, and formulate an objective function for jointly learning and improving model and polciy as a variational lower-bound of a log-likelihood. This allows us to use EM: iteratively ﬁx a baseline policy and learn a variational distribution, consisting of a model and a policy (E-step), followed by improving the baseline policy given the learned variational distribution (M-step). We propose model-based and model-free policy iteration (PI) style algorithms for the E-step and show how the variational distribution that they learn can be used to optimize the M-step only from modelgenerated samples. It is important to note that both algorithms are model-based and only differ in using model-based and model-free algorithms for the E-step. We call our algorithm

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

that uses model-based PI for the E-step, variational modelbased policy optimization (VMBPO). Our experiments on a number of continuous control tasks show that VMBPO is more sample-efﬁcient and robust to hyper-parameter tuning than its model-free counterpart. Using the same control tasks, we also compare VMBPO with several state-of-the-art model-based and model-free RL algorithms, including model-based policy optimization (MBPO) [Janner et al., 2019] and maximum a posteriori policy optimization (MPO) [Abdolmaleki et al., 2018], and show its sample efﬁciency and performance.

2 Preliminaries

We study the RL problem in which the agent s interaction with the environment is modeled as a Markov decision process (MDP) M = X, A, r, p, p0 , where X and A are state and action spaces; r : X A R is the reward function; p : X A X is the transition kernel ( X is the set of probability distributions over X); and p0 : X X is the initial state distribution. A stationary Markovian policy π : X A is a probabilistic mapping from states to actions. Each policy π is evaluated by its expected return, i.e., J(π) = E[PT 1 t=0 r(xt, at) | p0, p, π], where T is the (random) time of hitting a terminal state.1 We denote by X 0 the set of all terminal states. The agent s goal is to ﬁnd a policy with maximum expected return, i.e, π arg maxπ A J(π). We denote by ξ = (x0, a0, . . . , x T 1, a T 1, x T ), a system trajectory of length T, whose probability under a policy π is deﬁned as pπ(ξ) = p0(x0) QT 1 t=0 π(at|xt)p(xt+1|xt, at). Finally, we deﬁne [T] := {0, . . . , T 1}.

3 Policy Optimization as Inference

Policy search in RL can be formulated as a probabilistic inference problem (e.g., [Todorov, 2008; Toussaint, 2009; Kappen et al., 2012; Levine, 2018]). The goal in the conventional RL formulation is to ﬁnd a policy whose generated trajectories maximize the expected return. In contrast, in the inference formulation, we start with a prior over trajectories and then estimate the posterior conditioned on a desired outcome, such as reaching a goal state. In this formulation, the notion of a desired (optimal) outcome is introduced via independent binary random variables Ot, t [T], where Ot = 1 denotes that we acted optimally at time t. The likelihood of Ot, given the state xt and action at, is modeled as

p(Ot = 1 | xt, at) = exp(η r(xt, at)),

where η > 0 is a temperature parameter. This allows us to deﬁne the log-likelihood of π being optimal as

log pπ(O0:T 1 = 1) = log Z

ξ pπ(O0:T 1 = 1, ξ)

= log Eξ pπ p(O0:T 1 = 1 | ξ) , (1)

where p(O0:T 1 = 1 | ξ) is the likelihood of trajectory ξ being optimal and is deﬁned as

1Similar to [Levine, 2018], our setting can be easily extended to inﬁnite-horizon γ-discounted MDPs. This can be done by modifying the transition kernels, such that any action transitions the system to a terminal state with probability 1 γ, and all standard transition probabilities are multiplied by γ.

p(O0:T 1 = 1 | ξ) =

t=0 p(Ot = 1 | xt, at)

t=0 r(xt, at) .

As a result, ﬁnding an optimal policy in this setting would be equivalent to maximizing the log-likelihood in (1), i.e., π soft arg maxπ log pπ(O0:T 1 = 1). A potential advantage of formulating RL as an inference problem is the possibility of using a wide range of approximate inference algorithms, including variational methods. In variational inference, we approximate a distribution p( ) with a potentially simpler (e.g., tractable factored) distribution q( ) in order to make the whole inference process more tractable. If we approximate pπ(ξ) with a variational distribution q(ξ), we will obtain the following variational lower-bound for the log-likelihood in (1):

log pπ(O0:T 1 = 1) = log Eξ pπ exp η

t=0 r(xt, at)

= log Eξ q(ξ) pπ(ξ)

t=0 r(xt, at)

(a) Eξ q(ξ) log pπ(ξ)

t=0 r(xt, at)

= η Eq T 1 X

t=0 r(xt, at) KL(q||pπ) := J (q; π), (3)

(a) is from Jensen s inequality and J (q; π) is the evidence lower-bound (ELBO) of the log-likelihood function. A variety of algorithms have been proposed (e.g., [Peters and Schaal, 2007; Hachiya et al., 2009; Neumann, 2011; Levine and Koltun, 2013; Abdolmaleki et al., 2018; Fellows et al., 2019]), whose main idea is to approximate π soft by maximizing J (q; π) w.r.t. both q and π. This often results in an EM-style algorithm in which we ﬁrst ﬁx π and maximize J ( ; π) for q (E-step), and then given the q obtained in the E-step, we maximize J (q; ) for π (M-step).

4 Variational Model-based Policy Optimization

In this section, we describe the ELBO objective function used by our algorithms, study the properties of the resulted optimization problem, and propose algorithms to solve it. We propose to use the variational distribution q(ξ) = p0(x0) QT 1 t=0 qc(at|xt)qd(xt+1|xt, at) to approximate pπ(ξ). Note that q has the same initial state distribution as pπ (deﬁned in Section 2), but has different control strategy (policy), qc, and dynamics, qd. Using this variational distribution, we may write the ELBO objective (3) as

J (q; π) = Eq T 1 X

t=0 η r(xt, at) log qc(at|xt)

log qd(xt+1|xt, at)

p(xt+1|xt, at) , where Eq[ ] := E[ | p0, qd, qc].

To maximize J (q; π) w.r.t. q and π, we ﬁrst ﬁx π and compute the variational distribution (E-step):

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

q = (q c, q d) arg max qc A,qd X E T 1 X

t=0 η r(xt, at)

log qc(at|xt)

π(at|xt) log qd(xt+1|xt, at)

p(xt+1|xt, at) | p0, qd, qc ,

and then optimize π given q , i.e., arg maxπ J (q ; π) (Mstep). Note that in (5), q c and q d are both functions of π, but we remove π from the notation to keep it lighter. Remark 1. In our formulation (our choice of the variational distribution q), the M-step is independent of the true dynamics, p, and thus, can be implemented ofﬂine (using samples generated by the model qd). Moreover, as we will see in Section 5, we also use the model, qd, in the E-step. As discussed throughout the paper, using simulated samples (from qd) and reducing the need for real samples (from p) is an important feature of our proposed model-based formulation and algorithms. Remark 2 (relationship with MPO). There are similarities between our variational formulation and the one used in the maximum a posteriori policy optimization (MPO) algorithm [Abdolmaleki et al., 2018]. However, MPO sets its variational dynamics, qd, to be the dynamics of the real system, p, which results in a model-free algorithm, while our approach is modelbased, since we learn qd and use it to generate samples in both E-step and M-step of our algorithms. In the rest of this section, we study the E-step optimization (5) and propose algorithms to solve it.

4.1 Properties of the E-step Optimization We start by deﬁning two Bellman-like operators related to the E-step optimization (5). For any variational policy qc : X A and any value function V : X R, such that V (x) = 0, x X 0, we deﬁne the qc-induced operator, Tqc, and the optimal operator, T , as

Tqc[V ](x) := Ea qc( |x) h η r(x, a) log qc(a|x)

+ max qd X Ex qd( |x,a) V (x ) log qd(x |x, a)

p(x |x, a) i , (6)

T [V ](x) := max qc A Tqc[V ](x). (7)

We deﬁne the optimal value function of the E-step, Vπ, as

Vπ(x) := E T 1 X

t=0 η r(xt, at) log q c(at|xt) π(at|xt)

log q d(xt+1|xt, at) p(xt+1|xt, at) | p0, q d, q c .

For any value function V , we deﬁne its associated action-value function Q : X A R as

Q(x, a) := η r(x, a) + log Ex p( |x,a) exp V (x ) . (9)

The following lemmas, whose proofs are reported in Appendices A.1 and A.2, show properties of operators Tqc and T , and their relation with the (E-step) optimal value function, Vπ. Lemma 1. The qc-induced and optimal operators, deﬁned by (6) and (7), can be rewritten as

Tqc[V ](x) = Ea qc( |x) Q(x, a) log qc(a|x)

π(a|x) , (10)

T [V ](x) = log Ea π( |x),x p( |x,a) exp η r(x, a) + V (x ) . (11)

Lemma 2. The qc-induced and optimal operators are monotonic and contractive. Moreover, the optimal value function Vπ is the unique ﬁxed-point of T (T [Vπ](x) = Vπ(x), x X). From the deﬁnition of Q-function in (9) and Lemma 2, we prove (in Appendix A.3) the following proposition for the action-value function associated with the E-step optimal value function Vπ. Proposition 1. The E-step optimal value function Vπ and its associated action-value function Qπ, deﬁned by (9), are related as Vπ(x) = log Ea π( |x) exp Qπ(x, a) , x X. In the rest of this section, we show how to derive a closedform expression for the variational distribution q = (q c, q d). For any value function V , we deﬁne its corresponding variational dynamics, q V d , as the solution to the maximization problem in the deﬁnition of Tqc (see Eq. 6), i.e.,

q V d ( |x, a) arg max qd X Ex qd V (x ) log qd(x |x, a)

p(x |x, a) , (12)

and its corresponding variational policy q Q c (Q is the actionvalue function associated with V , deﬁned by Eq. 9), as the solution to the maximization problem in the deﬁnition of T (see Eqs. 7 and 10), i.e.,

q Q c ( |x) arg max qc A Ea qc( |x) Q(x, a) log qc(a|x)

π(a|x) . (13)

We now derive (proof in Appendix A.4) closed-form expressions for the variational distributions q V d and q Q c . Lemma 3. The variational dynamics and policy corresponding to a value function V and its associated action-value function Q can be written in closed-form as

q V d (x |x, a) = p(x |x, a) exp V (x )

Ex p( |x,a) exp V (x )

= p(x |x, a) exp V (x )

exp Q(x, a) η r(x, a) , x, x X, a A, (14)

q Q c (a|x) = π(a|x) exp Q(x, a)

Ea π( |x) exp Q(x, a) , x X, a A. (15)

Equations 14 and 15 show that the variational dynamics, q V d , and policy, q Q c , can be seen as an exponential twisting of the dynamics p and policy π with weights V and Q, respectively. For the special case V = Vπ (the E-step optimal value function), these distributions can be written in closed-form as

q d(x |x, a) = p(x |x, a) exp Vπ(x )

exp Qπ(x, a) η r(x, a) ,

q c(a|x) = π(a|x) exp Qπ(x, a)

exp Vπ(x) ,

where the denominator of q c is obtained by applying Proposition 1 to replace Qπ with Vπ.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

4.2 Policy and Value Iteration for the E-step Using the results of Section 4.1, we now propose model-based and model-free dynamic programming (DP) style algorithms, i.e., policy iteration (PI) and value iteration (VI), for solving the E-step problem (5). The model-based algorithms compute the variational dynamics, qd, at each iteration, while the model-free counterparts compute qd only at the end (upon convergence). Having access to qd at each iteration has the advantage that we may generate samples from the model, qd, when we implement the sample-based version (RL version) of these DP algorithms in Section 5. In the model-based PI algorithm, at each iteration k, given the current variational policy q(k) c , we

Policy Evaluation: Compute the q(k) c -induced value function Vq(k) c (the ﬁxed-point of the operator Tq(k) c ) by iteratively applying Tq(k) c from (6), i.e., Vq(k) c (x) = limn T n q(k) c [V ](x), x X, where the variational model

qd in (6) is computed using (14) with V = V (n). We then compute the corresponding Q-function Qq(k) c using (9).

Policy Improvement: Update the variational distribution q(k+1) c using (15) with Q = Qq(k) c .2

Upon convergence, i.e., q( ) c = q c, we compute q d from (14). The model-free PI algorithm is exactly the same, except in its policy evaluation step, the q(k) c -induced operator Tq(k) c is applied using (10) (without the variational dynamics qd). In this case, the variational dynamics qd is computed only upon convergence, q d, using (14). Lemma 4. The model-based and model-free PI algorithms converge to their optimal values, q c and q d, deﬁned by (5), i.e., q( ) c = q c and q( ) d = q d. (proof in Appendix ??) We can similarly derive model-based and model-free (VI) algorithms for the E-step. These algorithms start from an arbitrary value function V and iteratively apply the optimal operator T from (6) and (7) (model-based) and (11) (model-free) until convergence, i.e., Vπ(x) = limn T n[V ](x), x X. Given Vπ, these algorithms ﬁrst compute Qπ from Proposition 1, and then compute (q c, q d) using (16). From the properties of the optimal operator T in Lemma 2, both model-based and model-free VI algorithms converge to q c and q d. In the rest of the paper, we focus on the PI approach, in particular the model-based one, and only report the details of the VI algorithms in Appendix B. In the next section, we show how the PI algorithms can be implemented and combined with a routine for solving the M-step, when the true MDP model p is unknown (the RL setting) and the state and action spaces are large that require using function approximation.

5 Variational Model-based RL Algorithm In this section, we propose a RL algorithm, called variational model-based policy optimization (VMBPO). It is an EM-style

2When the number of actions is large, the denominator of (15) cannot be computed efﬁciently. In this case, we replace (15) in the policy improvement step of our PI algorithms with q(k+1) c = arg minqc KL(qc||q Q c ), where Q = Qq(k) c . We also prove the convergence of our PI algorithms with this update in Appendix A.5.

algorithm based on the variational formulation proposed in Section 4. The E-step of VMBPO is the sample-based implementation of the model-based PI algorithm, described in Section 4.2. We describe the E-step and M-step of VMBPO in details in Sections 5.1 and 5.2, and report its pseudo-code in Algorithm 1 in Appendix C. VMBPO uses 8 neural networks to represent: policy π, variational dynamics qd, variational policy qc, log-likelihood ratio ν = log(qd/p), value function V , action-value function Q, target value function V , and target action-value function Q , with parameters θπ, θd, θc, θν, θv, θq, θ v, and θ q, respectively.

5.1 The E-step of VMBPO At the beginning of the E-step, we generate a number of samples (x, a, r, x ) from the current baseline policy π, i.e., a π( |x) and r = r(x, a), and add them to the buffer D. The E-step consists of four updates: 1) computing the variational dynamics qd, 2) estimating the log-likelihood ratio log(qd/p), 3) computing the qc-induced value and action-value functions Vqc and Qqc (critic update), and ﬁnally 4) computing the new variational policy qc (actor update). We describe the details of each step below. Step 1. (Computing qd) We ﬁnd qd as a solution to the optimization problem (12) for V equal to the target value network V . Since the q V d in (14) is the solution of (12), we compute qd by minimizing KL(q V d ||qd), which results in the following forward KL loss (for all x X and a A):

θd = arg min θ KL p( |x, a) exp(η r(x, a) + V ( ; θ v)

Q (x, a; θ q)) || qd( |x, a; θ) (17)

(a)= arg max θ Ex p( |x,a) exp(η r(x, a) + V (x ; θ v)

Q (x, a; θ q)) log(qd( |x, a; θ)) , (18)

where (a) is by removing the θ-independent terms from (17). We update θd by taking several steps in the direction of the gradient of a sample average of the loss function (18), i.e.,

θd = arg max θ

(x,a,r,x ) D exp η r + V (x ; θ v)

Q (x, a; θ q) log qd(x |x, a; θ) , (19)

where (x, a, r, x ) are randomly sampled from D. The intuition here is to focus on learning the dynamics model in the regions of the state-action space that have higher temporal difference (regions with higher anticipated future return). Step 2. (Computing log(qd/p)) Using the duality of f-divergence [Nguyen et al., 2008] w.r.t. the reverse KLdivergence, the log-likelihood ratio log(qd( |x, a; θd)/p( |x, a)) is a solution to

log qd(x |x, a; θd)

p(x |x, a) = arg max ν:X A X R Ex qd( |x,a;θd)[ν(x |x, a)]

Ex p( |x,a) exp ν(x |x, a) , (20)

for all x, x X and a A. Note that the optimizer of (20) is unique almost surely (at (x, a, x ) with P(x |x, a) > 0), because qd is absolutely continuous w.r.t. p (see the deﬁnition of qd in Eq. 14) and the objective function of (20) is strictly concave. The optimization problem (20) allows us to compute ν( | ; θν) as an approximation to the log-likelihood ratio

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

log(qd( ; θd)/p). We update θν by taking several steps in the direction of the gradient of a sample average of (20), i.e.,

θν = arg max θ

(x,a,x ) E ν(x |x, a; θ) X

(x,a,x ) D exp(ν(x |x, a; θ)), (21)

where E is the set of samples for which x is drawn from the variational dynamics, i.e., x qd( |x, a). Here we ﬁrst sample (x, a, x ) randomly from D and use them in the second sum. Then, for all (x, a) that have been sampled, we generate x from qd and use the resulting samples in the ﬁrst sum. Step 3. (critic update) To compute Vqc and its actionvalue Qqc, we rewrite (6) with the maximizer qd from Step 1 and the log-likelihood ratio log(qd/p) from Step 2:

Tqc[V ](x) = Ea qc( |x) h η r(x, a) log qc(a|x)

+ Ex qd( |x,a;θd) V (x ; θ v) ν(x |x, a; θν) i .

Since Tqc can be written as both (10) and (22), we compute the qc-induced Q-function by setting the RHS of these equations equal to each other, i.e., for all x X and a qc( |x; θc),

Q(x, a; θq) = η r(x, a)+

Ex qd( |x,a;θd) V (x ; θ v) ν(x |x, a; θν) . (22)

Since the expectation in (22) is w.r.t. the variational dynamics (model) qd, we can estimate Qqc only with samples generated from the model. We do this by taking several steps in the direction of the gradient of a sample average of the square-loss obtained by setting both sides of (22) equal, i.e.,

θq = arg min θ

(x,a,r,x ) E

Q(x, a; θ) η r V (x ; θ v)+ν(x |x, a; θν) 2.

(23) Note that in (22), the actions are generated by qc. Thus, in (23), we ﬁrst randomly sample x, then sample a from qc( |x; θc), and ﬁnally draw x from qd( |x, a; θd). If the reward function is known (chosen by the designer of the system), then it is used to generate the reward signals r = r(x, a) in (23), otherwise, a reward model has to be learned. After estimating Qqc, we approximate Vqc, the ﬁxed-point of Tqc, using Tqc deﬁnition in (10) as Tqc[V ](x) V (x) Ea qc( |x) Q(x, a; θq) log qc(a|x;θc)

π(a|x;θπ) . This results in updating Vqc by taking several steps in the direction of the gradient of a sample average of the square-loss obtained by setting the two sides of the above equation equal, i.e.,

θv =arg min θ

V (x; θ) Q(x, a; θq)+log qc(a|x; θc)

π(a|x; θπ) 2, (24)

where x is randomly sampled and a qc( |x; θc) (without sampling from the true environment). Step 4. (actor update) We update the variational policy qc (policy improvement) by solving the optimization problem (13) for the Q estimated by the critic in Step 3. Since the qc that optimizes (13) can be written as (15), we update it by minimizing KL(qc||q Q c ). This results in the following reverse KL loss (for all x X):

θc = arg min θ KL qc( |x; θ)||π( |x; θπ) exp(Q(x, , ; θq))

= arg min θ Ea qc log( qc(a|x; θ)

π(a|x; θπ)) Q(x, a, ; θq) .

If we reparameterize qc using a transformation a = f(x, ϵ; θc), where ϵ is a Gaussian noise, we can update θc by taking several steps in the direction of the gradient of a sample average of the above loss, i.e.,

θc = arg min θ

(x,ϵ) log qc(f(x, ϵ; θ)|x) Q(x, a, ; θq)

log π(a|x; θπ) . (25)

We can also compute qc as the closed-form solution to (15), as described in [Abdolmaleki et al., 2018]. They refer to this as non-parametric representation of the variational distribution.

5.2 The M-step of VMBPO As described in Section 4, the goal of the M-step is to improve the baseline policy π, given the variational model q = (q c, q d) learned in the E-step, by solving the following optimization problem:

π arg max π Π J (q ; π) := Eq T 1 X

t=0 η r(xt, at)

log q c(at|xt) π(at|xt) log q d(xt+1|xt, at) p(xt+1|xt, at) .

A nice feature of (26) is that it can be solved using only the variational model q , without the need for samples from the true environment p. However, it is easy to see that if the policy space considered in the M-step, Π, contains the policy space used for qc in the E-step, then we can trivially solve the Mstep by setting π = q c. Although this is an option, it is more efﬁcient in practice to solve a regularized version of (26). A practical way to regularize (26) is to make sure that the new baseline policy π remains close to the old one, which results in the following optimization problem:

θπ arg max θ Eq T 1 X

t=0 log(π(at|xt; θ))

λ KL π( |xt; θπ)||π( |xt; θ) .

This is equivalent to the weighted MAP formulation used in the M-step of MPO [Abdolmaleki et al., 2018]. In MPO, they deﬁne a prior over the parameter θ and add it as log P(θ) to the objective function of (26). Then, they set the prior P(θ) to a speciﬁc Gaussian and obtain an optimization problem similar to (27) (see Section 3.3 in [Abdolmaleki et al., 2018]). However, in the absence of a variational dynamics model (i.e., qd = p), they need real samples to solve their optimization problem, while our model-based approach uses simulated samples.

6 Experiments To illustrate the effectiveness of VMBPO, we (i) compare it with several state-of-the-art RL methods, and (ii) evaluate sample efﬁciency of MBRL via ablation analysis. Comparison with Baseline RL Algorithms. We compare VMBPO with ﬁve baselines, two popular model-free algorithms: MPO [Abdolmaleki et al., 2018] and SAC [Haarnoja et al., 2018], and three recent model-based algorithms: MBPO [Janner et al., 2019], PETS [Chua et al., 2019], and STEVE [Buckman et al., 2018]. We also compare VMBPO with its variant that uses a model-free PI algorithm to solve the E-step (see Section 4.2). We refer to this variant as VMBPOMFE and describe it in details in Appendix D. We evaluate all

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Environment VMBPO MBPO STEVE PETS VMBPO-MFE SAC MPO

Pendulum -125.8 73.7 -126.0 78.4 -6385.3 799.7 -183.5 1773.9 -125.7 130.1 -124.7 199.0 -131.9 315.9 Hopper 2897.4 630. 2403.1 556. 279.0 237.1 94.5 114.2 1368.7 184.1 2020.8 954.1 1509.7 756.0 Walker2D 4226.1 843.0 3883.3 753.5 336.3 196.3 93.5 134.1 3334.5 122.8 3026.4 888.9 2889.4 712.7 Half Cheetah 13120 933.1 11877 997.1 482.9 596.9 13272.6 4926.4 4647.3 505.8 9080.3 1625.1 4969.2 623.7 Reacher -11.4 27.0 -12.6 25.9 -141.8 355.7 -55.5 39.0 -23.9 23.8 -75.9 336.7 Reacher7Do F -13.8 20.5 -15.1 98.8 -45.6 36.1 -33.5 49.6 -27.4 112.0 -38.4 53.8

Table 1: The mean standard deviation of the ﬁnal returns with the best hyper-parameter conﬁguration for each algorithm. VMBPO outperforms other baselines. VMBPO-MFE performs better than MPO but is sometimes unstable.

Environment VMBPO MBPO VMBPO-MFE SAC MPO

Pendulum -147.4 94.1 -146.8 272.6 -511.9 384.4 -146.8 450.6 -605.2 389.6 Hopper 2292.5 1256.0 1638.7 881.5 485.4 389.3 1262.2 803.3 780.8 629.6 Walker2D 3326.6 1276.1 2977.8 997.3 1447.1 767.1 1341.6 1092.6 1590.3 860.7 Half Cheetah 10366.7 3477.2 7586.1 4814.1 2834.6 1062.9 6312.0 2299.7 3258.2 970.1 Reacher -13.5 38.7 -17.5 44.8 -122.2 507.0 -77.2 50.6 -168.2 477.1 Reacher7Do F -15.2 66.4 -17.2 101.6 -78.9 439.1 -114.2 196.9 -93.8 426.9

Table 2: The mean standard deviation of the average of the ﬁnal returns over all hyper-parameter conﬁgurations. VMBPO is much more robust to change in hyper-parameters than the other baselines. We do not include PETS and STEVE because their hyper-parameters are directly adopted from their papers.

Figure 1: Performance of VMBPO with different number of samples generated from the dynamics model qd.

the algorithms on a classical control task: Pendulum, and ﬁve Mu Jo Co tasks: Hopper, Walker2D, Half Cheetah, Reacher, and Reacher7Do F. We use similar neural network architectures (for the dynamics model, value functions, and policies) for VMBPO and MBPO. The detailed description of the network architectures and hyper-parameters is reported in Appendix E. Since we use a parametric representation for qc in the E-step of VMBPO, as discussed in Section 5.2, we simply set π = q c in its M-step. We set the number of training steps to 400, 000 for the difﬁcult environments (Walker2D, Half Cheetah), to 150, 000 for the medium one (Hopper), and to 50, 000 for the simpler ones (Pendulum, Reacher, Reacher7DOF). We evaluate policy performance every 1, 000 training steps. Each measurement is an average return over 5 episodes, generated with a separate random seed. To illustrate the relative performance of the algorithms, we report the average return of VMBPO, VMBPO-MFE, and the baselines, with their best hyper-parameters, in Table 1 and Figure 1 (see Appendix E.1). The results show that VMBPO performs better than the baselines in most of the tasks, and usually converges faster even when the ﬁnal performances are similar. The data-efﬁciency of VMBPO is mainly the result of using synthetic data generated by the learned model, and its extra performance can be attributed to jointly learning model and policy using a universal objective function. The results also show that VMBPO-MFE outperforms MPO in 4 out of the 6 domains. However, in some cases its learning curve degrades and results in poor ﬁnal performance. This is partly due to the instability in critic learning caused by sample variance ampliﬁcation in exponential-TD minimization (see Eq. 39 in Sec. D.1). A way to alleviate this issue is to add a temperature term τ to the exponential-TD update [Borkar, 2002], although tuning this hyper-parameter is often non-trivial.3 To study the sensitivity of the algorithms w.r.t. the hyperparameters, we report their performance averaged over all hyper-parameter/random-seed conﬁgurations in Table 2 and Figure 2 (see Appendix E.1). These results show that VMBPO

3The variance is further ampliﬁed with a large τ, but the critic learning is hampered by a small τ.

is much more robust to change in hyper-parameters than the other algorithms, with the best average performance over all the tasks. Ablation Study. We study the dependence of the VMBPO performance on the number of samples generated from the dynamics model qd. Here we only experiment with two tasks, Hopper and Half Cheetah, and with fewer training steps 100, 000. At each step, we update the actor and critic using {128, 256, 512} synthetic samples. Figure 1 shows the learning performance averaged over all hyper-parameter/randomseed conﬁgurations and illustrates how synthetic data can help with policy learning. The results show that increasing the amount of synthetic data generally improves the policy convergence rate. In the early phases, when the model is inaccurate, sampling from it may slow down learning, while in the later phases, with an improved model, adding more synthetic data can lead to a more signiﬁcant performance boost.

7 Conclusion We formulated the problem of jointly learning and improving model and policy in RL as a variational lower-bound of a loglikelihood and proposed EM-style algorithms to solve it. Our algorithm, called variational model-based policy optimization (VMBPO), uses model-based policy iteration for solving the Estep. We compared our (E-step) model-based and model-free algorithms with each other, and with a number of state-of-theart model-based (e.g., MBPO) and model-free (e.g., MPO) RL algorithms, and showed its sample efﬁciency and performance. We brieﬂy discussed VMBPO algorithms in which the E-step is solved by value iteration methods. However, full implementation of these algorithms and studying their relationship with the existing methods requires more work that we leave for future. Another future directions are: 1) ﬁnding more efﬁcient implementation for VMBPO, and 2) using VMBPO style algorithms in solving control problems from high-dimensional observations, by learning a low-dimensional latent space and a latent dynamics, and perform control there. This class of algorithms is referred to as learning controllable embedding [Levine et al., 2020].

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

References [Abdolmaleki et al., 2018] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In Proceedings of the 6th International Conference on Learning Representations, 2018. [Borkar, 2002] Vivek S. Borkar. Q-learning for risk-sensitive control. Mathematics of operations research, 27(2):294 311, 2002. [Buckman et al., 2018] Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sampleefﬁcient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224 8234, 2018. [Chebotar et al., 2017] Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In IEEE International Conference on Robotics and Automation, 2017. [Chua et al., 2019] Kurtland Chua, Rowan Mc Allister, Roberto Calandra, and Sergey Levine. Unsupervised exploration with deep model-based reinforcement learning. In International Conference on Learning Representations (ICLR), 2019. [Dayan and Hinton, 1997] Peter Dayan and Geoffrey E. Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271 278, 1997. [Fellows et al., 2019] Matthew Fellows, Anuj Mahajan, Tim G. J. Rudner, and Shimon Whiteson. VIREL: A variational inference framework for reinforcement learning. In Advances in Neural Information Processing Systems, pages 7120 7134, 2019. [Haarnoja et al., 2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pages 1861 1870, 2018. [Hachiya et al., 2009] Hirotaka Hachiya, Jan Peters, and Masashi Sugiyama. Efﬁcient sample reuse in EM-based policy search. In Proceedings of the European Conference on Machine Learning, 2009. [Janner et al., 2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems 32, pages 12519 12530, 2019. [Kappen et al., 2012] Hilbert J. Kappen, Vicenc G omez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine Learning, 87(2):159 182, 2012. [Levine and Abbeel, 2014] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, 2014. [Levine and Koltun, 2013] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems, 2013.

[Levine et al., 2020] Nir Levine, Yinlam Chow, Rui Shu, Ang Li, Mohammad Ghavamzadeh, and Hung Bui. Prediction, consistency, curvature: Representation learning for locallylinear control. In Proceedings of the 8th International Conference on Learning Representations, 2020. [Levine, 2018] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv:1805.00909, 2018. [Mnih et al., 2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. preprint ar Xiv:1312.5602, 2013. [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. [Nachum et al., 2017] Oﬁr Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775 2785, 2017. [Neumann, 2011] Gerhard Neumann. Variational inference for policy search in changing situations. In Proceedings of the 28th international conference on machine learning, pages 817 824, 2011. [Nguyen et al., 2008] Xuan Long Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems, pages 1089 1096, 2008. [Peters and Schaal, 2007] Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on machine learning, 2007. [Peters et al., 2010] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence, 2010. [Rawlik et al., 2013] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In Proceedings of Robotics: Science and Systems, 2013. [Schulman et al., 2015] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 1889 1897, 2015. [Sutton, 1990] Richard Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning, 1990. [Todorov, 2008] Emanuel Todorov. General duality between optimal control and estimation. In Proceedings of the 47th

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

IEEE Conference on Decision and Control, pages 4286 4292, 2008. [Toussaint, 2009] Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th International Conference on Machine Learning, pages 1049 1056, 2009.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)