# deterministic_valuepolicy_gradients__894d5f0b.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Deterministic Value-Policy Gradients

Qingpeng Cai, 1 Ling Pan, 2 Pingzhong Tang2

1Alibaba Group 2IIIS, Tsinghua University qingpeng.cqp@alibaba-inc.com, pl17@mails.tsinghua.edu.cn, kenshin@tsinghua.edu.cn

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efﬁciency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the ﬁnite horizon, but it is too myopic compared with inﬁnite horizon. We ﬁrstly give a theoretical guarantee of the existence of the value gradients in this inﬁnite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with inﬁnite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We ﬁnally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Introduction (Silver et al. 2014) propose the deterministic policy gradient (DPG) algorithm that aims to ﬁnd an optimal deterministic policy that maximizes the expected long-term reward, which lowers the variance when estimating the policy gradient, compared to stochastic policies (Sutton et al. 2000). (Lillicrap et al. 2015) further combine deep neural networks with DPG to improve the modeling capacity, and propose the deep deterministic policy gradient (DDPG) algorithm. It is recognized that DDPG has been successful in robotic control tasks (Gu et al. 2016) and dynamic mechanism design (Tang 2017; Cai et al. 2018). Despite the effectiveness of DDPG in these tasks, it suffers from the high sample complexity problem (Schulman et al. 2015). Deterministic value gradient methods (Werbos 1990; Nguyen and Widrow 1990; Jordan and Rumelhart 1992;

The ﬁrst two authors contributed equally to this work. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Fairbank 2008) compute the policy gradient through back propagation of the reward along a trajectory predicted by the learned model, which enables better sample efﬁciency. However, to the best of our knowledge, existing works of deterministic value gradient methods merely focus on ﬁnite horizon, which are too myopic and can lead to large bias. Stochastic value gradient (SVG) methods (Heess et al. 2015) use the re-parameterization technique to optimize the stochastic policies. Among the class of SVG algorithms, although SVG(1) studies inﬁnite-horizon problems, it only uses onestep rollout, which limits its efﬁciency. Also, it suffers from the high variance due to the importance sampling ratio and the randomness of the policy. In this paper, we study the setting with inﬁnite horizon, where both state transitions and policies are deterministic. (Heess et al. 2015) gives recursive Bellman gradient equations of deterministic value gradients, but the gradient lacks of theoretical guarantee as the DPG theorem does not hold in this deterministic transition case. We prove that the gradient indeed exists for a certain set of discount factors. We then derive a closed form of the value gradients. However, the estimation of the deterministic value gradients is much more challenging. The difﬁculty of the computation of the gradient mainly comes from the dependency of the gradient of the value function over the state. Such computation may involve inﬁnite times of the product of the gradient of the transition function and is hard to converge. Thus, applying the Bellman gradient equation recursively may incur high instability. To overcome these challenges, we use model-based approaches to predict the reward and transition function. Based on the theoretical guarantee of the closed form of the value gradients in the setting, we propose a class of deterministic value gradients DVG(k) with inﬁnite horizon, where k denotes the number of rollout steps. For each choice of k, we use the rewards predicted by the model and the action-value at k + 1 step to estimate of the value gradients over the state, in order to reduce the instability of the gradient of the value function over the state. Different number of rollout steps maintains a trade-off between the accumulated model bias and the variance of the gradient over the state. The deterministic policy gradient estimator can be viewed as a special case

of this class, i.e., it never use the model to estimate the value gradients, and we refer it to DVG(0). As the model-based approaches are more sample efﬁcient than model-free algorithms (Li and Todorov 2004; Levine and Koltun 2013), and the model-based deterministic value gradients may incur model bias (Wahlstr om, Sch on, and Deisenroth 2015), we consider an essential question: How to combines the model-based gradients and the model-free gradients efﬁciently? We propose a temporal difference method to ensemble gradients with different rollout steps. The intuition is to ensemble different gradient estimators with geometric decaying weights. Based on this estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. The algorithm updates the policy by stochastic gradient ascent with the ensembled value gradients of the policy, and the weight maintains a trade-off between sample efﬁciency and performance. To sum up, the main contribution of the paper is as follows:

First of all, we provide a theoretical guarantee for the existence of the deterministic value gradients in settings with inﬁnite horizon.

Secondly, we propose a novel algorithm that ensembles the deterministic value gradients and the deterministic policy gradients, called deterministic value-policy gradient (DVPG), which effectively combines the model-free and model-based methods. DVPG reduces sample complexity, enables faster convergence and performance improvement.

Finally, we conduct extensive experiments on standard benchmarks comparing with DDPG, DDPG with modelbased rollouts, the stochastic value gradient algorithm, SVG(1) and state-of-the-art stochastic policy gradient methods. Results conﬁrm that DVPG signiﬁcantly outperforms other algorithms in terms of both sample efﬁciency and performance.

Related Work

Model-based algorithms has been widely studied (Moldovan et al. 2015; Montgomery and Levine 2016; Ha and Schmidhuber 2018; Hafner et al. 2018; Chua et al. 2018) in recent years. Model-based methods allows for more efﬁcient computations and faster convergence than model-free methods (Wang and Dietterich 2003; Li and Todorov 2004; Levine and Koltun 2013; Watter et al. 2015). There are two classes of model-based methods, one is to use learned model to do imagination rollouts to accelerate the learning. (Gu et al. 2016; Kurutach et al. 2018) generate synthetic samples by the learned model. PILCO (Deisenroth and Rasmussen 2011) learns the transition model by Gaussian processes and applies policy improvement on analytic policy gradients. The other is to use learned model to get better estimates of action-value functions. The value prediction network (VPN) (Oh, Singh, and Lee 2017) uses the learned transition model to get a better target estimate. (Feinberg et al. 2018; Buckman et al. 2018) combines different modelbased value expansion functions by TD(k) trick or stochastic distributions to improve the estimator of the action-value function.

Different from previous model-based methods, we present a temporal difference method that ensembles model-based deterministic value gradients and model-free policy gradients. Our technique can be combined with both the imagination rollout technique and the model-based value expansion technique.

Preliminaries A Markov decision process (MDP) is a tuple (S, A, p, r, γ, p0), where S and A denote the set of states and actions respectively. p(st+1|st, at) represents the conditional density from state st to state st+1 under action at. The density of the initial state distribution is denoted by p0(s). At each time step t, the agent interacts with the environment with a deterministic policy μθ. We use r(st, at) to represent the immediate reward, contributing to the discounted overall rewards from state s0 following μθ, denoted by J(μθ) = E[ k=0 γkr(ak, sk)|μθ, s0]. Here, γ [0, 1) is the discount factor. The Q-function of state st and action at under policy μθ is denoted by Qμθ(st, at) = E[ k=t γk tr(ak, sk)|μθ, st, at]. The corresponding value function of state st under policy μθ is denoted by V μθ(st) = Qμθ(st, μθ(st)). We denote the density at state s after t time steps from state s following the policy μθ by p(s, s , t, μθ) . We denote the discounted state distribution by ρμθ(s ) =

S t=1 γt 1p0(s)p(s, s , t, μθ)ds. The agent aims to ﬁnd an optimal policy that maximizes J(μθ).

Deterministic Value Gradients In this section, we study a setting of inﬁnite horizon with deterministic state transition, which poses challenges for the existence of deterministic value gradients. We ﬁrst prove that under proper condition, the deterministic value gradient does exist. Based on the theoretical guarantee, we then propose a class of practical algorithms by rolling out different number of steps. Finally, we discuss the difference and connection between our proposed algorithms and existing works. Deterministic Policy Gradient (DPG) Theorem (Silver et al. 2014), proves the existence of the deterministic policy gradient for MDP that satisﬁes the regular condition, which requires the probability density of the next state p(s |s, a) to be differentiable in a. In the proof of the DPG theorem, the existence of the gradient of the value function is ﬁrstly proven, i.e.,

t=0 γtp(s, s , t, μθ) θμθ(s )

a Qμθ(s , a )|a =μθ(s )ds ,

then the gradient of the long-terms rewards exists. Without this condition, the arguments in the proof of the DPG theorem do not work 1, and poses challenges for cases where the differentiability is not satisﬁed. Note this condition does not hold in any case with deterministic transitions. Therefore, one must need a new theoretical guarantee to determine the

1Please refer to http://proceedings.mlr.press/v32/silver14supp.pdf

existence of the gradient of V μθ(s) over θ in deterministic state transition cases.

Deterministic Value Gradient Theorem We now analyze the gradient of a deterministic policy. Denote T(s, a) the next state given current state s and action a. Without loss of generality, we assume that the transition function T is continuous, differentiable in s and a and is bounded. Note that the regular condition is not equivalent to this assumption. Consider a simple example that a transition T(s, a) = s + a, then the gradient of p(s |s, a) over a is inﬁnite or does not exist. However, the gradient of T(s, a) over a exists. By deﬁnition,

θV μθ(s) = θ r (s, μθ(s)) + γV μθ(s )|s =T (s,μθ(s))

= θr(s, μθ(s)) + γ θV μθ(s )|s =T (s,μθ(s))

+ γ θT(s, μθ(s)) s V μθ(s ). Therefore, the key of the existence (estimation) of the gradient of V μθ(s) over θ is the existence (estimation) of s V μθ(s). In Theorem 1, we give a sufﬁcient condition of the existence of s V μθ(s). Theorem 1 For any policy μθ, the gradient of the value function over the state, s V μθ(s), exists with two assumptions: A.1: The set of states that the policy visits starting from any initial state s is ﬁnite. A.2: For any initial state s, by Assumption A.1, we get that there is a periodic loop of visited states. Let (s0, s1, ..., sk) denote the loop2, and A(s) = γk+1 k i=0 si T(si, μθ(si)), the power sum of A(s), m=0 Am(s) converges. Proof 1 By deﬁnition,

V μθ(s) = r(s, μθ(s)) + γV μθ(s )|s =T (s,μθ(s)). (2)

Taking the gradient of Eq. (2), we obtain s V μθ(s) = s r(s, μθ(s))

+γ s T(s, μθ(s)) s V μθ(s )|s =T (s,μθ(s)). (3) Unrolling Eq. (3) with inﬁnite steps, we get

s V μθ(s) =

t=0 γtg(s, t, μθ) st r(st, μθ(st)), (4)

where g(s, t, μθ) = t 1 i=0 si T(si, μθ(si)), s0 = s and si is the state after i steps following policy μθ. With the assumption A.1, we rewrite (4) by the indicator function I(s, s , t, μθ) that indicates whether s is obtained after t steps from the initial state s following the policy μθ:

s V μθ(s) =

s B(s,θ) γtg(s, t, μθ)I(s, s , t, μθ)

s r(s , μθ(s )),

2Note that s0 is not equal to sk.

Where B(s, θ) is the set of states the policy visits from s. We now prove that for any μθ, s, s , the inﬁnite sum of gradients, t=0 γtg(s, t, μθ)I(s, s , t, μθ) converges. For each state s , there are three cases during the process from the initial state s with inﬁnite steps:

1. Never visited: t=0 γtg(s, t, μθ)I(s, s , t, μθ) = 0. 2. Visited once: Let ts denote the number of steps that it takes to reach the state s , then t=0 γtg(s, t, μθ)I(s, s , t, μθ) = γts g(s, ts , μθ). 3. Visited inﬁnite times: Let t1 denote the number of steps it takes to reach s for the ﬁrst time. The state s will be revisited every k steps after the previous visit. By deﬁnition,

t=0 γtg(s, t, μθ)I(s, s , t, μθ)

a=0 γt1g(s, t1, μθ)Aa(s).

By the assumption A.2 we get (6) converges.

By exchanging the order of the limit and the summation,

s V μθ(s) =

t=0 γtg(s, t, μθ)I(s, s , t, μθ)

s r(s , μθ(s )).

Assumption A.1 guarantees the existence of the stationary distribution of states theoretically. Actually, it holds on most continuous tasks, e.g., Inverted Pendulum-v2 in Mu Jo Co. We directly test a deterministic policy with a 2-layer fully connected network on this environment with 10,000 episodes3, and we count the number that each state is visited. After projecting the data into 2D space by t-SNE (Maaten and Hinton 2008), we obtain the state visitation density contour as shown in Figure 1. We have two interesting ﬁndings: (1) The set of states visited by the policy is ﬁnite. (2) Many states are visited for multiple times, which justiﬁes Assumption A.1. By the analysis of Assumption A.2, we get that for any policy and state, there exists a set of discount factors such that the gradient of the value function over the state exists, as illustrated in Corollary 1.

Corollary 1 For any policy μθ and any initial state s, let (s0, s1, ..., sk) denote the loop of states following the policy and the state, C(s, μθ, k) = k i=0 si T(si, μθ(si)), the gradient of the value function over the state, s V μθ(s) exists if γk+1 max {||C(s, μθ, k)|| , ||C(s, μθ, k)||1} < 1.

In Theorem 2, we show that the deterministic value gradients exist and obtain the closed form based on the analysis in Theorem 1.

Theorem 2 (Deterministic Value Gradient Theorem) For any policy μθ and MDP with deterministic state transitions,

3We test different weights, the observation of ﬁnite visited states set is very common among different weights.

Figure 1: State visitation density countour on Inverted Pendulum-v2.

if assumptions A.1 and A.2 hold, the value gradients exist, and

s B(s,θ) ρμθ(s, s ) θμθ(s )( a r(s , a )+

γ a T(s , a ) s V μθ(s )|s =T (s ,a )),

where a is the action the policy takes at state s , ρμθ(s, s ) is the discounted state distribution starting from the state s and the policy, and is deﬁned as ρμθ(s, s ) = t=1 γt 1I(s, s , t, μθ).

Deterministic Value Gradient Algorithm The value gradient methods estimate the gradient of value function recursively (Fairbank and Alonso 2012):

θV μθ(s) = θr(s, μθ(s)) + γ θT(s, μθ(s)) s V μθ(s )

+ γ θV μθ(s ) (8)

s V μθ(s) = s r(s, μθ(s)) + γ s T(s, μθ(s))

s V μθ(s )|s =T (s,μθ(s)). (9)

In fact, there are two kinds of approaches for estimating the gradient of the value function over the state, i.e., inﬁnite and ﬁnite. On the one hand, directly estimating the gradient of the value function over the state recursively by Eq. (9) for inﬁnite times is slow to converge. On the other hand, estimating the gradient by ﬁnite horizon like traditional value gradient methods (Werbos 1990; Nguyen and Widrow 1990; Heess et al. 2015) may cause large bias of the gradient. We set out to estimate the action-value function denoted by Qw(s, a) with parameter w, and replace s V μθ(s) by s Qw(s, μθ(s)) in Eq. (8). In this way, we can directly obtain a 1-step estimator of the value gradients,

G1(μθ, s) = θr(s, μθ(s)) + γ θT(s, μθ(s)) s1Qw(s1, μθ(s1)) + γG1(μθ, s1), (10)

where s1 is the next state of s, which can be generalized to k(k 2) rollout steps. Let si denote the state visited by the policy at the i-th step starting form the initial state s0, g(s, t, μθ, T) = t 1 i=1 si T(si, μθ(si)). We choose to

rollout k 1 steps to get rewards, then replace sk V μθ(sk) by sk Qw(sk, μθ(sk)) in Eq. (9), and we get

Lk(μθ, s, r, T) =

t=1 γt 1g(s, t, μθ, T) str(st, μθ(st))

+γk 1g(s, k, μθ, T) sk Qw(sk, μθ(sk)).

Replacing s V μθ(s ) with Lk(μθ, s, r, T) in Eq. (8), we get a k-step estimator of the value gradients:

Gk(μθ, s) = θr(s, μθ(s)) + γ θT(s, μθ(s) Lk(μθ, s, r, T) + γGk(μθ, s1). (11)

It is easy to see that Gk(μθ, s) and G1(μθ, s) are the same if we have the true reward and transition functions, which is generally not the case as we need to learn the model in practical environments. We estimate the recursive value gradient form by the same way as the implementation of the DDPG algorithm. Let Dk(μθ, s, T, r) denote the value gradient at the sampled state s with k rollout steps, on the true transition function T and reward function r, which is deﬁned as:

Dk(μθ, s, T, r) = θr(s, μθ(s)) + γ θT(s, μθ(s)) Lk(μθ, s, r, T). (12)

To be speciﬁc, the recursive value gradient form Gk(μθ, s0) can be rewritten as j=0 γj Dk(μθ, sj, T, r). Let ρ(s0, s) denote the discounted state distribution from state s0, then j=0 γj Dk(μθ, sj, T, r) can be rewritten as

s ρ(s0, s)Dk(μθ, s, T, r). Then, we propose the deterministic value gradients with inﬁnite horizon, where the algorithm is shown in Algorithm 1. As the model is not given, we choose to predict the reward function and the transition function. Given n samples (sj, aj, rj, sj+1), for each choice of k, we employ the empirical state distribution to approximate the discounted state distribution. With the learned transition function T and reward function r , we use 1

j Dk(μθ, sj, T , r ) to update the current policy. We choose to use experience replay to compare with the DDPG algorithm fairly. Different choices of the number of rollout steps trade-off between the variance and the bias. Larger steps means lower variance of the value gradients, and larger bias due to the accumulated model error.

The Difference between Inﬁnite and Finite Horizon In this section, we discuss the advantage of our proposed DVG algorithm over ﬁnite horizon and validate the effect on a continuous control task. The estimator of deterministic value gradients with ﬁnite horizon, DVGF, is deﬁned as (Fairbank and Alonso 2012):

Fk(μθ, s) = θr(s, μθ(s)) + γ θT(s, μθ(s))

g(s, t, μθ, T) str(st, μθ(st)) + γFk(μθ, s1).

Algorithm 1 The DVG(k) algorithm

1: Initialize the reward network r , transition network T , critic network Q, actor network μθ, target networks Q , μ θ and experience replay buffer B 2: for episode= 0, ..., N 1 do 3: for t = 1, ..., T do 4: Select action according to the current policy and exploration noise 5: Execute action at, observe reward rt and new state st+1, and store transition (st, at, rt, st+1) in B 6: Sample a random minibatch of n transitions from B 7: Update the critic Q by minimizing the TD error: 1 n n j (rj + γQ (sj+1, μθ(sj+1)) Q(sj, aj))2

8: Update the reward network r and the transition network T on the batch by minimizing the square loss 9: Estimate the value gradients by 1 n

j Dk(μθ, sj, T , r ) and perform gradient update on the policy

10: Update the target networks by θQ = τθQ + (1 τ)θQ ; θμ = τθμ + (1 τ)θμ

11: end for 12: end for

Note that Fk(μθ, s) does not take rewards after the k-th step into consideration. Therefore, given n samples {(sj, aj, rj, sj+1)}, DVGF uses the sample mean of D k(μθ, s, T , r ) to update the policy, where D k(μθ, s, T , r ) is deﬁned as:

D k(μθ, s, T , r ) = θr (s, μθ(s)) + γ θT (s, μθ(s))

t=1 γt 1g(s, t, μθ, T ) s tr(s t, μθ(s t)).

We then test the two approaches on the environment Humanoid Standup-v2, where we choose the parameter k to be 24. As shown in Figure 2, DVG signiﬁcantly outperforms DVGF, which validates our claim that only considering ﬁnite horizon fails to achieve the same performance as that of inﬁnite horizon.

0 200000 400000 600000 800000 1000000 Steps

Humanoid Standup-v2

Figure 2: Comparisons of DVG and DVGF.

4For the choice of k, we test DVGF with steps ranging from 1 to 5, and we choose the parameter with the best performance for fair comparison.

0 200000 400000 600000 800000 1000000 Steps

DVG(5) DDPG

Figure 3: Comparisons of DVG with DDPG.

Connection and Comparison of DVG and DDPG By the proof of the DPG theorem in (Silver et al. 2014), Eq. (8) can be re-written as

θV μθ(s) = θμθ(s) a Qμθ(s, a) + γ θV μθ(s1). (13)

The DDPG algorithm uses the gradient of the estimator of the Q function over the action, a Qw(s, a) to estimate a Qμθ(s, a), i.e., G0(μθ, s) = θμθ(s) a Qw(s, a)+ γG0(μθ, s1). The DDPG algorithm is a model-free algorithm which does not predict the reward and the transition, and can be viewed as the DVG(0) algorithm. We compare the DVG algorithm with different rollout steps k and DDPG on a continuous control task in Mu Jo Co, Hopper-v2. From Figure 3, we get that DVG with any choice of the number of rollout steps is more sample efﬁcient than DDPG, which validates the power of modelbased techniques. DVG(1) outperforms DDPG and DVG with other number of rollout steps in terms of performance as it trades off well between the bias and the variance of the value gradients. Note that with a larger number of step, DVG(5) is not stable due to the propagated model error.

The DVPG Algorithm As discussed before, the model-based DVG algorithm are more sample efﬁcient than the model-free DDPG algorithm. However, it suffers from the model bias which results in performance loss. In this section, we consider to ensemble these different gradient estimators for better performance. Motivated by the idea of TD(λ) algorithm (Sutton and Barto 2018), which ensembles the TD(k) error with a geometric decaying weight λ, we propose a temporal-difference method to ensemble DVG with varying rollout steps and the model-free deterministic policy gradients. We deﬁned the temporal difference deterministic value gradients as Gλ,t(μθ, s) = (1 λ) t k=0 λk Gk(μθ, s), where t denotes the maximal number of rollout steps by the learned model. For the gradient update rule, we also apply sample based methods: given n samples {(sj, aj, rj, sj+1)}, we use

j ((1 λ) θμθ(sj) a Qw(sj, a) + (1 λ)

k=1 λk Dk(μθ, sj, T , r ))

to update the policy. Based on this ensembled deterministic value-policy gradients, we propose the deterministic valuepolicy gradient algorithm, shown in Algorithm 2 5.

Algorithm 2 The DVPG algorithm

1: Initialize the weight λ and the number of maximal steps t 2: Initialize the reward network r , transition network T , critic network Q, actor network μθ, target networks Q , μ θ and experience replay buffer B 3: for episode= 0, ..., N 1 do 4: for t = 1, ..., T do 5: Select action according to the current policy and exploration noise 6: Execute action at, observe reward rt and new state st+1, and store transition (st, at, rt, st+1) in B 7: Sample a random minibatch of n transitions from B 8: Update the critic Q by minimizing the TD error: 1 n n j (rj + γQ (sj+1, μθ(sj+1)) Q(sj, aj))2

9: Update the reward network r and the transition network T on the batch by minimizing the square loss 10: Estimate the value gradients by Eq. (14), and perform gradient update on the policy

11: Update the target networks by θQ = τθQ + (1 τ)θQ ; θμ = τθμ + (1 τ)θμ

12: end for 13: end for

Experimental Results We design a series of experiments to evaluate DVG and DVPG. We investigate the following aspects: (1) What is the effect of the discount factor on DVG? (2) How sensitive is DVPG to the hyper-parameters? (3) How does DVPG compare with state-of-the-art methods? We evaluate DVPG in a number of continuous control benchmark tasks in Open AI Gym based on the Mu Jo Co simulator. We compare DVPG with DDPG, DVG, DDPG with imagination rollouts (DDPG(model)), and SVG with 1 step rollout and experience replay (SVG(1)) in the text. We also compare DVPG with methods using stochastic policies, e.g. ACKTR, TRPO, and DVPG outperforms ACKTR and TRPO signiﬁcantly. We omit the results due to the limit of the space. We plot the averaged rewards of episodes over 5 different random seeds with the number of real samples, and the shade region represents the 75% conﬁdence interval. We choose the same hyperparameters of the actor and critic network for all algorithms. The prediction models of DVPG, DVG and DDPG(model) are the same.

The Effect of Discount Factors on DVG From Eq. (9), we get that s V μθ(s) is equivalent to the inﬁnite sum of the gradient vectors. To study the effect of

5The only difference between the DVG(k) algorithm and the DVPG algorithm is the update rule of the policy.

the discount factor on DVG, we train the algorithm with 2 rollout steps with different values of the discount factor on the environment Inverted Pendulum-v2. As shown in Figure 5, 0.95 performs the best in terms of rewards and stability while 0.85 and 0.99 performs comparably, while the performance of 0.8 and 0.6 are inferior to other values. This is because the convergence of the computation of the gradient of the value function over the state may be slow if the discount factor is close to 1 while a smaller value of γ may enable better convergence of s V μθ(s). However, the sum of rewards discounted by a too small γ will be too myopic, and fails to perform good. Here, 0.95 trades-off well between the stability and the performance, which is as expected that there exists an optimal intermediate value for the discount factor.

0 200000 400000 600000 800000 1000000 Steps

Inverted Pendulum-v2

γ = 0.6 γ = 0.8 γ = 0.85 γ = 0.95 γ = 0.99

Figure 5: The effect of discount factors.

Ablation Study of DVPG We evaluate the effect of the weight of bootstrapping on DVPG with different values from 0.1 to 0.9, where the number of rollout steps is set to be 4. From Figure 6, we get that the performance of the DVPG decreases with the increase of the value λ, where 0.1 performs the best in terms of the sample efﬁciency and the performance. Thus, we choose the value of the weight to be 0.1 in all experiments.

0 200000 400000 600000 800000 1000000 Steps

Half Cheetah-v2

DVPG (λ = 0.1)

DVPG (λ = 0.3)

DVPG (λ = 0.5)

DVPG (λ = 0.7)

DVPG (λ = 0.9)

Figure 6: The weight of bootstrapping.

We evaluate the effect of the number of rollout steps ranging from 1 to 5. Results in Figure 7 show that DVPG with different number of rollout steps all succeed to learn a good policy, with 1 rollout step performing the best. Indeed, the number of rollout steps trade off between the model-error and the variance. There is an optimal value of the number of rollout steps for each environment, which is the only one parameter we tune. To summarize, for the number of look steps,

0 200000 400000 600000 800000 1000000 Steps

Humanoid Standup-v2

0 200000 400000 600000 800000 1000000 Steps

0 200000 400000 600000 800000 1000000 Steps

0 200000 400000 600000 800000 1000000 Steps

0 200000 400000 600000 800000 1000000 Steps

Half Cheetah-v2

DVPG DVG DDPG DDPG(Model) SVG(1)

0 200000 400000 600000 800000 1000000 Steps

Humanoid-v2

Figure 4: Performance comparisons on environments from the Mu Jo Co simulator.

1 rollout step works the best on Humanoid-v2, Swimmer-v2 and Half Cheetah-v2, while 2 rollout steps performs the best on Humanoid Standup-v2, Hopper-v2 and Ant-v2. For fair comparisons, we choose the same number of rollout steps for both the DVG and the DVPG algorithm.

0 200000 400000 600000 800000 1000000 Steps

Half Cheetah-v2

DVPG (t = 1)

DVPG (t = 2)

DVPG (t = 3)

DVPG (t = 4)

DVPG (t = 5)

Figure 7: The number of rollout steps.

Performance Comparisons

In this section we compare DVPG with the model-free baseline DDPG, and model-based baselines including DVG, DDPG(model) and SVG(1) on several continuous control tasks on Mu Jo Co. As shown in Figure 4, there are two classes of comparisons. Firstly, we compare DVPG with DDPG and DVG to validate the effect of the temporal difference technique to ensemble model-based and model-free deterministic value gradients. The DVG algorithm is the most sample efﬁcient than other algorithms in environments Humanoid Standup-v2, and Hopper-v2. For sample efﬁciency, DVPG outperforms DDPG as it trades off between the model-based deterministic value gradients and the model-free deterministic policy gradients. In the end of the training, DVPG outperforms other two al-

gorithms signiﬁcantly, which demonstrates the power of the temporal difference technique. In other four environments, DVPG outperforms other algorithms in terms of both sample efﬁciency and performance. The performance of DVG and DDPG on Swimmer-v2 and Ant-v2 are comparable, while DVG performs bad in Halfcheetah-v2 and Humanoid-v2 due to the model-error. Secondly, we compare DVPG with SVG(1) and DDPG with imagination rollouts. Results show that the DVPG algorithm signiﬁcantly outperforms these two model-based algorithms in terms of sample efﬁciency and performance, especially in environments where other model-based algorithms do not get better performance than the model-free DDPG algorithm. For the performance of the SVG(1) algorithm, it fails to learn good policies in Ant-v2, which is also reported in (Kurutach et al. 2018).

Due to high sample complexity of the model-free DDPG algorithm and high bias of the deterministic value gradients with ﬁnite horizon, we study the deterministic value gradients with inﬁnite horizon. We prove the existence of the deterministic value gradients for a certain set of discount factors in this inﬁnite setting. Based on this theoretical guarantee, we propose the DVG algorithm with different rollout steps by the model. We then propose a temporal difference method to ensemble deterministic value gradients and deterministic policy gradients, to trade off between the bias due to the model error and the variance of the model-free policy gradients, called the DVPG algorithm. We compare DVPG on several continuous control benchmarks. Results show that DVPG substantially outperforms other baselines in terms of convergence and performance. For future work, it is promising to apply the temporal difference technique presented in

this paper to other model-free algorithms and combine with other model-based techniques.

Acknowledgments

The work by Ling Pan was supported in part by the National Natural Science Foundation of China Grants 61672316. Pingzhong Tang was supported in part by the National Natural Science Foundation of China Grant 61561146398, and a China Youth 1000-talent program.

Buckman, J.; Hafner, D.; Tucker, G.; Brevdo, E.; and Lee, H. 2018. Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, 8234 8244. Cai, Q.; Filos-Ratsikas, A.; Tang, P.; and Zhang, Y. 2018. Reinforcement mechanism design for e-commerce. In Proceedings of the 2018 World Wide Web Conference, 1339 1348. International World Wide Web Conferences Steering Committee. Chua, K.; Calandra, R.; Mc Allister, R.; and Levine, S. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 4754 4765. Deisenroth, M., and Rasmussen, C. E. 2011. Pilco: A modelbased and data-efﬁcient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), 465 472. Fairbank, M., and Alonso, E. 2012. Value-gradient learning. In Neural Networks (IJCNN), The 2012 International Joint Conference on, 1 8. IEEE. Fairbank, M. 2008. Reinforcement learning by value gradients. ar Xiv preprint ar Xiv:0803.3539. Feinberg, V.; Wan, A.; Stoica, I.; Jordan, M. I.; Gonzalez, J. E.; and Levine, S. 2018. Model-based value estimation for efﬁcient model-free reinforcement learning. ar Xiv preprint ar Xiv:1803.00101. Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, 2829 2838. Ha, D., and Schmidhuber, J. 2018. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, 2450 2462. Hafner, D.; Lillicrap, T.; Fischer, I.; Villegas, R.; Ha, D.; Lee, H.; and Davidson, J. 2018. Learning latent dynamics for planning from pixels. ar Xiv preprint ar Xiv:1811.04551. Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Erez, T.; and Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2944 2952. Jordan, M. I., and Rumelhart, D. E. 1992. Forward models: Supervised learning with a distal teacher. Cognitive science 16(3):307 354.

Kurutach, T.; Clavera, I.; Duan, Y.; Tamar, A.; and Abbeel, P. 2018. Model-ensemble trust-region policy optimization. ar Xiv preprint ar Xiv:1802.10592. Levine, S., and Koltun, V. 2013. Guided policy search. In International Conference on Machine Learning, 1 9. Li, W., and Todorov, E. 2004. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), 222 229. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971. Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579 2605. Moldovan, T. M.; Levine, S.; Jordan, M. I.; and Abbeel, P. 2015. Optimism-driven exploration for nonlinear systems. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, 3239 3246. IEEE. Montgomery, W., and Levine, S. 2016. Guided policy search as approximate mirror descent. ar Xiv preprint ar Xiv:1607.04614. Nguyen, D. H., and Widrow, B. 1990. Neural networks for self-learning control systems. IEEE Control systems magazine 10(3):18 23. Oh, J.; Singh, S.; and Lee, H. 2017. Value prediction network. ar Xiv preprint ar Xiv:1707.03497. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International Conference on Machine Learning, 1889 1897. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML. Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057 1063. Tang, P. 2017. Reinforcement mechanism design. In IJCAI, volume 17, 26 30. Wahlstr om, N.; Sch on, T. B.; and Deisenroth, M. P. 2015. From pixels to torques: Policy learning with deep dynamical models. ar Xiv preprint ar Xiv:1502.02251. Wang, X., and Dietterich, T. G. 2003. Model-based policy gradient reinforcement learning. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 776 783. Watter, M.; Springenberg, J.; Boedecker, J.; and Riedmiller, M. 2015. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, 2746 2754. Werbos, P. J. 1990. A menu of designs for reinforcement learning over time. Neural networks for control 67 95.