# hessian_aided_policy_gradient__398f435a.pdf Hessian Aided Policy Gradient Zebang Shen 1 Hamed Hassani 2 Chao Mi 1 Hui Qian 1 Alejandro Ribeiro 2 Reducing the variance of estimators for policy gradient has long been the focus of reinforcement learning research. While classic algorithms like REINFORCE find an ϵ-approximate first-order stationary point in O(1/ϵ4) random trajectory simulations, no provable improvement on the complexity has been made so far. This paper presents a Hessian aided policy gradient method with the first improved sample complexity of O(1/ϵ3). While our method exploits information from the policy Hessian, it can be implemented in linear time with respect to the parameter dimension and is hence applicable to sophisticated DNN parameterization. Simulations on standard tasks validate the efficiency of our method. 1. Introduction A Markov Decision Process (MDP) is determined by timevarying system states, actions to affect the transition probability between states, and instantaneous rewards collected as a function of the state visited and the action taken (Puterman, 2014). Whenever the system visits a particular state, the agent chooses an action that is dictated by a (possibly stochastic) policy. As the agent moves from state to state, it collects an aggregate reward given by a discounted sum of the instantaneous rewards. The optimal policy of an MDP is the one that maximizes such aggregate reward. The focus of this paper is to find the optimal policy in modelfree reinforcement learning (RL) problems where transition probabilities and rewards are unknown and can only be estimated by probing the system through the execution of policy actions (Sutton and Barto, 2018). Policy gradient methods and its variants constitute the set of tools that are most widely used for finding good policies in MDPs (Williams, 1992; Sutton et al., 2000; Baxter and 1Zhejiang University 2University of Pennsylvania. Correspondence to: Hui Qian , Alejandro Ribeiro . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Bartlett, 2001). Such methods directly find the optimal policy through the use of stochastic first-order differentials of the accumulated reward relative to policy variations. Their popularity notwithstanding, they are known to have low sample efficiency demanding hundreds of thousands of Monte Carlo simulations of full trajectories even when dealing with relatively simple MDPs. Large sample complexity is inherent to MDPs because to evaluate individual stochastic policy gradients we need to run a full trajectory under the current policy. Such trajectories are composed of a large number of individual transitions which not only makes them costly to simulate but also results in gradient estimates with large variances as the randomness of individual transitions is compounded over the trajectory s time horizon (Williams, 1992; Sutton, 1984; Arulkumaran et al., 2017; Sutton and Barto, 2018). Besides the aforementioned MDP-specific challenges, policy gradient inherits the limitations of any stochastic gradient descent method used for optimizing a nonconvex function. This manifests in the need for O(1/ϵ4) stochastic first-order queries to find an ϵ-approximate first order stationary point (ϵ-FOSP) (Nesterov, 2013). In the supervised learning literature this slow convergence is alleviated with variance reduction techniques which exploit correlations between consecutive stochastic gradient estimators (Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014; Nguyen et al., 2017; Fang et al., 2018). Give the dramatic impact of VR on the SL, we should expect it to be even more effective in reinforcement learning where policy gradients are prone to larger variances. However, direct transplanting the variance reduction techniques tailored for supervised learning does not reduced the O(1/ϵ4) sample complexity (Papini et al., 2018). This happens because variance reduction techniques make explicit use of the fact that the randomness in the objective function is oblivious to the argument, i.e., the randomness that affects the choice of a function does not depend the variables. This is not true for RL problems in which the probabilities of random trajectories are affected by the policy. Consequently, when applying variance reduction techniques to these nonoblivious objectives, the improvements in convergence rates that are obtained in supervised learning do not materialize. Contributions. This paper develops a Hessian-aided vari- Hessian Aided Policy Gradient ance reduction method that is applicable to non-oblivious objectives. In particular, we give the first provable sampleefficiency improvement over stochastic gradient based policy-gradient type methods: We reduce the trajectory complexity from O(1/ϵ4) to O(1/ϵ3) to obtain an ϵ-FOSP. To do so, we construct a novel, efficiently computable, and unbiased variance reduced gradient estimator for non-oblivious objectives by integrating carefully designed Hessian-vector products along the iterate path. Our estimator utilizes the stochastic approximation of the second-order policy differential without compromising the linear per-iteration computation complexity and is hence suitable for complex and high dimensional parameterizations. Additionally, existing RL techniques like actor-critic and GAE can be seamlessly integrated to our algorithm. We also provide extensive simulations on various reinforcement learning tasks to validate our analysis and illustrate the efficiency of our method. Notation. Lowercase boldface v denotes a vector and uppercase boldface A denotes a matrix. We use v to denote the Euclidean norm of vector v and use A to denote the spectral norm of matrix A. E[X] and V[X] denote the expectation and variance of a random variable X. 2. Preliminaries Consider a discrete time index h 0 and a Markov system with time varying states sh S and actions ah A. The probability distribution of the initial state is ρ(s0) and the conditional probability distribution of transitioning into sh+1 given that we are in state sh and take action ah is P(sh+1|sh, ah). Actions are chosen according to a possibly random policy π in which π(ah|sh) is the distribution for taking action ah when observing state sh. We assume that policies are parametrized by a vector θ Rd and use πθ as a shorthand for the conditional distribution π(ah|sh; θ) associated to θ. For a given time horizon H we define the trajectory τ := (s1, a1, . . . , s H, a H) as the collection of state action pairs experienced up until time h = H. Given the initial distribution ρ(s0), the transition kernel P(sh+1|sh, ah), and the Markov property of the system, it follows that the probability distribution over trajectories τ is p(τ; πθ) := ρ(s0) h=1 P(sh+1|sh, ah) π(ah|sh). (1) Associated with a state action pair we have a reward function r(sh, ah). When following a trajectory τ := (s1, a1, . . . , s H, a H), we consider the accumulated reward discounted by a geometric factor γ h=1 γhr(sh, ah). (2) Our goal is to find the policy parameter θ that maximizes the expected discounted trajectory reward max θ Rd J(θ) := Eτ[R(τ)] = Z R(τ)p(τ; πθ)dτ. (3) Unlike traditional supervised learning problems, the underlying distribution p depends on the variable θ and hence varies through the whole optimization procedure. We refer to such property as non-oblivious. Due to the non-convexity of (3), we are satisfied with finding an ϵ-approximate First-Order Stationary Point (ϵ-FOSP), denoted by θϵ, such that J(θϵ) ϵ. (4) As a standard tool to achieve such goal, policy gradient, a.k.a. the first-order differential of the objective (3), can be expressed as J(θ) := Eτ p(τ;πθ)[ h=1 Ψh(τ) log πθ(ah|sh)], (5) where we denote Ψh(τ) := PH i=h γir(si, ai) for a fixed trajectory τ. Let M be a set of random trajectories sampled according to distribution p( ; πθ). An unbiased stochastic policy-gradient estimator can be constructed by g(θ; M) := 1 |M| h=1 Ψh(τ) log πθ(ah|sh). (6) 2.1. Variance Reduced Gradient Estimator Most of the literature on variance reduction techniques focuses on the oblivious setting. For example, supervised learning is usually characterized through the following stochastic optimization framework max x Rd F(x) := Ez q(z)[f(x; z)], (7) where the samples z Z have distribution q(z) and the functions f(; z) : Rd R are smooth, potentially nonconvex loss functions with respect to sample z. Since the underlying distribution q(z) is invariant to the variable x, we refer to (7) as an oblivious objective. For oblivious objectives, a recent method called Variance Reduced Gradient Estimator has been proposed to reduce sample complexity of vanilla stochastic gradient methods with provable guarantees (Reddi et al., 2016; Fang et al., 2018; Zhou et al., 2018). Let x be some reference point and h be an unbiased gradient estimator at x. Given x and h, the variance-reduced gradient estimator at a point xt, which we denote by ht vr, is of the form: ht vr := h + h(xt; M) h( x; M), (8) Hessian Aided Policy Gradient Algorithm 1 Hessian Aided Policy Gradient (HAPG) 1: for t = 0 to T do 2: if mod (t, p) = 0 then 3: Sample |M0| trajectories to compute minibatch stochastic policy gradient estimator at θt gt = 1 |M0| h=1 Ψh(τ) log πθt(ah; sh); 4: else 5: Sample |M| random tuples (a, τ(a)) to construct gradient difference estimator t using (15) 6: gt = gt 1 + t; 7: end if 8: θt+1 = θt + η gt/ gt ; 9: end for where M is a set of samples drawn from q(z) and h(x; M) := 1 |M| z M f(x; z). (9) Note that h(xt; M) and h( x; M) use the same sample batch M, which is critical for reducing variance. The formulation (8) captures several variance-reduction techniques such as the seminal SVRG estimator (Johnson and Zhang, 2013) and the recent SPIDER estimator (Fang et al., 2018). While it is tempting to generalize the above-mentioned variance reduction technique to reinforcement learning (i.e. the non-oblivious setting), there has been no provable success. For example, the Stochastic Variance Reduced Policy Gradient (SVRPG) from (Papini et al., 2018) incorporates such technique in its gradient estimator design. However, SVRPG still requires the same amount of random trajectories as the vanilla stochastic policy-gradient type method, i.e. O( 1 ϵ4 ), to achieve an ϵ-FOSP (4) even under relatively strong assumptions as discussed in the Appendix 7.1. 3. Methodology In this section, we derive our Hessian Aided Policy Gradient (HAPG) method for the non-oblivious non-convex objective (3). HAPG is the first algorithm to achieve provably better sample efficiency than the policy-gradient type method, from O(1/ϵ4) to O(1/ϵ3). Note that while our method conceptually utilizes the curvature information, HAPG can be implemented in linear time in terms of the parameter dimension d, without explicitly computing the Hessian. Suppose we are given a variable sequence {θs}t s=0. Then the gradient at θt can be written in a path-integral form: J(θt) = J(θ0) + s=1 J(θs) J(θs 1). (10) Let s be an unbiased estimator for the gradient difference J(θs) J(θs 1) and recall the policy gradient defined in (6). We can recursively construct the following estimator for J(θt): ( g(θt; M0) mod (t, p) = 0, gt 1 + t mod (t, p) = 0. (11) where M0 is a mini-batch of trajectories sampled according to p( |πθ) and p is a given epoch length. In words, after every p iterations, we directly estimate the gradient using a mini-batch M0 of stochastic trajectories, and in between, we maintain an unbiased estimate by recursively adding the correction term t to the current estimate gt 1. Having estimator (11), we will use gt to update θt+1 in a normalized gradient ascent manner: θt+1 := θt + ηt gt Our method is presented in Algorithm 1. We now focus on the construction of t. From the Taylor s expansion, the gradient difference can be written as J(θt) J(θt 1) = Z 1 0 [ 2J(θ(a)) v]da 0 2J(θ(a))da v, (13) where we denote θ(a) := a θt + (1 a) θt 1 and v := θt θt 1. Note that the integral in (13) is just the expectation Ea[ 2J(θ(a))], which admits the unbiased estimator 2J(θ( a)) with a uniformly sampled from [0, 1]. Therefore, if we have an unbiased estimator of 2J(θ) for arbitrary θ, we can construct t. To do so, note that the policy Hessian 2J(θ) can be expressed by1 2J(θ) = Eτ h Φ(θ; τ) log p(τ; πθ) + 2Φ(θ; τ) i , where the trajectory τ has the distribution p(τ; πθ) and i=h γir(si, ai) log πθ(ah|sh). Hence, we can construct an unbiased estimator 2(θ; τ) for 2J(θ) by sampling τ according to p(τ; πθ) and let 2(θ; τ) := Φ(θ; τ) log p(τ; πθ) + 2Φ(θ; τ). (14) Putting things together, let M be a mini-batch of random tuple (a, τ(a)) where a R is uniformly distributed over 1A detailed derivation is provided in the appendix. Hessian Aided Policy Gradient [0, 1] and τ(a) is a trajectory sampled according to distribution p(τ(a); πθ(a)) (recall θ(a) := a θt + (1 a) θt 1). Denote v = θt θt 1. We have the construction of t by t := 2(θ; M) v, (15) where 2(θ; M) is a minibatch version of (14) defined by 2(θ; M) := 1 |M| (a,τ(a)) M 2(θ(a); τ(a)). (16) Let us reemphasize that the estimator (15) can be computed in linear time in terms of the parameter dimension d: First note that computing t is equivalent to computing |M| matrix-vector product 2(θ; τ) v. From (14) we can write 2(θ; τ) v=( log p(τ; πθ) v) Φ(θ; τ) + 2Φ(θ; τ) v. Clearly, the first term can be computed in time O(Hd). Additionally, the second term is a Hessian-vector product, and can be computed either with Pearlmutter s algorithm (Pearlmutter, 1994) or using finite difference (Wright and Nocedal, 1999) in O(Hd) time as discussed later in Section 3.1. Remark 3.1. The choice of Ψh(τ) = PH i=h γir(si, ai) is for deriving a tight bound on the norm of the stochastic policy gradient and policy Hessian. In practice, we can incorporate a state-dependent baseline into Ψh(τ) to further improve the performance, see (27) in the Experiment section. In that case, the O(1/ϵ3) trajectory complexity still holds as the estimator g in (6) remains unbiased. Remark 3.2. Note that estimator (8) for the oblivious loss (7) shares the same spirit of (11) except that it estimates F(xi) and F(xi 1) separately by h(xt; M) and h(xt 1; M) (taking x = xt 1). However, such separated estimation is biased in reinforcement learning: Let M be a minibatch of trajectories sampled from p( |πθt) and recall the definition of g(θ; M) in (6). We have E[g(θt; M)] = J(θt) but E[g( θ; M)] = J( θ). 3.1. Finite Difference for Hessian-Vector Product In this section, we briefly describe the finite difference method for computing the Hessian-vector product. Let φ : Rd R be a twice differential function. Our goal is to compute, for arbitrary θ, v R, 2φ(θ) v. For ϵ > 0, define the operator ξϵ(v; φ) := φ(θ + ϵv) φ(θ ϵv) 2ϵ = 2φ(θ(ϵ)) v. Assuming the second-order smoothness of φ, i.e. 2φ(x) 2φ(y) L2 x y for arbitrary x, y Rd, we bound ξϵ(v; φ) 2φ(θ) v L2 v ϵ, (17) which can be made arbitrarily small by taking a sufficiently small ϵ. Note that the complexity of computing ξϵ(v; φ) is twice the complexity of evaluating the gradient of φ( ). Therefore, we can approximate 2Φ(θ; τ) v to arbitrary accuracy by employing ξϵ(v; Φ) within time O(Hd). 4. Convergence Analysis We prove the convergence of Algorithm 1 and analyze the trajectory complexity to find an ϵ-FOSP under the following assumptions. Assumption 4.1 (bounded reward). The instantaneous reward function is bounded, i.e., for all a A and s S, |r(a|s)| R. (18) Assumption 4.2 (parameterization regularity). For any choice of parameter θ and state-action pair (s, a), we have log π(a|s; θ) G and 2 log π(a|s; θ) L. We note that Assumptions 4.1 and 4.2 are standard in the literature and are used in recent work like (Papini et al., 2018). These two assumptions imply the following technical lemma that characterizes the properties of the stochastic approximations g(θ; {τ}) (6) and 2(θ; τ) (14) for first and second order differential of J(θ). Lemma 4.1 (properties of stochastic differential estimators). Under Assumption 4.1 and 4.2, we have for all θ g(θ; {τ}) J(θ) 2 G2R2 (1 γ)4 := G2 g, 2(θ; τ) 2 H2G4R2 + L2R2 (1 γ)4 := G2 H, (19) where τ is a trajectory sampled according to p(τ; θ). Proof. Bound for stochastic first-order differential: Recall the unbiasedness of g(θ; {τ}). Using E[(X E[X])2] E[X2] for all random variable X, we have Eτ g(θ; {τ}) J(θ) 2 Eτ g(θ; {τ}) 2. Use the definition of g(θ; {τ}) to obtain h=1 Ψh(τ) log πθ(ah|sh) h=1 |Ψh(τ)| log πθ(ah|sh) G h=1 |Ψh(τ)|, where we use triangle inequality in the first inequality and Assumption 4.2 in the second one. On the other hand, using the definition of Ψh, we derive i=h γir(si, ai)| i=h γi R Rγh Hessian Aided Policy Gradient where the first inequality uses Assumption 4.1 and the second uses the summation structure of geometric sequence. Combining these two bounds, we obtain g(θ; {τ}) GR (1 γ) h=1 γh GR (1 γ)2 . Consequently, we have Eτ g(θ; τ) J(θ) 2 G2R2/(1 γ)4. Bound for stochastic second-order differential: Recall the definition of 2(θ; τ) (14). Using A + B 2 2 A 2 + 2 B 2 we bound Eτ 2(θ; τ) 2 2Eτ Φ(θ; τ) 2 log p(τ; πθ) 2 + 2Eτ 2Φ(θ; τ) 2. To bound the first product, use Φ(θ; τ) = g(θ; {τ}) so Φ(θ; τ) = g(θ; {τ}) GR/(1 γ)2. Additionally, use log p(τ; πθ) h=1 log π(ah|sh; θ) HG, so that the first term is bounded by Φ(θ; τ) 2 log p(τ; πθ) 2 H2G4R2/(1 γ)4. The second term is bounded by h=1 Ψh(τ) 2 log πθ(ah|sh) h=0 |Ψh(τ)| 2 log π(ah|sh; θ) . Use the bound on |Ψh(τ)| and Assumption 4.2 to arrive at 2Φ(θ; τ) LR 1 γ h=0 γh LR (1 γ)2 . Combining these two bounds, we have Eτ 2(θ; τ) 2 (H2G4R2 + L2R2)/(1 γ)4. Having Lemma 4.1, we are now ready to bound the variance of the unbiased gradient estimator gt. Lemma 4.2 (variance bound for gradient estimator). Recall the definition of gt from (11). By setting p = Gg/(GHϵ), |Mh| = Gg/(GHϵ), and |M0| = G2 g/(G2 Hϵ2), we have E gt J(θt) 2 2G2 Hϵ2. (20) Proof. For ease of presentation, we consider t < p. The general case is a direct extension. Recall the definition of gt and use its unbiasedness to obtain E gt J(θt) 2 =E 2(θ; M)[θt θt 1] + gt 1 J(θt) 2 =V 2(θ; M)[θt θt 1] + E gt 1 J(θt 1) 2 To bound the first term, use V[ 1 n Pn i=1 Xi] 1 n E[X2 1] for i.i.d. random variables {Xi}n i=1 to obtain V 2(θ; M)[θt θt 1] |M|E 2(θ(a); τ(a))[θt θt 1] 2 |M|E 2(θ(a); τ(a)) 2 θt θt 1 2 = G2 Hϵ2 where we use in the last equality θt+1 θt = ϵ, due to the normalized update (12) and the step-size choice ηt = ϵ. Consequently we have the recursion E gt J(θt) 2 G2 Hϵ2 |Mh| + E gt 1 J(θt 1) 2. By repeating the above recursion t times, we obtain E gt J(θt) 2 t G2 Hϵ2 |Mh| + E g0 J(θ0) 2 |Mh| + G2 g |M0|. By setting p = Gg GHϵ, |Mh| = Gg GHϵ, and |M0| = G2 g G2 Hϵ2 , the result of the theorem folows. Theorem 4.1. Recall the definition of smoothness parameter Gg and GH in Lemma 4.1. Under Assumptions 4.1,4.2, and by setting p = Gg GHϵ, |Mh| = Gg GHϵ, |M0| = G2 g G2 Hϵ2 , T = 2(J(θ0) J ) GH ϵ2 in Algorithm 1, we have E J(θ t) 4GHϵ, (21) where t is uniformly sampled from {0, . . . , T 1}2. Proof. Lemma 4.1 implies that J(θ) is GH-smooth, i.e. 2J(θ) = EτH(θ; τ) Eτ H(θ; τ) GH. (22) We can thus write J(θt+1) J(θt) + J(θt), θt+1 θt GH 2 θt+1 θt 2 =J(θt) + gt, θt+1 θt GH 2 θt+1 θt 2 + J(θt) gt, θt+1 θt . 2The idea of randomly selecting the output of the algorithm is standard in the non-convex optimization literature, e.g. (Ghadimi and Lan, 2013) Hessian Aided Policy Gradient Use θt+1 θt = ϵ and θt+1 θt = ϵgt/ gt to obtain J(θt+1) J(θt) + ϵ gt GHϵ2 2 + J(θt) gt, ϵgt J(θt) + ϵ gt GHϵ2 1 2GH J(θt) gt 2, where we use Young s inequality in the second inequality. Using the triangle inequality, we have J(θt+1) J(θt) + ϵ( J(θt) J(θt) gt ) GHϵ2 1 2GH J(θt) gt 2. (23) Take expectation on both sides of (23) and rearrange terms to obtain: ϵ E[ J(θt) ] E[J(θt+1)] E[J(θt)] + GHϵ2 + 1 2GH E[ J(θt) gt 2] + ϵ E[ J(θt) gt ] E[J(θt+1)] E[J(θt)] + GHϵ2 + 1 2GH E[ J(θt) gt 2] + ϵ p E[ J(θt) gt 2] E[J(θt+1)] E[J(θt)] + 4GHϵ2, (24) where we use the Jensen s inequality in the second inequality and Lemma (4.2) in the third one. Sum (24) from t = 0 to T 1 and divide by Tϵ on both sides to arrive at t=0 E J(θt) E[J(θT )] J(θ0) Tϵ + 4GHϵ. (25) We have the result by taking T = 2(J J(θ0))/(GH ϵ2) and t to be uniformly sampled from {0, . . . , T 1}. Having Theorem 4.1, the following corollary translates the convergence results to the overall trajectory complexity in order to achieve ϵ-FOSP. Corollary 4.1. Under Assumption 4.1 and 4.2, Algorithm 1 finds an ϵ-FOSP with no more than 256Gg GH(J J(θ0)) ϵ3 random trajectories. Proof. Using Theorem 4.1 with ϵ = ϵ 4GH , we have ϵ and the amortized per-iteration trajectory complexity is 2|Mh|. Besides, T = 32GH(J J(θ0)) ϵ2 , and hence, the overall trajectory complexity is 2|Mh| T = 256Gg GH(J J(θ0)) 5. Related Work 5.1. MDP-Specific Variance In the policy-based and model-free reinforcement learning, the large variance in gradient estimation and the resulting high trajectory complexity have for long been identified as key challenges. This is mainly due to the term Ψh(τ) having large variance in the policy gradient (5), due to which the randomness grows exponentially with respect to the horizon. Ideally, by setting Ψh(τ) to be the advantage function of the MDP under policy πθ, i.e. the difference of the state-action value function and the state-value function, the estimator given in (6) achieves the minimum possible variance. However, we generally do not have direct access to the advantage functions and can only use estimations like the discounted cumulative reward PH i=h γir(si, ai) for approximation. A notable portion of the literature has focused on deriving better choices of Ψh(τ) in order to reduce the variance in estimating the advantage function and can be classified into two categories depending on whether or not bootstrapping is used to update the state-value function (Sutton and Barto, 2018). By estimating Ψh(τ) with a critic which uses bootstrapping, the actor-critic type methods effectively drive down the variance at the cost of introducing bias to the gradient estimator (Konda and Tsitsiklis, 2000; Mnih et al., 2016). Alternatively, if we directly incorporate a baseline into Ψh(τ), the estimator (6) remains unbiased but potentially suffers from larger variance (Greensmith et al., 2004; Wu et al., 2018; Duan et al., 2016). The Generalized Advantage Estimation (GAE) proposed by (Schulman et al., 2016) incorporates the temporal-difference structure into the advantage function approximation and allows to control the tradeoff between bias and variance. We emphasize that all these refined advantage estimators can be directly incorporated to our method by placing Ψh(τ) correspondingly. 5.2. Variance Reduced Gradient in RL While variance reduced gradient estimators have been successful in the oblivious supervised learning (see Section 2.1), its development in the non-oblivious reinforcement learning setting has been limited. Most of the existing work only apply such techniques to solve oblivious subproblems in RL rather than using it to estimate the gradients: (Du et al., 2017) considers estimating the state-value function under the current policy via minimizing the empirical mean squared projected Bellman error. This is equivalent to a finite-sum convex-concave saddle point problem where the underlying distribution is the variable-independent uniform distribution, and can be solved by existing variance reducedvariants (Palaniappan and Bach, 2016). In another work Xu et al. (2017) use the SVRG estimator (Johnson and Zhang, 2013) to solve the trust region subproblem of TRPO (Schulman et al., 2015), which again is a finite sum minimization Hessian Aided Policy Gradient problem with an oblivious uniform underlying distribution. Recently, Papini et al. (2018) propose SVRPG which is the first method that employs the variance-reduced type gradient estimator to directly optimize the non-oblivious loss (3). Concretely, denoting the importance weight w(θt, θ; τ) := p(τ|π θ) π θ(ah|sh) πθt(ah|sh), (26) SVRPG constructs the gradient estimator by gt vr := g + g(θt; M) 1 |M| τ M w(θt, θ; τ)g( θ; {τ}), where θ and g are the reference point and its corresponding unbiased estimator respectively, and M is a mini-batch of trajectories sampled from p( |πθt). Note that gt vr shares the same structure as ht vr (see Eqn. (8)) except the correction term w(θt, θ; τ), which ensures the unbiasedness of gt vr under the non-oblivious setting. However, by scrutinizing the convergence result, O( 1 ϵ4 ) random trajectories are still required to achieve an ϵ-FOSP (4), which is the same as the original policy-gradient type method. In the appendix, we briefly discuss why the trajectory complexity is not reduced. As we presented in the previous section, HAPG directly estimate their difference by sampling from the Hessian integral instead of estimating the policy gradient at two point (θt and θ) separately, which is the key to our success. 6. Experiments In this section, we evaluate the performance of the proposed HAPG method on several standard reinforcement learning tasks. The REINFORCE method from (Sutton et al., 2000) is used as the baseline for comparison. We normalize the gradient estimator of REINFORCE like HAPG (see (11)). The performance of HAPG and REINFORCE are tested on six continuous RL tasks, namely Cart Pole, Swimmer, 2d Walker, Reacher, Humanoid, and Humanoid Standup, where the latter five are commonly used Mujoco environments (Todorov et al., 2012). We use the environment implementations in the garage library (Duan et al., 2016). For all the tasks, we use deep Gaussian policy with the mean and variance parameterized by a fully-connected neural network. The number of network layers and hidden units, and the nonlinear activation functions follows (Papini et al., 2018) with the details given in the appendix3. For fair comparison, in each run we initialize HAPG and REINFORCE with a common random policy parameter. To limit the influence 3We observe that the SVRPG method from (Papini et al., 2018) is extremely sensitive to the initialization policy. When initialized by a random policy, SVRPG usually diverges and hence is excluded in our comparison. of randomness, each task is repeated 10 times and the performance is evaluated by averaging the mini-batch episode return with 90% bootstrap confidence interval. In terms of the sample complexity measurement, we use the number of system probes, i.e. the number of state transitions after taking an action according to the policy, instead of the number of trajectories. Such quantity serves a better criterion because different trajectories might have varying number of system probes: In many tasks, the environment returns FAILURE flag (usually at the beginning of the training procedure) and the current episode terminates before reaching the maximum horizon. In our experiments, we set the hyper-parameters such as the mini-batch size (|M0| and |M|), the epoch length p, and step-size according to our analysis. Specifically, (1) the mini-batch size of REINFORCE is set to the same value as |M0| in HAPG, which is obtained via grid-search, (2) |M| and p are set to satisfy |M| p = |M0|, and (3) the step-sizes of HAPG and REINFORCE are both set to be a small constant value 0.01. More details of the parameter choices and the URL of our code are given in the appendix. While we set the function Ψh(τ) to be the discounted cumulative reward PH i=h γir(si, ai), we can incorporate more sophisticated choices from the literature to obtain better performance. For simplicity, in our experiment, we adopt the standard linear baseline from the garage library which predicts a state-dependent baseline function b : S R and then we use the GAE(γ, 1) type advantage estimation: i=h γir(si, ai) b(sh), (27) where b is updated with the previous empirical trajectories. Note that under such choice of Ψh(τ) the gradient estimator (6) remains unbiased and all of our theoretical guarantee carries over (with different parameters Gg and GH). In practice, replacing the baseline with a critic from the actorcritic type method may further improve the performance of HAPG and will be our future work. We emphasize that both methods use the same baseline in our implementation. We present the results of the comparison in Figure 1. In the Cart Pole experiment, we see that HAPG has only a small advantage over REINFORCE. This is because the Cart Pole task is relatively easy. From hyper-parameter setting of REINFORCE, we see that a very small mini-batch of samples suffices to ensure the convergence of REINFORCE. Under such circumstances, our Hessian-aided technique has limited advantage. The advantage of HAPG over REINFORCE becomes more significant when the task is more difficult. Specifically, in the Mujoco tasks, HAPG converges to the parameter region with a high average reward using significantly less system probes than REINFORCE. Such observation can be Hessian Aided Policy Gradient 0 1 2 3 4 5 system probes 1e5 average episode return HAPG REINFORCE 0.0 0.2 0.4 0.6 0.8 1.0 system probes 1e7 average episode return HAPG REINFORCE 0.0 0.2 0.4 0.6 0.8 1.0 system probes 1e7 average episode return HAPG REINFORCE 0.0 0.2 0.4 0.6 0.8 1.0 average episode return HAPG REINFORCE 0.0 0.2 0.4 0.6 0.8 1.0 system probes 1e7 average episode return HAPG REINFORCE 0.0 0.2 0.4 0.6 0.8 1.0 system probes 1e7 average episode return HAPG REINFORCE Humanoid Standup Figure 1. Comparison of the proposed HAPG method with the REINFORCE method on six tasks explained by grid searching over hyper-parameter space: In these three tasks, REINFORCE requires a larger mini-batch size to obtain the best performance. In contrast, while the exterior mini-batch size of HAPG, i.e. |M0|, is set to be identical as the choice of REINFORCE, the interior minibatch size |M| can be chosen to be much smaller without compromising the convergence of HAPG. This is because when the step-size is small, our Hessian-aided scheme effectively reduces the variance with small number of extra trajectories. Conclusion and Future Work In this paper, we considered the problem of policy-based, model-free reinforcement learning. By introducing novel variance-reduced gradient estimators, we proposed a Hessian Aided Policy Gradient (HAPG) method which provides the first provable sample complexity improvement over the REINFORCE algorithm. HAPG incorporates the curvature information from the policy Hessian without compromising the O(d) per-iteration computation cost. Moreover, it can readily employ the state-of-the-art techniques for more sophisticated advantage function estimation which would result in superior performance. While we directly use the estimated gradient as the descent direction, methods like nature gradient (Kakade, 2002) or TRPO (Schulman et al., 2015) usually have better performance as they correct the gradients using an inverse-Fisher information matrix. A possible future direction is to combine HAPG with the nature gradient updates. Hessian Aided Policy Gradient Acknowledgment This work is supported by NSF CPS-1837253, Zhejiang Provincial Natural Science Foundation of China under Grant No. LZ18F020002, and National Natural Science Foundation of China (Grant No: 61672376, 61751209, 61472347). Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning. ar Xiv preprint ar Xiv:1708.05866, 2017. Jonathan Baxter and Peter L Bartlett. Infinite-horizon policygradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, 2014. Simon S. Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning, 2017. Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329 1338, 2016. Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 687 697, 2018. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 2004. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, 2013. Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531 1538, 2002. Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008 1014, 2000. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928 1937, 2016. Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2013. Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáˇc. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, 2017. Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pages 1416 1424, 2016. Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variancereduced policy gradient. In Proceedings of the 35th International Conference on Machine Learning, 2018. Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147 160, 1994. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314 323, 2016. Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in neural information processing systems, 2012. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016. Hessian Aided Policy Gradient Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 2000. Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. 1984. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026 5033. IEEE, 2012. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992. Stephen Wright and Jorge Nocedal. Numerical optimization. Springer Science, 1999. Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=H1t Ssb-AW. Tianbing Xu, Qiang Liu, and Jian Peng. Stochastic variance reduction for policy gradient estimation. ar Xiv preprint ar Xiv:1710.06034, 2017. Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3925 3936, 2018.