# trust_region_evolution_strategies__c6fdae57.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Trust Region Evolution Strategies

Guoqing Liu, Li Zhao, Feidiao Yang, Jiang Bian, Tao Qin, Nenghai Yu, Tie-Yan Liu

University of Science and Technology of China Microsoft Research Institute of Computing Technology, Chinese Academy of Sciences lgq1001@mail.ustc.edu.cn; {lizo, jiang.bian, taoqin, tyliu}@microsoft.com; yangfeidiao@ict.ac.cn

Evolution Strategies (ES), a class of black-box optimization algorithms, has recently been demonstrated to be a viable alternative to popular MDP-based RL techniques such as Qlearning and Policy Gradients. ES achieves fairly good performance on challenging reinforcement learning problems and is easier to scale in a distributed setting. However, standard ES algorithms perform one gradient update per data sample, which is not very efﬁcient. In this paper, with the purpose of more efﬁcient using of sampled data, we propose a novel iterative procedure that optimizes a surrogate objective function, enabling to reuse data sample for multiple epochs of updates. We prove monotonic improvement guarantee for such procedure. By making several approximations to the theoretically-justiﬁed procedure, we further develop a practical algorithm called Trust Region Evolution Strategies (TRES). Our experiments demonstrate the effectiveness of TRES on a range of popular Mu Jo Co locomotion tasks in the Open AI Gym, achieving better performance than ES algorithm.

Introduction Developing agents that can accomplish challenging tasks in complex, uncertain environments is a primary goal of Reinforcement Learning (RL). The most popular paradigm for analyzing such problem has been using a class of MDP-based algorithms, such as DQN (Mnih et al. 2015), DDPG (Lillicrap et al. 2015), and TRPO (Schulman et al. 2015), which have gained signiﬁcant achievements after applying to a variety of applications, including Atari games (Mnih et al. 2015; Van Hasselt, Guez, and Silver 2016; Mnih et al. 2016), robot locomotion tasks (Schulman et al. 2015; 2017) and the game of Go (Silver et al. 2016; 2017). Recently, as a class of black-box optimization algorithms, Evolution Strategies (ES) (Salimans et al. 2017) has been risen to be a viable alternative to popular MDP-based RL techniques. As comparable to state-of-the-art policy gradient methods, ES has demonstrated its success in training policies on a variety of simulated robotics tasks in Mu Jo Co and

This work was done when the ﬁrst author was on an internship with Microsoft Research. Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝜃

𝑙𝑜𝑐𝑎𝑙𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛𝐿𝜃( ෨𝜃)

𝑡𝑟𝑢𝑒𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒𝐽( ෨𝜃)

𝑙𝑜𝑐𝑎𝑙𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛𝑤𝑖𝑡ℎ𝐾𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦

Figure 1: Red curve represents true objective. Green curve represents local approximation. Blue curve represents local approximation with KL penalty, which is the surrogate objective function. The surrogate objective function forms a lower bound of the true objective, so optimizing this surrogate objective is guaranteed to improve the true objective.

on Atari games. With competitive performance and good parallelization, ES algorithms are receiving more and more attention as a scalable alternative to popular MDP-based RL techniques. Compared to MDP-based RL techniques, ES algorithms employ a different path of techniques by directly searching optimal policy in parameter space. To be speciﬁc, the ES algorithm in (Salimans et al. 2017) proposed to optimize the Gaussian smoothing of objective function,

J(θ) = EX N(θ,σ2I)[F(X)], (1)

where F(X) is the objective function representing the expected total reward. In each training iteration of above ES algorithm, two phrases are executed sequentially: 1) Sampling search directions ϵi from Gaussian distribution N(0, σ2I) and evaluating these search directions by sampling data using policy with parameter (θ + ϵi) in the environment. 2) Using sampled data to estimate the gradient which are then applied to update the corresponding parameter. Given such procedure, in order to enhance ES algorithms, one vital direction is to improve sample efﬁciency, which is not that high in current ES algorithm since the sampled data is only used for one gradient update. To boost sample efﬁciency, it is essentially comprised of two major aspects. One aspect concerns how to sample better or more diverse search directions, and the other one corresponds to how to make more efﬁcient use of existing sampled data. Some recent works (Choromanski et al. 2018; Maheswaranathan et al. 2018) have been proposed for improving sample efﬁciency of ES algorithms from the ﬁrst

aspect. However, to the best of our knowledge, there is no existing work exploring another perspective: given the sampled data, how can we make more efﬁcient use of them? To take deeper investigation on the other aspect of sample efﬁciency, in this paper, we propose a novel iterative procedure that introduces a surrogate objective, by optimizing which can give rise to efﬁciently reusing sample data for multiple epochs of updates. In particular, we ﬁrst introduce a local approximation Lθ( θ) to the Gaussian smoothing objective J( θ) around current parameter θ where we sample data, with the purpose to approximately optimize J( θ) by reusing sampled data from θ instead of sampling new data from θ. However, this local approximation only works well when θ is close enough to current parameter θ, as shown in Figure 1. It s still not clear that how much can we update when optimizing the local approximation Lθ( θ). To address this issue, we propose a novel procedure that optimizes a surrogate objective function, namely a local approximation with a KL divergence penalty, which forms a lower bound of the true objective J( θ). We prove theoretical monotonic improvement guarantee for such procedure. At last, we make a series of approximations to the theoretically-justiﬁed procedure, yielding a practical algorithm, which leads to better performance than ES algorithm, called Trust Region Evolution Strategies (TRES). To summarize, our contributions are as follows: Make more efﬁcient use of sampled data by optimizing surrogate objective function for multiple epochs of updates, instead for just one gradient update. Prove theoretical monotonic improvement guarantee for this optimization procedure. Develop practical algorithm TRES after making several approximations. Experiments on ﬁve popular physical locomotion tasks in the Open AI Gym demonstrate the effectiveness of TRES.

Background The goal of reinforcement learning (RL) is to ﬁnd, through trial and error, a feedback policy that prescribes how an agent should optimally act in a dynamic, uncertain environment. ES address RL problem by directly searching for optimal policy in parameter space, casted as a maximization problem of the form:

max θ Rd F(θ), (2)

where a family of deterministic policies mapping states to actions is parameterized by θ Rd, and the objective F : Rd R measures the expected total reward of policy parameter θ. However, when the environment is stochastic and implemented in black-box physics simulators, the objective function F is only accessible via noisy and expensive function evaluations, and may not be smooth. So the gradient of F cannot be computed with a backpropagation-like algorithm. That means we cannot directly use standard gradient-based optimization methods to ﬁnd a good solution for θ.

In order to both make the problem smooth and to ﬁnd a way to estimate its gradients, it is necessary to add noise. In this way, we turn to optimize the Gaussian Smoothing of the objective function,

max θ Rd J(θ) = EX pθ(x)[F(X)], (3)

where pθ(x) denotes the probability density function of multivariate Gaussian distribution with diagonal covariance matrix N(θ, σ2I). Such process can be interpreted as adding a Gaussian blur to the original objective, which results in a smooth, differentiable objective J(θ) that can be solve with stochastic gradient ascent.

Estimating Gradients of Gaussian Smoothing The gradient of J(θ) is given by

σ2 EX pθ(x) [F(X)(X θ)] . (4)

In practice, this gradient is intractable, and must be estimated. The gradient J(θ) can be estimated via a variety of Monte Carlo estimators. In this paper, we investigate two fundamental estimators:

Vanilla ES gradient estimator. The ﬁrst estimator is derived via the Monte Carlo REINFORCE (Williams 1992) estimator, and coincides with a standard Monte Carlo estimator of the expectation appearing in Equation (4):

ˆ V NJ(θ) = 1 Nσ2

i=1 F(xi)(xi θ), (5)

where (xi)N i=1 i.i.d N(θ, σ2I) can be interpreted as search parameters sampled from search distribution N(θ, σ2I). We refer to this gradient estimator as the vanilla ES gradient estimator.

Antithetic ES gradient estimator. Secondly, we consider the version of vanilla ES gradient estimator augmented with antithetic variables, as in (Salimans et al. 2017), given by

ˆ A NJ(θ) = 1 2Nσ2

i=1 ((F(xi) F(2 θ xi))(xi θ)),

(6) where again (xi)N i=1 i.i.d N(θ, σ2I). The two terms appearing in the summand are contributed by search parameter xi N(θ, σ2I) and its antithetic counterpart (2 θ xi), We refer to this gradient estimator as the antithetic ES gradient estimator, owing to its use of antithetic Monte Carlo samples. In summary, Evolution Strategies (ES) introduced by (Salimans et al. 2017) aims to optimize Gaussian Smoothing objective by directly using stochastic gradient ascent, with the antithetic ES gradient estimator, and each gradient update step of ES algorithm could be divided into two sub-steps: 1) Sampling search parameters {x1, ..., x N} from search distribution N(θ, σ2I) and evaluating these search parameters via doing rollouts in the environment. 3) Using

sampled data {F(x1), ..., F(x N)} to estimate the gradient and updating current parameter with the estimated gradient. For each parameter update, ES needs to sample new data to estimate the gradient, which is not very efﬁcient.

Methodology In this section, we present a new algorithm, Trust Region Evolution Strategies (TRES), with the purpose of more efﬁcient using of sampled data. We will start with an introduction of a local approximation Lθ( θ) to the true objective J( θ), aiming to approximately optimize J( θ) by reusing sampled data from θ instead of sampling new data from θ. Then, we derive that the difference between local approximation Lθ( θ) and true objective J( θ) is bounded by a KL divergence term Dmax KL (θ, θ). Based on this bound, we propose to optimize a surrogate objective function, namely local approximation with a KL divergence penalty, which is a lower bound of the true objective J( θ) and thus can provide explicit lower bounds on the improvement of J( θ). After making a series of approximations to the theoretical procedure, we ﬁnally develop the practical TRES algorithm and propose a distributed implementation of it.

Local Approximation of Gaussian Smoothing In this subsection, we introduce a local approximation to the true objective J( θ) around current parameter θ where we sample data. We start by introducing the difference of the true objective between new parameter θ and current parameter θ :

J( θ) J(θ) = EX p θ(x)[F(X)] EX pθ(x)[F(X)]. (7)

Given that X stands for a d-dimension search parameter, we can rewrite it in vector form [X1, ..., Xd], where the superscript denotes the corresponding dimension. Thus Equation (7) becomes:

J( θ) J(θ) = EX1...d θ[F(X)] EX1...d θ[F(X)]. (8)

For compactness, we use the notation X1...i to represent the ﬁrst i dimensions of X, namely [X1, ..., Xi] and the notation pθ(x1...i) to represent the marginal distribution of X1...i, for i {1, ..., d}. Besides, we further compress pθ(x1...i) down to θ in expectation forms. We decompose this difference as a sum of per-dimension difference:

EX1...i+1 θ Xi+2...d θ [F(X)] E X1...i θ Xi+1...d θ [F(X)] . (9)

For convenience, we will give the following deﬁnitions of the value function Vθ, and the advantage function Aθ in our framework:

Vθ(x1...i) = EXi+1...d θ h F(X)|X1...i = x1...ii , i = 0, ..., d.

Aθ(x1...i+1) = Vθ(x1...i+1) Vθ(x1...i), i = 0, ..., d 1. (10) Speciﬁcally, when i = 0, Vθ(x1...0) = EX1...d θ[F(X)] = J(θ); when i = d, Vθ(x1...d) = F(x1...d) = F(x).

In other words, value function Vθ(x1...i) represents the expectation of F(X) when ﬁrst i dimensions of X are given and remaining dimensions are sampled from pθ(xi+1...d), and advantage function Aθ(x1...i+1) represents the expectation difference in whether (i + 1)th dimension of X is given or sampled from pθ(xi+1). We can rewrite Equation (9) with the notations described above:

J( θ) = J(θ) + EX θ

i=0 (Vθ(X1...i+1) Vθ(X1...i))

= J(θ) + EX θ

i=0 Aθ(X1...i+1)

Then we proceed to expand Equation (11):

J( θ) = J(θ) +

i=0 EX θ h Aθ(X1...i+1) i

x1...i+1 p θ(x1...i+1)Aθ(x1...i+1)

x1...i p θ(x1...i) X

xi+1 p θ(xi+1)Aθ(x1...i+1).

The complex dependency of p θ(x1...i) on θ makes Equation (12) difﬁcult to optimize directly without new sampled data over θ. With the purpose to approximately optimize J( θ) by reusing sampled data from θ instead of sampling new data from θ, we introduce the following local approximation to J( θ),

Lθ( θ) = J(θ) +

x1...i pθ(x1...i) X

xi+1 p θ(xi+1)Aθ(x1...i+1).

(13) Although Lθ( θ) uses pθ(x1...i) rather than p θ(x1...i), we can ﬁnd that Lθ( θ) matches J( θ) to zero order and ﬁrst order. That is, for any parameter value θ,

Lθ( θ)| θ=θ = J( θ)| θ=θ,

θLθ( θ)| θ=θ = θJ( θ)| θ=θ. (14)

which implies that a sufﬁciently small step θ θ that improves Lθ( θ) will also improve J( θ).

Monotonic Improvement Guarantee In this subsection, we ﬁrst derive KL divergence bound of the difference between local approximation Lθ( θ) and true objective J( θ). Based on this bound, we propose to optimize a surrogate objective function, namely local approximation with a KL divergence penalty, with theoretical monotonic improvement guarantee. The particular distance measure we use here is the total variation divergence, which is deﬁned by DT V (p||q) = 1 2 P

i |pi qi| for discrete probability distributions p, q. Deﬁne Dmax T V (θ, θ) as

Dmax T V (θ, θ) = max 1 i d DT V (θi, θi). (15)

For simplicity, we use the notation DT V (θi, θi) to represent DT V (N(θi, σ2), N( θi, σ2)).

Theorem 1 Let α = Dmax T V (θ, θ). Then the following bound holds: |J( θ) Lθ( θ)| 2ϵd (d + 1)α2

where ϵ = max x1...i 1 i d |Aθ(x1...i)|. (16)

Theorem 1 provides a TV divergence bound of the difference between true objective J(θ) and local approximation Lθ( θ), and we defer the proof of Theorem 1 to supplementary material. In consideration of the following relationship between the total variation divergence and the KL divergence: DT V (p||q)2 DKL(p||q), we deﬁne Dmax KL (θ, θ) = max1 i d DKL(θi, θi), and then the KL divergence bound follows directly from Theorem 1:

J( θ) Lθ( θ) c Dmax KL (θ, θ), where c = 2ϵd (d + 1). (17)

where we also use DKL(θi, θi) to represent DKL(N(θi, σ2I), N( θi, σ2I)), and we deﬁne Dmax KL (θ, θ) as max1 i d DKL(θi, θi). Based on Equation (17), we further deﬁne Mθ( θ) = Lθ( θ) c Dmax KL (θ, θ). Then

J( θ) Mθ( θ) by Equation (17), J(θ) = Mθ(θ) by deﬁnition,

J( θ) J(θ) Mθ( θ) Mθ(θ).

Thus, by maximizing Mθ( θ) at each iteration, we guarantee that the true objective J( θ) is non-decreasing. This algorithm is a type of minorization-maximization (MM) algorithm (Hunter and Lange 2004). In the terminology of MM algorithms, Mθ( θ) is called the surrogate objective function that minorizes J( θ) with equality at θ.

Practical Algorithm After justifying the effectiveness of optimizing a surrogate objective function, we now describe how to derive a practical algorithm from these theoretical foundations.

Constrained Optimization As indicated by Equation (18), by performing the following maximization, we are guaranteed to improve the true objective J( θ):

h Lθ( θ) c Dmax KL (θ, θ) i . (19)

In practice, considering c = 2ϵd (d + 1), the penalty coefﬁcient increases accordingly to the dimension size of the policy parameter, and then the update step size would be very small. A natural way to take larger steps in a robust way is to use a constraint on the Dmax KL (θ, θ), i.e., a trust region constraint, so we can transform Equation (19) to the following constraint optimization problem:

max θ Lθ( θ)

subject to Dmax KL (θ, θ) δ. (20)

Sampled-Based Estimation We seek to solve the following optimization problem, obtained by expanding Lθ( θ) in Equation (20):

x1...i pθ(x1...i) X

xi+1 p θ(xi+1)Aθ(x1...i+1)

subject to Dmax KL (θ, θ) δ.

We ﬁrst replace the sum over ﬁrst i + 1 dimensions of x in this objective by the expectation:

i=0 EX1...i θ Xi+1 θ

Aθ(X1...i+1)

subject to Dmax KL (θ, θ) δ.

Note that in Equation (22), Xi+1 still follows the distribution p θ(Xi+1), rather than pθ(Xi+1). we solve this problem by using an importance sampling estimator:

i=0 EX1...i+1 θ

pθ(Xi+1)Aθ(X1...i+1)

subject to Dmax KL (θ, θ) δ.

Next we replace the advantage function Aθ(x1...i+1) by the value function Vθ(x1...i+1), which only changes the objective by a constant. At last, our optimization problem in Equation (23) is exactly equivalent to the following one:

p θ(Xi+1) pθ(Xi+1)Vθ(X1...i+1)

subject to Dmax KL (θ, θ) δ.

All that remains is to replace the expectation by sample averages and replace the value function by an empirical estimate.

Clipped Surrogate Objective However, the optimization of Equation (24) is exactly a constrained optimization problem with box constraint, which is complicated and hard to solve precisely. Inspired by PPO (Schulman et al. 2017), we propose a novel surrogate objective with clipped probability ratio, which forms a lower bound of the local approximation Lθ( θ), and we can optimize this clipped objective via multiple epochs of gradient updates. Let ri( θ) denotes the

probability ratio of ith dimension, namely, ri( θ) = p θ(Xi) pθ(Xi),

the main objective LCLIP θ ( θ) is the following:

LCLIP θ ( θ) = EX θ

i=1 min(li( θ), l CLIP i ( θ))

where li( θ) = ri( θ)Vθ(X1...i),

l CLIP i ( θ) = clip(ri( θ), 1 λ, 1 + λ)Vθ(X1...i).

The motivation for this objective is as follows: the ﬁrst term li( θ) is the ith dimension s contribution to local approximation Lθ( θ), and the second term l CLIP i ( θ) modiﬁes li( θ) by

clipping the probability ratio, which removes the search parameters for moving ri outside of the interval [1 λ, 1 + λ] and thus constrains the update step during training process. Finally, we take the minimum of the clipped and unclipped objective, so the ﬁnal objective is a lower bound on the local approximation, which ensures that the improvement of local approximation Lθ( θ) when we optimize LCLIP θ ( θ). Although this objective is not exactly constrained in trust region, we still call our algorithm Trust Region Evolution Strategies since in spirit it encourages the parameter to stay in trust region.

Distributed Implementation The resulting practical algorithm repeatedly executes two phases: 1) Sampling search parameters {x1, ..., x N} from search distribution N(θ, σ2I) and evaluating sampled search parameters by doing rollouts in the environment, and 2) Reusing sampled data {F(x1), ..., F(x N)} and performing K epochs of gradient updates to optimize the clipped surrogate objective. The evaluation of various search parameters in the ﬁrst phase is well suited to be scaled up to many parallel workers: the rollout process of each search parameter operates independent of each other. Following the efﬁcient communication strategy introduced by (Salimans et al. 2017), we share random seeds to sample search parameters between workers and each worker knows what search parameter the other workers used, so each worker only needs to communicate a single scalar to and from each other worker to agree on a parameter update, TRES thus requires extremely low bandwidth, in sharp contrast to policy gradient methods, which require workers to communicate entire gradients. We give the parallel version of TRES in Algorithm 1.

Algorithm 1 Trust Region Evolution Strategies

1: Input: noise standard deviation σ, clip factor λ, epoch number K, learning rate α 2: Initialize: N workers with known random seeds, and initial policy parameters θ0 3: for each iteration t = 0, 1, 2, ... do 4: for each worker i = 1, ..., N do 5: Sample xi N(θt, σ2I) 6: Compute rollout returns F(xi) 7: end for 8: Send all rollout returns to every worker 9: for each worker i = 1, .., N do 10: Reconstruct all search parameters 11: Compute value function with rollout returns 12: Let θt,1 = θt 13: for each epoch j = 1, ..., K do 14: Reusing sampled data from θt 15: θt,j+1 = θt,j + α LCLIP θt (θt,j) 16: end for 17: Update policy parameter via θt+1 = θt,K+1 18: end for 19: end for

Experiments

To demonstrate the effectiveness of TRES, we conducted experiments on the continuous Mu Jo Co locomotion tasks from the Open AI Gym (Brockman et al. 2016). These tasks have been widely studied in the reinforcement learning community (Duan et al. 2016; Gu et al. 2016; 2017; Rajeswaran et al. 2017a). In these tasks, the states of the robots are their generalized positions and velocities, and the controls are joint torques. And, these tasks are challenging due to under-actuation, high dimensionality, and nonsmooth dynamics.

Experimental Settings

The empirical result from (Rajeswaran et al. 2017b; Mania, Guy, and Recht 2018) shows that linear policies are sufﬁciently expressive to capture diverse behaviors in Mu Jo Co; thus, we use linear policy structure in all tasks we evaluated. Thus, the policy parameter size is equal to the product of state space dimension and action space dimension in the environment. We uniformly initialize policies with zero matrix before training in all tasks. To reduce variance, We use the antithetic sampling (also known as mirrored sampling) technique following (Salimans et al. 2017). For some of tasks, such as Walker-v1 and Ant-v1, the default reward functions include a survival bonus, which rewards RL agents with a constant reward at each timestep as long as a terminal condition has not been reached. This may cause that ﬁnding policies are more likely to be local optimal. To resolve this problem, we subtract the the survival bonus from the rewards outputted as in (Mania, Guy, and Recht 2018). To remove the inﬂuence of outlier individuals in sampled parameters and avoid falling into local optima, we perform ﬁtness shaping (Sun et al. 2009; Wierstra et al. 2011; 2014; Salimans et al. 2017) by applying a rank transformation to the rollout returns before computing each parameter update. We also perform state normalization trick, which is widely used in various RL related works (Schulman et al. 2015; 2017; Wu et al. 2017; Mania, Guy, and Recht 2018) and is similar to data whiting used in regression tasks. Intuitively, it ensures that policies put equal weight on the different components of the states. Our algorithms are built based on the implementation of Open AI evolution-strategies-starter code (Salimans et al. 2017). Each algorithm has been evaluated on 5 environments, with 6 random seeds on each. For fair comparison, we sample six random seeds uniformly from the interval [0, 1000) and share them in all environments.

Main Results

To compare the performance of TRES with those of ES, TRPO and PPO, Figure 2 shows the training curves of TRES and ES, respectively, and Table 1 illustrates the number of timesteps required to reach a prescribed reward threshold by TRES, ES, TRPO and PPO, respectively. From the ﬁgure and this table, we can see that TRES outperforms ES across all the tasks and provides competitive results with TRPO and

Figure 2: Training curves of TRES (red) versus ES (blue) on the Mu Jo Co locomotion tasks. For each run, after every ten training iterations, we evaluate current policy parameter via average returns of 100 independent rollouts. The curves are averaged over six random seeds, and the shaded region shows the standard deviation of these seeds.

Environment Threshold TRPO timesteps PPO timesteps ES timesteps TRES timesteps

Swimmer-v1 128 4.6e+6 4.4e+5 4.5e+5 2.4e+5 Half Cheetah-v1 2385 2.6e+6 1.2e+6 1.7e+6 9.8e+5 Walker2d-v1 2872 2.9e+6 1.9e+6 1.2e+7 3.6e+6 Ant-v1 3000 N/A 1.0e+7 1.5e+7 1.3e+7 Inverted Double Pendulum-v1 9104 4.4e+6 1.9e+6 8.8e+5 6e+5

Table 1: A comparison of TRES, ES, TRPO and PPO on the Mu Jo Co locomotion tasks. For each task we show the average number of timesteps required to reach a prescribed reward threshold, averaged over six random seeds.

PPO in most tasks. This indicates that TRES can achieve robust and consistent performance for various tasks. A particular notable case is the walker2d-v1 task, which is known for its difﬁculty and falling into local optimal easily. In such task, our TRES converges much faster than ES and achieves better ﬁnal performance, without falling into the local optima that ES stuck into. In Figure 2, we also ﬁnd that the shaded region of TRES is smaller than that of ES in most tasks. This suggests that TRES achieves stronger robustness across different random seeds. Especially in Swimmerv1 task, ES has a very high variance during training, while TRES can obtain a stable training process with low variance.

Impact of hyper-parameters There are two main hyper-parameters λ and K in TRES algorithms. In this subsection, we conducted several experiments to investigate their impact.

Impact of clip factor λ Clip factor λ is introduced to constrain policy parameter update in each iteration. We conducted several experiments on Half Cheetah-v1 task to study the impact of λ. We plot the training curves of different λ

in Figure 3. From Figure 3, we can see that λ [8.0, 20.0] can improve sample efﬁciency consistently against baseline, and λ = 14.0 has the best performance. Reducing or increasing λ from 14.0 hurts the performance. Intuitively, too small clip factor means that policy parameter update is constrained strictly, which may slow down the training process, while too big clip factor means that policy parameter update is completely unconstrained, which may make the training process unstable. Therefore, a medium clip factor is a better choice to obtain a fast and stable training process. These results well suggest our intuitive understanding.

Impact of epoch number K By reusing the sampled data, we perform K gradient steps over clip surrogate objective in each iteration. We conducted some experiments on Half Cheetah-v1 task to study the impact of epoch number K. Intuitively, larger epoch number contributes to higher reuse frequency of sampled data, but if the updated parameter is too far away from current policy parameter where we sample data, it would be no longer precise to improve true objective by using the approximated gradient from clipped surrogate objective. To investigate the balance be-

tween reuse frequency and gradient precision, we try different epoch numbers with ﬁxed λ. Figure 4 shows the training curves of various settings of K. From Figure 4, we can observe that a small K does not make the best of sampled data, and a large K causes a performance drop. We observe that K = 15 can gain best performance.

Figure 3: Impact of λ on Half Cheetah-v1 task.

Figure 4: Impact of K on Half Cheetah-v1 task.

Parallel version The good parallelizability is one of the most important characteristics of ES algorithm, which allows a very fast training over distributed computing environment. Our TRES retains this valuable property. We show time analysis between TRES and ES on Half Cheetah-v1 task in Figure 5. From ﬁgure 5(a), we can see that TRES is the same with ES with respect to time cost of rollout process. From ﬁgure 5(b), we can see that although TRES needs to update parameter for multiple rounds in each iteration, which is more complex than ES, TRES curve is almost coincident with ES curve with respect to overall process. This makes sense because the time cost of rollout process is far greater than that of update process during the actual training process.

Related work Evolution Strategies (ES) is an age-old black-box optimization method (Rechenberg 1971; Sun et al. 2009; Schaul, Glasmachers, and Schmidhuber 2011; Wierstra et al. 2011; 2014), which generates a descent direction via ﬁnite differences over random sampled directions. Recently, (Salimans et al. 2017; Mania, Guy, and Recht 2018) applied it to reinforcement learning domain and empirically demonstrated that ES is comparable to state-of-the-art policy gradient algorithms, which has generated renewed interest in ES as a promising direction in reinforcement learning.

(a) Rollout time cost

(b) Total time cost

Figure 5: Time Analysis for TRES and ES

However, how to improve sample efﬁciency of ES algorithm is still a challenging problem. Recent methods proposed for this purpose could be divided into two categories: (1) Adapting the search distribution by encouraging the exploration diversity. (2) Adapting the search distribution with the help of extra gradient information. In the ﬁrst category, (Choromanski et al. 2018) propose to enforce orthogonality conditions on the Gaussian perturbations for parameter exploration, and they demonstrate theoretically and empirically that random orthogonal and Quasi Monte Carlo (QMC) ﬁnite difference directions are much more effective for parameter exploration than random Gaussian directions used in (Salimans et al. 2017). (Conti et al. 2017) hybridize ES with some existing algorithms which promote directed exploration like novelty search (NS) and quality diversity (QD), improving ES performance on sparse or deceptive deep RL tasks. In the second category, (Maheswaranathan et al. 2018) proposes Guided Evolution Strategies, the main idea of such method is to use extra surrogate gradient information (directions that may be correlated with, but not necessarily identical to, the true gradient) along with ES random perturbations. Speciﬁcally, they deﬁne a new search distribution for ES that is elongated along a guiding subspace spanned by the surrogate gradients. All of above mentioned works focus on sampling better or more diverse search directions to improve sample efﬁciency of ES algorithm. Quite different from these methods, our approach focuses on how to fully utilize the sampled data, no matter how they are sampled.

Conclusion and Future Work In this paper, we have proposed a new algorithm, Trust Region Evolution Strategies (TRES) in the context of reinforcement learning. With the purpose of making more efﬁcient use of sampled data, we propose a novel iterative procedure that optimizes a surrogate objective function, with guaranteed monotonic improvement. After making a series of approximations to the theoretically-justiﬁed procedure, we further develop the practical algorithm TRES. Experiments on ﬁve popular Mu Jo Co locomotion tasks in the Open AI Gym show that our TRES has more efﬁcient performance than ES algorithm. For future work, we plan to combine our method with some prior works that focus on sampling better or more diverse search directions, it would be possible to substantially reduce sample complexity of ES algorithm.

Acknowledgments

This work was supported in part by the Natural Science Foundation of China under Grant U1636201.

Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. ar Xiv preprint ar Xiv:1606.01540. Choromanski, K.; Rowland, M.; Sindhwani, V.; E., T. R.; and Weller, A. 2018. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning. Conti, E.; Madhavan, V.; Such, F. P.; Lehman, J.; Stanley, K. O.; and Clune, J. 2017. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. ar Xiv preprint ar Xiv:1712.06560. Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning. Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016. Continuous deep q-learning with model-based acceleration. In Proceedings of The 33rd International Conference on Machine Learning, 2829 2838. Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2017. Q-prop: Sample-efﬁcient policy gradient with an off-policy critic. ICLR. Hunter, D. R., and Lange, K. 2004. A tutorial on mm algorithms. The American Statistician 58(1):30 37. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971. Maheswaranathan, N.; Metz, L.; Tucker, G.; and Sohl Dickstein, J. 2018. Guided evolutionary strategies: escaping the curse of dimensionality in random search. ar Xiv preprint ar Xiv:1806.10230. Mania, H.; Guy, A.; and Recht, B. 2018. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529 533. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928 1937. Rajeswaran, A.; Kumar, V.; Gupta, A.; Vezzani, G.; Schulman, J.; Todorov, E.; and Levine, S. 2017a. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. ar Xiv preprint ar Xiv:1709.10087.

Rajeswaran, A.; Lowrey, K.; Todorov, E. V.; and Kakade, S. M. 2017b. Towards generalization and simplicity in continuous control. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc. 6550 6561. Rechenberg, I. 1971. Evolutionsstrategie optimierung technisher systeme nach prinzipien der biologischen evolution. Ph.D. Dissertation, Technical University of Berlin. Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864. Schaul, T.; Glasmachers, T.; and Schmidhuber, J. 2011. High dimensions and heavy tails for natural evolution strategies. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO 11, 845 852. New York, NY, USA: ACM. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 1889 1897. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484 489. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; Chen, Y.; Lillicrap, T.; Fan, H.; Sifre, L.; van den Driessche, G.; Graepel, T.; and Hassabis, D. 2017. Mastering the game of go without human knowledge. Nature 550(7676):354 359. Sun, Y.; Wierstra, D.; Schaul, T.; and Schmidhuber, J. 2009. Stochastic search using the natural gradient. In International Conference on Machine Learning. Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double q-learning. In AAAI, 2094 2100. Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; and Schmidhuber, J. 2011. Natural evolution strategies. ar Xiv preprint ar Xiv:1106.4487. Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Peters, J.; and Schmidhuber, J. 2014. Natural evolution strategies. Journal of Machine Learning Research 15(1):949 980. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 58(3-4):229 256. Wu, Y.; Mansimov, E.; Grosse, R. B.; Liao, S.; and Ba, J. 2017. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, 5285 5294.