# hindsight_trust_region_policy_optimization__6ddd48e5.pdf Hindsight Trust Region Policy Optimization Hanbo Zhang1 , Site Bai1 , Xuguang Lan1 , David Hsu2 and Nanning Zheng1 1Xi an Jiaotong University 2National University of Singapore {zhanghanbo163, best99317}@stu.xjtu.edu.cn, xglan@xjtu.edu.cn, dyhsu@comp.nus.edu.sg, nnzheng@xjtu.edu.cn Reinforcement Learning (RL) with sparse rewards is a major challenge. We propose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy gradient algorithm for RL with sparse rewards.1 1 Introduction Reinforcement Learning (RL) has been widely investigated to solve problems from complex strategic games [Mnih et al., 2015] to precise robotic control [Deisenroth et al., 2013]. However, current successful practice of RL in robotics relies heavily on careful and arduous reward shaping[Ng et al., 1999; Grzes, 2017]. Sparse reward, in which the agent is rewarded only upon reaching the desired goal, obviates designing a delicate reward mechanism. It also guarantees that the agent focuses on the intended task itself without any deviation. However, sparse reward diminishes the chance for policy to converge, especially in the initial random exploration stage, since the agent can hardly get positive feedbacks. Recently, several works have been devoted to sparsereward RL. [Andrychowicz et al., 2017] proposes Hindsight Corresponding author. 1All detailed proofs and additional experiments can be found in the supplementary materials. Experience Replay(HER), which trains the agent with hindsight goals generated from the achieved states through the historical interactions. Such hindsight experience substantially alleviates exploration problem caused by sparse-reward settings. [Rauber et al., 2019] proposes Hindsight Policy Gradient (HPG). It introduces hindsight to policy gradient, resulting in an advanced algorithm for RL with sparse reward. However, for HPG, there remain several drawbacks hindering its application in more cases. Firstly, as an extension to vanilla policy gradient, its performance level and sample efficiency remain limited. Secondly, it inherits the intrinsic high variance of PG methods, and the combination with hindsight data further exacerbates the learning stability. In this paper, we propose Hindsight Trust Region Policy Optimization (HTRPO), a hindsight form of TRPO [Schulman et al., 2015b], which is an advanced RL algorithm with approximately monotonic policy improvements. We prove that HTRPO theoretically inherits the convergence property of TRPO, and significantly reduces the variance of policy improvement by introducing Quadratic KL divergence Estimation (QKL) approach. Moreover, to select hindsight goals that better assist the agent to reach the original goals, we design a Hindsight Goal Filtering mechanism. We demonstrate that in a wide variety of sparse-reward tasks including Atari games and robotic control, HTRPO can consistently outperform TRPO and HPG in both performance and sample efficiency with commendable learning stability. We also provide a comprehensive comparison with HER, showing that HTRPO achieves much better performance in 6 out of 7 benchmarks. Besides, we also conduct ablation studies to show that Quadratic KL divergence Estimation can effectively lower the variance and constrain the divergence while Hindsight Goal Filtering brings the performance to a higher level especially in more challenging tasks. 2 Preliminaries 2.1 RL Formulation and Notation Consider the standard infinite-horizon reinforcement learning formulation which can be defined by tuple (S, A, π, ρ0, r, γ). S and A denote the set of states and actions respectively. π : S P(A) is a policy mapping states to a distribution over actions. ρ0 is the distribution of the initial state s0. Reward function r : S R defines the reward obtained from Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) the environment and γ (0, 1) is a discount factor. In this paper, the policy is a differentiable function regarding parameter θ. We follow the standard formalism of state-action value function Q(s, a), state value function V (s) and advantage function A(s, a) in [Sutton and Barto, 2018]. We also adopt the definition of γ-discounted state visitation distribution as ρθ(s) = (1 γ) P t=0 γt P(st = s) [Ho et al., 2016]. Correspondingly, γ-discounted state-action visitation distribution [Ho et al., 2016], also known as occupancy measure [Ho and Ermon, 2016], is defined as ρθ(s, a) = ρθ(s) πθ(a|s). 2.2 Trust Region Policy Optimization TRPO [Schulman et al., 2015a] is an iterative trust region method that effectively optimizes policy by maximizing the per-iteration policy improvement. The optimization problem proposed in TRPO can be formalized as follows: max θ E s,a ρ θ(s,a) π θ(a|s)A θ(s, a) (1) s.t. E s ρ θ(s) DKL(π θ(a|s)||πθ(a|s)) ϵ (2) in which ρ θ(s) = P t=0 γt P(st = s). θ denotes the parameter of the new policy while θ is that of the old one. 2.3 Hindsight Policy Gradient HPG [Rauber et al., 2019] combines the idea of hindsight [Andrychowicz et al., 2017] with policy gradient methods. Though goal-conditioned reinforcement learning has been explored for a long time and actively investigated in recent works [Peters and Schaal, 2008; Schaul et al., 2015; Veeriah et al., 2018], HPG firstly extends the idea of hindsight to goal-conditioned policy gradient and shows that the policy gradient can be computed in expectation over all goals. The goal-conditioned policy gradient is derived as follows: θη(πθ) = E g,τ t=0 θ log πθ(at|st, g)Aθ(st, at, g) where τ pθ(τ|g). Then, by applying hindsight formulation, it rewrites goal-conditioned policy gradient with trajectories conditioned on achieved goal g using importance sampling to solve sparse-reward problems efficiently. 3 Hindsight Trust Region Policy Optimization In this section, we firstly introduce Quadratic KL divergence Estimation (QKL) method, which efficiently reduces the variance of KL estimation in TRPO and results in higher learning stability. With QKL, we show that TRPO maintains the monotonically-converging property. After that, we derive the hindsight form of TRPO, called Hindsight Trust Region Policy Optimization algorithm, to tackle the severely off-policy hindsight data for better learning with sparse rewards. Specifically, the expected return and the KL divergence constraint are both modified to adapt to hindsight data with importance sampling. Benefiting from QKL, we can precisely estimate KL divergence using hindsight data while keeping the variance below a reasonable level. Intuitively, HTRPO utilizes hindsight data to estimate the objective and the constraint, and iteratively find out the local optimal policy to ensure the approximately monotonous policy improvements. 3.1 TRPO with Quadratic KL Divergence In TRPO, the KL divergence expectation under ρ θ(s) is estimated by averaging values of KL divergence conditioned on collected states. However, this method is no longer valid if KL divergence cannot be analytically computed (e.g. Gaussian Mixture Model) or the state distribution changes (e.g. using hindsight data instead of the collected ones). To solve this problem, we firstly transform the KL divergence to an expectation under occupancy measure ρ θ(s, a) = ρ θ(s) π θ(a|s). It can be estimated using the collected state-action pairs (s, a), no longer depending on the analytical form of KL divergence. Also, such formulation is convenient for correcting changed distribution over state and action by importance sampling, which will be discussed in section 3.2. However, it will increase the estimation variance, causing instability of training. Therefore, by making use of another f-divergence, we propose QKL to approximate KL divergence for variance reduction, and both theoretically and practically, we prove the effectiveness of such an approximation. Given two policies π θ(a|s) and πθ(a|s), the KLdivergence over state s can be converted to a logarithmic form: DKL(π θ(a|s)||πθ(a|s)) = E a π θ(a|s) log π θ(a|s) log πθ(a|s) However, simply expanding the KL-divergence into logarithmic form still leaves several problems unhandled. Firstly, such formulation causes excessively high estimation variance. Secondly, such estimation of KL-divergence is of possible negativity. To overcome these two drawbacks, we propose Quadratic KL Divergence Estimation in Proposition 1 and prove that such approximation will reduce the estimation variance in Proposition 2 (detailed proof can be found in Appendix A.1 and A.2): Proposition 1. (Quadratic KL Divergence Estimation). For policy π θ(a|s) and πθ(a|s), and for η = πθ(a|s) π θ(a|s), log π θ(a|s) log πθ(a|s) 2(log π θ(a|s) log πθ(a|s))2 + E a where a π θ(a|s). Proposition 1 demonstrates that when θ and θ is of limited difference, the expectation of log π θ(a|s) log πθ(a|s) can be sufficiently estimated by the expectation of its square. In fact, Ea π θ(a|s) 1 2(log π θ(a|s) log πθ(a|s))2 is an fdivergence, where f(x) = 1 2x(log x)2, which we call DQKL in this paper. Noticeably, though f(x) is a convex function only when x ( 1 e, ), and it indeed does not correspond to an f-divergence, in our practice, π θ(a|s) πθ(a|s) > 1 e holds, hence we can define a convex function on R+: f(x) = 1 e, ) and x + 2 e when x (0, 1 e], with an unused piece defined over (0, 1 e]. Proposition 2. (Variance of Constraint Function). For policy π θ(a|s) and πθ(a|s), let Var denote the variance of a variable. For any action a A and any state s S, when Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) log π θ(a|s) log πθ(a|s) [ 0.5, 0.5], then Var a π θ(a|s) (log π θ(a|s) log πθ(a|s))2 Var a π θ(a|s) log π θ(a|s) log πθ(a|s) . (5) Proposition 2 illustrates that there is a decrease from the variance of log π θ(a|s) log πθ(a|s) to the variance of its square. In fact, the closer it is between θ and θ, the more the variance decreases. Next, we will show that with the introduction of QKL, TRPO still maintains similar convergence property. Proposition 3. (Policy Improvement Guarantee) Given two policies πθ and π θ, Let s = arg max s DT V (π θ(a|s), πθ(a|s)) If DKL(π θ(a|s ),πθ(a|s )) DQKL(π θ(a|s ),πθ(a|s )) 2 ln 2, then η(πθ) Lπ θ(πθ) CDmax QKL(π θ(a|s), πθ(a|s)) (6) Lπ θ(πθ) = η(π θ) + Es π θ(s),a πθ(a|s)[Aπ θ(s, a)] (7) and η(πθ) = E [P γtrt] is the expected return, C = 4βγ (1 γ)2 , β = maxs,a|Aπ θ(s, a)|, DT V (p, q) = 1 2 P i |pi qi|. The proof and detailed analysis are given in Appendix B. Intuitively, Proposition 3 means that when two policies are not far from each other, the convergence property of TRPO also holds for the QKL constraint. As a result, with Proposition 3, we can derive a new but similar monotonicallyconverging algorithm as in TRPO, given in Appendix B.2. By taking a series of approximation as shown in Appendix B.2, the following policy optimization problem is derived, called QKL-TRPO. max θ E s,a ρ θ(s,a) π θ(a|s)A θ(s, a) (1) 2(log π θ(a|s) log πθ(a|s))2 ϵ (8) It is noteworthy that QKL-TRPO can be applied to policies which do not correspond to an analytic KL divergence (e.g. GMM policies). We also provide a simple analysis of QKLTRPO compared with the original TRPO in Appendix G.1, which shows that QKL-TRPO is comparable with TRPO in a series of Mu Jo Co benchmarks. 3.2 Hindsight Formulation of QKL-TRPO In this section, we derive the hindsight form of the QKLTRPO, called Hindsight Trust Region Policy Optimization (HTRPO), to efficiently tackle severely off-policy hindsight experience and sparse-reward RL problems. Starting from eq.1, it can be written in the following variant form: L θ(θ) = E τ p θ(τ) t=0 γt πθ(at|st) π θ(at|st)A θ(st, at) The derivation process of this variant form is shown explicitly in Appendix C.1 and in [Schulman et al., 2015a]. Given the expression above, similar to eq.3, we consider the goalconditioned objective function: L θ(θ) = E g,τ t=0 γt πθ(at|st, g) π θ(at|st, g)A θ(st, at, g) where τ p θ(τ|g). Though it seems that eq.10 makes it possible for off-policy learning, it can be used as the objective only when policy πθ is close to the old policy π θ, i.e. within the trust region. Using severely off-policy data like hindsight experience will make the learning process diverge. Therefore, importance sampling is integrated to correct the difference of the trajectory distribution caused by changing the goal. Based on eq.10, the following Proposition gives out the hindsight objective function conditioned on some goal g with the distribution correction derived from importance sampling. Proposition 4. (Hindsight Expected Return). For the original goal g and hindsight goal g , the object function of HTRPO L θ(θ) is given by: L θ(θ) = E g ,τ π θ(ak|sk, g ) π θ(ak|sk, g) γt πθ(at|st, g ) π θ(at|st, g )A θ(st, at, g ) (11) in which τ pθ(τ|g) and τ = s0, a0, s1, a1, ..., st, at. Appendix C.2 presents an explicit proof of how the hindsight-form objective function derives from eq.10. In our practice, we introduce a baseline Vθ(s) for computing the advantage Aθ. Though Aθ here can be estimated by combining per-decision return [Precup et al., 2000], due to its high variance, we adopt one-step TD method instead to get Aθ, i.e., Aθ(s, a) = r(s, a) + γVθ(s ) Vθ(s). Intuitively, eq.11 provides a way to compute the expected return in terms of the advantage with new-goal-conditioned hindsight experiences which are generated from interactions directed by old goals. Next, we demonstrate that hindsight can also be introduced to the constraint function. The proof follows the methodology similar to that in Proposition 4, and is deducted explicitly in Appendix C.3. Proposition 5. (HTRPO Constraint Function). For the original goal g and hindsight goal g , the constraint between policy π θ(a|s) and policy πθ(a|s) is given by: E τ pθ(τ|g) π θ(ak|sk, g ) π θ(ak|sk, g) γt Kt in which ϵ = ϵ 1 γ , and Kt = 1 2(log π θ(at|st, g ) log πθ(at|st, g ))2. Proposition 5 implies the practicality of using hindsight data under condition g to estimate the KL expectation. From all illustration above, we give out the final form of the optimization problem for HTRPO: E τ pθ(τ|g) π θ(ak|sk, g ) π θ(ak|sk, g) γt Rt Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Selected Valid Achieved Goals Max-Min Distance Figure 1: Procedure of Hindsight Goal Filterring E τ pθ(τ|g) π θ(ak|sk, g ) π θ(ak|sk, g) γt Kt where Rt = πθ(at|st,g ) π θ(at|st,g )A θ(st, at, g ) and Kt = 1 2(log π θ(at|st, g ) log πθ(at|st, g ))2. The solving process for HTRPO optimization problem is explicitly demonstrated in Appendix D. 4 Hindsight Goal Filtering In hindsight learning, the agent generalizes to reaching the original goal through learning to reach the hindsight goal first. Therefore, the selection of hindsight goals imposes a direct impact on the performance. If the hindsight goals are far from the original ones, the learned policy may not generalize well to the original goals. For example, in Fetch Pick And Place, the initialized random policy barely grasps the target successfully, which results in the hindsight goals majorly distributing on the table. Given the original goals up in the air, such a discrepancy can cause a lower learning efficiency. In this section, we introduce a heuristic method called Hindsight Goal Filtering(HGF). Intuitively, HGF is trying to filter the most useful goals from the achieved ones instead of random selection. Specifically, based on our analysis (eq. 13), the performance improves if we reduce the distribution discrepancy between original goals g and hindsight goals g . Ideally, if the distribution of g matches that of g, the agent will reach g after learning to reach g . Therefore, we restrict the selected hindsight goals to distribute in the original goal space whenever possible to cover the area of original goals. The main idea is shown in Figure 1 and the algorithm is summarized in Appendix E.1. The input of HGF includes 2 parts: the achieved goal set Ga and the original goal set Go. At the beginning, especially for some complex tasks, Ga can only have small or even no overlap with Go. Under this situation, we encourage the agent to learn to reach the original goal region by selecting the nearest achieved goals as the hindsight goals. Once some achieved goals fall in the original goal region, they are considered valid achieved goals, and a subset of this intersection will be sampled to cover the region as fully as possible. This subset is selected following the procedure in Figure 1. Note that the distance metric should be determined by the collected original goal distribution. In our experiments, we use the density-weighted Euclidean distance. Specifically, we initialize the hindsight goal set G with a randomly sampled achieved goal. To make the goal distribute dispersedly, we use Max-Min Distance as the measurement, which indicates the minimal distance between the new goal and the selected ones. By maximizing the minimal distance, it ensures an overall large distance between the new goal and the rest. HGF is related to Curriculum-guided HER (a) Bit Flipping (b) Ms. Pacman Figure 2: Demonstration of experiment environments (CHER)[Fang et al., 2019] to some extent. However, CHER is suitable for transition-based RL, and cannot be applied to episode-based policy gradient algorithms directly. The complete algorithm of HGF and HTRPO is presented in Appendix E. 5 Experiments Our experiments aims to answer the following questions: 1. How does HTRPO compared to other methods when performed over diversified tasks? (Section 5.2) 2. What are the main contributors to HTRPO? (Section 5.3) 3. How do key parameters affect the performance? (Section 5.4) For 1), we show that HTRPO consistently outperforms both HPG and TRPO on success rate and sample efficiency in a wide variety of tasks, and achieves state-of-the-art performance in sparse-reward stochastic policy gradient methods. We also provide an in-depth comparison with HER in this part. For 2), we ablate the main components of HTRPO. The ablation study shows that QKL effectively reduces the variance and significantly improves the performance in all tasks. HGF plays a crucial role in improved performance for the more challenging tasks (e.g. Fetch Pick And Place). For 3), we vary the scale of KL estimation constraint and the numbers of hindsight goals and choose the best parameter settings. 5.1 Benchmark Settings We implement HTRPO on a variety of sparse reward tasks. Firstly, we test HTRPO in simple benchmarks established in previous work [Andrychowicz et al., 2017] including 4-to100-Bit Flipping tasks. Secondly, We verify HTRPO s performance in Atari games like Ms. Pac-Man [Bellemare et al., 2013] with complex raw image input to demonstrate its generalization to convolutional neural network policies. Finally, we test HTRPO in simulated robot control tasks like Reach, Push, Slide and Pick And Place in Fetch [Plappert et al., 2018] robot environment. As mentioned in [Plappert et al., 2018], it still remains unexplored that to what extent the policy gradient methods trained with hindsight data can solve continuous control tasks. Since HTRPO is a natural candidate that can be applied to both discrete and continuous tasks, other than discrete Fetch environments introduced in [Rauber et al., 2019], we also implement HTRPO in continuous environments including Fetch Reach, Fetch Push, Fetch Slide, Fetch Pick And Place. A glimpse of these environments is demonstrated in Figure 2, and the inclusive introductions are Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Figure 3: Success rate for benchmark environments. Top row: performance of discrete environments. Bottom row: performance of continuous environments. The full lines represent the average evaluation over 10 trails and the shaded regions represent the corresponding standard deviation. included in Appendix F.1. Detailed settings of hyperparameters are listed in Appendix F.2. All experiments are conducted on a platform with NVIDIA Ge Force GTX 1080Ti. We compare HTRPO with HPG [Rauber et al., 2019] and TRPO [Schulman et al., 2015a], which are chosen as the baseline algorithms. The reward setting used in our paper is purely sparse reward, i.e., when the task has not been finished, the agent receives 0 reward in each time step, and once the task is finished, the agent will receive a high positive reward. Besides, TRPO is also implemented with dense rewards and the new KL estimation method proposed in Section 3.1. For a fair comparison, we also combine HPG with Hindsight Goal Filtering in our experiments. To demonstrate the performance level of HTRPO more comprehensively, we also compare HTRPO with the well-known HER algorithm. In all experiments, we directly use the accumulated time steps the agent takes while interacting with the environments throughout episodes and batches, and do not count the hindsight steps which are generated using hindsight goals. 5.2 Comparative Analysis We evaluate HTRPO s performance from success rate and sample efficiency, and test its generality to different tasks including image-based Atari games, and simulated robot control tasks. Results show HTRPO s consistent effectiveness and strong generality to different kinds of tasks and policies. Compare with Baselines The success rate curves for the trained policy are demonstrated in Figure 3. We can conclude that HTRPO consistently outperforms all baselines, including different versions of TRPO and HPG, in most benchmarks, including imagebased Atari games (Ms. Pac-Man) and a variety of simulated robot control tasks with different control modes. It demonstrates that HTRPO generalizes well in different kinds of tasks and policies with high-dimensional inputs. Besides, the sample efficiency of HTRPO also exceeds that of HPG, for it reaches a higher average return within less time in most environments. Compare with HER We implement HER with DQN [Mnih et al., 2015] for discrete environments and DDPG [Lillicrap et al., 2015] for continuous environments based on Open AI baselines2. We found that HER cannot work well with the purely sparse reward, i.e., the reward is available only when reaching the goal. Thus, we also follow the reward setting in [Andrychowicz et al., 2017] to conduct HER experiments for reference (HER 1). Toy example. To begin with, we test HTRPO on 4-to-100Bit Flipping task [Andrychowicz et al., 2017] as well as HER (Figure 4). The maximum training steps are 2 106. In all Bit Flipping tasks, HTRPO can converge to nearly 100% success rate with much fewer time steps while HER is much datainefficient as the number of Bits increases. Benchmarks. Table 1 shows the comparison over the benchmark environments. We can conclude that: 1) HER can not work quite well with the purely sparse reward setting, and HTRPO outperforms HER in 6 out of 7 benchmarks significantly. 2) For discrete robot control tasks, HTRPO can learn a good policy while HER 1+DQN cannot work well. For continuous environments, HER 1 slightly outperforms HTRPO. In summary, HTRPO can be applied both in discrete and continuous tasks without any modification and achieve commendable performance compared to HER. 5.3 Ablation Studies There are mainly 2 components in HTRPO, QKL and HGF, that impose an effect on the performance. Besides, we will also investigate the impact of Weighted Importance Sampling (WIS), which is conducive to variance reduction. To study the effect of reward settings, we implement HTRPO with dense rewards. Selected results are shown in Figure 5 and the full ablation study is available in Appendix G.2. We can conclude that: 1) QKL plays a crucial role for the high performance 2https://github.com/openai/baselines Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Figure 4: Performance of Bit Flipping. Environment HER HER 1 HTRPO Ms. Pacman 72 3 74 4 64 6 Fetch Reach D 53 41 100 0 100 0 Fetch Push D 8 1 7 2 88 2 Fetch Slide D 11 3 10 5 76 9 Fetch Reach C 18 3 100 0 100 0 Fetch Push C 7 2 100 0 98 1 Fetch Slide C 1 1 93 7 85 4 Table 1: Success rate comparison between HTRPO and HER (%). HER 1 means using the original -1-and-0 reward setting instead of the purely sparse reward that HTRPO used, i.e., only when the agent achieves the goal can it receive a high reward. of HTRPO by significantly reducing the estimation variance of KL divergence; 2) HGF can enhance the performance of HTRPO to a higher level; 3) WIS is important since it can reduce the variance of importance sampling significantly; 4) Dense-reward setting harms the performance, which has also been verified in [Plappert et al., 2018]. 5.4 Hyperparameter Selection We take Continuous Fetch Push as an example to study the impact of different KL estimation constraint scales and different numbers of hindsight goals. Different KL estimation constraint scales. KL estimation constraint, i.e. max KL step specifies the trust region, the range within which the agent searches for the next-step optimal policy. In the sense of controlling the scale to which the agent updates the policy per step, this parameter presents similar functionality as learning step size. If set too low, say 5e-6 shown in Figure 6, it would inevitably slow down the converging speed. If set too high, the potentially large divergence between the new and old policy may violate the premise for some core parts of HTRPO theory derivation including Proposition 1 and HTRPO solving process. Different number of hindsight goals. From the results in Figure 7, it is straightforward that more hindsight goals lead to faster converging speed. This phenomenon accords with the mechanism of how hindsight methodology deals with sparse reward scenarios, i.e. it augments the sample pool with substantial hindsight data rather than leaving it with few valid original trajectories. It s intuitive that the more hindsight data there are, the higher sample efficiency HTRPO achieves. However, limited by the hardware resources, we need to trade off the sampled goal number. Figure 5: Ablation Experiments. Figure 6: Max KL steps. Figure 7: Goal numbers. 6 Conclusion We proposed Hindsight Trust Region Policy Optimization(HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. We show that with the help of the proposed Quadratic KL divergence Estimation (QKL), HTRPO significantly reduces the variance of KL estimation and improves the performance and learning stability. Moreover, we design a Hindsight Goal Filtering mechanism to narrow the discrepancy between hindsight and original goal space, leading to better performance. Results on diversified benchmarks demonstrate the effectiveness of HTRPO. Since HTRPO is a natural candidate for both discrete and continuous tasks and the QKL constraint gets rid of the demand for analytical form, it is promising to optimize policies with non-Gaussian (e.g. GMM) or mixed (discrete+continuous) action space. It also provides the possibility to tackle high-dimensional real-world problems and train robot control policies without arduous reward shaping. Besides, HGF can be integrated into hindsight-goal exploration methods naturally [Ren et al., 2019; Pitis et al., 2020], which should lead to a higher performance. Acknowledgements This work was supported in part by NSFC under grant No.91748208, No.62088102, No.61973246, Shaanxi Project under grant No.2018ZDCXLGY0607, and the program of the Ministry of Education. D. Hsu is supported by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG2-RP-2020-016). Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) [Andrychowicz et al., 2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048 5058, 2017. [Bellemare et al., 2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, jun 2013. [Deisenroth et al., 2013] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends R in Robotics, 2(1 2):1 142, 2013. [Fang et al., 2019] Meng Fang, Tianyi Zhou, Yali Du, Lei Han, and Zhengyou Zhang. Curriculum-guided hindsight experience replay. In Advances in Neural Information Processing Systems, pages 12623 12634, 2019. [Grzes, 2017] Marek Grzes. Reward shaping in episodic reinforcement learning. In International Joint Conference on Autonomous Agents and Multi-agent Systems, 2017. [Ho and Ermon, 2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565 4573, 2016. [Ho et al., 2016] Jonathan Ho, Jayesh Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization. In International Conference on Machine Learning, pages 2760 2769, 2016. [Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [Ng et al., 1999] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278 287, 1999. [Peters and Schaal, 2008] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682 697, 2008. [Pitis et al., 2020] Silviu Pitis, Harris Chan, Stephen Zhao, Bradly Stadie, and Jimmy Ba. Maximum entropy gain exploration for long horizon multi-goal reinforcement learning. In International Conference on Machine Learning, pages 7750 7761. PMLR, 2020. [Plappert et al., 2018] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mc Grew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464, 2018. [Precup et al., 2000] Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In ICML 00 Proceedings of the Seventeenth International Conference on Machine Learning, 2000. [Rauber et al., 2019] Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and J urgen Schmidhuber. Hindsight policy gradients. In International Conference on Learning Representations, 2019. [Ren et al., 2019] Zhizhou Ren, Kefan Dong, Yuan Zhou, Qiang Liu, and Jian Peng. Exploration via hindsight goal generation. In Advances in Neural Information Processing Systems, pages 13485 13496, 2019. [Schaul et al., 2015] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312 1320, 2015. [Schulman et al., 2015a] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015. [Schulman et al., 2015b] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015. [Sutton and Barto, 2018] Ricahrd S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning series. MIT Press, 2018. [Veeriah et al., 2018] Vivek Veeriah, Junhyuk Oh, and Satinder Singh. Many-goals reinforcement learning. ar Xiv preprint ar Xiv:1806.09605, 2018. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)