# on_proximal_policy_optimizations_heavytailed_gradients__150f4531.pdf

On Proximal Policy Optimization s Heavy-tailed Gradients

Saurabh Garg 1 Joshua Zhanson 2 Emilio Parisotto 1 Adarsh Prasad 1 J. Zico Kolter 2 Zachary C. Lipton 1

Sivaraman Balakrishnan 3 Ruslan Salakhutdinov 1 Pradeep Ravikumar 1

Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ( heavy-tailed ) regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent s policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavytailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of Mu Jo Co continuous control tasks.

1. Introduction

As Deep Reinforcement Learning (DRL) methods have made strides on such diverse tasks as game playing and continuous control (Berner et al., 2019; Silver et al., 2017;

1Machine Learning Department, Carnegie Mellon University 2Computer Science Department, Carnegie Mellon University 3Department of Statistics and Data Science, Carnegie Mellon University. Correspondence to: Saurabh Garg <sgarg2@andrew.cmu.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Mnih et al., 2015), policy gradient methods (Williams, 1992; Sutton et al., 2000; Mnih et al., 2016) have risen as a popular alternative to dynamic programming approaches. Since Mnih et al. (2016) s breakthrough results demonstrated the applicability of policy gradients in DRL, a number of popular variants have emerged (Schulman et al., 2017; Espeholt et al., 2018). Proximal Policy Optimization (PPO) (Schulman et al., 2017) one of the most popular policy gradient methods introduced the clipped importance sampling update, an effective heuristic for off-policy learning. However, while their stated motivation for clipping draws upon trustregion enforcement, the updates in practice tend to deviate from such trust regions (Ilyas et al., 2018). and exhibit sensitivity to implementation details such as random seeds and hyperparameter choices (Engstrom et al., 2019). This brittleness characterizes not just PPO, but policy gradient methods more generally (Ilyas et al., 2018; Henderson et al., 2017; 2018; Islam et al., 2017), raising a broader concern about our understanding of these methods.

In this work, we take a step towards understanding the workings of PPO, the most prominent and widely used deep policy gradient method. Noting that the heuristics implemented in PPO are evocative of estimation techniques from robust statistics in outlier-rich and heavy-tailed settings, we conjecture that the heavy-tailed distribution of gradients is the main obstacle addressed by these heuristics. We perform a rigorous empirical study to conﬁrm the existence of heavy-tailedness in PPO gradients and to investigate its causes and consequences.

Our ﬁrst contribution is to analyze the role played by each component of the PPO objective in the heavy-tailedness of the gradients. We observe that as training proceeds, gradients of both the actor and the critic loss grow more heavytailed. Our ﬁndings show that during on-policy gradient steps the advantage estimates are the primary contributors to the heavy-tailed nature of the gradients. Moreover, as offpolicyness increases during training (i.e. as the behavioral and actor policy diverge), the likelihood ratios that appear in the surrogate objective exacerbate the heavy-tailedness.

Second, we highlight the consequences of the heavytailedness of PPO s gradients. Empirically, we ﬁnd that heavy-tailedness in likelihood ratios induced during off-

On Proximal Policy Optimization s Heavy-tailed Gradients

policy training can be a signiﬁcant factor causing optimization instability leading to low average rewards. Moreover, we also show that removing heavy-tailedness in advantage estimates can enable agents to achieve superior performance. Subsequently, we demonstrate that the clipping heuristics present in standard PPO implementations (i.e., gradient clipping, actor objective clipping, and value loss clipping) signiﬁcantly counteract the heavy-tailedness induced by offpolicy training.

Finally, motivated by this analysis, we present an algorithm that uses Geometric Median-of-Means (GMOM), a highdimensional robust aggregation method adapted from the statistics literature. Without using any of the objective clipping or gradient clipping heuristics implemented in PPO, the GMOM algorithm nearly matches PPO s performance on Mu Jo Co (Todorov et al., 2012) tasks, which strengthens our conjecture that heavy-tailedness is a critical concern facing policy gradient methods, and that the beneﬁts of PPO s clipping heuristics come primarily from addressing this problem.

2. Preliminaries

We deﬁne a Markov Decision Process (MDP) as a tuple (S, A, R, γ, P), where S represents the set of environments states, A represents the set of agent actions, R : S A R is the reward function, γ is the discount factor, and P : S A S R is the state transition probability distribution. The goal in reinforcement learning is to learn a policy π : S A R+ such that the expected cumulative discounted reward (known as returns) is maximized. Formally, π : = argmaxπ Eat π( |st),st+1 P ( |st,at) [P t=0 γt R(st, at)].

Policy gradient methods directly parameterize the policy (also known as actor network), i.e., they deﬁne a policy πθ, parameterized by θ. Since directly optimizing the cumulative rewards can be challenging, modern policy gradient algorithms typically optimize a surrogate reward function which includes a likelihood ratio in order to re-use stale (offpolicy) trajectories via importance sampling. For example, Schulman et al. (2015a) iteratively optimize:

max θt E(st,at) πθt 1

πθt 1(at|st)Aπθt 1(st, at) , (1)

where Aπθt = Qθt(st, at) Vθt(st). Here, the Q-function Qθt(s, a) is the expected discounted reward after taking an action a at state s and following πθt afterwards and Vθt(s) is the value estimate (implemented with a critic network).

However, the surrogate is indicative of the true reward function only when πθt and πθt 1 are close in distribution. Different policy gradient methods (Schulman et al., 2015a; 2017; Kakade, 2002) attempt to enforce the closeness in different ways. In Natural Policy Gradients (Kakade, 2002)

and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a), authors utilize a conservative policy iteration with an explicit divergence constraint which provides provable lower bounds guarantees on the improvements of the parameterized policy. On the other hand, PPO (Schulman et al., 2017) implements a clipping heuristic on the likelihood ratio to avoid excessively large policy updates. Speciﬁcally, PPO optimizes the following objective:

max θt E(st,at) πθt 1

h min ρt ˆAπθt 1(st, at) ,

clip(ρt, 1 ϵ, 1 + ϵ) ˆAπθt 1(st, at) i , (2)

where ρt : = πθt(at,st) πθt 1(at,st) and clip(x, 1 ϵ, 1 + ϵ) clips x to stay between 1 + ϵ and 1 ϵ. We refer to ρt as likelihoodratios. Due to a minimum with the unclipped surrogate reward, the PPO objective acts as a pessimistic bound on the true surrogate reward. As in standard PPO implementation, we use Generalized Advantage Estimation (GAE) (Schulman et al., 2015b). Instead of ﬁtting the value network via regression to target values (denoted by Vtrg), via

min θt Est πθt 1 (Vθt(st) Vtrg(st))2 , (3)

standard implementations ﬁt the value network with a PPOlike objective:

min θt Est πθt 1 max n (Vθt(st) Vtrg(st))2 , (clip (Vθt(st),

Vθt 1(st) ε, Vθt 1(st) + ε Vtrg(st) 2o , (4)

where ϵ is the same value used to clip probability ratios in PPO s loss function (Eq. 2). PPO uses the following training procedure: At any iteration t, the agent creates a clone of the current policy πθt which interacts with the environment to collect rollouts B (i.e., state-action pairs {(si, ai)}N i=1). Then the algorithm optimizes the policy πθ and value function Vθ for a ﬁxed K gradient steps on the sampled data B. Since at every iteration the ﬁrst gradient step is taken on the same policy from which the data was sampled, we refer to these gradient updates as on-policy steps. And as for the remaining K 1 steps, the sampling policy differs from the current agent, we refer to these updates as off-policy steps.

Throughout the paper, we consider a stripped-down variant of PPO (denoted PPO-NOCLIP) that consists of policy gradient with importance weighting, but has been simpliﬁed as follows: (i) no likelihood-ratio clipping (Eq. 1), i.e., no objective function clipping ; (ii) value network optimized via regression to target values (Eq. 3) without value function clipping; and (iii) no gradient clipping. Overall PPONOCLIP uses the objective summarized in App. A. One may argue that since PPO-NOCLIP removes the clipping heuristic from PPO, the unconstrained maximization of Eq. 1 may

On Proximal Policy Optimization s Heavy-tailed Gradients

lead to excessively large policy updates. In App. E, we empirically justify the use of Eq. 1 by showing that with the small learning rate used in our experiments (tuned hyperparameters in Table 1), PPO-NOCLIP maintains a KL-based trust region like PPO throughout the training.

2.1. Framework for estimating Heavy-Tailedness

We now formalize our setup for studying the distribution of gradients. Throughout the paper, we use the following deﬁnition of the heavy-tailed property:

Deﬁnition 1 (Resnick (2007)). A non-negative random variable w is called heavy-tailed if its tail probability Fw(t): =P(w t) is asymptotically equivalent to t α

as t for some positive number α . Here α (known as the tail index of w) determines the heavy-tailedness.

For a heavy-tailed distribution with index α , its α-th moment exists only if α < α , i.e., E[wα] < iff α < α . A value of α = 1.0 corresponds to a Cauchy distribution and α = (i.e., all moments exist) corresponds to a Gaussian distribution. Intuitively, as α decreases, the central peak of the distribution gets higher, the valley before the central peak gets deeper, and the tails get heavier. In other words, the lower the tail-index, the more heavy-tailed the distribution. However, in the ﬁnite sample setting, estimating the tail index is notoriously challenging (Simsekli et al., 2019; Danielsson et al., 2016; Hill, 1975).

In this study, we explore three estimators as heuristic measures to understand heavy tails and non-Gaussianity of gradients (refer to App. B for details): (i) Alpha-index estimator which measures alpha-index for symmeteric αstable distributions; (ii) Anderson-Darling test (Anderson & Darling, 1954) on random projections of stochastic Gradient Noise (GN) to perform Gaussianity testing (Panigrahi et al., 2019). To our knowledge, the deep learning literature has only explored these two estimators for analyzing the heavy-tailed nature of gradients. Finally, in our work, we propose using (iii) Kurtosis. To quantify the heavy-tailedness relative to a normal distribution, we measure kurtosis (fourth standardized moment) of the gradient norms. Given samples {Xi}N i=1, the kurtosis κ is given by:

PN i=1(Xi X)4/N ( PN i=1(Xi X)2/N) 2 where X is the empirical mean of

the samples. With a slight breach of notation, we use kurtosis to denote κ1/4. In App. B, we show behavior of kurtosis on ﬁnite samples from Gaussian and Pareto distributions. It is well known that for a Pareto distribution with shape α 4, the lower the tail-index (shape parameter α) the higher the kurtosis. For α < 4, since the fourth moment is non-existent, kurtosis is inﬁnity. While for Gaussian distribution, the kurtosis value is approximately 1.31. In App. B, we discuss limitations of α-index estimator and Anderson Darling test when used as heuristics to understand heavy

tails. Hence, in the main paper, we include results with Kurtosis and relegate results with the other estimators.

3. Heavy-Tailedness in Policy-Gradients: A Case Study on PPO

We now examine the distribution of gradients in PPO. To start, we examine the behavior of gradients at only on-policy steps. We ﬁx the policy at the beginning of every training iteration and just consider the gradients for the ﬁrst step (see App. D for details). As the training proceeds, the gradients clearly become more heavy-tailed (Fig. 1(a)). To thoroughly understand this behavior and the contributing factors, we separately analyze the contributions from different components in the loss function. We also separate out the contributions coming from actor and critic networks.

To decouple the behavior of na ıve policy gradients from PPO optimizations, we consider a variant of PPO which we call PPO-NOCLIP as described in Section 2. Recall that in a nutshell PPO-NOCLIP implements policy gradient with just importance sampling. In what follows, we perform a ﬁne-grained analysis of PPO at on-policy iterations.

3.1. Heavy-tailedness in on-policy training

Given the trend of increasing heavy-tailedness in onpolicy gradients, we ﬁrst separately analyze the contributions of the actor and critic networks. On both these component network gradients, we observe similar trends, with the heavy-tailedness in the actor gradients being marginally higher than the critic network (Fig. 1). Note that during on-policy steps, since the likelihood-ratios are just 1, the gradient of actor network is given by θ log (πθ(at, st)) ˆAπ0(st, at) and the gradient of the critic network is given by θVθ ˆAπ0(st, at) where π0 is the behavioral policy. To explain the rising heavy-tailed behavior, we separately plot the advantages ˆAπ0 and the advantage divided gradients (i.e, log(πθ(at|st)) and θVθ). Strikingly, we observe that while the advantage divided gradients are not heavy-tailed for both value and policy network, the heavy-tailedness in advantage estimates increases as training proceeds. This elucidates that during on-policy updates, outliers in advantage estimates are the only source of heavytailedness in actor and critic networks.

To understand the reasons behind the observed behaviour of advantages, we plot value estimates as computed by the critic network and the discounted returns used to calculate advantages (Fig. 9 in App. F) We don t observe any discernable heavy-tailedness trends in value estimates and a slight increase in returns. However, remarkably, we notice a very similar course of an increase in heavy-tailedness with negative advantages (whereas positive advantages remained light-tailed) as training proceeds. In App. F.3, we also pro-

On Proximal Policy Optimization s Heavy-tailed Gradients

0 100 200 300 400 500 On-policy steps

0 100 200 300 400 500 On-policy steps

actor b Aπ0 actor/b Aπ0

0 100 200 300 400 500 On-policy steps

critic b Aπ0 critic/b Aπ0

Figure 1. Heavy-tailedness in PPO during on-policy iterations. All plots show mean kurtosis aggregated over 8 Mu Jo Co environments. For other estimators, see App. G. For individual environments with error bars, see App. I. Increases in Kurtosis implies an increase in heavy-tailedness. Dotted line represents the Kurtosis value for a Gaussian distribution. (a) Kurtosis vs on-policy iterations for A2C and PPO. Evidently, as training proceeds, the gradients become more heavy-tailed for both the methods. (b) Kurtosis vs on-policy iterations for actor networks in PPO. (c) Kurtosis vs on-policy iterations for critic networks in PPO. Both critic and actor gradients become more heavy-tailed as the agent is trained. Note that as the gradients become more heavy-tailed, we observe a corresponding increase of heavy-tailedness in the advantage estimates ( ˆAπ0). However, actor/ ˆAπ0 and critic/ ˆAπ0 (i.e., actor or critic gradient norm divided by advantage) remain light-tailed throughout the training. In App. F, we perform ablation tests to highlight the reason for heavy-tailed behavior of advantages.

vide evidence to this observation by showing the trends of increasing heavy-tailed behavior with the histograms of log(|Aπθ|) grouped by their sign as training proceeds for one Mu Jo Co environment (Half Cheetah-v2). This observation highlights that, at least in Mu Jo Co control environments, there is a positive bias of the learned value estimate for actions with negative advantages. In addition, our experiments also suggest that the outliers in advantages (primarily, in negative advantages) are the root cause of observed heavytailed behavior in the actor and critic gradients.

We also analyse the gradients of A2C (Mnih et al., 2016) an on-policy RL algorithm and observe similar trends (Fig. 1(a)), but at a relatively smaller degree of heavytailedness. Although they start at a similar magnitude, the heavy-tailed nature escalates at a higher rate in PPO1. This observation may lead us to ask: What is the cause of heightened heavy-tailedness in PPO (when compared with A2C)? Next, we demonstrate that off-policy training can exacerbate the heavy-tailed behavior.

3.2. Offpolicyness escalte heavytailness in gradients

To analyze the gradients at off-policy steps, we perform the following experiment: At various stages of training (i.e., at initialization, 50% of maximum reward, and maximum reward), we ﬁx the actor and the critic network at each gradient step during off-policy training and analyze the collected gradients (see App. D for details). First, in the early stages of training, as the off-policyness increases, the heavy-tailedness in gradients (both actor and critic) increases. However, unlike with on-policy steps, actor gradi-

1In Appendix F.2, we show a corresponding trend in the heavytailedness of advantage estimates.

ents are the major contributing factor to the overall heavytailedness of the gradient distribution. In other words, the increase in heavy-tailedness of actor gradients due to offpolicy training is substantially greater than for critic gradients (Fig. 2). Moreover, the increase lessens in later stages of training as the agent approaches its peak performance.

Now we turn our attention to explaining the possible causes for such a profound increase. The strong increase in heavytailedness of the actor gradients during off-policy training coincides with a increase of heavy-tailedness in the distribution of likelihood ratios ρ, given by πθ(at, st)/π0(at, st). The corresponding increase in heavy-tailedness in ratios can be explained theoretically. In continuous control RL tasks, the actor network often implements the policy with a Gaussian distribution, where the policy parameters estimate the mean and the (diagonal) covariance. With a simple example, we highlight the heavy-tailed behavior of such likelihoodratios of Gaussian density function. This example highlights how even a minor increase in the standard deviation of the distribution of the current policy (as compared to behavior policy) can induce heavy-tails.

Example 1 (Wang et al., 2018). Assume π1(x) = N x; 0, σ2 1 and π2(x) = N x; 0, σ2 2 . Let ρ = π1(x)/π2(x) at a sample x π2. If σ1 σ2, then likelihood ratio ρ is bounded and its distribution is not heavy-tailed. However, when σ1 > σ2, then w has a heavy-tailed distribution with the tail-index (Deﬁnition 1) α = σ2 1/(σ2 1 σ2 2).

During off-policy training, to understand the heavytailedness of actor gradients beyond the contributions from likelihood ratios, we inspect the actor gradients normalized

On Proximal Policy Optimization s Heavy-tailed Gradients

0 50 100 150 200 250 300 Off-policy steps

Initialization

actor critic ratios actor/ratio

0 50 100 150 200 250 300 Off-policy steps

50 % of Max Reward

actor critic ratios actor/ratio

0 50 100 150 200 250 300 Off-policy steps

actor critic ratios actor/ratio

Figure 2. Heavy-tailedness in PPO-NOCLIP during off-policy steps at various stages of training iterations in Mu Jo Co environments. All plots show mean kurtosis aggregated over 8 Mujoco environments. Plots for other estimators can be found in App. G. We also show trends with these estimators (with error bars) on individual environments in App I. Increases in Kurtosis implies an increase in heavy-tailedness. Dotted line represents the Kurtosis value for a Gaussian distribution. Note that the analysis is done with gradients taken on a ﬁxed batch of data within a single iteration. As off-policyness increases, the actor gradients get substantially heavy-tailed. This trend is corroborated by the increase of heavy-tailedness in ratios. Moreover, consistently we observe that the heavy-tailedness in actor/ratios stays constant. While initially during training, the heavy-tailedness in the ratio s increases substantially, during later stages the increase tapers off. The overall increase across training iterations is due to the induced heavy-tailedness in the advantage estimates (cf. Sec. 3.1).

by likelihood-ratios, i.e.,

θπθ(at, st)/π0(at, st)

πθ(at, st)/π0(at, st) ˆAπ0(st, at) =

θ log (πθ(at, st)) ˆAπ0(st, at) .

Note that this gradient expression is similar to on-policy actor gradients. Since we observe an increasing trend in heavytailedness of the actor gradients even during on-policy training, one might ask: does these gradients heavy-tailedness increase during off-policy gradient updates?

Recall that in PPO, we ﬁx the value function at the beginning of off-policy training and pre-compute advantage estimates that will later be used throughout the training. Since the advantages were the primary factor dictating the increase during on-policy training, ideally, we should not observe any increase in the heavy-tailed behavior. Conﬁrming this hypothesis, we show that the heavy-tailedness in this quantity indeed stays constant during the off-policy training (Fig. 2), i.e., θ log (πθ(at, st)) Aπ0(st, at) doesn t cause the increased heavy-tailed nature as long as π0 is ﬁxed.

Our ﬁndings from off-policy analysis strongly suggest that when the behavioral policy is held ﬁxed, heavy-tailedness in the importance ratios is the fundamental cause. In addition, in Sec. 3.1, we showed that when importance-ratio s are 1 (i.e., the data on which the gradient step is taken is onpolicy), advantages induce heavy-tailedness. With these two observations, we conclude that the scalars (either the likelihood-ratios or the advantage estimates) in the objective are the primary causes of the underlying heavy-tailedness in the gradients.

4. How do Heavy-Tailed Policy-Gradients affect Training?

In the previous section, we investigated into the root cause of the heavy-tailed behaviour. That apparent heavy-tailed nature of PPO s gradients may lead us to ask: how do heavytailed gradients affect agents performance? In this section, we show that heavy-tailed gradients harm the performance of the underlying agent. Subsequently, we investigate into PPO heuristics and demonstrate how these heuristics alleviate for the heavy-tailed nature of the gradient distribution.

4.1. Effect of heavy-tailedness in advantages

Analysis in Sec. 3.1 shows that multiplicative advantage estimate in the PPO loss is a signiﬁcant contributing factor to the observed heavy-tailedness. Motivated by this, we now study the impacts of clipping advantages on the underlying agent. In particular, we clip negative advantages which are the primary contributors to the induced heavy-tailedness.

Depending on the observed heavy-tailedness, we tune a perenvironment clipping threshold for advantages to maximize the performance of the agent trained with PPO. Intuitively, we expect that clipping should improve optimization and hence should lead to an improved performance. Corroborating this intuition, we observe signiﬁcant improvements (Fig. 3 (c)). We also plot the trend of heavy-tailedness in clipped advantage estimates during training. As we clip negative advantages below the obtained threshold, we observe that the induced heavy-tailedness stays constant throughout training (Fig. 3 (a)). Our experiment unearths an intriguing ﬁnding. Since the advantage estimates signiﬁcantly contribute to the observed heavy-tailed behavior, we show that

On Proximal Policy Optimization s Heavy-tailed Gradients

0 100 200 300 400 500 On-policy steps

b Aπ0 b Aπ0-clip

0.0 0.2 0.4 0.6 0.8 1.0 Fraction of off-policy steps

Initialization

10 epochs 20 epochs 30 epochs

Walker2d-v2 Hopper-v2 Half Cheetah-v2 0.0

Normalized Rewards

PPO-Adv Clip PPO-No Clip (10)

PPO-No Clip (20) PPO-No Clip (30)

Figure 3. (a) Heavy-tailedness in PPO advantages with per-environment tuned advantage clipping threshold and (b) Heavytailedness in PPO-NOCLIP likelihood-ratios as the degree of off-policyness is varied in Mu Jo Co environments. All plots show mean kurtosis aggregated over 8 Mujoco environments. With clipping advantages at appropriate thresholds (tuned per environment), we observe that the heavy-tailedness in advantages remains almost constant with training. For (b), we plot kurtosis vs the fraction of off-policy steps (i.e. number of steps taken normalized by the total number of gradients steps in one epoch). As the number of off-policy epochs increase, the heavy-tailedness in ratios increases substantially. (c) Normalized rewards for PPO-Adv Clip and for PPO-NOCLIP as the degree of off-policyness is varied (number of off-policy steps in parenthesis). Normalized w.r.t. the max reward obtained with PPO (with all heuristics enabled) and performance of a random agent. Evidently, as off-policy training increases, the max reward achieved drops. With advantage clipping (tuned per environment), we observe improved performance of the agent. (See App J for reward curves on individual environments.)

clipping outlier advantages stabilizes the training and improves agents performance on 5 out of 8 Mu Jo Co tasks (per environment rewards in App J). While tuning a clipping threshold per environment may not be practical, the primary purpose of this study is to illustrate that heavy-tailedness in advantages can actually hurt the optimization process, and clipping advantages leads to improvements in the performance of agent.

4.2. Effect of heavy-tailedness in likelihood-ratios

In Sec. 3.2, we demonstrated the heavy-tailed behavior of gradients during off-policy training which increases with off-policy gradient steps in PPO-NOCLIP. Moreover, we observe a corresponding increase in the heavy-tailedness of likelihood ratios. Motivated by this connection, we train agents with increased off-policy gradient steps to understand the effect of the off-policy induced heavy-tailedness on the performance of the agent. With PPO-NOCLIP, we train agents for 20 and 30 ofﬂine epochs (instead of 10 in Table 1)2 and analyze its performance.

First, as expected, we observe an increase in heavytailedness in the likelihood ratios with escalated ofﬂine training (Fig. 3(b)). Moreover, the heavy-tailedness in advantages remains unaffected with an increase in the number of ofﬂine epochs (Fig. 20 in App. J) conﬁrming that the

2Note that even with 20 and 30 ofﬂine epochs the agent maintains a KL based trust-region throughout training (Fig. 19 in App. J). Beyond 30 ofﬂine steps, successive policies often diverge failing to maintain a KL based trust region.

observed behavior is primarily due to heightened heavytailedness in likelihood ratios. We conjecture that induced heavy-tailedness can make the optimization process harder. Corroborating this hypothesis, we observe that as the number of ofﬂine epochs increases, the performance of agent trained with PPO-NOCLIP deteriorates, and the training becomes unstable (Fig. 3 (c)). Findings from this experiment clearly highlight issues due to induced heavy-tailedness in likelihood ratios during off-policy training. While ofﬂine training enables sample efﬁcient training, restricting the number of off-policy epochs allows effective tackling of optimization issues induced due to the heavy-tailed nature which are beyond just trust-region enforcement.

4.3. Explaining roles of various PPO objective optimizations

Motivated from our results from the previous sections, we now take a deeper look at how the core idea of likelihoodratio clipping and auxiliary optimizations implemented in PPO and understand how they affect the heavy-tailedness during training. First, we make a key observation. Note that the PPO-clipping heuristics don t get triggered for the ﬁrst gradient step taken (when a new batch of data is sampled). But rather these heuristics may alter the loss only when behavior policy is different from the policy that is being optimized. Hence, in order to understand the effects of clipping heuristics, we perform the following analysis on the off-policy gradients of the PPO-NOCLIP: At each update step on the agent trained with PPO-NOCLIP, we compute the gradients while progressively including optimizations from the standard PPO objective.

On Proximal Policy Optimization s Heavy-tailed Gradients

0 50 100 150 200 250 300 Off-policy steps

actor actor-clip actor-gradclip

0 50 100 150 200 250 300 Off-policy steps

critic critic-clip critic-gradclip

Figure 4. Heavy-tailedness in PPO-NOCLIP with PPOheuristics applied progressively during off-policy steps, with kurtosis aggregated across 8 Mu Jo Co environments. For other estimators, see App. G. Dotted line represents the Kurtosis value for a Gaussian distribution. -clip denotes loss clipping on corresponding networks. -gradclip denotes both gradient clipping and loss clipping. Increases in Kurtosis implies an increase in heavy-tailedness. As training progresses during off-policy steps, the increased heavy-tailedness in actor and critic gradients is mitigated by PPO-heuristics.

Our results demonstrate that both the likelihood-ratio clipping and value-function clipping in loss during training offset the enormous heavy-tailedness induced due to off-policy training (Fig. 4). Recall that by clipping the likelihood ratios and the value function, the PPO objective is discarding samples (i.e., replacing them with zero when) used for gradient aggregation. Since heavy-tailedness in the distribution of likelihood ratios is the central contributing factor during offpolicy training, by truncating likelihood-ratios ρt which lie outside (1 ϵ, 1 + ϵ) interval, PPO is primarily mitigating heavy-tailedness in actor gradients. Similarly, by rejecting samples from the value function loss which lie outside an ϵ boundary of a ﬁxed target estimate, the heuristics alleviate the slight heavy-tailed nature induced with off-policy training in the critic network.

While PPO heuristics alleviate the heavy-tailedness induced with off-policy training, these heuristics don t alter heavytailed nature of advantage estimates. Since none of these heuristics directly target the outliers present in the advantage estimates, we believe that our ﬁndings can guide a development of fundamentally stable RL algorithms by targeting the outliers present in the advantage estimates (the primary cause of increasing heavy-tailedness throughout training).

5. Mitigating Heavy-Tailedness with Robust Gradient Estimation

Motivated by our analysis showing that the gradients in PPO-NOCLIP exhibit heavy-tailedness that increases during off-policy training, we propose an alternate method of gradient aggregation using the gradient estimation framework from Prasad et al. (2018) that is better suited to the heavy-tailed estimation paradigm than the sample mean. To support our hypothesis that addressing the primary beneﬁt

Algorithm 1 BLOCK-GMOM

input : Samples S = {x1, . . . , xn}, number of blocks b, Model optimizer OG, b block optimizers OB, network fθ, loss ℓ 1: Partition S into b blocks B1, . . . Bb of equal size. 2: for i in 1 . . . b do 3: ˆµi = O(i) B P

xj Bi θℓ(fθ, xj)/ |Bi|

4: end for 5: ˆµGMOM = OG (WEISZFELD(ˆµ1, . . . , ˆµb)). output : Gradient estimate ˆµGMOM

of PPO s various clipping heuristics lies in mitigating this heavy-tailedness, we aim to show that equipped with our robust estimator, PPO-NOCLIP can achieve comparable results to state-of-the-art PPO implementations, even with the clipping heuristics turned off.

We now consider robustifying PPO-NOCLIP (policy gradient with just importance sampling). Informally, for gradient distributions which do not enjoy Gaussian-like concentration, the empirical-expectation-based estimates of the gradient do not necessarily point in the right descent direction, leading to bad solutions. To this end, we leverage a robust mean aggregation technique called Geometric Median-Of-Means (GMOM) due to Minsker et al. (2015). We ﬁrst split the samples into non-overlapping subsamples and estimate the sample mean of each. The GMOM estimator is then given by the geometric median-ofmeans of the subsamples. Formally, let {x1, . . . , xn} R be n i.i.d. random variables sampled from a distribution D. Then the GMOM estimator for estimating the mean can be described as follows: Partition the n samples into b blocks B1, . . . , Bb, each of size n/b . Compute sample means in each block, i.e., {ˆµ1, . . . , µb}, where ˆµi = P

xj Bi xj/ |Bi|. Then the GMOM estimator ˆµGMOM is given by the geometric median of {ˆµ1, . . . , µb} deﬁned as follows: ˆµGMOM = argminµ Pb i=1 ||µ ˆµi||2. We present GMOM algorithm along with the Weiszfeld s algorithm used for computing the approximate geometric median in App. C.

GMOM has been shown to have several favorable properties when used for statistical estimation in heavy-tailed settings. Intuitively, GMOM reduces the effect of outliers on a mean estimate by taking a intermediate mean of blocks of samples and then computing the geometric median of those block means. The robustness comes from the additional geometric median step where a small number of samples with large norms would not affect a GMOM estimate as much as they would a sample mean. Formally, given n samples from a heavy-tailed distribution, the GMOM estimate concentrates better around the true mean than the sample mean which satisﬁes the following:

On Proximal Policy Optimization s Heavy-tailed Gradients

Half Cheetah-v2

Humanoid-v2

Walker2d-v2

Inverted Pendulum-v2

Cart Pole-v1

Normalized Rewards

PPO-No Clip Robust-PPO-No Clip

Figure 5. Normalized rewards for ROBUST-PPO-NOCLIP and PPO-NOCLIP. Normalized w.r.t. the max reward obtained with PPO (with all heuristics enabled) and performance of a random agent. (See App H for reward curves on individual environment.)

Theorem 1 (Minsker et al. (2015)). Suppose we are given n samples {xi}n i=1 from a distribution with mean µ and covariance Σ. Assume δ > 0. Choose the number of blocks b = 1 + 3.5 log(1/δ) . Then, with probability

at least 1 δ, ||µGMOM µ||2 q

trace(Σ) log(1/δ)

n Pn i=1 xi µ 2 q

When applying stochastic gradient descent or its variants in deep learning, one typically backpropagates the mean loss, avoiding computing per-sample gradients. However, computing GMOM requires per-sample gradients. Consequently, we propose a simple (but novel) variant of GMOM called BLOCK-GMOM which avoids the extra sample-size dependent computational penalty of calculating samplewise gradients. Notice that by Theorem 1, the number of blocks required to compute GMOM is independent of the sample size to obtain the guarantee with high probability. To achieve this, instead of calculating sample-wise gradients, we compute block-wise gradients by backpropagating on sample-mean aggregated loss for each block. Moreover, such an implementation not only increases efﬁciency but also allows incorporating adaptive optimizers for individual blocks. Algorithm 1 presents the overall BLOCK-GMOM.

5.1. Results on Mu Jo Co environment

We perform experiments on 8 Mu Jo Co (Todorov et al., 2012) control tasks. To use BLOCK-GMOM aggregation with PPO-NOCLIP, we extract actor-network and critic-network gradients at each step and separately run the Algorithm 1 on both the networks. For our experiments, we use SGD as OB and Adam as OG and refer to this variant of PPO-NOCLIP as ROBUST-PPO-NOCLIP. We compare the performances of PPO, PPO-NOCLIP, and ROBUST-PPO-NOCLIP, using

hyperparameters that are tuned individually for each method but held ﬁxed across all tasks (Table 1).

For 7 tasks, we observe signiﬁcant improvements with ROBUST-PPO-NOCLIP over PPO-NOCLIP and performance close to that achieved by PPO (with all clipping heuristics enabled) (Fig. 5). Although we do not observe improvements over PPO, we believe that this result corroborates our conjecture that PPO heuristics primarily aim to offset the heavy-tailedness induced with training.

6. Related Work

Studying the behavior of SGD, Simsekli et al. (2019) questioned the Gaussianity of SGD noise, highlighting its heavytailed nature. Subsequently, there has been a growing interest in understanding the nature of SGD noise in different deep learning tasks with a speciﬁc focus on its inﬂuence on generalization performance versus induced optimization difﬁculties (S ims ekli et al., 2020; Zhang et al., 2019b; Panigrahi et al., 2019). In particular, Zhang et al. (2019b) studied the nature of stochastic gradients in natural language processing (e.g., BERT-pretraining) and highlighted the effectiveness of adaptive methods (e.g. Adam and gradient clipping). Some recent work has also made progress towards understanding the effectiveness of gradient clipping in convergence (Zhang et al., 2019b;a; S ims ekli et al., 2020) in presence of heavy-tailed noise. On the other hand, Simsekli et al. (2019) highlighted the beneﬁts of heavy-tailed noise in achieving wider minima with better generalization, by analyzing SGD as an SDE driven by Levy motion (whose increments are α stable heavy-tailed noise).

On the RL side, Bubeck et al. (2013) studied the stochastic multi-armed bandit problem when the reward distribution is heavy-tailed. The authors designed a robust version of the classical Upper Conﬁdence Bound algorithm by replacing the empirical average of observed rewards with robust estimates obtained via the univariate median-of-means estimator (Nemirovski & Yudin, 1983) on the observed sequence of rewards. Medina & Yang (2016) extended this approach to the problem of linear bandits under heavy-tailed noise. There is also a long line of work in deep RL which focuses on reducing the variance of stochastic policy gradients (Gu et al., 2016; Wu et al., 2018; Cheng et al., 2020). On the ﬂip side, Chung et al. (2020) highlighted the beneﬁcial impacts of stochasticity of policy gradients on the optimization process. In simple MDPs, authors showed that larger higher moments with ﬁxed variance leads to improved exploration. This aligns with the conjecture of Simsekli et al. (2019) in the context of supervised learning that heavy-tailedness in gradients can improve generalization. Chung et al. (2020) thus pointed out the importance of a careful analysis of stochasticity in gradients to better understand policy gradient algorithms.

On Proximal Policy Optimization s Heavy-tailed Gradients

We consider our work a stepping stone towards analyzing stochastic gradients beyond just their variance. We hypothesize that in deep RL where the optimization process is known to be brittle (Henderson et al., 2018; 2017; Engstrom et al., 2019; Ilyas et al., 2018), perhaps due to the ﬂexibility of the neural representation, heavy-tailedness can cause heightened instability rather than help in efﬁcient exploration. This perspective aligns with one line of work (Zhang et al., 2019b) where authors demonstrate that heavy-tailedness can cause instability in the learning process in deep models. Indeed with ablation experiments in Sec. 4, we show that increasing heavy-tailedness in likelihood ratios hurts the agent performance, and mitigating heavy-tailedness in advantage estimates improves the agent performance.

7. Conclusion

In this paper, we empirically characterized PPO s gradients, demonstrating that they become more heavy-tailed as training proceeds. Our detailed analysis showed that at on-policy steps, the heavy-tailed nature of the gradients is primarily attributable to the multiplicative advantage estimates. On the other hand, we observed that during off-policy training, the heavy-tailedness of the likelihood ratios of the surrogate reward function exacerbates the observed heavy-tailedness.

Subsequently, we examined issues due to heavy-tailed nature of gradients. We demonstrated that PPO s clipping heuristics primarily serve to offset the heavy-tailedness induced by off-policy training. With this motivation, we showed that a robust estimation technique could effectively replace all three of PPO s clipping heuristics: likelihoodratio clipping, value loss clipping, and gradient clipping.

In future work, we plan to conduct similar analysis on gradients for other RL algorithms such as deep Q-learning. Moreover, we believe that our ﬁndings on heavy-tailed nature of advantage estimates can signiﬁcantly impact algorithm development for policy gradient algorithms.

Acknowledgements

We acknowledge the support of Lockheed Martin, DARPA via HR00112020006, and NSF via IIS-1909816, OAC1934584.

Anderson, T. W. and Darling, D. A. A test of goodness of ﬁt. Journal of the American statistical association, 49 (268):765 769, 1954.

Berner, C., Brockman, G., Chan, B., Cheung, V., Dkebiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S.,

Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. ar Xiv preprint ar Xiv:1912.06680, 2019.

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. Bandits with heavy tail. IEEE Transactions on Information Theory, 59 (11):7711 7717, 2013.

Cheng, C.-A., Yan, X., and Boots, B. Trajectory-wise control variates for variance reduction in policy gradient methods. In Conference on Robot Learning, pp. 1379 1394. PMLR, 2020.

Chung, W., Thomas, V., Machado, M. C., and Roux, N. L. Beyond variance reduction: Understanding the true impact of baselines on policy optimization. ar Xiv preprint ar Xiv:2008.13773, 2020.

Danielsson, J., Ergun, L. M., de Haan, L., and de Vries, C. G. Tail index estimation: Quantile driven threshold selection. Available at SSRN 2717478, 2016.

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Implementation matters in deep rl: A case study on ppo and trpo. In International Conference on Learning Representations, 2019.

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. ar Xiv preprint ar Xiv:1802.01561, 2018.

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efﬁcient policy gradient with an off-policy critic. ar Xiv preprint ar Xiv:1611.02247, 2016.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. ar Xiv preprint ar Xiv:1709.06560, 2017.

Henderson, P., Romoff, J., and Pineau, J. Where did my optimum go?: An empirical analysis of gradient descent optimization in policy gradient methods. ar Xiv preprint ar Xiv:1810.02525, 2018.

Hill, B. M. A simple general approach to inference about the tail of a distribution. The annals of statistics, pp. 1163 1174, 1975.

Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. A closer look at deep policy gradients. ar Xiv preprint ar Xiv:1811.02553, 2018.

Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. ar Xiv preprint ar Xiv:1708.04133, 2017.

On Proximal Policy Optimization s Heavy-tailed Gradients

Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531 1538, 2002.

Medina, A. M. and Yang, S. No-regret algorithms for heavytailed linear bandits. In International Conference on Machine Learning, pp. 1642 1650, 2016.

Minsker, S. et al. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308 2335, 2015.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937, 2016.

Mohammadi, M., Mohammadpour, A., and Ogata, H. On estimating the tail index and the spectral measure of multivariate α-stable distributions. Metrika, 78(5):549 561, 2015.

Nemirovski, A. and Yudin, D. Problem Complexity and Method Efﬁciency in Optimization. A Wiley-Interscience publication. Wiley, 1983.

Panigrahi, A., Somani, R., Goyal, N., and Netrapalli, P. Nongaussianity of stochastic gradient noise. ar Xiv preprint ar Xiv:1910.09626, 2019.

Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018.

Resnick, S. I. Heavy-tail phenomena: probabilistic and statistical modeling. Springer Science & Business Media, 2007.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015a.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

Simsekli, U., Sagun, L., and Gurbuzbalaban, M. A tailindex analysis of stochastic gradient noise in deep neural networks. ar Xiv preprint ar Xiv:1901.06053, 2019.

S ims ekli, U., Zhu, L., Teh, Y. W., and G urb uzbalaban, M. Fractional underdamped langevin dynamics: Retargeting sgd with momentum under heavy-tailed gradient noise. ar Xiv preprint ar Xiv:2002.05685, 2020.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Wang, D., Liu, H., and Liu, Q. Variational inference with tail-adaptive f-divergence. In Advances in Neural Information Processing Systems, pp. 5737 5747, 2018.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3 4):229 256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/ 10.1007/BF00992696.

Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with action-dependent factorized baselines. ar Xiv preprint ar Xiv:1803.07246, 2018.

Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors ﬂat minima. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=w Xgk_i Ci YGo.

Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justiﬁcation for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019a.

Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S. J., Kumar, S., and Sra, S. Why adam beats sgd for attention models. ar Xiv preprint ar Xiv:1912.03194, 2019b.