# phasic_policy_gradient__14cbd508.pdf

Phasic Policy Gradient

Karl Cobbe 1 Jacob Hilton 1 Oleg Klimov 1 John Schulman 1

We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modiﬁes traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features. PPG also enables the value function to be more aggressively optimized with a higher level of sample reuse. Compared to PPO, we ﬁnd that PPG signiﬁcantly improves sample efﬁciency on the challenging Procgen Benchmark.

1. Introduction

Model free reinforcement learning (RL) has enjoyed remarkable success in recent years, achieving impressive results in diverse domains including Do TA (Open AI et al., 2019b), Starcraft II (Vinyals et al., 2019), and robotic control (Open AI et al., 2019a). Although policy gradient methods like PPO (Schulman et al., 2017), A3C (Mnih et al., 2016), and IMPALA (Espeholt et al., 2018) are behind some of the most high proﬁle results, many related algorithms have proposed a variety of policy objectives (Schulman et al., 2015a; Wu et al., 2017; Peng et al., 2019; Song et al., 2019; Lillicrap et al., 2015; Haarnoja et al., 2018). All of these algorithms fundamentally rely on the actor-critic framework, with two key quantities driving learning: the policy and the value function. In practice, whether or not to share parameters between the policy and the value function networks is an important implementation decision. There is a clear advantage to sharing parameters: features trained by each objective

1Open AI, San Francisco, CA, USA. Correspondence to: Karl Cobbe <karl@openai.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

can be used to better optimize the other.

However, there are also disadvantages to sharing network parameters. First, it is not clear how to appropriately balance the competing objectives of the policy and the value function. Any method that jointly optimizes these two objectives with the same network must assign a relative weight to each. Regardless of how well this hyperparameter is chosen, there is a risk that the optimization of one objective will interfere with the optimization of the other. Second, the use of a shared network all but requires the policy and value function objectives to be trained with the same data, and consequently the same level of sample reuse. This is an artiﬁcial and undesirable restriction.

We address these problems with Phasic Policy Gradient (PPG), an algorithm which preserves the feature sharing between the policy and value function, while otherwise decoupling their training. PPG operates in two alternating phases: the ﬁrst phase trains the policy, and the second phase distills useful features from the value function. More generally, PPG can be used to perform any auxiliary optimization alongside RL, though in this work we take value function error to be the sole auxiliary objective. Using PPG, we highlight two important observations about on-policy actor-critic methods:

Interference between policy and value function op-

timization can negatively impact performance when parameters are shared between the policy and the value function networks.

Value function optimization often tolerates a signiﬁ-

cantly higher level of sample reuse than policy optimization.

By mitigating the interference between the policy and value function objectives while still sharing representations, and by optimizing each with the appropriate level of sample reuse, PPG signiﬁcantly improves sample efﬁciency.

2. Algorithm

In PPG, training proceeds in two alternating phases: the policy phase, followed by the auxiliary phase. During the policy phase, we train the agent with Proximal Policy Opti-

Phasic Policy Gradient

mization (PPO) (Schulman et al., 2017). During the auxiliary phase, we distill features from the value function into the policy network, to improve training in future policy phases. Compared to PPO, the novel contribution of PPG is the inclusion of periodic auxiliary phases. We now describe each phase in more detail.

During the policy phase, we optimize the same objectives from PPO, notably using disjoint networks to represent the policy and the value function (Figure 1). Speciﬁcally, we train the policy network using the clipped surrogate objective

Lclip = ˆEt

min(rt( ) ˆ At, clip(rt( ), 1 , 1 + ) ˆ At)

where rt( ) = (at|st) old(at|st), and ˆ At is an estimator of the ad-

vantage function at timestep t. We optimize Lclip +βSS[ ], where βS is a constant and S is a an entropy bonus for the policy. To train the value function network, we optimize

Lvalue = ˆEt

2(V V (st) ˆV targ

where ˆV targ are value function targets. Both ˆA and ˆV targ are computed with GAE (Schulman et al., 2015b).

Algorithm 1 PPG

for phase = 1, 2, ... do

Initialize empty buffer B for iteration = 1, 2, ..., N do

Perform rollouts under current policy Compute value function target ˆV targ

t for each state st for epoch = 1, 2, ..., E do

Optimize Lclip + βSS[ ] wrt end for for epoch = 1, 2, ..., EV do

Optimize Lvalue wrt V end for Add all (st, ˆV targ

t ) to B end for Compute and store current policy old( |st) for all states st in B for epoch = 1, 2, ..., Eaux do

Optimize Ljoint wrt , on all data in B Optimize Lvalue wrt V , on all data in B end for end for

During the auxiliary phase, we optimize the policy network with a joint objective that includes an arbitrary auxiliary loss and a behavioral cloning loss:

Ljoint = Laux + βclone ˆEt [KL[ old( |st), ( |st)]]

Figure 1. PPG uses disjoint policy and value networks to reduce

interference between objectives. The policy network includes an auxiliary value head. We write instead of for simplicity, since there is only one policy head.

where old is the policy right before the auxiliary phase begins. That is, we optimize the auxiliary objective while otherwise preserving the original policy, with the hyperparameter βclone controlling this trade-off. In principle Laux

could be any auxiliary objective. At present, we simply use the value function loss as the auxiliary objective, thereby sharing features between the policy and value function while minimizing distortions to the policy. Speciﬁcally, we deﬁne

(V (st) ˆV targ

where V is an auxiliary value head of the policy network, shown in Figure 1.

This auxiliary value head and policy itself share all parameters except for the ﬁnal linear layers. The auxiliary value head is used purely to train representations for the policy; it has no other purpose in PPG. Note that the targets ˆV targ are the same targets computed during the policy phase. They remain ﬁxed throughout the auxiliary phase. During the auxiliary phase, we also take the opportunity to perform additional training on the value network by further optimizing Lvalue. Note that Lvalue and Ljoint share no parameter dependencies, so we can optimize these objectives separately.

We brieﬂy explain the role of each hyperparameter. N controls the number of policy updates performed in each policy phase. E and EV control the sample reuse for the policy and value function respectively, during the policy phase. Although these are conventionally set to the same value, this is not a strict requirement in PPG. Note that EV inﬂuences the training of the true value function, not the auxiliary value function. Eaux controls the sample reuse during the auxiliary phase, representing the number of epochs performed across all data in the replay buffer. It is usually by increasing Eaux, rather than EV , that we

Phasic Policy Gradient

Figure 2. Sample efﬁciency of PPG compared to a PPO baseline. Mean and standard deviation shown across 3 runs.

increase sample reuse for value function training. For a detailed discussion on the relationship between Eaux and EV , see Appendix C. Default values for all hyperparameters can be found in Appendix A. Code for PPG can be found at https://github.com/openai/phasic-policy-gradient.

3. Experiments

We report results on the environments in Procgen Benchmark (Cobbe et al., 2019). This benchmark was designed to be highly diverse, and we expect improvements on this benchmark to transfer well to many other RL environments. In each Procgen environment, we train and evaluate agents on the full distribution of levels. Throughout all experiments, we use the hyperparameters found in Appendix A unless otherwise speciﬁed. When feasible, we compute and visualize the standard deviation across 3 separate runs. Each experiment required between 10 and 50 GPU-hours per run per environment, depending on hyperparameters.

3.1. Comparison to PPO

We begin by comparing our implementation of PPG to the highly tuned implementation of PPO from Cobbe et al. (2019). We note that this implementation of PPO uses a near optimal level of sample reuse and a near optimal relative

weight for the value and policy losses, as determined by a hyperparameter sweep. Results are shown in Figure 2. We can see that PPG achieves signiﬁcantly better sample efﬁciency than PPO in nearly every environment.

We have noticed that the importance of representation sharing between the policy and value function does seem to vary between environments. While it is critical to share parameters between the policy and the value function in Procgen environments (see Appendix B), this is often unnecessary in environments with a lower dimensional input space (Haarnoja et al., 2018). We conjecture that the high dimensional input space in Procgen contributes to the importance of sharing representations between the policy and the value function. We therefore believe it is in environments such as these, particularly those with vision-based observations, that PPG is most likely to outperform PPO and other similar algorithms.

3.2. Policy Sample Reuse

In PPO, choosing the optimal level of sample reuse is not straightforward. Increasing sample reuse in PPO implies performing both additional policy optimization and additional value function optimization. This leads to an undesirable confounding of effects, making it harder to analyze the impact of policy sample reuse alone. Empirically, we ﬁnd that

Phasic Policy Gradient

Figure 3. Performance with varying levels of policy sample reuse

performing 3 epochs per rollout is best in PPO, given our other hyperparameter settings (see Appendix D).

In PPG, policy and value function training are decoupled, and we can train each with different levels of sample reuse. In order to better understand the impact of policy sample reuse, we choose to vary the number of policy epochs (E ) without changing the number of value function epochs (EV ). Results are shown in Figure 3.

As we can see, training with a single policy epoch is almost always optimal or near-optimal in PPG. This suggests that the PPO baseline beneﬁts from greater sample reuse only because the extra epochs offer additional value function training. When value function and policy training are properly isolated, we see little beneﬁt from training the policy beyond a single epoch. Of course, various hyperparameters will inﬂuence this result. If we use an artiﬁcially low learning rate, for instance, it will become advantageous to increase policy sample reuse. Our present conclusion is simply that when using well-tuned hyperparameters, performing a single policy epoch is near-optimal.

3.3. Value Sample Reuse

We now evaluate how performing additional epochs during the auxiliary phase impacts performance. We expect there to be a trade-off: using too many epochs runs the risk of overﬁtting to recent data, while using fewer epochs will lead to slower training. We vary the number of auxiliary epochs

Figure 4. Performance with varying levels of value function sample

(Eaux) from 1 to 9 and report results in Figure 4.

We ﬁnd that training with additional auxiliary epochs is generally beneﬁcial, with performance tapering off around 6 auxiliary epochs. We note that training with additional auxiliary epochs offers two possible beneﬁts. First, due to the optimization of Ljoint, we may expect better-trained features to be shared with the policy. Second, due to the optimization of Lvalue, we may expect to train a more accurate value function, thereby reducing the variance of the policy gradient in future policy phases. In general, which beneﬁt is more signiﬁcant is likely to vary between environments. In Procgen environments, the feature sharing between policy and value networks appears to play the more critical role. For a more detailed discussion of the relationship between these two objectives, see Appendix C.

3.4. Auxiliary Phase Frequency

We next investigate alternating between policy and auxiliary phases at different frequencies, controlled by the hyperparameter N . As described in Section 2, we perform each auxiliary phase after every N policy updates. We vary this hyperparameter from 2 to 32 and report results in Figure 5.

It is clear that performance suffers when we perform auxiliary phases too frequently. We conjecture that each auxiliary phase interferes with policy optimization, and that performing frequent auxiliary phases exacerbates this effect. It s possible that future research will uncover more clever opti-

Phasic Policy Gradient

Figure 5. Performance with varying auxiliary phase frequency

mization techniques to mitigate this interference. For now, we conclude that relatively infrequent auxiliary phases are critical to success.

3.5. KL Penalty vs Clipping

As an alternative to clipping, Schulman et al. (2017) proposed using an adaptively weighted KL penalty. We now investigate the use of a KL penalty in PPG, but we instead choose to keep the relative weight of this penalty ﬁxed. Speciﬁcally, we set the policy gradient loss (excluding the entropy bonus) to be

ˆ At (at|st)

old(at|st) + β KL[ old( |st), ( |st)]

where β controls the weight of the KL penalty. After performing a hyperparameter sweep, we set β to 1. Results are shown in Figure 6. We ﬁnd that a ﬁxed KL penalty objective performs remarkably similarly to clipping when using PPG. We suspect that using clipping (or an adaptive KL penalty) is more important when rewards are poorly scaled. We avoid this concern by normalizing rewards so that discounted returns have approximately unit variance (see Appendix A for more details). In any case, we highlight the effectiveness of the KL penalty variant of PPG since LKL is arguably easier to analyze than Lclip, and since future work may wish to build upon either objective.

Figure 6. The impact of replacing the clipping objective (Lclip) with a ﬁxed KL penalty objective (LKL). Mean and standard deviation shown across 3 runs.

3.6. Single-Network PPG

By default, PPG comes with both an increased memory footprint and an increased wall-clock training time. Since we use disjoint policy and value function networks instead of a single uniﬁed network, we use approximately twice as many parameters compared to the PPO baseline. The increased parameter count leads to an approximate doubling in the computational cost for forward passes and backpropagation. We can recover this cost and maintain most of the key beneﬁts of PPG by using a single network that appropriately detaches the value function gradient. During the policy phase, we detach the value function gradient at the last layer shared between the policy and value heads, preventing the value function gradient from inﬂuencing shared parameters. During the auxiliary phase, we take the value function gradient with respect to all parameters, including shared parameters. This allows us to beneﬁt from the representations learned by the value function, while still removing the interference during the policy phase.

As we can see, using PPG with this single shared network performs almost as well as PPG with a dual network architecture. We were initially concerned that the value function might be unable to train well during the policy phase with the detached gradient, but in practice this does not appear to be a major problem. We believe this is because the value function can still train from the full gradient during the

Phasic Policy Gradient

Figure 7. A comparison between the default implementation of

PPG which trains two separate networks, and a single-network variant that mimics the same training dynamics by detaching the gradient when necessary. PPO shown for reference.

auxiliary phase. We note that our implementation of Single Network PPG has a similar wall-clock training time to PPO, while the wall-clock training time for PPG is roughly twice as long.

4. Related Work

Igl et al. (2020) recently proposed Iterative Relearning (ITER) to reduce the impact of non-stationarity during RL training. ITER and PPG share a striking similarity: both algorithms alternate between a standard RL phase and a distillation phase. However, the nature and purpose of the distillation phase varies. In ITER, the policy and value function teachers are periodically distilled into newly initialized student networks, in an effort to improve generalization. In PPG, the value function network is periodically distilled into the policy network, in an effort to improve sample efﬁciency.

Bejjani et al. (2021) used an objective similar to Ljoint to perform additional value function training with a shared network architecture, in order to train a more accurate value function baseline while minimizing interference with the policy. In PPG, we further emphasize the usefulness of this value function training as an auxiliary task, and we show that it is best to optimize this objective relatively infrequently, for the sake of stability.

Other prior work has considered the role the value function plays as an auxiliary task. Bellemare et al. (2019) investigate using value functions to train useful representations, speciﬁcally focusing on a special class of value functions called Adversarial Value Functions (AVFs). They ﬁnd that AVFs provide a useful auxiliary objective in the four-room domain. Lyle et al. (2019) suggest that the beneﬁts of distributional RL (Bellemare et al., 2017) can perhaps be attributed to the rich signal the value function distribution provides as an auxiliary task. We ﬁnd that the representation learning performed by the value function is indeed critical in Procgen environments, although we consider only the value function of the current policy, and we do not model the full value distribution.

Off-policy algorithms like Soft Actor-Critic (SAC) (Haarnoja et al., 2018), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), and Actor-Critic with Experience Replay (ACER) (Wang et al., 2016) all employ replay buffers to improve sample efﬁciency via off-policy updates. PPG also utilizes a replay buffer, speciﬁcally when performing updates during the auxiliary phase. However, unlike these algorithms, PPG does not attempt to improve the policy from off-policy data. Rather, this replay buffer data is used only to better ﬁt the value targets and to better train features for the policy. SAC also notably uses separate policy and value function networks, presumably, like PPG, to avoid interference between their respective objectives.

Although we use the clipped surrogate objective from PPO (Schulman et al., 2017) throughout this work, PPG is in principle compatible with the policy objectives from any actor-critic algorithm. Andrychowicz et al. (2020) recently performed a rigorous empirical comparison of many relevant algorithms in the on-policy setting. In particular, AWR (Peng et al., 2019) and V-MPO (Song et al., 2019) propose alternate policy objectives that move the current policy towards one which weights the likelihood of each action by the exponentiated advantage of that action. Such objectives could be used in PPG, in place of the PPO objective.

There are also several trust region methods, similar in spirit to PPO, that would be compatible with PPG. Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) proposed performing policy updates by optimizing a surrogate objective, whose gradient is the policy gradient estimator, subject to a constraint on the KL-divergence between the original policy and the updated policy. Actor Critic using Kronecker-Factored Trust Region (ACKTR) (Wu et al., 2017) uses Kronecker-factored approximated curvature (KFAC) to perform a similar trust region update, but with a computational cost comparable to SGD. Both methods could be used in the PPG framework.

Phasic Policy Gradient

5. Conclusion

The results in Section 3.2 and Section 3.3 make it clear that the optimal level of sample reuse varies signiﬁcantly between the policy and the value function. Training these two objectives with varying sample reuse is not possible in a conventional actor-critic framework using a shared network architecture. By decoupling policy and value function training, PPG is able to reap the beneﬁts of additional value function training without signiﬁcantly interfering with the policy. To achieve this, PPG does introduce several new hyperparameters, which creates some additional complexity relative to previous algorithms. However, we consider this a relatively minor cost, and we note that the chosen hyperparameter values generalize well across all 16 Procgen environments.

By mitigating interference between the policy and the value function while still maintaining the beneﬁts of shared representations, PPG signiﬁcantly improves sample efﬁciency on the challenging Procgen Benchmark. Moreover, PPG establishes a framework for optimizing arbitrary auxiliary losses alongside RL training in a stable manner. We have focused on the value function error as the sole auxiliary loss in this work, but we consider it a compelling topic for future research to evaluate other auxiliary losses using PPG.

Andrychowicz, M., Raichuk, A., Sta nczyk, P., Orsini, M.,

Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., et al. What matters in on-policy reinforcement learning? a large-scale empirical study. ar Xiv preprint ar Xiv:2006.05990, 2020.

Bejjani, W., Leonetti, M., and Dogar, M. R. Learning image-based receding horizon planning for manipulation in clutter. Robotics and Autonomous Systems, pp. 103730, 2021.

Bellemare, M., Dabney, W., Dadashi, R., Taiga, A. A., Cas-

tro, P. S., Le Roux, N., Schuurmans, D., Lattimore, T., and Lyle, C. A geometric perspective on optimal representations for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4360 4371, 2019.

Bellemare, M. G., Dabney, W., and Munos, R. A distribu-

tional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449 458. JMLR. org, 2017.

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever-

aging procedural generation to benchmark reinforcement learning. ar Xiv preprint ar Xiv:1912.01588, 2019.

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V.,

Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. Co RR, abs/1802.01561, 2018.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft

actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018.

Igl, M., Farquhar, G., Luketina, J., Boehmer, W., and

Whiteson, S. The impact of non-stationarity on generalisation in deep reinforcement learning. ar Xiv preprint ar Xiv:2006.05826, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic

optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,

T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Lyle, C., Bellemare, M. G., and Castro, P. S. A compara-

tive analysis of expected and distributional reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 4504 4511, 2019.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,

T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937, 2016.

Open AI, Akkaya, I., Andrychowicz, M., Chociej, M.,

Litwin, M., Mc Grew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W., and Zhang, L. Solving rubik s cube with a robot hand. ar Xiv preprint ar Xiv:1910.07113, 2019a.

Open AI, Berner, C., Brockman, G., Chan, B., Cheung, V.,

Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., J ozefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., and Zhang, S. Dota 2 with large scale deep reinforcement learning. ar Xiv preprint ar Xiv:1912.06680, 2019b.

Peng, X. B., Kumar, A., Zhang, G., and Levine, S.

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,

P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015a.

Phasic Policy Gradient

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,

P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017.

Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark,

A., Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tirumala, D., et al. V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control. ar Xiv preprint ar Xiv:1909.12238, 2019.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M.,

Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350 354, 2019.

Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,

Kavukcuoglu, K., and de Freitas, N. Sample efﬁcient actor-critic with experience replay. ar Xiv preprint ar Xiv:1611.01224, 2016.

Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba,

J. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pp.

5279 5288, 2017.