# simple_policy_optimization__cf750647.pdf

Simple Policy Optimization

Zhengpeng Xie * 1 Qiang Zhang * 1 2 Fan Yang * 3 Marco Hutter 3 Renjing Xu 1

Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region, backed by strong theoretical guarantees. However, its reliance on complex secondorder optimization limits its practical efficiency. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO s approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm. By slightly modifying the policy loss used in PPO, SPO can achieve the best of both worlds. Our new objective improves upon ratio clipping, offering stronger theoretical properties and better constraining the probability ratio within the trust region. Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end.

Code is available at Simple-Policy-Optimization.

1. Introduction

Deep Reinforcement Learning (DRL) has achieved great success in recent years, notably in games (Mnih et al., 2015; Silver et al., 2016; 2017; 2018; Vinyals et al., 2019), foundation model fine-tuning (Ouyang et al., 2022; Black et al., 2023), and robotic control (Makoviychuk et al., 2021; Rudin et al., 2022). Policy gradient (PG) methods (Sutton & Barto, 2018; Lehmann, 2024), as a major paradigm in RL, have

*Equal contribution 1The Hong Kong University of Science and Technology (Guangzhou) 2Beijing Innovation Center of Humanoid Robotics 3ETH Zurich. Correspondence to: Zhengpeng Xie <zhengpengxie@hkust-gz.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps 1e7

Average return

PPO (Res Net-50) PPO (Res Net-101) SPO (Res Net-50) SPO (Res Net-101)

Figure 1. Training performance in the Breakout environment. SPO is a novel model-free algorithm capable of end-to-end training for extremely deep neural network architectures, positioning itself as a promising alternative to the well-known PPO algorithm.

been widely adopted by the academic community. One main practical challenge of PG methods is to reduce the variance of the gradients while keeping the bias low (Sutton et al., 2000; Schulman et al., 2015b). In this context, a widely used technique is to add a baseline when sampling an estimate of the action-value function (Greensmith et al., 2004). Another challenge of PG methods is to estimate the proper step size for the policy update (Kakade & Langford, 2002; Schulman et al., 2015a). Given that the training data strongly depends on the current policy, a large step size may result in a collapse of policy performance, whereas a small one may impair the sample efficiency of the algorithm.

To address these challenges, Schulman et al. (2015a) proved that optimizing a certain surrogate objective guarantees policy improvement with non-trivial step sizes. Subsequently, the TRPO algorithm was derived through a series of approximations, which impose a trust region constraint during the policy iterations, leading to monotonic policy improvement in theory. However, given the complexity of second-order optimization, TRPO is highly inefficient and can be hard to extend to large-scale RL environments. Proximal Policy Optimization (PPO) (Schulman et al., 2017) is designed to enforce comparable constraints on the difference between successive policies during the training process, while only using first-order optimization. By clipping the current data that exceeds the probability ratio limit to a constant, PPO attempts to remove the high incentive for pushing the cur-

Simple Policy Optimization

0 20 40 60 80 100 Epoch

Probability ratio

Data points of PPO

0 20 40 60 80 100 Epoch

Probability ratio

Data points of SPO

Figure 2. (Left) The only difference between SPO and PPO is the policy loss, where rt(θ) = πθ(at|st)/πθold(at|st) and ϵ is the probability ratio hyperparameter, making it simple and straightforward to implement SPO based on high-quality PPO implementations. (Right) The optimization behavior of PPO and SPO is visualized, where each scatter point represents the probability ratio of a single data point for a specific training epoch, with its color corresponding to its advantage, and the red line representing the probability ratio bound.

rent policy away from the old one. It has been demonstrated that PPO can be effectively extended to large-scale complex control tasks (Ye et al., 2020; Makoviychuk et al., 2021).

Despite its success, the optimization behavior of PPO remains insufficiently understood. Although PPO aims to constrain the probability ratio deviations between successive policies, it often fails to keep these ratios within bounds (Ilyas et al., 2018; Engstrom et al., 2020; Wang et al., 2020). In some tasks, the ratios can even escalate to values as high as 40 (Wang et al., 2020). Furthermore, studies have revealed that PPO s performance is highly dependent on code-level optimizations (Andrychowicz et al., 2021; Huang et al., 2022a). The implementation of PPO includes numerous code-level details that critically influence its effectiveness (Engstrom et al., 2020; Huang et al., 2022b).

In this paper, we propose a new model-free RL algorithm named Simple Policy Optimization (SPO) designed to more effectively bound probability ratios through a novel objective function. The key differences in optimization behavior between PPO and SPO are illustrated in Figure 2. Our main contributions are summarized as follows:

We theoretically prove that optimizing a tighter performance lower bound using Total Variation (TV) divergence constrained space results in more consistent policy improvement.

To overcome PPO s limitation in constraining probability ratios, we propose a new objective function, leading to the development of the proposed SPO algorithm.

Experiments benchmark various policy gradient algorithms across different environments, showing that SPO can achieve competitive performance with a simple implementation, improved sample efficiency, and easier training of deeper policy networks.

2. Related Work

Since TRPO (Schulman et al., 2015a) theoretically demonstrated monotonic policy improvement, numerous studies

have explored how to enforce trust region constraints efficiently, which are essential for ensuring robust policy improvement. For instance, the widely-used PPO algorithm (Schulman et al., 2017) was the first to introduce the heuristic clipping technique, effectively avoiding the computationally expensive second-order optimization. This heuristic clipping technique has been widely used in various reinforcement learning algorithms (Queeney et al., 2021; Zhuang et al., 2023; Gan et al., 2024).

However, empirical evidence from a wide range of studies demonstrates that ratio clipping fails to enforce trust region constraints effectively (Wang et al., 2020). To prevent aggressive policy updates, previous works have focused on designing adaptive learning rates based on TV divergence or KL divergence (Heess et al., 2017; Queeney et al., 2021; Rudin et al., 2022), which have been shown to effectively enhance the stability of PPO. On the other hand, code-level optimizations are crucial for the robust performance of PPO (Engstrom et al., 2020). High-quality implementations of PPO involve numerous code details (Huang et al., 2022a;b), making it challenging to accurately assess the core factors that truly affect the algorithm s performance.

In this work, we argue that heuristic clipping technique cannot enforce trust region constraints (see Figure 2). During PPO s iterations, ratio clipping zeros the gradients of certain data points, which can lead to a lack of corrective gradients to prevent the policy from escaping the trust region, thus undermining the monotonic improvement guarantee. As a result, PPO requires additional code-level tuning, such as adaptive learning rates or early stopping strategies, to artificially prevent performance collapse. We reveal this inherent flaw of ratio clipping and propose promising alternatives.

3. Background

3.1. Reinforcement Learning

Online reinforcement learning is a mathematical framework for sequential decision-making, which is generally defined by the Markov Decision Process (MDP) M = (S, A, r, P, ρ0, γ), where S and A represent the state space

Simple Policy Optimization

and action space, r : S A 7 R is the reward function, P : S A S 7 [0, 1] is the probability distribution of the state transition function, ρ0 : S 7 [0, 1] is the initial state distribution, while γ (0, 1) is the discount factor.

Suppose that an agent interacts with the environment following policy π, i.e., at π( |st) and obtains a trajectory τ = (s0, a0, r0, . . . , st, at, rt, . . . ), where rt = r(st, at). The goal of RL is to learn a policy that maximizes the objective η(π) = Eτ π [P t=0 γtrt], where the notation Eτ π represents the expected return of the trajectory τ generated by the agent following policy π, i.e., s0 ρ0( ), at π( |st), rt = r(st, at), st+1 P( |st, at). The actionvalue function and value function are defined as

Qπ(st, at) = Est+1,at+1,...

k=0 γkr(st+k, at+k)

Vπ(st) = Eat π( |st) [Qπ(st, at)] .

Given Qπ and Vπ, the advantage function can be expressed as Aπ(st, at) = Qπ(st, at) Vπ(st).

3.2. Trust Region Policy Optimization

Classic policy gradient methods cannot reuse data and are highly sensitive to the hyperparameters. To address these issues, in Trust Region Policy Optimization (TRPO), Schulman et al. (2015a) derived a lower bound for policy improvement. Before that, Kakade & Langford (2002) first proved the following policy performance difference theorem. Theorem 3.1. (Kakade & Langford, 2002) Let P(st = s|π) represents the probability of the t-th state equals to s in trajectories generated by the agent following policy π, and ρπ(s) = (1 γ) P t=0 γt P(st = s|π) represents the normalized discounted visitation distribution. Given any two policies, π and π, their performance difference can be measured by

η( π) η(π) = 1 1 γ Es ρ π( ),a π( |s) [Aπ(s, a)]

a π(a|s) Aπ(s, a),

where η(π) = Eτ π [P t=0 γtrt].

The key insight is that the new policy π will improve (or at least remain constant) as long as it has a nonnegative expected advantage at every state s. Then, the following performance improvement lower bound is given: Theorem 3.2. (Achiam et al., 2017) Given any two policies, π and π, the following bound holds:

η( π) η(π) 1 1 γ Es ρπ( ),a π( |s) [Aπ(s, a)]

2ξγ (1 γ)2 Es ρπ( ) [DTV(π π)[s]] , (3)

where ξ = maxs Ea π( |s) [Aπ(s, a)] , DTV is the Total Variation (TV) divergence.

Using importance sampling on action a π( |s) and according to the Pinsker s inequality, we have

η( π) η(π) 1 1 γ Es ρπ( ),a π( |s)

π(a|s) Aπ(s, a)

2ξγ (1 γ)2 Es ρπ( )

1 2DKL(π π)[s]

At this point, the subscripts of the expectation in (2) are replaced from s ρ π( ) and a π( |s) to s ρπ( ) and a π( |s), which means that we can now reuse the current data. In TRPO, the lower bound in (4) is indirectly optimized by solving the following optimization problem:

max θ E(st,at) πθold

πθold(at|st) ˆA(st, at) ,

s.t. E [DKL(πθold πθ)] δ. (5)

This problem includes a constraint where δ is a hyperparameter that limits the KL divergence between successive policies, with ˆA(st, at) being the estimate of the advantage function, and the objective is called surrogate objective .

3.3. Proximal Policy Optimization

Due to the necessity of solving a constrained optimization problem (5) in each update, TRPO is highly inefficient and can be challenging to apply to large-scale reinforcement learning tasks.

Schulman et al. (2017) proposed a new objective called clipped surrogate objective , in which the algorithm is named Proximal Policy Optimization (PPO). PPO retains similar constraints of TRPO but is much easier to implement and involves only first-order optimization.

The clipped surrogate objective , also called PPO-Clip, adopts a ratio clipping function. Denote ˆAt = ˆA(st, at), the objective of PPO-Clip can be expressed as

Jclip(θ) = E(st,at) πθold

n min h rt(θ) ˆAt, rt(θ) ˆAt io , (6) where

rt(θ) = πθ(at|st) πθold(at|st), rt(θ) = clip (rt(θ), 1 ϵ, 1 + ϵ) ,

(7) with πθold and πθ being the old policy and the current policy. The gradient of PPO-Clip, given the training data (st, at),

Simple Policy Optimization

can be expressed as

θJclip(θ) =

θrt(θ) ˆAt, ˆAt > 0, rt(θ) 1 + ϵ; θrt(θ) ˆAt, ˆAt < 0, rt(θ) 1 ϵ; 0, otherwise. (8) In other words, PPO-Clip aims to remove the high incentive for pushing the current policy away from the old one. PPOClip has gained wide adoption in the academic community due to its simplicity and performance.

4. Methodology

PPO attempts to limit the differences between successive policies through ratio clipping. However, Wang et al. (2020) proved the following theorem: Theorem 4.1. (Wang et al., 2020) For discrete action space tasks where |A| 3 or continuous action space tasks where the output of the policy πθ follows a multivariate Gaussian distribution. Let Θ = {θ|1 ϵ rt(θ) 1 + ϵ}, we have supθ Θ DKL(πθold πθ)[st] = + for both discrete and continuous action space tasks.

Theorem 4.1 demonstrates that DKL(πθold πθ)[st] is not necessarily bounded even if the probability ratio rt(θ) is bounded. However, this theorem considers only an extreme case involving a single data point, which is less typical than the batch processing used in training data. On a broader scale, the heuristic clipping technique employed by PPO aims to bound the TV divergence for sufficient batch sizes (Queeney et al., 2021). This relationship is formalized as

Es ρπ( ) [DTV(π π)[s]] = 1

2 E s ρπ( ) a π( |s)

π(a|s) π(a|s) 1

(9) Then, the performance improvement lower bound (3) can be rewritten as

η( π) η(π) 1 1 γ Es ρπ( ),a π( |s)

π(a|s) Aπ(s, a)

ξγ (1 γ)2 Es ρπ( ),a π( |s)

π(a|s) π(a|s) 1

This explains why PPO attempts to limit the probability ratio | π(a|s)/π(a|s) 1| ϵ, as this enforces a TV divergence trust region in expectation.

Finally, we also found that PPO, which aims to bound the TV divergence, can offer a larger solution space compared to methods that incorporate a looser KL divergence as a constraint (e.g., in TRPO). To illustrate this, we present the following proposition: Proposition 4.2. Given the old policy π, define the solution spaces under the TV and KL divergence constraints as

ΩTV = { π | DTV(π π)[s] δTV, s S},

ΩKL = { π | DKL(π π)[s] δKL, s S}, (11)

where δKL > 0 is a predefined threshold. Let δTV q

1 2δKL, we establish that ΩKL ΩTV.

Proof. For any given δKL and π ΩKL, using Pinsker s

inequality, we have DTV(π π)[s] q

1 2DKL(π π)[s] q

1 2δKL δTV, therefore π ΩKL = π ΩTV, which means ΩKL ΩTV, concluding the proof.

Additionally, the optimal solution to the lower bound in the TV divergence solution space, ΩTV, is expected to be superior. We now present the following theorem: Theorem 4.3. Given the old policy π, and ΩTV, ΩKL presented in Proposition 4.2, let

LTV π ( π) = 1 1 γ Es ρπ( ),a π( |s)

π(a|s) Aπ(s, a)

2ξγ (1 γ)2 Es ρπ( ) [DTV(π π)[s]] ,

LKL π ( π) = 1 1 γ Es ρπ( ),a π( |s)

π(a|s) Aπ(s, a)

2ξγ (1 γ)2 Es ρπ( )

1 2DKL(π π)[s]

be the lower bounds of performance improvement with TV

divergence and KL divergence. Let δTV q

1 2δKL, denote

π TV = arg max π ΩTV LTV π ( π), π KL = arg max π ΩKL LKL π ( π),

(13) then LTV π ( π TV) LKL π ( π KL).

Proof. Since ΩKL ΩTV, we have

LTV π ( π TV) LTV π ( π KL) =

1 1 γ Es ρπ( ),a π( |s)

π KL(a|s) π(a|s) Aπ(s, a)

2ξγ (1 γ)2 Es ρπ( ) [DTV(π π KL)[s]]

1 1 γ Es ρπ( ),a π( |s)

π KL(a|s) π(a|s) Aπ(s, a)

2ξγ (1 γ)2 Es ρπ( )

1 2DKL(π π KL)[s]

=LKL π ( π KL),

thus LTV π ( π TV) LKL π ( π KL), concluding the proof.

Simple Policy Optimization

Algorithm 1 Simple Policy Optimization (SPO)

1: Initialize: Policy and value networks πθ, Vϕ, hyperparameter ϵ, value loss and policy entropy coefficients c1, c2 2: Output: Optimal policy network πθ 3: while not converged do 4: # Data collection 5: Collect data D = {(st, at, rt)}N t=1 using the current policy network πθ 6: # The networks before updating 7: πθold πθ, Vϕold Vϕ 8: # Estimate the advantage ˆA(st, at) based on Vϕold 9: Use GAE (Schulman et al., 2015b) technique to estimate the advantage ˆA(st, at) 10: # Estimate the return ˆRt 11: ˆRt Vϕold(st) + ˆA(st, at) 12: for each training epoch do 13: # Compute policy loss Lp (This is the only difference between SPO and PPO)

πθ(at|st) πθold(at|st) ˆA(st, at) | ˆ A(st,at)|

2ϵ h πθ(at|st) πθold(at|st) 1 i2

15: # Compute policy entropy Le and value loss Lv 16: Le 1

N PN t=1 H(πθ( |st)), Lv 1 2N PN t=1[Vϕ(st) ˆRt]2

17: # Compute total loss L 18: L Lp + c1Lv c2Le 19: # Update parameters θ and ϕ through backpropagation, λθ and λϕ is the step sizes 20: θ θ λθ θL, ϕ ϕ λϕ ϕL 21: end for 22: end while

Based on the Proposition 4.2 and Theorem 4.3, we have the following conclusion:

Optimizing the lower bound with TV divergence constrains offers a more effective solution space than using KL divergence constrains, leading to better policy improvement.

As a result, to optimize the lower bound (10), we aim to solve the following constrained optimization problem:

max θ E(st,at) πθold

h rt(θ) ˆAt i ,

s.t. E(st,at) πθold [|rt(θ) 1|] ϵ, (15)

where rt(θ) = πθ(at|st)/πθold(at|st) and ˆAt = ˆA(st, at).

PPO attempts to satisfy the constraints of (15) through ratio clipping, but this does not prevent excessive ratio deviations (demonstrated in Figure 2). The underlying reason is that ratio clipping causes certain data points to stop contributing to the gradients. Over multiple iterations, this can lead to uncontrollable updates, as the absence of corrective gradients prevents the policy from recovering. To overcome this issue

with ratio clipping, we propose the following objective:

J(θ) = E(st,at) πθold

rt(θ) ˆAt | ˆAt|

2ϵ [rt(θ) 1]2 )

(16) The details of the objective will be discussed in the following section, and the pseudo-code is shown in Algorithm 1.

5. Theoretical Results

In this section, we provide some theoretical insights of the differences between PPO and SPO, demonstrating that SPO can be more effective in constraining probability ratios.

5.1. Objective Class

Simplify the notation by using r and A to represent the probability ratio and the advantage value. Based on the previous analysis, our goal is to find an objective function f(r, A, ϵ) such that while optimizing the surrogate objective r A, the probability ratio is constrained by |r 1| ϵ.

According to (15), for any given A = 0 and ϵ > 0, we can write down the following desired optimization problem:

max r r A, s.t. |r 1| ϵ. (17)

The objective is linear, so the optimal solution is r = 1+sign(A) ϵ, where sign( ) is the sign function. Motivated by this, we present the following definition:

Simple Policy Optimization

Definition 5.1 (ϵ-aligned). For any given A = 0 and ϵ > 0, we say that the function f(r, A, ϵ) is ϵ-aligned, if it is differentiable and convex with respect to r, and attains its maximum value at r = 1 + sign(A) ϵ.

The objective of PPO in (6) and SPO in (16) can be expressed as

fppo = min [r A, clip(r, 1 ϵ, 1 + ϵ)A] ,

fspo = r A |A|

2ϵ (r 1)2 . (18)

It can be obtained that fppo is not ϵ-aligned, as fppo zeros the gradients under some special cases according to (8). For fspo, we have the following theorem:

Theorem 5.2. fspo is ϵ-aligned.

Proof. Obviously, fspo is differentiable and convex with respect to r since fspo is a quadratic polynomial of r. For any given A = 0 and ϵ > 0, let fspo(r, A, ϵ)/ r = 0, then

fspo(r, A, ϵ)

ϵ (r 1) = 0, (19)

thus r = 1+sign(A) ϵ is the optimal solution for fspo.

Note that fspo is not the only objective function that satisfies the definition. It can be proved that there is a simple objective function fsimple = (r 1 sign(A) ϵ)2, which is also ϵ-aligned. We will discuss the differences between these two in Section 6.3.

5.2. Analysis of New Objective

We show that the optimization process of SPO can more effectively bound the probability ratio, as can be seen from Figure 3. The largest circular area in the figure represents the boundary on the probability ratio. The green circles represent data points with non-zero gradients during the training process, while the gray circles represent data points with zero gradients.

gradient direction without gradient

with gradient

Figure 3. In PPO, certain data points exhibit zero gradients, while in SPO, all data points generate non-zero gradients that guide towards the constraint boundary.

During the training process of PPO, certain data points that exceed the probability ratio bound cease to provide gradients. In contrast, all data points in SPO contribute gradients that guide the optimization towards the constraint boundary. As training progresses, PPO will accumulate more gray circles that no longer provide gradients and may be influenced by the harmful gradients from green circles. This phenomenon could potentially push the gray circles further away from the constraint boundary. In contrast, the gradient directions of all data points in SPO point towards the constraint boundary. This indicates that SPO imposes stronger constraints on the probability ratio.

6. Experiments

We report results on the Atari 2600 (Bellemare et al., 2013; Machado et al., 2018) and Mu Jo Co (Todorov et al., 2012) benchmarks. In all our experiments, we utilize the RL library Gymnasium (Towers et al., 2024), which serves as a central abstraction to ensure broad interoperability between benchmark environments and training algorithms.

6.1. Comparing Algorithms

Our implementation of SPO is compared against PPO-Clip (Schulman et al., 2017), PPO-Penalty (Schulman et al., 2017), SPU (Vuong et al., 2018), PPO-RB (Wang et al., 2020), TR-PPO (Wang et al., 2020), TR-PPO-RB (Wang et al., 2020), and RPO (Gan et al., 2024) in Mu Jo Co benchmark. We compute the algorithm s performance across ten separate runs with different random seeds. In addition, we emphasize that in all comparative experiments involving the same settings for SPO and PPO, the only modification in SPO is replacing the PPO s objective with (16), no further code-level tuning is applied to SPO, highlighting its simplicity and efficiency.

Due to the absence of human score baselines in Mu Jo Co (Todorov et al., 2012), we normalize the algorithms performance across all environments using the training data of PPO-Clip, specifically,

normalized(score) = score min

max min , (20)

where max and min represent the maximum and minimum validation returns of PPO-Clip during training, respectively.

As suggested in Agarwal et al. (2021), we employ stratified bootstrap confidence intervals to assess the confidence intervals of the algorithm and evaluate the composite metrics of SPO against other baselines, as illustrated in Figure 4. It can be observed that SPO achieved the best performance across nearly all statistical metrics, which fully demonstrates the strong potential of SPO. For the Atari 2600 benchmark (Bellemare et al., 2013), the main results are presented in Appendix A and C.

Simple Policy Optimization

0.5 1.0 1.5 2.0 2.5

PPO-Clip PPO-Penalty

SPO SPU TR-PPO PPO-RB TR-PPO-RB

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0 2.5

0.00 0.15 0.30 0.45

Optimality Gap

Figure 4. Aggregate metrics on Mu Jo Co-v4 with 95% CIs based on 6 environments. We collected the returns of each algorithm over the last 1% training steps across ten random seeds. In this context, higher median, IQM and mean scores and lower optimality gap are better.

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

Half Cheetah-v4

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

Humanoid-v4

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

Humanoid Standup-v4

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

Walker2d-v4

PPO (7 layers) PPO (3 layers) SPO (7 layers) SPO (3 layers)

Figure 5. Training performance of PPO and SPO with different policy network layers in Mu Jo Co benchmark. The mean and standard deviation are shown across 5 random seeds.

Table 1. Average return of PPO and SPO in the last 10% training steps across 5 separate runs with different random seeds, with their maximum ratio deviation during the entire training process.

Environment Index 3 layers 7 layers PPO SPO PPO SPO

Ant-v4 Average return ( ) 5323.2 4911.3 1002.8 4672.5 Ratio deviation ( ) 0.229 0.101 548.060 0.190

Half Cheetah-v4 Average return ( ) 4550.2 3602.4 2242.3 5307.3 Ratio deviation ( ) 0.225 0.086 1675.340 0.188

Hopper-v4 Average return ( ) 1119.4 1480.3 975.9 1507.6 Ratio deviation ( ) 0.164 0.067 113.178 0.194

Humanoid-v4 Average return ( ) 795.1 2870.0 614.1 4769.9 Ratio deviation ( ) 3689.957 0.179 2411.845 0.191

Humanoid Standup-v4 Average return ( ) 143908.8 152378.7 92849.7 176928.9 Ratio deviation ( ) 2547.499 0.182 4018.718 0.187

Walker2d-v4 Average return ( ) 3352.3 2870.2 1110.9 3008.1 Ratio deviation ( ) 0.170 0.070 998.101 0.157

6.2. Scaling Policy Network

To investigate how scaling policy network size impacts the sample efficiency of both PPO and SPO in Mu Jo Co, the number of policy network layers was increased without altering the hyperparameters or other settings. The standard deviation of the algorithm s performance was computed and visualized across five separate runs with different random seeds. The results, shown in Figure 5, 9 and Table 1, where the ratio deviation indicates the largest value of average

ratio deviation in a batch during the entire training process, i.e., 1 |D| P

(st,at) D |rt(θ) 1|.

It can be observed that as the network deepens, the performance of PPO collapses in most environments, with uncontrollable probability ratio deviations. In contrast, the performance of SPO outperforms that of shallow networks in almost all environments and constrains the probability ratio deviation effectively. Furthermore, the statistical metrics of SPO generally outperform PPO s and demonstrate

Simple Policy Optimization

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average return

Space Invaders

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average ratio deviation

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average ratio deviation

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average ratio deviation

0.00 0.25 0.50 0.75 1.00 Timesteps 1e7

Average ratio deviation

Space Invaders

PPO-0.2 (Res Net-18) PPO-0.1 (Res Net-18) SPO-0.2 (Res Net-18) PPO-0.2 (CNN) SPO-0.2 (CNN)

Figure 6. Training performance of SPO using Res Net-18 as the encoder compared to the PPO and SPO using default CNN (shown with the reference line). The mean and standard deviation are shown across 3 random seeds, and the red dashed line represents ϵ = 0.2.

0 50 100 150 200 Epoch

Surrogate objective

0 50 100 150 200 Epoch

Average ratio deviation

PPO SPO Simple objective

Figure 7. The optimization behavior of fppo, fspo, and fsimple.

relative robustness to variations in network depth and minibatch size.

We also trained the Res Net-181 (He et al., 2016) as the encoder on the Atari 2600 benchmark, the results are shown in Figure 6. As the network s capacity increases, the performance of SPO is significantly improved. Moreover, SPO can still maintain a good probability ratio constraint, thereby benefiting from the theoretical lower bound (10). In contrast, it is challenging to train large neural networks with PPO because the probability ratio cannot be controlled during training, even employing a smaller ϵ = 0.1.

6.3. Constraining Ratio Deviation

To further investigate the optimization behavior of different objective functions that satisfy the ϵ-aligned definition, we visualize the optimization process of fppo, fspo, and fsimple presented in Section 5.1, on the same batch of advantage values initialized from a standard Gaussian distribution, as shown in Figure 7.

We can observe that while PPO achieves the best perfor-

1Since Bhatt et al. (2019) demonstrated that batch normalization is harmful to RL training, we removed batch normalization.

mance in optimizing the surrogate objective, it also leads to uncontrollable ratio deviations. In contrast, the two objectives that satisfy the ϵ-aligned definition effectively constrain the ratio deviations during the optimization process.

Furthermore, we also observe that fspo achieves better optimization of the surrogate objective compared to fsimple, while fsimple converges more quickly to the probability ratio boundary. This aligns with our expectations, as the optimization objective of fsimple only depends on the sign of the advantage values. As a result, fsimple pushes each data point equally toward the constraint boundary, which results in the magnitude of the advantage values being less effectively utilized compared to fspo, which makes it difficult to efficiently optimize the surrogate objective.

7. Conclusion

In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm that effectively combines the strengths of Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). SPO maintains optimization within the trust region, benefiting from TRPO s theoretical guarantees while preserving the efficiency of PPO. Our experimental results demonstrate that SPO achieves competitive performance across various benchmarks with a simple implementation. Moreover, SPO simplifies the training of deep policy networks, addressing a key challenge faced by existing algorithms. These findings indicate that SPO is a promising approach for advancing model-free reinforcement learning. In future work, SPO holds potential for impactful applications in areas such as language models, robotic control, and financial modeling. With further research and refinement, we believe SPO will drive innovation and breakthroughs across these fields.

Simple Policy Optimization

Acknowledgements

We would like to thank the anonymous ICLR and ICML reviewers for their insightful and constructive comments.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International conference on machine learning, pp. 22 31. PMLR, 2017.

Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304 29320, 2021.

Andrychowicz, M., Raichuk, A., Sta nczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., et al. What matters for on-policy deep actor-critic methods? a large-scale study. In International conference on learning representations, 2021.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013.

Bhatt, A., Argus, M., Amiranashvili, A., and Brox, T. Crossnorm: Normalization for off-policy td reinforcement learning. ar Xiv preprint ar Xiv:1902.05605, 10, 2019.

Black, K., Janner, M., Du, Y., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. ar Xiv preprint ar Xiv:2305.13301, 2023.

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Implementation matters in deep policy gradients: A case study on ppo and trpo. ar Xiv preprint ar Xiv:2005.12729, 2020.

Gan, Y., Yan, R., Wu, Z., and Xing, J. Reflective policy optimization. ar Xiv preprint ar Xiv:2406.03678, 2024.

Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Heess, N., Tb, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S., et al. Emergence of locomotion behaviours in rich environments. ar Xiv preprint ar Xiv:1707.02286, 2017.

Huang, S., Dossa, R. F. J., Raffin, A., Kanervisto, A., and Wang, W. The 37 implementation details of proximal policy optimization. The ICLR Blog Track 2023, 2022a.

Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., and Ara Aˇsjo, J. G. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1 18, 2022b.

Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Are deep policy gradient algorithms truly policy gradient algorithms. ar Xiv preprint ar Xiv:1811.02553, 2018.

Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267 274, 2002.

Lehmann, M. The definitive guide to policy gradients in deep reinforcement learning: Theory, algorithms and implementations. ar Xiv preprint ar Xiv:2401.13662, 2024.

Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523 562, 2018.

Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., et al. Isaac gym: High performance gpu-based physics simulation for robot learning. ar Xiv preprint ar Xiv:2108.10470, 2021.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730 27744, 2022.

Queeney, J., Paschalidis, Y., and Cassandras, C. G. Generalized proximal policy optimization with sample reuse. Advances in Neural Information Processing Systems, 34: 11909 11919, 2021.

Simple Policy Optimization

Rudin, N., Hoeller, D., Reist, P., and Hutter, M. Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on Robot Learning, pp. 91 100. PMLR, 2022.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897. PMLR, 2015a.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140 1144, 2018.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Sutton, R. S., Singh, S., and Mc Allester, D. Comparing policy-gradient algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 2000.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012.

Towers, M., Kwiatkowski, A., Terry, J., Balis, J. U., De Cola, G., Deleu, T., Goul ao, M., Kallinteris, A., Krimmel, M., KG, A., et al. Gymnasium: A standard interface for reinforcement learning environments. ar Xiv preprint ar Xiv:2407.17032, 2024.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. nature, 575 (7782):350 354, 2019.

Vuong, Q., Zhang, Y., and Ross, K. W. Supervised policy update for deep reinforcement learning. ar Xiv preprint ar Xiv:1805.11706, 2018.

Wang, Y., He, H., and Tan, X. Truly proximal policy optimization. In Uncertainty in artificial intelligence, pp. 113 122. PMLR, 2020.

Ye, D., Liu, Z., Sun, M., Shi, B., Zhao, P., Wu, H., Yu, H., Yang, S., Wu, X., Guo, Q., et al. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6672 6679, 2020.

Zhuang, Z., Lei, K., Liu, J., Wang, D., and Guo, Y. Behavior proximal policy optimization. ar Xiv preprint ar Xiv:2302.11312, 2023.

Simple Policy Optimization

A. Atari 2600

H e r o B a n k H e i s t

B r e a k o u t

U p N D o w n

I c e H o c k e y

F i s h i n g D e r b y C h o p p e r C o m m a n d

B o x i n g S p a c e I n v a d e r s

P o n g C r a z y C l i m b e r N a m e T h i s G a m e C e n t i p e d e K u n g F u M a s t e r T i m e P i l o t

S e a q u e s t

A s t e r i x

A s t e r o i d s

A s s a u l t

P h o e n i x

T u t a n k h a m

R o b o t a n k

B e r z e r k

B o w l i n g

J a m e s b o n d W i z a r d O f W o r

A l i e n S t a r G u n n e r

T e n n i s B e a m R i d e r

G r a v i t a r D e m o n A t t a c k

B a t t l e Z o n e

Z a x x o n

P e r f o r m a n c e i m p r o v e m e n t ( %

Figure 8. Final performance of SPO compared to PPO across 35 games in Atari 2600 environment, using default CNN as the encoder.

B. Hyperparameters

Table 2. Detailed hyperparameters used in SPO.

Hyperparameters Atari 2600 (Bellemare et al., 2013) Mu Jo Co (Todorov et al., 2012)

Number of workers 8 8 Horizon 128 256 Learning rate 0.00025 0.0003 Learning rate decay Linear Linear Optimizer Adam Adam Total steps 10M 10M Batch size 1024 2048 Update epochs 4 10 Mini-batches 4 4 Mini-batch size 256 512 GAE parameter λ 0.95 0.95 Discount factor γ 0.99 0.99 Value loss coefficient c1 0.5 0.5 Entropy loss coefficient c2 0.01 0.0 Probability ratio hyperparameter ϵ 0.2 0.2

C. More Results

0.0 0.8 1.6 2.4

PPO (7 layers, 64) PPO (3 layers, 64) PPO (7 layers, 512) PPO (3 layers, 512) SPO (7 layers, 64) SPO (3 layers, 64) SPO (7 layers, 512) SPO (3 layers, 512)

0.0 0.8 1.6 2.4

0.0 0.8 1.6 2.4

0.00 0.15 0.30 0.45

Optimality Gap

Figure 9. Aggregate metrics on Mu Jo Co-v4 with 95% CIs based on 6 environments, comparing PPO and SPO with different policy network layers and mini-batch sizes using PPO-normalized score.

Simple Policy Optimization

1e6 Atlantis

Battle Zone

Chopper Command

Crazy Climber

Demon Attack

Elevator Action

Fishing Derby

Journey Escape

Kung Fu Master

Montezuma Revenge

Name This Game

Private Eye

Road Runner

Space Invaders

Star Gunner

Video Pinball

Wizard Of Wor

PPO (Res Net-18) SPO (Res Net-18)

Figure 10. Training performance on Atari 2600 using Res Net-18 as the encoder, with a fixed learning rate of 0.0001. The mean and standard deviation are shown across 3 random seeds.