# reflective_policy_optimization__9b406261.pdf

Reflective Policy Optimization

Yaozhong Gan * 1 Renye Yan * 1 Zhe Wu 1 Junliang Xing 1

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future stateaction information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https: //github.com/Edgargan/RPO.

1. Introduction

On-policy reinforcement learning (RL) aims to learn an optimal mapping from a sequence of states to actions based on rewarding criteria acquired through trajectories generated by interacting with the underlying environment. Proximal Policy Optimization (PPO) (Schulman et al., 2017) is one of the most typical algorithms in this category, owing to its simplicity and effectiveness. It has been successfully applied in various domains, including Atari games (Mnih et al., 2015), continuous control tasks (Dhariwal et al., 2017), and robot control (Lillicrap et al., 2016). However, existing algorithms optimize the policy based on a state-action pair and do not directly consider the impact of subsequent states and actions in the trajectory. This limitation inevitably gives rise to the sample inefficiency problem.

In prior studies (Mnih et al., 2015; van Hasselt et al., 2016;

*Equal contribution 1Qi Yuan Lab. Correspondence to: Junliang Xing <xingjunliang@qiyuanlab.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Schulman et al., 2015; 2017; Haarnoja et al., 2018; Silver et al., 2014; Fujimoto et al., 2018), the prevalent approach involves optimizing the policy using the value function of the current state. The value function potentially contains information about the subsequent data. However, a pertinent question emerges: Is optimizing a policy solely based on the value function the fastest (optimal) path to convergence? The answer is no, as this approach may overlook other crucial factors.

To illustrate this answer, consider an environment with a cliff . If an agent takes an action leading to falling off the cliff under a state, it must learn to avoid the action and the associated state. Returning to this state is perilous and could potentially trigger the same action. Therefore, the agent must actively avoid this state to enhance safety. The preceding action leading to this state must also be avoided, anticipating the possibility of re-entering that state again.

A similar scenario unfolds in a treasure environment, where an agent performs an action resulting in a large reward. Subsequent data imparts positive and negative insights into previous states and actions. Therefore, optimizing the previous action should incorporate information from subsequent state-action pairs, not relying solely on the value function. Intuitively, leveraging subsequent data directly can expedite algorithm convergence and enhance sample efficiency. Unfortunately, most existing algorithms lack this capability, which directly exploits the relationship between pairs of trajectory data for policy optimization.

To address the above issues with better sample efficiency, we introduce a new on-policy algorithm that optimizes the current policy by explicitly considering the relationship between the previous and subsequent state-action pairs in the sampled trajectories. Specifically, the proposed algorithm evaluates the current state-action pair and the impact of the subsequent pair of trajectories. It provides a more comprehensive perspective than traditional value function-based policy optimization. This approach enables the optimized policy to adjust its actions based on positive and negative information from subsequent states, thereby effectively reflecting the policy. We thus name the proposed algorithm as Reflective Policy Optimization (RPO).

The RPO algorithm, as proposed, directly focuses on policy optimization rather than solely on evaluating the value

Reflective Policy Optimization

function. This distinction separates it from multi-step reinforcement learning methods (De Asis et al., 2018; Duan & Wainwright, 2023; Hernandez-Garcia & Sutton, 2019). Our proposed algorithm takes a direct approach by incorporating previous and subsequent trajectory information for policy optimization, establishing more clearly theoretical properties. We also derive a novel policy improvement lower bound, illustrating that, in addition to ensuring the desirable property of monotonic performance improvement, our method effectively reduces the solution space of the optimized policy, thus significantly accelerating the algorithm s convergence procedure. Furthermore, our method improves sample efficiency.

We incorporate our proposed algorithm with the PPO s clipping mechanism (Schulman et al., 2017) to provide a practical implementation. Following standard settings, we validate the effectiveness of our algorithm by utilizing an illustrative toy example, shedding light on the underlying working mechanism of RPO. Additionally, we showcase superior performance on widely recognized RL benchmarks, such as Mu Jo Co (Todorov et al., 2012) and Atari games (Brockman et al., 2016).

2. Preliminaries

2.1. Markov Decision Process

Commonly, the reinforcement learning problem can be modeled as a Markov Decision Process (MDP), which is described by the tuple S, A, P, R, γ (Sutton & Barto, 1998). S and A are the state space and action space respectively. The function P(s |s, a) : S A S 7 [0, 1] is the transition probability function from state s to state s under action a. The function R(s, a) : S A 7 R is the reward function. And γ [0, 1) is the discount factor for longhorizon returns. In a state s, the agent performs an action a according to a stochastic policy π : S A 7 [0, 1] (satisfies P

a π(a|s) = 1). The environment returns a reward R(s, a) and a new state s according to the transition function P(s |s, a). The agent interacts with the MDP to give a trajectory τ of states, actions, and rewards: s0, a0, R(s0, a0), , st, at, R(st, at), over S A R (Silver et al., 2014). Given a policy π, under a state st and a action at, the state-action value function and state-value function are defined as

Qπ(st, at) = Eτ π[Gt|st, at], V π(st) = Eτ π[Gt|st],

where Gt = P i=0 γi Rt+i is the discount return, and Rt = R(st, at).

It is clear that V π(st) = Eat πQπ(st, at). Correspondingly, advantage function can be represented Aπ(s, a) = Qπ(s, a) V π(s). We know that P

a π(a|s)Aπ(s, a) = 0.

Let ρπ be a normalized discount state visitation distribution,

ρπ(s) = (1 γ)

t=0 γt P(st = s|ρ0, π),

where ρ0 is the initial state distribution (Kakade & Langford, 2002). Similarly, ρπ( |s, a) can be defined and denotes the conditional visitation distribution under state s and action a. And the normalized discount state-action visitation distribution can be represented ρπ(s, a) = ρπ(s)π(a|s). We make it clear from the context whether ρπ refers to the state or state-action distribution.

The goal is to learn a policy that maximizes the expected total discounted reward η(π), defined

η(π) = Eτ π

i=0 γi R(si, ai)

The following identity indicates that the distance between the policy performance of π and ˆπ is related to the advantage over π (Kakade & Langford, 2002):

η(π) = η(ˆπ) + 1 1 γ Es,a ρπ Aˆπ(s, a) . (1)

Some admirable algorithms obtain good properties by modifying the right-hand side of Eqn. (1), for example, Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015) optimizes the lower bound of policy improvement by replacing ρπ with ρˆπ under state s, and offers better theoretical properties, i.e. monotonic improvement of policy improvement.

3. The Generalized Surrogate Function

In this section, we establish a recurrence form by providing the equation relationships before and after the replacement of TRPO. Further, we reach general conclusions by extending TRPO with subsequent state-action pairs. Lemma 3.1. Consider a current policy ˆπ, and any policies π, we have

Es,a ρπAˆπ(s, a) Es ρˆπ,a πAˆπ(s, a)

= γ 1 γ Es,a ρˆπ[π(a|s)

ˆπ(a|s) 1]Es ,a ρπ( |s,a)Aˆπ(s , a ).

The proof of this lemma is given in Appendix A.4.

Note that from this lemma, the difference between the original formula and the replaced one is relevant to the normalized discount subsequent state-action visitation distribution ρπ( |s, a). By the boundary of the right-hand side of the equation, it is easy to obtain Theorem 1 of the paper (Schulman et al., 2015) and Theorem 1 of the paper (Achiam et al., 2017). From this lemma, we constructed a relationship between the current visitation distributions (s, a) ρπ( ) and the next (s , a ) ρπ( |s, a).

Reflective Policy Optimization

Theorem 3.2. Consider a current policy ˆπ, and any policies π, we have

η(π) = η(ˆπ) +

i=0 αi Li(π, ˆπ) + βk Gk(π, ˆπ), (2)

Li(π, ˆπ)= E s0,a0 ρˆπ( ) si 1,ai 1 ρˆπ( |si 2,ai 2)

t=0 (rt 1) li(π, ˆπ),

Gk(π, ˆπ)= E s0,a0 ρˆπ( ) sk 1,ak 1 ρˆπ( |sk 2,ak 2)

t=0 (rt 1) gk(π, ˆπ),

li(π, ˆπ) = Esi ρˆπ( |si 1,ai 1),ai π( |si)Aˆπ(si, ai),

gk(π, ˆπ) = Esk,ak ρπ( |sk 1,ak 1)Aˆπ(sk, ak),

rt = π(at|st)

ˆπ(at|st), αi = γi

(1 γ)i+1 , βk = γk

We define that L0(π, ˆπ) = Es0,a0 ρˆπ( )r0Aˆπ(s0, a0), G1(π, ˆπ)=Es0,a0 ρˆπ( );s1,a1 ρπ( |s0,a0)(r0 1)Aˆπ(s1, a1) and r0 = π(a0|s0)

The proof of this theorem is given in Appendix A.5.

This theorem gives a general form for the difference between the policy performance of π and ˆπ by finite sums. This equation accurately represents the general gap between the performance of π and ˆπ from the trajectory-based. It portrays that subsequent state-action pairs can also directly impact optimizing the current policy. We refer to Pk 1 i=0 αi Li(π, ˆπ) as the generalized surrogate objective function.

A slight problem may exist if the generalized surrogate objective function is directly optimized. Consider L1(π, ˆπ) in Eqn. (2) as an example. We consider this function without delving into the specific form of the parameters. When the environment is unknown, it can only be optimized by sampling. Considering a special case, the function L1(π, ˆπ) is optimized by using a sample (s0, a0, s1, a1), i.e., L1(π, ˆπ) (r0 1)r1Aˆπ(s1, a1). If Aˆπ(s1, a1) < 0 and r0 1 < 0, it follows that (r0 1)r1Aˆπ(s1, a1) = [(r0 1)Aˆπ(s1, a1)]r1 > 0. This implies an increase in the probability of a1. However, when Aˆπ(s1, a1) < 0, we should decrease the probability of a1. It s a contradiction. Thus, the term 1 in r0 1 may be incorrectly misleading for policy optimization despite the soundness of the theory. This situation exists when the environment is unknown.

Next, we measure the gap between the policy performance η(π) and Pk 1 i=0 αi Li(π, ˆπ).

Corollary 3.3. According to the definition of Gk, we have

|βk Gk(π, ˆπ)| γk

(1 γ)k+2 ϵk+1Rmax,

where ϵ π ˆπ 1 = maxs P

a |π(a|s) ˆπ(a|s)| and Rmax maxs,a |R(s, a)|.

The proof of the theorem is given in the Appendix A.6.

Based on Theorem 3.2 and Corollary 3.3, a general lower bound exists for the policy performance of π. This theory makes good theoretical sense, which helps to understand the generalized surrogate function. For the case where k = 1, the l1 norm constraints are replaced by KL constraints. This outcome aligns with the lower bound observed in TRPO.

4. Reflective Policy Optimization

Theoretically, the previous section gave a lower bound for the policy performance of π. Although the generalized surrogate function incorporates the current and subsequent state-action pairs of the trajectory, the inclusion of the term 1 in the function Li(π, ˆπ) introduces ambiguity regarding how the subsequent pairs influence the behavior of the policy at the current state, potentially yielding positive or adverse effects. In this section, we have made slight modifications to the generalized surrogate function Li(π, ˆπ) in Eqn. (2), defined

ˆLi(π, ˆπ) = E s0,a0 ρˆπ( ) si,ai ρˆπ( |si 1,ai 1)

t=0 rt Aˆπ(si, ai). (3)

It is a natural modification that avoids the problems caused by the 1 item. The following theorem measures the relationship between the function ˆLi(π, ˆπ) and η(π).

Theorem 4.1. Consider a current policy ˆπ, and any policies π, we have

i=0 αi ˆLi(π, ˆπ) ˆCk(π, ˆπ), (4)

ˆCk(π, ˆπ)=γRmax π ˆπ 1

i=1 αi+ γk Rmax

(1 γ)k+2 π ˆπ 2 1,

ˆLi(π, ˆπ) is defined in Eqn. (3) and αi = γi

(1 γ)i+1 . We

define that P0 i=1 αi = 0 for k = 1.

The proof of this theorem is given in Appendix A.8.

From Theorem 4.1, the first term of the generalized lower bound is referred to as the new generalized surrogate function, while the second term is known as the penalty term.

Reflective Policy Optimization

It is worth noting that TRPO (Schulman et al., 2015) is a special case of the generalized lower bound for k = 1. By optimizing the generalized lower bound, we can get a monotonically improving sequence of policies {πi} i=0, satisfy η(π0) η(π1) . Next, we intuitively analyze the new generalized surrogate function. The difference between the function Li(π, ˆπ) and ˆLi(π, ˆπ) is very small, involving the removal of the number 1 from the ratios product. However, their intended meanings are quite distinct. The function ˆLi(π, ˆπ) can directly utilize the information between the current and subsequent state-action pairs to optimize the current policy. Optimizing this function does not encounter the issues discussed in Section 3.

With k = 2, we will explain the optimization procedure in detail. The function ˆL1(π, ˆπ) contains the ratio of the pair (s, a) and (s , a ). If Aˆπ(s , a ) > 0, it indicates that the action a is deemed favorable, and its probability will be increased through algorithm optimization. Simultaneously, the state s is likely fine as well. To return to this state, we should increase the probability of the action a under state s. In contrast, if Aˆπ(s , a ) < 0, similar results will be obtained. The agent can reflect on its current behavior based on subsequent information. For ˆL0(π, ˆπ), the action s a probability can be optimizing using the advantage function Aˆπ(s, a). Therefore, optimizing the current action a will be influenced by both the current and subsequent advantage functions Aˆπ, taking them into account.

In this way, the optimized policy will likely foster the agent s reflection, and we observe that optimizing the generalized surrogate function lacks this ability. By utilizing the same trajectory, the agent can acquire more information. Figure 1 depicts an experiment conducted in the Cliff Walking environment. The results in Figure 1 indicate that optimizing the new surrogate function reduces the number of falling off the Cliff and faster after reaching the goal G. In the experimental section, we explain this phenomenon in detail.

Theorem 4.1 demonstrates that the generalized lower bound is optimized for any k. As k increases, the generalized lower bound is optimized using subsequent samples to learn implicit relationships of the current and subsequent states and actions data. However, is it suitable when k takes a large value? The answer is no. Let s look at the ˆLk(π, ˆπ) function individually. This objective function comprises the product of the k ratios and an advantage function. If the ratio is too high, it encounters the problem of high variance (Munos et al., 2016), which, in turn, impacts the algorithm s stability. Given this weakness, a very large value of k may not be practical. In the experimental section 5.1, we discuss the values of k and observe that as long as the agent leverages the relationship between previous and subsequent state-action pairs, it enables the agent to fall into the Cliff less often and to reach the goal G faster.

The experimental results show similarity whether k = 2 or k = 3. Therefore, the main part of the following discussion is framed in terms of k = 2. The following theorem shows that the modified generalized surrogate function has another nice property except for the monotonicity. Theorem 4.2. For k = 2, defined two sets

Ψ1 = µ | α0 ˆL0(µ, ˆπ) ˆC1(µ, ˆπ) 0, µ ˆπ 1 1

Ψ2 = n µ | α0 ˆL0(µ, ˆπ) + α1 ˆL1(µ, ˆπ) ˆC2(µ, ˆπ) 0,

then we have

The proof of the theorem is given in Appendix A.9.

Note that when the old and new policies do not change much, the set Ψ1 is a solution space of TRPO, and the set Ψk corresponds to the solution space of the k-th generalized lower bound. The theorem 4.2 shows that the scale of the solution space of the policy is reduced when k = 2. Under certain conditions, the optimal policy π belongs to the set Ψ1 as well as to Ψ2. Reducing the solution space is possibly more efficient in finding a good policy, and therefore, it is intuitive that the algorithm s convergence procedure can be accelerated. Furthermore, it can improve sample efficiency. Similarly, we can define the solution space of k = 3, 4, and use the same way to get Ψ1 Ψ2 Ψ3 Ψ4 . Note that π is in those sets. It reveals the benefits of using current and subsequent states and actions of trajectory data to optimize the policy. This result provides a promising theoretical basis for our algorithm.

4.1. The Clipped Generalized Surrogate Objection

In the previous subsection discussion, the generalized lower bound function contained the generalized surrogate function and a penalty term. The optimization approach for this lower bound is similar to TRPO, utilizing a linear approximation for the surrogate objective and a quadratic approximation for the penalty term. However, it requires computing the inverse matrix of the quadratic approximation of the penalty term. In particular, the generalized lower bound function also includes the relationship between before and after state-action pairs. It is, therefore, impractical to solve this. Inspired by the PPO (Schulman et al., 2017) algorithm, we propose a new clipped surrogate objection according to Eqn. (4).

When k = 1, for ˆL0(π, ˆπ), we use the PPO s objective function: ˆLclip 0 (π, ˆπ) =E(s,a) min r(a|s)Aˆπ(s, a) ,

clip (r(a|s), 1 ϵ, 1 + ϵ) Aˆπ(s, a) , (5)

Reflective Policy Optimization

Algorithm 1 Reflective Policy Optimization (RPO)

Environment E, discount factor γ, batch size n, clipping parameter ϵ and ϵ1, learning rate α and the weighted parameter β. Initialize policy network parameter θ. for t = 0, 1, 2, . . . do

Collect data: Collect n samples with πt on environment E. Estimate policy objective: Samples a policy data πt, estimate on-policy advantage Aπt using GAE method, approximately estimate maximize the empirical objective ˆLclip 0 (πθ, πt) and ˆLclip 1 (πθ, πt) from Eqn.(5) and Eqn.(6). The full objective: ˆL(πθ) ˆLclip 0 (πθ, πt) + β ˆLclip 1 (πθ, πt). Update policy network: Update gradient: θ θ + α θ ˆL(πθ). end for

where r(a|s) = π(a|s) ˆπ(a|s), ϵ is the hyperparameter, and we ignore the distribution of random variables (s, a).

When k = 2, for ˆL1(π, ˆπ), we simply modify the clipping mechanism:

ˆLclip 1 (π, ˆπ) =E(s,a,s ,a ) min r(a|s)r(a |s )Aˆπ(s , a ) ,

C(r, r )Aˆπ(s , a ) , (6) where C(r, r ) = clip [r(a|s), 1 ϵ, 1 + ϵ] clip [r(a |s ), 1 ϵ1, 1 + ϵ1], r(a|s) = π(a|s)

ˆπ(a|s), r(a |s ) = π(a |s )

ˆπ(a |s ), ϵ and ϵ1 are the hyperparameter, we ignore the distribution of random variables (s, a, s , a ).

From the Eqn. (6), we implement the clipping mechanism for each ratio, not altogether. If the ratio r(a|s) is large and the ratio r(a |s ) is small, the product of r(a|s) and r(a |s ) may fall between 1 ϵ and 1 + ϵ. If their product is clipped, the policy will continue to be optimized, and the result may improve or worsen. We have no control over this phenomenon. Therefore, using the separate clipping mechanism will be considered a reasonable situation. The clipping mechanism constrains the ratio variance, making the algorithm s training procedure more stable. In practice, we find that the parameter ϵ1 cannot be too large, and it s better to be a little smaller than or equal to the ϵ. We want to use the subsequent state-action information to subsidiarily optimize the current policy while avoiding abrupt changes between old and new policies. In this way, the training procedure can be more stable.

Additionally, k > 2, the function ˆLk(π, ˆπ) can be clipped using the same mechanism. Therefore, for the generalized lower bound function, we present a more practical version of the algorithm.

Combining Eqn. (5) and Eqn. (6), we present the Reflective

Policy Optimization algorithm (RPO), a practical variant for the generalized surrogate objective function:

ˆL(π, πt) =ˆLclip 0 (π, πt) + β ˆLclip 1 (π, πt), (7)

where ˆLclip 0 (π, πt) is defined in Eqn.(5), ˆLclip 1 (π, πt) is defined in Eqn.(6), and β > 0. By choosing the parameter β, this parameter plays a role in weighting the use of subsequent state-action pair information. Eqn. (7) is the optimization objective function for the t-th update. This paper s optimization of the value function is the same as that of PPO. Algorithm 1 shows the detailed implementation pipeline. The RPO algorithm is divided into three steps in each iteration: collect samples, estimate policy objectives, and update the network. It can be seen that our proposed method is also an on-policy algorithm.

Discuss with multi-step RL Multi-step reinforcement learning (RL) is a set of methods that aim to adjust the tradeoff of utilization between the knowledge of the current and future return. Recent advances in multi-step RL have demonstrated remarkable empirical success (Wu et al., 2023; Tang et al., 2022). This approach does not directly optimize the current policy but is based on the value function estimated in multiple steps. It is difficult to see directly what role multi-step RL plays in the policy optimization procedure. However, the approach proposed in this paper is viewed from a multi-step perspective: subsequent state-action pairs directly affect policy optimization. It has a direct effect on the actions of the agent and has better theoretical properties. Therefore, our proposed method is fundamentally different from traditional multi-step RL.

Discuss with Tay PO The surrogate objective function of the Tay PO algorithm (Tang et al., 2020) is denoted as Li(π, ˆπ) in Eqn. (2). As discussed in Section 3, their algorithm includes a 1 term, which encounters the same problem. We have observed experimentally (refer to Figure 5 of the appendix) that the 1 term will severely damage the performance of their algorithm. Furthermore, we establish an equality relationship between η(π) and the generalized surrogate function. Compared to their method, we give a tighter lower bound (please refer to Appendix A.7).

5. Experiments

To verify the effectiveness of the proposed RPO algorithm, we utilize several continuous and discrete environments from the Mu Jo Co (Todorov et al., 2012) and Atari games in Open AI Gym (Brockman et al., 2016) extensively adopted in previous works. We conducted experiments with the Mujoco environment based on code from (Queeney et al., 2021) and Atari games based on code from (Zhang, 2018).

Reflective Policy Optimization

S The Cliff G

(a) Cliff Walking

Cliff Number

PPO RPO RPO-3

0 500 1000 1500 2000 Training Episode

Episode Length

PPO RPO RPO-3

Figure 1. (a) is a Cliff Walking environment. (b) represents the total number of times the agent fell into the Cliff during the training procedure. (c) represents the agent s steps to reach the goal G during the training procedure. RPO-3 means that when k = 3, the algorithm uses three ratios.

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

TRPO PPO RPO

(a) Half Cheetah

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

(c) Walker2d

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

(d) Swimmer

0 1 2 3 4 5 Timesteps (Million)

Average Return

(e) Humanoid

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

(f) Reacher

Figure 2. Learning curves on the Gym environments. Performance of RPO vs. PPO. The shaded region indicates the standard deviation of ten random seeds. The X-axis represents the timesteps in the environment. The Y-axis represents the average return.

5.1. Visual Validation Experiment

To demonstrate the effectiveness of the Reflective Mechanism of RPO, we conducted visual validation experiments in the Cliff Walking environment. Cliff Walking is a classic setting widely used for visualizing the performance of reinforcement learning algorithms. From Figure 1 (a), it is characterized by a gird environment in which the agent starts from S and moves through several girds to reach the goal G while avoiding falling into the cliff.

Figure 1 illustrates the overall performance of RPO and its baseline algorithm during the training processing, mainly focusing on the frequency of falling off the cliff and the interaction step overhead, assisting in validating the advantages of RPO s Reflective Mechanism .

As shown in Figure 1 (b), RPO significantly reduces the frequency of falling off the cliff under equal iteration conditions. This data attests to the significant efficiency of RPO s Reflective Mechanism . It capitalizes on previous interaction experiences, substantially reducing the occurrence rate of poor decisions. Figure 1 (c) reveals that as the number of interactions increases, RPO markedly cuts down the interaction step overhead per episode, which further confirms the benefits of directly utilizing the subsequent data, i.e. the ability to reflect on the action under state and gain the greater rewards. In addition, the selection of k is further discussed. When k = 3, using a similar clipping mechanism as in Eqn. (6), we construct the generalized surrogate objective function. From Figure 1, the performance of k = 3 is only a little better than k = 2. Considering the simplicity and the

Reflective Policy Optimization

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

PPO RPO RPO-3

(a) Half Cheetah

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

PPO RPO RPO-3

(b) Swimmer

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

PPO RPO RPO-clip(r1r2)

(c) Half Cheetah

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

PPO RPO RPO-clip(r1r2)

(d) Swimmer

Figure 3. The figure (a) and (b) represent the performance of RPO vs. RPO-3 (means that when k = 3, the algorithm uses three ratios), and the figure (c) and (d) represent the performance of RPO vs. RPO-clip(r1r2) (means that the two ratios are clipped together.).

Table 1. Mean of return of ten random seeds for different methods in Mujoco. The best results are highlighted in bold.

Hopper Reacher Walker2d Swimmer Humanoid

TRPO 2862 1996 -5.3 3198 135 5425

PPO 2408 2795 -5.2 3643 134 5608

ISPO 1205 1984 -6.7 2275 71 2736

Tay PO -137 29 -66.5 5.7 -22 3717

Ge PPO 3153 3170 -8.1 3690 157 4100

OTRPO 3075 1597 -10.7 2435 122 1768

RPO 3495 3262 -4.9 4025 157 5793

complexity of the policy optimization, the case of k = 2 is considered in the main experiment later.

The successful implementation of this RPO mechanism is attributed to its unique approach to comprehensive analysis of state-action pairs. Interestingly, RPO distinguishes itself from most existing algorithms by integrating current and subsequent data strengths. Unlike other algorithms, RPO efficiently utilizes good experiences and adjusts based on bad experiences. It can more accurately incorporate future states development, a comprehensive feature that current peer algorithms do not have.

5.2. Main Experimental Analysis

Since the Cliff Walking environment is especially conducive to showcasing RPO s Reflective Mechanism, we performed auxiliary experiments in this setting. In this subsection, we conducted experiments in continuous and discrete action space environments to validate RPO s extensive effectiveness and universal adaptability in reinforcement learning scenarios.

We conducted a detailed comparative analysis with six related algorithms in the field (TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), Ge PPO (Queeney et al., 2021), OTRPO (Meng et al., 2022), Tay Po (Tang et al., 2020) and

ISPO (Tomczak et al., 2019)) in six major experimental environments of Mu Jo Co (Todorov et al., 2012). The results (as shown in Table 1 and Figure 5 of the appendix) indicate that RPO consistently outperforms in all environments.

In the continuous environments, we use the same hyperparameters ϵ1 = 0.1 and β = 0.3 in all environments. From Figure 2 and Figure 5 of the appendix, RPO surpasses classic on-policy reinforcement learning algorithms PPO and TRPO not only in terms of average return but also in convergence speed. This improvement is attributed to RPO s incorporation of current and subsequent data strengths. RPO also exhibits significant advantages compared to the related off-policy algorithms OTRPO and Ge PPO. This shows that even if off-policy data are not used, better performance can be achieved by using only the data of the current policy itself. The Tay PO s performance is not good because this method faces the problems discussed in Section 3.

In the discrete Atari environments, the results are averaged over three seeds during 50M timesteps. We run our experiments across three seeds with fair evaluation metrics. We use the same hyperparameters ϵ1 = 0.1 and β = 3.0 in all environments. The normalized improvement (N.I.) of the RPO final score a w.r.t. the final score of a baseline PPO b is a b

b r , r is the random score (Vieillard et al., 2020). From Figure 4, we compute the N.I./PPO in each environment1. It demonstrates that RPO performs better than the PPO algorithm in most environments. We also calculated an average performance improvement of at least 70%. Please refer to the appendix for a detailed curve (refer to Figure 7) and table results (refer to Table 2).

The exceptional performance of RPO is rooted in its unique reflective mechanism that facilitates the efficient utilization of both positive and negative experiences. By employing short trajectories composed of two consecutive states for learning and decision-making, a more profound reflection

1For the Venture environment, since the final performance of PPO and random are both zero, whereas our method attains a final performance of 659 points. The calculated N.I. = , and we clip the N.I. to 12 for graphical representation.

Reflective Policy Optimization

Space Invaders

Star Gunner

Wizard Of Wor

Private Eye

Journey Escape

Fishing Derby

Crazy Climber

Road Runner

Double Dunk

Demon Attack

Battle Zone

Yars Revenge

Name This Game

Kung Fu Master

Video Pinball

Elevator Action

Chopper Command

Average improvement: 74.3%; Median improvement: 15.3%; Improved games: 41/54.

Figure 4. Normalized Improvement of RPO vs. PPO in 54 Atari 2600 games.

and utilization of experience is achieved. This approach has the following benefits: it enables the effective use of interaction experiences from adjacent states. Adopting this pair-wise state combination for short trajectory inputs minimizes the method s computational and storage overhead while maximizing the retention of relevance , promoting a deeper utilization of experience and Reflective mechanism.

The analysis above shows that RPO exhibits significant advantages in various aspects, especially in convergence speed and average return, compared to other algorithms. These empirical findings underscore the efficiency and applicability of the RPO algorithm in both complex continuous and discrete action space environments.

5.3. Ablation Studies

We conduct ablation experiments to evaluate several principal elements that might influence the RPO s performance.

Initially, we conducted an ablation study on the number of states-action pairs, as shown in figures (a) and (b) of Figure 3. When k = 3, we fine-tuned the hyperparameters of the algorithm RPO-3 (three ratios). The results reveal that RPO3 s performance is slightly better than RPO s (two ratios), but the cost of fine-tuning the hyperparameters increases. Considering the number of parameters and the algorithm s simplicity, we think the ratios in RPO need not exceed two, as two states suffice for effective reflection.

Secondly, we conducted ablation experiments on whether the two ratios were clipped together or not, and by comparing it with the RPO-clip(r1r2) algorithm (this means that the two ratios are clipped together), as shown in figure (c) and (d) of Figure 3. The RPO-clip(r1r2) may not achieve better performance in some environments, and it may be that two ratios clipped together will face instability. And the RPO method exhibits greater performance, as indicated by the results in Figure 3.

Finally, we conducted ablation experiments on the clipping and weighting coefficients (see Figure 6 in the appendix). Adjusting these two clipping parameters aims to maintain policy stability and address the risk associated with current methods that solely apply clipping to the product. This approach can avoid abrupt changes between old and new policies. We observed that the smaller clipping values yield more significant improvements under a fixed weighting coefficient. Furthermore, our findings suggest that reducing the weighting coefficient alongside certain clipping levels positively influences the results within an acceptable range. It facilitates deep experiential learning from new short trajectories formed by preceding and succeeding states, promoting stable and enhanced performance and accelerating the algorithm s convergence rate.

6. Conclusive Remarks

This paper proposes a simple on-policy algorithm called Reflective Policy Optimization (RPO). This method aims to combine before and after state and action information of the trajectory data to optimize the current policy, thus allowing the agent to reflect on and modify the action of the current state to some extent. Furthermore, theoretical analyses show that our proposed method, in addition to satisfying the desirable property of the monotonic improvement of policy performance, can effectively reduce the optimized policy s solution space, speeding up the algorithm s convergence procedure. We verify the feasibility and effectiveness of the proposed method by a toy example and achieve better performance on RL benchmarks.

In future work, since RPO optimizes the policy by leveraging information from past and future pairs, it is straightforward to integrate it into other Actor-Critic algorithms or maximum entropy methods, further exploiting its advantages. We anticipate that RPO will play a good role in developing new reinforcement learning algorithms.

Reflective Policy Optimization

Acknowledgments

We thank the anonymous ICML reviewers for their insightful and constructive comments. Dr. Junliang Xing was partly supported by the Natural Science Foundation of China under Grant No. 62222606 and 62076238.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International conference on machine learning, pp. 22 31, 2017.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

De Asis, K., Hernandez-Garcia, J., Holland, G., and Sutton, R. Multi-step reinforcement learning: A unifying algorithm. In AAAI Conference on Artificial Intelligence, volume 32, 2018.

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. Open AI baselines. https://github. com/openai/baselines, 2017.

Duan, Y. and Wainwright, M. J. A finite-sample analysis of multi-step temporal difference estimates. In Learning for Dynamics and Control Conference, pp. 612 624, 2023.

Finner, H. A generalization of holder s inequality and some probability inequalities. The Annals of Probability, pp. 1893 1901, 1992.

Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1582 1591, 2018.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1856 1865, 2018.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In AAAI conference on artificial intelligence, pp. 3207 3214, 2018.

Hernandez-Garcia, J. F. and Sutton, R. S. Understanding multi-step deep reinforcement learning: A systematic study of the dqn target. ar Xiv preprint ar Xiv:1901.07510, 2019.

Kakade, S. M. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pp. 267 274, 2002.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.

Meng, W., Zheng, Q., Shi, Y., and Pan, G. An off-policy trust region policy optimization method with monotonic improvement guarantee for deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 33(5):2223 2235, 2022.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. volume 518, pp. 529 533, 2015.

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1046 1054, 2016.

Queeney, J., Paschalidis, Y., and Cassandras, C. G. Generalized proximal policy optimization with sample reuse. In Advances in Neural Information Processing Systems, pp. 11909 11919, 2021.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv: 1707.06347, 2017.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. A. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387 395, 2014.

Sutton, R. S. and Barto, A. G. Reinforcement learning: an introduction. 1998.

Reflective Policy Optimization

Tang, Y., Valko, M., and Munos, R. Taylor expansion policy optimization. In International Conference on Machine Learning, pp. 9397 9406, 2020.

Tang, Y., Munos, R., Rowland, M., Pires, B. A., Dabney, W., and Bellemare, M. G. The nature of temporal difference errors in multi-step distributional reinforcement learning. In Advances in Neural Information Processing Systems, 2022.

Todorov, E., Erez, T., and Tassa, Y. Mu Jo Co: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Tomczak, M. B., Kim, D., Vrancx, P., and Kim, K.-E. Policy optimization through approximate importance sampling. ar Xiv preprint ar Xiv:1910.03857, 2019.

van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI Conference on Artificial Intelligence, pp. 2094 2100, 2016.

Vieillard, N., Pietquin, O., and Geist, M. Munchausen reinforcement learning. In Advances in Neural Information Processing Systems, 2020.

Wu, Z., Yu, C., Chen, C., Hao, J., and Zhuo, H. H. Models as agents: Optimizing multi-step predictions of interactive local models in model-based multi-agent reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 10435 10443, 2023.

Zhang, S. Modularized implementation of deep rl algorithms in pytorch. https://github.com/ Shangtong Zhang/Deep RL, 2018.

Reflective Policy Optimization

Let s start with some useful lemmas.

Lemma A.1. (Kakade & Langford, 2002) Consider any two policies ˆπ and π, we have

η(π) η(ˆπ) = 1 1 γ Es,a ρπAˆπ(s, a).

Corollary A.2. Consider any two policies ˆπ and π, we have

V π(s0) V ˆπ(s0) = 1 1 γ Es,a ρπ( |s0)Aˆπ(s, a).

Qπ(s0, a0) Qˆπ(s0, a0) = γ 1 γ Es,a ρπ( |s0,a0)Aˆπ(s, a).

Proof. The first formula is simple, due to η(π) = Es0 ρ0V π(s0).

Let s prove the second formula.

Qπ(s0, a0) Qˆπ(s0, a0)

=γEs P (s |s0,a0) V π(s ) V ˆπ(s )

= γ 1 γ Es P (s |s0,a0)Es,a ρπ( |s )Aˆπ(s, a)

= γ 1 γ Es,a ρπ( |s0,a0)Aˆπ(s, a).

Lemma A.3. (Tomczak et al., 2019) Consider any two policies ˆπ and π, we have

η(π) η(ˆπ) = 1 1 γ Es ρˆπ,a πAˆπ(s, a) + 1 1 γ Es,a ρˆπ π(a|s)

ˆπ(a|s) 1 Qπ(s, a) Qˆπ(s, a) .

Lemma A.4. Consider a current policy ˆπ, and any policies π, we have

Es,a ρπ( )Aˆπ(s, a) Es ρˆπ,a πAˆπ(s, a)

= γ 1 γ E s,a ρˆπ( ) s ,a ρπ( |s,a)

ˆπ(a|s) 1]Aˆπ(s , a ).

Proof. From Lemma A.1 and A.3, we have

Es,a ρπAˆπ(s, a) Es ρˆπ,a πAˆπ(s, a)

=Es,a ρˆπ π(a|s)

ˆπ(a|s) 1 Qπ(s, a) Qˆπ(s, a) .

According to Corollary A.2, it is easy to get the conclusion.

Theorem A.5. Consider a current policy ˆπ, and any policies π, we have

η(π) = η(ˆπ) +

i=0 αi Li(π, ˆπ) + βk Gk(π, ˆπ),

Reflective Policy Optimization

Li(π, ˆπ) = E s0,a0 ρˆπ( ) si 1,ai 1 ρˆπ( |si 2,ai 2)

t=0 (rt 1) li(π, ˆπ),

Gk(π, ˆπ) = E s0,a0 ρˆπ( ) sk 1,ak 1 ρˆπ( |sk 2,ak 2)

t=0 (rt 1) gk(π, ˆπ),

li(π, ˆπ) = Esi ρˆπ( |si 1,ai 1),ai π( |si)Aˆπ(si, ai),

gk(π, ˆπ) = Esk,ak ρπ( |sk 1,ak 1)Aˆπ(sk, ak),

rt = π(at|st)

ˆπ(at|st), αi = γi

(1 γ)i+1 , βk = γk

Proof. From Lemma A.4, this formula creates a link between Es,a ρπ( )Aˆπ(s, a) and Es ,a ρπ( |s,a)Aˆπ(s , a ), resulting in a recursive relationship.

According to Lemma A.3, and using recursive relationships, defined

li(π, ˆπ) = Esi ρˆπ( |si 1,ai 1),ai π( |si)Aˆπ(si, ai),

= 1 1 γ Es0 ρˆπ,a0 πAˆπ(s0, a0) + γ (1 γ)2 Es0,a0 ρˆπ[r0 1]Es1,a1 ρπ( |s0,a0)Aˆπ(s1, a1)

= 1 1 γ l0(π, ˆπ)

+ γ (1 γ)2 Es0,a0 ρˆπ[r0 1] l1(π, ˆπ) + γ 1 γ Es1,a1 ρˆπ( |s0,a0)[r1 1]Es2,a2 ρπ( |s1,a1)Aˆπ(s2, a2)

= 1 1 γ l0(π, ˆπ) + γ (1 γ)2 Es0,a0 ρˆπ[r0 1]l1(π, ˆπ)

(1 γ)3 E s0,a0 ρˆπ( ) s1,a1 ρˆπ( |s0,a0) s2,a2 ρπ( |s1,a1)

[r0 1][r1 1]Aˆπ(s2, a2)

i=0 αi Li(π, ˆπ) + βk Gk(π, ˆπ).

Corollary A.6. According to the definition of Gk, we have

|βk Gk(π, ˆπ)| γk

(1 γ)k+2 ϵk+1Rmax,

where ϵ π ˆπ 1 = maxs P

a |π(a|s) ˆπ(a|s)| and Rmax maxs,a |R(s, a)|.

Reflective Policy Optimization

Proof. According to the definition of Gk(π, ˆπ), and defined ϵ π ˆπ 1, we have

|Gk(π, ˆπ)| ϵk |Esk,ak ρπ( |sk 1,ak 1)Aˆπ(sk, ak)|

a (π ˆπ)Qˆπ(s, a)da|

Combining with βk, we can get this conclusion.

Corollary A.7. Compared with Theorem 2 of the paper (Tang et al., 2020), we give a tighter lower bound.

Proof. From the paper (Tang et al., 2020), they give the gap between the policy performance of π and the general surrogate object

ˆGk = 1 γ(1 γ)

1 γ 1 γ ϵ 1 γϵ

Next, from Corollary A.6, we will prove that the following inequality holds

(1 γ)k+2 ϵk+1Rmax < ˆGk.

That is, we need to prove

(1 γ)k+2 ϵk+1Rmax < 1 γ(1 γ)

1 γ 1 γ ϵ 1 γϵ

After simplification, we get

1 1 γ < 1 1 γ γϵ.

The inequality obviously holds. So, we give a tighter lower bound.

Theorem A.8. Consider a current policy ˆπ, and any policies π, we have

i=0 αi ˆLi(π, ˆπ) ˆCk(π, ˆπ),

ˆLi(π, ˆπ) = E s0,a0 ρˆπ( ) si 1,ai 1 ρˆπ( |si 2,ai 2) si,ai ρˆπ( |si 1,ai 1)

t=0 rt Aˆπ(si, ai),

ˆCk(π, ˆπ) = γRmax π ˆπ 1

i=1 αi + γk Rmax (1 γ)k+2 π ˆπ 2 1,

and αi = γi

Reflective Policy Optimization

Proof. For the definition of Li(π, ˆπ), we have

= 1 1 γ Es0 ρˆπ,a0 πAˆπ(s0, a0) + γ (1 γ)2 Es0,a0 ρˆπ[r0 1]Es1,a1 ρπ( |s0,a0)Aˆπ(s1, a1)

= 1 1 γ l0(π, ˆπ) γ (1 γ)2 Es0,a0 ρˆπEs1,a1 ρπ( |s0,a0)Aˆπ(s1, a1)

+ γ (1 γ)2 Es0,a0 ρˆπr0

l1(π, ˆπ) + γ 1 γ Es1,a1 ρˆπ( |s0,a0)[r1 1]Es2,a2 ρπ( |s1,a1)Aˆπ(s2, a2)

= 1 1 γ l0(π, ˆπ) γ (1 γ)2 Es0,a0 ρˆπEs1,a1 ρπ( |s0,a0)Aˆπ(s1, a1)

+ γ (1 γ)2 Es0,a0 ρˆπr0l1(π, ˆπ) γ2

(1 γ)3 Es0,a0 ρˆπr0Es1,a1 ρˆπ( |s0,a0)Es2,a2 ρπ( |s1,a1)Aˆπ(s2, a2)

(1 γ)3 Es0,a0 ρˆπr0Es1,a1 ρˆπ( |s0,a0)r1Es2,a2 ρπ( |s1,a1)Aˆπ(s2, a2)

= 1 1 γ l0(π, ˆπ) + γ (1 γ)2 Es0,a0 ρˆπ[r0 1]l1(π, ˆπ)

(1 γ)3 E s0,a0 ρˆπ( ) s1,a1 ρˆπ( |s0,a0) s2,a2 ρπ( |s1,a1)

[r0 1][r1 1]Aˆπ(s2, a2)

i=0 αi ˆLi(π, ˆπ)

i=1 αi ˆHi(π, ˆπ) + βk ˆGk(π, ˆπ),

ˆLi(π, ˆπ) = E s0,a0 ρˆπ( ) si 1,ai 1 ρˆπ( |si 2,ai 2) si,ai ρˆπ( |si 1,ai 1)

t=0 rt Aˆπ(si, ai),

ˆHi(π, ˆπ) = E s0,a0 ρˆπ( ) si 1,ai 1 ρˆπ( |si 2,ai 2) si,ai ρπ( |si 1,ai 1)

t=0 rt Aˆπ(si, ai),

ˆGk(π, ˆπ) = E s0,a0 ρˆπ( ) si 1,ai 1 ρˆπ( |si 2,ai 2) si,ai ρπ( |si 1,ai 1)

t=0 rt (ri 1 1) Aˆπ(si, ai),

and αi = γi

(1 γ)i+1 , βk = γk

It is easy to prove that the following inequality holds

ˆHi(π, ˆπ) Rmax

1 γ π ˆπ 1, ˆGk(π, ˆπ) Rmax

1 γ π ˆπ 2 1.

i=0 αi ˆLi(π, ˆπ) γRmax π ˆπ 1

i=1 αi γk Rmax (1 γ)k+2 π ˆπ 2 1.

Reflective Policy Optimization

Theorem A.9. Define two sets

Ψ1 = µ | α0 ˆL0(µ, ˆπ) ˆC1(µ, ˆπ) 0, µ ˆπ 1 1

Ψ2 = µ | α0 ˆL0(µ, ˆπ) + α1 ˆL1(µ, ˆπ) ˆC2(µ, ˆπ) 0, µ ˆπ 1 1

then we have

Proof. Let µ Ψ1, we have

ˆL0(µ, ˆπ) γRmax

(1 γ)2 µ ˆπ 2 1 0. (8)

Below, we will show that µ may not be in the set Ψ2.

For ˆL1(π, ˆπ), we can get

ˆL1(π, ˆπ) = E s0 ρˆπ( ),a0 π( |s0) s1 ρˆπ( |s0,a0),a1 π( |s1)

Aˆπ(s1, a1) (9)

= E s0 ρˆπ( ),a0 ˆπ( |s0) s1 ρˆπ( |s0,a0),a1 π( |s1)

Aˆπ(s1, a1) +

E s0 ρˆπ( ),a0 π( |s0) s1 ρˆπ( |s0,a0),a1 π( |s1)

E s0 ρˆπ( ),a0 ˆπ( |s0) s1 ρˆπ( |s0,a0),a1 π( |s1)

Aˆπ(s1, a1) (10)

E s1 ρˆπ( ),a1 π( |s1) Aˆπ(s1, a1) Rmax

1 γ π ˆπ 2 1. (11)

The last inequality uses Es0 ρˆπ( ),a0 ˆπ( |s0)ρˆπ( |s0, a0) = ρˆπ( ) and H older s inequality (Finner, 1992).

Combining with ˆL0(π, ˆπ) and ˆC2(π, ˆπ), we have

ˆL0(π, ˆπ) + γ 1 γ ˆL1(π, ˆπ) γRmax

(1 γ)2 π ˆπ 1 γ2Rmax

(1 γ)3 π ˆπ 2 1

ˆL0(π, ˆπ) + γ 1 γ

E s1 ρˆπ( ),a1 π( |s1) Aˆπ(s1, a1) Rmax

1 γ π ˆπ 2 1

(1 γ)2 π ˆπ 1 γ2Rmax

(1 γ)3 π ˆπ 2 1

= 1 1 γ ˆL0(π, ˆπ) γRmax

(1 γ)3 π ˆπ 2 1 γRmax

(1 γ)2 π ˆπ 1.

Combining with the inequality (8), we have

ˆL0(µ, ˆπ) + γ 1 γ ˆL1(µ, ˆπ) γRmax

(1 γ)2 µ ˆπ 1 γ2Rmax

(1 γ)3 µ ˆπ 2 1 γRmax

(1 γ)2 µ ˆπ 1.

From the above inequality, it shows that µ may not be in set Ψ2.

On the contrary, we will show that if µ Ψ2, then µ Ψ1. According to the Eqn. (9),we can get the upper bound of ˆL1(π, ˆπ):

ˆL1(π, ˆπ) E s1 ρˆπ( ),a1 π( |s1) Aˆπ(s1, a1) + Rmax

1 γ π ˆπ 2 1.

Reflective Policy Optimization

If µ Ψ2, we have

0 ˆL0(µ, ˆπ) + γ 1 γ ˆL1(µ, ˆπ) γRmax

(1 γ)2 µ ˆπ 1 γ2Rmax

(1 γ)3 µ ˆπ 2 1

ˆL0(µ, ˆπ) + γ 1 γ

E s1 ρˆπ( ),a1 µ( |s1) Aˆπ(s1, a1) + Rmax

1 γ µ ˆπ 2 1

(1 γ)2 µ ˆπ 1 γ2Rmax

(1 γ)3 µ ˆπ 2 1

= 1 1 γ ˆL0(µ, ˆπ) γRmax

(1 γ)2 µ ˆπ 1 + γ(1 2γ)Rmax

(1 γ)3 µ ˆπ 2 1

= 1 1 γ ˆL0(µ, ˆπ) γRmax

(1 γ)3 µ ˆπ 2 1

(1 γ)3 µ ˆπ 2 1 γRmax

(1 γ)2 µ ˆπ 1 + γ(1 2γ)Rmax

(1 γ)3 µ ˆπ 2 1.

Next, we will show that the second line of the last equation in the above formula is less than or equal to 0. We have

γRmax (1 γ)3 µ ˆπ 2 1 γRmax

(1 γ)2 µ ˆπ 1 + γ(1 2γ)Rmax

(1 γ)3 µ ˆπ 2 1

1 1 γ µ ˆπ 2 1 µ ˆπ 1 + 1 2γ

1 γ µ ˆπ 2 1

(1 γ)2 2 µ ˆπ 2 1 µ ˆπ 1

The last inequality holds because µ ˆπ 1 1

2. So, from the Eqn.(12), we have

0 1 1 γ ˆL0(µ, ˆπ) γRmax

(1 γ)3 µ ˆπ 2 1.

In summary, we have that Ψ2 Ψ1 holds.

B. Additional experimental results

To verify the effectiveness of the proposed RPO method, we select six continuous control tasks from the Mu Jo Co environments (Todorov et al., 2012) in Open AI Gym (Brockman et al., 2016). We conduct all the experiments mainly based on the code from (Queeney et al., 2021). The test procedures are averaged over ten test episodes across ten independent runs. For the experimental parameters, we use the default parameters from (Dhariwal et al., 2017; Henderson et al., 2018), for example, the discount factor is γ = 0.995, and we use the Adam optimizer (Kingma & Ba, 2015) throughout the training progress. The learning rate ϕ = 3e 4 except for Humanoid which is 1e 5. For RPO, the clipping parameters are ϵ = 0.2 and ϵ1 = 0.1, and the weighted parameter β = 0.3 on Mu Jo Co environments and do not fine-tune them. The RPO algorithm involves hyperparameters, but we have yet to extensively fine-tun them, from Table 3, we supplemented some experiments for k = 4.

To verify the effectiveness of the proposed RPO method in discrete Atari environments, the code is based on (Zhang, 2018). We run our experiments across three seeds with fair evaluation metrics. We use the same hyperparameters ϵ1 = 0.1 and β = 3.0 in all environments and do not fine-tune them. Firstly, in the discrete state-action space of Atari, rewards are sparser compared to the continuous state-action space of Mu Jo Co. Therefore, to better utilize information from subsequent states, β should be larger than the setting in Mu Jo Co. Secondly, we did not specifically select the hyperparameters but set them arbitrarily. Supplementary experiments were conducted on six randomly selected Atari games to demonstrate this. In Table 4, experimental results show that the performance of the RPO algorithm with different parameters is better than that of PPO, demonstrating better robustness.

Reflective Policy Optimization

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

TRPO PPO ISPO Tay PO Ge PPO OTRPO RPO

(a) Half Cheetah

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

(c) Reacher

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

(d) Walker2d

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

(e) Swimmer

0 1 2 3 4 5 Timesteps (Million)

Average Return

(f) Humanoid

Figure 5. Learning curves on the Gym environments. Performance of RPO vs. PPO, TRPO, OTRPO, Ge PPO, ISPO and Tay PO.

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1 RPO 1=0.2 =0.1

RPO 1=0.1 =0.3

RPO 1=0.2 =0.3

(a) Half Cheetah

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1 RPO 1=0.2 =0.1

RPO 1=0.1 =0.3

RPO 1=0.2 =0.3

(b) Reacher

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1 RPO 1=0.2 =0.1

RPO 1=0.1 =0.3

RPO 1=0.2 =0.3

(c) Swimmer

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1 RPO 1=0.2 =0.1

RPO 1=0.1 =0.3

RPO 1=0.2 =0.3

(d) Walker2d

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1 RPO 1=0.1 =0.5 RPO 1=0.1 =1.0 RPO 1=0.1 =2.0

RPO 1=0.1 =3.0

(e) Half Cheetah

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1

RPO 1=0.1 =0.5 RPO 1=0.1 =1.0 RPO 1=0.1 =2.0

RPO 1=0.1 =3.0

(f) Reacher

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1 RPO 1=0.1 =0.5

RPO 1=0.1 =1.0

RPO 1=0.1 =2.0

RPO 1=0.1 =3.0

(g) Swimmer

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps (Million)

Average Return

RPO 1=0.1 =0.1 RPO 1=0.1 =0.5 RPO 1=0.1 =1.0 RPO 1=0.1 =2.0

RPO 1=0.1 =3.0

(h) Walker2d

Figure 6. The top line represents the performance under the condition of β fixed, and the bottom line represents the performance under the condition of ϵ1 fixed.

Reflective Policy Optimization

Figure 7. Learning curves on the Atari games. Performance of RPO vs. PPO.

Reflective Policy Optimization

Table 2. Mean of returns of three random seeds of last 100 episodes for different methods in the Atari environments. The best results are highlighted in bold. The bottom row counts the number of games where the algorithm or human performed best.

Algorithm Random Human PPO RPO

Alien 228 7128 2037 2201 Amidar 6 1720 1121 1148 Asterix 210 8503 12048 14904 Asteroids 719 47389 3409 31230 Atlantis 12850 29028 3311130 3347654 Bank Heist 14 753 1264 1257 Battle Zone 2360 37188 23629 29674 Berzerk 124 2630 1390 1500 Bowling 23 161 33 65 Boxing 0 12 98 99 Breakout 2 30 233 306 Carnival 380 4000 3578 4289 Centipede 2091 12017 4007 6361 Chopper Command 811 7388 1407 2394 Crazy Climber 10780 35829 111516 121374 Demon Attack 152 1971 157620 188806 Double Dunk -19 -16 -6 -4 Elevator Action 0 3000 10085 24473 Enduro 0 860 366 470 Fishing Derby -92 -39 38 46 Freeway 0 30 33 32 Frostbite 65 4335 3808 4455 Gopher 258 2412 2165 9489 Gravitar 173 3351 1301 1551 Hero 1027 30826 35477 32507 Ice Hockey -11 1 -3 -2 Jamesbond 29 303 2979 5751 Journey Escape -18000 -1000 -1008 -609 Kangaroo 52 3035 9173 10383 Krull 1598 2666 7593 8507 Kung Fu Master 258 22736 25088 34463 Ms Pacman 307 6952 2215 2969 Name This Game 2292 8049 5872 6925 Phoenix 761 7243 27074 43073 Pitfall -229 6464 -10 0 Pong -21 15 21 21 Pooyan 500 1000 2765 3542 Private Eye 25 69571 99 96 Qbert 164 13455 15410 16200 Riverraid 1338 17118 11093 14785 Road Runner 12 7845 41111 45982 Robotank 2 12 29 22 Seaquest 68 42055 2651 1548 Space Invaders 148 1669 2706 2445 Star Gunner 664 10250 66997 63297 Tennis -24 -8 -4 5 Time Pilot 3568 5229 14527 12012 Tutankham 11 168 208 176 Up NDown 533 11693 344972 494588 Venture 0 1188 0 659 Video Pinball 0 17668 71406 100495 Wizard Of Wor 564 4756 10347 9805 Yars Revenge 3093 54577 59645 75887 Zaxxon 32 9173 16258 14066

Best 0 18 11 26

Reflective Policy Optimization

Table 3. Mean of return for different k in Mujoco.

PPO RPO (k = 2) RPO (k = 3) RPO (k = 4) Half Cheetah 2408 3495 4239 4381 Swimmer 134 157 226 227

Table 4. Mean of return of three random seeds for different β in some Atari. The best results are highlighted in bold.

PPO RPO (β = 1) RPO (β = 2) RPO (β = 3) RPO (β = 4) RPO (β = 5) Asterix 12048 15198 17138 14904 16054 13806 Breakout 233 272 303 306 357 360 Crazy Climber 111516 112760 118933 121374 114350 123890 Kangaroo 9173 12434 10611 10383 10239 11590 Phoenix 27074 28240 34910 43073 42200 42082 Qbert 15410 16634 16213 16200 16137 16958

Table 5. Hyperparameters for RPO on Mujoco tasks.

Hyperparameter Value

Discount rate γ 0.995 GAE parameter 0.97 Minibatches per epoch 32 Epochs per update 10 Optimizer Adam Learning rate ϕ 3e-4 Minimum batch size (n) 2048 First clipping parameter ϵ 0.2 Second clipping parameter ϵ1 0.1 Weighting parameter β 0.3

Table 6. Hyperparameters for RPO on Atari tasks.

Hyperparameter Value

Discount rate γ 0.99 GAE parameter 0.95 Number Workers 16 Epochs per update 4 Optimizer Adam Learning rate ϕ 2.5e-4 Rollout Length 256 First clipping parameter ϵ 0.1 Second clipping parameter ϵ1 0.1 Weighting parameter β 3.0