# on_stationary_point_convergence_of_ppoclip__c740e4c9.pdf Published as a conference paper at ICLR 2024 ON STATIONARY POINT CONVERGENCE OF PPO-CLIP Ruinan Jin The Chinese University of Hong Kong, Shenzhen jinruinan@cuhk.edu.cn Shuai Li Shanghai Jiao Tong University shuaili8@sjtu.edu.cn Baoxiang Wang The Chinese University of Hong Kong, Shenzhen bxiangwang@cuhk.edu.cn Proximal policy optimization (PPO) has gained popularity in reinforcement learning (RL). Its PPO-Clip variant is one the most frequently implemented algorithms and is one of the first-to-try algorithms in RL tasks. This variant uses a clipped surrogate objective function not typically found in other algorithms. Many works have demonstrated the practical performance of PPO-Clip, but the theoretical understanding of it is limited to specific settings. In this work, we provide a comprehensive analysis that shows the stationary point convergence of PPO-Clip and the convergence rate thereof. Our analysis is new and overcomes many challenges, including the non-smooth nature of the clip operator, the potentially unbounded score function, and the involvement of the ratio of two stochastic policies. Our results and techniques might share new insights into PPO-Clip. 1 INTRODUCTION Proximal policy gradient (PPO) is one of the most popular algorithms in reinforcement learning (RL) (Schulman et al., 2017). For many RL tasks, one would try PPO as their first attempt, for its generality and stability among different environments as well as the easiness of implementation. As of September 2023, it returns over 5,500 results on Git Hub.com, if one searches PPO . This demonstrates enthusiasm for reproducing the algorithm, applying it to different problems, and extending it to many variants. PPO stems from policy gradient, which is proposed by Williams (1992) as one of the seminal works in RL. The idea is to utilize the policy in a previous step as an off-policy reference, and restrain the new policy to be within a small discrepancy of the old policy. In this way, the policy does not suddenly deviate from its previous iterations by too much. The idea is well-compatible with modern deep learning models, such as advantage function estimation and policy parametrization. There are two primary variants of PPO introduced by Schulman et al. (2017). For the sake of convenience, we call them PPO-Penalty and PPO-Clip. PPO-Penalty approximately solves a KLconstrained update, but penalizes the KL-divergence in the objective function instead of making it a hard constraint. The penalty coefficient is then automatically adapted over the course of the training so that it scales appropriately. PPO-Clip does not have a KL-divergence term in the objective and does not have a constraint either. Instead, it relies on a clipping operator in the objective function to remove incentives for the new policy to get too far away from the old policy. This use of a clipped surrogate objective function is not typical in optimization algorithms. In this work, we focus on PPO-Clip. In contrast to the empirical success of PPO-Clip, theoretical analysis of it is limited. Huang et al. (2021) and Yao et al. (2022) have made substantial attempts to discuss the convergence of PPO-Clip under the tabular and the over-parameterized neural network settings. The basic idea is to categorize PPO-Clip as a hinge loss problem in classification. This analysis relies on the local convexity of the optimization objective, with which typical techniques for hinge loss apply. Liu et al. (2019) Corresponding to: Baoxiang Wang Published as a conference paper at ICLR 2024 employed the classic proof based on the neural tangent kernel (NTK) condition to establish the global convergence of over-parameterized neural networks. Their analysis is for a variant of PPO which no longer involves the ratio of the current policy and the old policy. In fact, analyzing the original PPO-Clip involves many challenges. First, the algorithm uses the ratio of two stochastic policies, which is a unique structure that has no off-the-shelf recursive inequality available to characterize it. Second, the clip operator is non-smooth, which necessitates the use of divide-and-conquer of probabilistic events. In comparison, treating the clip operation as a hinge loss does not provide more help when the problem is general and not locally convex. Third, the score function of the policy is unbounded. A typical example is the commonly used neural network policy. Without bounded score function, strong conditions such as the gradient dominance lemma are no longer available, and existing techniques in analysis of policy gradient will not apply. In addition, the unbounded score function could make the ratio of two policies arbitrarily large, even in the late stages of the optimization process. In this work, we investigate the vanilla version of PPO-Clip with a minimum set of conditions. We formulate PPO-Clip into a general two-scale iterative process, where the old policy πold is synchronized every T steps (T could be 1, which makes it single-scale). We show that this process will converge to a stationary point of the learning objective, up to a bias term that depends only on the biases incurred by the Markovian sampling and the advantage estimation. When the biases are zero, such as when we use Monte-Carlo methods, PPO-Clip converges almost surely. We further provide the convergence rate of the gradient norm in terms of average-iterate convergence and subsequence convergence. Technically, our analyses are based on key lemmas that characterize the probabilities of the clip events and thereof the recursion of the learning process. This new tool might share insights into analyzing PPO and might be of independent interest. 2 PRELIMINARIES 2.1 MARKOV DECISION PROCESS We consider a canonical discounted Markov decision process (MDP) setting with finite state space and action space. In this work, an MDP is denoted as M = (S, A, P, r, γ), where S and A are finite state and action spaces, P : S A (S) denotes the transition kernel, r : S A R is the reward function, and γ (0, 1) is the discount factor. Let s0 be the initial state. The goal of reinforcement learning is to learn a policy π that takes s S and outputs a distribution π( |s) over the action space, to maximize the expected cumulative discounted reward. For every policy π, define the action value function Qπ : S A R as Qπ(s, a) := P h=0 γhr(sh, ah) s0 = s, a0 = a . Here, Eπ,P ( ) denotes the expectation under subse- quent actions and states that follow π and P. When the context is clear we write this into Eπ( ) instead. Define the state-value function V π : S R as V π(s) := Ea π( |s)[Qπ(s, a)] and the advantage function Aπ : S A R as Aπ(s, a) := Qπ(s, a) V π(s), respectively. In this paper, we consider the case where the policy is parametrized by a parameter θ Rd. Define, as the objective function of our optimization, V (θ) := V πθ(s0) = Eπθ h=0 γhr(sh, ah) s0 Denote V := maxθ Rd V (θ) as a global maximum of V (θ). 2.2 POLICY GRADIENT When both π and V are parametrized by θ, it is natural to consider gradient ascent methods to optimize the objective function, i.e., θn+1,1 = θn,1 +ϵn V (θn,1), where ϵn > 0 is the learning rate (step size). The gradient term V (θ) can be estimated via V (θ) = Eπθ ln πθn(a|s) Aπθ(s, a) , where Aπθ is an estimation of the advantage function Aπθ (Williams, 1992; Sutton & Barto, 2018). The policy gradient (PG) estimation is unbiased whenever A is an unbiased estimation of A, for example, the Monte-Carlo estimation. Published as a conference paper at ICLR 2024 Policy gradient have received a significant line of research and improvement, and together with the emergence of deep neural networks, receives empirical success in many areas (Konda & Tsitsiklis, 2000; Kakade, 2001; Schulman et al., 2017; 2015; Lillicrap et al., 2015; Schulman et al., 2017). In theory, the global convergence of PG is investigated under the gradient dominance condition (Zhang et al., 2020; 2021; Yuan et al., 2020; Xu et al., 2020; Huang et al., 2020; Fazel et al., 2018; Fatkhullin et al., 2023) and soft-max policy condition (Agarwal et al., 2021; Mei et al., 2020). Stationary point convergence of PG under general policies are proved by Zhang et al. (2020); Fazel et al. (2018). Wang et al. (2019) study the convergence of neural policy gradient methods. While our setting is quite different to PG, we obtain some insights from these lines of theoretical works. 2.3 PROXIMAL POLICY OPTIMIZATION WITH CLIPPED SURROGATE OBJECTIVE The proximal policy optimization algorithm has different variants (Schulman et al., 2017; Huang et al., 2022; Hsu et al., 2020; Zhang et al., 2019; 2022). In this work, we are particularly interested in investigating its variant with a clipped surrogate objective which is introduced in the original PPO work. We denote this variant as PPO-Clip in our manuscript. The surrogate objective function is defined as LCPI(θ) = Eπθold πθ(a|s) πθold(a|s) Aπθold(s,a) , where πθold denotes the policy at a previous iteration, Aπθ is an estimation of the advantage function Aπθ. The surrogate objective function with the clip operator is defined as LCLIP = Eπθold min n πθ(a|s) πθold(a|s), clip πθ(a|s) πθold(a|s) o Aπθold(s, a) . The first term in the minimum operator is LCP I. The second term, clip πθ(a|s) πθold(a|s) Aπθold (s, a), modifies the surrogate objective by clipping the probability ratio to the interval [1 δ0, 1 + δ0], for some absolute constant δ0. PPO-Clip takes the minimum of the clipped and unclipped objective and the final objective is a lower bound on the unclipped objective. In the original PPO work, PPO-Clip considers πθold to be the policy used in the immediate previous iteration. In the current iteration, the algorithm aims to find a parameter θ that maximizes the clipped surrogate objective, using some estimation Eπθold of the expectation Eπθold over the Markovian sampling. Namely, maximizeθ Rd Eπθold ( πθ(a|s) πθold(a|s), clip πθ(a|s) ) Aπθold (s, a) This maximization alone could involve multiple steps of updates on θ, forming a double-loop structure of the PPO algorithm. The off-policy nature of the surrogate objective specifically fits into this type of double-loop structure. In practice, such a structure is implemented simply by running the optimizer multiple times (e.g., optimizer.step() in Py Torch) for each policy gradient update. 3 SETTING AND RESULTS In this section, we aim to prove the global convergence of PPO-Clip to a stationary point. Namely, the gradient of the value function V (θ) should converge to zero whenever the estimation bias is zero (Theorem 3.1 and 3.2). We further extend this result to the convergence rate of average-iteration convergence and subsequence convergence (Theorem 3.3). 3.1 FORMULATION Now we are ready to formulate the PPO-Clip variant that we investigate in this paper. In this work, we specifically implement the optimization process within one step of policy gradient, namely Equation (1), as T steps of gradient ascent. The algorithm therefore runs in a two-scale manner, where the index n of policy gradient iteration will be considered asymptotically large, and the index k of the maximization in (1) is considered to be within 1, . . . , T, for some small constant T. Notice that all our results hold for T = 1 as well. We formulate PPO-Clip as below. Published as a conference paper at ICLR 2024 θn,k+1 = θn,k + ϵn,k θn,k clip πθn,k(a|s) , πθn,k(a|s) Aπθn,1(s, a) (k = 1, 2, ..., T 1), θn+1,1 = θn,T , (2) where Eπθn,1 is an estimate of Eπθn,1, and Aπθ(s, a) is the estimated value of the advantage function. As we have mentioned in the preliminary section on PPO, the value of A and the expectation under Markovian sampling Eπθn,1 need to be estimated. Both the estimations can be obtained in an unbiased way using Monte-Carlo sampling, provided that the sampling goes through until the termination of the episode. Practically, PPO is rarely implemented in a pure Monte-Carlo way. Methods such as the actor-critic algorithm use biased estimations to trade for other properties such as sample complexity and generalization. Nevertheless, this biasedness can be formulated by the sampling bias term. Define the functionals E πθ( ) := Es0 ρ, ah πθ( |sh) sh+1 P( |sh,ah) Aπθ(s, a) , E πθ( ) := Es0 ρ, ah πθ( |sh) sh+1 P( |sh,ah) Aπθ(s, a) . If Aπθ is bounded, for which we will assume in Assumption 3.2, both of the functionals are bounded linear functionals in S A R. Therefore, (E πθ E πθ)( ) is also a bounded linear functional. The norm of this functional then exists and we denote the norm as (E πθ E πθ)( ) = max f =1 (E πθ E πθ)(f) , where f := max(s,a) S A |f(s, a)|. Subsequently, denote the upper bound of the norm of this functional as (E πθ E πθ)( ) ϕn. (3) When Monte-Carlo sampling is used to estimate the expectation E of the Markovian sampling and the advantage function A, the term ϕn = 0. In this work, we will provide general results that depend on the value of ϕn. To facilitate the analysis, define the σ-filtration F1 := σ(θ1,1), Fn := σ(σ(θn 1,1), ξn 1) (n 2), where the random variable {ξn} represents the stochasticity in the sampling of E and the estimation A in step n of the outer iterate. 3.2 ASSUMPTIONS We first present the assumption regarding the policy parameterization of πθ(s|a). This assumption can be realized by setting the policy to be within certain function classes such as neural networks. Assumption 3.1. There is a constant L such that for any s, a, the policy πθ(a|s) is L-smooth regarding θ, i.e., πθ1(a|s) πθ2(a|s) L θ1 θ2 θ1, θ2. In comparison, a stronger assumption was used in some previous works on policy gradient (Zhang et al., 2020; 2021; Yuan et al., 2020; Xu et al., 2020; Huang et al., 2020). In the assumption, the score function ln(πθ(s|a)) has to be bounded and Lipschitz continuous, i.e., ln(πθ1(s|a)) ln(πθ2(s|a)) M1 θ1 θ2 , for all s, a, for some M1. This assumption does not hold for neural networks in general. As a counterexample, let the action space be {0, 1} and let θ = (x, y). Let πx,y(0|s) = e x T ys (1 + e x T ys), πx,y(1|s) = 1 (1 + e x T ys), ( s). When θ + , it is evident that ln(πθ1,θ2(a|s)) + in this case. Nevertheless, our conditions in Assumption 3.1 hold under this example. Our analyses also need a uniform upper bound for the reward function for each state and action. This condition is very commonly used in the field of reinforcement learning (Zhang et al., 2021; 2020; Yuan et al., 2020; Xu et al., 2020; Agarwal et al., 2021). Assumption 3.2. There exists an upper bound ˆr > 0, such that for any s, a, the reward is bounded by |r(s, a)| ˆr. Published as a conference paper at ICLR 2024 It follows immediately this assumption that |Aπ(s, a)| Rmax := 2ˆr/(1 γ). In this manuscript, whenever A is estimated (denoted as A), we truncate the value of that estimate such that | Aπ(s, a)| Rmax = 2ˆr/(1 γ). Now we describe the conditions of the learning rate of PPO-Clip. In this work, we require the learning rate to satisfy the Robbins-Monro condition, in order to prove the global convergence of PPO-Clip to a stationary point. The condition agrees with many commonly used learning rates of PPO in practice. Assumption 3.3. The learning rate {ϵn,k}+ ,T 1 n=1,k=1 satisfies P+ n=1 PT 1 k=1 ϵn,k = + , P+ n=1 PT 1 k=1 ϵ2 n,k < + . In the case we only need to prove subsequence convergence or average-iterate mean-square Convergence, Assumption 3.3 can be relaxed into Assumption 3.4. Assumption 3.4. The learning rate {ϵn,k}+ ,T 1 n=1,k=1 satisfies P+ n=1 PT 1 k=1 ϵn,k = + , PT 1 k=1 ϵn,k 0. Under Assumption 3.3, we can use the classical martingale method to prove convergence, which can yield strong last-iterate convergence. However, this approach is not applicable to the case under Assumption 3.4. Under Assumption 3.4, we are unable to construct a sup-martingale by the value of V (θn). Instead, we will need to decompose { V (θn)} into multiple sub-processes using first entrance times. Then a respective sup-martingale was constructed on each sub-process, and the compounding effect of these sub-processes is analyzed. Assumptions on the learning rate can of course be satisfied by our choice of the learning rate. We remark that both the assumptions host a wide range of learning rate choices used in practice in reinforcement learning. 3.3 RESULTS In investigating the convergence of stochastic optimization, it is common to construct a recursive inequality for the optimization objective. Taking policy gradient as an example, one may construct a recursive inequality using Taylor s expansion as E V V (θn+1) Fn V V (θn) ϵn ˆ V (θn)T V (θn) + Lϵ2 n 2 ˆ V (θn) 2, where ϵn is the learning rate, ˆ V is a estimation of V , V is the maximum value of the value function V (θ), and L represents the Lipschitz coefficient of V (θ). Due to the difference between the policy gradient and PPO-Clip, the update rule now is not by a direct estimate of the policy gradient V (θ). In this case, to obtain a recursive inequality similar to the above one, we estimate the error between the PPO-Clip update target and V (θ). The aim is to bound this error term and mitigate its effect in the following analyses. However, it is not straightforward to estimate this error as one needs to estimate the distribution of the random variable πθ(a|s)/πθold(a|s), which was not available in the literature to the best of our knowledge. From this perspective, the analysis of PPO-Clip might be more challenging compared to the analysis of policy gradient. To be specific, our idea is to prove that the event B := {πθ(a|s)/πθold(a|s) (0, 1 δ0) (1 + δ0, + )} is a subset of the event C := { p πθold(a|s) kθ}, for some scale kθ. Then we relax the set of (s, a) pairs that satisfy the clip condition to those that satisfy the event C. Subsequently, conditionally under the event C holds, we can estimate the error terms (denoted as X and Y in the analysis) accurately. This idea generates the following lemma. Published as a conference paper at ICLR 2024 Lemma 3.1. If Assumption 3.1 and 3.2 hold, then for the sequence {θn,k}+ ,T n=1,k=1 formed by PPOClip (2), there is V V (θn+1,1) (V V (θn,1)) T 1 X 1 + δ0R2 max p ϕn + O T 1 X L := |A|Rmax L (1 γ)2 + (1 + γ)|A|Rmax 2L (1 γ)3 , and {ζn, Fn+1}+ n=1 is a martingale-difference sequence. This lemma is our key lemma. It derives a recursive inequity that is similar to that of policy gradient. We now provide the insights into finding this lemma, while deferring the full proof to the appendix. Step 1: We first aim to simplify the gradient in PPO using a divide-and-conquer approach. Let 1 δ0 < πθn,k(a|s) πθn,1(a|s) < 1 + δ0 be the set of (s, a) pairs that do not satisfy the clip operation. We obtain the following equations. min clip πθn,k(a|s) , πθn,k(a|s) Aπθn,1(s, a) 1B(s,a) n,k πθn,k(a|s) πθn,1(a|s) Aπθn,1(s, a) πθn,1(a|s) Aπθn,1(s, a) 1Ω/B(s,a) n,k πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) 1B(s,a) n,k πθn,k(a|s) πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) πθn,1(a|s) Aπθn,1(s, a) + Xn,k + Yn,k. Here, Xn,k and Yn,k are defined as the last two terms, i.e. the error terms, on the right-hand side of the second equation. Step 2: Note that the first term of the last equation is the policy gradient of the value function at θn,1, which can be handled effectively. Therefore, we aim to bound Xn,k and Yn,k to ensure that the difference between the gradient of PPO and the policy gradient of the value function at θn,1 is not significant. For Xn,k, we observe Ω/Bn,k Ω/Cn,k (see appendix for its proof), where πθn,1(a|s) > ( 2L(1 + δ0) p πθn 1,1(a|s) Aπθn 1,1(s, a) !) and λ0 := min 2L ln(1 δ0), 2L ln(1+δ0) . As such, the indicator function 1Bn,k can be bounded by 1Cn,k. By further noticing the L-smooth condition, we have πθn,k(a|s) p 2Lπθn,k(a|s) (see Published as a conference paper at ICLR 2024 Lemma A.1 for a proof), which leads to E Xn,k Fn 1 2L p 1 + δ0R3 max|A| πθn 1,1(a|s) where Rmax := ˆr/(1 γ) is defined as an upper bound on the value function. For Yn, our main objective is to bound πθn,k(a|s) πθn,1(a|s). According to the L-smooth condition, we have πθn,k(a|s) πθn,1(a|s) L θn,k θn,1 . It therefore suffices to control θn,k θn,1 . In fact, we could obtain θn,k+1 θn,k ϵn,k Eπθn,1 2L(1 + δ0) p πθn,1(a|s) Aπθn,1(s, a) ! and the proof of this inequality is presented in Lemma A.2. The bound of Yn is then E Yn Fn 1 L p 2L(1 + δ0)R3 max|A| πθn 1,1(a|s) Step 3: In this step, we construct a recursive equation for the value function V (θn,1). Using Taylor s expansion, we obtain V V (θn+1,1) (V V (θn,1)) V (θn 1) (θn+1,1 θn,1) + V (θn,1) V (θn 1) (θn+1,1 θn,1) + L θn+1,1 θn,1 2, where L := |A|Rmax L (1 γ)2 + (1+γ)|A|Rmax 2L (1 γ)3 represents the Lipschitz coefficient of V (θ). By substituting the expressions for Xn,k and Yn,k from the previous step into the equation, there is V V (θn+1,1) (V V (θn,1)) T 1 X 1 + δ0R2 max p ϕn + O T 1 X Armed with the lemma, we are now ready to discuss the convergence of the PPO-Clip algorithm. Our first theorem asserts its global convergence to a stationary point, up to the estimation bias ϕn involved in the Markovian sampling and the advantage estimation. The idea of the proof is to discuss the magnitude of V (θn,1) 2. Let ϕ n := ϕn +O PT 1 k=1 ϵn,k . When V (θn,1) 2 > 8 p |A|ϕ n, the fixed bias term 2 p |A|Lϕ n PT 1 k=1 ϵn,k can be immediately offset by PT 1 k=1 ϵn,k V (θn 1) 2. When V (θn,1) 2 8 p |A|Lϕ n, the desired result is straightforwardly satisfied. Theorem 3.1. If Assumption 3.1, 3.2, 3.4 hold, then for the sequence {θn,k}+ ,T n=1,k=1 formed by PPO-Clip (2), there is lim inf n + V (θn,1) 2 8L p |A| lim sup n + ϕn a.s.. Published as a conference paper at ICLR 2024 Proof. Let ϕ n := ϕn + O PT 1 k=1 ϵn,k . We construct the following events. Ai,n := { V (θi,1) 2 > max{8 p |A| sup t i ϕ t, δ}}, { V (θi+1,1) 2 > 8 p |A| sup t i+1 ϕ tϕ i+1}, ..., { V (θn,1) 2 > max{8 p |A| sup t i ϕ t, δ}}, Ai,i := { V (θi,1) 2 > max{8 p |A| sup t i ϕ t, δ}}, Ai,+ := { V (θi,1) 2 > max{8 p |A| sup t i ϕ t, δ}}, { V (θi+1,1) 2 > 8 p |A| sup t i+1 ϕ t}, ..., { V (θn,1) 2 > max{8 p |A| sup t n ϕ t, δ}, ...}. where ˆV := V V and δ is a positive constant. Asymptotically, we can obtain the existence of a N0 such that for n > N0, there is E V (θn,1) 2 > 8 1 + δR2 max p E V (θn,1) 2. Note that here N0 only depends on the parameters δ, Rmax, |A|, L and the learning rate ϵn,k itself. Then through Equation (14) in the proof of Lemma 3.1, when i > T0, n i, X Fi, there is 1X Ai,n ˆV (θn+1,1) 1X Ai,n ˆV (θn,1) 1 1X Ai,n V (θn,1) 2 + 1X Ai,nζn. Notice 1X Ai,n ˆV (θn+1,1) 1X Ai,n+1 ˆV (θn+1,1). Then, 1X Ai,n+1 ˆV (θn+1,1) 1X Ai,n ˆV (θn,1) 1 1X Ai,n V (θn,1) 2 + 1X Ai,nζn. By summing up the above expression from i to t, we obtain E(1X Ai,n) + . Combining the condition Pt n=i PT 1 k=1 ϵn,k = + , and the fact the sequence {E(1X Ai,n)}+ n=i has a limit, we have lim n + E(1X Ai,n) = 0, which implies E(1X Ai,+ ) = 0. According to the arbitrariness of δ, i and X, we get lim infn + V (θn,1) 2 8L p |A| lim supn + ϕn a.s.. This theorem essentially provides us the guarantee that PPO-Clip makes the gradient norm of the value function subsequence converge almost surely to zero within an O(lim supn + ϕn) neighborhood. In practice, the size of this neighborhood is determined by the advantage network. The more accurate is the advantage function estimation, the smaller is this neighborhood. When the advantage function is estimated unbiasedly, like through Monte-Carlo sampling, the convergence is exact. In fact, when ϕn = 0, we further improve this result by replacing Assumption 3.4 with Assumption 3.3, which is strictly weaker. Theorem 3.2. If Assumption 3.1, 3.2, 3.3 hold and there exists a t > 0, such that ϕn = 0 ( n > t), then for the sequence {θn,k}+ ,T n=1,k=1 formed by PPO-Clip (2), there are lim n + V (θn) = 0 a.s., and lim n + E V (θn) 2 = 0. Published as a conference paper at ICLR 2024 This theorem indicates that PPO-Clip has the same convergence property as the best current results available for policy gradient (Zhang et al., 2020). The proof of this theorem is by Lemma 3.1 and classic martingale methods in stochastic approximation (Robbins & Monro, 1951). We defer the proof to Appendix A.3. Now we proceed to present the results regarding the convergence rate. Unlike existing research on the convergence of the policy gradient algorithm which acquired the last-iterate convergence rate (Zhang et al., 2020; 2021; Yuan et al., 2020; Xu et al., 2020; Huang et al., 2020), i.e., E V (θn) 2 = O( ), we provide the convergence rate in terms of the average iterate, i.e., 1/(PT k=1 ϵk) PT k=1 ϵk E V (θk) 2 = O( ). We remark that this difference is due to the conditions of the analysis. In general, for stochastic gradient descent algorithms, in order to provide convergence rate in terms of the last-iterate sense, it is necessary to establish a local or global inequality that connects V and V. In the study of SGDtype algorithms, examples of such inequalities are P-L conditions and K-L conditions. In the study of the PG algorithm, assumptions like bounded score functions, Fisher non-degenerate assumption, and transfer error bounded conditions are often used, which then imply the gradient dominance condition, i.e., V (θ) V V (θ)+ϵbias. This condition is similar to the K-L condition. As we have not imposed any such additional conditions, the convergence rate in terms of the average-iterate convergence is the best we could have obtained. Theorem 3.3. If Assumption 3.1, 3.2, 3.4 hold, then for the sequence {θn,k}+ ,T n=1,k=1 formed by PPO (2), there is min m=1,2,...,n E V (θt,1) 2 = O Pn t=1 ϵ2 t Pn t=1 ϵt + O lim sup n + ϕn , and 1 Pn t=1 ϵt t=1 ϵt E V (θt,1) 2 = O Pn t=1 ϵ2 t Pn t=1 ϵt + O lim sup n + ϕn , where ϵt := PT 1 k=1 ϵt,k. Proof. By Lemma 3.1, we have E(V V (θn+1,1)) E(V V (θn,1)) T 1 X E V (θn,1) 2 1 + δ0R2 max p 2 E V (θn,1) 2 where ϵt := PT 1 k=1 ϵt,k. Then, we obtain min m=1,2,...,n E V (θt,1) 2 1 Pn t=1 ϵt t=1 ϵt E V (θt,1) 2 O Pn t=1 ϵ2 t Pn t=1 ϵt + O lim sup n + ϕn . The convergence rate under the notion of best-iterate convergence and average-iterate convergence follow, respectively. Our analyses might provide some insights that PPO could be more efficient than PG, in some scenarios. To compare them, we set the learning rate in PPO-Clip as ϵn,k = ϵn ( 1 k T), and the learning rate in PG as ϵn. The convergence rate of PPO-Clip is O T Pn t=1 ϵ2 n Pn t=1 ϵt + O(lim supn + ϕn),, while the convergence rate of PG is O Pn t=1 ϵ2 n Pn t=1 ϵt + O(lim supn + ϕn). One could observe that both the convergence rates are in the same order, up to a factor of T, which is the number of steps of off policy updates using πold. In this way, while PG requires continuous sampling to obtain on policy data, PPO-Clip use one sample for T steps. Improving the sample efficiency is critical in many reinforcement learning tasks, and when sampling is the bottleneck, PPO-Clip could have an edge over PG. Published as a conference paper at ICLR 2024 4 CONCLUSION AND FUTURE WORKS We investigate PPO-Clip, one of the most popular RL algorithms yet to be fully understood in theory. We show that PPO-Clip converges to stationary points up to a bias term, and when the biases in Markovian sampling and in advantage estimation are zero this bias term is also zero. We further proved the convergence rate of average-iterate convergence and subsequence convergence. We used a minimum set of conditions in our analysis. The conditions on step sizes and policy parametrization could be satisfied by the choices of step sizes and policy class. The condition on bounded reward function is rather common and mild in RL analysis. The relaxation of conditions incurs several technical challenges to overcome, which motivates us to introduce our key lemma, Lemma 3.1. Our lemma derives the recursive property of the optimization process under the presence of the clipped ratio of two stochastic policies, and could be of independent interest for analysis on other optimization algorithms. It remains unknown to us if the optimality of PPO-Clip could be deduced under the current conditions, but it would be important for future works to further characterize the optimality of the stationary points that PPO-Clip converges to. Another direction is to investigate other conditions on the policy, such that it is general enough to include neural networks but provides improved results on the convergence and the optimality. ACKNOWLEDGEMENT Ruinan Jin and Baoxiang Wang are partially supported by National Natural Science Foundation of China (62106213, 72150002, 72394361) and Shenzhen Science and Technology Program (RCBS20210609104356063, JCYJ20210324120011032). The work is partially supported by Guangdong Provincial Key Laboratory of Big Data Computing of The Chinese University of Hong Kong, Shenzhen. Published as a conference paper at ICLR 2024 Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431 4506, 2021. Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, and Niao He. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. ar Xiv preprint ar Xiv:2302.01734, 2023. Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pp. 1467 1476. PMLR, 2018. Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization. ar Xiv preprint ar Xiv:2009.10897, 2020. Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Momentum-based policy gradient methods. In International Conference on Machine Learning, pp. 4422 4433. PMLR, 2020. Nai-Chieh Huang, Ping-Chun Hsieh, Kuo-Hao Ho, Hsuan-Yu Yao, Kai-Chun Hu, Liang-Chun Ouyang, and I Wu. Neural ppo-clip attains global optimality: A hinge loss perspective. ar Xiv preprint ar Xiv:2110.13799, 2021. Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. The 37 implementation details of proximal policy optimization. In ICLR Blog Track, 2022. URL https://iclr-blog-track.github.io/2022/03/25/ ppo-implementation-details/. Ruinan Jin, Yu Xing, and Xingkang He. On the convergence of m SGD and Ada Grad for stochastic optimization. In International Conference on Learning Representations, 2022. Sham M Kakade. A natural policy gradient. Advances in Neural Information Processing Systems, 14, 2001. Vijay R Konda and John N Tsitsiklis. Actor critic algorithms. In Advances in Neural Information Processing Systems, pp. 1008 1014, 2000. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy. Advances in Neural Information Processing Systems, 32: 10565 10576, 2019. Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pp. 6820 6829. PMLR, 2020. Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400 407, 1951. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897. PMLR, 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. ar Xiv preprint ar Xiv:1909.01150, 2019. Published as a conference paper at ICLR 2024 Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229 256, 1992. Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variancereduced policy gradient. In Uncertainty in Artificial Intelligence, pp. 541 551. PMLR, 2020. Hsuan-Yu Yao, Ping-Chun Hsieh, Kuo-Hao Ho, Kai-Chun Hu, Liang Chun Ouyang, and I-Chen Wu. Hinge policy optimization: Rethinking policy improvement and reinterpreting PPO, 2022. URL https://openreview.net/forum?id=gex-2G2b Ldh. Huizhuo Yuan, Xiangru Lian, Ji Liu, and Yuren Zhou. Stochastic recursive momentum for policy gradient methods. ar Xiv preprint ar Xiv:2003.04302, 2020. Junwei Zhang, Zhenghao Zhang, Shuai Han, and Shuai Lü. Proximal policy optimization via enhanced exploration efficiency. Information Sciences, 609:750 765, 2022. Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi Wang, et al. On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228 2240, 2021. Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6): 3586 3612, 2020. Zhenyu Zhang, Xiangfeng Luo, Tong Liu, Shaorong Xie, Jianshu Wang, Wei Wang, Yang Li, and Yan Peng. Proximal policy optimization with mixed distributed training. In International Conference on Tools with Artificial Intelligence, pp. 1452 1456. IEEE, 2019. Published as a conference paper at ICLR 2024 A.1 AUXILIARY LEMMAS Lemma A.1. If Assumption 3.1 holds, we have πθ(a|s) p 2L min{πθ(a|s), 1 πθ(a|s)} ( s, a). Proof. By applying Lemma 10 of Jin et al. (2022), we get πθ(a|s) r 2L(πθ(a|s) inf θ Rd πθ (a|s)) p 2Lπθ(a|s). (4) Then we construct a function f (s,a)(θ) := 1 πθ(a|s). By its definition we have f (s,a)(θ) = πθ(a|s). By Assumption 3.1, we have for any θ1, θ2, such that f (s,a)(θ1) f (s,a)(θ2) = πθ1(a|s) πθ2(a|s) L θ1 θ2 . Similarly, we obtain f (s,a)(θ) r 2Lf (s,a)(θ) inf θ Rd f (s,a)(θ ) q 2Lf (s,a)(θ), that is πθ(a|s) p 2L(1 πθ(a|s)). (5) Combining Equation (4) and Equation (5) yields the desired inequality. Lemma A.2. If Assumption 3.1 holds, then for the sequence {θn,k}+ ,T n=1,k=1 formed by PPO (2), there is n > 0, 0 < k < T, θn,k+1 θn,k ϵn,k Eπθn,1 2L(1 + δ0) p πθn,1(a|s) Aπθn 1(s, a) ! Proof. By the update process of PPO in Equation (2), we have θn,k+1 θn,k = ϵn,k clip πθn,k(a|s) Aπθn,1(s, a) Define events 1 δ0 < πθn,k(a|s) πθn,1(a|s) < 1 + δ0 and its indicator function as 1B(s,a) n,k . Then, 1B(s,a) n,k πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) ϵn,k Eπθn,1 1B(s,a) n,k πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) ! By Lemma A.1, πθn,1(a|s) q 2Lπθn,1(a|s). This implies θn,k+1 θn,k ϵn,k Eπθn,1 2L(1 + δ0) p πθn,1(a|s) Aπθn,1(s, a) ! The lemma follows. Published as a conference paper at ICLR 2024 A.2 PROOF OF LEMMA 3.1 Proof. We continue to use the events B(s,a) n,k and the indicator function 1B(s,a) n,k defined in the proof of Lemma A.2. Using these notations, we rewrite the gradient term in PPO as clip πθn,k(a|s) Aπθn,1(s, a) 1B(s,a) n,k πθn,k(a|s) πθn,1(a|s) Aπθn,1(s, a) πθn,1(a|s) Aπθn,1(s, a) 1Ω/B(s,a) n,k πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) 1B(s,a) n,k πθn,k(a|s) πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) πθn,1(a|s) Aπθn,1(s, a) + Xn,k + Yn,k. (7) Inspecting the norm of the term Xn in Equation (7) yields 1Ω/B(s,a) n,k πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) 1Ω/B(s,a) n,k πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) ! Now we aim to estimate the indicator 1Ω/B(s,a) n,k . Through the mean value theorem and Lemma A.1, πθn,k(a|s) πθn,1(a|s) = eln πθn,k (a|s) ln πθn,1(a|s) = exp ( θn,kπ θn,k(a|s) π θn,k(a|s) (θn,k θn,1) exp θn,kπ θn,k(a|s) π θn,k(a|s) θn,k θn,1 , exp θn,kπ θn,k(a|s) π θn,k(a|s) θn,k θn,1 ! π θn,k(a|s) θn,k θn,1 , exp π θn,k(a|s) θn,k θn,1 ! (9) where θn,k is a point between θn,k and θn,1. We construct another event C(s,a) n,k,λ := πθn,1(a|s) > ( 2L(1 + δ0) p πθn 1,1(a|s) Aπθn 1,1(s, a) !) where λ > 0 is a coefficient to be determined. It is worth noting that the event C(s,a) n,k,λ belongs to the σ field Fn 1, which is used in the subsequent proof. Then, q π θn,k(a|s) = q πθn,1(a|s) + q π θn,k(a|s) q πθn,1(a|s) q π θn,k(a|s) q πθn,1(a|s) . By inspecting the function p πθ(a|s), we obtain πθ(a|s) = πθ(a|s) p That means q π θn,k(a|s) q 2L θn,k θn,1 2L θn,k θn,1 . Published as a conference paper at ICLR 2024 Therefore, whenever C(s,a) n,k,λ happens, through Lemma A.2, there is q π θn,k(a|s) q 2L θn,k θn,1 2L(1 + δ0) p πθn 1,1(a|s) Aπθn 1,1(s, a) ! 2L θn,1 θn 1 2L + λ) θn,k θn,1 2L θn,1 θn 1 = λ θn,k θn,1 . Substituting the above inequity into Equation (9), we have that whenever C(s,a) n,k,λ happens, there is πθn,k(a|s) πθn,1(a|s) e Now we choose λ = λ0 := min 2L ln(1 δ0), 2L ln(1 + δ0) which implies πθn,k(a|s) πθn,1(a|s) e 2L λ0 (1 δ0, 1 + δ0), which then implies that B(s,a) n,k happens. As C(s,a) n,k,λ0 B(s,a) n,k , we have Ω/B(s,a) n,k Ω/C(s,a) n,k,λ0, 1Ω/B(s,a) n,k 1Ω/C(s,a) n,k,λ0. Substituting it into Equation (8) yields Xn,k Eπθn,1 1Ω/C(s,a) n,k,λ0 πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) ! We will now calculate the conditional expectation of Xn,k on Fn 1. Noting 1Ω/C(s,a) n,k,λ0 Fn 1, E Xn,k Fn 1 E 1Ω/C(s,a) n,k,λ0 Aπθn,1(s, a) ! Fn 1 1Ω/C(s,a) n,k,λ0 πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) ! 1Ω/C(s,a) n,k,λ0 πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) ! 1Ω/C(s,a) n,k,λ0 1 p πθn,1(a|s) Aπθn,1(s, a) ! 2LRmax Es πθn,1 1Ω/C(s,a) n,k,λ0 q 1 + δ0R3 max|A| πθn 1,1(a|s) E Xn,k Fn 1 E 1Ω/C(s,a) n,k,λ0 Aπθn,1(s, a) ! Fn 1 1 + δ0R3 max|A| πθn 1,1(a|s) Published as a conference paper at ICLR 2024 Now we inspect the norm of the term Yn in Equation (7). We have 1B(s,a) n,k πθn,k(a|s) πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) πθn,k(a|s) πθn,1(a|s) πθn,1(a|s) Aπθn,1(s, a) ! L πθn 1(a|s) t=1 θn,t θn,1 Aπθn,1(s, a) ! 2L(1 + δ0) p πθn 1,1(a|s) 1 πθn,1(a|s) 2L(1 + δ0) p πθn 1,1(a|s) E Yn Fn 1 L p 2L(1 + δ0)R3 max|A| πθn 1,1(a|s) By the Lipschitz continuity of π, for any θ1, θ2 Rb, there is V (θ1) V (θ2) (1 γ)2 + (1 + γ)|A|Rmax For notational convenience, we assign above coefficient as L := |A|Rmax L (1 γ)2 + (1+γ)|A|Rmax 2L (1 γ)3 . Then the discrete difference V V (θn+1,1) (V V (θn,1)) of V (θn,1) can be expanded as V V (θn+1,1) (V V (θn,1)) V (θn,1) (θn+1,1 θn,1) + L θn+1,1 θn,1 2 = V (θn 1) (θn+1,1 θn,1) + V (θn,1) V (θn 1) (θn+1,1 θn,1) + L θn+1,1 θn,1 2. We substitute Equation (2) and Equation (7) into Equation (12) and obtain V (θn+1,1) V (θn,1) T 1 X V (θn 1,1) Eπθn,1 πθn,1(a|s) Aπθn,1(s, a) V (θn 1) T 1 X k=1 ϵn,k Xn,k V (θn 1) T 1 X k=1 ϵn,k Yn,k + L θn,1 θn 1,1 θn+1,1 θn,1 + L θn+1,1 θn,1 2. Then substituting Equation (10) and Equation (11) into Equation (13) will acquire V V (θn+1,1) (V V (θn,1)) T 1 X 1 + δ0R2 max p ϕn + O T 1 X The lemma follows. Published as a conference paper at ICLR 2024 A.3 PROOF OF THEOREM 3.2 Proof. Through Lemma 3.1, we know n > t, there is E(V V (θn+1,1)) E((V V (θn,1))) T 1 X E V (θn,1) 2 1 + δ0R2 max p E V (θn,1) 2 By inspecting the asymptotic order, we conclude the existence of an N0 such that for n > N0, we have T 1 X E V (θn,1) 2 > 8 p 1 + δ0R2 max p E V (θn,1) 2. Note that here N0 depends only on the parameters δ0, Rmax, |A|, L and the learning rate ϵn,k. Substituting the above inequity into (15), we know that when n > i0 := max{N0, t}, there is E(V V (θn+1,1)) E((V V (θn,1))) 1 E V (θn,1) 2 Summing over Equation (16) from i0 to t, and noticing the condition P+ n=1 PT 1 k=1 ϵ2 n,k < + , we obtain 1 2 E V (θn,1) 2 < + , which implies V (θn,1) 2 < + a.s.. (17) The condition P+ n=1 PT 1 k=1 ϵn,k = + implies lim infn + V (θn,1) = 0 a.s., which means that for any δ > 0, there exists a subsequence {qn}+ n=1 such that g(θqn) < δ. We find all the boundary points in the sequence {qn}+ n=1 and denote them as {pn}+ n=1. They satisfy V (θp2k 1,1) δ, V (θp2k 1+1,1) > δ, V (θp2k 1,1) > δ, V (θp2k,1) δ. If {pn}+ n=1 contains at most finitely many elements, there exists an n0 > 0, such that for any n > n0, V (θ) < δ. If {pn}+ n=1 has infinitely many elements, we find sup p2k 1 0, such that for any p2k 1 > n1, 2|A|L(1 + δ0) V (θn,1) 2 < δ. Subsequently, for any n > max{i0, n1}, there is V (θn,1) V (θp2k 1,1) + sup p2k 1 max{i0, n0, n1}, there is V (θn,1) < 2δ. By the arbitrariness of δ, we have lim n + V (θn,1) = 0 a.s.. Therefore, for any θn,1, we have V (θn,1) 2 < R2 max2|A|L(1 + δ0). Through the Lebesgue s dominated convergence theorem, we obtain lim n + E V (θn,1) 2 = 0 as we desired.