# clipped_action_policy_gradient__ae431fba.pdf Clipped Action Policy Gradient Yasuhiro Fujita 1 Shin-ichi Maeda 1 Many continuous control tasks have bounded action spaces. When policy gradient methods are applied to such tasks, out-of-bound actions need to be clipped before execution, while policies are usually optimized as if the actions are not clipped. We propose a policy gradient estimator that exploits the knowledge of actions being clipped to reduce the variance in estimation. We prove that our estimator, named clipped action policy gradient (CAPG), is unbiased and achieves lower variance than the conventional estimator that ignores action bounds. Experimental results demonstrate that CAPG generally outperforms the conventional estimator, indicating that it is a better policy gradient estimator for continuous control tasks. The source code is available at https: //github.com/pfnet-research/capg. 1. Introduction Reinforcement learning (RL) has achieved remarkable success in recent years in a wide range of challenging tasks, such as games (Mnih et al., 2015; Silver et al., 2016; 2017), robotic manipulation (Levine et al., 2016), and locomotion (Schulman et al., 2015; 2017; Heess et al., 2017), with the help of deep neural networks. Policy gradient methods are among the most successful model-free RL algorithms (Mnih et al., 2016; Schulman et al., 2015; 2017; Gu et al., 2017b). They are particularly suitable for continuous control tasks, i.e., environments with continuous action spaces, because they directly improve policies that represent continuous distributions of actions to maximize expected returns. For continuous control tasks, policies are typically represented by Gaussian distributions conditioned on current and past observations. Although Gaussian policies have unbounded support, contin- 1Preferred Networks, Inc., Japan. Correspondence to: Yasuhiro Fujita , Shin-ichi Maeda . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). uous control tasks often have bounded action sets that they can execute (Duan et al., 2016; Brockman et al., 2016; Tassa et al., 2018). For example, when controlling the torques of motors, effective torque values will be physically constrained. Policies with unbounded support like Gaussian policies are usually applied to such tasks by clipping sampled actions into their bounds (Duan et al., 2016; Dhariwal et al., 2017). Policy gradients for such policies are estimated as if actions were not clipped (Chou et al., 2017). In this study, we demonstrate that we can improve policy gradient methods by exploiting the knowledge of actions being clipped. We prove that the variance of policy gradient estimates can be strictly reduced under mild assumptions that hold for popular policy representations such as Gaussian policies with diagonal covariance matrices. Our proposed algorithm, named clipped action policy gradient (CAPG), is an alternative unbiased policy gradient estimator with a lower variance than the conventional estimator. Our experimental results on Mu Jo Co-simulated continuous control benchmark problems (Todorov et al., 2012; Brockman et al., 2016) show that CAPG can improve the performance of existing policy gradient-based deep RL algorithms. 2. Preliminaries We consider a Markov decision process (MDP) defined by the tuple (S, U, P, r, ρ0, γ), where S is a set of possible states, U is a set of possible actions, P is a state-transition probability distribution, r : S U R is a reward function, ρ0 is a distribution of the initial state s0, and γ (0, 1] is a discount factor. A probability distribution of action conditioned on state is referred to as a policy. The probability density function (PDF) of a policy is denoted by π. RL algorithms aim to find a policy that maximizes the expected cumulative discounted reward from initial states, η(π) = Es0,u0,... h X t γtr(st, ut) π i , where Es0,u0,...[ |π] denotes an expected value with respect to a state-action sequence s0 ρ0( ), u0 π( |s0), s1 P( |s0, u0), u1 π( |s1), . . . . Clipped Action Policy Gradient The state-action value function of a policy π is defined as Qπ(s, u) = Es1,u1,... h X t γtr(st, ut) s0 = s, u0 = u, π i . One way to find π = argmaxπη(π) is to adjust the parameters θ of a parameterized policy πθ by following the gradient θη(πθ), which is referred to a policy gradient. The policy gradient theorem (Sutton et al., 1999) states that θη(πθ) = Es h Eu[Qπθ(s, u)ψ(s, u)|s] i , where ψ(s, u) = θ log πθ(u|s), Eu[ |s] denotes a conditional expected value with respect to πθ( |s), and Es[ ] denotes an (improper) expected value with respect to the (improper) discounted state distribution ρπθ( ), which is defined as t γt Z ρ0(s0)p(st = s|s0, π)ds0. In practice, the policy gradient is often estimated by a finite number of samples {(s(i), u(i))|u(i) πθ( |s(i)), i = 1, . . . , N}. i Qπθ(s(i), u(i))ψ(s(i), u(i)). (1) RL algorithms that rely on this estimation are referred to as policy gradient methods. While this estimation is unbiased, its variance is typically high and is considered as a crucial problem of policy gradient methods. We address the problem by estimating θη(πθ) in an unbiased and lower-variance1 manner than (1). To this end, we derive a random variable Y such that V[Y ] V[X] and E[Y ] = E[X], where X = Qπθ(s, u)ψ(s, u). Because E[X] = Es[Eu[X|s]] and V[X] = Vs[Eu[X|s]] + Es[Vu[X|s]], it is sufficient to show Eu[Y |s] = Eu[X|s], (2) Vu[Y |s] Vu[X|s] (3) for all s. For notational simplicity, Eu[ |s] and Vu[ |s] are written as Eu[ ] and Vu[ ] below, respectively. The exact value of Qπθ(s, u) is usually not available and needs to be estimated. It is often estimated using observed rewards after executing u at s, sometimes combined with function approximation to balance bias and variance (Schulman et al., 2016; Mnih et al., 2016), but this is possible only for u that is executed at s. Our algorithm assumes the estimates of Qπθ(s, u) only for such (s, u) pairs to be available, and thus is applicable to such cases. 1 When θ is not a scalar, we consider the variance of gradients with respect to each element of θ throughout the paper. 3. Clipped Action Policy Gradient We consider the case where any action u Rd (d 1) chosen by an agent is clipped by the environment into a range [α, β] Rd. That is, the state-transition PDF and the reward function satisfy P(s |s, u) = P(s |s, clip(u, α, β)), (4) r(s, u) = r(s, clip(u, α, β)), (5) respectively. The clip function is defined as clip(u, α, β) = max(min(u, β), α), where max and min are computed elementwise when u is a vector, i.e., d 2. Each of α and β can be a constant or a function of s. The case where the reward function depends on actions before clipping is discussed in Section 3.4. Before explaining our algorithm, let us characterize the class of policies we consider in this study. Definition 3.1 (compatible PDF). Let pθ(u) be a PDF of u R that has a parameter θ. If pθ(u) is differentiable with respect to θ and allows the exchange of derivative and integral as R α θpθ(u)du = θ R α pθ(u)du and R β θpθ(u)du = θ R β pθ(u)du, we call pθ(u) a compatible PDF. If pθ(u|s) is a conditional PDF that satisfies these conditions, we call it a compatible conditional PDF. 3.1. Scalar actions First, we derive an unbiased and lower-variance estimator of the policy gradient for scalar actions, i.e., d = 1. The case of vector actions will be covered later in Section 3.2. From (4) and (5), the state-action value function satisfies Qπθ(s, u) = Qπθ(s, clip(u, α, β)) Qπθ(s, α) if u α Qπθ(s, u) if α < u < β Qπθ(s, β) if β u . (6) Let X be a random variable that depends on u and 1f(u) be an indicator function that takes 1 when u satisfies the condition f(u), otherwise 0. Because X = 1u αX + 1α