# stochastic_variancereduced_policy_gradient__288274ae.pdf Stochastic Variance-Reduced Policy Gradient Matteo Papini * 1 Damiano Binaghi * 1 Giuseppe Canonaco * 1 Matteo Pirotta 2 Marcello Restelli 1 In this paper, we propose a novel reinforcementlearning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective function; II) approximations in the full gradient computation; and III) a non-stationary sampling process. The result is SVRPG, a stochastic variancereduced policy gradient algorithm that leverages on importance weights to preserve the unbiasedness of the gradient estimate. Under standard assumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs. 1. Introduction On a very general level, artificial intelligence addresses the problem of an agent that must select the right actions to solve a task. The approach of Reinforcement Learning (RL) (Sutton & Barto, 1998) is to learn the best actions by direct interaction with the environment and evaluation of the performance in the form of a reward signal. This makes RL fundamentally different from Supervised Learning (SL), where correct actions are explicitly prescribed by a human teacher (e.g., for classification, in the form of class labels). However, the two approaches share many challenges and tools. The problem of estimating a model from samples, which is at the core of SL, is equally fundamental in RL, whether we choose to model the environment, *Equal contribution 1Politecnico di Milano, Milano, Italy 2Inria, Lille, France. Correspondence to: Matteo Papini . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). a value function, or directly a policy defining the agent s behaviour. Furthermore, when the tasks are characterized by large or continuous state-action spaces, RL needs the powerful function approximators (e.g., neural networks) that are the main subject of study of SL. In a typical SL setting, a performance function J(θ) has to be optimized w.r.t. to model parameters θ. The set of data that are available for training is often a subset of all the cases of interest, which may even be infinite, leading to optimization of finite sums that approximate the expected performance over an unknown data distribution. When generalization to the complete dataset is not taken into consideration, we talk about Empirical Risk Minimization (ERM). Even in this case, stochastic optimization is often used for reasons of efficiency. The idea of stochastic gradient (SG) ascent (Nesterov, 2013) is to iteratively focus on a random subset of the available data to obtain an approximate improvement direction. At the level of the single iteration, this can be much less expensive than taking into account all the data. However, the sub-sampling of data is a source of variance that can potentially compromise convergence, so that periteration efficiency and convergence rate must be traded off with proper handling of meta-parameters. Variance-reduced gradient algorithms such as SAG (Roux et al., 2012), SVRG (Johnson & Zhang, 2013) and SAGA (Defazio et al., 2014a) offer better ways of solving this trade-off, with significant results both in theory and practice. Although designed explicitly for ERM, these algorithms address a problem that affects more general machine learning problems. In RL, stochastic optimization is rarely a matter of choice, since data must be actively sampled by interacting with an initially unknown environment. In this scenario, limiting the variance of the estimates is a necessity that cannot be avoided, which makes variance-reduced algorithms very interesting. Among RL approaches, policy gradient (Sutton et al., 2000) is the one that bears the closest similarity to SL solutions. The fundamental principle of these methods is to optimize a parametric policy through stochastic gradient ascent. Compared to other applications of SG, the cost of collecting samples can be very high since it requires to interact with the environment. This makes SVRG-like methods potentially much more efficient than, e.g., batch learning. Unfortunately, RL has a series of difficulties that are not present in ERM. First, in SL the objective can often be de- Stochastic Variance-Reduced Policy Gradient signed to be strongly concave (we aim to maximize). This is not the case for RL, so we have to deal with non-concave objective functions. Then, as mentioned before, the dataset is not initially available and may even be infinite, which makes approximations unavoidable. This rules out SAG and SAGA because of their storage requirements, which leaves SVRG as the most promising choice. Finally, the distribution used to sample data is not under direct control of the algorithm designer, but it is a function of policy parameters that change over time as the policy is optimized, which is a form of non-stationarity. SVRG has been used in RL as an efficient technique for optimizing the per-iteration problem in Trust-Region Policy Optimization (Xu et al., 2017) or for policy evaluation (Du et al., 2017). In both the cases, the optimization problems faced resemble the SL scenario and are not affected by all the previously mentioned issues. After providing background on policy gradient and SVRG in Section 2, we propose SVRPG, a variant of SVRG for the policy gradient framework, addressing all the difficulties mentioned above (see Section 3). In Section 4 we provide convergence guarantees for our algorithm, and we show a convergence rate that has an O(1/T) dependence on the number T of iterations. In Section 5.2 we suggest how to set the meta-parameters of SVRPG, while in Section 5.3 we discuss some practical variants of the algorithm. Finally, in Section 7 we empirically evaluate the performance of our method on popular continuous RL tasks. 2. Preliminaries In this section, we provide the essential background on policy gradient methods and stochastic variance-reduced gradient methods for finite-sum optimization. 2.1. Policy Gradient A Reinforcement Learning task (Sutton & Barto, 1998) can be modelled with a discrete-time continuous Markov Decision Process (MDP) M = {S, A, P, R, γ, ρ}, where S is a continuous state space; A is a continuous action space; P is a Markovian transition model, where P(s |s, a) defines the transition density from state s to s under action a; R is the reward function, where R(s, a) [ R, R] is the expected reward for state-action pair (s, a); γ [0, 1) is the discount factor; and ρ is the initial state distribution. The agent s behaviour is modelled as a policy π, where π( |s) is the density distribution over A in state s. We consider episodic MDPs with effective horizon H.1 In this setting, we can limit our attention to trajectories of length H. A trajectory τ is a sequence of states and actions (s0, a0, s1, a1, . . . , s H 1, a H 1) observed by follow- 1The episode duration is a random variable, but the optimal policy can reach the target state (i.e., absorbing state) in less than H steps. This has not to be confused with a finite horizon problem where the optimal policy is non-stationary. ing a stationary policy, where s0 ρ. We denote with p(τ|π) the density distribution induced by policy π on the set T of all possible trajectories (see Appendix A for the definition), and with R(τ) the total discounted reward provided by trajectory τ: R(τ) = PH 1 t=0 γt R(st, at). Policies can be ranked based on their expected total reward: J(π) = Eτ p( |π) [R(τ)|M]. Solving an MDP M means finding π arg maxπ{J(π)}. Policy gradient methods restrict the search for the best performing policy over a class of parametrized policies Πθ = {πθ : θ Rd}, with the only constraint that πθ is differentiable w.r.t. θ. For sake of brevity, we will denote the performance of a parametric policy with J(θ) and the probability of a trajectory τ with p(τ|θ) (in some occasions, p(τ|θ) will be replaced by pθ(τ) for the sake of readability). The search for a locally optimal policy is performed through gradient ascent, where the policy gradient is (Sutton et al., 2000; Peters & Schaal, 2008a): J(θ) = E τ p( |θ) [ log pθ(τ)R(τ)] . (1) Notice that the distribution defining the gradient is induced by the current policy. This aspect introduces a nonstationarity in the sampling process. Since the underlying distribution changes over time, it is necessary to resample at each update or use weighting techniques such as importance sampling. Here, we consider the online learning scenario, where trajectories are sampled by interacting with the environment at each policy change. In this setting, stochastic gradient ascent is typically employed. At each iteration k > 0, a batch Dk N = {τi}N i=0 of N > 0 trajectories is collected using policy πθk. The policy is then updated as θk+1 = θk + αb NJ(θk), where α is a step size and b NJ(θ) is an estimate of Eq. (1) using Dk N. The most common policy gradient estimators (e.g., REINFORCE (Williams, 1992) and G(PO)MDP (Baxter & Bartlett, 2001)) can be expressed as follows b NJ(θ) = 1 n=1 g(τi|θ), τi Dk N, (2) where g(τi|θ) is an estimate of log pθ(τi)R(τi). Although the REINFORCE definition is simpler than the G(PO)MDP one, the latter is usually preferred due to its lower variance. We refer the reader to Appendix A for details and a formal definition of g. The main limitation of plain policy gradient is the high variance of these estimators. The na ıve approach of increasing the batch size is not an option in RL due to the high cost of collecting samples, i.e., by interacting with the environment. For this reason, literature has focused on the introduction of baselines (i.e., functions b : S A R) aiming to reduce the variance (e.g., Williams, 1992; Peters & Schaal, Stochastic Variance-Reduced Policy Gradient 2008a; Thomas & Brunskill, 2017; Wu et al., 2018), see Appendix A for a formal definition of b. These baselines are usually designed to minimize the variance of the gradient estimate, but even them need to be estimated from data, partially reducing their effectiveness. On the other hand, there has been a surge of recent interest in variance reduction techniques for gradient optimization in supervised learning (SL). Although these techniques have been mainly derived for finite-sum problems, we will show in Section 3 how they can be used in RL. In particular, we will show that the proposed SVRPG algorithm can take the best of both worlds (i.e., SL and RL) since it can be plugged into a policy gradient estimate using baselines. The next section has the aim to describe variance reduction techniques for finite-sum problems. In particular, we will present the SVRG algorithm that is at the core of this work. 2.2. Stochastic Variance-Reduced Gradient Finite-sum optimization is the problem of maximizing an objective function f(θ) which can be decomposed into the sum or average of a finite number of functions zi(θ): This kind of optimization is very common in machine learning, where each zi may correspond to a data sample xi from a dataset DN of size N (i.e., zi(θ) = z(xi|θ)). A common requirement is that z must be smooth and concave in θ.2 Under this hypothesis, full gradient (FG) ascent (Cauchy, 1847) with a constant step size achieves a linear convergence rate in the number T of iterations (i.e., parameter updates) (Nesterov, 2013). However, each iteration requires N gradient computations, which can be too expensive for large values of N. Stochastic Gradient (SG) ascent (e.g., Robbins & Monro, 1951; Bottou & Le Cun, 2004) overcomes this problem by sampling a single sample xi per iteration, but a vanishing step size is required to control the variance introduced by sampling. As a consequence, the lower per-iteration cost is paid with a worse, sub-linear convergence rate (Nemirovskii et al., 1983). Starting from SAG, a series of variations to SG have been proposed to achieve a better trade-off between convergence speed and cost per iteration: e.g., SAG (Roux et al., 2012), SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014a), Finito (Defazio et al., 2014b), and MISO (Mairal, 2015). The common idea is to reuse past gradient computations to reduce the variance of the current estimate. In particular, Stochastic Variance-Reduced Gradient (SVRG) is often preferred to other similar methods for its limited storage requirements, which is a significant advantage when deep and/or wide neural networks are employed. 2Note that we are considering a maximization problem instead of the classical minimization one. Algorithm 1 SVRG Input: a dataset DN, number of epochs S, epoch size m, step size α, initial parameter θ0 m := eθ 0 for s = 0 to S 1 do θs+1 0 := eθ s = θs m eµ = f(eθ s) for t = 0 to m 1 do x U (DN) vs+1 t = eµ + z(x|θs+1 t ) z(x|eθ s) θs+1 t+1 = θs+1 t + αvs+1 t end for end for Concave case: return θS m Non-Concave case: return θs+1 t with (s, t) picked uniformly at random from {[0, S 1] [0, m 1]} The idea of SVRG (Algorithm 1) is to alternate full and stochastic gradient updates. Each m = O(N) iterations, a snapshot eθ of the current parameter is saved together with its full gradient f(eθ) = 1 N P i z(xi|eθ). Between snapshots, the parameter is updated with f(θ), a gradient estimate corrected using stochastic gradient. For any t {0, . . . , m 1}: f(θt) := vt = f(eθ) + z(x|θt) z(x|eθ), (3) where x is sampled uniformly at random from DN (i.e., x U(DN)). Note that t = 0 corresponds to a FG step (i.e., f(θ0) = f(eθ)) since θ0 := eθ. The corrected gradient f(θ) is an unbiased estimate of f(θ), and it is able to control the variance introduced by sampling even with a fixed step size, achieving a linear convergence rate without resorting to a plain full gradient. More recently, some extensions of variance reduction algorithms to the non-concave objectives have been proposed (e.g., Allen-Zhu & Hazan, 2016; Reddi et al., 2016a;b). In this scenario, f is typically required to be L-smooth, i.e., f(θ ) f(θ) 2 L θ θ 2 for each θ, θ Rn and for some Lipschitz constant L. Under this hypothesis, the convergence rate of SG is O(1/ T) (Ghadimi & Lan, 2013), i.e., T = O(1/ϵ2) iterations are required to get f(θ) 2 2 ϵ. Again, SVRG achieves the same rate as FG (Reddi et al., 2016a), which is O( 1 T ) in this case (Nesterov, 2013). The only additional requirement is to select θ uniformly at random among all the θk instead of simply setting it to the final value (k being the iterations). 3. SVRG in Reinforcement Learning In online RL problems, the usual approach is to tune the batch size of SG to find the optimal trade-off between variance and speed. Recall that, compared to SL, the samples Stochastic Variance-Reduced Policy Gradient are not fixed in advance but we need to collect them at each policy change. Since this operation may be costly, we would like to minimize the number of interactions with the environment. For these reasons, we would like to apply SVRG to RL problems in order to limit the variance introduced by sampling trajectories, which would ultimately lead to faster convergence. However, a direct application of SVRG to RL is not possible due to the following issues: Non-concavity: the objective function J(θ) is typically non-concave. Infinite dataset: the RL optimization cannot be expressed as a finite-sum problem. The objective function is an expected value over the trajectory density pθ(τ) of the total discounted reward, for which we would need an infinite dataset. Non-stationarity: the distribution of the samples changes over time. In particular, the value of the policy parameter θ influences the sampling process. To deal with non-concavity, we require J(θ) to be L-smooth, which is a reasonable assumption for common policy classes such as Gaussian3 and softmax (e.g., Furmston & Barber, 2012; Pirotta et al., 2015). Because of the infinite dataset, we can only rely on an estimate of the full gradient. Harikandeh et al. (2015) analysed this scenario under the assumptions of z being concave, showing that SVRG is robust to an inexact computation of the full gradient. In particular, it is still possible to recover the original convergence rate if the error decreases at an appropriate rate. Bietti & Mairal (2017) performed a similar analysis on MISO. In Section 4 we will show how the estimation accuracy impacts on the convergence results with a non-concave objective. Finally, the non-stationarity of the optimization problem introduces a bias into the SVRG estimator in Eq. (3). To overcome this limitation we employ importance weighting (e.g., Rubinstein, 1981; Precup, 2000) to correct the distribution shift. We can now introduce Stochastic Variance-Reduced Policy Gradient (SVRPG) for a generic policy gradient estimator g. Pseudo-code is provided in Algorithm 2. The overall structure is the same as Algorithm 1, but the snapshot gradient is not exact and the gradient estimate used between snapshots is corrected using importance weighting:4 J(θt) = b NJ(eθ) + g(τ|θt) ω(τ|θt, eθ)g(τ|eθ) for any t {0, . . . , m 1}, where b NJ(eθ) is as in Eq. (2) where DN is sampled using the snapshot policy πeθ, τ is 3See Appendix C for more details on the Gaussian policy case. 4Note that g can be any unbiased estimator, with or without baseline. The unbiasedness is required for theoretical results (e.g., Appendix A). Algorithm 2 SVRPG Input: number of epochs S, epoch size m, step size α, batch size N, mini-batch size B, gradient estimator g, initial parameter θ0 m := eθ 0 := θ0 for s = 0 to S 1 do θs+1 0 := eθ s = θs m Sample N trajectories {τj} from p( |eθ s) eµ = b NJ(eθ s) (see Eq. (2)) for t = 0 to m 1 do Sample B trajectories {τi} from p( |θs+1 t ) g(τi|θs+1 t ) ω(τi|θs+1 t , eθ s)g(τi|eθ s) vs+1 t = eµ + cs+1 t θs+1 t+1 = θs+1 t + αvs+1 t end for end for return θA := θs+1 t with (s, t) picked uniformly at random from {[0, S 1] [0, m 1]} sampled from the current policy πθt, and ω(τ|θt, eθ) = p(τ|eθ) p(τ|θt) is an importance weight from πθt to the snapshot policy πeθ. Similarly to SVRG, we have that θ0 := eθ, and the update is a FG step. Our update is still fundamentally on-policy since the weighting concerns only the correction term. However, this partial off-policyness represents an additional source of variance. This is a well-known issue of importance sampling (e.g., Thomas et al., 2015). To mitigate it, we use mini-batches of trajectories of size B N to average the correction, i.e., J(θt) := vt = b NJ(eθ) (4) h g(τi|θt) ω(τi|θt, eθ)g(τi|eθ) i It is worth noting that the full gradient and the correction term have the same expected value: Eτi p( |θt) h 1 B PB 1 i=0 ω(τi|θt, eθ)g(τi|eθ) i = J(eθ).5 This property will be used to prove Lemma 3.1. The use of mini-batches is also common practice in SVRG since it can yield a performance improvement even in the supervised case (Harikandeh et al., 2015; Koneˇcn y et al., 2016). It is easy to show that the SVRPG estimator has the following, desirable properties: Lemma 3.1. Let b NJ(θ) be an unbiased estimator of (1) and let θ arg minθ{J(θ)}. Then, the SVRG estimate 5The reader can refer to Appendix A for off-policy gradients and variants of REINFORCE and G(PO)MDP. Stochastic Variance-Reduced Policy Gradient in (4) is unbiased E [ J(θ)] = J(θ). (5) and regardless of the mini-batch size B:6 Var [ J(θ )] = Var h b NJ(θ ) i . (6) Previous results hold for both REINFORCE and G(PO)MDP. In particular, the latter result suggests that an SVRG-like algorithm using J(θ) can achieve faster convergence, by performing much more parameter updates with the same data without introducing additional variance (at least asymptotically). Note that the randomized return value of Algorithm 2 does not affect online learning at all, but will be used as a theoretical tool in the next section. 4. Convergence Guarantees of SVRPG In this section, we state the convergence guarantees for SVRPG with REINFORCE or G(PO)MDP gradient estimator. We mainly leverage on the recent analysis of nonconcave SVRG (Reddi et al., 2016a; Allen-Zhu & Hazan, 2016). Each of the three challenges presented at the beginning of Section 3 can potentially prevent convergence, so we need additional assumptions. In Appendix C we show how Gaussian policies satisfy these assumptions. 1) Non-concavity. A common assumption, in this case, is to assume the objective function to be L-smooth. However, in RL we can consider the following assumption which is sufficient for the L-smoothness of the objective (see Lemma B.2). Assumption 4.1 (On policy derivatives). For each stateaction pair (s, a), any value of θ, and all parameter components i, j there exist constants 0 G, F < such that: | θi log πθ(a|s)| G, 2 θi θj log πθ(a|s) F. 2) FG Approximation. Since we cannot compute an exact full gradient, we require the variance of the estimator to be bounded. This assumption is similar in spirit to the one in (Harikandeh et al., 2015). Assumption 4.2 (On the variance of the gradient estimator). There is a constant V < such that, for any policy πθ: Var [g( |θ)] V. 3) Non-stationarity. Similarly to what is done in SL (Cortes et al., 2010), we require the variance of the importance weight to be bounded. 6 For any vector x, we use Var[x] to denote the trace of the covariance matrix, i.e., Tr(E (x E [x])(x E [x])T ). Assumption 4.3 (On the variance of importance weights). There is a constant W < such that, for each pair of policies encountered in Algorithm 2 and for each trajectory, Var [ω(τ|θ1, θ2)] W, θ1, θ2 Rd, τ p( |θ1). Differently from Assumptions 4.1 and 4.2, Assumption 4.3 must be enforced by a proper handling of the epoch size m. We can now state the convergence guarantees for SVRPG. Theorem 4.4 (Convergence of the SVRPG algorithm). Assume the REINFORCE or the G(PO)MDP gradient estimator is used in SVRPG (see Equation (4)). Under Assumptions 4.1, 4.2 and 4.3, the parameter vector θA returned by Algorithm 2 after T = m S iterations has, for some positive constants ψ, ζ, ξ and for proper choice of the step size α and the epoch size m, the following property: E h J(θA) 2 2 i J(θ ) J(θ0) where θ is a global optimum and ψ, ζ, ξ depend only on G, F, V, W, α and m. Refer to Appendix B for a detailed proof involving the definition of the constants and the meta-parameter constraints. By analysing the upper-bound in Theorem 4.4 we observe that: I) the O(1/T) term is coherent with results on nonconcave SVRG (e.g., Reddi et al., 2016a); II) the O(1/N) term is due to the FG approximation and is analogous to the one in (Harikandeh et al., 2015); III) the O(1/B) term is due to importance weighting. To achieve asymptotic convergence, the batch size N and the mini-batch size B should increase over time. In practice, it is enough to choose N and B large enough to make the second and the third term negligible, i.e., to mitigate the variance introduced by FG approximation and importance sampling, respectively. Once the last two terms can be neglected, the number of trajectories needed to achieve J(θ) 2 2 ϵ is O( B+N/m ϵ ). In this sense, an advantage over batch gradient ascent can be achieved with properly selected meta-parameters. In Section 5.2 we propose a joint selection of step size α and epoch size m. Finally, from the return statement of Algorithm 2, it is worth noting that J(θA) can be seen as the average performance of all the policies tried by the algorithm. This is particularly meaningful in the context of online learning that we are considering in this paper. 5. Remarks on SVRPG The convergence guarantees presented in the previous section come with requirements on the meta-parameters (i.e., α and m) that may be too conservative for practical applications. Here we provide a practical and automatic way to choose the step size α and the number of sub-iterations m performed between snapshots. Additionally, we provide Stochastic Variance-Reduced Policy Gradient a variant of SVRPG exploiting a variance-reduction technique for importance weights. Despite lacking theoretical guarantees, we will show in Section 7 that this method can outperform the baseline SVRPG (Algorithm 2). 5.1. Full Gradient Update As noted in Section 3, the update performed at the beginning of each epoch is equivalent to a full-gradient update. In our setting, where collecting samples is particularly expensive, the B trajectories collected using the snapshot trajectory πeθ s feels like a waste of data (the term P i g(τi) ω(τi)g(τi) = 0 since θ0 = eθ). In practice, we just perform an approximate full gradient update using the N trajectories sampled to compute b NJ(eθ s), i.e., θs+1 1 = eθ s + αb NJ(eθ s) θs+1 t+1 = θs+1 t + α J(θs+1 t ) for t = 1, . . . , m 1. In the following, we will always use this practical variant. 5.2. Meta-Parameter Selection The step size α is crucial to balance variance reduction and efficiency, while the epoch length m influences the variance introduced by the importance weights. Low values of m are associated with small variance but increase the frequency of snapshot points (which means many FG computations). High values of m may move policy πθt far away from the snapshot policy πeθ, causing large variance in the importance weights. We will jointly set the two meta-parameters. Adaptive step size. A standard way to deal with noisy gradients is to use adaptive strategies to compute the step size. ADAptive Moment estimation (ADAM) (Kingma & Ba, 2014) stabilizes the parameter update by computing learning rates for each parameter based on an incremental estimate of the gradient variance. Due to this feature, we would like to incorporate ADAM in the structure of the SVRPG update. Recall that SVRPG performs two different updates of the parameters θ: I) FG update in the snapshot; II) corrected gradient update in the sub-iterations. Given this structure, we suggest using two separate ADAM estimators: θs+1 1 = eθ s + αFG s b NJ(eθ s) θs+1 t+1 = θs+1 t + αSI s+1,t J(θs+1 t ) for t = 1, . . . , m 1, where αFG s is associated with the snapshot and αSI s+1,t with the sub-iterations (see Appendix D for details). By doing so, we decouple the contribution of the variance due to the approximate FG from the one introduced by the subiterations. Note that these two terms have different orders of magnitude since are estimated with a different number of trajectories (B N) and the estimator in the snapshot does not require importance weights. The use of two ADAM estimators allows to capture and exploit this property. Adaptive epoch length. It is easy to imagine that a predefined schedule (e.g., m fixed in advance or changed with a policy-independent process) may poorly perform due to the high variability of the updates. In particular, given a fixed number of sub-iterations m, the variance of the updates in the sub-iterations depends on the snapshot policy and the sampled trajectories. Since the ADAM estimate partly captures such variability, we propose to take a new snapshot (i.e., interrupt the sub-iterations) whenever the step size αSI proposed by ADAM for the sub-iterations is smaller than the one for the FG (i.e., αFG). If the latter condition is verified, it amounts to say that the noise in the corrected gradient has overcome the information of the FG. Formally, the stopping condition is as follows B then take snapshot, where we have introduced N and B to take into account the trajectory efficiency (i.e., weighted advantage). The less the number of trajectories used to update the policy, the better. Including the batch sizes in the stopping condition allows us to optimize the trade-off between the quality of the updates and the cost of performing them. 5.3. Normalized Importance Sampling As mentioned in Section 5.2, importance weights are an additional source of variance. A standard way to cope with this issue is self-normalization (e.g., Precup, 2000; Owen, 2013). This technique can reduce the variance of the importance weights at the cost of introducing some bias (Owen, 2013, Chapter 9). Whether the trade-off is advantageous depends on the specific task. Introducing self-normalization in the context of our algorithm, we switch from Eq. (4) to: J(θt) = b NJ(eθ) + 1 i=0 [g(τi|θt)] h ω(τi|θt, eθ)g(τi|eθ) i . where Ω= PB 1 i=0 ω(τi|θt, eθ). In Section 7 we show that self-normalization can provide a performance improvement. 6. Related Work Despite the considerable interest received in SL, variancereduced gradient approaches have not attracted the RL community. As far as we know, there are just two applications of SVRG in RL. The first approach (Du et al., 2017) aims to exploit SVRG for policy evaluation. The policy evaluation problem is more straightforward than the one faced in this paper (control problem). In particular, since the goal is to evaluate just the performance of a predefined policy, the optimization problem is stationary. The setting considered in the paper is the one of policy evaluation by minimizing the Stochastic Variance-Reduced Policy Gradient empirical mean squared projected Bellman error (MSPBE) with a linear approximation of the value function. Du et al. (2017) shown that this problem can be equivalently reformulated as a convex-concave saddle-point problem that is characterized by a finite-sum structure. This problem can be solved using a variant of SVRG (Palaniappan & Bach, 2016) for which convergence guarantees have been provided. The second approach (Xu et al., 2017) uses SVRG as a practical method to solve the optimization problem faced by Trust Region Policy Optimization (TRPO) at each iteration. This is just a direct application of SVRG to a problem having finite-sum structure since no specific structure of the RL problem is exploited. It is worth to mention that, for practical reasons, the authors proposed to use a Newton conjugate gradient method with SVRG. In the recent past, there has been a surge of studies investigating variance reduction techniques for policy gradient methods. The specific structure of the policy gradient allows incorporating a baseline (i.e., a function b : S A R) without affecting the unbiasedness of the gradient (e.g., Williams, 1992; Weaver & Tao, 2001; Peters & Schaal, 2008b; Thomas & Brunskill, 2017; Wu et al., 2018). Although the baseline can be arbitrarily selected, literature often refers to the optimal baseline as the one minimizing the variance of the estimate. Nevertheless, even the baseline needs to be estimated from data. This fact may partially reduce its effectiveness by introducing variance. Even if these approaches share the same goal as SVRG, they are substantially different. In particular, the proposed SVRPG does not make explicit use of the structure of the policy gradient framework, and it is independent of the underlying gradient estimate (i.e., with or without baseline). This suggests that would be possible to integrate an ad-hoc SVRPG baseline to further reduce the variance of the estimate. Since this paper is about the applicability of SVRG technique to RL, we consider this topic as future work. Additionally, the experiments show that SVRPG has an advantage over G(PO)MPD even when the baseline is used (see the half-cheetah domain in Section 7). Concerning importance weighting techniques, RL has made extensive use of them for off-policy problems (e.g., Precup, 2000; Thomas et al., 2015). However, as mentioned before, SVRPG cannot be compared to such methods since it is in all respects an on-policy algorithm. Here, importance weighting is just a statistical tool used to preserve the unbiasedness of the corrected gradient. 7. Experiments In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016). We consider G(PO)MDP since it has a smaller variance than REINFORCE. For our algorithm, we use a batch size N = 100, a mini-batch size B = 10, and the jointly adaptive step size α and epoch length m proposed in Section 5.2. Since the aim of this comparison is to show the improvement that SVRG-flavored variance reduction brings to SG in the policy gradient framework, we set the batch size of the baseline policy gradient algorithm to B. In this sense, we measure the improvement yielded by computing snapshot gradients and using them to adjust parameter updates. Since we evaluate on-line performance over the number of sampled trajectories, the cost of computing such snapshot gradients is automatically taken into consideration. To make the comparison fair, we also use Adam in the baseline PG algorithm, which we will denote simply as G(PO)MDP in the following. In all the experiments, we use deep Gaussian policies with adaptive standard deviation (details on network architecture in Appendix E). Each experiment is run 10 times with a random policy initialization and seed, but this initialization is shared among the algorithms under comparison. The length of the experiment, i.e., the total number of trajectories, is fixed for each task. Performance is evaluated by using test-trajectories on a subset of the policies considered during the learning process. We provide average performance with 90% bootstrap confidence intervals. Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based.7 More details on meta-parameters and exhaustive task descriptions are provided in Appendix E. Figure 1a compares SVRPG with G(PO)MDP on a continuous variant of the classical Cart-pole task, which is a 2D balancing task. Despite using more trajectories on average for each parameter update, our algorithm shows faster convergence, which can be ascribed to the better quality of updates due to variance reduction. The Swimmer task is a 3D continuous-control locomotion task. This task is more difficult than cart-pole. In particular, the longer horizon and the more complex dynamics can have a dangerous impact on the variance of importance weights. In this case, the self-normalization technique proposed in Section 5.3 brings an improvement (even if not statistically significant), as shown in Figure 1b. Figure 1c shows self-normalized SVRPG against G(PO)MDP. Our algorithm outperforms G(PO)MDP for almost the entire learning process. Also here, we note an increase of speed in early iterations, and, toward the end of the learning process, the improvement becomes statistically significant. Preliminary results on actor-critic. Another variancereduction technique in policy gradient consists of using baselines or critics. This tool is orthogonal to the methods described in this paper, and the theoretical results of Section 4 are general in this sense. In the experiments de- 7Code available at github.com/Dam930/rllab. Stochastic Variance-Reduced Policy Gradient 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Trajectories Average Return SVRPG GPOMDP (a) SVRPG vs G(PO)MDP on Cart-pole. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Trajectories Average Return Self-Normalized SVRPG SVRPG (b) Self-Normalized SVRPG vs SVRPG on Swimmer. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Trajectories Average Return Self-Normalized SVRPG GPOMDP (c) Self-Normalized SVRPG vs G(PO)MDP on Swimmer. 0 1 2 3 4 5 6 7 8 9 10 Trajectories Average Return GPOMDP Self Normalized SVRPG (d) Self-Normalized SVRPG vs G(PO)MDP on Half-Cheetah. Figure 1: Comparison of on-line performance over sampled trajectories, with 90% confidence intervals. scribed so far, we compared against the so-called actor-only G(PO)MDP, i.e., without the baseline. To move towards a more general understanding of the variance issue in policy gradient, we also test SVRPG in an actor-critic scenario. To do so, we consider the more challenging Mu Jo Co (Todorov et al., 2012) Half-cheetah task, a 3D locomotion task that has a larger state-action space than Swimmer. Figure 1d compares self-normalized SVRPG and G(PO)MDP on Halfcheetah, using the critic suggested in (Duan et al., 2016) for both algorithms. Results are promising, showing that a combination of the baseline usage and SVRG-like variance reduction can yield an improvement that the two techniques alone are not able to achieve. Moreover, SVRPG presents a noticeably lower variance. The performance of actorcritic G(PO)MDP8 on Half-Cheetah is coherent with the one reported in (Duan et al., 2016). Other results are not comparable since we did not use the critic. 8. Conclusion In this paper, we introduced SVRPG, a variant of SVRG designed explicitly for RL problems. The control prob- 8Duan et al. (2016) report results on REINFORCE. However, inspection on rllab code and documentation reveals that it is actually PGT (Sutton et al., 2000), which is equivalent to G(PO)MDP (shown by Peters & Schaal, 2008b). Using the name REINFORCE in a general way is inaccurate, but widespread. lem considered in the paper has a series of difficulties that are not common in SL. Among them, non-concavity and approximate estimates of the FG have been analysed independently in SL (e.g., Allen-Zhu & Hazan, 2016; Reddi et al., 2016a; Harikandeh et al., 2015) but never combined. Nevertheless, the main issue in RL is the non-stationarity of the sampling process since the distribution underlying the objective function is policy-dependent. We have shown that by exploiting importance weighting techniques, it is possible to overcome this issue and preserve the unbiasedness of the corrected gradient. We have additionally shown that, under mild assumptions that are often verified in RL applications, it is possible to derive convergence guarantees for SVRPG. Finally, we have empirically shown that practical variants of the theoretical SVRPG version can outperform classical actor-only approaches on benchmark tasks. Preliminary results support the effectiveness of SVRPG also with a commonly used baseline for the policy gradient. Despite that, we believe that it will be possible to derive a baseline designed explicitly for SVRPG to exploit the RL structure and the SVRG idea jointly. Another possible improvement would be to employ the natural gradient (Kakade, 2002) to better control the effects of parameter updates on the variance of importance weights. Future work should also focus on making batch sizes N and B adaptive, as suggested in (Papini et al., 2017). Stochastic Variance-Reduced Policy Gradient Acknowledgments This research was supported in part by French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council and French National Research Agency (ANR) under project Ex Tra-Learn (n.ANR-14-CE24-001001). Allen-Zhu, Zeyuan and Hazan, Elad. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning, pp. 699 707, 2016. Baxter, Jonathan and Bartlett, Peter L. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. Bietti, Alberto and Mairal, Julien. Stochastic optimization with variance reduction for infinite datasets with finite sum structure. In Advances in Neural Information Processing Systems, pp. 1622 1632, 2017. Bottou, L eon and Le Cun, Yann. Large scale online learning. In Advances in neural information processing systems, pp. 217 224, 2004. Cauchy, Augustin. M ethode g en erale pour la r esolution des systemes d equations simultan ees. Comp. Rend. Sci. Paris, 25(1847):536 538, 1847. Cortes, Corinna, Mansour, Yishay, and Mohri, Mehryar. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442 450, 2010. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014a. Defazio, Aaron, Domke, Justin, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pp. 1125 1133, 2014b. Du, Simon S., Chen, Jianshu, Li, Lihong, Xiao, Lin, and Zhou, Dengyong. Stochastic variance reduction methods for policy evaluation. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 1049 1058. PMLR, 2017. Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329 1338, 2016. Furmston, Thomas and Barber, David. A unifying perspective of parametric policy search methods for markov decision processes. In NIPS, pp. 2726 2734, 2012. Ghadimi, Saeed and Lan, Guanghui. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Harikandeh, Reza, Ahmed, Mohamed Osama, Virani, Alim, Schmidt, Mark, Koneˇcn y, Jakub, and Sallinen, Scott. Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pp. 2251 2259, 2015. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013. Jurˇc ıˇcek, Filip. Reinforcement learning for spoken dialogue systems using off-policy natural gradient method. In Spoken Language Technology Workshop (SLT), 2012 IEEE, pp. 7 12. IEEE, 2012. Kakade, Sham M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531 1538, 2002. Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Koneˇcn y, Jakub, Liu, Jie, Richt arik, Peter, and Tak aˇc, Martin. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242 255, 2016. Mairal, Julien. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015. Nemirovskii, Arkadii, Yudin, David Borisovich, and Dawson, Edgar Ronald. Problem complexity and method efficiency in optimization. 1983. Nesterov, Yurii. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Owen, Art B. Monte Carlo theory, methods and examples. 2013. Palaniappan, Balamurugan and Bach, Francis. Stochastic variance reduction methods for saddle-point problems. In NIPS, pp. 1408 1416, 2016. Stochastic Variance-Reduced Policy Gradient Papini, Matteo, Pirotta, Matteo, and Restelli, Marcello. Adaptive batch size for safe policy gradients. In Advances in Neural Information Processing Systems, pp. 3594 3603, 2017. Peters, Jan and Schaal, Stefan. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21 (4):682 697, 2008a. Peters, Jan and Schaal, Stefan. Reinforcement learning of motor skills with policy gradients. Neural networks, 21 (4):682 697, 2008b. Pirotta, Matteo, Restelli, Marcello, and Bascetta, Luca. Adaptive step-size for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 1394 1402, 2013. Pirotta, Matteo, Restelli, Marcello, and Bascetta, Luca. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2):255 283, 2015. ISSN 1573-0565. doi: 10.1007/s10994-015-5484-1. Precup, Doina. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000. Reddi, Sashank J, Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, and Smola, Alex. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314 323, 2016a. Reddi, Sashank J, Sra, Suvrit, P oczos, Barnab as, and Smola, Alex. Fast incremental method for nonconvex optimization. ar Xiv preprint ar Xiv:1603.06159, 2016b. Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Roux, Nicolas L, Schmidt, Mark, and Bach, Francis R. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663 2671, 2012. Rubinstein, Reuven Y Reuven Y. Simulation and the monte carlo method. Technical report, 1981. Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT press Cambridge, 1998. Sutton, Richard S, Mc Allester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000. Thomas, Philip S. and Brunskill, Emma. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. Co RR, abs/1706.06643, 2017. Thomas, Philip S, Theocharous, Georgios, and Ghavamzadeh, Mohammad. High-confidence offpolicy evaluation. In AAAI, pp. 3000 3006, 2015. Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026 5033. IEEE, 2012. Weaver, Lex and Tao, Nigel. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 538 545. Morgan Kaufmann Publishers Inc., 2001. Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Wu, Cathy, Rajeswaran, Aravind, Duan, Yan, Kumar, Vikash, Bayen, Alexandre M, Kakade, Sham, Mordatch, Igor, and Abbeel, Pieter. Variance reduction for policy gradient with action-dependent factorized baselines. International Conference on Learning Representations, 2018. accepted as oral presentation. Xu, Tianbing, Liu, Qiang, and Peng, Jian. Stochastic variance reduction for policy gradient estimation. Co RR, abs/1710.06034, 2017. Zhao, Tingting, Hachiya, Hirotaka, Niu, Gang, and Sugiyama, Masashi. Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems, pp. 262 270, 2011.