# offer_offenvironment_reinforcement_learning__f185dfa7.pdf OFFER: Off-Environment Reinforcement Learning Kamil Ciosek, Shimon Whiteson Department of Computer Science, University of Oxford, United Kingdom {kamil.ciosek,shimon.whiteson}@cs.ox.ac.uk Policy gradient methods have been widely applied in reinforcement learning. For reasons of safety and cost, learning is often conducted using a simulator. However, learning in simulation does not traditionally utilise the opportunity to improve learning by adjusting certain environment variables state features that are randomly determined by the environment in a physical setting but controllable in a simulator. Exploiting environment variables is crucial in domains containing significant rare events (SREs), e.g., unusual wind conditions that can crash a helicopter, which are rarely observed under random sampling but have a considerable impact on expected return. We propose off environment reinforcement learning (OFFER), which addresses such cases by simultaneously optimising the policy and a proposal distribution over environment variables. We prove that OFFER converges to a locally optimal policy and show experimentally that it learns better and faster than a policy gradient baseline. Introduction When applying reinforcement learning (RL) to physical systems, a major issue is the cost and risk of running trials, e.g., when learning a control policy for a robot. Hence, learning is often performed using a simulator, to which off-line RL (i.e., sample-based planning) can be applied. Although this is cheaper and safer than physical trials, the computational cost of each trial can still be considerable. It is therefore important to develop algorithms that minimise this cost. Policy gradient methods (Sutton et al. 2000) are popular in such settings as they cope well with continuous action spaces, which often occur in physical systems such as robots. However, existing policy gradient methods do not exploit an important opportunity presented by simulators: the chance to adjust certain environment variables, i.e., state features that cannot be controlled in a physical setting but are (stochastically) determined by the environment. For example, if we learn to fly a helicopter under varying wind conditions (Koppejan and Whiteson 2011), we cannot control the wind in physical trials but can easily do so in simulation. A conventional application of policy gradient methods to such settings is not robust to significant rare events (SREs), i.e., it fails any time there are rare events that substantially Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. affect expected performance. For example, some rare wind conditions may increase the risk of crashing the helicopter. Since crashes are so catastrophic, avoiding them is key to maximising expected performance, even though the wind conditions contributing to the crash occur only rarely. In such cases, the conventional approach does not see such rare events often enough to learn an appropriate response. In this paper, we propose a new policy gradient method called off-environment reinforcement learning (OFFER) that aims to address this deficiency. The main idea is to couple the primary optimisation of the policy with the secondary optimisation of a proposal distribution governing the environment variables. Since environment variables can be controlled in simulation, we are free to sample from the proposal distribution when generating data for the primary optimisation. Thanks to importance sampling, the primary optimisation can retain unbiased gradient estimates. Just as off-policy RL learns about a target policy while using a behaviour policy, off-environment RL learns about a target environment (the true distribution over environment variables) while generating data from another environment (the proposal distribution). By learning a proposal distribution that minimises the variance in the gradient estimate used during primary optimisation, OFFER can automatically discover and focus on the SREs that any robust policy must address. We show that OFFER is guaranteed to converge to a locally optimal policy. In addition, we present empirical results on several tasks showing that it greatly outperforms existing policy gradient methods in the presence of significant rare events. Related Work Our approach is related to existing work on adaptive importance sampling (Ahamed, Borkar, and Juneja 2006; Frank, Mannor, and Precup 2008; Desai and Glynn 2001), which also seeks to optimise proposal distributions. However, most work focuses on Markov reward processes, i.e., the evaluation of a fixed policy. By contrast, our work considers how to learn a good proposal distribution for a policy that is itself changing. As in our work, Frank, Mannor, and Precup (2008) consider a full Markov decision process and aim to learn using a proposal distribution that takes rare events into account. However, they assume prior knowledge of what the signif- Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) icant rare events are as well as access to a simulator with environment variables that directly control the probability of such rare events. By contrast, our work seeks to automatically discover significant rare events and the proposal distributions that generate them. For example, we may know that a helicopter becomes unstable in a tornado but not know a priori (1) whether such an event is important for distinguishing between policies, (2) how to make our simulator generate a tornado if it is significant, or (3) what a good proposal distribution for learning is. Our approach is designed to work without such prior knowledge. Paul et al. (2016) consider a similar setting to ours but in the context of Bayesian optimisation, in which case environment variables must be marginalised out using Bayesian quadrature. Here, we consider a policy gradient approach, which is typically more effective in high-dimensional tasks. Variance reduction in policy gradient methods has also been well studied. Most work focuses on subtracting an appropriate baseline (Peters and Schaal 2006; Williams 1992). Some work also considers importance sampling for variance reduction (Glynn 1990; Nakayama, Goyal, and Glynn 1994) but uses only hard-coded proposal distributions, whereas our approach learns them automatically. Our work is also related to safe reinforcement learning (Garc ıa and Fern andez 2015), which also aims to identify risky states. Many such methods aim to optimise a riskaverse objective (Bertsekas and Rhodes 1971; Heger 1994; Malfaz and Salichs 2011), whereas we aim to more efficiently optimise a risk-neutral objective (expected return). Other methods aim to constrain exploration to avoid risky states (Gehring and Precup 2013), whereas we learn in a safe simulator and thus seek proposal distributions that visit such states more often, if they are significant to expected return. Background We begin by formalising the problem setting and reviewing existing methods. Problem Setting We model the decision-making task as a Markov decision process (MDP), in which taking an action at A in state st S at time t generates a reward whose expected value is r(st, at) and a transition to a next state st+1 p(st+1|st, at). We assume access to an MDP simulator in the form of a trajectory model that generates sequences of samples: τ = (s1, a1, r1, s2, a2, r2, s3, a3, r3, . . . , s N, a N, r N), where s1 is sampled from the distribution over initial states p1(s1) and each action at is sampled from a stochastic policy πθ(at|φ(st)) parameterised by a vector θ; φ(st) is a function mapping st to a vector of real-valued features. In the sequel, we write πθ(at|st) for brevity. We assume πθ is a twice differentiable function of θ. T is the set of all possible trajectories, i.e., T = {τ : θ. pπθ(τ) > 0}, where pπθ(τ) = p1(s1) N t=1 p(st+1|st, at)πθ(at|st). In addition, we assume access to a vector of environment variables ψ (e.g., coefficients of friction, wind velocities) that affect state transition probabilities. Note that, while we can control ψ when running the simulator, the policy we ultimately deploy cannot. In this paper, we model environment variables by supposing that states in the simulator are sampled, not from p1(s1) and p(st+1|st, at), but from proposal distributions f ψ 1 (s1) and f ψ(st+1|s1, at), which are parameterised by ψ. We can set ψ such that p1 = f ψ 1 and p = f ψ and thus sample trajectories from the original MDP, or we can set it otherwise and thus sample trajectories from altered transition dynamics. The goal in this setting is find a θ that maximises the total expected return: A πθ(a|s)r(s, a)dads, where ρπ(s ) is an improper distribution, i.e., ρπ(s ) = S n=1 γn 1p1(s)p(s s , n, π)ds and: p(s s , n, π) = τ Traj(n,s,s ) p(st+1|st, at)π(at|st)dat, where Traj(n, s, s ) is the set of all possible trajectories of length n beginning with s and ending with s . In the next section, we propose an algorithm that learns both θ and ψ in parallel. Existing methods Policy gradient (PG) methods (Sutton et al. 2000; Silver et al. 2014; Degris, Pilarski, and Sutton 2012; Deisenroth et al. 2013) use gradient ascent to directly optimise θ. PG methods are appealing in many settings because they cope well with continuous action spaces. Sutton et al. (2000) showed that the gradient of Jθ with respect to θ can be written as: θJθ = Es ρπ, a πφ θ log πθ(a|s) (Qπ(s, a) b(s)) u(s,a) where b(s) a baseline function. Given a trajectory τ sampled from p1(s1) and p(st+1|st, at), we can estimate the gradient as: t=1 γt θ log πθ(at|st)ˆut(st, at), (2) where ˆut(st, at) is an estimate of ut(st, at). Setting ˆut(st, at) = i t γi tri b, where b is a baseline (Peters and Schaal 2006), yields the REINFORCE method (Williams 1992) summarised in Algorithm 1. Setting ˆut(st, at) = ri+γφ(si+1) w φ(si) w, i.e., the TD-error computed using TD(λ) policy evaluation (Sutton 1988), yields an actor-critic method (Degris, Pilarski, and Sutton 2012) summarised in Algorithm 2. Algorithm 2 provides an unbiased estimator for Q(s, a) V (s) (Bhatnagar et al. 2009, Lemma 3), where V is the Algorithm 1 CRITIC-REINFORCE(τ) 1: Traj. τ = (s1, a1, r1, s2, a2, r2, . . . , s N, a N, r N) 2: bi compute baselines 3: for i = 1 . . . N do 4: ˆu(si, ai) ( N t=i γt irt) bi 5: end for 6: return ˆu Algorithm 2 CRITIC-AC(τ) 1: Traj. τ = (s1, a1, r1, s2, a2, r2, . . . , s N, a N, r N) 2: TD(λ) learning algorithm 3: ˆu(s1, a1) r1 + γφ(s1) w 4: for i = 2 . . . N do 5: δi ri + γφ(si) w γφ(si 1) w 6: ˆu(si, ai) δi 7: e γλe + φ(si 1) 8: w βδie 9: end for 10: return ˆu value function of the current policy, thus estimating u(s, a) with b(s) = V (s). Convergence of PG using Algorithm 2 is guaranteed if the critic s representation is compatible with the policy representation (Sutton et al. 2000). Off-Environment Reinforcement Learning In this section, we propose off-environment reinforcement learning (OFFER) for coping with significant rare events by exploiting environment variables. The main idea is to interleave two different optimisation steps. The primary optimisation step performs an importance-weighted policy gradient update, adjusting θ to improve the policy s expected return. The secondary optimisation performs a gradient descent step on the proposal distribution, adjusting ψ to reduce the variance of the gradient estimate used during primary optimisation. For example, in the helicopter domain, OFFER can optimise the control policy while simultaneously optimising the wind conditions used to evaluate that policy. Algorithm 3 summarises OFFER; the rest of this section provides the details of primary and secondary optimisation. Algorithm 3 OFFER() 1: while not converged do 2: τ sample trajectory using πθ and fΨ 3: τ = (s1, a1, r1, s2, a2, r2, . . . , s N, a N, r N) 4: Δθ PRIMARY-OPTIMISATION(τ, θ, ψ) 5: θ θ + Δθ 6: Δψ SECONDARY-OPTIMISATION(τ, ψ) 7: ψ ψ + Δψ 8: end while Primary Optimisation The primary optimisation performs a policy gradient update that adjusts θ to improve expected return. We first derive the gradient and then discuss an appropriate update rule. To properly estimate the gradient, we must modify (2) to take into account the fact that trajectories are sampled from f rather than p. To do so, we simply apply importance sampling to the policy gradient update (Glynn 1990): θJθ = Eτ pπ(τ) t=1 γt θ log πθ(at|st)Qπ(st, at) pπ(τ) fπ(τ) t=1 γt θ log πθ(at|st)Qπ(st, at) where fπ(τ) = f1(s1) N t=1 f(st+1|st, at)π(at|st). Given a concrete sampled trajectory τ, we can then estimate the gradient as, ˆ θJθ = p1(s1) p(st+1|st, at) f(st+1|st, at) (4) t=1 γt θ log πθ(at|st)Qπ(st, at). We now consider how to use (4) to update θ. A naive approach would be to use the standard policy gradient update Δθ = α ˆ θJθ. However, doing so is problematic in any setting characterised by significant rare events. To see why, first note that the magnitude of the update depends on the magnitude of the gradient, which in turn depends on the magnitude of the rewards. Now consider two MDPs that are identical save that in one MDP, the rewards have been multiplied by a large constant. Clearly, for a given θ, the magnitude of Δθ should be the same in both MDPs. However, since ˆ θJθ will be different, the two MDPs will require different values of the learning rate α. We now show that, in the presence of SREs, the same problem can arise within a single MDP. We need the following definitions. Definition 1. A partition P = {E1, E2, . . . , E|P |} divides the set of possible trajectories T into events, Ei. For example, a simple partition P = {ENE, ERE} involves just normal and rare events. Definition 2. A policy feature θi is salient with respect to an event E under partition P if τ E. { ˆ θJθ(τ)}i = 0. Note that different policy features can be salient for different subsets of events, e.g., θi might be salient only for ERE while θj might be salient for both ERE and ENE. Now suppose that different events have different reward scales, e.g., trajectories in ERE trigger rewards of a larger order of magnitude than those in ENE. Then, different policy features will need different learning rates, just as in the example with two MDPs. Consequently, the ability to adaptively set the learning rate separately for each policy feature becomes an essential, not merely desirable, characteristic of the learning algorithm in our setting. To meet this requirement, we employ a stochastic variant of Newton s method (Furmston and Barber 2012; Spall 2000). In our setting, the Hessian H equals θ θ Jθ. This can be estimated from a sampled trajectory, yielding ˆH(τ). Furthermore, we can maintain a mean of the Hessians estimated over time: ˆHM(τ 1, . . . , τ N) = N i=1 1 N ˆH(τ i). The gradient update is then: Δθ = α diag( ˆHM(τ1, . . . , τn+1)) 1 θJθ(τn+1). To avoid the prohibitive cost of inverting the full Hessian, only its diagonal is used, which corresponds to performing Newton s method on each coordinate separately. As a result, the computational complexity is the same as that of vanilla gradient descent. Algorithm 4 summarises the resulting primary optimisation algorithm, which calls either Algorithm 1 or 2 as a subroutine. The diagonal variant of Newton s method that Algorithm 4 implements can be seen as an automatic way of setting learning rates for vanilla gradient descent (Furmston and Barber 2012). Consequently, it addresses the problem described above: the estimate of the i-th entry of the vector ˆ θJθ and the i-th diagonal entry of ˆHM both scale with ˆu (the output of Algorithms 1 and 2), which scales with the rewards. Hence, if some policy features are salient for different subsets of events, then the effects on the update cancel in line 11 of Algorithm 4. Algorithm 4 PRIMARY-OPTIMISATION(τ, θ,ψ) 1: Traj. τ = (s1, a1, r1, s2, a2, r2, . . . , s N, a N, r N) 2: Critic 3: ˆu CRITIC-REINFORCE(τ) or CRITIC-AC(τ) 4: 5: Actor 6: w p1(s1) f1(s1) N t=1 p(st+1|st,at) f(st+1|st,at) Importance sampling 7: ˆ θJθ w N i=1 γi θ log πθ(ai|si)ˆui 8: ˆH w N i=1 γidiag( θ θ log πθ(ai|si))ˆui 9: ˆHM i 1 i ˆH 10: 11: return αi ˆH 1 M ˆ θJθ We use Newton s method because it has already been shown to work well with policy gradient methods (Furmston and Barber 2012). However, other adaptive learning rate approaches such as ADAM (Kingma and Ba 2014) could also be considered. Nevertheless, as we discuss below, using ADAM would complicate the secondary optimisation. Secondary Optimisation The goal of secondary optimisation is to find a proposal distribution that best facilitates primary optimisation. To this end, we propose to minimise the variance of the gradient estimate followed during primary optimisation. Such variance is known to be a key contributor to slow learning in policy gradient methods (Peters and Schaal 2006; Williams 1992; Glynn 1990; Nakayama, Goyal, and Glynn 1994). By minimising this variance, we expect OFFER to discover proposal distributions that generate significant rare events more often than in the original task. Since such events contribute substantially to expected return, doing so makes it easier to estimate the gradient of the expected return. We start by defining the covariance: C = Covτ fψ 1 f ψ(τ) p(τ) t=1 γt θ log πθ(at|st)ut = Covτ fψ 1 f ψ(τ)h(τ) Since we are interested in minimising our uncertainty about each of the partial derivatives that make up the gradient, rather than the correlations between them, we consider only the diagonal of C. Furthermore, since we have no a priori reason to think one partial derivative is more important than another, we define the trace of C as our objective: min ψ trace(C) . (6) Note that, if we treat ˆHM as a constant (which is approximately true for large N), then Δθ, as computed by Newton s method, is a linear function of ˆ θJθ. Hence, minimising the covariance of ˆ θJθ also minimises the covariance of Δθ. By contrast, in ADAM, Δθ is a nonlinear function of ˆ θJθ that divides by the square root of the second moment. Hence, using ADAM for primary optimisation would necessitate a more complex secondary objective. To solve (6) by gradient descent, we must evaluate the gradient of the trace with respect to φ. The gradient with respect to the second term in (5) is zero because, ψ Eτ f ψ(τ) Hence, we focus on the gradient of the trace of the first term: ψj 1 f ψ(τ) 1 f ψ(τ)2 ψjf(τ) hi(τ)2dτ 1 f ψ(τ)3 ψjf(τ) hi(τ)2f ψ(τ)dτ 1 f ψ(τ)3 ψjf(τ) hi(τ)2 Consequently: ψj trace(C) = Eτ f ψ i 1 fψ(τ)3 ( ψj f(τ))hi(τ)2 , which can be estimated from a trajectory τ as follows: ˆ ψj trace(C) = i 1 f ψ(τ)3 ψjf(τ) ˆhi(τ)2, (7) where ˆhi(τ) = p(τ) N t=1 γt θ log πθ(at|st)ˆut. Unlike the primary optimsation, the secondary optimisation can be performed using any variant of stochastic gradient descent. Since our estimate of the gradient (7) relies on only one trajectory, we employ ADAM, which smooths estimates across previous trajectories. Algorithm 5 summarises the secondary optimisation. Algorithm 5 SECONDARY-OPTIMISATION(τ, ψ) 1: Traj. τ = (s1, a1, r1, s2, a2, r2, . . . , s N, a N, r N) 2: ADAM algorithm 3: ˆ ψj trace(C) 1 m i 1 fψ(τ)3 ψjf(τ) hi(τ)2 4: m β1m + (1 β1) ˆ ψj trace(C) 5: is an element-wise product 6: v β2v + (1 β2) ˆ ψj trace(C) ˆ ψj trace(C) 7: ˆm m/(1 βi 1) 8: ˆv v/(1 βi 2) 9: 10: Division and square root are element-wise 11: return α ˆm/ Convergence Guarantee In this section, we show that OFFER has convergence guarantees similar to those of existing policy gradient methods (Peters and Schaal 2008). In particular, we use existing results to establish a corollary showing that θ converges to a local optimum of J. Corollary 1. If the following assumptions hold: 1. f is non-zero everywhere; 2. J is convex over some restricted domain B containing the local optimum θ ; 3. All iterates θ(1), . . . belong to B; 4. Robbins-Monro conditions (Spall 2000, Condition C.1 ) hold for the learning rate αi; 5. At each iteration k, there exist δ, ρ > 0 such that E ˆ θJθ (k) 2+δ ρ; 6. J has a uniformly bounded third derivative in B; 7. At each iteration, ˆH(k) M is diagonal with positive entries1; 8. At each iteration k, there exist δ, ρ > 0 such that E ( ˆH(k) M ) 1 2+δ ρ; 1Existing techniques (Spall 2000) can be used to show consistency for non-diagonal Hessians as well. In this work, we opt for the diagonal Hessian for efficiency reasons. 9. If we are using a critic, then it is compatible (Sutton et al. 2000) with the policy, i.e., θu(s, a) = ( θπ(s, a)) 1 π(s,a), then OFFER (Algorithm 3) converges almost surely to the local minimum θ . Moreover, ψ converges to a local minimum of the optimisation problem in (6). Proof. First, we show E ˆ θJθ = θJθ. For the REINFORCE-based critic, this follows from the derivation of (3), i.e., importance sampling does not introduce bias (Owen 2013, Chapter 9). For the actor-critic, the same reasoning together with Assumption 9 implies that the gradient involving the approximate critic is the same in expectation as the true gradient (Sutton et al. 2000, Theorem 2). Since this condition holds at each iteration, even though ψ and the distribution the expectation is taken with respect to changes, we satisfy condition C.0 from (Spall 2000), which requires that we use a sequence of unbiased estimators. Moreover, Assumption 7 implies that, at each iteration k, for each coordinate i, sgn({ θJθ(θk)}i) = sgn({ ˆH 1 M θJθ(θk)}i). Furthermore, Assumption 2 implies that the iterates are, of necessity, bounded. Together, these facts imply the lemma Sufficient Conditions for C.5 and C.7 from (Spall 2000), which shows that certain conditions relating to the Hessian estimate are satisfied. Given this lemma, we can instantiate the main theorem (Spall 2000, Theorem 1b-2SG), which says that a general second-order stochastic optimisation scheme, of which the primary optimisation of Algorithm 4 is a special case, converges given a set of conditions that we have either just shown or which follow directly from our assumptions. The secondary optimisation converges because eventual convergence of θ renders the secondary optimisation problem stationary, at which point ADAM s convergence guarantee applies (Kingma and Ba 2014, Theorem 4.1). Note that Corollary 1 holds for any choice of f ψ, not just the ones learned by OFFER: a poor choice of ψ will only slow convergence, not prevent it. Experiments In this section, we empirically compare OFFER to a policy gradient baseline (primary optimisation only) on variants of the mountain car task, as well as a simulated robot arm. We describe the experimental setup only briefly; complete details are in the supplementary material. Both domains have penalties rather than rewards, so lower is better on all plots. All results are averaged over 48 runs. Mountain Car We modified the mountain car benchmark task by adding a third state feature (besides position and velocity): a boolean variable indicating whether a rare event occurs in the given episode. With probability 0.01, this boolean is true, in which case the landscape is shifted to the right, changing when the agent should go forwards or backwards, and penalties are multiplied by 1000. In our setup, θ consists of the parameters of a standard tile-coding function approximator, and ψ is a single scalar that determines the probability of a rare event. We first consider an unaliased setting (Figure 1a), in which the agent has separate features for normal and rare events, i.e., the tile coding is duplicated, with each feature activated only for normal or rare events, not both. Hence, the agent can in principle learn effective policies for both events, but must learn a proposal distribution that allocates data appropriately to them. The lines in the middle of the plot show the average performance of OFFER (blue) compared to the policy gradient baseline (black). OFFER substantially improves performance (note the log scale on the y-axis). In fact, the baseline shows negligible learning, confirming that optimising the proposal distribution can be key to successful learning when significant rare events are present. Note that, unlike in previous work (Frank, Mannor, and Precup 2008), OFFER discovered a good proposal distribution on its own. The lines at the top and bottom of Figure 1a show the performance of the same learning runs split out into just rare events and just normal events, respectively, i.e., the middle lines are weighted averages of the top and bottom lines. These lines show that the baseline algorithm is actually learning, but its progress is limited to the normal events that dominate its data but contribute only negligibly to expected return. By contrast, OFFER learns a proposal distribution that samples rare events with a probability of approximately 0.8, enabling it to efficiently learn how to handle such events. It learns more slowly on the normal events but has correctly concluded that doing so is worth the tradeoff. Next, we consider an aliased setting (Figure 1b) in which the actor uses the same features in θ for both normal and rare events, i.e., it cannot directly observe a rare event or condition its behaviour on it. Unlike in the unaliased setting, it cannot learn separate policies for the two events but must learn a single policy that balances between the two, with the correct balance depending critically on both the probability and relative scale of penalties for rare events. Here, OFFER again performs much better. It learns a proposal distribution assigning a probability of about 0.9 to rare events, enabling it to more quickly learn to cope with such events. Furthermore, though initially slower, it quickly catches up to the baseline even on the normal events. Because the two events are somewhat similar, knowledge gained in the rare events transfers to the normal events via the shared features. Finally, we consider a variant of mountain car (Figure 1c) with a more complex representation of ψ, which is now a vector of four features, yielding a bigger challenge for secondary optimisation. Two features control the probability of trigger events, with the rare event occurring only if both are triggered. The other two features control the probability of slippery and jittery conditions, which affect the task but not as dramatically as a rare event. The results show that OFFER again outperforms the baseline, and can thus effectively optimise the proposal distribution even when doing so requires setting multiple features correctly. The learned proposal distribution gives high values to the trigger events to maximise the probability of rare events, while the probabilities of slippery and jittery conditions vary per run. normal event (a) Unaliased Mountain Car 0 60000 120000 101 102 103 104 105 106 107 normal event (b) Aliased Mountain Car 0 60000 120000 0 (c) 4-Featured Mountain Car 0 15000 30000 0.2 (d) Robotic Arm Figure 1: Average per-episode penalty on the orig. MDP (lower is better) for OFFER (blue) and baseline PG (black). Robotic Arm We also applied OFFER to a simulated robotic-arm control task (Figure 1d) that was recently proposed specifically to model significant rare events (Paul et al. 2016). An off-theshelf kinematics simulator (Cully et al. 2015) models an arm with three joints, each of which can be set by the agent, whose goal is to get the end effector to a specific location. The task is episodic and in each step the agent moves each joint by at most 30% of the allowed movement range. The penalty the agent receives is proportional to the distance of the end effector to the target location. Moreover, there is a wall near the arm and the agent incurs a large penalty if the arm hits it. Usually, the arm is far from the wall (normal event) but occasionally it is near to it (rare event). OFFER learns a safe policy that avoids the wall even when it is nearby, whereas the baseline is unable to solve the task. Conclusions & Future Work We proposed off-environment reinforcement learning (OFFER), which speeds convergence of policy gradient methods by optimising the proposal distribution from which environment variables are set. We prove that OFFER converges to a locally optimal policy. Furthermore, empirical results in multiple tasks show that OFFER learns better and faster than a policy gradient baseline in settings that are characterised by significant rare events. In future, we plan to apply OFFER to deterministic policy gradient methods (Silver et al. 2014) as well as to the generalised helicopter hovering task (Koppejan and Whiteson 2011), which is known to be characterised by significant rare events. Acknowledgements This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement number 637713). References Ahamed, T. I.; Borkar, V. S.; and Juneja, S. 2006. Adaptive importance sampling technique for markov chains using stochastic approximation. Operations Research 54(3):489 504. Bertsekas, D. P., and Rhodes, I. B. 1971. On the minimax feedback control of uncertain dynamic systems. In Decision and Control, 1971 IEEE Conference on, 451 455. IEEE. Bhatnagar, S.; Sutton, R.; Ghavamzadeh, M.; and Lee, M. 2009. Natural actor critic algorithms. Automatica. Cully, A.; Clune, J.; Tarapore, D.; and Mouret, J.-B. 2015. Robots that can adapt like animals. Nature 521(7553):503 507. Degris, T.; Pilarski, P. M.; and Sutton, R. S. 2012. Modelfree reinforcement learning with continuous action in practice. In American Control Conference (ACC), 2012, 2177 2182. IEEE. Deisenroth, M. P.; Neumann, G.; Peters, J.; et al. 2013. A survey on policy search for robotics. Foundations and Trends in Robotics 2(1-2):1 142. Desai, P. Y., and Glynn, P. W. 2001. Simulation in optimization and optimization in simulation: a markov chain perspective on adaptive monte carlo algorithms. In Proceedings of the 33nd conference on Winter simulation, 379 384. IEEE Computer Society. Frank, J.; Mannor, S.; and Precup, D. 2008. Reinforcement learning in the presence of rare events. In Proceedings of the 25th international conference on Machine learning, 336 343. ACM. Furmston, T., and Barber, D. 2012. A unifying perspective of parametric policy search methods for markov decision processes. In Advances in Neural Information Processing Systems, 2717 2725. Garc ıa, J., and Fern andez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16:1437 1480. Gehring, C., and Precup, D. 2013. Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, 1037 1044. International Foundation for Autonomous Agents and Multiagent Systems. Glynn, P. W. 1990. Likelihood ratio gradient estimation for stochastic systems. Commun. ACM 33(10):75 84. Heger, M. 1994. Consideration of risk in reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, 105 111. Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Koppejan, R., and Whiteson, S. 2011. Neuroevolutionary reinforcement learning for generalized control of simulated helicopters. Evolutionary intelligence 4(4):219 241. Malfaz, M., and Salichs, M. A. 2011. Learning to avoid risky actions. Cybernetics and Systems 42(8):636 658. Nakayama, M. K.; Goyal, A.; and Glynn, P. W. 1994. Likelihood ratio sensitivity analysis for markovian models of highly dependable systems. Operations Research 42(1):137 157. Owen, A. B. 2013. Monte Carlo theory, methods and examples. Paul, S.; Ciosek, K.; Osborne, M. A.; and Whiteson, S. 2016. Alternating optimisation and quadrature for robust reinforcement learning. Co RR abs/1605.07496. Peters, J., and Schaal, S. 2006. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, 2219 2225. IEEE. Peters, J., and Schaal, S. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks 21(4):682 697. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of The 31st International Conference on Machine Learning, 387 395. Spall, J. C. 2000. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control 45(10):1839 1853. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057 1063. Sutton, R. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3:9 44. Williams, R. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229 256.