# the_optioncritic_architecture__7d09e809.pdf The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab, School of Computer Science Mc Gill University {pbacon, jharb, dprecup}@cs.mcgill.ca Temporal abstraction is key to scaling up learning and planning in reinforcement learning. While planning with temporally extended actions is well understood, creating such abstractions autonomously from data has remained challenging. We tackle this problem in the framework of options [Sutton, Precup & Singh, 1999; Precup, 2000]. We derive policy gradient theorems for options and propose a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals. Experimental results in both discrete and continuous environments showcase the flexibility and efficiency of the framework. Introduction Temporal abstraction allows representing knowledge about courses of action that take place at different time scales. In reinforcement learning, options (Sutton, Precup, and Singh 1999; Precup 2000) provide a framework for defining such courses of action and for seamlessly learning and planning with them. Discovering temporal abstractions autonomously has been the subject of extensive research efforts in the last 15 years (Mc Govern and Barto 2001; Stolle and Precup 2002; Menache, Mannor, and Shimkin 2002; S ims ek and Barto 2009; Silver and Ciosek 2012), but approaches that can be used naturally with continuous state and/or action spaces have only recently started to become feasible (Konidaris et al. 2011; Niekum 2013; Mann, Mannor, and Precup 2015; Mankowitz, Mann, and Mannor 2016; Kulkarni et al. 2016; Vezhnevets et al. 2016; Daniel et al. 2016). The majority of the existing work has focused on finding subgoals (useful states that an agent should reach) and subsequently learning policies to achieve them. This idea has led to interesting methods but ones which are also difficult to scale up given their combinatorial flavor. Additionally, learning policies associated with subgoals can be expensive in terms of data and computation time; in the worst case, it can be as expensive as solving the entire task. We present an alternative view, which blurs the line between the problem of discovering options from that of learn- Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ing options. Based on the policy gradient theorem (Sutton et al. 2000), we derive new results which enable a gradual learning process of the intra-option policies and termination functions, simultaneously with the policy over them. This approach works naturally with both linear and non-linear function approximators, under discrete or continuous state and action spaces. Existing methods for learning options are considerably slower when learning from a single task: much of the benefit comes from re-using the learned options in similar tasks. In contrast, we show that our approach is capable of successfully learning options within a single task without incurring any slowdown and while still providing benefits for transfer learning. We start by reviewing background related to the two main ingredients of our work: policy gradient methods and options. We then describe the core ideas of our approach: the intra-option policy and termination gradient theorems. Additional technical details are included in the appendix. We present experimental results showing that our approach learns meaningful temporally extended behaviors in an effective manner. As opposed to other methods, we only need to specify the number of desired options; it is not necessary to have subgoals, extra rewards, demonstrations, multiple problems or any other special accommodations (however, the approach can take advantage of pseudo-reward functions if desired). To our knowledge, this is the first end-to-end approach for learning options that scales to very large domains at comparable efficiency. Preliminaries and Notation A Markov Decision Process consists of a set of states S, a set of actions A, a transition function P : S A (S [0, 1]) and a reward function r : S A R. For convenience, we develop our ideas assuming discrete state and action sets. However, our results extend to continuous spaces using usual measure-theoretic assumptions (some of our empirical results are in continuous tasks). A (Markovian stationary) policy is a probability distribution over actions conditioned on states, π : S A [0, 1]. In discounted problems, the value function of a policy π is defined as the expected return: Vπ(s) = Eπ [ t=0 γtrt+1 | s0 = s] and its action-value function as Qπ(s, a) = Eπ [ t=0 γtrt+1 | s0 = s, a0 = a], where γ [0, 1) is the discount factor. A policy π is greedy with respect to a given action-value function Q if Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) π(s, a) > 0 iff a = argmaxa Q(s, a ). In a discrete MDP, there is at least one optimal policy which is greedy with respect to its own action-value function. Policy gradient methods (Sutton et al. 2000; Konda and Tsitsiklis 2000) address the problem of finding a good policy by performing stochastic gradient descent to optimize a performance objective over a given family of parametrized stochastic policies, πθ. The policy gradient theorem (Sutton et al. 2000) provides expressions for the gradient of the average reward and discounted reward objectives with respect to θ. In the discounted setting, the objective is defined with respect to a designated start state (or distribution) s0: ρ(θ, s0) = Eπθ [ t=0 γtrt+1 | s0]. The policy gradient theorem shows that: ρ(θ,s0) s μπθ (s | s0) θ Qπθ(s, a), where μπθ (s | s0) = t=0 γt P (st = s | s0) is a discounted weighting of the states along the trajectories starting from s0. In practice, the policy gradient is estimated from samples along the on-policy stationary distribution. (Thomas 2014) showed that neglecting the discount factor in this stationary distribution makes the usual policy gradient estimator biased. However, correcting for this discrepancy also reduces data efficiency. For simplicity, we build on the framework of (Sutton et al. 2000) and discuss how to extend our results according to (Thomas 2014). The options framework (Sutton, Precup, and Singh 1999; Precup 2000) formalizes the idea of temporally extended actions. A Markovian option ω Ω is a triple (Iω, πω, βω) in which Iω S is an initiation set, πω is an intra-option policy, and βω : S [0, 1] is a termination function. We also assume that s S, ω Ω : s Iω (i.e., all options are available everywhere), an assumption made in the majority of option discovery algorithms. We will discuss how to dispense with this assumption in the final section. (Sutton, Precup, and Singh 1999; Precup 2000) show that an MDP endowed with a set of options becomes a Semi-Markov Decision Process (Puterman 1994, chapter 11), which has a corresponding optimal value function over options VΩ(s) and option-value function QΩ(s, ω). Learning and planning algorithms for MDPs have their counterparts in this setting. However, the existence of the underlying MDP offers the possibility of learning about many different options in parallel : this is the idea of intraoption learning, which we leverage in our work. Learning Options We adopt a continual perspective on the problem of learning options. At any time, we would like to distill all of the available experience into every component of our system: value function and policy over options, intra-option policies and termination functions. To achieve this goal, we focus on learning option policies and termination functions, assuming they are represented using differentiable parameterized function approximators. We consider the call-and-return option execution model, in which an agent picks option ω according to its policy over options πΩ , then follows the intra-option policy πω until termination (as dictated by βω), at which point this procedure is repeated. Let πω,θ denote the intra-option policy of option ω parametrized by θ and βω,ϑ, the termination function of ω parameterized by ϑ. We present two new results for learning options, obtained using as blueprint the policy gradient theorem (Sutton et al. 2000). Both results are derived under the assumption that the goal is to learn options that maximize the expected return in the current task. However, if one wanted to add extra information to the objective function, this could readily be done so long as it comes in the form of an additive differentiable function. Suppose we aim to optimize directly the discounted return, expected over all the trajectories starting at a designated state s0 and option ω0, then: ρ(Ω, θ, ϑ, s0, ω0) = EΩ,θ,ω [ t=0 γtrt+1 | s0, ω0]. Note that this return depends on the policy over options, as well as the parameters of the option policies and termination functions. We will take gradients of this objective with respect to θ and ϑ. In order to do this, we will manipulate equations similar to those used in intra-option learning (Sutton, Precup, and Singh 1999, section 8). Specifically, the definition of the option-value function can be written as: a πω,θ (a | s) QU(s, ω, a) , (1) where QU : S Ω A R is the value of executing an action in the context of a state-option pair: QU(s, ω, a) = r(s, a) + γ s P (s | s, a) U(ω, s ) . (2) Note that the (s, ω) pairs lead to an augmented state space, cf. (Levy and Shimkin 2011). However, we will not work explicitly with this space; it is used only to simplify the derivation. The function U : Ω S R is called the option-value function upon arrival, (Sutton, Precup, and Singh 1999, equation 20). The value of executing ω upon entering a state s is given by: U(ω, s ) = (1 βω,ϑ(s ))QΩ(s , ω) + βω,ϑ(s )VΩ(s ) (3) Note that QU and U both depend on θ and ϑ, but we do not include these in the notation for clarity. The last ingredient required to derive policy gradients is the Markov chain along which the performance measure is estimated. The natural approach is to consider the chain defined in the augmented state space, because state-option pairs now play the role of regular states in a usual Markov chain. If option ωt has been initiated or is executing at time t in state st, then the probability of transitioning to (st+1, ωt+1) in one step is: P (st+1, ωt+1 | st, ωt) = a πωt,θ (a | st) P(st+1| st, a)( (1 βωt,ϑ(st+1))1ωt=ωt+1 + βωt,ϑ(st+1)πΩ(ωt+1| st+1)) (4) Clearly, the process given by (4) is homogeneous. Under mild conditions, and with options available everywhere, it is in fact ergodic, and a unique stationary distribution over state-option pairs exists. We will now compute the gradient of the expected discounted return with respect to the parameters θ of the intraoption policies, assuming that they are stochastic and differentiable. From (1 , 2), it follows that: πω,θ (a | s) θ QU(s, ω, a) a πω,θ (a | s) s γ P (s | s, a) U(ω, s ) We can further expand the right hand side using (3) and (4), which yields the following theorem: Theorem 1 (Intra-Option Policy Gradient Theorem). Given a set of Markov options with stochastic intra-option policies differentiable in their parameters θ, the gradient of the expected discounted return with respect to θ and initial condition (s0, ω0) is: s,ω μΩ (s, ω | s0, ω0) πω,θ (a | s) θ QU(s, ω, a) , where μΩ (s, ω | s0, ω0) is a discounted weighting of stateoption pairs along trajectories starting from (s0, ω0): μΩ (s, ω | s0, ω0) = t=0 γt P (st = s, ωt = ω | s0, ω0). The proof is in the appendix. This gradient describes the effect of a local change at the primitive level on the global expected discounted return. In contrast, subgoal or pseudoreward methods assume the objective of an option is simply to optimize its own reward function, ignoring how a proposed change would propagate in the overall objective. We now turn our attention to computing gradients for the termination functions, assumed this time to be stochastic and differentiable in ϑ. From (1, 2, 3), we have: a πω,θ (a | s) s γ P (s | s, a) U(ω, s ) Hence, the key quantity is the gradient of U. This is a natural consequence of the call-and-return execution, in which the goodness of termination functions can only be evaluated upon entering the next state. The relevant gradient can be further expanded as: ϑ = βω,ϑ(s ) ϑ AΩ(s , ω) + s P (s , ω | s , ω) U(ω , s ) where AΩ is the advantage function (Baird 1993) over options AΩ(s , ω) = QΩ(s , ω) VΩ(s ). Expanding U(ω ,s ) ϑ recursively leads to a similar form as in theorem (1) but where the weighting of state-option pairs is now according to a Markov chain shifted by one time step: μΩ (st+1, ωt | st, ωt 1) (details are in the appendix). Theorem 2 (Termination Gradient Theorem). Given a set of Markov options with stochastic termination functions differentiable in their parameters ϑ, the gradient of the expected discounted return objective with respect to ϑ and the initial condition (s1, ω0) is: s ,ω μΩ (s , ω | s1, ω0) βω,ϑ(s ) ϑ AΩ(s , ω) , where μΩ (s , ω | s1, ω0) is a discounted weighting of state-option pairs from (s1, ω0): μΩ (s, ω | s1, ω0) = t=0 γt P (st+1 = s, ωt = ω | s1, ω0). The advantage function often appears in policy gradient methods (Sutton et al. 2000) when forming a baseline to reduce the variance in the gradient estimates. Its presence in that context has to do mostly with algorithm design. It is interesting that in our case, it follows as a direct consequence of the derivation and gives the theorem an intuitive interpretation: when the option choice is suboptimal with respect to the expected value over all options, the advantage function is negative and it drives the gradient corrections up, which increases the odds of terminating. After termination, the agent has the opportunity to pick a better option using πΩ. A similar idea also underlies the interrupting execution model of options (Sutton, Precup, and Singh 1999) in which termination is forced whenever the value of QΩ(s , ω) for the current option ω is less than VΩ(s ). (Mann, Mankowitz, and Mannor 2014) recently studied interrupting options through the lens of an interrupting Bellman Operator in a valueiteration setting. The termination gradient theorem can be interpreted as providing a gradient-based interrupting Bellman operator. Algorithms and Architecture Environment Critic TD error Policy over options Figure 1: Diagram of the option-critic architecture. The option execution model is depicted by a switch over the contacts . A new option is selected according to πΩ only when the current option terminates. Based on theorems 1 and 2, we can now design a stochastic gradient descent algorithm for learning options. Using a two-timescale framework (Konda and Tsitsiklis 2000), we propose to learn the values at a fast timescale while updating the intra-option policies and termination functions at a slower rate. We refer to the resulting system as an option-critic architecture, in reference to the actor-critic architectures (Sutton 1984). The intra-option policies, termination functions and policy over options belong to the actor part of the system while the critic consists of QU and AΩ. The option-critic architecture does not prescribe how to obtain πΩ since a variety of existing approaches would apply: using policy gradient methods at the SMDP level, with a planner over the options models, or using temporal difference updates. If πΩ is the greedy policy over options, it follows from (2) that the corresponding one-step off-policy update target g(1) t is: g(1) t = rt+1+ γ (1 βωt,ϑ(st+1)) a πωt,θ (a | st+1) QU(st+1, ωt, a) + βωt,ϑ(st+1) max ω a πω,θ (a | st+1) QU(st+1, ω, a) , which is also the update target of the intra-option Q-learning algorithm of (Sutton, Precup, and Singh 1999). A prototypical implementation of option-critic which uses intra-option Q-learning is shown in Algorithm 1. The tabular setting is assumed only for clarity of presentation. We write α, αθ and αϑ for the learning rates of the critic, intra-option policies and termination functions respectively. Algorithm 1: Option-critic with tabular intra-option Qlearning s s0 Choose ω according to an ϵ-soft policy over options πΩ(s) repeat Choose a according to πω,θ (a | s) Take action a in s, observe s , r 1. Options evaluation: δ r QU(s, ω, a) if s is non-terminal then δ δ + γ(1 βω,ϑ(s ))QΩ(s , ω) + γβω,ϑ(s ) max ω QΩ(s , ω) end QU(s, ω, a) QU(s, ω, a) + αδ 2. Options improvement: θ θ + αθ log πω,θ(a | s) θ QU(s, ω, a) ϑ ϑ αϑ βω,ϑ(s ) ϑ (QΩ(s , ω) VΩ(s )) if βω,ϑ terminates in s then choose new ω according to ϵ-soft(πΩ(s )) s s until s is terminal Learning QU in addition to QΩ is computationally wasteful both in terms of the number of parameters and samples. A practical solution is to only learn QΩ and derive an estimate of QU from it. Because QU is an expectation over next states, QU(s, ω, a) = Es P [r(s, a) + γU(ω, s ) | s, ω, a], it follows that g(1) t is an appropriate estimator. We chose this approach for our experiment with deep neural networks in the Arcade Learning Environment. Experiments We first consider a navigation task in the four-rooms domain (Sutton, Precup, and Singh 1999). Our goal is to evaluate the ability of a set of options learned fully autonomously to recover from a sudden change in the environment. (Sutton, Precup, and Singh 1999) presented a similar experiment for a set of pre-specified options; the options in our results have not been specified a priori. Initially the goal is located in the east doorway and the initial state is drawn uniformly from all the other cells. After 1000 episodes, the goal moves to a random location in the lower right room. Primitive movements can fail with probability 1/3, in which case the agent transitions randomly to one of the empty adjacent cells. The discount factor was 0.99, and the reward was +1 at the goal and 0 otherwise. We chose to parametrize the intra-option policies with Boltzmann distributions and the terminations with sigmoid functions. The policy over options was learned using intra-option Q-learning. We also implemented primitive actor-critic (denoted AC-PG) using a Boltzmann policy. We also compared option-critic to a primitive SARSA agent using Boltzmann exploration and no eligibility traces. For all Boltzmann policies, we set the temperature parameter to 0.001. All the weights were initialized to zero. 00 500 1000 1500 Episodes SARSA(0) AC-PG OC 4 options OC 8 options Figure 2: After a 1000 episodes, the goal location in the fourrooms domain is moved randomly. Option-critic ( OC ) recovers faster than the primitive actor-critic ( AC-PG ) and SARSA(0). Each line is averaged over 350 runs. As can be seen in Figure 2, when the goal suddenly changes, the option-critic agent recovers faster. Furthermore, the initial set of options is learned from scratch at a rate comparable to primitive methods. Despite the simplicity of the domain, we are not aware of other methods which could have solved this task without incurring a cost much larger than when using primitive actions alone (Mc Govern and Barto 2001; S ims ek and Barto 2009). Figure 3: Termination probabilities for the option-critic agent learning with 4 options. The darkest color represents the walls in the environment while lighter colors encode higher termination probabilities. In the two temporally extended settings, with 4 options and 8 options, termination events are more likely to occur near the doorways (Figure 3), agreeing with the intuition that they would be good subgoals. As opposed to (Sutton, Precup, and Singh 1999), we did not encode this knowledge ourselves but simply let the agents find options that would maximize the expected discounted return. Pinball Domain Figure 4: Pinball: Sample trajectory of the solution found after 250 episodes of training using 4 options All options (color-coded) are used by the policy over options in successful trajectories. The initial state is in the top left corner and the goal is in the bottom right one (red circle). In the Pinball domain (Konidaris and Barto 2009), a ball must be guided through a maze of arbitrarily shaped polygons to a designated target location. The state space is continuous over the position and velocity of the ball in the xy plane. At every step, the agent must choose among five discrete primitive actions: move the ball faster or slower, in the vertical or horizontal direction, or take the null action. Collisions with obstacles are elastic and can be used to the advantage of the agent. In this domain, a drag coefficient of 0.995 effectively stops ball movements after a finite number of steps when the null action is chosen repeatedly. Each thrust action incurs a penalty of 5 while taking no action costs 1. The episode terminates with +10000 reward when the agent reaches the target. We interrupted any episode taking more than 10000 steps and set the discount factor to 0.99. We used intra-option Q-learning in the critic with linear function approximation over Fourier bases (Konidaris et al. 2011) of order 3. We experimented with 2, 3 or 4 options. We used Boltzmann policies for the intra-option policies and linear-sigmoid functions for the termination functions. The learning rates were set to 0.01 for the critic and 0.001 for both the intra and termination gradients. We used an epsilongreedy policy over options with ϵ = 0.01. 0 25 50 100 150 200 250 Episodes Undiscounted Return 4 options 3 options 2 options Figure 5: Learning curves in the Pinball domain. In (Konidaris and Barto 2009), an option can only be used and updated after a gestation period of 10 episodes. As learning is fully integrated in option-critic, by 40 episodes a near optimal set of options had already been learned in all settings. From a qualitative point of view, the options exhibit temporal extension and specialization (fig. 4). We also observed that across many successful trajectories the red option would consistently be used in the vicinity of the goal. Arcade Learning Environment We applied the option-critic architecture in the Arcade Learning Environment (ALE) (Bellemare et al. 2013) using a deep neural network to approximate the critic and represent the intra-option policies and termination functions. We used the same configuration as (Mnih et al. 2013) for the first 3 convolutional layers of the network. We used 32 convolutional filters of size 8 8 and stride of 4 in the first layer, 64 filters of size 4 4 with a stride of 2 in the second and 64 3 3 filters with a stride of 1 in the third layer. We then fed the output of the third layer into a dense shared layer of 512 neurons, as depicted in Figure 6. We fixed the learning rate for the intra-option policies and termination gradient to 0.00025 and used RMSProp for the critic. Figure 6: Deep neural network architecture. A concatenation of the last 4 images is fed through the convolutional layers, producing a dense representation shared across intra-option policies, termination functions and policy over options. We represented the intra-option policies as linear-softmax of the fourth (dense) layer, so as to output a probability distribution over actions conditioned on the current observation. The termination functions were similarly defined using sigmoid functions, with one output neuron per termination. The critic network was trained using intra-option Qlearning with experience replay. Option policies and terminations were updated on-line. We used an ϵ-greedy policy over options with ϵ = 0.05 during the test phase (Mnih et al. 2013). As a consequence of optimizing for the return, the termination gradient tends to shrink options over time. This is expected since in theory primitive actions are sufficient for solving any MDP. We tackled this issue by adding a small ξ = 0.01 term to the advantage function, used by the termination gradient: AΩ(s, ω)+ξ = QΩ(s, ω) VΩ(s)+ξ. This term has a regularization effect, by imposing an ξ-margin between the value estimate of an option and that of the optimal one reflected in VΩ. This makes the advantage function positive if the value of an option is near the optimal one, thereby stretching it. A similar regularizer was proposed in (Mann, Mankowitz, and Mannor 2014). As in (Mnih et al. 2016), we observed that the intra-option policies would quickly become deterministic. This problem seems to pertain to the use of policy gradient methods with deep neural networks in general, and not from option-critic itself. We applied the regularizer prescribed by (Mnih et al. 2016), by penalizing for low-entropy intra-option policies. 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Primitive actions 8 options, no baseline 8 options, baseline 2 options, baseline Figure 7: Seaquest: Using a baseline in the gradient estimators improves the distribution over actions in the intra-option policies, making them less deterministic. Each column represents one of the options learned in Seaquest. The vertical axis spans the 18 primitive actions of ALE. The empirical action frequencies are coded by intensity. Finally, the baseline QΩ was added to the intra-option policy gradient estimator to reduce its variance. This change provided substantial improvements (Harb 2016) in the quality of the intra-option policy distributions and the overall agent performance as explained in Figure 7. We evaluated option-critic in Asterisk, Ms. Pacman, Seaquest and Zaxxon. For comparison, we allowed the system to learn for the same number of episodes as (Mnih et al. 2013) and fixed the parameters to the same values in all four domains. Despite having more parameters to learn, optioncritic was capable of learning options that would achieve the goal in all games, from the ground up, within 200 episodes (Figure 8). In Asterisk, Seaquest and Zaxxon, option-critic surpassed the performance of the original DQN architecture based on primitive actions. The eight options learned in each game are learned fully end-to-end, in tandem with the feature representation, with no prior specification of a subgoal or pseudo-reward structure. The solution found by option-critic was easy to interpret in the game of Seaquest when learning with only two options. We found that each option specialized in a behavior sequence which would include either the up or the down button. Figure 9 shows a typical transition from one option to the other, first going upward with option 0 then switching to option 1 downward. Options with a similar structure were also found in this game by (Krishnamurthy et al. 2016) using an option discovery algorithm based on graph partitioning. Related Work As option discovery has received a lot of attention recently, we now discuss in more detail the place of our approach with respect to others. (Comanici and Precup 2010) used a gradient-based approach for improving only the termination function of semi-Markov options; termination was modeled by a logistic distribution over a cumulative measure of the features observed since initiation. (Levy and Shimkin 2011) also built on policy gradient methods by constructing explicitly the augmented state space and treating stopping events as additional control actions. In contrast, we do not need to construct this (very large) space directly. (Silver and Ciosek 2012) dynamically chained options into longer temporal sequences by relying on compositionality properties. Earlier work on linear options (Sorg and Singh 2010) also used compositionality to plan using linear expectation models for options. Our approach also relies on the Bellman equations and compositionality, but in conjunction with policy gradient methods. Several very recent papers also attempt to formulate option discovery as an optimization problem with solutions that are compatible with function approximation. (Daniel et al. 2016) learn return-optimizing options by treating the termination functions as hidden variables, and using EM to learn them. (Vezhnevets et al. 2016) consider the problem of learning options that have open-loop intra-option policies, also called macro-actions. As in classical planning, action sequences that are more frequent are cached. A mapping from states to action sequences is learned along with a commitment module, which triggers re-planning when necessary. In contrast, we use closed-loop policies throughout, which are reactive to state information and can provide better solutions. (Mankowitz, Mann, and Mannor 2016) propose a gradient-based option learning algorithm, assuming a particular structure for the initiation sets and termination functions. Under this framework, exactly one option is active in any partition of the state space. (Kulkarni et al. 2016) (a) Asterix (b) Ms. Pacman (c) Seaquest (d) Zaxxon Testing Moving avg.10 DQN Testing Moving avg.10 DQN Testing Moving avg.10 DQN Testing Moving avg.10 DQN Epoch Epoch Epoch Epoch 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 0 Figure 8: Learning curves in the Arcade Learning Environment. The same set of parameters was used across all four games: 8 options, 0.01 termination regularization, 0.01 entropy regularization, and a baseline for the intra-option policy gradients. Option 0 Option 1 Figure 9: Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top bar shows a trajectory in the game, with white representing a segment during which option 1 was active and black for option 2. use the DQN framework to implement a gradient-based option learner, which uses intrinsic rewards to learn the internal policies of options, and extrinsic rewards to learn the policy over options. As opposed to our framework, descriptions of the subgoals are given as inputs to the option learners. Option-critic is conceptually general and does not require intrinsic motivation for learning the options. We developed a general gradient-based approach for learning simultaneously the intra-option policies and termination functions, as well as the policy over options, in order to optimize a performance objective for the task at hand. Our ALE experiments demonstrate successful end-to-end learning of options in the presence of nonlinear function approximation. As noted, our approach only requires specifying the number of options. However, if one wanted to use additional pseudo-rewards, the option-critic framework would easily accommodate it. In this case, the internal policies and termination function gradients would simply need to be taken with respect to the pseudo-rewards instead of the task reward. A simple instance of this idea, which we used in some of the experiments, is to use additional rewards to encourage options that are indeed temporally extended by adding a penalty whenever a switching event occurs. Our approach can work seamlessly with any other heuristic for biasing the set of options towards some desirable property (e.g. compositionality or sparsity), as long as it can be expressed as an additive reward structure. However, as seen in the results, such biasing is not necessary to produce good results. The option-critic architecture relies on the policy gradient theorem, and as discussed in (Thomas 2014), the gradient estimators can be biased in the discounted case. By introducing factors of the form γt t i=1(1 βi) in our updates (Thomas 2014, eq (3)), it would be possible to obtain unbiased estimates. However, we do not recommend this approach since the sample complexity of the unbiased estimators is generally too high and the biased estimators performed well in our experiments. Perhaps the biggest remaining limitation of our work is the assumption that all options apply everywhere. In the case of function approximation, a natural extension to initiation sets is to use a classifier over features, or some other form of function approximation. As a result, determining which options are allowed may have similar cost to evaluating a policy over options (unlike in the tabular setting, where options with sparse initiation sets lead to faster decisions). This is akin to eligibility traces, which are more expensive than using no trace in the tabular case, but have the same complexity with function approximation. If initiation sets are to be learned, the main constraint that needs to be added is that the options and the policy over them lead to an ergodic chain in the augmented state-option space. This can be expressed as a flow condition that links initiation sets with terminations. The precise description of this condition, as well as sparsity regularization for initiation sets, is left for future work. Acknowledgements The authors gratefully acknowledge financial support for this work by the National Science and Engineering Research Council of Canada (NSERC) and the Fonds de recherche du Quebec - Nature et Technologies (FRQNT). Appendix Augmented Process If ωt has been initiated or is executing at time t, then the discounted probability of transitioning to (st+1, ωt+1) is: P(1) γ (st+1, ωt+1| st, ωt) = a πωt (a| st) γ P(st+1| st, a) (1 βωt(st+1))1ωt=ωt+1 + βωt(st+1)πΩ (ωt+1 | st+1) . When conditioning the process from (st, ωt 1), the discounted probability of transitioning to st+1, ωt is: P(1) γ (st+1, ωt | st, ωt 1) = (1 βωt 1(st))1ωt=ωt 1+ βωt 1(st)πΩ (ωt | st) a πωt (a | st) γ P (st+1 | st, a) . More generally, the k-steps discounted probabilities can be expressed recursively as follows: P(k) γ (st+k, ωt+k | st, ωt) = P(1) γ (st+1, ωt+1 | st, ωt) P(k 1) γ (st+k, ωt+k | st+1, ωt+1) , P(k) γ (st+k, ωt+k 1 | st, ωt 1) = P(1) γ (st+1, ωt | st, ωt 1) P(k 1) γ (st+k, ωt+k 1 | st+1, ωt) . Proof of the Intra-Option Policy Gradient Theorem Taking the gradient of the option-value function: a πω,θ (a | s) QU(s, ω, a) θ QU(s, ω, a)+ πω,θ (a|s) QU(s, ω, a) πω,θ (a | s) θ QU(s, ω, a)+ πω,θ (a | s) s γ P (s | s, a) U(ω, s ) (1 βω,ϑ(s )) QΩ(s , ω) θ + βω,ϑ(s ) VΩ(s ) = (1 βω,ϑ(s )) QΩ(s , ω) ω πΩ (ω | s ) QΩ(s , ω ) (1 βω,ϑ(s ))1ω =ω+ βω,ϑ(s )πΩ (ω | s ) QΩ(s , ω ) where (7) follows from the assumption that θ only appears in the intra-option policies. Substituting (7) into (6) yields a recursion which, using the previous remarks about augmented process can be transformed into: QΩ(s, ω) πω,θ (a | s) θ QU(s, ω, a)+ a πω,θ (a | s) s γ P (s | s, a) βω,ϑ(s )πΩ (ω | s ) + (1 βω,ϑ(s ))1ω =ω QΩ(s , ω ) πω,θ (a | s) θ QU(s, ω, a)+ ω P(1) γ (s , ω | s, ω) QΩ(s , ω ) s ,ω P(k) γ (s , ω |s, ω) πω ,θ (a|s ) θ QU(s , ω , a). The gradient of the expected discounted return with respect to θ is then: QΩ(s0, ω0) k=0 P(k) γ (s, ω | s0, ω0) πω,θ (a | s) θ QU(s, ω, a) s,ω μΩ(s, ω|s0, ω0) πω,θ (a | s) θ QU(s, ω, a) . Proof of the Termination Gradient Theorem The expected sum of discounted rewards starting from (s1, ω0) is given by: U(ω0, s1) = E We start by expanding U as follows: U(ω, s ) = (1 βω,ϑ(s ))QΩ(s , ω) + βω,ϑ(s )VΩ(s ) = (1 βω,ϑ(s )) a πω,θ (a | s ) s γ P (s | s , a) U(ω, s ) ω πΩ (ω | s ) a πω ,θ (a | s ) s γ P (s | s , a) U(ω , s ) . The gradient of U is then: U(ω, s ) ϑ = βω,ϑ(s ) ϑ (VΩ(s ) QΩ(s , ω)) AΩ(s ,ω) (1 βω,ϑ(s )) a πω,θ (a|s ) s γ P (s |s , a) U(ω, s ) Using the structure of the augmented process: ϑ = βω,ϑ(s ) ϑ AΩ(s , ω)+ s P(1) γ (s , ω | s , ω) U(ω , s ) k=0 P(k) γ (s , ω | s , ω) βω ,ϑ(s ) ϑ AΩ(s , ω ) . We finally obtain: k=0 P(k) γ (s , ω | s1, ω0) βω,ϑ(s ) ϑ AΩ(s , ω) ω,s μΩ(s , ω|s1, ω0) βω,ϑ(s ) ϑ AΩ(s , ω) . References Baird, L. C. 1993. Advantage updating. Technical Report WL TR-93-1146, Wright Laboratory. Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47:253 279. Comanici, G., and Precup, D. 2010. Optimal policy switching algorithms for reinforcement learning. In AAMAS, 709 714. S ims ek, O., and Barto, A. G. 2009. Skill characterization based on betweenness. In NIPS 21, 1497 1504. Daniel, C.; van Hoof, H.; Peters, J.; and Neumann, G. 2016. Probabilistic inference for determining options in reinforcement learning. Machine Learning, Special Issue 104(2):337 357. Harb, J. 2016. Learning options in deep reinforcement learning. Master s thesis, Mc Gill University. Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algorithms. In NIPS 12, 1008 1014. Konidaris, G., and Barto, A. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In NIPS 22, 1015 1023. Konidaris, G.; Kuindersma, S.; Grupen, R. A.; and Barto, A. G. 2011. Autonomous skill acquisition on a mobile manipulator. In AAAI. Krishnamurthy, R.; Lakshminarayanan, A. S.; Kumar, P.; and Ravindran, B. 2016. Hierarchical reinforcement learning using spatio-temporal abstractions and deep neural networks. Co RR abs/1605.05359. Kulkarni, T.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In NIPS 29. Levy, K. Y., and Shimkin, N. 2011. Unified inter and intra options learning using policy gradient methods. In EWRL, 153 164. Mankowitz, D. J.; Mann, T. A.; and Mannor, S. 2016. Adaptive skills, adaptive partitions (ASAP). In NIPS 29. Mann, T. A.; Mankowitz, D. J.; and Mannor, S. 2014. Timeregularized interrupting options (TRIO). In ICML, 1350 1358. Mann, T. A.; Mannor, S.; and Precup, D. 2015. Approximate value iteration with temporally extended actions. Journal of Artificial Intelligence Research 53:375 438. Mc Govern, A., and Barto, A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, 361 368. Menache, I.; Mannor, S.; and Shimkin, N. 2002. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In ECML, 295 306. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. Co RR abs/1312.5602. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In ICML. Niekum, S. 2013. Semantically Grounded Learning from Unstructured Demonstrations. Ph.D. Dissertation, University of Massachusetts, Amherst. Precup, D. 2000. Temporal abstraction in reinforcement learning. Ph.D. Dissertation, University of Massachusetts, Amherst. Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc. Silver, D., and Ciosek, K. 2012. Compositional planning using optimal option models. In ICML. Sorg, J., and Singh, S. P. 2010. Linear options. In AAMAS, 31 38. Stolle, M., and Precup, D. 2002. Learning options in reinforcement learning. In Abstraction, Reformulation and Approximation, 5th International Symposium, SARA Proceedings, 212 223. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In NIPS 12. 1057 1063. Sutton, R. S.; Precup, D.; and Singh, S. P. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(12):181 211. Sutton, R. S. 1984. Temporal Credit Assignment in Reinforcement Learning. Ph.D. Dissertation. Thomas, P. 2014. Bias in natural actor-critic algorithms. In ICML, 441 448. Vezhnevets, A. S.; Mnih, V.; Agapiou, J.; Osindero, S.; Graves, A.; Vinyals, O.; and Kavukcuoglu, K. 2016. Strategic attentive writer for learning macro-actions. In NIPS 29.