# natural_option_critic__441b8092.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Natural Option Critic

Saket Tiwari College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 sakettiwari@umass.edu

Philip S. Thomas College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 pthomas@cs.umass.edu

The recently proposed option-critic architecture (Bacon, Harb, and Precup 2017) provides a stochastic policy gradient approach to hierarchical reinforcement learning. Speciﬁcally, it provides a way to estimate the gradient of the expected discounted return with respect to parameters that deﬁne a ﬁnite number of temporally extended actions, called options. In this paper we show how the option-critic architecture can be extended to estimate the natural gradient (Amari 1998) of the expected discounted return. To this end, the central questions that we consider in this paper are: 1) what is the deﬁnition of the natural gradient in this context, 2) what is the Fisher information matrix associated with an option s parameterized policy, 3) what is the Fisher information matrix associated with an option s parameterized termination function, and 4) how can a compatible function approximation approach be leveraged to obtain natural gradient estimates for both the parameterized policy and parameterized termination functions of an option with per-time-step time and space complexity linear in the total number of parameters. Based on answers to these questions we introduce the natural option critic algorithm. Experimental results showcase improvement over the vanilla gradient approach.

Introduction Hierarchical reinforcement learning methods enable agents to tackle challenging problems by identifying reusable skills temporally extended actions that simplify the task. For example, a robot agent that tries to learn to play chess by reasoning solely at the level of how much current to give to its actuators every 20ms will struggle to correlate obtained rewards with their true underlying cause. However, if this same agent ﬁrst learns skills to move its arm, grasp a chess piece, and move a chess piece, then the task of learning to play chess (leveraging these skills) becomes tractable. Several mathematical frameworks for hierarchical reinforcement learning have been proposed, including hierarchies of machines (Parr and Russell 1998), MAXQ (Dietterich 2000), and the options framework (Sutton, Precup, and Singh 1999). However, none of these frameworks provides a practical mechanism for skill discovery: determining what skills will be useful for an agent to learn. Although skill discovery methods have been proposed, they tend to be heuristic in that they ﬁnd skills that

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

have a property that intuitively might make for good skills for some problems, but which do not follow directly from the primary objective of optimizing the expected discounted return (Machado, Bellemare, and Bowling 2017; Simsek and Barto 2008; Thrun and Schwartz 1995; Konidaris and Barto 2009). The option-critic architecture (Bacon, Harb, and Precup 2017), stands out from other attempts at developing a general framework for skill discovery in that it searches for the skills that directly optimize the expected discounted return. Speciﬁcally, the option critic uses the aforementioned options framework, wherein a skill is called an option, and it proposes parameterizing all aspects of the option and then performing stochastic gradient descent on the expected discounted return with respect to these parameters. The key insight that enables the option-critic architecture is a set of theorems that give expressions for the gradient of the expected discounted return with respect to the different parameters of an option. One limitation of the option critic is that it uses ordinary (stochastic) gradient descent. In this paper we show how the option critic can be extended to use natural gradient descent (Amari 1998), which exploits the underlying structure of the option-parameter space to produce a more informed update direction. The primary contributions of this work are theoretical: we deﬁne the natural gradients associated with the option critic, derive the Fisher information matrices associated with an option s parameterized policy and termination function, and show how the natural gradients can be estimated with per-time-step time and space complexity linear in the total number of parameters. This is achieved by means of compatible function approximations. We also analyze the performance of natural gradient descent based approach on various learning tasks.

Preliminaries and Notation A reinforcement learning (RL) agent interacts with an environment, modeled as a Markov decision process (MDP), over a sequence of time steps t N 0. A ﬁnite MDP is a tuple (S, A, P, R, d0, γ). S is the ﬁnite set of possible states of the environment. St is the state of the environment at time t. A is the ﬁnite set of possible actions the agent can take. At is the action taken by the agent at time t. P : S A S [0, 1] is the transition function: P(s, a, s ) = Pr(St+1=s |St=s, At=a), for all t. Meaning,

P(s, a, s ) the probability of transitioning to state s given the agent takes action a in state s. Rt denotes the reward at time t. R is the reward function, R : S A R, where R(s, a) = E[Rt|St=a, At=a], i.e., the expected reward the agent receives given it took action a in state s. We say that a process has ended when the environment enters a terminal state, meaning for a terminal state s, P(s, a, s ) = 0 and R(s, a) = 0 for all s S\{s} and a A. The process ends after T steps and we call T the horizon. We say the process is inﬁnite horizon when there does not exist a ﬁnite T. d0 is the initial state distribution, i.e., d0(s) = Pr(S0=s). The parameter γ [0, 1] scales how the rewards are discounted over time. When a terminal state is reached, time is reset to t = 0 and consequently a new initial state is sampled using d0. A policy, π : S A [0, 1], represents the agent s decision making system: π(s, a) = Pr(At=a|St=s). Given a policy, π, and an MDP, (S, A, P, R, d0, γ), an episode, H is a sequence of states of the environment, actions taken by the agent, and the rewards observed from the initial state, S0, to the terminal state, ST , i.e., H = (S0, A0, R0, S1, A1, R1, ..., ST , AT , RT ). We also deﬁne the path that an agent takes to be a sequence of states and actions, i.e., a history without rewards, X = (S0, A0, S1, A1, ..., ST , AT ). Path X is a random variable from the set of all possible paths, X. The return of an episode H is the discounted sum of all rewards, g(H) = PT t=0 γt Rt. We call vπ the value function for the policy π, vπ : S R, where vπ(s) = E[PT t=0 γt Rt|, S0=s, π]. We call qπ the action-value function associated with policy π, qπ : S A R, where qπ(s, a) = E[PT t=0 γt Rt|S0=s, A0=a, π].

Policy Gradient Framework The policy gradient framework (Sutton et al. 1999; Konda and Tsitsiklis 2000) assumes the policy π, parametrized by θ, is differentiable. The objective function, ρ, is deﬁned with respect to a start state s0, ρ(θ) = E[PT t=0 γt Rt|d0, θ]. The agent learns by updating the parameters θ approximately proportional to the gradient ρ/ θ, i.e., θ α ρ/ θ where α is the learning rate (LR): a scalar hyper-parameter.

Option Critic framework The options framework (Sutton, Precup, and Singh 1999) formalizes the notion of temporal abstractions by introducing options. An option, o, from a set of options, O, is a generalization of primitive actions. The intra-option policy πo : S A [0, 1] represents the agent s decision making while executing an option o: πo(s, a) = Pr(At=a|St=s, Ot=o). Like primitive actions the agent executes an option at a state St and the option terminates at another St+τ, where τ is the duration for which the agent is executing the option: ot. While in the option o, from state St to St+τ, the agent follows the policy πo. Option o terminates stochastically in state s according to a distribution β. The framework puts restrictions on where an option can be initiated by deﬁning an initiation state set, Io, for option o. The option o is initiated in state s Io based on πO(s), which is a policy over options deﬁned as πO : S O [0, 1]. An initiation state

set Io, an intra-option policy πo and a termination function βo : S [0, 1] comprise an option o. It is commonly assumed that all options are available everywhere and thereby we dispense with the notion of an initiation set. The option critic framework makes all the options available everywhere, and introduces policy-gradient theorems within the options framework. The option active at time step t is Ot. The intra-option policies (πo) and termination functions (βo) are represented using differentiable functions parametrized by θ and ϑ, respectively. The goal is to optimize the expected discounted return starting at state s0 and option o0. We redeﬁne the objective function, ρ, for the option critic setting: ρ(O, θ, ϑ, s, o) = E[P t=0 γt Rt|O, θ, ϑ, S0 = s, O0 = o]. Equations similar to those in the policy gradient framework (Sutton et al. 1999) are manipulated to derive gradients of the objective with respect to θ and ϑ in the option-critic framework. The analogous state value function is vπO : S R, where vπO(s) = E[P

t γt Rt|S0=s]. vπO(s) is the value of a state s, within the options framework, with the option set O and the policy over options πO. The optionvalue function is qπO : S O R, where qπO(s, o) = E[P

t γt Rt|S0=s, O0=o]. Here, qπO(s, o) is the value of state s when option o is active with the option set O. The state-option-action value function is q U : S O A R, where q U(s, o, a) = E[P

t γt Rt|S0=s, O0=o, A0=a]. Here, q U(s, o, a) is the value of executing action a in the context of state-option pair (s, o). The option-value function upon arrival is u : O S R, where u(o, s ) = E[P

t γt Rt|S1=s , O0=o]. Here, u(o, s ) is the value of option o being active upon the agent entering state s . Bacon, Harb, and Precup (2017) observe a consequence of the deﬁnitions:

u(o, s ) = (1 βo(s ))qπO(s , o) + βo(s )vπO(s ).

The main results presented by Bacon, Harb, and Precup (2017) are the intra-option policy gradient theorem and the termination gradient theorem. The gradient of the expected discounted return with respect to θ and initial condition (s0, o0) is:

qπO(s0, o0)

s,o µO(s, o) X

πo(s, a, θ)

θ q U(s, o, a),

where µO(s, o) is the discounted weighting of state-option pair (s, o) along trajectories starting from (s0, o0) deﬁned by : µO(s, o) = P t=0 γt Pr(St=s, Ot=o|s0, o0). The gradient of the expected discounted return with respect to ϑ and initial condition (s1, o0) is:

o,s µO(s , o) βo(s , ϑ)

ϑ a O(s , o),

where a O : S O R is the advantage function over options such that a O(s , o) = qπO(s , o) vπO(s ). Here, µO(s , o) is the discounted weighting of state option pair (s , o) from (s1, o0), i.e., according to a Markov chain shifted by one time step, deﬁned by: µO(s , o) = P t=0 γt Pr(St+1=s , Ot=o|s1, o0). The agent learns by updating parameters θ and ϑ in the direction approximately proportional to q O(s0, o0)/ θ and u(o0, s1)/ ϑ, respectively.

Meaning, it learns by updating θ αθ qπO(s0, o0)/ θ and ϑ αϑ u(o0, s1)/ ϑ, where αθ and αϑ are the learning rates for θ and ϑ, respectively.

Natural Actor Critic Natural gradient descent (Amari 1998) exploits the underlying structure of the parameter space when deﬁning the direction of steepest descent. It does so by deﬁning the inner product x, y θ in the parameter space as:

x, y θ = x T Gθy, (1)

where Gθ is called the metric tensor. Although the choice of Gθ remains open under certain conditions (Thomas et al. 2016) we choose the Fisher information matrix, as is common practice. The ﬁsher information matrix distribution over random variable X, parametrized by policy parameters θ, that lie on a Reimannian manifold (Rao 1945; Amari 1985):

(Gθ)i,j = E ln Pr(X; θ)

ln Pr(X; θ)

where the expectation is over the distribution Pr(X) and (Gθ)i,j represents a matrix with its i, jth element being the expression as deﬁned on the right hand side we use this notation to represent a matrix throughout the paper. Kakade (2001) makes the assumption that every policy, π, is ergodic and irreducible, therefore it has a well-deﬁned stationary distribution for each state s. Under this assumption, Kakade (2001) introduces the use of natural gradient for optimizing the expected reward over the parameters θ of policy π, as deﬁned by ρ(θ) = P

s,a dπ(s)π(s, a, θ)R(s, a). The natural gradient for the objective function, ρ, is deﬁned as:

e ρ(θ) = G 1 θ ρ(θ)

The derivation of a closed form expression for Gθ for the parameter space of policy π, parametrized by θ, is non-trivial as demonstrated for the limiting matrix of the inﬁnite horizon problem in reinforcement learning (Bagnell and Schneider 2003). For a weight vector w let ˆqw be an approximation of the state action value function q(s, a), which has the form:

ˆqw(s, a) = w T ln π(s, a, θ)

The mean squared error ϵ(w, θ), for a weight vector w and a given policy parametrized by θ, is deﬁned as:

ϵ(w, θ) = X

s,a dπ(s)π(s, a, θ)(ˆqw(s, a) qπ(s, a))2,

where dπ(s) = P t=0 γt Pr(St=s|π) is the discounted weighting of state s in the inﬁnite horizon problem. The weights dπ(s) normalize to the stationary distribution for state s under policy π in the undiscounted setting where the MDP terminates at every time step t with probability 1 γ. Theorem 1 as introduced by Kakade (2001) states that w which minimizes the mean squared error, ϵ(w, θ), is equal to the natural gradient as deﬁned in (2). Kakade (2001) also demonstrates how natural policy gradient performs under the re-scaling of parameters. In addition

to that, Kakade (2001) demonstrates how the natural gradient weights the components of e ρ(θ) uniformly, instead of using dπ(s). We also point out that the natural gradient is independent to local re-parametrization of the model (Pascanu and Bengio 2013) and can be used in online learning (Degris, Pilarski, and Sutton 2012). Natural gradients for reinforcement learning (Peters and Schaal; Bhatnagar et al. 2008; 2009; Degris, Pilarski, and Sutton 2012), as well as more recent work in deep neural networks (Desjardins et al. 2015; Pascanu and Bengio 2013; Thomas, Dann, and Brunskill 2018; Sun and Nielsen 2017) have shown to be effective in learning. The Option-Critic architecture uses vanilla gradient to learn temporal abstraction and internal policies, which can be less data efﬁcient compared to the natural gradient (Amari 1998). The natural gradient also overcomes the difﬁculty posed by the plateau phenomena (Amari 2016). We derive the metric tensors for the parameters in the option-critic architecture. Computing the complete Fisher information matrix or is expensive. We use a block-diagonal estimate of the Fisher information matrix as has been applied in the past to reinforcement learning (Thomas 2011) and to neural networks (Roux, Manzagol, and Bengio 2008; Kurita 1992; Martens 2010; Pascanu and Bengio 2013; Martens and Grosse 2015). Speciﬁcally, we estimate Gθ and Gϑ separately, where θ and ϑ are the parameters of of the intra-option policy and the option termination function. These are then combined into a (|θ| + |ϑ|) (|θ| + |ϑ|) sized estimate of the complete Fisher information matrix of the parameter space, where |θ|, |ϑ| represent the size of vectors. We also provide theoretical justiﬁcation for the resulting algorithm inspired from the incremental natural actor critic algorithm (Bhatnagar et al. 2007) (INAC) and its extension to include eligibility traces (Morimura, Uchibe, and Kenji 2005; Thomas 2014).

Start State Fisher Information Matrix Over Intra-Option Path Manifold We deﬁne path X in the options framework for the inﬁnite horizon problem as the sequence of state-option-action tuples: X = (S0, O0, A0, S1, O1, A1, ...). We use X to denote the set of all paths. We introduce the function g : X R called the expected return over path, where g(x) = E[PT t=0 γt Rt|x] is the expected return given the path x. The goal in a reinforcement learning problem, in the context of the option-critic architecture, is to maximize the discounted return, ρ(O, θ, ϑ, s0, o0). The goal can be re-written as maximizing J(θ, s0, o0) = P Pr(x; θ)g(x). Where the summation is over all x X starting from (s0, o0) and the intra-option policies are parametrized by θ. To optimize the objective J, we deﬁne it over a Riemannian space Θ, with θ Θ. In the Riemannian space the inner product is deﬁned as in (1). The direction of steepest ascent of J(θ) in the Riemannian space, Θ, is given by G 1 θ J(θ)/ θ (Amari 1998), (see equation (2)). In this section we use i to denote / θi and use f(X) Pr(X) to indicate the expected value of f with respect to distribution Pr(X). We obtain an alternative form of the Fisher information matrix which is a well know result

(De Groot 1970) (for details see appendix):

(Gθ)i,j = i j ln Pr(X; θ) Pr(X;θ). (3)

Fisher Information Matrix Over Intra-Option Path Manifold In Theorem 1 we show that the Fisher information matrix over the paths, X, truncated to terminate at time step T converges as T to the Fisher information matrix over the intra-option policies, πo. This gives an expression for Fisher information matrix over the set of paths, X, and simpliﬁes computation of the natural gradient when maximizing the objective J(θ, s0, o0). We use GT θ to indicate the T -step ﬁnite horizon Fisher information matrix, meaning the Fisher information matrix if the problem were to be reduced to terminate at step T . We normalize the metric by the total length of path T (Bagnell and Schneider 2003) to get a convergent metric. Theorem 1 (Inﬁnite Horizon Intra-Option Matrix). Let GT θ be the T -step ﬁnite horizon Fisher information matrix and Gθ µO(s,o) be the Fisher information matrix of intra-option policies under a stationary distribution of states, actions and options: πo(s, a, θ)µO(s, o). Then:

lim T 1 T GT θ = Gθ µO(s,o).

Proof. See the appendix (supplementary materials).

Compatible Function Approximation For Intra-Option Path Manifold We subtract the option-state value function, qπO, from the state-option-action value function, q U, and treat it as a baseline to reduce variance in the gradient estimate of the expected discounted return. The baseline can be a function of both state and action in special circumstances, but none of those apply here (Thomas and Brunskill 2017). So, we deﬁne the state-option-action advantage function a U : S O A R. Where a U(s, o, a) = q U(s, o, a) q O(s, o) is the advantage of the agent taking action a in state s in the context of option o. Here, a U is approximated by some compatible function approximator f πo η . For vector η and parameters θ we deﬁne:

f πo η (s, a) = ηT ln(πo(s, a, θ))

The η that is a local minima of the squared error ϵ(η, θ):

ϵ(η, θ) = X

s,o,a µO(s, o)πo(s, a, θ)(f πo η (s, a) a U(s, o, a))2.

is equal to the natural gradient of the objective, ρ, with respect to ϑ (the complete derivation is in the appendix):

e θqπO(s0, o0) = G 1 θ qπO(s0, o0)

Thus, for a sensible (Kakade 2001) function approximation, as in (4), in the option-critic framework the natural gradient of the expected discounted return is the weights of linear function approximation.

Start State Fisher Information Matrix Over State-Option Transition Path Manifold We derive the Fisher information matrix for the parameters ϑ over the state-option transitions path manifold. We deﬁne X as a path for state-option transitions in the optioncritic architecture. More speciﬁcally, we deﬁne X = (O0, S1, O1, S2, O2, S3, ...) to be path tuples of state option pairs shifted by one time step. We deﬁne X to be the set of all state-option transition paths. Similar to the previous section, we deﬁne the expected return over state-option transitions g : X R, where g (x ) = E[PT t=0 γt Rt|x ] is the expected return given state-option transitions path x . The goal can be re-written to maximize J (ϑ, s1, o0) = P Pr(x )g (x ). Where the summation is over all x X starting from (s1, o0) and terminations are parametrized by ϑ. To optimize J we deﬁne it over a Reimannian space Θ with ϑ Θ and the inner product deﬁned as in (1), similar to previous section. The direction of steepest ascent in the Reimannian space, Θ , is the natural gradient. In this section, we use i to denote / ϑi and use f(X ) Pr(X ) to indicate the expected value of f(X ) with respect to the distribution Pr(X ). Equation (3) implies that the Fisher information matrix can be written as:

(Gϑ)i,j = i j ln Pr(X ; ϑ) Pr(X ;ϑ).

Fisher Information Matrix Over State-Option Transition Path Manifold In Theorem 2 we show that the Fisher information matrix over the paths, X , truncated to terminate at time step T converges as T to an expression in terms of the terminations and the policy over options over the stationary distribution of states and options. This gives an expression for Fisher information Matrix over set of paths, X , and simpliﬁes computation of the natural gradient when maximizing the objective J (ϑ, s1, o0). Theorem 2 (Inﬁnite Horizon State-Option Transition Matrix). Let GT ϑ be the T -step ﬁnite horizon Fisher information matrix and µO(s , o) is the stationary distribution of state-option pairs s , o. Then: lim T 1 T GT ϑ

i,j = i ln βo(s , ϑ)

j ln(1 βo(s , ϑ) + βo(s , ϑ)πO(s , o)) µO(s ,o).

Proof. See appendix (supplementary materials).

Compatible Function Approximation For State-Option Transition Path Manifold We deﬁne the advantage function of continued option as: a O : S O R. Where a O(s , o) = u(o, s ) qπO(s , o) is the advantage of the option o being active while exiting s given that option o is active when the agent enters s . We consider terminations improvement when a O is approximated by some compatible function approximator hβo ϕ . For vector ϕ and parameters ϑ we deﬁne:

hβo ϕ (s ) = ϕT ln(1 βo(s , ϑ) + πO(s , o)βo(s , ϑ)))

We deﬁne the squared error ϵ(ϕ, ϑ) associated with vector ϕ as:

ϵ(ϕ, ϑ) = X

s ,o µO(s , o)L(Ot+1=o|Ot=o, St+1=s ; ϑ)

(hβo ϕ (s ) a O(s , o))2,

where L(Ot+1=o|Ot=o, St+1 = s ; ϑ) is the likelihood ratio of option o being active while exiting s given that option o is active when the agent enters s . It is deﬁned as follows:

L(Ot+1=o|Ot=o, St+1 = s ; ϑ)

=Pr(Ot+1=o|Ot=o, St+1 = s ; ϑ)

Pr(Ot+1 =o|Ot=o, St+1 = s ; ϑ)

= β o(s , ϑ) 1 β o(s , ϑ).

We assume, throughout the paper, that the denominator is not 0. The ϕ that is a local minima of ϵ(ϕ) satisﬁes (the complete derivation is in the appendix):

e ϑu(o0, s1) = G 1 ϑ u(o0, s1)

Therefore, for an approximation of the continued state-option value function, as in (5), the natural gradient of the expected discounted return is the negative weights of the linear function approximation.

Incremental Natural Option Critic Algorithm We introduce algorithms inspired from the incremental natural actor critic introduced by Degris, Pilarski, and Sutton (2012), who in turn built on the theoretical work of Bhatnagar et al. (2007). The algorithm learns the parameters for approximations of state-option-action advantage function, a U, and the advantage function of continued option, a O, incrementally by taking steps in the direction of reducing the error ϵ(η, θ) and ϵ(ϕ, ϑ). It does stochastic gradient descent using the gradients ϵ(η, ϑ)/ η and ϵ(ϕ, ϑ)/ ϕ. Learning the parameters η and ϕ leads to natural gradient based updates for θ and ϑ. We introduce hyper parameters αη, αϕ and λ, which are the learning rate for η, the learning rate for ϕ and the λ the eligibility trace parameter of both η and ϕ, respectively. The algorithm learns the policy over options, πO, using intra-option Q-learning (Sutton, Precup, and Singh 1999) as in previous work (Bacon, Harb, and Precup 2017). The algorithm uses TD-error style updates to learn θ and ϑ. Analogous to the consistent estimates used by Bhatnagar et al. (2007), we state that a consistent estimate of the state-option value function, ˆqπO, satisﬁes E[ˆqπO(st, ot)|st, ot, πO, πot, βot] = qπO. Similarly, a consistent estimate of the value function upon arrival, ˆu, satisﬁes E[ˆu(ot, st+1)|ot, st+1, πO, πot, βot] = u(ot, st+1). We deﬁne the TD-error for the intra-option policies at time step t to be δU t = rt + γˆu(ot, st+1) ˆqπO(st, ot). A consistent estimate of the state value function, ˆvπO, satisﬁes E[ˆvπO(st)|st, πO, πot, βot] = vπO(st). We deﬁne the TD-error at time step t for the terminations to be δO t = rt + γˆvπO(st+1) ˆvπO(st). We provide Lemmas 1 and 2 to show that δU t and δO t are consistent estimates of a U and a O.

Lemma 1. Given intra-option policies, πo for all o O, policy over options, πO, and terminations, βo for all o O, then:

E[δU t |st, at, ot, πot, πO, βot] = a U(st, ot, at).

Lemma 2. Under the precondition ot = ot 1 and given intra-option policies, πo for all o O, policy over options, πO, and terminations, βo for all o O, then:

E[δO t |st, ot, ot=ot 1, πot, πO] = a O(st, ot 1).

The proofs are in the appendix (supplementary materials). Using these lemmas and theorems we introduce algorithm 20 (INOC). We provide details on how we arrive at the updates to parameters η and ϕ in the appendix. The precondition ot = ot 1 might lead to fewer updates to the parameters of the terminations. The options evaluation part in the algorithm is the same as in previous work (Bacon, Harb, and Precup 2017).

Algorithm 1 Incremental Natural Option-Critic Algorithm (INOC)

1: s0 d0 and choose o using πO. 2: while Not in terminal state do 3: Select action at as per πot 4: Take action at observe st+1, rt 5: eη λeη + ln πot(st,at,θ)

θ 6: δU t rt + γu(ot, st+1) qπO(st, ot)

7: temp = ln πot(st,at,θ)

θ 8: η η + αηδU t eη αηtemp temp T η 9: θ θ + αθ η ||η||2 10: if ot is the same as ot 1 then

11: eϕ λeϕ + ln βot 1(st,ϑ)

ϑ 12: δO t rt + γvπO(st+1) γvπO(st)

13: temp = ln βot 1(st,ϑ)

ϑ 14: ϕ ϕ + αϕβot 1(st, ϑ)δO t eϕ + αϕtemp temp T ϕ 15: ϑ ϑ αϑ ϕ ||ϕ||2 16: end if 17: if should terminate ot in st+1 according to βot then 18: Choose ot+1 (next option) according to πO and reset η, ϕ, eη, eϕ 19: end if 20: end while

Experiments We look at the performance of natural option critic in three different types of domains: a simple 2 state MDP, one with linear state representations and one with neural networks for state representations, and compare it to option critic. In all the cases we use sigmoid terminations and linear-softmax intra-option policies, as in previous work (Bacon, Harb, and Precup 2017). MDP Setup: We design an MDP to demonstrate the uniform weighting of the components of the natural termination

Figure 1: Simple deterministic MDP of two states and two actions

Figure 2: Average reward for INOC reaches the maxima while that of OC is stuck in a plateau. Results averaged over 200 runs of 2000 episodes.

gradient, e θqπO(s0, o0), as opposed to using µO(s, o). Note that the effectiveness of the natural policy gradient has been demonstrated sufﬁciently in past work (Kakade 2001; Bagnell and Schneider 2003; Degris, Pilarski, and Sutton 2012). We deﬁne a simple 2 state MDP as in Figure 1. The initial state distribution is d0(s1) = 0.8 and d0(s2) = 0.2. The transitions are deterministic. The reward for self loops into s1 and s2 are 1 and 2, respectively. The episode terminates after 30 steps. We use an ϵ-greedy policy over options, πO. We consider a scenario with two options, o1 and o2, each of which has probability 0.9 for actions a1 and a2, respectively, regardless of the state. This gives us options as abstractions over individual actions. We initialize the terminations, βo, and option value function, qπO(s, o) such that they are biased towards the greedy action, a1, in state s1 via the selection of option o1. Speciﬁcally, we set βo1(s1) = 0.1 and βo1(s2) = 0.1, this way the setup is biased towards higher probability of µO(s1, o1). This presents the agent with the challenge of learning the more optimal action of transitioning to state s2, despite the higher probability µ(s1, o1) and the self loop reward of s1. We set the learning rate for the intra-option policies, αθ, to be negligible as our goal is to demonstrate the efﬁcacy of the natural termination gradient. As can be seen from Figure 2, the natural option critic converges to the optimal value, by overcoming the plateau, for average reward much faster than the option critic. The option critic is initially stuck in the greedy self-loop action, this is due to the weighting by µO(s, o). Whereas the natural option critic begins learning early on and achieves the optimal average reward. Four Rooms: The four rooms domain (Sutton, Precup,

Figure 3: Four rooms with αθ = αϑ = 0.0025, αη = 0.5, αϕ = 0.75, λ = 0.5 and critic LR 0.5, averaged over 350 runs

and Singh 1999) is a particularly favorable case for demonstrating the use of options. We use the same number of options, 4, as in previous work (Bacon, Harb, and Precup 2017). The result (Figure 3) indicates that natural option critic converges faster.

Arcade Learning Environment:

We compare natural option-critic with the option critic framework on the Arcade Learning Environment (Bellemare et al. 2013). To showcase the improvement over the optioncritic architecture we use the same conﬁguration for all the layers as in previous work (Bacon, Harb, and Precup 2017). Which in turn uses the same conﬁguration for the ﬁrst 3 convolutional layers of the network introduced by Mnih et al. (2013). The critic network was trained, similar to previous work (Bacon, Harb, and Precup 2017), using experience replay (Mnih et al. 2013) and RMSProp.

As in previous work (Bacon, Harb, and Precup 2017), we apply the regularizer prescribed by Mnih et al. (2016) to penalize low entropy policies. We use an on-policy estimate of the policy over options, πO, which is used in the computation of the natural gradient with respect to the termination parameters.

We compare the two approaches, option critic and natural option critic, by evaluating them for the games Asterisk, Seaquest, and Zaxxon (Bacon, Harb, and Precup 2017). For comparison we run training over same number of frames per epoch as done by Bacon, Harb, and Precup (2017), running the same number of trial and use the same number of options: 8. We demonstrate the results in Figure 4. More importantly, we use the same hyperparameters, for learning rates and entropy regularization, as in previous work to merit a fair comparison. We obtain improvements on the option-critic architecture (OC) for Asterisk and Zaxxon. We also note that we were unable to reproduce the results for Seaquest for option critic, but having given the same set of hyperparameters we observe that option critic performs better. We explain the issue with termination updates, and it s effect on the return, for Seaquest in the appendix.

For Zaxxon and Asterisk we see that NOC breaks the plateau much earlier than option critic. Note that the value network, for approximating QU, is learned using vanilla gradient.

(a) Asterisk

(b) Seaquest

Figure 4: Moving average of 10 returns for a single trial for Arcade learning Environment, with αθ = αϑ = 0.0025, αη = αϕ = 0.75, and λ = 0.5

Discussion We have introduced a natural gradient based approach for learning intra-option policies and terminations, within the option-critic framework, which is linear in the number of parameters. More importantly, we have furnished instructive proofs on deriving the Fisher information matrix over path manifolds and corresponding function approximations based approach while reducing mean squared errors. We have also introduced an algorithm that uses consistent estimates of the advantage functions and learn the natural gradient by learning coefﬁcients of the corresponding linear function approximators. The results showcase performance improvements on previous work. The proofs for ﬁnite horizon metrics are very similar to the ones provided by Bagnell and Schneider (2003). We also demonstrate the effectiveness of natural option critic in three distinct domains. As discussed by Thomas (2014) we can obtain a truly unbiased estimate for our updates, but it may not be practical (Thomas 2014). The limitations that apply to the option-critic framework, except the use of vanilla gradient, apply. We use a block diagonal estimate of the Fisher information matrix. The complete Fisher information matrix for the option-critic framework over path manifolds is:

Gθ,ϑ = Gθ X

where Gθ and Gϑ are the Fisher information matrices for intra-option path manifold and state-option transition manifold, respectively. The random variable X is the path variable over state-option-action tuples. The computation of the complete Fisher information matrix suffers and its inverse is expensive and needs a compatible function approximation based approach to obtain a natural gradient estimate with space complexity linear in number of parameters. Although our approach has added beneﬁts it is limited by fewer updates of the termination policy. Work is required to develop better estimates of the advantage functions. More experimental work, e.g. applications to other domains, can further help understand the efﬁcacy of natural gradients in the context of the option-critic framework.

References Amari, S. 1985. Differential-geometrical methods in statistics. In Lecture Notes in Statistics 28. Springer-Verlag.

Amari, S.-I. 1998. Natural gradient works efﬁciently in learning. Neural Comput. 10(2):251 276. Amari, S.-i. 2016. Information Geometry and Its Applications. Springer.

Bacon, P.-L.; Harb, J.; and Precup, D. 2017. The option-critic architecture. In AAAI. Bagnell, J. A., and Schneider, J. 2003. Covariant policy search. IJCAI. Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. H. 2013. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47:253 279. Bhatnagar, S.; Sutton, R. S.; Ghavamzadeh, M.; and Lee, M. 2007. Incremental natural actor-critic algorithms. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS 07, 105 112. USA: Curran Associates Inc. Bhatnagar, S.; Sutton, R. S.; Ghavamzadeh, M.; and Lee, M. 2009. Natural actor-critic algorithms. Automatica 45(11):2471 2482.

Degris, T.; Pilarski, P. M.; and Sutton, R. S. 2012. Model-free reinforcement learning with continuous action in practice. De Groot, M. 1970. Optimal Statistical Decisions. Wiley Classics Library. Wiley. Desjardins, G.; Simonyan, K.; Pascanu, R.; et al. 2015. Natural neural networks. In Advances in Neural Information Processing Systems, 2071 2079. Dietterich, T. G. 2000. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR) 13(1):227 303. Kakade, S. 2001. A natural policy gradient. In Dietterich, T. G.; Becker, S.; and Ghahramani, Z., eds., Advances in Neural Information Processing Systems 14 (NIPS 2001), 1531 1538. MIT Press. Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algorithms. NIPS 2000, 1008 1014.

Konidaris, G., and Barto, A. G. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In Bengio, Y.; Schuurmans, D.; Lafferty, J. D.; Williams, C. K. I.; and Culotta, A., eds., Advances in Neural Information Processing Systems 22. Curran Associates, Inc. 1015 1023.

Kurita, T. 1992. Iterative weighted least squares algorithms for neural networks classiﬁers. New Generation Computing 12:375 394. Machado, M. C.; Bellemare, M. G.; and Bowling, M. 2017. A Laplacian framework for option discovery in reinforcement learning. In Precup, D., and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 2295 2304. International Convention Centre, Sydney, Australia: PMLR. Martens, J., and Grosse, R. B. 2015. Optimizing neural networks with kronecker-factored approximate curvature. In ICML. Martens, J. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, 735 742. USA: Omnipress. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. Co RR abs/1312.5602. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In ICML. Morimura, T.; Uchibe, E.; and Kenji, D. 2005. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. 0 0. Parr, R., and Russell, S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, 1043 1049. Pascanu, R., and Bengio, Y. 2013. Revisiting natural gradient for deep networks. Peters, J., and Schaal, S. 2008. Natural actor-critic. Neurocomputing 71:1180 1190. Rao, C. R. 1945. Information and accuracy attainable in the estimation of statistical parameters. In Bulletin of the Calcutta Mathematical Society. 81 91. Roux, N. L.; Manzagol, P.; and Bengio, Y. 2008. Topmoumoute online natural gradient algorithm. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in Neural Information Processing Systems 20. Curran Associates, Inc. 849 856. Simsek, O., and Barto, A. G. 2008. Skill characterization based on betweenness. In NIPS. Sun, K., and Nielsen, F. 2017. Relative ﬁsher information and natural gradient for learning large modular models. In ICML. Sutton, R. S.; Mc Allester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS 99, 1057 1063. Sutton, R. S.; Precup, D.; and Singh, S. P. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 112:181 211.

Thomas, P. S., and Brunskill, E. 2017. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. Co RR abs/1706.06643. Thomas, P.; Silva, B. C.; Dann, C.; and Brunskill, E. 2016. Energetic natural gradient descent. In Balcan, M. F., and Weinberger, K. Q., eds., Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 2887 2895. New York, New York, USA: PMLR. Thomas, P. S.; Dann, C.; and Brunskill, E. 2018. Decoupling learning rules from representations. In ICML. Thomas, P. S. 2011. Policy gradient coagent networks. In Shawe-Taylor, J.; Zemel, R. S.; Bartlett, P. L.; Pereira, F.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 24. Curran Associates, Inc. 1944 1952. Thomas, P. 2014. Bias in natural actor-critic algorithms. In ICML. Thrun, S., and Schwartz, A. 1995. Finding structure in reinforcement learning. In Tesauro, G.; Touretzky, D. S.; and Leen, T. K., eds., Advances in Neural Information Processing Systems 7. MIT Press. 385 392.