# learning_with_options_that_terminate_offpolicy__c4801807.pdf

Learning with Options that Terminate Off-Policy

Anna Harutyunyan Vrije Universiteit Brussel Brussels, Belgium

Peter Vrancx PROWLER.io Cambridge, England

Pierre-Luc Bacon Mc Gill University Montreal, Canada

Doina Precup Mc Gill University Montreal, Canada

Ann Now e Vrije Universiteit Brussel Brussels, Belgium

A temporally abstract action, or an option, is speciﬁed by a policy and a termination condition: the policy guides the option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efﬁcient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy well, shorter options offer more ﬂexibility and can yield a better solution. Thus, the termination condition puts learning efﬁciency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with wellstudied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.

1 Introduction Abstraction is essential for scaling up learning, and there has been a renewed interest in methods that extract, or leverage it (Vezhnevets et al. 2016; Kulkarni et al. 2016; Tessler et al. 2017). The options framework (Sutton, Precup, and Singh 1999) is the standard for modeling temporal abstraction in reinforcement learning. The temporal aspect of an option is determined by its termination condition β, which roughly determines its length. Learning and planning with longer options is known to be more efﬁcient (Mann, Mannor, and Precup 2015). This is partly due to an option having similar properties to the familiar multi-step λreturns,1 which are known to yield faster convergence (Bertsekas and Tsitsiklis 1996). The key qualitative difference between β and λ, however, is that β directly affects the solution rather than, like λ, just the rate of convergence. If β is not trivial, this couples the quality of the solution with the quality of the options at hand. This can be restrictive especially

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1The similarity is particularly relevant since a full βmodel (Sutton 1995) is at the basis of both paradigms.

if the options are not perfect, which of course is likely. Indeed, when a set of options is given, we can show that the more they terminate, the more optimal the resulting policy is at the primitive action level. This poses a challenge: on the one hand, we wish for the options to be long to yield fast convergence and meaningful exploration, but on the other, if these options are not ideal, the more we commit to them, the poorer the quality of our solution. Interrupting suboptimal options is one way of addressing this (Sutton, Precup, and Singh 1999), but like cutting traces in off-policy learning, it may prevent us from following a coherent policy for more than a couple of steps. To this end, we propose to terminate options off-policy, that is: decouple the behavior termination condition that the options execute with, from the target termination condition that is to be factored into the solution. The behavior terminations then solely become a means of inﬂuencing convergence speed. We describe a new algorithm, Q(β), that achieves this by leveraging connections to several old and new multi-step off-policy temporal difference algorithms. The paper is organized as follows. After introducing relevant background, we will derive the ﬁxed point of the classical call-and-return option operator. Using its shape for intuition, we will then incrementally build up to our proposed off-policy termination option operator, starting from the classical intra-option equations. We will analyze the convergence properties of this operator for both policy evaluation and control, and state the corresponding online algorithm Q(β). Finally, we will validate Q(β) empirically, and show that it can (1) learn to evaluate a task w.r.t. options terminating off-policy, and (2) learn an optimal solution from suboptimal options quicker than the alternatives.

2 Framework and notation

We assume the standard reinforcement learning setting (Sutton and Barto 2017) of an MDP M = (S,A,p,r,γ), where S is the set of states, A the set of discrete actions; p : S A S [0,1] the transition that speciﬁes the environment dynamics, with p(s |s,a) (or pa ss in short), denoting the probability of transitioning to state s upon taking action a in s; r : S A [ rmax,rmax] is the reward function, and γ the scalar discount factor. A pol-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

icy is a probabilistic mapping from states to actions. For a policy π, and the MDP transition matrix p, let the matrix pπ denote the dynamics of the induced Markov chain: pπ(s,s ) =

a Aπ(a|s)p(s |s,a), and rπ the reward expected for each state under π: rπ(s) =

a Aπ(a|s)r(s,a). Consider the Q-function q as a mapping S A R, and deﬁne the one-step transition operator over Q-functions as:

(Pπq)(s,a) def=

a A p(s |s,a)π(a |s )q(s ,a ). (1)

Using operator notation, deﬁne the Q-function corresponding to the value of policy π as:

t=0 γt(Pπ)tr = r + γPπqπ = (I γPπ) 1r,

the recursive form of which deﬁnes the Bellman equation for the policy π (Bellman 1957). The corresponding one-step Bellman operator can be applied to any Q-function:

T πq(s,a) def= r(s,a) + γPπq(s,a), (2)

and its repeated applications are guaranteed to produce its ﬁxed point qπ (Puterman 1994). The policy evaluation setting is concerned with estimating this quantity for a given policy π. The Bellman optimality operator introduces maximization over the set of policies: T q def= r + γmaxπPπq, and its repeated applications are guaranteed to produce its ﬁxed point, q def= maxπqπ = qπ , which is the value of the optimal policy π . The control setting is concerned with ﬁnding this value. A policy is greedy w.r.t. a Q-function if at each state it picks the action of maximum value. The optimal policy π is greedy w.r.t. the optimal Q-function q . Throughout, we will write 1 and 0 for the vectors with 1or 0-components. For a policy πb, we will use Eπb[ ] as a shorthand for the expectation ES1: ,A1: |πb[ ] w.r.t. the trajectory S0 = s,A0 = a,S1,A1,..., with At drawn according to πb( |St) and St+1 drawn according to p( |St,At). We will occasionally use Rt+1 to denote r(St,At), and write arguments in subscripts (i.e. use x(s) and xs equivalently). In the learning setting, the algorithms often update with a sample of the residual T πq q:

δt = Rt+1 + γq(St+1,At+1) q(St,At),

which is referred to as a temporal difference (TD) error.

2.1 Multi-step off-policy TD learning One need not only consider single-step operators, and may apply T π and T repeatedly. The λ-operator is a particularly ﬂexible mixture of such multi-step operators:

T π λ q = (1 λ)

t=0 λn(T π)nq = q+(I λγPπ) 1(T πq q).

In the learning setting, this corresponds to considering multistep returns, rather than one-step samples for updating the estimate, and λ trades off the bias of bootstrapping with an

approximate estimate with the variance of sampling multiple steps. A high intermediate value of λ is typically best in practice (Kearns and Singh 2000). Off-policy learning is the setting where the behavior and target policies are decoupled. That is: πb = π. Multi-step methods pose a challenge when considered off-policy, and advances have recently been made in this direction (Munos et al. 2016; Mahmood, Yu, and Sutton 2017). To this end, Munos et al. (2016) uniﬁed several off-policy return-based algorithms under a common umbrella:

Rπ,πb c q(s,a) def= q(s,a) + Eπb

i=1 ci δπ t , (3)

δπ t = Rt+1 + γEπq(St+1, ) q(St,At).

where Eπq(s ) def=

a A π(a|s)q(s,a),

and the trace ci is a state-action coefﬁcient, whose speciﬁc shapes correspond to different algorithms. In particular, Tree-Backup(λ) sets ci to λπb(Ai|Si) (Precup, Sutton, and Singh 2000), while Retrace(λ) sets ci to λmin(1, π(Ai|Si)

πb(Ai|Si)) (Munos et al. 2016).

Unifying algorithm Recently, Asis et al. (2017) formulated an algorithm that encompasses the ones discussed so far even more generally. Instead of the binary taxonomy of on-and off-policy algorithms, the authors introduce a parameter σ that smoothly transitions between the two. This algorithm can also be expressed via Eq. (3) with:

δσ,π t = Rt+1 + γ(σq(St+1,At+1) + (1 σ)Eπq(St+1, )) q(St,At), ci = λ((1 σ)πb(Ai|Si) + σ).

In particular, σ = 1 corresponds to the on-policy SARSA(λ) algorithm, while σ = 0 to Tree-Backup(λ).

2.2 Options An option o is a tuple (Io,ζo,πo), with Io S the initiation set, from which an option may start, ζo : S [0,1], the probabilistic termination condition, and πo, the option policy with which it navigates through the environment. Just like the MDP reward and transition models r and p, options can be seen to induce semi-MDP (Puterman 1994) models R and P as follows (Sutton, Precup, and Singh 1999):

P o ss def= ED:s s |o γD

= γpπo ss βs + γ

s Io pπo ss (1 βs )P o s s ,

Ro s def= ED:s|o

i=1 γi 1rπo(St+i)|St = s

= rπo s + γ

s Io pπo ss (1 βs )Ro s . (4)

where ED:s|o[ ] and ED:s s |o[ ] are the expectations of the option duration D from state s and the travel time between

states s and s , respectively, w.r.t. option dynamics pπo and the termination condition βo. In the call-and-return model of option execution, an option is run until completion (according to its termination condition), and only then a new option choice is made (Precup, Sutton, and Singh 1998). This suggests the following state-option analogues of the state-action transition operator from Eq. (1), and the Bellman operator from Eq. (2). For a policy over options μ:

(Pμ Oq)(s,o) def=

s P o(s,s )

o μ(o |s )q(s ,o ) (5)

T μ O q(s,o) def= Ro s + Pμ Oq(s,o). (6)

It will also be relevant to consider the marginal ﬂat policy κ over primitive actions:

κ(a|s) def=

o μ(o|s)πo(a|s). (7)

For simplicity, we assume that options can initiate anywhere: Io = S, o O. Finally, in the main departure from the standard framework, we will distinguish between the target termination condition β that is to be estimated, and a behavior termination condition ζ ( zeta ) that the options actually terminate with.

3 The call-and-return operator Before proceeding to describe our key idea and algorithm, let us quantify the role of β in the target solution of planning with options. To do this, we plan to derive this solution at the primitive action resolution. Let ν be an arbitrary policy over options, and c : S O [0,1] a coefﬁcient function. Consider the following transition operator and its corresponding Bellman operator:

(Pcνq)(s,o) def=

s S pπo ss c(s ,o)

o ν(o |s )q(s ,o ),

T cνq def= crπ + γPcνq,

where rπ is the |S| |O|-vector of rπo for all options. Note that T cν, like T π, is a one-step operator, whereas T μ O is an option-level operator. The transition operator Pcν in particular deﬁnes the following operators corresponding to option continuation and termination, respectively:

P(1 β)ι(s,o) def=

s S pπo ss (1 βo(s ))q(s ,o),

Pβμ(s,o) def=

s S pπo ss βo(s )

o μ(o |s )q(s ,o ).

That is: ι ( iota ) is the policy over options that maintains the current (argument) option. Using these operators, we can express Pμ O from Eq. (5) and the reward model from Eq. (4) concisely for all state-option pairs:

Pμ Oq = (I γP(1 β)ι) 1γPβμq,

R = (I γP(1 β)ι) 1rπ,

and rewrite the option-level Bellman operator from Eq. (6):

T μ O q = (I γP(1 β)ι) 1(rπ + γPβμq). (8)

The following proposition derives the ﬁxed point of T μ O in terms of the one-step operators P(1 β)ι and Pβμ. Proposition 1. The ﬁxed point of the operator T μ O is the same as the ﬁxed point of the operator T (1 β)ι + T βμ, and writes:

qμ,ι β = (I γ(Pβμ Pβι) γP1ι) 1rπ. (9)

Thus, the termination scheme directly affects the convergence limit: in the extreme, if β = 0, options never terminate, and we have the ﬁxed point of T 1ι: qμ,ι 0 (s,o) = Eπoqπo(s, ), the value of the option o. In the other extreme, β = 1, the options terminate at every step and we have the ﬁxed point of T 1μ, which can be shown (Theorem 1 in (Bacon and Precup 2016)) to correspond to the value of the marginal policy κ from Eq. (7):

qμ,ι 1 (s,o) = Eκqκ(s, ), o O. (10)

4 Off-policy option termination We would like to decouple the target termination condition β that factors into the solution from the behavior termination condition ζ that governs for how long the options are followed. Apart from the theoretical appeal of the freedom that this allows, a key motivation is in the fact that on the one hand just like with multi-step returns, the less options terminate the faster the convergence, but on the other the more options terminate, the better the control solution (as we show formally in the next section). Being able to decouple the two allows one to achieve the best of both worlds, and is exactly what we propose to do in this paper. The critical insight in our approach is the off-policy-ness at the primitive action level that is introduced by the discrepancy between policies μ (that picks a new option) and ι (that maintains the current option) in Eq. (9). The degree of this off-policy-ness is modulated exactly by the termination condition β, and is usually implicit. We propose to leverage multi-step off-policy learning and to impose a desired degree explicitly, independent of the behavior terminations ζ.2 In the extreme, we can learn the marginal policy κ directly. In the other extreme, we can learn the value of the current option. The two extremes are traded off via β. The algorithm we propose is analogous to the unifying algorithm Q(σ) in which σ modulates the degree of off-policy-ness (Asis et al. 2017). To highlight the parallels between learning with options and learning off-policy from multi-step returns, this section will incrementally build the technical intuition towards the proposed off-policy termination operator, starting from the classical intra-option equations. The next section will analyze the convergence of this operator.

2Technically this should be referred to as off-termination learning, since the difference is indeed in termination conditions, of which the induced policies are a consequence: the behavior policy ι takes the current option w.p. 1, while the target policy is the actual policy over options μ.

4.1 From one step intra-option learning to General Q(λ) To begin, let us ﬁrst re-derive the target from Proposition 1 starting from the familiar intra-option equations (Sutton and Precup 1998). At this point let us assume that the options execute w.r.t. some behavior condition ζ. Letting Δ denote the update on the estimated Q-function, we have at time t and the current option o:

Δq(St,o) r(St,At) + γ q(St+1,o) q(St,o), (11) q(s,o) = (1 ζo(s))q(s,o) + ζo(s)Eμq(s, ),

where as before we write Eμq(s, ) def=

oμ(o|s)q(s,o). Notice that this is exactly a sample of the one-step update corresponding to T (1 ζ)ι + T ζμ. In fact, if we roll it out over multiple steps, and take an expectation, we obtain Eq. (8) for the behavior ζ exactly:

q(s,o) = Eπo

i=1 (1 ζo(Si))

[r(St,At) + γζo(St+1)Eμq(St+1, )] , (12)

from which some simple algebra yields:

Δq(s,o) = rπo(s) + γEμq(St+1, ) q(s,o)

i=1 (1 ζo(Si)) δμμ t , (13)

δμμ t = r(St,At) + γEμq(St+1, ) Eμq(St, ).

This is the same (expected) update as Peng s Q(λ) for greedy policies (Peng and Williams 1996) or General Q(λ) (i.e. offpolicy Expected SARSA(λ) (Van Seijen et al. 2009)) for arbitrary policies, with 1 ζo(Si) being the state-option analogue of λ from those algorithms. The ﬁxed point of both of those algorithms is given in Proposition 1 in (Harutyunyan et al. 2016) and is indeed analogous to that given in Proposition 1 in this paper.

4.2 General Q(λ) to Tree-Backup(λ) The update (13) converges to the ﬁxed point from Prop. 1, which is an on-/off-policy mixture, regulated by the behavior terminations ζ. We wish to reweigh this mixture by the desired target terminations β. Let us now begin introducing the off-policy machinery that will allow us to do so. It will be useful to cast the multi-step intra-option update (13) in the form of the general off-policy operator (3). This can be done by replacing the second expectation in the TD-error with the point-estimate q(St,o), which introduces off-policy corrections:

δμ t = r(St,At) + γEμq(St+1, ) q(St,o).

This update will converge to the off-policy ﬁxed point of T 1μ only if μ and ι are close (Harutyunyan et al. 2016), which, in our particular case of the indicator behavior policy ι, is not a very interesting scenario. If we further augment the trace 1 ζo(Si) with the policy probability coefﬁcient μ(o|Si), we obtain option-level Tree-Backup(λ) (Precup, Sutton, and Singh 2000) whose target policy is μ, and

behavior policy is ι, and with 1 ζo(Si) being the stateoption analogue of λ:

Δq(s,o) = Eπo

i=1 co i δμ t |S0 = s , (14)

co i = μ(o|Si)(1 ζo(Si)).

From the convergence guarantees of Tree-Backup, we know that this update converges to the ﬁxed point of T 1μ, which in turn corresponds to qκ (Eq. (10)).

4.3 The off-policy termination operator We are now ready to present the operator underlying the Q(β) algorithm, which is the main contribution of this paper. Eq. (14) can be considered a special case where we correct all of the off-policy-ness, thus implicitly assuming β = 1. To obtain the general case, we need to split the target in each TD-error into two offand on-policy terms:

Rμ βq(s,o) = q(s,o) +

t=0 γt Eπo t

i=1 co i δβ,μ t , (15)

δβ,μ t = r(St,At) + γ q(St+1,o) q(St,o), q(s,o) = (1 βo(s))q(s,o) + βo(s)Eμq(s, ), co i = (1 ζo(Si))(1 βo(Si) + βo(Si)μ(o|Si)).

Inspecting these updates, it is clear that the parameter β controls the degree of partial off-policy-ness : in the next state, with a weight 1 βo(St+1) the option continues and the update considers the on-policy current option value q(St+1,o), and with a weight βo(St+1), the option terminates, and the update considers the off-policy value w.r.t. the policy over options o μ(o |s)q(St+1,o ). Note that, these coefﬁcients appear in the correction terms co i as well, since in order for the algorithm to converge to the correct on-/off-policy mixed target, the βo(St+1)-weighted offpolicy portion needs to be corrected. Finally, note that in the learning setting, the second factor in co i is explicit, while 1 ζo(Si) is sampled from the current option. Algorithm 1 presents the forward view of the algorithm underlying this expected operator, for the general case of an evolving sequence of policies (μk)k N. This algorithm is very similar to the recently formalized Q(σ) (Asis et al. 2017; Sutton and Barto 2017), with β being a state-option generalization of 1 σ. In Q(σ), the parameter σ controls the degree of off-policy-ness: σ = 1 corresponds to (on-policy) SARSA, and σ = 0 to (off-policy) Tree-Backup. Analogously, Q(β) with β = 0 learns the onpolicy value of the current option policy πo (i.e. the policy ι), and Q(β) with β = 1 learns the off-policy value of the marginal policy κ (i.e. the policy μ). The behavior terminations ζ on the other hand have a role analogous to that of the eligibility trace parameter λ.

4.4 Relationship with intra-option learning The off-policy-ness discussed so far is different than that in the more familiar off-policy intra-option setting. The intraoption learning algorithm suggests applying the update (11)

Algorithm 1 Q(β) algorithm

Given: Option set O, target termination function β, initial Q-function q0, step-sizes (αk)k N, start state s0 1: S0 s0 2: for k = 0,1,... do 3: Sample an option o from μk( |S0) 4: Sample the return R1,S1,R2...,SDk from πo. Dk is determined by sampling 1 ζo(Si). 5: for t = 0,1,...Dk 1 do 6: δβ,μk t = Rt+1 + γ qμk(St+1,o) q(St,o)

7: qμk(s,o) def= (1 βo(s))q(s,o) + βo(s)Eμkq(s, ) 8: co j = 1 βo(Sj) + βo(Sj)μ(o|Sj)

9: Δt = Dk 1 i=t γi t i j=t+1co j δβ,μk t 10: qk+1(St,o) qk(St,o) + αkΔt 11: end for 12: S0 SDk 13: end for

to all options o consistent with the experience stream S1,A1,... (Sutton and Precup 1998). When considering multiple steps, this amounts to introducing importance sampling into Eq. (12). That is, given a behavior option b, the trace coefﬁcient 1 ζo(Si) from Eq. (12) now becomes

cob i = πo(Ai|Si)

πb(Ai|Si)(1 ζo(Si)). (16)

And hence, writing ζo t+1 for ζo(St+1), the value of option o from Eq. (12) writes:

q(s,o) = Eπb

i=1 cob i [Rt+1 + γζo t+1Eμq(St+1, )]

Now, notice that there are two sources of off-policy-ness in these formulas. One is πb vs. πo, the contrast between option policies, and the other is in the target: ι vs. μ itself, as discussed before. Indeed if we write the above in the form from Eq. (13) we get a different correction: Δq(s,o) = rπo(s) + γEμq(St+1, ) q(s,o)

i=1 cob i (1 ζo(St))δob t ,

δob t = πo(At|St)

πb(At|St)[Rt+1 + γEμq(St+1, )] Eμq(St, ).

Since the corrections for the two sources of off-policy-ness are orthogonal, it could be possible to combine them. We leave this for the future.

5 Analysis In this section we will analyze the convergence behavior of the off-policy termination operator in both policy evaluation and control settings, and show that learning about shorter target options off-policy is generally asymptotically more efﬁcient than on-policy. We will then consider the relationship of the solution quality in control with option duration, and show that shorter options generally yield better solutions.

5.1 Policy evaluation We will prove that the evaluation operator Rμ β is contractive around the appropriate ﬁxed point, and that its contraction factor is less than that of the respective on-policy operator, given the target options are longer than the behavior ones.

Theorem 1 (Policy evaluation). The operator Rμ β deﬁned in Eq. (15) has a unique ﬁxed point qμ,ι β , as deﬁned in Eq. (9). Furthermore, if for each state Si S and option o O we have co i (1 βo(Si)) + βo(Si)μ(o|Si), then for any Q-function q:

|Rμ βq(s,o) qμ,ι β (s,o)| η(s,o) q(s,o) qμ,ι β (s,o) ,

where η(s,o) def= 1 (1 γ)Eπo t=0γt t i=1co i γ.

Proof. The proof is analogous to that of Theorem 1 from (Munos et al. 2016) and is given in appendix.

The contraction coefﬁcient η controls the convergence speed of this operator: the smaller η (and the larger t i=1co i ) the fewer iterations are needed to converge, but the larger the computational expense when planning, or the variance when learning (Bertsekas and Ioffe 1996; Munos et al. 2016). Since co i 1, and options terminate eventually, the variance is less signiﬁcant here, and generally, larger co i will yield faster convergence. In our case, since a behavior option is assumed (i.e. the 1 ζo(Si) factor in Eq. (15) is ﬁxed), the additional β-term in co i can only reduce the existing trace. However, since we are interested in learning about a different target, we ought to compare these traces to the setting in which that same target is learnt on-policy. The following corollary derives a sufﬁcient condition for Q(β) to maintain larger traces than its on-policy counterpart.

Corollary 1. If for all states s and options o we have:

βo(s) 1 μ(o|s)(1 ζo(s)) μ(o|s)(1 ζo(s)) + ζo(s),

then the iteration (15) converges faster than if βo was used on-policy, in place of ζo in iteration (12).

In particular if ζo(s) = 0, any βo(s) > 0 satisﬁes this, irrespective of μ. If μ is deterministic, this holds for all βo(s) > ζo(s) for the chosen o. In general, the intuition here is that it s easier to learn from longer option traces about shorter options than vice versa.

5.2 Control Let us now formulate the control analogue of Theorem 1. Its proof is a simpler version of that of Theorem 2 from (Munos et al. 2016), and we omit it here.

Theorem 2 (Control). Consider a sequence of policies over options (μk)k N that are greedy w.r.t. the sequence of estimates (qk)k N, and consider the update:

qk+1 = Rμk β qk,

where the operator Rμk β is deﬁned by Eq. (15) for the kth policy μk. Let T μ,ι β q = T (1 β)ιq + T βμq, and let q =

Figure 1: Prediction error on the 19-state chain task. Each variant is an average of 10 seeds. Left: Sum error for each ζ-β combination. Q(β) always gets more efﬁcient as ζ decreases (the options get longer). Right: Example learning curves. The lines corresponding to β = 0.1 are outside the axes bounds. The shaded region covers standard deviation.

maxμqμ,ι β . Suppose that T μ0,ι β q0 q0. 3 Then for any k 0,

qk+1 q γ qk q .

It follows that qk q as t . We do not give a proof of online convergence at this time, but verify it empirically in our experiments.

5.3 Option duration and solution quality Our convergence results show that learning about shorter options off-policy is more efﬁcient than on-policy. Here, we motivate why one would want to learn about shorter options in the ﬁrst place. In particular, we will show that given a set of options, the more they terminate, the better the resulting control solution at the primitive action resolution. Intuitively, this is because upon termination, the learner picks the current best option, whereas during option execution, the target value includes the potentially suboptimal current option (Proposition 1). The following theorem (proof in appendix) formalizes this intuition, and may be of independent interest. Theorem 3 (The more options terminate, the better the solution.). Given a set of options O, and a greedy policy over options μ. Let β ζ be two termination conditions for the options in O. Then: qμ,ι β qμ,ι ζ .

Note that this result refers to the target solution. During learning, the more decisions there is to make, the more potential there is for error. As such, in reasonably complex tasks we expect the performance to obey a tradeoff on β.

6 Experiments Finally, let us evaluate our algorithm empirically. We aim to illustrate the following claims: The learning speed improves as ζ gets smaller (behavior options get longer); The control performance improves, as β gets larger (target options get shorter); Q(β) converges with off-policy terminations.

3This can be attained by pessimistic initialization: q0 rmax/(1 γ).

For simplicity, we assume that options terminate deterministically in a set of goal states. We hence reduce β and ζ to single parameters that determine the likelihood of terminating before reaching the goal. The plain variant refers to the on-policy intra-option update from Eq. (12). The complete experimental details are provided in appendix E.

Figure 2: The 19-state random walk task. The agent starts in the middle. Transitions are deterministic, and the task terminates in each end.

6.1 Policy evaluation

First, we show that Q(β) learns the correct values on the 19state random walk task (Fig. 2). There are two options: one leads all the way to the left, the other to the right. The policy over options is uniform. The task is to estimate the value function w.r.t. target terminations β. The results are given in Figure 1. Q(β) is able to learn the correct values (up to an irreducible exploration-related error). As expected, Q(β) gets more efﬁcient as behavior options get longer (ζ gets smaller). The opposite is true for the plain algorithm, since without interrupting sufﬁciently, it does not have a chance to update the intermediate states with anything other than the option policy. There is a small inﬂection point in the performance of the plain algorithm at the on-policy value of ζ.

6.2 Control

To demonstrate the beneﬁt of decoupled offand onpolicy terminations, we compare our algorithm with the plain on-policy variant (labelled: onpolicy-plain) that uses ζ = β during both learning and evaluation. In order to demonstrate that the learning target plays a role, we also compare it with the plain algorithm that uses β at evaluation only (labelled: offpolicy-plain). The behavior ζ = 0, unless speciﬁed otherwise.

Figure 3: Control performance. (a) Cliffwalk. Each variant is evaluated on 5 seeds for 10 runs each. Left: Average performance per value of β on all seeds. Right: Learning curves for the best seeds per variant. Notice how Q(β) is the only variant that escapes the plateau of the suboptimal policy. (b) Pinball. Each variant is evaluated on 20 independent runs. Top row: Inﬂuence of β and ζ on Q(β): performance improves as ζ gets smaller. Overall, intermediate target β-s are best, but β = 1 reaches slightly better performance at the end of learning. Bottom row: Comparison within the variants for select values of β

Figure 4: Left: The modiﬁed cliffwalk task. Shaded regions are cliffs. Right: Pinball domain conﬁguration used. The red ball must be moved to the blue hole. Each black diamond indicates an option landmark.

Modiﬁed Cliffwalk We illustrate the beneﬁts of off-policy termination on a modiﬁed Cliffwalk example (Fig. 4, left). The agent starts in a position inside a n n grid with the goal of getting to a corner where a positive reward is given. The step reward is zero, but there are small cliffs along the border that aren t fatal, but induce a penalty. We have four options, one for each cardinal direction, that take the agent up until the corresponding border (and cliff). Thus, while these options are able to learn to reach the goal in an optimal number of steps, they are unable to learn the optimal policy which only moves inside the grid, so as to not encounter cliffs. To ensure adequate exploration we consider ϵopt-soft option policies, as well as a usual ϵ-greedy policy over options during learning (but not during evaluation). The results are plotted in Fig. 3a. Q(β) outperforms the alternatives in all cases, and its performance improves with larger β. Note that it is the only variant able to surpass the value of the suboptimal upper bound of the naive approach.

Pinball We ﬁnally evaluate our algorithm on a variation of the Pinball domain (Konidaris and Barto 2009). Here,

a small ball must be maneuvered through a set of obstacles into a hole. Observations consist of four continuous variables describing the ball s x,y positions and velocities. There are ﬁve primitive actions: the ﬁrst four apply a small force to either the x or y velocity, the ﬁnal action leaves all velocities unchanged. There is a step penalty and a ﬁnal reward. We deﬁne a set of landmark options (Mann, Mannor, and Precup 2015) that move the ball near a target goal location on the board. The agent can initiate and terminate each option from within some initiation and termination distances from the respective landmark. To illustrate the beneﬁts of terminating suboptimal options, landmarks were placed in such a way that the paths from start to the goal via the landmarks are suboptimal (Fig. 4, right). The results are plotted in Fig. 3b. As expected, the performance of Q(β) drastically improves with longer behavior options. The inﬂuence of β is more subtle: intermediate β-s perform best overall, but β = 1 reaches slightly better eventual performance. In a comparison, Q(β) outperforms the on-policy variant that learns with ζ = β. However, in this domain, the offpolicy variant (that learns with ζ = 0, but evaluates with β) performs comparably to Q(β). This may in part be due to the use of function approximation, which allows the plain update to generalize meaningfully within the option trajectory, and in part due to the noisy nature of Pinball, in which there are many optimal policies of similar values. Since, as we have seen, Q(β) is the only variant to learn accurate values, we expect it to stand out more in settings where the reward scheme is more intricate.

7 Related work Much of the related work has already been discussed throughout. We mention a few more relevant works below. Mann, Mankowitz, and Mannor (2014) propose an algorithm for multi-step option interruption that stems from the same motivation of mitigating poor-quality options. In order

to avoid the resulting options being too short, they introduce a time-regularization term. Our approach bypasses the need to do so by interrupting off-policy and the ability to explicitly specify target terminations. Ryan (2002) also considers the problem of termination improvement, or interrupting options when they are no longer relevant, in a hybrid decision-theoretic and classical planning setting. The intuitions in that work are particularly aligned with the ones here in particular the tradeoff between efﬁciency and optimality is hinted at, and it is even suggested that persistence [with an option] is a kind of offpolicy exploration, at the behavior level. In this work, we formalize and exploit this intuition. Yu and Bertsekas (2012) consider general λ-operators with state-dependent λ. In the case of options, 1 ζo takes the role of a state-option-dependent λ. White (2017) proposes to consider 1 ζo as part of the transition-based discount instead.

8 Discussion

We propose decoupling behavior and target termination conditions, like it is done with policies in off-policy learning. We formulate an algorithm for learning target terminations off-policy, analyze its expected convergence, and validate it empirically, conﬁrming the theoretical intuition that learning shorter options from longer options is beneﬁcial both computationally and qualitatively. More generally, we cast learning with options into a common framework with wellstudied multi-step off-policy temporal difference learning, which allows us to carry over existing results with ease.

Learning longer options from shorter options. We have assumed here that the options are given, but may not express the optimal policy well. This scenario applies when the options describe simple rules of thumb, or are transferred from a different task. If the options are not given, but learnt endto-end, our wish typically is to distill meaningful behavior in them. However, instead, the result often ends up reducing to degenerate options (Bacon, Harb, and Precup 2017; Mann, Mankowitz, and Mannor 2014). Being able to impose longer durations on the target off-policy may mitigate this. Though it should be noted that our convergence results suggest that learning may not be as efﬁcient then.

Action-level importance sampling. The option policy term in the trace is ambivalent to the action choice. Thus if μ(o|Si) is small, co i will be small, even if the taken action is consistent with the option policy πo. It would be interesting to replace this term with the importance sampling ratio at the primitive action level, like κ(Ai|Si) πo(Ai|Si), which corresponds to another multi-step off-policy algorithm, Retrace(λ) (Munos et al. 2016). Another direction towards this goal is to incorporate the intra-option correction from Eq. (16).

Online convergence. Proving online convergence of Q(β) remains an open problem. Supported by reliable empirical behavior, we hypothesize that it holds under reasonable conditions and plan to investigate it further in the future.

Future. A natural direction for future work is to learn many β-s in parallel as is done in classical off-policy learning (Sutton et al. 2011), or learn a new kind4 of a universal option model using the ideas from (Schaul et al. 2015). It is also worthwhile to extend the convergence results to approximate state spaces, perhaps similarly to (Touati et al. 2017).

References Asis, K. D.; Hernandez-Garcia, J. F.; Holland, G. Z.; and Sutton, R. S. 2017. Multi-step reinforcement learning: A unifying algorithm. Co RR abs/1703.01327. Bacon, P.-L., and Precup, D. 2016. A matrix splitting perspective on planning with options. Co RR abs/1612.00916. Bacon, P.-L.; Harb, J.; and Precup, D. 2017. The optioncritic architecture. In Proceedings of AAAI, 1726 1734. Bellman, R. 1957. Dynamic Programming. Princeton, NJ, USA: Princeton University Press. Bertsekas, D. P., and Ioffe, S. 1996. Temporal differencesbased policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, MIT. Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Athena Scientiﬁc. Harutyunyan, A.; Bellemare, M. G.; Stepleton, T.; and Munos, R. 2016. Q(λ) with off-policy corrections. In Proceedings of the Conference on Algorithmic Learning Theory, 305 320. Kearns, M. J., and Singh, S. P. 2000. Bias-variance error bounds for temporal difference updates. In COLT, 142 147. Konidaris, G., and Barto, A. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems, 1015 1023. Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675 3683. Mahmood, A. R.; Yu, H.; and Sutton, R. S. 2017. Multistep off-policy learning without importance sampling ratios. Co RR abs/1702.03006. Mann, T.; Mankowitz, D.; and Mannor, S. 2014. Timeregularized interrupting options (trio). In Proceedings of the International Conference on Machine Learning, 1350 1358. Mann, T. A.; Mannor, S.; and Precup, D. 2015. Approximate value iteration with temporally extended actions. JAIR 53:375 438. Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, 1046 1054. Peng, J., and Williams, R. J. 1996. Incremental multi-step q-learning. Machine Learning 22(1-3):283 290.

4I.e., not one indifferent to the reward, as in (Szepesvari et al. 2014), but one indifferent to the termination scheme.

Precup, D.; Sutton, R. S.; and Singh, S. 1998. Theoretical results on reinforcement learning with temporally abstract options. In Proceedings of the European conference on machine learning, 382 393.

Precup, D.; Sutton, R. S.; and Singh, S. 2000. Eligibility traces for off-policy policy evaluation. In Proceedings of the International Conference on Machine Learning, 759 766.

Puterman, M. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition.

Ryan, M. R. K. 2002. Using abstract models of behaviours to automatically generate reinforcement learning hierarchies. In In Proceedings of The International Conference on Machine Learning, 522 529.

Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Universal value function approximators. In Proceedings of the International Conference on Machine Learning (ICML15), 1312 1320.

Sutton, R. S., and Barto, A. G. 2017. Reinforcement Learning: An Introduction. MIT Press, 2nd edition.

Sutton, R. S., and Precup, D. 1998. Intra-option learning about temporally abstract actions. In In Proceedings of the International Conference on Machine Learning, 556 564.

Sutton, R.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P.; White, A.; and Precup, D. 2011. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, AAMAS, 761 768.

Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence 112(12):181 211.

Sutton, R. S. 1995. Td models: Modeling the world at a mixture of time scales. In Proceedings of the Twelfth International Conference on Machine Learning, 531 539.

Szepesvari, C.; Sutton, R. S.; Modayil, J.; Bhatnagar, S.; et al. 2014. Universal option models. In Advances in Neural Information Processing Systems, 990 998.

Tessler, C.; Givony, S.; Zahavy, T.; Mankowitz, D. J.; and Mannor, S. 2017. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of AAAI, 1553 1561.

Touati, A.; Bacon, P.-L.; Precup, D.; and Vincent, P. 2017. Convergent tree-backup and retrace with function approximation. Co RR abs/1705.09322.

Van Seijen, H.; Van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 177 184.

Vezhnevets, A.; Mnih, V.; Osindero, S.; Graves, A.; Vinyals, O.; Agapiou, J.; and Kavukcuoglu, K. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems 29, 3486 3494.

White, M. 2017. Unifying task speciﬁcation in reinforcement learning. In Proceedings of the International Conference on Machine Learning, volume 70, 3742 3750. Yu, H., and Bertsekas, D. P. 2012. Weighted bellman equations and their applications in approximate dynamic programming. Technical Report LIDS-P-2876, MIT.

A Proof of Proposition 1

Proof. From Eq. (8), and writing q for qμ,ι β for less clutter:

q = (I γP(1 β)ι) 1(rπ + γPβμq),

q γP(1 β)ιq = rπ + γPβμq,

q = rπ + γPβμq + γP(1 β)ιq

= rπ + γPβμq + γP1ιq γPβιq

= rπ + γ(Pβμ Pβι)q + γP1ιq.

Solving for q yields the result.

B Proof of Theorem 1

Proof. Write R for Rμ β. The fact that the ﬁxed point is the desired one is clear from Eq. (15) and Proposition 1. Recall that:

co i = (1 βo(Si) + βo(Si)μ(o|Si))(1 ζo(Si)), q(s,o) = (1 βo(s))q(s,o) + βo(s)Eμq(s, ).

Now, let us derive the contraction result. Eq. (15) can be written:

t=0 γt Eπo t

i=1 co i T co t

T co t def= r(St,At) + γ( q(St+1,o) co t+1q(St+1,o))

Let Δq(s,o) def= q(s,o) qμ,ι β (s,o), and Δq(s, ) be deﬁned analogously. Since Rqμ,ι β = qμ,ι β , and after shifting the sum index forward, we have:

Rq(s,o) qμ,ι β (s,o) =

t=1 γt Eπo t 1

i=1 co i ΔT co t ,

ΔT co t = (1 βo t )Δq(St,o) + βo t EμΔq(St, ) co tΔq(St,o)

= (1 βo t co t)Δq(St,o) + βo t

b O μ(b|St)Δq(St,b)

Ib=o(1 βb t cb t) + βo t μ(b|St) Δq(St,b),

where we write Ib=o for the indicator function. Since co t (1 βo t ) + βo t μ(o|St) and βo t μ(b|St) 0, b = o, we have a linear combination of Δq(St,b) weighted by non-negative coefﬁcients wy,b deﬁned as:

wy,b = t=1γt Eπo t 1 i=1co i (Ib=o(1 βb t cb t) +

βo t μ(b|St))ISt=y , whose sum is:

t=1 γt Eπo t 1

b O Ib=o(1 βb t cb t)

+ βo t μ(b|St)

t=1 γt Eπo t 1

i=1 co i ((1 βo t co t)

b O μ(b|St)βo t

t=1 γt Eπo t 1

i=1 co i ((1 βo t co t) + βo t )

t=1 γt Eπo t 1

i=1 co i (1 co t)

= γC (C 1) γ

where C = Eπo t=0γt t i=1co i , and the last inequality is due to C 1. It follows that R is a γ-contraction around qμ,ι β .

C Proof of Corollary 1

Proof. We would like for the off-policy trace co s = (1 βo s + βo sμ(o|s))(1 ζo s) to be larger than the equivalent on-policy trace 1 βo s.

1 βo s (1 βo s + βo sμ(o|s))(1 ζo s) (1 βo s)ζo s βo sμ(o|s)(1 ζo s) ζo s βo s(μ(o|s)(1 ζo s) + ζo s)

βo s ζo s μ(o|s)(1 ζos) + ζos

= 1 μ(o|s)(1 ζo s) μ(o|s)(1 ζos) + ζos

Thus, if βo obeys this bound, it is more beneﬁcial to learn it offrather than onpolicy, from the point of view of convergence speed.

D Proof of Theorem 3

Proof. First notice that Proposition 1 rewrites:

q = (I γ(Pβμ Pβι) γP1ι) 1rπ

= (I γPβμ γP(1 β)ι) 1rπ

t=0 γt(Pβμ + P(1 β)ι)trπ. (17)

Let Δζ β def= qμ,ι β qμ,ι ζ . From the above (17), we have:

t=1 γt((Pβμ + P(1 β)ι)t (Pζμ + P(1 ζ)ι)t)rπ

= (Pβμ + P(1 β)ι Pζμ P(1 ζ)ι)

t=1 γt Atrπ,

where At = (Pβμ + P(1 β)ι Pζμ P(1 ζ)ι) 1((Pβμ + P(1 β)ι)t (Pζμ P(1 ζ)ι)t). Let f = t=1γt Atrπ be a Q-function. Then simple manipulations yield:

Δζ β = qμ,ι β qμ,ι ζ = (P(β ζ)μ + P(I β I+ζ))ι)f

= (P(β ζ)μ P(β ζ)ι)f

= (P1μ P1ι)(β ζ)f

(P1μ P1ι)f 0,

since the operators P1μ and P1ι are monotone, β ζ and P1μf = maxν P1νf P1ιf for any Q-function f.

E Experimental details The setting is as follows: an option o is picked according to μ, and a trajectory s,R1,S2,R2,...,RD 1,SD is generated according to πo and ζo. Then for each state Si in the trajectory, q(Si,o) is updated according to the considered algorithm.

E.1 19-chain The termination conditions β and ζ are evaluated in the range of {0.1,0.5,0.8,1}, with the ﬁrst value being positive to ensure adequate state visitation. The step-sizes set via a linear search over α {0.1,0.2,0.3,0.4}. The discount factor γ = 0.99.

E.2 Modiﬁed Cliffwalk The reward scheme is: rgoal = 10 and rcliff = 2, and the grid size n = 10. We set ϵ = 0.1, and ϵopt = 0.3, and determine the step-size for each variant from a linear search over α {0.1,0.2,0.3,0.4}, behavior ζ = 0, and target β is evaluated on the range of {0,0.5,0.8,1}. The discount factor γ = 0.99.

E.3 Pinball The reward is 1 on every step, except the ﬁnal step which receives a reward of 10000. We use initiation distance of 0.3 and termination distance of 0.03. The state option value function was approximated using tile coding with 16, 10 10 tilings. All algorithms used a learning rate α = 0.01, discount γ = 0.99 and an exploration rate ϵ = 0.05 and ϵopt = 0.01 during learning. The target β is evaluated on the range of {0,0.3,0.5,0.8,1}.