# averagereward_learning_and_planning_with_options__866e4f0e.pdf

Average-Reward Learning and Planning with Options

Yi Wan , Abhishek Naik , Richard S. Sutton {wan6,anaik1,rsutton}@ualberta.ca University of Alberta, Amii Deep Mind Edmonton, Canada Edmonton, Canada

We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as samplebased planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efﬁcacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.

1 Introduction

Reinforcement learning (RL) is a formalism of trial-and-error learning in which an agent interacts with an environment to learn a behavioral strategy that maximizes a notion of reward. In many problems of interest, a learning agent may need to predict the consequences of its actions over multiple levels of temporal abstraction. The options framework provides a way for deﬁning courses of actions over extended time scales, and for learning, planning, and representing knowledge with them (Sutton, Precup, & Singh 1999, Sutton & Barto 2018). The options framework was originally proposed within the discounted formulation of RL in which the agent tries to maximize the expected discounted return from each state. We extend the options framework from the discounted formulation to the average-reward formulation in which the goal is to ﬁnd a policy that maximizes the rate of reward.

The average-reward formulation is of interest because, once genuine function approximation is introduced, there is no longer a well-deﬁned discounted formulation of the continuing RL problem (see Sutton & Barto 2018, Section 10.4; Naik et al. 2019). If we want to take advantage of options in acting, learning, and planning in the continuing (non-episodic) RL setting, then we must extend options to the average-reward formulation.

Given a Markov decision process (MDP) and a ﬁxed set of options, learning and planning algorithms can be divided into two classes. The ﬁrst class consists of inter-option algorithms, which enable an agent to learn or plan with options instead of primitive actions. Given an option, the learning and planning updates for this option in these algorithms occur only after the option s actual or simulated execution. Algorithms in this class are also called semi-MDP (SMDP) algorithms because given an MDP, the decision process that selects among a set of options, executing each to termination, is an SMDP (Sutton et al. 1999). The second class consists of algorithms in which learning or planning updates occur after each state-action transition within options execution these are called intra-option algorithms. From a single state-action transition, these algorithms can learn or plan to improve the values or policies for all options that may generate that transition, and are therefore potentially more efﬁcient than SMDP algorithms.

Several inter-option (SMDP) learning algorithms have been proposed for the average-reward formulation (see, e.g., Das et al. 1999, Gosavi 2004, Vien & Chung 2008). To the best of our knowledge,

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Gosavi s (2004) algorithm is the only proven-convergent off-policy inter-option learning algorithm. However, its convergence proof requires the underlying SMDP to have a special state that is recurrent under all stationary policies. Recently, Wan, Naik, and Sutton (2021) proposed Differential Q-learning, an off-policy control learning algorithm for average-reward MDPs that is proved to converge without requiring any special state. We extend this algorithm and its convergence proof from primitive actions to options and highlight some challenges we faced in developing inter-option Differential Q-learning. For planning, we propose inter-option Differential Q-planning, which is the ﬁrst convergent incremental (sampled-based) planning algorithm. The existing proven-convergent inter-option planning algorithms (e.g., Schweitzer 1971, Puterman 1994, Li & Cao 2010) are not incremental because they perform a full sweep over states for each planning step.

Additionally, the literature lacks intra-option learning and planning algorithms within the averagereward formulation for both values and models. We ﬁll this gap by proposing such algorithms in the average-reward formulation and provide their convergence results. These algorithms are stochastic approximation algorithms solving the average-reward intra-option value and model equations, which are also introduced in this paper for the ﬁrst time.

Sutton et al. (1999) also introduced an algorithm to improve an agent s behavior given estimated option values. Instead of letting an option execute to termination, this algorithm involves potentially interrupting an option s execution to check if starting a new option might yield a better expected outcome. If so, then the currently-executing option is terminated, and the new option is executed. Our ﬁnal contribution involves extending this notion of an interruption algorithm from the discounted to the average-reward formulation.

2 Problem Setting

We formalize an agent s interaction with its environment by a ﬁnite Markov decision process (MDP) M and a ﬁnite set of options O. The MDP is deﬁned by the tuple M .= (S, A, R, p), where S is a set of states, A is a set of actions, R is a set of rewards, and p : S R S A [0, 1] is the dynamics of the environment. Each option o in O has two components: the option s policy πo : A S [0, 1], and a probability distribution of the option s termination βo : S [0, 1]. For simplicity, for any s S, o O, we use π(a | s, o) to denote πo(a, s) and β(s, o) to denote βo(s). Sutton et al. s (1999) options additionally have an initiation set that consists of the states at which the option can be initiated. To simplify the presentation in this paper, we allow all options to be initiated in all states of the state space; the algorithms and theoretical results can easily be extended to incorporate initiation from speciﬁc states.

In the continuing (non-episodic) setting, the agent-environment interactions go on forever without any resets. If an option o is initiated at time t in state St, then the action At is chosen according to the option s policy π( | St, o). The agent then observes the next state St+1 and reward Rt+1 according to p. The option terminates at St+1 with probability β(St+1, o) or continues with action At+1 chosen according to π( | St+1, o). It then possibly terminates in St+2 according to β(St+2, o), and so on. At an option termination, one way to govern an agent s behavior is to choose a new option according to a hierarchical policy µb : S O 7 [0, 1]. In this case, when an option terminates at time t, the next option is selected stochastically according to µb( |St). The option initiates at St and terminates at St+K, where K is a random variable denoting the number of time steps the option executed. At St+K, a new option is again chosen according to µb( |St+K), and so on. We use the notation Ot to denote whatever option is being executed at time step t. Note that Ot will remain the same for as many steps as the option executes. Also note that actions are a special case of options: every action a is an option o that terminates after exactly one step (β(s, o) = 1, s) and whose policy is to pick a in every state (π(a | s, o) = 1, s).

Let Tn denote the time step when the n 1th option terminates and the nth option is chosen. Denote the nth option by ˆOn .= OTn, its starting state by ˆSn .= STn, the cumulative reward during its execution by ˆRn .= PTn+1 t=Tn+1 Rt, the state it terminates in by ˆSn+1 .= STn+1, and its length by ˆLn .= Tn+1 Tn. Note that every option s length is a random variable taking values among positive integers. The option s transition probability is then deﬁned as ˆp(s , r, l | s, o) .= Pr( ˆSn+1 = s , ˆRn = r, ˆLn = l | ˆSn = s, ˆOn = o). Throughout the paper, we assume that the expected execution time of every option starting from any state is ﬁnite.

An MDP M and a set of options O results in an SMDP ˆ M = (S, O, ˆL, ˆR, ˆp), where ˆL is the set of all possible lengths of options and ˆR is the set of all possible options cumulative rewards. For this SMDP, the reward rate of a policy µ given a starting state s and option o can be deﬁned as r C(µ)(s, o) .= limt Eµ[Pt i=1 Ri | S0 = s, O0 = o]/t. Alternatively, at the level of option transitions, r(µ)(s, o) .= limn Eµ[Pn i=0 ˆRi | ˆS0 = s, ˆO0 = o]/Eµ[Pn i=0 ˆLi | ˆS0 = s, ˆO0 = o]. Both the limits exist and are equivalent (Puterman s (1994) propositions 11.4.1 and 11.4.7) under the following assumption:

Assumption 1. The Markov chain induced by any stationary policy in the MDP (S, O, ˆR, p ) is unichain, where p (s , r | s, o) .= P

l ˆp(s , r, l | s, o) s , r, s, o.

Note: In a unichain MDP, there could be some states that only occur a ﬁnite number of times in a single stream of experience. In other words, these states are transient under all stationary policies. Thus, their values can not be correctly estimated by any learning algorithm. However, this inaccurate value estimation is not a problem because the decisions made in these transient states do not affect the reward rate. We refer to the non-transient states as recurrent states and denote their set by S S.

Under Assumption 1, the reward rate does not depend on the starting state-option pair and hence we can denote it by just r(µ). The optimal reward rate can then be deﬁned as r .= supµ Π r(µ), where Π denotes the set of all policies. The differential option-value function for a policy µ is deﬁned for all s S, o O as qµ(s, o) .= Eµ[Rt+1 r(µ) + Rt+2 r(µ) + | St = s, Ot = o ]. The evaluation and optimality equations for SMDPs, as given by Puterman (1994), are:

q(s, o) = X

s ,r, l ˆp(s , r, l | s, o) r r l + X

o µ(o |s )q(s , o ) , (1)

q(s, o) = X

s ,r, l ˆp(s , r, l | s, o) r r l + max o q(s , o ) , (2)

where q and r denote estimates of the option-value function and the reward rate respectively. If Assumption 1 holds, the SMDP Bellman equations have a unique solution for r r(µ) for evaluation and r for control and a unique solution for q only up to a constant (Schweitzer & Federgruen 1978). Given an MDP and a set of options, the goal of the prediction problem is, for a given policy µ, to ﬁnd the reward rate r(µ) and the differential value function (possibly with some constant offset). The goal of the control problem is to ﬁnd a policy that achieves the optimal reward rate r .

3 Inter-Option Learning and Planning Algorithms

In this section, we present our inter-option learning and planning, prediction and control algorithms, which extend Wan et al. s (2021) differential learning and planning algorithms for average-reward MDPs from actions to options. We begin with the control learning algorithm and then move on to the prediction and planning algorithms.

Consider Wan et al. s (2021) control learning algorithm:

Qt+1(St, At) .= Qt(St, At) + αtδt, Rt+1 .= Rt + ηαtδt,

where Q is a vector of size |S A| that approximates a solution of q in the Bellman optimality equation for MDPs, R is a scalar estimate of the optimal reward rate, αt is a step-size sequence, η is a positive constant, and δt is the temporal-difference (TD) error: δt .= Rt Rt + maxa Qt(St+1, a) Qt(St, At). The most straightforward inter-option extension of Differential Q-learning is:

Qn+1( ˆSn, ˆOn) .= Qn( ˆSn, ˆOn) + αnδn, (3) Rn+1 .= Rn + ηαnδn, (4)

where Q is a vector of size |S O| that approximates a solution of q in (2), R is a scalar estimate of r , αn is a step-size sequence, and δn is the TD error:

δn .= ˆRn ˆLn Rn + max o Qn( ˆSn+1, o) Qn( ˆSn, ˆOn). (5)

Such an algorithm is prone to instability because the sampled option length ˆLn can be quite large, and any error in the reward-rate estimate Rn gets multiplied with the potentially-large option length.

Using small step sizes might make the updates relatively stable, but at the cost of slowing down learning for options of shorter lengths. This could make the choice of step size quite critical, especially when the range of the options lengths is large and unknown. Alternatively, inspired by Schweitzer (1971), we propose scaling the updates by the estimated length of the option being executed:

Qn+1( ˆSn, ˆOn) .= Qn( ˆSn, ˆOn) + αnδn/Ln( ˆSn, ˆOn), (6) Rn+1 .= Rn + ηαnδn/Ln( ˆSn, ˆOn), (7)

where αn is a step-size sequence, Ln( , ) comes from an additional vector of estimates L : S O R that approximates the expected lengths of state-option pairs, updated from experience by:

Ln+1( ˆSn, ˆOn) .= Ln( ˆSn, ˆOn) + βn(ˆLn Ln( ˆSn, ˆOn)), (8)

where βn is an another step-size sequence. The TD-error δn in (6) and (7) is

δn .= ˆRn Ln( ˆSn, ˆOn) Rn + max o Qn( ˆSn+1, o) Qn( ˆSn, ˆOn), (9)

which is different from (5) with the estimated expected option length Ln( ˆSn, ˆOn) being used instead of the sampled option length ˆLn. (6 9) make up our inter-option Differential Q-learning algorithm.

Similarly, our prediction learning algorithm, called inter-option Differential Q-evaluation, also has update rules (6 8) with the TD error:

δn .= ˆRn Ln( ˆSn, ˆOn) Rn + X

o µ(o | ˆSn+1)Qn( ˆSn+1, o) Qn( ˆSn, ˆOn). (10)

Theorem 1 (Convergence of inter-option algorithms; informal). If Assumption 1 holds, step sizes are decreased appropriately, all state-option pairs (s, o) in S and O are visited for an inﬁnite number of times, and the relative visitation frequency between any two pairs is ﬁnite:

1. inter-option Differential Q-learning (6 9) converges almost surely, Rn to r and Qn(s, o) to a solution of q(s, o) in (2) for all s S , o O, and r(µn) to r , where µn is a greedy policy w.r.t. Qn, 2. inter-option Differential Q-evaluation (6 8, 10) converges almost surely, Rn to r(µ) and Qn(s, o) to a solution of of q(s, o) in (1) for all s S , o O.

Figure 1: A continuing variant of the Four-Room domain where the task is to repeatedly go from the yellow start state to one of the three green goal states. There is one goal state per experiment, chosen to demonstrate particular aspects of the proposed algorithms. Also shown is an option policy to go to the upper hallway cell; more details in-text.

The convergence proofs for the inter-option (as well as the subsequent intra-option) algorithms are based on a result that generalizes Wan et al. s (2021) and Abounadi et al. s (2001) proof techniques from primitive actions to options. We present this result in Appendix A.1; the formal theorem statements and proofs in Appendix A.2.

Remark: It is important for the scaling factor in the algorithm to be the expected option length Ln( ˆSn, ˆOn) and not the sampled option length ˆLn. Scaling the updates by the expected option lengths ensures that ﬁxed points of the updates are the solutions of (2). This is not guaranteed to be true when using the sampled option length. We discuss this in more detail in Appendix C.1.

The inter-option planning algorithms for prediction and control are similar to the learning algorithms except that they use simulated experience generated by a (given or learned) model instead of real experience. In addition, they only have two update rules, (6) and (7), not (8), because the model provides the expected length of a given option from a given state (see Section 5 for a complete speciﬁcation of option models). The planning algorithms and their convergence results are presented in Appendix A.2.

Empirical Evaluation. We tested our inter-option Differential Qlearning with Gosavi s (2004) algorithm as a baseline in a variant of Sutton et al. s (1999) Four-Room domain (shown in Figure 1). The agent starts in the yellow cell. The goal states are indicated by green cells. Every experiment in this paper uses only one of the green cells as a goal state; the other two are considered as empty cells.

There are four primitive actions of moving up, down, left, right. The agent receives a reward of +1 when it moves into the goal cell, 0 otherwise.

In addition to the four primitive actions, the agent has eight options that take it from a given room to the hallways adjoining the room. The arrows in Figure 1 illustrate the policy of one of the eight options. For this option, the policy in the empty cells (not marked with arrows) is to uniformlyrandomly pick among the four primitive actions. The termination probability is 0 for all the cells with arrows and 1 for the empty cells. The other seven options are deﬁned in a similar way. Denote the set of primitive actions as A and the set of hallway options as H. Including the primitive actions, the agent has 12 options in total.

In the ﬁrst experiment, we tested inter-option Differential Q-learning with three different sets of options, O {A, H, A + H}. The task was to reach the green cell G1, which the agent can achieve with a combination of options and primitive actions. The shortest path to G1 from the starting state takes 16 time steps, hence the best possible reward rate for this task is 1/16 = 0.0625. The agent used an ϵ-greedy policy with ϵ = 0.1. For each of the two step-sizes αn and βn, we tested ﬁve choices: 2 x, x {1, 3, 5, 7, 9}. In addition, we tested ﬁve choices of η : 10 x, x {0, 1, 2, 3, 4}. Q and R were initialized to 0, L to 1. Each parameter setting was run for 200,000 steps and repeated 30 times. The left subﬁgure of Figure 2 shows a typical learning curve for each of the three sets of options, with α = 2 3, β = 2 1, and η = 10 1. The parameter study for O = A + H w.r.t. α and η, with β = 2 1, is presented in the right subﬁgure of Figure 2. The metric is the average reward obtained over the entire training period. Complete parameter studies for all the three sets of options are presented in Appendix B.1.

The learning curves in the left panel of Figure 2 show that the agent achieved a relatively stable reward rate after 100,000 steps in all three cases. Using just primitive actions A, the learning curve rises the slowest, indicating that hallway options indeed help the agent reach the goal faster. But solely using the hallway options H is not very useful in the long run as the goal G1 is not a hallway state. Note that because of the ϵ-greedy behavior policy, the learning curves do not reach the optimal reward rate of 0.0625. These observations mirror those by Sutton et al. (1999) in the discounted formulation.

The sensitivity curves of inter-option Differential Q-learning (right panel of Figure 2) indicate that, in this Four-Room domain, the algorithm was not sensitive to parameter η, performed well for a wide range of step sizes α, and showed low variance across different runs. We also found that the algorithm was not sensitive to β either; this parameter study is also presented in Appendix B.1.

50000 100000 150000 200000 Total Steps

Hallway Options and Actions

Hallway Options

2 1 2 3 2 5 2 7 2 9

η = 10 1 η = 100

Figure 2: Plots showing some learning curves and the parameter study of inter-option Differential Q-learning on the continuing Four-Room domain when the goal was to go to G1. Left: A point on the solid line denotes reward rate over the last 1000 time steps and the shaded region indicates one standard error. The behavior using the three different sets of options was as expected. Right: Sensitivity of performance to α and η when using O = A + H and β = 2 1. The x-axis denotes step size α; the y-axis denotes the rate of the rewards averaged over all 200,000 steps of training, reﬂecting the rate of learning. The error bars denote one standard error. The algorithm s rate of learning varied little over a broad range of η.

2 1 2 3 2 5 2 7 2 9

Figure 3: Parameter studies showing the baseline algorithm s (Gosavi 2004) rate of learning is relatively more sensitive to the choices of its two parameters compared to our inter-option Differential Qlearning. The experimental setting and the plot axes are the same as mentioned in Figure 2 s caption.

We also tested Gosavi s (2004) algorithm as a baseline. We chose not to compare the proposed algorithms in this paper with Sutton et al. s (1999) discounted versions because the discounted and average-reward problem formulations are different; comparing the performance of their respective solution methods would be inappropriate and difﬁcult to interpret. We have proposed new solution methods for the average-reward formulation, hence in this case Gosavi s (2004) algorithm is the most appropriate baseline. Recall it is the only proven-convergent average-reward SMDP off-policy control learning algorithm prior to our work. It estimates the reward rate by tracking the cumulative reward C obtained by the options and dividing it by the another estimate T the tracks the length of the options. If the nth option executed is a greedy choice, then these estimates are updated using:

Cn+1 .= Cn + βn( ˆRn Cn), Tn+1 .= Tn + βn(ˆLn Tn), Rn+1 .= Cn+1/ Tn+1.

When ˆOn is not greedy, Rn+1 = Rn. The option-value function is updated with (3) with δn as deﬁned in (5). αn and βn are two step-size sequences. The sensitivity of this algorithm with O = A + H is shown in Figure 3. Compared to inter-option Differential Q-learning, this baseline has one less parameter, but its performance was found to be more sensitive to the values of both its step-size parameters. In addition, the error bars were generally larger, suggesting that the variance across different runs was also higher.

To conclude, our experiments with the continuing Four-Room domain show that our inter-option Differential Q-learning indeed ﬁnds the optimal policy given a set of options, in accordance with Theorem 1. In addition, its performance seems more robust to the choices of parameters compared to the baseline. Experiments with the prediction algorithm, inter-option Differential Q-evaluation, are presented in Appendix B.4.

4 Intra-Option Value Learning and Planning Algorithms

In this section, we introduce intra-option value learning and planning algorithms. The objectives are same as that of inter-option value learning algorithms. As mentioned earlier, intra-option algorithms learn from every transition St, At, Rt+1, St+1 during the execution of a given option Ot. Moreover, intra-option algorithms also make updates for every option o O, including ones that may potentially never be executed.

To develop our algorithms, we ﬁrst establish the intra-option evaluation and optimality equations in the average-reward case. The general form of the intra-option Bellman equation is:

q(s, o) = X

a π(a | s, o) X

s ,r p(s , r | s, a) r r + uq(s , o) (11)

where q R|S| |O| and r R are free variables. The optimality and evaluation equations use uq = uq and uq = uq µ respectively, deﬁned s S, o O as:

uq(s , o) = uq (s , o) .= 1 β(s , o) q(s , o) + β(s , o) max o q(s , o ), (12)

uq(s , o) = uq µ(s , o) .= 1 β(s , o) q(s , o) + β(s , o) X

o µ(o |s )q(s , o ). (13)

Intuitively, the uq term accounts for the two possibilities of an option terminating or continuing in the next state. These equations generalize the average-reward Bellman equations given by Puterman (1994). The following theorem characterizes the solutions to the intra-option Bellman equations.

Theorem 2 (Solutions to intra-option Bellman equations). If Assumption 1 holds, then:

1. a) there exists a r R and a q R|S| |O| for which (11) and (12) hold, b) the solution of r is unique and is equal to r , let q1 be one solution of q, the solutions of q form a set {q1 + c e | c R} where e is an all-one vector of size |S| |O|, c) a greedy policy w.r.t. a solution of q achieves the optimal reward rate r . 2. a) there exists a r R and a q R|S| |O| for which (11) and (13) hold, b) the solution of r is unique and is equal to r(µ), the solutions of q form a set {qµ + c e | c R}.

The proof extends those of Corollary 8.2.7, Theorem 8.4.3, Theorem 8.4.4 by Puterman (1994) and is presented in Appendix A.3.

Our intra-option control and prediction algorithms are stochastic approximation algorithms solving the intra-option optimality and evaluation equations respectively. Both the algorithms maintain a vector of estimates Q(s, o) and a scalar estimate R, just like our inter-option algorithms. However, unlike inter-option algorithms, intra-option algorithms need not maintain an estimator for option lengths (L) because they make updates after every transition. Our control algorithm, called intra-option Differential Q-learning, updates the estimates Q and R by:

Qt+1(St, o) .= Qt(St, o) + αtρt(o)δt(o), o O, (14)

Rt+1 .= Rt + ηαt X

o O ρt(o)δt(o), (15)

where αt is a step-size sequence, ρt(o) .= π(At|St,o) π(At|St,Ot) is the importance sampling ratio, and:

δt(o) .= Rt+1 Rt + u Qt (St+1, o) Qt(St, o). (16)

Our prediction algorithm, called intra-option Differential Q-evaluation, also updates Q and R by (14) and (15) but with the TD error:

δt(o) .= Rt+1 Rt + u Qt µ (St+1, o) Qt(St, o). (17)

Theorem 3 (Convergence of intra-option algorithms; informal). Under the conditions of Theorem 1:

1. intra-option Differential Q-learning algorithm (14 16) converges almost surely, Rt to r , Qt(s, o) to a solution of q(s, o) in (11) and (12) for all s S , o O, and r(µt) to r , where µt is a greedy policy w.r.t. Qt, 2. intra-option Differential Q-evaluation algorithm (14,15,17) converges almost surely, Rt to r(µ), Qt(s, o) to a solution of q(s, o) in (11) and (13) for all s S , o O.

50000 100000 150000 200000 Total Steps

Figure 4: Learning curve showing that the greedy policy corresponding to the hallway options option-value function achieves the optimal reward rate on the continuing Four-Room domain. The value function was learned via intraoption Differential Q-learning using a behavior policy consisting only of primitive actions; the hallway options were never executed.

Remark: The intra-option learning methods introduced in this section can be used with options having stochastic policies. This is possible with the use of the important sampling ratios as described above. Sutton et al. s (1999) discounted intra-option learning methods were restricted to options having deterministic policies.

Again, the intra-option value planning algorithms are similar to the learning algorithms except that they use simulated experience generated by a given or learned model instead of real experience. The planning algorithms and their convergence results are presented in Appendix A.4.

Empirical Evaluation. We conducted another experiment in the Four-Room domain to show that intra-option Differential Q-learning can learn the values of hallway options H using only primitive actions A. As mentioned earlier, there are no baseline intra-option average-reward algorithms, so this is a proof-of-concept experiment.

The goal state for this experiment was G2, which can be reached using the option that leads to the lower hallway.

The optimal reward rate in this case is 1/14 0.714 with both O = H and O = A. We applied intra-option Differential Q-learning using a behavior policy that chose the four primitive actions with equal probability in all states. This choice of behavior policy and goal G2 would test if the intra-option algorithm leads to a policy consisting exclusively of options by interacting with the environment using only primitive actions. Each parameter setting was run for 200,000 steps and repeated 30 times. For evaluation, we saved the learned option value function after every 1000 steps and computed the average reward of the corresponding greedy policy over 1000 steps.

Figure 4 shows the learning curve of this average reward across the 30 independent runs for parameters α = 0.125, η = 0.1. The agent indeed succeeds in learning the option-value function corresponding to the hallway options using a behavior policy consisting only of primitive actions. The parameter study of intra-option Differential Q-learning is presented in Appendix B.2. Experiments with the prediction algorithm, intra-option Differential Q-evaluation, are presented in Appendix B.4.

5 Intra-Option Model Learning and Planning Algorithms

In this section, we ﬁrst describe option models within the average-reward formulation. We then introduce an algorithm to learn such models in an intra-option fashion. This option-model learning algorithm can be combined with the planning algorithms from the previous section to obtain a complete model-based average-reward options algorithm that learns option models and plans with them (we present this combined algorithm in Appendix C.2).

The average-reward option model is similar to the discounted options model but with key distinctions. Sutton et al. s (1999) discounted option model has two parts: the dynamics part and the reward part. Given a state and an option, the dynamics part predicts the discounted occupancy of states upon termination, and the reward part predicts the expected (discounted) sum of rewards during the execution of the option. In the average-reward setting, apart from the dynamics and the reward parts, an option model has a third part the duration part that predicts the duration of execution of the option. In addition, the dynamics part predicts the state distribution upon termination without discounting and reward part predicts the undiscounted cumulative rewards during the execution of the option.

Formally, the dynamics part approximates mp(s |s, o) .= P

r,l ˆp(s , r, l |s, o), the probability that option o terminates in state s when starting from state s. The reward part approximates mr(s, o) .= P

s ,r,l ˆp(s , r, l |s, o) r, the expected cumulative reward during the execution of option o when starting from state s. Finally, the duration part approximates ml(s, o) .= P

s ,r,l ˆp(s , r, l |s, o) l, the expected duration of option o when starting from state s.

We now present a set of recursive equations that are key to our model-learning algorithms. These equations extend the discounted Bellman equations for option models (Sutton et al. 1999) to the average-reward formulation.

mp(x | s, o) = X

a π(a | s, o) X

s ,r p(s , r | s, a) β(s , o)I(x = s ) + 1 β(s , o) mp(x | s , o) , (18)

mr(s, o) = X

a π(a | s, o) X

s ,r p(s , r | s, a) r + (1 β(s , o)) mr(s , o) , (19)

ml(s, o) = X

a π(a | s, o) X

s ,r p(s , r | s, a) 1 + (1 β(s , o)) ml(s , o) . (20)

The ﬁrst equation are different from the other two because the total reward and length of the option o are incremented irrespective of whether the option terminates in s or not. The following theorem shows that (mp, mr, ml) is the unique solution of (18 20) and therefore the models can be obtained by solving these equations (see Appendix A.5 for the proof).

Theorem 4 (Solutions to Bellman equations for option models). There exist unique mp R|S| |O| |S|, mr R|S| |O|, and ml R|S| |O| for which (18), (19), and (20) hold. Further, if mp, mr, ml satisfy (18), (19), and (20), then mp = mp, mr = mr, ml = ml.

Our intra-option model-learning algorithm solves the above recursive equations using the following TD-like update rules for each option o:

Mp t+1(x | St, o) .= Mp t (x | St, o) + αtρt(o) β(St+1, o)I(St+1 = x)

+ 1 β(St+1, o) Mp t (x | St+1, o) Mp t (x | St, o) , x S, (21)

Mr t+1(St, o) .= Mr t (St, o) + αtρt(o) Rt+1 + 1 β(St+1, o) Mr t (St+1, o) Mr t (St, o) (22)

Ml t+1(St, o) .= Ml t(St, o) + αtρt(o) 1 + 1 β(St+1, o) Ml t(St+1, o) Ml t(St, o) (23)

where M p is a |S| |O| |S|-sized vector of estimates, M r and M l are both |S| |O|-sized vectors of estimates, and αt is a sequence of step sizes. Standard stochastic approximation results can be applied to show the algorithm s convergence (see Appendix A.6 for details).

Theorem 5 (Convergence of the intra-option model learning algorithm; informal). If the step sizes are set appropriately and all the state-option pairs are updated an inﬁnite number of times, then intra-option model-learning (21 23) converges almost surely, M p t to mp, M r t to mr, and M l t to ml.

Our intra-option model-learning algorithms (21 23) can be applied with simulated one-step transitions generated by a given action model, resulting in a planning algorithm that produces an estimated option model. The planning algorithm and its convergence result are presented in Appendix A.6.

6 Interruption to Improve Policy Over Options

In all the algorithms we considered so far, the policy over options would select an option, execute the option policy till termination, then select a new option. Sutton et al. (1999) showed that the policy over options can be improved by allowing the interruption of an option midway through its execution to start a seemingly better option. We now show that this interruption result applies for average-reward options as well (see Appendix A.7 for the proof).

Theorem 6 (Interruption). For any MDP, any set of options O, and any policy µ : S O [0, 1], deﬁne a new set of options, O , with a one-to-one mapping between the two option sets as follows: for every o = (π, β) O, deﬁne a corresponding o = (π, β ) O where β = β, but for any state s in which qµ(s, o) < vµ(s), β (s, o) = 1 (where vµ(s) .= P

o µ(o | s)qµ(s, o)). Let the interrupted policy µ be such that for all s S and for all o O , µ (s, o ) = µ(s, o), where o is the option in O corresponding to o . Then:

1. the new policy over options µ is not worse than the old one µ, i.e., r(µ ) r(µ), 2. if there exists a state s S from which there is a non-zero probability of encountering an interruption upon initiating µ in s, then r(µ ) > r(µ).

100000 200000 300000 400000 Total Steps

Acting without Interruption

Acting with Interruption

Figure 5: Learning curves showing that executing options with interruptions can achieve a higher reward rate than executing options till termination in the domain described in the adjoining text.

In short, the above theorem shows that interruption produces a behavior that achieves a higher reward rate than without interruption. Note that interruption behavior is only applicable with intra-option algorithms; complete option transitions are needed in inter-option algorithms.

Empirical Evaluation. We tested the intra-option Differential Q-learning algorithm with and without interruption in the Four-Room domain. We set the goal as G3 and allowed the agent to choose and learn only from the set of all hallway options H. With just hallway options, without interruption, the best strategy is to ﬁrst move to the lower hallway and then try to reach the goal by following options that pick random actions in the states near the hallway and goal. With interruption, the agent can ﬁrst move to the left hallway, then take the option that moves the agent to the lower hallway but terminate when other options have higher option-values. This termination is most likely to occur in the cell just above G3. The agent then needs a fewer number of steps in expectation to reach the goal.

Figure 5 shows learning curves using intra-option Differential Q-learning with and without interruptions on this problem. Each parameter setting was run for 400,000 steps and repeated 30 times. The learning curves shown correspond to α = 0.125 and η = 0.1. As expected, the agent achieved a higher reward rate by using interruptions. The parameter study of the interruption algorithm along with the rest of the experimental details is presented in Appendix B.3.

7 Conclusions, Limitations, and Future Work

In this paper, we extended learning and planning algorithms for the options framework originally proposed by Sutton et al. (1999) for discounted-reward MDPs to average-reward MDPs. The inter-option learning algorithm presented in this paper is more general than previous work in that its convergence proof does not require existence of any special states in the MDP. We also derived the intra-option Bellman equations in average-reward MDPs and used them to propose the ﬁrst intra-option learning algorithms for average-reward MDPs. Finally, we extended the interruption algorithm and its related theory from the discounted to the average-reward setting. Our experiments on a continuing version of the classic Four-Room domain demonstrate the efﬁcacy of the proposed algorithms. We believe that our contributions will enable widespread use of options in the averagereward setting.

We now brieﬂy comment on the novelty of our theoretical and algorithmic contributions. Our primary theoretical contribution is to generalize Wan et al. s (2021) proof techniques to obtain a uniﬁed convergence proof for actions and options. The same proof techniques then apply for both the interand intra-option algorithms. Our primary algorithmic contribution is the scaling of the updates by option lengths in the inter-option algorithms. The lack of scaling makes the algorithms unstable and prone to divergence. Furthermore, we show the correct way of scaling involves estimated option lengths, not sampled option lengths.

The most immediate line of future work involves extending these ideas from the tabular case to the general case of function approximation, starting with linear function approximation. One way to incorporate function approximation is to extend algorithms presented in this paper to those using linear options (Sorg & Singh 2010, Yao et al. 2014), perhaps by building on Zhang et al. s (2021) work. Using the results developed in this paper, we also foresee extensions to more ideas from the discounted formulation involving function approximation, such as Bacon et al. s (2017) option-critic architecture, to the average-reward formulation.

This paper assumes that a ﬁxed set of options is provided and the agent then learns or plans using them. One of the most important challenges in the options framework is the discovery of options. We think the discovery problem is orthogonal to the problem formulation. Hence, another line of future work is to extend existing option-discovery algorithms developed for the discounted formulation to the average-reward formulation (e.g., algorithms by Mc Govern & Barto 2001, Menache et al. 2002, Sim sek & Barto 2004, Singh et al. 2004, Van Djik & Polani 2011, Machado et al. 2017) . Relatively more work might be required in extending approaches that couple the problems of option discovery and learning (e.g., Gregor et al. 2016, Eysenbach et al. 2018, Achiam et al. 2018, Veeriah et al. 2021).

Another limitation of this paper is that it deals with learning and planning separately. We also need combined methods that learn models as well as plan with them; we discuss some ideas in Appendix C. Finally, we would like to get more empirical experience with the algorithms proposed in this paper, both in pedagogical tabular problems and challenging large-scale problems. Nevertheless, we believe this paper makes novel contributions that are signiﬁcant for the use of temporal abstractions in average-reward reinforcement learning.

Acknowledgements

The authors were supported by Deep Mind, Amii, and CIFAR. The authors wish to thank Huizhen Yu for extensive discussions on several related works; Benjamin Van Roy, Csaba Szepesvári, Dale Schuurmans, Martha White, and the anonymous reviewers for valuable feedback.

Abounadi, J., Bertsekas, D., & Borkar, V. S. (2001). Learning Algorithms for Markov Decision Processes with Average Cost. SIAM Journal on Control and Optimization.

Achiam, J., Edwards, H., Amodei, D., & Abbeel, P. (2018). Variational Option Discovery Algorithms. Ar Xiv:1807.10299.

Almezel, S., Ansari, Q. H., & Khamsi, M. A. (2014). Topics in Fixed Point Theory (Vol. 5). Springer.

Bacon, P. L., Harb, J., & Precup, D. (2017). The Option-Critic Architecture. AAAI Conference on Artiﬁcial Intelligence.

Borkar, V. S. (1998). Asynchronous Stochastic Approximations. SIAM Journal on Control and Optimization.

Borkar, V. S. (2009). Stochastic Approximation: A Dynamical Systems Viewpoint. Springer.

Borkar, V. S., & Soumyanatha, K. (1997). An Analog Scheme for Fixed Point Computation. I. Theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications.

Brunskill, E., & Li, L. (2014). PAC-inspired Option Discovery in Lifelong Reinforcement Learning. International Conference on Machine Learning.

Das, T. K., Gosavi, A., Mahadevan, S., & Marchalleck, N. (1999). Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning. Management Science.

Eysenbach, B., Gupta, A., Ibarz, J., & Levine, S. (2018). Diversity is All You Need: Learning Skills without a Reward Function. Ar Xiv:1802.06070.

Fruit, R., & Lazaric, A. (2017). Exploration-Exploitation in MDPs with Options. Artiﬁcial Intelligence and Statistics.

Gosavi, A. (2004). Reinforcement Learning for Long-run Average Cost. European Journal of Operational Research.

Gregor, K., Rezende, D. J., & Wierstra, D. (2016). Variational Intrinsic Control. Ar Xiv:1611.07507.

Li, Y., & Cao, F. (2010). RVI Reinforcement Learning for Semi-Markov Decision Processes with Average Reward. IEEE World Congress on Intelligent Control and Automation.

Machado, M. C., Bellemare, M. G., & Bowling, M. (2017). A Laplacian Framework for Option Discovery in Reinforcement Learning. International Conference on Machine Learning.

Mc Govern, A., & Barto, A. G. (2001). Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density. International Conference on Machine Learning.

Menache, I., Mannor, S., & Shimkin, N. (2002). Q-Cut Dynamic Discovery of Sub-goals in Reinforcement Learning. European Conference on Machine Learning.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.

Schweitzer, P. J. (1971). Iterative Solution of the Functional Equations of Undiscounted Markov Renewal Programming. Journal of Mathematical Analysis and Applications.

Schweitzer, P. J., & Federgruen, A. (1978). The Functional Equations of Undiscounted Markov Renewal Programming. Mathematics of Operations Research.

Sim sek, Ö., & Barto, A. G. (2004). Using Relative Novelty to Identify Useful Temporal Abstractions in Reinforcement Learning. International Conference on Machine Learning.

Singh, S., Barto, A. G., & Chentanez, N. (2004). Intrinsically Motivated Reinforcement Learning. Advances in Neural Information Processing Systems.

Sorg, J., & Singh, S. (2010). Linear Options. International Conference on Autonomous Agents and Multiagent Systems.

Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artiﬁcial Intelligence.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.

Tsitsiklis, J. N. (1994). Asynchronous Stochastic Aapproximation and Q-learning. Machine Learning.

van Dijk, S. G., & Polani, D. (2011). Grounding Subgoals in Information Transitions. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

Veeriah, V., Zahavy, T., Hessel, M., Xu, Z., Oh, J., Kemaev, I., van Hasselt, H., Silver, D., & Singh S. (2021). Discovery of Options via Meta-Learned Subgoals. Ar Xiv:2102.06741.

Vien, N. A., & Chung, T. (2008). Policy Gradient Semi-Markov Decision Process. IEEE International Conference on Tools with Artiﬁcial Intelligence.

Wan, Y., Naik, A., & Sutton, R. S. (2021). Learning and Planning in Average-Reward Markov Decision Processes. International Conference on Machine Learning.

Yao, H., Szepesvári, C., Sutton, R. S., Modayil, J., & Bhatnagar, S. (2014). Universal Option Models. Advances in Neural Information Processing Systems.

Zhang, S., Wan, Y., Sutton, R. S., & Whiteson, S. (2021). Average-Reward Off-Policy Policy Evaluation with Function Approximation. International Conference on Machine Learning.