# perdecision_option_discounting__9158a379.pdf

Per-Decision Option Discounting

Anna Harutyunyan 1 2 Peter Vrancx 3 2 Philippe Hamel 1 Ann Now e 2 Doina Precup 1

In order to solve complex problems an agent must be able to reason over a sufﬁciently long horizon. Temporal abstraction, commonly modeled through options, offers the ability to reason at many timescales, but the horizon length is still determined by the discount factor of the underlying Markov Decision Process. We propose a modiﬁcation to the options framework that naturally scales the agent s horizon with option length. We show that the proposed option-step discount controls a bias-variance trade-off, with larger discounts (counter-intuitively) leading to less estimation variance.

1. Introduction

Reinforcement learning agents have to solve the problem of reasoning about actions that improve long-term performance. This objective can be formulated either in the discounted setting (in which the agent optimizes the expectation of the discounted sum of rewards) or in the average-reward setting (in which the agent optimizes the average reward received per step over an inﬁnite amount of time) (Schwartz, 1993; Bertsekas & Tsitsiklis, 1996; Mahadevan, 1996). Intuitively, if agents are to have a ﬁnite lifetime (but of unknown duration), the discounted reward framework is more appropriate. Indeed, an average reward agent is content to wait arbitrarily long before collecting rewards, so long as the average over the inﬁnite horizon is favorable. Discounting, on the other hand, causes the agent to focus on collecting rewards early on preferred behavior if there is a chance of termination. Even in cases where the average reward performance is the desired target, it can be approximated arbitrarily well by using a discount which approaches one (Tsitsiklis & Van Roy, 2002). We will hence focus on the discounted setting in this paper.

1Deep Mind, London, UK 2Vrije Universiteit Brussel, Brussels, Belgium 3PROWLER.io, Cambridge, UK. Correspondence to: Anna Harutyunyan <harutyunyan@google.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

The discount factor γ itself is usually treated as something in between a mathematical convenience and a meaningful time horizon parameter. Indeed, the discounted return Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . corresponds roughly to a time horizon of 1 1 γ steps, and the convergence speed of learning and planning algorithms is strongly related to this quantity (Bertsekas & Tsitsiklis, 1996; Jiang et al., 2015). While we wish for our agents to reason over long temporal horizons, this coupling makes it challenging for that to be efﬁcient, and fairly low discounts are favored in practice.

Temporal abstraction is often considered key for reasoning over long temporal horizons (Sutton et al., 1999). Indeed, options induce reward and transition models that act as higher-order analogues of those from the primitive-action Markov Decision Process. However, the discount factor remains an integral part of these models there are in fact clear parallels between options and multi-step temporal difference methods (Bacon & Precup, 2016; Harutyunyan et al., 2018). As such, in the classical options framework, an option that takes one hundred steps incurs a discount of γ100

at termination, and the information about its outcome can be signiﬁcantly washed out. So, regardless of how sophisticated the options are, the agent remains at the mercy of the step-discount, and its high-level plan is unlikely to beneﬁt from the foresight brought by the options. The temporal abstraction in the behavior offers little additional temporal representation power in the model or the value function. The aim of this work is to overcome this limitation. In particular:

We generalize the options framework to better express temporal abstraction. Namely, we (1) decouple the step-discounts in the reward and transition models, and (2) introduce per-decision discounting that augments the transition model irrespectively of duration. This simple generalization allows for options to extend the agent s horizon, a property which can be thought of as time dilation .

We analyze the properties of planning with the new framework, and devise novel bias-variance bounds that apply to the classical options framework as a special case. Notably, we show that larger discounts in the transition model can reduce the variance of estimating values, which is contrary to the familiar intuitions about e.g. multi-step returns.

Per-Decision Option Discounting

We verify the shape of the bounds empirically on a classical task. Our results imply that in addition to extending the agent s horizon, time dilation can be a tool for better estimation of value functions.

The paper is organized as follows. After providing background and relevant notation, in Section 3 we motivate and in Section 4 formally introduce time dilation. In Section 5 we begin our analysis by deriving the equivalent step-discounted problem and providing a convergence result. Section 6 analyzes the bias-variance trade-off of the new framework in the approximate dynamic programming setting. All proofs from these sections are in the appendix. Finally, Section 7 provides supporting experimental results.

2. Background

A Markov Decision Process (MDP) is a tuple M = (S, A, p, r), where S is the set of states, A the set of discrete actions; p : S A S [0, 1] represents the environment dynamics, where p(s |s, a) is the probability of transitioning to state s when a is taken in s; r : S A [ rmax, rmax] is the reward function. A policy π : S A [0, 1] induces a reward Markov chain given by pπ(s |s) = P

a A π(a|s)p(s |s, a), and rπ(s) = P

a A π(a|s)r(s, a).

The general goal of a reinforcement learning (RL) agent is to ﬁnd a policy that optimizes a cumulative measure of reward (Sutton & Barto, 2017). One standard way to deﬁne such a criterion is by using a scalar discount factor γ (0, 1], which depreciates rewards received further in the future. More formally, the agent will aim to optimize:

i=0 γirπ(St+i)|St = s

Let q : S A R be a generic action-value (or Q-) function and deﬁne the one-step transition operator:

Pπq(s, a) def= X

a A p(s |s, a)π(a |s )q(s , a ).

Using operator notation, the value of policy π under a discount γ is given by: qπ γ def= P t=0 γt(Pπ)tr = r+γPπqπ γ = (I γPπ) 1r, where the inverse always exists if γ < 1. The corresponding one-step Bellman operator (Bellman, 1957) can be applied to any q:

T πq(s, a) def= r(s, a) + γPπq(s, a),

and its repeated applications are guaranteed to produce its ﬁxed point qπ γ (Puterman, 1994). Policy evaluation is concerned with estimating this quantity for a ﬁxed π, while in control we seek the optimal policy π γ, whose

value q γ = maxπ qπ γ . The state value function averages qπ γ w.r.t. the policy: vπ γ (s) = P

a π(a|s)qπ γ (s, a), and v γ = maxπ vπ γ .

The target of learning is often intended to be w.r.t. some very long horizon, whose corresponding discount factor we denote by γeval. Using a large discount factor can be inefﬁcient during value function learning, and can pose problems during planning with approximate models (e.g. (Jiang et al., 2015; Lehnert et al., 2018)). Hence, in practice, γ used during learning is often treated as a parameter with the hope that it will lead to ﬁnding a good policy w.r.t. γeval. Similarly to the notions from Jiang et al. (2015) and Lehnert et al. (2018), we say that a discount γ is able to represent a policy w.r.t. γ > γ if π γ = π γ .

The average-reward formulation aims to ﬁnd the policy that optimizes the following criterion:

Tsitsiklis & Van Roy (2002) showed that both the expected and the transient behavior of the average reward criterion can be approximated arbitrarily well by a large enough γ.

An option o is a tuple (Io, βo, πo), with Io S the initiation set,1 πo the internal option policy, and βo : S [0, 1] the termination condition, with βo(s) denoting the probability of option o terminating upon arriving in s (Sutton et al., 1999). Given a discount factor γ, an option o has an associated semi-MDP (Puterman, 1994) model given by:

P o γ (s |s) def= ED:s s |o γDISt+D=s |St = s

= γpπo(s |s)βo(s )

s pπo(s |s)(1 βo(s ))P o γ (s |s ),

Ro γ(s) def= ED:s|o

i=0 γirπo(St+i)|St = s

= rπo(s) + γ X

s pπo(s |s)(1 βo(s ))Ro γ(s ),

where ED:s|o [ ] and ED:s s |o [ ] are the expectations of the option o s duration D from state s and the travel time between state s and s in which o terminates, respectively, and where we index relevant quantities with the appropriate discount factor to make the dependence clear. For a ﬁnite set of options, we denote the |S| |O| matrix collecting all Ro γ by Rγ.

In the usual call-and-return model of option execution, an option is run until completion (according to its termination condition), upon which a new option choice is made (Precup

1For simplicity, we assume that Io = S, o O.

Per-Decision Option Discounting

et al., 1998). This suggests the following analogues of Pπ

and T π for a policy over options µ and discount factor γ:

Pµ γ q(s, o) def= X

s P o γ (s |s) X

o µ(o |s )q(s , o ),

T µ γ q(s, o) def= Ro γ(s) + Pµ γ q(s, o). (1)

Finally, we will write to denote the L norm.

3. Motivation

In order to further understand the goal of our approach, consider an agent placed in a deterministic environment needing to choose between a closer, worse reward of z and a farther, better reward of Z > z, with all other rewards being zero. The rewarding states sz and s Z could be terminal goal states, or could in fact allow the agent to continue on some future path. Now, consider the role of the discount factor γ in this decision. In order for the agent to pick the higher reward, it would need to be the case that γKZ > γkz, where K and k are the distances from s Z and sz to the agent s location. Thus, there is a minimum value of γ > z

required for the agent to display foresight (see Figure 1 for an illustration). The average-reward framework can ﬁx this problem if the environment is a continuing task. However, if the goal states are absorbing (the agent stays in these states forever once it reaches them), then in fact all policies have an average reward of 0 and there is no useful signal for the agent to optimize. This illustrates that considering average rewards does not handle the trade-off of the two subgoals correctly in all cases.

Now, consider the same task, but with the choice being between two options (instead of primitive actions), going to each of the respective goals. Ideally, scaling the primitive time step should allow the agent to consistently exhibit foresight and choose the larger reward (since we can assume the primitive time step was a somewhat arbitrary choice anyway). However, the agent will only keep its preference if P o1 γ (s Z|s0)Z > P o2 γ (sz|s0)z, which still entirely depends on γ.2 Hence, the policy over options remains tied critically to the magnitude of the discount factor applied at the primitive time step.

4. Options with Time Dilation

The example from the previous section highlights the fact that the option transition model is responsible for the effective discount. This leads us to propose two modiﬁcations to the classical transition model, aimed to ensure that options

2To see this, note that γdmax P o γ γdmin, where dmin and dmax denote minimum and maximum duration of an option.

If γdmax < z

Z 1 K k , the agent will ﬂip its choice of option.

Figure 1. The value of the actions that lead towards the two goals plotted against the timestep distance towards these goals. The lines plot the values of the left and right actions in the states to the left and to the right of the agent, respectively, and the colors correspond to two different discounts. The preferred action in the current state thus depends on the discount: the red discounting scheme of γ = 0.95 is too short-sighted to prefer the correct goal s Z. Note that for any discount γ < 1, the distances k and K can be proportionally increased (to k + H and K + H for some H < ) for γ to be insufﬁcient to capture the correct ordering.

provide real temporal abstraction, which diminishes or eliminates completely the dependence on the primitive action time step:

1. We allow time dilation in the transitions by using a transition step-discount γo p that is larger than the reward step-discount γo r, thus weakening the dependence on option duration.

2. We add a per-decision state-dependent discount γo d, thus reinforcing option-level reasoning over primitiveaction policies.

The inner γo r has exactly the same interpretation as before, but locally for each option. But when reasoning at the higher level, the precise duration of each option matters less, and so γo p > γo r, but the number of decisions steps among options becomes relevant, and so γo d need not be 1. We denote the set of option discounts by Γo = {γo r, γo p, γo d}. The new option transition model is then given by:

P o Γ(s |s) def= γo d(s, s )ED:s s |o (γo p)DISt+D=s |St = s .

Clearly, P o Γ = P o γ and consistent with Ro γr if γo p = γo r = γ and γo d = I. Figure 2 plots a simple instance of the coefﬁcients induced by this manner of discounting on a ﬁxed stream of rewards. The two-timescale spiky structure is due to the discrepancy between γo p and γo r: the returns within an option are still discounted with γo r, but upon termination, γo p is invoked.

We will show that γo p imposes a bias-variance trade-off on the complexity of estimating the transition model and

Per-Decision Option Discounting

Figure 2. The discounting coefﬁcient applied to the reward at time t under time dilation for various constant values of Γ and γp and for random option durations (drawn from a Poisson distribution with λ = 10). The spikes occur because the reward model (still discounted with γ) controls the discounting inside the option, and the transition model that at option decision points.

the value function, where the bias is w.r.t. a consistent discounting scheme γo p = γo r. In the extreme, if γo p = 1, all of the variance of a given transition s, s associated with the random variable D that determines the number of steps from s to s is removed. This comes at the cost of introducing bias in terms of the difference between γo r and γo p. The additional option-level discount γo d can help reduce this bias. In fact, we show that there is a value of γo d for which the bias is zero, which occurs when γo d captures P o in a certain sense.

The new option model allows one to effectively redeﬁne the primitive resolution of the agent, simply by considering γo p = 1. This in turn provides options with the power to represent policies over horizons that would otherwise be too large to capture with a ﬁxed step-discount. Indeed, consider the example from above but with γo p = 1. Then, P o Γ = γo d becomes independent of the step-discount, and able to represent the same policy over options regardless of their length. This is a key motivation of our approach.

For notational simplicity, in the following we will take Γo to be the same for all options, and denote the relevant discounts simply by γr, γp and γd. All of the results trivially apply in the general case. Furthermore, instead of the full γd matrix, it is more practical to consider a diagonal γd that speciﬁes an option-level discount that only depends on the arrival state s , regardless of where the option started.3 This form of γo d lets us rewrite P o Γ in an intuitive way:

P o Γ(s |s) = γd(s )βo(s ) γppπo(s |s)+

s pπo(s |s)(1 βo(s ))(pπo(s |s ) + . . .) .

3The same can be achieved by considering a full row-wise constant matrix, but the diagonal form is more efﬁcient and commonly used, see e.g. (White, 2017; Yu & Bertsekas, 2012).

This equation makes it evident that γd controls the discounting over the agent s decision points (i.e. where the agent chooses an option), while γp is the intra-option discount. We return to the full matrix view brieﬂy in Section 6.2.

5. Convergence Analysis

In this section we will derive an equivalence from the model we propose to a step-discounted setting with consistent options (i.e. ones with γp = γr), and use it to prove expected convergence under mild conditions. Our analysis both here and in the next section is for policy evaluation given a ﬁxed policy over options µ, but our experiments test the control setting, and illustrate the insights found in the theory. The new option operators that we will work with are given by:

Pµ Γq(s, o) = X

s P o Γ(s |s) X

o µ(o |s )q(s , o ),

T µ Γ q(s, o) def= Ro γr(s) + Pµ Γq(s, o). (2)

with Ro γr deﬁned as before. We can show that T µ Γ is a contraction (i.e. converges asymptotically), so long as a terminating state with a sub-unitary discount is reachable.

Assumption 5.1. Options are ﬁnite: dmax < .

This assumption is standard, and can be attained by stopping options after some maximum number of steps, since all option theory results hold for such semi-Markov terminations (Sutton et al., 1999). We additionally require that there is a chance for an option to terminate in a state whose discount is less than one.

Assumption 5.2. For all o O and s S, s that is reachable by πo, s.t. βo(s ) > 0 and γo d(s ) < 1, or γp < 1.

The following theorem proves that T µ Γ is a contraction, and derives the equivalent problem with a modiﬁed reward model, termination scheme, and a generalized step-discount.

Theorem 1. If Assumptions 5.1 and 5.2 hold, the operator T µ Γ from Eq. (2) is a contraction. The ﬁxed point of T µ Γ is equivalent to that of a κ-discounted options operator T µ κ from Eq. (1) for

κ(s, o, s ) = γp(1 βo(s )(1 γd(s ))) γp,

w.r.t. modiﬁed reward and termination functions:

zπo = (I γpp(1 β)πo)(I γrp(1 β)πo) 1rπo

ζo(s ) = γd(s )βo(s ) γd(s )βo(s ) + 1 βo(s ),

where p(1 β)πo(s |s) def= (1 βo(s ))pπo(s |s).

The proof is in Appendix A. This theorem implies that the step discount is controlled by γd, γp, but also βo. This

Per-Decision Option Discounting

is appropriate, since βo controls the inner timescale of an option. For example, if γp = 1 and γd(s ) = 0 for some s , the discount at s is 1 βo(s ) exactly.

On the other hand, the value of γd directly impacts the new, implicitly induced termination scheme. For example, if γd(s ) = 0 and no bootstrapping occurs, then ζ accounts for it by not permitting any termination. In general, any γd(s ) < 1 implies ζo(s ) < βo(s ), and hence the corresponding induced options are longer. This highlights the fact that our approach yields more effective temporal abstraction.

6. The Bias-Variance Tradeoff in the Option Transition Model

We will now analyze the computational effects of time dilation. We will show that a larger discount in the transition model can reduce estimation variance, at the cost of introducing bias, as compared to the consistent discounting scheme. The inter-option discount γd then helps control this bias, with a particular shape of γd removing it altogether.

More speciﬁcally, we will show that varying γ in Pµ γ from Eq. (1), and more generally replacing Pµ γr with Pµ Γ from Eq. (2) induces a novel bias-variance tradeoff on the approximation error when Pµ Γ is estimated from samples. Let qµ γr = (I Pµ γr) 1Rγr, qµ Γ = (I Pµ Γ) 1Rγr, and

qµ bΓ = (I c Pµ Γ) 1Rγr, where c Pµ Γ is the approximate transition model estimated from samples. The approximate loss has the following form:

E = qµ bΓ qµ γr = qµ bΓ qµ Γ + qµ Γ qµ γr

qµ bΓ qµ Γ | {z } Eestim

+ qµ Γ qµ γr | {z } Etarg

The ﬁrst term Eestim is the estimation error that contains the variance, while the second term Etarg is the bias in the targets. We analyze them separately below.

6.1. Variance

It is widely known that larger discounts, like larger eligibility traces, incur more variance (Jiang et al., 2015; Petrik & Scherrer, 2009; Kearns & Singh, 2000). In the case of options, somewhat counter-intuitively, larger transition discounts γp may incur less estimation variance, when they are sufﬁciently large. This becomes evident when considering γp = 1, for which the variance in γD p due to the random length of the trajectory is entirely removed. The reason this seems at odds with our knowledge of the properties of e.g. λ-returns is because the variance incurred by random option duration is not present there. The following result formalizes this intuition and hints at the type of problems in which it is particularly relevant:

Figure 3. The numerator of Eq. (4) plotted against γp for v = 0, w = 1, dmax = 10 and different values of dmin. We see that there is a decrease in variance near γp = 1. Note that the lower values of γp that correspond to the other low-variance region may not be sufﬁcient to represent complex policies.

Lemma 1. Let dmin and dmax be the minimum and maximum option durations across the option set O. Let Fo denote the set of possible terminating states of an option o. Let each P o Γ be estimated from n i.i.d. samples, and let Ro γr be given. Then, for any policy µ, with probability 1 δ:

Eestim = qµ bΓ qµ Γ

(γdmin p γdmax p )w + γdmin p v

1 γd γdmin p

1 2n log 2|S||O|

v = max o O

max s Fo vµ Γ(s) min s Fo vµ Γ(s)

is the maximum variation of value in terminating states, and w = maxo O mins Fo γd(s)vµ Γ(s).

The lemma is proven in Appendix B. Intuitively, v is a measure of variability of an option s qualitative outcome, while w is a bound on that outcome together with its associated discount. The shape of the bound depends on the relationship between these two quantities. In particular, if v, the variation in the values of ﬁnal states, is not too large, then Eestim is proportional to γdmin p γdmax p , and monotonically decreases with γp when γp is large (see Figure 3 for example shapes). We will observe this behavior empirically on a control task. This class of options corresponds to the typical goal-directed options that terminate in a handful of similar states. On the other hand if v is large, heavier discounting is needed to account for the difference, and accounting for option duration becomes more important.

Now, let us turn to the error Etarg incurred by the discrepancy in the targets.

Lemma 2. Let dmin and dmax be the minimum and maximum option durations across the option set O. Let µ be

Per-Decision Option Discounting

a policy over options and consider the difference in the value of µ w.r.t. the option models {(Ro γr, P o γr)}o O and {(Ro γr, P o Γ)}o O. We have:

Etarg = qµ Γ qµ γr

rmax (γp γr)(γdmin p + 1) + γp(1 γd )

(1 γr)2(1 γd γdmin p ) .

The lemma is proven in Appendix C. Consider the second factor in the numerator of the bound. It is composed of two terms, one that reﬂects the difference between γp and γr, and another additive term that has to do with the interoption discount γd. If γd = 1, and there is no inter-option discounting, this term vanishes, and the error reduces to that incurred by the difference in the discounts. Otherwise, there is some bias introduced by γd = 1, and some bias introduced by γp = γr. Even though the worst-case bound is additive, these biases can sometimes be in opposite directions , and reduce the overall error when compared to either one in isolation. In fact, there is a value of γd that reduces bias all the way to zero, even if γp = γr. The following proposition derives a sufﬁcient condition for this (proof in Appendix D).

Proposition 1. If γo d(s, s ) = P o γr (s |s) P o γp(s |s), s, s S, o O,

there is no bias in the value function, for any γp. That is: qµ Γ = qµ γr for any policy µ.

This result suggests an interpretation of γo d as a particular importance sampling ratio of the two option models.4 While it is unlikely to be able to obtain it exactly, even an approximate γo d may help balance the bias (Munos et al., 2016). We leave a precise characterization of the general case of this for the future, Finally, from Eq. (3) and Lemmas 1 and 2, we have our result: Theorem 2. Let dmin and dmax be the minimum and maximum option durations across the option set O. Let Fo denote the set of possible terminating states of an option o. Let µ be a policy over options, and let each P o Γ be estimated from n i.i.d. samples. Then, with probability 1 δ, the error in the estimate qµ bΓ is bounded by:

E = qµ bΓ qµ γr

(γdmin p γdmax p )w + γdmin p v

1 γd γdmin p

1 2n log 2|S||O|

δ | {z } variance

+ rmax (γp γr)(γdmin p + 1) + γp(1 γd )

(1 γr)2(1 γd γdmin p ) | {z } bias

4 Note that in order for the form of γo d from this proposition to hold, γo d must be a full (rather than diagonal) matrix, whose value is closely related to that of the option transition model, e.g. if γp = 1, γo d = P o γr exactly.

Figure 4. The domains used in our experiments. Left Four Rooms. The agent starts in the top left room, and aims to navigate to G1 via options that navigate to hallways. The option policies are ϵ-soft and extremely noisy with ϵ = 0.5. Right Growing Gridworld. The agent s task is to get from the start state S to the goal G. There is another distractor goal g with a smaller reward. Bottom Chain. The goal is to collect as much reward as possible in a limited time, by avoiding the distractor reward on the left, and pursuing the sequence of smaller rewards instead.

v = max o O

max s Fo vµ Γ(s) min s Fo vµ Γ(s)

is the maximum variation of value in terminating states, and w = maxo O mins Fo γd(s)vµ Γ(s).

7. Experiments

We illustrate empirically the key ideas of this paper:

1. the bias-variance tradeoff obtained in Theorem 2; and

2. the ability of time dilation to extend the agent s horizon and preserve far-sighted policies, irrespectively of the size of the environment;

Our approximate planning setting is similar to that described by Jiang et al. (2015). Following that work, and since the reward model is unaffected by our proposal, we do not estimate the reward model in these experiments, and instead use its true value (which can be computed exactly in these tasks). Finally, in 7.3, we evaluate the learning performance on an illustrative task with characteristic properties.

7.1. Bias-Variance

We investigate whether the analytical bias-variance tradeoff can be observed in practice in the control setting on the classical Four Rooms domain (Sutton et al., 1999). Here, the agent aims to navigate to a goal location via options that navigate from inside of each room to its hallways (see

Per-Decision Option Discounting

Figure 5. The certainty equivalence loss 1

s v bΓ(s) as a function of γp, for different values of γd (lower is better). The reward model is known, the transition model is estimated from N samples, and v bΓ is obtained from solving the corresponding MDP. Average of 100 independent runs. Notice the similarity with Fig. 3, which diminishes as N increases, since the effects of the variance then diminish, while the large error for small γ-s is due to a large bias. Note the log scale, where the value at 1.0 is biased to be ﬁnite.

Fig. 4: Left). The reward is zero everywhere, except for the goal, where it is 10. To evaluate the effects of varied option duration, we add ϵ-noise to the typically deterministic option policies. That is: an option takes an action recommended by its original πo w.p. 1 ϵ, and a random action w.p. ϵ. To obtain a clear picture, we consider a very noisy case of ϵ = 0.5.

For each option o, and for each state s Io, we sample N trajectories to obtain an estimate c P o Γ of P o Γ. We then perform policy iteration w.r.t. the approximate models c P o Γ and the true reward models Ro γr to obtain the approximate optimal policy π bΓ. We then report the certainty equivalence (CE) loss 1

s S v bΓ(s) for the value of this π bΓ. See Fig. 5 for the results. Notice how the loss curves mimic the bound on the variance term from Lemma 1 closely for reasonably high γp, while the bias term dominates the performance of the low γp-s. Note that because options terminate at exactly one state (the hallway), v is zero, and the variance is entirely eliminated at γp = 1.

7.2. Horizon Invariance

Recall the scenario described in Sec. 3. We simulate an experiment that mimics this scenario and observe that our intuitions hold in practice numerically. In particular, we consider a simple Growing Gridworld task (Fig. 4: Right).

Figure 6. The certainty equivalence gain 1 |S| P

s v(s) as a function of the grid size (higher is better). The value function v is the value w.r.t. a high γeval of the optimal policies w.r.t. Left the exact model P o Γ Right the approximate model c P o Γ. The shaded area denotes standard deviation. In both cases the performance of the variant γp < 1 deteriorates with the size of the grid, while the variant with γp = 1, γd < 1 is indifferent to the size of the grid. Note that this pattern is irrespective of the chosen value of γp and would occur for some grid size for any γp.

There are two terminal states: g with a smaller reward (of 1) and G with a larger reward (of 2). The preference of the agent between them is entirely determined by its discount factor γp. As before, we estimate c P o Γ from N samples, and obtain π bΓ by policy iteration. We take the value of N to be 2 here. For the estimation to be less trivial, we consider ϵ-soft option policies, as described above, with ϵ = 0.05. We then consider both the value vc π γeval of the optimal policy π bΓ w.r.t. approximate model, and the value vπ γevalof the optimal policy π Γ w.r.t. the true model P o Γ, both evaluated with a very high γeval = 1 10 8.

We compare two variants: one with γp < 1, γd = 1 (corresponding to the classical option model), and the other with γp = 1, γd < 1 (exploiting time dilation). The reward model is computed with the same value of γr < 1 for both cases. Figure 6 reports the certainty equivalence gain 1 |S| P

s v(s) for both the exact and approximate optimal values of these variants. The same pattern is induced in both cases, and the values of the optimal policies diminish, as the size of the grid gets larger. Time dilation on the other hand allows the options to maintain the same performance regardless of the size of the grid.

7.3. Learning Performance

We now study the learning performance of an agent in relation to time dilation. We use a simple illustrative task imbued with the realistic properties that motivated our proposal: the chain given in Fig. 4 (bottom). The agent must choose between greedily pursuing a large reward R or a sequence of smaller rewards, whose sum is larger than R. The setting is continuing, but there is a small chance of death which doesn t allow for collecting all the rewards. There is a variable amount of steps between the rewards, and each reward is only collectible once. The agent has two options, one for each movement direction, which terminate when a reward is reached. At each primitive time step, there is zero-

Per-Decision Option Discounting

Figure 7. Top: Cumulative performance (total sum of undiscounted reward collected during training) for all of the γp, γr settings, plotted with linear interpolation. Bottom: Learning curves for each setting showing the mean undiscounted return per episode (with colors corresponding to the heatmap above).

mean reward-noise with parameterized standard deviation pn, and a stickiness probability ps, with which the agent stays in place.

We use Q-learning over options with call-and-return execution. When an option terminates after D steps, it receives a γr-discounted D-step return, and bootstraps with a discount of γdγD p . Fig. 7 summarizes the results across a range of γr and γp for two settings of the task: noise-free and with noise as described above.5 Qualitatively, there are three levels of performance: convergence to the optimal policy π r (red), convergence to the suboptimal policy π R (blue), and slower convergence to π r (green). In the noise-free case, we observe a stark effect of the importance of the γr, γp relationship. When γp > γr (i.e. with time dilation), the agent is able to quickly discover π r, even for very low values of γr. Reward noise has the effect of yielding intermediate performance for more settings (i.e. a larger green region). In this more challenging case, settings with intermediate values of γr and high values of γp yield the best performance, showcasing the effectiveness of time dilation.

8. Related Work

The analysis we provided is closely related to that by Jiang et al. (2015), and earlier results along the same lines of Petrik & Scherrer (2009). In both works, the authors consider the

5The per-decision γd in this task does not affect the results qualitatively, and we report results with γd = 1.

tradeoff on estimation variance vs target bias in the quality of the approximate planning solution, due to using a lower discount factor. Jiang et al. (2015) show that it is beneﬁcial to use a lower discount when the number of samples used to estimate the transition model is small. These implications carry over to γd in the context of options, while γp controls a more subtle tradeoff that has to do with the random duration of options.

Petrik & Scherrer (2009) focus on the bias aspect, and show that in some problems the bias due to using a lower discount can be lower than predicted by the worst case, especially in problems whose rewards are sparse. It is interesting to identify a similar structure for options.

General transition-based discounting is introduced by White (2017), where it is proposed to use the discount as a formalism for reinforcement learning tasks, and shown that each option then represents a task, since the termination condition together with the step discount incurs a transition discount. We alter the option transition model, and hence incur option-transition discounts.

9. Discussion

We proposed to provide options with autonomy over their own timescale, by introducing time dilation into the option transition model. We analyzed the bias-variance incurred by doing so, and veriﬁed the analytical predictions empirically. These insights are immediately applicable to any setting using the options framework.

While our experiments are in the control setting, the analysis applies to a ﬁxed policy µ. In order to extend Lemma 1 to apply to all policies, we need to consider the relationship of the number of optimal policies under a given model |MΓ| to γd and γp. Jiang et al. (2015) give an interesting interpretation of the discount as a policy complexity control parameter, and show that this quantity grows monotonically with γ. This analysis is trickier with options, due to there being two parameters γd and γp controlling the discount.

This work offers a formal foundation, from which there are many possible future directions. Learning the appropriate discount parameters for the given set of options, or even as a way to discover options, is the obvious next step. The non-homogeneous discounting structure may act similarly to a deliberation cost (Harb et al., 2018) and encourage non-trivial option discovery. On the other hand, the horizon invariance property lends itself well to the transfer setting, which is where hierarchical methods are most appropriate. Namely, because our proposal removes the strict dependencies on the primitive timescale, it may be possible to successively transfer option models and policies learned on smaller instances of a task to larger instances of the same task, as a form of scaffolding.

Per-Decision Option Discounting

Acknowledgments

The authors thank Hado van Hasselt, Adam White, Will Dabney, Gheorghe Comanici and Andre Barreto for their feedback on earlier manuscripts of this paper, and Mohammad Azar for helping with one of the bounds.

Bacon, P.-L. and Precup, D. A matrix splitting perspective on planning with options. ar Xiv preprint ar Xiv:1612.00916, 2016.

Bellman, R. Dynamic programming. Princeton University Press, 1957.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996.

Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. When waiting is not an option: Learning options with a deliberation cost. In Proceedings of 32nd AAAI Conference on Artiﬁcial Intelligence (AAAI-18), 2018.

Harutyunyan, A., Vrancx, P., Bacon, P.-L., Precup, D., and Now e, A. Learning with options that terminate off-policy. In Proceedings of 32nd AAAI Conference on Artiﬁcial Intelligence (AAAI-18), 2018.

Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The dependence of effective planning horizon on model accuracy. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-15), pp. 1181 1189, 2015.

Kearns, M. J. and Singh, S. P. Bias-variance error bounds for temporal difference updates. In Proceedings of the 13th International Conference on Computational Learning Theory (COLT-00), pp. 142 147, 2000.

Lehnert, L., Laroche, R., and van Seijen, H. On value function representation of long horizon problems. In Proceedings of 32nd AAAI Conference on Artiﬁcial Intelligence (AAAI-18), 2018.

Mahadevan, S. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine learning, 22(1-3):159 195, 1996.

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1046 1054, 2016.

Petrik, M. and Scherrer, B. Biasing approximate dynamic programming with a lower discount factor. In Advances in Neural Information Processing Systems 21, pp. 1265 1272, 2009.

Precup, D., Sutton, R. S., and Singh, S. Theoretical results on reinforcement learning with temporally abstract options. In Proceedings of the 10th European conference on machine learning (ECML-98), pp. 382 393, 1998.

Puterman, M. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.

Schwartz, A. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the 10th International Conference on Machine Learning (ICML93), 1993.

Sutton, R. and Barto, A. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2017.

Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2): 181 211, 1999.

Tsitsiklis, J. N. and Van Roy, B. On average versus discounted reward temporal-difference learning. Machine Learning, 49:179 191, 2002.

White, M. Unifying task speciﬁcation in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML-17), 2017.

Yu, H. and Bertsekas, D. P. Weighted bellman equations and their applications in approximate dynamic programming. Technical report, Citeseer, 2012.