# safe_and_efficient_offpolicy_reinforcement_learning__b1c4bdd5.pdf

Safe and efﬁcient off-policy reinforcement learning

R emi Munos munos@google.com Google Deep Mind

Thomas Stepleton stepleton@google.com Google Deep Mind

Anna Harutyunyan anna.harutyunyan@vub.ac.be Vrije Universiteit Brussel

Marc G. Bellemare bellemare@google.com Google Deep Mind

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of off-policyness ; and (3) it is efﬁcient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the ﬁrst return-based off-policy control algorithm converging a.s. to Q without the GLIE assumption (Greedy in the Limit with Inﬁnite Exploration). As a corollary, we prove the convergence of Watkins Q(λ), which was an open problem since 1989. We illustrate the beneﬁts of Retrace(λ) on a standard suite of Atari 2600 games.

One fundamental trade-off in reinforcement learning lies in the deﬁnition of the update target: should one estimate Monte Carlo returns or bootstrap from an existing Q-function? Return-based methods (where return refers to the sum of discounted rewards t γtrt) offer some advantages over value bootstrap methods: they are better behaved when combined with function approximation, and quickly propagate the fruits of exploration (Sutton, 1996). On the other hand, value bootstrap methods are more readily applied to off-policy data, a common use case. In this paper we show that learning from returns need not be at cross-purposes with off-policy learning.

We start from the recent work of Harutyunyan et al. (2016), who show that naive off-policy policy evaluation, without correcting for the off-policyness of a trajectory, still converges to the desired Qπ value function provided the behavior µ and target π policies are not too far apart (the maximum allowed distance depends on the λ parameter). Their Qπ(λ) algorithm learns from trajectories generated by µ simply by summing discounted off-policy corrected rewards at each time step. Unfortunately, the assumption that µ and π are close is restrictive, as well as difﬁcult to uphold in the control case, where the target policy is greedy with respect to the current Q-function. In that sense this algorithm is not safe: it does not handle the case of arbitrary off-policyness .

Alternatively, the Tree-backup (TB(λ)) algorithm (Precup et al., 2000) tolerates arbitrary target/behavior discrepancies by scaling information (here called traces) from future temporal differences by the product of target policy probabilities. TB(λ) is not efﬁcient in the near on-policy case (similar µ and π), though, as traces may be cut prematurely, blocking learning from full returns.

In this work, we express several off-policy, return-based algorithms in a common form. From this we derive an improved algorithm, Retrace(λ), which is both safe and efﬁcient, enjoying convergence guarantees for off-policy policy evaluation and more importantly for the control setting.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Retrace(λ) can learn from full returns retrieved from past policy data, as in the context of experience replay (Lin, 1993), which has returned to favour with advances in deep reinforcement learning (Mnih et al., 2015; Schaul et al., 2016). Off-policy learning is also desirable for exploration, since it allows the agent to deviate from the target policy currently under evaluation.

To the best of our knowledge, this is the ﬁrst online return-based off-policy control algorithm which does not require the GLIE (Greedy in the Limit with Inﬁnite Exploration) assumption (Singh et al., 2000). In addition, we provide as a corollary the ﬁrst proof of convergence of Watkins Q(λ) (see, e.g., Watkins, 1989; Sutton and Barto, 1998).

Finally, we illustrate the signiﬁcance of Retrace(λ) in a deep learning setting by applying it to the suite of Atari 2600 games provided by the Arcade Learning Environment (Bellemare et al., 2013).

We consider an agent interacting with a Markov Decision Process (X, A, γ, P, r). X is a ﬁnite state space, A the action space, γ [0, 1) the discount factor, P the transition function mapping stateaction pairs (x, a) X A to distributions over X, and r : X A [ RMAX, RMAX] is the reward function. For notational simplicity we will consider a ﬁnite action space, but the case of inﬁnite possibly continuous action space can be handled by the Retrace(λ) algorithm as well. A policy π is a mapping from X to a distribution over A. A Q-function Q maps each state-action pair (x, a) to a value in R; in particular, the reward r is a Q-function. For a policy π we deﬁne the operator P π:

(P πQ)(x, a) :=

a A P(x | x, a)π(a | x )Q(x , a ).

The value function for a policy π, Qπ, describes the expected discounted sum of rewards associated with following π from a given state-action pair. Using operator notation, we write this as

t 0 γt(P π)tr. (1)

The Bellman operator T π for a policy π is deﬁned as T πQ := r + γP πQ and its ﬁxed point is Qπ, i.e. T πQπ = Qπ = (I γP π) 1r. The Bellman optimality operator introduces a maximization over the set of policies: T Q := r + γ max π P πQ. (2)

Its ﬁxed point is Q , the unique optimal value function (Puterman, 1994). It is this quantity that we will seek to obtain when we talk about the control setting .

Return-based Operators: The λ-return extension (Sutton, 1988) of the Bellman operators considers exponentially weighted sums of n-steps returns:

T π λ Q := (1 λ)

n 0 λn (T π)n+1Q = Q + (I λγP π) 1(T πQ Q),

where T πQ Q is the Bellman residual of Q for policy π. Examination of the above shows that Qπ is also the ﬁxed point of T π λ . At one extreme (λ = 0) we have the Bellman operator T π λ=0Q = T πQ, while at the other (λ = 1) we have the policy evaluation operator T π λ=1Q = Qπ which can be estimated using Monte Carlo methods (Sutton and Barto, 1998). Intermediate values of λ trade off estimation bias with sample variance (Kearns and Singh, 2000).

We seek to evaluate a target policy π using trajectories drawn from a behaviour policy µ. If π = µ, we are on-policy; otherwise, we are off-policy. We will consider trajectories of the form:

x0 = x, a0 = a, r0, x1, a1, r1, x2, a2, r2, . . .

with at µ( |xt), rt = r(xt, at) and xt+1 P( |xt, at). We denote by Ft this sequence up to time t, and write Eµ the expectation with respect to both µ and the MDP transition probabilities. Throughout, we write for supremum norm.

2 Off-Policy Algorithms

We are interested in two related off-policy learning problems. In the policy evaluation setting, we are given a ﬁxed policy π whose value Qπ we wish to estimate from sample trajectories drawn from a behaviour policy µ. In the control setting, we consider a sequence of policies that depend on our own sequence of Q-functions (such as ε-greedy policies), and seek to approximate Q .

The general operator that we consider for comparing several return-based off-policy algorithms is:

RQ(x, a) := Q(x, a) + Eµ

s=1 cs rt + γEπQ(xt+1, ) Q(xt, at) , (3)

for some non-negative coefﬁcients (cs), where we write EπQ(x, ) :=

a π(a|x)Q(x, a) and deﬁne ( t s=1 cs) = 1 when t = 0. By extension of the idea of eligibility traces (Sutton and Barto, 1998), we informally call the coefﬁcients (cs) the traces of the operator.

Importance sampling (IS): cs = π(as|xs)

µ(as|xs). Importance sampling is the simplest way to correct for the discrepancy between µ and π when learning from off-policy returns (Precup et al., 2000, 2001; Geist and Scherrer, 2014). The off-policy correction uses the product of the likelihood ratios between π and µ. Notice that RQ deﬁned in (3) with this choice of (cs) yields Qπ for any Q. For Q = 0 we recover the basic IS estimate

t 0 γt t s=1 cs rt, thus (3) can be seen as a variance reduction technique (with a baseline Q). It is well known that IS estimates can suffer from large even possibly inﬁnite variance (mainly due to the variance of the product π(a1|x1)

µ(a1|x1) π(at|xt)

µ(at|xt)), which has motivated further variance reduction techniques such as in (Mahmood and Sutton, 2015; Mahmood et al., 2015; Hallak et al., 2015).

Off-policy Qπ(λ) and Q (λ): cs = λ. A recent alternative proposed by Harutyunyan et al. (2016) introduces an off-policy correction based on a Q-baseline (instead of correcting the probability of the sample path like in IS). This approach, called Qπ(λ) and Q (λ) for policy evaluation and control, respectively, corresponds to the choice cs = λ. It offers the advantage of avoiding the blow-up of the variance of the product of ratios encountered with IS. Interestingly, this operator contracts around Qπ provided that µ and π are sufﬁciently close to each other. Deﬁning ε := maxx π( |x) µ( |x) 1 the level of off-policyness , the authors prove that the operator deﬁned by (3) with cs = λ is a contraction mapping around Qπ for λ < 1 γ

γε , and around Q for the worst case of λ < 1 γ

2γ . Unfortunately, Qπ(λ) requires knowledge of ε, and the condition for Q (λ) is very conservative. Neither Qπ(λ), nor Q (λ) are safe as they do not guarantee convergence for arbitrary π and µ.

Tree-backup, TB(λ): cs = λπ(as|xs). The TB(λ) algorithm of Precup et al. (2000) corrects for the target/behaviour discrepancy by multiplying each term of the sum by the product of target policy probabilities. The corresponding operator deﬁnes a contraction mapping for any policies π and µ, which makes it a safe algorithm. However, this algorithm is not efﬁcient in the near on-policy case (where µ and π are similar) as it unnecessarily cuts the traces, preventing it to make use of full returns: indeed we need not discount stochastic on-policy transitions (as shown by Harutyunyan et al. s results about Qπ).

Retrace(λ): cs = λ min 1, π(as|xs)

µ(as|xs) . Our contribution is an algorithm Retrace(λ) that takes

the best of the three previous algorithms. Retrace(λ) uses an importance sampling ratio truncated at 1. Compared to IS, it does not suffer from the variance explosion of the product of IS ratios. Now, similarly to Qπ(λ) and unlike TB(λ), it does not cut the traces in the on-policy case, making it possible to beneﬁt from the full returns. In the off-policy case, the traces are safely cut, similarly to TB(λ). In particular, min 1, π(as|xs)

µ(as|xs) π(as|xs): Retrace(λ) does not cut the traces as much as TB(λ). In the subsequent sections, we will show the following:

For any traces 0 cs π(as|xs)/µ(as|xs) (thus including the Retrace(λ) operator), the return-based operator (3) is a γ-contraction around Qπ, for arbitrary policies µ and π In the control case (where π is replaced by a sequence of increasingly greedy policies) the online Retrace(λ) algorithm converges a.s. to Q , without requiring the GLIE assumption. As a corollary, Watkins s Q(λ) converges a.s. to Q .

Deﬁnition Estimation Guaranteed Use full returns of cs variance convergence (near on-policy)

Importance sampling π(as|xs) µ(as|xs) High for any π, µ yes

Qπ(λ) λ Low for π close to µ yes

TB(λ) λπ(as|xs) Low for any π, µ no

Retrace(λ) λ min 1, π(as|xs)

µ(as|xs) Low for any π, µ yes

Table 1: Properties of several algorithms deﬁned in terms of the general operator given in (3). Guaranteed convergence of the expected operator R.

3 Analysis of Retrace(λ)

We will in turn analyze both off-policy policy evaluation and control settings. We will show that R is a contraction mapping in both settings (under a mild additional assumption for the control case).

3.1 Policy Evaluation

Consider a ﬁxed target policy π. For ease of exposition we consider a ﬁxed behaviour policy µ, noting that our result extends to the setting of sequences of behaviour policies (µk : k N).

Our ﬁrst result states the γ-contraction of the operator (3) deﬁned by any set of non-negative coefﬁcients cs = cs(as, Fs) (in order to emphasize that cs can be a function of the whole history Fs) under the assumption that 0 cs π(as|xs)

Theorem 1. The operator R deﬁned by (3) has a unique ﬁxed point Qπ. Furthermore, if for each as A and each history Fs we have cs = cs(as, Fs) 0, π(as|xs)

µ(as|xs) , then for any Q-function Q

RQ Qπ γ Q Qπ .

The following lemma will be useful in proving Theorem 1 (proof in the appendix). Lemma 1. The difference between RQ and its ﬁxed point Qπ is

RQ(x, a) Qπ(x, a) = Eµ

i=1 ci Eπ[(Q Qπ)(xt, )] ct(Q Qπ)(xt, at) .

Proof (Theorem 1). The fact that Qπ is the ﬁxed point of the operator R is obvious from (3) since Ext+1 P ( |xt,at) rt + γEπQπ(xx+1, ) Qπ(xt, at) = (T πQπ Qπ)(xt, at) = 0, since Qπ is the ﬁxed point of T π. Now, from Lemma 1, and deﬁning ΔQ := Q Qπ, we have

RQ(x, a) Qπ(x, a) =

t 1 γt E x1:t a1:t

i=1 ci EπΔQ(xt, ) ctΔQ(xt, at)

t 1 γt E x1:t a1:t 1

i=1 ci EπΔQ(xt, ) Eat[ct(at, Ft)ΔQ(xt, at)|Ft]

t 1 γt E x1:t a1:t 1

π(b|xt) µ(b|xt)ct(b, Ft) ΔQ(xt, b) .

Now since π(a|xt) µ(a|xt)ct(b, Ft) 0, we have that RQ(x, a) Qπ(x, a) =

y,b wy,bΔQ(y, b), i.e. a linear combination of ΔQ(y, b) weighted by non-negative coefﬁcients:

t 1 γt E x1:t a1:t 1

i=1 ci π(b|xt) µ(b|xt)ct(b, Ft) I{xt = y} .

The sum of those coefﬁcients is:

t 1 γt E x1:t a1:t 1

π(b|xt) µ(b|xt)ct(b, Ft)

t 1 γt E x1:t a1:t 1

i=1 ci Eat[1 ct(at, Ft)|Ft] =

t 1 γt E x1:t a1:t

i=1 ci (1 ct)

i=1 ci = γC (C 1),

where C := Eµ t 0 γt t i=1 ci . Since C 1, we have that y,b wy,b γ. Thus RQ(x, a) Qπ(x, a) is a sub-convex combination of ΔQ(y, b) weighted by non-negative coefﬁcients wy,b which sum to (at most) γ, thus R is a γ-contraction mapping around Qπ.

Remark 1. Notice that the coefﬁcient C in the proof of Theorem 1 depends on (x, a). If we write

η(x, a) := 1 (1 γ)Eµ

t 0 γt( t s=1 cs) , then we have shown that

|RQ(x, a) Qπ(x, a)| η(x, a) Q Qπ .

Thus η(x, a) [0, γ] is a (x, a)-speciﬁc contraction coefﬁcient, which is γ when c1 = 0 (the trace is cut immediately) and can be close to zero when learning from full returns (Eµ[ct] 1 for all t).

3.2 Control

In the control setting, the single target policy π is replaced by a sequence of policies (πk) which depend on (Qk). While most prior work has focused on strictly greedy policies, here we consider the larger class of increasingly greedy sequences. We now make this notion precise. Deﬁnition 1. We say that a sequence of policies (πk : k N) is increasingly greedy w.r.t. a sequence (Qk : k N) of Q-functions if the following property holds for all k: P πk+1Qk+1 P πk Qk+1.

Intuitively, this means that each πk+1 is at least as greedy as the previous policy πk for Qk+1. Many natural sequences of policies are increasingly greedy, including εk-greedy policies (with nonincreasing εk) and softmax policies (with non-increasing temperature). See proofs in the appendix.

We will assume that cs = cs(as, Fs) = c(as, xs) is Markovian, in the sense that it depends on xs, as (as well as the policies π and µ) only but not on the full past history. This allows us to deﬁne the (sub)-probability transition operator

(P cµQ)(x, a) :=

a p(x |x, a)µ(a |x )c(a , x )Q(x , a ).

Finally, an additional requirement to the convergence in the control case, we assume that Q0 satisﬁes T π0Q0 Q0 (this can be achieved by a pessimistic initialization Q0 = RMAX/(1 γ)). Theorem 2. Consider an arbitrary sequence of behaviour policies (µk) (which may depend on (Qk)) and a sequence of target policies (πk) that are increasingly greedy w.r.t. the sequence (Qk):

Qk+1 = Rk Qk,

where the return operator Rk is deﬁned by (3) for πk and µk and a Markovian cs = c(as, xs) [0, πk(as|xs)

µk(as|xs)]. Assume the target policies πk are εk-away from the greedy policies w.r.t. Qk, in the sense that T πk Qk T Qk εk Qk e, where e is the vector with 1-components. Further suppose that T π0Q0 Q0. Then for any k 0,

Qk+1 Q γ Qk Q + εk Qk .

In consequence, if εk 0 then Qk Q .

Sketch of Proof (The full proof is in the appendix). Using P cµk, the Retrace(λ) operator rewrites

t 0 γt(P cµk)t(T πk Q Q) = Q + (I γP cµk) 1(T πk Q Q).

We now lowerand upper-bound the term Qk+1 Q .

Upper bound on Qk+1 Q . We prove that Qk+1 Q Ak(Qk Q ) with Ak := γ(I γP cµk) 1 P πk P cµk . Since ct [0, π(at|xt)

µ(at|xt)] we deduce that Ak has non-negative elements, whose sum over each row, is at most γ. Thus Qk+1 Q γ Qk Q e. (4)

Lower bound on Qk+1 Q . Using the fact that T πk Qk T π Qk εk Qk e we have

Qk+1 Q Qk+1 T πk Qk + γP π (Qk Q ) γεk Qk e

= γP cµk(I γP cµk) 1(T πk Qk Qk) + γP π (Qk Q ) εk Qk e. (5)

Lower bound on T πk Qk Qk. Since the sequence (πk) is increasingly greedy w.r.t. (Qk), we have T πk+1Qk+1 Qk+1 T πk Qk+1 Qk+1 = r + (γP πk I)Rk Qk = Bk(T πk Qk Qk), (6) where Bk := γ[P πk P cµk](I γP cµk) 1. Since P πk P cµk and (I γP cµk) 1 are non-negative matrices, so is Bk. Thus T πk Qk Qk Bk 1Bk 2 . . . B0(T π0Q0 Q0) 0, since we assumed T π0Q0 Q0 0. Thus, (5) implies that

Qk+1 Q γP π (Qk Q ) εk Qk e. Combining the above with (4) we deduce Qk+1 Q γ Qk Q + εk Qk . When εk 0, we further deduce that Qk are bounded, thus Qk Q .

3.3 Online algorithms

So far we have analyzed the contraction properties of the expected R operators. We now describe online algorithms which can learn from sample trajectories. We analyze the algorithms in the every visit form (Sutton and Barto, 1998), which is the more practical generalization of the ﬁrst-visit form. In this section, we will only consider the Retrace(λ) algorithm deﬁned with the coefﬁcient c = λ min(1, π/µ). For that c, let us rewrite the operator P cµ as λP π µ, where P π µQ(x, a) :=

b min(π(b|y), µ(b|y))Q(y, b), and write the Retrace operator RQ = Q + (I λγP π µ) 1(T πQ Q). We focus on the control case, noting that a similar (and simpler) result can be derived for policy evaluation. Theorem 3. Consider a sequence of sample trajectories, with the kth trajectory x0, a0, r0, x1, a1, r1, . . . generated by following µk: at µk( |xt). For each (x, a) along this trajectory, with s being the time of ﬁrst occurrence of (x, a), update

Qk+1(x, a) Qk(x, a) + αk

i=j+1 ci I{xj, aj = x, a}, (7)

where δπk t := rt + γEπk Qk(xt+1, ) Qk(xt, at), αk = αk(xs, as). We consider the Retrace(λ) algorithm where ci = λ min 1, π(ai|xi)

µ(ai|xi) . Assume that (πk) are increasingly greedy w.r.t. (Qk) and are each εk-away from the greedy policies (πQk), i.e. maxx πk( |x) πQk( |x) 1 εk, with εk 0. Assume that P πk and P πk µk asymptotically commute: limk P πk P πk µk P πk µk P πk = 0. Assume further that (1) all states and actions are visited inﬁnitely often:

t 0 P{xt, at = x, a} D > 0, (2) the sample trajectories are ﬁnite in terms of the second moment of their lengths Tk: Eµk T 2 k < , (3) the stepsizes obey the usual Robbins-Munro conditions. Then Qk Q a.s.

The proof extends similar convergence proofs of TD(λ) by Bertsekas and Tsitsiklis (1996) and of optimistic policy iteration by Tsitsiklis (2003), and is provided in the appendix. Notice that compared to Theorem 2 we do not assume that T π0Q0 Q0 0 here. However, we make the additional (rather technical) assumption that P πk and P πk µk commute at the limit. This is satisﬁed for example when the probability assigned by the behavior policy µk( |x) to the greedy action πQk(x) is independent of x. Examples include ε-greedy policies, or more generally mixtures between the greedy policy πQk and an arbitrary distribution µ (see Lemma 5 in the appendix for the proof):

µk(a|x) = ε µ(a|x) 1 µ(πQk(x)|x)I{a = πQk(x)} + (1 ε)I{a = πQk(x)}. (8)

Notice that the mixture coefﬁcient ε needs not go to 0.

4 Discussion of the results

4.1 Choice of the trace coefﬁcients cs

Theorems 1 and 2 ensure convergence to Qπ and Q for any trace coefﬁcient cs [0, π(as|xs)

µ(as|xs)]. However, to make the best choice of cs, we need to consider the speed of convergence, which depends on both (1) the variance of the online estimate, which indicates how many online updates are required in a single iteration of R, and (2) the contraction coefﬁcient of R.

Variance: The variance of the estimate strongly depends on the variance of the product trace (c1 . . . ct), which is not an easy quantity to control in general, as the (cs) are usually not independent. However, assuming independence and stationarity of (cs), we have that V

t γtc1 . . . ct

is at least

t γ2t V(c)t, which is ﬁnite only if V(c) < 1/γ2. Thus, an important requirement for a numerically stable algorithm is for V(c) to be as small as possible, and certainly no more than 1/γ2. This rules out importance sampling (for which c = π(a|x)

µ(a|x), and V(c|x) =

a µ(a|x) π(a|x)

µ(a|x) 1 2, which may be larger than 1/γ2 for some π and µ), and is the reason we choose c 1.

Contraction speed: The contraction coefﬁcient η [0, γ] of R (see Remark 1) depends on how much the traces have been cut, and should be as small as possible (since it takes log(1/ε)/ log(1/η) iterations of R to obtain an ε-approximation). It is smallest when the traces are not cut at all (i.e. if cs = 1 for all s, R is the policy evaluation operator which produces Qπ in a single iteration). Indeed, when the traces are cut, we do not beneﬁt from learning from full returns (in the extreme, c1 = 0 and R reduces to the (one step) Bellman operator with η = γ).

A reasonable trade-off between low variance (when cs are small) and high contraction speed (when cs are large) is given by Retrace(λ), for which we provide the convergence of the online algorithm.

If we relax the assumption that the trace is Markovian (in which case only the result for policy evaluation has been proven so far) we could trade off a low trace at some time for a possibly largerthan-1 trace at another time, as long as their product is less than 1. A possible choice could be

cs = λ min 1 c1 . . . cs 1 , π(as|xs)

4.2 Other topics of discussion

No GLIE assumption. The crucial point of Theorem 2 is that convergence to Q occurs for arbitrary behaviour policies. Thus the online result in Theorem 3 does not require the behaviour policies to become greedy in the limit with inﬁnite exploration (i.e. GLIE assumption, Singh et al., 2000). We believe Theorem 3 provides the ﬁrst convergence result to Q for a λ-return (with λ > 0) algorithm that does not require this (hard to satisfy) assumption.

Proof of Watkins Q(λ). As a corollary of Theorem 3 when selecting our target policies πk to be greedy w.r.t. Qk (i.e. εk = 0), we deduce that Watkins Q(λ) (e.g., Watkins, 1989; Sutton and Barto, 1998) converges a.s. to Q (under the assumption that µk commutes asymptotically with the greedy policies, which is satisﬁed for e.g. µk deﬁned by (8)). We believe this is the ﬁrst such proof.

Increasingly greedy policies The assumption that the sequence of target policies (πk) is increasingly greedy w.r.t. the sequence of (Qk) is more general that just considering greedy policies w.r.t. (Qk) (which is Watkins s Q(λ)), and leads to more efﬁcient algorithms. Indeed, using nongreedy target policies πk may speed up convergence as the traces are not cut as frequently. Of course, in order to converge to Q , we eventually need the target policies (and not the behaviour policies, as mentioned above) to become greedy in the limit (i.e. εk 0 as deﬁned in Theorem 2).

Comparison to Qπ(λ). Unlike Retrace(λ), Qπ(λ) does not need to know the behaviour policy µ. However, it fails to converge when µ is far from π. Retrace(λ) uses its knowledge of µ (for the chosen actions) to cut the traces and safely handle arbitrary policies π and µ.

Comparison to TB(λ). Similarly to Qπ(λ), TB(λ) does not need the knowledge of the behaviour policy µ. But as a consequence, TB(λ) is not able to beneﬁt from possible near on-policy situations, cutting traces unnecessarily when π and µ are close.

0.0 0.2 0.4 0.6 0.8 1.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Figure 1: Inter-algorithm score distribution for λ-return (λ = 1) variants and Q-Learning (λ = 0).

Estimating the behavior policy. In the case µ is unknown, it is reasonable to build an estimate µ from observed samples and use µ instead of µ in the deﬁnition of the trace coefﬁcients cs. This may actually even lead to a better estimate, as analyzed by Li et al. (2015). Continuous action space. Let us mention that Theorems 1 and 2 extend to the case of (measurable) continuous or inﬁnite action spaces. The trace coefﬁcients will make use of the densities min(1, dπ/dµ) instead of the probabilities min(1, π/µ). This is not possible with TB(λ). Open questions include: (1) Removing the technical assumption that P πk and P πk µk asymptotically commute, (2) Relaxing the Markov assumption in the control case in order to allow trace coefﬁcients cs of the form (9).

5 Experimental Results

To validate our theoretical results, we employ Retrace(λ) in an experience replay (Lin, 1993) setting, where sample transitions are stored within a large but bounded replay memory and subsequently replayed as if they were new experience. Naturally, older data in the memory is usually drawn from a policy which differs from the current policy, offering an excellent point of comparison for the algorithms presented in Section 2.

Our agent adapts the DQN architecture of Mnih et al. (2015) to replay short sequences from the memory (details in the appendix) instead of single transitions. The Q-function target update for a sample sequence xt, at, rt, , xt+k is

ΔQ(xt, at) =

i=t+1 ci r(xs, as) + γEπQ(xs+1, ) Q(xs, as) .

We compare our algorithms performance on 60 different Atari 2600 games in the Arcade Learning Environment (Bellemare et al., 2013) using Bellemare et al. s inter-algorithm score distribution. Inter-algorithm scores are normalized so that 0 and 1 respectively correspond to the worst and best score for a particular game, within the set of algorithms under comparison. If g {1, . . . , 60} is a game and zg,a the inter-algorithm score on g for algorithm a, then the score distribution function is f(x) := |{g : zg,a x}|/60. Roughly, a strictly higher curve corresponds to a better algorithm.

Across values of λ, λ = 1 performs best, save for Q (λ) where λ = 0.5 obtains slightly superior performance. However, is highly sensitive to the choice of λ (see Figure 1, left, and Table 2 in the appendix). Both Retrace(λ) and TB(λ) achieve dramatically higher performance than Q-Learning early on and maintain their advantage throughout. Compared to TB(λ), Retrace(λ) offers a narrower but still marked advantage, being the best performer on 30 games; TB(λ) claims 15 of the remainder. Per-game details are given in the appendix.

Conclusion. Retrace(λ) can be seen as an algorithm that automatically adjusts efﬁciently and safely the length of the return to the degree of off-policyness of any available data.

Acknowledgments. The authors thank Daan Wierstra, Nicolas Heess, Hado van Hasselt, Ziyu Wang, David Silver, Audrunas Gr uslys, Georg Ostrovski, Hubert Soyer, and others at Google Deep Mind for their very useful feedback on this work.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279.

Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientiﬁc.

Geist, M. and Scherrer, B. (2014). Off-policy learning with eligibility traces: A survey. The Journal of Machine Learning Research, 15(1):289 333.

Hallak, A., Tamar, A., Munos, R., and Mannor, S. (2015). Generalized emphatic temporal difference learning: Bias-variance analysis. ar Xiv:1509.05172.

Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016). Q(λ) with off-policy corrections.

Kearns, M. J. and Singh, S. P. (2000). Bias-variance error bounds for temporal difference updates. In Conference on Computational Learning Theory, pages 142 147.

Li, L., Munos, R., and Szepesvari, C. (2015). Toward minimax off-policy value estimation. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS).

Lin, L. (1993). Scaling up reinforcement learning for robot control. In Machine Learning: Proceedings of the Tenth International Conference, pages 182 189.

Mahmood, A. R. and Sutton, R. S. (2015). Off-policy learning based on weighted importance sampling with linear computational complexity. In Conference on Uncertainty in Artiﬁcial Intelligence.

Mahmood, A. R., Yu, H., White, M., and Sutton, R. S. (2015). Emphatic temporal-difference learning. ar Xiv:1507.01569.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529 533.

Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In International Conference on Machine Laerning, pages 417 424.

Precup, D., Sutton, R. S., and Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. In International Conference on Learning Representations.

Singh, S., Jaakkola, T., Littman, M. L., and Szepesv ari, C. (2000). Convergence results for singlestep on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287 308.

Sutton, R. and Barto, A. (1998). Reinforcement learning: An introduction. Cambridge Univ Press.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44.

Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8.

Tsitsiklis, J. N. (2003). On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3:59 72.

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph D thesis, King s College, Cambridge, UK.