# offpolicy_actorcritic_with_emphatic_weightings__940a73e2.pdf

Journal of Machine Learning Research 24 (2023) 1-63 Submitted 11/21; Revised 7/22; Published 3/23

Oﬀ-Policy Actor-Critic with Emphatic Weightings

Eric Graves graves@ualberta.ca Ehsan Imani imani@ualberta.ca Raksha Kumaraswamy kumarasw@ualberta.ca Martha White whitem@ualberta.ca Reinforcement Learning and Artiﬁcial Intelligence Laboratory Department of Computing Science, University of Alberta Edmonton, Alberta, Canada T6G 2E8

Editor: Joelle Pineau

A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simpliﬁed form for the gradient. The oﬀ-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit oﬀ-policy policy gradient theorem. In this work, we unify these objectives into one oﬀ-policy objective, and provide a policy gradient theorem for this uniﬁed objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semigradient) oﬀ-policy actor-critic methods particularly Oﬀ-Policy Actor-Critic (OﬀPAC) and Deterministic Policy Gradient (DPG) converge to the wrong solution whereas ACE ﬁnds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoﬀs made by each gradient approximation. We ﬁnd that by approximating the emphatic weightings directly, ACE performs as well as or better than OﬀPAC in all settings tested. Keywords: oﬀ-policy learning, policy gradient, actor-critic, reinforcement learning

1. Introduction

Policy gradient methods are a general class of algorithms for learning optimal policies for both the on and oﬀ-policy settings. In policy gradient methods, a parameterized policy is improved using gradient ascent (Williams, 1992), with seminal work in actor-critic algorithms (Witten, 1977; Barto et al., 1983) and many techniques since proposed to reduce variance of the estimates of this gradient (Konda and Tsitsiklis, 2000; Weaver and Tao, 2001; Greensmith et al., 2004; Peters et al., 2005; Bhatnagar et al., 2007, 2009; Grondman et al., 2012; Gu et al., 2017b). These algorithms rely on a fundamental theoretical result: the policy gradient theorem. This theorem (Sutton et al., 1999a; Marbach and Tsitsiklis, 2001) simpliﬁes estimation of the gradient, which would otherwise require diﬃcult-to-estimate gradients with respect to the stationary distribution of the policy and potentially of the action-values.

c 2023 Eric Graves, Ehsan Imani, Raksha Kumaraswamy, and Martha White.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1350.html.

Graves, Imani, Kumaraswamy, and White

These gradients can be sampled on-policy or oﬀ-policy. On-policy methods are limited to learning about the agent s current policy: the policy must be executed to obtain a sample of the gradient. Conversely, in oﬀ-policy learning the agent can learn about many policies that are diﬀerent from the policy being executed. Sampling these gradients is most straightforward in the on-policy setting, and so most policy gradient methods are on-policy. Oﬀ-policy methods, however, are critical in an online setting, where an agent generates a single stream of interaction with its environment. Oﬀ-policy methods can learn from data generated by older versions of a policy, known as experience replay, a critical factor in the recent success of deep reinforcement learning (Lin, 1992; Mnih et al., 2015; Schaul et al., 2016). They also enable learning from other forms of suboptimal data, including data generated by human demonstration, non-learning controllers, and even random behaviour. Oﬀ-policy methods also enable learning about the optimal policy while executing an exploratory policy (Watkins and Dayan, 1992), thereby addressing the exploration-exploitation tradeoﬀ. Finally, oﬀ-policy methods allow an agent to learn about many diﬀerent policies at once, forming the basis for a predictive understanding of an agent s environment (Sutton et al., 2011; White, 2015) and enabling the learning of options (Sutton et al., 1999b; Precup, 2000; Klissarov and Precup, 2021). With options, an agent can determine optimal (short) behaviours from its current state. Oﬀ-policy policy gradient methods have been developed, particularly in recent years where the need for data eﬃciency and decorrelated samples in deep reinforcement learning require the use of experience replay and so oﬀ-policy learning. This work began with the Oﬀ-Policy Actor-Critic algorithm (OﬀPAC) (Degris et al., 2012a), where an oﬀ-policy policy gradient theorem was provided that parallels the on-policy policy gradient theorem, but only for tabular policy representations.1 This motivated further development, including a recent actor-critic algorithm proven to converge when the critic uses linear function approximation (Maei, 2018), as well as several methods using the approximate oﬀ-policy gradient such as Deterministic Policy Gradient (DPG) (Silver et al., 2014; Lillicrap et al., 2015), Actor-Critic with Experience Replay (ACER) (Wang et al., 2016), and Interpolated Policy Gradient (IPG) (Gu et al., 2017a). However, it remained an open question whether the foundational theorem that underlies these algorithms can be generalized beyond tabular representations. This question was resolved with the development of an oﬀ-policy policy gradient theorem, for general policy parametrization (Imani et al., 2018). The key insight is that the gradient can be simpliﬁed if the gradient in each state is weighted with an emphatic weighting. This result, combined with previous methods for incrementally estimating emphatic weightings (Yu, 2015; Sutton et al., 2016), allowed for the design of a new oﬀ-policy actor-critic algorithm, called Actor-Critic with Emphatic Weightings (ACE). Afterwards, an algorithm was proposed that directly estimates the emphatic weightings using a gradient temporal diﬀerence update, with an associated proof of convergence using the standard two-timescale analysis (Zhang et al., 2020). However, ACE and the underlying oﬀ-policy policy gradient theorem do not obviously resolve the dilemma of computing the oﬀ-policy gradient, because ACE and previously OﬀPAC were introduced under what is called the excursions objective. This objective weights states by the visitation distribution of the behaviour policy. This is sensible in the

1. See B. Errata in Degris et al. (2012b) that the theorem only applies to tabular policy representations.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

parallel oﬀ-policy setting, where many policies are learned in parallel from one stream of experience. The agent reasons about the outcomes of those policies when taking excursions from its current behaviour. In the episodic setting, however, weighting the states by the visitation distribution of the behaviour policy is not appropriate. Instead, an ideal episodic objective should be weighted by the start-state distribution, and this objective should be optimized from oﬀ-policy data without changing this weighting. Potentially because of this mismatch, most recent methods have pursued strategies that directly correct state-weightings,2 to obtain gradients of either the episodic objective (Liu et al., 2020b) or a ﬁnite-horizon objective (Kallus and Uehara, 2020). This approach involves estimating the distribution of state visitation under the behaviour policy and the discounted state visitation under the target policy to obtain an importance sampling ratio to reweight the updates. One disadvantage to this approach is that the state visitation distribution must be estimated, and these estimates can be extremely biased both by the initialization and by the limitations of the parametric function class used for the estimates. In this work, we revisit estimating the oﬀ-policy gradient with emphatic weightings. First, we propose a unifying objective for the episodic setting and the excursions objective. This objective facilitates the design of a single, unifying oﬀ-policy algorithm pertinent to both settings. Then, we extend the oﬀ-policy policy gradient theorem both to this objective and to objectives that incorporate entropy regularization. We further show that the onpolicy episodic objective can also be seen as a special case, through the use of interest functions to specify interest in the set of start states. We then highlight the diﬀerence to previous oﬀ-policy approaches including OﬀPAC and DPG, which use a biased gradient that we call the semi-gradient because it omits a component of the gradient. We prove that in a simple three state counterexample, with two states aliased, the stationary point under semi-gradients is suboptimal, unlike ACE which converges to the optimal solution. Though such state-aliasing is not pathological, it does seem to contradict the success of methods built on OﬀPAC in the literature. We show that, under a condition called the strong growth condition (Schmidt and Roux, 2013), which is obtained if there is suﬃcient representational capacity, the semi-gradient solution is equivalent to that of ACE. Further, we highlight that even outside this setting, in some cases the semi-gradient can be seen as using the sound update underlying ACE, but with a diﬀerent state weighting. These two insights help explain why OﬀPAC has been surprisingly eﬀective and suggest promising directions for a lower-variance version of ACE. Finally, we discuss several improvements to the algorithm, including using direct estimates of the emphatic weightings to reduce variance, incorporating experience replay, and balancing between the higher-variance full gradient update and the lower-variance semigradient update. We empirically investigate variants of ACE in several benchmark environments, and ﬁnd that those with direct estimates of the emphatic weighting consistently perform as well as or better than OﬀPAC. Remark on Contributions: This paper builds on our conference paper (Imani et al., 2018). The conference paper presented a policy gradient theorem for one oﬀ-policy objective

2. A notable exception is an approach that uses a nonparametric Bellman equation essentially using nonparametric estimates of the environment model (Tosatto et al., 2020). This approach allows direct estimation of the gradient by taking derivatives through this nonparametric Bellman equation.

Graves, Imani, Kumaraswamy, and White

function, and a counterexample showing that OﬀPAC and DPG can converge to the wrong solution, with experiments limited to the simple three-state counterexample. An important limitation of that work is that it only applies to the excursions objective, which does not obviously relate to the standard policy gradient objective. Further, the work primarily investigated the true gradient, and did not provide much insight into how to practically use the algorithm. This paper builds on the conference paper in several substantive ways.

1. We ﬁrst clarify the confusion surrounding objectives by providing a more general objective that includes both the standard objective and the excursions objective. We provide all the derivations for this more general objective (their Theorems 1 and 2 and Proposition 1, which are our Theorems 2 and 3 and Proposition 6). The result is nearly identical, but reiterated with the more general deﬁnition, and slightly diﬀerent terminology. We also highlight that we can recover a sound on-policy algorithm using interest functions. 2. We prove that the semi-gradient from OﬀPAC has suboptimal solutions on the counterexample. The conference paper only showed this empirically. To obtain this result, we needed to extend the derivation to include entropy regularization, so that the optimal policy is stochastic and its parameters are bounded. Then we proved that the stationary points for OﬀPAC on the counterexample do not include the optimal point. 3. We provide insight into why semi-gradient methods can perform well, by showing that in some cases the weighting does not aﬀect the solution and that locally we can characterize the semi-gradient as a gradient update with a diﬀerent state weighting. 4. We provide several algorithmic improvements to ACE, particularly by directly estimating the emphatic weightings to give a lower variance update. In our experiments, this algorithm outperforms or performs comparably to all others, including OﬀPAC. 5. We conduct a more thorough empirical investigation of ACE in more settings, including two classic benchmarks and a new partially-observable image-based domain. The experiments are aimed at deeply investigating the role of diﬀerent components, including (a) the objective used for learning versus evaluation, (b) the algorithm for estimating the emphatic weightings, and (c) the algorithm used for the critic.

2. Problem Formulation

Throughout the paper, we use bolded symbols to represent vectors and matrices, and uppercase italicized symbols to represent random variables. We consider a Markov decision process (S, A, P, r), where S denotes the set of states, A denotes the set of actions, P : S A (S) denotes the one-step state transition dynamics, and r : S A S R denotes the transition-based reward function. At each timestep t = 1, 2, . . ., the agent selects an action At according to its behaviour policy µ, where µ : S (A). The environment responds by transitioning into a new state St+1 according to P, and emits a scalar reward Rt+1 = r(St, At, St+1). The discounted sum of future rewards given actions are selected according to some target policy π is called the return, and deﬁned as:

Gt def= Rt+1 + γt+1Rt+2 + γt+1γt+2Rt+3 + . . . (1)

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

= Rt+1 + γt+1Gt+1.

We use transition-based discounting γ : S A S [0, 1], as it facilitates specifying diﬀerent tasks (termination) on top of the same MDP (White, 2017). This generality is useful in our setting, where we may be interested in learning multiple (option) policies, oﬀ-policy. The state value function for π and γ is deﬁned as

vπ(s) def= Eπ[Gt|St = s] s S, (2)

where under discrete states and actions this corresponds to

a A π(a|s) X

s S P(s |s, a)[r(s, a, s ) + γ(s, a, s )vπ(s )] s S

and under continuous states and actions this corresponds to

S P(s |s, a)[r(s, a, s ) + γ(s, a, s )vπ(s )] da ds s S.

When possible, we will opt for the more generic expectation notation, to address both the ﬁnite and continuous cases. In oﬀ-policy control, the agent s goal is to learn a target policy π while following the behaviour policy µ. The target policy πθ is a diﬀerentiable function of a weight vector θ Rnθ, nθ N. The goal of the agent is to learn a policy that maximizes total reward. Depending on the setting, however, this goal is speciﬁed with diﬀerent objectives, as we discuss in the next section. Throughout, we assume that the limiting distribution dµ(s) under µ exists, where

dµ(s) def= lim t P(St = s|s0, µ) (3)

and P(St = s|s0, µ) is the probability or density that St = s when starting in state s0 and executing µ. Similarly, dπ denotes the limiting distribution3 of states under π. Note that this limiting distribution exists in either the episodic or continuing settings (White, 2017; Huang, 2020). One way to see this is that in the episodic setting under transition-based discounting, there are no explicit terminal states. Rather, the agent transitions between the state right before termination, back to a start state, as it would for a continuing problem. The transition-based discount truncates returns, to specify that the goal of the agent from each state is to maximize return for the episode.

3. The Weighted-Excursions Objective for Oﬀ-Policy Control

In this section we introduce the weighted-excursions objective that uniﬁes two objectives commonly considered in oﬀ-policy control. The objective encompasses the standard onpolicy episodic objective as well as the excursions objective that allows option policies to be

3. This term is sometimes overloaded to mean the discounted state visitation starting from a single start state s0: P t=0 γt P(St = s|s0, π). This overloading comes from the deﬁnition given in the original policy gradient theorem (Sutton et al., 1999a). We only use it to mean state visitation (limiting distribution).

Graves, Imani, Kumaraswamy, and White

learned oﬀ-policy with diﬀerent termination conditions. Throughout, we will deﬁne both the ﬁnite state and continuous state versions. The standard episodic objective is

J0(θ) def= X

s S d0(s)vπθ(s) or J0(θ) def= Z

S d0(s)vπθ(s) ds (4)

where d0 (S) is the start state distribution. This objective can be optimized in either the on-policy or oﬀ-policy settings, with diﬀerent updates to account for the fact that data is gathered according to µ in the oﬀ-policy setting. Another objective that has been considered in the oﬀ-policy setting is the excursions objective (Degris et al., 2012b)

Jexc(θ) def= X

s S dµ(s)vπθ(s), (5)

where the state weighting is determined by the behaviour policy s limiting state distribution dµ. This objective assumes that the target policy πθ will be executed as an excursion from the current behaviour or that the agent will use the target policy in planning under the current state visitation. In the options framework (Sutton et al., 1999b), for example, the agent learns the optimal policies for a variety of diﬀerent options, each with diﬀerent termination conditions. The behaviour policy itself µ could be learning in a continuing environment, even though the objective for the option policy πθ is episodic, in that it has termination conditions. The excursions objective, however, is not necessarily appropriate for learning an alternative optimal policy. For example, consider an episodic setting where a reasonable behaviour policy is currently being used, and we would like to learn an improved policy. The improved policy will be executed from the set of start states, not from the visitation distribution under the behaviour policy. In this setting, we would like to optimize the standard episodic objective, using oﬀ-policy data generated by the current behaviour policy. These two settings can be uniﬁed by considering the weighted-excursions objective

Jµ(θ) def= X

s S dµ(s)i(s)vπθ(s) or Jµ(θ) def= Z

S dµ(s)i(s)vπθ(s) ds, (6)

where the weighting i : S [0, ) represents the interest in a state (Sutton et al., 2016). This objective reduces to the episodic (start-state) formulation by setting i(s) = d0(s)/dµ(s). This choice is not just hypothetical; we will show how our algorithm designed for the excursions objective can also be used for the start-state formulation by using an easy-to-compute setting of the interest i(s) for each s S. This generalization goes beyond unifying the previous two objectives, and also naturally allows us to reweight states based on their importance for the target policy. For example, when learning an option, the interest function could be set to 1 in states from which the policy will be executed (i.e., the initiation set), and zero elsewhere. More generally, we could consider the generic objective P s S d(s)vπθ(s) for any state weighting d. Then the weighted-excursions objective is simply an instance of this more generic objective, with d = dµ(s)i(s). The reason that we opt for the weighted-excursions

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

objective is because it is suﬃciently general to incorporate our two settings of interest but suﬃciently restricted to facilitate development of a practical oﬀ-policy algorithm. In fact, the idea of interest comes from the same work on emphatic weightings (Sutton et al., 2016) on which we build our algorithm. This work also provides insight into how to easily incorporate such interest functions in the oﬀ-policy setting. Remark: In this work, we focus on episodic objectives those with termination but it is worthwhile to contrast this to the objective used for continuing problems. For continuing problems, there is a compelling case for the average reward objective

Jave(θ) def= X

s S dπθ(s)rπθ(s) or Jave(θ) def= Z

S dπθ(s)rπθ(s) ds (7)

for rπ(s) = Eπ[r(s, A, S )]. Optimizing Jave is equivalent to optimizing P s S dπ(s)vπ(s) for a constant 0 γ < 1 (Sutton and Barto, 2018). Intuitively, the agent seeks to maximize state values while shifting its state distribution towards high-valued states. A sharp distinction has been previously made between alternative life objectives and excursions objectives in policy evaluation (Patterson et al., 2022). In that work, alternative life objectives use the state-weighting dπ that speciﬁes: learn the best value function as if data had been gathered under the target policy π. Such a weighting is obtained by using prior corrections namely reweighting entire trajectories with products of importance sampling ratios (Precup et al., 2000; Meuleau et al., 2000) or estimating dπ/dµ (Hallak and Mannor, 2017; Liu et al., 2018; Gelada and Bellemare, 2019). The excursions objective, on the other hand, uses a weighting of dµ. This distinction is not clearly relevant to control. Instead, typically on-policy average reward algorithms are developed for the continuing objective which has weighting dπ. The few oﬀ-policy algorithms designed for the continuing, average reward setting directly estimate dπ and dµ (Liu et al., 2019). The episodic objective for control does not include dπ, and so is diﬀerent from the alternative life objective in policy evaluation. In the remainder of this paper we develop an algorithm for optimizing the weightedexcursions objective in Equation (6) from oﬀ-policy data.

4. An Oﬀ-Policy Policy Gradient Theorem using Emphatic Weightings

In this section, we introduce the oﬀ-policy policy gradient theorem for the discrete and continuous settings as well as for deterministic policies. We additionally compare this gradient to the gradient underlying OﬀPAC, and provide a generalized gradient that interpolates between this gradient called the semi-gradient and the true gradient.

4.1 The Oﬀ-Policy Policy Gradient Theorem

The policy gradient theorem with function approximation in the on-policy setting was a seminal result (Sutton et al., 1999a, Theorem 1). The ﬁrst attempt to extend the policy gradient theorem to the oﬀ-policy excursion objective was limited to the setting where the policy is tabular (Degris et al., 2012b, Theorem 2).4 In this section, we show that the policy

4. Note that the statement in the paper is stronger, but in an errata published by the authors, they highlight an error in the proof. Consequently, the result is only correct for the tabular setting.

Graves, Imani, Kumaraswamy, and White

gradient theorem does hold in the oﬀ-policy setting for the weighted excursions objective, for arbitrary smooth policy parametrizations. The resulting gradient resembles the standard policy gradient, but with emphatic weightings to reweight the states.

Deﬁnition 1 (Emphatic Weighting) The emphatic weighting m : S [0, ) for a behaviour policy µ and target policy π can be recursively deﬁned as

m(s ) def= dµ(s )i(s ) + E[γ(S, A, S )m(S) | S = s ], (8)

where the distribution over S, A from next state S is under P and π.

Under ﬁnite states and actions, this expectation is P s S P a A π(a|s)P(s |s, a)γ(s, a, s )m(s). Under continuous states and actions, it is R

A π(a|s)P(s |s, a)γ(s, a, s )m(s) da ds. If a deterministic policy π : S A is used, then it is R

S P(s |s, π(s))γ(s, π(s), s )m(s) ds. Notice that these emphatic weightings involve states leading into s ; bootstrapping is back-in-time, rather than forward. The emphatic weighting reﬂects the relative importance of a state s , based on its own interest and weighting the ﬁrst term dµ(s )i(s ) as well as the discounted importance of states that lead into s . For value estimation, this is sensible because it reﬂects how much the agent bootstraps oﬀof the values in s , and hence how much it relies on accurate estimates in s . For policy gradients, we have a related role: the importance of a state is due both to its own value, as well as how much other states that lead into it depend on the value obtained from that state.

Theorem 2 (Oﬀ-Policy Policy Gradient Theorem) For ﬁnite states and actions,

θ qπ(s, a) = X

s S m(s)g(s, θ) (9)

g(s, θ) def= Eπθ

log π(A|S; θ)

θ qπ(S, A) S = s (10)

is the gradient for a given state, rewritten using the log-likelihood ratio. For continuous states, and either continuous or discrete actions,

S m(s)g(s, θ) ds. (9 for continuous states)

Proof We ﬁrst start with the ﬁnite state setting. The proof relies on the vector form of the emphatic weighting m def= i (I Pπ,γ) 1,

where the vector i R|S| has entries i(s) def= dµ(s)i(s) and Pπ,γ R|S| |S| is the matrix with entries Pπ,γ(s, s ) def= P a A π(a|s; θ)P(s |s, a)γ(s, a, s ). First notice that

θ = P s S i(s)vπ(s)

s S i(s) vπ(s)

Therefore, to compute the gradient of Jµ, we need to compute the gradient of the value function with respect to the policy parameters. A recursive form of the gradient of the

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

value function can be derived, as we show below. Before starting, for simplicity of notation we will use

θ qπ(s, a),

where g : S Rd. Now let us compute the gradient of the value function:

a π(a|s; θ)qπ(s, a)

θ qπ(s, a) + X

a π(a|s; θ) qπ(s, a)

a π(a|s; θ) P s P(s |s, a)(r(s, a, s ) + γ(s, a, s )vπ(s ))

a π(a|s; θ) X

s P(s |s, a)γ(s, a, s ) vπ(s )

We can simplify this more easily using vector form. Let vπ R|S| d be the matrix of gradients (with respect to the policy parameters θ) of vπ for each state s, and G R|S| d

the matrix where each row corresponding to state s is the vector g(s). Then

vπ = G + Pπ,γ vπ = vπ = (I Pπ,γ) 1G. (12)

Therefore, we obtain

s S i(s) vπ(s)

θ = i vπ = i (I Pπ,γ) 1G

θ qπ(s, a).

The proof is similar for the continuous case; we include it in Appendix B.

We additionally prove the deterministic policy gradient objective, for a deterministic policy, π : S A. The objective remains the same, but the space of possible policies is constrained, resulting in a slightly diﬀerent gradient.

Theorem 3 (Deterministic Oﬀ-policy Policy Gradient Theorem)

S m(s) π(s; θ)

a=π(s;θ) ds (13)

where m(s ) = dµ(s )i(s ) + Z

S P(s |s, π(s; θ))γ(s, π(s; θ), s )m(s) ds.

The proof is presented in Appendix A.

Graves, Imani, Kumaraswamy, and White

4.2 Connection to OﬀPAC and the Semi-gradient

The oﬀ-policy policy gradient above contrasts with the one used by OﬀPAC

s S dµ(s)Eπθ

log π(a|s; θ)

θ qπ(S, A) S = s . (14)

The only diﬀerence is in the state weighting dµ. This small diﬀerence, however, is key for getting the correct gradient. In fact, the OﬀPAC gradient can be considered a semi-gradient, because it drops a part of the gradient when using the product rule:

θJexc(θ) = X

s S dµ(s) θ X

a πθ(a|s)qπ(s, a)

s S dµ(s) X

a [qπ(s, a) θπθ(a|s) + πθ(a|s) θqπ(s, a)] .

The second term with θqπ(s, a) is not obviously easy to compute, because it requires reasoning about how changes to the policy parameters aﬀect the action-values. OﬀPAC opted to drop this term, and justify why this approximation was acceptable. The resulting semi-gradient has proven to be surprisingly eﬀective, though we provide several examples in this work where it performs poorly. Note that using this correct weighting is only critical due to state aliasing. If the policy representation is tabular, for example, the weighting on the gradients does not aﬀect the asymptotic solution as long as it is non-zero.5 The gradient P s S m(s)g(θ, s) = 0 if and only if g(θ, s) = 0, because the gradient vector has an independent entry for each state. Therefore, even if the state weighting in the gradient is changed to some other weighting d, then we still get P s S d(s)g(θ, s) = 0. This is the reason that the semi-gradient underlying OﬀPAC converges to a stationary point of this objective in the tabular setting. More generally, for any suﬃciently powerful function approximator that can obtain g(θ, s) = 0 for all states, this result holds. In optimization, this property has been termed the strong growth condition (Schmidt and Roux, 2013). We state this simple result for the irrelevance of the weighting formally in the following proposition, both to highlight it and because to the best of our knowledge it has not been formally stated.

Proposition 4 (Irrelevance of Weighting under the Strong Growth Condition) If g(θ, s) = 0 for all states, then θ is a stationary point under any state weighting d : S R.

This result states that with a function approximator that can perfectly maximize value in each state, the choice of state weighting in the gradient computation is not relevant. Both the oﬀ-policy gradient with emphatic weighting, and the semi-gradient with weighting dµ, can converge to this stationary point. Alternatively, we can consider the more generic emphatic weighting that lets us sweep between the gradient and the semi-gradient. The emphatic weightings were introduced for temporal diﬀerence (TD) learning with eligibility traces (Mahmood et al., 2015). The weightings depend on the eligibility trace parameter, λ. For λ = 1, the weightings eﬀectively

5. Note that the weighting on the gradients could still aﬀect the rate of convergence (Laroche and Tachet des Combes, 2021).

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

become 1 everywhere implicitly resulting in state weightings of dµ because there is no bootstrapping. For λ = 0, the weightings are exactly as deﬁned above.

Deﬁnition 5 (Generalized Emphatic Weighting) For η [0, 1], the generalized emphatic weighting mη : S [0, ) for a behaviour policy µ and target policy π is deﬁned as mη(s ) def= (1 η)dµ(s )i(s ) + ηm(s ) (15)

For η = 1, mη(s ) = m(s ).

Note that in the original work, m(s ) was called the follow-on weighting, and mη(s ) the emphatic weighting. Because mη is a strict generalization of m, we simply call m an emphatic weighting as well.6

The parameter η can be interpreted as trading oﬀbias and variance in the emphatic weightings. Notice that for η = 0, and assuming an interest of 1 everywhere, we get that mη(s) = dµ(s). Such a choice would give exactly the semi-gradient weighting. As we increase η, we interpolate between the semi-gradient and the full gradient. For η = 1, we get mη(s) = m(s) and so we have the full gradient. For η = 0, therefore, we obtain a biased gradient estimate, but the emphatic weightings themselves are easy to estimate they are myopic estimates of interest which could signiﬁcantly reduce variance when estimating the gradient. Selecting η between 0 and 1 could provide a reasonable balance, obtaining a nearly unbiased gradient to enable convergence to a valid stationary point but potentially reducing some variability when estimating the emphatic weighting. We will develop our algorithm for the general emphatic weighting, as this will give us more ﬂexibility in the weighting, and will also allow us to incorporate OﬀPAC as one extreme (biased) version of the approach.

4.3 Summary

In this section we proved the oﬀ-policy policy gradient theorem for the weighted-excursions objective, for both stochastic (Theorem 2) and deterministic (Theorem 3) policies. These gradients are similar to the standard policy gradient, but with states weighted by emphatic weightings. We contrasted this to OﬀPAC, the (unsound) algorithm on which many oﬀpolicy policy gradient algorithms are built, highlighting that the only distinction is in this state weighting. We argued that this state weighting does not always aﬀect the solution (Proposition 4), potentially partially explaining why OﬀPAC often performs well especially in deep RL with large neural networks. Finally, we show how we can use generalized emphatic weightings, with a parameter η (Deﬁnition 5), that allows us to interpolate between the sound, but harder-to-estimate emphatic weighting (at η = 1) and the unsound, but easy-to-obtain weighting used in OﬀPAC (at η = 0).

6. Note that the original emphatic weightings (Sutton et al., 2016) use λ = 1 η. This is because their emphatic weightings are designed to balance bias introduced from using λ for estimating value functions: larger λ means the emphatic weighting plays less of a role. For this setting, we want larger η to correspond to the full emphatic weighting (the unbiased emphatic weighting), and smaller η to correspond to a more biased estimate, to better match the typical meaning of such trace parameters.

Graves, Imani, Kumaraswamy, and White

5. Actor-Critic with Emphatic Weightings

In this section, we develop an incremental actor-critic algorithm with emphatic weightings that uses the above oﬀ-policy policy gradient theorem. To perform a gradient ascent update on the policy parameters, the goal is to obtain a sample of the gradient in Equation 9. As discussed above, the only diﬀerence to the semi-gradient used by OﬀPAC, in Equation 14, is in the weighting of the states: the true gradient uses m(s) and the semi-gradient uses dµ(s). Therefore, we can use standard solutions developed for other actor-critic algorithms to obtain a sample of g(s, θ). The key diﬃculty is in estimating m(s) to reweight this gradient.

5.1 Sampling the Gradient from a State

Before addressing this key diﬃculty, we provide a brief reminder on how to obtain a sample of g(s, θ), with more details in Appendix C. The simplest approach to compute the gradient for a given state is to use what is sometimes called the all-actions gradient P a π(a|s;θ)

θ qπ(s, a). To avoid summing over all actions, we instead obtain an unbiased sample of the gradient under the action taken by the behaviour policy A µ(s, ): ρ(s, A; θ) log π(a|s;θ)

θ qπ(s, A) where the importance sampling ratio ρ(s, a; θ) def= π(a|s;θ)

µ(a|s) corrects the distribution over actions. The estimated gradient has more variance when we only use a sampled action rather than all actions. The standard strategy to reduce this variance is to subtract a baseline, such as an estimate of the value function v(s). The resulting update is

ρ(s, a; θ) log π(a|s; θ)

θ [qπ(s, a) v(s)].

In practice, it is diﬃcult to obtain qπ(s, a) to compute this gradient exactly. Instead, we obtain (a likely biased) sample of the advantage, δt qπ(St, At) v(st). An unbiased but high variance estimate uses a sample of the return Gt from (st, at), and uses δt = Gt v(st). Then E[δt|St = s, At = a] = qπ(s, a) v(s). On the other extreme, we can directly approximate the action-values, ˆq(s, a). In-between, we can use n-step returns, or λ-returns, to balance between using observed rewards and value estimates. In this work, we opt for the simplest choice: n = 1 with δt = Rt+1 + γt+1v(St+1) v(St). Finally, we need to estimate the values themselves. Any value approximation algorithm can be used to estimate v, such as TD, gradient TD or even emphatic TD. Given we already compute the emphasis weighting, it is straightforward to use emphatic TD. We investigate both a gradient TD critic and an emphatic TD critic in the experiments.

5.2 Estimating the Gradient with Emphatic Weightings

The previous section outlined how to sample the gradient for a given state. We now need to ensure that these updates across states are weighted by the emphatic weighting. Intuitively, if we could estimate mη(s) and dµ(s) for all s, then we could simply premultiply the update with mη(s)/dµ(s). However, estimating the state distribution is nontrivial, and actually unnecessary. In this section, we ﬁrst explain how to obtain a Monte Carlo estimate of this pre-multiplier, and then how to directly estimate it with function approximation.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

5.2.1 An Unbiased Estimate of the Gradient with Emphatic Weightings

We can rely on the original work introducing the emphatic weightings to obtain a Monte Carlo estimate with the update

Mt = (1 η)it + ηFt Ft = γtρt 1Ft 1 + it (16)

where F0 = 0 and it is the interest observed in St, with i(s) = E[it|St = s]. Putting this together with the previous section, the gradient can be sampled by (1) sampling states according to dµ, (2) taking actions according to µ, (3) obtaining an estimate δt of the advantage qπ(s, a) v(s), and then (4) weighting the update with Mt, to get

θ θ + αρt Mt log π(At|St; θ)

Note that the all-actions gradient update with this emphatic weighting would be

θ θ + αMt X

b A π(b|St; θ) log π(b|St; θ)

θ qπ(St, b),

but as before we will assume that we use one sampled action. We prove that our update is an unbiased estimate of the gradient for a ﬁxed policy in Proposition 6. We assume access to an unbiased estimate δt of the advantage for this result, though in practice δt is likely to have some bias due to using approximate values.

Proposition 6 For a ﬁxed policy π, with the conditions on the MDP from (Yu, 2015) and assuming unbiased δt, namely that E[δt|St = s, At = a] = qπ(s, a) v(s),

ESt dµ,At µ[ρt Mtδt θ log π(At|St; θ)] = X

s S mη(s) X

θ qπ(s, a).

Proof Emphatic traces Mt have been shown to provide an unbiased estimate of the true emphatic weighting for policy evaluation in Emphatic TD. We use the emphatic weighting diﬀerently, but can rely on the proof from (Sutton et al., 2016) to ensure that (a) dµ(s) limt Eµ[Mt|St = s] = mη(s). Note also that, given St, the next action does not impact the expectation: (b) Eµ[Mt|St = s] = Eµ[Mt|St = s, At]. Using these equalities, we obtain

ESt dµ,At µ[ρt Mtδt θ log π(At|St; θ)]

s S dµ(s)Eµ[ρt Mtδt θ log π(At|St; θ)|St = s]

s S dµ(s)Eµ h Eµ[ρt Mtδt θ log π(At|St; θ)|St = s, At] i law of total expectation

s S dµ(s)Eµ h Eµ[Mt|St = s, At] Eµ[ρtδt θ log π(At|St; θ)|St = s, At] i

s S dµ(s)Eµ[Mt|St = s]Eµ h Eµ[ρtδt θ log π(At|St; θ)|St = s, At] i using (b)

Graves, Imani, Kumaraswamy, and White

s S mη(s)Eµ[ρtδt θ log π(At|St; θ)|St = s] using (a)

s S mη(s) X

θ qπ(s, a),

where the last line follows from the fact that

Eµ[ρtδt θ log π(At|St; θ)|St = s]

a µ(a|s)ρ(s, a) θ log π(a|s; θ) π(a|s; θ)

θ Eµ[δt|St = s, At = a]

a µ(a|s)ρ(s, a) θ log π(a|s; θ) π(a|s; θ)

θ [qπ(s, a) v(s)] by assumption

θ qπ(s, a).

5.2.2 A Biased Estimate of the Gradient with Emphatic Weightings

For a ﬁxed policy, the emphatic trace provides an unbiased way to reweight the gradient. However, this is no longer true when the target policy is updated online in an actor-critic algorithm like ACE; the emphatic trace will contain importance sampling ratios for older versions of the policy. If the target policy changes substantially during the learning process as one would hope the older importance sampling ratios could bias the emphatic trace s estimates. On the other hand, a constant discount rate γ < 1 would give exponentially less weight to older importance sampling ratios in the emphatic trace, potentially mitigating the bias. We empirically investigate the impact of this type of bias in section 9.3. As a Monte Carlo estimator of the emphatic weightings, the emphatic trace can also yield high-variance estimates. Furthermore, the product of importance sampling ratios in the emphatic trace can lead to even higher-variance estimates, which can slow learning (Precup et al., 2001; Liu et al., 2020a; Ghiassian et al., 2018). As such, several algorithms have been proposed that use parametrized functions to estimate the factors required for reweighting the updates (Hallak and Mannor, 2017; Gelada and Bellemare, 2019; Liu et al., 2018; Zhang et al., 2019). We can similarly approximate the emphatic weighting directly. For our setting, this involves estimating Eµ[Mt|St = s] = mη(s)/dµ(s). Because Mt is a function of Ft, we directly approximate Eµ[Ft|St = s] by learning a parametrized function fφ(s) and use it in place of Ft to compute Mt. Notice that the resulting fφ(s) Eµ[Ft|St = s] = m(s)/dµ(s), and so (1 η)i(s) + ηfφ(s) Eµ[Mt|St = s] = mη(s)/dµ(s). Because we sample s dµ, weighting the update with an estimate of mη(s)/dµ(s) is eﬀectively an importance sampling ratio that ensures the updates are weighted according to mη(s). The direct approximation relies on the recursive equation for the emphatic weighting that allows for a temporal-diﬀerence update. Unlike the usual TD update, this update bootstraps oﬀthe estimate from the previous step rather than the next step, because the emphatic weightings accumulate interest back in time, leading into the state. Recall the

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

recursive equation

m(s ) = dµ(s )i(s ) + X

S,A π(a|s; θ)p(s |s, a)γ(s, a, s )m(s).

This equation is a Bellman equation, that speciﬁes the emphatic values for state s in terms of the immediate interest and the states leading into s. We approximate fφ(s) m(s)/dµ(s), rather than m(s). The recursive formula for an f(s) that equals m(s)/dµ(s) is

f(s ) = i(s ) + 1 dµ(s ) X

S,A π(a|s; θ)p(s |s, a)γ(s, a, s )dµ(s)f(s ),

where it + ρt 1γtfφ(St 1) is a sample of this target, as we show in the next proposition.

Proposition 7 Under the conditions stated by Sutton et al. (2016),

lim t Eµ[it + ρt 1γtfφ(St 1)|St = s ] = i(s ) + 1 dµ(s ) X

s,a π(a|s; θ)p(s |s, a)γ(s, a, s )dµ(s)fφ(s).

Proof The proof follows the strategy showing the unbiasedness of Ft in Sutton et al. (2016).

lim t Eµ[it + ρt 1γtfφ(St 1)|St = s ]

= i(s ) + lim t Eµ[ρt 1γtfφ(St 1)|St = s ]

= i(s ) + X

s,a lim t Pr{St 1 = s, At 1 = a|St = s }π(a|s; θ)

µ(a|s) γ(s, a, s )fφ(s)

= i(s ) + X

dµ(s)µ(a|s)p(s |s, a)

dµ(s ) π(a|s; θ)

µ(a|s) γ(s, a, s )fφ(s) Bayes rule

= i(s ) + 1 dµ(s ) X

s,a π(a|s; θ)p(s |s, a)γ(s, a, s )dµ(s)fφ(s)

The utility of this result is that it provides a way to sample the target for this Bellman equation. The following is a sample update whose target is equal to the above in expectation, when sampling s dµ:

φ φ + βt it + ρt 1γtfφ(St 1) fφ(St) φfφ(St), (17)

where βt is a step-size. The update aims to minimize the diﬀerence between the current estimate of the emphatic weightings and the target, using a standard semi-gradient TD update. We could alternatively use a gradient TD (Sutton et al., 2009; Sutton and Barto, 2018) update

φ φ + βt it + ρt 1γtfφ(St 1) fφ(St) φfφ(St) γth(St 1) φfφ(St 1) , (18)

where the auxiliary function h(s) provides an estimate of Eµ[it+ρt 1γtfφ(St 1) fφ(St)|St = s] using a regression update.

Graves, Imani, Kumaraswamy, and White

Once we use an approximation to the emphatic weighting, we may introduce bias into the gradient estimate. As an extreme case, imagine the function is restricted to predict the same value for all states. Using these estimates means replacing Mt in the actor updates with a constant factor that can be subsumed in the step-size and therefore following the semi-gradient updates in Equation 14. However, the advantage is that these lower-variance estimates do not have to be computed online, unlike Mt. The function fφ can be trained by sampling transitions from a replay buﬀer (Lin, 1993; Mnih et al., 2015) and applying the update in Equation 17.

5.3 Summary

In this section, we showed how to operationalize the oﬀ-policy policy gradient theorem, by discussing how to sample the gradient. The gradient involves a standard actor-critic update from a state s, but additionally weighted by the emphatic weighting mη(s). We leverage existing results for sampling the emphatic weighting in Section 5.2.1, and proved that using this unbiased emphatic weighting in our update results in an unbiased update (Proposition 6). These sampled emphatic weightings, however, can be high variance and become biased if the policy changes on each step. We develop a parameterized estimator in Section 5.2.2, by developing a recursive formula for fφ(s) m(s)/dµ(s) in Proposition 7. Though a similar form was used to show unbiasedness of a sampled weighting in the original work on emphasis (Sutton et al., 2016), they did not recognize nor leverage this recursive form to develop a TD algorithm to learn a parameterized estimator. We need this fφ(s), instead of directly estimating m(s), because states are sampled s dµ, so weighting by m(s) would be incorrect. We summarize the ﬁnal algorithm in Algorithm 1, in Section 7. This pseudocode is provided later, after incorporating entropy regularization and showing how to use the algorithm for episodic problems. The algorithm shows both how to use the emphatic trace and directly-estimated emphatic weightings.

6. Incorporating Entropy Regularization

In this section, we discuss how to extend the oﬀ-policy policy gradient theorem to incorporate entropy regularization. We focus on discrete states and actions in the main body, and include results for continuous states and actions in the appendix. Entropy regularization is commonly used with policy gradient methods, as it encodes a preference for more stochasticity in the policy. Recall that at each state, the policy induces a probability distribution over actions whose entropy is deﬁned as

H π(.|s) def= X

a A π(a|s) log π(a|s).

This value captures the stochasticity in the policy. A uniform random policy will maximize entropy, while the entropy of a nearly deterministic policy will be a large negative value. Entropy regularization is believed to promote exploration, improve the loss surface, and promote faster convergence rates. For exploration, entropy regularization can help make the policy more stochastic and diversify the set of states and actions that the agent learns about (Williams and Peng, 1991; Haarnoja et al., 2018). To ﬁnd a good policy, the agent

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

needs accurate estimates of values of diﬀerent actions in diﬀerent parts of the state space. A greedy policy only takes actions that are deemed optimal at the current point which results in learning only about a limited number of trajectories. Entropy regularization has also shown to help policy optimization by modifying the landscape. The resulting objective is smoother, which allows the use of larger step-sizes and also reduces the chance of getting stuck in bad local maxima (Ahmed et al., 2019). Finally, the use of entropy regularization facilitates convergence analysis to stationary points. Mei et al. (2020) showed that entropy regularization can ensure existence of stationary points and also improve the rate of convergence in tabular policy optimization. The idea is similar to convex optimization, where ℓ2 regularization makes the loss function strongly convex and improves the convergence rate from sublinear to linear. In Appendix E, we extend the proof of existence of a stationary point for entropy regularized policy optimization to state aggregation. Therefore, in addition to extending our algorithm to allow for entropy regularization, we also use this extension to formally prove a counterexample for the semi-gradient updates. Extending the results to entropy regularization is relatively straightforward, as it relies on the already developed machinery of soft action-values that include entropy in the rewards. Entropy regularization augments each reward with a sample of the entropy at the current state so that the return is modiﬁed to

Gt def= (Rt+1 τ log π(At|St)) + γt+1(Rt+2 τ log π(At+1|St+1)) + . . .

= Rt+1 τ log π(At|St) + γt+1 Gt+1,

where τ is a parameter that controls the amount of regularization. Entropy regularized state values and action values can be deﬁned as (Geist et al., 2019)

vπ(s) def= Eπ[ Gt|St = s] s S

qπ(s, a) def= Eπ[Rt+1 + γt+1 Gt+1|St = s, At = a] s S and a A

s S P(s |s, a)[r(s, a, s ) + γ(s, a, s ) vπ(s )] a A, s S.

Here, vπ(s) = Eπ[ qπ(S, A)|S = s] + τH(π( |s)) for entropy H(π( |s)) = Eπ[ log π(A|s)]. The entropy-regularized objective function simply uses these entropy regularized values,

Jµ(θ) def= X

s S dµ(s)i(s) vπθ(s). (19)

The oﬀ-policy policy gradient theorem can be extended to the entropy regularized objective. The result looks a little diﬀerent, because the relationship between the soft values and soft action-values is a little diﬀerent than in the unregularized setting. Notice that τ log π(a|s) is not included in the ﬁrst reward for qπ(s, a), because this ﬁrst action is given not random and entropy is an expectation over all actions. Further, log π(a|s) can be arbitrarily large, though the entropy itself remains nicely bounded. For this reason, we do not deﬁne the soft action-values to be Eπ[ Gt|St = s, At = a], as we typically would for the unregularized setting. Nonetheless, deriving the policy gradient theorem for these soft action-values uses exactly the same steps. Further, we do recover the unregularized formulation when τ = 0.

Graves, Imani, Kumaraswamy, and White

Theorem 8 (Oﬀ-policy Policy Gradient Theorem for Entropy Regularization)

θ qπ(s, a) + τ H(π( |s; θ))

θ [ qπ(s, a) τ log π(a|s; θ))]

The proof is in Appendix D, as is the extension to continuous states, continuous actions, and other regularizers. This theorem shows that we can use a similar update as in the unregularized setting, simply by adding τ log π(a|s; θ)). Given the true soft action-values qπ(s, a), an update with an unbiased sample of the gradient is

θ θ + αρt Mt log π(At|St; θ)

θ [ qπ(St, At) τ log π(At|St; θ))] .

In practice, we subtract a baseline and use an estimate δt of the advantage qπ(s, a) τ log π(a|s; θ) vπ(s), by only directly estimating the soft values v(s) vπ(s). The 1-step return sample is again the TD error, but with entropy regularization added to the rewards: δt = Rt+1 τ log π(At|St) + γt+1v(St+1) v(St).

7. Formulating Episodic Problems as a Special Case

The goal in an episodic problem is to maximize the expected episodic return. Mismatch between the algorithm s objective and this goal, even in the on-policy setting, can lead the agent to suboptimal policies (Thomas, 2014; Nota and Thomas, 2020). This section describes how we can use the interest function to maximize the right objective. The objective function in Equation 4 that weights state values by the start state distribution is equal to expected episodic return. With limited function approximation resources this weighting matters. If the parametrized function is incapable of representing a policy that maximizes all state values, the agent has to settle for lower values in some states to maximize the values of more important ones. For example, if the weighting is dµ, the agent may incorrectly prioritize states that appear less often at the start of an episode and more frequently at other points in the behaviour policy s trajectories, and fail to optimize the episode return. The interest function in Equation 6 allows us to focus the parametrized function s resources on states of interest. Recall that we assume that the agent observes interest it in state St, with i(s) = E[it | St = s]. So far, we have not used the additional ﬂexibility that the interest itself can be random, but for this uniﬁcation we will use this property. If we could set i(St) = d0(St)/dµ(St) then the objective function would correspond to the episodic objective. However, neither dµ nor d0 is available to the agent, and we would like to avoid estimating those distributions. Fortunately, we can show that if the signal is set to one at the beginning of an episode and set to zero thereafter, its expectation in the limit of time will be proportional to the correct ratio. Under transition-based discounting, the agent is informed that an episode has begun whenever it receives a discount factor of zero: namely, that the transition (St, At, St+1)

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

resulted in termination. Therefore, we can obtain this result by allowing for the interest to be a function of the whole transition.

Proposition 9 If the signal it is set to

( 1, if γt = 0 (i.e. the beginning of an episode) 0, otherwise (20)

then its expected value under the behaviour policy is

Eµ[it | St = s] d0(s)

Eµ[it | St = s] = Pr(γt = 0 | St = s)

= Pr(St = s | γt = 0) Pr(γt = 0)

= d0(s) Pr(γt = 0)

where we used the fact that Pr(St = s | γt = 0) = d0(s) and Pr(St = s) = dµ(s).

The constant term is the same for all states, Pr(γt = 0). It simply reﬂects the probability of termination under the behaviour policy. This constant term does not change the relative ordering between policies, since all weight is still on the start states. Therefore, the resulting objective is equivalent to the episodic objective. We can relate the resulting update to the on-policy and oﬀ-policy updates used for the episodic setting. In the on-policy setting, if we use the interest function given by Equation 20 and a constant discount factor γ during the episode, then the update reduces to the unbiased episodic actor-critic algorithm of Sutton and Barto (2018), originally proposed by Thomas (2014). Sutton et al. (1999a) proved the policy gradient theorem for the episodic setting

s S dπ(s) X

θ qπ(s, a) for dπ(s) =

t=0 γt Pr(s; t, π),

where Pr(s; t, π) is the probability of going from a start state to state s in t steps under π, the weighting in the gradient dπ is the discounted state visit distribution, and the agent should discount updates that occur later in the episode to account for this weighting. An algorithm that does not discount later updates and thus samples according to dπ results in a biased update. To ﬁx this problem, Thomas (2014) proposed the unbiased actor-critic algorithm with the updates rules

θ θ + αIδ θ log π(A|s; θ)

Graves, Imani, Kumaraswamy, and White

Since I is initialized to 1 and updated after the update to θ, I will be 1 during the ﬁrst weight update, and will decay by γ each timestep thereafter. The on-policy ACE update rule with ρ = 1 and η = 1 is

Ft γt Ft 1 + it for F0 = 0

θ θ + αFtδ θ log π(A|s; θ).

Because it = 1 on the ﬁrst time step and F is initialized to 0, F will be 1 on the ﬁrst weight update and will decay by γ each time step thereafter. From this inspection it is clear that the ACE update reduces to the unbiased actor-critic update in the on-policy episodic setting. The ﬁnal ACE algorithm with all the discussed techniques is in Algorithm 1.

Algorithm 1 Online Actor Critic with Emphatic weightings (ACE)

Initialize weights for actor θ and critic w For emphatic trace, initialize F0 = 0; for directly-estimated emphatic weightings, initialize weights φ for approximate emphatic weightings fφ Suggested (default) settings of parameters: η = 0.1, τ = 0.01 Obtain initial feature vector x0 and set i0 1 repeat

Choose an action at according to µ( |xt) Observe reward rt+1, next state vector xt+1 and γt+1 For episodic setting: Set it+1 = 1 if a new episode has begun; else it+1 = 0 For excursions setting: If no preferences between states, set it+1 = 1 rt+1 rt+1 τ log π(at|xt; θ) ρt π(at|xt;θ)

µ(at|xt) Update (entropy-regularized) critic vw if using emphatic trace then

Mt (1 η)it + ηFt Ft+1 ρtγt+1Ft + it+1 else

Mt (1 η)it + ηfφ(xt) φ φ + βt it+1 + ρtγt+1fφ(xt) fφ(xt+1) φfφ(xt+1)

δt rt+1 + γt+1vw(st+1) vw(st) ψ θ log π(b|x; θ) θ θ + αtρt Mtδtψ until agent done interaction with environment

8. Stationary Points under the Semi-gradient and Gradient Updates

In this section, we provide more insight into the diﬀerence between the stationary points of the weighted excursions objective obtained using the true gradient in Equation 9 versus those obtain with the semi-gradient update underlying OﬀPAC, in Equation 14. We ﬁrst

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

provide a counterexample showing that all the stationary points for the semi-gradient have poor performance. In fact, initializing at the optimal solution, and then updating with the semi-gradient, moves the solution away to one of these highly suboptimal stationary points. We then discuss that the semi-gradient update can actually be seen as a full gradient update, albeit with a diﬀerent implicit state weighting. This connection illuminates why the semi-gradient does so poorly on the counterexample, but also potentially sheds light on why OﬀPAC has generally performed reasonably in practice.

8.1 A Counter-example for the Semi-Gradient Update

Recall that, in the derivation of the policy gradient theorem, the product rule breaks down the gradient of the value function into the two terms shown in Equation 11. The policy gradient theorem in OﬀPAC only considers the ﬁrst term, i.e. the gradient of the policy given a ﬁxed value function. The approximated gradient is

s S dµ(s) X

θ qπ(s, a).

The approximation above weights the states by the behaviour policy s state distribution instead of emphatic weightings. To see the diﬀerence between these weightings, consider the MDP in Figure 1a. For the actor, s0 has a feature vector [1, 0], and the feature vector for both s1 and s2 is [0, 1]. This aliased representation forces the actor to take a similar action in s1 and s2. The behaviour policy takes actions a0 and a1 with probabilities 0.25 and 0.75 in all non-terminal states, so that s0, s1, and s2 will have probabilities 0.5, 0.125, and 0.375 under dµ. The target policy is initialized to take a0 and a1 with probabilities 0.9 and 0.1 in all non-terminal states, which is close to optimal. We trained two actors on this MDP, one with semi-gradient updates and one with gradient updates. The actors are initialized to the target policy above and the updates use exact values rather than critic estimates. States and actions are sampled from the behaviour policy. As shown in Figures 1b and 1c, while both methods start close to the highest attainable value of the objective function (that of a deterministic policy that takes a0 everywhere), semi-gradient updates move the actor towards a suboptimal policy and reduce the objective function along the way. Gradient updates, however, increase the probability of taking a0 in all states and increase the value of the objective function. The problem with semi-gradient updates boils down to the weighting. In an expected semi-gradient update, each state tries to increase the probability of the action with the highest action-value. There will be a conﬂict between the aliased states s1 and s2 because their highest-valued actions diﬀer. If the states are weighted by dµ in the expected update, s1 will appear insigniﬁcant to the actor, and the update will increase the probability of a1 in the aliased states. The ratio between qπ(s1, a0) and qπ(s2, a1) is not enough to counterbalance this weighting. However, s1 has an importance that the semi-gradient update overlooks. Taking a suboptimal action at s1 will also reduce q(s0, a0), and after many updates the actor will gradually prefer to take a1 in s0. Eventually, the target policy will be to take a1 at all states, which has a lower value under the objective function than the initial target policy.

Graves, Imani, Kumaraswamy, and White

a0, 0 a1, 0

0nlx3p2PRWv By We O4Q+czx8b HY4t</latexit>a0, 2 a1, 0 a0, 0

yj AMZz AGWC4ghrc QR0a QOERnu EV3jzlv Xjv3se8dc XLZ47g D7z PHxsgji0=</latexit>a1, 1

(a) Counterexample

(b) Learning curves

(c) Optimal action probability

Figure 1: (a) A counterexample that demonstrates the suboptimal behaviour of semigradient updates. The semi-gradients converge for the tabular setting (Degris et al., 2012b), but not necessarily under function approximation such as with the state aliasing in this MDP. The start state is denoted s0 and the terminal state is denoted t. States s1 and s2 are aliased to the actor. The interest i(s) is set to one for all states. (b) Learning curves comparing semi-gradient updates and gradient updates, averaged over 30 runs with negligible standard error bars. The actor has a softmax output on a linear transformation of features and is trained with a step-size of 0.1 (though results were similar across all the stepsizes tested). The dashed line shows the highest attainable objective function under the aliased representation. (c) The probability of taking the optimal action (a0) in the aliased states.

This experiment highlights why the weight of a state should depend not only on its own share of dµ, but also on its predecessors. The behaviour policy s state distribution is not the proper deciding factor in the competition between s1 and s2, even when optimizing the excursions objective. Proposition 10 formalizes the problem with semi-gradient updates by showing that, under any τ > 0, semi-gradient updates will not converge to a stationary point of the objective function in the counterexample. The requirement τ > 0 is only needed to ensure existence of a stationary point; this result holds for τ arbitrarily close to zero. The proof is presented in Appendix E.

Proposition 10 For any τ > 0 where τ = τi 0.2779 in the three-state counterexample, semi-gradient updates do not converge to a stationary point of the objective function.

8.2 The Semi-Gradient as a True Gradient with a Diﬀerent Weighting

In this section, we show that the semi-gradient can locally be seen as a gradient update, with a diﬀerent state weighting in the objective. Locally for the current weights θt with corresponding policy π, that state weighting is d = dµ(I Pπ,γ), with objective Jd(θ) def= P s S d(s)vπθ(s). The resulting d may not be a distribution. In fact, it may not even be positive! If it is negative, the objective tells the agent to minimize the value for that state. This is precisely what occurs in the above counterexample.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

First, to see why this is the implicit state weighting, notice that for the semi-gradient update to be a gradient update, the weighting dµ used in the update has to correspond to an emphatic weighting md for some state weighting d in the objective. In other words, dµ(s) = md(s) where md = d(I Pπ,γ) 1. This requirement implies that d = md(I Pπ,γ). Therefore, by using the weighting dµ in the expected gradient, the semi-gradient update locally around θt can be seen as a gradient update on Jd. This weighting d actually changes as θt changes. In fact, we know that the semi-gradient update cannot be seen as the gradient of a ﬁxed objective function, from our counterexample for OﬀPAC and the result for the on-policy setting showing that omitting the discount factor results in an update that is not a gradient (Nota and Thomas, 2020). However, even if this interpretation is only valid locally at each update, such a negative weighting can be problematic. We can compute the implicit weighting d, in our counter-example. Let b = µ(a0|s0) and p = π(a0|s0). Then we know that dµ(s0) = 0.5, dµ(s1) = 0.5b = 0.125 and dµ(s0) = 0.5(1 b) = 0.375. Further, we know that m(s0) = dµ(s0) and

m(s1) = dµ(s1) + π(a0|s0)m(s0) = bdµ(s0) + pm(s0) = dµ(s0)(b + p) = 0.5(b + p)

m(s2) = dµ(s2) + (1 p)m(s0) = dµ(s0)((1 b) + (1 p)) = 0.5(2 b p).

The implicit d for the semi-gradient update, locally around this π, has m(s) = dµ(s) where m(s0) = d(s0) and so d(s0) = dµ(s0) = 0.5 and

m(s1) = d(s1) + p m(s0) = d(s1) + pdµ(s0) = d(s1) + 0.5p

m(s2) = d(s2) + (1 p) m(s0) = d(s2) + (1 p)dµ(s0) = d(s2) + 0.5(1 p).

Using m(s1) = dµ(s1) = 0.5b and m(s2) = dµ(s2) = 0.5(1 b), we get that

d(s1) = m(s1) 0.5p = 0.5(b p)

d(s2) = m(s2) 0.5(1 p) = 0.5((1 b) (1 p)) = 0.5(p b).

If b > p, then d(s2) < 0; if b < p, then d(s1) < 0; and if b = p then both are zero. In our counterexample, we set b < p, making the update move away from increasing the value in s1, namely preferring to increase the value in s2 and decrease the value in s1. The iterative updates maintain the condition b < p, even as the policy changes which changes p so the implicit weighting systematically causes convergence to a suboptimal policy. We note that the implicit weighting is independent of the representation. However, we know that the semi-gradient update converges to a stationary point of the excursions objective for the tabular setting. This might seem odd, given that the implicit weighting is negative for a state in this counterexample. However, this is not contradictory. Recall in Section 2 we discussed that in the tabular setting, the condition for the stationary point is that the gradient must be zero at every state, independently. Summing up zeros, even when weighting those zeros with a negative weighting, still results in zero. There has been other work that has noted that optimizing under a diﬀerent state weighting can still be eﬀective. Proposition 6 of Ghosh et al. (2020) shows that a lower bound can be optimized, using data generated under the behaviour policy. The action-values of the behaviour policy is used, and only the log likelihood terms of the target policy are diﬀerentiated. This lower bound, however, is only an approximation to the true objective.

Graves, Imani, Kumaraswamy, and White

Our result diﬀers, in that it highlights that equivalent solutions can be obtained, even under diﬀerent state weightings. This view suggests a direction to reduce variance in the ACE updates: consider appropriate implicit weightings d, that allow for low variance emphatic updates. One potentially promising approach is to only consider emphatic weightings a small number of steps backin-time. Another is to err on the side of smaller η in the emphatic trace in ACE, knowing that the implicit weighting d may remain reasonable for a broad range of η.

9. Experiments: Studying Properties of ACE

In this section we investigate key properties of the ACE algorithm empirically.7 Speciﬁcally, we study the eﬀects of the trade-oﬀparameter and the choice of estimator for the emphatic weightings on the learned policy, as these choices are central to the ACE algorithm. First, we revisit the simple counterexample from Section 8.1 to examine how the trade-oﬀparameter can aﬀect the learned policy. Next, we illustrate some issues with using the emphatic trace in the ACE algorithm by testing it on a modiﬁed version of the counterexample. We then move to two classic control environments to study the two estimators discussed in Section 5.2 and their eﬀects on the learned policy. Finally, we test several variants of ACE on a challenging environment designed to illustrate the issues associated with both estimators. Please see Table 1 in Appendix F for a description of each of the algorithms compared.

9.1 The Impact of the Trade-OﬀParameter

The parameter η can be interpreted as trading oﬀbias and variance. For η = 0, the bias can be signiﬁcant, as shown in the previous section. A natural question to ask is how the bias changes as η ranges from 0 to 1. To answer this question, we repeated the experiment in Section 8.1, but this time with η taking values in {0, 0.25, 0.5, 0.75, 1} and the actor s step-size taking values in {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1}, with the best-performing step-size (by total area under the learning curve, averaged over 30 runs) for each value of η used in Figures 2b and 2c. To highlight the rate of learning, the actor s policy was initialized to take each action with equal probability. The results are displayed in Figure 2 with shaded regions depicting standard error. Figure 2a shows the performance of ACE with each value of η over the range of step-sizes tested. ACE performed well for a range of step-sizes, and even small values of η signiﬁcantly improved over η = 0 (OﬀPAC). However, the performance of ACE with η > 0 was more sensitive to the choice of step-size, with performance decreasing as the step-size grew large, while the performance of η = 0 was lower overall but remained steady with larger step-sizes. Figure 2b plots the learning curves for ACE with each value of η. For η = 0 (OﬀPAC), the algorithm decreased the objective function during learning to get to a suboptimal ﬁxed point, while η > 0 always improved the objective function relative to the starting point. For a surprisingly small choice of η = 0.5, the actor converged to the optimal solution, and even η = 0.25 produced a much more reasonable solution than η = 0.

7. Code is available at: https://github.com/gravesec/actor-critic-with-emphatic-weightings.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

Figure 2c shows the probability of taking the optimal action in the aliased states. The optimal policy is to take a0 in the aliased states with probability 1. ACE with η = 0 (OﬀPAC) quickly converged to the suboptimal solution of choosing the best action for s2 instead of s1. Even with η just a bit higher than 0, convergence is to a more reasonable solution, choosing the optimal action the majority of the time.

(a) Actor step-size sensitivity

(b) Learning curves

(c) Optimal action probability

Figure 2: Performance of ACE with diﬀerent values of η in the counterexample.

To determine whether the η parameter has similar eﬀects when ACE is used with a learned critic, we repeated the previous experiment but this time using value estimates from a critic trained with GTD(λ) (Maei, 2011). We checked all combinations of the following critic step-sizes: {10 5, 10 4, 10 3, 10 2, 10 1, 100}, the following critic trace decay rates: {0, 0.5, 1.0}, and the following actor step-sizes: {10 10, 10 8, 10 6, 10 4, 10 2}, and used the best-performing combination (by area under the learning curve) for each value of η in Figures 3b and 3c. We averaged the results over 10 runs, and plotted the standard error as shaded regions. Figure 3 shows that, as before, even relatively small values of η can achieve close to the optimal solution. However, η = 0 (OﬀPAC) still ﬁnds a suboptimal policy. Overall, the outcomes are similar to the previous experiment, although noisier due to the use of a learned critic rather than the true value function.

(a) Actor step-size sensitivity

(b) Learning curves

(c) Optimal action probability

Figure 3: Performance of ACE with a GTD(λ) critic and diﬀerent values of η in the counterexample.

So far we have only considered ACE with a discrete action policy parameterization. However, an appealing property of actor-critic algorithms is their ability to naturally handle continuous action spaces. To determine if the ﬁndings above generalize to continuous action policy parameterizations, we created an environment similar to Figure 1a, but with one

Graves, Imani, Kumaraswamy, and White

continuous unbounded action. Taking action with value a at s0 will result in a transition to s1 with probability 1 σ(a) and a transition to s2 with probability σ(a), where σ denotes the logistic sigmoid function. For all actions from s0, the reward is zero. From s1 and s2, the agent can only transition to the terminal state, with reward 2σ( a) and σ(a) respectively. The behaviour policy takes actions drawn from a Gaussian distribution with mean 1.0 and variance 1.0. Because the environment has continuous actions, we can include both stochastic and deterministic policies, and so can include DPG in the comparison. DPG is built on the semi-gradient, like OﬀPAC (Silver et al., 2014). We include True-DPG with Emphatic weightings (True-DPGE), which uses the true emphatic weightings rather than estimated ones to avoid the issue of estimating the emphatic weightings for a deterministic target policy, and focus the investigation on whether DPG converges to a suboptimal solution in this setting. Estimation of the emphatic weightings for a deterministic target policy is left for future work. The stochastic actor in ACE has a linear output unit and a softplus output unit to represent the mean and the standard deviation of a Gaussian distribution. All actors are initialized with zero weights. Figure 4 summarizes the results. The ﬁrst observation is that DPG demonstrates suboptimal behaviour similar to OﬀPAC. As training goes on, DPG prefers to take positive actions in all states, because s2 is updated more often. This problem goes away in True-DPGE. The emphatic weightings emphasize updates in s1 and, thus, the actor gradually prefers negative actions and surpasses DPG in performance. Similarly, True-ACE learns to take negative actions but, being a stochastic policy, it cannot achieve True-DPGE s performance on this domain. ACE with diﬀerent η values, however, cannot outperform DPG, and this result suggests that an alternative to importance sampling ratios is needed to eﬀectively extend ACE to continuous actions.

(a) Stepsize sensitivity

(b) Learning curves

(c) Mean action

Figure 4: Performance of ACE with diﬀerent values of η, True-ACE, DPG, and True-DPGE on the continuous action MDP. The results are averaged over 30 runs. For continuous actions, the methods have even more diﬃculty getting to the optimal solutions, given by True-DPGE and True-ACE, though the action selection graphs suggest that ACE for higher η is staying nearer the optimal action selection than ACE(0) and DPG.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

9.2 Challenges in Estimating the Emphatic Weightings

Up to this point we have been using the emphatic trace originally proposed by Sutton et al. (2016) to estimate the emphatic weightings. As discussed in Section 5, there can be multiple sources of inaccuracy from using this Monte Carlo estimate in an actor-critic framework. However, it is unclear how the issues aﬀecting the emphatic trace manifest in practice, and hence whether introducing a parametrized function to directly estimate the emphatic weightings is worth the added bias especially if the representation is poor. To study the eﬀects of this choice of estimator on the ACE algorithm, we conducted a series of experiments starting with a modiﬁed version of the counterexample in Figure 1a, moving to two classic control environments, and ending with a challenging environment designed to illustrate the issues associated with both estimators.

Ig Yaxt KSRz9ed ERi Nj Jl Fg Oy OKI7Pszc T/v G6K4ZWf CZWky BVb LAp TSTAms7/JQGj OUE4so Uw Leyth I6op Q5t Oy Ybg Lb/8l7TOqt5Ft XZXq9Sv8zi Kc ATHc Aoe XEIdbq EBTWAwh Cd4g Vd HOs/Om/O+a C04+cwh/ILz8Q0KCI2m</latexit>s3

+Esb Il DZmpvycy Gmk9jg Lb GVEz1Ive VPz P6Qmv PYz Lp PUo GTz RWEqi In J9G/S5wq ZEWNLKFPc3kr Yk Cr Kj E2n ZEPw Fl9e Js2zqnd ZPb8/r9Ru8ji Kc ATHc Aoe XEN7q AODWAwg Gd4h Td HOC/Ou/Mxby04+cwh/IHz+QMNEI2o</latexit>s5

m QMNa2FJK5+nsio5Exkyiwn RHFk Vn2Zu J/Xjf F8Nr Ph Ep S5Iot Fo Wp JBi T2d9k IDRn KCe WUKa Fv ZWw Ed WUo U2n ZEPwl9e Ja2Lqler Xt5f Vuo3e Rx FOIFTOAc Prq AOd9CAJj AYwj O8wpsjn Rfn3fl Yt Bacf OY/s D5/AEOl I2p</latexit>s6

m QMNa2FJK5+nsio5Exkyiwn RHFk Vn2Zu J/Xjf F8Nr Ph Ep S5Iot Fo Wp JBi T2d9k IDRn KCe WUKa Fv ZWw Ed WUo U2n ZEPwl9e Ja2Lqnd Vvby/r NRv8ji Kc AKnc A4e1KAOd9CAJj AYwj O8wpsjn Rfn3fl Yt Bacf OY/s D5/AEQGI2q</latexit>s7

CEsb Il DZmrvycy Gmk9i QLb GVEz0sve TPz P6Ymr Pk Zl0lq ULFoj AVx MRk9jc Zc IXMi Ikl Cluby Vs RBVlxq ZTsi F4y+vkt ZF1buq Xt5f Vuo3e Rx FOIFTOAc Prq EOd9CAJj AYwj O8wpsjn Bfn3fl Yt Bacf OY/s D5/AERn I2r</latexit>s8

RJGCtb0p CZ+nsio5HW4yiwn RE1Q73o Tc X/v E5qwis/4z JDUo2Xx Smgpi YTP8mfa6QGTG2h DLF7a2EDamiz Nh0Sj YEb/Hl Zd I8q3o X1f P780rt Jo+j CEdw DKfgw SXU4A7q0AGA3i GV3hzh Piv Dsf89a Ck8cwh84nz8TI2s</latexit>s9

a0, 0 a0, 0

a0, 0 a0, 0 a0, 0

Dt Cq Obu Vk SHRBNq XUBF4K/+PIya V5U/Mt K9b5art3kc RTg GE7g DHy4ghrc QR0a QOERnu EV3jzlv Xjv3se8dc XLZ47g D7z PHxg Vjis=</latexit>a0, 0

a0, 0 a1, 0

a1, 0 a1, 0

4h Gd4h Tek0At6Rx/z1h WUzxz BH6DPHxmcjiw=</latexit>a1, 0

a1, 0 a1, 0

Figure 5: An 11-state MDP that makes estimating the emphatic weightings with the emphatic trace more diﬃcult.

The ﬁrst environment, shown in Figure 5, is an extended version of the counterexample with two long chains before the aliased states. Like the original counterexample, the behaviour policy takes a0 with probability 0.25 and a1 with probability 0.75 in all nonterminal states, and the interest i(s) is set to 1 for all states. Each state is represented by a unique feature vector except states s9 and s10, which are aliased and appear the same to the actor. The addition of the new states makes trajectories considerably longer, which may exacerbate the issues with the emphatic trace. We repeated the experiment from Figure 2 on the long counterexample. The following actor step-sizes were tested: {5 10 5, 10 4, 2 10 4, 5 10 4, 10 3, 2 10 3, 5 10 3, 10 2}, with the best-performing value (by total area under the learning curve, averaged over 10 runs) for each value of η used in Figures 6b and 6c. The actor was again initialized to take each action with equal probability, and the true state values were again used in the updates in order to isolate the eﬀects of the emphatic trace. We also trained an actor called True-ACE that uses the true emphatic weightings for the current target policy and behaviour policy, computed at each timestep. The performance of True-ACE is included here for the sake of comparison, as computing the exact emphatic weightings is not generally possible in an unknown environment. The results in Figure 6 show that, even though performance improves as η is increased, there is a signiﬁcant gap between ACE with η = 1 and True-ACE. Unlike Figure 2, the methods have more diﬃculty reaching the optimal solution, although ACE with larger η does still ﬁnd a signiﬁcantly better solution than η = 0. Additionally, values of η greater

Graves, Imani, Kumaraswamy, and White

than 0 result in high-variance estimates of the emphatic weightings, which lead to more variable performance, as shown by the larger shaded regions representing standard error. These results overall show the inaccuracies pointed out in Section 5 indeed disturb the updates in long trajectories.

(a) Actor step-size sensitivity

(b) Learning curves

(c) Optimal action probability

Figure 6: Performance of ACE with diﬀerent values of η on the 11-state MDP.

9.3 Estimating the Emphatic Weightings in Classic Control Environments

The results of the previous experiments suggest that using the emphatic trace to estimate the emphatic weightings can signiﬁcantly impact the performance of ACE. To better understand how the issues with the emphatic trace aﬀect the learning process beyond our counterexample, we tested several variants of ACE on oﬀ-policy versions of two classic control environments: Puddle World (Degris et al., 2012b) and Mountain Car (Moore, 1990). As an oﬀ-policy Monte Carlo estimator of the emphatic weightings, the emphatic trace can yield extremely high-variance estimates which can interfere with learning (Ghiassian et al., 2018). To see how the variance of the emphatic trace aﬀects ACE, we tested a variant called ACE-direct that uses the one-step temporal-diﬀerence update from Section 5.2.2 to estimate the emphatic weightings. Temporal-diﬀerence methods often have lower variance than Monte Carlo methods at the cost of introducing bias which makes using a TD-style update to estimate the emphatic weightings an appealing alternative to the emphatic trace (Sutton and Barto, 2018). As discussed in Section 5.2.2, the emphatic trace can also be biased when used in an actor-critic algorithm where the target policy is changing. To determine how detrimental this bias is to the performance of ACE, we tested a variant called ACE-ideal where all importance sampling ratios in the emphatic trace are re-computed at each time step using the current target policy, yielding an unbiased Monte Carlo estimate of the emphatic weighting for the current state. ACE-ideal is not a practical algorithm, as the computation and memory required grow with the number of time steps, but it allows us to isolate the eﬀects of the bias introduced by using the emphatic trace with a changing target policy. Other baselines for comparison included OﬀPAC, and ACE using the emphatic trace (referred to as ACE-trace). To determine the eﬀects of diﬀerent choices of critic, we included versions of each algorithm using an ETD critic (Sutton et al., 2016) and a TDRC critic (Ghiassian et al., 2020). We also included versions of each algorithm using the uniform interest function (i.e., i = 1 for all time steps), as well as the episodic interest function from Section 7 with the exception of OﬀPAC. OﬀPAC scales policy updates using only the

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

interest for the current time step, which when combined with the episodic interest function leads to a single policy update on the ﬁrst time step followed by no updates thereafter. To get a clear picture of the performance of each algorithm, we conducted a grid search on the free parameters of each algorithm. For the step size parameter of the actor, critic, and the direct estimator of emphatic weightings, we tested values of the form 1

2 i where i ranged from 0 to 15. For the trace decay rate of the critic, we tested values of the form 1 1

2 j where j ranged from 0 to 6. The discount factor was .95. We sought to establish the performance of each variant for η = 1, as they all reduce to OﬀPAC when η = 0. Each combination of parameters for each algorithm was run on 5 diﬀerent trajectories generated by a ﬁxed behaviour policy interacting with the environment for 100,000 time steps. The learned policies were saved every 1,000 time steps and evaluated 50 times using both the episodic and excursions objective functions from Section 3. For the episodic objective function, the policies were evaluated by creating 50 diﬀerent instances of the environment and executing the target policy from the starting state until termination or 1000 time steps had elapsed. The excursions objective function was evaluated similarly, but with the environment s starting state drawn from the behaviour policy s steady state distribution, chosen by running the behaviour policy for 50,000 time steps and saving every thousandth state. The results were averaged, and the best-performing combinations (by area under the learning curve for the appropriate objective function) were re-run on enough diﬀerent trajectories to reduce the standard error to an acceptable level (100 runs for Puddle World, 30 runs for Mountain Car). The initial parameter sweep was intended to ﬁnd good parameter settings for each method at a reasonable computational cost, and not necessarily the absolute best performance.

9.3.1 Puddle World

The Puddle World environment is a 2-dimensional continuous gridworld containing a start location, goal location, and puddles through which it is costly to move. While it ﬁrst appeared in Boyan and Moore (1995), we use the version from Degris et al. (2012b) that includes an additional puddle near the goal state which makes the task more diﬃcult.8 The behaviour policy took the North, East, South, and West actions with probabilities .45, .45, .05, and .05 respectively. The observations were tile coded with a fairly low-resolution tile coder (4 tilings of 2 2 tiles plus a bias unit) to generate feature vectors with a large degree of generalization. Figure 7 presents the results for the Puddle World environment. The left-hand column shows learning curves, while the right-hand column contains sensitivity analyses for the actor s step size. The ﬁrst four plots (ﬁgures 7a, 7b, 7c, and 7d) show results when using ETD as the critic, while the last four plots (ﬁgures 7e, 7f, 7g, and 7h) show results for a TDRC critic. The ﬁrst and third rows show results using the excursions objective function, and the second and fourth rows show results for the episodic objective function. Comparing the performance of each algorithm under the excursions and episodic objective function, we can see that the algorithms all performed better in the excursions setting than the episodic setting. This is due to the excursions setting being easier than the episodic setting for the classic control problems in this section. In the excursions setting, the agent

8. Please see Degris et al. (2012b) for a picture of the Puddle World environment.

Graves, Imani, Kumaraswamy, and White

(a) Excursions learning curves (ETD)

(b) Excursions step-size sensitivity (ETD)

(c) Episodic learning curves (ETD)

(d) Episodic step-size sensitivity (ETD)

(e) Excursions learning curves (TDRC)

(f) Excursions step-size sensitivity (TDRC)

(g) Episodic learning curves (TDRC)

(h) Episodic step-size sensitivity (TDRC)

Figure 7: Results on Puddle World. Shaded regions are 95% conﬁdence intervals.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

starts from the behaviour policy s state distribution, which in the problems studied is closer to the goal state than the start state. This is not necessarily true for all problems, however, and in general there is no strict ordering of objectives in terms of diﬃculty. ACE variants that used the episodic interest function (dashed lines) generally performed much worse than their counterparts using the uniform interest function (solid lines), with the exception of ACE-direct. This is consistent with the discussion in Section 7 of Thomas (2014) although they deal with the on-policy case which states scaling policy updates by an accumulating product of discount factors hurts the sample eﬃciency of the algorithm. However, the oﬀ-policy case can exacerbate this issue as the accumulating product includes both discount factors and importance sampling ratios. The fact that the direct method of estimating emphatic weightings allowed ACE to perform well when using the episodic interest function is intriguing and merits further study. The choice of critic aﬀected the performance of the algorithms to diﬀering degrees. Replacing an ETD critic with a TDRC critic improved the performance of each of the algorithms, with episodic performance showing the greatest improvement (compare ﬁgures 7c and 7g), perhaps due to the excursions objective being easier for this environment as previously mentioned. ACE-direct with uniform interest performed well regardless of the choice of critic, but improved substantially in the episodic setting when using a TDRC critic. Using a TDRC critic instead of an ETD critic also improved the sensitivity of the algorithms to the actor s step size, allowing a slightly wider range of step sizes to perform well for most methods (compare Figure 7b to 7f, and Figure 7d to 7h). When comparing ACE-trace (orange) to ACE-ideal (blue) where the emphatic trace s importance sampling ratios were computed using the current policy, resulting in an unbiased Monte Carlo estimate of the emphatic weightings we can see that correcting the bias introduced by using the emphatic trace with a changing policy did not improve performance signiﬁcantly in all but one situation. The one exception was when performance was evaluated with the excursions objective function (ﬁgures 7a and 7e). When using an ETD critic (Figure 7a), ACE-ideal outperformed ACE-trace early in learning before deteriorating to a similar level of performance by the end of the experiment. Conversely, ACE-ideal outperformed ACE-trace throughout the experiment when using a TDRC critic (Figure 7e), although the diﬀerence was barely signiﬁcant. In all other situations, the performance of ACE-ideal was not signiﬁcantly diﬀerent from the performance of ACE-trace. Overall, ACE-direct performed better than or equal to ACE-trace, ACE-ideal, and OﬀPAC across all combinations of critic, interest function, and objective function. However, it diverged for some actor step sizes when used with an ETD critic (ﬁgures 7b and 7d). This could be due to the use of a semi-gradient update rule (analogous to oﬀ-policy TD(0) which is not guaranteed to converge when used with function approximation) for the direct method of estimating emphatic weightings. To resolve this issue, one could use the gradient-based update rule in equation 18 of Section 5.2.2. We chose to use the semi-gradient update to keep both the explanation and experiments simple, as there were already a large number of concepts and algorithmic parameters involved.

Graves, Imani, Kumaraswamy, and White

9.3.2 Off-policy Mountain Car

In the original Mountain Car environment, the agent attempts to drive an underpowered car out of a valley (Moore, 1990).9 In the oﬀ-policy version, the agent learns from experience generated by a ﬁxed behaviour policy, in this case the uniform random policy. The observations were tile coded using 8 tilings of 4 4 tiles plus a bias unit. ACE-ideal was omitted from the Mountain Car experiments due to computational constraints; the uniform random behaviour policy rarely completed an episode, which resulted in extremely long trajectories that prevented ACE-ideal from running in a reasonable amount of time. Figure 8 contains the results for the Oﬀ-policy Mountain Car environment. The lefthand column shows learning curves, while the right-hand column shows sensitivity analyses for the actor s step size. The ﬁrst four plots (ﬁgures 8a, 8b, 8c, and 8d) show results when using ETD as the critic, while the last four plots (ﬁgures 8e, 8f, 8g, and 8h) show results for a TDRC critic. The ﬁrst and third rows show results using the excursions objective function, and the second and fourth rows show results for the episodic objective function. Across all choices of critic and objective function, algorithms trained using the episodic interest function (dashed lines) performed much worse than their counterparts trained using the uniform interest function (solid lines). This ﬁnding is consistent with the results from Puddle World, although much more pronounced. In this speciﬁc experiment, the extremely long trajectories generated by the uniform random behaviour policy caused the emphatic trace (and hence the policy updates) of the algorithms that used episodic interest to quickly shrink to the point of irrelevance, whereas in the Puddle World experiment the behaviour policy encounters the goal state more often, limiting trajectory length and shrinking the updates more slowly. Nevertheless, using the direct method to estimate the emphatic weightings allowed ACE to improve the performance of the policy in some cases, whereas using the emphatic trace prevented the policy from meaningfully improving. Again, the choice of critic played a role in the performance of all the algorithms to varying degrees. Using a TDRC critic instead of an ETD critic improved the performance of each of the algorithms, with OﬀPAC seeing the greatest improvement, followed by ACEtrace. ACE-direct again performed well regardless of the choice of critic. Using a TDRC critic improved the sensitivity of each of the algorithms to the actor s step size, allowing a wider range of step sizes to perform well (compare ﬁgures 8b and 8f, and 8d and 8h). Comparing ACE-direct (red) to ACE-trace (orange) reveals several advantages of using the direct method to estimate the emphatic weightings. The direct method allowed ACE to learn faster, with less variance, and ﬁnd a better-performing target policy by the end of the experiment, regardless of the choice of critic or objective function used for evaluation. In addition, the direct method allowed a wider range of actor step sizes to perform well, with performance increasing and decreasing more smoothly and predictably than when using the emphatic trace. Finally, the choice of critic had little eﬀect on the performance of ACEdirect, whereas ACE-trace performed better with a TDRC critic than with an ETD critic. Overall, ACE with the direct method performed better than or equal to OﬀPAC (green) and ACE-trace for all critics and objective functions, but was less sensitive to the actor step size and choice of critic than OﬀPAC.

9. Please see Degris et al. (2012b) for a picture of the Mountain Car environment.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

(a) Excursions learning curves (ETD)

(b) Excursions step-size sensitivity (ETD)

(c) Episodic learning curves (ETD)

(d) Episodic step-size sensitivity (ETD)

(e) Excursions learning curves (TDRC)

(f) Excursions step-size sensitivity (TDRC)

(g) Episodic learning curves (TDRC)

(h) Episodic step-size sensitivity (TDRC)

Figure 8: Results on Oﬀ-policy Mountain Car. Shaded regions are 95% conﬁdence intervals.

Graves, Imani, Kumaraswamy, and White

9.4 The Virtual Oﬃce Environment

The classic control environments considered in the previous section are well-known environments for testing general oﬀ-policy algorithms, but were not designed to probe the weaknesses of the speciﬁc algorithms being studied. Hence, both semi-gradient updates and the direct method of estimating emphatic weightings perform well, and the tradeoﬀs made in the design of each algorithm are not obvious. In this section, we introduce an environment designed to highlight issues with both semi-gradient updates and the direct method of estimating emphatic weightings. The environment is based on Dung et al. (2007) s Virtual Oﬃce, a grid world consisting of a hallway and two cubicles that are identical except for the goal locations. To illustrate the problem with semi-gradient updates, two key properties of the counterexample from Section 8.1 must be incorporated: the optimal actions in the two identical rooms must be diﬀerent, and the behaviour policy must visit the suboptimal room more often. To accomplish the ﬁrst goal, we introduced hidden goal states in the north-east and south-east corners of the two rooms, and assigned rewards of 1 and 0 respectively to the goal states of the north-east room and rewards of 0 and .5 respectively to the goal states of the south-east room. To encourage the behaviour policy to visit the suboptimal room (i.e., the south-east room) more often, we ﬁxed the starting state to the left-most, middle-most state and used a behaviour policy that favoured the South and East actions, taking the North, East, South, and West actions with probabilities .2, .4, .3, and .1 respectively. While Dung et al. (2007) s Virtual Oﬃce contains some partial observability in the form of the two identical rooms, to fully illustrate the tradeoﬀmade by the Markov assumption used in the derivation of the direct method, we limited the agent s observations to a 3 3 grid in front of the agent. The RGB colour values of each square in the agent s view were used as observations, with the agent unable to see through walls or perceive the goal states. The ﬁnal environment was implemented using Minigrid (Chevalier-Boisvert et al., 2018), and is depicted in Figure 9. The agent (red triangle) must navigate to one of the terminal states (the north-east and south-east squares of the green rooms) to obtain the associated reward. The agent observes the RGB colour values of the 3 3 grid of squares in front of it (the lighter blue squares), and selects the North, East, South, or West action in response. For this experiment we followed the methodology detailed in the previous section, with minor changes. First, we restricted the critic trace decay rate to be 0, as eligibility traces can alleviate some of the issues caused by partial observability (Loch and Singh, 1998). Second, we ran each combination of parameters for 30 runs of 200,000 time steps each. The learned policies were saved every 5,000 time steps and evaluated 50 times using both the episodic and excursions objective functions. The best-performing parameter settings were then re-run 100 times and averaged. Figure 10 contains the results of our experiment on the Virtual Oﬃce environment. The left-hand column shows learning curves, while the right-hand column shows sensitivity analyses for the actor s step size. The ﬁrst four plots (ﬁgures 10a, 10b, 10c, and 10d) show results when using ETD as the critic, while the last four plots (ﬁgures 10e, 10f, 10g, and 10h) show results for a TDRC critic. The ﬁrst and third rows show results for the excursions objective function, and the second and fourth rows show the episodic objective function.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

Figure 9: The Virtual Oﬃce environment.

Interestingly, algorithms that used the episodic interest function (dashed lines) performed better than or equal to their counterparts that used the uniform interest function (solid lines) for all combinations of objective function and critic, unlike the previous experiments on Mountain Car and Puddle World. In addition, the algorithms that used the episodic interest function worked well for a larger range of step sizes, making it easier to ﬁnd a value that performs well. This could be due to the partially observable nature of the environment; many of the agent s observations are similar, which could cause the emphatic weightings with a uniform interest function to quickly become very large, necessitating a very small step size to compensate. As in the previous experiments, the choice of critic played an important role in the performance of all the algorithms, although this time with mixed results. Using a TDRC critic instead of an ETD critic improved the performance of OﬀPAC and ACE-direct with uniform interest, but had very little impact on ACE-trace with uniform interest and ACEdirect with episodic interest, and actually reduced the performance of ACE-trace with episodic interest (compare ﬁgures 10a and 10e, and 10c and 10g). Similarly, using a TDRC critic had mixed eﬀects on the sensitivity of each of the algorithms to the actor s step size. It allowed a wider range of step sizes to perform well for some algorithms, but also led to a narrower range of well-performing step sizes for some algorithms (compare ﬁgures 10b and 10f, and 10d and 10h). ACE-direct with the episodic interest function performed better than or equal to the other methods across most combinations of critic and objective function, with the possible exception of the episodic objective function. With a TDRC critic, OﬀPAC learned more quickly than the other methods before eventually being surpassed by both ACE-direct variants. However, OﬀPAC was able to close the gap near the end of the experiment, with the ﬁnal learned policy of OﬀPAC performing comparably to the ACE-direct variants. With an ETD critic, ACE-trace with episodic interest performed surprisingly well, with the ﬁnal policy possibly outperforming ACE-direct with episodic interest. However, the variance of the policy learned by ACE-trace was much larger than the other methods, rendering the diﬀerence in performance not statistically signiﬁcant. Conversely, ACE-direct with episodic interest learned a low-variance policy on each of the settings tested, performing as well as or better than the other algorithms.

Graves, Imani, Kumaraswamy, and White

(a) Excursions learning curves (ETD)

(b) Excursions step-size sensitivity (ETD)

(c) Episodic learning curves (ETD)

(d) Episodic step-size sensitivity (ETD)

(e) Excursions learning curves (TDRC)

(f) Excursions step-size sensitivity (TDRC)

(g) Episodic learning curves (TDRC)

(h) Episodic step-size sensitivity (TDRC)

Figure 10: Results on the Virtual Oﬃce. Shaded regions are 95% conﬁdence intervals.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

9.5 Summary

In this section we investigated several key properties of the ACE algorithm empirically. First we studied how the parameter η trades oﬀbias and variance in the counterexample from Section 8.1 and a continuous action version, ﬁnding that even small values of η (i.e., closer to semi-gradient updates) greatly improved the quality of the ﬁnal learned policy, both when using the true value function and when using a learned critic. Next we tested the ability of the emphatic trace to accurately estimate the emphatic weightings on an extended version of the counterexample, showing that while higher values of η improved performance, a signiﬁcant gap remained between ACE with η = 1 and ACE with the true emphatic weightings. Then we moved to two classic control domains and tested several variants of ACE to determine how the choice of estimator for the emphatic weightings (and associated bias and variance properties) and the choice of critic aﬀects the learning process. We found that the emphatic trace s extreme variance greatly reduced the performance of ACE with its bias playing somewhat less of a role and replacing the emphatic trace with the direct method from Section 5.2.2 resulted in ACE performing as well as or better than the other variants. The TDRC critic was the better critic overall, but ACE with the direct method of estimating emphatic weightings was insensitive to the choice of critic and performed well regardless of whether ETD or TDRC was used. Finally, we compared the performance of several variants of ACE on a new environment designed to highlight the weaknesses of each of the variants, ﬁnding that the episodic variant of ACE with the direct method of estimating emphatic weightings performed as well as or better than the other variants in almost all the situations tested, suggesting it should be the default method of estimating the emphatic weightings.

10. Conclusion

In this paper we introduced a generalized objective for learning policies and proved the oﬀpolicy policy gradient theorem using emphatic weightings. Using this theorem, we derived an oﬀ-policy actor-critic algorithm that follows the gradient of the objective function, as opposed to previous methods like OﬀPAC and DPG that follow an approximate semi-gradient. We designed a simple MDP to highlight that stationary points for semi-gradients can be highly suboptimal, with semi-gradient updates moving away from the optimal solution. We also show, though, that the semi-gradient methods can produce reasonable solutions, either because of suﬃciently rich policy parameterizations or because in some cases they can be seen as using the true gradient update with a diﬀerent state weighting. We leverage these insights to improve the practicality of the method empirically. We design Actor-Critic with Emphatic Weightings (ACE) to have a tuneable parameter η that sweeps between the full gradient (at η = 1) and the semi-gradient (at η = 0), often obtaining the best performance with a relatively small but non-zero η (typically η = 0.1). We additionally reduce variance by directly estimating emphatic weightings with function approximation rather than using the emphatic trace. Empirically, the direct method of estimating the emphatic weightings performed better than the emphatic trace across all objective functions and environments tested. In addition, ACE with the direct method was less sensitive to the actor step size and choice of critic than OﬀPAC. Overall, ACE with the

Graves, Imani, Kumaraswamy, and White

direct method performed better than or equal to OﬀPAC in all situations, suggesting that incorporating (low-variance) state reweightings might be a reasonable direction for future policy gradient methods.

Acknowledgments

We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chair program, and the Alberta Machine Intelligence Institute (Amii). This research was enabled in part by support provided by Sci Net and the Digital Research Alliance of Canada.

Appendix A. Proof of Deterministic Oﬀ-Policy Policy Gradient Theorem

A.1 Assumptions

We make the following assumptions on the MDP:

Assumption 1 P(s |s, a), r(s, a, s ), γ(s, a, s ), π(s; θ) and their derivatives are continuous in all variables s, a, s , θ.

Assumption 2 S is a compact set in Rd, and A is a compact set in R.

Assumption 3 The policy π and discount γ are such that the inverse kernel of δ(s, s ) Pπ,γ(s, s ) exists, where Pπ,γ(s, s ) = R

A π(a|s)γ(s, a, s )P(s |s, a)da.

Under Assumption 1, vπθ(s) and vπθ (s)

θ are continuous functions of θ and s. Together,

Assumptions 1 and 2 imply that vπθ (s)

are bounded

functions of s, which allows us to switch the order of integration and diﬀerentiation, and the order of multiple integrations.

A.2 Proof of Theorem 3

Proof We start by deriving a recursive form for the gradient of the value function with respect to the policy parameters:

θqπ(s, π(s; θ))

S P(s |s, π(s; θ)) r(s, π(s; θ), s ) + γ(s, π(s; θ), s )vπ(s ) ds

P(s |s, π(s; θ)) r(s, π(s; θ), s ) + γ(s, π(s; θ), s )vπ(s ) ds , (21)

where in Equation 21 we used the Leibniz integral rule to switch the order of integration and diﬀerentiation. We proceed with the derivation using the product rule:

θP(s |s, π(s; θ)) r(s, π(s; θ), s ) + γ(s, π(s; θ), s )vπ(s )

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

+ P(s |s, π(s; θ))

r(s, π(s; θ), s ) + γ(s, π(s; θ), s )vπ(s ) ds

θ P(s |s, a)

r(s, π(s; θ), s ) + γ(s, π(s; θ), s )vπ(s ) + P(s |s, π(s; θ))

θr(s, π(s; θ), s ) +

θγ(s, π(s; θ), s )vπ(s ) + γ(s, π(s; θ), s )

θ P(s |s, a)

r(s, π(s; θ), s ) + γ(s, π(s; θ), s )vπ(s )

+ P(s |s, π(s; θ))

θ r(s, a, s )

a=π(s;θ) + π(s; θ)

θ γ(s, a, s )

a=π(s;θ) vπ(s )

S P(s |s, π(s; θ))γ(s, π(s; θ), s )

r(s, π(s; θ), s ) + γ(s, π(s; θ), s )vπ(s )

+ P(s |s, π(s; θ))

r(s, a, s ) + γ(s, a, s )vπ(s ) a=π(s;θ)

S P(s |s, π(s; θ))γ(s, π(s; θ), s )

P(s |s, a) r(s, a, s ) + γ(s, a, s )vπ(s ) a=π(s;θ) ds

S P(s |s, π(s; θ))γ(s, π(s; θ), s ) vπ(s )

a=π(s;θ) + Z

S P(s |s, π(s; θ))γ(s, π(s; θ), s ) vπ(s )

θ ds . (22)

For simplicity of notation, we will write Equation 22 as

θ = g(s) + Z

S Pπ,γ(s, s ) vπ(s )

θ ds , (23)

since P(s |s, π(s; θ))γ(s, π(s; θ), s ) is a function of s and s for a ﬁxed deterministic policy. Then we can write vπ(s)

θ as an integral transform using the delta function:

S δ(s, s ) vπ(s )

θ ds . (24)

Plugging Equation 24 into the left-hand side of Equation 23, we obtain: Z

S δ(s, s ) vπ(s )

θ ds = g(s) + Z

S Pπ,γ(s, s ) vπ(s )

δ(s, s ) Pπ,γ(s, s ) vπ(s )

θ ds = g(s)

S k(s, s )g(s ) ds , (25)

Graves, Imani, Kumaraswamy, and White

where k(s, s ) is the inverse kernel of δ(s, s ) Pπ,γ(s, s ). Now, using the continuous version of the weighted excursions objective deﬁned in Equation 6, we have:

S dµ(s)i(s)vπ(s) ds

θ dµ(s)i(s) ds

S k(s, s )g(s ) ds dµ(s)i(s) ds

S k(s, s )dµ(s)i(s) ds g(s ) ds , (26)

where in Equation 26 we used Fubini s theorem to switch the order of integration. Now, we convert the recursive deﬁnition of emphatic weightings for deterministic policies over continuous state-action spaces into a non-recursive form. Recall the deﬁnition:

m(s ) = dµ(s )i(s ) + Z

S Pπ,γ(s, s )m(s) ds. (27)

Again, we can write m(s ) as an integral transform using the delta function:

S δ(s, s )m(s) ds. (28)

Plugging Equation 28 into the left-hand side of Equation 27, we obtain: Z

S δ(s, s )m(s) ds = dµ(s )i(s ) + Z

S Pπ,γ(s, s )m(s) ds

S (δ(s, s ) Pπ,γ(s, s ))m(s) ds = dµ(s )i(s )

= m(s ) = Z

S k(s, s )dµ(s)i(s) ds, (29)

where k(s, s ) is again the inverse kernel of δ(s, s ) Pπ,γ(s, s ). Plugging Equation 29 into Equation 26 yields

S m(s )g(s ) ds

S m(s) π(s; θ)

a=π(s;θ) ds.

Appendix B. Continuous State Oﬀ-Policy Policy Gradient Theorem

Given the Deterministic Oﬀ-Policy Policy Gradient Theorem in Appendix A, we can now provide a proof for the continuous-state version of the Oﬀ-Policy Policy Gradient Theorem provided in Theorem 2.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

Proof The gradient of the objective in the continuous-states case is

S i(s)vπ(s) ds

S i(s) vπ(s)

Therefore, again to compute the gradient of Jµ we need to compute the gradient of the value function with respect to the policy parameters. A recursive form of the gradient of the value function can be derived, as we show below. Before starting, for simplicity of notation, we will again use

θ qπ(s, a),

where g : S Rd. Now let us compute the gradient of the value function:

a π(s, a; θ)qπ(s, a)

θ qπ(s, a) + X

a π(s, a; θ) qπ(s, a)

a π(s, a; θ) R

S P(s |s, a)(r(s, a, s ) + γ(s, a, s )vπ(s )) ds

a π(s, a; θ) Z

S P(s |s, a)γ(s, a, s ) vπ(s )

a π(s, a; θ)P(s |s, a)γ(s, a, s ) vπ(s )

θ ds . (31)

For simplicity of notation we will write Equation 31 as

θ = g(s) + Z

S Pπ,γ(s, s ) vπ(s )

θ ds , (32)

where Pπ,γ(s, s ) = P a π(s, a; θ)P(s |s, a)γ(s, a, s ). Again, as in the deterministic oﬀ-policy policy gradient theorem, we can write vπ(s)

θ as an integral transform using the delta function: vπ(s)

S δ(s, s ) vπ(s )

θ ds . (33)

Plugging Equation 33 into the left-hand side of Equation 32, we obtain: Z

S δ(s, s ) vπ(s )

θ ds = g(s) + Z

S Pπ,γ(s, s ) vπ(s )

δ(s, s ) Pπ,γ(s, s ) vπ(s )

θ ds = g(s)

S k(s, s )g(s ) ds . (34)

Graves, Imani, Kumaraswamy, and White

Plugging this gradient from Equation 34 back into Equation 30, we obtain

S i(s) vπ(s)

S k(s, s )g(s ) ds ds (35)

S k(s, s )i(s) ds g(s ) ds (36)

S m(s )g(s ) ds

S m(s)g(s) ds,

where in Equation 35 we used Fubini s theorem to switch the order of integration, and in Equation 36 we use the non-recursive version of emphasis as shown in Equation 29.

Appendix C. Standard Approaches for Sampling the Gradient from a State

In this section we overview the standard approach to sample the gradient from a given state, that is used in both ACE and OﬀPAC. This section reiterates known results, but is included here for completeness.

C.1 Sampling the Gradient for a Given State

For a given state, we can use the same strategies to sample the gradient as OﬀPAC, and many other actor-critic algorithms. We can use the log-likelihood approach, and incorporate baselines. For continuous actions, we can also use other strategies like the reparameterization trick. For concreteness, we explicitly explain how we sample this gradient here, using a value baseline and the log-likelihood approach, but emphasize that this can be replaced with other choices. Under ﬁnite states, the simplest approach to compute the gradient for a given state is to use what is sometimes called the all-actions gradient

θ qπ(s, a).

If there are many actions or if the action space is continuous, this gradient is not an appropriate choice. Instead, we can obtain a sample estimate of this gradient using the log-likelihood trick: X

θ qπ(s, a) = X

a π(a|s; θ) log π(a|s; θ)

θ qπ(s, a),

which follows from the fact that the derivative of log y is 1 y. Above we called this log likelihood form g(s, θ). Now we can use one action sampled according to π to obtain

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

an unbiased estimate of the all-actions gradient. In the oﬀ-policy setting, however, we do not use π to select actions. So, additionally, we can incorporate importance sampling ρ(s, a; θ) def= π(a|s;θ)

µ(a|s) to adjust for the fact that the actions are taken according to behaviour policy µ:

θ qπ(s, a) = X

a µ(a|s) 1 µ(a|s)π(a|s; θ) log π(a|s; θ)

a µ(a|s)ρ(s, a; θ) log π(a|s; θ)

θ qπ(s, a).

We can obtain an unbiased sample of the gradient by sampling an action A µ(s, ) and using g(A) def= ρ(s, A; θ) log π(a|s;θ)

θ qπ(s, A). The estimated gradient has more variance when we only use a sampled action rather than all actions. The standard strategy to reduce this variance is to subtract a baseline. The baseline is a function of state, such as an estimate of the value function v(s), with modiﬁed update

ρ(s, a; θ) log π(a|s; θ)

θ [qπ(s, a) v(s)].

It is straightforward to show that the expected value of this update is the same, i.e. that

a µ(a|s)ρ(s, a; θ) log π(a|s; θ)

θ [qπ(s, a) v(s)] = X

a µ(a|s)ρ(s, a; θ) log π(a|s; θ)

θ qπ(s, a).

Note that this is not the minimal variance baseline (Dick, 2015), but it nonetheless is one of the most commonly chosen in practice due to its eﬃcacy and simplicity.

C.2 Estimating the Advantage

In practice, it is diﬃcult to obtain qπ(s, a) to compute this gradient exactly. Instead, we obtain (a likely biased) sample of the advantage, δt qπ(St, At) v(st). An unbiased but high variance estimate uses a sample of the return Gt from (st, at), and uses δt = Gt v(st). Then E[δt|St = s, At = a] = qπ(s, a) v(s). On the other extreme, we can directly approximate the action-values, ˆq(s, a). In-between, we can use n-step returns, or λ-returns, to balance between using observed rewards and value estimates. The simplest choice is to use a 1-step return: r(s, a) + γ(s, a, St+1)v(St+1) is used as an approximation of qπ(s, a). The value function v is typically called the critic, and plays both the role of estimating the return as well as that of a baseline. The advantage is set to δt = Rt+1 + γt+1v(St+1) v(St), which is simply the TD error. More generally, however, we can also use multi-step returns. For an n-step return, with data sampled on-policy, with γt+1:t+n def= Qn j=1 γt+j as short-hand for the product of discounts, we would use

δt = Rt+1 + γt+1Rt+2 + . . . γt+1:t+n 1Rt+n + γt+1:t+nv(St+n) v(St).

With oﬀ-policy data, we would need to incorporate importance sampling ratios

δt = Rt+1 + ρt+1γt+1Rt+2 + . . . ρt+1:t+n 1γt+1:t+n 1Rt+n + ρt+1:t+nγt+1:t+nv(St+n) v(St),

Graves, Imani, Kumaraswamy, and White

where ρt+1:t+n def= Qn j=1 ρt+j. For suﬃciently large n, we recover the sampled return and so obtain an unbiased but likely high-variance estimate of the advantage. An interim n likely provides a reasonable choice between bias and variance, between n = 0 with a direct estimation of qπ(s, a), and large n that gives an unbiased sample of the return. We can also average across multiple n, and obtain λ-returns (Sutton and Barto, 2018). For larger λ, more weight is put on the n-step returns with larger n, and for λ = 1 we once again recover an unbiased sample of the return. Such an approach was used in OﬀPAC, which requires storing traces of gradients of policy parameters (see Algorithm 1 in Degris et al., 2012b). Note that this λ-return is used for the policy update. The update to the value estimate (critic) itself can use a diﬀerent λ, as it stores a separate trace for the gradient of the value function. The update with eligibility traces is only sound under linear function approximation for the values and linear function approximation for actionpreferences. In this work, therefore, we opt for the simplest choice: n = 1 with δt = Rt+1 + γt+1v(St+1) v(St). Finally, we need to estimate the values themselves. Any value approximation algorithm can be used to estimate v, such as TD, gradient TD or even emphatic TD. Given we already compute the emphasis weighting, it is straightforward to use emphatic TD. We investigate both a gradient TD critic and an emphatic TD critic in the experiments.

Appendix D. Proof of Theorem 8

D.1 Entropy-regularized OPPG theorem with discrete states and actions

Proof The proof for this theorem is mostly similar to the one for Theorem 2. First, since the interest function does not depend on the parameters,

θ = P s S i(s) vπ(s)

s S i(s) vπ(s)

Recall that vπ(s) = P a π(a|s; θ) qπ(s, a) + τH(π( |s; θ)). We deﬁne the following notation which helps with the recursive expression of the gradient:

θ qπ(s, a) + τ

θH(π( |s; θ)),

where g : S Rd. The gradient of the entropy regularized value function is then

a π(a|s; θ) qπ(s, a) + τH(π( |s; θ))

θ qπ(s, a) + X

a π(a|s; θ) qπ(s, a)

θH(π( |s; θ)) (37)

a π(a|s; θ) P s p(s |s, a)(r(s, a, s ) + γ(s, a, s ) vπ(s ))

a π(a|s; θ) X

s p(s |s, a)γ(s, a, s ) vπ(s )

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

So the gradient of the regularized value function can be written in vector form similar to the unregularized version. Let vπ R|S| d be the matrix of gradients of vπ for each state s; and G R|S| d the matrix where each row corresponds to state s in the vector g(s). Then

vπ = G + Pπ,γ vπ (38)

= vπ = (I Pπ,γ) 1 G.

Therefore, we obtain

s S i(s) vπ(s)

θ = i vπ = i (I Pπ,γ) 1 G = m G

θ qπ(s, a) + τ

θH(π( |s; θ))

We can further simplify this gradient by explicitly computing the gradient of the negative of the entropy:

-H(π( |s; θ))

θ log π(a|s; θ) + X

a π(a|s; θ) log π(a|s; θ)

θ log π(a|s; θ) + X

θ log π(a|s; θ),

where the last line follows from the fact that P a π(a|s;θ)

θ = θ P a π(a|s; θ) = θ1 = 0.

This form looks a bit diﬀerent than the (on-policy) policy gradient update given by (Geist et al., 2019, Theorem 6). We can actually get a similar form by rewriting the inner term above. For a speciﬁc s, using chain rule on the entropy we obtain

θ qπ(s, a) + τ

θH(π( |s; θ)) = X

θ qπ(s, a) + τ π(a|s; θ)

θ H(π( |s; θ))

qπ(s, a) + τ H(π( |s; θ))

D.2 Regularized OPPG theorem with continuous states and actions

θ qπ(s, a) da + τ

θΩ(π( |s; θ)) ds

Proof We extend the result from the previous section to the continuous state and action setting using a similar strategy to the one we used in Appendix B to extend Theorem 2

Graves, Imani, Kumaraswamy, and White

to the continuous state setting. In addition, we opt for a generic regularizer Ω(π( |s; θ)) : (A) R, which slightly modiﬁes the deﬁnition of the regularized value functions to be

A π(a|s; θ) qπ(s, a) da + τΩ(π( |s; θ)) s S

qπ(s, a) = Z

S P(s |s, a) r(s, a, s ) + γ(s, a, s ) vπ(s ) ds s S, a A.

We use the following notation to simplify the derivation:

θ qπ(s, a) da + τ

θΩ(π( |s; θ)).

Starting with the continuous objective function from equation (6) and using the Leibniz integral rule to diﬀerentiate under the integral (which applies because we assumed the state space is a compact set in Assumption 2 from Appendix A.1):

S i(s) vπ(s) ds = Z

S i(s) vπ(s)

Next, we ﬁnd a recursive expression for the gradient of the regularized value function:

A π(a|s; θ) qπ(s, a) da + τΩ(π( |s; θ))

θ qπ(s, a) da + Z

A π(a|s; θ) qπ(s, a)

θΩ(π( |s; θ))

A π(a|s; θ) qπ(s, a)

A π(a|s; θ)

S P(s |s, a) r(s, a, s ) + γ(s, a, s ) vπ(s ) ds da

A π(a|s; θ) Z

S P(s |s, a)γ(s, a, s ) vπ(s )

A π(a|s; θ)P(s |s, a)γ(s, a, s ) da vπ(s )

where in the last step we used Fubini s Theorem to switch the order of integrals (which we can do because the absolute value of the integral is ﬁnite due to Assumptions 1 and 2). Again we will use the shorthand Pπ,γ(s, s ) = R

A π(a|s; θ)P(s |s, a)γ(s, a, s ) da to simplify the derivation. Next we write vπ(s)

θ as the integral transform

S δ(s, s ) vπ(s )

plug it into the previous equation, and solve to get Z

S δ(s, s ) vπ(s )

θ ds = g(s) + Z

S Pπ,γ(s, s ) vπ(s )

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

δ(s, s ) Pπ,γ(s, s ) vπ(s )

θ ds = g(s)

S k(s, s ) g(s ) ds ,

where k(s, s ) is the inverse kernel of δ(s, s ) Pπ,γ(s, s ). Then plugging the above into Equation (39), interchanging integrals using Fubini s theorem, and simplifying gives

S k(s, s ) g(s ) ds ds

S i(s)k(s, s ) ds g(s ) ds

S m(s ) g(s ) ds

S m(s) g(s) ds

θ qπ(s, a) da + τ

θΩ(π( |s; θ)) ds.

Appendix E. Proof of Proposition 10

This section proves Proposition 10. First, a result by Mei et al. (2020) for tabular domains is extended to state aggregation and transition dependent discounting (which includes the three-state counterexample) to show that entropy-regularized policy gradient converges to a point where the gradient is zero. Then, it is shown that such point is not a stationary point for semi-gradient updates.

E.1 Entropy-regularized PG under State Aggregation

We assume the policy is a softmax policy and additionally speciﬁcally characterize the gradient under state aggregation. This speciﬁc characterization facilitates showing that the solution to the objective lies on the interior of the simplex, and so that a stationary point exists. We deﬁne alias(s) as the set of state that share their representation with s including s itself. We additionally deﬁned Srep S the set of representative states, one for each bin in the aggregation. For example, in the three-state counterexample alias(s0) = {s0} and alias(s1) = alias(s2) = {s1, s2}. We simply need to choose one state in each aliased set, giving Srep = {s0, s1}. For a parameter set θ, the policy is a softmax transform deﬁned as follows. For a state s with representative state s, i.e., s alias(s), we have

π(s , a; θ) = exp θ(s, a)

P a exp θ(s, a ) . (40)

Graves, Imani, Kumaraswamy, and White

Using again the three-state counterexample, θ has four components: θ(s0, a0) and θ(s0, a1) specify the policy for s0, and θ(s1, a0) and θ(s1, a1) specify the policy for both s1 and s2 since these two states are aliased. The softmax policy has a simple well-known gradient, which we can explicitly write for state-aggregation in the following lemma for easy reference in later proofs.

Lemma 12 Assume the policy uses a softmax distribution, as in Equation 40. For any state s Srep, for any s alias(s),

π(s , a ; θ)

θ(s, a) = π(s , a; θ)[1 π(s , a; θ)] if a = a π(s , a; θ)π(s , a ; θ) if a = a (41)

and, across all actions, this can be compactly written as

π(s , a ; θ)

θ(s, ) = diag(π(s, .; θ)) π(s, .; θ)π(s, .; θ) .

Proof For explicit step to obtain this derivation, notice ﬁrst that

log π(s , a; θ) = θ(s, a) log X

a exp(θ(s, a ))

π(s , a ; θ)

θ(s, a) = π(s , a ; θ) log π(s , a; θ)

= π(s , a ; θ) [θ(s, a ) log P a exp(θ(s, a ))] θ(s, a)

= π(s , a ; θ) θ(s, a )

θ(s, a) log P a exp(θ(s, a )) θ(s, a)

If a = a, then

θ(s, a) log P a exp(θ(s, a )) θ(s, a) = 1 1 P a exp(θ(s, a ) P a exp(θ(s, a ))

= 1 1 P a exp(θ(s, a )) exp(θ(s, a)

= 1 π(a|s; θ)

= 1 π(s , a; θ),

where the last step follows from the fact that π(s , a; θ) = π(a|s; θ) under state aggregation. If a = a, then

θ(s, a) log P a exp(θ(s, a )) θ(s, a ) = 0 π(a|s; θ) = π(s , a; θ).

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

The gradient ascent update using the entropy-regularized objective is

θt+1 = θt + η Jµ(θt)

for stepsize η > 0. For an appropriately small stepsize η, this will converge to a stationary point if one exists, and otherwise move to a point on the simplex, where θt(s, a) for an action a in state s. The standard condition on η is that η < 1/L for Lipschitz constant L of Jµ, which is a common assumption for policy gradient methods. In the next section, we show that this update converges to a stationary on the interior of the policy simplex. We provide one more result, that is useful for this characterization. Lemma 13 breaks down the gradient into its components. Mei et al. (2020), in their Lemma 10, proved this result for entropy regularized policy gradients, assuming tabular policies in the on-policy setting. We extend it to state aggregation, with our oﬀ-policy gradient. Their presentation has an extra term τ log π(a|s; θ). This discrepancy is just a diﬀerence in notation. In our formulation, entropy regularized action values are deﬁned to contain the entropy term so that the connection between the regularized and unregularized policy gradient is clearer.

Lemma 13 For each state s Srep and action a

Jµ(θ) θ(s, a) = X

s alias(s) m(s )π(s , a; θ) qπ(s , a) vπ(s ) .

Proof According to Theorem 8

θ qπ(s, a).

Notice that only for states s alias(s), and any a , we have π(s ,a ;θ)

θ(s,a) = 0. We can write partial derivatives w.r.t. the parameters of each state and action, using Equation 41:

Jµ(θ) θ(s, a) = X

s alias(s) m(s ) X

π(s , a ; θ)

θ(s, a) qπ(s , a )

s alias(s) m(s )

π(s , a; θ)(1 π(s , a; θ)) qπ(s , a)

a =a π(s , a; θ)π(s , a ; θ) qπ(s , a )

s alias(s) m(s )π(s , a; θ)

qπ(s , a) X

a π(s , a ; θ) qπ(s , a )

s alias(s) m(s )π(s , a; θ) qπ(s , a) vπ(s ) .

Graves, Imani, Kumaraswamy, and White

E.2 Existence of stationary points for the entropy-regularized PG objective

We are now ready to prove that a stationary point exists for τ > 0. Mei et al. (2020) (Lemma 17) proved that, under tabular representation and softmax parameterization, entropy regularized policy gradient updates converge to a point where the gradient is zero. We extend their proof to state aggregation and transition-dependent discounting. We additionally avoid assuming that the rewards are non-negative.

Assumption 4 (Lipschitz continuity) Jµ is Lipschitz continuous with Lipschitz constant L.

Assumption 5 (Bounded reward) |r(s, a)| Rmax.

Assumption 6 (Bounded expected sum of future discounting) There exists C R such that for any s S, Eπ[γt+1 + γt+1γt+2 + . . . |St = s] C. This assumption is satisﬁed if π is proper, or if γt+i < 1 after a bounded number of steps i N.

We ﬁrst prove that the state values are bounded and show that the action-values are lower bounded by the scaled log of the probabilities. We use this to show the main result in Proposition 15.

Lemma 14 Under Assumptions 5 and 6, for any given policy parameters θ,

vπ(s) CRmax + Cτ log na qπ(s, a) vπ(s) τ log π(a|s; θ) (2C + 1)Rmax Cτ log na.

Proof First notice that for any distribution p over actions

Entropy(p) def= X

a A p(a) log p(a) X

where na is the number of actions. The inequality follows from the fact that the uniform distribution has the highest entropy. We use this inequality to bound Eπ[ log π(At|St; θ)|St = s] = P a π(a|s; θ) log π(a|s; θ), which is the entropy of π(s, ; θ). Using this bound on entropy, and the facts that the entropy is nonnegative and γt+i 1 for all i N, we obtain

vπ(s) = Eπ[Gt|St = s] τEπ[log π(At|St; θ) + γt+1 log π(St+1, At+1; θ) + . . . |St = s]

CRmax + τEπ[ log π(At|St; θ)|St = s] + τEπ[γt+1( log π(St+1, At+1; θ))|St = s] + . . .

CRmax + τ log na + τ log na Eπ[γt+1|St = s] + τ log na Eπ[γt+2|St = s] + . . .

CRmax + Cτ log na.

Finally, we can lower bound the advantage function, using vπ(s) Eπ[Gt|St = s] CRmax and

qπ(s, a) = Eπ[Rt+1 + γt+1 vπ(St+1)|St = s, At = a] τ log π(a|s; θ)

Rmax CRmax τ log π(a|s; θ),

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

qπ(s, a) vπ(s) Rmax CRmax τ log π(a|s; θ) vπ(s)

Rmax CRmax τ log π(a|s; θ) (CRmax + Cτ log na)

= τ log π(a|s; θ) (2C + 1)Rmax Cτ log na.

Proposition 15 Under Assumptions 4, 5 and 6, the entropy-regularized policy update in Equation 42 with τ > 0 and η < 1/L converges to a point with zero gradient.

Proof Let {θt} t=1 be the trajectory of parameters under the gradient ascent update, where θt θ . This trajectory either converges to ﬁnite θ that provides a policy on the interior of the policy simplex, or it pushes the weights for a subset of actions to inﬁnity to converge to a point on the simplex, where certain actions have zero probability. The gradient is only zero for the solution on the interior, and so to prove the result we simply need to show that this process converges to the interior. We show this is true for every parameter θt(s, a) in the vector θt, where s Srep and a A. Deﬁne A0(s) and A+(s) as the sets of actions with zero and nonzero probability under π(s, .; θ ). We use a proof by contradiction to show that A0(s) = for any s as long as τ > 0. Suppose there exist s and a0 such that a0 A0(s). Note that π(s , a; θ) = π(s, a; θ) for all s alias(s). We know that π(s, a0; θt) 0 and log π(s, a0; θt) as t . Therefore there exists t0 > 0 such that for all s alias(s) and t t0

log π(s , a0; θt) (2C + 1)Rmax + Cτ log na

According to Lemma 13, for all t t0

Jµ(θt) θt(s, a0) = X

s alias(s) m(s )π(s , a0; θt) qπ(s , a0) vπ(s ) And applying Lemma 14

s alias(s) m(s )π(s , a0; θt) τ log π(a|s; θ) (2C + 1)Rmax Cτ log na

s alias(s) m(s )π(s , a0; θt) τ (2C + 1)Rmax + Cτ log na

τ ((2C + 1)Rmax + Cτ log na)

So θt(s, a0) is non-decreasing after t0, because we only add this non-negative gradient to θt(s, a0). This means that θ (s, a0) is lower bounded by a constant c:

exp(θ (s, a0)) > ec > 0.

Next, we bound | P a exp θ (s, a) |. Notice that the sum of gradient components over all actions at a state is zero: X

Jµ(θt) θt(s, a) = X

s alias(s) m(s ) X

a π(s , a; θt)[ qπ(s , a) vπ(s )]

Graves, Imani, Kumaraswamy, and White

s alias(s) m(s ) vπ(s ) vπ(s ) = 0. (44)

Jµ(θt) θt(s, a) = X

Jµ(θt) θt(s, a0) + X

Jµ(θt) θt(s, a+)

we get that for all t > t0,

Jµ(θt) θt(s, a+) = X

Jµ(θt) θt(s, a) X

Jµ(θt) θt(s, a0)

Jµ(θt) θt(s, a0) 0,

where the last line follows from Equation 43. This means that P a+ A+(s) θt(s, a+) is nonincreasing after t0. For a+ A+(s), we know that there exists some constant that lower bounds θt(s, a+) for all t. Otherwise, a+ A+(s) would not have non-zero probability under π(s, .; θ ). Therefore, non-increasing P a+ A+(s) θt(s, a+) cannot be due to some θt(s, a+) approaching while others approach . Instead, each θt(s, a+) must also be bounded above. This means there exists some b+ > 0 such that exp( b+) < exp(θt(s, a+)) exp(b+) for every a+ A+(s). For all a0 A0, we also know that there exists b0 such that exp(θt(s, a0)) < b0 for all t > t0. Otherwise, if exp(θt(s, a0)) approaches inﬁnity, then a0 would not be in A0. Putting this together with the above, and knowing there is at least one a+ A+(s), X

a exp θt(s, a) na exp(b+) + b0

a exp θt(s, a) exp(b+) > 0.

Finally, this gives

π(s, a0; θt) = exp θt(s, a0)

P a exp θt(s, a)

c na exp(b+) + b0 .

Taking the limit as t , this lower bound remains positive, implying that π(s, a0; θ ) > 0, which is a contradiction. Therefore, no such a0 can exist and so A0(s) = for all s S. Since the policy converges to the interior of the probability simplex and the objective function is Lipschitz continuous, the gradient has to be zero at the point of convergence (Mei et al., 2020) (Lemma 17).

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

E.3 Proof for Three-State Counterexample

In the previous section, we showed that a stationary point for the entropy-regularized PG objective exists. We now show that in the three-state counterexample that such a point is not a stationary point under semi-gradient updates.

Lemma 16 A point with zero gradient is not a stationary point under semi-gradient updates in the three-state counterexample.

Proof Suppose θ is a point with zero gradient, i.e.:

Every component of the gradient vector is zero, thus

Jµ(θ) θ(s1, .) = X

s alias(s1) m(s) X

θ(s, .) qπ(s, a) = 0.

Since taking any action from s1 or s2 results in termination, the following holds for s {s1, s2} and any action a:

qπ(s, a) = X

s P(s |s, a) r(s, a, s ) τ log π(a|s; θ) + γ(s, a, s ) vπ(s )

s P(s |s, a) r(s, a, s ) τ log π(a|s; θ) + γ(s, a, s ) vπ(s )

=r(s, a) τ log π(a|s; θ) Recall that r(s, a) def= X

s P(s |s, a)r(s, a, s ).

States s1 and s2 have the same representation, meaning that π(s1,.;θ)

θ(s1,.) = π(s2,.;θ)

θ(s2,.) and π(s1, .; θ) = π(s2, .; θ). So

Jµ(θ) θ(s1, .) = X

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) [r(s, a) τ log π(s1, a; θ)]

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) r(s, a)

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ).

For simplicity of notation, deﬁne

M(s1) def= X

s alias(s1) m(s)

D(s1) def= X

s alias(s1) dµ(s),

Graves, Imani, Kumaraswamy, and White

letting us write the above as

Jµ(θ) θ(s1, .) = X

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) r(s, a) M(s1) X

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ).

Similarly, the semi-gradient update to θ(s1, .) becomes

s alias(s1) dµ(s) X

π(s1, a; θ)

θ(s1, .) r(s, a) D(s1) X

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ).

We show that, even though θ is a stationary point under the true gradient update, this semi-gradient under θ is not zero. Notice ﬁrst that

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ)

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ) + D(s1)

M(s1) Jµ(θ) θ(s1, .) because Jµ(θ)

θ(s1, .) = 0

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ)

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) r(s, a) M(s1) X

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ)

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ) + D(s1)

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) r(s, a)

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ)

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) r(s, a)

s alias(s1) dµ(s) X

π(s1, a; θ)

θ(s1, .) r(s, a) D(s1) X

π(s1, a; θ)

θ(s1, .) τ log π(s1, a; θ) (45)

s alias(s1) dµ(s) X

π(s1, a; θ)

θ(s1, .) r(s, a) D(s1)

s alias(s1) m(s) X

π(s1, a; θ)

θ(s1, .) r(s, a)

s alias(s1) δ(s) X

π(s1, a; θ)

θ(s1, .) r(s, a) where δ(s) def= dµ(s)

P s alias(s1) dµ(s ) P s alias(s1) m(s ) m(s)

=δ(s1) π(s1, a0; θ)

θ(s1, .) 2 + δ(s2) π(s1, a1; θ)

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

because r(s1, a0) = 2, r(s2, a1) = 1, r(s1, a1) = r(s2, a0) = 0

=2 δ(s1) δ(s2) π(s1, a0; θ)

θ(s1, .) . as π(s1, a1; θ)

θ(s1, .) = π(s1, a0; θ)

θ(s1, .) by Lemma 12

The second factor π(s1,a0;θ)

θ(s1,.) is nonzero due to softmax parametrization. It therefore suﬃces to show that the ﬁrst term 2 δ(s1) δ(s2) is not zero. In each episode in the counterexample, regardless of the policy, the agent spends one step in s0 and one step in either s1 or s2. Therefore

dµ(s0) = 0.5, dµ(s1) + dµ(s2) = 0.5.

Unfolding the emphatic weightings gives

m(s0) = dµ(s0) = 0.5

m(s1) = dµ(s1) + dµ(s0)π(s0, a0; θ) = dµ(s1) + 0.5π(s0, a0; θ)

m(s2) = dµ(s2) + dµ(s0)π(s0, a1; θ) = dµ(s2) + 0.5π(s0, a1; θ)

= 0.5 dµ(s1) + 0.5(1 π(s0, a0; θ)) = 1 dµ(s1) 0.5π(s0, a0; θ).

Plugging these values into δ(s) results in

δ(s) def= dµ(s)

P s alias(s1) dµ(s ) P s alias(s1) m(s ) m(s)

= dµ(s) dµ(s1) + dµ(s2)

m(s1) + m(s2) m(s)

= dµ(s) 0.5

1 m(s) = dµ(s) 0.5m(s).

δ(s1) = dµ(s1) 0.5m(s1) = 0.5dµ(s1) 0.25π(s0, a0; θ)

δ(s2) = dµ(s2) 0.5m(s2) = 0.5 dµ(s1) 0.5 + 0.5dµ(s1) + 0.25π(s0, a0; θ)

= 0.5dµ(s1) + 0.25π(s0, a0; θ),

and we get that

2δ(s1) δ(s2) = dµ(s1) 0.5π(s0, a0; θ) + 0.5dµ(s1) 0.25π(s0, a0; θ)

= 1.5dµ(s1) 0.75π(s0, a0; θ)

= 0.75 µ(s0, a0) π(s0, a0; θ) . because dµ(s1) = 0.5µ(s0, a0)

As long as π(s0, a0; θ) = µ(s0, a0), the semi-gradient is not zero. For example, as in Section 8.1, we can choose µ(s0, a0) = 0.25. Now we show that π(s0, a0; θ) = 0.25 for a stationary point under the true gradient. Let us ﬁrst write the partial derivative w.r.t. the ﬁrst parameter given that π(s0, a0; θ) = 0.25:

Jµ(θ) θ(s0, a0) = X

s alias(s0) m(s) X

π(a|s; θ) θ(s0, a0) qπ(s, a)

Graves, Imani, Kumaraswamy, and White

= m(s0) π(s0, a0; θ)

θ(s0, a0) qπ(s0, a0) + π(s0, a1; θ)

θ(s0, a0) qπ(s0, a1)

= 0.5 π(s0, a0; θ)(1 π(s0, a0; θ)) qπ(s0, a0)

π(s0, a1; θ)π(s0, a0; θ) qπ(s0, a1)

= 0.5 0.25 0.75 qπ(s0, a0) qπ(s0, a1) .

For this derivative to be zero, qπ(s0, a0) qπ(s0, a1) must be zero.

qπ(s0, a0) qπ(s0, a1)

= τ log π(s0, a0; θ) + π(s1, a0, θ) qπ(s1, a0) + π(s1, a1, θ) qπ(s1, a1)

+ τ log π(s0, a1; θ) π(s2, a0, θ) qπ(s2, a0) π(s2, a1, θ) qπ(s2, a1)

= τ log 0.25 + π(s1, a0, θ)(2 τ log π(s1, a0, θ)) + π(s1, a1, θ)( τ log π(s1, a1, θ))

+ τ log 0.75 π(s2, a0, θ)( τ log π(s2, a0, θ)) π(s2, a1, θ)(1 τ log π(s2, a1, θ))

Recall that π(s2, , θ) = π(s1, , θ) and π(s1, a1, θ) = 1 π(s1, a0, θ). Then

qπ(s0, a0) qπ(s0, a1)

= τ log 0.25 + π(s1, a0, θ)(2 τ log π(s1, a0, θ)) + π(s1, a1, θ)( τ log π(s1, a1, θ))

+ τ log 0.75 π(s1, a0, θ)( τ log π(s1, a0, θ)) π(s1, a1, θ)(1 τ log π(s1, a1, θ))

= τ(log 0.75 log 0.25) + 2π(s1, a0, θ) π(s1, a1, θ)

= τ(log 3) + 3π(s1, a0, θ) 1, (46)

because log 0.75 log 0.25 = log(0.75/0.25) = log 3. For small values of τ, π(s1, a0, θ) should be close to 1/3 to make θ a stationary point. Now let us write the partial derivative w.r.t. θ(s1, a0), again assuming π(s0, a0; θ) = 0.25 and by noting that s1 and s2 are aliased:

Jµ(θ) θ(s1, a0) = X

s alias(s1) m(s) X

π(a|s; θ) θ(s1, a0) qπ(s, a)

= m(s1) π(s1, a0; θ)

θ(s1, a0) qπ(s1, a0) + π(s1, a1; θ)

θ(s1, a0) qπ(s1, a1)

+ m(s2) π(s2, a0; θ)

θ(s1, a0) qπ(s2, a0) + π(s2, a1; θ)

θ(s1, a0) qπ(s2, a1)

= m(s1) π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s1, a0)

π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s1, a1)

+ m(s2) π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s2, a0)

π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s2, a1)

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

= m(s1) π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s1, a0)

π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s1, a1)

+ m(s2) π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s2, a0)

π(s1, a0; θ)(1 π(s1, a0; θ)) qπ(s2, a1)

= π(s1, a0; θ)(1 π(s1, a0; θ)) m(s1) qπ(s1, a0) qπ(s1, a1) + m(s2) qπ(s2, a0) qπ(s2, a1) .

Making the partial derivative zero requires either π(s1, a0; θ), 1 π(s1, a0; θ), or the contents of the brackets to be zero. The ﬁrst two are incompatible with the requirement for making Equation 46 zero so we will continue with the third one. Note that

m(s1) = dµ(s1) + dµ(s0)π(s0, a0; θ) = 0.125 + 0.5 0.25 = 0.25,

m(s2) = dµ(s2) + dµ(s0)π(s0, a1; θ) = 0.375 + 0.5 0.75 = 0.75,

and the factor in the brackets becomes

m(s1) qπ(s1, a0) qπ(s1, a1) + m(s2) qπ(s2, a0) qπ(s2, a1)

= 0.25 2 τ log π(s1, a0, θ) + τ log π(s1, a1, θ)

+ 0.75 τ log π(s2, a0, θ) 1 + τ log π(s2, a1, θ)

= 0.25 + τ log π(s1, a0, θ) + log π(s1, a1, θ)

= 0.25 + τ log(1 π(s1, a0, θ)) log π(s1, a0, θ)

= 0.25 + τ log 1 π(s1, a0, θ)

π(s1, a0, θ) .

Making this zero is also at odds with the requirement for Equation 46. To see why, let s consider both conditions, where we use p = π(s1, a0, θ) to simplify notation.

τ log 3 + 3p 1 = 0

0.25 + τ log 1 p

The ﬁrst equation requires p = (1 τ log 3)/3. The second equation requires p = (exp(0.25/τ)+ 1) 1. These two equations intersect when τi 0.2779, but otherwise do not equal each other, meaning we cannot satisfy both of these criteria. Therefore, for any τ = τi, a stationary point θ under the true gradient cannot have π(s0, a0, θ) = µ(s0, a0) = 0.25 and thus cannot be a stationary point under semi-gradient.

This counterexample shows one environment where the sets of stationary points under the true gradient is disjoint from the set of stationary points under the semi-gradient.

Graves, Imani, Kumaraswamy, and White

ACE(η) Actor-Critic with Emphatic weightings. η interpolates between semi-gradient updates (η = 0) and gradient updates (η = 1). OﬀPAC Oﬀ-Policy Actor-Critic (Degris et al., 2012b). Equivalent to ACE(0). GTD(λ) Gradient Temporal-Diﬀerence learning (Sutton et al., 2009). λ is the decay rate of the eligibility trace vector. ETD(λ) Emphatic Temporal-Diﬀerence learning (Sutton et al., 2016). λ is the decay rate of the eligibility trace vector. TDRC(λ) Temporal-Diﬀerence learning with Regularized Corrections (Ghiassian et al., 2020; Patterson et al., 2022). λ is the decay rate of the eligibility trace vector. DPG Deterministic Policy Gradient (Silver et al., 2014). Uses a deterministic policy parameterization and semi-gradient updates. True-DPGE A version of DPG using true gradient updates (i.e., scaling the update by the true emphatic weightings). True-ACE A version of ACE using the true emphatic weightings. ACE-direct A version of ACE that uses the direct method from Section 5.2.2 to estimate the emphatic weightings. ACE-trace A version of ACE that uses the emphatic trace from Sutton et al. (2016) to estimate the emphatic weightings. ACE-ideal A version of ACE that re-computes the emphatic trace on each time step using the current policy to remove the inﬂuence of old versions of the policy.

Table 1: Descriptions of the algorithms used in the experiments.

Appendix F. Descriptions of Algorithms in Section 9

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, 2019.

Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive elements that can solve diﬃcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, (5), 1983.

Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Incremental natural actor critic algorithms. In Conference on Neural Information Processing Systems, 2007.

Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor critic algorithms. Automatica, 45(11), 2009.

Justin Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. Conference on Neural Information Processing Systems, pages 369 376, 1995.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for Open AI Gym. https://github.com/maximecb/gym-minigrid, 2018.

Thomas Degris, Patrick M. Pilarski, and Richard S. Sutton. Model free reinforcement learning with continuous action in practice. In American Control Conference, 2012a.

Thomas Degris, Martha White, and Richard S. Sutton. Oﬀ policy actor critic. In International Conference on Machine Learning, 2012b.

Travis B. Dick. Policy gradient reinforcement learning without regret. Master s thesis, University of Alberta, 2015.

Le Tien Dung, Takashi Komeda, and Motoki Takagi. Reinforcement learning in non Markovian environments using automatic discovery of subgoals. In Society of Instrument and Control Engineers Annual Conference, pages 2601 2605. IEEE, 2007.

Charles R. Harris et al. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/ s41586-020-2649-2.

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160 2169, 2019.

Carles Gelada and Marc G. Bellemare. Oﬀ-policy deep reinforcement learning by bootstrapping the covariate shift. In AAAI Conference on Artiﬁcial Intelligence, volume 33, 2019.

Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, and Adam White. Online oﬀ-policy prediction. ar Xiv:1811.02597, 2018.

Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient temporal-diﬀerence learning with regularized corrections. In International Conference on Machine Learning, pages 3524 3534. PMLR, 2020.

Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux. An operator view of policy gradient methods. In Conference on Neural Information Processing Systems, 2020.

Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research, 5, 2004.

Ivo Grondman, Lucian Busoniu, Gabriel A.D. Lopes, and Robert Babuska. A survey of actor critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 2012.

Shixiang Gu, Tim Lillicrap, Richard E. Turner, Zoubin Ghahramani, Bernhard Sch olkopf, and Sergey Levine. Interpolated policy gradient: Merging on policy and oﬀ policy gradient estimation for deep reinforcement learning. In Conference on Neural Information Processing Systems, 2017a.

Graves, Imani, Kumaraswamy, and White

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q prop: Sample eﬃcient policy gradient with an oﬀ policy critic. In International Conference on Learning Representations, 2017b.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Oﬀpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.

Assaf Hallak and Shie Mannor. Consistent on-line oﬀ-policy evaluation. In International Conference on Machine Learning, volume 70, 2017.

Bojun Huang. Steady state analysis of episodic reinforcement learning. In Conference on Neural Information Processing Systems, 2020.

Ehsan Imani, Eric Graves, and Martha White. An oﬀ policy policy gradient theorem using emphatic weightings. In Conference on Neural Information Processing Systems, 2018.

Nathan Kallus and Masatoshi Uehara. Statistically eﬃcient oﬀ policy policy gradients. In International Conference on Machine Learning, 2020.

Andreas Kirsch. Mdp environments for the Open AI Gym. Technical report, 2017. URL https://arxiv.org/abs/1709.09069.

Martin Klissarov and Doina Precup. Flexible option learning. Conference on Neural Information Processing Systems, pages 4632 4646, 2021.

Vijay R. Konda and John N. Tsitsiklis. Actor critic algorithms. In Conference on Neural Information Processing Systems, 2000.

Romain Laroche and Remi Tachet des Combes. Dr jekyll & mr hyde: the strange case of oﬀ-policy policy updates. Conference on Neural Information Processing Systems, 2021.

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2015.

Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4), 1992.

Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon University, 1993.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Inﬁnite-horizon oﬀ-policy estimation. In Conference on Neural Information Processing Systems, 2018.

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Oﬀ-policy policy gradient with stationary distribution correction. In Conference on Uncertainty in Artiﬁcial Intelligence, 2019.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in oﬀ-policy evaluation via conditional importance sampling. In International Conference on Machine Learning, pages 6184 6193. PMLR, 2020a.

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Oﬀ policy policy gradient with stationary distribution correction. In Conference on Uncertainty in Artiﬁcial Intelligence, 2020b.

John Loch and Satinder P. Singh. Using eligibility traces to ﬁnd the best memoryless policy in partially observable Markov decision processes. In International Conference on Machine Learning, pages 323 331, 1998.

Hamid Maei. Gradient Temporal Diﬀerence Learning Algorithms. Ph D thesis, University of Alberta, 2011.

Hamid Reza Maei. Convergent actor critic algorithms under oﬀ-policy training and function approximation. ar Xiv:1802.07842, 2018.

A. Rupam Mahmood, Huizhen Yu, Martha White, and Richard S. Sutton. Emphatic temporal-diﬀerence learning. ar Xiv:1507.01569, 2015.

Peter Marbach and John N. Tsitsiklis. Simulation based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2), 2001.

Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, 2020.

Nicolas Meuleau, Leonid Peshkin, Leslie P. Kaelbling, and Kee-Eung Kim. Oﬀ policy policy search. Technical report, MIT Artiﬁcial Intelligence Laboratory, 2000.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. Human level control through deep reinforcement learning. Nature, 518(7540), 2015.

Andrew William Moore. Eﬃcient memory-based learning for robot control. Ph D thesis, University of Cambridge, 1990.

Chris Nota and Philip S. Thomas. Is the policy gradient a gradient? In International Conference on Autonomous Agents and Multiagent Systems, 2020.

Andrew Patterson, Adam White, and Martha White. A generalized projected bellman error for oﬀ-policy value estimation in reinforcement learning. The Journal of Machine Learning Research, 2022.

Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Natural actor-critic. In European Conference on Machine Learning, 2005.

Doina Precup. Temporal Abstraction In Reinforcement Learning. Ph D thesis, University of Massachusetts Amherst, 2000.

Graves, Imani, Kumaraswamy, and White

Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility Traces for Oﬀ Policy Policy Evaluation. In International Conference on Machine Learning, 2000.

Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Oﬀ policy temporal diﬀerence learning with function approximation. International Conference on Machine Learning, 2001.

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016.

Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv:1308.6370, 2013.

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd edition, 2018.

Richard S. Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Conference on Neural Information Processing Systems, 1999a.

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112 (1-2), 1999b.

Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv ari, and Eric Wiewiora. Fast gradient-descent methods for temporal diﬀerence learning with linear function approximation. In International Conference on Machine Learning, 2009.

Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White, and Doina Precup. Horde: A scalable real time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multi Agent Systems, 2011.

Richard S. Sutton, A. Rupam Mahmood, and Martha White. An emphatic approach to the problem of oﬀ policy temporal diﬀerence learning. The Journal of Machine Learning Research, 17, 2016.

Philip Thomas. Bias in natural actor critic algorithms. In International Conference on Machine Learning, 2014.

Samuele Tosatto, Joao Carvalho, Hany Abdulsamad, and Jan Peters. A nonparametric oﬀ policy policy gradient. In International Conference on Artiﬁcial Intelligence and Statistics, 2020.

OFF-POLICY ACTOR-CRITIC WITH EMPHATIC WEIGHTINGS

Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, R emi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample eﬃcient actor critic with experience replay. In International Conference on Learning Representations, 2016.

Christopher J.C.H. Watkins and Peter Dayan. Q learning. Machine Learning, 8(3-4), 1992.

Lex Weaver and Nigel Tao. The optimal reward baseline for gradient based reinforcement learning. In Conference on Uncertainty in Artiﬁcial Intelligence, 2001.

Adam White. Developing A Predictive Approach To Knowledge. Ph D thesis, University of Alberta, 2015.

Martha White. Unifying task speciﬁcation in reinforcement learning. In International Conference on Machine Learning, 2017.

Ronald J. Williams. Simple statistical gradient following algorithms for connectionist reinforcement learning. Machine Learning, 8, 1992.

Ronald J. Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3), 1991.

Ian H. Witten. An adaptive optimal controller for discrete time Markov environments. Information and Control, 34(4), 1977.

Huizhen Yu. On convergence of emphatic temporal diﬀerence learning. In Conference on Learning Theory, 2015.

Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Generalized oﬀ-policy actorcritic. In Conference on Neural Information Processing Systems, 2019.

Shangtong Zhang, Bo Liu, Hengshuai Yao, and Shimon Whiteson. Provably convergent two timescale oﬀ-policy actor critic with function approximation. In International Conference on Machine Learning, 2020.