# from_past_to_future_rethinking_eligibility_traces__98becc0c.pdf

From Past to Future: Rethinking Eligibility Traces

Dhawal Gupta1*, Scott M. Jordan2, Shreyas Chaudhari1, Bo Liu3, Philip S. Thomas1, Bruno Castro da Silva1

1University of Massachusetts, 2University of Alberta, 3Amazon {dgupta, schaudhari, pthomas, bsilva}@cs.umass.edu, sjordan@ualberta.ca, boliucs@amazon.com

In this paper, we introduce a fresh perspective on the challenges of credit assignment and policy evaluation. First, we delve into the nuances of eligibility traces and explore instances where their updates may result in unexpected credit assignment to preceding states. From this investigation emerges the concept of a novel value function, which we refer to as the bidirectional value function. Unlike traditional state value functions, bidirectional value functions account for both future expected returns (rewards anticipated from the current state onward) and past expected returns (cumulative rewards from the episode s start to the present). We derive principled update equations to learn this value function and, through experimentation, demonstrate its efﬁcacy in enhancing the process of policy evaluation. In particular, our results indicate that the proposed learning approach can, in certain challenging contexts, perform policy evaluation more rapidly than TD(λ) a method that learns forward value functions, v , directly. Overall, our ﬁndings present a new perspective on eligibility traces and potential advantages associated with the novel value function it inspires, especially for policy evaluation.

Introduction Reinforcement Learning (RL) offers a robust framework for tackling complex sequential decision-making problems. The growing relevance of RL in diverse applications from controlling nuclear reactors (Radaideh et al. 2021; Park et al. 2022) to guiding atmospheric balloons (Bellemare et al. 2020) and optimizing data centers (Li et al. 2019) underscores the need for more efﬁcient solutions. Central to these solutions is addressing the credit assignment problem, which involves determining which actions most contributed to a particular outcome a key challenge in sequential decisionmaking. The outcome of actions in RL may be delayed, posing a challenge in correctly attributing credit to the most relevant prior states and actions. This often requires a large number of samples or interactions with the environment. In real-world settings, this might be a constraint as system interactions are often risky or costly. Minimizing the sample and computational complexity of existing algorithms not only addresses these challenges but also facilitates the wider adoption of RL in various future applications.

*Corresponding author Copyright 2024, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Implementation of the backward view via TD(λ), adapted from Sutton and Barto (2018). Arrow sizes denote the magnitude of updates: (a) Expected update direction of past state values under the backward view, or TD(λ). (b) Potential misalignment in updates due to reliance on outdated gradient memory for past states (especially when using non-linear function approximation), which deviated from (a).

Addressing the credit assignment problem given a temporal sequence of states and actions has been and continues to be an active area of research. A key concept in addressing this challenge is the Temporal Difference (TD) methods (Sutton 1984). In the context of TD methods, the backward view (Sutton and Barto 2018) offers an intuitive approach under which the agent aims to adjust the value of previous states to align with recent observations. As depicted in Figure 1(a), observing a positive outcome at a given time step, t, prompts the agent to increase the value estimates of all preceding states. This adjustment incorporates a level of discounting, accounting for the diminishing inﬂuence of distant preceding states on the observed outcome. This approach assumes that every prior state should share credit for the resultant outcome. One advantage of the backward view, discussed later, is that it can be efﬁciently implemented online and incrementally. The TD(λ) method is a widely adopted algorithmic implementation of the backward view (Sutton 1984). Initially introduced for settings involving linear parameterizations of the value function, it has become standard in policy evaluation RL problems. However, as the complexity of RL problems increases with novel applications, there has been a shift to-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

wards adopting neural networks as function approximators, primarily due to their ﬂexibility to represent many functions. Yet, this shift is not without challenges. In our work, we underscore one particular issue: when deploying TD(λ) with nonlinear function approximators, particular settings might cause it to update the value of previous states in ways inconsistent with the standard expectations regarding its behavior, that run counter to the core intuition underlying the backward view (Figure 1(a)). Figure 2 illustrates such a scenario, where after observing a positive outcome, the value of some prior states are decreased rather than increased. Therefore, the direction of these updates is contrary to the intended or expected ones. In a later section, we delve deeper into why this issue with TD(λ) arises. Intuitively, the problem lies in how TD(λ) relies on an eligibility trace vector : a short-term memory of the gradients of previous states value functions. This trace vector is essentially a moving average of gradients of previously visited states and it is used by TD(λ) to update the value of multiple states simultaneously. However, as the agent continuously updates state values online, such average gradients can become outdated. In the policy evaluation setting with non-linear function approximation, gradient memory vector maintained by TD(λ) can become outdated, which can pose challenges, leading to state value updates that are misaligned with the intended behavior of the backward view. As a result, past states might receive updates that do not align with our intentions or expectations. Importantly, this issue does not occur under linear functions. This is because, in the linear setting, trace vectors equate to ﬁxed-state feature vectors. The contributions of this paper are as follows:

We present a novel perspective under which eligibility traces may be investigated, highlighting speciﬁc scenarios that may lead to unexpected credit assignments to prior states. Stemming from our exploration of eligibility traces, we introduce bidirectional value functions. This novel type of value function captures both future and past expected returns, thus offering a broader perspective than traditional state value functions. We formulate principled update equations tailored for learning bidirectional value functions while emphasizing their applicability in practical scenarios. Through empirical analyses, we illustrate how effectively bidirectional value functions can be used in policy evaluation. Our results suggest that the methods proposed can outperform the traditional TD(λ) technique, especially in settings involving complex non-linear approximators.

Background & Motivation Notation and Introduction to RL Reinforcement learning is a framework for modeling sequential decision-making problems where an agent interacts with the environment and learns to improve its decisions based on its previous actions and the rewards it receives. The most common way to model such problems is

as a Markov Decision Process (MDP), deﬁned as a tuple (S, A, P, R, d0, γ), where S is the state space; A is the action space; P(St+1 = s0|St = s, At = a) is a transition function describing the probability of transitioning to state s0 given that the agent was in state s and executed action a; R(St, At) is a bounded reward function; d0(S0) is the starting state distribution; and γ is the discount factor. For ease of discussion, we also consider the case where state features may be used, e.g., to perform value function approximation. In this case, we assume a domain-dependent function x : S ! Rd mapping states to corresponding d-dimensional feature representations. The agent behaves according to a policy denoted by . In particular, while interacting with the environment and observing its current state at time t, St, the agent stochastically selects an action At according to its policy : S ! (A), where (A) is the probability simplex over the actions; i.e., At (St). After executing the action, the agent observes a reward value R(St, At) and transitions to the next state, St+1. The goal of an RL agent is to ﬁnd the policy ? that maximizes the expected discounted sum of the rewards generated by it:

? 2 arg max E

t=0 γt R(St, At)

Policy Evaluation The search for ? is often made easier by being able to predict the future returns of a given policy this is known as the policy evaluation problem. Return at time t is deﬁned as Gt := Rt + γRt+1 + γ2Rt+2 + . . ., where Rt := R(St, At). We deﬁne the value function for a state s as the expected return observed by an agent starting from state s following policy , i.e., v (s) = E [Gt|St = s]. Estimating v (i.e., performing policy evaluation) is an important subroutine required to perform policy improvement. It is often done in settings where the value function is approximated as a parametric function with parameters , v (s) v (s), where the weights are updated through an iterative process, i.e., t+1 = t + 4 t. A common way of determining the update term, 4 t, is by assuming an update rule of the form

4 t = (Gt v t(St))@v (St)

where is a step-size parameter. The two simplest algorithms to learn the value function are the Monte Carlo (MC) algorithm and the Temporal Differences (TD) learning algorithm. In the MC algorithm, updates are as follows

4 t = (Rt + γGt+1 v t(St))@v (St)

where determining Gt+1 requires that the agent wait until the end of an episode. TD learning algorithms replace the Gt+1 term with the agent s current approximation or estimate of the return at the next step; i.e.,

4 t = (Rt + γv t(St+1) v t(St))@v (St)

$$$ = t,(1)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 2: (a) A simple 2-state MDP. States are annotated with corresponding feature representations and values according to the approximator at time t = 0. (b) Functional form of the value function and respective parameters at t = 0. (c) On the (Left), a surface depicting the value of state s1 for different parameter values ( 0 and 1). After an update from t = 0 to t = 1, we can see (Middle) the update direction given by TD(λ) at t = 1 (i.e., δ1z1). On the (Right), we see that TD(λ) s update direction is obtuse (and hence not aligned) to the direction based on the gradient of the current value function (i.e., δ1 @v (s1)

where Rt + γv t(St+1) v t(St) is known as the TD error and denoted as δt. An important characteristic of the latter update is its online nature: the agent does not need to wait till the episode ends, since it does not rely on future variables. Thus, updates can be performed after each time step. The family of λ-return algorithms is an intermediary between Monte Carlo (MC) and one-step TD(0) algorithms, using a smoothing parameter, λ. TD(λ), which implements the backward view of the λreturn algorithm, relies only on historical information and can perform value function updates online. It accomplishes this by maintaining a trace vector et (known as the eligibility trace) encoding a short-term memory summarizing the history of the trajectory up to time t. This trace vector is then used to assign credit to the previous states visited by the agent based on the currently observed reward. In particular, the update term at time t is

4 t = δtet, where et :=

i=0 (λγ)t i @v (Si)

$$$ = i .(2)

Note that, when at state St, et can be computed recursively as the running average of the value function gradient evaluated at the states visited during the current episode:

et = γλet 1 + @v (St)

$$$ = t, (3)

where z 1 =0; i.e., the trace is set to 0 at the start of episodes. Note that eligibility traces, as deﬁned above, can be used to perform credit assignment over a single episode. To perform credit assignment over multiple possible episodes, van Hasselt et al. (2020) introduced an expected variant of traces, z(s), corresponding to a type of average trace over all possible trajectories ending at state s: z(s) := E[et|St = s] . (4)

Misconception with Eligibility Traces

In TD(λ), the backward view uses the term @v (Si)

@ as a mechanism to determine how the value of a previously encountered state, Si, will be updated. For instance, given an

observed TD error at time t, δt, we may adjust the value of the state St 3 by using the stored derivatives from time t 3, i.e., @v (St 3)

@ | = t 3.Weight updates in TD(λ) are implemented by maintaining a moving average of the gradients of the value functions related to previously-encountered states (see Eq. (2)). In particular, this weighted average determines how values of multiple past states will be updated given the currently observed TD error. Notice, however, that this moving average aggregates information, say, about the gradient of the value of state St 3 assuming the value function at that time, i.e., it aggregates information about the gradient @v (St 3)

@ | = t 3. This gradient, however, represents the direction in which weights should be updated (based on the currently-observed outcome, δt) to update the old, no-longer-in-use value function, v t 3, not the current value function, v t. The correct update direction based on the intuition in Figure 1(a), by contrast, should be that given by the gradient of the current value function: @v (St 3)

@ | = t. This indicates that the direction chosen by TD(λ) to update the value of previously encountered states may not align with the correct direction according to the value function s current parameters, t, due to the use of outdated gradients. Mathematically, this discrepancy can be represented as (for i > 0): @v(St i)

Figure 2 depicts a simple example to highlight this problem i.e., the fact that TD(λ) uses outdated gradients when performing updates. In this ﬁgure, we can see that the effective update direction of TD(λ) points in the direction opposite to the intended/correct one. We discuss the learning performance of both types of updates in Appendix A. Conceptually, TD(λ) computes traces by combining previous derivatives of the value function to adjust the value of the state at time i, based on the TD error at time t, for i < t. Notably, this misconception was not present when TD(λ) was introduced (Sutton 1984). In its original form, the update

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

was tailored for linear functions. In this case, the update is a function of the feature representation of each encountered state Si, since @v (Si)

@ = x(Si). Then, the trace is a moving average of features observed in previous steps; update issues are averted since the feature vector of any given state remains the same, independently of changes to the value function. Recall that equation (2) corresponds to the TD(λ) update and that it highlights how it uses outdated gradients. A minor adjustment to this equation provides the desired update:

4 t = δt et, where et =

i=0 (λγ)t i @v (Si)

$$$ = t . (5)

The key distinction is in using the latest/current weights, t, during the trace calculation for prior states. For linear function approximations, both (2) and (5) produce identical updates. A key advantage of the original update (2) is that it can be implemented online (in particular, via Eq. (3)). This requires only a constant computational and memory cost per update step. In contrast, if we were to use Eq. (5) to perform online updates, it would require computing the derivative for every state encountered up to time t. This makes the complexity of each update directly proportional to the episode s duration thus far. This computation becomes impractical for longer episodes. Furthermore, this approach negates the principle of incremental updates, as the computational cost per step increases based on episode length. Let us adapt the expected trace formulation (4) to our Eq. (5). We start with an expected trace vector z(s) := E[ et|St = s].1 By substituting the value of et into this expression and simplifying it, we obtain:

E[ et|St = s] = E

i=0 (λγ)t i @v (Si)

$$$ = t|St = s

= (a) @ @ E

' (λγ)t iv (Si) ($$$St = s

= (b) @f( , s)

$$$ = t. (6)

where f( , s) := E h Pt i=0 ' (λγ)t iv (Si) ($$$St = s i . Step (a) uses linearity of expectation and gradients, and in (b) we deﬁne the inner term as a f. Notice that the resulting trace is the gradient of a function deﬁned as the weighted sum of value approximations over different time steps.

Methodology In the previous section, we showed that the expected trace update, when applied to Eq. (5), is the gradient of a function composed of the expected discounted sum of the value of states as approximated by v . Let us consider what Eq. (6)

1Note that van Hasselt et al. (2020) do not distinguish between et and et. This difference is often overlooked with traces, and one contribution of our work is to point out this difference.

would be at the point of convergence, ?, such that v ? = v . At convergence, f( ?, s) is

f( ?, s) = E

' (λγ)t iv ?(Si) ($$$St = s

' (λγ)t iv (Si) ($$$St = s

Expanding the deﬁnition of f at ? leads us to Lemma 1. Prior to delving into the lemma, let us deﬁne another quantity,

Gt := Pt i=1(λγ)i Rt i. This represents the discounted sum of rewards collected up to the current time step, where discounting is in the reverse direction with a factor of λγ. Lemma 1. The discounted sum of the value function at the expected sequence of states in trajectories reaching state s at time t is

i=0 (λγ)t iv (Si) $$$St = s

1 1 γ2λE h Gt +

Gt γ(λγ)t+1G0|St = s i .

Proof. Appendix C.

In the equation above, the ﬁrst two terms are the future and past discounted returns, as deﬁned earlier. The third term, G0, represents the return of the entire trajectory conditioned on the agent visiting state s at time t; it decays by a factor proportional to λγ as the episode progresses. Similar to how v is deﬁned as the expectation of Gt, we can deﬁne another type of value function (akin to v ) representing the expected return observed by the agent up to the current time:

i=1 (λγ)i Rt i|St = s

We call this value function, v, the backward value function, with the arrow being used to indicate the temporal direction of the prediction.2 This value represents the discounted sum of rewards the agent has received until now, where rewards earlier in the trajectory are more heavily discounted. The discount factor for this value function is λγ, as opposed to the standard discount function, γ, used in the forward value function, !v. Henceforth, we use !v, v , and v interchangeably. Let us now deﬁne another value function: the sum of the backward and forward value functions. This function combines the expected return to go and the return to get to a state s. Simply put, it is the summation of v and !v at a given state:

!v (st) := v(st) + !v(st)

i=1 (γλ)i Rt i +

i=0 γi Rt+i)|St = st

2For brevity, we often drop from value functions, using v to represent the corresponding approximations of these functions.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

We refer to this value function as the bi-directional value function and denote it as !v to indicate the directions in which it predicts returns. Notice that we dropped the E γ(λγ)t+1G0|St = s term when deﬁning this value function because in the limit of t ! 1, the inﬂuence of the starting state decreases, since limt!1(λγ)t+1 = 0. Notice that !v represents (up to a constant scaling factor) the total discounted return received by the agent throughout an entire trajectory and passing through a speciﬁc state. In this work, we wish to further investigate the properties of the value functions !v and v, and whether learning !v and v can facilitate the identiﬁcation of the forward value function, v . More precisely, in the next sections we: Investigate formal properties of these value functions in terms of learnability and convergence; Design principled and practical learning rules based on online stochastic sampling; Evaluate whether learning !v may result in a more efﬁcient policy evaluation process, particularly in settings where standard methods may struggle.

Bellman Equations and Theory In this section, we show that the two newly introduced value functions ( !v and v) have corresponding variants of Bellman equations. Previous work (Zhang, Veeriah, and Whiteson 2020) has shown that v allows for a recursive form (i.e., a Bellman equation), but our work is the ﬁrst to present a Bellman equation for !v . We also prove that these Bellman equations, when used to deﬁne corresponding Bellman operators, are contraction mappings; thus, by applying such Bellman updates, we are guaranteed to converge to the corresponding value functions in the tabular setting. First, we present the standard Bellman equation for the forward value function, !v:

!v (st) = r (st) + γ X

! P (st+1|st) !v (st+1),

wherein we deﬁne r (s) = P

a2A (a|s)R(s, a) and ! P (s0|s) = P

a2A (a|s)P(s0|s, a). This is the standard Bellman equation for the forward value function. Similarly, we can show that the Bellman equation for the backward value function can be written as

v (st) = λγ

r (st) + λγ X

P (st 1|st) v(st 1).

The expression above can be proved by applying the recursive deﬁnition of v on (7), where

P (st 1|st) and

r (st) are the backward-looking transition and reward functions. Appendix B presents further details about these deﬁnitions, and Appendix D shows the proof/complete derivation of the above Bellman equation. Theorem 2. Given the Bellman equations for !v and v , the Bellman equation for !v is

!v (st) = 1 1+γ2λ r (st)(1 γ2λ)+

! P (st+1|st) !v (st+1)+

P (st 1|st) !v (st 1) .

Proof. We provide a proof sketch here. The complete proof is in Appendix E. To prove this result, we ﬁrst recall that, by deﬁnition, !v is the sum of v and !v. We the expand these terms into their corresponding Bellman equations. Further simpliﬁcation of terms leads to the above Bellman equation.

An important point to note in the above equation is that the value of !v (st) bootstraps from the value of the previous state ( !v (st 1)), as well as the value of the next state ( !v (st+1)), which makes sense since this value function does look in the past as well as future. Another observation regarding this equation is the division by a factor of 1 1+λγ2 , which can be viewed as a way to normalize the effect of summing overlapping returns from two bootstrapped state values. Using the Bellman equations for vi and !v i, we now deﬁne their corresponding Bellman operators,

T ( vi(s)) := λγ

r (s) + λγ X

P (s0|s) vi(s0), (8)

!T ( !v i(s0)) := 1 1+γ2λ(r (s0)(1 γ2λ)+ (9)

! P (s00|s0) !v i(s00)+

P (s|s0) !v i(s)).

Assumption 1. The Markov chain induced by is ergodic.

Theorem 3. Under Assumption 1, the limit limt!1 E h Gt|St = s i exists and the operator deﬁned in (8) is a contraction mapping; and hence, repeatedly applying it leads to convergence to v .

Proof. The proof is provided in the Appendix F.

Theorem 4. Under Assumption 1, the limit limt!1 E h Gt + Gt|St = s i exists and the operator deﬁned in (9) is a contraction mapping; and hence, repeatedly applying it leads to convergence to !v .

Proof. The proof is provided in the Appendix F.

Online and Incremental Update Equations In the previous section, we introduced the Bellman operators for two value functions and proved that their corresponding operators are contractions. We would now like to derive an update equation allowing agents to learn such value functions from stochastic samples. Similar to how the TD(0) update rule is motivated by its corresponding Bellman operator, we can deﬁne similar update rules for !v and v. Let us parameterize our value functions as follows: !v , vφ, !v . The TD(0)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

equivalent of the update equations for the parameters of vφ and !v value functions are presented below.

Update for and φ :

!δ t @ !v (St)

$$$ = t, 4φt =

δ t @ vφ(St)

where the corresponding TD error are deﬁned as

!δ t := 1 1+γ2λ(Rt(1 γ2λ) + γ !v t(St+1) + γλ !v t(St 1))

!v t(St)) and

δ t := λγRt 1 + λγ vφt(St 1) vφt(St). The above update equations (as is the case with standard TD methods) can be computed online and incrementally; i.e., they incur a ﬁxed computation cost per step and thus allow us to distribute compute evenly throughout an agent s trajectory. Note that when implementing such updates, we can use a scalar value to store the backward return and obtain an online version of the Monte Carlo update for v, i.e.,

Gt 1 + λγRt 1, leading to the following online update:

Gt vφt(St))@ vφ(St)

$$$ φ=φt. (11)

where we deﬁne

G0 = 0 and start updating v at time step t = 1. In Appendix G, we present several update equation variants that rely on other types of value functions.

Experiments

We investigate three questions: RQ1: Can we jointly parameterize all three value functions such that learning each of them individually helps to learn the other two? RQ2: Can learning such value functions facilitate/accelerate the process of evaluating forward value functions compared to standard techniques like TD(λ)? RQ3: What is the inﬂuence of λ on the method s performance?

Parameterization (RQ1)

We want to identify a parameterization for our value functions such that training !v helps learn !v, i.e., the value function we care about. We can leverage the mathematical property that !v = v + !v to learn !v such that these two functions are interdependent and allow for the other to be inferred. Figure 3 shows a possible way to parameterize the value functions. In this case, we parameterize !v as the sum of the other two heads of a single-layer neural network. In particular, !v = v + v, and so = {w1, w2}, φ = {w1, w3}, and = {w1, w2, w3}. We refer to this parameterization as Bi TD-FR. We can similarly fully parameterize the forward value function with all the weights as shown in Appendix H (Figure 7b), wherein v = !v v. In this case, the parameterization becomes = {w1, w2, w3}, φ = {w1, w3}, and = {w1, w2}; i.e., we make use of all the weights to parameterize v. We refer to this variant as Bi TD-Bi R. Similarly, for completeness, we deﬁne a third parameterization as v = !v v, wherein = {w1, w2}, φ = {w1, w2, w3}, and = {w1, w3}. We call this parameterization Bi TD-FBi. Note that we have only shown a 1-layer neural network, but

x(st) v (st)

Figure 3: Parameterizing the three value functions: We parameterize !v to be sum of the other two value functions.

0 0.25 0.75

0 0.75 0.25

0.25 0.75 0

0.75 0.25 0

5 5 -5 5 -5 5 -5 5 -5

-5 5 -5 5 -5 5 -5 5

vπ(S) -0.34 4.83 -0.50 4.58 -0.50 4.5 -0.34 4.58 4.83 0] [

Figure 4: Chain Domain

w1 can be replaced with any deep neural network and the parameterization would remain valid. Our formulation simply imposes more structure in our value function representation. Training this network is also straightforward since we can estimate all three value functions with a single forward pass. We can also compute the losses of all three heads with their respective TD/MC updates as shown in (1), (10) and (11).

Policy Evaluation (RQ2 & RQ3) We study the utility of using these value functions in a standard prediction task. One issue that might arise in value function approximation occurs when trying to approximate non-smooth value functions i.e., value functions that might change abruptly w.r.t. to the input feature. In RL, this implies that the value of spatially similar states may differ vastly. For our prediction problem, we consider a chain domain with 9 states (Sutton and Barto 2018) as depicted in Figure 4. The initial state is drawn from a uniform distribution over the state space. The agent can only take two actions (go left and go right) and the ending states of the chain are terminal. We use a feature representation (motivated by Boyan s chain (Boyan 2002)) such that the values of nearby states are forced to generalize over multiple features a property commonly observed in continuous-state RL problems. To simulate a highly irregular value function, we deﬁne a reward function that ﬂuctuates between -5 and +5 between consecutive states. We evaluate each TD learning algorithm (along with the Monte Carlo variant for learning v) in terms of their ability to approximate the value function of a uniform random policy, (left| ) = (right| ) = 0.5, under a discount factor of γ = 0.99. The analytically derived value function is shown in Figure 4 alongside the feature representation used for each state. In our experiments, we sweep over multiple values of learning rate ( ) and λ. We use the learning rate for all value function heads. Each run corresponds to 50K training/environment steps, and we average the loss function over 100 seeds. We used a single-layer neural network with 9 units in the hidden layer and Re LU as the non-linearity. Figure 5 depicts the results of this experiment with hyperparameters optimized for Area Under Curve (AUC). (RQ2)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0.0 0.2 0.4 0.6 0.8 1.0

0 10K 20K 30K 40K 50K Steps

Forward Value Loss Forward Value Loss

Learning Curve

Sensitivity

Bi TD2-Bi R

Bi TD-FBi Bi TD-FR

Figure 5: Experimental results for prediction in random chain domain. The y axis shows the MSTDE error of the forward value function. (top) Best performing parameter setting for Bi TD-FR, Bi TD-Bi R, Bi TD-FBi and standard TD(λ). (bottom) We compare all Bi TD variants and TD(λ) for different values of λ; notice that any values λ > 0 are detrimental to TD(λ) s performance, but can aid in performing policy evaluation using the proposed framework.

From Fig. 5(top) we can see how all variants of Bi TD achieve a lower MSTDE loss than TD(λ) for a given number of samples. (RQ3) To investigate the role of λ in the efﬁciency of policy evaluation, we analyze Fig. 5(bottom). From this ﬁgure, we can see that TD(λ) s performance deteriorates strictly as the value of λ increases and that it performs best for λ = 0. We also notice that Bi TD-FR performs similarly to TD for λ = 0, but its performance is better for intermediate values of λ (with the best performance being for λ = 0.4). Furthermore, notice that among different Bi TD methods, the ones that directly approximate v (Bi TD-FR and Bi TD-FBi) seem to perform better than the ones that indirectly approximate it, like Bi TD-Bi R. Appendix H provides detailed plots with standard error bars (as well as the sensitivity of methods w.r.t and λ) to better understand how may affect methods.

Literature Review

The successful application of eligibility traces has been historically associated with linear function approximation (Sutton and Barto 2018). The update rules for eligibility traces, explicitly designed for linear value functions, encounter ambiguities when extended to nonlinear settings. The fact that the RL community started to use non-linear function approximations more often (due to the rise of deep RL) led to the wider use of experience replay, making eligibility traces hard to properly deploy. Nonetheless, several works have tried to

adapt traces for deep RL (Tesauro 1992; Elfwing, Uchibe, and Doya 2018). Traces have found some utility in methods such as advantage estimation (Schulman et al. 2017). One interesting interpretation, proposed by van Hasselt et al. (2020), applies expected traces to the penultimate layer in neural nets while maintaining the running trace for the remaining network. Traces have also been modiﬁed to be combined with experience replay (Daley and Amato 2018). Backward TD learning offers a new perspective to RL by integrating hindsight into the credit assignment process. Unlike traditional forward-view TD learning, which deﬁnes updates based on expected future rewards from present decisions, backward TD works retrospectively, estimating present values from future outcomes. This shift to a backward view has spurred signiﬁcant advancements. Chelu, Precup, and van Hasselt (2020) underscored the pivotal roles of both foresight and hindsight in RL, illustrating their combined efﬁcacy in algorithmic enhancement. Wang et al. (2021) leveraged backward TD for ofﬂine RL, demonstrating its potential in settings where data collection proves challenging. Further, the efﬁcacy of backward TD learning methods in imitation learning tasks was highlighted by Park and Wong (2022). Zhang, Veeriah, and Whiteson (2020) reinforced the need for retrospective knowledge in RL, underscoring the signiﬁcance of the backward TD methods. One may consider learning !v and v as auxiliary tasks. Unlike standard learning scenarios where auxiliary tasks may not directly align with the value function prediction objective, in our case, learning !v and v results in auxiliary tasks that directly complement and align with the primary goal of learning the forward value function. Auxiliary tasks, when incorporated into RL, often act as catalysts, reﬁning the primary task s learning dynamics. Such tasks span various functionalities next-state prediction, reward forecasting, representation learning, and policy reﬁnement, to name a few (Lin et al. 2019; Raﬁee et al. 2022). The Unsupervised Auxiliary Tasks framework, introduced by Jaderberg et al. (2017) demonstrates how auxiliary tasks can enhance feature representations, beneﬁting primary and auxiliary tasks.

Conclusion and Future Work In this work, we unveiled an inconsistency resulting from the combination of eligibility traces and non-linear value function approximators. Through a deeper investigation, we derived a new type of value function a bidirectional value function and showed principled update rules and convergence guarantees. We also introduced online update equations for stochastic sample-based learning methods. Empirical results suggest that this new value function might surpass traditional eligibility traces in speciﬁc settings, for various values of λ, offering a novel perspective to policy evaluation. Future directions include, e.g., extending our on-policy algorithms to off-policy prediction. Another promising direction relates to exploring analogous functions but in the context of control rather than prediction. We would also like to investigate the hypothesis that bidirectional value functions (akin to average reward policy evaluation methods, which model complete trajectories) may be a ﬁrst step towards unifying discounted and average-reward RL settings.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments

We thank Yash Chandak for his valuable feedback. We are also immensely grateful to the three anonymous reviewers and the meta-reviewer for their insights and feedback.

References Bellemare, M. G.; Candido, S.; Castro, P. S.; Gong, J.; Machado, M. C.; Moitra, S.; Ponda, S. S.; and Wang, Z. 2020. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836): 77 82. Boyan, J. A. 2002. Technical Update: Least-Squares Temporal Difference Learning. Mach. Learn., 49(2-3): 233 246. Chelu, V.; Precup, D.; and van Hasselt, H. P. 2020. Forethought and Hindsight in Credit Assignment. In Advances in Neural Information Processing Systems. Daley, B.; and Amato, C. 2018. Efﬁcient Eligibility Traces for Deep Reinforcement Learning. Co RR, abs/1810.09967. Elfwing, S.; Uchibe, E.; and Doya, K. 2018. Sigmoidweighted linear units for neural network function approximation in reinforcement learning. Neural Networks. Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2017. Reinforcement Learning with Unsupervised Auxiliary Tasks. In 5th International Conference on Learning Representations, ICLR 2017. Li, Y.; Wen, Y.; Tao, D.; and Guan, K. 2019. Transforming cooling optimization for green data center via deep reinforcement learning. IEEE transactions on cybernetics, 50(5): 2002 2013. Lin, X.; Baweja, H.; Kantor, G.; and Held, D. 2019. Adaptive auxiliary task weighting for reinforcement learning. Advances in neural information processing systems, 32. Park, J.; Kim, T.; Seong, S.; and Koo, S. 2022. Control automation in the heat-up mode of a nuclear power plant using reinforcement learning. Progress in Nuclear Energy, 145: 104107. Park, J. Y.; and Wong, L. 2022. Robust Imitation of a Few Demonstrations with a Backwards Model. Advances in Neural Information Processing Systems, 35: 19759 19772. Radaideh, M. I.; Wolverton, I.; Joseph, J.; Tusar, J. J.; Otgonbaatar, U.; Roy, N.; Forget, B.; and Shirvan, K. 2021. Physics-informed reinforcement learning optimization of nuclear assembly design. Nuclear Engineering and Design, 372: 110966. Raﬁee, B.; Jin, J.; Luo, J.; and White, A. 2022. What makes useful auxiliary tasks in reinforcement learning: investigating the effect of the target policy. ar Xiv preprint ar Xiv:2204.00565. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy Optimization Algorithms. Co RR. Sutton, R. S. 1984. Temporal Credit Assignment in Reinforcement Learning. Ph.D. thesis. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction.

Tesauro, G. 1992. Practical Issues in Temporal Difference Learning. Mach. Learn. van Hasselt, H.; Madjiheurem, S.; Hessel, M.; Silver, D.; Barreto, A.; and Borsa, D. 2020. Expected Eligibility Traces. Co RR. Wang, J.; Li, W.; Jiang, H.; Zhu, G.; Li, S.; and Zhang, C. 2021. Ofﬂine reinforcement learning with reverse modelbased imagination. Advances in Neural Information Processing Systems, 34: 29420 29432. Zhang, S.; Veeriah, V.; and Whiteson, S. 2020. Learning Retrospective Knowledge with Reverse Reinforcement Learning. In Advances in Neural Information Processing Systems.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)