# a_state_representation_for_diminishing_rewards__494b63d6.pdf

A State Representation for Diminishing Rewards

Ted Moskovitz Gatsby Unit, UCL ted@gatsby.ucl.ac.uk

Samo Hromadka Gatsby Unit, UCL samo.hromadka.21@ucl.ac.uk

Ahmed Touati Meta atouati@meta.com

Diana Borsa Google Deep Mind borsa@google.com

Maneesh Sahani Gatsby Unit, UCL maneesh@gatsby.ucl.ac.uk

A common setting in multitask reinforcement learning (RL) demands that an agent rapidly adapt to various stationary reward functions randomly sampled from a fixed distribution. In such situations, the successor representation (SR) is a popular framework which supports rapid policy evaluation by decoupling a policy s expected discounted, cumulative state occupancies from a specific reward function. However, in the natural world, sequential tasks are rarely independent, and instead reflect shifting priorities based on the availability and subjective perception of rewarding stimuli. Reflecting this disjunction, in this paper we study the phenomenon of diminishing marginal utility and introduce a novel state representation, the λ representation (λR) which, surprisingly, is required for policy evaluation in this setting and which generalizes the SR as well as several other state representations from the literature. We establish the λR s formal properties and examine its normative advantages in the context of machine learning, as well as its usefulness for studying natural behaviors, particularly foraging.

1 Introduction

Stimulus Presentations

Cumulative Reward

Constant Utility Diminishing Utility

Figure 1.1: Diminishing rewards.

The second ice cream cone rarely tastes as good as the first, and once all the most accessible brambles have been picked, the same investment of effort yields less fruit. In everyday life, the availability and our enjoyment of stimuli is sensitive to our past interactions with them. Thus, to evaluate different courses of action and act accordingly, we might expect our brains to form representations sensitive to the non-stationarity of rewards. Evidence in fields from behavioral economics [1, 2] to neuroscience [3] supports this hypothesis. Surprisingly, however, most of reinforcement learning (RL) takes place under the assumptions of the Markov Decision Process [MDP; 4], where rewards and optimal decision-making remain stationary.

In this paper, we seek to bridge this gap by studying the phenomenon of diminishing marginal utility [5] in the context of RL. Diminishing marginal utility (DMU) is the subjective phenomenon by which repeated exposure to a rewarding stimulus reduces the perceived utility one experiences. While DMU is thought to have its roots in the maintenance of homeostatic equilibrium (too much ice cream can result in a stomach ache), it also manifests itself in domains in which the collected rewards are abstract, such as economics ($10 vs. $0 is perceived as a bigger difference in value than $1,010 vs. $1,000), where it is closely related to risk aversion [6, 7]. While DMU is well-studied in other fields, relatively few RL studies have explored diminishing reward functions [8, 9], and, to our

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

knowledge, none contain a formal analysis of DMU within RL. Here, we seek to characterize both its importance and the challenge it poses for current RL approaches (Section 3).

Surprisingly, we find that evaluating policies under diminishing rewards requires agents to learn a novel state representation which we term the λ representation (λR, Section 4). The λR generalizes several state representations from the RL literature: the successor representation [SR; 10], the first-occupancy representation [FR; 11], and the forward-backward representation [FBR; 12], adapting them for non-stationary environments. Interestingly, despite the non-stationarity of the underlying reward functions, we show that the λR still admits a Bellman recursion, allowing for efficient computation via dynamic programming (or approximate dynamic programming) and prove its convergence. We demonstrate the scalability of the λR to large and continuous state spaces (Section 4.1), show that it supports policy evaluation, improvement, and composition (Section 5), and show that the behavior it induces is consistent with optimal foraging theory (Section 6).

2 Preliminaries

Standard RL In standard RL, the goal of the agent is to act within its environment so as to maximize its discounted cumulative reward for some task T [13]. Typically, T is modeled as a discounted MDP, T = (S, A, p, r, γ, µ0), where S is the state space, A is the set of available actions, p : S A 7 P(S) is the transition kernel (where P(S) is the space of probability distributions over S), r : S 7 R is the reward function, γ [0, 1) is a discount factor, and µ0 P(S) is the distribution over initial states. Note that the reward function is also frequently defined over stateaction pairs (s, a) or triples (s, a, s ), but to simplify notation we mostly consider state-based rewards. All of the following analysis, unless otherwise noted, can easily be extended to the (s, a)- or (s, a, s )- dependent cases. The goal of the agent is to maximize its expected return, or discounted cumulative reward P

t γtr(st). The role of the discount factor is twofold: from a theoretical standpoint, it ensures that this sum is finite for bounded rewards, and it induces myopia in the agent s behavior. To simplify notation, we will frequently write r(st) rt and r R|S| as the vector of rewards for each state. The agent acts according to a stationary policy π : S 7 P(A). For finite MDPs, we can describe the expected transition probabilities under π using a |S| |S| matrix P π such that P π s,s = pπ(s |s) P

a p(s |s, a)π(a|s). Given π and a reward function r, the expected return associated with taking action a in state s is

Qπ r (s, a) = Eπ

k=0 γkrt+k st = s, at = a

= Es pπ( |s) [rt + γQπ r (s , π(s ))] . (2.1)

The Qπ r are called the state-action values or simply the Q-values of π. The expectation Eπ [ ] is taken with respect to the randomness of both the policy and the transition dynamics. For simplicity of notation, from here onwards we will write expectations of the form Eπ [ |st = s, at = a] as Eπ [ |st, at]. The recursive form in Eq. (2.1) is called the Bellman equation, and it makes the process of estimating Qπ r termed policy evaluation tractable via dynamic programming. In particular, successive applications of the Bellman operator T πQ r +γP πQ are guaranteed to converge to the true value function Qπ for any initial real-valued |S| |A| matrix Q. Once a policy has been evaluated, policy improvement identifies a new policy π such that Qπ r (s, a) Qπ r (s, a), (s, a) Qπ r (s, a). Helpfully, such a policy can be defined as π (s) argmaxa Qπ r (s, a).

Value Decomposition Often, it can be useful for an agent to evaluate the policies it s learned on new reward functions. In order to do so efficiently, it can make use of the successor representation [SR; 10], which decomposes the value function as follows:

V π(s) = Eπ

k 0 γk1(st+k)

where 1(st) is a one-hot representation of state st and M π, the SR, is an |S| |S| matrix such that

M π(s, s ) Eπ

k 0 γk1(st+k = s )

1(st = s ) + γM π(st+1, s )

Because the SR satisfies a Bellman recursion, it can be learned in a similar manner to the value function, with the added benefit that if the transition kernel p is fixed, it can be used to instantly evaluate a policy on a task determined by any reward vector r. The SR can also be conditioned on actions, M π(s, a, s ), in which case multiplication by r produces the Q-values of π. The SR was originally motivated by the idea that a representation of state in the brain should be dependent on the similarity of future paths through the environment, and there is evidence that SR-like representations are present in the hippocampus [14]. A representation closely related to the SR is the first-occupancy representation [FR; 11], which modifies the SR by only counting the first visit to each state:

F π(s, s ) Eπ

k 0 γk1(st+k = s , s / {st, . . . , st+k 1})

The FR can be used in the same way as the SR, with the difference that the values it computes are predicated on ephemeral rewards those that are consumed or vanish after the first visit to a state.

Policy Composition If the agent has access to a set of policies Π = {π} and their associated SRs (or FRs) and is faced by a new task determined by reward vector r, it can instantly evaluate these policies by simply multiplying: {Qπ(s, a) = r TM π(s, a, )}. This process is termed generalized policy evaluation [GPE; 15]. Similarly, generalized policy improvement (GPI) is defined as the identification of a new policy π such that Qπ (s, a) supπ Π Qπ(s, a) s, a S A. Combining both provides a way for an agent to efficiently compose its policies Π that is, to combine them to produce a new policy without additional learning. This process, which we refer to as GPE+GPI, produces the following policy, which is guaranteed to perform at least as well as any policy in Π [16]:

π (s) argmax a A max π Π r TM π(s, a, ). (2.5)

3 Diminishing Marginal Utility

Problem Statement Motivated by DMU s importance in decision-making, our goal is to understand RL in the context of the following class of non-Markov reward functions:

rλ(s, t) = λ(s)n(s,t) r(s), λ(s) [0, 1], (3.1)

where n(s, t) N is the agent s visit count at s up to time t and r(s) describes the reward at the first visit to s. λ(s) therefore encodes the extent to which reward diminishes after each visit to s. Note that for λ(s) = λ = 1 we recover the usual stationary reward given by r, and so this family of rewards strictly generalizes the stationary Markovian rewards typically used in RL.

DMU is Challenging An immediate question when considering reward functions of this form is whether or not we can still define a Bellman equation over the resulting value function. If this is the case, standard RL approaches still apply. However, the following result shows otherwise.

Lemma 3.1 (Impossibility; Informal). Given a reward function of the form Eq. (3.1), it is impossible to define a Bellman equation solely using the resulting value function and immediate reward.

We provide a more precise statement, along with proofs for all theoretical results, in Appendix B. This result means that we can t write the value function corresponding to rewards of the form Eq. (3.1) recursively only in terms of rewards and value in an analogous manner to Eq. (2.1). Nonetheless, we found that it is in fact possible to derive a recursive relationship in this setting, but only by positing a novel state representation that generalizes the SR and the FR, which we term the λ representation (λR). In the following sections, we define the λR, establish its formal properties, and demonstrate its necessity for RL problems with diminishing rewards.

4 The λ Representation

A Representation for DMU We now the derive the λR by decomposing the value function for rewards of the form Eq. (3.1) and show that it admits a Bellman recursion. To simplify notation, we

use a single λ for all states, but the results below readily apply to non-uniform λ. We have

k=0 γkrλ(st+k, k) st = s

k=0 γkλnt(st+k,k)1(st+k) st = s

where we call Φπ λ(s) the λ representation (λR), and nt(s, k) Pk 1 j=0 1(st+j = s), is the number of times state s is visited from time t up to but not including time t + k. Formally:

Definition 4.1 (λR). For an MDP with finite S and λ [0, 1]|S|, the λ representation is given by Φπ λ such that

Φπ λ(s, s ) E

k=0 λ(s )nt(s ,k)γk1(st+k = s ) st = s

where nt(s, k) Pk 1 j=0 1(st+j = s) is the number of times state s is visited from time t until time t + k 1.

We can immediately see that for λ = 0, the λR recovers the FR (we take 00 = 1), and for λ = 1, it recovers the SR. For λ (0, 1), the λR interpolates between the two, with higher values of λ reflecting greater persistence of reward in a given state or state-action pair and lower values of λ reflecting more ephemeral rewards.

The λR admits a recursive relationship:

Φπ λ(s, s ) = E

k=0 λnt(s ,k)γk1(st+k = s ) st = s

1(st = s ) + λnt(s ,1)γ

k=1 λnt+1(s ,k)γk 11(st+k = s ) st = s

= 1(st = s )(1 + γλEst+1 pπΦπ λ(st+1, s )) + γ(1 1(st = s ))Est+1 pπΦπ λ(st+1, s ), (4.3)

where (i) follows from nt(s , k) = nt(s , 1)+nt+1(s , k 1). A more detailed derivation is provided in Appendix A. Thus, we can define a tractable Bellman operator:

Definition 4.2 (λR Operator). Let Φ R|S| |S| be an arbitrary real-valued matrix, and let Gπ λ denote the λR Bellman operator for π, such that

Gπ λΦ I 11T + γλP πΦ + γ(11T I) P πΦ, (4.4)

where denotes elementwise multiplication and I is the |S| |S| identity matrix. In particular, for a stationary policy π, Gπ λΦπ λ = Φπ λ.

The following result establishes that successive applications of Gπ λ converge to the λR.

Proposition 4.1 (Convergence). Under the conditions assumed above, set Φ(0) = (1 λ)I. For k = 1, 2, . . . , suppose that Φ(k+1) = Gπ λΦ(k). Then

|(Φ(k) Φπ λ)s,s | γk+1

While such analysis is fairly standard in MDP theory, it is noteworthy that the analysis extends to this case despite Lemma 3.1. That is, for a ethologically relevant class of non-Markovian reward functions [17], it is possible to define a Markovian Bellman operator and prove that repeated applications of it converge to the desired representation. Furthermore, unlike in the stationary reward case, where decomposing the value function in terms of the reward and the SR is optional to perform prediction or control, in this setting the structure of the problem requires that this representation be learned. To get an intuition for the λR consider the simple gridworld presented in Fig. 4.1, where we visualize the λR for varying values of λ. For λ = 0, the representation recovers the FR, encoding the expected discount at first occupancy of each state, while as λ increases, effective occupancy is accumulated accordingly at states which are revisited. We offer a more detailed visualization in Fig. F.2.

Figure 4.1: The λR interpolates between the FR and the SR. We visualize the λRs of the bottom left state for the depicted policy for λ {0.0, 0.5, 1.0}.

4.1 Continuous State Spaces

When the state space is large or continuous, it becomes impossible to use a tabular representation, and we must turn to function approximation. There are several ways to approach this, each with their own advantages and drawbacks.

Feature-Based Representations For the SR and the FR, compatibility with function approximation is most commonly achieved by simply replacing the indicator functions in their definitions with a base feature function ϕ : S 7 RD to create successor features [SFs; 15, 16] and first-occupancy features [FFs; 11], respectively. The intuition in this case is that ϕ for the SR and the FR is just a one-hot encoding of the state (or state-action pair), and so for cases when |S| is too large, we can replace it with a compressed representation. That is, we have the following definition Definition 4.3 (SFs). Let ϕ : S 7 RD be a base feature function. Then, the successor feature (SF) representation φ1 : S A 7 RD is defined as φπ 1(s, a) Eπ h P k=0 γkϕ(st+k) st, at i for all s, a S A.

One key fact to note, however, is that due to this compression, all notion of state occupancy is lost and these representations instead measure feature accumulation. For any feasible λ then, it is most natural to define these representations using their recursive forms: Definition 4.4 (λF). For λ [0, 1] and bounded base features ϕ : S 7 [0, 1]D, the λ-feature (λF) representation of state s is given by φπ λ such that

φπ λ(s) ϕ(s) (1 + γλEs pπ( |s)φπ λ(s )) + γ(1 ϕ(s)) Es pπ( |s)φπ λ(s ). (4.5)

In order to maintain their usefulness for policy evaluation, the main requirement of the base features is that the reward should lie in their span. That is, a given feature function ϕ is most useful for an associated set of reward functions R, given by

R = {r | w RD s.t. r(s, a) = w Tϕ(s, a) s, a S A}. (4.6) However, Barreto et al. [18] demonstrate that good performance can still be achieved for an arbitrary reward function as long as it s sufficiently close to some r R.

Set-Theoretic Formulations As noted above, computing expectations of accumulated abstract features is unsatisfying because it requires that we lose the occupancy-based interpretation of these representations. It also restricts the agent to reward functions which lie within R. An alternative approach to extending the SR to continuous MDPs that avoids this issue is the successor measure [SM; 19], which treats the distribution of future states as a measure over S:

M π(s, a, X)

k=0 γk P(st+k X | st = s, at = a, π) X S measurable, (4.7)

which can be expressed in the discrete case as M π = I + γP πM π. In the continuous case, matrices are replaced by their corresponding measures. Note that SFs can be recovered by integrating: φπ 1(s, a) = R

s M π(s, a, ds )ϕ(s ). We can define an analogous object for the λR as follows

k=0 λnt(X,k)γk P(st+k X | st = s, π) (4.8)

0 2 4 6 8 Iterations

Dynamic Programming

0 500 1000 1500 Episodes

Tabular TD Learning

0 500 1000 1500 2000 Episodes

2.5 F TD Learning

Figure 5.1: The λR is required for accurate policy evaluation. Policy evaluation of the policy depicted in Fig. 4.1 using dynamic programming, tabular TD learning, and λF TD learning produces the most accurate value estimates when using the λR with λ = λtrue. Results are averaged over three random seeds. Shading indicates standard error.

where nt(X, k) Pk 1 j=0 δst+j(X). However, this is not a measure because it fails to satisfy additivity for λ < 1, i.e., for measurable disjoint sets A, B S, Φπ λ(s, A B) < Φπ λ(s, A) + Φπ λ(s, B) (Lemma B.4). For this reason, we call Eq. (4.8) the λ operator (λO). We then minimize the following squared Bellman error loss for Φπ λ (dropping sub/superscripts for concision):

L(Φ) = Est,st+1 ρ,s µ h (φ(st, s ) γ φ(st+1, s ))2i 2Est ρ[φ(st, st)]

+ 2γ(1 λ)Est,st+1 ρ[µ(st)φ(st, st) φ(st+1, st)], (4.9)

where ρ and µ are densities over S and Φπ λ(s) φπ λ(s)diag(µ) in the discrete case, with φπ λ parametrized by neural network. indicates a stop-gradient operation, i.e., a target network. A detailed derivation and discussion are given in Appendix H. While ρ can be any training distribution of transitions we can sample from, we require an analytic expression for µ. Eq. (4.9) recovers the SM loss of Touati and Ollivier [12] when λ = 1.

5 Policy Evaluation, Learning, and Composition under DMU

In the following sections, we experimentally validate the formal properties of the λR and explore its usefulness for solving RL problems with DMU. The majority of our experiments center on navigation tasks, as we believe this is the most natural setting for studying behavior under diminishing rewards. However, in Appendix I we also explore potential for the λR s use other areas, such as continuous control, even when rewards do not diminish. There is also the inherent question of whether the agent has access to λ. In a naturalistic context, λ can be seen as an internal variable that the agent likely knows, especially if the agent has experienced the related stimulus before. Therefore, in subsequent experiments, treating λ as a "given" can be taken to imply the agent has prior experience with the relevant stimulus. Further details for all experiments can be found in Appendix E.

5.1 Policy Evaluation

In Section 4, we showed that in order to perform policy evaluation under DMU, an agent is required to learn the λR. In our first experimental setting, we verify this analysis empirically for the policy depicted in Fig. 4.1 with a rewarded location in the state indicated by a g in Fig. 5.1 with λtrue = 0.5. We then compare the performance for agents using different values of λ across dynamic programming (DP), tabular TD learning, and λF TD learning with Laplacian features. For the latter two, we use a linear function approximator with a one-hot encoding of the state as the base feature function. We then compute the Q-values using the λR with λ {0.5, 1.0} (with λ = 1 corresponding to the SR) and compare the resulting value estimates to the true Q-values. Consistent with our theoretical analysis, Fig. 5.1 shows that the λR with λ = λtrue is required to produce accurate value estimates.

5.2 Policy Learning

To demonstrate that λR is useful in supporting policy improvement under diminishing rewards, we implemented modified forms of Q-learning [20] (which we term Qλ-learning) and advantage

0 100 200 300 400 500 Episodes

Q -Learning

0.0 0.5 1.0

0 2000 4000 6000 Episodes

Figure 5.2: The λR is required for strong performance. We apply a tabular Q-learning-style algorithm and deep actor-critic algorithm to policy optimization in the Two Rooms domain. The blue locations indicate reward, and the black triangle shows the agent s position. Results are averaged over three random seeds. Shading indicates standard error.

actor-critic [A2C; 21] and applied them to the Two Rooms domain from the Neuro Nav benchmark task set [22] (see Fig. 5.2). For a transition (st, at, st+1), we define the following operator:

T λ Φ 1(st) (1 + λγΦ(st+1, at+1)) + γ(1 1(st)) Φ(st+1, at+1), (5.1)

where at+1 = argmaxa Qλ(st, a) = argmaxa r TΦ(st, a). This is an improvement operator for Qλ. The results in Fig. 5.2 show that Qλ-learning outperforms standard Q-learning (λ = 1) for diminishing rewards, and that the correct λ produces the best performance. To implement A2C with a λR critic, we modified the standard TD target in a similar manner as follows:

TλV (st) = r(st) + γ(V (st+1) + (λ 1)w T(ϕ(st) φλ(st+1))), (5.2)

where ϕ were one-hot state encodings, and the policy, value function, and λF were output heads of a shared LSTM network [23]. Note this target is equivalent to Definition 4.4 multiplied by the reward (derivation in Appendix E). Fig. 5.2 shows again that correct value targets lead to improved performance. Videos of agent behavior can be found at lambdarepresentation.github.io.

5.3 Policy Composition

As we ve seen, DMU problems of this form have an interesting property wherein solving one task requires the computation of a representation which on its own is task agnostic. In the same way that the SR and FR facilitate generalization across reward functions, the λR facilitates generalization across reward functions with different rs.The following result shows that there is a benefit to having the correct λ for a given resource. Theorem 5.1 (GPI). Let {Mj}n j=1 M and M M be a set of tasks in an environment M and let Qπ j denote the action-value function of an optimal policy of Mj when executed in M. Assume that the agent uses diminishing rate ˆλ that may differ from the true environment diminishing rate λ. Given estimates Qπj such that Qπ j Qπj ϵ for all j, define

π(s) argmax a max j Qπj(s, a).

Qπ(s, a) max j Qπ j (s, a) 1 1 γ

2ϵ + |λ ˆλ| r + γ(1 λ)r(s, a)

Note that for λ = 1, we recover the original GPI bound due to Barreto et al. [18] with an additional term quantifying error accrued if incorrectly assuming λ < 1.

Tabular Navigation We can see this result reflected empirically in Fig. 5.3, where we consider the following experimental set-up in the classic Four Rooms domain [24]. The agent is assumed to be given or have previously acquired four policies {π0, π1, π2, π3} individually optimized to reach rewards located in each of the four rooms of the environment. There are three reward locations {g0, g1, g2} scattered across the rooms, each with its own initial reward and all with λ = 0.5. At the

0.0 0.5 1.0 15.0

Tabular GPI Example Trajectories

Figure 5.3: Tabular GPI. (Left) Average returns obtained by agents performing GPE+GPI using λRs with λ {0.0, 0.5, 1.0} over 50 episodes. Error bars indicate standard error. (Right) Sample trajectories. Agents with λ set too high overstay in rewarding states, and those with λ too low leave too early.

0.0 0.5 1.0 30

Values = 0.0

Values = 0.5

Values = 1.0

Pixel-based GPI Feature Analysis Example Observation

Figure 5.4: Pixel-Based GPI. Performance is strongest for agents using the correct λ = 0.5. PCA on the learned features in each underlying environment state shows that the λFs capture the valueconditioned structure of the environment.

beginning of each episode, an initial state s0 is sampled uniformly from the set of available states. An episode terminates either when the maximum reward remaining in any of the goal states is less than 0.1 or when the maximum number of steps H = 40 is reached (when λ = 1, the latter is the only applicable condition). For each of the four policies, we learn λRs with λ {0.0, 0.5, 1.0} using dynamic programming and record the returns obtained while performing GPE+GPI with each of these representations over 50 episodes. Bellman error curves for the λRs are shown in Fig. F.3, and demonstrate that convergence is faster for lower λ. In the left panel of Fig. 5.3, we can indeed see that using the correct λ (0.5) nets the highest returns. Example trajectories for each agent λ are shown in the remaining panels.

Pixel-Based Navigation We verified that the previous result is reflected in larger scales by repeating the experiment in a partially-observed version of Four Rooms in which the agent receives 128 128 RGB egocentric observations of the environment (Fig. 5.4, left) with H = 50. In this case, the agent learns λFs for each policy for λ {0.0, 0.5, 1.0}, where each λF is parameterized by a feedforward convolutional network with the last seven previous frames stacked to account for the partial observability. The base features were Laplacian eigenfunctions normalized to [0, 1], which which were shown by Touati et al. [25] to perform the best of all base features for SFs across a range of environments including navigation.

6 Understanding Natural Behavior

Naturalistic environments often exhibit diminishing reward and give insight into animal behavior. The problem of foraging in an environment with multiple diminishing food patches (i.e., reward states) is of interest in behavioral science [26 28]. The cornerstone of foraging theory is the marginal value theorem [MVT; 28, 29], which states that the optimal time to leave a patch is when the patch s reward rate matches the average reward rate of the environment. However, the MVT does not describe which patch to move to once an agent leaves its current patch. We show that the λO recovers MVT-like behavior in discrete environments and improves upon the MVT by not only predicting when agents should leave rewarding patches, but also where they should go. Moreover, we provide a scheme for learning λ alongside the λO using feedback from the environment.

Figure 6.1: λO trained via FB. a) λ values of two agents in Four Rooms, one which learns λ and one which does not. b) Performance of the two agents from (a). Learning λ improves performance. c) Reward structure and starting state of the asymmetric environment. d) Trajectory of an agent with λ = 1. The optimal strategy is to reach the high reward state and exploit it ad infinitum. e) Trajectory of an agent with λ = 0.1. The optimal strategy is to exploit each reward state for a few time steps before moving to the next reward state.

To learn the λO, we take inspiration from the FBR [12] and use the following parametrization: Φπz λ (s, a, s ) = F(s, a, z) B(s )µ(s ) and πz(s) = argmaxa F(s, a, z) z, where µ is a density with full support on S, e.g., uniform. We then optimize using the the loss in Eq. (4.9) under this parameterization (details in Appendix E). Given a reward function r : S R at evaluation time, the agent acts according to argmaxa F(s, a, z R) z R, where z R = Es µ[r(s)B(s)]. Because the environment is non-stationary, z R has to be re-computed at every time step. To emulate a more realistic foraging task, the agent learns λ by minimizing the loss in Eq. (H.1) in parallel with the loss

L(λ) = Est,st+1 ρ h 1(st = st+1) (r(st+1) λr(st))2i ,

which provably recovers the correct value of λ provided that ρ is sufficiently exploratory. In Appendix H.1 we provide experiments showing that using an incorrect value of λ leads to poor performance on tabular tasks. In Fig. 6.1 we show that the agent learns the correct value of λ, increasing its performance. We illustrate the behavior of the λO in an asymmetric environment that has one large reward state on one side and many small reward states (with higher total reward) on the other. Different values of λ lead to very different optimal foraging strategies, which the λO recovers and exhibits MVT-like behavior (see Appendix H.2 for a more detailed analysis). Our hope is that the λR may provide a framework for new theoretical studies of foraging behavior and possibly mechanisms for posing new hypotheses. For example, an overly large λ may lead to overstaying in depleted patches, a frequently observed phenomenon [30].

7 Conclusion

Limitations Despite its advantages, there are several drawbacks to the representation which are a direct result of the challenge of the DMU setting. First, the λR is only useful for transfer across diminishing reward functions when the value of λ at each state is consistent across tasks. In natural settings, this is fairly reasonable, as λ can be thought of as encoding the type of the resource available at each state (i.e., each resource has its own associated decay rate). Second, as noted in Section 4.1, the λR does not admit a measure-theoretic formulation, which makes it challenging to define a principled, occupancy-based version compatible with continuous state spaces. Third, the λR is a prospective representation, and so while it is used to correctly evaluate a policy s future return under DMU, it is not inherently memory-based and so performs this evaluation as if the agent hasn t visited locations with diminishing reward before. Additional mechanisms (i.e., recurrence or frame-stacking) are necessary to account for previous visits. Finally, the λR is dependent on an episodic task setting for rewards to reset, as otherwise the agent would eventually consume all reward in the environment. An even more natural reward structure would include a mechanism for reward replenishment in addition to depletion. We describe several such candidates in Appendix J, but leave a more thorough investigation of this issue to future work.

In this work, we aimed to lay the groundwork for understanding policy evaluation, learning, and composition under diminishing rewards. To solve such problems, we introduced and showed the necessity of the λ representation, which generalizes the SR and FR. We demonstrated its usefulness for rapid policy evaluation and by extension, composition, as well as control. We believe the λR represents a useful step in the development of state representations for naturalistic environments.

[1] Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under risk.

Econometrica, 47(2):263 292, 1979.

[2] Matthew Rabin. Risk aversion and expected-utility theory: a calibration theorem. Econometrica, 68(5):1281 1292, 2000. doi: 10.2307/2999450.

[3] Alex Pine, Ben Seymour, Jonathan P Roiser, Peter Bossaerts, Karl J Friston, H Valerie Curran, and Raymond J Dolan. Encoding of marginal utility across time in the human brain. Journal of Neuroscience, 29(30):9575 9581, 2009.

[4] Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley and Sons, 2010.

[5] Hermann Heinrich Gossen and Rudolph C Blitz. The laws of human relations and the rules of

human action derived therefrom. Mit Press Cambridge, MA, 1983.

[6] Kenneth J Arrow. The theory of risk-bearing: small and great risks. Journal of risk and

uncertainty, 12:103 111, 1996.

[7] John W Pratt. Risk aversion in the small and in the large. In Uncertainty in economics, pages 59 79. Elsevier, 1978.

[8] Nathan Wispinski, Andrew Butcher, Kory Wallace Mathewson, Craig S Chapman, Matthew Botvinick, and Patrick M. Pilarski. Adaptive patch foraging in deep reinforcement learning agents. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https: //openreview.net/forum?id=a0T3n OP9s B.

[9] Sergey Shuvaev, Sarah Starosta, Duda Kvitsiani, Adam Kepecs, and Alexei Koulakov. Rlearning in actor-critic model offers a biologically relevant mechanism for sequential decisionmaking. Advances in neural information processing systems, 33:18872 18882, 2020.

[10] Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613 624, 1993. doi: 10.1162/neco.1993.5.4.613.

[11] Ted Moskovitz, Spencer R Wilson, and Maneesh Sahani. A first-occupancy representation for reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=JBAZe2y N6Ub.

[12] Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. Advances

in Neural Information Processing Systems, 34:13 23, 2021.

[13] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd. html.

[14] Kimberly L Stachenfeld, Matthew M Botvinick, and Samuel J Gershman. The hippocampus as a predictive map. Nature neuroscience, 20(11):1643 1653, 2017.

[15] Andre Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48):30079 30087, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1907370117. URL https: //www.pnas.org/content/117/48/30079.

[16] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017.

[17] Maor Gaon and Ronen Brafman. Reinforcement learning with non-markovian rewards. In

Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 3980 3987, 2020.

[18] André Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel Mankowitz, Augustin Žídek, and Rémi Munos. Transfer in deep reinforcement learning using successor features and generalised policy improvement, 2019.

[19] Léonard Blier and Yann Ollivier. The description length of deep learning models. Advances in

Neural Information Processing Systems, 31, 2018.

[20] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279 292, 1992.

[21] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928 1937. PMLR, 2016.

[22] Arthur Juliani, Samuel Barnett, Brandon Davis, Margaret Sereno, and Ida Momennejad. Neuronav: a library for neurally-plausible reinforcement learning. ar Xiv preprint ar Xiv:2206.03312, 2022.

[23] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9 (8):1735 1780, 1997.

[24] Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112 (1):181 211, 1999. URL https://www.sciencedirect.com/science/article/pii/ S0004370299000521.

[25] Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_Oc QI.

[26] Benjamin Y Hayden, John M Pearson, and Michael L Platt. Neuronal basis of sequential foraging decisions in a patchy environment. Nature neuroscience, 14(7):933 939, 2011.

[27] Tehrim Yoon, Robert B Geary, Alaa A Ahmed, and Reza Shadmehr. Control of movement vigor and decision making during foraging. Proceedings of the National Academy of Sciences, 115 (44):E10476 E10485, 2018.

[28] Anthony S Gabay and Matthew AJ Apps. Foraging optimally in social neuroscience: computations and methodological considerations. Social Cognitive and Affective Neuroscience, 16(8): 782 794, 2021.

[29] Eric L. Charnov. Optimal foraging, the marginal value theorem. Theoretical Population Biology, 9(2):129 136, 1976. ISSN 0040-5809.

[30] Peter Nonacs. State dependent behavior and the marginal value theorem. Behavioral Ecology, 12(1):71 83, 2001.

[31] Ruosong Wang, Hanrui Zhang, Devendra Singh Chaplot, Denis Garagi c, and Ruslan Salakhutdinov. Planning with submodular objective functions. ar Xiv preprint ar Xiv:2010.11863, 2020.

[32] Manish Prajapat, Mojmír Mutn y, Melanie N Zeilinger, and Andreas Krause. Submodular reinforcement learning. ar Xiv preprint ar Xiv:2307.13372, 2023.

[33] Tom Zahavy, Brendan O Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is enough for convex mdps, 2021.

[34] Eitan Altman. Constrained Markov decision processes. Routledge, 2021.

[35] Ted Moskovitz, Brendan O Donoghue, Vivek Veeriah, Sebastian Flennerhag, Satinder Singh, and Tom Zahavy. Reload: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained mdps. In International Conference on Machine Learning, pages 25303 25336. PMLR, 2023.

[36] Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina Mc Kinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, et al. Behaviour suite for reinforcement learning. ar Xiv preprint ar Xiv:1908.03568, 2019.

[37] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint

ar Xiv:1607.06450, 2016.

[38] Martin Riedmiller. Neural fitted q iteration first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16, pages 317 328. Springer, 2005.

[39] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. Advances in neural

information processing systems, 5, 1992.

[40] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, 2018.

[41] Ted Moskovitz, Michael Arbel, Ferenc Huszar, and Arthur Gretton. Efficient wasserstein natural gradients for reinforcement learning, 2020.

[42] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.

[43] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[44] Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano, Michael Arbel, and Michael Jordan. Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems, 34:12849 12863, 2021.

[45] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Appendix: A State Representation for Diminishing Rewards

GIFs of navigation agents can be found at lambdarepresentation.github.io and in the supplementary material.

A Derivation of λR Recursion

We provide a step-by-step derivation of the λR recursion in Eq. (4.3):

Φπ λ(s, s ) = E

k=0 λnt(s ,k)γk1(st+k = s ) st = s

1(st = s ) +

k=1 λnt(s ,k)γk1(st+k = s ) st = s

1(st = s ) + λnt(s ,1)γ

k=1 λnt+1(s ,k)γk 11(st+k = s ) st = s

1(st = s ) + 1(st = s )λγ

k=1 λnt+1(s ,k)γk 11(st+k = s )

+ γ(1 1(st = s ))

k=1 λnt+1(s ,k)γk 11(st+k = s ) st = s

= 1(st = s ) + 1(st = s )λγEst+1 pπΦπ λ(st+1, s ) + γ(1 1(st = s ))Est+1 pπΦπ λ(st+1, s ) = 1(st = s )(1 + γλEst+1 pπΦπ λ(st+1, s )) + γ(1 1(st = s ))Est+1 pπΦπ λ(st+1, s ), (A.1)

where (i) is because nt(s , k) = nt(s , 1) + nt+1(s , k) and (ii) is because

λnt(s ,1) = λ1(st=s ) = 1(st = s )λ + (1 1(st = s )).

B Theoretical Analysis

Here, we provide proofs for the theoretical results in the main text.

Lemma 3.1 (Impossibility; Informal). Given a reward function of the form Eq. (3.1), it is impossible to define a Bellman equation solely using the resulting value function and immediate reward.

Before providing the formal statement and proof for Lemma 3.1, we introduce a definition for a Bellman operator.

Definition B.1 (Bellman Operator). A Bellman operator is a contractive operator R|S| R|S| that depends solely on r, one-step expectations under pπ, and learning hyperparameters (in our case γ and λ).

We can now give a formal statement of Lemma 3.1:

Lemma B.1 (Impossibility; Formal). Assume that |S| > 1. Then, there does not exist a Bellman operator T with fixed point V π.

Proof. Assume for a contradiction that T is a Bellman operator. By the Banach fixed-point theorem, V π must be the unique fixed point of T. Hence, TV π must take on the following form (see the proof of Lemma 3.1 in Appendix B): for s S, (TV π)(s) = r(s) + γEs pπ( |s)V π(s ) + γ(λ 1) r(s)Es pπ( |s)Φπ λ(s , s). For the assumption to hold, there must exist a function f such that, for any s S, Es pπ( |s)Φπ λ(s , s) = Epπ( |s)f( r, Vπ, γ, λ, s). Now, by definition,

s S Φπ λ(s, s ) r(s ).

r is a vector in R|S|, so as long as S > 1, r is non-trivial. Fix any r, Vπ, s. Pick a vector w r \ {0} and define Φπ λ(s, s ) := Φπ λ(s, s ) + w(s ) for any s, s S. Note that X

s S Φπ λ(s, s ) r(s ) = X

s S Φπ λ(s, s ) r(s ) + X

s S w(s ) r(s ) = V π(s),

as w r. However,

Es pπ( |s) Φπ λ(s , s) = X

s S pπ(s |s)Φπ λ(s , s) + X

s S pπ(s |s)w(s).

The final term evaluates to w(s). Because w = 0, there must exist some s such that w(s) = 0. For this s, we have a single input ( r, Vπ, γ, λ, s) to f that corresponds to two distinct outputs: Es pπ( |s)Φπ λ(s , s) and Es pπ( |s) Φπ λ(s , s).

Hence, f is a one-to-many mapping: for fixed input, there is more than one output. Therefore, f is not a function, yielding a contradiction.

The following establishes Gπ λ as a contraction. Lemma B.2 (Contraction). Let Gπ λ be the operator as defined in Definition 4.2 for some stationary policy π. Then for any two matrices Φ, Φ R|S| |S|, |Gπ λΦ(s, s ) Gπ λΦ (s, s )| γ|Φ(s, s ) Φ (s, s )|.

Proof. We have |(Gπ λΦ Gπ λΦ )s,s | = |(I (11T + γλP πΦ) + γ(11T I) P πΦ

I (11T + γλP πΦ ) γ(11T I) P πΦ )s,s |

= |(I γλP π(Φ Φ ) + γ(11T I) P π(Φ Φ ))s,s |

= |((I λ11T + 11T I) γP π(Φ Φ ))s,s |

(i) |(γP π(Φ Φ ))s,s | = γ|(P π(Φ Φ ))s,s | γ|(Φ Φ )s,s |,

where (i) comes from using λ 1 and simplifying.

Note that we can actually get a tighter contraction factor of λγ for s = s . Given this contractive property, we can prove its convergence with the use of the following lemma. Lemma B.3 (Max λR). The maximum possible value of Φπ λ(s, s ) is

1(s = s ) + (1 1(s = s ))γ

Proof. For s = s ,

Φπ λ(s, s) = 1 + λγEst+1 pπ( |st)Φπ λ(st+1, s).

This is just the standard SR recursion with discount factor λγ, so the maximum is

k=0 (λγ)k = 1 1 λγ . (B.1)

For s = s , 1(st = s ) = 0, so

Φπ λ(s, s ) = γEst+1 pπ( |st)Φπ λ(st+1, s ).

Observe that Φπ λ(s, s) Φπ λ(s, s ) for s = s, so the maximum is attained for st+1 = s . We can then use the result for s = s to get

Φπ λ(s, s ) = γ 1 1 λγ

Combining Eq. (B.1) and Eq. (B.2) yields the desired result.

Proposition 4.1 (Convergence). Under the conditions assumed above, set Φ(0) = (1 λ)I. For k = 1, 2, . . . , suppose that Φ(k+1) = Gπ λΦ(k). Then

|(Φ(k) Φπ λ)s,s | γk+1

Proof. Using the notation Xs,s = X(s, s ) for a matrix X:

|(Φ(k) Φπ λ)s,s | = |(Gk λΦ(0) Gk λΦπ λ)s,s |

= |(Gk λΦ(0) Φπ λ)s,s |

(i) γk|(Φ(0) Φπ λ)s,s |

(ii) = γkΦπ λ(s, s )

where (i) is due to Lemma B.2, (ii) is because Φ(0)(s, s ) = 0 for s = s , and (iii) is due to Lemma B.3.

Lemma B.4 (Subadditivity). For any s S, policy π, λ [0, 1), and disjoint measurable sets A, B S,

Φπ λ(s, A B) < Φπ λ(s, A) + Φπ λ(s, B).

Proof. Note that for disjoint sets A, B, we have nt(A B, k) = nt(A, k)+nt(B, k). Hence, conditioned on some policy π and st = s,

λnt(A B,k)P(st+k A B) = λnt(A,k)λnt(B,k)P(st+k A) + λnt(A,k)λnt(B,k)P(st+k B)

λnt(A,k)P(st+k A) + λnt(B,k)P(st+k B),

where the first line follows from P(st+k A B) = P(st+k A) + P(st+k B). Equality holds over all A, B, t, k if and only if λ = 1. Summing over k yields the result.

B.1 Proof of Theorem 5.1

We first prove two results, which rely throughout on the fact that Φλ(s, a, s ) 1 1 λγ for all s, a, s , which follows from Lemma B.3. For simplicity, we also assume throughout that all rewards are non-negative, but this assumption can easily be dropped by taking absolute values of rewards. The proofs presented here borrow ideas from those of [16].

Lemma B.5. Let {Mj}n j=1 M and M M be a set of tasks in an environment M with diminishing rate λ and let Qπ j denote the action-value function of an optimal policy of Mj when executed in M. Given estimates Qπj such that Qπ j Qπj ϵ for all j, define

π(s) argmax a max j Qπj(s, a).

Qπ(s, a) max j Qπ j (s, a) 1 1 γ

2ϵ + γ(1 λ)r(s, a)

where r denotes the reward function of M.

Proof. Define Qmax(s, a) := maxj Qπj(s, a) and Qmax(s, a) := maxj Qπ j (s, a). Let T ν denote the Bellman operator of a policy ν in task M. For all (s, a) S A and

T π i Qmax(s, a) = r(s, a) + γ X

s p(s |s, a) (λ 1)ri(s, a)Φπ(s , π(s ), s) + Qmax(s , π(s ))

= r(s, a) + γ X

s p(s |s, a) (λ 1)ri(s, a)Φπ(s , π(s ), s) + max b Qmax(s , b)

r(s, a) + γ X

s p(s |s, a) (λ 1)ri(s, a)Φπ(s , π(s ), s) + max b Qmax(s , b) γϵ

r(s, a) + γ X

s p(s |s, a) (λ 1)ri(s, a)Φπ(s , π(s ), s) + Qmax(s , π j(s )) γϵ

r(s, a) + γ X

s p(s |s, a) (λ 1)ri(s, a)Φπ(s , π(s ), s) + Q π j i (s , π j(s )) γϵ

= r(s, a) + γ X

s p(s |s, a) (λ 1)ri(s, a)Φπ j (s , π j(s ), s) + Q π j i (s , π j(s )) γϵ

+ γ(λ 1)r(s, a) X

s p(s |s, a) Φπ(s , π(s ), s) Φπ j (s , π j(s ), s)

T π j i Q π j i (s, a) γϵ γ(1 λ)r(s, a)

= Q π j i (s, a) γϵ γ(1 λ)r(s, a)

This holds for any j, so

T π Qmax(s, a) max j Q π j i (s, a) γϵ γ(1 λ)r(s, a)

= Qmax(s, a) γϵ γ(1 λ)r(s, a)

Qmax(s, a) ϵ γϵ γ(1 λ)r(s, a)

Next, note that for any c R,

T π( Qmax(s, a) + c) = T π Qmax(s, a) + γ X

s p(s |s, a)c

= T π Qmax(s, a) + γc.

Putting everything together, and using the fact that T ν is monotonic and contractive, Qπ i (s, a) = lim k (T π)k Qmax(s, a)

" Qmax(s, a) ϵ(1 + γ) γ(1 λ)r(s, a)

Qmax(s, a) 1 1 γ

ϵ(1 + γ) γ(1 λ)r(s, a)

Qmax(s, a) ϵ 1 1 γ

ϵ(1 + γ) γ(1 λ)r(s, a)

Qπ j (s, a) 1 1 γ

2ϵ + γ(1 λ)r(s, a)

This holds for every j, hence the result.

Lemma B.6. Let ν be any policy, λ, ˆλ [0, 1], and Qλ denote a value function with respecting to diminishing rate λ. Then,

Qν λ Qν ˆλ |λ ˆλ| r

Proof. The proof follows from the definition of Q: for every (s, a) S A,

|Qν λ(s, a) Qν ˆλ(s, a)| =

k=0 γk λnt(st+k,k) ˆλnt(st+k,k) r(st+k) st = s, at = a

k=0 γk λnt(st+k,k) ˆλnt(st+k,k) r(st+k) st = s, at = a

k=0 γkr(st+k) λ ˆλ

nt(st+k,k) 1 X

j=0 λnt(st+k,k) 1 jˆλj st = s, at = a

k=0 γkr(st+k) st = s, at = a

Theorem 5.1 (GPI). Let {Mj}n j=1 M and M M be a set of tasks in an environment M and let Qπ j denote the action-value function of an optimal policy of Mj when executed in M. Assume that the agent uses diminishing rate ˆλ that may differ from the true environment diminishing rate λ. Given estimates Qπj such that Qπ j Qπj ϵ for all j, define π(s) argmax a max j Qπj(s, a).

Qπ(s, a) max j Qπ j (s, a) 1 1 γ

2ϵ + |λ ˆλ| r + γ(1 λ)r(s, a)

Proof. Let Qλ denote a value function with respect to diminishing constant λ. We wish to bound

max j Q π j ˆλ (s, a) Qπ λ(s, a),

i.e., the value of the GPI policy with respect to the true λ compared to the maximum value of the constituent policies π j used for GPI, which were used assuming ˆλ. By the triangle inequality,

max j Q π j ˆλ (s, a) Qπ λ(s, a) max j Q π j λ (s, a) Qπ λ(s, a) + | max j Q π j λ (s, a) max j Q π j ˆλ (s, a)|

max j Q π j λ (s, a) Qπ λ(s, a) | {z } (1)

+ max j |Q π j λ (s, a) Q π j ˆλ (s, a)| | {z } (2)

We bound (1) by Lemma B.5 and (2) by Lemma B.6 to get the result.

B.2 An Extension of Theorem 5.1

Inspired by [18], we prove an extension of Theorem 5.1:

Theorem B.1. Let M M be a task in an environment M with true diminishing constant λ. Suppose we perform GPI assuming a diminishing constant ˆλ:

Let {Mj}n j=1 and Mi be tasks in M and let Q π j i denote the actionvalue function of an optimal policy of Mj when executed in Mi. Given estimates Q πj i such that Q π j i Q πj i ϵ for all j, define π(s) argmaxa maxj Q πj i (s, a).

Let Qπ ˆλ and Qπ λ denote the action-value functions of π and the M-optimal policy π

when executed in M, respectively. Then,

Qπ λ Qπ ˆλ 2 1 γ

2|λ ˆλ| r + ϵ + r ri + min j ri rj

where C is a positive constant not depending on λ:

C = γ 2 r ri + 2 minj ri rj + min ( r , ri ) + min ( ri , r1 , . . . , rn )

Note that when λ = 1, we recover Proposition 1 of [18] with an additional term quantifying error incurred by ˆλ = λ. The proof relies on two other technical lemmas, presented below.

Qπ Q π i i r ri

1 γ + γ(1 λ)min ( r , ri ) + r ri

(1 γ)(1 λγ) .

Proof. Define i := Qπ Q π i i . For any (s, a) S A,

|Qπ (s, a) Q π i i (s, a)| = r(s, a) + γ X

s p(s |s, a) (λ 1)r(s, a)Φπ (s , π (s ), s) + Qπ (s , π (s )

ri(s, a) γ X

s p(s |s, a) (λ 1)ri(s, a)Φπ i (s , π i (s ), s) + Q π i i (s , π i (s )

|r(s, a) ri(s, a)| + γ X

s p(s |s, a)|Qπ (s, a) Q π i i (s, a)|

s p(s |s, a) r(s, a)Φπ (s , π (s ), s) ri(s, a)Φπ i (s , π i (s ), s)

r ri + γ i + γ(1 λ) rΦπ riΦπ i .

The third term decomposes as

rΦπ riΦπ i rΦπ rΦπ i + rΦπ i riΦπ i

We could equivalently use the following decomposition:

rΦπ riΦπ i rΦπ riΦπ + riΦπ riΦπ i

rΦπ riΦπ i min ( r , ri ) + r ri

The inequalities above hold for all s, a and so

i r ri + γ i + γ(1 λ)min ( r , ri ) + r ri

1 γ + γ(1 λ)min ( r , ri ) + r ri

(1 γ)(1 λγ) .

Hence the result.

Lemma B.8. For any policy π,

Qπ i Qπ r ri

1 γ + γ(1 λ) r ri (1 γ)(1 λγ).

Proof. Write i := Qπ i Qπ . Proceeding as in the previous lemma, for all (s, a) S A, we have

|Qπ i (s, a) Qπ(s, a)| = ri(s, a) + γ X

s p(s |s, a) ((λ 1)ri(s, a)Φπ(s , π(s ), s) + Qπ i (s , π(s )))

r(s, a) γ X

s p(s |s, a) ((λ 1)r(s, a)Φπ(s , π(s ), s) + Qπ(s , π(s )))

|r(s, a) ri(s, a)| + γ X

s p(s |s, a)(1 λ)|r(s, a) ri(s, a)|Φπ(s , π(s ), s)

s p(s |s, a)|Qπ i (s , π(s )) Qπ(s , π(s ))|

r ri + γ(1 λ) r ri 1 1 λγ + γ i

= i r ri + γ(1 λ) r ri

1 γ + γ(1 λ) r ri

(1 γ)(1 λγ) .

Finally, we prove Theorem B.1:

Proof of Theorem B.1. By the triangle inequality, Qπ λ Qπ ˆλ Qπ λ Qπ λ + Qπ λ Qπ ˆλ . By Lemma B.6, the second term is bounded above by

The first term decomposes as follows (dropping the λ subscript on all action-value functions for clarity):

Qπ Qπ Qπ Q π i i | {z } (1)

+ Q π i i Qπ i | {z } (2)

+ Qπ i Qπ | {z } (3)

Applying Lemma B.5 to (2) (but with respect to Mi rather than M), we have that for any j,

Q π i i (s, a) Qπ i (s, a) Q π i i (s, a) Q π j i (s, a) + 1 1 γ

2ϵ + γ(1 λ)ri(s, a)

= Q π i i Qπ i Q π i i Q π j j | {z } (2.1)

+ Q π j j Q π j i | {z } (2.2)

2ϵ + γ(1 λ) ri

We bound (2.1) using Lemma B.7 and (2.2) using Lemma B.8 (but with respect to Mj rather than M):

Q π i i Q π j j + Q π j j Q π j i 2 ri rj

1 γ + γ(1 λ)min ( ri , rj ) + 2 ri rj

(1 γ)(1 λγ) .

We then apply Lemma B.7 to (1) and Lemma B.8 to (3) to get the result.

C An nth Occupancy Representation

To generalize the first occupancy representation to account for reward functions of this type, it s natural to consider an Nth occupancy representation that is, one which accumulates value only for the first N occupancies of one state s starting from another state s: Definition C.1 (NR). For an MDP with finite S, the Nth-occupancy representation (NR) for a policy π is given by F π [0, N]|S| |S| such that

Φπ (N)(s, s ) Eπ

k=0 γt+k1 (st+k = s , # ({j | st+j = s , j [0, k 1]}) < N) st

Intuitively, such a representation sums the first N (discounted) occupancies of s

from time t to t + k starting from st = s. We can also note that Φπ (1) is simply the FR and Φ(0)(s, s ) = 0 s, s . As with the FR and the SR, we can derive a recursive relationship for the NR:

Φπ (N)(s, s ) = 1(st = s )(1 + γEΦπ (N 1)(st+1, s )) + γ(1 1(st = s ))EΦπ (N)(st+1, s ), (C.2)

where the expectation is wrt pπ(st+1|st). Once again, we can confirm that this is consistent with the FR by noting that for N = 1, the NR recursion recovers the FR recursion. Crucially, we also recover the SR recursion in the limit as N :

lim N7 Φπ (N)(s, s ) = 1(st = s )(1 + γEΦπ ( )(st+1, s )) + γ(1 1(st = s ))EΦπ ( )(st+1, s )

= 1(st = s ) + γEΦπ ( )(st+1, s ).

This is consistent with the intuition that the SR accumulates every (discounted) state occupancy in a potentially infinite time horizon of experience. While Definition C.1 admits a recursive form which is consistent with our intuition, Eq. (C.2) reveals an inconvenient intractability: the Bellman target for Φπ (N) requires the availability of Φπ (N 1). This is a challenge, because it means that if we d like to learn any NR for finite N > 1, the agent also must learn and store Φπ (1), . . . Φπ (N 1). Given these challenges, the question of how to learn a tractable general occupancy representation remains. From a neuroscientific perspective, a fixed depletion amount is also inconsistent with both behavioral observations and neural imaging [3], which indicate instead that utility disappears at a fixed rate in proportion to the current remaining utility, rather than in proportion to the original utility. We address these theoretical and practical issues in the next section.

D Additional Related Work

Another important and relevant sub-field of reinforcement learning is work which studies non-stationary rewards. Perhaps most relevant, the setting of DMU can be seen as a special case of submodular planning and reward structures [31, 32]. [31] focus specifically on planning and not the form of diminishment we study, while [32] is a concurrent work which focuses on the general class of submodular

reward problems and introduces a policy-based, REINFORCE-like method which is necessarily on-policy. In contrast, we focus on a particular sub-class of problems especially relevant to natural behavior and introduce a family of approaches which exploit this reward structure, are value-based (and which can be used to modify the critic in policy-based methods), and are compatible with off-policy learning. Other important areas include convex [33] and constrained [34, 35] MDPs. In these cases, non-stationarity is introduced by way of a primal-dual formulation of distinct problem classes into min-max games.

E Further Experimental Details

E.1 Policy Evaluation

We perform policy evaluation for the policy shown in Fig. 4.1 on the 6 6 gridworld shown. The discount factor γ was set to 0.9 for all experiments, which were run for H = 10 steps per episode. The error metric was the mean squared error:

Qerror 1 |S||A|

s,a (Qπ(s, a) ˆQ(s, a))2, (E.1)

where Qπ is the ground truth Q-values and ˆQ is the estimate. Transitions are deterministic. For the dynamic programming result, we learned the λR using Eq. (4.3) for λ {0.5, 1.0} and then measured the resulting values by multiplying the resulting λR by the associated reward vector r { 1, 0, 1}36, which was 1 in all wall states and +1 at the reward state g. We compared the results to the ground truth values. Dynamic programming was run until the maximum Bellman error across state-action pairs reduced below 5e-2. For the tabular TD learning result, we ran the policy for three episodes starting from every available (non-wall) state in the environment, and learned the λR for λ {0.5, 1.0} as above, but using the online TD update: Φλ(st, at) Φλ(st, at) + αδt, δt = 1(st) (1 + γλΦλ(st+1, at+1)) + γ(1 1(st)) Φλ(st+1, at+1) Φλ(st, at), where at+1 π( | st+1). The learned Q-values were then computed in the same way as the dynamic programming case and compared to the ground truth. For the λF result, we first learned Laplacian eigenfunction base features as described in [25] from a uniform exploration policy and normalized them to the range [0, 1]. We parameterized the base feature network as a 2-layer MLP with Re LU activations and 16 units in the hidden layer. We then used the base features to learn the λFs as in the tabular case, but with the λF network parameterized as a three-layer MLP with 16 units in each of the hidden layers and Re LU activations. All networks were optimized using Adam with a learning rate of 3e-4. The tabular and neural network experiments were repeated for three random seeds, the former was run for 1,500 episodes and the latter for 2,000.

E.2 Policy Learning

We ran the experiments for Fig. 5.2 in a version of the Two Rooms environment from the Neuro Nav benchmark [22] with reward modified to decay with a specified

Algorithm 1: Online Tabular Qλ-Learning Update

1: Require: Current λR-values Φ(t) λ R|S| |A| |S|, current reward vector r(t), observed (st, at, st+1) tuple

2: Compute Qλ-values: Q(t) λ (Φ(t) λ )Tr(t)

3: Select greedy action: at+1 argmaxa A Q(t) λ (st+1, a) 4: Update Φλ:

Φ(t+1) λ (st, at) Φ(t) λ (st, at) + αδ(t), where

δ(t) = 1(st) (1 + γλΦ(t) λ (st+1, at+1)) + γ(1 1(st)) Φ(t) λ (st+1, at+1) Φ(t) λ (st, at).

5: Return updated Φ(t+1) λ

λtrue = 0.5 and discount factor γ = 0.95. The initial rewards in the top right goal and the lower room goal locations were 5 and the top left goal had initial reward 10. The observations in the neural network experiment were one-hot state indicators. The tabular Qλ experiments run the algorithm in Algorithm 1 for 500 episodes for λ {0.0, 0.5, 1.0}, with λtrue set to 0.5, repeated for three random seeds. Experiments used a constant step size α = 0.1. There were five possible actions: up, right, down, left, and stay. The recurrent A2C agents were based on the implementation from the BSuite library [36] and were run for 7,500 episodes of maximum length H = 100 with γ = 0.99 using the Adam optimizer with learning rate 3e-4. The experiment was repeated for three random seeds. The RNN was an LSTM with 128 hidden units and three output heads: one for the policy, one for the value function, and one for the λF. The base features were one-hot representations of the current state, 121-dimensional in this case.

E.3 Tabular GPI

The agent is assumed to be given or have previously acquired four policies {π0, π1, π2, π3} individually optimized to reach rewards located in each of the four rooms of the environment. There are three reward locations {g0, g1, g2} scattered across the rooms, each with its own initial reward r = [5, 10, 5] and all with λ = 0.5. At the beginning of each episode, an initial state s0 is sampled uniformly from the set of available states. An episode terminates either when the maximum reward remaining in any of the goal states is less than 0.1 or when the maximum number of steps H = 40 is reached. Empty states carry a reward of 0, encountering a wall gives a reward of 1, and the discount factor is set to γ = 0.97.

For each of the four policies, we learn λRs with λ equal to 0, 0.5, and 1.0 using standard dynamic programming (Bellman error curves plotted in Fig. F.3), and record the returns obtained while performing GPE+GPI with each of these representations over the course of 50 episodes. Bellman error curves for the λRs are In the left panel of Fig. 5.3, we can indeed see that using the correct λ (0.5) nets the highest returns. Example trajectories for each of λR are shown in the remaining panels.

E.4 Pixel-Based GPI

In this case, the base policies Π were identical to those used in the tabular GPI experiments. First, we collected a dataset consisting of 340 observation trajectories (o0, o1, . . . , o H 1) OH with H = 19 from each policy, totalling 6, 460 observations. Raw observations were 128 128 3 and were converted to grayscale. The previous seven observations were stacked and used to train a Laplacian eigenfunction base feature network in the same way as [25]. For observations less than seven steps from the start of an episode, the remaining frames were filled in as all black observations (i.e., zeros). The network consisted of four convolutional layers with 32 3 3 filters with strides (2, 2, 2, 1), each followed by a Re LU nonlinearity. This was then flattened and passed through a Layer Norm layer [37] and a tanh nonlinearity before three fully fully connected layers, the first two with 64 units each and Re LU nonlinearities and the final, output layer with 50 units. The output was L2-normalized as in [25]. This network ϕ : O7 7 RD (with D = 50) was trained on the stacked observations for 10 epochs using the Adam optimizer and learning rate 1e-4 with batch size B = 64. To perform policy evaluation, the resulting features, evaluated on the dataset of stacked observations were collected into their own dataset of (st, at+1, st+1, at+1) tuples, where st ot 6:t. The states were normalized to be between 0 and 1, and a vector w was fit to the actual associated rewards via linear regression on the complete dataset. The λF network was then trained using a form of neural fitted Q-iteration [FQI; 38] modified for policy evaluation with λFs (Algorithm 2). The architecture for the λF network was identical to the base feature network, with the exception that the hidden size of the fully connected layers was 128 and the output dimension was D|A| = 250. FQI was run for K = 20 outer loop iterations, with each inner loop supervised learning setting run for L = 100 epochs on the current dataset. Supervised learning was done using Adam with learning rate 3e-4 and batch size B = 64. Given the trained networks, GPI proceeded as in the tabular case, i.e.,

at = argmax a A max π Π w Tφπ θ(st, a). (E.2)

50 episodes were run from random starting locations for H = 50 steps and the returns measured. Learning curves for the base features and for λF fitting are shown in Fig. E.1. The λF curve measures the mean squared error as in Eq. (E.1).

The feature visualizations were created by performing PCA to reduce the average λF representations for observations at each state in the environment to 2D. Each point in the scatter plot represents the reduced representation on the xy plane, and is colored according to the λ-conditioned value of the underlying state.

E.5 Continuous Control

λ-SAC See Appendix I for details.

E.6 Learning the λO with FB

Training the λO with the FB parameterization proceeds in much the same way as in [12], but adjusted for a different norm and non-Markovian environment. We

0 2 4 6 8 Epochs

Base Feature Learning

0 5 10 15 FQI Iterations

Figure E.1: Learning curves for λF policy evaluation. Results are averaged over three runs, with shading indicating one unit of standard error.

Algorithm 2: Fitted Qλ-Iteration

1: Require: Dataset of base features {ϕ(s) RD}s S, decay rate λ, discount factor γ, reward feature vector w RD, batch size B, learning rate α 2: Initialize λF φθ parameters θ(1) (we drop the subscript λ and superscript π for concision) 3: for k = 1 . . . , K do 4: // Stage 1: Construct dataset 5: D 6: for (s, a) S A do 7: for (s , a ) S A do

(s, a), w T [ϕ(s) (1 + λγ φθ(k)(s , a )) + γ(1 ϕ(s)) φθ(k)(s , a )] | {z }

9: end for 10: end for 11: // Stage 2: Supervised learning 12: Randomly initialize θ0 13: for ℓ= 1, . . . , L do 14: Randomly shuffle D 15: for {((s, a), y)}B b=1 D do

16: θℓ θℓ 1 α θ 1

2B PB b=1 yb w Tφθℓ 1(sb, ab) 2

17: end for 18: end for 19: θ(k+1) θL 20: end for

summarize the learning procedure in Algorithm 3. The loss function L is derived in Appendix H, with the addition of the following regularizer:

Es ρBω(s)Bω(s) I 2.

This regularizer encourages B to be approximately orthonormal, which promotes identifiability of Fθ and Bω [12].

F Additional Results

See surrounding sections.

Algorithm 3: λO FB Learning

1: Require: Probability distribution ν over Rd, randomly initialized networks Fθ, Bω, learning rate η, mini-batch size B, number of episodes E, number of epochs M, number of time steps per episode T, number of gradient steps N, regularization coefficient β, Polyak coefficient α, initial diminishing constant λ, discount factor γ, exploratory policy greediness ϵ, temperature τ 2: // Stage 1: Unsupervised learning phase 3: D 4: for epoch m = 1, . . . , M do 5: for episode i = 1 . . . , E do 6: Sample z ν 7: Observe initial state s1 8: for t = 1, . . . , T do 9: Select at ϵ greedy with respect to Fθ(st, a, z) z 10: Observe reward rt(st) and next state st+1 11: D D {(st, at, rt(st), st+1)} 12: end for 13: end for 14: for n = 1, . . . , N do 15: Sample a minibatch {(sj, aj, rj(sj), sj+1)}j J D of size |J| = B 16: Sample a minibatch { sj}j J D of size |J| = B

17: Sample a minibatch {s j}j J iid µ of size |J| = B

18: Sample a minibatch {zj}j J iid ν of size |J| = B 19: For every j J, set πzj( |sj+1) = softmax Fθ (sj+1, , zj) zj/τ

L(θ, ω) 1 2B2 X

Fθ(sj, aj, zj) Bω(s k) γ X

a A πzj(a|sj+1)Fθ (sj+1, a, zj) Bω (s k)

j J Fθ(sj, aj, zj) Bω(sj)

j J µ(sj)Fθ(sj, aj, zj) Bω(sj) X

a A πzj(a|sj+1)Fθ (sj+1, a, zj) Bω (sj)

j,k J2 Bω(sj) Bω( sk) Bω(sj) Bω( sk) 1

j J Bw(sj) Bω(sj)

21: Update θ and ω via one step of Adam on L 22: Sample a minibatch {(sj, rj(sj), sj+1, rj+1(sj+1))}j J of size |J| = B from D 23: Lλ(λ) 1 2B P j J 1(sj+1 = sj) (rj+1(sj+1) λrj(sj))2

24: Update λ via one step of Adam on Lλ 25: end for 26: θ αθ + (1 α)θ 27: ω αω + (1 α)ω 28: end for 29: // Stage 2: Exploitation phase for a single episode with initial reward r0(s) 30: z R P

s S µ(s)r0(s)Bω(s) 31: Observe initial state s1 32: for t = 1, . . . , T do 33: at argmaxa A F(st, a, z R) z R 34: Observe reward rt(s) and next state st+1 35: z R P

s S µ(s)rt(s)Bω(s) 36: end for

Figure F.1: A simple grid and several policies.

0 20 40 60 Iteration

Bellman Error

Noise = 0.0

0.0 0.25 0.5 0.75 1.0

0 20 40 60 Iteration

Noise = 0.2

Figure F.3: Convergence of dynamic programming on Four Rooms with and without stochastic transitions.

0.0 0.5 1.0 10.0

Figure F.4: GPI with noisy transitions in Four Rooms. To verify that performance was maintained even with stochastic transitions, we added a 20% probability that a given action would result in a random transition to neighboring state. Results are consistent with Fig. 5.3, indicating the having the correct value of λ produces better performance.

G Advantage of the Correct λ

Importantly, for GPE using the λR to work in this setting, the agent must either learn or be provided with the updated reward vector rλ after each step/encounter with a rewarded state. This is because the λR is forward-looking in that it measures the (diminished) expected occupancies of states in the future without an explicit mechanism for remembering previous visits. For simplicity in this case, we provide this vector to the agent at each step though if we view such a multitask agent as simply as a module carrying out the directives of a higher-level module or policy within a hierarchical framework as in, e.g., Feudal RL [39], the explicit provision of reward information is not unrealistic. Regardless, a natural question in this case is whether there is actually any value in using the λR with the corret value of λ in this

Figure F.2: Visualizing the SR, the λR and the FR. We can see that the Φπ 1 is equivalent to the SR and Φπ 0 is equivalent to the FR, with intermediate values of λ providing a smooth transition between the two.

setting: If the agent is provided with the correct reward vector, then wouldn t policy evaluation work with any λR?

To see that this is not the case, consider the three-state toy MDP shown in Figure Fig. G.1, where r(s1) = 10, r(s2) = 6, r(s0) = 0, λ(s1) = 0, λ(s2) = 1.0, and γ = 0.99. At time t = 0, the agent starts in s0. Performing policy evaluation with λ(s1) = λ(s2) = 1 (i.e., with the SR) would lead the agent to go left to s1. However,

Figure G.1: A 3-state toy environment.

the reward would then disappear, and policy evaluation on the second step would lead it to then move right to s0 and then s2, where it would stay for the remainder of the episode. In contrast, performing PI with the correct values of λ would lead the agent to go right to s2 and stay there. In the first two timesteps, the first policy nets a total reward of 10 + 0 = 10, while the second policy nets 6 + 5.94 = 11.94. (The remaining decisions are identical between the two policies.) This is a clear example of the benefit of having the correct λ, as incorrect value estimation leads to suboptimal decisions even when the correct reward vector/function is provided at each step.

H The λ Operator

To learn the λO, we would like to define Φπ λ(st, ds ) φπ λ(st, s )µ(ds ) for some base policy µ. However, this would lead to a contradiction:

Φπ λ(s, A B) = Z

A φπ λ(s, ds )µ(ds ) + Z

B φπ λ(s, ds )µ(ds ) = Φπ λ(s, A) + Φπ λ(s, B)

for all disjoint A, B, contradicting Lemma B.4.

For now, we describe how to learn the λO for discrete S, in which case we have Φπ λ(s, s ) = φπ λ(s, s )µ(s ), i.e., by learning φ we learn a weighted version of Φ. We define the following norm, inspired by Touati et al. [25]:

Φπ λ 2 ρ E s ρ s µ

" Φπ λ(s, s ) µ(s )

where µ is any density on S. In the case of finite S, we let µ be the uniform density. We then minimize the Bellman error for Φπ λ with respect to 2 ρ,µ (dropping the sub/superscripts on Φ and φ for clarity):

L(Φ) = φµ (I (11T + λγP πφµ) + γ(11T + I) P πφµ) 2 ρ,µ

= Est ρ,s µ h φ(st, s ) 1(st = s )

+ γ(1 λ)1(st = s )

µ(s ) Est+1 pπ( |st) Φ(st+1, s ) γEst+1 pπ( |st) φ(st+1, s ) 2i

+c = Est,st+1 ρ,s µ h (φ(st, s ) γ φ(st+1, s ))2i

2Est,st+1 ρ

s µ(s )φ(st, s )1(st = s )

+ 2γ(1 λ)Est,st+1 ρ

s µ(s )φ(st, s ) φ(st+1, s )µ(s )1(st = s )

+c = Est,st+1 ρ,s µ h (φ(st, s ) γ φ(st+1, s ))2i 2Est ρ[φ(st, st)]

+ 2γ(1 λ)Est,st+1 ρ[µ(st)φ(st, st) φ(st+1, st)],

Note that we recover the SM loss when λ = 1. Also, an interesting interpretation is that when the agent can never return to its previous state (i.e., φ(st+1, st) = 0), then we also recover the SM loss, regardless of λ. In this way, the above loss appears to correct for repeated state visits so that the measure only reflects the first visit.

L(Φ) = Est,at,st+1 ρ,s µ h F(st, at, z) B(s ) γ F(st+1, πz(st+1), z) B(s ) 2i

2Est,at ρ F(st, at, z) B(st)

+ 2γ(1 λ)Est,at,st+1 ρ µ(st)F(st, at, z) B(st) F(st+1, πz(st+1), z) B(st)

Even though the λO is not a measure, we can use the above loss to the continuous case, pretending as though we could take the Radon-Nikodym derivative Φ(s,ds )

H.1 Experimental Results with the FB Parameterization

To show that knowing the correct value of λ leads to improved performance, we trained λO with the FB parameterization on the Four Rooms task of Fig. 5.3, but with each episode initialized at a random start state and with two random goal states. Average per-epoch reward is shown in Fig. H.2. We tested performance with λtrue, λagent {0.5, 1.0}, where λtrue denotes the true environment diminishing rate and λagent denotes the diminishing rate that the agent uses. For the purpose of illustration, we do not allow the agent to learn λ. We see in Fig. H.2 that using the correct λ leads to significantly increased performance. In particular, the left plot shows that assuming λ = 1, i.e., using the SR, in a diminishing environment can lead to highly suboptimal performance.

Hyperparameters used are given in Table 1 (notation as in Algorithm 3).

Hyperparameter Value

M 100 E 100 N 25 B 128 T 50 γ 0.99 α 0.95 η 0.001 τ 200 ϵ 1

Table 1: λO-FB hyperparameters.

Average Expert Per-Epoch Reward

Average Expert Per-Epoch Reward

Figure H.1: Performance of the λO-FB with two values of λ. Results averaged over six seeds and 10 episodes per seed. Error bars indicate standard error.

H.2 λO and the Marginal Value Theorem

To study whether the agent s behavior is similar to behavior predicted by the MVT, we use a very simple task with constant starting state and vary the distance between rewards (see Fig. H.1(a)). When an agent is in a reward state, we define an MVT-optimal leaving time as follows (similar to that of [8] but accounting for the non-stationarity of the reward).

Let R denote the average per-episode reward received by a trained agent, r(st) denote the reward received at time t in a given episode, Rt = Pt u=0 r(su) denote the total reward received until time t in the episode, and let T be episode length. Then, on average, the agent should leave its current reward state at time t if the next reward that it would receive by staying in st, i.e., λr(st), is less than

In other words, the agent should leave a reward state when its incoming reward falls below the diminished average per-step reward of the environment. We compute R by averaging reward received by a trained agent over many episodes.

Previous studies have trained agents that assume stationary reward to perform foraging tasks, even though the reward in these tasks is non-stationary. These agents can still achieve good performance and MVT-like behavior [8]. However, because

Figure H.2: Analysis of MVT-like behavior of λO-FB. a) Three environments with equal start state and structure but different distances between reward states. b) Difference between the agent s leave times and MVT-predicted leave times for γ = 0.99, with discounting taken into account. The agent on average behaves similar to the discounted MVT. c) Difference between the agent s leave times and MVT-predicted leave times for γ = 1.0, i.e., with no discounting taken into account. The agent on average behaves similar to the MVT. Results for (b) and (c) are averaged over three seeds. Error bars indicate standard error.

they target the standard RL objective

k=0 γkr(st+k) st = s

which requires γ < 1 for convergence, optimal behavior is recovered only with respect to the discounted MVT, in which R (and in our case, Rt) weights rewards by powers of γ [8].

In Fig. H.1(b-c) we perform a similar analysis to that of [8] and show that, on average over multiple distances between rewards, λO-FB performs similarly to the discounted MVT for γ = 0.99 and the standard MVT for γ = 1.0. An advantage of the λO is that it is finite for γ = 1.0 provided that λ < 1. Hence, as opposed to previous work, we can recover the standard MVT without the need to adjust for discounting.

Hyperparameters used are given in Table 1 (notation as in Algorithm 3).

Mitigating Value Overestimation One well-known challenge in deep RL is that the use of function approximation to compute values is prone to overestimation. Standard approaches to mitigate this issue typically do so by using two value functions and either taking the minimum mini {1,2} Qπ i (s, a) to form the Bellman target for a given (s, a) pair [40] or combining them in other ways [41]. However, creating multiple networks is expensive in both computation and memory. Instead, we hypothesized that it might be possible to address this issue by using λ-based values. To test this idea, we modified the Soft Actor-Critic [SAC; 42] algorithm to compute λFs-based values by augmenting the soft value target Tsoft Q = rt + γEVsoft(st+1), where Vsoft(st+1) is given by the expression

Eat+1 π( |st+1) h Q(st+1, at+1) + (λ 1)w T(ϕ(st, at) φλ(st+1, at+1))

α log π(at+1 | st+1) i

A derivation as well as pseudocode for the modified loss is provided in Appendix E.5. Observe that for λ = 1, we recover the standard SAC value target, corresponding to an assumed stationary reward. We apply this modified SAC algorithm, which we term λ-SAC to feature-based Mujoco continuous control tasks within Open AI Gym [43]. We found that concatenating the raw state and action observations ϕt = [st, at] and normalizing them to [0, 1] make effective regressors to the reward. That is, we compute base features as

ϕb t = ϕb t minb ϕb t maxb ϕb t minb ϕb t ,

where b indexes (st, at) within a batch. Let X [0, 1]B D be the concatenated matrix of features for a batch, where D = dim(S) + dim(A). Then,

wt = XTX 1 XTr,

where here r denotes the vector of rewards from the batch. In addition to using a fixed λ value, ideally we d like an agent to adaptively update λ to achieve the best balance of optimism and pessimism in its value estimates. Following [44], we frame this decision as a multi-armed bandit problem, discretizing λ into three possible values {0, 0.5, 1.0} representing the arms of the bandit. At the start of each episode, a random value of λ is sampled from these arms and used in the value function update. The probability of each arm is updated using the Exponentially Weighted Average Forecasting algorithm [45], which modulates the probabilities in proportion to a feedback score. As in [44], we use the difference in cumulative (undiscounted) reward between the current episode ℓand the previous one ℓ 1 as this feedback signal: Rℓ Rℓ 1. That is, the probability of selecting a given value of λ increases if performance is improving and decreases if it s decreasing. We use identical settings for the bandit algorithm as in [44]. We call this variant λ-SAC.

We plot the results for SAC with two critics (as is standard), SAC with one critic, SAC with a single critic trained with λF-based values ( x-SAC denotes SAC trained with a fixed λ = x), and λ-SAC trained on the Half Cheetah-v2 and Hopper-v2 Mujoco environments. All experiments were repeated over eight random seeds. Half Cheetah-v2 was found by [44] to support optimistic value estimates in that even without pessimism to reduce overestimation it was possible to perform well. Consistent with this, we found that single-critic SAC matched the performance of standard SAC, as did 1-SAC (which amounts to training a standard value function with the auxiliary task of SF prediction). Fixing lower values of λ performed poorly, indicating that over-pessimism is harmful in this environment. However, λ-SAC eventually manages to learn to set λ = 1 and matches the final performance of the best fixed algorithms. Similarly, in [44] it was observed that strong performance in Hopper-v2 was associated with pessimistic value estimates. Consistent with this, λ-SAC learns to select lower values of λ, again matching the performance of SAC while only requiring one critic and significantly reducing the required FLOPS Fig. I.2. We consider these results to be very preliminary, and hope to perform more experiments on other environments. We also believe λ-SAC could be improved by using the difference between the current episode s total reward and the average of the total rewards from previous episodes Rℓ (ℓ 1) 1 Pℓ 1 i=1 Ri as a more

stable feedback signal for the bandit. There is also non-stationarity in the base features due to the per-batch normalization, which could also likely be improved. Hyperparameters are described in Table 2.

Hyperparameter Value

Collection Steps 1000 Random Action Steps 10000 Network Hidden Layers 256:256 Learning Rate 3 10 4 Optimizer Adam Replay Buffer Size 1 106 Action Limit [ 1, 1] Exponential Moving Avg. Parameters 5 10 3 (Critic Update:Environment Step) Ratio 1 (Policy Update:Environment Step) Ratio 1 Has Target Policy? No Expected Entropy Target dim(A) Policy Log-Variance Limits [ 20, 2] Bandit Learning Rate 0.1 λ Options {0, 0.5, 1.0}

Table 2: Hyperparameters for SAC Mujoco experiments. Only applicable to λ-SAC

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps 1e6

Test Reward

Half Cheetah-v2

SAC (2 critics) SAC

1-SAC 0.5-SAC

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps 1e6

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps 1e6

Half Cheetah-v2 Hopper-v2

Figure I.1: λ-SAC (1 critic) matches the performance of SAC (2 critics) by adapting λ online.

300000 350000 400000 FLOPS

Final Test Reward

Half Cheetah-v2

SAC (2 critics)

300000 350000 400000 FLOPS

Figure I.2: λ-SAC matches performance while saving in computational cost.

J Replenishing Rewards

We list below a few candidate reward replenishment schemes, which are visualized in Fig. J.1.

Time elapsed rewards r(s, t) = λn(s,t)/m(s,t) r(s), where m(s, t) is the time elapsed since the last visit to state s:

m(s, t) t max{j|st+j = s, 0 j t 1}.

Due to the max term in m(s, t), the corresponding representation

k=0 γkλn(s,t)/m(s,t)1(st+k = s ) st = s

does not admit a closed-form recursion. However, we empirically tested a version of this type of reward with Qλ-learning in the Two Rooms environment, modified so that the exponent on λ is n(s, t)/(0.1m(s, t)). This was done so that reward replenishes at a slow rate, reducing the deviation from the standard diminishing setting. Episode length was capped at H = 100. All other settings are identical to the Qλ experiment described in Appendix E. Results are presented in Fig. J.2 and a GIF is included in the supplementary material.

Eligibility trace rewards

j=0 λt j r 1(st+j = s)

where λd, λr [0, 1] are diminishment and replenishment constants, respectively. Denoting the corresponding representation by Ωπ, i.e.,

Ωπ(s, s ) = E

j=0 λk j r 1(st+j = s )

1(st+k = s )

we obtain the following recursion:

Ωπ(s, s ) = 1(s = s )(λd γλr(1 λd)Est+1 pπ( |s)M π γλr(st+1, s ))

+ γEst+1 pπ( |s)Ωπ(st+1, s ),

where M π γλr denotes the successor representation of π with discount factor γλr. This representation could be learned by alternating TD learning between Ωπ and M π γλr. We leave this to future work.

Total time rewards r(s, t) = λn(s,t) d λn(s,t) t r r(s), where λd, λr [0, 1] are diminishment and replenishment constants, respectively. The corresponding representation is

P π(s, s ) = E

k=0 γkλnt(s ,k) d λk nt(s ,k) r 1(st+k = s )

0 2 4 6 8 10 12 14 Time

Time Elapsed

0 2 4 6 8 10 12 14 Time

Eligibility Trace

0 2 4 6 8 10 12 14 Time

Figure J.1: Visualizing three different replenishment schemes. For all schemes, r(s) = 1 and visits to s are at t = 2, 5. (Left) The time elapsed reward with λ = 0.5; (Middle) The eligibility trace reward with λr = λd = 0.5; (Right) The total time reward with λd = 0.5, λr = 0.9.

0 100 200 300 400 500 Episodes

Q Time Elapsed Replenishing Rewards

= 0.5 = 1.0

Figure J.2: Performance on Two Rooms with replenishing rewards. Return is averaged over five runs, with shading indicating one unit of standard error.

which satisfies the following recursion:

P π(s, s ) = 1(s = s ) + γ(λd1(s = s ) + 1 1(s = s ))(λr(1 1(s = s )) + 1(s = s ))Est+1 pπ( |s)P π(st+1, s ).

While neither the reward nor the representation are guaranteed to be finite, P π could be learned via TD updates capped at a suitable upper bound.

We now briefly discuss the interaction between the temporal discount factor γ commonly used in RL and the diminishing utility rate λ. The key distinction between the two is that all rewards decay in value every time step with respect to γ, regardless of whether a state is visited or not. With λ, however, decay is specific to each state (or (s, a) pair) and only occurs when the agent visits that state. Thus, γ decays reward in a global manner which is independent of the agent s behavior, and λ decays reward in a local manner which dependent on the agent s behavior. In combination, they have the beneficial effect of accelerating convergence in dynamic programming (Fig. F.3). This indicates the potential for the use of higher discount factors in practice, as paired with a decay factor λ, similar (or faster) convergence rates could be observed even as agents are able to act with a longer effective temporal horizon.

L Compute Resources

The λF-based experiments shown were run on a single NVIDIA Ge Force GTX 1080 GPU. The recurrent A2C experiments took roughly 30 minutes, base feature learning for policy composition took approximately 45 minutes, λF learning for policy composition took approximately 10 hours, and the SAC experiments took approximately 8 hours per run. The λF training required roughly 30GB of memory due to the size of the dataset. All experiments in Section 6 and Appendix H were run on a single RTX5000 GPU and each training and evaluation run took about 30 minutes. All other experiments were run on a 2020 Mac Book Air laptop 1.1 GHz Quad-Core Intel Core i5 CPU and took less than one hour to train.

M Broader Impact Statement

We consider this work to be primarily of a theoretical nature pertaining to sequential decision-making primarily in the context of natural intelligence. While it may have applications for more efficient training of artificial RL agents, it is hard to predict long-term societal impacts.