# on_the_expressivity_of_markov_reward__7e0cc731.pdf

On the Expressivity of Markov Reward

David Abel Deep Mind dmabel@deepmind.com

Will Dabney Deep Mind wdabney@deepmind.com

Anna Harutyunyan Deep Mind harutyunyan@deepmind.com

Mark K. Ho Department of Computer Science Princeton University mho@princeton.edu

Michael L. Littman Department of Computer Science Brown University mlittman@cs.brown.edu

Doina Precup Deep Mind doinap@deepmind.com

Satinder Singh Deep Mind baveja@deepmind.com

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of task that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical ﬁndings.

1 Introduction

How are we to use algorithms for reinforcement learning (RL) to solve problems of relevance in the world? Reward plays a signiﬁcant role as a general purpose signal: For any desired behavior, task, or other characteristic of agency, there must exist a reward signal that can incentivize an agent to learn to realize these desires. Indeed, the expressivity of reward is taken as a backdrop assumption that frames RL, sometimes called the reward hypothesis: ...all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward) [53, 29, 6]. In this paper, we establish ﬁrst steps toward a systematic study of the reward hypothesis by examining the expressivity of reward as a signal. We proceed in three steps.

1. An Account of Task . As rewards encode tasks, goals, or desires, we ﬁrst ask, what is a task? . We frame our study around a thought experiment (Figure 1) involving the interactions between a designer, Alice, and a learning agent, Bob, drawing inspiration from Ackley and Littman [2], Sorg [50], and Singh et al. [46]. In this thought experiment, we draw a distinction between how Alice thinks of a task (TASKQ) and the means by which Alice incentivizes Bob to pursue this task (EXPRESSIONQ). This distinction allows us to analyze the expressivity of reward as an answer to the latter question, conditioned on how we answer the former. Concretely, we study three answers to the TASKQ in the context of ﬁnite Markov Decision Processes (MDPs): A task is either (1) a set of

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

state action

learning signal

Figure 1: Alice, Bob, and the artifacts of task deﬁnition (blue) and task expression (purple).

acceptable behaviors (policies), (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Further detail and motivation for these task types is provided in Section 3, but broadly they can be viewed as generalizations of typical notions of task such as a choice of goal or optimal behavior. Given these three answers to the TASKQ, we then examine the expressivity of reward.

2. Expressivity of Markov Reward. The core of our study asks whether there are tasks Alice would like to convey as captured by the answers to the TASKQ that admit no characterization in terms of a Markov reward function. Our emphasis on Markov reward functions, as opposed to arbitrary history-based reward functions, is motivated by several factors. First, disciplines such as computer science, psychology, biology, and economics typically rely on a notion of reward as a numerical proxy for the immediate worth of states of affairs (such as the ﬁnancial cost of buying a solar panel or the ﬁtness beneﬁts of a phenotype). Given an appropriate way to describe states of affairs, Markov reward functions can represent immediate worth in an intuitive manner that also allows for reasoning about combinations, sequences, or re-occurrences of such states of affairs. Second, it is not clear that general history-based rewards are a reasonable target for learning as they suffer from the curse of dimensionality in the length of the history. Lastly, Markov reward functions are the standard in RL. A rigorous analysis of which tasks they can and cannot convey may provide guidance into when it is necessary to draw on alternative formulations of a problem. Given our focus on Markov rewards, we treat a reward function as accurately expressing a task just when the value function it induces in an environment adheres to the constraints of a given task.

3. Main Results. We ﬁnd that, for all three task types, there are environment task pairs for which there is no Markov reward function that realizes the task (Theorem 4.1). In light of this ﬁnding, we design polynomial-time algorithms that can determine, for any given task and environment, whether a reward function exists in the environment that captures the task (Theorem 4.3). When such a reward function does exist, the algorithms also return it. Finally, we conduct simple experiments with these procedures to provide empirical insight into the expressivity of reward (Section 5).

Collectively, our results demonstrate that there are tasks that cannot be expressed by Markov reward in a rigorous sense, but we can efﬁciently construct such reward functions when they do exist (and determine when they do not). We take these ﬁndings to shed light on the nature of reward maximization as a principle, and highlight many pathways for further investigation.

2 Background

RL deﬁnes the problem facing an agent that learns to improve its behavior over time by interacting with its environment. We make the typical assumption that the RL problem is well modeled by an agent interacting with a ﬁnite Markov Decision Process (MDP), deﬁned by the tuple (S, A, R, T, γ, s0). An MDP gives rise to deterministic behavioral policies, π : S A, and the value, V π : S R, and action value, Qπ : S A R, functions that measure their quality. We will refer to a Controlled Markov Process (CMP) as an MDP without a reward function, which we denote E for environment. We assume that all reward functions are deterministic, and may be a function of either state, stateaction pairs, or state-action-state triples, but not history. Henceforth, we simply use reward function to refer to a deterministic Markov reward function for brevity, but note that more sophisticated settings beyond MDPs and deterministic Markov reward functions are important directions for future work. For more on MDPs or RL, see the books by Puterman [41] and Sutton and Barto [54] respectively.

2.1 Other Perspectives on Reward

We here brieﬂy summarize relevant literature that provides distinct perspectives on reward.

Two Roles of Reward. As Sorg [50] identiﬁes (Chapter 2), reward can both deﬁne the task the agent learns to solve, and deﬁne the bread crumbs that allow agents to efﬁciently learn to solve the task. This distinction has been raised elsewhere [2, 46, 47], and is similar to the extrinsic-intrinsic reward divide [45, 66]. Tools such as reward design [34, 51] or reward shaping [36] focus on offering more efﬁcient learning in a variety of environments, so as to avoid issues of sparsity and long-term credit assignment. We concentrate primarily on reward s capacity to express a task, and defer learning dynamics to an (important) stage of future work.

Discounts, Expectations, and Rationality. Another important facet of reward is how it is used in producing behavior. The classical view offered by the Bellman equation (and the reward hypothesis) is that the quantity of interest to maximize is expected, discounted, cumulative reward. Yet it is possible to disentangle reward from the expectation [5], to attend only to ordinal [60] or maximal rewards [26], or to adopt different forms of discounting [61, 11]. In this work, we take the standard view that agents will seek to maximize value for a particular discount factor γ, but recognize that there are interesting directions beyond these commitments, such as inspecting the limits of reward in constrained MDPs as studied by Szepesvári [56]. We also note the particular importance of work by Pitis [40], who examines the relationship between classical decision theory [59] and MDPs by incorporating additional axioms that account for stochastic processes with discounting [24, 35, 48, 49]. Drawing inspiration from Pitis [40] and Sunehag and Hutter [52], we foresee valuable pathways for future work that further makes contact between RL and various axioms of rationality.

Preferences. In place of numerical rewards, preferences of different kinds may be used to evaluate an agent s behaviors, drawing from the literature on preference-learning [25] and ordinal dynamic programming [8, 35, 48]. This premise gives rise to preference-based reinforcement learning (Pb RL) in which an agent interacts with a CMP and receives evaluative signals in the form of preferences over states, actions, or trajectories. This kind of feedback inspires and closely parallels the task types we propose in this work. A comprehensive survey of Pb RL by Wirth et al. [64] identiﬁes critical differences in this setup from traditional RL, categorizes recent algorithmic approaches, and highlights important open questions. Recent work focuses on analysing the sample efﬁciency of such methods [65, 38] with close connections to learning from human feedback in real time [23, 32, 7].

Teaching and Inverse RL. The inverse RL (IRL) and apprenticeship learning literature examine the problem of learning directly from behavior [37, 1]. The classical problem of IRL is to identify which reward function (often up to an equivalence class) a given demonstrator is optimizing. We emphasize the relevance of two approaches: First, work by Syed et al. [55], who ﬁrst illustrate the applicability of linear programming [22] to apprenticeship learning; and second, work by Amin et al. [4], who examine the repeated form of IRL. The methods of IRL have recently been expanded to include variations of cooperative IRL [14], and assistive learning [43], which offer different perspectives on how to frame interactive learning problems.

Reward Misspeciﬁcation. Reward is also notoriously hard to specify. As pointed out by Littman et al. [30], putting a meaningful dollar ﬁgure on scufﬁng a wall or dropping a clean fork is challenging. Along these lines, Hadﬁeld-Menell et al. [16] identify cases in which well-intentioned designers create reward functions that produce unintended behavior [39]. Mac Glashan et al. [33] ﬁnd that human-provided rewards tend to depend on a learning agent s entire policy, rather than just the current state. Further, work by Hadﬁeld-Menell et al. [15] and Kumar et al. [27] suggest that there are problems with reward as a learning mechanism due to misspeciﬁcation and reward tampering [10]. These problems have given rise to approaches to reward learning, in which a reward function is inferred from some evidence such as behavior or comparisons thereof [20].

Other Notions of Task. As a ﬁnal note, we highlight alternative approaches to task speciﬁcation. Building on the Free Energy Principle [13, 12], Hafner et al. [17] consider a variety of task types in terms of minimization of distance to a desired target distribution [3]. Alternatively, Littman et al. [30] and Li et al. [28] propose variations of linear temporal logic (LTL) as a mechanism for specifying a task to RL agents, with related literature extending LTL to the multi-task [58] and multi-agent

[18] settings, or using reward machines for capturing task structure [19]. Jothimurugan et al. [21] take a similar approach and propose a task speciﬁcation language for RL based on logical formulas that evaluate whether trajectories satisfy the task, similar in spirit to the logical task compositions framework developed by Tasse et al. [57]. Many of these notions of task are more general than those we consider. A natural direction for future work broadens our analysis to include these kinds of task.

3 An Account of Reward s Expressivity: The TASKQ and EXPRESSIONQ

Consider an onlooker, Alice, and an earnest learning agent, Bob, engaged in the interaction pictured in Figure 1. Suppose that Alice has a particular task in mind that she would like Bob to learn to solve, and that Alice constructs a reward function to incentivize Bob to pursue this task. Here, Alice is playing the role of all of what we mean by goals and purposes for Bob to pursue, with Bob playing the role of the standard reward-maximizing RL agent.

Two Questions About Task. To give us leverage to study the expressivity of reward, it is useful to draw a distinction between two stages of this process: 1) Alice thinks of a task that she would like Bob to learn to solve, and 2) Alice creates a reward function (and perhaps chooses γ) that conveys the chosen task to Bob. We inspect these two separately, framed by the following two questions. The ﬁrst we call the task-deﬁnition question (TASKQ) which asks: What is a task? The second we call the task-expression question (EXPRESSIONQ) which asks: Which learning signal can be used as a mechanism for expressing any task to Bob?

Reward Answers The EXPRESSIONQ. We suggest that it may be useful to treat reward as an answer to the EXPRESSIONQ rather than the TASKQ. On this view, reward is treated as an expressive language for incentivizing reward-maximizing agents: Alice may attempt to translate any task into a reward function that incentivizes Bob to pursue the task, no matter which environment Bob inhabits, which task Alice has chosen, or how she has represented the task to herself. Indeed, it might be the case that Alice s knowledge of the task far exceeds Bob s representational or perceptual capacity. Alice may know every detail of the environment and deﬁne the task based on this holistic vantage, while Bob must learn to solve the task through interaction alone, relying only on a restricted class of functions for modeling and decision making.

Under this view, we can assess the expressivity of reward as an answer to the EXPRESSIONQ conditioned on how we answer the TASKQ. For example, if the TASKQ is answered in terms of natural language descriptions of desired states of affairs, then reward may fail to convey the chosen task due to the apparent mismatch in abstraction between natural language and reward (though some work has studied such a proposal [31, 62]).

3.1 Answers to the TASKQ: What is a Task?

In RL, tasks are often associated with a choice of goal, reward function (R), reward-discount pair (R, γ), or perhaps a choice of optimal policy (alongside those task types surveyed previously, such as LTL). However, it is unclear whether these constructs capture the entirety of what we mean by task .

For example, consider the Russell and Norvig [42] grid world: A 4 3 grid with one wall, one terminal ﬁre state, and one terminal goal state (pictured with a particular reward function in Figure 4a). In such an environment, how might we think about tasks? A standard view is that the task is to reach the goal as quickly as possible. This account, however, fails to distinguish between the non-optimal behaviors, such as the costly behavior of the agent moving directly into the ﬁre and the neutral behavior of the agent spending its existence in the start state. Indeed, characterizing a task in terms of choice of π or goal fails to capture these distinctions. Our view is that a suitably rich account of task should allow for the characterization of this sort of preference, offering the ﬂexibility to scale from specifying only the desirable behavior (or outcomes) to an arbitrary ordering over behaviors (or outcomes).

In light of these considerations, we propose three answers to the TASKQ that can convey general preferences over behavior or outcome: 1) A set of acceptable policies, 2) A partial ordering over policies, or 3) A partial ordering over trajectories. We adopt these three as they can capture many kinds of task while also allowing a great deal of ﬂexibility in the level of detail of the speciﬁcation.

Name Notation Generalizes Constraints Induced by T

SOAP ΠG task-as-π equal: V πg(s0) = V πg (s0) > V πb(s0), πg,πg ΠG,πb ΠB range: V πg(s0) > V πb(s0), πg ΠG,πb ΠB

PO LΠ SOAP (π1 π2) LΠ = V π1(s0) V π2(s0)

TO Lτ,N task-as-goal (τ1 τ2) Lτ,N = G(τ1; s0) G(τ2; s0)

Table 1: A summary of the three proposed task types. We further list the constraints that determine whether a reward function realizes each task type in an MDP, where we take to be one of < , > , or = , and G is the discounted return of the trajectory.

3.2 SOAPs, POs, and TOs

(SOAP) Set Of Acceptable Policies. A classical view of the equivalence of two reward functions is based on the optimal policies they induce. For instance, Ng et al. [36] develop potential-based reward shaping by inspecting which shaped reward signals will ensure that the optimal policy is unchanged. Extrapolating, it is natural to say that for any environment E, two reward functions are equivalent if the optimal policies they induce in E are the same. In this way, a task is viewed as a choice of optimal policy. As discussed in the grid world example above, this notion of task fails to allow for the speciﬁcation of the quality of other behaviors. For this reason, we generalize task-as-optimal-policy to a set of acceptable policies, deﬁned as follows.

Deﬁnition 3.1. A set of acceptable policies (SOAP) is a non-empty subset of the deterministic policies, ΠG Π, with Π the set of all deterministic mappings from S to A for a given E.

With one task type deﬁned, it is important to address what it means for a reward function to properly realize, express, or capture a task in a given environment. We offer the following account.

Deﬁnition 3.2. A reward function is said to realize a task T in an environment E just when the start-state value (or return) induced by the reward function exactly adheres to the constraints of T .

Precise conditions for the realization of each task type are provided alongside each task deﬁnition, with a summary presented in column four of Table 1.

For SOAPs, we take the start-state value V π(s0) to be the mechanism by which a reward function realizes a SOAP. That is, for a given E and ΠG, a reward function R is said to realize the ΠG in E when the start-state value function is optimal for all good policies, and strictly higher than the start-state value of all other policies. It is clear that SOAP strictly generalizes a task in terms of a choice of optimal policy, as captured by the SOAP ΠG = {π }.

We note that there are two natural ways for a reward function to realize a SOAP: First, each πg ΠG has optimal start-state value and all other policies are sub-optimal. We call this type equal-SOAP, or just SOAP for brevity. Alternatively, we might only require that the acceptable policies are each near-optimal, but are allowed to differ in start-state value so long as they are all better than every bad policy πb ΠB. That is, in this second kind, there exists an ϵ 0 such that every πg ΠG is ϵ-optimal in start-state value, V (s0) V πg(s0) ϵ, while all other policies are worse. We call this second realization condition range-SOAP. We note that the range realization generalizes the equal one: Every equal-SOAP is a range-SOAP (by letting ϵ = 0). However, there exist range-SOAPs that are expressible by Markov rewards that are not realizable as an equal-SOAP. We illustrate this fact with the following proposition. All proofs are presented in Appendix B.

Proposition 3.1. There exists a CMP, E, and choice of ΠG such that ΠG can be realized under the range-SOAP criterion, but cannot be realized under the equal-SOAP criterion.

One such CMP is pictured Figure 2b. Consider the SOAP ΠG = {π11, π12, π21}: Under the equal SOAP criterion, if each of these three policies are made optimal, any reward function will also make π22 (the only bad policy) optimal as well. In contrast, for the range criterion, we can choose a reward function that assigns lower rewards to a2 than a1 in both states. In general, we take the equal-SOAP realization as canonical, as it is naturally subsumed by our next task type.

(PO) Partial Ordering on Policies. Next, we suppose that Alice chooses a partial ordering on the deterministic policy space. That is, Alice might identify a some great policies, some good, and some bad policies to strictly avoid, and remain indifferent to the rest. POs strictly generalize equal SOAPs, as any such SOAP is a special choice of PO with only two equivalence classes. We offer the following deﬁnition of a PO. Deﬁnition 3.3. A policy order (PO) of the deterministic policies Π is a partial order, denoted LΠ.

As with SOAPs, we take the start-state value V π(s0) induced by a reward function R as the mechanism by which policies are ordered. That is, given E and LΠ, we say that a reward function R realizes LΠ in E if and only if the resulting MDP, M = (E, R), produces a start-state value function that orders Π according to LΠ.

(TO) Partial Ordering on Trajectories. A natural generalization of goal speciﬁcation enriches a notion of task to include the details of how a goal is satisﬁed that is, for Alice to relay some preference over trajectory space [63], as is done in preference based RL [64]. Concretely, we suppose Alice speciﬁes a partial ordering on length N trajectories of (s, a) pairs, deﬁned as follows. Deﬁnition 3.4. A trajectory ordering (TO) of length N N is a partial ordering Lτ,N, with each trajectory τ consisting of N state action pairs, {(s0, a0), . . . , (a N 1, s N 1)}, with s0 the start state.

As with PO, we say that a reward function realizes a trajectory ordering Lτ,N if the ordering determined by each trajectory s cumulative discounted N-step return from s0, denoted G(τ; s0), matches that of the given Lτ,N. We note that trajectory orderings can generalize goal-based tasks at the expense of a larger speciﬁcation. For instance, a TO can convey the task, Safely reach the goal in less than thirty steps, or just get to the subgoal in less than twenty steps.

Recap. We propose to assess the expressivity of reward by ﬁrst answering the TASKQ in terms of SOAPs, POs, or TOs, as summarized by Table 1. We say that a task T is realized in an environment E under reward function R if the start-state value function (or return) produced by R imposes the constraints speciﬁed by T , and are interested in whether reward can always realize a given task in any choice of E. We make a number of assumptions along the way, including: (1) Reward functions are Markov and deterministic, (2) Policies of interest are deterministic, (3) The environment is a ﬁnite CMP, (4) γ is part of the environment, (5) We ignore reward s role in shaping the learning process, (6) Start-state value or return is the appropriate mechanism to determine if a reward function realizes a given task. Relaxation of these assumptions is a critical direction for future work.

4 Analysis: The Expressivity of Markov Reward

With our deﬁnitions and objectives in place, we now present our main results.

4.1 Express SOAPs, POs, and TOs

We ﬁrst ask whether reward can always realize a given SOAP, PO, or TO, for an arbitrary E. Our ﬁrst result states that the answer is no there are tasks that cannot be realized by any reward function. Theorem 4.1. For each of SOAP, PO, and TO, there exist (E, T ) pairs for which no Markow reward function realizes T in E.

Thus, reward is incapable of capturing certain tasks. What tasks are they, precisely? Intuitively, inexpressible tasks involve policies or trajectories that must be correlated in value in an MDP. That is, if two policies are nearly identical in behavior, it is unlikely that reward can capture the PO that places them at opposite ends of the ordering. A simple example is the always move the same direction task in a grid world, with state deﬁned as an (x, y) pair. The SOAP ΠG = {π , π , π , π } conveys this task, but no Markov reward function can make these policies strictly higher in value than all others.

Example: Inexpressible SOAPs. Observe the two CMPs pictured in Figure 2, depicting two kinds of inexpressible SOAPs. On the left, we consider the SOAP ΠG = {π21}, containing only the policy that executes a2 in the left state (s0), and a1 in the right (s1). This SOAP is inexpressible through reward, but only because reward cannot distinguish the start-state value of π21 and π22 since the policies differ only in an unreachable state. This is reminiscent of Axiom 5 from Pitis [40], which

(a) Steady State Case

(b) Entailment Case

Figure 2: Two CMPs in which there is a SOAP that is not expressible under any Markov reward function. On the left, ΠG = {π21} is not realizable, as π21 can not be made better than π22 because s1 is never reached. On the right, the XOR-like-SOAP, ΠG = {π12, π21} is not realizable: To make these two policies optimal, it is entailed that π22 and π11 must be optimal, too.

explicitly excludes preferences of this sort. On the right, we ﬁnd a more interesting case: The chosen SOAP is similar to the XOR function, ΠG = {π12, π21}. Here, the task requires that the agent choose each action in exactly one state. However, there cannot exist a reward function that makes only these policies optimal, as by consequence, both policies π11 and π22 must be optimal as well.

Next, we show that Theorem 4.1 is not limited to a particular choice of transition function or γ. Proposition 4.2. There exist choices of E T = (S, A, γ, s0) or E γ = (S, A, T, s0), together with a task T , such that there is no (T, R) pair that realizes T in E T or (R, γ) in E γ.

This result suggests that the scope of Theorem 4.1 is actually quite broad even if the transition function or γ are taken as part of the reward speciﬁcation, there are tasks that cannot be expressed. We suspect there are ways to give a precise characterization of all inexpressible tasks from an axiomatic perspective, which we hope to study in future work.

4.2 Constructive Algorithms: Task to Reward

We now analyze how to determine whether an appropriate reward function can be constructed for any (E, T ) pair. We pose a general form of the reward-design problem [34, 51, 9] as follows.

Deﬁnition 4.1. The REWARDDESIGN problem is: Given E = (S, A, T, γ, s0), and a T , output a reward function Ralice that ensures T is realized in M = (E, Ralice).

Indeed, for all three task types, there is an efﬁcient algorithm for solving the reward-design problem. Theorem 4.3. The REWARDDESIGN problem can be solved in polynomial time, for any ﬁnite E, and any T , so long as reward functions with inﬁnitely many outputs are considered.

Therefore, for any choice of ﬁnite CMP, E, and a SOAP, PO, or TO, we can ﬁnd a reward function that perfectly realizes the task in the given environment, if such a reward function exists. Each of the three algorithms are based on forming a linear program that matches the constraints of the given task type, which is why reward functions with inﬁnitely many outputs are required. Pseudo-code for SOAP-based reward design is presented in Algorithm 1. Intuitively, the algorithms compute the discounted expected-state visitation distribution for a collection of policies; in the case of SOAP, for instance, these policies include ΠG and what we call the fringe , the set of policies that differ from a πg ΠG by exactly one action. Then, we use these distributions to describe linear inequality constraints ensuring that the start-state value of the good policies are better than those of the fringe.

As highlighted by Theorem 4.1 there are SOAPs, POs, and TOs that are not realizable. Thus, it is important to determine how the algorithms mentioned in Theorem 4.3 will handle such cases. Our next corollary illustrates that the desirable outcome is achieved: For any E and T , the algorithms will output a reward function that realizes T in E, or output when no such function exists. Corollary 4.4. For any task T and environment E, deciding whether T is expressible in E is solvable in polynomial time.

Together, Theorem 4.1 and Theorem 4.3 constitute our main results: There are environment task pairs in which Markov reward cannot express the chosen task for each of SOAPs, POs, and TOs. However, there are efﬁcient algorithms for deciding whether a task is expressible, and for constructing the realizing reward function when it exists. We will study the use of one of these algorithms in Section 5, but ﬁrst attend to other aspects of reward s expressivity.

Algorithm 1 SOAP Reward Design

INPUT: E = (S, A, T, γ, s0), ΠG. OUTPUT: R, or .

1: Πfringe = compute_fringe(ΠG) 2: for πg,i ΠG do Compute state-visitation distributions. 3: ρg,i = compute_exp_visit(πg,i, E)

4: for πf,i Πfringe do 5: ρf,i = compute_exp_visit(πf,i, E)

6: Ceq = {} Make Equality Constraints. 7: for πg,i ΠG do 8: Ceq.add(ρg,0(s0) X = ρg,i(s0) X)

9: Cineq = {} Make Inequality Constraints. 10: for πf,j Πfringe do 11: Cineq.add(ρf,j(s0) X + ϵ ρg,0(s0) X)

12: Rout, ϵout = linear_programming(obj. = max ϵ, constraints = Cineq, Ceq) Solve LP.

13: if ϵout > 0 then Check if successful. return Rout 14: else return

4.3 Other Aspects of Reward s Expressivity

We next brieﬂy summarize other considerations about the expressivity of reward. As noted, Theorem 4.3 requires the use of a reward function that can produce inﬁnitely many outputs. Our next result proves this requirement is strict for efﬁcient reward design. Theorem 4.5. A variant of the REWARDDESIGN problem with ﬁnite reward outputs is NP-hard.

We provide further details about the precise problem studied in Appendix B. Beyond reward functions with ﬁnitely-many outputs, we are also interested in extensions of our results to multiple environments. We next present a positive result indicating our algorithms can extend to the case where Alice would like to design a reward function for a single task across multiple environments. Proposition 4.6. For any SOAP, PO, or TO, given a ﬁnite set of CMPs, E = {E1, . . . , En}, with shared state action space, there exists a polynomial time algorithm that outputs one reward function that realizes the task (when possible) in all CMPs in E.

A natural follow up question to the above result asks whether task realization is closed under a set of CMPs. Our next result answers this question in the negative. Theorem 4.7. Task realization is not closed under sets of CMPs with shared state-action space. That is, there exist choices of T and E = {E1, . . . , En} such that T is realizable in each Ei E independently, but there is not a single reward function that realizes T in all Ei E simultaneously.

Intuitively, this shows that Alice must know precisely which environment Bob will inhabit if she is to design an appropriate reward function. Otherwise, her uncertainty over E may prevent her from designing a realizing reward function. We foresee iterative extensions of our algorithms in which Alice and Bob can react to one another, drawing inspiration from repeated IRL by Amin et al. [4].

5 Experiments

We next conduct experiments to shed further light on the ﬁndings of our analysis. Our focus is on SOAPs, though we anticipate the insights extend to POs and TOs as well with little complication. In the ﬁrst experiment, we study the fraction of SOAPs that are expressible in small CMPs as we vary aspects of the environment or task (Figure 3). In the second, we use one algorithm from Theorem 4.3

(a) Vary Num. Actions

2 3 4 5 6 Num. States

Fraction of Expressible SOAPs

Estimate of Expressible SOAPs vs. Num. States

(b) Vary Num. States

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Fraction of Expressible SOAPs

Estimate of Expressible SOAPs vs.

1 2 3 4 5 SOAP Size

Fraction of Expressible SOAPs

Estimate of Expressible SOAPs vs. SOAP Size

(d) Vary SOAP Size

0 log(| |) Entropy of T( |s, a)

Fraction of Expressible SOAPs

Estimate of Expressible SOAPs vs. Entropy of T( |s, a)

(e) Vary Entropy of T

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Variety of g G

Fraction of Expressible SOAPs

Estimate of Expressible SOAPs vs. Variety of g G

(f) Vary the Spread of ΠG

Figure 3: The approximate fraction of SOAPs that are expressible by reward in CMPs with a handful of states and actions, with 95% conﬁdence intervals. In each plot, we vary a different parameter of the environment or task to illustrate how this change impacts the expressivity of reward, showing both equal (color) and range (grey) realization of SOAP.

to design a reward function, and contrast learning curves under a SOAP-designed reward function compared to standard rewards. Full details about the experiments are found in Appendix C.

SOAP Expressivity. First, we estimate the fraction of SOAPs that are expressible in small environments. For each data point, we sample 200 random SOAPs and run Algorithm 1 described by Theorem 4.3 to determine whether each SOAP is realizable in the given CMP. We ask this question for both the equal (color) variant of SOAP realization and the range (grey) variant. We inspect SOAP expressivity as we vary six different characteristics of E or ΠG: The number of actions, the number of states, the discount γ, the number of good policies in each SOAP, the Shannon entropy of T at each (s, a) pair, and the spread of each SOAP. The spread approximates average edit distance among policies in ΠG determined by randomly permuting actions of a reference policy by a coin weighted according to the value on the x-axis. We use the same set of CMPs for each environment up to any deviations explicitly made by the varied parameter (such as γ or entropy). Unless otherwise stated, each CMP has four states and three actions, with a ﬁxed but randomly chosen transition function.

Results are presented in Figure 3. We ﬁnd that our theory is borne out in a number of ways. First, as Theorem 4.1 suggests, we ﬁnd SOAP expressivity is strictly less than one in nearly all cases. This is evidence that inexpressible tasks are not only found in manufactured corner cases, but rather that expressivity is a spectrum. We further observe as predicted by Proposition 3.1 clear separation between the expressivity of range-SOAP (grey) vs. equal-SOAP (color); there are many cases where we can ﬁnd a reward function that makes the good policies near optimal and better than the bad, but cannot make those good policies all exactly optimal. Additionally, several trends emerge as we vary the parameter of environment or task, though we note that such trends are likely speciﬁc to the choice of CMP and may not hold in general. Perhaps the most striking trend is in Figure 3f, which shows a decrease in expressivity as the SOAPs become more spread out. This is quite sensible: A more spread out SOAP is likely to lead to more entailments of the kind discussed in Figure 2b.

Learning with SOAP-designed Rewards. Next, we contrast the learning performance of Qlearning under a SOAP-designed reward function (visualized in Figure 4a) with that of the regular goal-based reward in the Russell and Norvig [42] grid world. In this domain, there is 0.35 slip probability such that, on a slip event, the agent randomly applies one of the two orthogonal action effects. The regular goal-based reward function provides +1 when the agent enters the terminal ﬂag cell, and 1 when the agent enters the terminal ﬁre cell. The bottom left state is the start-state, and the black cell is an impassable wall.

(a) Grid World SOAP Reward

0 50 100 150 200 250 Episodes

Fraction of Match to any G

Learning with SOAP Reward Functions

SOAP Reward Function Regular Reward Function

(b) Grid World Learning

Figure 4: A SOAP-designed reward function (left) and the resulting learning curves (right) for Qlearning compared to the traditional reward function for the Russell and Norvig [42] grid world. Each series presents average performance over 50 runs of the experiment with 95% conﬁdence intervals.

Results are presented in Figure 4. On the right, we present a particular kind of learning curve contrasting the performance of Q-learning with the SOAP reward (blue) and regular reward (green). The y-axis measures, at the end of each episode, the average (inverse) minimum edit distance between Q-learning s greedy policy and any policy in the SOAP. Thus, when the series reaches 1.0, Q-learning s greedy policy is identical to one of the two SOAP policies. We ﬁrst ﬁnd that Q-learning is able to quickly learn a πg ΠG under the designed reward function. We further observe that the typical reward does not induce a perfect match in policy at convergence, the green curve hovers slightly below the blue, indicating that the default reward function is incentivizing different policies to be optimal. This is entirely sensible, as the two SOAP policies are extremely cautious around the ﬁre; they choose the orthogonal (and thus, safe) action in ﬁre-adjacent states, relying on slip probability to progress. Lastly, as expected given the amount of knowledge contained in the SOAP, the SOAP reward function allows Q-learning to rapidly identify a good policy compared to the typical reward.

6 Conclusion

We have here investigated the expressivity of Markov reward, framed around three new accounts of task. Our main results show that there exist choices of task and environment in which Markov reward cannot express the chosen task, but there are efﬁcient algorithms that decide whether a task is expressible and construct a reward function that captures the task when such a function exists. We conclude with an empirical examination of our analysis, corroborating the ﬁndings of our theory. We take these to be ﬁrst steps toward understanding the full scope of the reward hypothesis.

There are many routes forward. A key direction moves beyond the task types we study here, and relaxes our core assumptions the environment might not be a ﬁnite CMP, Alice may not know the environment precisely, reward may be a function of history, or Alice may not know how Bob represents state. Along similar lines, a critical direction incorporates how reward impacts Bob s learning dynamics rather than start-state value. Further, we note the potential relevance to the recent reward-is-enough hypothesis proposed by Silver et al. [44]; we foresee pathways to extend our analysis to examine this newer hypothesis, too. For instance, in future work, it is important to assess whether reward is capable of inducing the right kinds of attributes of cognition, not just behavior.

Acknowledgments and Disclosure of Funding

The authors would like to thank André Barreto, Diana Borsa, Michael Bowling, Wilka Carvalho, Brian Christian, Jess Hamrick, Steven Hansen, Zac Kenton, Ramana Kumar, Katrina Mc Kinney, Rémi Munos, Matt Overlan, Hado van Hasselt, and Ben Van Roy for helpful discussions. We would also like to thank the anonymous reviewers for their thoughtful feedback, and Brendan O Donoghue for catching a typo in the appendix. Michael Littman was supported in part by funding from DARPA L2M, ONR MURI, NSF FMit F, and NSF RI.

[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the International Conference on Machine learning, 2004.

[2] David Ackley and Michael L. Littman. Interactions between learning and evolution. Artiﬁcial Life II, 1992.

[3] Sundararaman Akshay, Nathalie Bertrand, Serge Haddad, and Loic Helouet. The steady-state control problem for Markov decision processes. In Proceedings of the International Conference on Quantitative Evaluation of Systems, 2013.

[4] Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. In Advances in Neural Information Processing Systems, 2017.

[5] Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017.

[6] Brian Christian. The Alignment Problem: Machine Learning and Human Values, pages 130 131. Atlantic Books, 2021.

[7] Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.

[8] Gerard Debreu. Representation of a preference ordering by a numerical function. Decision Processes, 3:159 165, 1954.

[9] Daniel Dewey. Reinforcement learning and the reward engineering principle. In Proceedings of the AAAI Spring Symposium Series, 2014.

[10] Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. Reinforcement learning with a corrupted reward channel. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 2017.

[11] William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons. ar Xiv preprint ar Xiv:1902.06865, 2019.

[12] Karl J. Friston. The free-energy principle: a uniﬁed brain theory? Nature reviews neuroscience, 11(2):127 138, 2010.

[13] Karl J. Friston, Jean Daunizeau, and Stefan J. Kiebel. Reinforcement learning or active inference? Plo S One, 4(7):e6421, 2009.

[14] Dylan Hadﬁeld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, 2016.

[15] Dylan Hadﬁeld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 2017.

[16] Dylan Hadﬁeld-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, 2017.

[17] Danijar Hafner, Pedro A. Ortega, Jimmy Ba, Thomas Parr, Karl J. Friston, and Nicolas Heess. Action and perception as divergence minimization. ar Xiv preprint ar Xiv:2009.01791, 2020.

[18] Lewis Hammond, Alessandro Abate, Julian Gutierrez, and Michael Wooldridge. Multi-agent reinforcement learning with temporal logic speciﬁcations. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2021.

[19] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila Mc Ilraith. Using reward machines for high-level task speciﬁcation and decomposition in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.

[20] Hong Jun Jeon, Smitha Milli, and Anca Dragan. Reward-rational (implicit) choice: A unifying formalism for reward learning. In Advances in Neural Information Processing Systems, 2020.

[21] Kishor Jothimurugan, Rajeev Alur, and Osbert Bastani. A composable speciﬁcation language for reinforcement learning tasks. In Advances in Neural Information Processing Systems, 2020.

[22] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. In Proceedings of the Annual ACM Symposium on Theory of Computing, 1984.

[23] W. Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the International Conference on Knowledge Capture, 2009.

[24] Tjalling C. Koopmans. Stationary ordinal utility and impatience. Econometrica: Journal of the Econometric Society, pages 287 309, 1960.

[25] David Kreps. Notes on the Theory of Choice. Westview Press, 1988.

[26] Sai Krishna Gottipati, Yashaswi Pathak, Rohan Nuttall, Raviteja Chunduru, Ahmed Touati, Sriram Ganapathi Subramanian, Matthew E. Taylor, and Sarath Chandar. Maximum reward formulation in reinforcement learning. In Neur IPS Workshop on Deep Reinforcement Learning, 2020.

[27] Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, and Shane Legg. REALab: An embedded perspective on tampering. ar Xiv preprint ar Xiv:2011.08820, 2020.

[28] Xiao Li, Cristian-Ioan Vasile, and Calin Belta. Reinforcement learning with temporal logic rewards. In Proceedings of the International Conference on Intelligent Robots and Systems, 2017.

[29] Michael L. Littman. The reward hypothesis, 2017. URL https:// www.coursera.org/lecture/fundamentals-of-reinforcement-learning/ michael-littman-the-reward-hypothesis-q6x0e.

[30] Michael L. Littman, Ufuk Topcu, Jie Fu, Charles Isbell, Min Wen, and James Mac Glashan. Environment-independent task speciﬁcations via GLTL. ar Xiv preprint ar Xiv:1704.04341, 2017.

[31] James Mac Glashan, Monica Babes-Vroman, Marie des Jardins, Michael L. Littman, Smaranda Muresan, Shawn Squire, Stefanie Tellex, Dilip Arumugam, and Lei Yang. Grounding English commands to reward functions. In Proceedings of Robotics: Science and Systems, 2015.

[32] James Mac Glashan, Michael L. Littman, David L. Roberts, Robert Loftin, Bei Peng, and Matthew E. Taylor. Convergent actor critic by humans. In Proceedings of the International Conference on Intelligent Robots and Systems, 2016.

[33] James Mac Glashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David L. Roberts, Matthew E. Taylor, and Michael L. Littman. Interactive learning from policy-dependent human feedback. In Proceedings of the International Conference on Machine Learning. PMLR, 2017.

[34] Maja J. Mataric. Reward functions for accelerated learning. In Proceedings of the International Conference on Machine Learning, 1994.

[35] L. G. Mitten. Preference order dynamic programming. Management Science, 21(1):43 46, 1974.

[36] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning, 1999.

[37] Andrew Y. Ng, Stuart J. Russell, et al. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2000.

[38] Ellen Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel Burdick. Dueling posterior sampling for preference-based reinforcement learning. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2020.

[39] Pedro A. Ortega, Vishal Maini, and the Deep Mind Safety Team. Building safe artiﬁcial intelligence: speciﬁcation, robustness, and assurance, 2018. URL https://medium.com/@deepmindsafetyresearch/ building-safe-artificial-intelligence-52f5f75058f1.

[40] Silviu Pitis. Rethinking the discount factor in reinforcement learning: A decision theoretic approach. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2019.

[41] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

[42] Stuart J. Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 1994. ISBN 0-13-103805-2.

[43] Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D. Dennis, Pieter Abbeel, Anca Dragan, and Stuart Russell. Beneﬁts of assistance over reward learning, 2021. URL https://openreview.net/forum?id=DFIo GDZej IB.

[44] David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough. Artiﬁcial Intelligence, page 103535, 2021.

[45] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically motivated reinforcement learning. Technical report, University of Massachusetts at Amherst Department of Computer Science, 2005.

[46] Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from? In Proceedings of the Annual Conference of the Cognitive Science Society, 2009.

[47] Satinder Singh, Richard L. Lewis, Jonathan Sorg, Andrew G. Barto, and Akram Helou. On separating agent designer goals from agent goals: Breaking the preferences parameters confound, 2010.

[48] Matthew J. Sobel. Ordinal dynamic programming. Management science, 21(9):967 975, 1975.

[49] Matthew J. Sobel. Discounting axioms imply risk neutrality. Annals of Operations Research, 208(1):417 432, 2013.

[50] Jonathan Sorg. The Optimal Reward Problem: Designing Effective Reward for Bounded Agents. Ph D thesis, University of Michigan, 2011.

[51] Jonathan Sorg, Richard L. Lewis, and Satinder Singh. Reward design via online gradient ascent. Advances in Neural Information Processing Systems, 2010.

[52] Peter Sunehag and Marcus Hutter. Axioms for rational reinforcement learning. In Proceedings of the International Conference on Algorithmic Learning Theory, 2011.

[53] Richard S. Sutton. The reward hypothesis, 2004. URL http://incompleteideas.net/ rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html.

[54] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.

[55] Umar Syed, Michael Bowling, and Robert E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the International Conference on Machine Learning, 2008.

[56] Csaba Szepesvári. Constrained MDPs and the reward hypothesis, 2020. URL http:// readingsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis. html.

[57] Geraud Nangue Tasse, Steven James, and Benjamin Rosman. A Boolean task algebra for reinforcement learning. In Advances in Neural Information Processing Systems, 2020.

[58] Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. Mc Ilraith. Teaching multiple tasks to an RL agent using LTL. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2018.

[59] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1953.

[60] Paul Weng. Markov decision processes with ordinal rewards: Reference point-based preferences. In Proceedings of the International Conference on Automated Planning and Scheduling, 2011.

[61] Martha White. Unifying task speciﬁcation in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017.

[62] Edward C. Williams, Nakul Gopalan, Mine Rhee, and Stefanie Tellex. Learning to parse natural language to grounded reward functions with weak supervision. In Proceedings of the International Conference on Robotics and Automation, 2018.

[63] Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, 2012.

[64] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945 4990, 2017.

[65] Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, and Artur Dubrawski. Preference-based reinforcement learning with ﬁnite-time guarantees. Advances in Neural Information Processing Systems, 33, 2020.

[66] Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu, Manuel Kroiss, Hado van Hasselt, David Silver, and Satinder Singh. What can learned intrinsic rewards capture? In Proceedings of the International Conference on Machine Learning, 2020.