# on_the_expressivity_of_markov_reward__7e0cc731.pdf On the Expressivity of Markov Reward David Abel Deep Mind dmabel@deepmind.com Will Dabney Deep Mind wdabney@deepmind.com Anna Harutyunyan Deep Mind harutyunyan@deepmind.com Mark K. Ho Department of Computer Science Princeton University mho@princeton.edu Michael L. Littman Department of Computer Science Brown University mlittman@cs.brown.edu Doina Precup Deep Mind doinap@deepmind.com Satinder Singh Deep Mind baveja@deepmind.com Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of task that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings. 1 Introduction How are we to use algorithms for reinforcement learning (RL) to solve problems of relevance in the world? Reward plays a significant role as a general purpose signal: For any desired behavior, task, or other characteristic of agency, there must exist a reward signal that can incentivize an agent to learn to realize these desires. Indeed, the expressivity of reward is taken as a backdrop assumption that frames RL, sometimes called the reward hypothesis: ...all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward) [53, 29, 6]. In this paper, we establish first steps toward a systematic study of the reward hypothesis by examining the expressivity of reward as a signal. We proceed in three steps. 1. An Account of Task . As rewards encode tasks, goals, or desires, we first ask, what is a task? . We frame our study around a thought experiment (Figure 1) involving the interactions between a designer, Alice, and a learning agent, Bob, drawing inspiration from Ackley and Littman [2], Sorg [50], and Singh et al. [46]. In this thought experiment, we draw a distinction between how Alice thinks of a task (TASKQ) and the means by which Alice incentivizes Bob to pursue this task (EXPRESSIONQ). This distinction allows us to analyze the expressivity of reward as an answer to the latter question, conditioned on how we answer the former. Concretely, we study three answers to the TASKQ in the context of finite Markov Decision Processes (MDPs): A task is either (1) a set of 35th Conference on Neural Information Processing Systems (Neur IPS 2021). state action learning signal Figure 1: Alice, Bob, and the artifacts of task definition (blue) and task expression (purple). acceptable behaviors (policies), (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Further detail and motivation for these task types is provided in Section 3, but broadly they can be viewed as generalizations of typical notions of task such as a choice of goal or optimal behavior. Given these three answers to the TASKQ, we then examine the expressivity of reward. 2. Expressivity of Markov Reward. The core of our study asks whether there are tasks Alice would like to convey as captured by the answers to the TASKQ that admit no characterization in terms of a Markov reward function. Our emphasis on Markov reward functions, as opposed to arbitrary history-based reward functions, is motivated by several factors. First, disciplines such as computer science, psychology, biology, and economics typically rely on a notion of reward as a numerical proxy for the immediate worth of states of affairs (such as the financial cost of buying a solar panel or the fitness benefits of a phenotype). Given an appropriate way to describe states of affairs, Markov reward functions can represent immediate worth in an intuitive manner that also allows for reasoning about combinations, sequences, or re-occurrences of such states of affairs. Second, it is not clear that general history-based rewards are a reasonable target for learning as they suffer from the curse of dimensionality in the length of the history. Lastly, Markov reward functions are the standard in RL. A rigorous analysis of which tasks they can and cannot convey may provide guidance into when it is necessary to draw on alternative formulations of a problem. Given our focus on Markov rewards, we treat a reward function as accurately expressing a task just when the value function it induces in an environment adheres to the constraints of a given task. 3. Main Results. We find that, for all three task types, there are environment task pairs for which there is no Markov reward function that realizes the task (Theorem 4.1). In light of this finding, we design polynomial-time algorithms that can determine, for any given task and environment, whether a reward function exists in the environment that captures the task (Theorem 4.3). When such a reward function does exist, the algorithms also return it. Finally, we conduct simple experiments with these procedures to provide empirical insight into the expressivity of reward (Section 5). Collectively, our results demonstrate that there are tasks that cannot be expressed by Markov reward in a rigorous sense, but we can efficiently construct such reward functions when they do exist (and determine when they do not). We take these findings to shed light on the nature of reward maximization as a principle, and highlight many pathways for further investigation. 2 Background RL defines the problem facing an agent that learns to improve its behavior over time by interacting with its environment. We make the typical assumption that the RL problem is well modeled by an agent interacting with a finite Markov Decision Process (MDP), defined by the tuple (S, A, R, T, γ, s0). An MDP gives rise to deterministic behavioral policies, π : S A, and the value, V π : S R, and action value, Qπ : S A R, functions that measure their quality. We will refer to a Controlled Markov Process (CMP) as an MDP without a reward function, which we denote E for environment. We assume that all reward functions are deterministic, and may be a function of either state, stateaction pairs, or state-action-state triples, but not history. Henceforth, we simply use reward function to refer to a deterministic Markov reward function for brevity, but note that more sophisticated settings beyond MDPs and deterministic Markov reward functions are important directions for future work. For more on MDPs or RL, see the books by Puterman [41] and Sutton and Barto [54] respectively. 2.1 Other Perspectives on Reward We here briefly summarize relevant literature that provides distinct perspectives on reward. Two Roles of Reward. As Sorg [50] identifies (Chapter 2), reward can both define the task the agent learns to solve, and define the bread crumbs that allow agents to efficiently learn to solve the task. This distinction has been raised elsewhere [2, 46, 47], and is similar to the extrinsic-intrinsic reward divide [45, 66]. Tools such as reward design [34, 51] or reward shaping [36] focus on offering more efficient learning in a variety of environments, so as to avoid issues of sparsity and long-term credit assignment. We concentrate primarily on reward s capacity to express a task, and defer learning dynamics to an (important) stage of future work. Discounts, Expectations, and Rationality. Another important facet of reward is how it is used in producing behavior. The classical view offered by the Bellman equation (and the reward hypothesis) is that the quantity of interest to maximize is expected, discounted, cumulative reward. Yet it is possible to disentangle reward from the expectation [5], to attend only to ordinal [60] or maximal rewards [26], or to adopt different forms of discounting [61, 11]. In this work, we take the standard view that agents will seek to maximize value for a particular discount factor γ, but recognize that there are interesting directions beyond these commitments, such as inspecting the limits of reward in constrained MDPs as studied by Szepesvári [56]. We also note the particular importance of work by Pitis [40], who examines the relationship between classical decision theory [59] and MDPs by incorporating additional axioms that account for stochastic processes with discounting [24, 35, 48, 49]. Drawing inspiration from Pitis [40] and Sunehag and Hutter [52], we foresee valuable pathways for future work that further makes contact between RL and various axioms of rationality. Preferences. In place of numerical rewards, preferences of different kinds may be used to evaluate an agent s behaviors, drawing from the literature on preference-learning [25] and ordinal dynamic programming [8, 35, 48]. This premise gives rise to preference-based reinforcement learning (Pb RL) in which an agent interacts with a CMP and receives evaluative signals in the form of preferences over states, actions, or trajectories. This kind of feedback inspires and closely parallels the task types we propose in this work. A comprehensive survey of Pb RL by Wirth et al. [64] identifies critical differences in this setup from traditional RL, categorizes recent algorithmic approaches, and highlights important open questions. Recent work focuses on analysing the sample efficiency of such methods [65, 38] with close connections to learning from human feedback in real time [23, 32, 7]. Teaching and Inverse RL. The inverse RL (IRL) and apprenticeship learning literature examine the problem of learning directly from behavior [37, 1]. The classical problem of IRL is to identify which reward function (often up to an equivalence class) a given demonstrator is optimizing. We emphasize the relevance of two approaches: First, work by Syed et al. [55], who first illustrate the applicability of linear programming [22] to apprenticeship learning; and second, work by Amin et al. [4], who examine the repeated form of IRL. The methods of IRL have recently been expanded to include variations of cooperative IRL [14], and assistive learning [43], which offer different perspectives on how to frame interactive learning problems. Reward Misspecification. Reward is also notoriously hard to specify. As pointed out by Littman et al. [30], putting a meaningful dollar figure on scuffing a wall or dropping a clean fork is challenging. Along these lines, Hadfield-Menell et al. [16] identify cases in which well-intentioned designers create reward functions that produce unintended behavior [39]. Mac Glashan et al. [33] find that human-provided rewards tend to depend on a learning agent s entire policy, rather than just the current state. Further, work by Hadfield-Menell et al. [15] and Kumar et al. [27] suggest that there are problems with reward as a learning mechanism due to misspecification and reward tampering [10]. These problems have given rise to approaches to reward learning, in which a reward function is inferred from some evidence such as behavior or comparisons thereof [20]. Other Notions of Task. As a final note, we highlight alternative approaches to task specification. Building on the Free Energy Principle [13, 12], Hafner et al. [17] consider a variety of task types in terms of minimization of distance to a desired target distribution [3]. Alternatively, Littman et al. [30] and Li et al. [28] propose variations of linear temporal logic (LTL) as a mechanism for specifying a task to RL agents, with related literature extending LTL to the multi-task [58] and multi-agent [18] settings, or using reward machines for capturing task structure [19]. Jothimurugan et al. [21] take a similar approach and propose a task specification language for RL based on logical formulas that evaluate whether trajectories satisfy the task, similar in spirit to the logical task compositions framework developed by Tasse et al. [57]. Many of these notions of task are more general than those we consider. A natural direction for future work broadens our analysis to include these kinds of task. 3 An Account of Reward s Expressivity: The TASKQ and EXPRESSIONQ Consider an onlooker, Alice, and an earnest learning agent, Bob, engaged in the interaction pictured in Figure 1. Suppose that Alice has a particular task in mind that she would like Bob to learn to solve, and that Alice constructs a reward function to incentivize Bob to pursue this task. Here, Alice is playing the role of all of what we mean by goals and purposes for Bob to pursue, with Bob playing the role of the standard reward-maximizing RL agent. Two Questions About Task. To give us leverage to study the expressivity of reward, it is useful to draw a distinction between two stages of this process: 1) Alice thinks of a task that she would like Bob to learn to solve, and 2) Alice creates a reward function (and perhaps chooses γ) that conveys the chosen task to Bob. We inspect these two separately, framed by the following two questions. The first we call the task-definition question (TASKQ) which asks: What is a task? The second we call the task-expression question (EXPRESSIONQ) which asks: Which learning signal can be used as a mechanism for expressing any task to Bob? Reward Answers The EXPRESSIONQ. We suggest that it may be useful to treat reward as an answer to the EXPRESSIONQ rather than the TASKQ. On this view, reward is treated as an expressive language for incentivizing reward-maximizing agents: Alice may attempt to translate any task into a reward function that incentivizes Bob to pursue the task, no matter which environment Bob inhabits, which task Alice has chosen, or how she has represented the task to herself. Indeed, it might be the case that Alice s knowledge of the task far exceeds Bob s representational or perceptual capacity. Alice may know every detail of the environment and define the task based on this holistic vantage, while Bob must learn to solve the task through interaction alone, relying only on a restricted class of functions for modeling and decision making. Under this view, we can assess the expressivity of reward as an answer to the EXPRESSIONQ conditioned on how we answer the TASKQ. For example, if the TASKQ is answered in terms of natural language descriptions of desired states of affairs, then reward may fail to convey the chosen task due to the apparent mismatch in abstraction between natural language and reward (though some work has studied such a proposal [31, 62]). 3.1 Answers to the TASKQ: What is a Task? In RL, tasks are often associated with a choice of goal, reward function (R), reward-discount pair (R, γ), or perhaps a choice of optimal policy (alongside those task types surveyed previously, such as LTL). However, it is unclear whether these constructs capture the entirety of what we mean by task . For example, consider the Russell and Norvig [42] grid world: A 4 3 grid with one wall, one terminal fire state, and one terminal goal state (pictured with a particular reward function in Figure 4a). In such an environment, how might we think about tasks? A standard view is that the task is to reach the goal as quickly as possible. This account, however, fails to distinguish between the non-optimal behaviors, such as the costly behavior of the agent moving directly into the fire and the neutral behavior of the agent spending its existence in the start state. Indeed, characterizing a task in terms of choice of π or goal fails to capture these distinctions. Our view is that a suitably rich account of task should allow for the characterization of this sort of preference, offering the flexibility to scale from specifying only the desirable behavior (or outcomes) to an arbitrary ordering over behaviors (or outcomes). In light of these considerations, we propose three answers to the TASKQ that can convey general preferences over behavior or outcome: 1) A set of acceptable policies, 2) A partial ordering over policies, or 3) A partial ordering over trajectories. We adopt these three as they can capture many kinds of task while also allowing a great deal of flexibility in the level of detail of the specification. Name Notation Generalizes Constraints Induced by T SOAP ΠG task-as-π equal: V πg(s0) = V πg (s0) > V πb(s0), πg,πg ΠG,πb ΠB range: V πg(s0) > V πb(s0), πg ΠG,πb ΠB PO LΠ SOAP (π1 π2) LΠ = V π1(s0) V π2(s0) TO Lτ,N task-as-goal (τ1 τ2) Lτ,N = G(τ1; s0) G(τ2; s0) Table 1: A summary of the three proposed task types. We further list the constraints that determine whether a reward function realizes each task type in an MDP, where we take to be one of < , > , or = , and G is the discounted return of the trajectory. 3.2 SOAPs, POs, and TOs (SOAP) Set Of Acceptable Policies. A classical view of the equivalence of two reward functions is based on the optimal policies they induce. For instance, Ng et al. [36] develop potential-based reward shaping by inspecting which shaped reward signals will ensure that the optimal policy is unchanged. Extrapolating, it is natural to say that for any environment E, two reward functions are equivalent if the optimal policies they induce in E are the same. In this way, a task is viewed as a choice of optimal policy. As discussed in the grid world example above, this notion of task fails to allow for the specification of the quality of other behaviors. For this reason, we generalize task-as-optimal-policy to a set of acceptable policies, defined as follows. Definition 3.1. A set of acceptable policies (SOAP) is a non-empty subset of the deterministic policies, ΠG Π, with Π the set of all deterministic mappings from S to A for a given E. With one task type defined, it is important to address what it means for a reward function to properly realize, express, or capture a task in a given environment. We offer the following account. Definition 3.2. A reward function is said to realize a task T in an environment E just when the start-state value (or return) induced by the reward function exactly adheres to the constraints of T . Precise conditions for the realization of each task type are provided alongside each task definition, with a summary presented in column four of Table 1. For SOAPs, we take the start-state value V π(s0) to be the mechanism by which a reward function realizes a SOAP. That is, for a given E and ΠG, a reward function R is said to realize the ΠG in E when the start-state value function is optimal for all good policies, and strictly higher than the start-state value of all other policies. It is clear that SOAP strictly generalizes a task in terms of a choice of optimal policy, as captured by the SOAP ΠG = {π }. We note that there are two natural ways for a reward function to realize a SOAP: First, each πg ΠG has optimal start-state value and all other policies are sub-optimal. We call this type equal-SOAP, or just SOAP for brevity. Alternatively, we might only require that the acceptable policies are each near-optimal, but are allowed to differ in start-state value so long as they are all better than every bad policy πb ΠB. That is, in this second kind, there exists an ϵ 0 such that every πg ΠG is ϵ-optimal in start-state value, V (s0) V πg(s0) ϵ, while all other policies are worse. We call this second realization condition range-SOAP. We note that the range realization generalizes the equal one: Every equal-SOAP is a range-SOAP (by letting ϵ = 0). However, there exist range-SOAPs that are expressible by Markov rewards that are not realizable as an equal-SOAP. We illustrate this fact with the following proposition. All proofs are presented in Appendix B. Proposition 3.1. There exists a CMP, E, and choice of ΠG such that ΠG can be realized under the range-SOAP criterion, but cannot be realized under the equal-SOAP criterion. One such CMP is pictured Figure 2b. Consider the SOAP ΠG = {π11, π12, π21}: Under the equal SOAP criterion, if each of these three policies are made optimal, any reward function will also make π22 (the only bad policy) optimal as well. In contrast, for the range criterion, we can choose a reward function that assigns lower rewards to a2 than a1 in both states. In general, we take the equal-SOAP realization as canonical, as it is naturally subsumed by our next task type. (PO) Partial Ordering on Policies. Next, we suppose that Alice chooses a partial ordering on the deterministic policy space. That is, Alice might identify a some great policies, some good, and some bad policies to strictly avoid, and remain indifferent to the rest. POs strictly generalize equal SOAPs, as any such SOAP is a special choice of PO with only two equivalence classes. We offer the following definition of a PO. Definition 3.3. A policy order (PO) of the deterministic policies Π is a partial order, denoted LΠ. As with SOAPs, we take the start-state value V π(s0) induced by a reward function R as the mechanism by which policies are ordered. That is, given E and LΠ, we say that a reward function R realizes LΠ in E if and only if the resulting MDP, M = (E, R), produces a start-state value function that orders Π according to LΠ. (TO) Partial Ordering on Trajectories. A natural generalization of goal specification enriches a notion of task to include the details of how a goal is satisfied that is, for Alice to relay some preference over trajectory space [63], as is done in preference based RL [64]. Concretely, we suppose Alice specifies a partial ordering on length N trajectories of (s, a) pairs, defined as follows. Definition 3.4. A trajectory ordering (TO) of length N N is a partial ordering Lτ,N, with each trajectory τ consisting of N state action pairs, {(s0, a0), . . . , (a N 1, s N 1)}, with s0 the start state. As with PO, we say that a reward function realizes a trajectory ordering Lτ,N if the ordering determined by each trajectory s cumulative discounted N-step return from s0, denoted G(τ; s0), matches that of the given Lτ,N. We note that trajectory orderings can generalize goal-based tasks at the expense of a larger specification. For instance, a TO can convey the task, Safely reach the goal in less than thirty steps, or just get to the subgoal in less than twenty steps. Recap. We propose to assess the expressivity of reward by first answering the TASKQ in terms of SOAPs, POs, or TOs, as summarized by Table 1. We say that a task T is realized in an environment E under reward function R if the start-state value function (or return) produced by R imposes the constraints specified by T , and are interested in whether reward can always realize a given task in any choice of E. We make a number of assumptions along the way, including: (1) Reward functions are Markov and deterministic, (2) Policies of interest are deterministic, (3) The environment is a finite CMP, (4) γ is part of the environment, (5) We ignore reward s role in shaping the learning process, (6) Start-state value or return is the appropriate mechanism to determine if a reward function realizes a given task. Relaxation of these assumptions is a critical direction for future work. 4 Analysis: The Expressivity of Markov Reward With our definitions and objectives in place, we now present our main results. 4.1 Express SOAPs, POs, and TOs We first ask whether reward can always realize a given SOAP, PO, or TO, for an arbitrary E. Our first result states that the answer is no there are tasks that cannot be realized by any reward function. Theorem 4.1. For each of SOAP, PO, and TO, there exist (E, T ) pairs for which no Markow reward function realizes T in E. Thus, reward is incapable of capturing certain tasks. What tasks are they, precisely? Intuitively, inexpressible tasks involve policies or trajectories that must be correlated in value in an MDP. That is, if two policies are nearly identical in behavior, it is unlikely that reward can capture the PO that places them at opposite ends of the ordering. A simple example is the always move the same direction task in a grid world, with state defined as an (x, y) pair. The SOAP ΠG = {π , π , π , π } conveys this task, but no Markov reward function can make these policies strictly higher in value than all others. Example: Inexpressible SOAPs. Observe the two CMPs pictured in Figure 2, depicting two kinds of inexpressible SOAPs. On the left, we consider the SOAP ΠG = {π21}, containing only the policy that executes a2 in the left state (s0), and a1 in the right (s1). This SOAP is inexpressible through reward, but only because reward cannot distinguish the start-state value of π21 and π22 since the policies differ only in an unreachable state. This is reminiscent of Axiom 5 from Pitis [40], which (a) Steady State Case (b) Entailment Case Figure 2: Two CMPs in which there is a SOAP that is not expressible under any Markov reward function. On the left, ΠG = {π21} is not realizable, as π21 can not be made better than π22 because s1 is never reached. On the right, the XOR-like-SOAP, ΠG = {π12, π21} is not realizable: To make these two policies optimal, it is entailed that π22 and π11 must be optimal, too. explicitly excludes preferences of this sort. On the right, we find a more interesting case: The chosen SOAP is similar to the XOR function, ΠG = {π12, π21}. Here, the task requires that the agent choose each action in exactly one state. However, there cannot exist a reward function that makes only these policies optimal, as by consequence, both policies π11 and π22 must be optimal as well. Next, we show that Theorem 4.1 is not limited to a particular choice of transition function or γ. Proposition 4.2. There exist choices of E T = (S, A, γ, s0) or E γ = (S, A, T, s0), together with a task T , such that there is no (T, R) pair that realizes T in E T or (R, γ) in E γ. This result suggests that the scope of Theorem 4.1 is actually quite broad even if the transition function or γ are taken as part of the reward specification, there are tasks that cannot be expressed. We suspect there are ways to give a precise characterization of all inexpressible tasks from an axiomatic perspective, which we hope to study in future work. 4.2 Constructive Algorithms: Task to Reward We now analyze how to determine whether an appropriate reward function can be constructed for any (E, T ) pair. We pose a general form of the reward-design problem [34, 51, 9] as follows. Definition 4.1. The REWARDDESIGN problem is: Given E = (S, A, T, γ, s0), and a T , output a reward function Ralice that ensures T is realized in M = (E, Ralice). Indeed, for all three task types, there is an efficient algorithm for solving the reward-design problem. Theorem 4.3. The REWARDDESIGN problem can be solved in polynomial time, for any finite E, and any T , so long as reward functions with infinitely many outputs are considered. Therefore, for any choice of finite CMP, E, and a SOAP, PO, or TO, we can find a reward function that perfectly realizes the task in the given environment, if such a reward function exists. Each of the three algorithms are based on forming a linear program that matches the constraints of the given task type, which is why reward functions with infinitely many outputs are required. Pseudo-code for SOAP-based reward design is presented in Algorithm 1. Intuitively, the algorithms compute the discounted expected-state visitation distribution for a collection of policies; in the case of SOAP, for instance, these policies include ΠG and what we call the fringe , the set of policies that differ from a πg ΠG by exactly one action. Then, we use these distributions to describe linear inequality constraints ensuring that the start-state value of the good policies are better than those of the fringe. As highlighted by Theorem 4.1 there are SOAPs, POs, and TOs that are not realizable. Thus, it is important to determine how the algorithms mentioned in Theorem 4.3 will handle such cases. Our next corollary illustrates that the desirable outcome is achieved: For any E and T , the algorithms will output a reward function that realizes T in E, or output when no such function exists. Corollary 4.4. For any task T and environment E, deciding whether T is expressible in E is solvable in polynomial time. Together, Theorem 4.1 and Theorem 4.3 constitute our main results: There are environment task pairs in which Markov reward cannot express the chosen task for each of SOAPs, POs, and TOs. However, there are efficient algorithms for deciding whether a task is expressible, and for constructing the realizing reward function when it exists. We will study the use of one of these algorithms in Section 5, but first attend to other aspects of reward s expressivity. Algorithm 1 SOAP Reward Design INPUT: E = (S, A, T, γ, s0), ΠG. OUTPUT: R, or . 1: Πfringe = compute_fringe(ΠG) 2: for πg,i ΠG do Compute state-visitation distributions. 3: ρg,i = compute_exp_visit(πg,i, E) 4: for πf,i Πfringe do 5: ρf,i = compute_exp_visit(πf,i, E) 6: Ceq = {} Make Equality Constraints. 7: for πg,i ΠG do 8: Ceq.add(ρg,0(s0) X = ρg,i(s0) X) 9: Cineq = {} Make Inequality Constraints. 10: for πf,j Πfringe do 11: Cineq.add(ρf,j(s0) X + ϵ ρg,0(s0) X) 12: Rout, ϵout = linear_programming(obj. = max ϵ, constraints = Cineq, Ceq) Solve LP. 13: if ϵout > 0 then Check if successful. return Rout 14: else return 4.3 Other Aspects of Reward s Expressivity We next briefly summarize other considerations about the expressivity of reward. As noted, Theorem 4.3 requires the use of a reward function that can produce infinitely many outputs. Our next result proves this requirement is strict for efficient reward design. Theorem 4.5. A variant of the REWARDDESIGN problem with finite reward outputs is NP-hard. We provide further details about the precise problem studied in Appendix B. Beyond reward functions with finitely-many outputs, we are also interested in extensions of our results to multiple environments. We next present a positive result indicating our algorithms can extend to the case where Alice would like to design a reward function for a single task across multiple environments. Proposition 4.6. For any SOAP, PO, or TO, given a finite set of CMPs, E = {E1, . . . , En}, with shared state action space, there exists a polynomial time algorithm that outputs one reward function that realizes the task (when possible) in all CMPs in E. A natural follow up question to the above result asks whether task realization is closed under a set of CMPs. Our next result answers this question in the negative. Theorem 4.7. Task realization is not closed under sets of CMPs with shared state-action space. That is, there exist choices of T and E = {E1, . . . , En} such that T is realizable in each Ei E independently, but there is not a single reward function that realizes T in all Ei E simultaneously. Intuitively, this shows that Alice must know precisely which environment Bob will inhabit if she is to design an appropriate reward function. Otherwise, her uncertainty over E may prevent her from designing a realizing reward function. We foresee iterative extensions of our algorithms in which Alice and Bob can react to one another, drawing inspiration from repeated IRL by Amin et al. [4]. 5 Experiments We next conduct experiments to shed further light on the findings of our analysis. Our focus is on SOAPs, though we anticipate the insights extend to POs and TOs as well with little complication. In the first experiment, we study the fraction of SOAPs that are expressible in small CMPs as we vary aspects of the environment or task (Figure 3). In the second, we use one algorithm from Theorem 4.3 (a) Vary Num. Actions 2 3 4 5 6 Num. States Fraction of Expressible SOAPs Estimate of Expressible SOAPs vs. Num. States (b) Vary Num. States 0.0 0.2 0.4 0.6 0.8 1.0 0.0 Fraction of Expressible SOAPs Estimate of Expressible SOAPs vs. 1 2 3 4 5 SOAP Size Fraction of Expressible SOAPs Estimate of Expressible SOAPs vs. SOAP Size (d) Vary SOAP Size 0 log(| |) Entropy of T( |s, a) Fraction of Expressible SOAPs Estimate of Expressible SOAPs vs. Entropy of T( |s, a) (e) Vary Entropy of T 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Variety of g G Fraction of Expressible SOAPs Estimate of Expressible SOAPs vs. Variety of g G (f) Vary the Spread of ΠG Figure 3: The approximate fraction of SOAPs that are expressible by reward in CMPs with a handful of states and actions, with 95% confidence intervals. In each plot, we vary a different parameter of the environment or task to illustrate how this change impacts the expressivity of reward, showing both equal (color) and range (grey) realization of SOAP. to design a reward function, and contrast learning curves under a SOAP-designed reward function compared to standard rewards. Full details about the experiments are found in Appendix C. SOAP Expressivity. First, we estimate the fraction of SOAPs that are expressible in small environments. For each data point, we sample 200 random SOAPs and run Algorithm 1 described by Theorem 4.3 to determine whether each SOAP is realizable in the given CMP. We ask this question for both the equal (color) variant of SOAP realization and the range (grey) variant. We inspect SOAP expressivity as we vary six different characteristics of E or ΠG: The number of actions, the number of states, the discount γ, the number of good policies in each SOAP, the Shannon entropy of T at each (s, a) pair, and the spread of each SOAP. The spread approximates average edit distance among policies in ΠG determined by randomly permuting actions of a reference policy by a coin weighted according to the value on the x-axis. We use the same set of CMPs for each environment up to any deviations explicitly made by the varied parameter (such as γ or entropy). Unless otherwise stated, each CMP has four states and three actions, with a fixed but randomly chosen transition function. Results are presented in Figure 3. We find that our theory is borne out in a number of ways. First, as Theorem 4.1 suggests, we find SOAP expressivity is strictly less than one in nearly all cases. This is evidence that inexpressible tasks are not only found in manufactured corner cases, but rather that expressivity is a spectrum. We further observe as predicted by Proposition 3.1 clear separation between the expressivity of range-SOAP (grey) vs. equal-SOAP (color); there are many cases where we can find a reward function that makes the good policies near optimal and better than the bad, but cannot make those good policies all exactly optimal. Additionally, several trends emerge as we vary the parameter of environment or task, though we note that such trends are likely specific to the choice of CMP and may not hold in general. Perhaps the most striking trend is in Figure 3f, which shows a decrease in expressivity as the SOAPs become more spread out. This is quite sensible: A more spread out SOAP is likely to lead to more entailments of the kind discussed in Figure 2b. Learning with SOAP-designed Rewards. Next, we contrast the learning performance of Qlearning under a SOAP-designed reward function (visualized in Figure 4a) with that of the regular goal-based reward in the Russell and Norvig [42] grid world. In this domain, there is 0.35 slip probability such that, on a slip event, the agent randomly applies one of the two orthogonal action effects. The regular goal-based reward function provides +1 when the agent enters the terminal flag cell, and 1 when the agent enters the terminal fire cell. The bottom left state is the start-state, and the black cell is an impassable wall. (a) Grid World SOAP Reward 0 50 100 150 200 250 Episodes Fraction of Match to any G Learning with SOAP Reward Functions SOAP Reward Function Regular Reward Function (b) Grid World Learning Figure 4: A SOAP-designed reward function (left) and the resulting learning curves (right) for Qlearning compared to the traditional reward function for the Russell and Norvig [42] grid world. Each series presents average performance over 50 runs of the experiment with 95% confidence intervals. Results are presented in Figure 4. On the right, we present a particular kind of learning curve contrasting the performance of Q-learning with the SOAP reward (blue) and regular reward (green). The y-axis measures, at the end of each episode, the average (inverse) minimum edit distance between Q-learning s greedy policy and any policy in the SOAP. Thus, when the series reaches 1.0, Q-learning s greedy policy is identical to one of the two SOAP policies. We first find that Q-learning is able to quickly learn a πg ΠG under the designed reward function. We further observe that the typical reward does not induce a perfect match in policy at convergence, the green curve hovers slightly below the blue, indicating that the default reward function is incentivizing different policies to be optimal. This is entirely sensible, as the two SOAP policies are extremely cautious around the fire; they choose the orthogonal (and thus, safe) action in fire-adjacent states, relying on slip probability to progress. Lastly, as expected given the amount of knowledge contained in the SOAP, the SOAP reward function allows Q-learning to rapidly identify a good policy compared to the typical reward. 6 Conclusion We have here investigated the expressivity of Markov reward, framed around three new accounts of task. Our main results show that there exist choices of task and environment in which Markov reward cannot express the chosen task, but there are efficient algorithms that decide whether a task is expressible and construct a reward function that captures the task when such a function exists. We conclude with an empirical examination of our analysis, corroborating the findings of our theory. We take these to be first steps toward understanding the full scope of the reward hypothesis. There are many routes forward. A key direction moves beyond the task types we study here, and relaxes our core assumptions the environment might not be a finite CMP, Alice may not know the environment precisely, reward may be a function of history, or Alice may not know how Bob represents state. Along similar lines, a critical direction incorporates how reward impacts Bob s learning dynamics rather than start-state value. Further, we note the potential relevance to the recent reward-is-enough hypothesis proposed by Silver et al. [44]; we foresee pathways to extend our analysis to examine this newer hypothesis, too. For instance, in future work, it is important to assess whether reward is capable of inducing the right kinds of attributes of cognition, not just behavior. Acknowledgments and Disclosure of Funding The authors would like to thank André Barreto, Diana Borsa, Michael Bowling, Wilka Carvalho, Brian Christian, Jess Hamrick, Steven Hansen, Zac Kenton, Ramana Kumar, Katrina Mc Kinney, Rémi Munos, Matt Overlan, Hado van Hasselt, and Ben Van Roy for helpful discussions. We would also like to thank the anonymous reviewers for their thoughtful feedback, and Brendan O Donoghue for catching a typo in the appendix. Michael Littman was supported in part by funding from DARPA L2M, ONR MURI, NSF FMit F, and NSF RI. [1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the International Conference on Machine learning, 2004. [2] David Ackley and Michael L. Littman. Interactions between learning and evolution. Artificial Life II, 1992. [3] Sundararaman Akshay, Nathalie Bertrand, Serge Haddad, and Loic Helouet. The steady-state control problem for Markov decision processes. In Proceedings of the International Conference on Quantitative Evaluation of Systems, 2013. [4] Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. In Advances in Neural Information Processing Systems, 2017. [5] Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017. [6] Brian Christian. The Alignment Problem: Machine Learning and Human Values, pages 130 131. Atlantic Books, 2021. [7] Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017. [8] Gerard Debreu. Representation of a preference ordering by a numerical function. Decision Processes, 3:159 165, 1954. [9] Daniel Dewey. Reinforcement learning and the reward engineering principle. In Proceedings of the AAAI Spring Symposium Series, 2014. [10] Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. Reinforcement learning with a corrupted reward channel. In Proceedings of the International Joint Conference on Artificial Intelligence, 2017. [11] William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons. ar Xiv preprint ar Xiv:1902.06865, 2019. [12] Karl J. Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2):127 138, 2010. [13] Karl J. Friston, Jean Daunizeau, and Stefan J. Kiebel. Reinforcement learning or active inference? Plo S One, 4(7):e6421, 2009. [14] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, 2016. [15] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. In Proceedings of the International Joint Conference on Artificial Intelligence, 2017. [16] Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, 2017. [17] Danijar Hafner, Pedro A. Ortega, Jimmy Ba, Thomas Parr, Karl J. Friston, and Nicolas Heess. Action and perception as divergence minimization. ar Xiv preprint ar Xiv:2009.01791, 2020. [18] Lewis Hammond, Alessandro Abate, Julian Gutierrez, and Michael Wooldridge. Multi-agent reinforcement learning with temporal logic specifications. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2021. [19] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila Mc Ilraith. Using reward machines for high-level task specification and decomposition in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018. [20] Hong Jun Jeon, Smitha Milli, and Anca Dragan. Reward-rational (implicit) choice: A unifying formalism for reward learning. In Advances in Neural Information Processing Systems, 2020. [21] Kishor Jothimurugan, Rajeev Alur, and Osbert Bastani. A composable specification language for reinforcement learning tasks. In Advances in Neural Information Processing Systems, 2020. [22] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. In Proceedings of the Annual ACM Symposium on Theory of Computing, 1984. [23] W. Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the International Conference on Knowledge Capture, 2009. [24] Tjalling C. Koopmans. Stationary ordinal utility and impatience. Econometrica: Journal of the Econometric Society, pages 287 309, 1960. [25] David Kreps. Notes on the Theory of Choice. Westview Press, 1988. [26] Sai Krishna Gottipati, Yashaswi Pathak, Rohan Nuttall, Raviteja Chunduru, Ahmed Touati, Sriram Ganapathi Subramanian, Matthew E. Taylor, and Sarath Chandar. Maximum reward formulation in reinforcement learning. In Neur IPS Workshop on Deep Reinforcement Learning, 2020. [27] Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, and Shane Legg. REALab: An embedded perspective on tampering. ar Xiv preprint ar Xiv:2011.08820, 2020. [28] Xiao Li, Cristian-Ioan Vasile, and Calin Belta. Reinforcement learning with temporal logic rewards. In Proceedings of the International Conference on Intelligent Robots and Systems, 2017. [29] Michael L. Littman. The reward hypothesis, 2017. URL https:// www.coursera.org/lecture/fundamentals-of-reinforcement-learning/ michael-littman-the-reward-hypothesis-q6x0e. [30] Michael L. Littman, Ufuk Topcu, Jie Fu, Charles Isbell, Min Wen, and James Mac Glashan. Environment-independent task specifications via GLTL. ar Xiv preprint ar Xiv:1704.04341, 2017. [31] James Mac Glashan, Monica Babes-Vroman, Marie des Jardins, Michael L. Littman, Smaranda Muresan, Shawn Squire, Stefanie Tellex, Dilip Arumugam, and Lei Yang. Grounding English commands to reward functions. In Proceedings of Robotics: Science and Systems, 2015. [32] James Mac Glashan, Michael L. Littman, David L. Roberts, Robert Loftin, Bei Peng, and Matthew E. Taylor. Convergent actor critic by humans. In Proceedings of the International Conference on Intelligent Robots and Systems, 2016. [33] James Mac Glashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David L. Roberts, Matthew E. Taylor, and Michael L. Littman. Interactive learning from policy-dependent human feedback. In Proceedings of the International Conference on Machine Learning. PMLR, 2017. [34] Maja J. Mataric. Reward functions for accelerated learning. In Proceedings of the International Conference on Machine Learning, 1994. [35] L. G. Mitten. Preference order dynamic programming. Management Science, 21(1):43 46, 1974. [36] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning, 1999. [37] Andrew Y. Ng, Stuart J. Russell, et al. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2000. [38] Ellen Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel Burdick. Dueling posterior sampling for preference-based reinforcement learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2020. [39] Pedro A. Ortega, Vishal Maini, and the Deep Mind Safety Team. Building safe artificial intelligence: specification, robustness, and assurance, 2018. URL https://medium.com/@deepmindsafetyresearch/ building-safe-artificial-intelligence-52f5f75058f1. [40] Silviu Pitis. Rethinking the discount factor in reinforcement learning: A decision theoretic approach. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019. [41] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. [42] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 1994. ISBN 0-13-103805-2. [43] Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D. Dennis, Pieter Abbeel, Anca Dragan, and Stuart Russell. Benefits of assistance over reward learning, 2021. URL https://openreview.net/forum?id=DFIo GDZej IB. [44] David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough. Artificial Intelligence, page 103535, 2021. [45] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically motivated reinforcement learning. Technical report, University of Massachusetts at Amherst Department of Computer Science, 2005. [46] Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from? In Proceedings of the Annual Conference of the Cognitive Science Society, 2009. [47] Satinder Singh, Richard L. Lewis, Jonathan Sorg, Andrew G. Barto, and Akram Helou. On separating agent designer goals from agent goals: Breaking the preferences parameters confound, 2010. [48] Matthew J. Sobel. Ordinal dynamic programming. Management science, 21(9):967 975, 1975. [49] Matthew J. Sobel. Discounting axioms imply risk neutrality. Annals of Operations Research, 208(1):417 432, 2013. [50] Jonathan Sorg. The Optimal Reward Problem: Designing Effective Reward for Bounded Agents. Ph D thesis, University of Michigan, 2011. [51] Jonathan Sorg, Richard L. Lewis, and Satinder Singh. Reward design via online gradient ascent. Advances in Neural Information Processing Systems, 2010. [52] Peter Sunehag and Marcus Hutter. Axioms for rational reinforcement learning. In Proceedings of the International Conference on Algorithmic Learning Theory, 2011. [53] Richard S. Sutton. The reward hypothesis, 2004. URL http://incompleteideas.net/ rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html. [54] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018. [55] Umar Syed, Michael Bowling, and Robert E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the International Conference on Machine Learning, 2008. [56] Csaba Szepesvári. Constrained MDPs and the reward hypothesis, 2020. URL http:// readingsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis. html. [57] Geraud Nangue Tasse, Steven James, and Benjamin Rosman. A Boolean task algebra for reinforcement learning. In Advances in Neural Information Processing Systems, 2020. [58] Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. Mc Ilraith. Teaching multiple tasks to an RL agent using LTL. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2018. [59] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1953. [60] Paul Weng. Markov decision processes with ordinal rewards: Reference point-based preferences. In Proceedings of the International Conference on Automated Planning and Scheduling, 2011. [61] Martha White. Unifying task specification in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017. [62] Edward C. Williams, Nakul Gopalan, Mine Rhee, and Stefanie Tellex. Learning to parse natural language to grounded reward functions with weak supervision. In Proceedings of the International Conference on Robotics and Automation, 2018. [63] Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, 2012. [64] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945 4990, 2017. [65] Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, and Artur Dubrawski. Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33, 2020. [66] Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu, Manuel Kroiss, Hado van Hasselt, David Silver, and Satinder Singh. What can learned intrinsic rewards capture? In Proceedings of the International Conference on Machine Learning, 2020.