# pathspecific_objectives_for_safer_agent_incentives__dc2a4621.pdf Path-Specific Objectives for Safer Agent Incentives Sebastian Farquhar1,2, Ryan Carey1, Tom Everitt2 1University of Oxford, 2Deep Mind We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with delicate parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals. 1 Introduction Artificial agents can have unsafe incentives to influence parts of their environments in unintended ways. For example, content recommendation systems can achieve good performance by manipulating their users to develop more predictable preferences instead of catering to their tastes directly (Russell 2019). These incentives can be instrumental: indirectly achieving what the system s designers asked for but did not intend (Everitt et al. 2021a). In these cases, it is hard to just pick a better reward function. Imagine the unenviable task of writing down which userpreferences are desirable! We would rather ensure that the agent has no systematic incentive to manipulate people s preferences at all as opposed to an agent with an incentive to encourage its users to have the right kind of preference. In this setting, the users preferences are what we call delicate state: a part of the environment which is hard to define a reward for and vulnerable to deliberate manipulation. Even when part of the state-space is delicate, other parts might be tractable. For example, we might know exactly how we want to price media-bandwidth consumption driven by our recommendations. This paper provides a framework for designing agents to act safely in environments where parts of the state-space are delicate and other parts are not, under assumptions which permit causal effects estimation. We show how to train agents in a way that removes incentives to control delicate state. This can be interpreted as Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. creating an agent which does not intend to affect the delicate state (Halpern and Kleiman-Weiner 2018), which is distinct from both trying not to influence that state or to keep it constant (Turner, Ratzlaff, and Tadepalli 2020; Krakovna et al. 2020). We use causal influence diagrams (CIDs) which can formally express instrumental control incentives (Everitt et al. 2021a). We show that one can remove the instrumental control incentive over delicate state by training agents to maximize the path-specific causal effect (Pearl 2001) of their actions on the reward following paths which are not mediated by the delicate state. Moreover, we show how a diverse set of previous proposals for safe agent design can be motivated by these principles. In this way, we unify and generalize approaches from topics such as reward tampering (Uesato et al. 2020; Everitt et al. 2021b), online reward learning (Armstrong et al. 2020), and auto-induced distributional shift (Krueger, Maharaj, and Leike 2020). At the same time, we show how these methods depend on assumptions about the state-space which have not previously been acknowledged. We highlight the opportunities and dangers of these approaches empirically in a content recommendation environment from Krueger, Maharaj, and Leike (2020). Our main contributions are: We formalize the problem of delicate state as a complement to reward specification ( 2); We propose path-specific objectives ( 5); We show this generalizes and unifies prior work ( 6). 2 The Problem of Delicate State Delicate state is a tool for framing safe agent design. When a state is subtle and manipulable we call it delicate: Subtle Hard to specify a reward for. Manipulable Vulnerable to motivated action intentional actions can have bad outcomes. Jointly, these are dangerous: it is hard to say what we want for the state and it is easy for influence on the state to have bad consequences. A person s political beliefs might be an example of such a state. The current toolbox for safe agent design mostly tries to attack subtlety directly by finding a better way to specify the reward. This is not our approach instead we aim to remove any incentive for the agent to control the delicate part of state-space. The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) If a part of state-space is delicate then having a control incentive over it is dangerous. But in order for removing it to lead to safe outcomes a third condition is needed: Stable Robust against unmotivated action sideeffects are unlikely to be bad. Stability entails that an agent with no systematic incentive to influence the state but which still influences the state and may produce side-effects over it is safe. As a metaphor for a system which is both manipulable and stable, consider a puzzle box: apply the right pressure to the right spots and it comes apart easily, but you can fumble randomly or even use it as a mallet and it will not open. We might hypothesise that a person s political beliefs are relatively stable after all, most people are able to think critically and independently in the presence of influences in many directions. We define delicate state within the context of a factored Markov decision process (MDP) characterized by transition function, reward function, action-space, and, unlike standard MDPs, a state-space factored into a robust state s S and a delicate state z Z such that the overall state is {s, z} S Z. The transition function therefore maps s, z and action (a A) onto the succeeding state s , z and the reward function maps s, s , z, z , a onto a reward r R. Subtlety We consider five cases that make a state subtle, and thereby potentially delicate. In each, it is not enough to simply pick a reward function that does not explicitly depend on z because instrumental incentives emerge when Z and S interact. Not Ordered There might not be a well-defined ethical ranking of different values of Z or it might be unethical to codify a ranking. For example, it may be unethical for a system to systematically influence user s beliefs, preferences, and political views (Burr, Cristianini, and Ladyman 2018). Content recommender systems often interact with subtle human states (Kramer, Guillory, and Hancock 2014). Vague Even if an ordering is possible, we may not trust our agent-designers to describe it. Reward modelling (Leike et al. 2018) or alternative work on reward specification (Christiano et al. 2017) seeks to attack this source of subtlety directly, while our approach tries to side-step it. Unenforceable Even with a well-specified reward, we may be unable to enforce it, for example, if Z is the physical implementation of the reward function then a modified Z might no longer punish the agent for having changed it (Amodei et al. 2016; Everitt et al. 2021b). Illegal The law might ban a well-specified and enforceable reward. For example, if Z is the market-price of an asset, deliberately influencing it may be market manipulation. Structural We might choose not to reward based on Z in order to construct an ecosystem of agents. For example, Z might be a performance measure of our agent which is used by another agent. Alternatively, the system might have deliberately demarcated roles, much as judges may be asked to apply the law as it stands, ignoring political consequences. Manipulability and Stability A manipulable state is one where deliberate or intentional actions can easily bring about harm. We adopt a notion of intentionality built on incentives assuming that an agent which has an incentive over Z and influences Z does so intentionally , following Halpern and Kleiman-Weiner (2018). This approach, described more formally below, has the advantage of being agnostic to the specific implementation of the agent (models, algorithms, etc.). We contrast this with instability where non-deliberate actions can easily bring about harm. We can draw parallels to safety and security in cybersecurity. A secure system is one that is robust to malicious actors (not manipulable), while a safe system is robust to natural behaviour (stable). For example, making a user manually type Delete my repo improves safety it is unlikely to happen unintentionally while doing nothing to improve security. Requiring a secret password instead would improve both safety and security. Our approach is most applicable in settings that are safe (in this sense) but not secure of which user-preference manipulation or reward tampering are archetypal. One can show when a system is not stable by demonstrating a natural behaviour that produces bad outcomes. Proving that a system is stable is an open challenge. The problem of stability is related to the problem identified by Armstrong and Gorman (2021): delicate states whose random variables have mutual information with the utility might be systematically influenced even without an incentive present. We consider stability further in 9 alongside other limitations of our method, and highlighting that these are previously unacknowledged limitations of a number of related methods. 3 Background on Causal Influence Diagrams Causal Influence Diagrams (CIDs) combine ideas from influence diagrams (Howard and Matheson 2005; Lauritzen and Nilsson 2001) and causality (Pearl 2009) and can be used to identify incentives (Everitt et al. 2021a). They are particularly well-suited to the analysis of delicate state because they explicitly represent the causal interactions of agents, rewards, and different parts of state while also formalizing graphical criteria for the presence of incentives to control certain parts of state. In this section we provide a background on causal models and CIDs which is needed to formally develop the delicate state setting. We return to a broader review of prior work in 8. Throughout this paper we adopt the convention that upper case denotes random variables and lower case their realizations. We also elide whether random variables are singletons or sets, noting that a set of random variables is identical to a set-valued random variable. Restating (using original numbering) the definition of the CID itself: Definition E3 (Everitt et al. 2021a). A causal influence diagram (CID) is a directed acyclic graph G whose vertex set V is partitioned into structure nodes, X, action nodes A, and utility nodes, U. Intuitively, this is a graph where some nodes represent the agent s decision and others its goals. The rest are structure (a) CID with ICI on S and Z (b) CID removing ICI on Z Figure 1: Causal Influence Diagrams in a setting with delicate state. Blue square is decision, yellow diamond is utility. Black arrows show causal influence. Dashed red circles show instrumental control incentive (ICI). (a) the agent has ICI over {S, Z}. (b) Z does not influence R so no ICI on Z. nodes. Note that what we call action nodes are sometimes called decision nodes. Arrows into action nodes are called information links . The relationship between the nodes are defined by structural functions in a SCIM: Definition E4 (Everitt et al. 2021a). A structural causal influence model (SCIM) is a tuple M = G, E, F, P where: G is a CID with finite-domain V and U Rn. We say that M is compatible with G. E = {EV }V V is a set of finite-domain exogenous variables, one for each element of V. F = {f V }V V\A is a set of structural functions f V : dom(pa V {EV }) dom(V ) that specify how each non-decision variable depends on its parents in G and exogenous variable. P is a Markovian probability distribution over E (i.e., all elements are mutually independent). Intuitively, this describes how variables at the nodes change with each other and incorporates chance. The notation PAX describes the parents of X in G. The goal of the agent is to select a policy π : dom(PAA) dom(A) for each action node, A, so that the expected sum of the utility nodes is maximized. We also use structural causal models (SCMs) for some of our analysis. While SCMs are logically more fundamental than SCIMs, for the purpose of this paper we can think of them as SCIMs without action nodes. That is, an SCM is a SCIM where all nodes have been assigned structural functions. In particular, imputing a policy, π, to a SCIM, M, turns the SCIM into the SCM Mπ. SCMs are fully developed by Pearl (2009), who also formalize the intervention notation do(X = x) to mean intervening to set the random variable X to x. Formally, do(X = x) replaces the structural function f X with a constant function X = x. The potential response Yx is used to denote Y under the intervention do(X = x). Somewhat abusing notation, we will write Yπ for the variable Y in Mπ, and Yπ,x for this variable under the intervention do(X = x). Potential responses can be nested, allowing expressions such as ZYx, which should be interpreted as Zy where y = Yx. CIDs have been used to define instrumental control incentives, which formalises the intuitive notion of which variables that agent wants to influence: Definition E17 (Everitt et al. 2021a). There is an instrumental control incentive (ICI) in a SCIM with a singledecision CID M on a variable X in with pa A with total return U = P U if, for all optimal policies π , Eπ UXa | pa A = Eπ U | pa A (1) UXa is the utility in the nested potential response where X is as if A had been a. Intuitively, this says that the agent has an ICI over X if it could achieve utility different than that of the optimal policy, were it also able to independently set X. Finally, to diagnose the presence or lack of ICIs over the delicate state: Theorem E18 (Everitt et al. 2021a). A single-decision CID G admits an ICI over X V iff G has a directed path from A to U via X: i.e. a directed path A X U. To review these concepts, examine Fig. 1 where we contrast a CID which admits an ICI over Z (Fig. 1a) with a CID which is not (Fig. 1b). 4 General Delicate MDP CID Applying these tools to 2, we construct a general CID for a factored MDP with delicate and robust state. Definition 4.1. A delicate T-step MDP is a factored MDP (Boutilier, Dearden, and Goldszmidt 2000) where the state is factored into delicate state, Zt, and robust state, St, at each timestep. We can describe a delicate MDP with a CID containing random variables Zt, St, At, and Rt for 0 t T. Here, At are action nodes; the Rt s are utility nodes (discounting can be introduced by scaling these); all other nodes are chance nodes. Variables depend only on the most recent timestep. The decision node At can observe Zt, St, and Rt. The resulting CID is shown in Fig. 2a. Special cases of this graph can remove influence arrows. For example, work on reward tampering often assumes that the reward function specification (here modelled as Zt) cannot directly influence the rest of the state (here modelled as St+1) (Everitt et al. 2021b). 5 Path-specific Objectives We show how to train an agent in a way that removes instrumental control incentives (ICIs) over the delicate state even though the environment actually has unsafe incentives. This can be interpreted as creating an agent that does not intend to use the delicate state (Halpern and Kleiman-Weiner 2018). To understand the causal effect that a variable X has on a variable Y along an edge-subgraph G of an SCM M, Pearl (2001, Definition 8) defines path-specific causal effects. Informally, the path-specific effect along G compares the outcome of Y under a default outcome x for X with the value that Y takes under a different outcome x, when the effect of the new value is propagated only along G . Formally, restating their definition with our notation: Definition P8 (Pearl (2001)). Let G be the causal graph associated with causal model M, and let G be an edge-subgraph of G containing the paths selected for effect analysis. The G - specific effect of x on Y (relative to reference x) is defined R0 R1 R2 A0 A1 Z0 Z1 Z2 . . . (a) G: with ICI on {Zt | 1 < t < T} R0 R1 R2 A0 A1 Z0 Z1 Z2 . . . . . . S1 S2 (b) G : with no ICI on {Zt | 1 < t < T} Figure 2: (a) A general delicate MDP CID. (b) Removing paths from A0 to Zt removes ICI on {Zt}1