# agent_incentives_a_causal_perspective__1e9a4add.pdf

Agent Incentives: A Causal Perspective

Tom Everitt,*1* Ryan Carey,*2 Eric D. Langlois,*1,3,4 Pedro A. Ortega,1 Shane Legg1

1Deep Mind, 2University of Oxford, 3University of Toronto, 4Vector Institute tomeveritt@google.com, ry.duff@gmail.com, edl@cs.toronto.edu, pedroortega@google.com

We present a framework for analysing agent incentives using causal inﬂuence diagrams. We establish that a well-known criterion for value of information is complete. We propose a new graphical criterion for value of control, establishing its soundness and completeness. We also introduce two new concepts for incentive analysis: response incentives indicate which changes in the environment affect an optimal decision, while instrumental control incentives establish whether an agent can inﬂuence its utility via a variable X. For both new concepts, we provide sound and complete graphical criteria. We show by example how these results can help with evaluating the safety and fairness of an AI system.

Introduction A recurring question in AI research is how to choose an objective to induce safe and fair behaviour (O Neil 2016; Russell 2019). In a given setup, will an optimal policy depend on a sensitive attribute, or seek to inﬂuence an important variable? For example, consider the following two incentive design problems, to which we will return throughout the paper: Example 1 (Grade prediction). To decide which applicants to admit, a university uses a model to predict the grades of new students. The university would like the system to predict accurately, without treating students differently based on their gender or race (see Figure 1a). Example 2 (Content recommendation). An AI algorithm has the task of recommending a series of posts to a user. The designers want the algorithm to present content adapted to each user s interests to optimize clicks. However, they do not want the algorithm to use polarising content to manipulate the user into clicking more predictably (Figure 1b).

Contributions This paper provides a common language for incentive analysis, based on inﬂuence diagrams (Howard 1990) and causal models (Pearl 2009). Traditionally, inﬂuence diagrams have been used to help decision-makers make better decisions. Here, we invert the perspective, and use the diagrams to understand and predict the behaviour of machine

*Equal contribution Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

learning systems trained to optimize an objective in a given environment. To facilitate this analysis, we prove a number of relevant theorems and introduce two new concepts: Value of Information (Vo I): First deﬁned by Howard (1966), a graphical criterion for detecting positive Vo I in inﬂuence diagrams were proposed and proven sound by Fagiuoli and Zaffalon (1998), Lauritzen and Nilsson (2001), and Shachter (2016). Here we offer the ﬁrst correct completeness proof, showing that the graphical criterion is unique and cannot be further improved upon. Value of Control (Vo C): Deﬁned by Shachter (1986), Matheson (1990), and Shachter and Heckerman (2010), an incomplete graphical criterion was discussed by Shachter (1986). Here we provide a complete graphical criterion, along with both soundness and completeness proofs. Instrumental Control incentive (ICI): We propose a reﬁnement of Vo C to nodes the agent can inﬂuence with its decision. Conceptually, this is a hybrid of Vo C and responsiveness (Shachter 2016). We offer a formal deﬁnition of instrumental control incentives based on nested counterfactuals, and establish a sound and complete graphical criterion. Response incentive (RI): Which changes in the environment does an optimal policy respond to? This is a central problem in fairness and AI safety (e.g. Kusner et al. 2017; Hadﬁeld-Menell et al. 2017). Again, we give a formal deﬁnition, and a sound and complete graphical criterion. Our analysis focuses on inﬂuence diagrams with a singledecision. This single-decision setting is adequate to model supervised learning, (contextual) bandits, and the choice of a policy in an MDP. Previous work has also discussed ways to transform a multi-decision setting into a single-decision setting by imputing policies to later decisions (Shachter 2016).

Applicability This paper combines material from two preprints (Everitt et al. 2019c; Carey et al. 2020). Since the release of these preprints, the uniﬁed language of causal inﬂuence diagrams have already aided in the understanding of incentive problems such as an agent s redirectability, ambition, tendency to tamper with reward, and other properties (Armstrong et al. 2020; Holtman 2020; Cohen, Vellambi, and Hutter 2020; Everitt et al. 2019a,b; Langlois and Everitt 2021).

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Race High school Education Grade

Predicted grade Gender Accuracy

(a) Fairness example: grade prediction

Posts to show Model of original opinions

Original user opinions

Inﬂuenced user opinions

(b) Safety example: content recommendation

structurenode decision node utility node

Figure 1: Two examples of decision problems represented as causal inﬂuence diagrams. In a) a predictor at a hypothetical university aims to estimate a student s grade, using as inputs their gender and the high school they attended. We ask whether the predictor is incentivised to behave in a discriminatory manner with respect to the students gender and race. In this hypothetical cohort of students, performance is assumed to be a function of the quality of the high-school education they received. A student s high-school is assumed to be impacted by their race, and can affect the quality of their education. Gender, however, is assumed not to have an effect. In b) the goal of a content recommendation system is to choose posts that will maximise the user s click rate. However, the system s designers prefer the system not to manipulate the user s opinions in order to obtain more clicks.

To analyse agents incentives, we will need a graphical framework with the causal properties of a structural causal model and the node categories of an inﬂuence diagram. This section will deﬁne such a model after reviewing structural causal models and inﬂuence diagrams.

Structural Causal Models

Structural causal models (SCMs) Pearl (2009) are a type of causal model where all randomness is consigned to exogenous variables, while deterministic structural functions relate the endogenous variables to each other and to the exogenous ones. As demonstrated by Pearl (2009), this structural approach has signiﬁcant beneﬁts over traditional causal Bayesian networks for analysing (nested) counterfactuals and individual-level effects.

Deﬁnition 1 (Structural causal model; Pearl 2009, Chapter 7). A structural causal model (with independent errors) is a tuple E, V , F , P , where E is a set of exogenous variables; V is a set of endogenous variables; and F = {f V }V V is a collection of functions, one for each V . Each function f V : dom(Pa V {EV }) dom(V ) speciﬁes the value of V in terms of the values of the corresponding exogenous variable EV and endogenous parents Pa V V , where these functional dependencies are acyclic. The domain of a variable V is dom(V ) and for a set of variables, dom(W ) := W W dom(W). The uncertainty is encoded through a probability distribution P(ε) such that the exogenous variables are mutually independent.

For example, Figure 2b shows an SCM that models how posts (D) can inﬂuence a user s opinion (O) and clicks (U).x The exogenous variables E of an SCM represent factors that are not modelled. For any value E = ε of the exogenous variables, the value of any set of variables W V is given by recursive application of the structural functions F and is denoted by W (ε). Together with the distribution P(ε) over exogenous variables, this induces a joint distribution Pr(W = w) = P

{ε|W (ε)=w} P(ε). SCMs model causal interventions that set variables to particular values. These are deﬁned via submodels:

Deﬁnition 2 (Submodel; Pearl 2009, Chapter 7). Let M = E, V , F , P be an SCM, X a set of variables in V , and x a particular realization of X. The submodel Mx represents the effects of an intervention do(X = x), and is formally deﬁned as the SCM E, V , Fx, P where Fx = {f V |V / X} {X = x}. That is to say, the original functional relationships of X X are replaced with the constant functions X = x. More generally, a soft intervention on a variable X in an SCM M replaces f X with a function g X : dom(Pa X {EX}) dom(X) (Eberhardt and Scheines 2007; Tian and Pearl 2001). The probability distribution Pr(Wg X) on any W V is deﬁned as the value of Pr(W ) in the submodel Mg X where Mg X is M modiﬁed by replacing f X with g X. If W is a variable in an SCM M, then Wx refers to the same variable in the submodel Mx and is called a potential response variable. In Figure 2b, the random variable O represents user opinion under default circumstances while Od in Figure 2c represents the user s opinion given an intervention do(D = d) on the content posted. Note also how the intervention on D severs the link from εD to d in Figure 2c, as the intervention on D overrides the causal effect from D s parents. Throughout this paper we use subscripts to indicate submodels or interventions, and superscripts for indexing. More elaborate hypotheticals can be described with a nested counterfactual, in which the intervention is itself a potential response variable. In Figure 2c, the click probability U depends on both the chosen posts D and the user opinion O, which is in turn also inﬂuenced by D. The nested potential response variable UOd, deﬁned by UOd(ε) := Uo(ε) where o = Od(ε), represents the probability that a user clicks on a default post D given that their opinion has been inﬂuenced by a hypothetical post d. In other words, the effect of the intervention do(D = d) is propagated to U only through O.

Causal Inﬂuence Diagrams Inﬂuence diagrams are graphical models with special decision and utility nodes, developed to model decision making problems (Howard 1990; Lauritzen and Nilsson 2001). Inﬂuence diagrams do not in general have causal semantics, although some causal structure can be inferred (Heckerman and Shachter 1995). We will assume that the edges of the

Opinion O = f O(D, EO)

Clicks U = f U(D, O, EU)

Posts D = π(ED)

Opinion O = f O(D, EO)

Clicks U = f U(D, O, EU)

Posts d = apolitical

Opinion Od = f O(d, EO)

Clicks UOd = f U(D, Od, EU)

(c) SCM with nested counterfactual

exogenous node structuralnode intervened node decision node utility node

Figure 2: An example of a SCIM and interventions. In the SCIM, either political or apolitical posts D are displayed. These affect the user s opinion O. D and O inﬂuence the user s clicks U (a). Given a policy, the SCIM becomes a SCM (b). Interventions and counterfactuals may be deﬁned in terms of this SCM. For example, the nested counterfactual UOd represents the number of clicks if the user has the opinions that they would arrive at, after viewing apolitical content (c).

inﬂuence diagram reﬂect the causal structure of the environment, so we use the term Causal Inﬂuence Diagram . Deﬁnition 3 (Causal inﬂuence diagram). A causal inﬂuence diagram (CID) is a directed acyclic graph G where the vertex set V is partitioned into structure nodes X, decision nodes D, and utility nodes U. Utility nodes have no children.

We use Pa V and Desc V to denote the parents and descendants of a node V V . The parents of the decision, Pa D, are also called observations. An edge from node V to node Y is denoted V Y . Edges into decisions are called information links, as they indicate what information is available at the time of the decision. A directed path (of length at least zero) is denoted V 99K Y . For sets of variables, V 99K Y means that V 99K Y holds for some V V , Y Y .

Structural Causal Inﬂuence Models For our new incentive concepts, we deﬁne a hybrid of the inﬂuence diagram and the SCM. Such a model, originally proposed by Dawid (2002), has structure and utility nodes with associated functions, exogenous variables with an associated probability distributions, and decision nodes, without any function at all, until one is selected by an agent.1 This can be formalised as the structural causal inﬂuence model (SCIM, pronounced skim ). Deﬁnition 4 (Structural causal inﬂuence model). A structural causal inﬂuence model (SCIM) is a tuple M = G, E, F , P where: G is a CID with ﬁnite-domain variables V (partitioned into X, D, and U) where utility variable domains are a subset of R. We say that M is compatible with G. E = {EV }V V is a set of ﬁnite-domain exogenous variables, one for each endogenous variable. F = {f V }V V \D is a set of structural functions f V : dom(Pa V {EV }) dom(V ) that specify how each non-decision endogenous variable depends on its parents in G and its associated exogenous variable.

1Dawid called this a functional inﬂuence diagram . We favour the term SCIM, because the corresponding term SCM is more prevalent than functional model .

P is a probability distribution for E such that the individual exogenous variables EV are mutually independent. We will restrict our attention to single-decision settings with D = {D}. An example of such a SCIM for the content recommendation example is shown in Figure 2a. In single-decision SCIMs, the decision-making task is to maximize expected utility by selecting a decision d dom(D) based on the observations Pa D. More formally, the task is to select a structural function for D in the form of a policy π : dom(Pa D {ED}) dom(D). The exogenous variable ED provides randomness to allow the policy to be a stochastic function of its endogenous parents Pa D. The speciﬁcation of a policy turns a SCIM M into an SCM Mπ := E, V , F {π}, P , see Figure 2b. With the resulting SCM, the standard deﬁnitions of causal interventions apply. Note that what determines whether a node is observed or not at the time of decision-making is whether the node is a parent of the decision. Commonly, some structure nodes represent latent variables that are unobserved. We use Prπ and Eπ to denote probabilities and expectations with respect to Mπ. For a set of variables X not in Desc D, Prπ(x) is independent of π and we simply write Pr(x). An optimal policy for a SCIM is deﬁned as any policy π that maximises Eπ[U], where U := P U U U. A potential response Ux is deﬁned as Ux := P U U Ux.

Materiality Next, we review a characterization of which observations are material for optimal performance, as this will be a fundamental building block for most of our theory.2

Deﬁnition 5 (Materiality; Shachter 2016). For any given SCIM M, let V (M) = maxπ Eπ[U] be the maximum attainable utility in M, and let MX D be M modiﬁed by removing any information link X D. The observation X Pa D is material if V (MX D) < V (M). Nodes may often be identiﬁed as immaterial based on the graphical structure alone (Fagiuoli and Zaffalon 1998;

2In contrast to subsequent sections, the results in this section and the Vo I section do not require the inﬂuence diagrams to be causal.

Lauritzen and Nilsson 2001; Shachter 2016). The graphical criterion uses uses the notion of d-separation.

Deﬁnition 6 (d-separation; Verma and Pearl 1988). A path p is said to be d-separated by a set of nodes Z if and only if:

1. p contains a collider X W Y , such that the middle node W is not in Z and no descendants of W are in Z, or 2. p contains a chain X W Y or fork X W Y where W is in Z, or 3. one or both of the endpoints of p is in Z.

A set Z is said to d-separate X from Y , written (X Y | Z) if and only if Z d-separates every path from a node in X to a node in Y . Sets that are not d-separated are called d-connected.

According to the graphical criterion of Fagiuoli and Zaffalon (1998), an observation cannot provide useful information if it is d-separated from utility, conditional on other observations. This condition is called nonrequisiteness.

Deﬁnition 7 (Nonrequisite observation; Lauritzen and Nilsson 2001). Let U D := U Desc D be the utility nodes downstream of D. An observation X Pa D in a single-decision CID G is nonrequisite if:

X U D Pa D {D} \ {X} (1)

In this case, the edge X D is also called nonrequisite. Otherwise X and X D are requisite.

For example, in Figure 3a, high school is a requisite observation while gender is not.

Value of Information Materiality can be generalized to nodes not observed, to assess which variables a decision-maker would beneﬁt from knowing before making a decision, i.e. which variables have Vo I (Howard 1966; Matheson 1990). To assess Vo I for a variable X, we ﬁrst make X an observation by adding a link X D, and then test whether X is material in the updated model (Shachter 2016).

Deﬁnition 8 (Value of information). A node X V \Desc D

in a single-decision SCIM M has Vo I if it is material in the model MX D obtained by adding the edge X D to M. A CID G admits Vo I for X if X has Vo I in a a SCIM M compatible with G.

Since Deﬁnition 8 adds an information link, it can only be applied to non-descendants of the decision, lest cycles be created in the graph. Fortunately, the structural functions need not be adapted for the added link, since there is no structural function associated with D. We prove that the graphical criterion of Deﬁnition 7 is tight for both materiality and Vo I, in that it identiﬁes every zero Vo I node that can be identiﬁed from the graphical structure (in a single decision setting).

Theorem 9 (Value of information criterion). A single decision CID G admits Vo I for X V \ Desc D if and only if X is a requisite observation in GX D, the graph obtained by adding X D to G.

The soundness direction (i.e. the only if direction) follows from d-separation (Fagiuoli and Zaffalon 1998; Lauritzen and Nilsson 2001; Shachter 2016). In contrast, the completeness direction does not follow from the completeness property of d-separation. The d-connectedness of X to U implies that U may be conditionally dependent on X. It does not imply, however, that the expectation of U or the utility attainable under an optimal policy will change. Instead, our proof in the supplementary material3 constructs a SCIM such that X is material. This differs from a previous attempt by Nielsen and Jensen (1999), as discussed in Related Work. We apply the graphical criterion to the grade prediction example in Figure 3a. One can see that the predictor has an incentive to use the incoming student s high school but not gender. This makes intuitive sense, given that gender provides no information useful for predicting the university grade in this example.

Response Incentives There are two ways to understand a material observation. One is that it provides useful information. From this perspective, a natural generalisation is Vo I, as described in the previous section. An alternative perspective is that a material observation is one that inﬂuences optimal decisions. Under this interpretation, the natural generalisation is the set of all (observed and unobserved) variables that inﬂuence the decision. We say that these variables have a response incentive.4

Deﬁnition 10 (Response incentive). Let M be a singledecision SCIM. A policy π responds to a variable X X if there exists some intervention do(X = x) and some setting E = ε, such that Dx(ε) = D(ε). The variable X has a response incentive if all optimal policies respond to X. A CID admits a response incentive on X if it is compatible with a SCIM that has a response incentive on X. For a response incentive on X to be possible, there must be: i) a directed path X 99K D, and ii) an incentive for D to use information from that path. For example, in Figure 3a, gender has a directed path to the decision but it does not provide any information about the likely grade, so there is no response incentive. The graphical criterion for RI builds on a modiﬁed graph with nonrequisite information links removed. Deﬁnition 11 (Minimal reduction; Lauritzen and Nilsson 2001). The minimal reduction Gmin of a single-decision CID G is the result of removing from G all information links from nonrequisite observations. The presence (or absence) of a path X 99K D in the minimal reduction tells us whether a response incentive can occur. Theorem 12 (Response incentive criterion). A singledecision CID G admits a response incentive on X X if and only if the minimal reduction Gmin has a directed path X 99K D.

3Available at https://arxiv.org/abs/2102.01685 4The term responsiveness (Heckerman and Shachter 1995; Shachter 2016) has a related but not identical meaning it refers to whether a decision D affects a variable X rather than whether X affects D.

Race High school Education Grade

Predicted grade Gender Accuracy

(a) Admits response incentive on race

Race High school Education Grade

Predicted grade Gender Accuracy

(b) Admits no response incentive on race

Figure 3: In (a), the admissible incentives of the grade prediction example from Figure 1a are shown, including a response incentive on race. In (b), the predictor no-longer has access to the students high school, and hence there can no-longer be any response incentive on race.

Proof. The if (completeness) direction is proved in the supplementary material. For the soundness direction, assume that for G, the minimal reduction Gmin does not contain a directed path X 99K D. Let M = G, E, F , P be any SCIM compatible with G. Let Mmin = Gmin, E, F , P be M, but with the minimal reduction Gmin. By Lemma 25 in the supplementary material, there exists a Gmin-respecting policy π that is optimal in M. In Mmin π , X is causally irrelevant for D so D(ε) = Dx(ε). Furthermore, M π and Mmin π are the same SCM, with the functions F { π}. So D(ε) = Dx(ε) also in M π, which means that there is an optimal policy in M that does not respond to interventions on X for any ε.

The intuition behind the proof is that an optimal decision only responds to effects that propagate to one of its requisite observations. For the completeness direction, we show in the supplementary material that if X 99K D is present in the minimal reduction Gmin, then we can select a SCIM M compatible with G such that D receives useful information along that path, that any optimal policy must respond to. In a safety setting, it may be desirable for an AI system to have an incentive to respond to its shutdown button, so that when asked to shut down, it does so (Hadﬁeld-Menell et al. 2017). In a fairness setting, on the other hand, a response incentive may be a cause for concern, as illustrated next.

Incentivised unfairness Response incentives are closely related to counterfactual fairness (Kusner et al. 2017; Kilbertus et al. 2017). A prediction or more generally a decision is considered counterfactually unfair if a change to a sensitive attribute like race or gender would change the decision. Deﬁnition 13 (Counterfactual fairness; Kusner et al. 2017). A policy π is counterfactually fair with respect to a sensitive attribute A if

Prπ Da = d | pa D, a = Prπ D = d | pa D, a

for every decision d dom(D), every context pa D dom(Pa D), and every pair of attributes a, a dom(A) with

Pr(pa D, a) > 0. A response incentive on a sensitive attribute indicates that counterfactual unfairness is incentivised, as it implies that all optimal policies are counterfactually unfair: Theorem 14 (Counterfactual fairness and response incentives). In a single-decision SCIM M with a sensitive attribute A X, all optimal policies π are counterfactually unfair with respect to A if and only if A has a response incentive. The proof is given in the supplementary material. A response incentive on a sensitive attribute means that counterfactual unfairness is not just possible, but incentivised. As a result, it has a more restrictive graphical criterion. The graphical criterion for counterfactual fairness states that a decision can only be counterfactually unfair with respect to a sensitive attribute if that attribute is an ancestor of the decision (Kusner et al. 2017, Lemma 1). For example, in the grade prediction example of Figure 3a, it is possible for a predictor to be counterfactually unfair with respect to either gender or race, because both are ancestors of the decision. The response incentive criterion can tell us in which case counterfactual unfairness is actually incentivised. In this example, the minimal reduction includes the edge from high school to predicted grade and hence the directed path from race to predicted grade. However, it excludes the edge from gender to predicted grade. This means that the agent is incentivised to be counterfactually unfair with respect to race but not to gender. Based on this, how should the system be redesigned? According to the response incentive criterion, the most important change is to remove the path from race to predicted grade in the minimal reduction. This can be done by removing the agent s access to high school. This change is implemented in Figure 3b, where there is no response incentive on either sensitive variable. Value of information is also related to fairness. For a sensitive variable that is not a parent of the decision, positive Vo I means that if the predictor gained access to its value, then the predictor would use it. For example, if in Figure 3b an edge is added from race to predicted grade, then unfair behaviour will result. In practice, such access can result from unanticipated correlations between the sensitive attribute and parents of the decision, rather than the system being given direct access to the attribute. Analysing Vo I may help detect such problems at an early stage. However, Vo I is less closely related to counterfactual fairness than response incentives. In particular, race lacks Vo I in Figure 3a, but counterfactual unfairness is incentivised. On the other hand, Figure 3b admits positive Vo I for race, but counterfactual unfairness is not incentivised. The incentive approach is not restricted to counterfactual fairness. For any fairness deﬁnition, one could assess whether that kind of unfairness is incentivised by checking whether it is present under all optimal policies.

Value of Control A variable has Vo C if a decision-maker could beneﬁt from setting its value (Shachter 1986; Matheson 1990; Shachter

and Heckerman 2010). Concretely, we ask whether the attainable utility can be increased by letting the agent decide the structural function for the variable.

Deﬁnition 15 (Value of control). In a single-decision SCIM M, a non-decision node X has positive value of control if

max π Eπ[U] < max π,g X Eπ[Ug X]

where g X : dom(Pa X {EX}) dom(X) is a soft intervention at X, i.e. a new structural function for X that respects the graph.

A CID G admits positive value of control for X if there exists a SCIM M compatible with G where X has positive value of control. This can be deduced from the graph, using again the minimal reduction (Deﬁnition 11) to rule out effects through observations that an optimal policy can ignore.

Theorem 16 (Value of control criterion). A single-decision CID G admits positive value of control for a node X V \ {D} if and only if there is a directed path X 99K U in the minimal reduction Gmin.

Proof. The if (completeness) direction is proved in the supplementary material. The proof of only if (soundness) is as follows. Let M = G, E, F , P be a single-decision SCIM. Let Mg X be M, but with the structural function f X replaced with g X. Let Mmin and Mmin g X be the same SCIMs, respectively, but replacing each graph with the minimal reduction Gmin. Recall that Eπ[Ug X] is deﬁned by applying the soft intervention g X to the (policy-completed) SCM Mπ. However, this is equivalent to applying the policy π to the modiﬁed SCIM Mg X, as the resulting SCMs are identical. Since Mg X is a SCIM, Lemma 25 in the supplementary material can be applied, to ﬁnd a Gmin-respecting optimal policy π for Mg X. Consider now the expected utility under an arbitrary intervention g X for a policy π optimal for Mg X:

Eπ[Ug X] in M

= Eπ[U] in Mg X by SCM equivalence

= E π[U] in Mg X by Lemma 25

= E π[U] in Mmin g X since π is Gmin-respecting

= E π[U] in Mmin by Lemma 23 = E π[U] in M only increasing the policy set max π Eπ [U] in M max dominates all elements.

This shows that X must lack value of control.

The proof of the completeness direction in the supplementary material establishes that if a path exists, then a SCIM be selected where the intervention on X can either directly control U or increase the useful information available at D. To apply this criterion to the content recommendation example (Figure 4a), we ﬁrst obtain the minimal reduction, which is identical to the original graph. Since all non-decision nodes are upstream of the utility in the minimal reduction, they all admit positive Vo C. Notably, this includes nodes like

original user opinions and model of user opinions that the decision has no ability to control according to the graphical structure. In the next section, we propose instrumental control incentives, which incorporate the agent s limitations.

Instrumental Control Incentive Would an agent use its decision to control a variable X? This question has two parts: whether X is useful to control (Vo C), and whether X is possible to control (responsiveness). As described in the previous section, Vo C uses Ug X to consider the utility attainable from arbitrary control of X. Meanwhile, Xd describes the way X can be controlled by D. These notions can be combined with a nested counterfactual UXd, which expresses the effect that D can have on U by controlling X. Deﬁnition 17 (Instrumental control incentive). In a singledecision SCIM M, there is an instrumental control incentive on a variable X in decision context pa D if, for all optimal policies π ,

Eπ [UXd | pa D] = Eπ [U | pa D]. (2)

Conceptually, an instrumental control incentive can be interpreted as follows. If the agent got to choose D to inﬂuence X independently of how D inﬂuences other aspects of the environment, would that choice matter? We call it an instrumental control incentive, as the control of X is a tool for achieving utility (cf. instrumental goals Omohundro 2008; Bostrom 2014). ICIs do not consider side-effects of the optimal policy: for instance, it may be that all optimal policies affect X in a particular way, even if X is a not an ancestor of any utility node in such cases, no ICI is present. Finally, in Pearl s (2001) terminology, an instrumental control incentive corresponds to a natural indirect effect from D to U via X in Mπ , for all optimal policies π . A CID G admits an instrumental control incentive on X if G is compatible with a SCIM M with an instrumental control incentive on X for some decision context pa D. The following theorem gives a sound and complete graphical criterion for which CIDs admit instrumental control incentives. Theorem 18 (Instrumental Control Incentive Criterion). A single-decision CID G admits an instrumental control incentive on X V if and only if G has a directed path from the decision D to a utility node U U that passes through X, i.e. a directed path D 99K X 99K U.

Proof. Completeness (the if direction) is proved in the supplementary material. The proof of soundness is as follows. Let M be any SCIM compatible with G and π any policy for M. We consider variables in the SCM Mπ. If there is no directed path D 99K X 99K U in G, then either D 99K X or X 99K U. If D 99K X, then Xd(ε) = X(ε) for any setting ε dom(E) and decision d (Lemma 20 in the supplementary material). Therefore, U(ε) = UXd(ε). Similarly, if X 99K U then U(ε) = Ux(ε) for every setting ε dom(E), x dom(X) and U U so U(ε) = UXd(ε). In either case, Eπ[U | pa D] = Eπ[UXd | pa D] and there is no instrumental control incentive on X.

The logic behind the soundness proof above is that if there is no path from D to X to U, then D cannot have any effect

Posts to show Model of original opinions

Original user opinions

Inﬂuenced user opinions

(a) Admits instrumental control incentive on user opinion

Posts to show Model of original opinions

Original user opinions

Predicted Clicks

Inﬂuenced user opinions Vo C ICI

(b) Admits no instrumental control incentive on user opinion

Figure 4: In (a), the content recommendation example from Figure 1b is shown to admit an instrumental control incentive on user opinion. This is avoided in (b) with a change to the objective.

on U via X. For the completeness direction proved in the supplementary material, we show how to construct a SCIM so that UXd differs from the non-intervened U for any diagram with a path D 99K X 99K U. Let us apply this criterion to the content recommendation example in Figure 4a. The only nodes X in this graph that lie on a path D 99K X 99K U are clicks and inﬂuenced user opinions. Since inﬂuenced user opinions has an instrumental control incentive, the agent may seek to inﬂuence that variable in order to attain utility. For example, it may be easier to predict what content a more emotional user will click on and therefore, a recommender may achieve a higher click rate by introducing posts that induce strong emotions. How could we instead design the agent to maximise clicks without manipulating the user s opinions (i.e. without an instrumental control incentive on inﬂuenced user opinions)? As shown in Figure 4b, we could redesign the system so that instead of being rewarded for the true click rate, it is rewarded for the clicks it would be predicted to have, based on a separately trained model of the user s preferences. An agent trained in this way would view any modiﬁcation of user opinions as irrelevant for improving its performance; however, it would still have an instrumental control incentive for predicted clicks so it would still deliver desired content. To avoid undesirable behaviour in practice, the click prediction must truly predict whether the original user would click the content, rather than baking in the effect of changes to the user s opinion from reading earlier posts. This could be accomplished, for instance, by training a model to predict how many clicks each post would receive if it was offered individually. This dynamic is related to concerns about the long-term safety of AI systems. For example, Russell (2019) has hypothesised that an advanced AI system would seek to manipulate its objective function (or human overseer) to obtain reward.

This can be understood as an instrumental control incentive on the objective function (or the overseer s behaviour). A better understanding of incentives could therefore be relevant for designing safe systems in both the short and long-term.

Related Work

Causal inﬂuence diagrams Jern and Kemp (2011) and Kleiman-Weiner et al. (2015) deﬁne inﬂuence diagrams with causal edges, and similarly use them to model decisionmaking of rational agents (although they are less formal than us, and focus on human decision-making). An informal precursor of the SCIM that also used structural functions (as opposed to conditional probability distributions) was the functional inﬂuence diagram (Dawid 2002). The most similar alternative model is the Howard canonical form inﬂuence diagram (Howard 1990; Heckerman and Shachter 1995). However, this only permits counterfactual reasoning downstream of decisions, which is inadequate for deﬁning the response incentive. Similarly, the causality property for inﬂuence diagrams introduced by Heckerman and Shachter (1994) and Shachter and Heckerman (2010) only constrains the relationships to be partially causal downstream of the decision (though adding new decision-node parents to all nodes makes the diagram fully causal). Appendix A in the supplementary material shows by example why the stronger causality property is necessary for most of our incentive concepts. An open-source Python implementation of CIDs has recently been developed5 (Fox et al. 2021).

Value of information and control Theorems 9 and 16 for value of information and value of control build on previous work. The concepts were ﬁrst introduced by Howard (1966) and Shachter (1986), respectively. The Vo I soundness proof follows previous proofs (Shachter 1998; Lauritzen and Nilsson 2001), while the Vo I completeness proof is most similar to an attempted proof by Nielsen and Jensen (1999). They propose the criterion X U D | Pa D for requisite nodes, which differs from (1) in the conditioned set. Taken literally,6 their criterion is unsound for requisite nodes and positive Vo I. For example, in Figure 3a, High school is d-separated from accuracy given Pa D, so their criterion would fail to detect that High school is requisite and admits Vo I.78

5https://github.com/causalincentives/pycid 6Def. 6 deﬁnes d-separation for potentially overlapping sets. 7Furthermore, to prove that nodes meeting the d-connectedness property are requisite, Nielsen and Jensen claim that X is [requisite] for D if Pr(dom(U) | D, Pa D) is a function of X and U is a utility function relevant for D . However, U being a function of X only proves that U is conditionally dependent on X, not that it changes the expected utility, or is requisite or material. Additional argumentation is needed to show that conditioning on X can actually change the expected utility; our proof provides such an argument. 8Since a preprint of this paper was placed online (Everitt et al. 2019c), this completeness result was independently discovered by Zhang, Kumor, and Bareinboim (2020, Thm. 2) and Lee and Bareinboim (2020, Thm. 1). Theorem 2 in the latter also provides a crite-

To have positive Vo C, it is known that a node must be an ancestor of a value node (Shachter 1986), but the authors know of no more-speciﬁc criterion. The concept of a relevant node introduced by Nielsen and Jensen (1999) also bears some semblance to Vo C. The relation of the current technical results to prior work is summarised in Table S1 in the Appendix.

Instrumental control incentives Kleiman-Weiner et al. (2015) use (causal) inﬂuence diagrams to deﬁne a notion of intention, that captures which nodes an optimal policy seeks to inﬂuence. Intention is conceptually similar to instrumental control incentives and uses hypothetical node deletions to ask which nodes the agent intends to control. Their concept is more reﬁned than ICI in the sense that it includes includes only the nodes that determine optimal policy behaviour, but the deﬁnition is not properly formalized and it is not clear that it can be applied to all inﬂuence diagram structures.

AI fairness Another application of this work is to evaluate when an AI system is incentivised to behave unfairly, on some deﬁnition of fairness. Response incentives address this question for counterfactual fairness (Kusner et al. 2017; Kilbertus et al. 2017). An incentive criterion corresponding to path-speciﬁc effects (Zhang, Wu, and Wu 2017; Nabi and Shpitser 2018) is deferred to future work. Nabi, Malinsky, and Shpitser (2019) have shown how a policy may be chosen subject to path-speciﬁc effect constraints. However, they assume recall of all past events, whereas the response incentive criterion applies to any CID.

Mechanism design The aim of mechanism design is to understand how objectives and environments can be designed, in order to shape the behavior of rational agents (e.g. Nisan et al. 2007, Part II). At this high level, mechanism design is closely related to the incentive design results we have developed in this paper. In practice, the strands of research look rather different. The core challenge of mechanism design is that agents have private information or preferences. As we take the perspective of an agent designer, private information is only relevant for us to the extent that some types of agents or objectives may be harder to implement than others. Instead, our core challenge comes from causal relationships in agent environments, a consideration of little interest to most of mechanism design.

Discussion and Conclusion We have proved sound and complete graphical criteria for two existing concepts (Vo I and Vo C) and two new concepts: response incentive and instrumental control incentive. The results have all focused on the (causal) structure of the interaction between agent and environment. This is both a strength and a weakness. On the one hand, it means that formal conclusions can be made about a system s incentives, even when details about the quantitative relationship between variables is unknown. On the other hand, it also means that

rion for material observations in a multi-decision setting.

these results will not help with subtler comparisons, such as the relative strength of different incentives. It also means that the causal relationships between variables must be known. This challenge is common to causal models in general. In the context of incentive design, it is partially alleviated by the fact that causal relationships often follow directly from the design choices for an agent and its objective. Finally, causal diagrams struggle to express dynamically changing causal relationships. While important to be aware of, these limitations do not prevent causal inﬂuence diagrams from providing a clear, useful, and uniﬁed perspective on agent incentives. It has seen applications ranging from value learning (Armstrong et al. 2020; Holtman 2020), interruptibility (Langlois and Everitt 2021), conservatism (Cohen, Vellambi, and Hutter 2020), modeling of agent frameworks (Everitt et al. 2019b), and reward tampering (Everitt et al. 2019a). Through such applications, we hope that the incentive analysis described in this paper will ultimately contribute to more fair and safe AI systems.

Acknowledgements Thanks to Michael Cohen, Ramana Kumar, Chris van Merwijk, Carolyn Ashurst, Lewis Hamilton, James Fox, Michiel Bakker, Silvia Chiappa, and Koen Holtman for their invaluable feedback. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number CGSD3-534795-2019]. Cette recherche a et e ﬁnanc ee par le Conseil de recherches en sciences naturelles et en g enie du Canada (CRSNG), [num ero de r ef erence CGSD3-534795-2019].

References Armstrong, S.; Orseau, L.; Leike, J.; and Legg, S. 2020. Pitfalls in learning a reward function online. In International Joint Conference on Artiﬁcial Intelligence (IJCAI).

Bostrom, N. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

Carey, R.; Langlois, E.; Everitt, T.; and Legg, S. 2020. The Incentives that Shape Behaviour. In Safe AI AAAI workshop.

Cohen, M. K.; Vellambi, B. N.; and Hutter, M. 2020. Asymptotically Unambitious Artiﬁcial General Intelligence. In AAAI Conference on Artiﬁcial Intelligence.

Dawid, A. P. 2002. Inﬂuence diagrams for causal modelling and inference. International Statistical Review .

Eberhardt, F.; and Scheines, R. 2007. Interventions and causal inference. Philosophy of Science .

Everitt, T.; Hutter, M.; Kumar, R.; and Krakovna, V. 2019a. Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Inﬂuence Diagram Perspective. Co RR .

Everitt, T.; Kumar, R.; Krakovna, V.; and Legg, S. 2019b. Modeling AGI Safety Frameworks with Causal Inﬂuence Diagrams. In Workshop on Artiﬁcial Intelligence Safety, volume 2419 of CEUR Workshop Proceedings.

Everitt, T.; Ortega, P. A.; Barnes, E.; and Legg, S. 2019c. Understanding Agent Incentives using Causal Inﬂuence Diagrams. Part I: Single Action Settings. Co RR .

Fagiuoli, E.; and Zaffalon, M. 1998. A note about redundancy in inﬂuence diagrams. International Journal of Approximate Reasoning .

Fox, J.; Hammond, L.; Everitt, T.; Abate, A.; and Wooldridge, M. 2021. Equilibrium Reﬁnements for Multi-Agent Inﬂuence Diagrams: Theory and Practice. In AAMAS.

Hadﬁeld-Menell, D.; Dragan, A.; Abbeel, P.; and Russell, S. J. 2017. The Off-Switch Game. In International Joint Conference on Artiﬁcial Intelligence (IJCAI).

Heckerman, D.; and Shachter, R. 1994. A Decision-Based View of Causality. In Uncertainty in Artiﬁcial Intelligence (UAI), 302 310.

Heckerman, D.; and Shachter, R. D. 1995. Decision Theoretic Foundations for Causal Reasoning. Journal of Artiﬁcial Intelligence Research 3: 405 430. doi:10.1613/jair.202.

Holtman, K. 2020. AGI Agent Safety by Iteratively Improving the Utility Function. International Conference on Artiﬁcial General Intelligence .

Howard, R. A. 1966. Information Value Theory. IEEE Transactions on Systems Science and Cybernetics .

Howard, R. A. 1990. From inﬂuence to relevance to knowledge. Inﬂuence diagrams, belief nets and decision analysis .

Jern, A.; and Kemp, C. 2011. Capturing mental state reasoning with inﬂuence diagrams. In Proceedings of the 2011 Cognitive Science Conference, 2498 2503.

Kilbertus, N.; Rojas-Carulla, M.; Parascandolo, G.; Hardt, M.; Janzing, D.; and Sch olkopf, B. 2017. Avoiding Discrimination through Causal Reasoning. In Advances in Neural Information Processing Systems, 656 666.

Kleiman-Weiner, M.; Gerstenberg, T.; Levine, S.; and Tenenbaum, J. B. 2015. Inference of intention and permissibility in moral decision making. In Proceedings of the 37th Annual Conference of the Cognitive Science Society, 1123 1128.

Kusner, M. J.; Loftus, J. R.; Russell, C.; and Silva, R. 2017. Counterfactual Fairness. In Advances in Neural Information Processing Systems.

Langlois, E.; and Everitt, T. 2021. How RL Agents Behave when their Actions are Modiﬁed. In AAAI.

Lauritzen, S. L.; and Nilsson, D. 2001. Representing and Solving Decision Problems with Limited Information. Management Science .

Lee, S.; and Bareinboim, E. 2020. Characterizing optimal mixed policies: Where to intervene and what to observe. Advances in Neural Information Processing Systems 33.

Matheson, J. E. 1990. Using inﬂuence diagrams to value information and control. In Oliver, R. M.; and Smith, J. Q., eds., Inﬂuence Diagrams, Belief Nets, and Decision Analysis. Wiley and Sons.

Nabi, R.; Malinsky, D.; and Shpitser, I. 2019. Learning optimal fair policies. Proceedings of machine learning research .

Nabi, R.; and Shpitser, I. 2018. Fair Inference on Outcomes. In AAAI Conference on Artiﬁcial Intelligence.

Nielsen, T. D.; and Jensen, F. V. 1999. Welldeﬁned Decision Scenarios. In Uncertainty in Artiﬁcial Intelligence (UAI).

Nisan, N.; Roughgarden, T.; Tardos, E.; and Vijay V Vazirani, eds. 2007. Algorithmic Game Theory. Cambridge University Press. Omohundro, S. M. 2008. The Basic AI Drives. In Wang, P.; Goertzel, B.; and Franklin, S., eds., Artiﬁcial General Intelligence, volume 171. IOS Press. O Neil, C. 2016. Weapons of Math Destruction. Crown Books.

Pearl, J. 2001. Direct and Indirect Effects. In Uncertainty in Artiﬁcial Intelligence (UAI). Pearl, J. 2009. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition edition. ISBN 9780521895606. Russell, S. J. 2019. Human Compatible: Artiﬁcial Intelligence and the Problem of Control. Viking.

Shachter, R. D. 1986. Evaluating Inﬂuence Diagrams. Operations Research .

Shachter, R. D. 1998. Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Inﬂuence Diagrams). Uncertainty in Artiﬁcial Intelligence (UAI) . Shachter, R. D. 2016. Decisions and Dependence in Inﬂuence Diagrams. In International Conference on Probabilistic Graphical Models. Shachter, R. D.; and Heckerman, D. 2010. Pearl Causality and the Value of Control. In R. Dechter, H. G.; and Halpern, J. Y., eds., Heuristics, Probability and Causality: A Tribute to Judea Pearl. College Publications. Tian, J.; and Pearl, J. 2001. Causal discovery from changes. In Uncertainty in Artiﬁcial Intelligence (UAI). Verma, T.; and Pearl, J. 1988. Causal Networks: Semantics and Expressiveness. In Uncertainty in Artiﬁcial Intelligence (UAI). Zhang, J.; Kumor, D.; and Bareinboim, E. 2020. Causal imitation learning with unobserved confounders. Advances in Neural Information Processing Systems 33. Zhang, L.; Wu, Y.; and Wu, X. 2017. A causal framework for discovering and removing direct and indirect discrimination. In IJCAI International Joint Conference on Artiﬁcial Intelligence.