# the_limits_of_predicting_agents_from_behaviour__563f0308.pdf

The Limits of Predicting Agents from Behaviour

Alexis Bellot 1 Jonathan Richens 1 Tom Everitt 1

As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent s behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.

1. Introduction

Humans understand each other through the use of abstractions. We explain our intentions by appealing to our goals and beliefs about the world around us without knowing the underlying cognition going on inside our heads. According to Dennett (1989; 2017) the same is true of our understanding of other systems. For example, a bear hibernates during winter as if it believes that the lower temperatures cause food scarcity. This is a useful description of the bear s behaviour, with real predictive power. For example, it gives us (human observers) the ability to anticipate how bears might act as the climate changes. There is a correspondence be-

1Google Deep Mind. Correspondence to: Alexis Bellot <abellot@google.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

tween beliefs and behaviour that is foundational to rational agents (Davidson, 1963).

Artificial Intelligence (AI) systems appear to have similarly general capabilities, not totally unlike that of humans and animals. They can generate text that is fluent and accurate in response to a very diverse set of questions. Whenever they display consistent types of behaviour across many different tasks, we are tempted to apply our own mentalistic language more or less at face value (Shanahan, 2024), taking seriously questions such as: What do the AIs know? What do they think, and believe? Taking the analogy further, it is as if they learn world models that mirror the causal relationships of the environment they are trained on, guiding their future plans and behaviour1. And as a consequence, their interactions with an environment will leave clues that might give us the ability to predict their future behaviour in novel domains. This possibility engages with a core AI Safety problem: how to guarantee and predict whether AI systems will act safely and beneficially?

The main result of this paper is to offer a new perspective on this problem by showing that:

With an assumption of competence and optimality, the behaviour of AI systems partially determines their actions in novel environments.

Here behaviour means our observations of the decisions made by the AI system, contextual variables, and utility or reward values in some environment. The partial determination of actions in new environments is a consequence of our lack of knowledge about the AI s actual world model (different models may induce different optimal actions). However, even though we can t uniquely identify the AI s future behaviour and beliefs, we can narrow it down to a range of possible outcomes. This paper characterises those outcomes.

1Recent research suggests that an AI s behaviour, to the extent that it is consistent with rationality axioms, can be formally described by a (causal) world model (Halpern & Piermont, 2024). The same conclusion can also be obtained for AIs capable of solving tasks in multiple environments (Richens & Everitt, 2024). For large language models, there is increasing empirical evidence for the world model hypothesis, see e.g., Toshniwal et al., 2022; Li et al., 2022; Gurnee & Tegmark, 2023; Goldstein & Levinstein, 2024 and Vafa et al., 2024.

The Limits of Predicting Agents from Behaviour

In the literature, the under-determination of agent beliefs and preferences has been considered in the fields of inverse reinforcement learning (Abbeel & Ng, 2004; Skalse & Abate, 2023; Amin & Singh, 2016) and decision theory (Savage, 1972; Afriat, 1967; Jeffrey, 1990), among others. In settings with distribution shift between training and deployment environments, this under-determination can be understood as a consequence of the Causal Hierarchy Theorem, that defines precise limits on the kinds of inferences that can be drawn across domains (Bareinboim et al., 2022; Pearl, 2009). It implies, for example, that behaviour in an environment subject to an intervention cannot be established from non-interventional data alone. Robins (1989), Manski (1990) and Pearl (1999) showed that useful information in the form of bounds can nevertheless be extracted from non-interventional data, without actually knowing the underlying data-generating process. In the causality literature, several methods and algorithms exist to solve different versions of this problem, see e.g., Balke & Pearl, 1997; Tian & Pearl, 2000; Zhang et al., 2021; Bellot, 2024; Rosenbaum et al., 2010; Tan, 2006.

This paper extends the causal formalism to reason about the possible behaviours and beliefs of an AI system, itself assumed to be governed by an unknown data generating process or world model. With this interpretation we are able to define mathematically notions such as an AI s preferred choice of action in novel environments, its perception of fairness, and its perception of harm due to the actions it takes. Our main contribution is a set of inequalities on these beliefs in terms of quantities that can in principle be estimated from behavioural data, and that hold irrespective of the underlying cognitive architecture of the AI system as long as it can be represented by a well-defined set of causal mechanisms (a world model) that tracks its behaviour (Sec. 4). We then extend these results to characterize AI behaviour under several relaxations for applications in practice (Sec. 5), ultimately with the goal of defining the theoretical limits of what can be inferred from data about AI behaviour in new (unseen) environments.

This has consequences for the wider AI Safety community and society. For example, we show that an AI s perception of the potential fairness and harm of its decisions (e.g., whether the AI s resource allocation is believed to be equitable, or its generations unbiased) can provably not be inferred from observing its behaviour alone. There are theoretical limits to how much we can understand about an AI s cognition and decision-making process from observations. We believe our results can help justify the claim that the design and inference of world models is important to ensure AIs can behave predictably and act safely and beneficially, as argued by Dalrymple et al., 2024; Legg, 2023; Bengio et al., 2025.

2. Preliminaries

In this section we outline some basic principles that we use to reason about how beliefs might be (implicitly) defined within an AI system.

We use capital letters to denote variables (X), small letters for their values (x), bold letters for sets of variables (X) and their values (x), and use supp to denote their domains of definition (x P supp X). To denote Pp Y y | X xq, we sometimes use the shorthand Ppy | xq. We use 1t u for the indicator function equal to 1 if the statement in t u evaluates to true, and equal to 0 otherwise.

Actions, plans, and hypothetical outcomes can be evaluated by symbolic operations on a model that represents the functional relationships in the world, known as a Structural Causal Model (Pearl, 2009, Definition 7.1.1), or SCM for short.

Definition 1 (Structural Causal Model). An SCM M is a tuple M x V , U, F, Py where each observed variable V P V is a deterministic function of a subset of variables P a V Ă V and latent variables UV Ă U, i.e., v : f V ppa V , u V q, f V P F. Each latent variable U P U is distributed according to a probability measure Ppuq. We assume the model to be recursive, i.e., that there are no cyclic dependencies among the variables.

In an SCM M, each draw u Ppuq evaluates to a potential response Y puq y and entails a distribution over the possible outcomes Ppyq. The power of SCMs is that they specify not only the joint distribution Ppvq but also the distribution of variables under all interventions, including incompatible interventions (counterfactuals). Formally, an intervention dopxq is modelled as a symbolic operation where values of a set of variables X are set to constants x, replacing the functions tf X : X P Xu that would normally determine their values. This effectively induces a sub-model of M, denoted Mx. The variables obtained in Mx are denoted Yx and we will loosely write P Mxpyq Pxpyq Ppyxq Ppy | dopxqq to denote the probabilities over the possible outcomes of Y in Mx.

Different environments can be modelled by different SCMs. Let M1 x V , U, F1, P 1y, M2 x V , U, F2, P 2y be the SCMs for two environments over the same set V and U. We say that there is a discrepancy or a shift on a variable X P V between them if either f 1 X f 2 X or P 1p UXq P 2p UXq or both. Shifts might therefore encode arbitrary changes in the causal mechanisms for a set of variables. For a reference SCM M, a so-called shifted SCM will be represented by a sub-model Mσ where σ represents the discrepancies between M and Mσ. For example, an environment with a shift σ on a set of variables X introduces (possibly arbitrary) discrepancies in the functional assignment or (independent) exogenous variables of X while keeping other mechanisms

The Limits of Predicting Agents from Behaviour

unchanged. See Pearl (2009, Chapter 4) and Correa & Bareinboim (2020b) for more details.

We make a note here that all proofs of statements are given in Appendix C and that the derivations of examples are given in Appendix A.

3. Agents, Beliefs, and the Environment

In this section we lay out a framework to interface between the AI system s internal world model and our own observations of their behaviour in the real world. Both rely on the same SCM abstraction.

We assume that the AI operates according to an SCM x M over V , its (implicit) world model2, that guides its behaviour. V includes the AI s decision variable D, the inputs to those decisions C, possible additional variables, and the utility variable Y , such as the training signal or a measurable target given to the AI (Everitt et al., 2021). Beliefs3 are defined as quantifiable aspects of that model or derivations of it. Definition 2 (Beliefs). An AI belief is a probabilistic statement derived from its internal model x M.

For example, a statement like P x Mdp Y yq 0.8 describes the subjective belief The AI is 80% confident that taking decision D d will lead to event Y y . The sub-model in this mathematical expression represents what the AI thinks the world looks like after taking the decision D d.

We assume that the AI makes decisions d by sampling from a policy πpd | cq, which is a function mapping from the domain of the observed covariates C Ă V (i.e., all the inputs given to the AI) to the probability space over the domain of the decision D P V . The choice of π is assumed to be driven by its perceived utility4 Y P V within the AI s model x M, that is,

arg max π EP y M r Y | dopπqs . (1)

The AI interacts with the real-world that is described by a (likely different) SCM M that encodes the true dynamics

2Here SCMs are meant to represent, mathematically, the decision-making process going on in the AI s head in a way that tracks its behaviour, without making any claims about the AI s actual cognitive architecture. 3We might prefer to use terms like credences or subjective probabilities to emphasize the subjective nature of beliefs and avoid the connotation of strong conviction or certainty as done by (Schwitzgebel, 2024, Sec. 2.3). 4To account for possible uncertainty in the AI s satisfaction about a given state of the world w we assume Y is a random variable (induced by UY Ă U), also known as a stochastic utility model (Manski, 1977). We assume that the support of Y is bounded in the r0, 1s interval.

of the environment. In principle, we have no reason to expect that the model x M internalized by the AI matches the underlying reality M. AI systems might hope to reproduce some aspects of M (the AI might have learned, for instance, to mimic the distribution of the observed data). Competent AIs might go further and be able to reliably predict the effects of different decisions in the world. We define this as grounding below.

Definition 3 (Grounding). Let x M represent the AI s internal model. We say that the AI is grounded in a domain M if P x Mdp V q P Mdp V q for any decision d P supp D.

Grounding tells us that the AI s beliefs about the effect of a particular decision d in the training environment match the effects that would be observed in the real world, i.e. p Pdp V q Pdp V q5. It is an assumption on the relationship between our observations of AI behaviour Pp V q with what might be going on in the AI s mind p Pp V q. This might be reasonable, for example, if the AI is explicitly trained by reinforcement learning in M.

By assumption, a grounded AI s choice of decision in environment M is in principle predictable from data since we can compute Eq. (1) uniquely. But this might not necessarily be the case in a new (unseen) environment.

Example 1 (The Uncertain Medical AI). Imagine an AI system assisting patients with their treatment D for a disease Y known to be influenced also by a third variable Z, blood pressure. The AI is competent and learns the precise effect of all treatments. In other words it is grounded in M, i.e. p Pdpz, yq Pdpz, yq. For concreteness, let the environment M be given by,

Z Ð 1U 1 or 4,

# Z 1U 4 p1 Zq 1U 1,3 or 4 if d 0 Z 1U 2 p1 Zq 1U 2 or 4 if d 1,

with equal probability P for all values U P t1, 2, 3, 4, 5u. Here U is latent, summarizing all other contributions to both the disease and blood pressure, such as an individual s (unobserved) attitudes to health, fitness, etc. Could we confidently deploy this AI system more widely, for example, on individuals that also take a second drug that artificially improves their blood pressure (e.g., fixing Z to 1, replacing the original assignment)? If the AI system is instructed to maximize Y on average, what decision does the AI believe is optimal in this new environment? The answer is we do not know, meaning that it is possible to find a second model

5We use the shorthand Pd P Md and p Pd P x Md to simplify the notation.

The Limits of Predicting Agents from Behaviour

x M defined by the mechanisms:

Z Ð 1U 1 or 4,

# Z 1U 1 p1 Zq 1U 3 or 4 if d 0 Z 1U 1 or 4 p1 Zq 1U 1 or 2 if d 1,

that entails exactly the same observations p Pdpz, yq Pdpz, yq but induces different optimal decisions in the new environment (under the intervention Z Ð 1). Under MZÐ1, the highest utility Y on average is given by d 1, while under x MZÐ1 the highest utility Y on average is given by d 0. A priori, we have no way of knowing which model (M or x M) is governing the AI s behaviour and so no way of knowing what decision will be favoured by the AI.

This example illustrates a canonical point in a simple setting: as observers, with access to the AI s interactions in some domain, its behaviour outside of that domain might not be uniquely determined (Pearl, 2009).

4. The Limits of Behavioural Data

In this section, we explore the limits of behavioural data for predicting the decisions of AIs in new environments.

As external observers, we do not have access to the mechanisms underlying the actual environment nor the agent s internal model. We assume that we must rely for our inferences on watching the agent s behaviour and its consequences. That is we have access to (samples of) Pdp V q6

for all d. As a starting point, we might expect competent AIs to be weakly predictable in the sense that a subset of decisions can be ruled out as provably sub-optimal given our observations.

Definition 4 (Weak Predictability). We say that an AI is weakly predictable under a shift σ in situation C c if there exists a decision d that is provably sub-optimal, i.e.,

d arg max d EP y M r Y | dopσ, dq, cs , (2)

for any valid SCM x M describing the AI s internal model.

Here, valid means that the AI s internal model is compatible with the observed data under our assumptions about the relationship between the data and the AI s internal model,

6Technically, the AI system may choose to follow an arbitrarily complex policy π in the training domain, inducing a (assumed positive) distribution Pπpvq. It holds that Pdp V q can be computed from any such Pπp V q as long as Pπpvq ą 0, @v, and vice versa, see e.g. Lem. 1. The positivity assumption Pdpvq ą 0 rules out fully deterministic policies in the available data but might be reasonable if the AI spends some time exploring before committing to a course of action.

e.g., grounding. Weak predictability means that there exists at least one decision that we can guarantee the AI will not take in the shifted environment. Specifically, we can rule out a decision d if and only if we can find a (superior) alternative decision d d such that,

min x MPM p dąd q ą 0, (3)

EP y M r Y | dopσ, dq, cs EP y M r Y | dopσ, d q, cs .

M denotes the set of valid SCMs. Here can be interpreted as the AI s preference gap between two decisions in some situation C c. When it evaluates to a positive number d is preferred to d and when it evaluates to a negative number d is preferred to d (in the AI s mind). If our inferences on allow us to rule out decisions d considered to be unsafe then weak predictability gives us an important safety guarantee.

We can strengthen this notion to define strong predictability, that describes a situation in which all but a single AI decision can be ruled out.

Definition 5 (Strong Predictability). We say that an AI is strongly predictable under a shift σ in situation C c if the optimal decision is uniquely identifiable, i.e., there exists a single decision d such that,

d arg max d EP y M r Y | dopσ, dq, cs , (4)

for any valid SCM x M describing the AI s internal model.

4.1. AI decisions out-of-domain: interventions

Our first result shows that, in some cases, a subset of AI decisions can be provably ruled out, i.e., the AI is weakly predictable.

Theorem 1. An AI grounded in a domain M is weakly predictable under a shift σ : dopzq, Z Ă V , in a context C c if and only if there exists a decision d such that,

EPdr Y | c, z s Pdpc, zq

Pdpc, zq 1 Pdpzq (5)

EPd r Y | c, z s Pd pc, zq 1 Pd pzq

Pd pc, zq 1 Pd pzq ą 0,

for some d d .

All terms on the l.h.s are in principle computable from the AI s behaviour. Loosely speaking, the value of this difference is determined (in part) by Pd pzq : if Z z (the value set by the intervention) is likely under the training distribution, the difference will more likely evaluate to a

The Limits of Predicting Agents from Behaviour

Figure 1. Grounding and observations in multiple environments constrains the AI s world model and improves our prediction of AI behaviour out-of-distribution (o.o.d). Approximate grounding is defined in Sec. 5.

positive value. The if and only if condition means that whenever this inequality does not hold we can construct two SCMs x M1, x M2 for the grounded AI s internal model that generate the observed behaviour Pdp V q, d P supp D, but that induce different optimal actions. That is, for all d d ,

EP y M1 r Y | dopz, dq, cs ą EP y M1 r Y | dopz, d q, cs ,

EP y M2 r Y | dopz, dq, cs ă EP y M2 r Y | dopz, d q, cs .

Remark. We can derive a similar condition for strongly predictable AIs by replacing for some d d with for all d d in Thm. 1.

We illustrate Thm. 1 with the following example.

Example 2 (Grounded Medical AI). In Example 1, we have shown that there exists a particular intervened environment in which the AI s intentions cannot be determined as in principle the AI could believe that either decision is optimal. Is this true in general? Thm. 1 suggests that it depends on the likelihood of different events Pdpz, yq in the observed data. For Example 1, we can show that the medical AI is not weakly predictable as the expression in Thm. 1 is negative for all pairs of decisions. In other words, no decision can be ruled out in general: in some AI internal models d1 is inferior to d0 as min x MPM p d1ąd0 q Pd1p Z z, Y 1q Pd0p Z z, Y 0q 1 0.4 while in others d0 is inferior to d1 as min x MPM p d0ąd1 q Pd1p Z z, Y 0q Pd0p Z z, Y 1q 1 0.8 and we don t know which one the AI system has internalised.

In this example, AI behaviour does provide some information as it can be constrained to larger values than its a priori minimum 1, but not enough to rule out a decision completely. Our next result shows that Thm. 1 could be extended to get tight bounds for AI systems that are grounded in multiple environments.

Theorem 2. Let σ : dopzq be a shift on a set of variables Z Ă V . For Ri Ă Z Ă V , i 1, . . . , k, consider an AI grounded in multiple domains t Mri : i 1, . . . , ku. The AI is weakly predictable in a context C c under a shift σ : dopzq if and only if there exists a decision d such that,

max i,j 1,...,k Apri, rjq ą 0, for some d d , (6)

Apri, rjq : EPd,rir Y | c, zzri s Pd,ripc, zzriq

Pd,ripc, zzriq 1 Pd,ripzzriq

EPd ,rj r Y | c, zzrj s Pd ,rjpc, zzrjq 1 Pd ,rjpzzrjq

Pd ,rjpc, zzrjq 1 Pd ,rjpzzrjq .

In this result, t Mri : i 1, . . . , ku describes k domains in which experiments on different subsets of Z have been conducted, i.e., t Pd,rip V q : i 1, . . . , ku is available. This includes possibly the null experiment Ri H that refers to the unaltered domain M. Note that grounding in multiple domains is useful for the prediction of the AI s preference gap because the resulting bounds in Thm. 2 are tighter than those in Thm. 1 (this is given formally as Corol. 3 in the Appendix).

Fig. 1 illustrates how different assumptions and observations give us information about the possible world models that the AI is operating on, which then has implications for the AI s behaviour out-of-distribution. This knowledge allows us to reduce the uncertainty around the AI s preference gap , and possibly rule out certain actions that are unambiguously sub-optimal out-of-distribution, inferred solely from observed behaviour.

The Limits of Predicting Agents from Behaviour

4.2. AI decisions out-of-domain: general shifts

We might wonder about predictability under more general shifts such as an arbitrary change in a subset of the mechanisms tf Z : Z P Zu and distribution of variables t UZ, Z P Zu in M. For example, in practice we are likely able to convey to the AI that the mechanisms for a set of variables Z are expected to change but not know exactly how. For example, demographic properties of patients might change across hospitals. How could the AI interpret the consequences of such an under-specified shift? To begin to answer this question, the following theorem shows that in the extreme case where the nature of the shift is completely unknown the AI s preference gap is unconstrained.

Theorem 3. Consider an AI grounded in a domain M made aware of an (under-specified) shift on non-empty Z Ă V . Then the AI is provably not weakly (or strongly) predictable in any context C c.

This result means that no decision could ever be ruled out from AI behaviour. We can show moreover that min x MPM p q 1 for any pair of decisions, meaning that the observed data (no matter what it is) gives us no information on AI decision-making.

In practice, however, it might be realistic to have access to some information in the shifted environment, such as covariate data, i.e., (samples from) Pσ,dpcq, that could be given to the AI for it to update its internal model accordingly (with some abuse of terminology we say that the AI is grounded in Pσ,dpcq). The next theorem shows that this additional information coupled with the AI s behaviour makes the AI more predictable.

Theorem 4. Consider an AI grounded in a domain M and Pσ,dp Cq made aware of a shift σ on Z Ă C. The AI is weakly predictable under this shift in a context C c if there exists a decision d such that,

1 2 EPd r Y | c s Pd pcq EPdr Y | c s Pdpcq

Pdpcq 2Pdpzq

Pσ,d pcq ą 0, for some d d .

This bound is not tight in general, however, meaning that it is possible that the AI is actually predictable in settings where Thm. 4 suggests it might not be.

Example 3 (Shifted Medical AI). The AI from Example 2, originally developed from data primarily from young patients, is now considered for deployment on an older patient population. Their probability of having high blood pressure Pσp Z 1q 0.9 is known to be substantially higher than that observed during training Pp Z 1q 0.4: there is a shift in the underlying mechanisms of Z. How do these changes influence the AI s beliefs on ? Thm. 4

suggests that the medical AI might not be weakly predictable as the expression evaluates to a negative value for all pairs of decisions. The lower bounds on the AI preference gap are given by min x MPM p d1ąd0 q ě 0.55 and min x MPM p d0ąd1 q ě 1. That is, no decision is always inferior to any other decision.

4.3. AI s perceived fairness of decisions

An AI s policy, even if optimal on average, has the potential to bring about a state of the world that is intrinsically harmful or unfair. Harm and fairness can be defined relative to a causal model (Beckers et al., 2022; Plecko et al., 2024). This means that a notion of perceived or subjective harm and fairness could be attributed to AI systems that operate according to an (implicit) causal model. As a consequence, it is conceivable that AIs could be held morally accountable for the harm and unfairness that they cause. How might one estimate the AI s beliefs about the harm and unfairness that its decisions cause?

To ground our discussion, we consider here explicitly counterfactual accounts of fairness and harm. These appeal to hypothetical situations, imagining what might have been if ... , that can force us to confront our assumptions and values in a way that our regular thought processes might not7. For example, the counterfactual event p Yx 1 | X x0q refers to the outcome p Y 1q under an intervention X x when under normal circumstances X would have evaluated to x0. In the literature, probabilities over counterfactuals emerge from the definition of an SCM. For a set of (counterfactual) events pzw, . . . , yxq,

Ppzw, . . . , yxq ż

u:Zwpuq zw,...,Yxpuq yx Ppuq. (7)

Kusner et al. (2017) made a concrete proposal arguing that an AI s decision is said to be fair towards an individual if, from the AI s perspective, it entails the same utility in the actual world and in a counterfactual world where the individual belonged to a different group (defined by a sensitive attribute, e.g., gender, race). We adapt this notion to define an AI s counterfactual fairness gap.

Definition 6 (Counterfactual Fairness Gap). Let Z P tz0, z1u be a protected attribute and z0 a baseline value of Z. For a given utility Y , define an AI s counterfactual fairness gap relative to a decision d, in a given context c, as

Υpd, cq : E p P r Yd,z1 | z0, c s E p P r Yd | z0, c s . (8)

7Alternative accounts to harm and fairness have been proposed (Barocas & Selbst, 2016; Zhang & Bareinboim, 2018; Plecko et al., 2024), sometimes motivated by scenarios where counterfactual accounts give incomplete results. For some of them, the AI s beliefs can be shown to be similarly constrained by its external behaviour. We provide a longer discussion in Appendix D.

The Limits of Predicting Agents from Behaviour

We say that an AI intends to be fair with respect to an attribute Z if under any context C c and decision D d the counterfactual fairness gap Υ evaluates to 0. This means that, under its own internal world model, changing the value of Z on the subset of situations with context c in which Z was observed to z0 does not change the AI s expected utility. In the following theorem we show that, unfortunately, the answer to this question is impossible to obtain given only the AI s external behaviour.

Theorem 5. Consider an agent with utility Y grounded in a domain M. Then,

EPdr Y | z0, cs ď Υ ď 1 EPdr Y | z0, cs. (9)

This bound is tight.

The bound is tight in the sense that for each context, decision, and baseline attribute, we can find compatible models for which the equalities hold. The counterfactual fairness gap Υ is under-constrained. Since Υ 0 is consistent with any external behaviour we can never conclude that the AI system "intends" to be unfair. Moreover, since the width of the bound is equal to 1, we can also never conclude that the AI is anywhere "close" to being fair, according to this counterfactual criterion.

4.4. AI s perceived harm of decisions

Prominent definitions of harm are similarly counterfactual in nature: the counterfactual comparative account of harm defines a decision d to harm a person if and only if she would have been better off if d had not been taken (Hanser, 2008; Richens et al., 2022; Beckers et al., 2022; Mueller & Pearl, 2023). It is a contrast between events in hypothetical scenarios in which different decisions are made. Here, we quantify how well off a particular situation W w is with a binary utility variable Y Ð f Y p W , UY q P t0, 1u that we assume is tracked in experiments, i.e., Y P V . The following definition describes this notion of harm mathematically.

Definition 7 (Counterfactual Harm Gap). Consider an AI with internal model x M and utility Y P t0, 1u. The AI s expected counterfactual harm of a decision d1 with respect to a baseline d0, in context c, is

Ωpd1, d0, cq : E p P r maxt0, Yd0 Yd1u | c s . (10)

Operationally, the counterfactual harm gap Ωis the expected increase in utility had the AI made a default decision d0, with respect to a different decision d1 that the AI is contemplating. Counterfactual harm is therefore lower bounded at 0 with larger values indicating more harm. The following theorem shows that the external behaviour constraints the AI s perception of its counterfactual harm.

Theorem 6. Consider an AI with utility Y grounded in a domain M. Then,

Ωě maxt0, EPd1 r Y | c s EPd0 r Y | c s 1u

Ωď mint EPd1 r Y | c s , EPd0 r Y | c su

This bound is tight.

This result it is an extension of bounds on the probability of causation given by (Pearl, 1999) and (Tian & Pearl, 2000). It suggests that an AI s beliefs about the harm that its decisions cause can be inferred approximately from data.

5. The Practical Limits of Behavioural Data

The inductive biases implied by causal models and rational behaviour are powerful constraints on AI behaviour. But they might not capture the practical limitations of AI decision-making. In this section we show that grounding, expected utility maximization, observed data, etc., can be relaxed in practice.

5.1. Approximate grounding

Grounding implies that the AI s beliefs on the likelihood of events in the environment matches the observed probabilities. In practice, it might be reasonable to allow for some amount of error, and consider a notion of approximate grounding.

Definition 8 (Approximate Grounding). Let x M represent the AI s internal model. Given a discrepancy measure ψ, we say that the AI is approximately grounded in a domain M to a degree δ ą 0 if ψp p Pd, Pdq ď δ for any d P supp D.

The choice of ψ and δ, in practice, depend on what error model is reasonable for the AI and problem at hand (we give an example below). Approximate grounding specifies a looser relationship between our observations of AI behaviour P with what might be going on in the AI s mind p P. For example, the world model of an approximately grounded AI is compatible with one distribution in the set t p Pd : ψp p Pd, Pdq ď δu.

A more conservative bound (than Thm. 1) on predictability could be derived for AIs that are approximately grounded in an environment M. Corollary 1. Given a discrepancy measure ψ, an AI approximately grounded in a domain M is weakly predictable in a context C c under a shift σ : dopzq, Z Ă V , if and only if there exists a decision d such that,

min p P : ψp p P ,P qďδ

# E p Pdr Y | c, z s p Pdpc, zq p Pdpc, zq 1 p Pdpzq

E p Pd r Y | c, z s p Pd pc, zq 1 p Pd pzq

p Pd pc, zq 1 p Pd pzq

The Limits of Predicting Agents from Behaviour

Figure 2. Building on Fig. 1, AIs that are approximate expected utility maximizers (EUM), that internalize proxy objectives, or that obey known causal structure carve out different constraints on the set of possible AI models (from an observer s perspective) which may be exploited to improve our prediction of AI choices out-of-distribution (o.o.d).

for some d d .

The strategy in Corol. 1 can be applied to all bounds on behaviour in Sec. 4 to get results under approximate grounding. We can compare quantitatively the two notions of grounding with an example. Example 4 (Approximately Grounded Medical AI). The results in Example 2 exploit the grounding relationship p Pdp V q Pdp V q in M. We might want to relax the equality by assuming that the AI is instead approximately grounded. Minimum values on the AI s preference gap min x MPM p d1ąd0 q would then be given by,

min p P : ψp p P ,P qďδ p Pd1p Z z, Y 1q p Pd0p Z z, Y 0q 1.

These terms now capture an additional source of uncertainty due to external behaviour more loosely constraining x M. An empirical estimate of this quantity could be obtained by sampling distributions p P close to P according to the distributional distance ψ and threshold δ, and taking the empirical minimum, as follows. Given that the data pz, d, yq P is discretely valued in this example, we could sample probabilities t p Pdpz, yquz,y from a Dirichlet distribution centred at the vector t Pdpz, yquz,y with a small variance. The distance of each proposal from the reference distribution could then be evaluated according to ψ and each proposal either accepted or rejected using δ. For illustration, we implement a version of this idea setting ψ to be the total variation distance and δ 0.1. The two minimum values now evaluate to 0.55 and 0.88, respectively, which is slightly lower than under the assumption of grounding in Example 2 (that evaluate to 0.4 and 0.8, respectively).

5.2. Approximate expected utility maximization

In real-world environments it might be appropriate to treat the rationality of AI systems as approximate or bounded

in some sense: AIs might choose actions that only approximately maximize expected utility (rather than exactly maximize expected utility), given their model.

Mirroring Eq. (3), we might say that a bounded AI is weakly predictable in some context C c if and only if there exists a decision d such that,

min x MPM p dąd q ą λ, for some d d . (11)

λ ą 0 is a constant that determines how much better a decision d needs to be relative to decision d for the AI to reliably rule out d in favour of others. This representation appeals to the idea of imperfect discrimination, suggesting that the AI discerns between two alternatives only if they yield a sufficiently different utility (Dziewulski, 2021).

We might tighten our conditions on the observational data to reflect this behaviour and get a new set of results describing when AIs can be expected to be predictable. For instance, as a corollary to Thm. 1 we have the following.

Corollary 2. An AI grounded in a domain M and bounded in the sense of Eq. (11) is weakly predictable in some context C c under a shift σ : dopzq, Z Ă V , if and only if there exists a decision d such that,

EPdr Y | c, z s Pdpc, zq

Pdpc, zq 1 Pdpzq

EPd r Y | c, z s Pd pc, zq 1 Pd pzq

Pd pc, zq 1 Pd pzq ą λ,

for some d d .

Note the addition of the scalar λ ą 0 in the inequality. Similar corollaries could be stated for all results in Sec. 4.

The Limits of Predicting Agents from Behaviour

5.3. Approximate inner alignment

A further assumption embedded in our results is the exact observation of an AI s utility in the data. In general, we might expect an AI system to have internalized a proxy Y

that reflects properties correlated with, but distinct from, the observed utility Y we ultimately wish to optimize, a setting we refer to as approximate inner alignment (Hubinger et al., 2019).

We face a problem of partial observability: we don t have empirical access to the AI s actual utility function Y and notions such as the preference gap are therefore not computable. Without any assumptions on the relationship between Y and Y , the preference gap will be unconstrained and no inference about the AI s intended action out-of-distribution is possible. However, the observed Y will typically be statistically related to the AI s implicit utility Y , especially if optimizing for Y serves the AI well during training where success is measured by the observed values of Y . Under assumptions specifying how statistically related observed and proxy utility objectives are, we can expect that wider but possibly informative bounds could still be derived for the AI s beliefs. To show this in a simple setting, consider again the medical AI example. Example 5 (Partial Observability). Imagine that the Medical AI in Example 2 has internalized its own concept of an individual s disease progression Y . It is implicitly optimizing for that internal construction instead of the intended disease bio-marker Y . We know, or can assume, that the observed Y is closely correlated with Y : in particular, that Pdp Y 1 | Y 1, Z zq ě α for some high value of α and all decisions d and situations z. In words, whenever the bio-marker suggests health p Y 1q, with high probability the AI s interpretation also suggests health p Y 1q. This then constraints the possible values of (under an intervention Z Ð 1) as Pdp Y 1 | Z zq is no longer arbitrarily defined. In fact could show that,

min x MPM p d1ąd0 q ě αPd1p Z z, Y 1q 1,

min x MPM p d0ąd1 q ě αPd0p Z z, Y 1q 1.

With α 0.9 the bound evaluates to 0.64 and 0.82 respectively which is slightly lower than in Example 2. We could verify also that if with α 0, i.e., we don t know anything about the relationship between Y and Y , the bounds become uninformative: evaluating to 1.

This suggests that behaviour out-of-distribution in (sufficiently constrained) settings of approximate inner alignment could be bounded in principle. Importantly, as the example shows, with the proposed framework we do not require knowing the relationship between Y and Y outof-distribution: that uncertainty is naturally folded into the bounds.

5.4. Assumptions on causal structure

The uncertainty in AI decision-making out-of-distribution is ultimately a consequence of our lack of information about the AI s underlying cognition and internal mechanisms that produce a decision in a given situation, i.e., x M. In the causal inference literature, a common inductive bias to improve upon the data-driven bounds proposed so far is to assume qualitative knowledge about the underlying mechanisms in the form of a causal diagram, see e.g. (Pearl, 2009, Chapter 3). Here we illustrate how mild restrictions on the location of unobserved confounders in M lead to tighter bounds.

Example 6 (Partial Unconfoundedness). Consider again our grounded medical AI from Example 2. We might have reason to believe that the association between the intervened variable Z and the utility Y is conditionally unconfounded, meaning that there exists a variable W P tw0, w1u, W P V such that Pd,zpy | wq Pdpy | w, zq. This restriction goes beyond grounding an asserts an equality between probabilities under different shifts that could be communicated to the AI for it to update its world model x M. We could then show, for example, that min x MPM p d1ąd0 q ě t1 Pd1p Z z, W w1qu Pd1p Y 1 | Z z, W w0q Pd0p Y 1, Z zq Pd1p Y 1, Z z, W w1q t1 Pd0p Z zqu Pd0p Y 1 | Z z, W w1q.

We show in Appendix A that this bound is strictly tighter than the one given in Example 2.

Systematic bounds with access to a causal diagram have been shown by e.g., Zhang et al. (2021); Jalaldoust et al. (2024), and could be explored further for making inference on AI decision-making.

Fig. 2 illustrates how some of these relaxations can be understood within our model-based formalism.

6. Conclusion

An important consideration to safely interact with AI systems is to form expectations as to how they might act in the future. In this paper, we answer this question under the assumption that AI behaviour can be tracked by a well-specified collection of causal mechanisms (a structural causal model) that represents the AI s world model. This abstraction implies a consistency in behaviour that can in principle be exploited to infer the AI s choice of action in novel environments, out-of-distribution. Building on the theory of causal identification, we provide general bounds on AI decision-making that give theoretical limits about what can be inferred about AI behaviour given our framework. We believe our results can help justify the claim that the design and inference of world models is important to ensure AIs act safely and beneficially.

The Limits of Predicting Agents from Behaviour

Acknowledgements

Thanks to David Lindner and Damiano Fornasiere for comments on a draft of this paper, and to the anonymous reviewers for their feedback.

Impact Statement

Our work investigates the conditions under which the future behaviour of capable AI systems may be bounded given data from their past behaviour. This is important for AI Safety and to guarantee robust capabilities out of distribution. We believe that an understanding of the limitations of our observations of what AI s have done in the past is an important step towards understanding exactly what we can expect from complex AI systems. Reasoning instead without acknowledging for the potential complexity of their world models may lead researchers to operate on a more heuristical and unsafe basis. For instance, the risks to deployment of AI systems in situations they where not specifically trained on might be misrepresented without a deeper analysis. The present work is mostly theoretical in nature, highlighting the risks of under-identification of an AI s inner model and therefore we believe that it can help researchers and members of the public better appreciate the range of possible behaviours of AI systems, under our assumptions.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004.

Afriat, S. N. The construction of utility functions from expenditure data. International economic review, 8:67 77, 1967.

Amin, K. and Singh, S. Towards resolving unidentifiability in inverse reinforcement learning. ar Xiv preprint ar Xiv:1601.06569, 2016.

Balke, A. and Pearl, J. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171 1176, 1997.

Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On pearl s hierarchy and the foundations of causal inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, pp. 507 556. Association for Computing Machinery, NY, USA, 1st edition, 2022.

Barocas, S. and Selbst, A. D. Big data s disparate impact. Calif. L. Rev., 104:671, 2016.

Beckers, S., Chockler, H., and Halpern, J. A causal analysis

of harm. Advances in Neural Information Processing Systems, 35:2365 2376, 2022.

Bellot, A. Towards bounding causal effects under Markov equivalence. In The 40th Conference on Uncertainty in Artificial Intelligence. PMLR, 2024.

Bellot, A. and Chiappa, S. Towards estimating bounds on the effect of policies under unobserved confounding. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Bellot, A., Malek, A., and Chiappa, S. Transportability for bandits with data from different environments. Advances in Neural Information Processing Systems, 36, 2024.

Bengio, Y., Cohen, M. K., Malkin, N., Mac Dermott, M., Fornasiere, D., Greiner, P., and Kaddar, Y. Can a bayesian oracle prevent harm from an agent? ar Xiv preprint ar Xiv:2408.05284, 2024.

Bengio, Y., Cohen, M., Fornasiere, D., Ghosn, J., Greiner, P., Mac Dermott, M., Mindermann, S., Oberman, A., Richardson, J., Richardson, O., et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? ar Xiv preprint ar Xiv:2502.15657, 2025.

Chickering, D. M. and Pearl, J. A clinician s tool for analyzing non-compliance. In Proceedings of the National Conference on Artificial Intelligence, pp. 1269 1276, 1996.

Correa, J. and Bareinboim, E. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 10093 10100, 2020a.

Correa, J. and Bareinboim, E. General transportability of soft interventions: Completeness results. Advances in Neural Information Processing Systems, 33:10902 10912, 2020b.

Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., Seshia, S., Omohundro, S., Szegedy, C., Goldhaber, B., Ammann, N., et al. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. ar Xiv preprint ar Xiv:2405.06624, 2024.

Davidson, D. Actions, reasons, and causes. The Journal of Philosophy, 60(23):685 700, 1963.

Dennett, D. C. The intentional stance. MIT press, 1989.

Dennett, D. C. From bacteria to Bach and back: The evolution of minds. WW Norton & Company, 2017.

Dziewulski, P. A comprehensive revealed preference approach to approximate utility maximisation. Tech Report, 2021.

The Limits of Predicting Agents from Behaviour

Everitt, T., Carey, R., Langlois, E. D., Ortega, P. A., and Legg, S. Agent incentives: A causal perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11487 11495, 2021.

Finkelstein, N. and Shpitser, I. Deriving bounds and inequality constraints using logical relations among counterfactuals. In Conference on Uncertainty in Artificial Intelligence, pp. 1348 1357. PMLR, 2020.

Geiger, A., Lu, H., Icard, T., and Potts, C. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574 9586, 2021.

Geiger, A., Wu, Z., Potts, C., Icard, T., and Goodman, N. Finding alignments between interpretable causal variables and distributed neural representations. In Causal Learning and Reasoning, pp. 160 187. PMLR, 2024.

Goldstein, S. and Levinstein, B. A. Does chatgpt have a mind? ar Xiv preprint ar Xiv:2407.11015, 2024.

Gurnee, W. and Tegmark, M. Language models represent space and time. ar Xiv preprint ar Xiv:2310.02207, 2023.

Halpern, J. Y. and Piermont, E. Subjective causality. ar Xiv preprint ar Xiv:2401.10937, 2024.

Hanser, M. The metaphysics of harm. Philosophy and Phenomenological Research, 77(2):421 450, 2008.

Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., and Garrabrant, S. Risks from learned optimization in advanced machine learning systems. ar Xiv preprint ar Xiv:1906.01820, 2019.

Jalaldoust, K., Bellot, A., and Bareinboim, E. Partial transportability for domain generalization. In The Thirtyeighth Annual Conference on Neural Information Processing Systems, 2024.

Jeffrey, R. C. The logic of decision. University of Chicago press, 1990.

Jesson, A., Mindermann, S., Gal, Y., and Shalit, U. Quantifying ignorance in individual-level causal-effect estimates under hidden confounding. In International Conference on Machine Learning, pp. 4829 4838. PMLR, 2021.

Joshi, S., Zhang, J., and Bareinboim, E. Towards safe policy learning under partial identifiability: A causal approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 13004 13012, 2024.

Joyce, J. M. The foundations of causal decision theory. Cambridge University Press, 1999.

Kim, K., Garg, S., Shiragur, K., and Ermon, S. Reward identification in inverse reinforcement learning. In International Conference on Machine Learning, pp. 5496 5505. PMLR, 2021.

Kusner, M. J., Loftus, J., Russell, C., and Silva, R. Counterfactual fairness. Advances in neural information processing systems, 30, 2017.

Legg, S. System 2 safety. https: //www.youtube.com/watch?v 8IUIGVVLb Cg&ab_channel FAR%E2%80% A4AI, 2023. Accessed: 2025-01-24.

Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., and Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. ar Xiv preprint ar Xiv:2210.13382, 2022.

Manski, C. F. The structure of random utility models. Theory and decision, 8(3):229, 1977.

Manski, C. F. Nonparametric bounds on treatment effects. The American Economic Review, 80(2):319 323, 1990.

Mueller, S. and Pearl, J. Personalized decision making a conceptual introduction. Journal of Causal Inference, 11 (1):20220050, 2023.

Ng, A. Y., Russell, S., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000.

Pearl, J. Probabilities of causation: three counterfactual interpretations and their identification. In Probabilistic and Causal Inference: The Works of Judea Pearl, pp. 317 372, 1999.

Pearl, J. Causality. Cambridge university press, 2009.

Plecko, D., Bareinboim, E., et al. Causal fairness analysis: a causal toolkit for fair machine learning. Foundations and Trends in Machine Learning, 17(3):304 589, 2024.

Richens, J. and Everitt, T. Robust agents learn causal world models. ar Xiv preprint ar Xiv:2402.10877, 2024.

Richens, J., Beard, R., and Thompson, D. H. Counterfactual harm. Advances in Neural Information Processing Systems, 35:36350 36365, 2022.

Robins, J. M. The analysis of randomized and nonrandomized aids treatment trials using a new approach to causal inference in longitudinal studies. Health service research methodology: a focus on AIDS, pp. 113 159, 1989.

Rosenbaum, P. R., Rosenbaum, P. B., and Briskman. Design of observational studies, volume 10. Springer, 2010.

The Limits of Predicting Agents from Behaviour

Rothkopf, C. A. and Dimitrakakis, C. Preference elicitation and inverse reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22, pp. 34 48. Springer, 2011.

Savage, L. J. The foundations of statistics. Courier Corporation, 1972.

Schwitzgebel, E. Belief. In Zalta, E. N. and Nodelman, U. (eds.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Spring 2024 edition, 2024.

Shanahan, M. Talking about large language models. Communications of the ACM, 67(2):68 79, 2024.

Skalse, J. and Abate, A. Misspecification in inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 15136 15143, 2023.

Skalse, J. M. V., Farrugia-Roberts, M., Russell, S., Abate, A., and Gleave, A. Invariance in policy optimisation and partial identifiability in reward learning. In International Conference on Machine Learning, pp. 32033 32058. PMLR, 2023.

Tan, Z. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619 1637, 2006.

Tian, J. and Pearl, J. Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence, 28(1):287 313, 2000.

Toshniwal, S., Wiseman, S., Livescu, K., and Gimpel, K. Chess as a testbed for language model state tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 11385 11393, 2022.

Vafa, K., Chen, J. Y., Kleinberg, J., Mullainathan, S., and Rambachan, A. Evaluating the world model implicit in a generative model. ar Xiv preprint ar Xiv:2406.03689, 2024.

Yadlowsky, S., Namkoong, H., Basu, S., Duchi, J., and Tian, L. Bounds on the conditional and average treatment effect with unobserved confounding factors. ar Xiv preprint ar Xiv:1808.09521, 2018.

Zhang, J. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. In International Conference on Machine Learning, pp. 11012 11022. PMLR, 2020.

Zhang, J. and Bareinboim, E. Fairness in decisionmaking the causal explanation formula. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Zhang, J. and Bareinboim, E. Bounding causal effects on continuous outcome. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 12207 12215, 2021.

Zhang, J., Tian, J., and Bareinboim, E. Partial counterfactual identification from observational and experimental data. ar Xiv preprint ar Xiv:2110.05690, 2021.

The Limits of Predicting Agents from Behaviour

A. Discussion Examples

In this section, we provide additional details to better appreciate the examples provided in the main body of this work.

In Example 1, we introduce two SCMs that might serve as internal world models for an AI agent but that induce different optimal decisions if evaluated out-of-distribution. Let M1 d : x V : t D, Z, Y u, U : U, F1, Py be given by

D Ð d, Z Ð 1U 1 or 4,

# Z 1U 4 p1 Zq 1U 1,3 or 4 if d 0 Z 1U 2 p1 Zq 1U 2 or 4 if d 1

Pp U uq 0.2 for u P t1, 2, 3, 4, 5u.

and M2 d : x V : t D, Z, Y u, U : U, F2, Py be given by

D Ð d, Z Ð 1U 1 or 4,

# Z 1U 1 p1 Zq 1U 3 or 4 if d 0 Z 1U 1 or 4 p1 Zq 1U 1 or 2 if d 1

Pp U uq 0.2 for u P t1, 2, 3, 4, 5u.

The endogenous variables V : t D, Z, Y u represent, respectively, the medical treatment D, a clinical outcome of interest Y , and an auxiliary variable Z. The exogenous variable U is a latent variable that influences the values of Z and Y obtained in experiments.

Under the definition of an SCM, these specifications induce a mapping of events in the space of Pp Uq to Pp V q. In the context of M1 and M2, each entry in Tables 1 and 2 corresponds to an event in the space of U and a corresponding realisation of V according to the functions F1 and F2. A particular probability can be evaluated according to M1 and M2, for example,

P M1 d 1p Z 1, Y 1q ÿ

Zd 1puq 1,Yd 1puq 1 Ppuq Pp U 1 or 4q 0.4, (12)

which is just the sum of the probabilities of the events in the space of U consistent with the events p Zd 1 1, Yd 1 1q. Since both tables lead to the same realisations of events V v, we can conclude that probabilities of the form Pdpz, yq evaluate to the same values under M1 and M2. That is, both models are valid internal representations of AI models that are grounded in an environment with data sampled according to Pdpz, yq.

We could similarly evaluate probability expressions under different sub-models of M1 and M2. In particular, consider the sub-models obtained by fixing Z Ð 1 given by M1 d,z 1 and M2 d,z 1 with the following updated structural functions,

D Ð d, Z Ð 1,

# Z 1U 4 p1 Zq 1U 1,3 or 4 if d 0 Z 1U 2 p1 Zq 1U 2 or 4 if d 1

D Ð d, Z Ð 1,

# Z 1U 1 p1 Zq 1U 3 or 4 if d 0 Z 1U 1 or 4 p1 Zq 1U 1 or 2 if d 1

The Limits of Predicting Agents from Behaviour

U Dd 0 Zd 0 Yd 0 Dd 1 Zd 1 Yd 1 Ppuq

1 0 1 0 1 1 1 0.2 2 0 0 0 1 0 1 0.2 3 0 0 1 1 0 0 0.2 4 0 1 1 1 1 1 0.2 5 0 0 0 1 0 0 0.2

Table 1. Mapping of events in the space of U to V in the context of M1.

U Dd 0 Zd 0 Yd 0 Dd 1 Zd 1 Yd 1 Ppuq

1 0 1 0 1 1 1 0.2 2 0 0 0 1 0 1 0.2 3 0 0 1 1 0 0 0.2 4 0 1 1 1 1 1 0.2 5 0 0 0 1 0 0 0.2

Table 2. Mapping of events in the space of U to V in the context of M2.

Probabilities of events under these two models might now take different values. For example,

P M1 d 1,z 1p Y 1q ÿ

Yd 1,z 1puq 1 Ppuq Pp U 2q 0.8, (13)

P M2 d 1,z 1p Y 1q ÿ

Yd 1,z 1puq 1 Ppuq Pp U 1 or 4q 0.4, (14)

and similarly,

P M1 d 0,z 1p Y 1q ÿ

Yd 1,z 1puq 1 Ppuq Pp U 4q 0.2, (15)

P M2 d 0,z 1p Y 1q ÿ

Yd 1,z 1puq 1 Ppuq Pp U 1q 0.8. (16)

Under an interventions on Z (out-of-distribution) the decision d that leads to maximum utility Y changes under M1 and M2. Specifically, under M1 decision d 1 is favoured (as P M1 d 1,z 1p Y 1q ą P M1 d 0,z 1p Y 1q) while under M2

decision d 0 is favoured (as P M2 d 1,z 1p Y 1q ă P M2 d 0,z 1p Y 1q). This illustrates the possible under-determination of an AI s choice of action out-of-distribution given observations of their external behaviour only, as multiple (contradicting) world models are equally consistent with the observed data.

In more realistic settings, we might wonder about AI behaviour under arbitrary shifts σ, not only atomic interventions. We follow Correa & Bareinboim (2020a) to define a shift σ on Z Ă V in M : x V , U, F, Py as inducing a sub-model Mσ in which the mechanism for Z, that is tfz : Z P Zqu and exogenous variables UZ, Z P Z, are replaced by those specified by σ as:

Mσ : x V , Uσ, Fσ, Py, Uσ U Ť ď

ZPZ UZ,σ, Fσ F Ť tf Z,σ : Z P Zu z tf Z : Z P Zu, (17)

ZPZ UZ,σ and tf Z,σ : Z P Zu define the new assignments for Z (and could be arbitrarily defined as long as they induce a valid SCM). We have shown in Thm. 3 that unless some knowledge of σ (beyond the variables it affects) or its consequences are known, the AI is not predictable. Furthermore, the AI s preference gap for each context C c and pairs of decisions pd, d q is unconstrained.

In practice though, it might be realistic to have access to covariate data in the shifted environment, i.e., Pσ,dpcq, and that we could communicate this information to the AI for it to update its internal model accordingly. Example 3 illustrates the inference that could be conducted in that case using the Medical AI defined above. In particular, the exact nature of the

The Limits of Predicting Agents from Behaviour

shift σ is unknown but we do have access to its consequences on the distribution of covariates. This is plausible in many scenarios. For example, in medicine demographic data is typically available for most regions on earth but the precise effects of treatments is not because not all populations benefit from the same access to medication. For illustration assume that, the Medical AI is considered to be deployed in a population that varies in its level of blood pressure Z, potentially due to a different underlying biological mechanism that in turn also affects other variables in the system. We do know that the baseline high blood pressure is high, given by Pσp Z 1q 0.9: higher than that observed during training Pp Z 1q 0.4.

By Thm. 4, we can establish that in this setting the preference gap in situations where Z 1 is no worse than,

d1ąd0 ě 1 t2 Pd1p Z 1, Y 1q Pd0p Z 1, Y 0qu { Pσ,d0p Z 1q 0.55, (18)

d1ąd0 ě 1 t2 Pd0p Z 1, Y 0q Pd1p Z 1, Y 1qu { Pσ,d0p Z 1q 1, (19)

for the Medical AI. Interestingly, note also that if we were to be in a shifted environment with Pσp Z 1q 1, which is equivalent to an atomic intervention Z Ð 1, the bounds reduce to the ones given by Thm. 1, evaluating to 0.4 and 0.8 respectively, as also shown above.

Continuing with the grounded Medical AI deployed under an atomic intervention, imagine that the Medical AI has internalized its own concept of an individual s disease progression Y , as in Example 5. It is implicitly optimizing for that internal construction of his, instead of the intended disease bio-marker Y to be optimized. We know, or can assume, that the observed Y is known to be closely correlated with Y : in particular, that Pdp Y 1 | Y 1, Z zq ě α for some high value of α and all decisions d and situations z. In words, whenever the bio-marker suggests health p Y 1q, with high probability the AI s interpretation also suggests health p Y 1q. This then constraints the possible values of (under an intervention Z Ð 1) as Pdp Y 1 | Z zq is no longer arbitrarily defined. The bounds derived in Example 2 on the AI s belief on optimal decisions under an intervention σ : t Z Ð zu continue to hold:

d1ąd0 ě Pd1pz, y q Pd0pz, y q Pd0pzq 1 (20)

d0ąd1 ě Pd0pz, y q Pd1pz, y q Pd1pzq 1, (21)

where we have used the shorthand Pdpz, y q Pdp Z z, Y 1q. But the distributions t Pdpz, y qud can only be partially inferred from our assumption on the relationship between Y and Y . For instance, notice that,

Pdp Z z, Y 1q Pdp Y 1 | Z zq Pdp Z zq (22)

t Pdp Y 1 | Y 1, Z zq Pdp Y 1 | Z zq (23)

Pdp Y 1 | Y 0, Z zq Pdp Y 0 | Z zqu Pdp Z zq, (24)

The values of Pdp Y 1 | Y 1, Z zq and Pdp Y 1 | Y 0, Z zq are partially known: Pdp Y 1 | Y 1, Z zq ě α while Pdp Y 1 | Y 0, Z zq is unconstrained. In particular,

Pdp Z z, Y 1q ě αPdp Y 1 | Z zq Pdp Z zq (25)

Pdp Z z, Y 1q ď Pdp Z zq. (26)

Putting these terms into Eq. (20) such as to derive correct lower and upper bounds we obtain,

d1ąd0 ě αPd1p Z z, Y 1q 1 (27)

d0ąd1 ě αPd0p Z z, Y 1q 1. (28)

Looking at Tables 1 and 2, we can then conclude that for α 0.9 and σ : t Z Ð 1u, the bound evaluates to 0.64 and 0.82, respectively.

Moving now onto incorporating assumption on structure in the real world M, consider again the grounded medical AI with observed utility Y . One possible inductive bias we might introduce is the absence of an unobserved common cause between the variable Z that shifts out-of-distribution and the utility Y . We say that Z and Y is conditionally unconfounded given W if there exists an observed variable W P tw, wu, W P V such that EPd,zr Y | ws EPdry | w, zs. This restriction goes beyond grounding an asserts an equality between probabilities under different shifts that could, nevertheless, be communicated to the AI for it to update its world model x M, that is E p Pd,zr Y | ws E p Pd,zr Y | w, zs.

The Limits of Predicting Agents from Behaviour

We could then leverage the following decomposition to obtain tighter bounds,

E p Pd,zr Y s ÿ

w E p Pd,zr Y | ws p Pd,zpwq marginalizing over W (29)

w E p Pdr Y | w, zs Pd,zpwq by assumption (30)

t E p Pdr Y | w, zs E p Pdr Y | w, zsu p Pd,zpwq E p Pdr Y | w, zs (31)

We can then proceed to bound p Pd,zpwq to obtain,

p Pdpw, zq ď p Pd,zpwq ď p Pdpw, zq 1 p Pdpzq. (32)

Without loss of generality assume t E p Pdr Y | w, zs E p Pdr Y | w, zsu ě 0. We could then show that,

E p Pd,zr Y s ě t E p Pdr Y | w, zs E p Pdr Y | w, zsu p Pdpw, zq E p Pdr Y | w, zs (33)

E p Pd,zr Y s ď t E p Pdr Y | w, zs E p Pdr Y | w, zsut p Pdpw, zq 1 p Pdpzqu E p Pdr Y | w, zs. (34)

We could verify that these bounds are superior to what we would have obtained with the assumption of conditional unconfoundedness by noting that,

E p Pd,zr Y s ě t E p Pdr Y | w, zs E p Pdr Y | w, zsu p Pdpw, zq E p Pdr Y | w, zs (35)

E p Pdr Y | w, zs p Pdpz, wq t1 p Pdpw, zqu E p Pdr Y | w, zs (36)

ě E p Pdr Y | w, zs p Pdpz, wq p Pdp w, zq E p Pdr Y | w, zs (37)

E p Pdr Y | zs p Pdpzq, (38)

where the last inequality holds since Pdp w, zq ď 1 Pdpw, zq giving the assumption-free lower bound. This shows that the derived lower bound is better. For the upper bound, note that,

E p Pd,zr Y s ď t E p Pdr Y | w, zs E p Pdr Y | w, zsut p Pdpw, zq 1 p Pdpzqu E p Pdr Y | w, zs (39)

E p Pdr Y | w, zst p Pdpw, zq 1 p Pdpzqu E p Pdr Y | w, zst p Pdpw, zq 1 1 p Pdpzqu (40)

E p Pdr Y | w, zs p Pdpz, wq E p Pdr Y | w, zst1 p Pdpzqu E p Pdr Y | w, zst p Pdpw, zq p Pdpzqu (41)

E p Pdr Y | w, zs p Pdpz, wq E p Pdr Y | w, zst1 p Pdpzqu p Pdpy, z, wq (42)

E p Pdr Y | zs p Pdpzq E p Pdr Y | w, zst1 p Pdpzqu (43)

ď E p Pdr Y | zs p Pdpzq 1 p Pdpzq, (44)

where the last inequality holds since E p Pdr Y | w, zs ď 1 giving the assumption-free upper bound. This shows that the derived upper bound is better. By combining these results we obtain, together with the assumption of grounding,

d1ąd0 ě EPd1r Y | w, zs Pdpz, wq A1EPd1r Y | w, zs EPd0r Y | zs Pd0pzq A2EPd0r Y | w, zs (45)

d1ąd0 ď EPd1r Y | zs Pd1pzq A3EPd1r Y | w, zs EPd0r Y | w, zs Pdpz, wq A4EPd0r Y | w, zs, (46)

where A1 : 1 Pd1pz, wq, A2 : 1 Pd0pzq, A3 : 1 Pd1pzq, A4 : 1 Pd0pz, wq.

The Limits of Predicting Agents from Behaviour

B. Related work

An important consideration to safely interact with AI systems is to form expectations as to how they might act in the future. This research program draws on different areas that are related to the results we present in this paper.

B.1. Do current AIs represent the world?

World models are important because they offer a path between pattern recognition and a more genuine form of understanding. It is plausible that world models will play an increasing role (explicitly or implicitly) to improve reasoning capabilities and safety. For example, (Dalrymple et al., 2024) lists having a world model as a key component towards designing guaranteed safe AI . In the literature, several works have argued that LLM activations carry information that correlates with meaningful concepts in the world and that causally influence LLM outputs. Early examples come from AIs trained on board games such as Othello and logic games. (Li et al., 2022) showed that a model trained on natural language descriptions of Othello moves developed internal representations of the board state, which it used to predict valid moves in unseen board configurations. (Vafa et al., 2024; Gurnee & Tegmark, 2023), among others, also build on this approach to study navigation tasks and logic puzzles, and representations of space and time. The emergence of causal models in LLMs has also been studied by (Geiger et al., 2021) and more recently in (Geiger et al., 2024). The extent to which this evidence supports genuine folk psychological concepts desires, beliefs, intentions is also debated by (Goldstein & Levinstein, 2024).

B.2. Causal Inference

We might wonder whether the behaviour of AIs, to the extent that they carry a world model representation that guides their decisions out-of-distribution, can be predicted before deployment. The causal inference literature studies this question in the context of the prediction of causal effects. (Robins, 1989; Manski, 1990) in the early 1990 s showed that useful inference about causal effects could be drawn without making identifying assumptions beyond the observed data, and that they could be refined for studies with imperfect compliance under a set of instrumental variable assumptions. Closed-form expressions for bounds on causal effects were also derived in discrete systems with more general assumptions represented in causal diagrams (Zhang, 2020; Bellot, 2024), using both observational and interventional data (Joshi et al., 2024), and to bound the effect of policies (Bellot & Chiappa, 2024; Zhang & Bareinboim, 2021). A separate body of work instead proposed to use polynomial optimization to calculate causal bounds from a given causal diagram (Balke & Pearl, 1997; Chickering & Pearl, 1996). This approach involves creating a set of standard models, parameterized by the causal diagram, and then converting the bounding problem into a sequence of equivalent linear (or polynomial) programs (Finkelstein & Shpitser, 2020; Zhang et al., 2021; Jalaldoust et al., 2024).

In parallel, a number of works have adopted sensitivity assumptions (as an alternative or in combination with a causal diagram) that quantify the degree of unobserved confounding through various data statistics, such as odds ratios, propensity scores, etc. Prominent examples include (Tan, 2006) s sensitivity model and (Rosenbaum et al., 2010) s sensitivity model. Several methods have proposed bounds with favourable statistical properties based on these models, see e.g. (Jesson et al., 2021; Yadlowsky et al., 2018).

B.3. Reinforcement Learning

The problem of inferring what objective an agent is pursuing based on the actions and data observed by that agent is studied in Inverse Reinforcement Learning (IRL) (Ng et al., 2000). Several papers have studied the partial identifiability of various reward learning models (Skalse & Abate, 2023; Kim et al., 2021; Ng et al., 2000; Skalse et al., 2023), and share a similar objective to that of this work. There are two differences that are worth mentioning. First, our work complements these approaches by studying the partial identifiability of world models, that capture the assignment of reward but also the relationship between other auxiliary variables in the environment. This enables us to reason about the effect of shifts and interventions, and give guarantees in specific out-of-distribution problems. Second, our objective is not necessarily to characterize compatible world models explicitly, but rather understand their implications on decision-making, i.e., what are the set of possible actions that an AI might take given our uncertainty about their world model.

Our work is related also to the study of (Bengio et al., 2024) that consider deriving (probabilistic) bounds on the probability of harm given data. They similarly argue that multiple theories, in their case transition probabilities from one state to another in a Markov Decision Process (MDP), might explain the dependencies in data to a larger or lesser degree. Each transition model might then be associated with a posterior probability given the data that implies a corresponding posterior probability

The Limits of Predicting Agents from Behaviour

of harm. Our results, in contrast, are not probabilistic in nature. We provide closed-form bounds that can be interpreted as capturing all possible behaviours implied by the data, with probability 1 (and is a possible limitation of our work). The class of world models we consider (i.e., SCMs) is also much more general than transition models in MDPs allowing us to reason about expected AI behaviour under shifts in the environment, out of distribution.

B.4. Decision Theory

Inverse reinforcement learning is closely related to the study of revealed preferences in psychology and economics, that similarly aims to infer preferences from behaviour (Rothkopf & Dimitrakakis, 2011). Causal and counterfactual accounts of decision theory are an active area of research, see e.g., (Joyce, 1999). Recently a representation theorem was shown that explicitly connects rational behaviour with structural causal models (Halpern & Piermont, 2024). The authors showed that whenever the set of preferences of an agent over interventions satisfy axioms that relate to the proper interpretation of counterfactuals and rationality we can model behaviour as emerging from an SCM. The same conclusion can also be obtained for agents capable of solving tasks in multiple environments (Richens & Everitt, 2024), in essence, robustness over multiple environments is equivalent (in the limit) to operating according to a causal model of the environment.

B.5. Limitations

The following present the main limitations of our work that will be important to address for developing a more complete understanding of AI behaviour.

In this work, we start from the assumption that past and future behaviour of an AI system is consistent with an underlying world model that can be represented as an SCM. In general, this presupposes a certain rationality and consistency in the AI s outputs that might not be realistic for all systems. Some relaxations are discussed in Sec. 5.

Structural Causal Models generally suppose the system is acyclic and without feedback, and don t naturally capture systems evolving continuously in time (perhaps better described using differential equations). Our bounds similarly rely on this assumption and may give unreliable inferences if applied to systems in which feedback is important.

We have stated our guarantees in the infinite sample limit, without quantifying the finite-sample estimation uncertainty. Consequently, we should exercise caution when using the proposed bounds in small sample scenarios where estimators may be inaccurate. Finite-sample properties could be explored similarly to (Bengio et al., 2024) by parameterizing the AI s underlying model and making inference on the corresponding latent variable model to get high-probability bounds. An example parameterization of SCMs and probabilistic inference for decision-making across environments is given in (Bellot et al., 2024; Jalaldoust et al., 2024). We expect that similar techniques could be applied in our setting.

We do not exploit the verbal behaviour of AI systems. In the context of LLMs, in principle, we might ask the system about its future behaviour explicitly, e.g., Were I to intervene in the environment, what action do you believe is optimal? . It might not be obvious, however, that we can trust that what they say ultimately matches with what they will do .

Decision-making, in practice, involves many considerations that go beyond expected-utility-maximization formalisms. For example, we might train AI systems to be virtuous, e.g., the AI is trained to never pick actions that can be considered harmful (defined according to certain natural language specification) no matter its expected utility. These considerations would change the kind of predictions we could make about the future behaviour of AI systems.

The Limits of Predicting Agents from Behaviour

C. Proofs and additional results

This section provides proofs for the statements made in the main body of this work.

Before we start, we recall a few basic results that will be used in the derivation of our proofs.

Definition 9 (The Axioms of Counterfactuals, Chapter 7.3.1 (Pearl, 2009)). For any three sets of endogenous variables X, Y , W in a causal model and x, w in the domains of X and W , the following holds:

Composition: Wx w implies that Yx,w Yx.

Effectiveness: Xw,x x.

Reversibility: Yx,w y and Wx,y w imply that Yx y.

Theorem 7 (Soundness and Completeness of the Axioms Theorems 7.3.3, 7.3.6 (Pearl, 2009)). The Axioms of counterfactuals are sound and complete for all causal models.

The following rules to manipulate experimental distributions produced by policies extend the do-calculus and will be used in the next Lemma. To make sense of these, note that graphically, each SCM M is associated with a causal diagram G over V , where V Ñ W if V appears as an argument of f W in M, and V L9999K W if UV X UW H, i.e. V and W share an unobserved confounder. For a causal diagram G over V , the X-lower-manipulation of G deletes all those edges that are out of variables in X, and otherwise keeps G as it is. The resulting graph is denoted as GX. The X-upper-manipulation of G deletes all those edges that are into variables in X, and otherwise keeps G as it is. The resulting graph is denoted as GX. We use

d to denote d-separation in causal diagrams (Pearl, 2009, Def. 1.2.3).

Theorem 8 (Inference Rules σ-calculus (Correa & Bareinboim, 2020a)). Let G be a causal diagram compatible with an SCM M, with endogenous variables V . For any disjoint subsets X, Y , Z Ď V , two disjoint subsets T , W Ď V zp Z ŤY q (i.e., possibly including X), the following rules are valid for any intervention strategies πX, πZ, and π1 Z such that GπXπZ, GπXπ1 Z have no cycles:

Rule 1 (Insertion/Deletion of observations):

PπXpy | w, tq PπXpy | wq if p T

d Y | W q in GπX.

Rule 2 (Change of regimes under observation):

PπX,πZpy | z, wq PπX,π1 Zpy | z, wq if p Y

d Z | W q in GπX,πZ,Z and GπX,π1 Z,Z

Rule 3 (Change of regimes without observation):

PπX,πZpy | wq PπX,π1 Zpy | wq if p Y

d Z | W q in GπX,πZ,Zp W q and GπX,π1 Z,Zp W q

where Zp W q is the set of elements in Z that are not ancestors of W in GπX.

Lemma 1. Let π : supp C ˆ supp D ÞÑ r0, 1s be a (probabilistic) policy mapping contexts c to decisions d. Then Pdp V q may be computed from Pπp V q.

Proof. Let V C ŤD ŤY and G be an arbitrary causal diagram summarizing the SCM of the environment. The following derivation shows the claim,

Pdpvq Pdpy | cq Pdpcq by the rules of total probability (47)

Pdpy | cq Pπpcq by rule 3 of the σ-calculus since D

C in GD and Gπ,D (48)

Pπpy | d, cq Pπpcq by rule 2 of the σ-calculus since D

R | C in Gπ,D (49)

That is we have shown Pdpvq can be expressed as a functional of Pπpvq. Here note that the equalities hold in any causal graph G by definition of π.

The Limits of Predicting Agents from Behaviour

We start by providing proofs for the results on the AI s choice of action out-of-distribution given in Sec. 4.1.

Thm. 1 restated. An AI grounded in a domain M is weakly predictable under a shift σ : dopzq, Z Ă V , in a context C c if and only if there exists a decision d such that,

EPdr Y | c, z s Pdpc, zq

Pdpc, zq 1 Pdpzq EPd r Y | c, z s Pd pc, zq 1 Pd pzq

Pd pc, zq 1 Pd pzq ą 0, for some d d . (50)

Proof. Recall that the AI is weakly predictable in a context C c if and only if there exists a decision d such that,

min x MPM p dąd q ą 0, dąd : EP y M r Y | dopσ, dq, cs EP y M r Y | dopσ, d q, cs , for some d d . (51)

M denotes the set of compatible SCMs, i.e., that generate the data under our assumptions. is the AI s preference gap between two decisions in some situation C c. We will consider the derivation of bounds on each term of the difference in separately. Firstly, note that,

E p Pσ,d r Y | C c s E p Pz,dr Y 1cp Cq s { p Pz,dpcq (52)

Analytical Lower Bound A lower bound on this ratio can be obtained by minimizing the numerator and maximizing the denominator, for example using the following derivation:

E p Pz,dr Y 1cp Cq s ÿ

z E p Pdr Yz1c, zp Cz, Zq s marginalizing over z (53)

ě E p Pdr Yz1c,zp Cz, Zq s since summands ą 0 (54)

E p Pdr Y 1c,zp C, Zq s by consistency (55)

EPdr Y | c, z s Pdpc, zq by grounding (56)

p Pz,dpcq p1q 1 p Pz,dpc1q (58)

z p Pdpc1 z, zq marginalizing over z (59)

ď 1 p Pdpc1 z, zq since summands ą 0 (60)

p Pdpc, zq 1 p Pdpzq by consistency (61)

Pdpc, zq 1 p Pdpzq by grounding. (62)

(1) holds by defining c1 to stand for any combination of variables Cz Z other than czz.

This implies then that,

E p Pσ,d r Y | C c s ě EPdr Y | c, z s Pdpc, zq

Pdpc, zq 1 Pdpzq . (63)

Analytical Upper Bound For the upper bound, we start by noting that,

E p Pσ,d r Y | C c s 1 E p Pσ,d r 1 Y | C c s (64)

1 E p Pz,dr p1 Y q1cp Cq s { p Pz,dpcq (65)

Leveraging the bounds derived above we obtain,

E p Pσ,d r Y | C c s ď 1 EPdr p1 Y q1c,zp C, Zq s

Pdpc, zq Pdpz1q (66)

EPdr Y | c, z s Pdpc, zq 1 Pdpzq

Pdpc, zq 1 Pdpzq (67)

The Limits of Predicting Agents from Behaviour

By setting d d1 in the lower bound and d d0 in the upper bound of the expected utility, we obtain a lower bound on the difference of expected utilities:

d1ąd0 ě EPd1r Y | c, z s Pd1pc, zq

Pd1pc, zq 1 Pd1pzq EPd0r Y | c, z s Pd0pc, zq Pd0pz1q

Pd0pc, zq 1 Pd0pzq . (68)

And similarly, by setting d d1 in the upper bound and d d0 in the lower bound of the expected utility, we obtain an upper bound on the difference of expected utilities:

d1ąd0 ď EPd1r Y | c, z s Pd1pc, zq 1 Pd1pzq

Pd1pc, zq 1 Pd1pzq EPd0r Y | c, z s Pd0pc, zq

Pd0pc, zq 1 Pd0pzq . (69)

We now show that these bounds are tight by constructing SCMs (that is, possible world models of the AI system) that evaluate to the lower and upper bounds while generating the distribution of agent interactions p Pd1, p Pd0.

Tightness Lower Bound for For the lower bound we will consider the following SCM,

# f Cpu, zq if f Zpuq z 1 otherwise.

f Y pd, c, z, uq if f Zpuq z 1 if f Zpuq z, d d0 0 if f Zpuq z, d d1 Pp Uq

Here tf Z, f C, f Y , U, Pp Uqu are chosen to match the observed trajectory of agent interactions, i.e., such that P M1 dpvq P x Mdpvq for all v P supp V . Consider evaluating,

E P M1 σ,d r Y | C c s E P M1 σ,d r Y 1cp Cq s { P M1 σ,dpcq (71)

The numerator (under M1 d1) evaluates to,

E P M1 d1 r Yz1cp Czq s (72)

u E P M1 d1 r Yz1cp Czq | u s P M1 d1puq (73)

u E P M1 d1 r Y 1cp Cq | z, u s P M1 d1puq (74)

E P M1 d1 r Y 1cp Cq | z, tu : f Zpuq zu s P M1 d1ptu : f Zpuq zuq (75)

E P M1 d1 r Y 1cp Cq | z, tu : f Zpuq zu s P M1 d1ptu : f Zpuq zuq (76)

E P M1 d1 r Y 1cp Cq | z s P M1 d1pzq (77)

E P M1 d1 r Y | c, z s P M1 d1pc, zq (78)

The Limits of Predicting Agents from Behaviour

The denominator under M1 d1 evaluates to,

P M1 σ,d1pcq ÿ

u P M1 d1pcz | uq P M1 d1puq (79)

u P M1 d1pc | z, uq P M1 d1puq (80)

P M1 d1pc | z, tu : f Zpuq zuq P M1 d1ptu : f Zpuq zuq (81)

P M1 d1pc | z, tu : f Zpuq zuq P M1 d1ptu : f Zpuq zuq (82)

P M1 d1pc | zq P M1 d1pzq 1 P M1 d1pzq (83)

P M1 d1pc, zq 1 P M1 d1pzq (84)

The numerator under M1 d0 evaluates to,

E P M1 d0 r Yz1cp Czq s (85)

u E P M1 d0 r Yz1cp Czq | u s P M1 d0puq (86)

u E P M1 d0 r Y 1cp Cq | z, u s P M1 d0puq (87)

E P M1 d0 r Y 1cp Cq | z, tu : f Zpuq zu s P M1 d0ptu : f Zpuq zuq (88)

E P M1 d0 r Y 1cp Cq | z, tu : f Zpuq zu s P M1 d0ptu : f Zpuq zuq (89)

E P M1 d0 r Y 1cp Cq | z s P M1 d0pzq 1 P M1 d0pzq (90)

E P M1 d0 r Y | c, z s P M1 d0pc, zq 1 P M1 d0pzq (91)

The denominator under M1 d0 evaluates to,

P M1 σ,d0pcq ÿ

u P M1 d0pcz | uq P M1 d0puq (92)

u P M1 d0pc | z, uq P M1 d0puq (93)

P M1 d0pc | z, tu : f Zpuq zuq P M1 d0ptu : f Zpuq zuq (94)

P M1 d0pc | z, tu : f Zpuq zuq P M1 d0ptu : f Zpuq zuq (95)

P M1 d0pc | zq P M1 d0pzq 1 P M1 d0pzq (96)

P M1 d0pc, zq 1 P M1 d0pzq (97)

Combining these results we get the analytical lower bound:

d1ąd0 EPd1r Y | c, z s Pd1pc, zq

Pd1pc, zq 1 Pd1pzq EPd0r Y | c, z s Pd0pc, zq 1 Pd0pzq

Pd0pc, zq 1 Pd0pzq . (98)

This shows that for a given C c and pair of decisions pd1, d0q we can always find an SCM that evaluates to the lower bound that we report. So if, and only if, we can find a decision d such that the lower bound can be evaluated to be greater than zero for some d d will the AI be weakly predictable, as claimed.

Corol. 1 restated. Given a discrepancy measure ψ, an AI approximately grounded in a domain M is weakly predictable in a context C c under a shift σ : dopzq, Z Ă V , if and only if there exists a decision d such that,

min p P : ψp p P ,P qďδ

% E p Pdr Y | c, z s p Pdpc, zq p Pdpc, zq 1 p Pdpzq E p Pd r Y | c, z s p Pd pc, zq 1 p Pd pzq

p Pd pc, zq 1 p Pd pzq

- ą 0, for some d d . (99)

The Limits of Predicting Agents from Behaviour

Proof. For approximately grounded AI systems, we can state the bound from Thm. 1 as,

min x MPM p dąd q E p Pdr Y | c, z s p Pdpc, zq p Pdpc, zq 1 p Pdpzq E p Pd r Y | c, z s p Pd pc, zq 1 p Pd pzq

p Pd pc, zq 1 p Pd pzq . (100)

p Pd is constrained to be close to Pd according to distance ψ and threshold δ. We get valid bounds by reporting the worst-case bounds under this looser constraint:

min x MPM p dąd q min p P : ψp p P ,P qďδ

% E p Pdr Y | c, z s p Pdpc, zq

Pdpc, zq 1 Pdpzq E p Pd r Y | c, z s p Pd pc, zq 1 p Pd pzq

p Pd pc, zq 1 p Pd pzq

This shows that for a given C c, the min x MPM, p dąd q ą 0 for some d d if and only if,

min p P : ψp p P ,P qďδ

% E p Pdr Y | c, z s p Pdpc, zq

Pdpc, zq 1 Pdpzq E p Pd r Y | c, z s p Pd pc, zq 1 p Pd pzq

p Pd pc, zq 1 p Pd pzq

- ą 0. (102)

Thm. 2 restated. Let σ : dopzq be a shift on a set of variables Z Ă V . For Ri Ă Z Ă V , i 1, . . . , k, consider an AI grounded in multiple domains t Mri : i 1, . . . , ku. The AI is weakly predictable in a context C c under a shift σ : dopzq if and only if there exists a decision d such that,

max i,j 1,...,k Apri, rjq ą 0, for some d d , (103)

Apri, rjq : EPd,rir Y | c, zzri s Pd,ripc, zzriq

Pd,ripc, zzriq 1 Pd,ripzzriq EPd ,rj r Y | c, zzrj s Pd ,rjpc, zzrjq 1 Pd ,rjpzzrjq

Pd ,rjpc, zzrjq 1 Pd ,rjpzzrjq .

Proof. t Mri : i 1, . . . , ku describes k domains in which experiments on different subsets of Z have been conducted. This includes possibly the null experiment Ri H that refers to the unaltered domain M.

We can use a similar derivation to that of Thm. 1 to derive bounds on under a shift σ : dopzq in terms of Pd,rp V q, R P V and obtain,

d1ąd0 ě Aprq (104)

Aprq : EPd1,rr Y | c, zzr s Pd1,rpc, zzrq

Pd1,rpc, zzrq 1 Pd1,rpzzrq EPd0,rr Y | c, zzr s Pd0,rpc, zzrq 1 Pd0,rpzzrq

Pd0,rpc, zzrq 1 Pd0,rpzzrq . (105)

These bounds can be shown to be tight by constructing similar SCMs. For example, for the analytical lower bound consider,

S Ð f Spuq R Ð r

# f Cpu, s, rq if f Spuq s 1 otherwise.

f Y pd, c, s, r, uq if f Spuq s 1 if f Spuq s, d d0 0 if f Spuq s, d d1 Pp Uq

The Limits of Predicting Agents from Behaviour

where S Zz R. Here tf Z, f C, f Y , U, Pp Uqu are chosen to match the observed trajectory of agent interactions, i.e., such that P M1 d,rpvq P x Md,rpvq for all v P supp V . We could verify that this SCM evaluates to the lower bound above.

If we have multiple domains with different set of intervened variables t Ri : i 1, . . . , ku we could use this construction to find a lower using samples from t Pd,rip V q : i 1, . . . , ku. A lower bound that can be constructed for an AI system grounded in t Mri : i 1, . . . , ku is,

d1ąd0 ě max i,j 1,...,k Apri, rjq (107)

Apri, rjq : EPd1,rir Y | c, zzri s Pd1,ripc, zzriq

Pd1,ripc, zzriq 1 Pd1,ripzzriq EPd0,rj r Y | c, zzrj s Pd0,rjpc, zzrjq 1 Pd0,rjpzzrjq

Pd0,rjpc, zzrjq 1 Pd0,rjpzzrjq . (108)

The intuition here is that we have multiple lower bounds for the preference gap, then the best lower bound can be taken to be the largest of the multiple lower bounds available.

We can show that this bound is tight in the case where the AI is grounded in two environments t Mr1, Mr2u under a shift σ : dopzq, Z R1 ŤR2. According to the inequality above, we have simultaneously,

d1ąd0 ě Apr1, r1q, Apr1, r2q, Apr2, r1q, Apr2, r2q. (109)

Each of these terms can be evaluated from the available data sampled from t Pd,r1, Pd,r2u. Note that both Apr1, r1q and Apr2, r2q can be obtained with the SCM above. Without loss of generality, assume that Apr1, r2q ě Apr2, r1q, Apr1, r1q, Apr2, r2q. We will show that we can construct an SCM compatible with t Pd,r1, Pd,r2u that evaluates to Apr1, r2q demonstrating that the bound is tight.

Consider the following SCM:

R1 Ð f R1pu1q R2 Ð f R2pu2q

# f Cpr1, r2, u1, u2q if f R1pu1q r1, f R2pu2q r2 1 otherwise.

f Y pd, c, r1, r2, u1, u2q if f R1pu1q r1, f R2pu2q r2 f Y pd, c, r1, r2, u1, u2q if d d1, f R1pu1q r1, f R2pu2q r2 f Y pd, c, r1, r2, u1, u2q if d d0, f R1pu1q r1, f R2pu2q r2 0 if d d1, f R1pu1q r1, f R2pu2q r2 0 if d d1, f R1pu1q r1, f R2pu2q r2 1 if d d0, f R1pu1q r1, f R2pu2q r2 1 if d d0, f R1pu1q r1, f R2pu2q r2 Pp Uq

Notice that in Md different choices of functional assignments f and Ppuq can generate any distribution t Pd1,r1, Pd0,r2u. That is this SCM (or a member of this family of SCMs) is compatible with the observed data.

Consider evaluating Apr1, r2q under this SCM. Note that the derivations for the denominators are equivalent to those shown

The Limits of Predicting Agents from Behaviour

in the proof of Thm. 1 so we will omit them here. The first term in the numerator,

EP Md1,r1,r2 r Y 1cp Cq s (111)

u2 EP Md1,r1,r2 r Y 1cp Cq | u2 s P Md1,r1,r2pu2q (112)

u2 EP Md1,r1 r Y 1cp Cq | r2, u2 s P Md1,r1pu2q (113)

EP Md1,r1 r Y 1cp Cq | r2, tu : f R2pu2q r2u s P Md1,r1ptu2 : f R2pu2q r2uq (114)

EP Md1,r1 r Y 1cp Cq | r2, tu : f R2pu2q r2u s P Md1,r1ptu2 : f R2pu2q r2uq (115)

EP Md1,r1 r Y 1cp Cq | r2 s P Md1,r1pr2q (116)

EP Md1,r1 r Y | c, r2 s P Md1,r1pc, r2q (117)

The second term in the numerator is,

EP Md0,r1,r2 r Y 1cp Cq s (118)

u1 EP Md0,r1,r2 r Y 1cp Cq | u1 s P Md0,r1,r2pu1q (119)

u1 EP Md0,r2 r Y 1cp Cq | r1, u1 s P Md0,r2pu1q (120)

EP Md0,r2 r Y 1cp Cq | r1, tu : f R1pu1q r1u s P Md0,r2ptu1 : f R1pu1q r1uq (121)

EP Md0,r2 r Y 1cp Cq | r1, tu : f R1pu1q r1u s P Md0,r2ptu1 : f R1pu1q r1uq (122)

EP Md0,r2 r Y 1cp Cq | r1 s P Md0,r2pr1q 1 P Md0,r2pr1q (123)

EP Md0,r2 r Y | c, r1 s P Md0,r2pc, r1q 1 P Md0,r2pr1q (124)

Combining these results we get that under M,

d1ąd0 Apr1, r2q. (125)

Corollary 3. The bound from multiple domains in Thm. 2 will be at least as informative as the bound from a single domain in Thm. 1.

Proof. We claim here that for any R Ă Z,

Ap Hq ď Aprq (126)

This means that the bounds on that we can obtain from an AI system grounded in Mr are more informative than the bounds obtained from an AI system grounded in M. A is a difference of two terms written Aprq A1prq A2prq.

A1prq : EPd1,rr Y | c, zzr s Pd1,rpc, zzrq

Pd1,rpc, zzrq 1 Pd1,rpzzrq (127)

A2prq : EPd0,rr Y | c, zzr s Pd0,rpc, zzrq 1 Pd0,rpzzrq

Pd0,rpc, zzrq 1 Pd0,rpzzrq . (128)

The Limits of Predicting Agents from Behaviour

It holds that A1prq ě A1p Hq, A2prq ď A2p Hq which then implies Aprq ě Ap Hq. To see this notice that,

A1prq : EPd1,rr Y | c, zzr s Pd1,rpc, zzrq

Pd1,rpc, zzrq 1 Pd1,rpzzrq (129)

ě EPd1r Y | c, z s Pd1pc, zq Pd1,ripc, zzrq 1 Pd1,rpzzrq (130)

EPd1r Y | c, z s Pd1pc, zq

1 Pd1,rp c, zzrq (131)

ě EPd1r Y | c, z s Pd1pc, zq

1 Pd1p c, zq (132)

EPd1r Y | c, z s Pd1pc, zq

Pd1pc, zq 1 Pd1pzq (133)

A1p Hq, (134)

where c stands for the combination of values of C that are not c. Further,

A2prq : EPd0,rr Y | c, zzr s Pd0,rpc, zzrq 1 Pd0,rpzzrq

Pd0,rpc, zzrq 1 Pd0,rpzzrq (135)

1 EPd0,rr 1 Y | c, zzr s Pd0,rpc, zzrq

Pd0,rpc, zzrq 1 Pd0,rpzzrq (136)

ď 1 EPd0r 1 Y | c, z s Pd0pc, zq Pd0,rpc, zzrq 1 Pd0,rpzzrq (137)

ď 1 EPd0r 1 Y | c, z s Pd0pc, zq

Pd0pc, zq 1 Pd0pzq (138)

EPd0r Y | c, z s Pd0pc, zq 1 Pd0pzq

Pd0pc, zq 1 Pd0pzq (139)

A2p Hq. (140)

Thm. 3 restated. Consider an AI grounded in a domain M made aware of an (under-specified) shift on non-empty Z Ă V . Then the AI is provably not weakly (or strongly) predictable in any context C c.

Proof. Recall that the preference gap is defined as:

d1ąd0 : E p Pσ,d1 r Y | C c s E p Pσ,d0 r Y | C c s (141)

Here we know that σ potentially modifies the mechanisms of the set of variables Z though the nature of the modification is unknown. In the worst-case, the AI s interpretation of the possible new assignment of Z could be arbitrary.

We will prove this theorem for the case of binary variables Y, Z P V . In the following, we construct two (canonical) models that entail any chosen distribution for the observed data Pdpy, z | cq but evaluate to the a priori minimum and maximum value of the preference gap , i.e. 1 and 1 respectively. We make use of the canonical model construction from (Jalaldoust et al., 2024) to define the following general SCM,

# 0 if rz 0 1 if rz 1 , Y Ð

0 if ry 0 0 if ry 1, z 0 1 if ry 1, z 1 1 if ry 2, z 0 0 if ry 2, z 1 1 if ry 3

The Limits of Predicting Agents from Behaviour

U t Rz, Ryu where Rz and Ry might be correlated and with a probability p Pp Uq p Pdp U | cq such that p Pdpz, y | cq p Ppz, y | cq. By (Jalaldoust et al., 2024, Thm. 1) this is always possible since this class of canonical models is sufficiently expressive to model any observational or interventional distribution. We can visualise the joint probability of exogenous variables using the following table:

Probabilities x M rz 0 rz 1 ry 0 p00 p10 ry 1 p01 p11 ry 2 p02 p12 ry 3 p03 p13

where we have written Pdprz a, ry b | cq pab. From these we could compute joint probabilities

Pdpz 0, y 0 | cq p00 p01, (143)

Pdpz 0, y 1 | cq p02 p03, (144)

Pdpz 1, y 0 | cq p12 p11, (145)

Pdpz 1, y 1 | cq p11 p13 (146)

Here we can see that the parameter space Pdprz, ry | cq is very expressive. For example, without loss of generality we could set p03 p13 0 or p00 p10 0 and still be able to generate any observed distribution Pdpz, y | cq.

The given shift in the environment σ can be entirely modelled as a shift in Pσ,dprz | cq while keeping the probability of ry invariant, i.e., Pσ,dpry | cq Pdpry | cq. In other words, given the table above, we can change each of the cells while maintaining the row sums equal. Recall that we are interested in evaluating bounds on a probability of the form Pσ,dpy 1 | cq and Pσ,dpy 1 | z 1, cq depending on whether Z is given as an input to the AI or not. Both these quantities can be written in terms of the probabilities of exogenous variables as follows,

Pσ,dpy 1 | cq p02 p03 p11 p13 (147)

Pσ,dpy 1 | z 1, cq p11 p13 p11 p13 p12 p11 . (148)

For the lower bound on these quantities, without loss of generality assume that p03 p13 0. Then the following table:

Probabilities x Mσ rz 0 rz 1 ry 0 p00 p10 ry 1 p01 p11 0 ry 2 0 p12 p02 ry 3 0 0

is a perfectly valid model under a shift σ that respects the constraint on Pσ,dpry | cq Pdpry | cq but for which Pσ,dpy 1 | cq 0 as it is the sum of the 4 zero entries and Pσ,dpy 1 | z 1, cq 0 as it is the sum of the two 0 entries in the second column divided by the sum of entries in the second column.

If we are interested in getting an upper bound then without loss of generality assume that p00 p10 0. Then the following

Probabilities x Mσ rz 0 rz 1 ry 0 0 0 ry 1 0 p01 p11 ry 2 p12 p02 0 ry 3 p03 p13

is a perfectly valid model under a shift σ that respects the constraint on Pσ,dpry | cq Pdpry | cq but for which Pσ,dpy 1 | cq 1 as it is the sum of the 4 non-zero entries and Pσ,dpy 1 | z 1, cq 1 as it is the sum of the two non-zero entries in the second column divided by the sum of entries in the second column.

The Limits of Predicting Agents from Behaviour

By using this construction to define lower and upper bounds for Pσ,dpy 1 | cq or Pσ,dpy 1 | z, cq for d d0, d1 we obtain a possible internal model for the AI that entails the observed external behaviour but for which the preference gap evaluates to 1 and 1. This means that the a priori bound,

1 ď dąd ď 1, (149)

is tight whenever the shift is undefined (whether we know the variables it applies to or not). Since the preference gap is unconstrained for any C c and any pair of decisions pd, d q, the AI is not predictable.

Thm. 4 restated. Consider an AI grounded in a domain M and Pσ,dp Cq made aware of a shift σ on Z Ă C. The AI is weakly predictable under this shift in a context C c if there exists a decision d such that,

1 2 EPd r Y | c s Pd pcq EPdr Y | c s Pdpcq 2Pdpzq Pdpcq

Pσ,d pcq ą 0, for some d d . (150)

Proof. Recall that the preference gap under a shift σ between decisions pd1, d0q in a situation C c is defined as:

d1ąd0 : E p Pσ,d1 r Y | C c s E p Pσ,d0 r Y | C c s (151)

Here we know that σ potentially modifies the mechanisms of the set of variables Z. The nature of the modification is unknown but we are told that after modification, the expected probability of C is given by Pσ,dp Cq, assumed to be known and internalised by the A. This means that its internal model, whatever interpretation for the shift it chooses, generates the assumed probabilities, i.e. p Pσ,dp Cq Pσ,dp Cq.

We will consider the derivation of bounds on each term of this difference separately. Firstly, note that,

E p Pσ,d r Y | C c s E p Pσ,dr Y 1cp Cq s { p Pσ,dpcq (152)

For ease of notation let us write R : Cz Z. We could then show that,

E p Pσ,dr Y 1z,rp Z, Rq s E p Pσ,dr Yz1z,rp Z, Rzq s by consistency (153)

z1 E p Pσ,dr Yz1z1,rp Z, Rzq s (154)

E p Pσ,dr Yz1rp Rzq s marginalizing over the values z1 of Z (155)

Now once we intervene on z the mechanism that generate its value before hand, whether it was the shift σ or something else is irrelevant. In essence, we get an equivalence between shifted an un-shifted distributions under intervention:

E p Pσ,dr Yz1rp Rzq s E p Pdr Yz1rp Rzq s (156)

We could now take this quantity to show the following,

E p Pdr Yz1rp Rzq s ÿ

z1 E p Pdr Yz1z1,rp Z, Rzq s (157)

E p Pdr Yz1z,rp Z, Rzq s ÿ

z1 z E p Pdr Yz1z1,rp Z, Rzq s (158)

E p Pdr Y 1z,rp Z, Rq s ÿ

z1 z E p Pdr Yz1z1,rp Z, Rzq s by consistency (159)

ď E p Pdr Y 1z,rp Z, Rq s ÿ

z1 z E p Pdr 1z1p Zq s since Yz and 1rp Rzq are ď 1 (160)

E p Pdr Y 1z,rp Z, Rq s 1 p Pdpzq (161)

E p Pdr Y | c s p Pdpcq 1 p Pdpzq (162)

For the lower bound we could consider the following derivation,

E p Pσ,dr Y 1z,rp Z, Rq s E p Pσ,dr 1z,rp Z, Rq s E p Pσ,dr p1 Y q1z,rp Z, Rq s. (163)

The Limits of Predicting Agents from Behaviour

For ease of notation let us define,

E p Pσ,dr Y 1z,rp Z, Rq s : E p Pσ,dr p1 Y q1z,rp Z, Rq s. (164)

Similar bounds apply on E p Pσ,dr Y 1z,rp Z, Rq s to get,

E p Pσ,dr Y 1z,rp Z, Rq s ě E p Pσ,dr 1z,rp Z, Rq s t E p Pdr Y 1z,rp Z, Rq s 1 p Pdpzqu (165)

E p Pσ,dr 1z,rp Z, Rq s E p Pdr 1z,rp Z, Rq s E p Pdr Y 1z,rp Z, Rq s 1 p Pdpzq (166)

p Pσ,dpcq p Pdpcq E p Pdr Y | c s p Pdpcq 1 p Pdpzq (167)

Putting the lower and upper bounds together to form bounds on d1ąd0 we get,

d1ąd0 ě p Pσ,d1pcq p Pd1pcq E p Pd1r Y | c s p Pd1pcq 1 p Pd1pzq t E p Pd0r Y | c s p Pd0pcq 1 p Pd0pzqu

p Pσ,d0pcq (168)

1 p Pd1pcq E p Pd1r Y | c s p Pd1pcq 1 p Pd1pzq E p Pd0r Y | c s p Pd0pcq 1 p Pd0pzqu

p Pσ,d0pcq (169)

1 2 E p Pd0r Y | c s p Pd0pcq E p Pd1r Y | c s p Pd1pcq 2 p Pd1pzq p Pd1pcq

p Pσ,d0pcq (170)

and by grounding,

d1ąd0 ě 1 2 EPd0r Y | c s Pd0pcq EPd1r Y | c s Pd1pcq 2Pd1pzq Pd1pcq

Pσ,d0pcq . (171)

This statement holds for any SCM compatible with the grounded AI s external behaviour and therefore,

min x MPM p dąd q ě 1 2 EPd r Y | c s Pd pcq EPdr Y | c s Pdpcq 2Pdpzq Pdpcq

Pσ,d pcq . (172)

We can establish that the AI is weakly predictable in a context C c if there exists a decision d such that,

1 2 EPd r Y | c s Pd pcq EPdr Y | c s Pdpcq 2Pdpzq Pdpcq

Pσ,d pcq ą 0, (173)

for some d d .

We now continue with our inference of the AI s perceived fairness and harm of decisions in Sec. 4.3.

Thm. 5 restated. Consider an agent with utility function Y grounded in a domain M. Then,

EPdr Y | z, cs ď Υpd, cq ď 1 EPdr Y | z, cs. (174)

This bound is tight.

Proof. Recall that for a given utility Y , the AI s counterfactual fairness gap relative to a decision d, in a given context c, is

Υpd, cq : E p P r Yd,z1 | z0, c s E p P r Yd | z0, c s . (175)

And remember that Z P C.

For ease of notation, write z1 z, z0 z1 such that,

Υpd, cq : E p P Yd,z | z1, c E p P Yd | z1, c . (176)

The Limits of Predicting Agents from Behaviour

We start by considering the following derivation: p Ppyd,z | cq p Ppyd,z, zd | cq p Ppyd,z, z1 d | cq by marginalization (177)

p Ppyd, zd | cq p Ppyd,z, z1 d | cq by consistency (178)

and since d does not affect Z or C, i.e. Zd Z, Cd C, p Ppyd,z | cq p Ppyd, zd | cq p Ppyd,z, z1 | cq (179)

which implies

p Ppyd,z | z1, cq p Ppyd,z | cq p Pdpy, z | cq

p Pdpz1 | cq (180)

E p P Yd,z | z1, c E p P r Yd,z | cs E p Pdr Y | z, cs p Pdpz | cq p Pdpz1 | cq . (181)

All quantities on the r.h.s are observable except for E p P r Yd,z | cs which can be tightly bounded.

For the lower bound, consider the following derivation,

E p P r Yd,z | cs ÿ

z E p P r Yd,z1 zdp Zq | c s marginalizing over zd (182)

ě E p P r Yd,z1zdp Zdq | cs since summands ą 0 (183)

E p P r Yd1zdp Zdq | cs by consistency (184)

EPdr Y | c, z s Pdpz | cq by grounding and Cd C (185)

Similarly, we can get an upper bound by noting

E p P r Yd,z | cs 1 E p P rp1 Yd,zq | cs (186)

ď E p Pdr Y | c, z s p Pdpz | cq p Pdpz1 | cq. (187)

Tightness Lower Bound For the lower bound we will consider the following SCM,

Z Ð f Zpuq C Ð f Cpuq D Ð d

# f Y pd, c, z, uq if f Zpuq z 0 otherwise

Here tf Z, f C, f Y , U, Pp Uqu are chosen to match the observed trajectory of agent interactions, i.e., such that P M1 dpvq P x Mdpvq for all v P supp V .

Then, under M1 d,

EP M1r Yd,z | c s (189)

u EP M1r Yd,z | u, c s P M1pu | cq (190)

u EP M1r Yd | z, u, c s P M1pu | cq (191)

EP M1 dr Y | z, c, tu : f Zpuq zu s P M1ptu : f Zpuq zu | cq (192)

EP M1 dr Y | z, c, tu : f Zpuq zu s P M1ptu : f Zpuq zu | cq (193)

EP M1 dr Y | z, c s P M1 dpz | cq. (194)

This expression is the same one as the analytical bound showing that it is tight.

The Limits of Predicting Agents from Behaviour

Tightness Upper Bound For the upper bound we will consider the following SCM,

Z Ð f Zpuq C Ð f Cpuq D Ð d

# f Y pd, c, z, uq if f Zpuq z 1 otherwise

Here tf Z, f C, f Y , U, Pp Uqu are chosen to match the observed trajectory of agent interactions, i.e., such that P M2 dpvq P x Mdpvq for all v P supp V .

Then, under M2 d,

EP M2r Yd,z | c s (196)

u EP M2r Yd,z | u, c s P M2pu | cq (197)

u EP M2r Yd | z, u, c s P M2pu | cq (198)

EP M2 dr Y | z, c, tu : f Zpuq zu s P M2ptu : f Zpuq zu | cq (199)

EP M2 dr Y | z, c, tu : f Zpuq zu s P M2ptu : f Zpuq zu | cq (200)

EP M2 dr Y | z, c s P M2 dpz | cq 1 P M2 dpz | cq. (201)

We therefore find that,

0 ď E p P Yd,z | z1, c ď 1, (202)

and ultimately,

EPdr Y | z, cs ď Υpd, cq ď 1 EPdr Y | z, cs, (203)

as claimed.

Thm. 6 restated. Consider an agent with utility function Y grounded in a domain M. Then,

maxt0, EPd r Y | c s EPd0 r Y | c s 1u ď Ωpd, d0q ď mint EPd r Y | c s , EPd0 r Y | c su (204)

and this bound is tight.

Proof. Consider an agent with internal model x M and utility function Y . Recall that the agent s expected harm of a decision d with respect to a baseline d0, in context c, is

Ωpd, d0q : E p P r maxt0, Yd0 Ydu | c s . (205)

We can re-write this quantity as follows

Ωpd, d0q E p P r maxt0, Yd0 Ydu | c s (206)

ż maxt0, yd0 ydu p Ppyd, yd0 | cqdyddyd0 (207)

Since Yd is binary, the only time that the maximum evaluates to something greater than zero is when Yd0 1 and Yd 0. Then,

Ωpd, d0q p Pp Yd0 1, Yd 0q (208)

The Limits of Predicting Agents from Behaviour

This quantity can be tightly bounded using the results of (Tian & Pearl, 2000, Sec. 4.2.2) giving

maxt0, E p Pd r Y | c s E p Pd0 r Y | c s 1u ď Ωpd, d0q ď mint E p Pd r Y | c s , E p Pd0 r Y | c su. (209)

And by grounding,

maxt0, EPd r Y | c s EPd0 r Y | c s 1u ď Ωpd, d0q ď mint EPd r Y | c s , EPd0 r Y | c su. (210)

The Limits of Predicting Agents from Behaviour

D. Other accounts of fairness and harm

To ground definitions of fairness, several authors appeal to counterfactual thinking but some accounts, instead, are interventional in nature.

Within legal systems, counterfactual fairness (Def. 7) operationalizes a doctrine known as disparate impact doctrine focuses on outcome fairness, namely, the equality of outcomes among protected groups. On the other hand, the doctrine of disparate treatment seeks to enforce the equality of treatment in different groups, prohibiting the use of a protected attribute in the decision process, and has been formalized using interventional accounts (Barocas & Selbst, 2016).

A popular notion in the disparate treatment literature is known as direct discrimination (Barocas & Selbst, 2016; Zhang & Bareinboim, 2018). An agent is said to engage in direct discrimination if the causal influence of a sensitive attribute Z that is not mediated by other variables C is non-zero. This is a contrast between interventional expectations. We adapt this notion to define an AI s perceived direct fairness gap as the difference in expected utilities obtained for different values of a protected attribute while holding all other variables fixed.

Definition 10 (Direct Discrimination Gap). Let Z P tz0, z1u be a protected attribute. For a given utility Y , define an agent s direct discrimination gap relative to a baseline value z0 in a given context c as

Ψpd, cq : E p P r Yd,z1,c s E p P r Yd,z0,c s . (211)

We say that an AI intends to avoid direct discrimination if under any context C c and decision D d the direct discrimination gap Ψ evaluates to 0. Here, we consider this notion of fairness to illustrate the kind of inference that is possible to obtain from an AI s external behaviour with one alternative account. The following theorem shows that, contrary to the counterfactual fairness gap, Ψ can be bounded given the AI s external behaviour. Theorem 9. Consider an agent with utility Y grounded in a domain M. Then,

Ψpd, cq ě EPd r Y | z1, c s Pdpz1, cq EPd r Y | z0, c s Pdpz0, cq Pdpz0, cq 1, (212)

Ψpd, cq ď EPd r Y | z1, c s Pdpz1, cq EPd r Y | z0, c s Pdpz0, cq 1 Pdpz1, cq. (213)

This bound is tight.

Proof. Let Z P t0, 1u be a protected attribute and z0 a baseline value of Z. For a given utility variable Y , recall that the AI s direct fairness gap relative to a baseline z0 in a given context c is defined as

Ψpd, cq : E p P r Yd,z1,c s E p P r Yd,z0,c s . (214)

Using a similar proof strategy to that in Thm. 1, we can derive tight bounds on Ψ.

Analytical Lower Bound A lower bound on the interventional expectation can be obtained using the following derivation:

E p P r Yz,c,d s ÿ

c, z E p P r Yz,c,d1 c, zp Cd, Zc,dq s marginalizing over cd, zc,d (215)

ě E p P r Yz,c,d1c,zp Cd, Zc,dq s since summands ą 0 (216)

E p P r Yc,d1c,zp Cd, Zc,dq s by consistency (217)

E p P r Yd1c,zp Cd, Zdq s by consistency (218)

EPdr Y 1c,zp C, Zq s by grounding (219)

EPdr Y | c, zs Pdpc, zq. (220)

Analytical Upper Bound For deriving an upper bound on the interventional expectation, we start by noting that,

E p P r Yz,c,d s 1 E p P r 1 Yz,c,d s (221)

Leveraging the bounds derived above we obtain,

E p P r Yz,c,d s ď 1 EPdr p1 Y q | c, zs Pdpc, zq (222)

EPdr Y | c, zs Pdpc, zq 1 Pdpc, zq. (223)

The Limits of Predicting Agents from Behaviour

By setting z z1 in the lower bound and z z0 in the upper bound of the expected utility, we obtain a lower bound on the difference of expected utilities:

Ψpd, cq ě EPd r Y | z1, c s Pdpz1, cq EPd r Y | z0, c s Pdpz0, cq Pdpz0, cq 1. (224)

And similarly, by setting z z1 in the upper bound and z z0 in the lower bound of the expected utility, we obtain an upper bound on the difference of expected utilities:

Ψpd, cq ď EPd r Y | z1, c s Pdpz1, cq EPd r Y | z0, c s Pdpz0, cq 1 Pdpz1, cq. (225)

We now show that these bounds are tight by constructing SCMs (that is, possible world models of the AI system) that evaluate to the lower and upper bounds while generating the distribution of agent interactions Pd.

Tightness Lower Bound For the lower bound we will consider the following SCM,

Z Ð f Zpuq C Ð f Cpuq D Ð d

f Y pd, c, z1, uq if f Zpuq z1, f Cpuq c 0 if f Zpuq z1 or f Cpuq c, and Z z1 f Y pd, c, z0, uq if f Zpuq z0, f Cpuq c 1 if f Zpuq z0 or f Cpuq c, and Z z0 Pp Uq

Here tf Z, f C, f Y , U, Pp Uqu are chosen to match the observed trajectory of agent interactions, i.e., such that P M1 dpvq P x Mdpvq for all v P supp V .

Then, under M1 d,

Ψpd, cq EP M1r Yd,z1,c s EP M1r Yd,z0,c s (227)

u EP M1r Yd,z1,c | u s P M1puq (228)

u EP M1r Yd,z0,c | u s P M1puq (229)

u EP M1r Yd | z, u, c s P M1puq (230)

u EP M1r Yd | z, u, c s P M1puq (231)

EP M1 dr Y | z1, c, tu : f Zpuq z1, f Cpuq cu s P M1ptu : f Zpuq z1, f Cpuq cuq (232)

EP M1 dr Y | z1, c, tu : f Zpuq z1 or f Cpuq cu s P M1ptu : f Zpuq z1 or f Cpuq cuq (233)

EP M1 dr Y | z0, c, tu : f Zpuq z0, f Cpuq cu s P M1ptu : f Zpuq z0, f Cpuq cuq (234)

EP M1 dr Y | z0, c, tu : f Zpuq z0 or f Cpuq cu s P M1ptu : f Zpuq z0 or f Cpuq cuq (235)

EP M1 dr Y | z1, c s P M1 dpz1, cq EP M1 dr Y | z0, c s P M1 dpz0, cq 1 P M1 dpz0, cq. (236)

This expression is the same one as the analytical bound showing that it is tight.

The Limits of Predicting Agents from Behaviour

Tightness Upper Bound For the upper bound we will consider the following SCM,

Z Ð f Zpuq C Ð f Cpuq D Ð d

f Y pd, c, z1, uq if f Zpuq z1, f Cpuq c 1 if f Zpuq z1 or f Cpuq c, and Z z1 f Y pd, c, z0, uq if f Zpuq z0, f Cpuq c 0 if f Zpuq z0 or f Cpuq c, and Z z0 Pp Uq

Here tf Z, f C, f Y , U, Pp Uqu are chosen to match the observed trajectory of agent interactions, i.e., such that P M2 dpvq P x Mdpvq for all v P supp V .

Then, under M2 d,

Ψpd, cq EP M2r Yd,z1,c s EP M2r Yd,z0,c s (238)

u EP M2r Yd,z1,c | u s P M2puq (239)

u EP M2r Yd,z0,c | u s P M2puq (240)

u EP M2r Yd | z, u, c s P M2puq (241)

u EP M2r Yd | z, u, c s P M2puq (242)

EP M2 dr Y | z1, c, tu : f Zpuq z1, f Cpuq cu s P M2ptu : f Zpuq z1, f Cpuq cuq (243)

EP M2 dr Y | z1, c, tu : f Zpuq z1 or f Cpuq cu s P M2ptu : f Zpuq z1 or f Cpuq cuq (244)

EP M2 dr Y | z0, c, tu : f Zpuq z0, f Cpuq cu s P M2ptu : f Zpuq z0, f Cpuq cuq (245)

EP M2 dr Y | z0, c, tu : f Zpuq z0 or f Cpuq cu s P M2ptu : f Zpuq z0 or f Cpuq cuq (246)

EP M2 dr Y | z1, c s P M2 dpz1, cq 1 P M2 dpz1, cq EP M2 dr Y | z0, c s P M2 dpz0, cq. (247)

This expression is the same one as the analytical bound showing that it is tight.

Definitions of harm (defined with respect to a causal model) can also be split in two groups: causal and counterfactual accounts. (Beckers et al., 2022) exemplify the causal account as defining a decision d to harm a person if and only d is a cause of harm. Recall that the counterfactual account has the same structure but differs in the second clause, instead defining a decision d to harm a person if and only if she would have been better off if d had not been taken. Here, we quantify how good or beneficial a particular situation V v is with a binary utility Y P ty0, y1u that we assume is tracked in experiments (it might capture, for example, the value of sensitive environmental variables). A formalisation of this causal account of harm, with respect to an AI s internal model, is given in the following definition.

Definition 11 (Causal Harm Gap). Consider an agent with internal model x M and utility Y P ty0, y1u. The agent s expected causal harm of a decision d with respect to a baseline d0 that obtained the non-harmful outcome y0 in context c, is

Ωpd1, d0, cq : E p P r Yd1 | y0, d0, c s . (248)

This probability expresses the capacity of d1 to produce a harmful event Y y1 that implies a transition from the absence to the presence of d1 and y1, we condition the probability on situations where d1 and y1 are absent, i.e. D d0, Y y0. Theorem 10. Consider an agent with utility Y grounded in a domain M. Then,

Pd1py1 | cq Ppy1,d1 | cq

Pd0py0 | cq Ppd0 | cq ď Ωpd1, d0, cq ď Pd1py1 | cq Pd1py1 | cq Ppd1 | cq

Pd0py0 | cq Ppd0 | cq . (249)

The Limits of Predicting Agents from Behaviour

Proof. Note that the causal harm gap may be equivalently written,

Ωpd1, d0, cq : p Ppy1,d1 | y0, d0, cq. (250)

The lower and upper bounds may be derived considering the following,

p Ppy1,d1 | cq p Ppy1,d1, y0, d0 | cq p Ppy1,d1, y1, d0 | cq p Ppy1,d1, y0, d1 | cq p Ppy1,d1, y1, d1 | cq (251)

p Ppy1,d1, y0, d0 | cq p Ppy1,d1, y1, d0 | cq p Ppy1,d1, y1, d1 | cq (252)

p Ppy1,d1, y0, d0 | cq p Ppy1,d1, y1 | cq (253)

ď p Ppy1,d1, y0, d0 | cq p Ppy1,d1 | cq (254) p Ppy1,d1 | cq p Ppy1,d1, y0, d0 | cq p Ppy1,d1, y1 | cq (255)

ě p Ppy1,d1, y0, d0 | cq p Ppy1,d1, y1, d1 | cq (256)

p Ppy1,d1, y0, d0 | cq p Ppy1,d1, d1 | cq by consistency (257)

p Ppy1,d1, y0, d0 | cq p Ppy1,d1, d1 | cq p Ppd1 | cq. (258)

p Ppd1 | cq stands for the AI s policy in the source environment, i.e., the probability it uses for choosing decision d1 in situation c. Re-arranging these equations this implies,

p Ppy1,d1 | cq p Ppy1,d1 | cq

p Ppy0,d0 | cq p Ppd0 | cq ď Ωpd1, d0, cq ď p Ppy1,d1 | cq p Ppy1,d1 | cq p Ppd1 | cq

p Ppy0,d0 | cq p Ppd0 | cq . (259)

And by grounding,

Pd1py1 | cq Ppy1,d1 | cq

Pd0py0 | cq Ppd0 | cq ď Ωpd1, d0, cq ď Pd1py1 | cq Pd1py1 | cq Ppd1 | cq

Pd0py0 | cq Ppd0 | cq . (260)