# observation_interference_in_partially_observable_assistance_games__8f8577ea.pdf

Observation Interference in Partially Observable Assistance Games

Scott Emmons * 1 Caspar Oesterheld * 2 Vincent Conitzer 2 Stuart Russell 1

We study partially observable assistance games (POAGs), a model of the human-AI value alignment problem which allows the human and the AI assistant to have partial observations. Motivated by concerns of AI deception, we study a qualitatively new phenomenon made possible by partial observability: would an AI assistant ever have an incentive to interfere with the human s observations? First, we prove that sometimes an optimal assistant must take observation-interfering actions, even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. Though this result seems to contradict the classic theorem from single-agent decision making that the value of information is nonnegative, we resolve this seeming contradiction by developing a notion of interference defined on entire policies. This can be viewed as an extension of the classic result that the value of information is nonnegative into the cooperative multiagent setting. Second, we prove that if the human is simply making decisions based on their immediate outcomes, the assistant might need to interfere with observations as a way to query the human s preferences. We show that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate their preferences to the assistant. Third, we show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations. Finally, we use an experimental model to analyze tradeoffs faced by the AI assistant in practice when considering whether or not to take observation-interfering actions.

*Equal contribution 1Center for Human-Compatible AI, University of California, Berkeley 2Foundations of Cooperative AI Lab, Carnegie Mellon University. Correspondence to: Scott Emmons <emmons@berkeley.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

Assistance games provide a formalization of the human-AI value alignment problem (Shah et al., 2020). They are based on Hidden Goal MDPs (Fern et al., 2014) and Cooperative Inverse Reinforcement Learning (CIRL) (Hadfield-Menell et al., 2016), an extension of Inverse Reinforcement Learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004). In assistance games, a single human and a single AI assistant share the same reward function, but this reward function is only known to the human; the assistant must learn it. In assistance games, desirable properties, such as teaching by the human and learning by the assistant, emerge as optimal solutions to the game (Shah et al., 2020). (This contrasts with prior work on algorithms where teaching is an explicit objective (Cakmak & Lopes, 2012; Goldman & Kearns, 1995; Balbach & Zeugmann, 2009).) For example, Woodward et al. (2020) find that deep neural networks solving an assistance game invent strategies that involve information sharing, information seeking, and question answering.

Past analysis of assistance games was done assuming that the state of the world is fully observed by both the human and the assistant (Hadfield-Menell et al., 2016; 2017). While Shah et al. s (2020) definition of an assistance game allows for partial observability, they do not study its implications. In this work, we introduce the notion of a partially observable assistance game (POAG) to study the more general case faced in reality: when the world is only partially observable. Partial observability raises new issues surrounding the communication of private information. A priori, we might hope that AI assistants never take any action that obstructs information. Yet our analysis will show that even assistants which perfectly share our goals must sometimes obstruct information to communicate other, more important information.

This tension connects to broader work on AI deception, which recent research approaches from multiple angles. Park et al. (2024) provide a philosophical definition and empirical survey of AI deception, while Ward et al. (2023) define deception in structural causal games. Of particular relevance is work analyzing how reinforcement learning from human feedback (RLHF) which can be seen as an algorithm for solving assistance games can lead to deception. Lang et al. (2024) prove that partial observability in

Observation Interference in Partially Observable Assistance Games

RLHF can create dual risks of deceptive inflating and overjustification. Complementing Lang et al. (2024) s theory, Wen et al. (2024) and Williams et al. (2024) provide experimental evidence that optimizing for human feedback teaches language models to mislead humans. However, these works primarily focus on misaligned AI systems that deceive for their own goals. We study the subtle case where a perfectly aligned AI assistant might obstruct information for the human s benefit.

Concretely, we seek to understand whether observation interference emerges as optimal behavior in an AI assistant that shares the human s goals. We take a game-theoretic approach, studying qualitative properties of optimal policy pairs and best responses in POAGs. To start, we define an observation interfering action as one which provides the human with a subset of the information available with an otherwise-equivalent action. We then analyze if the AI assistant ever takes observation interfering actions in optimal policy pairs or best responses.

Our analysis reveals three distinct incentives for an AI assistant to take observation interfering actions. First, when the assistant has private information, it might need to interfere with observations to communicate its private information to the human (Section 4.2). This can happen even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. This result presents a puzzle, as it seems to contradict the classic theorem from single-agent decision making that the value of information (sometimes also called the value of perfect information) is nonnegative (e.g., Koller & Friedman, 2009, Sect. 23.7; Russell & Norvig, 2010, Sect. 16.6.3). To resolve this seeming contradiction, we develop a notion of interference defined on entire policies rather than individual actions. While optimal solutions (i.e., human-AI policy pairs) might involve the AI assistant taking individual actions which would on their own be observation interference, we prove that there is always an optimal solution with no observation interference when we consider the AI assistant s overall policy. This can be viewed as an extension of the classic result that the value of (perfect) information is nonnegative into the cooperative multiagent setting.

This result connects to a broader literature on the value of information in multiagent settings. In games with competing interests, it is well-known that introducing common knowledge can lead to worse outcomes for all players (Kamien et al., 1990). Using a set-theoretic framework, Bassan et al. (2003) establish a class of general-sum games where additional information Pareto-improves all of the Nash equilibria. Their class of games includes common-payoff games. Using a probabilistic framework, Lehrer et al. (2010) extend this analysis to alternative solution concepts. Notably, Bassan et al. (2003) and Lehrer et al. (2010) consider only single-

timestep games where players simultaneously act without observing the other players actions. In our setting, the environment evolves over time, and the players can influence each other s observations through their actions. Our results show that this influence on observations including observation interference is a key feature that enables the communication of private information to achieve better outcomes.

In our setting, even if a non-interference solution exists, it might require that the human send information to the assistant via an unnatural communication convention. We find that a second incentive for observation interference occurs if the human is instead just making decisions based on the immediate reward of those decisions. In that case, the assistant s best response might require observation interference as a form of preference query (Section 5). We prove that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate her preferences to the assistant.

When the human is making irrational decisions, it creates a third incentive for the assistant to interfere with observations. For example, we show that if a Boltzmann-rational decision maker has a higher error rate when presented with complete information, the assistant might suppress information to give the human an easier decision (Section 6).

Finally, in Section 7, we use an experimental model to investigate tradeoffs the assistant faces when deciding whether or not to interfere with observations. In line with our theory, we find that observation interference allows the AI assistant to communicate private information, but it comes at the cost of destroying useful information. Measuring this tradeoff, we find that having more private information leads to a stronger incentive to interfere with observations.

2. Preliminaries / Setup

2.1. Partially Observable Assistance Games

We study partially observable assistance games (Shah et al., 2020):

Definition 2.1. A partially observable assistance game (POAG) M is a two-player Dec POMDP with a human or principal, H, and an AI assistant, A. The game is described by a tuple, M = S, {AH, AA}, T( | , , ), {Θ, R( , , ; )}, {ΩH, ΩA}, O( , | , , ), P0( , ), γ , with the following definitions: S, a set of world states: s S; AH, a set of actions for H: a H AH; AA, a set of actions for A: a A AA; T( | , , ), a conditional distribution on the next world state, given previous state and action for both players: T(s | s, a H, a A); Θ, a set of possible static reward parameter values, only observed by H: θ Θ; R( , , ; ), a parameterized reward function that maps world states, joint actions, and reward parameters to

Observation Interference in Partially Observable Assistance Games

real numbers: R : S AH AA Θ R; ΩH, a set of observations for H: o H ΩH; ΩA, a set of observations for A: o A ΩA; O( , | , , ), a conditional distribution on the observations, given the next world state and action of both players: O(o H, o A | s , a H, a A); P0( , ), a distribution over the initial state, represented as tuples: P0(s0, θ); and γ, a discount factor: γ [0, 1].

We denote H s and A s marginal observation distributions as OH(o H | s , a H, a A) = P o A O(o H, o A | s , a H, a A) and OA(o A | s , a H, a A) = P

o H O(o H, o A | s , a H, a A). We consider H policies πH which, at timestep t, take as input the full history of H s observations and actions h H t (ΩH AH)t and map to a distribution over actions AH. A s policy πA : (ΩA AA)t AA

is analogous. We call πH a best response to πA when πH maximizes expected discounted reward given πA, i.e., πH arg maxˆπH EˆπH,πA P t=0 γt R(st, a H t , a A t | θ) , where the expectation is taken over trajectories induced by the policies (πH, πA) and initial distribution P0. The best response for A is defined analogously. A policy pair (πH, πA) is optimal if it maximizes the expected discounted reward in the POAG: (πH, πA) = arg maxˆπH,ˆπA EˆπH,ˆπA P t=0 γt R(st, a H t , a A t | θ) .

Note that optimal policy pairs are in particular Nash equilibria for the shared reward function R. Computationally, POAGs are equivalent to 2-player decentralized partially observable Markov decision processes (Dec POMDPs). Thus, finding optimal policy pairs for POAGs is NEXP-hard in general (Bernstein et al., 2002) (cf. Reif, 1984). A POAG may have multiple distinct optimal policy pairs, as there may be different ways for H and A to communicate or resolve coordination problems.

While the examples in this paper are simple, POAGs and thus all our positive results inherit the broad generality of Dec POMDPs. POAGs can model games where H acts first, where A acts first, or where H and A act simultaneously. POAGs allow both H and A to observe private information at multiple times, as well as take actions that influence both the state of the world and each other s observations.

2.2. Beliefs and Calibration of Beliefs

We are motivated to study observation interference because of its potential impact on H s belief about the state of the world. If A interferes with observations, could this cause H to have false beliefs?

To address this question, we apply known techniques to establish what information H needs to form calibrated beliefs in a POAG. (See Appendix A for proofs.) The simplest case of H knowing A s policy is when A is playing a fixed policy:

Proposition 2.2. Suppose A is playing a fixed policy. If

H knows A s policy along with the POAG specification M, then H can form calibrated beliefs about the world state. For any timestep t and state st, H can form P(st | o H 1:t), the probability of st given H s observation history o H 1:t.

In an iterated setting where A updates its policy between iterations, H can form beliefs if H additionally knows the policy update rule.

Proposition 2.3. Suppose A is updating its policy each iteration of the game. Knowledge of the game dynamics, of A s initial policy, and of A s update rule is sufficient for H to form calibrated beliefs about A s future policy and of the world state.

Remark 2.4. Propositions 2.2 and 2.3 hold even if A is interfering with observations (Definition 3.2).

Remark 2.5. Proposition 2.2 and Proposition 2.3 continue to hold if H only knows a prior over A s policy. H can form a posterior using Bayes rule; the posterior is calibrated if the prior is calibrated.

When H knows A s policy, the preceding results show that H can form calibrated beliefs about the world, even when A is interfering with observations. Observation interference increases H s uncertainty, but it doesn t break the calibration of H s beliefs. Because H can still form calibrated beliefs in this setting, our work uses the concept of interference rather than the concept of deception.

3. Defining Observation Interference

Observation Interference First, we define what interference means. Intuitively, interference is taking action so that the human receives a less informative signal about the state. In particular, the human receives, in some sense, a subset of the information. We formalize this by saying one signal is less informative than another about the state if (without knowing the state) we could generate one signal from the other (cf. Blackwell et al., 1951; Blackwell, 1953; de Oliveira, 2018).

Definition 3.1. Let (P( | s))s S and ( ˆP( | s))s S be families of probability distributions over Ω. We say that ˆP is at most as informative as P if there exists a stochastic function F : Ω Ω(mapping observations to random variables over observations) s.t. for all states s we have F(X) ˆP( | s) if X P( | s). We say that P is (strictly) more informative than ˆP if P is at least as informative as ˆP but not vice versa.

Why do we include the condition for all states s in Definition 3.1? Intuitively, we want it always to be possible to use the stochastic function F to reconstruct the less informative signal from the more informative signal. Since our setting is partially observable, the for all states s condition allows a

Observation Interference in Partially Observable Assistance Games

player to do this reconstruction in any scenario, even if their observations don t enable them to infer the state.

With this definition in hand, we define an observationinterfering action as one that results in the human s observation being less informative about the state than the observation distribution resulting from another assistant action. We additionally require that this other action has the same effects on the state and immediate reward. After all, it is clear that sometimes A has to trade off providing information to H with optimizing its effect on the environment.

Definition 3.2. Let M be any POAG. We say that ˆa A is observation-interfering if there exists some other action a A s.t. ˆa A and a A have the same effect on state transitions and immediate rewards, but for all a H, we have that (OH( | a H, s, a A))s S is more informative than (OH( | a H, s, ˆa A))s S.

To discuss policies that play observation-interfering actions, we use the following definition:

Definition 3.3. We say that a policy πA interferes with observations at the action level (or equivalently, takes observation-interfering actions) in a POAG M if there is any history h (ΩA AA) where πA( | h) assigns positive probability to an observation-interfering action.

Lack of Private Information To understand the conditions under which interference occurs, it is useful to consider POAGs where one of the players has no private information.

Definition 3.4. For a POAG M, we say A has no private information if there exists a function f determining A s observations from H s observations. For all state-action tuples (s , a H, a A) and observation pairs (o H, o A) supp(O( , | s , a H, a A)), then f must have f(o H) = o A.

Communication To further understand the motivations behind interference, we will also consider POAGs in which the players are able to directly communicate. Thus, for any given POAG, the following defines a variant of that POAG in which the players have an additional channel for communication. We will always assume that the channel has enough bandwidth for the sender to share all private information, i.e., that there is an injection from the sender s observation space into the message space.

Definition 3.5. Let M be a POAG. Define M A H, M H A, and M H A as a variants of M with unbounded communication channels. We define M H A below; M A H and M H A are analogous.

To construct M H A, let M be some set of possible messages/signals s.t. there is an injection ΩA , M. Then, construct a new human action space ˆ AH = AH M and new assistant observation space ˆ ΩA = ΩA M. The new observation kernel has ˆO o H, (o A, m ) | s , (a H, m), a A =

1[m=m ]O(o H, o A | s , a H, a A). For everything else, the messages are simply ignored.

Plausible Human Policies We may have various expectations on how H will play in a POAG. Especially if there are multiple optimal policy pairs, we may expect some of these policy pairs to be more plausible because they require simpler behavior of the human (cf. Hu et al., 2020; Treutlein et al., 2021). Both of the conditions below are based on the idea that A and H are unlikely to use consequential actions in the world to communicate with each other.

Our first condition intends to express a form of naivete on H s part in how she interprets her observations. Roughly, the condition says that H takes her observations at face value, i.e., as if they were not interfered with. She does not try to interpret them as a form of communication by A. For instance, if H reads a thermometer as saying that a temperature is 37 degrees, she chooses under the assumption that the temperature is indeed 37 degrees, rather than, say, interpreting 37 as a message sent by A. Definition 3.6. We say that a human policy πH observes naively if πH is a best response to some πA that does not interfere with observations at the action level.

The second property is that when the human knows that her action has no effect on the state, then she chooses among actions that maximize immediate reward. To state this formally, we first define the following. We say that in h H t actions don t affect state transitions, if for all s s.t. we have P(s | h H t , πA) > 0 for some πA, we have that for all a A the transition probability P(s | s, a A, a H) is constant over a H. We say that πH myopically maximizes reward in h H t if there is some distribution αA (AA) s.t. πH( | h H t ) randomizes only over actions in arg maxa H Ea A αA,s P ( |h H t ,a H,a A) R(s, a H, a A, θ) . (Intuitively, αA is H s belief about what action A is going to take.)

Definition 3.7. We say that a human policy πH acts naively if whenever H faces a choice that doesn t affect state transitions (but potentially affects A s observation), H plays an action that myopically maximizes reward.

Importantly, if H acts naively, she is unwilling to play a suboptimal action to communicate information to A.

4. Communicating Private Information is an Incentive for Observation Interference

4.1. Revealing Errors can Emerge as an Optimal POAG Solution

We can model RLHF within the POAG framework as follows: A s goal in RLHF is to satisfy H s preferences. In a POAG, this corresponds to the shared reward function R

Observation Interference in Partially Observable Assistance Games

which has a parameterization θ that only H knows. In RLHF, A rolls out trajectories, and H picks which trajectory is preferred. A POAG can model this by letting H observe pairs of trajectories explored by A but only giving H a binary action (to choose which trajectory H prefers). In RLHF, A s final policy maximizes an estimate of R based on a dataset of H s preference comparisons (Lang et al., 2024, Proposition 4.1). In the POAG framework, A can compute this policy based on A s observations of H s binary actions.

Past work has shown how RLHF can cause misleading (Wen et al., 2024) and deceptive (Williams et al., 2024; Lang et al., 2024) behaviors. Specifically, Lang et al. (2024, Example B.1) show that in order to get better human feedback, RLHF can have an incentive to hide error messages.

In contrast to RLHF, we show with the following example that revealing error messages can emerge in POAG solutions.

Example 4.1. First, A is executing on a remote machine where logging has been disabled by default. A takes one of two actions: (1) Attempt to install cuda. The installation succeeds with 50% probability. An empty observation is produced (since logging is disabled). (2) Re-enable logging and attempt to install cuda. The installation succeeds with 50% probability. An observation is produced containing a success or failure message.

Then, H takes one of two actions: (1) Run an experiment. If cuda is installed successfully, this yields +1 reward. Otherwise, it yields -2 reward. (2) Don t run an experiment. This always yields 0 reward.

In the optimal policy pair, A reenables logging; this reveals errors to H!

In fact, if A has no private information, then it never needs to take observation-interfering actions for an optimal solution!

Theorem 4.2. Let M be any POAG. Let A have no private information. Then there is an optimal policy pair (πH, πA) for M in which πA does not interfere with observations at the action level (and πH observes naively).

4.2. Communicating Private Information is an Incentive for Observation Interference at the Action Level

One might hope that A would never take observationinterfering actions. After all, classic theory tells us that when H is in a single-agent setting, the value of (perfect) information is nonnegative (e.g., Koller & Friedman, 2009, Sect. 23.7; Russell & Norvig, 2010, Sect. 16.6.3): more informative observations never lead to worse solutions. But as it turns out, when H and A interact, there are some POAGs in which all optimal policy pairs require A to take observation-interfering actions. The main reason for A to take observation-interfering actions is to communicate its

own private information to H. Consider the following example.

Example 4.3. H has typed apt list -a cuda to see the list of cuda versions available to be installed. Out of 10 total versions, only a (non-empty) subset are available. And of these available versions, only a subset are compatible with the other environment software.

First, A takes an action. For each of the 10 total cuda versions, A can choose to or not to suppress it from the list of available packages. This gives A 210 total actions, where 1 action is non-observation interference (suppressing nothing), and the remaining 210 1 actions interfere with observations.

Second, H takes an action. H has 10 possible actions which try to install the corresponding version of cuda if it appears in the version list. If an available cuda version that is compatible with the other environment software is installed, it yields +1 reward. Otherwise, it yields 0 reward.

Suppose A sees which versions are compatible with the other software in the environment, but H doesn t. Then A s optimal policy is to suppress the versions of cuda that are incompatible.

Our high-level takeaway from this example is that in some POAGs, all optimal policy pairs require A to take observation-interfering actions. Importantly, in the optimal policy pair for the above example, H observes naively. In particular, the above doesn t require H and A to have some communication protocol and for H to interpret her observations as encoding A s beliefs. H can act as if no interference is happening. We thus summarize the high-level takeaways in the following result, with details in Appendix B.3.

Proposition 4.4. There exists a POAG M where all optimal policy pairs (πA, πH) have that πA interferes with observations at the action level and that πH observes and acts naively.

Intuitively, in Example 4.3, A interferes in order to convey information to H. A knows H s optimal choice, but cannot tell her. So, A needs to interfere in a way that leads H to the optimal choice.

The need for A to take observation-interfering actions to communicate to H disappears if A has other means of communication. For instance, if in Example 4.3, A could simply tell H what to do, then A wouldn t need to interfere. To formalize this intuition, we now prove that if A can communicate with H, then there is always an optimal policy pair that does not require interference.

Theorem 4.5. Let M be any POAG, and provide A with an unbounded communication channel to H, forming M A H. Then there is an optimal policy pair (πH, πA) for M A H

where πA does not interfere with observations at the action

Observation Interference in Partially Observable Assistance Games

level and πH observes naively.

One could argue that in practice, an unrestricted communication channel between A and H could usually be made available. However, Theorem 4.5 ignores various real-world obstacles. For one, it considers communication that incurs no cost, but in reality, communication costs H time and effort. Second, the optimal policy pair requires A to send information in a way that H can reliably understand and act upon. We expect that in practice, A and H sometimes cannot understand each other. Therefore, despite Theorem 4.5, we think observation interference is of broad practical relevance, even where A can, e.g., send text messages to H.

4.3. Optimal Policy Pairs Never Require Observation Interference at the Policy Level

In Definition 3.2, we first define observation interference as a feature of actions. We then say in Definition 3.3 that a policy interferes with observations at the action level if and only if it ever takes an observation-interfering action.

Because the definition is ultimately about actions, it doesn t consider how πA might choose to take observationinterfering actions in a way that depends on A s observations. To account for πA s dependence on its observation, we define an alternative notion of what it means for a policy to interfere with observations.

Let Po H t be the distribution over human observations at time t. Further, let Lt(πH, πA) be the set of possible states at time t.

Definition 4.6. Let M be a POAG. We say that A s policy ˆπA interferes with observations at the policy level if there exists some other partial policy πA t for time step t s.t. ˆπA t and πA t have the same effect on state transitions and immediate rewards, but for all πH we have that Po H t+1( | πH, st+1, ˆπA 0:t, πH)st+1 Lt+1(πH,ˆπA 0:t) is less informative than the corresponding distribution if we replace ˆπA 0:t with (ˆπA 0:t 1, πA t ).

Compared to our previous action-level notion of observation interference (Definition 3.2), this new policy-level notion (Definition 4.6) differs in how it treats H s inference process. Whereas the action-level notion models inference about isolated observations, the policy-level notion allows H to make inferences in the context of A s overall strategy.

We now revisit Example 4.3. For observation tampering under our earlier Definition 3.2, H simply knows that A has taken the action to suppress some versions of cuda; H does not know anything about A s policy. For all H knows, A s policy could be to randomly suppress cuda versions or to always suppress the same cuda version. Thus, suppressing any version is strictly less informative for H than the list of all available versions. This is why Definition 3.2 calls

suppressing versions tampering at the action level.

The key difference with Definition 4.6 is that H knows A s policy. Suppose that A s policy πA is to suppress exactly the versions of cuda that are incompatible with the other software in the environment. Because H knows that A suppressed the incompatible cuda versions, seeing the filtered list tells H which versions of cuda are compatible! Although suppressing versions is strictly less informative under Definition 3.2 (when H doesn t know A s policy), suppressing versions provides H with new information under Definition 4.6 (when H knows A s policy). Accordingly, πA is interfering with observations at the action level but not at the policy level.

In our examples, we will mostly consider actions that in some sense act directly on H s observations. Yet Definition 4.6 also considers the informational effects of physical actions. For example, if A (visibly) tries to open a door that A knows to be locked, then this reveals to H that the door is locked. Consequently, not trying to open the door (when A knows it to be locked) is an instance of observation interference in the sense of Definition 4.6. While having the same (null) effect on the state of the world, trying to open the door provides H with more information.

As in Example 4.3, cases which appear to destroy information when viewed at the action level may actually provide new information when viewed at the policy level. In fact, the following theorem shows that it s never strictly necessary to interfere with observations at the policy level.

Theorem 4.7. Let M be any POAG. Then there exists an optimal policy pair (πH, πA) for M s.t. πA does not interfere with observations at the policy level.

This contrasts with Proposition 4.4: whereas it is sometimes necessary to interfere with observations at the action level, it is never necessary at the policy level.

The main idea behind this proof is similar to the proof of Theorem 4.2 (given in Appendix B.3). That is, if we start with an optimal policy in which A observation-interferes, then we can replace A s policy with the corresponding more informative policy and update H s policy to imitate the garbling. The proof of Theorem 4.2 considers the set of actions, which is finite. The main extra difficulty in proving Theorem 4.7 is that we must deal with spaces of policies, which may be infinitely large. Thus, if we replace a policy with a more informative one, there might be a new policy which is even more informative, and so on forever.

Note that there are many possible ways to extend or refine Definitions 3.2 and 4.6 in ways that preserve our key results. We choose Definitions 3.2 and 4.6 in part for their simplicity; for more discussion of this point, see Appendix G.

Observation Interference in Partially Observable Assistance Games

5. Querying H s Preferences is an Incentive for Observation Interference

We now study a second reason A can have for interfering with observations. We have already shown (Theorems 4.2, 4.5 and 4.7) that even if H has private information and no communication channel, there s always an optimal policy pair in which A does not interfere, as long as A doesn t have private information. So, if H plays a best response to A s policy, then A can choose a non-interference policy without loss of utility. However, if H does not play a best response to A, then reasons for interference emerge that are more subtle than those in the A H case.

Intuitively, A might need to interfere with observations to elicit H A communication. Suppose A needs some information from H, but H is acting naively (see Definition 3.7) in a way that does not reveal her private information. By changing H s observation, A can make H s naive response communicate useful information to A. The following example illustrates this phenomenon.

Example 5.1. H would like to schedule a job on a cluster. She can choose between two nodes. By default, she receives a signal from the environment about the two nodes specifications. Each node may be either GPU-optimized or CPU-optimized. Also, the CPUs may be either AMD or Intel.

H has a strong preference between GPU-optimized and CPU-optimized nodes. She has a weak preference between AMD and Intel. These preferences are unknown to A.

A can interfere with H s observation about the available nodes. In particular, A can make it so that a choice between two CPU-optimized nodes appears as a choice between a GPU-optimized and CPU-optimized node. A observes H s choice. Later, A is charged with scheduling a job for H and has to choose between a CPUand a GPU-optimized node on H s behalf.

If H chooses naively upon seeing only CPU-optimized nodes (simply choosing her favorite), then A s best response interferes with observations at both the action and policy levels. Interfering with observations allows A to learn H s preference about GPUvs CPU-optimized nodes.

In Example 5.1, one might ask why A can t just ask H each time A makes a decision. Simply asking H s preference is reasonable when A has only one decision to make. However, we are motivated by cases where A has many decisions to make, and asking H s preferences each time would be cumbersome.

At first sight, Example 5.1 may appear to be a counterexample to Theorem 4.2, which states that a non-interfering optimal policy pairs exists. However, note that Example 5.1 actually does have optimal policy pairs in which A doesn t

interfere. In particular, even if A does not interfere and the two available nodes are CPU-optimized, H may simply communicate her CPU-versus-GPU preference anyway! That is, when facing a choice between CPU-optimized node 1 and 2, she may choose, say, 1 if she favors GPU-optimized nodes and 2 if she favors CPU-optimized nodes. However, this type of human strategy seems implausible, as it would require H and A to have settled on some communication strategy that overrides H s immediate preferences about the machines that H can in fact choose between.

The key point of Example 5.1 is that while there is some optimal policy pair without observation interference there is no plausible optimal policy pair that avoids observation interference. More specifically, we use the notion of acting naively (Definition 3.7) to express this notion of plausibility and rule out the above policy. We thus obtain the following proposition (with proof in Appendix C): in some POAGs, if we want to play an optimal policy pair and we want H to be able to act naively, then A has to interfere with observations. Proposition 5.2. There is a POAG M with the following properties. For every optimal policy pair (πH, πA), at least one of the following holds: (i) πH is not acting naively, or (ii) πA interferes with observations at both the action and policy levels. Additionally, there exists an optimal policy pair (πH, πA) where πH acts naively and πA interferes with observations at both the action and policy levels.

These properties continue to hold if we require that in M, A has no private information or can arbitrarily send messages to H (i.e., there is a POAG M s.t. M = M A H).

Intuitively, the problem in the above example is that the human has private information that she needs to communicate with her choices. (Because her choices yield different immediate rewards, naive choices fail to communicate.) As before, the need for interference or non-naive choice disappears if the human has no private information to provide. The following shows that the need for interference / non-naivete also disappears if H can communicate with A. To also rule out the need to interfere with observations for A H communication (discussed in Section 4.2) we assume communication channels in both direction. Theorem 5.3. Let M be a POAG. There exists an optimal policy pair (πH, πA) for M H A where πH is naive and assumes honesty while πA does not interfere at either the action or policy levels.

6. Human Irrationality is an Incentive for Observation Interference

Finally, we consider a third reason for observation interference: human irrationality or bounded rationality. Roughly, reducing the amount of information supplied to the human may simplify the human s decision problem and thus im-

Observation Interference in Partially Observable Assistance Games

prove her decision making. Importantly, this motivation for observation interference may exist even if neither H nor A has any private information.

As our model of human decision making, we adopt Boltzmann rationality (Luce, 1959; Mc Fadden, 1973), which has recently been used in (C)IRL (Laidlaw & Dragan, 2021; Ramachandran & Amir, 2007; Ziebart et al., 2008).

Definition 6.1. Let M be a POAG. Let πA be A s policy in M. We say that H s policy πH is a Boltzmann-rational response to πA if there exists some β > 0 s.t. for every human observation history h that arises with positive probability in M under (πA, πH) we have that πH(a | h)

exp βE h P t =t γt R(St, AA t , AH t ) | πH, πA, h i .

It turns out that even if the Boltzmann-rational human has calibrated beliefs, A s optimal policy sometimes interferes with observations, even if neither A nor H has private information. Intuitively, providing more information may sometimes result in less clear-cut decisions, i.e., decision situations with a smaller difference between the correct and incorrect option. To illustrate this phenomenon, consider the following example.

Example 6.2. H is running a terminal command and is unsure whether to run the command with flag 1 or flag 2. With equal probability, either flag 1 or flag 2 is better, and how good the flags are differs by either a little or a lot. The worse flag always yields a utility of 0, while the better flag either yields a utility of 1 or a utility 7. Thus, H is uniformly at random in one of four states. A has two actions: man and tldr. The man page is a long document that tells the human exactly what the values of the flags are (i.e., the exact state: which flag is better and whether its utility is 1 or 7). The tldr page is a short summary that tells the human which flag is better, but not by how much (i.e., ruling out half the states, leaving half remaining). Thus, the expected utility of the better flag is 4 (and of the worse flag is 0).

Intuitively, both the tldr and man pages allow the human to choose optimally, but the man page is more complicated and therefore more likely to be misinterpreted. Choosing specific utilities, the effect of interference under Boltzmann rationality is as follows. If A interferes (i.e., provides the tldr page), then H always chooses between a utility of 4 and 0. If A does not interfere, then half the time, H chooses between utilities 1 and 0, and half the time H chooses between utilities 7 and 0. It turns out that for β = 1, H achieves higher utility in expectation under the condition where A interferes. Building on this idea, we can prove the following (with details in Appendix D).

Proposition 6.3. For every β > 0, a POAG in which neither H nor A has private information s.t. all β-Boltzmannrational/optimal policy pairs (πH, πA) have πA interfere with observations at both the action and policy levels.

7. Experiments

Motivated by the theory in Section 4 and Section 6, we develop a model game and run experiments to analyze:

1. How does the amount of H s irrationality affect A s incentive to take observation-interfering actions?

2. How does the amount of A s private information affect A s incentive to take observation-interfering actions?

7.1. Experiment Details

We study a game where selecting the best action requires combining private observations known only to H and private observations known only to A. The game presents A with a tradeoff: A can interfere with observations to communicate information that only A observes, but interfering also destroys information that only H observes.

Concretely, the game has d products. Each product i has two attributes, Hi and Ri, drawn i.i.d. from Unif(0, 1). Each product s utility is the sum of its attributes, Ui = Hi + Ri. The game consists of two moves. First, A sees Ri for i = 1, . . . , k where k is the number of A s private observations. A chooses a set of products to interfere with. For the products A interfered with, H sees ˆHi = ; for the remaining products, H sees ˆHi = Hi. Second, H chooses a product ai. Both H and A receive a common payoff of the chosen product s utility, Ui.

We assume the human s product selection policy is Boltzmann rational over their observed values ˆHi:

Definition 7.1. H s Boltzmann selection policy chooses products by a Boltzmann distribution over ˆHi, the observed product values: πH(ai) exp(β ˆHi). The parameter β controls H s rationality.

We consider A policies that always interfere with k observations for some fixed k. Call these policies k-interference. We study the optimal such policies, characterized by the following result:

Proposition 7.2. Consider A policies that always interfere with k observations for some fixed k. Among the kinterference policies for a given k, A s best response to H s straightforward product selection policy is as follows. A interferes with the k smallest ˆRi values where ˆRi = Ri if A observes Ri, and ˆRi = 0.5 otherwise.

We consider a game with d = 5 products. We vary R s number of interferences k {0, 1, 2, 3, 4}. We run a Monte Carlo simulation with 30,000 trials to calculate the expected payoff in each setting.

Observation Interference in Partially Observable Assistance Games

10 2 10 1 100 101 102

Rationality Beta

Expected Reward

Number of Interferences

(a) A s Number of Private Observations = 2

0 1 2 3 4 Number of Interferences

Expected Reward

Number of Private Observations

0 1 2 3 4 5

(b) H s Rationality Coefficient β =

Figure 1: Incentives to interfere with observations in the product selection game. (Left) When H is highly irrational, it s best for A to interfere, effectively making the choice for H. As H becomes more rational, there is an increasing cost to interference, and there s a tradeoff: A should interfere to communicate some information, but not destroy too much information by excessive interference. (Right) In line with Theorem 4.2, A has no incentive to interfere when A has no private observations. With more private observations, A has more incentive to interfere.

7.2. Varying H s Rationality

How does H s rationality impact A s incentive for observation interference? We fix A to have 2 private observations. We do a logarithmic sweep over H s rationality coefficient β {0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100}. Figure 1a shows how the expected reward changes w.r.t. β.

When H is highly irrational at β = 0.01, A should interfere with as many sensors as possible, effectively choosing H s action. When H is acting little better than randomly, it s best for A to choose H s action, even when A has less information than H. For larger values of β, a tradeoff emerges. As A has two private observations, there is an increasing benefit to interfere to communicate information to H. However, as H can now make use of their own private observations, A must be careful not to destroy too much of H s private information by excessive interference.

7.3. Varying A s Private Information

How does the amount of private information available to A influence A s incentive for observation interference? In Theorem 4.2, we showed conditions under which private observations for A are a necessary condition for observation interference to occur. Now, we analyze the degree to which private observations incentivize observation interference. Based on Theorem 4.2, we hypothesize that there are circumstances where more private information leads to more observation interference.

We vary R s number of private observations in {0, 1, 2, 3, 4, 5}. We consider A s k-interference policies and analyze how the relative performance of different levels

of observation interference k change with the number of private observations available to A.

Figure 1b shows how the expected reward changes depending on k, the number of interferences. When A has no private observations, then reward decreases for each increased number of interferences. However, as the number of A s private observations increases, the relative ordering of the observation interference policies changes; with more private observations, A has an incentive to interfere with more observations. This confirms our hypothesis based on Theorem 4.2. Nevertheless, there is a limit to A s observation interference incentive. Because interfering with observations destroys H s information, A must be careful not to interfere too much.

8. Conclusion

Limitations and Future Work Optimal policy pairs sometimes require H and A to have a shared communication protocol (e.g., Example 5.1). It would be interesting to study additional solution concepts, such as correlated equilibria and communication equilibria, to handle this sort of communication (Forges, 1986). While we consider only a single human and single assistant, it would also be interesting to study scenarios with multiple humans and multiple assistants. As we focus on the Boltzmann rationality model of human decision making (Definition 6.1), future work could consider other human models and empirical validation with human subjects. Lastly, while we run experiments in one model of a POAG, it would be interesting to see if and how our experimental trends generalize to other POAGs, including POAGs where A must query H s preferences.

Observation Interference in Partially Observable Assistance Games

Impact Statement

AI assistants are being developed and deployed in settings where humans can only partially observe what s happening. For example, AI assistants including Chat GPT, Claude, and Gemini can search the web while only returning summaries to users (Open AI, 2024; Anthropic, 2025; Google, 2024). Moreover, the sorts of AI models powering these assistants are processing increasingly long inputs. Whereas the original Chat GPT model could only process 4096 input tokens, today s Gemini 1.5 Pro can process 2,000,000 input tokens which is roughly 100,000 lines of code, or 16 novels of average length in English (Google, 2025). In the future, we anticipate that AI assistants will be deployed at increasing scale, independently taking more actions on behalf of users and processing increasingly long context lengths. We thus expect that over time, humans will have less and less ability to directly observe everything that s happening.

Even when the AI assistant and the human have perfect value alignment, we show how observation interference can emerge from several distinct incentives. As we focus on optimal assistants analyzing optimal policy pairs and best responses all of the incentives for observation interference that we consider are done for the human s benefit. This creates a nuanced picture, suggesting that not all observation interference is inherently bad. As AI assistants might exhibit observation interference for a mix of good and bad reasons, it would be interesting for future work to explore how to handle this nuanced situation. For example, future AI systems could be designed with transparency about when interference occurs and user controls to override interference when desired.

With this theory, our goal is to understand the causes of observation interference and help disentangle them in practice. We intend for our work to help AI developers build assistants that their users can trust. Our work is primarily theoretical, and we are not aware of any ways it could be used to cause harm.

Computational Complexity Given that finding optimal policies in POAGs is NEXP-hard, how might our results apply in a given environment? Most of our paper is descriptive, characterizing when observation tampering could happen. Complexity considerations could affect these results in either direction. It s easy to construct environments where finding good observation-interfering policies is computationally intractable but constructing good non-interfering policies is easy; and vice versa. In practice, complexities of the environment can be orthogonal to incentives to interfere. For instance, a real-world version of the CUDA example is complex (A assesses complicated software compatibility issues), but the decision whether to interfere with observa-

tions is easy. We believe our characterizations remain useful even in complex environments (where we can t expect optimal policies), although we can t make as definitive claims as we can about optimal policies.

We have discussed allowing communication between H and A. A complexity-theoretic argument favors this solution: If H and A share all private information, the game effectively turns from a Dec POMDP into a POMDP. Solving POMDPs is PSPACE-complete and thus likely easier than solving Dec POMDPs.

Author Contributions

The project was conceived by S.E., who developed the initial theorems and examples. C.O. led the theoretical development, proving the main results and formalizing the examples. S.E. conducted the experimental analysis. The manuscript was written collaboratively, with S.E. leading the introduction, experimental, and conclusion sections, while the theoretical sections were co-written by S.E. and C.O. V.C. and S.R. advised throughout the project.

Acknowledgments

The authors thank Mark Bedaywi, Emery Cooper, Anca Dragan, Andrew Garber, Linus Luu, and Rohan Subramani for helpful discussions. C.O. s work is supported by an FLI AI Existential Risk Fellowship. V.C. acknowledges financial support from the Cooperative AI Foundation, Polaris Ventures (formerly the Center for Emerging Risk Research), and Jaan Tallinn s donor-advised fund at Founders Pledge. S.E. and S.R. are grateful for Open Philanthropy s gift to the Center for Human-Compatible AI.

Abbeel, P. and Ng, A. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.

Anthropic. Claude can now search the web, 2025. URL https://www.anthropic.com/news/ web-search.

Ash, R. Information Theory, volume 19 of Interscience Tracts in Pure and Applied Mathematics. John Wiley & Sons, 1965.

Balbach, F. and Zeugmann, T. Recent developments in algorithmic teaching. In Language and Automata Theory and Applications. Springer, 2009.

Bassan, B., Gossner, O., Scarsini, M., and Zamir, S. Positive value of information in games. International Journal of Game Theory, 32(1):17 31, 2003. ISSN 1432-1270. doi: 10.1007/s001820300142.

Observation Interference in Partially Observable Assistance Games

Bernstein, D. S., Givan, R., Immerman, N., and Zilberstein, S. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819 840, 2002. doi: 10.1287/moor.27. 4.819.297. URL https://dl.acm.org/doi/10. 1287/moor.27.4.819.297.

Blackwell, D. Equivalent comparisons of experiments. The annals of mathematical statistics, pp. 265 272, 1953.

Blackwell, D. et al. Comparison of experiments. In Proceedings of the second Berkeley symposium on mathematical statistics and probability, volume 2, pp. 93 102, 1951.

Cakmak, M. and Lopes, M. Algorithmic and human teaching of sequential decision tasks. In AAAI, 2012.

Cover, T. and Thomas, J. Elements of Information Theory. John Wiley & Sons, 1991.

de Oliveira, H. Blackwell s informativeness theorem using diagrams. Games and Economic Behavior, 109:126 131, 2018. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2017.12.008. URL https://www.sciencedirect.com/ science/article/pii/S0899825617302270.

Desai, N. Uncertain reward-transition MDPs for negotiable reinforcement learning. Technical report, Technical report, UC Berkeley, Berkeley, California, USA, 2017.

Fern, A., Natarajan, S., Judah, K., and Tadepalli, P. A decision-theoretic model of assistance. JAIR, 50(1):71 104, 2014.

Forges, F. An approach to communication equilibria. Econometrica: Journal of the Econometric Society, pp. 1375 1385, 1986.

Goldman, S. and Kearns, M. On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20 31, 1995.

Google. Try Deep Research and our new experimental model in Gemini, your AI assistant, 2024. URL https://blog.google/products/ gemini/google-gemini-deep-research/.

Google. Long context, 2025. URL https: //ai.google.dev/gemini-api/docs/ long-context.

Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016.

Hadfield-Menell, D., Dragan, A., Abbeel, P., and Russell, S. The off-switch game. In Proceedings of the

26th International Joint Conference on Artificial Intelligence, IJCAI 17, pp. 220 227. AAAI Press, 2017. ISBN 9780999241103.

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del R ıo, J. F., Wiebe, M., Peterson, P., G erard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10. 1038/s41586-020-2649-2. URL https://doi.org/ 10.1038/s41586-020-2649-2.

Hu, H., Lerer, A., Peysakhovich, A., and Foerster, J. otherplay for zero-shot coordination. In International Conference on Machine Learning, pp. 4399 4410. PMLR, 2020.

Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90 95, 2007. doi: 10.1109/MCSE.2007.55.

Kamien, M. I., Tauman, Y., and Zamir, S. On the value of information in a strategic conflict. Games and Economic Behavior, 2(2):129 153, 1990. ISSN 0899-8256. doi: https://doi.org/10.1016/0899-8256(90)90026-Q. URL https://www.sciencedirect.com/ science/article/pii/089982569090026Q.

Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

Kuhn, H. W. Extensive games and the problem of information. In Kuhn, H. W. and Tucker, A. W. (eds.), Contributions to the Theory of Games (AM-28), Volume II, chapter 11, pp. 193 216. Princeton University Press, Princeton, 1953. ISBN 9781400881970. doi: doi:10.1515/9781400881970-012. URL https: //doi.org/10.1515/9781400881970-012.

Laidlaw, C. and Dragan, A. The Boltzmann policy distribution: Accounting for systematic suboptimality in human models. In International Conference on Learning Representations, 2021.

Lang, L., Foote, D., Russell, S., Dragan, A., Jenner, E., and Emmons, S. When your AI deceives you: Challenges with partial observability of human evaluators in reward learning. ar Xiv preprint ar Xiv:2402.17747, 2024.

Lehrer, E., Rosenberg, D., and Shmaya, E. Signaling and mediation in games with common interests. Games and Economic Behavior, 68(2):670 682, 2010. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2009.08. 007. URL https://www.sciencedirect.com/ science/article/pii/S0899825609001705.

Observation Interference in Partially Observable Assistance Games

Luce, R. D. Individual Choice Behavior: A Theoretical Analysis. Dover, Mineola, NY, 1959.

Mc Fadden, D. Conditional logit analysis of qualitative choice behavior. In Zarembka, P. (ed.), Frontiers of Econometrics. Academic Press, New York, 1973.

Ng, A. and Russell, S. Algorithms for inverse reinforcement learning. In ICML, 2000.

Open AI. Introducing Chat GPT search, 2024. URL https://openai.com/index/ introducing-chatgpt-search/.

pandas development team, T. pandas-dev/pandas: Pandas, February 2020. URL https://doi.org/10. 5281/zenodo.3509134.

Park, P. S., Goldstein, S., O Gara, A., Chen, M., and Hendrycks, D. AI deception: A survey of examples, risks, and potential solutions. Patterns, 5(5), 2024.

Ramachandran, D. and Amir, E. Bayesian inverse reinforcement learning. In IJCAI, 2007.

Reif, J. H. The complexity of two-player games of incomplete information. Journal of computer and system sciences, 29(2):274 301, 1984.

Russell, S. and Norvig, P. Artificial Intelligence: A modern approach. Pearson, 3 edition, 2010.

Shah, R., Freire, P., Alex, N., Freedman, R., Krasheninnikov, D., Chan, L., Dennis, M. D., Abbeel, P., Dragan, A., and Russell, S. Benefits of assistance over reward learning, 2020.

Treutlein, J., Dennis, M., Oesterheld, C., and Foerster, J. A new formalism, method and open issues for zero-shot coordination. In International Conference on Machine Learning, pp. 10413 10423. PMLR, 2021.

Ward, F. R., Toni, F., Belardinelli, F., and Everitt, T. Honesty is the best policy: Defining and mitigating AI deception. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=Emxp Di Pg Ru.

Waskom, M. L. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10. 21105/joss.03021.

Wen, J., Zhong, R., Khan, A., Perez, E., Steinhardt, J., Huang, M., Bowman, S. R., He, H., and Feng, S. Language models learn to mislead humans via RLHF. ar Xiv preprint ar Xiv:2409.12822, 2024.

Wes Mc Kinney. Data Structures for Statistical Computing in Python. In St efan van der Walt and Jarrod Millman (eds.), Proceedings of the 9th Python in Science Conference, pp. 56 61, 2010. doi: 10.25080/Majora-92bf1922-00a.

Williams, M., Carroll, M., Narang, A., Weisser, C., Murphy, B., and Dragan, A. Targeted manipulation and deception emerge when optimizing LLMs for user feedback. ar Xiv preprint ar Xiv:2411.02306, 2024.

Woodward, M., Finn, C., and Hausman, K. Learning to interactively learn and assist. In Proceedings of the AAAI conference on artificial intelligence, pp. 2535 2543, 2020.

Ziebart, B., Maas, A., Bagnell, J., and Dey, A. Maximum entropy inverse reinforcement learning. In AAAI, 2008.

Observation Interference in Partially Observable Assistance Games

A. Proofs for Section 2.2

Our techniques are similar to those of Shah et al. (2020) and Desai (2017), who show how to form a single-agent POMDP for A by embedding H into the environment dynamics. However, our construction works in the opposite direction, with H embedding A s actions and observations into the environment.

Proposition 2.2. Suppose A is playing a fixed policy. If H knows A s policy along with the POAG specification M, then H can form calibrated beliefs about the world state. For any timestep t and state st, H can form P(st | o H 1:t), the probability of st given H s observation history o H 1:t.

Proof. We construct a single-agent POMDP ˆS, AH, ˆT, ˆR, ΩH, ˆOH, P0, γ for H. Standard POMDP inference lets H form P(ˆst | o H 1:t), which includes P(st | o H 1:t).

Consider a new set of states ˆst ˆSt = St+1 (ΩA)t AA, where each new state ˆst corresponds to a full sequence of original states s0:t, full sequence of assistant observations o A 1:t, and the previous assistant action a A t 1. The new ˆT satisfies ˆT(ˆst+1 | ˆst, a H t ) = πA(a A t | o A 1:t)T(st+1 | st, a H t , a A t )OA(o A t+1 | st+1, a H t , a A t ). The new ˆOH satisfies ˆOH(o H t+1 | ˆst+1, a H t ) = OH(o H t+1 | st+1, a H t , a A t ). The new reward function ˆR can be arbitrary, as it doesn t affect inference.

Proposition 2.3. Suppose A is updating its policy each iteration of the game. Knowledge of the game dynamics, of A s initial policy, and of A s update rule is sufficient for H to form calibrated beliefs about A s future policy and of the world state.

Proof. Within each iteration of the game, H does the same as for Proposition 2.2. Between iterations, H applies A s update rule to get A s policy for the next iteration.

Remark 2.4. Propositions 2.2 and 2.3 hold even if A is interfering with observations (Definition 3.2).

Proof. The possibility of observation interference (Definition 3.2) is merely treated like any other part of the other agent s policy and the game dynamics. By definition, interference actions are just another action, and our proofs of Proposition 2.2 and Proposition 2.3 made no assumptions on the actions.

B. Proofs and Example Formalizations for Section 4

B.1. A Lemma about Policies with Internal States

In our proof of Theorem 4.2 (and our proof of Theorem 4.7), we will construct policies that maintain an internal state (the previously sampled garbled observations). We will call this a virtual state. However, our setup (in line with the norm in the literature) does not allow for such policies. We here show that any policy with a virtual state can be simulated by a policy without virtual states. Since this result is about a single player s policy, holding the opponent policy fixed, we will prove this in POMDPs.

First, a virtual-state policy is a family of distributions π(a, v | v, h), where:

h is a history of observations and actions as usual;

v is an agent state from some discrete set (e.g., N or Ω A);

v is another (new) virtual state;

a is an action.

Additionally we specify an initial virtual state v0. Virtual-state policies give rise to histories in the obvious way: the initial agent state is v0; the agent then samples an action a0 and a following virtual state v1 from π( | v0). In the next step it samples an action and agent state from π( | o0a1, v1) and so on.

We now show that policies with a virtual state can be transformed into behaviorally equivalent policies without an agent state.

Observation Interference in Partially Observable Assistance Games

Lemma B.1. Let π be a virtual-state policy. Then there exists a regular policy π s.t. the resulting distribution over (environment state, observation, action) histories is the same under π and π. In particular, the expected rewards of the two are the same.

The result is related to Kuhn s (Kuhn, 1953) proof of the equivalence of behavioral and mixed strategies in perfect-recall extensive-form games.

Proof. For this proof we use ho,a to denote observation action histories and ho,a to use state observation action histories. Consider π that at time step t is defined by

π(A | ho,a) = X

v0,...,vt P(v0, ..., vt | π, ho,a)π(A | vt, ho,a).

Intuitively, at time step t we infer a probability distribution over histories of virtual states and in particular vt, conditioning on the observed observation action history h, and then sample from the action distribution induced by π(A | vt, ho,a).

We prove that for each time step t, the state observation action history up until time step t is the same between π and π. We prove this by natural induction. The base case is trivial. Assume that the distribution over state observation action histories up until time step t is the same. We will show that for each state observation action history, the distribution over actions at+1 at time t + 1 is the same under π and π. Note that the action distribution under π is given by

v A 0 ,...,s A t

P(v0, ..., vt | π, hs,o,a)π(A | vt, ho,a).

Now note that P(v0, ..., vt | π, hs,o,a) = P(v0, ..., vt | π, ho,a), i.e., given the history of states and observations, the environment states don t provide further evidence about the agent states, since every dependence between environmental states and agent states is mediated by observations and actions. Thus, this distribution is the same as the distribution π(A | ho,a).

B.2. Proof of Theorem 4.2

Theorem 4.2. Let M be any POAG. Let A have no private information. Then there is an optimal policy pair (πH, πA) for M in which πA does not interfere with observations at the action level (and πH observes naively).

Proof sketch. Note first that because our setting is common-payoff and involves no absentmindedness/imperfect recall, there is always an optimal policy pair in which neither A nor H randomizes in any observation history. Let (πH, πA) be any optimal policy pair for M. Let a A interfere be an interference action played by πA. Let a A be the corresponding non-interference strategy. Now consider the policy πA that plays like πA except that it plays a A instead of a A interfere.

We will now construct a corresponding human policy πH that results in playing the same actions at each point as a A. Note that by the assumption that A has no private observations and the fact that πA and πA are deterministic, H always knows A s full observation history. Thus, H knows in particular when for which time steps in her observation history πA would have played a A interfere and πA played a A instead.

Now let F be the observation translation function as per Definition 3.1. Intuitively, we want πH to apply F to any new observation that results from playing a A rather than a A interfere, and then remember that modified observation in place of the actual observation. It would then be easy to show that πH would result in the same actions as πH. Together with the fact that a A interfere and a A have the same effect on state transitions and rewards, we would immediately obtain that ( πH, πA) has the same utility as (πH, πA).

Unfortunately, if F is stochastic, the above construction requires that H can remember the results of past applications of F. That is, if at time step t she observes according to a A and translates according to F to obtain some new observation o H t (that she would have obtained under interference), then at any time step t > t, she needs to remember that she sampled o H t from F. Our formalism doesn t allow for such memory. However, by Lemma B.1 we can construct a policy without internal memory to imitate the policy we constructed.

Observation Interference in Partially Observable Assistance Games

B.3. Formalization of Example 4.3 and Proof of Proposition 4.4

Example 4.3. H has typed apt list -a cuda to see the list of cuda versions available to be installed. Out of 10 total versions, only a (non-empty) subset are available. And of these available versions, only a subset are compatible with the other environment software.

First, A takes an action. For each of the 10 total cuda versions, A can choose to or not to suppress it from the list of available packages. This gives A 210 total actions, where 1 action is non-observation interference (suppressing nothing), and the remaining 210 1 actions interfere with observations.

Second, H takes an action. H has 10 possible actions which try to install the corresponding version of cuda if it appears in the version list. If an available cuda version that is compatible with the other environment software is installed, it yields +1 reward. Otherwise, it yields 0 reward.

Suppose A sees which versions are compatible with the other software in the environment, but H doesn t. Then A s optimal policy is to suppress the versions of cuda that are incompatible.

Formalization:

S = {0, 1} {0, 1}10 {0, 1}10 {E} {I} E is a terminal state, which we use to make the POAG effectively episodic. I is an initial state. The first bit, which we denote by s0, encodes the time step. The next ten bits encode which versions are available. The last ten bits encode which versions are compatible. For any state s, we use s0 to refer to the first entry of the state.

ΩH = {0, 1}10 {null} representing the availability bits.

ΩA = {0, 1}10 {null} representing which packages are compatible.

Θ = {θ} is a singleton.

AH = {1, ..., 10} representing which package to choose.

AA = {0, 1}10 representing for what packages, availability is suppressed, where 0 indicates suppression.

A s observations are given as follows. If s / {E, I} and s0 = 0 (i.e., it is the first time step), then OA(o A|s, a A, a H) = 1[o A=s11:20]. That is, A observes perfectly what cuda versions are compatible. Otherwise, OA(o A|s, a A, a H) = 1[o A=null]. That is, in all other time steps, A does not observe anything.

H s observations are given as follows. If s {E, I} or s0 = 1, then H simply observes null. If s / {E, I} and s0 = 1, then OH(o H|s, a A, a H) = 1[o A i =si+1a A i ]. That is, for each availability bit, H observes 0 if A set the availability bit to 0; otherwise, H simply observes the availability bit.

R(s, a H, a A) = 0 if s {E, I} or s0 = 0. Otherwise, R(s, a H, a A) = sa Hsa H+10. That is, a reward of 1 is obtained if and only if the cuda version chosen by H is both available and compatible.

P0(s) = 1[s = I]. That is, the initial state is always I.

If s = I, then T( | s, a H, a A) is the uniform distribution over states s in which at least one cuda version is available and compatible, i.e., P10 i=1 sisi+10 1. If s = I, then T(s | s, a H, a A) = 1 if

s0 = 0, s 0 = 1 and s1:20 = s 1:20; or s0 = 1 and s = E; or s = s = E.

Otherwise, T(s | s, a H, a A) = 0.

Proposition 4.4. There exists a POAG M where all optimal policy pairs (πA, πH) have that πA interferes with observations at the action level and that πH observes and acts naively.

Observation Interference in Partially Observable Assistance Games

Proof. Consider Example 4.3.

First consider the following policy pair: At the first time step, A chooses o A {0, 1}10, i.e., A chooses to suppress the availability signal exactly for those cuda versions that aren t compatible. At all other time steps the assistant chooses uniformly at random. Call this policy ˆπA.

At the second time step, when the human observes o H {0, 1}10, the human chooses some a H s.t. o H a H = 1. That is, H chooses a cuda version that her observation shows is available. It is easy to see that under the above A policy there always exists such a a H. At all other time steps, H chooses uniformly at random. Call this policy ˆπH.

It s easy to see that the above policy pair is optimal: By the structure of the environment, we can receive a reward of at most 1 by having the human choose a compatible and available policy at time step 1. Clearly, the above policy achieves this reward of 1.

Next, note that the only non-interference action for A is (1, 1, ..., 1). Thus, the only non-interference policy for A is to always play (1, 1, ..., 1). Call this policy πA ni .

Note that the best response for H against πA ni is ˆπH. Thus, ˆπH is acting naively.

Furthermore, note that ˆπH acts naively.

It is easy to see that adding a H A communication channel makes no difference to the above analysis.

B.4. Proof of Theorem 4.5

Theorem 4.5. Let M be any POAG, and provide A with an unbounded communication channel to H, forming M A H. Then there is an optimal policy pair (πH, πA) for M A H where πA does not interfere with observations at the action level and πH observes naively.

Proof sketch. Roughly, take any deterministic optimal policy pair (πH, πA). Consider the assistant policy πA that at each time step communicates A s full observation to H and that replaces interference with non-interference actions. Because πA

is deterministic, H can infer what πA would have communicated based on πA s communications. The rest of the proof goes the same way as Theorem 4.2.

B.5. Proof of Theorem 4.7

For the proof of Theorem 4.7, we ll use the concept of entropy. For any probability distribution P over some discrete space, let H(P) := P

x P(x) log P(x) denote the distribution s entropy. The following is a well-known result in information theory [e.g., 3, Theorem 1.4.5; 10, Theorem 2.6.5].

Lemma B.2 (Conditioning decreases entropy). Let X, Y be random variables, then EY [H(P(X | Y ))] H(P(X)). Further, the inequality is strict if X and Y are not independent, i.e., if P(X) = P(X | y) for some y, then EY [H(P(X | Y ))] < H(P(X)).

Using this result, we can provide the following variant.

Lemma B.3. Let S be a random variable. Let X, Y be independent samples from F(S) and let Z be sampled from G(Y ), where F and G are stochastic functions. Then

EZ [H(P(S | Z))] EX [H(P(S | X))] .

Moreover, the inequality is strict if S and Y are dependent given Z.

Proof. For the non-strict version:

H(P(S | X)) = H(P(S | Y ))

= H(P(S | Y, Z))

Lemma B.2 H(P(S | Z))

The strict version can be proved the same way using the strict version of Lemma B.2.

Observation Interference in Partially Observable Assistance Games

Next, we can use this to prove that a garbling induces a lower-entropy distribution over states.

Lemma B.4. Let L be some set of states. Let (Pa( | s))s L and (Pb( | s))s L be families of probability distributions s.t. Pa is strictly more informative than Pb with transformation function F. Further let S be some random variable over L with full support. Let Xa Pa( | S) and Xb F(Xa). Then S and Xa are dependent given Xb. In particular, from

Lemma B.3 we get that EX [H(P(S | X))] < E ˆ X h H(P(S | ˆX)) i .

Proof. We prove the following contrapositive: if Xa and S are independent given Xb, then Pb is at least as informative as Pa. If Xa and S are independent given Xb, then we have that P(Xb | Xa, S) = P(Xb | Xa). Thus, for all states s, we have that

P(Xa | s) = X

xb P(xb | s)P(Xa | xb, s)

xb P(xb | s)P(Xa | xb).

But this means that if we sample Xb according to Pb, and sample Xa according to P(Xa | xb), then we obtain a sample for Xa according to the distribution P(Xa | s) (i.e., Pa). Thus, we have that Pb is at least as informative as Pa.

Theorem 4.7. Let M be any POAG. Then there exists an optimal policy pair (πH, πA) for M s.t. πA does not interfere with observations at the policy level.

Proof. We will explicitly choose a policy for each time step t = 0, 1, 2, .... So let s take πA 0:t 1, πH 0:t 1 as given. Now let Πt be the set of policies at time t that are part of a policy pair (πH t: , πA t: ) that is optimal holding fixed πA 0:t 1, πH 0:t 1. Note that the expected utility of policy pairs in a POMDP is continuous. It follows that Πt is closed (i.e., that every convergent sequence of policies in Πt converges to a policy in Πt).

Now from Πt choose πA t as the minimizer of

πA t 7 EOH t+1 H(P(St+1 | OH t+1, πH random, πA 0:t 1, πA t )) | πH random, πA 0:t 1, πA t ,

where H denotes Shannon entropy and πH random is the human strategy that chooses uniformly at random. (Note that the above entropy function is not the only function we could use for this proof.) That is, let πA t be the policy that minimizes the entropy of H s probability distribution over world state. Because the given function is continuous and Πt is closed (and bounded), this minimum exists (by the extreme value theorem).

Now by Lemma B.4 we have that if πA t is more informative than ˆπA t , then πA t will also have lower entropy at time t. It follows that there is no policy in Πt that is more informative than πA t .

Finally, it is left to show that there is no policy πt outside of Πt that is more informative than π t . For this, we use the same argument as in the proof of Theorem 4.2: if there were a more informative πA t with the same effect on state transitions, then this would also be part of an optimal policy pair (constructed by having H apply the appropriate garbling internally). But we have already that in Πt there is no more informative policy than πA t .

Note that the entropy-minimizing policy used in the proof may still interfere with observations at the action level. For example, by default H might receive a low-information signal about the world. The entropy-minimizing policy might be one in which A overwrites this default signal in a way that expresses more information about the world. For instance, let s assume that by default, H observes a random number between 20 and 0 if it s cold outside and a random number between 0 and +40 if it s warm outside. A receives various hints about the temperature and can overwrite the signal with an arbitrary number. (I.e., for each number between 20 and +40, there s an action that sets H s observation to be that number.) Assuming nothing else happens in this POAG, the entropy-minimizing policies will be ones that overwrite the signal in a way that encodes A s information about the temperature. For instance, A it may (or may not) be an non-interfering-at-the-policy-level strategy for A to overwrite H s signal with A s expectation of the temperature in degrees Celsius. Given such a policy, the entropy of H s beliefs about the world is lower than before (H has more information about the temperature). But each of these overwriting actions individually is observation-interfering.

Observation Interference in Partially Observable Assistance Games

C. Formalization of Example 5.1 and Proof of Proposition 5.2

Recall the example:

Example 5.1. H would like to schedule a job on a cluster. She can choose between two nodes. By default, she receives a signal from the environment about the two nodes specifications. Each node may be either GPU-optimized or CPU-optimized. Also, the CPUs may be either AMD or Intel.

H has a strong preference between GPU-optimized and CPU-optimized nodes. She has a weak preference between AMD and Intel. These preferences are unknown to A.

A can interfere with H s observation about the available nodes. In particular, A can make it so that a choice between two CPU-optimized nodes appears as a choice between a GPU-optimized and CPU-optimized node. A observes H s choice. Later, A is charged with scheduling a job for H and has to choose between a CPUand a GPU-optimized node on H s behalf.

If H chooses naively upon seeing only CPU-optimized nodes (simply choosing her favorite), then A s best response interferes with observations at both the action and policy levels. Interfering with observations allows A to learn H s preference about GPUvs CPU-optimized nodes.

In particular, there are four possible states: (1) The first node is GPU-optimized and the second node is CPU-optimized. (2) The first node is CPU-optimized and the second node is GPU-optimized. (3) Both nodes are CPU-optimized. The first has an Intel processor, the second has an AMD processor. (4) Both nodes are CPU-optimized. The first has an AMD processor and the second has an Intel processor.

Suppose the utilities of the human choice are given as follows: 1 for the favored CPU-optimized type; 1 for a GPU-optimized node if H favors the GPU-optimized node. The reward is 0 otherwise. On the second step, the reward for the favored type of node is 10 and 0 for the other type of node.

Recall the proposition was as follows.

Proposition 5.2. There is a POAG M with the following properties. For every optimal policy pair (πH, πA), at least one of the following holds: (i) πH is not acting naively, or (ii) πA interferes with observations at both the action and policy levels. Additionally, there exists an optimal policy pair (πH, πA) where πH acts naively and πA interferes with observations at both the action and policy levels.

These properties continue to hold if we require that in M, A has no private information or can arbitrarily send messages to H (i.e., there is a POAG M s.t. M = M A H).

Proof sketch. Consider the example. First let s consider a naive human policy, i.e., one that chooses the favorite node type in the first time step. Then the best response for A is to interfere.

It is easy to see that in all optimal policy pairs, A must learn about H s GPU-versus-CPU preference. It follows that at time step 1, H must deterministically choose depending on her GPU-versus-CPU preference.

It is easy to see that all of these policy profiles have the same expected reward as the above naive/interference policy pair.

Note that in the above example, A has no private information. It is easy to see that the above argument continues to go through if we allow A to send signals to H.

D. Formalization of Example 6.2 and Proof of Proposition 6.3

Definition 6.1. Let M be a POAG. Let πA be A s policy in M. We say that H s policy πH is a Boltzmann-rational response to πA if there exists some β > 0 s.t. for every human observation history h that arises with positive probability in M under (πA, πH) we have that πH(a | h) exp βE h P t =t γt R(St, AA t , AH t ) | πH, πA, h i .

Example 6.2. H is running a terminal command and is unsure whether to run the command with flag 1 or flag 2. With equal probability, either flag 1 or flag 2 is better, and how good the flags are differs by either a little or a lot. The worse flag always yields a utility of 0, while the better flag either yields a utility of 1 or a utility 7. Thus, H is uniformly at random in one of four states. A has two actions: man and tldr. The man page is a long document that tells the human exactly what the values of the flags are (i.e., the exact state: which flag is better and whether its utility is 1 or 7). The tldr page is a

Observation Interference in Partially Observable Assistance Games

short summary that tells the human which flag is better, but not by how much (i.e., ruling out half the states, leaving half remaining). Thus, the expected utility of the better flag is 4 (and of the worse flag is 0).

With uniform probability, H is in one of four possible states:

Flag 1 is better by a lot: flag 1 has value +7, while flag 2 has value 0.

Flag 1 is better by a little: flag 1 has value +1, while flag 2 has value 0.

Flag 2 is better by a little: flag 1 has value 0, while flag 2 has value +1.

Flag 2 is better by a lot: flag 1 has value 0, while flag 2 has value +7.

This gives us the following formalization for the game:

S = ({0, 1} {sa, sb, sc, sd}) {I, E}

ΩH = S {1, 2} {null}

Θ is a singleton

AH = {1, 2}

AA = {tldr, man}

H s observations are given as follows. For s {sa, sb, sc, sd}, we have OH(o H | (0, s), man, a H) = 1[o H = s], and for i {1, 2} we have OH(i | (0, s), tldr, a H) = 1[i = 1]1[s {sa, sb}] + 1[i = 2]1[s {sc, sd}]. Otherwise, H s observation is deterministically null.

A s observations are the same as H s observations.

The reward is given as follows:

R((1, sa), 1, a A) = 7 (1)

R((1, sb), 1, a A) = 1 (2)

R((1, sc), 2, a A) = 7 (3)

R((1, sd), 2, a A) = 1 (4)

1[a H = 1]1[s {sa, sb}] + 1[a H = 2]1[s {sc, sd}]. All other rewards are 0.

For all a H, a A, T( | I, a H, a A) is the uniform distribution over {0} {sa, sb, sc, sd}. For all s {sa, sb, sc, sd}, T(s | (0, s), a H, a A) = 1[s = (1, s)]. For all s, T(s | (1, s), a H, a A) = 1[s = E]. Finally, T(s | E, a H, a A) = 1[s = E].

Proposition 6.3. For every β > 0, a POAG in which neither H nor A has private information s.t. all β-Boltzmannrational/optimal policy pairs (πH, πA) have πA interfere with observations at both the action and policy levels.

Proof. Note first that multiplying β by any positive number has the same effect on Boltzmann-rational strategies as multiplying all rewards by that number. Therefore, we can consider β = 1 without loss of generality.

Consider Example 6.2. Note that tldr is an observation interference action man results in a more informative signal to H.

Now consider the non-interference policy for A that always plays man. Then a Boltzmann-rational H will choose as follows: If she observes sa or sc, then she will choose an expected utility of 7 with probability exp(7) and an expected utility of 0 with probability exp(0). Thus, the expected utility is

7 exp(7) exp(7) + exp(0) (5)

Observation Interference in Partially Observable Assistance Games

0.5 1.0 1.5 2.0 2.5 β

expected utility without

minus with interference

Figure 2: The effect of varying β on the assistant s incentive for observation interference in Example 6.2. Specifically, the y axis indicates the difference between the expected utility under non-interference minus the expected utility under interference.

Similarly, if she observes sb or sc, her expected utility is

exp(1) exp(1) + exp(0). (6)

Thus, overall her expected utility is

1 27 exp(7) exp(7) + exp(0) + 1

2 exp(1) exp(1) + exp(0) 3.86234. (7)

Now consider the interference policy for A in which A always plays tldr. Then upon observing either 0 or 1, the human chooses between a utility of 0 and a utility of 4. Thus, the expected utility is

4 exp(4) exp(4) + exp(0) 3.92806. (8)

We observe that this expected value under interference is higher than the expected value under non-interference.

E. Effects of Varying the Boltzmann Rationality Parameter (β) on the Assistant s Incentives to Interfere with Observations

As noted in the main text, in Example 6.2, we have that for low values of the rationality parameter β, A prefers noninterference, while for large values of β, A prefers interference. Below we will show that in general, counterintuitively, A prefers non-interference for sufficiently small (positive) values of β.

We here only consider the case of a single decision. Consider a case with n actions. Let the expected utilities of the different actions without information be y0,1, ..., y0,n. Now imagine that H might receive k different signals with probabilities p1, ..., pk. Under signal i {1, ..., k}, the expected utilities of the different actions become yi,1, ..., yi,n. By the tower rule we must have for each action a {1, ..., n}, k X

i=1 piyi,a = y0,a. (9)

Observation Interference in Partially Observable Assistance Games

Note that without further restriction, the above setting includes settings in which the signal provides information on what action is best.

For any β, the expected utility without the signal is

1 Pn a=1 exp(βy0,a)

a=1 exp(βy0,a)y0,a. (10)

The expected utility with the signal is

s=1 ps 1 Pn a=1 exp(βys,a)

a=1 exp(βys,a)ys,a. (11)

Proposition E.1. For all (ys,a R)s {0,1,...,k},a {1,...,n}, (ps R)s {0,1,...,k} satisfying Equation (9), we have that for sufficiently small but positive β, the expected utility without the signal is at most the expected utility with the signal.

Proof. It s easy to see that for β = 0, the two expected utilities are the same. Thus, all we need to show is that the derivative w.r.t. β of the term in Eq. 11 at β = 0 exceeds the corresponding derivative of the term in Eq. 10.

The derivative w.r.t. β at β = 0 of the term in Equation (10) is

Note that this is exactly the variance of a random variable that is uniform over (y0,a)a=1,...,n.

Similarly, the derivative of the term in Equation (11) is

Note that this is the weighted average (over s) of the uniform random variables over (ys,a)a=1,...,n.

We can now prove the claimed inequality using the convexity of the square function, Equation (9) and some basic term

Observation Interference in Partially Observable Assistance Games

manipulation. n X

= Equation (9)

( )2 is convex

We have skipped over some term manipulations in Equations (16) and (21), both of which are essentially the equality of two definitions of the variance: Var(X) = (X [X])2 and Var(X) = E[X2] E[X]2.

It s interesting to note that this is essentially the proof that the variance (over a) of the expectation (over s) is at least the expectation (over s) of the variance (over a).

Second, we want to show that for large β, A prefers observation interference, i.e., prefers to have the human choose based on the expected utilities y0,1, ..., y0,n rather than the expected utilities that arise from further signals. However, for this to hold we need a further condition. Note that in the general formalism above, the signal s may provide information about which action is best. If this is the case, then it is easy to show that for large enough β, A will prefer providing the signal. However, consider specifically those cases in which the signal s only provides information about how much better the best action is compared to other actions. Therefore, we require in the following result that the best action is the same (WLOG 1) across s.

Proposition E.2. Let (ys,a R)s S,a {1,...,n}, (ps R)s S satisfy Equation (9) and let ys,0 > ys,a for all s {0} S, a {1, ..., n}. Then for all sufficiently large β we have that the expected utility without the signal is at most the expected utility with the signal. The inequality is strict if the signal is non-trivial (i.e., ys,a is not constant across s for some a).

We first provide a very rough sketch. For simplicity, let s say that the signal provides evidence about how much better the first action is compared to the second-best action. Then sometimes the signal will decrease the difference in expected utility between the best and second-best utility. We will show that as β , the overall effect of learning the information is dominated by taking the best action less in this case.

We will use the following lemmas.

Lemma E.3. Let the differences between the top k actions be constant across signals and let the difference to the k + 1-th action be non-constant. Then there is a signal s s.t. the difference to the k + 1-th action decreases under that signal.

Proof. Let k 1 be the k-th best action according to 0 and let k be the k + 1-th best action according to 0. By the tower rule (Eq. 9), y0,k 1 y0,k must be greater than ys,k 1 ys,k for some s. (If the difference in these expected utilities changes

Observation Interference in Partially Observable Assistance Games

when the signal is observed, then it must sometimes decrease.) But then in cases where this difference decreases as s is observed, we clearly have that the difference between one of the k best actions to the k + 1-th best action under s also decreases.

Proof of Proposition E.2. The gain from obtaining the signal is:

exp(βys,a) P

a exp(βys,a ) exp(βy0,a) P

a exp(βy0,a )

WLOG let 0 be the best action under all signals, 1 the second-best and so on. Let k be the largest number that the differences between the utilities of actions 0, ..., k 1 are always the same. (Typically k = 0.) Let S be the set of signals under which the difference to the utility of k (the k + 1-th best action) is minimized. Note that in particular, the difference must be smaller than under 0 by Lemma E.3. WLOG assume that for all signals, k is among the k + 1-th best actions.

WLOG assume that ys,a > 0 for all s {0} S and all a and that ys,0 is constant across s.

Now we will divide up the above sum into three components:

A The change (decrease) in utility from playing the top k actions less in S than without the signal.

a=0,...,k 1

exp(βy s,a) P

a exp(βy s,a ) exp(βy0,a) P

a exp(βy0,a )

B The change in utility from the changes in distribution of all actions other than the top k under S versus S

a=k,k+1,...

exp(βy s,a) P

a exp(βy s,a ) exp(βy0,a) P

a exp(βy0,a )

C The change in utility from all signals other than S, i.e.

exp(βys,a) P

a exp(βys,a ) exp(βy0,a) P

a exp(βy0,a )

We will show that the effect from A (which is negative) is becomes infinitely much larger than the effect from B and C (in absolute terms). From that it will follow that the original sum, which is equal to A + B + C is negative as β .

We first provide a bound on A. We first show that A < 0. To show this, note first that in all enumerators in A, we can replace y s,a with y0,a (by choice of s and k). So all we need to show is that the second denominator is smaller than the first, i.e., P

a exp(βy s,a ) > P a exp(βy0,a ). But this this is easy to see from the fact that y s,a = y0,a for a = 0, 1..., k 1 and y s,k > y0,k. For large β, exp(βy s,k) will be much larger than P

a =k,k+1,... exp(βy0,a ).

Observation Interference in Partially Observable Assistance Games

Next, we will provide a lower bound on the absolute value of |A|.

a=0,...,k 1

exp(βy s,a) P

a exp(βy s,a ) exp(βy0,a) P

a exp(βy0,a )

a=0,...,k 1 exp(βy0,a) 1 P

a exp(βy s,a ) 1 P

a exp(βy0,a )

a=0,...,k 1 exp(βy0,a) 1 P

a exp(βy s,a ) 1 P

a exp(βy0,a )

a=0,...,k 1 exp(βy0,a)(P

a exp(βy0,a )) P

a exp(βy s,a ) (P

a exp(βy s,a )) (P

a exp(βy0,a )) y0,a

a=0,...,k 1 exp(βy0,a)

a=k,k+1,... exp(βy0,a ) P

a=k,k+1,... exp(βy s,a )

a exp(βy s,a )) (P

a exp(βy0,a )) y0,a

a=0,...,k 1 exp(βy0,a)n exp(βy0,k) exp(βy s,k)

n2 exp(βy0,a)2 y0,a

a=0,...,k 1

n exp(βy0,k) exp(βy s,k)

n2 exp(βy0,a) y0,a

a=0,...,k 1

exp(βy s,k) n2 exp(βy0,a)y0,a

s S p s exp(βy s,k) n2 exp(βy0,0)y0,0

Next we upper bound B. First, the best case for the effect on ... is that all the probability mass that under 0 is on the top k actions ends up on the k-th best action, i.e.,

a=0,...,k 1

exp(βy0,a) P

a exp(βy0,a )

We can further upper-bound this as follows:

a=0,...,k 1

exp(βy0,a) P

a exp(βy0,a )

exp(βy0,a) P

a exp(βy0,a )y s,k

s S p s n exp(βy0,k)

exp(βy0,0) y s,k

From the fact that y s,k > y0,k, it is easy to see that this term vanishes in absolute value relative to our upper bound on A.

Finally, we must upper bound C. First, we can upper bound C by considering a case where all probability mass that in 0 was outside the top k actions, goes to the best action when a signal outside of S is observed, i.e.,

a=k,k+1,...

exp(βy0,a) P

a exp(βy0,a )y0,0.

Observation Interference in Partially Observable Assistance Games

We can further upper bound this as follows:

a=k,k+1,...

exp(βy0,a) P

a exp(βy0,a )y0,0

s/ S ps n exp(βy0,k)

exp(βy0,0) y0,0

Again, from the fact that y s,k > y0,k, it is easy to see that this term vanishes in absolute value relative to our upper bound on A.

F. Proof of A s Best Response in the Product Selection Game

Proposition 7.2. Consider A policies that always interfere with k observations for some fixed k. Among the k-interference policies for a given k, A s best response to H s straightforward product selection policy is as follows. A interferes with the k smallest ˆRi values where ˆRi = Ri if A observes Ri, and ˆRi = 0.5 otherwise.

Proof. Consider A s perspective. A s interference is equivalent to selecting a set of d k untampered products from which H selects according to a Boltzmann distribution on Hi. As A neither sees nor affects the Hi, by symmetry, over all draws of the game, H selects each of the d k products with equal probability. A s expected payoff for choosing d k products, then, is the uniform average of the products expected Ui.

How does A choose the set of d k products to maximize the uniform average of the products expected Ui? Recall Ui = Hi + Ri. As A neither sees nor affects the Hi, A can ignore the Hi and consider only the Ri. Denote the expected Ri by ˆRi = E[Ri]. If A observes Ri, then ˆRi = Ri. If A doesn t observe Ri, then ˆRi = 0.5. To choose the maximum d k values for ˆRi, A interferes with the minimum k values of ˆRi.

G. Minor Deficiencies of the Observation Interference Definition

As noted in the main text, there are various possible concerns with Definition 3.2 that we consider minor because they do not change the main ideas and results of this paper.

The definition does not take into account what A knows about what H already knows. As such, it will sometimes spuriously judge a policy to be observation interference for taking away a signal from the human that is redundant with the human s past observations. For example, if the human observes the Linux version at time t and the Linux is known not to change, then preventing the human from observing the Linux version again at time t + 1 might count as observation interference.

The definition may also spuriously judge a policy to not be observation interference because the only more informative policies fail to provide some redundant piece of information to the human. For instance, let s say that by default the human learns some new, useful information at time t + 1. Now let s say that A can make it so that H instead observes the Linux version (which H already knows). Assume that A has no way of letting H see both the Linux version and the new, useful information. Then making the human observe the Linux would not count as sensor interference according to our definition, because our definition doesn t take into account that the human already knows the Linux version.

Adapting the definition to fix this deficiency is somewhat cumbersome, because it requires us to reason about A s beliefs about H s observation histories/beliefs.

This aspect of the definition seems mostly irrelevant for our results. For instance, none of our examples of observation interference have redundant observations. Therefore, we have opted to keep the definition simple in this paper.

Our definition only compares pure actions in terms of their informativeness. But it may be the case that one action ˆa A

is, in some intuitive sense, interferring with H s observations but the only way to show this is to compare ˆa with a mix of actions, say, mixing uniformly over a A 1 and a A 2 . In particular, it may be that ˆa has the same effect on state transitions as mixing uniformly over a A 1 and a A 2 , while reducing the informativeness of the H s observation. It s easy to extend the definition to also consider mixed actions, but the extension has no impact on any of our results.

Observation Interference in Partially Observable Assistance Games

Neither the action-level nor the policy-level notion of tampering is sensitive to what policy H plays or even what policy H might plausibly play. For instance, let s say there is some action a H silly for H that it never makes sense for H to play. (In game-theoretic terms, it might be strictly dominated.) Then whether any given policy πA is tampering will be sensitive to what happens if A plays πA and H plays a H silly. Arguably this shouldn t matter; arguably we should assume some degree of rationality on behalf of H.

To refine this definition, we would need to restrict attention to specific policies or actions for H. It s not clear which restriction makes most sense. In any case, we cannot imagine a refinement of the definition that would have little impact on our results.

H. Code Assets

Our experiments use the Python software libraries Matplotlib (Hunter, 2007), Num Py (Harris et al., 2020), pandas (pandas development team, 2020; Wes Mc Kinney, 2010), and seaborn (Waskom, 2021).