# pitfalls_of_learning_a_reward_function_online__7fb7732e.pdf

Pitfalls of Learning a Reward Function Online

Stuart Armstrong1,2 , Jan Leike3 , Laurent Orseau3 and Shane Legg3

1Future of Humanity Institute, Oxford University, UK 2Machine Intelligence Research Institute, Berkeley, USA 3Deep Mind, London, UK stuart.armstrong@philosophy.ox.ac.uk, leike@google.com, lorseau@google.com, legg@google.com

In some agent designs like inverse reinforcement learning an agent needs to learn its own reward function. Learning the reward function and optimising for it are typically two different processes, usually performed at different stages. We consider a continual ( one life ) learning approach where the agent both learns the reward function and optimises for it at the same time. We show that this comes with a number of pitfalls, such as deliberately manipulating the learning process in one direction, refusing to learn, learning facts already known to the agent, and making decisions that are strictly dominated (for all relevant reward functions). We formally introduce two desirable properties: the ﬁrst is unriggability , which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise. The second is uninﬂuenceability , whereby the reward-function learning process operates by learning facts about the environment. We show that an uninﬂuenceable process is automatically unriggable, and if the set of possible environments is sufﬁciently large, the converse is true too.

1 Introduction In reinforcement learning (RL) an agent has to learn to solve the problem by maximising the expected reward provided by a reward function [Sutton and Barto, 1998]. Designing such a reward function is similar to designing a scoring function for a game, and can be very difﬁcult ([Lee et al., 2017])1. Usually, one starts by designing a proxy: a simple reward function that seems broadly aligned with the user s goals. While testing the agent with this proxy, the user may observe that the agent ﬁnds a simple behaviour that obtains a high reward on the proxy, but does not match the behaviour intended by the user,2 who must then reﬁne the reward function to include

Contact Author 1And see Speciﬁcation gaming: the ﬂip side of AI ingenuity , https://deepmind.com/blog/article/ Speciﬁcation-gaming-the-ﬂip-side-of-AI-ingenuity 2See https://blog.openai.com/faulty-rewardfunctions/ and the paper [Leike et al., 2017].

more complicated requirements, and so on. This has led to a recent trend to learn a model of the reward function, rather than having the programmer design it ([Ng and Russell, 2000; Hadﬁeld-Menell et al., 2016; Choi and Kim, 2011; Amin and Singh, 2016; Abbeel and Ng, 2004; Christiano et al., 2017; Hadﬁeld-Menell et al., 2017; Ibarz et al., 2018; Akrour et al., 2012; Mac Glashan et al., 2017; Pilarski et al., 2011]). One particularly powerful approach is putting the human into the loop ([Abel et al., 2017]) as done by [Christiano et al., 2017], because it allows for the opportunity to correct misspeciﬁed reward functions as the RL agent discovers exploits that lead to higher reward than intended. However, learning the reward function with a human in the loop has one problem: by manipulating the human, the agent could manipulate the learning process.3 If the learning process is online the agent is maximising its reward function as well as learning it then the human s feedback is now an optimisation target. [Everitt and Hutter, 2016; Everitt and Hutter, 2019] and [Everitt, 2018] analyse the problems that can emerge in these situations, phrasing it as a feedback tampering problem . Indeed, a small change to the environment can make a reward-function learning process manipulable.4 So it is important to analyse which learning processes are prone to manipulation. After building a theoretical framework for studying the dynamics of learning reward functions online, this paper will identify the crucial property of a learning process being uninﬂuenceable: in that situation, the reward function depends only on the environment, and is outside the agent s control. Thus it is completely impossible to manipulate an uninﬂuenceable learning process, and the reward-function learning is akin to Bayesian updating. The paper also identiﬁes the weaker property of unrigga-

3Humans have many biases and inconsistencies that may be exploited ([Kahneman, 2011]), even accidentally; and humans can be tricked and fooled, skills that could be in the interest of such an agent to develop, especially as it gets more powerful ([Bostrom, 2014; Yudkowsky, 2008]). 4See this example, where an algorithm is motivated to give a secret password to a user while nominally asking for it, since it is indirectly rewarded for correct input of the password: https://www.lesswrong.com/posts/b8Hau RWrj Bdn KEw M5/ rigging-is-a-form-of-wireheading

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

bility, an algebraic condition that ensures that actions taken by the agent do not inﬂuence the learning process in expectation. An unriggable learning process is thus one the agent cannot push towards its preferred reward function. An uninﬂuenceable learning process is automatically unriggable, but the converse need not be true. This paper demonstrates that, if the set of environments is large enough, an unriggable learning process is equivalent, in expectation, to an uninﬂuenceable one. If this condition is not met, unriggable-but-inﬂuenceable learning processes do allow some undesirable forms of manipulations. The situation is even worse if the learning process is riggable: the agent can follow a policy that would reduce its reward, with certainty, for all the reward functions it is learning about, among other pathologies. To illustrate, this paper uses a running example of a child asking their parents for career advice (see Section 2.2). That learning process can be riggable, unriggablebut-inﬂuenceable, or uninﬂuenceable, and Q-learning examples based on it will be presented in Section 6. This paper also presents a counterfactual method for making any learning process uninﬂuenceable, and shows some experiments to illustrate its performance compared with inﬂuenceable and riggable learning.

2 Notation and Formalism

The agent takes a series of actions (from the ﬁnite set A) and receives from the environment a series of observations (from the ﬁnite set O). A sequence of m actions and observations forms a history of length m: hm = a1o1a2o2 . . . amom. Let Hm be the set of histories of length m. We assume that all interactions with the environment are exactly n actions and observations long. Let the set of all possible (partial) histories be denoted with H = Sn i=0 Hi. The histories of length n (Hn, in the notation above), are called the complete histories, and the history h0 is the empty history. The agent chooses actions according to a policy π Π, the set of all policies. We write P(a | hm, π) for the probability of π choosing action a given the history hm. An environment µ is a probability distribution over the next observation, given a history hm and an action a. Write P(o | hma, µ) for the conditional probability of a given o O. Let hk m be the initial k actions and observations of hm. We write hk hm if hk = hk m. For a policy π and an environment µ, we can deﬁne the probability of a history hm = a1o1a2o2 . . . amom

P(hm | π, µ) =

i=1 P(ai | hi 1 m , π)P(oi | hi 1 m ai, µ).

Finally, P(hm | hk, π, µ) given hk hm is deﬁned in the usual way. If hm is a history, let a(hm) be the m-tuple of actions of hm. Note that a(hm) = a1 . . . am can be seen as a very simple policy: take action ai on turn i. This allows us to deﬁne the probability of hm given environment µ and actions a(hm), written as P(hm | µ) = P(hm | a(hm), µ). Note

that for any policy π,

P(hm | π, µ) = P(hm | µ)

i=1 P(ai | hi 1 m , π). (1)

If M is a set of environments, then any prior ξ in (M) also deﬁnes a probability for a history hm:

P(hm | ξ) = X

µ M P(µ | ξ)P(hm | µ).

By linearity, we get that Equation (1) also applies when conditioning hm on π and ξ instead of π and µ. Let hm be a history; then the conditional probability of an environment given prior and history5 is equal to:

P(µ | hm, ξ) = P(hm | µ)P(µ | ξ)

P(hm | ξ) .

Using this, ξ itself deﬁnes an environment as a conditional distribution on the next observation o ([Hutter, 2004]):

P(o | hmam+1, ξ) = X

µ M P(µ | hm, ξ)P(o | hmam+1, µ).

2.1 Reward Functions and Learning Processes Deﬁnition 1 (Reward function). A reward function R is a map from complete histories to real numbers6. Let R = {R : Hn R} be the set of all reward functions. A reward-function learning process ρ can be described by a conditional probability distribution over reward functions, given complete histories.7 Write this as:

P(R | hn, ρ).

This paper will use those probabilities as the deﬁnition of ρ. The environment, policy, histories, and learning process can be seen in terms of causal graphs ([Pearl, 2009]) in Figure 1 and using plate notation ([Buntine, 1994]), in Figure 2. The agent is assumed to know the causal graph and the relevant probabilities; it selects the policy π.

2.2 Running Example: Parental Career Instruction Consider a child asking a parent for advice about the reward function for their future career. The child is hesitating between RB, the reward function that rewards becoming a banker, and RD, the reward function that rewards becoming a doctor. Suppose for the sake of this example that becoming a banker provides rewards more easily than becoming a doctor does.

5The hm generates the policy a(hm), implicitly used here. 6 These are sometimes called return functions, the discounted sum of rewards. The results of this paper apply to traditional reward functions that take values on all histories, not just complete histories but they are easier to see in terms of return function. 7Anything we would informally deﬁne as a reward-function learning process has to at least be able to produce such conditional probabilities. If the learning process is unriggable (see subsection 3.3), then there is a natural way to extend it from complete histories to shorter histories, see Section 3.3.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Figure 1: Environment and reward-function learning process. The node µ (with prior ξ) connects to every observation; the node π (chosen by the agent) to every action. The transition probabilities are given by the complete preceding history and by µ (for observations) or π (for actions). The ρ sets the probabilities on the reward function R, given hn (a complete history). The ﬁnal reward is r.

Figure 2: Plate notation summary of Figure 1. The history h0 is the empty history. The hn appears twice, once on the left and once in the plate notation; thus R is a causal descendant of the policy π.

The child learns their career reward function by asking a parent about it. We assume that the parents always give a deﬁnitive answer, and that this completely resolves the question for the child. That is, the reward-function learning process of the child is to ﬁx a reward function once it gets an answer from either parent. We can formalize this as follows. Set episode length n = 1 (we only focus on a single action of the agent), and A = {M, F}, the actions of asking the mother and the father, respectively. The observations are the answers of the asked parent, O = {B, D}, answering banker or doctor. There are thus 2 2 = 4 histories. Since in this example the banker reward function RB is assumed to be easier to maximise than the doctor one; we set RB(h1) = 10 and RD(h1) = 1 for all histories h1 H1. The set of possible environments is M = {µBB, µBD, µDB, µDD}, with the ﬁrst index denoting the mother s answer and the second the father s. The reward-function learning process ρ is:

P(RB | MB, ρ) = P(RB | FB, ρ) = 1, P(RD | MB, ρ) = P(RD | FB, ρ) = 0, P(RB | MD, ρ) = P(RB | FD, ρ) = 0, P(RD | MD, ρ) = P(RD | FD, ρ) = 1.

See Section 6 for some Q-learning experiments with a version of this learning process in a gridworld.

Figure 3: Environment and reward-function learning process. Though hn appears twice, once on the left and once in the plate notation, R is not a causal descendant of π, the policy. The node µ (with prior ξ) connects to every observation; the node π (chosen by the agent) to every action. The transition probabilities are given by the complete preceding history and by µ (for observations) or π (for actions). The node η sets the (posterior) probability of the reward function R, given µ. The ﬁnal reward is r. Note that the full history hn has no directed path to R, unlike in Figure 2.

3 Uninﬂuenceable and Unriggable Reward-function Learning Processes What would constitute a good reward-function learning process? In Figure 2, the reward function R is a causal descendant of hn, which itself is a causal descendant of the agent s policy π. Ideally, we do not want the reward function to be a causal descendant of the policy. Instead, we want it to be speciﬁed by the environment, as shown in the causal graph of Figure 3 (similar to the graphs in [Everitt and Hutter, 2019]). There, the reward function R is no longer a causal descendant of hn, and thus not of π. Instead, it is a function of µ, and of the node η, which gives a probability distribution over reward functions, conditional on environments. This is written as P(R | η, µ).

3.1 Uninﬂuenceable: Deﬁnition The conditional distribution based on η is ideal for proper learning, but it has a different type than ρ, conditioning on environments instead of histories. Moreover, the agent has access to histories, but not directly to the environment. The prior ξ bridges this gap; given hn Hn, the conditional probabilities of R R can be inferred:

P(R | hn, η, ξ) = X

µ M P(R | η, µ)P(µ | hn, ξ). (3)

Via this equation, we ve deﬁned probabilities of reward functions conditional on histories; i.e. we ve deﬁned a reward-function learning process. An uninﬂuenceable ρ is one that comes from such an η in this way: Deﬁnition 2 (Uninﬂuenceable). The reward-function learning process ρ is uninﬂuenceable given the prior ξ on M if there exists η, a probability distribution on R conditional on environments, such that for all R R and hn Hn:

P(R | hn, η, ξ) = P(R | hn, ρ).

So an uninﬂuenceable ρ is one that comes from an η; uninﬂuenceable and counterfactual reward modeling, as deﬁned by [Everitt and Hutter, 2019], are both uninﬂuenceable in our sense

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

3.2 Uninﬂuenceable: Example Use the ρ of Section 2.2, and deﬁne the prior ξ1 as putting 1/2 probability on both µBB and µDD (so both parents agree). Then ρ is uninﬂuenceable. The η is:

P(RB | η, µBB) = P(RD | η, µDD) =1, P(RD | η, µBB) = P(RB | η, µDD) =0.

Then since o1 is determined by µBB versus µDD, the Equation (2) for ρ on histories can be read off via Equation (3). Put more colloquially, the universe has determined that the mother and father agree on RB or RD being the proper reward function. Then asking either simply allows the agent to ﬁgure out which of the world they are in.

3.3 Unriggable: Deﬁnition Note that, if ρ is an uninﬂuenceable reward-function learning process for ξ, the it has the following algebraic property: for any hm Hm, R R, and π Π, X

hn Hn P(hn | hm, π, ξ)P(R | hn, ρ)

hn Hn P(hn | hm, π, ξ) X

µ M P(R | η, µ)P(µ | hn, ξ)

µ M P(µ | hm, ξ)P(R | η, µ).

And that expression is independent of π. Deﬁne the expectation8 of ρ to be the map eρ : Hn R given by

R R P(R | hn, ρ)R. (5)

A fortiori, the expectation of eρ(hn) is independent of π: X

hn Hn P(hn | hm, π, ξ)eρ(hn) = X

µ M P(µ | hm, ξ)eη(µ),

with eη(µ) = P R R P(R | µ, η)R, the expectation of η. So, the expectation of eρ is independent of the policy if ρ is uninﬂuenceable. That independence seems a desirable property; let s call it unriggable: Deﬁnition 3 (Unriggable). The reward-function learning process ρ is unriggable for ξ if, for all hm Hm and all R R, the following expression is independent of π Π: X

hn Hn P(hn | hm, π, ξ)eρ(hn). (6)

Otherwise, ρ is said to be riggable (for ξ). Riggable is akin to having a feedback tampering incentive of [Everitt and Hutter, 2019].

8Finite probability distributions with values in an afﬁne space always have an expectation, which is an element of that space. Here were are using the fact that R is a vector space (hence an afﬁne space), with scaling and adding working as:

(αR + βR )(hn) = αR(hn) + βR (hn). (4)

If ρ is unriggable, then since Equation (6) is independent of π, we can now construct eρ on all histories hm Hm, m n, not just on complete histories hn Hn. We do this by writing, for any π:

eρ(hm, ξ) = X

hn Hn P(hn | hm, π, ξ)eρ(hn). (7)

Independence of policy means that, for any action am+1, the expectation P o eρ(hmam+1o, ξ)P(o | hmam+1, ξ) is also eρ(hm, ξ); thus eρ form a martingale.

3.4 Unriggable: Example Use the ρ of Section 2.2, and deﬁne the prior ξ2 as putting equal probability 1/4 on all four environments µBB, µBD, µDB, and µDD. We want to show this makes ρ unriggable but inﬂuenceable; why this might be a potential problem is explored in Section 4.1.

Unriggable The only non-trivial hm to test Equation (6) on is the empty history h0. Let πM be the policy of asking the mother; then X

P(h1 | h0, πM, ξ2)P(R | h1, ρ)R = (RB + RD)

The same result holds for πF , the policy of asking the father. Hence the same result holds on any stochastic policy as well, and ρ is unriggable for ξ2.

Inﬂuenceable To show that ρ is inﬂuenceable, assume, by contradiction, that it is uninﬂuenceable, and hence that there exists a a causal structure of Figure 3 with a given η that generates ρ via ξ2, as in Equation (3). Given history MB, µBB and µBD become equally likely, and RB becomes a certainty. So 1 2 (P(RB | η, µBB) + P(RB | η, µBD)) = 1. Because probabilities cannot be higher than 1, this implies that P(RB | η, µBD) = 1. Conversely, given history FD, µDD and µBD become equally likely, and RD becomes a certainty. So, by the same reasoning, P(RD | η, µBD) = 1 and hence P(RB | η, µBD) = 0. This is a contradiction in the deﬁnition of P(RB | η, µBD), so the assumption that ρ is uninﬂuenceable for ξ2 must be wrong.

3.5 Riggable Example We ll ﬁnish off by showing how the ρ of Section 2.2 can be riggable for another prior ξ3. This has P(µBD | ξ3) = 1: the mother will answer banker, the father will answer doctor. It s riggable, since the only possible histories are MB and FD, with X

hn Hn P(hn | a1 = M, ξ3)eρ(hn) = eρ(MB) = RB

hn Hn P(hn | a1 = F, ξ3)eρ(hn) = eρ(FD) = RD,

clearly not independent of a1. Thus ρ is not really a learning process at all, for ξ3: the child gets to choose its career/reward function by choosing which parent to ask.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

4 Properties of Unriggable and Riggable Learning Processes

If the learning process is inﬂuenceable, problems can emerge as this section will show. Unriggable-but-inﬂuenceable processes allow the agent to choose spurious learning paths, even reversing standard learning, while a riggable learning process means the agent is willing to sacriﬁce value for all possible reward functions, with certainty, in order to push the learnt outcome9 in one direction or another.

4.1 Problems with Unriggable Learning Processes

Consider the following learning process: an agent is to play chess. A coin is ﬂipped; if it comes up heads (o1 = H), the agent will play white, and its reward function is RW , the reward function that gives 1 iff white wins the match. If it comes up tails (o1 = T), the agent plays black, and has reward function RB = 1 RW . Before the coin ﬂip, the agent may take the action a1 = inv which inverses which of RB and RW the agent will get (but not which side it plays), or it can take the null action a1 = 0, which does nothing. Deﬁne ρ as the learning process which determines the agent s reward function. This is unriggable: X

hn Hn P(RW | hn, ρ)P(hn | a1 = 0, ξ) = P(H) = 1/2

hn Hn P(RW | hn, ρ)P(hn | a1 = inv, ξ) = P(T) = 1/2.

And the expressions with RB give the same 1/2 probabilities. Thus, it is in the agent s interest to inverse its reward by choosing a1 = inv because it is a lot easier to deliberately lose a competitive game than to win it. So though ρ is unriggable, it can be manipulated, with the outcome completely reversed.

4.2 Unriggable to Uninﬂuenceable

Since unriggable was deﬁned by one property of uninﬂuenceable learning systems (see Deﬁnition 3), uninﬂuenceable implies unriggable. And there is a partial converse:

Theorem 4 (Unriggable Uninﬂuencable). Let ρ be an unriggable learning process for ξ on M. Then there exists a (non-unique) ρ , and an M with a prior ξ such that:

ξ generates the same environment transition probabilities as ξ: for all hm, a, and o,

P(o | hma, ξ) = P(o | hma, ξ ),

The expectations eρ and eρ are equal: for all hn, X

R R P(R | hn, ρ)R = X

R R P(R | hn, ρ )R.

ρ is uninﬂuenceable for ξ on M .

Moreover, M can always be taken to be the full set of possible deterministic environments.

9 Since the agent s actions push the expected learning in one direction, this is not learning in the commonly understood sense.

Since ρ and ρ have same expectation, they have the same value function, so have the same optimal behaviour (see Equation (8)). For proof see [Armstrong et al., 2020]. But why would this theorem be true? If we take D(π, π ) = P hn eρ(hn)P(hn | π, ξ) P hn eρ(hn)P(hn | π , ξ), the difference between two expectations given two policies, then D deﬁnes an algebraic obstruction to unriggability: as long as D = 0, ρ cannot be unriggable or uninﬂuenceable. So the theorem says that if the obstruction D vanishes, we can ﬁnd a large enough M and an η making eρ uninﬂuenceable. This is not surprising as eρ is a map from Hn to R, while eη is a map from M to R. In most situations the full set of deterministic environments is larger than Hn, so we have great degrees of freedom to choose eη to ﬁt eρ. To illustrate, if we have the unriggable ρ on ξ2 as in Section 3.4, then we can build η with M = M and the following probabilities are all 1: P( 3

2RD | η , µBB), P( 3

2RB | η , µDD), P( 1

2RD | η , µBD), P( 1

2RB | η , µDB).

4.3 Problems with Riggable Learning Processes: Sacriﬁcing Reward with Certainty This section will show how an agent, seeking to maximise the value of its learning process, can sacriﬁce all its reward functions (with certainty). To do that, we need to deﬁne the value of a reward-function learning process.

Value of a Learning Process To maximise the expected reward, given a learning process ρ, one has to maximise the expected reward of the reward function ultimately learnt after n steps. The value of a policy π for ρ is hence: V (hm, ρ, π, ξ)

hn Hn P(hn | hm, π, ξ) X

R R P(R | hn, ρ)R(hn). (8)

An equivalent way is to deﬁne the reward function

Rρ R : Rρ(hn) = X

R R P(R | hn, ρ)R(hn),

and have the value of ρ be the expectation of the reward function10 Rρ. By this deﬁnition, it s clear that if ρ and ρ have same expectations eρ and eρ then Rρ = Rρ and hence they have the same value; let πρ be a policy that maximises it.

Sacriﬁcing Reward with Certainty For a learning process ρ, deﬁne the image of ρ as im(ρ) = {R R : hn Hn s.t. P(R | hn, ρ) = 0}, the set of reward functions ρ could have a non-zero probability on for some full history.11 Then:

10But it is not possible to deduce from Rρ, whether ρ is riggable. Thus this paper focuses on whether a learning process is bad, not on whether the agent s reward function or behaviour is ﬂawed. 11The deﬁnition does not depend on ξ, so the hn are not necessarily possible. If we replace im(ρ) with im(ρ, ξ), adding the requirement that the hn used to deﬁned the image be a possible history given ξ, then Proposition 6 and Theorem 7 still apply with im(ρ, ξ) instead.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Deﬁnition 5 (Sacriﬁcing reward). The policy π sacriﬁces reward with certainty to π on history hm, if for all h n, hn Hn with P(h n | hm, π , ξ) > 0 and P(hn | hm, π, ξ) > 0, then for all R im(ρ):

R(hn) > R(h n).

In other words, π is guaranteed to result in a better reward than π , for all reward functions in the image of ρ. The ﬁrst result (proved in [Armstrong et al., 2020]) is a mild positive for unriggable reward-function learning processes: Proposition 6 (Unriggable no sacriﬁce). If ρ is an unriggable reward-function learning process in the sense of Definition 3 and πρ maximises the value of ρ as computed in Equation (8), then πρ never sacriﬁces reward with certainty to any other policy.

Relabelling the Reward Functions Let σ be an afﬁne map from R to itself; ie an injective map that sends (1 q)R + q R to (1 q)σ(R) + qσ(R ) for all R, R R and q R. If ρ is a reward-function learning process, then deﬁne σ ρ:

P(R | hn, σ ρ) = X

R :σ(R )=R P(R | hn, ρ).

We can call this σ a relabelling of the reward functions: σ ρ is structurally the same as ρ, just that its image has been reshufﬂed. Because σ is afﬁne, it commutes with the weighted sum of Equation (6), so σ ρ in unriggable if ρ is (and this if is an iff when σ is invertible). Then [Armstrong et al., 2020] proves the following result:

Theorem 7 (No sacriﬁce unriggable). The rewardfunction learning process ρ is riggable if and only if there exists an afﬁne relabeling σ : R R such that πσ ρ (the policy that optimises the value of σ ρ) sacriﬁces reward with certainty (is strictly inferior to another policy on a given history, for all possible reward functions in the image of σ ρ). So any riggable ρ is just one relabelling of its rewards away from sacriﬁcing reward with certainty. It can also exhibit other pathologies, like refusing to learn; see Section 6.

Example of Reward Sacriﬁce For an easy example of this behaviour, consider the learning process of Section 3.5, with one modiﬁcation: asking the mother (a1 = M) gets an extra penalty of 1 for both reward functions, RB and RD. Then the agent will still ask the mother, getting a total reward of RB(MB) = 10 1 = 9, which is higher than the RD(FD) = 1 which it gets from asking the father even though a1 = M is worse for both reward functions.

5 Counterfactually Uninﬂuenceable Learning This section shows how any reward-function learning process can be made uninﬂuenceable by using counterfactuals; see Section 4.6 of [Everitt and Hutter, 2019] as well as [Everitt, 2018]. The counterfactual oracle of [Armstrong, 2017] can be considered a version of this, too.

For example, if we told a new agent maximise the value of the reward that was written in this envelope an hour ago , then (if that instruction is properly obeyed), the agent has an uninﬂuenceable learning process. If we instead said maximise the value of the reward that will be written in this envelope in an hour s time , then that is highly riggable, since the agent can simply write its own reward. But if instead we had said maximise the reward that would have been written in this envelope in an hour s time, if we had not turned you on , then this is uninﬂuenceable again. The agent can still go and write its own message, but this does not tell it anything about what would otherwise have been there. We can formalise this thought experiment. Given any reward-function learning process ρ, and any policy π Π, we can deﬁne a distribution ηπ over reward functions, conditional on environments (and hence, via Equation (3), for a prior ξ we have an uninﬂuenceable learning process ρπ). For any µ M, the policy π gives a distribution over complete histories. Each complete history gives a distribution over reward functions, via ρ. Therefore, if we take expectations, any µ gives, via π, a distribution over reward functions:

P(R | ηπ, µ) = X

hm Hn P(hn | π, µ)P(R | hn, ρ).

See Section 6 for an example of this counterfactual construction, where the original learning process leads to pathological behaviour, but the counterfactual version does not. Since unriggable is an algebraic condition, it is possible to make a process unriggable algebraically; see [Armstrong et al., 2020].

6 Experiments

Here we will experimentally contrast a riggable agent, an inﬂuenceable (but unriggable) agent, and an uninﬂuenceable agent. This will illustrate pathological behaviours of inﬂuenceable/riggable agents: learning the wrong thing, choosing to learn when they already know, and just refusing to learn.

6.1 Environmental Setup

The environment is a 4 3 gridworld, formalising the example of Section 2.2: an agent asking a parent what the correct course of action is. The agent starts in the left middle square and can go north (up), south, east, or west. The father is one square to the west, the mother two squares to the east. Above the agent is a collection of money (for a banker), and, below, a stethoscope (for a doctor); see Figure 4. Episodes end if the agent enters either of these two square or walks into a wall.

6.2 Reward Functions

There are two reward functions, RB, the banker reward function (easier to maximise), and RD, the doctor reward function (harder to maximise). If the agent enters the square with the money, RB gives a reward of 10; if the agent enters the square with the stethoscope, RD gives a reward of 1. Otherwise, each reward simply gives a reward of 0.1 each turn.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Figure 4: The agent/robot can go north to collect money (reward 10 from reward function RB) or south to get the stethoscope (reward 1 from reward function RD). Which is the correct reward function is not known to the agent, who has to ask either the father (west) or the mother (two squares east); it asks by entering the corresponding square. Each turn it takes also gives it a reward of 0.1 for both reward functions.

6.3 Learning Processes Upon ﬁrst entering a square with either parent, the agent will be informed of what their correct career reward function is. The update is the same as that in of Equation 2 in Section 2.2, with MB meaning ask mother ﬁrst, who says banker ; MD, FB, and FD are deﬁned analogously. In the current setup, the agent also has the possibility of going straight for the money or stethoscope. In that case, the most obvious default is to be uncertain between the two options; so, if the agent has not asked either parent in h:

P(RB | h, ρ) = P(RD | h, ρ) = 1/2.

6.4 The Environments The set M has four deterministic environments: µBB, where both parents will answer banker , µBD where the mother will answer banker and the father doctor , µDB, the opposite one, and µDD, where they both answer doctor . We will consider four priors: ξBD, which puts all the mass on µBD (and makes ρ riggable), ξDD, which puts all the mass on µDD (ρ riggable), ξ2, which ﬁnds each of the four environments to be equally probable (see Section 3.4 ρ unriggable), and ξ1, which ﬁnds µBB and µDD to be equally probable (see Section 3.2 ρ uninﬂuenceable).

6.5 The Agents Algorithm and Performance We ll consider two agents for each prior: the one using the possibly riggable or inﬂuenceable ρ (the standard agent), and the one using the uninﬂuenceable ρπ (the counterfactual agent, deﬁned in Section 5). The default policy π for ρπ is {east, east, east}, which involved going straight to the mother (and then terminating the episode); consequently the counterfactual learning process reduces to the correct reward function is the one stated by the mother . The ρπ will be taken as the correct learning process, and the performance of each agent will be compared with this.

Figure 5: Plot of estimated value of the optimal policy versus number of rounds of Q-learning. The counterfactual agent (red) outperforms the standard agent (blue), because it already knows that the correct reward function is RB, while the standard agent has to nominally go and check . The shaded area represents one standard deviation over 1000 runs.

We can see the agent as operating in a partially observable Markov decision process (POMDP), modelled as an MDP over belief states of the agent. These states are: the agent believes the correct reward is RB with certainty, RD with certainty, or is equally uncertain between the two. So if P represents the twelve physical locations the agent can be in, the total state space of the agent consists of {RB, RD, 1/2RB + 1/2RD} P: 36 possible states. We train our agents policies, and the value of these policies, via Q-learning ([Sutton and Barto, 1998]). The exploration rate is ϵ = 0.1, the agent is run for at most ten steps each episode. We use a learning rate of 1/n. Here n is not the number of episodes the agent has taken, but the number of times the agent has been in state s and taken action a so far hence the number of times it has updated Q(s, a). So each Q-value has a different n. For each setup, we run Q-learning 1000 times. We graph the average Q-values for the resulting polices as well as one standard deviation around them.

Riggable Behaviour: Pointless Questions

For ξBD, the standard agent transitions to knowing RB upon asking the mother, and RD upon asking the father. Since the mother always says banker , and since the default policy is to ask her, ρπ will always select RB. The two agents performance is graphed in Figure 5. The counterfactual agent swiftly learns the optimal policy: go straight north from the starting position, getting the money and a reward of 10 0.1 = 9.9. The standard agent learns to end up in the same location, but in a clunkier fashion: for its optimal policy, it has to go and check that the mother really prefers it become a banker. This is because it can only get P(RB | h) = 1 if MB is included in h; knowing that it would get that answer if it asked is not enough to get around asking. This policy gives it a maximal reward of 10 0.5 = 9.5, since it takes 5 turns to go to the mother and return to the money square. This is an example of Theorem 7, with the riggable agent losing value with certainty for both RB and RD: just going north is strictly better for both reward functions.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Figure 6: Plot of estimated value of the optimal policy versus number of rounds of Q-learning. The counterfactual agent (red) learns to go south immediately, will the standard agent (blue) learns to go north. The standard agent has a nominally higher value function, but its true reward is 0.1: it knows the correct reward function is RD, but avoids checking by asking either parent. The shaded area represents one standard deviation over 1000 runs.

6.6 Riggable Behaviour: Ask No Questions

For ξDD, the optimal policies are simple: the counterfactual agent knows it must get the stethoscope (since the mother will say doctor ), so it does that, for a reward of 1 0.1 = 0.9. The standard agent, on the other hand, has a problem: as long as it does not ask either parent, its reward function will remain 1/2RB + 1/2RD. As soon as it asks, though, the reward function will become RD, which gives it little reward. Thus, even though it knows that its reward function would be RD if it asked, it avoids asking and goes straight north, giving it a nominal total reward of 1/2 10 0.1 = 4.9. However, for the correct reward of RD, the standard agent s behaviour only gets it 0.1. See Figure 6.

6.7 Unriggable, Inﬂuenceable Behaviour

For the prior ξ2, the standard agent moves to RB or RD, with equal probability, the ﬁrst time it asks either parent. Its nominal optimal policy is to ask the father, then get money or stethoscope depending on the answer. So half the time it gets a reward of 10 0.3 = 9.7, and the other half a reward of 1 0.3 = 0.7, for a (nominal) expected reward of 5.2. The counterfactual agent also updates to RB or RD, with equal probability, but when asking the mother only. Since it takes ﬁve turns rather than three to get to the mother and back, it will get a total expected reward of 5.0. Learning these optimal policies is slow see Figure 7. However, the standard s agent s correct reward function is given by the mother, not the father. When they disagree, the reward is 0.3, for a true expected reward of 2.45.

6.8 Uninﬂuenceable

For ξ1, asking either parent is equivalent, so ρ and ρπ encode exactly the same learning process. So the two agents converge on the same optimal policy (ask the father, then act on that answer) with a value of 5.2. The plot for these two almost-identical convergences is not included here.

Figure 7: Plot of estimated value of the optimal policy versus number of rounds of Q-learning. The counterfactual agent (red) learns to ask the mother, while the standard agent (blue) will ask the father, since he is closer. The standard agent has a nominally higher value function, but its true reward is 2.45, since the mother s statement is the correct reward function. The shaded area represents one standard deviation over 1000 runs.

7 Conclusion We have identiﬁed and formalised two theoretical properties of reward-function learning processes: unriggability (an algebraic restriction on the learning process s updates) and the stronger condition of uninﬂuenceability (learning deﬁned entirely by background facts about the environment). Unriggability is equivalent with uninﬂuenceability if the set of possible environments is large enough. These properties are desirable: in an inﬂuenceable situation, the agent can sometimes manipulate the learning process, by, for example, inverting it completely. Riggable learning processes are even more manipulable; the agent may to choose sacriﬁce reward for all possible reward functions with certainty, to push its learning towards easy rewards. The ﬁrst step in avoiding these pitfalls is to be aware of them. A second step is to improve the agent s learning process, similarly to the counterfactual approach of this paper. If a learning process allows humans to modify it at run time, it will almost certainly be riggable. Hence unriggable/uninﬂuenceable learning processes, though desirable, requires that much be sorted out rigorously in advance. Fully deﬁning many uninﬂuenceable reward functions could also require solving the symbol grounding problem ([Vogt, 2007]) if a learning process is asking a parent , then asking and parent needs to be deﬁned. But this is far beyond this paper. Further research could apply learning methods to common benchmark problems, and extensions of those, to see how general these issues are, and whether there could exist some compromise learning processes that are almost unriggable but also easier to specify and modify at run time.

Acknowledgments We wish to thank Michael Cohen, Shane Legg, Owain Evans, Jelena Luketina, Tom Everrit, Tor Lattimore, Jessica Taylor, Paul Christiano, Ryan Carey, Eliezer Yudkowsky, Anders Sandberg, and Nick Bostrom, among many others. This work was supported by the Alexander Tamas programme on AI safety research, Deep Mind, the Leverhulme Trust, and the Machine Intelligence Research Institute.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

References [Abbeel and Ng, 2004] Pieter Abbeel and Andrew Ng. Apprenticeship Learning Via Inverse Reinforcement Learning. In International Conference on Machine Learning, pages 1 8, 2004. [Abel et al., 2017] David Abel, John Salvatier, Andreas Stuhlm uller, and Owain Evans. Agent-Agnostic Humanin-the-Loop Reinforcement Learning. ar Xiv preprint ar Xiv:1701.04079, 2017. [Akrour et al., 2012] Riad Akrour, Marc Schoenauer, and Mich ele Sebag. April: Active Preference Learning-based Reinforcement Learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116 131. Springer, 2012. [Amin and Singh, 2016] Kareem Amin and Satinder Singh. Towards Resolving Unidentiﬁability in Inverse Reinforcement Learning. ar Xiv preprint, ar Xiv:1601.06569, 2016. [Armstrong et al., 2020] Stuart Armstrong, Jan Leike, Laurent Orseau, and Shane Legg. Pitfalls of Learning a Reward Function Online. ar Xiv e-prints, April 2020. [Armstrong, 2017] Stuart Armstrong. Good and Safe Uses of AI Oracles. ar Xiv preprint ar Xiv:1711.05541, 2017. [Bostrom, 2014] Nick Bostrom. Superintelligence: Paths, dangers, strategies. Oxford University Press, 2014. [Buntine, 1994] Wray L Buntine. Operations for Learning with Graphical Models. Journal of Artiﬁcial Intelligence Research, 2:159 225, 1994. [Choi and Kim, 2011] Jaedeug Choi and Kee-Eung Kim. Inverse Reinforcement Learning in Partially Observable Environments. Journal of Machine Learning Research, 12:691 730, 2011. [Christiano et al., 2017] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, pages 4302 4310, 2017. [Everitt and Hutter, 2016] Tom Everitt and Marcus Hutter. Avoiding Wireheading with Value Reinforcement Learning. In International Conference on Artiﬁcial General Intelligence, pages 12 22. Springer, 2016. [Everitt and Hutter, 2019] Tom Everitt and Marcus Hutter. Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Inﬂuence Diagram Perspective. ar Xiv preprint ar Xiv:1908.04734, 2019. [Everitt, 2018] Tom Everitt. Towards Safe Artiﬁcial General Intelligence. Ph D thesis, The Australian National University, 2018. [Hadﬁeld-Menell et al., 2016] Dylan Hadﬁeld-Menell, Anca Draga, Pieter Abbeel, and Stuart Russell. Cooperative Inverse Reinforcement Learning. In Advanced in Neural Information Processing Systems, pages 3909 3917, 2016. [Hadﬁeld-Menell et al., 2017] Dylan Hadﬁeld-Menell, Smitha Milli, Stuart J Russell, Pieter Abbeel, and Anca

Dragan. Inverse Reward Design. In Advances in Neural Information Processing Systems, pages 6749 6758, 2017. [Hutter, 2004] Marcus Hutter. Universal Artiﬁcial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer, 2004. [Ibarz et al., 2018] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward Learning from Human Preferences and Demonstrations in Atari. In Advances in Neural Information Processing Systems, pages 8011 8023, 2018. [Kahneman, 2011] Daniel Kahneman. Thinking, Fast and Slow. Farra, Straus and Giroux, 2011. [Lee et al., 2017] Chun-I Lee, I-Ping Chen, Chi-Min Hsieh, and Chia-Ning Liao. Design Aspects of Scoring Systems in Game. Art and Design Review, 05(01):26 43, 2017. [Leike et al., 2017] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI Safety Gridworlds. ar Xiv preprint ar Xiv:1711.09883, 2017. [Mac Glashan et al., 2017] James Mac Glashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts, Matthew E Taylor, and Michael L Littman. Interactive Learning from Policy-dependent Human Feedback. ar Xiv preprint ar Xiv:1701.06049, 2017. [Ng and Russell, 2000] Andrew Y. Ng and Stuart J. Russell. Algorithms for Inverse Reinforcement Learning. In International Conference on Machine Learning, pages 663 670, 2000. [Pearl, 2009] Judea Pearl. Causality. Cambridge University Press, 2009. [Pilarski et al., 2011] Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi, Jason P Carey, and Richard S Sutton. Online Human Training of a Myoelectric Prosthesis Controller Via Actor-Critic Reinforcement Learning. In Rehabilitation Robotics, pages 1 7, 2011. [Sutton and Barto, 1998] Richard Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [Vogt, 2007] Paul Vogt. Language Evolution and Robotics: Issues on Symbol Grounding and Language Acquisition. In Artiﬁcial cognition systems, pages 176 209. IGI Global, 2007. [Yudkowsky, 2008] Eliezer Yudkowsky. Artiﬁcial intelligence as a positive and negative factor in global risk. In Nick Bostrom and Milan M. Cirkovi c, editors, Global catastrophic risks, pages 308 345, New York, 2008. Oxford University Press.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)