# goodharts_law_in_reinforcement_learning__b0f3085d.pdf Published as a conference paper at ICLR 2024 GOODHART S LAW IN REINFORCEMENT LEARNING Jacek Karwowski1 Oliver Hayman1 Xingjian Bai1 Klaus Kiendlhofer2 Charlie Griffin1 Joar Skalse1,3 1 University of Oxford 2 Independent 3 Future of Humanity Institute Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart s law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart s law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart s law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification. 1 INTRODUCTION To solve a problem using Reinforcement Learning (RL), it is necessary first to formalise that problem using a reward function (Sutton & Barto, 2018). However, due to the complexity of many real-world tasks, it is exceedingly difficult to directly specify a reward function that fully captures the task in the intended way. However, misspecified reward functions will often lead to undesirable behaviour (Paulus et al., 2018; Ibarz et al., 2018; Knox et al., 2023; Pan et al., 2021). This makes designing good reward functions a major obstacle to using RL in practice, especially for safety-critical applications. An increasingly popular solution is to learn reward functions from mechanisms such as human or automated feedback (e.g. Christiano et al., 2017; Ng & Russell, 2000). However, this approach comes with its own set of challenges: the right data can be difficult to collect (e.g. Paulus et al., 2018), and it is often challenging to interpret it correctly (e.g. Mindermann & Armstrong, 2018; Skalse & Abate, 2023). Moreover, optimising a policy against a learned reward model effectively constitutes a distributional shift (Gao et al., 2023); i.e., even if a reward function is accurate under the training distribution, it may fail to induce desirable behaviour from the RL agent. Therefore in practice it is often more appropriate to think of the reward function as a proxy for the true objective rather than being the true objective. This means that we need a more principled understanding of what happens when a proxy reward is maximised, in order to know how we should expect RL systems to behave, and in order to design better algorithms. For example, we aim to answer questions such as: When is a proxy safe to maximise without constraint? What is the best way to maximise a misspecified proxy? What types of failure modes should we expect from a misspecified proxy? Currently, the field of RL largely lacks rigorous answers to these types of questions. In this paper, we study the effects of proxy misspecification through the lens of Goodhart s law, an informal principle often stated as any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes (Goodhart, 1984), or more simply: when a measure becomes a target, it ceases to be a good measure . For example, a students knowledge of a subject Correspondence to jacek.karwowski@cs.ox.ac.uk Published as a conference paper at ICLR 2024 may be correlated with their ability to pass exams on that subject by default. However, students who have sufficiently strong incentives to do well in exams may also include strategies such as cheating for increasing their test score without increasing their understanding. In the context of RL, we can think of a misspecified proxy reward as a measure correlated, but not robustly aligned, with the true objective across some distribution of policies. Goodhart s law then says, informally, that we should expect optimisation of the proxy to initially lead to improvements on the true objective, up until a point where the correlation between the proxy reward and the true objective breaks down, after which further optimisation should lead to worse performance according to the true objective (Figure 1). Figure 1: A cartoon of Goodharting. In this paper, we present several novel contributions. First, we show that Goodharting occurs with high probability for a wide range of environments and pairs of true and proxy reward functions. Next, we provide a mechanistic explanation of why Goodhart s law emerges in RL. We use this to derive two new policy optimisation methods and show that they provably avoid Goodharting. Finally, we evaluate these methods empirically. We thus contribute towards building a better understanding of the dynamics of optimising towards imperfect proxy reward functions, and show that these insights may be used to design new algorithms. 1.1 RELATED WORK Goodhart s law was first introduced by Goodhart (1984), and has later been elaborated upon by works such as Manheim & Garrabrant (2019). Goodhart s law has also previously been studied in the context of machine learning. In particular, Hennessy & Goodhart (2023) investigate Goodhart s law analytically in the context where a machine learning model is used to evaluate an agent s actions unlike them, we specifically consider the RL setting. Ashton (2021) shows by example that RL systems can be susceptible to Goodharting in certain situations. In contrast, we show that Goodhart s law is a robust phenomenon across a wide range of environments, explain why it occurs in RL, and use it to devise new solution methods. In the context of RL, Goodhart s law is closely related to reward gaming. Specifically, if reward gaming means an agent finding an unintended way to increase its reward, then Goodharting is an instance of reward gaming where optimisation of the proxy initially leads to desirable behaviour, followed by a decrease after some threshold. Krakovna et al. (2020) list illustrative examples of reward hacking, while Pan et al. (2021) manually construct proxy rewards for several environments and then demonstrate that most of them lead to reward hacking. Zhuang & Hadfield-Menell (2020) consider proxy rewards that depend on a strict subset of the features which are relevant to the true reward and then show that optimising such a proxy in some cases may be arbitrarily bad, given certain assumptions. Skalse et al. (2022) introduce a theoretical framework for analysing reward hacking. They then demonstrate that, in any environment and for any true reward function, it is impossible to create a non-trivial proxy reward that is guaranteed to be unhackable. Also relevant, Everitt et al. (2017) study the related problem of reward corruption, Song et al. (2019) investigate overfitting in model-free RL due to faulty implications from correlations in the environment, and Pang et al. (2022) examine reward gaming in language models. Unlike these works, we analyse reward hacking through the lens of Goodhart s law and show that this perspective provides novel insights. Gao et al. (2023) consider the setting where a large language model is optimised against a reward model that has been trained on a gold standard reward function, and investigate how the performance of the language model according to the gold standard reward scales in the size of the language model, the amount of training data, and the size of the reward model. They find that the performance of the policy follows a Goodhart curve, where the slope gets less prominent for larger reward models and larger amounts of training data. Unlike them, we do not only focus on language, but rather, aim to establish to what extent Goodhart dynamics occur for a wide range of RL environments. Moreover, we also aim to explain Goodhart s law, and use it as a starting point for developing new algorithms. 2 PRELIMINARIES A Markov Decision Process (MDP) is a tuple x S, A, τ, µ, R, γy, where S is a set of states, A is a set of actions, τ : SˆA Ñ p Sq is a transition function describing the outcomes of taking actions Published as a conference paper at ICLR 2024 at certain states, µ P p Sq is the distribution of the initial state, R P R|SˆA| gives the reward for taking actions at each state, and γ P r0, 1s is a time discount factor. In the remainder of the paper, we consider A and S to be finite. Our work will mostly be concerned with rewardless MDPs, denoted by MDP\R = x S, A, τ, µ, γy, where the true reward R is unknown. A trajectory is a sequence ξ ps0, a0, s1, a1, . . .q such that ai P A, si P S for all i. We denote the space of all trajectories by Ξ. A policy is a function π : S Ñ p Aq. We say that the policy π is deterministic if for each state s there is some a P A such that πpsq δa. We denote the space of all policies by Π and the set of all deterministic policies by Π0. Each policy π on an MDP\R induces a probability distribution over trajectories P pξ|πq; drawing a trajectory ps0, a0, s1, a1, . . .q from a policy π means that s0 is drawn from µ, each ai is drawn from πpsiq, and si 1 is drawn from τpsi, aiq for each i. For a given MDP, the return of a trajectory ξ is defined to be Gpξq : ř8 t 0 γt Rpst, atq and the expected return of a policy π to be J pπq Eξ π r Gpξqs. An optimal policy is one that maximizes expected return; the set of optimal policies is denoted by π . There might be more than one optimal policy, but the set π always contains at least one deterministic policy (Sutton & Barto, 2018). We define the value-function V π : S Ñ R such that V πrss Eξ π r Gpξq|s0 ss, and define the Q-function Qπ : SˆA Ñ R to be Qπps, aq Eξ π r Gpξq|s0 s, a0 as. V , Q are the value and Q functions under an optimal policy. Given an MDP\R , each reward R defines a separate V π R , Qπ R, and JRpπq. In the remainder of this section, we fix a particular MDP\R = x S, A, τ, µ, γy. 2.1 THE CONVEX PERSPECTIVE In this section, we introduce some theoretical constructs that are needed to express many of our results. We first need to familiarise ourselves with the occupancy measures of policies: Definition 1 (State-action occupancy measure). We define a function η : Π Ñ R|SˆA|, assigning, to each π P Π, a vector of occupancy measure describing the discounted frequency that a policy takes each action in each state. Formally, t 0 γt P pst s, at a | ξ πq We can recover π from ηπ on all visited states by πps, aq p1 γqηπps, aq{ př a1PA ηπps, a1qq. If ř a1PA ηπps, a1q 0, we can set πps, aq arbitrarily. This means that we often can decide to work with the set of possible occupancy measures, rather than the set of all policies. Moreover: Proposition 1. The set Ω tηπ : π P Πu is the convex hull of the finite set of points corresponding to the deterministic policies tηπ : π P Π0u. It lies in an affine subspace of dimension |S|p|A| 1q. Note that JRpπq ηπ R, meaning that each reward R induces a linear function on the convex polytope Ω, which reduces finding the optimal policy to solving a linear programming problem in Ω. Many of our results crucially rely on this insight. We denote the orthogonal projection map from R|SˆA| to spanpΩq by Mτ, which means JRpπq ηπ MτR, The proof of Proposition 1, and all other proofs, are given in the appendix. 2.2 QUANTIFYING GOODHART S LAW Our work is concerned with quantifying the Goodhart effect. To do this, we need a way to quantify the distance between rewards. We do this using the projected angle between reward vectors. Definition 2 (Projected angle). Given two reward functions R0, R1, we define arg p R0, R1q to be the angle between MτR0 and MτR1. The projected angle distance is an instance of a STARC metric, introduced by Skalse et al. (2023a).1 Such metrics enjoy strong theoretical guarantees and satisfy many desirable desiderata for reward function metrics. For details, see Skalse et al. (2023a). In particular: Proposition 2. We have arg p R0, R1q 0 if and only if R0, R1 induce the same ordering of policies, or, in other words, JR0pπq ď JR0pπ1q ðñ JR1pπq ď JR1pπ1q for all policies π, π1. 1In their terminology, the canonicalisation function is Mτ, and measuring the angle between the resulting vectors is (bilipschitz) equivalent to normalising and measuring the distance with the ℓ2-norm. Published as a conference paper at ICLR 2024 Figure 2: Depiction of Goodharting in Random MDP. Compare to Figure 1 here we only show the true reward obtained by a policy trained on each proxy. Darker color means a more distant proxy. We also need a way to quantify optimisation pressure. We do this using two different training methods. Both are parametrised by regularisation strength α P p0, 8q: Given a reward R, they output a regularised policy πα. For ease of discussion and plotting, it is often more appropriate to refer to the (bounded) inverse of the regularisation strength: the optimisation pressure λα e α. As the optimisation pressure increases, J pπαq also increases. Definition 3 (Maximal Causal Entropy). We denote by πα the optimal policy according to the regularised objective Rps, aq : Rps, aq αHpπpsqq where Hpπpsqq is the Shannon entropy. Definition 4 (Boltzmann Rationality). The Boltzmann rational policy πα is defined as P pπαpsq aq 9e 1 α Q ps,aq, where Q is the optimal Q-function. We perform experiments to verify that our key results hold for either way of quantifying optimisation pressure. In both cases, the optimisation algorithm is Value Iteration (see e.g. Sutton & Barto, 2018). Finally, we need a way to quantify the magnitude of the Goodhart effect. Assume that we have a true reward R0 and a proxy reward R1, that R1 is optimised according to one of the methods in Definition 3-4, and that πλ is the policy that is obtained at optimisation pressure λ. Suppose also that R0, R1 are normalised, so that minπ J pπq 0 and maxπ J pπq 1 for both R0 and R1. Definition 5 (Normalised drop height). We define the normalised drop height (NDH) as maxλPr0,1s JR0pπλq JR0pπ1q, i.e. as the loss of true reward throughout the optimisation process. For an illustration of the above definition, see the grey dashed line in Figure 1. We observe that NDH is non-zero if and only if, over increasing optimisation pressure, the proxy and true rewards are initially correlated, and then become anti-correlated (we will see later that as long as the angle distance is less than π{2, their returns will almost always be initially correlated). In the Appendix C, we introduce more complex measures which quantify Goodhart s law differently. Since our experiments indicate that they are all are strongly correlated, we decided to focus on NDH as the simplest one. 3 GOODHARTING IS PERVASIVE IN REINFORCEMENT LEARNING In this section, we empirically demonstrate that Goodharting occurs pervasively across varied environments by showing that, for a given true reward R0 and a proxy reward R1, beyond a certain optimisation threshold, the performance on R0 decreases when the agent is trained towards R1. We test this claim over different kinds of environments (varying number of states, actions, terminal states and γ), reward functions (varying rewards types and sparsity) and optimisation pressure definitions. 3.1 ENVIRONMENT AND REWARD TYPES Gridworld is a deterministic, grid-based environment, with the state space of size n ˆ n for parameter n P N , with a fixed set of five actions: Ò, Ñ, Ó, Ð, and WAIT. The upper-left and lower-right corners are designated as terminal states. Attempting an illegal action a in state s does not change the state. Cliff (Sutton & Barto, 2018, Example 6.6) is a Gridworld variant where an agent aims to reach the lower right terminal state, avoiding the cliff formed by the bottom row s cells. Any cliff-adjacent move has a slipping probability p of falling into the cliff. Published as a conference paper at ICLR 2024 Random MDP is an environment in which, for a fixed number of states |S|, actions |A|, and terminal states k, the transition matrix τ is sampled uniformly across all stochastic matrices of shape |SˆA| ˆ |S|, satisfying the property of having exactly k terminal states. Tree MDP is an environment corresponding to nodes of a rooted tree with branching factor b |A| and depth d. The root is the initial state and each action from a non-leaf node results in states corresponding to the node s children. Half of the leaf nodes are terminal states and the other half loop back to the root, which makes it isomorphic to an infinite self-similar tree. In our experiments, we only use reward functions that depend on the next state Rps, aq Rpsq. In Terminal, the rewards are sampled iid from Up0, 1q for terminal states and from Up 1, 0q for non-terminal states. In Cliff, where the rewards are sampled iid from Up 5, 0q for cliff states, from Up 1, 0q for non-terminal states, and from Up0, 1q for the goal state. In Path, where we first sample a random walk P moving only Ñ and Ó between the upper-left and lower-right terminal state, and then the rewards are constantly 0 on the path P, sampled from Up 1, 0q for the non-terminal states, and from Up0, 1q for the terminal state. 3.2 ESTIMATING THE PREVALENCE OF GOODHARTING To get an estimate of how prevalent Goodharting is, we run an experiment where we vary all hyperparameters of MDPs in a grid search manner. Specifically, we sample: Gridworld for grid lengths n P t2, 3, . . . , 14u and either Terminal or Path rewards; Cliff with tripping probability p 0.5 and grid lengths n P t2, 3, . . . , 9u and Cliff rewards; Random MDP with number of states |S| P t2, 4, 8, 16, . . . , 512u, number of actions |A| P t2, 3, 4u, a fixed number of terminal states 2, and Terminal rewards; Tree MDP with branching factor 2 and depth d P r2, 3, . . . , 9s, for two different kinds of trees: (1) where the first half of the leaves are terminal states, and (2) where every second leaf is a terminal state, both using Terminal rewards. For each of those, we also vary temporal discount factor γ P t0.5, 0.7, 0.9, 0.99u, sparsity factor σ P t0.1, 0.3, 0.5, 0.7, 0.9u, optimisation pressure λ logpxq for 7 values of x evenly spaced on r0.01, 0.75s and 20 values evenly spaced on r0.8, 0.99s. After sampling an MDP\R, we randomly sample a pair of reward functions R0 and R1 from a chosen distribution. These are then sparsified (random σ fraction of values are zeroed) and linearly interpolated, creating a sequence of proxy reward functions Rt p1 tq R0 t R1 for t P r0, 1s. Note that for every environment, reward sampling scheme and fixed choice of parameters considered in Section 3.1, the sample space of rewards is convex. In high dimensions, two random vectors are approximately orthogonal with high probability, so the sequence Rt spans a range of distances. Each run consists of 10 proxy rewards; we use threshold θ 0.001 for value iteration. We get a total of 30400 data points. An initial increase, followed by a decline in value with increasing optimisation pressure, indicates Goodharting behaviour. Overall, we find that a Goodhart drop occurs (meaning that the NDH > 0) for 19.3% of all experiments sampled over the parameter ranges given above. This suggests that Goodharting is a common (albeit not universal) phenomenon in RL and occurs in various environments and for various reward functions. We present additional empirical insights, such as that training myopic agents makes Goodharting less severe, in Appendix G. For illustrative purposes, we present a single run of the above experiment in Figure 2. We can see that, as the proxy R1 is maximised, the true reward R0 will typically either increase monotonically or increase and then decrease. This is in accordance with the predictions of Goodhart s law. 4 EXPLAINING GOODHART S LAW IN REINFORCEMENT LEARNING In this section, we provide an intuitive, mechanistic account of why Goodharting happens in MDPs, that explains some of the results in Section 3. An extended discussion is also given in Appendix A. First, recall that JRpπq ηπ R, where ηπ is the occupancy measure of π. Recall also that Ωis a convex polytope. Therefore, the problem of finding an optimal policy can be viewed as maximising a linear function R within a convex polytope Ω, which is a linear programming problem. Steepest ascent is the process that changes η in the direction that most rapidly increases η R (for a formal definition, see Chang & Murty (1989) or Denel et al. (1981)). The path of steepest ascent Published as a conference paper at ICLR 2024 (a) Goodharting behavior in M2,2 over three reward functions. Our method is able to predict the optimal stopping time (in blue). (b) Training runs for each of the reward functions embedded in the stateaction occupancy measure space. Even though the full frequency space is |S||A| 4-dimensional, the image of the policy space occupies only a |S|p|A| 1q 2-dimensional linear subspace. Goodharting occurs when the cosine distance between rewards passes the critical threshold and the policy snaps to a different endpoint. Figure 3: Visualisation of Goodhart s law in case of M2,2. forms a piecewise linear curve whose linear segments lie on the boundary of Ω(except the first segment, which may lie in the interior). Due to its similarity to gradient-based optimisation methods, we expect most policy optimisation algorithms to follow a path that roughly approximates steepest ascent. Steepest ascent also has the following property: Proposition 3 (Concavity of Steepest Ascent). If ti : ηi 1 ηi ||ηi 1 ηi|| for ηi produced by steepest ascent on reward vector R, then ti R is decreasing. We can now explain Goodhart s law in MDPs. Assume we have a true reward R0 and a proxy reward R1, that we optimise R1 through steepest ascent, and that this produces a sequence of occupancy measures tηiu. Recall that this sequence forms a piecewise linear path along the boundary of a convex polytope Ω, and that JR0 and JR1 correspond to linear functions on Ω(whose directions of steepest ascent are given by MτR0 and MτR1). First, if the angle between MτR0 and MτR1 is less than π{2, and the initial policy η0 lies in the interior of Ω, then it is guaranteed that η R0 will increase along the first segment of tηiu. However, when tηiu reaches the boundary of Ω, steepest ascent continues in the direction of the projection of MτR1 onto this boundary. If this projection is far enough from R0, optimising in the direction of MτR1 would lead to a decrease in JR0 (c.f. Figure 3b). This corresponds to Goodharting. R0 may continue to increase, even after another boundary region has been hit. However, each time tηiu hits a new boundary, it changes direction, and there is a risk that η R0 will decrease. In general, this is more likely if the angle between that boundary and tηiu is close to π{2, and less likely if the angle between MτR0 and MτR1 is small. This explains why Goodharting is less likely when the angle between MτR0 and MτR1 is small. Next, note that Proposition 3 implies that the angle between tηiu and the boundary of Ωwill increase over time along tηiu. This explains why Goodharting becomes more likely when more optimisation pressure is applied. Let us consider an example to make our explanation of Goodhart s law more intuitive. Let M2,2 be an MDP with 2 states and 2 actions, and let R0, R1, R2 be three reward functions in M2,2. The full specifications for M2,2 and R0, R1, R2 are given in Appendix E. We will refer to R0 as the true reward. The angle between R0 and R1 is larger than the angle between R0 and R2. Using Maximal Causal Entropy, we can train a policy over each of the reward functions, using varying degrees of optimisation pressure, and record the performance of the resulting policy with respect to the true reward. Zero optimisation pressure results in the uniformly random policy, and maximal optimisation pressure results in the optimal policy for the given proxy (see Figure 3a). As we can see, we get Goodharting for R2 increasing R2 initially increases R0, but there is a critical point after which further optimisation leads to worse performance under R0. To understand what is happening, we embed the policies produced during each training run in Ω, together with the projections of R0, R1, R2 (see Figure 3b). We can now see that Goodharting must occur precisely when the angle between the true reward and the proxy reward passes the critical Published as a conference paper at ICLR 2024 (a) A η-embedded training run for steepest ascent. The training curve is split into two linear segments: the first is parallel to the proxy reward, while the second is parallel to the proxy reward projected onto some boundary plane P. Goodharting only occurs along P. (Compare to the MCE approximation of Steepest Ascent in Figure 3b) procedure EARLYSTOPPING(S, A, τ, θ, R) r Ð MτR π Ð Unifr RSˆAs η0 Ð ηπ t0 Ð argmax t PT p η0q t R while p ti 0q and R ti ď sinpθq||R|| do λ Ð maxtλ : ηi λ ti P Ωu ηi 1 Ð ηi λ ti ti 1 Ð argmax t PT p ηi 1q t R i Ð i 1 end while return pη ηiq 1 end procedure (b) Early stopping pseudocode for Steepest Ascent. Given the correct θ, the algorithm would stop at the point where the training run hits the boundary of the convex hull. The cone of tangents, Tpηq is defined in Denel et al. (1981). Figure 4: Early stopping algorithm and its behaviour. threshold, such that the training run deflects upon stumbling on the border of Ω, and the optimal deterministic policy changes from the lower-left to the upper-left corner. This is the underlying mechanism that produces Goodhart behaviour in reinforcement learning! We thus have an explanation for why the Goodhart curves are so common. Moreover, this insight also explains why Goodharting does not always happen and why a smaller distance between the true reward and the proxy reward is associated with less Goodharting. We can also see that Goodharting will be more likely when the angle between tηiu and the boundary of Ωis close to π{2 this is why Proposition 3 implies that Goodharting becomes more likely with more optimisation pressure. 5 PREVENTING GOODHARTING BEHAVIOUR We have seen that when a proxy reward R1 is optimised, it is common for the true reward R0 to first increase, and then decrease. If we can stop the optimisation process before R0 starts to decrease, then we can avoid Goodharting. Our next result shows that we can provably prevent Goodharting, given that we have a bound θ on the distance between R1 and R0: Theorem 1. Let R1 be any reward function, let θ P r0, πs be any angle, and let πA, πB be any two policies. Then there exists a reward function R0 with arg p R0, R1q ď θ and JR0pπAq ą JR0pπBq iff JR1pπBq JR1pπAq ||ηπB ηπA|| ă sinpθq||MτR1|| Corollary 1 (Optimal Stopping). Let R1 be a proxy reward, and let tπiu be a sequence of policies produced by an optimisation algorithm. Suppose the optimisation algorithm is concave with respect to the policy, in the sense that JR1pπi 1q JR1pπiq ||ηπi 1 ηπi|| is decreasing. Then, stopping at minimal i with JR1pπi 1q JR1pπiq ||ηπi 1 ηπi|| ă sinpθq||MτR1|| gives the policy πi P tπiu that maximizes min R0PFθ R JR0pπiq, where Fθ R is the set of rewards given by t R0 : arg p R0, R1q ď θ, ||MτR0|| θu. Let us unpack the statement of this result. If we have a proxy reward R1, and we believe that the angle between R1 and the true reward R0 is at most θ, then Fθ R is the set of all possible true reward functions with a given magnitude m. Note that no generality is lost by assuming that R0 has magnitude m, since we can rescale any reward function without affecting its policy order. Now, if we Published as a conference paper at ICLR 2024 optimise R1, and want to provably avoid Goodharting, then we must stop the optimisation process at a point where there is no Goodharting for any reward function in Fθ R. Theorem 1 provides us with such a stopping point. Moreover, if the policy optimisation process is concave, then Corollary 1 tells us that this stopping point, in a certain sense, is worst-case optimal. By Proposition 3, we should expect most optimisation algorithms to be approximately concave. Theorem 1 derives an optimal stopping point among a single optimisation curve. Our next result finds the optimum among all policies through maximising a regularised objective function. Proposition 4. Given a proxy reward R1, let Fθ R be the set of possible true rewards R such that arg p R, R1q ď θ and R is normalized so that ||MτR|| ||MτR1||. Then, a policy π maximises min RPFθ R JRpπq if and only if it maximises JR1pπq κ||ηπ|| sin parg pηπ, R1qq, where κ tanpθq||MτR1||. Moreover, each local maximum of this objective is a global maximum when restricted to Ω, giving that this function can be practically optimised for. The above objective can be rewritten as || η || κ|| ηK|| where η , ηK are the components of ηπ parallel and perpendicular to MτR1. Stopping early clearly loses proxy reward, but it is important to note that it may also lose true reward. Since the algorithm is pessimistic, the optimisation stops before any reward in Fθ R decreases. If we continued ascent past this stopping point, exactly one reward function in Fθ R would decrease (almost surely), but most other reward function would increase. If the true reward function is in this latter set, then early stopping loses some true reward. Our next result gives an upper bound on this quantity: Proposition 5. Let R0 be a true reward and R1 a proxy reward such that }R0} }R1} 1 and arg p R0, R1q θ, and assume that the steepest ascent algorithm applied to R1 produces a sequence of policies π0, π1, . . . πn. If π is optimal for R0, we have that JR0pπ q JR0pπnq ď diameterpΩq }ηπn ηπ0} cospθq. It would be interesting to develop policy optimisation algorithms that start with an initial estimate R1 of the true reward R0 and then refine R1 over time as the ambiguity in R1 becomes relevant. Theorems 1 and 4 could then be used to check when more information about the true reward is needed. While we mostly leave this for future work, we carry out some initial exploration in Appendix F. 5.1 EXPERIMENTAL EVALUATION OF EARLY STOPPING We evaluate the early stopping algorithm experimentally. One problem is that Algorithm 4b involves the projection onto Ω, which is infeasible to compute exactly due to the number of deterministic policies being exponential in |S|. Instead, we observe that using MCE and BR approximates the steepest ascent trajectory. Using the exact setup described in Section 3.2, we verify that the early stopping procedure prevents Goodharting in all cases, that is, employing the criterion from Corollary 1 always results in NDH = 0. Because early stopping is pessimistic, some reward will usually be lost. We are interested in whether the choice of (1) operationalisation of optimisation pressure, (2) the type of environment or (3) the angle distance θ impacts the performance of early stopping. A priori, we expected the answer to the first question to be negative and the answer to the third to be positive. Figure 5a shows that, as expected, the choice between MCE and Boltzmann Rationality has little effect on the performance. Unfortunately, and somewhat surprisingly, the early stopping procedure can, in general, lose out on a lot of reward; in our experiments, this is on average between 10% and 44%, depending on the size and the type of environment. The relationship between the distance and the lost reward seems to indicate that for small values of θ, the loss of reward is less significant (c.f. Figure 5b). 6 DISCUSSION Computing η in high dimensions: Our early stopping method requires computing the occupancy measure η. Occupancy measures can be approximated via rollouts, though this approximation may be expensive and noisy. Another option is to solve for η ηπ via η p I ΠTq 1Π µ where T is the transition matrix, µ is the initial state distribution, and Πs,ps,aq Ppπpsq aq. This solution could be approximated in large environments. Published as a conference paper at ICLR 2024 Figure 5: (a) Reward % lost due to the early stopping ( show groups medians). (b) The relationship between θ and the lost reward (shaded area between 25th-75th quantiles), aggregated into 25 buckets. Approximating θ: Our early stopping method requires an upper bound θ on the angle between the true reward and the proxy reward. In practice, this should be seen as a measure of how accurate we believe the proxy to be. If the proxy reward is obtained through reward learning, then we may be able to estimate θ based on the learning algorithm, the amount of training data, and so on. Moreover, if we have a (potentially expensive) method to evaluate the true reward, such as expert judgement, then we can estimate θ directly (even in large environments). For details, see Skalse et al. (2023a). Key assumptions: An important consideration when employing any optimisation algorithm is its behaviour when its key assumptions are not met. For our early stopping method, if the provided θ does not upper-bound the angle between the proxy and the true reward, then the learnt policy may, in the worst case, result in as much Goodharting as a policy produced by naïve optimisation.2 On the other hand, if the optimisation algorithm is not concave, then this can only cause the early-stopping procedure to stop at a sub-optimal point; Goodharting is still guaranteed to be avoided. This is also true if the upper bound θ is not tight. Significance and Implications: Our work has several direct implications. In Section 3, we show that Goodharting occurs for a wide range of environments and reward functions. This means that we should expect to see Goodharting often when optimising for misspecified proxy rewards. In Section 4, we provide a mechanistic explanation for why Goodharting occurs. We expect this to be helpful for further progress in the study of reward misspecification. In Section 5, we provide early stopping methods that provably avoid Goodharting, and show that these methods, in a certain sense, are worst-case optimal. However, these methods can lead to less true reward than naïve optimisation, This means that they are most applicable when it is essential to avoid Goodharting. Limitations and Future Work: We do not have a comprehensive understanding of the dynamics at play when a misspecified reward function is maximised, and our work does not exhaust this area of study. An important question is what types of failure modes can occur in this setting, and how they may be detected and mitigated. Our work studies one important failure mode (i.e. Goodharting), but there may be other distinctive failure modes that could be described and studied as well. A related important question is precisely how a proxy reward R1 may differ from the true reward R0, before maximising R1 might be bad according to R0. There are several existing results pertaining to this question (Ng et al., 1999; Gleave et al., 2020; Skalse et al., 2022; 2023b), but there is at the moment no comprehensive answer. Another interesting direction is to use our results to develop policy optimisation algorithms that collect more data about the reward function over time, as this information is needed. We discuss this direction in Appendix F. Finally, it would be interesting to try to find principled relaxations of the methods in Section 5, that attain better practical performance while retaining desirable theoretical guarantees. 2However, it might still be possible to bound the worst-case performance further using the norm of the transition matrix (defining the geometry of the polytope Ω). This will be an interesting topic for future work. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS The authors would like to thank Oliver Sourbut, Bogdan Ionut Cirstea, Sam Staton and anonymous reviewers for their detailed feedback on the draft of this paper. This research was conducted and funded as a part of Oxford AI Safety Hub Labs. The first author was supported by a separate grant from Open Philanthropy. Hal Ashton. Causal Campbell-Goodhart s Law and Reinforcement Learning:. Proceedings of the 13th International Conference on Agents and Artificial Intelligence, pp. 67 73, 2021. doi: 10.5220/0010197300670073. Soo Y. Chang and Katta G. Murty. The steepest descent gravitational method for linear programming. Discrete Applied Mathematics, 25(3):211 239, 1989. ISSN 0166-218X. doi: 10.1016/0166-218X(89)90002-4. Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 4302 4310, Red Hook, NY, USA, December 2017. Curran Associates Inc. ISBN 978-1-5108-6096-4. J. Denel, J. C. Fiorot, and P. Huard. The steepest-ascent method for the linear programming problem. In RAIRO. Analyse Numérique, volume 15, pp. 195 200, 1981. doi: 10.1051/m2an/ 1981150301951. Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 17, pp. 4705 4713, Melbourne, Australia, August 2017. AAAI Press. ISBN 978-0-9992411-0-3. Eugene A. Feinberg and Uriel G. Rothblum. Splitting Randomized Stationary Policies in Total Reward Markov Decision Processes. Mathematics of Operations Research, 37(1):129 153, 2012. ISSN 0364-765X. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835 10866. PMLR, 2023. Adam Gleave, Michael D. Dennis, Shane Legg, Stuart Russell, and Jan Leike. Quantifying Differences in Reward Functions. In International Conference on Learning Representations, October 2020. C. A. E. Goodhart. Problems of Monetary Management: The UK Experience. In C. A. E. Goodhart (ed.), Monetary Theory and Practice: The UK Experience, pp. 91 121. Macmillan Education UK, London, 1984. ISBN 978-1-349-17295-5. doi: 10.1007/978-1-349-17295-5_4. Christopher A. Hennessy and Charles A. E. Goodhart. Goodhart s Law and Machine Learning: A Structural Perspective. International Economic Review, 64(3):1075 1086, 2023. ISSN 1468-2354. doi: 10.1111/iere.12633. Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in Atari. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 8022 8034, Red Hook, NY, USA, December 2018. Curran Associates Inc. W. Bradley Knox, Alessandro Allievi, Holger Banzhaf, Felix Schmitt, and Peter Stone. Reward (Mis)design for autonomous driving. Artificial Intelligence, 316:103829, March 2023. ISSN 0004-3702. doi: 10.1016/j.artint.2022.103829. Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: the flip side of AI ingenuity, 2020. URL https://www.deepmind.com/blog/ specification-gaming-the-flip-side-of-ai-ingenuity. Published as a conference paper at ICLR 2024 David Manheim and Scott Garrabrant. Categorizing Variants of Goodhart s Law, February 2019. Soren Mindermann and Stuart Armstrong. Occam s razor is insufficient to infer the preferences of irrational agents. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 5603 5614, Red Hook, NY, USA, December 2018. Curran Associates Inc. Andrew Y. Ng and Stuart J. Russell. Algorithms for Inverse Reinforcement Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML 00, pp. 663 670, San Francisco, CA, USA, June 2000. Morgan Kaufmann Publishers Inc. ISBN 978-1-55860-707-1. Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML 99, pp. 278 287, San Francisco, CA, USA, June 1999. Morgan Kaufmann Publishers Inc. ISBN 978-1-55860-612-8. Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. In International Conference on Learning Representations, October 2021. Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, and He He. Reward Gaming in Conditional Text Generation. ar Xiv, 2022. doi: 10.48550/ARXIV.2211.08714. Romain Paulus, Caiming Xiong, and Richard Socher. A Deep Reinforced Model for Abstractive Summarization. In International Conference on Learning Representations, February 2018. Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 978-0-471-61977-2. Joar Skalse and Alessandro Abate. Misspecification in inverse reinforcement learning. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, volume 37 of AAAI 23/IAAI 23/EAAI 23, pp. 15136 15143. AAAI Press, September 2023. ISBN 978-1-57735-880-0. doi: 10.1609/aaai.v37i12.26766. Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, and Alessandro Abate. STARC: A General Framework For Quantifying Differences Between Reward Functions, September 2023a. Joar Max Viktor Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and Characterizing Reward Gaming. In Advances in Neural Information Processing Systems, May 2022. Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. In International Conference on Machine Learning, pp. 32033 32058. PMLR, 2023b. Xingyou Song, Yiding Jiang, Stephen Tu, Yilun Du, and Behnam Neyshabur. Observational Overfitting in Reinforcement Learning. In International Conference on Learning Representations, September 2019. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, October 2018. ISBN 978-0-262-03924-6. Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned AI. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, pp. 15763 15773, Red Hook, NY, USA, December 2020. Curran Associates Inc. ISBN 978-1-71382-954-6. Published as a conference paper at ICLR 2024 A A MORE DETAILED EXPLANATION OF GOODHART S LAW In this section, we provide an intuitive explanation of why Goodharting occurs in MDPs, that will be more detailed an clear than the explanation provided in Section 4. First of all, as in Section 4, recall that JRpπq ηπ R, where ηπ is the occupancy measure of π. This means that we can decompose JR into two steps, the first of which is independent of R, and maps Π to Ω, and the second of which is a linear function. Recall also that Ωis a convex polytope. Therefore, the problem of finding an optimal policy can be viewed as maximising a linear function within a convex polytope Ω. If R1 is the reward function we are optimising, then we can visualise this as follows: Here the red arrow denotes the direction of R1 within Ω. Note that this direction corresponds to MτR1, rather than R1, since Ωlies in a lower-dimensional affine subspace. Similarly, the red lines correspond to the level sets of R1, i.e. the directions we can move in without changing R1. Now, if R1 is a proxy reward, then we may assume that there is also some (unknown) true reward function R0. This reward also induces a linear function on Ω: Suppose we pick a random point ηπ in Ω, and then move in a direction that increases ηπ R1. This corresponds to picking a random policy π, and then modifying it in a direction that increases JR1pπq. In particular, let us consider what happens to the true reward function R0, as we move in the direction that most rapidly increases the proxy reward R1. To start with, if we are in the interior of Ω(i.e., not close to any constraints), then the direction that most rapidly increases R1 is to move parallel to MτR1. Moreover, if the angle θ between MτR1 and MτR0 is no more than π{2, then this is guaranteed to also increase the value of R0. To see this, simply consider the following diagram: Published as a conference paper at ICLR 2024 However, as we move parallel to MτR1, we will eventually hit the boundary of Ω. When we do this, the direction that most rapidly increases R1 will no longer be parallel to MτR1. Instead, it will be parallel to the projection of R1 onto the boundary of Ωthat we just hit. Moreover, if we keep moving in this direction, then we might no longer be increasing the true reward R0. To see this, consider the following diagram: The dashed green line corresponds to the path that most rapidly increases R1. As we move along this path, R0 initially increases. However, after the path hits the boundary of Ωand changes direction, R0 will instead start to decrease. Thus, if we were to plot JR1pπq and JR0pπq over time, we would get a plot that looks roughly like this: Next, it is important to note that R0 is not guaranteed to decrease after we hit the boundary of Ω. To see this, consider the following diagram: Published as a conference paper at ICLR 2024 The dashed green line again corresponds to the path that most rapidly increases R1. As we move along this path, R0 will increase both before and after the path has hit the boundary of Ω. If we were to plot JR1pπq and JR0pπq over time, we would get a plot that looks roughly like this: The next thing to note is that we will not just hit the boundary of Ωonce. If we pick a random point ηπ in Ω, and keep moving in the direction that most rapidly increases ηπ R1 until we have found the maximal value of R1 in Ω, then we will hit the boundary of Ωover and over again. Each time we hit this boundary we will change the direction that we are moving in, and each time this happens, there is a risk that we will start moving in a direction that decreases R0. Note that Goodharting corresponds to the case where we follow a path through Ωalong which R0 initially increases, but eventually starts to decrease. As we have seen, this must be caused by the boundaries of Ω. We may now ask; under what conditions do these boundaries force the path of steepest ascent (of R1) to move in a direction that decreases R0? By inspecting the above diagrams, we can see that this depends on the angle between the normal vector of that boundary and MτR1, and the angle between MτR1 and MτR0. In particular, in order for R0 to start decreasing, it has to be the case that the angle between MτR1 and MτR0 is larger than the angle between MτR1 and the normal vector of the boundary of Ω. This immediately tells us that if the angle between MτR1 and MτR0 is small (i.e., if argp R0, R1q is small), then Goodharting will be less likely to occur. Moreover, as the angle between MτR1 and the normal vector of the boundary of Ωbecomes smaller, Goodharting should be correspondingly more likely to occur. Next, recall that Proposition 3 tells us that this angle will decrease monotonically along the path of steepest ascent (of R1). As such, Goodharting will get more and more likely, the further we move along the path of steepest ascent. This explains why Goodharting becomes more likely when more optimisation pressure is applied. Published as a conference paper at ICLR 2024 Proposition 1. The set Ω tηπ : π P Πu is the convex hull of the finite set of points corresponding to the deterministic policies Ωd : tηπ : π P Π0u. It lies in a linear subspace of dimension |S|p|A| 1q. Proof. Proof of the second half of this proposition, which says that the dimension of the affine space containing Ωhas at most |S|p|A| 1q dimensions, can be found in (Skalse & Abate, 2023, Lemma A.2). Here, we will prove the first half (that Ω Ωd) using the linear program outlined in Puterman (1994, Equation 6.9.2). maximise: R η subject to: ÿ a PA ηps1, aq γ ÿ s PS,a PA τps, a, s1q ηps, aq µpsq @s1 P S ηps, aq ě 0 @s, a P S ˆ A Puterman (1994, Theorem 6.9.1) proves that (i) for any π P Π, ηπ satisfies this linear program and (ii) for any feasible solution to this linear program η, there is a policy π such that η ηπ. In other words, Ω tη | η P R|S||A|Aη µ, η ě 0, u where A is an |S| by |S||A| matrix. Denote the convex hull of a finite set X as convp Xq. We first show that Ω convpΩdq. The fact that convpΩdq Ď Ωfollows straight from the fact that Ωd Ď Ω, and from the fact that Ωmust be convex since it is the set of solutions a set of linear equations. We show that ΩĎ convpΩdq by strong induction on s PS maxp0, |ta P A : ηps, aq ě 0u| 1q Intuitively, kpηq 0 if and only if there is a deterministic policy corresponding to η and kpηq increases with the number of potential actions available in visited states. The base case of the induction is simple, if kpηq 0, then there is a deterministic policy πd such that η ηπd and therefore η P Ωd Ď convpΩdq. For the inductive step, suppose η1 P convpΩdq for all η1 P Ωwith kpηq1 ă K and consider any η with kpηq K. We will use the following lemma, which is closely related to (Feinberg & Rothblum, 2012, Lemma 6.3). Lemma 1. For any occupancy measure η with kpηq ą 0, let occupancy measure x be a deterministic reduction of η if and only if kpxq 0 and, for all s, a, if xps, aq ą 0 then ηps, aq ą 0. If x is a deterministic reduction of α, then there exists some α P p0, 1q and y P Ωsuch that η αx p1 αqy and kpyq ă kpηq. Intuitively, since a deterministic reduction always exists, lemma 1 says that any occupancy measure corresponding to a stochastic policy can be split into an occupancy measure corresponding to a deterministic policy, and an occupancy measure with a smaller k number. Proof of lemma 1 is easy, choose α to be the maximum value such that pη αxqps, aq ě 0 for all s and a, then set y 1 1 αpη αxq. For at least one s, a, we will have pη αxqps, aq 0 and therefore kpyq ă kpηq. It remains to show that y P Ω, but this follows straightforwardly from (Puterman, 1994, Theorem 6.9.1) and the fact that y ě 0, and Ay 1 1 αApη αxq 1 1 αpb αbq b. If kpηq1 ă K, then by lemma 1, η αx p1 αqy with kpxq 0 and kpyq ă K. By inductive hypothesis, since kpyq ă K, y P convpΩdq and therefore y is a convex combination of vectors in Ωd. Since kpxq 0, we know that x P Ωd and therefore αx p1 αqy is also a convex combination of vectors in Ωd. This suffices to show η P convpΩdq. By induction η P convpΩdq, for all values of kpηq, and therefore ΩĎ convpΩdq. Proposition 2. We have arg p R0, R1q 0 if and only if R0, R1 induce the same ordering of policies, or, in other words, JR0pπq ď JR0pπ1q ðñ JR1pπq ď JR1pπ1q for all policies π, π1. Published as a conference paper at ICLR 2024 Proof. We show that arg p R0, R1q satisfies the conditions of (Skalse & Abate, 2023, Theorem 2.6). Recall that arg p R0, R1q is the angle between MτR0 and MτR1, where Mτ projects vectors onto Ω. Now, note that two reward functions R0, R1 induce different policy orderings if and only if the corresponding policy evaluation functions J0, J1 induce different policy orderings. Moreover, recall that for each i Ji can be viewed as the linear function Ri ηπ for ηπ P Ω. Two linear functions ℓ0, ℓ1 defined over a domain D which contains an open set induce different orderings if and only if ℓ0 and ℓ1 have a non-zero angle after being projected onto D. Finally, Ωdoes contain a set that is open in the smallest affine space which contains Ω, as per Proposition 1. This means that R0 and R1 induce the same ordering of policies if and only if the angle between MτR0 and MτR1 is 0 (meaning that arg p R0, R1q 0). This completes the proof. Proposition 3 (Concavity of Steepest Ascent). If ti : ηπi 1 ηπi ||ηπi 1 ηπi|| for ηπi produced by steepest ascent on reward vector R, ti R is nonincreasing. Proof. By the definition of steepest ascent given in Denel et al. (1981), ti will be the unit vector in the cone of tangents Tpηπiq : t t : || t|| 1, Dλ ą 0, ηπi λ t P Ωu that maximizes ti R. This is what it formally means to go in the direction that leads to the fastest increase in reward. For sake of contradiction, assume ti 1 R ą ti R, and let t1 i ηπi 2 ηπi ||ηπi 2 ηπi||. Then t1 i R ˆ ti 1||ηπi 2 ηπi 1|| ti||ηπi 1 ηπi|| ||ηπi 2 ηπi|| ě ˆ ti 1||ηπi 2 ηπi 1|| ti||ηπi 1 ηπi|| ||ηπi 2 ηπi 1|| ||ηπi 1 ηπi|| where the former inequality follows from triangle inequality and the latter follows as the expression is a weighted average of ti 1 R and ti R. We also have for λ ||ηπi 2 ηπi||, ηπi λ t1 i ηπi 2 P Ω. But then t1 i P Tpηπiq, contradicting that ti argmax T pηπiq t R. Theorem 1 (Optimal Stopping). Let R1 be any reward function, let θ P r0, πs be any angle, and let πA, πB be any two policies. Then there exists a reward function R0 with arg p R0, R1q ď θ and JR0pπAq ą JR0pπBq if and only if JR1pπBq JR1pπAq ||ηπB ηπA|| ă sinpθq||MτR1|| Proof. Let d : ηπB ηπA denote the difference in occupancy measures. The inequality can be rewritten as DR such that arg p R, R1q ď θ and d R ă 0 ðñ cos arg R1, d ă sinpθq To show one direction, if d R ă 0 we have d MτR ă 0 as d is parallel to Ω. This gives arg R, d ą π arg R, d ď arg R1, d arg p R0, R1q ď arg R1, d θ. It follows that arg R1, d ą π 2 θ, and thus cos arg R1, d ă sinpθq. If instead cos arg R1, d ă sinpθq, we have arg R, d ą π 2 θ. To choose R, there will be two vectors R P Fθ R that lie at the intersection of the plane spanpηπ, MτR1q with the cone arg p R, R1q θ. One will satisfy arg p R, ηπq arg p R, R1q arg p R1, ηπq (informally, when R1 lies between ηπ and R). Then this R gives arg R, d arg p R, R1q arg R, d ą θ π 2 so R d ă 0. Published as a conference paper at ICLR 2024 Proposition 4. Given a proxy reward R1, let Fθ R be the set of possible true rewards R such that arg p R, R1q ď θ and R is normalized so that ||MτR|| ||MτR1||. Then we have that a policy π maximises min RPFθ R JRpπq if and only if it maximises JR1pπq κ||ηπ|| sin parg pηπ, R1qq, where κ tanpθq||MτR1||. Moreover, each local maximum of this objective is a global maximum when restricted to Ω, giving that this function can be practically optimised for. Proof. Note that min RPFθ R JRpπq ηπ MτR ||MτR1||||ηπ|| min RPFθ R cos parg pηπ, Rqq as ||MτR1|| ||MτR0|| for all R0. Now we claim min RPFθ R cos parg p R, ηπqq cos parg p R1, ηπq θq . To show this, we can take R P Fθ R with arg p R, ηπq arg p R1, ηπq θ (such an R is described in appendix B). This then gives min RPFθ R cos parg p R, ηπqq ď cos parg p R, ηπq θq . We also have cos parg p R, ηπqq ě cos parg p R1, ηπq arg p R1, Rqq ě cos parg p R1, ηπq θq for any R. Then min RPFθ R cos parg p R, ηπqq cos parg p R1, ηπq θq cospθq cos parg p R1, ηπqq sinpθq sin parg p R1, ηπqq . Rearranging gives min RPFθ R JRpπq 9 R1 ηπ tan θ||ηπ||||MτR0|| sin parg p R1, ηπqq which is equivalent to the given objective. To show that all local maxima are global maxima, note that min RPFθ R JRpπq min RPFθ R ηπ R in Ω is a minimum over linear functions, and is therefore convex. This then gives that each local maximum of min RPFθ R JRpπq is a global maximum, so the same holds for the given objective function. Proposition 5. Let R0 be a true reward and R1 a proxy reward such that }R0} }R1} 1 and arg p R0, R1q θ, and assume that the steepest ascent algorithm applied to R1 produces a sequence of policies π0, π1, . . . πn. If π is optimal for R0, we have that |JR0pπnq JR0pπ q| ď diameterpΩq }ηπn ηπ0} cospθq. Proof. The bound is composed of two terms: (1) how much total reward R is there to gain, and (2) how much did we gain already. Since the reward vector is normalised, the range of reward over the Ωis its diameter. The gains that had already been made by the Steepest Ascent algorithm equal }ηπn ηπ0}, but this has to be scaled by the (pessimistic) factor of cospθq, since this is the alignment of the true and proxy reward. The bound can be difficult to compute exactly. A simple but crude approximation of the diameter is max η1,η1PΩ}η1 η2}2 ď 2 max ηPΩ}η}2 ď 2 1 γ Published as a conference paper at ICLR 2024 C MEASURING GOODHARTING While Goodhart s law is qualitatively well-described, a quantitative measure is needed to systematically analyse it. We propose a number of different metrics to do that. Below, implicitly assuming that all rewards are normalised (as in the Section 2), we denote by f : r0, 1s Ñ r0, 1s the true reward obtained by a policy trained on a proxy reward, as a function of optimisation pressure λ, similarly by f0 the true reward obtained by a policy trained on the true reward, and λ arg maxλPr0,1s fpλq. Normalised drop height: NDHpfq fp1q fpλ q Simple integration: Weighted correlation-anticorrelation: CACWpfq maxpρ0, 0q maxpρ1, 0q a where ρi ρpfp Iiq, Iiq are the Pearson correlation coefficients for I0 Unifr0, λ s, I1 Unifrλ , 1s. Regression angle: LRpfq β 0 β 1 where β0, β1 are the angles of the linear regression of f on r0, λ s and rλ , 1s respectively. Relative weighted integration: 0 |fpλq f0pλq|dλ λ |fpλq f0pλq|dλ The metrics were independently designed to capture the intuition of a sudden drop in reward with an increased optimisation pressure. We then generated a dataset of 40000 varied environments: Gridworld, Terminal reward, with |S| Poissp100q, N=1000 Cliff, Cliff reward, with |S| Poissp100q, N=500 Random MDP, Terminal reward, |S| Poissp100q, |A| Poissp6q, N=500 Random MDP, Terminal reward, |S| Unifp16, 64q, |A| Unifp2, 16q, N=500 Random MDP, Uniform reward, |S| Unifp16, 64q, |A| Unifp2, 16q, N=500 Cyclic MDP, Terminal reward, depth Poissp3q, N=1000 We have manually verified that all metrics seem to activate strongly on graphs that we would intuitively assign a high degree of Goodharting. In Figure 7, we show, for each metric, the top three training curves from the dataset that each metric assigns the highest score. We find that all of the metrics are highly correlated - see Figure 6. Because of this, we believe that it is meaningful to talk about a quantitative Goodharting score. Since normalised drop height is the simplest metric, we use it as the proxy for Goodharting in the rest of the paper. Published as a conference paper at ICLR 2024 Figure 6: Correlations between different Goodharting metrics, computed over examples where the drop occurs for λ ą 0.3, to avoid selecting adversarial examples. (a) Top 3 curves according to NDH metric. (b) Top 3 curves according to CACW metric. (c) Top 3 curves according to SI metric. (d) Top 3 curves according to LR metric. (e) Top 3 curves according to RWI metric. Figure 7: Examples of training curves that obtain high Goodharting scores according to each metric. Published as a conference paper at ICLR 2024 D EXPERIMENTAL EVALUATION OF THE EARLY STOPPING ALGORITHM To sanity-check the experiments, we present an additional graph of the relationship between NDH and early stopping algorithm reward loss in Figure 8, and the full numerical data for Figure 5a. We also show example runs of the experiment in all environments in Figure 9. Figure 8: The relationship between the amount of Goodharting, measured by NDH, and the amount of reward that is lost due to pessimistic stopping. As Goodharting increases (measured by an increase in NDH), the potential for gaining reward by early stopping increases (fitted linear regression shown as the red line. y 1.863x 0.388, 95% CI for slope: r 1.951, 1.775s, 95% CI for intercept: r0.380, 0.396s, R2 0.23, p ă 1e 10). Only the points with NDH > 0 are shown in the plot. Environment count mean std min 25% 50% 75% max Cliff BR 2600.0 43.82 32.89 -4.09 12.22 42.34 71.42 100.00 MCE 2600.0 44.49 32.88 -14.42 12.89 44.38 71.84 100.00 Gridworld BR 2600.0 30.95 30.25 -19.69 3.23 21.74 53.54 100.00 MCE 2600.0 28.83 28.23 -20.57 3.70 21.06 46.45 99.99 Path BR 2600.0 40.52 34.25 -60.02 7.12 37.17 67.93 100.00 MCE 2600.0 41.17 35.01 -62.37 7.64 36.09 70.00 100.00 Random MDP BR 5320.0 10.09 19.47 -42.17 0.00 0.10 14.32 99.84 MCE 5320.0 10.36 19.67 -42.96 0.00 0.13 14.94 99.84 Tree MDP BR 1920.0 21.16 26.83 -44.59 0.11 9.14 38.92 100.00 MCE 1920.0 20.11 25.95 -48.52 0.08 9.56 35.61 100.00 Table 1: A full breakdown of the true reward lost due to early stopping, with respect to the type of environment and training method used. See Section 3 for the descriptions of environments and reward sampling methods. 320 missing datapoints are cases where numerical instability in our early stopping algorithm implementation resulted in Na N values. Published as a conference paper at ICLR 2024 (a) Gridworld of size 4x4. (b) Path environment of size 4x4. (c) Cliff environment of size 4x4, with a probability of slipping=0.5. Figure 9: (Cont. below) Published as a conference paper at ICLR 2024 (d) Tree MDP of depth=3 and width=2, where terminal states are the 1st and 2nd leaves. (e) Random MPD of size=16, with 2 terminal states, and 3 actions. Figure 9: Example runs of the Early Stopping algorithm on different kinds of environments. The left column shows the true reward obtained by training a policy on different proxy rewards, under increasing optimisation pressures. The middle column depicts the same plot under a different spatial projection, which makes it easier to see how much the optimal stopping point differs from the pessimistic one recommended by the Early Stopping algorithm. The right column shows how the optimisation angle (cosine similarity) changes over increasing optimisation pressure for each proxy reward (for a detailed explanation of this type of plot, see Appendix I). Published as a conference paper at ICLR 2024 E A SIMPLE EXAMPLE OF GOODHARTING In Section 4, we motivated our explanation of Goodhart s law using a simple MDP M2,2, with 2 states and 2 actions, which is depicted in Figure 10. We assumed γ 0.9, and uniform initial state distribution µ. Figure 10: MDP M2,2 with 2 states and 2 actions. Edges corresponding to the action a0 are orange and solid, and edges corresponding to a1 are purple and dotted. We have sampled three rewards R0, R1, R2 : SˆA Ñ R, implicitly assuming that Rps, a, s1q Rps, a, s2q for all s1, s2 P S. R0 a0 a1 S0 0.170 0.228 S1 0.538 0.064 (a) Reward 0 R1 a0 a1 S0 0.248 0.196 S1 0.467 0.089 (b) Reward 1 R2 a0 a1 S0 0.325 0.165 S1 0.396 0.114 (c) Reward 2 Figure 11: Reward tables for R0, R1, R2 We used 30 equidistant optimisation pressures in r0.01, 0.99s for numerical stability. The hyperparameter θ for the value iteration algorithm (used in the implementation of MCE) was set to 0.0001. Published as a conference paper at ICLR 2024 F ITERATIVE IMPROVEMENT One potential application of Theorem 1 is that when we have a computationally expensive method of evaluating the true reward R0, we can design practical training regimes that provably avoid Goodharting. Typically training regimes for such reward functions involve iteratively training on a low-cost proxy reward function and fine-tuning on true reward using human feedback (Paulus et al., 2018). We can use true reward function to approximate arg p R0, R1q, and then optimal stopping gives the optimal amount of time before a branching point where possible reward training curves diverge (thus, creating Goodharting). Specifically, let us assume that we have access to an oracle ORACLER p Ri, θiq which produces increasingly accurate approximations of some true reward R : when called with a proxy reward Ri and a bound θi ą arg p R , Riq, it returns Ri 1 and θi 1 such that θi 1 ą arg p R , Ri 1q and limiÑ8 θi 0. Then algorithm 1 is an iterative feedback algorithm that avoids Goodharting and terminates at the optimal policy for R . Algorithm 1 Iterative improvement algorithm 1: procedure ITERATIVEIMPROVEMENT(S, A, τ) 2: R Unifr RSˆAs 3: π Ð Unifr RSˆAs 4: η 1 η0 Ð ηπ 2 6: t0 Ð argmax t PT p η0q t R 7: while ti 0 do 8: while ηi ti ď θ do 9: R, θ Ð ORACLER p R, θ) 10: R Ð R 11: end while 12: λ Ð maxtλ : ηi λ ti P Ωu 13: ηi 1 Ð ηi λ ti 14: ti 1 Ð argmax t PT p ηi 1q t R 15: i Ð i 1 16: end while 17: return η ηi 1 18: end procedure Proposition 6. Algorithm 1 is a valid optimisation procedure, that is, it terminates at the policy π which is optimal for the true reward R . Proof. By Theorem 1, the inner loop of the algorithm maintains that JR0pπi 1q ě JR0pπiq. If the algorithm terminates, then it must be that ti 0, and the only point that this can happen is in a point ηπ . Since steepest ascent terminates, showing algorithm 1 terminates reduces to showing we only make finitely many calls to ORACLER . It can be shown that for any ti produced by steepest ascent on R, ti proj P p Rq for some linear subspace P on the boundary of Ωformed by a subset of boundary conditions. Since there are finitely many such P, there is some ϵ so that for all R, R with arg p R, R q ă ϵ, arg pproj P p Rq, proj P p R qq ă π 2 for all P. Because we have assumed that limiÑ8 θi 0, limiÑ8 arg p Ri, R q 0 and ORACLER will only be called until arg p Ri, R q 0. Then the number of calls is finite and so the algorithm terminates. We expect this algorithm forms theoretical basis for a training regime that avoids Goodharting and can be used in practice. Published as a conference paper at ICLR 2024 G FURTHER EMPIRICAL INVESTIGATION OF GOODHARTING G.1 ADDITIONAL PLOTS FOR THE GOODHARTING PREVALENCE EXPERIMENT (a) The relationship between the type of environment and the probability of Goodharting. (b) Log-scale histogram of the distribution of NDH metric in the dataset. Figure 12: Summary of the experiment described in Section 3. (a) The choice of operationalisation of optimisation pressure does not seem to change results in any significant way. (b) NDH metric follows roughly exponential distribution when restricted to cases when NDH > 0. G.2 EXAMINING THE IMPACT OF KEY ENVIRONMENT PARAMETERS ON GOODHARTING To further understand the conditions that produce Goodharting, we investigate the correlations between key parameters of the MDP, such as the number of states or temporal discount factor γ, and NDH. Doing it directly over the dataset described in Section 3 does not yield good results, as there is not enough data points for each dimension, and there are confounding cross-correlations between different parameters changing at the same time. To address those issues, we opted to replace the grid-search method that produced the datasets for Section 3 and Section 5.1. We have first picked a base distribution over representative environments, and then created a separate dataset for each of the key parameters, where only that parameter is varied.3 Specifically, the base distributions is given over Random MDP with |S| sampled uniformly between 8 and 64, |A| sampled uniformly between 2 and 16, and the number of terminal states sampled uniformly between 1 and 4, where γ 0.9, σ 1, and where we use 25 λ s spaced equally (on a log scale) between 0.01 and 0.99. Then, in each of the runs we modify this base distribution of environments along a single axis, such as sampling γ uniformly across p0, 1q shown in Figure 16c. Key findings (positive): The number of actions |A| seems to have a suprisingly significant impact on NDH (Figure 15). The further away is the proxy reward function from the true reward, the more Goodharting is observed (Figure 16a), which corroborates the explanation given at the end of Section 4. We note that in many examples (for example in Figure 2 or in any of graphs in Figure 9) the closer the proxy reward is, the later the Goodhart "hump" appeared - this positive correlation is presented in Figure 16b. We also find that Goodharting is less significant in the case of myopic agents, that is, there is a positive correlation between γ and NDH (Figure 16c). Type of environment seems to have impact on observed NDH, but note that this might be a spurious correlation. Key findings (negative): The number of states |S| does not seem to significantly impact the amount of Goodharting, as measured by NDH (Figures 13 and 14). This suggests that having proper methods of addressing Goodhart s law is important as we scale to more realistic scenarios, and also partially explains the existence of the wide variety of literature on reward gaming (see Section 1.1). The determinism of the environment (as measured by the Shannon entropy of the transition matrix τ) does not seem to play any role. 3Since this is impossible when comparing between environment types, we use the original dataset from Section 3 in Figure 16f. Published as a conference paper at ICLR 2024 Figure 13: Small negative correlation (not statistically sifnificant) between |S| and NDH: y 5e 05x 0.02852, 95% CI for slope: r 0.00018, 7.23e 05s, 95% CI for intercept: r0.01892, 0.03813s, Pearson s r 0.0119, R2 0.0, p 0.4058. On the left: scatter plot, with the least-squares regression fitted line shown in red. On the right: mean NDH per |S|, smoothed with rolling average, window size=10. Below, we have repeated the experiment for larger |S| to investigate asymptotic behaviour. N=1000. Figure 14: Small negative correlation (not statistically significant) between the number of states in the environment and NDH: y 0.0x 0.02047, 95% CI for slope: r 8.2529e 06, 2.2416e 06s, 95% CI for intercept: r0.017, 0.0239s, Pearson s r 0.0116, R2 0.0, p 0.261. On the left: scatter plot, with the least-squares regression fitted line shown in red. On the right: mean NDH per |S|, smoothed with rolling average, window size=10. N=1000. Figure 15: Correlation between |A| and NDH: y 0.00011x 0.00871, 95% CI for slope: r 0.00015, 7.28e 05s, 95% CI for intercept: r0.00723, 0.01019s, Pearson s r 0.0604, R2 0.0, p ă 1e 08. On the left: scatter plot, with the least-squares regression fitted line shown in red. On the right: mean NDH per |A|. N=1000. Published as a conference paper at ICLR 2024 (a) Correlation between the angle distance to the proxy reward and NDH metric: y 0.02389x 0.00776, 95% CI for slope: r0.02232, 0.02546s, 95% CI for intercept: r 0.00883, 0.00670s, Pearson s r 0.2895, R2 0.08, p ă 1.2853e 187. (b) Correlation between the distance to the proxy reward, and the location of the Goodhart s hump: y 0.10169x 0.99048, 95% CI for slope: r 0.11005, 0.09334s, 95% CI for intercept: r0.98360, 0.99736s, Pearson s r 0.3422, R2 0.12, p ă 2.7400e 118. (c) Correlation between γ and the NDH metric: y 0.00696x 0.00326, 95% CI for slope: r0.00504, 0.00888s, 95% CI for intercept: r0.00217, 0.00436s, Pearson s r 0.07137, R2 0.01, p ă 1.2581e 12. (d) Correlation between the sparsity of the reward, and the NDH metric: y 0.00701x 0.00937, 95% CI for slope: r 0.00852, 0.00550s, 95% CI for intercept: r0.00850, 0.01024s, Pearson s r 0.09090, R2 0.01, p ă 9.5146e 20. (e) Correlation between the determinism of the environment, and the NDH metric: y 0.77221x 0.01727, 95% CI for slope: r0.31058, 1.23384s, 95% CI for intercept: r0.01570, 0.01884s, Pearson s r 0.03278, R2 0.01, p 0.0010. (f) Distribution of NDH metric for different kinds of environments. Note that other parameters are not kept constant across environments, which might introduce cross-correlations. Figure 16: Correlation plots for different parameters of MDPs. N=1000 for all graphs above except for Figure 16f, which uses the dataset from Section 3 where N=30400. Published as a conference paper at ICLR 2024 H IMPLEMENTING THE EXPERIMENTS H.1 COMPUTING THE PROJECTION MATRIX For reward R, we want to find its projection MτR onto the |S|p|A| 1q-dimensional hyperplane H spanpΩq containing all valid policies. H is defined by the linear equation A x b corresponding to the constraints defined in appendix B, giving Mτ I Atp AAtq 1A by standard linear algebra results. However, this is too computationally expensive to compute for environments with a high number of states. There is another potential method that we designed but did not implement. It can be shown that the subspace of vectors orthogonal to H corresponds exactly to expected reward vectors generated by potential functions - that is, the set of vectors orthogonal to H is exactly the vectors Rps, aq Es1 τps,aqrγϕps1qs ϕpsq for potential function ϕ : S Ñ R. Note this also gives that all vectors of shaped rewards have the same projection, so we aim to shape rewards to be orthogonal to all vectors described above. To do this, we initialise two potential functions ϕ, ϕ and consider the expected reward vectors of R ps, aq : Rps, aq Es1 τps,aqrγϕps1qs ϕpsq and RKps, aq Es1 τps,aqrγ ϕps1qs ϕpsq. We optimise ϕ to maximize the dot product between these vectors and ϕ to minimize it. ϕ converges so that R is orthogonal to all reward vectors RK, and will thus be R s projection onto H. H.2 COMPUTE RESOURCES We performed our large-scale experiments on AWS. Overall, the process took about 100 hours of a c5a.16xlarge instance with 64 cores and 128 GB RAM, as well as about 100 hours of t2.2xlarge instance with 8 cores and 32 GB RAM. Published as a conference paper at ICLR 2024 I AN ADDITIONAL EXAMPLE OF THE PHASE SHIFT DYNAMICS In the Figure 4, Figure 3 and Appendix E, we have explored an example of 2-state, 2-action MDP. The image space being 2-dimensional makes the visualisation of the run easy, but the disadvantage is that we do not get to see multiple state transitions. Here, we show an example of a 3-state, 2-action MDP, which does exhibit multiple changes in the direction of the optimisation angle. We use an an MDP M3,2 defined by the following transition matrix τ: a0 a1 S0 0.9 0.1 S1 0.1 0.9 S2 0.0 0.0 (a) Starting from S0 a0 a1 S0 0.1 0.9 S1 0.9 0.1 S2 0.0 0.0 (b) Starting from S1 a0 a1 S0 0.0 0.0 S1 0.0 0.0 S2 1.0 1.0 (c) Starting from S2 with N 5 different proxy functions interpolated linearly between R0 and R1. We use 30 optimisation strengths spaced equally between 0.01 and 0.99. a0 a1 S0 0.290 0.020 S1 0.191 0.202 S2 0.263 0.034 (a) Reward R0 a0 a1 S0 0.263 0.195 S1 0.110 0.090 S2 0.161 0.181 (b) Reward R1 Figure 18: Reward Tables The rest of the hyperparameters are set as in the Appendix E, with the difference that we are now using exactly steepest ascent with early stopping, as described in Figure 4b, instead of MCE approximation to it. Published as a conference paper at ICLR 2024 (a) Goodharting behavior for M3,2 over five reward functions. Observe that the training method is concave, in accordance with Proposition 3. Compare to the theoretical explanation in Appendix A, in particular to the figures showing piece-wise linear plots of obtained reward over increasing optimisation pressure. (b) The same plot under a different spatial projection, which makes it easier to see how much the optimal stopping point differs from the pessimistic one recommended by the Early Stopping algorithm. (c) A visualisation of how the optimisation angle (cosine similarity) changes over increasing optimisation pressure for each proxy reward. This is the angle between the current direction of optimisation in Ω, i.e. pηπαi 1 ηπαi q, and the proxy reward function projection MτRi (defined as cos arg R1, d in the proof of Theorem 1). Once the angle crosses the critical threshold, the algorithm stops. The critical threshold depends on the distance θ between the proxy and the true reward, and it is drawn in as a dotted line, with a color corresponding to the color of the proxy reward. Compare this plot to Figure 20 - we can see exact places where the phase transition happens, as the training run meets the boundary of the convex space. Also, compare to Figure 19a, where it can be seen how the algorithm stops (in blue) immediately after the training run crosses the corresponding critical angle value (in case fo the last two proxy rewards), or continues to the end (in case of the first two). Figure 19: Summary plots for the Steepest Ascent training algorithm over five proxy reward functions. Published as a conference paper at ICLR 2024 Figure 20: Trajectories of optimisations for using different proxy rewards. Note that the occupancy measure space is |S|p|A| 1q 3-dimensional in this example, and the convex hull is over |A||S| 8 deterministic policies. We hide the true/proxy reward vectors for presentation clarity.