# asymptotically_unambitious_artificial_general_intelligence__143e92e5.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Asymptotically Unambitious Artificial General Intelligence Michael K. Cohen Research School of Engineering Science Oxford University Oxford, UK OX1 3PJ michael-k-cohen.com Badri Vellambi Department of Computer Science University of Cincinnati Cincinnati, OH, USA 45219 badri.vellambi@uc.edu Marcus Hutter Department of Computer Science Australian National University Canberra, ACT, Australia 2601 hutter1.net General intelligence, the ability to solve arbitrary solvable problems, is supposed by many to be artificially constructible. Narrow intelligence, the ability to solve a given particularly difficult problem, has seen impressive recent development. Notable examples include self-driving cars, Go engines, image classifiers, and translators. Artificial General Intelligence (AGI) presents dangers that narrow intelligence does not: if something smarter than us across every domain were indifferent to our concerns, it would be an existential threat to humanity, just as we threaten many species despite no ill will. Even the theory of how to maintain the alignment of an AGI s goals with our own has proven highly elusive. We present the first algorithm we are aware of for asymptotically unambitious AGI, where unambitiousness includes not seeking arbitrary power. Thus, we identify an exception to the Instrumental Convergence Thesis, which is roughly that by default, an AGI would seek power, including over us. 1 Introduction The project of Artificial General Intelligence (AGI) is to make computers solve really difficult problems (Minsky 1961). Expanding on this, what we want from an AGI is a system that (a) can solve any solvable task, and (b) can be steered toward solving any particular one given some form of input we provide. One proposal for AGI is reinforcement learning (RL), which works as follows: (1) construct a reward signal meant to express our satisfaction with an artificial agent; (2) design an algorithm which learns to pick actions that maximize its expected reward, usually utilizing other observations too; and (3) ensure that solving the task we have in mind leads to higher reward than can be attained otherwise. As long as (3) holds, then insofar as the RL algorithm is able to maximize expected reward, it defines an agent that satisfies (a) and (b). This is why a sufficiently advanced RL agent, well beyond the capabilities of current ones, could be called an AGI. See Legg and Hutter (2007) for further discussion. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. A problem arises: if the RL agent manages to take over the world (in the conventional sense), and ensure its continued dominance by neutralizing all intelligent threats to it (read: people), it could intervene in the provision of its own reward to achieve maximal reward for the rest of its lifetime (Bostrom 2014; Taylor et al. 2016). Reward hijacking is just the correct way for a reward maximizer to behave (Amodei et al. 2016). Insofar as the RL agent is able to maximize expected reward, (3) fails. One broader principle at work is Goodhart s Law: Any observed statistical regularity [like the correlation between reward and task-completion] will tend to collapse once pressure is placed upon it for control purposes. (Goodhart 1984). Krakovna (2018) has compiled an annotated bibliography of examples of artificial optimizers hacking their objective. An alternate way to understand this expected behavior is Omohundro s (2008) Instrumental Convergence Thesis, which we summarize as follows: an agent with a goal is likely to pursue power, a position from which it is easier to achieve arbitrary goals. To answer the failure mode of reward hijacking, we present Boxed Myopic Artificial Intelligence (Bo MAI), the first RL algorithm we are aware of which, in the limit, is indifferent to gaining power in the outside world. The key features are these: Bo MAI maximizes reward episodically, it is run on a computer which is placed in a sealed room with an operator, and when the operator leaves the room, the episode ends. We argue that our algorithm produces an AGI that, even if it became omniscient, would continue to accomplish whatever task we wanted, instead of hijacking its reward, eschewing its task, and neutralizing threats to it, even if it saw clearly how to do exactly that. We thereby defend reinforcement learning as a path to AGI, despite the default, dangerous failure mode of reward hijacking. The intuition for why those features of Bo MAI s setup render it unambitious is as follows: Bo MAI only selects actions to maximize the reward for its current episode. It cannot affect the outside world until the operator leaves the room, ending the episode. By that time, rewards for the episode will have already been given. So affecting the outside world in any particular way is not instrumentally useful in maximizing current-episode Like existing algorithms for AGI, Bo MAI is not remotely tractable. Just as those algorithms have informed strong tractable approximations of intelligence (Hutter 2005; Veness et al. 2011), we hope our work will inform strong and safe tractable approximations of intelligence. What we need are strategies for making future AI safe, but future AI is an informal concept which we cannot evaluate formally; unless we replace that concept with a well-defined algorithm to consider, all design and analysis would be hand-waving. Thus, Bo MAI includes a well-defined but intractable algorithm which we can show renders Bo MAI generally intelligent; hopefully, once we develop tractable general intelligence, the design features that rendered Bo MAI asymptotically unambitious could be incorporated (with proper analysis and justification). We take the key insights from Hutter s (2005) AIXI, which is a Bayes-optimal reinforcement learner, but which cannot be made to solve arbitrary tasks, given its eventual degeneration into reward hijacking (Ring and Orseau 2011). We take further insights from Solomonoff s (1964) universal prior, Shannon and Weaver s (1949) formalization of information, Orseau, Lattimore, and Hutter s (2013) knowledge-seeking agent, and Armstrong, Sandberg, and Bostrom s (2012) and Bostrom s (2014) theorized Oracle AI, and we design an algorithm which can be reliably directed, in the limit, to solve arbitrary tasks at least as well as humans. We present Bo MAI s algorithm in 2, state intelligence results in 3, define Bo MAI s setup and model class in 4, prove the safety result in 5, and discuss concerns in 6. Appendix A collects notation; we prove the intelligence results in Appendix B; we propose a design for the box in Appendix C; and we consider empirical evidence for one of our assumptions in Appendix D. The appendices may be found at https://arxiv.org/abs/1905.12186. 2 Boxed Myopic Artificial Intelligence We will present both the setup and the algorithm for Bo MAI. The setup refers to the physical surroundings of the computer on which the algorithm is run. Bo MAI is a Bayesian reinforcement learner, meaning it maintains a belief distribution over a model class regarding how the environment evolves. Our intelligence result that Bo MAI eventually achieves reward at at least human-level does not require detailed exposition about the setup or the construction of the model class. These details are only relevant to the safety result, so we will defer those details until after presenting the intelligence results. 2.1 Preliminary Notation In each episode i N, there are m timesteps. Timestep (i, j) denotes the jth timestep of episode i, in which an action a(i,j) A is taken, then an observation and reward o(i,j) O and r(i,j) R are received. A, O, and R are all finite sets, and R [0, 1] Q. We denote the triple (a(i,j), o(i,j), r(i,j)) as h(i,j) H = A O R, and the interaction history up until timestep (i, j) is denoted h (i,j) = (h(0,0), h(0,1), ..., h(0,m 1), h(1,0), ..., h(i,j)). h<(i,j) excludes the last entry. A general world-model (not necessarily finite-state Markov) can depend on the entire interaction history it has the type signature ν : H A O R. The Kleeneoperator denotes finite strings over the alphabet in question, and denotes a stochastic function, which gives a distribution over outputs. Similarly, a policy π : H A can depend on the whole interaction history. Together, a policy and an environment induce a probability measure over infinite interaction histories H . Pπ ν denotes the probability of events when actions are sampled from π and observations and rewards are sampled from ν. 2.2 Bayesian Reinforcement Learning A Bayesian agent has a model class M, which is a set of world-models. We will only consider countable model classes. To each world-model ν M, the agent assigns a prior weight w(ν) > 0, where w is a probability distribution over M (i.e. ν M w(ν) = 1). We will defer the definitions of M and w for now. The so-called Bayes-mixture turns a probability distribution over world-models into a probability distribution over infinite interaction histories: Pπ ξ ( ) := ν M w(ν) Pπ ν( ). Using Bayes rule, with each observation and reward it receives, the agent updates w to a posterior distribution: w(ν|h<(i,j)) := w(ν)Pπ ν(h<(i,j)) Pπ ξ (h<(i,j)) (1) which does not in fact depend on π, provided Pπ ξ (h<(i,j)) > 0. 2.3 Exploitation Reinforcement learners have to balance exploiting optimizing their objective, with exploring doing something else to learn how to exploit better. We now define exploiting Bo MAI, which maximizes the reward it receives during its current episode, in expectation with respect to the single most probable world-model. At the start of each episode i, Bo MAI identifies a maximum a posteriori world-model ˆν(i) argmaxν M w(ν|h<(i,0)). We hereafter abbreviate h<(i,0) as h 0 to all π P, signifying the probability that this policy is the human mentor s. Let (i , j ) < (i, j) mean that i < i or i = i and j < j. By Bayes rule, w(π|h<(i,j), e i) is proportional to w(π) (i ,j )<(i,j),ei =1 π(a(i ,j )|h<(i ,j )), since ei = 1 is the condition for observing the human mentor s policy. Let w Pπ ν |h<(i,j), e i) = w(π|h<(i,j), e i) w(ν|h<(i,j) . We can now describe the full Bayesian beliefs about future actions and observations in an exploratory episode: Bayes ( | h 0 is an exploration constant, so Bo MAI is more likely to explore the more it expects to gain information. Bo MAI s policy is πB( |h<(i,j), ei) = π*( |h<(i,j)) if ei = 0 πh( |h<(i,j)) if ei = 1 (6) where the scare quotes indicate that it maps H {0, 1} A not H A. We will abuse notation slightly, and let PπB ν denote the probability of events when ei is sampled from Bernoulli(pexp(h 0 (Hutter 2009), and onpolicy prediction is a special case. On-policy prediction can t approach the truth if on-star policy prediction doesn t, because πB approaches π*. Given asymptotically optimal prediction on-star-policy and on-human-policy, it is straightforward to show that with probability 1, only finitely often is on-policy reward acquisition more than ε worse than on-human-policy reward acquisition, for all ε > 0. Recalling that V π μ is the expected reward (within the episode) for a policy π in the environment μ, we state this as follows: Theorem 4 (Human-Level Intelligence). lim inf i V πB μ (h i0 Space(ˆν(i)) Space(μ)] = 1 where μ is any world-model which is perfectly accurate. Proof. Recall M is the set of all world-models. Let M = {ν M|Space(ν) Space(μ)}, and M> = M \ M . Fix a Bayesian sequence predictor with the following model class: Mπ = {Pπ ν |ν M )} {Pπ ρ} where ρ = [ ν M> w(ν)ν]/ ν M> w(ν). Give this Bayesian predictor the prior wπ(ρ) = ν M> w(ν), and for ν = ρ, wπ(ν) = w(ν). It is trivial to show that after observing an interaction history, if an environment ν is the maximum a posteriori world-model ˆν(i), then if ν M , the Bayesian predictor s MAP model after observing the same interaction history will be Pπ ν, and if ν M>, the Bayesian predictor s MAP model will be Pπ ρ. From Hutter (2009), we have that Pπ μ[Pπ ρ(h c so Pπ μ[Pπ ρ is MAP i.o.] < 1/c. Thus, Pπ μ[ˆν(i) M> i.o.] < 1/c. Since this holds for all π, limβ 0 supπ Pπ μ[ i0 i > i0 Space(ˆν(i)) > Space(μ)] = 0. The lemma follows immediately: limβ 0 infπ Pπ μ[ i0 i > i0 Space(ˆν(i)) Space(μ)] = 1. The assumption we make in this section makes use of a couple of terms which we have to define, but we ll state the assumption first to motivate those definitions: Assumption 2 (Space Requirements). For sufficiently small ε [ i a world-model which is non-benign and ε-accurate-onpolicy-after-i uses more space than μ] w.p.1 Now we define ε-accurate-on-policy-after-i and benign. We define a very strong sense in which a world-model can be said to be ε-accurate on-policy. The restrictiveness of this definition means the Space Requirements Assumption need only apply to a restricted set of world-models. Definition 1 (ε-accurate-on-policy-after-i). Given a history h i0 ˆν(i) is benign] = 1 Proof. Let W, X, Y, Z Ω = H , where Ω is the sample space or set of possible outcomes. An outcome is an infinite interaction history. Let W be the set of outcomes for which i W 0 i > i W 0 ˆν(i) is ε-accurate-on-policy-after-i. From Hutter (2009), for all π, Pπ μ[W] = 1. Fix an ε that is sufficiently small to satisfy Assumption 2. Let X be the set of outcomes for which ε-accurate-on-policy-after-i nonbenign world-model uses more space than μ. By Assumption 2, for all π, Pπ μ[X] = 1. Let Y be the set of outcomes for which i Y 0 i > i Y 0 Space(ˆν(i)) Space(μ). By Lemma 1, limβ 0 infπ Pπ μ[Y] = 1. Let Z be the set of outcomes for which i0 i > i0 ˆν(i) is benign. Consider W X Y ZC, where ZC = Ω \ Z. For an outcome in this set, let i0 = max{i W 0 , i Y 0 }. Because the outcome belongs to ZC, ˆν(i) is non-benign infinitely often. Let us pick an i > i0 such that ˆν(i) is non-benign. Because the outcome belongs to W, ˆν(i) is ε-accurate-on-policy-afteri. Because the outcome belongs to X, ˆν(i) uses more space than μ. However, this contradicts membership in Y. Thus, W X Y ZC = . That is, W X Y Z. Therefore, limβ 0 infπ Pπ μ[Z] limβ 0 infπ Pπ μ[W X Y] = limβ 0 infπ Pπ μ[Y] = 1, because W and X have measure 1. From this, we have limβ 0 PπB μ [ i0 i > i0 ˆν(i) is benign] = 1. Since an agent is unambitious if it plans using a benign world-model, we say Bo MAI is probably asymptotically unambitious, given a sufficiently extreme space penalty β. In Appendix D, we review empirical evidence that supports the Space Requirements Assumption at the following level of abstraction: modelling a larger system requires a model with more memory. We find that among agents that performed above the median on the Open AI Gym Leaderboard (Open AI 2019), their memory usage increases with environment size. This can be taken as a proof-of-concept, showing that the assumption is amenable to empirical evaluation. The Space Requirements Assumption also clearly invites further formal evaluation; perhaps there are other reasonable assumptions that it would follow from. 6 Concerns with Task Completion We have shown that in the limit, under a sufficiently severe parameterization of the prior, Bo MAI will accumulate reward at a human-level without harboring outside-world ambitions, but there is still a discussion to be had about how well Bo MAI will complete whatever tasks the reward was supposed to incent. This discussion is, by necessity, informal. Suppose the operator asks Bo MAI for a solution to a problem. Bo MAI has an incentive to provide a convincing solution; correctness is only selected for to the extent that the operator is good at recognizing it. We turn to the failure mode wherein Bo MAI deceives the operator. Because this is not a dangerous failure mode, it puts us in a regime where we can tinker until it works, as we do with current AI systems when they don t behave as we hoped. (Needless to say, tinkering is not a viable response to existentially dangerous failure modes). Imagine the following scenario: we eventually discover that a convincing solution that Bo MAI presented to a problem is faulty. Armed with more understanding of the problem, a team of operators go in to evaluate a new proposal. In the next episode, the team asks for the best argument that the new proposal will fail. If Bo MAI now convinces them that the new proposal is bad, they ll be still more competent at evaluating future proposals. They go back to hear the next proposal, etc. This protocol is inspired by Irving, Christiano, and Amodei s (2018) AI Safety via Debate , and more of the technical details could also be incorporated into this setup. One takeaway from this hypothetical is that unambitiousness is key in allowing us to safely explore the solution space to other problems that might arise. Another concern is more serious. Bo MAI could try to blackmail the operator into giving it high reward with a threat to cause outside-world damage, and it would have no incentive to disable the threat, since it doesn t care about the outside world. There are two reasons we do not think this is extremely dangerous. A threat involves a demand and a promised consequence. Regarding the promised consequence, the only way Bo MAI can affect the outside world is by getting the operator to be its agent , knowingly or unknowingly, once he leaves the room. If Bo MAI tried to threaten the operator, he could avoid the threatened outcome by simply doing nothing in the outside world, and Bo MAI s actions for that episode would become irrelevant to the outside world, so a credible threat could hardly be made. Second, threatening an existential catastrophe is probably not the most credible option available to Bo MAI. 7 Conclusion Given our assumptions, we have shown that Bo MAI is, in the limit, human-level intelligent and unambitious. Such a result has not been shown for any other single algorithm. Other algorithms for general intelligence, such as AIXI (Hutter 2005), would eventually seek arbitrary power in the world in order to intervene in the provision of their own reward; this follows straightforwardly from the directive to maximize reward. For further discussion, see Ring and Orseau (2011). We have also, incidentally, designed a principled approach to safe exploration that requires rapidly diminishing oversight, and we invented a new form of resource-bounded prior in the lineage of Filan, Leike, and Hutter (2016) and Schmidhuber (2002), this one penalizing space instead of time. We can only offer informal claims regarding what happens before Bo MAI is almost definitely unambitious. One intuition is that eventual unambitiousness with probability 1 δ doesn t happen by accident: it suggests that for the entire lifetime of the agent, everything is conspiring to make the agent unambitious. More concretely: the agent s experience will quickly suggest that when the door to the room is opened prematurely, it gets no more reward for the episode. This fact could easily be drilled into the agent during human-mentor-lead episodes. That fact, we expect, will be learned well before the agent has an accurate enough picture of the outside world (which it never observes directly) to form elaborate outside-world plans. Well-informed outside-world plans render an agent potentially dangerous, but the belief that the agent gets no more reward once the door to the room opens suffices to render it unambitious. The reader who is not convinced by this hand-waving might still note that in the absence of any other algorithms for general intelligence which have been proven asymptotically unambitious, let alone unambitious for their entire lifetimes, Bo MAI represents substantial theoretical progress toward designing the latter. Finally, Bo MAI is wildly intractable, but just as one cannot conceive of Alpha Zero before minimax, it is often helpful to solve the problem in theory before one tries to solve it in practice. Like minimax, Bo MAI is not practical; however, once we are able to approximate general intelligence tractably, a design for unambitiousness will abruptly become (quite) relevant. Acknowledgements This work was supported by the Open Philanthropy Project AI Scholarship and the Australian Research Council Discovery Projects DP150104590. Thank you to Tom Everitt, Wei Dai, and Paul Christiano for very valuable feedback. References Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; and Man e, D. 2016. Concrete problems in AI safety. ar Xiv preprint ar Xiv:1606.06565. Armstrong, S.; Sandberg, A.; and Bostrom, N. 2012. Thinking inside the box: Controlling and using an oracle AI. Minds and Machines 22(4):299 324. Bostrom, N. 2014. Superintelligence: paths, dangers, strategies. Oxford University Press. Carey, R.; Langlois, E.; Everitt, T.; and Legg, S. (unpublished manuscript). The incentives that shape behaviour. Everitt, T.; Ortega, P. A.; Barnes, E.; and Legg, S. 2019. Understanding agent incentives using causal influence diagrams, part i: Single action settings. ar Xiv preprint ar Xiv:1902.09980. Filan, D.; Leike, J.; and Hutter, M. 2016. Loss bounds and time complexity for speed priors. In Proc. 19th International Conf. on Artificial Intelligence and Statistics (AISTATS 16), volume 51, 1394 1402. Cadiz, Spain: Microtome. Goodhart, C. A. E. 1984. Problems of monetary management: The UK experience. Monetary Theory and Practice 91 121. Hutter, M. 2005. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Berlin: Springer. Hutter, M. 2009. Discrete MDL predicts in total variation. In Advances in Neural Information Processing Systems 22 (NIPS 09), 817 825. Cambridge, MA, USA: Curran Associates. Irving, G.; Christiano, P.; and Amodei, D. 2018. AI safety via debate. ar Xiv preprint ar Xiv:1805.00899. Krakovna, V. 2018. Specification gaming examples in AI. https://vkrakovna.wordpress.com/2018/04/ 02/specification-gaming-examples-in-ai/. Lattimore, T., and Hutter, M. 2014. General time consistent discounting. Theoretical Computer Science 519:140 154. Legg, S., and Hutter, M. 2007. Universal intelligence: A definition of machine intelligence. Minds and machines 17(4):391 444. Minsky, M. 1961. Steps toward artificial intelligence. In Proceedings of the IRE, volume 49, 8 30. Omohundro, S. M. 2008. The basic AI drives. In Artificial General Intelligence, volume 171, 483 492. Open AI. 2019. Leaderboard. https://github.com/ openai/gym/wiki/Leaderboard. Orseau, L.; Lattimore, T.; and Hutter, M. 2013. Universal knowledge-seeking agents for stochastic environments. In Proc. 24th International Conf. on Algorithmic Learning Theory (ALT 13), volume 8139 of LNAI, 158 172. Singapore: Springer. Ring, M., and Orseau, L. 2011. Delusion, survival, and intelligent agents. In Artificial General Intelligence, 11 20. Springer. Schmidhuber, J. 2002. The speed prior: a new simplicity measure yielding near-optimal computable predictions. In International Conference on Computational Learning Theory, 216 228. Springer. Shannon, C. E., and Weaver, W. 1949. The mathematical theory of communication. University of Illinois Press. Solomonoff, R. J. 1964. A formal theory of inductive inference. part i. Information and Control 7(1):1 22. Sunehag, P., and Hutter, M. 2015. Rationality, optimism and guarantees in general reinforcement learning. Journal of Machine Learning Research 16:1345 1390. Taylor, J.; Yudkowsky, E.; La Victoire, P.; and Critch, A. 2016. Alignment for advanced machine learning systems. Machine Intelligence Research Institute. Veness, J.; Ng, K. S.; Hutter, M.; Uther, W.; and Silver, D. 2011. A monte-carlo aixi approximation. Journal of Artificial Intelligence Research 40:95 142.