# selfpredictive_universal_ai__4b925b0b.pdf Self-Predictive Universal AI Elliot Catt, Jordi Grau Moya, Marcus Hutter, Matthew Aitchison Tim Genewein, Gregoire Deletang, Kevin Li Wenliang, Joel Veness Google Deep Mind ecatt@google.com Reinforcement Learning (RL) algorithms typically utilize learning and/or planning techniques to derive effective policies. Integrating both approaches has proven to be highly successful in addressing complex sequential decision-making challenges, as evidenced by algorithms such as Alpha Zero and Mu Zero, which consolidate the planning process into a parametric search-policy. AIXI, the universal Bayesoptimal agent, leverages planning through comprehensive search as its primary means to find an optimal policy. Here we define an alternative universal Bayesian agent, which we call Self-AIXI, that on the contrary to AIXI, maximally exploits learning to obtain good policies. It does so by self-predicting its own stream of action data, which is generated, similarly to other TD(0) agents, by taking an action maximization step over the current on-policy (universal mixture-policy) Q-value estimates. We prove that Self-AIXI converges to AIXI, and inherits a series of properties like maximal Legg-Hutter intelligence and the self-optimizing property. 1 Introduction Reinforcement Learning (RL) [1] algorithms exploit learning, planning 1, or their combination, to obtain good policies from experience. Pure learning consists of using real experience for improving a policy via a (parametric) model, possibly representing an explicit policy-model and/or the Q-values [2 6]. In a sense, learning stores the computational effort of policy-improvement into the parameters, which makes it a computationally efficient approach when needing to reuse the policy later on. In contrast, pure planning finds good policies via simulated experience using an environment model and a randomized (or exhaustive) search policy [1, 7, 8]. In the case of unknown or stochastic environments, one must re-plan after receiving a new observation, thus wasting all computational-effort from the previous step. This makes pure planning a wasteful approach. Using both, planning and learning, is a good way to improve performance and efficiency as demonstrated by modern high-performant RL algorithms such as Mu Zero [9 12]. These algorithms distill the planning effort back into the parametric search-policy by training it to predict the good actions obtained from planning. In a way, these agents are self-predicting their own policy-improvements. Although empirically successful and widely used, this distillation [13] or self-prediction 2 process is motivated in a purely heuristic way without much theoretical understanding on its optimality condition. The AIXI agent [14, 15] is a theoretical universal Bayes-optimal agent obtained through pure planning without relying on distilling the search effort as described above. AIXI learns an environment model via a Solomonoff predictor [16, 17] and uses it for exhaustive (computationally intractable) planning. Thus, although it uses learning for the environment model, we say AIXI adopts a pure planning approach in the context of policy generation. Two desirable properties of Solomonoff prediction are universality obtained by considering a huge hypothesis class containing all computable 1We use the terms planning and search interchangeably. 2Policy distillation usually refers to the process of amortizing one or several policies into another policy model. We view self-prediction as a type of distillation where a search-policy is consolidated into another model. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). environments and fast convergence to the true environment statistics [18, 19]. Unfortunately, universality makes Solomonoff prediction incomputable, a property that AIXI also inherits. Nevertheless, the AIXI agent is considered as the most powerful agent and the gold-standard in decision-making under unknown general environments. An existing gap in the literature, however, is the lack of an alternative universal agent that maximally exploits learning and distillation, instead of the more wasteful planning approach used by AIXI. Understanding this distillation process from a theoretical standpoint can inspire taking new directions in approximating the AIXI agent as done in [20], which is based on a pure planning approach. The central aim of this paper is to define an agent, which we name Self-AIXI, that maximally exploits self-prediction (or distillation) instead of planning, and to prove that this agent is optimal in the sense that converges to the gold-standard AIXI agent. We define self-prediction as the process of predicting the stream of action-data generated by the agent itself. Self-AIXI generates action-data in a similar fashion as other Temporal Difference TD(0) [1] algorithms in which there is only a single planning step when choosing an action i.e. arg maxat ˆQ(ht, at) for history ht, action at and current Q-value estimates ˆQ. The difference lies in that Self-AIXI does exact Bayesian inference on the policy space by holding a universal mixture ζ over policies (in addition to ξ, the universal mixture over environments), and consumes action-data generated by maximizing the Self-AIXI policy i.e. arg maxa Qζ ξ(ht, a). In summary our contributions are: We define Self-AIXI, a novel universal agent based on self-prediction. We prove that Self-AIXI s Q-values converge to AIXI s Q-values asymptotically. We show how Self-AIXI leads to maximum Legg-Hutter [21] intelligence and inherits the self-optimization [22] property. These results provide compelling evidence that self-prediction can effectively serve as a robust alternative to traditional planning methods. 2 Background 2.1 General Reinforcement Learning Let A, O, R denote the (finite) action, observation and reward set respectively. Let the set of percepts be defined as E := O R. X denotes a distribution over X. Define the set of histories H := (A E) . A policy π : H A is a distribution over actions given a history. An environment µ : H A E is a distribution over percepts given history and action. We use h 0 and renormalize wi ; w i = (1 w 0)wi, then ξ ξ M := {ξ} M. We can apply the same logic to the policy class P and policy prior ω(π) as we just did for M and w(µ), similarly we need to have that πS P. We leave the full investigation of policy classes to future work. Full discussion on the choice of class M and prior w(ν) are beyond the scope of this work [14, 27]. 3.2 Experimental Results While our work is mainly theoretical, we also conducted experiments (see Appendix B) comparing self-prediction, using a Self-AIXI approximation, against the pure planning approach, using an AIXI approximation, using Context Tree Weighting as predictor and Monte-Carlo Tree Search for the Q-value estimates. In short, the Self-AIXI approximation outperforms the AIXI approximation in three environments and performs equally in the remaining two. 4 Theoretical Results In this section we will demonstrate the intelligence of Self-AIXI by showing that it asymptotically converges to AIXI in expectation. We specifically chose this criterion over other types of optimality, since Cesaro and Almost-Sure asymptotic optimality (converging to the optimal policy) are too strong [28 30], and are not satisfied by AIXI. In fact, agents which satisfy these also die with certainty [31]. The goal is to show that Self-AIXI eventually performs as well as the most intelligent agent, and AIXI is the most intelligent agent [21]. We will do this by proving an asymptotic convergence result to the Bayes optimal policy in all environments. To summarize the results of this section: Theorem 18: µ. V πS ξ V ξ in µπS-expectation. Theorem 21: µ. V πS µ V π ξ µ in µπS-expectation. Theorem 22: V πS µ V µ µπS-almost surely, under similar conditions to the AIXI Selfoptimizing result from Theorem 7. The initial three results successively broaden in scope, with the second being a more general form of the first, and the third further generalizing the second, holding for all µ instead of just ξ. All proofs not found in this section can be found in the appendix. 4.1 Expected AIXI-like behavior To start with, we will present a useful Lemma about the environment mixture ξ: Lemma 13 (Convergence of ξ to µ in TV). For any environment µ M and policy π we have D (ξπ, µπ|h 0 and h (A E)t. Now Lemma 14 and the above implies that there exists a t0 such that for all t t0, 0 EπS ξ [max a Qζ ξ(h, a) V ζ ξ (h)] EπS ξ D (ξπS, ξζ|h) < β Taking the expectation of 0 V ξ V πS ξ V ξ V ζ ξ (which follows from V ξ V πS ξ V ζ ξ ), Lemma 15 implies 0 EπS ξ [V ξ (h) V πS ξ (h)] EπS ξ [V ξ V ζ ξ ] β/(1 γ(1 + α)). Since β > 0 was arbitrary, we get the EπS ξ [V ξ (h) V πS ξ (h)] 0. The previous theorem also holds in expectation over the true environment µ, instead of in expectation w.r.t. ξ, but first we need a small lemma: Lemma 17 (Eπ ξ 0 implies Eπ µ 0). If π is such that Eπ ξ V ξ (h