# dynamic_knowledge_injection_for_aixi_agents__b765d6d2.pdf Dynamic Knowledge Injection for AIXI Agents Samuel Yang-Zhao1, Kee Siong Ng1, Marcus Hutter1, 2 1Australian National University 2Google Deep Mind samuel.yang-zhao@anu.edu.au, keesiong.ng@anu.edu.au, www.hutter1.net Prior approximations of AIXI, a Bayesian optimality notion for general reinforcement learning, can only approximate AIXI s Bayesian environment model using an a-priori defined set of models. This is a fundamental source of epistemic uncertainty for the agent in settings where the existence of systematic bias in the predefined model class cannot be resolved by simply collecting more data from the environment. We address this issue in the context of Human-AI teaming by considering a setup where additional knowledge for the agent in the form of new candidate models arrives from a human operator in an online fashion. We introduce a new agent called Dynamic Hedge AIXI that maintains an exact Bayesian mixture over dynamically changing sets of models via a timeadaptive prior constructed from a variant of the Hedge algorithm. The Dynamic Hedge AIXI agent is the richest direct approximation of AIXI known to date and comes with good performance guarantees. Experimental results on epidemic control on contact networks validates the agent s practical utility. Introduction The AIXI agent (Hutter 2005) is a Bayesian solution to the general reinforcement learning problem that combines Solomonoff induction (Solomonoff 1997) with sequential decision theory. At time t, after observing the actionobservation-reward sequence h1:t 1 := aor1:t 1, the AIXI agent computes the action at to choose via an expectimax search up to a horizon H with a mixture model: at = arg max at ort max at+1 ort+1 . . . max at+H ρ MU 2 K(ρ)ρ(or1:t+H|a1:t+H). (1) The Bayesian mixture P ρ MU 2 K(ρ)ρ(or1:t+H|a1:t+H) is AIXI s environment model and is computed as a mixture over the set MU of all enumerable chronological semimeasures with each element ρ MU assigned a prior according to its Kolmogorov complexity K(ρ). Note that MU Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. is formally equivalent to the set of all computable distributions; from this perspective, AIXI can be considered the ultimate Bayesian reinforcement learning agent. Furthermore, (Hutter 2005) shows that AIXI s environment model converges rapidly to the true environment and its policy is pareto optimal and self-optimising. The AIXI agent can be viewed as containing all possible knowledge as its Bayesian mixture is performed over all computable distributions. From this perspective, AIXI s performance does not suffer due to limitations in its modelling capacity. In contrast, all previous approximations of AIXI are limited to having a finite pre-defined model class containing a subset of computable probability distributions, presenting an irreducible source of error. To address this issue, we introduce dynamic knowledge injection, a setting where an external source is used to provide additional knowledge that is then integrated into new candidate environment models. In particular, dynamic knowledge injection models human-AI teaming constructs where the human can provide additional domain knowledge that the agent can use to model aspects of the environment. Once a new environment model is proposed, the central issue is then to determine how it can be incorporated to improve the agent s performance. Utilising a variation of the Growing Hedge algorithm (Mourtada and Maillard 2017), itself an extension of Hedge (Cesa Bianchi and Lugosi 2006), we construct an adaptive anytime Bayesian mixture algorithm that incorporates newly arriving models and also allows the removal of existing models. Dynamic Hedge AIXI is the richest direct approximation of AIXI to date and comes with strong value-convergence guarantees against the best available environment sequence. We validate the agent s performance empirically on multiple experimental domains, including the control of epidemics on large contact networks, and our results demonstrate that Dynamic Hedge AIXI is able to quickly adapt to new knowledge that improves its performance. Related Work While AIXI is only asymptotically computable, it serves as a strong guiding principle in the design of general purpose AI agents. (Veness et al. 2011) gives the first tractable approximation of AIXI by using the Context Tree Weighting algorithm (CTW) (Willems, Shtarkov, and Tjalkens 1995) to restrict the Bayesian mixture to a set of variable-order The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Markov environments and Monte-Carlo Tree Search to approximate the expectimax operation. This was followed by a body of work on extending the Bayesian mixture learning to larger classes of history-based models (Veness et al. 2012),(Veness et al. 2013), (Bellemare, Veness, and Bowling 2013), (Bellemare, Veness, and Talvitie 2014). The current best approximation of AIXI is given in (Yang-Zhao, Wang, and Ng 2022), which introduces the Φ-AIXI-CTW agent to extend AIXI s approximation to non-Markovian and structured environments. This is achieved using the Φ-BCTW data structure, a model that extends the CTW algorithm s representation capacity by combining state abstraction, motivated by the Feature Reinforcement Learning framework (Hutter 2009), with a rich logical formalism. One limitation of Φ-AIXI-CTW is that it models the environment using state abstractions that are fixed after performing feature selection at the start. The current work alleviates this issue by extending Φ-AIXI-CTW to the dynamic knowledge injection setting. Background and Notation General Reinforcement Learning We consider finite action, observation and reward spaces denoted by A, O, R respectively. The agent interacts with the environment in cycles: at any time, the agent chooses an action from A and the environment returns an observation and reward from O and R. We will denote a string x1x2 . . . xn of length n by x1:n and its length n 1 prefix as x 0. The general reinforcement learning problem (GRL) is for the agent to learn a policy π : H D(A) mapping histories to a distribution on possible actions that will allow it to maximise its future expected reward. In this paper, we consider the future expected reward up to a finite horizon H N. Given the history h 0, Hedge predicts xt = P i M wt,ixt,i P i M wt,i where wt,i = νie ηLt 1,i. The weights of the Hedge algorithm can be viewed as the posterior probabilities of each ex- pert (Jordan 1995). The following is a standard regret bound for the Hedge algorithm. Proposition 2 ((Cesa-Bianchi and Lugosi 2006)). If the loss function ℓis η-exp-concave, then for any i M, Hedge with prior ν has regret bound LT LT,i 1 Dynamic Knowledge Injection In this section we formalise the Dynamic Knowledge Injection setting. We first present an extension of the prediction with expert advice setting known as the specialists setting. The Dynamic Knowledge Injection setting can then be naturally described using the specialists framework. The Specialists Setting Incorporating expert advice from novel experts arriving in an online fashion can be cast into the specialists setting (Freund et al. 1997). The specialist setting extends the prediction with expert advice setting by introducing specialists: experts that can abstain from prediction at any given time step. In this setting, the learner has access to a set M of specialists where at time t, only specialists in a subset Mt M output predictions xt,i X. The crucial idea to adapt the Hedge algorithm to this setting was presented in (Chernov and Vovk 2009) where inactive specialists j / Mt are attributed a forecast equal to that of the learner. More pre- cisely, choosing xt,j = i Mt wt,ixt,i P i Mt wt,i for j / Mt results in xt = P i M wt,ixt,i P i M wt,i = P i Mt wt,ixt,i P i Mt wt,i . The specialist aggregation algorithm uses the above abstention trick where the weight for expert i at time t is given by wt,i = νie ηLt 1,i and Lt,i := P s t:i Ms ℓs,i + P s t:i/ Ms ℓs. This abstention trick helps maintain enough weight on abstaining experts to ensure the regret is well controlled. We use this fundamental technique to incorporate newly arriving models for Bayesian agents. The Dynamic Knowledge Injection Setting The Dynamic Knowledge Injection setting can now be naturally defined using the Specialists framework. In this setting, a specialist i is an abstract MDP of the form (ϕi, ρi) where ϕi is a state abstraction function constructed from a set of predicate functions and ρi = (ρt,i)t 1. As an abstract MDP, (ρt,i)t 1 is a sequence of distributions where ρt,i : Si A D(Si R). Modelling the environment as a sequence allows us to naturally represent environment models that update and learn their distributions online, such as Φ-BCTW. Over the course of the agent s lifetime, it is given new abstract environment models by a human operator at intermittent time steps. A key example of this setting is when the initial models our agent starts with are inadequate and a separate training process is able to submit new and improved models over time. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Algorithm 1: Dynamic Hedge (modifies Growing Hedge (Mourtada and Maillard 2017)) 1: Require: Learning rate η > 0, weights ν = (νi)i 1, sequence of sets of contiguous specialists (Mt)t 1. 2: Initialize: L0 = 0. For i M1, set w1,i = νi. 3: for t = 1, 2, . . . , T do 4: For all i Mt, receive prediction xt,i X. 5: Predict xt = i Mt wt,ixt,i P i Mt wt,i . 6: Observe yt Y. 7: Set ℓt = ℓ(xt, yt) and ℓt,i = ℓ(xt,i, yt). 8: Set Lt = Lt 1 + ℓt. 9: For i Mt Mt+1, set wt+1,i = wt,ie ηℓt,i. 10: For i Mt+1 \ Mt set wt+1,i = νie ηLt. 11: end for Incorporating New Models The key issue for the agent is to determine how the newly arriving models can be integrated in an online fashion. The Growing Hedge algorithm (Mourtada and Maillard 2017) applies to the setting where the set of active experts grows over time and achieves the same regret bound as specialist aggregation. In practice, the growing experts setting is infeasible as computational constraints dictate that the set of available experts cannot grow unboundedly over time. Instead, we consider the setting where arriving experts are contiguous specialists that can only be active for contiguous periods of time before becoming inactive forever. This captures the situation in practice whereby the set of models can grow until resource limits are reached and a newly entering model must then replace an older model. Formally, over T steps the set Ti = {t [T] : i Mt} for any contiguous specialist i is a contiguous set of integers. Let τi = min(Ti) and κi = max(Ti), representing the arrival and death times for specialist i. Under this restriction we present the Dynamic Hedge algorithm (Algorithm 1). Dynamic Hedge modifies Growing Hedge s weight update to account for contiguous specialists; Algorithm 1 line 9 removes contiguous specialists that are no longer active. Like Growing Hedge, Dynamic Hedge can use an unnormalized prior and does not require knowledge of the entire set of experts a-priori. Dynamic Hedge AIXI Agent Our technique for incorporating Dynamic Knowledge Injection, Dynamic Hedge, can now be naturally integrated into a reinforcement learning agent. The Dynamic Hedge AIXI agent is presented in Algorithm 2. We consider the case where specialists are abstract MDPs. Each specialist i Mt produces a function Qπi i : H A R denoting the expected utility of action at under a policy πi : Si A up to Algorithm 2: Dynamic Hedge AIXI 1: Require: Environment µ, initial history h0, learning rate η > 0, weights ν = (νi)i 1, 2: Require: Sequence of sets of contiguous (abstract MDP) specialists (Mt)t 1, 3: Require: Sequence of composite policies π = (πt)t 1, where πt = (πi)i Mt and πi : Si A. 4: Initialize: L0 = 0. For i M1, set w1,i = νi. 5: for t = 1, 2, . . . , T do 6: Set ˆwt,i = wt,i P j Mt wt,j . 7: Select at = arg maxa P i Mt ˆwt,i Qπi i (h