# dynamic_knowledge_injection_for_aixi_agents__b765d6d2.pdf

Dynamic Knowledge Injection for AIXI Agents

Samuel Yang-Zhao1, Kee Siong Ng1, Marcus Hutter1, 2

1Australian National University 2Google Deep Mind samuel.yang-zhao@anu.edu.au, keesiong.ng@anu.edu.au, www.hutter1.net

Prior approximations of AIXI, a Bayesian optimality notion for general reinforcement learning, can only approximate AIXI s Bayesian environment model using an a-priori defined set of models. This is a fundamental source of epistemic uncertainty for the agent in settings where the existence of systematic bias in the predefined model class cannot be resolved by simply collecting more data from the environment. We address this issue in the context of Human-AI teaming by considering a setup where additional knowledge for the agent in the form of new candidate models arrives from a human operator in an online fashion. We introduce a new agent called Dynamic Hedge AIXI that maintains an exact Bayesian mixture over dynamically changing sets of models via a timeadaptive prior constructed from a variant of the Hedge algorithm. The Dynamic Hedge AIXI agent is the richest direct approximation of AIXI known to date and comes with good performance guarantees. Experimental results on epidemic control on contact networks validates the agent s practical utility.

Introduction The AIXI agent (Hutter 2005) is a Bayesian solution to the general reinforcement learning problem that combines Solomonoff induction (Solomonoff 1997) with sequential decision theory. At time t, after observing the actionobservation-reward sequence h1:t 1 := aor1:t 1, the AIXI agent computes the action at to choose via an expectimax search up to a horizon H with a mixture model:

at = arg max at

ort max at+1

ort+1 . . . max at+H

ρ MU 2 K(ρ)ρ(or1:t+H|a1:t+H). (1)

The Bayesian mixture P

ρ MU 2 K(ρ)ρ(or1:t+H|a1:t+H) is

AIXI s environment model and is computed as a mixture over the set MU of all enumerable chronological semimeasures with each element ρ MU assigned a prior according to its Kolmogorov complexity K(ρ). Note that MU

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

is formally equivalent to the set of all computable distributions; from this perspective, AIXI can be considered the ultimate Bayesian reinforcement learning agent. Furthermore, (Hutter 2005) shows that AIXI s environment model converges rapidly to the true environment and its policy is pareto optimal and self-optimising. The AIXI agent can be viewed as containing all possible knowledge as its Bayesian mixture is performed over all computable distributions. From this perspective, AIXI s performance does not suffer due to limitations in its modelling capacity. In contrast, all previous approximations of AIXI are limited to having a finite pre-defined model class containing a subset of computable probability distributions, presenting an irreducible source of error. To address this issue, we introduce dynamic knowledge injection, a setting where an external source is used to provide additional knowledge that is then integrated into new candidate environment models. In particular, dynamic knowledge injection models human-AI teaming constructs where the human can provide additional domain knowledge that the agent can use to model aspects of the environment. Once a new environment model is proposed, the central issue is then to determine how it can be incorporated to improve the agent s performance. Utilising a variation of the Growing Hedge algorithm (Mourtada and Maillard 2017), itself an extension of Hedge (Cesa Bianchi and Lugosi 2006), we construct an adaptive anytime Bayesian mixture algorithm that incorporates newly arriving models and also allows the removal of existing models. Dynamic Hedge AIXI is the richest direct approximation of AIXI to date and comes with strong value-convergence guarantees against the best available environment sequence. We validate the agent s performance empirically on multiple experimental domains, including the control of epidemics on large contact networks, and our results demonstrate that Dynamic Hedge AIXI is able to quickly adapt to new knowledge that improves its performance.

Related Work

While AIXI is only asymptotically computable, it serves as a strong guiding principle in the design of general purpose AI agents. (Veness et al. 2011) gives the first tractable approximation of AIXI by using the Context Tree Weighting algorithm (CTW) (Willems, Shtarkov, and Tjalkens 1995) to restrict the Bayesian mixture to a set of variable-order

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Markov environments and Monte-Carlo Tree Search to approximate the expectimax operation. This was followed by a body of work on extending the Bayesian mixture learning to larger classes of history-based models (Veness et al. 2012),(Veness et al. 2013), (Bellemare, Veness, and Bowling 2013), (Bellemare, Veness, and Talvitie 2014). The current best approximation of AIXI is given in (Yang-Zhao, Wang, and Ng 2022), which introduces the Φ-AIXI-CTW agent to extend AIXI s approximation to non-Markovian and structured environments. This is achieved using the Φ-BCTW data structure, a model that extends the CTW algorithm s representation capacity by combining state abstraction, motivated by the Feature Reinforcement Learning framework (Hutter 2009), with a rich logical formalism. One limitation of Φ-AIXI-CTW is that it models the environment using state abstractions that are fixed after performing feature selection at the start. The current work alleviates this issue by extending Φ-AIXI-CTW to the dynamic knowledge injection setting.

Background and Notation General Reinforcement Learning We consider finite action, observation and reward spaces denoted by A, O, R respectively. The agent interacts with the environment in cycles: at any time, the agent chooses an action from A and the environment returns an observation and reward from O and R. We will denote a string x1x2 . . . xn of length n by x1:n and its length n 1 prefix as x<n. An action, observation and reward from the same time step will be denoted aort. A history h is an element of the history space H := (A O R) (A O) . An environment ρ is a sequence of probability distributions {ρ0, ρ1, ρ2, . . .}, where ρn : An D((O R)n), that satisfies a1:n or<n ρn 1(or<n|a<n) = P or O R ρn(or1:n|a1:n). We will drop the subscript on ρn when the context is clear. The predictive probability of the next percept given history and a current action is given by ρ(orn|aor<n, an) = ρ(orn|hn 1, an) := ρ(or1:n|a1:n) ρ(or<n|a<n) for all aor1:n such that ρ(or<n|a<n) > 0. The general reinforcement learning problem (GRL) is for the agent to learn a policy π : H D(A) mapping histories to a distribution on possible actions that will allow it to maximise its future expected reward. In this paper, we consider the future expected reward up to a finite horizon H N. Given the history h<t = aor<t up to time t and a policy π the value function with respect to the environment ρ is given by V π ρ (h<t) := Eπ ρ h Pt+H i=t ri h<t i , where ri denotes the random variable distributed according to ρ for the reward at time i. The action value function is defined similarly as Qπ ρ(h<t, at) := Eπ ρ h Pt+H i=t ri h<t, at i . The agent s goal is to learn the optimal policy π , which is the policy that results in the value function with the maximum reward for any given history.

Abstract Environment Models In the dynamic knowledge injection setting, the candidate models received by the agent may only approximate the un-

derlying environment. State abstractions provides a framework for defining such models. A state abstraction is a mapping ϕ : H Sϕ that maps the space of history sequences into an abstract state space. For the history h1:t at time t, the state at time t is given by st = ϕ(h1:t). In this manner, the interaction sequence of the original process is mapped to a state-action-reward sequence. For a given ϕ, an abstract Markov Decision Process (MDP) predicts the next state and reward according to a distribution ρϕ : Sϕ A D(Sϕ R) that factorises into a state transition and reward distribution as ρϕ(s , r|s, a) = ρϕ(s |s, a)ρϕ(r|s, a, s ). Let ρϕ := (ρt,ϕ)t 1, where ρt,ϕ : Sϕ A D(Sϕ R) for all t. We refer to (ϕ, ρϕ) together as an abstract environment model. Note that the state of an abstract MDP model will in general not be a sufficient statistic for the underlying environment at hand and thus presents a source of bias; a simple example is when ϕ maps all histories into a single state. An abstract MDP can help simplify the environment s dynamics but pushes a lot of the complexity into the design of the state abstraction function and a sufficiently powerful representation is required to ensure as little generality is lost. In particular, the quality of an abstract environment model will determine how closely its reward distribution approximates the underlying environment s reward distribution. Following (Yang-Zhao, Wang, and Ng 2022), we consider the class of predicate environment models, which are abstract environment models that can be constructed from a set of predicate functions on histories. More precisely, we consider abstract environment models where the state abstraction is of the form ϕ(h) = (p1(h), . . . , pn(h)) and pi : H {0, 1} are predicates. Under a sufficiently powerful knowledge representation and reasoning language (such as the one described in (Lloyd and Ng 2011; Lloyd 2003; Farmer 2008)), such models are capable of representing a large class of non Markovian and structured environments as abstract MDPs. The Φ-BCTW data structure, described in the next section, will be used to model the distributions under the predicate state abstractions.

Bayesian Mixtures and Φ-BCTW The importance of Bayesian mixture models in general reinforcement learning is that they converge rapidly to the good models in the model class. Theorem 1. (Hutter 2005) Let µ be the true environment and ξ be the mixture environment model over a model class M. Let xk = ork and h<k = aor<k. For all n N and for all a1:n,

x1:k µ(x<k|a<k) (µ(xk|h<kak) ξ(xk|h<kak))2

wρ 0 + KL(µ||ρ) (2)

Context Tree Weighting (CTW) (Willems, Shtarkov, and Tjalkens 1995) is a rare case where a Bayesian mixture over prediction suffix trees can be computed efficiently. The ΦBCTW data structure introduced in (Yang-Zhao, Wang, and Ng 2022) generalises CTW by allowing predicates pi : H

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

{0, 1} from a set Φ to act as the internal nodes of the tree. In more details, in the Φ-BCTW tree, each sub-tree of depth d is a Φ-prediction suffix tree (Φ-PST) model. For a history h, a Φ-PST with predicates pi at depth i computes a path from root to leaf node as p1(h)p2(h) . . . pd(h), which forms a state abstraction. At each leaf node resides a KT estimator (Krichevsky and Trofimov 2006) maintaining a distribution over the next bit. By chaining together multiple Φ-BCTW trees, the data structure can be used to predict the binary representation of arbitrary symbols. A Φ-BCTW data structure constructed using a set Φ of D predicates is able to perform an exact Bayesian mixture over 2(2D) Φ-PST models in O(D) time. Since each Φ-PST model represents a different predicate environment model, applying Φ-BCTW to predicting state and reward symbols then amounts to computing a mixture environment model over all predicate environment models that can be constructed from Φ. The following result states the environment mixture result for Φ-BCTW. Proposition 1 ((Yang-Zhao, Wang, and Ng 2022)). Suppose a state and reward symbol can be binarized in k bits. Let Td be the set of Φ-PST models that can be constructed from a single Φ-BCTW tree of depth d. Let P( |a1:n, T) denote the action-conditional distribution under model T. Then the ΦBCTW computes a mixture environment model of the form, where Γ(T) is the coding length of T:

ξ(sr1:n|a1:n) = X

T Td ... Td+k 1 2 Γ(T )P(sr1:n|a1:n, T) .

Prediction with Expert Advice The prediction with expert advice setting is a wellestablished framework providing theoretically sound strategies on how to aggregate the forecasts provided by many experts in a sequential setting (Cesa-Bianchi and Lugosi 2006). This setting is characterised by a game played between a learner and an adversary. Initially, a loss function ℓ: X Y R is provided where X is the vector space of predictions and Y is the outcome space. The learner has access to a set of fixed experts M. At time t, a learner receives prediction xt,i X from expert i. The learner then must combine the predictions from all experts and outputs xt X. An adversary then chooses an outcome yt Y causing the learner to incur loss ℓt = ℓ(xt, yt) and observe the loss ℓt,i = ℓ(xt,i, yt) for each expert i. Learners are typically designed to minimise the regret LT LT,i = PT t=1 ℓt PT t=1 ℓt,i, a measure of the relative performance of the agent with respect to any fixed expert i M. The Hedge (exponential weights) algorithm is a simple yet fundamental algorithm in this setting (Cesa-Bianchi and Lugosi 2006; Vovk 1998). Given a prior distribution ν over M and η > 0, Hedge predicts

xt = P i M wt,ixt,i P i M wt,i

where wt,i = νie ηLt 1,i. The weights of the Hedge algorithm can be viewed as the posterior probabilities of each ex-

pert (Jordan 1995). The following is a standard regret bound for the Hedge algorithm.

Proposition 2 ((Cesa-Bianchi and Lugosi 2006)). If the loss function ℓis η-exp-concave, then for any i M, Hedge with prior ν has regret bound LT LT,i 1

Dynamic Knowledge Injection

In this section we formalise the Dynamic Knowledge Injection setting. We first present an extension of the prediction with expert advice setting known as the specialists setting. The Dynamic Knowledge Injection setting can then be naturally described using the specialists framework.

The Specialists Setting

Incorporating expert advice from novel experts arriving in an online fashion can be cast into the specialists setting (Freund et al. 1997). The specialist setting extends the prediction with expert advice setting by introducing specialists: experts that can abstain from prediction at any given time step. In this setting, the learner has access to a set M of specialists where at time t, only specialists in a subset Mt M output predictions xt,i X. The crucial idea to adapt the Hedge algorithm to this setting was presented in (Chernov and Vovk 2009) where inactive specialists j / Mt are attributed a forecast equal to that of the learner. More pre-

cisely, choosing xt,j =

i Mt wt,ixt,i P

i Mt wt,i for j / Mt results in

xt = P i M wt,ixt,i P i M wt,i =

P i Mt wt,ixt,i P i Mt wt,i .

The specialist aggregation algorithm uses the above abstention trick where the weight for expert i at time t is given by wt,i = νie ηLt 1,i and Lt,i := P s t:i Ms ℓs,i + P s t:i/ Ms ℓs. This abstention trick helps maintain enough weight on abstaining experts to ensure the regret is well controlled. We use this fundamental technique to incorporate newly arriving models for Bayesian agents.

The Dynamic Knowledge Injection Setting

The Dynamic Knowledge Injection setting can now be naturally defined using the Specialists framework. In this setting, a specialist i is an abstract MDP of the form (ϕi, ρi) where ϕi is a state abstraction function constructed from a set of predicate functions and ρi = (ρt,i)t 1. As an abstract MDP, (ρt,i)t 1 is a sequence of distributions where ρt,i : Si A D(Si R). Modelling the environment as a sequence allows us to naturally represent environment models that update and learn their distributions online, such as Φ-BCTW. Over the course of the agent s lifetime, it is given new abstract environment models by a human operator at intermittent time steps. A key example of this setting is when the initial models our agent starts with are inadequate and a separate training process is able to submit new and improved models over time.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: Dynamic Hedge (modifies Growing Hedge (Mourtada and Maillard 2017))

1: Require: Learning rate η > 0, weights ν = (νi)i 1, sequence of sets of contiguous specialists (Mt)t 1. 2: Initialize: L0 = 0. For i M1, set w1,i = νi. 3: for t = 1, 2, . . . , T do 4: For all i Mt, receive prediction xt,i X.

5: Predict xt =

i Mt wt,ixt,i P

i Mt wt,i .

6: Observe yt Y. 7: Set ℓt = ℓ(xt, yt) and ℓt,i = ℓ(xt,i, yt). 8: Set Lt = Lt 1 + ℓt. 9: For i Mt Mt+1, set wt+1,i = wt,ie ηℓt,i. 10: For i Mt+1 \ Mt set wt+1,i = νie ηLt. 11: end for

Incorporating New Models

The key issue for the agent is to determine how the newly arriving models can be integrated in an online fashion. The Growing Hedge algorithm (Mourtada and Maillard 2017) applies to the setting where the set of active experts grows over time and achieves the same regret bound as specialist aggregation. In practice, the growing experts setting is infeasible as computational constraints dictate that the set of available experts cannot grow unboundedly over time. Instead, we consider the setting where arriving experts are contiguous specialists that can only be active for contiguous periods of time before becoming inactive forever. This captures the situation in practice whereby the set of models can grow until resource limits are reached and a newly entering model must then replace an older model. Formally, over T steps the set Ti = {t [T] : i Mt} for any contiguous specialist i is a contiguous set of integers. Let τi = min(Ti) and κi = max(Ti), representing the arrival and death times for specialist i. Under this restriction we present the Dynamic Hedge algorithm (Algorithm 1). Dynamic Hedge modifies Growing Hedge s weight update to account for contiguous specialists; Algorithm 1 line 9 removes contiguous specialists that are no longer active. Like Growing Hedge, Dynamic Hedge can use an unnormalized prior and does not require knowledge of the entire set of experts a-priori.

Dynamic Hedge AIXI Agent

Our technique for incorporating Dynamic Knowledge Injection, Dynamic Hedge, can now be naturally integrated into a reinforcement learning agent. The Dynamic Hedge AIXI agent is presented in Algorithm 2. We consider the case where specialists are abstract MDPs. Each specialist i Mt produces a function Qπi i : H A R denoting the expected utility of action at under a policy πi : Si A up to

Algorithm 2: Dynamic Hedge AIXI

1: Require: Environment µ, initial history h0, learning rate η > 0, weights ν = (νi)i 1, 2: Require: Sequence of sets of contiguous (abstract MDP) specialists (Mt)t 1, 3: Require: Sequence of composite policies π = (πt)t 1, where πt = (πi)i Mt and πi : Si A. 4: Initialize: L0 = 0. For i M1, set w1,i = νi. 5: for t = 1, 2, . . . , T do 6: Set ˆwt,i = wt,i P

j Mt wt,j .

7: Select at = arg maxa P i Mt ˆwt,i Qπi i (h<t, a). 8: Observe ot, rt µ( |h<t, at). 9: For all i Mt, si t = ϕi(h<taort). 10: For all i Mt, xt,i = ρt,i( |si t 1, at, si t) where ρt,i : Si A Si D(R).

11: Set xt =

i Mt wt,ixt,i P

i Mt wt,i 12: Set ℓt = log xt(rt), ℓt,i = log xt,i(rt). 13: Set Lt = Lt 1 + ℓt. 14: For i Mt Mt+1, set wt+1,i = wt,ie ηℓt,i. 15: For i Mt+1 \ Mt, set wt+1,i = νie ηLt. 16: end for

a horizon H:

Qπi i (h<t, at) = X

ρi(srt:t+H|h<t, ai t:t+H).

Also ai t = at and for k = 1, . . . , H the actions are selected via ai t+k = πi(si t+k). The agent then selects the action that maximises the weighted sum of the given Q values. Dynamic Hedge AIXI tracks existing and newly entering specialists by computing the weights wt,i using Dynamic Hedge. At each time step, specialist i predicts a state-action conditional distribution over the next reward and is evaluated based on the log loss ℓt,i = log ρt,i(rt|si t 1, at, si t). Thus, over time Dynamic Hedge AIXI will weight specialists based on how well they predict the reward sequence over time. In this manner, Dynamic Hedge AIXI can also avoid the objective mismatch issue common to other system identification approaches to model-based reinforcement learning (Lambert et al. 2021; Eysenbach et al. 2023).

Expanding line 7, Algorithm 2 with Equation 4 shows that actions are selected according to:

at = arg max at

i Mt ˆwt,i ρi(srt:t+H|h<t, ai t:t+H)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

where ˆwt,i = wt,i P

j Mt wt,j . Let πt = (πi)i Mt denote the composite policy consisting of the policy over each specialist i Mt. Dynamic Hedge AIXI s mixture environment model can then be defined as

ξπt(rt:t+H|h<t) = X

i Mt ˆwt,i ρi(rt:t+H|h<t, ai t:t+H),

ρi(rt:t+H|h<t, ai t:t+H) = X

st:t+H ρi(srt:t+H|ht, ai t:t+H).

Expanding ˆwt,i = wt,i P

j Mt wt,j gives

ξπt(rt:t+H|h<t) = X

i Mt wi t ρi(rτi:t 1|sai τi:t 1)ρi(rt:t+H|h<t, ai t:t+H), (6)

where wi t = νie Lτi 1 P

j Mt νje Lτj 1ρj(rτj :t 1|saj τj :t 1) and

ϕi(hj) = si j for j < t. Equation 6 reveals that wi t can be viewed as the prior weight given to each model in the mixture and that Dynamic Hedge AIXI computes an exact Bayesian mixture over the available set of models at each time step. When a model i first becomes active, its initial weight wi τi in the mixture environment model depends upon the relative performance of Dynamic Hedge to each of the available models before time τi. In this sense, the prior is adaptive as it is path dependent.

Value Convergence Our main theoretical result shows that Dynamic Hedge AIXI will achieve good value convergence rates against the best sequence of environment models available to the agent. Since each specialist can potentially operate over a different state space, we first convert the state-reward distribution into a representative observation-reward distribution for our analysis. For a specialist (ϕ, ρ), let ρ be an environment distribution such that the following hold: X

ot O:ϕ(h<taot)=st ρ(ot|h<t, at) = ρ(st|st 1, at),

and ρ(rt|h<t, at, ot) = ρ(rt|st 1, at, st),

where ϕ(h<t) = st 1 and ϕ(hao) = s . The specialist can then be identified as (ϕ, ρ) and we will drop the bar on ρ when the context is clear. Most standard Bayesian consistency results compare performance in the realizable setting, where it is assumed that the model class contains the true environment. To generalise to the dynamic knowledge injection setting where the model class dynamically changes, we instead compare performance against an admissible environment sequence. Definition 1 (Admissible environment sequence). Let (Mt)t 1 be the sequence of sets of specialists. Let µ = (µt)t 1 be an environment sequence such that for all t 1 µt Mt, i.e. there exists i Mt where ρi = µt. We say µ is admissible if for j 1, if µj = µj+1, then µj+1 Mj+1 \ Mj.

Denote by V H i (h<t, πi) = Eπi ρi h Pt+H j=t rj h<t i the expected future value whilst actions are selected according to πi with specialist i. Also, let V H ξ (h<t, πt) = P i Mt ˆwt,i V H i (h<t, πi) denote the expected future value for Dynamic Hedge AIXI. To simplify analysis, we consider when η = 1 and ν = 1.

Theorem 2. Let µ = (µi)1 i T be an admissible sequence of environments. Let σ = (σi)1 i k be the switching times such that for all 1 i < k, σi < σi+1 and for all 1 j T, µj 1 = µj iff j σ. Let Mt = S 0 s t Ms. Then for any sequence of composite policies π = (πt)1 t T with πµt πt,

t=1 Eh<t µ h V H ξ (h<t, πt) V H µt (h<t, πµt) 2i

2H3r2 max log 1 w(µ) , (7)

where w(µ) = Qk j=0 ˆwj τj and ˆwj τj = wτj ,j P

i MT wτj ,i .

For a fixed horizon H, Theorem 2 shows that the cumulative squared difference of the value under Dynamic Hedge AIXI is bound as a function of log 1 w(µ). The term w(µ) can be viewed as the prior weight assigned to the sequence µ by Dynamic Hedge AIXI. In the context of AIXI agents, w(µ) is a measure of µ s complexity, with smaller values for w(µ) indicating a more complex sequence. The complexity of w(µ) is a function of how often µ switches and the performance of the models available to the agent. The dependence on the number of switches and available models is given next.

Theorem 3. Let µ = (µi)1 i T be an admissible sequence of environments. Let σ = (σi)1 i k be the switching times such that for all 1 i < k, σi < σi+1 and for all 1 j T, µj 1 = µj iff j σ. Let Mt = S 0 s t Ms. Then for any sequence of composite policies π = (πt)1 t T with πµt πt,

t=1 Eh<t µ h V H ξ (h<t, πt) V H µt (h<t, πµt) 2i

4k H3r2 max log| MT | . (8)

Theorem 3 makes clear that the cumulative error grows at the rate O k log| MT | . If the number of switches k grows sub-linearly and the total number of models seen by the agent over T steps | MT | grows sub-exponentially in T, then Dynamic Hedge AIXI s value will converge.

Proof Sketch For brevity, we only provide a proof sketch for Theorem 2 as Theorem 3 follows directly from Theorem 2. The full proofs of Theorems 2 and 3 are provided in the extended version of the paper (Yang-Zhao, Ng, and Hutter 2023). Let MT = S 0 s T Ms denote the set of all specialists seen by Dynamic Hedge AIXI up to time T and let π =

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(πi)i MT . For each specialist i MT we define

ˆρπi i (ort:t+H|h<t) := ( ρπi i (ort:t+H|h<t) if i Mt P j Mt ˆwt,jρπj j (ort:t+H|h<t) if i / Mt

where ρπi i (ort:t+H|h<t) = ρi(ort:t+H|h<t, ai t:t+H) with ai t+k = πi(si t+k) and si t+k = ϕ(h1:t+k) for k = 0, . . . , H. Using the abstention trick, Dynamic Hedge AIXI s value function can then be expressed as follows:

V H ξ (h<t, π) =

j MT ˆwt,j ˆρπj j (ort:t+H|h<t).

We define the mixture model as:

ξπ t (ort:t+H|h<t) = X

j MT ˆwt,j ˆρπj j (ort:t+H|h<t).

After expressing the value function in this way, we are ready to bound the error. The sum over T time steps can be first split into a double sum over segments where µ does not change distribution:

t=1 E h V H ξ (h<t, π) V H µt (h<t, πµt) 2i

t=σi E h V H ξ (h<t, π) V H ρi (h<t, πi) 2i ,

where ρi is such that ρi = µt for t = σi, . . . , σi+1 1. We then look to bound the error in each segment. We first convert the squared error between the value functions into a squared error between the mixture distribution and the underlying distribution before applying Pinsker s inequality:

t=σi E h V H ξ (h<t, π) V H µt (h<t, πµt) 2i

t=σi E [Dt:t+H(µi||ξπ t )]

n=t E [Dn:n(µi||ξπ t )]

where Di:j(µ||ρ) := P ori:j µ(ori:j|h<i) log µ(ori:j|h<i)

ρ(ori:j|h<i) denotes the KL divergence on the sequence from time i to j. The standard argument is to then apply the chain rule for KL divergence to collapse the double sum into a single KL divergence term. This cannot be done naively in our case however as the mixture model ξπ t uses a different set of weights at each time step. We are however able to apply the chain rule in the standard manner by noticing that the weights that maximize each KL divergence term occur at the start of each segment. This allows us to recover an upper bound of ln ˆwσi,i per segment. Summing over all segments gives the final result.

Dynamic Hedge AIXI in Practice In practice, we let η = 1, ν = 1 and instantiate Dynamic Hedge AIXI in the case where each specialist is a Φ-BCTW model. In this case, each specialist is itself a mixture environment model over a set of Φ-PST environment models. Whilst Dynamic Hedge itself uses an agnostic prior, using Φ-BCTW models as specialists means that Dynamic Hedge AIXI s mixture environment model incorporates a model complexity prior:

ξπt(rt:t+H|h<t) = X

T Ti wi t 2 Γ(T )

P(rτi:t 1|saτi:t 1, T)P(rt:t+H|st 1, at:t+H, T) , (9)

where Ti is the set of Φ-PST models in the Φ-BCTW specialist i. The Q value functions for each specialist (Algorithm 2, line 4) also need to be estimated in practice. Each specialist uses the UCT policy (Kocsis and Szepesv ari 2006) and Monte-Carlo Tree Search to approximate the finitehorizon expectimax operation in calculating Q. We show in our experiments that this set of design choices works well in practice.

Experiments Experiment Setup In the Dynamic Knowledge Injection setting, new knowledge arrives from a human operator in the form of new domain-specific predicates. A predicate environment model is then generated from these predicates for the agent to utilise. To display the adaptive behaviour of our agent, we consider the setting where better models are generated for the agent over time. We simulate the dynamic knowledge injection setting by maintaining two sets of predicates I and U representing the informative and uninformative predicates for each domain. Given p [0, 1], a new Φ-BCTW model of depth d is constructed by sampling p d predicates from I and d p d predicates from U. The proportion p initially starts out small and increases over time. In all our experiments, we drop the Φ-BCTW model with the lowest weight and introduce a new Φ-BCTW model every 4K steps. The model is also pre-trained on the preceding 4K steps to ensure it does not perform too poorly to when it is first introduced. The parameter p starts out at p = 0.05 and increases by 0.05 every 4K steps. The full details of the experiment design are provided in the extended version of the paper (Yang-Zhao, Ng, and Hutter 2023).

Baseline Methods. We compare Dynamic Hedge AIXI against three baseline methods. We compare against two decision-tree based, iterative state abstraction methods using splitting criteria as defined in U-Tree (Mc Callum 1996) and PARSS-DT (Hostetler, Fern, and Dietterich 2017). These two models were chosen as they are able to be modified to use predicate functions as state abstraction criteria appropriately. In U-Tree, an existing node (representing a state) is split on a given predicate if splitting results in a statistically significant difference in the resulting Q-values as computed by a Kolmogorov-Smirnov test. In PARSS-DT, nodes are split on a given predicate if the resulting value functions

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

are sufficiently far apart. To modify each method to the dynamic knowledge injection setting, each method can only consider splits from the currently available set of informative predicates as well as the uninformative predicates. This also ensures that the baseline methods do not have an informational advantage over Dynamic Hedge AIXI. The third method we compare against is a non-adaptive version of Dynamic Hedge AIXI we call Hedge AIXI. Hedge AIXI proceeds in the same way as Dynamic Hedge AIXI but does not re-initialize a model s weight when a newly entering model replaces an older model.

Biased Rock-Paper-Scissors (RPS). This domain is taken from (Farias et al. 2010). In biased RPS, the agent plays RPS against an environment with a biased strategy. The environment plays randomly but plays rock if it won the previous round playing rock. There are two informative predicates in this setting: Is Rock t 1(h) and Is Loset 1(h), indicating whether the environment played rock and whether the agent lost in the previous time step respectively.

Taxi. The Taxi environment was first introduced in (Dietterich 2000). The agent acts as a Taxi in a grid world and must move to pick up a passenger and drop the passenger off at their desired destination. Instead of the 5x5 grid traditionally considered, we consider a 2x5 grid with no intermediate walls and increase the reward for completing the task. These modifications were made to shrink the planning horizon as well as make the original problem less sensitive to parameters for the three algorithms tested. The predicates available to the agent are indicator functions on the binary representation of the history indicating whether a given bit equals 1. The agent can recover the original MDP if it captures the indicator functions comprising the last received observation.

Epidemic Control Over Contact Networks. The epidemic control problem we use was introduced in (Yang Zhao, Wang, and Ng 2022). In this environment, an SEIRS epidemic process evolves over a contact network and the agent s goal is to perform actions to slow or stop the spread of the disease (Pastor-Satorras et al. 2015; Nowzari, Preciado, and Pappas 2016; Newman 2018). A contact network is an undirected graph where the nodes represent individuals and the edges represent interactions between individuals. We use the network dataset from (Rossi and Ahmed 2015; Guimer a et al. 2003), which contains 1133 nodes and 5451 edges. Each node is in one of four states corresponding to their infection status: Susceptible (S), Exposed (E), Infectious (I), and Recovered (R). Each node also maintains an immunity level. The environment is partially observable and the environment emits an observation on each node from {+, , ?} corresponding to whether a node tests positive, negative or is unknown/untested. The agent can select actions from a set of 11 possible actions {Do Nothing, Vaccinate(i, j), Quarantine(i)}, where i [0, 0.2, 0.4, 0.6, 0.8, 1.0] and j = i + 0.2. Quarantine actions quarantine the top ith percent of nodes ranked by betweenness centrality by removing all edges incident on those

nodes for one time step. Vaccinate actions increase the immunity level (up to a maximum value) for the top ith percent of nodes ranked in the same way. At each time step, the instantaneous reward is given by

rt(ot, at 1) := Positives(ot) Action Cost(at 1)

where Positives(ot) counts the number of positive tests in the observation ot and Action Cost(at 1) is a function determining the cost of each action. If the agent successfully terminates the epidemic, i.e. there are no more Exposed or Infectious nodes, the agent receives a positive reward of 2 per node. A full description of the transition and observation models is provided in the extended version of the paper (Yang-Zhao, Ng, and Hutter 2023). The exact predicates that are considered perfectly useful in this domain are unknown. However, a few are known to provide some information in helping the agent represent the problem. Some examples of the types of functions the agent gains access to over time are as follows:

Observed Infection Ratet,ν takes a history sequence h H and computes the observed infection rate at time t over the set of nodes ν V . Predicates comparing whether the observed infection rate is greater than or smaller than certain values are received over time.

Infection Rate Of Changet,ν takes h H and computes the change in infection rate between timesteps t 1 and t over the set of nodes ν V . Predicates comparing whether the infection rate of change is greater than or smaller than certain values are received over time.

Percent Actiona,N takes a history and returns the percentage of time action a was selected in the last N timesteps. Predicates comparing whether the action selection percentage is greater than or smaller than certain values are received over time.

Epidemic control is a topical subject given the recent prevalence of COVID-19 and approaches to this question vary wildly in terms of the how the problem is modelled (Arango and Pelov 2020; Charpentier et al. 2020; Colas et al. 2020; Brauer and Castillo-Ch avez 2012; Anderson 2013; Berestizshevsky et al. 2021). With no consensus on the appropriate model to use, our model is chosen as it is sufficiently complex to demonstrate the efficacy of our agent.

Figures 1 and 2 display the mean learning curves for Dynamic Hedge AIXI, U-Tree, PARSS and Hedge AIXI with standard deviations computed over five random seeds. Dynamic Hedge AIXI and Hedge AIXI vastly outperform UTree and PARSS on the Taxi and Epidemic Control domains. PARSS and U-Tree were likely unable to perform because these two domains require a large number of predicates to be evaluated together rather than individually for node splits. Also, both methods can be susceptible to producing spurious splits, making the resulting state space too difficult to learn with. On both Taxi and Epidemic Control, the learning curves for both Dynamic Hedge AIXI and Hedge AIXI start

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 1: Learning curve on biased RPS and Taxi domains.

out flat due to the lack of useful environment models initially. As useful models are introduced, both agents performance begin to pick up. Dynamic Hedge AIXI s performance picks up faster than Hedge AIXI. This effect is especially pronounced in the Taxi domain. Hedge AIXI s slower convergence on these more complex domains is likely due to it taking longer to overcome the bad weights left behind by previous models. This highlights the fast adaptation that Dynamic Hedge can provide. On the RPS environment, Dynamic Hedge AIXI and Hedge AIXI were more unstable compared to U-Tree. Dynamic Hedge AIXI s initial instability likely comes from the frequent re-distribution of posterior weight towards newly entering models. In contrast, since the environment is rather small and requires only two predicates to fully model, U-Tree was able to find the correct splits quickly and did not consider additional splits. Nevertheless, Dynamic Hedge AIXI eventually converges to optimal performance.

To further demonstrate Dynamic Hedge AIXI s adaptive behaviour, Figure 2 displays the model weights over time during the 40k to 60k steps period of a single run of the Epidemic Control problem. The sharp drops in weight for the highest weighted model correspond to when a new model enters. Similarly, the sharp spikes in a coloured line indicate when a previous model has been removed and replaced with a new model. Between models being introduced, the weights converge quickly to the best performing model. Over this period of time, the highest weighted model switches to a newly entering model twice. Considering the algorithm s performance in the Epidemic Control domain, Dynamic Hedge AIXI can be seen as adapting effectively.

Figure 2: Epidemic control: plot of learning curve and agent s model weights over time.

Discussion and Conclusion

The key contribution of this paper is the introduction of an exact and efficient Bayesian mixture modelling algorithm for general reinforcement learning agents in the dynamic knowledge injection setting, a setting that formalises a form of Human-AI teaming construct. The algorithm generalizes and integrates Hedge and CTW, two highly successful Bayesian mixture models, and we provide theoretical and empirical results that demonstrate the algorithm s practical utility. Our research opens the door to the principled construction of collaborative Human-AI teaming systems for complex environments, where (a) the human expert can use an expressive knowledge representation formalism to supply the agent with new background knowledge in an interactive manner to reduce model bias in the agent s initial knowledge base, and (b) the AI agent can learn to act optimally with respect to an exact Bayesian mixture model that comes with performance guarantees measured against the best aspects of human knowledge injection. Our work is another proof point that the AIXI theory can provide good guidance on the design of new general reinforcement learning agents. A key limitation of our approach is the transfer of a huge burden of learning to the human expert. This can be alleviated by augmenting the human expert with a semi-automatic predicate generation process. The predicate-selection algorithm described in (Yang-Zhao, Wang, and Ng 2022) can be run in micro-batches to instantiate that external process. More generally, the use of statistical predicate invention (Cropper, Morel, and Muggleton 2020; Kok and Domingos 2007) techniques is an important direction for future work.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments

The authors would like to thank Joel Veness, Elliot Catt, Jordi Grau-Moya, and Tim Genewein for their thoughtful feedback on an earlier draft. This work was in parts supported by ARC grant DP150104590.

Anderson, R. M. 2013. The population dynamics of infectious diseases: theory and applications. Springer. Arango, M.; and Pelov, L. 2020. COVID-19 Pandemic Cyclic Lockdown Optimization Using Reinforcement Learning. Co RR, abs/2009.04647. Bellemare, M. G.; Veness, J.; and Bowling, M. 2013. Bayesian Learning of Recursively Factored Environments. In Proceedings of the 30th International Conference on Machine Learning, 1211 1219. Bellemare, M. G.; Veness, J.; and Talvitie, E. 2014. Skip Context Tree Switching. In Proceedings of the 31th International Conference on Machine Learning, 1458 1466. Berestizshevsky, K.; Sadzi, K.-E.; Even, G.; and Shahar, M. 2021. Optimization of resource-constrained policies for COVID-19 testing and quarantining. Journal of Communications and Networks, 23(5): 326 339. Brauer, F.; and Castillo-Ch avez, C. 2012. Mathematical Models in Population Biology and Epidemiology. Springer. Cesa-Bianchi, N.; and Lugosi, G. 2006. Prediction, learning, and games. Cambridge university press. Charpentier, A.; Elie, R.; Lauri ere, M.; and Tran, V. C. 2020. COVID-19 pandemic control: balancing detection policy and lockdown intervention under ICU sustainability. Chernov, A.; and Vovk, V. 2009. Prediction with Expert Evaluators Advice. In Gavald a, R.; Lugosi, G.; Zeugmann, T.; and Zilles, S., eds., Algorithmic Learning Theory, 8 22. Springer Berlin Heidelberg. ISBN 978-3-642-04414-4. Colas, C.; Hejblum, B.; Rouillon, S.; Thi ebaut, R.; Oudeyer, P.-Y.; Moulin-Frier, C.; and Prague, M. 2020. Epidemi Optim: A Toolbox for the Optimization of Control Policies in Epidemiological Models. Cropper, A.; Morel, R.; and Muggleton, S. H. 2020. Learning Higher-Order Programs through Predicate Invention. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, 13655 13658. AAAI Press. Dietterich, T. G. 2000. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. J. Artif. Intell. Res., 13: 227 303. Eysenbach, B.; Khazatsky, A.; Levine, S.; and Salakhutdinov, R. 2023. Mismatched No More: Joint Model-Policy Optimization for Model-Based RL. ar Xiv:2110.02758. Farias, V. F.; Moallemi, C. C.; Roy, B. V.; and Weissman, T. 2010. Universal reinforcement learning. IEEE Trans. Inf. Theory, 56(5): 2441 2454. Farmer, W. 2008. The Seven Virtues of Simple Type Theory. Journal of Applied Logic, 6(3): 267 286.

Freund, Y.; Schapire, R. E.; Singer, Y.; and Warmuth, M. K. 1997. Using and Combining Predictors That Specialize. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, 334 343. ACM. ISBN 0897918886. Guimer a, R.; Danon, L.; D ıaz-Guilera, A.; Giralt, F.; and Arenas, A. 2003. Self-similar community structure in a network of human interactions. Phys. Rev. E, 68: 065103. Hostetler, J.; Fern, A.; and Dietterich, T. 2017. Sample Based Tree Search with Fixed and Adaptive State Abstractions. J. Artif. Int. Res., 60(1): 717 777. Hutter, M. 2005. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Berlin: Springer. ISBN 3-540-22139-5. Hutter, M. 2009. Feature Reinforcement Learning: Part I. Unstructured MDPs. In J. Artif. Gen. Intell. Jordan, M. I. 1995. Why the logistic function? A tutorial discussion on probabilities and neural networks. Kocsis, L.; and Szepesv ari, C. 2006. Bandit Based Monte-Carlo Planning. In F urnkranz, J.; Scheffer, T.; and Spiliopoulou, M., eds., ECML 2006, 282 293. Springer Berlin Heidelberg. ISBN 978-3-540-46056-5. Kok, S.; and Domingos, P. M. 2007. Statistical predicate invention. In Ghahramani, Z., ed., Machine Learning, Proceedings of the Twenty-Fourth International Conference, 433 440. ACM. Krichevsky, R.; and Trofimov, V. 2006. The Performance of Universal Encoding. IEEE Trans. Inf. Theor., 27(2): 199 207. Lambert, N.; Amos, B.; Yadan, O.; and Calandra, R. 2021. Objective Mismatch in Model-based Reinforcement Learning. ar Xiv:2002.04523. Lloyd, J. W. 2003. Logic for Learning: Learning Comprehensible Theories from Structured Data. Springer. Lloyd, J. W.; and Ng, K. S. 2011. Declarative programming for agent applications. Autonomous Agents Multi Agent Systems, 23(2): 224 272. Mc Callum, A. K. 1996. Reinforcement learning with selective perception and hidden state. University of Rochester. Mourtada, J.; and Maillard, O.-A. 2017. Efficient tracking of a growing number of experts. In International Conference on Algorithmic Learning Theory, 517 539. Newman, M. 2018. Networks. Oxford University Press. Nowzari, C.; Preciado, V. M.; and Pappas, G. J. 2016. Analysis and control of epidemics: A survey of spreading processes on complex networks. IEEE Control Systems Magazine, 36(1): 26 46. Pastor-Satorras, R.; Castellano, C.; Van Mieghem, P.; and Vespignani, A. 2015. Epidemic processes in complex networks. Reviews of Modern Physics, 87(3). Rossi, R. A.; and Ahmed, N. K. 2015. The Network Data Repository with Interactive Graph Analytics and Visualization. In AAAI. Solomonoff, R. J. 1997. The Discovery of Algorithmic Probability. J. Comput. Syst. Sci., 55(1): 73 88.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Veness, J.; Ng, K. S.; Hutter, M.; and Bowling, M. H. 2012. Context Tree Switching. In Storer, J. A.; and Marcellin, M. W., eds., 2012 Data Compression Conference, 327 336. IEEE Computer Society. Veness, J.; Ng, K. S.; Hutter, M.; Uther, W. T. B.; and Silver, D. 2011. A Monte-Carlo AIXI Approximation. J. Artif. Intell. Res., 40: 95 142. Veness, J.; White, M.; Bowling, M.; and Gy orgy, A. 2013. Partition Tree Weighting. In Bilgin, A.; Marcellin, M. W.; Serra-Sagrist a, J.; and Storer, J. A., eds., 2013 Data Compression Conference, 321 330. IEEE. Vovk, V. 1998. A Game of Prediction with Expert Advice. Journal of Computer and System Sciences, 56(2): 153 173. Willems, F.; Shtarkov, Y.; and Tjalkens, T. 1995. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3): 653 664. Yang-Zhao, S.; Ng, K. S.; and Hutter, M. 2023. Dynamic Knowledge Injection for AIXI Agents. ar Xiv preprint ar Xiv:2312.16184. Yang-Zhao, S.; Wang, T.; and Ng, K. S. 2022. A Direct Approximation of AIXI Using Logical State Abstractions. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)