# a_definition_of_continual_reinforcement_learning__560a5c41.pdf

A Deﬁnition of Continual Reinforcement Learning

David Abel dmabel@google.com Google Deep Mind

Andr e Barreto andrebarreto@google.com Google Deep Mind

Benjamin Van Roy benvanroy@google.com Google Deep Mind

Doina Precup doinap@google.com Google Deep Mind

Hado van Hasselt hado@google.com Google Deep Mind

Satinder Singh baveja@google.com Google Deep Mind

In a standard view of the reinforcement learning problem, an agent s goal is to efﬁciently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as ﬁnding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning. Despite the importance of continual reinforcement learning, the community lacks a simple deﬁnition of the problem that highlights its commitments and makes its primary concepts precise and clear. To this end, this paper is dedicated to carefully deﬁning the continual reinforcement learning problem. We formalize the notion of agents that never stop learning through a new mathematical language for analyzing and cataloging agents. Using this new language, we deﬁne a continual learning agent as one that can be understood as carrying out an implicit search process indeﬁnitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents. We provide two motivating examples, illustrating that traditional views of multi-task reinforcement learning and continual supervised learning are special cases of our deﬁnition. Collectively, these deﬁnitions and perspectives formalize many intuitive concepts at the heart of learning, and open new research pathways surrounding continual learning agents.

1 Introduction

In The Challenge of Reinforcement Learning, Sutton states: Part of the appeal of reinforcement learning is that it is in a sense the whole AI problem in a microcosm [56]. Indeed, the problem facing an agent that learns to make better decisions from experience is at the heart of the study of Artiﬁcial Intelligence (AI). Yet, when we study the reinforcement learning (RL) problem, it is typical to restrict our focus in a number of ways. For instance, we often suppose that a complete description of the state of the environment is available to the agent, or that the interaction stream is subdivided into episodes. Beyond these standard restrictions, however, there is another signiﬁcant assumption that constrains the usual framing of RL: We tend to concentrate on agents that learn to solve problems, rather than agents that learn forever. For example, consider an agent learning to play Go: Once the agent has discovered how to master the game, the task is complete, and the agent s learning can stop. This view of learning is often embedded in the standard formulation of RL, in which an agent interacts with a Markovian environment with the goal of efﬁciently identifying an optimal policy, at which point learning can cease.

But what if this is not the best way to model the RL problem? That is, instead of viewing learning as ﬁnding a solution, we can instead think of it as endless adaptation. This suggests study of the continual reinforcement learning (CRL) problem [47, 48, 25, 27], as ﬁrst explored in the thesis by

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Ring [46], with close ties to supervised never-ending [10, 39, 43] and continual learning [47, 48, 26, 54, 41, 42, 49, 22, 30, 45, 4].

Despite the prominence of CRL, the community lacks a clean, general deﬁnition of this problem. It is critical to develop such a deﬁnition to promote research on CRL from a clear conceptual foundation, and to guide us in understanding and designing continual learning agents. To these ends, this paper is dedicated to carefully deﬁning the CRL problem. Our deﬁnition is summarized as follows:

The CRL Problem (Informal) An RL problem is an instance of CRL if the best agents never stop learning.

The core of our deﬁnition is framed around two new insights that formalize the notion of agents that never stop learning : (i) we can understand every agent as implicitly searching over a set of history-based policies (Theorem 3.1), and (ii) every agent will either continue this search forever, or eventually stop (Remark 3.2). We make these two insights rigorous through a pair of logical operators on agents that we call generates and reaches that provide a new mathematical language for characterizing agents. Using these tools, we then deﬁne CRL as any RL problem in which all of the best agents never stop their implicit search. We provide two motivating examples of CRL, illustrating that traditional multi-task RL and continual supervised learning are special cases of our deﬁnition. We further identify necessary properties of CRL (Theorem 4.1) and the new operators (Theorem 4.2, Theorem 4.3). Collectively, these deﬁnitions and insights formalize many intuitive concepts at the heart of continual learning, and open new research pathways surrounding continual learning agents.

2 Preliminaries

We ﬁrst introduce key concepts and notation. Our conventions are inspired by Ring [46], the recent work by Dong et al. [16] and Lu et al. [32], as well as the literature on general RL by Hutter [23, 24], Lattimore [28], Leike [29], Cohen et al. [12], and Majeed [36].

Notation. We let capital calligraphic letters denote sets (풳), lower case letters denote constants and functions (푥), italic capital letters denote random variables (푋), and blackboard capitals denote the natural and real numbers (N, R, N0 = N {0}). Additionally, we let Δ(풳) denote the probability simplex over the set 풳. That is, the function 푝: 풳 풴 Δ(풵) expresses a probability mass function 푝( | 푥, 푦), over 풵, for each 푥 풳and 푦 풴. Lastly, we use to denote logical negation, and we use 푥 풳and 푥 풳to express the universal and existential quantiﬁers over a set 풳.

2.1 Agents and Environments

We begin by deﬁning environments, agents, and related artifacts. Deﬁnition 2.1. An agent-environment interface is a pair (풜, 풪) of countable sets 풜and 풪where |풜| 2 and |풪| 1.

We refer to elements of 풜as actions, denoted 푎, and elements of 풪as observations, denoted 표. Histories deﬁne the possible interactions between an agent and an environment that share an interface. Deﬁnition 2.2. The histories with respect to interface (풜, 풪) are the set of sequences of actionobservation pairs,

푡=0 (풜 풪)푡. (2.1)

We refer to an individual element of ℋas a history, denoted ℎ, and we let ℎℎ express the history resulting from the concatenation of any two histories ℎ, ℎ ℋ. Furthermore, the set of histories of length 푡 N0 is deﬁned as ℋ푡= (풜 풪)푡, and we use ℎ푡 ℋ푡to refer to a history containing 푡action-observation pairs, ℎ푡= 푎0표1 . . . 푎푡 1표푡, with ℎ0 = the empty history. An environment is then a function from the set of all environments, ℰ, that produces observations given a history. Deﬁnition 2.3. An environment with respect to interface (풜, 풪) is a function 푒: ℋ 풜 Δ(풪).

This model of environments is general in that it can capture Markovian environments such as Markov decision processes (MDPs, Puterman, 2014) and partially observable MDPs (Cassandra et al., 1994), as well as both episodic and non-episodic settings. We next deﬁne an agent as follows.

Deﬁnition 2.4. An agent with respect to interface (풜, 풪) is a function 휆: ℋ Δ(풜).

We let Λ denote the set of all agents, and let Λ denote any non-empty subset of Λ. This treatment of an agent captures the mathematical way experience gives rise to behavior, as in agent functions from work by Russell and Subramanian [50]. This is in contrast to a mechanistic account of agency as proposed by Dong et al. [16] and Sutton [58]. Further, note that Deﬁnition 2.4 is precisely a history-based policy; we embrace the view that there is no real distinction between an agent and a policy, and will refer to all such functions as agents unless otherwise indicated.

2.2 Realizable Histories

We will be especially interested in the histories that occur with non-zero probability as a result of the interaction between a particular agent and environment. Deﬁnition 2.5. The realizable histories of a given agent-environment pair, (휆, 푒), deﬁne the set of histories of any length that can occur with non-zero probability from the interaction of 휆and 푒,

푘=0 푒(표푘+1 | ℎ푘, 푎푘)휆(푎푘| ℎ푘) > 0

Given a realizable history ℎ, we will refer to the realizable history sufﬁxes, ℎ , which, when concatenated with ℎ, produce a realizable history ℎℎ ℋ. Deﬁnition 2.6. The realizable history sufﬁxes of a given (휆, 푒) pair, relative to a history preﬁx ℎ ℋ휆,푒, deﬁne the set of histories that, when concatenated with preﬁx ℎ, remain realizable,

ℋ휆,푒 ℎ = ℋℎ= {ℎ ℋ: ℎℎ ℋ휆,푒}. (2.3)

We abbreviate ℋ휆,푒to ℋ, and ℋ휆,푒 ℎ to ℋℎ, where 휆and 푒are obscured for brevity.

2.3 Reward, Performance, and the RL Problem

Supported by the arguments of Bowling et al. [7], we assume that all of the relevant goals or purposes of an agent are captured by a deterministic reward function (in line with the reward hypothesis [57]). Deﬁnition 2.7. We call 푟: 풜 풪 R a reward function.

We remain agnostic to how the reward function is implemented; it could be a function inside of the agent, or the reward function s output could be a special scalar in each observation. Such commitments do not impact our framing. When we refer to an environment we will implicitly mean that a reward function has been selected as well. We remain agnostic to how reward is aggregated to determine performance, and instead adopt the function 푣deﬁned as follows. Deﬁnition 2.8. The performance, 푣: ℋ Λ ℰ [vmin, vmax] is a bounded function for ﬁxed constants vmin, vmax R.

The function 푣(휆, 푒| ℎ) expresses some statistic of the received future random rewards produced by the interaction between 휆and 푒following history ℎ, where we use 푣(휆, 푒) as shorthand for 푣(휆, 푒| ℎ0). While we accommodate any 푣that satisﬁes the above deﬁnition, it may be useful to think of speciﬁc choices of 푣(휆, 푒| ℎ푡), such as the average reward, lim inf 푘

1 푘E휆,푒[푅푡+ . . . + 푅푡+푘| 퐻푡= ℎ푡], (2.4)

where E휆,푒[ | 퐻푡= ℎ푡] denotes expectation over the stochastic process induced by 휆and 푒 following history ℎ푡. Or, we might consider performance based on the expected discounted reward, 푣(휆, 푒| ℎ푡) = E휆,푒[푅푡+ 훾푅푡+1 + . . . | 퐻푡= ℎ푡], where 훾 [0, 1) is a discount factor.

The above components give rise to a simple deﬁnition of the RL problem. Deﬁnition 2.9. An instance of the RL problem is deﬁned by a tuple (푒, 푣, Λ) as follows Λ = arg max 휆 Λ 푣(휆, 푒). (2.5)

This captures the RL problem facing an agent designer that would like to identify an optimal agent (휆 Λ ) with respect to the performance (푣), among the available agents (Λ), in a particular environment (푒). We note that a simple extension of this deﬁnition of the RL problem might instead consider a set of environments (or similar alternatives).

3 Agent Operators: Generates and Reaches

We next introduce two new insights about agents, and the logical operators that formalize them:

1. Theorem 3.1: Every agent can be understood as searching over another set of agents. 2. Remark 3.2: Every agent will either continue their search forever, or eventually stop.

We make these insights precise by introducing a pair of logical operators on agents: (1) a set of agents generates (Deﬁnition 3.4) another set of agents, and (2) a given agent reaches (Deﬁnition 3.5) an agent set. Together, these operators enable us to deﬁne learning as the implicit search process captured by the ﬁrst insight, and continual learning as the process of continuing this search indeﬁnitely.

3.1 Operator 1: An Agent Basis Generates an Agent Set.

The ﬁrst operator is based on two complementary intuitions.

From the ﬁrst perspective, an agent can be understood as searching over a space of representable action-selection strategies. For instance, in an MDP, agents can be interpreted as searching over the space of policies (that is, the space of stochastic mappings from the MDP s state to action). It turns out this insight can be extended to any agent and any environment.

The second complementary intuition notes that, as agent designers, we often ﬁrst identify the space of representable action-selection strategies of interest. Then, it is natural to design agents that search through this space. For instance, in designing an agent to interact with an MDP, we might be interested in policies representable by a neural network of a certain size and architecture. When we design agents, we then consider all agents (choices of loss function, optimizer, memory, and so on) that search through the space of assignments of weights to this particular neural network using standard methods like gradient descent. We codify these intuitions in the following deﬁnitions. Deﬁnition 3.1. An agent basis (or simply, a basis), ΛB Λ, is any non-empty subset of Λ.

Notice that an agent basis is a choice of agent set, Λ. We explicitly call out a basis with distinct notation (ΛB) as it serves an important role in the discussion that follows. For example, we next introduce learning rules as functions that switch between elements of an agent basis for each history. Deﬁnition 3.2. A learning rule over an agent basis ΛB is a function, 휎: ℋ ΛB, that selects a base agent for each history.

We let Σ denote the set of all learning rules over ΛB, and let Σ denote any non-empty subset of Σ. A learning rule is a mechanism for switching between the available base agents following each new experience. Notice that learning rules are deterministic; while a simple extension captures the stochastic case, we will see by Theorem 3.1 that the above is sufﬁciently general in a certain sense. We use 휎(ℎ)(ℎ) to refer to the action distribution selected by the agent 휆= 휎(ℎ) at any history ℎ. Deﬁnition 3.3. Let Σ be a set of learning rules over some basis ΛB, and let 푒be an environment. We say that a set Λ is 횺-generated by ΛB in 푒, denoted ΛB 푒Σ Λ, if and only if

휆 Λ 휎 Σ ℎ ℋ휆(ℎ) = 휎(ℎ)(ℎ). (3.1)

Thus, any choice of Σ together with a basis ΛB induces a family of agent sets whose elements can be understood as switching between the basis according to the rules prescribed by Σ. We then say that a basis generates an agent set in an environment if there exists a set of learning rules that switches between the basis elements to produce the agent set. Deﬁnition 3.4. We say a basis ΛB generates Λ in 푒, denoted ΛB 푒Λ, if and only if

ΛB 푒Σ Λ. (3.2)

Intuitively, an agent basis ΛB generates another agent set Λ just when the agents in Λ can be understood as switching between the base agents. It is in this sense that we can understand agents as searching through a basis an agent is just a particular sequence of history-conditioned switches over a basis. For instance, let us return to the example of a neural network: The agent basis might represent a speciﬁc multilayer perceptron, where each element of this basis is an assignment to the network s weights. The learning rules are different mechanisms that choose the next set of weights in

(a) Generates (ΛB 푒Λ)

(b) Sometimes Reaches (휆1

Figure 1: A visual of the generates (left) and sometimes reaches (right) operators. (a) Generates: An agent basis, ΛB, comprised of three base agents depicted by the triangle, circle, and square, generates a set Λ containing agents that can each be understood as switching between the base agents in the realizable histories of environment 푒. (b) Sometimes Reaches: On the right, we visualize 휆1 Λ generated by ΛB (from the ﬁgure on the left) to illustrate the concept of sometimes reaches. That is, the agent s choice of action distribution at each history can be understood as switching between the three basis elements, and there is at least one history for which the agent stops switching here, we show the agent settling on the choice of the blue triangle and never switching again.

response to experience (such as gradient descent). Together, the agent basis and the learning rules generate the set of agents that search over choices of weights in reaction to experience. We present a cartoon visual of the generates operator in Figure 1(a).

Now, using the generates operator, we revisit and formalize the central insight of this section: Every agent can be understood as implicitly searching over an agent basis. We take this implicit search process to be the behavioral signature of learning.

Theorem 3.1. For any agent-environment pair (휆, 푒), there exists inﬁnitely many choices of a basis, ΛB, such that both (1) 휆 ΛB, and (2) ΛB 푒{휆}.

Due to space constraints, all proofs are deferred to Appendix B.

We require that 휆 ΛB to ensure that the relevant bases are non-trivial generators of {휆}. This theorem tells us that no matter the choice of agent or environment, we can view the agent as a series of history-conditioned switches between basis elements. In this sense, we can understand the agent as if 1 it were carrying out a search over the elements of some ΛB. We emphasize that there are inﬁnitely many choices of such a basis to illustrate that there are many plausible interpretations of an agent s behavior we return to this point throughout the paper.

3.2 Operator 2: An Agent Reaches a Basis.

Our second operator reﬂects properties of an agent s limiting behavior in relation to a basis. Given an agent and a basis that the agent searches through, what happens to the agent s search process in the limit: does the agent keep switching between elements of the basis, or does it eventually stop? For example, in an MDP, many agents of interest eventually stop their search on a choice of a ﬁxed policy. We formally deﬁne this notion in terms of an agent reaching a basis according to two modalities: an agent (i) sometimes or (ii) never reaches a basis.

Deﬁnition 3.5. We say agent 휆 Λ sometimes reaches ΛB in 푒, denoted 휆

푒ΛB, if and only if

ℎ ℋ 휆B ΛB ℎ ℋℎ휆(ℎℎ ) = 휆B(ℎℎ ). (3.3)

That is, for at least one realizable history, there is some base agent (휆B) that produces the same action distribution as 휆forever after. This indicates that the agent can be understood as if it has stopped its search over the basis. We present a visual of sometimes reaches in Figure 1(b). By contrast, we say an agent never reaches a basis just when it never becomes equivalent to a base agent.

Deﬁnition 3.6. We say agent 휆 Λ never reaches ΛB in 푒, denoted 휆 /

푒ΛB, iff (휆

1We use as if in the sense of the positive economists, such as Friedman [19].

The reaches operators formalize the intuition that, since every agent can be interpreted as if it were searching over a basis, every agent will either (1) sometimes, or (2) never stop this search. Since (1) and (2) are simply negations of each other, we can now plainly state this fact as follows. Remark 3.2. For any agent-environment pair (휆, 푒) and any choice of basis ΛB such that ΛB 푒{휆}, exactly one of the following two properties must be satisﬁed: (1) 휆

푒ΛB, (2) 휆 /

Thus, by Theorem 3.1, every agent can be thought of as implicitly searching over an agent basis, and by Remark 3.2, every agent will either (1) sometimes, or (2) never stop this search. We take this implicit search process to be the behavioral signature of learning, and will later exploit this perspective to deﬁne a continual learning agent as one that continues its search forever (Deﬁnition 4.1). Our analysis in Section 4.4 further elucidates basic properties of both the generates and reaches operators, and Figure 1 presents a cartoon visualizing the intuition behind each of the operators. We summarize all deﬁnitions and notation in a table in Appendix A.

Considerations on the Operators. Naturally, we can design many variations of both

푒and 푒. For instance, we might be interested in a variant of reaches in which an agent becomes 휖-close to any of the basis elements, rather than require exact behavioral equivalence. Concretely, we highlight four axes of variation that can modify the deﬁnitions of the operators. We state these varieties for reaches, but similar modiﬁcations can be made to the generates operator, too:

1. Realizability. An agent reaches a basis (i) in all histories (and thus, all environments), or (ii) in the histories realizable by a given (휆, 푒) pair. 2. History Length. An agent reaches a basis over (i) inﬁnite or, (ii) ﬁnite length histories. 3. Probability. An agent reaches a basis (i) with probability one, or (ii) with high probability. 4. Equality or Approximation. An agent reaches a basis by becoming (i) equivalent to a base agent, or (ii) sufﬁciently similar to a base agent.

Rather than deﬁne all of these variations precisely for both operators (though we do explore some in Appendix C), we acknowledge their existence, and simply note that the formal deﬁnitions of these variants follow naturally.

4 Continual Reinforcement Learning

We now provide a precise deﬁnition of CRL. The deﬁnition formalizes the intuition that CRL captures settings in which the best agents do not converge they continue their implicit search over an agent basis indeﬁnitely.

4.1 Deﬁnition: Continual RL

We ﬁrst deﬁne continual learning agents using the generates and never reaches operators as follows. Deﬁnition 4.1. An agent 휆is a continual learning agent in 푒relative to ΛB if and only if the basis generates the agent (ΛB 푒{휆}) and the agent never reaches the basis (휆 /

This means that an agent is a continual learner in an environment relative to ΛB if the agent s search over ΛB continues forever. Notice that an agent might be considered a continual learner with respect to one basis but not another; we explore this fact more in Section 4.4.

Then, using these tools, we formally deﬁne CRL as follows.

Deﬁnition 4.2. Consider an RL problem (푒, 푣, Λ). Let ΛB Λ be a basis such that ΛB 푒Λ, and let Λ = arg max휆 Λ 푣(휆, 푒). We say (푒, 푣, Λ, ΛB) deﬁnes a CRL problem if 휆 Λ 휆 /

Said differently, an RL problem is an instance of CRL just when all of the best agents are continual learning agents relative to basis ΛB. This problem encourages a signiﬁcant departure from how we tend to think about designing agents: Given a basis, rather than try to build agents that can solve problems by identifying a ﬁxed high-quality element of the basis, we would like to design agents that continue to update their behavior indeﬁnitely in light of their experience.

4.2 CRL Examples

We next detail two examples of CRL to provide further intuition.

Q-Learning in Switching MDPs. First we consider a simple instance of CRL based on the standard multi-task view of MDPs. In this setting, the agent repeatedly samples an MDP to interact with from a ﬁxed but unknown distribution [64, 9, 2, 25, 20]. In particular, we make use of the switching MDP environment from Luketina et al. [33]. The environment 푒consists of a collection of 푛 underlying MDPs, 푚1, . . . , 푚푛, with a shared action space and environment-state space. We refer to this environment-state space using observations, 표 풪. The environment has a ﬁxed constant positive probability of 0.001 to switch the underlying MDP, which yields different transition and reward functions until the next switch. The agent only observes each environment state 표 풪, which does not reveal the identity of the active MDP. The rewards of each underlying MDP are structured so that each MDP has a unique optimal policy. We assume 푣is deﬁned as the average reward, and the basis is the set of 휖-greedy policies over all 푄(표, 푎) functions, for ﬁxed 휖= 0.15. Consequently, the set of agents we generate, ΛB 푒Λ, consists of all agents that switch between these 휖-greedy policies.

Now, the components (푒, 푣, Λ, ΛB) have been deﬁned, we can see that this is indeed an instance of CRL: None of the base agents can be optimal, as the moment that the environment switches its underlying MDP, we know that any previously optimal policy will no longer be optimal in the next MDP following the switch. Therefore, any agent that converges (in that it reaches the basis ΛB) cannot be optimal either for the same reason. We conclude that all optimal agents in Λ are continual learning agents relative to the basis ΛB.

We present a visual of this domain in Figure 2(a), and conduct a simple experiment contrasting the performance of 휖-greedy continual Q-learning (blue) that uses a constant step-size parameter of 훼= 0.1, with a convergent Q-learning (green) that anneals its step size parameter over time to zero. Both use 휖= 0.15, and we set the number of underlying MDPs to 푛= 10. We present the average reward with 95% conﬁdence intervals, averaged over 250 runs, in Figure 2(b). Since both variants of Q-learning can be viewed as searching over ΛB, the annealing variant that stops its search will under-perform compared to the continual approach. These results support the unsurprising conclusion that it is better to continue searching over the basis rather than converge in this setting.

Continual Supervised Learning. Second, we illustrate the power of our CRL deﬁnition to capture continual supervised learning. We adopt the problem setting studied by Mai et al. [35]. Let 풳denote a set of objects to be labeled, each belonging to one of 푘 N classes. The observation space 풪consists of pairs, 표푡= (푥푡, 푦푡), where 푥푡 풳and 푦푡 풴. Here, each 푥푡is an input object to be classiﬁed and 푦푡is the label for the previous input 푥푡 1. Thus, 풪= 풳 풴. We assume by convention that the initial

(a) Switching MDP Visual

Q-learning (continual)

Q-learning (anneal)

(b) Switching MDP Results

Figure 2: A visual of a grid world instance of the switching MDPs problem (left) [33], and results from an experiment contrasting continual learning and convergent Q-learning (right). The environment pictured contains 푛distinct MDPs. Each underlying MDP shares the same state space and action space, but varies in transition and reward functions, as indicated by the changing walls and rewarding locations (stars, circles, and ﬁre). The results pictured on the right contrast continual Q-learning (with 훼= 0.1) with traditional Q-learning that anneals its step-size parameter to zero over time.

label 푦0 is irrelevant and can be ignored. The agent will observe a sequence of object-label pairs, (푥0, 푦0), (푥1, 푦1), . . ., and the action space is a choice of label, 풜= {푎1, . . . , 푎푘} where |풴| = 푘. The reward for each history ℎ푡is +1 if the agent s most recently predicted label is correct for the previous input, and 1 otherwise:

푟(푎푡 1표푡) = 푟(푎푡 1푦푡) = +1 푎푡 1 = 푦푡, 1 푎푡 1 = otherwise. (4.1)

Concretely, the continual learning setting studied by Mai et al. [35] supposes the learner will receive samples from a sequence of probability distributions, 푑0, 푑1, . . ., each supported over 풳 풴. The (푥, 푦) 풳 풴pairs experienced by the learner are determined by the sequence of distributions. We capture this distributional shift in an environment 푒that shifts its probability distribution over 풪 depending on the history to match the sequence, 푑0, 푑1, . . ..

Now, is this an instance of CRL? To answer this question precisely, we need to select a (Λ, ΛB) pair. We adopt the basis ΛB = {휆B : 푥 푦푖, 푦푖 풴} that contains each classiﬁer that maps each object to each possible label. By the universal set of learning rules Σ, this basis generates the set of all agents that search over classiﬁers. Now, our deﬁnition says the above is an instance of CRL just when every optimal agent never stops switching between classiﬁers, rather than stop their search on a ﬁxed classiﬁer. Consequently, if there is an optimal classiﬁer in ΛB, then this will not be an instance of CRL. If, however, the environment imposes enough distributional shift (changing labels, adding mass to new elements, and so on), then the only optimal agents will be those that always switch among the base classiﬁers, in which case the setting is an instance of CRL.

4.3 Relationship to Other Views on Continual Learning

The spirit of continual learning has been an important part of machine learning research for decades, often appearing under the name of lifelong learning [63, 62, 53, 55, 51, 3, 4], never-ending learning [39, 43] with close ties to transfer-learning [61, 60], meta-learning [52, 17], as well as online learning and non-stationarity [5, 40, 13, 6, 31]. In a similar vein, the phrase continuing tasks is used in the classic RL textbook [59] to refer explicitly to cases when the interaction between agent and environment is not subdivided into episodes. Continual reinforcement learning was ﬁrst posed in the thesis by Ring [46]. In later work [47, 48], Ring proposes a formal deﬁnition of the continual reinforcement learning problem The emphasis of Ring s proposal is on the generality of the environment: rather than assume that agents of interest will interact with an MDP, Ring suggests studying the unconstrained case in which an agent must maximize performance while only receiving a stream of observations as input. The environment or reward function, in this sense, may change over time or may be arbitrarily complex. This proposal is similar in spirit to general RL, studied by Hutter [24], Lattimore [28], Leike [29], and others [12, 37, 36] in which an agent interacts with an unconstrained environment. General RL inspires many aspects of our conception of CRL; for instance, our emphasis on history-dependence rather than environment-state comes directly from general RL. More recently, Khetarpal et al. [25] provide a comprehensive survey of the continual reinforcement learning literature. We encourage readers to explore this survey for a detailed history of the subject.2 In the survey, Khetarpal et al. propose a deﬁnition of the CRL problem that emphasizes the non-stationarity of the underlying process. In particular, in Khetarpal et al. s deﬁnition, an agent interacts with a POMDP in which each of the individual components of the POMDP such as the state space or reward function are allowed to vary with time. We note that, as the environment model we study (Deﬁnition 2.3) is a function of history, it can capture time-indexed non-stationarity. In this sense, the same generality proposed by Khetarpal et al. and Ring is embraced and retained by our deﬁnition, but we add further precision to what is meant by continual learning by centering around a mathematical deﬁnition of continual learning agents (Deﬁnition 4.1).

4.4 Properties of CRL

Our formalism is intended to be a jumping off point for new lines of thinking around agents and continual learning. We defer much of our analysis and proofs to the appendix, and here focus on highlighting necessary properties of CRL.

2For other surveys, see the recent survey on continual robot learning by Lesort et al. [30], a survey on continual learning with neural networks by Parisi et al. [42], a survey on transfer learning in RL by Taylor and Stone [60], and a survey on continual image classiﬁcation by Mai et al. [35].

Theorem 4.1. Every instance of CRL (푒, 푣, Λ, ΛB) necessarily satisﬁes the following properties:

1. If Λ ΛB Λ , then there exists a Λ B such that (1) Λ B 푒Λ, and (2) (푒, 푣, Λ, Λ B) is not an instance of CRL.

2. No element of ΛB is optimal: ΛB Λ = .

3. If |Λ| is ﬁnite, there exists an agent set, Λ , such that |Λ | < |Λ| and Λ 푒Λ.

4. If |Λ| is inﬁnite, there exists an agent set, Λ , such that Λ Λ and Λ 푒Λ.

This theorem tells us several things. The ﬁrst point of the theorem has peculiar implications. We see that as we change a single element (the basis ΛB) of the tuple (푒, 푣, ΛB, Λ), the resulting problem can change from CRL to not CRL. By similar reasoning, an agent that is said to be a continual learning agent according to Deﬁnition 4.1 may not be a continual learner with respect to some other basis. We discuss this point further in the next paragraph. Point (2.) notes that no optimal strategy exists within the basis instead, to be optimal, an agent must switch between basis elements indeﬁnitely. As discussed previously, this fact encourages a departure in how we think about the RL problem: rather than focus on agents that can identify a single, ﬁxed solution to a problem, CRL instead emphasizes designing agents that are effective at updating their behavior indeﬁnitely. Points (3.) and (4.) show that Λ cannot be minimal. That is, there are necessarily some redundancies in the design space of the agents in CRL this is expected, since we are always focusing on agents that search over the same agent basis. Lastly, it is worth calling attention to the fact that in the deﬁnition of CRL, we assume ΛB Λ this suggests that in CRL, the agent basis is necessarily limited in some way. Consequently, the design space of agents Λ are also limited in terms of what agents they can represent at any particular point in time. This limitation may come about due to a computational or memory budget, or by making use of a constrained set of learning rules. This suggests a deep connection between bounded agents and the nature of continual learning, as explored further by Kumar et al. [27]. While these four points give an initial character of the CRL problem, we note that further exploration of the properties of CRL is an important direction for future work.

Canonical Agent Bases. It is worth pausing and reﬂecting on the concept of an agent basis. As presented, the basis is an arbitrary choice of a set of agents consequently, point (1.) of Theorem 4.1 may stand out as peculiar. From this perspective, it is reasonable to ask if the fact that our deﬁnition of CRL is basis-dependant renders it vacuous. We argue that this is not the case for two reasons. First, we conjecture that any deﬁnition of continual learning that involves concepts like learning and convergence will have to sit on top of some reference object whose choice is arbitrary. Second, and more important, even though the mathematical construction allows for an easy change of basis, in practice the choice of basis is constrained by considerations like the availability of computational resources. It is often the case that the domain or problem of interest provides obvious choices of bases, or imposes constraints that force us as designers to restrict attention to a space of plausible bases or learning rules. For example, as discussed earlier, a choice of neural network architecture might comprise a basis any assignment of weights is an element of the basis, and the learning rule 휎 is a mechanism for updating the active element of the basis (the parameters) in light of experience. In this case, the number of parameters of the network is constrained by what we can actually build, and the learning rule needs to be suitably efﬁcient and well-behaved. We might again think of the learning rule 휎as gradient descent, rather than a rule that can search through the basis in an unconstrained way. In this sense, the basis is not arbitrary. We as designers choose a class of functions to act as the relevant representations of behavior, often limited by resource constraints on memory or compute. Then, we use speciﬁc learning rules that have been carefully designed to react to experience in a desirable way for instance, stochastic gradient descent updates the current choice of basis in the direction that would most improve performance. For these reasons, the choice of basis is not arbitrary, but instead reﬂects the ingredients involved in the design of agents as well as the constraints necessarily imposed by the environment.

4.5 Properties of Generates and Reaches

Lastly, we summarize some of the basic properties of generates and reaches. Further analysis of generates, reaches, and their variations is provided in Appendix C.

Theorem 4.2. The following properties hold of the generates operator:

1. Generates is transitive: For any triple (Λ1, Λ2, Λ3) and 푒 ℰ, if Λ1 푒Λ2 and Λ2 푒Λ3, then Λ1 푒Λ3.

2. Generates is not commutative: there exists a pair (Λ1, Λ2) and 푒 ℰsuch that Λ1 푒Λ2, but (Λ2 푒Λ1).

3. For all Λ and pair of agent bases (Λ1 B, Λ2 B) such that Λ1 B Λ2 B, if Λ1 B 푒Λ, then Λ2 B 푒Λ.

4. For all Λ and 푒 ℰ, Λ 푒Λ.

5. The decision problem, Given (푒, ΛB, Λ), output True iff ΛB 푒Λ, is undecidable.

The fact that generates is transitive suggests that the basic tools of an agent set paired with a set of learning rules might be likened to an algebraic structure. We can draw a symmetry between an agent basis and the basis of a vector space: A vector space is comprised of all linear combinations of the basis, whereas Λ is comprised of all valid switches (according to the learning rules) between the base agents. However, the fact that generates is not commutative (by point 2.) raises a natural question: are there choices of learning rules under which generates is commutative? We suggest that a useful direction for future work can further explore an algebraic perspective on agents.

We ﬁnd many similar properties hold of reaches.

Theorem 4.3. The following properties hold of the reaches operator:

푒are not transitive.

2. Sometimes reaches is not commutative: there exists a pair (Λ1, Λ2) and 푒 ℰsuch that 휆1 Λ1 휆1

푒Λ2, but 휆2 Λ2 휆2 /

3. For all pairs (Λ, 푒), if 휆 Λ, then 휆

4. Every agent satisﬁes 휆

푒Λ in every environment.

5. The decision problem, Given (푒, 휆, Λ), output True iff 휆

푒Λ, is undecidable.

Many of these properties resemble those in Theorem 4.2. For instance, point (5.) shows that deciding whether a given agent sometimes reaches a basis in an environment is undecidable. We anticipate that the majority of decision problems related to determining properties of arbitrary agent sets interacting with unconstrained environments will be undecidable, though it is still worth making these arguments carefully. Moreover, there may be interesting special cases in which these decision problems are decidable (and perhaps, efﬁciently so). We suggest that identifying these special cases and ﬂeshing out their corresponding efﬁcient algorithms is an interesting direction for future work.

5 Discussion

In this paper, we carefully develop a simple mathematical deﬁnition of the continual RL problem. We take this problem to be of central importance to AI as a ﬁeld, and hope that these tools and perspectives can serve as an opportunity to think about CRL and its related artifacts more carefully. Our proposal is framed around two new insights about agents: (i) every agent can be understood as though it were searching over an agent basis (Theorem 3.1), and (ii) every agent, in the limit, will either sometimes or never stop this search (Remark 3.2). These two insights are formalized through the generates and reaches operators, which provide a rich toolkit for understanding agents in a new way for example, we ﬁnd straightforward deﬁnitions of a continual learning agent (Deﬁnition 4.1) and learning rules (Deﬁnition 3.2). We anticipate that further study of these operators and different families of learning rules can directly inform the design of new learning algorithms; for instance, we might characterize the family of continual learning rules that are guaranteed to yield continual learning agents, and use this to guide the design of principled continual learning agents (in the spirit of continual backprop by Dohare et al. [14]). In future work, we intend to further explore connections between our formalism of continual learning and some of the phenomena at the heart of recent empirical continual learning studies, such as plasticity loss [34, 1, 15], in-context learning [8], and catastrophic forgetting [38, 18, 21, 26]. More generally, we hope that our deﬁnitions, analysis, and perspectives can help the community to think about continual reinforcement learning in a new light.

Acknowledgements

The authors are grateful to Michael Bowling, Clare Lyle, Razvan Pascanu, and Georgios Piliouras for comments on a draft of the paper, as well as the anonymous Neur IPS reviewers that provided valuable feedback on the paper. The authors would further like to thank all of the 2023 Barbados RL Workshop participants and Elliot Catt, Will Dabney, Sebastian Flennerhag, Andr as Gy orgy, Steven Hansen, Anna Harutyunyan, Mark Ho, Joe Marino, Joseph Modayil, R emi Munos, Evgenii Nikishin, Brendan O Donoghue, Matt Overlan, Mark Rowland, Tom Schaul, Yannick Shroecker, Rich Sutton, Yunhao Tang, Shantanu Thakoor, and Zheng Wen for inspirational conversations.

[1] Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. ar Xiv preprint ar Xiv:2303.07507, 2023.

[2] David Abel, Yuu Jinnai, Yue Guo, George Konidaris, and Michael L. Littman. Policy and value transfer in lifelong reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.

[3] Haitham Bou Ammar, Rasul Tutunov, and Eric Eaton. Safe policy search for lifelong reinforcement learning with sublinear regret. In Proceedings of the International Conference on Machine Learning, 2015.

[4] Megan M Baker, Alexander New, Mario Aguilar-Simon, Ziad Al-Halah, S ebastien MR Arnold, Ese Ben-Iwhiwhu, Andrew P Brna, Ethan Brooks, Ryan C Brown, Zachary Daniels, et al. A domain-agnostic approach for characterization of lifelong learning systems. Neural Networks, 160:274 296, 2023.

[5] Peter L Bartlett. Learning with a slowly changing distribution. In Proceedings of the Annual Workshop on Computational Learning Theory, 1992.

[6] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in Neural Information Processing Systems, 2014.

[7] Michael Bowling, John D. Martin, David Abel, and Will Dabney. Settling the reward hypothesis. In Proceedings of the International Conference on Machine Learning, 2023.

[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.

[9] Emma Brunskill and Lihong Li. PAC-inspired option discovery in lifelong reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2014.

[10] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka, and Tom Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2010.

[11] Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman. Acting optimally in partially observable stochastic domains. In Proceedings of the AAAI Conference on Artiﬁciall Intelligence, 1994.

[12] Michael K Cohen, Elliot Catt, and Marcus Hutter. A strongly asymptotically optimal agent in general environments. ar Xiv preprint ar Xiv:1903.01021, 2019.

[13] Travis Dick, Andr as Gy orgy, and Csaba Szepesvari. Online learning in Markov decision processes with changing cost sequences. In Procedings of the International Conference on Machine Learning, 2014.

[14] Shibhansh Dohare, Richard S Sutton, and A Rupam Mahmood. Continual backprop: Stochastic gradient descent with persistent randomness. ar Xiv preprint ar Xiv:2108.06325, 2021.

[15] Shibhansh Dohare, Juan Hernandez-Garcia, Parash Rahman, Richard Sutton, and A Rupam Mahmood. Loss of plasticity in deep continual learning. ar Xiv preprint ar Xiv:2306.13812, 2023.

[16] Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou. Simple agent, complex environment: Efﬁcient reinforcement learning with agent states. Journal of Machine Learning Research, 23 (255):1 54, 2022.

[17] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, 2017.

[18] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128 135, 1999.

[19] Milton Friedman. Essays in positive economics. University of Chicago press, 1953.

[20] Haotian Fu, Shangqun Yu, Michael Littman, and George Konidaris. Model-based lifelong reinforcement learning with Bayesian exploration. Advances in Neural Information Processing Systems, 2022.

[21] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. ar Xiv preprint ar Xiv:1312.6211, 2013.

[22] Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences, 24(12):1028 1040, 2020.

[23] Marcus Hutter. A theory of universal artiﬁcial intelligence based on algorithmic complexity. ar Xiv preprint cs/0004001, 2000.

[24] Marcus Hutter. Universal artiﬁcial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.

[25] Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards continual reinforcement learning: A review and perspectives. Journal of Artiﬁcial Intelligence Research, 75: 1401 1476, 2022.

[26] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521 3526, 2017.

[27] Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Yueyang Liu, and Benjamin Van Roy. Continual learning as computationally constrained reinforcement learning. ar Xiv preprint ar Xiv:2307.04345, 2023.

[28] Tor Lattimore. Theory of general reinforcement learning. Ph D thesis, The Australian National University, 2014.

[29] Jan Leike. Nonparametric general reinforcement learning. Ph D thesis, The Australian National University, 2016.

[30] Timoth ee Lesort, Vincenzo Lomonaco, Andrei Stoian, Davide Maltoni, David Filliat, and Natalia D ıaz-Rodr ıguez. Continual learning for robotics: Deﬁnition, framework, learning strategies, opportunities and challenges. Information fusion, 58:52 68, 2020.

[31] Yueyang Liu, Benjamin Van Roy, and Kuang Xu. A deﬁnition of non-stationary bandits. ar Xiv preprint ar Xiv:2302.12202, 2023.

[32] Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, and Zheng Wen. Reinforcement learning, bit by bit. Foundations and Trends in Machine Learning, 16(6):733 865, 2023. ISSN 1935-8237.

[33] Jelena Luketina, Sebastian Flennerhag, Yannick Schroecker, David Abel, Tom Zahavy, and Satinder Singh. Meta-gradients in non-stationary environments. In Proceedings of the Conference on Lifelong Learning Agents, 2022.

[34] Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. In Proceedings of the International Conference on Machine Learning, 2023.

[35] Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classiﬁcation: An empirical survey. Neurocomputing, 469:28 51, 2022.

[36] Sultan J Majeed. Abstractions of general reinforcement Learning. Ph D thesis, The Australian National University, 2021.

[37] Sultan Javed Majeed and Marcus Hutter. Performance guarantees for homomorphisms beyond Markov decision processes. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2019.

[38] Michael Mc Closkey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109 165. Elsevier, 1989.

[39] Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan Yang, Justin Betteridge, Andrew Carlson, Bhavana Dalvi, Matt Gardner, Bryan Kisiel, et al. Never-ending learning. Communications of the ACM, 61(5):103 115, 2018.

[40] Claire Monteleoni and Tommi Jaakkola. Online learning of non-stationary sequences. Advances in Neural Information Processing Systems, 16, 2003.

[41] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. 2018.

[42] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural networks, 113:54 71, 2019.

[43] Emmanouil Antonios Platanios, Abulhair Saparov, and Tom Mitchell. Jelly bean world: A testbed for never-ending learning. ar Xiv preprint ar Xiv:2002.06306, 2020.

[44] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

[45] Matthew Riemer, Sharath Chandra Raparthy, Ignacio Cases, Gopeshh Subbaraj, Maximilian Puelma Touzel, and Irina Rish. Continual learning in environments with polynomial mixing times. Advances in Neural Information Processing Systems, 2022.

[46] Mark B Ring. Continual learning in reinforcement environments. Ph D thesis, The University of Texas at Austin, 1994.

[47] Mark B Ring. Child: A ﬁrst step towards continual learning. Machine Learning, 28(1):77 104, 1997.

[48] Mark B Ring. Toward a formal framework for continual learning. In Neur IPS Workshop on Inductive Transfer, 2005.

[49] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in Neural Information Processing Systems, 2019.

[50] Stuart J Russell and Devika Subramanian. Provably bounded-optimal agents. Journal of Artiﬁcial Intelligence Research, 2:575 609, 1994.

[51] Paul Ruvolo and Eric Eaton. ELLA: An efﬁcient lifelong learning algorithm. In Proceedings of the International Conference on Machine Learning, 2013.

[52] Tom Schaul and J urgen Schmidhuber. Metalearning. Scholarpedia, 5(6):4650, 2010.

[53] J urgen Schmidhuber, Jieyu Zhao, and Nicol N Schraudolph. Reinforcement learning with self-modifying policies. In Learning to Learn, pages 293 309. Springer, 1998.

[54] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In Proceedings of the International Conference on Machine Learning, 2018.

[55] Daniel L Silver. Machine lifelong learning: Challenges and beneﬁts for artiﬁcial general intelligence. In Proceedings of the Conference on Artiﬁcial General Intelligence, 2011.

[56] Richard S Sutton. Introduction: The challenge of reinforcement learning. In Reinforcement Learning, pages 1 3. Springer, 1992.

[57] Richard S Sutton. The reward hypothesis, 2004. URL http://incompleteideas.net/rlai. cs.ualberta.ca/RLAI/rewardhypothesis.html.

[58] Richard S Sutton. The quest for a common model of the intelligent decision maker. ar Xiv preprint ar Xiv:2202.13252, 2022.

[59] Richard S Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.

[60] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633 1685, 2009.

[61] Sebastian Thrun. Is learning the n-th thing any easier than learning the ﬁrst? Advances in Neural Information Processing Systems, 1995.

[62] Sebastian Thrun. Lifelong learning algorithms. Learning to Learn, 8:181 209, 1998.

[63] Sebastian Thrun and Tom M Mitchell. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25 46, 1995.

[64] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the International Conference on Machine learning, 2007.

We ﬁrst provide a table summarizing all relevant notation.

Notation Meaning Deﬁnition

풜 Actions 풪 Observations

ℋ푡 Length 푡histories ℋ푡= (풜 풪)푡

ℋ All histories ℋ= Ð 푡=0 ℋ푡 ℎ A history ℎ ℋ ℎℎ History concatenation ℎ푡 Length 푡history ℎ푡 ℋ푡 ℋ= ℋ휆,푒 Realizable histories ℋ= Ð 푡=0 ℎ푡 ℋ푡: Î푡 1 푘=0 푒(표푘| ℎ푘, 푎푘)휆(푎푘| ℎ푘) > 0

ℋℎ= ℋ휆,푒 ℎ Realizable history sufﬁxes ℋℎ= {ℎ ℋ: ℎℎ ℋ휆,푒}

푒 Environment 푒: ℋ 풜 Δ(풜) ℰ Set of all environments

휆 Agent 휆: ℋ Δ(풜) Λ Set of all agents Λ Set of agents Λ Λ ΛB Agent basis ΛB Λ

푟 Reward function 푟: 풜 풪 R 푣 Performance 푣: ℋ Λ ℰ [vmin, vmax]

휎 Learning rule 휎: ℋ ΛB Σ Set of all learning rules Σ Set of learning rules Σ Σ

ΛB 푒Σ Λ Σ-generates Λ 휎 Σ ℎ ℋ휆(ℎ) = 휎(ℎ)(ℎ) ΛB 푒Λ Generates Σ Σ ΛB 푒Σ Λ ΛB |=Σ Λ Universally Σ-generates Λ 휎 Σ ℎ ℋ휆(ℎ) = 휎(ℎ)(ℎ) ΛB |= Λ Universally generates Σ Σ ΛB |=Σ Λ

푒ΛB Sometimes reaches ℎ ℋ 휆B ΛB ℎ ℋℎ휆(ℎℎ ) = 휆B(ℎℎ ) 휆 /

푒ΛB Never reaches (휆

푒ΛB Always reaches ℎ ℋ 푡 N0 ℎ ℋ푡: ℎ 휆B ΛB ℎ ℋℎ휆(ℎℎ ℎ ) = 휆B(ℎℎ ℎ )

Table 1: A summary of notation.

B Proofs of Presented Results

We next provide proofs of each result from the paper. Our proofs make use of some extra notation: we use as logical implication, and we use 풫(풳) to denote the power set of any set 풳. Lastly, we use 풜 풳and 풜 풳as shorthand for 풜 풫(풳) and 풜 풫(풳) respectively.

B.1 Section 3 Proofs

Our ﬁrst result is from Section 3 of the paper.

Theorem 3.1. For any pair (휆, 푒), there exists inﬁnitely many choices of a basis, ΛB, such that both (1) 휆 ΛB, and (2) ΛB 푒{휆}.

Proof of Theorem 3.1.

Choose a ﬁxed but arbitrary pair (휆, 푒). Then, enumerate the realizable histories, ℋ휆,푒, and let ℎ1 denote the ﬁrst element of this enumeration, ℎ2 the second, and so on.

Then, we design a constructive procedure for a basis that, when repeatedly applied, induces an inﬁnite enumeration of bases that satisfy the desired two properties. This constructive procedure for the 푘-th basis will contain 푘+ 1 agents, where each agent is distinct from 휆, but will produces the same action as the agent every 푘+ 1 elements of the history sequence, ℎ1, ℎ2, . . ..

For the ﬁrst (푘= 1) basis, we construct two agents. The ﬁrst, 휆1 B, chooses the same action distribution as 휆on each even numbered history: 휆1 B(ℎ푖) = 휆(ℎ푖). Then, this agent will choose a different action distribution on the odd length histories: 휆1 B(ℎ푖+1) 휆(ℎ푖+1), for 푖any even natural number. The second agent, 휆2 B will do the opposite to 휆1 B: on each odd numbered history ℎ푖+1, 휆2 B(ℎ푖+1) 휆(ℎ푖+1), but on every even numbered history, 휆2 B(ℎ푖) = 휆(ℎ푖).

Observe ﬁrst that by construction, 휆 휆1 B, and 휆 휆2 B, since there exist histories where they choose different action distributions. Next, observe that the basis, ΛB = {휆1 B, 휆2 B}, generates {휆} in 푒through the following set of learning rules, Σ: given any realizable history, ℎ ℋ휆,푒, check whether the history has an even or odd numbered index in the enumeration. If odd, choose 휆1 B, and if even, choose 휆2 B.

More generally, this procedure can be applied for any 푘:

Λ푘 B = {휆1 B, . . . , 휆푘+1 B }, 휆푖 B(ℎ) = 휆(ℎ) [ℎ] == 푖, 휆(ℎ) otherwise, (B.1)

where we use the notation [ℎ] == 푖to express the logical predicate asserting that the modulos of the index of ℎin the enumeration ℎ1, ℎ2, . . . is equal to 푖.

Further, 휆(ℎ) simply refers to any choice of action distribution that is unequal to 휆(ℎ). Thus, for all natural numbers 푘 2, we can construct a new basis consisting of 푘base agents that generates 휆in 푒, but does not contain the agent itself. This completes the argument.

B.2 Section 4 Proofs

We next present the proofs of results from Section 4.

B.2.1 Theorem 4.1: Properties of CRL

We begin with Theorem 4.1 that establishes basic properties of CRL.

Theorem 4.1. Every instance of CRL (푒, 푣, Λ, ΛB) satisﬁes the following properties:

1. If Λ ΛB Λ , there exists a Λ B such that (1) Λ B 푒Λ, and (2) (푒, 푣, Λ, Λ B) is not an instance of CRL.

2. No element of ΛB is optimal: ΛB Λ = .

3. If |Λ| is ﬁnite, there exists an agent set, Λ such that |Λ | < |Λ| and Λ 푒Λ.

4. If |Λ| is inﬁnite, there exists an agent set, Λ such that Λ Λ and Λ 푒Λ.

We prove this result in the form of three lemmas, corresponding to each of the four points of the theorem (with the third lemma, Lemma B.3, covering both points 3. and 4.). Some of the lemmas make use of properties of generates and reaches that we establish later in Appendix C.

Lemma B.1. For all instances of CRL (푒, 푣, Λ, ΛB), if Λ ΛB Λ , then there exists a choice Λ B such that (1) Λ B 푒Λ, and (2) (푒, 푣, Λ, Λ B) is not an instance of CRL.

Proof of Lemma B.1.

Recall that a tuple (푒, 푣, Λ, ΛB) is CRL just when all of the optimal agents Λ do not reach the basis. Then, the result holds as a straightforward consequence of two facts. First, we can always construct a new basis containing all of the optimal agents, Λ B = ΛB Λ . Notice that Λ B still generates Λ by property three of Theorem 4.2. Further, since both ΛB and Λ are each subsets of Λ, and by assumption Λ ΛB Λ (so there is at least one sub-optimal agent that is not in the basis), it follows that Λ B Λ. Second, by Proposition C.15, we know that every element 휆 B Λ B will always reach the basis, 휆 B

푒Λ B. Therefore, in the tuple (푒, 푣, Λ, Λ B), each of the optimal agents will reach the basis, and therefore this is not an instance of CRL.

Lemma B.2. No element of ΛB is optimal: ΛB Λ = .

Proof of Lemma B.2.

The lemma follows as a combination of two facts.

First, recall that, by deﬁnition of CRL, each optimal agent 휆 Λ satisﬁes 휆 /

Second, note that by Lemma B.11, we know that each 휆B ΛB satisﬁes 휆B

Therefore, since sometimes reaches (

푒) and never reaches ( /

푒) are negations of one another, we conclude that no basis element can be optimal.

Before stating the next lemma, we note that points (3.) and (4.) of Theorem 4.1 are simply expansions of the deﬁnition of a minimal agent set, which we deﬁne precisely in Deﬁnition C.4 and Deﬁnition C.5.

Lemma B.3. For any instance of CRL, Λ is not minimal.

Proof of Lemma B.3.

We ﬁrst show that Λ cannot be minimal. To do so, we consider the cases where the rank (Deﬁnition C.3) of Λ is ﬁnite and inﬁnite separately.

(Finite Rank Λ.)

If rank(Λ) is ﬁnite and minimal, then it follows immediately that there is no agent set of smaller rank that generates Λ. By consequence, since ΛB Λ and ΛB 푒Λ, we conclude that Λ cannot be minimal.

(Inﬁnite Rank Λ.)

If rank(Λ) is inﬁnite and minimal, then there is no proper subset of Λ that uniformly generates Λ by deﬁnition. By consequence, since ΛB Λ and ΛB 푒Λ, we conclude that Λ cannot be minimal.

This completes the argument of both cases, and we conclude that for any instance of CRL, Λ is not minimal.

B.2.2 Theorem 4.2: Properties of Generates

Next, we prove basic properties of generates.

Theorem 4.2. The following properties hold of the generates operator:

1. Generates is transitive: For any triple (Λ1, Λ2, Λ3) and 푒 ℰ, if Λ1 푒Λ2 and Λ2 푒Λ3, then Λ1 푒Λ3.

2. Generates is not commutative: there exists a pair (Λ1, Λ2) and 푒 ℰsuch that Λ1 푒Λ2, but (Λ2 푒Λ1).

3. For all Λ and pair of agent bases (Λ1 B, Λ2 B) such that Λ1 B Λ2 B, if Λ1 B 푒Λ, then Λ2 B 푒Λ.

4. For all Λ and 푒 ℰ, Λ 푒Λ.

5. The decision problem, Given (푒, ΛB, Λ), output True iff ΛB 푒Λ, is undecidable.

The proof of this theorem is spread across the next ﬁve lemmas below.

The fact that generates is transitive suggests that the basic tools of an agent set paired with a set of learning rules might be likened to an algebraic structure. We can draw a symmetry between an agent basis and the basis of a vector space: A vector space is comprised of all linear combinations of the basis, whereas Λ is comprised of all valid switches (according to the learning rules) between the base agents. However, the fact that generates is not commutative (by point 2.) raises a natural question: are there choices of learning rules under which generates is commutative? An interesting direction for future work is to explore this style of algebraic analysis on agents.

Lemma B.4. Generates is transitive: For any triple (Λ1, Λ2, Λ3) and 푒 ℰ, if Λ1 푒Λ2 and Λ2 푒Λ3, then Λ1 푒Λ3.

Proof of Lemma B.4.

Assume Λ1 푒Λ2 and Λ2 푒Λ3. Then, by Proposition C.4 and the deﬁnition of the generates operator, we know that

휆2 Λ2 휎1 Σ1 ℎ ℋ휆2(ℎ) = 휎1(ℎ)(ℎ), (B.2)

휆3 Λ3 휎2 Σ2 ℎ ℋ휆3(ℎ) = 휎2(ℎ)(ℎ), (B.3)

where Σ1 and Σ2 express the set of all learning rules over Λ1 and Λ2 respectively. By deﬁnition of a learning rule, 휎, we rewrite the above as follows,

휆2 Λ2 ℎ ℋ 휆1 Λ1 휆2(ℎ) = 휆1(ℎ), (B.4)

휆3 Λ3 ℎ ℋ 휆2 Λ2 휆3(ℎ) = 휆2(ℎ). (B.5)

Then, consider a ﬁxed but arbitrary 휆3 Λ3. We construct a learning rule deﬁned over Λ1 as 휎1 : ℋ Λ1 that induces an equivalent agent as follows. For each realizable history, ℎ ℋ, by Equation B.5 we know that there is an 휆2 such that 휆3(ℎ) = 휆2(ℎ), and by Equation B.4, there is an 휆1 such that 휆2(ℎ) = 휆1(ℎ). Then, set 휎1 : ℎ 휆1 such that 휆1(ℎ) = 휆2(ℎ) = 휆3(ℎ)

Since ℎand 휆3 were chosen arbitrarily, we conclude that

휆3 Λ3 ℎ ℋ 휆1 Λ1 휆3(ℎ) = 휆1(ℎ).

But, by the deﬁnition of Σ, this means there exists a learning rule such that

휆3 Λ3 휎1 Σ1 ℎ ℋ휆3(ℎ) = 휎1(ℎ)(ℎ).

This is exactly the deﬁnition of Σ-generation, and by Proposition C.4, we conclude Λ1 푒

Lemma B.5. Generates is not commutative: there exists a pair (Λ1, Λ2) and 푒 ℰsuch that Λ1 푒Λ2, but (Λ2 푒Λ1).

Proof of Lemma B.5.

The result follows from a simple counterexample: consider the pair

Λ1 = {휆푖: ℎ 푎1}, Λ2 = {휆푖: ℎ 푎1, 휆푗: ℎ 푎2}.

Note that since 휆푖is in both sets, and Λ1 is a singleton, we know that Λ2 푒Λ1 in any environment. But, by Proposition C.6, we know that Λ1 cannot generate Λ2.

Lemma B.6. For all Λ and pair of agent bases (Λ1 B, Λ2 B) such that Λ1 B Λ2 B, if Λ1 B 푒Λ, then Λ2 B 푒Λ.

Proof of Lemma B.6.

The result follows as a natural consequence of the deﬁnition of generates. Recall that Λ1 B 푒Λ just when,

Σ1 Σ Λ1 B 푒

Σ1 Σ 휎1 Σ1 ℎ ℋ휆(ℎ) = 휆휎1(ℎ) B (ℎ), (B.7)

where again 휆휎1(ℎ) B Λ1 B is the base agent chosen by 휎1(ℎ). We use superscripts Σ1 and 휎1 to signify that 휎1 is deﬁned relative to Λ1 B, that is, 휎1 : ℋ Λ1 B Σ1.

But, since Λ1 B Λ2 B, we can deﬁne Σ2 = Σ1 and ensure that Λ2 B 푒

Σ2 Λ, since the agent basis Λ1 B was already sufﬁcient to generate Λ. Therefore, we conclude that Λ2 B 푒Λ.

Lemma B.7. For all Λ and 푒 ℰ, Λ 푒Λ.

Proof of Lemma B.7.

This is a direct consequence of Proposition C.18.

Lemma B.8. The decision problem, AGENTSGENERATE, Given (푒, ΛB, Λ), output True iff ΛB 푒Λ, is undecidable.

Proof of Lemma B.8.

We proceed as is typical of such results by reducing AGENTSGENERATE from the Halting Problem.

In particular, let 푚be a ﬁxed but arbitrary Turing Machine, and 푤be a ﬁxed but arbitrary input to be given to machine 푚. Then, HALT deﬁnes the decision problem that outputs True iff 푚halts on input 푤.

We construct an oracle for AGENTSGENERATE that can decide HALT as follows. Let (풜, 풪) be an interface where the observation space is comprised of all conﬁgurations of machine 푚. Then, we consider a deterministic environment 푒that simply produces the next conﬁguration of 푚when run on input 푤, based on the current tape contents, the state of 푚, and the location of the tape head. Note that all three of these elements are contained in a Turing Machine s conﬁguration, and that a single conﬁguration indicates whether the Turing Machine is in a halting state or not. Now, let the action space 풜consist of two actions, {푎no-op, 푎halt}. On execution of 푎no-op no-op, the environment moves to the next conﬁguration. On execution of 푎halt, the machine halts. That is, we restrict ourselves to the singleton agent set, Λ, containing

the agent 휆 that outputs 푎halt directly following the machine entering a halting conﬁguration, and 푎no-op otherwise:

휆 : ℎ푎표 푎halt 표is a halting conﬁgiration, 푎no-op otherwise. , Λ = {휆 }.

Using these ingredients, we take any instance of HALT, (푚, 푤), and consider the singleton agent basis: Λ1 B = {푎no-op}.

We make one query to our AGENTSGENERATE oracle, and ask: Λ1 B 푒Λ. If it is True, then the histories realizable by (휆 , 푒) pair ensure that the single agent in Λ never emits the 푎halt action, and thus, 푚does not halt on 푤. If it is False, then there are realizable histories in 푒in which 푚halts on 푤. We thus use the oracle s response directly to decide the given instance of HALT.

B.2.3 Theorem 4.3: Properties of Reaches

We ﬁnd many similar properties hold for reaches. Theorem 4.3. The following properties hold of the reaches operator:

푒are not transitive.

2. Sometimes reaches is not commutative: there exists a pair (Λ1, Λ2) and 푒 ℰsuch that 휆1 Λ1 휆1

푒Λ2, but 휆2 Λ2 휆2 /

3. For all pairs (Λ, 푒), if 휆 Λ, then 휆

4. Every agent satisﬁes 휆

푒Λ in every environment.

5. The decision problem, Given (푒, 휆, Λ), output True iff 휆

푒Λ, is undecidable.

Again, we prove this result through ﬁve lemmas that correspond to each of the above properties.

Many of these properties resemble those in Theorem 4.2. For instance, point (5.) shows that deciding whether a given agent sometimes reaches a basis in an environment is undecidable. We anticipate that the majority of decision problems related to determining properties of arbitrary agent sets will be undecidable, though it is still worth making these arguments carefully. Moreover, there may be interesting special cases in which these decision problems are decidable (and perhaps, efﬁciently so). Identifying these special cases and their corresponding efﬁcient algorithms is another interesting direction for future work. Lemma B.9.

푒are not transitive.

Proof of Lemma B.9.

We construct two counterexamples, one for each of sometimes reaches (

푒) and never reaches ( /

Counterexample: Sometimes Reaches. To do so, we begin with a tuple (푒, Λ1, Λ2, Λ3) such that both Λ1 Λ1 휆1

푒Λ1, Λ2 Λ2 휆2

We will show that there is an agent, 휆 1 Λ1, such that 휆 1 /

푒 Λ3, thus illustrating that sometimes reaches is not guaranteed to be transitive. The basic idea is that sometimes reaches only requires an agent stop its search on one realizable history. So, 휆1

푒Λ2 might happen on some history ℎ, but each 휆2 Λ2 might only reach Λ3 on an entirely different history. As a result, reaching Λ2 is not enough to ensure the agent also reaches Λ3.

In more detail, the agent sets of the counterexample are as follows. Let 풜= {푎1, 푎2} and

풪= {표1, 표2}. Let Λ2 be all agents that, after ten timesteps, always take 푎2. 휆 1 is simple: it

always takes 푎1, except on one realizable history, ℎ , (and all of the realizable successors

of ℎ , ℋ휆,푒 ℎ ), where it switches to taking 푎2 after ten timesteps. Clearly 휆1

푒 Λ2, since

after ten timesteps, we know there will be some 휆2 such that 휆 1(ℎ ℎ ) = 휆2(ℎ ℎ ) for all realizable history sufﬁxes ℎ . Now, by assumption, we know that 휆2

푒Λ3. This ensures there is a single realizable history ℎsuch that there is an 휆3 where 휆2(ℎℎ ) = 휆3(ℎℎ ) for any realizable sufﬁx ℎ . To ﬁnish the counterexample, we simply note that this realizable ℎcan be different from ℎ and all of its successors. For example, ℎ might be the history containing only 표1 for the ﬁrst ten timesteps, while ℎcould be the history containing only 표2 for the ﬁrst ten timesteps. Thus, this 휆1 never reaches Λ3, and we conclude the counterexample.

Counterexample: Never Reaches. The instance for never reaches is simple: Let 풜= {푎1, 푎2, 푎3}, and Λ1 = Λ3. Suppose all agents in Λ1 (and thus Λ3) only choose actions 푎1 and 푎3. Let Λ2 be a singleton, Λ2 = {휆2} such that 휆2 : ℎ 푎2. Clearly, every 휆1 Λ1 will never reach Λ2, since none of them ever choose 푎2. Similarly, 휆2 will never reach Λ3, since no agents in Λ3 choose 푎2. However, by Proposition C.15 and the assumption that Λ1 = Λ3, we know 휆1 Λ1 휆1

푒Λ3. This directly violates transitivity.

This completes the argument for all three cases, and we conclude.

Lemma B.10. Sometimes reaches is not commutative: there exists a pair (Λ1, Λ2) and 푒 ℰsuch that 휆1 Λ1 휆1

푒Λ2, but 휆2 Λ2 휆2 /

Proof of Lemma B.10.

The result holds as a straightforward consequence of the following counterexample. Consider the pair of agent sets

Λ1 = {휆푖: ℎ 푎1}, Λ2 = {휆푖: ℎ 푎1, 휆푗: ℎ 푎2}.

Note that since 휆푖is in both sets, and Λ1 is a singleton, we know that 휆

푒 Λ1 in any environment by Lemma B.11. But, clearly 휆푗never reaches Λ1, since no agent in Λ1 ever chooses 푎1.

Lemma B.11. For all pairs (Λ, 푒), if 휆 Λ, then 휆

Proof of Lemma B.11.

The proposition is straightforward, as any 휆 Λ will be equivalent to itself in behavior for all histories.

Lemma B.12. Every agent satisﬁes 휆

푒Λ in every environment.

Proof of Lemma B.12.

This is again a direct consequence of Proposition C.18.

Lemma B.13. The decision problem, AGENTREACHES, Given (푒, 휆, Λ), output True iff 휆

푒Λ, is undecidable.

Proof of Lemma B.13.

We again proceed by reducing AGENTREACHES from the Halting Problem.

In particular, let 푚be a ﬁxed but arbitrary Turing Machine, and 푤be a ﬁxed but arbitrary input to be given to machine 푚. Then, HALT deﬁnes the decision problem that outputs True iff 푚halts on input 푤.

We construct an oracle for AGENTREACHES that can decide HALT as follows. Consider the same observation space used in the proof of Lemma B.8: Let 풪be comprised of all conﬁgurations of machine 푚. Then, sequences of observations are simply evolution of different Turing Machines processing possible inputs. We consider an action space, 풜= {푎halted, 푎not yet}, where agents simply report whether the history so far contains a halting conﬁguration.

Then, we consider a deterministic environment 푒that simply produces the next conﬁguration of 푚when run on input 푤, based on the current tape contents, the state of 푚, and the location of the tape head. Note again that all three of these elements are contained in a Turing Machine s conﬁguration.

Using these ingredients, we take any instance of HALT, (푚, 푤), and build the singleton agent set ΛB containing only the agent 휆halted : ℎ 푎halted that always reports the machine as having halted. We then consider whether the agent that outputs 푎not yet indeﬁnitely until 푚 reports halting, at which point the agent switches to 푎halted.

We make one query to our AGENTREACHES oracle, and ask: 휆

푒ΛB. If it is True, then the branching agent eventually becomes equivalent to 휆halted in that they both indeﬁnitely output 푎halted on at least one realizable history. Since 푒is deterministic, we know this equivalence holds across all histories. If the query reports False, then there is no future in 푒in which 푚 halts on 푤, otherwise the agent would become equivalent to 휆halted. We thus use the oracle s response directly to decide the given instance of HALT.

C Additional Analysis

Finally, we present a variety of additional results about agents and the generates and reaches operators.

C.1 Additional Analysis: Generates

We ﬁrst highlight simple properties of the generates operator. Many of our results build around the notion of uniform generation, a variant of the generates operator in which a basis generates an agent set in every environment. We deﬁne this operator precisely as follows. Deﬁnition C.1. Let Σ be a set of learning rules over some basis ΛB. We say that a set Λ is uniformly 횺-generated by ΛB, denoted ΛB |=Σ Λ, if and only if

휆 Λ 휎 Σ ℎ ℋ휆(ℎ) = 휎(ℎ)(ℎ). (C.1)

Deﬁnition C.2. We say a basis ΛB uniformly generates Λ, denoted ΛB |= Λ, if and only if

Σ Σ ΛB |=Σ Λ. (C.2)

We will ﬁrst show that uniform generation entails generation in a particular environment. As a consequence, when we prove that certain properties hold of uniform generation, we can typically also conclude that the properties hold for generation as well, though there is some subtlety as to when exactly this implication will allow results about |= to apply directly to 푒. Proposition C.1. For any (ΛB, Λ) pair, if ΛB |= Λ, then for all 푒 ℰ, ΛB 푒Λ.

Proof of Proposition C.1.

Recall that in the deﬁnition of uniform generation, ΛB |= Λ, we require,

Σ Σ 휆 Λ 휎 Σ ℎ ℋ휆(ℎ) = 휎(ℎ)(ℎ). (C.3)

Now, contrast this with generates with respect to a speciﬁc environment 푒,

Σ Σ 휆 Λ 휎 Σ ℎ ℋ휆(ℎ) = 휎(ℎ)(ℎ). (C.4)

The only difference in the deﬁnitions is that the set of histories quantiﬁed over is ℋin the former, and ℋ= ℋ휆,푒in the latter.

Since ℋ ℋfor any choice of environment 푒, we can conclude that when Equation C.3, it is also the case that Equation C.4 holds, too. Therefore, ΛB |= Λ ΛB 푒Λ for any 푒.

We next show that the subset relation implies generation.

Proposition C.2. Any pair of agent sets (Λsmall, Λbig) such that Λsmall Λbig satisﬁes

Λbig |= Λsmall. (C.5)

Proof of Proposition C.2.

The result follows from the combination of two facts. First, that all agent sets generate themselves. That is, for arbitrary Λ, we know that Λ 푒Λ, since the trivial set of learning rules,

Σtr = {휎푖: ℎ 휆푖, 휆푖 Λ}, (C.6) that never switches between agents is sufﬁcient to generate the agent set.

Second, observe that removing an agent from the generated set has no effect on the generates operator. That is, let Λ = Λ \ 휆, for ﬁxed but arbitrary 휆 Λ. We see that Λ 푒Λ , since Σtr is sufﬁcient to generate Λ , too. By inducting over all removals of agents from Λ, we reach our conclusion.

Next, we establish properties about the sets of learning rules that correspond to the generates operator.

Proposition C.3. For any (ΛB, Σ, Λ) such that ΛB |=Σ Λ, it holds that

|Λ| |Σ|. (C.7)

Proof of Proposition C.3.

We proceed toward contradiction, and assume |Λ| > |Σ|. Then, there is at least one learning rule 휎 Σ that corresponds to two or more distinct agents in Λ. Call this element 휎 , and without loss of generality let 휆1 and 휆2 be two distinct agents that are each generated by 휎 in the sense that, 휆1(ℎ) = 휎 (ℎ)(ℎ), 휆2(ℎ) = 휎 (ℎ)(ℎ), (C.8) for every ℎ ℋ. But, by the distinctness of 휆1 and 휆2, there must exist a history ℎin which 휆1(ℎ) 휆2(ℎ). We now arrive at a contradiction as such a history cannot exist: By Equation C.8, we know that 휆1(ℎ) = 휎 (ℎ)(ℎ) = 휆2(ℎ) for all ℎ.

We see that the universal learning rules, Σ, is the strongest in the following sense.

Proposition C.4. For any basis ΛB and agent set Λ, exactly one of the two following properties hold:

1. The agent basis ΛB uniformly generates Λ under the set of all learning rules: ΛB |=Σ Λ.

2. There is no set of learning rules for which the basis Σ-uniformly generates the agent set: Σ Σ ΛB |=Σ Λ.

Proof of Proposition C.4.

The proof follows from the law of excluded middle. That is, for any set of learning rules Σ, either it generates Λ or it does not. If it does generate Λ, by Lemma B.6 so does Σ. By consequence, if Σ does not generate Λ, neither do any of its subsets.

Furthermore, uniform generation is also transitive.

Theorem C.5. Uniform generates is transitive: For any triple (Λ1, Λ2, Λ3), if Λ1 |= Λ2 and Λ2 |= Λ3, then Λ1 |= Λ3.

Proof of Theorem C.5.

Assume Λ1 |= Λ2 and Λ2 |= Λ3. Then, by Proposition C.4 and the deﬁnition of the uniform generates operator, we know that

휆2 Λ2 휎1 Σ1 ℎ ℋ휆2(ℎ) = 휎1(ℎ)(ℎ), (C.9)

휆3 Λ3 휎2 Σ2 ℎ ℋ휆3(ℎ) = 휎2(ℎ)(ℎ), (C.10)

where Σ1 and Σ2 express the set of all learning rules over Λ1 and Λ2 respectively. By deﬁnition of a learning rule, 휎, we rewrite the above as follows,

휆2 Λ2 ℎ ℋ 휆1 Λ1 휆2(ℎ) = 휆1(ℎ), (C.11)

휆3 Λ3 ℎ ℋ 휆2 Λ2 휆3(ℎ) = 휆2(ℎ). (C.12)

Then, consider a ﬁxed but arbitrary 휆3 Λ3. We construct a learning rule deﬁned over Λ1 as 휎1 : ℋ Λ1 that induces an equivalent agent as follows. For each history, ℎ ℋ, by Equation C.12 we know that there is an 휆2 such that 휆3(ℎ) = 휆2(ℎ), and by Equation C.11, there is an 휆1 such that 휆2(ℎ) = 휆1(ℎ). Then, set 휎1 : ℎ 휆1 such that 휆1(ℎ) = 휆2(ℎ) = 휆3(ℎ). Since ℎand 휆3 were chosen arbitrarily, we conclude that

휆3 Λ3 ℎ ℋ 휆1 Λ1 휆3(ℎ) = 휆1(ℎ).

But, by the deﬁnition of Σ, this means there exists a learning rule such that

휆3 Λ3 휎1 Σ1 ℎ ℋ휆3(ℎ) = 휎1(ℎ)(ℎ).

This is exactly the deﬁnition of Σ-uniform generation, and by Proposition C.4, we conclude Λ1 |= Λ3.

Next, we show that a singleton basis only generates itself. Proposition C.6. Any singleton basis, ΛB = {휆}, only uniformly generates itself.

Proof of Proposition C.6.

Note that generates requires switching between base agents. With only a single agent, there cannot be any switching, and thus, the only agent that can be described as switching amongst the elements of the singleton set ΛB = {휆} is the set itself.

C.1.1 Rank and Minimal Bases

As discussed in the paper, one natural reaction to the concept of an agent basis is to ask how we can justify different choices of a basis. And, if we cannot, then perhaps the concept of an agent basis is disruptive, rather than illuminating. In the main text, we suggest that in many situations, the choice of basis is made by the constraints imposed by the problem, such as the available memory. However, there are some objective properties of different bases that can help us to evaluate possible choices of a suitable basis. For instance, some bases are minimal in the sense that they cannot be made smaller while still retaining the same expressive power (that is, while generating the same agent sets). Identifying such minimal sets may be useful, as it is likely that there is good reason to consider only the most compressed agent bases.

To make these intuitions concrete, we introduce the rank of an agent set. Deﬁnition C.3. The rank of an agent set, rank(Λ), is the size of the smallest agent basis that uniformly generates it: rank(Λ) = min ΛB Λ |ΛB| s.t. ΛB |= Λ. (C.13)

For example, the agent set,

Λ = {휆0 : ℎ 푎0, (C.14)

휆2 : ℎ 푎0 |ℎ| mod 2 = 0, 푎1 |ℎ| mod 2 = 1,

has rank(Λ) = 2, since the basis,

ΛB = {휆0 B : ℎ 푎0, 휆1 B : ℎ 푎1},

uniformly generates Λ, and there is no size-one basis that uniformly generates Λ by Proposition C.6.

Using the notion of an agent set s rank, we now introduce the concept of a minimal basis. We suggest that minimal bases are particular important, as they contain no redundancy with respect to their expressive power. Concretely, we deﬁne a minimal basis in two slightly different ways depending on whether the basis has ﬁnite or inﬁnite rank. In the ﬁnite case, we say a basis is minimal if there is no basis of lower rank that generates it. Deﬁnition C.4. An agent basis ΛB with ﬁnite rank is said to be minimal just when there is no smaller basis that generates it,

Λ B Λ Λ B |= ΛB rank(Λ B) rank(ΛB). (C.15)

In the inﬁnite case, as all inﬁnite rank bases will have the same effective size, we instead consider a notion of minimiality based on whether any elements can be removed from the basis without changing its expressive power. Deﬁnition C.5. An agent basis ΛB with inﬁnite rank is said to be minimal just when no proper subset of ΛB uniformly generates ΛB.

Λ B ΛB Λ B |= ΛB Λ B = ΛB. (C.16)

Notably, this way of looking at minimal bases will also apply to ﬁnite rank agent bases as a direct consequence of the deﬁnition of a minimal ﬁnite rank basis. However, we still provide both deﬁnitions, as a ﬁnite rank basis may not contain a subset that generates it, but there may exist a lower rank basis that generates it. Corollary C.7. As a Corollary of Proposition C.2 and Deﬁnition C.4, for any minimal agent basis ΛB, there is no proper subset of ΛB that generates ΛB.

Regardless of whether an agent basis has ﬁnite or inﬁnite rank, we say the basis is a minimal basis of an agent set Λ just when the basis uniformly generates Λ and the basis is minimal. Deﬁnition C.6. For any Λ, a minimal basis of 횲is any basis ΛB that is both (1) minimal, and (2) ΛB |= Λ.

A natural question arises as to whether the minimal basis of any agent set Λ is unique. We answer this question in the negative. Proposition C.8. The minimal basis of a set of agents is not necessarily unique.

Proof of Proposition C.8.

To prove the claim, we construct an instance of an agent set with two distinct minimal bases.

Let 풜= {푎0, 푎1}, and 풪= {표0}. We consider the agent set containing four agents. The ﬁrst two map every history to 푎0 and 푎1, respectively, while the second two alternate between 푎0

and 푎1 depending on whether the history is of odd or even length:

Λ = {휆0 : ℎ 푎0, (C.17)

휆2 : ℎ 푎0 |ℎ| mod 2 = 0, 푎1 |ℎ| mod 2 = 1,

휆3 : ℎ 푎0 |ℎ| mod 2 = 1, 푎1 |ℎ| mod 2 = 0,

Note that there are two distinct subsets that each universally generate Λ:

Λ0,1 B = {휆0 B, 휆1 B}, Λ2,3 B = {휆2 B, 휆3 B}. (C.18)

Next notice that there cannot be a singleton basis by Proposition C.6, and thus, both Λ0,1 B and Λ2,3 B satisfy (1) |Λ0,1 B | = |Λ2,3 B | = rank(Λ), and (2) both Λ0,1 B |= Λ, Λ2,3 B |= Λ.

Beyond the lack of redundancy of a basis, we may also be interested in their expressive power. For instance, if we compare two minimal bases, Λ1 B and Λ2 B, how might we justify which to use? To address this question, we consider another desirable property of a basis: universality. Deﬁnition C.7. An agent basis ΛB is universal if ΛB |= Λ.

Clearly, it might be desirable to work with a universal basis, as doing so ensures that the set of agents we consider in our design space is as rich as possible. We next show that there is at least one natural basis that is both minimal and universal. Proposition C.9. The basis, Λ B = {휆: 풪 Δ(풜)}, (C.19) is a minimal universal basis:

1. Λ B |= Λ: The basis uniformly generates the set of all agents.

2. Λ B is minimal.

Proof of Proposition C.9.

We prove each property separately.

1. Λ B |= Λ

First, we show that the basis is universal: Λ B |= Λ. Recall that this amounts to showing that,

휆 Λ ℎ ℋ 휆 Λ B 휆(ℎ) = 휆 (ℎ). (C.20)

Let 휆 Λ and ℎ ℋbe ﬁxed but arbitrary. Now, let us label the action distribution produced by 휆(ℎ) as 푝휆(ℎ). Let 표refer to the last observation contained in ℎ(or if ℎ= ℎ0 = ). Now, construct the agent 휆 B : 표 푝휆(ℎ). By construction of Λ B, this agent is guaranteed to be a member of Λ B, and furthermore, we know that 휆 B produces the same output as 휆on ℎ. Since both 휆and ℎwere chosen arbitrarily, the construction will work for any choice of 휆and ℎ, and we conclude that at every history, there exists a basis agent Λ B Λ B that produces the same probability distribution over actions as any given agent. Thus, the ﬁrst property holds.

2. Λ B is minimal.

Second, we show that Λ B is a minimal basis of Λ. Recall that since rank(Λ B) = , the deﬁnition of a minimal basis means that:

ΛB Λ B ΛB |= Λ ΛB = Λ B. (C.21)

To do so, ﬁx an arbitrary proper subset of ΛB 풫(Λ B). Notice that since ΛB is a proper subset, there exists a non-empty set ΛB such that,

ΛB ΛB = Λ B.

Now, we show that ΛB cannot uniformly generate Λ by constructing an agent from ΛB. In particular, consider the ﬁrst element of ΛB, which, by construction of Λ B, is some mapping from 풪to a choice of probability distribution over 풜. Let us refer to this agent s output probability distribution over actions as 푝. Notice that there cannot exist an agent in Λ B that chooses 푝, otherwise ΛB would not be a proper subset of Λ B. Notice further that in the set of all agents, there are inﬁnitely many agents that output 푝in at least one history. We conclude that ΛB cannot uniformly generate Λ, as it does not contain any base element that produces 푝. The set ΛB was chosen arbitrarily, and thus the claim holds for any proper subset of Λ B, and we conclude.

This completes the proof of both statements.

Corollary C.10. As a direct consequence of Proposition C.9, every universal basis has inﬁnite rank.

C.1.2 Orthogonal and Parallel Agent Sets

Drawing inspiration from vector spaces, we introduce notions of orthogonal and parallel agent bases according to the agent sets they generate. Deﬁnition C.8. A pair of agent bases (Λ1 B, Λ2 B) are orthogonal if any pair (Λ1, Λ2) they each uniformly generate Λ1 B |= Λ1, Λ2 B |= Λ2, (C.22) satisfy Λ1 Λ2 = . (C.23)

Naturally this deﬁnition can be modiﬁed to account for environment-relative generation ( 푒), or to be deﬁned with respect to a particular set of learning rules in which case two bases are orthogonal with respect to the learning rule set just when they generate different agent sets under the given learning rules. As with the variants of the two operators, we believe the details of such formalisms are easy to produce.

A few properties hold of any pair of orthogonal bases. Proposition C.11. If two bases Λ1 B, Λ2 B are orthogonal, then the following properties hold:

1. Λ1 B Λ2 B = .

2. Neither Λ1 B nor Λ2 B are universal.

Proof of Proposition C.11.

We prove each property independently.

1. Λ1 B Λ2 B =

We proceed toward contradiction. That is, suppose that both Λ1 B is orthogonal to Λ2 B, and that Λ1 B Λ2 B . Then, by the latter property, there is at least one agent that is an element of

both bases. Call this agent 휆 B. It follows that the set Λ B = {휆 B} is a subset of both Λ1 B and Λ2 B. By Proposition C.2, it follows that Λ1 B |= Λ B and Λ2 B |= Λ B. But this contradicts the fact that Λ1 B is orthogonal to Λ2 B, and so we conclude.

2. Neither Λ1 B nor Λ2 B are universal.

We again proceed toward contradiction. Suppose without loss of generality that Λ1 B is universal. Then, we know Λ1 B |= Λ. Now, we consider two cases: either Λ2 B generates some non-empty set, Λ2, or it does not generate any sets. If it generates a set Λ2, then we arrive at a contradiction as Λ2 Λ , which violates the deﬁnition of orthogonal bases. If if does not generate a set, this violates the deﬁnition of a basis, as any basis is by construction non-empty, and we know that containing even a single element is sufﬁcient to generate at least one agent set by Proposition C.6. Therefore, in either of the two cases, we arrive at a contradiction, and thus conclude the argument.

This concludes the proof of each statement.

Corollary C.12. For any non-universal agent basis ΛB, there exists an orthogonal agent basis, Λ B.

Conversely, two agent bases Λ1 B, Λ2 B are parallel just when they generate the same agent sets.

Deﬁnition C.9. A pair of agent bases (Λ1 B, Λ2 B) are parallel if for every agent set Λ, Λ1 B |= Λ if and only if Λ2 B |= Λ.

Proposition C.13. If two bases Λ1 B, Λ2 B are parallel, then the following properties hold:

1. Both Λ1 B |= Λ2 B and Λ2 B |= Λ1 B.

2. rank(Λ1 B) = rank(Λ2 B).

3. Λ1 B is universal if and only if Λ2 B is universal.

Proof of Proposition C.13.

We prove each property separately.

1. Both Λ1 B |= Λ2 B and Λ2 B |= Λ1 B.

The claim follows directly from the deﬁnition of parallel bases. An agent set Λ is uniformly generated by Λ1 B if and only if it is uniformly generated by Λ2 B. Since by Proposition C.2 we know both Λ1 B |= Λ1 B and Λ2 B |= Λ2 B, we conclude that both Λ1 B |= Λ2 B and Λ2 B |= Λ1 B.

2. rank(Λ1 B) = rank(Λ2 B).

Recall that the deﬁnition of rank refers to the size of the smallest basis that uniformly generates it, rank(Λ) = min ΛB Λ |ΛB| s.t. ΛB |= Λ.

Now, note that by property (1.) of the proposition, both sets uniformly generate each other. Therefore, we know that

rank(Λ1 B) min |Λ1 B|, |Λ2 B| , rank(Λ2 B) min |Λ1 B|, |Λ2 B| ,

since the smallest set that generates each basis is no larger than the basis itself, or the other basis.

3. Λ1 B is universal if and only if Λ2 B is universal.

The claim again follows by combining the deﬁnitions of universality and parallel: If Λ1 B is universal, then by deﬁnition of parallel bases, Λ2 B must uniformly generate all the same agent sets including Λ, and therefore Λ2 B is universal, too. Now, if Λ1 B is not universal, then it does not uniformly generate Λ. By the deﬁnition of parallel bases, we conclude Λ2 B does not generate Λ as well. Both directions hold for each labeling of the two bases without loss of generality, and we conclude.

This completes the argument for each property, and we conclude.

C.2 Analysis: Reaches

We now establish other properties of the reaches operator. Several of these results are based on a third modality of the reaches operator: always reaches, in which an agent eventually reaches an agent basis in all histories realizable in a given environment. We deﬁne this precisely as follows. Deﬁnition C.10. We say agent 휆 Λ always reaches ΛB, denoted 휆

푒ΛB, if and only if ℎ ℋ 푡 N0 ℎ ℋ푡: ℎ 휆B ΛB ℎ ℋℎ휆(ℎℎ ℎ ) = 휆B(ℎℎ ℎ ). (C.24)

The nested quantiﬁers allows the agent to become equivalent to different base behaviors depending on the evolution of the interaction stream. For example, in an environment that ﬂips a coin to determine whether 푎heads or 푎tails is optimal, the 휆might output 푎heads indeﬁnitely if the coin is heads, but 푎tails otherwise. In this case, such an agent will still always reach the basis ΛB = {휆1 B : ℎ 푎heads, 휆2 B : ℎ 푎tails}. Notice that we here make use of the notation, ℋ푡: ℎ , which refers to all history sufﬁxes of length 푡or greater, deﬁned precisely as ℋ푡: ℎ = {ℎ ℋℎ: |ℎ | 푡} (C.25)

We ﬁrst show that the always reaches operator implies sometimes reaches. Proposition C.14. If 휆

Proof of Proposition C.14.

푒Λ. That is, expanding the deﬁnition of always reaches, we assume

ℎ ℋ 푡 N0 ℎ ℋ푡: ℎ 휆B ΛB ℎ ℋℎ휆(ℎℎ ℎ ) = 휆B(ℎℎ ℎ ). (C.26)

Further recall the deﬁnition of can reach 휆 ΛB is as follows

ℎ ℋ 휆B ΛB ℎ ℋℎ휆(ℎℎ ) = 휆B(ℎℎ ). (C.27)

Then, the claim follows quite naturally: pick any realizable history ℎ ℋ. By our initial assumption that 휆

푒Λ, it follows (by Equation C.26) that there is a time 푡and a realizable history sufﬁx ℎ for which

휆B ΛB ℎ ℋℎ휆(ℎℎ ℎ ) = 휆B(ℎℎ ℎ ).

By construction of ℎℎ ℋℎ, we know ℎℎ is a realizable history. Therefore, there exists a realizable history, ℎ = ℎℎ , for which 휆B ΛB ℎ ℋℎ휆(ℎ ℎ ) = 휆B(ℎ ℎ ) holds. But this is exactly the deﬁnition of can reach, and therefore, we conclude the argument.

Next, we highlight the fact that every agent in a basis also reaches that basis. Proposition C.15. For any agent set Λ, it holds that 휆

푒Λ for every 휆 Λ.

Proof of Proposition C.15.

The proposition is straightforward, as any 휆 Λ will be equivalent to itself in behavior for all histories.

Corollary C.16. As a corollary of Proposition C.15, any pair of agent sets (Λsmall, Λbig) where Λsmall Λbig, satisﬁes 휆 Λsmall 휆

푒Λbig. (C.28)

We further show that, unlike sometimes and never reaches, always reaches is transitive. Proposition C.17. Always reaches is transitive.

Proof of Proposition C.17.

We proceed by assuming that both 휆1 Λ1 휆1

푒Λ2 and 휆2 Λ2 휆2

푒Λ3 and show that it must follow that 휆1 Λ1 휆1

푒Λ3. To do so, pick a ﬁxed but arbitrary 휆1 Λ1, and expand 휆1

ℎ ℋ 푡 N0 ℎ ℋ푡: ℎ 휆2 Λ2 ℎ ℋℎ휆1(ℎℎ ℎ ) = 휆2(ℎℎ ℎ ).

Now, consider for any realizable history ℎℎ ℎ , we know that the corresponding 휆2 that produces the same action distribution as 휆1 also satisﬁes 휆2

푒Λ3. Thus, there must exist some time 푡at which, any realizable history ℎ ℎ , will satisfy 휆3 Λ3 ℎ ℋ휆2( ℎ ℎ ℎ ) = 휆3( ℎ ℎ ℎ . But then there exists a time ( 푡), that ensures every 휆2 Λ2 will have a corresponding 휆3 Λ3 with the same action distribution at all subsequent realizable histories.

ℎ ℋ 푡 N0 ℎ ℋℎ푡 : 휆2 Λ2 ℎ ℋℎ휆1(ℎℎ ℎ ) = 휆2(ℎℎ ℎ ) | {z }

휆3 Λ3 =휆3(ℎℎ ℎ )

Thus, rewriting,

ℎ ℋ 푡 N0 ℎ ℋ푡 : ℎ 휆3 Λ3 ℎ ℋℎ휆1(ℎℎ ℎ ) = 휆3(ℎℎ ℎ ).

But this is precisely the deﬁnition of always reaches, and thus we conclude.

Next, we show two basic properties of the set of all agents: it uniformly generates all agent sets, and it is always reached by all agents. Proposition C.18. For any 푒, the set of all agents Λ (i) uniformly generates all other agent sets, and (ii) is always reached by all agents:

(푖) Λ Λ Λ |= Λ, (푖푖) 휆 Λ 휆

Proof of Proposition C.18.

(i). Λ Λ Λ |= Λ

The property holds as a straightforward consequence of Proposition C.2: Since any set Λ is a subset of Λ, it follows that Λ |= Λ.

(ii). 휆 Λ 휆

The property holds as a straightforward consequence of Proposition C.15: Since every agent satisﬁes 휆 Λ, it follows that 휆

This concludes the argument of both statements.

C.3 Figure: Set Relations in CRL

Finally, in Figure 3 we present a visual depicting the set relations in CRL between an agent basis ΛB, an agent set it generates Λ, and the three agent sets corresponding to those agents in Λ that (i) sometimes, (ii) never, or (iii) always reach the basis. First, we highlight that we visualize ΛB as a subset of Λ since we deﬁne ΛB Λ in CRL (Deﬁnition 4.2). However, there can exist triples (ΛB, Λ, 푒) such that ΛB 푒Λ, but that ΛB is not a subset of Λ. Such cases are slightly peculiar, since it means that the basis contains agents that cannot be expressed by the agent set Λ. Such cases are not in line with our deﬁnition of CRL, so we instead opt to visualize ΛB as a subset of Λ. Next, notice that the basis is a subset of both the agents that always reach the basis and the agents that sometimes reach the basis this follows directly from the combination of Proposition C.14 and point (3.) of Theorem 4.3. By similar reasoning from Proposition C.14, we know that the set of agents that always reaches ΛB is a subset of the agents that sometimes reach the basis. Further, since sometimes and never reaches are negations of one another (Remark 3.2), observe that the two sets are disjoint, and together comprise the entirety of Λ. Lastly, we know that the set of optimal agents, Λ , contains only agents that never reach the basis, and thus the set Λ is disjoint from ΛB and the set of agents that sometimes reach ΛB.

X+9eu8f Dvru O/737d+Pep97x/8HAJ/d9A=</latexit>

Figure 3: A depiction of the division of a set of agents Λ relative to a basis ΛB through the reaches operator in CRL.