# fully_general_online_imitation_learning__96f9355c.pdf

Journal of Machine Learning Research 23 (2022) 1-30 Submitted 6/21; Published 9/22

Fully General Online Imitation Learning

Michael K. Cohen michael.cohen@eng.ox.ac.uk Department of Engineering Science University of Oxford Future of Humanity Institute Oxford, UK OX1 3PJ

Marcus Hutter marcus.hutter@anu.edu.au Deep Mind Department of Computer Science Australian National University Acton, ACT, Australia 2601

Neel Nanda neelnanda27@gmail.com Independent

Editor: Joelle Pineau

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely diﬀerent events. In the special setting of environments that restart, existing work provides formal guidance in how to imitate so that events unfold similarly, but outside that setting, no formal guidance exists. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes, and we allow our imitator to learn online from the demonstrator. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event s likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency. If any such event qualiﬁes as dangerous , our imitator would have the notable distinction of being relatively safe .

Keywords: Bayesian Sequence Prediction, Imitation Learning, Active Learning, General Environments

1. Introduction

Supervised learning of independent and identically distributed data is often practiced in two phases: training and deployment. This separation makes less sense if the learner s predictions aﬀect the distribution of future contexts for prediction, since the deployment phase could lose all resemblance to the training phase. When a program s output changes its future percepts, we often call its output actions . Supervised learning in that regime is commonly called imitation learning , where labels are the actions of a demonstrator

2022 Michael K. Cohen, Marcus Hutter, and Neel Nanda.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0618.html.

Cohen, Hutter, and Nanda

(Syed and Schapire, 2010). Our agent, acting in a general environment that responds to its actions, tries to pick actions according to the same distribution as a demonstrator.

Even in imitation learning, where it is understood that actions can change the distribution of contexts that the agent will face, it is common to separate a training phase from a deployment phase. This assumes away the possibility that the distribution of contexts will shift signiﬁcantly upon deployment and render the training data increasingly irrelevant. Here, we present an online imitation learner that is robust to this possibility.

The obvious downside is that the training never ends. The agent can always make queries for more data, but importantly, it does this with diminishing probability. It transitions smoothly from a mostly-training phase to a mostly-deployed phase. Our agent also handles totally general stochastic environments (environments serve new contexts for the agent to act in) and totally general stochastic demonstrator policies. No ﬁnite-state Markov-style stationarity assumption is required for either. The lack of assumptions about the environment is a mundane point, because imitation learners don t have to learn the dynamics of the environment, but the lack of assumptions on the prediction target the demonstrator s policy makes these results highly non-trivial. The only assumption is that the demonstrator s policy belongs to some known countable class of possibilities. Moreover, stochasticity makes single-elimination-style learning (Gold, 1967) impossible.

For demonstrator policies this general, we present formal results that are unthinkable in the train-then-deploy paradigm. The ℓ1 distance between the imitator and demonstrator policies converges to 0 in mean cube, when conditioned on a high-probability event (Theorem 3). And Theorem 2 shows that the event has high probability. Conditioned on the same high-probability event, we bound the KL divergence from imitator to demonstrator (Theorem 4), and we upper bound the probability of an arbitrary event under the imitator s policy, given a low probability of occurrence under the demonstrator s policy (Theorem 5). Instead of having a ﬁnite training phase, our agent s query probability converges to 0 in mean cube (Theorem 1). Without Theorems 1 and 2, the remaining theorems would be uninteresting; they would be easily fulﬁlled by an imitator that always queried the demonstrator, or they would apply only rarely.

Our imitator maintains a posterior over demonstrator models. At each timestep, it takes the top few demonstrator models in the posterior, in a way that depends on a scalar parameter α. Then, for each action, it considers the minimum over those models of the probability that the demonstrator picks that action. The imitator samples an action according to those probabilities, and if no action is sampled (since model disagreement makes the probabilities to sum to less than 1), it defers to the demonstrator.

We review theoretical developments in imitation learning in Section 2, deﬁne our formal setting in Section 3, deﬁne our imitation learner in Section 4, and illustrate it with a toy example in Section 5. We state key formal results in Section 6, and we outline our proof technique and introduce necessary notation in Section 7. Section 8 presents lemmas and intermediate results, and Section 9 presents proofs and proof ideas of our key results, but most of the proofs appear in Appendix B. Appendix A collects notation and deﬁnitions.

Fully General Online Imitation Learning

2. Related Work

Recall that a key diﬃculty of imitation learning over supervised learning is the removal of a standard i.i.d. assumption. However, all existing formal work in imitation learning studies repeated ﬁnite episodes of length T; even though the dynamics are not i.i.d. from timestep to timestep within an episode, the agent learns from a sequence of episodes that are, as a whole, independent and identically distributed. Thus, the scope of existing formal work is limited to environments that restart . A driving agent that gets housed in a new car every time it crashes (or gets hopelessly lost) enjoys a restarting environment, whereas a driving agent with only one car to burn does not. If we can accurately simulate a non-restarting environment, then training the imitator in simulation (using existing formal methods) could indeed prepare it to act in a non-restarting one. The viability of this approach depends on the environment; for many, we simply cannot simulate them with enough accuracy. For example, consider imitating a sales rep at a software company, interfacing with potential clients over email. For a real potential client, a relationship cannot be rebooted, and no simulation could anticipate the many diverse needs of clients. In the context of restarting environments, Syed and Schapire (2010) reduce the problem of predicting a demonstrator s behavior to i.i.d. classiﬁcation. The only assumption about the demonstrator is that the value of its policy as a function of state is arbitrarily well approximated by the value of a deterministic policy, which is only slightly weaker than assuming the demonstrator is deterministic itself. They make no assumptions about the environment, other than that we can access identical copies of it repeatedly. They show that if a classiﬁer guessing the demonstrator s actions has an error rate of ε, then the value of the imitator s policy that uses the classiﬁer is within O( ε) of the demonstrator. Judah et al. (2014) improve the label complexity of Syed and Schapire s (2010) reduction by actively deciding when to query the demonstrator, instead of simply observing N full episodes before acting. Making the same assumptions as that paper, and also assuming a realizable hypothesis class with a ﬁnite VC dimension, they attempt to reduce the number of queries before the agent can act for a whole episode on its own with an error rate less than ε. Letting T be the length of an episode, compared to Syed and Schapire s (2010) O(T 3/ε) labels, they achieve O(T log(T 3/ε)). Ross and Bagnell (2010) also reduce the problem to classiﬁcation. In a trivial reduction, the imitator observes the demonstrator act from the distribution of states induced by the demonstrator policy. In this reduction, if the classiﬁer has an error rate of ε per action on the demonstrator s state distribution, the error rate of the imitator on its own distribution is at most T 2ε, where T is again the length of the episode. Their main contribution is to introduce a cleverer training regime for the classiﬁer to reduce this bound to Tε in environments with approximate recoverability. Ross et al. (2011) reduce imitation learning to something else: a no-regret online learner, for which the average error rate over its lifetime approaches 0, even with a potentially changing loss function. With access to an online learner with average regret O(1/Npredictions), they construct an imitation learner with regret of the same order. Unlike Syed and Schapire (2010) and Judah et al. (2014), they make no assumption that the demonstrator is arbitrarily well-approximated by a deterministic policy. Unlike Judah et al. (2014), they do not assume a realizable hypothesis class with a ﬁnite VC dimension. And unlike the Ross and

Cohen, Hutter, and Nanda

Bagnell (2010) (for their main contribution), they do not assume approximate recoverability. They do still assume that we can repeatedly access identical copies of the environment, and the loss function used for their measurement of regret must be bounded. To achieve a regret of order O(1/Npredictions) with probability at least 1 δ, they require O(T 2 log(1/δ)) observations of the demonstrator. There is a great deal of empirical study of imitation learning, given the practical applications, which Hussein et al. (2017) review. We call a few speciﬁc experiments to the reader s attention, since they resemble our work in taking an active approach to querying, with an eye to risk aversion, not just label eﬃciency; they ﬁnd it works. First, Brown et al. (2018, 2020) consider a context where the imitator can, at any time, ask the demonstrator how it would act in any of ﬁnitely many states. These imitators focus on states that they assign higher value at risk. Those papers and the following all show strong label eﬃciency alongside limited loss. Zhang and Cho (2017) assume some method of predicting the error of an imitator in the process of learning, and they query for help when it is above some threshold. Otherwise, their imitator follows Ross et al. s (2011) construction. In their paper, the function that predicts the imitator s error is learned from hand-picked features of a dataset. Menda et al. (2019) query much more extensively, but like Zhang and Cho (2017), they don t always act on the demonstrator s suggestion, in order to sample a more diverse set of states. Unlike Zhang and Cho (2017), they do act on it when the imitator s action deviates enough from the demonstrator s (given some hand-designed distance metric over the action space). They also defer to the demonstrator when there is suﬃcient disagreement among an ensemble of imitators. They ﬁnd their imitator is more robust. Hoque et al. (2021) note that in many contexts, it is more convenient for the demonstrator to be queried a few times successively, rather than spread out over a long time. They modify Zhang and Cho s (2017) approach: the imitator starts querying when the estimated error exceeds the same threshold, but it continues querying until it returns below a lower threshold. At the cost of more total queries, it requires fewer query-periods. Like the formal work, all these experiments regard environments that restart. Adjacent to pure imitation learning (trying to pick the same actions as a demonstrator would), there is also work on trying to act in pursuit of the same goals as a demonstrator (which must be inferred), or matching only some outcomes of the demonstrator policy, like the expectation of some given set of features. For a review of some work in this area, see Adams et al. (2022).

3. Preliminaries

Let at A and ot O be the action and observation at timestep t N. Let qt {0, 1} denote whether the imitator (qt = 0) or demonstrator (qt = 1) selects at. Let H = {0, 1} A O, and let ht = (qt, at, ot) H. Let h<t = (h0, h1, ..., ht 1). X n = n i=1 X denotes the set of n-tuples of elements of X, and X = S n=0 X n is the Kleene-star operator, which denotes all tuples of elements of X. Let π : H {0, 1} A, and denotes that π gives a distribution over {0, 1} A. ϵ will denote the empty string; it is the element of H0. π is called a policy, and will typically be written π(qtat | h<t). π(at | h<t) denotes the marginal distribution over the action. Let µ : H {0, 1} A O. µ is called the environment, and will typically

Fully General Online Imitation Learning

be written µ(ot | h<tqtat). Note from this construction that an environment and a policy may qualitatively change over time instead of being stationary with respect to the latest timestep, they can depend on the whole history. Much formal work in imitation learning and reinforcement learning involves deﬁning environments in terms of their Markov states and how one transitions through them. The deﬁning property of a state is that that future is independent of the past conditioned on the state. For those more comfortable in that framework, our state space here is H , so the Markov property trivial: the state is the whole history, so indeed, the future is independent of the history, when conditioned on the history. The point of the Markov Decision Process formalism is that when the state space is ﬁnite (or compact, with relevant functions of it being continuous), more tractable inference algorithms become available, but we do not assume ﬁniteness or any structure in the state space. For ﬁnite histories denoted h<t, the reader could mentally substitute st, this being the state at time t, but the inﬁnite history h< , which appears in some proofs, has no standard notational analog. Speaking of which, let H be the set of inﬁnite strings of elements of H. Let Pπ µ be the probability measure over H where query records and actions are sampled from π, and observations are sampled from µ. The event space is the standard sigma algebra over cylinder sets σ({{h<tht: : ht: H } : h<t H }). In a stochastic process, a cylinder set is the set of all possible futures given a particular past. Let Π be a ﬁnite or countable set of policies, and for π Π, let w(π) > 0 be a prior weight assigned to π, such that P π Π w(π) = 1. This represents the imitator s initial belief distribution over the demonstrator s policy. For convenience, let Π only contain policies which assign zero probability to qt = 0, since demonstrator models may as well be convinced that the demonstrator is picking the action.

Example 1 ((Linear-Time) Computable Policies) The requirement that Π be countable is not restrictive in theory. Suppose Π is the set of programs that compute a policy (in linear time). These can be easily enumerated, and the prior w can be set 2 program length

(Kraft, 1949; Hutter, 2005).

Given the near absence of constraints, the choice of model class might pique philosophical interest. There are multiple logics with diﬀering powers that we could plausibly use to represent programs, including programs higher in the arithmetic hierarchy. In general, the choice of programming language would change programs relative length, and there are no clear desiderata when choosing a language. So Example 1 does not appear to oﬀer an approach to solving the Problem of Priors (Talbott, 2016). The option to restrict to linear-time programs is a marginally more practical possibility that might escape most philosophical discussions.

4. Imitation

Let w(π | h<t) be the posterior weight after observing h<t that demonstrator-chosen actions were sampled from π. That is,

w(π | h<t) : w(π) Y

k<t:qk=1 π(qkak | h<k) (1)

Cohen, Hutter, and Nanda

normalized such that P π Π w(π | h<t) = 1. Ranking the policies by posterior weight, let πh<t n be the one with the nth largest posterior weight w(π | h<t), breaking ties arbitrarily. Now let Πα h<t be the set of policies with posterior weights at least α times the sum of the posterior weights of policies that are at least as likely as it; that is,

Πα h<t := {πh<t n Π : w(πh<t n | h<t) α X

m n w(πh<t m | h<t)} (2)

This is the set of policies the imitator takes seriously. The imitator is designed to be robust to policies in this set, so smaller α will make it more robust. Let πd denote the demonstrator s policy, deﬁned such that πd(qt = 1 | h<t) = 1 for all values of h<t. As later results suggest, α should be set a few orders of magnitude below w(πd); since πd is probably unknown to the programmers, or else there would be no need for imitation learning, w(πd) will have to be estimated. The imitator s policy πi α is deﬁned in the next two equations:

πi α(0, a | h<t) := min π Πα h<t π (1, a | h<t) (3)

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator, and the 1 on the r.h.s. means this is the probability of the demonstrator model π picking that same action. The imitator uses the leftover probability to query. Let θq(h<t) := 1 P a A πi α(0, a | h<t). θq is the probability with which the imitator queries the demonstrator to have it pick the action. Thus, πi α(1, a | h<t) := θq(h<t)πd(1, a | h<t) (4)

One can see that qt records whether the demonstrator was involved in selecting the action. Using the model class and prior from Example 1, the time-complexity constraint makes πi α computable. Conservatism with respect to probability estimates is a core technical innovation of our work. Taking the minimum over a set of models with high posterior weights is an approach to conservatism inspired by Cohen and Hutter s (2020) pessimistic agent. The pessimistic agent, unlike ours, is a reinforcement learner, but it is also designed to keep certain (risky) events unlikely. By underestimating probabilities, the imitator only acts if it is sure the demonstrator might act that way. We will also consider hypothetical imitator policies if the demonstrator policy were something else; for an arbitrary demonstrator policy π, let ˆπα denote the corresponding imitator policy, so πi α = ˆ (πd)α. This paper will investigate the probability distribution Pπi α µ and compare it to Pπd µ .

5. Toy Example

We now walk through a toy example, in which our imitation learner has about a halfmillion demonstrator models in its model class Π. We begin by deﬁning Π. The action space A of the demonstrator is null {0, 1}4. The observation space O is { , 1, 2, 3}. A demonstrator model π Π deﬁned by is a 12-tuple of the elements {1/3, 2/3, 1}. When the latest observation is 1, 2, or 3, let x be the 1st - 4th, 5th - 8th, or 9th - 12th elements of 12-tuple.

Fully General Online Imitation Learning

Then, the demonstrator model outputs four bits that are Bernoulli distributed according to each of the four elements of x. All demonstrator models output null when the latest observation is . The true demonstrator also takes the form of such a demonstrator model. Each observation is randomly sampled; it is 1 with probability 1/4, 2 with probability 1/16, 3 with probability 1/64, and otherwise . Let s give some ﬂavor to this example. The demonstrator does client relations for a high-end travel agency with very fussy clients. The demonstrator gets a feel for her clients, and for any given night that a client needs a restaurant recommendation, the demonstrator sends a Boolean 4-tuple to the restaurant team, who identiﬁes a suitable restaurant. The observation tells the demonstrator which of the three clients needs a recommendation, if any. The ﬁrst bit of the Boolean 4-tuple tells the restaurant team whether the restaurant should have lots of vegetarian options, the second bit: should it have a Michelin star, the third: should it have unfamiliar local specialties, and the fourth: should it be Instagrammable. Why is the demonstrator stochastic? Many clients want a variety of styles of restaurants from night to night. The demonstrator couldn t write down the exact probabilities that she is using to generate these Boolean vectors; she goes oﬀof intuition. If we run an imitator that only sometimes asks the demonstrator for help, we can free up some of the demonstrator s time. Unfortunately, in this toy environment, the fussy clients sometimes quit. Each client has a 4-tuple of probabilities that they would like their Boolean vector sampled from (conveniently in {1/3, 2/3, 1}4). If it becomes clear that this is not how their Boolean vectors are being sampled, they quit. ( Becoming clear is operationalized as follows: H1 is the hypothesis that their restaurant recommendations are being sampled correctly; H2 is the hypothesis that some other 4-tuple in {1/3, 2/3, 1}4 is producing their restaurant recommendations. If, given the set of all restaurant recommendations they have gotten, the likelihood ratio of H2 exceeds 100, the client quits. Note that this happens if an element is ever False when it was supposed to be True with probability 1; some clients demand Michelin stars.) When recommendations are made by the demonstrator, who always correctly intuits the client s desired distribution of restaurants, clients hardly ever quit. We would like clients to hardly ever quit even when the imitator frequently takes over. For an imitator with α = 1e-14, Figure 1 shows how often it has to query the demonstrator to pick the restaurant features. Recommendations are random, and this is only one run. Running it with 20 diﬀerent random seeds, the number of queries required is 486.75 52.63 (out of 215 timesteps), and no client ever quit. Returning to run depicted in Figure 1, Table 1 works through an example of the posterior and the imitator s behavior. The code for this toy example can be found at https://tinyurl.com/imitation-toy-example.

For the whole of the paper, we assume:

Assumption 1 (Realizability) πd Π.

That is, the imitator can conceive of the demonstrator. There may be some interesting results in the setting of approximate realizability, where π Π such that π ε πd in some sense, but that is out of our scope here.

Cohen, Hutter, and Nanda

Figure 1: Timesteps when the imitator queries. 215 timesteps are shown, with black representing a query, and green representing the imitator acting unassisted. Pixels are to be read like text, left to right, top to bottom. In the accompanying code, a random seed of 0 is used to generate this image.

We now state and discuss our key results before turning to selected proofs. Our ﬁrst is that the imitator s query probability converges to 0 in mean cube. This result

renders its resemblance to the demonstrator non-trivial, since always querying would yield perfect correspondence,

is desirable in its own right if demonstrator access is a limited resource,

and is instrumental in proving the remaining results, since low query probability implies little model disagreement.

Theorem 1 (Limited Querying)

t=0 θq(h<t)3 #

|A|α 3(24w(πd) 1 + 12)

The in mean cube bound allows inﬁnite querying, but it diminishes in frequency, or else the expectation of an inﬁnite sum of cubed probabilities would not be ﬁnite. Since we query under uncertainty, both querying and uncertainty diminish in tandem; this is a theme for active learners in general. Error bounds in Bayesian prediction and MAP prediction tend to be Θ(log(w(truth) 1)) and Θ(w(truth) 1) respectively, so theoretically, our case resembles the MAP one. The cubic dependence on α is unfortunate, and subsequent results inherit them; the only path we found to proving a bound was fairly circuitous, and we are unsure whether this dependence can be improved.

Fully General Online Imitation Learning

1/3 2/3 1 -0.0000 -44.0000 -inf -50.0000 -0.0000 -inf 0.0000 -78.0000 -inf -46.0000 -0.0000 -inf

-18.0000 -0.0000 -inf -69.7384 -25.7384 -0.0000 -0.0000 -22.0000 -inf -69.7384 -25.7384 -0.0000

-0.3219 -2.3219 -inf -0.0056 -8.0056 -inf -0.0000 -16.0000 -inf -28.5303 -10.5303 -0.0010

p([False, True, False, True] | client 2) Model 0.11111 (..., 2/3, 1, 2/3, 1, ...) 0.14815 (..., 2/3, 1/3, 1/3, 1, ...) 0.22222 (..., 2/3, 1, 1/3, 1, ...) 0.29630 (..., 1/3, 2/3, 1/3, 1, ...) 0.44444 (..., 1/3, 1, 1/3, 1, ...)

Table 1: Left: Log2 posterior at timestep 1000 for the run depicted in Figure 1. The posterior decomposes into posterior probabilities for each of 12 features. Each block is a client, each row is a feature, and each entry is the log posterior probability that the demonstrator picks True for that feature with probability 1/3, 2/3, or 1, respectively. To get the posterior for a whole demonstrator model, as in Equation 1, add the independent posteriors for each element in the 12-tuple of the demonstrator model. The posterior weight on the truth is in bold for each feature; that is, the true demonstrator for this run is (1/3, 2/3, 1/3, 2/3, 2/3, 1, 1/3, 1, 1/3, 1/3, 1/3, 1). Right: At timestep 1000, with α = 1e-14, we have many top models, as deﬁned in Equation 2. The ﬁrst column is a list of probabilities that diﬀerent top models assign to the outcome [False True False True] for client 2. The second column contains examples of top models that assign those probabilities to the outcome [False True False True] for client 2, with the true model in bold. Recall a demonstrator model is deﬁned by a 12-tuple, but the only relevant elements for client 2 are 5-8. All these models have posterior weight large enough to make it into the top set. Thus, the probability the imitator picks [False True False True] for client 2 is 0.11111, the minimum probability shown, as per Equation 3.

Our remaining results show that the imitator resembles the demonstrator on one condition: πd Πα h<t. Recall that Πα h<t is a set of top demonstrator models that the imitator takes seriously, and πd is the true demonstrator model. Low model disagreement implies high accuracy when the truth is one of those models, and recall that our querying regime promises low model disagreement within ﬁnite time. Fortunately, this condition has high probability for α << w(πd).

Theorem 2 (Top Models Contain Truth) Pπi α µ ( t : πd Πα h<t) 1 αw(πd) 1

Let E be the event t : πd Πα h<t, so the true demonstrator policy is always in the top set. The high probability of E is mainly of interest in the context of subsequent results that depend on it. For instance, conditioned on E, the imitator, when picking its own actions, converges to the demonstrator in mean cube.

Cohen, Hutter, and Nanda

Theorem 3 (Predictive Convergence) For α < w(πd),

πi α(0, a | h<t) πd(1, a | h<t) !3 E

|A|α 3(24w(πd) 1 + 12)

This theorem ﬁnally justiﬁes our calling πi α an imitator , since the policy converges to that of the demonstrator. Existing literature on imitation learning does little to suggest that imitators exist in non-restarting environments. This result shows that they do, at least in a high-probability sense. Note that the denominator is the probability of E, which will be nearly 1 for appropriate choice of α. The requirement that α < w(πd) has important consequence: when α is set appropriately, the bounds in this theorem and Theorem 1 are eﬀectively quartic in w(πd) 1. We do not know if a better rate is possible under additional assumptions. It is even possible that stronger results are available without additional assumptions, and we simply failed to identify them. We think this is a ripe area for research. We argue informally that this disappointing dependence can be mitigated in some circumstances. By pre-training with N consecutive demonstrator queries and calling the posterior at that point the new prior for the purposes of our analysis, the prior on w(πd) could usually be made quite large, unless most demonstrator models behave extremely similarly for the ﬁrst N steps. Consider an extreme case: many models of comparable weight almost agree with the true model, except one disagrees at t = 1, one at t = 2, etc. In this case, the posterior on the truth increases very slightly every step, as models are excluded one by one. If, on the other hand, half of demonstrator models conﬁdently predict one action, and half conﬁdently predict another, the posterior on the truth will likely nearly double in one step. So to the extent that a large fraction of models in Π disagree with πd

within the ﬁrst N steps, the posterior on the truth would increase exponentially following pre-training. That said, the quartic dependence on w(πd) 1 in the worst case is a weakness of our approach. Any pair of these ﬁrst three results would be uninteresting on their own, but jointly, they show that with high probability, the imitator converges to the demonstrator with limited querying. Our stronger results below apply when the environment and demonstrator policy do not depend on the query record. This means that whatever action is taken, the eﬀect does not depend on whether the imitator chose it or the demonstrator did. We would like events to unfold similarly when we replace the demonstrator with the imitator, but this is impossible if the environment discriminates between them. Indeed, if the environment treats identical actions diﬀerently depending on whether they were selected by imitator or demonstrator, it s unclear what imitation accomplishes. We deﬁne fairness formally in Section 9. In a fair setting, we bound the KL divergence between Pπi α µ and Pπd µ , the ﬁrst meaning that actions are picked according to our imitation policy, and the second meaning that all actions are picked by the demonstrator. The objective of imitation is most easily characterized as outputting demonstrator-like actions, but the purpose of imitation learning is for events to unfold similarly. Small errors in the limit do not guarantee that property; this result is only possible with small errors for the imitator s whole lifetime.

Fully General Online Imitation Learning

Theorem 4 (KL Bound) Suppose that µ and πd are fair, and α < w(πd). Letting the two probability measures below be restricted to (A O)t (that is, marginalizing over the query record, and considering only the ﬁrst t timesteps),

Pπi α µ ( | E) Pπd µ ( | E) α 1|A|1/3(24w(πd) 1 + 12)1/3

(1 α/w(πd))2 t2/3 log(1 α/w(πd))

Notably, KLt /t 0 in the limit. The direction of the divergence resembles the variational objective (with the ground truth on the right). Thus, there may be some events that only the demonstrator would cause, but no events that only the imitator would. This consequence is made explicit in our ﬁnal result. We construct an upper bound for the probability of an event given the probability of the event if the demonstrator were acting the whole time. This bound is mainly of interest for bad events.

Theorem 5 (Preserving Unlikeliness) Fix t. Let B (A O)t be a (bad) event, and extending B to the outcome space ({0, 1} A O)t = Ht, let D = B E. Then, for fair µ and πd,

Pπi α µ (D) t2sα log t2sα 27 Pπd µ (B) 3 log log 1 + t2/3s1/3 α 3 Pπd µ (B)1/3

where sα = | A |α 3(24w(πd) 1 + 12).

That is, as Pπd µ (B) 1 , Pπi α µ (D) 1 at least polylogarithmicly. If an event would have been extremely unlikely under the demonstrator s policy, a similar event is unlikely when running the imitator. Whereas existing work on imitation learners attempts to be robust to a bounded loss function, our Preserving Unlikeliness Theorem is relevant even in the absence of a uniform bound on badness. In the real world, to quote Theon Greyjoy, It can always be worse . But some bounds on badness are possible: we tolerate one-in-ten-chance events; they happen, and we get on with it. One-in-a-hundred-chance events can be meaningfully worse. But in a world largely governed by humans, we keep most truly devastating events below even a 1% chance. It s hard to apply similar bounds to the badness of one-in-a-billion-chance events, and in general, as the probability gets smaller, a loss function should countenance steadily larger losses. When an event goes from a 1% to a 2% chance, we should be much less concerned than if it went from 10 9 to 1%. In the extreme, if an event has probability 0 under a demonstrator s policy, there might be an arbitrarily good reason for that. Whereas the bounded loss functions of all existing work ignore this eﬀect, our Theorem 5 does not. The main weaknesses of our results are what they require: a model class that includes the truth and a good choice of α. Setting α well requires estimating w(πd), something we cannot oﬀer general guidance on; it would depend entirely on the exact nature of the prior. And realistically, in many contexts, the realizability assumption is infeasible. There will always be mismatch between a computational model of a demonstrator and the true demonstrator. We hope this paper opens the door for other research into relaxing the realizability assumption. Plausibly, if the best approximation in Π of πd produces certain bad events with low probability, then the imitator will too.

Cohen, Hutter, and Nanda

7. Roadmap and Notation for the Proof of Theorem 1

Much of the work of this paper is to prove Theorem 1. In this section, we state a theorem on which it depends, and we introduce the mathematical objects required to prove it. The imitator queries when the top few demonstrator models disagree, so we bound the errors that those models can make over the agent s lifetime. We ﬁrst must establish a ﬁnite bound on the errors of such models in ordinary Bayesian sequence prediction. We deﬁne that here. Let X be an arbitrary ﬁnite alphabet. Let ν be a probability measure over X with the event space generated by the cylinder sets {{x<txt: | xt: X } | x<t X }. Let M be a countable set of such probability measures, and let w(ν) be a prior weight over these measures such that P ν M w(ν) = 1. Let x<t X t, let ν(x<t) denote the probability that the inﬁnite sequence begins with x<t, and let ν(x | x<t) = ν(x<tx)/ν(x<t). Let µ M be a the true measure; that is, in formal results, we will let x< be sampled from µ. Let νx<t n be the measure with the nth largest posterior weight after observing x<t; that is, order M to be non-increasing in w(ν)ν(x<t), breaking ties arbitrarily, and take the nth. (Ties between any pair should broken consistently for diﬀerent t). Let the posterior w(ν | x<t) : w(ν)ν(x<t), normalized to sum to 1. Let Mx<t n be the set of the top n measures, and let w(Mx<t n | x<t) = P m n w(ν | x<t). Recall a model belongs to the imitator s top set if its posterior weight is at least α times the sum of the posterior weights of the models that are at least as good. Thus, we deﬁne

φx<t n := w(νx<t n | x<t) w(Mx<t n | x<t) (5)

So if φx<t n α, then νx<t n can be considered a top model in the same sense that is relevant to our imitation learner. Our key result on which Theorem 1 is based shows that taking the minimum over predictions in the top measures converges to the truth, and the missing probability converges to 0.

Theorem 6 (Top Model Convergence)

µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

#2 α 3(24w(µ) 1 + 12)

x X min n:φ x<t n >α νx<t n (x | x<t)

#2 |X|α 3(24w(µ) 1 + 12)

This is perfectly analogous to the way the imitator predicts actions: taking the minimum over the top models for which φx<t n > α. The diﬀerence is that in this sequence prediction setting, all observations are informative about the true measure, whereas the imitator rarely sees the demonstrator act. To prove Theorem 6, we show that a posterior-weighted mixture over Mx<t n converges to the truth, and if φx<t n > α, then each constituent must as well. This posterior-weighted mixture is called ρstat n . We deﬁne it here alongside other estimators that will be used in the proof of ρstat n s convergence. First,

Fully General Online Imitation Learning

ρstat n (x | x<t) :=

P ν M x<t n w(ν)ν(x<tx) P ν M x<t n w(ν)ν(x<t) (6)

ρstat n resembles a maximum a posteriori estimate, but instead mixes over the top few. We call it a satis magnum a posteriori estimate (SMAP). We will show ρstat n converges to ρn, which converges to ρnorm n , which converges to µ. ρn and ρnorm n are alternative SMAP estimators. ρn is not a measure, as the numerator below sums over a diﬀerent set than the denominator. It sums over the top measures after observing x:

ρn(x | x<t) :=

P ν M x<tx n w(ν)ν(x<tx) P ν M x<t n w(ν)ν(x<t) (7)

The deﬁnition appears more natural when considering a whole sequence:

ρn(x<t) = X

w(ν)ν(x<t) (8)

Since P x X ρn(x | x<t) may not be 1, we construct the measure ρnorm n by normalizing:

ρnorm n (x | x<t) := ρn(x | x<t) P x X ρn(x | x<t) = ρn(x<tx) P x X ρn(x<tx ) (9)

Our ρn, ρnorm n , and ρstat n are closely inspired by Poland and Hutter (2005), who constructed (in our notation) ρ1, ρnorm 1 , and ρstat 1 . Finally, we deﬁne the full Bayes-mixture measure ξ(x<t) := X

ν M w(ν)ν(x<t) = ρstat (x<t) = ρ (x<t) = ρnorm (x<t) (10)

We state those relationships without proof for the reader s interest; they are not used in our results.

8. General Sequence Prediction Results

This section organizes the proof of Theorem 6 into lemmas, some of which are proven here and some in Appendix B. We begin with elementary relations between ξ, ρn, ρnorm n , and ρstat n .

ξ(x<t) ρn(x<t) (11)

ρn(x<t) w(µ)µ(x<t) (12)

ρn(x | x<t) ρnorm n (x | x<t) (13)

ρn(x | x<t) ρstat n (x | x<t) (14)

Inequalities 11 and 12 follow directly from Equation 8. Inequality 13 follows because

ρn(x<t) = max M M:| M |=i

ν M w(ν)ν(x<t) = max M M:| M |=i

x X ν(x<tx)

Cohen, Hutter, and Nanda

x X max M M:| M |=i

ν M w(ν)ν(x<tx) = X

x X ρn(x<tx) (15)

so ρn assigns too much probability mass. Inequality 14 follows because

ρstat n (x | x<t) =

P ν M x<t n w(ν)ν(x<tx) P ν M x<t n w(ν)ν(x<t)

P ν M x<tx n w(ν)ν(x<tx)

ρn(x<t) = ρn(x<tx)

ρn(x<t) (16)

which holds because Mx<tx n is chosen to maximize the numerator. Our ﬁrst lemma bounds the normalizing factor for ρn, allowing us to show in our next lemma that it converges to both ρnorm n and ρstat n .

P x X ρn(x<tx)

ρn(x<t) 1 w(µ) 1

Proof idea ρn is bounded above and below by measures, save a multiplicative constant (Inequalities 11 and 12), so ρn converges to being a measure, in that P x X ρn(x | x<t) 1.

Proof All terms in the sum are non-negative, by Inequality 15. Recall ϵ denotes the empty string the element of X 0. Justiﬁcations of the upcoming lettered equations follow below the block.

P x X ρn(x<tx)

x<t X t µ(x<t) P x X ρn(x<tx) ρn(x<t)

x<t X t w(µ) 1 "X

x X ρn(x<tx) ρn(x<t)

(b) =w(µ) 1

x<N X N ρn(x<N) ρn(ϵ)

(c) w(µ) 1 X

x<N X N ξ(x<N) = w(µ) 1 (17)

where (a) follows from Inequality 12, (b) cancels terms that are added then subtracted, and (c) follows from Inequality 11.

Recall we are trying to show ρstat n ρn ρnorm n µ. The following lemma gives two of those links.

Fully General Online Imitation Learning

ρn(x | x<t) ρstat n (x | x<t) w(µ) 1

ρn(x | x<t) ρnorm n (x | x<t) w(µ) 1

ρn(x | x<t) ρstat n (x | x<t) (a) = Eµ

x X ρn(x | x<t) ρstat n (x | x<t) =

P x X ρn(x<tx)

ρn(x<t) 1 (b) w(µ) 1 (18)

where (a) follows from Inequality 14 and (b) follows from Lemma 7. The proof is identical for ρnorm n , except now (a) follows from Inequality 13.

Given Lemma 8, the ﬁnal link in showing ρstat n converges to µ is to show that ρnorm n does.

Lemma 9 Recalling ν( | x<t) is a measure over X,

t=0 KL µ( | x<t) ρnorm n ( | x<t) w(µ) 1 + log w(µ) 1

Proof idea The KL divergence telescopes over timesteps. The log w(µ) 1 term comes from a gap between µ and ρn, and the w(µ) 1 term comes from a gap between ρn and ρnorm n .

We can now show that ρstat n converges to µ, an independently interesting and novel result in SMAP estimation.

Theorem 10 (SMAP Convergence)

ρstat n (x | x<t) µ(x | x<t) 2 6w(µ) 1 + 3

Proof idea ρstat n is close to ρn in an ℓ1 sense, and likewise for ρn and ρnorm n , and ρnorm n is close to µ in an ℓ2 squared sense, since ℓ2 2 KL. Finally, for a vector v [ 1, 1]n, ||v||2 2 ||v||1, so ℓ1 proximity implies ℓ2 proximity as well.

By applying Theorem 10 to the very similar measures ρstat n and ρstat n 1, whose only difference is that the former contains νx<t n in its mixture, we arrive at our ﬁnal result in the general sequence prediction setting.

Cohen, Hutter, and Nanda

Theorem 6 (Top Model Convergence)

µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

#2 α 3(24w(µ) 1 + 12)

x X min n:φ x<t n >α νx<t n (x | x<t)

#2 |X|α 3(24w(µ) 1 + 12)

Proof idea ρstat n is a weighted average of νx<t m for m n, so convergence results for ρstat n and ρstat n 1 are leveraged for νx<t n s convergence. φx<t n > α ensures the weights in the weighted average aren t too small, and that we only need to consider the top 1/α models.

9. Key Proofs

We now prove our bound on the query probability, we deﬁne fairness, and we prove our bound on the probabilities of bad events.

Theorem 1 (Limited Querying)

t=0 θq(h<t)3 #

|A|α 3(24w(πd) 1 + 12)

Proof idea The sort of model mismatch bounded by Theorem 6 (ii) is the basis for the deﬁnition of θq. Theorem 6 (ii) bounds model mismatch on observed data, and data is only observed with probability θq, so with an extra factor of θq on the l.h.s., we go from an in mean square bound to a weaker in mean cube bound.

Proof Recall the agent considers a set of possible policies Π that includes the true demonstrator policy πd, and assigns a strictly positive prior w(π) to each policy in Π. Recall Pπ µ is a probability measure over ({0, 1} A O) = H . Now we construct a class of measures over H : let M := {Pˆπα µ : π Π} (see the last paragraph of Section 4 for the deﬁnition of ˆπα), and let w(Pˆπα µ ) := w(π). Let w(Pˆπα µ | h<t) : w(Pˆπα µ ) Pˆπα µ (h<t). It follows straightforwardly from the deﬁnitions of the posterior that w(Pˆπα µ | h<t) = w(π | h<t), w(Pˆπα µ | h<tqt) = w(π | h<tqt), and w(Pˆπα µ | h<tqtat) = w(π | h<tqtat), since all measures in M assign the probabilities identically to actions after qt = 0, and to observations. Instead of saying M contains measures over X , we generalize slightly, and say that M contains measures over k=0 Xk. For k 0 mod 3, Xk = {0, 1}, for k 1 mod 3, Xk = A, and for k 2 mod 3, Xk = O. With νx<k n and φx<k n as deﬁned before, we can apply Theorem 6 (i) to the class M, after a trivial extension from ﬁxed X to variable Xk. Checking the deﬁnitions is enough to verify that {νx<k n : φx<k n > α} is exactly the set {Pˆπα µ : π Πα h<t}, where hj = (qj, aj, oj) = (x3j, x3j+1, x3j+2), and t = (k + 1)/3 . In short, for this M, sequence prediction errors can only come from errors predicting actions after querying, since that s when models diﬀer, so we can use Theorem 6 to bound the latter. Recalling that Pπi α µ is the true probability measure,

Fully General Online Imitation Learning

α 3(24w(πd) 1 + 12) = α 3(24w(Pπi α µ ) 1 + 12)

(a) Eπi α µ

Pπi α µ (x | x<k) min i:φ x<k n >α νx<k n (x | x<k)

(b) = Eπi α µ

Pπi α µ (q | h<t) min π Πα h<t Pˆπα µ (q | h<t)

Pπi α µ (a | h<tqt) min π Πα h<t Pˆπα µ (a | h<tqt)

Pπi α µ (o | h<tqtat) min π Πα h<t Pˆπα µ (o | h<tqtat)

(c) = Eπi α µ

Pπi α µ (a | h<tqt) min π Πα h<t Pˆπα µ (a | h<tqt)

q {0,1} Pπi α µ (q | h<t) X

Pπi α µ (a | h<tq) min π Πα h<t Pˆπα µ (a | h<tq)

(d) = Eπi α µ

t=0 Pπi α µ (1 | h<t) X

Pπi α µ (a | h<t1) min π Πα h<t Pˆπα µ (a | h<t1)

(e) Eπi α µ

t=0 θq(h<t)| A |

a A Pπi α µ (a | h<t1) min π Πα h<t Pˆπα µ (a | h<t1)

= | A | 1 Eπi α µ

t=0 θq(h<t)

a A min π Πα h<t

ˆπα(1, a | h<t)

ˆπα(1 | h<t)

(f) = | A | 1 Eπi α µ

t=0 θq(h<t)

a A min π Πα h<t

θq(h<t)π(1, a | h<t)

(g) = | A | 1 Eπi α µ

t=0 θq(h<t) [θq(h<t)]2 (19)

where (a) follows from Theorem 6, (b) groups triples (x3t, x3t+1, x3t+2) into ht, (c) follows because all Pˆπα µ M give identical conditional probabilities as Pπi α µ on queries and obser-

vations, (d) follows because all Pˆπα µ M give identical conditional probabilities as Pπi α µ for actions that follow qt = 0, (e) follows from Jensen s Inequality, (f) follows from the deﬁnition of ˆπα, and (g) follows from the deﬁnition of θq(h<t). Rearranging Inequality 19 gives the theorem.

Recall that Theorem 3 bounds the error between πi α and πd, conditioned on the event E.

Cohen, Hutter, and Nanda

Proof idea of Theorem 3 Conditioned on πd Πα h<t, it follows that θq the ℓ1 norm between πd and πi α. Then we apply Theorem 1.

Our remaining theorems apply when the environment and demonstrator policy are fair. Roughly, they are fair if they do not have access to the imitator s internals.

Deﬁnition 11 (Fair) An environment µ : H {0, 1} A O is fair if it does not depend on the query record; that is, µ( | h<tqtat) is not a function of qk for k t. A demonstrator policy πd : H {0, 1} A is likewise fair if πd( | h<t) is not a function of qk for k < t.

Theorems 4 and 5 rest on the following crux: if πd Πα h<t, then πi α(0, a | h<t) πd(a | h<t). Since πi α(1, a | h<t) = θq(h<t)πd(a | h<t), we have πi α(a | h<t) (1 + θq(h<t))πd(a | h<t). Thus, we have a multiplicative bound relating πi α and πd, and it decreases to 1.

Theorem 5 (Preserving Unlikeliness) Fix t. Let B (A O)t be a (bad) event, and extending B to the outcome space ({0, 1} A O)t = Ht, let D = B E. Then, for fair µ and πd,

Pπi α µ (D) t2sα log t2sα 27 Pπd µ (B) 3 log log 1 + t2/3s1/3 α 3 Pπd µ (B)1/3

where sα = | A |α 3(24w(πd) 1 + 12).

Proof idea Pπi α µ (B E)/ Pπd µ (B) increases by a factor of at most 1 + θq per timestep. While the expectation of θ3 q is summable, the expectation of P t θq grows as O(t2), hence that dependence in the bound. The ﬁnal diﬃculty is that our bound on the query probability only applies in expectation, but a pathological and unlikely event B could describe a case where querying is much more prolonged than expected. Thus, we do not prove a nice bound on the ratio Pπi α µ (B E)/ Pπd µ (B). Instead, since smaller Pπd µ (B) allows more pathology, our

bound on Pπi α µ (B E) is only polylogarithmic in Pπd µ (B).

Proof If πd Πα h<t, then πi α(0, a | h<t) πd(a | h<t), and of course πi α(1, a | h<t) = θq(h<t)πd(a | h<t), so

πi α(a | h<t) (1 + θq(h<t))πd(a | h<t) (20)

Thus, for fair µ and πd, for h<t E,

Pπi α µ (h\ <t)

Pπd µ (h\ <t)

k=0 [1 + θq(h<k)] (21)

It follows from Theorem 1 that

k=0 θq(h<k)3 D

Pπiα µ (D) (22)

Fully General Online Imitation Learning

By the same derivation as in Inequality 59, we can thus bound the sum

k=0 θq(h<k) D

Now, applying Inequality 20 repeatedly,

k=0 (1 + θq(h<k)) 1 D

P h<t D Pπi α µ (h<t) Qt 1 k=0(1 + θq(h<k)) 1 P h<t D Pπiα µ (h<t)

P h<t 1 E P

ht 1 H:h\ <t B Pπi α µ (h<t) Qt 1 k=0(1 + θq(h<k)) 1

P h<t D Pπiα µ (h<t)

P h<t 1 E h Pπi α µ (h<t 1) Qt 2 k=0(1 + θq(h<k)) 1i P

h\ t 1 A O:h\ <t B Pπi α µ (h\ t 1 | h<t 1)(1 + θq(h<t 1)) 1

P h<t D Pπiα µ (h<t)

P h<t 1 E h Pπi α µ (h<t 1) Qt 2 k=0(1 + θq(h<k)) 1i P

h\ t 1 A O:h\ <t B Pπd µ (h\ t 1 | h<t 1) P h<t D Pπiα µ (h<t)

P h<t 2 E h Pπi α µ (h<t 2) Qt 3 k=0(1 + θq(h<k)) 1i P

h\ t 2h\ t 1 (A O)2:h\ <t B Pπd µ (h\ t 2h\ t 1 | h<t 2) P h<t D Pπiα µ (h<t)

h\ <t B Pπd µ (h\ <t) P h<t D Pπiα µ (h<t) = Pπd µ (B)

Pπiα µ (D) (24)

where (a) follows from Inequality 20 since h<t 1 E (note the change from πi α to πd), (b) iterates the previous three lines, and (c) iterates the logic down to 0. Now we bound the expectation

k=0 (1 + θq(h<k)) 1 D

k=0 (1 + Eπi α µ [θq(h<k) | D]) 1

k=0 log 1 + Eπi α µ [θq(h<k) | D] !

k=0 Eπi α µ [θq(h<k) | D]

(b) e t2/3s1/3 α Pπiα µ (D) 1/3 (25)

where (a) follows from Jensen s Inequality (one can easily show the Hessian of Q i 1/(1+xi)

is positive semideﬁnite for x 0), and (b) follows from Inequality 23. Solving for Pπi α µ (D)

Cohen, Hutter, and Nanda

in terms of Pπd µ (B), we get

Pπi α µ (D) t2sα

27W( t2/3s1/3 α 3 Pπd µ (B)1/3 )3 (26)

where W is the Lambert-W function, deﬁned by the property W(z)e W(z) = z. A property of the Lambert-W function that W(z) log z log log(1 + z) yields the theorem:

Pπi α µ (D) t2sα log t2sα 27 Pπd µ (B) 3 log log 1 + t2/3s1/3 α 3 Pπd µ (B)1/3

One can easily verify this inequality by supposing the opposite and showing that it violates Inequality 25, but we omit this.

10. Conclusion

We present the ﬁrst formal results for an imitation learner in a setting where the environment does not reset. We present the ﬁrst formal results for an imitation learner that do not depend on a bounded loss assumption. We present the ﬁrst ﬁnite error bounds for an agent acting in general environments; existing results only regard limiting behavior (although existing work considers reinforcement learning, a harder problem than imitation learning). If we would like to have an artiﬁcial agent imitate, with particular concern for keeping unlikely events unlikely, this is the ﬁrst theory of how to do it.

Acknowledgments

This work was supported by the Leverhulme Centre for the Future of Intelligence, Australian Research Council Discovery Projects DP150104590, the Oxford-Man Institute, and the Berkeley Existential Risk Initiative. Thank you to Jan Leike for encouraging us to write a paper on imitation learning.

Stephen Adams, Tyler Cody, and Peter A Beling. A survey of inverse reinforcement learning. Artiﬁcial Intelligence Review, pages 1 40, 2022.

Daniel Brown, Scott Niekum, and Marek Petrik. Bayesian robust optimization for imitation learning. Advances in Neural Information Processing Systems, 33:2479 2491, 2020.

Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk-aware active inverse reinforcement learning. In Conference on Robot Learning, pages 362 372. PMLR, 2018.

Michael K Cohen and Marcus Hutter. Pessimism about unknown unknowns inspires conservatism. In Conference on Learning Theory, pages 1344 1373. PMLR, 2020.

Fully General Online Imitation Learning

R Durrett. Probability: Theory and Examples. Cambridge University Press, Cambridge, 2010.

E Mark Gold. Language identiﬁcation in the limit. Information and control, 10(5):447 474, 1967.

Ryan Hoque, Ashwin Balakrishna, Carl Putterman, Michael Luo, Daniel S Brown, Daniel Seita, Brijen Thananjeyan, Ellen Novoseller, and Ken Goldberg. Lazydagger: Reducing context switching in interactive imitation learning. In 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), pages 502 509. IEEE, 2021.

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1 35, 2017.

Marcus Hutter. Universal Artiﬁcial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005. ISBN 3-540-22139-5. doi: 10.1007/b138233.

Kshitij Judah, Alan P Fern, Thomas G Dietterich, and Prasad Tadepalli. Active imitation learning: Formal and practical reductions to iid learning. Journal of Machine Learning Research, 15(120):4105 4143, 2014.

Leon Gordon Kraft. A device for quantizing, grouping, and coding amplitude-modulated pulses. Ph D thesis, Massachusetts Institute of Technology, 1949.

Kunal Menda, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Ensembledagger: A bayesian approach to safe imitation learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5041 5048. IEEE, 2019.

Jan Poland and Marcus Hutter. Asymptotics of discrete MDL for online prediction. IEEE Transactions on Information Theory, 51(11):3780 3795, 2005. ISSN 0018-9448. doi: 10.1109/TIT.2005.856956.

St ephane Ross and Drew Bagnell. Eﬃcient reductions for imitation learning. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 661 668, 2010.

St ephane Ross, Geoﬀrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627 635, 2011.

Umar Syed and Robert E Schapire. A reduction from apprenticeship learning to classiﬁcation. Advances in neural information processing systems, 23:2253 2261, 2010.

William Talbott. Bayesian Epistemology. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2016 edition, 2016.

Jiakai Zhang and Kyunghyun Cho. Query-eﬃcient imitation learning for end-to-end simulated driving. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31, 2017.

Cohen, Hutter, and Nanda

Appendix A. Notation and Deﬁnitions

Notation Meaning Preliminary Notation A, O the ﬁnite action/observation spaces at, ot A, O; the action and observation at timestep t qt {0, 1}; indicates whether the demonstrator is queried at time t H {0, 1} A O ht (qt, at, ot); the interaction history in the tth timestep h<t (h1, ..., ht 1) h\ t (at, ot) ϵ the empty history π policy stochastically mapping H {0, 1} A µ environment stochastically mapping H {0, 1} A O Pπ ν a probability measure over histories with actions sampled from π and observations sampled from ν Eπ ν the expectation when the interaction history is sampled from Pπ ν w(π) (positive) prior weight that the policy π is the demonstrator s w(π | h<t) posterior weight on the policy π; w(π) Q k<t:qk=1 π(qkak | h<k) Imitation Learner Deﬁnition α (0, 1]; lower values mean the imitator better resembles the demonstrator, but queries longer Πα h<t set of top models; {πh<t n Π : w(πh<t n | h<t) α P m n w(πh<t m | h<t)} πd the demonstrator s policy πi α the imitator s policy; πi α(0, a | h<t) = minπ Πα h<t π (1, a | h<t), and

πi α(1, a | h<t) = θq(h<t)πd(1, a | h<t) θq(h<t) the query probability; 1 P a A πi α(0, a | h<t) ˆπα the imitator policy deﬁned with respect to an arbitrary demonstrator π, not the real demonstrator πd

General Sequence Prediction X ﬁnite alphabet x<t an element of X t

M countable set of measures over X

w(ν) prior weight on ν M w(ν | x<t) posterior weight on ν M ξ ξ(x<t) = P ν M w(ν)ν(x<t) ρn ρn(x<t) = max M M:| M |=i P ν M w(ν)ν(x<t) ρnorm n like ρn, but normalized to be a measure ρnorm n (x | x<t) = ρn(x | x<t)/ P x X ρn(x | x<t) Mx<t n argmax M M:| M |=i P ν M w(ν)ν(x<t) ρstat n a mixture over the top i models, sorted by posterior weight ρstat n (x | x<t) = P ν M x<t n w(ν)ν(x<tx)/ P ν M x<t n w(ν)ν(x<t) φx<t n w(νx<t n | x<t)/w(Mx<t n | x<t)

Fully General Online Imitation Learning

Appendix B. Omitted Proofs

Lemma 9 Recalling ν( | x<t) is a measure over X,

t=0 KL µ( | x<t) ρnorm n ( | x<t) w(µ) 1 + log w(µ) 1

Proof The KL divergence is non-negative, so we bound an arbitrary ﬁnite sum.

t=0 KL µ( | x<t) ρnorm n ( | x<t)

xt X µ(xt | x<t) log µ(xt | x<t) ρnorm n (xt | x<t)

x<t X t µ(x<t) X

xt X µ(xt | x<t) log µ(xt | x<t)

ρn(xt | x<t) + log P x X ρn(x<tx )

(b) w(µ) 1 +

x<t X t µ(x<t) X

xt X µ(xt | x<t) log µ(xt | x<t)

ρn(xt | x<t)

x<t X t µ(x<t) X

xt X µ(xt | x<t) log µ(x<txt)

ρn(x<txt) log µ(x<t)

x<t X t µ(x<t)

xt X µ(xt | x<t) log µ(x<txt)

ρn(x<txt) log µ(x<t)

x t X t+1 µ(x t) log µ(x t)

x<t X t µ(x<t) log µ(x<t)

(c) =w(µ) 1 + X

x<N X N µ(x<N) log µ(x<N)

ρn(x<N) µ(ϵ) log µ(ϵ)

(d) w(µ) 1 + X

x<N X N µ(x<N) log w(µ) 1 = w(µ) 1 + log w(µ) 1 (27)

where (a) follows from the deﬁnition of ρnorm n in Equation 9, (b) follows from Lemma 7 and the fact that log x x 1, (c) cancels like terms, and (d) follows from Inequality 12.

Theorem 10 (SMAP Convergence)

ρstat n (x | x<t) µ(x | x<t) 2 6w(µ) 1 + 3

Cohen, Hutter, and Nanda

Proof We abbreviate w(µ) 1 as c. Let [N] := (0, ..., N 1). We deﬁne an N|X|-dimensional random vector depending on the inﬁnite sequence x< :

ν1:ν2N := (ν1(x | x<t) ν2(x | x<t))t [N],x X (28)

In this notation, we aim to show Eµ || ρstat n :µN||2 2 6c + 3. Lemma 8 (i) and (ii) become

Eµ || ρn:ρstat n N||1 c (29)

Eµ || ρn:ρnorm n N||1 c (30)

Eµ || ρstat n :ρnorm n N||1 2c (31)

Since each element in this vector is in [ 1, 1], squaring them makes the magnitude no larger, so

Eµ || ρstat n :ρnorm n N||2 2 2c (32)

The KL divergence is larger than the sum of the squares of the probability diﬀerences (proven, for example, in (Hutter, 2005, 3.9.2)), so Lemma 9 implies

Eµ || ρnorm n :µN||2 2 c + log c (33)

By the triangle inequality,

|| ρstat n :µN||2 || ρstat n :ρnorm n N||2 + || ρnorm n :µN||2 (34)

|| ρstat n :µN||2 2 || ρstat n :ρnorm n N||2 2 + || ρnorm n :µN||2 2 + 2|| ρstat n :ρnorm n N||2|| ρnorm n :µN||2 (35)

and because E[XY ] p

E[X2] E[Y 2] (the Cauchy Schwarz Inequality),

Eµ || ρstat n :µN||2 2 2c + (c + log c) + 2 p

2c(c + log c) < 6c + 3 (36)

We name the measure with the ith largest posterior weight

νx<t n : Mx<t n \ Mx<t i 1 (37)

with the posterior weight formally deﬁned w(ν | x<t) := w(ν)ν(x<t)

ξ(x<t) , and w(M | x<t) := P ν M w(ν | x<t). Now, we let

φx<t n := w(νx<t n | x<t) w(Mx<t n | x<t) (38)

Fully General Online Imitation Learning

Theorem 6 (Top Model Convergence)

µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

#2 α 3(24w(µ) 1 + 12)

x X min n:φ x<t n >α νx<t n (x | x<t)

#2 |X|α 3(24w(µ) 1 + 12)

Proof ρstat n (x | x<t) is a weighted average of νx<t j (x | x<t) for j i:

ρstat n (x | x<t) =

P ν M x<t n w(ν)ν(x<tx) P ν M x<t n w(ν)ν(x<t)

P ν M x<t n w(ν)ν(x<t)ν(x | x<t) P ν M x<t n w(ν)ν(x<t)

P ν M x<t n w(ν | x<t)ξ(x<t)ν(x | x<t) P ν M x<t n w(ν | x<t)ξ(x<t)

w(ν | x<t) w(Mx<t n | x<t)ν(x | x<t)

w(νx<t j | x<t)

w(Mx<t n | x<t)νx<t j (x | x<t) (39)

Trivially, νx<t 1 (x | x<t) = ρstat 1 (x | x<t) (40)

but for i > 1, we would like to express νx<t n in terms of ρstat n and ρstat n 1:

ρstat n (x | x<t) = w(Mx<t i 1 | x<t) w(Mx<t n | x<t) ρstat n 1(x | x<t) + w(νx<t n | x<t) w(Mx<t n | x<t)νx<t n (x | x<t) (41)

νx<t n (x | x<t) = w(Mx<t n | x<t) w(νx<t n | x<t) ρstat n (x | x<t) w(Mx<t i 1 | x<t) w(νx<t n | x<t) ρstat n 1(x | x<t) (42)

Since w(M x<t n |x<t) w(ν x<t n |x<t) w(M x<t i 1 |x<t)

w(ν x<t n |x<t) = 1, for i > 1,

νx<t n (x | x<t) µ(x | x<t) = w(Mx<t n | x<t) w(νx<t n | x<t) ρstat n (x | x<t) µ(x | x<t)

w(Mx<t i 1 | x<t) w(νx<t n | x<t) ρstat n 1(x | x<t) µ(x | x<t) (43)

φx<t n := w(νx<t n | x<t) w(Mx<t n | x<t)

Cohen, Hutter, and Nanda

Since w(Mx<t i 1 | x<t) w(Mx<t n | x<t), we have

(φx<t n )2 [νx<t n (x | x<t) µ(x | x<t)]2 2 ρstat n (x | x<t) µ(x | x<t) 2 +

2 ρstat n 1(x | x<t) µ(x | x<t) 2 (44)

Now we consider all measures νx<t n for which φx<t n > α.

i:φ x<t n >α

x X [νx<t n (x | x<t) µ(x | x<t)]2 2α 2 Eµ

i:φ x<t n >α

x X ρstat n (x | x<t) µ(x | x<t) 2 + ρstat n 1(x | x<t) µ(x | x<t) 2 (45)

Now we note that {n : φx<t n > α} {n : n < α 1}, since w(νx<t n | x<t) w(νx<t j | x<t) for i > j. Thus,

n:φ x<t n >α

x X [νx<t n (x | x<t) µ(x | x<t)]2

ρstat n (x | x<t) µ(x | x<t) 2 + ρstat n 1(x | x<t) µ(x | x<t) 2

ρstat n (x | x<t) µ(x | x<t) 2 + ρstat n 1(x | x<t) µ(x | x<t) 2

i:i<α 1 2(6w(µ) 1 + 3) α 3(24w(µ) 1 + 12) (46)

Considering only a subset of these conditional-probability-errors,

µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

n:φ x<t n >α

x X [νx<t n (x | x<t) µ(x | x<t)]2 α 3(24w(µ) 1 + 12) (47)

This completes the proof of (i). Finally, with U being the uniform distribution,

x X min n:φ x<t n >α νx<t n (x | x<t)

x X µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

|X| Ex U(X) µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

Fully General Online Imitation Learning

(a) |X|2 Eµ

t=0 Ex U(X)

µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

µ(x | x<t) min n:φ x<t n >α νx<t n (x | x<t)

(b) |X|α 3(24w(µ) 1 + 12) (48)

where (a) follows from Jensen s Inequality, and (b) follows from Theorem 6 (i), which completes the proof of (ii).

Theorem 2 (Top Models Contain Truth) Pπi α µ ( t : πd Πα h<t) 1 αw(πd) 1

Proof Since, w(πd | h<t) > α = πd Πα h<t, we show Pπi α µ ( t : w(πd | h<t) > α)

1 αw(πd) 1. First we show that zt = w(πd | h<t) 1 is a non-negative Pπi α µ -supermartingale. First, suppose qt+1 = 0. In this case, zt+1 = zt, because the posterior weight is only updated when the demonstrator picks an action. Now suppose qt+1 = 1.

Eπi α µ [zt+1 | h<t1] (a) = X

at A:πd(at|h<t)>0 πd(at | h<t)w(πd | h<t1at) 1

at A:πd(at|h<t)>0 πd(at | h<t) P π Π w(π | h<t)π(at | h<t) w(πd | h<t)πd(at | h<t)

P π Π w(π | h<t)π(at | h<t)

w(πd | h<t)

π Π w(π | h<t) X

at A π(at | h<t) = zt

where (a) follows because at πd when qt = 1, (b) follows from Bayes rule the formula for posterior updating, and (c) follows from cancelling, and adding non-negative terms to the sum. Since w(πd | h<t) 1 is a non-negative supermartingale, by the supermartingale convergence theorem (Durrett, 2010, Thm. 5.4.2),

Pπi α µ ( t : w(πd | h<t) 1 α 1) αw(πd) 1 (49)

so Pπi α µ ( t : w(πd | h<t) > α) 1 αw(πd) 1 (50)

which implies Pπi α µ ( t : πd Πα h<t) 1 αw(πd) 1 (51)

Cohen, Hutter, and Nanda

Theorem 3 (Predictive Convergence) For α < w(πd),

πi α(0, a | h<t) πd(1, a | h<t) !3 E

|A|α 3(24w(πd) 1 + 12)

Proof Recall πi α(0, a | h<t) = minπ Πα h<t π(1, a | h<t), so if πd Πα h<t, then πi α(0, a |

h<t) πd(1, a | h<t). Thus, in that case,

πi α(0, a | h<t) πd(1, a | h<t) = X

a A πd(1, a | h<t) πi α(0, a | h<t)

a A πi α(0, a | h<t) = θq(h<t) (52)

The rest follows easily:

πi α(0, a | h<t) πd(1, a | h<t) !3 t : πd Πα h<t

t=0 θq(h<t)3 t : πd Πα h<t

(a) Eπi α µ

t=0 θq(h<t)3 # Pπi α µ ( t : πd Πα h<t)

(b) |A|α 3(24w(πd) 1 + 12)

1 αw(πd) 1 (53)

where (a) follows because θq is non-negative, and (b) follows from Equation 51 and Theorem 1 (as long as α < w(πd)).

Lemma 12 For a A, let 0 ia da, and let P a A da = 1. Let θq = 1 P a A ia. Then,

a A (ia + θqda) log ia + θqda

a A (ia + θqda) log ia + θqda

a A (ia + θqda) log( ia

a A ia + θq X

log(1 + θq) = (1 θq + θq) log(1 + θq) θq (54)

Fully General Online Imitation Learning

For the remaining proofs, we sometimes consider the restriction of probability measures over H to (A O) ; that is, we marginalize over the query record. For a history h<t = q0a0o0...qt 1at 1ot 1, let h\ <t denote a0o0...at 1ot 1. We deﬁne the t-step KL divergence as follows:

KL t (P || Q) := X

h\ <t (A O)t P(h\ <t) log P(h\ <t)

Q(h\ <t) (55)

Theorem 4 (KL Bound) Suppose that µ and πd are fair, and α < w(πd). Letting the two probability measures below be restricted to (A O)t (that is, marginalizing over the query record, and considering only the ﬁrst t timesteps),

Pπi α µ ( | E) Pπd µ ( | E) α 1|A|1/3(24w(πd) 1 + 12)1/3

(1 α/w(πd))2 t2/3 log(1 α/w(πd))

Proof We begin by restricting attention to a particular timestep t. Recall πi α(0, a | h<t) = minπ Πα h<t π (1, a | h<t). We abbreviate this quantity ia. We also let da denote πd(1, a |

h<t). Note that when πd Πα h<t, ia da (56)

Recall that the query probability θq = 1 P a A ia, and the marginalized probability πi α(a | h<t) = ia + θqda. Assuming h<k satisﬁes E, let

πi α( | h<k) πd( | h<k) = X

a A (ia + θqda) log ia + θqda

By Lemma 12, k θq. Now, we write the t-step KL divergence KLt as a sum of the expectation of 1-step KL divergences. We ll abbreviate a measure P( | E) as EP.

EPπi α µ EPπd µ = E h<t EPπiα µ log EPπi α µ (h\ <t)

EPπd µ (h\ <t)

(a) E h<t EPπiα µ log Pπi α µ (h\ <t)/ Pπi α µ (E)

Pπd µ (h\ <t)/ Pπd µ (E)

= E h<t EPπiα µ log Pπi α µ (h\ <t)

Pπd µ (h\ <t) + log Pπd µ (E)

E h<t EPπiα µ log Pπi α µ (h\ <t)

Pπd µ (h\ <t) log Pπi α µ (E)

=: E h<t EPπiα µ log Pπi α µ (h\ <t)

Pπd µ (h\ <t) + Cα

(b) = Cα + E h<t EPπiα µ

k=0 log Pπi α µ (h\ k | h<k)

Pπd µ (h\ k | h<k)

Cohen, Hutter, and Nanda

k=0 E h<k EPπiα µ E hk EPπiα µ ( |h<k) log Pπi α µ (h\ k | h<k)

Pπd µ (h\ k | h<k)

k=0 E h<k EPπiα µ

EPπi α µ (h\ k | h<k) log Pπi α µ (h\ k | h<k)

Pπd µ (h\ k | h<k)

k=0 E h<k EPπiα µ

Pπi α µ (h\ k | h<k)

Pπiα µ (E) log Pπi α µ (h\ k | h<k)

Pπd µ (h\ k | h<k)

k=0 E h<k EPπiα µ 1

Pπiα µ (E) KL 1

Pπi α µ ( | h<k) Pπd µ ( | h<k)

k=0 E h<k EPπiα µ KL 1

πi α( | h<k) πd( | h<k)

(c) log Pπi α µ (E) + 1

k=0 θq(h<k)

log Pπi α µ (E) + 1

Pπiα µ (E)2 Eπi α µ

k=0 θq(h<k) (58)

where (a) follows from h<t satisfying E with EPπi α µ -prob. 1, (b) follows because µ and πd

are fair, and (c) follows from Equation 57 and Lemma 12. Finally,

k=0 θq(h<k) = t Ek U([t]) Eπi α µ θq(h<k)

= t Ek U([t]) Eπi α µ θq(h<k) 3 1/3

(a) t Ek U([t]) Eπi α µ θq(h<k)3 1/3

k=0 Eπi α µ θq(h<k)3 !1/3

(b) t2/3|A|1/3α 1(24w(πd) 1 + 12)1/3 (59)

where (a) follows from Jensen s Inequality, and (b) follows from Theorem 1. Combining this with Inequality 58, and recalling Pπi α µ (E) 1 α/w(πd), we have

EPπi α µ EPπd µ α 1|A|1/3(24w(πd) 1 + 12)1/3

(1 α/w(πd))2 t2/3 log(1 α/w(πd)) (60)