# fully_general_online_imitation_learning__96f9355c.pdf Journal of Machine Learning Research 23 (2022) 1-30 Submitted 6/21; Published 9/22 Fully General Online Imitation Learning Michael K. Cohen michael.cohen@eng.ox.ac.uk Department of Engineering Science University of Oxford Future of Humanity Institute Oxford, UK OX1 3PJ Marcus Hutter marcus.hutter@anu.edu.au Deep Mind Department of Computer Science Australian National University Acton, ACT, Australia 2601 Neel Nanda neelnanda27@gmail.com Independent Editor: Joelle Pineau In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely different events. In the special setting of environments that restart, existing work provides formal guidance in how to imitate so that events unfold similarly, but outside that setting, no formal guidance exists. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes, and we allow our imitator to learn online from the demonstrator. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event s likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency. If any such event qualifies as dangerous , our imitator would have the notable distinction of being relatively safe . Keywords: Bayesian Sequence Prediction, Imitation Learning, Active Learning, General Environments 1. Introduction Supervised learning of independent and identically distributed data is often practiced in two phases: training and deployment. This separation makes less sense if the learner s predictions affect the distribution of future contexts for prediction, since the deployment phase could lose all resemblance to the training phase. When a program s output changes its future percepts, we often call its output actions . Supervised learning in that regime is commonly called imitation learning , where labels are the actions of a demonstrator 2022 Michael K. Cohen, Marcus Hutter, and Neel Nanda. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0618.html. Cohen, Hutter, and Nanda (Syed and Schapire, 2010). Our agent, acting in a general environment that responds to its actions, tries to pick actions according to the same distribution as a demonstrator. Even in imitation learning, where it is understood that actions can change the distribution of contexts that the agent will face, it is common to separate a training phase from a deployment phase. This assumes away the possibility that the distribution of contexts will shift significantly upon deployment and render the training data increasingly irrelevant. Here, we present an online imitation learner that is robust to this possibility. The obvious downside is that the training never ends. The agent can always make queries for more data, but importantly, it does this with diminishing probability. It transitions smoothly from a mostly-training phase to a mostly-deployed phase. Our agent also handles totally general stochastic environments (environments serve new contexts for the agent to act in) and totally general stochastic demonstrator policies. No finite-state Markov-style stationarity assumption is required for either. The lack of assumptions about the environment is a mundane point, because imitation learners don t have to learn the dynamics of the environment, but the lack of assumptions on the prediction target the demonstrator s policy makes these results highly non-trivial. The only assumption is that the demonstrator s policy belongs to some known countable class of possibilities. Moreover, stochasticity makes single-elimination-style learning (Gold, 1967) impossible. For demonstrator policies this general, we present formal results that are unthinkable in the train-then-deploy paradigm. The ℓ1 distance between the imitator and demonstrator policies converges to 0 in mean cube, when conditioned on a high-probability event (Theorem 3). And Theorem 2 shows that the event has high probability. Conditioned on the same high-probability event, we bound the KL divergence from imitator to demonstrator (Theorem 4), and we upper bound the probability of an arbitrary event under the imitator s policy, given a low probability of occurrence under the demonstrator s policy (Theorem 5). Instead of having a finite training phase, our agent s query probability converges to 0 in mean cube (Theorem 1). Without Theorems 1 and 2, the remaining theorems would be uninteresting; they would be easily fulfilled by an imitator that always queried the demonstrator, or they would apply only rarely. Our imitator maintains a posterior over demonstrator models. At each timestep, it takes the top few demonstrator models in the posterior, in a way that depends on a scalar parameter α. Then, for each action, it considers the minimum over those models of the probability that the demonstrator picks that action. The imitator samples an action according to those probabilities, and if no action is sampled (since model disagreement makes the probabilities to sum to less than 1), it defers to the demonstrator. We review theoretical developments in imitation learning in Section 2, define our formal setting in Section 3, define our imitation learner in Section 4, and illustrate it with a toy example in Section 5. We state key formal results in Section 6, and we outline our proof technique and introduce necessary notation in Section 7. Section 8 presents lemmas and intermediate results, and Section 9 presents proofs and proof ideas of our key results, but most of the proofs appear in Appendix B. Appendix A collects notation and definitions. Fully General Online Imitation Learning 2. Related Work Recall that a key difficulty of imitation learning over supervised learning is the removal of a standard i.i.d. assumption. However, all existing formal work in imitation learning studies repeated finite episodes of length T; even though the dynamics are not i.i.d. from timestep to timestep within an episode, the agent learns from a sequence of episodes that are, as a whole, independent and identically distributed. Thus, the scope of existing formal work is limited to environments that restart . A driving agent that gets housed in a new car every time it crashes (or gets hopelessly lost) enjoys a restarting environment, whereas a driving agent with only one car to burn does not. If we can accurately simulate a non-restarting environment, then training the imitator in simulation (using existing formal methods) could indeed prepare it to act in a non-restarting one. The viability of this approach depends on the environment; for many, we simply cannot simulate them with enough accuracy. For example, consider imitating a sales rep at a software company, interfacing with potential clients over email. For a real potential client, a relationship cannot be rebooted, and no simulation could anticipate the many diverse needs of clients. In the context of restarting environments, Syed and Schapire (2010) reduce the problem of predicting a demonstrator s behavior to i.i.d. classification. The only assumption about the demonstrator is that the value of its policy as a function of state is arbitrarily well approximated by the value of a deterministic policy, which is only slightly weaker than assuming the demonstrator is deterministic itself. They make no assumptions about the environment, other than that we can access identical copies of it repeatedly. They show that if a classifier guessing the demonstrator s actions has an error rate of ε, then the value of the imitator s policy that uses the classifier is within O( ε) of the demonstrator. Judah et al. (2014) improve the label complexity of Syed and Schapire s (2010) reduction by actively deciding when to query the demonstrator, instead of simply observing N full episodes before acting. Making the same assumptions as that paper, and also assuming a realizable hypothesis class with a finite VC dimension, they attempt to reduce the number of queries before the agent can act for a whole episode on its own with an error rate less than ε. Letting T be the length of an episode, compared to Syed and Schapire s (2010) O(T 3/ε) labels, they achieve O(T log(T 3/ε)). Ross and Bagnell (2010) also reduce the problem to classification. In a trivial reduction, the imitator observes the demonstrator act from the distribution of states induced by the demonstrator policy. In this reduction, if the classifier has an error rate of ε per action on the demonstrator s state distribution, the error rate of the imitator on its own distribution is at most T 2ε, where T is again the length of the episode. Their main contribution is to introduce a cleverer training regime for the classifier to reduce this bound to Tε in environments with approximate recoverability. Ross et al. (2011) reduce imitation learning to something else: a no-regret online learner, for which the average error rate over its lifetime approaches 0, even with a potentially changing loss function. With access to an online learner with average regret O(1/Npredictions), they construct an imitation learner with regret of the same order. Unlike Syed and Schapire (2010) and Judah et al. (2014), they make no assumption that the demonstrator is arbitrarily well-approximated by a deterministic policy. Unlike Judah et al. (2014), they do not assume a realizable hypothesis class with a finite VC dimension. And unlike the Ross and Cohen, Hutter, and Nanda Bagnell (2010) (for their main contribution), they do not assume approximate recoverability. They do still assume that we can repeatedly access identical copies of the environment, and the loss function used for their measurement of regret must be bounded. To achieve a regret of order O(1/Npredictions) with probability at least 1 δ, they require O(T 2 log(1/δ)) observations of the demonstrator. There is a great deal of empirical study of imitation learning, given the practical applications, which Hussein et al. (2017) review. We call a few specific experiments to the reader s attention, since they resemble our work in taking an active approach to querying, with an eye to risk aversion, not just label efficiency; they find it works. First, Brown et al. (2018, 2020) consider a context where the imitator can, at any time, ask the demonstrator how it would act in any of finitely many states. These imitators focus on states that they assign higher value at risk. Those papers and the following all show strong label efficiency alongside limited loss. Zhang and Cho (2017) assume some method of predicting the error of an imitator in the process of learning, and they query for help when it is above some threshold. Otherwise, their imitator follows Ross et al. s (2011) construction. In their paper, the function that predicts the imitator s error is learned from hand-picked features of a dataset. Menda et al. (2019) query much more extensively, but like Zhang and Cho (2017), they don t always act on the demonstrator s suggestion, in order to sample a more diverse set of states. Unlike Zhang and Cho (2017), they do act on it when the imitator s action deviates enough from the demonstrator s (given some hand-designed distance metric over the action space). They also defer to the demonstrator when there is sufficient disagreement among an ensemble of imitators. They find their imitator is more robust. Hoque et al. (2021) note that in many contexts, it is more convenient for the demonstrator to be queried a few times successively, rather than spread out over a long time. They modify Zhang and Cho s (2017) approach: the imitator starts querying when the estimated error exceeds the same threshold, but it continues querying until it returns below a lower threshold. At the cost of more total queries, it requires fewer query-periods. Like the formal work, all these experiments regard environments that restart. Adjacent to pure imitation learning (trying to pick the same actions as a demonstrator would), there is also work on trying to act in pursuit of the same goals as a demonstrator (which must be inferred), or matching only some outcomes of the demonstrator policy, like the expectation of some given set of features. For a review of some work in this area, see Adams et al. (2022). 3. Preliminaries Let at A and ot O be the action and observation at timestep t N. Let qt {0, 1} denote whether the imitator (qt = 0) or demonstrator (qt = 1) selects at. Let H = {0, 1} A O, and let ht = (qt, at, ot) H. Let h 0 be a prior weight assigned to π, such that P π Π w(π) = 1. This represents the imitator s initial belief distribution over the demonstrator s policy. For convenience, let Π only contain policies which assign zero probability to qt = 0, since demonstrator models may as well be convinced that the demonstrator is picking the action. Example 1 ((Linear-Time) Computable Policies) The requirement that Π be countable is not restrictive in theory. Suppose Π is the set of programs that compute a policy (in linear time). These can be easily enumerated, and the prior w can be set 2 program length (Kraft, 1949; Hutter, 2005). Given the near absence of constraints, the choice of model class might pique philosophical interest. There are multiple logics with differing powers that we could plausibly use to represent programs, including programs higher in the arithmetic hierarchy. In general, the choice of programming language would change programs relative length, and there are no clear desiderata when choosing a language. So Example 1 does not appear to offer an approach to solving the Problem of Priors (Talbott, 2016). The option to restrict to linear-time programs is a marginally more practical possibility that might escape most philosophical discussions. 4. Imitation Let w(π | hα νxα νx α. The difference is that in this sequence prediction setting, all observations are informative about the true measure, whereas the imitator rarely sees the demonstrator act. To prove Theorem 6, we show that a posterior-weighted mixture over Mx α, then each constituent must as well. This posterior-weighted mixture is called ρstat n . We define it here alongside other estimators that will be used in the proof of ρstat n s convergence. First, Fully General Online Imitation Learning ρstat n (x | xα νxα νx α ensures the weights in the weighted average aren t too small, and that we only need to consider the top 1/α models. 9. Key Proofs We now prove our bound on the query probability, we define fairness, and we prove our bound on the probabilities of bad events. Theorem 1 (Limited Querying) t=0 θq(h α} is exactly the set {Pˆπα µ : π Πα hα νxα νxα νx 1, we would like to express νx 1, νx α. i:φ xα x X [νxα x X ρstat n (x | x α} {n : n < α 1}, since w(νx j. Thus, n:φ xα x X [νxα νxα x X [νxα νxα νxα νxα νxα νx α = πd Πα h α) 1 αw(πd) 1. First we show that zt = w(πd | h0 πd(at | h0 πd(at | h α) 1 αw(πd) 1 (50) which implies Pπi α µ ( t : πd Πα h