# on_efficient_online_imitation_learning_via_classification__f9587072.pdf

On Efﬁcient Online Imitation Learning via Classiﬁcation

Yichen Li University of Arizona yichenl@arizona.edu

Chicheng Zhang University of Arizona chichengz@cs.arizona.edu

Imitation learning (IL) is a general learning paradigm for tackling sequential decision-making problems. Interactive imitation learning, where learners can interactively query for expert demonstrations, has been shown to achieve provably superior sample efﬁciency guarantees compared with its ofﬂine counterpart or reinforcement learning. In this work, we study classiﬁcation-based online imitation learning (abbrev. COIL) and the fundamental feasibility to design oracle-efﬁcient regret-minimization algorithms in this setting, with a focus on the general nonrealizable case. We make the following contributions: (1) we show that in the COIL problem, any proper online learning algorithm cannot guarantee a sublinear regret in general; (2) we propose LOGGER, an improper online learning algorithmic framework, that reduces COIL to online linear optimization, by utilizing a new definition of mixed policy class; (3) we design two oracle-efﬁcient algorithms within the LOGGER framework that enjoy different sample and interaction round complexity tradeoffs, and conduct ﬁnite-sample analyses to show their improvements over naive behavior cloning; (4) we show that under the standard complexitytheoretic assumptions, efﬁcient dynamic regret minimization is infeasible in the LOGGER framework. Our work puts classiﬁcation-based online imitation learning, an important IL setup, into a ﬁrmer foundation.

1 Introduction

Imitation learning (IL), also known as learning from expert demonstrations [47, 44], is a general paradigm for training intelligent behavior for sequential decision making tasks. IL has been successfully deployed in many applications, such as autonomous driving [47, 45], robot arm control [69], game playing [60], and sequence prediction [23, 8]. It is now well-known that with the help of a demonstrating expert, an IL agent can bypass the exploration challenges of reinforcement learning, achieving a much lower sample requirement than reinforcement learning agents [63].

Two major IL paradigms have been studied in the literature: ofﬂine and interactive. In ofﬂine IL [1, 64, 74, 30], the learner receives a set of expert demonstrations ahead of time; in contrast, in interactive IL [23, 56, 55, 32], the learner has the ability to interactively query the expert for demonstrations on states at its disposal, allowing expert feedback to be provided in a targeted manner. In both settings, the goal of the learner is to output a policy ˆπ that competes with the expert s policy πE, by consuming as few resources (e.g. expert annotations) as possible. Between these two, interactive IL is known to be able to achieve superior policy performance than its ofﬂine counterpart under certain favorable assumptions on the expert policy and the environment, in that learning agents can use interaction to address the compounding error challenge [54, 49].

Despite the recent progress in the fundamental limits of the interactive imitation learning in the realizable setting [63, 50, 49], the statistical and computational limits of the interactive imitation learning in the general nonrealizable setting remain open. One promising and inﬂuential algorithmic framework for studying and analyzing interactive IL in the nonrealizable setting is DAGGER (Data

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Aggregation) [56], whose key insight is to reduce interactive IL to regret minimization in online learning [59]. Speciﬁcally, it constructs a N-round online learning game, where at every round n [N], the learner outputs some policy πn from a policy class B, and incurs a loss Fn(π); the loss is carefully constructed so that the learner s instantaneous loss Fn(πn) characterizes current policy πn s competitiveness compared to the expert πE. A representative example of Fn(π) is the expected disagreement between π and πE on the state occupancy distribution induced by πn, used in the original DAGGER paper [56], which can be expressed as the expected zero-one loss of π on a distribution of classiﬁcation examples we call such setting Classiﬁcation-based Online Imitation Learning (abbrev. COIL). The DAGGER reduction framework has spurred an active line of research on IL [e.g. 55, 63, 15, 18, 17, 38]: it enables conversions from stochastic online optimization algorithms with static or dynamic regret guarantees to IL algorithms with different output policy suboptimality guarantees, allowing the research community to directly translate new results in online learning to those in IL.

Perhaps surprisingly, from a fundamental perspective, rigorous design of efﬁcient regret minimization algorithms for COIL has been largely overlooked by the prior literature. Speciﬁcally, many works assume a ﬁxed parameterization of policies in B, and assume that Fn(π) s are convex in π s underlying parameters to allow for no-regret online convex optimization [e.g. 56, 63, 15]. Although natural, this viewpoint has some issues: (1) in DAGGER s reduction, the learner uses ﬁnite-sample approximations of Fn(π), which are often discontinuous in π s underlying parameters (e.g. given a policy πθ(s) = sign( θ, s ) as a linear classiﬁer, its zero-one loss on a state, I(πθ(s) = πE(s)) is discontinuous in θ), making stochastic gradient-based methods inapplicable. Convex surrogate loss functions has been proposed as a popular workaround [56], but it is well-known that in the nonrealizable setting, even for the special case of supervised learning, minimizing convex surrogate losses can result in very different models compared to minimizing the original zero-one classiﬁcation losses [7]; (2) it makes the usage of policy classes with complex parameterization (e.g. rule-based policies such as decision trees) difﬁcult, as convexity is hard to establish for such classes.

Overview of our results. In this paper, we bridge the above-mentioned gaps by studying the fundamental feasibility of designing efﬁcient regret minimization algorithms for COIL, putting the study of statistical and computational limits of interactive imitation learning in the general nonrealizable setting into a ﬁrmer foundation. Our ﬁrst result is that, analogous to Cover s impossibility result in online classiﬁcation [20], in the COIL setting, any proper online learning algorithm (that outputs a sequence of policies {πn}N n=1 from the original class B) cannot guarantee sublinear regret in general ( 3.1).

The above negative result motivates the design of improper learning algorithms for regret minimization. To this end, we propose to choose policies from a mixed policy class ΠB, and provide an algorithmic framework, LOGGER, that reduces COIL to online linear optimization. In a nutshell, LOGGER uses a natural parameterization on ΠB that allows to express Fn(π) as a linear function of the underyling parameters of π ΠB. We show that any online linear optimization algorithm with (high-probability) regret guarantees can be plugged into LOGGER to obtain an algorithm for COIL with policy suboptimality guarantees ( 3.2).

Next, enabled by the LOGGER framework, we design computationally efﬁcient algorithms for static regret minimization. Assuming access to an ofﬂine cost-sensitive classiﬁcation (CSC) oracle O, and a set of unlabeled separator examples for B [66, 24], we design LOGGER-M, a sample and computationally efﬁcient algorithm. With O(1/ϵ2) interaction rounds and O(1/ϵ2) expert annotations, LOGGER-M enjoys a per-round static regret of ϵ ( 4.1). Underlying LOGGER-M is a delicate utilization of the connection between Follow-the-Perturbed-Leader and Follow-the-Regularized Leader, two well-known online learning algorithm families, ﬁrst observed by [2]. Moreover, by exploiting the predictability of the COIL problem [51, 18, 17], we design an efﬁcient algorithm LOGGER-ME, that enjoys a per-round static regret of ϵ, with O(1/ϵ) interaction rounds and O(1/ϵ2) expert annotations ( 4.2). Its reduced number of interaction rounds can enable a more practical deployment of IL agents, especially when interactive expert annotations come in batches or with delays.

Finally, we study efﬁcient dynamic regret minimization in the LOGGER framework ( 5). We show that this is unlikely to be feasible: under a standard complexity-theoretic assumption, no oracleefﬁcient algorithms can output policies in ΠB with sublinear dynamic regret. Due to space constraints, we discuss key related works throughout the paper, and defer additional related works to Appendix A.

2 Preliminaries

Basic deﬁnitions. Deﬁne [n] := {1, . . . , n}. Deﬁne indicator function I( ) such that I(E) = 1 if condition E is true, and = 0 otherwise. We use (W) to denote the set of probability distributions over a ﬁnite set W, and use Onehot(w, W) (W) to denote the delta mass on w W. For a ﬁnite W, we will oftentimes treat u R|W | (e.g. u (W)) as a |W|-dimensional vector; for w W, denote by u[w] the w-th coordinate of u. We abuse the notation of { } to denote multisets.

Episodic MDPs. We study imitation learning in episodic Markov decision processes (MDPs). An episodic MDP M is a tuple (S, A, H, c, ρ, P), where S is a ﬁnite state space (that can be exponentially large), A is a ﬁnite action set, H N+ is the episode length, c : S A [0, 1] is the cost function, and ρ (S) is the initial state distribution. Also, P = {Pt}H 1 t=1 denotes M s transition dynamics, with Pt : S A (S) being the transition probability at step t. Throughout, we use S and A to denote |S| and |A|, respectively. Without loss of generality, we assume that M is layered, where S can be partitioned into H disjoint sets {St}H t=1; the initial distribution ρ is supported on S1, and transition distribution Pt( | s, a) is supported on St+1 for all t, s, a. For state s S, deﬁne Step(s) as the step t such that s St.

A learning agent interacts with M for one episode using the following protocol: for every step t [H]: it observes a state st St, takes an action at A, incurs cost c(st, at), and transitions to next state st+1 Pt( | st, at) except for the last step when it stops. Given a stationary policy π : S (A), we use π( |s) to denote the action distribution of π on s. Denote by Eπ and Pπ the expectation and probability over executing (i.e. rolling out) policy π in M. Given policy π, its state occupancy distribution at step t is deﬁned as dt π( ) := Pπ(st = ); its average state occupancy distribution is denoted as dπ := 1 H PH t=1 dt π. Let J(π) := Eπ h PH t=1 c(st, at) i =

H Es dπEa π( |s) c(s, a) denote the expected cumulative cost of π over an episode. For policy

π, we denote its value function Vπ(s) := E h PH t=Step(s) c(st, at) | s, π i and action-value function

Qπ(s, a) := c(s, a) + E h PH t=Step(s)+1 c(st, at) | s, a, π i ; in words, they are the expected costs of rolling out π starting from s and (s, a), respectively. For policy π, deﬁne its advantage function as Aπ(s, a) := Qπ(s, a) Vπ(s), which measures the expected performance difference by one step deviation of π by taking action a at state s. Also, we deﬁne the recoverability constant as the ability of π to recover from deviation when rolled out in M:

Deﬁnition 1 (µ-recoverability). A (MDP, policy) pair (M, π) is said to be µ-recoverable, if s S, a A, Aπ(s, a) µ. 1

Interactive IL. We study interactive imitation learning [23, 54], where the learner has access to a stationary deterministic demonstrating expert πE and would like to learn a policy with low expected cost. Throughout, we assume that (M, πE) is µ-recoverable, for some µ H that can possibly be H. At each interaction round, the learner interacts with M for a few episodes, obtaining trajectories of the form τ = (s1, a1, s2, a2, . . . , s H, a H) and queries the expert for feedback on some of the states. Speciﬁcally, given a state s, the feedback given by the expert is of the form (ζE(s, a))a A RA. Two notable examples are: (1) direct expert annotation [56], i.e. given state s, expert provides demonstration πE(s), and we use it to construct an A-dimensional feedback (ζE(s, a))a A = (µ I(a = πE(s)))a A; (2) estimates of value functions based on experts rollout [55], i.e. (ζE(s, a))a A = (AE(s, a))a A := (AπE(s, a))a A.2 Throughout, we assume ζE(s, a) satisﬁes s S, a A, AE(s, a) ζE(s, a) µ I(a = πE(s)); this is satisﬁed by the two examples above. For a stationary policy π, deﬁne its imitation loss as: L(π) := Es dπEa π( |s) ζE(s, a) . By the performance difference lemma [34] (see Lemma 55 in Appendix H), J(π) J(πE) H L(π), implying that if π has a small imitation loss, it will have expected cost competitive with πE. In light of this connection, in interactive IL, the learner would like to obtain policy ˆπ with low L(ˆπ). Subject to this, throughout the paper, we consider optimizing two measures of data efﬁciency:

1The µ-recoverability deﬁnition here is slightly different from the original ones in [56], in that it also requires that Aπ(s, a) µ; we can drop this assumption with a slightly worse sample complexity analysis. 2Strictly speaking, in Aggre Vate and its variants [55, 63], the learner requests expert rollout to obtain unbiased estimators of AE(s, a); our sample complexity analysis can also be adapted to this setting.

Protocol 1 Classiﬁcation-based Online Imitation Learning (COIL)

for n = 1, . . . , N do

Learner outputs policy πn. Loss function Fn(π) := Es dπn Ea π( |s) ζE(s, a) . Learner draws samples from DE πn to obtain information about loss Fn, via interacting with M and querying the expert for annotation ζE. end for Goal of learner: minimize PN n=1 Fn(πn) = PN n=1 L(πn).

Sample complexity: the total number of expert annotations ζE requested. A smaller sample complexity reduces the total cost of expert annotations (which often takes human effort). Interaction round complexity: total number of adaptive interaction rounds. A small number of interaction rounds enables more parallelized annotations within an interaction round, and mitigates issues of annotation delay [72, 68].

The DAGGER reduction framework for interactive IL. The DAGGER framework reduces minimizing L(π) to no-regret online learning [56, 55]. It constructs a N-round online learning game, where at every round n [N], the learner outputs some policy πn, which induces a loss function Fn(π) = Es dπn Ea π( |s) ζE(s, a) . Its key insight is that, by the deﬁnition of {Fn}N n=1, minimizing the online learning cumulative loss PN n=1 Fn(πn) is equivalent to minimizing the cumulative imitation losses of πn s, i.e. PN n=1 L(πn). Research efforts in online learning [59, 43] have mainly focused on the design of algorithms that can output {πn}N n=1 with static regret SReg N(B) or dynamic regret DReg N(B) against some benchmark policy class B, formally:

SReg N(B) :=

n=1 Fn(πn) min π B

n=1 Fn(π), DReg N(B) :=

Fn(πn) min π B Fn(π) .

Assuming that the learner chooses policies {πn}N n=1 from another stationary policy class B0 (which may or may not be B), the following proposition shows that static and dynamic regret guarantees in the induced online learning game can be converted to policy suboptimality guarantees:

Proposition 2 (e.g. [15]). For any N N+ and online learner that outputs {πn}N n=1 BN 0 , deﬁne Bias(B, B0, N) := max {υn}N n=1 BN 0 min π B Es d N Ea π( |s) I(a = πE(s)) , where d N := 1 N PN n=1 dυn.

Then, choosing ˆπ uniformly at random from {πn}N n=1 has guarantee:

E h J(ˆπ) J(πE) i H min µ Bias(B, B0, N) + E[SReg N(B)]

N , µ Bias(B, B0, 1) + E[DReg N(B)]

In the above proposition, Bias(B, B0, N) takes the worst-case mixture of state occupancy distributions {dυn}N n=1 induced by N policies from B0, and measures the expected disagreement between πE and its best approximating policy in B. Informally, it measures the approximation error of benchmark class B: it is always nonnegative, and in the special case of πE B (which we call the realizable case), Bias(B, B0, N) = 0. Proposition 2 gives two ways to obtain a competitive imitation policy: (1) choose (B, B0) with a small Bias(B, B0, N) and achieve a low static regret; (2) choose (B, B0) with a small Bias(B, B0, 1) and achieve a low dynamic regret. Although achieving low dynamic regret can be signiﬁcantly more challenging than achieving low static regret, minimizing dynamic regret has the advantage that its approximation error term Bias(B, B0, 1) is smaller than the static regret formulation s counterpart Bias(B, B0, N).

Classiﬁcation-based Online Imitation Learning (COIL). As we consider a ﬁnite action space A, a policy π can be equivalently viewed as a (possibly randomized) multiclass classiﬁer. A costsensitive classiﬁcation (CSC) example is deﬁned to be a pair (x, c), where x S is its feature part, and c RA is its cost part. Under the DAGGER reduction framework, the loss at iteration n, Fn(π) = Es dπn Ea π( |s) ζE(s, a) , can be viewed as the expected cost of policy π on a distribution of cost-sensitive examples DE πn (formally, E(s, c) DE πn c(h(s)) ), where a sample (s, c) is drawn from DE π by ﬁrst rolling out π and drawing s dπ, and query the expert on s to obtain

(ζE(s, a))a A as its associated c. The learner can obtain a ﬁnite-sample approximation to Fn(π) by interacting with M and the expert to draw samples from DE πn. We will focus on designing efﬁcient regret-minimizing algorithms in the COIL setting; see Protocol 1 for a summary.

In addition to data efﬁciency, we also consider the design of imitation learning algorithms with computational efﬁciency guarantees. To this end, following a sequence of empirically and theoretically successful works on oracle-efﬁcient learning [66, 25, 5, 21], we assume access to the benchmark policy class B, a collection of B stationary and deterministic policies h : S A, via the following computational oracle: Deﬁnition 3 (CSC oracle). A CSC oracle O for policy class B is such that: given any input multiset of cost-sensitive examples D = {(x1, c1), . . . , (x K, c K)} (S RA)K, it outputs the policy in B that has the smallest empirical cost, formally,

O(D) := argmin h B E(x, c) D c(h(x)) ,

where we slightly abuse the notation and and use D to also denote the uniform distribution over it.

We measure an algorithm s computational efﬁciency by its number of calls to oracle O.

3 LOGGER: reducing COIL to online linear optimization

In this section, we introduce our main algorithmic framework, LOGGER (an abbreviation for Linear l Oss a GGr Egation) for designing regret-minimizing algorithms in the COIL setting. Section 3.1 motivates our approach by showing that natural proper learning-based approaches fail to achieve sublinear regret in general; Section 3.2 introduces our approach of performing improper learning using a carefully-deﬁned mixture policy class, via a reduction to online linear optimization.

3.1 Can we achieve sublinear regret using proper learning?

A natural idea for minimizing regret is proper learning: at round n, the learner chooses some policy πn (possibly at random) from B0 = B, our benchmark policy class, using some online learning algorithm, based on the information collected in the ﬁrst n 1 rounds; the learner then collects information on Fn via rollouts of πn and expert annotations, and continue to the next iteration.

While this approach has demonstrated sharp online regret guarantees in classical online cost-sensitive classiﬁcation settings [40, 27], perhaps subtly, we show in the following theorem that, this approach is insufﬁcient to guarantee sublinear regret in the COIL setting. Theorem 4. Suppose the expert s feedback ζE(s, a) is of the form µ I(a = πE(s)) or AE(s, a). Then, for any H 3, there exists an MDP M of episode length H, a deterministic expert policy πE, a benchmark policy class B, such that for any learner that sequentially (and possibly at random) generates a sequence of policies {πn}N n=1 BN, its static regret satisﬁes SReg N(B) = Ω(N).

The proof of the theorem can be found at Appendix D.1. Its key insight is that, distinct from the classical online CSC setting, in COIL, the loss at round n, Fn, depends on the policy chosen at that round πn, making standard regret minimization results in online classiﬁcation [40, 27] inapplicable. In more detail, we construct a ﬁxed MDP that act adversarially to policies in B, such that any πn B has Fn(πn) H 1

H , whereas minπ B PN n=1 Fn(π) N

2 . Our theorem is similar in spirit to Cover s impossibility result in online classiﬁcation [20], which shows that an adversary that adapts to the randomness of the learner at each round can force the learner to suffer linear regret.

3.2 A new hypothesis class and the LOGGER algorithmic framework

To sidestep the impossibility result in Theorem 4, we apply the convexiﬁcation by randomization technique in online convex optimization [59] by improper learning on a mixed policy class, deﬁned below. Deﬁnition 5 (Mixed policy class). Given policy class B, deﬁne its induced mixed policy class

πu( |s) := X

h B u[h] h( |s) : u (B)

Algorithm 2 LOGGER: reducing COIL to online linear optimization

1: Input: MDP M, Expert feedback ζE, sample size per iteration K, Online linear optimization algorithm OLOA with decision set (B). 2: for n = 1, 2, . . . , N do 3: Choose un OLOA({gi}n 1 i=1 ), which induces policy πn := πun. 4: Draw K examples Dn = (s, c) iid from DE πn, via interaction with M and expert ζE.

5: Dn induces gn = E(s, c) Dn Ea h( |s) c(a)

h B, an unbiased estimator of θ(un).

We slightly abuse notation and use h( | s) (A) to denote the delta mass on h(s) A.3

At a cursory glance, choosing a policy from ΠB seems equivalent to choosing some policy at random from B, which also falls into the failure mode of proper learning (Theorem 4). We remark that this is not true: rolling out a policy from πu ΠB is equivalent to drawing new policies in an i.i.d. fashion from B at every step of the episode instead. As we will see next, the usage of ΠB enables the design of COIL algorithms with sublinear regret.

Our key observation is that with the learner outputting policies from the mixed policy class ΠB, online regret minimization in IL becomes an online linear optimization problem. Recall that in online IL, the loss at round n is Fn(π) = Es dπn Ea π( |s) ζE(s, a) . By choosing πn = πun ΠB and π = πu ΠB for un, u (B), Fn(πu) can be viewed as a linear function of u:

h B u[h] Es dπn Ea h( |s) ζE(s, a) = θ(un), u ,

where θ(v) := Es dπv Ea h( |s) ζE(s, a)

h B. We have

θ(un), un =

n=1 Fn(πn), min u (B)

θ(un), u = min u (B)

n=1 Fn(πu) = min π B

and therefore, minimzing the static regret SReg N(B) is equivalent to minimizing the static regret in

the online linear optimization problem with losses n u 7 θ(un), u o N

This motivates LOGGER (Algorithm 2), our main algorithmic framework. Given input an online linear optimization algorithm OLOA and sample size K, LOGGER outputs policy sequence {πn}N n=1. Speciﬁcally, at round n, LOGGER calls OLOA to perform online linear optimization with respect to linear losses {u 7 gi, u }n 1 i=1 and obtains un (B), which corresponds to a policy πn ΠB (line 3); here for every i, gi is an unbiased estimator of θ(ui). It then rolls out πn in M for K times to obtain K samples iid from dπn, queries the expert on each sample s to obtain (ζE(s, a))a A as its associated c, and constructs dataset Dn = (s, c) (line 4). Finally LOGGER computes the empirical

loss of policies on Dn, i.e. gn = E(s, c) Dn Ea h( |s) c(a)

h B (line 5).

Comparison to prior works. [16] considers a general online convex optimization formulation for online IL, dubbed continuous online learning ; our loss function θ(u), can be viewed as an instantiation of the loss function fu( ) therein. However, their regret minimization results assume that fu( ) is strongly convex, which do not cover our COIL setting where fu( ) is linear.

Deﬁne LReg N := PN n=1 gn, un minu (B) PN n=1 gn, u as OLOA s static regret with respect to { gn, }N n=1. We have the following proposition that links LReg N to SReg N(B), the static regret of {πn}N n=1 in the online IL problem.

Proposition 6. For any δ (0, 1], if LOGGER uses some OLOA that outputs {un}N n=1 (B)N such that with probability at least 1 δ/3, LReg N Reg(N). Then, with probability at least 1 δ,

its output policies {πn}N n=1 satisfy SReg N(B) Reg(N) + O µ q

3[63, Theorem 5.3] proposes to perform no-regret learning using another deﬁnition of nonstationary mixed policy class. We identify a technical issue with this approach, and defer a detailed discussion to Appendix D.2.

Proposition 6 shows that Algorithm 2 is a regret-preserving reduction from online IL to online linear optimization over (B), a B-dimensional probability simplex. The latter is well-known as the prediction with expert advice problem [27] (abbrev. expert problem), where algorithms with different guarantees abound, such as Follow the Regularized Leader (FTRL), Hedge [27] and its adaptive and optimistic variants [e.g. 61, Section 1], many of which have optimal worst-case regret bounds Reg(N) = O p

N ln(B) . Instantiating Algorithm 2 with OLOA set as these algorithms,

we obtain a family of online IL algorithms with expected regret of order O(

Although satisfying from a statistical efﬁciency perspective, such online IL algorithms suffers from computational inefﬁciency: they require explicit calculation of gn and maintenance of un, which are B-dimensional (dense) vectors whose entries need to be updated separately. For instance, when Hedge is chosen as OLOA, un[h] exp( η Pn 1 i=1 gn[h])) for all h B, which naively requires O(B) time per round to maintain. To address this computational efﬁciency issue, in the next section, we exploit the cost-sensitive classiﬁcation nature of the COIL problem to design sublinear-regret algorithms that use implicit representations of gn s, i.e. Dn s, that enjoy oracle-efﬁciency guarantees.

4 Efﬁcient algorithms with static regret guarantees

Using the LOGGER framework, in this section, we propose two oracle-efﬁcient COIL algorithms that have sublinear static regret guarantees against policy class B, in Subsections 4.1 and 4.2 respectively.

4.1 LOGGER-M: an efﬁcient algorithm with O(

N) static regret

The LOGGER reduction framework calls for a computationally and statistically efﬁcient OLOA, which, if devised, yields an computationally and statistically efﬁcient online imitation learner. However, when viewed as a general adversarial online learning problem, computational hardness results [29] suggest that, even with access to classiﬁcation oracle O, a prohibitive Ω(

B) time complexity is necessary for sublinear regret. Therefore, in subsequent sections, we adopt an assumption on B in [66], which, to the best of our knowledge, is the state-of-the-art weakest assumption that allows the design of oracle-efﬁcient online CSC algorithms in the adversarial setting:

Assumption 1 (Small separator set). There exists a set X S (called the separator set) such that, for every pair of distinct policies h, h B, x X, such that h(x) = h (x). Denote by X := |X|.

Technical challenges. Even under the small separator set assumption, the design of low-regret oracle-efﬁcient algorithms for imitation learning still remains nontrivial. A naive application of existing oracle-efﬁcient online CSC algorithms, such as Contextual Follow the Perturbed Leader (CFTPL) [66, 24], still falls into the failure mode of proper learning (Theorem 4 in Section 3.1), where an Ω(N) regret lower bound is unavoidable in the worst case. This is in sharp contrast to the classical online CSC setting, where CFTPL enjoys a O(

N) regret [66, 24]. To recap, at round n, CFTPL ﬁrst constructs a random set of hallucinated cost-sensitive examples Z based on the separator set X; it subsequently calls the CSC oracle O on the union of Z and the accumulated dataset n 1 i=1 Di to obtain policy πn B. CFTPL achieves computational efﬁciency by operating on Dn s, an implicit representation of gn s, the linear losses of the underlying OLO problem.

Our approach. The above difﬁculty motivates the need of a new algorithmic approach for efﬁcient classiﬁcation-based online IL. In view of Section 3.2 s observation that FTRL approaches enjoy a sublinear regret, we ask the question: is it possible to perform FTRL in an oracle-efﬁcient manner? A positive answer will simultaneously address the computational and statistical challenges of COIL.

We answer this question in the afﬁrmative, by utilizing a connection between FTRL and FTPL ﬁrst observed in [2]: an in-expectation version of FTPL can be viewed as an FTRL algorithm. Using this observation, we design Algorithm 3, namely Mixed CFTPL (abbrev. MFTPL), which mimics FTRL by approximating the in-expectation version of CFTPL in an oracle-efﬁcient manner. Similar to CFTPL, MFTPL keeps gn implicitly in Dn, and calls the CSC oracle. Different from CFTPL, MFTPL runs the oracle-call step in CFTPL for T times and outputs the uniform mixture of the T policies. We refer to T as Algorithm 3 s sparsiﬁcation parameter, due to the algorithm s resemblance to Maurey s sparsiﬁcation [46].

Algorithm 3 MFTPL: an oracle-efﬁcient approximation of FTRL

1: Input: Linear losses {gi}n 1 i=1 represented by datasets {Di}n 1 i=1 each of size K (s.t. gi[h] = E(s, c) Di c(h(s)) for all h B) , separator set X, learning rate η, sparsiﬁcation parameter T. 2: for j = 1, 2, . . . , T do 3: Draw (ℓx,j(a))a A N(0, IA) iid for each x X.

4: Deﬁne Zj = n (x, K

η ℓx,j) : x X o .

5: Compute un,j Onehot(O(( n 1 i=1 Di) Zj), B). 6: end for 7: return un 1

T PT j=1 un,j.

Speciﬁcally, at each iteration j [T], MFTPL ﬁrst draws (ℓx,j(a))a A N(0, IA) iid for each x in the separator set X, where IA is the identity matrix of dimension A (line 3). It then constructs a perturbation set of cost-sensitive examples Zj = n (x, K

η ℓx,j) : x X o that contains each x within

the separator set and K

η ℓx,j as associated c, where K accounts for the adjustment on dataset size and η accounts for FTRL s learning rate (line 4). It then calls the oracle O with datasets n 1 i=1 Di accumulated so far, together with perturbation examples Zj to obtain an empirical cost minimizer h B. This h is represented by a one-hot vector un,j (B) that has weight 1 on the h-th coordinate and 0 elsewhere (line 5). Finally, after T iterations, MFTPL returns the mean value un = 1

T PT j=1 un,j (line 7). MFTPL guarantees that:

Lemma 7. There exists some strongly convex function R : (B) R, such that the following holds. Suppose MFTPL receives datasets {Di}n 1 i=1 , separator set X, learning rate η, sparsiﬁcation parameter T. Then, δ (0, 1], with probability at least 1 δ, MFTPL makes T calls to the cost-sensitive oracle O, and outputs un (B) such that

s S, πun( |s) πu n( |s) 1

2A ln(S) + ln( 2

with u n := argminu (B) η Pn 1 i=1 gi, u + R(u) .

Therefore, by setting T = Ω(A ln(S)), the policy πn induced by the MFTPL s output un closely mimics πu n, a policy induced by the FTRL output u n. The mild ln( ) dependence on S makes the lemma useful in large-state-space settings. Speciﬁcally, a naive attempt to show Lemma 7 is to establish the 1 closeness of un and u n, given that T un Multinomial(T, u n). This unavoidably carries an impractical concentration factor of O( p

B/T), as the bound requires T to be Ω(B) to be non-vacuous. We circumvent this challenge by directly showing the closeness of the action distributions πun and πu n for all states. We defer the full version of the lemma, including an explicit form of R, to Appendix E.1.

Lemma 8. For any δ (0, 1], MFTPL, if called for N rounds, with input learning rate η =

4 and sparsiﬁcation parameter T = N ln(2NS/δ)

X3 ln(B) , outputs a sequence {un}N n=1, such

that with probability at least 1 δ:

NA X3 ln(B) 1

Composing LOGGER with MFTPL, we obtain an efﬁcient online IL algorithm, LOGGER-M. Its regret guarantees immediately follow from combining Lemma 8 with Proposition 6:

Theorem 9. For any δ (0, 1], LOGGER-M, with K = 1 and MFTPL setting its parameters as in Lemma 8, is such that: (1) with probability at least 1 δ, its output {πn}N n=1 satisﬁes:

SReg N(B) O µ p

NA ln(1/δ)(X3 ln(B)) 1 4 ; (2) it queries N annotations from the expert; (3)

it calls the CSC oracle O for N2 ln(6NS/δ)

X3 ln(B) times.

4.2 LOGGER-ME: an efﬁcient algorithm with O (1) static regret

Although LOGGER-M is oracle-efﬁcient, it is unclear whether its O(

N) regret guarantee is optimal. As a lower regret can translate to lower sample and interaction round complexity guarantees, it is desirable to design algorithms with regret as low as possible.

A key observation from prior works [e.g. 15, 16, 38] is that, online IL is a predictable online learning problem [19, 51], and is thus not completely adversarial. This opens up possibilities to bypass the O(

N) worst-case regret barrier. Speciﬁcally, in the LOGGER framework, the coefﬁcient of the linear loss at round n, θ(un), depends continuously on un; more concretely, we can show: Lemma 10. For u, v (B), θ(u) θ(v) µH max s S πu( |s) πv( |s) 1 µH u v 1.

This property and its variants, termed distributional continuity, has been utilized in many online IL algorithms to achieve sharper regret guarantees. These works additionally exploit the strong convexity on the loss functions Fn(π) [15, 16, 38], or use some external predictive model that can predict Fn(π) well [18, 17]. Unfortunately, in our COIL setting, neither is the loss function Fn(πu) = θ(un), u strongly convex in the policy parameter u, nor do we have access to an external predictive model, rendering these approaches inapplicable.

We get around these challenges and design an oracle-efﬁcient algorithm, namely MFTPL-EG (where EG stands for extra-gradient), with O(1) regret for the online linear optimization problem, which, when composed with the LOGGER framework, yields the LOGGER-ME algorithm with O(1) regret in the COIL setting. MFTPL-EG is largely inspired by the predictor-corrector framework for policy optimization [17] and extragradient methods in smooth optimization [42, 33]; its details can be found in Appendix E.2. Its key insight is that, although we do not have a predictive model for θ(un), we can use an extra round of interaction with M and expert annotations to obtain a good estimate of it. Based on this, we derive an online linear optimization regret guarantee of MFTPL-EG, deferred to Appendix E.2. This immediately implies the following guarantee of LOGGER-ME: Theorem 11. For any δ (0, 1], LOGGER-ME, with K and MFTPL-EG s parameters set appropriately, is such that: (1) with probability at least 1 δ, its output {πn}N n=1 satisﬁes:

SReg N(B) O(µHA p

X3 ln(B)); (2) it queries O N2 ln(NB/δ)

annotations from the expert;

(3) it calls the CSC oracle O for O N3 ln(NS/δ) µHAX3 ln(B) times.

Discussion and comparison. We now compare the guarantees of LOGGER-M, LOGGER-ME, and the baseline of behavior cloning, where the learner simply draws iid examples from DE πE and perform empirical risk minimization over B to learn a policy ˆπ. All algorithms output policy suboptimality guarantees have the following decomposition:

E h J(ˆπ) J(πE) i Approx Err + Estim Err, (2)

where Approx Err measures the approximation error of the policy class B to the expert policy πE, and Estim Err is an estimation error term that vanishes with the number of expert annotation examples and iterations increasing. By Proposition 2, for LOGGER-M and LOGGER-ME, their Approx Err terms are both µH Bias(B, ΠB, N). For their Estim Err = H SReg N(B)

N , we use Theorems 9 and 11 to calculate the minimum total numbers of interaction rounds I(ϵ), expert annotations A(ϵ), and oracle calls C(ϵ), so that Estim Err is at most ϵ with probability at least 1 δ. As presented in Table 1, LOGGER-ME has the same sample complexity order as LOGGER-M, but has a much lower interaction round complexity (µH2/ϵ vs. µ2H2/ϵ2).

On the other hand, by standard ERM analysis [58] and conversion from supervised learning to imitation learning guarantees ([65, 34]), behavior cloning on B using K samples outputs a policy ˆπ, such that Equation (2) holds with Approx Err = H2 Bias(B, {πE}, 1), and Estim Err = H2p

2 ln(2B/δ)/K (see Appendix E.3 for a detailed derivation), where Bias(B, {πE}, 1) = minh B Es dπE I(h(s) = πE(s)) . We also summarize behavior cloning s performance guarantees in Table 1. Compared with the two interactive IL algorithms above, behavior cloning requires only one interaction round and one call to the oracle O, however its Approx Err has a larger coefﬁcient on the optimal classiﬁcation loss (H2 vs. µH), and needs more expert annotations (H4/ϵ2 vs. H2µ2/ϵ2) to achieve approximation error smaller than ϵ with probability at least 1 δ.

Algorithm Approx Err I(ϵ) A(ϵ) C(ϵ)

LOGGER-M µH Bias(B, ΠB, N) O( µ2H2

ϵ2 ) O( µ2H2

ϵ2 ) O( µ4H4

LOGGER-ME µH Bias(B, ΠB, N) O( µH2

ϵ ) O( µ2H2

ϵ2 ) O( µ2H5

Behavior cloning H2 Bias(B, {πE}, 1) 1 O( H4

Table 1: A comparison between our algorithms and the behavior cloning baseline, in terms of approximation error, and numbers of interaction rounds I(ϵ), expert annotations A(ϵ), oracle calls C(ϵ) needed for estimation error to be at most ϵ with probability 1 δ. Here O( ) hides dependences on ln(µH/ϵ), X, A, ln(S), ln(B), ln(µ), ln(1/δ). See Appendix E.3 for the full version of the table.

5 Computational hardness of sublinear dynamic regret guarantees

Finally, we study dynamic regret minimization for COIL in the LOGGER framework. Although in the abstract continuous online learning setup, dynamic regret minimization has recently been shown to be computationally hard [16], given the peculiar linear loss structure of the LOGGER framework, the computational tractability of dynamic regret minimization within this framework still remains open.

We ﬁll this gap by showing that, under a standard complexity-theoretic assumption (that PPADcomplete problems do not admit randomized polynomial-time algorithms), there do not exist polynomial-time algorithms that achieve sublinear dynamic regret in COIL. Speciﬁcally, we have: Theorem 12. Fix γ > 0, if there exist a COIL algorithm such that for any M and expert πE, it interacts with M, CSC oracle O, expert feedback ζE(s, a) = AE(s, a), and outputs a sequence of {πn}N n=1 ΠN B s.t. with probability at least 1/2,

DReg N(B) O(poly(S, A, B) N 1 γ),

in poly(N, S, A, B) time, then all problems in PPAD are solvable in randomized polynomial time.

The key insight behind our proof of Theorem 12 is that, achieving a sublinear dynamic regret in the COIL setup is at least as hard as ﬁnding an approximate Nash equilibrium in a two-player general-sum game, a well-known PPAD-complete problem [13]. To establish a reduction from a two-player general-sum game to a COIL problem, we carefully construct a tree-structured MDP whose leaf states consist of two major groups: one group has costs encoding the two players payoffs, and the other group has a large constant cost, ensuring that any policy in ΠB with small dynamic regret encodes near-optimal strategies of both players. We refer the readers to Appendix F for details.

6 Conclusion

In this work, we investigate the fundamental statistical and computational limits of classiﬁcationbased online imitation learning (COIL). On the positive side, we propose the LOGGER framework that enables the design of oracle and regret efﬁcient COIL algorithms with different sample and interaction round complexity tradeoffs, outperforming the behavior cloning baseline. On the negative side, we establish impossibility results for sublinear static regret using proper learning in the COIL setting. We also show the computational hardness of sublinear dynamic regret guarantees in the LOGGER framework.

Looking forward, it would be interesting to investigate the optimality of our sample complexity and interaction round complexity guarantees; we also speculate that it is possible to relax the small separator set assumption on B by utilizing very recent results on smoothed online learning [11, 28]. Finally, we are also interested in empirically evaluating our algorithms.

Acknowledgments and Disclosure of Funding. We thank the anonymous reviewers for their constructive comments. We thank Kwang-Sung Jun, Ryn Gray, Jason Pacheco, and members of the University of Arizona machine learning reading group for helpful discussions. We thank Wen Sun for helpful communications regarding [63, Theorem 5.3] and pointing us to an updated version [6, Section 15.5]. We thank Weijing Wang for helping with illustrative ﬁgures. This work is supported by a startup funding by the University of Arizona.

[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 1, 2004.

[2] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization via smoothing. In Conference on Learning Theory, pages 807 823. PMLR, 2014.

[3] Jacob D Abernethy, Young Hun Jung, Chansoo Lee, Audra Mc Millan, and Ambuj Tewari. Online learning via the differential privacy lens. Advances in Neural Information Processing Systems, 32, 2019.

[4] Jacob D Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of smoothness. Advances in Neural Information Processing Systems, 28, 2015.

[5] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646. PMLR, 2014.

[6] Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms, 2020.

[7] Shai Ben-David, David Loker, Nathan Srebro, and Karthik Sridharan. Minimizing the misclassiﬁcation error rate using a surrogate convex loss. In ICML, 2012.

[8] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.

[9] Dimitri P Bertsekas. Stochastic optimization problems with nondifferentiable cost functionals. Journal of Optimization Theory and Applications, 12(2):218 231, 1973.

[10] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227 1244, 2015.

[11] Adam Block, Yuval Dagan, Noah Golowich, and Alexander Rakhlin. Smoothed online learning is as easy as statistical learning. ar Xiv preprint ar Xiv:2202.04690, 2022.

[12] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[13] Xi Chen and Xiaotie Deng. Settling the complexity of two-player nash equilibrium. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pages 261 272. IEEE, 2006.

[14] Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computing two-player nash equilibria. Journal of the ACM (JACM), 56(3):1 57, 2009.

[15] Ching-An Cheng and Byron Boots. Convergence of value aggregation for imitation learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1801 1809. PMLR, 2018.

[16] Ching-An Cheng, Jonathan Lee, Ken Goldberg, and Byron Boots. Online learning with continuous variations: Dynamic regret and reductions. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2218 2228. PMLR, 2020.

[17] Ching-An Cheng, Xinyan Yan, Nathan Ratliff, and Byron Boots. Predictor-corrector policy optimization. In International Conference on Machine Learning, pages 1151 1161. PMLR, 2019.

[18] Ching-An Cheng, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Accelerating imitation learning with predictive models. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3187 3196. PMLR, 2019.

[19] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6 1. JMLR Workshop and Conference Proceedings, 2012.

[20] Thomas M Cover. Behavior of sequential predictors of binary sequences. Technical report, STANFORD UNIV CALIF STANFORD ELECTRONICS LABS, 1966.

[21] Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efﬁcient pac rl with rich observations. Advances in neural information processing systems, 31, 2018.

[22] Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195 259, 2009.

[23] Hal Daumé, John Langford, and Daniel Marcu. Search-based structured prediction. Machine learning, 75(3):297 325, 2009.

[24] Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efﬁcient online learning and auction design. Journal of the ACM (JACM), 67(5):1 57, 2020.

[25] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efﬁcient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence, pages 169 178, 2011.

[26] Francisco Facchinei and Jong-Shi Pang. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007.

[27] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

[28] Nika Haghtalab, Yanjun Han, Abhishek Shetty, and Kunhe Yang. Oracle-efﬁcient online learning for beyond worst-case adversaries. ar Xiv preprint ar Xiv:2202.08549, 2022.

[29] Elad Hazan and Tomer Koren. The computational power of optimization in online learning. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 128 141, 2016.

[30] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.

[31] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Artiﬁcial Intelligence and Statistics, pages 398 406. PMLR, 2015.

[32] Kshitij Judah, Alan P Fern, Thomas G Dietterich, and Prasad Tadepalli. Active imitation learning: Formal and practical reductions to iid learning. Journal of Machine Learning Research, 15(120):4105 4143, 2014.

[33] Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17 58, 2011.

[34] Sham M. Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, 2002.

[35] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. The Journal of Machine Learning Research, 13(1):1865 1890, 2012.

[36] Liyiming Ke, Sanjiban Choudhury, Matt Barnes, Wen Sun, Gilwoo Lee, and Siddhartha Srinivasa. Imitation learning as f-divergence minimization. In International Workshop on the Algorithmic Foundations of Robotics, pages 313 329. Springer, 2020.

[37] Hoang Le, Andrew Kang, Yisong Yue, and Peter Carr. Smooth imitation learning for online sequence prediction. In International Conference on Machine Learning, pages 680 688. PMLR, 2016.

[38] Jonathan N Lee, Michael Laskey, Ajay Kumar Tanwani, Anil Aswani, and Ken Goldberg. Dynamic regret convergence analysis and an adaptive regularization algorithm for on-policy robot imitation learning. The International Journal of Robotics Research, 40(10-11):1284 1305, 2021.

[39] Carlton E Lemke and Joseph T Howson, Jr. Equilibrium points of bimatrix games. Journal of the Society for industrial and Applied Mathematics, 12(2):413 423, 1964.

[40] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994.

[41] H Brendan Mc Mahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117 3166, 2017.

[42] Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

[43] Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

[44] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends R in Robotics, 7(1-2):1 179, 2018.

[45] Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan, Evangelos A Theodorou, and Byron Boots. Imitation learning for agile autonomous driving. The International Journal of Robotics Research, 39(2-3):286 302, 2020.

[46] Gilles Pisier. Remarques sur un résultat non publié de b. maurey. Séminaire d Analyse fonctionnelle (dit" Maurey-Schwartz"), pages 1 12, 1981.

[47] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.

[48] Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Concentration inequalities for multinoulli random variables. ar Xiv preprint ar Xiv:2001.11595, 2020.

[49] Nived Rajaraman, Yanjun Han, Lin Yang, Jingbo Liu, Jiantao Jiao, and Kannan Ramchandran. On the value of interaction and function approximation in imitation learning. Advances in Neural Information Processing Systems, 34, 2021.

[50] Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33:2914 2924, 2020.

[51] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993 1019. PMLR, 2013.

[52] Alexander Rakhlin and Karthik Sridharan. Bistro: An efﬁcient relaxation-based method for contextual bandits. In International Conference on Machine Learning, pages 1977 1985. PMLR, 2016.

[53] R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press, 1970.

[54] Stéphane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 661 668. JMLR Workshop and Conference Proceedings, 2010.

[55] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. ar Xiv preprint ar Xiv:1406.5979, 2014.

[56] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627 635, 2011.

[57] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. Hebrew University, 2007.

[58] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[59] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2011.

[60] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

[61] Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International Conference on Machine Learning, pages 1593 1601. PMLR, 2014.

[62] Wen Sun, Anirudh Vemula, Byron Boots, and Drew Bagnell. Provably efﬁcient imitation learning from observation alone. In International conference on machine learning, pages 6036 6045. PMLR, 2019.

[63] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning, pages 3309 3318. PMLR, 2017.

[64] Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. Advances in neural information processing systems, 20, 2007.

[65] Umar Syed and Robert E Schapire. A reduction from apprenticeship learning to classiﬁcation. Advances in neural information processing systems, 23, 2010.

[66] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efﬁcient algorithms for adversarial contextual learning. In International Conference on Machine Learning, pages 2159 2168. PMLR, 2016.

[67] Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E Schapire. Improved regret bounds for oracle-based adversarial contextual bandits. Advances in Neural Information Processing Systems, 29, 2016.

[68] Tianhao Wang, Si Chen, and Ruoxi Jia. One-round active learning. ar Xiv preprint ar Xiv:2104.11843, 2021.

[69] Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess. Robust imitation of diverse behaviors. Advances in Neural Information Processing Systems, 30, 2017.

[70] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.

[71] Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments. Advances in Neural Information Processing Systems, 33:15737 15749, 2020.

[72] Liu Yang and Jaime Carbonell. Buy-in-bulk active learning. Advances in neural information processing systems, 26, 2013.

[73] Constantin Zalinescu. Convex analysis in general vector spaces. World scientiﬁc, 2002.

[74] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] See Section 3.1, Section 3.2, Section 4, and Section 5. (b) Did you describe the limitations of your work? [Yes] See Section 2, Section 3.2, Deﬁnition 1 and Section 5. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 2, Section 3.2, Deﬁnition 1. (b) Did you include complete proofs of all theoretical results? [Yes] Please refer to Appendix for detailed proofs. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]