# mobile_modelbased_imitation_learning_from_observation_alone__c416753d.pdf Mob ILE: Model-Based Imitation Learning From Observation Alone Rahul Kidambi Amazon Search & AI Berkeley CA 94704. rk773@cornell.edu Jonathan D. Chang CS Department, Cornell University Ithaca NY 14853. jdc396@cornell.edu Wen Sun CS Department, Cornell University Ithaca NY 14853. ws455@cornell.edu This paper studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that consist only of states visited by an expert (without access to actions taken by the expert). We present a provably efficient model-based framework Mob ILE to solve the ILFO problem. Mob ILE involves carefully trading off strategic exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for Mob ILE, and demonstrate that Mob ILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of structural complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by presenting an exponential sample complexity separation between IL and ILFO. We complement these theoretical results with experimental simulations on benchmark Open AI Gym tasks that indicate the efficacy of Mob ILE. Code for implementing the Mob ILE framework is available at https://github.com/rahulkidambi/Mob ILE-Neur IPS2021. 1 Introduction This paper considers Imitation Learning from Observation Alone (ILFO). In ILFO, the learner is presented with sequences of states encountered by the expert, without access to the actions taken by the expert, meaning approaches based on a reduction to supervised learning (e.g., Behavior cloning (BC) [49], DAgger [50]) are not applicable. ILFO is more general and has potential for applications where the learner and expert have different action spaces, applications like sim-to-real [56, 14] etc. Recently, [59] reduced the ILFO problem to a sequence of one-step distribution matching problems that results in obtaining a non-stationary policy. This approach, however, is sample inefficient for longer horizon tasks since the algorithm does not effectively reuse previously collected samples when solving the current sub-problem. Another line of work considers model-based methods to infer the expert s actions with either an inverse dynamics [63] or a forward dynamics [16] model; these recovered actions are then fed into an IL approach like BC to output the final policy. These works rely on stronger assumptions that are only satisfied for Markov Decision Processes (MDPs) with injective transition dynamics [68]; we return to this in the related works section. Work initiated when RK was a post-doc at Cornell University; work done outside Amazon. 35th Conference on Neural Information Processing Systems (Neur IPS 2021), virtual. Figure 1: Expert performance normalized scores of ILFO algorithms averaged across 5 seeds in environments with discrete action spaces (Reacher-v2) and continuous action spaces (Hopper-v2 and Walker2d-v2). We introduce Mob ILE Model-based Imitation Learning and Exploring, a model-based framework, to solve the ILFO problem. In contrast to existing model-based efforts, Mob ILE learns the forward transition dynamics model a quantity that is well defined for any MDP. Importantly, Mob ILE combines strategic exploration with imitation by interleaving a model learning step with a bonus-based, optimistic distribution matching step a perspective, to the best of our knowledge, that has not been considered in Imitation Learning. Mob ILE has the ability to automatically trade-off exploration and imitation. It simultaneously explores to collect data to refine the model and imitates the expert wherever the learned model is accurate and certain. At a high level, our theoretical results and experimental studies demonstrate that systematic exploration is beneficial for solving ILFO reliably and efficiently, and optimism is a both theoretically sound and practically effective approach for strategic exploration in ILFO (see Figure 1 for comparisons with other ILFO algorithms). This paper extends the realm of partial information problems (e.g. Reinforcement Learning and Bandits) where optimism has been shown to be crucial in obtaining strong performance, both in theory (e.g., E3 [30], UCB [3]) and practice (e.g., RND [10]). This paper proves that incorporating optimism into the min-max IL framework [69, 22, 59] is beneficial for both the theoretical foundations and empirical performance of ILFO. Our Contributions: We present Mob ILE (Algorithm 1), a provably efficient, model-based framework for ILFO that offers competitive results in benchmark gym tasks. Mob ILE can be instantiated with various implementation choices owing to its modular design. This paper s contributions are: 1. The Mob ILE framework combines ideas of model-based learning, optimism for exploration, and adversarial imitation learning. Mob ILE achieves global optimality with near-optimal regret bounds for classes of MDP dynamics that satisfy certain well studied notions of complexity. The key idea of Mob ILE is to use optimism to trade-off imitation and exploration. 2. We show an exponential sample complexity gap between ILFO and classic IL where one has access to expert s actions. This indicates that ILFO is fundamentally harder than IL. Our lower bound on ILFO also indicates that to achieve near optimal regret, one needs to perform systematic exploration rather than random or no exploration, both of which will incur sub-optimal regret. 3. We instantiate Mob ILE with a model ensemble of neural networks and a disagreement-based bonus. We present experimental results on benchmark Open AI Gym tasks, indicating Mob ILE compares favorably to or outperforms existing approaches. Ablation studies indicate that optimism indeed helps in significantly improving the performance in practice. 1.1 Related Works Imitation Learning (IL) is considered through the lens of two types of approaches: (a) behavior cloning (BC) [45] which casts IL as a reduction to supervised or full-information online learning [49, 50], or, (b) (adversarial) inverse RL [40, 1, 69, 17, 22, 29, 18], which involves minimizing various distribution divergences to solve the IL problem, either with the transition dynamics known (e.g., [69]), or unknown (e.g., [22]). Mob ILE does not assume knowledge of the transition dynamics, is model-based, and operates without access to the expert s actions. Imitation Learning from Observation Alone (ILFO) [59] presents a model-free approach FAIL that outputs a non-stationary policy by reducing the ILFO problem into a sequence of min-max problems, one per time-step. While being theoretically sound, this approach cannot share data across different time steps and thus is not data efficient for long horizon problems. Also FAIL in theory only works for discrete actions. In contrast, our paper learns a stationary policy using model-based approaches by reusing data across all time steps and extends to continuous action space. Another line of work [63, 16, 66] relies on learning an estimate of expert action, often through the use of an inverse dynamics models, P e(a|s, s ). Unfortunately, an inverse dynamics model is not well defined in many benign problem instances. For instance, [68, remark 1, section 9.3] presents an example showing that inverse dynamics isn t well defined except in the case when the MDP dynamics is injective (i.e., no two actions could lead to the same next state from the current state. Note that even deterministic transition dynamics doesn t imply injectivity of the MDP dynamics). Furthermore, ILPO [16] applies to MDPs with deterministic transition dynamics and discrete actions. Mob ILE, on the other hand, learns the forward dynamics model which is always unique and well-defined for both deterministic and stochastic transitions and works with discrete and continuous actions. Another line of work in ILFO revolves around using hand-crafted cost functions that may rely on task-specific knowledge [44, 4, 53]. The performance of policy outputted by these efforts relies on the quality of the engineered cost functions. In contrast, Mob ILE does not require cost function engineering. Model-Based RL has seen several advances [61, 36, 13] including ones based on deep learning (e.g., [34, 19, 38, 24, 37, 65]). Given Mob ILE s modularity, these advances in model-based RL can be translated to improved algorithms for the ILFO problem. Mob ILE bears parallels to provably efficient model-based RL approaches including E3 [31, 27], R-MAX [7], UCRL [23], UCBVI [5], Linear MDP [67], LC3 [25], Witness rank [58] which utilize optimism based approaches to trade-off exploration and exploitation. Our work utilizes optimism to trade-off exploration and imitation. We consider episodic finite-horizon MDP M = {S, A, P , H, c, s0}, where S, A are the state and action space, P : S A 7 S is the MDP s transition kernel, H is the horizon, s0 is a fixed initial state (note that our work generalizes when we have a distribution over initial states), and c is the state-dependent cost function c : S 7 [0, 1]. Our result can be extended to the setting where c : S S 7 [0, 1], i.e., the ground truth cost c(s, s ) depends on state and next state pairs. For analysis simplicity, we focus on c : S 7 [0, 1].2 We denote dπ P (S A) as the average state-action distribution of policy π under the transition kernel P, i.e., dπ P (s, a) := 1 H PH t=1 Pr(st = s, at = a|s0, π, P), where Pr(st = s, at = a|s0, π, P) is the probability of reaching (s, a) at time step t starting from s0 by following π under transition kernel P. We abuse notation and write s dπ P to denote a state s is sampled from the state-wise distribution which marginalizes action over dπ P (s, a), i.e., dπ P (s) := 1 H PH t=1 Pr(st = s|s0, π, P). For a given cost function f : S 7 [0, 1], V π P ;f denotes the expected total cost of π under transition P and cost function f. Similar to IL setting, in ILFO, the ground truth cost c is unknown. Instead, we can query the expert, denoted as πe : S 7 (A). Note that the expert πe could be stochastic and does not have to be the optimal policy. The expert, when queried, provides state-only demonstrations τ = {s0, s1 . . . s H}, where st+1 P ( |st, at) and at πe( |st). The goal is to leverage expert s state-wise demonstrations to learn a policy π that performs as well as πe in terms of optimizing the ground truth cost c, with polynomial sample complexity on problem parameters such as horizon, number of expert samples and online samples and underlying MDP s complexity measures (see section 4 for precise examples). We track the progress of any (randomized) algorithm by measuring the (expected) regret incurred by a policy π defined as E[V π] V π as a function of number of online interactions utilized by the algorithm to compute π. 2.1 Function Approximation Setup Since the ground truth cost c is unknown, we utilize the notion of a function class (i.e., discriminators) F S 7 [0, 1] to define the costs that can then be utilized by a planning algorithm (e.g. NPG [26]) for purposes of distribution matching with expert states. If the ground truth c depends (s, s ), we use discriminators F S S 7 [0, 1]. Furthermore, we use a model class P S A 7 (S) to capture the ground truth transition P . For the theoretical results in the paper, we assume realizability: Assumption 1. Assume F and P captures ground truth cost and transition, i.e., c F, P P. We will use Integral probability metric (IPM) with F as our divergence measure. Note that if c F and c : S 7 [0, 1], then IPM defined as maxf F Es dπf(s) Es dπe f(s) directly upper 2Without any additional assumptions, in ILFO, learning to optimize action-dependent cost c(s, a) (or c(s, a, s ) is not possible. For example, if there are two sequences of actions that generate the same sequence of states, without seeing expert s preference over actions, we do not know which actions to commit to. Algorithm 1 Mob ILE: The framework of Model-based Imitation Learning and Exploring for ILFO 1: Require: IPM class F, dynamics model class P, policy class Π, bonus function class B, expert dataset De {se i}N i=1. 2: Initialize policy π0 Π, replay buffer D 1 = . 3: for t = 0, , T 1 do 4: Execute πt in true environment P to get samples τt = {sk, ak}H 1 k=0 s H. Append to replay buffer Dt = Dt 1 τt. 5: Update model and bonus: b Pt+1 : S A S and bt+1 : S A R+ using buffer Dt. 6: Optimistic model-based min-max IL: obtain πt+1 by solving equation (1) with b Pt+1, bt+1, De. 7: end for 8: Return πT . bounds sub-optimality gap V π V πe, where V π is the expected total cost of π under cost function c. This justifies why minimizing IPM between two state distributions suffices [22, 59]. Similarly, if c depends on s, s , we can simply minimize IPM between two state-next state distributions, i.e., maxf Es,s dπf(s, s ) Es,s dπe f(s, s ) where discriminators now take (s, s ) as input.3 To permit generalization, we require P to have bounded complexity. For analytical simplicity, we assume F is discrete (but exponentially large), and we require the sample complexity of any PAC algorithm to scale polynomially with respect to its complexity ln(|F|). The ln |F| complexity can be replaced to bounded conventional complexity measures such as Rademacher complexity and covering number for continuous F (e.g., F being a Reproducing Kernel Hilbert Space). 3 Algorithm We introduce Mob ILE (Algorithm 1) for the ILFO problem. Mob ILE utilizes (a) a function class F for Integral Probability Metric (IPM) based distribution matching, (b) a transition dynamics model class P for model learning, (c) a bonus parameterization B for exploration, (d) a policy class Π for policy optimization. At every iteration, Mob ILE (in Algorithm 1) performs the following steps: 1. Dynamics Model Learning: execute policy in the environment online to obtain state-action-next state (s, a, s ) triples which are appended to the buffer D. Fit a transition model b P on D. 2. Bonus Design: design bonus to incentivize exploration where the learnt dynamics model is uncertain, i.e. the bonus b(s, a) is large at state s where b P( |s, a) is uncertain in terms of estimating P ( |s, a), while b(s, a) is small where b P( |s, a) is certain. 3. Imitation-Exploration tradeoff: Given discriminators F, model b P, bonus b and expert dataset De, perform distribution matching by solving the model-based IPM objective with bonus: πt+1 arg min π Π max f F L(π, f; b P, b, De) := E(s,a) dπ b P [f(s) b(s, a)] Es De [f(s)] , (1) where Es Def(s) := P s De f(s)/|De|. Intuitively, the bonus cancels out discriminator s power in parts of the state space where the dynamics model b P is not accurate, thus offering freedom for Mob ILE to explore. We first explain Mob ILE s components and then discuss Mob ILE s key property which is to trade-off exploration and imitation. 3.1 Components of Mob ILE This section details Mob ILE s components. Dynamics model learning: For the model fitting step in line 5, we assume that we get a calibrated model in the sense that: b Pt( |s, a) P ( |s, a) 1 σt(s, a), s, a for some uncertainty measure σt(s, a), similar to model-based RL works, e.g. [12]. We discuss ways to estimate σt(s, a) in the bonus estimation below. There are many examples (discussed in Section 4) that permit efficient 3we slightly abuse notation here and denote dπ as the average state-next state distribution of π, i.e., dπ(s, s ) := dπ(s) R a π(a|s)da P (s |s, a). estimation of these quantities including tabular MDPs, Kernelized nonlinear regulator, nonparametric model such as Gaussian Processes. Consider a general function class G S A 7 S, one can learn bgt via solving a regression problem, i.e., bgt = argmin g G s,a,s Dt g(s, a) s 2 2, (2) and setting b Pt( |s, a) = N bgt(s, a), σ2I , where, σ is the standard deviation of error induced by bgt. In practice, such parameterizations have been employed in several settings in RL with G being a multi-layer perceptron (MLP) based function class (e.g.,[48]). In Section 4, we also connect this with prior works in provable model-based RL literature. Bonus: We utilize bonuses as a means to incentivize the policy to efficiently explore unknown parts of the state space for improved model learning (and hence better distribution matching). With the uncertainty measure σt(s, a) obtained from calibrated model fitting, we can simply set the bonus bt(s, a) = O(Hσt(s, a)). How do we obtain σt(s, a) in practice? For a general class G, given the least square solution bgt, we can define a version space Gt as: Gt = n g G : Pt 1 i=0 PH 1 h=0 g(st h, at h) bgt(st h, at h) 2 2 zt o , with zt being a hyper parameter. The version space Gt is an ensemble of functions g G which has training error on Dt almost as small as the training error of the least square solution bgt. In other words, version space Gt contains functions that agree on the training set Dt. The uncertainty measure at (s, a) is then the maximum disagreement among models in Gt, with σt(s, a) supf1,f2 Gt f1(s, a) f2(s, a) 2. Since g Gt agree on Dt, a large σt(s, a) indicates (s, a) is novel. See example 3 for more theoretical details. Empirically, disagreement among an ensemble [41, 6, 11, 43, 37] is used for designing bonuses that incentivize exploration. We utilize an neural network ensemble, where each model is trained on Dt (via SGD on squared loss Eq. 2) with different initialization. This approximates the version space Gt, and the bonus is set as a function of maximum disagreement among the ensemble s predictions. Optimistic model-based min-max IL: For model-based imitation (line 6), Mob ILE takes the current model b Pt and the discriminators F as inputs and performs policy search to minimize the divergence defined by b Pn and F: dt(π, πe) := maxf F h Es,a dπ b Pt(f(s) bt(s, a)) Es dπe f(s) i . Note that, for a fixed π, the arg maxf F is identical with or without the bonus term, since Es,a dπ b Ptbt(s, a) is independent of f. In our implementation, we use the Maximum Mean Discrepancy (MMD) with a Radial Basis Function (RBF) kernel to model discriminators F.4 We compute argminπ dt(π, πe) by iteratively (1) computing the argmax discriminator f given the current π, and (2) using policy gradient methods (e.g., TRPO) to update π inside b Pt with f bt as the cost. Specifically, to find πt (line 6), we iterate between the following two steps: 1. Cost update: ˆf = argmax f F Es dˆπ b Ptf(s) Es Def(s), 2. PG Step:ˆπ = ˆπ η πV ˆπ b Pt, ˆ f bt, where the PG step uses the learnt dynamics model b Pt and the optimistic IPM cost ˆf(s) bt(s, a). Note that for MMD, the cost update step has a closed-form solution. 3.2 Exploration And Imitation Tradeoff We note that Mob ILE is performing an automatic trade-off between exploration and imitation. More specifically, the bonus is designed such that it has high values in the state space that have not been visited, and low values in the state space that have been frequently visited by the sequence of learned policies so far. Thus, by incorporating the bonus into the discriminator f F (e.g., ef(s, a) = f(s) bt(s, a)), we diminish the power of discriminator f at novel state-action space regions, which relaxes the state-matching constraint (as the bonus cancels the penalty from the discriminators) at those novel regions so that exploration is encouraged. For well explored states, we force the learner s states to match the expert s using the full power of the discriminators. Our work uses optimism (via coupling bonus and discriminators) to carefully balance imitation and exploration. 4For MMD with kernel k, F = {w φ(s, a)| w 2 1} where φ: φ(s, a), φ(s , a ) = k((s, a), (s , a )). This section presents a general theorem for Mob ILE that uses the notion of information gain [57], and then specializes this result to common classes of stochastic MDPs such as discrete (tabular) MDPs, Kernelized nonlinear regulator [28], and general function class with bounded Eluder dimension [51]. Recall, Algorithm 1 generates one state-action trajectory τ t := {st h, at h}H h=0 at iteration t and estimates model b Pt based on Dt = τ 0, . . . , τ t 1. We present our theorem under the assumption that model fitting gives us a model b P and a confidence interval of the model s prediction. Assumption 2 (Calibrated Model). For all iteration t with t N, with probability 1 δ, we have a model b Pt and its associated uncertainty measure σt : S A 7 R+, such that for all s, a S A5 b Pt( |s, a) P ( |s, a) 1 min {σt(s, a), 2} . Assumption 2 has featured in prior works (e.g., [12]) to prove regret bounds in model-based RL. Below we demonstrate examples that satisfy the above assumption. Example 1 (Discrete MDPs). Given Dt, denote N(s, a) as the number of times (s, a) appears in Dt, and N(s, a, s ) number of times (s, a, s ) appears in Dt. We can set b Pt(s |s, a) = N(s, a, s )/N(s, a), s, a, s . We can set σt(s, a) = e O p S/N(s, a) . Example 2 (KNRs [28]). For KNR, we have P ( |s, a) = N W φ(s, a), σ2I where feature mapping φ(s, a) Rd and φ(s, a) 2 1 for all s, a.6 We can learn b Pt via Kernel Ridge regression, i.e., bgt(s, a) = c Wtφ(s, a) where c Wt = argmin W s,a,s Dt Wφ(s, a) s 2 2 + λ W 2 F where F is the Frobenius norm. The uncertainty measure σt(s, a) = βt σ φ(s, a) Σ 1 t , βt = {2λ W 2 2 + 8σ2 [ds ln(5) + 2 ln(t2/δ) + ln(4) + ln (det(Σt)/ det(λI))]}1/2, and, Σt = Pt 1 k=0 PH 1 h=1 φ(sk h, ak h)φ(sk h, ak h) + λI with λ > 0.See Proposition 12 for more details. Similar to RKHS, Gaussian processes (GPs) offers a calibrated model [57]. Note that GPs offer similar regret bounds as RKHS; so we do not discuss GPs and instead refer readers to [12]. Example 3 (General class G). In this case, assume we have P ( |s, a) = N(g (s, a), σ2I) with g G. Assume G is discrete (but could be exponentially large with complexity measure, ln(|G|)), and supg G,s,a g(s, a) 2 G R+. Suppose model learning step is done by least square: bgt = argming G Pt 1 k=0 PH 1 h=0 g(sk h, ak h) sk h+1 2 2. Com- pute a version space Gt = n g G : Pt 1 k=0 PH 1 h=0 g(sk h, ak h) bgt(sk h, ak h) 2 2 zt o , where zt = 2σ2G2ln(2t2|G|/δ) and use this for uncertainty computation. In particular, set uncertainty σt(s, a) = 1 σ maxg1 G,g2 G g1(s, a) g2(s, a) 2, i.e., the maximum disagreement between any two functions in the version space Gt. Refer to Proposition 14 for more details. The maximum disagreement above motivates our practical implementation where we use an ensemble of neural networks to approximate the version space and use the maximum disagreement among the models predictions as the bonus. We refer readers to Section 6 for more details. 4.1 Regret Bound We bound regret with the quantity named Information Gain I (up to some constant scaling factor) [57]: IT := max Alg EAlg h=0 min σ2 t (st h, at h), 1 # 5the uncertainty measure σt(s, a) will depend on the input failure probability δ, which we drop here for notational simplicity. When we introduce specific examples, we will be explicit about the dependence on the failure probability δ which usually is in the order of ln(1/δ). 6The covariance matrix can be generalized to any PSD matrix with bounded condition number. where Alg is any adaptive algorithm (thus including Algorithm 1) that maps from history before iteration t to some policy πt Π. After the main theorem, we give concrete examples for IT where we show that IT has extremely mild growth rate with respect to T (i.e., logarithimic). Denote V π as the expected total cost of π under the true cost function c and the real dynamics P . Theorem 3 (Main result). Assume model learning is calibrated (i.e., Assumption 2 holds for all t) and Assumption 1 holds. In Algorithm 1, set bonus bt(s, a) := H min{σt(s, a), 2}. There exists a set of parameters, such that after running Algorithm 1 for T iterations, we have: E min t [0,...,T 1] V πt V πe O Appendix A contains proof of Theorem 3. This theorem indicates that as long as IT grows sublinearly o(T), we find a policy that is at least as good as the expert policy when T and N approach infinity. For any discrete MDP, KNR [28], Gaussian Processes models [57], and general G with bounded Eluder dimension ([52, 42]), we can show that the growth rate of IT with respect to T is mild. Corollary 4 (Discrete MDP). For discrete MDPs, IT = e O(HS2A) where S = |S|, A = |A|. Thus: E min t [0,...,T 1] V πt V πe = e O Note that Corollary 4 (proof in Appendix A.1) hold for any MDPs (not just injective MDPs) and any stochastic expert policy. The dependence on A, T is tight (see lower bound in 4.2). Now we specialize Theorem 3 to continuous MDPs below. Corollary 5 (KNRs (Example 2)). For simplicity, consider the finite dimension setting φ : S A 7 Rd. We can show that IT = e O Hd + Hdds + Hd2 (see Proposition 13 for details), where d is the dimension of the feature φ(s, a) and ds is the dimension of the state space. Thus, we have 7 E min t [0,...,T 1] V πt V πe = e O H3 dds + d2 Corollary 6 (General G with bounded Eluder dimension (Example 3)). For general G, assume that G has Eluder-dimension d E(ϵ) (Definition 3 in [42]). Denote d E = d E(1/TH). The information gain is upper bounded as IT = O Hd E + d E ln(T 3H|G|) ln(TH) (see Proposition 16). Thus, E min t [0,...,T 1] V πt V πe = e O d E ln(TH|G|) Thus as long as G has bounded complexity in terms of the Eluder dimension [52, 42], Mob ILE with the maximum disagreement-based optimism leads to near-optimal guarantees. 4.2 Exploration in ILFO and the Exponential Gap between IL and ILFO To show the benefit of strategic exploration over random exploration in ILFO, we present a novel reduction of the ILFO problem to a bandit optimization problem, for which strategic exploration is known to be necessary [9] for optimal bounds while random exploration is suboptimal; this reduction indicates that benefit of strategic exploration for solving ILFO efficiently. This reduction also demonstrate that there exists an exponential gap in terms of sample complexity between ILFO and classic IL that has access to expert actions. We leave the details of the reduction framework in Appendix A.4. The reduction allows us to derive the following lower bound for any ILFO algorithm. Theorem 7. There exists an MDP with number of actions A 2, such that even with infinitely many expert data, any ILFO algorithm must occur expected commutative regret Ω( Specifically we rely on the following reduction where solving ILFO, with even infinite expert data, is at least as hard as solving an MAB problem with the known optimal arm s mean reward which itself 7We use e O to suppress log term except the ln(|G|) and ln(|F|) which present the complexity of F and G. occurs the same worst case AT cumulative regret bound as the one in the classic MAB setting. For MAB, it is known that random exploration such as ϵ-greedy will occur suboptimal regret O(T 2/3). Thus to achieve optimal T rate, one needs to leverage strategic exploration (e.g., optimism). Methods such as BC for IL have sample complexity that scales as poly ln(A), e.g., see [2, Theorem 14.3, Chapter 14] which shows that for tabular MDP, BC learns a policy whose performance is O(H2p S ln(A)/N) away from the expert s performance (here S is the number of states in the tabular MDP). Similarly, in interactive IL setting, DAgger [50] can also achieve poly ln(A) dependence in sample complexity. The exponential gap in the sample complexity dependence on A between IL and ILFO formalizes the additional difficulty encountered by learning algorithms in ILFO. 5 Practical Instantiation of Mob ILE We present a brief practical instantiation Mob ILE s components with details in Appendix Section C. Dynamics model learning:We employ Gaussian Dynamics Models parameterized by an MLP [48, 32], i.e., b P(s, a) := N(hθ(s, a), σ2I), where, hθ(s, a) = s + σ s MLPθ(sc, ac), where, θ are MLP s trainable parameters, sc = (s µs)/σs, ac = (a µa)/σa with µs, µa (and σs, σa) being the mean of states, actions (and standard deviation of states and actions) in the replay buffer D. Next, for (s, a, s ) D, s = s s and σ s is the standard deviation of the state differences s D. We use SGD with momentum [60] for training the parameters θ of the MLP. Discriminator parameterization:We utilize MMD as our choice of IPM and define the discriminator as f(s) = w ψ(s), where, ψ(s) are Random Fourier Features [46]. Bonus parameterization:We utilize the discrepancy between predictions of a pair of dynamics models hθ1(s, a) and hθ2(s, a) for designing the bonus. Empirically, we found that using more than two models in the ensemble offered little to no improvements. Denote the disagreement at any (s, a) as δ(s, a) = hθ1(s, a) hθ2(s, a) 2, and δD = max(s,a) D δ(s, a) is the max discrepancy of a replay buffer D. We set bonus as b(s, a) = λ min(δ(s, a)/δD, where λ > 0 is a tunable parameter. PG oracle:We use TRPO [54] to perform incremental policy optimization inside the learned model. 6 Experiments This section seeks to answer the following questions: (1) How does Mob ILE compare against other benchmark algorithms? (2) How does optimism impact sample efficiency/final performance? (3) How does increasing the number of expert samples impact the quality of policy outputted by Mob ILE? We consider tasks from Open AI Gym [8] simulated with Mujoco [62]: Cartpole-v1, Reacher-v2, Swimmer-v2, Hopper-v2 and Walker2d-v2. We train an expert for each task using TRPO [54] until we obtain an expert policy of average value 460, 10, 38, 3000, 2000 respectively. We setup Swimmer-v2, Hopper-v2,Walker2d-v2 similar to prior model-based RL works [33, 39, 38, 48, 32]. We compare Mob ILE against the following algorithms: Behavior Cloning (BC), GAIL [22], BCO [63], ILPO [16] (for environments with discrete actions), GAIFO [64]. Furthermore, recall that BC and GAIL utilize both expert states and actions, information that is not available for ILFO. This makes both BC and GAIL idealistic targets for comparing ILFO methods like Mob ILE against. As reported by Torabi et al. [63], BC outperforms BC-O in all benchmark results. Moreover, our results indicate Mob ILE outperforms GAIL and GAIFO in terms of sample efficiency. With reasonable amount of parameter tuning, BC serves as a very strong baseline and nearly solves deterministic Mujoco environments. We use code released by the authors for BC-O and ILPO. For GAIL we use an open source implementation [21], and for GAIFO, we modify the GAIL implementation as described by the authors. We present our results through (a) learning curves obtained by averaging the progress of the algorithm across 5 seeds, and, (b) bar plot showing expert normalized scores averaged across 5 seeds using the best performing policy obtained with each seed. Normalized score refers to ratio of policy s score over the expert score (so that expert has normalized score of 1). For Reacher-v2, since the expert policy has a negative score, we add an constant before normalization. More details can be found in Appendix C. 1 2 3 4 5 Online Samples 1e4 Return (Value) Cart Pole-v1 (10 traj.) Mob ILE (Ours) BC Expert GAIL BC-O GAIFO ILPO 0.5 1.0 1.5 Online Samples 1e4 Return (Value) Reacher-v2 (10 traj.) 0.2 0.4 0.6 0.8 1.0 Online Samples 1e5 Return (Value) Swimmer-v2 (40 traj.) 0.5 1.0 1.5 Online Samples 1e6 Return (Value) Hopper-v2 (10 traj.) 0.25 0.50 0.75 1.00 1.25 Online Samples 1e6 Return (Value) Walker2d-v2 (10 traj.) Cart Pole-v1 Walker2d-v2 Normalized Score Figure 2: Comparing Mob ILE (red) against BC (orange), BC-O (green), GAIL (purple), GAIFO (periwinkle), ILPO (green olive). The learning curves are obtained by averaging all algorithms over 5 seeds. Mob ILE outperforms BC-O, GAIL and matches BC s behavior despite Mob ILE not having access to expert actions. The bar plot (bottom-right) presents the best performing policy outputted by each algorithm averaged across 5 seeds for each algorithm. Mob ILE clearly outperforms BC-O, GAIFO, ILPO while matching the behavior of IL algorithms like BC/GAIL which use expert actions. 6.1 Benchmarking Mob ILE on Mu Jo Co suite Figure 2 compares Mob ILE with BC, BC-O, GAIL, GAIFO and ILPO. Mob ILE consistently matches or exceeds BC/GAIL s performance despite BC/GAIL having access to actions taken by the expert and Mob ILE functioning without expert action information. Mob ILE, also, consistently improves upon the behavior of ILFO methods such as BC-O, ILPO, and GAIFO. We see that BC does remarkably well in these benchmarks owing to determinism in the transition dynamics; in the appendix, we consider a variant of the cartpole environment with stochastic dynamics. Our results suggest that BC struggles with stochasticity in the dynamics and fails to solve this task, while Mob ILE continues to reliably solve this task. Also, note that we utilize 10 expert trajectories for all environments except Swimmer-v2; this is because all algorithms (including Mob ILE) present results with high variance. We include a learning curve for Swimmer-v2 with 10 expert trajectories in the appendix. The bar plot in Figure 2 shows that within the sample budget shown in the learning curves, Mob ILE (being a model-based algorithm), presents superior performance in terms of matching expert, thus indicating it is more sample efficient than GAIFO, GAIL (both being model-free methods), ILPO and BC-O. 6.2 Importance of the optimistic MDP construction Figure 3 presents results obtained by running Mob ILE with and without optimism. In the absence of optimism, the algorithm either tends to be sample inefficient in achieving expert performance or 2 4 # Online Samples 1e4 Return (Value) Cart Pole-v1 (10 traj.) Expert With optimism No optimism 0.5 1.0 1.5 # Online Samples 1e4 Reacher-v2 (10 traj.) 0.5 1.0 # Online Samples 1e5 40 Swimmer-v2 (40 traj.) 0.5 1.0 1.5 # Online Samples 1e6 Hopper-v2 (10 traj.) 0.5 1.0 # Online Samples 1e6 Walker2d-v2 (10 traj.) Figure 3: Learning curves obtained by running Mob ILE with (red) and without (green) optimism. Without optimism, the algorithm learns slowly or does not match the expert, whereas, with optimism, Mob ILE shows improved behavior by automatically trading off exploration and imitation. completely fails to solve the problem. Note that without optimism, the algorithm isn t explicitly incentivized to explore only implicitly exploring due to noise induced by sampling actions. This, however, is not sufficient to solve the problem efficiently. In contrast, Mob ILE with optimism presents improved behavior and in most cases, solves the environments with fewer online interactions. 6.3 Varying Number of Expert Samples Table 1: Expert normalized score and standard deviation of policy outputted by Mob ILE when varying number of expert trajectories as E1 and E2 (specific values represented in parentheses) Environment E1 E2 Expert Cartpole-v1 1.07 0.15 (5) 1.14 0 (10) 1 0.25 Reacher-v2 1.01 0.05 (10) 0.997 0.055 (20) 1 0.11 Swimmer-v2 1.54 1.1 (10) 1.25 0.15 (40) 1 0.05 Hopper-v2 1.11 0.064 (10) 1.16 0.03 (40) 1 0.16 Walker2d-v2 0.975 0.12 (10) 0.94 0.038 (50) 1 0.25 Table 1 shows the impact of increasing the number of samples drawn from the expert policy for solving the ILFO problem. The main takeaway is that increasing the number of expert samples aids Mob ILE in reliably solving the problem (i.e. with lesser variance). 7 Conclusions This paper introduces Mob ILE, a model-based ILFO approach that is applicable to MDPs with stochastic dynamics and continuous action spaces. Mob ILE trades-off exploration and imitation, and this perspective is shown to be important for solving the ILFO efficiently both in theory and in practice. Future works include exploring other means for learning dynamics models, performing strategic exploration and extending Mob ILE to problems with rich observation spaces (e.g. videos). By not even needing the actions to imitate, ILFO algorithms allow for learning algorithms to capitalize on large amounts of video data available online. Moreover, in ILFO, the learner is successful if it learns to imitate the expert. Any expert policy designed by bad actors can naturally lead to obtaining new policies that continue to imitate and be a negative influence to the society. With this perspective in mind, any expert policy must be thoroughly vetted in order to ensure ILFO algorithms including Mob ILE are employed in ways that benefit the society. Acknowledgements Rahul Kidambi acknowledges funding from NSF TRIPODS Award CCF 1740822 at Cornell University. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors. [1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML. ACM, 2004. [2] Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 2019. [3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. [4] Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching youtube. In Neur IPS, pages 2935 2945, 2018. [5] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263 272, 2017. [6] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep q-networks. In ITA, pages 1 9. IEEE, 2018. [7] Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213 231, 2001. [8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai Gym. ar Xiv preprint ar Xiv:1606.01540, 2016. [9] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Found. Trends Mach. Learn, 5(1):1 122, 2012. [10] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. ar Xiv preprint ar Xiv:1810.12894, 2018. [11] Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation. In ICLR. Open Review.net, 2019. [12] Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient model-based reinforcement learning through optimistic policy search and planning. ar Xiv preprint ar Xiv:2006.08684, 2020. [13] Marc Deisenroth and Carl E. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pages 465 472, 2011. [14] Siddharth Desai, Ishan Durugkar, Haresh Karnan, Garrett Warnell, Josiah Hanna, Peter Stone, and AI Sony. An imitation from observation approach to transfer learning with dynamics mismatch. Advances in Neural Information Processing Systems, 33, 2020. [15] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017. [16] Ashley D. Edwards, Himanshu Sahni, Yannick Schroecker, and Charles L. Isbell Jr. Imitating latent policies from observation. In ICML, 2019. [17] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, 2016. [18] Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pages 1259 1277. PMLR, 2020. [19] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration, 2016. [20] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Co RR, abs/1801.01290, 2018. [21] Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https: //github.com/hill-a/stable-baselines, 2018. [22] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Co RR, abs/1606.03476, 2016. [23] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. [24] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Co RR, abs/1906.08253, 2019. [25] Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control. ar Xiv preprint ar Xiv:2006.12466, 2020. [26] Sham M. Kakade. A natural policy gradient. In NIPS, pages 1531 1538, 2001. [27] Sham M. Kakade, Michael J. Kearns, and John Langford. Exploration in metric state spaces. In ICML, 2003. [28] Sham M. Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control. In Neur IPS, 2020. [29] Liyiming Ke, Matt Barnes, Wen Sun, Gilwoo Lee, Sanjiban Choudhury, and Siddhartha Srinivasa. Imitation learning as f-divergence minimization. ar Xiv preprint ar Xiv:1905.12888, 2019. [30] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002. [31] Michael Kearns and Satinder Singh. Near optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209 232, 2002. [32] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. Co RR, abs/2005.05951, 2020. [33] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In ICLR. Open Review.net, 2018. [34] Thomas Lampe and Martin A. Riedmiller. Approximate model-assisted neural fitted q-iteration. In IJCNN, pages 2698 2704. IEEE, 2014. [35] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. [36] Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO, pages 222 229, 2004. [37] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations (ICLR), 2019. [38] Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. ar Xiv preprint ar Xiv:1807.03858, 2018. [39] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In IEEE International Conference on Robotics and Automation, pages 7559 7566, 2018. [40] Andrew Y. Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Proc. ICML, pages 663 670, 2000. [41] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. Co RR, abs/1806.03335, 2018. [42] Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the Eluder dimension. In Advances in Neural Information Processing Systems, pages 1466 1474, 2014. [43] Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagreement. In ICML, pages 5062 5071, 2019. [44] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: exampleguided deep reinforcement learning of physics-based character skills. ACM Trans. Graphics, 2018. [45] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Technical report, CMU, 1989. [46] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177 1184, 2008. [47] Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards Generalization and Simplicity in Continuous Control. In NIPS, 2017. [48] Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. Ar Xiv, abs/2004.07804, 2020. [49] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Yee Whye Teh and D. Mike Titterington, editors, AISTATS, JMLR Proceedings, pages 661 668, 2010. [50] Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, pages 627 635, 2011. [51] Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In NIPS, pages 2256 2264, 2013. [52] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014. [53] Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Reinforcement learning with videos: Combining offline observations with interaction. Co RR, abs/2011.06507, 2020. [54] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. Co RR, abs/1502.05477, 2015. [55] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017. [56] Yuda Song, Aditi Mavalankar, Wen Sun, and Sicun Gao. Provably efficient model-based policy adaptation. In International Conference on Machine Learning, pages 9088 9098. PMLR, 2020. [57] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009. [58] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898 2933. PMLR, 2019. [59] Wen Sun, Anirudh Vemula, Byron Boots, and Drew Bagnell. Provably efficient imitation learning from observation alone. In ICML, volume 97. PMLR, 2019. [60] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, volume 28, 2013. [61] R. S. Sutton. First results with dyna, an integrated architecture for learning, planning, and reacting. In Neural Networks for Control, pages 179 189. The MIT Press: Cambridge, MA, USA, 1990. [62] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu Jo Co: A physics engine for model-based control. In IEEE International Conference on Intelligent Robots and Systems, pages 5026 5033, 2012. [63] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In IJCAI, pages 4950 4957, 2018. [64] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative adversarial imitation from observation. ar Xiv preprint ar Xiv:1807.06158, 2018. [65] Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. ar Xiv preprint ar Xiv:1907.02057, 2019. [66] Chao Yang, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu, Junzhou Huang, and Chuang Gan. Imitation learning from observations by minimizing inverse dynamics disagreement. In Neur IPS, 2019. [67] Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389, 2019. [68] Zhuangdi Zhu, Kaixiang Lin, Bo Dai, and Jiayu Zhou. Off-policy imitation learning from observations. In Neur IPS, 2020. [69] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008.