# provably_efficient_imitation_learning_from_observation_alone__3eda4a69.pdf Provably Efficient Imitation Learning from Observation Alone Wen Sun 1 Anirudh Vemula 1 Byron Boots 2 J. Andrew Bagnell 3 We study Imitation Learning (IL) from Observations alone (ILFO) in large-scale MDPs. While most IL algorithms rely on an expert to directly provide actions to the learner, in this setting the expert only supplies sequences of observations. We design a new model-free algorithm for ILFO, Forward Adversarial Imitation Learning (FAIL), which learns a sequence of time-dependent policies by minimizing an Integral Probability Metric between the observation distributions of the expert policy and the learner. FAIL is the first provably efficient algorithm in ILFO setting, which learns a near-optimal policy with a number of samples that is polynomial in all relevant parameters but independent of the number of unique observations. The resulting theory extends the domain of provably sample efficient learning algorithms beyond existing results, which typically only consider tabular reinforcement learning settings or settings that require access to a near-optimal reset distribution. We also demonstrate the efficacy of FAIL on multiple Open AI Gym control tasks. 1. Introduction Imitation Learning (IL) is a sample efficient approach to policy optimization (Ross et al., 2011; Ross & Bagnell, 2014; Sun et al., 2017) that has been extensively used in real applications, including Natural Language Processing (Daum e III et al., 2009; Chang et al., 2015b;a), game playing (Silver et al., 2016; Hester et al., 2017), system identification (Venkatraman et al., 2015; Sun et al., 2016), and robotics control tasks (Pan et al., 2018). Most previous IL work considers settings where an expert can directly provide action signals to the learner. In these settings, a general strategy is to directly learn a policy that maps from state to 1Robotics Institute, Carnegie Mellon University, USA 2College of Computing, Georgia Institute of Technology, USA 3Aurora Innovation, USA. Correspondence to: Wen Sun . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). action, via supervised learning approaches (e.g., DAgger (Ross et al., 2011), Aggre Va Te (Ross & Bagnell, 2014), Behaviour Cloning (Syed & Schapire, 2010)). Another popular strategy is to learn a policy by minimizing some divergence between the policy s state-action distribution and the expert s state-action distribution. Popular divergences include Forward KL (i.e., Behaviour Cloning), Jensen Shannon Divergence (e.g., GAIL (Ho & Ermon, 2016)). Here, we consider a more challenging IL setting, where experts demonstrations consist only of observations, no action or reward signals are available to the learner, and no reset is allowed (e.g., a robot learns a task by just watching an expert performing the task). We call this setting Imitation Learning from Observations alone (ILFO). Under this setting, without access to expert actions, approaches like DAgger, Aggre Va Te, GAIL, and Behaviour Cloning by definition cannot work. Although recently several model-based approaches, which learn an inverse model that predicts the actions taken by an expert (Torabi et al., 2018; Edwards et al., 2018) based on successive observations, have been proposed, these approaches can suffer from covariate shift (Ross et al., 2011). While we wish to train a predictor that can infer an expert s actions accurately under the expert s observation distribution, we do not have access to actions generated by the expert conditioned on the expert s observation (See Section 6 for a more detailed discussion). An alternative strategy is to handcraft cost functions that use some distance metric to penalize deviation from the experts trajectories (e.g., Liu et al. (2018); Peng et al. (2018)), which is then optimized by Reinforcement Learning (RL). These methods typically involve hand-designed cost functions that sometimes require prior task-specific knowledge (Peng et al., 2018). The quality of the learned policy is therefore completely dependent on the hand-designed costs which could be widely different from the true cost. Ideally, we would like to learn a policy that minimizes the unknown true cost function of the underlying MDP. In this work, we explicitly consider learning near-optimal policies in a sample and computationally efficient manner. Specifically, we focus on large-scale MDPs where the number of unique observations is extremely large (e.g., highdimensional observations such as raw-pixel images). Such large-scale MDP settings immediately exclude most existing sample efficient RL algorithms, which are often designed Imitation Learning from Observation for small tabular MDPs, whose sample complexities have a polynomial dependency on the number of observations and hence cannot scale well. To solve large-scale MDPs, we need to design algorithms that leverage function approximation for generalization. Specifically, we are interested in algorithms with the following three properties: (1) nearoptimal performance guarantees, i.e., we want to search for a policy whose performance is close to the expert s in terms of the expected total cost of the underlying MDP (and not a hand-designed cost function); (2) sample efficiency, we require sample complexity that scales polynomially with respect to all relevant parameters (e.g., horizon, number of actions, statistical complexity of function approximators) except the cardinality of the observation space hence excluding PAC RL algorithms designed for small tabular MDPs; (3) computational efficiency: we rely on the notion of oracleefficiency (Agarwal et al., 2014) and require the number of efficient oracle calls to scale polynomially thereby excluding recently proposed algorithms for Contextual Decision Processes which are not computationally efficient (Jiang et al., 2016; Sun et al., 2018). To the best of our knowledge, the desiderata above requires designing new algorithms. With access to experts trajectories of observations we introduce a model-free algorithm, called Forward Adversarial Imitation Learning (FAIL), that decomposes ILFO into H independent two-player min-max games, where H is the horizon length. We aim to learn a sequence of timedependent policies from h = 1 to H, where at any time step h, the policy h is learned such that the generated observation distribution at time step h + 1, conditioned on 1, . . . , h 1 being fixed, matches the expert s observation distribution at time step h + 1, in terms of an Integral Probability Metric (IPM) (M uller et al., 1997). IPM is a family of divergences that can be understood as using a set of discriminators to distinguish two distributions (e.g., Wasserstein distance is one such special instance). We analyze the sample complexity of FAIL and show that FAIL can learn a near-optimal policy in sample complexity that does not explicitly depend on the cardinality of observation space, but rather only depends on the complexity measure of the policy class and the discriminator class. Hence FAIL satisfies the above mentioned three properties. The resulting theory extends the domain of provably sample efficient learning algorithms beyond existing results, which typically only consider tabular reinforcement learning settings (e.g., Dann & Brunskill (2015)) or settings that require access to a nearoptimal reset distribution (e.g., Kakade & Langford (2002); Bagnell et al. (2004); Munos & Szepesv ari (2008)). We also demonstrate that learning under ILFO can be exponentially more sample efficient than pure RL.We also study FAIL under three specific settings: (1) Lipschitz continuous MDPs, (2) Interactive ILFO where one can query the expert at any time during training, and (3) state abstraction. Finally, we demonstrate the efficacy of FAIL on multiple continuous control tasks. 2. Preliminaries We consider an episodic finite horizon Decision Process that consists of {Xh}H h=1, A, c, H, , P, where Xh for h 2 [H] is a time-dependent observation space,1 A is a discrete action space such that |A| = K 2 N+, H is the horizon. We assume the cost function c : XH ! R is only defined at the last time step H (e.g., sparse cost), and 2 (X1) is the initial observation distribution, and P is the transition model i.e., P : Xh A ! (Xh+1) for h 2 [H 1]. Note that here we assume the cost function only depends on observations. We assume that Xh for all h 2 [H] is discrete, but |Xh| is extremely large and hence any sample complexity that has polynomial dependency on |Xh| should be considered as sample inefficient, i.e., one cannot afford to visit every unique observation. We assume that the cost is bounded, i.e., for any sequence of observations, c H 1 (e.g., zero-one loss at the end of each episode). For a timedependent policy = { 1, . . . , H} with h : Xh ! (A), the value function V h : Xh ! [0, 1] is defined as: h (xh) = E [c(x H)|ai i( |xi), xi+1 Pxi,ai] , and state-action function Q h (xh, ah) is defined as Q h (xh, ah) = Exh+1 Pxh,ah H (x) = c(x). We denote µ h as the distribution over Xh at time step h following . Given H policy classes { 1, . . . , H}, the goal is to learn a = { 1, . . . , H} with h 2 h, which minimizes the expected cost: J( ) = E [c(x H)|ah h( |xh), xh+1 P( |xh, ah)] . Denote Fh {f : Xh ! R} for h 2 [H]. We define a Bellman Operator Γh associated with the expert ? h at time step h as Γh : Fh+1 ! {f : Xh ! R}where for any xh 2 Xh, f 2 Fh+1, (Γhf) (xh) , Eah ? h( |xh),xh+1 Pxh,ah [f(xh+1)] . Notation For a function f : X ! R, we denote kfk1 = supx2X |f(x)|, kfk L as the Lipschitz constant: kfk L = supx1,x2,x16=x2(f(x1) f(x2))/d(x1, x2), with d : X X ! R+ being the metric in space X.2 We consider a Reproducing Kernel Hilbert Space (RKHS) H defined with a positive definite kernel k : X X ! [0, 1] such that 1we use the term observation throughout the paper instead of the term state as one would normally use in defining MDPs, for the purpose of sharply distinguishing our setting from tabular MDPs where X has very small number of states. 2A metric d (e.g., Euclidean distance) satisfies the following conditions: d(x, y) 0, d(x, y) = 0 iff x = y, d(x, y) = d(y, x) and d satisfies triangle inequality. Imitation Learning from Observation H is the span of {k(x, ) : x 2 X}, and we have f(x) = hf, k(x, )i, with hk(x1, ), k(x2, )i , k(x1, x2). For any f 2 H, we define kfk2 H = hf, fi. We denote U(A) as a uniform distribution over action set A. For N 2 N+, we denote [N] , {1, 2, . . . , N}. Integral Probability Metrics (IPM) (M uller, 1997) is a family of distance measures on distributions: given two distributions P1 and P2 over X, and a function class F containing real-value functions f : X ! R and symmetric (e.g., 8f 2 F, we have f 2 F), IPM is defined as: (Ex P1[f(x)] Ex P2[f(x)]) . (1) By choosing different class of functions F, various popular distances can be obtained. For instance, IPM with F = {f : kfk1 1} recovers Total Variation distance, IPM with F = {f : kfk L 1} recovers Wasserstein distance, and IPM with F = {f : kfk H 1} with RKHS H reveals maximum mean discrepancy (MMD). 2.1. Assumptions We first assume access to a Cost-Sensitive oracle and to a Linear Programming oracle. Cost-Sensitive Oracle The Cost-Sensitive (CS) oracle takes a class of classifiers , { : X ! (A)}, a dataset consisting of pairs of feature x and cost vector c 2 RK, i.e., D = {xi, ci}N i=1, as inputs, and outputs a classifier that minimizes the average expected classification cost: PN i=1 ( |xi)>ci/N. Efficient cost sensitive classifiers exist (e.g., Beygelzimer et al. (2005)) and are widely used in sequential decision making tasks (e.g., Agarwal et al. (2014); Chang et al. (2015b)). Linear Programming Oracle A Linear Programming (LP) oracle takes a class of functions G as inputs, optimizes a linear functional with respect to g 2 G: ming2G i=1 ig(xi). When G is in RKHS with bounded norm, the linear functional becomes maxg:kgk chg, Pn i=1 iφ(xi)i, from which one can obtain the closed-form solution. Another example is when G consists of all functions with bounded Lipschitiz constant, i.e., G = {g : kgk L c} for c 2 R+, Sriperumbudur et al. (2012) showed that maxg2G i=1 ig(xi) can be solved by Linear Programming with n many constraints, one for each pair ( i, xi). In Appendix E, we provide a new reduction to Linear Programming for G = {g : kgk L c1, kgk1 c2} for c1, c2 2 R+. The second assumption is related to the richness of the function class. We simply consider a time-dependent policy class h and Fh for h 2 [H], and we assume realizability: Assumption 2.1 (Realizability and Capacity of Function Class). We assume h and Fh contains ? h , i.e., ? h 2 h and V ? h 2 Fh, 8h 2 [H]. Further assume that for all h, Fh is symmetric, and h and Fh is finite in size. Note that we assume h and Fh to be discrete (but could be extremely large) for analysis simplicity. As we will show later, our bound scales only logarithmically with respect to the size of function class. 3. Algorithm Our algorithm, Forward Adversarial Imitation Learning (FAIL), aims to learn a sequence of policies = { 1, . . . , H} such that its value J( ) is close to J( ?). Note that J( ) J( ?) does not necessarily mean that the state distribution of is close to ?. FAIL learns a sequence of policies with this property in a forward training manner. The algorithm learns i starting from i = 1. When learning i, the algorithm fixes { 1, . . . , i 1}, and solves a min-max game to compute i; it then proceeds to time step i + 1. At a high level, FAIL decomposes a sequential learning problem into H-many independent two-player minmax games, where each game can be solved efficiently via no-regret online learning. Below, we first consider how to learn h conditioned on { 1, . . . , h 1} being fixed. We then present FAIL by chaining H min-max games together. 3.1. Learning One Step Policy via a Min-Max Game Throughout this section, we assume { 1, . . . , h 1} are learned already and fixed. The sequence of policies { 1, . . . , h 1} for h 2 determines a distribution h 2 (Xh) over observation space Xh at time step h. Also expert policy ? naturally induces a sequence of observation distributions µ? h 2 (Xh) for h 2 [H]. The problem we consider in this section is to learn a policy h 2 h, such that the resulting observation distribution from { 1, . . . , h 1, h} at time step h + 1 is close to the expert s observation distribution µ? h+1 at time step h + 1. We consider the following IPM minimization problem. Given the distribution h 2 (Xh), and a policy 2 h, via the Markov property, we have the observation distribution at time step h + 1 conditioned on h and as h+1(x) , P xh,ah h(xh) (ah|xh)P(x|xh, ah) for any x 2 Xh+1. Recall that the expert observation distribution at time step h + 1 is denoted as µ? h+1. The IPM with Fh+1 between h+1 and µ? h+1 is defined as: d Fh+1( | h, µ? Ex h+1 [f(x)] Ex µ? Note that d Fh+1( | h, µ? h+1) is parameterized by , and our goal is to minimize d Fh+1( | h, µ? h+1) with respect to Imitation Learning from Observation Algorithm 1 Min-Max Game (D?, D, , F, T) 1: Initialize 0 2 2: for n = 1 to T do 3: f n = arg maxf2F u( n, f) (LP Oracle) 4: un = u( n, f n) 5: n+1 = arg min 2 t=1 u( , f t) + φ( ) (Regularized CS Oracle) 6: end for 7: Output: n? with n? = arg minn2[T ] un over h. However, d Fh+1( | h, µ? h+1) is not measurable directly as we do not have access to µ? h+1 but only samples from µ? h+1. To estimate d Fh+1, we draw a dataset i=1 such that xi h), together with observation set resulting from expert D? = { xi h+1, we form the following empirical estimation of d Fh+1 for any , via importance weighting (recall ai bd Fh+1( | h, µ? h) 1/K f(xi where recall that K = |A| and the importance weight K (ai h) is used to account for the fact that we draw actions uniformly from A but want to evaluate . Though due to the max operator, bd Fh+1( | h, µ? h+1) is not an unbiased estimate of d Fh+1( | h, µ? h+1), in Appendix A, we show that bd Fh+1 indeed is a good approximation of d Fh+1 via an application of the standard Bernstein s inequality and a union bound over Fh+1. Hence we can approximately minimize bd Fh+1 with respect to : min 2 bd Fh+1( | h, µ? h+1), resulting in a two-player min-max game. Intuitively, we can think of as a generator, such that, conditioned on h, it generates next-step samples xh+1 that are similar to the expert samples from µ? h+1, via fooling discriminators Fh+1. Note that the above formulation is a two-player game, with the utility function for and f defined as: u( , f) , PN h+1)/N PN 0 Algorithm 1 solves the minmax game min maxf u( , f) using no-regret online update on both f and . At iteration n, player f plays the best-response via fn = arg maxf u( n, f) (Line 3) and player plays the Followthe-Regularized Leader (FTRL) (Shalev-Shwartz et al., 2012) as n+1 = Pn t=1 u( , f t) + φ( ) with φ being convex regularization (Line 5). Note that other no-regret online learning algorithms (e.g., replacing FTRL by incremental online learning algorithms like OGD (Zinkevich, 2003) can also be used to approximately optimize the above min-max formulation. After the end of Algorithm 1, we output a policy among all computed policies { i}T i=1 such that bd Fh+1( | n, µ? n+1) is minimized (Line 7). Regarding the computation efficiency of Algorithm 1, the best response computation on f in Line 3 can be computed by a call to the LP Oracle, while FTRL on can be implemented by a call to the regularized CS Oracle. Regarding the statistical performance, we have the following theorem: Theorem 3.1. Given 2 (0, 1], δ 2 (0, 1], set T = , N = N 0 = K log(| h||Fh+1|/δ) , Algorithm 1 outputs such that with probability at least 1 δ, ....d Fh+1( | h, µ? 02 h d Fh+1( 0| h, µ? The proof of the above theorem is included in Appendix A, which combines standard min-max theorem and uniform convergence analysis. The above theorem essentially shows that Algorithm 1 successfully finds a policy whose resulting IPM is close to the smallest possible IPM one could achieve if one had access to d Fh+1( | h, µ? h+1) directly, up to error. Intuitively, from Theorem 3.1, we can see that if h the observation distribution resulting from fixed policies { 1, . . . , h 1}, is similar to µ? h, then we guarantee to learn a policy , such that the new sequence of policies { 1, . . . , h 1, } will generate a new distribution h+1 that is close to µ? h+1, in terms of IPM with Fh+1. The algorithm introduced below is based on this intuition. 3.2. Forward Adversarial Imitation Learning Theorem 3.1 indicates that conditioned on { 1, . . . , h 1} being fixed, Algorithm 1 finds a policy 2 h such that it approximately minimizes the divergence measured under IPM with Fh+1, between the observation distribution h+1 resulting from { 1, . . . , h 1, }, and the corresponding distribution µ? h+1 from expert. With Algorithm 1 as the building block, we now introduce our model-free algorithm Forward Adversarial Imitation Learning (FAIL) in Algorithm 2. Algorithm 2 integrates Algorithm 1 into the Forward Training framework (Ross & Bagnell, 2010), by decomposing the sequential learning problem into H many independent distribution matching problems where each one is solved using Algorithm 1 independently. Every time step h, FAIL assumes that 1, . . . , h 1 have been correctly learned in the sense that the resulting observation distribution h from { 1, . . . , h 1} is close to µ? h from expert. Therefore, FAIL is only required to focus on learning h correctly conditioned on { 1, . . . , h 1} being fixed, such that vh+1 is close to µ? h+1, in terms of the IPM with Fh+1. Intuitively, when one has a strong class of discriminators, and the two- Imitation Learning from Observation Algorithm 2 FAIL({ h}h, {Fh}h, , n, n0, T) 1: Set = ; 2: for h = 1 to H 1 do 3: Extract expert s data at h + 1: D = { xi i=1 4: D = ; 5: for i = 1 to n do 6: Reset x(i) 1 7: Execute = { 1, . . . , h 1} to generate state xi h 8: Execute ai h U(A) to generate xi h+1 and add (xi h+1) to D 9: end for 10: Set h to be the return of Algorithm 1 with inputs D, D, h, Fh+1, T 11: Append h to 12: end for player game in each time step is solved near optimally, then by induction from h = 1 to H, FAIL should be able to learn a sequence of policies such that h is close to µ? h for all h 2 [H] (for the base case, we simply have 1 = µ? 3.3. Analysis of Algorithm 2 The performance of FAIL crucially depends on the capacity of the discriminators. Intuitively, discriminators that are too strong cause overfitting (unless one has extremely large number of samples). Conversely, discriminators that are too weak will not be able to distinguish h from µ? h. This dilemma was studied in the Generative Adversarial Network (GAN) literature already by Arora et al. (2017). Below we study this tradeoff explicitly in IL. To quantify the power of discriminator class Fh for all h, we use inherent Bellman Error (IBE) with respect to ?: max g2Fh+1 min f2Fh kf Γhgk1 The Inherent Bellman Error is commonly used in approximate value iteration literature (Munos, 2005; Munos & Szepesv ari, 2008; Lazaric et al., 2016) and policy evaluation literature (Sutton, 1988). It measures the worst possible projection error when projecting Γhg to function space Fh. Intuitively increasing the capacity of Fh reduces be. Using a restricted function class F potentially introduces be, hence one may tend to set Fh to be infinitely powerful discriminator class such as function class consisting of all bounded functions {f : kfk1 c} (recall IPM becomes total variation in this case). However, using Fh , {f : kfk1 c} makes efficient learning impossible. The following proposition excludes the possibility of sample efficiency with discriminator class being {f : kfk1 c}. Theorem 3.2 (Infinite Capacity F does not generalize). There exists a MDP with H = 2, a policy set = { , 0}, an expert policy ? with = ? (i.e., is realizable), such that for datasets D? = { xi i=1 with xi i=1 with xi 2 , and D0 = {x0(i) i=1 with x0(i) 2 , as long as M = O(log(|X|)), we must have: lim |X|!1 P(D? \ D = ;) = 1, lim |X|!1 P(D? \ D0 = ;) = 1. Namely, denote ˆD as the empirical distribution of a dataset D by assigning probability 1/|D| to any sample, we have: lim |X|!1 k ˆD? ˆDk1 = 2, lim |X|!1 k ˆD? ˆD0k1 = 2. The above theorem shows by just looking at the samples generated from and 0, and comparing them to the samples generated from the expert policy ? using {f : kfk1 c} (IPM becomes Total variation here), we cannot distinguish from 0, as they both look similar to ?, i.e., none of the three datasets overlap with each other, resulting the TV distances between the empirical distributions become constants, unless the sample size scales (poly(|X|)). Theorem 3.2 suggests that one should explicitly regularize discriminator class so that it has finite capacity (e.g., bounded VC or Rademacher Complexity). The restricted discriminator class F has been widely used in practice as well such as learning generative models (i.e., Wasserstain GANs (Arjovsky et al., 2017)). Denote | | = maxh | h| and |F| = maxh |Fh|. The following theorem shows that the learned time-dependent policies s performance is close to the expert s performance: Theorem 3.3 (Sample Complexity of FAIL). Under Assumption 2.1, for any , δ 2 (0, 1], set T = ( K 2 ), n = n0 = ( K log(| ||F|H/δ) 2 ), with probability at least 1 δ, FAIL (Algorithm 2) outputs , such that, J( ) J( ?) H2 0 3 many trajectories with an average inherent Bellman Error 0 h max g2Fh+1 min h)/2[|f(x) (Γhg)(x)|]. Note that the average inherent Bellman error 0 be defined above is averaged over the state distribution of the learned policy and the state distribution of the expert, which is guaranteed to be smaller than the classic inherent Bellman error used in RL literature (i.e., (4)) which uses infinity norm over X. The proof of Theorem 3.3 is included in Appendix C. Regarding computational complexity of Algorithm 2, we can see that it requires polynomial number of 3In O, we drop log terms that does not dependent on | | or |F|. In we drop constants that do not depend on H, K, |X|, | |, |F|, 1/ , 1/δ. Details can be found in Appendix. Imitation Learning from Observation calls (with respect to parameters H, K, 1/ ) to the efficient oracles (Regularized CS oracle and LP oracle). Since our analysis only uses uniform convergence analysis and standard concentration inequalities, extension to continuous and F with complexity measure such as VC-dimension, Rademacher complexity, and covering number is standard. We give an example in Section 5.1. 4. The Gap Between ILFO and RL To quantify the gap between RL and ILFO, below we present an exponential separation between ILFO and RL in terms of sample complexity to learn a near-optimal policy. We assume that the expert policy is optimal. Proposition 4.1 (Exponential Separation Between RL and ILFO). Fix H 2 N+ and 2 (0, 1/8). There exists a family of MDP with deterministic dynamics, with horizon H, 2H 1 many states, two different actions, such that for any RL algorithm, the probability of outputting a policy ˆ with J(ˆ ) J( ?) + after collecting T trajectories is at most 2/3 for all T O(2H/ 2). On the other hand, for the same MDP, given one trajectory of observations { xh}H h=1 from the expert policy ?, there exists an algorithm that deterministically outputs ? after collecting O(H) trajectories. Proposition 4.1 shows having access to expert s trajectories of observations allows us to efficiently solve some MDPs that are otherwise provably intractable for any RL algorithm (i.e., requiring exponentially many trajectories to find a near optimal policy). This kind of exponential gap previously was studied in the interactive imitation learning setting where the expert also provides action signals (Sun et al., 2017) and one can query the expert s action at any time step during training. To the best of our knowledge, this is the first exponential gap in terms of sample efficiency between ILFO and RL. Note that our theorem applies to a specific family of purposefully designed MDPs, which is standard for proving information-theoretical lower bounds. 5. Case Study In this section, we study three settings where inherent Bellman Error will disappear even under restricted discriminator class : (1) Lipschitz Continuous MDPs (e.g., (Kakade et al., 2003)), (2) Interactive Imitation Learning from Observation where expert is available to query during training, and (3) state abstraction. 5.1. Lipschitz Continuous MDPs We consider a setting where cost functions, dynamics and ? h are Lipschitz continuous in metric space (X, d): k P( |x, a) P( |x0, a)k1 LP d(x, x0), h( |x0)k1 L d(x, x0), for the known metric d and Lipschitz constants LP , L . Under this condition, the Bellman operator with respect to ? is Lipschitz continuous in the metric space (X, d): |Γhf(x1) Γhf(x2)| (kfk1(LP + L ))d(x1, x2), where we applied Holder s inequality. Hence, we can design the function class Fh for all h 2 [H] as follows: Fh = {f : kfk L (LP + L ), kfk1 1}, (5) which will give us be = 0 and V ? h 2 Fh due to the assumption on the cost function. Namely Fh is the class of functions with bounded Lipschitz constant and bounded value. This class of functions is widely used in practice for learning generative models (e.g., Wasserstain GAN). Note that this setting was also studied in (Munos & Szepesv ari, 2008) for the Fitted Value Iteration algorithm. Denote L , LP + L . For F = {f : kfk L L, kfk1 1} we show that we can evaluate the empirical IPM supf2F i=1 f(xi)/N PN 0 by reducing it to Linear Programming, of which the details are deferred to Appendix E. Regarding the generalization ability, note that our function class F is a subset of all functions with bounded Lipschitz constant, i.e., F {f : kfk L L}. The Rademacher complexity for bounded Lipschitz function class grows in the order of O(N 1/cov(X)) (e.g., see (Luxburg & Bousquet, 2004; Sriperumbudur et al., 2012)), with cov(X) being the covering dimension of the metric space (X, d).4 Extending Theorem 3.3 to Lipschitz continuous MDPs, we have the following corollary. Corollary 5.1 (Sample Complexity of FAIL for Lipschitz Continuous MDPs). With the above set up on Lipschitz continuous MDP and Fh for h 2 [H] (5), given , δ 2 (0, 1], set T = ( K 2 ), n = n0 = ( K(LK)cov(X) log(| |/δ) 2+cov(X) ), then with probability at least 1 δ, FAIL (Algorithm 2) outputs a policy with J( ) J( ?) O(H2 ) using at most HK(KL)cov(X) (2+cov(X)) log many trajectories. The proof of the above corollary is deferred to Appendix F which uses a standard covering number argument over Fh with norm k k1. Note that we get rid of IBE here and hence as the number of sample grows, FAIL approaches to the global optimality. Though the bound has an exponential dependency on the covering dimension, note that the covering dimension cov(X) is completely dependent on the underlying metric space (X, d) and could be much smaller than the real dimension of X. Note that the above theorem also serves an example regarding how we can extend Theorem 3.3 to settings where F contains infinitely many functions but with bounded statistical complexity (similar techniques can be used for as well). 4Covering dimension is defined as cov(X) , infd>0{N (X) d, 8 > 0}, where N (X) is the size of the minimum -net of metric space (X, D). Imitation Learning from Observation 5.2. Interactive Imitation Learning from Observations We can avoid IBE in an interactive learning setting, where we assume that we can query expert during training. But different from previous interactive imitation learning setting such as Aggre Va Te, LOLS (Ross & Bagnell, 2014; Chang et al., 2015b), and DAgger (Ross et al., 2011), here we do not assume that expert provides actions nor cost signals. Given any observation x, we simply ask expert to take over for just one step, and observe the observation at the next step, i.e., x0 P( |x, a) with a ?( |x). Note that compared to the non-interactive setting, interactive setting assumes a much stronger access to expert. In this setting, we can use arbitrary class of discriminators with bounded complexity. Due to space limit, we defer the detailed description of the interactive version of FAIL (Algorithm 5) to Appendix G. The following theorem states that we can avoid IBE: Theorem 5.2. Under Assumption 2.1 and the existence of an interactive expert, for any 2 (0, 1] and δ 2 (0, 1], set T = ( K 2 ), n = ( K log(| ||F|H)/δ 2 ), with probability at least 1 δ, Algorithm 5) outputs a policy such that: J( ) J( ?) H , by using at most O many trajectories. Compare to the non-interactive setting, we get rid of IBE, at the cost of a much stronger expert. 5.3. State Abstraction Denote φ : X ! S as the abstraction that maps X to a discrete set S. We assume that the abstraction satisfies the Bisimulation property (Givan et al., 2003): for any x, x0 2 X, if φ(x) = φ(x0), we have that 8s 2 S, a 2 A, h 2 [H]: c(x) = c(x0), ? P(x00|x, a) = P(x00|x0, a). (6) In this case, one can show that the V ? is piece-wise constant under the partition induced from φ, i.e., V ?(x) = V ?(x0) if φ(x) = φ(x0). Leveraging the abstraction φ, we can then design Fh = {f : kfk1 1, f(x) = f(x0), 8x, x0, s.t., φ(x) = φ(x0)}. Namely, Fh contains piece-wise constant functions with bounded values. Under this setting, we can show that the inherent Bellman Error is zero as well (see Proposition 9 from Chen & Jiang (2019)). Also supf2Fh u( , f) can be again computed via LP and Fh has Rademacher complexity scales O( |S|/N) with N being the number of samples. Details are in Appendix I. 6. Discussion on Related Work Some previous works use the idea of learning an inverse model to predict actions (or latent causes) (Nair et al., 2017; Torabi et al., 2018) from two successive observations and then use the learned inverse model to generate actions using expert observation demonstrations. With the inferred actions, it reduces the problem to normal imitation learning. We note here that learning an inverse model is ill-defined. Specifically, simply by the Bayes rule, the inverse model P(a|xh, xh+1) the probability of action a was executed at xh such that the system generated xh+1, is equivalent to P(a|xh, xh+1) / P(xh+1|xh, a)P(a|xh), i.e., an inverse model P(a|xh, xh+1) is explicitly dependent on the action generation policy P(a|xh). Unlike P(xh+1|xh, a), without the policy P(a|xh), the inverse model is ill-defined by itself alone. This means that if one wants to learn an inverse model that predicts expert actions along the trajectory of observations generated by the expert, one would need to learn an inverse model, denoted as P ?(a|xh, xh+1), such that P ?(a|xh, xh+1) / P(xh+1|xh, a) ? h(a|xh), which indicates that one needs to collect actions from ? h. An inverse model makes sense when the dynamics is deterministic and bijective. Hence replying on learning such an inverse model P ?(a|x, x0) will not provide any performance guarantees in general, unless we have actions collected from ?. 7. Simulation We test FAIL on three simulations from open AI Gym (Brockman et al., 2016): Swimmer, Reacher, and the Fetch Robot Reach task (Fetch Reach). For Swimmer we set H to be 100 while for Reacher and Fetch Reach, H is 50 in default. The Swimmer task has dense reward (i.e.,., reward at every time step). For reacher, we try both dense reward and sparse reward (i.e., success if it reaches to the goal within a threshold). Fetch Reach is a sparse reward task. As our algorithm is presented for discrete action space, for all three tasks, we discrete the action space via discretizing each dimension into 5 numbers and applying categorical distribution independently for each dimension.5 6 For baseline, we modify GAIL (Ho & Ermon, 2016), a model-free IL algorithm, based on the implementation from Open AI Baselines, to make it work for ILFO. We delete the input of actions to discriminators in GAIL to make it work for ILFO. Hence the modified version can be understood as using RL (the implementation from Open AI uses TRPO (Schulman et al., 2015)) methods to minimize the divergence between the learner s average state distribution 5i.e., (a|x) = Qd i=1 i(a[i]|x), with a[i] stands for the i-th dimension. Note that common implementation for continuous control often assumes such factorization across action dimensions as the covariance matrix of the Gaussian distribution is often diagonal. See comments in Appendix J for extending FAIL to continuous control setting in practice. 6Implementation and scripts for reproducing results can be found at https://github.com/wensun/ Imitation-Learning-from-Observation Imitation Learning from Observation (a) Swimmer (1m) (b) Reacher (1m) (c) swimmer (25) (d) Reacher (25) Figure 1. Performance of expert, FAIL, and GAIL (without ac- tions) on dense reward tasks (Reacher and Hopper). For (a) and (b), we fix the number of training samples while varying the number of expert demonstrations (6, 12, 25). For (c) and (d), we fix the number of expert trajectories, while varying the training samples. and the expert s average state distribution. For FAIL implementation, we use MMD as a special IPM, where we use RBF kernel and set the width using the common median trick (e.g.,(Fukumizu et al., 2009)) without any future tuning. All policies are parameterized by one-layer neural networks. Instead of using FTRL, we use ADAM as an incremental no-regret learner, with all default parameters (e.g., learning rate) (Kingma & Ba, 2014). The total number of iteration T in Algorithm 1 is set to 1000 without any future tuning. During experiments, we stick to default hyperparameters for the purpose of best reflecting the algorithmic contribution of FAIL. All the results below are averaged over ten random trials with seeds randomly generated between [1, 1e6]. The experts are trained via a reinforcement learning algorithm (TRPO (Schulman et al., 2015)) with multiple millions of samples till convergence. Figure 1 shows the comparison of expert, FAIL, and GAIL on two dense reward tasks with different number of expert demonstrations, under fixed total number of training samples (one million). We report mean and standard error in Figure 1. We observe GAIL outperforms FAIL in Swimmer on some configurations, while FAIL outperforms GAIL on Reacher (Dense reward) for all configuration. Figure 2 shows the comparison of expert, FAIL, and GAIL on two sparse reward settings. We observe that FAIL significantly outperforms GAIL on these two sparse reward tasks. For sparse reward, note that what really matters is to (a) Reacher Sparse (1m) (b) Fetch Reach (1m) Figure 2. Performance of expert, FAIL, and GAIL (without ac- tions) on two sparse control tasks (Reacher Sparse and Fetch Reach). We fix the number of training samples while varying the number of expert demonstrations (6, 12, 25). reach the target at the end, FAIL achieves that by matching expert s state distribution and learner s state distribution one by one at every time step till the end, while GAIL (without actions) loses the sense of ordering by focusing on the average state distributions. The above experiments also indicates that FAIL in general can work well for shorter horizon (e.g., H = 50 for Reacher and Fetch), while shows much less improvement over GAIL on longer horizon task. We think this is because of the nature of FAIL which has to learn a sequence of time-dependent policies along the entire horizon H. Long horizon requires larger number of samples. While method like GAIL learns a single stationary policy with all training data, and hence is less sensitive to horizon increase. We leave extending FAIL to learning a single stationary policy as a future work. 8. Conclusion and Future Work We study Imitation Learning from Observation (ILFO) setting and propose an algorithm, Forward Adversarial Imitation Learning (FAIL), that achieves sample efficiency and computational efficiency. FAIL decomposes the sequential learning tasks into independent two-player min-max games of which is solved via general no-regret online learning. In addition to the algorithmic contribution, we present the first exponential gap in terms of sample complexity between ILFO and RL, demonstrating the potential benefit from expert s observations. A key observation is that one should explicitly regularize the class of discriminators to achieve sample efficiency and design discriminators to decrease the inherent Bellman Error. Experimentally, while GAIL can be used to solve the ILFO problem by removing action inputs to the discriminators, FAIL works just as well in problems with dense reward. Our analysis of FAIL provides the first strong theoretical guarantee for solving ILFO, and FAIL significantly outperforms GAIL on sparse reward MDPs, which are common in practice. Imitation Learning from Observation Acknowledgement WS is supported in part by Office of Naval Research contract N000141512365. WS thanks Nan Jiang and Akshay Krishnamurthy for valuable discussions. We thank the first anonymous reviewer for carefully reviewing the proofs. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646, 2014. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Gener- alization and equilibrium in generative adversarial nets (gans). ar Xiv preprint ar Xiv:1703.00573, 2017. Bagnell, J. A., Kakade, S. M., Schneider, J. G., and Ng, A. Y. Policy search by dynamic programming. In Advances in neural information processing systems, pp. 831 838, 2004. Beygelzimer, A., Dani, V., Hayes, T., Langford, J., and Zadrozny, B. Error limiting reductions between classification tasks. In Proceedings of the 22nd international conference on Machine learning, pp. 49 56. ACM, 2005. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Chang, K.-W., He, H., Daum e III, H., and Langford, J. Learning to search for dependencies. ar Xiv preprint ar Xiv:1503.05615, 2015a. Chang, K.-w., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. Learning to search better than your teacher. In ICML, 2015b. Chen, J. and Jiang, N. Information-theoretic considera- tions in batch reinforcement learning. ar Xiv preprint ar Xiv:1905.00360, 2019. Dann, C. and Brunskill, E. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818 2826, 2015. Daum e III, H., Langford, J., and Marcu, D. Search-based structured prediction. Machine learning, 2009. Edwards, A. D., Sahni, H., Schroeker, Y., and Isbell, C. L. Imitating latent policies from observation. ar Xiv preprint ar Xiv:1805.07914, 2018. Fukumizu, K., Gretton, A., Lanckriet, G. R., Sch olkopf, B., and Sriperumbudur, B. K. Kernel choice and classifiability for rkhs embeddings of probability distributions. In Advances in neural information processing systems, pp. 1750 1758, 2009. Givan, R., Dean, T., and Greig, M. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(1-2):163 223, 2003. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac Arnold, G., et al. Deep q-learning from demonstrations. ar Xiv preprint ar Xiv:1704.03732, 2017. Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In NIPS, 2016. Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. ar Xiv preprint ar Xiv:1610.09512, 2016. Kakade, S. and Langford, J. Approximately optimal approx- imate reinforcement learning. In ICML, 2002. Kakade, S., Kearns, M. J., and Langford, J. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 306 312, 2003. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Krishnamurthy, A., Agarwal, A., and Langford, J. Pac rein- forcement learning with rich observations. In Advances in Neural Information Processing Systems, pp. 1840 1848, 2016. Lazaric, A., Ghavamzadeh, M., and Munos, R. Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17(1):583 612, 2016. Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118 1125. IEEE, 2018. Luxburg, U. v. and Bousquet, O. Distance-based classi- fication with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669 695, 2004. M uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29 (2):429 443, 1997. Imitation Learning from Observation M uller, K., Smola, A., and R atsch, G. Predicting time series with support vector machines. Artificial Neural Networks ICANN 9, 1327:999 1004, 1997. doi: 10. 1007/BFb0020283. URL http://link.springer. com/chapter/10.1007/BFb0020283. Munos, R. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, pp. 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. Munos, R. and Szepesv ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (May):815 857, 2008. Nair, A., Chen, D., Agrawal, P., Isola, P., Abbeel, P., Malik, J., and Levine, S. Combining self-supervised learning and imitation for vision-based rope manipulation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2146 2153. IEEE, 2017. Pan, Y., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E., and Boots, B. Agile autonomous driving using end-to-end deep imitation learning. Proceedings of Robotics: Science and Systems. Pittsburgh, Pennsylvania, 2018. Peng, X. B., Abbeel, P., Levine, S., and van de Panne, M. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ar Xiv preprint ar Xiv:1804.02717, 2018. Ross, S. and Bagnell, J. A. Efficient reductions for imitation learning. In AISTATS, pp. 661 668, 2010. Ross, S. and Bagnell, J. A. Reinforcement and imitation learning via interactive no-regret learning. ar Xiv preprint ar Xiv:1406.5979, 2014. Ross, S., Gordon, G. J., and Bagnell, J. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011. Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In ICML, pp. 1889 1897, 2015. Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 2012. Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016. Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Sch olkopf, B., Lanckriet, G. R., et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012. Sun, W., Venkatraman, A., Boots, B., and Bagnell, J. A. Learning to filter with predictive state inference machines. In ICML, 2016. Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. Deeply aggrevated: Differentiable imitation learning for sequential prediction. ar Xiv preprint ar Xiv:1703.01030, 2017. Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based reinforcement learning in contextual decision processes. ar Xiv preprint ar Xiv:1811.08540, 2018. Sutton, R. Learning to predict by the methods of temporal differences. Machine Learning, 3:9 44, 1988. Syed, U. and Schapire, R. E. A reduction from appren- ticeship learning to classification. In Advances in neural information processing systems, pp. 2253 2261, 2010. Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. ar Xiv preprint ar Xiv:1805.01954, 2018. Venkatraman, A., Hebert, M., and Bagnell, J. A. Improving multi-step prediction of learned time series models. AAAI, 2015. Zinkevich, M. Online Convex Programming and General- ized Infinitesimal Gradient Ascent. In ICML, 2003.