# imitation_with_neural_density_models__e171f925.pdf Imitation with Neural Density Models Kuno Kim1, Akshat Jindal1, Yang Song1, Jiaming Song1, Yanan Sui2, Stefano Ermon1 1Department of Computer Science, Stanford University 2NELN, School of Aerospace Engineering, Tsinghua University Contact: khkim@cs.stanford.edu We propose a new framework for Imitation Learning (IL) via density estimation of the expert s occupancy measure followed by Maximum Occupancy Entropy Reinforcement Learning (RL) using the density as a reward. Our approach maximizes a non-adversarial model-free RL objective that provably lower bounds reverse Kullback Leibler divergence between occupancy measures of the expert and imitator. We present a practical IL algorithm, Neural Density Imitation (NDI), which obtains state-of-the-art demonstration efficiency on benchmark control tasks. 1 Introduction Imitation Learning (IL) algorithms aim to learn optimal behavior by mimicking expert demonstrations. Perhaps the simplest IL method is Behavioral Cloning (BC) (Pomerleau, 1991) which ignores the dynamics of the underlying Markov Decision Process (MDP) that generated the demonstrations, and treats IL as a supervised learning problem of predicting optimal actions given states. Prior work showed that if the learned policy incurs a small BC loss, the worst case performance gap between the expert and imitator grows quadratically with the number of decision steps (Ross & Bagnell, 2010; Ross et al., 2011a). The crux of their argument is that policies that are "close" as measured by BC loss can induce disastrously different distributions over states when deployed in the environment. One family of solutions to mitigating such compounding errors is Interactive IL (Guo et al., 2014; Ross et al., 2011b, 2013), which involves running the imitator s policy and collecting corrective actions from an interactive expert. However, interactive expert queries are expensive and seldom available. Another family of approaches (Fu et al., 2017; Ho & Ermon, 2016; Ke et al., 2020; Kim & Park, 2018; Kostrikov et al., 2020; Wang et al., 2017) that have gained much traction is to directly minimize a statistical distance between state-action distributions induced by policies of the expert and imitator, i.e the occupancy measures E and . As is an implicit distribution induced by the policy and environment1, distribution matching with typically requires likelihood-free methods involving sampling. Sampling from entails running the imitator policy in the environment, which was not required by BC. While distribution matching IL requires additional access to an environment simulator, it has been shown to drastically improve demonstration efficiency, i.e the number of demonstrations needed to succeed at IL (Ho & Ermon, 2016). A wide suite of distribution matching IL algorithms use adversarial methods to match and E, which requires alternating between reward (discriminator) and policy (generator) updates (Fu et al., 2017; Ho & Ermon, 2016; Ke et al., 2020; Kim et al., 2019; Kostrikov et al., 2020). A key drawback to such Adversarial Imitation Learning (AIL) methods is that they inherit the instability of alternating min-max optimization (Miyato et al., 2018; Salimans et al., 2016) which is generally not guaranteed to converge (Jin et al., 2019). Furthermore, this instability is exacerbated in the IL setting where generator updates involve high-variance policy optimization and leads to sub-optimal demonstration efficiency. To alleviate this instability, (Brantley et al., 2020; Reddy et al., 2017; Wang et al., 2019) have proposed to do RL with fixed heuristic rewards. Wang et al. (2019), for example, uses a heuristic reward that estimates the 1we assume only samples can be taken from the environment dynamics and its density is unknown 35th Conference on Neural Information Processing Systems (Neur IPS 2021). support of E which discourages the imitator from visiting out-of-support states. While having the merit of simplicity, these approaches have no guarantee of recovering the true expert policy. In this work, we propose a new framework for IL via obtaining a density estimate qφ of the expert s occupancy measure E followed by Maximum Occupancy Entropy Reinforcement Learning (Max Occ Ent RL) (Islam et al., 2019; Lee et al., 2019). In the Max Occ Ent RL step, the density estimate qφ is used as a fixed reward for RL and the occupancy entropy H( ) is simultaneously maximized, leading to the objective max E [log qφ(s, a)] + H( ). Intuitively, our approach encourages the imitator to visit high density state-action pairs under E while maximally exploring the state-action space. There are two main challenges to this approach. First, we require accurate density estimation of E, which is particularly challenging when the state-action space is high dimensional and the number of expert demonstrations are limited. Second, in contrast to Maximum Entropy RL (Max Ent RL), Max Occ Ent RL requires maximizing the entropy of an implicit density . We address the former challenge leveraging advances in density estimation (Du & Mordatch, 2018; Germain et al., 2015; Song et al., 2019). For the latter challenge, we derive a non-adversarial model-free RL objective that provably maximizes a lower bound to occupancy entropy. As a byproduct, we also obtain a model-free RL objective that lower bounds reverse Kullback-Lieber (KL) divergence between and E. The contribution of our work is introducing a novel family of distribution matching IL algorithms, named Neural Density Imitation (NDI), that (1) optimizes a principled lower bound to the additive inverse of reverse KL, thereby avoiding adversarial optimization and (2). advances state-of-the-art demonstration efficiency in IL. 2 Imitation Learning via density estimation We model an agent s decision making process as a discounted infinite-horizon Markov Decision Process (MDP) M = (S, A, P, P0, r, γ). Here S, A are state-action spaces, P : S A ! (S) is a transition dynamics where (S) is the set of probability measures on S, P0 : S ! R is an initial state distribution, r : S A ! R is a reward function, and γ 2 [0, 1) is a discount factor. A parameterized policy : S ! (A) distills the agent s decision making rule and {st, at}1 t=0 is the stochastic process realized by sampling an initial state from s0 P0(s) then running in the environment, i.e at ( |st), st+1 P( |st, at). We denote by p ,t:t+k the joint distribution of states {st, st+1, ..., st+k}, where setting p ,t recovers the marginal of st. The (unnormalized) occupancy measure of is defined as (s, a) = P1 t=0 γtp ,t(s) (a|s). Intuitively, (s, a) quantifies the frequency of visiting the state-action pair (s, a) when running for a long time, with more emphasis on earlier states. We denote policy performance as J( , r) = E [P1 t=0 γt r(st, at)] = E(s,a) [ r(s, a)] where r is a (potentially) augmented reward function and E denotes the generalized expectation operator extended to non-normalized densities ˆp : X ! R+ and functions f : X ! Y so that Eˆp[f(x)] = P x ˆp(x)f(x). The choice of r depends on the RL framework. In standard RL, we simply have r = r, while in Maximum Entropy RL (Max Ent RL) (Haarnoja et al., 2017), we have r(s, a) = r(s, a) log (a|s). We denote the entropy of (s, a) as H( ) = E [ log (s, a)] and overload notation to denote the γ-discounted causal entropy of policy as H( ) = E [P1 t=0 γt log (at|st)] = E [ log (a|s)]. Note that we use a generalized notion of entropy where the domain is extended to non-normalized densities. We can then define the Maximum Occupancy Entropy RL (Max Occ Ent RL) (Islam et al., 2019; Lee et al., 2019) objective as J( , r = r) + H( ). Note the key difference between Max Occ Ent RL and Max Ent RL: entropy regularization is on the occupancy measure instead of the policy, i.e seeks state diversity instead of action diversity. We will later show in section 2.2, that a lower bound on this objective reduces to a complete model-free RL objective with an augmented reward r. Let E, denote an expert and imitator policy, respectively. Given only demonstrations D = {(s, a)i}k i=1 E of state-action pairs sampled from the expert, Imitation Learning (IL) aims to learn a policy which matches the expert, i.e = E. Formally, IL can be recast as a distribution matching problem (Ho & Ermon, 2016; Ke et al., 2020) between occupancy measures and E: maximize d( , E) (1) where d(ˆp, ˆq) is a generalized statistical distance defined on the extended domain of (potentially) non-normalized probability densities ˆp(x), ˆq(x) with the same normalization factor Z > 0, i.e R x ˆp(x)/Z = x ˆq(x)/Z = 1. For and E, we have Z = 1 1 γ . As we are only able to take samples from the transition kernel and its density is unknown, is an implicit distribution2. Thus, optimizing Eq. 1 typically requires likelihood-free approaches leveraging samples from , i.e running in the environment. Current state-of-the-art IL approaches use likelihood-free adversarial methods to approximately optimize Eq. 1 for various choices of d such as reverse Kullback-Liebler (KL) divergence (Fu et al., 2017; Kostrikov et al., 2020) and Jensen-Shannon (JS) divergence (Ho & Ermon, 2016). However, adversarial methods are known to suffer from optimization instability which is exacerbated in the IL setting where one step in the alternating optimization involves RL. We instead derive a non-adversarial objective for IL. In this work, we choose d to be (generalized) reverse-KL divergence and leave derivations for alternate choices of d to future work. DKL( || E) = E [log E(s, a) log (s, a)] = J( , r = log E) + H( ) (2) We see that maximizing negative reverse-KL with respect to is equivalent to Maximum Occupancy Entropy RL (Max Occ Ent RL) with log E as the fixed reward. Intuitively, this objective drives to visit states that are most likely under E while maximally spreading out probability mass so that if two state-action pairs are equally likely, the policy visits both. There are two main challenges associated with this approach which we address in the following sections. 1. log E is unknown and must be estimated from the demonstrations D. Density estimation remains a challenging problem, especially when there are a limited number of samples and the data is high dimensional (Liu et al., 2007). Note that simply extracting the conditional (a|s) from an estimate of the joint E(s, a) is an alternate way to do BC and does not resolve the compounding error problem (Ross et al., 2011a). 2. H( ) is hard to maximize as is an implicit density. This challenge is similar to the difficulty of entropy regularizing generators (Belghazi et al., 2018; Dieng et al., 2019; Mohamed & Lakshminarayanan, 2016) for Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), and most existing approaches (Dieng et al., 2019; Lee et al., 2019) use adversarial optimization. 2.1 Estimating the expert occupancy measure We seek to learn a parameterized density model qφ(s, a) of E from samples. We consider two canonical families of density models: Autoregressive models and Energy-based models (EBMs). Autoregressive Models (Germain et al., 2015; Papamakarios et al., 2017): An autoregressive model qφ(x) for x = (s, a) learns a factorized distribution of the form: qφ(x) = iqφi(xi|x 0 are weights introduced to control the influence of the occupancy entropy regularization. In practice, Eq. 11 can be maximized using any RL algorithm by simply setting the reward function to be r from Eq. 11. In this work, we use Soft Actor-Critic (SAC) (Haarnoja et al., 2018). Note that SAC already includes a policy entropy bonus, so we do not separately include one. For our critic f, we fix it to be a normalized RBF kernel for simplicity, f(st+1, st) = log e kst+1 stk2 Eqt,qt+1[e kst+1 stk2 2] + 1 (12) Table 1: Comparison between different families of distribution matching IL algorithms IL method Learned Models Relation between -divergence and optimized objective Objective Type Support policy , support estimator f Neither Upper nor Lower Bound max Adversarial policy , discriminator D Tight Upper Bound min max NDI (ours) policy , critic f, density qφ Loose Lower Bound max max but future works could explore learning the critic to match the optimal critic. While simple, our choice of f emulates two important properties of the optimal critic f (x, y) = log p(x|y) p(x) + 1: (1). it follows the same "form" of a log-density ratio plus a constant (2). consecutively sampled states from the joint, i.e st, st+1 p ,t:t+1 have high value under our f since they are likely to be close to each other under smooth dynamics, while samples from the marginals st, st+1 qt, qt+1 are likely to have lower value under f since they can be arbitrarily different states. To estimate the expectations with respect to qt, qt+1 in Eq. 8, we simply take samples of previously visited states at time t, t + 1 from the replay buffer. 4 Trade-offs between Distribution Matching IL algorithms Adversarial Imitation Learning (AIL) methods find a policy that maximizes an upperbound to the additive inverse of an f-divergence between the expert and imitator occupancies (Ghasemipour et al., 2019; Ke et al., 2020). For example, if the f-divergence is reverse KL, then for any D : S A ! R, E E[e D(s,a)] E [D(s, a)] where the bound is tight at D(s, a) = log (s,a) E (s,a) + C for any constant C. AIL alternates between, E E[e D(s,a)] E [D(s, a)], E [D(s, a)] The discriminator update step in AIL minimizes the upper bound with respect to D, tightening the estimate of reverse KL, and the policy update step maximizes the tightened bound. We thus see that by using an upper bound, AIL innevitably ends up with alternating min-max optimization where policy and discriminator updates act in opposing directions. The key issue with such adversarial optimization lies not in coordinate descent itself, but in its application to a min-max objective which is widely known to gives rise to optimization instability (Salimans et al., 2016). The key insight of NDI is to instead derive an objective that lower bounds the additive inverse of reverse KL. Recall from Eq. 9 that NDI maximizes the lower bound with the SAELBO Hf( ): DKL( || E) max J( , r = log E) + Hf( ) Unlike the AIL upper bound, this lower bound is not tight. With critic f updates, NDI alternates NWJ(st+1; st| ), J( , r = log E) + (1 + γ)H( ) + γ NWJ(st+1; st| ) The critic update step in NDI maximizes the lower bound with respect to f, tightening the estimate of reverse KL, and the policy update step maximizes the tightened bound. In other words, for AIL, the policy and discriminator D seek to push the upper bound in opposing directions while in NDI the policy and critic f push the lower bound in the same direction. Unlike AIL, NDI does not perform alternating min-max but instead alternating max-max! While NDI enjoys non-adversarial optimization, it comes at the cost of having to use a non-tight lower bound to the occupancy divergence. On the otherhand, AIL optimizes a tight upper bound at the cost of unstable alternating min-max optimization. Support matching IL algorithms also avoid min-max but their objective is neither an upper nor lower bound to the occupancy divergence. Table 1 summarizes the trade-offs between different families of algorithms for distribution matching IL. 5 Related Works Prior literature on Imitation learning (IL) in the absence of an interactive expert revolves around Behavioral Cloning (BC) (Pomerleau, 1991; Wu et al., 2019), distribution matching IL (Ghasemipour et al., 2019; Ho & Ermon, 2016; Ke et al., 2020; Kim et al., 2019; Kostrikov et al., 2020; Song et al., 2018), and Inverse Reinforcement Learning (Brown et al., 2019; Fu et al., 2017; Uchibe, 2018). Many approaches in the latter category minimize statistical divergences using adversarial methods to solve a min-max optimization problem, alternating between reward (discriminator) and policy (generator) updates. Value DICE, a more recently proposed adversarial IL approach, formulates reverse KL divergence into a completely off-policy objective thereby greatly reducing the number of environment interactions. A key issue with such Adversarial Imitation Learning (AIL) approaches is optimization instability (Jin et al., 2019; Miyato et al., 2018). Recent works have sought to avoid adversarial optimization by instead performing RL with a heuristic reward function that estimates the support of the expert occupancy measure. Random Expert Distillation (RED) (Wang et al., 2019) and Disagreement-regularized IL (Brantley et al., 2020) are two representative approaches in this family. A key limitation of these approaches is that support estimation is insufficient to recover the expert policy and thus they require an additional behavioral cloning step. Unlike AIL, we maximize a non-adversarial RL objective and unlike heuristic reward approaches, our objective provably lower bounds reverse KL between occupancy measures of the expert and imitator. Density estimation with deep neural networks is an active research area, and much progress has been made towards modeling high-dimensional structured data like images and audio. Most successful approaches parameterize a normalized probability model and estimate it with maximum likelihood, e.g., autoregressive models (Germain et al., 2015; Uria et al., 2013, 2016; van den Oord et al., 2016) and normalizing flow models (Dinh et al., 2014, 2016; Kingma & Dhariwal, 2018). Some other methods explore estimating non-normalized probability models with MCMC (Du & Mordatch, 2019; Yu et al., 2020) or training with alternative statistical divergences such as score matching (Hyvärinen, 2005; Song & Ermon, 2019; Song et al., 2019) and noise contrastive estimation (Gao et al., 2019; Gutmann & Hyvärinen, 2010). Related to Max Occ Ent RL, recent works (Hazan et al., 2019; Islam et al., 2019; Lee et al., 2019) on exploration in RL have investigated state-marginal occupancy entropy maximization. To do so, (Hazan et al., 2019) requires access to a robust planning oracle, while (Lee et al., 2019) uses fictitious play, an alternative adversarial algorithm that is guaranteed to converge. Unlike these works, our approach maximizes the SAELBO which requires no planning oracle nor min-max optimization, and is trivial to implement with existing RL algorithms. 6 Experiments Environment: Following prior work, we run experiments on benchmark Mujoco (Brockman et al., 2016; Todorov et al., 2012) tasks: Hopper (11, 3), Half Cheetah (17, 6), Walker (17, 6), Ant (111, 8), and Humanoid (376, 17), where the (observation, action) dimensions are noted parentheses. Pipeline: We train expert policies using SAC (Haarnoja et al., 2018). All of our results are averaged across five random seeds where for each seed we randomly sample a trajectory from an expert, perform density estimation, and then Max Occ Ent RL. Performance for each seed is averaged across 50 trajectories. For each seed we save the best imitator as measured by our augmented reward r from Eq. 11 and report its performance with respect to the ground truth reward. We don t perform sparse subsampling on the data as in (Ho & Ermon, 2016) since real world demonstration data typically aren t subsampled to such an extent and using full trajectories was sufficient to compare performance. Architecture: We experiment with two variants of our method, NDI+MADE and NDI+EBM, where the only difference lies in the density model. Across all experiments, our density model qφ is a two-layer MLP with 256 hidden units. For hyperparameters related to the Max Occ Ent RL step, λ = 0.2 is fixed and for λf see Section 6.3. For full details on architecture see Appendix B. Baselines: We compare our method against the following baselines: (1). Behavioral Cloning (BC) (Pomerleau, 1991): learns a policy via direct supervised learning on D. (2). Random Expert Distillation (RED) (Wang et al., 2019): estimates the support of the expert policy using a predictor Table 2: Task Performance when provided with one demonstration. NDI (orange rows) outperforms all baselines on all tasks. See Appendix C.2 for results with varying demonstrations. HOPPER HALF-CHEETAH WALKER ANT HUMANOID RANDOM 14 8 282 80 1 5 70 111 123 35 BC 1432 382 2674 633 1691 1008 1425 812 353 171 RED 448 516 383 819 309 193 910 175 242 67 GAIL 3261 533 3017 531 3957 253 2299 519 204 67 VALUEDICE 2749 571 3456 401 3342 1514 1016 313 364 50 NDI+MADE 3288 94 4119 71 4518 127 555 311 6088 689 NDI+EBM 3458 210 4511 569 5061 135 4293 431 5305 555 EXPERT 3567 4 4142 132 5006 472 4362 827 5417 2286 and target network (Burda et al., 2018), followed by RL using this heuristic reward. (3). Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016): on-policy adversarial IL method which alternates reward and policy updates. (4). Value DICE (Kostrikov et al., 2020): current state-of-the-art adversarial IL method that works off-policy. See Appendix B for baseline implementation details. 6.1 Task Performance Table 2 compares the ground truth reward acquired by agents trained with various IL algorithms when one demonstration is provided by the expert. (See Appendix C.2 for performance comparisons with varying demonstrations) NDI+EBM achieves expert level performance on all mujoco benchmarks when provided one demonstration and outperforms all baselines on all mujoco benchmarks. NDI+MADE achieves expert level performance on 4/5 tasks but fails on Ant. We found spurious modes in the density learned by MADE for Ant, and the RL algorithm was converging to these local maxima. We found that baselines are commonly unable to solve Humanoid with one demonstration (the most difficult task considered). RED is unable to perform well on all tasks without pretraining with BC as done in (Wang et al., 2019). For fair comparisons with methods that do not use pretraining, we also do not use pretraining for RED. See Appendix C.4 for results with a BC pretraining step added to all algorithms. GAIL and Value DICE perform comparably with each other, both outperforming behavioral cloning. We note that these results are somewhat unsurprising given that Value DICE (Kostrikov et al., 2020) did not claim to improve demonstration efficiency over GAIL (Ho & Ermon, 2016), but rather focused on reducing the number of environment interactions. Both methods notably under-perform the expert on Ant-v3 and Humanoid-v3 which have the largest state-action spaces. Although minimizing the number of environment interactions was not a targeted goal of this work, we found that NDI roughly requires an order of magnitude less environment interactions than GAIL. Please see Appendix C.5 for full environment sample complexity comparisons. 6.2 Density Evaluation In this section, we examine the learned density model qφ for NDI+EBM and show that it highly correlates with the true mujoco rewards which are linear functions of forward velocity. We randomly sample test states s and multiple test actions as per test state, both from a uniform distribution with boundaries at the minimum/maximum state-action values in the demonstration set. We then visualize the log marginal log qφ(s) = log P as qφ(s, as) projected on to two state dimensions: one corresponding to the forward velocity of the robot and the other a random selection, e.g the knee joint angle. Each point in Figure 1 corresponds to a projection of a sampled test state s, and the colors scale with the value of log qφ(s). For all environments besides Humanoid, we found that the density estimate positively correlates with velocity even on uniformly drawn state-actions which were not contained in the demonstrations. We found similar correlations for Humanoid on states in the demonstration set. Intuitively, a good density estimate should indeed have such correlations, since the true expert occupancy measure should positively correlate with forward velocity due to the expert attempting to consistently maintain high velocity. 6.3 Ablation studies As intuited in Section 2.2, maximizing the SAELBO can be more effective for occupancy entropy maximization, than solely maximizing policy entropy. (see Appendix C.1 for experiments that support this) This is because in discrete state-spaces the SAELBO Hf( ) is a tighter lower bound Figure 1: Learned density visualization. We randomly sample test states s and multiple test actions as per test state, both from a uniform distribution, then visualize the log marginal log qφ(s) = log P as qφ(s, as) projected onto two state dimensions: one corresponding to forward velocity and the other a random selection. Much like true reward function in Mujoco environments, we found that the log marginal positively correlates with forward velocity on 4/5 tasks. Table 3: Effect of varying MI reward weight λf on (1). Task performance of NDI-EBM (top row) and (2). Imitation performance of NDI-EBM (bottom row) measured as the average KL divergence between , E on states s sampled by running in the true environment, i.e Es [DKL( ( |s)|| E( |s))], normalized by the average DKL between the random and expert policies. DKL( || E) can be computed analytically since , E are conditional gaussians. Density model qφ is trained with one demonstration. Setting λf too large hurts task performance while setting it too small is suboptimal for matching the expert occupancy. A middle point of λf = 0.005 achieves a balance between the two metrics. HOPPER HALF-CHEETAH WALKER ANT HUMANOID λf = 0 REWARD 3576 154 5658 698 5231 122 4214 444 5809 591 KL 0.13 0.09 0.35 0.12 0.31 0.08 0.58 0.09 0.55 0.21 λf = 0.0001 REWARD 3506 188 5697 805 5171 157 4158 523 5752 632 KL 0.15 0.05 0.32 0.15 0.25 0.04 0.51 0.05 0.41 0.18 REWARD 3458 210 4511 569 5061 135 4293 431 5305 555 λf = 0.005 KL 0.11 0.02 0.17 0.09 0.22 0.14 0.32 0.12 0.12 0.14 λf = 0.1 REWARD 1057 29 103 59 2710 501 1021 21 142 50 KL 0.78 0.13 1.41 0.51 0.41 0.11 2.41 1.41 0.89 0.21 EXPERT REWARD 3567 4 4142 132 5006 472 4362 827 5417 2286 to occupancy entropy H( ) than policy entropy H( ), i.e H( ) Hf( ) H( ), and in continuous state-spaces, where Assumption 1 holds, the SAELBO is still a lower bound while policy entropy alone is neither a lower nor upper bound to occupancy entropy. As an artifact, we found that SAELBO maximization (λf > 0) leads to better occupancy distribution matching than sole policy entropy maximization (λf = 0). Table 3 shows the effect of the varying λf on task (reward) and imitation performance (KL), i.e similarities between , E measured as Es [DKL( ( |s)|| E( |s))]. Setting λf too large ( 0.1) hurts both task and imitation performance as the MI reward rf dominates the RL objective. Setting it too small ( 0.0001), i.e only maximizing policy entropy H( ), turns out to benefit task performance, sometimes enabling the imitator to outperform the expert by concentrating most of it s trajectory probability mass to the mode of the expert s trajectory distribution. However, the boosted task performance comes at the cost of suboptimal imitation performance, e.g imitator cheetah running faster than the expert. We found that a middle point of λf = 0.005 simultaneously achieves expert level task performance and good imitation performance. In summary, these results show that SAELBO Hf maximization (λf > 0) improves distribution matching between , E over policy entropy H( ) maximization (λf = 0), but distribution matching may not be ideal for task performance maximization, e.g in apprenticeship learning settings. See Appendix C.1, C.3 for extended ablation studies. 7 Discussion and Outlook This work s main contribution is a new principled framework for IL and an algorithm that obtains state-of-the-art demonstration efficiency. One future direction is to apply NDI to harder visual IL tasks for which AIL is known perform poorly. While the focus of this work is to improve on demonstration efficiency, another important IL performance metric is environment sample complexity. Future works could explore combining off-policy RL or model-based RL with NDI to improve on this end. Finally, there is a rich space of questions to answer regarding the effectiveness of the SAELBO reward rf. We posit that, for example, in video game environments rf may be crucial for success since state-action entropy maximization has been shown to be far more effective than policy entropy maximization (Burda et al., 2018). Furthermore, one could improve on the tightness of SAELBO by incorporating negative samples (Van Den Oord et al., 2018) and learning the critic function f so that it is close to the optimal critic. Acknowledgements This research was supported in part by NSF(1651565, 1522054, 1733686), ONR (N000141912145), AFOSR (FA95501910024), ARO (W911NF-21-1-0125), TRI, and Sloan Fellowship. Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. Mine: Mutual information neural estimation. ar Xiv preprint ar Xiv:1801.04062, 2018. Brantley, K., Sun, W., and Henaff, M. Disagreement-regularized imitation learning. 2020. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Wojciech, Z. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Brown, D., Goo, W., and Niekum, S. Better-than-demonstrator imitation learning via automatically- ranked demonstrations. ar Xiv preprint ar Xiv:1907.03976, 2019. Brown, G. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951. Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. ar Xiv preprint ar Xiv:1810.12894, 2018. Dieng, A., Ruiz, F., Blei, D. M., and Titsias, M. K. Prescribed generative adversarial networks. ar Xiv preprint ar Xiv:1910.04302, 2019. Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. ar Xiv preprint ar Xiv:1605.08803, May 2016. Du, Y. and Mordatch, I. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2018. Du, Y. and Mordatch, I. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019. Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017. Gao, R., Nijkamp, E., Kingma, D. P., Xu, Z., Dai, A. M., and Wu, Y. N. Flow contrastive estimation of energy-based models. ar Xiv preprint ar Xiv:1912.00589, 2019. Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. ar Xiv preprint ar Xiv:1502.03509, 2015. Ghasemipour, S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. ar Xiv preprint ar Xiv:1911.02256, 2019. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X. Deep learning for real-time atari game play using offline monte-carlo tree search planning. Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014. Gutmann, M. and Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297 304, 2010. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. ar Xiv preprint ar Xiv:1702.08165, 2017. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor-Critic: Off-Policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, January 2018. Hazan, E., Kakade, S., Singh, K., and Soest, A. V. Provably efficient maximum entropy exploration. ar Xiv preprint ar Xiv:1812.02690, 2019. Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. Stable baselines. https://github.com/hill-a/stable-baselines, 2018. Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565 4573, 2016. Huszár, F. Variational inference using implicit distributions. ar Xiv preprint ar Xiv:1702.08235, 2017. Hyvärinen, A. Estimation of Non-Normalized statistical models by score matching. Journal of machine learning research: JMLR, 6(Apr):695 709, 2005. ISSN 1532-4435, 1533-7928. Islam, R., Seraj, R., Bacon, P.-L., and Precup, D. Entropy regularization with discounted future state distribution in policy gradient methods. ar Xiv preprint ar Xiv:1912.05104, 2019. Jin, C., Netrapalli, P., and Jordan, M. I. What is local optimality in nonconvex-nonconcave minimax optimization? ar Xiv preprint ar Xiv:1902.00618, 2019. Ke, L., Barnes, M., Sun, W., Lee, G., Choudhury, S., and Srinivasa, S. Imitation learning as f-divergence minimization. ar Xiv preprint ar Xiv:1905.12888, 2020. Kim, K., Gu, Y., Song, J., Zhao, S., and Ermon, S. Domain adaptive imitation learning. ar Xiv preprint ar Xiv:1910.00105, 2019. Kim, K.-E. and Park, H. S. Imitation learning via kernel mean embedding. AAAI, 2018. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, December 2014. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215 10224, 2018. Kostrikov, I., Nachum, O., and Tompson, J. Imitation learning via off-policy distribution matching. Lee, L., Eysenbach, B., Parisotto, E., Xing, E., Levine, S., and Salakhutdinov, R. Efficient exploration via state marginal matching. ar Xiv preprint ar Xiv:1906.05274, 2019. Liu, H., Lafferty, J., and Wasserman, L. Sparse nonparametric density estimation in high dimensions using the rodeo. volume 2 of Proceedings of Machine Learning Research, pp. 283 290, San Juan, Puerto Rico, 21 24 Mar 2007. PMLR. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. Mohamed, S. and Lakshminarayanan, B. Learning in implicit generative models. ar Xiv preprint ar Xiv:1610.03483, 2016. Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. ar Xiv preprint ar Xiv:0809.0853, 2010. Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. ar Xiv preprint ar Xiv:1606.00709, June 2016. Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. ar Xiv preprint ar Xiv:1705.07057, 2017. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. ar Xiv preprint ar Xiv:1705.05363, 2017. Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88 97, 1991. ISSN 0899-7667. Reddy, S., Dragan, A. D., and Levine, S. Sqil: Imitation learning via reinforcement learning with sparse rewards. ar Xiv preprint ar Xiv:1905.11108, 2017. Ross, S. and Bagnell, D. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661 668, 2010. Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635, 2011a. Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635, 2011b. Ross, S., Melik-Barkhudarov, N., Shankar, K. S., Wendel, A., Dey, D., Bagnell, A. J., and Hebert, M. Learning monocular reactive uav control in cluttered natural environments. International Conference on Robotics and Automation (ICRA), 2013. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. ar Xiv:1606.03498, 2016. Song, J., Ren, H., Sadigh, D., and Ermon, S. Multi-agent generative adversarial imitation learning. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. ar Xiv preprint ar Xiv:1907.05600, 2019. Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. ar Xiv preprint ar Xiv:1905.07088, 2019. Todorov, E. Convex and analytically-invertible dynamics with contacts and constraints: Theory and implementation in mujoco. IEEE International Conference on Robotics and Automation (ICRA), 2014. Todorov, E., Erez, T., and Tassa, Y. Mu Jo Co: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033, October 2012. doi: 10.1109/IROS.2012.6386109. Uchibe, E. Model-free deep inverse reinforcement learning by logistic regression. Neural Processing Letters, 47:891 905, 2018. Uria, B., Murray, I., and Larochelle, H. RNADE: The real-valued neural autoregressive density- estimator. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 2175 2183. Curran Associates, Inc., 2013. Uria, B., Côté, M.-A., Gregor, K., Murray, I., and Larochelle, H. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184 7220, 2016. van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759, January 2016. Van Den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Wang, R., Ciliberto, C., Amadori, P., and Demiris, Y. Random expert distillation: Imitation learning via expert policy support estimation. 2019. Wang, Z., Merel, J., Reed, S., Wayne, G., Freitas, N. d., and Heess, N. Robust imitation of diverse behaviors. ar Xiv preprint ar Xiv:1707.02747, 2017. Wu, A., Piergiovanni, A., and Ryoo, M. S. Model-based behavioral cloning with future image similarity learning. ar Xiv preprint ar Xiv:1910.03157, 2019. Yu, L., Song, Y., Song, J., and Ermon, S. Training deep energy-based models with f-divergence minimization. ar Xiv preprint ar Xiv:2003.03463, 2020.