# iqlearn_inverse_softq_learning_for_imitation__b28479f6.pdf

IQ-Learn: Inverse soft-Q Learning for Imitation

Divyansh Garg Shuvam Chakraborty Chris Cundy Jiaming Song Stefano Ermon Stanford University {divgarg, shuvamc, cundy, tsong, ermon}@stanford.edu

In many sequential decision-making problems (e.g., robotics control, game playing, sequential prediction), human or expert data is available containing useful information about the task. However, imitation learning (IL) from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics. Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence but doesn t utilize any information involving the environment s dynamics. Many existing methods that exploit dynamics information are difﬁcult to train in practice due to an adversarial optimization process over reward and policy approximators or biased, high variance gradient estimators. We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy. On standard benchmarks, the implicitly learned rewards show a high positive correlation with the ground-truth rewards, illustrating our method can also be used for inverse reinforcement learning (IRL). Our method, Inverse soft-Q learning (IQ-Learn) obtains state-of-the-art results in ofﬂine and online imitation learning settings, signiﬁcantly outperforming existing methods both in the number of required environment interactions and scalability in high-dimensional spaces, often by more than 3x.

1 Introduction

Imitation of an expert has long been recognized as a powerful approach for sequential decisionmaking [29, 1], with applications as diverse as healthcare [39], autonomous driving [41], and playing complex strategic games [8]. In the imitation learning (IL) setting, we are given a set of expert trajectories, with the goal of learning a policy which induces behavior similar to the expert s. The learner has no access to the reward, and no explicit knowledge of the dynamics.

The simple behavioural cloning [34] approach simply maximizes the probability of the expert s actions under the learned policy, approaching the IL problem as a supervised learning problem. While this can work well in simple environments and with large quantities of data, it ignores the sequential nature of the decision-making problem, and small errors can quickly compound when the learned policy departs from the states observed under the expert. A natural way of introducing the environment dynamics is by framing the IL problem as an Inverse RL (IRL) problem, aiming to learn a reward function under which the expert s trajectory is optimal, and from which the learned imitation policy can be trained [1]. This framing has inspired several approaches which use rewards either explicitly or implicitly to incorporate dynamics while learning an imitation policy [17, 10, 33, 22]. However, these dynamics-aware methods are typically hard to put into practice due to unstable learning which can be sensitive to hyperparameter choice or minor implementation details [21].

In this work, we introduce a dynamics-aware imitation learning method which has stable, nonadversarial training, allowing us to achieve state-of-the-art performance on imitation learning bench-

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Table 1: A comparison of various algorithms for imitation learning. Convergence Guarantees refers to if a proof is given that the algorithm converges to the correct policy with sufﬁcient data. We consider an algorithm directly optimized if it consists of an optimization algorithm (such as gradient descent) applied to the parameters of a single function

Method Reference Dynamics

Non Adversarial

Convergence

Non-restrictive

Direct Optimization

Max Margin IRL [29, 1] 3 3 3 Max Entropy IRL [43] 3 3 3 GAIL/AIRL [17, 10] 3 3 3 ASAF [4] 3 3 3 3 SQIL [33] 3 3 3

Ours (Online) 3 3 3 3 3

Max Margin IRL [24, 20] 3 3 3 Max Likelihood IRL [18] 3 3 3 Max Entropy IRL [16] 3 3 3 Value DICE [22] 3 Behavioral Cloning [34] 3 3 3 Regularized BC [30] 3 3 3 3 EDM [19] 3 3 3 3

Ours (Ofﬂine) 3 3 3 3 3

marks. Our key insight is that much of the difﬁculty with previous IL methods arises from the IRL-motivated representation of the IL problem as a min-max problem over reward and policy [17, 1].

This introduces a requirement to separately model the reward and policy, and train these two functions jointly, often in an adversarial fashion. Drawing on connections between RL and energy-based models [13, 14], we propose learning a single model for the Q-value. The Q-value then implicitly deﬁnes both a reward and policy function. This turns a difﬁcult min-max problem over policy and reward functions into a simpler minimization problem over a single function, the Q-value. Since our problem has a one-to-one correspondence with the min-max problem studied in adversarial IL [17], we maintain the generality and guarantees of these previous approaches, resulting in a meaningful reward that may be used for inverse reinforcement learning. Furthermore, our method may be used to minimize a variety of statistical divergences between the expert and learned policy. We show that we recover several previously-described approaches as special cases of particular divergences, such as the regularized behavioural cloning of [30], and the conservative Q-learning of [23].

In our experiments, we ﬁnd that our method is performant even with very sparse data - surpassing prior methods using one expert demonstration in the completely ofﬂine setting - and can scale to complex image-based tasks like Atari reaching expert performance. Moreover, our learnt rewards are highly predictive of the original environment rewards.

Concretely, our contributions are as follows:

We present a modiﬁed Q-learning update rule for imitation learning that can be implemented

on top of soft-Q learning or soft actor-critic (SAC) algorithms in fewer than 15 lines of code.

We introduce a simple framework to minimize a wide range of statistical distances: Integral

Probability Metrics (IPMs) and f-divergences, between the expert and learned distributions.

We empirically show state-of-art results in a variety of imitation learning settings: online

and ofﬂine IL. On the complex Atari suite, we outperform prior methods by 3-7x while requiring 3x less environment steps.

We characterize our learnt rewards and show a high positive correlation with the ground-truth

rewards, justifying the use of our method for Inverse Reinforcement Learning.

2 Background

Preliminaries We consider environments represented as a Markov decision process (MDP), which is deﬁned by a tuple (S, A, p0, P, r, γ). S, A represent state and action spaces, p0 and P(s0|s, a) represent the initial state distribution and the dynamics, r(s, a) represents the reward function, and

γ 2 (0, 1) represents the discount factor. RS A = {x : S A ! R} will denote the set of all functions in the state-action space and R will denote the extended real numbers R [ {1}. Section 3 and 4 will work with ﬁnite state and action spaces S and A, but our algorithms and experiments later in the paper use continuous environments. is the set of all stationary stochastic policies that take actions in A given states in S. We work in the γ-discounted inﬁnite horizon setting, and we will use an expectation with respect to a policy 2 to denote an expectation with respect to the trajectory it generates: E [r(s, a)] , E[P1

t=0 γtr(st, at)], where s0 p0, at ( |st), and st+1 P( |st, at) for t 0. For a policy 2 , we deﬁne its occupancy measure : S A ! R as (s, a) = (a|s) P1

t=0 γt P (st = s| ). We refer to the expert policy as E and its occupancy measure as E. In practice, E is unknown and we have access to a sampled dataset of demonstrations. For brevity, we refer to as for a learnt policy in the paper.

Soft Q-functions For a reward r 2 RS A and 2 , the soft Bellman operator B : RS A ! RS A deﬁned as (B Q)(s, a) = r(s, a) + γEs0 P (s,a)V (s0) with V (s) = Ea ( |s) [Q(s, a) log (a|s)]. The soft Bellman operator is contractive [13] and deﬁnes a unique soft Q-function for r, given as Q = B Q.

Max Entropy Reinforcement Learning For a given reward function r 2 RS A, maximum entropy RL [14, 5] aims to learn a policy that maximizes the expected cumulative discounted reward along with the entropy in each state: max 2 E [r(s, a)]+H( ). Where H( ) , E [ log (a|s)] is the discounted causal entropy of the policy . The optimal policy satisﬁes [42, 5]:

exp (Q(s, a)), (1)

where Q is the soft Q-function and Zs is the normalization factor given as P

a0 exp (Q (s, a0)).

Q satisﬁes the soft-Bellman equation:

Q(s, a) = r(s, a) + γEs0 P( |s,a)

exp(Q(s0, a0))

In continuous action spaces, Zs becomes intractable and soft actor-critic methods like SAC [13] can be used to learn an explicit policy.

Max Entropy Inverse Reinforcement Learning Given demonstrations sampled using the policy E, maximum entropy Inverse RL aims to recover the reward function in a family of functions R that rationalizes the expert behavior by solving the optimization problem: maxr2R min 2 E E[r(s, a)] (E [r(s, a)] + H( )), where the expected reward of E is empirically approximated. It looks for a reward function that assigns high reward to the expert policy and a low reward to other policies, while searching for the best policy for the reward function in an inner loop.

The Inverse RL objective can be reformulated in terms of its occupancy measure, and with a convex reward regularizer : RS A ! R [17]

2 L( , r) = E E[r(s, a)] E [r(s, a)] H( ) (r) (3)

In general, we can exchange the max-min resulting in an objective that minimizes the statistical distance parameterized by , between the expert and the policy [17]

r2R L( , r) = min

2 d ( , E) H( ), (4)

with d , ( E ), where is the convex conjugate of .

3 Inverse soft Q-learning (IQ-Learn) Framework

A naive solution to the IRL problem in (Eq. 3) involves (1) an outer loop learning rewards and (2) executing RL in an inner loop to ﬁnd an optimal policy for them. However, we know that this optimal policy can be obtained analytically in terms of soft Q-functions (Eq. 1). Interestingly, as we will show

later, the rewards can also be represented in terms of Q (Eq. 2). Together, these observations suggest it might be possible to directly solve the IRL problem by optimizing only over the Q-function.

To motivate the search of an imitation learning algorithm that depends only on the Q-function, we characterize the space of Q-functions and policies obtained using Inverse RL. We will study 2 , r 2 R and Q-functions Q 2 where R = = RS A. We assume is convex, compact and that E 2 1. We deﬁne V (s) = Ea ( |s) [Q(s, a) log (a|s)].

We start with analysis developed in [17]: The regularized IRL objective L( , r) given by Eq. 3, is concave in the policy and convex in rewards. And has a unique saddle point where it is optimized.

To characterize the Q-functions it is useful to transform the optimization problem over rewards to a problem over Q-functions. We can get a one-to-one correspondence between r and Q:

Deﬁne the inverse soft bellman operator T : RS A ! RS A such that

(T Q)(s, a) = Q(s, a) γEs0 P (s,a)V (s0),

Lemma 3.1. The inverse soft bellman operator T is bijective, and for any r, (T ) 1r is the unique contraction of B .

The proof of this lemma is in Appendix A.1. For a policy , we are thus justiﬁed in changing between rewards and their corresponding soft-Q functions. We can freely transform functions from the reward-policy space: R to the Q-policy space: , giving us the lemma: Lemma 3.2. If L( , r) = E E[r(s, a)] E [r(s, a)] H( ) (r) and J ( , Q) = E E[(T Q)(s, a)] E [(T Q)(s, a)] H( ) (T Q), then for all policies 2 , L( , r) = J ( , (T ) 1r) for all r 2 R, and J ( , Q) = L( , T Q), for all Q 2 .

Lemma 3.1 and 3.2 allow us to adapt the Inverse RL objective L( , r) to learning Q through J ( , Q).

Simplifying our new objective (using Lemma A.3 in Appendix):

J ( , Q) = Es,a E[Q γEs0 P( |s,a)V (s0)] (1 γ)Es0 p0[V (s0)] (T Q), (5)

We are now ready to study J ( , Q), the Inverse RL optimization problem in the Q-policy space. As the regularizer depends on both Q and , a general analysis over all functions in RS A becomes too difﬁcult. We restrict ourselves to regularizers induced by a convex function g : R ! R such that

g(r) = E E[g(r(s, a))] (6)

This allows us to simplify our analysis to the set of all real functions while retaining generality2. We further motivate this choice in Section 4. Proposition 3.3. In the Q-policy space, there exists a unique saddle point ( , Q ) that optimizes J . i.e. Q = argmax Q2 min 2 J ( , Q) and = argmin 2 max Q2 J ( , Q). Furthermore, and r = T Q are the solution to the Inverse RL objective L( , r).

Thus we have, max Q2 min 2 J ( , Q) = maxr2R min 2 L( , r).

This tells us, even after transforming to Q-functions we have retained the saddle point property of the original IRL objective and optimizing J ( , Q) recovers this saddle point. In the Q-policy space, we can get an additional property: Proposition 3.4. For a ﬁxed Q, argmin 2 J ( , Q) is the solution to max entropy RL with rewards r = T Q. Thus, this forms a manifold in the Q-policy space, that satisﬁes

exp(Q(s, a)),

with normalization factor Zs = P

a exp Q(s, a) and Q deﬁned as the corresponding to Q.

Proposition 3.3 and 3.4 are telling us that if we know Q, then the inner optimization problem in terms of policy is trivial, and obtained in a closed form! Thus, we can recover an objective that only requires learning Q:

2 J ( , Q) = max

Q2 J ( Q, Q) (7)

Furthermore, we have:

1The full policy class satisﬁes all these assumptions 2Averaging over the expert occupancy allows to adjust to arbitrary experts and accommodate multimodality

Proposition 3.5. Let J (Q) = J ( Q, Q). Then J is concave in Q.

Thus, this new optimization objective is well-behaved and is maximized only at the saddle point.

Concave L( , )

L( , r) Convex

Quasi-convex J ( , Q)

Concave J ( , )

Saddle point ( , r )

Saddle point ( , Q )

Policy minima

Figure 1: Properties of IRL objective in reward-policy space and Q-policy space.

In Appendix C, we expand on our analysis and characterize the behavior for different choices of regularizer , while giving proofs of all our propositions. Figure 1 summarizes the properties for the IRL objective: there exists a optimal policy manifold depending on Q, allowing optimization along it (using J ) to converge to the saddle point. We further present analysis of IL methods that learn Q-functions like SQIL [33] and Value DICE [22] and ﬁnd subtle fallacies affecting their learning.

Note that although the same analysis holds in the reward-policy space, the optimal policy manifold depends on Q, which isn t trivially known unlike when in the Q-policy space.

In this section, we develop our inverse soft-Q learning (IQ-Learn) algorithm, such that it recovers the optimal soft Q-function for a MDP from a given expert distribution. We start by learning energy-based models for the policy similar to soft Q-learning and later learn an explicit policy similar to actor-critic methods.

4.1 General Inverse RL Objective

For designing a practical algorithm using regularizers of the form g (from Eq. 6), we deﬁne g using

a concave function φ : R ! R, such that g(x) =

x φ(x) if x 2 R +1 otherwise with the rewards constrained in R .

For this choice of , the Inverse RL objective L( , r) takes the form of Eq. 4 with a distance measure:

d ( , E) = max

r2R E E[φ(r(s, a))] E [r(s, a)], (8)

This forms a general learning objective that allows the use of a wide-range of statistical distances including Integral Probability Metrics (IPMs) and f-divergences (see Appendix B). 3

4.2 Choice of Statistical Distances

While choosing a practical regularizer, it can be useful to obtain certain properties on the reward functions we recover. Some (natural) nice properties are: having rewards bounded in a range, learning smooth functions or enforcing a norm-penalty.

In fact, we ﬁnd these properties correspond to the Total Variation distance, the Wasserstein-1 distance and the χ2-divergence respectively. The regularizers and the induced statistical distances are summarized in Table 2:

Table 2: Enforced reward property, corresponding regularizer and statistical distance (Rmax, K, 2 R+ )

Reward Property d Bound range = 0 if |r| Rmax and +1 otherwise 2Rmax TV( , E) Smoothness = 0 if krk Lip K and +1 otherwise K W1( , E) L2 Penalization (r) = r2 1 4 χ2( , E)

3We recover IPMs when using identity φ and restricted reward family R

We ﬁnd that these choice of regularizers4 work very well in our experiments. In Appendix B, we further give a table for the well known f-divergences, the corresponding φ and the learnt reward estimators, along with a result ablation on using different divergences. Compared to χ2, we ﬁnd other f-divergences like Jensen-Shannon result in similar performances but are not as readily interpretable.

4.3 Inverse soft-Q update (Discrete control)

Optimization along the optimal policy manifold gives the concave objective (Prop 3.5):

Q2 J (Q) = E E[φ(Q(s, a) γEs0 P( |s,a)V (s0))] (1 γ)E 0[V (s0)], (9)

with V (s) = log P

a exp Q(s, a).

For each Q, we get a corresponding reward r(s, a) = Q(s, a) γEs0 P( |s,a)[log P

a0 exp Q (s0, a0)]. This correspondence is unique (Lemma A.1 in Appendix), and every update step can be seen as ﬁnding a better reward for IRL.

Note that estimating V (s) exactly is only possible in discrete action spaces. Our objective forms a variant of soft-Q learning: to learn the optimal Q-function given an expert distribution.

4.4 Inverse soft actor-critic update (Continuous control)

In continuous action spaces, it might not be possible to exactly obtain the optimal policy Q, which forms an energy-based model of the Q-function, and we use an explicit policy to approximate Q.

For any policy , we have a objective (from Eq. 5):

J ( , Q) = E E[φ(Q γEs0 P( |s,a)V (s0))] (1 γ)E 0[V (s0)] (10) For a ﬁxed Q, soft actor-critic (SAC) update: min

Es D,a ( |s)[Q(s, a) log (a|s)], brings closer to Q while always minimizing Eq. 10 (Lemma A.4 in Appendix). Here D is the distribution of previously sampled states, or a replay buffer.

Thus, we obtain the modiﬁed actor-critic update rule to learn Q-functions from the expert distribution:

1. For a ﬁxed , optimize Q by maximizing J ( , Q). 2. For a ﬁxed Q, apply SAC update to optimize towards Q.

This differs from Value DICE [22], where the actor is updated adverserially and the objective may not always converge (Appendix C).

5 Practical Algorithm

Pseudocode in Algorithm 1, shows our Q-learning and actor-critic variants, with differences with conventional RL algorithms in red (we optimize -J to use gradient descent). We can implement our algorithm IQ-Learn in 15 lines of code on top of standard implementations of (soft) DQN [14] for discrete control or soft actor-critic (SAC) [13] for continuous control, with a change on the objective for the Q-function. Default hyperparameters from [14, 13] work well, except for tuning the entropy regularization. Target networks were helpful for continuous control. We elaborate details in Appendix D.

5.1 Training methodology

Corollary 2.1 in Appendix A states E(s,a) µ[V (s) γEs0 P( |s,a)V (s0)] = (1 γ)Es p0[V (s)], where µ is any policy s occupancy. We use this to stabilize training instead of using Eq. 9 directly.

Online: Instead of directly estimating Ep0[V (s0)] in our algorithm, we can sample (s, a, s0) from a replay buffer and get a single-sample estimate E(s,a,s0) replay[V (s) γV (s0)]. This removes the issue where we are only optimizing Q in the inital states resulting in overﬁtting of V (s0), and improves the stability for convergence in our experiments. We ﬁnd sampling half from the policy buffer and half from the expert distribution gives the best performances. Note that this is makes our learning online, requiring environment interactions.

4The additional scalar terms scale the entropy regularization strength and can be ignored in practice

Algorithm 1 Inverse soft Q-Learning (both variants)

1: Initialize Q-function Q , and optionally a policy φ 2: for step t in {1...N} do 3: Train Q-function using objective from Equation 9: t+1 t Qr [ J ( )] (Use V for Q-learning and V φ for actor-critic) 4: (only with actor-critic) Improve policy φ with SAC style actor update: φt+1 φt rφEs D,a φ( |s)[Q(s, a) log φ(a|s)] 5: end for

Algorithm 2 Recover policy and reward

1: Given trained Q-function Q , and optionally a trained policy φ 2: Recover policy :

(Q-learning) := 1

Z exp Q (actor-critic) := φ 3: For state s, action a and s0 P( |s, a) 4: Recover reward r(s, a, s0) = Q (s, a) γV (s0)

Ofﬂine: Although Ep0[V (s0)] can be estimated ofﬂine we still observe an overﬁtting issue. Instead of requiring policy samples we use only expert samples to estimate E(s,a,s0) expert[V (s) γV (s0)] to sufﬁciently approximate the term. This methodology gives us state-of-art results for ofﬂine IL.

5.2 Recovering rewards

Instead of the conventional reward function r(s, a) on state and action pairs, our algorithm allows recovering rewards for each transition (s, a, s0) using the learnt Q-values

as follows:

r(s, a, s0) = Q(s, a) γV (s0) (11)

Now, Es0 P( |s,a)[Q(s, a) γV (s0)] = Q(s, a) γEs0 P( |s,a)[V (s0)] = T Q(s, a). This is just the reward function r(s, a) we want. So by marginalizing over next-states, our expression correctly recovers the reward over state-actions. Thus, Eq. 11 gives the reward over transitions.

Our rewards require s0 which can be sampled from the environment, or by using a dynamics model.

5.3 Implementation of Statistical Distances

Implementing TV and W1 distances is fairly trivial and we give details in Appendix B. For the χ2-divergence, we note that it corresponds to φ(x) = x 1 4 x2. On substituting in Eq. 9, we get

Q2 E E[(Q(s, a) γEs0 P( |s,a)V (s0))] (1 γ)Ep0[V (s0)] 1 4 E E[(Q(s, a) γEs0 P( |s,a)V (s0))2]

In a fully ofﬂine setting, this can be further simpliﬁed as (using the ofﬂine methodology in Sec 5.1):

min Q2 E E[(Q(s, a) V (s))] + 1

4 E E[(Q(s, a) γEs0 P( |s,a)V (s0))2] (12)

This is interestingly the same as the Q-learning objective in CQL [23], an state-of-art method for ofﬂine RL (using 0 rewards), and shares similarities with regularized behavior cloning [33] 5.

5.4 Learning state-only reward functions

Previous works like AIRL [10] propose learning rewards that are only function of the state, and claim that these form of reward functions generalize between different MDPs. We ﬁnd our method can predict state-only rewards by using the policy and expert state-marginals with a modiﬁcation to Eq. 9:

Q2 J (Q) = Es E(s)[Ea ( |s)[φ(Q(s, a) γEs0 P( |s,a)V (s0))]] (1 γ)Ep0[V (s0)]

Interestingly, our objective no longer depends on the the expert actions E and can be used for IL using only observations. For the sake of brevity, we expand on this in Section 1 in Appendix A.

6 Related Work

Classical IL: Imitation learning has a long history, with early works using supervised learning to match a policy s actions to those of the expert [15, 35]. A signiﬁcant advance was made with the formulation of IL as the composition of RL and IRL [29, 1, 43], recovering the expert s policy

5The simpliﬁcation to get Eq. (12) is not applicable in the online IL setting where our method differs

by inferring the expert s reward function, then ﬁnding the policy which maximizes reward under this reward function. These early approaches required a hand-designed featurization of the MDP, limiting their applicability to complex MDPs. In this setting, early approaches [9, 31] noted a formal equivalence between IRL and IL using an inverse Bellman operator similar to our own.

Online IL: More recent work aims to leverage the power of modern machine learning approaches to learn good featurizations and extend IL to complex settings. Recent work generally falls into one of two settings: online or ofﬂine. In the online setting, the IL algorithm is able to interact with the environment to obtain dynamics information. GAIL [17] takes the nested RL/IRL formulation of earlier work , optimizing over all reward functions with a convex regularizer. This results in the objective in Eq. (3), with a max-min adversarial problem similar to a GAN [11]. A variety of further work has built on this adversarial approach [21, 10, 3]. A separate line of work aims to simplify the problem in Eq. (3) by using a ﬁxed r or . In SQIL [33], r is chosen to be the 1-0 indicator on the expert demonstrations, while ASAF [4] takes the GAN approach and uses a discriminator (with role similar to r) of ﬁxed form, consisting of a ratio of expert and learner densities. Ad RIL [38] is a recent extension of SQIL, additionally assigning decaying negative reward to previous policy rollouts.

Ofﬂine IL: In the ofﬂine setting, the learner has no access to the environment. The simple behavioural cloning (BC) [34] approach is ofﬂine, but doesn t use any dynamics information. Value DICE [22] is a dynamics-aware ofﬂine approach with an objective somewhat similar to ours, motivated from minimization of a variational representation of the KL-divergence between expert and learner policies. Value DICE requires adversarial optimization to learn the policy and Q-functions, with a biased gradient estimator for training. We show a way to recover a unbiased gradient estimate for the KL-divergence in Appendix C. The O-NAIL algorithm [2] builds on Value DICE and combines with an SAC update to obtain a method that is similar to our algorithm described in section 4.4, with the speciﬁc choice of reverse KL-divergence as the relevant statistical distance. The EDM method [19] incorporates dynamics via learning an explicit energy based model for the expert state occupancy, although some theoretical details have been called into question (see [37] for details). The recent AVRIL approach [6] uses a variational method to solve a probabilistic formulation of IL, ﬁnding a posterior distribution over r and . Illustrating the potential beneﬁts of alternative distances for IL, the PWIL [7] algorithm gives a non-adversarial procedure to minimize the Wasserstein distance between expert and learned occupancies. The approach is speciﬁc to the primal form of the W1-distance, while our method (when used with the Wasserstein distance) targets the dual form.

7 Experiments

7.1 Experimental Setup

We compare IQ-Learn ("IQ") to prior work on a diverse collection of RL tasks and environments - ranging from low-dimensional control tasks: Cart Pole, Acrobot, Lunar Lander - to more challenging continuous control Mu Jo Co tasks: Half Cheetah, Hopper, Walker and Ant. Furthermore, we test on the visually challenging Atari Suite with high-dimensional image inputs. We compare on ofﬂine IL - with no access to the the environment while training, and online IL - with environment access. We show results on W1 and χ2 as our statistical distances, as we found them more effective than TV distance. In all cases, we train until convergence and average over multiple seeds. Hyperparameter settings and training details are detailed in Appendix D.

7.2 Benchmarks

Ofﬂine IL We compare to the state-of-art IL methods EDM and AVRIL, following the same experimental setting as [6]. Furthermore, we compare with Value DICE which also learns Q-functions, albeit with drawbacks such as adversarial optimization. We also experimented with SQIL, but found that it was not competitive in the ofﬂine setting. Finally, we utilize BC as an additional IL baseline.

Online IL We use Mu Jo Co and Atari environments and compare against state-of-art online IL methods: Value DICE, SQIL and GAIL. We only show results on χ2 as W1 was harder to stabilize on complex environments6. Using target updates stabilizes the Q-learning on Mu Jo Co. For brevity, further online IL results are shown in the Appendix D.

6χ2 and W1 can be used together to still have a convex regularization and is more stable

7.3 Results

Figure 2: Ofﬂine IL results. We plot the average environment returns vs the number of expert trajectories.

Ofﬂine IL We present results on the three ofﬂine control tasks in Figure 2. On all tasks, IQ strongly outperforms prior works we compare to in performance and sample efﬁciency. Using just one expert trajectory, we achieve expert performance on Acrobot and reach near expert on Cartpole.

Table 3: Mujoco Results. We show our performance on Mu Jo Co control tasks using a single expert trajectory.

Task GAIL DAC Value DICE IQ (Ours) Expert Hopper 3252.5 3305.1 3312.1 3546.4 3532.7 Half-Cheetah 3080.0 4080.6 3835.6 5076.6 5098.3 Walker 4013.7 4107.9 3842.6 5134.0 5274.5 Ant 2299.1 1437.5 1806.3 4362.9 4700.0 Humanoid 232.6 380.5 644.5 5227.1 5312.8

Mujoco Control We present our results on the Mu Jo Co tasks using a single expert demo in Table 3. IQ achieves expert-level performance in all the tasks while outperforming prior methods like Value DICE and GAIL. We did not ﬁnd SQIL competitive in this setting, and skip it for brevity.

Atari We present our results on Atari using 20 expert demos in Figure 3. We reach expert performance on Space Invaders while being near expert on Pong and Breakout. Compared to prior methods like SQIL, IQ obtains 3-7x normalized score7 and converges in 300k steps, being 3x faster compared to Q-learning based RL methods that take more than 1M steps to converge. Other popular methods like GAIL and Value DICE perform near random even with 1M env steps.

Figure 3: Atari Results. We show the returns vs the number of env steps. (Averaged over 5 seeds)

7.4 Recovered Rewards

IQ has the added beneﬁt of recovering rewards and can be used for IRL. On Hopper task, our learned rewards have a Pearson correlation of 0.99 with the true rewards. In Figure 4, we visualize our recovered rewards in a simple grid environment. We elaborate details in Appendix D.

8 Discussion and Outlook

We present a new principled framework for learning soft-Q functions for IL and recovering the optimal policy and the reward, building on past works in IRL [43]. Our algorithm IQ-Learn outperforms prior methods with very sparse expert data and scales to complex image-based environments. We also recover rewards highly correlated with actual rewards. It has applications in autonomous driving and complex decision-making, but proper considerations need to be taken into account to ensure safety

7normalizing rewards obtained from random behavior to 0 and expert to 1

Figure 4: Reward Visualization. We use a discrete Grid World environment with 5 possible actions: up, down, left, right, stay. Agent starts in a random state. (With 30 expert demos)

and reduce uncertainty, before any deployment. Finally, human or expert data can have errors that can propagate. A limitation of our method is that our recovered rewards depend on the environment dynamics, preventing trivial use on reward transfer settings. One direction of future work could be to learn a reward model from the trained soft-Q model to make the rewards explicit.

9 Acknowledgements

We thank Kuno Kim and John Schulman for helpful discussions. We also thank Ian Goodfellow as some initial motivations for this work were developed under an internship with him.

10 Funding Transparency

This research was supported in part by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-12145), AFOSR (FA9550-19-1-0024) and FLI.

[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.

International conference on Machine learning (ICML), 2004. 1, 2, 7

[2] Oleg Arenz and Gerhard Neumann. Non-adversarial imitation learning and its connections to

adversarial methods. ar Xiv preprint ar Xiv:2008.03525, 2020. 8

[3] Nir Baram, Oron Anschel, and Shie Mannor. Model-based adversarial imitation learning. stat,

1050:7, 2016. 8

[4] Paul Barde, Julien Roy, Wonseok Jeon, Joelle Pineau, Christopher Pal, and Derek

Nowrouzezahrai. Adversarial soft advantage ﬁtting: Imitation learning without policy optimization. Advances in neural information processing systems (Neur IPS), 2020. 2, 8

[5] M. Bloem and N. Bambos. Inﬁnite time horizon maximum causal entropy inverse reinforcement

learning. 53rd IEEE Conference on Decision and Control, pages 4911 4916, 2014. 3

[6] Alex J. Chan and Mihaela van der Schaar. Scalable bayesian inverse reinforcement learning,

2021. 8, 21

[7] Robert Dadashi, Léonard Hussenot, Matthieu Geist, and Olivier Pietquin. Primal wasserstein

imitation learning. In ICLR 2021-Ninth International Conference on Learning Representations, 2021. 8

[8] G Alphastar Deep Mind. Mastering the real-time strategy game starcraft ii, 2019. 1

[9] Krishnamurthy Dvijotham and Emanuel Todorov. Inverse optimal control with linearly-solvable

mdps. In ICML, 2010. 8

[10] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse

reinforcement learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rk Hywl-A-. 1, 2, 7, 8

[11] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil

Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 8

[12] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville.

Improved training of wasserstein gans. In NIPS, 2017. 16

[13] T. Haarnoja, Aurick Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum

entropy deep reinforcement learning with a stochastic actor. In ICML, 2018. 2, 3, 6, 15, 22

[14] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning

with deep energy-based policies. 2017. 2, 3, 6, 14

[15] G HAYES. A robot controller using learning by imitation. In Proc. 2nd Int. Symposium on

Intelligent Robotic Systems, LIFTA-IMAG, Grenoble, France, 1994. 7

[16] Michael Herman, Tobias Gindele, Jörg Wagner, Felix Schmitt, and Wolfram Burgard. Inverse

reinforcement learning with simultaneous estimation of rewards and dynamics. International conference on artiﬁcial intelligence and statistics (AISTATS), 2016. 2

[17] Jonathan Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, 2016. 1, 2, 3, 4,

[18] Vinamra Jain, Prashant Doshi, and Bikramjit Banerjee. Model-free irl using maximum likelihood

estimation. AAAI Conference on Artiﬁcial Intelligence (AAAI), 2019. 2

[19] Daniel Jarrett, Ioana Bica, and Mihaela van der Schaar. Strictly batch imitation learning

by energy-based distribution matching. Advances in neural information processing systems (Neur IPS), 2020. 2, 8, 21

[20] Edouard Klein, Matthieu Geist, and Olivier Pietquin. Batch, off-policy and model-free appren-

ticeship learning. European Workshop on Reinforcement Learning (EWRL), 2011. 2

[21] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan

Tompson. Discriminator-actor-critic: Addressing sample inefﬁciency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2018. 1, 8

[22] Ilya Kostrikov, Oﬁr Nachum, and Jonathan Tompson. Imitation learning via off-policy dis-

tribution matching. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr. 1, 2, 5, 6, 8, 20

[23] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for

ofﬂine reinforcement learning. 2020. URL https://arxiv.org/abs/2006.04779. 2, 7

[24] Donghun Lee, Srivatsan Srinivasan, and Finale Doshi-Velez. Truly batch apprenticeship learning

with deep successor features. International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2019. 2

[25] Mario Lucic, Karol Kurach, Marcin Michalski, S. Gelly, and O. Bousquet. Are gans created

equal? a large-scale study. In Neur IPS, 2018. 16

[26] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Y. Yoshida. Spectral normalization for

generative adversarial networks. Ar Xiv, abs/1802.05957, 2018. 16

[27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, Ioannis Antonoglou, Daan Wierstra, and

Martin A. Riedmiller. Playing atari with deep reinforcement learning. Ar Xiv, abs/1312.5602, 2013. 21, 22

[28] Oﬁr Nachum, Yinlam Chow, B. Dai, and L. Li. Dualdice: Efﬁcient estimation of off-policy

stationary distribution corrections. 2019. 15

[29] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. Interna-

tional conference on Machine learning (ICML), 2000. 1, 2, 7

[30] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted and reward-regularized classiﬁcation

for apprenticeship learning. International conference on Autonomous agents and multi-agent systems (AAMAS), 2014. 2

[31] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Bridging the gap between imitation learning

and inverse reinforcement learning. IEEE transactions on neural networks and learning systems, 28(8):1814 1826, 2016. 8

[32] Antonin Rafﬁn. Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo,

[33] Siddharth Reddy, A. Dragan, and S. Levine. Sqil: Imitation learning via reinforcement learning

with sparse rewards. ar Xiv: Learning, 2020. 1, 2, 5, 7, 8, 20, 25

[34] Stéphane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. International

conference on artiﬁcial intelligence and statistics (AISTATS), 2010. 1, 2, 8, 24

[35] Claude Sammut, Scott Hurst, Dana Kedzier, and Donald Michie. Learning to ﬂy. In Proceedings

of the Ninth Conference on Machine Learning, pages 385 393. Elsevier, 1992. 7

[36] Maurice Sion. On general minimax theorems. Paciﬁc Journal of Mathematics, 8(1):171 176,

1958. doi: pjm/1103040253. URL https://doi.org/. 17

[37] Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. A Critique of

Strictly Batch Imitation Learning. ar Xiv:2110.02063 [cs], October 2021. 8

[38] Gokul Swamy, Sanjiban Choudhury, Zhiwei Steven Wu, and J Andrew Bagnell. Of moments

and matching: Trade-offs and treatments in imitation learning. ar Xiv preprint ar Xiv:2103.03236, 2021. 8

[39] Lu Wang, Wenchao Yu, Xiaofeng He, Wei Cheng, Martin Renqiang Ren, Wei Wang, Bo Zong,

Haifeng Chen, and Hongyuan Zha. Adversarial cooperative imitation learning for dynamic treatment regimes. In Proceedings of The Web Conference 2020, pages 1785 1795, 2020. 1

[40] Lantao Yu, Tianhe Yu, Chelsea Finn, and Stefano Ermon. Meta-inverse reinforcement learning

with probabilistic context variables. In Neur IPS, 2019. 26

[41] Jinyun Zhou, Rui Wang, Xu Liu, Yifei Jiang, Shu Jiang, Jiaming Tao, Jinghao Miao, and Shiyu

Song. Exploring imitation learning for autonomous driving with feedback synthesizer and differentiable rasterization. ar Xiv preprint ar Xiv:2103.01882, 2021. 1

[42] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal

entropy. 2010. 3

[43] Brian D. Ziebart, Andrew L. Maas, J. Bagnell, and A. Dey. Maximum entropy inverse reinforce-

ment learning. In AAAI, 2008. 2, 7, 9, 23