# regularized_inverse_reinforcement_learning__77b2d494.pdf Published as a conference paper at ICLR 2021 REGULARIZED INVERSE REINFORCEMENT LEARNING Wonseok Jeon ,1,2, Chen-Yang Su1,2, Paul Barde1,2, Thang Doan1,2, Derek Nowrouzezahrai1,2, Joelle Pineau1,2,3 1Mila, Quebec AI Institute 2School of Computer Science, Mc Gill University 3Facebook AI Research Inverse Reinforcement Learning (IRL) aims to facilitate a learner s ability to imitate expert behavior by acquiring reward functions that explain the expert s decisions. Regularized IRL applies strongly convex regularizers to the learner s policy in order to avoid the expert s behavior being rationalized by arbitrary constant rewards, also known as degenerate solutions. We propose tractable solutions, and practical methods to obtain them, for regularized IRL. Current methods are restricted to the maximum-entropy IRL framework, limiting them to Shannon-entropy regularizers, as well as proposing solutions that are intractable in practice. We present theoretical backing for our proposed IRL method s applicability to both discrete and continuous controls, empirically validating our performance on a variety of tasks. 1 INTRODUCTION Reinforcement learning (RL) has been successfully applied to many challenging domains including games (Mnih et al., 2015; 2016) and robot control (Schulman et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018). Advanced RL methods often employ policy regularization motivated by, e.g., boosting exploration (Haarnoja et al., 2018) or safe policy improvement (Schulman et al., 2015). While Shannon entropy is often used as a policy regularizer (Ziebart et al., 2008), Geist et al. (2019) recently proposed a theoretical foundation of regularized Markov decision processes (MDPs) a framework that uses strongly convex functions as policy regularizers. Here, one crucial advantage is that an optimal policy is shown to uniquely exist, whereas multiple optimal policies may exist in the absence of policy regularization. Meanwhile, since RL requires a given or known reward function (which can often involve non-trivial reward engineering), Inverse Reinforcement Learning (IRL) (Russell, 1998; Ng et al., 2000) the problem of acquiring a reward function that promotes expert-like behavior is more generally adopted in practical scenarios like robotic manipulation (Finn et al., 2016b), autonomous driving (Sharifzadeh et al., 2016; Wu et al., 2020) and clinical motion analysis (Li et al., 2018). In these scenarios, defining a reward function beforehand is particularly challenging and IRL is simply more pragmatic. However, complications with IRL in unregularized MDPs relate to the issue of degeneracy, where any constant function can rationalize the expert s behavior (Ng et al., 2000). Fortunately, Geist et al. (2019) show that IRL in regularized MDPs regularized IRL does not contain such degenerate solutions due to the uniqueness of the optimal policy for regularized MDPs. Despite this, no tractable solutions of regularized IRL other than maximum-Shannon-entropy IRL (Max Ent IRL) (Ziebart et al., 2008; Ziebart, 2010; Ho & Ermon, 2016; Finn et al., 2016a; Fu et al., 2018) have been proposed. In Geist et al. (2019), solutions for regularized IRL were introduced. However, they are generally intractable since they require a closed-form relation between the policy and optimal value function and the knowledge on model dynamics. Furthermore, practical algorithms for solving regularized IRL problems have not yet been proposed. We summarize our contributions as follows. Unlike the solutions in Geist et al. (2019), we propose tractable solutions for regularized IRL problems that can be derived from policy regularization and Correspondence to: Wonseok Jeon Published as a conference paper at ICLR 2021 its gradient in discrete control problems (Section 3.1). We additionally show that our solutions are tractable for Tsallis entropy regularization with multi-variate Gaussian policies in continuous control problems (Section 3.2). We devise Regularized Adversarial Inverse Reinforcement Learning (RAIRL), a practical sample-based method for policy imitation and reward learning in regularized MDPs, which generalizes adversarial IRL (AIRL, Fu et al. (2018)) (Section 4). Finally, we empirically validate our RAIRL method on both discrete and continuous control tasks, evaluating RAIRL via episodic scores and from divergence minimization perspective (Ke et al., 2019; Ghasemipour et al., 2019; Dadashi et al., 2020) (Section 5). 2 PRELIMINARIES Notation For finite sets X and Y , Y X is a set of functions from X to Y . X ( X Y ) is a set of (conditional) probabilities over X (conditioned on Y ). Especially for the conditional probabilities p X|Y X Y , we say p X|Y ( |y) X for y Y . R is the set of real numbers. For functions f1, f2 RX, we define f1, f2 X := P x X f1(x) f2(x). Regularized Markov Decision Processes and Reinforcement Learning We consider sequential decision making problems where an agent sequentially chooses its action after observing the state of the environment, and the environment in turn emits a reward with state transition. Such an interaction between the agent and the environment is modeled as an infinite-horizon Markov Decision Process (MDP), Mr := S, A, P0, P, r, γ and the agent s policy π A S . The terms within the MDP are defined as follows: S is a finite state space, A is a finite action space, P0 S is an initial state distribution, P S S A is a state transition probability, r RS A is a reward function, and γ [0, 1) is the discount factor. We also define an MDP without reward as M := S, A, P0, P, γ . The normalized state-action visitation distribution, dπ S A, associated with π is defined as the expected discounted state-action visitation of π, i.e., dπ(s, a) := (1 γ) Eπ[P i=0 γi I{si = s, ai = a}], where the subscript π on E means that a trajectory (s0, a0, s1, a1, ...) is randomly generated from M and π, and I{ } is an indicator function. Note that dπ satisfies the transposed Bellman recurrence (Boularias & Chaib-Draa, 2010; Zhang et al., 2019): dπ(s, a) = (1 γ)P0(s)π(a|s) + γπ(a|s) X s, a P(s| s, a)dπ(s, a). We consider RL in regularized MDPs (Geist et al., 2019), where the policy is optimized with a causal policy regularizer. Mathematically for an MDP Mr and a strongly convex function Ω: A R, the objective in regularized MDPs is to seek π that maximizes the expected discounted sum of rewards, or return in short, with policy regularizer Ω: arg max π A S JΩ(r, π) := Eπ i=0 γi{r(si, ai) Ω(π( |si))} = 1 1 γ E(s,a) dπ [r(s, a) Ω(π( |s)] . It turns out that the optimal solution of Eq.(1) is unique (Geist et al., 2019), whereas multiple optimal policies may exist in unregularized MDPs (See Appendix A for a detailed explanation). In later work (Yang et al., 2019), Ω(p) = λEa pφ(p(a)), p A was considered for λ > 0 and φ : (0, 1] R satisfying some mild conditions. For example, RL with Shannon entropy regularization (Haarnoja et al., 2018) can be recovered by φ(x) = log x, while RL with Tsallis entropy regularization (Lee et al., 2020) can be recovered from φ(x) = k q 1(1 xq 1) for k > 0, q > 1. The optimal policy π for Eq.(1) with Ωfrom Yang et al. (2019) is shown to be π (a|s) = max gφ µ (s) Q (s, a) Q (s, a) = r(s, a) + γEs P ( |s,a)V (s ), V (s) = µ (s) λ X a A π (a|s)2φ (π (a|s)), (3) where φ (x) = xφ(x), gφ is an inverse function of f φ for fφ(x) := xφ(x), x (0, 1], and µ is a normalization term such that P a A π (a|s) = 1. Note that we still need to find out µ to acquire a closed-form relation between optimal policy π and value function Q . However, such relations Published as a conference paper at ICLR 2021 have not been discovered except for Shannon-entropy regularization (Haarnoja et al., 2018) and specific instances (q = 1, 2, ) of Tsallis-entropy regularization (Lee et al., 2019) to the best of our knowledge. Inverse Reinforcement Learning Given a set of demonstrations from an expert policy πE, IRL (Russell, 1998; Ng et al., 2000) is the problem of seeking a reward function from which we can recover πE through RL. However, IRL in unregularized MDPs has been shown to be an ill-defined problem since (1) any constant reward function can rationalize every expert and (2) multiple rewards meet the criteria of being a solution (Ng et al., 2000). Maximum entropy IRL (Max Ent IRL) (Ziebart et al., 2008; Ziebart, 2010) is capable of solving the first issue by seeking a reward function that maximizes the expert s return along with Shannon entropy of expert policy. Mathematically, for the RL objective JΩin Eq.(1) and Ω= H for negative Shannon entropy H(p) = Ea p[ log p(a)] (Ho & Ermon, 2016), the objective of Max Ent IRL is Max Ent IRL(πE) := arg max r RS A J H(r, πE) max π A S J H(r, π) Another commonly used IRL method is Adversarial Inverse Reinforcement Learning (AIRL) (Fu et al., 2018) which involves generative adversarial training (Goodfellow et al., 2014; Ho & Ermon, 2016) to acquire a solution of Max Ent IRL. AIRL considers the structured discriminator (Finn et al., 2016a) D(s, a) = σ(r(s, a) log π(a|s)) = er(s,a) er(s,a)+π(a|s) for σ(x) := 1/(1 + e x) and iteratively optimizes the following objective: max r RS A E(s,a) dπE [log Dr,π(s, a)] + E(s,a) dπ [log(1 Dr,π(s, a))] , max π A S E(s,a) dπ [log Dr,π(s, a) log(1 Dr,π(s, a))] = max π A S E(s,a) dπ [r(s, a) log π(a|s)] . (5) It turns out that AIRL minimizes the divergence between visitation distributions dπ and dπE by solving minπ A S KL(dπ||dπE ) for Kullback-Leibler (KL) divergence KL( || ) (Ghasemipour et al., 2019) . 3 INVERSE REINFORCEMENT LEARNING IN REGULARIZED MDPS In this section, we propose the solution of IRL in regularized MDPs regularized IRL and relevant properties in Section 3.1. We then discuss a specific instance of our proposed solution in Section 3.2 where Tsallis entropy regularizers and multi-variate Gaussian policies are used in continuous action spaces. 3.1 SOLUTIONS OF REGULARIZED IRL We consider regularized IRL that generalizes Max Ent IRL in Eq.(4) to IRL with a class of strongly convex policy regularizers: IRLΩ(πE) := arg max r RS A JΩ(r, πE) max π A S JΩ(r, π) For any strongly convex policy regularizer Ω, regularized IRL does not suffer from degenerate solutions since there is a unique optimal policy in any regularized MDP (Geist et al., 2019). While Geist et al. (2019) proposed solutions of regularized IRL, those are intractable solutions (See Appendix F.1 for a detailed explanation). In the following lemma, we propose tractable solutions that only requires the evaluation of the policy regularizer (Ω) and its gradient ( Ω) which are more manageable in practice. Our solution is motivated from figuring out a reward function that is capable of converting regularized RL into an equivalent divergence minimization problem associated with π and πE: Lemma 1. For a policy regularizer Ω: A R, let us define t(s, a; π) := Ω (s, a; π) Ea π( |s)[Ω (s, a ; π)] + Ω(π( |s)) (7) for Ω (s, ; π) := Ω(π( |s)) := [ pΩ(p)]p=π( |s) RA, s S. Then, t(s, a; πE) for expert policy πE is a solution of regularized IRL with Ω. Published as a conference paper at ICLR 2021 Proof. (Abbreviated. See Appendix B for the full version.) With r(s, a) = t(s, a; πE), the RL objective in Eq.(1) becomes equivalent to the problem of minimizing the discounted sum of Bregman divergences between π and πE arg min π A S Eπ i=0 γi DA Ω(π( |si)||πE( |si)) where DA Ωis the Bregman divergence (Bregman, 1967) defined by DA Ω(p1||p2) = Ω(p1) Ω(p2) Ω(p2), p1 p2 A for p1, p2 A. Due to the non-negativity of the Bregman divergence, π = πE is a solution of Eq.(8) and is unique since Eq.(1) has the unique solution for arbitrary reward functions (Geist et al., 2019). In particular, for any policy regularizer Ωrepresented by an expectation over the policy (Yang et al., 2019), Lemma 1 can be reduced to the following solution in Corollary 1: Corollary 1. For Ω(p) = λEa pφ(p(a)) with p A (Yang et al., 2019) , Eq.(7) becomes t(s, a; π) = λ f φ(π(a|s)) Ea π( |s)[f φ(π(a |s)) φ(π(a |s))] (9) for f φ(x) = x(xφ(x)). The proof is in Appendix B. Throughout the paper, we denote reward baseline by the expectation Ea π( |s)[f φ(π(a|s)) φ(π(a|s))]. Note that for continuous control tasks with Ω(p) = λEa pφ(p(a)), we can obtain the same form of the reward in Eq.(9) (The proof is in Appendix D). Although the reward baseline is generally not intractable in continuous control tasks, we derive a tractable reward baseline for a special case (See Section 3.2). Additionally, when λ = 1 and φ(x) = log x, it can be shown that t(s, a; π) = log π(a|s) for both discrete and continuous control problems which was used as a reward objective in previous work (Fu et al., 2018), and that the Bregman divergence in Eq.(8) becomes the KL divergence KL(π( |s)||πE( |s)). Additional solutions of regularized IRL can be found by shaping t(s, a; πE) as stated in the following lemma: Lemma 2 (Potential-based reward shaping). Let π be the solution of Eq.(1) in a regularized MDP Mr with a regularizer Ω: A R and a reward function r RS A. Then for Φ RS, using either r(s, a) + γΦ(s ) Φ(s) or r(s, a) + Es P ( |s,a)Φ(s ) Φ(s) as a reward does not change the solution of Eq.(1). The proof is in Appendix E. From Lemma 1 and Lemma 2, we prove the sufficient condition of rewards being solutions of the IRL problem. However, the necessary condition a set of those solutions are the only possible solutions for an arbitrary MDP is not proved (Ng et al., 1999). In the following lemma, we check how the proposed solution is related to the normalized statevisitation distribution which can be discussed in the line of distribution matching perspective on imitation learning problems (Ho & Ermon, 2016; Fu et al., 2018; Ke et al., 2019; Ghasemipour et al., 2019): Lemma 3. Given the policy regularizer Ω, let us define Ω(d) := E(s,a) d[Ω( πd( |s))] for an arbitrary normalized state-action visitation distribution d S A and the policy πd(a|s) := d(s,a) P a A d(s,a ) induced by d. Then, Eq.(7) is equal to t(s, a; πd) = [ Ω(d)](s, a). (10) When Ω(d) is strictly convex and a solution t(s, a; πE) = Ω(dπE ) of IRL in Eq.(10) is used, the RL objective in Eq.(1) is equal to arg min π A S DS A Ω (dπ||dπE ), where DS A Ω is the Bregman divergence among visitation distributions defined by DS A Ω (d1||d2) = Ω(d1) Ω(d2) Ω(d2), d1 d2 for visitation distributions d1 and d2. The proof is in Appendix G. Published as a conference paper at ICLR 2021 Note that the strict convexity of Ωis required for DS A Ω to become a valid Bregman divergence. Although the strict convexity of a policy regularizer Ωdoes not guarantee the assumption on the strict convexity of Ω, it has been shown to be true for Shannon entropy regularizer ( Ω= H of Lemma 3.1 in Ho & Ermon (2016)) and Tsallis entropy regularizer with its constants k = 1 2, q = 2 ( Ω= W of Theorem 3 in Lee et al. (2018)). 3.2 IRL WITH TSALLIS ENTROPY REGULARIZATION AND GAUSSIAN POLICIES For continuous controls, multi-variate Gaussian policies are often used in practice (Schulman et al., 2015; 2017) and we consider IRL problems with those policies in this subsection. In particular, we consider IRL with Tsallis entropy regularizer Ω(p) = T k q (p) = Ea p[ k q 1(1 p(a)q 1)] (Lee et al., 2018; Yang et al., 2019; Lee et al., 2020) for a multi-variate Gaussian policy π( |s) = N (µ(s),Σ(s)) with µ(s) = [µ1(s), ..., µd(s)]T , Σ(s) = diag{(σ1(s))2, ..., (σd(s))2}. In such a case, we can obtain tractable forms of the following quantities: Tsallis entropy. The tractable form of Tsallis entropy for a multi-variate Gaussian policy is T k q (π( |s)) = k(1 e(1 q)Rq(π( |s))) q 1 , Rq(π( |s)) = 2πσi(s)) log q 2(1 q) for Renyi entropy Rq. Its derivation is given in Appendix I. Reward baseline. The reward baseline term Ea π( |s)[f φ(π(a|s)) φ(π(a|s))] in Corollary 1 is generally intractable except for either discrete control problems or Shannon entropy regularization where the reward baseline is equal to 1. Interestingly, as long as the tractable form of Tsallis entropy can be derived, that of the corresponding reward baseline can also be derived since the reward baseline satisfies Ea π( |s)[f φ(π(a|s)) φ(π(a|s))] = (q 1)Ea π( |s)[φ(π(a|s)] k = (q 1)T k q (π( |s)) k. Here, the first equality holds with f φ(x) = k q 1(1 qxq 1) = qφ(x) k for Tsallis entropy regularization. Bregman divergence associated with Tsallis entropy. For two different multivariate Gaussian policies, we derive the tractable form of the Bregman divergence (associated with Tsallis entropy) between two policies. The resultant divergence has a complicated form, so we leave its derivation in Appendix I.3. Figure 1: Bregman divergence DΩ(π||πE) associated with Tsallis entropy (Ω(p) = T 1 q (p)) between two uni-variate Gaussian distributions π = N(µ, σ2) and πE = N(0, (e 3)2) (green point in each subplot). In each subplot, we normalized the Bregman divergence so that the maximum value becomes 1. Note that for q = 1, DΩ(π||πE) becomes the KL divergence KL(π||πE). For deeper understanding of Tsallis entropy and its associated Bregman divergence in regularized MDPs, we consider an example in Figure 1. We first assume that both learning agents and experts policies follow uni-variate Gaussian distributions π = N(µ, σ2) and πE = N(0, (e 3)2), respectively. We then evaluate the Bregman divergence in Figure 1 by using its tractable form and varying q from 1.0 which corresponds to the KL divergence to 2.0. We observe that the constant q from the Tsallis entropy affects the sensitivity of the associated Bregman divergence w.r.t. the mean and standard deviation of the learning agent s policy π. Specifically, as q increases, the size of the valley relatively red region in Figure 1 across the µ-axis and log σ-axis decreases. This suggests that for larger q, minimizing the Bregman divergence requires more tightly matching means and variances of π and πE. Published as a conference paper at ICLR 2021 Algorithm 1: Regularized Adversarial Inverse Reinforcement Learning (RAIRL) 1: Input: A set DE of expert demonstration generated by expert policy πE, a reward approximator rθ, a policy πψ for neural network parameters θ and ψ 2: for each iteration do 3: Sample rollout trajectories by using the learners policy πψ. 4: Optimize θ with the discriminator Drθ,πψ and the learning objective in Eq.(11). 5: Optimize ψ with Regularized Actor Critic by using rθ as a reward function. 6: end for 7: Output: πψ πE, rθ(s, a) t(s, a; πE) (a solution of the IRL problem in Lemma 1). 4 ALGORITHMIC CONSIDERATION Based on a solution for regularized IRL in the previous section, we focus on developing an IRL algorithm in this section. Particularly to recover the reward function t(s, a; πE) in Lemma 1, we design an adversarial training objective as follows. Motivated by AIRL (Fu et al., 2018), we consider the following structured discriminator associated with π, r and t in Lemma 1: Dr,π(s, a) = σ(r(s, a) t(s, a; π)), σ(z) = 1 1 + e z , z R. Note that we can recover the discriminator of AIRL in Eq.(5) when t(s, a) = log π(a|s) (φ(x) = log x and λ = 1). Then, we consider the following optimization objective of the discriminator which is the same as that of AIRL: ˆt(s, a; π) := arg max r RS A E(s,a) dπE [log Dr,π(s, a)] + E(s,a) dπ [log(1 Dr,π(s, a))] . (11) Since the function x 7 a log σ(x) + b log(1 σ(x)) attains its maximum at σ(x) = a a+b, or equivalently at x = log a b (Goodfellow et al., 2014; Mescheder et al., 2017), it can be shown that ˆt(s, a; π) = t(s, a; π) + log dπE (s, a) dπ(s, a) . (12) When π = πE in Eq.(12), we have ˆt(s, a; πE) = t(s, a; πE) since dπ = dπE , which means the maximizer ˆt becomes the solution of IRL after the agent successfully imitates the expert policy πE. To do so, we consider the following iterative algorithm. Assuming that we find out the optimal reward approximator ˆt(s, a; π(i)) in Eq.(12) for the policy π(i) of the i-th iteration, we get the policy π(i+1) by optimizing the following objective with gradient ascent: maximize π A S E(s,a) dπ h ˆt(s, a; π(i)) Ω(π( |s)) i . (13) The above expectation in Eq.(13) can be decomposed into the following two terms E(s,a) dπ h ˆt(s, a; π(i)) Ω(π( |s)) i = E(s,a) dπ h t(s, a; π(i)) Ω(π( |s)) i KL(dπ||dπE ) = E(s,a) dπ h DA Ω(π( |s)||π(i)( |s)) i KL(dπ||dπE ) | {z } (II) where the second equality follows since Lemma 1 tells us that t(s, a; π(i)) is a reward function that makes π(i) an optimal policy in the Ω-regularized MDP. Minimizing term (II) in Eq.(14) makes π(i+1) close to πE while minimizing term (I) can be regarded as a conservative policy optimization around the policy π(i) (Schulman et al., 2015). In practice, we parameterize our reward and policy approximations with neural networks and train them using an off-policy Regularized Actor-Critic (RAC) (Yang et al., 2019) as described in Algorithm 1.Below, we evaluate our Regularized Adversarial Inverse Reinforcement Learning (RAIRL) approach across various scenarios. Published as a conference paper at ICLR 2021 5 EXPERIMENTS We summarize the experimental setup as follows. In our experiments, we consider Ω(p) = λEa p[φ(p(a)] with the following regularizers from Yang et al. (2019): (1) Shannon entropy (φ(x) = log x), (2) Tsallis entropy regularizer (φ(x) = k q 1(1 xq 1)), (3) exp regularizer (φ(x) = e ex), (4) cos regularizer (φ(x) = cos( π 2 x)), (5) sin regularizer (φ(x) = 1 sin π 2 x). The above regularizers were chosen since other regularizers have not been empirically validated to the best of our knowledge. We chose these regularizers to make our empirical analysis more tractable. In addition, we model the reward approximator of RAIRL as a neural network with either one of the following models: (1) Non-structured model (NSM) a simple feed-forward neural network that outputs real values used in AIRL (Fu et al., 2018) and (2) Density-based model (DBM) a model using a neural network for π (softmax for discrete controls and multi-variate Gaussian model for continuous controls) of the solution in Eq.(1) (See Appendix J.2 for a detailed explanation). For the RL algorithm of RAIRL, we implement Regularized Actor Critic (RAC) (Yang et al., 2019) on top of the SAC implementation from Rlpyt (Stooke & Abbeel, 2019). Other settings are summarized in Appendix J. For all experiments, we use 5 runs and report 95% confidence intervals. 5.1 EXPERIMENT 1: MULTI-ARMED BANDIT (DISCRETE ACTION) We consider a 4-armed bandit environment as shown in Figure 2 (left). An expert policy πE is assumed to be either dense (with probability 0.1, 0.2, 0.3, 0.4 for a = 0, 1, 2, 3) or sparse (with probability 0, 0, 1/3, 2/3 for a = 0, 1, 2, 3). For those experts, we use RAIRL with actions sampled from πE and compare learned rewards with the ground truth reward t(s, a; πE) in Lemma 1. When πE is dense, RAIRL successfully acquires the ground truth rewards irrespective of the reward model choices. When sparse πE is used, however, RAIRL with a non-structured model (RAIRL-NSM) fails to recover the rewards for a = 0, 1 where πE(a) = 0 due to the lack of samples at the end of the imitation. On the other hand, RAIRL with a density-based model (RAIRL-DBM) can recover the correct rewards due to the softmax layer which maintains the sum over the outputs equal to 1. Therefore, we argue that using DBM is necessary for correct reward acquisition since a set of demonstrations is generally sparse. In the following experiment, we show that the choice of reward models indeed affects the performance of rewards. Figure 2: Expert policy (Left) and reward learned by RAIRL with different types of policy regularizers (Right) in Multi-armed Bandit. Either one of dense (Top row) or sparse (Bottom row) expert policies πE is considered. 5.2 EXPERIMENT 2: BERMUDA WORLD (CONTINUOUS OBSERVATION, DISCRETE ACTION) We consider an environment with a 2-dimensional continuous state space as described in Figure 3. At each episode, the learning agent is initialized uniformly on the x-axis between 5 and 5, and there are 8 possible actions an angle in { π, 3 4π} that determines the direction of movement. An expert in Bermuda World considers 3 target positions ( 5, 10), (0, 10), (5, 10) and behaves stochastically. We state how we mathematically define the expert policy πE in Appendix J.3. During RAIRL s training (Figure 3, Top row), we use 1000 demonstrations sampled from the expert and periodically measure mean Bregman divergence, i.e., for DA Ω(p1||p2) = Ea p1[f φ(p2(a)) Published as a conference paper at ICLR 2021 φ(p1(a))] Ea p2[f φ(p2(a)) φ(p2(a))], i=1 DA Ω(π( |si)||πE( |si)). Here, the states s1, ..., s N come from 30 evaluation trajectories that are stochastically sampled from the agent s policy π which is fixed during evaluation in a separate evaluation environment. During the evaluation of learned reward (Figure 3, Bottom row), we train randomly initialized agents with RAC and rewards acquired from RAIRL s training and check if the mean Bregman divergence is properly minimized. We measure the mean Bregman divergence as was done in RAIRL s training. RAIRL-DBM is shown to minimize the target divergence more effectively compared to RAIRL-NSM during reward evaluation, although both achieve comparable performances during RAIRL s training. Moreover, we substitute λ with 1, 5, 10 and observe that learning with λ larger than 1 returns better rewards only λ = 1 was considered in AIRL (Fu et al., 2018). Note that in all cases, the minimum divergence achieved by RAIRL is comparable with that of behavioral cloning (BC). This is because BC performs sufficiently well when many demonstrations are given. We think the divergence of BC may be the near-optimal divergence that can be achieved with our policy neural network model. Figure 3: Mean Bregman divergence during training (Top row) and the divergence during reward evaluation (Bottom row) in Bermuda World. In each column, different policy regularizers and their respective target divergences are considered. The results are reported after normalization with the divergence of uniform random policy, and that of behavioral cloning (BC) is reported for comparison. 5.3 EXPERIMENT 3: MUJOCO (CONTINUOUS OBSERVATION AND ACTION) We validate RAIRL on Mu Jo Co continuous control tasks (Hopper-v2, Walker-v2, Half Cheetah-v2, Ant-v2) as follows. We assume multivariate-Gaussian policies (with diagonal covariance matrices) for both learner s policy π and expert policy πE. Instead of tanh-squashed policy in Soft-Actor Critic (Haarnoja et al., 2018), we use hyperbolized environments where tanh is regarded as a part of the environment with additional engineering on the policy networks (See Appendix J.4 for details). We use 100 demonstrations stochastically sampled from πE to validate RAIRL. In Mu Jo Co experiments, we focus on a set of Tsallis entropy regularizer (Ω= T 1 q ) with q = 1, 1.5, 2 where Tsallis entropy becomes Shannon entropy for q = 1. We then exploit the tractable quantities for multi-variate Gaussian distributions in Section 3.2 to stabilize RAIRL and check its performance in terms of its mean Bregman divergence similar to the previous experiment. Note that since both π and πE are multi-variate Gaussian and can be evaluated, we can evaluate the individual Bregman divergence DA Ω(π( |s)||πE( |s)) on s by using the derivation in Appendix I.3. The performances during RAIRL s training are described as follows. We report π with both an episodic score (Figure 4) and mean Bregman divergences with respect to three types of Tsallis entropies (Figure 5) Ω= T 1 q with q = 1, 1.5, 2. Note that the objective of RAIRL with Ω= T 1 q is to minimize the corresponding mean Bregman divergence with q = q. In Figure 4, both RAIRLDBM and RAIRL-NSM are shown to achieve the expert performance, irrespective of q, in Hopperv2, Walker-v2, and Half Cheetah-v2. In contrast, RAIRL in Ant-v2 fails to achieve the expert s performance within 2,000,000 steps and RAIRL-NSM highly outperforms RAIRL-DBM in our Published as a conference paper at ICLR 2021 setting. Although the episodic scores are comparable for all methods in Hopper-v2, Walker-v2, and Half Cheetah-v2, respective divergences are shown to be highly different from one another as shown in Figure 5. RAIRL with q = 2 in most cases achieves the minimum mean Bregman divergence (for all three divergences with q = 1, 1.5, 2), whereas RAIRL with q = 1 which corresponds to AIRL (Fu et al., 2018) achieves the maximum divergence in most cases. This result is in alignment with our intuition from Section 3.2; as q increases, minimizing the Bregman divergence requires much tighter matching between π and πE. Unfortunately, while evaluating the acquired reward RAC with a randomly initialized agent and acquired reward the target divergence is not properly decreased in continuous controls. We believe this is because π is a probability density function in continuous controls and causes large variance during training, while π is a mass function and is well-bounded in discrete control problems. Figure 4: Averaged episodic score of RAIRL s training in Mu Jo Co environments. RAIRL with T 1 q regularizer with q = 1, 1.5, 2 is considered. Figure 5: Bregman divergences with Tsallis entropy T 1 q with q = 1, 1.5, 2 during RAIRL s training in Mu Jo Co environments. We consider RAIRL with Tsallis entropy regularizer T 1 q with q = 1, 1.5, 2. 6 DISCUSSION AND FUTURE WORKS We consider the problem of IRL in regularized MDPs (Geist et al., 2019), assuming a class of strongly convex policy regularizers. We theoretically derive its solution (a set of reward functions) and show that learning with these rewards is equivalent to a specific instance of imitation learning i.e., one that minimizes the Bregman divergence associated with policy regularizers. We propose RAIRL a practical sampled-based IRL algorithm in regularized MDPs and evaluate its applicability on policy imitation (for discrete and continuous controls) and reward acquisition (for discrete control). Finally, recent advances in imitation learning and IRL are built from the perspective of regarding imitation learning as statistical divergence minimization problems (Ke et al., 2019; Ghasemipour et al., 2019). Although Bregman divergence is known to cover various divergences, it does not include some divergence families such as f-divergence (Csiszár, 1963; Amari, 2009). Therefore, we believe that considering RL with policy regularization different from Geist et al. (2019) and its inverse problem is a possible way of finding the links between imitation learning and various statistical distances. Published as a conference paper at ICLR 2021 ACKNOWLEDGMENTS We would like to thank Daewoo Kim at Waymo for his valuable comments on this work. Shun-Ichi Amari. α-divergence is unique, belonging to both f-divergence and bregman divergence classes. IEEE Transactions on Information Theory, 55(11):4925 4931, 2009. Abdeslam Boularias and Brahim Chaib-Draa. Bootstrapping apprenticeship learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 289 297, 2010. Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200 217, 1967. Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizitat von markoffschen ketten. Magyar. Tud. Akad. Mat. Kutató Int. Közl, 8:85 108, 1963. Robert Dadashi, Léonard Hussenot, Matthieu Geist, and Olivier Pietquin. Primal wasserstein imitation learning. ar Xiv preprint ar Xiv:2006.04678, 2020. Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. ar Xiv preprint ar Xiv:1611.03852, 2016a. Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 49 58, 2016b. Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1582 1591, 2018. Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 2160 2169, 2019. Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Proceedings of the 3rd Conference on Robot Learning (Co RL), 2019. Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2672 2680, 2014. Xin Guo, Johnny Hong, and Nan Yang. Ambiguity set and learning via Bregman and Wasserstein. ar Xiv preprint ar Xiv:1705.08056, 2017. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1861 1870, 2018. Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 4565 4573, 2016. Lee K Jones and Charles L Byrne. General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis. IEEE Transactions on Information Theory, 36(1):23 30, 1990. Published as a conference paper at ICLR 2021 Liyiming Ke, Matt Barnes, Wen Sun, Gilwoo Lee, Sanjiban Choudhury, and Siddhartha Srinivasa. Imitation learning as f-divergence minimization. ar Xiv preprint ar Xiv:1905.12888, 2019. Kyungjae Lee, Sungjoon Choi, and Songhwai Oh. Maximum causal tsallis entropy imitation learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 4403 4413, 2018. Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh. Tsallis reinforcement learning: A unified framework for maximum entropy reinforcement learning. ar Xiv preprint ar Xiv:1902.00137, 2019. Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, Mineui Hong, Jaein Kim, Yong-Lae Park, and Songhwai Oh. Generalized Tsallis entropy reinforcement learning and its application to soft mobile robots. In Robotics: Science and Systems (RSS), 2020. Kun Li, Mrinal Rath, and Joel W Burdick. Inverse reinforcement learning via function approximation for clinical motion analysis. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 610 617. IEEE, 2018. Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 2391 2400, 2017. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1928 1937, 2016. Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning (ICML), pp. 278 287, 1999. Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 663 670, 2000. Frank Nielsen and Richard Nock. On Rényi and Tsallis entropies and divergences for exponential families. ar Xiv preprint ar Xiv:1105.3259, 2011. Stuart Russell. Learning agents for uncertain environments. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 101 103, 1998. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 1889 1897, 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Sahand Sharifzadeh, Ioannis Chiotellis, Rudolph Triebel, and Daniel Cremers. Learning to drive using inverse reinforcement learning and deep Q-networks. ar Xiv preprint ar Xiv:1612.03653, 2016. Adam Stooke and Pieter Abbeel. rlpyt: A research code base for deep reinforcement learning in pytorch. ar Xiv preprint ar Xiv:1909.01500, 2019. Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 1032 1039, 2008. Zheng Wu, Liting Sun, Wei Zhan, Chenyu Yang, and Masayoshi Tomizuka. Efficient sampling-based maximum entropy inverse reinforcement learning with application to autonomous driving. IEEE Robotics and Automation Letters, 5(4):5355 5362, 2020. Published as a conference paper at ICLR 2021 Wenhao Yang, Xiang Li, and Zhihua Zhang. A regularized approach to sparse optimal policy in reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 5938 5948, 2019. Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen Dice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2019. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, volume 8, pp. 1433 1438, 2008. Published as a conference paper at ICLR 2021 A BELLMAN OPERATORS, VALUE FUNCTIONS IN REGULARIZED MDPS Let a policy regularizer Ω: A R be strongly convex, and define the convex conjugate of Ωis Ω : RA R as Ω (Q(s, )) = max π( |s) A π( |s), Q(s, ) A Ω(π( |s)), Q RS A, s S. (15) Then, Bellman operators, equations and value functions in regularied MDPs are defined as follows. Definition 1 (Regularized Bellman Operators). For V RS, let us define Q(s, a) = r(s, a) + γEs P ( |s,a)[V (s )]. The regularized Bellman evaluation operator is defined as [T πV ](s) := π( |s), Q(s, ) A Ω(π( |s)), s S, and T πV = V is called the regularized Bellman equation. Also, the regularized Bellman optimality operator is defined as [T V ](s) := max π( |s) A[T πV ](s) = Ω (Q(s, )), s S, and T V = V is called the regularized Bellman optimality equation. Definition 2 (Regularized value functions). The unique fixed point V π of the operator T π is called the state value function, and Qπ(s, a) = r(s, a) + γEs P ( |s,a)[V π(s )] is called the state-action value function. Also, the unique fixed point V of the operator T is called the optimal state value function, and Q (s, a) = r(s, a) + γEs P ( |s,a)[V (s )] is called the optimal state-action value function. It should be noted that Proposition 1 in Geist et al. (2019) tells us Ω (Q(s, )) is a policy that uniquely maximizes Eq.(15). For example, when Ω(π( |s)) = P a π( |s) log π(a|s) (negative Shan- non entropy), Ω (Q(s, )) is a softmax policy, i.e., Ω (Q(s, )) = exp(Q(s, )) P a A exp(Q(s,a )). Due to this property, the optimal policy π of a regularized MDP is uniquely found for the optimal state-action value function Q which is also uniquely defined as the fixed point of T . B PROOF OF LEMMA 1 Let us define πs = π( |s). For r(s, a) = t(s, a; πE), the RL objective Eq.(1) satisfies i=0 γi {r(si, ai) Ω(πsi)} i=0 γi {t(s, a; πE) Ω(πsi)} i=0 γi n Ω (si, ai; πE) Ea π si E Ω (si, a; πE) + Ω(πsi E ) Ω(πsi) o# i=0 γi n Ea πsiΩ (si, a; πE) Ea π si E Ω (si, a; πE) + Ω(πsi E ) Ω(πsi) o# i=0 γi Ω(πsi E ), πsi A Ω(πsi E ), πsi E A + Ω(πsi E ) Ω(πsi) # i=0 γi Ω(πsi E ) Ω(πsi) + Ω(πsi E ), πsi A Ω(πsi E ), πsi E A # i=0 γi Ω(πsi) Ω(πsi E ) Ω(πsi E ), πsi A + Ω(πsi E ), πsi E A # i=0 γi Ω(πsi) Ω(πsi E ) Ω(πsi E ), πsi πsi E A # i=0 γi DA Ω(πsi||πsi E )) Published as a conference paper at ICLR 2021 where (i) follows from the assumption r(s, a) = t(s, a; πE) in Lemma 1, (ii) follows from the definition of t(s, a; π) in Eq.(7), (iii) follows since taking the inner expectation first does not change the overall expectation, (iv) follows from the definition of Ω in Lemma 1 and P a A p(a)[ Ω(p)](a) = Ω(p), p A, and (v) follows from the definition of Bregman divergence, i.e., DA Ω(p1||p2) = Ω(p1) Ω(p2) Ω(p2), p1 p2 A. Due to the non-negativity of DA Ω, Eq.(16) is less than or equal to zero which can be achieved when π = πE. C PROOF OF COROLLARY 1 Let a {1, ..., |A|} and πa = π(a) for simplicity. For Ω(π) = λEa πφ(πa) = λ X a A πaφ(πa) = λ X with fφ(x) = xφ(x), we have Ω(π) = λ π1,...,π|A| X a A fφ(πa) = λ[f φ(π1), ..., f φ(π|A|)]T for f φ(x) = x(xφ(x)). Therefore, for πs = π( |s) we have t(s, a; π) = [ Ω(πs)](a) Ω(πs), πs A + Ω(πs) = λf φ(πs a) a A πs a ( λf φ(πs a )) a A πs a φ(πs a ) a A πs a f φ(πs a ) a A πs a φ(πs a ) f φ(πs a) X a A πs a f φ(πs a ) φ(πs a ) ) = λ f φ(πs a) Ea πs f φ(πs a ) φ(πs a ) . D PROOF OF OPTIMAL REWARDS ON CONTINUOUS CONTROLS Note that for two continuous distributions P1 and P2 having probability density functions p1(x) and p2(x), respectively, the Bregman divergence can be defined as (Guo et al., 2017; Jones & Byrne, 1990) DX ω (P1||P2) := Z X {ω(p1(x)) ω(p2(x)) ω (p2(x))(p1(x) p2(x))} dx, (17) where ω (x) := xω(x) and the divergence is measured point-wisely on x X. Let us assume A ω(π(a))da (18) for a probability density function π on A and define t(s, a; π) := ω (πs(a)) Z A [πs(a )ω (πs(a )) ω(πs(a ))] da . (19) Published as a conference paper at ICLR 2021 for πs = π( |s). For r(s, a) = t(s, a; πE), the RL objective in Eq.(1) satisfies i=0 γi {r(si, ai) Ω(πsi)} i=0 γi {t(si, ai; πE) Ω(πsi)} i=0 γi ω (πsi E (ai)) Z πsi E (a)ω (πsi E (a)) ω(πsi E (a)) da Z A ω(πsi(a))da # A πsi(a)ω (πsi E (a))da Z πsi E (a)ω (πsi E (a)) ω(πsi E (a)) + ω(πsi(a)) da # πsi(a)ω (πsi E (a)) πsi E (a)ω (πsi E (a)) ω(πsi E (a)) + ω(πsi(a)) da ω(πsi E (a)) ω(πsi(a)) + πsi(a)ω (πsi E (a)) πsi E (a)ω (πsi E (a)) da ω(πsi(a)) ω(πsi E (a)) πsi(a)ω (πsi E (a)) + πsi E (a)ω (πsi E (a)) da ω(πsi(a)) ω(πsi E (a)) ω (πsi E (a)) πsi(a) πsi E (a) da i=0 γi DA ω (πsi||πsi E )) where (i) follows from r(s, a) = t(s, a; πE), (ii) follows from Eq.(18) and Eq.(19), and (iii) follows from the definition of Bregman divergence in Eq.(17). Due to the non-negativity of Dω, Eq.(24) is less than or equal to zero which can be achieved when π = πE. Also, π = πE is a unique solution since Eq.(1) has a unique solution for arbitrary reward functions. E PROOF OF LEMMA 2 Since Lemma 2 was mentioned but not proved in Geist et al. (2019), we derive the rigorous proof for Lemma 2 in this subsection. Note that we follow the proof idea in Ng et al. (1999). First, let us assume an MDP Mr with a reward r and corresponding optimal policy π . From Definition 1, the optimal state-value function V ,r and its corresponding state-action value function Q ,r(s, a) := r(s, a) + γEs P ( |s,a)[V ,r(s )] should satisfy the regularized Bellman optimality equation V ,r(s) = T ,r V ,r(s) = Ω (Q ,r(s, a)) = max π( |s) A π( |s), Q ,r(s, a) A Ω(π( |s)) = max π( |s) A π( |s), r(s, a) + γEs P ( |s,a)[V ,r(s )] A Ω(π( |s)), (21) where we explicitize the dependencies on r. Also, the optimal policy π is a unique maximizer for the above maximization. Now, let us consider the shaped rewards r(s, a)+Φ(s ) Φ(s) and r(s, a)+Es P ( |s,a)Φ(s ) Φ(s). Please note that for both rewards, the expectation over s for given s, a is equal to r(s, a) = r(s, a) + Es P ( |s,a)Φ(s ) Φ(s), Published as a conference paper at ICLR 2021 and thus, it is sufficient to only consider the optimality for r. By subtracting Φ(s) from the regularized optimality Bellman equation for r in Eq.(21), we have V ,r(s) Φ(s) = max π( |s) A π( |s), r(s, a) + γEs P ( |s,a)Φ(s ) Φ(s) + γEs P ( |s,a)[V ,r(s ) Φ(s )] A = max π( |s) A π( |s), r(s, a) + γEs P ( |s,a)[V ,r(s ) Φ(s )] A Ω(π( |s)) = [T , r(V ,r Φ)](s). That is, V ,r Φ is the fixed point of the regularized Bellman optimality operator T , r associated with the shaped reward r. Also, a maximizer for the above maximization is π since subtracting Φ(s) from Eq.(21) does not change the unique maximizer π . F COMPARISON BETWEEN OUR SOLUTION AND EXISTING SOLUTION IN GEIST ET AL. (2019) F.1 ISSUES WITH THE SOLUTIONS IN GEIST ET AL. (2019) In Proposition 5 of Geist et al. (2019), a solution of regularized IRL is given, and we rewrite the relevant theorem in this subsection. Let us consider a regularized IRL problem, where πE A S is an expert policy. Assuming that both the model (dynamics, discount factor and regularizer) and the expert policy are known, Geist et al. (2019) proposed a solution of regularized IRL as follows: Lemma 4 (A solution of regularized IRL from Geist et al. (2019)). Let QE RS A be a function such that πE( |s) = Ω (QE(s, )), s S. Also, define r E(s, a) := QE(s, a) γEs P ( |s,a)[Ω (QE(s , ))] = QE(s, a) γEs P ( |s,a)[ πE( |s ), QE(s , ) A Ω(πE( |s ))]. Then, in the Ω-regularized MDP with the reward r E, πE is an optimal policy. Proof. Although the brief idea of the proof is given in Geist et al. (2019), we rewrite the proof in a more rigorous way as follows. Let us define VE(s) = πE( |s), QE(s, ) A Ω(πE( |s)). Then, r E(s, a) = QE(s, a) γEs P ( |s,a)[VE(s )], and thus, QE(s, a) = r E(s, a) + γEs P ( |s,a)[VE(s )]. By using this and regularized Bellman optimality operator, we have [T VE](s) (i)= Ω (QE(s, )) (ii) = max π( |s) A π( |s), QE(s, ) A Ω(π( |s)) (iii) = πE( |s), QE(s, ) A Ω(πE( |s)) = VE(s), where (i) and (ii) follow from Definition 1, and (iii) follows since πE is a unique maximizer. Thus, VE is the fixed point of T , and πE( |s) = Ω (QE(s, )), s S, becomes an optimal policy. For example, when negative Shannon entropy is used as a regularizer, we can get r E(s, a) = log πE(a|s) by setting QE(s, a) = log πE(a|s). However, a solution proposed in Lemma 4 has two crucial issues: Issue 1. It requires the knowledge on the model dynamics, which is generally intractable. Issue 2. We need to figure out QE that satisfies πE( |s) = Ω (QE(s, )), which comes from the relationship between the optimal value function and optimal policy (Geist et al., 2019). In the following subsection, we show how our solution in Lemma 1 is related to the solution from Geist et al. (2019) in Lemma 4. F.2 RELATION BETWEEN THE SOLUTION OF GEIST ET AL. (2019) AND OUR TRACTABLE SOLUTION Let us consider the expert policy πE and functions QE and r E satisfying the conditions in Lemma 4. From Lemma 2, a regularized MDP with the shaped reward r E(s, a) := r E(s, a) + γEs P ( |s,a)Φ(s ) Φ(s) Published as a conference paper at ICLR 2021 for Φ RS has its optimal policy as πE. Since Φ can be arbitrarily chosen, let us assume Φ(s) = Ω (QE(s, )). Then, we have = r E(s, a) + γEs P ( |s,a)Φ(s ) Φ(s) = QE(s, a) γEs P ( |s,a)[Ω (QE(s , ))] + γEs P ( |s,a)Ω (QE(s , )) Ω (QE(s, )) = QE(s, a) Ω (QE(s, )). (22) Note that the reward in Eq.(22) does not require the knowledge on the model dynamics, which addresses Issue 1 in Appendix F.1. Also, by using VE(s) = Ω (QE(s, )) in the proof of Lemma 4, the reward in Eq.(22) can be written as r E(s, a) = QE(s, a) VE(s), which means r E(s, a) is an advantage function for the optimal policy πE in the Ω-regularized MDP. However, we still have Issue 2 in Appendix F.1 since Ω in Eq.(22) is generally intractable (Geist et al., 2019), which prevents us from finding the appropriate QE(s, a). Interestingly, we show that for all s S and QE(s, ) = Ω(πE( |s)), Ω (QE(s, )) = arg max π( |s) A π( |s), QE(s, ) A Ω(π( |s)) = arg max π( |s) A π( |s), Ω(πE( |s)) A Ω(π( |s)) = arg min π( |s) A Ω(π( |s)) Ω(πE( |s)), π( |s) A = arg min π( |s) A Ω(π( |s)) Ω(πE( |s)) Ω(πE( |s)), π( |s) πE( |s) A = arg min π( |s) A DA Ω(π( |s)||πE( |s)) = πE( |s), where the last equality holds since the Bregman divergence DA Ω(π( |s)||πE( |s)) is greater than or equal to zero and its lowest value is achieved when π( |s) = πE( |s). This means that when QE(s, ) = Ω(πE( |s)) is used, the condition πE( |s) = Ω (QE(s, )) in Lemma 4 is satisfied without knowing the tractable form of Ω or Ω . Thus, Issue 2 in Appendix F.1 is addressed; we instead require the knowledge on the gradient Ωof the policy regularizer Ω, which is practically more tractable. Finally, by substituting QE(s, ) = Ω (s, ; πE) for Ω (s, ; π) := Ω(π( |s)), s S, to Eq.(22), we have r E(s, a) = Ω (s, a; πE) Ω (Ω (s, ; πE)) = Ω (s, a; πE) { πE( |s), Ω (s, ; πE) A Ω(πE( |s))} = Ω (s, a; πE) Ea πE ( |s)[Ω (s, a ; πE)] + Ω(πE( |s)) = t(s, a; πE), where t(s, a; πE) is our proposed solution in Lemma 1. G PROOF OF LEMMA 3 RL objective in Regularized MDPs w.r.t. normalized visitation distributions. For a reward function r RS A and a strongly convex function Ω: A R, the RL objetive JΩ(r, π) in Eq.(1) is equivalent to arg max π J Ω(r, dπ) := r, dπ S A Ω(dπ), (23) where for a set D of normalized visitation distributions (Syed et al., 2008) a d(s , a ) = (1 γ)P0(s ) + γ X s,a P(s |s, a)d(s, a), s S Published as a conference paper at ICLR 2021 we define Ω(d) := E(s,a) d[Ω( πd( |s))] and πd( |s) := d(s, ) P a d(s,a ) A S for d D and use πdπ( |s) = π( |s) for all s S. For Ω: D R, its convex conjugate Ω is Ω (r) : = max d D J Ω(r, d) = max d D r, d S A Ω(d) (i) = max π A S r, dπ S A Ω(dπ) = max π A S s,a dπ(s, a) [r(s, a) Ω(π(a|s))] = (1 γ) max π A S JΩ(r, π), (24) where (i) follows from using the one-to-one correspondence between policies and visitation distributions (Syed et al., 2008; Ho & Ermon, 2016). Note that Eq.(24) is equal to the optimal discounted average return in regularized MDPs. IRL objective in regularized MDPs w.r.t. normalized visitation distributions. By using the RL objective in Eq.(23), we can rewrite the IRL objective in Eq.(6) w.r.t. the normalized visitation distributions as the maximization of the following objective over r RS A: JΩ(r, πE) max π A S JΩ(r, π) = J Ω(r, dπE ) max d D J Ω(r, d) n J Ω(r, dπE ) J Ω(r, d) o n r, dπE S A Ω(dπE ) r, d S A Ω(d) o n Ω(d) Ω(dπE ) r, d dπE S A o . (25) Note that if Ω(d) is well-defined and r = Ω(dπE ) for any strictly convex Ω, Eq.(25) is equal to n Ω(d) Ω(dπE ) Ω(dπE ), d dπE S A o = min d D DS A Ω (d||dπE ), where the equality comes from the definition of Bregman divergence. Proof of t(s, a; πd) = [ Ω(d)](s, a). For simpler notation, we use matrix-vector notation for the proof when discrete state and action spaces S = {1, ..., |S|} and A = {1, ..., |A|} are considered. For a normalized visitation distribution d D, let us define ds a := d(s, a), s S, a A, ds := [ds 1, ..., ds |A|]T RA, s S, D := [d1, ...,d|S|]T = d1 1 d1 |A| ... ... ... d|S| 1 d|S| |A| π(x) := x 1 T Ax = 1 P x1, ..., x|A| T RA,x := x1, ..., x|A| T RA, where 1 A = [1, ..., 1]T RA is an |A|-dimensional all-one vector. By using these notations, the original Ωcan be rewritten as s,a ds aΩ( π(ds)) = X s S 1 T AdsΩ( π(ds)). Published as a conference paper at ICLR 2021 The gradient of Ωw.r.t. D (using denominator-layout notation) is D Ω(D) = Ω(D) d1 , ..., Ω(D) where each element of D Ω(D) satisfies ds 1 , ..., Ω(D) s S 1 T AdsΩ( π(ds)) = Ω( π(ds))1 A + 1 T Ads Ω( π(ds)) = Ω( π(ds))1 A + 1 T Ads π(ds) ds Ω( π(ds)) π(ds) . (26) ds = π1(ds) ds , ..., π|A|(ds) ds a 1 T Ads = ds a ds (1 T Ads) 1 + ds a (1 T Ads) 1 Note that each element of πa(ds) ds satisfies ds a = ds a ds a (1 T Ads) 1 + ds a (1 T Ads) 1 ds a = I{a = a }(1 T Ads) 1 ds a(1 T Ads) 2 = I{a = a }(1 T Ads) 1 πa(ds)(1 T Ads) 1 = (1 T Ads) 1 [I{a = a } πa(ds)] , ds = (1 T Ads) 1 I A A 1 A[ π(ds)]T . (27) By substituting Eq.(27) into Eq.(26), we have ds = Ω( π(ds))1 A + 1 T Ads π(ds) ds Ω( π(ds)) = Ω( π(ds))1 A + 1 T Ads (1 T Ads) 1 I A A 1 A[ π(ds)]T Ω( π(ds)) = Ω( π(ds))1 A + I A A 1 A[ π(ds)]T Ω( π(ds)) = Ω( π(ds))1 A + Ω( π(ds)) π(ds) [ π(ds)]T Ω( π(ds)) = Ω( π(ds)) π(ds) [ π(ds)]T Ω( π(ds)) π(ds) 1 A + Ω( π(ds))1 A. (28) If we use the function notation, Eq.(28) can be written as [ Ω(d)](s, a) = Ω( πd( |s))(a) Ea πd( |s) [ Ω( πd( |s))(a )] + Ω( πd( |s)) = t(s, a; πd) for t of Eq.(7) in Lemma 1. Published as a conference paper at ICLR 2021 H DERIVATION OF BREGMAN-DIVERGENCE-BASED MEASURE IN CONTINUOUS CONTROLS In Eq.(17), the Bregman divergence in the control task is defined as DA ω (P1||P2) := Z X {ω(p1(x)) ω(p2(x)) ω (p2(x))(p1(x) p2(x))} dx. (29) Note that we consider Ω(p) = R X ω(p(x))dx = R X [ fφ(p(x))] dx for fφ(x) = xφ(x), which makes Eq.(29) equal to Z p1(x)φ(p1(x)) + p2(x)φ(p2(x)) + f φ(p2(x))(p1(x) p2(x)) dx X p1(x) f φ(p2(x)) φ(p1(x)) dx Z X p2(x) f φ(p2(x)) φ(p2(x)) dx = Ex p1 f φ(p2(x)) φ(p1(x)) Ex p2 f φ(p2(x)) φ(p2(x)) . Thus, by considering a learning agent s policy πs = π( |s), expert policy πs E = πE( |s), and the objective in Eq.(8) characterized by the Bregman divergence, we can think of the following measure between expert and agent policies: Es dπ DA Ω(πs||πs = Es dπ h Ea πs f φ(πs E(a)) φ(πs(a)) Ea πs E f φ(πs E(a)) i . (30) I TSALLIS ENTROPY AND ASSOCIATED BREGMAN DIVERGENCE AMONG MULTI-VARIATE GAUSSIAN DISTRIBUTIONS Based on the derivation in Nielsen & Nock (2011), we derive the Tsallis entropy and associated Bremgan divergence as follows. We first consider the distributions in the exponential family exp ( θ, t(x) F(θ) + k(x)) . (31) Note that for t(x) = x xx T 4θT 1 θ 1 2 θ1 + 1 2 log | πθ 1 2 | = 1 2µT Σ 1µ + 1 2 log(2π)d|Σ|, we can recover the multi-variate Gaussian distribution (Nielsen & Nock, 2011): exp( θ, t(x) F(θ) + k(x)) (32) = exp µT Σ 1x 1 2tr(Σ 1xx T ) 1 2 log(2π)d|Σ| (33) = 1 (2π)d/2|Σ|1/2 exp µT Σ 1x 1 2x T Σ 1x 1 2µT Σ 1µ (34) = 1 (2π)d/2|Σ|1/2 exp 1 2(x µ)T Σ 1(x µ) . (35) For two distributions with k(x) = 0, π(x) = exp( θ, t(x) F(θ)), ˆπ(x) = exp( ˆθ, t(x) F(ˆθ)) Published as a conference paper at ICLR 2021 that share t, F, and k, it can be shown that I(π, ˆπ; α, β) = Z π(x)αˆπ(x)βdx = exp F(αθ + βˆθ) αF(θ) βF(ˆθ) Z π(x)αˆπ(x)βdx = Z exp α θ, t(x) αF(θ) + β ˆθ, t(x) βF(ˆθ) dx = Z exp αθ + βˆθ, t(x) F(αθ + βˆθ) exp F(αθ + βˆθ) αF(θ) βF(ˆθ) dx = exp F(αθ + βˆθ) αF(θ) βF(ˆθ) Z exp αθ + βˆθ, t(x) F(αθ + βˆθ) dx = exp F(αθ + βˆθ) αF(θ) βF(ˆθ) . I.1 TSALLIS ENTROPY For φ(x) = k q 1(1 xq 1) and k = 1, the Tsallis entropy of π can be written as Tq(π) := Ex πφ(x) = Z π(x)1 π(x)q 1 = 1 R π(x)qdx q 1 = 1 q 1 (1 I(π, π; q, 0)) = 1 exp (F(qθ) q F(θ)) If π is a multivariate Gaussian distribution, we have 2µT Σ 1µ + 1 2 log(2π)d|Σ| 1 2µT Σ 1µ + q 2 log(2π)d|Σ|, F(qθ) q F(θ) = 1 q 2 log(2π)d|Σ| 1 2 log 2π + 1 2 log |Σ| d log q For Σ = diag{σ2 1, ..., σ2 d}, we have F(qθ) q F(θ) = (1 q) d 2 log 2π + 1 2 log |Σ| d log q ( d 2 log 2π + 1 i=1 σ2 i d log q 2 + log σi log q 2(1 q) Published as a conference paper at ICLR 2021 I.2 TRACTABLE FORM OF BASELINE For φ(x) = k q 1(1 xq 1), we have f φ(x) = k q 1(1 qxq 1) = k q 1(q qxq 1 (q 1)) = qk q 1(1 xq 1) k Therefore, the baseline can be rewritten as Ex π[ f φ(x) + φ(x)] = Ex π[k qφ(x) + φ(x)] = (1 q)Tq(π) + k. For a multivariate Gaussian distribution π, the tractable form of Ex π[ f φ(x)+φ(x)] can be derived by using that of Tsallis entropy Tq(π) of π. I.3 BREGMAN DIVERGENCE WITH TSALLIS ENTROPY REGULARIZATION In Eq.(30), we consider the following form of the Bregman divergence: Z π(x){f φ(ˆπ(x)) φ(π(x))}dx Z ˆπ(x){f φ(ˆπ(x)) φ(ˆπ(x))}dx. For φ(x) = k q 1(1 xq 1), f φ(x) = k q 1(1 qxq 1) = qφ(x) k, and k = 1, the above form is equal to Z π(x) 1 qˆπ(x)q 1 dx Tq(π) (q 1)Tq(ˆπ) + 1 = 1 q 1 q q 1 Z π(x)ˆπ(x)q 1dx Tq(π) (q 1)Tq(ˆπ) + 1 = q q 1 q q 1 Z π(x)ˆπ(x)q 1dx Tq(π) (q 1)Tq(ˆπ). For multivariate Gaussians π(x) = N(x; µ, Σ), µ = [ν1, ..., νd]T , Σ = diag(σ2 1, ..., σ2 d), ˆπ(x) = N(x; ˆµ, ˆΣ), ˆµ = [ˆν1, ..., ˆνd]T , ˆΣ = diag(ˆσ2 1, ..., ˆσ2 d), we have Z π(x)ˆπ(x)q 1dx = I(π, ˆπ; 1, q 1) = exp F(θ ) F(θ) (q 1)F(ˆθ) , ˆθ = ˆΣ 1ˆµ 1 θ = θ + (q 1)ˆθ = Σ 1µ + (q 1)ˆΣ 1ˆµ 1 2(Σ 1 + (q 1)ˆΣ 1) σ2 1 + (q 1) ˆν1 ˆσ2 1 , ..., νd σ2 d + (q 1) ˆνd σ2 1 + (q 1) 1 ˆσ2 1 , ..., 1 σ2 d + (q 1) 1 Published as a conference paper at ICLR 2021 2µT Σ 1µ + 1 2 log(2π)d|Σ| = ν2 i 2σ2 i + log 2π 2 ˆµT ˆΣ 1ˆµ + 1 2 log(2π)d|ˆΣ| = ˆν2 i 2ˆσ2 i + log 2π 2 + log ˆσi F(θ + (q 1)ˆθ) = 1 4(θ 1)T (θ 2) 1θ 1 + 1 2 log | π(θ 2) 1| νi σ2 i + (q 1) ˆνi 1 σ2 i + (q 1) 1 ˆσ2 i + log 2π 2 + log 1 1 σ2 i + (q 1) 1 Published as a conference paper at ICLR 2021 J EXPERIMENT SETTING J.1 POLICY REGULARIZERS IN EXPERIMENTS Table 1: Policy regularizers φ and their corresponding fφ (Yang et al., 2019). reg. type. condition φ(x) f φ(x) Shannon - log x log x 1 Tsallis k > 0, q > 1 k q 1 (1 xq 1) k q 1 (1 qxq 1) Exp k 0, q 1 q xkqx q xkqx(k + 1 + x log q) Cos 0 < θ π/2 cos(θx) cos(θ) cos(θ) + cos(θx) θx sin(θx) Sin 0 < θ π/2 sin(θ) sin(θx) sin(θ) sin(θx) θx cos(θx) J.2 DENSITY-BASED MODEL By exploiting the knowledge on the reward in Corollary 1 f φ(π(a|s)) Ea π( |s)[f φ(π(a |s)) φ(π(a |s))], we consider the density-based model (DBM) which is defined by rθ(s, a) f φ(πθ1(a|s)) + Bθ2(s) for θ = (θ1, θ2). Here, f φ is a function that can be known priorly, and πθ1 is a neural network which is defined separately from the policy neural network πψ. The DBM for discrete control problems are depicted in Figure 6 (Left). The model outputs rewards over all actions in parallel, where softmax is used for πθ1( |s) and f φ is elementwisely applied to those softmax outputs followed by elementwisely adding Bθ2(s). For continuous control (Figure 6, Right), we use the network architecture similar to that in discrete control, where a multivariate Gaussian distribution is used instead of a softmax layer. Figure 6: Density-based model for discrete (Left) and continuous control (Right) J.3 EXPERT IN BERMUDA WORLD ENVIRONMENT We assume a stochastic expert defined by πE(a|s) = P3 t=1(d(t)) 1I{a = Proj(θ(t))} P3 t=1(d(t)) 1 , θ(t) = arctan2( y(t) y, x(t) x), d(t) = s(t) s 4 2 + ϵ, t = 1, 2, 3, Published as a conference paper at ICLR 2021 for s = (x, y), s(1) = ( x(1), y(1)) = ( 5, 10), s(2) = ( x(2), y(2)) = (0, 10), s(3) = ( x(3), y(3)) = (5, 10), ϵ = 10 4 and an operator Proj(θ) : R A that maps θ to the closest angle in A. In Figure 7, we depicted the expert policy. Figure 7: Visualization of the expert policy J.4 MUJOCO EXPERIMENT SETTING Instead of directly using Mu Jo Co environments with tanh-squashed policies proposed in Soft Actor Critic (SAC) (Haarnoja et al., 2018), we move tanh to a part of the environment named hyperbolized environments in short and assume Gaussian policies. Specifically, after an action a is sampled from the policies, we pass tanh(a) to the environment. We then consider multi-variate Gaussian policy π( |s) = N (µ(s),Σ(s)) with µ(s) = [µ1(s), ..., µd(s)]T , Σ(s) = diag{(σ1(s))2, ..., (σd(s))2}, where arctanh(0.99) µi(s) arctanh(0.99), log(0.01) log σi(s) log(2) for all i = 1, ..., d. Instead of using clipping, we use tanh-activated outputs and scale them to fit in the above ranges, which empirically improves the performance. Also, instead of using potential-based reward shaping used in AIRL (Fu et al., 2018), we update the moving mean of intermediate reward values and update the value network with mean-subtracted rewards so that the value network gets approximately mean-zero reward to stabilize the RL part of RAIRL. Note that this is motivated by Lemma 2 from which we can guarantee that any constant shift of reward functions does not change optimality. Published as a conference paper at ICLR 2021 J.5 HYPERPARAMETERS Table 2, Table 3 and Table 4 list the parameters used in our Bandit, Bermuda World, and Mu Jo Co experiments, respectively. Table 2: Hyperparameters for Bandit environments. Hyper-parameter Bandit Batch size 500 Initial exploration steps 10,000 Replay size 500,000 Target update rate (τ) 0.0005 Learning rate 0.0005 λ 5 q (Tsallis entropy T k q ) 2.0 k (Tsallis entropy T k q ) 1.0 Number of trajectories 1,000 Reward learning rate 0.0005 Steps per update 50 Total environment steps 500,000 Table 3: Hyperparameters for Bermuda World environment. Hyper-parameter Bermuda World Batch size 500 Initial exploration steps 10,000 Replay size 500,000 Target update rate (τ) 0.0005 Learning rate 0.0005 q (Tsallis entropy T k q ) 2.0 k (Tsallis entropy T k q ) 1.0 Number of trajectories 1,000 Reward learning rate 0.0005 (For evaluation) λ 1 (For evaluation) Learning rate 0.001 (For evaluation) Target update rate (τ) 0.0005 Steps per update 50 Number of steps 500,000 Table 4: Hyperparameters for Mu Jo Co environments. Hyper-parameter Hopper Walker2d Half Cheetah Ant Batch size 256 256 256 256 Initial exploration steps 10,000 10,000 10,000 10,000 Replay size 1,000,000 1,000,000 1,000,000 1,000,000 Target update rate (τ) 0.005 0.005 0.005 0.005 Learning rate 0.001 0.001 0.001 0.001 λ 0.0001 0.000001 0.0001 0.000001 k (Tsallis entropy T k q ) 1.0 1.0 1.0 1.0 Number of trajectories 100 100 100 100 Reward learning rate 0.001 0.001 0.001 0.001 Steps per update 1 1 1 1 Number of steps 1,000,000 1,000,000 1,000,000 2,000,000