# multiagent_adversarial_inverse_reinforcement_learning__062c777a.pdf Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu 1 Jiaming Song 1 Stefano Ermon 1 Reinforcement learning agents are prone to undesired behaviors due to reward mis-specification. Finding a set of reward functions to properly guide agent behaviors is particularly challenging in multi-agent scenarios. Inverse reinforcement learning provides a framework to automatically acquire suitable reward functions from expert demonstrations. Its extension to multi-agent settings, however, is difficult due to the more complex notions of rational behaviors. In this paper, we propose MA-AIRL, a new framework for multi-agent inverse reinforcement learning, which is effective and scalable for Markov games with high-dimensional state-action space and unknown dynamics. We derive our algorithm based on a new solution concept and maximum pseudolikelihood estimation within an adversarial reward learning framework. In the experiments, we demonstrate that MA-AIRL can recover reward functions that are highly correlated with ground truth ones, and significantly outperforms prior methods in terms of policy imitation. 1. Introduction Reinforcement learning (RL) is a general and powerful framework for decision making under uncertainty. Recent advances in deep learning have enabled a variety of RL applications such as games (Silver et al., 2016; Mnih et al., 2015), robotics (Gu et al., 2017; Levine et al., 2016), automated machine learning (Zoph & Le, 2016) and generative modeling (Yu et al., 2017). RL algorithms are also showing promise in multi-agent systems, where multiple agents interact with each other, such as multi-player games (Peng et al., 2017), social interactions (Leibo et al., 2017) and multi-robot control systems (Matignon et al., 2012). How- 1Department of Computer Science, Stanford University, Stanford, CA 94305 USA. Correspondence to: Lantao Yu , Stefano Ermon . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). ever, the success of RL crucially depends on careful reward design (Amodei et al., 2016). As reinforcement learning agents are prone to undesired behaviors due to reward misspecification (Amodei & Clark, 2016), designing suitable reward functions can be challenging in many real-world applications (Hadfield-Menell et al., 2017). In multi-agents systems, since different agents may have completely different goals and state-action representations, hand-tuning reward functions becomes increasingly more challenging as we take more agents into consideration. Imitation learning presents a direct approach to programming agents with expert demonstrations, where agents learn to produce behaviors similar to the demonstrations. However, imitation learning algorithms, such as behavior cloning (Pomerleau, 1991) and generative adversarial imitation learning (Ho & Ermon, 2016; Ho et al., 2016), typically sidestep the problem of interring an explicit representation for the underlying reward functions. Because the reward function is often considered as the most succinct, robust and transferable representation of a task (Abbeel & Ng, 2004; Fu et al., 2017), it is important to consider the problem of inferring reward functions from expert demonstrations, which we refer to as inverse reinforcement learning (IRL). IRL can offer many advantages compared to direct policy imitation, such as analyzing and debugging an imitation learning algorithm, inferring agents intentions and re-optimizing rewards in new environments (Ng et al., 2000). However, IRL is ill-defined, as many policies can be optimal for a given reward and many reward functions can explain a set of demonstrations. Maximum entropy inverse reinforcement learning (Max Ent IRL) (Ziebart et al., 2008) provides a general probabilistic framework to solve the ambiguity by finding the trajectory distribution with maximum entropy that matches the reward expectation of the experts. As Max Ent IRL requires solving an integral over all possible trajectories for computing the partition function, it is only suitable for small scale problems with known dynamics. Adversarial IRL (Finn et al., 2016a; Fu et al., 2017) scales Max Ent IRL to large and continuous problems by drawing an analogy between a sampling based approximation of Max Ent IRL and Generative Adversarial Networks (Goodfellow et al., 2014) with a particular discriminator structure. However, the approach is restricted to single-agent settings. Multi-Agent Adversarial Inverse Reinforcement Learning In this paper, we consider the IRL problem in multi-agent environments with high-dimensional continuous state-action space and unknown dynamics. Generalizing Max Ent IRL and Adversarial IRL to multi-agent systems is challenging. Since each agent s optimal policy depends on other agents policies, the notion of optimality, central to Markov decision processes, has to be replaced by an appropriate equilibrium solution concept. Nash equilibrium (Hu et al., 1998) is the most popular solution concept for multi-agent RL, where each agent s policy is the best response to others. However, Nash equilibrium is incompatible with Max Ent RL in the sense that it assumes the agents never take sub-optimal actions. Thus imitation learning and inverse reinforcement learning methods based on Nash equilibrium or correlated equilibrium (Aumann, 1974) might lack the ability to handle irrational (or computationally bounded) experts. In this paper, inspired by logistic quantal response equilibrium (Mc Kelvey & Palfrey, 1995; 1998) and Gibbs sampling (Hastings, 1970), we propose a new solution concept termed logistic stochastic best response equilibrium (LSBRE), which allows us to characterize the trajectory distribution induced by parameterized reward functions and handle the bounded rationality of expert demonstrations in a principled manner. Specifically, by uncovering the close relationship between LSBRE and Max Ent RL, and bridging the optimization of joint likelihood and conditional likelihood with maximum pseudolikelihood estimation, we propose Multi-Agent Adversarial Inverse Reinforcemnt Learning (MA-AIRL), a novel Max Ent IRL framework for Markov games. MA-AIRL is effective and scalable to large highdimensional Markov games with unknown dynamics, which are not amenable to previous methods relying on tabular representation and linear or quadratic programming (Natarajan et al., 2010; Waugh et al., 2013; Lin et al., 2014; 2018). We experimentally demonstrate that MA-AIRL is able to recover reward functions that are highly correlated to the ground truth rewards, while simultaneously learning policies that significantly outperform state-of-the-art multi-agent imitation learning algorithms (Song et al., 2018) in mixed cooperative and competitive tasks (Lowe et al., 2017). 2. Preliminaries 2.1. Markov Games Markov games (Littman, 1994) are generalizations of Markov decision processes (MDPs) to the case of N interacting agents. A Markov game (S, A, P, η, r) is defined via a set of states S, and N sets of actions {Ai}N i=1. The function P : S A1 AN P(S) describes the (stochastic) transition process between states, where P(S) denotes the set of probability distributions over the set S. Given that we are in state st at time t and the agents take actions (a1, . . . , a N), the state transitions to st+1 with probability P(st+1|st, a1, . . . , a N). Each agent i obtains a (bounded) reward given by a function ri : S A1 AN R. The function η P(S) specifies the distribution of the initial state. We use bold variables without subscript i to denote the concatenation of all variables for all agents (e.g., π denotes the joint policy, r denotes all rewards and a denotes actions of all agents in a multi-agent setting). We use subscript i to denote all agents except for i. For example, (ai, a i) represents (a1, . . . , a N), the actions of all N agents. The objective of each agent i is to maximize its own expected return (i.e., the expected sum of discounted rewards) Eπ h PT t=1 γtri,t i , where γ is the discount factor and ri,t is the reward received t steps into the future. Each agent achieves its own objective by selecting actions through a stochastic policy πi : S P(Ai). Depending on the context, the policies can be Markovian (i.e., depend only on the state) or require additional coordination signals. For each agent i, we further define the expected return for a state-action pair as: Exp Retπi,π i i (st, at) = Est+1:T ,at+1:T h P l t γl tri(sl, al)|st, at, π i 2.2. Solution Concepts for Markov Games A correlated equilibrium (CE) for a Markov game (Ziebart et al., 2011) is a joint strategy profile, where no agent can achieve higher expected reward through unilaterally changing its own policy. CE first introduced by (Aumann, 1974; 1987) is a more general solution concept than the wellknown Nash equilibrium (NE) (Hu et al., 1998), which further requires agents actions in each state to be independent, i.e. π(a|s) = ΠN i=1πi(ai|s). It has been shown that many decentralized, adaptive strategies will converge to CE instead of a more restrictive equilibrium such as NE (Gordon et al., 2008; Hart & Mas-Colell, 2000). To take bounded rationality into consideration, (Mc Kelvey & Palfrey, 1995; 1998) further propose logistic quantal response equilibrium (LQRE) as a stochastic generalization to NE and CE. Definition 1. A logistic quantal response equilibrium for Markov game corresponds to any strategy profile satisfying a set of constraints, where for each state and action, the constraint is given by: πi(ai|s) = exp (λExp Retπ i (s, ai, a i)) P a i exp (λExp Retπ i (s, a i, a i)) Intuitively, in LQRE, agents choose actions with higher expected return with higher probability. 2.3. Learning from Expert Demonstrations Suppose we do not have access to the ground truth reward signal r, but have demonstrations D provided by an expert (N expert agents in Markov games). D is a set of trajectories {τj}M j=1, where τj = {(st j, at j)}T t=1 is an expert trajectory Multi-Agent Adversarial Inverse Reinforcement Learning collected by sampling s1 η(s), at πE(at|st), st+1 P(st+1|st, at). D contains the entire supervision to the learning algorithm, i.e., we assume we cannot ask for additional interactions with the experts during training. Given D, imitation learning (IL) aims to directly learn policies that behave similarly to these demonstrations, whereas inverse reinforcement learning (IRL) (Russell, 1998; Ng et al., 2000) seeks to infer the underlying reward functions which induce the expert policies. The Max Ent IRL framework (Ziebart et al., 2008) aims to recover a reward function that rationalizes the expert behaviors with the least commitment, denoted as IRL(πE): IRL(πE) = arg max r RS A EπE[r(s, a)] RL(r) RL(r) = max π Π H(π) + Eπ[r(s, a)] where H(π) = Eπ[ log π(a|s)] is the policy entropy. However, Max Ent IRL is generally considered less efficient and scalable than direct imitation, as we need to solve a forward RL problem in the inner loop. In the context of imitation learning, (Ho & Ermon, 2016) proposed to use generative adversarial training (Goodfellow et al., 2014), to learn the policies characterized by RL IRL(πE) directly, leading to the Generative Adversarial Imitation Learning (GAIL) algorithm: min θ max ω EπE [log Dω(s, a)] + Eπθ [log(1 Dω(s, a))] where Dω is a discriminator that classifies expert and policy trajectories, and πθ is the parameterized policy that tries to maximize its score under Dω. According to Goodfellow et al. (2014), with infinite data and infinite computation, at optimality, the distribution of generated state-action pairs should exactly match the distribution of demonstrated stateaction pairs under the GAIL objective. The downside to this approach, however, is that we bypass the intermediate step of recovering rewards. Specifically, note that we cannot extract reward functions from the discriminator, as Dω(s, a) will converge to 0.5 for all (s, a) pairs. 2.4. Adversarial Inverse Reinforcement Learning Besides resolving the ambiguity that many optimal policies can explain a set of demonstrations, another advantage of Max Ent IRL is that it can be interpreted as solving the following maximum likelihood estimation (MLE) problem: t=1 P(st+1|st, at) t=1 rω(st, at) max ω EπE [log pω(τ)] = Eτ πE t=1 rω(st, at) Here, ω are the parameters of the reward function and Zω is the partition function, i.e. an integral over all possible trajectories consistent with the environment dynamics. Zω is intractable to compute when the state-action spaces are large or continuous, and the environment dynamics are unknown. Combining Guided Cost Learning (GCL) (Finn et al., 2016b) and generative adversarial training, Finn et al.; Fu et al. proposed adversarial IRL framework as an efficient sampling based approximation to the Max Ent IRL, where the discriminator takes on a particular form: Dω(s, a) = exp(fω(s, a)) exp(fω(s, a)) + q(a|s) where fω(s, a) is the learned function, q(a|s) is the probability of the adaptive sampler pre-computed as an input to the discriminator, and the policy is trained to maximize log D log(1 D). To alleviate the reward shaping ambiguity (Ng et al., 1999), where many reward functions can explain an optimal policy, (Fu et al., 2017) further restricted f to a reward estimator gω and a potential shaping function hφ: fω,φ(s, a, s ) = gω(s, a) + γhφ(s ) hφ(s) It has been shown that under suitable assumptions, gω and hφ will recover the true reward and value function up to a constant. 3.1. Logistic Stochastic Best Response Equilibirum To extend Max Ent IRL to Markov games, we need be able to characterize the trajectory distribution induced by a set of (parameterized) reward functions {ri(s, a)}N i=1 (analogous to Equation (1)). However existing optimality notions introduced in Section 2.2 do not explicitly define a tractable joint strategy profile that we can use to maximize the likelihood of expert demonstrations (as a function of the rewards); they do so implicitly as the solution to a set of constraints. Motivated by Gibbs sampling (Hastings, 1970), dependency networks (Heckerman et al., 2000), best response dynamic (Nisan et al., 2011; Gandhi, 2012) and LQRE, we propose a new solution concept that allows us to characterize rational (joint) policies induced from a set of reward functions. Intuitively, our solution concept corresponds to the result of repeatedly applying a stochastic (entropy-regularized) best response mechanism, where each agent (in turns) attempts to optimize its actions while keeping the other agents actions fixed. To begin with, let us first consider a stateless single-shot normal-form game with N players and a reward function Multi-Agent Adversarial Inverse Reinforcement Learning ri : A1 . . . AN R for each player i. We consider the following Markov chain over (A1 AN), where the state of the Markov chain at step k is denoted z(k) = (z1, , z N)(k), with each random variable z(k) i taking values in Ai. The transition kernel of the Markov Chain is defined by the following equations: z(k+1) i Pi(ai|a i = z(k) i ) = exp(λri(ai, z(k) i )) P a i exp(λri(a i, z(k) i )) (2) and each agent i is updated in scan order. Given all other players actions z(k) i , the i-th player picks an action proportionally to exp(λri(ai, z(k) i )), where λ > 0 is a parameter that controls the level of rationality of the agents. For λ 0, the agent will select actions uniformly at random, while for λ , the agent will select actions greedily (best response). Because the Markov Chain is ergodic, it admits a unique stationary distribution which we denote π(a). Interpreting this stationary distribution over (A1 AN) as a policy, we call this stationary joint policy a logistic stochastic best response equilibrium for normal-form games. Now let us generalize the solution concept to Markov games. For each agent i, let {πt i}T t=1 denote a set of time-dependent policies. First we define the state action value function for each agent i. Starting from the base case: Qπ i (s T , a T i , a T i) = ri(s T , a T i , a T i) then we recursively define: Qπt+1:T i (st, at i,at i) = ri(st, at i, at i)+ Est+1 P ( |st,at) h H(πt+1 i ( |st+1))+ Eat+1 πt+1( |st+1)[Qπt+2:T i (st+1, at+1)] i which generalizes the standard state-action value function in single-agent RL (a i = when N = 1). Definition 2. Given a Markov game with horizon T, the logistic stochastic best response equilibrium (LSBRE) is a sequence of T stochastic policies {πt}T t=1 constructed by the following process. Consider T Markov chains over (A1 AN)|S|, where the state of the t-th Markov chain at step k is {zt,(k) i : S Ai}N i=1, with each random variable zt,(k) i (s) taking values in Ai. For t [T, . . . , 1], we recursively define the the stationary joint distribution πt(a|s) of the t-th Markov chain in terms of {πℓ}T ℓ=t+1 as: For st S, i [1, , N], we update the state of the Markov chain as: zt,(k+1) i (st) P t i (at i|at i = zt,(k) i (st), st) = exp λQπt+1:T i (st, at i, zt,(k) i (st)) a i exp λQπt+1:T i (st, a i, zt,(k) i (st) (3) where parameter λ R+ controls the level of rationality of the agents, and {P t i }N i=1 specifies a set of conditional distributions. LSBRE for Markov game is the sequence of T joint stochastic policies {πt}T t=1. Each joint policy πt : S P(A1 AN) is given by: πt(a1, , a N|st) = P i {zt,( ) i (st) = ai} where the probability is taken with respect to the unique stationary distribution of the t-th Markov chain. When the set of conditionals in Equation (2) are compatible (in the sense that each conditional can be inferred from the same joint distribution (Arnold & Press, 1989)), the above process corresponds to a Gibbs sampler, which will converge to a stationary joint distribution π(a), whose conditional distributions are consistent with the ones used during sampling, namely Equation (2). This is the case, for example, if the agents are cooperative, i.e., they share the same reward function ri. In general, π(a) is the distribution specified by the dependency network (Heckerman et al., 2000) defined via conditionals in Equation (2). The same argument can be made for the Markov Chains in Definition 2 with respect to the conditionals in Equation (3). When the set of conditionals in Equation (2) and (3) are incompatible, the procedure is called a pseudo Gibbs sampler. As discussed in literatures on dependency networks (Heckerman et al., 2000; Chen et al., 2011; Chen & Ip, 2015), when the conditionals are learned from a sufficiently large dataset, the pseudo Gibbs sampler asymptotically works well in the sense that the conditionals of the stationary joint distribution are nearly consistent with the conditionals used during sampling. Under some conditions, theoretical bounds on the approximation can be obtained (Heckerman et al., 2000). 3.2. Trajectory Distributions Induced by LSBRE Following (Fu et al., 2017; Levine, 2018), without loss of generality, in the remainder of this paper we consider the case where λ = 1. First, we note that there is a connection between the notion of LSBRE and maximum causal entropy reinforcement learning (Ziebart, 2010). Specifically, we can characterize the trajectory distribution induced by LSBRE policies with an energy-based formulation, where the probability of a trajectory increases exponentially as the sum of rewards increases. Formally, with LSBRE policies, the probability of generating a certain trajectory can be characterized with the following theorem: Theorem 1. Given a joint policy {πt(at|st)}T t=1 specified by LSBRE, for each agent i, let {πt i(at i|st)}T t=1 denote other agents marginal distribution and {πt i(at i|at i, st)}T t=1 denote agent i s conditional distribution, both obtained from the LSBRE joint policies. Multi-Agent Adversarial Inverse Reinforcement Learning Then the LSBRE conditional distributions are the optimal solution to the following optimization problem: min ˆπ1:T DKL(ˆp(τ)|| p(τ)) (5) t=1 P(st+1|st, at) πt i(at i|st)) t=1 ˆπt i(at i|at i, st) t=1 P(st+1|st, at) πt i(at i|st)) t=1 ri(st, at i, at i) Proof. See Appendix A.1. Intuitively, for single-shot normal form games, the above statement holds obviously from the definition in Equation (2). For Markov games, similar to the process introduced in Definition 2, we can employ a dynamic programming algorithm to find the conditional policies which minimizes Equation (5). Specifically, we first construct the base case of t = T as a normal form game, then recursively construct the conditional policy for each time step t, based on the policies from t+1 to T that have already been constructed. It can be shown that the constructed optimal policy which minimizes the KL divergence between its trajectory distribution and the trajectory distribution defined in Equation (6) corresponds to the set of conditional policies in LSBRE. 3.3. Multi-Agent Adversarial IRL In the remainder of this paper, we assume that the expert policies form a unique LSBRE under some unknown (parameterized) reward functions, according to Definition 2. By adopting LSBRE as the optimality notion, we are able to rationalize the demonstrations by maximizing the likelihood of the expert trajectories with respect to the LSBRE stationary distribution, which is in turn induced by the ωparameterized reward functions {ri(s, a; ωi)}N i=1. The probability of a trajectory τ = {st, at}T t=1 generated by LSBRE policies in a Markov game is defined by the following generative process: p(τ) = η(s1) t=1 πt(at|st; ω) t=1 P(st+1|st, at) (7) where πt(at|st; ω) are the unique stationary joint distributions for the LSBRE induced by {ri(s, a; ωi)}N i=1. The initial state distribution η(s1) and transition dynamics P(st+1|st, at) are specified by the Markov game. As mentioned in Section 2.4, the Max Ent IRL framework interprets finding suitable reward functions as maximum likelihood over the expert trajectories in the distribution defined in Equation (7), which can be reduced to: max ω Eτ πE t=1 log πt(at|st; ω) since the initial state distribution and transition dynamics do not depend on the parameterized rewards. Note that πt(at|st) in Equation (8) is the joint policy defined in Equation (4), whose conditional distributions are given by Equation (3). From Section 3.1, we know that given a set of ω-parameterized reward functions, we are able to characterize the conditional policies {πt i(at i|at i, st)}T t=1 for each agent i. However direct optimization over the joint MLE objective in Equation (8) is intractable, as we cannot obtain a closed form for the stationary joint policy. Fortunately, we are able to construct an asymptotically consistent estimator by approximating the joint likelihood πt(at|st) with a product of the conditionals Q i=1 πt i(at i|at i, st), which is termed a pseudolikelihood (Besag, 1975). With the asymptotic consistency property of the maximum pseudolikelihood estimation (Besag, 1975; Lehmann & Casella, 2006), we have the following theorem: Theorem 2. Let demonstrations τ1, . . . , τM be independent and identically distributed (sampled from LSBRE induced by some unknown reward functions), and suppose that for all t [1, . . . , T], at i Ai, πt i(at i|at i, st; ωi) is differentiable with respect to ωi. Then, with probability tending to 1 as M , the equation i=1 log πt i(am,t i |am,t i , sm,t; ωi) = 0 (9) has a root ˆωM such that ˆωM tends to the maximizer of the joint likelihood in Equation (8). Proof. See Appendix A.2. Theorem 2 bridges the gap between optimizing the joint likelihood and each conditional likelihood. Now we are able to maximize the objective in Equation (8) as: ω log πt i(at i|at i, st; ωi) To optimize the maximum pseudolikelihood objective in Equation (10), we can instead optimize the following surrogate loss which is a variational approximation to the psuedolikelihood objective (from Theorem 1): ω ri(st, at; ωi) Multi-Agent Adversarial Inverse Reinforcement Learning where Zωi is the partition function of the distribution in Equation (6). It is generally intractable to exactly compute and optimize the partition function Zω, which involves an integral over all trajectories. Similar to GCL (Finn et al., 2016b) and single-agent AIRL (Fu et al., 2017), we employ importance sampling to estimate the partition function with adaptive samplers qθ. Now we are ready to introduce our practical Multi-Agent Adversarial IRL (MA-AIRL) framework, where we train the ω-parameterized discriminators as: i=1 log exp(fωi(s, a)) exp(fωi(s, a)) + qθi(ai|s) i=1 log qθi(ai|s) exp(fωi(s, a)) + qθi(ai|s) and we train the θ-parameterized generators as: i=1 log(Dωi(s, a)) log(1 Dωi(s, a)) i=1 fωi(s, a) log(qθi(ai|s)) Specifically, for each agent i, we have a discriminator with a particular structure exp(fω) exp(fω)+qθ for a binary classification, and a generator as an adaptive importance sampler for estimating the partition function. Intuitively, qθ is trained to minimize the KL divergence between its trajectory distribution and that induced by the reward functions, for reducing the variance of importance sampling, while fω in the discriminator is trained to estimate the reward function. At optimality, fω will approximate the advantage function for the expert policy and qθ will approximate the expert policy. 3.4. Solving Reward Ambiguity in Multi-Agent IRL For single-agent reinforcement learning, Ng et al. shows that for any state-only potential function φ : S R, potentialbased reward shaping defined as: r (st, at, st+1) = r(st, at, st+1) + γΦ(st+1) Φ(st) is a necessary and sufficient condition to guarantee invariance of the optimal policy in both finite and infinite horizon MDPs. In other words, given a set of expert demonstrations, there is a class of reward functions, all of which can explain the demonstrated expert behaviors. Thus without further assumptions, it would be impossible to identify the groundtruth reward that induces the expert policy within this class. Similar issues also exist when we consider multi-agent scenarios. Devlin & Kudenko show that in multi-agent systems, using the same reward shaping for one or more agents will not alter the set of Nash equilibria. It is possible to extend Algorithm 1 Multi-Agent Adversarial IRL Input: Expert trajectories DE = {τ E j }; Markov game as a black box with parameters (N, S, A, η, P , γ) Initialize the parameters of policies q, reward estimators g and potential functions h with θ, ω, φ. repeat Sample trajectories Dπ = {τj} from π: s1 η(s), at π(at|st), st+1 P(st+1|st, at) Sample state-action pairs Xπ, XE from Dπ, DE. for i = 1, . . . , N do Update ωi, φi to increase the objective in Eq. 11: EXE[log D(s, ai, s )] + EXπ[log(1 D(s, ai, s ))] end for for i = 1, . . . , N do Update reward estimates ˆri(s, ai, s ) with gωi(s, ai). or (log D(s, ai, s ) log(1 D(s, ai, s ))) Update θi with respect to ˆri(s, ai, s ). end for until Convergence Output: Learned policies πθ and reward functions gω. this result to other solution concepts such as CE and LSBRE. For example, in the case of LSBRE, after specifying the level of rationality λ, for any πi = πLSBRE i , we have: EπLSBRE i ,πLSBRE i [ri(s, a)] Eπi,πLSBRE i [ri(s, a)] (13) since each individual LSBRE conditional policy is the optimal solution to the corresponding entropy regularized RL problem (See Appendix (A.1)). It can be also shown that any policy that satisfies the inequality in Equation (13) will still satisfy the inequality after reward shaping (Devlin & Kudenko, 2011). To mitigate the reward shaping effect and recover reward functions with higher linear correlation to the ground truth reward, as in (Fu et al., 2017), we further assume the functions fωi in Equation (11) have a specific structure: fωi,φi(st, at, st+1) = gωi(st, at) + γhφi(st+1) hφi(st) where gω is a reward estimator and hφ is a potential function. We summarize the MA-AIRL training procedure in Algorithm 1. 4. Related Work A vast number of methods and paradigms have been proposed for single-agent imitation learning and inverse reinforcement learning. However, multi-agent scenarios are less commonly investigated, and most existing works assume specific reward structures. These include fully cooperative games (Barrett et al., 2017; Le et al., 2017; ˇSoˇsi c et al., 2017; Bogert & Doshi, 2014), two player zero-sum games (Lin Multi-Agent Adversarial Inverse Reinforcement Learning et al., 2014), and rewards as linear combinations of prespecified features (Reddy et al., 2012; Waugh et al., 2013). Recently, Song et al. proposed MA-GAIL, a multi-agent extension of GAIL which works on general Markov games. While both MA-AIRL and MA-GAIL are based on adversarial training, the methods are inherently different. MA-GAIL is based on the notion of Nash equilibrium, and is motivated via a specific choice of Lagrange multipliers for a constraint optimization problem. MA-AIRL, on the other hand, is derived from Max Ent RL and LSBRE, and aims to obtain an MLE solution for the joint trajectories; we connect this with a set of conditionals via pseudolikelihood, which are then solved with the adversarial reward learning framework. From a reward learning perspective, the discriminators outputs in MA-GAIL will converge to uninformative uniform distribution, while MA-AIRL allows us to recover reward functions from the optimal discriminators. 5. Experiments We seek to answer the following questions via empirical evaluation: (1) Can MA-AIRL efficiently recover the expert policies for each individual agent from the expert demonstrations (policy imitation)? (2) Can MA-AIRL effectively recover the underlying reward functions, for which the expert policies form a LSBRE (reward recovery)? Task Description To answer these questions, we evaluate our MA-AIRL algorithm on a series of simulated particle environments (Lowe et al., 2017). Specifically, we consider the following scenarios: cooperative navigation, where three agents cooperate through physical actions to reach three landmarks; cooperative communication, where two agents, a speaker and a listener, cooperate to navigate to a particular landmark; and competitive keep-away, where one agent tries to reach a target landmark, while an adversary, without knowing the target a priori, tries to infer the target from the agent s behaviors and prevent it from reaching the goal through physical interactions. In our experiments, for generality, the learning algorithms will not leverage any prior knowledge on the types of interactions (cooperative or competitive). Thus for all the tasks described above, the learning algorithms will take a decentralized form and we will not utilize additional reward regularization, besides penalizing the ℓ2 norm of the reward parameters to mitigate overfitting (Ziebart, 2010; Kalakrishnan et al., 2013). Training Procedure In the simulated environments, we have access to the ground-truth reward functions, which enables us to accurately evaluate the quality of both recovered policies and reward functions. We use a multi-agent version of ACKTR (Wu et al., 2017; Song et al., 2018), an efficient model-free policy gradient algorithm for training the experts as well as the adaptive samplers in MA-AIRL. The supervision signals for the experts come from the ground-truth rewards, while the reward signals for the adaptive samplers come from the discriminators. Specifically, we first obtain expert policies induced by the ground-truth rewards, then we use them to generate demonstrations, from which the learning algorithms will try to recover the policies as well as the underlying reward functions. We compare MA-AIRL against the state-of-the-art multi-agent imitation learning algorithm, MA-GAIL (Song et al., 2018), which is a generalization of GAIL to Markov games. Following (Li et al., 2017; Song et al., 2018), we use behavior cloning to pretrain MA-AIRL and MA-GAIL to reduce sample complexity for exploration, and we use 200 episodes of expert demonstrations, each with 50 time steps, which is close to the amount of time steps used in (Ho & Ermon, 2016)1. 5.1. Policy Imitation Although MA-GAIL achieved superior performance compared with behavior cloning (Song et al., 2018), it only aims to recover policies via distribution matching. Moreover, the training signal for the policy will become less informative as training progresses; according to (Goodfellow et al., 2014) with infinite data and computational resources the discriminator outputs will converge to 0.5 for all state-action pairs, which could potentially hinder the robustness of the policy towards the end of training. To empirically verify our claims, we compare the quality of the learned policies in terms of the expected return received by each agent. In the cooperative environment, we directly use the groundtruth rewards from the environment as the oracle metric, since all agents share the same reward. In the competitive environment, we follow the evaluation procedure in (Song et al., 2018), where we place the experts and learned policies in the same environment. A learned policy is considered better if it receives a higher expected return while its opponent receives a lower expected return. The results for cooperative and competitive environments are shown in Tables 1 and 2 respectively. MA-AIRL consistently performs better than MA-GAIL in terms of the received reward in all the considered environments, suggesting superior imitation learning capabilities to the experts. 5.2. Reward Recovery The second question we seek to answer is concerned with the reward recovering problem as in inverse reinforcement learning: is the algorithm able to recover the ground truth reward functions with expert demonstrations being the only source of supervision? To answer this question, we evaluate the statistical correlations between the ground truth rewards 1The codebase for this work can be found at https:// github.com/ermongroup/MA-AIRL. Multi-Agent Adversarial Inverse Reinforcement Learning Table 1. Expected returns in cooperative tasks. Mean and variance are taken across different random seeds used to train the policies. Algorithm Nav. Exp Ret Comm. Exp Ret Expert -43.195 2.659 -12.712 1.613 Random -391.314 10.092 -125.825 3.4906 MA-GAIL -52.810 2.981 -12.811 1.604 MA-AIRL -47.515 2.549 -12.727 1.557 Table 2. Expected returns of the agents in competitive task. Agent #1 represents the agent trying to reach the target and Agent #2 represents the adversary. Mean and variance are taken across different random seeds. Agent #1 Agent #2 Agent #1 Exp Ret Expert Expert -6.804 0.316 MA-GAIL Expert -6.978 0.305 MA-AIRL Expert -6.785 0.312 Expert MA-GAIL -6.919 0.298 Expert MA-AIRL -7.367 0.311 (which the learning algorithms have no access to) and the inferred rewards for the same state-action pairs. Specifically, we consider two types of statistical correlations: Pearson s correlation coefficient (PCC), which measures the linear correlation between two random variables; and Spearman s rank correlation coefficient (SCC), which measures the statistical dependence between the rankings of two random variables. Higher SCC suggests that two reward functions have higher monotonic relationships and higher PCC suggests higher linear correlations. For each trajectory, we compare the ground-truth return from the environment with the supervision signals from the discriminators, which correspond to gω in MA-AIRL and log(Dω) in MA-GAIL. Tables 3 and 4 provide the SCC and PCC statistics for cooperative and competitive environments respectively. In the cooperative case, compared to MA-GAIL, MA-AIRL achieves a much higher PCC and SCC, which could facilitate policy learning. The statistical correlations between reward signals gathered from discriminators for each agent are also quite high, suggesting that while we do not reveal the agents are cooperative, MA-AIRL is able to discover high correlations between the agents reward functions. In the competitive case, the reward functions learned by MAAIRL also significantly outperform MA-GAIL in terms of SCC and PCC statistics. In Figure 1, we further show the changes of PCC statistics with respect to training time steps for MA-GAIL and MA-AIRL. The reward functions recovered by MA-GAIL initially have a high correlation with the ground truth, yet that dramatically decreases as training continues, whereas the functions learned by MAAIRL maintains a high correlation throughout the course of Table 3. Statistical correlations between the learned reward functions and the ground-truth rewards in cooperative tasks. Mean and variance are taken across N independently learned reward functions for N agents. Task Metric MA-GAIL MA-AIRL Nav. SCC 0.792 0.085 0.934 0.015 PCC 0.556 0.081 0.882 0.028 Comm. SCC 0.879 0.059 0.936 0.080 PCC 0.612 0.093 0.848 0.099 Table 4. Statistical correlations between the learned reward functions and the ground-truth rewards in competitive task. Algorithm MA-GAIL MA-AIRL SCC #1 0.424 0.534 SCC #2 0.653 0.907 Average SCC 0.538 0.721 PCC #1 0.497 0.720 PCC #2 0.392 0.667 Average PCC 0.445 0.694 training, which is in line with the theoretical analysis that in MA-GAIL, reward signals from the discriminators will become less informative towards convergence. 0 2000 4000 6000 8000 10000 12000 0.5 Figure 1. PCC w.r.t. the training epochs in cooperative navigation, with MA-AIRL (blue) and MA-GAIL (orange). 6. Discussion and Future Work We propose MA-AIRL, the first multi-agent Max Ent IRL framework that is effective and scalable to Markov games with high-dimensional state-action space and unknown dynamics. We derive our algorithm based on a solution concept termed LSBRE and we employ maximum pseudolikelihood estimation to achieve tractability. Experimental results demonstrate that MA-AIRL is able to imitate expert behaviors in high-dimensional complex environments, as well as learn reward functions that are highly correlated with the ground truth rewards. An exciting avenue for future work is to include reward regularization to mitigate overfitting and leverage prior knowledge of the task structure.