# multiagent_generative_adversarial_imitation_learning__c39c6cb9.pdf Multi-Agent Generative Adversarial Imitation Learning Jiaming Song Stanford University tsong@cs.stanford.edu Hongyu Ren Stanford University hyren@cs.stanford.edu Dorsa Sadigh Stanford University dorsa@cs.stanford.edu Stefano Ermon Stanford University ermon@cs.stanford.edu Imitation learning algorithms can be used to learn a policy from expert demonstrations without access to a reward signal. However, most existing approaches are not applicable in multi-agent settings due to the existence of multiple (Nash) equilibria and non-stationary environments. We propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We further introduce a practical multiagent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents. 1 Introduction Reinforcement learning (RL) methods are becoming increasingly successful at optimizing reward signals in complex, high dimensional environments [1]. A key limitation of RL, however, is the difficulty of designing suitable reward functions for complex and not well-specified tasks [2, 3]. If the reward function does not cover all important aspects of the task, the agent could easily learn undesirable behaviors [4]. This problem is further exacerbated in multi-agent scenarios, such as multiplayer games [5], multi-robot control [6] and social interactions [7]; in these cases, agents do not even necessarily share the same reward function and might even have conflicting rewards. Imitation learning methods address these problems via expert demonstrations [8 11]; the agent directly learns desirable behaviors by imitating an expert. Notably, inverse reinforcement learning (IRL) frameworks assume that the expert is (approximately) optimizing an underlying reward function, and attempt to recover a reward function that rationalizes the demonstrations; an agent policy is subsequently learned through RL [12, 13]. Unfortunately, this paradigm is not suitable for general multi-agent settings due to environment being non-stationary to individual agents [14] and the existence of multiple equilibrium solutions [15]. The optimal policy of one agent could depend on the policies of other agents, and vice versa, so there could exist multiple solutions in which each agents policy is the optimal response to others. In this paper, we propose a new framework for multi-agent imitation learning provided with demonstrations of a set of experts interacting with each other in the same environment, we aim to learn multiple parametrized policies that imitate the behavior of each expert respectively. Using the framework of Markov games, we integrate multi-agent RL with a suitable extension of multi-agent inverse RL. The resulting procedure strictly generalizes Generative Adversarial Imitation Learning (GAIL, [16]) in the single agent case. Imitation learning in our setting corresponds to a two-player 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. game between a generator and a discriminator. The generator controls the policies of all the agents in a distributed way, and the discriminator contains a classifier for each agent that is trained to distinguish that agent s behavior from that of the corresponding expert. Upon training, the behaviors produced by the policies should be indistinguishable from the training data. We can incorporate prior knowledge into the discriminators, including the presence of cooperative or competitive agents. In addition, we propose a novel multi-agent natural policy gradient algorithm that addresses the issue of high variance gradient estimates commonly observed in reinforcement learning [14, 17]. Empirical results demonstrate that our method can imitate complex behaviors in high-dimensional environments, such as particle environments and cooperative robotic control tasks, with multiple cooperative or competitive agents; the imitated behaviors are close to the expert behaviors with respect to true reward functions which the agents do not have access to during training. 2 Preliminaries 2.1 Markov games We consider an extension of Markov decision processes (MDPs) called Markov games [18]. A Markov game (MG) for N agents is defined via a set of states S, and N sets of actions {Ai}N i=1. The function P : S A1 AN P(S) describes the (stochastic) transition process between states, where P(S) denotes the set of probability distributions over the set S. Given that we are in state st at time t, the agents take actions (a1, . . . , a N) and the state transitions to st+1 with probability P(st+1|st, a1, . . . , a N). Each agent i obtains a (bounded) reward given by a function ri : S A1 AN R. Each agent i aims to maximize its own total expected return Ri = P t=0 γtri,t, where γ is the discount factor, by selecting actions through a (stationary and Markovian) stochastic policy πi : S Ai [0, 1]. The initial states are determined by a distribution η : S [0, 1]. The joint policy is defined as π(a|s) = QN i=1 πi(ai|s), where we use bold variables without subscript i to denote the concatenation of all variables for all agents (e.g., π denotes the joint policy QN i=1 πi in a multi-agent setting, r denotes all rewards, a denotes actions of all agents). We use expectation with respect to a policy π to denote an expectation with respect to the trajectories it generates, and use subscript i to denote all agents except for i. For example, (ai, a i) represents (a1, . . . , a N), the actions of all N agents. 2.2 Reinforcement learning and Nash equilibrium In reinforcement learning (RL), the goal of each agent is to maximize total expected return Eπ[r(s, a)] given access to the reward signal r. In single agent RL, an optimal Markovian policy exists but the optimal policy might not be unique (e.g., all policies are optimal for an identically zero reward; see [19], Chapter 3.8). An entropy regularizer can be introduced to resolve this ambiguity. The optimal policy is found via the following RL procedure: RL(r) = arg max π Π H(π) + Eπ[r(s, a)], (1) where H(π) is the γ-discounted causal entropy [20] of policy π Π. Definition 1 (γ-discounted Causal Entropy). The γ-discounted causal entropy for a policy π is defined as follows: H(π) Eπ[ log π(a|s)] = Est,at π t=0 γt log π(at|st) The addition of H(π) in (1) resolves this ambiguity the policy with both the highest reward and the highest entropy1 is unique because the entropy function is strictly concave with respect to π. In Markov games, however, the optimal policy of an agent depends on other agents policies. One approach is to use an equilibrium solution concept, such as Nash equilibrium [15]. Informally, a set of policies {πi}N i=1 is a Nash equilibrium if no agent can achieve higher reward by unilaterally 1We use the term entropy to denote the γ-discounted causal entropy for policies in the rest of the paper. changing its policy, i.e., i [1, N], ˆπi = πi, Eπi,π i[ri] Eˆπi,π i[ri]. The process of finding a Nash equilibrium can be defined as a constrained optimization problem ([21], Theorem 3.7.2): min π Π,v RS N fr(π, v) = s S vi(s) Eai πi( |s)qi(s, ai) vi(s) qi(s, ai) Eπ i ri(s, a) + γ X s S P(s |s, a)vi(s ) i [N], s S, ai Ai (3) a (ai, a i) (a1, . . . , a N) v [v1; . . . ; v N] where the joint action a includes actions a i sampled from π i and ai. Intuitively, v can be thought of as a value function and q represents the Q-function that corresponds to v. The constraints enforce the Nash equilibrium condition when the constraints are satisfied, (vi(s) qi(s, ai)) is non-negative for every i [N]. Hence fr(π, v) is always non-negative for a feasible (π, v). Moreover, this objective has a global minimum of zero if a Nash equilibrium exists, and π forms a Nash equilibrium if and only if fr(π, v) reaches zero while being a feasible solution ([22], Theorem 2.4). 2.3 Inverse reinforcement learning Suppose we do not have access to the reward signal r, but have demonstrations D provided by an expert (N expert agents in Markov games). Imitation learning aims to learn policies that behave similarly to these demonstrations. In Markov games, we assume all experts/players operate in the same environment, and the demonstrations D = {(sj, aj)}M j=1 are collected by sampling s0 η(s), at = πE(at|st), st+1 P(st+1|st, at); we assume knowledge of N, γ, S, A, as well as access to T and η as black boxes. We further assume that once we obtain D, we cannot ask for additional expert interactions with the environment (unlike in DAgger [23] or CIRL [24]). Let us first consider imitation in Markov decision processes (as a special case to Markov games) and the framework of single-agent Maximum Entropy IRL [8, 16] where the goal is to recover a reward function r that rationalizes the expert behavior πE: IRL(πE) = arg max r RS A EπE[r(s, a)] max π Π H(π) + Eπ[r(s, a)] In practice, expectations with respect to πE are evaluated using samples from D. The IRL objective is ill-defined [12, 10] and there are often multiple valid solutions to the problem when we consider all r RS A. To resolve this ambiguity, [16] introduce a convex reward function regularizer ψ : RS A R, which can be used for example to restrict rewards to be linear in a pre-determined set of features [16]: IRLψ(πE) = arg max r RS A ψ(r) + EπE[r(s, a)] max π Π H(π) + Eπ[r(s, a)] (4) 2.4 Imitation by matching occupancy measures [16] interprets the imitation learning problem as matching two occupancy measures, i.e., the distribution over states and actions encountered when navigating the environment with a policy. Formally, for a policy π, it is defined as ρπ(s, a) = π(a|s) P t=0 γt P(st = s|π). [16] draws a connection between IRL and occupancy measure matching, showing that the former is a dual of the latter: Proposition 1 (Proposition 3.1 in [16]). RL IRLψ(πE) = arg min π Π H(π) + ψ (ρπ ρπE) Here ψ (x) = supy x y ψ(y) is the convex conjugate of ψ, which could be interpreted as a measure of similarity between the occupancy measures of expert policy and agent s policy. One instance of ψ = ψGA gives rise to the Generative Adversarial Imitation Learning (GAIL) method: ψ GA(ρπ ρπE) = max D (0,1)S A EπE[log(D(s, a))] + Eπ[log(1 D(s, a))] (5) The resulting imitation learning method from Proposition 1 involves a discriminator (a classifier D) competing with a generator (a policy π). The discriminator attempts to distinguish real vs. synthetic trajectories (produced by π) by optimizing (5). The generator, on the other hand, aims to perform optimally under the reward function defined by the discriminator, thus fooling the discriminator with synthetic trajectories that are difficult to distinguish from the expert ones. 3 Generalizing IRL to Markov games Extending imitation learning to multi-agent settings is difficult because there are multiple rewards (one for each agent) and the notion of optimality is complicated by the need to consider an equilibrium solution [15]. We use MARL(r) to denote the set of (stationary and Markovian) policies that form a Nash equilibrium under r and have the maximum γ-discounted causal entropy (among all equilibria): MARL(r) = arg min π Π,v RS N fr(π, v) H(π) (6) vi(s) qi(s, ai) i [N], s S, ai Ai where q is defined as in Equation (3). Our goal is to define a suitable inverse operator MAIRL, in analogy to IRL in Equation (4), which chooses a reward that creates a margin between the expert and every other policy. However, the constraints in the Nash equilibrium optimization (Equation (6)) can make this challenging. To that end, we derive an equivalent Lagrangian formulation of (6), where we move the constraints into the objective function, so that we can define a margin between the expected reward of two sets of policies that captures their difference . 3.1 Equivalent constraints via temporal difference learning Intuitively, the Nash equilibrium constraints imply that any agent i cannot improve πi via 1-step temporal difference learning; if the condition for Equation (3) is not satisfied for some vi, qi, and (s, ai), this would suggest that we can update the policy for agent i and its value function. Based on this notion, we can derive equivalent versions of the constraints corresponding to t-step temporal difference (TD) learning. Theorem 1. For a certain policy π and reward r, let ˆvi(s; π, r) be the unique solution to the Bellman equation: ˆvi(s; π, r) = Ea π ri(s, a) + γ X s S P(s |s, a)ˆvi(s ; π, r) Denote ˆq(t) i ({s(j), a(j)}t 1 j=0, s(t), a(t) i ; π, r) as the discounted expected return for the i-th agent conditioned on visiting the trajectory {s(j), a(j)}t 1 j=0, s(t) in the first (t 1) steps and choosing action a(t) i at the t step, when other agents use policy π i: ˆq(t) i ({s(j), a(j)}t 1 j=0, s(t), a(t) i ; π, r) j=0 γjri(s(j), a(j)) + γt Ea i π i ri(s(t), a(t)) + γ X s S P(s |s, a(t))ˆvi(s ; π, r) Then π is Nash equilibrium if and only if for all t N+, i [N], j [t], s(j) S, a(j) A ˆvi(s(0); π, r) Ea i π i h ˆq(t) i ({s(j), a(j)}t 1 j=0, s(t), a(t) i ; π, r) i Q(t) i ({s(j), a(j) i }t j=0; π, r). Intuitively, Theorem 1 states that if we replace the 1-step constraints with (t + 1)-step constraints, we obtain the same solution as MARL(r), since (t + 1)-step TD updates (over one agent at a time) are still stationary with respect to a Nash equilibrium solution. So the constraints can be unrolled for t steps and rewritten as ˆvi(s(0)) Q(t) i ({s(j), a(j) i }t j=0; π, r) (corresponding to Equation (7)). 3.2 Multi-agent inverse reinforcement learning We are now ready to construct the Lagrangian dual of the primal in Equation (6), using the equivalent formulation from Theorem 1. The first observation is that for any policy π, f(π, ˆv) = 0 when ˆv is defined as in Theorem 1 (see Lemma 1 in appendix). Therefore, we only need to consider the unrolled constraints from Theorem 1, obtaining the following dual problem max λ 0 min π L(t+1) r (π, λ) τi T t i λ(τi) Q(t) i (τi; π, r) ˆvi(s(0); π, r) (8) where Ti(t) is the set of all length-t trajectories of the form {s(j), a(j) i }t j=0, with s(0) as initial state, λ is a vector of N |Ti(t)| Lagrange multipliers, and ˆv is defined as in Theorem 1. This dual formulation is a sum over agents and trajectories, which uniquely corresponds to the constraints in Equation 7. In the following theorem, we show that for a specific choice of λ we can recover the difference of the sum of expected rewards between two policies, a performance gap similar to the one used in single agent IRL in Equation (4). This amounts to relaxing the primal problem. Theorem 2. For any two policies π and π, let λ π(τi) = η(s(0))πi(a(0) i |s(0)) j=1 πi(a(j) i |s(j)) X P(s(j)|s(j 1), a(j 1))π i(a(j) i|s(j)) be the probability of generating the sequence τi using policy πi and π i. Then lim t L(t+1) r (π , λ π) = i=1 Eπi,π i[ri(s, a)] i=1 Eπ i ,π i[ri(s, a)] (9) where L(t+1) r (π , λ π) corresponds to the dual function where the multipliers are the probability of generating their respective trajectories of length t. We provide a proof in Appendix A.3. Intuitively, the λ (τi) weights correspond to the probability of generating trajectory τi when the policy is πi for agent i and π i for the other agents. As t , the first term of left hand side in Equation (8), PN i=1 P τi T t i λ(τi)Q(t) i (τi), converges to the expected total reward Eπi,π i[ri], which is the first term of right hand side. The marginal of λ over the initial state is the initial state distribution, so the second term of left hand side, P s ˆv(s)η(s), converges to Eπ i ,π i[ri], which is the second term of right hand side. Thus, the left hand side and right hand side of Equation (8) are the same as t . We could also view the right hand side of Equation (8) as the case where policies of π i are part of the environment. Theorem 2 motivates the following definition of multi-agent IRL with regularizer ψ. MAIRLψ(πE) = arg max r ψ(r) + i=1 (EπE[ri]) i=1 (βHi(πi) + Eπi,πE i[ri]) where Hi(πi) = Eπi,πE i[ log πi(ai|s)] is the discounted causal entropy for policy πi when other agents follow πE i, and β is a hyper-parameter controlling the strength of the entropy regularization term as in [16]. This formulation is a strict generalization to the single agent IRL in [16]. Corollary 2.1. If N = 1, β = 1 then MAIRLψ(πE) = IRLψ(πE). Furthermore, if the regularization ψ is additively separable, and for each agent i, πEi is the unique optimal response to other experts πE i, we obtain the following: Theorem 3. Assume that ψ(r) = PN i=1 ψi(ri), ψi is convex for each i [N], and that MARL(r) has a unique solution2 for all r MAIRLψ(πE), then MARL MAIRLψ(πE) = arg min π Π i=1 βHi(πi) + ψ i (ρπi,E i ρπE) (10) where πi,E i denotes πi for agent i and πE i for other agents. 2The set of Nash equilibria is not always convex, so we have to assume MARL(r) returns a unique solution. The above theorem suggests that ψ-regularized multi-agent inverse reinforcement learning is seeking, for each agent i, a policy whose occupancy measure is close to one where we replace policy πi with expert πEi, as measured by the convex function ψ i . However, we do not assume access to the expert policy πE during training, so it is not possible to obtain ρπi,E i. Therefore, we consider an alternative approach where we match the occupancy measure between ρπE and ρπ. We can obtain our practical algorithm if we select an adversarial reward function regularizer and remove the effect from entropy regularizers. Proposition 2. If β = 0, and ψ(r) = PN i=1 ψi(ri) where ψi(ri) = EπE[g(ri)] if ri > 0; + otherwise, and g(x) = x log(1 ex) if ri > 0 + otherwise then i=1 ψ i (ρπi,πE i ρπE) = arg min π i=1 ψ i (ρπi,π i ρπE) (11) and both are equal to πE. Theorem 3 and Proposition 2 discuss the differences from the single agent scenario. In Theorem 3 we make the assumption that MARL(r) has a unique solution, which is always true in the single agent case due to convexity of the space of the optimal policies. In Proposition 2 we remove the entropy regularizer because here the causal entropy for πi may depend on the policies of the other agents. Specifically, the entropy for the left hand side of Equation (11) conditions on πE i and the entropy for the right hand side conditions on π i (both would disappear in the single-agent case). 4 Practical multi-agent imitation learning Despite the recent successes in deep RL, it is notoriously hard to train policies with RL algorithmsbecause of high variance gradient estimates. This is further exacerbated in Markov games since an agent s optimal policy depends on other agents [14, 17]. In this section, we address these problems and propose practical algorithms for multi-agent imitation. 4.1 Multi-agent generative adversarial imitation learning We select ψi to be our reward function regularizer in Proposition 2; this corresponds to the two-player game introduced in Generative Adversarial Imitation Learning (GAIL, [16]). For each agent i, we have a discriminator (denoted as Dωi) mapping state action-pairs to scores optimized to discriminate expert demonstrations from behaviors produced by πi. Implicitly, Dωi plays the role of a reward function for the generator, which in turn attempts to train the agent to maximize its reward thus fooling the discriminator. We optimize the following objective: min θ max ω Eπθ i=1 log Dωi(s, ai) i=1 log(1 Dωi(s, ai)) We update πθ through reinforcement learning, where we also use a baseline Vφ to reduce variance. We outline the algorithm Multi-Agent GAIL (MAGAIL) in Appendix B. We can augment the reward regularizer ψ(r) using an indicator y(r) denoting whether r fits our prior knowledge; the augmented reward regularizer ˆψ : RS A R { } is then: ψ(r) if y(r) = 1 and if y(r) = 0. We introduce three types of y(r) for common settings. Centralized The easiest case is to assume that the agents are fully cooperative, i.e. they share the same reward function. Here y(r) = I(r1 = r2 = . . . rn) and ψ(r) = ψGA(r). One could argue this corresponds to the GAIL case, where the RL procedure operates on multiple agents (a joint policy). Decentralized We make no prior assumptions over the correlation between the rewards. Here y(r) = I(ri ROi Ai) and ψi(ri) = ψGA(ri). This corresponds to one discriminator for each agent which discriminates the trajectories as observed by agent i. However, these discriminators are not learned independently as they interact indirectly via the environment. (o1, a1) (o2, a2) (o1, a1) (o2, a2) T(st+1|st, at) (a) Centralized (Cooperative) (o1, a1) (o2, a2) (o1, a1) (o2, a2) T(st+1|st, at) (b) Decentralized (Mixed) (o2, a2) (o1, a1) (c) Zero-sum (Competitive) Figure 1: Different MAGAIL algorithms obtained with different priors on the reward structure. The discriminator tries to assign higher rewards to top row and low rewards to bottom row. In centralized and decentralized, the policy operates with the environment to match the expert rewards. In zero-sum, the policy do not interact with the environment; expert and policy trajectories are paired together as input to the discriminator. Zero Sum Assume there are two agents that receive opposite rewards, so r1 = r2. As such, ψ is no longer additively separable. Nevertheless, an adversarial training procedure can be designed using the following fact: v(πE1, π2) v(πE1, πE2) v(π1, πE2) (13) where v(π1, π2) = Eπ1,π2[r1(s, a)] is the expected outcome for agent 1, and is modeled by the discriminator. The discriminator could then try to maximize v for trajectories from (πE1, π2) and minimize v for trajectories from (π2, πE1) according to Equation (13). These three settings are in summarized in Figure 1. 4.2 Multi-agent actor-critic with Kronecker factors To optimize over the generator parameters θ in Eq. (12) we wish to use an algorithm for multi-agent RL that has good sample efficiency in practice. Our algorithm, which we refer to as Multi-agent Actor-Critic with Kronecker-factors (MACK), is based on Actor-Critic with Kronecker-factored Trust Region (ACKTR, [25 27]), a state-of-the-art natural policy gradient [28, 29] method in deep RL. MACK uses the framework of centralized training with decentralized execution [17]; policies are trained with additional information to reduce variance but such information is not used during execution time. We let the advantage function of every agent agent be a function of all agents observations and actions: Aπi φi(s, at) = j=0 (γjri(st+j, at+j) + γk V πi φi (st+k, a i,t+k)) V πi φi (st, a i,t) (14) where V πi φi (sk, a i) is the baseline for i, utilizing the additional information (a i) for variance reduction. We use (approximated) natural policy gradients to update both θ and φ but without trust regions to schedule the learning rate, using a linear decay learning rate schedule instead. MACK has some notable differences from Multi-Agent Deep Deterministic Policy Gradient [14]. On the one hand, MACK does not assume knowledge of other agent s policies nor tries to infer them; the value estimator merely collects experience from other agents (and treats them as black boxes). On the other hand, MACK does not require gradient estimators such as Gumbel-softmax [30, 31] to optimize over discrete actions, which is necessary for DDPG [32]. 5 Experiments We evaluate the performance of (centralized, decentralized, and zero-sum versions) of MAGAIL under two types of environments. One is a particle environment which allows for complex interactions and behaviors; the other is a control task, where multiple agents try to cooperate and move a plank forward. We collect results by averaging over 5 random seeds. Our implementation is based on Open AI baselines [33]; please refer to Appendix C for implementation details3. 3Code for reproducing the experiments are in https://github.com/ermongroup/multiagent-gail. 100 200 300 400 # Expert Demonstrations Normalized Rewards Cooperative Communication 100 200 300 400 # Expert Demonstrations Cooperative Navigation Expert Random BC GAIL Centralized Decentralized Figure 2: Average true reward from cooperative tasks. Performance of experts and random policies are normalized to one and zero respectively. We use inverse log scale for better comparison. We compare our methods (centralized, decentralized, zero-sum MAGAIL) with two baselines. The first is behavior cloning (BC), which learns a maximum likelihood estimate for ai given each state s and does not require actions from other agents. The second baseline is a GAIL IRL baseline that operates on each agent separately for each agent we first pretrain the other agents with BC, and then train the agent with GAIL; we then gather the trained GAIL policies from all the agents and evaluate their performance. 5.1 Particle environments We first consider the particle environment proposed in [14], which consists of several agents and landmarks. We consider two cooperative environments and two competitive ones. All environments have an underlying true reward function that allows us to evaluate the performance of learned agents. The environments include: Cooperative Communication two agents must cooperate to reach one of three colored landmarks. One agent ( speaker ) knows the goal but cannot move, so it must convey the message to the other agent ( listener ) that moves but does not observe the goal. Cooperative Navigation three agents must cooperate through physical actions to reach three landmarks; ideally, each agent should cover a single landmark. Keep-Away two agents have contradictory goals, where agent 1 tries to reach one of the two targeted landmarks, while agent 2 (the adversary) tries to keep agent 1 from reaching its target. The adversary does not observe the target, so it must act based on agent 1 s actions. Predator-Prey three slower cooperating adversaries must chase the faster agent in a randomly generated environment with obstacles; the adversaries are rewarded by touching the agent while the agent is penalized. For the cooperative tasks, we use an analytic expression defining the expert policy; for the competitive tasks, we use MACK to train expert policies based on the true underlying rewards (using larger policy and value networks than the ones that we use for imitation). We then use the expert policies to simulate trajectories D, and then do imitation learning on D as demonstrations, where we assume the underlying rewards are unknown. Following [34], we pretrain our Multi-Agent GAIL methods and the GAIL baseline using behavior cloning as initialization to reduce sample complexity for exploration. We consider 100 to 400 episodes of expert demonstrations, each with 50 timesteps, which is close to the amount of timesteps used for the control tasks in [16]. Moreover, we randomly sample the starting position of agent and landmarks each episode, so our policies have to learn to generalize when they encounter new settings. 5.1.1 Cooperative tasks We evaluate performance in cooperative tasks via the average expected reward obtained by all the agents in an episode. In this environment, the starting state is randomly initialized, so generalization is crucial. We do not consider the zero-sum case, since it violates the cooperative nature of the task. We display the performance of centralized, decentralized, GAIL and BC in Figure 2. Naturally, the performance of BC and MAGAIL increases with more expert demonstrations. MAGAIL performs consistently better than BC in all the settings; interestingly, in the cooperative communication task, centralized MAGAIL is able to achieve expert-level performance with only 200 demonstrations, but BC fails to come close even with 400 trajectories. Moreover, the centralized MA- Table 1: Average agent rewards in competitive tasks. We compare behavior cloning (BC), GAIL (G), Centralized (C), Decentralized (D), and Zero-Sum (ZS) methods. Best marked in bold (high vs. low rewards is preferable depending on the agent vs. adversary role). Task Predator-Prey Agent Behavior Cloning G C D ZS Adversary BC G C D ZS Behavior Cloning Rewards -93.20 -93.71 -93.75 -95.22 -95.48 -90.55 -91.36 -85.00 -89.4 Task Keep-Away Agent Behavior Cloning G C D ZS Adversary BC G C D ZS Behavior Cloning Rewards 24.22 24.04 23.28 23.56 23.19 26.22 26.61 28.73 27.80 GAIL performs slightly better than decentralized MAGAIL due to the better prior, but decentralized MAGAIL still learns a highly correlated reward between two agents. 5.1.2 Competitive tasks We consider all three types of Multi-Agent GAIL (centralized, decentralized, zero-sum) and BC in both competitive tasks. Since there are two opposing sides, it is hard to measure performance directly. Therefore, we compare by letting (agents trained by) BC play against (adversaries trained by) other methods, and vice versa. From Table 1, decentralized and zero-sum MAGAIL often perform better than centralized MAGAIL and BC, which suggests that the selection of the suitable prior ˆψ is important for good empirical performance. 5.2 Cooperative control In some cases we are presented with sub-optimal expert demonstrations because the environment has changed; we consider this case in a cooperative control task [35], where N bipedal walkers cooperate to move a long plank forward; the agents have incentive to collaborate since the plank is much longer than any of the agents. The expert demonstrates its policy on an environment with no bumps on the ground and heavy weights, while we perform imitation in an new environment with bumps and lighter weights (so one is likely to use too much force). Agents trained with BC tend to act more aggressively and fail, whereas agents trained with centralized MAGAIL can adapt to the new environment. With 10 (imperfect) expert demonstrations, BC agents have a chance of failure of 39.8% (with a reward of 1.26), while centralized MAGAIL agents fail only 26.2% of the time (with a reward of 26.57). We show videos of respective policies in the supplementary. 6 Discussion There is a vast literature on single-agent imitation learning [36]. Behavior Cloning (BC) learns the policy through supervised learning [37]. Inverse Reinforcement Learning (IRL) assumes the expert policy optimizes over some unknown reward, recovers the reward, and learns the policy through reinforcement learning (RL). BC does not require knowledge of transition probabilities or access to the environment, but suffers from compounding errors and covariate shift [38, 23]. Most existing work in multi-agent imitation learning assumes the agents have very specific reward structures. The most common case is fully cooperative agents [39], where the challenges mainly lie in other factors, such as unknown role assignments [40], scalability to swarm systems [41] and agents with partial observations [42]. In non-cooperative settings, [43] consider the case of IRL for two-player zero-sum games and cast the IRL problem as Bayesian inference, while [44] assume agents are non-cooperative but the reward function is a linear combination of pre-specified features. Our work is the first to propose a general multi-agent IRL framework that combines state-of-the art multi-agent reinforcement learning methods [14, 17] and implicit generative models such as generative adversarial networks [45]. Experimental results demonstrate that it is able to imitate complex behaviors in high-dimensional environments with both cooperative and adversarial interactions. An interesting future direction is to explore new paradigms for learning from experts, such as allowing the expert to participate in the agent s learning process [24]. Acknowledgements This work was supported by Toyota Research Institute and Future of Life Institute. The authors would like to thank Lantao Yu for discussions over implementation. [1] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, ar Xiv preprint ar Xiv:1802.01561, 2018. [2] D. Hadfield-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, Inverse reward design, in Advances in Neural Information Processing Systems, pp. 6768 6777, 2017. [3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, Concrete problems in ai safety, ar Xiv preprint ar Xiv:1606.06565, 2016. [4] D. Amodei and J. Clark, Faulty reward functions in the wild, 2016. [5] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang, Multiagent bidirectionallycoordinated nets for learning to play starcraft combat games, ar Xiv preprint ar Xiv:1703.10069, 2017. [6] L. Matignon, L. Jeanpierre, A.-I. Mouaddib, et al., Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes., in AAAI, 2012. [7] J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel, Multi-agent reinforcement learning in sequential social dilemmas, in Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, pp. 464 473, International Foundation for Autonomous Agents and Multiagent Systems, 2017. [8] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, Maximum entropy inverse reinforcement learning., in AAAI, vol. 8, pp. 1433 1438, Chicago, IL, USA, 2008. [9] P. Englert and M. Toussaint, Inverse kkt learning cost functions of manipulation tasks from demonstrations, in Proceedings of the International Symposium of Robotics Research, 2015. [10] C. Finn, S. Levine, and P. Abbeel, Guided cost learning: Deep inverse optimal control via policy optimization, in International Conference on Machine Learning, pp. 49 58, 2016. [11] B. Stadie, P. Abbeel, and I. Sutskever, Third person imitation learning, in ICLR, 2017. [12] A. Y. Ng, S. J. Russell, et al., Algorithms for inverse reinforcement learning., in Icml, pp. 663 670, 2000. [13] P. Abbeel and A. Y. Ng, Apprenticeship learning via inverse reinforcement learning, in Proceedings of the twenty-first international conference on Machine learning, p. 1, ACM, 2004. [14] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, Multi-agent actor-critic for mixed cooperative-competitive environments, ar Xiv preprint ar Xiv:1706.02275, 2017. [15] J. Hu, M. P. Wellman, et al., Multiagent reinforcement learning: theoretical framework and an algorithm., in ICML, vol. 98, pp. 242 250, Citeseer, 1998. [16] J. Ho and S. Ermon, Generative adversarial imitation learning, in Advances in Neural Information Processing Systems, pp. 4565 4573, 2016. [17] J. Foerster, Y. Assael, N. de Freitas, and S. Whiteson, Learning to communicate with deep multi-agent reinforcement learning, in Advances in Neural Information Processing Systems, pp. 2137 2145, 2016. [18] M. L. Littman, Markov games as a framework for multi-agent reinforcement learning, in Proceedings of the eleventh international conference on machine learning, vol. 157, pp. 157 163, 1994. [19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. MIT press Cambridge, 1998. [20] M. Bloem and N. Bambos, Infinite time horizon maximum causal entropy inverse reinforcement learning, in Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 4911 4916, IEEE, 2014. [21] J. Filar and K. Vrieze, Competitive Markov decision processes. Springer Science & Business Media, 2012. [22] H. Prasad and S. Bhatnagar, A study of gradient descent schemes for general-sum stochastic games, ar Xiv preprint ar Xiv:1507.00093, 2015. [23] S. Ross, G. J. Gordon, and D. Bagnell, A reduction of imitation learning and structured prediction to no-regret online learning., in AISTATS, p. 6, 2011. [24] D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan, Cooperative inverse reinforcement learning, in Advances in neural information processing systems, pp. 3909 3917, 2016. [25] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation, in Advances in neural information processing systems, pp. 5285 5294, 2017. [26] Y. Song, J. Song, and S. Ermon, Accelerating natural gradient with higher-order invariance, in International Conference on Machine Learning (ICML), 2018. [27] Y. Song, R. Shu, N. Kushman, and S. Ermon, Constructing unrestricted adversarial examples with generative models, ar Xiv preprint ar Xiv:1805.07894, 2018. [28] S.-I. Amari, Natural gradient works efficiently in learning, Neural computation, vol. 10, no. 2, pp. 251 276, 1998. [29] S. M. Kakade, A natural policy gradient, in Advances in neural information processing systems, pp. 1531 1538, 2002. [30] E. Jang, S. Gu, and B. Poole, Categorical reparameterization with gumbel-softmax, ar Xiv preprint ar Xiv:1611.01144, 2016. [31] C. J. Maddison, A. Mnih, and Y. W. Teh, The concrete distribution: A continuous relaxation of discrete random variables, ar Xiv preprint ar Xiv:1611.00712, 2016. [32] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, ar Xiv preprint ar Xiv:1509.02971, 2015. [33] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, Openai baselines. https://github.com/openai/baselines, 2017. [34] Y. Li, J. Song, and S. Ermon, Infogail: Interpretable imitation learning from visual demonstrations, ar Xiv preprint ar Xiv:1703.08840, 2017. [35] J. K. Gupta and M. Egorov, Multi-agent deep reinforcement learning environment. https: //github.com/sisl/madrl, 2017. [36] J. A. Bagnell, An invitation to imitation, tech. rep., CARNEGIE-MELLON UNIV PITTSBURGH PA ROBOTICS INST, 2015. [37] D. A. Pomerleau, Efficient training of artificial neural networks for autonomous navigation, Neural Computation, vol. 3, no. 1, pp. 88 97, 1991. [38] S. Ross and D. Bagnell, Efficient reductions for imitation learning., in AISTATS, pp. 3 5, 2010. [39] S. Barrett, A. Rosenfeld, S. Kraus, and P. Stone, Making friends on the fly: Cooperating with new teammates, Artificial Intelligence, vol. 242, pp. 132 171, 2017. [40] H. M. Le, Y. Yue, and P. Carr, Coordinated multi-agent imitation learning, ar Xiv preprint ar Xiv:1703.03121, 2017. [41] A. Šošic, W. R. Khuda Bukhsh, A. M. Zoubir, and H. Koeppl, Inverse reinforcement learning in swarm systems, stat, vol. 1050, p. 17, 2016. [42] K. Bogert and P. Doshi, Multi-robot inverse reinforcement learning under occlusion with interactions, in Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pp. 173 180, International Foundation for Autonomous Agents and Multiagent Systems, 2014. [43] X. Lin, P. A. Beling, and R. Cogill, Multi-agent inverse reinforcement learning for zero-sum games, ar Xiv preprint ar Xiv:1403.6508, 2014. [44] T. S. Reddy, V. Gopikrishna, G. Zaruba, and M. Huber, Inverse reinforcement learning for decentralized non-cooperative multiagent systems, in Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on, pp. 1930 1935, IEEE, 2012. [45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in neural information processing systems, pp. 2672 2680, 2014. [46] J. Martens and R. Grosse, Optimizing neural networks with kronecker-factored approximate curvature, in International Conference on Machine Learning, pp. 2408 2417, 2015. [47] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, High-dimensional continuous control using generalized advantage estimation, ar Xiv preprint ar Xiv:1506.02438, 2015.