# coordinated_multiagent_imitation_learning__ec39a1b9.pdf

Coordinated Multi-Agent Imitation Learning

Hoang M. Le 1 Yisong Yue 1 Peter Carr 2 Patrick Lucey 3

We study the problem of imitation learning from demonstrations of multiple coordinating agents. One key challenge in this setting is that learning a good model of coordination can be difﬁcult, since coordination is often implicit in the demonstrations and must be inferred as a latent variable. We propose a joint approach that simultaneously learns a latent coordination model along with the individual policies. In particular, our method integrates unsupervised structure learning with conventional imitation learning. We illustrate the power of our approach on a difﬁcult problem of learning multiple policies for ﬁnegrained behavior modeling in team sports, where different players occupy different roles in the coordinated team strategy. We show that having a coordination model to infer the roles of players yields substantially improved imitation loss compared to conventional baselines.

1. Introduction

The areas of multi-agent planning and control have witnessed a recent wave of strong interest due to the practical desire to deal with complex real-world problems, such as smart-grid control, autonomous vehicles planning, managing teams of robots for emergency response, among others. From the learning perspective, (cooperative) multi-agent learning is not a new area of research (Stone & Veloso, 2000; Panait & Luke, 2005). However, compared to the progress in conventional supervised learning and singleagent reinforcement learning, the successes of multi-agent learning have remained relatively modest. Most notably, multi-agent learning suffers from extremely high dimensionality of both the state and actions spaces, as well as relative lack of data sources and experimental testbeds.

1California Institute of Technology, Pasadena, CA 2Disney Research, Pittsburgh, PA 3STATS LLC, Chicago, IL. Correspondence to: Hoang M. Le <hmle@caltech.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

Figure 1. Our motivating example of learning coordinating behavior policies for team sports from tracking data. Red is the attacking team, blue is the defending team, and yellow is the ball.

The growing availability of data sources for coordinated multi-agent behavior, such as sports tracking data (Bialkowski et al., 2014), now enables the possibility of learning multi-agent policies from demonstrations, also known as multi-agent imitation learning. One particularly interesting aspect of domains such as team sports is that the agents must coordinate. For example, in the professional soccer setting depicted in Figure 1, different players must coordinate to assume different roles (e.g., defend left ﬁeld). However, the roles and role assignment mechanism are unobserved from the demonstrations. Furthermore, the role for a player may change during the same play sequence. In the control community, this issue is known as index-free multi-agent control (Kingston & Egerstedt, 2010).

Motivated by these challenges, we study the problem of imitation learning for multiple coordinating agents from demonstrations. Many realistic multi-agent settings require coordination among collaborative agents to achieve some common goal (Guestrin et al., 2002; Kok et al., 2003). Beyond team sports, other examples include learning policies for game AI, controlling teams of multiple robots, or modeling collective animal behavior. As discussed above, we are interested in settings where agents have access to the outcome of actions from other agents, but the coordination mechanism is neither clearly deﬁned nor observed, which makes the full state only partially observable.

We propose a semi-supervised learning framework that integrates and builds upon conventional imitation learning and unsupervised, or latent, structure learning. The latent

Coordinated Multi-Agent Imitation Learning

structure model encodes a coordination mechanism, which approximates the implicit coordination in the demonstration data. In order to make learning tractable, we develop an alternating optimization method that enables integrated and efﬁcient training of both individual policies and the latent structure model. For learning individual policies, we extend reduction-based single-agent imitation learning approaches into multi-agent domain, utilizing powerful black-box supervised techniques such as deep learning as base routines. For latent structure learning, we develop a stochastic variational inference approach.

We demonstrate the effectiveness of our method in two settings. The ﬁrst is a synthetic experiment based on the popular predator-prey game. The second is a challenging task of learning multiple policies for team defense in professional soccer, using a large training set1 of play sequences illustrated by Figure 1. We show that learning a good latent structure to encode implicit coordination yields signiﬁcantly superior imitation performance compared to conventional baselines. To the best of our knowledge, this is the ﬁrst time an imitation learning approach has been applied to jointly learn cooperative multi-agent policies at large scale.

2. Problem Formulation

In coordinated multi-agent imitation learning, we have K agents acting in coordination to achieve a common goal (or sequence of goals). Training data D consists of multiple demonstrations of K agents. Importantly, we assume the identity (or indexing) of the K experts may change from one demonstration to another. Each (unstructured) set of demonstrations is denoted by U t U1, . . . , UKu, where Uk tut,ku T t 1 is the sequence of actions by agent k at time t. Note that each set of demonstrations can have varying sequence length T. Let C tctu T t 1 be the context associated with each demonstration sequence.

Policy Learning. Our ultimate goal is to learn a (largely) decentralized policy, but for clarity we ﬁrst present the problem of learning a fully centralized multi-agent policy. Following the notation of (Ross et al., 2011), let πp sq : a denote the joint policy that maps the joint state, s rs1, . . . , s Ks, of all K agents into K actions a ra1, . . . , a Ks. The goal is to minimize imitation loss:

Limitation E s d π rℓp πp sqqs ,

where d π denotes the distribution of states experienced by joint policy π and ℓis the imitation loss deﬁned over the demonstrations (e.g., squared loss for deterministic policies, or cross entropy for stochastic policies).

The decentralized setting decomposes the joint policy π

1Data at http://www.stats.com/data-science/ and see video result at http://hoangminhle.github.io

rπ1, . . . , πKs into K policies, each tailored to a speciﬁc agent index or role .2 The loss function is then:

k 1 Es dπk rℓpπkpskqqs .

Black-Box Policy Classes. In order to leverage powerful black-box policy classes such as random forests and deep learning, we take a learning reduction approach to training π. One consequence is that the state space representation s rs1, . . . , s Ks must be consistently indexed, e.g., agent k in one instance must correspond to agent k in another instance. This requirement applies for both centralized and decentralized policy learning, and is often implicitly assumed in prior work on multi-agent learning. A highly related issue arises in distributed control of index-free coordinating robots, e.g., to maintain a deﬁned formation (Kloder & Hutchinson, 2006; Kingston & Egerstedt, 2010).

Motivating example: Soccer Domain. Consider the task of imitating professional soccer players, where training data includes play sequences from different teams and games. Context C corresponds to the behavior of the opposing team and the ball. The data includes multiple sequences of K-set of trajectories U t U1, U2, . . . , UKu, where the actual identity of player generating Uk may change from one demonstration to the next.

One important challenge for black-box policy learning is constructing an indexing mechanism over the agents to yield a consistent state representation. For example, the same index should correspond to the left defender in all instances. Otherwise, the inputs to the policy will be inconsistent, making learning difﬁcult if not impossible. Note that barring extensive annotations or some heuristic rulebased deﬁnitions, it is unnatural to quantitatively deﬁne what makes a player left defender . In addition, even if we had a way to deﬁne who the left defender is, he may not stay in the same role during the same sequence.

Role-based Indexing. We address index-free policy learning via role learning and role-based index assignment. To motivate our notion of role, let s ﬁrst consider the simplest indexing mechanism: one could equate role to agent identity. However, the data often comes from various sequences, with heterogeneous identities and teams of agents. Thus instead of learning identity-speciﬁc policies, it is more natural and data-efﬁcient to learn a policy per role. However, a key challenge in learning policies directly is that the roles are undeﬁned, unobserved, and could change dynamically within the same sequence. We thus view learning the coordination, via role assignment, as an unsupervised structured prediction problem.

2It is straightforward to extend our formulation to settings where multiple agents can occupy the same role, and where not all roles are occupied across all execution sequences.

Coordinated Multi-Agent Imitation Learning

Figure 2. Alternating stochastic optimization training scheme for our semi-supervised structure regularization model. Coordination via Structured Role Assignment. Instead of handcrafting the deﬁnition of roles, we learn the roles in an unsupervised fashion, without attaching any semantic labels to the roles. At the same time, role transition should obey certain structural regularity, due to coordination. This motivates using graphical models to represent the coordination structure.

Coordinated Policy Learning. We formulate the indexing mechanism as an assignment function A which maps the unstructured set U and some probabilistic structured model q to an indexed set of trajectory A rearranged from U, i.e.,

A : t U1, .., UKu ˆ q ÞÑ r A1, .., AKs ,

where the set t A1, .., AKu t U1, .., UKu. We view q as a latent variable model that infers the role assignments for each set of demonstrations. Thus, q drives the indexing mechanism A so that state vectors can be consistently constructed to facilitate optimizing for the imitation loss.

We employ entropy regularization, augmenting the imitation loss with some low entropy penalty (Grandvalet et al., 2004; Dudik et al., 2004), yielding our overall objective:

min π1,..,πK,A

k 1 Esk dπk rℓpπkpskqq|A, Ds λHp A|Dq (1)

where both imitation loss and entropy are measured with respect to the state distribution induced by the policies, and D is training data. This objective can also be seen as maximizing the mutual information between latent structure and observed trajectories (Krause et al., 2010).

3. Learning Approach

Optimizing (1) is challenging for two reasons. First, beyond the challenges inherited from single-agent settings, multi-agent imitation learning must account for multiple simultaneously learning agents, which is known to cause non-stationarity for multi-agent reinforcement learning (Busoniu et al., 2008). Second, the latent role assignment model, which forms the basis for coordination, depends on the actions of the learning policies, which in turn depend on the structured role assignment.

Algorithm 1 Coordinated Multi-Agent Imitation Learning

Input: Multiple unstructured trajectory sets U t U1, . . . , UKu with Uk tut,ku T t 1 and context C tctu T t 1. Input: Graphical model q with global/local parameters θ and z. Input: Initialized policies πk, k 1, . . . , K Input: Step size sequence ρn, n 1, 2, . . .

1: repeat 2: r A1, . . . , AKs Ð Assignt U1, . . . , UK|qpθ, zqu 3: rπ1, . . . , πKs Ð Learn r A1, . . . , AK, Cs 4: Roll-out π1, . . . , πK to obtain p A1, . . . , p AK 5: Ak Ð p Ak @k (Alternatively: Ak Ð p Ak with prob η for η Ñ 1) 6: qpθ, zq Ð Learn Structuret A1, . . . , AK, C, θ, ρnu 7: until No improvement on validation set output K policies π1, π2, . . . , πK

We propose an alternating optimization approach to solving (1), summarized in Figure 2. The main idea is to integrate imitation learning with unsupervised structure learning by taking turns to (i) optimize for imitation policies while ﬁxing a structured model (minimizing imitation loss), and (ii) re-train the latent structure model and reassign roles while ﬁxing the learning policies (maximizing role assignment entropy). The alternating nature allows us to circumvent the circular dependency between policy learning and latent structure learning. Furthermore, for (i) we develop a stable multi-agent learning reduction approach.

3.1. Approach Outline

Algorithm 1 outlines our framework. We assume the latent structure model for computing role assignments is formulated as a graphical model. The multi-agent policy training procedure Learn utilizes a reduction approach, and can leverage powerful off-the-shelf supervised learning tools such as deep neural networks (Hochreiter & Schmidhuber, 1997). The structure learning Learn Structure and role assignment Assign components are based on graphical model training and inference. For efﬁcient training, we employ alternating stochastic optimization (Hoffman et al., 2013; Johnson & Willsky, 2014; Beal, 2003) on the same mini-batches. Note that batch training can be deployed similarly, as illustrated by one of our experiments.

We interleave the three components described above into a complete learning algorithm. Given an initially unstructured set of training data, an initialized set of policies, and prior parameters of the structure model, Algorithm 1 performs alternating structure optimization on each mini-batch (size 1 in Algorithm 1).

Line 2: Role assignment is performed on trajectories t A1, . . . , AKu by running inference procedure (Algorithm 4). The result is an ordered set r A1, . . . , AKs, where trajectory Ak corresponds to policy πk.

Line 3-5: Each policy πk is updated using joint multiagent training on the ordered set r A1, . . . , AK, Cs

Coordinated Multi-Agent Imitation Learning

(Algorithm 2). The updated models are executed to yield a rolled-out set of trajectories, which replace the previous set of trajectories t Aku.

Line 6: Parameters of latent structured model are updated from the rolled-out trajectories (Algorithm 3).

The algorithm optionally includes a mixing step on line 5, where the rolled-out trajectories may replace the training trajectories with increasing probability approaching 1, which is similar to scheduled sampling (Bengio et al., 2015), and may help stabilize learning in the early phase of the algorithm. In our main experiment, we do not notice a performance gain using this option.

3.2. Joint Multi-Agent Imitation Learning

In this section we describe the Learn procedure for multiagent imitation learning in Line 3 of Algorithm 1. As background, for single agent imitation learning, reductionbased methods operate by iteratively collecting a new data set Dn at each round n of training, consisting of stateaction pairs pst, a t q where a t is some optimal or demonstrated action given state st. A new policy can be formed by (i) combining a new policy from this data set Dn with previously learned policy π (Daum e III et al., 2009) or (ii) learning a new policy π directly from the data set formed by aggregating D1, . . . , Dn (Ross et al., 2011). Other variants exist although we do not discuss them here.

The intuition behind the iterative reduction approach is to prevent a mismatch in training and prediction distributions due to sequential cascading errors (also called covariateshift). The main idea is to use the learned policy s own predictions in the construction of subsequent states, thus simulating the test-time performance during training. This mechanism enables the agent to learn a policy that is robust to its own mistakes. Reduction-based methods also accommodate any black-box supervised training subroutine. We focus on using expressive function classes such as Long Short-Term Memory networks (LSTM) (Hochreiter & Schmidhuber, 1997) as the policy class.3

Algorithm 2 outlines the Learn procedure for stable multi-agent imitation learning. Assume we are given consistently indexed demonstrations A r A1, . . . , AKs, where each Ak tat,ku T t 1 corresponds action of policy πk. Let the corresponding expert action be a t,k. To lighten the notation, we denote the per-agent state vector by st,k ϕkprat,1, . . . , at,k, . . . , at,K, ctsq4

3Note that conventional training of LSTMs does not address the cascading error problem. While LSTMs are very good at sequence-to-sequence prediction tasks, they cannot naturally deal with the drifting of input state distribution drift caused by action output feedback in dynamical systems (Bengio et al., 2015). 4Generally, state vector st,k of policy πk at time t can be constructed as st,k rφkpra1:t,1, c1:tsq, . . . , φkpra1:t,K, c1:tsqs

Algorithm 2 Joint Multi-Agent Imitation Learning Learnp A1, A2, . . . , AK, Cq

Input: Ordered actions Ak tat,ku T t 1 @k, context tctu T t 1 Input: Initialized policies π1, . . . , πK Input: base routine Trainp S, Aq mapping state to actions

1: Set increasing prediction horizon j P t1, . . . , Tu 2: for t 0, j, 2j, . . . , T do 3: for i 0, 1, . . . , j 1 do 4: Roll-out ˆat i,k πkpˆst i 1,kq @ agent k 5: Cross-update for each policy k P t1, . . . , Ku ˆst i,k ϕk prˆat i,1, . . . , ˆat i,k, . . . , ˆat i,K, ct isq 6: end for 7: Policy update for all agent k πk Ð Trainptˆst i,k, a t i 1,kuj i 0q 8: end for output K updated policies π1, π2, . . . , πK

Algorithm 2 employs a roll-out horizon j, which divides the entire trajectory into T{j segments. The following happens for every segment:

Iteratively perform roll-out at each time step i for all K policies (line 4) to obtain actions tpai,ku.

Each policy simultaneously updates its state psi,k, using the prediction from all other policies (line 5).

At the end of the current segment, all policies are updated using the error signal from the deviation between predicted pai,k versus expert action a i,k, for all i along the sub-segment (line 7).

After policy updates, the training moves on to the next jlength sub-segment, using the freshly updated policies for subsequent roll-outs. The iteration proceeds until the end of the sequence is reached. In the outer loop the roll-out horizon j is incremented.

Two key insights behind our approach are:

In addition to the training-prediction mismatch issue in single-agent learning, each agent s prediction must also be robust to imperfect predictions from other agents. This non-stationarity issue also arises in multiagent reinforcement learning (Busoniu et al., 2008) when agents learn simultaneously. We perform joint training by cross-updating each agent s state using previous predictions from other agents.

Many single-agent imitation learning algorithms assume the presence of a dynamic oracle to provide onestep corrections a t along the roll-out trajectories. In practice, dynamic oracle feedback is very expensive to obtain and some recent work have attempted to relax this requirement (Le et al., 2016; Ho & Ermon, 2016). Without dynamic oracles, the rolled-out trajectory can deviate signiﬁcantly from demonstrated trajectories when the prediction horizon j is large ( T), leading to training instability. Thus j is gradually increased to allow for slowly learning to make good sequential predictions over longer horizons.

Coordinated Multi-Agent Imitation Learning

For efﬁcient training, we focus on stochastic optimization, which can invoke base routine Train multiple times and thus naturally accommodates varying j. Note that the batch-training alternatives to Algorithm 2 can also employ similar training schemes, with similar theoretical guarantees lifted to the multi-agent case. The Appendix shows how to use DAgger (Ross et al., 2011) for Algorithm 2, which we used for our synthetic experiment.

3.3. Coordination Structure Learning

The coordination mechanism is based on a latent structured model that governs the role assignment. The training and inference procedures seek to address two main issues:

Learn Structure: unsupervised learning a probabilistic role assignment model q.

Assign: how q informs the indexing mechanism so that unstructured trajectories can be mapped to structured trajectories amenable to Algorithm 2.

Given an arbitrarily ordered set of trajectories U t U1, . . . , UK, Cu, let the coordination mechanism underlying each such U be governed by a true unknown model p, with global parameters θ. We suppress the agent/policy subscript and consider a generic featurized trajectory xt rut, cts @t. Let the latent role sequence for the same agent be z z1:T . At any time t, each agent is acting according to a latent role zt Categoricalt 1, 2, . . . , Ku, which are the local parameters to the structured model.

Ideally, role and index asignment can be obtained by calculating the true posterior ppz|x, θq, which is often intractable. We instead aim to approximate ppz|x, θq by a simpler distribution q via techniques from stochastic variational inference (Hoffman et al., 2013), which allows for efﬁcient stochastic training on mini-batches that can naturally integrate with our imitation learning subroutine.

In variational inference, posterior approximation is often cast as optimizing over a simpler model class Q, via searching for parameters θ and z that maximize the evidence lower bound (ELBO) L:

L pqpz, θqq ﬁEq rln ppz, θ, xqs Eq rln qpz, θqs ď ln ppxq

Maximizing L is equivalent to ﬁnding q P Q to minimize the KL divergence KL pqpz, θ|xq||ppz, θ|xqq. We focus on the structured mean-ﬁeld variational family, which factorizes q as qpz, θq qpzqqpθq. This factorization breaks the dependency between θ and z, but not between single latent states zt, unlike variational inference for i.i.d data (Kingma & Welling, 2013).

3.3.1. TRAINING TO LEARN MODEL PARAMETERS

The procedure to learn the parameter of our structured model is summarized in Algorithm 3. Parameter learning

Algorithm 3 Structure Learning Learn Structure t U1, . . . , UK, C, θ, ρu ÞÑ qpθ, zq

Input: Xk txt,ku T t 1 trut,k, ctsu @t, k.X t Xku K k 1 Graphical model parameters θ, stepsize ρ 1: Local update: compute qpzq via message-passing while ﬁxing θ (See Appendix for derivations) 2: Global parameter update: via natural gradient ascent θ Ð θp1 ρq ρpθprior b JEqpzq rtpz, xqsq output Updated model qpθ, zq qpθqqpzq

proceeds via alternating updates over the factors qpθq and qpzq, while keeping other factor ﬁxed. Stochastic variational inference performs such updates efﬁciently in minibatches. We slightly abuse notations and overload θ for the natural parameters of global parameter θ in the exponential family. Assuming the usual conjugacy in the exponential family, the stochastic natural gradient takes a convenient form (line 2 of Algo 3, and derivation in Appendix), where tpz, xq is the vector of sufﬁcient statistics, b is a vector of scaling factors adjusting for the relative size of the minibatches. Here the global update assumes optimal local update qpzq has been computed.

Fixing the global parameters, the local updates are based on message-passing over the underlying graphical model. The exact mathematical derivation depends on the speciﬁc graph structure. The simplest scenario is to assume independence among zt s, which resembles naive Bayes. In our experiments, we instead focus on Hidden Markov Models to capture ﬁrst-order dependencies in role transitions over play sequences. In that case, line 1 of Algorithm 3 resembles running the forward-backward algorithm to compute the update qpzq. The forward-backward algorithm in the local update step takes Op K2Tq time for a chain of length T and K hidden states. For completeness, derivation of parameter learning for HMMs is included in the Appendix.

3.3.2. INFERENCE FOR ROLE AND INDEX ASSIGNMENT

We can compute two types of inference on a learned q:

Role inference. Compute the most likely role sequence tzt,ku T t 1 P t 1, . . . , Ku T , e.g., using Viterbi (or dynamic programming-based forward message passing for graph structures). This most likely role sequence for agent k, which is the low-dimensional representation of the coordination mechanism, can be used to augment the contextual feature tctu T t 1for each agent s policy training.

Role-based Index Assignment Transform the unstructured set U into an ordered set of trajectories A to facilitate the imitation learning step. This is the more important task for the overall approach. The intuitive goal of an indexing mechanism is to facilitate consistent agent trajectory to policy mapping. Assume for notational convenience that we want index k assigned to an unique agent who is most likely assuming role k. Our inference technique rests on the

Coordinated Multi-Agent Imitation Learning

Algorithm 4 Multi-Agent Role Assignment Assign t U1, . . . , UK|qu ÞÑ r A1, . . . , AKs

Input: Approximate inference model q. Unordered trajectories U t Uku K k 1. 1: Calculate cost matrix M P RKˆK per equation 2 2: A Ð Min Cost Assignmentp Mq output Ak UApkq @k 1, 2, . . . , K

well-known Linear Assignment Problem (Papadimitriou & Steiglitz, 1982), which is solved optimally via the Kuhn Munkres algorithm. Speciﬁcally, construct the cost matrix M as:

M M1 d M2 (2)

M1 qptxt,ku|zt,k kq

t 1 qpxt,k|zt,k kq

M2 ln qptxt,ku|zt,k kq

t 1 ln qpxt,k|zt,k kq

where k 1, . . . , K, k 1, . . . , K, d is the Hadamard product, and matrices M1, M2 take advantage of the Markov property of the graphical model. Now solving the linear assignment problem for cost matrix M, we obtain the matching A from role k to index k, such that the total cost per agent is minimized. From here, we rearrange the unordered set t U1, . . . , UKu to the ordered sequence r A1, . . . , AKs r UAp1q, . . . , UAp Kqs according to the minimum cost mapping.

To see why this index assignment procedure results in an increased entropy in the original objective (1), notice that:

k 1 Pp kqqp Ap Akq kq log qp Ap Akq kq

k 1 Mp k, kq,

where we assume each latent role k has equal probability. The RHS increases from the linear assignment and consequent role assignment procedure. Our inference procedure to perform role assignment is summarized in Algorithm 4.

4. Experiments

We present empirical results from two settings. The ﬁrst is a synthetic setting based on predator-prey, where the goal is to imitate a coordinating team of predators. The second is a large-scale imitation learning setting from player trajectores in professional soccer games, where the goal is to imitate defensive team play.

4.1. Predator-Prey Domain

Setting. The predator-prey problem, also frequently called the Pursuit Domain (Benda, 1985), is a popular setting for

multi-agent reinforcement learning. The traditional setup is with four predators and one prey, positioned on a grid board. At each time step, each agent has ﬁve moves:

N,S,E,W or no move. The world is toroidal: the agents can move off one end of the board and come back on the other end. Agents make move simultaneously, but two agents cannot occupy the same position, and collisions are avoided by assigning a random move priority to the agents at each time step. The predators can capture the prey only if the prey is surrounded by all four predators. The goal of the predators is to capture the prey as fast as possible, which necessarily requires coordination.

Data. The demonstration data is collected from 1000 game instances, where four experts, indexed 1 to 4, are prescribed the consistent and coordinated role as illustrated in the capture state of Figure 3. In other words, agent 1 would attempt to capture the prey on the right hand side, which allows for one ﬁxed role for each expert throughout the game. However, the particular role assignment is hidden from the imitation learning task. Each expert is then exhaustively trained using Value Iteration (Sutton & Barto, 1998) in the reinforcement learning setting, with the reward of 1 if the agent is in the position next to the prey according to its deﬁned role, and 0 otherwise. A separate set of 100 games was collected for evaluation. A game is terminated after 50 time steps if the predators fail to capture the prey. In the test set, the experts fail to capture the prey in 2% of the games, and on average take 18.3 steps to capture the prey.

Experiment Setup. For this experiment, we use the batch version of Algorithm 1 (see appendix) to learn to imitate the experts using only demonstrations. Each policy is represented by a random forest of 20 trees, and were trained over 10 iterations. The expert correction for each rolled-out state is collected via Value Iteration. The experts thus act as dynamic oracles, which result in a multi-agent training setting analogous to DAgger (Ross et al., 2011). We compare two versions of multi-agent imitation learning:

Coordinated Training. We use our algorithm, with the latent structure model represented by a discrete Hidden Markov Model with binomial emission. We use Algorithm 4 to maximize the role consistency of the dynamic oracles across different games.

Unstructured Training. An arbitrary role is assigned to each dynamic oracle for each game, i.e., the agent index is meaningless.

In both versions, training was done using the same data aggregation scheme and batch training was conducted using the same random forests conﬁguration.

Coordinated Multi-Agent Imitation Learning

Figure 4. Comparing performance in Predator-Prey between our approach and unstructured multi-agent imitation learning, as a function of the number of training rounds. Our approach demonstrates both signiﬁcantly lower failure rates as well as lower average time to success (for successful trials). Results. Figure 4 compares the test performance of our method versus unstructured multi-agent imitation learning. Our method quickly approaches expert performance (average 22 steps with 8% failure rate in the last iteration), whereas unstructured multi-agent imitation learning performance did not improve beyond the ﬁrst iteration (average 42 steps with 70% failure rate). Note that we even gave the unstructured baseline some advantage over our method, by forcing the prey to select the moves last after all predators make decisions (effectively making the prey slower). Without this advantage, the unstructured policies fail to capture the prey almost 100% of the time. Also, if the same restriction is applied to the policies obtained from our method, performance would be on par with the experts (100% success rate, with similar number of steps taken).

4.2. Multi-agent Imitation Learning for Soccer

Setting. Soccer is a popular domain for multi-agent learning. Robo Cup, the robotic and simulation soccer platform, is perhaps the most popular testbed for multi-agent reinforcement learning research to date (Stone, 2016). The success of MARL has been limited, however, due to the extremely high dimensionality of the problem. In this experiment, we aim to learn multi-agent policies for team soccer defense, based on tracking data from real-life professional soccer (Bialkowski et al., 2014).

Data. We use the tracking data from 45 games of real professional soccer from a recent European league. The data was chunked into sequences with one team attacking and the other defending. Our goal is to learn up to 10 policies for team defense (11 players per team, minus the goal keeper). The training data consists of 7500 sets of trajectories A t A1, . . . , A10u , where Ak tat,ku T t 1 is the sequence of positions of one defensive player, and C is the

Figure 5. Experimental results on soccer domain. We see that using coordination substantially improves the imitation loss, and that the decentralized policy is comparable to the centralized. context consisting of opponents and the ball. Overall, there are about 1.3 million frames at 10 frames per second. The average sequence length is 176 steps, and the maximum is 1480.

Experiment Setup. Each policy is represented by a recurrent neural network structure (LSTM), with two hidden layers of 512 units each. As LSTMs generally require ﬁxed-length input sequences, we further chunk each trajectory into sub-sequences of length 50, with overlapping window of 25 time steps. The joint multi-agent imitation learning procedure follows Algorithm 2 closely. In this setup, without access to dynamic oracles for imitation learning in the style of SEARN (Daum e III et al., 2009) and DAgger (Ross et al., 2011), we gradually increase the horizon of the rolled-out trajectories from 1 to 10 steps lookahead. Empirically, this has the effect of stabilizing the policy networks early in training, and limits the cascading errors caused by rolling-out to longer horizons.

The structured model component is learned via stochastic variational inference on a continuous HMM, where the perstate emission distribution is a mixture of Gaussians. Training and inference operate on the same mini-batches used for joint policy learning.

We compare against two variations. The ﬁrst employs centralized policy that aggregates the state vectors of all decentralized learner and produces the actions for all players, i.e., a multi-task policy. The centralized approach generally requires more model parameters, but is potentially much more accurate. The second variation is to not employ joint multi-agent training: we modify Algorithm 2 to not crossupdate states between agents, and each role is trained conditioned on the ground truth of the other agents.

Results. Figure 5 shows the results. Our coordinated learning approach substantially outperforms conventional imitation learning without structured coordination. The imitation loss measures average distance of roll-outs and ground truth in meters (note the typical size of soccer ﬁeld

Coordinated Multi-Agent Imitation Learning

Figure 6. Result of 10 coordinated imitation policies, corresponding with Figure 1. White is the rolled-out imitation policies. is 110 ˆ 70 meters). As expected, average loss increases with longer sequences, due to cascading errors. However, this error scales sub-linearly with the length of the horizon, even though the policies were trained on sequences of length 50. Note also that the performance difference between decentralized and centralized policies is insigniﬁcant compared to the gap between coordinated and unstructured policies, further highlighting the beneﬁts of structured coordination in multi-agent settings. The loss of a single network, non-joint training scheme is very large and thus omitted from Figure 5 (see the appendix).

Visualizations. Imitation loss, of course, is not a full reﬂection of the quality of the learned policies. Unlike predator-prey, the long-term reward signal is not available, so we rely on visual inspection as part of evaluation. Figure 6 overlays policy prediction on top of the actual game sequence from Figure 1. Additional test examples are included in our supplemental video 5. We note that learned policies are qualitatively similar to the ground truth demonstrations, and can be useful for applications such as counterfactual replay analysis (Le et al., 2017). Figure 7 displays the Gaussian components of the underlying HMM. The components correspond to the dominant modes of the roles assigned. Unlike the predator-prey domain, roles can be switched during a sequence of play. See the appendix for more details on role swap frequency.

5. Other Related Work

The problem of multi-agent imitation learning has not been widely considered, perhaps with the exception of (Chernova & Veloso, 2007) which focused on very different applications and technical challenges (i.e., learning a model of a joint task by collecting samples from direct interaction with teleoperating human teachers). The actual learning algorithm there requires the learner to collect enough data points from human teachers for conﬁdent classiﬁcation of

5Watch video at http://hoangminhle.github.io

Figure 7. Components of role distributions, corresponding to a popular formation arrangement in professional soccer task. It is not clear how well the proposed method would translate to other domains.

Index-free policy learning is generally difﬁcult for blackbox machine learning techniques. Some recent work has called attention to the importance of order to learning when input or output are sets (Vinyals et al., 2015), motivated by classic algorithmic and geometric problems such as learning to sort a set of numbers, or ﬁnding convex hull for a set of points, where no clear indexing mechanism exists. Other permutation invariant approaches include those for standard classiﬁcation (Shivaswamy & Jebara, 2006).

6. Limitations and Future Work

In principle, the training and inference of the latent structure model can accommodate different types of graphical models. However, the exact procedure varies depending on the graph structure. It would be interesting to ﬁnd domains that can beneﬁt from more general graphical models. Another possible direction is to develop fully endto-end differentiable training methods that can accommodate our index-free policy learning formulation, especially deep learning-based method that could provide computational speed-up compared to traditional graphical model inference. One potential issue with the end-to-end approach is the need to depart from a learning-reductions style approach.

Although we addressed learning from demonstrations in this paper, the proposed framework can also be employed for generative modeling, or more efﬁcient structured exploration for reinforcement learning. Along that line, our proposed method could serve as a useful component of general reinforcement learning, especially in multi-agent settings where traditional exploration-based approaches such as Qlearning prove computationally intractable.

Acknowledgement. This work was funded in part by NSF Awards #1564330 & #1637598, JPL PDF IAMS100224, a Bloomberg Data Science Research Grant, and a gift from Northrop Grumman.

Coordinated Multi-Agent Imitation Learning

Beal, Matthew James. Variational algorithms for approximate Bayesian inference. University of London United Kingdom, 2003.

Benda, Miroslav. On optimal cooperation of knowledge sources. Technical Report BCS-G2010-28, 1985.

Bengio, Samy, Vinyals, Oriol, Jaitly, Navdeep, and Shazeer, Noam. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171 1179, 2015.

Bialkowski, Alina, Lucey, Patrick, Carr, Peter, Yue, Yisong, and Matthews, Iain. win at home and draw away : Automatic formation analysis highlighting the differences in home and away team behaviors. In MIT Sloan Sports Analytics Conference (SSAC), 2014.

Blei, David M, Kucukelbir, Alp, and Mc Auliffe, Jon D. Variational inference: A review for statisticians. Journal of the American Statistical Association, (just-accepted), 2017.

Busoniu, Lucian, Babuska, Robert, and De Schutter, Bart. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews, 38(2):156, 2008.

Chernova, Sonia and Veloso, Manuela. Multiagent collaborative task learning through imitation. In Proceedings of the fourth International Symposium on Imitation in Animals and Artifacts, pp. 74 79, 2007.

Daum e III, Hal, Langford, John, and Marcu, Daniel. Searchbased structured prediction. Machine learning, 75(3):297 325, 2009.

Dudik, Miroslav, Phillips, Steven J, and Schapire, Robert E. Performance guarantees for regularized maximum entropy density estimation. In International Conference on Computational Learning Theory, pp. 472 486. Springer, 2004.

Grandvalet, Yves, Bengio, Yoshua, et al. Semi-supervised learning by entropy minimization. In NIPS, volume 17, pp. 529 536, 2004.

Guestrin, Carlos, Lagoudakis, Michail, and Parr, Ronald. Coordinated reinforcement learning. In ICML, volume 2, pp. 227 234, 2002.

Ho, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565 4573, 2016.

Hochreiter, Sepp and Schmidhuber, J urgen. Long short-term memory. Neural computation, 9(8):1735 1780, 1997.

Hoffman, Matt and Blei, David. Structured stochastic variational inference. Co RR abs/1404.4114, 2014.

Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John William. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347, 2013.

Johnson, Matthew James and Willsky, Alan S. Stochastic variational inference for bayesian time series models. In ICML, pp. 1854 1862, 2014.

Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kingston, Peter and Egerstedt, Magnus. Index-free multi-agent systems: An eulerian approach. IFAC Proceedings Volumes, 43(19):215 220, 2010.

Kloder, Stephen and Hutchinson, Seth. Path planning for permutation-invariant multirobot formations. IEEE Transactions on Robotics, 22(4):650 665, 2006.

Kok, Jelle R, Spaan, Matthijs TJ, Vlassis, Nikos, et al. Multi-robot decision making using coordination graphs. In Proceedings of the 11th International Conference on Advanced Robotics, ICAR, volume 3, pp. 1124 1129, 2003.

Krause, Andreas, Perona, Pietro, and Gomes, Ryan G. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, pp. 775 783, 2010.

Le, Hoang, Kang, Andrew, Yue, Yisong, and Carr, Peter. Smooth imitation learning for online sequence prediction. In Proceedings of The 33rd International Conference on Machine Learning, pp. 680 688, 2016.

Le, Hoang M., Carr, Peter, Yue, Yisong, and Lucey, Patrick. Datadriven ghosting using deep imitation learning. In MIT Sloan Sports Analytics Conference (SSAC), 2017.

Panait, Liviu and Luke, Sean. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, 11(3):387 434, 2005.

Papadimitriou, Christos H and Steiglitz, Kenneth. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1982.

Ross, Stephane, Gordon, Geoff, and Bagnell, J. Andrew. A reduction of imitation learning and structured prediction to no-regret online learning. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2011.

Shivaswamy, Pannagadatta K and Jebara, Tony. Permutation invariant svms. In Proceedings of the 23rd international conference on Machine learning, pp. 817 824. ACM, 2006.

Stone, Peter. What s hot at Robo Cup. In The AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016.

Stone, Peter and Veloso, Manuela. Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8 (3):345 383, 2000.

Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

Vinyals, Oriol, Bengio, Samy, and Kudlur, Manjunath. Order matters: Sequence to sequence for sets. ar Xiv preprint ar Xiv:1511.06391, 2015.