# learning_policy_representations_in_multiagent_systems__c79b22b2.pdf Learning Policy Representations in Multiagent Systems Aditya Grover 1 Maruan Al-Shedivat 2 Jayesh K. Gupta 1 Yura Burda 3 Harrison Edwards 3 Modeling agent behavior is central to understanding the emergence of complex phenomena in multiagent systems. Prior work in agent modeling has largely been task-specific and driven by handengineering domain-specific prior knowledge. We propose a general learning framework for modeling agent behavior in any multiagent system using only a handful of interaction data. Our framework casts agent modeling as a representation learning problem. Consequently, we construct a novel objective inspired by imitation learning and agent identification and design an algorithm for unsupervised learning of representations of agent policies. We demonstrate empirically the utility of the proposed framework in (i) a challenging highdimensional competitive environment for continuous control and (ii) a cooperative environment for communication, on supervised predictive tasks, unsupervised clustering, and policy optimization using deep reinforcement learning. 1. Introduction Intelligent agents rarely act in isolation in the real world and often seek to achieve their goals through interaction with other agents. Such interactions give rise to rich, complex behaviors formalized as per-agent policies in a multiagent system (Ferber, 1999; Wooldridge, 2009). Depending on the underlying motivations of the agents, interactions could be directed towards achieving a shared goal in a collaborative setting, opposing another agent in a competitive setting, or be a mixture of these in a setting where agents collaborate in teams to compete against other teams. Learning useful representations of the policies of agents based on their interactions is an important step towards characterization of the agent behavior and more generally inference and reasoning in multiagent systems. 1Stanford University 2Carnegie Mellon University 3Open AI. Correspondence to: Aditya Grover . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). In this work, we propose an unsupervised encoder-decoder framework for learning continuous representations of agent policies given access to only a few episodes of interaction. For any given agent, the representation function is an encoder that learns a mapping from an interaction (i.e., one or more episodes of observation and action pairs involving the agent) to a continuous embedding vector. Using such embeddings, we condition a policy network (decoder) and train it simultaneously with the encoder to imitate other interactions involving the same (or a coupled) agent. Additionally, we can explicitly discriminate between the embeddings corresponding to different agents using triplet losses. For the embeddings to be useful, the representation function should generalize to both unseen interactions and unseen agents for novel downstream tasks. Generalization is wellunderstood in the context of supervised learning where a good model is expected to attain similar train and test performance. For multiagent systems, we consider a notion of generalization based on agent-interaction graphs. An agentinteraction graph provides an abstraction for distinguishing the agents (nodes) and interactions (edges) observed during training, validation, and testing. Our framework is agnostic to the nature of interactions in multiagent systems, and hence broadly applicable to competitive and cooperative environments. In particular, we consider two multiagent environments: (i) a competitive continuous control environment, Robo Sumo (Al-Shedivat et al., 2018), and (ii) a Particle World environment of cooperative communication where agents collaborate to achieve a common goal (Mordatch & Abbeel, 2018). For evaluation, we show how representations learned by our framework are effective for downstream tasks that include clustering of agent policies (unsupervised), classification such as win or loss outcomes in competitive systems (supervised), and policy optimization (reinforcement). In the case of policy optimization, we show how these representations can serve as privileged information for better training of agent policies. In Robo Sumo, we train agent policies that can condition on the opponent s representation and achieve superior win rates much more quickly as compared to an equally expressive baseline policy with the same number of parameters. In Particle World, we train speakers that can communicate more effectively with a much wider range of listeners given knowledge of their representations. Learning Policy Representations in Multiagent Systems 2. Preliminaries In this section, we present the necessary background and notation relevant to the problem setting of this work. Markov games. We use the classical framework of Markov games (Littman, 1994) to represent multiagent systems. A Markov game extends the general formulation of partially observable Markov decision processes (POMDP) to the multiagent setting. In a Markov game, we are given a set of n agents on a state-space S with action spaces A1, A2, , An and observation spaces O1, O2, , On respectively. At every time step t, an agent i receives an observation o(t) i Oi and executes an action a(t) i Ai based on a stochastic policy π(i) : Oi Ai [0, 1]. Based on the executed action, the agent receives a reward r(t) i : S Ai R and the next observation o(t+1) i . The state dynamics are determined by a transition function T : S A1 An S. The agent policies are trained to maximize their own expected reward ri = PH t=1 r(t) i over a time horizon H. Extended Markov games. In this work, we are interested in interactions that involve not all but only a subset of agents. For this purpose, we generalize Markov games as follows. First, we augment the action space of each agent with a NO-OP (i.e., no action). Then, we introduce a problem parameter, 2 k n, with the following semantics. During every rollout of the Markov game, all but k agents deterministically execute the NO-OP operator while the k agents execute actions as per the policies defined on the original observation and action spaces. Accordingly, we assume that each agent receives rewards only in the interaction episode it participates in. Informally, the extension allows for multiagent systems where all agents do not necessarily have to participate simultaneously in an interaction. For instance, this allows to consider one-vs-one multiagent tournaments where only two players participate in any given match. To further introduce the notation, consider a multiagent system as a generalized Markov game. We denote the set of agent policies with P = {π(i)}n i=1, interaction episodes with E = {EMj}m j=1 where Mj {1, 2, , n}, |Mj| = k is the set of k agents participating in episode EMj. To simplify presentation for the rest of the paper, we assume k = 2 and, consequently, denote the set of interaction episodes between agents i and j as Eij. A single episode, eij Eij, consists of a sequence of observations and actions for the specified time horizon, H. Imitation learning. Our approach to learning policy representations relies on behavioral cloning (Pomerleau, 1991) a type of imitation learning where we train a mapping from observations to actions in a supervised manner. Although there exist other imitation learning algorithms (e.g., inverse reinforcement learning, Abbeel & Ng, 2004), our frame- work is largely agnostic to the choice of the algorithm, and we restrict our presentation to behavioral cloning, leaving other imitation learning paradigms to future work. 3. Learning framework The dominant paradigm for unsupervised representation learning is to optimize the parameters of a representation function that can best explain or generate the observed data. For instance, the skip-gram objective used for language and graph data learns representations of words and nodes predictive of representations of surrounding context (Mikolov et al., 2013; Grover & Leskovec, 2016). Similarly, autoencoding objectives, often used for image data, learn representations that can reconstruct the input (Bengio et al., 2009). In this work, we wish to learn a representation function that maps episode(s) from an agent policy, π(i) Π to a real-valued vector embedding where Π is a class of representable policies. That is, we optimize for the parameters θ for a function fθ : E Rd where E denotes the space of episodes corresponding to a policy and d is the dimension of the embedding. Here, we have assumed the agent policies are black-boxes, i.e., we can only access them based on interaction episodes with other agents in a Markov game. Hence, for every agent i, we wish to learn policies using Ei = j E(i) ij . Here, E(i) ij refers the episode data for interactions between agent i and j, but consisting of only the observation and action pairs of agent i. For a multiagent system, we propose the following auxiliary tasks for learning a good representation of an agent s policy: 1. Generative representations. The representation should be useful for simulating the agent s policy. 2. Discriminative representations. The representation should be able to distinguish the agent s policy with the policies of other agents. Accordingly, we now propose generative and discriminative objectives for representation learning in multiagent systems. 3.1. Generative representations via imitation learning Imitation learning does not require direct access to the reward signal, making it an attractive task for unsupervised representation learning. Formally, we are interested in learning a policy π(i) φ : S A [0, 1] for an agent i given access to observation and action pairs from interaction episode(s) involving the agent. For behavioral cloning, we maximize the following (negative) cross-entropy objective: h log π(i) φ (a|o) i where the expectation is over interaction episodes of agent i and the optimization is over the parameters φ. Learning Policy Representations in Multiagent Systems Algorithm 1 Learn Policy Embedding Function (fθ) input {Ei}n i=1 interaction episodes, λ hyperparameter. 1: Initialize θ and φ 2: for i = 1, 2, . . . , n do 3: Sample a positive episode pe e+ Ei 4: Sample a reference episode re e Ei\e+ 5: Compute Im loss P o,a e+ log πφ,θ(a|o, e ) 6: for j = 1, 2, . . . , n do 7: if j = i then 8: Sample a negative episode ne e Ej 9: Compute Id loss dθ(e+, e , e ) 10: Set Loss Im loss + λ Id loss 11: Update θ and φ to minimize Loss 12: end if 13: end for 14: end for output θ Learning individual policies for every agent can be computationally and statistically prohibitive for large-scale multiagent systems, especially when the number of interaction episodes per agent is small. Moreover, it precludes generalization across the behaviors of such agents. On the other hand, learning a single policy for all agents increases sample efficiency but comes at the cost of reduced modeling flexibility in simulating diverse agent behaviors. We offset this dichotomy by learning a single conditional policy network. To do so, we first specify a representation function, fθ : E Rd, with parameters θ, where E represents the space of episodes. We use this embedding to condition the policy network. Formally, the policy network is denoted by πφ,θ : S A E [0, 1] and φ are parameters for the function mapping the agent observation and embedding to a distribution over the agent s actions. The parameters θ and φ for the conditional policy network are learned jointly by maximizing the following objective: i=1 Ee1 Ei, e2 Ei\e1 o,a e1 log πφ,θ(a|o, e2) For every agent, the objective function samples two distinct episodes e1 and e2. The observation and action pairs from e2 are used to learn an embedding fθ(e2) that conditions the policy network trained on observation and action pairs from e1. The conditional policy network shares statistical strength through a common set of parameters for the policy network and the representation function across all agents. 3.2. Discriminative representations via identification An intuitive requirement for any representation function learned for a multiagent system is that the embeddings should reflect characteristics of an agent s behavior that dis- tinguish it from other agents. To do so in an unsupervised manner, we propose an objective for agent identification based on the triplet loss directly in the space of embeddings. To learn a representation for agent i based on interaction episodes, we use the representation function fθ to compute three sets of embeddings: (i) a positive embedding for an episode e+ Ei involving agent i, (ii) a negative embedding for an episode e Ej involving a random agent j = i, and (iii) a reference embedding for an episode e Ei again involving agent i, but different from e+. Given these embeddings, we define the triplet loss: dθ(e+, e , e ) = (1 + exp { re ne 2 re pe 2}) 2 (2) where pe = fθ(e+), ne = fθ(e ), re = fθ(e ). Intuitively, the loss encourages the positive embedding to be closer to the reference embedding than the negative embedding, which makes the embeddings of the same agent tend to cluster together and be further away from embeddings of other agents. We note that various other notions of distance can also be used. The one presented above corresponding to a squared softmax objective (Hoffer & Ailon, 2015). 3.3. Hybrid generative-discriminative representations Conditional imitation learning encourages fθ to learn representations that can learn and simulate the entire policy of the agents and agent identification incentivizes representations that can distinguish between agent policies. Both objectives are complementary, and we combine Eq. (1) and Eq. (2) to get the final objective used for representation learning: i=1 Ee+ Ei, e Ei\e+ o,a e+ log πφ,θ(a|o, e ) | {z } imitation j =i Ee Ej [dθ(e+, e , e )] | {z } agent identification where λ > 0 is a tunable hyperparameter that controls the relative weights of the discriminative and generative terms. The pseudocode for the proposed algorithm is given in Algorithm 1. In experiments, we parameterize the conditional policy πθ,φ using neural networks and use stochastic gradient-based methods for optimization. 4. Generalization in MAS Generalization is well-understood for supervised learning models that shows similar train and test performance exhibit good generalization. To measure the quality of the learned representations for a multiagent system (MAS), we introduce a graphical formalism for reasoning about agents and their interactions. Learning Policy Representations in Multiagent Systems (a) Agent-Interaction Graph (b) The Robo Sumo environment. target landmark (c) The Particle World environment. Figure 1: An example of a graph used for evaluating generalization in a multiagent system (a). Illustrations for the environments used in our experiments: competitive (b) and cooperative (c). 4.1. Generalization across agents & interactions In many scenarios, we are interested in generalization of the policy representation function fθ across novel agents and interactions in a multiagent system. For instance, we would like fθ to output useful embeddings for a downstream task, even when evaluated with respect to unseen agents and interactions. This notion of generalization is best understood using agent-interaction graphs (Grover et al., 2018). The agent-interaction graph describes interactions between a set of agent policies P and a set of interaction episodes I through a graph G = (P, I).1 An example graph is shown in Figure 1a. The graph represents a multiagent system consisting of interactions between pairs of agents, and we will especially focus on the interactions involving Alice, Bob, Charlie, and Davis. The interactions could be competitive (e.g., a match between two agents) or cooperative (e.g., two agents communicating for a navigation task). We learn the representation function fθ on a subset of the interactions, denoted by the solid black edges in Figure 1a. At test time, fθ is evaluated on some downstream task of interest. The agents and interactions observed at test time can be different from those used for training. In particular, we consider the following cases: Weak generalization.2 Here, we are interested in the generalization performance of the representation function on an unseen interaction between existing agents, all of which are observed during training. This corresponds to the red edge representing the interaction between Alice and Bob in Figure 1a. From the context of an agent-interaction graph, the test graph adds only edges to the train graph. Strong generalization. Generalization can also be evaluated with respect to unseen agents (and their interactions). This corresponds to the addition of agents Charlie and Davis in Figure 1a. Akin to a few shot learning setting, we observe a few of their interactions with existing agents Alice and 1If we have more than two participating agents per interaction episode, we could represent the interactions using a hypergraph. 2Also referred to as intermediate generalization by Grover et al. (2018). Bob (green edges) and generalization is evaluated on unseen interactions involving Charlie and Davis (blue edges). The test graph adds both nodes and edges to the train graph. For brevity, we skip discussion of weaker forms of generalization that involves evaluation of the test performance on unseen episodes of an existing training edge (black edge). 4.2. Generalization across tasks Since the representation function is learned using an unsupervised auxiliary objective, we test its generalization performance by evaluating the usefulness of these embeddings for various kinds downstream tasks described below. Unsupervised. These embeddings can be used for clustering, visualization, and interpretability of agent policies in a low-dimensional space. Such semantic associations between the learned embeddings can be defined for a single agent wherein we expect representations for the same agent based on distinct episodes to be embedded close to each other, or across agents wherein agents with similar policies will have similar embeddings on average. Supervised. Deep neural network representations are especially effective for predictive modeling. In a multiagent setting, the embeddings serve as useful features for learning agent properties and interactions, including assignment of role categories to agents with different skills in a collaborative setting, or prediction of win or loss outcomes of interaction matches between agents in a competitive setting. Reinforcement. Finally, we can use the learned representation functions to improve generalization of the policies learned from a reinforcement signal in competitive and cooperative settings. We design policy networks that, in addition to observations, take embedding vectors of the opposing agents as inputs. The embeddings are computed from the past interactions of the opposing agent either with the agent being trained or with other agents using the representation function (Figure 2). Such embeddings play the role of privileged information and allow us to train a policy network that uses this information to learn faster and generalize better to opponents or cooperators unseen at training time. Learning Policy Representations in Multiagent Systems previous interactions πt ψ et 1 et 1 et 1 et 1 πt 1 ψ Figure 2: Illustration of the proposed model for optimizing a policy πψ that conditions on an embedding of the opponent policy πA. At time t, the pre-trained representation function fθ computes the opponent embedding based on a past interaction et 1. We optimize πψ to maximize the expected rewards in its current interactions et with the opponent. 5. Evaluation methodology & results We evaluate the proposed framework for both competitive and collaborative environments on various downstream machine learning tasks. In particular, we use the Robo Sumo and Particle World environments for the competitive and collaborative scenarios, respectively. We consider the embedding objectives in Eq. (1), Eq. (2), and Eq. (3) independently and refer to them as Emb-Im, Emb-Id, and Emb-Hyb respectively. The hyperparameter λ for Emb-Hyb is chosen by grid search over λ {0.01, 0.05, 0.1, 0.5} on a held-out set of interactions. In all our experiments, the representation function fθ is specified through a multi-layer perceptron (MLP) that takes as input an episode and outputs an embedding of that episode. In particular, the MLP takes as input a single (observation, action) pair to output an intermediate embedding. We average the intermediate embeddings for all (observation, action) pairs in an episode to output an episode embedding. To condition a policy network on the embedding, we simply concatenate the observation fed as input to the network with the embedding. Experimental setup and other details beyond what we state below are deferred to the Appendix. 5.1. The Robo Sumo environment For the competitive environment, we use Robo Sumo (Al Shedivat et al., 2018) a 3D environment with simulated physics (based on Mu Jo Co (Todorov et al., 2012)) that allows agents to control multi-legged 3D robots and compete against each other in continuous-time wrestling games (Figure 1b). For our analysis, we train a diverse collection of 25 agents, some of which are trained via self-play and others are trained in pairs concurrently using Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017). We start with a fully connected agent-interaction graph (clique) of 25 agents. Every edge in this graph corresponds to 10 rollout episodes involving the corresponding agents. Table 1: Intra-inter clustering ratios (IICR) and accuracies for outcome prediction (Acc) for weak (W) and strong (S) generalization on Robo Sumo. IICR (W) IICR (S) Acc (W) Acc(S) Emb-Im 0.24 0.23 0.71 0.60 Emb-Id 0.25 0.27 0.67 0.56 Emb-Hyb 0.22 0.21 0.73 0.56 The maximum length (or horizon) of any episode is 500 time steps, after which the episode is declared a draw. To evaluate weak generalization, we sample a connected subgraph for training with approximately 60% of the edges preserved for training, and remaining split equally for validation and testing. For strong generalization, we preserve 15 agents and their interactions with each other for training, and similarly, 5 agents and their within-group interactions each for validation and testing. 5.1.1. EMBEDDING ANALYSIS To evaluate the robustness of the embeddings, we compute multiple embeddings for each policy based on different episodes of interaction at test time. Our evaluation metric is based on the intraand inter-cluster Euclidean distances between embeddings. The intra-cluster distance for an agent is the average pairwise distance between its embeddings computed on the set of test interaction episodes involving the agent. Similarly, the inter-cluster distance is the average pairwise distance between the embeddings of an agent with those of other agents. Let Ti = {t(i) c }ni c=1 denote the set of test interactions involving agent i. We define the intra-inter cluster ratio (IICR) as: 1 n Pn i=1 1 n2 i Pni a=1 Pni b=1 t(i) a t(i) b 2 b=1 t(i) a t(j) b 2 . The intra-inter clustering ratios are reported in Table 1. A ratio less than 1 suggests that there is signal that identifies the agent, and the signal is stronger for lower ratios. Even though this task might seem especially suited for the agent identification objective, we interestingly find that the Emb-Im attains lower clustering ratios than Emb-Id for both weak and strong generalization. Emb-Hyb outperforms both these methods. We qualitatively visualize the embeddings learned using Emb-Hyb by projecting them on the leading principal components, as shown in Figures 3a and 3b for 10 test interaction episodes of 5 randomly selected agents in the weak and strong generalization settings respectively. 5.1.2. OUTCOME PREDICTION We can use these embeddings directly for training a classifier to predict the outcome of an episode (win/loss/draw). For classification, we use an MLP with 3 hidden layers Learning Policy Representations in Multiagent Systems first eigenvector second eigenvector third eigenvector (a) Robo Sumo: Weak first eigenvector second eigenvector third eigenvector (b) Robo Sumo: Strong first eigenvector second eigenvector third eigenvector (c) Particle World: Weak first eigenvector second eigenvector third eigenvector (d) Particle World: Strong Figure 3: Embeddings learned using Emb-Hyb for 10 test interaction episodes of 5 agents projected on the first three principal components for Robo Sumo and Particle World. Color denotes agent policy. 0 1000 2000 PPO PPO + Emb-Im PPO + Emb-Id PPO + Emb-Hyb 0 1000 2000 0 1000 2000 PPO PPO + Emb-online PPO + Emb-offline PPO + Emb-zero PPO + Emb-rand 0 1000 2000 Figure 4: Average win rates of the newly trained agents against 5 training agent and 5 testing agents. The left two charts compare baseline with policies that make use of Emb-Im, Emb-Id, and Emb-Hyb (all computed online). The right two charts compare different embeddings used at evaluation time (all embedding-conditioned policies use Emb-Hyb). At each iteration, win rates were computed based on 50 1-on-1 games. Each agent was trained 3 times, each time from a different random initialization. Shaded regions correspond to 95% CI. of 100 units each and the learning objective minimizes the cross entropy error. The input to the classifier are the embeddings of the two agents involved in the episode. The results are reported in Table 1. Again, imitation based methods seem more suited for this task with Emb-Hyb and Emb-Im outperforming other methods for weak and strong generalization respectively. 5.1.3. POLICY OPTIMIZATION Here we ask whether embeddings can be used to improve learned policies in a reinforcement learning setting both in terms of end performance and generalization. To this end, we select 5 training, 5 validation, and 5 testing opponents from the pool of 25 pre-trained agents. Next, we train a new agent with reinforcement learning to compete against the selected 5 training opponents; the agent is trained concurrently against all 5 opponents using a distributed version of PPO algorithm, as described in Al-Shedivat et al. (2018). Throughout training, we evaluate new agents on the 5 testing opponents and record the average win and draw rates. Using this setup, we compare a baseline agent with MLP-based policy with an agent whose policy takes 100dimensional embeddings of the opponents as additional inputs at each time step and uses that information to condition its behavior on the opponent s representation. The embed- dings for each opponent are either computed online, i.e., based on an interaction episode rolled out during training at a previous time step (Figure 2), or offline, i.e., pre-computed before training the new agent using only interactions between the pre-trained opponents. Figure 4 shows the average win rates against the set of training and testing opponents for the baseline and our agents that use different types of embeddings. While every new agent is able to achieve almost 100% win rate against the training opponents, we see that the agents that condition their policies on the opponent s embeddings perform better on the held-out set of opponents, i.e., generalize better, with the best performance achieved with Emb-Hyb. We also note that embeddings computed offline turn out to lead to better performance than if computed online3. As an ablation test, we also evaluate our agents when they are provided an incorrect embedding (either all zeros, Emb-zero, or an embedding selected for a different random opponent, Emb-rand) and observe that such embeddings lead to a degradation in performance4. 3Perhaps, this is due to differences in the interactions of the opponents between themselves and with the new agent that the embedding network was not able to capture entirely. 4Performance decrease is most significant for Emb-zero, which is an out-of-distribution all-zeros vector. Learning Policy Representations in Multiagent Systems Emb-Hyb vs. PPO Win Loss Draw 0 1000 Emb-Hyb vs. Emb-Im Emb-Hyb vs. Emb-Id Emb-Im vs. PPO Emb-Im vs. Emb-Id Emb-Id vs. PPO Figure 5: Win, loss, and draw rates plotted for the first agent in each pair. Each pair of agents was evaluated after each training iteration on 50 1-on-1 games; curves are based on 5 evaluation runs. Shaded regions correspond to 95% CI. PPO + Emb-Hyb - 4 PPO + Emb-Id - 3 PPO + Emb-Im - 2 0.67 0.61 0.57 0.48 0.62 0.44 0.50 0.42 0.55 0.49 0.55 0.36 0.51 0.44 0.36 0.32 Figure 6: Win rates for agents specified in each row at computed at iteration 1000. Finally, to evaluate strong generalization in the RL setting, we pit the newly trained baseline and agents with embedding-conditional policies against each other. Since the embedding network has never seen the new agents, it must exhibit strong generalization to be useful in such setting. The results are give in Figures 5 and 6. Even though the margin is not very large, the agents that use Emb-Hyb perform the best on average. 5.2. The Particle World environment For the collaborative setting, we evaluate the framework on the Particle World environment for cooperative communication (Mordatch & Abbeel, 2018; Lowe et al., 2017). The environment consists of a continuous 2D grid with 3 landmarks and two kinds of agents collaborating to navigate to a common landmark goal (Figure 1c). At the beginning of every episode, the speaker agent is shown the RGB color of a single target landmark on the grid. The speaker then communicates a fixed length binary message to the listener agent. Based on the received messages, the listener agent the moves in a particular direction. The final reward, shared across the speaker and listener agents, is the distance of the listener to the target landmark after a fixed time horizon. The agent-interaction graph for this environment is bipartite with only cross edges between speaker and listener agents. Every interaction edge in this graph corresponds to 1000 rollout episodes where the maximum length of any episode is 25 steps. We pretrain 28 MLP parameterized speaker and listener agent policies. Every speaker learns through communication with only two different listeners Table 2: Intra-inter clustering ratios (IICR) for weak (W) and strong (S) generalization on Particle World. Lower is better. IICR (W) IICR (S) Emb-Im 0.58 0.86 Emb-Id 0.50 0.82 Emb-Hyb 0.54 0.85 Table 3: Average train and test rewards for speaker policies on Particle World. Train Test MADDPG 11.66 18.99 MADDPG + Emb-Im 11.68 17.75 MADDPG + Emb-Id 11.68 17.68 MADDPG + Emb-Hyb 11.77 17.20 and vice-versa, giving an extremely sparse agent-interaction graph. We explicitly encoded diversity in these speakers and listener agents by masking bits in the communication channel. In particular, we masked 1 or 2 randomly selected bits for every speaker agent in the graph to give a total of 7 1 + 7 2 = 28 distinct speaker agents. Depending on the neighboring speaker agents in the agent-interaction graph, the listener agents also show diversity in the learned policies. The policies are learned using multiagent deep deterministic policy gradients (MADDPG, Lowe et al., 2017). In this environment, the speakers and listeners are tightly coupled. Hence we vary the setup used previously in the competitive scenario. We wish to learn embeddings of listeners based on their interactions with speakers. Since the agent-interaction graph is bipartite, we use the embeddings of listener agents to condition a shared policy network for the respective speaker agents. 5.2.1. EMBEDDING ANALYSIS For the weak generalization setting, we remove an outgoing edge from every listener agent in the original graph to obtain the training graph. In the case of strong generalization, we set aside 7 listener agents (and their outgoing edges) each for validation and testing while the representation function is learned on the remaining 14 listener agents and their interactions. The intra-inter clustering ratios are shown Learning Policy Representations in Multiagent Systems in Table 2, and the projections of the embeddings learned using Emb-Hyb are visualized in Figure 3c and Figure 3d for weak and strong generalization respectively. In spite of the high degree of sparsity in the training graph, the intrainter clustering ratio for the test interaction embeddings is less than unity suggesting an agent-specific signal. Emb-id works particularly well in this environment, achieving best results for both weak and strong generalization. 5.2.2. POLICY OPTIMIZATION Here, we are interested in learning speaker agents that can communicate more effectively with a much wider range of listeners given knowledge of their embeddings. Referring back to Figure 2, we learn a policy πψ for a speaker agent that conditions on the representation function fθ for the listener agents. For cooperative communication, we consider interactions with 14 pre-trained listener agents split as 6 training, 4 validation, and 4 test agents.5 Similar to the competitive setting, we compare performance against a baseline speaker agent that does not have access to any privilege information about the listeners. We summarize the results for the best validated models during training and 100 interaction episodes per test listener agent across 5 initializations in Table 3. From the results, we observe that online embedding based methods can generalize better than the baseline methods. The baseline MADDPG achieves the lowest training error, but fails to generalize well enough and incurs a low average reward for the test listener agents. 6. Discussion & Related Work Agent modeling is a well-studied topic within multiagent systems. See Albrecht & Stone (2017) for an excellent recent survey on this subject. The vast majority of literature concerns with learning models for a specific predictive task. Predictive tasks are typically defined over actions, goals, and beliefs of other agents (Stone & Veloso, 2000). In competitive domains such as Poker and Go, such tasks are often integrated with domain-specific heuristics to model opponents and learn superior policies (Rubin & Watson, 2011; Mnih et al., 2015). Similarly, intelligent tutoring systems take into account pedagogical features of students and teachers to accelerate learning of desired behaviors in a collaborative environment (Mc Calla et al., 2000). In this work, we proposed an approach for modeling agent behavior in multiagent systems through unsupervised representational learning of agent policies. Since we sidestep any domain specific assumptions and learn in an unsupervised manner, our framework learns representations that are 5None of the methods considered were able to learn a nontrivial speaker agent when trained simultaneously with all 28 listener agents. Hence, we simplified the problem by considering the 14 listener agents that attained the best rewards during pretraining. useful for several downstream tasks. This extends the use of deep neural networks in multiagent systems to applications beyond traditional reinforcement learning and predictive modeling (Mnih et al., 2015; Hoshen, 2017). Both the generative and discriminative components of our framework have been explored independently in prior work. Imitation learning has been extensively studied in the singleagent setting and recent work by Le et al. (2017) proposes an algorithm for imitation in a coordinated multiagent system. Wang et al. (2017) proposed an imitation learning algorithm for learning robust controllers with few expert demonstrations in a single-agent setting that conditions the policy network on an inference network, similar to the encoder in our framework. In another recent work, Li et al. (2017) propose an algorithm for learning interpretable representations using generative adversarial imitation learning. Agent identification which represents the discriminative term in the learning objective is inspired from triplet losses and Siamese networks that are used for learning representations of data using distance comparisons (Hoffer & Ailon, 2015). A key contribution of this work is a principled methodology for evaluating generalization of representations in multiagent systems based on the graphs of the agent interactions. Graphs are a fundamental abstraction for modeling relational data, such as the interactions arising in multiagent systems (Zhou et al., 2016a;b; Chen et al., 2017; Battaglia et al., 2016; Hoshen, 2017) and concurrent work proposes to learn such graphs directly from data (Kipf et al., 2018). 7. Conclusion & Future Work In this work, we presented a framework for learning representations of agent policies in multiagent systems. The agent policies are accessed using a few interaction episodes with other agents. Our learning objective is based on a novel combination of a generative component based on imitation learning and a discriminative component for distinguishing the embeddings of different agent policies. Our overall framework is unsupervised, sample-efficient, and domainagnostic, and hence can be readily extended to many environments and downstream tasks. Most importantly, we showed the role of these embeddings as privileged information for learning more adaptive agent policies in both collaborative and competitive settings. In the future, we would like to explore multiagent systems with more than two agents participating in the interactions. Semantic interpolation of policies directly in the embedded space in order to obtain a policy with desired behaviors quickly is another promising direction. Finally, it would be interesting to extend and evaluate the proposed framework to learn representations for history dependent policies such as those parameterized by long short-term memory networks. Learning Policy Representations in Multiagent Systems Acknowledgements We are thankful to Lisa Lee, Daniel Levy, Jiaming Song, and everyone at Open AI for helpful comments and discussion. AG is supported by a Microsoft Research Ph D Fellowship. MA is partially supported by NIH R01GM114311. JKG is partially supported by the Army Research Laboratory through the Army High Performance Computing Research Center under Cooperative Agreement W911NF-07-2-0027. Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning, 2004. Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via metalearning in nonstationary and competitive environments. In International Conference on Learning Representations, 2018. Albrecht, S. V. and Stone, P. Autonomous agents modeling other agents: A comprehensive survey and open problems. ar Xiv preprint ar Xiv: 1709.08071, 2017. Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems, 2016. Bengio, Y. et al. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1 127, 2009. Chen, M., Zhou, Z., and Tomlin, C. J. Multiplayer reachavoid games via pairwise outcomes. IEEE Transactions on Automatic Control, 62(3):1451 1457, 2017. Ferber, J. Multi-agent systems: An introduction to distributed artificial intelligence, volume 1. Addison-Wesley Reading, 1999. Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In SIGKDD Conference on Knowledge Discovery and Data Mining, 2016. Grover, A., Al-Shedivat, M., Gupta, J. K., Burda, Y., and Edwards, H. Evaluating generalization in multiagent systems using agent-interaction graphs. In International Conference on Autonomous Agents and Multiagent Systems, 2018. Hoffer, E. and Ailon, N. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84 92. Springer, 2015. Hoshen, Y. VAIN: Attentional multi-agent predictive modeling. In Advances in Neural Information Processing Systems, 2017. Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In International Conference on Machine Learning, 2018. Le, H. M., Yue, Y., and Carr, P. Coordinated multi-agent imitation learning. In International Conference on Machine Learning, 2017. Li, Y., Song, J., and Ermon, S. Inferring the latent structure of human decision-making from raw visual inputs. In Advances in Neural Information Processing Systems, 2017. Littman, M. L. Markov games as a framework for multiagent reinforcement learning. In International Conference on Machine Learning, 1994. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. Multi-agent actor-critic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, 2017. Mc Calla, G., Vassileva, J., Greer, J., and Bull, S. Active learner modelling. In Intelligent tutoring systems, pp. 53 62. Springer, 2000. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015. Mordatch, I. and Abbeel, P. Emergence of grounded compositional language in multi-agent populations. In AAAI Conference on Artificial Intelligence, 2018. Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88 97, 1991. Rubin, J. and Watson, I. Computer poker: A review. Artificial intelligence, 175(5-6):958 987, 2011. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Stone, P. and Veloso, M. Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8 (3):345 383, 2000. Learning Policy Representations in Multiagent Systems Todorov, E., Erez, T., and Tassa, Y. Mu Jo Co: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012. Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N., and Heess, N. Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, 2017. Wooldridge, M. An introduction to multiagent systems. John Wiley & Sons, 2009. Zhou, Z., Bambos, N., and Glynn, P. Dynamics on linear influence network games under stochastic environments. In International Conference on Decision and Game Theory for Security, 2016a. Zhou, Z., Yolken, B., Miura-Ko, R. A., and Bambos, N. A game-theoretical formulation of influence networks. In American Control Conference, 2016b.