# finding_friend_and_foe_in_multiagent_games__57896512.pdf Finding Friend and Foe in Multi-Agent Games Jack Serrino MIT jserrino@mit.edu Max Kleiman-Weiner Harvard, MIT, Diffeo maxkleimanweiner@fas.harvard.edu David C. Parkes Harvard University parkes@eecs.harvard.edu Joshua B. Tenenbaum MIT, CBMM jbt@mit.edu Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have seen great strides in recent years. Yet none of these games address the real-life challenge of cooperation in the presence of unknown and uncertain teammates. This challenge is a key game mechanism in hidden role games. Here we develop the Deep Role algorithm, a multi-agent reinforcement learning agent that we test on The Resistance: Avalon, the most popular hidden role game. Deep Role combines counterfactual regret minimization (CFR) with deep value networks trained through self-play. Our algorithm integrates deductive reasoning into vector-form CFR to reason about joint beliefs and deduce partially observable actions. We augment deep value networks with constraints that yield interpretable representations of win probabilities. These innovations enable Deep Role to scale to the full Avalon game. Empirical game-theoretic methods show that Deep Role outperforms other hand-crafted and learned agents in five-player Avalon. Deep Role played with and against human players on the web in hybrid human-agent teams. We find that Deep Role outperforms human players as both a cooperator and a competitor. 1 Introduction Cooperation enables agents to achieve feats together that no individual can achieve on her own [16, 39]. Cooperation is challenging, however, because it is embedded within a competitive world [15]. Many multi-party interactions start off by asking: who is on my team? Who will collaborate with me and who do I need to watch out for? These questions arise whether it is your first day of kindergarten or your first day at the stock exchange. Figuring out who to cooperate with and who to protect oneself against is a fundamental challenge for any agent in a diverse multi-agent world. This has been explored in cognitive science, economics, and computer science [2, 7, 8, 21, 23, 24, 25, 26, 28, 30, 31, 44]. Core to this challenge is that information about who to cooperate with is often noisy and ambiguous. Typically, we only get this information indirectly through others actions [1, 3, 21, 41]. Since different agents may act in different ways, these inferences must be robust and take into account ad-hoc factors that arise in an interaction. Furthermore, these inferences might be carried out in the presence of a sophisticated adversary with superior knowledge and the intention to deceive. These adversaries could intentionally hide their non-cooperative intentions and try to appear cooperative for their own benefit [36]. The presence of adversaries makes communication challenging when intent to cooperate is unknown, simple communication is unreliable or cheap [14]. This challenge has not been addressed by recent work in multi-agent reinforcement learning (RL). In particular, the impressive results in imperfect-information two-player zero-sum games such as poker indicates equal contribution 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. [4, 6, 27] are not straightforward to apply to problems where cooperation is ambiguous. In heads-up poker, there is no opportunity to actually coordinate or cooperate with others since two-player zerosum games are strictly adversarial. In contrast, games such as Dota and capture the flag have been used to train Deep RL agents that coordinate with each other to compete against other teams [17, 29]. However, in neither setting was there ambiguity about who to cooperate with. Further in real-time games, rapid reflexes and reaction times give an inherent non-strategic advantage to machines [9]. Here we develop Deep Role, a multi-agent reinforcement learning algorithm that addresses the challenge of learning who to cooperate with and how. We apply Deep Role to a five-player game of alliances, The Resistance: Avalon (Avalon), a popular hidden role game where the challenge of learning who to cooperate with is the central focus of play [13]. Hidden role games start with players joining particular teams and adopting roles that are not known to all players of the game. During the course of the game, the players try to infer and deduce the roles of their peers while others simultaneously try to prevent their role from being discovered. As of May 2019, Avalon is the most highly rated hidden role game on boardgamegeek.com. Hidden role games such as Mafia, Werewolf, and Saboteur are widely played around the world. Related work Deep Role builds on the recent success of heuristic search techniques that combine efficient depth-limited lookahead planning with a value function learned through self-play in twoplayer zero-sum games [27, 33, 34]. In particular, the Deep Stack algorithm for no-limit heads up poker combines counterfactual regret minimization (CFR) using a continual re-solving local search strategy with deep neural networks [27, 45]. While Deep Stack was developed for games where all actions are public (such as poker), in hidden role games some actions are only observable by some agents and therefore must be deduced. In Avalon, players obtain new private information as the game progresses while in poker the only hidden information is the initial set of cards. Contributions. Our key contributions build on these recent successes. Our algorithm integrates deductive reasoning into vector-form CFR [19] to reason about joint beliefs and partially observable actions based on consistency with observed outcomes, and augments value networks with constraints that yield interpretable representations of win probabilities. This augmented network enables training with better sample efficiency and generalization. We conduct an empirical game-theoretic analysis in five-player Avalon and show that the Deep Role CFR-based algorithm outperforms existing approaches and hand-crafted systems. Finally, we had Deep Role play with a large sample of human players on a popular online Avalon site. Deep Role outperforms people as both a teammate and opponent when playing with and against humans, even though it was only trained through self-play. We conclude by discussing the value of hidden role games as a long-term challenge for multi-agent RL systems. proposal 1 proposal 2 2 fail 1 fail succeed private actions round # 1 2 3 4 5 Figure 1: Description of the public game dynamics in The Resistance: Avalon. (left) Each round (rectangle) has up to 5 proposals (white circles) and leads to either a mission that fails or succeeds. (right) Example dynamics within each round. Players (colored circles) alternate proposing subsets of players (2 or 3) to go on a mission which are then put to vote by all 5 players. If the majority approves, those players (1 & 5 in this example) privately and independently decide to succeed or fail the mission. If the majority disapproves, the next player proposes a subset. 2 The Resistance: Avalon We first briefly describe game mechanics of The Resistance: Avalon played with five players. At the beginning of the game, 3 players are randomly assigned to the Resistance team and 2 players are assigned to the Spy team. The spies know which players are on the Spy team (and hence also know which players are on the Resistance team). One member of the Resistance team is randomly and privately chosen to be the Merlin role who also knows all role assignments. One member of the Spy team is randomly chosen to be the Assassin. At the end of the game, if the Resistance team has won, the Assassin guesses the identity of Merlin. If the Assassin guesses Merlin correctly then the Spy team wins. Figure 1 shows a visual description of the public game dynamics. There are five rounds in the game. During each round a player proposes a subset (two or three depending on the round) of agents to go on a mission. All players simultaneously and publicly vote (approve or not approve) of that subset. If a simple majority do not approve, another player is selected to propose a subset to go on the mission. If after five attempts, no proposal receives a simple majority, the Spy team wins. If a simple majority approves, the subset of players privately select whether the mission succeeds or fails. Players on the Resistance team must always choose success but players on the Spy team can choose success or failure. If any of the Spies choose to fail the mission, the mission fails. Otherwise, the mission succeeds. The total number of success and fail votes is made public but the identity of who made those votes is private. If three missions succeed, the Resistance team wins. If three missions fail, the Spy team wins. When people play Avalon, the games are usually rich in cheap talk, such as defending oneself, accusing others, or debunking others claims [10]. In this work, we do not consider the strategic implications of natural language communication. Although Avalon is a simple game to describe, it has a large state space. We compute a lower bound of 1056 distinct information sets in the 5-player version of Avalon (Appendix D for details). This is larger than the state space of Chess (1047) and larger than the number of information sets in heads-up limit poker (1014) [18]. 3 Algorithm: Deep Role The Deep Role algorithm builds off of recent success in poker by combining Deep Stack s innovations of deep value networks and depth-limited solving with deductive reasoning. Compared to MCTSbased methods like Alpha Go, CFR-based methods like Deep Stack and Deep Role can soundly reason over hidden information. Unique to Deep Role, our innovations allow the algorithm to play games with simultaneous and hidden actions. In broad strokes, Deep Role is composed of two parts: (1) a CFR planning algorithm augmented with deductive reasoning; and (2) neural value networks that are used to reduce the size of the game tree. Source code and experimental data is available here: https://github.com/Detry322/Deep Role. Background. Hidden role games like Avalon can be modeled as extensive-form games. We follow the notation of [19]. Briefly, these games have a game tree with nodes that correspond to different histories of actions, h 2 H, with Z H the set of terminal histories. For each h 2 Z, let ui(h) be the utility to player i in terminal history h. In extensive-form games, only a single player P(h) can move at any history h, but because Avalon s mechanics intimately involve simultaneous action, we extend this definition to let P 0(h) be the set of players simultaneously moving at h. Histories are partitioned into information sets (I 2 Ii) that represent the game states that player i cannot distinguish between. For example, a Resistance player does not know who is on the Spy team and thus all h differing only in the role assignments to the other players are in a single information set. The actions available in a given information set are a 2 A(I). A strategy σi for player i is a mapping for each I 2 Ii to a probability distribution over A(I). Let σ = (σ1, . . . , σp) be the joint strategy of all p players. We write σI!a to mean strategy σ, modified so action a is always played at information set I. Then, we let σ(h) be the probability of reaching h if all players act according to σ. We write σ i (h) to mean the contribution of player i to the joint probability σ(h) = Q i (h). Finally, let σ i(h) be the product of strategies for all players except i and let σ(h, h0) be the probability of reaching history h0 under strategy σ, given h has occurred. Counterfactual regret minimization (CFR) iteratively refines σ based on the regret accumulated through a self-play like procedure. Specifically, in CFR+, at iteration T, the cumulative counterfactual regret is R+,T i (I, a) = max{P I!a, I) CFVi(σt, I), 0} where the counterfactual values for player i are defined as CFVi(σ, I) = P z2Z ui(z) σ i(z[I]) σ(z[I], z), where z[I] is the h 2 I such that h v z [38]. At a high-level, CFR iteratively improves σ by boosting the probability of actions that would have been beneficial to each player. In two-player zero-sum games, CFR provably converges to a Nash equilibrium. However, it does not necessarily converge to an equilibrium in games with more than two players [37]. We investigate whether CFR can generate strong strategies in a multi-agent hidden role game like Avalon. 3.1 CFR with deductive logic The CFR component of Deep Role is based on the vector-form public chance sampling (PCS) version of CFR introduced in [19], together with CFR+ regret matching [38]. Vector-form versions of CFR can result in faster convergence and take advantage of SIMD instructions, but require a public game tree [20]. In poker-like games, one can construct a public game tree from player actions, since all actions are public (e.g., bets, new cards revealed) except for the initial chance action (giving players cards). In hidden role games, however, key actions after the initial chance action are made privately, breaking the standard construction. To support hidden role games, we extend the public game tree to be a history of third-person observations, o 2 O(h), instead of just actions. This includes both public actions and observable consequences of private actions (lines 22-44 in Alg. 1 in the Appendix). Our extension works when deductive reasoning from these observations reveals the underlying private actions. For instance, if a mission fails and one of the players is known to be a Spy, one can deduce that the Spy failed the mission. deduce Actions(h, o) carries out this deductive reasoning and returns the actions taken by each player under each information set (~ai[I]) (line 23). With ~ai[I] and the player s strategy (~σi), the player s reach probabilities are updated for the public game state following the observation (ho) (lines 24-26). Using the public game tree, we maintain a human-interpretable joint posterior belief b( |h) over the initial assignment of roles . represents a full assignment of roles to players (the result of the initial chance action) so our belief b( |h) represents the joint probability that each player has the role specified in , given the observed actions in the public game tree. See Figure 2 for an example belief b and assignment . This joint posterior b( |h) can be approximated by using the individual players strategies as the likelihood in Bayes rule: b( |h) / b( )(1 i (Ii(h, )) (1) where b( ) is the prior over assignments (uniform over the 60 possible assignments), Ii(h, ) is the information set implied by public history h and assignment , and the product is the likelihood of playing to h given each player s implied information set. A problem is that this likelihood can put positive mass on assignments that are impossible given the history. This arises because vector-form CFR algorithms can only compute likelihoods for each player independently rather than jointly. For instance, consider two players that went on a failing mission. In the information sets implied by the where they are both resistance, each player is assumed to have passed the mission. However, this is logically inconsistent with the history, as one of them must have played fail. To address this, the indicator term (1 {h }) zeros the probability of any that is logically inconsistent with the public game tree h. This zeroing removes any impact these impossible outcomes would have had on the value and regret calculations in CFR (line 20 in Alg. 2). 3.2 Value network The enhanced vector-form CFR cannot be run on the full public game tree of Avalon (or any real hidden role game). This is also the case for games like poker, so CFR-based poker systems [6, 27] rely on action abstraction and state abstraction to reduce the size of the game tree. However, actions in Avalon are not obviously related to each other. Betting 105 chips in poker is strategically similar to betting 104 chips, but voting up a mission in Avalon is distinct from voting it down. The size of Avalon s game tree does not come from the number of available actions, but rather from the number of players. Furthermore, since until now Avalon has only received limited attention, there are no One-hot encoding of proposer position [0 1 0 0 0] Public belief state b = P(role ( )) P1 P2 P3 P4 P5 Pr M R R S A .15 M R S R A .12 M S R R A .00 S A R R M .00 R S A R M .03 Dense (Re LU) P1 P2 P3 P4 P5 P(win) M R R S A .96 M R S R A .73 M S R R A .55 S A R R M .23 R S A R M .41 Probability-weighted value of each information set P(Ii)*Vi(Ii) .5 -.2 .3 -.1 Win likelihood P(win|role ( )) Figure 2: Deep Role neural network architecture used to limit game tree depth. Tables (black headers) show example inputs. The uppercase characters represent the different roles: (R)esistance, (S)py, (M)erlin, (A)ssassin. The outputs are the probability weighted value for each player in each of their information sets. While there is only one information set for Resistance (since they only know their own role), there are multiple for each of the other roles types. M (2,3) should be read as Merlin who sees players 2 and 3 as Spy and S (4) should be read as Spy who sees player 4 as Assassin. developed hand-crafted state abstractions available either (although see [5] for how these could be learned). We follow the general approach taken by [27], using deep neural networks to limit the size of the game tree that we traverse (lines 14-16 in Alg. 1 in Appendix A). We first partition the Avalon public game tree into individually solvable parts, segmented by a proposal for every possible number of succeeded and failed missions (white circles on the left side of Figure 1). This yields 45 neural networks. Each h corresponding to a proposal is mapped to one of these 45 networks. These networks take in a tuple 2 , = (i, b) where i is the proposing player, and b is the posterior belief at that position in the game tree. is the set of all possible game situations. The value networks are trained to predict the probability-weighted value of each information set (Figure 2). Unlike in Deep Stack, our networks calculate the non-counterfactual (i.e. normal) values for every information set I for each player. This is because our joint belief representation loses the individual contribution of each player s likelihood, making it impossible to calculate a counterfactual. The value Vi(I) for private information I for player i can be written as: σ(h, z)ui(z) = σ i (I) CFVi(I) where players play according to a strategy σ. Since we maintain a σ i (I) during planning, we can convert the values produced by the network to the counterfactual values needed by CFR (line 15 in Alg. 2). Value network architecture While it s possible to estimate these values using a generic feedforward architecture, it may cause lower sample efficiency, require longer training time, or fail to achieve a low loss. We design an interpretable custom neural network architecture that takes advantage of restrictions imposed by the structure of many hidden role games. Our network feeds a one-hot encoded vector of the proposer player i and the belief vector b into two fully-connected hidden layers of 80 Re LU units. These feed into a fully-connected win probability layer with sigmoid Figure 3: Deep Role generalization and sample efficiency. (left) Generalization error on held out samples averaged across the 45 neural networks. (right) Generalization error as a function of training data for the first deep value network (averaged over N=5 runs, intervals are SE). Figure 4: Comparing the expected win rate of Deep Role with other agents. The x-axis shows how many of the first four agents are Deep Role. The y-axis shows the expected win rate for the fifth agent if they played as Deep Role or the benchmark. Standard errors smaller than the size of the dots. (top) Combined expected win rate. (middle) Spy-only win rate. (bottom) Resistance-only win rate. activation. This layer is designed to take into account the specific structure of V , respecting the binary nature of payoffs in Avalon (players can only win or lose). It explicitly represents the probability of a Resistance win (~w = P(win| )) for each assignment . Using these probabilities, we then calculate the Vi(I) for each player and information set, constraining the network s outputs to sound values. To do this calculation, for each player i, win probabilities are first converted to expected values (~ui~w + -~ui(1 ~w) representing i s payoff in each if resistance win. It is then turned into the probability-weighted value of each information set which is used and produced by CFR: ~Vi = Mi[(~ui~w + -~ui(1 ~w)) b] where Mi is a (15 60) multi-one-hot matrix mapping each to player i s information set, and b is the belief over roles passed to the network. This architecture is fully differentiable and is trained via back-propagation. A diagram and description of the full network is shown in Figure 2. See Appendix B and Alg. 3 for details of the network training algorithm, procedure, parameters and compute details. The win probability layer enabled training with less training data and better generalization. When compared to a lesioned neural network that replaced the win probability layer with a zero-sum layer (like Deep Stack) the average held-out loss per network was higher and more training data was required (Figure 3). 4 Empirical game-theoretic analysis The possibility of playing with diverse teammates who may be playing conflicting equilibrium strategies, out-of-equilibrium strategies, or even human strategies makes evaluation outside of twoplayer zero-sum games challenging. In two-player zero-sum games, all Nash equilibria are minimally exploitable, so algorithms that converge to Nash are provably optimal in that sense. However evaluating 3+ player interactions requires considering multiple equilibria and metrics that account for coordinating with teammates. Further, Elo and its variants such as True Skill are only good measures of performance when relative skill is intransitive, but have no predictive power in transitive games (e.g., rock-paper-scissors) [40]. Thus, we turn to methods for empirical game-theoretic analysis which require running agents against a wide variety of benchmark opponents [40, 42]. We compare the performance of Deep Role to 5 alternative baseline agents: CFR an agent trained with MCCFR [22] over a hand-crafted game abstraction; Logic Bot a hand-crafted strategy that uses logical deduction; Random Bot - plays randomly; ISMCTS - a single-observer ISMCTS algorithm Figure 5: Empirical game-theoretic evaluation. Arrow size and darkness are proportional to the size of the gradient. (left) Deep Role against hand-coded agents. (center) Deep Role compared to systems without our algorithmic improvements. (right) Deep Role against itself but with CFR iterations equal to the number next to the game. found in [11, 12, 35]; MOISMCTS - a multiple-observer variant of ISMCTS [43]. Details for these agents are found in Appendix C. We first investigated the conditional win rates for each baseline agent playing against Deep Role. We consider the effect of adding a 5th agent to a preset group of agents and compare Deep Role s win rate as the 5th agent with the win rate of a baseline strategy as the 5th agent in that same preset group. For each preset group (0-4 Deep Role agents) we simulated >20K games. Figure 4 shows the win probability of each of these bots when playing Deep Role both overall and when conditioning on the role (Spy or Resistance). In most cases adding a 5th Deep Role player yielded a higher win rate than adding any of the other bots. This was true in every case we tested when there were at least two other Deep Role agents playing. Thus from an evolutionary perspective, Deep Role is robust to invasion from all of these agent types and in almost all cases outperforms the baselines even when it is the minority. To formalize these intuitions we construct a meta-game where players select a mixed meta-strategy over agent types rather than actions. Figure 5 shows the gradient of the replicator dynamic in these meta-games [40, 42]. The replicator dynamic gradient describes the direction a player playing meta-strategy σ can update their strategy for maximal gain, assuming other players are also playing σ. Both vector field sinks and points with zero gradient correspond to Nash equilibria in the meta-game. First, we compare Deep Role to the two hand-crafted strategies (Logic Bot and CFR), and show that purely playing Deep Role is the equilibrium with the largest basin of attraction. The ISMCTS agents are too computationally demanding to run in these contests, but in a pairwise evaluation, playing Deep Role is the sole equilibrium. Next, we test whether our innovations make Deep Role a stronger agent. We compare Deep Role to two lesioned alternatives. The first, Deep Role (No Win Layer), uses a zero-sum sum layer instead of our win probability layer in the neural network. Otherwise it is identical to Deep Role. In Figure 3, we saw that this neural network architecture did not generalize as well. We also compare to a version of Deep Role that does not include the logical deduction step in equation 1, and also uses the zero-sum layer instead of the probability win layer (No Win Layer, No Deduction). The agent without deduction is the weakest, and the full Deep Role agent is the strongest, showing that our innovations lead to enhanced performance. Finally, we looked at the impact of CFR solving iterations during play (thinking time). More iterations make each move slower but may yield a better strategy. When playing Deep Role variants with 10, 30, and 100 iterations against each other, each variant was robust to invasion by the others but the more iterations used, the larger the basin of attraction (Figure 5). 5 Human evaluation Playing with and against human players is a strong test of generalization. First, humans are likely to play a diverse set of strategies that will be challenging for Deep Role to respond to. During training time, it never learns from any human data and so its abilities to play with people must be the result of playing a strategy that generalizes to human play. Importantly, even if human players take the Figure 6: Belief dynamics over the course of the game. (left) Deep Role s posterior belief in the ground truth Spy role assignments as a Resistance player with four humans. (right) Deep Role s posterior belief of the true spy team while observing all-human games from the perspective a Resistance player. Deep Role neural networks out of distribution , the online CFR iterations can still enable smart play in novel situations (as with MCTS in Alpha Go). Humans played with Deep Role on the popular online platform Pro Avalon.com (see Appendix F for commentated games and brief descriptions of Deep Role s play style ). In the 2189 mixed human/agent games we collected, all humans knew which players were human and which were Deep Role. There were no restrictions on chat usage for the human players, but Deep Role did not say anything and did not process sent messages. Table 1 shows the win rate of Deep Role compared to humans. On the left, we can see that Deep Role is robust; when four of the players were Deep Role, a player would do better playing the Deep Role strategy than playing as an average human, regardless of team. More interestingly, when considering a game of four humans, the humans were better off playing with the Deep Role agent as a teammate than another human, again regardless of team. Although we have no way quantifying the absolute skill level of these players, among this pool of avid Avalon players, Deep Role acted as both a superior cooperator and competitor it cooperated with its teammates to compete against the others. Finally, Deep Role s interpretable belief state can be used to gain insights into play. In Figure 6 we show Deep Role s posterior probability estimate of the true set of Spies when playing as a Resistance player. When Deep Role played as the sole agent among four humans (left plot), the belief state rapidly converged to the ground truth in the situations where three missions passed, even though it had never been trained on human data. If three missions failed, it was often because it failed to learn correctly. Next, we analyze the belief state when fed actions and observations from the perspective of a human resistance player playing against a group of humans (yoked actions). As shown in Figure 6, the belief estimates increase as the game progresses, indicating Deep Role can make correct inferences even while just observing the game. The belief estimate converges to the correct state faster in games with three passes, presumably because the data in these games was more informative to all players. 6 Discussion We developed a new algorithm for multi-agent games called Deep Role which effectively collaborates and competes with a diverse set of agents in The Resistance: Avalon. Deep Role surpassed both humans and existing machines in both simulated contests against other agents and a real-world evaluation with human Avalon players. These results are achieved through the addition of a deductive reasoning system to vector-based CFR and a win probability layer in deep value networks for depth- Adding Deep Role or a Human to 4 Deep Role to 4 Human +Deep Role +Human +Deep Role +Human Win Rate (%) (N) Win Rate (%) (N) Win Rate (%) (N) Win Rate (%) (N) Overall 46.9 0.6 (7500) 38.8 1.3 (1451) 60.0 5.5 (80) 48.1 1.2 (1675) Resistance 34.4 0.7 (4500) 25.6 1.5 (856) 51.4 8.2 (37) 40.3 1.5 (1005) Spy 65.6 0.9 (3000) 57.8 2.0 (595) 67.4 7.1 (43) 59.7 1.9 (670) Table 1: Win rates for humans playing with and against the Deep Role agent. When a human replaces a Deep Role agent in a group of 5 Deep Role agents, the win rate goes down for the team that the human joins. When a Deep Role agent replaces a human in a group of 5 humans, the win rate goes up for the team the Deep Role agent joins. Averages standard errors of the mean calculated over a binary outcome. limited search. Taken together, these innovations allow Deep Role to scale to the full game of Avalon allowing CFR agents to play hidden role games for the first time. In future work, we will investigate whether the interpretable belief state of Deep Role could also be used to ground language, enabling better coordination through communication. Looking forward, hidden role games are an exciting opportunity for developing AI agents. They capture the ambiguous nature of day-to-day interactions with others and go beyond the strictly adversarial nature of two-player zero-sum games. Only by studying 3+ player environments can we start to capture some of the richness of human social interactions including alliances, relationships, teams, and friendships [32]. Acknowledgments We thank Victor Kuo and Pro Avalon.com for help integrating Deep Role with human players online. We also thank Noam Brown, Murray Campbell, and Michael Wellman for helpful discussions and comments. This work was supported by Harvard Data Science Initiative, CRCS, and MBB, The Future of Life Institute, DARPA Ground Truth, and the Center for Brains, Minds and Machines (NSF STC award CCF-1231216). [1] Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems. ar Xiv preprint ar Xiv:1709.08071, 2017. [2] Robert Axelrod. The Evolution of Cooperation. Basic Books, 1985. [3] Chris L Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B Tenenbaum. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1:0064, 2017. [4] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold em poker is solved. Science, 347(6218):145 149, 2015. [5] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. ar Xiv preprint ar Xiv:1811.00164, 2018. [6] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 2017. [7] Colin Camerer. Behavioral game theory: Experiments in strategic interaction. Princeton University Press, [8] Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games. The Quarterly Journal of Economics, pages 861 898, 2004. [9] Rodrigo Canaan, Christoph Salge, Julian Togelius, and Andy Nealen. Leveling the playing field-fairness in ai versus human game benchmarks. ar Xiv preprint ar Xiv:1903.07008, 2019. [10] Gokul Chittaranjan and Hayley Hung. Are you a werewolf? detecting deceptive roles and outcomes in a conversational role-playing game. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5334 5337. IEEE, 2010. [11] Peter I Cowling, Edward J Powley, and Daniel Whitehouse. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 4(2):120 143, 2012. [12] Peter I Cowling, Daniel Whitehouse, and Edward J Powley. Emergent bluffing and inference with Monte Carlo tree search. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pages 114 121. IEEE, 2015. [13] Don Eskridge. The Resistance: Avalon, 2012. [14] Joseph Farrell and Matthew Rabin. Cheap talk. Journal of Economic perspectives, 10(3):103 118, 1996. [15] Adam Galinsky and Maurice Schweitzer. Friend and Foe: When to Cooperate, when to Compete, and how to Succeed at Both. Random House, 2015. [16] Joseph Henrich. The secret of our success: how culture is driving human evolution, domesticating our species, and making us smarter. Princeton University Press, 2015. [17] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. ar Xiv preprint ar Xiv:1807.01281, 2018. [18] Michael Johanson. Measuring the size of large no-limit poker games. ar Xiv preprint ar Xiv:1302.7008, [19] Michael Johanson, Nolan Bard, Marc Lanctot, Richard Gibson, and Michael Bowling. Efficient nash equilibrium approximation through monte carlo counterfactual regret minimization. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 837 846. International Foundation for Autonomous Agents and Multiagent Systems, 2012. [20] Michael Johanson, Kevin Waugh, Michael Bowling, and Martin Zinkevich. Accelerating best response calculation in large extensive games. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. [21] Max Kleiman-Weiner, Mark K Ho, Joseph L Austerweil, Michael L Littman, and Joshua B Tenenbaum. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, 2016. [22] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sampling for regret minimization in extensive games. In Advances in neural information processing systems, pages 1078 1086, 2009. [23] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Julien Perolat, David Silver, Thore Graepel, et al. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4193 4206, 2017. [24] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent rein- forcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, pages 464 473. International Foundation for Autonomous Agents and Multiagent Systems, 2017. [25] Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. ar Xiv preprint ar Xiv:1707.01068, 2017. [26] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In ICML, volume 94, pages 157 163, 1994. [27] Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lis y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508 513, 2017. [28] Martin A Nowak. Five rules for the evolution of cooperation. Science, 314(5805):1560 1563, 2006. [29] Open AI. Open AI Five. https://blog.openai.com/openai-five/, 2018. [30] Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pages 3646 3655, 2017. [31] David G Rand and Martin A Nowak. Human cooperation. Trends in cognitive sciences, 17(8):413, 2013. [32] Michael Shum, Max Kleiman-Weiner, Michael L Littman, and Joshua B Tenenbaum. Theory of minds: Understanding behavior in groups through inverse planning. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), 2019. [33] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. [34] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140 1144, 2018. [35] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, pages 2164 2172, 2010. [36] DJ Strouse, Max Kleiman-Weiner, Josh Tenenbaum, Matt Botvinick, and David J Schwab. Learning to share and hide intentions using information regularization. In Advances in Neural Information Processing Systems, pages 10270 10281, 2018. [37] Duane Szafron, Richard Gibson, and Nathan Sturtevant. A parameterized family of equilibrium profiles for three-player kuhn poker. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages 247 254. International Foundation for Autonomous Agents and Multiagent Systems, 2013. [38] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit texas hold em. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. [39] Michael Tomasello. A natural history of human thinking. Harvard University Press, 2014. [40] Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A generalised method for empirical game theoretic analysis. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pages 77 85. International Foundation for Autonomous Agents and Multiagent Systems, 2018. [41] Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, and Joshua B Tenenbaum. Help or hinder: Bayesian models of social goal inference. In Advances in neural information processing systems, pages 1874 1882, 2009. [42] Michael P Wellman. Methods for empirical game-theoretic analysis. In AAAI, pages 1552 1556, 2006. [43] Daniel Whitehouse. Monte Carlo tree search for games with hidden information and uncertainty. Ph D thesis, University of York, 2014. [44] Michael Wunder, Michael Kaisers, John Robert Yaros, and Michael Littman. Using iterated reasoning to predict opponent strategies. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 593 600. International Foundation for Autonomous Agents and Multiagent Systems, 2011. [45] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pages 1729 1736, 2008.