# offbelief_learning__398970d8.pdf Off-Belief Learning Hengyuan Hu 1 Adam Lerer 1 Brandon Cui 1 Luis Pineda 1 Noam Brown 1 Jakob Foerster 1 The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy π1 that is optimized assuming past actions were taken by a given, fixed policy (π0), but assuming that future actions will be taken by π1. When π0 is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi. 1. Introduction An important goal of multi-agent reinforcement learning (MARL) research is to develop AI systems that can coordinate with novel partners, such as humans or other artificial agents, unseen at training time. However, the standard problem setting in cooperative MARL is self-play (SP), where a team of agents is trained to find near optimal joint policies. Crucially, SP assumes that at test time all parties follow this policy faithfully, without considering the performance 1Facebook AI Research. Correspondence to: Hengyuan Hu , Jakob Foerster . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). when paired with other agents. As a result, strong joint policies for SP often rely on efficient, yet arbitrary conventions (Foerster et al., 2019). Although SP agents achieve high scores among themselves, they often perform very poorly when paired with other, independently-trained agents or with humans because they cannot decipher the other agents signals and intentions, and vice versa (Carroll et al., 2019; Bard et al., 2020; Hu et al., 2020). Such agents fail because they rely too heavily on conventions and other inferences that finely depend on their partner s policy, rather than on grounded information. For example, suppose in the board game Hanabi (detailed game rules in Section 6) our partner reveals the color of a particular card, e.g. green. We now know that the card is green. This is grounded information. But we may also infer other information. If we have agreed on a convention to only reveal the color of a card if it is playable, we can infer that the green card must also be playable. So under this shared convention in this situation green also means playable , not simply green . Conventions may also be iterated to higher levels of cognitive depth, where agents build more complex conventions on top of the baseline expectation that others are following earlier conventions. For example, not playing a card that was hinted to be playable could be used as a signal for an even higher priority action. Agents trained via self-play tend to learn sets of conventions where the meaning of each action in each situation can vary unpredictably between different training runs. These conventions can be of arbitrarily high cognitive depth, which each agent expects the other to understand perfectly. By contrast, humans often limit themselves to a cognitive depth of only a few levels (Camerer et al., 2003), and fewer when other agents are anticipated to be less experienced (Agranov et al., 2012), across a wide variety of competitive and collaborative games. When collaborating with unknown partners for the first time, humans may still successfully communicate by simple grounded information sharing. They can avoid making fragile inferences based on their partner s unknown policy, while still planning their own actions and assuming that other players will respond reasonably . This is a behavior that current methods cannot replicate. Off-Belief Learning In this paper, we present off-belief learning (OBL), a MARL algorithm that addresses this gap in prior methods by controlling the cognitive reasoning depth, preventing the use of arbitrary conventions, while converging to optimal behavior under this constraint. Given a starting policy π0, OBL agents follow a policy π1 that, at each step, acts optimally assuming that all past actions were taken by π0 and all future actions will be taken by π1. Note that this includes accounting for the fact that future actions from π1 will again be interpreted as if they were chosen by π0. OBL can also be applied iteratively to obtain a hierarchy of policies that converges to a unique solution, making it a natural candidate for solving zero-shot coordination (ZSC) problems, where the goal is to maximize the test-time performance between polices from independent training runs of the same algorithm (cross-play). Our work is motivated by the question Can an agent act optimally, and expect other agents to do the same, without assuming any shared conventions? . In our work we formalize and answer this question affirmatively with OBL. In particular, if π0 is a uniform random policy, then under OBL π1 is an optimal grounded policy, meaning that it plays the game relying only on the grounded information, without assuming any conventions. This is because interpreting actions as coming from π0 results in π1 conditioning only on grounded information revealed by the environment due to actions taken (a card revealed to be green is green, even if interpreted as if it were revealed by a random agent), but no further information about any agent s private state, since regardless of private state, a uniform random agent takes all actions with equal probability. Crucially, since π1 will act upon this grounded information optimally at future time steps, π1 will still learn optimal grounded signaling, i.e. taking actions specifically to reveal information other agents can use to take better actions. In this work, we first characterize OBL theoretically. We prove that OBL converges to a unique policy, i.e. given the same π0, OBL produces the the same π1 across independent training runs. We then prove that OBL is a policy improvement operator under certain conditions. We further show that when π0 doesn t depend on private information, OBL will converge to an optimal grounded policy. We then evaluate OBL in both a toy setting and Hanabi. In the toy setting, we demonstrate that OBL learns an optimal grounded policy while other existing methods such as SP and Cognitive Hierarchies do not. In Hanabi, OBL finds fully-grounded policies that reach a score of 20.92 in SP without relying on conventions, an important data point that tells us how well we can perform in this benchmark without conventions. OBL also performs strongly in zero-shot coordination settings, achieving significantly higher cross-play scores than previous methods, particularly when iterated. Finally we find it also plays well in ad-hoc team play (Stone et al., 2010) when paired with an agent trained by imitation learning from human data and two distinct agents trained with reinforcement learning, all of which are unseen during training time. 2. Related Work This work is inspired by the notion of Cognitive Hierarchies (CH) (Camerer et al., 2004) and k-level reasoning. CH is a framework for decision-making in multi-agent settings. Like OBL, CH starts with an uninformative level-0 policy and constructs a hierarchy of policies that are optimal assuming that other agents play the policies at lower levels. However, in contrast to OBL, CH does not differentiate between reasoning about other agents past behavior and expected future behavior. In other words, assuming that π0 is a uniform random policy, π1 under CH is simply a best response to a random policy. While this seems like a reasonable default, it has some rather drastic consequences. In particular, unlike OBL, it is impossible for CH to learn optimal grounded policies in many settings. The intuition here is simple: Only the first CH level, the best response to a random agent, interprets actions as entirely grounded, i.e., conditioning only on observable information. However, the first level also assumes that the future moves of the partner are entirely random, which means it cannot learn strategies that require the partner to cooperate. One class of behavior that is difficult or impossible to learn is grounded signaling, where an agent can reveal information to another agent through a costly action, as we show in Section 5. The striking difference between OBL and CH can also be illustrated in fully observable turn-based settings. While OBL learns the optimal policy at the first level, CH in general needs as many levels as the length of the episode to converge to an optimal policy. Another difference is the role of π0: While in CH the exact π0 can change the final policy, in OBL any policy π0 that ignores the private observation will result in the same, perfectly uninformative belief and lead to the same OBL policy π1. OBL is conceptually related to the Rational Speech Acts (RSA) framework, which has been able to model and explain a large amount of human communication behavior (Frank & Goodman, 2012). RSA assumes a speaker-listener setting with grounded hints and starts out with a literal listener (LL) that only considers the grounded information provided. Like CH, RSA then introduces a hierarchy of speakers and listeners, each level defined via Bayesian reasoning (i.e., a best response) given the level below. OBL allows us to train an analogous hierarchy and thus introduce pragmatic reasoning in a controlled fashion. However, while RSA is mostly a conceptual framework that is focused on simple communication tasks, OBL is designed to deal with high- Off-Belief Learning dimensional, complex Dec-POMDPs in which agents have to both act and communicate through their actions. There is a large body of work on coordination in behavioral game theory. One of the most well-known frameworks is focal points (Schelling, 1980), i.e., common knowledge labels in the environment that allow agents to coordinate on a joint action. In contrast to this line of work we address the zero-shot coordination setting (Hu et al., 2020), where the structure of the Dec-POMDP itself needs to be used for coordination and players specifically cannot coordinate on labels for states, actions and observations. In this setting Other-play (OP) (Hu et al., 2020) has been proposed as a method to prevent agents from learning equivalent but mutually incompatible policies across independent training runs. OP accomplishes this by enforcing equivariance of the policies under the symmetries of the Dec POMDP, which must be provided as an input to the algorithm. While OP prevents agents from breaking symmetries of the environment to encode conventions, it does not prevent all arbitrary convention formation as not all of them are symmetry-breaking. By starting from grounded information and deriving conventions in a controlled fashion, OBL avoids both symmetrybreaking and arbitrary convention formation when run with an appropriate (e.g. random) π0. Unlike OP it does not require access to the symmetries. 3. Background Dec-POMDPs We consider decentralized partially observable Markov Decision Processes (Dec-POMDPs) (Nair et al., 2003) G with state s, action a, observation function Ωi(s) for each player i and transition function T (s, a). Players receive a common reward R(s, a). We denote the historical trajectory as τ = (s1, a1, ..., at 1, st) and action-observation history (AOH) for player i as τ i = (Ωi(s1), a1, ..., at 1, Ωi(st)), which encodes the trajectory from player i s point of view. A policy πi(a|τ i t) for player i takes as input an AOH and outputs a distribution over actions. We denote the joint policy as π. We remark that while the value V π(τ) of policy π from any state s is well-defined, the value of playing π from an AOH τ i is not well-defined without specifying what policy was played leading to τ i, because it affects the distribution of world states at τ i. The distribution of world states is often referred to as a belief Bπ(τ|τ i) = P(τ|τ i, π). In this work, we restrict ourselves to bounded-length, turnbased Dec-POMDPs. Formally, we assume that G reaches a terminal state after at most tmax steps, and that only a single agent acts at each state s (other agents have a null action). See Section 4.5 for a discussion of turn-based vs. simultaneous-action environments. Deep MARL has been successfully applied in many Dec POMDP settings (de Witt et al., 2019; Foerster et al., 2019). The specific OBL algorithm introduced later uses Recurrent Replay Distributed Deep Q-Networks (R2D2) (Kapturowski et al., 2019) as its backbone. In deep Q-learning (Mnih et al., 2015) the agent learns to predict the expected total return for each action given the AOH, Q(τ i t, a) = Eτ P (τ|τ i)Rt(τ). Rt(τ) = P t =t γ(t t)rt is the forward looking return from time t where rt is reward at step t with at = a and at = argmaxa Q(τt , a ) for t > t and γ is an optional discount factor. The state-of-the-art R2D2 algorithm incorporates many modern best practices such as double-DQN (van Hasselt et al., 2016), dueling network architecture (Wang et al., 2016), prioritized experience replay (Schaul et al., 2016), distributed training setup with parallel running environments (Horgan et al., 2018) and recurrent neural network for handling partial observability. The straightforward way to apply deep Q-learning to Dec POMDP settings is Independent Q-Learning (IQL) (Tan, 1993) where each agent treats other agents as part of the environment and learns an independent estimate of the expected return without taking other agents actions into account in the bootstrap process. Many methods (Sunehag et al., 2017; Rashid et al., 2018) have been proposed to learn joint Q-functions to take advantage of the centralized training and decentralized control structure in Dec-POMDPs. In this work, however, we use the IQL setup with shared neural network weights θ for simplicity and note that any potential improvements to the RL algorithm used are orthogonal to our contribution. Zero-Shot Coordination. The most common problem setting for learning in Dec-POMDPs is self-play (SP) where a team of agents is trained and tested together. Optimal SP policies typically rely on arbitrary conventions, which the entire team can jointly coordinate on during training. However, many real-world problems require agents to coordinate with other, unknown AI agents and humans at test time. This desiderata was formalized as the Zero-Shot Coordination (ZSC) setting by Hu et al. (2020), where the goal is stated as finding algorithms that allow agents to coordinate with independently trained agents at test time, a proxy for the the independent reasoning process in humans. ZSC immediately rules out arbitrary conventions as optimal solutions and instead requires learning algorithms that produce robust and, ideally, unique solutions across multiple independent runs. 4. Off-Belief Learning One of the big challenges in ZSC under partially observable settings is to determine how to interpret the actions of other agents and how to select actions that will be interpretable to other agents. In the following we introduce OBL, a method Off-Belief Learning that addresses this problem by learning a hierarchy of policies, with an optimal grounded policy at the lowest level, which does not interpret other agents actions at all. We will discuss OBL both as a formal method with corresponding proofs and as a scalable algorithm based on a fictitious transition mechanism applicable to the deep RL setting. 4.1. The OBL Operator We first introduce the OBL operator that computes π1 given any π0. If a common knowledge policy π0 is played by all agents up to τ i, then agent i can compute a belief distribution Bπ0(τ|τ i) = P(τ|τ i, π0) conditional on its AOH. This belief distribution fully characterizes the effect of the history on the current state. Notice that the return for (all players) playing a counterfactual policy π0 to τ i and π1 thereafter, which we denote V π0 π1(τ i), is precisely the expected return for all players playing according to π1, starting from a trajectory τ Bπ0(τ i). Therefore we define a counterfactual value function as follows: V π0 π1(τ i) = Eτ Bπ0(τ i) [V π1(τ)] . (1) We can similarly define counterfactual Q values as Qπ0 π1(a|τ i t) = X τt,τt+1 Bπ0(τt|τ i t) R(st, a) + T (τt+1|τt)V π1(τt+1) . (2) Then OBL operator is defined to be the operator that maps an initial policy π0 to a new policy π1 as follows: π1(a|τ i) = exp(Qπ0 π1(a|τ i)/T) P a exp(Qπ0 π1(a |τ i)/T) (3) for some temperature hyperparameter T. 4.2. Properties of Off-Belief Learning Theorem 1. For any T > 0 and starting policy π0, OBL computes a unique policy π1. Proof in the Appendix A. Unique means that that, once T is fixed, π1 is a well-defined deterministic function of π0. This means that, unlike most MARL methods, OBL always converges to the same answer in principle, regardless of random initialization or other hyperparameters. Corollary 1. For any T > 0 and starting policy π0, N agents independently computing OBL joint policy π1 = {πi 1} and playing their parts of it (πi 1) will achieve the same return as if they had computed a centralized π1 policy (zeroshot coordination). Proof. Since all π1 are identical, this follows trivially. Theorem 1 is trivial in a single-agent context, but illustrates a substantial departure from traditional multi-agent learning rules, under which independently computed policies for each agent are typically not unique or compatible due to the presence of multiple equilibria, a particularly severe example of which is the formation of arbitrary conventions for communication in a game like Hanabi. Theorem 2. For every policy π1 generated by OBL from π0, J(π1) J(π0) tmax T/e, i.e. OBL is a policy improvement operator except for a term that vanishes as T 0. Proof in the Appendix A. Theorem 2 implies that self-play performance will gradually increase if we apply OBL iteratively. Together with Corollary 1, it also guarantees that zero-shot coordination performance, i.e. cross-play between independent training runs, will improve at the same pace. Notably, unlike standard MARL, the fixed points of the OBL learning rule are not guaranteed to be equilibria of the game. However, we have the theorem: Theorem 3. If repeated application of the OBL policy improvement operator converges to a fixed point policy π, then π is an ϵ-subgame perfect equilibrium of the Dec-POMDP, where ϵ = tmax T/e. Proof in the Appendix A. 4.3. Optimal Grounded Policies In multi-agent cooperative settings with imperfect information, the optimal policy for an agent depends not only on assumptions about the future policy, but also on assumptions about other agents policies i.e. what do those agents prior actions say about their private information? Making wrong assumptions is a source of coordination failure and we thus may wish to consider grounded policies that avoid reasoning about partners actions altogether. To do so, we define the grounded belief BG as a modified belief that conditions only on the observations but not the partner actions, i.e. BG(τ|τ i) = P(τ) Q t P(oi t|τ) P t P(oi t|τ ). Then, we can define an optimal grounded policy πG to be any policy that at each AOH τ i plays an action that maximizes expected reward assuming the state distribution at τ i is drawn from the grounded belief, and assuming that πG is played thereafter. Theorem 4. Application of OBL to any constant policy π0(a|τ i) = f(a) - or in fact any policy that only conditions on public state - yields an optimal grounded policy in the limit as T 0 . Proof in the Appendix A. Off-Belief Learning 4.4. Iterated Application of OBL While an optimal grounded policy can be an effective baseline in some settings, clearly there are many settings where reasoning about the partner s behavior improves performance. In humans this ability is commonly described as theory of mind and a variety of experimental studies (Camerer et al., 2003; Agranov et al., 2012; Kleiman-Weiner et al., 2016) have shown that humans typically carry out a limited number of such reasoning steps, depending on the exact situation, even in zero-shot settings. Iterating OBL is simple. When the OBL operator is applied to a random policy π0, it computes an optimal grounded policy π1, which we also refer to as OBL level 1 . We may compute OBL level 2 similarly by using OBL level 1 as π0 and compute OBL level 3 by replacing π0 with OBL level 2, and so on. Level 2 acts optimally while using beliefs induced by level 1 to interpret partners actions. Different from OBL level 1, a level 2 agent will assume that other agents will take positive-utility actions instead of purely random actions and share useful factual information to others. For example, if it sees another agent avoid an action that would likely be beneficial, it will infer that the other agent probably has private information that the action is not so beneficial. Since all independently trained agents use the same belief (modulo convergence errors), this training procedure is helpful in ZSC, unlike self-play where agents may learn arbitrary beliefs and signaling. Iterated OBL beliefs also appear to be a better match for human behavior than any prior work. We also show in Section 6 that while OBL level 1 already cooperates better with a human-like policy than any prior method, level 2 and higher cooperate even better with it and other tested agents. The iterated application of OBL allows us to directly control for the number of cognitive reasoning loops to be carried out by our agent. Different from the cognitive hierarchies framework, at each level the agent plays optimally assuming optimal game play in the future, rather than assuming the partners are fixed and play one level below. Therefore it is possible to optimize the overall quality of the policy independently of the depth of cognitive reasoning in OBL but not in cognitive hierarchy because we cannot reach the optimal policy at each level. 4.5. Simultaneous-action vs turn-based environments So far we have restricted ourselves to turn-based games in which a single agent acts at each state. If two players acted simultaneously, then OBL couldn t guarantee convergence to a unique policy (Thm 1); for example, consider a one-step coordination game with payoff matrix [[1, 0], [0, 1]] (which is a single-state, single-step Dec-POMDP). OBL is identical to self-play in a one step game, and could thus converge to either equilibrium. This may seem like a strong restriction of OBL to turnbased games. However, any simultaneous-action game can be converted to an equivalent stage game by ordering the simultaneous actions into separate timesteps and only letting players observe the actions of others once the timesteps corresponding to the simultaneous actions have completed. An OBL policy can be generated for this equivalent turnbased game. The ordering of players actions solves the equilibrium selection problem, because the player who acts second assumes that the other player acted according to π0. The player order we use to convert the simultaneous-action game to a turn-based one may change the OBL policy. 4.6. Algorithms for Off-Belief Learning Equation 3 immediately suggests a simple algorithm for computing an OBL policy in small tabular environments: compute Bπ0(τ i) for each AOH, and then compute Qπ0 π1(τ i) for each AOH in backwards topological order. However, such approaches for POMDPs are intractable for all but the smallest size problems. In order to apply value iteration methods, we can write the Bellman equation for Qπ0 π1 for each agent i as follows: Qπ0 π1(at|τ i t) = Eτt,τt+k h t+k 1 X t =t R(τt , at )+ at+k π1(at+k|τ i t+k)Qπ0 π1(at+k|τ i t+k) i (4) where τt+k is the next history at which player i acts, τt Bπ0(τ i t), and τt+k (T , π1) denotes that τt+k is sampled from the distribution of successor states where all (other) players play according to π1. Now, how can we perform the Bellman iteration in Eq. 4? One approach, which we will denote Q-OBL, is to perform independent Q-learning where each player uses π0 as the exploration policy to play until time t and use π1 afterwards. We can then train π1 at time t but not other timesteps. This would guarantee that τt Bπ0(τ i t). However in order to compute the value for playing π1 in the future, we must simulate the transition τt+k (T , π1) by having the other players playing according to their current π1. This is guaranteed to converge to the OBL policy (in the tabular case) as long as τ i a π0(a|τ i) > 0, by the same inductive argument used in Theorems 1 and 2, since Q and π1 at each AOH only depend on Q and π1 at successor AOHs. In practice, this approach has a downside: the states reached by π1 may be reached with very low probability under π0,1 so this procedure will have low sample efficiency. Instead, 1The bound is P(τ|π0)/P(τ|π1) (mina P(a))t Off-Belief Learning Figure 1. Comparison of independent Q learning (left) with LB-OBL (right). Superscripts i and i mean player i and its partner i. In LB-OBL, the target value for an action ai t is computed by first sampling a state from a belief model ˆBπ0 that assumes π0 has been played, and then simulating play (blue) from that state to i s next turn using the current policy π1. a) Self-Play b) OBL, Level 1 c) OBL, Level 2 d) OBL, Level 10 e) CH, Level 1 f) CH, Level 2 g) CH, Level 10 Figure 2. Cross-play matrix for each of the different algorithms in the toy communication game. 10 agents were independently trained with each method; each cell (i, j) of the matrix represents the average score of the ith and jth agent. OBL with a depth of 1 (b) achieves a higher score than CH (e) by learning optimal grounded communication, while avoiding arbitrary handshakes that lead to the poor XP performance of SP (a) or multi-level reasoning (d, g). we first learn an approximate belief ˆBπ0 that takes an AOH τ i and can sample a trajectory from an approximation of P(τ|τ i, π0). We compute this belief model following the procedure described in Hu et al. (2021). We then perform Q-learning with a modified rule for computing the target value for Q(a|τ i t): we re-sample a new τ from ˆBπ0(τ i t), simulate a transition to τ i t+1 with other agents play their policy π1, and use maxa Q(a|τ i t+1) as the target value. We refer to this variant as Learned-belief OBL (LB-OBL). An illustration comparing the LB-OBL algorithm with independent Q-learning (IQL) is shown in Figure 1. In both cases, we assume a two player setting with shared Q-function and 2-step Q-learning, i.e. k = 2 in Equation 4. The active player at time t is i and the partner is i. In IQL, each player simply observes the world at each time-step and takes turns to act. The target is computed on the actual trajectory with rt, rt+1 and maxa Q(a|τt+2) using target network Qi. By contrast, LB-OBL involves only fictitious transitions. For the active player i, we first decide the action ai t given AOH τ i t. However, in addition to applying ai t to the actual environment state St, we also apply ai t to a fictitious state S t sampled from the learned belief model Bi and forward the fictitious environment to S t+1. Note that the action ai t can be directly applied to the fictitious state because the observation from the fictitious state is the same as that of the real state. Next, we need to evaluate the partner s policy to produce a fictitious action a i t+1 on τ i t+1 = [τ i; ai t, Ω i(S t+1)]. The learning target in LB-OBL is the sum of fictitious rewards r t, r t+1 and the fictitious bootstrapped value maxa Q(a|τ t+2). We also note that it is not possible to simply train a Q-learning agent that takes the grounded beliefs as an input. The agents would still develop conventions and the grounded belief would simply lose its semantic meaning. 5. Experiments in a Toy Environment We first test OBL in a simple fully cooperative toy environment that includes both a cheap-talk channel and a grounded communication channel: As shown in Figure 3, Alice observes a random binary variable, pet {cat, dog}, and has four possible actions. She can either turn on the light bulb, bail out of the game with fixed reward of 1 or remove a barrier to let Bob see the pet, which will cost them 5. After Alice takes her action, Bob observes the outcome and has three options. He can either bail for a fixed reward of 0.5 or guess the identity of the pet. If the guess is correct, they obtain an additional 10, otherwise they lose 10. Off-Belief Learning Figure 3. Toy cooperative communication game: Alice (left) observes the pet and can either signal to Bob by turning on the light-bulb, pay 5 to remove the barrier, so that Bob can see the pet, or bail out for a reward of 1. Bob then needs to guess the identity of the pet or can bail out for a reward of 0.5. Results. We run three different algorithms on this simple toy game: SP, OBL and cognitive hierarchies (CH). Figure 2 shows the cross-play (XP) results for 10 independent runs for each of the methods as a matrix, where the entry i, j is the reward obtained from pairing the first agent (Alice) from run i with the second agent (Bob) from run j. As can be seen from the checkerboard-like pattern, the SP agents learn to use the cheap-talk channel, obtaining +10 in about half the cases and 10 in the others. Clearly, these agents have developed an arbitrary handshake where on / off correspond to dog / cat or vice versa, each incompatible with the other half of the agents that made the opposite choice. In contrast, OBL learns the optimal grounded policy at the first level, choosing to remove the barrier which results in a reward of +5, both in XP and SP. The CH agents at level 1 learn to simply bail out of the game, since there is no way to win when playing with a random agent. They also fail to discover the grounded solution in between and proceed straight to an arbitrary handshake for higher levels, as is indicated by the +10/ 10 rewards. At higher levels, our OBL implementation starts to diverge between different runs. Although theoretically the OBL policy is unique, some environment states do not occur very often and the noise is amplified higher up. This could be addressed with higher temperature or increased regularization if desired. All results can be reproduced in our notebook implementation https://bit.ly/3c Iia6z. 6. Hanabi Experiments We now test our methods in the more complex domain of Hanabi. Hanabi is a fully-cooperative multiplayer partiallyobservable board game. It has become a popular benchmark environment (Bard et al., 2020) for MARL, theory of mind, and zero-shot coordination research. Hanabi is a 2-5 player card game. The deck is composed of 50 cards, split among five different colors (suits) and ranks, with each color having three 1s, two 2s and 3s and 4s, and one 5. In a 2-player game, each player maintains a 5-card hand. Players can see their partner s hand but not their own. The goal of the team is to play one card of each rank in each color in order from 1 to 5. The team shares 8 hint tokens and 3 life tokens. Taking turns, each turn a player can play or discard a card in their hand, or spend a hint token to give a hint to their partner. Playing a card succeeds if it is the lowest-rank card in its color not yet played, otherwise it fails and loses a life token. Giving a hint consists of choosing a rank or a color that a partner s hand contains and indicating all cards in the partner s hand sharing that color or rank. Discarding a card or successfully playing a 5 regains one hint token. The team s score is zero if all life tokens are lost, otherwise it is equal to the number of cards successfully played, giving a maximum possible score of 25. Experimental Setup. In Hanabi, we implement LB-OBL on top of R2D2 (Kapturowski et al., 2019). Training consists of a large number of parallel environments that invoke a deep neural network to approximate the Q-function at each time step to generate trajectories, a prioritized experience replay buffer to store the trajectories, and a training loop that updates the neural network using samples from the replay buffer. The queries from different environments are dynamically batched together to run efficiently on GPUs (Espeholt et al., 2020). We introduce several novel engineering designs to make our training infrastructure significantly more efficient than previous ones in Hanabi, allowing us to reach new state-of-the-art results in SP. We refer to Appendix B for more details on the implementation and benchmarks. Our policy (Q-function) is parameterized by a deep recurrent neural network θ. As illustrated in Figure 1, we should sample trajectory τ from replay buffer and update the neural network with TD-error: t=1 [r t+r t+1+max a Q ˆθ(a|τ t+2) Qθ(at|τt)]2 (5) where Q ˆθ is the target network that uses a slightly outdated version of θ for stability reasons. In normal Q-learning with RNNs, the target Q ˆθ(at|τt) for t = 1, 2, . . . , T can be re-computed efficiently by passing the sequence τ to the RNN at once. However, this is no longer feasible in OBL as each τ t contains unique fictitious transitions. To solve this problem, we pre-compute the target G t = r t + r t+1 + maxa Q ˆθ(a|τ t+2) during rollouts and store the sequence of {G t} along with the real trajectory τ into the replay buffer. Although this might cause stability issues if the stored target was computed using a very old target network, in practice we find it to be fairly stable because the replay buffer gets refreshed quickly thanks to the fast trajectory generation speed of our training infrastructure. On average, a trajectory is used 4-5 times before it gets evicted from the buffer. We set up the belief training following Hu et al. (2021). The Off-Belief Learning Method Self-Play Cross-Play w/ OP w/ OP w/ Clone Bot (Rank Bot) (Color Bot) SAD(*) 23.97 0.04 2.52 0.34 7.29 1.40 0.18 0.06 0.83 0.31 OP 24.14 0.03 21.77 0.68 22.81 0.87 4.05 0.37 8.55 0.48 K-Level 16.97 1.19 17.17 0.98 14.80 1.77 12.36 1.44 13.03 1.91 OBL (level 1) 20.92 0.07 20.85 0.03 10.83 0.26 13.21 0.34 13.80 0.19 OBL (level 2) 23.41 0.03 23.24 0.03 15.99 0.30 18.74 0.46 15.61 0.10 OBL (level 3) 23.93 0.01 23.68 0.05 15.61 0.34 20.68 0.44 16.02 0.26 OBL (level 4) 24.10 0.01 23.76 0.06 14.46 0.59 21.78 0.42 16.00 0.13 Table 1. Average scores in 2-player Hanabi for different pairings of agents. SP indicates play between an agent and itself. XP indicates play between agents from different independently-trained runs of the same algorithm. The other columns indicate play with agents from different algorithms. We evaluate each model on 5000 games and aggregate results of 5 independent training runs (seeds). (*) We use the 12 SAD agents obtained from (Hu et al., 2020) which uses different network architectures. Thus the numbers are not directly comparable. belief model is trained to predict the player s hand given τ i, which is the only missing information needed to produce fictitious game states in Hanabi. It takes as input τ i and predicts each of the player s cards auto-regressively from the oldest to the newest. The belief model is an RNN trained via supervised learning: L(h|τ i t) = k=1 log p(hk|τ i t, h1:k 1), (6) where hk is the kth card in hand and n is the hand size. To compare the zero-shot coordination performance of OBL against previous methods, we train three categories of policies with existing methods to serve as the unseen partner for OBL models. Rank-Bot is trained with Other-Play (OP). Empirically, OP in Hanabi consistently converges to using rank-based conventions to communicate to a partner to play a card, hence the name. Color Bot is roughly the color equivalent of Rank Bot , induced by adding extra reward for hinting color at the early stage of training. Clone Bot is trained with supervised learning on human game data collected from an online board game platform. Similar to the toy example, we implemented a variant of Cognitive Hierarchies (CH) called k-level reasoning as one of our baselines. The details of neural network design, hyper-parameters and computation cost of OBL policies and belief models, as well as those of Rank, Color, Clone Bot and K-Level can be found in the Appendix B.3. We will open source our code and all models. For all the methods and OBL levels shown in Table 1, we carry out 5 independent runs with different seeds, except for the 12 SAD models directly downloaded from the opensourced repository of Hu et al. (2020). Grounded Policies in Hanabi. In this experiment, we empirically show that OBL can indeed learn surprisingly strong grounded policies in Hanabi. Figure 4. Grounded knowledge of cards played. OBL-1 mostly plays cards of known rank and color. Iterated OBL converges to an even balance of color and rank, similar to humans. Although Theorem 4 states that applying OBL on any constant policy π0 should yield an optimal grounded policy π1, in practice we want the trajectories generated by π0 to be as diverse as possible to help train a good belief model. Hence a simple choice is to make π0 a uniform-random policy. We first train a belief model Bπ0 to convergence and then use Bπ0 to train the OBL policy. We repeat this training procedure with five different seeds. The numerical results are shown in the OBL (level 1) row of Table 1. We find that a policy with no conventions can achieve nearly 21 points in SP, which by itself is an important milestone for this popular benchmark environment. The XP score, which is computed by pairing agents with different seeds, is almost as high as the SP score, indicating that they indeed converge to nearly identical policies. The slight gap between XP and SP scores may come from the noise of belief and policy training. we note that despite being a grounded policy, OBL (level 1) already cooperates with human-like policies in Hanabi (proxied by Clone Bot) better than any prior method. One way to quantitatively show whether a Hanabi policy is grounded is to look at how much the active player knows about a card before playing it. Playing a card unsuccessfully is costly, and therefore playing it without high confidence Off-Belief Learning that it will succeed is usually suboptimal. Cards played can be categorized into four groups. 1) Both color and rank of the card are known. 2) Only color is known. 3) Only rank is known. 4) Neither is known. Here the known information can be revealed by hints or deduced from card counts. The result is shown in the OBL L1 column on Figure 4. More than 60% of the cards played by this policy are completely known to the player, significantly higher than for other agents, which may infer playability based on arbitrary signals rather than grounded knowledge. Although roughly 28% of the time the policy plays cards only knowing the number, inspection showed that in those cases the cards are safe to play regardless of color. These observations are consistent with the theoretical prediction that OBL should converge to the optimal grounded policy in Hanabi. OBL Hierarchy in Hanabi. As described in Section 4.4, we can repeat the OBL training process to get OBL level i + 1 using level i as the new π0. In practice, we can speed up the training of next level belief and policy by loading weights from the previous level as a start point. We train three additional levels and the evaluation results are shown in Table 1. OBL level 2 infers beliefs over unseen cards assuming other agents are more likely to take good actions and to share factual information about useful cards the way OBL level 1 would, and so on for higher levels. Since other agents, even if they have unknown conventions, will also often act this way, these beliefs are sufficiently robust that OBL level 2 achieves even better scores than OBL level 1 with Clone Bot and with all other tested agents. For comparison, we show results of OP applied on the same network architecture as OBL. The OP agents are also trained with the auxiliary task from the original paper. We see from the table that both SP and XP of OBL gradually increase after each level. The final SP performance is similar to that of OP while the XP score is significantly higher, showing that OBL is a better method for zero-shot coordination. OBL also performs significantly better than OP when playing with a diverse set of unseen partners. Note that OP gets high scores with Rank Bot because Rank Bot is randomly chosen from the five OP training runs. Based on its performance with Clone Bot, it is probable that OBL would achieve much better performance than OP when paired with real humans. From Figure 4, we can see that as we progress through OBL levels, the agents play fewer grounded cards and start infer playability based on their partners behaviors. Unlike SP where conventions are formed implicitly as each agent adapts to the random biases of its partner, OBL exhibits a consistent path of evolution that can be understood by reasoning through the behaviors of the lower level OBL policy. The final level of OBL is the most similar to Clone Bot in how it decides to play cards. OBL in 3 Player Hanabi. We also apply OBL to 3 player Hanabi and evaluate its performance in XP and playing with Clone Bot. All OBL levels perform better than OP in these two criteria. In XP, OBL (level 4) gets 23.02, which is significantly higher than the 17.36 of OP. When paired with Clone Bot, OP gets 12.18 while the highest score of OBL is 14.09 from level 2. The Clone Bot itself gets 17.46 in self-play. Due to the page limit the detailed results are available in Appendix D. 7. Conclusion and Future Work We present off-belief learning, a new method that can train optimal grounded policies, preventing agents from exchanging information through arbitrary conventions. When used in a hierarchy, each level adds one step of reasoning over beliefs from the previous level, thus providing a controlled means of reintroducing conventions and pragmatic reasoning into the learning process. Crucially, OBL removes the weirdness of learning in Dec-POMDPs, since, given the beliefs induced by the level below, each level has a unique optimal policy. Therefore, OBL at convergence can solve instances of the ZSC problem, as we illustrate in Hanabi. Importantly, OBL s performance gains under ZSC directly translate to state-of-the-art ad-hoc teamplay and human-AI coordination results, validating the ZSC hypothesis . An interesting direction for future work is to extend OBL to two-player zero-sum and general-sum settings, where the optimal grounded policy might lead to a new type of equilibrium concept. Furthermore, it should be possible to adapt OBL to ensure convergence to a Nash equilibrium in these settings, e.g. by training average beliefs and sampling the fictitious partner move randomly from levels below. Lastly, there are major features of human-level coordination that OBL doesn t capture: For example, human discarding behavior in Hanabi is based on being maximally predictable and robust to uncertainty in the beliefs. More generally speaking, since episodes in ZSC are temporally extended, humans are commonly able to adapt to each other s policy and to introduce conventions within the episode. OBL clearly highlights the need for fundamentally new methods that are capable of accomplishing this. 8. Acknowledgements We would like to thank David Wu for his contribution in the discussions and help in the writing. He should be on the author list if we are allowed to update it. We also thank Andrei Lupu, Samuel Sokota, Michael Dennis and Wendelin Boehmer for their feedback on the manuscript. Lastly, we thank Board Game Arena for providing us with anonymized Hanabi game play data for this research and James Nesta (creator of hanabi.live) for providing pointers and advice. Off-Belief Learning Agranov, M., Potamites, E., Schotter, A., and Tergiman, C. Beliefs and endogenous cognitive levels: An experimental study. Games and Economic Behavior, 75(2):449 463, 2012. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2012.02. 002. URL http://www.sciencedirect.com/ science/article/pii/S089982561200022X. Bard, N., Foerster, J. N., Chandar, S., Burch, N., Lanctot, M., Song, H. F., Parisotto, E., Dumoulin, V., Moitra, S., Hughes, E., et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. Camerer, C., Ho, T., and Chong, J.-K. A cognitive hierarchy theory of one-shot games and experimental analysis. 08 2003. URL https://ssrn.com/abstract= 411061. Camerer, C. F., Ho, T.-H., and Chong, J.-K. A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3):861 898, 2004. Carroll, M., Shah, R., Ho, M. K., Griffiths, T., Seshia, S. A., Abbeel, P., and Dragan, A. D. On the utility of learning about humans for human-ai coordination. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 5175 5186, 2019. URL https://proceedings. neurips.cc/paper/2019/hash/ f5b1b89d98b7286673128a5fb112cb9a-Abstract. html. de Witt, C., Foerster, J., Farquhar, G., Torr, P., Boehmer, W., and Whiteson, S. Multi-Agent Common Knowledge Reinforcement Learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 9927 9939. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ f968fdc88852a4a3a27a81fe3f57bfc5-Paper. pdf. Espeholt, L., Marinier, R., Stanczyk, P., Wang, K., and Michalski, M. Seed rl: Scalable and efficient deep-rl with accelerated central inference. In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=rkgv Xlr Kw H. Foerster, J., Song, F., Hughes, E., Burch, N., Dunning, I., Whiteson, S., Botvinick, M., and Bowling, M. Bayesian action decoder for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 1942 1951. PMLR, 2019. Frank, M. C. and Goodman, N. D. Predicting pragmatic reasoning in language games. Science, 336(6084):998 998, 2012. Hendon, E., Jacobsen, H. J., and Sloth, B. The one-shotdeviation principle for sequential rationality. Games and Economic Behavior, 12(2):274 282, 1996. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1996. 0018. URL https://www.sciencedirect.com/ science/article/pii/S0899825696900184. Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. Distributed prioritized experience replay. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https: //openreview.net/forum?id=H1Dy---0Z. Hu, H. and Foerster, J. N. Simplified action decoder for deep multi-agent reinforcement learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum? id=B1xm3RVtw B. Hu, H., Peysakhovich, A., Lerer, A., and Foerster, J. otherplay for zero-shot coordination. In Proceedings of Machine Learning and Systems 2020, pp. 9396 9407. 2020. Hu, H., Lerer, A., Brown, N., and Foerster, J. N. Learned belief search: Efficiently improving policies in partially observable settings, 2021. URL https://openreview. net/forum?id=x P37gk VKa_0. Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. Recurrent experience replay in distributed reinforcement learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum? id=r1ly Tj Aq YX. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980. Kleiman-Weiner, M., Ho, M. K., Austerweil, J. L., Littman, M. L., and Tenenbaum, J. B. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In Cog Sci, 2016. Off-Belief Learning Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 02 2015. URL http://dx.doi.org/10. 1038/nature14236. Nair, R., Tambe, M., Yokoo, M., Pynadath, D., and Marsella, S. Taming decentralized pomdps: Towards efficient policy computation for multiagent settings. In IJCAI, volume 3, pp. 705 711, 2003. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., and Whiteson, S. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 4292 4301. PMLR, 2018. URL http://proceedings.mlr.press/ v80/rashid18a.html. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In Bengio, Y. and Le Cun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.05952. Schelling, T. C. The strategy of conflict. Harvard university press, 1980. Stone, P., Kaminka, G. A., Kraus, S., and Rosenschein, J. S. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Fox, M. and Poole, D. (eds.), Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010. AAAI Press, 2010. URL http://www.aaai.org/ocs/index.php/ AAAI/AAAI10/paper/view/1843. Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V. F., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., and Graepel, T. Valuedecomposition networks for cooperative multi-agent learning. Co RR, abs/1706.05296, 2017. URL http: //arxiv.org/abs/1706.05296. Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330 337, 1993. van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Schuurmans, D. and Wellman, M. P. (eds.), Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pp. 2094 2100. AAAI Press, 2016. URL http://www.aaai.org/ocs/index.php/ AAAI/AAAI16/paper/view/12389. Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. Dueling network architectures for deep reinforcement learning. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1995 2003. JMLR.org, 2016. URL http://proceedings. mlr.press/v48/wangf16.html.