# modelbased_opponent_modeling__897f9148.pdf

Model-Based Opponent Modeling

Xiaopeng Yu Jiechuan Jiang Wanpeng Zhang Haobin Jiang Zongqing Lu

School of Computer Science, Peking University

When one agent interacts with a multi-agent environment, it is challenging to deal with various opponents unseen before. Modeling the behaviors, goals, or beliefs of opponents could help the agent adjust its policy to adapt to different opponents. In addition, it is also important to consider opponents who are learning simultaneously or capable of reasoning. However, existing work usually tackles only one of the aforementioned types of opponents. In this paper, we propose model-based opponent modeling (MBOM), which employs the environment model to adapt to all kinds of opponents. MBOM simulates the recursive reasoning process in the environment model and imagines a set of improving opponent policies. To effectively and accurately represent the opponent policy, MBOM further mixes the imagined opponent policies according to the similarity with the real behaviors of opponents. Empirically, we show that MBOM achieves more effective adaptation than existing methods in a variety of tasks, respectively with different types of opponents, i.e., ﬁxed policy, naïve learner, and reasoning learner.

1 Introduction

Reinforcement learning (RL) has made great progress in multi-agent competitive games, e.g., Alpha Go [33], Open AI Five [26], and Alpha Star [39]. In multi-agent environments, an agent usually has to compete against or cooperate with diverse other agents (collectively termed as opponents whether collaborators or competitors) unseen before. Since the opponent policy inﬂuences the transition dynamics experienced by the agent, interacting with diverse opponents makes the environment nonstationary from the agent s perspective. Due to the complexity and diversity in opponent policies, it is very challenging for the agent to retain overall supremacy.

Explicitly modeling the behaviors, goals, or beliefs of opponents [2], rather than treating them as a part of the environment, could help the agent adjust its policy to adapt to different opponents. Many studies rely on predicting the actions [11, 13, 10, 27] and goals [29, 30] of opponents during training. When facing diverse or unseen opponents, the agent policy conditions on such predictions or representations generated by corresponding modules. However, opponents may also have the same reasoning ability, e.g., an opponent who makes predictions about the agent s goal. In this scenario, higher-level reasoning and some other modeling techniques are required to handle such sophisticated opponents [40, 43, 45]. In addition, the opponents may learn simultaneously, the modeling becomes unstable, and the ﬁtted models with historical experiences lag behind. To enable the agent to continuously adapt to learning opponents, LOLA [8] takes into account the gradients of the opponent s learning for policy updates, Meta-PG [1] formulates continuous adaptation as a meta-learning problem, and Meta-MAPG [16] combines meta-learning with LOLA. However, LOLA requires knowing the learning algorithm of opponents, while Meta-PG and Meta-MAPG require all opponents use the same learning algorithm.

Correspondence to B zongqing.lu@pku.edu.cn

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Unlike existing work, we do not make such assumptions and focus on enabling the agent to learn effectively by directly representing the policy improvement of opponents when interacting with them, even if they may be also capable of reasoning. Inspired by the intuition that humans could anticipate the future behaviors of opponents by simulating the interactions in the brain after knowing the rules and mechanics of the environment, in this paper, we propose model-based opponent modeling (MBOM), which employs the environment model to predict and capture the policy improvement of opponents. By simulating the interactions in the environment model, we could obtain the best responses of opponents to the agent policy that is conditioned on the opponent model. Then, the opponent model can be ﬁnetuned using the simulated best responses to get a higher-level opponent model. By recursively repeating the simulation and ﬁnetuning, MBOM imagines the learning and reasoning of opponents and generates a set of opponent models with different levels, which could also be seen as recursive reasoning. However, since the learning and reasoning of opponents are unknown, a certain-level opponent model might erroneously estimate the opponent. To effectively and accurately represent the opponent policy, we further propose to mix the imagined opponent policies according to the similarity with the real behaviors of opponents updated by the Bayesian rule.

We empirically evaluate MBOM in a variety of competitive tasks, against three types of opponents, i.e., ﬁxed policy, naïve learner, and reasoning learner. MBOM outperforms strong baselines, especially when against naïve learner and reasoning learner. Ablation studies verify the effectiveness of recursive imagination and Bayesian mixing. We also show that MBOM can be applied in cooperative tasks.

2 Related Work

Opponent Modeling. In multi-agent reinforcement learning (MARL), it is a big challenge to form robust policies due to the unknown opponent policy. From the perspective of an agent, if opponents are considered as a part of the environment, the environment is unstable and complex for policy learning when the policies of opponents are also changing. If the information about opponents is included, e.g., behaviors, goals, and beliefs, the environment may become stable, and the agent could learn using single-agent RL methods. This line of research is opponent modeling.

One simple idea of opponent modeling is to build a model each time a new opponent or group of opponents is encountered [48]. However, learning a model every time is not efﬁcient. A more computationally tractable approach is to represent an opponent s policy with an embedding vector. Grover et al. [10] uses a neural network as an encoder, taking as input the trajectory of one agent. Imitation learning and contrastive learning are used to train the encoder. Then, the learned encoder can be combined with RL by feeding the generated representation into policy or/and value networks. Learning of the model can also be performed simultaneously with RL, as an auxiliary task [14]. Based on DQN [24], DRON [11] and DPIQN [13] use a secondary network that takes observations as input and predicts opponents actions. The hidden layer of this network is used by the DQN module to condition on for a better policy. It is also feasible to model opponents using variational auto-encoders [27], which means the generated representations are no longer deterministic vectors, but high-dimensional distributions. To Mnet [29] tries to make agents have the same Theory of Mind [28] as humans. To Mnet consists of three networks, reasoning about the agent s action and goal based on past and current information. SOM [30] implements Theory of Mind from a different perspective. SOM uses its own policy, opponent s observation, and opponent s action to work backward to learn opponent s goal by gradient ascent.

The methods aforementioned only consider opponent policies that are independent of the agent. If opponents hold a belief about the agent, the agent can form a higher-level belief about opponents beliefs. This process can perform recursively, termed recursive reasoning. In repeated games, R2-B2 [5] performs recursive reasoning by assuming that all agents select actions according to GP-UCB acquisition function [34], and shows that recursive reasoning on one level higher than the other agents leads to faster regret convergence. When it comes to more complicated stochastic games, recursive reasoning should be compatible with reinforcement learning algorithms. PR2 [40] and GR2 [41] use the agent s joint Q-function to obtain recursive reasoning. Yuan et al. [45] takes both level-0 and level-1 beliefs as input to the value function, where level-0 belief is updated according to the Bayesian rule and level-1 belief is updated using a learnable neural network. However, these methods [40, 41, 45] use centralized training with decentralized execution algorithms to train a set of ﬁxed agents that cannot handle diverse opponents in execution.

If the opponents are also learning, the modeling mentioned above becomes unstable and the ﬁtted models with historical experiences lag behind. So it is beneﬁcial to take into consideration the learning process of opponents. LOLA [8] introduces the impact of the agent s policy on the anticipated parameter update of the opponent. A neural network is used to model the opponent s policy and estimate the learning gradient of the opponent s policy, implying that the learning algorithm used by the opponent should be known, otherwise the estimated gradient will be inaccurate. Further, the opponents may still be learning continuously during execution. Meta-PG [1] is a method based on meta policy gradient, using trajectories from the current opponents to do multiple meta-gradient steps and construct a policy that is good for the updated opponents. Meta-MAPG [16] extends this method by including an additional term that accounts for the impact of the agent s current policy on the future policies of opponents, similar to LOLA. These meta-learning based methods require that the distribution of trajectories matches across training and test, which implicitly means all opponents use the same learning algorithm.

Unlike existing work, we go one step further and consider a more general setting, where the opponents could be ﬁxed, randomly sampled from an unknowable policy set, or continuously learning using an unknowable and changeable RL algorithm, both in training and execution. We learn an environment model that allows the agent to perform recursive reasoning against opponents who may also have the same reasoning ability.

Model-Based RL and MARL. Model-based RL allows the agent to have access to the transition function. There are two typical branches of model-based RL approaches: background planning and decision-time planning. In background planning, the agent could use the learned model to generate additional experiences for assisting learning. For example, Dyna-style algorithms [35, 19, 23] perform policy optimization on simulated experiences, and model-augmented value expansion algorithms [7, 25, 3] use model-based rollouts to improve the update targets. In decision-time planning [4], the agent could use the model to rollout the optimal action at a given state by looking forward during execution, e.g., model predictive control. Recent studies have extended model-based methods to multi-agent settings for sample efﬁciency [46, 47], centralized training [42], and communication [17]. Unlike these studies, we exploit the environment model for opponent modeling.

3 Preliminaries

In general, we consider an n-agent stochastic game (S, A1, . . . , An, P, R1, . . . , Rn, γ), where S is the state space, Ai is the action space of agent i [1, . . . , n], A = Qn i=1 Ai is the joint action space of agents, P : S A S [0, 1] is a transition function, Ri : S A R is the reward function of agent i , and γ is the discount factor. The policy of agent i is πi, and the joint policy of other agents is π i(a i|s) = Q

j =i πj(aj|s), where a i is the joint action except agent i. All agents interact with the environment simultaneously without communication. The historical trajectory is available, i.e., for agent i at timestep t, {s0, ai 0, a i 0 , . . . , st 1, ai t 1, a i t 1} is observable. The goal of the agent i is to maximize its expected cumulative discount rewards

E st+1 P( |st,ai t,a i t ), ai πi( |st),a i t π i( |st)

t=0 γt Ri(st, ai t, a i t )

For convenience, the learning agent treats all other agents as a joint opponent with the joint action ao πo( |s) and reward ro. The action and reward of the learning agent are denoted as a π( |s) and r, respectively.

MBOM employs the environment model to predict and capture the learning of opponent policy. By simulating recursive reasoning via the environment model, MBOM imagines the learning and reasoning of the opponent and generates a set of opponent models. To obtain a stronger representation ability and accurately capture the adaptation of the opponent, MBOM mixes the imagined opponent policies according to the similarity with the real behaviors of the opponent.

Environment Model

Recursive Imagination

Bayesian Mixing

( | ; 0) ( | ; 1) ( | ; 1)

~ ( | ; ) ( | , ; )

( | ; ) ( | , ; )

model dataset

( | ; 0) ( | ; 1)

( | ; 0) ( | ; 1)

Figure 1: Architecture of MBOM

4.1 Recursive Imagination

If the opponent is also learning during the interaction, the opponent model ﬁtted with historical experiences always lags behind, making the agent hard to adapt to the opponent. Moreover, if the opponent could adjust its policy according to the actions, intentions, or goals of the agent, then recursive reasoning may occur between agent and opponent. However, based on the lagged opponent model, the agent would struggle to keep up with the learning of opponent. To adapt to the learning and reasoning opponent, the agent should predict the current opponent policy and reason more deeply than the opponent.

MBOM explicitly simulates the recursive reasoning process utilizing the environment model, called recursive imagination, to generate a series of opponent models, called Imagined Opponent Policies (IOPs). First, we pretrain the agent s policy π using PPO [32] while interacting with ν different opponents with diverse policies that can be learned for example by [37]. and collects a buffer D which contains the experience s, a, ao, s , r , where r = r, ro . For zero-sum game (r + ro = 0) and fully cooperative game (r = ro), ro can be easily obtained, while for general-sum game we make an assumption that ro can be accessed during experience collection. Then, using the experience buffer D, we can train the environment model Γ(s , r|s, a, ao; ζ) by minimizing the mean square error

E s,a,ao,s ,r D

1 2 s ˆs 2 + 1

2 r ˆr 2, (2)

where ˆs , ˆr = Γ(s, a, ao; ζ), and obtain the level-0 IOP πo 0( |s; φ0) by maximum-likelihood estimation, E s,ao D log πo 0(ao|s; φ0). (3)

To imagine the learning of the opponent, as illustrated in Figure 1, we use the rollout algorithm [38] to get the best response of the opponent to the agent policy π. For each opponent action ao t at timestep t, we uniformly sample the opponent action sequences in the next k timesteps, simulate the trajectories using the learned environment model Γζ, and select the best response with the highest rollout value

ao t = argmax ao t max ao t+1, ,ao t+k Unif

j=0 γjro t+j. (4)

During the simulation, the agent acts according to the policy conditioned on the modeled opponent policy, at π( |st, ao t πo 0( |st; φ0); θ), and the learned environment model provides the transition st+1, rt = Γ(st, at, ao t; ζ). With larger k, the rollout has a longer planning horizon, and thus could evaluate the action ao more accurately, assuming a perfect environmental model. However, the computation cost of rollout increases exponentially with the planning horizon to get an accurate estimate of ao , while in practice the compounding error of the environmental model also increases with the planning horizon [15]. Therefore, the choice of k is a tradeoff between accuracy and cost. Speciﬁcally, for zero-sum game and fully cooperative game, we can approximately estimate the opponent state value V o(s) as V (s) and V (s), respectively, and modify the rollout value like n-step return [36] to obtain a longer horizon (see Appendix C for the empirical effect of k),

ao t = argmax ao t max ao t+1, ,ao t+k Unif

j=0 γjro t+j + γk+1V o(st+k+1)

Algorithm 1 MBOM

1: Pretraining: 2: Initialize the recursive imagination layer M. 3: Initialize the weights α with (1/M, 1/M, . . . , 1/M). 4: Interact with ν learning opponents to train the agent policy θ and collect the experience buffer D. 5: Train the level-0 IOP φ0, the environment model Γζ using the experience buffer D. 6: Interaction: 7: for t = 1, . . . , max_epoch do 8: Interact with the opponent to observe the real opponent actions. {//Recursive Imagination} 9: Finetune φ0 with the real opponent actions 10: for m = 1, . . . , M 1 do 11: Rollout in Γζ with π( |s, ao πo m 1( |s; φm 1); θ) to get ao by (4) or (5). 12: Finetune φm 1 with best responses to get the φm. 13: end for {//Bayesian Mixing} 14: Update α by (8). 15: Mix the IOPs to get πo mix by (6). 16: end for

By imagination, we can obtain the best response of the opponent to the agent policy π and construct the simulated data { s, ao }. Then, we use the data to ﬁnetune the level-0 IOP πo 0( |s; φ0) by maximum-likelihood estimation, and obtain the level-1 IOP πo 1( |s; φ1). The level-1 IOP can be seen as the best response of the opponent to the response of the agent conditioned on level-0 IOP. In the imagination, it is the nested form as the opponent believes [that the agent believes (that the opponent believes ...)]. The existing imagined opponent policy is the innermost (that the opponent believes), while the outermost the opponent believes is ao which is obtained by the rollout process. Recursively repeating the rollout and ﬁnetuning, where the agent policy is conditioned on the IOP πo m 1( |s; φm 1), we could derive the level-m IOP πo m( |s; φm).

4.2 Bayesian Mixing

By recursive imagination, we get M IOPs with different reasoning levels. However, since the learning and reasoning of the opponent are unknown, a single IOP might erroneously estimate the opponent. To obtain stronger representation capability and accurately capture the learning of the opponent, we linearly combine the IOPs to get a mixed IOP,

πo mix( |s) =

m=0 αm πo m( |s; φm), (6)

where αm is the weight of level-m IOP, which is taken as p( πo m|ao), also denoted as p(m|ao), the probability that the opponent action ao is generated from the level-m IOP. Thus, p(m|ao) indicates the similarity between the level-m IOP πo m and the opponent policy πo in the most recent stage. By Bayesian rule, we have

p(m|ao) = p(ao|m)p(m) PM 1 i=0 p(ao|i)p(i) = πo m(ao|s; φm)p(m) PM 1 i=0 [ πo i (ao|s; φi)p(i)] . (7)

Updating p(m|ao) during interaction can obtain a more accurate estimate of the improving opponent policy.

In practice, we use the moving average of p(m|ao) as the prior p(m), take the decayed moving average of p(m|ao) as Ψm, and obtain α by the IOPs mixer,

α = (α0, . . . , αM 1) = softer-softmax(Ψ0, . . . , ΨM 1). (8)

Softer-softmax [12] is a variant of softmax, which uses higher temperature to control a softy of the probability distribution. The IOPs mixer is non-parametric, which could be updated quickly and efﬁciently without parameter training and too many interactions (see Appendix B for empirical analysis on α). Therefore, the IOPs mixer could adapt to the fast-improving opponent.

For completeness, the full procedure of MBOM in given in Algorithm 1.

4.3 Theoretical Analysis

MBOM consists of several components, including a conditional policy, an environment model, and M IOPs. They all interact with each other through model rollouts, which however makes it extremely hard to give an error analysis based on each component of MBOM. Therefore, we instead focus mainly on recursive imagination and Bayesian mixing, and give a profound understanding of MBOM. Proofs are available in Appendix A.

Without loss of generality, we deﬁne δm as the discrepancy between level-m and m-1 IOPs in terms of estimated value function given the IOP, and εm as the error of level-m IOP compared to the true value function given the opponent policy, i.e., δm .= |b Vm b Vm 1|, εm .= |b Vm V |, and δ0 .= ε0. Then, we have the following two lemmas.

Lemma 1. For the mixed imagined opponent policy (IOP) πo mix( |s) = PM 1 m=0 αm πo m( |s; φm), the total error εtotal = PM 1 m=0 αmεm satisﬁes the following inequality:

Lemma 2. Suppose the value function is Lipschitz continuous on the state space S, K is the Lipschitz constant, c Mi(s, a) .= Γ(s, a, ao; ζ) πo i (ao|s; φi) is the transition distribution given s and a, then

b Vi b Vj γK

1 γ E s,a π, c Mj

c Mi(s, a) c Mj(s, a) .

According to (6) and (7), we can express αm as the following

αm .= E [p(m|ao)] = X

s,ao d(s, ao) πo m (ao|s; φm) p(m) PM 1 i=0 πo i (ao|s; φi) p(i) ,

where d(s, ao) is the stationary distribution of pairs of state and action of opponent. Then, based on Lemma 1 and 2, we can obtain the following theorem, where c M 1 denotes P( |s, a, ao) πo(ao|s), the true transition dynamics of agent.

Theorem 1. Deﬁne the error between the approximated value function of the Bayesian mixing IOP and the true value function as εtrue .= b V V . For the Bayesian mixing IOP, the true error satisﬁes the following inequality:

εtrue εtotal X

m=0 E s,a π, c Mm 1

c Mm c Mm 1 M 1 P

i=m πo i (ao|s; φi) p(i)

j=0 πo j (ao|s; φj) p(j) .

Theorem 1 tells us the error accumulates as level M increases. However, larger M also has advantages. To analyze, we ﬁrst deﬁne the beneﬁt using the mixed IOP as In the error bound of Theorem 1, it is easy to see that the error accumulates as level M increases, but at the same time, higher level also has advantages. To analyze, we need to deﬁne the beneﬁt of using the mixed IOP as the optimization. We use the negative distribution distance η .= ˆπ π to denote the beneﬁt of using policy in different levels, where ˆπ is the estimated opponent policy, π is the true opponent policy and is a general form of distribution distance function. For a mixing policy, we can deﬁne the beneﬁt as:

ηM = (α0 πo 0 + + αM 1 πo M 1) πo .

Then, we have the following lemma and theorem.

Lemma 3. Assume that the true opponent policy πo has a probability distribution over the given set of IOPs { πo 0, . . . , πo M 1}, and a probability Pr{πo = πo m} = pm for each πo m, then the maximum expectation of ηM can be achieved if and only if αm = pm, m [0, M 1].

Theorem 2. Given the action trajectory of opponent {ao 0, ao 1, . . . , ao t}, the posterior probability updated by Bayesian mixing approximates the true probability of opponent as the length of the trajectory grows, i.e.,

P(m|ao t, , ao 1, ao 0) Ptrue(m) .= pm, t .

Then the maximum expectation of ηM can be achieved with αm = E[P(m|ao t, , ao 1, ao 0)].

According to Theorem 2, we know that the estimated probability distribution of the opponent converges to the true distribution of the opponent. It is obvious that larger M improves the representation capability of IOPs and thus better satisﬁes the assumption in Lemma 3, but also increases the error bound in Theorem 1. Therefore, the selection of M is a tradeoff between them (see Appendix C for empirically study on M).

5 Experiments

We ﬁrst evaluate MBOM thoroughly in two-player zero-sum tasks. Then, we investigate MBOM when against multiple opponents and in a cooperative task. In all the experiments, the baselines have the same neural network architectures as MBOM. All the methods are trained for ﬁve runs with different random seeds, and results are presented using mean and 95% conﬁdence intervals. More details about experimental settings and hyperparameters are available in Appendix E.

5.1 Two-Player Zero-Sum Tasks

We evaluate MBOM in two competitive tasks: (1) Triangle Game is an asymmetric zero-sum game implemented on Multi-Agent Particle Environments (MPE) [21]. When facing different policies of the opponent, the agent has to adjust its policy to adapt to the opponent for higher reward. (2) One-on-One is a two-player competitive game implemented on Google Research Football [18]. The goalkeeper controlled by the agent could only passively react to the strategies of the shooter controlled by the opponent and makes policy adaptation when the shooter strategy changes.

Baselines. In the experiments, we compare MBOM with the following methods:

LOLA-Di CE [9] is an expansion of the LOLA, which uses Differentiable Monte-Carlo Estimator (Di CE) operation to consider how to shape the learning dynamics of other agents. Meta-PG [1] uses trajectories from the current opponents to do multiple meta-gradient steps and construct a policy that is good for the updated opponents. Meta-MAPG [16] includes an additional term that accounts for the impact of the agent s current policy on the future policies of opponents, compared with Meta-PG. PPO [32] is a classical single-agent RL algorithm, without any other modules.

Opponents. We construct three types of opponents:

Fixed policy. The opponents are pre-trained but not updated during interaction. Naïve learner. The opponents are pre-trained and updated using PPO during interaction. Reasoning learner. The opponents are pre-trained and can model the behavior of the agent. The model is ﬁnetuned during interaction and their policy is conditioned on the predicted action of the model.

Performance. The experimental results against test opponents in Triangle Game and One-on-One are shown in Figure 2, and the mean performance with standard deviation over all test opponents is summarized in Table 1. Without explicitly considering opponent policy, PPO achieves poor performance. The learning of LOLA-Di CE depends on the gradient of the opponent model. However, the gradient information cannot clearly reﬂect the distinctions between diverse opponents, which leads to that LOLA-Di CE cannot adapt to the unseen opponent quickly and effectively. Meta-PG and Meta-MAPG show mediocre performance in Triangle Game. In One-on-One, since Meta-PG and Meta-MAPG heavily rely on the reward signal, adaptation is difﬁcult for the two methods in this sparse reward task. MBOM outperforms Meta-MAPG and Meta-PG, because the opponent model of MBOM improves the ability to adapt to different opponents. And the performance gain becomes more signiﬁcant when the opponent is naïve learner or reasoning learner, which is attributed

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 opponent index

(a) Triangle Game, versus ﬁxed policy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 opponent index

(b) Triangle Game, versus naïve learner

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 opponent index

(c) Triangle Game, versus reasoning learner

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 opponent index

(d) One-on-One, versus ﬁxed policy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 opponent index

(e) One-on-One, versus naïve learner

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 opponent index

(f) One-on-One, versus reasoning learner

LOLA-Di CE Meta-PG Meta-MAPG PPO MBOM Figure 2: Performance against different types of opponents, i.e., ﬁxed policy, naïve learner, and reasoning learner, where x-axis is opponent index. The results show that MBOM outperforms other baselines, especially against naïve learner and reasoning learner.

to the recursive reasoning capability brought by recursive imagination in the environment model and Bayesian mixing that quickly captures the learning of the opponent.

Table 1: Performance on Triangle Game and One-on-One

Triangle Game (score ) One-on-One (win rate %)

Fixed Policy Naïve Learner Reasoning Learner Fixed Policy Naïve Learner Reasoning Learner

LOLA-Di CE 22.51 (5.22) 20.48 (4.02) 21.55 (4.43) 33.0 (0.5) 25.2 (1.2) 18.4 (0.9) Meta-PG 3.78 (0.13) 6.72 (0.18) 8.35 (1.87) 41.7 (0.7) 35.4 (2.5) 28.0 (0.6) Meta-MAPG 2.01 (0.06) 5.76 (0.29) 6.14 (0.84) 44.4 (1.5) 36.6 (1.6) 31.1 (1.9) PPO 13.29 (1.80) 18.42 (1.28) 20.51 (1.80) 40.6 (0.7) 21.5 (0.7) 25.8 (1.0) MBOM 1.19 (0.03) 1.66 (0.29) 2.75 (0.89) 59.5 (0.9) 51.3 (1.0) 64.2 (1.8)

Ablation Studies. The opponent modeling module of MBOM is composed of both recursive imagination and Bayesian mixing. In the following, we respectively test the functionality of these two components.

In recursive imagination, MBOM generates a sequence of different levels of IOPs ﬁnetuned by the imagined best response of the opponent in the environment model. For comparison, we use only φ0 as the opponent model (i.e., without recursive imagination), denoted as MBOM w/o IOPs, and use random actions to ﬁnetune the IOPs rather than using the best response, denoted as MBOM-BM. Note that MBOM w/o IOPs is essentially the same as the simple opponent modeling methods that predict opponents actions, like [11, 13, 10]. The experimental results of MBOM, MBOM w/o IOPs, and MBOM-BM are shown in Figure 3(a) and 3(b). MBOM w/o IOPs obtains similar results to MBOM when facing ﬁxed policy opponents because φ0 could accurately predict the opponent behaviors if the opponent is ﬁxed, which corroborates the empirical results in [11, 13, 10]. However, if the opponent is learning or reasoning, φ0 cannot accurately represent the opponent, thus the performance of MBOM w/o IOPs drops. The methods with IOPs perform better than MBOM w/o IOPs, which means that a variety of IOPs can capture the policy changes of learning opponents. When facing the reasoning learner, MBOM outperforms MBOM-BM, indicating that the IOPs ﬁnetuned by the imagined best response have a stronger ability to represent the reasoning learner.

In Bayesian mixing, MBOM mixes the generated IOPs to obtain a policy that is close to the real opponent. As a comparison, we directly use the individual level-m IOP generated by recursive imagination without mixing, denoted as MBOM-φm, and use uniformly mixing instead of Bayesian mixing, denoted as MBOM-unif. The experimental results are illustrated in Figure 3(c) and 3(d). Beneﬁted from recursive imagination, MBOM-φ1 and MBOM-φ2 show stronger performance than MBOM-φ0. MBOM consistently outperforms the ablation baselines without mixing when ﬁghting against reasoning learners. The middling performance of MBOM-unif is due to not exploiting the more representative IOPs, indicating that Bayesian mixing could obtain a more accurate estimate of reasoning opponents.

fixed policy naïve learner reasoning learner -8

(a) Triangle Game

fixed policy naïve learner reasoning learner 40.0%

(b) One-on-One

fixed policy naïve learner reasoning learner -8

(c) Triangle Game

fixed policy naïve learner reasoning learner 40.0%

(d) One-on-One

MBOM w/o IOPs MBOM-BM MBOM MBOM-φ0 MBOM-φ1 MBOM-φ2 MBOM-unif MBOM

Figure 3: Ablation study of MBOM. MBOM is compared with MBOM-BM and MBOM w/o IOPs on recursive imagination in (a) Triangle Game and (b) One-on-One. MBOM is compared with MBOM-φ0, MBOM-φ1, MBOM-φ2, and MBOM-unif on Bayesian mixing in (c) Triangle Game and (d) One-on-One.

fixed policy naïve learner reasoning learner 0

Number of Touches

Meta-MAPG Meta-PG PPO LOLA-Di CE MBOM

(a) Predator-Prey

0 25 50 75 100 125 150 175 200 iterations

LOLA-Di CE Meta-PG & Meta-MAPG PPO (7.9 2.1) MBOM (8.2 1.2)

(b) Learning curves on Coin Game Figure 4: (a) Performance against different types of opponents in Predator-Prey, where smaller touch number means better performance. (b) Learning curves in Coin Game, where the numbers in the legend are the mean score with standard deviation over 200 iterations.

We additionally provide analysis on the weight α and ablation studies on hyperparameters including the rollout length k and the level of recursive imagination M, which are available in Appendix B and C, respectively.

5.2 Multiple Opponents

When facing multiple opponents, MBOM takes them as a joint opponent, which is consistent with the single-opponent case. We perform experiments in Predator-Prey [21]. We control the prey as the agent and take the three predators as opponents. In this task, the agent tries not to be touched by the three opponents. When the policies of opponents are ﬁxed (e.g., enveloping the agent), it is difﬁcult for the agent to get high reward. However, if the policies of opponents change during interaction, there may be changes for the agent to escape. The results are shown in Figure 4(a). When facing learning opponents, MBOM achieves performance gain, especially competing with the reasoning learner, which indicates that MBOM adapts to the learning and reasoning of the opponents and responds effectively. Meta-MAPG and Meta-PG do not effectively adapt to the changes of opponents policies and are prone to be touched by the opponents many times in a small area, resulting in poor performance. LOLA-Di CE, Meta-PG and Meta-MAPG underperforms PPO when against multiple reasoning learners, indicating their adaptation may induce a negative affect on the agent performance. Detailed experimental results with different opponents are available in Appendix D.

5.3 Cooperative Task

MBOM could also be applied to cooperative tasks. We test MBOM on a cooperative scenario, Coin Game, which is a high-dimension expansion of the iterated prisoner dilemma with multi-step actions [20, 8]. Both agents simultaneously update their policies using MBOM or the baseline methods to maximize the sum of rewards. The experiment results are shown in Figure 4(b). Meta-PG and Meta-MAPG degenerate to Policy Gradients for this task as there is no training set. Both learn a greedy strategy that collects any color coin, which leads to a zero total score of two players. LOLA-Di CE learns too slow and does not learn to cooperate within 200 iterations, indicating the inefﬁciency of estimating the opponent gradients. PPO learns to cooperate quickly and successfully,

which corroborates the good performance of independent PPO in cooperative multi-agent tasks as pointed out in [6, 44]. MBOM slightly outperforms PPO, which indicates that MBOM can also be applied to cooperative tasks without negative effects. Note that we do not compare other cooperative MARL methods like QMIX [31], as they require centralized training.

5.4 Opponent Model Accuracy Analysis

MBOM intends to achieve a higher return by improving model accuracy via adapting to the changing opponent. In previous sections, we have shown that MBOM achieves superior performance. In this section, we investigate the correlation between the performance and the prediction accuracy of the opponent model.

First, we tested the prediction error of the opponent model, using KL divergence to measure the distance between the predicted action distribution and the real action distribution of the opponent policy. The results compared with MBOM w/o IOPs (without recursive imagination) are shown in Table 2 (the column Error). Overall, MBOM s opponent model achieves a more accurate prediction than MBOM w/o IOPs.

Then, we tested the performance of the same policy π( |s, ao; θ) with different opponent models. The opponent model of MBOM w/o IOPs is continuously updated by supervised learning using the observed opponent s action, which has lower prediction accuracy than MBOM s opponent model. MBOM-pro, which uses the true actions of the opponent as the input of the policy during the interaction phase instead of the actions predicted by IOPs. That is, it can be considered that the prediction accuracy of MBOM-pro s opponent model is 100%. With the increasing prediction accuracy of the opponent model, the performance also increases correspondingly, from MBOM w/o IOPs MBOM MBOM-pro, as shown in Table 2 (the column Performance). The results show that a more accurate opponent model could improve performance, and that MBOM indeed obtains a more accurate opponent model when the opponent is learning.

Table 2: Prediction accuracy of opponent model and performance in Triangle Game and One-on-One.

Fixed policy Naïve learner Reasoning learner

Error Performance Error Performance Error Performance

Triangle Game

MBOM w/o IOPs 15.72 (0.18) 1.14 (0.03) 15.17 (0.17) 3.23 (0.07) 8.19 (0.09) 7.10 (2.00) MBOM 15.47 (0.04) 1.19 (0.03) 13.47 (0.43) 1.66 (0.29) 7.81 (0.38) 2.75 (0.89) MBOM-pro - 0.96 (0.03) - 0.11 (0.18) - 0.82 (0.35)

MBOM w/o IOPs 5.62 (0.06) 59.4 (0.7) 6.04 (0.14) 48.8 (1.7) 7.36 (0.34) 55.7 (1.9) MBOM 5.78 (0.05) 59.5 (0.9) 5.78 (0.06) 51.3 (1.0) 7.04 (0.12) 64.2 (1.8) MBOM-pro - 59.8 (0.6) - 52.0 (1.7) - 67.5 (0.4)

6 Conclusion and Future Work

We have proposed MBOM, which employs recursive imagination and Bayesian mixing to predict and capture the learning and improvement of opponents. Empirically, we evaluated MBOM in a variety of competitive tasks and demonstrated MBOM adapts to learning and reasoning opponents much better than the baselines. These make MBOM a simple and effective RL method whether opponents be ﬁxed, continuously learning, or reasoning in competitive environments. Moreover, MBOM can also be applied in cooperative tasks.

In MBOM, the learning agent treats all opponents as a joint opponent. If the size of the joint opponent is large, the agent will need a lot of rollouts to get an accurate best response. The cost increases dramatically with the size of the joint opponent. How to reduce the computation overhead in such scenarios will be considered in future work. Moreover, MBOM implicitly assumes that the relationship between opponents is fully cooperative. How to deal with the case where their relationship is non-cooperative is also left as future work.

Acknowledgments and Disclosure of Funding

This work was supported in part by NSF China under grant 61872009. The authors would like to thank the anonymous reviewers for their valuable comments.

[1] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous Adaptation Via Meta-Learning in Nonstationary and Competitive Environments. In International Conference on Learning Representations (ICLR), 2018.

[2] Stefano V Albrecht and Peter Stone. Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems. Artiﬁcial Intelligence, 258:66 95, 2018.

[3] Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample Efﬁcient Reinforcement Learning with Stochastic Ensemble Value Expansion. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

[4] Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep Reinforcement Learning in A Handful of Trials Using Probabilistic Dynamics Models. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

[5] Zhongxiang Dai, Yizhou Chen, Bryan Kian Hsiang Low, Patrick Jaillet, and Teck-Hua Ho. R2-b2: Recursive reasoning-based bayesian optimization for no-regret learning in games. In International Conference on Machine Learning (ICML), 2020.

[6] Christian Schröder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, and Shimon Whiteson. Is Independent Learning All You Need in the Starcraft Multi-Agent Challenge? ar Xiv preprint ar Xiv:2011.09533, 2020.

[7] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-Based Value Estimation for Efﬁcient Model-Free Reinforcement Learning. ar Xiv preprint ar Xiv:1803.00101, 2018.

[8] Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with Opponent-Learning Awareness. In International Conference on Autonomous Agents and Multi Agent Systems (AAMAS), 2018.

[9] Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rocktäschel, Eric Xing, and Shimon Whiteson. DICE: The Inﬁnitely Differentiable MOnte CArlo Estimator. In International Conference on Machine Learning (ICML), 2018.

[10] Aditya Grover, Maruan Al-Shedivat, Jayesh K. Gupta, Yuri Burda, and Harrison Edwards. Learning policy representations in multiagent systems. In International Conference on Machine Learning (ICML), 2018.

[11] He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent Modeling in Deep Reinforcement Learning. In International Conference on Machine Learning (ICML), 2016.

[12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling The Knowledge in A Neural Network. ar Xiv preprint ar Xiv:1503.02531, 2015.

[13] Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A Deep Policy Inference Q-Network for Multi-Agent Systems. In International Conference on Autonomous Agents and Multi Agent Systems (AAMAS), 2018.

[14] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, and K. Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In International Conference on Learning Representations (ICLR), 2016.

[15] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to Trust Your Model: Model-Based Policy Optimization. Advances in Neural Information Processing Systems (Neur IPS), 2019.

[16] Dong-Ki Kim, Miao Liu, Matthew Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan P. How. A Policy Gradient Algorithm for Learning To Learn in Multiagent Reinforcement Learning. In International Conference on Machine learning proceedings (ICML), 2021.

[17] Woojun Kim, Jongeui Park, and Youngchul Sung. Communication in multi-agent reinforcement learning: Intention sharing. In International Conference on Learning Representations (ICLR), 2021.

[18] Karol Kurach, Anton Raichuk, Piotr Sta nczyk, Michał Zaj ac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google Research Football: A Novel Reinforcement Learning Environment. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2020.

[19] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations (ICLR), 2018.

[20] Adam Lerer and Alexander Peysakhovich. Maintaining Cooperation in Complex Social Dilemmas Using Deep Reinforcement Learning. ar Xiv preprint ar Xiv:1707.01068, 2018.

[21] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Advances in Neural Information Processing Systems (Neur IPS), 2017.

[22] Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic Framework for Model-Based Deep Reinforcement Learning with Theoretical Guarantees. In International Conference on Learning Representations (ICLR), 2019.

[23] Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic Framework for Model-Based Deep Reinforcement Learning with Theoretical Guarantees. In International Conference on Learning Representations (ICLR), 2019.

[24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-Level Control Through Deep Reinforcement Learning. Nature, 518(7540):529 533, 2015.

[25] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value Prediction Network. In Advances in Neural Information Processing Systems (Neur IPS), 2017.

[26] Open AI. Openai ﬁve. https://blog.openai.com/openai-five/, 2018.

[27] Georgios Papoudakis and Stefano V Albrecht. Variational Autoencoders for Opponent Modeling in Multi-Agent Systems. ar Xiv preprint ar Xiv:2001.10829, 2020.

[28] David Premack and Guy Woodruff. Does The Chimpanzee Have A Theory of Mind? Behavioral and brain sciences, 1(4):515 526, 1978.

[29] Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew Botvinick. Machine Theory of Mind. In International Conference on Machine Learning (ICML), 2018.

[30] Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), 2018.

[31] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: monotonic value function factorisation for deep multiagent reinforcement learning. In International Conference on Machine Learning (ICML), 2018.

[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[33] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering The Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587): 484 489, 2016.

[34] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[35] Richard S Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In International Conference on Machine learning proceedings (ICML), 1990.

[36] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press, 2018.

[37] Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, and Yi Wu. Discovering diverse multi-agent strategic behavior via reward randomization. In International Conference on Learning Representations (ICLR), 2021.

[38] G. Tesauro and G. R. Galperin. On-Line Policy Improvement Using Monte-Carlo Search. In Advances in Neural Information Processing Systems (Neur IPS), 1996.

[39] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster Level in Star Craft II Using Multi-Agent Reinforcement Learning. Nature, 575(7782): 350 354, 2019.

[40] Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning. In International Conference on Learning Representations (ICLR), 2019.

[41] Ying Wen, Yaodong Yang, Rui Luo, and Jun Wang. Modelling bounded rationality in multiagent interactions by generalized recursive reasoning. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2020.

[42] Daniël Willemsen, Mario Coppola, and Guido CHE de Croon. Mambpo: Sample-efﬁcient multi-robot reinforcement learning using learned world models. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.

[43] Tianpei Yang, Jianye Hao, Zhaopeng Meng, Chongjie Zhang, Yan Zheng, and Ze Zheng. Towards efﬁcient detection and optimal response against sophisticated opponents. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2019.

[44] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre M. Bayen, and Yi Wu. The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games. ar Xiv preprint ar Xiv:2103.01955, 2021.

[45] Luyao Yuan, Zipeng Fu, Jingyue Shen, Lu Xu, Junhong Shen, and Song-Chun Zhu. Emergence of Pragmatics From Referential Game Between Theory of Mind Agents. ar Xiv preprint ar Xiv:2001.07752, 2020.

[46] Kaiqing Zhang, Sham M. Kakade, Tamer Basar, and Lin F. Yang. Model-based multi-agent RL in zero-sum markov games with near-optimal sample complexity. ar Xiv preprint ar Xiv:2007.07461, 2020.

[47] Weinan Zhang, Xihuai Wang, Jian Shen, and Ming Zhou. Model-based multi-agent policy optimization with adaptive opponent-wise rollouts. ar Xiv preprint ar Xiv:2105.03363, 2021.

[48] Yan Zheng, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, and Changjie Fan. A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec 6.

(c) Did you discuss any potential negative societal impacts of your work? [No] As we just propose a reinforcement learning algorithm, we do not immediately see any potential negative societal impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix A. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Appendix E. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix E. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] In Appendix E 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] In Appendix E.

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]