# multiagent_advisor_qlearning__c94189a3.pdf

Journal of Artiﬁcial Intelligence Research 74 (2022) 1-74 Submitted 10/2021; published 05/2022

Multi-Agent Advisor Q-Learning

Sriram Ganapathi Subramanian S2GANAPA@UWATERLOO.CA University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1 Vector Institute

661 University Ave Suite 710, Toronto, ON M5G 1M1

Matthew E. Taylor MATTHEW.E.TAYLOR@UALBERTA.CA University of Alberta 116 Street and 85 Avenue, Edmonton, AB T6G 2R3 Alberta Machine Intelligence Institute (Amii) 10065 Jasper Ave Edmonton, AB T5J 3B1

Kate Larson KATE.LARSON@UWATERLOO.CA University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1

Mark Crowley MCROWLEY@UWATERLOO.CA University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1

In the last decade, there have been signiﬁcant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question that arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. In this paper, we provide a principled framework for incorporating action recommendations from online suboptimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novel Q-learning based algorithms: ADMIRAL - Decision Making (ADMIRAL-DM) and ADMIRAL - Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide ﬁxed point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.

1. Introduction

Reinforcement learning (RL) research is growing and expanding rapidly, however, this method still ﬁnds only limited applications in practical real-world settings (Dulac-Arnold et al., 2021). One major reason for this is that RL algorithms typically have high sample complexity and can learn effective policies only after experiencing millions of data samples in simulation (Kakade, 2003). Multi-agent reinforcement learning (MARL) extends RL to domains where more than one agent learn

2022 AI Access Foundation. All rights reserved.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

simultaneously in the environment (Shoham and Leyton-Brown, 2008). Moving from single-agent to multi-agent settings introduces new challenges including non-stationary environments and the curse-of-dimensionality (Hernandez-Leal et al., 2019), while concerns from single-agent RL such as exploration-exploitation trade-offs and sample efﬁciency remain (Yogeswaran and Ponnambalam, 2012). In MARL environments, it has been reported that learning complex tasks from scratch is even impractical due to its poor sample complexity (Silva and Costa, 2019). In this regard, it becomes necessary for agents to obtain guidance from an external source to have any possibility of scaling up to real-world domains. Furthermore, during the early stages of learning, agents policies may be quite random and dangerous, which makes it almost impossible to use them in real-world environments. Thus, it is hard to improve upon these policies by only using direct interactions with the environment. In this paper, we aim to tackle the problem of improving sample efﬁciency in MARL through the use of other available sources of knowledge, particularly during the early stages of training.

In single-agent RL use of external knowledge sources such as advisors to drive exploration has been successful in a variety of domains. The advisors provide actions to the agent at different states to bootstrap learning by targeted exploration (Nair et al., 2018). However, the biases of suboptimal advisors pose a challenge to successful learning (Gao et al., 2018). Further, many of these approaches do not directly extend to multi-agent environments due to the additional complications present in the multi-agent environments. Though learning from external sources of knowledge has been explored in multi-agent settings, many previous works assume the presence of fully optimal experts (Natarajan et al., 2010; Hadﬁeld-Menell et al., 2016; Yu et al., 2019). Generally, they entail additional assumptions such as having simpliﬁed environments with only two agents (Lin et al., 2019) and consider restrictive environments such as competitive zero-sum settings (Wang and Klabjan, 2018) or fully cooperative settings where all agents share a common goal (Natarajan et al., 2010; Le et al., 2017; Peng et al., 2020). Additionally, some approaches such as Lin et al. (2019) restrict themselves to simple multi-agent environments with discrete state and action spaces. The use of sub-optimal advisors in multi-agent general-sum settings with an arbitrary number of agents has been less explored, and to the best of our knowledge, there has been no comprehensive analysis of this approach, especially from a theoretical perspective.

1.1 Motivational Examples

We describe two motivational examples relevant to the goals of our paper. These examples provide

an intuition about the kind of practical problems where our approach can be used and clarify the potential impact of this line of research. The ﬁrst example uses a cooperative wildﬁre response setting and the second example uses the competitive product marketing domain.

Motivational Example 1: Wildﬁre response is a complicated process that requires systematic planning of important resources and a good understanding of wildﬁre behaviour for proper estimation and combat (Thompson et al., 2018). The ﬁreﬁghters and ﬁre managers need to make many critical decisions related to wildﬁre control, and on many occasions, these decisions could be the difference between life and death (Thompson et al., 2019). Additionally, these decisions have a high ecological impact. This is a multi-agent problem (multiple ﬁre-ﬁghters aim to ﬁght ﬁre) where artiﬁcial learning agents can learn suitable policies to aid in ﬁre-ﬁghting efforts (Nikitin et al., 2019). Machine learning, particularly reinforcement learning, has a huge potential in this area but has been underutilized so far (Jain et al., 2020). Notably, in these sustainability-based domains, there is a general paucity of data (Tymstra et al., 2020), since obtaining good quality high-resolution sensor data is expensive and hard.

MULTI-AGENT ADVISOR Q-LEARNING

Hence, current state-of-the-art MARL algorithms are incapable of being used in such problems due to poor sample efﬁciency. Notably, current practical ﬁre-ﬁghting efforts use physics-based models (Rothermel, 1972) that help in predicting the spread of ﬁres given current location and intensity. Despite being the current state-of-the-art, these models are sub-optimal, have low accuracies (Jahdi et al., 2015), and possess other problems like under-prediction bias and lack of generalizability to regions outside North America (Cruz and Alexander, 2010). Thus, we have particular knowledge sources that are not optimal but are still used in practice (particularly due to lack of alternatives). Our work in this paper will enable MARL training to use these physics-based models to speed up learning. The policies that the MARL algorithms will ﬁnally arrive at will have the potential to be better than these physics-based models since the MARL algorithms will also simultaneously learn from data.

Motivational Example 2: Multi-agent algorithms have the potential to learn from available data and formulate effective marketing and price management strategies to improve ﬁnancial proﬁt for companies (Ganesh et al., 2019). However, the problem of poor sample efﬁciency prevents the usage of MARL for this problem, since many companies would ﬁnd it difﬁcult to procure sufﬁcient data for MARL training. There are many mathematical marketing models in the literature that typically help companies formulate marketing strategies (Eryigit, 2017), however, these models are sub-optimal, with scope for improvement, especially in adapting to changing trends (Storbacka and Moser, 2020). These models can serve as external knowledge sources that MARL training can leverage to learn good policies.

Hence, learning from possibly sub-optimal external content sources, which we broadly refer to as advisors , is useful for MARL training. We formally introduce this problem and study it further in this paper.

1.2 Related Work

Imitation learning includes various methods for learning the behaviour of advisors, the simplest being behaviour cloning, where supervised learning is used to mimic the advisor policy. This method dates back to the early 90s, where agents were shown to be successful in copying the behaviour of the demonstrator in autonomous driving tasks such as road following and perception (Pomerleau, 1988, 1991). This method comes with certain theoretical guarantees, where prior works have conducted

formal analysis and established that near-optimal advisors are the easiest to imitate, requiring far fewer demonstrations than sub-optimal advisors to achieve the same performance as the advisors being imitated (Syed and Schapire, 2010). Behaviour cloning is hard to generalize to different unseen environments since the agent is learning in a supervised fashion (Kiran et al., 2021). Prior works in the area of behaviour cloning have also introduced methods to detect and safeguard against a few noisy/bad demonstrations (Zheng et al., 2014; Hussein et al., 2021), however, in general, the demonstrations are assumed to be near-optimal to enable learning reasonable policies. Further, it has been noticed that behaviour cloning methods are prone to a problem of distribution drift, where the trajectory distribution at the test time drifts away from the distribution learned from the advisor during training (Ross et al., 2011; Rajeswaran et al., 2018). In autonomous driving environments, on-policy data collection has been shown to mitigate this problem to some extent (Santana and Hotz, 2016). Some recent approaches propose off-policy solutions, along with techniques of expanding the input (image)-action space using data-augmentation methods (Codevilla et al., 2018; Laskey et al., 2017). However, several other limitations like dataset bias and high variance in neural network-based

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

solutions have been reported to limit the application of behaviour cloning to real-world environments (Codevilla et al., 2019).

Another popular imitation learning framework is inverse reinforcement learning (IRL) (Ng and Russell, 2000), where the objective is to learn the reward function from demonstrations. This framework typically assumes that the environment does not have a reward function and/or it is difﬁcult to formulate good reward functions, but expert demonstrations of good behaviour are easier to obtain. A good example is autonomous driving, where formulating a complete reward function that covers all scenarios is hard while obtaining demonstrations of good driver behaviour is much simpler. Initial approaches to IRL used maximum margin methods where an initial estimate of the reward function for the demonstrator keeps being iteratively improved, such that the performance of the demonstrator is at least more than a margin of the previous reward estimate for the demonstrator (Abbeel and Ng, 2004; Ratliff et al., 2006a). This iteration is repeated until no such improvement is possible. The problem with the maximum margin methods are their sensitivity to noise and imperfection in the demonstrator behaviour. To alleviate this problem, probabilistic approaches using principles of maximum entropy have been proposed for the IRL framework (Ziebart et al., 2008; Boularias et al., 2011). These approaches reason over a set of possible behaviours, rather than monotonically improving upon estimates of reward or policies. Neural networks have also been considered to learn a suitable reward function, where convolutional networks aim to map the relationship between input images to ﬁnal rewards (Wulfmeier et al., 2015). Several techniques from supervised learning such as Gaussian processes (Rasmussen, 2003) and ensemble methods (Dietterich, 2000) have also been used in the IRL paradigm (Levine et al., 2011; Ratliff et al., 2006b). A recent approach, Generative Adversarial Imitation Learning (GAIL), aims to recover the policy of the expert directly instead of extracting an explicit reward function using Generative Adversarial Networks (GANs) (Ho and Ermon, 2016). Since GAIL is not learning a reward function, it may not be considered an IRL technique and since it is not learning in a supervised fashion it may not be considered a behaviour cloning technique as well. This approach opened up a new class of methods in the intersection of imitation learning and generative adversarial networks (Miyato et al., 2018; Kueﬂer et al., 2017). Some of these approaches aim to extract an explicit reward function from the demonstrations using GANs (Finn et al., 2016; Fu et al., 2018) and these can be considered to fall within the IRL framework.

The DAGGER algorithm introduced by Ross et al. (2011) is yet another imitation learning method. This algorithm has strong guarantees of performance while learning stationary deterministic policies in environments with an online advisor that can be queried interactively for additional feedback. However, the major aim of this algorithm is to obtain a policy that guarantees no-regret under its induced distribution of states and does not aim to improve upon the provided demonstrator. This method belongs to a wider set of online algorithms that aim to provide the no-regret guarantee while learning based on demonstrations from a perfect advisor (Hazan et al., 2007; Cesa-Bianchi et al., 2001; Shalev-Shwartz and Kakade, 2008).

In general, imitation learning methods assume fully optimal or near-optimal advisors (or experts), with the major goal being to copy the policy or behaviour of the experts. Since MARL problems are non-stationary, there is little expectation of obtaining perfect experts. Rather, we expect guidance on action choices that aid agents in learning and improving over time. Further, several multi-agent imitation-based methods in the literature are restricted to cooperative games (Barrett et al., 2017; Bogert and Doshi, 2014) or games with strict restrictions on the nature of the reward function (such as having linear relations to some underlying feature) (Reddy et al., 2012; Waugh et al., 2011), in

MULTI-AGENT ADVISOR Q-LEARNING

addition to assuming the availability of (near) optimal experts. On the other hand, other approaches based on IRL in multi-agent settings are restricted to zero-sum games (Lin et al., 2017; Wang and Klabjan, 2018). Some recent approaches aim to apply IRL in general-sum games (Yu et al., 2019; Song et al., 2018), however the assumption of availability of perfect experts are present in these works as well. A different approach from Price and Boutilier (2003) studies imitating more experienced peers in a multi-agent setting. However, this work considers a very restrictive environment, where each agent s dynamics is independent of other agents. Further, strict assumptions on the reward function exist, such as obtaining the same numerical reward over a part of the state space and having an independent reward function that does not depend on other agents actions. Such assumptions are hard to verify in real-world multi-agent environments. Pure imitation methods generally lack the ability to exceed the performance of the available expert/advisor.

Another approach, Learning from Demonstrations (Lf D), combines the imitation-based behaviour cloning approach of learning from expert demonstrations and the reinforcement learning-based approach of directly learning from the environment to reach a suitable goal. Here, the objective is not to simply perform imitation learning, but to use imitation learning as a bootstrap mechanism that can speed up the training of RL agents. The RL algorithm enables further ﬁne-tuning of the policy learned from imitation, which provides an opportunity for improving upon the performance of the advisor and learning goal-oriented policies. The algorithms in this approach use the environmental rewards along with expert/advisor demonstrations collected ofﬂine. Unlike the IRL paradigm, the environmental rewards are assumed to be available, and the rewards need not be extracted from suitable expert demonstrations. Our work in this paper is most related to the Lf D framework. Early works in this area studied the Lf D approach using model-based reinforcement learning, which found applications in classical RL environments like cart-pole (Schaal, 1996) and robot arm learning to balance a pendulum (Atkeson and Schaal, 1997). More recently, the model-free RL approaches gained prominence, especially after the exceptional performance shown by Deep Q-learning (DQN) on Atari games (Mnih et al., 2015). Model-free RL using replay buffers for training have been successful in the Lf D framework as well (Piot et al., 2014; Chemali and Lazaric, 2015). One stateof-the-art algorithm, Deep Q-learning from Demonstrations (DQf D) (Hester et al., 2018) pre-trains the agent using demonstration data, keeps this data permanently for training, and combines an imitation-based hinge loss with a temporal difference (TD) error. This additional loss helps DQf D learn faster from demonstrations, but also makes DQf D prone to the problem of overﬁtting to the demonstration data. Normalized Actor-Critic (NAC) (Jing et al., 2020) drops the imitation loss and hence is more robust to imperfect demonstrations (from bad, almost adversarial advisors) than DQf D. However, we ﬁnd that the performance of DQf D is at least as good as NAC for reasonable advisors (due to the imitation loss). Goecks et al. (2020) introduce the Cycle-of-Learning (Co L) algorithm that provides a novel Lf D mechanism in which additional human inputs can be obtained during training in environments where humans are present in the loop to help agents train faster. The agents can make use of the human feedback in addition to the demonstration from other sources for training. Notably, Lf D algorithms have found numerous applications in robotics. Some early applications have focussed on teaching primitive movements of motors to policy gradient based RL algorithms that can learn to perform a suite of relatively simple robotic tasks, such as manoeuvring gaps (Peters and Schaal, 2008; Theodorou et al., 2010). Recently, Rajeswaran et al. (2018) introduce an effective algorithm that can learn highly complex dexterous manipulation such as object relocation and in-hand manipulation in response to sensor inputs. They introduce an algorithm, Demonstration Augmented Policy Gradient (DAPG) that uses an on-policy policy gradient (Sutton et al., 1999)

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

update as opposed to the off-policy nature of prior approaches such as Hester et al. (2018). Zhu et al. (2018) provides yet another approach that uses the Lf D technique for robot manipulation tasks such as block lifting, block stacking and pouring liquids, where the agents need to learn effective visuomotor policies that can take actions in response to image inputs from a camera or sensor. The well-known RL algorithm, Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2016) has been adapted to the Lf D framework to incorporate human demonstrations and learn continuous control object manipulation robotic tasks such as peg-insertion and clip-insertion in both real and simulated environments (Vecerik et al., 2017).

While powerful, the requirement for ofﬂine demonstrations commonly seen in prior works on Lf D is not a good ﬁt for MARL. In MARL, since the environments are non-stationary with dynamic opponents, real-time action advising would be more useful as the advisors can teach agents to adapt to changing opponents. Recently, some multi-agent approaches have used the Lf D method, where the algorithms can, in-theory, do without fully optimal advisors (Silver et al., 2016; Wang and Klabjan, 2018; Hu et al., 2018). These works, however, are applicable only to a restrictive set of MARL environments. The Alpha-Go approach from Silver et al. (2016) and the approach from Wang and Klabjan (2018) are restricted to zero-sum competitive games and cannot naturally extend

to general-sum games. The work by Hu et al. (2018) is designed for a very particular application (Star Craft Micromanagement), where the authors require the availability of specialized human-made opponents that contain speciﬁc pre-deﬁned tactics about game-playing. In MARL, prior works have also studied peer-to-peer teaching, where each agent can learn when and what advice needs to be extended to peers in addition to learning how to use the available advice and improve its own learning (Silva et al., 2017; Omidshaﬁei et al., 2019; Ye et al., 2020). The agents can switch between the roles of student or teacher at different points of time based on the situation. However, as evident from the setting, this method is only applicable to fully cooperative environments.

Expert demonstrations have also been used for reward shaping in single-agent RL (Laud, 2004), but this undermines the convergence guarantees of Q-learning based algorithms. Using prior knowledge to deﬁne a potential function over the state space provides an approach known as potential-based reward shaping, which preserves the total order over policies and does not affect the convergence guarantees (Ng et al., 1999; Wiewiora et al., 2003). The approach of reward shaping has been popular in single-agent RL (Marom and Rosman, 2018), and very recently adopted to the MARL setting as well (Devlin et al., 2011; Gangwani et al., 2020; Xiao et al., 2021). The work by Gangwani et al. (2020) uniformly redistributes the rewards accumulated at the end of a trajectory, to each state-action pair along the length of the trajectory. This approach is based on independent learning, which hurts convergence guarantees in the MARL setting (Tan, 1997). Although it shows good empirical performance in single-agent tasks, this approach performs poorly in many MARL tasks, since the credit-assignment is not always accurate (Xiao et al., 2021). On the other hand, the approach by Xiao et al. (2021) adapts potential-based reward shaping to the MARL setting. There are two important limitations of this potential-based reward shaping approach, formulated by Xiao et al. (2021) in MARL. The ﬁrst is that the shaping advice is a heuristic that needs to come from an expert who has complete prior knowledge about the entire problem domain and is capable of designing these heuristics. Obtaining such experts for complex MARL tasks is not always possible. Second, the shaping advice is provided at the beginning and then ﬁxed for the duration of training. However, MARL requires adaptive advising at different parts of the state based on opponent behaviour.

Another group of methods such as human agent transfer (HAT) (Taylor et al., 2011) aim to summarize limited ofﬂine demonstrations (from sources like humans) into decision tree-based expert

MULTI-AGENT ADVISOR Q-LEARNING

rules that boost learning online. Conﬁdence-based human-agent transfer (CHAT) (Wang and Taylor, 2017) improves HAT by adding conﬁdence measurements to safeguard against bad demonstrations. Both these methods demonstrate good performance in a multi-agent Keep Away game, although the algorithms themselves are single-agent only. These algorithms are independent methods that consider the other agent(s) as part of the state and do not track opponent actions or perform any kind of opponent modelling. In MARL environments, changes in opponent behaviour play a crucial role in determining the agent rewards (Shoham and Leyton-Brown, 2008). Particularly in competitive environments, these agents are expected to adapt between risk-seeking and risk-averse strategies based on the nature of their opponents (Conitzer and Sandholm, 2003). Without tracking opponent behaviour, this adaptability is not possible, since the differences in opponent behaviour in the same state would not stimulate different responses by the learning agent. A related approach known as the Teacher-Student Framework (Torrey and Taylor, 2013) aims at accelerating the learning process under limited communication with an advisor. Almost all works in this framework assume either fully optimal or a moderate level of expertise for the advisors (Amir et al., 2016; Zhan et al., 2016).

At their core, RL (and MARL) algorithms are ﬁxed point iterative methods that iterate recursively until no further iterations are desirable or required (Littman and Moore, 1996). An RL algorithm s ability to converge to a ﬁxed point provides a clear picture of the goal towards which the algorithm is progressing. The ﬁxed point deﬁnes the completion of the task of an RL agent in the given environment and is like a terminal point in the sequence of RL iterations. For example, single-agent RL methods use the optimal Q-value that provides the maximum expected discounted sum of rewards in the given environment as the ﬁxed point (Sutton and Barto, 1998). In MARL the ﬁxed point is deﬁned by the solution concept of the game. We note that many of the prior works referenced here do not contain a theoretical analysis of the learning algorithms that provide conditions for ﬁxed point guarantees regarding learning in MARL environments. Since all these methods involve the presence of external sources in the learning process, it is unknown if previous guarantees in RL convergence extend to these approaches. Without such guarantees, it is unclear whether the algorithms will learn reasonable policies in any generic environment (beyond those considered in the paper) and whether the algorithms will progress towards obtaining a suitable solution for the current problem. Since there are many solutions concepts in multi-agent environments (Shoham and Leyton-Brown, 2008), the kind of solutions that are likely to be obtained by these algorithms are unclear. Some approaches establish the existence of unique solution concepts in the particular model of multi-agent environment considered, yet still lack ﬁxed point guarantees for any RL method and theoretical guarantees of arriving at the established solution concept by any learning algorithm (Yu et al., 2019; Song et al., 2018).

From the above discussion, we see that there are ﬁve fundamental limitations of existing algorithms that learn from external sources of knowledge, which hinders their applicability to real-world multi-agent environments. All prior works can be seen to contain one or more of these limitations. 1) Strict assumptions on the quality of advisors, 2) algorithms designed as single-agent based inde-

pendent methods that consider other agents as simply part of the state in the environment, though the actions of these agents strongly inﬂuence the rewards for the learning agent, 3) ofﬂine advising, where demonstrations are collected and used for training agents ofﬂine, which is not well-suited for MARL due to the adaptive nature of opponents, where real-time feedback is critical, 4) algorithms designed towards a restricted class of MARL environments that are not generally applicable to many other environments, and 5) lack of thorough analysis for the conditions under which theoretical ﬁxed point guarantees can be provided.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

1.3 Our Approach

In this paper, we study advising in MARL under the stochastic game model (Shapley, 1953) and aim to resolve the ﬁve major limitations of prior methods discussed in Section 1.2. We will explore the use of advisors in multi-agent reinforcement learning (MARL) under general-sum settings, where advisors suggest (possibly sub-optimal) actions to different agents in the environment. The advisors can belong to a broad class of categories, such as pre-trained policies, rule-based systems or other systems that continue to learn and/or adapt during gameplay. We do not make any assumptions or place constraints on the quality or type of the advisors themselves. The advisors are assumed to be available online so that agents can get real-time feedback while training in dynamic MARL environments. We will also assume that each agent has access to at most one advisor. Communication between the agent and the advisor is assumed to be free. The advisor receives the state of an agent and provides an action recommendation for the current state. These action recommendations can be deterministic or stochastic.

We introduce a principled framework for studying the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL). We propose two Q-learning based algorithms (Watkins and Dayan, 1992). The ﬁrst algorithm, ADvising Multiple Intelligent Reinforcement Agents - Decision Making (ADMIRAL-DM), learns to act in the environment using advisor-guidance,

while the second, ADvising Multiple Intelligent Reinforcement Agents - Advisor Evaluation (ADMIRAL-AE), provides a principled method to evaluate the usefulness of the advisor in the current MARL context. To the best of our knowledge, we are the ﬁrst to propose a method to evaluate a knowledge source before using it for learning in MARL1. We empirically study the performance of our algorithms in suitable test-beds, along with a comparison to related baselines. Theoretically, we establish conditions under which we can provide ﬁxed point guarantees regarding the learning of our ADMIRAL algorithms in general-sum stochastic games.

Speciﬁcally, our contributions in this work are: 1) introducing a general paradigm for learning from external advisors in MARL, 2) analyzing two important challenges in learning from advisors in MARL, 3) presenting a suitable algorithm for each of these challenges, 4) establishing conditions for appropriate ﬁxed point guarantees in these algorithms, 5) proving that it is possible to provide convergence results under less restrictive assumptions compared to prior work, and 6) empirically showing that our algorithm can adapt and perform well in many challenges in MARL.

2. Background

In this section, we present the key concepts used in this paper. We start with a brief introduction to single-agent RL before describing the generalized multi-agent RL setting we use in this paper.

Deﬁnition 1. A Markov decision process (MDP) is deﬁned as h S, A, R, T, γi where S is the set of states, A is a set of actions, R : S A 7! R is the reward function, T : S A S 7! [0, 1] is the transition function and 0 γ < 1 is the discount factor.

1. In their comprehensive survey of literature that aims at reusing knowledge to accelerate MARL, Silva and Costa (2019)

state that several prior works (Amir et al., 2016; Zhan et al., 2016) assume (at least) a moderate level of expertise for the advisors for action advising and are only applicable to single-agent environments, in line with our discussions in Section 1.2. While Silva et al. (2017) relax the assumption of optimal advisors by allowing agents to advise each other (in cooperative games), they do not provide a method to evaluate the available agents/advisors before using them.

MULTI-AGENT ADVISOR Q-LEARNING

Given an MDP, it is assumed that the agent starts in some state s 2 S, takes some action a 2 A, and transitions to another state s0 with probability T(s, a, s0) where it collects reward R(s, a). A policy, : S 7! A, speciﬁes a probability distribution over the set of actions for each state s 2 S (the notation is used to denote a probability distribution). The value, or expected discounted sum of future rewards, of following some policy when starting in state s is deﬁned as v(s, ) = P1

t=0 γt E[rt|s0 = s, ] where rt is the reward collected at time t. The objective of the agent is to ﬁnd the optimal policy, , which maximizes the expected discounted sum of future rewards at each state, v(s, ). An alternative approach is to use Q-values. A Q-value, Q (s, a), of a state-action pair, is the expected future reward estimate that can be obtained from taking action a in the state s and following the policy from there. The optimal policy is obtained by (s) = arg maxa2A Q (s, a) where Q (s, a) is the optimal action-value function that returns the maximum Q-value in all the states.

In multi-agent settings, the optimal policy of an agent may depend on the policies followed by the other agents. Generalizations of MDPs, called stochastic games, are used to model multi-agent settings and form the basis of MARL.

Deﬁnition 2. A stochastic game is deﬁned as h S, N, A, P, R, βi where S is a ﬁnite set of states, N is the ﬁnite set of agents, |N| = n, and A = A1 . . . An is the set of joint actions, where Ai is the ﬁnite action set of an agent i, and a = (a1, . . . , an) 2 A is the joint action where an agent i takes action ai 2 Ai. Furthermore, P : S A S 7! [0, 1] is the transition function, R = {R1, . . . , Rn} is the set of reward functions, where Ri : S A 7! Rn is the reward function of the agent i, and β is the discount factor (0 β < 1).

In a stochastic game, the common assumption is that all agents share the same set of states S (where S is the state space), which contains information about all agents participating in the stochastic game (Shapley, 1953). The environment provides the (global) state to each agent participating in the stochastic game. At each time step t, each agent i observes the current state s 2 S and takes a local action ai 2 Ai (where Ai is called as the action space of the agent i). Subsequently, the agent obtains a reward ri according to its reward function Ri, and the joint action a of all agents in the environment at state s. The transition function, P, determines the transition of the environment to the next state, s0. This transition depends on the current state (s) and the joint action (a) of all agents in the environment. Further, the transition function is ﬁxed and satisﬁes the constraint, P

s02S P(s0|s, a) = 1 for all s 2 S and a 2 A. Given a stochastic game, a joint policy is represented by = ( 1, . . . , n), where i is the stochastic policy followed by an agent i. As in the single-agent setting, each individual agent tries to maximize their value function. However, this optimization depends on the policies of others: vi(s, 1, . . . , n) = P1

t=0 βt E(ri

t| 1, . . . , n, s0 = s). There are several formulations of the stochastic game model. The general-sum model is the most general formulation, where the rewards that an agent receives at any time step can be related to the rewards obtained by other agents in an arbitrary fashion. Special cases of general-sum games are zero-sum games that restrict the sum of rewards obtained by all the agents at any time step to be zero, and identical interest coordination games that require all agents in the environment to obtain the same numerical reward at every time step. We will use the general-sum formulation in this work.

It is useful to think of a stochastic game as a multi-period stage game. In a stage game, agents select an individual action and then receive some (possibly different) payoff, which depends on the joint action taken. This can be formally deﬁned as follows.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Deﬁnition 3. An n-player stage game is deﬁned as (A, M1, . . . , Mn), where Mk : A 7! R is agent k s payoff function, specifying a payoff for agent k for each joint action (a1, . . . , an) 2 A.

The main solution concept we are interested in is the Nash equilibrium (Nash, 1951), namely a stable point in the joint policy-space. We ﬁrst formally deﬁne a Nash equilibrium in a stage game, before moving on to the generalization of this concept in stochastic games. We switch terminology slightly and will refer to agents strategies to be consistent with the game-theoretic literature, although we can use strategy and policy interchangeably. In particular, an agent s strategy in a stage game is simply a probability distribution over actions, given the underlying state s. Let φk

be the strategy of agent k in the stage game and φ k be the product of strategies of all agents other than k, φ k , φ1 φk 1 φk+1 φn.

Deﬁnition 4. A joint strategy (φ1, . . . , φn) can be considered as a Nash equilibrium for the stage game (A, M1, . . . , Mn), for k = 1, . . . , n, if

φkφ k Mk ˆφkφ k Mk, for all ˆφk 2 φ(Ak) (1)

where φ(Ak) is the set of all probability distributions over Ak.

The term φkφ k Mk is a scalar value. The product of strategies denotes the product of probabilities of taking speciﬁc actions by an agent. This is multiplied with the value of that action as denoted by Mk. The dimensionality of Mk is equal to the action space |A|.

For a stochastic game, the strategies for an agent apply to the entire time horizon of the game.

Deﬁnition 5. In a stochastic game Γ, a Nash equilibrium is a tuple of n strategies ( 1

, . . . , n

), such that for all states s 2 S and agents i = 1, . . . , n,

, . . . , i 1

, . . . , n

, . . . , n

for all i 2 i where i is the set of strategies available to the agent i.

That is, no agent has incentive to unilaterally change their strategy in a Nash equilibrium. Now, we deﬁne the Nash Q-function (Hu and Wellman, 2003) as follows,

Deﬁnition 6. Agent i s Nash Q-function is deﬁned as the sum of the agent i s immediate reward and its discounted future rewards when all agents follow a joint Nash equilibrium strategy ( 1

, . . . , n

(s, a) = ri(s, a) + β P

s02S P(s0|s, a)vi(s0, 1

, . . . , n

where ri(s, a) is the immediate one-stage reward of the agent i at state s and the corresponding joint action a, and vi(s0, 1

, . . . , n

) is the agent i s total discounted reward over inﬁnite periods starting from s0, given that all agents follow the joint equilibrium strategy.

The Q-values of the Nash Q-function are denoted as the Nash Q-value in Hu and Wellman (2003). Finally, we deﬁne an approximate Nash equilibrium concept, an -equilibrium, which bounds the beneﬁts of agents deviations from the joint Nash equilibrium strategy.

Deﬁnition 7. In a stochastic game Γ, a joint strategy ( 1

0, . . . , n

0) is an ( )-equilibrium if it satisﬁes (for all i0 2 i and 8s)

0, . . . , i 1

0 , i0, i+1

0 , . . . , n

0, . . . , n

MULTI-AGENT ADVISOR Q-LEARNING

As seen by the deﬁnition of the Nash equilibrium (Deﬁnition 5), given the strategies of the other agents, the Nash Equilibrium guarantees that any given agent is obtaining the best possible payoff. This is the best guarantee we can provide in a general-sum stochastic game setting with fully independent agents Hu and Wellman (2003), without any restrictions on the nature of the environment or the agent. Hence, we choose to use the Nash equilibrium as our solution concept.

3. Advisor Q-Learning

In this section, we introduce the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL). We have a set of agents that can either take an action using their own policy or consult an advisor that provides action recommendations, given the current state, at each time step. Each agent has access to at most one advisor. An advisor can be any external source of knowledge, such as a rule-based agent, a pre-trained policy, or any other system that continues to learn during gameplay. The advisor is assumed to be available online with the possibility of providing instantaneous action recommendations to an agent. Further, we consider a centralized training setting, where agents can observe the state, the local actions, and rewards of all other agents. Another speciﬁcation is that in our setting, the advisor and agent communication is free, while the agents cannot communicate amongst themselves. There is no communication amongst the agents themselves since establishing reliable communication protocols amongst every single agent may be prohibitively expensive in large multi-agent environments. For example, in the case of wildﬁre ﬁghting, it has been noted that communication, even if available, could be very limited since the individual agents (ﬁre-ﬁghters) may be very far apart (Phan and Liu, 2008). However, all agents can receive global inputs from satellite/airborne sensors (Leblon et al., 2012). The centralized setting we consider is similar to that in prior works (Hu and Wellman, 2003). Subsequently, in Section 3.4 we will show that our method can be adapted to the popular centralized training and decentralized execution (Lowe et al., 2017) paradigm, which provides a suitable relaxation of the centralization assumption.

We study two challenges that arise when learning from advisors in MARL and provide algorithms for each problem. The ﬁrst challenge is learning a policy with the help of an advisor. We introduce an algorithm for this challenge, which we call ADvising Multiple Intelligent Reinforcement Agents - Decision Making (ADMIRAL-DM). In this setting, each agent aims to learn a suitable policy that provides the best responses to the opponent(s) and performs effectively in the given multi-agent environment. An agent has access to a (possibly sub-optimal) advisor that could be leveraged to improve the speed of learning. Hence, at each time step, the agent could choose to follow its own policy or that of the advisor. In early stages of learning, the dependence on the advisor is greater, and this dependence gradually declines as the agent s policy improves. If an agent does not receive an action recommendation at some time step, they can simply use their own current policy. Hence, we do not require the advisor to be capable of providing action advice at every state in the state space.

The second challenge is the evaluation of the advisor itself. We provide an algorithm called ADvising Multiple Intelligent Reinforcement Agents - Advisor Evaluation (ADMIRAL-AE) that tackles this challenge. Before using an advisor, it is beneﬁcial to evaluate it to determine whether the advisor will provide effective advice. Hence, we propose a pre-learning phase (i.e., a distinct phase before the beginning of training of ADMIRAL-DM) where the ADMIRAL-AE is used with the goal of getting a good understanding of the capabilities of the advisor in the current environment. We assume that a single advisor exists in the system and this advisor could be evaluated by one or more agents.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Many real-world multi-agent application domains may have a state-of-the-art solution method being currently used in practice. This method could be normally useful, but may not be suitable for every context and against every possible opponent in MARL environments. Hence, it is possible to face situations in which the available advisor provides advice that is optimal or close to optimal for some states, but whose advice is poor in others. Intuitively, to learn well, agents must listen more to the advisor when the available advisor is good and well-suited to the current context and listen less (or not at all) to the available advisor when it is bad. To make this issue more concrete, recall the wildﬁre example discussed in Section 1.1. A well-known ﬁre simulator model is the Farsite (Finney, 1998) simulator that is actively used in practice to model the spread of ﬁre. This ﬁre simulator model

predicts the spread of ﬁre in the future and forms the basis of ﬁre-control strategies of ﬁreﬁghters. Notably, through extensive experimentation, it has been reported that this model does not give satisfactory performances under certain environmental conditions such as extreme downslope winds (Zigner et al., 2020). However, since real ﬁres are affected by many dynamic environmental factors at the same time, coming up with suitable heuristics for deciding the ﬁt of the given model is almost impossible. Thus, it is important to understand when an advisor is providing useful advice and when its advice has limitations. Always relying on a poor advisor may lead to poor policies being learned or hurt the overall sample complexity of the algorithms. Thus, we recommend evaluating the advisor before using it for learning, if possible. We argue that it is important to decouple the problem of advisor-evaluation where the objective is to study the suitability of the advisor in the given MARL

environment from the problem of decision-making which aims to improve the training of MARL algorithms by making use of advisor knowledge.

We propose to conduct a pre-learning phase before training the decision-making algorithm. In this phase we perform an advisor-evaluation study, either in simulation or in real-world environments (particularly in environments that are not safety-critical like recommendation systems and board games such as chess) that helps to answer two important questions before beginning to learn from the advisor. 1) Does the advisor have some good knowledge of the domain that could be helpful for the MARL algorithms during training? and 2) How much should one listen to the given advisor? Performing advisor evaluation before beginning the process of learning helps the agent gain a good understanding of the advisor before learning from it, which would help in leveraging it effectively in learning a suitable decision-making policy. Most importantly, in MARL, advisors (especially good ones) could continue to learn and/or adapt online, during gameplay, based on the nature of opponents. Such advisors cannot be evaluated effectively unless their learning and evolution is captured using a principled method designed to evaluate them. If advisors are being evaluated by agents along with agents simultaneously using them for learning decision-making policies, the evaluation becomes limited and advisors are prone to be discarded quickly based on the metrics of performance and consistency at the early stages of training. During this time, the advisor could be still evolving its strategies based on the nature of the opponent(s), and hence, the evaluation is not accurate. For example, in the algorithm CHAT from Wang and Taylor (2017), the conﬁdence of an agent on a demonstrator is determined based on the demonstrator s consistency in action recommendations for the same state. This approach works well in the single-agent context. However, in MARL, due to the adaptive nature of opponents, good advisors could adapt based on the opponent and possibly evolve mixed (stochastic) strategies that will provide different actions at the same state. An approach such as CHAT would reduce the conﬁdence of such an advisor, but this is not accurate since the advisor is good and should be leveraged more for better performance.

MULTI-AGENT ADVISOR Q-LEARNING

Algorithm 1 ADvising Multiple Intelligent Reinforcement Agents - Decision Making (ADMIRALDM)

1: Set t = 0, get the initial state s0. Let the learning agent be indexed by j

2: For all j, s 2 S, and aj 2 Aj, let Qj

0(s, a) = 0, where a = (a1, . . . , an). Initialize a value for hyperparameters and 0 (i.e. value for 0 and 0

3: Deﬁne policy derived from Q to return a random action with probability t, advisor suggested

action with probability 0

t and greedy action with probability 1 t 0

t 4: Choose aj

0 at state s0 for each j using policy derived from Q

5: while Qj is not constant for each j 2 {1, . . . , n} do

6: For each agent j, execute the action aj

t and observe r1

t , . . . , rn

t , . . . , an

t ; and st+1 = s0

7: For each j, choose the next greedy action for all other agents from the copy of their respective policies using s0. The next greedy actions are chosen using the current observed actions of other agents

8: For each j, let u be a uniform random number between 0 and 1

9: if u < 0

10: Obtain next action aj

t+1 = aj0 from the advisor (using state s0)

11: else if u > 0

t and u < t then

12: Set the next action aj

t+1 = aj0 as a random action from the action space Aj

14: Choose a greedy action aj

t+1 = aj0 from the Q-function using s0 and the next greedy actions of other agents

16: Update Qj

t for each j 2 {1, . . . , n} using Eq. 5

17: Let t := t + 1

18: For each agent j, set the current action aj = aj0 and current state s = s0

19: At the end of each episode, linearly decay t and 0

t 20: end while

3.1 Decision-Making Using Advisor

Our ﬁrst algorithm learns to act in the environment by leveraging the available advisor.

ADMIRAL-DM is described in Algorithm 1. First, for simplicity, we assume that all the agents in the environment use the same algorithmic steps for learning, as done in prior work (Hu and Wellman, 2003). Subsequently, the same algorithm can be used in other scenarios where different agents use different algorithms for learning as well. Further, as in Hu and Wellman (2003), we assume that all agents maintain a copy of the Q-updates of the other agents. This is possible since, during training, agents are in a centralized setting and can observe the local actions and rewards of all other agents at each time step. This helps in predicting the actions of opponents needed for providing the best responses.

A learning agent (represented by j) starts with an arbitrary initialization of its Q-value Qj

0(s, a). One such assignment could be to set Qj

0(s, a) = 0, for all agents j, all states s 2 S and actions a1 2 A1, . . . , an 2 An. Recall, in this setting, each agent has access to an online advisor that it could query during learning. Whenever the agent needs to choose an action, it does so based on its current Q-value, the advisor s recommendation, or simply a random action, as the case may be (lines 8 15). The dependence on the advisor s recommendation and the random exploration

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

is captured by two hyperparameters, 0

t and t, respectively. This action is subsequently executed and the actions and rewards of the other agents are observed, including the next state s0 (line 6). During training, at each time step, the agent picks the possible next actions of other agents in line 7 using its copy of other agents Q-values. Then, in lines 8 15, the agent j picks its next action based on ADMIRAL-DM algorithm s policy which chooses a random action and an advisor action with diminishing probabilities, and a greedy action with increasing probabilities, such that it becomes greedy in the limit with inﬁnite exploration (GLIE). Thus, the agent is guaranteed to train without any further advisor inﬂuence after some ﬁnite time t in the training process. Accordingly, the dependence on the advisor s recommendation is decayed linearly (line 19). In this process, the dependence of an agent is more on the advisor during the earlier stages of learning, when its own policy is quite bad. This dependence gradually reduces as its own policy improves. The Q-values are updated (line 16) following an update scheme given by,

t+1(s, a) = (1 t)Qj

t(s, a) + t[rj

t(s0, a0)] (5)

where a = (a1, . . . , an) denotes the actions for all agents at state s and a0 = (a10, . . . , an0) denotes the actions for all the agents at state s0. β denotes the discount factor, and t 2 (0, 1) is the learning rate. The other variables have the usual meaning described in Section 2. The algorithm s steps are repeated continuously until either the Q-values fully converge or come within a small threshold of convergence, as is commonly done in practice (Sutton and Barto, 1998).

The ADMIRAL-DM algorithm s time and space complexity can be compared to the Nash Q algorithm from Hu and Wellman (2003). At each time step, a learning agent j needs to update (Q1, . . . , Qn), for all states s 2 S, and all actions a1 2 A1, . . . , an 2 An. Let the total number of

states in the environment be represented by |S|, and |Aj| be the total number of actions in the action space of the agent j. Further, assuming that |A1| = = |An| = |A|, we get the total number of entries in Qj to be |S||A|n. If the learning agent needs to update a total of n Q-tables, then the space complexity can be given by n|S||A|n. Thus, regarding the space complexity, a tabular version of ADMIRAL-DM is linear in the number of states, polynomial in the number of actions, and exponential in the number of agents, which is the same as the guarantees for the Nash Q algorithm in Hu and Wellman (2003). However, in the case of time complexity, ADMIRAL-DM has the same guarantees as given for the space complexity, since in the worst case, each entry in the table needs to be queried before updating a Q-value. Note that this is better than the algorithm by Hu and Wellman (2003), which had exponential time complexity in the states and actions, even in the case of

a two-player game. This is because the Nash Q algorithm has a requirement of determining the Nash equilibrium at each stage game, which has exponential time complexity even for two-player games (Neumann, 1928). We do not have this requirement.

3.2 Evaluation Of Advisors

The second challenge is that of evaluating the advisor to determine its nature and its suitability for the given context. As described in the previous sub-section, the ADMIRAL-DM uses an advisor if one exists. In this sub-section, we provide an algorithm (ADMIRAL-AE) that evaluates a potential advisor and helps in guiding the conﬁguration of Algorithm 1 by setting the initial value of 0. The objective is to make an agent following ADMIRAL-DM listen more to good advisors and listen less (or not at all) to bad advisors. The ADMIRAL-AE is used in the pre-learning phase discussed earlier, where agent(s) are evaluating the advisor in the context of the given environment and opponents.

MULTI-AGENT ADVISOR Q-LEARNING

Algorithm 2 ADvising Multiple Intelligent Reinforcement Agents - Advisor Evaluation (ADMIRALAE)

1: Set t = 0 and get the initial state s0. Let the learning agent be indexed by j

2: For all j, s 2 S, and aj 2 Aj, let Qj

0(s, a) = 0 where a = (a1, . . . , an)

3: Set the value of hyperparameters and 0

4: while Qj is not constant for each j 2 {1, . . . , n} do

5: For each j, let u be a uniform random number between 0 and 1

6: if u < 0 then

7: Obtain action aj

t for the current state st from the advisor

8: else if u > 0 and u < then

9: Set the action aj

t as a random action from the action space Aj

11: Choose a greedy action aj

t from the Q-function using st and the observed previous actions of other agents

13: Execute aj

t and then observe a1

t , . . . , an

t; and st+1 = s0 for each j 2 {1, . . . , n}

14: Obtain the advisor solution from the advisor for state s0

15: Update Qj

t for each j 2 {1, . . . , n} using Eq. 6

16: Let t := t + 1 and current state st = st+1 17: end while

We start with a deﬁnition of an advisor strategy.

Deﬁnition 8. In a stage game, (A, M1, . . . , Mn) an advisor strategy (or advisor solution) is a tuple of n strategies (σ1, . . . , σn), an advisor speciﬁes for all n agents.

Since the advisor is deﬁned to be a general model that receives a state and provides action recommendations to an agent in the multi-agent environment, the advisor, in general, is capable of predicting the actions of other agents as well. All the advisor s predictions and/or recommendations towards every agent in the environment constitute the advisor solution as in Deﬁnition 8. Here, we do not restrict our setting to environments where the advisor can predict the actions of all agents. In practice, it is possible to encounter situations where the advisor cannot predict or recommend actions to some agents. In this case, these agents can get random or placeholder strategies in the advisor solution formulated in Deﬁnition 8.

Similar to our decision-making setting, we ﬁrst provide an algorithm that will have all the agents in the environment using the same algorithmic steps. In each state s, and time t during training, we form a stage game (Q1

t (s), . . . , Qn

t (s)) using the Q-values of all agents. Here, the notation Qj

t(s, a1), . . . , Qj

t(s, an)). The advisor receives the state and provides its predictions/recommendation for each agent, which will form the advisor solution (σ1

t (s), . . . , σn

t (s)) for the stage game (Q1

t (s), . . . , Qn

t (s)). In the stochastic game, having access to the state and the advisor allows an agent to have access to the full advisor solution at every state s 2 S for all time t.

Algorithm 2 describes our ADMIRAL-AE algorithm. A learning agent j starts with an arbitrary value of Qj

0(s, a), which represents the value at the initial time step t = 0. We deﬁne an action selection scheme (lines 5 12) that chooses to directly use the advisor s recommendation with probability 0, a random action with probability and an action that maximizes the Q-values at the

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

current state with probability 1 0. The idea is to mix between directly following the advisor at the current time step and choosing an action that maximizes the value of following the advisor at later stages for the action selection. We also perform a small percentage of random actions to ensure sufﬁcient exploration of the environment. At each time t, the agent j observes the current state s, and takes a local action aj (line 13) and observes the action of all agents (including itself), the reward it obtains and the new state s0. Note that, unlike the decision-making setting, here we do not require all agents to maintain copies of the updates of other agents and hence the rewards of other agents are not required to be observed by the agent j. The agent then obtains the advisor solution (Deﬁnition 8) from the advisor for the next state s0 (line 14). Subsequently, each agent j updates its Q-value as follows (using β 2 [0, 1), as the discount factor) :

t+1(s, a) = (1 t)Qj

t(s, a) + t[rj

t + βAdvisor Qj

where t 2 (0, 1) is the learning rate. The term Advisor Qj

t(s0), is the total value (payoff) that the agent j will obtain at the state s0 when all agents (including itself) play the advisor solution. This is calculated as Advisor Qj

t(s0), where (σ1

t (s0), . . . , σn

t (s0)) denotes the advisor solution at state s0 and time t. This can be seen as a solution to the stage game (Q1

t (s0), . . . , Qn

t (s0)), since the value of each agent s payoff at state s0 is reﬂected in their corresponding Q-values at state s0. Since the advisor recommendation to each agent can be a stochastic sample, the σj

t (s0) is interpreted as a vector that contains the probability of taking each action in the action space of j. Similarly, from the Q-function of the agent j, we can obtain Qj

t(s0), which consists of the Q-value of taking each action in the action space of j. Hence, Advisor Qj

t(s0) is a scalar value obtained using a component-wise multiplication of the advisor solution and the Q-values.

The Q-values of all agents are updated using the advisor strategies at each iteration, as given in line 15 of Algorithm 2. The above-described steps continue until convergence, or until the Q-values come within a small threshold of convergence, as in ADMIRAL-DM.

Recall that the primary purpose of this algorithm is advisor evaluation. After implementing ADMIRAL-AE using the given advisor in the pre-learning phase, the performance of the algorithm can be used to determine 0

0 (hyperparameter of ADMIRAL-DM) in different ways. We provide one heuristic in this paper. We propose that the performance of ADMIRAL-AE (in terms of cumulative rewards) be compared against the maximum possible performance of any algorithm (maximum cumulative rewards). The ADMIRAL-AE s performance using the given advisor should lie between the maximum possible performance in that environment and the performance of ADMIRAL-AE using a random advisor. This can then be normalized in the range of [0, 1] to determine a value for 0

0. This normalization is given in Eq. 7.

0 = CR RCR MCR RCR (7)

where CR denotes the cumulative reward obtained by ADMIRAL-AE using the advisor (averaged across multiple trials), RCR denotes the cumulative reward obtained by ADMIRAL-AE using a random advisor (averaged across multiple trials), MCR denotes the maximum possible cumulative reward in the given environment 2. To be more accurate, a correction can be applied to MCR to compensate for the loss in performance from random exploration in ADMIRAL-AE (hyperparameter ).

2. We clarify that in this method if the RCR and MCR are not very tight, the difference between almost similar

performing advisors could become small. However, this will not be a problem for our method, since the ADMIRALDM can also learn directly through environmental interactions.

MULTI-AGENT ADVISOR Q-LEARNING

After obtaining the value of 0

0 from ADMIRAL-AE, this hyperparameter is used in the training of ADMIRAL-DM where its value is linearly decayed in line with the steps in Algorithm 1. We experimentally illustrate these steps later in Section 5.1. A more elaborate study demonstrating the effectiveness of this technique is given in Section 5.4.

Another way of using ADMIRAL-AE is to study the effectiveness of the available advisor against simulated or baseline opponents (Section 5.2). An important advantage of ADMIRAL-AE is in situations of adapting advisors, as discussed earlier. An experimental illustration of this advantage is given in Appendix C. The ADMIRAL-AE algorithm is off-policy, as the update policy (line 15) and policy followed (lines 5 12) are different. Due to this off-policy nature of ADMIRAL-AE, we do not require an agent to follow the advisor at every state in this setting. The evaluation is happening through the Q-values, while the action selection policy can be independent of the policy being updated as in any off-policy algorithm. The convergence guarantees in off-policy methods do not require using speciﬁc action selection policies as long as sufﬁcient exploration is guaranteed (Jaakkola et al., 1994; Sutton and Barto, 1998).

We would like to clarify that our main algorithm, ADMIRAL-DM, uses the advisor in a fullyonline fashion and does not require any pre-learning for implementation. ADMIRAL-AE is a principled method that helps in determining how much to listen to an advisor (through the hyperparameter 0

0) in an optional pre-learning phase. If the pre-learning phase is not conducted, an approximate value for 0

0 can still be obtained using experimental heuristics. Using such an approximate value is not a problem since ADMIRAL-DM is also capable of learning from environmental rewards (through direct environmental interactions), in addition to the advisor. Here the advisor only aims to accelerate the process of training. Hence, the use of ADMIRAL-AE does not violate our contribution of relaxing the ofﬂine limitation in prior methods.

In a tabular implementation of the ADMIRAL-AE algorithm, both space and time complexities will be linear in the number of states, polynomial in the number of actions, and exponential in the number of agents, same as that described for ADMIRAL-DM. However, the space and time complexity of ADMIRAL-AE can be represented by |S||A|n. Here, notice that the complexity does not have the product term of the number of agents n, unlike the requirement for ADMIRAL-DM. This is due to the fact that ADMIRAL-AE does not have the requirement of each agent maintaining copies of the Q-values of other agents as in ADMIRAL-DM. As described in the previous sub-section, this time complexity of ADMIRAL-AE is much better than that of Nash Q (Hu and Wellman, 2003).

3.3 Illustrative Example For Algorithm 2

In this sub-section, we show the calculations of some steps in the Q-updates of Algorithm 2 for a single state system, to serve as a demonstration of this algorithm. Our objective is to provide a practical illustration of the various steps involved since, to the best of our knowledge, pre-evaluation of advisors have not been considered before. Since Algorithm 1 has similarities to the well-known Q-learning algorithm (Watkins and Dayan, 1992), we omit a demonstrative example for Algorithm 1.

Let us consider a two agent game with all the initial Q-values set to 0. The ﬁrst agent (column agent) can perform two actions Up and Down and the second agent (row agent) can also perform two actions Left and Right . The system has only one state, s1. Let the learning rate be 0.9 and discount factor β be 0.9. Let us assign the hyper-parameters = 0.05 and 0 = 0.45. At the initial state, at time t = 0, the stage game constructed from the Q-values of both the agents is given in Table 1a.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

0) Left Right

Up (0,0) (0,0) Down (0,0) (0,0)

(a) Initial stage game at time t = 0

1) Left Right

Up (1.8, 1.8) (0,0) Down (0,0) (0,0)

(b) Stage game at time t = 1

2) Left Right

Up (2.34, 2.34) (0,0) Down (0,0) (0,0)

(c) Stage game at time t = 2

Table 1: Stage games constructed in the example.

At state s1, in time t = 0, let us assume that both agents decide to use the advisor recommended actions Up and Left respectively. They execute these actions and obtain a reward of 2 each. Now the agents receive the next state, which is s1 (single state system). Let the advisor solution at this state for the column agent be σ1

0(s1) = (1, 0) and that for the row agent be σ2

0(s1) = (1, 0). The ﬁrst value of the tuple in this notation is the probability of taking the ﬁrst action, and the second value of the tuple is the probability of the second action for the respective agents. This means that the advisor recommends both the agents to perform the Up and Left actions respectively, with probability 1, and the other action with probability 0. The Q update will be as follows:

1(s1, Up, Left) = Q1

0(s1, Up, Left) +

0 + βAdvisor Q1

0(s1, Up, Left)

1(s1, Up, Left) = Q1

0(s1, Up, Left) +

0(s1, Up, Left)

1(s1, Up, Left) = 0 + 0.9

2 + 0.9 0 0

1(s1, Up, Left) = 1.8

(8) The superscript for Q represents the agent index (1 represents the column player). The above equation also holds for the row agent and the new stage game with Q-values at state s1, in time t = 1 is given in Table 1b.

Now, at state s1 and time t = 1, let us assume that the agents decide to perform the best actions from their respective Q-values, i.e., Up and Left again. The actions are executed, and the agents obtain a reward of 2 each. The next state is observed, and let the advisor solution here be σ1

1(s1) = (0.5, 0.5) and σ2

1(s1) = (0.5, 0.5). This means that the advisor assigns a probability of 0.5 for each of the actions for both the agents. Again, we calculate the Q-update for the column agent,

2(s1, Up, Left) = Q1

1(s1, Up, Left) +

1 + βAdvisor Q1

1(s1, Up, Left)

2(s1, Up, Left) = Q1

1(s1, Up, Left) +

1(s1, Up, Left)

2(s1, Up, Left)

= 1.8 + 0.9

(0.5)2 1.8 + (0.5)2 0 + (0.5)2 0 + (0.5)2 0

2(s1, Up, Left) = 2.34

MULTI-AGENT ADVISOR Q-LEARNING

The above equation also holds for the row agent and the new stage game at t = 2 is given in Table 1c. Similarly, the Q-values continue to be updated until convergence or until the values come to a small threshold of convergence.

In the above steps, we have demonstrated the Q-values of different actions at the given state based on the advisor solutions. Such a process, at convergence, will lead to the agent(s) evaluating the advisor obtain a value for the advisor at each state in the state space of the environment.

3.4 Neural Network And Actor-Critic Implementations

It is well known that tabular algorithms are not applicable for environments with large state and action spaces in RL (Mnih et al., 2016). All our algorithms can be extended to large state-action environments using function approximations as is common in RL (Mnih et al., 2015, 2016), where neural networks serve as the function approximators. In this section, we give a neural network-based implementation of ADMIRAL-DM and ADMIRAL-AE.

We incorporate techniques introduced in the well-known Deep Q-learning (DQN) algorithm (Mnih et al., 2015) with the update rule in ADMIRAL-DM to obtain its neural network implementation in Algorithm 3. To highlight differences from the tabular implementation, we make note of some parts of Algorithm 3. All agents maintain two networks (evaluation and target) throughout the training process. Both the evaluation and target networks start with the same conﬁguration. The evaluation network is updated periodically at every training step and used for action selection at each step. The target network provides the target value for calculating the Bellman errors during training and is updated less frequently compared to the evaluation network. In line 18 of Algorithm 3, all agents store the experience tuples in their respective replay buffers. After every episode in lines 21 25, the evaluation networks are trained using the temporal difference (TD) errors as the loss function. The TD target is obtained from the Bellman equation given in Eq. 5. After every ﬁnite number of training steps, the target network parameters are updated by copying over values from the evaluation network in line 26 as previously performed in Mnih et al. (2015).

Similar to our approach with ADMIRAL-DM, we incorporate the techniques of DQN (Mnih et al., 2015) with the update rule in ADMIRAL-AE to obtain Algorithm 4. The important changes are similar to our discussion with the ADMIRAL-DM case, where the algorithm uses two networks and a replay buffer for training. The target for the loss function is obtained from Eq. 6 (line 21 in Algorithm 4). If more exploration is desired, need not be decayed in these implementations.

We also extend Algorithm 3 to an actor-critic method ADvising Multiple Intelligent Reinforcement Agents - Decision Making (Actor-Critic) abbreviated as ADMIRAL-DM(AC). This algorithm uses the Q-function as the critic and the policy derived from Q as the actor. The algorithm follows a Centralized Training and Decentralized Execution (CTDE) scheme (Lowe et al., 2017), where the critic uses the information associated with other agents during the training time and the actors can act independently without access to other agent information during execution. This allows our methods to be applicable in environments where global information (i.e., information associated with other agents) is available during training but not available during execution such as, for example, autonomous driving (Zhou et al., 2020). The CTDE scheme extends our algorithm to partially observable environments, where the actor can just use the local observations of the agent for action selection (during both training and execution), while the critic can use the joint observation of all agents during training. As discussed in Lowe et al. (2017) in the simplest case, the Q-function (used as the critic) would consist of the observations of all agents, but it could also include additional state

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Algorithm 3 ADMIRAL-DM Neural Network Implementation

1: Initialize Qφj, Q j, where φ represent the evaluation (eval) net and represents the target net,

for all j 2 {1, . . . , n}. Initialize a value for hyperparameters and 0 (i.e. value for 0 and 0

2: At t = 0, get the initial state s0 from the environment

3: Let the learning agent be indexed by j

4: For all s 2 S, aj 2 Aj, and j 2 {1, . . . , n}, let Qj

t(s, a1, . . . , an) = 0

5: Deﬁne policy derived from Q to return the random action a with probability t, advisor suggested

action with probability 0

t and greedy action with probability 1 t 0

t 6: Choose action aj

0 for s0 using policy derived from Q for each j 2 {1, . . . , n}

7: while training is not ﬁnished do

8: For each agent j, execute the action aj

t and observe r1

t , . . . , rn

t , . . . , an

t ; and st+1 = s0

9: For each j, choose the next greedy action for all other agents from the copy of their respective policies using s0. The next greedy actions are chosen using the current observed actions of other agents

10: For each j, let u be a uniform random number between 0 and 1

11: if u < 0

12: Obtain next action aj

t+1 = aj0 from the advisor (using state s0)

13: else if u > 0

t and u < t then

14: Set the next action aj

t+1 = aj0 as a random action from the action space Aj

16: Choose a greedy action aj

t+1 = aj0 from Qφj using s0 and the next greedy actions of other agents

18: Store hs, a, r, s0, a0i in replay buffer D, where s = st, a = a1

t , . . . , an

t , . . . , rn

t and a0 = a1

t+1, . . . , an

t+1; for each agent j

19: Set the current action aj

t+1 and the current state st = st+1; for each agent j

20: At the end of each episode linearly decay 0

t and t 21: while j = 1 to n do

22: Sample a minibatch of K experiences hs, a, r, s0, a0i from D

23: Set yj = rj + βQ j(s0, a0) according to Eq. 5

24: Update the Q eval network by minimizing the loss L(φj) = 1

P(yj Qφj(s, a))2

25: end while

26: Update the parameters of the target network for each agent by copying over the eval network every T steps: j φj

27: end while

MULTI-AGENT ADVISOR Q-LEARNING

Algorithm 4 ADMIRAL-AE Neural Network Implementation

1: Initialize Qφj, Q j, where φ represent the evaluation (eval) net and represents the target net,

for all j 2 {1, . . . , n}. Initialize the hyperparameters and 0

2: At t = 0, get the initial state s0 from the environment 3: Let the learning agent be indexed by j

4: For all s 2 S, aj 2 Aj, and j 2 {1, . . . , n}, let Qj

t(s, a1, . . . , an) = 0 5: Set the value of hyperparameters and 0

6: while training is not ﬁnished do 7: For each j, let u be a uniform random number between 0 and 1 8: if u < 0 then

9: Obtain action aj

t for the current state st from the advisor 10: else if u > 0 and u < then

11: Set the action aj

t as a random action from the action space Aj

12: else 13: Choose a greedy action aj

t from Qφj using st and the observed previous actions of other agents 14: end if 15: For each agent j, execute the action aj

t and observe rj

t , . . . , an

t , and st+1 = s0

16: For each agent j, obtain the advisor action a1

e,t+1, . . . , an

e,t+1 for all agents at state s0

17: For each agent j, store hs, a, rj, s0, a0i in replay buffer D, where s = st, a = a1

t , . . . , an

t and a0 = a1

e,t+1, . . . , an

e,t+1 18: Set the current state st = st+1 19: while j = 1 to n do 20: Sample a minibatch of K experiences hs, a, rj, s0, a0i from D 21: Set yj = rj + βAdvisor Q j(s0) according to Eq. 6 22: Update the Q eval network by minimizing the loss L(φj) = 1

P(yj Qφj(s, a))2

23: end while 24: Update the parameters of the target network for each agent by copying over the eval network every T steps: j φj

25: end while

information when available. The critic is not required during execution in this setting. Furthermore, ADMIRAL-DM(AC) makes our method applicable to continuous action spaces as well.

We provide the complete pseudocode for the actor-critic implementation of ADMIRAL-DM in Algorithm 5. All agents maintain two networks during training. The ﬁrst network is the value network that serves as a critic, and the second network is a policy network that serves as the actor (line 1). After each experience, the value (critic) network is updated in line 16 using the TD errors (from Eq. 5) as the loss function. The actor is updated in line 17 using policy gradients. Since the algorithm is maintaining a stochastic policy which explores naturally, we do not need to perform a random action selection for exploration (unlike in the other algorithms).

4. Theoretical Results

In this section, we ﬁrst show that Q-updates following Algorithm 2 will converge to an -equilibrium in the stochastic game. From the update rule provided in Eq. 6, we note that this equilibrium corresponds to the value of the advisor, which is the action-value function that provides the expectation of immediate reward and future discounted rewards when all agents follow the advisor solutions

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Algorithm 5 ADMIRAL-DM(AC) Neural Network Implementation

1: Initialize Vφj, j, the critic and actor networks for all j 2 {1, . . . , n}. Initialize a value for

hyperparameters and 0 (i.e. value for 0 and 0

0) 2: At t = 0, get the initial state s0 3: Let the learning agent be indexed by j

4: For all s 2 S and aj 2 Aj, j 2 {1, . . . , n}, let Qj

t(s, a1, . . . , an) = 0 5: Deﬁne policy derived from Q to return the random action a with probability t, advisor sug-

gested action with probability 0

t and greedy action with probability 1 t 0

t 6: For each j sample action aj

0 from the actor j at state s0 7: while training is not ﬁnished do

8: Execute the action aj

t and observe r1

t , . . . , rn

t , . . . , an

t ; and st+1 = s0, for all agents j 9: For each j, let u be a uniform random number between 0 and 1 10: if u < 0

11: Obtain next action aj

t+1 = aj0 from the advisor (using state s0) 12: else 13: Choose an action aj

t+1 = aj0 using the respective actor j(s0) 14: end if 15: Set yj = rj + βV j

φj(s0, a0 j), where a0 j = (a01, . . . , a0j 1, a0j+1, . . . , a0N), for all j

16: For each j, update the critic by minimizing the loss L(φj) = (yj V j

φj(s, a j))2, where a j = (a1, . . . , aj 1, aj+1, . . . , a N) and s = st 17: For each j, update the actor using the log loss J ( j) = log j(aj|s)L(φj) 18: Set the current action aj = aj0 and the current state s = s0 for each agent j 19: At the end of each episode linearly decay 0

t and t 20: end while

for inﬁnite periods starting from the current state and joint action. This of the -equilibrium will depend on the nature of the advisor used. Further, we prove that the Q-updates following Algorithm 1 converges to the Nash Q-value, thus ﬁnding the Nash equilibrium of the stochastic game.

The primary convergence result for Q-learning based algorithms in a general-sum stochastic game was provided by Hu and Wellman (2003). However, this result relies on a very restrictive assumption that states that every stage game of the stochastic game contains a Nash equilibrium that is either a global optimum or a saddle point. Additionally, an agent must use the payoff at this equilibrium to update its Q-value in every stage game of the stochastic game. As shown by Bowling (2000), this assumption implies that every stage game should use the same kind of equilibrium, it cannot oscillate between being a global optimum or saddle point between stage games. There is almost no game that satisﬁes this condition in practice (Hu and Wellman, 2003). We will show in this section that the convergence results in our setting can be provided under a set of assumptions weaker than that used by Hu and Wellman (2003).

In this section, we provide three important theorems with their detailed proofs. The proofs of each theorem depend on a set of lemmas that we provide. We have included the statement of these lemmas in this section, while the complete proofs of the lemmas can be found in Appendix A.

We start by providing a general result for stochastic processes. Theorem 1 is a technical result, extending a result of Szepesvári and Littman (1999), which will form the foundation of our convergence result in Theorem 2. Theorem 1 aims to relax a requirement of contraction conditions in the result from Szepesvári and Littman (1999). The presence of these conditions would necessitate strong assumptions in a MARL setting as shown in Hu and Wellman (2003), which we aim to avoid.

MULTI-AGENT ADVISOR Q-LEARNING

Towards our Theorem 1, we restate some formal deﬁnitions relating to translation and invariance of operators from Szepesvári and Littman (1999) in Appendix B to stay self-contained.

Theorem 1. Let X be an arbitrary set and assume that B is the space of bounded functions over X . Let T : B ! B, be an arbitrary operator. Let F B, be a subset of B and let F0 : F ! 2B be a mapping that associates subsets of B with the elements of F. Let v be a ﬁxed point of T and let T = (T0, T1, . . .) be a sequence of random operators, Tt mapping B B to B, that approximate T at v and for initial values from F0(v ). Further, assume that F0 is invariant under T . Let V0 2 F0(v ), and deﬁne Vt+1 = Tt(Vt, Vt). If there exist random functions 0 Ft(x) 1 and 0 Gt(x) 1 satisfying the conditions below with probability 1 (w. p. 1), then Vt converges to a point (v S)3 w. p. 1 in the norm of B(X ):

1. For all U1 and U2 2 F0 and all x 2 X ,

Tt(U1, v )(x) Tt(U2, v )(x) = Gt(x)(U1(x) U2(x)).

2. For all U and V 2 F0, and all x 2 X , we can ﬁnd a ﬁnite sequence kt(x) such that,

Tt(U, v )(x) Tt(U, V )(x) = Ft(x)(||v V || + λt + kt(x)||v V ||)

where λt ! 0 w. p. 1 as t ! 1 and kt(x) is ﬁnite for all values of x and t.

3. kt(x) converges to a ﬁnite point (independent of time) K(x), as t ! 1. 4. For all l > 0, n

t=l Gt(x) converges to 0 uniformly in x as n ! 1. 5. There exists 0 γ < 1 such that for all x 2 X and large enough t, Ft(x) = γ(1 Gt(x)) The point S can be represented by the equation S(x) = 1

ˆβ(γC1 + K(x)C1), if K(x) 6= 0,

8x 2 X , where 0 < ˆβ 1 and C1 is a small positive constant. If K(x) = 0 8x 2 X , then S(x) = 0 8x.

Before providing the proof of Theorem 1, we provide an intuitive grasp of the result by relating the different variables in Theorem 1 to RL. The operators T and T are similar to Bellman operators commonly seen in Q-learning, where Tt is the component of T at time t. The B is the space of all Q-functions and F0 provides a mapping on the subsets of B (since a set of n elements has 2n subsets, the F0 maps to 2B). Speciﬁc instances (U, V ) of F0 are considered. These can be seen as particular Q-functions. The variable x denotes the parameters of the Q-function (state, action pair). Here v is the ﬁxed point of the Q-function (Q ). The Gt(x) and Ft(x) are functions of the learning rate ( t). These relations will become more explicit in our upcoming results.

Although our proof for Theorem 1 is structurally similar to that from Szepesvári and Littman (1999), there are signiﬁcant differences in our detailed proof arguments which stems from the differences in the nature of our results. First, Szepesvári and Littman (1999) required an inequality condition in condition 2 to hold for all x and all t. We do not have this requirement. Our condition 2 uses an exact equivalence, includes an additional term to capture the difference in the other terms and the constraint on this additional term needs to hold only as t ! 1 (condition 3). As a consequence, we are restricted to showing convergence to a point close to the ﬁxed point v , instead of exactly to v . Second, as discussed in Szepesvári and Littman (1999), condition 2 in their theorem combined with the other conditions turns Tt operator into a contraction condition for all t which is hard to

3. Note that the variable S here does not denote the state space but a deviation from the ﬁxed point v . We use the

(calligraphic) S to denote the state space.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

satisfy or ensure in multi-agent environments. While Hu and Wellman (2003) use a very restrictive assumption to overcome this problem, we show that this problem can be altogether avoided using our Theorem 1 (which is the core motivation for this theorem). We also use an exact equivalence in condition 1 and condition 5 to avoid the contraction condition. The complete proof of Theorem 1 is given next.

Proof. Let U0 be a value function in F0(v ) and let Ut+1 = Tt(Ut, v ). Since Tt approximates T at v , Ut converges to Tv = v w. p. 1 uniformly over X . Let

δt(x) = Ut(x) Vt(x) (10)

t(x) = v (x) Ut(x). (11)

We know that t(x) converges to 0 because Ut converges to v . We will show that δt converges to a point (independent of t) S, w. p. 1, which implies that Vt converges to the point (v S).

Now from the conditions of Theorem 1 we have,

δt+1(x) = Ut+1(x) Vt+1(x)

= Tt(Ut, v )(x) Tt(Vt, Vt)(x)

= Tt(Ut, v )(x) Tt(Vt, v )(x) + Tt(Vt, v )(x) Tt(Vt, Vt)(x)

= Gt(x)(Ut(x) Vt(x)) + Ft(x)(||v Vt|| + λt + kt(x)||v V ||)

= Gt(x)δt(x) + Ft(x)(||v Vt|| + λt + kt(x)||v V ||)

= Gt(x)δt(x) + Ft(x)(||v Ut + Ut Vt|| + λt + kt(x)||v Ut + Ut Vt||)

= Gt(x)δt(x) + Ft(x)(||δt + t|| + λt + kt(x)(||δt + t||))

1 Gt(x)δt(x) + Ft(x)(||δt|| + λt + kt(x)||δt|| + || t|| + kt(x)|| t||)

= Gt(x)δt(x) + Ft(x)(||δt|| + kt(x)||δt|| + t).

The (1) comes from the fact that t is guaranteed to converge to 0 in the limit, so the effect of splitting the sum under the norm is negligible. Regarding the last step, let us denote the term (λt + || t|| + kt(x)|| t||) by t as all these terms converge to 0 in the limit. Another assumption in

Theorem 1 is that kt(x) converges to K(x) in the limit. We consider two cases, in which the ﬁrst case is K(x) = 0 8x 2 X and the second case is when K(x) 6= 0. In the ﬁrst case, notice that, the theorem will then effectively change to Theorem 1 in Szepesvári and Littman (1999), where the authors prove that δt(x) converges to 0 and hence Vt(x) converges to v (x).

For the second case, we provide the proof for the process δt to converge to a point (independent of time) by keeping a modiﬁed process ˆδt bounded by rescaling δt. Since δt is a homogeneous process, it can be written in the form δt+1 = Gt(δt, || t|| + λt), such that ˆβGt(x, y) = Gt(ˆβx, ˆβy)

MULTI-AGENT ADVISOR Q-LEARNING

holds for all ˆβ > 0. Now we will prove that, if ˆδt converges to a point S, then δt will also converge to another point, that is 1

ˆβS, where ˆβ is the scale factor applied to δt to get the modiﬁed process ˆδt. Similar to Szepesvári and Littman (1999), we will begin by considering a homogenous process and another equivalent formulation of this process that can be obtained by keeping it bounded by scaling.

Let us consider a process of the form,

xt+1 = Gt(xt, t) (13)

where Gt : B B ! B is a homogeneous random function, i.e.,

Gt(ˆβx, ˆβ ) = ˆβGt(x, ) (14)

holds for all ˆβ > 0, x and . We want to prove that xt converges to some point independent of t.

Now, consider another process, that is obtained from modifying process in Eq. 13 by keeping it bounded by re-scaling, namely the process

n Gt(yt, t) if:||Gt(yt, t)|| C CGt(yt, t)/||Gt(yt, t)|| otherwise (15)

We denote the solution of Eq. 13 corresponding to an initial condition of x0 = ! and a sequence = { k} by xt(!, ). Similarly, we denote the solution of Eq. 15 corresponding to the initial condition of y0 = ! and the sequence by yt(!, ).

Next, we state a lemma that provides a relationship between convergence of the sequence represented by xt and the sequence represented by yt. The result is used later in Lemma 4.

Lemma 1. Let us ﬁx an arbitrary positive constant C, an arbitrary w0, and a sequence . Then provided that

(i) yt(w0, ) converges to a point (independent of t) D. (ii) The sequence converges to 0 in the limit (t ! 1). The homogeneous process xt(w0, ) converges to a point 1

ˆβD w. p. 1, where ˆβ, satisfying 0 < ˆβ 1, is the scaling factor applied.

Next, we state another lemma that provides conditions for the convergence of a cascade of two converging processes. Again, this result will be used later in Lemma 4.

Lemma 2. Let X and Y be normed vector spaces, Ut : X Y ! X(t = 0, 1, 2, . . .) be a sequence of mappings, and t 2 Y be an arbitrary sequence. Let 1 2 Y and x1 2 X. Consider the sequences xt+1 = Ut(xt, 1), and yt+1 = Ut(y t, t) and suppose that xt and t converge to x1 and 1 respectively, in the norm of the appropriate spaces.

k be the uniform Lipschitz index of Uk(x, ) with respect to at 1 and, similarly, let LX

t satisfy the relations L

m = 0 where C > 0 is some constant and t = 0, 1, 2, . . . , then limt !1 ||yt x1|| = 0.

Now, we show that stochastic processes having certain special structure will converge to some point independent of t, under a set of conditions. This result will also be used later in the proof of Lemma 4.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Lemma 3. Let Z be an arbitrary set and consider the process

xt+1(z) = Gt(z)xt(z) + Ft(z)(C + kt(z)C) (16)

where x1, Ft, Gt 0 are random processes, ||x1|| < C < 1 w. p. 1 for some C > 0, and z is an element in Z. Assume that for all k, limn !1 n

t=k Gt(z) = 0 uniformly in z w. p. 1 and Ft(z) = γ(1 Gt(z)), for some 0 γ < 1, and 8z 2 Z, w. p. 1. Also, kt(z) converges to K(z) in the limit. Then, xt(z) converges to a point D(z) = γ(C + K(z)C) w. p. 1.

Given the previous results, we are now in a position to conclude the proof by showing that a stochastic process that can be represented by Eq. 12 will converge to a point independent of t, which is our required result.

Lemma 4. Consider an equation of the form

xt+1(z) = Gt(z)xt(z) + Ft(z)(||xt|| + t + kt(z)||xt||) (17)

where the sequence t converges to zero w. p. 1. Assume that for all k, limn !1 n

t=k Gt(z) = 0 uniformly in z w. p. 1 and Ft(z) = γ(1 Gt(z)), for some 0 γ < 1, and 8z 2 Z, w. p. 1. Assume further that kt(z) is ﬁnite, and it converges to K(z) in the limit (t ! 1). Then xt(z) converges to a point represented by S0(z) = 1

ˆβ(γC1 + K(z)C1), where C1 is a small positive constant, w. p. 1

uniformly over Z. Here ˆβ is a scaling factor satisfying 0 < ˆβ 1.

Hence, we have proved that Eq. 10 has converged to a point independent of t, and thus Theorem 1 follows. The expression for point S is derived in Lemma 4.

Let us deﬁne a relaxation process of the form

Vt+1(x) = (1 ft(x))Vt(x) + ft(x)[Pt Vt](x) (18)

where 0 ft(x) 1 is a relaxation parameter and the sequence Pt : B(X ) ! B(X ) can be considered a randomized version of the operator T. Let us consider a process Ut such that E[Vt(x)] = Ut(x). Also let, E[Pt Vt](x) = TV (x). Now we can state the following corollary to Theorem 1.

Corollary 1. Assume that the process deﬁned by

Ut+1(x) = (1 ft(x))Ut(x) + ft(x)[Ptv ](x) (19)

converges to v w. p. 1. Assume further that the following conditions hold:

1. There exist a scalar γ satisfying 0 γ < 1 and a sequence λt 0 converging to 0 w. p. 1 such that ||Ptv Pt V || = γ||v V || + λt + γkt(x)||v V || holds for all V 2 B(X ) and for ﬁnite kt(x).

2. kt(x) converges to ﬁnite point K(x) as t goes to 1. 3. 0 ft(x) 1, t 0, and Pn

t=1 ft(x) goes to inﬁnity uniformly in x as n ! 1. Then, the iteration deﬁned by Eq. 18 converges to a point (v S), where S is deﬁned as in Theorem 1.

MULTI-AGENT ADVISOR Q-LEARNING

Proof. Here, Pt is a randomized version of an operator T. It can be proved that a process of the form,

Ut+1(x) = (1 ft(x))Ut(x) + ft(x)[Pt V ](x) (20)

will converge to TV w. p. 1 where V 2 B(X ). The convergence is due to Lemma 5 below.

Lemma 5. Let Ft be an increasing sequence of σ-ﬁelds, let 0 t and wt be random variables such that t and wt 1 are Ft measurable. Assume that the following hold w. p. 1: E[wt|Ft, t 6= 0] = A, E[w2

t |Ft] < B < 1, P1

t=1 t = 1 and P1

t < C < 1 for some B, C > 0. Then the process

Qt+1 = (1 t)Qt + twt (21)

converges to A w. p. 1.

Let the random operator sequence Tt : B(X ) B(X ) ! B(X ) be deﬁned by

Tt(U, V )(x) = (1 ft(x))U(x) + ft(x)[Pt V ](x). (22)

We know that the operator Tt approximates T at v , since the process deﬁned by Eq. 20 converges to

TV for all V 2 B(X ) by Lemma 5 and the process deﬁned by Eq. 19 converges to v by deﬁnition. Moreover, observe that Vt as deﬁned by Eq. 18 satisﬁes Vt+1 = Tt(Vt, Vt). Due to conditions 1,2, and 3, it can be readily veriﬁed that coefﬁcients Gt(x) = 1 ft(x), and Ft(x) = γft(x) satisfy the rest of the conditions of Theorem 1, and this yields that the process Vt converges to (v S) w. p. 1.

Now, we deﬁne the Pt operator in the context of Algorithm 2.

Deﬁnition 9. Let Q = (Q1, . . . , Qn). We deﬁne an operator Pt : Q ! Q such that Pt Q = (Pt Q1, . . . , Pt Qn), where

Pt Qk(s, a) = rk

t (s, a) + βσ1

t (s0)Qk(s0) (23)

and k = (1, . . . , n), s0 is the state at time t + 1, and (σ1

t (s0), . . . , σn

t (s0)) is an advisor solution for the stage game (Q1(s0), . . . , Qn(s0)) at time t.

The Pt operator s value depends on the advisor solution. Next, we will state some assumptions. The ﬁrst two are commonly used in RL (Jaakkola et al., 1994; Singh et al., 2000).

Assumption 1. Every state s 2 S and action aj 2 Aj, for each agent j 2 {1, . . . , n}, are visited inﬁnitely often, and the reward function for all agents stay bounded.

Assumption 2. For all s, t, and a, 0 t(s, a) < 1, P1

t=0 t(s, a) = 1, P1

t=0[ t(s, a)]2 < 1.

Assumption 3. The advisor solution to any stage game (Q1(s), . . . , Qn(s)) pertaining to state s 2 S will become stationary in the limit (t ! 1). In other words, the advisor can adapt its solutions, but in the limit it is guaranteed to settle down and provide the same advisor strategy for a particular state s 2 S.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Assumption 3 allows the advisor to learn and adapt during gameplay. However, in the limit, the advisor is required to converge and provide the same strategy. Now, directly, it can be seen that this assumption is much weaker than the restrictive assumption in Hu and Wellman (2003). Firstly, Assumption 3 does not involve the computation of Nash equilibrium of every stage game in the stochastic game. Secondly, agents need not use the Nash equilibrium of every stage game to update their Q-values unlike the assumption in Hu and Wellman (2003). If this condition were needed, then the advisor will need to only suggest the Nash equilibrium at every stage game, greatly reducing the scope of the advisor. Thirdly, the Nash equilibrium at every stage game does not need to be a global optimum or saddle point. This condition in Hu and Wellman (2003) is not satisﬁed in any practical game, which we have relaxed.

Next, we state a lemma needed to prove convergence.

Lemma 6. Assume that t satisﬁes Assumption 2 and the mapping Pt : Q ! Q satisﬁes the condition that, there exists a scalar γ satisfying 0 γ < 1, a sequence λt 0 converging to zero w. p. 1, and a ﬁnite sequence kt(s) such that ||Pt Q Pt Q || = β||Q Q || + λt + βkt(s)||Q Q || for all Q, and all s 2 S. Assume further that, kt(s) converges to a ﬁnite point K(s) in the limit. Additionally, Q (s, a) = E[Pt Q (s, a)], then the iteration deﬁned by

Qt+1(s, a) = (1 t)Qt(s, a) + t[Pt Qt(s, a)] (24)

converges to (Q S) w. p. 1, where S is as given in Theorem 1.

Theorem 2 proves that the Q-updates in Algorithm 2, converges to an -equilibrium consistent with Deﬁnition 7. The Q-values at convergence denotes the value of the advisor. The proof of this theorem will be an application of Lemma 6. The bound of the -equilibrium will depend on the advisor through its advisor solutions.

As proved by Fink et al. (1964), every stochastic game is guaranteed to have at least one Nash equilibrium point. However, there can be more than one Nash equilibrium point, in which case, the Q in Theorem 2, could be the Nash Q-function of any Nash equilibrium strategy. We do not require the Nash equilibrium point of the stochastic game to be unique in Theorem 2 as assumed in prior works (Hu and Wellman, 2003).

Theorem 2. Under the Assumptions 1, 2, and 3, the Q-functions updated by Eq. 6 converges to a bounded distance from the Nash Q-function Q = (Q1

, . . . , Qn

), represented as (Q S), in the limit (t ! 1). The point S is as given in Theorem 1.

Proof. We will state a lemma before we begin the proof of the theorem.

Lemma 7. For a n-player stochastic game, E[Pt Q ] = Q where Q = (Q1

, . . . , Qn

The proof of the Theorem 2 is a direct application of Lemma 6, which establishes convergence given three conditions. The condition pertaining to the expectation relationship is satisﬁed by Lemma 7.

To satisfy the ﬁrst condition (distances between Q-functions), we will begin by providing two expressions pertaining to distances between Q-functions seen in Lemma 6. The ﬁrst expression gives a formal way of determining the distance between any Q-function and the Nash Q-function.

MULTI-AGENT ADVISOR Q-LEARNING

Consider two Q functions, Q, Q 2 Q. Then we can get,

||Q Q || , maxjmaxs||Qj(s) Qj

= maxjmaxsmaxa1,...,an||Qj(s, a1, . . . , an) Qj

(s, a1, . . . , an)||

where Q = (Q1

, . . . , Qn

is the Nash Q-function of the agent j. The second expression below pertains to distances under corresponding Pt operators. Consider two Q functions Q, Q 2 Q. Then we can get,

||Pt Q Pt Q || , maxj||Pt Qj Pt Qj

= maxjmaxs||βσ1

t (s)Qj(s) βσ1

= maxjβ||σ1

t (s)Qj(s) σ1

t (s), . . . , σn

t (s)) is an advisor solution for the stage game (Q1(s), . . . , Qn(s)) at time t and (σ1

,t(s0), . . . , σn

,t(s0)) is an advisor solution for the stage game (Q1

(s0), . . . , Qn

(s0)) at time t. Also Qj

is the Nash Q-function of agent j.

In the second step, the corresponding reward terms from the Pt operators get cancelled from Deﬁnition 9. Since the state s changes for every t, the dependence on maxs in Eq. 26 is removed in the last step.

||Pt Q Pt Q || = β||Q Q || + λt + βkt(s)||Q Q || (27)

We state the equation in Lemma 6 again in Eq. 27. Now, we can ﬁnd a ﬁnite kt(s) such that the Eq. 27 is satisﬁed for all Q 2 Q and all s 2 S. This is due to the fact that all the other terms in that expression are guaranteed to be ﬁnite due to Assumption 1. This satisﬁes another condition of Lemma 6.

To satisfy the last condition of Lemma 6, we rewrite the Eq. 27, to get an expression for kt(s) in the limit (t ! 1),

kt(s) = ||Pt Q Pt Q ||

β||Q Q || 1

t (s)Qj(s) σ1

(s)| β||Q Q || 1

kt(s) = maxj |σ1(s) σn(s)Qj(s) σ1

(s)| ||Q Q || 1

kt(s) = K(s).

The ﬁrst expression in Eq. 28 is obtained as λt is guaranteed to go to 0 in the limit. In the second expression, Eq. 26 is being used. The third expression is from Assumption 3. Now, from Eq. 28, we can see that kt(s) is a constant for a given state that has no dependence on time t. This satisﬁes the last condition of Lemma 6.

Hence, all the conditions of Lemma 6 are satisﬁed and, therefore, the process Qt+1 = (1 t)Qt + t[Pt Qt] converges to a bounded distance from the Nash equilibrium (Q S). Thus, the

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

equilibrium point reached is an epsilon equilibrium, where the epsilon value can be given by the point S. The expression for this point is as given in Theorem 1 with the value of K being given in Eq. 28. Here the agents can still unilaterally deviate and possibly obtain an additional payoff consistent with Deﬁnition 7.

Now, we move on to providing theoretical guarantees for Algorithm 1. We will show in Theorem 3 that this algorithm will converge to a Nash equilibrium under a set of assumptions. Again, these assumptions are weaker than others previously considered in literature. To prove this convergence result for Algorithm 1 we will retain the ﬁrst two assumptions but replace Assumption 3 with two other assumptions.

Assumption 4. The algorithm is Greedy in the limit with Inﬁnite Exploration (GLIE).

Assumption 5. The Nash equilibrium is a global optimum or saddle point in every stage game of the stochastic game.

Assumption 4 allows the policy to explore, but in the limit (t ! 1) this assumption requires the policy to choose greedy actions. Assumption 5 is strong, however, Hu and Wellman (2003) show that similar assumptions are required to prove convergence in theory but not necessary to observe convergence in practice, which is consistent with our observations as well. Thus, even if such assumptions are violated in practice, convergence is still observed. Further, it is to be noted that Assumption 5 is a weaker condition than the assumption in Hu and Wellman (2003), since it does not require calculating the Nash equilibrium at each stage game and using the same to update the Q-values.

Theorem 3. When we update the Q-functions according to Eq. 5, they converge to the Nash Qfunction under Assumptions 1, 2, 4, and 5, in the limit (t ! 1).

Theorem 3 proves that the Q-updates in Algorithm 1 converges to the Nash equilibrium strategies. The proof is along the lines of Theorem 1 in Singh et al. (2000), but involves signiﬁcant modiﬁcations to cater to the multi-agent scenario.

Proof. The proof will involve using two lemmas from previous work, one from Jaakkola et al. (1994) and the other from Hu and Wellman (2003). We will start the proof of this theorem by stating the ﬁrst lemma.

Lemma 8. A random iterative process

t+1(x) = (1 t(x)) t(x) + t(x)Ft(x) (29)

where x 2 X, t = 0, 1, . . . , 1, converges to zero with probability one (w. p. 1) if the following properties hold:

1. The set of possible states X is ﬁnite. 2. 0 t(x) 1, P

t t(x) = 1, P

t (x) < 1 w. p. 1, where the probability is over the learning rates t.

3. || E{Ft(x)|Pt}||W K || t||W + ct, where K 2 [0, 1) and ct converges to zero w. p. 1. 4. var{Ft(x)|Pt} K(1 + || t||W )2, where K is some constant. Here Pt is an increasing sequence of σ-ﬁelds that includes the past of the process. In particular, we assume that t, t, Ft 1 2 Pt. The notation || ||W refers to some (ﬁxed) weighted maximum norm and the notation var refers to the variance.

MULTI-AGENT ADVISOR Q-LEARNING

Let us deﬁne a Nash operator Pt, consistent with the deﬁnition in Hu and Wellman (2003). The Nash operator is deﬁned using the following equation,

Pt Qk(s, a1, . . . , an) = Es0 p[rk

t (s, a1, . . . , an) + γ 1

(s0)Qk(s0)] (30)

where s0 is the state at time t + 1, ( 1

(s0), . . . , n

(s0)) is the Nash equilibrium solution for the stage game (Q1(s0), . . . , Qn(s0)), and p is the transition function. Qk denotes the Q-value of a representative agent k.

Lemma 9. Under Assumption 5, the Nash operator as deﬁned in Eq. 30 forms a contraction mapping with the ﬁxed point being the Nash Q-value of the game.

Now, since the Pt operator forms a contraction mapping, ||Pt Q Pt Q || β||Q Q ||, is satisﬁed for some β 2 [0, 1) and all Q. Here Q is the Nash Q-value.

We will apply Lemma 8 to show that the Q-values converge to the Nash Q-value. We will drop the agent index in all the expressions for simplicity. The rest of the proof is conducted for the Q-values of a representative agent.

The ﬁrst two conditions of Lemma 8 are satisﬁed from the assumptions. Comparing Eq. 29 and Eq. 5 we get that x can be associated with the state action pairs (s, a1, . . . , an) and t(st, at) can be associated with Qt(s, a1, . . . , an) Q (s, a1, . . . , an). Here, Q (s, a1, . . . , an) can be considered as the Nash Q-value (Q-values under the Nash Q-function).

t+1(x) = (1 t(x)) t(x) + t(x)Ft(x), (31)

Ft(x) = rt + γv Nash(st+1) Q (st, a1

t , . . . , an

t ) + γ[Qt(st+1, a1

t+1, . . . , an

t+1) v Nash(st+1)]

= rt + γv Nash(st+1) Q (st, a1

t , . . . , an

t ) + Ct(st, a1

t , . . . , an

t , . . . , an

t ) + Ct(st, a1

t , . . . , an

We deﬁne Ft(st, a1

t , . . . , an

t , . . . , an

t ) = Ct(st, a1

t , . . . , an

t ) = 0 if (s, a) 6= (st, at). We also deﬁne a = (a1, . . . , an) and at = (a1,t, . . . , an,t). Let the σ-ﬁeld generated by all the

random variables (st, t, a1

t , . . . , an

t , rt 1, . . . , s1, 1, a1, Q0) be represented by Pt. Now, all the Qvalues are Pt measurable which makes t and Ft, Pt measurable and this satisﬁes the measurability condition of Lemma 8.

Hu and Wellman (2003) showed that v Nash(st+1) , vk(s0, 1

, . . . , n

(s0) (see the proof in Lemma 10 of Hu and Wellman (2003)). Hence, from Lemma 9, we can show that the E[F Q

t ] forms a contraction mapping. This can be done using the fact that E(Pt Q ) = Q (Lemma 7). Here, the norm is the maximum norm on the joint action.

Now, we have the following for all t,

t , . . . , an

t )|Pt]|| γ||Qt(st, a1

t , . . . , an

t ) Q (st, a1

t , . . . , an

t )|| = γ|| t||. (33)

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Now from Eq. 32,

|| E[Ft(st, a1

t , . . . , an

t )|Pt]|| || E[F Q

t , . . . , an

t )|Pt]|| + || E[Ct(st, a1

t , . . . , an

γ|| t|| + || E[Ct(st, a1

t , . . . , an

(34) This satisﬁes the third condition of Lemma 8 if ct = || E[Ct(st, a1

t , . . . , an

t )|Pt]|| converges to 0 w. p. 1.

Let us rewrite the deﬁnition of the Ct,

t , . . . , an

t ) = γ[Qt(st+1, a10

t+1, . . . , an0

t+1) v Nash(st+1)]

t , . . . , an

t ) = γ[max Qt(st+1, a10

t+1 . . . , an0

t+1) v Nash(st+1)]

t , . . . , an

t ) = γ[v(st+1) v Nash(st+1)].

In the second step, we are using the assumption that the Q-value is GLIE. The max operator operates over the action space of the representative agent.

According to Assumption 5, the Nash equilibrium could only be a global optimum or a saddle point. Now, if it is a global optimum, the value of maximizing all the actions in Eq. 35 will lead to the global optimum for all the agents and this will be the Nash payoff, thus leading to Ct evaluating to 0 in the limit. Furthermore, Hu and Wellman (2003) show that a global optimum is always a Nash equilibrium and all global optima are guaranteed to have equal values. Alternatively, if the Nash equilibrium is a saddle point, consider a stage game, with saddle point equilibrium payoff, σ and . Then, σkσ k Qk(s) kσ k Qk(s), as deviating from the equilibrium when the others are playing the equilibrium strategy will leave an agent worse off by deﬁnition of a Nash equilibrium. Also, k k Qk(s) kσ k Qk(s), as in a saddle point, if others deviate the agent should be better off (see Deﬁnition 13 in Hu and Wellman (2003)). Thus, we will get the relation, k k Qk(s) σkσ k Qk(s). Since σ and are saddle points, the previous argument holds without the loss of generality. Hence, the following is also true, σkσ k Qk(s) k k Qk(s). Thus, the value obtained is the same in the saddle points and the value would be Nash value if all the agents are being greedy given the strategies of all other agents. Thus, we have proved that for all cases Ct converges to 0 in the limit.

The fourth condition of the Lemma 8 is satisﬁed since we have the reward to be bounded (Assumption 1) and we know the variance of F Q

t is bounded (Jaakkola et al., 1994). Thus, it follows from Lemma 8 that the process t converges to 0 and hence, Qt converges to the Nash Q-function Q .

Theorem 3 shows that complicated steps in algorithms such as Nash-Q (Hu and Wellman, 2003) to predict the actions of other agents using an equilibrium calculation are unnecessary. The other agents current policy can be used directly instead of the equilibrium calculations. Assumption 5 is strong enough to ensure that such processes converge. Also, Theorem 3 proves convergence without any restrictions on the nature of advisors. They could be sub-optimal or adversarial. Thus, in both the Theorem 2 and Theorem 3, we have proved ﬁxed point guarantees in general-sum stochastic games with weaker assumptions than those used by earlier work (Hu and Wellman (2003)). We also

MULTI-AGENT ADVISOR Q-LEARNING

Figure 1: Grid Maze environment

note that Theorem 3 just assumes that the advisor inﬂuence decays to 0 in the limit, and so is also applicable to learning algorithms without advisors.

To conclude, from the Theorem 2 and Theorem 3 we see that both our algorithms (i.e., ADMIRALAE and ADMIRAL-DM) have a suitable ﬁxed point and a guarantee of converging to that ﬁxed point in the limit. This shows that our algorithms are theoretically grounded.

5. Experiments

We experimentally validate our algorithms, showing their effectiveness in a variety of situations

using different testbeds. We also demonstrate superior performance to common baselines previously used in literature. The source code for the experiments has been open sourced (Subramanian, 2022).

5.1 Experimental Results - Tabular Version

The objective of this section is to provide a simple illustration of the tabular version of our algorithms using different kinds of advisors.

For our ﬁrst experiment, we investigate the performance of ADMIRAL-AE, where we control the quality of the advisors. To this end, we use a 5 5 Grid Maze environment for the empirical evaluation. The schematic representation for this environment is available in Figure 1. Here the red and blue agents are trying to reach the yellow goal. The agents must learn to avoid the black pitfalls while reaching the goal. This is a cooperative game where the two agents must learn to perfectly coordinate the task of reaching the goal to achieve maximum reward. However, the agents are unable to communicate directly. An agent has to learn to take the correct actions to reach the goal in relation to the observed behaviour of the other agent. The game terminates if at least one of the agents reaches the goal state or hits a pitfall. We implemented four rule-based advisors of decreasing quality, from Advisor 1, which provides the best action for each state (relative to the other advisors), to Advisor 4, which makes random suggestions. The state in this game corresponds to the positions (grid coordinates) of both the agents and the action involves moving in one of the four cardinal directions. A detailed description of the environment, reward function, action selection, and more details on the advisors are in Appendix D.

First we conduct the pre-learning phase where we study the performance of ADMIRAL-AE using the different advisors in the Grid Maze domain. In our implementations, both agents play a separate instance of the same algorithm (ADMIRAL-AE with a particular advisor). We consider

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

(a) Performance of ADMIRAL-AE using 4 different advisors.

(b) Mean Square Error (MSE), between current Q-value of ADMIRAL-AE using Advisor 1 and the Qσ, which is the true value of the advisor.

Figure 2: Experimental ﬁndings using the tabular version of ADMIRAL-AE with different advisors. (a) shows that ADMIRAL-AE with the best advisor (Advisor 1) gives the best overall performance, and ADMIRAL-AE with the worst advisor (Advisor 4) gives the worst overall performance. (b) shows that the MSE between the Q-value from ADMIRAL-AE and the value of the advisor Qσ progressively reduces, and hence ADMIRAL-AE evaluates correctly. All results show an average of ﬁve runs, and they have negligible standard deviation.

the cumulative rewards obtained by ADMIRAL-AE with each of the advisors over a period of 2000 episodes of training. Figure 2(a) shows the average result over ﬁve runs. Since perfect coordination gives the large positive rewards, we can see from Figure 2(a) that the ADMIRAL-AE implementation with the best advisor (Advisor 1) results in the highest performance for the task. The performance progressively degrades from Advisor 1 to Advisor 4. We highlight that the performance of ADMIRAL-AE depends on the quality of the advisor, with the best advisor (Advisor 1) leading to the best overall performance. This also shows that we would ﬁnd most value in learning from Advisor 1, as compared to the other advisors. Thus, from Figure 2(a) we conclude that there is great value in using ADMIRAL-AE when the quality and suitability of advisors are the objective of study.

Further, we plot the mean square error (MSE) between the Q-values (of every state and joint action) obtained from playing ADMIRAL-AE (using the best advisor, Advisor 1) at the end of each episode and the true Q-value of the advisor (denoted as Qσ). As mentioned before, the true Q-value of the advisor captures the expectation of the sum of the immediate reward and the discounted sum of future rewards obtained, when all agents follow the advisor s strategy for inﬁnite periods starting from the current state and joint action. We obtain this value by running trajectories from each state and joint action pair until the end of the episode and calculating the expected discounted sum of rewards. The plot of the MSE is given in Figure 2(b). From this ﬁgure, we see that the MSE approaches close to zero after about 2000 episodes of training, which shows that the ADMIRAL-AE algorithm correctly evaluates the advisor. This experiment provides another illustration for the effectiveness of ADMIRAL-AE in evaluating the given advisor.

In Table 2 we use the average cumulative performances (5 runs) of ADMIRAL-AE along with all the advisors (from Figure 2(a)) to ﬁnd a value for 0

0, the hyperparameter that controls the advisor inﬂuence in ADMIRAL-DM. Recall that this is a major objective of the pre-learning phase. We use

MULTI-AGENT ADVISOR Q-LEARNING

Advisor Average cumulative reward (rounded to nearest 10)

Maximum possible cumulative reward (adjusted for random exploration)

Average performance of ADMIRALAE using a random advisor (rounded to nearest 10)

Normalized value (rounded up to nearest ﬁrst decimal)

Advisor 1 3560 3800 930 1 Advisor 2 1700 3800 930 0.3 Advisor 3 1030 3800 930 0.1 Advisor 4 930 3800 930 0

Table 2: Finding 0

0 using ADMIRAL-AE for the Grid Maze environment.

the Eq. 7 given in Section 3.2. The performance of ADMIRAL-AE using each of the advisors should lie between the maximum possible performance and the performance of ADMIRAL-AE using a random advisor. We normalize the average performances between the range of [0, 1], which will be used as the initial value of 0 (or 0

0) in ADMIRAL-DM. The maximum possible performance in Table 2 is adjusted for random exploration, which is approximately 5% of all actions. Hence, we subtract this portion from the theoretical possible maximum performance of 4000 for this domain. The initial values of 0 that would pertain to the advisors, are given in the last column of Table 2. From this table, it can be seen that the 0

0 value is high for Advisor 1, since there is good value to be gained in listening to this advisor. On the other hand, 0

0 is lower for other advisors and for Advisor 4 this value is 0, which means there will be no advisor inﬂuence. Since Advisor 4 only provides random advice, the agent is better off listening to its own policy rather than following actions suggested by Advisor 4. Thus, using the values given in Table 2, ADMIRAL-DM would listen more to the good advisor and listen less (or not at all) to the bad advisors, as was our goal.

Next, we show an illustration of the tabular implementation of our ADMIRAL-DM algorithm (Algorithm 1). We use the same Grid Maze domain along with the same four advisors for this study. In this setting, we have the ADMIRAL-DM algorithm training along with each of the advisors for 2000 episodes. In each implementation, both agents use the same algorithm for training (ADMIRALDM with an advisor) similar to the previous experiments. We set the initial value of 0 (or 0

0) to the values obtained from ADMIRAL-AE (given in Table 2). As described, since the Advisor 4 is bad, the value of 0

0 is set to 0 and there is no inﬂuence from this advisor during learning. Further, for all the implementations, the advisor inﬂuence through 0 is linearly decayed during training, as described in the algorithmic steps for ADMIRAL-DM. More details regarding the game conditions and reward functions are given in Appendix D. The results (cumulative rewards) are in Figure 3(a), where we plot the averages of ﬁve experimental runs. We note that learning from the best advisor (Advisor 1) using ADMIRAL-DM obtains the best overall performance, while ADMIRAL-DM with the other advisors requires more episodes to obtain similar performances to that of the Advisor 1. Since Advisor 1 teaches useful strategies, ADMIRAL-DM using Advisor 1 sees a good performance early on in training, even when there have been only limited interactions with the environment. This

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

(a) Performance of ADMIRAL-DM in the Grid Maze domain.

(b) Mean Square Error (MSE), between current Q-value of ADMIRAL-DM (using Advisor 1) and Q , which is the Nash Q-value.

Figure 3: Experimental ﬁndings on the Grid Maze domain with both agents playing a tabular implementation of ADMIRAL-DM with different advisors with hyperparameter 0

0 obtained from Table 2. (a) shows that using a tuned value for 0

0, gives a good performance for all the implementations of ADMIRAL-DM with the different advisors. However, better advisors help in getting a relatively better performance. (b) shows that the MSE between the current Q-estimate of ADMIRAL-DM using Advisor 1, and the Nash Q-value progressively reduces. All results show an average of ﬁve runs, and they have negligible standard deviation.

shows the value of positive inﬂuences from good advisors for improving the sample efﬁciency of MARL algorithms.

Now, we show that the Q-values of an agent following the ADMIRAL-DM algorithm converges to the Nash Q-value of the stochastic game. In Figure 3(b) we plot the MSE between the Q-values (of every state and joint action) of ADMIRAL-DM using the Advisor 1, and the Nash Q-value. To obtain the Nash Q-value we construct the Nash policy and obtain the value of this policy by running trajectories from each state and joint action pair till the end of the episode and calculating the expected discounted sum of rewards (similar to obtaining the value of the advisor in the previous experiment). In this environment, the Nash equilibrium strategies will provide the actions of perfect coordination that obtains large positive rewards. The MSE in Figure 3(b) approaches very close to zero after 2000 episodes of training, showing that the Q-values following ADMIRAL-DM ﬁnds the Nash Q-value in the limit.

5.2 Experimental Results - Function Approximation - ADMIRAL-AE

We present results for our advisor evaluation algorithm (Algorithm 2) on the large state-action

Pommerman environment (Resnick et al., 2018). The objective is to conduct the pre-learning phase to evaluate a set of advisors and pick a suitable 0

0 for learning using ADMIRAL-DM, which we study in the upcoming sub-sections. We use a two-agent version of Pommerman, which we denote as Domain One Vs One of Pommerman (we will consider another domain of Pommerman shortly). Pommerman is a complex multi-agent domains, with each state containing more than 200 elements describing the position of the board, special features like bombs, and the position of other agents.

MULTI-AGENT ADVISOR Q-LEARNING

Each agent can perform 6 actions, which include moving in the grid and laying bombs to kill the opponent. The reward function in Pommerman is quite sparse with the agents getting a +1 for winning the game, -1 for losing or a draw, with nothing in between. This game is general-sum since both agents get -1 for a draw. There is a maximum of 800 steps and the games where there are no winners after 800 steps are declared to be a draw. It is very hard for RL agents to learn good performance in Pommerman due to difﬁculties in balancing the twin goals of the killing of opponent and protecting themselves (Gao et al., 2019).

(a) Advisor 1

(b) Advisor 2

(c) Advisor 3

(d) Advisor 4

Figure 4: Analysis of ADMIRAL-AE algorithm on Pommerman (Domain One Vs One) against (single-agent) DQN. The standard deviation in (a) and (d) are very small (negligible). The best advisor (Advisor 1) makes the agent reach the best overall performance. The performance steadily decreases from Advisor 1 to Advisor 4. All results are averages of 10 experimental runs ((a) and (d) have negligible standard deviations).

For these experiments, we use the neural network implementation of the ADMIRAL-AE algorithm (Algorithm 4). We consider four different advisors. Advisors are arranged in decreasing quality of advice, with Advisor 1 providing the best advice and Advisor 4 proving the worst (e.g. random advice). The Advisors 1 and 2 have a positive inﬂuence on learning, as they can teach many useful techniques to win the game, while Advisor 4 has a negative inﬂuence on learning. The Advisor 3 is also capable of teaching some useful strategies and in general, is better than a random advisor (Advisor 4). However, it is much worse compared to Advisor 1 or Advisor 2. Appendix D contains complete descriptions of the advisors and the implementation details of the algorithms used.

We conduct all experimental runs on 100,000 Pommerman games (episodes). Each episode is a full Pommerman game containing a maximum of 800 steps. Each experiment has a DQN

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

(single-agent version as introduced in Mnih et al. (2015)) agent and an ADMIRAL-AE agent training and competing against each other. The experiments analyze the performance of ADMIRAL-AE with each of the four advisors against the common opponent (DQN). The performance is plotted in Figure 4. We repeat the experiments 10 times and plot the averages and standard deviations. We observe that using the best advisor (Advisor 1) clearly results in the best performance of the ADMIRAL-AE algorithm, with an overall cumulative reward reaching around 60,000 (Figure 4(a)). The second-best advisor (Advisor 2) results in cumulative reward around 35,000 (Figure 4(b)). When the ADMIRAL-AE uses Advisor 3 and Advisor 4, DQN results in better performance cumulatively than the ADMIRAL-AE algorithm (Figures 4(c) and (d)). The results (Figure 4) show that the ADMIRAL-AE algorithm can distinguish between different quality advisors. From Figure 4, it is clear that the advisor of choice for learning in this domain is Advisor 1. This result is obtained by running a separate instance of the ADMIRAL-AE algorithm with each of the advisors. This is consistent with our description of possible ways of evaluating the advisors using the ADMIRAL-AE algorithm in Section 3.2.

In Table 3 we tabulate the results and normalize the average performances to obtain a suitable value for 0

0 (using Eq. 7). The procedure is the same as that adopted in Section 5.1. We adjust the column for maximum possible performance value for 10% random exploration as done in Table 2. The Advisor 1 along with its initial value of 0 is used in the next sub-section for learning using the ADMIRAL-DM method.

Advisor Average cumulative reward (rounded to nearest 1000)

Maximum possible cumulative reward (adjusted for random exploration)

Average performance of the ADMIRALAE using a random advisor (rounded to nearest 1000)

Normalized value (rounded up to nearest ﬁrst decimal)

Advisor 1 58000 90000 -63000 0.8 Advisor 2 35000 90000 -63000 0.7 Advisor 3 -16000 90000 -63000 0.4 Advisor 4 -63000 90000 -63000 0

Table 3: Finding 0

0 using ADMIRAL-AE for the Pommeran Domain One Vs One against DQN.

5.3 Experimental Results - Function Approximation - ADMIRAL-DM

We now show that it is possible to extend our tabular ADMIRAL-DM method to function approxi-

mation based implementations that make our algorithms more generally applicable to environments with large state-action spaces. We will use the neural network-based version of ADMIRAL-DM as discussed in Section 3.4. Additionally, we show that ADMIRAL-DM is capable of outperforming several strong baselines from literature.

We perform comparative experiments in three domains. All our experiments in this section are repeated 30 times, and we plot the mean and standard deviation. The important elements of our experimental domains are mentioned here, while the complete details of the domains and

MULTI-AGENT ADVISOR Q-LEARNING

implementation details of all algorithms are in Appendix D. Neural network implementations of decision-making algorithms (ADMIRAL-DM, ADMIRAL-DM(AC)) are used in this sub-section. The ﬁrst domain we consider is Domain One Vs One of Pommerman introduced in Section 5.2. Our baselines are DQf D, CHAT, and DQN. We perform 50,000 episodes of training, where the algorithms train against speciﬁc opponents. Each episode is a full Pommerman game (lasting a maximum of 800 steps). All the algorithms relying on demonstrations (DQf D, CHAT, ADMIRAL-DM, and ADMIRAL-DM(AC)) use the Advisor 1 considered in Section 5.2. The probability of using the advisor action ( 0

t in Algorithm 1) starts from 0.8 (obtained from Table 3) and linearly decays to be close to zero at the end of training for both ADMIRAL-DM and ADMIRAL-DM(AC). To provide data for ofﬂine pretraining in the case of DQf D, two instances of Advisor 1 is used to play many Pommerman games that generate the required data. The DQf D is pretrained with all of this data, before entering the training phase of our experiments.

(a) ADMIRAL-DM vs DQN

(b) ADMIRAL-DM vs DQf D

(c) ADMIRAL-DM vs CHAT

(d) ADMIRAL-DM vs ADMIRAL-DM(AC)

(e) Faceoff against ADMIRAL-DM

Figure 5: Pommerman competition against ADMIRAL-DM. The ADMIRAL-DM beats all baselines in the execution phase. In the training phase ADMIRAL-DM(AC) performs better than ADMIRALDM in a head-to-head challenge.

After the training phase, the trained algorithms enter a face-off competition of 10,000 games where there is no more training, no further exploration and additionally ADMIRAL-DM and ADMIRAL-DM(AC) play without any advisor inﬂuence. ADMIRAL-DM(AC) is a CTDE technique,

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

which only performs decentralized execution in face-off using the trained actor-network. We plot the cumulative rewards in the training phase (Figure 5 (a), (b), (c), (d)), from which it can be seen that ADMIRAL-DM s performance is better than the baselines (DQN, DQf D, and CHAT). The face-off plots in Figure 5(e) show that ADMIRAL-DM wins more games on average against all the other baselines, showing its dominance. DQf D relies on pretraining, which is harder in MARL, as the nature of opponents that an agent will face during competition is impossible to determine upfront. The algorithms that use online advisors to give real-time feedback (that captures the changing nature of the opponent) tend to do better. DQf D has also been previously reported to have over-ﬁtting issues (Gao et al., 2018), which is likely to hurt its performance more in multi-agent environments as compared to single-agent environments. In multi-agent environments, it is more important to be able to generalize to unseen dynamic opponent behaviour, which is different from that seen in pre-collected demonstration data. As discussed previously, CHAT maintains a conﬁdence measure on the advisor, which depends on the advisor s consistency in action recommendations at different states. In MARL, this measure is not completely reliable, since even good advisors may need to formulate stochastic action recommendations as responses to the opponent. DQN, on the other hand, learns directly from interaction experiences and cannot learn from advisor inputs. This is a disadvantage in environments where external sources of knowledge, such as advisors, are available to be leveraged. Furthermore, since our baselines are independent algorithms (that consider opponents to be part of the state), they lose out to ADMIRAL-DM, which explicitly tracks opponent action. ADMIRAL-DM loses to ADMIRAL-DM(AC) during training (Figure 5 (d)). Though ADMIRAL-DM(AC) shows slower learning overall (as it is training both actor and critic), it ultimately learns a higher performing policy. One important reason is that the actor-critic method trains a stochastic policy that can explore naturally, whereas the Q-learning method needs a hyperparameter to conduct a forced exploration ( -greedy). Another reason could be that ADMIRAL-DM(AC) learns from each recent experience, while the ADMIRAL-DM has delayed learning using the replay buffer. However, in the face-off, ADMIRAL-DM has an edge over ADMIRAL-DM(AC) (Figure 5(e)), probably due to being centralized. Since the performance of ADMIRAL-DM and ADMIRAL-DM (AC) in the face-off results given in Figure 5(e) are close, we perform a Fischer s exact test for the average performances to check statistical signiﬁcance. We get p < 0.03 which shows that this result is statistically signiﬁcant (we treat p < 0.05 as statistically signiﬁcant as in common practice).

Next, we conduct similar experiments with ADMIRAL-DM(AC) that show it outperforms the baselines. The ADMIRAL-DM(AC) is explicitly compared to all the baselines in a training and face-off scheme similar to that done with ADMIRAL-DM. To recall, we perform training experiments of 50,000 full Pommerman games and face-off contests of 10,000 games, where the trained agents compete against each other without any further training or advisor inﬂuence. The results are plotted in Figure 6, where the ADMIRAL-DM(AC) shows better performance than the baselines (in both training and face-off phases). In training, ADMIRAL-DM(AC) dominates all the other baselines by winning around 20,000 30,000 games out of the 50,000 games conducted. As observed in the previous experiments, the ADMIRAL-DM(AC) algorithm s learning is slower than that of the Q-learning variant, making the overall number of games won (captured by cumulative rewards), against the baselines to be lower than that of the corresponding training of ADMIRAL-DM against the baselines in Figure 5. In the face-off contests, the ADMIRAL-DM(AC) algorithm wins more than 50% of the games against all baselines except ADMIRAL-DM. As noted previously, the ADMIRALDM algorithm has a slight edge in performance over that of ADMIRAL-DM(AC) in the face-off stage.

MULTI-AGENT ADVISOR Q-LEARNING

(a) ADMIRAL-DM(AC) vs DQN

(b) ADMIRAL-DM(AC) vs DQf D

(c) ADMIRAL-DM(AC) vs CHAT

(d) Faceoff against ADMIRAL-DM(AC)

Figure 6: Pommerman competition against ADMIRAL-DM(AC). ADMIRAL-DM(AC) defeats all other baselines in both the training and execution phases.

Next, we use two cooperative domains from the Stanford Intelligent Systems Laboratory (SISL) (Gupta et al., 2017). These experiments have two phases training and execution. The algorithms train for 1000 games in the training phase and then enter an execution phase, where they execute the trained policy for 100 games. We choose to set the value of advisor inﬂuence 0

t to 0.8 at the start of training and linearly decay it the same way as in the above experiments with Pommerman (since we are using good advisors). In the execution phase, there is no further exploratory actions for all algorithms, and no more inﬂuence of the advisor for ADMIRAL-DM and ADMIRAL-DM(AC).

The ﬁrst SISL environment is a Pursuit environment, which contains 8 pursuers controlled by learning algorithms, trying to capture 30 evaders moving randomly in the grid-based environment. Rewards have a local structure, where the pursuers participating in the capture of an evader or the pursuers encountering evaders are rewarded individually. The game is general-sum and does not have a global reward structure. The local reward structure helps to tackle the issue of credit assignment. We use a pre-trained policy of DQN (trained for 1000 episodes) as the advisor. We plot the reward

obtained (averaged per agent) for the training and execution phases in Figures 7(a) and (b). The execution performance bars in Figure 7(b) is the average performance across the 100 execution games. The results show that ADMIRAL-DM has better performance than all other baselines, including DQN used as the advisor, in both phases. Thus, our algorithm can ultimately outperform the advisor. This environment is highly non-stationary (due to having more number of learning agents), so completely centralized ADMIRAL-DM has an edge over ADMIRAL-DM(AC) which uses decentralized actors.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

(a) Pursuit-Training

(b) Pursuit-Execution

(c) Waterworld - Training

(d) Waterworld - Execution

Figure 7: SISL Environments - Training and Execution. The ADMIRAL algorithms give a better performance than all the baselines in both the phases. Training graphs have been smoothed with a running average of 100.

Since some performances in Figure 7(b) are close, we perform an unpaired 2-sided t-test for statistical signiﬁcance. Regarding the performances of CHAT and ADMIRAL-DM(AC) we get a value of p < 0.02 and similarly for the performances of ADMIRAL-DM and ADMIRAL-DM(AC) we get a p < 0.02, which shows that both these comparisons are statistically signiﬁcant.4

Our second SISL environment is the continuous action space Waterworld environment, which has 5 pursuer agents trying to consume food and avoid poison. The actions are continuous-valued thrust, that the agents can apply to move in a particular direction, and with a desired speed. Here, multiple pursuer agents need to work together to consume food. Agents get rewards based on foods captured and punishments based on poison consumed. Rewards have a local structure similar to the Pursuit environment. The advisor here is a pre-trained proximal policy optimization (PPO) (Schulman et al., 2017) agent (trained for 1000 episodes). Similar to the Pursuit environment, we plot the performances for both training and execution phases (averaged per agent) in the Waterworld environment (see Figures 7(c) and (d)). The ADMIRAL-DM(AC) alone is used for these experiments, as the Q-learning variant is not applicable for continuous action spaces. We use two popular RL algorithms for continuous control, PPO and deep deterministic policy gradients (DDPG) (Lillicrap et al., 2016) as baselines. The results show that ADMIRAL-DM(AC) has better performance than others in both phases (Figure 7(c) and (d)). The ADMIRAL-DM(AC) algorithm has two important advantages over the other baselines here. The ﬁrst advantage is that it is capable of leveraging

4. We consider p < 0.05 as statistically signiﬁcant.

MULTI-AGENT ADVISOR Q-LEARNING

an advisor. The second advantage is that ADMIRAL-DM(AC) is trained in a centralized fashion by tracking the opponent behaviour while the other algorithms are independent methods. Still, ADMIRAL-DM(AC) is decentralized in execution. Notably, the ADMIRAL-DM(AC) algorithm also improves upon PPO, used as the advisor, similar to our observation in the Pursuit environment.

In both the above SISL experiments, we see a small improvement in performance in the execution phases for both algorithms, ADMIRAL-DM and ADMIRAL-DM(AC), as compared to the ﬁnal training performances. This is due to the fact that, at the end of the training, there is still a small amount of exploration and advisor inﬂuence (1 % of actions) involved, whereas during execution both these inﬂuences are completely removed, which contributes to a net improvement in performance.

To summarize, our experimental results show that the ADMIRAL-DM and ADMIRAL-DM(AC) algorithms make the best use of advisors in multi-agent settings compared to the other state-of-the-art algorithms. After the advisor inﬂuence completely stops, the performance of ADMIRAL-DM and ADMIRAL-DM(AC) is better than the others. We have also demonstrated that our methods can be extended to continuous action spaces and work in decentralized environments using the popular CTDE technique.

5.4 Performance of ADMIRAL-DM Under The Inﬂuence Of Different Advisors

Next we study the impact of using the ADMIRAL-AE in a pre-learning phase to determine the value of 0

0. Towards the same, we would like to use an algorithm to serve as a common opponent. We choose to use a different algorithm compared to the baselines considered in the previous sub-section (where the objective was to show better performances of ADMIRAL-DM as compared to these baselines). The algorithm we choose to use as the opponent is Deep Sarsa, which is similar to DQN but uses a Sarsa-like (Sutton and Barto, 1998) Bellman update for the Q-values. We clarify that our objective in this section is not to show better performances against any baseline (which we have already done in Section 5.3).

Further, in this section, we provide additional experiments that evaluate ADMIRAL-DM on different advisors with the common opponent (Deep Sarsa) and show that ADMIRAL-DM is capable of recovering from bad action-advice. All results reported in this section use averages and standard deviations of 30 runs.

We describe two sets of experiments with two different domains of Pommerman. In the ﬁrst set of experiments, we use the neural network implementation of ADMIRAL-DM and ADMIRAL-AE on the Domain One Vs One of Pommerman using the four different advisors introduced in Section 5.2. To recall, Advisor 1 is the best advisor who can give the best action (relative to other advisors) at all states and Advisor 4 is the worst. The quality of advisors reduces from Advisor 1 to Advisor 4. As mentioned, we use a common opponent as an agent playing Deep Sarsa.

First, we wish to evaluate the given advisors against the performance of Deep Sarsa in the pre-learning phase. To do this, we run a series of training experiments, where we implement

ADMIRAL-AE using each of the four advisors against Deep Sarsa. The results are plotted in Figure 8. As observed earlier, the best advisor leads to the best overall performance and the worst advisor leads to the worst performance. Using these performances the 0

0 values are determined in Table 4 (using Eq. 7). It can be seen that the suggested value of 0

0 is highest (0.9) for the best advisor and the smallest (0) for the worst advisor. These values of 0

0 show that the agent will listen more to the good advisors and listen less (or not at all) to the bad ones.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

(a) Advisor 1

(b) Advisor 2

(c) Advisor 3

(d) Advisor 4

Figure 8: Results in Domain One Vs One of Pommerman using different advisors with ADMIRAL-AE and Deep Sarsa. The best advisor (Advisor 1) gives the highest overall performance while the worst advisor (Advisor 4) gives the lowest performance.

Advisor Average cumulative reward (rounded to nearest 1000)

Maximum possible cumulative reward (adjusted for random exploration)

Average performance of agent using a random advisor (rounded to nearest 1000)

Normalized value (rounded up to nearest ﬁrst decimal)

Advisor 1 63000 90,000 -54000 0.9 Advisor 2 34000 90,000 -54000 0.7 Advisor 3 -16400 90,000 -54000 0.3 Advisor 4 -54000 90,000 -54000 0

Table 4: Finding an initial value for 0

0 using ADMIRAL-AE for the Pommeran Domain One Vs One against Deep Sarsa.

Next, we run the ADMIRAL-DM algorithm against Deep Sarsa, using each of the four advisors. We use four initial values of 0

0 for each of the advisors, where one of these values corresponds to the choice of 0

0 as obtained from our previous experiment with ADMIRAL-AE reﬂected in Table 4. In addition to these four values, we also consider a value of 0 for 0

0, which considers the performance

MULTI-AGENT ADVISOR Q-LEARNING

of ADMIRAL-DM with no advisor inputs to serve as a baseline. As done previously, the value of 0 is decayed linearly, during training, for ADMIRAL-DM in all the experiments. All training is conducted for 100,000 episodes with the advisor inﬂuence ( 0) being linearly decayed to 0 at 50,000 episodes, i.e. there is no advisor inﬂuence after 50,000 episodes. Each episode is a complete Pommerman game involving a maximum of 800 steps as in the previous experiments. More details of the game conditions and the advisors can be found in Appendix D. The results showing the performance of ADMIRAL-DM in each of these experiments are presented in Figure 9.

(a) Advisor 1

(b) Advisor 2

(c) Advisor 3

(d) Advisor 4

Figure 9: Results in Domain One Vs One of Pommerman using different advisors with ADMIRALDM and Deep Sarsa. The result plots show the performance of ADMIRAL-DM in the training against Deep Sarsa. The results show that 0

0 value obtained from the performance of ADMIRAL-AE in Table 4 show either the best performance or is very close to the best possible performances amongst all the 0

We highlight several observations. Figure 9(a) and (b) show that ADMIRAL-DM, using the good advisors (Advisor 1 and Advisor 2), achieves the maximum overall performance since the positive inﬂuence from the good advisors helps. However, if the advisor inﬂuence is limited ( 0 = 0.3), the good advisors only have a limited impact. As the value of 0

0 increases, we see that the performance of ADMIRAL-DM using the ﬁrst two advisors improves (Figures 9(a) and (b)), as expected. Since Advisor 1 is even better than Advisor 2, we see from Figures 9(a) and (b) that ADMIRAL-DM using Advisor 1 clearly shows superior performance to that of Advisor 2 for the highest value of 0

0, 0.9. When 0

0 values were lower (such as 0.5), performance using both Advisor 1 and Advisor 2 are comparable since these advisors did not have many opportunities to make an impact. Notably, ADMIRAL-DM using Advisor 1 shows the best performance for 0 = 0.9, the value obtained from

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

(a) One Vs One - 0 = 0.3

(b) One Vs One - 0 = 0.5

(c) One Vs One - 0 = 0.7

Figure 10: Results of ADMIRAL-DM vs Deep Sarsa using the Advisor 4, which is the worst advisor among the ones considered. The plots correspond to the One Vs One domain. All the ﬁgures show that, ADMIRAL-DM is capable of recovering from bad action advice. Greater inﬂuence of the bad advisor (larger 0

0) leads to a larger time needed for recovering from the bad action inﬂuence.

ADMIRAL-AE as seen in Table 4. ADMIRAL-DM using Advisor 2 almost provides the same performances for the highest values of 0

0, 0.7 and 0.9. Additional inputs from this advisor are not as useful (compared to Advisor 1), since it is weaker. Hence, a value of 0 = 0.7 as obtained from ADMIRAL-AE is sufﬁcient for this advisor.

Turning our attention to the performance of ADMIRAL-DM with the third advisor, we ﬁnd that it is signiﬁcantly inferior compared to the other two advisors, yet still has a limited positive inﬂuence (Figure 9(c)). Furthermore, the performance using this advisor is considerably better than using no advisor at all ( 0 = 0). However, while using Advisor 3, we notice that, as the values of 0

0 increases, there is no appreciable improvement in performance. This shows that more inﬂuence of a comparatively less effective advisor does not lead to much improvement in performance. Again, the value suggested by ADMIRAL-AE (0.3), comes very close to the best possible performance with other values of 0

0. The performance of ADMIRAL-DM using the last advisor (Advisor 4) is interesting. This advisor has a negative inﬂuence on learning and makes ADMIRAL-DM lose for the ﬁrst few episodes (Figure 9(d)). However, ADMIRAL-DM recovers after the advisor inﬂuence wanes in all cases. As the value of 0

0 increases, we see that further inﬂuence from the bad advisor is detrimental and a larger number of episodes is required before ADMIRAL-DM shows signs of recovery from the poor advice. Hence, the best value of 0

0, in this case, is 0, since listening to this advisor only

MULTI-AGENT ADVISOR Q-LEARNING

harms learning. Again, this was the value obtained for Advisor 4, in Table 4. We present a more elaborate set of results on the experiments with Advisor 4 in Figure 10. Here we show that for all cases of 0

0, ADMIRAL-DM is capable of recovering from bad action-advising and after a suitable number of episodes, can overtake the performance of Deep Sarsa. However, higher values of 0

0 makes the learning from the bad advisor problematic, since while ADMIRAL-DM shows signs of recovery it still cannot overtake the cumulative performance of Deep Sarsa even after 300,000 episodes (Fig: 10(c)).

(a) Advisor 1

(b) Advisor 2

(c) Advisor 3

(d) Advisor 4

Figure 11: Results in Domain Two Vs Two of Pommerman using different advisors with ADMIRALAE and Deep Sarsa. The best advisor (Advisor 1) gives the highest overall performance while the worst advisor (Advisor 4) gives the lowest performance. The standard deviation of (a) and (d) are negligible.

We now provide a brief description of another domain of Pommerman. Domain Two Vs Two is a larger version of Pommerman, where there are a total of four agents, with two of the four belonging to the same team. The state space is much larger than Domain One Vs One, with each state containing 372 elements. The reward function remains sparse, with the two agents belonging to the winning team getting +1 and the two agents belonging to the losing team getting -1 at the end of the game. In case of a draw, all the agents get -1. In Domain Two Vs Two, we consider one team of Deep Sarsa and one team of the ADMIRAL-DM. This makes this domain harder than the Domain One Vs One, as the agents must learn to cooperate amongst the members of the same team and compete against the members of the opponent team to win the game. The Domain Two Vs Two is a mixed competitive-cooperative domain, which is different from Domain One Vs One that had only pure competition. We use the same four advisors as considered before for the Domain Two Vs Two.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Similar to the experiments with Domain One Vs One, we ﬁrst evaluate the advisors against Deep Sarsa using the ADMIRAL-AE algorithm. The results are presented in Figure 11. Again, the best advisor gives the maximum overall performance and the worst advisor results in the minimum performance. The 0

0 values are determined based on these results in Table 5 (using Eq. 7). These values are used in further experiments using ADMIRAL-DM and Deep Sarsa in the Domain Two Vs Two of Pommerman.

Advisor Average cumulative reward (rounded to nearest 1000)

Maximum possible cumulative reward (adjusted for random exploration)

Average performance of agent using a random advisor (rounded to nearest 1000)

Normalized value (rounded up to nearest ﬁrst decimal)

Advisor 1 79000 90000 -81000 1 Advisor 2 39000 90000 -81000 0.7 Advisor 3 -21000 90000 -81000 0.4 Advisor 4 -81000 90000 -81000 0

Table 5: Finding 0

0 using ADMIRAL-AE for the Pommeran Domain Two Vs Two against Deep Sarsa.

Our next set of experiments train in the Domain Two Vs Two with each agent in one team of Pommerman agents playing ADMIRAL-DM and each agent in the other team playing Deep Sarsa. We consider four different initial values of 0 (where one of these values will correspond to the value

obtained in Table 5), as we did in the One Vs One domain. In addition to these 0

0 values, we continue to use the value of 0

0 = 0 as the baseline. This corresponds to the situation of ADMIRAL-DM learning with no advisor inﬂuence. All training is run for 100,000 episodes, with the value of 0

decaying linearly to 0 at 50,000 episodes. Thus, there is no more inﬂuence from advisors for the last 50,000 episodes of training. The results showing the performance of the team playing ADMIRALDM (in the competition against a team playing Deep Sarsa) are in Figure 12. The ADMIRAL-DM performances show similar characteristics to that seen in Domain One Vs One. ADMIRAL-DM, using the best advisor (Advisor 1), shows the best performance for all values of 0

0 and its performance keeps improving with the increase in value of 0

0 (refer Figure 12(a)). Notably, comparing Figure 12(a) and Figure 9(a), the performance of ADMIRAL-DM using the best advisor is better in the case of Domain Two Vs Two as compared to the Domain One Vs One. This suggests that a good advisor has comparatively higher impact when tasks get harder. This is because there are far more strategies that need to be learned to do well in this domain and the opponent learning from scratch needs more time for learning the hard task. A good advisor, on the other hand, can teach the different strategies needed much faster and provide an early lead for ADMIRAL-DM. Similar observations apply to the performances of ADMIRAL-DM using Advisor 2 as well (refer Figure 12(b) and Figure 9(b)).

For Advisor 3, the performance does not change much with varying values of 0

0 (refer Figure 12(c)) as observed in the case of Domain One Vs One. Comparing Figure 12(c) and Figure 9(c), we note that the best performance of ADMIRAL-DM using Advisor 3 is better for the Domain Two Vs Two as compared to Domain One Vs One, reinforcing our inferences earlier about a higher potential for impact in useful advisors when the tasks get harder. Regarding Advisor 4,

MULTI-AGENT ADVISOR Q-LEARNING

(a) Advisor 1

(b) Advisor 2

(c) Advisor 3

(d) Advisor 4

Figure 12: Results in Domain Two Vs Two of Pommerman using different advisors with ADMIRALDM and Deep Sarsa. The result plots show the performance of team playing ADMIRAL-DM in training competitions against Deep Sarsa. For this domain too, 0

0 value obtained from the performance of ADMIRAL-AE in Table 5 show either the best performance or is very close to the best possible performances amongst all the 0

the greater negative inﬂuence from this advisor necessitates a longer time for recovery (refer Figure 12(d)).

Again, from Figure 12, for all the four advisors, we observe that the value for 0

0 as obtained from Table 5 gives the best possible performance or comes quite close to the best possible performance, compared to other possible values of 0

0. This shows the advantage of evaluation using ADMIRALAE. In Figures 13(a), (b) and (c), ADMIRAL-DM shows signs of recovery for all values of 0

0 while using Advisor 4 for learning. However, for Advisor 4, the smaller the value of 0

0 the better. When making a comparison between Figure 10 and Figure13, we note that, as the negative inﬂuence from the advisor increases (through a larger 0

0) the time needed for recovery also rises in the case of Domain Two Vs Two and is greater than that needed in the Domain One Vs One. As the complexity of the tasks increase, the agents need a lot more time to learn good policies that recovers the loss of performance from bad action-advice. This shows that negative inﬂuence from an advisor is more costly in the case of harder MARL tasks/environments as against comparatively simpler environments.

From our experiments, we conclude that ADMIRAL-DM is capable of recovering from bad advisor recommendations and is able to suitably leverage advisors who have some positive inﬂuence on learning. However, if possible, it is best to evaluate an advisor using the ADMIRAL-AE method

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

(a) Two Vs Two - 0 = 0.3

(b) Two Vs Two - 0 = 0.5

(c) Two Vs Two - 0 = 0.7

Figure 13: Results of ADMIRAL-DM vs. Deep Sarsa using the Advisor 4 in the Two Vs Two domain. Similar to the ﬁrst domain, all the ﬁgures show that ADMIRAL-DM is capable of eventually recovering from bad action advice.

and obtain a suitable initial value for the hyperparameter that determines the advisor inﬂuence ( 0

0). This helps in learning good policies faster.

5.5 Summary

To summarize, in Section 5.1 we showed an experimental illustration of our algorithms in a tabular domain. We showed that, while using ADMIRAL-AE, the best advisor gives the best overall performance. Further, ADMIRAL-AE provides a suitable value for the hyperparameter 0

0, which when used by ADMIRAL-DM subsequently, provides good performances with different types of advisors. We also provided an experimental illustration of our theoretical convergence results in the case of both ADMIRAL-DM and ADMIRAL-AE. In Section 5.2 we provided an illustration of ADMIRAL-AE in a large environment with neural networks as function approximators. Again, we illustrated that using ADMIRAL-AE with the best advisor provides the best performance amongst other (comparatively worse) advisors. Obtaining an appropriate value for 0

0 from Section 5.2 we showed that ADMIRAL-DM and ADMIRAL-DM(AC) provide better performances than a set of baselines in Section 5.3. We tested our algorithms in both competitive and cooperative domains, as well as settings with discrete and continuous action spaces.

In Section 5.4, we used two Pommerman domains to illustrate that using ADMIRAL-AE to obtain a suitable value for 0

0, provides the best performance for ADMIRAL-DM. Hence, when possible, it would be best to use ADMIRAL-AE for determining 0

0 using a pre-learning phase. Also,

MULTI-AGENT ADVISOR Q-LEARNING

we illustrated that ADMIRAL-DM is capable of recovering from bad action advice from advisors if appropriate values for 0

0 cannot be determined before training ADMIRAL-DM.

Additionally, in Appendix C we show another important advantage of using the principled method of ADMIRAL-AE for advisor evaluation in environments having dynamically learning and adapting advisors. We show that a principled method like ADMIRAL-AE would ﬁnd a suitable value for 0

0 when it is presented with a learning advisor, where other methods based on simple heuristics may have a high chance of failure.

6. Conclusion

In this paper, we introduced the problem of learning under the inﬂuence of external advisors in MARL. We provided a principled framework for MARL algorithms learning to use advisors. Using Q-learning based methods, we proposed two MARL algorithms for this problem. We conducted theoretical analyses of these algorithms, establishing conditions under which ﬁxed point guarantees can be provided regarding their learning in general-sum stochastic games. We proved that previous theoretical results can extend to this setting under a comparatively weaker set of assumptions than previously considered. Empirically, we showed that our algorithms can be scaled to domains with large state-action spaces using traditional function approximators like neural networks. We also introduced an additional actor-critic variant of our ADMIRAL-DM algorithm that can operate under the CTDE paradigm and can learn in environments with continuous action spaces. Our empirical results further established the superiority of our algorithms compared to standard baselines. Furthermore, we have shown that our methods would be useful in a wide variety of problems and that the algorithms can recover from the inﬂuence of weak/bad advisors during learning.

While there is a rich body of literature on the use of external knowledge sources in single-agent RL (Bignold et al., 2021), MARL provides additional challenges which mean that not all the results and approaches can transfer over directly. We discussed the important problems of directly using the single-agent based methods that learn from external sources in MARL. Additionally, we performed direct comparison experiments to elucidate a few of these problems. Our approach to learning from advisors in MARL may look more complex compared to other single-agent approaches, however, the non-stationarity of the environment makes learning under the inﬂuence of advisors in MARL considerably more challenging. In MARL, quick adaptation to the changing environment is the key to better performance (Littman, 2001). Our approach of using an online advisor is a more appropriate formulation of advisors in MARL, as real-time feedback against non-stationary opponents are critical for learning effective multi-agent policies, as demonstrated in our experiments.

Importantly, we consider a general setting, where we had no restrictions on the type or quality of the advisor, and no restrictions on the relation between the reward functions of different agents (general-sum). Particularly in MARL, the assumption of optimal advisors could be overly strong, since performance depends on the nature of opponents. The advisor could be capable of providing good feedback in strategizing against a particular class of opponents yet be useless against another class of opponents. Furthermore, a sub-optimal advisor could be good only in a very narrow portion of the state space, which is still useful for an agent learning from scratch in this environment. By explicitly allowing nonrestrictive sub-optimal advisors, our work is more widely applicable than previous methods that make an assumption of optimal (or near-optimal) experts to help RL training (Ross et al., 2011; Giusti et al., 2015; Sonabend et al., 2020).

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

From the empirical perspective, as future work, we would like to study the performance of ADMIRAL algorithms under the simultaneous inﬂuence of multiple advisors providing conﬂicting demonstrations. This problem has been studied in single-agent RL environments (Li et al., 2019), but not yet in the MARL context. There is an emerging line of work that studies the possibility of multiple agents learning from peers in cooperative MARL settings (Omidshaﬁei et al., 2019). Our paper has the potential to contribute to this line of work as well. Further, in this paper, we provided a simple technique of making use of the evaluation of advisors in a learning algorithm by setting the value of 0

0. More sophisticated ways of analyzing the performance of ADMIRAL-AE and using the results for learning faster and more effective decision-making policies is left to future work. An observation about ADMIRAL-AE is that, in MARL, the advisors can be used as a way to predict the behaviour of other agents as well, which is not relevant in single-agent settings. In MARL, each agent needs to have the ability to perform accurate opponent modelling, based on its observations, to obtain strong performances (Hernandez-Leal et al., 2019). This is because the reward function and the transition dynamics depends on the joint action at each state. Previous methods have used several techniques, such as using a separate neural network for learning opponent behaviour (He and Boyd-Graber, 2016), learning policy features from raw observations (Hong et al., 2018), and using the agent s own policy to predict opponent actions (Raileanu et al., 2018). However, many of these methods are computationally expensive and scale poorly with the number of states, actions, and agents. Another possibility for opponent modelling is leveraging an external advisor that can possibly predict opponent behaviour, as done in ADMIRAL-AE, which could be relatively computationally friendly given the availability of an appropriate advisor. This could open up a very interesting research direction in learning from advisors in MARL. Both our ADMIRAL-DM and ADMIRAL-AE algorithms are exponential in the number of agents as described in our complexity analysis (Section 3). Hence, our algorithms are not easily scalable to environments with a large number of agents which is one limitation of our framework. As future work, our framework could be combined with works on mean ﬁeld games Lasry and Lions (2007) which can make the approach more scalable (since mean ﬁeld methods guarantee a constant dependence on the number of agents).

From the theoretical perspective, as future work, we would like to fully characterize the convergence rates of our algorithms. Additionally, some of our theoretical assumptions for the environmental settings may seem restrictive, however, we assert that this work is the ﬁrst to provide a theoretical foundation for MARL with advisors and that these assumptions are useful in understanding the strengths and limitations of such an approach. Furthermore, we note that the assumptions we make are also made by other works exploring the foundations of MARL, such as Hu and Wellman (2003). In future work, we wish to explore the ramiﬁcations of relaxing some of these assumptions. The theoretical understanding of MARL, in general, is still in its infancy and much more research into MARL theory is required to enhance our understanding of this area (Zhang et al., 2019). In this paper, we restrict the theoretical analysis to tabular settings, which is in line with the state-of-the-art in theoretical analysis of learning in general-sum stochastic games (Zhang et al., 2019). The objective is to provide a theoretical guarantee in the most basic (baseline) setting possible. Using a similar approach to single-agent RL methods that extend the tabular results to the function approximation setting (Carvalho et al., 2020), it would be possible to extend our theoretical results in this paper to the function approximation setting as well. An elaborate theoretical study of this is left to future work.

Two speciﬁc motivational applications for our work were discussed in Section 1. Similar to these examples, other domains such as managing natural disasters, controlling disease outbreaks, and

MULTI-AGENT ADVISOR Q-LEARNING

learning to play multi-player/multi-team sports would also ﬁnd beneﬁt from our approach. These domains also have state-of-the-art advisor models. These models, while not optimal, are still used in practice. Due to the inherent poor sample efﬁciency of MARL methods, it becomes critical to use all available sources of knowledge judiciously. Hence, learning from fallible advisors is important for many real-world domains that typically have the structure of multiple agent interaction.

The framework we have introduced in this paper on using external information through advisors is expected to improve the sample complexity of MARL algorithms and is an important step towards making MARL methods usable in real-world environments. We expect that our paper will spark more work in the area of accelerating RL training using other available sources of knowledge (or advisors) and that it will interest a broad community of researchers in the area of RL, game theory, machine learning, and multi-agent systems.

Acknowledgements

Kate Larson is an afﬁliate of the Vector Institute, Toronto. The authors would like to thank Seyed Majid Zahedi at University of Waterloo for his comments on the paper draft. Resources used in preparing this research at the University of Waterloo were provided by the province of Ontario and the government of Canada through CIFAR, NRC, NSERC and companies sponsoring the Vector Institute. Part of this work has taken place in the Intelligent Robot Learning (IRL) Lab at the University of Alberta, which is supported in part by research grants from the Alberta Machine Intelligence Institute (Amii); a Canada CIFAR AI Chair, Amii; Compute Canada; Mitacs; and NSERC.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Appendix A. Proof For The Lemmas In The Main Paper

In this section, we will restate all the lemmas in the main paper with detailed proofs. No new lemmas are given in this section.

Lemma 1. Let us ﬁx an arbitrary positive constant C, an arbitrary w0, and a sequence . Then provided that

(i) yt(w0, ) converges to some point (independent of t) D (ii) The sequence converges to 0 in the limit (t ! 1) The homogeneous process xt(w0, ) converges to a point 1

ˆβD w. p. 1, where ˆβ, satisfying 0 < ˆβ 1, is the scaling factor applied.

Proof. We state that

yt(w, ) = xt(dtw, ct t) (36)

for some sequences {ct} and {dt}, where ct = (ct0, ct1, . . . , cti, . . .), {ct} and {dt} satisfy 0, dt, cti 1, and cti = 1 if i t. Here, the product of the sequences ct and t is component wise: (ct t)i = cti i.

Note that yt(w, t) and xt(w, t) depend only on 0, , t 1. Thus, it is possible to prove Eq. 36 by constructing the appropriate sequence ct and dt.

Set c0i = di = 1 for all i = 0, 1, 2, . . .. Then Eq. 36 holds for t = 0. Let us assume that (ci, di) is deﬁned in a way that Eq. 36 holds for t. Let Bt be the scaling coefﬁcient of yt at step t + 1 (Bt = 1 if there is no scaling, otherwise 0 < Bt < 1 with Bt = C/||Gt(yt, t||). Now we have:

yt(w, t) = Bt Gt(yt(w, t), t) = Gt(Btyt(w, t), Bt t) = Gt(Btxt(dtw, ct t), Bt t).

We claim that

Bxt(w, t) = xt(Bw, B t) (38)

holds for all w, t and B > 0. For t = 0, this obviously holds. Assume that it holds for some time t. Then, from Eq. 13,

Bxt+1(w, t) = BGt(xt(w, t), t) = Gt(Bxt(w, t), B t) = xt+1(Bw, B t) (39)

yt+1(w, t) = Gt(xt(Btdtw, Btct t), Bt t), (40)

and we see that Eq. 36 holds if we deﬁne ct+1,i as ct+1,i = Btcti if 0 i t, ct+1,i = 1 if i > t and dt+1 = Btdt.

Thus, we get that with the sequences

j=i Bj, if i < t; ct,i = 1, otherwise; (41)

i=0Bi, (42)

MULTI-AGENT ADVISOR Q-LEARNING

Eq. 36 is satisﬁed for all t 0.

We know that yt(w, t) ! D w. p. 1. Since the process yt has been constructed by bounding the process with ||yt|| C, it follows that D C. Then, there exists a ﬁnite index M such that if t > M then

Pr(||yt(w, t)|| < C) > 1 δ (43)

without applying any more rescaling. Now, let us restrict our attention to those events ! for which ||yt(w, t(!))|| < C for all t > M without rescaling. Thus, we have no rescaling for all t, such that t > M. Thus, ct,i = c M+1,i for all t M + 1 and i, and speciﬁcally cti = 1 if i, t M + 1. Similarly, if t > M then dt+1(!) = M

i=0Bi(!) = d M+1(!). Let A! = {! : ||yt(w, )(!)|| < C} without rescaling. By Eq. 36, we have that if t > M then,

yt(w, t(!)) = xt(d M+1(!)w, c M+1(!) t(!)). (44)

Thus, it follows from our assumption concerning yt that xt(d M+1(!)w, c M+1 t(!)) converges to D almost everywhere (a. e.) on A! and, consequently, by Eq. 38, xt(w, c M+1 t(!)/d M+1(!)) converges to D0 = 1 d M+1 D a. e. on A!. Since c M+1 = 1 in the limit and we are analyzing in the space of A!, xt(w, t(!)/d M+1(!)) converges to D0 too. Now, since converges to 0 in the limit, and we are analyzing values in the space of A!, we can write xt(w, t(!)) xt(w, t(!)/d M+1(!))) which also converges to D0. All these hold with probability at least 1 δ, since, by Eq. 43 Pr(A! > 1 δ). Since δ was arbitrary, the lemma follows.

Lemma 2. Let X and Y be normed vector spaces, Ut : X Y ! X(t = 0, 1, 2, . . .) be a sequence of mappings, and t 2 Y be an arbitrary sequence. Let 1 2 Y and x1 2 X. Consider the sequences xt+1 = Ut(xt, 1), and yt+1 = Ut(y t, t) and suppose that xt and t converge to x1 and 1 respectively, in the norm of the appropriate spaces.

k be the uniform Lipschitz index of Uk(x, ) with respect to at 1 and, similarly, let LX

t satisfy the relations L

m = 0 where C > 0 is some constant and t = 0, 1, 2, . . . , then limt !1 ||yt x1|| = 0.

Proof. See Theorem 15 in Szepesvári and Littman (1999) for detailed proof.

Lemma 3. Let Z be an arbitrary set and consider the process

xt+1(z) = Gt(z)xt(z) + Ft(z)(C + kt(z)C)

where x1, Ft, Gt 0 are random processes, ||x1|| < C < 1 w. p. 1 for some C > 0, and z is an element in Z. Assume that for all k, limn !1 n

t=k Gt(z) = 0 uniformly in z w. p. 1 and Ft(z) = γ(1 Gt(z)), for some 0 γ < 1, and 8z 2 Z, w. p. 1. Also, kt(z) converges to K(z) in the limit. Then, xt(z) converges to a point D(z) = γ(C + K(z)C) w. p. 1.

Proof. Consider the process that is obtained from substituting the value of Ft(z) in terms of Gt(z) in Eq. 16,

xt+1(z) = Gt(z)xt(z) + γ(1 Gt(z))(C + kt(z)C). (45)

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Now, subtracting γ(C + kt(z)C) on both sides of the equation we get

xt+1(z) γ(C + kt(z)C) = Gt(z)(xt(z) γ(C + kt(z)C)). (46)

The above equation converges to 0 in the limit as limn !1 n

t=k Gt(z) = 0. Thus, we have that xt+1(z) γ(C + kt(z)C) converges to 0 in the limit. Hence, xt converges to γ(C + K(z)C), since kt(z) converges to K(z) in the limit.

Lemma 4. Consider an equation of the form

xt+1(z) = Gt(z)xt(z) + Ft(z)(||xt|| + t + kt(z)||xt||)

where the sequence t converges to zero w. p. 1. Assume that for all k, limn !1 n

t=k Gt(z) = 0 uniformly in z w. p. 1 and Ft(z) = γ(1 Gt(z)), for some 0 γ < 1, and 8z 2 Z, w. p. 1. Assume further that kt(z) is ﬁnite, and it converges to K(z) in the limit (t ! 1). Then xt(z) converges to a point represented by S0(z) = 1

ˆβ(γC1 + K(z)C1), where C1 is a small positive constant, w. p. 1

uniformly over Z. Here ˆβ is a scaling factor satisfying 0 < ˆβ 1.

Proof. Let us consider a process yt that is obtained from keeping the original process ||xt|| bounded by a constant C1. This is an arbitrary bound, with C1 speciﬁed to be a small positive constant. Since, ||xt|| is guaranteed to be positive, we can ﬁnd such a positive C1. Now we get,

yt+1(z) = Gt(z)yt(z) + γ(1 Gt(z))(C1 + t + kt(z)C1). (47)

By Lemma 2, yt converges to γC1 + K(z)C1 as the following bindings show X, Y := R, t := t, Ut(x, ) := Gt(z)x + γ(1 Gt(z))(C1 + kt(z)C1 + ), where z 2 Z is arbitrary. Then, LX

t = Gt(z) and L

t = γ(1 Gt(z)) satisfying the conditions of Theorem 2. We know that x in the expression Ut(x, ) converges to γC1 + K(z)C1 as proved in Lemma 3.

In Lemma 1, we proved that the original process will converge to a point 1

ˆβD, if the bounded

process converges to D. Here ˆβ is the scaling factor applied, satisfying the relation 0 < ˆβ 1. Using this result, now we get that the process represented by the Eq. 17 converges to a point S0(z) = 1

ˆβ(γC1 + K(z)C1) which is a constant for a given z. This gives an expression for the point S in Theorem 1.

Lemma 5. Let Ft be an increasing sequence of σ-ﬁelds, let 0 t and wt be random variables such that t and wt 1 are Ft measurable. Assume that the following hold w. p. 1: E[wt|Ft, t 6= 0] = A, E[w2

t |Ft] < B < 1, P1

t=1 t = 1 and P1

t < C < 1 for some B, C > 0. Then the process

Qt+1 = (1 t)Qt + twt

converges to A w. p. 1.

Proof. Refer to Lemma 4 in Szepesvári and Littman (1999) for detailed proof.

MULTI-AGENT ADVISOR Q-LEARNING

Lemma 6. Assume that t satisﬁes Assumption 2 and the mapping Pt : Q ! Q satisﬁes the condition that, there exists a scalar γ satisfying 0 γ < 1, a sequence λt 0 converging to zero w. p. 1, and a ﬁnite sequence kt(s) such that ||Pt Q Pt Q || = β||Q Q || + λt + βkt(s)||Q Q || for all Q, and all s 2 S. Assume further that, kt(s) converges to a ﬁnite point K(s) in the limit. Additionally, Q (s, a) = E[Pt Q (s, a)], then the iteration deﬁned by

Qt+1(s, a) = (1 t)Qt(s, a) + t[Pt Qt(s, a)]

converges to (Q S) w. p. 1, where S is as given in Theorem 1.

Proof. This lemma directly follows from Corollary 1 and Lemma 5.

Lemma 7. For a n-player stochastic game, E[Pt Q ] = Q where Q = (Q1

, . . . , Qn

Proof. Refer to Lemma 11 in Hu and Wellman (2003) for proof.

Lemma 8. A random iterative process

t+1(x) = (1 t(x)) t(x) + t(x)Ft(x)

where x 2 X, t = 0, 1, . . . , 1, converges to zero with probability one (w. p. 1) if the following properties hold:

1. The set of possible states X is ﬁnite. 2. 0 t(x) 1, P

t t(x) = 1, P

t (x) < 1 w. p. 1, where the probability is over the learning rates t.

3. || E{Ft(x)|Pt}||W K || t||W + ct, where K 2 [0, 1) and ct converges to zero w. p. 1. 4. var{Ft(x)|Pt} K(1 + || t||W )2, where K is some constant. Here Pt is an increasing sequence of σ-ﬁelds that includes the past of the process. In particular, we assume that t, t, Ft 1 2 Pt. The notation || ||W refers to some (ﬁxed) weighted maximum norm.

Proof. Refer to Theorem 1 in Jaakkola et al. (1994) for proof.

Lemma 9. Under Assumption 5, the Nash operator as deﬁned in Eq. 30 forms a contraction mapping with the ﬁxed point being the Nash Q-value of the game.

Proof. The detailed proof is given in Theorem 17 of Hu and Wellman (2003).

Appendix B. Additional Deﬁnitions For Theorem 1

We restate some deﬁnitions from Szepesvári and Littman (1999), needed for us in Theorem 1, to stay

self-contained.

Let us consider an arbitrary operator, T : B ! B, where B is a normed vector space with norm ||.||. Let T = (T0, T1, , Tt, ) be a sequence of random operators, Tt mapping B B to B.

Deﬁnition 10. Let F B be a subset of B and let F0 : F ! 2B be a mapping that associates subsets of B with the elements of F. If, for all f 2 F and all m0 2 F0(f), the sequence generated by the recursion mt+1 = Tt(mt, f) converges to Tf in the norm of B with probability 1, then we say that T approximates T for initial values from F0(f) and on the set F B. Further, we say that T approximates T on the singleton set f and the initial value mapping F0 : F ! B deﬁned by F0(f) = F0.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Deﬁnition 11. The subset F B is invariant under T : B B ! B if, for all f, g 2 F, T(f, g) 2 F. If T is an operator sequence as above, then F is said to be invariant under T if for all i 0, F is invariant under Ti.

Appendix C. ADMIRAL-AE Using An Adaptive Advisor

In this sub-section, we aim to provide an illustration of the behaviour of ADMIRAL-AE in the presence of an adaptive advisor. This advisor would actively change and adapt its strategies according to the changing opponent. The objective is to show that the ADMIRAL-AE algorithm will be able to capture the strength of the adaptive advisor, and hence there is merit in using a principled approach to evaluate an advisor. Also, we would like to discuss the advantages of keeping this evaluation method separate from another approach that aims to learn from the advisor. In this section, we clarify that we are only considering the pre-learning phase introduced in our paper, since our objective is to analyze the performance of ADMIRAL-AE with two different advisors. Results in this section use the average and standard deviation of 30 runs.

Figure 14: Two instances of ADMIRAL-AE along with an adaptive advisor and a non-adaptive advisor on Pommerman Domain One Vs One. The plot shows the performance of ADMIRAL-AE using the Advisor Non-Adaptive against the common opponent of DQN (orange line) and the performance of ADMIRAL-AE using the Advisor Adaptive against the common opponent of DQN (blue line). This plot shows that the ADMIRAL-AE using an adaptive advisor starts off week, but eventually surpasses the performance of the non-adaptive advisor. Thus, a principled method like ADMIRAL-AE can evaluate an adaptive advisor appropriately, while other non-principled approaches may have a high percentage of failure.

We will continue to use the two-agent version (Domain One Vs One) of Pommerman for this experiment. We consider two different advisors, namely, Advisor Non-Adaptive and Advisor Adaptive. The Advisor Non-Adaptive does not actively track opponent strategies or adapt to them. This advisor plays a balanced strategy choosing to be risk-seeking (actively laying bombs to kill the opponent) while the opponent is in proximity and choosing to be risk-averse (escaping from the enemy) otherwise. This advisor is reactive and only responds to the present position of the opponent, and does not attempt to model the opponent s nature actively. It has been observed that this is a relatively good strategy in Pommerman, particularly in the early stages of training (Meisheri et al., 2019). At this stage, an agent playing a relatively conservative strategy could wait for the opponent to make a mistake and kill itself. However, this strategy is not very strong and could lose out once the

MULTI-AGENT ADVISOR Q-LEARNING

opponent is well-trained. A well-trained opponent is less likely to make the mistake of killing itself, and there is the added possibility of the non-adaptive strategy becoming predictable, which could be ﬁgured out by the opponent. Hence, the Advisor Non-Adaptive will ﬁnd it hard to win games in the middle and later stages of training. On the other hand, the Advisor Adaptive uses an adaptive strategy that plays a risk-averse strategy when the enemy is risk-seeking, and a risk-seeking strategy when the enemy is risk-averse. This is a very strong strategy for winning in Pommerman (Zhou et al., 2018) but requires active modelling of the opponent. The Advisor Adaptive tracks the percentage of bombs played by the opponent to determine the nature of the opponent. In the initial few episodes (about 5000) the Advisor Adaptive s behaviour is close to random since it still does not have enough information to learn the nature of the opponent.

We implement a separate instance of the ADMIRAL-AE algorithm with both the advisors and plot the performance against a common baseline agent using DQN for learning. We run all the training for 100,000 episodes and plot the performances of both algorithms. The results are captured in Figure 14. The results show that ADMIRAL-AE using the Advisor Adaptive loses out in the beginning while it is still ﬁguring out the nature of the opponent. However, it soon shows a much stronger performance that surpasses the performance of the ADMIRAL-AE using the Advisor Non-Adaptive. ADMIRAL-AE using the Advisor Non-Adaptive, while showing good overall performance, does not quite reach the levels of the performance of ADMIRAL-AE using the Advisor Adaptive, due to the non-adaptive strategy possibly becoming predictable and prone to exploitation by the opponent, after the opponent has trained for a sufﬁcient number of episodes. This performance shows that an agent during learning should listen more to the Advisor Adaptive as compared to the Advisor Non-Adaptive for the best outcome. If the advisors were directly used for learning without being evaluated separately, then a learning agent would be prone to discarding Advisor Adaptive quickly due to its initial weak performance, while the agent would listen more to the Advisor Non-Adaptive. Yet we have seen that the opposite behaviour would actually be better using an implementation of ADMIRAL-AE with each of these advisors. This demonstrates that the evaluation would be prone to inaccuracy and inconsistency if it is combined with learning a policy. The analysis in this sub-section shows the advantage of using a principled evaluation algorithm (ADMIRAL-AE) for evaluating an advisor, especially when they have adaptive characteristics.

Appendix D. Experimental Details

In this section, we provide the experimental details for all the experiments. We have given detailed information about the advisors and reward functions for all the experiments. Hyperparameters for all algorithms are also provided.

D.1 Grid Maze Domain

The reward function is deﬁned in such a way that the agents get a +1 if any one of two agents reaches the goal, and they get a -1 if any one of the two gets to the pitfall. Both the agents get a +2 if both reach the goal at the same instant, and both the agents get a -2 if they reach a pitfall at the same instant. If one of the two gets to the goal and the other gets to the pitfall, they both still get a +1. In this environment, each agent obtains a local state (observation) which corresponds to the coordinates of the grid cell the agent is currently located. We use the joint observations of the two agents as the state in the environment for determining the Q-values in the experiments showing convergence in Section 5.1. This state is available to both agents. The other experiments simply use the observation

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

of the concerned agent in the Q-updates, in case of both ADMIRAL-DM and ADMIRAL-AE, to make the challenge harder.

The actions that the agents can take in this game are one of moving up, down, left or right. If the wall obstructs an action, then the agent will remain at the same spot.

All four advisors used are rule-based agents, where Advisor 1 follows high-quality rules at each grid cell, which enables the agent to move to the goal state and avoid the pitfalls. Advisor 2 on the other hand can only suggest the correct actions for the agents to reach the goal and escape the pitfall if the agent is right next to the goal or pitfall (within one step). It is not capable of giving the right action in the other parts of the grid. Thus, Advisor 2 cannot teach perfect coordination to obtain the large positive reward (+2). Advisor 3 can only suggest actions that make the agents move closer to the goal state, but cannot give accurate actions that avoid pitfalls. Advisor 4 is a random advisor which only suggests random actions.

The ADMIRAL-AE implementation for this experiment chooses to do the advisor actions with a probability of 50% ( 0 = 0.5). The action that maximizes the Q-value (greedy action) is chosen with probability 45% and a random action with probability 5% ( = 0.05) to satisfy Assumption 1. We use the previous actions of the other agents while determining actions at the current time step

t as done in the other algorithms. The ADMIRAL-DM algorithm simply follows the scheme in Algorithm 1 where the advisor suggestions are performed with decreasing probabilities and the algorithm becomes completely greedy (dependent only on the agent s own policy) after a ﬁnite number of episodes.

D.2 Pommerman Domain

All the advisors are based on the rule-based agent (called the simple agent) already provided in the Pommerman framework. It is well documented that the simple agent is hard to beat even for complex RL algorithms in the Pommerman environment (Resnick et al., 2018). We use four advisors for the ADMIRAL-AE experiments in Section 5.2 and 5.4, of which the ﬁrst advisor alone is used in the experiments that study the performance of ADMIRAL-DM and ADMIRAL-DM(AC) in Section 5.3. The ﬁrst advisor (Advisor 1) is the best of all the four advisors and has rules for all the different aspects of the game (escaping from enemies, collecting bombs, laying bombs, etc.). The second advisor (Advisor 2) has rules for moving away from danger and collecting powerups like life and bombs. This is a relatively conservative advisor which relies on staying safe and hoping for the opponent to make a mistake. Actually, this is a very effective strategy in the Pommerman game and therefore Advisor 2 is also a useful one. Advisor 3 only has rules for collecting powerups, but cannot teach any of the other strategies needed to win the Pommerman game. When there is no possibility of teaching any useful strategy based on the situation of the game (no nearly enemies, powerups, etc.), the ﬁrst three advisors provide pseudo-random strategies that encourage exploration of states which are relatively less visited. The fourth advisor (Advisor 4) is the weakest advisor of the four. It suggests random actions to the agent and does not contribute to the learning of the agent (rather, it harms learning).

Each new Pommerman game gives a randomized game map and there is a maximum of 800 steps for each game.

MULTI-AGENT ADVISOR Q-LEARNING

D.3 Pursuit Domain

The Pursuit domain was initially deﬁned in SISL (Gupta et al., 2017). We use the canonical domain implemented in the petting zoo environment (Terry et al., 2020). All the environmental parameters including the rewards are left as the same as the defaults in Gupta et al. (2017). For the training phase, the game has 30 evader agents and 8 pursuer agents, where the evader agents move randomly, and the pursuer agents are controlled by learning algorithms. The game is cooperative, with the pursuers receiving a reward of 5 for fully surrounding an evader (the evader is removed from the environment). The pursuers receive a reward of 0.01 for each time they touch an evader. The actions space is discrete with 5 actions, each action corresponding to moving to a neighbouring grid (including stay). Each episode is a full game with a maximum of 500 steps. Each agent in this environment is able to observe a grid of 7 7 centered around itself (the whole grid is 16 16). Since this environment is fully cooperative with homogeneous learning agents, all training (for each algorithm) is completely centralized (all agents train the same network).

D.4 Waterworld Domain

Waterworld is also a SISL environment with a set of 5 pursuers tasked with collecting food and

avoiding poison. We use the corresponding petting zoo environment (Terry et al., 2020). The environmental parameters are the same as that in Gupta et al. (2017). This is also a cooperative environment where multiple pursuers need to work together to capture food. The environment is a two-dimensional continuous space, where the action space is a continuous two-dimensional value representing the horizontal and vertical thrust that makes the agents move in particular directions with the desired speed. The local state (observation) of each agent consists of multiple sensor features and two other elements that indicate the collision of the agent with a food or poison respectively. In this environment, at least two agents need to attack a food together to capture it. There are a total of 5 food particles (not destroyed upon capture) and 10 poison particles in the environment. The agents get a reward of 10.0 for capturing food and a 0.01 for encountering food. Further, the agents have a thrust penalty of -0.5, and a penalty of -1.0 for encountering poison. The poison particles and food particles move in the environment with a speed of 0.01. Since this environment is also fully cooperative with homogenous learning agents, all training is completely centralized for this domain too.

D.5 Hyperparameters And Implementation Details

The hyperparameters for the baselines were chosen to be the same as those recommended by the respective papers. Some minor modiﬁcations were made due to performance and computational efﬁciency considerations.

Regarding the hyperparameters of DQf D, we set 1 106 as the demo buffer size and perform 50,000 mini-batch updates for pretraining. The replay buffer size is twice the size of the demo buffer. The N-step return weight is 1.0, the supervised loss weight is 1.0 and the L2 regularization weight is 10 5. The epsilon greedy exploration is 0.9. The discount factor is 0.99 and the learning rate is 0.002. The pretraining for DQf D comes from a data buffer related to a series of games where two rule-based agents (advisors) compete against each other. All other values are similar to that used in Hester et al. (2018).

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

The CHAT (Wang and Taylor, 2017) implementation uses a neural network for conﬁdence measurement (termed NNHAT in Wang and Taylor (2017)). The learning rate is 0.01, we use a discount factor of 0.9 and a ﬁxed exploration constant ( -greedy) of 0.9. We use the extra action variant of HAT (Taylor et al., 2011) in the CHAT implementation, as this gave the best performance in most of our comparative experiments. A neural network is used as the function approximator, as described in Mnih et al. (2015). The target net is replaced every 10 learning iterations. The conﬁdence threshold is set as 0.6 and the default action as action-0 . The mini-batch size is 32 and learning rate is = 0.01. The CHAT algorithm can directly use advisors in an online fashion similar to ADMIRAL-DM. We omit the rule summarization step of CHAT, and directly use the advisor, to make the performance as good as possible. The replay buffer size is 2 106.

The DQN and ADMIRAL-DM hyperparameters are the same as those mentioned for CHAT (as relevant). These algorithms also perform -greedy exploration with a constant value of 0.9 for . The advisor inﬂuence parameter ( 0

t in Algorithm 1) for ADMIRAL-DM and ADMIRAL-DM(AC) starts at a high value of 0.8 at the beginning of the training and linearly decays to 0.01 during training. The face-off and execution phases have this parameter to be 0 (no advisor inﬂuence). All other hyperparameters for DQN, ADMIRAL-DM and CHAT are similar to that used in Mnih et al. (2015). The ADMIRAL-DM(AC) has the critic learning rate set at 10 3 and the actor learning rate to be 10 5. The discount factor for ADMIRAL-DM(AC) is 0.9.

The ADMIRAL-AE algorithm in Section 5.2 chooses to do the advisor action with a probability of 50%. The action that maximizes the Q-value (the greedy action) is chosen with probability 45% and a random action with probability 5% to satisfy Assumption 1. This is the same as that in the tabular domain (Section 5.1). The hyperparameters of ADMIRAL-AE are the same as ADMIRAL-DM.

For the function approximation experiments, both ADMIRAL-DM and ADMIRAL-DM(AC) simply use the observed current actions of other agents for action selection instead of maintaining copies of policies of the other agents as speciﬁed in Algorithm 3 and Algorithm 5, since the opponents could possibly be using different algorithms (all agents are not always using the same algorithmic steps). This is also computationally more efﬁcient. The current actions of all other agents are either directly observed or provided by the game engine to each agent in all our experiments, to perform a joint action update.

The DDPG uses the learning rate of the actor as 0.001 and the critic as 0.002. The discount factor is 0.9. We use the soft replacement strategy with a learning rate of 0.01. The batch size is 32. The PPO implementation also uses the same batch size and actor and critic learning rates. All other values are similar to those used in Schulman et al. (2017) (PPO) and Lillicrap et al. (2016) (DDPG). For PPO, we used a single-thread implementation, which we found to be as good as the multi-threaded implementation for our experiments, and more computationally efﬁcient. This could be because the data correlations are already being broken by the multi-agent (non-stationary) nature of the domains.

For the Deep Sarsa implementation used in this paper, we follow almost the same steps as done in Algorithm 4, except that we do not have any advisors and hence the algorithm does not have the term 0

t in the implementation. As a consequence, the greedy action is selected with probability (1 t). Also, we are using an independent implementation, where the actions of the other agents are

not considered during action selection by the algorithms playing Deep Sarsa. The action is chosen only based on the current state. Deep Sarsa uses the same hyperparameters as ADMIRAL-DM. The values are either the same or closely match those considered in previous research (Mnih et al., 2015).

We use a set of 30 random seeds (1 30) for all training experiments and a new set of 30 random seeds (31 60) for all execution experiments.

MULTI-AGENT ADVISOR Q-LEARNING

D.6 Wall Clock Times

The experiments on the Grid Maze domain can just be run on the CPU and takes less than 20 minutes to complete.

The training for all the experiments on the Pommerman domain in Section 5.2 and Section 5.3 was run on a 2 GPU virtual machine with 16 GB GPU memory per GPU. The experiments take an average of 18 hours wall clock time to complete. We use Nvidia Volta-100 (V100) GPUs for all these experiments. The CPUs use Skylake as the processor microarchitecture.

The experiments on Pursuit and Waterworld domains in Section 5.3 were run on a virtual machine having the same conﬁguration containing 2 GPUs. These experiments take an average of 12 hours wall clock time to complete. The experiments on Pommeran in Section 5.4 took an average of 15 hours wall clock time to complete.

Abbeel, P. and Ng, A. Y. (2004). Apprenticeship Learning via Inverse Reinforcement Learning.

In Proceedings of the Twenty-ﬁrst International Conference of Machine Learning (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, volume 69, pages 1 9. ACM.

Amir, O., Kamar, E., Kolobov, A., and Grosz, B. J. (2016). Interactive Teaching Strategies for

Agent Training. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, IJCAI, New York, NY, USA, 9-15 July 2016, pages 804 811. IJCAI/AAAI Press.

Atkeson, C. G. and Schaal, S. (1997). Robot Learning From Demonstration. In Proceedings of the

Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997, pages 12 20. Morgan Kaufmann.

Barrett, S., Rosenfeld, A., Kraus, S., and Stone, P. (2017). Making friends on the ﬂy: Cooperating

with new teammates. Artiﬁcial Intelligence, 242:132 171.

Bignold, A., Cruz, F., Taylor, M. E., Brys, T., Dazeley, R., Vamplew, P., and Foale, C. (2021). A

Conceptual Framework for Externally-inﬂuenced Agents: An Assisted Reinforcement Learning Review. Journal of Ambient Intelligence and Humanized Computing.

Bogert, K. D. and Doshi, P. (2014). Multi-robot inverse reinforcement learning under occlusion

with interactions. In International conference on Autonomous Agents and Multi-Agent Systems, (AAMAS 2014), Paris, France, May 5-9, 2014, pages 173 180. IFAAMAS.

Boularias, A., Kober, J., and Peters, J. (2011). Relative Entropy Inverse Reinforcement Learning. In

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, (AISTATS 2011), Fort Lauderdale, USA, April 11-13, 2011, volume 15 of JMLR Proceedings,

pages 182 189. JMLR.org.

Bowling, M. H. (2000). Convergence Problems of General-Sum Multiagent Reinforcement Learning.

In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 89 94. Morgan Kaufmann.

Carvalho, D., Melo, F. S., and Santos, P. (2020). A new convergent variant of Q-learning with

linear function approximation. In Advances in Neural Information Processing Systems 33: Annual

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Conference on Neural Information Processing Systems (Neur IPS 2020), December 6-12, 2020, virtual.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2001). On the Generalization Ability of On-Line

Learning Algorithms. In Advances in Neural Information Processing Systems, (Neur IPS 2001), Vancouver, British Columbia, Canada, December 3-8, 2001, pages 359 366. MIT Press.

Chemali, J. and Lazaric, A. (2015). Direct Policy Iteration with Demonstrations. In Proceedings of

the Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, (IJCAI 2015), Buenos Aires, Argentina, July 25-31, 2015, pages 3380 3386. AAAI Press.

Codevilla, F., Müller, M., López, A. M., Koltun, V., and Dosovitskiy, A. (2018). End-to-End Driving

via Conditional Imitation Learning. In International Conference on Robotics and Automation, (ICRA 2018), Brisbane, Australia, May 21-25, 2018, pages 1 9. IEEE.

Codevilla, F., Santana, E., López, A. M., and Gaidon, A. (2019). Exploring the Limitations of

Behavior Cloning for Autonomous Driving. In International Conference on Computer Vision, (ICCV 2019) Seoul, Korea (South), October 27 - November 2, 2019, pages 9328 9337. IEEE.

Conitzer, V. and Sandholm, T. (2003). AWESOME: A General Multiagent Learning Algorithm that

Converges in Self-Play and Learns a Best Response Against Stationary Opponents. In Proceedings of the Twentieth International Conference of Machine Learning, (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 83 90. AAAI Press.

Cruz, M. and Alexander, M. (2010). Assessing Crown Fire Potential in Coniferous Forests of

Western North America: A Critique of Current Approaches and Recent Simulation Studies. The Bark Beetles, Fuels, and Fire Bibliography, 19.

Devlin, S., Kudenko, D., and Grze s, M. (2011). An empirical study of potential-based reward shaping

and advice in complex, multi-agent systems. Advances in Complex Systems, 14(02):251 278.

Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In Multiple Classiﬁer Systems,

First International Workshop, (MCS 2000), Cagliari, Italy, June 21-23, 2000, Proceedings, volume 1857 of Lecture Notes in Computer Science, pages 1 15. Springer.

Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T. (2021).

Challenges of real-world reinforcement learning: Deﬁnitions, Benchmarks and Analysis. Machine Learning, 1:1 50.

Eryigit, C. (2017). Marketing Models: A Review of the Literature. International Journal of Market

Research, 59(3):355 381.

Fink, A. M. et al. (1964). Equilibrium in a stochastic n-person game. Journal of Science of the

Hiroshima University, Series AI (Mathematics), 28(1):89 93.

Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016). A Connection between Generative

Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models. ar Xiv preprint ar Xiv:1611.03852.

MULTI-AGENT ADVISOR Q-LEARNING

Finney, M. A. (1998). FARSITE, Fire Area Simulator model development and evaluation. 4. US

Department of Agriculture, Forest Service, Rocky Mountain Research Station.

Fu, J., Luo, K., and Levine, S. (2018). Learning Robust Rewards with Adverserial Inverse Rein-

forcement Learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018. Open Review.net.

Ganesh, S., Vadori, N., Xu, M., Zheng, H., Reddy, P., and Veloso, M. (2019). Reinforcement learning

for market making in a multi-agent dealer market. ar Xiv preprint ar Xiv:1911.05892.

Gangwani, T., Zhou, Y., and Peng, J. (2020). Learning Guidance Rewards with Trajectory-space

Smoothing. In Advances in Neural Information Processing Systems (Neur IPS 2020), December 6-12, 2020, virtual.

Gao, C., Kartal, B., Hernandez-Leal, P., and Taylor, M. E. (2019). On Hard Exploration for

Reinforcement Learning: A Case Study in Pommerman. In Proceedings of the Fifteenth AAAI Conference on Artiﬁcial Intelligence and Interactive Digital Entertainment, (AIIDE 2019), October 8-12, 2019, Atlanta, Georgia, USA, pages 24 30. AAAI Press.

Gao, Y., Xu, H., Lin, J., Yu, F., Levine, S., and Darrell, T. (2018). Reinforcement Learning

from Imperfect Demonstrations. In 6th International Conference on Learning Representations, (ICLR 2018), Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings.

Open Review.net.

Giusti, A., Guzzi, J., Cire san, D. C., He, F.-L., Rodríguez, J. P., Fontana, F., Faessler, M., Forster, C.,

Schmidhuber, J., Di Caro, G., et al. (2015). A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661 667.

Goecks, V. G., Gremillion, G. M., Lawhern, V. J., Valasek, J., and Waytowich, N. R. (2020).

Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Dense and Sparse Reward Environments. In Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, (AAMAS 2020), Auckland, New Zealand, May 9-13, 2020, pages 465 473. IFAAMAS.

Gupta, J. K., Egorov, M., and Kochenderfer, M. J. (2017). Cooperative Multi-agent Control Using

Deep Reinforcement Learning. In Autonomous Agents and Multiagent Systems - (AAMAS 2017) Workshops, São Paulo, Brazil, May 8-12, 2017, volume 10642 of Lecture Notes in Computer

Science, pages 66 83. Springer.

Hadﬁeld-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. D. (2016). Cooperative Inverse

Reinforcement Learning. In Advances in Neural Information Processing Systems (Neur IPS 2016), Barcelona, Spain, December 5-10, 2016, pages 3909 3917.

Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex

optimization. Machine Learning, 69(2-3):169 192.

He, H. and Boyd-Graber, J. L. (2016). Opponent Modeling in Deep Reinforcement Learning. In

Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1804 1813. JMLR.org.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Hernandez-Leal, P., Kartal, B., and Taylor, M. E. (2019). A survey and critique of multiagent deep

reinforcement learning. Journal of Autonomous Agents and Multi-Agent Systems (JAAMAS), 33(6):750 797.

Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J.,

Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J. P., Leibo, J. Z., and Gruslys, A. (2018). Deep Q-learning From Demonstrations. In Thirty-Second Conference of Association for the Advancement of Artiﬁcial Intelligence (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 3223 3230. AAAI Press.

Ho, J. and Ermon, S. (2016). Generative Adversarial Imitation Learning. In Advances in Neural

Information Processing Systems (Neur IPS 2016), Barcelona, Spain, December 5-10, 2016, pages 4565 4573.

Hong, Z., Su, S., Shann, T., Chang, Y., and Lee, C. (2018). A Deep Policy Inference Q-Network for

Multi-Agent Systems. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, AAMAS 2018 Stockholm, Sweden, July 10-15, 2018, pages 1388 1396. IFAAMAS.

Hu, J. and Wellman, M. P. (2003). Nash Q-learning for general-sum stochastic games. Journal of

Machine Learning Research (JMLR), 4(Nov):1039 1069.

Hu, Y., Li, J., Li, X., Pan, G., and Xu, M. (2018). Knowledge-Guided Agent-Tactic-Aware Learning for Starcraft Micromanagement. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, (IJCAI 2018), July 13-19, 2018, Stockholm, Sweden, pages 1471 1477. ijcai.org.

Hussein, M., Crowe, B., Petrik, M., and Begum, M. (2021). Robust Maximum Entropy Behavior

Cloning. ar Xiv preprint ar Xiv:2101.01251.

Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the Convergence of Stochastic Iterative

Dynamic Programming Algorithms. Neural Computation, 6(6):1185 1201.

Jahdi, R., Salis, M., Darvishsefat, A. A., Alcasena, F., Mostafavi, M. A., Etemad, V., Lozano, O. M.,

and Spano, D. (2015). Evaluating ﬁre modelling systems in recent wildﬁres of the Golestan National Park, Iran. Forestry, 89(2):136 149.

Jain, P., Coogan, S. C., Subramanian, S. G., Crowley, M., Taylor, S., and Flannigan, M. D. (2020).

A review of machine learning applications in wildﬁre science and management. Environmental Reviews, 28(4):478 505.

Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., and Liu, H. (2020). Reinforcement Learning

from Imperfect Demonstrations under Soft Expert Guidance. In The Thirty-Fourth Conference of Association for the Advancement of Artiﬁcial Intelligence (AAAI 2020), New York, NY, USA, February 7-12, 2020, pages 5109 5116. AAAI Press.

Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Ph D thesis, UCL

(University College London).

MULTI-AGENT ADVISOR Q-LEARNING

Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab, A. A., Yogamani, S., and Pérez, P. (2021).

Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems.

Kueﬂer, A., Morton, J., Wheeler, T., and Kochenderfer, M. (2017). Imitating Driver Behavior with

Generative Adversarial Networks. In IEEE Intelligent Vehicles Symposium (IV 2017), pages 204 211. IEEE.

Laskey, M., Lee, J., Fox, R., Dragan, A. D., and Goldberg, K. (2017). DART: Noise Injection for

Robust Imitation Learning. In 1st Annual Conference on Robot Learning, (Co RL 2017), Mountain View, California, USA, 2017, volume 78 of Proceedings of Machine Learning Research, pages 143 156. PMLR.

Lasry, J.-M. and Lions, P.-L. (2007). Mean ﬁeld games. Japanese journal of mathematics, 2(1):229

Laud, A. D. (2004). Theory and application of reward shaping in reinforcement learning. University

of Illinois at Urbana-Champaign.

Le, H. M., Yue, Y., Carr, P., and Lucey, P. (2017). Coordinated Multi-Agent Imitation Learning. In

Proceedings of the 34th International Conference on Machine Learning, (ICML 2017), Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1995 2003. PMLR.

Leblon, B., Bourgeau-Chavez, L., and San-Miguel-Ayanz, J. (2012). Use of remote sensing in

wildﬁre management. In Sustainable Development-Authoritative and Leading Edge Content for Environmental Management. Intech Open.

Levine, S., Popovic, Z., and Koltun, V. (2011). Nonlinear Inverse Reinforcement Learning with

Gaussian Processes. In Advances in Neural Information Processing Systems (Neur IPS 2011) Granada, Spain, December 2011, pages 19 27.

Li, M., Wei, Y., and Kudenko, D. (2019). Two-level Q-learning: learning from conﬂict demonstrations.

The Knowledge Engineering Review, 34.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.

(2016). Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, (ICLR 2016), San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.

Lin, X., Adams, S. C., and Beling, P. A. (2019). Multi-agent inverse reinforcement learning for certain

general-sum stochastic games. Journal of Artiﬁcial Intelligence Research (JAIR), 66:473 502.

Lin, X., Beling, P. A., and Cogill, R. (2017). Multiagent inverse reinforcement learning for two-person

zero-sum games. IEEE Transactions on Games, 10(1):56 68.

Littman, M. and Moore, A. (1996). Reinforcement Learning: A Survey. Journal of Artiﬁcial

Intelligence Research (JAIR), 4:237 285.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Littman, M. L. (2001). Friend-or-Foe Q-learning in General-Sum Games. In Proceedings of

the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 322 328. Morgan Kaufmann.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. (2017). Multi-Agent Actor-Critic

for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems (Neur IPS 2017), Long Beach, CA, USA, December 4-9, 2017, pages 6379 6390.

Marom, O. and Rosman, B. (2018). Belief Reward Shaping in Reinforcement Learning. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 3762 3769. AAAI Press.

Meisheri, H., Shelke, O., Verma, R., and Khadilkar, H. (2019). Accelerating training in Pommerman

with imitation and reinforcement learning. ar Xiv preprint ar Xiv:1911.04947.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral Normalization for Generative

Adversarial Networks. In 6th International Conference on Learning Representations, (ICLR 2018), Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu,

K. (2016). Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the Thirty-Third International Conference on Machine Learning, (ICML 2016), New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1928 1937. JMLR.org.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,

Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.

Nair, A., Mc Grew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Overcoming Explo-

ration in Reinforcement Learning with Demonstrations. In 2018 IEEE International Conference on Robotics and Automation, (ICRA 2018) Brisbane, Australia, May 21-25, 2018, pages 6292 6299. IEEE.

Nash, J. F. (1951). Non-Cooperative Games. Princeton University Press.

Natarajan, S., Kunapuli, G., Judah, K., Tadepalli, P., Kersting, K., and Shavlik, J. W. (2010). Multi-

Agent Inverse Reinforcement Learning. In The Ninth International Conference on Machine Learning and Applications, (ICMLA 2010), Washington, DC, USA, 12-14 December 2010, pages 395 400. IEEE Computer Society.

Neumann, J. v. (1928). Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295 320.

Ng, A. Y., Harada, D., and Russell, S. J. (1999). Policy Invariance Under Reward Transformations:

Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, pages 278 287. Morgan Kaufmann.

MULTI-AGENT ADVISOR Q-LEARNING

Ng, A. Y. and Russell, S. J. (2000). Algorithms for Inverse Reinforcement Learning. In Proceedings of

the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 663 670. Morgan Kaufmann.

Nikitin, V., Golubin, S., Belov, R., Gusev, V., and Andrianov, N. (2019). Development of a robotic

vehicle complex for wildﬁre-ﬁghting by means of ﬁre-protection roll screens. In IOP Conference Series: Earth and Environmental Science, volume 226, page 012003. IOP Publishing.

Omidshaﬁei, S., Kim, D., Liu, M., Tesauro, G., Riemer, M., Amato, C., Campbell, M., and How, J. P.

(2019). Learning to Teach in Cooperative Multiagent Reinforcement Learning. In Thirty-Third Conference of Association for the Advancement of Artiﬁcial Intelligence (AAAI-19), Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 6128 6136. AAAI Press.

Peng, P., Xing, J., and Cao, L. (2020). Hybrid Learning for Multi-agent Cooperation with Sub-

optimal Demonstrations. In Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence, (IJCAI-2020), pages 3037 3043. International Joint Conferences on Artiﬁcial Intelligence Organization. Main track.

Peters, J. and Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural

networks, 21(4):682 697.

Phan, C. and Liu, H. H. (2008). A cooperative UAV/UGV platform for wildﬁre detection and ﬁghting.

In Asia Simulation Conference-7th International Conference on System Simulation and Scientiﬁc Computing, pages 494 498. IEEE.

Piot, B., Geist, M., and Pietquin, O. (2014). Boosted Bellman Residual Minimization Handling

Expert Demonstrations. In Machine Learning and Knowledge Discovery in Databases - European Conference, (ECML-PKDD 2014), Nancy, France, September 15-19, 2014. Proceedings, Part II, volume 8725 of Lecture Notes in Computer Science, pages 549 564. Springer.

Pomerleau, D. (1988). ALVINN: An Autonomous Land Vehicle in a Neural Network. In Advances

in Neural Information Processing Systems (Neur IPS 1988) Denver, Colorado, USA, 1988, pages 305 313. Morgan Kaufmann.

Pomerleau, D. A. (1991). Efﬁcient training of artiﬁcial neural networks for autonomous navigation.

Neural Computation, 3(1):88 97.

Price, B. and Boutilier, C. (2003). Accelerating Reinforcement Learning through Implicit Imitation.

Journal of Artiﬁcial Intelligence Research, 19:569 629.

Raileanu, R., Denton, E., Szlam, A., and Fergus, R. (2018). Modeling Others using Oneself in Multi-

Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4254 4263. PMLR.

Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., and Levine, S.

(2018). Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Rasmussen, C. E. (2003). Gaussian Processes in Machine Learning. In Advanced Lectures on

Machine Learning, ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, pages 63 71. Springer.

Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. (2006a). Maximum Margin Planning. In Proceedings

of the Twenty-Third International Conference on Machine Learning (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, volume 148 of ACM International Conference Proceeding Series, pages 729 736. ACM.

Ratliff, N. D., Bradley, D. M., Bagnell, J. A., and Chestnutt, J. E. (2006b). Boosting Structured

Prediction for Imitation Learning. In Advances in Neural Information Processing Systems (Neur IPS 2006) Vancouver, British Columbia, Canada, December 4-7, 2006, pages 1153 1160. MIT Press.

Reddy, T. S., Gopikrishna, V., Záruba, G. V., and Huber, M. (2012). Inverse reinforcement learning

for decentralized non-cooperative multiagent systems. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC 2012), Seoul, Korea (South), October 14-17, 2012, pages 1930 1935. IEEE.

Resnick, C., Eldridge, W., Ha, D., Britz, D., Foerster, J., Togelius, J., Cho, K., and Bruna, J. (2018).

Pommerman: A Multi-Agent Playground. In ar Xiv preprint ar Xiv:1809.07124.

Ross, S., Gordon, G. J., and Bagnell, D. (2011). A Reduction of Imitation Learning and Structured

Prediction to No-Regret Online Learning. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2011), Fort Lauderdale, USA, April 11-13, 2011, volume 15 of JMLR Proceedings, pages 627 635. JMLR.org.

Rothermel, R. C. (1972). A mathematical model for predicting ﬁre spread in wildland fuels, volume

115. Intermountain Forest & Range Experiment Station, Forest Service, US.

Santana, E. and Hotz, G. (2016). Learning a driving simulator. ar Xiv preprint ar Xiv:1608.01230.

Schaal, S. (1996). Learning from Demonstration. In Advances in Neural Information Processing

Systems (Neur IPS 1996), Denver, CO, USA, December 2-5, 1996, pages 1040 1046. MIT Press.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy

optimization algorithms. In ar Xiv preprint ar Xiv:1707.06347.

Shalev-Shwartz, S. and Kakade, S. M. (2008). Mind the Duality Gap: Logarithmic regret algorithms

for online optimization. In Advances in Neural Information Processing Systems (Neur IPS 2008), Vancouver, British Columbia, Canada, December 8-11, 2008, pages 1457 1464. Curran Associates,

Shapley, L. S. (1953). Stochastic games. Proceedings of the national academy of sciences, 39(10):1095 1100.

Shoham, Y. and Leyton-Brown, K. (2008). Multiagent systems: Algorithmic, game-theoretic, and

logical foundations. Cambridge University Press.

MULTI-AGENT ADVISOR Q-LEARNING

Silva, F. L. D. and Costa, A. H. R. (2019). A Survey on Transfer Learning for Multiagent Reinforce-

ment Learning Systems. Journal of Artiﬁcial Intelligence Research (JAIR), 64:645 703.

Silva, F. L. D., Glatt, R., and Costa, A. H. R. (2017). Simultaneously Learning and Advising in

Multiagent Reinforcement Learning. In Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, (AAMAS 2017), São Paulo, Brazil, May 8-12, 2017, pages 1100 1108. ACM.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J.,

Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489.

Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári, C. (2000). Convergence results for single-step

on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287 308.

Sonabend, A., Lu, J., Celi, L. A., Cai, T., and Szolovits, P. (2020). Expert-Supervised Reinforce-

ment Learning for Ofﬂine Policy Learning and Evaluation. In Advances in Neural Information Processing Systems (Neur IPS 2020), virtual, December 6-12, 2020.

Song, J., Ren, H., Sadigh, D., and Ermon, S. (2018). Multi-Agent Generative Adversarial Imitation

Learning. In Advances in Neural Information Processing Systems (Neur IPS 2018), Montreal, Canada, December 3-8, 2018,, pages 7472 7483.

Storbacka, K. and Moser, T. (2020). The changing role of marketing: transformed propositions,

processes and partnerships. AMS Review, 10(3):299 310.

Subramanian, S. G. (2022). Multi-Agent Advisor Q-Learning. https://github.com/ Sriram94/multiagentadvisorqlearning.

Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning, volume 135. MIT

Press Cambridge.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. (1999). Policy Gradient Methods

for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, [Nuer IPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pages 1057 1063. The MIT Press.

Syed, U. and Schapire, R. E. (2010). A Reduction from Apprenticeship Learning to Classiﬁcation. In

Advances in Neural Information Processing Systems (Neur IPS 2010) Vancouver, British Columbia, Canada, December 6-9, 2010, pages 2253 2261. Curran Associates, Inc.

Szepesvári, C. and Littman, M. L. (1999). A uniﬁed analysis of value-function-based reinforcement-

learning algorithms. Neural Computation, 11(8):2017 2060.

Tan, M. (1997). Multi-agent reinforcement learning: Independent vs Cooperative learning. Readings

in Agents, pages 487 494.

Taylor, M. E., Suay, H. B., and Chernova, S. (2011). Integrating reinforcement learning with

human demonstrations of varying ability. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011), Taipei, Taiwan, May 2-6, 2011, pages 617 624. IFAAMAS.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Terry, J. K., Black, B., Jayakumar, M., Hari, A., Santos, L., Dieffendahl, C., Williams, N. L., Lokesh,

Y., Sullivan, R., Horsch, C., and Ravi, P. (2020). Petting Zoo: Gym for Multi-Agent Reinforcement

Learning. In ar Xiv preprint ar Xiv:2009.14471.

Theodorou, E. A., Buchli, J., and Schaal, S. (2010). Reinforcement learning of motor skills in high

dimensions: A path integral approach. In International Conference on Robotics and Automation, (ICRA 2010), Anchorage, Alaska, USA, 3-7 May 2010, pages 2397 2403. IEEE.

Thompson, M. P., Lauer, C. J., Calkin, D. E., Rieck, J. D., Stonesifer, C. S., and Hand, M. S. (2018).

Wildﬁre Response Performance Measurement: Current and Future Directions. Fire, 1(2).

Thompson, M. P., Wei, Y., Calkin, D. E., O Connor, C. D., Dunn, C. J., Anderson, N. M., and

Hogland, J. S. (2019). Risk management and analytics in wildﬁre response. Current Forestry Reports, 5(4):226 239.

Torrey, L. and Taylor, M. E. (2013). Teaching on a budget: agents advising agents in reinforcement

learning. In International conference on Autonomous Agents and Multi-Agent Systems, AAMAS 13, Saint Paul, MN, USA, May 6-10, 2013, pages 1053 1060. IFAAMAS.

Tymstra, C., Stocks, B. J., Cai, X., and Flannigan, M. D. (2020). Wildﬁre management in Canada:

Review, challenges and opportunities. Progress in Disaster Science, 5:100045.

Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe,

T., and Riedmiller, M. (2017). Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. ar Xiv preprint ar Xiv:1707.08817.

Wang, X. and Klabjan, D. (2018). Competitive Multi-agent Inverse Reinforcement Learning with

Sub-optimal Demonstrations. In Proceedings of the Thirty-ﬁfth International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 5130 5138. PMLR.

Wang, Z. and Taylor, M. E. (2017). Improving Reinforcement Learning with Conﬁdence-Based

Demonstrations. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI 2017), Melbourne, Australia, August 19-25, 2017, pages 3027 3033. ijcai.org.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279 292.

Waugh, K., Ziebart, B. D., and Bagnell, D. (2011). Computational Rationalization: The Inverse

Equilibrium Problem. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, (ICML 2011), Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1169 1176. Omnipress.

Wiewiora, E., Cottrell, G. W., and Elkan, C. (2003). Principled Methods for Advising Reinforcement

Learning Agents. In Proceedings of the Twentieth International Conference of Machine Learning (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 792 799. AAAI Press.

Wulfmeier, M., Ondruska, P., and Posner, I. (2015). Maximum Entropy Deep Inverse Reinforcement

Learning. ar Xiv preprint ar Xiv:1507.04888.

MULTI-AGENT ADVISOR Q-LEARNING

Xiao, B., Ramasubramanian, B., and Poovendran, R. (2021). Shaping Advice in Deep Multi-Agent

Reinforcement Learning. Co RR, abs/2103.15941.

Ye, D., Zhu, T., Cheng, Z., Zhou, W., and Yu, P. S. (2020). Differential Advising in Multi-Agent

Reinforcement Learning. Co RR, abs/2011.03640.

Yogeswaran, M. and Ponnambalam, S. (2012). Reinforcement learning: Exploration exploitation

dilemma in multi-agent foraging task. Opsearch, 49(3):223 236.

Yu, L., Song, J., and Ermon, S. (2019). Multi-Agent Adversarial Inverse Reinforcement Learning.

In Proceedings of the Thirty-Sixth International Conference on Machine Learning, (ICML 2019), Long Beach, California, USA, 9-15 June 2019, volume 97 of Proceedings of Machine Learning Research, pages 7194 7201. PMLR.

Zhan, Y., Bou-Ammar, H., and Taylor, M. E. (2016). Theoretically-Grounded Policy Advice from

Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2315 2321. IJCAI/AAAI Press.

Zhang, K., Yang, Z., and Bashar, T. (2019). Multi-agent reinforcement learning: A selective overview

of theories and algorithms. ar Xiv preprint ar Xiv:1911.10635, 1.

Zheng, J., Liu, S., and Ni, L. M. (2014). Robust Bayesian Inverse Reinforcement Learning with

Sparse Behavior Noise. In Proceedings of the Twenty-Eighth Conference of Association of Artiﬁcial Intelligence (AAAI-2014), Quebec City, Quebec, Canada, July 27 -31, 2014, pages 2198 2205. AAAI Press.

Zhou, H., Gong, Y., Mugrai, L., Khalifa, A., Nealen, A., and Togelius, J. (2018). A hybrid search

agent in Pommerman. In Proceedings of the 13th International Conference on the Foundations of Digital Games (FDG 2018), Malmo, Sweden, August 07-10, 2018, pages 46:1 46:4. ACM.

Zhou, M., Luo, J., Villela, J., Yang, Y., Rusu, D., Miao, J., Zhang, W., Alban, M., Fadakar, I., Chen,

Z., Huang, A. C., Wen, Y., Hassanzadeh, K., Graves, D., Chen, D., Zhu, Z., Nguyen, N. M., Elsayed, M., Shao, K., Ahilan, S., Zhang, B., Wu, J., Fu, Z., Rezaee, K., Yadmellat, P., Rohani, M., Nieves, N. P., Ni, Y., Banijamali, S., Cowen-Rivers, A. I., Tian, Z., Palenicek, D., Bou-Ammar, H., Zhang, H., Liu, W., Hao, J., and Wang, J. (2020). SMARTS: Scalable Multi-agent Reinforcement Learning Training School for Autonomous Driving. Co RR, abs/2010.09776.

Zhu, Y., Wang, Z., Merel, J., Rusu, A. A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramár, J.,

Hadsell, R., de Freitas, N., and Heess, N. (2018). Reinforcement and Imitation Learning for Diverse Visuomotor Skills. In Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum Entropy Inverse

Reinforcement Learning. In Proceedings of the Twenty-Third Conference of Association for the Advancement of Artiﬁcial Intelligence (AAAI-2008), Chicago, Illinois, USA, July 13-17, 2008,

pages 1433 1438. AAAI Press.

SUBRAMANIAN, TAYLOR, LARSON, & CROWLEY

Zigner, K., Carvalho, L., Peterson, S., Fujioka, F., Duine, G.-J., Jones, C., Roberts, D., and Moritz, M.

(2020). Evaluating the ability of FARSITE to simulate wildﬁres inﬂuenced by extreme, downslope winds in Santa Barbara, California. Fire, 3(3):29.