# multiagent_learning_from_learners__d6c68e46.pdf

Multi-Agent Learning from Learners

Mine Melodi Caliskan * 1 Francesco Chini * 1 Setareh Maghsudi 1

A large body of the Inverse Reinforcement Learning (IRL) literature focuses on recovering the reward function from a set of demonstrations of an expert agent who acts optimally or noisily optimally. Nevertheless, some recent works move away from the optimality assumption to study the Learning from a Learner (Lf L) problem, where the challenge is inferring the reward function of a learning agent from a sequence of demonstrations produced by progressively improving policies. In this work, we take one of the initial steps in addressing the multi-agent version of this problem and propose a new algorithm, MA-Lf L (Multiagent Learning from a Learner). Unlike the stateof-the-art literature, which recovers the reward functions from trajectories produced by agents in some equilibrium, we study the problem of inferring the reward functions of interacting agents in a general sum stochastic game without assuming any equilibrium state. The MA-Lf L algorithm is rigorously built on a theoretical result that ensures its validity in the case of agents learning according to a multi-agent soft policy iteration scheme. We empirically test MA-Lf L and we observe high positive correlation between the recovered reward functions and the ground truth.

1. Introduction

The Inverse Reinforcement Learning (IRL) problem corresponds to inferring the reward function of a reinforcement learning (RL) agent from a set of trajectories. Learning the reward function, as compared to directly learning the policy of the demonstrator, allows to have a more succinct descrip-

*Equal contribution 1Department of Computer Science, University of Tuebingen, T ubingen, Germany. Correspondence to: Mine Melodi Caliskan <mine.caliskan@uni-tuebingen.de>, Francesco Chini <francesco.chini1990@gmail.com>, Setareh Maghsudi <setareh.maghsudi@uni-tuebingen.de>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Figure 1. Two autonomous cars from different companies might optimize different reward functions which are not directly accessible. For example, one company might prioritize speed and another one safety or energy efficiency. They share the same environment (road) and learning the reward function of each other can help them to predict the other agent behaviour.

tion of the task performed by the agent and this knowledge is better suited to be transferred to new environments. This is even more important when the demonstrator is not an expert, especially when it is under an ongoing policy learning process. Learning the reward functions, which do not change during the learning process, is also crucial in a multi-agent setting. Consider for instance the case of lane change in a highway for autonomous cars (Fig. 1). Here the environment contains multiple private agents, which can observe each others states and actions but cannot access any other information such as the policies and rewards of others. Inferring the reward functions of other agents can be useful to model and predict their behaviour.

Initial work on IRL typically assumes the reward function to be linear w.r.t. a set of features (Abbeel & Ng, 2004). However recent approaches to IRL have been relaxing this assumption (Ho & Ermon, 2016; Fu et al., 2017). Early IRL literature also assumes the observed agent to be an expert, i.e., to behave optimally or noisy optimally (Ng et al., 2000; Ziebart et al., 2008). Recent work has relaxed the optimality assumption (Brown et al., 2019), (Tangkaratt et al., 2020) and in (Jacq et al., 2019), the authors have introduced the Learning from a Learner (Lf L) problem, where the challenge is to infer the reward function of a learning agent from trajectories produced by a sequence of progressively improving policies. For another approach to

Multi-Agent Learning from Learners

the Lf L problem see also (Ramponi et al., 2020).

The IRL problem has also been studied in the multi-agent case (Natarajan et al., 2010), where the goal is to recover the reward functions of a set of agents interacting in a stochastic game. In this setting, the agents are usually assumed to be in a certain equilibrium, such as Nash or correlated equilibrium (Reddy et al., 2012). This is quite restrictive considering that in many real-world applications, such as autonomous cars, multi-agent systems will likely not be in any equilibrium.

Here we introduce and study the multi-agent version of the Lf L problem. We address the problem of recovering the reward functions of agents learning in a general-sum stochastic game. We do not assume the agents to be in any equilibrium but rather to be independently learning according to a multi-agent soft policy iteration scheme. To address this problem, we propose a new algorithm, MALf L (Multi-agent Learning from a Learner), which builds upon the single agent Lf L algorithm (Jacq et al., 2019). Our algorithm, which we present both in offline and online settings, allows each agent to recover the reward functions of other agents while improving its own policy with respect to its own reward function. Moreover the recovered reward functions can be used by the agents to predict the next policy improvements of the other agents. We include error bounds both for the reward recovery and the policy improvement predictions. These are novel contributions even in the single agent case.

2. Related Work

Our work stems from (Jacq et al., 2019), where the authors introduce the Lf L framework. The framework enables an Observer to learn the reward function of a Learner, who learn to solve a Markov Decision Process. The motivation there is to train the Observer with the recovered reward in order to potentially outperform the Learner. In our multi-agent setting all agents are Observers and Learners simultaneously. Our motivation is not to make the agents imitate (Yu et al., 2019; Torabi et al., 2018) or outperform each other (Jacq et al., 2019). Rather, we focus on modeling the agents during an ongoing learning process and we allow the agents to be heterogeneous, namely to have different action spaces and different reward functions.

The majority of the state-of-the-art research assumes specific reward structures, ranging from fully cooperative games (Natarajan et al., 2010; Barrett et al., 2017; Le et al., 2017; ˇSoˇsi c et al., 2017), to zero-sum games (Lin et al., 2017). We do not assume any of these restrictions as we allow the agents to interact in a general-sum stochastic game.

Multi-agent Adversarial Inverse Reinforcement Learning (MA-AIRL) (Yu et al., 2019) and Multi-Agent Generative Adversarial Imitation Learning (MA-GAIL) (Song et al.,

2018) are frameworks with adversarial learning and they estimate policies and reward functions. In both works there are no strong assumptions on the reward structure. However in (Song et al., 2018) the agents are assumed to be in a Nash equilibrium. In (Yu et al., 2019) the agents are assumed to be in a logistic stochastic best response equilibrium (LSBRE), an equilibrium concept which is a stochastic generalization of Nash and correlated equilibrium. This reflects the assumption that the agents act sub-optimally, significantly relaxing the assumptions of early works on multi-agent IRL (Natarajan et al., 2010; Reddy et al., 2012). We take a step further by assuming the agents to be in a learning process rather than in an equilibrium.

3. Problem Setting

We consider the problem of N agents with a entropyregularized objective acting in a Markov Game.

Definition 3.1. A Markov game (Littman, 1994) M for N agents is a tuple (S, {Ai}N i=1, T, {Ri}N i=1, P0, γ) , where S is the state space, Ai is the action set of agent i {1, , N}, T : S A1 . . . AN P(S) is the transition function, Ri : S A1 . . . AN R is the reward function of agent i {1, , N}, P0 P(S) is the initial state distribution, and 0 γ < 1 is the discount factor.

Definition 3.2. A policy for agent i is a map πi : S P(Ai), where P(Ai) denotes the set of probability measures over Ais. Given policies π1, . . . , πN, we use π to denote the joint policy π: S P A1 AN , where π a1, . . . , a N|s = QN i=1 πi ai|s . Moreover, a = a1, . . . , a N is the the joint action profile of all agents. Besides, a i = a1, . . . , ai 1, ai+1, . . . , a N and π i a i|s = Q j =i πj aj|s respectively denote the joint action and the joint policy of the opponents of agent i.

Remark 3.3. Note that we do not assume the existence of a centralized actor. The symbol π only denotes the product of N individual policies.

Assumption 3.4. In our setting, agents have access to states s and actions a of all agents. However each agent i can only observe its own reward Ri.

3.1. Entropy-regularized objective

In a standard stochastic game, the objective of each agent i is to find a policy πi that maximizes the expected total discounted reward. Formally,

J (πi) = E ai πi a i π i

t 0 γt Ri st, a i t , ai t

Remark 3.5. Note that the reward of every agent i, Ri, depends also on the actions of other agents. Consequently,

Multi-Agent Learning from Learners

also the objective depends on the joint policy π i of other agents.

Assumption 3.6. We assume that the objective is entropy regularized; i.e., each agent i maximizes

Jsoft(πi) = E πi,π i

t 0 γt Ri t + αHt

where Ri t = Ri (st, at), Ht = H πi( |st) = E ai π( |s)

ln πi(ai|s) is the Shannon entropy and α > 0

is a coefficient that controls the the degree of regularization.

Entropy regularization has been introduced in the RL literature as an approach to tackle the exploration-exploitation dilemma (Haarnoja et al., 2017; 2018).

Definition 3.7. Given a joint policy π, the soft Q-value function for agent i is defined as

Qπ,i soft (s, a) = Ri 0 + E π

t>0 γt Ri t + αHt #

for every s S, a A1 AN, Ri t = Ri (st, at), Ht = H πi( |st) .

Remark 3.8. It is straightforward to show that Qπ,i soft satisfies the following Bellman equation

Qπ,i soft (s, a) = Ri (s, a)

+ γE π[Qπ,i soft (s , a ) + αH πi ( |s ) ]. (3)

4. Multi-agent Soft Policy Iteration

Our MA-Lf L algorithm is built on the assumption that the agents are learning according to a multi-agent soft policy iteration (MA-SPI), which we derive from SPI in the single agent case. Before introducing the proposed MA-Lf L algorithm, in this section we explain in detail the MA-SPI algorithm. Similar to many policy iteration algorithms (Sutton & Barto, 2018), it consists of a policy evaluation step and a policy improvement one and is an on-policy algorithm.

4.1. Reducing a Markov Game to a Single Agent Markov Decision Process

Let us recall here the statement of the theorem that underlies the single agent SPI algorithm, which guarantees that it improves policies monotonically.

Theorem 4.1 (Theorem 4 in Appendix A of (Haarnoja et al., 2017)). Given a policy π in a entropy regularized Markov Decision Process, define a new policy πnew as

πnew( |s) exp Qπ soft(s, )

for every state s, where α is the entropy coefficient. Then it follows that Qπnew soft (s, a) Qπ soft(s, a), for every state-action pair (s, a).

We report here the statement of the theorem that underlies the single agent Lf L algorithm of Jacq et al. (2019).

Theorem 4.2 (Theorem 2 in (Jacq et al., 2019)). Let π and πnew two consecutive policies in an entropy regularized Markov Decision Process, with entropy coefficient α, such that πnew is the single agent soft policy improvement given by (4). Then the following reward function

R(s, a) = α ln πnew(a|s)

+ αγ E s P[DKL (π( |s ) πnew( |s ))]

coincides with the actual reward function R, up to a shaping -which will be defined in Section 4.4. Namely,

R(s, a) = R(s, a) + g(s) γ E s P[g(s )],

where g is a function defined on the state space.

Definition 4.3. Let M = (S, {Ai}N i=1, T, {Ri}N i=1, P0, γ) be a Markov game and let π i a joint policy for all agents except for agent i. We define the single agent Markov decision process g Mi = e S, e A, e P, e R, e P0, eγ , where

e P(s |s, a) = P(s |s, a i, a)π i(a i);

e R(s, a) = E a i π i[Ri(s, a i, a)];

For agent i, a policy πi defines a policy for the MDP f Mi. Moreover, if π i remains fixed, then the entropy regularized objective in Eq 1 is equal to the entropy regularized objective for f Mi, i.e.,

e Jsoft(πi) = E πi

t 0 γt e R(st, at) + αH πi( |st)

4.2. Policy Evaluation

Given a joint policy π = QN i=1 πi, each agent i learns the expectation of Qπ,i soft with respect to π i during the run of some episodes. From the perspective of agent i, during the evaluation phase, the other agents can be thought of being part of the environment, by absorbing the policy π i into the dynamics. Therefore, during the evaluation phase, the Markov game is equivalent to a Markov Decision Process

Multi-Agent Learning from Learners

f Mi for agent i and the expectation of Qπ,i soft is in fact the soft Q function e Qπi soft w.r.t. to f Mi. Hence, agent i learns e Qπi soft(s, ai) = E a i π i

h Qπ,i soft(s, a i, ai) i via temporal dif-

ference learning based on the Bellman equation Eq (3):

e Qπi soft(s, ai) = Ri s, ai

h e Qπi soft(s , ai new) + αH πi ( |s ) i (5)

where Ri s, ai = E a i π i[Ri(s, a i, ai)].

4.3. Policy Improvement

Definition 4.4. Given a policy πi for agent i and π i for the opponents, the soft policy improvement for agent i is defined as

πi new(ai|s) exp 1

α e Qπi soft(s, ai) , (6)

where e Qπi soft(s, ai) = E a i π i

h Qπ,i soft s, a i, i . In the fol-

lowing we will will use the notation SPIπ i(πi) to denote the soft policy improvement πi new. Assumption 4.5. We assume all the agent to update their policies simultaneously

i=1 SPIπ i(πi).

Lemma 4.6. Let e Qπi soft be the soft Q-value function for a

policy πi as a policy for the MDP f Mi. Formally,

e Qπi soft(s, ai) = e R(s, ai) + E πi

t>0 γt e Rt + αHt #

where e Rt = e R(st, at) and Ht = H πi( |st . Then we have

e Qπi soft(s, ai) = E a i π i

h Qπ,i soft(s, a i, ai) i ,

where π = π iπi and Qπ,i soft is the soft Q-value function of π for agent i in the Markov game M.

Proof. The proof follows immediately from the definition of f Mi given above.

Theorem 4.7 (Soft-policy Improvement Theorem). Let πi

be a policy for agent i and π i a joint policy for other agents and let πi new = SPIπ i(πi) as defined in (6).

Then for every ai Ai, we have

Qπnew,i soft (s, ai) Qπ,i soft(s, ai)

where Qπ,i soft(s, ai) = E a i π i

h Qπ,i soft s, a i, ai i and

πnew = πi newπ i.

Proof. As explained above, when the policies π i for the other agents are held fixed, which is guaranteed by Assumption 4.5, the Markov game reduces to a MDP f Mi for agent i. Therefore the proof follows directly from (6) and Theorem 4.1.

Remark 4.8. As a consequence of Theorem 4.7 below, πi new(ai|s) is the greedy improvement for agent i w.r.t. the opponents joint policy π i, namely

πi new(ai|s) = arg max πinew

E a i π i ai πi new

h Qπ,i soft(s, a i, ai) i

In other words the soft policy update is guaranteed to be an improvement for agent i in the case where the other agents do not change their policy π i. However in our setting we assume all the agents to update their policies simultaneously, therefore there is no guarantee that the policy update is an actual improvement.

4.4. Invariance Under Reward Shaping

The classical IRL problem is ill-posed (Ng et al., 2000); that is, the solution is not unique, as several different reward functions explain the behavior of an optimal agent. Similar difficulties arise in the Lf L setting (Jacq et al., 2019) because the single-agent Soft Policy Iteration algorithm is invariant under reward shaping. Naturally, the multi-agent setting inherits the same issue. More precisely, if we transform the reward function of an agent i by adding a shaping, then the soft policy improvement Eq (6) is still the same. A function sh: S A1 AN R is called shaping if there exists a function g: S R such that sh(s, a) = g(s) γ E s P (s,a) [g(s )] .

Lemma 4.9 (SPI invariance under shaping). Let Ri 1 : S A1 AN R and Ri 2 : S A1 AN R two reward functions for agent i such that for every s S, a = (a1, . . . , a N) A1 AN: Ri 1 (s, a) = Ri 2 (s, a) + sh(s, a). Let SPI1 π i and SPI2 π i be the soft policy improvement operators induced respectively by Ri 1 and Ri 2. Then for every policy πi

SPI1 π i(πi) = SPI2 π i(πi).

Proof. The proof is a simple extension of the proofs of Lemma 1 and Theorem 1 in (Jacq et al., 2019).

5. Multi-Agent Learning from Learners

In this section, we explain how each agent i can recover an estimation Rj i of the reward function Rj of each other agent after a certain number of MA-SPI steps, as described

Multi-Agent Learning from Learners

Algorithm 1 Multi-agent Soft Policy Iteration (MA-SPI)

Initialization πi Uniformly random policy, for i = 1, . . . , N. for h = 1 to H do

Initialize e Qπi soft 0, for i = 1, . . . , N for each episode do

t 0 s0 P0 while st not terminal do

Each agent i chooses ai t πi( |st) Each agent i observes Ri(st, at) st+1 P(st, at) Each agent i chooses ai t+1 πi( |st+1) Each agent i updates e Qπi soft according to Eq (5) t t + 1 end while Each agent simultaneously i updates πi SPI(πi) using Eq (6) end for end for

in Section 4. We call this algorithm Multi Agent Learning from a Learner (MA-Lf L), as it is a multi-agent extension of the Lf L algorithm developed in (Jacq et al., 2019). The pseudocode is given in Algorithm 2.

The core of the proposed MA-Lf L algorithm is the theorem below, which states the following: From the observation of one soft policy improvement for an agent i, namely observing two consecutive policies πi and SPIπ i(πi), it is possible to recover the expectation w.r.t. π i of the reward function Ri, up to a shaping.

Theorem 5.1 (Recovering reward up to shaping). Let π i

be a joint policy for all the agents except i. Besides, πi is a policy for i and πi new = SPIπ i(πi) is the soft policy improvement given by Eq (6). Then

Ri(s, a i, ai) i = α ln πi new(ai|s) +

αγ E a i π i s P ( |s,a i,ai)

DKL(πi( |s )|πi new( |s )) , (7)

where Ri(s, a i, ai) = Ri(s, a i, ai) + sh(s, a i, ai) , sh: S A1 AN R is a shaping.

Proof. As in the proof of Theorem 4.7, if π i remains fixed, for agent i, the Markov game M reduces to the Markov decision process f Mi as defined in Section 4.1. From Lemma 4.6, we have that πi new( |s) exp 1 α e Qπi soft(s, ) . Therefore, us-

ing Theorem 4.2, we can recover the reward e R of f Mi up to

a shaping. Formally,

e R(s, ai) =α ln πi new(ai|s)+

DKL πi( |s ) πi new( |s ) , (8)

e R(s, ai) = e R(s, ai) + g(s) γ E s e P [g(s )] , (9)

for some function g: S S.

From the definition of the Markov decision process f Mi in Section 4.1, and Eq (9), we rewrite Eq (8) as

Ri(s, a i, ai) i = α ln πi new(ai|s)

+ αγ E a i π i s P ( |s,a)

DKL πi( |s ) πi new( |s ) .

Remark 5.2. As mentioned in Remark 4.8, the MA-SPI algorithm is not guaranteed to improve agents policies. However our reward recovering MA-Lf L algorithm only relies on the way agents are updating their policies and it is not affected on whether the agents are actually improving.

5.1. Estimating Other Agent Policies

Theorem 5.1 allows each agent to extract information about the reward functions of each other agents given their policies. In practice, agents can only observe the actions of each other; therefore, to apply Theorem 5.1, they must learn the policies from the observed trajectories. Every agent uses the entropy regularized maximum likelihood estimation (MLE) method to estimate other agents policies from the observed trajectories.

Let π = QN i=1 πi be a joint policy for the agents, and D represent a set of trajectories D = {τ1, . . . , τK} produced by π. Each trajectory τk is a sequence of states and actions τk = {sk,0, ak,0, sk,1, ak,1, . . . }. Each agent j learns a parameterized approximation ˆπi j = ˆπi θi j of the policy πi of agent i by maximizing the entropy regularized likelihood L(θi j). Formally,

(s,ai) τk ln ˆπi j(ai|s) + λH ˆπi j( |s) ,

(10) where λ > 0 is the entropy regularization parameter.

5.2. Estimating Rewards from Trajectories

Now we discuss how each agent learns the rewards of other agents after a specific number of MA-SPI steps. Let

Multi-Agent Learning from Learners

{π0, π1, . . . , πH} be the H + 1 joint policies for N agents obtained while performing the MA-SPI algorithm (Algorithm 1) for H rounds; i.e., for every for h = 1, . . . H, let πh = SPI(πh 1). Moreover, let Dh be the set of trajectories produced by the agents with the joint policy πh during the h-th MA-SPI.

π0 π1 . . . πH

D0 D1 . . . DH

SP I SP I SP I

Let ˆRi j = ˆRi ϕi j be a parametrization of reward Ri that agent j is attempts to learn. As explained in Section 5.1, agent j can learn {πi 0, πi 1, . . . , πi H} from {D0, D1, . . . , DH} respectively. From those learned policies, agent j computes H targets {Y1, . . . , YH} according to Theorem 5.1, defined as

Y i h(s, ai) = α ln πi h ai|s +

αγ E a i π i s P ( |s,a)

DKL πi h 1( |s ) πi h( |s ) , (11)

for every h = 1, . . . , H.

Recall that from each improvement πi h SP I πi h+1, Theorem 5.1 allows inferring the expectation of Ri + shh, where shh is a shaping function. Observe that we use the index h because for different improvements, we might have different shapings. Since shh is a shaping, by definition (see Section 4.4), there exists a function gh : S R such that

shh(s, a) = gh(s) E s P ( |s,a) [gh(s )] .

Definition 5.3. Let gψh be a parametrization of gh. We define the loss function for the parameters ϕi j of ˆRi j = ˆRi ϕi j(s, a i, ai) as

Li j(ϕi j) = min ψ0,...,ψH

(s,a,s ) Dh ( ˆRi j+shψh(s, s ) Yh)2,

(12) where shψh(s, s ) = gψh(s) γgψh(s ).

Remark 5.4. The optimization of the loss function above is directly affected by the policy inference success, because the target values Yh s are produced by the inferred policies.

5.3. Semi-online MA-Lf L

In the previous section, we explained how agents learn the reward functions of other agents in a offline manner from a collection of sets of trajectories generated during a number

Algorithm 2 Multi-agent Learning from a Learner (MALf L)

Run Algorithm 1 and generate sets of trajectories {Dh}H h=1 using {πh}H h=1 for h = 1 to H do

for each agent j, i = 1, . . . , N, j = i do

Agent j learns estimate ˆπi h from Dh via Eq (10) Agent j computes targets Y i h using Eq (11) end for end for Each agent j computes ˆRi j via Eq (12) Return ˆRi j for each j, i = 1, . . . , N, j = i.

of MA-SPI iterations. However, MA-Lf L can also be performed semi-online, meaning that each agent maintains an estimation of the reward functions of the opponent which is updated after each MA-SPI step. Since entropy regularized maximum likelihood estimation can be performed online, each agent can learn the policies of the opponents in a streaming manner, during each MA-SPI step.

Now, consider the h-th MA-SPI step. Then agent j updates the parameters ϕi j of ˆRi j using the gradient Lj h, as in a mini-batch gradient descent, where Lj h is the following loss function

Lj h(ϕi j) = X

(s,a,s ) Dh

ˆRi j(s, a i, ai) + shψh(s, s ) Yh 2 ,

where shψh(s, s ) = gψh(s) γgψh(s ).

After the h-th iteration of MA-SPI, to predict the next soft policy improvement of the opponents using (6), each agent j has to estimate e Qπi soft(s, ai) = E a i π i

h Qπ,i soft(s, a i, ai) i

for each other agent i. That is doable with an off-line version of TD-learning or Monte Carlo (Sutton & Barto, 2018) from the trajectories in Dh, using the current estimations ˆRi j and ˆπi instead of Ri and πi. Remark 5.5. When performing MA-Lf L online, the agents must assess the quality of their current estimations of other agents reward functions. One way to do so is to use the current rewards estimations to predict future soft policy improvements followed by observing the actual ones. If the agents have valid estimations of the reward functions, their predictions of the soft policy improvements will be close to the actual ones. We will provide more details on how to bound the reward estimation error in the next Section 6.

6. Error Bound Analysis

In this section we provide a bound on the error on the recovered reward functions in terms of the policies improvement prediction error in Theorem 6.3 and Theorem 6.1. Con-

Multi-Agent Learning from Learners

versely, we also provide a bound on the policy improvement prediction error in terms of the error on the recovered rewards in Theorem 6.5. In Appendix A we state and prove the single-agent version of these results for the Lf L framework of (Jacq et al., 2019) and in Appendix B we extend the proofs to our multi-agent setting.

6.1. Reward Recovery Error Bound

Theorem 6.1. Let Ri be the reward function for agent i and let ˆRi j be the reward estimation of Ri learned by agent j. Let πi a policy for agent i and π i a joint policy for the other agents. Let πi new = SPIπ i(πi) be the soft policy improvement as defined in Theorem 4.7 and let ˆπi new be the soft policy improvement predicted by agent j using ˆRi j,

namely ˆπi new exp 1 αEπ i[Q π,i, ˆ Ri j soft ] . If

sup ai Ai s S | ln πi new(ai|s) ln ˆπi new(ai|s)| < δ,

then there exists a shaping sh such that for every s S and ai Ai

|Ea i π i h ˆRi j(s, a i, ai) (Ri + sh)(s, a i, ai) i | < ε,

where ε = δα(1 + γ),

and α is the entropy coefficient and γ is the discount factor.

Proof. See Appendix B.

Corollary 6.2. Consider the case in which Ri and ˆRi are state-only dependent reward functions. If

sup ai Ais S

ln πi new(ai|s) ln ˆπi new(ai|s) < δ

there exists a shaping sh: S Ai S such that depends only on the state and the actions of agent i such that

sup s Sai Ai

ˆRi j(s) Ri(s) + sh(s, ai) < ε,

where ε = δα(1 + γ), and α is the entropy coefficient and γ is the discount factor.

Proof. Follows directly from Theorem 6.1.

In the special case of state-only dependent reward function, we can provide another error bound that depends on the KL-divergence between the predicted and the actual soft policy improvement.

Theorem 6.3. Let us assume the reward function Ri for agent i and its estimation ˆRi j maintained by agent j to be state-only dependent. Let πi be a policy for the agent i, πi new = SPIπ i(πi) its soft policy improvement and let ˆπi new be the soft policy improvement predicted by the agent j. Let us assume Ai to be a finite set and let |Ai| be its cardinality. If sup s DKL(πi new( |s) ˆπi new( |s)) < δ,

then there exists a shaping sh: S Ai R, that depends only on states and on the actions of agent i, such that

sup s S | ˆRi(s) Eai πi( |s)(Ri(s) + sh(s, ai))| < ε,

ε = δ 1 + γ|Ai|e

α is the entropy coefficient, γ is the discount factor and i

is the maximum gap for Ri, namely i = sups S Ri(s) infs S Ri(s).

Proof. See Appendix B.

Remark 6.4. The assumptions in Theorem 6.3 on the finiteness of the action space Ai and the fact that Ri is a function only of the state are not so restrictive. Moreover the error bound on the estimation error for the reward is express in terms of the bound on the KL-divergence, which can be learned in practice by agent j.

6.2. Policy Improvement Prediction Error Bound

The recovered rewards allow the agents to predict the soft policy improvements of each other agents. The following theorem provides a bound on the KL-divergence between the actual improvement and the predicted improvement.

Theorem 6.5. Let Ri be the reward function for agent i and let ˆRi j be an estimation recovered by agent j. Let πi new = SPIπ i(πi) be the actual policy improvement of the policy πi, and ˆπi new the soft policy improvement predicted by agent j using ˆRi j. Let δ > 0 be such that for all s S, ai Ai

Ea i π i h ˆR(s, a) (R + sh)(s, a) i < δ

for a shaping sh. Then

sup s S DKL(πi new( |s)|ˆπi new( |s)) < ε,

ε = δ 1 α(1 γ) + |Ai|

and α is the entropy coefficient and γ is the discount factor.

Proof. See Appendix B.

Multi-Agent Learning from Learners

7. Experiments

We test MA-Lf L experimentally in a 3 3 deterministic grid world environments. The agents always start at the top-left cell and try to reach to the bottom-right cell. Our experimental setting involves two agents, i.e., N = 2. We emphasize that our theoretical results hold regardless of the number of agents. We assume the transition function is deterministic and known. The action space includes five actions: move up, down, left, and right, or stay.

We use two different reward functions in order to demonstrate our algorithm achieves reward recovery in generalsum games: Mhom: Homogeneous reward function as a combination of Manhattan disjoint distance Eq (13) and Mhet: Heterogenous reward function as a combination of Manhattan joint and disjoint distance Eq (14).

Definition 7.1. Let pg = (xg, yg) be a goal location and pi(t) = (xi, yi), pj(t) = (xj, yj) be the positions of agent i and agent j at time t. Then we define

M i hom(t) = pi(t) pg 1 + pi(t) pj 1 (13)

M i het(t) =

( pi(t) pg 1 pi(t) pj(t) 1 A#1 pi(t) pg 1 + pi(t) pj(t) 1 A#2 (14) for both agents i = 1, 2.

(a) The grid world.

(b) Example reward calculation.

Figure 2. (a) Agents start at the top-left grid and the goal location is the bottom-right. Every time agents arrive to the goal location, their states are reset as the start cell. (b) Purple lines indicates the Manhattan distance between two agents and green lines indicates the Manhattan distance between Agent #2 and the goal location.

In the Mhom setting, the agents try to minimize the distance between themselves and the goal while at the same time trying to stay as far away from each other as possible. In Mhet, similarly, both agents try to minimize the distance between themselves and the goal, however, while one agent tries to stay as close to the other as possible the other agent tries to stay away.

We measure the performance of MA-Lf L by computing the correlation between the recovered rewards with the groundtruth ones. In all cases in our experiments, agents have no access to the other agents policies or rewards, and they use state-action models for estimating the reward functions. However, since all the experiments consist of simulations, we have access to the ground-truth reward functions that we use for the evaluation. We use statistical correlation metrics Pearson s correlation coefficient (PCC) for linear correlation and Spearman s correlation coefficient (SCC) for rank correlation to compare the estimated reward functions with the actual rewards. In our experiments, we demonstrate recovery of rewards MA-Lf L achieves using MA-SPI in both the heterogeneous and the homogeneous reward cases. We present our results in Table. 1.

Metric Mhom Mhet PCC #1 0.48 0.06 0.45 0.04 PCC #2 0.59 0.02 0.42 0.02 ˆP 0.54 0.03 0.44 0.01 SCC #1 0.44 0.14 0.51 0.02 SCC #2 0.60 0.04 0.43 0.03 ˆS 0.52 0.06 0.47 0.01

Table 1. Pearson s correlation coefficients (PCC) and Spearman s correlation coefficients (SCC) of Agent 1 and Agent 2 between true reward functions and estimated reward functions. ˆP and ˆS are the averaged scores of PCC and SCC over both agents. Mean and variance are taken from the experiments with different random seeds.

Figure 3. The quality of recovered rewards has a logarithmic growth rate as the agents improve their policies. In our experiments, we observed that agents were able to recover rewards using only 10 iterations in a 3 3 grid world.

As a baseline for correlation coefficients, we calculated correlation between estimated joint and disjoint rewards to disjoint and joint ground truths respectively. Results are given in Table 2.

Multi-Agent Learning from Learners

Estimated Reward Manh. Disjoint Manh. Joint

Manhattan Disjoint PCC: 0.55 SCC: 0.57

PCC: 0.4 SCC: 0.37

Manhattan Joint PCC: 0.32 SCC: 0.33

PCC: 0.47 SCC: 0.51

Table 2. Cross-correlation between ground truths and estimations of two reward functions. All correlations are positive due to their very similar structure, however the correlations between the recovered rewards and their correspondent ground truths are higher.

8. Discussion on Generalization to Different Frameworks

Even though reward recovering in MA-Lf L is based on the assumption of agents are using MA-SPI to optimize their policies, MA-Lf L could potentially be used when agents optimize their policies with different models as demonstrated in single-agent case (Jacq et al., 2019).

We expect MA-Lf L to perform well with learning frameworks that have similar characteristics to SPI. SPI is an on-policy algorithm which it makes it easier for an observing agent to infer the policies from trajectories generated with a fixed policy. Off-policy algorithms such as SAC for continuous environments and Soft Q-learning for discrete environments might be desirable by practitioners because of sample efficiency, but the fact that constant updates of the policies after each step requires some care to compensate potential errors in inferring the policy of other agents from the generated trajectories. Another important characteristic of SPI is that it optimizes a stochastic policy which encourages exploration while agents optimize their own policy, especially in the sparse reward cases. Since PPO maintains both characteristics, it would be reasonable to expect MA-Lf L to perform well under this learning framework as an alternative to SPI, in continuous and high-dimensional environments.

9. Conclusion

We propose MA-Lf L, a multi-agent algorithm that allows inverse reinforcement learning in an entropy-regularized reinforcement learning setting. The input data of our algorithm are trajectories produced by agents that are not assumed to be in any equilibrium, but rather are learning according to a multi-agent soft policy iteration (MA-SPI). The reward functions recovered by MA-Lf L in our experiments show high correlation with the ground truth ones.

Some of the potential applications of our MA-Lf L algorithm are: imitation learning from multi-agent systems which have not yet reached an equilibrium, allowing the use of MARL algorithms that explicitly use the knowledge of all the agents reward functions in scenarios where those are not accessible, the promotion of fairness or to further collaboration in social

dilemmas such as the Prisoner s Dilemma by letting each agent being aware of other agents rewards. However, since MA-Lf L allows agents to recover rewards only up to a shaping, some care is required, especially in scenarios with many agents.

As a future work, it would be valuable to study generalization of MA-Lf L over different learning frameworks and more experimental investigations would provide useful insights. Investigating the scalability and performance on partially observable scenarios would also be worthful.

Acknowledgements

This work was supported by the German Federal Ministry of Education and Research (BMBF) under Grant 01IS20051 and the Cyber Valley under Grant Cy Vy-RF-2021-20.

We thank to Glenn Angrabeit for his support on the experiments. We would also like to thank the anonymous ICML reviewers for their valuable feedback on the manuscript.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty First International Conference on Machine Learning, ICML 04, pp. 1, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581138385. doi: 10.1145/1015330.1015430. URL https://doi. org/10.1145/1015330.1015430.

Barrett, S., Rosenfeld, A., Kraus, S., and Stone, P. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence, 242:132 171, 2017.

Brown, D., Goo, W., Nagarajan, P., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pp. 783 792. PMLR, 2019.

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352 1361. PMLR, 2017.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018.

Multi-Agent Learning from Learners

Ho, J. and Ermon, S. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.

Jacq, A., Geist, M., Paiva, A., and Pietquin, O. Learning from a learner. In International Conference on Machine Learning, pp. 2990 2999. PMLR, 2019.

Le, H. M., Yue, Y., Carr, P., and Lucey, P. Coordinated multiagent imitation learning. In International Conference on Machine Learning, pp. 1995 2003. PMLR, 2017.

Lin, X., Beling, P. A., and Cogill, R. Multiagent inverse reinforcement learning for two-person zero-sum games. IEEE Transactions on Games, 10(1):56 68, 2017.

Littman, M. L. Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pp. 157 163. Elsevier, 1994.

Natarajan, S., Kunapuli, G., Judah, K., Tadepalli, P., Kersting, K., and Shavlik, J. Multi-agent inverse reinforcement learning. In 2010 ninth international conference on machine learning and applications, pp. 395 400. IEEE, 2010.

Ng, A. Y., Russell, S., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000.

Ramponi, G., Drappo, G., and Restelli, M. Inverse reinforcement learning from a gradient-based learner. Advances in Neural Information Processing Systems, 33:2458 2468, 2020.

Reddy, T. S., Gopikrishna, V., Zaruba, G., and Huber, M. Inverse reinforcement learning for decentralized noncooperative multiagent systems. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1930 1935, 2012. doi: 10.1109/ICSMC.2012. 6378020.

Song, J., Ren, H., Sadigh, D., and Ermon, S. Multi-agent generative adversarial imitation learning. Advances in neural information processing systems, 31, 2018.

ˇSoˇsi c, A., Khuda Bukhsh, W. R., Zoubir, A. M., and Koeppl, H. Inverse reinforcement learning in swarm systems. In Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, pp. 1413 1421, 2017.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Tangkaratt, V., Han, B., Khan, M. E., and Sugiyama, M. Variational imitation learning with diverse-quality demonstrations. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine

Learning Research, pp. 9407 9417. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/ v119/tangkaratt20a.html.

Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4950 4957, 2018.

Yu, L., Song, J., and Ermon, S. Multi-agent adversarial inverse reinforcement learning. In International Conference on Machine Learning, pp. 7194 7201. PMLR, 2019.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008.

Multi-Agent Learning from Learners

A. Proofs of the error bounds for the single-agent case

For the sake of clarity, we start by presenting a single agent version of the theorems in Section 6. In Appendix B, we will discuss how to extend them to the multi-agent case. In this section, we use terminology of Learner and Observer as in (Jacq et al., 2019), where the Learner is the RL agent and the Observer is the IRL algorithm.

A.1. Single-agent setting

In the single-agent Lf L setting (Jacq et al., 2019), an agent, called the Learner, is learning to solve a Markov Decision Process M = (S, A, P, R, P0, γ) via soft-policy iteration. Namely, the Learner starts with a policy π0 and it subsequantially

improves to π1 exp Qπ0 soft (s,a)

α , then π1 will be improved to π2 exp Qπ1 soft (s,a)

α and so on. The Observer, namely the IRL algorithm, perceives trajectories generated by the policies π0, π1, . . . , πK of the Learner agent and infers its reward function R.

A.2. Reward recovery error bounds

Let π be a policy for the Learner and let ˆR an estimation of the reward R maintained by the Observer. In the following we denote by Qπ,R soft the actual soft Q function for π, and Qπ, ˆ R soft the soft Q function for π computed w.r.t. ˆR. Namely

Qπ,R soft (s, a) = R(s, a) + γEπ

t=0 γt(R(st, at) + αH(π( |st))

Qπ, ˆ R soft (s, a) = ˆR(s, a) + γEπ

t=0 γt( ˆR(st, at) + αH(π( |st))

Note that the Observer can learn Qπ, ˆ R soft from the trajectories produced by the policy π of the Learner using its ˆR of R. The

Observer can then use Qπ, ˆ R soft to predict the future soft policy improvement of the Learner. Theorem A.2 and Theorem A.1 quantify the error of the reward estimation in terms of the soft policy improvement prediction error.

Theorem A.1. Let R be the actual reward and let ˆR be an estimation recovered by the Observer. Then if

sup a A,s S |ln πnew(a|s) ln ˆπnew(a|s)| < δ,

then there exists a shaping sh such that

sup a A,s S

ˆR(s, a) (R + sh)(s, a) < ε,

where ε = δα(1 + γ).

Proof. From the definition of soft policy improvement, we have that for every s S and a A

ln πnew(a|s) ln ˆπnew(a|s) = 1

Qπ,R soft (s, a) Qπ, ˆ R soft (s, a) + f(s) ,

where f(s) = ln ˆZ(s) ln Z(s), and Z(s) and ˆZ(s) are the normalizing terms Z(s) = P

Qπ,R soft (s, a) α and ˆZ(s) = P

Qπ, ˆ R soft (s, a) α . From Lemma 1 in (Jacq et al., 2019), we have that there exists a shaping sh: S A R such that

ln πnew(a|s) ln ˆπnew(a|s) = 1

Qπ,R+sh soft (s, a) Qπ, ˆ R soft (s, a) .

Multi-Agent Learning from Learners

From the soft Bellman equations for Qπ,R+sh soft and Qπ, ˆ R, we have

ˆR(s, a) (R + sh)(s, a) =

Qπ, ˆ R soft (s, a) Qπ,R+sh soft (s, a)] γ E s P a π [Qπ, ˆ R soft (s , a ) Qπ,R+sh soft (s , a )

ln ˆπnew(a|s) ln πnew(a|s) γ E s P a π [ln ˆπnew(a |s ) ln πnew(a |s )]

αδ (1 + γ) .

ˆR(s, a) (R + sh)(s, a) αδ (1 + γ) .

In the same spirit as Theorem A.1, the following theorem provides an error bound on the recovered reward function in terms on the soft policy prediction error. Here the error bound does depend also on the size of the action space |A| and on the gap = sup R inf R. However, instead of assuming a strong bound on the difference between the logarithm of the predicted improvement and the logarithm of the actual one, here it is enough to use a bound on the KL-divergence.

Theorem A.2 (Single-agent Lf L). Let R be the actual reward and let ˆR be an estimation recovered by the Observer. Let us assume A to be finite and let = supa A,s S R(s, a) infa A,s S R(s, a). Let π be a policy for the Learner. Let

ˆπnew exp Qπ, ˆ R soft α be the soft policy improvement predicted by the Observer. Let δ > 0 be such that

sup s S DKL(πnew( |s)|ˆπnew( |s)) < δ,

then there exists a shaping sh: S A R such that for every s S

|Ea πnew h ˆR(s, a) i Ea πnew [(R + sh)(s, a)] | < ε,

ε = δ 1 + |A|e

Proof. From the definition of soft policy improvement, we have that for every s S and a A

ln πnew(a|s) ln ˆπnew(a|s) = 1

Qπ,R soft (s, a) Qπ, ˆ R soft (s, a) + f(s) ,

where f(s) = ln ˆZ(s) ln Z(s), and Z(s) and ˆZ(s) are the normalizing terms Z(s) = P

Qπ,R soft (s, a) α and ˆZ(s) = P

Qπ, ˆ R soft (s, a) α . From Lemma 1 in (Jacq et al., 2019), we have that there exists a shaping sh: S A R such that

ln πnew(a|s) ln ˆπnew(a|s) = 1

Qπ,R+sh soft (s, a) Qπ, ˆ R soft (s, a)

Multi-Agent Learning from Learners

From the Bellman equation for the soft Q function, we have E a πnew[ ˆR(s, a) (R + sh)(s, a)]

Qπ,R+sh soft (s, a) Qπ, ˆ R soft (s, a) γEs P ( |s,a) a π( |s )

h Qπ, ˆ R soft (s , a ) Qπ,R+sh soft (s , a ) i#

h Qπ, ˆ R soft (s, a) Qπ,R+sh soft (s, a) i + γ

E a πnew s P ( |s,a) a π( |s )

h Qπ,R+sh soft (s , a ) Qπ, ˆ R soft (s, a) i

= E a πnew[ln πnew(a|s) ln ˆπnew(a|s)] + γ

E a πnew s P ( |s,a) a π( |s )

[ln πnew(a |s ) ln ˆπnew(a |s )]

= DKL (πnew( |s) ˆπnew( |s)) + γ

E a πnew s P (s,a) a π( |s )

[ln πnew(a |s ) ln ˆπnew(a |s )]

Let us now analyze the second term on right-hand side. Our goal is to bound it with the expectation w.r.t. to πnew so we can use our assumption on the KL-divergence between πnew and ˆπnew.

Ea π( |s )[ln πnew(a |s ) ln ˆπ(a |s )] =

a A π(a |s ) (ln πnew(a |s ) ln ˆπnew(a |s ))

a A πnew(a |s )(ln πnew(a |s ) ln ˆπnew(a |s )) π(a |s )

πnew(a |s )

α(1 γ) sup s

a A πnew(a |s )(ln πnew(a |s ) ln ˆπnew(a |s ))

The inequality ( ) follows from the following observation. Observe that

π(a |s ) πnew(a |s ) = π(a |s ) X

Qπ,R soft (s , a) Qπ,R soft (s ,a ) α |A|e

The last inequality follows from the fact that Qπ,R soft (s, a) Qπ,R soft (s , a ) 1 1 γ , for every s, s S and a, a A.

A.3. Soft Policy Improvement prediction error bound

As discuss in the previous section, the Observer can use the recovered reward to predict the next soft policy improvements of the Learner. Here we prove an error bound of the prediction in terms of the reward estimation error.

Theorem A.3 (Single agent Lf L.). Let R be the actual reward and let ˆR be an estimation recovered by the Observer. Let πnew is the actual policy improvement of the policy π and ˆπnew is the predicted policy improvement of the actual policy π using recovered reward ˆR. If there exist δ > 0 and a shaping sh: S A R such that

sup s S,a A

ˆR(s, a) (R + sh)(s, a) < δ

then sup s S,a A DKL(πnew|ˆπnew) < ε

ε = δ 1 α(1 γ) + |A|

Multi-Agent Learning from Learners

Proof. For every s S

|DKL (πnew( |s)|ˆπnew( |s))| = E a πnew[ln πnew(a|s) ln ˆπ(a|s)]

h QR+sh,π soft (s, a) α ln Z(s) Q ˆ R,π soft (s, a) + α ln ˆZ(s) i

t=0 γt((R + sh)(st, a) ˆR(st, at))

(ln Z(s) ln ˆZ(s)),

where Z(s) and ˆZ(s) are the normalizing factors

QR+sh,π soft (s, a) α ˆZ(s) = X

Q ˆ R,π soft (s, a) α .

Therefore, using the assumption and the property that ln x y = ln x y

x , we have

|DKL (πnew( |s)|ˆπnew( |s))| δ α(1 γ) + ˆZ(s) Z(s)

δ 1 α(1 γ) + |A|

B. Proofs of the error bounds in the multi-agent case

Here we include the proofs of Theorem 6.1, Theorem 6.3 and Theorem6.5 which are the multi-agent extensions of Theorem A.1, Theorem A.2 and Theorem A.3 in Appendix A.

Proof of Theorem 6.1. Similar to the proof of Theorem A.1, we can write

ln πnew(ai|s) ln ˆπnew(ai|s) = 1

Ea i π i h Qπ,Ri+sh soft (s, a i, ai) Qπ, ˆ R soft (s, a i, ai) i , (15)

for a certain shaping sh: S A1 . . . AN R.

Using (15) and the Bellman equation (3), we can write

h ˆRi j(s, a i, ai) (Ri + sh)(s, a i, ai) i

= α ln ˆπi new(ai|s) ln πnew(ai|s) γ E a i π i s P ( |s,a) ai πi( |s )

ln ˆπi new( ai|s ) ln πnew( ai|s ) αδ(1 + γ). (16)

Proof of Theorem 6.3. As in the proof of Theorem A.2, there exists a shaping sh: S Ai R, that depends only on the state and on the action of agent i, such that

ln πnew(ai|s) ln ˆπnew(ai|s) = 1

h Qπ,Ri+sh soft (s, a i, ai) Qπ, ˆ R soft (s, a i, ai) i .

Multi-Agent Learning from Learners

Therefore, following the same idea as in the proof of Theorem A.2, we can write

E ai πinew [ ˆRi(s) (Ri + sh)(s, ai)]

ln πi new(ai|s) ln ˆπi new(ai|s) + γ

E ai πinew E a i π i( |s) s P ( |s,a) ai πi( |s )

[ln πi new( ai|s ) ln ˆπi new( ai|s )]

= DKL(πi new( |s) ˆπnew( |s)) + γ

E a i π i s P ( |s,a i,ai)

E ai πi ln πi new( ai|s ) ln ˆπnew( ai|s )

As in the proof of Theorem A.2, we would like the most inner expectation of the second term in the right-hand side to be w.r.t. πi new, in order to express it in terms of the KL-divergence. To achieve that, we can similarly bound the ratio πi πinew as follows πi(ai|s ) πinew(ai|s ) = π(ai|s ) X

Ea i π i [Qπ,Ri soft (s ,a i,ai) Qπ,Ri soft (s ,a i,ai)]

where i = sups S Ri(s) infs S Ri(s). This allows us to conclude that E ai πinew

h ˆRi(s) (Ri + sh)(s, ai) i δ 1 + γ|Ai|e

Proof of Theorem 6.5. Similarly to the proof of Theorem A.3, we have

DKL πi new( |s)|ˆπi new( |s) = E ai πinew

ln πi new(ai|s) ln ˆπi(ai|s)

α E ai πinew

h Qπ,Ri+sh soft (s, a i, ai) Qπ, ˆ Ri

soft (s, a i, ai) i + α ln ˆZ(s) α ln Z(s)

α E ai πinew

t=0 γt((Ri + sh)(st, at) ˆR(st, at))

(ln Z(s) ln ˆZ(s)),

where Z(s) and ˆZ(s) are the normalizing terms.

Following the same argument in the proof of Theorem A.3 we get

DKL πi new( |s)|ˆπi new( |s) δ 1 α(1 γ) + |Ai|

C. Experiments

In this section, we provide the details of our experimental evaluations. We execute all experiments under a Conda environment using Python with a computation unit GPU-2080i and the source code is available at Git Hub 1.

In Fig. 4 and Fig. 5 we present the visualizations of heterogeneous and homogeneous reward cases respectively.

1https://github.com/melodi Cyb/multiagent-learning-from-learners

Multi-Agent Learning from Learners

Parameter Value Alpha 3 Beta 0.1 Gamma 0.9 Episode Length 1000 Iteration # 10 Episode # 3000 Entropy Coefficient 0.3 Adam Learning Rate 0.1 Adam Epoch # 10 Reward Adam Epoch # 1000 Reward Adam Learning Rate 0.01

Table 3. Parameters to reproduce results for MA-Lf L in Grid World scenario in Section 7 Table 1.

(a) Normalized true rewards w.r.t Mhet.

(b) Normalized recovered rewards w.r.t Mhet.

Figure 4. Mhet case for Agent #1 with MA-SPI.

(a) Normalized true rewards w.r.t Mhom.

(b) Normalized recovered rewards w.r.t Mhom.

Figure 5. Mhom case for Agent #1 with MA-SPI.