# on_imitation_in_meanfield_games__f3684814.pdf

On Imitation in Mean-ﬁeld Games

Giorgia Ramponi1 Pavel Kolev2 Olivier Pietquin3 Niao He4

Mathieu Laurière3,5 Matthieu Geist3

1 ETH AI Center, Zurich 2 Max Planck Institute for Intelligent Systems, Tübingen, Germany

3 Google Deep Mind 4 ETH Zurich, Department of Computer Science 5 Shanghai Frontiers Science Center of Artiﬁcial Intelligence and Deep Learning, NYU Shanghai

We explore the problem of imitation learning (IL) in the context of mean-ﬁeld games (MFGs), where the goal is to imitate the behavior of a population of agents following a Nash equilibrium policy according to some unknown payoff function. IL in MFGs presents new challenges compared to single-agent IL, particularly when both the reward function and the transition kernel depend on the population distribution. In this paper, departing from the existing literature on IL for MFGs, we introduce a new solution concept called the Nash imitation gap. Then we show that when only the reward depends on the population distribution, IL in MFGs can be reduced to single-agent IL with similar guarantees. However, when the dynamics is population-dependent, we provide a novel upper-bound that suggests IL is harder in this setting. To address this issue, we propose a new adversarial formulation where the reinforcement learning problem is replaced by a mean-ﬁeld control (MFC) problem, suggesting progress in IL within MFGs may have to build upon MFC.

1 Introduction

Imitation learning (IL) is a popular framework involving an apprentice agent who learns to imitate the behavior of an expert agent by observing its actions and transitions. In the context of mean-ﬁeld games (MFGs), IL is used to learn a policy that imitates the behavior of a population of inﬁnitely-many expert agents that are following a Nash equilibrium policy, according to some unknown payoff function. Mean-ﬁeld games are an approximation introduced to simplify the analysis of games with a large (but ﬁnite) number of identical players, where we can look at the interaction between a representative inﬁnitesimal player and a term capturing the population s behavior. The MFG framework enables to scale to an inﬁnite number of agents, where both the reward and the transition are populationdependent. The aim is to learn effective policies that can effectively learn and imitate the behavior of a large population of agents, which is a crucial problem in many real-world applications, such as trafﬁc management [12, 30, 31], crowd control [11, 1], and ﬁnancial markets [6, 5].

IL in MFGs presents new challenges compared to single-agent IL, as both the (unknown) reward function and the transition kernel can depend on the population distribution. Furthermore, algorithms will depend on whether we can only observe the trajectories drawn from the Nash Equilibrium (NE) or if we can access the MFG itself, either driven by the expert population or the imitating one.

The main question we address is whether IL in MFGs is actually harder than IL in single-agent settings and if we can use single-agent techniques to solve IL in the MFGs framework.

Although there exist IL algorithms for MFGs in the literature, none comes with a characterization of the quality of the learnt imitation policy. We will also explain that they essentially amount to a reduction to classic IL, and explain the underlying possible issues. So as to compare algorithms on a

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

rational basis, we provide an extension of the concept of imitation gap to this setting and study it. Our contributions are:

We introduce a new solution concept for IL in MFGs called Nash imitation gap, which is a strict generalization of the classic imitation gap and that we think may be more widely applicable to Multi-agent Reinforcement Learning.

In light of this new criterion, we ﬁrst study the setting where only the reward depends on the population s distribution, while the dynamics does not. This setting was largely studied in the past few years [16, 4, 1, 14, 22], and we show that in this case IL in MFGs reduces to single-agent IL with similar guarantees for Behavioral Cloning (BC) and Adversarial Imitation (ADV) type of algorithms.

Then, we provide a similar analysis in the more general setting where the dynamics depends on the population s distribution. In this case, we provide for BC and ADV approaches upper-bounds that are exponential in the horizon, suggesting that IL is harder in this setting. On an abstract way, all previous works of the existing literature correspond to this setting.

Due to these negative results, we introduce a new proxy to the Nash imitation gap, for which we can derive a quadratic upper bound on the horizon. Then, we discuss how a practical algorithm could be designed with an adversarial learning viewpoint. The idea behind it is to use an approach similar to adversarial IL, where the underlying RL problem is replaced by a Mean-Field Control (MFC) problem. We leave the design and experimentation of practical algorithms for future works, but this suggests that making progress on IL in MFGs may have to build upon MFC. We also provide a numerical illustration empirically supporting our claims in the appendix.

2 Background

2.1 Mean-ﬁeld Games

Intuitively, an MFG corresponds to the limit of an N-player game when N tends towards inﬁnity. We focus on the ﬁnite-horizon setting, in discrete time, and with ﬁnite state and action spaces. Mathematically, the MFG is deﬁned by a tuple M = (S, A, P, r, H, 0) where S is a ﬁnite state space, A is a ﬁnite action space, P : S A S ! S is a transition kernel, r : S A S ! R is a reward function, H is a ﬁnite horizon and 0 2 S is a distribution over initial states. The ﬁrst and second inputs of P and r represent respectively the individual player s state and action, while the third input represents the distribution of the population. This allows the model to capture interactions or mean ﬁeld type. We denote [H 1] = {0, . . . , H 1}. Since the problem is set in ﬁnite time horizon, we consider non-stationary stochastic policies of the form = ( 0, . . . , H 1) where for each n 2 [H 1], n : S ! A. This means that, at time n, a representative player whose state is sn 2 S picks an action according to n(sn). We will also view n as a function from S A to [0, 1] and use the notation n(a|sn) to denote the probability to pick action a according to n(sn). We also introduce the set of population-independent non-stationary rewards uniformly bound by 1, that will be useful later, R = {r : S A [H 1] ! [ 1, 1]}.

When a player is at state s and uses action a while the population distribution is , it gets the reward r(s, a, ). In a ﬁnite-player game, would be replaced by the empirical distribution of the other players states. These players would be inﬂuenced by the player under consideration, leading to complex interactions. However, the inﬂuence of a single player on the population distribution becomes smaller as the number of players increases. In the limit, we can expect that each player has no inﬂuence on the distribution. We will use interchangeably the terms agent and player .

The MFG framework allows us to formalize this idea. So the problem faced by a single representative player is a Markov Decision Process (MDP), in which the distribution is ﬁxed: Given a mean ﬁeld sequence = ( n)n2[H 1], the player wants to ﬁnd a policy maximizing the value function V deﬁned by:

V ( , ) = E

r(Sn, An, n)

where S0 is distributed according to the initial state distribution 0, An n(Sn), and Sn+1 P(Sn, An, n). The resulting policy is called a best response to the mean ﬁeld . We will sometime write Vr( , ) to make explicit to reward of the value function.

Deﬁnition 1 (Best response). A policy is a best response to if: V ( , ) = max 0 V ( 0, ).

Intuitively, is a Nash equilibrium if it is the distribution sequence obtained when all the players use a best response policy against . To formalize this idea, we introduce the notion of population distribution sequence induced by a policy that does inﬂuence the transition kernel.

Deﬁnition 2 (Population distribution sequence). We denote by ( ) the state-distribution sequence induced by the population following policy , which is deﬁned as:

0 (s) = 0(s), ( )

n+1(s0) = P

a n(a|s)P(s0|s, a, ( )

s 2 S (n, s0) 2 [H 1] S.

The state-action distribution sequence, denoted by µ( ), is deﬁned as:

n (s, a) = n(a|s) ( )

n (s), (n, s, a) 2 [H 1] S A.

We can now give a formal deﬁnition of Nash equilibrium.1

Deﬁnition 3 (Nash equilibrium). A policy is called a Nash equilibrium policy if it is a best response against , i.e., is a maximizer of 0 7! V ( 0, ). The distribution sequence induced by a Nash equilibrium policy is called a Nash equilibrium mean ﬁeld sequence.

The value function can be written without expectation by introducing the state and state-action distributions for a single agent, which does not inﬂuence the transition kernel (only the population does, it is a core reason for considering the mean-ﬁeld limit). Deﬁnition 4 (Single-agent distribution sequence). Consider an agent using policy 0 who evolves among a population using policy . We denote by ( ) 0 the state-distribution sequence of this single agent, which is deﬁned by:

0 (s) = 0(s), ( ) 0

n+1 (s0) = P

n(a|s)P(s0|s, a, ( )

s 2 S (n, s0) 2 [H 1] S.

The state-action distribution sequence is deﬁned as:

n (s, a) = 0

n(a|s) ( ) 0

n (s), (n, s, a) 2 [H 1] S A.

From these deﬁnitions we directly have that ( ) = ( ) and µ( ) = µ( ). Furthermore, for any policies and 0, we deﬁne a generalized value function

V ( 0, ( )) =

n (s, a)r(s, a, ( )

Deﬁnition 5 (Exploitability). The exploitability of a policy quantiﬁes the gain for a representative player to replace its policy by a best response:

0 V ( 0, ( )) V ( , ( )).

A Nash equilibrium policy can be deﬁned equivalently as a policy such that its exploitability is 0.

Finding a Nash Equilibrium policy is different from trying to maximize the value function, which can be interpreted as a social optimum. This problem is sometimes referred to as mean ﬁeld control problem (MFC) because it can be interpreted as an optimal control for an MDP where the state is augmented with the distribution.

Deﬁnition 6 (Social optimum). A policy is socially optimal if V ( , ) = max 0 V ( 0, 0).

In general the Nash equilibrium and the social optimum do not coincide, i.e., the MFG policy and the MFC solution can be different. In other words, if is an MFG solution, then we might have V ( , ) < max 0 V ( 0, 0), and if is an MFC policy, then we might have E( ) 6= 0.

In this work, we will frequently make two common assumptions (e.g., [2, Asm. 1], or [34, Asm. 1]).

1We will sometimes omit the term Nash and simply write equilibrium when the context is clear.

Assumption 1. The reward function r and the transition kernel P are Lipschitz continuous w.r.t. the population distribution and have corresponding Lipschitz constants Lr and LP . In particular, for any state-action pair (s, a) it holds for any state distributions , 0 2 S that

|r(s, a, ) r(s, a, 0)| Lrk 0k1.

k P( |s, a, ) P( |s, a, 0)k1 LP k 0k1.

These assumptions mean that the reward and the transitions depend on the mean ﬁeld in a smooth way. For example, r(s, a, ) = cr (s) + r(s, a) satisﬁes the assumption with Lr = cr. A reward of this form penalizes the agent for being in a crowded state. In particular, these assumptions imply that r and P are continuous with respect to the distribution and since S and A are ﬁnite sets and we consider a discrete-time, ﬁnite horizon MFG, these assumptions are sufﬁcient to ensure existence of a Nash equilibrium, see e.g. [10, Proposition 1].

Note that these assumptions do not assume that Lr or LP need to be small. However, we will consider separately the case LP = 0, which has received attention in the MFG literature and which corresponds to situations where the mean-ﬁeld interactions occur through the reward only. Note that when LP = Lr = 0, then there are no interactions so the Nash equilibrium condition becomes trivial and the problem reduces to a single agent MDP. We will write the value function as V ( ).

2.2 Classic imitation learning

The single-agent setting is a special case of the above, with LP = Lr = 0. In this case, the Imitation Learning (IL) problem has been extensively studied [20]. In IL we assume that we observe state-action trajectories generated by an expert who is using an optimal policy E, i.e., E 2 argmax V ( ). We do not know E, V , r nor P. The goal is to learn a policy, denoted by A, which performs as well as the expert policy E according to the unknown reward: V ( A) = V ( E) = max 0 V ( 0).

Given the imitation policy A, the imitation gap is a non-negative quantity deﬁned by V ( E) V ( A) [29]. The goal is to learn a policy A whose imitation gap is as close to 0 as possible. However, when we do not know the model, this quantity cannot be minimized directly. For this reason, different methods were introduced in literature. Here we focus on the following two prominent methods: Behavioral Cloning (BC) [24] and adversarial IL such as Generative Adversarial IL (GAIL) [18].

For simplicity, we will denote E = ( E) E, µE = µ( E) E, A = ( A) A, and µA = µ( A) A. When the dynamics does not depend on the population (LP = 0), we will write = ( any) , µ = µ( any) , as the population driven by any has no effect on the transition kernel.

Behavioral Cloning. In this work, we frame BC as minimizing the expected 1-distance between the action probability distributions of the expert and the imitation policy, where the expectation is over the expert state occupancy. Although in practice it consists in a reverse KL-divergence, or equivalently as the maximum likelihood estimation in supervised learning, here we consider the 1-norm distance (or scaled total variation), which is more convenient for the analysis. However, the analysis could be adapted to this maximum likelihood estimation with minor changes thanks to Pinsker s inequality (see App. A). In the ﬁnite-horizon case, one obtains a bound on quantities of the form:

n (s)k1], n 2 {0, . . . , H 1}. (1) That is, we solve one BC problem per time-step, from data generated by an expert. The single-agent imitation-learning gap is bounded by O

, where BC = maxn2{0,...,H 1} BC

n [26]. We will retrieve this result as a special case of our analysis.

Adversarial Imitation Learning. Using adversarial-like approaches [18, 13] we control a distance or divergence between the state-action occupancy measures of the expert and the imitation policy. In the single-agent IL setting this divergence can be expressed as:

n k1 , n 2 {0, . . . , H 1}. (2)

Similar IL quantities are extensively studied in the single-agent case, leading to algorithms minimizing a divergence or distance such as Generative Adversarial Imitation Learning [18] or IQ-learn [13]. If we control the state-action occupancy measure errors, then we get a single-agent imitation-learning gap of order O

, where ADV = maxn2[0,...,H 1] ADV

n , gaining an H factor compared to the BC bound [32].

To make the link to adversarial approaches, controlling the terms of Eq. (2) can be achieved through the following equivalent formulation, based on the Integral Probability Metric (IPM) formulation of the total variation (see Appx. B for a detailed derivation and additional discussions):

(Vf( E) Vf( )), (3)

with R deﬁned in Sec. 2.1 (non-stationary population-independent rewards functions uniformly bounded by 1). This form is similar to classic adversarial imitation approach, learning both a reward function and a policy, the inner problem being a reinforcement learning (RL) problem. Previous work usually consider an f-divergence [15, 19] between state-action occupation measures, which leads to different min-max problems. However, in this paper, we will focus on the 1-distance, that we think is a meaningful and practical abstraction of adversarial approaches. Our results could be extended to other settings by using tools like the Pinsker inequality.

Remark 1. Remember that in the single-agent case (same as when we have LP = Lr = 0) the transition dynamics and the reward function do not depend on the population distribution. In the mean-ﬁeld game, on the other hand, the transition dynamics and the reward function can depend on the population distribution. For this reason, we need to consider different quantities to be controlled. We will see more about it in the next section.

3 Related works

We provide a commented review of the literature focusing on the fundamentals of the different approaches, their pros and their cons.

To the best of our knowledge, the ﬁrst work addressing the IL problem in MFGs is [33]. The authors consider a discrete-time MFG over a complete graph, and they propose a reduction from MFGs to ﬁnite-horizon MDP (with a population-augmented state) and then use single-agent IL algorithms on the new MDP. In the reduction, the new reward function is computed in the following way: r( , ) = P

a (a|s)r(s, a, ), considering that the state is the population distribution and the actions are the possible policies . Using this reduction, the authors implicitly assume that the observed expert is solving an MFC problem, i.e. she is looking for a socially optimal policy (see Def. 6). However, in general, this does not actually coincide with a Nash Equilibrium policy (see Def. 3). Then, the reduction works only for cooperative MFGs, but it is prone to biased reward inference in non-cooperative environments.

The ﬁrst work enlightening this issue is [8]. The authors consider a discounted ﬁnite-horizon MFG and propose a novel method called Mean Field IRL. They reframe the problem as ﬁnding a reward function that makes the expert policy E the best response with respect to the expert population distribution E, i.e., ﬁnd a population-dependent r such that V ( E, E) 2 argmax V ( , E). To solve this problem they use a max-margin approach similar to [25] in the single-agent case. In fact, ﬁxing the population distribution E, the MFG is reduced to an MDP, and the imitation problem is reduced to single-agent IL. However, since they do not have access to the actual population distribution E, they need to estimate it from samples. In this way, the authors do not have any actual theoretical guarantee on the performances of the recovered policy, since it depends on the estimation of E.

Recently, [9] proposed a novel approach reusing ideas from adversarial learning [18] and maximum entropy [36]. The authors assume they have access to an expert optimizing a meanﬁeld-regularized Nash-equilibrium, i.e., the value-function has an additional regularization term: V ( 0, ) = PH 1

H( 0( |s)) + P

n (s, a)r(s, a, ( )

, with H the Shannon entropy. This term guarantees that the MFG has a unique equilibrium solution, and only one best response policy. Then, they assume to have access to the population distribution E, or to be able to estimate this population distribution from samples. Fixing the population distribution, the MFG problem reduces to the single-agent setting, which allows applying GAIL [18] or maximum entropy IRL to learn an approximate policy A. This leads to two main problems. (i) We are actually estimating E from samples, trying to ﬁnd a policy which gives us the true E; this circulating reasoning leads to many estimation errors. (ii) From an abstract viewpoint, the ﬁnal theoretical guarantee we can hope to have for the recovered policy A would approximately be kµ( E) A µ( E) Ek1 (in practice for a different distance or divergence, but involving the same occupancy measures), which may not

be sufﬁcient as we will show in Sec. 4.2. We will see that the upper-bound for this case has an exponential dependency on the horizon.

In [35], the authors assume observing an MFG where the agents are acting according to a correlated equilibrium. The authors justify the study of a correlated equilibrium by some applications where we can have access to some correlation device (e.g., trafﬁc network equilibrium induced from the public routing recommendations). Similar to [9], the authors of this work reuse ideas from adversarial learning [18] to recover a policy observing a correlated equilibrium policy. As in [9] they ﬁx the population distribution to be the expert one and then apply single-agent GAIL to the problem. Then, although the solution concept is different, this approach faces similar problems as in [9], where instead the authors considered to observe an expert following a Nash Equilibrium policy. We will focus on experts achieving a Nash equilibrium. Extending our results (see Sec. 4) to more general equilibria such as coarse correlated ones is an interesting future research direction.

Notice also that all GAIL-like approaches discussed above learn an intermediate population-dependent reward function. This is complicated (as the population is a distribution over the state space, generally difﬁcult to represent compactly), and in fact superﬂuous in their setting. Indeed, as all these approaches assume that the interaction is done with the MFG driven by the expert population (or a given approximation of it), it is sufﬁcient to consider an intermediate population-independent but non-stationary reward. This will appears clearly in the IPM formulations we propose in Sec. 4.

4 Nash imitation gap and imitation in MFGs

We consider the imitation learning problem in MFGs. Similar to single-agent IL, we observe the interactions between an expert and a ﬁxed MFG environment, for which we do not know the reward function r nor the transition kernel P. We only observe samples coming from an expert policy denoted by E, which we assume to be a Nash equilibrium policy, i.e., E( E) = 0. In single-agent IL the goal is clear: Find a policy A to minimize the imitation gap max V ( ) V ( A). However, in MFG, the goal is less clear. Previous works focused on solving IL for a single agent in a population that stays at equilibrium, but this is not relevant for many applications, in which the population might use the learnt policy. The learnt policy should thus not only be good at the individual level, but it should also be an equilibrium policy. As a ﬁrst contribution, we propose a natural formulation for studying the performance of imitation learning in MFG, called Nash imitation gap: Deﬁnition 7. The Nash imitation gap (NIG) of a policy A is deﬁned as:

E( A) = max

02 V ( 0, ( A)) V ( A, ( A)).

Therefore, the NIG is simply deﬁned as being the exploitability of the considered policy. The NIG has many interesting and useful properties. (i) If it is zero, the recovered policy A is a Nash equilibrium policy. (ii) It is a generalization of the single-agent imitation gap (see above and Sec. 2.2), which is recovered as a special case when LP = Lr = 0.

As for the single-agent imitation gap, we cannot optimize it directly, since we do not know the reward function, but instead we can envision proxys, such as reducing the distance between the recovered policy A and the expert policy E (BC-like) or their occupancy measures (GAIL-like) as in the classic IL setting (see Sec. 2.2). In this section, we ﬁrst discuss the imitation learning problem when the dynamics does not depend on the population, i.e. LP = 0, a common setting largely explored in the last years [22, 23]. Then, we present our results for the general setting when the dynamics depends on the population, i.e. when LP > 0.

For what follows, in addition to the Lipschitz assumption (Asm. 1), we will also assume that the unknown reward function r for which the expert E is a Nash equilibrium is uniformly bounded when the population is the expert one. Assumption 2. The unknown reward r satisﬁes

s,a |r(s, a, ( E))| rmax.

4.1 Population-independent dynamics: a reduction to classic imitation

We start by analyzing a simpler but commonly used setting (e.g., see [23] in the context of reinforcement learning methods, or [16, 4, 1] in the context of the analysis of discrete or continuous space

MFGs), where the dynamics does not depend on the population. Here the MFG interaction is without reward and it is only the (unknown) reward function that depends on the population. This setting is equivalent to observing an interaction between an expert and an MDP, since the state-distribution does not depend on the population, i.e., ( ) 0 = 0.

Behavioral Cloning. The behavioral cloning setting is the same as the single agent setting (see Sec. 2.2), i.e, we control:

n := Es E n [k A

n (s)k1], n 2 {0, . . . , H 1}, (4)

where E is the expert policy and A the imitation policy. Under the assumption of the BC-type error, we give the following bound on the Nash imitation gap (proof in Appx. C). Theorem 1. Let BC = maxn2{0,...,H 1} BC

n . If LP = 0, the Nash imitation gap satisﬁes

E( A) H2(2Lr + rmax) BC.

Theorem 1 shows that when LP = 0 we recover the single-agent imitation learning bound. Perhaps surprisingly, the simple BC approach was not previously considered for MFG, to the best of our knowledge.

Adversarial learning. In the adversarial setting, similar to the single-agent case (see Sec. 2.2) we control a distance or divergence between occupancy measures. More precisely, we consider the following distance between occupancy measures:

n k1, n 2 {0, . . . , H 1}.

It is important to recall that in this setting, the dynamics do not depend on the population, thus µ( ) 00 = µ( 0) 00 for every triplet of policies ( , 0, 00).

Since in this case the dynamics do not depend on the population distribution, the same IPM approach (see Eq. (3)) of single-agent imitation-learning also works in this context. Then we can practically solve the MFG IL problem using GAIL-like approaches [18, 15, 19, 13].

We now provide a novel bound on the Nash imitation gap (proof in Appx. C). Theorem 2. Let ADV = maxn2{0,...,H 1} ADV

n . If LP = 0, the Nash imitation gap satisﬁes

E( A) H(2Lr + rmax) ADV.

In contrast to the quadratic horizon dependence in Theorem 1, we derive here a linear horizon dependence. Furthermore, the bound in Theorem 2 is almost the same (similarly to the BC case) as in the single-agent IL problem. In fact, it recovers the bound of [32] by setting Lr = 0.

Discussion. When LP = 0 interacting with the MFG without reward amounts to interact with and MDP without reward, so from a practical aspect any classic IL approach could be used, including Dagger-like approaches [27] that we do not analyse here. Our upper bounds show that when the dynamics is independent of the population, IL for MFG has similar guarantees as in singleagent imitation learning. In fact, our results suggest that a population-dependent unkown reward function affects the IL policy performance very moderately (through the Lr constant). However, new challenges may arise if we also want to recover the reward function, as in the case of Inverse Reinforcement Learning. We leave this interesting research direction for future work.

4.2 Population-dependent dynamics

When the dynamics of the MFG depend on the population, i.e., LP > 0, then the previous results do not apply anymore. In this section, we present results on MFGs with population-dependent dynamics, for the same proxys introduced above (BC and adversarial).

Behavioral Cloning. The BC proxy to the NIG is the same as in Equation (4). In this case, however, the bound we get is no longer comparable to the classic IL setting (proof in Appx. C). Theorem 3. Let BC = maxn2{0,...,H 1} BC

n . If LP > 0, the Nash imitation gap satisﬁes:

2(Lr + rmax)

(1 + LP )H + rmax H2

We observe that when the dynamics depend on the population, the dependence on the horizon is no longer quadratic but exponential. From a technical viewpoint, this comes from the dependency of the transition kernel to the population. A worse dependency makes sense intuitively. In the classic IL setting, an error on policies will amplify at the occupancy measure level (drift phenomenon in imitation learning). Here, this problem is even more ampliﬁed by the fact that the transition kernel that deﬁnes the occupancy measure itself depends on the related population, amplifying even more the imitation error. Sec. 5 provides some empirical evidence that IL in MFGs has an exponential dependence on the horizon for BC as LP increases.

Although we cannot claim our result is tight, this suggests that in MFGs we cannot hope to use BC to obtain a good imitation policy A, but we need to control the divergence between the state-action occupancy induced by A and the one induced by E.

Adversarial learning. In contrast to the population-independent case, the quantity to control in this setting is not trivial. In fact, it depends on the MFG interaction assumption. We can assume to have the possibility to interact with only the MFG or with the MFG driven by E. These two types of interactions lead us to consider two different errors. We start by considering the error we would like to minimize if we can interact with the MFG driven by the expert population E. Assuming access to this MFG, the IL problem reduces to doing classic IL in an MDP without reward and with transition distribution P( |s, a, E). The error is deﬁned as follows:

vanilla ADV

n := kµ( E) A

n k1, n 2 {0, . . . , H 1}.

This assumption was implicitly made in all previous works (see Sec. 3), where the authors assume to ﬁx the expert distribution E and then solve the IL problem. However, assuming to have access to the expert distribution may not be reasonable in practice, and due to this in [9, 8] the authors replace the expert population distribution E with its approximation from sampling ˆ E, in order to learn µ( E) A. In reality, however, we seek to learn µ( E) A and then E, which is the ultimate goal of the problem. This circular reasoning leads to not easily having theoretical guarantees for this setting.

In this case, for obtaining an adversarial formulation, we can use a similar approach as the one for the non-population dependent dynamics, Eq. (3), providing (details in Appx. B):

(Vf( E, ( E)) Vf( , ( E))).

We can observe that the inner problem is again an RL problem (for the MDP induced by ( E)), and in practice any single-agent adversarial approach could be applied to solve for a similar proxy (related to a different min-max problem).

We provide a bound for this approach (proof in Appx. C). Theorem 4. Let vanilla ADV = maxn2{0,...,H 1} vanilla ADV

n . If LP > 0 the NE imitation gap satisﬁes:

2(Lr + rmax)

(1 + LP )H + rmax H

vanilla ADV.

Therefore, in this case, if in practice we can apply GAIL-like approaches to the MDP with transition distribution P( |s, a, E), we have an exponential dependency in the horizon. Interestingly this is close to the quantity that previous works [9, 8] tried to control. Although we do not know if this bound is tight, reasonably controlling this quantity can lead to weak theoretical guarantees. Sec. 5 provides empirical evidences that, when H and LP are large enough, vanilla-ADV exponential depends on the horizon like BC.

4.3 Population-dependent dynamics: a new efﬁcient proxy

Although the vanilla-ADV error is a reasonable quantity to control, another interesting proxy is to consider state-action occupancy measures with population induced by the associated policy, namely

n := kµ( A) A

n k1, n 2 {0, . . . , H 1}.

To control this quantity, we do not need to have access to the expert population (except through data), we can directly interact with the MFG, driven by what we are learning. As we explained in Sec. 4.1,

Figure 1: For each of the curves in Fig. 1, each one for a speciﬁc value of L, of H, and for a type of error (BC, vanilla-ADV and MFC-ADV), we consider the points (NIGk, k) making the curve, and compute the value maxk( NIGk

k ) as an empirical upper-bound for the speciﬁc value of L, H and kind of error. Then, for each value of L and each kind of error, we plot this empirical upper-bound as a function of the horizon H, with a log-scale on the y-axis.

if LP = 0 then this quantity is the same as the vanilla-ADV error. Before presenting and discussing an adversarial formulation (that will explain the naming choice for the error), we provide a novel bound for this quantity (proof in Appx. C):

Theorem 5. Let MFC ADV = maxn2{0,...,H 1} MFC ADV

n . If LP > 0, the Nash imitation gap satisﬁes:

3LP rmax H2 + (2Lr + rmax)H

This gives us a signiﬁcant improvement compared to the BC case and the vanilla-ADV error. Indeed, the dependency to the horizon is now quadratic and not anymore exponential. Comparing this result with classic imitation learning we have worse dependency on H, since in the classic setting is only linear (we can recover this dependency when LP = 0). However, as explained before, the fact that the transition kernel does depend on the population indeed intuitively implies a larger ampliﬁcation of errors. Sec. 5 provides empirical evidence that this approach is better than the previous two, with much less inﬂuence from H and LP .

Now, we provide an adversarial formulation, to get a sense of what a practical algorithm could look like. Using again an IPM argument, we have that (details in Appx. B)

Vf( E, ( E)) Vf( , ( ))

Notice that it is not obvious if we can switch the min and the max here, due to the underlying set of policy-induced occupancy measures being not necessarily convex (due to the dependency of the dynamics on the population). Assuming we can, we still learn an intermediate population-independent non-stationary reward, but now the underlying control problem is no longer RL, it is an MFC problem, as it implies studying max Vf( , ( )), where the population does depend on the optimized policy. This suggests that in practice one could start from a classic adversarial IL approach, and replace the underlying RL algorithm by an inner MFC algorithm (e.g. [28, 7, 21, 17, 3]). We leave the design and implementation of such a practical algorithm for future work, which may be more complex than a straight replacement of the control part (notably, many MFC approaches learn a population-dependent policy). However, this overall suggests that making progress on IL for MFGs may have to build upon MFC.

5 Simulation

In order to provide some empirical evidences of the insights given by our analysis (inﬂuence of the horizon and the dependency of the dynamics to the population on the various considered proxys to the Nash imitation gap), we introduce the Attractor MFG (see more details about it in App. D). This is a 2-state and 2-action MFG with initial distribution satisfying 0(s0) = 1, with horizon H and with Lipschitz parameter L. The reward only depends on the state (not on the distribution nor the

action) and satisﬁes for all a 2 A, 2 S,

r(s0, a, ) = 0 and r(s1, a, ) = 1.

In the state s1, any choice of actions leads deterministically to s1, the transition kernel satisﬁes P(s1|s1, a, ) = 1 for all a 2 A, 2 S. In the state s0, the action a0 leads deterministically to s1, while action a0 leads stochastically to one of the two states: the higher the fraction of the population in s1, the higher the chance to transit to s1 after choosing a0:

P(s1|s0, a1, ) = 1 and P(s1|s0, a0, ) = min{1, L (s1)}.

Therefore, the state s1 is an attractor, hence the chosen name for the MFG.

The experiment consists in computing the errors BC

n , vanilla ADV

n , and MFC ADV

n for various values of L = {0.01, 0.05, 0.1, 0.5} and H = {3, 25, 50, 75, 100} and we show the NIG as a function of the mentioned errors. From Fig. 1, we can observe an exponential dependency of the horizon for both BC and vanilla-ADV when L increases, while this does not happen for MFC-ADV. This experiment demonstrates empirically what our theoretical upper bound shows, supporting the insights of our analysis. This suggests that a practical algorithm for IL in MFGs needs to minimize the MFC-ADV error (see Section 4.2). In Appx. D we provide more experimental results.

6 Conclusion

In this paper, we have studied the recent question of imitation learning in mean-ﬁeld games. We have reviewed the few previous works tackling this problem and provided a critical discussion on the proposed approach. We then have introduced the new solution concept of Nash imitation gap to quantify the quality of imitation. In the simpler case of a population-independent dynamics, we have shown that the problem basically reduces to single-agent imitation learning, and that abstractions of the canonical BC and adversarial approaches come with a similar performance guarantee. In the harder population-dependent case, we have provided upper-bounds that are exponential in the horizon, for both BC and adversarial IL, the latter being the approach adopted in previous work. We also introduce a new proxy that amounts to control different occupancy measures, not solely driven by the expert population but by the policy-induced population. From a practical viewpoint, it only implies interacting with the MFG (without access to the expert population for driving the dynamics), and from a theoretical viewpoint it enjoys a much better quadratic dependency to the horizon. The associated adversarial formulation suggests that classic adversarial IL approaches could be adapted by replacing the inner RL loop by an MFC one.

This work opens a number of interesting research questions. Our last result suggests a simple modiﬁcation of existing adversarial approaches, but in practice it may be more difﬁcult than just replacing the control part, and could call for additional research in MFC and MFGs. Maybe also that controlling other kinds of occupancy measures, or even different quantities related to the policy and dynamics, may lead to even better guarantees, ideally recovering the linear dependency to the horizon of the single-agent case. We provided upper-bounds, but there is no known associated lower-bound in this setting, which is another interesting research direction. We focused on experts achieving a Nash equilibrium, and extending our work to different kind of equilibria is another possible direction. We focused on imitating the expert, not on recovering the underlying optimized reward (inverse RL). Doing so may require a different approach, as our adversarial formulations all rely on population-independent rewards (which is convenient from a practical viewpoint, whenever one just wants to recover the policy). We also plan to work on the extension of Nash imitation gap to other kinds of games.

Funding and Acknowledgements

Giorgia Ramponi was supported by the ETH AI Center and Google Deepmind. Pavel Kolev was supported by the Cyber Valley Research Fund and the Volkswagen Stiftung (No 98 571). Mathieu Lauriere is with the NYU Shanghai Frontiers Science Center of Artiﬁcial Intelligence and Deep Learning, and the NYU-ECNU Institute of Mathematical Sciences at NYU Shanghai. Part of the work was done while Mathieu Lauriere was a visiting faculty at Google Brain. We thank anonymous reviewers for comments, which helped improve the presentation of the paper.

[1] Yves Achdou and Jean-Michel Lasry. Mean ﬁeld games for modeling crowd motion. Contribu-

tions to partial differential equations and applications, pages 17 42, 2019.

[2] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Q-learning in regularized mean-ﬁeld

games. Dynamic Games and Applications, 13(1):89 117, 2023.

[3] Andrea Angiuli, Jean-Pierre Fouque, and Mathieu Laurière. Uniﬁed reinforcement Q-learning

for mean ﬁeld game and control problems. Mathematics of Control, Signals, and Systems, 34(2):217 271, 2022.

[4] Pierre Cardaliaguet and P Jameson Graber. Mean ﬁeld games systems of ﬁrst order. ESAIM:

Control, Optimisation and Calculus of Variations, 21(3):690 722, 2015.

[5] Rene Carmona. Applications of mean ﬁeld games in ﬁnancial engineering and economic theory.

ar Xiv preprint ar Xiv:2012.05237, 2020.

[6] René Carmona and Mathieu Laurière. Deep learning for mean ﬁeld games and mean ﬁeld

control with applications to ﬁnance. ar Xiv preprint ar Xiv:2107.04568, 2021.

[7] René Carmona, Mathieu Laurière, and Zongjun Tan. Model-free mean-ﬁeld reinforcement

learning: mean-ﬁeld MDP and mean-ﬁeld Q-learning. ar Xiv preprint ar Xiv:1910.12802, 2019.

[8] Yang Chen, Libo Zhang, Jiamou Liu, and Shuyue Hu. Individual-level inverse reinforce-

ment learning for mean ﬁeld games. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 253 262, 2022.

[9] Yang Chen, Libo Zhang, Jiamou Liu, and Michael Witbrock. Adversarial inverse reinforcement

learning for mean ﬁeld games. ar Xiv preprint ar Xiv:2104.14654, 2021.

[10] Kai Cui and Heinz Koeppl. Approximately solving mean ﬁeld games via entropy-regularized

deep reinforcement learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1909 1917. PMLR, 2021.

[11] Christian Dogbé. Modeling crowd dynamics by the mean-ﬁeld limit approach. Mathematical

and Computer Modelling, 52(9-10):1506 1520, 2010.

[12] Mohammed Elhenawy, Ahmed A Elbery, Abdallah A Hassan, and Hesham A Rakha. An

intersection game-theory-based trafﬁc control algorithm in a connected vehicle environment. In 2015 IEEE 18th international conference on intelligent transportation systems, pages 343 347. IEEE, 2015.

[13] Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn:

Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028 4039, 2021.

[14] Matthieu Geist, Julien Pérolat, Mathieu Laurière, Romuald Elie, Sarah Perrin, Oliver Bachem,

Rémi Munos, and Olivier Pietquin. Concave utility reinforcement learning: The mean-ﬁeld game viewpoint. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 489 497, 2022.

[15] Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence mini-

mization perspective on imitation learning methods. In Conference on Robot Learning, pages 1259 1277. PMLR, 2020.

[16] Diogo A Gomes, Joana Mohr, and Rafael Rigao Souza. Discrete time, ﬁnite state space mean

ﬁeld games. Journal de mathématiques pures et appliquées, 93(3):308 328, 2010.

[17] Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu. Mean-ﬁeld controls with Q-learning for

cooperative MARL: convergence and complexity analysis. SIAM Journal on Mathematics of Data Science, 3(4):1168 1196, 2021.

[18] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in

neural information processing systems (Neur IPS), 2016.

[19] Liyiming Ke, Sanjiban Choudhury, Matt Barnes, Wen Sun, Gilwoo Lee, and Siddhartha Srini-

vasa. Imitation learning as f-divergence minimization. In Algorithmic Foundations of Robotics XIV: Proceedings of the Fourteenth Workshop on the Algorithmic Foundations of Robotics 14, pages 313 329. Springer, 2021.

[20] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters,

et al. An algorithmic perspective on imitation learning. Foundations and Trends R in Robotics, 7(1-2):1 179, 2018.

[21] Barna Pasztor, Ilija Bogunovic, and Andreas Krause. Efﬁcient model-based multi-agent mean-

ﬁeld reinforcement learning. ar Xiv preprint ar Xiv:2107.04050, 2021.

[22] Julien Pérolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu

Geist, Karl Tuyls, and Olivier Pietquin. Scaling mean ﬁeld games by online mirror descent. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 1028 1037, 2022.

[23] Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier

Pietquin. Fictitious play for mean ﬁeld games: Continuous time analysis and applications. Advances in Neural Information Processing Systems, 33:13199 13213, 2020.

[24] Dean A Pomerleau. Efﬁcient training of artiﬁcial neural networks for autonomous navigation.

Neural computation, 3(1):88 97, 1991.

[25] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In

Proceedings of the 23rd international conference on Machine learning, pages 729 736, 2006.

[26] Stéphane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. In Proceedings of

the thirteenth international conference on artiﬁcial intelligence and statistics, pages 661 668. JMLR Workshop and Conference Proceedings, 2010.

[27] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and

structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627 635. JMLR Workshop and Conference Proceedings, 2011.

[28] Jayakumar Subramanian and Aditya Mahajan. Reinforcement learning in stationary mean-

ﬁeld games. In Proceedings of the 18th International Conference on Autonomous Agents and Multi Agent Systems, pages 251 259, 2019.

[29] Gokul Swamy, Sanjiban Choudhury, J Andrew Bagnell, and Steven Wu. Of moments and match-

ing: A game-theoretic framework for closing the imitation gap. In International Conference on Machine Learning, pages 10022 10032. PMLR, 2021.

[30] Takashi Tanaka, Ehsan Nekouei, and Karl Henrik Johansson. Linearly solvable mean-ﬁeld

road trafﬁc games. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 283 289. IEEE, 2018.

[31] Huisheng Wang, Yuejiang Li, and H Vicky Zhao. Research on intelligent trafﬁc control methods

at intersections based on game theory. ar Xiv preprint ar Xiv:2009.05216, 2020.

[32] Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments. Advances

in Neural Information Processing Systems, 33:15737 15749, 2020.

[33] Jiachen Yang, Xiaojing Ye, Rakshit Trivedi, Huan Xu, and Hongyuan Zha. Learning deep mean

ﬁeld games for modeling large population behavior. ar Xiv preprint ar Xiv:1711.03156, 2017.

[34] Batuhan Yardim, Semih Cayci, Matthieu Geist, and Niao He. Policy mirror ascent for efﬁcient

and independent learning in mean ﬁeld games. ar Xiv preprint ar Xiv:2212.14449, 2022.

[35] Zhiyu Zhao, Renyuan Xu, Haifeng Zhang, Jun Wang, Mingyuan Zhang, and Yaodong Yang.

Imitation learning for mean ﬁeld games with correlated equilibria. openreview preprint: id=VUd Me Sb Ex Wg, 2023.

[36] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy

inverse reinforcement learning. In Proceedings of the 23rd national conference on Artiﬁcial intelligence-Volume 3, pages 1433 1438, 2008.