# efficient_inverse_multiagent_learning__234ba1f5.pdf Published as a conference paper at ICLR 2024 EFFICIENT INVERSE MULTIAGENT LEARNING Denizalp Goktas & Amy Greenwald Brown University, Computer Science denizalp_goktas@brown.edu Sadie Zhao Harvard University, Computer Science Alex Koppel & Sumitra Ganesh JP Morgan Chase & Co In this paper, we study inverse game theory (resp. inverse multiagent learning) in which the goal is to find parameters of a game s payoff functions for which the expected (resp. sampled) behavior is an equilibrium. We formulate these problems as generative-adversarial (i.e., min-max) optimization problems, which we develop polynomial-time algorithms to solve, the former of which relies on an exact firstorder oracle, and the latter, a stochastic one. We extend our approach to solve inverse multiagent simulacral learning in polynomial time and number of samples. In these problems, we seek a simulacrum, meaning parameters and an associated equilibrium that replicate the given observations in expectation. We find that our approach outperforms the widely-used ARIMA method in predicting prices in Spanish electricity markets based on time-series data. 1 INTRODUCTION Game theory provides a mathematical framework, called games, which is used to predict the outcome of the interaction of preference-maximizing agents called players. Each player in a game chooses a strategy from its strategy space according to its preference relation, often represented by a payoff function over possible outcomes, implied by a strategy profile (i.e., a collection of strategies, oneper-player). The canonical outcome, or solution concept, prescribed by game theory is the Nash equilibrium (NE) (Nash, 1950): a strategy profile such that each player s strategy, fixing the equilibrium strategies of its opponents, is payoff-maximizing (or more generally, preference-maximizing). In many applications of interest, such as contract design (Holmström, 1979; Grossman & Hart, 1992) and counterfactual prediction (Peysakhovich et al., 2019), the payoff functions (or more generally, preference relations) of the players are not available, but the players strategies are. In such cases, we are concerned with estimating payoff functions for which these observed strategies are an equilibrium. This estimation task serves to rationalize the players strategies (i.e., we can interpret the observed strategies as solutions to preference-maximization problems). Estimation problems of this nature characterize inverse game theory (Waugh et al., 2013; Bestick et al., 2013). The primary object of study of inverse game theory is the inverse game, which comprises a game with the payoff functions omitted, and an observed strategy profile. The canonical solution concept prescribed for an inverse game is the inverse Nash equilibrium, i.e., payoff functions for which the observed strategy profile corresponds to a Nash equilibrium. If the set of payoff functions in an inverse game is unrestricted, the set of inverse Nash equilibria can contain a wide variety of spurious solutions, e.g., in all inverse games, the payoff function that assigns zero payoffs to all outcomes is an inverse Nash equilibrium, because any strategy profile is a Nash equilibrium of a constant game: i.e., one whose payoffs are constant across strategies. To meaningfully restrict the class of payoff functions over which to search for an inverse Nash equilibrium, one common approach (Kuleshov & Schrijvers, 2015; Syrgkanis et al., 2017) is to assume that the inverse game includes in addition to all the aforementioned objects, a parameter-dependent payoff function for each player, in which case an inverse Nash equilibrium is simply defined as parameter values such that the observed strategy profile is a Nash equilibrium of the parameter-dependent payoff functions evaluated at those values. Research conducted while the author was an intern at JP Morgan Chase & Co. Published as a conference paper at ICLR 2024 If one assumes exact oracle access to the payoffs of the game (i.e., if there exists a function which, for any strategy profile, returns the players payoffs1), the problem of computing an inverse Nash equilibrium is one of inverse multiagent planning. In many games, however, a more appropriate assumption is stochastic oracle access, because of inherent stochasticity in the game (Shapley, 1953) or because players employ randomized strategies Nash (1950). The problem of computing an inverse Nash equilibrium assuming stochastic oracle access is one of inverse multiagent learning. One important class of inverse games is that of inverse Markov games, in which the underlying game is a Markov game (Shapley, 1953; Fink, 1964; Takahashi, 1964), i.e., the game unfolds over an infinite time horizon: at each time period, players observe a state, take an action (simultaneously), receive a reward, and transition onto a new state. In such games, each player s strategy,2 also called a policy, is a mapping from states to actions describing the action the player takes at each state, with any strategy profile inducing a history distribution over histories of play i.e., sequences of (state, action profile) tuples. The payoff for any strategy profile is then given by its expected cumulative reward over histories of play drawn from the history distribution associated with the strategy profile. Excluding rare instances,3 the payoff function in Markov games is only accessible via a stochastic oracle, typically implemented via a game simulator that returns estimates of the value of the game s rewards and transition probabilities. As such, the computation of an inverse Nash equilibrium in an inverse Markov game is an inverse multiagent learning problem, which is often called inverse multiagent reinforcement learning (inverse MARL) (Natarajan et al., 2010). In many real-world applications of inverse Markov games, such as robotics control (Coates et al., 2009), one does not directly observe Nash equilibrium strategies but rather histories of play, which we assume are sampled from the history distribution associated with some Nash equilibrium. In these applications, we are given an inverse simulation an inverse Markov game together with sample histories of play based on which we seek parameter values which induce payoff functions that rationalize the observed histories. As a Nash equilibrium itself is not directly observed in this setting, we aim to compute parameter values that induce a Nash equilibrium that replicates the observed histories in expectation. We call the solution of such an inverse simulation (i.e., parameter values together with an associated Nash equilibrium) a simulacrum. Not only does a simulacrum serve to explain (i.e., rationalize) observations, additionally, it can provide predictions of unobserved behavior. We study two simulacral learning problems, a first-order version in which samples histories of play are faithful, and a second-order version in which they are not a (possibly stochastic) function of each history of play is observed rather than the history itself. Here, the use of the term first-order refers to the fact that the simulacrum does not necessarily imitate the actual equilibrium that generated the histories of play, since multiple equilibria can generate the same histories of play (Baudrillard, 1994). More generally, if the simulacrum is second-order, it is nonfaithful, meaning some information about the sample histories of play is lost. We refer to the problems of computing first-order (resp. second-order) simulacra as first-order (resp. second-order) simulacral learning: i.e., build a first-order (resp. second-order) simulacrum from faithful (resp. non-faithful; e.g., aggregate agent behavior) sample histories of play. We summarize the problems characterizing inverse game theory in Table 1a. Contributions The algorithms introduced in this paper extend the class of games for which an inverse Nash equilibrium can be computed efficiently (i.e., in polynomial-time) to the class of normalform concave games (which includes normal-form finite action games), finite state and action Markov games, and a large class of continuous state and action Markov games. While our focus is on Markov games in this paper, the results apply to normal-form (Nash, 1950), Bayesian (Harsanyi, 1967; 1968), and extensive-form games (Zermelo, 1913). The results also extend to other equilibrium concepts, beyond Nash, such as (coarse) correlated Aumann (1974); Moulin & Vial (1978), and more generally, Φ-equilibrium (Greenwald & Jafari, 2003) mutatis mutandis. First, regarding inverse multiagent planning, we provide a min-max characterization of the set of inverse Nash equilibria of any inverse game for which the set of inverse Nash equilibria is non-empty, assuming an exact oracle (Theorem 3.1). We then show that for any inverse concave game, when the 1Throughout this work, we assume that the oracle evaluations are constant time and measure computational complexity in terms of the number of oracle calls. 2Throughout this paper, a strategy refers to the complete description of a players behavior at any state or time of the game, while an action refers to a specific realization of a strategy at a given state and time. 3For simple enough games, one can express the expected cumulative reward in closed form, and then solve the inverse (stochastic) game assuming exact oracle access. Published as a conference paper at ICLR 2024 Equilibrium Access Exact Oracle Stochastic Oracle Direct Inverse Multiagent Planning Inverse Multiagent Learning Faithful Samples First-order Simulacral Planning First-order Simulacral Learning Nonfaithful Samples Second-order Simulacral Planning Second-order Simulacral Learning (a) Taxonomy of inverse game theory problems. Firstorder simulacral learning is more commonly known as multiagent apprenticeship learning (Abbeel & Ng, 2004; Yang et al., 2020). Reference Game Type Solution Concept Polytime? (Fu et al., 2021) Finite Markov Nash 7 (Yu et al., 2019) Finite Markov Quantal Response 7 (Lin et al., 2019) Finite Zero-sum Markov Various 7 (Song et al., 2018) Finite Markov Quantal Response 7 (Syrgkanis et al., 2017) Finite Bayesian Bayes-Nash 3 (Kuleshov & Schrijvers, 2015) Finite Normal-Form Correlated 3 (Waugh et al., 2013) Finite Normal-Form Correlated 3 (Bestick et al., 2013) Finite Normal-Form Correlated 7 (Natarajan et al., 2010) Finite Markov Cooperative 7 This work Finite/Concave Normal-form Finite/Concave Markov Nash/Correlated Any Other 3 (b) A comparison of our work and prior work on inverse game theory and inverse MARL. regret of each player is convex in the parameters of the inverse game, an assumption satisfied by a large class of inverse games such as inverse normal-form games, this min-max optimization problem is convex-concave, and can thus be solved in polynomial time (Theorem 3.2) via standard first-order methods. This characterization also shows that the set of inverse Nash equilibria can be convex, even when the set of Nash equilibria is not. Second, we generalize our min-max characterization to inverse multiagent learning, in particular inverse MARL, where we are given an inverse Markov game, and correspondingly, a stochastic oracle, and we seek a first-order simulacrum (Corollary 1). We show that under standard assumptions, which are satisfied by a large class of inverse Markov games (e.g., all finite state and action Markov games and a class of continuous state and action Markov games), the ensuing min-max optimization problem is convex-gradient dominated, and thus an inverse Nash equilibrium can be computed once again via standard first-order methods in polynomial time (Theorem 4.1). Third, we provide an extension of our min-max characterization to (second-order) simulacral learning (Theorem 5.1). We once again characterize the problem as a solution to a min-max optimization problem, for which standard first-order methods compute a first-order stationary (Lin et al., 2020) solution in polynomial-time, using a number of observations (i.e., unfaithful samples of histories of play) that is polynomial in the size of the inverse simulation (Theorem 5.2). Finally, we include two sets of experiments. In the first, we show that our method is effective in synthetic economic settings where the goal is to recover buyers valuations from observed competitive equilibria (which, in this market, coincide with Nash equilibria). Second, using real-world time-series data, we apply our method to predict prices in Spanish electricity markets, and find that it outperforms the widely-used ARIMA method in predicting prices on this real-world data set. 2 PRELIMINARIES Notation. All notation for variable types, e.g., vectors, should be clear from context; if any confusion arises, see Section 7.1. We denote by [n] the set of integers {1, . . . , n}. Let X be any set and (X, F) any associated measurable space, where the σ-algebra F unless otherwise noted is assumed to be the σ-algebra of countable sets, i.e., F .= {E X | E is countable }. We write (X) .= {µ : (X, F) ! [0, 1]} to denote the set of probability measures on (X, F). Additionally, we denote the orthogonal projection operator onto a set X by X (x) .= arg miny2X kx yk2 Mathematical Concepts. Consider any normed space (X, k k) where X Rm and any function f : X ! R. f is f-Lipschitz-continuous w.r.t. norm (typically, Euclidean) k k iff 8x1, x2 2 A, kf(x1) f(x2)k f kx1 x2k. If the gradient of f is rf-Lipschitz-continuous, we refer to f as rf-Lipschitz-smooth. Furthermore, given µ > 0, f is said to be µ-gradient-dominated if minx02X f(x0) f(x) + µ minx02X hx0 x, rf(x)i (Bhandari & Russo, 2019). Normal-form Games. A (parametric) game G .= (n, m, d, X, , , u) comprises n 2 N+ players, each i 2 [n] of whom chooses a strategy xi 2 Xi from an strategy space Xi Rm simultaneously. We refer to any vector of per-player strategies x = (x1, . . . , xn) 2 X as a strategy profile, where X .= i2[n] Xi Rnm denotes the space of all strategy profiles. After the players choose their strategies x 2 X, each receives a payoff ui(x; ) given by payoff function ui : X ! R parameterized by a vector in a parameter space Rd. We define the payoff profile function u(x; ) .= (ui(x; ))i2[n]; the cumulative regret : X X ! R across all players, between two strategy profiles x, y 2 X, given 2 , as (x, y; ) .= P i2[n] ui(yi, x i; ) ui(x; ); and the exploitability (or Nikaido-Isoda potential (Nikaido & Isoda, 1955)) '(x; ) .= maxy2X (x, y; ). Published as a conference paper at ICLR 2024 A game is said to be concave if for all parameters 2 and players i 2 [n], 1. Xi is nonempty, compact, and convex, 2. ui is continuous, and 3. xi 7! ui(xi, x i; ) is concave. Given 2 , an "-Nash equilibrium ("-NE) of a game G is a strategy profile x 2 X s.t. ui(x ; ) maxxi2Xi ui(xi, x i; ) ", for all players i 2 [n]. A 0-Nash equilibrium is simply called a Nash equilibrium, and is guaranteed to exist in concave games (Nash, 1950; Arrow & Debreu, 1954). Dynamic Games. An (infinite-horizon, discounted, parametric) Markov game (Shapley, 1953; Fink, 1964; Takahashi, 1964) M .= (n, m, S, A, , , r, p, γ, µ) is a dynamic game played over an infinite time horizon. The game initiates at time t = 0 in some state S(0) µ drawn from an initial state distribution µ 2 (S). At each time period t = 0, 1, . . ., each player i 2 [n] plays an action a(t) i 2 Ai from an action space Ai Rm. We define the space of action profiles A = i2[n] Ai. After the players choose their action profile a(t) .= (a(t) 1 , . . . , a(t) n ) 2 A, each player i receives a reward ri(s(t), a(t); ) according to a parameterized reward profile function r : S A ! Rn. The game then either ends with probability 1 γ, where γ 2 (0, 1) is called the discount factor, or transitions to a new state S(t+1) p( | s(t), a(t)) according to a (Markov) probability transition kernel p whereby for all (s, a) 2 S A, p( | s, a) 2 (S), and p(s(t+1) | s(t), a(t)) = P(S(t+1) = s(t+1) | S(t) = s(t), A(t) = a(t)) is the probability of transitioning to state s(t+1) from state s(t) when the players take action profile a(t).4 A (stationary Markov) policy (Maskin & Tirole, 2001) for player i 2 [n] is a mapping i : S ! A from states to actions so that i(s) 2 Ai denotes the action that player i takes at state s. For each player i 2 [n], we define the space of all (measurable) policies Pi .= { i : S ! Ai}. As usual, .= ( 1, . . . , n) 2 P .= i2[n] Pi denotes a policy profile. A history (of play) h 2 (S A)T of length T 2 N is a sequence of state-action tuples h = (s(t), a(t))T 1 t=0 . For any policy profile 2 P, define the discounted history distribution (h) .= µ(s(0)) QT t=0 γtp(s(t+1) | s(t), (s(t))) as the probability of observing a history h of length T. Throughout, we denote by H .= S(t), A(t)$ any randomly sampled history from .5 Fix a policy profile 2 P and a player i. In our analysis of Markov games, we rely on the following terminology. The expected cumulative payoff is given by ui( ; ) .= EH P1 t=0 ri(S(t), A(t); ) . The stateand action-value functions are defined, respectively, as v i (s; ) .= EH P1 t=0 ri(S(t), A(t); ) | S(0) = s i (s, a; ) .= EH P1 t=0 ri(S(t), A(t); ) | S(0) = s, A(0) = a . The state occupancy distribution δ µ 2 (S) denotes the probability that a state is reached under a policy , given initial state distribution µ, µ (s) .= EH . Finally, as usual, an "-Nash equilibrium ("-NE) of a game M is a policy profile 2 P such that for all i 2 [n], ui( ; ) max i2Pi ui( i, i; ) "; and a Nash equilibrium ensues when " = 0. 3 INVERSE MULTIAGENT PLANNING The goal of inverse multiagent planning is to invert an equilibrium: i.e., estimate a game s parameters, given observed behavior. In this section, we present our main idea, namely a zero-sum game (i.e., min-max optimization) characterization of inverse multiagent planning, where one player called 4For notational convenience, we assume the probability transition function is independent of the parameters, but we note that our min-max characterizations apply more broadly without any additional assumptions, while our polynomial-time computation results apply when, in addition to Assumption 4, one assumes the probability transition function is stochastically convex (see, for instance, Atakan (2003a)) in the parameters of the game. 5Let (S, FS), (A, FA), and (S A, FS A) be the measurable spaces associated with the state, action profile, and state-action profile (S A) spaces, respectively. Further, let ([0, 1], B[0,1]), (Rn, BRn) be measurable spaces on [0, 1] and Rn defined by the Borel σ-algebra. For simplicity, we do not explicitly represent the reward profile function, transition probability kernel, initial state distribution, or policies as measures or measurable functions. We note, however, that for the expectations we define to be well-posed, they all must be assumed to be measurable functions. We simply write r : S A ! Rn, p : S (S A) ! [0, 1], µ : S ! [0, 1], and : S ! A to mean, respectively, r : (S A, FS A) ! (Rn, BRn), p : (S, FS) (S A, FS A) ! ([0, 1], B[0,1]), µ : (S, FS) ! ([0, 1], B[0,1]), and : (S, FS) ! (A, FA). Published as a conference paper at ICLR 2024 the stabilizer picks parameters, while the other called the destabilizer picks per-player deviations. This game is zero-sum) because the stabilizer seeks parameters that rationalize (i.e., minimize the exploitability of) the observed equilibrium, while the destabilizer aims to rebut the rationality of the observed equilibrium (i.e., seeks deviations that maximize cumulative regret). We use this characterization to develop a gradient descent ascent algorithm that finds inverse NE in polynomial time, assuming access to an exact first-order oracle: specifically, a pair of functions that return the value and gradient of the payoff profile function. An inverse game G 1 .= (G \ , x ) comprises a game form (i.e., a parametric game sans its parameter) G \ together with an observed strategy profile x , which we assume is a Nash equilibrium. Crucially, we do not observe the parameters of the payoff functions. Given an inverse game G 1, our goal is to compute an "-inverse Nash equilibrium, meaning parameter values 2 s.t. x 2 X is an "-NE of G . As usual, a 0-inverse NE is simply called an inverse NE. Note that this definition does not require that we identify the true parameters , as identifying is impossible unless there exists a bijection between the set of parameters and the set of Nash equilibria, a highly restrictive assumption that is not even satisfied in games with a unique Nash equilibrium. To compute an inverse NE is to find parameter values that minimize the exploitability of the observed equilibrium. This problem is a min-max optimization problem, as the parameter values that minimize exploitability are those that maximize the players cumulative regrets. More precisely: Theorem 3.1. The set of inverse NE of G 1 is the set of parameter profiles 2 that solve the optimization problem min 2 '(x ; ), or equivalently, this min-max optimization problem: y2X f( , y) .= (x , y; ) = i; ) ui(x ; ) This min-max optimization problem can be seen as a generalization of the dual of Waugh et al. s (2013) maximum entropy likelihood maximization method for games with possibly continuous strategy spaces, taking Nash equilibrium rather than maximum entropy correlated equilibrium as the inverse equilibrium. In contrast to Waugh et al. s dual, our min-max optimization problem characterizes the set of all inverse NE, and not only a subset of the inverse correlated equilibria, in particular those that maximize entropy. This formulation also generalizes Swamy et al. s (2021) moment matching game from a single-agent to a multiagent setting. Algorithm 1 Adversarial Inverse Multiagent Planning Inputs: , X, f, , y, T, (0), y(0), x Outputs: ( (t), y(t))T t=0 1: for t = 0, . . . , T 1 do r f( (t), y(t)) 3: y(t+1) X y ryf( (t), y(t)) 4: return ( (t), y(t))T Without further assumptions, the objective function f in Equation (1) is non-convex nonconcave; however, under suitable assumptions (Assumption 1) satisfied by finite action normal-form games, for example, it becomes convex-concave. Assumption 1. Given an inverse game G 1, assume 1. (Concave game) for all parameters 2 , G is concave; and 2. (Convex parametrization) is non-empty, compact, and convex; and for all 8i 2 [n], yi 2 Xi, and x 2 X, each player i s regret 7! ui(yi, x i; ) ui(x ; ) is convex. Remark 1. Perhaps surprisingly, the set of inverse NE can be convex even when the set of NE is not, since the set of solutions to a convex-concave (or even convex-non-concave) min-max optimization problem is convex. This observation should alleviate any worries about the computational intractability of inverse game theory that might have been suggested by the computational intractability of game theory itself (Daskalakis et al., 2009; Chen & Deng, 2006). If additionally, we assume the players payoffs are Lipschitz-smooth (Assumption 2), Equation (1) can then be solved to " precision in O (1/"2) via gradient descent ascent (Algorithm 1). That is, as Theorem 3.2 shows, an inverse "-NE can be computed in O (1/"2) iterations.6 We note that this convergence complexity can be further reduced to O (1/") (even without decreasing step-sizes) if one instead applies an extragradient descent ascent method (Golowich et al., 2020) or optimistic GDA (Gorbunov et al., 2022). 6We include detailed theorem statements and proofs in Section 7.2. Published as a conference paper at ICLR 2024 Assumption 2 (Lipschitz-Smooth Game). For all players i 2 [n], ui is rui-Lipschitz-smooth. Theorem 3.2 (Inverse NE Complexity). Under Assumptions 1 2, for " 0, if Algorithm 1 is run with inputs that satisfy T 2 (1/"2) and for all t 2 [T], (t) 1/t, then the time-average of all parameters (T) .= 1 T+1 t=0 (t) is an "-inverse NE. 4 INVERSE MULTIAGENT REINFORCEMENT LEARNING In this section, we build on our zero-sum game (i.e., min-max optimization) characterization of inverse game theory to tackle inverse MARL in an analogous fashion. As it is unreasonable to assume exact oracle access to the players (cumulative) payoffs in inverse MARL, we relax this assumption in favor of a stochastic oracle model. More specifically, we assume access to a differentiable game simulator (Suh et al., 2022), which simulates histories of play h according to , given any policy profile , and returns the rewards r and transition probabilities p,7 encountered along the way, together with their gradients. Formally, an inverse Markov game M 1 .= (M \ , ) is an inverse game that comprises a Markov game form (i.e., a parametric Markov game sans its parameter) M \ together with an observed policy profile , which we assume is a Nash equilibrium. Crucially, we do not observe the parameters of the payoff functions. Since a Markov game is a normal-form game with payoffs given by u( ; ) = EH P1 t=0 r(S(t), A(t); ) , the usual definitions of inverse NE and cumulative regret apply, and the following result, which characterizes the set of inverse NE as the minimizers of a stochastic min-max optimization problem, is a corollary of Theorem 3.1. Corollary 1. The set of inverse NE of M 1 is characterized by solutions to the following problem: 2P f( , ) .= ri(S(t), A(t); ) ri(S (t), A (t); ) As is usual in reinforcement learning, we use policy gradient to solve the destabilizer s problem in Equation (2). To do so, we restrict the destabilizer s action space to a policy class PX parameterized by X Rl. Redefining f( , x) .= f( , x), for x 2 PX, we aim to solve the stochastic min-max optimization problem min 2 maxx2X f( , x). Solutions to this problem are a superset of the solutions to Equation (2), unless it so happens that all best responses can be represented by policies in PX, because restricting the expressivity of the policy class decreases the power of the destabilizer. As in Section 3, without any additional assumptions, f is in general non-convex, non-concave, and nonsmooth. While we can ensure convexity and smoothness of 7! f( , x) under suitable assumptions on the game parameterization, namely by assuming the regret at each state is convex in , concavity in x is not satisfied even by finite state and action Markov games. Under the following conditions, however, we can guarantee that f is Lipschitz-smooth, convex in , and gradient dominated in x. Assumption 3 (Lipschitz-Smooth Gradient-Dominated Game). Given an inverse Markov game M 1, assume 1. S and A are non-empty, and compact; 2. (Convex parameter spaces) X, are non-empty, compact, and convex; 3. (Smooth Game) rr, rp, and rx x, for all policies x 2 PX, are continuously differentiable; 4. (Gradient-Dominated Game) for all players i 2 [n], states s 2 S, action profiles a 2 A, and policies x 2 PX, x 7! q x i (s, x(s); ) is µ-gradientdominated for some µ > 0; and 5. (Closure under Policy Improvement) for all states s 2 S, players i 2 [n], and policy profiles 2 P, there exists x 2 PX s.t. q i (s), i(s)) = max 0 i(s), i(s)). Part 3 of Assumption 3 implies that the game s cumulative payoff function is Lipschitz-smooth in the policy parameters x. We note that a large class of Markov games satisfy Part 4, including linear quadratic games (Bhandari & Russo, 2019), finite state and action games, and continuous state and action games whose rewards (resp. transition probabilities) are concave (resp. stochastically 7We note that in inverse reinforcement learning, as opposed to reinforcement learning, it is typical to assume that the transition model is known (see, for instance (Abbeel & Ng, 2004), Footnote 8). Published as a conference paper at ICLR 2024 concave) in each player s action (Atakan, 2003b). Finally, Part 5 is a standard assumption (see, for instance, Section 5 of Bhandari & Russo (2019)), which guarantees that the policy parameterization is expressive enough to represent best responses. Assumption 4 (Convex Parameterization). Given an inverse Markov game M 1, assume that for all players i 2 [n], states s 2 S, and action profiles a, b 2 A, the per-state regret 7! ri(s, bi, a i; ) ri(s, a; ) is convex. With these assumptions in hand, we face a convex gradient-dominated optimization problem, i.e., 7! f( , x) is convex, for all x 2 X, and x 7! f( , x) gradient-dominated, for all 2 . As for normal-form games (see Remark 1), the set of inverse NE in Markov games is convex under Assumptions 3 and 4. Consequently, we can obtain polynomial-time convergence of stochastic gradient descent ascent (Algorithm 2) by slightly modifying known results (Daskalakis et al., 2020). Algorithm 2 Adversarial Inverse MARL Inputs: , P, f , fx, , x, T, (0), x(0), Outputs: ( (t), x(t))T t=0 1: for t = 0, . . . , T 1 do 2: H i2[n] ( r f ( (t), x(t); H, h ) 4: x(t+1) P x rxfx( (t), x(t); H, h ) 5: return ( (t), x(t))T Algorithm 2 requires an estimate of rf w.r.t. both and x. Under Part 3 of Assumption 3, the gradient of f w.r.t. x can be obtained by the deterministic policy gradient theorem (Silver et al., 2014), while the gradient of f w.r.t. can be obtained by the linearity of the gradient and expectation operators. However, both of these gradients involve an expectation over H ( i) and H . As such, we estimate them using simulated trajectories from the deviation history distribution H .= h1, . . . , hn$T i2[n] ( i) and the equilibrium history distribu- tion h , respectively. For a given such pair (H, h ), the cumulative regret gradient estimators f and fx correspond to the gradients of the cumulative regrets between each deviation history hi in H and h , and can be computed directly using the chain rule for derivatives, as we assume access to a differentiable game simulator.8 Finally, we define the equilibrium distribution mismatch coefficient k@δ µ /@µk1 as the Radon Nikodym derivative of the state occupancy distribution of the NE w.r.t. the initial state distribution µ. This coefficient, which measures the inherent difficulty of reaching states under , is closely related to other distribution mismatch coefficients introduced in the analysis of policy gradient methods (Agarwal et al., 2020). With this definition in hand, we can finally show polynomial-time convergence of stochastic GDA (Algorithm 2) under Assumptions 3 4. Theorem 4.1. Under Assumptions 3 4, for all " 2 (0, 1), if Algorithm 2 is run with inputs that satisfy and for all t 2 [T], (t) y "4 and (t) "8, then the time-average of all parameters (T) .= 1 T+1 t=0 (t) is an "-inverse NE. 5 SIMULACRAL LEARNING In this section, we consider the more realistic setting in which we do not observe an equilibrium, but observe only sample histories (s(t,k), a(t,k))t k associated with an unobserved equilibrium . The problem of interest then becomes one of not only inferring parameter values from observed behavior, but of additionally finding equilibrium policies that generate the observed behavior, a solution which we refer to as a first-order simulacrum. A first-order simulacrum can be seen as a generalization of an inverse equilibrium, as it not only comprises parameters that rationalize 8For completeness, we show how to compute fx and f in Section 7.5. In our experiments, however, as has become common practice in the literature (Mora et al., 2021), we compute these gradients by simply autodifferentiating the cumulative regret of any history w.r.t. the policy parameters using a library like Jax (Bradbury et al., 2018). We also show that under Assumption 3, (f , fx) is an unbiased estimate of (r f, rxf) whose variance is bounded. Published as a conference paper at ICLR 2024 the observed histories, but also policies that mimic them in expectation. First-order simulacral learning is also known as multiagent apprenticeship learning (Abbeel & Ng, 2004; Yang et al., 2020). Algorithm 3 Adversarial Simulacral Learning Inputs: , P, (c g , c gx, c gy), , x, y, T, (0), x(0), y(0), {o (k)} Outputs: ( (t), x(t), y(t))T t=0 1: for t = 0, . . . , T 1 do 2: H i2[n] ( i ), h ( x(t)) r c g ( (t), x(t), y(t); H, h) 4: x(t+1) P x rxc gx( (t), x(t), y(t); H, h) 5: y(t+1) P y ry c gy( (t), x(t), y(t); H, h) 6: return ( (t), x(t), y(t))T Even more generally, we might not have access to samples {h (k)}k2[ ] from an equilibrium history distribution, but rather a lossy function of those histories according to some function : H ! O that produces observations {o (k)}k2[ ] .= { }k2[ ] , distributed according to some (pushforward) observation dis- tribution 2 (O), parameterized by policy profile 2 P, where O is the observation space. This more general framework is very useful in applications where there are limitations on the data collection process: e.g., if there are game states at which some of the players actions are unobservable, or when only an unfaithful function of them is available. Here, we seek to learn the more general notion of a second-order simulacrum. Formally, an inverse simulation I 1 .= (M \ , O, , ) is a tuple consisting of a Markov game form M \ with unknown parameters , an observation distribution : P ! (O) mapping policies to distributions over the observation space O, and an observation distribution for the unobserved behavioral policy , which we assume is a Nash equilibrium. Our goal is to find an (", δ)- Nash simulacrum, meaning a tuple of parameters and policies ( , ) 2 P that (", δ)-simulates the observations as a Nash equilibrium: i.e., ui( ; ) max i2Pi ui( i, h11o o 112i δ. Theorem 5.1, which is analogous to Corollary 1, characterizes the set of Nash simulacra of an inverse simulation. Theorem 5.1. Given an inverse simulation I 1, for any , β > 0, the set of Nash simulacra of M 1 is equal to the set of minimizers of the following stochastic min-max optimization problem: '( , ) = min 2P g( , , ) .= E (o,o ) h11o o 112i + β ( , ; ) (3) To tackle simulacral learning, we approximate g via realized observation samples {o (k)} , based on which we compute the empirical learning loss bg( , , ) .= 11o o (k)112i + β ( , ; ). Additionally, as in the previous section, we once again restrict policies to lie within a parametric class of policies PX, redefine g( , x, y) .= g( , x, y) and bg( , x, y) .= bg( , x, y), and solve the ensuing optimization problem over the empirical learning loss min( ,x)2 X maxy2X bg( , x, y). In general, this stochastic min-max optimization is non-convex non-concave. By Assumption 3, however, the function y 7! g( , x, y) is gradient dominated, for all 2 and x 2 X. Nevertheless, it is not possible to guarantee that ( , x) 7! g( , x, y) is convex or gradient dominated, for all y 2 Y, without overly restrictive assumptions. This claim is intuitive, since the computation of an inverse simulacrum involves computing a Nash equilibrium policy, which in general is a PPADcomplete problem (Daskalakis et al., 2009; Foster et al., 2023). Finally, defining gradient estimators as we did in Section 4, to obtain gradient estimators (c g , c gx, c gy)( , x, y; H, hx) from samples histories H i2[n] ( y i) and hx x , we can use Algorithm 3 to compute a local solution of Equation (3) from polynomially-many observations. Theorem 5.2. Suppose that Assumption 3 holds, and that for all x 2 PX , x is twice continuously differentiable in x. For any " 2 (0, 1), if Algorithm 3 is run with inputs that satisfy and for all t 2 [T], (t) y "4 and (t) "8, then the best iterate Published as a conference paper at ICLR 2024 ( best, xbest) converges to an "-stationary point of ' (defined in Section 7.2). Additionally, for any , 0, it holds with probability 1 that b'( (T) best) '( (T) best) if the number of sample observations 2 (1/ 2 log(1/ )). 6 EXPERIMENTS We run two sets of experiments with the aim of answering two questions. Our first goal is to understand the extent to which our algorithms are able to compute inverse Nash equilibria, if any, beyond our theoretical guarantees. Our second goal is to understand the ability of game-theoretic models to make predictions about the future.9 Figure 1: Hourly prices in the Spanish electricity market from January 2015 to December 2020. The Nash simulacrum achieves a MSE that is twice as low as that of the ARIMA method. In our first set of experiments, we consider five types of economic games whose equilibria and payoffs have different properties. The first three are Fisher market (FM) games, which are zero-sum, between sellers and buyers engaged in trading goods. These games can be categorized based on the buyers utility functions as linear, Cobb-Douglas, or Leontief (Cheung et al., 2013). We then consider two general-sum economic games, which model competition between two firms, namely Cournot competition and Bertrand oligopoly. When budgets are the only parameters we seek to recover, our min-max formulation is convex-concave, because the players payoffs are concave in their actions, and affine in their budgets, and hence the regret of players is also affine in the players budgets. In addition, in both the Cournot competition and Bertrand oligopoly games, regret is again convex in the parameters of the game. Finally, all the games we study are concave, with the exception of the Bertrand oligopoly game, and the equilibria are unique in the Cobb-Douglas FM, Cournot competition, and Bertrand oligopoly games. In each experiment, we generate 500 synthetic game instances, for which the true parameters are known, and use Algorithm 1 (which does not rely on this knowledge) to compute an inverse NE for each. We record whether our algorithm recovers the true parameters of the market and whether it finds an inverse NE (i.e., average exploitability). We summarize our findings for the FM games in Table 2. We find that our algorithm recovers the true parameters more often when budgets are the only parameters we seek to recover, as opposed to both budgets and types; but even in non-convex-concave case, our algorithm is still able to approximate inverse NE over 80% of the time. In settings where the equilibria are unique, we recover true parameters most often, while the worst performance is on Leontief FM games, where payoffs are not differentiable. Game Parameters Budgets Types + Budgets Fisher Market Type Linear Leontief CD Linear Leontief CD % Parameters Recovered 100% 36.8% 100% 12% 1% 99.6% Average Exploitability 0.0018 0.2240 0.0004 0.0119 0.1949 0.0004 Cournot Bertrand % Parameters Recovered 95.2% 78% Average Exploitability 0.0000 0.0011 Table 2: The percentage of games for which we recovered the true parameters and the average exploitabilities of the observed equilibrium evaluated w.r.t the computed inverse Nash equilibrium. In our second set of experiments, we model the Spanish electricity market as a stochastic Fisher market game between electricity re-sellers and consumers. In this game, the state comprises the supply of each good and the consumers budgets, while the re-sellers actions are to set prices in today s spot market and tomorrow s day ahead market, and the consumers actions are their electricity demands. We assume the consumers utilities are linear; this choice is suited to modeling the substitution effect between electricity today and electricity tomorrow. Using publicly available hourly Spanish electricity prices and aggregate demand data from Kaggle, we compute a simulacrum of the game that seeks to replicate these observations from January 2015 to December 2016. We also train an ARIMA model on the same data, and run a hyperparameter search for both algorithms using data from January 2017 to December 2018. After picking hyperparameters, we then retrain both models on the data between January 2015 to December 2018, and predict prices up to December 2018. We also compute the mean squared error (MSE) of both methods using January 2018 to December 2020 as a test set. We show the predictions of both methods in Figure 1. To summarize, we find that the simulacrum makes predictions whose MSE is twice as low. 9Our code can be found here. Published as a conference paper at ICLR 2024 Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004. 3, 6, 8, 20 Sydney N Afriat. The construction of utility functions from expenditure data. International economic review, 8(1):67 77, 1967. 20 Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64 66. PMLR, 2020. 7 Kenneth Arrow and Gerard Debreu. Existence of an equilibrium for a competitive economy. Econo- metrica: Journal of the Econometric Society, pp. 265 290, 1954. 4, 22 Alp E Atakan. Stochastic convexity in dynamic programming. Economic Theory, 22:447 455, 2003a. Alp E. Atakan. Stochastic convexity in dynamic programming. Economic Theory, 22(2):447 455, 2003b. ISSN 09382259, 14320479. URL http://www.jstor.org/stable/25055693. 7 Robert J Aumann. Subjectivity and correlation in randomized strategies. Journal of mathematical Economics, 1(1):67 96, 1974. 2 Monica Babes, Vukosi Marivate, Kaushik Subramanian, and Michael L Littman. Apprenticeship learning about multiple intentions. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 897 904, 2011. 20 Patrick Bajari, Han Hong, and Stephen P Ryan. Identification and estimation of a discrete game of complete information. Econometrica, 78(5):1529 1568, 2010. 20 Jean Baudrillard. Simulacra and simulation. University of Michigan press, 1994. 2 Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952. 20 Aaron Bestick, Lillian J Ratliff, Posu Yan, Ruzena Bajcsy, and S Shankar Sastry. An inverse correlated equilibrium framework for utility learning in multiplayer, noncooperative settings. In Proceedings of the 2nd ACM international conference on High confidence networked systems, pp. 9 16, 2013. 1, 3, 21 Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. ar Xiv preprint ar Xiv:1906.01786, 2019. 3, 6, 7, 16, 18 Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 182 189. JMLR Workshop and Conference Proceedings, 2011. 20 Abdeslam Boularias, Oliver Krömer, and Jan Peters. Structured apprenticeship learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II 23, pp. 227 242. Springer, 2012. 20 James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. 7 William C Brainard, Herbert E Scarf, et al. How to compute equilibrium prices in 1891. Citeseer, 2000. 22, 24 Timothy F Bresnahan and Peter C Reiss. Empirical models of discrete games. Journal of Economet- rics, 48(1-2):57 81, 1991. 20 Published as a conference paper at ICLR 2024 Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond sub- optimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pp. 783 792. PMLR, 2019. 20 Timothy C. Y. Chan, Rafid Mahmood, and Ian Yihang Zhu. Inverse optimization: Theory and applications, 2022. 20 Timothy CY Chan, Rafid Mahmood, and Ian Yihang Zhu. Inverse optimization: Theory and applications. ar Xiv preprint ar Xiv:2109.03920, 2021. 20 Xi Chen and Xiaotie Deng. Settling the complexity of two-player nash equilibrium. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pp. 261 272. IEEE, Yun Kuen Cheung, Richard Cole, and Nikhil Devanur. Tatonnement beyond gross substitutes? gradient descent to the rescue. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 13, pp. 191 200, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320290. doi: 10.1145/2488608.2488633. URL https: //doi.org/10.1145/2488608.2488633. 9 Jaedeug Choi and Kee-Eung Kim. Map inference for bayesian inverse reinforcement learning. Advances in neural information processing systems, 24, 2011. 20 Adam Coates, Pieter Abbeel, and Andrew Y Ng. Apprenticeship learning for helicopter control. Communications of the ACM, 52(7):97 105, 2009. 2 John M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641 664, 1966. ISSN 00361399. URL http://www.jstor.org/stable/2946123. 17 Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195 259, 2009. 5, 8 Constantinos Daskalakis, Dylan J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33: 5527 5540, 2020. 7, 16, 18, 19 Arlington M Fink. Equilibrium in a stochastic n-person game. Journal of science of the hiroshima university, series ai (mathematics), 28(1):89 93, 1964. 2, 4 Dylan J Foster, Noah Golowich, and Sham M Kakade. Hardness of independent learning and sparse equilibrium computation in markov games. ar Xiv preprint ar Xiv:2303.12287, 2023. 8 Justin Fu, Andrea Tacchetti, Julien Perolat, and Yoram Bachrach. Evaluating strategic structures in multi-agent inverse reinforcement learning. Journal of Artificial Intelligence Research, 71: 925 951, 2021. 3, 21 David Gale. The theory of linear economic models. University of Chicago press, 1989. 22 Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, and Asuman E. Ozdaglar. Last iter- ate is slower than averaged iterate in smooth convex-concave saddle point problems. Co RR, abs/2002.00057, 2020. URL https://arxiv.org/abs/2002.00057. 5 Eduard Gorbunov, Adrien Taylor, and Gauthier Gidel. Last-iterate convergence of optimistic gradient method for monotone variational inequalities. Advances in neural information processing systems, 35:21858 21870, 2022. 5 Amy Greenwald and Amir Jafari. A general class of no-regret learning algorithms and game-theoretic equilibria. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pp. 2 12. Springer, 2003. 2 Sanford J Grossman and Oliver D Hart. An analysis of the principal-agent problem. In Foundations of Insurance Economics: Readings in Economics and Finance, pp. 302 340. Springer, 1992. 1 Published as a conference paper at ICLR 2024 Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016. 21 John C Harsanyi. Games with incomplete information played by bayesian players, i iii part i. the basic model. Management science, 14(3):159 182, 1967. 2 John C. Harsanyi. Games with incomplete information played by "bayesian" players, i-iii. part iii. the basic probability distribution of the game. Management Science, 14(7):486 502, 1968. ISSN 00251909, 15265501. URL http://www.jstor.org/stable/2628894. 2 Clemens Heuberger. Inverse combinatorial optimization: A survey on problems, methods, and results. Journal of combinatorial optimization, 8:329 361, 2004. 20 Bengt Holmström. Moral hazard and observability. The Bell journal of economics, pp. 74 91, 1979. Kamal Jain, Vijay V Vazirani, and Yinyu Ye. Market equilibria for homothetic, quasi-concave utilities and economies of scale in production. In SODA, volume 5, pp. 63 71, 2005. 22 Shankar Kalyanaraman and Christopher Umans. The complexity of rationalizing matchings. In Algorithms and Computation: 19th International Symposium, ISAAC 2008, Gold Coast, Australia, December 15-17, 2008. Proceedings 19, pp. 171 182. Springer, 2008. 20 Shankar Kalyanaraman and Christopher Umans. The complexity of rationalizing network formation. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pp. 485 494. IEEE, 2009. 20 Edouard Klein, Matthieu Geist, Bilal Piot, and Olivier Pietquin. Inverse reinforcement learning through structured classification. Advances in neural information processing systems, 25, 2012. 20 Edouard Klein, Bilal Piot, Matthieu Geist, and Olivier Pietquin. A cascaded supervised learning approach to inverse reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I 13, pp. 1 16. Springer, 2013. 20 Volodymyr Kuleshov and Okke Schrijvers. Inverse game theory: Learning utilities in succinct games. In Web and Internet Economics: 11th International Conference, WINE 2015, Amsterdam, The Netherlands, December 9-12, 2015, Proceedings 11, pp. 413 427. Springer, 2015. 1, 3, 20, 21 Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. Advances in neural information processing systems, 24, 2011. 20 Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pp. 6083 6093. PMLR, 2020. 3, 18 Xiaomin Lin, Peter A Beling, and Randy Cogill. Multiagent inverse reinforcement learning for two-person zero-sum games. IEEE Transactions on Games, 10(1):56 68, 2017. 21 Xiaomin Lin, Stephen C Adams, and Peter A Beling. Multi-agent inverse reinforcement learning for certain general-sum stochastic games. Journal of Artificial Intelligence Research, 66:473 502, 2019. 3, 21 Wietze Lise. Estimating a game theoretic model. Computational Economics, 18:141 157, 2001. 20 Manuel Lopes, Francisco Melo, and Luis Montesano. Active learning for reward estimation in inverse reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 31 46. Springer, 2009. 20 Andreu Mas-Colell, Michael D. Whinston, and Jerry R. Green. Microeconomic Theory. Number 9780195102680 in OUP Catalogue. Oxford University Press, 1995. ISBN ARRAY(0x4cf9c5c0). URL https://ideas.repec.org/b/oxp/obooks/9780195102680.html. 20 Eric Maskin and Jean Tirole. Markov perfect equilibrium: I. observable actions. Journal of Economic Theory, 100(2):191 219, 2001. 4 Published as a conference paper at ICLR 2024 Miguel Angel Zamora Mora, Momchil Peychev, Sehoon Ha, Martin Vechev, and Stelian Coros. Pods: Policy optimization via differentiable simulation. In International Conference on Machine Learning, pp. 7805 7817. PMLR, 2021. 7 Hervé Moulin and J P Vial. Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon. International Journal of Game Theory, 7:201 221, 1978. 2 John F. Nash. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48 49, 1950. doi: 10.1073/pnas.36.1.48. URL https://www.pnas.org/ doi/abs/10.1073/pnas.36.1.48. 1, 2, 4 Sriraam Natarajan, Gautam Kunapuli, Kshitij Judah, Prasad Tadepalli, Kristian Kersting, and Jude Shavlik. Multi-agent inverse reinforcement learning. In 2010 ninth international conference on machine learning and applications, pp. 395 400. IEEE, 2010. 2, 3, 21 Denis Nekipelov, Vasilis Syrgkanis, and Eva Tardos. Econometrics for learning agents. In Proceedings of the sixteenth acm conference on economics and computation, pp. 1 18, 2015. 20 Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. 16 Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000. 20 Hukukane Nikaido and Kazuo Isoda. Note on non-cooperative convex games. Pacific Journal of Mathematics, 5(S1):807 815, 1955. 3 Alexander Peysakhovich, Christian Kroer, and Adam Lerer. Robust multi-agent counterfactual prediction. Advances in Neural Information Processing Systems, 32, 2019. 1 Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pp. 2586 2591, 2007. 20 Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729 736, 2006. 20 Paul A Samuelson. Consumption theory in terms of revealed preference. Economica, 15(60):243 253, Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10): 1095 1100, 1953. 2, 4 David Silver, James Bagnell, and Anthony Stentz. High performance outdoor navigation from overhead data using imitation learning. Robotics: Science and Systems IV, Zurich, Switzerland, 1, 2008. 20 David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387 395. Pmlr, 2014. 7 Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning. Advances in neural information processing systems, 31, 2018. 3 Hyung Ju Suh, Max Simchowitz, Kaiqing Zhang, and Russ Tedrake. Do differentiable simulators give better policy gradients? In International Conference on Machine Learning, pp. 20668 20696. PMLR, 2022. 6, 25 Gokul Swamy, Sanjiban Choudhury, J Andrew Bagnell, and Steven Wu. Of moments and matching: A game-theoretic framework for closing the imitation gap. In International Conference on Machine Learning, pp. 10022 10032. PMLR, 2021. 5 Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. Advances in neural information processing systems, 20, 2007. 20 Published as a conference paper at ICLR 2024 Vasilis Syrgkanis, Elie Tamer, and Juba Ziani. Inference on auctions with weak assumptions on information. ar Xiv preprint ar Xiv:1710.03830, 2017. 1, 3, 20 Masayuki Takahashi. Equilibrium points of stochastic non-cooperative n-person games. Journal of Science of the Hiroshima University, Series AI (Mathematics), 28(1):95 99, 1964. 2, 4 Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction models: A large margin approach. In Proceedings of the 22nd international conference on Machine learning, pp. 896 903, 2005. 20 Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11:3137 3181, 2010. 20 Hal R Varian. The nonparametric approach to demand analysis. Econometrica: Journal of the Econometric Society, pp. 945 973, 1982. 20 Hal R Varian. Revealed preference. Samuelsonian economics and the twenty-first century, pp. 99 115, Leon Walras. Elements de l economie politique pure, ou, Theorie de la richesse sociale. F. Rouge, Kevin Waugh, Brian D Ziebart, and J Andrew Bagnell. Computational rationalization: The inverse equilibrium problem. ar Xiv preprint ar Xiv:1308.3506, 2013. 1, 3, 5, 21 Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforce- ment learning. ar Xiv preprint ar Xiv:1507.04888, 2015. 20 Mingzhou Yang, Yanhua Li, Xun Zhou, Hui Lu, Zhihong Tian, and Jun Luo. Inferring passengers interactive choices on public transits via ma-al: Multi-agent apprenticeship learning. In Proceedings of The Web Conference 2020, pp. 1637 1647, 2020. 3, 8 Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-agent adversarial inverse reinforcement learning. In International Conference on Machine Learning, pp. 7194 7201. PMLR, 2019. 3, 21 Ernst Zermelo. Über eine anwendung der mengenlehre auf die theorie des schachspiels. In Proceed- ings of the fifth international congress of mathematicians, volume 2, pp. 501 504. Cambridge University Press Cambridge, 1913. 2 Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008. 20