# loqa_learning_with_opponent_qlearning_awareness__033a7afb.pdf Published as a conference paper at ICLR 2024 LOQA: LEARNING WITH OPPONENT Q-LEARNING AWARENESS Milad Aghajohari , Juan Agustin Duque , Tim Cooijmans, Aaron Courville University of Montreal & Mila firstname.lastname@umontreal.ca In various real-world scenarios, interactions among agents often resemble the dynamics of general-sum games, where each agent strives to optimize its own utility. Despite the ubiquitous relevance of such settings, decentralized machine learning algorithms have struggled to find equilibria that maximize individual utility while preserving social welfare. In this paper we introduce Learning with Opponent QLearning Awareness (LOQA), a novel, decentralized reinforcement learning algorithm tailored to optimizing an agent s individual utility while fostering cooperation among adversaries in partially competitive environments. LOQA assumes the opponent samples actions proportionally to their action-value function Q. Experimental results demonstrate the effectiveness of LOQA at achieving state-of-theart performance in benchmark scenarios such as the Iterated Prisoner s Dilemma and the Coin Game. LOQA achieves these outcomes with a significantly reduced computational footprint, making it a promising approach for practical multi-agent applications. 1 INTRODUCTION A major difficulty in reinforcement learning (RL) and multi-agent reinforcement learning (MARL) is the non-stationary nature of the environment, where the outcome of each agent is determined not only by their own actions but also those of other players von Neumann (1928). This difficulty often results in the failure of traditional algorithms converging to desirable solutions. In the context of general-sum games, independent RL agents often converge to sub-optimal solutions in the Pareto sense, when each of them seeks to optimize their own utility Foerster et al. (2018b). This situation draws parallels with many real-world scenarios, in which individuals pursuing their own selfish interests leads them to a worse outcome than cooperating with others. Thus one of the objectives of MARL research must be to develop decentralized agents that are able to cooperate while avoiding being exploited in partially competitive settings. We call this reciprocity-based cooperation. Previous work has resulted in algorithms that train reciprocity-based cooperative agents by differentiating through the opponent s learning step (Foerster et al., 2018b; Letcher et al., 2021; Zhao et al., 2022; Willi et al., 2022) or by modeling opponent shaping as a meta-game in the space of agent policies (Al-Shedivat et al., 2018; Kim et al., 2021; Lu et al., 2022; Cooijmans et al., 2023). However, both of these approaches have important drawbacks with respect to computational efficiency. On one hand, differentiating through even just a few of the opponent s learning steps, can only be done sequentially and requires building large computation graphs. This is computationally costly when dealing with complex opponent policies. On the other hand, meta-learning defines the problem as a meta-state over the product space of policies of the agent and opponent, and learns a meta-policy that maps from the meta-state to the agent s updated policy. The complexity of the problem then scales with the policy parameterization which is usually a neural network with many parameters. In this paper we introduce Learning with Opponent Q-Learning Awareness (LOQA), which stands because it avoids computing gradients w.r.t. optimization steps or learning the dynamics of a metagame, resulting in significantly improved computational efficiency. LOQA performs opponent shaping by assuming that the opponent s behavior is guided by an internal action-value function Q. This assumption allows LOQA agents to build a model of the opponent policy that can be shaped by in- Published as a conference paper at ICLR 2024 fluencing its returns for different actions. Controlling the return by differentiating through stochastic objectives is a key idea in RL and can be done using the REINFORCE estimator Williams (1992). 2 BACKGROUND We consider general-sum, n-player, Markov games, also referred as stochastic games Shapley (1953). Markov games are defined by a tuple M = (N, S, A, P, R, γ) where S denotes the state space, A := A1 . . . An, is the joint action space of all players, P : S A (S), defines a mapping from every state and joint action to a probability distribution over states, R = {r1, . . . , rn} is the set of reward functions where each ri : S A R maps every state and joint action to a scalar return and γ [0, 1] is the discount factor. We use the notation and definitions for standard RL algorithms of Agarwal et al. (2021). Consider two agents, 1 (agent) and 2 (opponent) that interact in an environment with neural network policies π1 := π( | ; θ1), π2 := π( | ; θ2) parameterized by θ1 and θ2 respectively. We denote τ to be a trajectory with initial state distribution µ and probability measure Prπ1,π2 µ given by Prπ1,π2 µ (τ) = µ(s0)π1(a0|s0)π2(b0|s0)P(s1|s0, a0, b0) . . . here b A2 denotes the action of the opponent. In multi-agent reinforcement learning, each agent seeks to optimize their expected discounted return R, for the agent this is given by: V 1(µ) := Eτ Prπ1,π2 µ R1(τ) = Eτ Prπ1,π2 µ t=0 γtr1(st, at, bt) The key observation is that under the definitions above, V 1 is dependent on the policy of the opponent through the reward function r1(st, at, bt). V 1 is thus differentiable with respect to the parameters of the opponent via the REINFORCE estimator (Williams, 1992) θ2V 1(µ) = Eτ Prπ1,π2 µ t=0 θ2log π2(bt|st) = Eτ Prπ1,π2 µ t=0 γtr1(st, at, bt) X k