# evolutionary_dynamics_and_phiregret_minimization_in_games__074997e4.pdf Journal of Artificial Intelligence Research 74 (2022) 1125-1158 Submitted 07/2021; published 07/2022 Evolutionary Dynamics and Φ-Regret Minimization in Games Georgios Piliouras georgios@sutd.edu.sg Singapore University of Technology and Design Singapore Mark Rowland markrowland@deepmind.com Shayegan Omidshafiei somidshafiei@deepmind.com Romuald Elie relie@deepmind.com Daniel Hennes hennes@deepmind.com Jerome Connor jeromeconnor@deepmind.com Karl Tuyls karltuyls@deepmind.com Deep Mind London, United Kingdom Regret has been established as a foundational concept in online learning, and likewise has important applications in the analysis of learning dynamics in games. Regret quantifies the difference between a learner s performance against a baseline in hindsight. It is wellknown that regret-minimizing algorithms converge to certain classes of equilibria in games; however, traditional forms of regret used in game theory predominantly consider baselines that permit deviations to deterministic actions or strategies. In this paper, we revisit our understanding of regret from the perspective of deviations over partitions of the full mixed strategy space (i.e., probability distributions over pure strategies), under the lens of the previously-established Φ-regret framework, which provides a continuum of stronger regret measures. Importantly, Φ-regret enables learning agents to consider deviations from and to mixed strategies, generalizing several existing notions of regret such as external, internal, and swap regret, and thus broadening the insights gained from regret-based analysis of learning algorithms. We prove here that the well-studied evolutionary learning algorithm of replicator dynamics (RD) seamlessly minimizes the strongest possible form of Φ-regret in generic 2 2 games, without any modification of the underlying algorithm itself. We subsequently conduct experiments validating our theoretical results in a suite of 144 2 2 games wherein RD exhibits a diverse set of behaviors. We conclude by providing empirical evidence of Φ-regret minimization by RD in some larger games, hinting at further opportunity for Φ-regret based study of such algorithms from both a theoretical and empirical perspective. 1. Introduction Understanding the behavior of learning dynamics in games is a fundamental problem studied in game theory, online learning theory, dynamical systems, and multiagent systems. Numerous works have been developed in the area that describe connections between these distinct fields (Bloembergen et al., 2015; Celli et al., 2020; Cesa-Bianchi and Lugosi, 2006; Chang, 2007; Galstyan, 2013; Gatti and Restelli, 2016; Klos et al., 2010; Nisan et al., 2007; Srini- c 2022 AI Access Foundation. All rights reserved. Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls vasan et al., 2018; Tuyls and Parsons, 2007; Tuyls et al., 2003, 2006; Vlatakis-Gkaragkounis et al., 2020; Wunder et al., 2010). Given this wealth of prior work, a natural question emerges: have we converged to a more-or-less complete mathematical language that allows us to accurately describe the behavior of learning dynamics (at least in simple small games), or are there remaining characteristics of these dynamics that are not yet well understood or fully described by existing concepts? Arguably, one of the most important concepts in this area is that of regret (Auer et al., 1995; Chang and Kaelbling, 2005; Freund and Schapire, 1999; Kleinberg et al., 2009; Monnot and Piliouras, 2017; Roughgarden, 2009). Regret is a basic definition of online learning with numerous variants, each with specific associated properties. Regret minimization forms the basis of a key class of algorithms for learning in games, and is concurrently central to the more general field of online sequential prediction (Cesa-Bianchi and Lugosi, 2006; Chang, 2007). Informally, regret is defined as the difference in the cumulative performance of an agent against a baseline, which typically allows hindsight deviations (or swaps) of one or more of the agent s actions with alternative actions. There exist various notions of regret, ranging from basic forms such as external regret (Hannan, 1957), which measures a learner s performance against the best fixed action in hindsight, to stronger variants such as internal regret (Foster and Vohra, 1998) and swap-regret (Blum and Mansour, 2007a), where the deviating action of the agent is allowed to be a function of their originally chosen action. These notions of regret have typically been defined within the context of single-agent online learning, and subsequently used to analyze behavior of decision-making in multiagent games. The benefit of regret-minimizing algorithms, when applied to games, is that they typically yield time-average convergence to a corresponding set of equilibria. For example, in generalsum games, (external) regret minimization by all players yields time-average convergence to coarse correlated equilibria (Nisan et al., 2007); in two-player zero-sum games, this yields convergence to Nash equilibria, and has driven key successes towards solving games such as poker and offered other important insights in extensive-form games (Brown and Sandholm, 2018; Brown et al., 2020; Farina et al., 2021a; Moravˇc ık et al., 2017; Morrill et al., 2021a,b; Sandholm, 2010a; Zinkevich et al., 2007). Despite these insights, traditional notions of regret such as those described above are, at times, too general to provide strong guarantees for certain learning dynamics in games. Even in zero-sum games, for example, standard regret-minimizing dynamics such as multiplicative weights update (Arora et al., 2012b) and its well-known continuous-time limit counterpart, the replicator dynamics (RD), can be non-convergent or even chaotic in their real-time behaviors (Bailey and Piliouras, 2018; Cheung and Piliouras, 2019, 2020; Piliouras and Shamma, 2014; Sato et al., 2002), despite their time-average convergence. Emergence of chaos can even materialize in simple congestion games (Chotibut et al., 2020a; Palaiopanos et al., 2017), and perhaps surprisingly, such behaviors can occur despite these algorithms time-average convergence to equilibria (Freund and Schapire, 1999). In fact, such prototypical learning algorithms can be non-equilibrating even in the smallest of environments, 2 2 games with two agents and two strategies, e.g., Matching Pennies (Bailey and Piliouras, 2019; Chotibut et al., 2020b; Papadimitriou and Piliouras, 2016). Overall, the behavior of well-known dynamics even in small games is diverse and non-trivial, and clearly cannot be fully inferred from standard regret analysis alone. Evolutionary Dynamics and Φ-Regret Minimization in Games The clear gap between traditional regret theory and the observed empirical behaviors of these well-studied algorithms hints at the enticing possibility of using stronger notions of regret to better understand dynamical behaviors at finer levels of granularity. As it is well-known that regret-minimizing algorithms must, in general, randomize (i.e., output probability distributions over actions, rather than deterministic actions), such a stronger notion could consider a regret definition where agents may condition their deviating behavior not merely on deterministic actions (e.g., as in traditional notions of regret), but using more refined deviation functions. In investigating this, a line of prior work has analyzed a stronger concept known as Φ-regret (Gordon et al., 2008; Greenwald and Jafari, 2003; Hong, 2008; Stoltz and Lugosi, 2007), which encapsulates several more general classes of deviations in contrast to traditional regret notions. Related notions has attracted recently a lot of attention in extensive-form games as well (Farina et al., 2021a,b; Morrill et al., 2021a,b). However, as later detailed, while some of these works have introduced algorithms for Φ-regret minimization, such algorithms have either been investigated only for specific and simple classes of Φ-regret, or apply to more general classes of games albeit being significantly more intricate (e.g., require more book-keeping and are more difficult to implement). In this paper, our primary contribution is to establish a link between Φ-regret and the simple and well-studied evolutionary dynamics algorithm of RD. RD is arguably one of the most well studied dynamics in evolutionary game theory Sandholm (2010b); Weibull (1997) and due to its well known connection to Multiplicative Weights Updates (Arora et al., 2012b; Kleinberg et al., 2009) enjoys strong external regret guarantees (Cheung and Piliouras, 2021; Kwon and Mertikopoulos, 2017; Mertikopoulos et al., 2018) making it an natural candidate for producing powerful Φ-regret guarantees as well. Our key theoretical result is that in general classes of 2 2 games, RD in self-play seamlessly minimizes the strongest possible notion of Φ-regret, without any modification of the underlying learning dynamics. Informally, this result implies that an RD learner, in self-play, attains on time-average at least the value of the game, even if time-averaging is applied to specific recurrent parts of the trajectory as defined by the agent s mixed strategy. Moreover, we prove that there exist algorithms that minimize external regret, which exhibit unbounded strong swap regret in some 2 2 games. We further ground our theoretical results in empirical analysis focusing on two-player games, introducing a version of Φ-regret we denote mosaic regret , which is more amenable to empirical implementation. We illustrate mosaic regret-convergence under RD in a broad range of 144 2 2 games (Bruns, 2015), with widely varying characteristics (e.g., fully cooperative games, social dilemmas, cyclical games, etc.). Finally, we conclude by showing empirical evidence that RD minimizes mosaic regret in some larger games, hinting at future avenues of further exploration. The remainder of the paper is structured as follows. In Section 2, we overview the necessary preliminaries for establishing our theoretical results. In Section 3, we delve into the Φ-regret framework and related concepts targeted in the paper. Following this, we establish our theoretical results in Sections 4 and 5, and subsequently validate them empirically in Section 6. Finally, we conclude with key discussion points and takeaways in Section 7. 2. Preliminaries We first review preliminaries related to game theory and online learning algorithms. Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls 2.1 Game Theory We study two-player normal-form games, where the first (resp., second) player has access to a finite set of pure strategies A1 (resp., A2). In two-player games, players 1 and 2 are also referred to as the row and column players, respectively. The joint mixed strategies of the players are denoted by (x, y), where x (A1) and y (A2). We will denote the probability that the first (resp., second player) assigns to their i-th strategy (resp., j-th strategy) as xi (resp., yj). The player payoffs are, respectively, specified by matrices A Rn m and B Rn m, where n and m are the number of strategies available to each player. Let aij, bij represent the payoffentries of the respective matrices. The respective utilities received by the players are x Ay and x By. In a zero-sum (resp., coordination) game, players receive payoffs A = B (resp., A = B). Given strategy profile (x, y), the best response for each player is the strategy that maximizes their utility against the other player s current strategy. The best response dynamics arise when players iteratively update their policy to their best response. In game theory, the Nash equilibrium has been well-established as a solution concept of interest, and is defined as follows. Definition 2.1. (Nash, 1950) A mixed strategy profile (x , y ) (A1) (A2) is a Nash equilibrium (NE) if x T Ay x Ay x (A1) and x T By x T By y (A2) . (1) In other words, the players are simultaneously in best response with one another when their profiles constitute an NE. Several weaker notions of equilibria are also important in game theory, and bear a close relationship to regret-minimizing strategies. In preparation for those definitions, it will be useful to introduce some notation that allows for us to work with arbitrary correlated probability distributions over A = A1 A2. Let z (A) be such a probability distribution. Let zij be the probability assigned to the outcome where the first (resp., second) player choose strategy i (resp., j). Let z(1| ) (resp., z(2| )) be the marginal probability of the first (resp., second) player. We denote by z(2|i) (A2) the conditional distribution of the second player s strategy given the first player s strategy is i, and similarly define z(1|j). Finally, we denote by ei (A1) the distribution putting mass 1 on i A1, and similarly define ej. Given these definitions, we next introduce several additional notions of equilibria. Definition 2.2. (Aumann, 1974) A distribution over joint strategies z (A1 A2) is a correlated equilibrium (CE) if (ei) Az(2|i) x Az(2|i) i A1 , x (A1) and (2) z(1|j) Bej z(1|j) By j A2 , y (A2) . (3) Definition 2.3. (Moulin and Vial, 1978) A distribution over joint strategies z (A1 A2) is a coarse correlated equilibrium (CCE) if X i,j aijzij x Az(2| ) x (A1) and (4) i,j bijzij z(1| ) By y (A2) . (5) Evolutionary Dynamics and Φ-Regret Minimization in Games It is important to note that the set of CCE is a superset of the CE, which itself is a superset of NE (i.e., NE CE CCE). An intuitive way to establish this set of inclusions is to define each of these solution concepts via allowable sets of joint distributions and classes of allowable deviating policies under which no player can strictly improve their payoff. From this perspective, a NE is a product of (mixed) strategies such that no player can deviate to another (mixed) strategy and strictly increase their payoff. The set of CCE can be defined equivalently merely by removing the restriction that the joint distribution of the two players has to be a product of each of the player s marginal distributions. Hence, CCE is a superset of NE. Finally, if the players initial joint distribution is not a mixture, one can define strictly more powerful deviating strategies by allowing a player s deviating strategy to depend on their realized strategy. This is the case for CE, where the set of allowable deviating strategies is strictly more expansive, and yet none of them can strictly improve the payoffs of the players; thus, the set of CE is more restrictive than CCE, while generalizing NE as, once again, it allows for correlated joint distributions. 2.2 Replicator Dynamics In this paper, we seek to investigate the connection of more expansive notions of regret to simple, well-studied learning algorithms. For these purposes, we focus on the replicator dynamics (RD), a standard and well-studied model defining the evolution of strategic, interacting individuals under biologically-inspired mechanisms (Schuster and Sigmund, 1983; Taylor and Jonker, 1978). In the two-player setting of interest, the time-evolution of player strategies is described by RD as follows, xi = xi (Ay)i x Ay i A1 and (6) yj = yj (x B)j x By j A2 . (7) RD is the continuous-time variant of the well-known multiplicative weights update (MWU) meta-algorithm (Arora et al., 2012b; Kleinberg et al., 2009), and the seminal dynamics in the areas of mathematical evolution, biology, ecology, and evolutionary game theory (Hofbauer and Sigmund, 1998; Weibull, 1997). In recent years, RD has enjoyed a particularly strong surge in applications to learning in multiplayer games (Boone and Piliouras, 2019; Flokas et al., 2020; Hennes et al., 2020; Lanctot et al., 2019; Nagarajan et al., 2020; Omidshafiei et al., 2019; Papadimitriou and Piliouras, 2019; Sanders et al., 2018; Skoulakis et al., 2021; Sorin, 2020). Despite its algorithmic simplicity, RD is well-known to minimize external regret (a concept later detailed in Section 3.1), thus yielding time-average convergence to a coarse correlated equilibrium (Mertikopoulos et al., 2018; Sorin, 2009). 2.3 Online Sequential Prediction The central problem of online sequential prediction in continuous time is specified by the interaction of a player with finite action set A and an environment at each time t [0, ).1 At time t, the player selects a distribution over actions xt (A), and a utility function 1. We emphasize that A is the action set of a single player, analogously with A1 and A2 in the gametheoretic setup above. Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls (a) Example external regret deviations. (b) Example swap regret deviations. Figure 1: Strategy deviations associated with different forms of regret. This example considers a simple online learning setting with three actions (i.e., pure strategies) available to the player, illustrated via the simplices in each row. (a) External regret considers deviations of all three possible actions to a single, fixed, deterministic action in hindsight. (b) Swap regret considers a function independently swapping each of the player s actions with an alternative fixed action. ut RA is revealed.2 The player s expected instantaneous utility is xt, ut , and the player is shown the entire utility vector ut, which it may use in deciding how to act in subsequent interactions. After interacting up to time T > 0, the cumulative utility experienced by the player is R T 0 xt, ut dt. It is difficult to judge how well the player has done in selecting its actions on the basis of this cumulative utility alone; the quality of the performance depends on whether there were other actions available that would have yielded significantly higher utility. This is formalized by judging the player s performance based on its regret, a common means of studying the performance of such algorithms in both discrete and continuous time (Banerjee and Peng, 2004; Blum and Mansour, 2007b; Cesa-Bianchi and Lugosi, 2006; Harris, 1998; Kwon and Mertikopoulos, 2017; Mertikopoulos et al., 2018; Shoham and Leyton-Brown, 2008; Sorin, 2009; Zinkevich, 2003). 3. From Traditional Regret Minimization to Φ-Regret and Mosaic Regret We next lay the foundations for our theoretical results, by overviewing a spectrum of noregret algorithms: from the more traditional forms of regret to the more general Φ-regret framework, and our introduced notion of mosaic regret. 3.1 External and Swap Regret Informally, regret quantifies whether a player could have done better by using an alternative method for picking actions throughout the interaction. One of the most common variants 2. The typical language of online learning uses loss vectors, ℓt = ut. Here, we use utilities to match the focus on payoffs, rather than losses, which is common in the game theory literature. Evolutionary Dynamics and Φ-Regret Minimization in Games studied is external regret, the expected improvement in performance that could have been achieved by sticking with a single action throughout all interactions during the time interval [0, T]. Figure 1(a) provides an illustrative example of the action deviations considered when computing external regret. This example considers a simple online learning setting with three actions available to the player (illustrated via each of the 2-simplices), with external regret considering deviations of all three actions to a single, fixed action in hindsight. External regret is mathematically defined by 0 ut adt Z T 0 xt, ut dt , (8) where ut a is the utility of action a A at time t. If a player is able to attain o(T) external regret, over bounded utility sequences (ut [0, 1]A, for example), then the player s strategy is said to minimize external regret, or is simply regret-minimizing. Intuitively, the suboptimality of the player s decision at each timestep, relative to the best constant action in hindsight, becomes vanishingly small, no matter what sequence of utilities (ut)t [0,T] are yielded by the environment. Regret in games. Regret minimization is central to many algorithms for computing equilibria in game theory, due to the close relationship between the notion of an alternative action, and the game-theoretic notions of strategy deviations that feature in the definitions of equilibria in Section 2.1. As such, one can consider actions in the sense of online learning as being synonymous with (pure) strategies in game theory; we henceforth use the latter terminology for simplicity. For instance, consider a two-player game as specified in Section 2.1. Let us cast the problem the players face in playing the game as an online sequential prediction problem, focusing on the first player in the following description. The first player s strategy set is [n], the set of strategies in the game. The utility vector ut for this player at time t is given by Ayt, where yt is the strategy selected by the second player at time t. A well-known folk-theorem implies that if both players employ algorithms that minimize external regret to select their strategies, the players time-average behavior (i.e., the joint strategy T 1 R T 0 (xt, yt)dt) is guaranteed to converge to the set of coarse correlated equilibrium (Hart and Mas-Colell, 2000; Roughgarden, 2016; Young, 2004). Further, if the game is zero-sum, then the product of the marginals of their individual time-averaged strategies converges to the set of Nash equilibria at the same rate (Freund and Schapire, 1999; Nisan et al., 2007; Young, 2004). A similar relationship holds between the set of correlated equilibria, and the stronger notions of regret described below (Hart and Mas-Colell, 2000). Broader deviation classes. There exist more general notions of regret that compare a player s behavior against a wider class of baselines than just those that deviate to a single, fixed strategy throughout time. One such alternative, swap regret (Blum and Mansour, 2007a), permits deviations involving the player using strategy b A every time they had selected strategy a A. Figure 1(b) illustrates swap regret in our earlier example, where now each of the three possible pure strategies may be independently deviated to a different one. The notion of swap regret is formalized through swap functions F : A A that can Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls be lifted to a function F : (A) (A) for all x (A). The swap regret is then defined to be 0 F(xt), ut dt Z T 0 xt, ut dt . Players interacting in a two-player game using algorithms that minimize swap regret are guaranteed to converge to the set of correlated equilibria in time-average, a stronger notion that the coarse correlated equilibrium guaranteed by algorithms that minimize external regret. A slightly weaker notion than swap regret is that of internal regret, which restricts the swap functions F : A A that are lifted to deviations on the simplex to take the form F(a) = a for all but one a A. However, note that in the special case of zero-sum games, internal and swap regret minimization algorithms do not offer stronger guarantees, as external regret minimization algorithms already guarantee convergence to a Nash equilibrium in the time-average. 3.2 Φ-regret Framework In swap regret, each swap function F : A A is lifted to a function F : (A) (A), representing an admissible deviation in the algorithm s past strategies. Notably, each such F is affine in the input strategy. This is also true of the deviation functions F : (A) (A) permitted under external and internal regret. A natural question to pose is whether there exist learning algorithms that minimize regret against even broader classes of deviation functions. To better intuit the possibilities, let us consider an alternative view of the deviations permitted by various regret notions in Fig. 2. As in our earlier example, we consider a simple 3-strategy game in this instance. However, we now also visualize the full spectrum of mixed strategies, with the color of each mixed strategy in the interior of the simplex corresponding to its composition of the three underlying pure strategies (red, green, and blue). For instance, Fig. 2(a) visualizes the three possibilities of deviations permitted by external regret, each respectively mapping all mixed strategies to either the red, green, or blue pure strategy. Likewise, in the swap regret case shown in Fig. 2(b), deviations can correspond to affine transformations of the input strategies. Here, the deterministic swap function F yields a mixed swap function F, mapping each input mixed strategy to a distinct output mixed strategy. Naturally, as swap regret generalizes external regret, deviations permitted in the latter are also possible under the swap regret case (e.g., as shown in example deviation #3 in Fig. 2(b)). Φ-regret. A previous line of work (Gordon et al., 2008; Greenwald and Jafari, 2003; Stoltz and Lugosi, 2007) introduces the concept of Φ-regret, expanding the space of deviations permitted compared to traditional regret notions.3 For our purposes, Φ-regret is defined 3. Φ-regret as introduced by Greenwald and Jafari (2003) applied only to linear deviation functions over (A). Later work (Gordon et al., 2008; Stoltz and Lugosi, 2007) applied the concept much more generally Evolutionary Dynamics and Φ-Regret Minimization in Games (a) External Regret (b) Swap Regret (c) Φ-regret Figure 2: Mixed-strategy view of permissible deviations under various forms of regret. We visualize here a simple 3-strategy game, where the color of each mixed strategy in the simplex interior corresponds to the combination of pure strategies (red, green, and blue) that compose it. (a) Under external regret, permitted deviations map all mixed strategies to either the red, green, or blue deterministic strategy. (b) Swap regret deviations can correspond to affine transformations of the input strategies, generalizing external regret. (c) Under Φ-regret, the input space may be partitioned in a multitude of ways. For instance, in example #1, the simplex is discretized into partitions, each capturing mixed strategies with a certain range of KL-divergences from the center of the simplex. Likewise, in example #2, a discrete set of deviation partitions is defined in a granular manner enabling the output mappings visualized. This definition allows for arbitrarily complex mappings, as illustrated by example #3, which reproduces an image from the three primary strategies (colors). by first specifying a set of deviation functions Φ { F : (A) (A)}, and then defining the corresponding Φ-regret as 0 φ(xt), ut dt Z T 0 xt, ut dt . An algorithm is said to minimize Φ-regret if the above expression is o(T); that is, it grows sub-linearly with time. Figure 2(c) visualizes example deviations permitted under Φ-regret, illustrating that the input space may be partitioned in a multitude of ways under the above definition. For instance, in example #1 of Fig. 2(c), partitions are defined as a function of KL-divergence from the center of the simplex. Likewise, in example #2 of Fig. 2(c), deviations are defined in a manner enabling the highly granular output mappings visualized. Theoretically, this definition would allow for any arbitrarily complex mapping, as illustrated by example #3 of Fig. 2(c), which reproduces an image from the three primary strategies (colors). Mathematically, the Φ-regret framework generalizes the concepts of external regret, internal regret, and swap regret. Additional specific instances of Φ-regret, however, can be to convex games on compact action spaces. The definition we use in this paper is a specific case of this notion, applied to the mixed extension of normal-form games. Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls considerably stronger than these notions of regret, as any continuous function on (A) can be uniformly approximated by piecewise affine functions. Strong Swap Regret. The strongest version of Φ-regret takes Φ to be as large as possible; the set of all measurable functions from (A) to itself. We refer to this notion as strong swap regret, as it is defined explicitly as a mapping over the mixed-strategy simplex and may, thus, be non-affine (in contrast to the traditional notion of swap regret, which is first defined over pure strategies before being lifted to the mixed space). Past work has led to the development of algorithms that are able to minimize this notion of regret, which is an extremely strong property, although the algorithms necessarily to do so are rather intricate and impractical to implement (Stoltz and Lugosi, 2007). One of the central findings of our work is that in certain circumstances, the straightforward algorithm of replicator dynamics actually minimizes strong swap regret. 3.3 Mosaic Regret: A Practical Strong Swap Regret Variant In moving from theory to experiments, strong swap regret can no longer practically be minimized due to time discretization effectively preventing the learning dynamics from revisiting specific mixed strategies. It is, therefore, useful to consider a slight weakening of this notion to enable practical evaluation, which we define as follows.4 Definition 3.1 (Mosaic regret). A learning algorithm that produces a strategy process (xt)t 0 in response to a utility process (ut)t 0 is said to minimize mosaic regret if, for any finite measurable partition Ω= {Ω1, . . . , ΩK} of (A), the algorithm has vanishing regret relative to all deviation strategies that are affine on the elements of the partition Ω. Mathematically, given a finite measurable partition Ωof (A), let D(Ω) = {G : (A) (A) | G affine on each Ωk} be the corresponding set of piecewise-affine functions on (A). Then the algorithm minimizes mosaic regret if 0 G(xt), ut dt Z T 0 xt, ut dt = o(T) over measurable, bounded utilities u ([0, 1]A)[0, ), for each finite partition Ωof (A). Mosaic regret also encompasses many standard existing notions of regret. Minimization of mosaic regret, for example, encompasses minimization of external regret through the singleton partition Ω= { (A)} and constant deviation maps. Internal and swap regret are also encompassed through the singleton partition, and various classes of affine maps. We shall return to this introduced notion of mosaic regret in our later experiments in Section 6. 4. Analysis of Replicator Dynamics under Φ-Regret We next present theoretical results pertaining to the dynamics of games wherein players use RD. We focus our attention on two-player 2 2 normal-form games, which, despite their 4. For related reasons, Gordon et al. (2008) introduce a similarly-weakened notion known as finite-element Φ-regret, which restricts deviations to be continuous, and affine on each piece of a finite partition. Our introduced notion of mosaic regret is a slight strengthening of finite-element Φ-regret, allowing for non-continuous deviations. Evolutionary Dynamics and Φ-Regret Minimization in Games size, form an important class capturing numerous canonical games traditionally used for closed-form analysis of regret-minimizing algorithms (Bruns, 2015; Guyer and Rapoport, 1972; Klos et al., 2010; Rapoport, 1966; Robinson and Goforth, 2005). We first begin by deriving preliminary results, subsequently establishing that RD minimizes strong swap regret in all classes of 2 2 games. Finally, we conclude by revisiting strong swap regret in the context of online learning, and in comparison to the standard notion of external regret. We seek to establish the key result that strong swap regret is minimized in all generic 2 2 games when both players are using RD. The notion of genericity that we consider here is that we only allow for games where, for each pure strategy of their opponent, each player has a unique (pure) best response as well as that for both payoffmatrices A, B it must hold that a11 a12 a21 + a22 = 0 and b11 b12 b21 + b22 = 0.5 Clearly, these properties are satisfied by all but a zero-measure set of games from the space of all 2 2 games. To prove the minimization of strong swap regret as outlined above, we consider the three possible classes of such games: (I) those with no pure Nash equilibrium (that have a unique interior, i.e., fully mixed, Nash equilibrium), (II) those with a unique pure Nash equilibrium, and (III) those with (at least) two pure Nash equilibria on the boundary (due to our genericity assumptions we can only have at most two pure Nash equilibria). From the perspective of a best response graph6 that connects pure strategy profiles via directed edges from one to another by a single player best-responding, these cases respectively correspond to: (I) the best response graph is a cycle through the four pure strategy outcomes, (II) the best response graph has a single sink that is a pure (strict) Nash equilibrium, and (III) the best response graph has two sinks/pure (strict) Nash equilibria. In case (III), due to our genericity assumption, both players choose a different pure strategy in each pure Nash equilibrium, which implies the existence of an isolated fully mixed Nash equilibrium as well. 4.1 Equivalent and Rescaled Games Before proving the result that RD minimizes strong swap regret in generic 2 2 games, we will show in this section that in order to prove case (I) it is sufficient to prove this result for the case of rescaled zero-sum games with a unique interior Nash equilibrium, whereas to prove case (III) it is sufficient to consider the case of rescaled coordination games with a unique interior Nash equilibrium. Let us first define these notions of equivalent and rescaled games below. 5. A natural sufficient condition for the uniqueness of best response strategies is that all payoffentries are distinct. It is easy to see that in 2 2 games under uniqueness of best responses, other useful generic properties follow, such as a finite number of possible Nash equilibria. Indeed, if, given a pure strategy of the opponent, each strategy has a unique best response then there cannot exist any mixed Nash equilibrium where exactly one of the two agents is randomizing. So the only possibility for having a continuum of Nash equilibria is in its interior. However, given any two interior Nash equilibria, any linear combination of them is also a Nash equilibrium, where in fact both agents are strictly indifferent between their two strategies. This line of equilibria intersects the boundary, implying the existence of a strategy profile where at least one of the agents plays a pure strategy and their opponent is indifferent between their two strategies, reaching a contradiction. 6. A best response graph is a directed graph with one vertex for each strategy profile. It has a directed edge between two vertices/profiles if and only if this transition corresponds to a best response of some agent. For more discussion see (Goemans et al., 2005; Omidshafiei et al., 2019). Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls Definition 4.1 (Equivalent games up to column/row shifts). Let G = (A, B) be a two player game with payoffmatrices A and B. We say that a game G = (A , B ) is equivalent to G up to column/row shifts if there exists (c1, c2, d1, d2) such that a ij = aij + cj, b ij = bij + di for all (i, j) {1, 2} . We then write (A, B) (A , B ) to denote equivalence up to column/row shifts. A game is called rescaled zero-sum (resp., rescaled coordination game) if there exist a payoffmatrix C and a negative (resp., positive) constant c such that (A, B) (C, c C). Observe that since the RD update expression for the row player can be written as j xj ((Ay)i (Ay)j) , the row player s RD trajectories only depend on payoffdifferences, which remain invariant after column payoffshifts. Hence, shifting a game to an equivalent version does not affect RD, implying the following remark. Remark 4.2. The orbits in RD of two equivalent games are identical. The following simple lemma (see also Hofbauer and Sigmund 1998) establishes that every generic 2 2 game is RD trajectory equivalent to either a rescaled zero-sum game or a rescaled coordination game. Lemma 4.3. Every generic 2 2 game is either a rescaled zero-sum game or a rescaled coordination game. Proof. By assumption we have that b11 b12 b21 +b22 and a11 a12 a21 +a22 be different from 0. Then, there exists c = 0 such that b11 b12 b21 + b22 = c(a11 a12 a21 + a22). Let D := c A B then d11 d12 d21 + d22 = 0. Thus, for all i, j, we have dij = di2 + (d2j d22) = ui + vj, with ui := di2 and vj := d2j d22. Since caij bij = ui + vj, we can define cij = bij+ui cvj. Hence, (A, B) (C, c C). In case (I) it suffices to examine only rescaled zero-sum games, as best-response dynamics cannot cycle in rescaled coordination games given they are simply weighted potential games.7 Similarly, since in case (III) the game has three isolated equilibria, it cannot be equivalent to a rescaled zero-sum game since such games have the same equilibrium set as some zero-sum game and thus their set of equilibria must form a convex polytope8. Thus, in case (III), it suffices to explore only rescaled coordination games. 7. In a rescaled coordination game, the ratio of the utilities of the agents is a constant that is independent of the action profile. Hence, any payoff-improving best-response for one agent is also payoff-improving for the other implying that any best response path must terminate at a Nash equilibrium. 8. Any rescaled zero-sum game has the same Nash equilibrium set as some zero-sum game. The set of Nash equilibria in each zero-sum game is a convex polytope (von Neumann and Morgenstern, 1944). Evolutionary Dynamics and Φ-Regret Minimization in Games 4.2 Case I: No pure Nash equilibrium (Rescaled Zero-Sum Games) We will start offour analysis with Case I, concerning rescaled zero-sum games without a pure Nash equilibrium (and therefore having a unique interior Nash equilibrium). The analysis of strong swap regret minimization for RD will involve three steps. Firstly, we will show that all interior trajectories in this case are cycles. This topological argument hinges upon recent topological characterizations of RD trajectories in zero-sum games and variants thereof (Boone and Piliouras, 2019; Mertikopoulos et al., 2018; Nagarajan et al., 2020; Piliouras and Shamma, 2014; Piliouras et al., 2014). These works typically argue for weaker notions of divergence (e.g., Poincar e recurrence) for almost all initial conditions, whereas in our case we will need to prove periodicity of all interior trajectories. The closest result to our own is in Boone and Piliouras (2019), where periodicity of RD trajectories is argued for a different class of single-agent, evolutionary zero-sum games with three strategies. Our proof strategy will tailor those arguments to address our different class of games. Armed with this topological result, we will next prove a strong time-average property of 2 2 games with a unique interior Nash equilibrium and periodic RD trajectories. Specifically, we will show that along any such trajectory if we fix a specific mixed strategy of one of the agents and compute the time-average of their opponent conditioned only on those points where the first player applies the specified mixed strategy then the time-average of second agent will converge to the mixed Nash equilibrium. This is a significant strengthening of a well-known result that specifies that the time-average of external-regret minimization algorithms in zero-sum games over their whole trajectory converges to the marginal of a Nash equilibrium. Leveraging this strong property, it will easily follow that no agent can profitably deviate from their current play even if we allow them to use deviations which are conditional to their current mixed strategy, implying strong swap regret minimization in this case. 4.2.1 Periodicity of Interior Orbits We will start of with a well-known theorem that constrains the possible limit behavior of smooth dynamical systems on a plane. Particularly, the Poincar e-Bendixson theorem provides sufficient conditions under which the limit behavior of such systems is periodic. Let z = (x1, y1) denote the state of RD in a 2 2 rescaled zero-sum game where we remove the constrained variables x2, y2 to express that the system has two degrees of freedom. Formally, a limit set ω(z) of an initial condition z is the union of the limits of all its convergent subsequences, i.e., all its possible limit behaviors. Specifically, let φ : [0, 1]2 R [0, 1]2 be the flow of RD such that for any point z [0, 1]2, φ(z, ) defines a function of time corresponding to the trajectory of x. The ω-limit (set) of z is formally defined as the set of points w [0, 1]2 such that there exists a sequence (tn) diverging to + such that φ(z, tn) w. Theorem 4.4 (Poincar e-Bendixson theorem). A limit set of a C1 dynamical system over the plane, if non-empty and compact, that does not contain a rest point is a periodic orbit. We will apply the Poincar e-Bendixson theorem to show that all interior RD trajectories are periodic for these games. Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls Theorem 4.5. Given any rescaled zero-sum game with a unique interior Nash equilibrium, all interior trajectories are periodic. Proof. First, from Piliouras et al. (2014), we have that the weighted sum of the Kullback Leibler (KL) divergences of each agent s current mixed strategy from their respective interior Nash equilibrium strategy is a constant of the motion, i.e., it remains invariant over time for all orbits. The weights are strictly positive numbers such that, after multiplying these numbers with the current agent utilities, the rescaled game is zero-sum. The KL divergence is equal to zero if and only if the respective probability distributions are equal to each other. Thus, the fact that for all non-trivial (i.e., non-equilibrium) orbits, the original weighted sum of KL divergences is positive implies that any such trajectory will stay bounded away from the interior equilibrium. The only other fixed points for RD lie on the boundary and correspond to pure strategy profiles, however, since the Nash equilibrium is interior the KL divergence becomes unbounded as we approach the boundary and thus due to the invariance and non-negativity of KL divergences any interior trajectory will stay bounded away from the boundary. Due to the Poincar e-Bendixson theorem, we have than any non-trivial orbit of RD will stay bounded away from all fixed points and thus the resulting limit set will be a periodic orbit. Finally, it remains to show that not only are the limit set of all trajectories cycles, but all trajectories themselves are cycles. We know that for any interior initial condition z, ω(z) is a periodic orbit and thus a Jordan curve of the plane [0, 1]2, so it is separated into two connected components: an interior I and an exterior E. By assumption, we know that there exists an interior rest point of RD, the interior Nash equilibrium. Let us denote it z . We claim that this z has to be a point of I. Let us assume not, i.e., z E. Since ω(z) is non-trivial, its interior I is non-empty. Pick a point a I. Draw a semi-infinite ray starting from z E in the direction of a I. Since I is bounded, this ray will transit from E to I then I to E at least once. Hence, it crosses ω(z) at least two times. Let us name those points ω1 then ω2. The sum of KL divergences is a strictly convex function, which attains its global minimum at z , so will strictly increase as we advance along the ray. Thus, the weighted sum of KL divergence at ω2 will be strictly larger than its value at ω1, which contradicts the fact that it is a continuous invariant of motion. Therefore, we have that z I; see the left panel of Fig. 3 for an illustration of the points constructed in this argument. Summarizing, ω(z) is a cycle, and its time-average (the interior Nash equilibrium z ) lies in the interior I of ω(z). We now wish to show that the orbit starting from z is a periodic orbit. To prove that, we will argue that z ω(z). From z , draw a semi-infinite line L in the direction of z. As I is bounded, this semi-infinite line must cross ω(z) in at least one point. Pick one of these points and call it zω. We claim that the weighted sum of KL divergences is equal to the sum of weighted KL divergences between z and z on L only at zω. Indeed, the sum of KL divergences is a strictly convex function with global minimum at z (equal to zero), so it is strictly increasing as one moves along the line L. Thus, zω is the unique intersection point between L and the level set of the weighted KL divergences with value equal to the weighted KL divergence between z and z. Therefore, we derive that z = zω ω(z); see the right panel of Fig. 3 for an illustration of the points constructed in this argument. Evolutionary Dynamics and Φ-Regret Minimization in Games Figure 3: Illustrations of two key steps in the proof of Theorem 4.5. Left: A proof by contradiction that the unique Nash equilibrium is contained in the interior I of ω(z), the limit set of replicator dynamics initialised at z. If the Nash equilibrium z lies in the exterior E, then the points ω1 and ω2 constructed must have different weighted sums of KL divergences from z ; a contradiction since they lie on the same level set for this function. Right: A proof by contradiction that z ω(z). If not, we construct zω ω(z) which has smaller weighted sum of KL divergences from z than z. However, since this weighted sum of KL divergences is constant on trajectories of the dynamics, this is a contradiction unless z = zω. Next, we will move forward to prove strong time-average properties for periodic trajectories of RD dynamics in 2 2 games with unique interior Nash equilibrium, which naturally apply to our case of rescaled zero-sum games with unique interior Nash equilibrium. 4.2.2 Average Opponent Strategies We first consider the average opponent strategy that an RD learner observes whenever playing a particular mixed strategy, a concept used to establish our subsequent regretminimization proofs. In order to better understand the notion of time-average opponent strategy derived here, consider the example in Fig. 4. Here, the left panel illustrates RD trajectories for a two-player 2-strategy game. The opponent strategy that the row player observes whenever their own strategy, x, is fixed to a particular value, σ, is illustrated via the highlighted vertical band in the figure; for a fixed orbit, whenever the row player passes through the mixed strategy x = σ, they observe one of two possible strategies for the (opposing) column player. The right panel illustrates the time-average of these observed opponent strategies, for which we seek to derive an analytic formula here. Below, we derive the time-average opponent strategy result specifically for periodic 2 2 games (i.e., those wherein the dynamics cycle around an interior Nash equilibrium), as strong swap regret minimization can be handled more straightforwardly for other instances. Proposition 4.6. Consider a periodic 2 2 game with unique interior Nash equilibrium, i.e., a 2 2 game such that all interior orbits of RD are periodic. Let (xt, yt)t 0 be the RD trajectories with some fixed interior initial condition (x0, y0) [0, 1]2. Let σ [0, 1] denote Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls 0 1000 2000 3000 4000 5000 t 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 4: Illustration of RD for a two-player 2-strategy game. The left panel illustrates the player dynamics (with a particular periodic trajectory traced in blue and the corresponding line thickness indicating the row player s update speed, vx). For a fixed orbit, whenever the row player passes through mixed strategy σ [0, 1], it observes one of two possible mixed strategies for the column player: ν1, ν2 [0, 1]. The right panel illustrates the recurrence of these orbital crossings, with green series indicating the time-average convergence of the observed opponent strategies to their Nash distribution. a fixed mixed strategy for the row player, and let ν1, ν2 [0, 1] denote the (at most) two possible mixed strategies played by the column player, i.e., such that both (σ, ν1) and (σ, ν2) lie on the orbit of RD.9 Let (vi x, vi y) denote the gradient of RD at point (σ, νi), for i = 1, 2. Then the time-average opponent strategy observed by the row player up to time t is given by 1 v1x ν1 + 1 Proof. We derive the time-average opponent strategy observed within time spans that are multiples of the period of the dynamics, T. It is sufficient to consider this case, as other cases can be dealt with by noting they modify the time-average observed strategy by at most a constant.10 Let ε > 0. The average strategy played by the column player whilst the row player plays a strategy within ε of σ is R T 0 yt1[xt [σ ε, σ + ε]] dt R T 0 1[xt [σ ε, σ + ε]] dt . (10) 9. On top of periodicity of all interior orbits, this proposition assumes that given any such orbit and for any fixed mixed strategy of the first (resp., second) player, the points of the orbit where the first (resp., second) agent applies that mixed strategy correspond to at most two mixed strategies of the second player. In Section 4.2.1, we have shown that the necessary conditions for this theorem are satisfied by all rescaled zero-sum games with a unique interior Nash equilibrium. 10. Specifically, writing t = NT + r, with N N0 and r [0, T), the contribution of opponent strategies from the interval (t r, t) to the time-average up to time t goes as O( 1 Evolutionary Dynamics and Φ-Regret Minimization in Games We aim to derive a limiting expression as ε 0. Let t1, t2 (0, T) be such that (xt1, yt1) = (σ, ν1) and (xt2, yt2) = (σ, ν2). Since the path of RD is differentiable, we have xt1+t = σ + tv1 x + O(t) , yt1+t = ν1 + tv1 y + O(t) xt2+t = σ + tv2 x + O(t) , yt2+t = ν2 + tv2 y + O(t) , where (vi x, vi y) denotes the gradient of RD at (σ, νi), for i = 1, 2. Then note that the condition xt [σ ε, σ+ε] is satisfied for t [t1 ε v1x +O(ε), t1+ ε v1x +O(ε)] [t2 ε v2x +O(ε), t2+ ε v2x +O(ε)]. Hence, the denominator integral in (10) simplifies to v2x + O(ε) . As for the numerator in (10), observe that 0 yt1[xt [σ ε, σ + ε]] dt = vix +O(ε) ytdt Z ε vix +O(ε) νi + tvi y + O(t) dt vix + O(ε) νi + O(ε) . Altogether, the average strategy faced by the row player in (10) is rewritten as vix + O(ε) νi + O(ε) v2x + O(ε) . Letting ε 0, we compute P2 i=1 (2 ε vix + O(ε) νi + O(ε) v2x + O(ε) = lim ε 0 v1x ν1 + 2 ε v2x + O(ε) + O(ε) 2 ε 1 v1x ν1 + 1 Remark 4.7. In the representation (9) of the time-average opponent strategy induced by Proposition 4.6, the weights do not depend on the underlying parametrization choice. Namely, consider any reparameterization of mixed strategy (x, y) via (θ1, θ2) = (g(x), g(y)), where g is a non-decreasing differentiable function. Then, a similar line of arguments leads to the following result: 1 v1 θ g(ν1) + 1 Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls Proposition 4.6 implies in particular that, whenever the 2 2 game of interest is also zero-sum while exhibiting a unique interior Nash equilibrium, the time-average opponent strategy identifies to the Nash equilibrium itself. This result is proven in the following proposition. Proposition 4.8. Consider a periodic 2 2 game with unique interior Nash equilibrium, i.e., a 2 2 game such that all interior orbits of RD are periodic. Furthermore, assume that given any such periodic orbit and for any fixed mixed strategy of the first (resp., second) player, the points of the orbit where the first (resp., second) agent applies that mixed strategy correspond to at most two mixed strategies of the second player such that the equilibrium strategy of the second (resp., first) player lies in their convex combination.11 Then, for any fixed first player strategy, the corresponding time-average opponent strategy in RD is the Nash strategy of the opponent. Proof. Consider the logit parameterization of strategies in these 2-strategy games, wherein a player s strategy is described by a single real number τ R, and the corresponding strategy probabilities are given by eτ(1 + eτ) 1 and (1 + eτ) 1. Given two joint strategies (σ, τ1), (σ, τ2) in logit parameterization on the orbit of RD, the gradient for the row player under RD is given by the difference in payoffs for row strategies 1 and 2 against the column player s joint strategy, namely: 1 1 A eτ2(1 + eτ2) 1 (1 + eτ2) 1 and similarly for the column player. For shorthand, denote z := (1, 1) and y(τ) := (eτ(1 + eτ) 1, (1 + eτ) 1) . Per Proposition 4.6 and Remark 4.7, the average opponent strategy that the row player experiences when playing σ is expressed by eτ1(1+eτ1) 1 |z Ay(τ1)| + eτ2(1+eτ2) 1 |z Ay(τ2)| 1 |z Ay(τ1)| + 1 |z Ay(τ2)| . Let pi := eτi(1 + eτi) 1 so that y(τi) = (pi, 1 pi) , for i = 1, 2. Similarly, let p denote the probability of the first strategy in the unique Nash equilibrium of the game, so that y := (p , 1 p ) . For the statement of the proposition to be valid, it remains to prove the following relation: 1 |z Ay(τ1)|p1 + 1 |z Ay(τ2)|p2 1 |z Ay(τ1)| + 1 |z Ay(τ2)| = p , which rewrites equivalently as 1 |z Ay(τ1)|(p1 p ) + 1 |z Ay(τ2)|(p2 p ) = 0 . (11) 11. In Section 4.2.1, we have shown that the necessary conditions for this theorem are satisfied by all rescaled zero-sum games with a unique interior Nash equilibrium. Evolutionary Dynamics and Φ-Regret Minimization in Games By definition of the Nash equilibrium y , recall that z Ay = 0, so that |z Ay(τi)| = |z Ay(τi) z Ay | = |z Az | |pi p |, for i = 1, 2 . This allows one to rewrite the targeted relation (11) as |p1 p | + (p2 p ) which holds true since p [p1, p2] as (σ, τ1) and (σ, τ2) belong to the same orbit of RD. We next prove the result establishing minimization of strong swap regret for Case I. Theorem 4.9. In any generic 2 2 game with a unique interior Nash equilibrium, RD minimizes strong swap regret. Proof. By Lemma 4.3 along with the fact that the game does not have any pure Nash equilibrium, we have that the game has to be equivalent to a rescaled zero-sum game. In such a case, we have that all interior trajectories of RD are periodic. Following the line of arguments of Proposition 4.6 and Proposition 4.8, we can conclude that RD minimizes strong swap regret, since for any fixed first player strategy, the average opponent strategy it observes is the Nash equilibrium component of the opponent and thus no profitable deviation is possible even if the agent is allowed to condition their deviating strategy to their current mixed strategy. The argument about the time-average convergence to the Nash equilibrium is specific to zero-sum games with a unique interior Nash equilibrium. In some sense, this is the most interesting subclass in which to study strong swap regret minimization, since this minimization result is itself strongest in this class of games; we will see below that in other cases, strong swap regret minimization coincides with weaker notions of regret. 4.3 Case II: Games with a Unique Pure Nash Equilibrium In the case with a unique Nash equilibrium on the simplex boundary, as argued above, RD converges to the pure Nash equilibrium, which we will exploit to establish a link between external regret and strong swap regret. Theorem 4.10. In any generic 2 2 game with a unique pure Nash equilibrium, RD minimizes strong swap regret. Proof. We will first argue that in this case the game is strictly dominance solvable. Namely, after the iterated elimination of pure strategies that are strictly dominated by other pure strategies, we are left with a single strategy for each agent. By our genericity assumption we have that for each pure strategy of their opponent, each player has a unique (pure) best response. Since the best response dynamic has a unique sink (the unique pure/strict Nash equilibrium) for at least one of the two agents one strategy strictly dominates the other. Once that dominated strategy has been eliminated, the genericity of the game will lead to the elimination of one of the two remaining strategies of the other agent, implying that the game is strictly dominance solvable. Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls In such games, RD (as well as a wide range of other evolutionary dynamics) is bound to converge to playing the (iteratively) undominated strategies with probability one (Weibull, 1997, Proposition 4.6). However, in any small enough neighborhood around this strict Nash equilibrium, the best response of each agent is unique and and it is their strategy at the unique pure/strict Nash equilibrium. Hence, given any interior initial condition, after a finite amount of time, the optimal deviating strategies for strong swap regret are going to be independent of the (exact) mixed strategy of the agent and, instead, they will stay fixed and equal to the Nash equilibrium strategy of that agent. Thus, strong swap regret will be within a constant of the external regret of RD, which is known to be bounded in general games. Thus, RD also minimizes strong swap regret. 4.4 Case III: Games with Two Pure Nash Equilibria (Rescaled Coordination Games) This is the last case of 2 2 games, where we examine games with multiple Nash equilibria. Given the fact that some trajectories converge to pure Nash equilibria and some to mixed Nash equilibria, we need specialized arguments. Lemma 4.11. In any 2 2 games with several Nash equilibria on the simplex boundary, RD minimizes strong swap regret. Proof. In this case, the game is equivalent to a rescaled coordination game per Section 4.1. However, rescaled coordination games are weighted potential games where the potential function of the player is the utility of, e.g., the first player. In this case, RD is known to converge to Nash equilibria for all interior initial conditions (Papadimitriou and Piliouras, 2018). In terms of the possible limit points of different trajectories, only two situations can occur: either the limit is a pure Nash equilibrium, or it is an interior (mixed) Nash equilibrium. In the case where the trajectory converges to a pure Nash equilibrium, the boundedness of strong swap regret follows from the same argument as in the case of strictly dominance solvable games converging to a pure Nash equilibrium, see the proof of Theorem 4.10. Thus, the only remaining case is that of trajectories converging to a mixed Nash equilibrium. In the case of rescaled coordination games with interior Nash equilibrium, there always exists a positive weight such that the weighted difference of KL divergences between each agent s mixed Nash equilibrium and their evolving state remains invariant over time (see Hofbauer 1996; Nagarajan et al. 2018). Specifically, given any pair of (interior) mixed strategies x = (x1, x2) and y = (y1, y2), there exists w > 0 such that the KL divergence w DKL p xt DKL q yt = w X i pi ln( pi between the interior Nash equilibrium (p, q) and the state (xt, yt) of the system is invariant over time. Clearly, the value of this invariant function evaluated at the mixed Nash equilibrium (p, q) is equal to zero and by continuity the value of this function must be equal to zero for any trajectory that converges to that mixed Nash equilibrium. The important property implied by this invariant function is that any trajectory that converges to the mixed Nash equilibrium (p, q) cannot intersect the lines x1 = p1, y1 = q1. If that was the case and Evolutionary Dynamics and Φ-Regret Minimization in Games exactly one of the two conditions x1 = p1, y1 = q1 was satisfied, then exactly one of the two KL divergence terms would be equal to zero, implying that the invariant function cannot be equal to zero. On the other hand, if the trajectory intersects the lines x1 = p1, y1 = q1 simultaneously then by uniqueness of the solution of the ODE this is the trivial equilibrium trajectory that does not converge to equilibrium since it already starts at it. Hence, when encoding the state of the 2 2 replicator dynamics in the [0, 1]2 plane expressing the variables x1, y1, then any trajectory converging to the interior Nash equilibrium must be strictly contained in one of the four parallelograms encoded by the lines x1 = p1, y1 = q1. Thus, the best response of both agents along any such trajectory remains fixed over time and, as a result, the maximal strong swap regret is achieved by deviating to that fixed best response throughout the system orbit. Therefore, strong swap regret is upper bounded by standard external regret, which is well-known to be bounded for RD (e.g., see Mertikopoulos et al. 2018) and the proof is complete. 5. Linking Strong Swap Regret to Other Optimization Notions Having established that RD minimizes strong swap regret in generic 2 2 games, we now present relationships to more standard perspectives in online learning. 5.1 Strong Swap Regret versus External Regret Minimization A natural question to ask is whether minimization of strong swap regret follows immediately from minimization of standard (external) regret in the special case of 2 2 games. The following theorem shows that this is not the case. Theorem 5.1. There exist algorithms that minimize external regret, which exhibit unbounded strong swap regret in some 2 2 games. Proof. Consider an arbitrary tuple of algorithms that minimize external regret (e.g., bounded external regret algorithms such as RD), with one such algorithm used by each of the agents. Given such a tuple of algorithms, we define a slight variant of them that initializes them as playing a pre-computed CCE in a chosen target game that does not have bounded strong swap regret. For example, consider the 2 2 game where both agents receive a payoffof 1 if they choose the same strategy (i.e., both play strategy a or b), otherwise receiving a payoff of 0. Now, consider the following strategy distribution in this game: with probability 1/2, both agents play the mixed strategy where they choose the first strategy with probability 3/4; with probability 1/2, both agents play the mixed strategy where they choose the second strategy with probability 3/4. This overall strategy distribution is a CCE.12 Thus, dividing the time interval into epochs of length one, we can have the agents reproduce this CCE by alternating between playing the two described mixed strategies. The agents continue producing this predefined play as long as no deviation from it is observed. If any deviation is observed from any individual agent, then all agents subsequently deviate from this play, from this point onward always applying their assigned regret-minimizing algorithm (e.g., RD). Clearly, each of these algorithms still minimizes 12. The overall distribution chooses pure strategy profiles (a, a), (b, b) each with probability 5/16 and the profiles (a, b), (b, a) with probability 3/16; it is easy to verify this distribution satisfies the CCE definition. Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls external regret, as even if a deviation is observed leading to this switch away from the original CCE mixed strategy, the total regret experienced on any single period is bounded and the total regret of the individual algorithm used (e.g., RD) is bounded as well. On the other hand, by construction the resulting play does not minimize strong swap regret, as any of the two mixed strategies applied by the agents is not a best response during the time epoch that is being played; thus, instead the agents would rather deviate from playing their assigned strategy, not with probability 3/4, but with probability 1. 5.2 Strong Swap Regret in the Case of Online Learning With Two Actions While the definition of strong swap regret was motivated by the need for more granular techniques in the study of learning in games, a related question is whether this notion also yields insights for more traditional online learning settings. Below, we show that it is possible to provide positive regret results even if we allow for the deviating strategies to depend on the current mixed strategy, the telltale attribute of strong swap regret, under specific conditions on the learning trajectory. Specifically, we explore the conditions under which we can prove positive results for strong swap regret, not in the case of 2 2 games, but rather in the case of online learning with two actions (i.e., strategies) whose utilities evolve continuously over time. Theorem 5.2. Given bounded and continuous utility functions ut 1, ut 2, the cumulative regret of RD remains bounded even if we consider only the part of the history of play where replicator chooses the first action with probability p (p0 ϵ, p0 + ϵ) for any p0 (0, 1), as long as ϵ is small enough such that the replicator trajectory always exits the interval from the opposite end of the interval from the one it enters it. Proof. Let us create a possibly infinite sequence of time intervals (aj, bj) whose union captures the set of time instances t such that pt (p0 ϵ, p0 + ϵ). To explore the total regret for a player in online learning in the time intervals where the probability of playing their first strategy is between p0 ϵ and p0 + ϵ, it suffices to track the total regret for deviating against any fixed strategy for the union of (aj, bj) intervals. In the case of RD in any such interval (aj, bj), the regret for not deviating to some fixed strategy i is given as follows: pi = pi(ui X Rearranging, pi pi = ui X ln(pi(bj)) ln(pi(aj)) = Z bj aj uτ i dτ Z bj i pτ i uτ i dτ . Evolutionary Dynamics and Φ-Regret Minimization in Games Without loss of generality, we can assume that the trajectory enters the (p0 ϵ, p0 + ϵ) interval from below. In this case, the regret in the union of these times intervals is X j ln(pi(bj)) ln(pi(aj)) = ln(p0 + ϵ) ln(p0 ϵ) + ln(p0 ϵ) ln(p0 + ϵ) + . . . ln(p0 + ϵ) ln(p0 ϵ) and thus remains bounded for all time. Overall, these insights indicate that there exist interesting positive and negative results linking strong swap regret to classical notions of regret. 6. Experiments In this section, we ground our theoretical findings in experimental analysis of RD in numerous games of varying characteristics. 6.1 Φ-Regret Minimization via RD in 2 2 Games As mentioned earlier, practical experiments preclude one from demonstrating that strong swap regret is minimized, as time-discretization implies that any individual mixed strategy observed in a single RD trajectory may not be revisited in future trajectories. Thus, we focus our empirical analysis here on mosaic regret, the practically-realizable instance of Φ-regret we introduced in Section 3.3. We first conduct a set of experiments illustrating the minimization of mosaic regret under RD in a wide suite of 2 2 games. As noted earlier, despite their simple structure, these payoffs form an important class capturing numerous canonical games that have received tremendous attention throughout the game theory literature (e.g., Matching Pennies, Prisoner s Dilemma, Stag Hunt, etc.) (Bruns, 2015; Guyer and Rapoport, 1972; Klos et al., 2010; Rapoport, 1966; Robinson and Goforth, 2005). The specific games we consider are those defined by Bruns (2015), which taxonomizes a collection of 2 2 games into several distinct classes by considering the patterns of payoffs received by each player. Bruns (2015) identifies 12 sets of basis payoffs corresponding to canonical games of varying characteristics, summarized for the row player in Fig. 5(a); corresponding column player payoffs for these games are defined in Bruns (2015) as the transpose along the anti-diagonal, which we also use for consistency. The combination of these basis payoffs enables the definition of a large collection of 144 2 2 games with varying characteristics (e.g., cyclical games, win-win games, pure coordination games, etc.). For each of these games, we visualize the vector field summarizing the behavior of agents playing RD in Fig. 5(b). The xand y-axes here, respectively, show the row and column player s probabilities of playing their first strategies. Figure 6 visualizes an evaluation of the mosaic regret across this diverse set of games. For simplicity, the partitioning scheme Σ we use for computing mosaic regret is to subdivide the first player s mixed strategy space x into 10 discrete, equally-sized bins. To generate the mosaic regret results, for each game we run 10 trials of RD (each with an independently-initialized set of seed strategies for the row and column players). We plot the time-average mosaic Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls Chicken Battle Hero Compromise Deadlock Prisoner s dilemma Stag hunt Assurance Coordination Peace Harmony Concord (Ch) (Ba) (Hr) (Cm) (Dl) (Pd) (Sh) (As) (Co) (Pc) (Ha) (Nc) 2 3 1 4 Win-win Biased Second best Unfair Traps Sad Cyclic 0 1 Ch (b) Figure 5: (a) Row player payoffs for 2 2 games defined by Bruns (2015). Note: in the convention of Bruns (2015), corresponding column player payoffs are defined as a transpose along the anti-diagonal. (b) RD run on 144 games composed of the indicated combinations of row and column player payoffs. Each subplot shows the dynamics for RD on a distinct game, with background colors corresponding to the classes of games originally defined by Bruns (2015). Nash equilibria for each game are indicated by the red points. Evolutionary Dynamics and Φ-Regret Minimization in Games Time-average Mosaic Regret Figure 6: Mosaic regret convergence for 2 2 games. For each of the 144 games (shown earlier in Fig. 5(b)), we run 10 trials of RD with independently random initial strategies for both the row and column players, computing mosaic regret through time. The leftmost subplot above summarizes the time-average convergence of mosaic regret, 1/T PT t=0 MRt, across all games and trials. The subsequent subplots show convergence for one example game per Bruns class, with classes corresponding to those in Fig. 5(b); in each of these latter subplots, each line indicates an individual trial, and the shaded region shows the average mosaic regret and 95% confidence interval across all trials. regret averaged across all 1440 combinations of games and trials in the lefternmost subplot of Fig. 6, verifying our earlier theoretical convergence results for these general classes of games. Moreover, to understand the game-specific convergence characteristics, we plot the time-average mosaic regret for one example game per Bruns class (i.e., one per win-win , biased , second best , unfair , traps , sad , and cyclic class of games) in the subsequent subplots of Fig. 6. The time-average mosaic regret converges across all these games, with a notable observation being that final game, Ba As, suffers from higher-variance across trials, which is explained primarily due to the cyclical nature of this particular game. 6.2 Mosaic Regret Beyond 2 2 Games We next consider several increasingly-complex instances of larger games, where we observe evidence that RD minimizes mosaic regret. Each row of Fig. 8 visualizes results associated with a distinct 3 3 game, where the respective row player payoffs are defined as 0 1 1 1 0 1 1 1 0 0 1 2 1 0 1 2 1 0 1 1 1.2 1 0 1 1 1 0.5 with corresponding column player payoffs Bi = Ai for each game i. These payoffs, in order, correspond to the canonical game of Rock Paper Scissors (RPS), a variant of RPS with symmetrically-biased payoffs (yielding an non-uniform Nash equilibrium for both players), and an asymmetric variant of RPS (yielding distinct Nash equilibria for either player). We next construct a particular instance of a mosaic regret deviation partition, and illustrate that RD minimizes mosaic regret under such a partitioning scheme in the above games. The first column in Fig. 8 visualizes points corresponding to raw trajectories exhibited under RD. For each game, the second column of Fig. 8 visualizes a specific mosaic regret deviation partition as follows: we select an arbitrary reference point in the row player s trajectories, xr, and compute the KL-divergence of the point with respect to the row player s Nash equilibrium, x (indicated in orange in the corresponding plots), i.e., d DKL(xr x ). Subsequently, throughout all raw trajectories, we retain only the joint Piliouras, Rowland, Omidshafiei, Elie, Hennes, Connor and Tuyls Raw Trajectories Retained Subset Average Observed Column Strategies Game (A1, B1) Game (A2, B2) Game (A3, B3) Figure 8: Several instances of 3 3 games where we observe evidence that RD minimizes mosaic regret. In all plots, the Nash equilibrium is indicated in orange for each player. Each row corresponds to a zero-sum game (with payoffs described in the main text). The first column visualizes points corresponding to raw trajectories exhibited under RD. Next, we retain only the partition of points wherein the first player s strategy, x, is within a small neighborhood of a reference KL-divergence from its Nash equilibrium; the second column plots the joint strategy under this filtering scheme. In the final column, we use all such sampled column player strategies up to the given timestep t, y