# policy_regret_in_repeated_games__ace19acf.pdf Policy Regret in Repeated Games Raman Arora Dept. of Computer Science Johns Hopkins University Baltimore, MD 21204 arora@cs.jhu.edu Michael Dinitz Dept. of Computer Science Johns Hopkins University Baltimore, MD 21204 mdinitz@cs.jhu.edu Teodor V. Marinov Dept. of Computer Science Johns Hopkins University Baltimore, MD 21204 tmarino2@jhu.edu Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu The notion of policy regret in online learning is a well defined performance measure for the common scenario of adaptive adversaries, which more traditional quantities such as external regret do not take into account. We revisit the notion of policy regret and first show that there are online learning settings in which policy regret and external regret are incompatible: any sequence of play that achieves a favorable regret with respect to one definition must do poorly with respect to the other. We then focus on the game-theoretic setting where the adversary is a self-interested agent. In that setting, we show that external regret and policy regret are not in conflict and, in fact, that a wide class of algorithms can ensure a favorable regret with respect to both definitions, so long as the adversary is also using such an algorithm. We also show that the sequence of play of no-policy regret algorithms converges to a policy equilibrium, a new notion of equilibrium that we introduce. Relating this back to external regret, we show that coarse correlated equilibria, which no-external regret players converge to, are a strict subset of policy equilibria. Thus, in game-theoretic settings, every sequence of play with no external regret also admits no policy regret, but the converse does not hold. 1 Introduction Learning in dynamically evolving environments can be described as a repeated game between a player, an online learning algorithm, and an adversary. At each round of the game, the player selects an action, e.g. invests in a specific stock, the adversary, which may be the stock market, chooses a utility function, and the player gains the utility value of its action. The player observes the utility value and uses it to update its strategy for subsequent rounds. The player s goal is to accumulate the largest possible utility over a finite number of rounds of play.1 The standard measure of the performance of a player is its regret, that is the difference between the utility achieved by the best offline solution from some restricted class and the utility obtained by the online player, when utilities are revealed incrementally. Formally, we can model learning as the following problem. Consider an action set A. The player selects an action at at round t, the adversary picks a utility function ut, and the player gains the utility value ut(at). While in a full observation setting the player observes the entire utility function ut, in a bandit setting the player only observes 1Such games can be equivalently described in terms of minimizing losses rather than maximizing utilities. All our results can be equivalently expressed in terms of losses instead of utilities. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. the utility value of its own action, ut(at). We use the shorthand a1 t to denote the player s sequence of actions (a1,...,at) and denote by Ut = {u At R} the family of utility functions ut. The objective of the player is to maximize its expected cumulative utility over T rounds, i.e. maximize E[ T t=1 ft(a1 t)], where the expectation is over the player s (possible) internal randomization. Since this is clearly impossible to maximize without knowledge of the future, the algorithm instead seeks to achieve a performance comparable to that of the best fixed action in hindsight. Formally, external regret is defined as R(T) = E max ut(a1 t 1,a) ut(a1 t) . (1) A player is said to admit no external regret if the external regret is sublinear, that is R(T) = o(T). In contrast to statistical learning, online learning algorithms do not need to make stochastic assumptions about data generation: strong regret bounds are possible even if the utility functions are adversarial. There are two main adversarial settings in online learning: the oblivious setting where the adversary ignores the player s actions and where the utility functions can be thought of as determined before the game starts (for instance, in weather prediction); and the adaptive setting where the adversary can react to the player s actions, thus seeking to throw the player off track (e.g., competing with other agents in the stock market). More generally, we define an m-memory bounded adversary as one that at any time t selects a utility function based on the player s past m actions: ut(a t m 1,at m,...,at) = ut(a1,...,at m 1,at m,...,at), for all a t m 1 and all a1,...,at. An oblivious adversary can therefore be equivalently viewed as a 0-memory bounded adversary and an adaptive adversary as an -memory bounded adversary. For an oblivious adversary, external regret in Equation 1 reduces to R(T) = E[maxa A T t=1 ut(a) ut(at)], since the utility functions do not depend upon past actions. Thus, external regret is meaningful when the adversary is oblivious, but it does not admit any natural interpretation when the adversary is adaptive. The problem stems from the fact that in the definition of external regret, the benchmark is a function of the player s actions. Thus, if the adversary is adaptive, or even memory-bounded for some m > 0, then, external regret does not take into account how the adversary would react had the player selected some other action. To resolve this critical issue, Arora et al. (2012b) introduced an alternative measure of performance called policy regret for which the benchmark does not depend on the player s actions. Policy regret is defined as follows ut(a,...,a) E ut(a1 t) . (2) Arora et al. (2012b) further gave a reduction, using a mini-batch technique where the minibatch size is larger than the memory m of adversary, that turns any algorithm with a sublinear external regret against an oblivious adversary into an algorithm with a sublinear policy regret against an m-memory bounded adversary, albeit at the price of a somewhat worse regret bound, which is still sublinear. In this paper, we revisit the problem of online learning against adaptive adversaries. Since Arora et al. (2012b) showed that there exists an adaptive adversary against which any online learning algorithm admits linear policy regret, even when the external regret may be sublinear, we ask if no policy regret implies no external regret. One could expect this to be the case since policy regret seems to be a stronger notion than external regret. However, our first main result (Theorem 3.2) shows that this in fact is not the case and that the two notions of regret are incompatible: there exist adversaries (or sequence of utilities) on which action sequences with sublinear external regret admit linear policy regret and action sequences with sublinear policy regret incur linear external regret. We argue, however, that such sequences may not arise in practical settings, that is in settings where the adversary is a self-interested entity. In such settings, rather than considering a malicious opponent whose goal is to hurt the player by inflicting large regret, it seems more reasonable to consider an opponent whose goal is to maximize his own utility. In zero-sum games, maximizing one s utility comes at the expense of the other player s, but there is a subtle difference between an adversary who is seeking to maximize the player s regret and an adversary who is seeking to minimize the player s utility (or maximize his own utility). We show that in such strategic game settings there is indeed a strong relationship between external regret and policy regret. In particular, we show in Theorem 3.4 that a large class of stable online learning algorithms with sublinear external regret also benefit from sublinear policy regret. Further, we consider a two-player game where each player is playing a no policy regret algorithm. It is known that no external regret play converges to a coarse correlated equilibrium (CCE) in such a game, but what happens when players are using no policy regret algorithms? We show in Theorem 4.8 that the average play in repeated games between no policy regret players converges to a policy equilibrium, a new notion of equilibrium that we introduce. Policy equilibria differ from more traditional notions of equilibria such as Nash or CCEs in a crucial way. Recall that a CCE is defined to be a recommended joint strategy for players in a game such that there is no incentive for any player to deviate unilaterally from the recommended strategy if other players do not deviate. What happens if the other players react to one player s deviation by deviating themselves? This type of reasoning is not captured by external regret, but is essentially what is captured by policy regret. Thus, our notion of policy equilibrium must take into account these counterfactuals, and so the definition is significantly more complex. But, by considering functions rather than just actions, we can define such equilibria and prove that they exactly characterize no policy regret play. Finally, it becomes natural to determine the relationship between policy equilibria (which characterize no policy regret play) and CCEs (which characterize no external regret play). We show in Theorems 4.9 and 4.10 that the set of CCEs is a strict subset of policy equilibria. In other words, every CCE can be thought of as a policy regret equilibrium, but no policy regret play might not converge to a CCE. 2 Related work The problem of minimizing policy regret in a fully adversarial setting was first studied by Merhav et al. (2002). Their work dealt specifically with the full observation setting and assumed that the utility (or loss) functions were m-memory bounded. They gave regret bounds in O(T 2 3). The follow-up work by Farias and Megiddo (2006) designed algorithms in a reactive bandit setting. However, their results were not in the form of regret bounds but rather introduced a new way to compare against acting according to a fixed expert strategy. Arora et al. (2012b) studied m-memory bounded adversaries both in the bandit and full information settings and provided extensions to more powerful competitor classes considered in swap regret and more general Φ-regret. Dekel et al. (2014) provided a lower bound in the bandit setting for switching cost adversaries, which also leads to a tight lower bound for policy regret in the order (T 2 3). Their results were later extended by Koren et al. (2017a) and Koren et al. (2017b). More recently, Heidari et al. (2016) considered the multi-armed bandit problem where each arm s loss evolves with consecutive pulls. The process according to which the loss evolves was not assumed to be stochastic but it was not arbitrary either in particular, the authors required either the losses to be concave, increasing and to satisfy a decreasing marginal returns property, or decreasing. The regret bounds given are in terms of the time required to distinguish the optimal arm from all others. A large part of reinforcement learning is also aimed at studying sequential decision making problems. In particular, one can define a Markov Decision Process (MDP) by a set of states equipped with transition distributions, a set of actions and a set of reward or loss distributions associated with each state action pair. The transition and reward distributions are assumed unknown and the goal is to play according to a strategy that minimizes loss or maximizes reward. We refer the reader to (Sutton and Barto, 1998; Kakade et al., 2003; Szepesvári, 2010) for general results in RL. MDPs in the online setting with bandit feedback or arbitrary payoff processes have been studied by Even-Dar et al. (2009a); Yu et al. (2009); Neu et al. (2010) and Arora et al. (2012a). The tight connection between no-regret algorithms and correlated equilibria was established and studied by Foster and Vohra (1997); Fudenberg and Levine (1999); Hart and Mas-Colell (2000); Blum and Mansour (2007). A general extension to games with compact, convex strategy sets was given by Stoltz and Lugosi (2007). No external regret dynamics were studied in the context of socially concave games (Even-Dar et al., 2009b). More recently, Hazan and Kale (2008) considered more general notions of regret and established an equivalence result between fixed-point computation, the existence of certain no-regret algorithms, and the convergence to the corresponding equilibria. In a follow-up work by Mohri and Yang (2014) and Mohri and Yang (2017), the authors considered a more powerful set of competitors and showed that the repeated play according to conditional swap-regret or transductive regret algorithms leads to a new set of equilibria. 3 Policy regret in reactive versus strategic environments Often, distant actions in the past influence an adversary more than more recent ones. The definition of policy regret (2) models this influence decay by assuming that the adversary is m-memory bounded for some m N. This assumption is somewhat stringent, however, since ideally we could model the current move of the adversary as a function of the entire past, even if actions taken further in the past have less significance. Thus, we extend the definition of Arora et al. (2012b) as follows. Definition 3.1. The m-memory policy regret at time T of a sequence of actions (at)T t=1 with respect to a fixed action a in the action set A and the sequence of utilities (ut)T t=1, where ut At R and m N { } is ut(a1, ,at m,a, ,a) ut(a1, ,at). We say that the sequence (at)T t=1 has sublinear policy regret (or no policy regret) if P(T,a) < o(T), for all actions a A. Let us emphasize that this definition is just an extension of the standard policy regret definition and that, when the utility functions are m-memory bounded, the two definitions exactly coincide. While the motivation for policy regret suggests that this should be a stronger notion compared to external regret, we show that not only that these notions are incomparable in the general adversarial setting, but that they are also incompatible in a strong sense. Theorem 3.2. There exists a sequence of m-memory bounded utility functions (ut)T t=1, where ut A R, such that for any constant m 2 (independent of T), any action sequence with sublinear policy regret will have linear external regret and any action sequence with sublinear external regret will have linear policy regret. The proof of the above theorem constructs a sequence for which no reasonable play can attain sublinear external regret. In particular, the only way the learner can have sublinear external regret is if they choose to have very small utility. To achieve this, the utility functions chosen by the adversary are the following. At time t, if the player chose to play the same action as their past 2 actions then they get utility 1 2. If the player s past two actions were equal but their current action is different, then they get utility 1, and if their past two actions differ then no matter what their current action is they receive utility 0. It is easy to see that the maximum utility play for this sequence (and the lowest 2-memory bounded policy regret strategy) is choosing the same action at every round. However, such an action sequence admits linear external regret. Moreover, every sublinear external regret strategy must then admit sublinear utility and thus linear policy regret. As discussed in Section 1, in many realistic environments we can instead think of the adversary as a self-interested agent trying to maximize their own utility, rather than trying to maximize the regret of the player. This more strategic environment is better captured by the game theory setting, in particular a 2-player game where both players are trying to maximize their utility. Even though we have argued that external regret is not a good measure, our next result shows that minimizing policy regret in games can be done if both players choose their strategies according to certain no external regret algorithms. More generally, we adapt a classical notion of stability from the statistical machine learning setting and argue that if the players use no external regret algorithms that are stable, then the players will have no policy regret in expectation. To state the result formally we first need to introduce some notation. Game definition: We consider a 2-player game G, with players 1 and 2. The action set of player i is denoted by Ai, which we think of as being embedded into R Ai in the obvious way where each action corresponds to a standard basis vector. The corresponding simplex is Ai. The action of player 1 at time t is at and of player 2 is bt. The observed utility for player i at time t is ui(at,bt) and this is a bi-linear form with corresponding matrix Pi. We assume that the utilities are bounded in [0,1]. Algorithm of the player: When discussing algorithms, we take the view of player 1. Specifically, at time t, player 1 plays according to an algorithm which can be described as Algt (A1 A2)t A1. We distinguish between two settings: full information, in which the player observes the full utility function at time t (i.e., u1( ,bt)), and the bandit setting, in which the player only observes u1(at,bt). In the full information setting, algorithms like multiplicative weight updates (MWU Arora et al. (2012c)) depend only on the past t 1 utility functions (u1( ,b ))t 1 =1, and thus we can think of Algt as a function ft At 2 A1. In the bandit setting, though, the output at time t of the algorithm depends both on the previous t 1 actions (a )t 1 =1 and on the utility functions (i.e., the actions picked by the other player). But even in the bandit setting, we would like to think of the player s algorithm as a function ft At 2 A1. We cannot quite do this, however we can think of the player s algorithm as a distribution over such functions. So how do we remove the dependence on At 1? Intuitively, if we fix the sequence of actions played by player 2, we want to take the expectation of Algt over possible choices of the t actions played by player 1. In order to do this more formally, consider the distribution µ over At 1 2 generated by simulating the play of the players for t rounds. Then let µb0 t be the distribution obtained by conditioning µ on the actions of player 2 being b0 t. Now we let ft(b0 t 1) be the distribution obtained by sampling a0 t 1 from µb1 t 1 and using Alg(a0 t 1,b0 t 1). When taking expectations over ft, the expectation is taken with respect to the above distribution. We also refer to the output pt = ft(b0 t 1) as the strategy of the player at time t. Now that we can refer to algorithms simply as functions (or distributions over functions), we introduce the notion of a stable algorithm. Definition 3.3. Let ft At 2 A1 be a sample from Algt (as described above), mapping the past t actions in A2 to a distribution over the action set A1. Let the distribution returned at time t be p1 t = ft(b1,...,bt). We call this algorithm on average (m,S(T)) stable with respect to the norm , if for any b t m+1,...,b t A2 such that p1 t = ft(b1,...,bt m,b t m+1,...,b t) A1, it holds that E[ T t ] S(T), where the expectation is taken with respect to the randomization in the algorithm. Even though this definition of stability is given with respect to the game setting, it is not hard to see that it can be extended to the general online learning setting, and in fact this definition is similar in spirit to the one given in Saha et al. (2012). It turns out that most natural no external regret algorithms are stable. In particular we show, in the supplementary, that both Exp3 Auer et al. (2002) and MWU are on average (m,m T) stable with respect to 1 norm for any m < o( T). It is now possible to show that if each of the players are facing stable no external regret algorithms, they will also have bounded policy regret (so the incompatibility from Theorem 3.2 cannot occur in this case). Theorem 3.4. Let (at)T t=1 and (bt)T t=1 be the action sequences of player 1 and 2 and suppose that they are coming from no external regret algorithms modeled by functions ft and gt, with regrets R1(T) and R2(T) respectively. Assume that the algorithms are on average (m,S(T)) stable with respect to the 2 norm. Then E[P(T,a)] P1 S(T) + R1(T) E[P(T,b)] P2 S(T) + R2(T), where ut(a1 t) in the definition of P(T,a) equals u1(at,gt(a0 t 1)) and similarly in the definition of P(T,b), equals u2(bt,ft(b0 t 1)). The above holds for any fixed actions b A2 and a A1. Here the matrix norm is the spectral norm. 4 Policy equilibrium Recall that unlike external regret, policy regret captures how other players in a game might react if a player decides to deviate from their strategy. The story is similar when considering different notions of equilibria. In particular Nash equlibria, Correlated equilibria and CCEs can be interpreted in the following way: if player i deviates from the equilibrium play, their utility will not increase no matter how they decide to switch, provided that all other players continue to play according to the equilibrium. This sentiment is a reflection of what no external and no swap regret algorithms guarantee. Equipped with the knowledge that no policy regret sequences are obtainable in the game setting under reasonable play from all parties, it is natural to reason how other players would react if player i deviated and what would be the cost of deviation when taking into account possible reactions. Let us again consider the 2-player game setup through the view of player 1. The player believes their opponent might be m-memory bounded and decides to proceed by playing according to a no policy regret algorithm. After many rounds of the game, player 1 has computed an empirical distribution of play σ over A = A1 A2. The player is familiar with the guarantees of the algorithm and knows that, if instead, they changed to playing any fixed action a A1, then the resulting empirical distribution of play σa, where player 2 has responded accordingly in a memory-bounded way, is such that E(a,b) σ [u1(a,b)] E(a,b) σa [u1(a,b)] . This thought experiment suggests that if no policy regret play converges to an equilibrium, then the equilibrium is not only described by the deviations of player 1, but also through the change in player 2 s behavior, which is encoded in the distribution σa. Thus, any equilibrium induced by no policy regret play, can be described by tuples of distributions {(σ,σa,σb) (a,b) A}, where σa is the distribution corresponding to player 1 s deviation to the fixed action a A1 and σb captures player 2 s deviation to the fixed action b A2. Clearly σa and σb are not arbitrary but we still need a formal way to describe how they arise. For convenience, lets restrict the memory of player 2 to be 1. Thus, what player 1 believes is that at each round t of the game, they play an action at and player 2 plays a function ft A1 A2, mapping at 1 to bt = ft(at 1). Finally, the observed utility is u1(at,ft(at 1)). The empirical distribution of play, σ, from the perspective of player 1, is formed from the observed play (at,ft(at 1))T t=1. Moreover, the distribution, σa, that would have occurred if player 1 chose to play action a on every round is formed from the play (a,ft(a))T t=1. In the view of the world of player 1, the actions taken by player 2 are actually functions rather than actions in A2. This suggests that the equilibrium induced by a no-policy regret play, is a distribution over the functional space defined below. Definition 4.1. Let F1 = {f Am1 2 A1} and F2 = {g Am2 1 A2} denote the functional spaces of play of players 1 and 2, respectively. Denote the product space by F = F1 F2. Note that when m1 = m2 = 0, F is in a one-to-one correspondence with A, i.e. when players believe their opponents are oblivious, we recover the action set studied in standard equilibria. For simplicity, for the remainder of the paper we assume that m1 = m2 = 1. However, all of the definitions and results that follow can be extended to the fully general setting of arbitrary m1 and m2; see the supplementary for details. Let us now investigate how a distribution over F can give rise to a tuple of distributions ( σ, σa, σb). We begin by defining the utility of such that it equals the utility of a distribution σ over A i.e., we want E(f,g) [u1(f,g)] = E(a,b) σ [u1(a,b)]. Since utilities are not defined for functions, we need an interpretation of E(f,g) [u1(f,g)] which makes sense. We notice that induces a Markov chain with state space A in the following way. Definition 4.2. Let be any distribution over F. Then induces a Markov process with transition probabilities P[(a2,b2) (a1,b1)] = (f,g) F1 F2 f(b1)=a2,g(a1)=b2 (f,g). We associate with this Markov process the transition matrix M R A A , with Mx1,x2 = P[x2 x1] where xi = (ai,bi). Since every Markov chain with a finite state space has a stationary distribution, we think of utility of as the utility of a particular stationary distribution σ of M. How we choose σ among all stationary distributions is going to become clear later, but for now we can think about σ as the distribution which maximizes the utilities of both players. Next, we need to construct σa and σb, which capture the deviation in play, when player 1 switches to action a and player 2 switches to action b. The no-policy regret guarantee can be interpreted as E(f,g) [u1(f,g)] E(f,g) [u1(a,g(a))] i.e., if player 1 chose to switch to a fixed action (or equivalently, the constant function which maps everything to the action a A1), then their utility should not increase. Switching to a fixed action a, changes to a new distribution a over F. This turns out to be a product distribution which also induces a Markov chain. Definition 4.3. Let be any distribution over F. Let δa be the distribution over F1 putting all mass on the constant function mapping all actions b A2 to the fixed action a A1. Let F2 be the marginal of over F2. The distribution resulting from player 1 switching to playing a fixed action a A, is denoted as a = δa F2. This distribution induces a Markov chain with transition probabilities P[(a,b2) (a1,b1)] = (f,g) g(a1)=b2 (f,g) and the transition matrix of this Markov process is denoted by Ma. The distribution b and matrix Mb are defined similarly for player 2. Since the no policy regret algorithms we work with do not directly induce distributions over the functional space F but rather only distributions over the action space A, we would like to state all of our utility inequalities in terms of distributions over A. Thus, we would like to check if there is a stationary distribution σa of Ma such that E(f,g) [u1(a,g(a))] = E(a,b) σa [u1(a,b)]. This is indeed the case as verified by the following theorem. Theorem 4.4. Let be a distribution over the product of function spaces F1 F2. There exists a stationary distribution σa of the Markov chain Ma for any fixed a A1 such that E(a,b) σa [u1(a,b)] = E(f,g) [u1(a,g(a))]. Similarly, for every fixed action b A2, there exists a stationary distribution σb of Mb such that E(a,b) σb [u2(a,b)] = E(f,g) [u2(f(b),b)]. The proof of this theorem is constructive and can be found in the supplementary. With all of this notation we are ready to formally describe what no-policy regret play promises in the game setting in terms of an equilibrium. Definition 4.5. A distribution over F1 F2 is a policy equilibrium if for all fixed actions a A1 and b A2, which generate Markov chains Ma and Mb respectively, with stationary distributions σa and σb from Theorem 4.4, there exists a stationary distribution σ of the Markov chain M induced by such that: E(a,b) σ [u1(a,b)] E(a,b) σa [u1(a,b)] E(a,b) σ [u2(a,b)] E(a,b) σb [u2(a,b)]. (3) In other words, is a policy equilibrium if there exists a stationary distribution σ of the Markov chain corresponding to , such that, when actions are drawn according to σ, no player has incentive to change their action. For a simple example of a policy equilibrium see Section E in the supplementary. 4.1 Convergence to the set of policy equilibria We have tried to formally capture the notion of equilibria in which player 1 s deviation would lead to a reaction from player 2 and vice versa in Definition 4.5. This definition is inspired by the counterfactual guarantees of no policy regret play and we would like to check that if players strategies yield sublinear policy regret then the play converges to a policy equilibrium. Since the definition of sublinear policy regret does not include a distribution over functional spaces but only works with empirical distributions of play, we would like to present our result in terms of distributions over the action space A. Thus we begin by defining the set of all product distributions σ σa σb, induced by policy equilibria as described in the previous subsection. Here σa and σb represent the deviation in strategy if player 1 changed to playing the fixed action a A1 and player 2 changed to playing the fixed action b A2 respectively as constructed in Theorem 4.4. Definition 4.6. For a policy equilibrium , let S be the set of all stationary distributions which satisfy the equilibrium inequalities (3), S = {σ σa σb (a,b) A} . Define S = S , where is the set of all policy equilibria. Our main result states that the sequence of empirical product distributions formed after T rounds of the game σ σa σb is going to converge to S. Here σa and σb denote the distributions of deviation in play, when player 1 switches to the fixed action a A1 and player 2 switches to the fixed action b A2 respectively. We now define these distributions formally. Definition 4.7. Suppose player 1 is playing an algorithm with output at time t given by ft At 2 A1 i.e. p1 t = ft(b0 t 1). Similarly, suppose player 2 is playing an algorithm with output at time t given by p2 t = gt(a0 t 1). The empirical distribution at time T is σ = 1 T T t=1 pt, where pt = p1 t is the product distribution over A at time t. Further let (p2 a)t = gt(a0 t m,a,...,a) denote the distribution at time t, provided that player 1 switched their strategy to the constant action a A1. Let δa denote the distribution over A1 which puts all the probability mass on action a. Let (pa)t = δa (p2 a)t be the product distribution over A, corresponding to the change of play at time t. Denote by σa = 1 t=1(pa)t the empirical distribution corresponding to the change of play. The distribution σb is defined similarly. Suppose that ft and gt are no-policy regret algorithms, then our main result states that the sequence ( σ σa σb)T converges to the set S. Theorem 4.8. If the algorithms played by player 1 in the form of ft and player 2 in the form of gt give sub-linear policy regret sequences, then the sequence of product distributions ( σ σa σb) T =1 converges weakly to the set S. In particular if both players are playing MWU or Exp3, we know that they will have sublinear policy regret. Not surprisingly, we can show something slightly stronger as well. Let σ, σa and σb denote the empirical distributions of observed play corresponding to σ, σa and σb, i.e. σ = 1 T δt, where δt denotes the Dirac distribution, putting all weight on the played actions at time t. Then these empirical distributions also converge to S almost surely. 4.2 Sketch of proof of the main result The proof of Theorem 4.8 has three main steps. The first step defines the natural empirical Markov chains M, Ma and Mb from the empirical play (pt) t=1 (see Definition B.2) and shows that the empirical distributions σ, σa and σb are stationary distributions of the respective Markov chains. The latter is done in Lemma B.3. The next step is to show that the empirical Markov chains converge to Markov chains M, Ma and Mb induced by some distribution over F. In particular, we construct an empirical distribution and distributions a and b corresponding to player s deviations (see Definition B.5), and show that these induce the Markov chains M, Ma and Mb respectively (Lemma B.7). The distribution we want is now the limit of the sequence ( )T . The final step is to show that is a policy equilibrium. The proof goes by contradiction. Assume is not a policy equilibrium, this implies that no stationary distribution of M and corresponding stationary distributions of Ma and Mb can satisfy inequalities (3). Since the empirical distributions σ, σa and σb of the play satisfies inequalities (3) up to an o(1) additive factor, we can show, in Theorem B.8, that in the limit, the policy equilibrium inequalities are exactly satisfied. Combined with the convergence of M, Ma and Mb to M, Ma and Mb, respectively, this implies that stationary distributions of M, Ma and Mb, satisfying (3), giving a contradiction. We would like to emphasize that the convergence guarantee of Theorem 4.8 does not rely on there being a unique stationary distribution of the empirical Markov chains M, Ma and Mb or their respective limits M,Ma,Mb. Indeed, Theorem 4.8 shows that any limit point of {( σ, σa, σb)T } T =1 satisfies the conditions of Definition 4.5. The proof does not require that any of the respective Markov chains have a unique stationary distribution, but rather requires only that σ has sublinear policy regret. We would also like to remark that {( σ, σa, σb)T } T =1 need not have a unique limit and our convergence result only guarantees that the sequence is going to the set S. This is standard when showing that any type of no regret play converges to an equilibrium, see for example Stoltz and Lugosi (2007). 4.3 Relation of policy equlibria to CCEs So far we have defined a new class of equilibria and shown that they correspond to no policy regret play. Furthermore, we know that if both players in a 2-player game play stable no external regret algorithms, then their play also has sublinear policy regret. It is natural to ask if every CCE is also a policy equilibrium: if σ is a CCE, is there a corresponding policy equilibrium which induces a Markov chain M for which σ is a stationary distribution satisfying (3)? We show that the answer to this question is positive: Theorem 4.9. For any CCE σ of a 2-player game G, there exists a policy-equilibrium which induces a Markov chain M with stationary distribution σ. To prove this, we show that for any CCE we can construct stable no-external regret algorithm which converge to it, and so since stable no-external regret algorithms always converge to policy equilibria (Theorem 3.4), this implies the CCE is also a policy equilibrium. However, we show the converse is not true: policy equilibria can give rise to behavior which is not a CCE. Our proof appeals to a utility sequence which is similar in spirit to the one in Theorem 3.2, but is adapted to the game setting. Theorem 4.10. There exists a 2-player game G and product distributions σ σa σb S (where S is defined in Definition 4.6 as the possible distributions of play from policy equilibria), such that σ is not a CCE of G. In Section E of the supplementary we give a simple example of a policy equilibrium which is not a CCE. 5 Discussion In this work we gave a new twist on policy regret by examining it in the game setting, where we introduced the notion of policy equilibrium and showed that it captures the behavior of no policy regret players. While our characterization is precise, we view this as only the first step towards truly understanding policy regret and its variants in the game setting. Many interesting open questions remain. Even with our current definitions, since we now have a broader class of equilibria to consider it is natural to go back to the extensive literature in algorithmic game theory on the price of anarchy and price of stability and reconsider it in the context of policy equilibria. For example Roughgarden (2015) showed that in smooth games the worst CCE is no worse than the worst Nash. Since policy equilibria contain all CCEs (Theorem 4.9), is the same true for policy equilibria? Even more interesting questions remain if we change our definitions to be more general. For example, what happens with more than 2 players? With three or more players, definitions of reaction by necessity become more complicated. Or what happens when m is not a constant? No policy regret algorithms exist for superconstant m, but our notion of equilibrium requires m to be constant in order for the Markov chains to make sense. Finally, what if we compare against deviations that are more complicated than a single action, in the spirit of swap regret or Φ-regret? From an online learning perspective, note that our notion of on average stable and the definition of m-memory boundedness are different notions of stability. Is there one unified definition of stable which would allow us to give no policy regret algorithms against stable adversaries even outside of the game setting? Acknowledgments This work was supported in part by NSF BIGDATA grant IIS-1546482, NSF BIGDATA grant IIS1838139, NSF CCF-1535987, NSF IIS-1618662, NSF CCF-1464239, and NSF AITF CCF-1535887. Zeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974 982, 2016. Raman Arora, Ofer Dekel, and Ambuj Tewari. Deterministic MDPs with adversarial rewards and bandit feedback. In Proceedings on Uncertainty in Artificial Intellegence (UAI), 2012a. Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of International Conference on Machine Learning (ICML), 2012b. Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta- algorithm and applications. Theory of Computing, 8(1):121 164, 2012c. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8:1307 1324, 2007. Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2 3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459 467. ACM, 2014. Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online markov decision processes. Mathe- matics of Operations Research, 34(3):726 736, 2009a. Eyal Even-Dar, Yishay Mansour, and Uri Nadav. On the convergence of regret minimization dynamics in concave games. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 523 532. ACM, 2009b. Daniela Pucci De Farias and Nimrod Megiddo. Combining expert advice in reactive environments. Journal of the ACM (JACM), 53(5):762 799, 2006. Dean P Foster and Rakesh V Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(1-2):40, 1997. Drew Fudenberg and David K Levine. Conditional universal consistency. Games and Economic Behavior, 29(1-2):104 130, 1999. Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000. Elad Hazan and Satyen Kale. Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In Advances in Neural Information Processing Systems, pages 625 632, 2008. Hoda Heidari, Michael Kearns, and Aaron Roth. Tight policy regret bounds for improving and decaying bandits. In IJCAI, pages 1562 1570, 2016. Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. Ph D thesis, University of London London, England, 2003. Tomer Koren, Roi Livni, and Yishay Mansour. Bandits with movement costs and adaptive pricing. In Proceedings of the 2017 Conference on Learning Theory, pages 1242 1268, 2017a. Tomer Koren, Roi Livni, and Yishay Mansour. Multi-armed bandits with metric movement costs. In Advances in Neural Information Processing Systems, pages 4119 4128, 2017b. Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48(7):1947 1958, 2002. Mehryar Mohri and Scott Yang. Conditional swap regret and conditional correlated equilibrium. In Advances in Neural Information Processing Systems, pages 1314 1322, 2014. Mehryar Mohri and Scott Yang. Online learning with transductive regret. In Advances in Neural Information Processing Systems, pages 5220 5230, 2017. Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pages 1804 1812, 2010. Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM (JACM), 62(5): Ankan Saha, Prateek Jain, and Ambuj Tewari. The interplay between stability and regret in online learning. ar Xiv preprint ar Xiv:1211.6158, 2012. Gilles Stoltz and Gábor Lugosi. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, 59(1):187 208, 2007. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1 103, 2010. Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737 757, 2009.