# regret_minimization_in_behaviorallyconstrained_zerosum_games__41beb857.pdf Regret Minimization in Behaviorally-Constrained Zero-Sum Games Gabriele Farina 1 Christian Kroer 1 Tuomas Sandholm 1 No-regret learning has emerged as a powerful tool for solving extensive-form games. This was facilitated by the counterfactual-regret minimization (CFR) framework, which relies on the instantiation of regret minimizers for simplexes at each information set of the game. We use an instantiation of the CFR framework to develop algorithms for solving behaviorally-constrained (and, as a special case, perturbed in the Selten sense) extensive-form games, which allows us to compute approximate Nash equilibrium refinements. Nash equilibrium refinements are motivated by a major deficiency in Nash equilibrium: it provides virtually no guarantees on how it will play in parts of the game tree that are reached with zero probability. Refinements can mend this issue, but have not been adopted in practice, mostly due to a lack of scalable algorithms. We show that, compared to standard algorithms, our method finds solutions that have substantially better refinement properties, while enjoying a convergence rate that is comparable to that of state-of-the-art algorithms for Nash equilibrium computation both in theory and practice. 1. Introduction No-regret learning algorithms have become a powerful tool for solving large-scale zero-sum extensive-form games (EFGs) (Bowling et al., 2015; Brown et al., 2015). This has largely been facilitated by the counterfactual-regret minimization (CFR) algorithm (Zinkevich et al., 2007) and its newer variants (Lanctot et al., 2009; Sandholm, 2010; Bowling et al., 2015; Brown and Sandholm, 2015; Brown et al., 2017; Brown and Sandholm, 2017). This framework works by defining a notion of regret local to an information 1Carnegie Mellon University, Pittsburgh PA 15213 USA. Correspondence to: Gabriele Farina , Christian Kroer , Tuomas Sandholm . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). set, and instantiating a standard regret minimizer at each information set in order to minimize local regret. Zinkevich et al. (2007) prove that this scheme of local regret minimization leads to a Nash equilibrium in two-player zero-sum extensive-form games of perfect recall. The framework works with any regret-minimizing algorithm, but in practice variants of the regret matching algorithm have been dominant (Hart and Mas-Colell, 2000; Bowling et al., 2015; Brown and Sandholm, 2015; Brown et al., 2015). We investigate the extension of regret-matching+ (RM+) (Tammelin et al., 2015), an even faster regretmatching algorithm, to more general regret-minimization problems over (finitely-generated) convex polytopes. We use these results to instantiate RM+ for linearly constrained simplexes, which in turn allows us to model and solve behaviorally-constrained EFGs (which are EFGs with additional linear constraints on the simplexes at each information set). An important special case of this framework is behaviorally-perturbed EFGs, which can be used to compute Nash equilibrium refinements.1 Nash equilibrium refinements are motivated by major deficiencies in the Nash equilibrium solution concept: Nash equilibria provide no guarantees on performance in information sets that are reached with probability zero in equilibrium, beyond not giving up more utility than the value of the game. Thus, if an opponent makes a mistake, Nash equilibrium is not guaranteed to capitalize on it, but may instead give back up to all of that utility (Miltersen and Sørensen, 2010). This is especially relevant when Nash equilibria are used as a solution concept for playing against imperfect opponents. Equilibrium refinements ameliorate this issue by introducing further constraints on behavior in information sets that are reached with probability zero. We will be interested in equilibrium concepts 1The idea of certain kinds of behavioral perturbations to CFR has been suggested by Neller and Lanctot (2013). They suggest that at every information set, with small probability ϵ, a player will make a random move. However, they provide no results of what the refinement consequences are (i.e., what kind of refinement this would lead to), and it is unclear whether the proposed method actually leads to refinements. In contrast, we establish a connection to (approximate) EFPEs in this paper. Furthermore, Neller and Lanctot cite Miltersen and Sørensen (2010) which is about quasi-perfect equilibrium (where a player assumes she will not make errors in the future), while the ϵ modeling of Neller and Lanctot makes ϵ errors at future information sets as well. Regret Minimization in Behaviorally-Constrained Zero-Sum Games that achieve this through the notion of perturbations or trembling hands (Selten, 1975). At each decision-point, a player is assumed to tremble with some small probability, and a Nash equilibrium is then computed in this perturbed game. A refinement is then a limit point of the sequence of Nash equilibria achieved as the probability of trembles is taken to zero. In quasi-perfect equilibria, players take into account only the trembles of their opponents (van Damme, 1984), whereas in an extensive-form perfect equilibrium (EFPE), players take into account mistakes made both by themselves and opponents (Selten, 1975). We compare our algorithm for perturbed EFGs to state-ofthe-art large-scale zero-sum EFG-solving algorithms: the standard CFR+ algorithm (Tammelin et al., 2015) and the excessive gap technique (EGT) (Nesterov, 2005a) instantiated with a state-of-the art smoothing function (Nesterov, 2005b; Hoda et al., 2010; Kroer et al., 2015; 2017b). We find that our perturbed variant of CFR+ converges (in the perturbed game) at the same rate as those algorithms converge while ours leads to orders of magnitude more refined strategies. Our algorithm also converges at the same rate in the unperturbed game, almost until the point where the imposed behavioral constraints necessarily prevent further convergence. 2. Related work No-regret algorithms have a long history in EFG solving. Gordon (2006) developed the Lagrangian Hedging algorithm, which can be used to find a Nash equilibrium in EFGs. However, it suffers from a drawback: it requires projection onto the strategy space at each iteration. Zinkevich et al. (2007) developed CFR, which avoids projection while maintaining the same convergence rate. It has since been extended in a number of ways. Lanctot et al. (2009) showed how to incorporate sampling in CFR. Brown and Sandholm (2015) showed how to achieve greater pruning in CFR, thereby reducing the iteration costs. CFR+ is a state-of-the-art variant of CFR (Tammelin et al., 2015), which has vastly superior practical performance compared to standard CFR, though it is not known to be stronger from a theoretical perspective. Gordon et al. (2008) shows how no-regret algorithms can also be utilized for computing extensive-form correlated equilibria in EFGs. Polynomial-time algorithms have been proposed for computing certain equilibrium refinements in two-player zerosum perfect-recall EFGs. Miltersen and Sørensen (2010) develop a linear program (LP) for computing quasi-perfect equilibria by choosing a sufficiently small perturbation to realization plans. Farina and Gatti (2017) develop a similar approach for EFPE computation, but rely on perturbations to behavioral strategies of players. These approaches rely on solving modified variants of the sequence-form LP for computing Nash equilibrium (von Stengel, 1996) in EFGs. These algorithms are of theoretical interest only and do not work in practice. They require rational numbers of precision n log n bits, where n is the number of sequences in the game. Another issue is that LP algorithms do not scale to large EFGs even when just finding Nash equilibria, and in practice CFR-based or EGT-based approaches are used to achieve scalability. Kroer et al. (2017a) recently showed how smoothing functions for first-order methods such as EGT can be extended to games with perturbations. Johanson et al. (2007) consider robust strategies that arise from assuming that the opponent will randomize between playing a Nash equilibrium and a strategy within some model of opponent behavior. Johanson and Bowling (2009) consider a similar model-biased Nash equilibrium approach on games where an independent model is used at each information set. Ganzfried and Sandholm (2011) develop an opponent modeling approach that adds opponentmodeled constraints across information sets. Our approach provides a principled framework for solving model-biased games that use general constraints on per-information set behavioral strategies. Constraints across information sets currently require the much less scalable LP approach. 3. Preliminaries We briefly introduce several of the basic concepts we use in the rest of the paper. We denote by R+ and R the set of non-negative and non-positive reals, respectively. 3.1. Normal-Form Games Definition 1. A two-player zero-sum normal-form game (for the rest of the paper, simply normal-form game or NFG) is a tuple (A1, A2, u) where A1 represents the finite set of actions that player 1 can play, A2 represents the finite set of actions that player 2 can play, and u : A1 A2 R is the payoff function for player 1, mapping the pair of actions (a1, a2) of the players into the payoff for player 1. The corresponding payoff for player 2 is given by u(a1, a2). Usually, the payoff function u is given as a matrix U, called the payoff matrix of the game. The rows of U represent the actions {a1,1, . . . , a1,n} = A1 of player 1, while the columns of U represent the actions {a2,1, . . . , a2,m} = A2 of player 2. At the intersection of the i-th row and the j-th column is the payoff for the action pair (a1,i, a2,j). Definition 2. A mixed strategy π for player i {1, 2} is a probability mass function over the set Ai. When players play according to mixed strategies π1 and π2 respectively, the expected payoff is given by Eπ1,π2(u) = X a2 A2 π1(a1)π2(a2)u(a1, a2). (1) Regret Minimization in Behaviorally-Constrained Zero-Sum Games 3.2. Generalized Normal-Form Games Telgarsky (2011) and Abernethy et al. (2011) propose a generalization of the concept of normal-form games, which conveniently allows us to remove all expectation operators, making the notation lighter and more legible. In this generalization, players select deterministic strategies from a convex compact set. For a normal-form game, this set is the space of all mixed strategies. Definition 3. A two-player zero-sum generalized normalform game Γ = (X, Y, u) is a tuple defined by a pair of convex and compact action spaces X Rn, Y Rm, one for each player, as well as a biaffine utility function u : X Y R. The utility function u(x, y) maps the pair of actions (x, y) X Y of the players into the payoff for player 1, while the corresponding payoff for player 2 is given by u(x, y). Observation 1. Any normal-form game can be mapped to an instance of a generalized normal-form game. Given Γ = (A1, A2, u), where |A1| = n and |A2| = m, the set of all mixed strategies for player 1 forms the n-dimensional simplex X = n, while the set of all mixed strategies for player 2 forms the m-dimensional simplex Y = m. Let U be the payoff matrix associated with Γ. Using Equation 1, we conclude that Γ is equivalent to the generalized two-player zero-sum normal-form game Γ = (X, Y, u ), where u (x, y) = x Uy for all (x, y) X Y. 3.3. Extensive-Form Games Definition 4. A two-player zero-sum extensive-form game with imperfect information and perfect recall Γ is a tuple (H, Z, A, P, fc, I1, I2, u) composed of: H: a finite set of possible sequences (or histories) of actions, such that the empty sequence H, and every prefix z of h in H is also in H. Z H: the set of terminal histories, i.e. those sequences that are not a proper prefix of any sequence. A: a function mapping h H \ Z to the set of available actions at non-terminal history h. P: the player function, mapping each non-terminal history h H \Z to {1, 2, c}, representing the player who takes action after h. If P(h) = c, the player is chance. fc: a function assigning to each h H \ Z such that P(h) = c a probability mass function over A(h). Ii, for i {1, 2}: partition of {h H : P(h) = i} with the property that A(h) = A(h ) for each h, h in the same set of the partition. For notational convenience, we will write A(I) to mean A(h) for any of the h I, where I Ii. Ii is the information partition of player i, while the sets in Ii are called the information sets of player i. u: utility function mapping z Z to the utility (a real number) gained by player 1 when the history is reached. The corresponding utility for player 2 is given by u(z). We further assume that all players can recall their previous actions and the corresponding information sets. In the rest of the paper, we will use the more relaxed term extensive-form game, or EFG, to mean a two-player zerosum extensive-form game with imperfect information and perfect recall. Observation 2. Extensive-form games can be represented as generalized NFGs, for example, via the normal form representation or sequence form representation (Romanovskii, 1962; Koller et al., 1996; von Stengel, 1996). 3.4. Regret and Regret Minimization Suppose there exists an iterative algorithm which, at each step t = 1, . . . , T, computes a strategy xt X for player 1, and plays a (generalized) normal-form game (X, Y, u) against player 2 using such strategy. Let yt be the strategy used by player 2 at step t. The average external regret of player 1 up to step T against action ˆx X is RT 1 (ˆx) = 1 t=1 u(ˆx, yt) u(xt, yt). The case for player 2 is symmetrical. A regret-minimizing scheme is a function that assigns, for each sequence of past actions x1, y1, . . . , xt 1, yt 1, an action xt such that lim sup T maxˆx X RT 1 (ˆx) 0. Regret-matching (RM) (Hart and Mas-Colell, 2000) is a regret-minimizing scheme for normal-form games, based on Blackwell s approachability theorem (Blackwell, 1956). The following theorem, a proof of which is given by Cesa Bianchi and Lugosi (2006), characterizes the convergence rate of RM. Theorem 1. Given a normal-form game (A1, A2, u), the maximum average external regret for player 1 at iteration T, when player 1 plays according to the regret-matching algorithm, is max ˆx RT 1 (ˆx) γ where γ .= maxx,y u(x, y) minx,y u(x, y). Regret-matching+ is an extension of RM, and converges significantly faster in practice. Tammelin et al. (2015; Lemma 2) proved that the convergence rate of RM+ is the same as that of RM above. 3.5. Nash Equilibria and Refinements We now review the needed solution concepts from game theory. We mostly focus on generalized normal-form Regret Minimization in Behaviorally-Constrained Zero-Sum Games games, allowing a compact presentation of concepts pertaining both normal-form games and extensive-form games. Definition 5 (Approximate best response). Given a generalized normal-form game (X, Y, u) and a strategy y Y, we say that x X is an ϵ-best response to y for player 1 if u(x, y) + ϵ u(ˆx, y) for all ˆx X. Symmetrically, given x X, we say that y Y is an ϵ-best response to x for player 2 if u(x, y) + ϵ u(x, ˆy) for all ˆy Y. Definition 6 (Approximate Nash equilibrium). Given a generalized normal-form game (X, Y, u), the strategy pair (x, y) X Y is a ϵ-Nash equilibrium for the game if x is an ϵ-best response to y for player 1, and y is an ϵ-best response to x for player 2. Definition 7 (Nash equilibrium). Given a generalized normal-form game (X, Y, u), a Nash equilibrium for the game is a 0-Nash equilibrium. There exists a well-known relationship between regret and approximate Nash equilibria (Definition 6), as summarized in the next theorem. Theorem 2. In a zero-sum game, if the average external regrets of the players up to step T are such that RT 1 (ˆx) ϵ1, RT 2 (ˆy) ϵ2 for all actions ˆx X, ˆy Y, then the strategy pair ( x T , y T ) .= is an (ϵ1 + ϵ2)-Nash equilibrium. Theorem 2 basically says that if there exists an iterative algorithm able to progressively propose strategies so that the maximum average external regret go to zero, then recovering a Nash equilibrium is straightforward, and just a matter of averaging the individual strategies proposed. We now turn to the class of perturbed games (Selten, 1975). Intuitively, a perturbation restricts the set of playable strategies by enforcing a lower bound on the probability of playing each action. We recall the definition and some of the properties of game perturbations, starting from the normalform case. We focus on player 1, but remark that the same definitions hold symmetrically for player 2 as well. Definition 8. Let Γ = (A1, A2, u) be an NFG and let Γ = ( |A1|, |A2|, u ) be its generalized NFG representation (see Observation 1). A perturbation is a function p : A1 A2 R+ such that P a A1 p(a) < 1 and P a A2 p(a) < 1. The corresponding perturbed NFG Γp is the generalized NFG where each action a must be played with probability at least p(a). Formally, Γp = ( Xp, Yp, u ) where Xp = x |A1| : xa p(a) a A1 . Yp is defined analogously. In the case of extensive-form games, a perturbation for player 1 assigns a lower-bound on each action playable by the player. More precisely: Definition 9. Let Γ = (H, Z, A, P, fc, I1, I2, u) be an extensive-form game. A perturbation is a function p mapping each pair (I, a) where I I1 I2 and a A(I) to a non-negative real, such that X a A(I) p(I, a) < 1 I I1 I2. The corresponding perturbed EFG Γp is the analogous game where each action a at each information set I has to be played with probability at least p(I, a). Perturbations play an important role in equilibrium refinement, as they form the basis for the concept of equilibrium perfection (Selten, 1975). In this paper we only focus on the case of extensive-form perfect equilibria (EFPEs). Definition 10. A strategy pair (x, y) X Y is an EFPE of Γ if it is a limit point of a sequence {(xp, yp)}p 0 where (xp, yp) is a Nash equilibrium of the perturbed game Γp. Intuitively, an EFPE is an equilibrium refinement that takes into account an imperfect ability to deterministically commit to a single action. 4. Generalized Normal-Form Games over Finitely-Generated Convex Polytopes In this section, we show how to adapt a regret-minimization algorithm to handle generalized normal-form games played on finitely-generated convex polytopes. The key insight is that when the action space is a finitely-generated convex polytope, the generalized game can be cast back as a normal-form game, i.e. a generalized normal-form game played over simplexes, and solved by a regretminimization algorithm; subsequently, the solution for the normal-form game gets mapped back into the polytope. This is achieved by constructing new simplex action spaces for the players, where each point in a simplex denotes a convex combination of weights on the vertices of that players finitely-generated convex polytopal action space. Theorem 3. Let Γ = (X, Y, u) be a generalized normalform game played on the finitely-generated convex polytopes X and Y. There exists a regret-minimizing scheme for player 1 in Γ. Proof. Let {b1, . . . , bn} be a convex basis for X, and {c1, . . . , cm} be a convex basis for Y; also, let B = (b1 | | bn) and C = (c1 | | cm) be the basis matrices for X and Y, respectively. We construct a generalized normalform game Γ = ( n, m, u ), where u (x, y) .= u(Bx, Cy) Regret Minimization in Behaviorally-Constrained Zero-Sum Games for all x n, y m. Of course Bx X, Cy Y for all x and y, so the definition is valid. Let f be any of regret-minimizing schemes for normal-form games (e.g., RM or RM+). We construct a regret-minimizing scheme f for Γ such that, at each iteration t = 1, 2, . . . , f(x1, y1, . . . , xt 1, yt 1) = Bf (x 1, y 1, . . . , x t 1, y t 1), where x , y denotes the coordinates of x, y with respect to the basis of X, Y, respectively; note that this definition is well-defined since the coordinates are guaranteed to belong to n, m2. The regret induced by this scheme is RT 1 (ˆx) = 1 t=1 u(ˆx, yt) u(xt, yt) t=1 u (ˆx , y t ) u (x t , y t ). Notice that the last expression is exactly the average regret for player 1 up to iteration T against action ˆx in Γ . Since f is a regret-minimizing scheme, the average regret against any action converges to zero, meaning that lim sup T RT 1 (ˆx) 0 for each ˆx, i.e. lim sup T maxˆx X RT 1 (ˆx) 0. This proves that f is a regret-minimizing scheme for Γ, concluding the proof. Another way to think about the construction above is that at each iteration, we compute the regret for not playing each of the strategies forming the vertices of the polytope, and updating the next strategy by taking a convex combination of the vertices, in a way proportional to the regret against them. Algorithm 1 represents an instantiation of the construction given in the proof of Theorem 3, where the regretminimizing scheme for the normal-form game was chosen to be RM+. A careful analysis of the construction also reveals that the convergence bound for RM+ carries over, as expressed by Theorem 4. At time t, RM+ projects the cumulative regret rt 1 onto the non-negative orthant Rn +; the projection is equal to the vector [rt 1]+, where [a]+ i .= max{0, ai}. Theorem 4. Given a generalized normal-form game (X, Y, u) with finitely generated X and Y, the maximum average external regret for player 1 at iteration T, when player 1 plays according to Algorithm 1, is bounded by max ˆx X RT 1 (ˆx) γ T where γ .= maxx,y u(x, y) minx,y u(x, y), and |X| denotes the number of vertices of X. 2Passage to coordinates might not be unique. In this case, any coordinate vector will do, as long as the choice is deterministic. Algorithm 1 RM+ algorithm for generalized normal-form games played over finitely-generated convex polytopes. 1: procedure REGRET-MATCHING+(Γ) Γ = (X, Y, u), and B is a fixed convex basis for X note: this reflects the point of view of player 1 2: r0 (0, . . . , 0) Rn 3: x (0, . . . , 0) Rn 4: for t = 1, 2, 3, . . . do 5: if rt 1 Rn then 6: xt any action Xp 7: else i=1 [rt 1]+ i 9: xt B [rt 1]+ Λt 1 10: play action xt 11: observe yt Y played by opponent u(b1, yt) u(xt, yt) ... u(bn, yt) u(xt, yt) x contains the average strategy for player 1 5. Behavioral Constraints and Perturbations Behavioral constraint are linear constraints on the simplexes at each information set. In order to obtain a regret minimizer for a behaviorally-constrained EFG, we could try to cast the game as a generalized NFG by means of the normal-form or sequence form representation (see Observation 2). However, the number of vertices of this representation is exponential, and therefore it does not work well with Theorem 4. Counterfactual Regret (CFR, Zinkevich et al. 2007) solves this problem, by defining a regretminimizing scheme that runs in polynomial time in the size of the game. Intuitively, CFR minimizes a variant of instantaneous regret, called immediate counterfactual regret, at each information set separately, and later combines the strategies computed at each information set. It requires simplex regret minimizers for each information set. If we have a finite number of them, each information set can be modeled as a finitely-generated convex polytope. We can then use Theorem 4 to get regret minimizers for each information set. Perturbations can be handled as a special case. Theorem 5 below shows that CFR+ instantiated with such regret minimizers for each behaviorally-constrained information set converges to an equilibrium of the constrained EFG. For this approach to be practical, we need the set of vertices for each information set to be of manageable Regret Minimization in Behaviorally-Constrained Zero-Sum Games size, as reflected in the dependence on max I I p |QI| in Theorem 5, where |QI| is the number of vertices in the behaviorally-constrained simplex at information set I. Theorem 5. Let (H, Z, A, P, fc, I1, I2, u) be an extensive-form game; let QI |A(I)| represent the behaviorally-constrained strategy space at information set I, for all I I1 I2. The maximum average external regret for player 1 in the constrained game at iteration T, when player 1 plays according to CFR+, is bounded by max I I1 |QI| where γ .= maxx,y u(x, y) minx,y u(x, y). 6. Perturbed Normal-Form Games Section 4 established that, in general, the problem of finding an approximate Nash equilibrium for player 1 in the generalized normal-form game Γ = (X, Y, u), where X is a convex polytope generated by n vectors, can be solved via regret-matching. We now specialize this result for the specific case of perturbed normal-form games. The following holds: Proposition 6. Let Γ = (A1, A2, u) be a normal-form game, where A1 = {a1, . . . , an}, and let p be a perturbation for player 1. Let Γp = ( Xp, Yp, u ) be the generalized normal-form game corresponding to the perturbation (Definition 8). Then the perturbed action space Xp is a finitely generated convex polytope of dimension n, a basis of which is given by the columns of the following invertible matrix: τp + p(a1) p(a1) p(a1) p(a2) τp + p(a2) p(a2) ... ... ... ... p(an) p(an) τp + p(an) where τp .= 1 p(a1) p(an). This means that Algorithm 1 is applicable and provides a regret-minimizing scheme. We remark that when computing the instantaneous regrets (Algorithm 1, Line 12), it is important to remember these values have to be computed against the basis {b1, . . . , bn} of Xp. However, computing the (expected) utility of the game when player 1 plays according to a mixed strategy is usually more expensive than the same task when player 1 plays a deterministic action. For this reason, we express the instantaneous regret calculation against {b1, . . . , bn} in terms of instantaneous regrets against the pure actions {e1, . . . , en} in the unperturbed game. In particular, by using the fact that the utility function is biaffine, we can write, for each i {1, . . . , n}, u(bi, yt) = u j=1 p(aj)ej, yt = τpu(ei, yt) + j=1 p(aj)u(ej, yt), so that, by introducing φt,i .= u(ei, ut) u(xt, yt) and the corresponding vector φt .= (φt,1, . . . , φt,n) , we have rt = rt 1 + τpφt + 1 p(a1) ... p(an) This allows us to compute the regret update in terms of the instantaneous regret against {e1, . . . , en} in the unperturbed game, without introducing any overhead from an asymptotic point of view. The maximum average external regret for player 1 at iteration T is given by Theorem 4; in this case | Xp| = |A1|. 7. Perturbed Extensive-Form Games As discussed in Section 5, it is possible to use CFR in conjunction with any regret-minimizing scheme for generalized NFGs, in order to define a regret-minimizing scheme able to support any behaviorally-perturbed EFG (thus including the restricted case of perturbed EFGs). In Algorithm 2, we propose an implementation of CFR+, i.e. CFR instantiated with the RM+ algorithm able to handle perturbed EFGs. Algorithm 2 assumes that we are given a perturbation p of the extensive-form game, X I p denotes the perturbed simplex for information set I, and τp(I) is as in Proposition 6. The following theorem characterizes the convergence guarantee of the proposed algorithm. Theorem 7. Let (H, Z, A, P, fc, I1, I2, u) be an extensive-form game; let p be the perturbation applied to the game. The maximum average external regret for player 1 in the perturbed game at iteration T, when player 1 plays according to Algorithm 2, is bounded by max I I1 |A(I)| where γ .= maxx,y u(x, y) minx,y u(x, y). Proof. Follows as a corollary of Theorem 5. Notice that the bound provided by Theorem 5 is the same provided by the original CFR algorithm proposed by Zinkevich et al. (2007). In other words, our modification does not impair the convergence and speed guarantees given by the original algorithm. Regret Minimization in Behaviorally-Constrained Zero-Sum Games Algorithm 2 Regret minimization algorithm for perturbed extensive-form games. 1: procedure REGRET-MATCH+-INFOSET(I, t) we assume A(I) = {a1, . . . , an}. 2: if r I t 1 Rn then 3: x I t any action X I p 4: else r I t 1 + i p(I, a1) ... p(I, an) 1: procedure TRAVERSE(h, i, t, π1, π2) assume h belongs to information set I 2: if h Z then 3: return u(h) 4: if P(h) = c then chance node 5: sample a fc(h) 6: return TRAVERSE(ha, t, π1, π2) 7: else if P(h) = 2 then TRAVERSE(ha1, t, π1, y I t,1π2) ... TRAVERSE(han, t, π1, y I t,nπ2) 9: else player 1 s turn 10: REGRET-MATCH-INFOSET(I, t) TRAVERSE(ha1, t, x I t,1π1, π2) ... TRAVERSE(han, t, x I t,nπ1, π2) 12: v (x I t ) v I t 13: φI t π2(v I t 1 v) r I t 1 + τp(I)φI t + 1 p(I, a1) ... p(I, an) 15: x I x I + π1x I t 16: return v 1: procedure CFR+(Γ) Γ = (H, Z, A, P, fc, I1, I2, u) 2: for all I I1 do 3: r I 0 (0, . . . , 0) R|A(I)| 4: x I (0, . . . , 0) R|A(I)| 5: for t = 1, 2, 3, . . . do 6: play according to strategy xt 7: observe strategy yt played by opponent 8: TRAVERSE( , t, 1, 1) 9: for all I I1 do 10: x I x I/P|A(I)| i=1 x I i x contains the average strategy for player 1. 8. Experimental Evaluation We conducted experiments to investigate the practical performance of our perturbed-regret-minimization approach when used to instantiate the CFR and CFR+ algorithms for computing approximate EFPE in EFGs. We compare these algorithms to state-of-the-art Nash-equilibrium-finding algorithms: EGT (Nesterov, 2005a) on an unperturbed polytope using the state-of-the-art smoothing technique by Kroer et. al. (Kroer et al., 2017b), CFR (Zinkevich et al., 2007) and CFR+ (Tammelin et al., 2015). We conducted the experiments on Leduc hold em poker (Southey et al., 2005), a widely-used benchmark in the imperfectinformation game-solving community. In our variant, Leduc k, the deck consists of k pairs of cards 1 . . . k, for a total deck size of 2k. We experiment on the standard Leduc game where k = 3 and a larger game where k = 5. Each player initially pays one chip to the pot, and is dealt a single private card. After a round of betting, a community card is dealt face up. After a subsequent round of betting, if neither player has folded, both players reveal their private cards. If either player pairs their card with the community card they win the pot. Otherwise, the player with the highest private card wins. In the event that both players have the same private card, they draw and split the pot. We consider k {3, 5}. We test our approach on games subject to different uniform perturbations p(I, a) = ξ for all information sets I and actions a A(I), for ξ {0.1, 0.05, 0.01, 0.005, 0.001}. Figure 1 reports on convergence to Nash equilibrium. The x-axis shows the number of tree traversals performed. We use tree traversals rather than iterations because EGT requires more tree traversals than CFR+ per iteration. The y-axis shows the sum of player regrets in the full (unperturbed) game. For both Leduc 3 and 5, we find that the ξ perturbations have only a small effect on overall convergence rate until convergence within the perturbed polytope, at which point the regret in the unperturbed game stops decreasing, as expected. Until bottoming out, the convergence is almost identical for all CFR+ algorithms. This shows that our approach can be utilized in practice: there is no substantial loss of convergence rate. Later in the run once the perturbed algorithms have bottomed out, there is a tradeoff between exploitability in the full game and refinement (i.e., better performance in low-probability information sets). The second set of experiments, Figure 2, investigates the improvement that our perturbation approach achieves compared to standard Nash equilibrium solutions in terms of equilibrium refinement. Our measure of refinement is the maximum regret at any information set, conditioned on reaching that information set. As discussed, convergence to a Nash Equilibrium does not guarantee that this measure goes to zero. Again, the x-axis shows the number Regret Minimization in Behaviorally-Constrained Zero-Sum Games 101 102 103 104 105 106 10 5 100 CFR+(0.1) CFR+(0.05) CFR+(0.01) CFR+(0.001) CFR+(0.005) Number of tree traversals ϵ-approx. of NE in unpert. game 101 102 103 104 105 106 CFR+(0.1) CFR+(0.05) CFR+(0.01) CFR+(0.001) CFR+(0.005) Number of tree traversals ϵ-approx. of NE in unpert. game Figure 1. Regret in Leduc 3 and Leduc 5 as a function of the number of iterations for EGT and CFR+ with various ξ perturbations (denoted in parentheses). Both axes are on a log scale. of tree traversals performed. The y-axis shows the maximum regret at any individual information set. Both unperturbed CFR+ and EGT perform badly in both games with respect to this measure of refinement. In Leduc 5, both have maximum regret two orders of magnitude worse than the perturbed approach. In Leduc 3, EGT still does as poorly. CFR+ does slightly better, but is still worse than our stronger refinements by more than an order of magnitude. The maximum regret one can possibly cause in an information set in either Leduc game is 23, so CFR+ and unperturbed EGT also do poorly in that sense. In contrast to this, we find that our ξ-perturbed solution concepts converge to a strategy with low regret at every information set. The choice of ξ is important: for ξ = 0.001, the smallest perturbation, we see that it takes a long time to converge at low-probability information sets, whereas we converge reasonably quickly for ξ = 0.01 or ξ = 0.005; for ξ = 0.1 and ξ = 0.05 the perturbations are too large, and we end up converging with relatively high regret (due to being forced to play every action with probability ξ). Thus, within this set of experiments, ξ [0.005, 0.01] seems to be the ideal amount of perturbation. 9. Discussion We extended the RM and RM+ regret minimization algorithms to finitely-generated convex polytopes, and specialized our results to linearly-constrained simplexes and behaviorally-perturbed EFGs. We then showed how this allows us to compute an approximate EFPE. Our experi- 0 2 105 4 105 6 105 8 105 1 106 CFR+(0.1) CFR+(0.05) CFR+(0.001) CFR+(0.005) Number of tree traversals Max information set regret 0 1 105 2 105 3 105 4 105 5 105 6 105 CFR+(0.1) CFR+(0.05) CFR+(0.001) CFR+(0.005) Number of tree traversals Max information set regret Figure 2. Maximum regret at any individual information set in Leduc 3 and Leduc 5, as a function of the number of iterations, for standard EGT as well as with various ξ perturbations (denoted EGT(ξ)) and CFR+. The y-axis is on a log scale. ments showed that this approach leads to much stronger strategies for information sets reached with low probability, while maintaining the strong convergence rate of CFR+. Our experiments raise an interesting question. Across both games, we see that the maximum information set regret goes down much faster for larger amounts of perturbation, but then it bottoms out earlier than for smaller perturbations (as expected). To get the best of both large and small perturbations, it may be possible to decrease the perturbations over time, leading to faster convergence rate, while never bottoming out. However, this has a number of challenges associated with it. Most importantly, we need a variant of RM or RM+ that can handle a slowly expanding feasible set within the simplex. This would also require decreasing the perturbations at the correct rate; if decreased too quickly, it is unlikely that we will converge to a refinement, and if decreased too slowly, we might still bottom out. The CFR algorithms have been shown to work well with a number of other techniques, notably sampling (Lanctot et al., 2009) and abstraction (Lanctot et al., 2012; Kroer and Sandholm, 2016). It would be both theoretically and practically interesting to see how well our refinement approach works in conjunction with these techniques. Acknowledgments This material is based on work supported by NSF grant IIS1617590 and ARO award W911NF-17-1-0082. Regret Minimization in Behaviorally-Constrained Zero-Sum Games Jacob Abernethy, Peter L Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Conference on Learning Theory (COLT), pages 27 46, 2011. David Blackwell. An analog of the minmax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1 8, 1956. Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold em poker is solved. Science, 347(6218), January 2015. Noam Brown and Tuomas Sandholm. Regret-based pruning in extensive-form games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2015. Noam Brown and Tuomas Sandholm. Reduced space and faster convergence in imperfect-information games via pruning. In International Conference on Machine Learning (ICML), 2017. Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed equilibrium computation, and post-processing, with application to a champion no-limit Texas Hold em agent. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2015. Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2017. Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. Gabriele Farina and Nicola Gatti. Extensive-form perfect equilibrium computation in two-player games. In AAAI Conference on Artificial Intelligence (AAAI), 2017. Sam Ganzfried and Tuomas Sandholm. Game theorybased opponent modeling in large imperfect-information games. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2011. Geoffrey J. Gordon, Amy Greenwald, and Casey Marks. No-regret learning in convex games. In International Conference on Machine Learning (ICML), pages 360 367. ACM, 2008. Geoffrey J. Gordon. No-regret algorithms for online convex programs. NIPS, 19:489, 2006. Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127 1150, 2000. Samid Hoda, Andrew Gilpin, Javier Pe na, and Tuomas Sandholm. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2), 2010. Michael Johanson and Michael Bowling. Data biased robust counter strategies. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. Michael Johanson, Martin Zinkevich, and Michael Bowling. Computing robust counter-strategies. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007. Daphne Koller, Nimrod Megiddo, and Bernhard von Stengel. Efficient computation of equilibria for extensive two-person games. Games and Economic Behavior, 14(2), 1996. Christian Kroer and Tuomas Sandholm. Imperfect-recall abstractions with bounds in games. In Proceedings of the ACM Conference on Economics and Computation (EC), 2016. Christian Kroer, Kevin Waugh, Fatma Kılınc -Karzan, and Tuomas Sandholm. Faster first-order methods for extensive-form game solving. In Proceedings of the ACM Conference on Economics and Computation (EC), 2015. Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Smoothing method for approximate extensiveform perfect equilibrium. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017. Christian Kroer, Kevin Waugh, Fatma Kilinc-Karzan, and Tuomas Sandholm. Theoretical and practical advances on smoothing for extensive-form games. In Proceedings of the ACM Conference on Economics and Computation (EC), 2017. Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009. Marc Lanctot, Richard Gibson, Neil Burch, Martin Zinkevich, and Michael Bowling. No-regret learning in extensive-form games with imperfect recall. In International Conference on Machine Learning (ICML), 2012. Peter Bro Miltersen and Troels Bjerre Sørensen. Computing a quasi-perfect equilibrium of a two-player game. Economic Theory, 42(1), 2010. Regret Minimization in Behaviorally-Constrained Zero-Sum Games Todd N. Neller and Marc Lanctot. An introduction to counterfactual regret minimization. Tutorial, 2013. Available from http://modelai.gettysburg.edu/2013/cfr/cfr.pdf. Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal of Optimization, 16(1), 2005. Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103, 2005. I. Romanovskii. Reduction of a game with complete memory to a matrix game. Soviet Mathematics, 3, 1962. Tuomas Sandholm. The state of solving large incompleteinformation games, and application to poker. AI Magazine, 2010. Special issue on Algorithmic Game Theory. Reinhard Selten. Reexamination of the perfectness concept for equilibrium points in extensive games. International Journal of Game Theory, 1975. Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), July 2005. Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015. Matus Telgarsky. Blackwell Approachability and Minimax Theory. Ar Xiv e-prints, October 2011. Eric van Damme. A relation between perfect equilibria in extensive form games and proper equilibria in normal form games. International Journal of Game Theory, 1984. Bernhard von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2), 1996. Martin Zinkevich, Michael Bowling, Michael Johanson, and Carmelo Piccione. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.