# regretbased_pruning_in_extensiveform_games__528eb8d3.pdf

Regret-Based Pruning in Extensive-Form Games

Noam Brown Computer Science Department Carnegie Mellon University Pittsburgh, PA 15217 noamb@cmu.edu

Tuomas Sandholm Computer Science Department Carnegie Mellon University Pittsburgh, PA 15217 sandholm@cs.cmu.edu

Counterfactual Regret Minimization (CFR) is a leading algorithm for ﬁnding a Nash equilibrium in large zero-sum imperfect-information games. CFR is an iterative algorithm that repeatedly traverses the game tree, updating regrets at each information set. We introduce an improvement to CFR that prunes any path of play in the tree, and its descendants, that has negative regret. It revisits that sequence at the earliest subsequent CFR iteration where the regret could have become positive, had that path been explored on every iteration. The new algorithm maintains CFR s convergence guarantees while making iterations signiﬁcantly faster even if previously known pruning techniques are used in the comparison. This improvement carries over to CFR+, a recent variant of CFR. Experiments show an order of magnitude speed improvement, and the relative speed improvement increases with the size of the game.

1 Introduction

Extensive-form imperfect-information games are a general model for strategic interaction. The last ten years have witnessed a leap of several orders of magnitude in the size of two-player zero-sum extensive-form imperfect-information games that can be solved to (near-)equilibrium [11][2][6]. This is the game class that this paper focuses on. For small games, a linear program (LP) can ﬁnd a solution (that is, a Nash equilibrium) to the game in polynomial time, even in the presence of imperfect information. However, today s leading LP solvers only scale to games with around 108 nodes in the game tree [4]. Instead, iterative algorithms are used to approximate solutions for larger games. There are a variety of such iterative algorithms that are guaranteed to converge to a solution [5, 3, 10]. Among these, Counterfactual Regret Minimization (CFR) [16] has emerged as the most popular, and CFR+ as the state-of-the-art variant thereof [13, 14].

CFR begins by exploring the entire game tree (though sampling variants exist as well [9]) and calculating regret for every hypothetical situation in which the player could be. A key improvement that makes CFR practical in large games is pruning. At a high level, pruning allows the algorithm to avoid traversing the entire game tree while still maintaining the same convergence guarantees. The classic version of pruning, which we will refer to as partial pruning, allows the algorithm to skip updates for a player in a sequence if the other player s current strategy does not reach the sequence with positive probability. This dramatically reduces the cost of each iteration. The magnitude of this reduction varies considerably depending on the game, but can easily be higher than 90% [9], which improves the convergence speed of the algorithm by a factor of 10. Moreover, the beneﬁt of partial pruning empirically seems to be more signiﬁcant as the size of the game increases.

While partial pruning leads to a large gain in speed, we observe that there is still room for much larger speed improvement. Partial pruning only skips updates for a player if an opponent s action in the path leading to that point has zero probability. This can fail to prune paths that are actually prunable. Consider a game where the ﬁrst player to act (Player 1) has hundreds of actions to choose

from, and where, over several iterations, the reward received from many of them is extremely poor. Intuitively, we should be able to spend less time updating the strategy for Player 1 following these poor actions, and more time on the actions that proved worthwhile so far. However, here, partial pruning will continue to update Player 1 s strategy following each action in every iteration.

In this paper we introduce a better version of pruning, regret-based pruning (RBP), in which CFR can avoid traversing a path in the game tree if either player takes actions leading to that path with zero probability. This pruning needs to be temporary, because the probabilities may change later in the CFR iterations, so the reach probability may turn positive later on. The number of CFR iterations during which a sequence can be skipped depends on how poorly the sequence has performed in previous CFR iterations. More speciﬁcally, the number of iterations that an action can be pruned is proportional to how negative the regret is for that action. We will detail these topics in this paper.

RBP can lead to a dramatic improvement depending on the game. As a rough example, consider a game in which each player has very negative regret for actions leading to 90% of nodes. Partial pruning, which skips updates for a player when the opponent does not reach the node, would traverse 10% of the game tree per iteration. In contrast, regret-based pruning, which skips updates when either player does not reach the node, would traverse only 0.1 0.1 = 1% of the game tree. In general, RBP roughly squares the performance gain of partial pruning.

We test RBP with CFR and CFR+. Experiments show that it leads to more than an order of magnitude speed improvement over partial pruning. The beneﬁt increases with the size of the game.

2 Background

In this section we present the notation used in the rest of the paper. In an imperfect-information extensive-form game there is a ﬁnite set of players, P. H is the set of all possible histories (nodes) in the game tree, represented as a sequence of actions, and includes the empty history. A(h) is the actions available in a history and P(h) P c is the player who acts at that history, where c denotes chance. Chance plays an action a A(h) with a ﬁxed probability σc(h, a) that is known to all players. The history h reached after an action is taken in h is a child of h, represented by h a = h , while h is the parent of h . More generally, h is an ancestor of h (and h is a descendant of h ), represented by h h, if there exists a sequence of actions from h to h. Z H are terminal histories for which no actions are available. For each player i P, there is a payoff function ui : Z ℜ. If P = {1, 2} and u1 = u2, the game is two-player zero-sum. We deﬁne i = maxz Z ui(z) minz Z ui(z) and = maxi i.

Imperfect information is represented by information sets for each player i P by a partition Ii of h H : P(h) = i. For any information set I Ii, all histories h, h I are indistinguishable to player i, so A(h) = A(h ). I(h) is the information set I where h I. P(I) is the player i such that I Ii. A(I) is the set of actions such that for all h I, A(I) = A(h). |Ai| = max I Ii |A(I)| and |A| = maxi |Ai|. We deﬁne U(I) to be the maximum payoff reachable from a history in I, and L(I) to be the minimum. That is, U(I) = maxz Z,h I:h z u P (I)(z) and L(I) = minz Z,h I:h z u P (I)(z). We deﬁne (I) = U(I) L(I) to be the range of payoffs reachable from a history in I. We similarly deﬁne U(I, a), L(I, a), and (I, a) as the maximum, minimum, and range of payoffs (respectively) reachable from a history in I after taking action a. We deﬁne D(I, a) to be the set of information sets reachable by player P(I) after taking action a. Formally, I D(I, a) if for some history h I and h I , h a h and P(I) = P(I ).

A strategy σi(I) is a probability vector over A(I) for player i in information set I. The probability of a particular action a is denoted by σi(I, a). Since all histories in an information set belonging to player i are indistinguishable, the strategies in each of them must be identical. That is, for all h I, σi(h) = σi(I) and σi(h, a) = σi(I, a). We deﬁne σi to be a probability vector for player i over all available strategies Σi in the game. A strategy proﬁle σ is a tuple of strategies, one for each player. ui(σi, σ i) is the expected payoff for player i if all players play according to the strategy proﬁle

σi, σ i . If a series of strategies are played over T iterations, then σT i =

t T σt i T .

πσ(h) = Πh a hσP (h)(h, a) is the joint probability of reaching h if all players play according to σ. πσ i (h) is the contribution of player i to this probability (that is, the probability of reaching h if all players other than i, and chance, always chose actions leading to h). πσ i(h) is the contribution of

all players other than i, and chance. πσ(h, h ) is the probability of reaching h given that h has been reached, and 0 if h h . In a perfect-recall game, h, h I Ii, πi(h) = πi(h ). In this paper we focus on perfect-recall games. Therefore, for i = P(I) we deﬁne πi(I) = πi(h) for h I. We deﬁne the average strategy σT i (I) for an information set I to be

σT i (I) = P

t T πσt i i σt i(I) P

t T πσt i (I) (1)

2.1 Nash Equilibrium

A best response to σ i is a strategy σ i such that ui(σ i , σ i) = maxσ i Σi ui(σ i, σ i). A Nash equilibrium, is a strategy proﬁle where every player plays a best response. Formally, it is a strategy proﬁle σ such that i, ui(σ i , σ i) = maxσ i Σi ui(σ i, σ i). We deﬁne a Nash equilibrium strategy for player i as a strategy σi that is part of any Nash equilibrium. In two-player zero-sum games, if σi and σ i are both Nash equilibrium strategies, then σi, σ i is a Nash equilibrium. An ϵ-equilibrium is a strategy proﬁle σ such that i, ui(σ i , σ i) + ϵ maxσ i Σi ui(σ i, σ i).

2.2 Counterfactual Regret Minimization

Counterfactual Regret Minimization (CFR) is a popular regret-minimization algorithm for extensiveform games [16]. Our analysis of CFR makes frequent use of counterfactual value. Informally, this is the expected utility of an information set given that player i tries to reach it. For player i at information set I given a strategy proﬁle σ, this is deﬁned as

vσ i (I) = X

πσ(h, z)ui(z) (2)

The counterfactual value of an action a is

vσ i (I, a) = X

πσ(h a, z)ui(z) (3)

Let σt be the strategy proﬁle used on iteration t. The instantaneous regret on iteration t for action a in information set I is rt(I, a) = vσt P (I)(I, a) vσt P (I)(I) (4)

and the regret for action a in I on iteration T is

RT (I, a) = X

t T rt(I, a) (5)

Additionally, RT +(I, a) = max{RT (I, a), 0} and RT (I) = maxa{RT +(I, a)}. Regret for player i in the entire game is RT i = max σ i Σi

ui(σ i, σt i) ui(σt i, σt i) (6)

In CFR, a player in an information set picks an action among the actions with positive regret in proportion to his positive regret on that action. Formally, on each iteration T + 1, player i selects actions a A(I) according to probabilities

σT +1 i (I, a) =

RT +(I,a) P

a A(I) RT +(I,a ), if P

a Ai RT +(I, a ) > 0

1 |A(I)|, otherwise (7)

If a player plays according to CFR in every iteration, then on iteration T, RT (I) i p

T. Moreover, RT i X

I Ii RT (I) |Ii| i p

So, as T , RT i T 0. In two-player zero-sum games, if both players average regret RT i T ϵ, their average strategies σT 1 , σT 2 form a 2ϵ-equilibrium [15]. Thus, CFR constitutes an anytime algorithm for ﬁnding an ϵ-Nash equilibrium in zero-sum games.

3 Applying Best Response to Zero-Reach Sequences

In Section 2 it was explained that if both players average regret approaches zero, then their average strategies approach a Nash equilibrium. CFR provides one way to compute strategies that have bounded regret, but it is not the only way. CFR-BR [7] is a variant of CFR in which one player plays CFR and the other player plays a best response to the opponent s strategy in every iteration. Calculating a best response to a ﬁxed strategy is computationally cheap (in games of perfect recall), costing only a single traversal of the game tree. By playing a best response in every iteration, the best-responder is guaranteed to have at most zero regret. Moreover, the CFR player s regret is still bounded according to (8). However, in practice the CFR player s regret in CFR-BR tends to be higher than when both players play vanilla CFR (since the opponent is clairvoyantly maximizing the CFR player s regret). For this reason, empirical results show that CFR-BR converges slower than CFR, even though the best-responder s regret is always at most zero.

We now discuss a modiﬁcation of CFR that will motivate the main contribution of this paper, which, in turn, is described in Section 4. The idea is that by applying a best response only in certain situations (and CFR in others), we can lower regret for one player without increasing it for the opponent. Without loss of generality, we discuss how to reduce regret for Player 1. Speciﬁcally, consider an information set I I1 and action a where σt 1(I, a) = 0 and any history h I. Then for any ancestor history h such that h h a, we know πσt 1 (h , h a) = 0. Likewise, for any descendant history h such that h a h , we know πσt 1 (h ) = 0. Thus, from (4) we see that Player 1 s strategy on iteration t in any information set following action a has no effect on Player 2 s regret for that iteration. Moreover, it also has no effect on Player 1 s regret for any information set except R(I, a) and information sets that follow action a. Therefore, by playing a best response only in information sets following action a (and playing vanilla CFR elsewhere), Player 1 guarantees zero regret for himself in all information sets following action a, without the practical cost of increasing his regret in information sets before I or of increasing Player 2 s regret. This may increase regret for action a itself, but if we only do this when R(I, a) (I), we can guarantee R(I, a) 0 even after the iteration. Similarly, Player 2 can simultaneously play a best response in information sets following an action a where σt 2(I , a ) = 0 for I I2. This approach leads to lower regret for both players.

(In situations where both players sequences of reaching an information set have zero probability (π1(h) = π2(h) = 0) the strategies chosen have no impact on the regret or average strategy for either player, so there is no need to compute what strategies should be played from then on.)

Our experiments showed that this technique leads to a dramatic improvement over CFR in terms of the number of iterations needed though the theoretical convergence bound remains the same. However, each iteration touches more nodes because negative-regret actions more quickly become positive and are not skipped with partial pruning and thus takes longer. It depends on the game whether CFR or this technique is faster overall; see experiments in Appendix A. Regret-based pruning, introduced in the next section, outperforms both of these approaches signiﬁcantly.

4 Regret-Based Pruning (RBP)

In this section we present the main contribution of this paper, a technique for soundly pruning on a temporary basis negative-regret actions from the tree traversal in order to speed it up signiﬁcantly. In Section 3 we proposed a variant of CFR where a player plays a best response in information sets that the player reaches with zero probability. In this section, we show that these information sets and their descendants need not be traversed in every iteration. Rather, the frequency that they must be traversed is proportional to how negative regret is for the action leading to them. This less-frequent traversal does not hurt the regret bound (8). Consider an information set I I1 and action a where Rt(I, a) = 1000 and regret for at least one other action in I is positive, and assume (I) = 1. From (7), we see that σt+1 1 (I, a) = 0. As described in Section 3, the strategy played by Player 1 on iteration t + 1 in any information set following action a has no effect on Player 2. Moreover, it has no immediate effect on what Player 1 will do in the next iteration (other than in information sets following action a), because we know regret for action a will still be at most -999 on iteration t + 2 (since (I) = 1) and will continue to not be played. So rather than traverse the game tree following action a, we could procrastinate in deciding what Player 1 did on iteration t+1, t+2, ..., t+1000

in that branch until after iteration t + 1000 (at which point regret for that action may be positive). That is, we could (in principle) store Player 2 s strategy for each iteration between t+1 and t+1000, and on iteration t+1000 calculate a best response to each of them and announce that Player 1 played those best responses following action a on iterations t + 1 to t + 1000 (and update the regrets to match this). Obviously this itself would not be an improvement, but performance would be identical to the algorithm described in Section 3.

However, rather than have Player 1 calculate and play a best response for each iteration between t+1 and t+1000 separately, we could simply calculate a best response against the average strategy that Player 2 played in those iterations. This can be accomplished in a single traversal of the game tree. We can then announce that Player 1 played this best response on each iteration between t + 1 and t + 1000. This provides beneﬁts similar to the algorithm described in Section 3, but allows us to do the work of 1000 iterations in a single traversal! We coin this regret-based pruning (RBP).

We now present a theorem that guarantees that when R(I, a) 0, we can prune D(I, a) through regret-based pruning for |R(I,a)| U(I,a) L(I) iterations.

Theorem 1. Consider a two-player zero-sum game. Let a A(I) be an action such that on iteration T0, RT0(I, a) 0. Let I be an information set for any player such that I D(I, a) and let a A(I ). Let m = |R(I,a)| U(I,a) L(I) . If σ(I, a) = 0 when R(I, a) 0, then regardless of what is played in D(I, a) during {T0, ..., T0 + m}, RT +(I , a ) is identical for T T0 + m.

Proof. Since vσ i (I) L(I) and vσ i (I, a) U(I, a), so from (4) we get rt(I, a) U(I, a) L(I). Thus, for iteration T0 T T0 + m, RT (I, a) 0. Clearly the theorem is true for T < T0. We prove the theorem continues to hold inductively for T T0 + m. Assume the theorem holds for iteration T and consider iteration T + 1. Suppose I IP (I) and either I = I or a = a. Then for any h I , there is no ancestor of h in an information set in D(I, a). Thus, πσT +1 i (h ) does not depend on the strategy in D(I, a). Moreover, for any z Z, if h h z for some h I D(I, a), then πσT +1(h , z) = 0 because σT +1(I, a) = 0. Since I = I or a = a, it similarly holds that πσT +1(h a , z) = 0. Then from (4), r T +1(I, a) does not depend on the strategy in D(I, a).

Now suppose I Ii for i = P(I). Consider some h I and some h I. First suppose that h a h . Since πσT +1 i (h a) = 0, so πσT +1 i (h ) = 0 and h contributes nothing to the regret of I . Now suppose h h. Then for any z Z, if h h z then πσT +1(h , z) = 0 and does not depend on the strategy in D(I, a). Finally, suppose h h and h a h . Then for any z Z such that h z, we know h z and therefore πσT +1(h , z) = 0 does not depend on the strategy in D(I, a).

Now suppose I = I and a = a. We proved RT (I, a) 0 for T0 T T0 +m, so RT +(I, a) = 0. Thus, for all T T0 + m, RT (I , a ) is identical regardless of what is played in D(I, a).

We can improve this approach signiﬁcantly by not requiring knowledge beforehand of exactly how many iterations can be skipped. Rather, we will decide in light of what happens during the intervening CFR iterations when an action needs to be revisited. From (4) we know that r T (I, a) πσT i (I). Moreover, vσT P (I)(I) does not depend on D(I, a). Thus, we can prune D(I, a) from iteration T0 until iteration T1 so long as

t=1 vσt P (I)(I, a) +

t=T0+1 πσt i(I)U(I, a)

t=1 vσt P (I)(I) (9)

In the worst case, this allows us to skip only R(I,a) U(I,a) L(I) iterations. However, in practice it performs signiﬁcantly better, though we cannot know on iteration T0 how many iterations it will skip because it depends on what is played in T0 t T1. Our exploratory experiments showed that in practice performance also improves by replacing U(I, a) with a more accurate upper bound on reward in (9). CFR will still converge if D(I, a) is pruned for too many iterations; however, that hurts convergence speed. In the experiments included in this paper, we conservatively use U(I, a) as the upper bound.

4.1 Best Response Calculation for Regret-Based Pruning

In this section we discuss how one can efﬁciently compute the best responses as called for in regretbased pruning. The advantage of Theorem 1 is that we can wait until after pruning has ﬁnished that is, until we revisit an action to decide what strategies were played in D(I, a) during the intervening iterations. We can then calculate a single best response to the average strategy that the opponent played, and say that that best response was played in D(I, a) in each of the intervening iterations. This results in zero regret over those iterations for information sets in D(I, a). We now describe how this best response can be calculated efﬁciently.

Typically, when playing CFR one stores PT t=1 πt i(I)σt i(I) for each information set I. This allows one to immediately calculate the average strategy deﬁned in (1) in any particular iteration. If we start pruning on iteration T0 and revisit on iteration T1, we wish to calculate a best response to

σT1 T0 i where σT1 T0 i (I) =

PT1 t=T0 πt i(I)σt i(I) PT1 t=T0 πt i(I) . An easy approach would be to store the opponent s

cumulative strategy before pruning begins and subtract it from the current cumulative strategy when pruning ends. In fact, we only need to store the opponent s strategy in information sets that follow action a. However, this could potentially use O(H) memory because the same information set I belonging to Player 2 may be reached from multiple information sets belonging to Player 1. In contrast, CFR only requires O(|I||A|) memory, and we want to maintain this desirable property. We accomplish that as follows.

To calculate a best response against σT 2 , we traverse the game tree and calculate the counterfactual value, deﬁned in (3), for every action for every information set belonging to Player 1 that does not lead to any further Player 1 information sets. Speciﬁcally, we calculate v σT0 1 1 (I, a) for every action a in I such that D(I, a) = . Since we calculate this only for actions where D(I, a) = , so v σT0 1 1 (I, a) does not depend on σ1. Then, starting from the bottom information sets, we set the best-response strategy σBR 1 (I) to always play the action with the highest counterfactual value (ties can be broken arbitrarily), and pass this value up as the payoff for reaching I, repeating the process up the tree. In order to calculate a best response to σT1 T0 2 , we ﬁrst store, before pruning begins, the counterfactual values for Player 1 against Player 2 s average strategy for every action a in each information set I where D(I, a) = . When we revisit the action on iteration T1, we calculate a best response to σT1 2 except that we set the counterfactual value for every action a in information set I where D(I, a) = to be T1v σT1 1 (I, a) (T0 1)v σT0 1 1 (I, a). The latter term was stored, and the former term can be calculated from the current average strategy proﬁle. As before, we set σBR 1 (I) to always play whichever action has the highest counterfactual value, and pass this term up.

A slight complication arises when we are pruning an action a in information set I and wish to start pruning an earlier action a from information set I such that I D(I , a ). In this case, it is necessary to explore action a in order to calculate the best response in D(I , a ). However, if such traversals happen frequently, then this would defeat the purpose of pruning action a. One way to address this is to only prune an action a when the number of iterations guaranteed (or estimated) to be skipped exceeds some threshold. This ensures that the overhead is worthwhile, and that we are not frequently traversing an action a farther down the tree that is already being pruned. Another option is to add some upper bound to how long we will prune an action. If the lower bound for how long we will prune a exceeds the upper bound for how long we will prune a , then we need not traverse a in the best response calculation for a because a will still be pruned when we are ﬁnished with pruning a . In our experiments, we use the former approach. Experiments to determine a good parameter for this are presented in Appendix B.

4.2 Regret-Based Pruning with CFR+

CFR+ [13] is a variant of CFR where the regret is never allowed to go below 0. Formally, RT (I, a) = max{RT 1(I, a) + r T (I, a), 0} for T 1 and RT (I, a) = 0 for T = 0. Although this change appears small, and does not improve the bound on regret, it leads to faster empirical convergence. CFR+ was a key advancement that allowed Limit Texas Hold em poker to be essentially solved [1].

At ﬁrst glance, it would seem that CFR+ and RBP are incompatible. RBP allows actions to be traversed with decreasing frequency as regret decreases below zero. However, CFR+ sets a ﬂoor

for regret at zero. Nevertheless, it is possible to combine the two, as we now show. We modify the deﬁnition of regret in CFR+ so that it can drop below zero, but immediately returns to being positive as soon as regret begins increasing. Formally, we modify the deﬁnition of regret in CFR+ for T > 0 to be as follows: RT (I, a) = r T (I, a) if r T (I, a) > 0 and RT 1(I, a) 0, and RT (I, a) = RT 1(I, a) + r T (I, a) otherwise. This leads to identical behavior in CFR+, and also allows regret to drop below zero so actions can be pruned.

When using RBP with CFR+, regret does not strictly follow the rules for CFR+. CFR+ calls for an action to be played with positive probability whenever instantaneous regret for it is positive in the previous iteration. Since RBP only checks the regret for an action after potentially several iterations have been skipped, there may be a delay between the iteration when an action would return to play in CFR+ and the iteration when it returns to play in RBP. This does not pose a theoretical problem: CFR s convergence rate still applies.

However, this difference is noticeable when combined with linear averaging. Linear averaging weighs each iteration σt in the average strategy by t. It does not affect regret or inﬂuence the selection of strategies on an iteration. That is, with linear averaging the new deﬁnition for average strat-

egy becomes σT i (I) =

t T (tπ σt i i σt i) P

t T (tπ σt i i ) . Linear averaging still maintains the asymptotic convergence

rate of constant averaging (where each iteration is weighed equally) in CFR+ [14]. Empirically it causes CFR+ to converge to a Nash equilibrium much faster. However, in vanilla CFR it results in worse performance and there is no proof guaranteeing convergence. Since RBP with CFR+ results in behavior that does not strictly conform to CFR+, linear averaging results in somewhat noisier convergence. This can be mitigated by reporting the strategy proﬁle found so far that is closest to a Nash equilibrium rather than the current average strategy proﬁle, and we do this in the experiments.

5 Experiments

We tested regret-based pruning in both CFR and CFR+ against partial pruning, as well as against CFR with no pruning. Our implementation traverses the game tree once each iteration.1 We tested our algorithm on standard Leduc Hold em [12] and a scaled-up variant of it featuring more actions. Leduc Hold em is a popular benchmark problem for imperfect-information game solving due to its size (large enough to be highly nontrivial but small enough to be solvable) and strategic complexity. In Leduc Hold em, there is a deck consisting of six cards: two each of Jack, Queen, and King. There are two rounds. In the ﬁrst round, each player places an ante of 1 chip in the pot and receives a single private card. A round of betting then takes place with a two-bet maximum, with Player 1 going ﬁrst. A public shared card is then dealt face up and another round of betting takes place. Again, Player 1 goes ﬁrst, and there is a two-bet maximum. If one of the players has a pair with the public card, that players wins. Otherwise, the player with the higher card wins. In standard Leduc Hold em, the bet size in the ﬁrst round is 2 chips, and 4 chips in the second round. In our scaled-up variant, which we call Leduc-5, there are 5 bet sizes to choose from: in the ﬁrst round a player may bet 0.5, 1, 2, 4, or 8 chips, while in the second round a player may bet 1, 2, 4, 8, or 16 chips.

We measure the quality of a strategy proﬁle by its exploitability, which is the summed ϵ distance of both players from a Nash equilibrium strategy. Formally, exploitability of a strategy proﬁle σ is maxσ 1 Σ1 u1(σ 1, σ2) + maxσ 2 Σ2 u2(σ1, σ 2). We measure exploitability against the number of nodes touched over all CFR traversals. As shown in Figure 1, RBP leads to a substantial improvement over vanilla CFR with partial pruning in Leduc Hold em, increasing the speed of convergence by more than a factor of 8. This is partially due to the game tree being traversed twice as fast, and partially due to the use of a best response in sequences that are pruned (the beneﬁt of which was described in Section 3). The improvement when added on top of CFR+ is smaller, increasing the speed of convergence by about a factor of 2. This matches the reduction in game tree traversal size.

The beneﬁt from RBP is more substantial in the larger benchmark game, Leduc-5. RBP increases convergence speed of CFR by a factor of 12, and reduces the per-iteration game tree traversal cost by about a factor of 7. In CFR+, RBP improves the rate of convergence by about an order of magnitude. RBP also decreases the number of nodes touched per iteration in CFR+ by about a factor of 40.

1Canonical CFR+ traverses the game tree twice each iteration, updating the regrets for each player in separate traversals [13]. This difference does not, however, affect the error measure (y-axis) in the experiments.

(a) Leduc Hold em

(b) Leduc-5 Hold em

Figure 1: Top: Exploitability. Bottom: Nodes touched per iteration.

The results imply that larger games beneﬁt more from RBP than smaller games. This is not universally true, since it is possible to have a large game where every action is part of the Nash equilibrium. Nevertheless, there are many games with very large action spaces where the vast majority of those actions are suboptimal, but players do not know beforehand which are suboptimal. In such games, RBP would improve convergence tremendously.

6 Conclusions and Future Research

In this paper we introduced a new method of pruning that allows CFR to avoid traversing highregret actions in every iteration. Our regret-based pruning (RBP) temporarily ceases their traversal in a sound way without compromising the overall convergence rate. Experiments show an order of magnitude speed improvement over partial pruning, and suggest that the beneﬁt of RBP increases with game size. Thus RBP is particularly useful in large games where many actions are suboptimal, but where it is not known beforehand which actions those are.

In future research, it would be worth examining whether similar forms of pruning can be applied to other equilibrium-ﬁnding algorithms as well. RBP, as presented in this paper, is for CFR using regret matching to determine what strategies to use on each iteration based on the regrets. RBP does not directly apply to other strategy selection techniques that could be used within CFR such as exponential weights, because the latter always puts positive probability on actions. Also, it would be interesting to see whether RBP-like pruning could be applied to ﬁrst-order methods for equilibriumﬁnding [5, 3, 10, 8]. The results in this paper suggest that for any equilibrium-ﬁnding algorithm to be efﬁcient in large games, effective pruning is essential.

6.1 Acknowledgement

This material is based on work supported by the National Science Foundation under grants IIS1320620 and IIS-1546752, as well as XSEDE computing resources provided by the Pittsburgh Supercomputing Center.

[1] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit holdem poker is solved. Science, 347(6218):145 149, 2015.

[2] Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed equilibrium computation, and post-processing, with application to a champion no-limit texas hold em agent. In Proceedings of the 2015 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 2015.

[3] Andrew Gilpin, Javier Pe na, and Tuomas Sandholm. First-order algorithm with O(ln(1/ϵ)) convergence for ϵ-equilibrium in two-person zero-sum games. Mathematical Programming, 133(1 2):279 298, 2012. Conference version appeared in AAAI-08.

[4] Andrew Gilpin and Tuomas Sandholm. Lossless abstraction of imperfect information games. Journal of the ACM, 54(5), 2007. Early version Finding equilibria in large sequential games of imperfect information appeared in the Proceedings of the ACM Conference on Electronic Commerce (EC), pages 160 169, 2006.

[5] Samid Hoda, Andrew Gilpin, Javier Pe na, and Tuomas Sandholm. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2):494 512, 2010. Conference version appeared in WINE-07.

[6] Eric Grifﬁn Jackson. A time and space efﬁcient algorithm for approximately solving large imperfect information games. In AAAI Workshop on Computer Poker and Imperfect Information, 2014.

[7] Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract strategies in extensive-form games. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2012.

[8] Christian Kroer, Kevin Waugh, Fatma Kılınc -Karzan, and Tuomas Sandholm. Faster ﬁrstorder methods for extensive-form game solving. In Proceedings of the ACM Conference on Economics and Computation (EC), 2015.

[9] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 1078 1086, 2009.

[10] Franc ois Pays. An interior point approach to large games of incomplete information. In AAAI Computer Poker Workshop, 2014.

[11] Tuomas Sandholm. The state of solving large incomplete-information games, and application to poker. AI Magazine, pages 13 32, Winter 2010. Special issue on Algorithmic Game Theory.

[12] Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 550 558, July 2005.

[13] Oskari Tammelin. Solving large imperfect information games using CFR+. ar Xiv preprint ar Xiv:1407.5042, 2014.

[14] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit texas holdem. In IJCAI, volume 2015, 2015.

[15] Kevin Waugh, David Schnizlein, Michael Bowling, and Duane Szafron. Abstraction pathologies in extensive games. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2009.

[16] Martin Zinkevich, Michael Bowling, Michael Johanson, and Carmelo Piccione. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.