# stochastic_regret_minimization_in_extensiveform_games__410749df.pdf

Stochastic Regret Minimization in Extensive-Form Games

Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1 3 4 5

Monte-Carlo counterfactual regret minimization (MCCFR) is the state-of-the-art algorithm for solving sequential games that are too large for full tree traversals. It works by using gradient estimates that can be computed via sampling. However, stochastic methods for sequential games have not been investigated extensively beyond MCCFR. In this paper we develop a new framework for developing stochastic regret minimization methods. This framework allows us to use any regret-minimization algorithm, coupled with any gradient estimator. The MCCFR algorithm can be analyzed as a special case of our framework, and this analysis leads to signiﬁcantly stronger theoretical guarantees on convergence, while simultaneously yielding a simpliﬁed proof. Our framework allows us to instantiate several new stochastic methods for solving sequential games. We show extensive experiments on ﬁve games, where some variants of our methods outperform MCCFR.

1. Introduction

Extensive-form games (EFGs) are a broad class of games that can model sequential and simultaneous moves, outcome uncertainty, and imperfect information. This includes real-world settings such as negotiation, sequential auctions, security games (Lisý et al., 2016; Munoz de Cote et al., 2013), cybersecurity games (De Bruhl et al., 2014; Chen et al., 2018), recreational games such as poker (Sandholm, 2010) and billiards (Archibald & Shoham, 2009), and medical treatment (Chen & Bowling, 2012; Sandholm, 2015).

1Computer Science Department, Carnegie Mellon University, Pittsburgh PA 15213 2IEOR Department, Columbia University, New York NY 10027 3Strategic Machine, Inc. 4Strategy Robot, Inc. 5Optimized Markets, Inc.. Correspondence to: Gabriele Farina <gfarina@cs.cmu.edu>, Christian Kroer <christian.kroer@columbia.edu>, Tuomas Sandholm <sandholm@cs.cmu.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Typically, EFG models are operationalized by computing either a Nash equilibrium of the game, or an approximate Nash equilibrium if the game is large. Approximate Nash equilibrium of zero-sum EFGs has been the underlying idea of several recent AI milestones, where strong AIs for twoplayer poker were created (Moravˇcík et al., 2017; Brown & Sandholm, 2017b). In principle, a zero-sum EFG can be solved in polynomial time using a linear program whose size is linear in the size of the game tree (von Stengel, 1996). However, for most real-world games this linear program is much too large to solve, either because it does not ﬁt in memory, or because iterations of the simplex algorithm or interior-point methods become prohibitively expensive due to matrix inversion. Instead, ﬁrst-order methods (Hoda et al., 2010; Kroer et al., 2020) or regret-based methods (Zinkevich et al., 2007; Tammelin et al., 2015; Brown & Sandholm, 2019a) are used in practice. These methods work by only keeping one or two strategies around for each player (typically the size of a strategy is much smaller than the size of the game tree). The game tree is then only accessed for computing gradients, which can be done via a single tree traversal (which can often be done without storing the tree), and sometimes game-speciﬁc structure can be exploited to speed this up further (Johanson et al., 2011). Finally, these gradients are used to update the strategy iterates.

However, for large games, even these gradient-based methods that require traversing the entire game tree are prohibitively expensive (henceforth referred to as deterministic methods). This was seen in two recent superhuman poker AIs: Libratus (Brown & Sandholm, 2017b) and Pluribus (Brown & Sandholm, 2019b). Both AIs were generated in a two-stage manner: an ofﬂine blueprint strategy was computed, and then reﬁnements to the blueprint solution were computed online while actually playing against human opponents. The online solutions were computed using deterministic methods (since those subgames are signiﬁcantly smaller than the entire game). However, the original blueprint strategies had to be computed without traversing the entire game tree, as this game tree is far too large for even a moderate amount of traversals.

When full tree traversals are too expensive, stochastic methods can be used to compute approximate gradients instead. The most common stochastic method for solving large EFGs is the Monte-Carlo Counterfactual Regret Minimization

Stochastic Regret Minimization in Extensive-Form Games

(MCCFR) algorithm (Lanctot et al., 2009). This algorithm, enhanced with certain dynamic pruning techniques, was also used to compute the blueprint strategies in the abovementioned superhuman poker milestones (Brown & Sandholm, 2015; 2017a;b; 2019b; Brown et al., 2017). MCCFR combines the CFR algorithm (Zinkevich et al., 2007) with certain stochastic gradient estimators. Follow-up papers have been written on MCCFR, investigating various methods for improving the sampling schemes used in estimating gradients and so on (Gibson et al., 2012; Schmid et al., 2019). However, beyond the MCCFR setting, stochastic methods have not been studied extensively for solving EFGs1.

In this paper we develop a general framework for constructing stochastic regret-minimization methods for solving EFGs. In particular, we introduce a way to combine any regret-minimizing algorithm with any gradient estimator, and obtain high-probability bounds on the performance of the resulting combined algorithm. As a ﬁrst application of our approach, we show that with probability 1 p, the regret in MCCFR is at most O( p

log(1/p)) worse than that of CFR, an exponential improvement over the bound O( p

1/p) previously known in the literature. Second, our approach enables us to develop a slew of other stochastic methods for solving EFGs. As an example of our framework, we show how each of two popular online convex optimization algorithms, follow-the-regularized-leader (FTRL) and online mirror descent (OMD), can be used to obtain stochastic EFG-solving algorithms with these guarantees. We then provide extensive numerical simulations on four diverse games, showing that it is possible to beat MCCFR in several of the games using our new methods. Because of the ﬂexibility and modularity of our approach, it paves the way for many potential future investigations into stochastic methods for EFGs, either via better gradient estimators, via better deterministic regret minimization methods that can now be converted into stochastic methods, or both.

2. Preliminaries

2.1. Two-Player Zero-Sum Extensive-Form Games

In this subsection we introduce the notation that we will use in the rest the paper when dealing with two-player zero-sum extensive-form games.

An extensive-form game is played on a tree rooted at a node r. Each node v in the tree belongs to a player from the set {1, 2, c}, where c is called the chance player. The chance player plays actions from a ﬁxed distribution known to Player 1 and 2, and it is used as a device to model stochas-

1Kroer et al. (2015) studies the stochastic mirror prox algorithm for EFGs, but it is not the primary focus of the paper, and seems to be more of a preliminary investigation.

tic events such as drawing a random card from a deck. We denote the set of actions available at node v by Av. Each action corresponds to an outgoing edges from v. Given a Av, we let ρ(v, a) denote the node that is reached by following the edge corresponding to action a at node v. Nodes v such that Av = are called leaves and represent terminal states of the game. We denote by Z the set of leaves of the game. Associated with each leaf z Z is a pair (u1(z), u2(z)) R2

of payoffs for Player 1 and 2, respectively. We denote by the payoff range of the game, that is the value := maxz Z max{u1(z), u2(z)} minz Z min{u1(z), u2(z)}. In this paper we are concerned with zero-sum extensive-form games, that is games in which u1(z) = u2(z) for all z Z.

To model private information, the set of all nodes for each player i {1, 2, c} is partitioned into a collection Ii of non-empty sets, called information sets. Each information set I Ii contains nodes that Player i cannot distinguish among. In this paper, we will only consider perfect-recall games, that is, games in which no player forgets what he or she observed or knew earlier. Necessarily, if two nodes u and v belong to the same information set I, the set of actions Au and Av must be the same (or the player would be able to tell u and v apart). So, we denote by AI the set of actions of any node in I.

Sequences. The set of sequences for Player i, denoted Σi, is deﬁned as the set of all possible information set-action pairs, plus a special element called empty sequence and denoted . Formally, Σi := {(I, a) : I Ii, a AI} { }. Given a node v for Player i, we denote with σi(v) the last information set-action pair of Player i encountered on the path from the root to node v; if the player does not act before v, σi(I) = . It is known that in perfect-recall games σi(u) = σi(v) for any two nodes u, v in the same information set. For this reason, for each information set I we deﬁne σi(I) to equal σi(v) for any v I.

Sequence-Form Strategies. A strategy for Player i {1, 2, c} is an assignment of a probability distribution over the set of actions AI to each information set I that belongs to Player i. In this paper, we represent strategies using their sequence-form representation (Romanovskii, 1962; Koller et al., 1996; von Stengel, 1996). A sequence-form strategy for Player i is a non-negative vector z indexed over the set of sequences Σi of that player. For each σ = (I, a) Σi, the entry z[σ] contains the product of the probability of all the actions that Player i takes on the path from the root of the game tree down to action a at information set I, included. In order for these probabilities to be consistent, it is necessary and sufﬁcient that z[ ] = 1 and X

a AI z[(I, a)] = z[σi(I)] I Ii.

A strategy such that exactly one action is selected with probability 1 at each node is called a pure strategy.

Stochastic Regret Minimization in Extensive-Form Games

We denote by X and Y the set of all sequence-form strategies for Player 1 and Player 2, respectively. We denote by c the ﬁxed sequence-form strategy of the chance player.

For any leaf z Z, the probability that the game ends in z is the product of the probabilities of all the actions on the path from the root to z. Because of the deﬁnition of sequenceform strategies, when Player 1 and 2 play according to strategies x X and y Y, respectively, this probability is equal to x[σ1(z)] y[σ2(z)] c[σc(z)]. So, Player 2 s expected utility is computed via the trilinear map

u2(x, y, c) := X

z Z u2(z) x[σ1(z)] y[σ2(z)] c[σc(z)]. (1)

Since the strategy of the chance player is ﬁxed, the above expression is bilinear in x and y and therefore can be expressed more concisely as u2(x, y) = x A2 y, where A2 is called the sequence-form payoff matrix of Player 2.

2.2. Regret Minimization

In this section we present the regret minimization algorithms that we will work with. We will operate within the framework of online convex optimization (Zinkevich, 2003). In this setting, a decision maker repeatedly makes decisions z1, z2, . . . from some convex compact set Z Rn. After each decision zt at time t, the decision maker faces a linear loss zt 7 (ℓt) zt, where ℓt is a gradient vector in Rn.

Give ˆz Z, the regret compared to z of the regret minimizer up to time T, denoted as RT (ˆz), measures the difference between the loss cumulated by the sequence of output decisions z1, . . . , z T and the loss that would have been cumulated by playing a ﬁxed, time-independent decision ˆz Z. In symbols, RT (ˆz) := PT t=1(ℓt) (zt ˆz). A good regret minimizer is such that the regret compared to any ˆz Z grows sublinearly in T.

The two algorithms beyond MCCFR that we consider assume access to a distance-generating function d : Z R, which is 1-strongly convex (with respect to some norm) and continuously differentiable on the interior of Z. Furthermore, d should be such that the gradient of the convex conjugate d(g) = arg maxz Z g, z d(z) is easy to compute. From d we also construct the Bregman divergence D(z z ) := d(z) d(z ) d(z ), z z .

We will use the following two classical regret minimization algorithms as examples that can be used in the framework that we introduce in this paper. The online mirror descent (OMD) algorithm produces iterates according to the rule

zt+1 = arg min z Z

η D(z zt) . (2)

The follow the regularized leader (FTRL) algorithm produces iterates according to the rule (Shalev-Shwartz &

Singer, 2007)

zt+1 = arg min z Z

τ=1 ℓτ, z + 1

η d(z) . (3)

OMD and FTRL satisfy regret bounds of the form maxˆz Z RT (ˆz) 2L p

D(z z1)T, where L is an upper bound on maxx Rn (ℓt) x

x for all t. Here is the norm with respect to which we measure strong convexity of d. (see, e.g., Orabona (2019)).

2.3. Equilibrium Finding in Extensive-Form Games using Regret Minimization

It is known that in a two-player extensive-form game, a Nash equilibrium (NE) is the solution to the bilinear saddle-point problem min ˆx X max ˆy Y ˆx A2 ˆy.

Given a pair (x, y) X Y of sequence-form strategies for the Player 1 and 2, respectively, the saddle-point gap

ξ(x, y) := max ˆy Y{x A2 ˆy} min ˆx X{ˆx A2 y}

measures of how far the pair is to being a Nash equilibrium. In particular, (x, y) is a Nash equilibrium if and only if ξ(x, y) = 0.

Regret minimizers can be used to ﬁnd a sequence of points (xt, yt) whose saddle-point gap converges to 0. The fundamental idea is to instantiate two regret minimizers R1 and R2 for the sets X and Y, respectively, and let them respond to each other in a self-play fashion using a particular choice of loss vectors (see Figure 1).

Figure 1. Self-play method for computing NE in EFGs.

At each time t, the strategies xt and yt output by the regret minimizers are used to compute the loss vectors

ℓt 1 := A2 yt, ℓt 2 := A 2 xt. (4)

Let x and y be the average of the strategies output by R1 and R2, respectively, up to time T. Furthermore, let RT 1 := maxˆx X RT 1 (ˆx) and RT 2 := maxˆy Y RT 2 (ˆy) be the maximum regret cumulated by R1 and R2 against any sequence-form strategy in X and Y, respectively. A wellknown folk lemma asserts that

ξ( x, y) (RT 1 + RT 2 )/T.

So, if R1 and R2 have regret that grows sublinearly, then the strategy proﬁle ( x, y) converges to a saddle point.

Stochastic Regret Minimization in Extensive-Form Games

3. Stochastic Regret Minimization for Extensive-Form Games

In this section we provide some key analytical tools to understand the performance of regret minimization algorithms when gradient estimates are used instead of exact gradient vectors. The results in this sections are complemented by those of Section 4, where we introduce computationally cheap gradient estimators for the purposes of equilibrium ﬁnding in extensive-form games.

3.1. Regret Guarantees when Gradient Estimators are Used

We start by studying how much the guarantee on the regret degrades when gradient estimators are used instead of exact gradient vectors. Our analysis need not assume that we operate over extensive-form strategy spaces, so we present our results in full generality.

Let R be a deterministic regret minimizer over a convex and compact set Z, and consider a second regret minimizer R over the same set Z that is implemented starting from R as in Figure 2. In particular, at all times t,

R queries the next decision zt of R, and outputs it; each gradient vector ℓt received by R is used by R to compute a gradient estimate ℓt such that

Et[ ℓt] := E[ ℓt | ℓ1, . . . , ℓt 1] = ℓt. (that is, the estimate in unbiased). The internal regret minimizer R is then shown ℓt instead of ℓt.

R Figure 2. Abstract regret minimizer considered in Section 3.1.

The regret minimizer R is a purely conceptual construction. We introduce R in order to compare the regret incurred by R to that incurred by R. This will allow us to quantify the degradation in regret that is incurred when the gradient vectors are estimated instead of exact. In practice, it is not necessary to explicitly construct R and fully observe the gradient vectors ℓt in order to compute the estimates ℓt. Examples of cheap gradient estimators for extensive-form games are given in Section 4.

When the estimate of the gradient is very accurate (for instance, it has low variance), it is reasonable to expect that the regret RT incurred by R up to any time T is roughly equal to the regret RT that is incurred by R, plus some degradation term that depends on the error of the estimates. We can quantify this relationship by ﬁxing an arbitrary u Z and introducing the discrete-time stochastic process

dt := (ℓt) (zt u) ( ℓt) (zt u). (5)

Since by hypothesis Et[ ℓt] = ℓt and R is a deterministic regret minimizer, Et[dt] = 0 and so {dt} is a martingale difference sequence. This martingale difference sequence is well-known, especially in the context of bandit regret minimization (Abernethy & Rakhlin, 2009; Bartlett et al., 2008). Using the Azuma-Hoeffding concentration inequality (Hoeffding, 1963; Azuma, 1967), we can prove the following.

Proposition 1. Let M and M be positive constants such that |(ℓt) (z z )| M and |( ℓ) (z z )| M for all times t = 1, . . . , T and all feasible points z, z Z. Then, for all p (0, 1) and all u Z,

P RT (u) RT (u) + (M + M) r

A straightforward consequence of Proposition 1 is that if R has regret that grows sublinearly in T, then also the regret of R will grow sublinearly in T with high probability.

Remark. As shown in Proposition 1, using gradient estimators instead of exact gradients incurs an additive regret degradation term that scales proportionally with the bound M on the norm of the gradient estimates ℓt. We remark that the regret RT (u) also scales proportionally to the norm of the gradient estimates ℓt. So, increasing the value of p in Proposition 1 is not enough to counter the dependence on M.

3.2. Connection to Equilibrium Finding

We now apply the general theory of Section 3.1 for the speciﬁc application of this paper that is, Nash equilibrium computation in large extensive-form games.

We start from the construction of Section 2.3. In particular, we instantiate two deterministic regret minimizers R1, R2 and let them play strategies against each other. However, instead of computing the exact losses ℓt 1 and ℓt 2 as in (4), we compute their estimates ℓt 1 and ℓt 2 according to some algorithm that guarantees that Et[ ℓt 1] = ℓt 1 and Et[ ℓt 2] = ℓt 2 at all times t. We will show that despite this modiﬁcation, the average strategy proﬁle has a saddle point gap that is guaranteed to converge to zero with high probability.

Because of the particular deﬁnition of ℓt 1, we have that at all times t,

(ℓt 1) (x x ) = max x,x X

(xt) A2yt (x ) A2yt

where is the payoff range of the game (see Section 2.1). (A symmetric statement holds for Player 2.) For i {1, 2}, let Mi be positive constants such that |( ℓt i) (z z )| Mi at all times t = 1, . . . , T and all strategies z, z in the sequenceform polytope for Player i (that is, X when i = 1 and Y when i = 2). Using Proposition 1, we ﬁnd that for all ˆx X

Stochastic Regret Minimization in Extensive-Form Games

and ˆy Y, with probability (at least) 1 p,

t=1 (xt ˆx) A2 yt RT 1 (ˆx) + ( + M1) r

t=1 (xt) A2 (yt ˆy) RT 2 (ˆy) + ( + M2) r

where Ri denotes the regret of the regret minimizer Ri that at each time t observes ℓt i.

Summing the above inequalities, dividing by T, and using the union bound, we obtain that, with probability at least 1 2p,

x A2 ˆy ˆx A2 y ( RT 1 (ˆx) + RT 2 (ˆy))/T

+ (2 + M1 + M2) r

where x := 1 T PT t=1 xt and y := 1 T PT t=1 yt. Since (6) holds for all ˆx X and ˆy Y, we obtain the following.

Proposition 2. With probability at least 1 2p,

ξ( x, y) RT 1 (ˆx) + RT 2 (ˆy) T + (2 + M1 + M2) r

If R1 and R2 have regret that is sublinear in T, then we conclude that the saddle point gap ξ( x, y) converges to 0 with high probability like in the non-stochastic setting. So, ( x, y) converges to a saddle point over time.

4. Game-Theoretic Gradient Estimators

We complete the theory of Sections 3.1 and 3.2 by showing some examples of computationally cheap gradient estimators designed for game-theoretic applications. We will illustrate how each technique can be used to construct an estimate ℓt 1 for the gradient ℓt 1 = A2 yt for Player 1 deﬁned in (4). The computation of an estimate for ℓt 2 is analogous.

4.1. External Sampling

An unbiased estimator of the gradient vector ℓt 1 = A2 yt can be easily constructed by independently sampling pure strategies yt for Player 2 and ct for the chance player. Indeed, as long as Et[ yt] = yt and Et[ ct] = c, from (1) we have that for all x X, u2(x, yt, c) = Et[ u2(x, yt, ct)]. Hence, the vector corresponding to the (random) linear function x 7 u2(x, yt, ct) is an unbiased gradient estimator, called the external sampling gradient estimator.

Since at all times t, yt and ct are sequence-form strategies, u2(x, yt, ct) is lower bounded by the minimum payoff of the game and upper bounded by the maximum payoff of the game. Hence, for this estimator, M in Proposition 1 is equal to the payoff range of the game. Substituting that value into Proposition 2, we conclude that when the external sampling gradient estimator is used to estimate the gradient

for both players, with probability at least 1 2p the saddle point gap of the average strategy proﬁle ( x, y) is

ξ( x, y) RT 1 (ˆx) + RT 2 (ˆy) T + 4 r

The external sampling gradient estimator, that is, the vector corresponding to the linear function x 7 u2(x, yt, ct), can be computed via a simple traversal of the game tree. The algorithm starts at the root of the game tree and starts visiting the tree. Every time a node that belongs to the chance player or to Player 2 is encountered, an action is sampled according to the strategy c or yt, respectively. Every time a node for Player 1 is encountered, the algorithm branches on all possible actions and recurses. A simple linear-time implementation is given as Algorithm 1. For every node of Player 2 or chance player, the algorithm branches on only one action. Thus computing an external sampling gradient estimate is signiﬁcantly cheaper to compute than the exact gradient ℓt 1.

Algorithm 1: Efﬁcient implementation of the external sampling gradient estimator

Input: yt strategy for Player 2 Output: ℓt 1 unbiased gradient estimate for ℓt 1 deﬁned in (4)

1 ℓt 1 0 R|Σ1|

2 subroutine TRAVERSEANDSAMPLE(v)

3 I infoset to which v belongs

4 if v is a leaf then

5 ℓt 1[σ1(v)] u1(v)

6 else if v belongs to the chance player then

7 Sample an action a c[(I,a)] c[σc(I)]

a Av 8 TRAVERSEANDSAMPLE(ρ(v, a ))

9 else if v belongs to Player 2 then

10 Sample an action a yt[(I,a) yt[σ2(I)]

a Av 11 TRAVERSEANDSAMPLE(ρ(v, a ))

12 else if v belongs to Player 1 then

13 for a Av do

14 TRAVERSEANDSAMPLE(ρ(v, a))

15 TRAVERSEANDSAMPLE(r) r is root of the game tree

16 return ℓt 1

Remark. Analogous estimators where only the chance player s strategy c or only Player 2 s strategy yt are sampled are referred to as chance sampling estimator and opponent sampling estimator, respectively. In both cases, the same value of M = (and therefore the bound in (7)) applies.

Remark. In the special case in which R1 and R2 run the CFR regret minimization algorithm, our analysis immediately implies the correctness of external-sampling MCCFR, chance-sampling MCCFR, and opponent-sampling MCCFR, while at the same time yielding a signiﬁcant improvement over the theoretical convergence rate to Nash

Stochastic Regret Minimization in Extensive-Form Games

equilibrium of the overall algorithm: the right hand side of (7) grows as p

log(1/p) in p, compared to the O( p

1/p) of the original analysis by Lanctot et al. (2009).

Finally, we remark that our regret bound has a more favorable dependence on game-speciﬁc constants (for example, the number of information sets of each player) than the original analysis by (Lanctot et al., 2009).

4.2. Outcome Sampling

Let wt X be an arbitrary strategy for Player 1. Furthermore, let zt Z be a random variable such that for all z Z,

Pt[ zt = z] = wt[σ1(z)] yt[σ2(z)] c[σc(z)],

and let ez be deﬁned as the vector such that ez[σ1(z)] = 1 and ez[σ] = 0 for all other σ Σ1, σ = σ1(z). It is a simple exercise to prove that the random vector

ℓt 1 := u2( zt) wt[σ1( zt)]e zt

is such that Et[ ℓt 1] = ℓt 1 (see Appendix A for a proof). This particular deﬁnition of ℓt 1 is called the outcome sampling gradient estimator.

Computationally, the outcome sampling gradient estimator is cheaper than the external sampling gradient estimator. Indeed, since wt X, one can sample zt by following a random path from the root of the game tree by sampling (from the appropriate player s strategy) one action at each node encountered along the way. The walk terminates as soon as it reaches a leaf, which corresponds to z.

As we show in Appendix A, the value of M for the outcome sampling gradient estimator is

M = max σ Σ1 1 wt[σ].

So, the high-probability bound on the saddle point gap is inversely proportional to the minimum entry in wt, as already noted by Lanctot et al. (2009).

4.2.1. EXPLORATION-BALANCED OUTCOME SAMPLING

In Appendix A we show that a strategy w exists such that w [σ] 1/(|Σ1| 1) for every σ Σ1. Since w

guarantees that all of the |Σ1| entries of w are at least 1/(|Σ1| 1), we call w the exploration-balanced strategy, and the corresponding outcome sampling regret estimator the exploration-balanced outcome sampling regret estimator. As a consequence of the above analysis, when both players gradients are estimated using the exploration-balanced outcome sampling regret estimator, with probability at least 1 2p the saddle point gap of the average strategy proﬁle ( x, y) is upper bounded as

ξ( x, y) RT 1 (ˆx) + RT 2 (ˆy) T + 2(|Σ1| + |Σ2|) r

To our knowledge, this is the ﬁrst time that the explorationbalanced outcome sampling gradient estimator has been introduced.

The ﬁnal remark of Section 4.1 applies to outcome sampling as well.

5. Experiments

In this section we perform numerical simulations to investigate the practical performance of several stochastic regretminimization algorithms. First, we have the MCCFR algorithm instantiated with regret matching (Hart & Mas-Colell, 2000). Second, we instantiate two algorithms through our framework: FTRL and OMD, both using the dilated entropy distance-generating function from Kroer et al. (2020), using their theoretically correct recursive scheme for informationset weights.2 We will show two sections of experiments, one with external sampling and one with exploration-balanced outcome sampling.

For each game, we try four choices of stepsize η in FTRL and OMD: 0.1, 1, 10, 100. For each algorithm-game pair we show only the best-performing of these four stepsizes in the plots below. The results for all stepsizes can be found in Appendix C. The stepsize is important: for most games where FTRL or OMD beats MCCFR, only the best stepsize does so. At the same time, we did not extensively tune stepsizes (four stepsizes increasing by a factor of 10 per choice leads to very coarse tuning), so there is room for better tuning of these. Figuring out how to intelligently choose, or adapt, stepsizes is an important future research direction to the present paper, and would likely lead to even faster algorithms.

For each game-algorithm pair, we run the experiment 50 times, in order to account for variance in gradient estimates. All plots show the mean performance, and each line is surrounded by shading indicating one standard deviation around the mean performance.

In each plot we show the number of nodes of the game tree touched on the x-axis. On the y-axis we show the saddle-point gap. All algorithms are run until the number of nodes touched corresponds to 50 full tree traversals (or, equivalently, 25 iterations of deterministic CFR or CFR+).

We run our experiments on four different games. Below, we summarize some key properties of the games. The full description of each game is in Appendix B.

Leduc poker is a standard parametric benchmark game in the EFG-solving community (Southey et al., 2005). For our experiments we consider the largest variant of the game,

2We do this as opposed to constant information-set weights as used numerically by some past papers. Our preliminary experiments with constant weights gave worse results.

Stochastic Regret Minimization in Extensive-Form Games

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Number of nodes touched ( 107)

Saddle-point gap

Battleship, external sampling, 50 seeds

MCCFR FTRL (η = 10) OMD (η = 10)

0.5 1.0 1.5 2.0 2.5

Number of nodes touched ( 106)

Saddle-point gap

Goofspiel, external sampling, 50 seeds

MCCFR FTRL (η = 100) OMD (η = 10)

1 2 3 4 5 6 7 8

Number of nodes touched ( 106)

Saddle-point gap

Leduc13, external sampling, 50 seeds

MCCFR FTRL (η = 10) OMD (η = 1)

0.2 0.4 0.6 0.8 1.0

Number of nodes touched ( 106)

Saddle-point gap

Search game (4 turns), external sampling, 50 seeds

MCCFR FTRL (η = 100) OMD (η = 1)

Figure 3. Performance of MCCFR, FTRL, and OMD when using the external sampling gradient estimator.

Leduc13. Leduc13 uses a deck of 13 unique cards, with two copies of each card. The game has 166,336 nodes and 6,007 sequences per player.

Goofspiel The variant of Goofspiel (Ross, 1971) that we use in our experiments is a two-player card game, employing three identical decks of 4 cards each. This game has 54,421 nodes and 21,329 sequences per player.

Search is a security-inspired pursuit-evasion game. The game is played on a graph shown in Figure 5 in Appendix B. We consider two variants of the game, which differ in the number k of simultaneous moves allowed before the game ties out. Search-4 uses k = 4 and has 21,613 nodes, 2,029 defender sequences, and 52 attacker sequences. Search-5 uses k = 5 and has 87,972 nodes, 11,830 defender sequences, and 69 attacker sequences. Our search game is a zero-sum variant of the one used by Kroer et al. (2018). A similar search game was considered by Bošansk y et al. (2014) and Bošansk y & ˇCermák (2015).

Battleship is a parametric version of a classic board game, where two competing ﬂeets take turns shooting at each other (Farina et al., 2019c). The game has 732,607 nodes, 73,130 sequences for Player 1, and 253,940 sequences for Player 2.

5.1. External Sampling

Figure 3 (top left) shows the performance on Battleship with external sampling. We see that both FTRL and OMD

perform better than MCCFR when using stepsize η = 10. In Goofspiel (top right plot) we ﬁnd that OMD performs signiﬁcantly worse than MCCFR and FTRL. MCCFR performs slightly better than FTRL also. In Leduc 13 (bottom left) we ﬁnd that OMD performs signiﬁcantly worse than MCCFR and FTRL. FTRL performs slightly better than MCCFR. Finally, in Search-4 (bottom right) we ﬁnd that OMD and MCCFR perform comparably, while FTRL performs signiﬁcantly better. Due to space limitations, we show the experimental evaluation for Search-5 in Appendix C. In Search-5 all algorithms perform comparably, with FTRL performing slightly better than OMD and MCCFR.

Summarizing across all ﬁve games for external sampling, we see that FTRL, either with η = 10 or η = 100, was better than MCCFR on four out of ﬁve games (and essentially tied on the last game), with signiﬁcantly better performance in the Search games. OMD performs signiﬁcantly better then MCCFR and FTRL on Battleship.

5.2. Exploration-Balanced Outcome Sampling

Next, we investigate the performance of our explorationbalanced outcome sampling. For that gradient estimator we drew 100 outcome samples per gradient estimate, and use the empirical mean of those 100 samples as our estimate. The reason for this is that FTRL and OMD seem more sensitive to stepsize issues under outcome sampling. It can be shown easily that by averaging gradient estimators, the constant M required in Proposition 1 does not increase.

Stochastic Regret Minimization in Extensive-Form Games

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Number of nodes touched ( 107)

Saddle-point gap

Battleship, exploration-balanced outcome sampling, 10 seeds

MCCFR FTRL (η = 1) OMD (η = 1)

0.5 1.0 1.5 2.0 2.5

Number of nodes touched ( 106)

Saddle-point gap

Goofspiel, exploration-balanced outcome sampling, 10 seeds

MCCFR FTRL (η = 100) OMD (η = 100)

1 2 3 4 5 6 7 8

Number of nodes touched ( 106)

Saddle-point gap

Leduc13, exploration-balanced outcome sampling, 10 seeds

MCCFR FTRL (η = 100) OMD (η = 10)

Number of nodes touched ( 106)

Saddle-point gap

Search game (5 turns), exploration-balanced outcome sampling, 10 seeds

MCCFR FTRL (η = 0.1) OMD (η = 1)

Figure 4. Performance of MCCFR, FTRL, and OMD when using the exploration-balanced outcome sampling gradient estimator.

Due to computational time issues, we present performance for only 10 random seeds per game in outcome sampling. For this reason we omit performance on Search-4, which seemed too noisy to make conclusions about. Search-4 plots can be found in Appendix C.

Figure 4 (top left) shows the performance on Battleship with outcome sampling. Here all algorithms perform essentially identically, with MCCFR performing signiﬁcantly worse for a while, then slightly better, and then they all become similar around 3 107 nodes touched.

In Goofspiel (top right), MCCFR performs signiﬁcantly better than both FTRL and OMD. Both FTRL and OMD were best with η = 100, our largest stepsize. It thus seems likely that even more aggressive stepsizes are needed in order to get better performance in Goofspiel.

In Leduc13 (bottom left), FTRL with outcome sampling is initially slower than MCCFR, but eventually overtakes it. OMD is signiﬁcantly worse than the other algorithms.

Finally, in Search-5 (bottom right), MCCFR performs significantly better than FTRL and OMD, although FTRL seems to be catching up in later iterations.

Overall, when the exploration-balanced outcome sampling gradient estimator is used for all three algorithms, MCCFR seems to perform better than FTRL and OMD. In two out of four games it is signiﬁcantly better, in one it is marginally better, and in one FTRL is marginally better. We hypothe-

size that FTRL and OMD are much more sensitive to stepsize issues with outcome sampling as opposed to external sampling. This would make sense, as the variance becomes much higher.

6. Conclusion

We introduced a new framework for constructing stochastic regret-minimization methods for solving zero-sum games. This framework completely decouples the choice of regret minimizer and gradient estimator, thus allowing any regret minimizer to be coupled with any gradient estimator. Our framework also yields a streamlined and dramatically simpler proof of MCCFR. Furthermore, it immediately gives a signiﬁcantly stronger bound on the convergence rate of MCCFR, whereby with probability 1 p the regret grows as O( p

T log(1/p)) instead of O( p

T/p) as in the original analysis an exponentially better bound. We also instantiated stochastic variants of the FTRL and OMD algorithms for solving zero-sum EFGs using our framework. Extensive numerical experiments showed that it is often possible to beat MCCFR using these algorithms, even with a very mild amount of stepsize tuning. Due to its modular nature, our framework opens the door to many possible future research questions around stochastic methods for solving EFGs. Among the most promising are methods for controlling the stepsize in, for instance, FTRL or OMD, as well as instantiating our framework with other regret minimizers.

Stochastic Regret Minimization in Extensive-Form Games

One potential avenue for future work is to develop gradientestimation techniques with stronger control over the variance. In that case, it is possible to derive a variation of Proposition 1 that is based on the sum of conditional variances, an intrinsic notion of time in martingales (e.g., Blackwell & Freedman (1973)). In particular, using the Freedmanstyle (Freedman, 1975) concentration result of Bartlett et al. (2008) for martingale difference sequences, we obtain:

Proposition 3. Let T 4, and let M and M be positive constants such that |(ℓt) (z u)| M and |( ℓ) (z u)| M for all times t = 1, . . . , T and all feasible points z, u

X. Furthermore, let σ := q PT t=1 Var[dt | ℓ1, . . . , ℓt 1] be the square root of the sum of conditional variances of the random variables dt introduced in (5). Then, for all p (0, 1/2] and all u X,

P h RT (u) RT (u) + 4 max{σβ, (M + M)β2} i 1 p,

The concentration result of Proposition 3 takes into account the variance of the martingale difference sequences. When the variance is low, the dominant term in the right hand side of the inequality is (M + M)β2 = O(log log T). On the other hand, when the variance is high (that is, σ grows as

T), we recover a bound similar to the Azuma-Hoeffding inequality (albeit with a slightly worse polylog dependence on T).

Finally, our framework can also be applied to more general EFG-like problems, and thus this work also enables one to instantiate MCCFR or other stochastic methods for new sequential decision-making problems, for example by using the generalizations of CFR in Farina et al. (2019a) or Farina et al. (2019b).

Acknowledgments

This material is based on work supported by the National Science Foundation under grants IIS-1718457, IIS-1617590, IIS-1901403, and CCF-1733556, and the ARO under awards W911NF-17-1-0082 and W911NF2010081. Gabriele Farina is supported by a Facebook fellowship.

Abernethy, J. D. and Rakhlin, A. Beating the adaptive bandit with high probability. 2009 Information Theory and Applications Workshop, 2009.

Archibald, C. and Shoham, Y. Modeling billiards games. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Budapest, Hungary, 2009.

Azuma, K. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357 367, 1967.

Bartlett, P. L., Dani, V., Hayes, T., Kakade, S., Rakhlin, A., and Tewari, A. High-probability regret bounds for bandit online linear optimization. In Conference on Learning Theory (COLT), 2008.

Blackwell, D. and Freedman, D. On the amount of variance needed to escape from a strip. The Annals of Probability, pp. 772 787, 1973.

Bošansk y, B. and ˇCermák, J. Sequence-form algorithm for computing Stackelberg equilibria in extensive-form games. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

Bošansk y, B., Kiekintveld, C., Lisý, V., and Pˇechouˇcek, M. An exact double-oracle algorithm for zero-sum extensiveform games with imperfect information. Journal of Artiﬁcial Intelligence Research, pp. 829 866, 2014.

Brown, N. and Sandholm, T. Regret-based pruning in extensive-form games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2015.

Brown, N. and Sandholm, T. Reduced space and faster convergence in imperfect-information games via pruning. In International Conference on Machine Learning (ICML), 2017a.

Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, pp. eaao1733, Dec. 2017b.

Brown, N. and Sandholm, T. Solving imperfect-information games via discounted regret minimization. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2019a.

Brown, N. and Sandholm, T. Superhuman AI for multiplayer poker. Science, 365(6456):885 890, 2019b.

Brown, N., Kroer, C., and Sandholm, T. Dynamic thresholding and pruning for regret minimization. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017.

Chen, K. and Bowling, M. Tractable objectives for robust policy optimization. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2012.

Chen, X., Han, Z., Zhang, H., Xue, G., Xiao, Y., and Bennis, M. Wireless resource scheduling in virtualized radio access networks using stochastic learning. IEEE Transactions on Mobile Computing, (1):1 1, 2018.

Stochastic Regret Minimization in Extensive-Form Games

De Bruhl, B., Kroer, C., Datta, A., Sandholm, T., and Tague, P. Power napping with loud neighbors: optimal energyconstrained jamming and anti-jamming. In Proceedings of the 2014 ACM conference on Security and privacy in wireless & mobile networks, pp. 117 128. ACM, 2014.

Farina, G., Kroer, C., and Sandholm, T. Online convex optimization for sequential decision processes and extensiveform games. In AAAI Conference on Artiﬁcial Intelligence, 2019a.

Farina, G., Kroer, C., and Sandholm, T. Regret circuits: Composability of regret minimizers. In International Conference on Machine Learning, pp. 1863 1872, 2019b.

Farina, G., Ling, C. K., Fang, F., and Sandholm, T. Correlation in extensive-form games: Saddle-point formulation and benchmarks. In Conference on Neural Information Processing Systems (Neur IPS), 2019c.

Freedman, D. A. On tail probabilities for martingales. The Annals of Probability, 3(1):100 118, 02 1975.

Gibson, R., Lanctot, M., Burch, N., Szafron, D., and Bowling, M. Generalized sampling and variance in counterfactual regret minimization. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2012.

Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68: 1127 1150, 2000.

Hoda, S., Gilpin, A., Peña, J., and Sandholm, T. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2), 2010.

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Johanson, M., Waugh, K., Bowling, M., and Zinkevich, M. Accelerating best response calculation in large extensive games. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2011.

Koller, D., Megiddo, N., and von Stengel, B. Efﬁcient computation of equilibria for extensive two-person games. Games and Economic Behavior, 14(2), 1996.

Kroer, C., Waugh, K., Kılınç-Karzan, F., and Sandholm, T. Faster ﬁrst-order methods for extensive-form game solving. In Proceedings of the ACM Conference on Economics and Computation (EC), 2015.

Kroer, C., Farina, G., and Sandholm, T. Robust stackelberg equilibria in extensive-form games and extension to limited lookahead. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2018.

Kroer, C., Waugh, K., Kılınç-Karzan, F., and Sandholm, T. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, 2020.

Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009.

Lisý, V., Davis, T., and Bowling, M. Counterfactual regret minimization in sequential security games. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016.

Mc Diarmid, C. Concentration, pp. 195 248. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998. ISBN 9783-662-12788-9. doi: 10.1007/978-3-662-12788-9_6.

Moravˇcík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker. Science, May 2017.

Munoz de Cote, E., Stranders, R., Basilico, N., Gatti, N., and Jennings, N. Introducing alarms in adversarial patrolling games. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1275 1276. International Foundation for Autonomous Agents and Multiagent Systems, 2013.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Romanovskii, I. Reduction of a game with complete memory to a matrix game. Soviet Mathematics, 3, 1962.

Ross, S. M. Goofspiel the game of pure strategy. Journal of Applied Probability, 8(3):621 625, 1971.

Sandholm, T. The state of solving large incompleteinformation games, and application to poker. AI Magazine, 2010. Special issue on Algorithmic Game Theory.

Sandholm, T. Steering evolution strategically: Computational game theory and opponent exploitation for treatment planning, drug design, and synthetic biology. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2015. Senior Member Track.

Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., and Bowling, M. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 2157 2164, 2019.

Shalev-Shwartz, S. and Singer, Y. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3): 115 142, 2007.

Stochastic Regret Minimization in Extensive-Form Games

Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., and Rayner, C. Bayes bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), July 2005.

Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving heads-up limit Texas hold em. In Proceedings of the 24th International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2015.

von Stengel, B. Efﬁcient computation of behavior strategies. Games and Economic Behavior, 14(2):220 246, 1996.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning (ICML), pp. 928 936, Washington, DC, USA, 2003.

Zinkevich, M., Bowling, M., Johanson, M., and Piccione, C. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.