# double_neural_counterfactual_regret_minimization__51ecb5ab.pdf

Published as a conference paper at ICLR 2020

DOUBLE NEURAL COUNTERFACTUAL REGRET MINIMIZATION

Hui Li , Kailiang Hu , Shaohua Zhang , Yuan Qi , Le Song

Ant Financial Services Group Georgia Institute of Technology {ken.lh, hkl163251, yaohua.zsh, yuan.qi, le.song}@antfin.com lsong@cc.gatech.edu ABSTRACT

Counterfactual regret minimization (CFR) is a fundamental and effective technique for solving Imperfect Information Games (IIG). However, the original CFR algorithm only works for discrete states and action spaces, and the resulting strategy is maintained as a tabular representation. Such tabular representation limits the method from being directly applied to large games. In this paper, we propose a double neural representation for the IIGs, where one neural network represents the cumulative regret, and the other represents the average strategy. Such neural representations allow us to avoid manual game abstraction and carry out end-to-end optimization. To make the learning efficient, we also developed several novel techniques including a robust sampling method and a mini-batch Monte Carlo Counterfactual Regret Minimization (MCCFR) method, which may be of independent interests. Empirically, on games tractable to tabular approaches, neural strategies trained with our algorithm converge comparably to their tabular counterparts, and significantly outperform those based on deep reinforcement learning. On extremely large games with billions of decision nodes, our approach achieved strong performance while using hundreds of times less memory than the tabular CFR. On head-to-head matches of hands-up no-limit texas hold em, our neural agent beat the strong agent ABS-CFR 1 by 9.8 4.1 chips per game. It s a successful application of neural CFR in large games.

1 INTRODUCTION

While significant advance has been made in addressing large perfect information games, such as Go (Silver et al., 2016), solving imperfect information games remains a challenging task. For Imperfect Information Games (IIG), a player has only partial knowledge about her opponents before making a decision, so that she has to reason under the uncertainty about her opponents information while exploiting the opponents uncertainty about herself. Thus, IIGs provide more realistic modeling than perfect information games for many real-world applications, such as trading, traffic routing, and politics.

Nash equilibrium is a typical solution concept for a two-player perfect-recall IIG. One of the most effective approaches is CFR (Zinkevich et al., 2007), which minimizes the overall counterfactual regret so that the average strategies converge to a Nash equilibrium. However the original CFR only works for discrete states and action spaces, and the resulting strategy is maintained as a tabular representation. Such tabular representation limits the method from being directly applied to large games. To tackle this challenge, one can simplify the game by grouping similar states together to solve the simplified (abstracted) game approximately via tabular CFR (Zinkevich et al., 2007; Lanctot et al., 2009). Constructing an effective abstraction, however, demands rich domain knowledge and its solution may be a coarse approximation of true equilibrium.

Function approximation can be used to replace the tabular representation. Waugh et al. (2015) combines regression tree function approximation with CFR based on handcrafted features, which is called Regression CFR (RCFR). However, since RCFR uses full traversals of the game tree, it is still impractical for large games. Moravcik et al. (2017) propose a seminal approach Deep Stack, which uses fully connected neural networks to represent players counterfactual values, tabular CFR however was used in the subgame solving. Jin et al. (2017) use deep reinforcement learning to solve regret minimization problem for single-agent settings, which is different from two-player perfect-recall IIGs.

The previous versions of this work were published in AAAI 2019 workshop on RLG and ICML 2019 workshop on RWSDM. It takes more than one year for us to reimplement Deep Stack and evaluate our method on large-scale heads-up no-limit Texas Hold em. Correspondence to: Hui Li, lihuiknight@google.com.

1 ABS-CFR is the advanced version of HITSZ_LMW_2pn, who won the third prize of the 2018 Annual Computer Poker Competition (ACPC). In our experiment, we chosen its advanced version as the benchmark.

Published as a conference paper at ICLR 2020

To learn approximate Nash equilibrium for IIGs in an end-to-end manner, Heinrich et al. (2015) and Heinrich & Silver (2016) propose e Xtensive-form Fictitious Play (XFP) and Neural Fictitious Self-Play (NFSP), respectively, based on deep reinforcement learning. In a NFSP model, the neural strategies are updated by selecting the best responses to their opponents average strategies. These approaches are advantageous in the sense that they do not rely on abstracting the game, and accordingly their strategies can improve continuously with more optimization iterations. However fictitious play empirically converges much slower than CFR-based approaches. Srinivasan et al. (2018) use actor-critic policy optimization methods to minimize regret and achieve performance comparable to NFSP.

Thus it remains an open question whether a purely neural-based end-to-end approach can achieve comparable performance to tabular based CFR approach. In the paper, we solve this open question by designing a double neural counterfactual regret minimization (DNCFR) algorithm 2. To make a neural representation, we modeled imperfect information game by a novel recurrent neural network with attention. Furthermore, in order to improve the convergence of the neural algorithm, we also developed a new sampling technique which converged much more efficient than the outcome sampling, while being more memory efficient than the external sampling. In the experiment, we conducted a set of ablation studies related to each novelty. The experiments showed DNCRF converged to comparable results produced by its tabular counterpart while performing much better than NFSP. In addition, we tested DNCFR on extremely large game, heads-up no-limit Texas Hold em (HUNL). The experiments showed that DNCFR with only a few number of parameters achieved strong neural strategy and beat ABS-CFR.

2 BACKGROUND ℎ0

𝑧3 𝑧4 𝑧5 𝑧6

𝑧7 𝑧8 𝑧9 𝑧10

𝜎𝑃𝐼0 𝜎0 𝑃𝐼0 + 𝑣0

𝜎𝑡(𝐵|𝐼0)𝜎0(𝐵|𝐼0)

Figure 1: Extensive-Form IIG and Information Set

Notation.Figure 1 illustrates an extensive game for a finite set N = {0,1,...,n 1} of n players. Define xv i as the hidden information of player i in IIG. xv i refers to hidden variables of all players other than i. H refers to a finite set of histories. h H denotes a possible history (or state), which consists of each player s hidden variable and actions taken by all players including chance. The empty sequence is a member of H. hj h denotes hj is a prefix of h. Z H denotes the terminal histories and any member z Z is not a prefix of any other sequences. A(h)={a:ha H} is the set of available actions after non-terminal history h H\Z. A player function P assigns a member of N {c} to each non-terminal history, where c is the chance ( we set c= 1). P(h) is the player who takes an action after history h. For each player i, imperfect information is denoted by information set (infoset) Ii. All states h Ii are indistinguishable to i. Ii refers to the set of infosets of i. The utility function ui(z) defines the payoff of i at state z. See appendix B.1 for more details.

Algorithm 1: CFR Algorithm

For t=1 to T do

vσt i (Ii)= X

h Ii,h z,z Z πσt i (h,z)πσt i(z)ui(z).

(1) rσt i (a|Ii)=vσt i (a|Ii) vσt i (Ii). (2)

Rt i(a|Ii)=Rt 1 i (a|Ii)+rσt i (a|Ii). (3)

σt+1 i (a|Ii)=

1 |A(Ii)| if P

a A(Ii) Rt,+ i (a|Ii)=0

Rt,+ i (a|Ii) P

a A(Ii) Rt,+ i (a|Ii) otherwise.

(4) St(a|Ii)=St 1(a|Ii)+πσt i (Ii)σt i(a|Ii). (5)

σi T(a|Ii)= ST(a|Ii) P

a A(Ii)ST(a|Ii). (6)

A strategy profile σ = {σi|σi Σi,i N} is a collection of strategies for all players, where Σi is the set of all possible strategies for player i. σ i refers to strategy of all players other than player i. For play i N, the strategy σi(Ii) is a function, which assigns an action distribution over A(Ii) to infoset Ii. σi(a|h) denotes the probability of action a taken by player i at state h. In IIG, h1,h2 Ii , we have σi(Ii)=σi(h1)=σi(h2). For iterative method such as CFR, σt refers to the strategy profile at t-th iteration. The state reach probability of history h is denoted by πσ(h) if players take actions according to σ. The reach probability is also called range in Deep Stack (Moravcik et al., 2017). Similarly, πσ i (h) refers to those for player i while πσ i(h) refers to those for other players except for i. For an empty sequence πσ( ) = 1. One can also show that the reach probability of the opponent is proportional to posterior

2 Solving IIGs via function approximation methods is an important and challenging problem. In the past year, several concurrent works (Lockhart et al., 2019; Brown et al., 2018; Steinberger, 2019) have been proposed to address this problem. We will discuss their differences in Section 6.

Published as a conference paper at ICLR 2020

𝑎3 𝑎4 𝑎5 𝑎6

𝑎3 𝑎4 𝑎5 𝑎6

𝑎𝑙𝑙𝑖𝑛𝑓𝑜𝑠𝑒𝑡𝑠

𝑎𝑙𝑙𝑖𝑛𝑓𝑜𝑠𝑒𝑡𝑠

𝑠𝑎𝑚𝑝𝑙𝑒𝑑𝑖𝑛𝑓𝑜𝑠𝑒𝑡𝑠

𝑠𝑎𝑚𝑝𝑙𝑒𝑑𝑖𝑛𝑓𝑜𝑠𝑒𝑡𝑠

Tabular Method Neural Method

𝜎𝑡((𝑎|𝐼𝑖)|𝑄𝑗)

Regret Matching Regret Matching

Regret Sum Network

Avg Strategy Network

𝜎𝑡((𝑎|𝐼𝑖)|𝑄𝑗)

Figure 2: (a) tabular CFR and (b) our double neural CFR framework. rσt i ((a|Ii)|Qj) is the estimated regret in MCCFR, Rt 1 i (a|Ii) is the cumulative regret, st i(a|Ii) is the weighted additional strategy and St 1 i (a|Ii) is the cumulative behavior strategy. In tabular CFR, cumulative regret and strategy are stored in the tabular memory, which limits it to solve large games. In DNCFR, we use double deep neural networks to approximate these two values. DNCFR needs less memory than tabular methods because of its generalization. probability of the opponent s hidden variable, i.e.,p(xv i|Ii) πσ i(h), where xv i and Ii indicate a particular h (proof in Appendix D.1). Finally, the infoset reach probability of Ii is defined as πσ(Ii)=P

h Iiπσ(h). Similarly, we have πσ i (Ii) = P

h Iiπσ i (h) and πσ i(Ii) = P

h Iiπσ i(h). More details can be found in Appendix B.3.

Counterfactual Regret Minimization. CFR is an iterative method for finding a Nash equilibrium for zero-sum perfect-recall IIGs (Zinkevich et al., 2007) (Algorithm 1 and Figure 2(a)). Given strategy profile σ, the counterfactual value (CFV) vσ i (Ii) at infoset Ii is defined by Eq. (1). The action CFV of taking action a is vσ i (a|Ii) and its regret is defined by Eq. (2). Then the cumulative regret of action a after T iterations is Eq. (3), where R0 i (a|Ii) = 0. Define Rt,+ i (a|Ii) = max(Rt i(a|Ii),0), the current strategy (or behavior strategy) at t+1 iteration will be updated by Eq. (4). Define st i(a|Ii) = πσt i (Ii)σt i(a|Ii) as the additional strategy in iteration t, then the cumulative strategy can be defined as Eq. (5), where S0(a|Ii)=0. The average strategy σit after t iterations is defined by Eq. (6), which approaches a Nash equilibrium after enough iterations.

Monte Carlo CFR.Lanctot et al. (2009) proposed a Monte Carlo CFR (MCCFR) to compute the unbiased estimation of counterfactual value by sampling subsets of infosets in each iteration. Although MCCFR still needs two tabular storages for saving cumulative regret and strategy as CFR does, it needs much less working memory than the standard CFR (Zinkevich et al., 2007). This is because MCCFR needs only to maintain values for those visited nodes into working memory; Define Q={Q1,Q2,...,Qm}, where Qj Z is a set (block) of sampled terminal histories in each iteration, such that Qj spans the set Z. Define q Qj as the probability of considering block Qj, where Pm j=1q Qj =1. Define q(z)=P

j:z Qjq Qj as the probability of considering a particular terminal history z. For infoset Ii, an estimate of sampled counterfactual value is vσ i (Ii|Qj)=P

h Ii,z Qj,h z 1 q(z)πσ i(z)πσ i (h,z)ui(z).

Lemma 1 (Lanctot et al. (2009)) The sampled counterfactual value in MCCFR is the unbiased estimation of actual counterfactual value in CFR. Ej q Qj [ vσ i (Ii|Qj)]=vσ i (Ii).

Define σrs as sampled strategy profile, where σrs i is the sampled strategy of player i and σrs i are those for other players except for i. The regret of the sampled action a A(Ii) is defined by rσ i ((a|Ii)|Qj)= P

z Qj,ha z,h Iiπσ i (ha,z)urs i (z) P

z Qj,h z,h Iiπσ i (h,z)urs i (z), where urs i (z) = ui(z) πσrs i (z) is a new

utility weighted by 1 πσrs i (z). The sampled estimation for cumulative regret of action a after t iterations is Rt i((a|Ii)|Qj)= Rt 1 i ((a|Ii)|Qj)+ rσt i ((a|Ii)|Qj), where R0 i ((a|Ii)|Qj)=0.

3 DOUBLE NEURAL COUNTERFACTUAL REGRET MINIMIZATION

Double neural CFR algorithm will employ two neural networks, one for the cumulative regret R, and the other for the average strategy S shown in Figure 2(b).

3.1 MODELING

The iterative updates of CFR algorithm maintain the regret sum Rt(a|Ii) and the average strategy σt i(a|Ii). Thus, our two neural networks are designed accordingly.

Regret Sum Network(RSN): according to Eq. (4), the current strategy σt+1(a|Ii) is computed from the cumulative regret Rt(a|Ii). We only need to track the numerator in Eq. (4) since the normalization in the denominator can be computed easily when the strategy is used. Given infoset Ii and action a, we design a neural network R(a,Ii|θt R) to track Rt(a|Ii), where θt R are the network parameters.

Published as a conference paper at ICLR 2020

[𝛼1, 𝛼2, , 𝛼6]

10 10 20 50

Mini-batch Robust Sampling

player 0 player 1 chance

Sequential Representation Attention

𝛼2 𝛼3 𝛼4 𝛼5 𝛼6

Figure 3: (a) recurrent neural network architecture with attention for extensive games. Both RSN and ASN are based on this architecture but with different parameters (θR and θS respectively). (b) an overview of the proposed robust sampling and mini-batch techniques. The trajectories marked by red arrows are the samples produced by robust sampling (k=2 here). Avg Strategy Network(ASN): according to Eq. (6), the approximate Nash equilibrium is the weighted average of all previous behavior strategies up to t iterations, which is computed by the normalization of cumulative strategy St(a|Ii). Similar to the cumulative regret, we employ the other deep neural network S(a|θt S) with network parameter θt S to track the cumulative strategy.

3.2 RECURRENT NEURAL NETWORK REPRESENTATION WITH ATTENTION

In order to define our R and S networks, we need to represent the infoset in extensive-form games. In such games, players take actions in an alternating fashion and each player makes a decision according to the observed history. Because the action sequences vary in length, we model them with recurrent neural networks and each action in the sequence corresponds to a cell in the RNN. This architecture is different from the one in Deep Stack (Moravcik et al., 2017), which used a fully connected deep neural network to estimate counterfactual value. Figure 3(a) provides an illustration of the proposed deep sequential neural network representation for infosets. Besides the vanilla RNN, there are several variants of more expressive RNNs, such as the GRU (Cho et al., 2014) and LSTM (Hochreiter & Schmidhuber, 1997). In our later experiments, we will compare these different neural architectures as well as a fully connected network representation.

Furthermore, different position in the sequence may contribute differently to the decision making, we add an attention mechanism (Desimone & Duncan, 1995; Cho et al., 2015) to the RNN architecture to enhance the representation. For example, the player may need to take a more aggressive strategy after beneficial public cards are revealed in a poker game. Thus the information after the public cards are revealed may be more important. In practice, we find that the attention mechanism can help DNCFR obtain a better convergence rate. See Appendix E for more details on the architectures.

3.3 OPTIMIZATION METHOD

The parameters in the two neural networks are optimized via stochastic gradient descent in a stage-wise fashion interleaving with CFR iterations.

3.3.1 OPTIMIZING CURRENT STRATEGY We use Mt R = {(Ii, rσt i ((a|Ii)|Qj))|for all sampled Ii} to store the sampled Ii and the corresponding regret rσt i ((a|Ii)|Qj)) for all players in t-th iteration, where Qj is the sampled block (shown in Figure 2(b)). These samples are produced by our proposed robust sampling and mini-batch MCCFR methods, which will be discussed in Section 4. According to Eq. (3), we optimize the cumulative regret neural network R(a,Ii|θt+1 R ) using the following loss function

i N, Ii Ii, a A(Ii)

R( |θt R)+ rσt i ( |Qj) R( |θt+1 R ) 2 if Ii in Mt R R( |θt R)+0 R( |θt+1 R ) 2 otherwise, (7)

where R((a|Ii)|θt R) refers to R( |θt R), rσt i ((a|Ii)|Qj) refers to rσt i ( |Qj), θt R refers to the old parameters and θt+1 R is the new parameters we need to optimize. Note that, Eq. (7) is minimized based on the samples of all the players rather than a particular player i. In standard MCCFR, if the infoset is not sampled, the corresponding regret is set to 0, which leads to unbiased estimation according to Lemma 1. The design of the loss function in Eq. (7) follows the same intuition. Techniques in Schmid et al. (2018) can be used to reduce the variance.

Published as a conference paper at ICLR 2020

Sampling unobserved infosets? Theoretically, in order to optimize Eq. (7), we need to collect both observed and unobserved infosets. This approach requires us to design a suitable sampling method to select additional training samples from large numbers of unobserved infosets, which will need a lot of memory and computation. Clearly, this is intractable on large games, such as HUNL. In practice, we find that minimizing loss only based on the observed samples can help us achieve a converged strategy.

Learning without forgetting? Another concern is that, only a small proportion of infosets are sampled due to mini-batch training, which may result in the neural networks forgetting values for those unobserved infosets. To address this challenge, we will use the neural network parameters from the previous iteration as the initialization, which gives us an online learning/adaptation flavor to the updates. Experimentally, on large games, due to the generalization ability of the neural networks, even a small proportion of infosets are used to update the neural networks, our double neural approach can still converge to an approximate Nash equilibrium. See Appendix F for more details on implementation.

Scaling regret for stable training? According to Theorem 6 in Burch (2017), the cumulative regret Rt i(a|Ii) p

|A|T, where |A| = max Ii I |A(Ii)| and = max Ii,a,t |Rt(a|Ii) Rt 1(a|Ii)|. It indicates that Rt i(a|Ii) will become increasingly large. In practice, we scale the cumulative regret by a factor of

t to make its range more stable. For example, define ˆRt i(a|Ii)=Rt i(a|Ii)/

t, we can update the cumulative regret Eq. (3) by ˆRt i(a|Ii)=( t 1ˆRt 1 i (a|Ii)+rσt i (a|Ii))/

t, where ˆR0 i (a|Ii)=0.

3.3.2 OPTIMIZING AVERAGE STRATEGY The other memory Mt S = {(Ii,st i(a|Ii)|for all sampled Ii} will store the sampled Ii and the weighted additional behavior strategy st i(a|Ii) in t-th iteration. Similarly, the loss function L(S) of ASN is defined by:

i N, Ii Ii, a A(Ii)

S( |θt S)+st i(a|Ii) S( |θt+1 S ) 2 if Ii in Mt S S( |θt S)+0 S( |θt+1 S ) 2 otherwise. (8)

where S( |θt S) refers to S(a,Ii|θt S), θt S refers to the old parameters and θt+1 S is the new parameters we need to optimize. According to Algorithm 1, cumulative regret is used to generate behavior strategy in the next iteration while cumulative strategy is the summation of the weighted behavior strategy. In theory, if we have all the Mt S in each iteration, we can achieve the final average strategy directly. Based on this concept, we don t need to optimize the average strategy network (ASN) S( |θt S) in each iteration. However, saving all such values into a huge memory is very expensive on large games. A compromise is that we can save such values within multiple iterations into a memory, when this memory is large enough, the incremental value within multiple iterations can be learned by optimizing Eq. (8).

Minimum squared loss versus maximum likelihood? The average strategy is a distribution over actions, which implies that we can use maximum likelihood method to directly optimize this average strategy. The maximum likelihood method should base on the whole samples up to t-th iteration rather than only the additional samples, so that this method is very memory-expensive. To address this limitation, we can use uniform reservoir sampling method (Osborne et al., 2014) to obtain the unbiased estimation of each strategy. In practice, we find this maximum likelihood method has high variance and cannot approach a less exploitable Nash equilibrium. Experimentally, optimization by minimizing squared loss helps us obtain a fast convergent average strategy profile and uses much less memory than maximum likelihood method.

3.4 CONTINUAL IMPROVEMENT

When solving large IIGs, prior methods such as Libratus (Brown & Sandholm, 2017) and Deep Stack (Moravcik et al., 2017) are based on the abstracted HUNL which has a manageable number of infosets. The abstraction techniques are usually based on domain knowledge, such as clustering similar hand-strength cards into the same buckets or only taking discrete actions (e.g., fold, call, one-pot raise and all in). DNCFR is not limited by the specified abstracted cards or actions. For example, we can use the continuous variable to represent bet money rather than encode it by discrete action. In practice, DNCFR can clone an existing tabular representation or neural representation and then continually improve the strategy from the initialized point. More specifically, for infoset Ii and action a, define R i(a|Ii) as the cumulative regret . We can use behavior cloning technique to learn the cumulative regret by optimizing θ R argminθR P

Ii Ii R( |θR) R ( |Ii) 2. Similarly, the cumulative strategy can be cloned in the

Published as a conference paper at ICLR 2020

same way. Based on the learned parameters, we can warm start DNCFR and continually improve beyond the tabular strategy profile.

3.5 OVERALL ALGORITHM Algorithm 2: DNCFR Algorithm

Function Agent(T, b):

For t=1 to T do

if t=1 and using warm starting then

Initialize θt R and θt S from a checkpoint t t+1 else

Initialize θt R and θt S randomly. Mt R,Mt S sampling methods. Sum aggregate value in MR by infoset. Remove duplicated records in MS. θt R Neural Agent(R( |θt 1 R ),Mt R,θt 1 R ,β R) θt S Neural Agent(S( |θt 1 S ),Mt S,θt 1 S ,β S)

return θt R,θt S

Algorithm 2 provides a summary of the proposed double neural counterfactual regret minimization approach. In the first iteration, if the system warm starts from tabular-based methods, the techniques in Section 3.4 will be used to clone the cumulative regrets and strategies. If there is no warm start initialization, we can start our algorithm by randomly initializing the parameters in RSN and ASN. Then sampling methods will return the sampled infosets and values, which are saved in memories Mt R and Mt S respectively. These samples will be used by the Neural Agent algorithm from Algorithm 3 to optimize RSN and ASN. Further details for the sampling methods will be discussed in the next section. Due to space limitation, we present Neural Agent fitting algorithm in Appendix F.

4 EFFICIENT TRAINING

In this section, we will propose two techniques to improve the efficiency of the double neural method. These techniques can also be used separately in other CFR variants.

4.1 ROBUST SAMPLING TECHNIQUE

In this section, we introduce a robust sampling method (RS), which is a general version of both external sampling and outcome sampling (Lanctot et al., 2009). RS samples k actions in one player s infosets and samples one action in the another player s infosets. Specifically, in the robust sampling method, the sampled profile is defined by σrs(k) = (σrs(k) i ,σ i), where player i will randomly select k actions according to sampled strategy σrs(k) i (Ii) at Ii and other players randomly select one action according to σ i.

We design an efficient sampling policy for robust sampling as follows and discuss the relationship among robust sampling, external sampling and outcome sampling in Appendix D.2. If k = max Ii I|A(Ii)| and for each action σrs(k) i (a|Ii)=1, then robust sampling is identical with external sampling. If k =1, σrs(k) i =σi and q(z) δ>0 (δ is a small positive number), then robust sampling is identical with outcome sampling.

Specifically, if player i randomly selects min(k, |A(Ii)|) actions according to discrete uniform distribution unif(0, |A(Ii)|) at Ii, i.e., σrs(k) i (a|Ii) = min(k,|A(Ii)|)

|A(Ii)| , then πσrs(k) i (Ii) = Q

h Ii,h h,h a h,h I i min(k,|A(I i)|) |A(I i)| and the weighted utility urs(k) i (z) will be a constant number in each iteration. In many settings, when k=1, we find such robust sampling schema converges more efficient than outcome sampling. In contrast, our robust sampling achieves comparable convergence with external sampling but using less working memory when specifying a suitable k. It s reasonable because our schema only samples k rather than all actions in player i s infosets, the sampled game tree is smaller than the one by external sampling. In the experiment, we will compare these sampling policies in our ablation studies.

4.2 MINI-BATCH TECHNIQUE

Traditional MCCFR only samples one block in an iteration and provides an unbiased estimation of origin CFV. In this paper, we present a mini-batch Monte Carlo technique and randomly sample b blocks in one iteration. Let Qj denote a block of terminals sampled according to the scheme in Section 4.1, then mini-batch CFV with mini-batch size b will be vσ i (Ii|b)=Pb j=1 P

h Ii,h z,z Qjπσ i(z)πσ i (h,z)ui(z)/(bq(z)).

Theorem 1 EQj mini-batch[ vσ i (Ii|b)]=vσ i (Ii).

We prove that vσ i (Ii|b) is an unbiased estimation of CFV in Appendix D.3. Following the similar ideas of CFR and CFR+, if we replace the regret matching by regret matching plus (Tammelin, 2014), we obtain a mini-batch MCCFR+ algorithm. Our mini-batch technique empirically can sample b blocks in parallel and converges faster than original MCCFR when performing on multi-core machines.

Published as a conference paper at ICLR 2020

(a) Sampling methods (b) Neural architectures (c) Number of parameters (d) Sampling proportion Figure 4: Log-log performance on Leduc(5). (a) different sampling methods, k refers to the number of sampling action for the proposed robust sampling method in each infoset. (b) neural architectures. (c) number of parameters. (d) proportion of observed infosets. Higher proportion indicates more working memory. 5 EXPERIMENT

To understand the contributions of various components in DNCFR algorithm, we will first conduct a set of ablation studies. Then we will compare DNCFR with tabular CFR and deep reinforcement learning method such as NFSP, which is a prior leading function approximation method in IIGs. At last, we conduct experiments on heads-up no-limit Texas Hold em (HUNL) to show the scalability of DNCFR algorithm. The games and key information used in our experiment are listed in Table 1. Table 1: Summary. #infoset is the number of infosets. #state is the number of states. %observed is the ratio of observed infosets in each iteration. #emd and #param are the embedding size and the number of parameters in DNCFR.

setting #infoset #state %observed #emd #param action abstraction card abstraction Leduc(5) 1 104 6 105 5.59% 16 2608 No No Leduc(10) 3 105 2 106 2.39% 32 7424 No No Leduc(15) 3 106 2 107 0.53% 64 23360 No No HUNL(1) 2 108 3 1011 0.01% 64 19200 Yes No HUNL(2) 8 1010 1 1014 0.001% 512 1070592 Yes No ABS-CFR 2 1010 2 1013 Yes Yes HUNL(full) 1 10161 1 10164 Yes No

5.1 SETTINGS AND METRIC We perform the ablation studies on Leduc Hold em poker, which is a commonly used poker game in research community (Heinrich & Silver, 2016; Schmid et al., 2018; Steinberger, 2019; Lockhart et al., 2019). In our experiments, we test DNCFR on three Leduc Hold em instances with stack size 5, 10, and 15, which are denoted by Leduc(5), Leduc(10), and Leduc(15) respectively.

To test DNCFR s scalability, we develop a neural agent to solve HUNL, which contains about 10161 infosets (Johanson, 2013) and has served for decades as challenging benchmark and milestones of solving IIGs. The rules for such games are given in Appendix A.

The experiments are evaluated by exploitability, which was used as a standard win rate measure in many key articles (Zinkevich et al., 2007; Lanctot et al., 2009; Michael Bowling, 2015; Brown et al., 2018). The units of exploitability in our paper is chips per game. It denotes how many chips one player wins on average per hand of poker. The method with a lower exploitability is better. The exploitability of Nash equilibrium is zero. In extremely large game, which is intractable to compute exploitability, we use head-to-head performance to measure different agents.

For reproducibility, we present the implementation details of the neural agent in Algorithm 2, Algorithm 3, Algorithm 4. Appendix F.4 provides the parameters used in our experiments. Solving HUNL is a challenging task. Although there are published papers (Moravcik et al., 2017; Brown & Sandholm, 2017), it lacks of available open source codes for such solvers. The development of HUNL solver not only needs tedious work, but also is difficult to verify the correctness of the implementation, because of its well known high variance and extremely large game size. In Appendix G, we provide several approaches to validate the correctness of our implementation for HUNL. 5.2 ABLATION STUDIES

We first conduct a set of ablation studies related to the mini-batch training, robust sampling, the choice of neural architecture on Leduc Hold em.

Is mini-batch sampling helpful? we present the convergence curves of the proposed robust sampling method with k = max(|A(Ii)|) under different mini-batch sizes in Figure 8(a) at Appendix C. The experimental results show that larger batch sizes generally lead to better strategy profiles.

Published as a conference paper at ICLR 2020

(a) Individual network (b) Warm starting (c) DNCFR vs NFSP (d) Large Leduc Hold em

Figure 5: Log-log performance. (a) Individual effect of RSN and ASN. RS-MCCFR+ refers to the tabular mini-batch MCCFR+ method with the proposed robust sampling. RS-MCCFR+-RSN only uses one neural network RSN to learn cumulative regret while uses a table to save cumulative strategy. RS-MCCFR+-ASN only use one neural network ASN. RS-MCCFR+-RSN-ASN refers to DNCFR with both RSN and ASN. (b) Warm start from tabular CFR and RS-MCCFR+. (c) DNCFR vs XFP vs NFSP. (d) Large Leduc(10) and Leduc(15). Is robust sampling helpful? Figure 4 (a) presents convergence curves for outcome sampling, external sampling(k=max(|A(Ii)|)) and the proposed robust sampling method under the different number of sampled actions. The outcome sampling cannot converge to a low exploitability( smaller than 0.1 after 1000 iterations). The proposed robust sampling algorithm with k=1, which only samples one trajectory like the outcome sampling, can achieve a better strategy profile after the same number of iterations. With an increasing k, the robust sampling method achieves an even better convergence rate. Experiment results show k = 3 and 5 have a similar trend with k = max(|A(Ii)|), which demonstrates that the proposed robust sampling achieves similar performance but requires less memory than the external sampling. We choose k=3 for the later experiments in Leduc Hold em. Is attention in the neural architecture helpful? Figure 4(b) shows that all the neural architectures achieved similar results while LSTM with attention achieved slightly better performance with a large number of iterations. We select LSTM plus attention as the default architectures in the later experiments. Do the neural networks just memorize but not generalize? One indication that the neural networks are generalizing is that they use much fewer parameters than their tabular counterparts. We experimented with LSTM plus attention networks, and embedding size of 8 and 16 respectively. These architectures contain 1048 and 2608 parameters respectively. Both of them are much less than the tabular memory (more than 11083 here) and can lead to a converging strategy profile as shown in Figure 4(c). We select embedding size 16 as the default parameters. In the later experiments, we will show the similar conclusion on HUNL. Do the neural networks generalize to unseen infosets? To investigate the generalization ability, we perform the DNCFR with small mini-batch sizes (b=50, 100, 500), where only 3.08%, 5.59%, and 13.06% infosets are observed in each iteration. In all these settings, DNCFR can still converge and arrive at exploitability less than 0.1 within only 1000 iterations as shown in Figure 4(d). In the later experiments, we set b=100 as the default mini-batch size. We learn new parameters based on the old parameters and a subset of observed samples. All infosets share the same parameters, so that the neural network can estimate the values for unseen infosets. Note that, the number of parameters is orders of magnitude less than the number of infosets in many settings, which indicates the generalization of our method. Furthermore, Figure 4(d) shows that DNCFRs are slightly better than tabular MCCFR, we think it s because of the generalization to unseen infosets. What is the individual effect of RSN and ASN? Figure 5(a) presents ablation study of the effects of RSN and ASN network respectively. Specifically, the method RSN denotes that we only employ RSN to learn the cumulative regret while the cumulative strategy is stored in a tabular memory. Similarly, the method ASN only employ ASN to learn the cumulative strategy. Both these single neural methods perform only slightly better than the DNCFR. How well does continual improvement work? As shown in Figure 5(b), warm starting from either full-width based or sampling based CFR can lead to continual improvements. Specifically, the first 10 iterations are learned by tabular based CFR and RS-MCCFR+. After the behavior cloning in Section 3.4, the remaining iterations are continually improved by DNCFR. How well does DNCFR on larger games? We test DNCFR on large Leduc(10) and Leduc(15), which contains millions of infosets. Even though only a small proportion of nodes are sampled in each iteration, Figure 5(d) shows that DNCFR can still converge on these large games.

5.3 COMPARISON AND SPACE-TIME TRADE-OFF How does DNCFR compare to the tabular counterpart, XFP, and NFSP? NFSP is the prior leading function approximation method for solving IIG, which is based on reinforcement learning and fictitious self-play techniques. In the experiment, NFSP requires two memories to store 2 105 state-action pair

Published as a conference paper at ICLR 2020

samples and 2 106 samples for supervised learning respectively. The memory sizes are larger than the number of infosets. Figure 5(c) demonstrates that NFSP obtains a 0.06-Nash equilibrium after touching 109 infosets. The XFP obtains the same exploitability when touching about 107 nodes. However, this method is the precursor of NFSP and updated by a tabular based full-width fictitious play. Our DNCFR achieves the same performance by touching no more than 106 nodes, which are much fewer than both NFSP and XFP. The experiment shows that DNCFR converges significantly better than the reinforcement learning counterpart.

Space and time trade-off. In this experiment, we investigate the time and space needed for DNCFR to achieve certain exploitability relative to tabular CFR algorithm. We compare their runtime and memory in Figure 6. It s clear that the number of infosets is much more than the number of parameters used in DNCFR. For example, on Leduc(15), tabular CFR needs 128 times more memory than DNCFR. In the figure, we use the ratio between the runtime of DNCFR and CFR as horizontal axis, and the sampling(observed) infosets ratios of DNCFR and full-width tabular CFR as vertical axis. Note that, the larger the sampling ratio, the more memory will be needed to save the sampled values.

Figure 6: time space trade-off.

Clearly, there is a trade-off between the relative runtime and relative memory in DNCFR: the longer the relative runtime, the less the relative memory needed for DNCFR. It is reasonable to expect that a useful method should lead to fair trade between space and time. That is onefold increase in relative runtime should lead onefold decreases in relative memory (the dashed line in Figure 6, slope -1). Interestingly, DNCFR achieves a much better trade-off between relative runtime and memory: for onefold increases in relative runtime, DNCFR may lead to fivefold decreases in relative memory consumption (red line, slope -5). We believe this is due to the generalization ability of the learned neural networks in DNCFR.

To present the time space trade off under a range of exploitability, we set the fixed exploitability as 1.0, 0.5, 0.1, 0.05, 0.01 and 0.005 and perform both neural and tabular CFR on Leduc Hold em. Figure 6 presents DNCFR achieves a much better time and space trade-off. We believe the research on neural CFR is important for future work and the running time is not the key limitation of our DNCFR. Some recent works (Schmid et al., 2018; Davis et al., 2019) provide strong variance reduction techniques for MCCFR and suggest promising direction for DNCFR. In the future, we will combine DNCFR with the latest acceleration techniques and use multiple processes or distributed computation to make it more efficient. 5.4 HEADS-UP NO-LIMIT TEXAS HOLD EM To test the scalability of the DNCFR on extremely large game, we develop a neural agent to solve HUNL. However, it s a challenging task to directly solve HUNL even with abstraction technique. For example, ABS-CFR uses k-means to cluster similar cards into thousands of clusters. Although it s a rough abstraction of original HUNL, such agent contains about 2 1010 infosets and needs 80GB memory to store its strategies. The working memory for training ABS-CFR is even larger (more than about 200GB), because it needs to store cumulative regrets and other essential variables, such as the abstracted mapping. To make it tractable for solving HUNL via deep learning, we assemble the ideas from both Deep Stack (Moravcik et al., 2017) and Libratus (Brown & Sandholm, 2017). Firstly, we train flop and turn networks like Deep Stack and use these networks to predict counterfactual value when given two players ranges and the pot size. Specifically, the flop network estimates values after dealing the first three public cards and the turn network estimates values after dealing the fourth public card. After that, we train blueprint strategies like Libratus. In contrast, the blueprint strategies in our settings are learned by DNCFR. Because we have value networks to estimate counterfactual values, there is no need for us to arrive at terminal nodes at the river.

To demonstrate the convergence of DNCFR, firstly, we test it on HUNL(1). Such game has no limited number of actions, contains four actions in each infoset, and ends with the terminals where the first three public cards are dealt. HUNL(1) contains more than 2 108 infosets and 3 1011 states. It s tractable to compute its exploitability within the limited time. We believe this game is suitable to evaluate the scalability and generalization of DNCFR. Figure 7(a) provides the convergence of DNCFR on different embedding size: emd=8, 16, 32, 64, 128. The smallest neural network only contains 608 parameters while the largest one contains 71168 parameters. It s reasonable to expect that a larger neural network typically achieves better performance because more parameters typically help neural networks represent more complicated patterns and structures. Figure 7(b) presents the performance of using the different number of stochastic gradient descent (SGD) updates to train neural network on each MCCFR iteration. The results show that the number of SGD updates on each iteration affects the asymptotic exploitability of DNCFR. It s

Published as a conference paper at ICLR 2020

100 101 102

Exploitability

emd=8 emd=16 emd=32 emd=64 emd=128

(a) embedding size

100 101 102

Exploitability

updates=200 updates=2000 updates=10000

(b) gradient descent updates

Match-up Win(chips/h) ABS-CFR 9.8 4.1 Tabular Agent 0.7 2.2

(c) head-to-head win rate.

Figure 7: Performance of DNCFR on heads-up no-limit Texas Hold em. (a) Log-log performance of DNCFR on HUNL(1) under different embedding size. (b) Log-Log performance of DNCFR on HUNL(1) under different numbers of gradient descent updates on each iteration. (c) DNCFR beats ABS-CFR by 9.8 4.1 chips per hand and achieves similar performance with its tabular version but using much less memory. reasonable because the neural network achieves small loss as the number of gradient descent updates is increasing.

Finally, we measure the head-to-head performance of our neural agent against its tabular version and ABSCFR on HUNL. ABS-CFR is a strong HUNL agent, which is the advanced version of the third-place agent in ACPC 2018. Although ABS-CFR used both card and action abstraction techniques, it still needs 80GB memory to store its strategies. More details about ABS-CFR are provided in Appendix G.1. Although abstraction pathologies are well known in extensive games (Waugh et al., 2009), typically, finer grained abstraction leads to better strategy in many settings. Following this idea, we use DNCFR to learn blueprint strategies on HUNL(2), which is similar to HUNL(1) but contains eight actions in each infoset. HUNL(2) contains 8 1010 infosets. Such large game size makes it intractable to perform subgame solving (Burch et al., 2014) in real-time. For the next rounds, we use continual resolving techniques to compute strategy in real-time. The action size in the look-ahead tree is similar to Table S3 in Moravcik et al. (2017). The tabular agent is similar to our neural agent except for using tabular CFR to learn blueprint strategies. When variance reduction techniques (Burch et al., 2018) are applied 3, Figure 7(c) shows that our neural agent beats ABS-CFR by 9.8 4.1 chips per game and obtains similar performance (0.7 2.2 chips per game) with its tabular agent. In contrast, our neural only needs to store 1070592 parameters, which uses much less memory than both tabular agent and ABS-CFR.

6 RELATED WORKS AND DISCUSSION

Solving IIGs via function approximation methods is an important and challenging problem. Neural Fictitious Self-Play (NFSP) (Heinrich & Silver, 2016) is a function approximation method based on deep reinforcement learning, which is a prior leading method to solve IIG. However, fictitious play empirically converges slower than CFR-based approaches in many settings. Recently, Lockhart et al. (2019) propose a new framework to directly optimize the final policy against worst-case opponents. However, the authors consider only small games. Regression CFR (RCFR) (Waugh et al., 2015) is a function approximation method based on CFR. However, RCFR needs to traverse the full game tree. Such traversal is intractable in large games. In addition, RCFR uses hand-crafted features and regression tree to estimate cumulative regret rather than learning features from data. Deep learning empirically performs better than regression tree in many areas, such as the Transformer and BERT in natural language models (Ashish Vaswani, 2017; Jacob Devlin, 2018).

In the past year, concurrent works deep CFR (DCFR) (Brown et al., 2018) and single deep CFR (SDCFR) (Steinberger, 2019) have been proposed to address this problem via deep learning. DCFR, SDCFR, RCFR and our DNCFR are based on the framework of counterfactual regret minimization. However, there are many differences in several important aspects, which are listed as follows. (1) We represent the extensive-form game by recurrent neural network. The proposed LSTM with attention performs better than fully connected network (see details in Section 3.2). (2) DNCFR updates the cumulative regret only based on the additionally collected samples in current iteration rather than using the samples in a big reservoir (see details in Section 3.3.1). (3) It s important to use squared-loss for the average strategies rather than log loss. Because the log loss is based on the big reservoir samples up to T-th iteration, it is very memory-expensive (see details in Section 3.3.2). (4) Another important aspect to make deep learning model work is that we divide regret by

T and renormalize the regret, because the cumulative regret can grow unboundedly

3 It s well known that head-to-head evaluation of HUNL is challenging because of its high variance. AIVAT is the state-of-the-art technique to reduce evaluation variance on poker evaluation.

Published as a conference paper at ICLR 2020

(see details in Section 3.3.1). (5) Also, DNCFR collects data by an efficiently unbiased mini-batch robust sampling method, which may be of independent interests to the IIG communities (see details in Section 4).

There are also big differences in the experimental evaluations. In our method, we conduct a set of ablation studies in various settings. We believe that our ablation studies are informative and could have a significant impact on these kinds of algorithms. Also, we evaluate DNCFR on extremely large games while RCFR and SDCFR are only evaluated on small toy games. 7 CONCLUSIONS

We proposed a novel double neural counterfactual regret minimization approach to solve large IIGs by combining many novel techniques, such as recurrent neural representation, attention, robust sampling, and mini-batch MCCFR. We conduct a set of ablation studies and the results show that these techniques may be of independent interests. This is a successful application of applying deep learning into large IIG. We believe DNCFR and other related neural methods open up a promising direction for future work.

8 ACKNOWLEDGMENTS

We would like to thank Marc Lanctot, Neil Burch, Noam Brown, Tuomas Sandholm, Richard Gibson, Matej Moravcik, Martin Schmid, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson and Michael Bowling for their helps in the work of developing large-scale heads-up no-limit Texas Hold em. We appreciate the help of Zhibang Ge and Tao Jiang for their early exploration on this difficult project. We thank Lin Wang, Jun Zhou, Yongchao Liu, Yewei Huang, Junping Zhao, Jun Jiang, Jincan Kong, Shaohua Du, Yao Zhang and Changhua He for supporting computation resource. We thank Chuanxiao Zhou, Shuai Xiao, Xiang Li and Yaxi Zhu et.al. for their suggestions and professional testing. At last, we would like to thank the anonymous reviewers for their valuable feedback.

Published as a conference paper at ICLR 2020

Ashish Vaswani, Noam Shazeer, N. P. J. U. L. J. A. N. G. L. K. I. P. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017.

Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, pp. eaao1733, 2017.

Brown, N., Lerer, A., Gross, S., and Sandholm, T. Deep counterfactual regret minimization. ar Xiv preprint ar Xiv:1811.00164, 2018.

Burch, N. Time and Space: Why Imperfect Information Games are Hard. Ph D thesis, 2017.

Burch, N., Johanson, M., and Bowling, M. Solving imperfect information games using decomposition. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

Burch, N., Schmid, M., Moravcik, M., Morill, D., and Bowling, M. Aivat: A new variance reduction technique for agent evaluation in imperfect information games. In AAAI, 2018.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014.

Cho, K., Courville, A., and Bengio, Y. Describing Multimedia Content using Attention-based Encoder Decoder Networks. ar Xiv preprint ar Xiv:1507.01053, 2015.

Davis, T., Schmid, M., and Bowling, M. Low-variance and zero-variance baselines for extensive-form games. ar Xiv preprint ar Xiv:1907.09633, 2019.

Desimone, R. and Duncan, J. Neural mechanisms of selective visual attention. Number 18, pp. 193 222. Annual review of neuroscience, 1995.

Ganzfried, S. and Sandholm, T. Potential-aware imperfect-recall abstraction with earth mover s distance in imperfect-information games. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

Gibson, R. G. Regret minimization in games and the development of champion multiplayer computer poker-playing agents. 2014.

Gilpin, A., Sandholm, T., and Sørensen, T. B. A heads-up no-limit texas hold em poker player: discretized betting models and automatically generated equilibrium-finding programs. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 2, pp. 911 918. International Foundation for Autonomous Agents and Multiagent Systems, 2008.

Gordon, G. J. No-regret algorithms for structured prediction problems. Number CMU-CALD-05-112. CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2005.

Harris, D. and Harris, S. Digital design and computer architecture (2nd ed.), volume ISBN 978-0-12394424-5. San Francisco, Calif.: Morgan Kaufmann.

Heinrich, J. and Silver, D. Deep reinforcement learning from self-play in imperfect-information games. ar Xiv preprint ar Xiv:1603.01121, 2016.

Heinrich, J., Lanctot, M., and Silver, D. Fictitious self-play in extensive-form games. pp. 805 813. International Conference on Machine Learning, 2015.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Number 8, pp. 1735 1780. Neural computation, 1997.

Jacob Devlin, Ming-Wei Chang, K. L. K. T. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Jin, P., Keutzer, K., and Levine, S. Regret minimization for partially observable deep reinforcement learning. ar Xiv preprint ar Xiv:1710.11424, 2017.

Johanson, M. Measuring the size of large no-limit poker games. ar Xiv preprint ar Xiv:1302.7008, 2013.

Published as a conference paper at ICLR 2020

Johanson, M., Burch, N., Valenzano, R., and Bowling, M. Evaluating state-space abstractions in extensiveform games. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 271 278. International Foundation for Autonomous Agents and Multiagent Systems, 2013.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Lanctot, M. Monte carlo sampling and regret minimization for equilibrium computation and decisionmaking in large extensive form games. 2013.

Lanctot, M., Kevin, W., Martin, Z., and Bowling, M. Monte Carlo sampling for regret minimization in extensive games. NIPS, 2009.

Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J., Morrill, D., Timbers, F., and Tuyls, K. Computing approximate equilibria in sequential adversarial games by exploitability descent. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 464 470, 2019.

Michael Bowling, Neil Burch, M. J. O. T. Heads-Up Limit Texas Hold em is solved. Science, pp. 347(6218):145 149, 2015.

Moravcik, M., Martin, S., Neil, B., Viliam, L., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, (6337):508 513, 2017.

Osborne, M., Lall, A., and Van Durme, B. Exponential reservoir sampling for streaming language models. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pp. 687 692, 2014.

Osborne, M. J. and Rubinstein, A. A course in game theory, volume 1. MIT Press, 1994.

Ruder, S. An overview of gradient descent optimization algorithms. ar Xiv preprint ar Xiv:1609.04747, 2017.

Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., and Bowling, M. Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines. ar Xiv preprint ar Xiv:1809.03057, 2018.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V. D., and et al., J. S. Mastering the game of Go with deep neural networks and tree search. Nature, (7587), 2016.

Southey, F., Bowling, M. P., Larson, B., Piccione, C., Burch, N., Billings, D., and Rayner, C. Bayes bluff: Opponent modelling in poker. ar Xiv preprint ar Xiv:1207.1411, 2012.

Srinivasan, S., Lanctot, M., Zambaldi, V., Pérolat, J., Tuyls, K., Munos, R., and Bowling, M. Actor-critic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems, pp. 3422 3435, 2018.

Steinberger, E. Single deep counterfactual regret minimization. ar Xiv preprint ar Xiv:1901.07621, 2019.

Tammelin, O. Solving large imperfect information games using CFR+. ar Xiv preprint, 2014.

Waugh, K., Schnizlein, D., Bowling, M., and Szafron, D. Abstraction pathologies in extensive games. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 781 788. International Foundation for Autonomous Agents and Multiagent Systems, 2009.

Waugh, K., Morrill, D., Bagnell, J. A., and Bowling, M. Solving Games with Functional Regret Estimation. In AAAI, 2015.

Zinkevich, M., Michael, J., Michael, B., and Piccione, C. Regret minimization in games with incomplete information. NIPS, 2007.

Published as a conference paper at ICLR 2020

A GAME RULES

A.1 ONE-CARD POKER

One-Card Poker is a two-players IIG of poker described by Gordon (2005). The game rules are defined as follows. Each player is dealt one card from a deck of X cards. The first player can pass or bet, If the first player bet, the second player can call or fold. If the first player pass, the second player can pass or bet. If the second player bet, the first player can fold or call. The game ends with two pass, call, fold. The fold player will lose 1 chip. If the game ends with two passes, the player with higher card wins 1 chip, If the game ends with call, the player with higher card wins 2 chips.

A.2 LEDUC HOLD EM

Leduc Hold em a two-players IIG of poker, which was first introduced in Southey et al. (2012). In Leduc Hold em, there is a deck of 6 cards comprising two suits of three ranks. The cards are often denoted by king, queen, and jack. In Leduc Hold em, the player may wager any amount of chips up to a maximum of that player s remaining stack. There is also no limit on the number of raises or bets in each betting round. There are two rounds. In the first betting round, each player is dealt one card from a deck of 6 cards. In the second betting round, a community (or public) card is revealed from a deck of the remaining 4 cards. In this paper, we use Leduc(x) refer to the Leduc Hold em with stack size is x.

A.3 HEADS-UP NO-LIMIT TEXAS HOLD EM

Heads-Up No-Limit Texas hold em (HUNL) has at most four betting rounds if neither of two players fold during playing. The four betting rounds are preflop, flop, turn, river respectively. The rules are defined as follows. In Annual Computer Poker Competition (ACPC), two players each have 20000 chips initially. One player at the position of small blind, firstly puts 50 chips in the pot, while the other player at the big blind then puts 100 chips in the pot. After that, the first round of betting is followed. If the preflop betting round ends without a player folding, then three public cards are revealed face-up on the table and the flop betting round occurs. After this round, one more public card is dealt (called the turn) and the third round of betting takes place, followed by a fifth public card (called river) and a final round of betting begins. In no-limit poker player can take fold, call and bet actions and bet number is from one big blind to a number of chips a player has left in his stack.

Published as a conference paper at ICLR 2020

B DEFINITION OF EXTENSIVE-FORM GAMES

B.1 DETAILED DEFINITIONS AND NOTATIONS

We define the components of an extensive-form game following Osborne & Rubinstein (1994) (page 200 201). A finite set N = {0,1,...,n 1} of players. Define xv i as the hidden variable of player i in IIG, e.g., in poker game xv i refers to the private cards of player i. H refers to a finite set of histories. Each member h = (xv i )i=0,1,...,n 1(al)l=0,...,L 1 = xv 0xv 1...xv n 1a0a1...a L 1 of H denotes a possible history (or state), which consists of each player s hidden variable and L actions taken by players including chance. For player i, h also can be denoted as xv i xv ia0a1...a L 1, where xv i refers to the opponent s hidden variables. The empty sequence is a member of H. hj h denotes hj is a prefix of h, where hj = (xv i )i=0,1,...,n 1(al)l=1,...,L 1 and 0 < L < L. Z H denotes the terminal histories and any member z Z is not a prefix of any other sequences. A(h) = {a : ha H} is the set of available actions after non-terminal history h H\Z. A player function P assigns a member of N {c} to each non-terminal history, where c denotes the chance player id, which usually is -1. P(h) is the player who takes an action after history h.

Ii of a history {h H :P(h)=i} is an information partition of player i. A set Ii Ii is an information set (infoset) of player i and Ii(h) refers to infoset Ii at state h. Generally, Ii could only remember the information observed by player i including player i s hidden variable and public actions. Therefore Ii indicates a sequence in IIG, i.e., xv i a0a2...a L 1. For Ii Ii we denote by A(Ii) the set A(h) and by P(Ii) the player P(h) for any h Ii. For each player i N a utility function ui(z) define the payoff of the terminal state z.

For player i, the expected game utility uσ i =P

z Zπσ(z)ui(z) of σ is the expected payoff of all possible

terminal nodes. Given a fixed strategy profile σ i, any strategy σ i =argmaxσ i Σiu(σ i,σ i) i of player i that achieves maximize payoff against πσ i is a best response. For two players extensive-form games, a Nash equilibrium is a strategy profile σ =(σ 0,σ 1) such that each player s strategy is a best response to the opponent. An ϵ-Nash equilibriumis an approximation of a Nash equilibrium, whose strategy profile σ satisfies: i N, uσ i i +ϵ maxσ i Σiu(σ i,σ i) i . Exploitability of a strategy σi is defined as

ϵi(σi)=uσ i u (σi,σ i) i . A strategy is unexploitable if ϵi(σi)=0. In large two player zero-sum games such poker, uσ i is intractable to compute. However, if the players alternate their positions, the value of a pair of games is zeros, i.e., uσ 0 +uσ 1 = 0 . We define the exploitability of strategy profile σ as ϵ(σ)=(u(σ0,σ 1) 1 +u(σ 0,σ1) 0 )/2.

B.2 EXPLANATION BY EXAMPLE

To provide a more detailed explanation, Figure 1 presents an illustration of a partial game tree in One-Card Poker. In the first tree, two players are dealt (queen, jack) as shown in the left subtree and (queen, king) as shown in the right subtree. zi denotes terminal node and hi denotes non-terminal node. There are 19 distinct nodes, corresponding 9 non-terminal nodes including chance h0 and 10 terminal nodes in the left tree. The trajectory from the root to each node is a history of actions. In an extensive-form game, hi refers to this history. For example, h3 consists of actions 0:Q, 1:J and P. h7 consists of actions 0:Q, 1:J, P and B. h8 consists of actions 0:Q, 1:K, P and B. We have h3 h7, A(h3)={P,B} and P(h3)=1.

In IIG, the private card of player 1 is invisible to player 0, therefore h7 and h8 are actually the same for player 0. We use infoset to denote the set of these undistinguished states. Similarly, h1 and h2 are in the same infoset. For the right tree of Figure 1, h 3 and h 5 are in the same infoset. h 4 and h 6 are in the same infoset.

Generally, any Ii I could only remember the information observed by player i including player i s hidden variable and public actions. For example, the infoset of h7 and h8 indicates a sequence of 0:Q, P, and B. Because h7 and h8 are undistinguished by player 0 in IIG, all the states have a same strategy. For example, I0 is the infoset of h7 and h8, we have I0 = I0(h7) = I0(h8), σ0(I0) = σ0(h7) = σ0(h8), σ0(a|I0)=σ0(a|h7)=σ0(a|h8).

Published as a conference paper at ICLR 2020

B.3 DETAILED DEFINITION ABOUT STRATEGY AND NASH EQUILIBRIUM

A strategy profile σ={σi|σi Σi,i N} is a collection of strategies for all players, where Σi is the set of all possible strategies for player i. σ i refers to strategy of all players other than player i. For play i N the strategy σi(Ii) is a function, which assigns an action distribution over A(Ii) to infoset Ii. σi(a|h) denotes the probability of action a taken by player i N {c} at state h. In IIG, h1,h2 Ii , we have Ii=Ii(h1)=Ii(h2), σi(Ii)=σi(h1)=σi(h2), σi(a|Ii)=σi(a|h1)=σi(a|h2). For iterative method such as CFR, σt refers to the strategy profile at t-th iteration.

The state reach probability of history h is denoted by πσ(h) if players take actions according to σ. For an empty sequence πσ( )=1. The reach probability can be decomposed into πσ(h)=Q

i N {c}πσ i (h)= πσ i (h)πσ i(h) according to each player s contribution, where πσ i (h) = Q

h a h,P(h )=P(h)σi(a|h ) and πσ i(h)=Q

h a h,P(h ) =P(h)σ i(a|h ).

The infoset reach probability of Ii is defined as πσ(Ii) = P

h Iiπσ(h). If h h, the interval state reach probability from state h to h is defined as πσ(h ,h), then we have πσ(h ,h) = πσ(h)/πσ(h ). πσ i (Ii), πσ i(Ii), πσ i (h ,h), and πσ i(h ,h) are defined similarly.

Published as a conference paper at ICLR 2020

C ADDITIONAL EXPERIMENT RESULTS

Figure 8(a) shows that the robust sampling with a larger batch size indicates better performance. It s reasonable because a larger batch size will lead to more sampled infosets in each iteration and costs more memory to store such values. If b=1, only one block is sampled in each iteration. The results demonstrate that the larger batch size generally leads to faster convergence. Because it s easy to sample the mini-batch samples by parallel fashion on a large-scale distributed system, this method is very efficient. In practice, we can specify a suitable mini-batch size according to computation and memory size.

In Figure 8(b), we compared the proposed robust sampling against Average Strategy (AS) sampling (Gibson, 2014) on Leduc Hold em (stack=5). Set the mini-batch size of MCCFR as b=100, k=2 in robust sampling. The parameters in average strategy sampling are set by ϵ=k/|A(I)|, τ =0, and β=0. After 1000 iterations, the performance of our robust sampling is better than AS. More specifically, if k=1, the exploitability of our robust sampling is 0.5035 while AS is 0.5781. If k=2, the exploitability of our robust sampling is 0.2791 while AS is 0.3238. Robust sampling samples a min(k,|A(I)|) player i s actions while AS samples a random number of player i s actions. Note that, if ρ is small or the number of actions is small, it has a possibility that the generated random number between 0 and 1 is larger than ρ for all actions, then the AS will sample zero action. Therefore, AS has a higher variance than our robust sampling. In addition, according to Gibson (2014), the parameter scopes of AS are ϵ (0,1], τ [1, ), β [0, ) respectively. They didn t analyze the experiment results for τ <1.

100 101 102 103

Exploitability

RS-MCCFR+(b=1) RS-MCCFR+(b=1000) RS-MCCFR+(b=5000) RS-MCCFR+(b=10000)

(a) Mini-batch size

100 101 102 103

Exploitability

strategy sampling(k=1) strategy sampling(k=2) robust sampling(k=1) robust sampling(k=2)

(b) Sampling Strategy

Figure 8: Comparison of different CFR-family methods on Leduc Hold em. (a) Performance of robust sampling with different batch size. (b) Robust sampling vs strategy sampling.

Published as a conference paper at ICLR 2020

D THEORETICAL ANALYSIS

D.1 REACH PROBABILITY AND POSTERIOR PROBABILITY

Lemma 2 The reach probability of the opponent is proportional to posterior probability of the opponent s hidden variable, i.e.,p(xv i|Ii) πσ i(h), where xv i and Ii indicate a particular h.

For player i at infoset Ii and fixed i s strategy profile σi, i.e., h Ii,πσ i (h) is constant. Based on the defination of extensive-form game, the cominbation of Ii and opponent s hidden state xv i can indicate a particular history h=xv i xv ia0a1...a L 1. With Bayes Theorem, we can inference the posterior probability of opponent s private cards with Equation9.

p(xv i|Ii)= p(xv i,Ii) p(Ii) = p(h)

p(xv i )p(xv i)

l=1 σP(xv i xv ia0a1...al 1)(al|xv i xv ia0a1...al 1)

πσ(h)=πσ i (h)πσ i(h) πσ i(h)

D.2 ROBUST SAMPLING, OUTCOME SAMPLING AND EXTERNAL SAMPLING

For robust sampling, given strategy profile σ and the sampled block Qj according to sampled profile σrs(k)=(σrs(k) i ,σ i), then q(z)=πσrs(k) i (z)πσ i(z), and the regret of action a Ars(k)(Ii) is

rσ i ((a|Ii)|Qj)= vσ i ((a|Ii)|Qj) vσ i (Ii|Qj)

z Qj,ha z,h Ii

1 q(z)πσ i(z)πσ i (ha,z)ui(z) X

1 q(z)πσ i(z)πσ i (h,z)ui(z)

z Qj,ha z,h Ii

ui(z) πσrs(k) i (z) πσ i (ha,z) X

z Qj,h z,h Ii

ui(z) πσrs(k) i (z) πσ i (h,z)

z Qj,ha z,h Ii πσ i (ha,z)urs i (z) X

z Qj,h z,h Ii πσ i (h,z)urs i (z),

where urs i (z)= ui(z)

πσrs(k) i (z) is the weighted utility according to reach probability πσrs(k) i (z). Because the

weighted utility no long requires explicit knowledge of the opponent s strategy, we can use this sampling method for online regret minimization.

Generally, if player i randomly selects min(k,|A(Ii)|) actions according to discrete uniform distribution unif(0,|A(Ii)|) at infoset Ii, i.e., σrs(k) i (a|Ii)= min(k,|A(Ii)|)

|A(Ii)| , then

πσrs(k) i (Ii)= Y

h Ii,h h,h a h,h I i

min(k,|A(I i)|) |A(I i)| (11)

and urs i (z) is a constant number when given the sampled profile σrs(k).

Specifically,

if k=max Ii I|A(Ii)|, then σrs(k) i (Ii)=1, urs(k) i (z)=ui(z), and

Published as a conference paper at ICLR 2020

rσ i ((a|Ii)|Qj)= X

z Qj,h z,h Ii ui(z)(πσ i (ha,z) πσ i (h,z)) (12)

Therefore, robust sampling is same with external sampling when k=max Ii I|A(Ii)|. For large game, because one player should take all actions in her infosets, it s intractable for external sampling. The robust sampling is more flexible and memory-efficient than external sampling. In practice, we can specify a suitable k according our memory. Experimentally, the smaller k can achieve a similar convergence rate to the external sampling.

if k = 1 and σrs(k) i = σi, only one history z is sampled in this case,then urs(k) i (z) = ui(z) π σi i (z),

h Ii, for a Ars(k)(Ii)

rσ i ((a|Ii)|Qj)= rσ i ((a|h)|Qj)

z Qj,ha z πσ i (ha,z)urs i (z) X

z Qj,h z πσ i (h,z)urs i (z)

= (1 σi(a|h))ui(z)

For a Ars(k)(Ii), the regret will be rσ i ((a|h)|j) = 0 vσ i (h|j). If we add exploration and guarantee q(z) δ >0, then robust sampling is same with outcome sampling when k =1 and σrs(k) i =σi.

if k = 1, and player i randomly selects one action according to discrete uniform distribution unif(0,|A(Ii)|) at infoset Ii, then urs(1) i (z)= ui(z)

πσrs(k) i (z) is a constant, h Ii, for a Ars(k)(Ii)

rσ i ((a|Ii)|Qj)= X

z Qj,ha z,h Ii πσ i (ha,z)urs i (z) X

z Qj,h z,h Ii πσ i (h,z)urs i (z)

=(1 σi(a|h))πσ i (ha,z)urs(1) i (z)

if action a is not sampled at state h, the regret is rσ i ((a|h)|j)=0 vσ i (h|j). Compared to outcome sampling, the robust sampling in that case converges more efficient than outcome sampling. Note that, in our experiment, we select this sampling policy as the default robust sampling when k=1.

D.3 UNBIASED MINI-BATCH MCCFR

Theorem 1 EQj mini-batch[ vσ i (Ii|b)]=vσ i (Ii).

In this section, we prove that mini-Batch MCCFR gives an unbiased estimation of counterfactual value.

Published as a conference paper at ICLR 2020

EQj mini-batch[ vσ i (Ii|b)]=Eb unif(0,b)[ vσ i (Ii|b )]

=Eb unif(0, b)

h Ii,z Qj,h z

πσ i(z)πσ i (h,z)ui(z) q(z)b

=Eb unif(0, b)

j=1 vσ i (Ii|Qj)

j=1 vσ i (Ii|Qj)

j=1 E( vσ i (Ii|Qj))

j=1 vσ i (Ii)

Published as a conference paper at ICLR 2020

E SEQUENTIAL REPRESENTATION AND RECURRENT NEURAL NETWORK WITH ATTENTION

In order to define our R and S network, we need to represent the infoset Ii I in extensive-form games. In such games, players take action in an alternating fashion and each player makes a decision according to the observed history. In this paper, we model the behavior sequence as a recurrent neural network and each action in the sequence corresponds to a cell in RNN. Figure 3 (a) provides an illustration of the proposed deep sequential neural network representation for infosets.

In standard RNN, the recurrent cell will have a very simple structure, such as a single tanh or sigmoid layer. Hochreiter & Schmidhuber (1997) proposed a long short-term memory method (LSTM) with the gating mechanism, which outperforms the standard version and is capable of learning long-term dependencies. Thus we will use LSTM for the representation. Furthermore, different position in the sequence may contribute differently to the decision making, we will add an attention mechanism (Desimone & Duncan, 1995; Cho et al., 2015) to the LSTM architecture to enhance the representation. For example, the player may need to take a more aggressive strategy after beneficial public cards are revealed in a poker game. Thus the information, after the public cards are revealed may be more important.

More specifically, for l-th cell, define xl as the input vector, which can be either player or chance actions. Define el as the hidden layer embedding, φ as a general nonlinear function. Each action is represented by a LSTM cell, which has the ability to remove or add information to the cell state with three different gates. Define the notation as element-wise product. The first forgetting gate layer is defined as gf l =φf(wf[xl,el 1]), where [xl,el 1] denotes the concatenation of xl and el 1. The second input gate layer decides which values to update and is defined as gi l =φi(wi[xl,el 1]). A nonlinear layer outputs a vector of new candidate values Cl=φc(wl[xl,el 1]), which decides what can be added to the state. After the forgetting gate and the input gate, the new cell state is updated by Cl =gf l Cl 1+gi l Cl. The third output gate is defined as go l =φo(wo[xl,el 1]). Finally, the updated hidden embedding is el=go l φe(Cl). As shown in Figure 3 (a), for each LSTM cell j, the vector of attention weight is learned by an attention network. Each member in this vector is a scalar αj =φa(waej). The attention embedding of l-th cell is then defined as ea l =Pl j=1αj ej, which is the summation of the hidden embedding ej and the learned attention weight αj. The final output of the network is predicted by a value network, which is defined as

yl:=f(a,Ii|θ)=wyφv(ea l )=wyφv

j=1 φa(waej) ej

where θ refers to the parameters in the defined sequential neural networks. Specifically, φf, φi, φo are sigmoid functions. φc and φe are hyperbolic tangent functions. φa and φv are rectified linear functions.

Remark. The proposed RSN and ASN share the same neural architecture, but use different parameters. That is R(a,Ii|θt R) = f(a,Ii|θt R) and S(a,Ii|θt S) = f(a,Ii|θt S). R( ,Ii|θt R) and S( ,Ii|θt S) denote two vectors of predicted value for all a A(Ii).

Published as a conference paper at ICLR 2020

F OPTIMIZING NEURAL REPRESENTATION AND IMPLEMENTATION

F.1 CODE FOR DNCFR FRAMEWORK

Algorithm 2 provides a summary of the proposed double neural counterfactual regret minimization method. Specifically, in the first iteration, if we start the optimization from tabular-based methods, the techniques in Section 3.4 should be used to clone the cumulative regrets and strategy, which is used to initialize RSN and ASN respectively. If there is no warm start initialization, we can start our algorithm by randomly initializing the parameters in RSN and ASN. After these two kinds of initialization, we use sampling method, such as the proposed robust sampling, to collect the training samples (include infosets and the corresponding values), which are saved in memories Mt R and Mt S respectively. These samples will be used by the Neural Agent algorithm from Algorithm 3 to optimize RSN and ASN. Algorithm 4 provides the implementation of the proposed mini-batch robust sampling MCCFR. Note that, with the help of the proposed mini-batch techniques in Section 4, we can collect training samples parallelly on multi-processors or distributed systems, which also leads to the unbiased estimation according to the proved Theorem 1. The acceleration training and distribution implementation is beyond the scope of this paper. To compare the performance of DNCFR and tabular CFR, all of our experiments are running on a single processor.

F.2 CODE FOR NEURAL NETWORKS

Algorithm 3: Optimization of Deep Neural Network

Function Neural Agent(f( |θT 1), M, θT 1, β ):

initialize optimizer, scheduler θT θT 1,lbest ,tbest 0 For t=1 to βepoch do

loss [] For each training epoch do

{x(i),y(i)}m i=1 M batch_loss 1

m Pm i=1(f(x(i)|θT 1)+y(i) f(x(i)|θT))2

back propagation batch_loss with learning rate lr clip gradient of θT to [ ϵ,ϵ]d optimizer(batch_loss) loss.append(batch_loss) lr sheduler(lr) if avg(loss)<βloss then

θT best θT, early stopping. else if avg(loss)<lbest then

lbest=avg(loss), tbest t, θT best θT if t tbest>βre then

lr βlr return θT

Notations in Neural Networks. Define βepoch as training epoch, βlr as learning rate, βloss as the criteria for early stopping, βre as the upper bound for the number of iterations from getting the minimal loss last time, θt 1 as the old parameter learned in t 1 iteration, f( |θt 1) as the neural network, M as the training samples including infosets and the corresponding targets. To simplify notations, we use β to denote the set of hyperparameters in the proposed deep neural networks. β R and β S refer to the sets of hyperparameters in RSN and ASN respectively.

Optimize Neural Networks. Algorithm 3 provides the implementation of the optimization technique for both RSN and ASN. Both R(a,Ii|θt+1 R ) and S(a,Ii|θt S) are optimized by mini-batch stochastic gradient descent method. In this paper, we use Adam optimizer (Kingma & Ba, 2014) with both momentum and adaptive learning rate techniques. We also replace Adam by other optimizers such as Nadam, RMSprop, Nadam Ruder (2017) in our experiments, however, such optimizers do not achieve better experimental results. In practice, existing optimizers (Ruder, 2017) may not return a relatively low enough loss because of potential saddle points or local minima. To obtain a relatively higher accuracy and lower optimization loss, we design a novel scheduler to reduce the learning rate when the loss has stopped decrease. Specifically,

Published as a conference paper at ICLR 2020

the scheduler reads a metrics quantity, e.g, mean squared error. If no improvement is seen for a number of epochs, the learning rate is reduced by a factor. In addition, we will reset the learning rate in both optimizer and scheduler once loss stops decreasing within βre epochs. Gradient clipping mechanism is used to limit the magnitude of the parameter gradient and make optimizer behave better in the vicinity of steep cliffs. After each epoch, the best parameters, which lead to the minimum loss, will replace the old parameters. Early stopping mechanism is used once the lowest loss is less than the specified criteria βloss.

The feature is encoded as following. As shown in the figure 3 (a), for a history h and player P(h), we use vectors to represent the observed actions including chance player. For example, on Leduc Hold em, the input feature xl for l-th cell is the concatenation of three one-hot features including the given private cards, the revealed public cards and current action a. Both the private cards and public cards are encoded by one-hot technique (Harris & Harris), where the value in the existing position is 1 and the others are 0. If there are no public cards, the respective position will be filled with 0. The betting chips in the encoded vector will be represented by the normalized cumulative spent, which is the cumulative chips dividing the stack size. For HUNL, each card is encoded by a vector with length 17: 13 for ranking embedding and 4 for suit embedding. The actions in public sequences are represented by one-hot and the raise action is also represented by the normalized cumulative spent.

Published as a conference paper at ICLR 2020

F.3 CODE FOR MINI-BATCH ROBUST SAMPLING MCCFR

Algorithm 4: Mini-Batch RS-MCCFR with Double Neural Networks Function Mini-Batch-MCCFR-NN(t):

Mt R , Mt S For all i=1 to b do in parallel then

MCCFR-NN(t, ,0,1,1) MCCFR-NN(t, ,1,1,1)

return Mt R,Mt S

Function MCCFR-NN(t, h, i, πi, πrs(k) i ): Ii Ii(h) if h Z then

return ui(h)

πrs(k) i else if P(h)= 1 then

a σ i(Ii) return MCCFR-NN(t,ha,i,πi,πrs(k) i ) else if P(h)=i then

ˆRi( |Ii) R( ,Ii|θt R) if t>1 else 0 σi(Ii) Calculate Strategy(ˆRi( |Ii),Ii) vi(h) 0,ri( |Ii) 0,si( |Ii) 0 Ars(k)(Ii) sampling k different actions according to σrs(k) i For a Ars(k)(Ii) do

vi(a|h) MCCFR-NN(t,ha,i,πiσi(a|Ii),πrs i σrs(k) i (a|Ii)) vi(h) vi(h)+vi(a|h)σi(a|Ii)

For a Ars(k)(Ii) do

ri(a|Ii) vi(a|h) vi(h) si(a|Ii) πσ i (Ii)σi(a|Ii) Store updated cumulative regret tuple (Ii,ri( |Ii)) in Mt R Store updated current strategy dictionary (Ii,si( |Ii)) in Mt S return vi(h) else

ˆR i( |Ii) R( ,Ii|θt R) if t>1 else 0 σ i(Ii) Calculate Strategy(ˆR i( |Ii),Ii) a σ i(Ii) return MCCFR-NN(t,ha,i,πi,πrs(k) i ) Function Calculate Strategy(Ri( |Ii),Ii):

sum P a A(Ii)max(Ri(a|Ii),0) For a A(Ii) do

σi(a|Ii)= max(Ri(a|Ii),0)

sum if sum > 0 else 1 |A(Ii)| return σi(Ii)

Algorithm 4 presents one application scenario of the proposed mini-batch robust sampling method. The function MCCFR-NN will traverse the game tree like tabular MCCFR, which starts from the root h= . Define Ii as the infoset of h. Suppose that player i will sample k actions according to the robust sampling. Algorithm 4 is defined as follows.

If the history is terminal, the function returns the weighted utility.

If the history is the chance player, one action a A(Ii) will be sampled according to the strategy σ i(Ii). Then this action will be added to the history, i.e., h ha.

Published as a conference paper at ICLR 2020

If P(Ii)=i, the current strategy can be updated by the cumulative regret predicted by RSN. Then we sample k actions according the specified sampled strategy profile σrs(k) i . After a recursive updating, we can obtain the counterfactual value and regret of each action at Ii. For the observed nodes, their counterfactual regrets and numerators of the corresponding average strategy will be stored in Mt R and Mt S respectively.

If P(Ii) is the opponent, only one action will be sampled according the strategy σ i(Ii).

The function Mini-Batch-MCCFR-NN presents a mini-batch sampling method, where b blocks will be sampled in parallel. This mini-batch method can help the MCCFR to achieve an unbiased estimation of CFV. The parallel implementation makes this method efficient in practice.

Remark: We update average in the procedure of P(h)=i, which potentially leads to a biased estimate of average strategy. There is a trade-off among unbiased estimate, convergence, and data efficiency on Algorithm 4. A feasible solution is using stochastically-weighted averaging (SWA). However, SWA typically leads to a large variance as discussed in Marc s Ph.D. thesis (Lanctot, 2013) (p49). The classical external sampling(ES) solves this problem by only updating average strategy for i. Because ES samples k=|A(Ii)| actions for i and only samples one action for i, it s inefficient to collect samples for average strategy at i in neural CFR. In contrast, we collect samples at i. Typically, when collecting average strategy samples at i, we need using SWA to maintain unbiased estimate of average strategy. However, because of the high variance of SWA, we find the one without SWA converges more efficient empirically.

F.4 HYPERPARAMETERS

In experiments, we set the network hyperparameters as following.

Hyperparameters on Leduc Hold em. The Leduc(5), Leduc(10) and Leduc(15) in our experiments have 1.1 104 infosets (6 104 states), 3 105 (1.5 106 states) and 3 106 (2 107 states) infosets respectively. We set k =3 as the default parameter in the provable robust sampling method on all such games. For the small Leduc(5), we select b=100 as the default parameter in the mini-batch MCCFR ??, which only samples 5.59% infosets in each iteration. For the larger Leduc(10) and Leduc(15), we select default b=500, which visit (observe) only 2.39% and 0.53% infosets in each iteration. To train RSN and ASN, we set the default embedding size for both neural networks as 16, 32, and 64 for Leduc(5), Leduc(10), and Leduc(15) respectively. There are 256 samples will be used to update the gradients of parameters by mini-batch stochastic gradient descent technique. We select Adam (Kingma & Ba, 2014) as the default optimizer and LSTM with attention as the default neural architecture in all the experiments. The neural networks only have 2608, 7424, and 23360 parameters respectively, which are much less than the number of infosets. The default learning rate of Adam is βlr =0.001. A scheduler, who will reduce the learning rate based on the number of epochs and the convergence rate of loss, help the neural agent to obtain a high accuracy. The learning rate will be reduced by 0.5 when loss has stopped improving after 10 epochs. The lower bound on the learning rate of all parameters in this scheduler is 10 6. To avoid the algorithm converging to potential local minima or saddle points, we will reset the learning rate to 0.001 and help the optimizer to obtain a better performance. θT best is the best parameters to achieve the lowest loss after T epochs. If average loss for epoch t is less than the specified criteria βloss=10 4 for RSN (set this parameter as 10 5 for RSN), we will early stop the optimizer. We set βepoch=2000 and update the optimizer 2000 maximum epochs. For ASN, we set the loss of early stopping criteria as 10 5. The learning rate will be reduced by 0.7 when loss has stopped improving after 15 epochs.

For NFSP in our experiment, we set the hyperparameters according to its original paper (Heinrich & Silver, 2016). The neural network in NFSP had 1 hidden layer of 64 neurons and rectified linear activation. The reinforcement and supervised learning rates were set to 0.1 and 0.005. Both neural networks were optimized by vanilla stochastic gradient descent for every 128 steps in the game. The mini-batch sizes for both neural networks were 128. The sizes of memories were 200k and 2m for reinforcement learning and supervised learning respectively. we set the anticipatory parameter in NFSP to 0.1. The exploration in ϵ-greedy policies started at 0.06 and decayed to 0.

Hyperparameters on HUNL. To solve HUNL(1) and HUNL(2), we sample 0.01% and 0.001% infosets in each iteration. The batch size of training neural network is set to 100000. We prefer to using large batch size, because gradient descent spends most of running time. Typically, larger batch size indicates less number of gradient decent updates. We perform DNCFR under different number of embedding sizes and the steps of gradient descent updates. The experiment results are presented in Figure 7. Other hyperparameters in neural networks and optimizers are set to be the same with Leduc(15).

Published as a conference paper at ICLR 2020

G ABS-CFR, DEEPSTACK, DOUBLE NEURAL CFR AND HUNL

0 20 40 60 80 100 120 Iteration

Preflop Loss of Auxiliary Network

Training Sample Validation Sample

(a) Preflop loss of CFV Network

0 100 200 300 400 500 Iteration

Flop Loss of Deep Counterfactual Network

Training Sample Validation Sample

(b) Flop loss of CFV Network

0 200 400 600 800 1000 1200 Iteration

Turn Loss of Deep Counterfactual Network

Training Sample Validation Sample

(c) Turn loss of CFV Network

Figure 9: Huber loss of three counterfactual value network in our implemented Deep Stack. (a) Huber loss of auxiliary network on preflop subgame, the training loss is 0.0000789 and the validation loss is 0.0000776 while they are 0.000053 and 0.000055 in original Deep Stack. (b) Huber loss of deep counterfactual value network on flop subgame, the training sample is 0.008 and the validation sample is 0.019 while they are 0.008 and 0.034 in original Deep Stack. (c) Huber loss of deep counterfactual value network on turn subgame (contains last two rounds of HUNL), the training sample is 0.016 and the validation sample is 0.035 while they are 0.016 and 0.026 in original Deep Stack. Specifically, the learning rate is decayed in 200th iteration(iteration is equal to epoch here), therefore the huber loss in (b) and (c) decreased. To balance the performance of both training and validation samples, we finally select the checkpoints that have the lowest validation loss.

The game size of imperfect information HUNL is compared with Go (Silver et al., 2016) and her partial observable property makes it very difficult. The article (Burch, 2017) gives a detailed analysis of this problem from the perspective of both computational time and space complexity. To evaluate the proposed method, we reimplement Deep Stack (Moravcik et al., 2017), which is an expert-level artificial intelligence in Heads-up No-Limit Texas Hold em. Deep Stack defeated professional poker players. The decision points of Heads-up No-Limit Texas Hold em exceed 10161 (Johanson, 2013). We provide the game rules of Texas hold em in Appendix A.3.

In this section, we provided some details about our implementation, compared our agent with the original Deep Stack to guarantee the correctness of the implementation, and applied our double neural method on the subgame of Deep Stack.

G.1 DETAILS ABOUT ABS-CFR

ABS-CFR agent is an enhanced version of HITSZ_LMW_2pn, whose previous version won the third prize of the 2018 Annual Computer Poker Competition (ACPC) and has 2 1010 information sets. The ideas of ABS-CFR agent is first abstract the full HUNL into the smaller abstract game and using CFR to solve the abstracted game. The ABS-CFR using two kind-of abstractions: the first one is action abstraction and the second is card abstraction. The action abstraction is using discretized betting model (Gilpin et al., 2008), which can do fold, call, 0.5 pot raise, 1 pot raise, 2 pot raise, and 4 pot raise and all-in in each decision node. The card abstraction is using domain knowledge that strategically similar states are collapsed into a single state. In preflop we use lossless abstraction which has 169 buckets. In flop and turn, we use potential-aware imperfect-recall abstraction with earth mover distance (Ganzfried & Sandholm, 2014), which has 10000 and 50000 buckets respectively. In the river, we use opponent cluster hand strength abstraction (Johanson et al., 2013), which has 5000 buckets.

G.2 DETAILS ABOUT OUR IMPLEMENTATION OF DEEPSTACK

Because Alberta university didn t release the source code of Deep Stack for No-Limit Texas Hold em, we implemented this algorithm according to the original article (Moravcik et al., 2017). It should be noted

Published as a conference paper at ICLR 2020

that the released example code 4 on Leduc Hold em cannot directly be used on Heads-up No-Limit Texas Hold em for at least three reasons: (1) The tony game Leduc Hold em only has 2 rounds, 6 cards with default stack size 5, which is running on a single desktop, while HUNL has four rounds, 52 cards and stack size 20000 according to ACPC game rules. (2) Specifically, there are 55,627,620,048,000 possible public and private card combinations for two players on HUNL (Johanson, 2013) and the whole game contains about 10161 infosets, which makes the program should be implemented and run on a large-scale distributed computing cluster. (3) The example code doesn t contain the necessary acceleration techniques and parallel algorithm for Texas Hold em.

Our implementation follows the key ideas presented in the original Deep Stack article by using the same hyperparameters and training samples. To optimize the counterfactual value network on turn subgame (this subgame looks ahead two rounds and contains both turn and river), we generate nine million samples. Because each sample is generated by traversing 1000 iterations using CFR+ algorithm based on a random reach probability, these huge samples are computation-expensive and cost 1500 nodes cluster (each node contains 32 CPU cores and 60GB memory) more than 60 days. To optimize the counterfactual value network on flop subgame (this subgame only looks ahead one round), we generate two million samples, which costs about one week by using the similar computation resource. The auxiliary network on preflop subgame is optimized based on ten million samples and costs 2 days. The whole implementation of Deep Stack costs us several months and hundreds of thousands of lines of codes.

G.3 VERIFY THE CORRECTNESS OF OUR IMPLEMENTATION

The overall Deep Stack algorithm contains three ingredients: (1) computing strategy for the current public state, (2) depth-limited Lookahead to the end of subgame rather than the end of the full game and using counterfactual value network to inference the value of the leaf node in the subgame, (3) using action abstraction technique to reduce the size of game tree.

To evaluate the strategy of imperfect information game, exploitability is usually used as the metric to evaluate the distance between the strategy and Nash equilibrium in two-player zero-sum game. However, in the large game, such as Heads-Up No-Limit Texas Hold em, computation of exploitability is expensive because of its 10161 searching space.

We verified the correctness of our implementation from three different aspects: First, the logs of Deep Stack against professional poker players are released on the official website, which contains more than 40000 hand histories. From these logs, we counted the frequency of each action taken by Deep Stack under different private cards and used the normalized frequency as the estimated strategy of Deep Stack. We compared this estimated strategy with our reimplemented Deep Stack. Figure 10 in Appendix G provided the comparison results and demonstrated that our implementation leads to policies very close to what the original Deep Stack does. Second, we compared the huber loss of three deep counterfactual value networks. Clearly, our implementation achieved a loss similar to the original paper. Third, our agent also played against an enhanced version of HITSZ_LMW_2pn, whose previous version won the third prize of the 2018 Annual Computer Poker Competition (ACPC). Our implementation can win HITSZ_LMW_2pn 120 mbb/g.

4 https://github.com/lifrordi/Deep Stack-Leduc

Published as a conference paper at ICLR 2020

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Alberta Deep Stack Take Fold Probability

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Our Agent Take Fold Probability

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Alberta Deep Stack Take Call Probability

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Our Agent Take Call Probability

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Alberta Deep Stack Take Half Pot Raise Probability

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Our Agent Take Half Pot Raise Probability

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Alberta Deep Stack Take One Pot Raise Probability

A K Q J T 9 8 7 6 5 4 3 2

A K Q J T 9 8 7 6 5 4 3 2

Our Agent Take One Pot Raise Probability

Figure 10: Comparison of action probability between Alberta s Deep Stack (Moravcik et al., 2017) (the left column) and our reimplemented Deep Stack (the right column).