# dynamic_discounted_counterfactual_regret_minimization__2e86fa9e.pdf

Published as a conference paper at ICLR 2024

DYNAMIC DISCOUNTED COUNTERFACTUAL REGRET MINIMIZATION

Hang Xu1,2, Kai Li1,2, , Haobo Fu6 Qiang Fu6 Junliang Xing5, Jian Cheng1,3,4

1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Future Technology, University of Chinese Academy of Sciences 4Ai Ri A 5Tsinghua University 6Tencent AI Lab {xuhang2020,kai.li,jian.cheng}@ia.ac.cn, {haobofu,leonfu}@tencent.com, jlxing@tsinghua.edu.cn

Counterfactual regret minimization (CFR) is a family of iterative algorithms showing promising results in solving imperfect-information games. Recent novel CFR variants (e.g., CFR+, DCFR) have significantly improved the convergence rate of the vanilla CFR. The key to these CFR variants performance is weighting each iteration non-uniformly, i.e., discounting earlier iterations. However, these algorithms use a fixed, manually-specified scheme to weight each iteration, which enormously limits their potential. In this work, we propose Dynamic Discounted CFR (DDCFR), the first equilibrium-finding framework that discounts prior iterations using a dynamic, automatically-learned scheme. We formalize CFR s iteration process as a carefully designed Markov decision process and transform the discounting scheme learning problem into a policy optimization problem within it. The learned discounting scheme dynamically weights each iteration on the fly using information available at runtime. Extensive experimental results demonstrate that DDCFR s dynamic discounting scheme has a strong generalization ability and leads to faster convergence with improved performance. The code is available at https://github.com/rp Sebastian/DDCFR.

1 INTRODUCTION

Imperfect-information games (IIGs) model strategic interactions between players with hidden information. Solving such games is challenging since it requires reasoning under uncertainty about the opponents private information. The hidden information is omnipresent in real-world decisionmaking problems, making the research on IIGs theoretically and practically crucial.

In this work, we focus on solving two-player zero-sum IIGs. For these games, the typical goal is to find an (approximate) Nash equilibrium (Nash, 1950) a strategy profile in which no player can improve their utilities by unilaterally switching to a different strategy. One of the most successful approaches to computing Nash equilibrium in IIGs is the family of counterfactual regret minimization (CFR) algorithms (Zinkevich et al., 2007). CFR iteratively minimizes both players regrets so that the time-averaged strategy profile gradually approaches the Nash equilibrium. Due to the sound theoretical guarantee and strong empirical performance, CFR-type algorithms are the basis of several breakthroughs (Bowling et al., 2015; Moravˇc ık et al., 2017; Brown & Sandholm, 2018; 2019a).

Over the past decade, researchers have proposed various novel CFR variants (Tammelin, 2014; Brown & Sandholm, 2019b; Xu et al., 2022) with faster convergence rates. In particular, the development of CFR+ (Tammelin, 2014) was a milestone at least an order of magnitude faster than the vanilla CFR in practice, which was the key to solving the heads-up limit Texas Hold em poker. The recently proposed Discounted CFR (DCFR) (Brown & Sandholm, 2019b) is one of the most efficient equilibrium-finding algorithms, outperforming other CFR variants in a diverse set of IIGs.

Equal contribution. Corresponding authors.

Published as a conference paper at ICLR 2024

Unlike the vanilla CFR, which assigns equal weights to every iteration, the key to the performance of these new CFR variants is that they use different discounting schemes to weight each iteration non-uniformly. For example, CFR+ uses a linear discounting scheme in which iteration t s contribution to the average strategy is proportional to t. DCFR is a family of algorithms that discounts prior iterations when determining both regrets and average strategy, i.e., assigning smaller weights to earlier iterations using three hyperparameters (α, β, and γ). Although these CFR variants have shown that faster convergence is achievable by discounting, they all rely on a fixed and manually-specified discounting scheme, which significantly limits their potential. Specifically, since there are a limitless number of discounting schemes in theory, it is impractical and even impossible to manually find one that performs well in most games, which leads to the manually-specified scheme being suboptimal. Additionally, the fixed discounting scheme is pre-determined before the start of the iteration process, which is excessively restrictive. It is desirable to design a more flexible scheme that can dynamically weight each iteration on the fly using the information available at runtime.

To this end, we propose a novel Dynamic Discounted CFR (DDCFR) framework that weights each iteration using a dynamic, automatically-learned discounting scheme. We formulate CFR s iteration process as a Markov decision process (MDP) and transform the discounting scheme learning problem into a policy optimization problem within it. Specifically, we regard the dynamic discounting scheme as an agent interacting with the MDP. The agent receives the current state of the iteration process and outputs an action consisting of the discounting weights for the next iteration. The agent s goal is to find an approximate Nash equilibrium promptly by choosing appropriate weights.

Our ultimate goal is to learn a discounting policy that generalizes across many different IIGs, not limited to some particular games. To this end, we carefully design the training games, the MDP, and the training algorithm. Specifically, we train the discounting policy on a diverse set of IIGs to improve its generalization ability. We design a game-agnostic state space for the MDP, so the trained policy naturally applies to different games. We further exploit a highly scalable training algorithm to optimize the discounting policy based on evolution strategies (Salimans et al., 2017). We evaluate the learned dynamic discounting scheme on new games not seen during training, including largescale games, such as heads-up no-limit Texas hold em (HUNL) subgames. The results demonstrate that it has a strong generalization ability and outperforms the fixed, manually-specified discounting scheme on both training and testing games.

This paper makes three novel contributions: (i) We propose DDCFR, the first equilibrium-finding framework that discounts prior iterations using a dynamic, automatically-learned scheme. (ii) We formulate CFR s iteration process as an MDP and transform the discounting scheme learning problem into a policy optimization problem in this MDP. (iii) We design an effective training algorithm to optimize the discounting policy. The learned discounting scheme exhibits strong generalization ability, achieving competitive performance on new games not seen during training.

2 PRELIMINARY

2.1 NOTATIONS Extensive-form games (Osborne & Rubinstein, 1994) are a tree-based formalism commonly used to describe IIGs. In these games, there is a finite set N of players and a unique player c called chance with a fixed, known stochastic strategy. History h consists of all actions taken by players, including private knowledge known to only a subset of players. All possible histories in the game tree form the set H. Players take turn to take actions and, A(h)={a : ha H} denotes the actions available at a given history h. Z H are terminal histories for which no actions are available. Each player i N has a utility function ui(z):Z [umin, umax] R, and =umax umin is the range of utilities.

In IIGs, the lack of information is represented by information sets Ii for each player i N. If h, h are in the same information set I Ii, player i cannot distinguish between them. For example, in poker, all histories within an information set differ only in the private cards of other players. Thus, A(I) = A(h) for arbitrary h I. We define |I| = P

i N |Ii| and |A| = maxi N max I Ii |A(I)|.

A strategy σi(I) assigns a probability distribution over the actions available for acting player i in an information set I. Specifically, σi(I, a) denotes the probability of player i taking action a. Since all histories in an information set belonging to player i are indistinguishable for that player, the strategies in each must be identical. Therefore, for any h1, h2 I, we have σi(I) = σi(h1) = σi(h2). A strategy profile σ = {σi|σi Σi, i N} is a specification of strategies for all players, where Σi denotes the set of all possible strategies for player i, and σ i refers to the strategies of all

Published as a conference paper at ICLR 2024

players other than player i. ui(σi, σ i) represents the expected utility for player i if player i plays according to σi and the others play according to σ i.

2.2 BEST RESPONSE AND NASH EQUILIBRIUM

The best response to σ i is any strategy BR(σ i) that maximizes player i s expected utility, such that ui(BR(σ i), σ i)= maxσ i Σi ui(σ i, σ i). The Nash equilibrium is a strategy profile σ =(σ i , σ i) in which every player plays a best response to the other players strategies. Formally, i N, ui(σ i , σ i) = maxσ i Σi ui(σ i, σ i). The exploitability of strategy σi is a measure of how far that strategy is from optimal. It is defined as ei(σi)=ui(σ i , σ i) ui(σi, BR(σi)). In an ϵ-Nash equilibrium, no player has exploitability higher than ϵ. The exploitability of a strategy profile σ is e(σ) = P i N ei(σi)/|N|. We can interpret it as the approximation error to the Nash equilibrium.

2.3 COUNTERFACTUAL REGRET MINIMIZATION

Counterfactual Regret Minimization (CFR) is one of the most popular equilibrium-finding algorithms for solving extensive-form IIGs (Zinkevich et al., 2007). The algorithm iteratively refines the players average strategies by minimizing regrets. CFR frequently uses counterfactual value vσ i (I), which is the expected utility of a information set I Ii for a player i under a given strategy profile σ, assuming that the player tries to reach it. The counterfactual value of a specific action a in I is vσ i (I, a). The rigorous definitions of vσ i (I) and vσ i (I, a) can be found in Appendix A.

CFR typically proceeds with a uniform random strategy σ1. On each iteration t, CFR first recursively traverses the game tree using the current strategy σt to calculate the instantaneous regret rt i(I, a) of not choosing action a in an information set I for player i, which is defined as:

rt i(I, a) = vσt i (I, a) vσt i (I). (1)

CFR then calculates the cumulative regret Rt i(I, a)= Pt k=1 rk i (I, a) and computes a strategy for the next iteration using regret-matching (Hart & Mas-Colell, 2000; Cesa-Bianchi & Lugosi, 2006):

σt+1 i (I, a) =

Rt,+ i (I,a) P

a A(I) Rt,+ i (I,a ), if P a A(I) Rt,+ i (I, a ) >0

1 |A(I)|, otherwise, (2)

where Rt,+ i (I, a) = max (Rt i(I, a), 0). In two-player zero-sum IIGs, if both players use CFR on each iteration, their average strategies σt will converge to an ϵ-Nash equilibrium in O |I|2|A| 2/ϵ2 iterations (Zinkevich et al., 2007). The average strategy σt is calculated as:

Ct i(I, a) =

πσk i (I)σk i (I, a) , σt i(I, a) = Ct i(I, a) P a A(I) Ct i(I, a ),

where πσk i (I) is information set I s reach probability (Appendix A) and Ct i(I, a) is player i s cumulative strategy for action a in information set I on iteration t.

2.4 THE DISCOUNTING VARIANTS OF CFR

Since the birth of CFR, many new CFR variants have been proposed and greatly improved the convergence rate of the vanilla CFR. CFR+ (Tammelin, 2014) is like CFR with three small but effective modifications and converges an order of magnitude faster than CFR. First, CFR+ uses a linear discounting scheme that weights iteration t by t rather than using a uniformly-weighted average strategy as in CFR. Second, to immediately reuse an action when it shows promise of performing well instead of waiting for the cumulative regret to become positive, CFR+ sets any action with negative cumulative regret to zero on each iteration. Finally, CFR+ uses the alternating-updates technique.

The recently proposed DCFR (Brown & Sandholm, 2019b) algorithm investigates the discounting schemes in regret minimization algorithms more systematically. DCFR is a family of algorithms that discounts cumulative regrets and the cumulative strategy, dramatically accelerating the convergence rate. Specifically, on each iteration t, DCFR multiplies positive cumulative regrets by (t 1)α

negative cumulative regrets by (t 1)β

(t 1)β+1, and the cumulative strategy by t 1

t γ. Formally,

Rt i(I, a) =

( Rt 1 i (I, a) (t 1)α

(t 1)α+1+rt i(I, a), if Rt 1 i (I, a)>0

Rt 1 i (I, a) (t 1)β

(t 1)β+1+rt i(I, a), otherwise, (3)

Published as a conference paper at ICLR 2024

Ct i(I, a) = Ct 1 i (I, a) t 1

γ + πσt i (I)σt i(I, a). (4)

The three hyperparameters (α, β, γ) determine DCFR s discounting scheme. For example, we can recover the linear discounting scheme by setting α = 1, β = 1, and γ = 1. Although there are a limitless number of discounting schemes that converge in theory, DCFR s authors empirically set α = 1.5, β = 0, and γ = 2 since they found this setting performs the best in the games they tested.

3 LEARNING DYNAMIC DISCOUNTING SCHEME

3.1 THE DDCFR FRAMEWORK

As mentioned earlier, the existing discounting CFR variants have obtained remarkable performance in solving IIGs. However, they all exploit a fixed and manually-specified discounting scheme. These pre-determined schemes are not flexible enough, thus inevitably limiting the convergence performance. We argue that an ideal discounting schema should fulfill two criteria. First, the scheme should be automatically learned rather than manually designed. For example, in DCFR, designing the discounting scheme by manually setting the values of α, β, and γ is unscalable and suboptimal since finding one setting that can efficiently solve a wide variety of IIGs is challenging. Second, the scheme should be able to adjust the weights dynamically instead of using fixed weights. For regret minimization algorithms, the discounting weights often need to change dynamically during learning. This is particularly important in solving different types of IIGs, where the distribution of regret values and the cost of taking suboptimal actions vary significantly during the iteration process.

To achieve this goal, we propose a novel Dynamic Discounted CFR (DDCFR) framework that weights each iteration using a dynamic, automatically-learned discounting scheme. DDCFR learns to adjust the discounting weights dynamically using information available at runtime, such as iterations and exploitability. By adapting to changing conditions in real time, we believe that DDCFR can lead to more efficient and effective learning.

Environments

Current Strategy

Update Cumulative Regrets (Eqn 3)

New Strategy (Eqn 2)

Update Average Strategy (Eqn 4)

Instantaneous Regrets (Eqn 1)

Normalized Exploitability

Traverse the Game Tree

Regret Matching

Figure 1: The DDCFR framework.

The high-level idea of DDCFR is to encapsulate CFR s iteration process into an environment and regard the discounting scheme as an agent interacting with it. Specifically, the agent receives the current state (i.e., the current status of CFR s iteration process) and outputs an action consisting of the discounting weights for the next iteration. This process continues until the maximum number of iterations. The agent s goal is to find an optimal discounting policy that can dynamically choose suitable weights for each iteration and thus minimize the exploitability of the obtained average strategies. Formally, the interaction process between the agent and the environment in game G constitutes a Markov decision process (MDP), represented as (G, S, A, P G, ˆRG).

The game G. Each game G represents a different environment. Our ultimate goal is to train an agent whose discounting policy performs well on the training games and generalizes to new games.

The state space S. The state st S contains information that the agent observes at each iteration t in game G. st should contain as much general and transferable information as possible, helping the agent make good decisions on choosing discounting weights for each iteration and making the learned scheme applicable to different games. We propose a game-agnostic state representation that meets these requirements, i.e., st consisting of two parts: 1) the normalized iteration ˆt, which is the ratio of the current iteration t to the total number of iterations; 2) the normalized exploitability ˆEG t 1, which is calculated as (log EG t 1 log Emin)/(log EG 1 log Emin), where EG 1 and EG t 1 are the exploitability of the average strategies in game G at iteration 1 and t 1, respectively, and Emin is the minimum reachable exploitability set to 10 12. The logarithmic operation results in a more even distribution of normalized exploitability, thereby facilitating agent training (Appendix I).

The action space A. The agent s action ˆat A at iteration t consists of four numbers, i.e., ˆat = [αt, βt, γt, τt]. Similar to DCFR in Equations 3 and 4, αt, βt, γt are used to determine the

Published as a conference paper at ICLR 2024

discounting weights, i.e., (t 1)αt (t 1)αt+1, (t 1)βt (t 1)βt+1, and t 1

t γt for positive cumulative regrets, negative cumulative regrets, and the cumulative strategy, respectively. The key difference between our DDCFR and DCFR is that DCFR uses a pre-determined discounting scheme, i.e., DCFR empirically sets α=1.5, β=0, and γ=2, which is fixed not only during the iteration process but also in different games. In contrast, our proposed DDCFR can utilize ˆat to dynamically adjust the discount weights at runtime and adaptively control its iteration process for each game. τt represents the duration for how long to use these discounting weights. Incorporating τt enhances the interpretability of the learned scheme, indicating whether the hyperparameters need to be adjusted frequently.

The state transition P G. P G : S A S describes how the state changes during the iteration process of DDCFR in game G. Specifically, the agent observes the current state st at each iteration t and outputs an action ˆat. DDCFR uses the discounting weights calculated by αt, βt, and γt for τt iterations, and finally, the state transitions to the next state st+τt. This process allows for the consecutive use of each set of discounting weights for a certain number of iterations.

The reward function ˆRG. ˆRG evaluates the agent s performance and guides DDCFR s iteration process. Since the agent s ultimate goal is to find a strategy profile with the lowest exploitability, we use the sparse reward setting where the agent only receives a reward at the end of the iteration process. Specifically, the reward is ˆRG = log EG 1 log EG T , where EG T is the exploitability of the average strategies in game G at the maximum number of iterations T.

The agent is typically represented by a neural network with parameters θ, i.e., πθ : S A. Figure 1 shows DDCFR s overall iteration process. Given a game G, at iteration t, the agent receives the current state st and outputs an action ˆat = (αt, βt, γt, τt) = πθ(st). DDCFR then uses the discounting weights calculated by αt, βt, and γt to update cumulative regrets and the average strategy for τt iterations. This process continues until the maximum number of iterations, and the agent eventually receives a reward ˆRG. In each game G, the objective is to maximize the final reward ˆRG, represented as f G(θ) = ˆRG. DDCFR s overall objective is to maximize the average sum of the rewards across the training games G, i.e., f(θ) = 1 |G| P

G G f G(θ). By optimizing f(θ), our ultimate goal is to learn a generalizable discounting policy that applies to new games.

3.2 THEORETICAL ANALYSIS

Although DDCFR s dynamic discounting scheme is highly flexible, a question arises regarding its convergence properties. Here we theoretically demonstrate that DDCFR is guaranteed to converge to a Nash equilibrium as long as αt, βt, γt are within a certain range. Theorem 1. Assume that conduct DDCFR T iterations in a two-player zero-sum game. If DDCFR selects hyperparameters as follows: αt [0, 5] for t< T

2 and αt [1, 5] for t T

2 , βt [ 5, 0], γt [0, 5],

the weighted average strategy profile is a 6|I| 8 3 p

T-Nash equilibrium.

The proof is in the Appendix B, mainly rescaling inequalities based on the range of αt, βt, γt following DCFR s proof. This theorem signifies that numerous dynamic discounting schemes converge in theory. We then describe how to efficiently optimize πθ to find a well-performing scheme in practice.

3.3 OPTIMIZATION THROUGH EVOLUTION STRATEGIES

When training the discounting policy πθ, we encounter the challenges of sparse reward (the agent receives a reward at the end of the iteration), long time horizons (CFR requires thousands of iterations), and multi-task learning (the agent maximizes rewards across various games). Off-the-shelf reinforcement learning (RL) algorithms like DQN (Mnih et al., 2015) and PPO (Schulman et al., 2017) struggle to cope with these challenges, requiring the integration of more sophisticated components. Additionally, RL is notably sensitive to hyperparameters. Evolution Strategies (ES) (Wierstra et al., 2014; Salimans et al., 2017) has demonstrated its efficacy as a scalable alternative to RL in tackling these challenges. As a black box optimization technique, ES is indifferent to the distribution of rewards and tolerant of arbitrarily long time horizons. Besides, ES is easy to implement and is highly scalable and efficient to use on distributed hardware.

Inspired by the effectiveness of ES in optimizing complex objective functions and addressing RLrelated problems, we design an ES-based algorithm to train the discounting policy πθ. The ES approach proposed in (Salimans et al., 2017) states that for an objective function f(θ) over a network

Published as a conference paper at ICLR 2024

Algorithm 1: DDCFR s training procedure. Input: Trainig games G, epochs M, learning rate lr, population size N, standard deviation δ

1 Randomly initialize the agent s parameters θ1;

2 for m = 1 to M do

3 Initialize population P with an empty list;

4 for i = 1, . . . , N

5 Sample ϵi N(0, I);

6 Add ϵi, ϵi to the population P;

7 foreach ϵ P do

8 foreach G G do

9 Compute f G(θm+δϵ), (Algorithm 2);

10 f(θm+δϵ) 1 |G| P

G G f G(θm+δϵ)

11 Compute f (θm+δϵ) using Eqn. 5;

12 θm+1 θm + lr δ N P

ϵ P f (θm+δϵ)ϵ Output: The trained agent πθM+1

Algorithm 2: The calculation process of f G(θ). Input: Training game G, agent πθ, total itertations T

2 while t T do

3 [αt, βt, γt, τt] = ˆat = πθ(st);

4 Use [αt, βt, γt] to calculate discounting weights and perform τt numbers of CFR iterations;

5 t t + τt;

6 Compute exploitability EG 1 , EG T ; Output: f G(θ) log EG 1 log EG T

parameterized by θ, its gradient estimation can be obtained by applying Gaussian perturbations to θ, i.e., θEϵ N(0,I)f(θ+δϵ) = 1

δ Eϵ N(0,I)f(θ+δϵ)ϵ, where δ denotes the noise standard deviation, and N(0, I) is a Gaussian distribution with mean 0 and covariance I. This gradient estimation can be approximated with samples. Specifically, we generate a population of perturbed network parameters at each epoch and evaluate the resulting parameters under all training games. We then combine the results to calculate the gradient and update the original parameter using stochastic gradient ascent.

To improve the efficiency of the training process, we employ three acceleration techniques: antithetic estimator, fitness shaping, and parallel evaluation. Algorithm 1 outlines the complete training procedure of the discounting policy. (i) We use the antithetic estimator (Nesterov & Spokoiny, 2017) to reduce variance. It is an unbiased estimator using control variates (Liu et al., 2019), given by the symmetric difference, i.e., θEϵ N(0,I)f(θ + δϵ) = 1 2δEϵ N(0,I) [f(θ + δϵ) f(θ δϵ)] ϵ. (ii) We apply fitness shaping (Wierstra et al., 2014) by performing a rank transformation on the objective function f(θ) before computing the gradient. For parameter θk with the kth largest value in the population of size N, the shaped result is as follows:

f (θk) = max 0, log N

2 + 1 log(k)

PP j=1 max 0, log N

2 + 1 log(j) 1

This approach makes the gradient only relevant to the relative ordering of parameters fitness values rather than their absolute magnitudes. By doing so, our algorithm becomes more robust to variations in the original fitness values. (iii) To further speed up training, we evaluate the performance of each perturbed parameter under each game in parallel.

3.4 ADDITIONAL COMPUTATION COST ANALYSIS

Compared with DCFR, DDCFR introduces additional computation costs, which mainly includes three parts: feature calculations, network inference, and policy training. However, these additional

Published as a conference paper at ICLR 2024

time costs are very small compared to the overall time costs. (i) Feature calculations. The state representation requires the iteration number and the exploitability. The iteration number is effortlessly obtainable. Since the exploitability calculation and the instantaneous regret calculation are completely independent, multiprocessing can be used to concurrently execute both processes without increasing the wall-clock time (Appendix C). (ii) Network inference. The time overhead of π is negligible compared to the overall iteration time. To support this claim, we compare the wall-clock time between DCFR and DDCFR in Appendix D. DDCFR exhibits only a 0.22% increase in time on average. (iii) Policy training. The effort put into training is worthwhile, as the trained discounting policy can be directly applied to numerous new games without modification. This means that the additional work during training is amortized across the various games to which the policy is applied.

4 EXPERIMENTS

4.1 EXPERIMENTAL SETUP

Training and testing games: We use several commonly used IIGs in the research community. Here, we provide brief descriptions of these games; please refer to the Appendix J for more details. Kuhn Poker (Kuhn, 1950) is a simplified form of poker with three cards in a deck and one chance to bet for each player. Leduc Poker (Southey et al., 2005) is a larger game with a 6-card deck and two rounds. Big Leduc Poker uses a deck of twenty-four cards with twelve ranks, allowing a maximum of six raises per round. In Liar s Dice-x (Lis y et al., 2015), each player has an x-sided dice, which they roll at the start. Players then take turns placing bets on the outcome. Goofspiel-x (Ross, 1971) is a card game where each player has x cards and tries to score the most points by making sealed bids in x rounds. In Battleship-x (Farina et al., 2019), players secretly place a 1 2 ship on separate 2 x grids and then take turns firing at the opponent s ship three times. HUNL Subgame-x is a HUNL subgame generated by the top poker agent Libratus.

We select eight relatively large IIGs as testing games, i.e., Battleship-2, Battleship-3, Goofspiel-4, Liar s Dice-4, Leduc Poker, Big Leduc Poker, HUNL Subgame-3, and HUNL Subgame-4. These games are diverse in size and challenging to solve, making them suitable for testing the learned discounting scheme s generalization performance.

The training games should be relatively small to speed up the training process while still exhibiting the typical characteristics of the testing games. We select four training games: Kuhn Poker, Goofspiel-3, Liar s Dice-3, and Small Matrix. The first three games are simplified versions of the testing games, while Small Matrix simulats common situations in games where irrational actions lead to large losses. Our training games have varying rules and scales, exhibiting a strong diversity, crucial for improving generalization ability.

Training details: We set a fixed noise standard deviation of δ=0.5 and a population size of N=100. We distribute the evaluation of perturbed parameters across 200 CPU cores. For the action space, we set the range of α and γ to [0, 5], β to [ 5, 0] following Theorem 1, and choose τ in [1, 2, 5, 10, 20]. We employ a network consisting of three fully-connected layers with 64 units and ELU activation functions to represent the discounting policy πθ. The network takes the current state of the environment as input and outputs values for α, β, γ, and the probability of selecting each τ. We use Adam with a learning rate lr of 0.01 to optimize the network and trained the agent for M = 1000 epochs, which took roughly 24 hours. We set the number of CFR iterations T to 1000, typically sufficient for convergence to a small exploitability. All CFR variants used the alternating-updates technique.

4.2 COMPARISON TO DISCOUNTING CFR VARIANTS

In Figures 2 and 3, we compare DDCFR s learned discounting scheme with two discounting CFR variants, i.e., CFR+ and DCFR. Since the iteration process of DDCFR is deterministic (Appendix F), we only test it once in each game. Due to the rapid decline in exploitability in the early iterations of different CFR algorithms, their differences are primarily manifested in the later iterations. For a better comparison of the algorithms, we focus on their performance in the later iterations, i.e., from 500 to 1000. Please refer to Appendix E for all iterations. As shown in Figure 2, DDCFR converges faster to a lower exploitability on the training games, which is expected since we are optimizing DDCFR s performance on these games. However, even in unseen testing games shown in Figure 3, DDCFR still achieves competitive performance against the other CFR variants due to the fact that DDCFR is designed with generalization in mind. Specifically, it yields an average reduction in exploitability

Published as a conference paper at ICLR 2024

500 625 750 875 1000 Iterations

Exploitability

Small Matrix

CFR+ DCFR DDCFR

500 625 750 875 1000 Iterations

Exploitability

500 625 750 875 1000 Iterations

Exploitability

Goofspiel-3

500 625 750 875 1000 Iterations

10 13 10 11

10 9 10 7 10 5 10 3

Exploitability

Liar's Dice-3

Figure 2: Comparison of DDCFR against two discounting CFR variants on four training games.

500 625 750 875 1000 Iterations

Exploitability

Battleship-2

CFR+ DCFR DDCFR

500 625 750 875 1000 Iterations

Exploitability

Battleship-3

500 625 750 875 1000 Iterations

Exploitability

Goofspiel-4

500 625 750 875 1000 Iterations

Exploitability

Liar's Dice-4

500 625 750 875 1000 Iterations

Exploitability

Leduc Poker

500 625 750 875 1000 Iterations

Exploitability

Big Leduc Poker

500 625 750 875 1000 Iterations

Exploitability

HUNL Subgame-3

500 625 750 875 1000 Iterations

Exploitability

HUNL Subgame-4

Figure 3: Comparison of DDCFR against two discounting CFR variants on eight testing games.

of 42% compared to DCFR, which is remarkable given DCFR s already strong performance. Unlike CFR+ and DCFR with fixed and manually-specified discounting schemes, our DDCFR achieves better performance on all games thanks to the learned dynamic discounting scheme s ability to adjust the discounting weights on the fly using information available at runtime.

Greedy Weights (Zhang et al., 2022) is a recently proposed algorithm specifically designed for normal-form games by greedily weighing iterates based on regrets observed at runtime. Although it can be adapted for extensive-form games as well, we find that its performance is notably inferior to that of DDCFR which is designed explicitly for extensive-form games (Appendix G). We suspect that when faced with a large number of information sets, the computed weights might not be appropriate, resulting in poor performance in extensive-form games.

To further understand the behaviors of the learned dynamic discounting scheme πθ, we visualize its actions (i.e., αt, βt, γt, τt) during the iteration process. Figure 4 shows that the actions outputted by πθ have different values in different games but exhibit a similar trend. Specifically, αt increases with iterations in all games while βt and γt decrease. Compared with DCFR s fixed discounting scheme (i.e., α=1.5, γ=2), the learned scheme is more aggressive in earlier iterations. Smaller α and bigger γ lead to smaller weights (t 1)α

(t 1)α+1 and t 1

t γ when discounting positive cumulative regrets and the cumulative strategy. This may be because the optimal strategy based on the opponent s strategy frequently changes in earlier iterations; thus, earlier iterations need to be given smaller weights. As the iteration progresses, the scheme becomes more moderate, which may aid in the convergence of the average strategy. Additionally, the scheme consistently uses more aggressive strategies when discounting negative cumulative regrets. This is beneficial as it allows for more rapid reuse of promising actions instead of waiting for cumulative regrets to become positive. The change in τt suggests that discounting weights should be adjusted more frequently in earlier iterations.

0 250 500 750 1000 Iterations

0 250 500 750 1000 Iterations

0 250 500 750 1000 Iterations

0 250 500 750 1000 Iterations

DDCFR (Small Matrix) DDCFR (Liar's Dice-3) DDCFR (Battleship-3) DDCFR (Big Leduc Poker) DCFR (All games)

Figure 4: The learned discounting scheme behaves differently in various games yet exhibits a similar trend. Compared with DCFR s fixed discounting scheme (we can view DCFR as a special case of DDCFR, where α1 = 1.5, β1 = 0, γ1 = 2, and the duration τ1 = ), it is more aggressive in the earlier iterations and becomes more moderate as the iteration progresses.

Published as a conference paper at ICLR 2024

4.3 ABLATION STUDIES

To verify the effectiveness of DDCFR s components, we conduct extensive ablation studies and report the results in Table 1. The results of each column are obtained by replacing one component of DDCFR, and the rest remains unchanged. We use the exploitability on three testing games to compare the learned discounting schemes performance.

The number of training games. We first investigate how the number of training games affects the learned discounting scheme. DDCFR-x means using the first x training games from Kuhn Poker, Goofspiel-3, Liar s Dice-3, and Small Matrix. DDCFR-1 performs relatively poorly since the learned discounting scheme is prone to overfitting using only one training game. DDCFR-2 and DDCFR3 obtain better results because they need to balance their performance under multiple games and improve the generalization ability of the learned discounting schemes. With the addition of Small Matrix, DDCFR achieves a much lower exploitability in HUNL Subgame-3, probably because both games exhibit the same property of having high-loss actions.

The state representation. We consider how the state representation affects DDCFR s final performance. DDCFR-it and DDCFR-ex only use the normalized iteration and the normalized exploitability as the state representation, respectively. As shown in Table 1, DDCFR-it achieves better performance than DDCFR-ex, indicating iterations provide the agent with more information for decision-making. However, both algorithms perform worse than DDCFR, showing that these two state features are complementary and that using them together can obtain the best performance.

The action space design. DDCFR adopts a continuous action space design, i.e., α and γ are continuous values between 0 and 5, β is a continuous value between 5 and 0. In contrast, DDCFR-dis uses a discrete action space design, i.e., it discretizes the range of α, β, γ to consecutive integers. The results in Table 1 show that continuous actions help the learned dynamic discounting scheme perform more fine-grained control and obtain better performance.

The optimization algorithm. We compare DDCFR s optimizing algorithm ES with RL. DDCFR-rl uses PPO to optimize the discounting policy, underperforming in all testing games. We attribute this outcome to several factors commonly encountered during training, such as sparse rewards, long time horizons, and muli-task learning.

Table 1: Ablation analyses of DDCFR.

DDCFR-1 DDCFR-2 DDCFR-3 DDCFR-it DDCFR-ex DDCFR-dis DDCFR-rl DDCFR

Leduc Poker 1.566e-4 1.490e-4 1.213e-4 1.249e-4 1.289e-4 1.156e-4 1.229e-4 1.198e-4 Subgame-3 2.533e-3 2.481e-3 2.393e-3 2.292e-3 5.320e-3 5.795e-3 4.987e-3 1.963e-3 Subgame-4 2.517e-3 2.175e-3 2.122e-3 2.714e-3 3.804e-3 5.803e-3 5.198e-3 2.005e-3

4.4 COMBINATION WITH OTHER CFR VARIANTS

Our proposed dynamic discounted regret minimization is a general framework that can be universally applied in any CFR variants that employ fixed and manually specified discounting rules. The DDCFR is just an example application of this framework to DCFR. The predictive CFR+ (PCFR+) (Farina et al., 2021) is a recently proposed state-of-the-art CFR variant based on predictive Blackwell approachability. Nonetheless, it also uses a fixed quadratic discounting scheme. To showcase the generalizability of our framework, we apply it to PCFR+, resulting in the Dynamic Predictive CFR+ (DPCFR+). We find that DPCFR+ can further unlock the potentials of PCFR+ and enhance its performance (Appendix H). These results demonstrate the general applicability of our dynamic discounting framework. It can be regarded as a performance booster that can be applied in a plug-and-play manner to various CFR variants that employ fixed discounting rules.

5 CONCLUSION

We present DDCFR, the first equilibrium-finding framework that discounts prior iterations using an automatically-learned dynamic scheme. We first formulate CFR s iteration process as a carefully designed MDP and transform the discounting scheme learning problem into a policy optimization problem. We then design a scalable ES-based algorithm to optimize the discounting policy efficiently. The learned discounting policy exhibits strong generalization ability, achieving competitive performance on both training games and new testing games.

Published as a conference paper at ICLR 2024

6 ACKNOWLEDGEMENTS

This work is supported in part by the National Science and Technology Major Project (2022ZD0116401); the Natural Science Foundation of China under Grant 62076238, Grant 62222606, and Grant 61902402; the Jiangsu Key Research and Development Plan (No. BE2023016); and the China Computer Federation (CCF)-Tencent Open Fund.

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2):281 305, 2012.

Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold em poker is solved. Science, 347(6218):145 149, 2015.

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418 424, 2018.

Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456): 885 890, 2019a.

Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted regret minimization. In AAAI Conference on Artificial Intelligence, pp. 1829 1836, 2019b.

Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Gabriele Farina, Chun Kai Ling, Fei Fang, and Tuomas Sandholm. Correlation in extensive-form games: Saddle-point formulation and benchmarks. In Advances in Neural Information Processing Systems, pp. 9229 9239, 2019.

Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Faster game solving via predictive blackwell approachability: Connecting regret matching and mirror descent. In AAAI Conference on Artificial Intelligence, pp. 5363 5371, 2021.

Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000.

Harold W Kuhn. A simplified two-person poker. Contributions to the Theory of Games, 1:97 103, 1950.

Viliam Lis y, Marc Lanctot, and Michael H Bowling. Online monte carlo counterfactual regret minimization for search in imperfect information games. In International Conference on Autonomous Agents and Multiagent Systems, pp. 27 36, 2015.

Hao Liu, Richard Socher, and Caiming Xiong. Taming maml: Efficient unbiased meta-reinforcement learning. In International Conference on Machine Learning, pp. 4061 4071, 2019.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

Matej Moravˇc ık, Martin Schmid, Neil Burch, Viliam Lis y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deep Stack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508 513, 2017.

Jr John F Nash. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences of the United States of America, 36(1):48 49, 1950.

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017.

Martin J Osborne and Ariel Rubinstein. A course in game theory. MIT press, 1994.

Published as a conference paper at ICLR 2024

Sheldon M Ross. Goofspiel the game of pure strategy. Journal of Applied Probability, 8(3): 621 625, 1971.

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951 2959, 2012.

Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes bluff: Opponent modelling in poker. In Conference on Uncertainty in Artificial Intelligence, pp. 550 558, 2005.

Oskari Tammelin. Solving large imperfect information games using CFR+. ar Xiv preprint ar Xiv:1407.5042, 2014.

Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and J urgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(1):949 980, 2014.

Hang Xu, Kai Li, Haobo Fu, Qiang Fu, and Junliang Xing. Autocfr: Learning to design counterfactual regret minimization algorithms. In AAAI Conference on Artificial Intelligence, pp. 5244 5251, 2022.

Zhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2396 2407, 2018.

Tong Yu and Hong Zhu. Hyper-parameter optimization: A review of algorithms and applications. ar Xiv preprint ar Xiv:2003.05689, 2020.

Hugh Zhang, Adam Lerer, and Noam Brown. Equilibrium finding in normal-form games via greedy regret minimization. In AAAI Conference on Artificial Intelligence, pp. 9484 9492, 2022.

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems, pp. 1729 1736, 2007.

Published as a conference paper at ICLR 2024

A RIGOROUS DEFINITIONS OF REACH PROBABILITIES AND COUNTERFACTUAL VALUES

The history reach probability of a history h is the joint probability of reaching that history if all players play according to σ, denoted by πσ(h) = Q

h a h σP(h )(h , a) where the relationship g h indicates that g is equal to or a prefix of h, and P(h) is the unique player who takes action at that history h. It can be decomposed into each player s contribution, i.e., πσ(h) = πσ i (h)πσ i(h), where πσ i (h) refers to player i s contribution, and πσ i(h) denotes all players contributions except player i. We define the information set reach probability as πσ(I)= P

h I πσ(h), and the interval history reach probability from history h to h as πσ(h , h)=πσ(h)/πσ(h ) if h h. πσ i (I), πσ i(I), πσ i (h, h ), πσ i(h, h ) are defined similarly.

For player i at an information set I Ii under a given strategy profile σ, the counterfactual value of I is vσ i (I) = P

h I(πσ i(h) P

z Z(πσ(h, z)ui(z)). The counterfactual value of a specific action a in I is vσ i (I, a) = P

h I(πσ i(h) P

z Z(πσ(ha, z)ui(z)).

B PROOF OF THEOREM 1

Proof. Since the lowest amount of instantaneous regret on any iteration is and DDCFR multiplies negative regrets by (t 1)βt (t 1)βt+1 (t 1)0

(t 1)0+1 = 1

2 on iteration t, the regret for any action at any point is greater than 2 .

Consider the weighted sequence of iterates σ 1, . . . , σ T , where σ t is identical to σt, but weighted by wa,t = ΠT k=t+1 k 1

k γk rather than wt = ΠT k=t+1 (k 1)αk (k 1)αk +1. The regret of action a in information set I on iteration t of this new sequence is R t(I, a).

From Lemma 5 we know that RT (I, a) 8

T for player i for action a in information set I. Since wa,t is an increasing sequence, we can apply Lemma 1 using weight wa,t for iteration t

T and C = 2 . This yields R T (I, a) 8 3 p

T + 2 . As the

weights sum is PT t=1 wa,t PT t=1 ΠT k=t+1 (k 1)5

k5 = PT t=1 t5 T 5 = T 2(T +1)2(2T 2+2T 1)

the weighted average regret R w,T i = maxa A R T (I,a) PT t=1 wa,t 6 ( 8

T. Applying Theorem 3 in CFR (Zinkevich et al., 2007), we get overall weighted average regret for player i is at most 6 |Ii| 8 3 p

T. Since |I1| + |I2| = |I|, the weighted average strategies form a

6 |I| 8 3 p

T-Nash equilibrium after T iterations.

Lemma 1. (Brown & Sandholm, 2019b, Lemma 1) Call a sequence x1, . . . , x T of bounded real values BC-plausible if B > 0, C 0, Pt k=1 xk C for all t T, and PT t=1 xt B. For any BC-plausible sequence and any sequence of non-decreasing weights wt 0, PT t=1(wtxt) w T (B C). Lemma 2. Given a set of actions A, any sequence of rewards vt, and any sequence of weights wt such that P t=1 wt = and |vt(a) vt(b)| for all t and all a, b A, after playing a sequence of strategies determined by regret matching but every reward vt is weighted by the weight wt, the

weighted average regret Rw,T i = maxa A

PT t=1(wtrt(a)) PT t=1 wt is bounded by Rw,T i

|A| PT t=1 w2 t PT t=1 wt .

Proof. The proof is identical to that of Theorem 2.8 in (Cesa-Bianchi & Lugosi, 2006), with p = 2 for the polynomially weighted average forecaster.

Lemma 3. (Brown & Sandholm, 2019b, Lemma 2) Given a sequence of strategies σ1, . . . , σT , each defining a probability distribution over a set of actions A, consider any definition for Qt(a) satisfying the following conditions:

1. Q0(a) = 0

2. Qt(a) = Qt 1(a) + rt(a) if Qt 1(a) + rt(a) > 0

Published as a conference paper at ICLR 2024

3. 0 Qt(a) Qt 1(a) + rt(a) if Qt 1(a) + rt(a) 0.

The regret-like value Qt(a) is then an upper bound on the regret Rt(a) and Qt(a) Qt 1(a) rt(a) = Rt(a) Rt 1(a).

Lemma 4. (Brown & Sandholm, 2019b, Lemma 3) Given a set of actions A and any sequence of rewards vt such that |vt(a) vt(b)| for all t and all a, b A, after playing a sequence of strategies determined by regret matching but using the regret-like value Qt(a) in place of Rt(a), QT (a) p

|A|T for all a A.

Lemma 5. Assume that player i conducts T iterations of DDCFR. Then the weighted regret for the player is at most |Ii| p

T and the weighted average regret for the player is at most 8 3 |Ii| p

Proof. The weight of iteration t < T is wt = ΠT k=t+1 (k 1)αk (k 1)αk +1 and w T = 1. Thus, wt

ΠT k=t+1 (k 1)5

(k 1)5+1 1 for all t and therefore PT t=1 w2 t T.

Additionally, wt ΠT k=t+1 (k 1)1

(k 1)1+1 = t T for T

2 t < T and w T = 1. Thus, PT t=1 wt PT t= T

2 wt PT t= T

Applying Lemma 2 and 4, we see that Qw,T i (I, a)

|A| PT t=1 w2 t PT t=1 wt 8

T 3T . Applying

Theorem 3 in CFR (Zinkevich et al., 2007), we see that Qw,T i 8 |Ii|

T 3T . Since Rw,T i

Qw,T i , so Rw,T i 8 |Ii|

C FEATURE CALCULATIONS IN DDCFR

The features in DDCFR contain the normalized iteration ˆt and the normalized exploitability ˆEG t 1. The iteration number is effortlessly obtainable. The normalized exploitability calculation can be done with the instantaneous regret calculation in parallel. We provide a detailed explanation below.

On each iteration t, our DDCFR first recursively traverses the game tree using the strategy σt to compute the instantaneous regret rt. Then, DDCFR updates the cumulative regrets Rt, the average strategy σt, and the new strategy σt+1 using the weights (αt, βt, γt) generated by the discounting policy πθ(st). Specifically, αt, βt, γt = πθ(st) = πθ([ˆt, ˆEG t 1]), where ˆt is the normalization of iteration t and ˆEG t 1 is the normalized exploitability of the average strategy σt 1 at iteration t 1.

Calculating ˆEG t 1 requires traversing the game tree using the average strategy σt 1, which leads to additional time overhead under naive implementation. However, it s crucial to note that we can calculate ˆEG t 1 and compute the instantaneous regret rt in parallel since these two processes depend on two different strategies, σt 1 and σt, and each requires traversing the game tree once. Consequently, there is no additional wall-clock time overhead associated with the calculation of exploitability in our parallel implementation.

D RUNNING TIME COMPARISON BETWEEN DCFR AND DDCFR

To further support claim in Section 3.4, we compare the running time between DCFR and DDCFR. Since the running time is influenced by the available hardware resources of the system, we execute the algorithms on various games one bye one, but distributed across multiple servers. In addition, to mitgate the time overhead associated with writing to disk, we remove the logging module and only print the final running time. Table 2 presents the running time of the two algorithms on eight testing games. For example, completing 1000 iterations of DCFR on Battleship-2 takes roughly 14 minutes, 54 seconds, and 558 milliseconds. In comparison to DCFR, DDCFR exhibits only a 0.22% increase in time on average. Therefore, the additional computation time incurred by DDCFR is negligible compared to the overall iteration time.

Published as a conference paper at ICLR 2024

Table 2: Running time comparsion between DCFR and DDCFR.

Game DCFR time DDCFR time (DDCFR time - DCFR time) / DCFR time

Battleship-2 0:14:54.558 0:14:55.814 0.140% Battleship-3 11:10:46.895 11:12:05.156 0.194% Goofspiel-4 00:01:03.259 00:01:03.518 0.409% Liar s Dice-4 0:06:57.896 0:06:58.577 0.163% Leduc Poker 0:08:13.140 0:08:13.921 0.158% Big Leduc Poker 14:55:48.347 14:58:07.357 0.259% HUNL Subgame-3 0:02:54.634 0:02:55.024 0.223% HUNL Subgame-4 0:02:14.379 0:02:14.648 0.200%

E ALL ITERATIONS OF DDCFR AND DISCOUNTING CFR VARIANTS

Due to the rapid decline in exploitability in the early iterations of different CFR algorithms, their differences are primarily manifested in the later iterations. Consequently, for a better comparison of the algorithms, we focus on the later iterations in the main paper. For the sake of completeness, we show the all iterations of DDCFR and discounting CFR variants in Figures 5 and 6.

0 250 500 750 1000 Iterations

Exploitability

Small Matrix

CFR+ DCFR DDCFR

0 250 500 750 1000 Iterations

Exploitability

0 250 500 750 1000 Iterations

Exploitability

Goofspiel-3

0 250 500 750 1000 Iterations

Exploitability

Liar's Dice-3

Figure 5: All iterations of DDCFR and discounting CFR variants on four training games.

0 250 500 750 1000 Iterations

Exploitability

Battleship-2

CFR+ DCFR DDCFR

0 250 500 750 1000 Iterations

Exploitability

Battleship-3

0 250 500 750 1000 Iterations

Exploitability

Goofspiel-4

0 250 500 750 1000 Iterations

Exploitability

Liar's Dice-4

0 250 500 750 1000 Iterations

Exploitability

Leduc Poker

0 250 500 750 1000 Iterations

Exploitability

Big Leduc Poker

0 250 500 750 1000 Iterations

Exploitability

HUNL Subgame-3

0 250 500 750 1000 Iterations

Exploitability

HUNL Subgame-4

Figure 6: All iterations of DDCFR and discounting CFR variants on eight testing games.

F DETERMINISTIC NATURE OF DDCFR S ITERATION PROCESS

We use ES to learn the discounting policy and select one of the best performing policy on the training games for testing. It is worth noting that there is no randomness throughout the iteration process of DDCFR when using the learned discounting policy on the testing games. Specifically, DDCFR always initializes with the uniform random average strategy, and the policy network consistently

Published as a conference paper at ICLR 2024

produces the same fixed action when inputs are determined. Put simply, each run of DDCFR with the learned model consistently generates the same exploitability curve.

G COMPARISON WITH GREEDY WEIGHTS

We conduct experimental evaluations comparing DDCFR and Greedy Weights (Zhang et al., 2022) on extensive-form games. We integrate Greedy Weights with CFR by extending upon the officially released code for random norm-form games. We run a grid search within the range of 0.1, 0.5, 1.0, 2.0, 5.0 to tune the hyperparameter weight floor for each game. The experimental results shown in Figure 7 demonstrate that DDCFR consistently converges faster to a lower exploitability on four extensive-form games. While Greedy Weights is designed to efficiently solve normalform games and shows promising results on extensive-form games, our DDCFR designed explicitly for extensive-form games outperforms Greedy Weights significantly. One key distinction between normal-form games and extensive-form games lies in the quantity of information sets. We suspect that when faced with a large number of information sets, the computed weights in Greedy Weights might not be appropriate, resulting in poor performance in extensive-form games.

500 625 750 875 1000 Iterations

Exploitability

Battleship-2

Greedy Weights DDCFR

500 625 750 875 1000 Iterations

Exploitability

Liar's Dice-4

500 625 750 875 1000 Iterations

Exploitability

500 625 750 875 1000 Iterations

Exploitability

Leduc Poker

Figure 7: Performance comparison between DDCFR and Greedy Weights on extensive-form games.

H COMPARISON AND COMBINATION WITH PREDICTIVE CFR+

The predictive CFR+ (PCFR+) (Farina et al., 2021) is a recently proposed state-of-the-art CFR variant based on predictive Blackwell approachability. However, it is not a consistently well-performing algorithm. The authors of PCFR+ have already found in their paper that PCFR+ is faster than DCFR on non-poker games, whereas on poker games DCFR is the fastest. We present a comparison between our DDCFR and PCFR in Figure 8. The experimental results align with expectations, i.e., PCFR+ converges faster in non-poker games while DDCFR outperforms PCFR+ in poker games, given that our DDCFR is built upon DCFR. Nonetheless, DDCFR can improve the performance of DCFR to narrow the gap with PCFR+ on non-poker games, while amplifying the advantage over PCFR+ on poker games.

500 625 750 875 1000 Iterations

Exploitability

Leduc Poker

PCFR+ DDCFR

500 625 750 875 1000 Iterations

Exploitability

Big Leduc Poker

500 625 750 875 1000 Iterations

Exploitability

Goofspiel-4

500 625 750 875 1000 Iterations

Exploitability

Liar's Dice-4

Figure 8: Performance comparison between DDCFR and PCFR+ on four testing games.

It is worth emphasizing that our main contribution is not to propose a specific algorithm, but to propose a general dynamic and trainable discounting regret minimization framework. This framework can be universally applied in any CFR variants that employ fixed and manually specified discounting rules. The DDCFR algorithm presented in the paper is just an example application of our framework to the DCFR algorithm.

Published as a conference paper at ICLR 2024

PCFR+ also uses a fixed quadratic discounting scheme, formalized as ( t 1

t )γ, where gamma is set to 2 in PCFR+. We apply our framework to learn a dynamic discounting averaging strategy, resulting in the Dynamic Predictive CFR+ (DPCFR+) algorithm. Figure 9 illustrates that this integration can further unlock PCFR+ s potential for solving non-poker games without sacrificing its performance on poker games.

500 625 750 875 1000 Iterations

Exploitability

Leduc Poker

PCFR+ DPCFR+

500 625 750 875 1000 Iterations

Exploitability

Big Leduc Poker

500 625 750 875 1000 Iterations

Exploitability

Goofspiel-4

500 625 750 875 1000 Iterations

Exploitability

Liar's Dice-4

Figure 9: Performance comparison between DPCFR+ and PCFR+ on four testing games.

Besides PCFR+, we consider another algorithm PDCFR in the PCFR+ paper, which can be seen as a combination of PCFR+ and DCFR. PDCFR also uses a fixed discounting scheme and performs well on both non-poker and poker games, benefiting from the characteristics inherited from PCFR+ and DCFR. We aim to enhance PDCFR using our framework, resulting in the Predictive Dynamic Discounted CFR (PDDCFR) algorithm. As illustrated in Figure 10, our framework significantly enhances PDCFR s performance in both non-poker and poker games.

500 625 750 875 1000 Iterations

Exploitability

Leduc Poker

PDCFR PDDCFR

500 625 750 875 1000 Iterations

Exploitability

Big Leduc Poker

500 625 750 875 1000 Iterations

Exploitability

Goofspiel-4

500 625 750 875 1000 Iterations

Exploitability

Liar's Dice-4

Figure 10: Performance comparison between PDDCFR and PDCFR on four testing games.

All of the above results demonstrate the general applicability of our dynamic discounting framework. Our framework can be regarded as a performance booster that can be applied in a plug-and-play manner to various CFR variants that employ fixed discounting rules.

I ABOUT LOGARITHMIC OPERATION

The state in DDCFR consists of the normalized iteration and the normalized exploitability. When calculating the normalized exploitability, we employ logarithmic operation. We use Kuhn Poker as an example to demonstrate how normalized exploitability changes with the number of iterations. We compare three different normalization methods in Table 3.

Table 3: Normalized exploitability variation in Kuhn Poker with three normalization methods.

t 1 10 20 100 500 1000

Et 4.583e-1 3.302e-2 1.599e-2 1.767e-3 1.699e-4 4.021e-5

(log Et log Emin)/(log E1 log Emin) 1.000 0.902 0.875 0.793 0.706 0.652

( Et Emin)/( E1 Emin) 1.000 0.268 0.187 0.062 0.019 0.009

(Et Emin)/(E1 Emin) 1.000 0.072 0.034 0.004 0.000 0.000

CFR variants have an upper bound on the regret values at a square root scale, but they converge much faster in practice. To capitalize on this characteristic, we adopt logarithmic normalization, which

Published as a conference paper at ICLR 2024

results in a more even distribution of the normalized exploitability. This normalization technique facilitates neural network training, allowing the agent to make better decisions at each iteration.

J DESCRIPTION OF THE GAMES

Kuhn Poker is a simplified version of poker proposed by Harold W. Kuhn (Kuhn, 1950). It uses a deck of three cards (J, Q, K), and players begin by betting one chip into the pot and receiving a private card from the shuffled deck. In the game, players can take four different actions: 1) fold, in which they forfeit the current game, and the other player wins the pot, 2) call, in which they match the other player s bet, 3) bet, in which they add more chips to the pot, and 4) check, in which they decline to place a bet when not facing one from the other player. In Kuhn Poker, each player has a chance to bet one chip. If neither player folds, both players reveal their cards, and the player with the higher card takes all chips in the pot. The utility for each player is determined by the difference in the number of chips they possess before and after playing.

Leduc Poker is a larger poker game first introduced in (Southey et al., 2005). It uses six cards divided into two suits, each with three ranks (Js, Qs, Ks, Jh, Qh, Kh). Similar to Kuhn Poker, each player starts by betting one chip, receiving a private card, and has the same action options. In Leduc Poker, there are two betting rounds. During the first round, each player has the opportunity to bet two chips, and in the second round, they can bet four chips. After the first round, one public card is revealed. If a player s private card matches the public card, that player wins the game. Otherwise, the player with the highest private card wins the game.

Big Leduc Poker is a more complex version of poker. It has rules similar to Leduc Poker but uses twenty-four cards divided into twelve ranks. In addition to the increased number of cards, it also allows a maximum of six raises per round instead of the two raises allowed in Leduc Poker.

HUNL Subgame-x is a heads-up no-limit Texas hold em(HUNL) sub-game generated and solved in real-time by the state-of-the-art poker agent Libratus (Brown & Sandholm, 2018)1. In HUNL, the two players, P1 and P2, start each hand with $20,000 and are dealt two private cards from a standard 52-card deck. P1 places $100 in the pot, and P2 places $50 in the pot. P2 starts the first round of betting, and players take turns choosing to fold, call, check or raise. A round ends when a player calls if both players have acted. After the first round, three community cards are dealt face-up for all players to observe, and P1 starts a similar betting round. In the third and fourth rounds, one additional community card is dealt, and betting starts again with P1. Unless a player has folded, the player with the best five-card poker hand, constructed from their two private cards and the five community cards, wins the pot. In the case of a tie, the pot is split evenly. HUNL Subgame-3 begins at the start of the final betting round with $500 in the pot, and HUNL Subgame-4 begins at the start of the final betting round with $3,750 in the pot. In the first betting round, we use bet sizes of 0.5x, 1x the size of the pot, and an all-in bet. In other betting rounds, we use 1x the pot and all-in.

Liar s Dice-x (Lis y et al., 2015) is a dice game where each player receives an x-sided dice and a cup used for concealment. At the start of the game, each player rolls the dice under their cup and examines the dice secretly. The first player starts by bidding on the form p-q, announcing that there are at least p dices with the number of q under both cups. The highest dice numbered x can be treated as any number. Players then have two choices during their turn: 1) bidding of the form p-q, with p or q being greater than the previous player s bidding, 2) calling Liar , ending the game immediately and revealing all the dices. If the last bid is not satisfied, the player called Liar wins the game. The winner s utility is 1, and the loser s -1.

Goofspiel-x (Ross, 1971) is a bidding card game where players start with x cards numbered 1 . . . x. A shuffled point card deck, also numbered 1 . . . x, is used in the game. The game proceeds in x rounds. Players select a card from their hand in each round to make a sealed bid for the top revealed point card. When both players have chosen their cards, they show their cards simultaneously. The player who makes the highest bid wins the point card. If the bids are equal, the point card will be discarded. The game concludes after x rounds. The player who holds the most point cards wins the game and receives a utility of 1, while the losing player has a utility of -1. We use a fixed deck of decreasing points and an imperfect information variant where players are only informed whether they have won or lost the bid, not the card played by the other player.

1https://github.com/CMU-EM/Libratus Endgames

Published as a conference paper at ICLR 2024

Battleship-x (Farina et al., 2019) is a classic board game where players secretly place a ship on their separate grids of size 2 x at the start of the game. Each ship is 1 2 in size and has a value of 2. Players take turns shooting at their opponent s ship, and the ship that has been hit at all its cells is considered sunk. The game ends when one player s ship is sunk or when each player has completed three shots. The utility for each player is calculated as the sum of the values of the opponent s sunk ship minus the sum of the values of their own lost ship.

Small Matrix is a game where player 1 can choose five actions, i.e., rock, paper, scissors, A1, and A2, while player 2 can only choose rock, paper, and scissors. When both players choose rock, paper, and scissors, the game is a modified rock-paper-scissors game where the winner receives two points when either player chooses scissors; otherwise, the winner receives one point. However, when player 1 chooses the last two actions, i.e., A1/A2, player 1 will lose 10,000/20,000.

We measure the sizes of the games in many dimensions and report the results in Table 4. In the table, #Histories measures the number of histories in the game tree. #Infosets measures the number of information sets in the game tree. #Terminal histories measures the number of terminal histories in the game tree. Depth measures the depth of the game tree, i.e., the maximum number of actions in one history. Max size of infosets measures the maximum number of histories that belong to the same information set.

Table 4: Sizes of the games.

Game #Histories #Infosets #Terminal histories Depth Max size of infosets

Small Matrix 21 2 15 3 5 Kuhn Poker 58 12 30 6 2 Goofspiel-3 67 16 36 5 4 Liar s Dice-3 1,147 192 567 10 3 Battleship-2 10,069 3,286 5,568 9 4 Battleship-3 732,607 81,027 552,132 9 7 Goofspiel-4 1,077 162 576 7 14 Liar s Dice-4 8,181 1024 4,080 12 4 Leduc Poker 9,457 936 5,520 12 5 Big Leduc Poker 6,178,561 100,800 3,953,424 20 23 HUNL Subgame-3 398,112,843 69,184 261,126,360 10 1,980 HUNL Subgame-4 244,005,483 43,240 158,388,120 8 1,980

K RELATED WORK

DDCFR is closely related to hyperparameter optimization since we can regard the discounting weights as hyperparameters of the CFR algorithm. Hyperparameter optimization (Yu & Zhu, 2020) is a widely studied topic in machine learning, which aims to find the optimal hyperparameters for a learning algorithm. Perhaps the most popular hyperparameter optimizer is grid/random search (Bergstra & Bengio, 2012). More sophisticated techniques, such as Bayesian optimization (Snoek et al., 2012), have also proven effective on many different tasks. However, these existing methods suffer from a major problem, i.e., they only estimate the best fixed hyperparameter values that are used throughout the model learning. In contrast, DDCFR s discounting scheme can dynamically adjust the discounting weights online, not just find the best fixed value. The recently proposed techniques, such as meta-gradients (Xu et al., 2018), can dynamically tune hyperparameters on the fly by computing the gradients of the objective function with respect to the hyperparameters directly. However, in our setting, computing the gradients of exploitability with respect to discounting policy s parameters directly using backpropagation is challenging since the forward pass contains many CFR iterations which involve complex tree-based recursive computations. To our knowledge, Greedy Weights (Zhang et al., 2022) is the only dynamic discounted regret minimization algorithm. However, it focuses on solving normal-form games, while DDCFR is designed explicitly for extensive-form games.