# lastiterate_convergence_in_extensiveform_games__bc426d84.pdf

Last-iterate Convergence in Extensive-Form Games

Chung-Wei Lee University of Southern California leechung@usc.edu

Christian Kroer Columbia University christian.kroer@columbia.edu

Haipeng Luo University of Southern California haipengl@usc.edu

Regret-based algorithms are highly efﬁcient at ﬁnding approximate Nash equilibria in sequential games such as poker games. However, most regret-based algorithms, including counterfactual regret minimization (CFR) and its variants, rely on iterate averaging to achieve convergence. Inspired by recent advances on last-iterate convergence of optimistic algorithms in zero-sum normal-form games, we study this phenomenon in sequential games, and provide a comprehensive study of last-iterate convergence for zero-sum extensive-form games with perfect recall (EFGs), using various optimistic regret-minimization algorithms over treeplexes. This includes algorithms using the vanilla entropy or squared Euclidean norm regularizers, as well as their dilated versions which admit more efﬁcient implementation. In contrast to CFR, we show that all of these algorithms enjoy last-iterate convergence, with some of them even converging exponentially fast. We also provide experiments to further support our theoretical results.

1 Introduction

Extensive-form games (EFGs) are an important class of games in game theory and artiﬁcial intelligence which can model imperfect information and sequential interactions. EFGs are typically solved by ﬁnding or approximating a Nash equilibrium. Regret-minimization algorithms are among the most popular approaches to approximate Nash equilibria. The motivation comes from a classical result which says that in two-player zero-sum games, when both players use no-regret algorithms, the average strategy converges to Nash equilibrium [Freund and Schapire, 1999, Hart and Mas-Colell, 2000, Zinkevich et al., 2007]. Counterfactual Regret Minimization (CFR) [Zinkevich et al., 2007] and it variants such as CFR+ [Tammelin, 2014] are based on this motivation.

However, due to their ergodic convergence guarantee, theoretical convergence rates of regretminimization algorithms are typically limited to O(1/

T) or O(1/T) for T rounds, and this is also the case in practice [Brown and Sandholm, 2019a, Burch et al., 2019]. In contrast, it is known that linear convergence rates are achievable for certain other ﬁrst-order algorithms [Tseng, 1995, Gilpin et al., 2008]. Additionally, the averaging procedure can create complications. It not only increases the computational and memory overhead [Bowling et al., 2015], but also makes things difﬁcult when incorporating neural networks in the solution process, where averaging is usually not possible. Indeed, to address this issue, Brown et al. [2019] create a separate neural network to approximate the average strategy in their Deep CFR model.

Therefore, a natural idea is to design regret-minimization algorithms whose last strategy converges (we call this last-iterate convergence), ideally at a faster rate than the average iterate. Unfortunately, many regret minimization algorithms such as regret matching, regret matching+, and hedge, are known not

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

to satisfy this property empirically and theoretically even for normal-form games. Although Bowling et al. [2015] ﬁnd that in Heads-Up Limit Hold em poker the last strategy of CFR+ is better than the average strategy, and Farina et al. [2019b] observe in some experiments the last-iterate of optimistic OMD and FTRL converge fast, a theoretical understanding of this phenomenon is still absent for EFGs.

In this work, inspired by recent results on last-iterate convergence in normal-form games [Wei et al., 2021], we greatly extend the theoretical understanding of last-iterate convergence of regretminimization algorithms in two-player zero-sum extensive-form games with perfect recall, and open up many interesting directions both in theory and practice. First, we show that any optimistic online mirror-descent algorithm instantiated with a strongly convex regularizer that is continuously differentiable on the EFG strategy space provably enjoys last-iterate convergence, while CFR with either regret matching or regret matching+ fails to converge. Moreover, for some of the optimistic algorithms, we further show explicit convergence rates. In particular, we prove that optimistic mirror descent instantiated with the 1-strongly-convex dilated entropy regularizer [Kroer et al., 2020], which we refer to as Dilated Optimistic Multiplicative Weights Update (DOMWU), has a linear convergence rate under the assumption that there is a unique Nash equilibrium; we note that this assumption was also made by Daskalakis and Panageas [2019], Wei et al. [2021] in order to achieve similar results for normal-form games.

2 Related Work

Extensive-form Games Here we focus on work related to two-player zero-sum perfect-recall games. Although there are many game-solving techniques such as abstraction [Kroer and Sandholm, 2014, Ganzfried and Sandholm, 2014, Brown et al., 2015], endgame solving [Burch et al., 2014, Ganzfried and Sandholm, 2015], and subgame solving [Moravcik et al., 2016, Brown and Sandholm, 2019b], these methods all rely on scalable methods for computing approximate Nash equilibria. There are several classes of algorithms for computing approximate Nash equilibria, such as double-oracle methods [Mc Mahan et al., 2003], ﬁctitious play [Brown, 1951, Heinrich and Silver, 2016], ﬁrst-order methods [Hoda et al., 2010, Kroer et al., 2020], and CFR methods [Zinkevich et al., 2007, Lanctot et al., 2009, Tammelin, 2014]. Notably, variants of the CFR approach have achieved signiﬁcant success in poker games [Bowling et al., 2015, Moravˇcík et al., 2017, Brown and Sandholm, 2018]. Underlying the ﬁrst-order and CFR approaches is the sequence-form representation [von Stengel, 1996], which allows the problem to be represented as a bilinear saddle-point problem. This leads to algorithms based on smoothing techniques and other ﬁrst-order methods [Gilpin et al., 2008, Kroer et al., 2017, Gao et al., 2021], and enables CFR via the theorem connecting no-regret guarantees to Nash equilibrium.

Online Convex Optimization and Optimistic Regret Minimization Online convex optimization [Zinkevich, 2003] is a framework for repeated decision making where the goal is to minimize regret. When applied to repeated two-player zero-sum games, it is known that the average strategy converges to Nash Equilibria at the rate of O(1/

T) when both players apply regret-minimization algorithms whose regret grows on the order of O(

T) [Freund and Schapire, 1999, Hart and Mas-Colell, 2000, Zinkevich et al., 2007]. Moreover, when the players use optimistic regret-minimization algorithms, the convergence rate is improved to O(1/T) [Rakhlin and Sridharan, 2013, Syrgkanis et al., 2015]. Recent works have applied optimism ideas to EFGs, such as optimistic algorithms with dilated regularizers [Kroer et al., 2020, Farina et al., 2019b], CFR-like local optimistic algorithms [Farina et al., 2019a], and optimistic CFR algorithms [Burch, 2018, Brown and Sandholm, 2019a, Farina et al., 2021a]. However, the theoretical results in all these existing papers consider the average strategy, while we are the ﬁrst to consider last-iterate convergence in EFGs.

Last-iterate Convergence in Saddle-point Optimization As mentioned previously, two-player zero-sum games can be formulated as saddle-point optimization problems. Saddle-point problems have recently gained a lot of attention due to their applications in machine learning, for example in generative adversarial networks [Goodfellow et al., 2014]. Basic algorithms, including gradient descent ascent and multiplicative weights update, diverge even in simple instances [Mertikopoulos et al., 2018, Bailey and Piliouras, 2018]. In contrast, their optimistic versions, optimistic gradient descent ascent (OGDA) [Daskalakis et al., 2018, Mertikopoulos et al., 2019, Wei et al., 2021] and optimistic multiplicative weights update (OMWU) [Daskalakis and Panageas, 2019, Lei et al., 2021,

Wei et al., 2021] have been shown to enjoy attractive last-iterate convergence guarantees. However, almost none of these results apply to the case of EFGs: Wei et al. [2021] show a result that implies linear convergence of vanilla OGDA in EFGs (see Corollary 5), but no results are known for vanilla OMWU or more importantly for algorithms instantiated with dilated regularizers which lead to fast iterate updates in EFGs. In this work we extend the existing results on normal-form games to EFGs, including the practically-important dilated regularizers.

3 Problem Setup

We start with some basic notation. For a vector z, we use zi to denote its i-th coordinate and z p to denote its p-norm (with z being a shorthand for z 2). For a convex function ψ, the associated Bregman divergence is deﬁne as Dψ(u, v) = ψ(u) ψ(v) ψ(v), u v , and ψ is called κ-strongly convex with respect to the p-norm if Dψ(u, v) κ

2 u v 2 p holds for all u and v in the domain. The Kullback-Leibler divergence, which is the Bregman divergence with respect to the entropy function, is denoted by KL( , ). Finally, we use P to denote the (P 1)-dimensional simplex and [N] to denote the set {1, 2 . . . , N} for some positive integer N.

Extensive-form Games as Bilinear Saddle-point Optimization We consider the problem of ﬁnding a Nash equilibrium of a two-player zero-sum extensive-form game (EFG) with perfect recall. Instead of formally introducing the deﬁnition of an EFG (see Appendix A for an example), for the purpose of this work, it sufﬁces to consider an equivalent formulation, which casts the problem as a simple bilinear saddle-point optimization [von Stengel, 1996]:

min x X max y Y x Gy = max y Y min x X x Gy, (1)

where G [ 1, +1]M N is a known matrix, and X RM and Y RN are two polytopes called treeplexes (to be deﬁned soon). The set of Nash equilibria is then deﬁned as Z = X Y , where X = argminx X maxy Y x Gy and Y = argmaxy Y minx X x Gy. Our goal is to ﬁnd a point z Z = X Y that is close to the set of Nash equilibria Z , and we use the Bregman divergence (of some function ψ) between z and the closest point in Z to measure the closeness, that is, minz Z Dψ(z , z).

For notational convenience, we let P = M + N and F(z) = (Gy, G x) for any z = (x, y) Z RP . Without loss of generality, we assume F(z) 1 for all z Z (which can always be ensured by normalizing the entries of G accordingly).

Treeplexes The structure of the EFG is implicitly captured by the treeplexes X and Y, which are generalizations of simplexes that capture the sequential structure of an EFG. The formal deﬁnition is as follows. (In Appendix A, we provide more details on the connection between treeplexes and the structure of the EFG, as well as concrete examples of treeplexes for better illustrations.)

Deﬁnition 1 (Hoda et al. [2010]). A treeplex is recursively constructed via the following three operations:

1. Every probability simplex is a treeplex.

2. Given treeplexes Z1, . . . , ZK, the Cartesian product Z1 ZK is a treeplex.

3. (Branching) Given treeplexes Z1 RM and Z2 RN, and any i [M],

Z1 i Z2 = (u, ui v) RM+N : u Z1, v Z2

is a treeplex.

By deﬁnition, a treeplex is a tree-like structure built with simplexes, which intuitively represents the tree-like decision space of a single player, and an element in the treeplex represents a strategy for the player. Let HZ denote the collection of all the simplexes in treeplex Z, which following Deﬁnition 1 can be recursively deﬁned as: HZ = {Z} if Z is a simplex; HZ = SK k=1 HZi if Z is a Cartesian product Z1 ZK; and HZ = HZ1 HZ2 if Z = Z1 i Z2. In EFG terminology, HX and HY are the collections of information sets for player x and player y respectively, which are the decision

points for the players, at which they select an action within the simplex. For any h HZ, we let Ωh denote the set of indices belonging to h, and for any z Z, we let zh be the slice of z whose indices are in Ωh. For each index i, we also let h(i) be the simplex i falls into, that is, i Ωh(i).

In Deﬁnition 1, the last branching operation naturally introduces the concept of a parent variable for each h HZ, which can again be recursively deﬁned as: if Z is a simplex, then it has no parent; if Z is a Cartesian product Z1 ZK, then the parent of h HZ is the same as the parent of h in the treeplex Zk that h belongs to (that is, h HZk); ﬁnally, if Z = {(u, ui v) : u Z1, v Z2}, then for all h HZ2 without a parent, their parent in Z is ui, and for all other h, their parents remain the same as in Z1 or Z2. We denote by σ(h) the index of the parent variable of h, and let it be 0 if h has no parent. For convenience, we let z0 = 1 for all z Z (so that zσ(h) is always well-deﬁned). Also deﬁne Hi = {h HZ : σ(h) = i} to be the collection of simplexes whose parent index is i.

Similarly, for an index i, its parent index is deﬁned as pi = σ(h(i)), and i is called a terminal index if it is not a parent index (that is, i = pj for all j). Finally, for an element z Z and an index i, we deﬁne qi = zi/zpi. In EFG terminology, qi speciﬁes the probability of selecting action i in the information set h(i) according to strategy z.

4 Optimistic Regret-minimization Algorithms

There are many different algorithms for solving bilinear saddle-point problems over general constrained sets. We focus speciﬁcally on a family of regret-minimization algorithms, called Optimistic Online Mirror Descent (OOMD) [Rakhlin and Sridharan, 2013], which are known to be highly efﬁcient. In contrast to the CFR algorithm and its variants, which minimize a local regret notion at each information set (which upper bounds global regret), the algorithms we consider explicitly minimize global regret. As our main results in the next section show, these global regret-minimization algorithms enjoy last-iterate convergence, while CFR provably diverges.

Speciﬁcally, given a step size η > 0 and a convex function ψ (called a regularizer), OOMD sequentially performs the following update for t = 1, 2, . . . ,

xt = argmin x X

n η x, Gyt 1 + Dψ(x, bxt) o , bxt+1 = argmin x X

n η x, Gyt + Dψ(x, bxt) o ,

yt = argmin y Y

n η y, G xt 1 + Dψ(y, byt) o , byt+1 = argmin y Y

n η y, G xt + Dψ(y, byt) o ,

with (bx1, by1) = (x0, y0) Z being arbitrary. Using shorthands zt = (xt, yt), bzt = (bxt, byt), ψ(z) = ψ(x) + ψ(y) and recalling the notation F(z) = (Gy, G x), the updates above can be compactly written as OOMD with regularizer ψ over treeplex Z:

zt = argmin z Z

n η z, F(zt 1) + Dψ(z, bzt) o , bzt+1 = argmin z Z

n η z, F(zt) + Dψ(z, bzt) o . (2)

Below, we discuss four different regularizers and their resulting algorithms (throughout, we use notations Φ for regularizers based on Euclidean norm and Ψ for regularizers based on entropy).

Vanilla Optimistic Gradient Descent Ascent (VOGDA) Deﬁne the vanilla squared Euclidean norm regularizer as Φvan(z) = 1

i z2 i . We call OOMD instantiated with ψ = Φvan Vanilla Optimistic Gradient Descent Ascent (VOGDA). In this case, the Bregman divergence is DΦvan(z, z ) = 1 2 z z 2 (by deﬁnition Φvan is thus 1-strongly convex with respect to the 2-norm), and the updates simply become projected gradient descent. For VOGDA there is no closed-form for Eq. (2), since projection onto the treeplex Z is required. Nevertheless, the solution can still be computed in O(P 2 log P) time (recall that P is the dimension of Z) [Gilpin et al., 2008].

Vanilla Optimistic Multiplicative Weight Update (VOMWU) Deﬁne the vanilla entropy regularizer as Ψvan(z) = P i zi ln zi. We call OOMD with ψ = Ψvan Vanilla Optimistic Multiplicative Weights Update (VOMWU). The Bregman divergence in this case is the generalized KL divergence: DΨvan(z, z ) = P

i zi ln(zi/z i) zi + z i. Although it is well-known that Ψvan is 1-strongly convex with respect to the 1-norm for the special case when Z is a simplex, this is not true generally on a treeplex. Nevertheless, it can still be shown that Ψvan is 1-strongly convex with respect to the 2-norm; see Appendix C.

The name Multiplicative Weights Update is inherited from case when X and Y are simplexes, in which case the updates in Eq. (2) have a simple multiplicative form. We emphasize, however, that in general VOMWU does not admit a closed-form update. Instead, to solve Eq. (2), one can equivalently solve a simpler dual optimization problem; see [Zimin and Neu, 2013, Proposition 1].

The two regularizers mentioned above ignore the structure of the treeplex. Dilated Regularizers [Hoda et al., 2010], on the other hand, take the structure into account and allow one to decompose the update into simpler updates at each information set. Speciﬁcally, given any convex function ψ deﬁned over the simplex and a weight parameter α RHZ + , the dilated version of ψ deﬁned over Z is:

ψdil α (z) = X

h HZ αh zσ(h) ψ zh

This is well-deﬁned since zh/zσ(h) is indeed a distribution within the simplex h (with qi for i Ωh being its entries). It can also be shown that ψdil α is always convex in z [Hoda et al., 2010]. Intuitively, ψdil α applies the base regularizer ψ to the action distribution in each information set and then scales the value by its parent variable and the weight αh. By picking different base regularizers, we obtain the following two algorithms.

Dilated Optimistic Gradient Descent Ascent (DOGDA) [Farina et al., 2019b] Deﬁne the dilated squared Euclidean norm regularizer Φdil α as Eq. (3) with ψ being the vanilla squared Euclidean norm ψ(z) = 1

i z2 i . Direct calculation shows Φdil α (z) = 1

i αh(i)ziqi. We call OOMD with regularizer Φdil α the Dilated Optimistic Gradient Descent Ascent algorithm (DOGDA). It is known that there exists an α such that Φdil α is 1-strongly convex with respect to the 2-norm [Farina et al., 2019b]. Importantly, DOGDA decomposes the update Eq. (2) into simpler gradient descent-style updates at each information set, as shown below. Lemma 1 (Hoda et al. [2010]). If z = argminz Z η z, f + DΦdil α(z, bz) , then for every h HZ,

the corresponding vector q h = z h z σ(h) can be computed by:

q h = argmin qh |Ωh|

n η qh, Lh + αh

2 qh bqh 2o , (4)

where bqh = bzh bzσ(h) , Lh is the slice of L whose entries are in Ωh, and L is deﬁned through:

Li = fi + X

q h, Lh + αh

2η q h bqh 2 .

While the deﬁnitions of q h and L are seemingly recursive, one can verify that they can in fact be computed in a bottom-up manner, starting with the terminal indices. Although Eq. (4) still does not admit closed-form solution, it only requires projection onto a simplex, which can be solved efﬁciently, see e.g. [Condat, 2016]. Finally, with q h computed for all h, z can be calculated in a top-down manner by deﬁnition.

Dilated Optimistic Multiplicative Weight Update (DOMWU) [Kroer et al., 2020] Finally, deﬁne the dilated entropy regularizer Ψdil α as Eq. (3) with ψ being the vanilla entropy ψ(z) = P

i zi ln zi. Direct calculation shows Ψdil α (z) = P i αh(i)zi ln qi. We call OOMD with regularizer Ψdil α the Dilated Optimistic Multiplicative Weights Update algorithm (DOMWU). Similar to DOGDA, there exists an α such that Ψdil α is 1-strongly convex with respect to the 2-norm [Kroer et al., 2020].1 Moreover, in contrast to all the three algorithms mentioned above, the update of DOMWU has a closed-form solution: Lemma 2 (Hoda et al. [2010]). Suppose z = argminz Z η z, f + DΨdil α(z, bz) . Similarly to the notation qi, deﬁne q i = z i/z pi and bqi = bzi/bzpi. Then we have

q i bqi exp ηLi/αh(i) , where Li = fi X

j Ωh bqj exp ( ηLj/αh)

1Kroer et al. [2020] also show a better strong convexity result with respect to the 1-norm. We focus on the 2-norm here for consistency with our other results, but our analysis can be applied to the 1-norm case as well.

This lemma again implies that we can compute q i bottom-up, and then z can be computed top-down. This is similar to DOGDA, except that all updates have a closed-form.

5 Last-iterate Convergence Results

In this section, we present our main last-iterate convergence results for the global regret-minimization algorithms discussed in Section 4. Before doing so, we point out again that the sequence produced by the well-known CFR algorithm may diverge (even if the average converges to a Nash equilibrium). Indeed, this can happen even for a simple normal-form game, as formally shown below.

Theorem 3. In the rock-paper-scissors game, CFR (with some particular initialization) produces a diverging sequence.

In fact, we empirically observe that all of CFR, CFR+ [Tammelin, 2014] (with simultaneous updates), and their optimistic versions [Farina et al., 2021a] may diverge in the rock-paper-scissors game. We introduce the algorithms and show the results in Appendix D.

On the contrary, every algorithm from the OOMD family given by Eq. (2) ensures last-iterate convergence, as long as the regularizer is strongly convex and continuously differentiable.

Theorem 4. Consider the update rules in Eq. (2). Suppose that ψ is 1-strongly convex with respect to the 2-norm and continuously differentiable on the entire domain, and η 1 8P . Then zt converges to a Nash equilibrium as t .

As mentioned, Φvan and Ψvan are both 1-strongly convex with respect to 2-norm, so are Φdil α and Ψdil α under some speciﬁc choice of α (in the rest of the paper, we ﬁx this choice of α). However, only Φvan and Φdil α are continuously differentiable in the entire domain. Therefore, Theorem 4 provides an asymptotic convergence result only for VOGDA and DOGDA, but not VOMWU and DOMWU. Nevertheless, below, we resort to different analyses to show a concrete last-iterate convergence rate for three of our algorithms, which is a much more challenging task.

First of all, note that [Wei et al., 2021, Theorem 5, Theorem 8] already provide a general last-iterate convergence rate for VOGDA over polytopes. Since treeplexes are polytopes, we can directly apply their results and obtain the following corollary.

Corollary 5. Deﬁne dist2(z, Z ) = minz Z z z 2. For η 1 8P , VOGDA guarantees

dist2(zt, Z ) 64dist2(bz1, Z )(1 + C1) t,

where C1 > 0 is some constant that depends on the game and η.

However, the results for VOMWU in [Wei et al., 2021, Theorem 3] is very speciﬁc to normal-form game (that is, when X and Y are simplexes) and thus cannot be applied here. Nevertheless, we are able to extend their analysis to get the following result.

Theorem 6. If the EFG has a unique Nash equilibrium z , then VOMWU with step size η 1 8P guarantees 1

2 bzt z 2 DΨvan(z , bzt) C2

t , where C2 > 0 is some constant depending on the game, bz1, and η.

We note that the uniqueness assumption is often required in the analysis of OMWU even for normal-form games [Daskalakis and Panageas, 2019, Wei et al., 2021] (although [Wei et al., 2021, Appendix A.5] provides empirical evidence to show that this may be an artifact of the analysis). Also note that for normal-form games, [Wei et al., 2021, Theorem 3] show a linear convergence rate, whereas here we only show a slower sub-linear rate, due to additional complications introduced by treeplexes (see more discussions in the next section). Whether this can be improved is left as a future direction.

On the other hand, thanks to the closed-form updates of DOMWU, we are able to show the following linear convergence rate for this algorithm.

Theorem 7. If the EFG has a unique Nash equilibrium z , then DOMWU with step size η 1 8P guarantees 1

2 zt z 2 DΨdil α(z , zt) C3(1 + C4) t, where C3, C4 > 0 are constants that depend on the game, bz1, and η.

To the best of our knowledge, this is the ﬁrst last-iterate convergence result for algorithms with dilated regularizers. Unfortunately, due to technical difﬁculties, we were unable to prove similar results for DOGDA (see Appendix E for more discussion). We leave that as an important future direction.

6 Analysis Overview

In this section, we provide an overview of our analysis. It starts from the following standard one-step regret analysis of OOMD (see, for example, [Wei et al., 2021, Lemma 1]): Lemma 8. Consider the update rules in Eq. (2). Suppose that ψ is 1-strongly convex with respect to the 2-norm, F(z1) F(z2) L z1 z2 for all z1, z2 Z and some L > 0, and η 1 8L. Then for any z Z and any t 1, we have

ηF(zt) (zt z) Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, zt) 15

16Dψ(zt, bzt) + 1

16Dψ(bzt, zt 1).

Note that the Lipschitz condition on F holds in our case with L = P since

F(z1) F(z2) = q

G(y1 y2) 2 + G (x1 x2) 2 q

P z1 z2 2 1 P z1 z2 ,

which is also why the step size is chosen to be η 1 8P in all our results. In the following, we ﬁrst prove Theorem 4. Then, we review the convergence analysis of [Wei et al., 2021] for OMWU in normal-form games, and ﬁnally demonstrate how to prove Theorem 6 and Theorem 7 by building upon this previous work and addressing the additional complications from EFGs.

6.1 Proof of Theorem 4

For any z Z , by optimality of z we have:

F(zt) (zt z ) = x t Gyt x t Gyt + x t Gy x Gyt x Gy x Gy = 0.

Thus, taking z = z in Lemma 8 and rearranging, we arrive at

Dψ(z , bzt+1) Dψ(z , bzt) Dψ(bzt+1, zt) 15

16Dψ(zt, bzt) + 1

16Dψ(bzt, zt 1).

Deﬁning Θt = Dψ(z , bzt) + 1

16Dψ(bzt, zt 1) and ζt = Dψ(bzt+1, zt) + Dψ(zt, bzt), we rewrite the inequality above as

We remark that similar inequalities appear in [Wei et al., 2021, Eq. (3) and Eq. (4)], but here we pick z Z arbitrarily while they have to pick a particular z Z (such as the projection of bzt onto Z ). Summing Eq. (5) over t, telescoping, and applying the strong convexity of ψ, we have

Θ1 Θ1 ΘT 15

t=1 bzt+1 zt 2 + zt bzt 2 15

t=2 zt zt 1 2.

Similar to the last inequality, we also have Θ1 15

64 PT 1 t=1 bzt+1 bzt 2 since 2 bzt+1 zt 2 + 2 zt bzt 2 bzt+1 bzt 2. Therefore, we conclude that zt bzt , zt+1 zt , and bzt+1 bzt all converge to 0 as t . On the other hand, since the sequence {z1, z2, . . . , } is bounded, by the Bolzano-Weierstrass theorem, there exists a convergent subsequence, which we denote by {zi1, zi2, . . . , }. Let z = limτ ziτ . By bzt zt 0 we also have z = limτ bziτ . Now, using the ﬁrst-order optimality condition of bzt+1, we have for every z Z,

( ψ(bzt+1) ψ(bzt) + ηF(zt)) (z bzt+1) 0.

Apply this with t = iτ for every τ and let τ , we obtain

0 lim τ ( ψ(bziτ +1) ψ(bziτ ) + ηF(ziτ )) (z bziτ +1) (by the ﬁrst-order optimality)

= lim τ ηF(ziτ ) (z bziτ +1) (by bzt+1 bzt 0 and the continuity of ψ)

= ηF(z ) (z z ) (by z = limτ ziτ = limτ bziτ )

This implies that z is a Nash equilibrium. Finally, choosing z = z in the deﬁnition of Θt, we have limτ Θiτ = 0 because limτ Dψ(z , bziτ ) = 0 and limτ bziτ ziτ 1 = 0. Additionally, by Eq. (5) we also have that limt Θt = 0 as Θt is non-increasing. Therefore, we conclude that the entire sequence {z1, z2, . . .} converges to z . On the other hand, since OOMD is a regret-minimization algorithm, it is well known that the average iterate converges to a Nash equilibrium [Freund and Schapire, 1999]. Consequently, combining the two facts above implies that zt has to converge to a Nash equilibrium, which proves Theorem 4.

We remark that Lemma 8 holds for general closed convex domains as shown in [Wei et al., 2021]. Consequently, with the same argument, Theorem 4 holds more generally as long as X and Y are closed convex sets. While the argument is straightforward, we are not aware of similar results in prior works. Also note that unlike Theorem 6 and Theorem 7, Theorem 4 holds without the uniqueness assumption for VOMWU and DOMWU.

6.2 Review for normal-form games

To better explain our analysis and highlight its novelty, we ﬁrst review the two-stage analysis of [Wei et al., 2021] for OMWU in normal-form games, a special case of our setting when X and Y are simplexes. Note that both VOMWU and DOMWU reduce to OMWU in this case. As with Theorem 6 and Theorem 7, the normal-form OMWU results assume a unique Nash equilibrium z . With this uniqueness assumption and [Mertikopoulos et al., 2018, Lemma C.4], Wei et al. [2021] show the following inequality

ζt = Dψ(bzt+1, zt) + Dψ(zt, bzt) C5 z bzt+1 2 (6)

for some problem-dependent constant C5 > 0, which, when combined with Eq. (5), implies that if the algorithm s current iterate is far from z , then the decrease in Θt is more substantial, that is, the algorithm makes more progress on approaching z . To establish a recursion, however, we need to connect the 2-norm back to the Bregman divergence (a reverse direction of strong convexity). To do so, Wei et al. [2021] argue that bzt+1,i can be lower bounded by another problem-dependent constant for i supp(z ) [Wei et al., 2021, Lemma 19], where supp(z ) denotes the support of z . This further allows them to lower bound z bzt+1 in terms of Dψ(z , bzt+1) (which is just KL(z , bzt+1)), leading to

ζt = Dψ(bzt+1, zt) + Dψ(zt, bzt) C6Dψ(z , bzt+1)2, (7)

for some C6 > 0. On the other hand, ignoring the nonnegative term Dψ(zt, bzt), we also have:

ζt = Dψ(bzt+1, zt) + Dψ(zt, bzt) Dψ(bzt+1, zt) 1

4Dψ(bzt+1, zt)2, (8)

where the last step uses the fact that bzt+1 and zt are close [Wei et al., 2021, Lemma 17 and Lemma 18]. Now, Eq. (7) and Eq. (8) together imply 6ζt 2C6Dψ(z , bzt+1)2 + Dψ(bzt+1, zt)2 min C6, 1

2 Θ2 t+1. Plugging this back into Eq. (5), we obtain a recursion

Θt+1 Θt C7Θ2 t+1 (9)

for some C7 > 0, which then implies Θt = O(1/t) [Wei et al., 2021, Lemma 12]. This can be seen as the ﬁrst and slower stage of the convergence behavior of the algorithm.

To further show a linear convergence rate, they argue that there exists a constant C8 > 0 such that when the algorithm s iterate is reasonably close to z in the following sense:

max{ z bzt 1, z zt 1} C8, (10)

the following improved version of Eq. (7) holds (note the lack of square on the right-hand side):

ζt = Dψ(bzt+1, zt) + Dψ(zt, bzt) C9Dψ(z , bzt+1) (11)

for some constant 0 < C9 < 1. Therefore, using the 1/t convergence rate derived in the ﬁrst stage, there exists a T0 such that when t T0, Eq. (10) holds and the algorithm enters the second stage. In this stage, combining Eq. (11) and the fact ζt Dψ(bzt+1, zt) gives ζt C9

2 Θt+1, which, together with Eq. (5) again, implies an improved recursion Θt+1 Θt 15

32C9Θt+1. This ﬁnally shows a linear convergence rate Θt = O((1 + ρ) t) for some problem-dependent constant ρ > 0.

6.3 Analysis of Theorem 6 and Theorem 7

While we mainly follow the steps of the analysis of [Wei et al., 2021] discussed above to prove Theorem 6 and Theorem 7, we remark that the generalization is highly non-trivial. First of all, we have to prove Eq. (6) for Z being a general treeplex, which does not follow [Mertikopoulos et al., 2018, Lemma C.4] since its proof is very speciﬁc to simplexes. Instead, we prove it by writing down the primal-dual linear program of Eq. (1) and applying the strict complementary slackness; see Appendix E.1 for details.

Next, to connect the 2-norm back to the Bregman divergence (which is not the simple KL divergence anymore, especially for DOMWU), we prove the following for VOMWU:

Dψ(z , bzt+1) X

(z i bzt+1,i)2

bzt+1,i + X

i/ supp(z ) bzt+1,i 3P z bzt+1 mini supp(z ) bzt+1,i , (12)

and the following for DOMWU:

Dψ(z , bzt+1)

(z i bzt+1,i)2

z i bqt+1,i + X

i/ supp(z ) z pibqt+1,i z bzt+1 1 mini supp(z ) z i bzt+1,i , (13)

where C = 4P α (see Appendix E.2). We then show a lower bound on zt+1,i and bzt+1,i for all i supp(z ), using similar arguments of [Wei et al., 2021] (see Appendix E.3). Combining Eq. (12) and Eq. (13) with Eq. (6), we have the counterpart of Eq. (7) for both VOMWU and DOMWU.

Showing Eq. (8) also involves extra complication if we follow their analysis, especially for VOMWU which does not admit a closed-form update. Instead, we ﬁnd a simple workaround: by applying Eq. (5) repeatedly, we get Dψ(z , bz1) = Θ1 Θt+1 1 16Dψ(bzt+1, zt), thus, ζt Dψ(bzt+1, zt) C10Dψ(bzt+1, zt)2 for some C10 > 0 depending on Dψ(z , bz1). Combining this with Eq. (7), and applying them to Eq. (5), we obtain the recursion Θt+1 Θt C11Θ2 t+1 for some C11 > 0 similar to Eq. (9), which implies Θt = O(1/t) for both VOMWU and DOMWU and proves Theorem 6.

Finally, to show a linear convergence rate, we need to show the counterpart of Eq. (11), which is again more involved compared to the normal-form game case. Indeed, we are only able to do so for DOMWU by making use of its closed-form update described in Lemma 2. Speciﬁcally, observe that in Eq. (13), the term P

i/ supp(z ) z pibqt+1,i is the one that prevents us from bounding Dψ(z , bzt+1) by O( z bzt+1 2). Thus, our high-level idea is to argue that P

i/ supp(z ) z pibqt+1,i decreases signiﬁcantly as bzt gets close enough to z . To do so, we use a bottom-up induction to prove that, for any information set h HZ, indices i, j Ωh such that i / supp(z ) and j supp(z ), b Lt,i is signiﬁcantly larger than b Lt,j when bzt is close to z , where b Lt is the counterpart of L in Lemma 2 when computing of bqt+1. This makes sure that the term P

i/ supp(z ) z pibqt+1,i is dominated by the other term involving i supp(z ) in Eq. (13), which eventually helps us show Eq. (11) and the ﬁnal linear convergence rate in Theorem 7. See Appendix E.5 for details.

7 Experiments

In this section, we experimentally evaluate the algorithms on three standard EFG benchmarks: Kuhn poker [Kuhn, 1950], Pursuit-evasion [Kroer et al., 2018], and Leduc poker [Southey et al., 2005]. The results are shown in Figure 1. Besides the optimistic algorithms, we also show two CFR-based algorithms as reference points. CFR+ refers to CFR with alternating updates, linear averaging [Tammelin, 2014], and regret matching+ as the regret minimizer. CFR w/ RM+ is CFR with regret matching+ and linear averaging. We provide the formal descriptions of these two algorithms in Appendix D for completeness. For the optimistic algorithms, we plot the last iterate performance. For the CFR-based algorithms, we plot the performance of the linear average of iterates (recall that the last iterate of CFR-based algorithms is not guaranteed to converge to a Nash equilibrium).

For Kuhn poker and Pursuit-evasion (on the left and in the middle of Figure 1), all of the optimistic algorithms perform much better than CFR+, and their curves are nearly straight, showing their linear last-iterate convergence on these games.

For Leduc poker, although CFR+ performs the best, we can still observe the last-iterate convergence trends of the optimistic algorithms. We remark that although VOGDA and DOMWU have linear

0 200 400 600 800 1000

CFR w/ RM+ CFR+ DOGDA eta=2.0 DOMWU eta=4.3 VOMWU eta=4.4 VOGDA eta=2.0

0 200 400 600 800 1000

CFR w/ RM+ CFR+ VOGDA eta=0.02 DOGDA eta=0.04 DOMWU eta=0.2

0 1000 2000 3000 4000 5000

DOMWU eta=4.2 VOGDA eta=0.3 CFR w/ RM+ DOGDA eta=0.9 CFR+

Figure 1: Experiments on Kuhn poker (left), Pursuit-evasion (middle), and Leduc poker (right). A description of each game is given in Appendix B. ξt = maxy x t Gy minx x Gyt is the duality gap at time step t, where (xt, yt) is the approximate Nash equilibrium computed by the algorithm at time t (for the optimistic algorithms, (xt, yt) is (xt, yt) while for the CFR-based algorithms, (xt, yt) is the linear average). The legend order reﬂects the curve order at the right-most point. Due to much higher computation overhead than all the other algorithms, we only run VOMWU on Kuhn poker, the game with the smallest size among the three games. For each optimistic algorithm, we ﬁne-tune step size η to get better convergence results and show its value in the legends. There is no hyperparameter for the CFR-based algorithms. All the experiments are run on CPU in a personal computer and the total computation time is less than an hour. There is no random seed and the results are all deterministic.

convergence rate in theory, the experiment on Leduc uses a step size η which is much larger than Corollary 5 and Theorem 7 suggest, which may void the linear convergence guarantee. This is done because the theoretical step size takes too many iterations before it starts to show improvement. It is worth noting that CFR+ improves signiﬁcantly when changing simultaneous updates (that is, CFR w/ RM+) to alternating updates. Analyzing alternation and combining it with optimistic algorithms is a promising direction. We provide a description of each game, more discussions, and details of the experiments in Appendix B.

8 Conclusions

In this work, we developed the ﬁrst general last-iterate convergence results for solving EFGs. Our paper opens up many potential future directions. The recent dilatable global entropy regularizer of Farina et al. [2021b] can likely be analyzed using techniques similar to our analysis of VOMWU and DOMWU, and it would likely lead to a linear rate as with DOMWU, due to its closed-form DOMWUstyle update. Other natural questions include whether it is possible to obtain better convergence rates for VOMWU and DOGDA, whether one can remove the uniqueness assumption for VOMWU and DOMWU, and ﬁnally whether it is possible to obtain last-iterate convergence rates for CFRlike optimistic algorithms such as those in [Farina et al., 2021a]. On the practical side, optimistic algorithms with last-iterate convergence guarantees may allow more efﬁcient computation and better incorporation with deep learning-based game-solving approaches.

Acknowledgments and Disclosure of Funding

CWL and HL are supported by NSF Award IIS-1943607.

James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, 2018.

Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold em poker is solved. Science, 347(6218):145 149, 2015.

George W Brown. Iterative solution of games by ﬁctitious play. Activity analysis of production and allocation, 13(1):374 376, 1951.

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418 424, 2018.

Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted regret minimization. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 1829 1836, 2019a.

Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456): 885 890, 2019b.

Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed equilibrium computation, and post-processing, with application to a champion no-limit texas hold em agent. In AAAI Workshop: Computer Poker and Imperfect Information, 2015.

Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In International conference on machine learning, pages 793 802. PMLR, 2019.

Neil Burch. Time and space: Why imperfect information games are hard. Ph D thesis, University of Alberta, 2018.

Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games using decomposition. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 28, 2014.

Neil Burch, Matej Moravcik, and Martin Schmid. Revisiting CFR+ and alternating updates. Journal of Artiﬁcial Intelligence Research, 64:429 443, 2019.

Laurent Condat. Fast projection onto the simplex and the ℓ1 ball. Mathematical Programming, 158 (1):575 585, 2016.

Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. Innovations in Theoretical Computer Science, 2019.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Conference on Learning Representations, 2018.

Gabriele Farina, Christian Kroer, Noam Brown, and Tuomas Sandholm. Stable-predictive optimistic counterfactual regret minimization. In International conference on machine learning, pages 1853 1862. PMLR, 2019a.

Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Optimistic regret minimization for extensiveform games via dilated distance-generating functions. In Advances in Neural Information Processing Systems, pages 5222 5232, 2019b.

Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Faster game solving via predictive Blackwell approachability: Connecting regret matching and mirror descent. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence. AAAI, 2021a.

Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Better regularization for sequential decision spaces: Fast convergence rates for nash, correlated, and team equilibria. In Proceedings of the 2021 ACM Conference on Economics and Computation, 2021b.

Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

Sam Ganzfried and Tuomas Sandholm. Potential-aware imperfect-recall abstraction with earth mover s distance in imperfect-information games. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 28, 2014.

Sam Ganzfried and Tuomas Sandholm. Endgame solving in large imperfect-information games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 37 45. Citeseer, 2015.

Yuan Gao, Christian Kroer, and Donald Goldfarb. Increasing iterate averaging for solving saddle-point problems. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2021.

Andrew Gilpin, Javier Peña, and Tuomas Sandholm. First-order algorithm with O(ln(1/ϵ)) convergence for ϵ-equilibrium in two-person zero-sum games. In AAAI, 2008.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.

Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000.

Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfectinformation games. ar Xiv preprint ar Xiv:1603.01121, 2016.

Samid Hoda, Andrew Gilpin, Javier Pena, and Tuomas Sandholm. Smoothing techniques for computing nash equilibria of sequential games. Mathematics of Operations Research, 35(2): 494 512, 2010.

Christian Kroer and Tuomas Sandholm. Extensive-form game abstraction with bounds. In Proceedings of the ﬁfteenth ACM conference on Economics and computation, pages 621 638, 2014.

Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Smoothing method for approximate extensive-form perfect equilibrium. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, pages 295 301, 2017.

Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Robust stackelberg equilibria in extensiveform games and extension to limited lookahead. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Christian Kroer, Kevin Waugh, Fatma Kılınç-Karzan, and Tuomas Sandholm. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, 179 (1):385 417, 2020.

HW Kuhn. Simpliﬁed two-person poker. Contributions to the Theory of Games, pages 97 103, 1950.

Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael H Bowling. Monte carlo sampling for regret minimization in extensive games. In NIPS, pages 1078 1086, 2009.

Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, and Xiao Wang. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. The 24nd International Conference on Artiﬁcial Intelligence and Statistics, 2021.

H Brendan Mc Mahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 536 543, 2003.

Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 2018.

Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In International Conference on Learning Representations, 2019.

Matej Moravcik, Martin Schmid, Karel Ha, Milan Hladik, and Stephen Gaukrodger. Reﬁning subgames in large imperfect information games. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 30, 2016.

Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lis y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker. Science, 356(6337):508 513, 2017.

Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani. Algorithmic Game Theory. Cambridge University Press, 2007.

Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066 3074, 2013.

Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes bluff: opponent modelling in poker. In Proceedings of the Twenty-First Conference on Uncertainty in Artiﬁcial Intelligence, pages 550 558, 2005.

Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, 2015.

Oskari Tammelin. Solving large imperfect information games using CFR+. ar Xiv preprint ar Xiv:1407.5042, 2014.

Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. In Journal of Computational and Applied Mathematics, 1995.

Robert J Vanderbei et al. Linear programming, volume 3. Springer, 2015.

Bernhard von Stengel. Efﬁcient computation of behavior strategies. Games and Economic Behavior, 14(2):220 246, 1996.

Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations, 2021.

Alexander Zimin and Gergely Neu. Online learning in episodic Markovian decision processes by relative entropy policy search. In Neural Information Processing Systems 26, 2013.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928 936, 2003.

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. Advances in neural information processing systems, 20: 1729 1736, 2007.