# nearoptimal_φregret_learning_in_extensiveform_games__d3755407.pdf

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Ioannis Anagnostides 1 Gabriele Farina 2 Tuomas Sandholm 1 3 4 5

In this paper, we establish efficient and uncoupled learning dynamics so that, when employed by all players in multi-player perfect-recall imperfect-information extensive-form games, the trigger regret of each player grows as O(log T) after T repetitions of play. This improves exponentially over the prior best known trigger-regret bound of O(T 1/4), and settles a recent open question by Bai et al. (2022). As an immediate consequence, we guarantee convergence to the set of extensive-form correlated equilibria and coarse correlated equilibria at a near-optimal rate of log T

T . Building on prior work, at the heart of our construction lies a more general result regarding fixed points deriving from rational functions with polynomial degree, a property that we establish for the fixed points of (coarse) trigger deviation functions. Moreover, our construction leverages a refined regret circuit for the convex hull, which unlike prior guarantees preserves the RVU property introduced by Syrgkanis et al. (NIPS, 2015); this observation has an independent interest in establishing near-optimal regret under learning dynamics based on a CFR-type decomposition of the regret.

1. Introduction

A primary objective of artificial intelligence is the design of agents that can adapt effectively in complex and nonstationary multiagent environments modeled as generalsum games. Multiagent decision making often occurs in a decentralized fashion, with each agent only obtaining information about its own reward function, and the goal is to learn how to play the game through repeated inter-

1Department of computer science, Carnegie Mellon University, Pittsburgh, USA 2FAIR, Meta AI 3Strategy Robot, Inc. 4Optimized Markets, Inc. 5Strategic Machine, Inc. Correspondence to: Ioannis Anagnostides <ianagnos@cs.cmu.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

actions. But how do we measure the performance of a learning agent? A popular metric commonly used is that of external regret (or simply regret). However, external regret can be a rather weak benchmark: a no-externalregret agent could still incur substantial regret under simple in-hindsight transformations of its behavior e.g., consistently switching from an action a to a different action a (Gordon et al., 2008).

A more general metric is Φ regret (Hazan & Kale, 2007; Rakhlin et al., 2011; Stoltz & Lugosi, 2007; Greenwald & Jafari, 2003), parameterized by a set deviations Φ. From a game-theoretic standpoint, the importance of this framework is that different choices of Φ lead to different types of equilibria (Greenwald & Jafari, 2003; Stoltz & Lugosi, 2007). For example, one such celebrated result guarantees that no-internal-regret players converge in terms of empirical frequency of play to the set of correlated equilibria (CE) (Foster & Vohra, 1997; Hart & Mas-Colell, 2000). This brings us to the following central question:

What are the best performance guarantees when no-Φ-regret learners are playing in multi-player general-sum games?

Special cases of this question have recently received considerable attention in the literature (Daskalakis et al., 2011; Rakhlin & Sridharan, 2013a;b; Syrgkanis et al., 2015; Foster et al., 2016; Wei & Luo, 2018; Chen & Peng, 2020; Hsieh et al., 2021; Daskalakis et al., 2021; Daskalakis & Golowich, 2022; Anagnostides et al., 2022a; Piliouras et al., 2022b). In particular, Daskalakis et al. (2021) were the first to establish O(polylog T) external regret bounds for normal-form games,1 and subsequent work extended those results to internal regret (Anagnostides et al., 2022a); those guarantees, applicable when all players employ specific learning dynamics, improve exponentially over what is possible when a player is facing a sequence of adversarially produced utilities the canonical consideration in online learning. However, much less is known about Φ-regret learning beyond normal-form games.

One such important application revolves around learn-

1With a slight abuse of notation, we use the O( ) notation in our introduction to suppress parameters that depend (polynomially) on the natural parameters of the game.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

ing dynamics for extensive-form correlated equilibria (EFCE) (Von Stengel & Forges, 2008; Gordon et al., 2008; Celli et al., 2020; Morrill et al., 2021a; Anagnostides et al., 2022b; Morrill et al., 2021b; Bai et al., 2022a; Song et al., 2022). Indeed, a particular instantiation of Φ regret, referred to as trigger regret, is known to drive the rate of convergence to EFCE (Farina et al., 2022b). Incidentally, minimizing trigger regret lies at the frontier of Φ-regret minimization problems that are known to be computationally tractable in extensive-form games. In this context, prior work established O(T 1/4) per-player trigger regret bounds (Anagnostides et al., 2022b), thereby leaving open the possibility of obtaining near-optimal rates for EFCE; that question was also recently posed by Bai et al. (2022a).

1.1. Our Contributions

Our main contribution is to establish the first uncoupled learning dynamics with near-optimal per-player trigger regret guarantees: Theorem 1.1 (Informal; precise version in Theorem 3.10). There exist uncoupled and computationally efficient learning dynamics so that the trigger regret of each player grows as O(log T) after T repetitions of play.

This improves exponentially over the O(T 1/4) bounds obtained in prior work (Celli et al., 2020; Farina et al., 2022b; Anagnostides et al., 2022b), and settles an open question recently posed by Bai et al. (2022a). As an immediate consequence, given that trigger regret drives the rate of convergence to EFCE (Theorem 2.4), we obtain the first nearoptimal rates to EFCE. Corollary 1.2. There exist uncoupled and computationally efficient learning dynamics converging to EFCE at a nearoptimal rate of log T

Overview of our techniques Our construction leverages the template of Gordon et al. (2008) for minimizing Φ regret (Algorithm 1). In particular, we follow the regret decomposition approach of Farina et al. (2022b) to construct an external regret minimizer for the set of deviations corresponding to trigger deviation functions. A key difference is that we instantiate each regret minimizer using the recent algorithm of Farina et al. (2022a), namely LRL-OFTRL, which is based on optimistic follow the regularizer leader (OFTRL) (Syrgkanis et al., 2015) under logarithmic regularization; LRL-OFTRL guarantees suitable RVU bounds (Syrgkanis et al., 2015) for each local regret minimizer.

To combine those local regret minimizers into a global one for the set of trigger deviations that still enjoys a suitable RVU bound, we provide a refined guarantee for the regret circuit of the convex hull (Proposition 3.3), which ensures that the RVU property is preserved along the construction.

Incidentally, this simple observation can be used to obtain the first near-optimal regret guarantees for algorithms based on a CFR-type decomposition of the regret (Zinkevich et al., 2007); as such, Proposition 3.3 has an independent and broader interest.

The next key step relates to the behavior of the fixed points of trigger deviation functions. (Fixed points are at heart of all known constructions for minimizing Φ regret (Hazan & Kale, 2007); see Algorithm 1.) More precisely, to convert the RVU property from the space of deviations to the actual space of the player s strategies, we show that it suffices that the fixed points deriving from trigger deviation functions can be expressed as a rational function with a polynomial degree (Lemma 3.5). Importantly, we prove this property for the fixed points of trigger deviation functions (Proposition 3.7), thereby leading to Theorem 1.1; the last part of our analysis builds on a technique developed for obtaining O(log T) swap regret in normal-form games (Anagnostides et al., 2022c), although (imperfectinformation) extensive-form games introduce considerable new challenges, not least due to the combinatorial structure of trigger deviation functions. We also obtain slightly improved guarantees for extensive-form coarse correlated equilibria (EFCCE) (Farina et al., 2020), a relaxation of EFCE that is attractive due to its reduced per-iteration complexity compared to EFCE.

Finally, we support our theory (Theorem 1.1) by implementing our algorithm and evaluating its performance through experiments on several benchmark extensive-form games in Section 4.

1.2. Further Related Work

Φ regret has received extensive attention as a solution concept in the literature since it strengthens and unifies many common measures of performance in online learning (e.g., see (Hazan & Kale, 2007; Rakhlin et al., 2011; Stoltz & Lugosi, 2007; Greenwald & Jafari, 2003; Marks, 2008; Piliouras et al., 2022a; Fujii, 2023; Bernasconi et al., 2023)). This framework has been particularly influential in game theory given that no-Φ-regret learning outcomes are known to converge to different equilibrium concepts, depending on the richness of the set of deviations Φ. For example, when Φ includes all constant transformations reducing to external regret no-regret learning outcomes are known to converge to coarse correlated equilibria (CCE) (Moulin & Vial, 1978), a relaxation of CE (Aumann, 1974). Unfortunately, CCE is understood to be a rather weak equilibrium concept, potentially prescribing irrational behavior (Dekel & Fudenberg, 1990; Viossat & Zapechelnyuk, 2013; Giannou et al., 2021). This motivates enlarging the set of deviaitons Φ, thereby leading to stronger and arguably more plausible equilibrium concepts. Indeed, the

Near-Optimal Φ-Regret Learning in Extensive-Form Games

framework of Φ regret has been central in the quite recent development of the first uncoupled no-regret learning dynamics for EFCE (Celli et al., 2020; Farina et al., 2022b) (see also (Morrill et al., 2021a;b; Zhang, 2022)).2

Our paper lies at the interface of the aforedescribed literature with a recent line of work that strives for improved regret guarantees when specific learning dynamics are in place; this allows bypassing the notorious Ω(

T) lower bounds applicable under an adversarial sequence of utilities (Cesa-Bianchi & Lugosi, 2006). The later line of work was pioneered by Daskalakis et al. (2011), and has been thereafter extended along several lines (Rakhlin & Sridharan, 2013a;b; Syrgkanis et al., 2015; Chen & Peng, 2020; Daskalakis et al., 2021; Daskalakis & Golowich, 2022; Piliouras et al., 2022b; Yang & Ma, 2022), incorporating partial or noisy information feedback (Foster et al., 2016; Wei & Luo, 2018; Hsieh et al., 2022; Bai et al., 2022b), and more recently, general Markov games (Erez et al., 2022; Zhang et al., 2022). For additional pointers, we refer the interested reader to the survey of Li et al. (2022).

A key reference point for our paper is the work of Anagnostides et al. (2022b), which established O(T 1/4) trigger regret bounds through optimistic hedge. Specifically, building on the work of Chen & Peng (2020), they showed multiplicative stability of the fixed points associated with EFCE. While those works operate in the full information model, recent papers have also developed dynamics converging to EFCE under bandit feedback (Bai et al., 2022a; Song et al., 2022). Finally, it is worth noting that O(polylog T) regret bounds in extensive-form games were already obtained in prior work (Farina et al., 2022c), but they only applied to the weaker notion of external regret.

2. Preliminaries

In this section, we introduce our notation and basic background on online learning and extensive-form games; the familiar reader can just skim this section for our notation. For a more comprehensive treatment on those subjects, we refer to the books of Cesa-Bianchi & Lugosi (2006) and Leyton-Brown & Shoham (2008), respectively.

Notation We denote by N = {1, 2, . . . } the set of natural numbers. We use the variable i with a subscript to index a player, and t with a superscript to indicate the (discrete) time. To access the r-th coordinate of a d-dimensional vector x Rd, for some index r [[d]] := {1, 2, . . . , d}, we use the symbol x[r].

2While there are other methods for efficiently computing EFCE (Dud ık & Gordon, 2009; Huang & von Stengel, 2008), approaches based on uncoupled no-regret learning typically scale significantly better in large games.

2.1. Optimistic Online Learning and Regret

Let X [0, 1]d be a nonempty convex and compact set, for d N. In the framework of online learning, a learner (or a player), denoted by R, interacts with the environment at time t N via the following subroutines.

R.NEXTSTRATEGY(): The learner outputs its next strategy x(t) X based on its internal state; and R.OBSERVEUTILITY(u(t)): The learner receives a feedback from the environment in the form of a utility vector u(t) Rd.

The canonical measure of performance in online learning is the notion of regret, denoted by Reg T , defined for a fixed time horizon T N as

t=1 x , u(t)

t=1 x(t), u(t) . (1)

In words, the performance of the learner is compared to the performance of playing an optimal fixed strategy in hindsight. We will say that the agent has no-regret if Reg T = o(T), under any sequence of observed utilities.

Optimistic FTRL By now, it is well-understood that broad families of online learning algorithms such as follow the regularized leader incur at most O(

T) regret, even when the sequence of utilities is selected adversarially (Cesa-Bianchi & Lugosi, 2006). In addition, significant improvements are possible when the observed utilities satisfy further properties, such as small variation (Rakhlin & Sridharan, 2013a), which turns out to be crucial in the context of learning in games. To leverage such structure, Syrgkanis et al. (2015) introduced optimistic follow the regularized leader (OFTRL), which updates every strategy x(t+1) as the (unique) solution to the optimization problem

where η > 0 is the learning rate and R : X R is a 1-strongly convex regularizer with respect to some norm . Syrgkanis et al. (2015) showed that OFTRL satisfies a remarkable regret bound coined the RVU property. Definition 2.1 (RVU property). A regret minimizer satisfies the RVU property w.r.t. a dual pair of norms ( , ) if there exist α, β, γ > 0 such that for any (u(t))1 t T ,

t=1 u(t+1) u(t) 2 γ

t=1 x(t+1) x(t) 2.

Φ regret A much more general performance metric than (1) is Φ regret, parameterized by a set of transformations Φ : X X. Namely, Φ-regret Reg T Φ for a time

Near-Optimal Φ-Regret Learning in Extensive-Form Games

horizon T N is defined as

t=1 ϕ (x(t)), u(t)

t=1 x(t), u(t) . (2)

External regret (1) is simply a special case of (2) when Φ includes all possible constant transformations, but Φ regret can be much more expressive. A celebrated game-theoretic motivation for Φ regret stems from the fact that when all players employ suitable Φ-regret minimizers, the dynamics converge to different notions of correlated equilibria, well-beyond coarse correlated equilibria (Foster & Vohra, 1997; Stoltz & Lugosi, 2007; Hart & Mas-Colell, 2000; Celli et al., 2020).

From external to Φ regret As it turns out, there is a general template for minimizing Φ regret due to Gordon et al. (2008). In particular, they assume access to the following.

1. A no-external-regret minimizer RΦ operating over the set of transformations Φ; and 2. A fixed point oracle FIXEDPOINT(ϕ) that, for any ϕ Φ, computes a fixed point x X, under the assumption that such a point indeed exists.

Based on those ingredients, Gordon et al. (2008) were able to construct a regret minimization algorithm R with sublinear Φ regret, as illustrated in Algorithm 1. Specifically, R determines its next strategy by first obtaining the strategy ϕ(t) of RΦ, and then outputting any fixed point of ϕ(t). Then, upon receiving the utility vector u(t) Rd, R forwards as input to RΦ the utility function ϕ 7 u(t), ϕ(x(t)) . We will assume that Φ contains linear transformations, in which case that utility can be represented as U (t) := u(t) x(t) := u(t)(x(t)) Rd d; that is, denotes the outer product of the two vectors. This algorithm enjoys the following guarantee.

Algorithm 1 Φ-Regret Minimizer (Gordon et al., 2008)

1: Input: An external regret minimizer RΦ for Φ 2: function NEXTSTRATEGY() 3: ϕ(t) RΦ.NEXTSTRATEGY() 4: x(t) FIXEDPOINT(ϕ(t)) 5: return x(t)

6: end function 7: function OBSERVEUTILITY(u(t)) 8: Construct the utility U (t) u(t) x(t)

9: RΦ.OBSERVEUTILITY(U (t)) 10: end function

Theorem 2.2 (Gordon et al., 2008). Let Reg T be the external regret of RΦ, and Reg T Φ be the Φ-regret of R. Then, for any T N, Reg T = Reg T Φ .

It is also worth pointing out that a similar guarantee applies even under approximate fixed-point computations, as long as the accuracy is high enough (Gordon et al., 2008).

No-Regret learning in games The main focus of our paper is about the behavior of no-regret learning dynamics when employed by all players in n-player games. More precisely, the strategy set of each player i [[n]] is a nonempty convex and compact set Xi. Further, the utility function ui : n i =1 Xi R of each player i [[n]] is multilinear, so that for any x i := (x1, . . . , xi 1, xi+1, xn), ui(x) := xi, ui(x i) .

In this context, learning procedures work as follows. At every iteration t N each player i [[n]] commits to a strategy x(t) i Xi, and subsequently receives as feedback the utility corresponding to the other players strategies at time t: u(t) i := ui(x(t) i). It is assumed that players use no-regret learning algorithms to adapt to the behavior of the other players, leading to uncoupled learning dynamics, in the sense that players do not use information about other players utilities (Hart & Mas-Colell, 2000; Daskalakis et al., 2011). For convenience, and without any loss, we assume that u(t) i 1, for i [[n]] and t N.

2.2. Background on EFGs

An extensive-form game (EFG) is played on a rooted and directed tree with node-set H. Every decision (nonterminal) node h H is uniquely associated with a player who selects an action from a finite and nonempty set Ah. By convention, the set of players includes a fictitious chance player c that acts according to a fixed distribution. The set of leaves (terminal) nodes Z H corresponds to different outcomes of the game. Once the game reaches a terminal node z Z, every player i [[n]] receives a payoff according to a (normalized) utility function ui : Z [ 1, 1].

In an imperfect-information EFG, the decision nodes of each player i [[n]] are partitioned into information sets Ji, inducing a partially ordered set (Ji, ). For an information set j Ji and an action a Aj, we let σ := (j, a) be the sequence of i s actions encountered from the root of the tree until (and including) action a; we use the special symbol to denote the empty sequence. The set of i s sequences is denoted by Σi := {(j, a) : j Ji, a Aj} { }. We also let Σ i := Σi \ { } and Σj := {σ Σi : σ j}, where we write σ j if sequence σ must pass from some node in j. We will use σj to represent the parent sequence of an information set j Ji; namely, the last sequence before reaching j, or if j is a root information set. For any pair of sequences σ, σ Σ i , with σ = (j, a) and σ = (j , a ), we write σ σ if the sequence of actions encountered from the root of the tree to any node in j includes select-

Near-Optimal Φ-Regret Learning in Extensive-Form Games

ing action a at some node from information set j. Further, by convention, we let σ for any σ Σ i .

Sequence-form strategies The strategy of a player specifies a probability distribution for every information set. Assuming perfect recall players never forget acquired information a strategy can be represented via the sequence-form strategy polytope Qi R|Σi| 0 , defined as n qi : qi[ ] = 1, qi[σj] = P

a Aj qi[(j, a)], j Ji o .

Under the sequence-form representation, learning in extensive-form games can be cast in the framework of online linear optimization described earlier in Section 2.1; we refer to, for example, the work of Farina et al. (2022b).

Further, we let Πi := Qi {0, 1}|Σi| be the set of deterministic sequence-form strategies. Analogously, one can define the sequence-form polytope Qj rooted at information set j Ji, and Πj := Qj {0, 1}|Σj|. We also use Qi 1 to denote the maximum ℓ1-norm of a vector qi Qi. Finally, we denote by Di the depth of i s subtree.

Trigger deviations and EFCE To formalize the connection between EFCE and the framework of Φ-regret, we introduce trigger deviation functions. Definition 2.3 (Farina et al., 2022b). A trigger deviation function with respect to a trigger sequence ˆσ = (j, a) Σ i and a continuation strategy ˆπi Πj is any linear mapping f : R|Σi| R|Σi| such that

f(πi) = πi for all πi Πi such that πi[ˆσ] = 0; Otherwise, for all πi Πi,

( πi[σ] if σ j, ˆπi[σ] if σ j.

We denote by Ψi the convex hull of all trigger deviation functions over all trigger sequences and deterministic continuation strategies; Ψi-regret is referred to as trigger regret. In an extensive-form correlated equilibrium (EFCE) (Von Stengel & Forges, 2008) no trigger deviation by any player can improve the utility of that player, leading to the following connection. Theorem 2.4 (Farina et al., 2022b). If each player i [[n]] incurs trigger regret Reg T Ψi after T repetitions of the game, the average product distribution of play is a 1 T maxi [[n]] Reg T Ψi-approximate EFCE.

Moreover, extensive-form coarse correlated equilibria (EFCCE) (Farina et al., 2020) are defined analogously based on coarse trigger deviations Ψi; the difference is that in EFCCE the player decides whether to follow the recommendation before actually seeing the recommendation at that information set (see Appendix A for the definition and specific examples).

3. Near-Optimal Learning for EFCE

In this section, we establish our main result: efficient learning dynamics with O(log T) per-player trigger regret; this is made precise in Theorem 3.10, the informal version of which was stated earlier in Theorem 1.1.

This section is organized as follows: Section 3.1 analyzes the regret of the algorithm operating over the set of trigger deviation functions; Section 3.2 provides a refined characterization for the corresponding fixed points; and, finally, Section 3.3 combines all the previous pieces to arrive at Theorem 3.10. All the proofs are deferred to Appendix B.

3.1. Regret Minimizer for Trigger Deviations

The algorithm Our construction for minimizing trigger regret uses the general template of Gordon et al. (2008) (Algorithm 1), and in particular, the approach of Farina et al. (2022b) in order to construct an external regret minimizer for the set Ψi (a similar approach also applies for the set of coarse trigger deviations Ψi). More precisely, that construction leverages one separate regret minimizer Rˆσ for every possible trigger sequence ˆσ Σ i (recall Definition 2.3). In particular, Rˆσ, with ˆσ = (j, a), is after performing an affine transformation operating over sequence-form vectors qˆσ Qj (rooted at information set j Ji). Then, those regret minimizers are combined using a regret minimizer R operating over the simplex (Σ i ). The first key ingredient in our construction is the use of a logarithmic regularizer. More specifically, we instantiate each regret minimizer with LRL-OFTRL, a recent algorithm due to Farina et al. (2022a). LRL-OFTRL is an instance of (OFTRL) (Syrgkanis et al., 2015) with logarithmic regularization; the main twist is that LRL-OFTRL operates over an appropriately lifted space. The overall construction is given in Algorithm 2.

For our purposes, we first apply (Farina et al., 2022a, Proposition 2 and Corollary 1) to obtain a suitable RVU bound for each regret minimizer Rˆσ instantiated with LRL-OFTRL, for each ˆσ Σ i . Lemma 3.1. Fix any ˆσ Σ i , and let Reg T ˆσ be the regret of Rˆσ up to time T 2. For any η 1 256 Qi 1 ,

max{0, Reg T ˆσ } can be upper bounded by

2|Σi| log T

η +16η Qi 2 1

t=1 U (t+1) i U (t) i 2

t=1 q(t+1) ˆσ q(t) ˆσ 2 q(t) ˆσ , . (3)

A few remarks are in order. First, we recall that U (t) i := u(t) i x(t) i , in accordance to Algorithm 2. Also, η > 0

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Algorithm 2 Ψi-Regret Minimizer

R LRL-OFTRL(η ) over (Σ i ) {Rˆσ LRL-OFTRL(η) over Qj}ˆσ=(j, ) Σ i

2: function NEXTSTRATEGY()

3: λ(t) i R .NEXTSTRATEGY()

4: X(t) ˆσ Rˆσ.NEXTSTRATEGY(), for all ˆσ Σ i 5: ϕ(t) i P

ˆσ Σ i λ(t) i [ˆσ]X(t) ˆσ

6: x(t) i FIXEDPOINT(ϕ(t) i )

7: return x(t) i 8: end function 9: function OBSERVEUTILITY(u(t) i )

10: Construct the utility U (t) i u(t) i x(t) i 11: Rˆσ.OBSERVEUTILITY(U (t) i ), for all ˆσ Σ i 12: R .OBSERVEUTILITY(( X(t) ˆσ , U (t) i )ˆσ Σ i ) 13: end function

denotes the (time-invariant) learning rate of LRL-OFTRL. Furthermore, for ˆσ = (j, a) Σ i , in Lemma 3.1 we used the notation

q(t+1) ˆσ q(t) ˆσ q(t) ˆσ , := max σ Σj

1 q(t+1) ˆσ [σ]

q(t) ˆσ [σ]

Lemma 3.1 establishes an RVU bound (Definition 2.1), but with two important refinements. First, the bound applies to max{0, Reg T ˆσ }, instead of Reg T ˆσ , ensuring that (3) is nonnegative. Further, the local norm appearing in (3) will also be crucial for our argument in the sequel (Lemma 3.5).

Next, similarly to Lemma 3.1, we obtain a regret bound for R , the regret minimizer mixing over all {Rˆσ}ˆσ Σ i .

Lemma 3.2. Let Reg T be the regret of R up to time T 2. For any η 1 512|Σi|, max{0, Reg T } can be upper bounded by

2|Σi| log T

η + 16η |Σi|2 T 1 X

t=1 u(t+1) u(t) 2

t=1 λ(t+1) i λ(t) i 2 λ(t) i , .

Here, we used the notation u(t) [ˆσ] := X(t) ˆσ , U (t) i , where

X(t) ˆσ is the output of Rˆσ, for each ˆσ Σ i ; that is, X(t) ˆσ R|Σi| |Σi| transforms sequence-form vectors based on the continuation strategy q(t) ˆσ below the trigger sequence ˆσ (recall Definition 2.3). We also note that λ(t) i (Σ i ) above represents the output of R at time t.

We next use Theorem 2.2 to obtain a bound for Reg T Ψi, the Ψi-regret of the overall construction (Algorithm 2).

Proposition 3.3. For any T N,

max{0, Reg T Ψi} max{0, Reg T }+ X

ˆσ Σ i max{0, Reg T ˆσ }.

This uses the regret circuit for the convex hull (Farina et al., 2019b) to combine all the regret minimizers {Rˆσ}ˆσ Σ i via R into an external regret minimizer for the set Ψi; by virtue of Theorem 2.2, the external regret of the induced algorithm is equal to the Ψi-regret (Reg T Ψi) of player i. There is, however, one crucial twist: the guarantee of Farina et al. (2019b) would give a bound in terms of maxˆσ Σ i Reg T ˆσ , instead of P

ˆσ Σ i Reg T ˆσ ; this is problematic for obtaining near-optimal rates as it breaks the RVU property over the convex hull. In general, it is not clear how to bound the maximum of the regrets by their sum since (external) regret can be negative. This is, in fact, a recurrent obstacle encountered in this line of work (Syrgkanis et al., 2015), and it is precisely the reason why approaches based on regret decomposition in the spirit of CFR (Zinkevich et al., 2007) failed to bring rates better than T 3/4 (Farina et al., 2019a). Proposition 3.3 circumvents those obstacles by establishing bounds in terms of nonnegative measures of regret.

Remark 3.4 (Near-optimal regret via CFR-type algorithms). An important byproduct of our techniques, and in particular Proposition 3.3 along with RVU bounds for nonnegative measures of regret (Anagnostides et al., 2022c), is the first near-optimal O(log T) regret bound for CFR-type algorithms in general games, a question that has been open even in (two-player) zero-sum games (see the discussion by Farina et al. (2019a)).3

3.2. Characterizing the Fixed Points

Next, our main goal is to obtain an RVU bound for max{0, Reg T Ψi}, but cast in terms of the player s strategies (x(t) i )1 t T , as well as the utilities (u(t) i )1 t T observed by that player. In particular, in light of Lemmas 3.1 and 3.2, the crux is to appropriately bound λ(t+1) i λ(t) i λ(t) i , and P

ˆσ Σ i q(t+1) ˆσ q(t) ˆσ q(t) ˆσ , in terms of x(t+1) i

x(t) i the deviation of the player s strategy at every time t. To do so, we prove the following key result.

Lemma 3.5. Let X(t) i RD >0 be defined for every time t N, for some D N. Further, suppose that for every

3Liu et al. (2022) very recently obtained near-optimal rates in zero-sum games, though with very different techniques.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

time t N and σ Σi,

x(t) i [σ] =

pσ,k(X(t) i )

qσ,k(X(t) i ) , (4)

for some multivariate polynomials {pσ,k}, {qσ,k} with positive coefficients and maximum degree degi N. If

max e [[D]]

1 X(t+1) i [e]

100 256 degi , (5)

it holds that

x(t+1) i x(t) i 1 (4 Qi 1 degi) max e [[D]]

1 X(t+1) i [e]

We recall that, based on Algorithm 1, the final strategy x(t) i is simply a fixed point of ϕ(t) i Ψ(t) i , where ϕ(t) i is a function of X(t) i = (λ(t) i , (q(t) ˆσ )ˆσ Σ i ). Equation (4) postulates that the fixed point is given by a rational function with positive coefficients. Taking a step back, let us clarify that assumption in the context of the no-swap-regret algorithm of Blum & Mansour (2007), a specific instance of Algorithm 1. In that algorithm, the fixed point is a stationary distribution of the underlying stochastic matrix Xi; hence, (4) is simply a consequence of the Markov chain tree theorem (see (Anantharam & Tsoucas, 1989)), with degree in the order of the rank of the corresponding stochastic matrix.

While insisting on having positive coefficients in Lemma 3.5 may seem restrictive at first glance, in Propositions B.5 and B.6 (in Appendix B) we show that, in fact, it comes without any loss under sequence-form vectors. We further remark that the degree of the rational function is a measure of the complexity of the fixed points, as it will be highlighted in Propositions 3.6 and 3.7 below. Finally, the property in (5) will be satisfied for our construction since the regret minimizers we employ guarantee multiplicative stability (as we formally show in Lemmas B.2 and B.4), meaning that the ratio of any two consecutive coordinates is close to 1; this refined notion of stability is ensured by the use of the logarithmic regularizer.

We now establish that assumption (4) is satisfied for transformations in Ψi with only a moderate degree. First, as a warm-up, we consider fixed points associated with coarse trigger deviation functions Ψi.

Proposition 3.6. Let ϕ(t) i Ψi be a transformation defined by X(t) i = (λ(t) i , (q(t) j )j Ji) RD >0, for some D N and

time t N. The unique fixed point x(t) i of ϕ(t) i satisfies (4) with degi 2Di.

This property is established by leveraging the closedform characterization for the fixed points associated with

EFCCE given by Anagnostides et al. (2022b). Next, let us focus on the fixed points of trigger deviation functions. Unlike EFCCE, determining such fixed points requires computing stationary distributions of Markov chains along paths of the tree, commencing from the root and gradually making way towards the leaves (Farina et al., 2022b); this substantially complicates the analysis. Nevertheless, we leverage a refined characterization of the stationary distribution at every information set (Anagnostides et al., 2022b) to obtain the following.

Proposition 3.7. Let ϕ(t) i Ψi be a transformation defined by X(t) i = (λ(t) i , (q(t) ˆσ )ˆσ Σ i ) RD >0, for some D N and

time t N. The (unique) fixed point x(t) i of ϕ(t) i satisfies (4) with degi 2Di|Ai|, where |Ai| := maxj Ji |Aj|.

In proof, we show that augmenting a partial fixed point at a new (successor) information set can only increase the degree of the rational function by an additive factor of 2|Ai|; Proposition 3.7 then follows by induction. It is crucial to note that using the Markov chain tree theorem directly at every information set would only give a bound on the degree that could be exponential in the description of the extensive-form game. Next, we combine Proposition 3.7 with Lemma 3.5 to derive the following key inequality.

Lemma 3.8. Consider any parameters η 1 256 Qi 1 degi and η 1 512|Σi| degi , where degi := 2|Ai|Di. Then, for any time t [[T 1]],

x(t+1) i x(t) i 1 8 Qi 1|Ai|Di M(X(t) i ),

where M(X(t) i ) is defined as

1 λ(t+1) i [ˆσ]

λ(t) i [ˆσ]

, max ˆσ Σ i max σ Σj

1 q(t+1) ˆσ [σ]

q(t) ˆσ [σ]

3.3. Putting Everything Together

We now combine Lemma 3.8 with Lemmas 3.1 and 3.2 and Proposition 3.3, as well as some further manipulations of the utilities (U (t) i )1 t T (Lemma 3.1) and (u(t) )1 t T (Lemma 3.2) to derive the following RVU bound.

Corollary 3.9. Suppose that η 1 212|Σi|1.5 Qi 1 degi and η = 1 2|Σi|η, where degi := 2|Ai|Di. For any T 2,

max{0, Reg T Ψi} can be upper bounded by

8|Σi|2 log T

η + 256η|Σi|3 T 1 X

t=1 u(t+1) i u(t) i 2

1 215η deg2 i Qi 2 1

t=1 x(t+1) i x(t) i 2 1.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Now given that the RVU bound in Corollary 3.9 has been obtained for max{0, Reg T Ψi}, a nonnegative measure of regret, we can show that the second-order path length of the dynamics when all players follow Algorithm 2 is bounded by O(log T); that is,

i=1 x(t+1) i x(t) i 2 1 = O(log T).

This step is formalized in Theorem B.12 (in Appendix B), and follows the technique of Anagnostides et al. (2022c), leading to our main result; below we use the notation |Σ| := maxi [[n]] |Σi|, and similarly for the other symbols (namely, Q 1, |A| and D). Theorem 3.10. If all players employ Algorithm 2, the trigger regret of each player i [[n]] after T repetitions will be bounded as

Reg T Ψi Cn|Σ|3.5 Q 1|Z||A|D log T, (6)

for a universal constant C > 0.

While the above theorem applies when all players follow the prescribed protocol, it is easy to ensure at the same time that the trigger regret of each player will grow as O(

T) even if the rest of the players are instead acting so as to minimize that player s utility (Anagnostides et al., 2022c).

As we highlight in Section 5, improving the dependence of (6) on the underlying parameters of the game is an important direction for future work. For EFCCE, in accordance to Proposition 3.6, we obtain a slightly improved regret bound (see Corollary B.14 in Appendix B). Remark 3.11 (Beyond EFCE). While we have focused primarily on obtaining near-optimal guarantees for trigger regret, corresponding to EFCE, our techniques apply more broadly to Φ-regret minimization under two conditions: i) the fixed point of any ϕ Φ should admit a characterization as a rational function per (4); and (ii) one can efficiently perform projections to the set Φ (in the sense of Farina et al. (2022a)). For example, we believe that those two conditions are met even when the set of deviations Φ includes all possible linear transformations, which is of course stronger than trigger regret.

4. Experimental Results

Finally, in this section we experimentally verify our theoretical results on several common benchmark extensiveform games: (i) 3-player Kuhn poker (Kuhn, 1953); (ii) 2-player Goofspiel (Ross, 1971); and (iii) 2-player Sheriff (Farina et al., 2019c). We note that none of the above is a two-player zero-sum game for which no-regret learning is well-known to lead to Nash equilibria. A detailed description of the game instances we use in our experiments is included in Appendix C.

In accordance to Theorem 3.10, we instantiate each local regret minimizer using LRL-OFTRL, and all players use the same learning algorithm. For simplicity we use the same learning rate η > 0 for all the local regret minimizers, which is treated as a hyperparameter in order to obtain better empirical performance. In particular, after a very mild tuning process, we chose η = 1 for all our experiments. We compare the performance of our algorithm with that of two other popular regret minimizers: 1) CFR with regret matching (RM) (Zinkevich et al., 2007), meaning that every local regret minimizer Rˆσ uses CFR (with RM) and R (which is an algorithm for the simplex) also uses RM; and 2) CFR+

with RM+ (Tammelin et al., 2015). We did not employ alternation or linear averaging, two popular tricks that accelerate convergence in zero-sum games, as it is not known if those techniques retain convergence in our setting.

Our findings are illustrated in Figure 1. As predicted by our theory (Theorem 3.10), the trigger regret of all players appears to grow as O(log T) (the x-axis is logarithmic), implying convergence to the set of EFCE with a rate of log T

T . In contrast, although the trigger regret experienced by the other regret minimizers is sometimes smaller compared to our algorithm, their asymptotic growth apparently exhibits an unfavorable exponential increase, meaning that their trigger regret grows as ω(log T), with the exception of 3-player Kuhn poker. In fact, for Kuhn poker we see that the learning dynamics actually converge to a Nash equilibrium after only a few iterations, but this is not a typical behavior beyond two-player zero-sum games. Indeed, for the other two games in Figure 1 we do not have convergence to a Nash equilibrium, although as predicted by our theory we observe convergence to EFCE (since the players average trigger regrets approach to 0). The overall conclusion is that our algorithm should be preferred in the high-precision regime.

It is also worth pointing out that we obtained qualitatively similar regret bounds for coarse trigger regret associated with EFCCE.

5. Conclusions and Future Research

In this paper, we established the first near-optimal log T

T rates of convergence to extensive-form correlated equilibria, thereby extending recent work from normalform games to the substantially more complex class of imperfect-information extensive-form games. Our approach for obtaining near-optimal Φ-regret guarantees can be in fact further extended even beyond extensive-form games, as long as the fixed points admit the characterization imposed by Lemma 3.5. Our techniques also have an independent interest in deriving near-optimal rates using the regret-decomposition approach, a question that has previously remained elusive even in two-player zero-sum

Near-Optimal Φ-Regret Learning in Extensive-Form Games

100 101 102 0.5

Trigger regret (Ours)

Player 1 Player 2 Player 3

102 103 104 105

Player 1 Player 2 Player 3

101 103 105 0

Player 1 Player 2

100 101 102 0.5

Trigger regret (CFR/RM)

Player 1 Player 2 Player 3

102 103 104 105

30 Player 1 Player 2 Player 3

101 103 105 0

Player 1 Player 2

100 101 102 103

Trigger regret (CFR/RM+)

Player 1 Player 2 Player 3

102 103 104 105

40 Player 1 Player 2 Player 3

101 103 105

Player 1 Player 2

Figure 1. Trigger regret of each player on (i) Kuhn poker (left); (ii) Goofspiel (center); and (iii) Sheriff (right). Every row corresponds to a different algorithm, starting from ours in the first one. The x-axis indicates the iteration, while the y-axis indicates the corresponding trigger regret for each player. We emphasize that the x-axis is logarithmic.

games (Farina et al., 2019a). In particular, we initiated the study of regret circuits in the sense of Farina et al. (2019b) that preserve the RVU property, and we established a new composition result for the convex hull, which has been the main obstacle in prior approaches (Farina et al., 2019a).

There are still many interesting avenues for future research related to our work. While our trigger-regret bounds are near-optimal in terms of the dependence on T (Theorem 3.10), the dependence on the parameters of the game in (6) can likely be improved. Establishing near-optimal trigger regret under dynamics that do not employ logarithmic regularization, such as optimistic hedge, could be a helpful step in that direction, but that is currently a challenging open problem; it is plausible that the techniques of Anagnostides et al. (2022a) in conjunction with the regret bounds of Daskalakis et al. (2021) could be useful in that direction, although the combinatorial complexity of trigger deviation functions poses considerable challenges.

Another interesting problem is to characterize the set of transformations Φ under which our techniques are applicable (see Remark 3.11). In a similar vein, we suspect that our approach leads to near-optimal convergence to the set of behavioral correlated equilibria (BCE), which corresponds to minimizing swap regret at every information set locally (Morrill et al., 2021b).

Acknowledgements

We are grateful to the anonymous reviewers at ICML 2023 for a number of helpful suggestions and corrections. The work of Prof. Sandholm s research group is funded by the National Science Foundation under grants IIS1901403, CCF-1733556, and the ARO under award W911NF2210266.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Anagnostides, I., Daskalakis, C., Farina, G., Fishelson, M., Golowich, N., and Sandholm, T. Near-optimal no-regret learning for correlated equilibria in multi-player generalsum games. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pp. 736 749. ACM, 2022a.

Anagnostides, I., Farina, G., Kroer, C., Celli, A., and Sandholm, T. Faster no-regret learning dynamics for extensive-form correlated and coarse correlated equilibria. In Proceedings of the ACM Conference on Economics and Computation (EC), pp. 915 916. ACM, 2022b.

Anagnostides, I., Farina, G., Kroer, C., Lee, C., Luo, H., and Sandholm, T. Uncoupled learning dynamics with O(log T) swap regret in multiplayer games. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2022c.

Anantharam, V. and Tsoucas, P. A proof of the markov chain tree theorem. Statistics & Probability Letters, 8 (2):189 192, 1989.

Aumann, R. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1:67 96, 1974.

Bai, Y., Jin, C., Mei, S., Song, Z., and Yu, T. Efficient phiregret minimization in extensive-form games via online mirror descent. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2022a.

Bai, Y., Jin, C., Mei, S., and Yu, T. Near-optimal learning of extensive-form games with imperfect information. In International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research, pp. 1337 1382. PMLR, 2022b.

Bernasconi, M., Castiglioni, M., Marchesi, A., Trov o, F., and Gatti, N. Constrained phi-equilibria. Co RR, abs/2301.13600, 2023.

Blum, A. and Mansour, Y. From external to internal regret. Journal of Machine Learning Research, 8:1307 1324, 2007.

Celli, A., Marchesi, A., Farina, G., and Gatti, N. No-regret learning dynamics for extensive-form correlated equilibrium. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2020.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 2006.

Chen, X. and Peng, B. Hedging in games: Faster convergence of external and swap regrets. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2020.

Daskalakis, C. and Golowich, N. Fast rates for nonparametric online learning: from realizability to learning in games. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pp. 846 859. ACM, 2022.

Daskalakis, C., Deckelbaum, A., and Kim, A. Nearoptimal no-regret algorithms for zero-sum games. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2011.

Daskalakis, C., Fishelson, M., and Golowich, N. Nearoptimal no-regret learning in general games. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), pp. 27604 27616, 2021.

Dekel, E. and Fudenberg, D. Rational behavior with payoff uncertainty. Journal of Economic Theory, 52(2):243 267, 1990.

Dud ık, M. and Gordon, G. J. A sampling-based approach to computing equilibria in succinct extensive-form games. In Bilmes, J. A. and Ng, A. Y. (eds.), Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), pp. 151 160. AUAI Press, 2009.

Erez, L., Lancewicki, T., Sherman, U., Koren, T., and Mansour, Y. Regret minimization and convergence to equilibria in general-sum markov games. Co RR, abs/2207.14211, 2022.

Farina, G., Kroer, C., Brown, N., and Sandholm, T. Stable-predictive optimistic counterfactual regret minimization. In International Conference on Machine Learning (ICML), 2019a.

Farina, G., Kroer, C., and Sandholm, T. Regret circuits: Composability of regret minimizers. In International Conference on Machine Learning, pp. 1863 1872, 2019b.

Farina, G., Ling, C. K., Fang, F., and Sandholm, T. Correlation in extensive-form games: Saddle-point formulation and benchmarks. In Conference on Neural Information Processing Systems (Neur IPS), 2019c.

Farina, G., Bianchi, T., and Sandholm, T. Coarse correlation in extensive-form games. In AAAI Conference on Artificial Intelligence (AAAI), pp. 1934 1941, 2020.

Farina, G., Anagnostides, I., Luo, H., Lee, C., Kroer, C., and Sandholm, T. Near-optimal no-regret learning dynamics for general convex games. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2022a.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Farina, G., Celli, A., Marchesi, A., and Gatti, N. Simple uncoupled no-regret learning dynamics for extensive-form correlated equilibrium. Journal of the ACM, 69(6):41:1 41:41, 2022b.

Farina, G., Lee, C., Luo, H., and Kroer, C. Kernelized multiplicative weights for 0/1-polyhedral games: Bridging the gap between learning in extensive-form and normal-form games. In International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research, pp. 6337 6357. PMLR, 2022c.

Foster, D. and Vohra, R. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21:40 55, 1997.

Foster, D. J., Li, Z., Lykouris, T., Sridharan, K., and Tardos, E. Learning in games: Robustness of fast convergence. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), pp. 4727 4735, 2016.

Fujii, K. Bayes correlated equilibria and no-regret dynamics. Co RR, abs/2304.05005, 2023.

Giannou, A., Vlatakis-Gkaragkounis, E., and Mertikopoulos, P. Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information. In Conference on Learning Theory (COLT), volume 134 of Proceedings of Machine Learning Research, pp. 2147 2148. PMLR, 2021.

Gordon, G. J., Greenwald, A., and Marks, C. No-regret learning in convex games. In International Conference on Machine Learning (ICML), pp. 360 367. ACM, 2008.

Greenwald, A. and Jafari, A. A general class of no-regret learning algorithms and game-theoretic equilibria. In Computational Learning Theory and Kernel Machines, COLT/Kernel 2003, volume 2777 of Lecture Notes in Computer Science, pp. 2 12. Springer, 2003.

Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68: 1127 1150, 2000.

Hazan, E. and Kale, S. Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pp. 625 632. Curran Associates, Inc., 2007.

Hsieh, Y., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. In Conference on Learning Theory (COLT), volume 134 of

Proceedings of Machine Learning Research, pp. 2388 2422. PMLR, 2021.

Hsieh, Y., Antonakopoulos, K., Cevher, V., and Mertikopoulos, P. No-regret learning in games with noisy feedback: Faster rates and adaptivity via learning rate separation. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2022.

Huang, W. and von Stengel, B. Computing an extensiveform correlated equilibrium in polynomial time. In International Workshop On Internet And Network Economics (WINE), volume 5385 of Lecture Notes in Computer Science, pp. 506 513. Springer, 2008.

Kuhn, H. W. Extensive games and the problem of information. In Kuhn, H. W. and Tucker, A. W. (eds.), Contributions to the Theory of Games, volume 2 of Annals of Mathematics Studies, 28, pp. 193 216. Princeton University Press, Princeton, NJ, 1953.

Leyton-Brown, K. and Shoham, Y. Essentials of Game Theory: A Concise Multidisciplinary Introduction. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2008.

Li, X., Meng, M., Hong, Y., and Chen, J. A survey of decision making in adversarial games. ar Xiv preprint ar Xiv:2207.07971, 2022.

Liu, M., Ozdaglar, A. E., Yu, T., and Zhang, K. The power of regularization in solving extensive-form games. Co RR, abs/2206.09495, 2022.

Marks, C. A. No-regret learning and game-theoretic equilibria. Brown University, 2008.

Morrill, D., D Orazio, R., Lanctot, M., Wright, J. R., Bowling, M., and Greenwald, A. R. Efficient deviation types and learning for hindsight rationality in extensive-form games. In International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pp. 7818 7828. PMLR, 2021a.

Morrill, D., D Orazio, R., Sarfati, R., Lanctot, M., Wright, J. R., Greenwald, A. R., and Bowling, M. Hindsight and sequential rationality of correlated play. In AAAI Conference on Artificial Intelligence (AAAI), pp. 5584 5594. AAAI Press, 2021b.

Moulin, H. and Vial, J.-P. Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. International Journal of Game Theory, 7(3-4):201 221, 1978.

Piliouras, G., Rowland, M., Omidshafiei, S., Elie, R., Hennes, D., Connor, J. T., and Tuyls, K. Evolutionary

Near-Optimal Φ-Regret Learning in Extensive-Form Games

dynamics and phi-regret minimization in games. Journal of Artificial Intelligence Research, 74:1125 1158, 2022a.

Piliouras, G., Sim, R., and Skoulakis, S. Beyond timeaverage convergence: Near-optimal uncoupled online learning via clairvoyant multiplicative weights update. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2022b.

Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Conference on Learning Theory (COLT), pp. 993 1019, 2013a.

Rakhlin, A. and Sridharan, K. Optimization, learning, and games with predictable sequences. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pp. 3066 3074, 2013b.

Rakhlin, A., Sridharan, K., and Tewari, A. Online learning: Beyond regret. In Conference on Learning Theory (COLT), volume 19 of JMLR Proceedings, pp. 559 594. JMLR.org, 2011.

Ross, S. M. Goofspiel the game of pure strategy. Journal of Applied Probability, 8(3):621 625, 1971.

Song, Z., Mei, S., and Bai, Y. Sample-efficient learning of correlated equilibria in extensive-form games. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2022.

Stoltz, G. and Lugosi, G. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, 59(1):187 208, 2007.

Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), pp. 2989 2997, 2015.

Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving heads-up limit Texas hold em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015.

Viossat, Y. and Zapechelnyuk, A. No-regret dynamics and fictitious play. J. Econ. Theory, 148(2):825 842, 2013.

Von Stengel, B. and Forges, F. Extensive-form correlated equilibrium: Definition and computational complexity. Mathematics of Operations Research, 33(4):1002 1022, 2008.

Wei, C. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference on Learning Theory (COLT), volume 75 of Proceedings of Machine Learning Research, pp. 1263 1291. PMLR, 2018.

Yang, Y. and Ma, C. O(T 1) convergence of optimisticfollow-the-regularized-leader in two-player zero-sum markov games. ar Xiv preprint ar Xiv:2209.12430, 2022.

Zhang, H. A simple adaptive procedure converging to forgiving correlated equilibria. Co RR, abs/2207.06548, 2022.

Zhang, R., Liu, Q., Wang, H., Xiong, C., Li, N., and Bai, Y. Policy optimization for markov games: Unified framework and faster convergence. In Proceedings of the Annual Conference on Neural Information Processing Systems (Neur IPS), 2022.

Zinkevich, M., Bowling, M., Johanson, M., and Piccione, C. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

A. Additional Preliminaries

In this section, we provide some additional background on extensive-form games and (coarse) trigger deviation functions.

An illustrative example First, to clarify some of the concepts we introduced earlier in Section 2, we consider the simple two-player EFG of Figure 2. White round nodes correspond to player 1, while black round nodes to player 2. We use square nodes to represent terminal nodes (or leaves). As illustrated in Figure 2, player 1 has two information sets, denoted by J1 := {A, B}, each containing two nodes. Further, the set of sequences of player 1 can be represented as Σ1 := { , 1, 2, 3, 4}; here, we omitted specifying the corresponding information set since we use different symbols for actions belonging to different information sets.

1 1 2 2 3 3 4 4

Figure 2. Example of a two-player EFG.

Trigger deviation functions It will be convenient to represent trigger deviation functions, in the sense of Definition 2.3, as follows.

Definition A.1 (Farina et al., 2022b). Let ˆσ = (j, a) Σ i , and q Qj. We let Mˆσ q R|Σi| |Σi| be a matrix, so that for any σr, σc Σi,

1 if σc ˆσ and σr = σc; q[σr] if σc = ˆσ and σr j; and 0 otherwise.

We will let ϕˆσ q denote the linear function x 7 Mˆσ qx, for some q Qj. It is immediate to verify that for any ˆσ = (j, a) Σ i and q Qj, ϕˆσ q is a trigger deviation function in the sense of Definition 2.3.

To clarify Definition A.1, below we give two examples for the EFG of Figure 2. If q = ( 1

1 0 0 0 0 1 0 1/2 0 0 0 2 0 1/2 1 0 0 3 0 0 0 1 0 4 0 0 0 0 1

1 0 0 0 0 1 0 1 0 0 0 2 0 0 1 0 0 3 0 0 0 1/2 0 4 0 0 0 1/2 1

The following characterization can be readily extracted from (Farina et al., 2022b).

Claim A.2. Every transformation ϕi Ψi can be expressed as P

ˆσ Σ i λi[ˆσ]ϕˆσ qˆσ, where λi (Σ i ) and qˆσ Qj for ˆσ = (j, a) Σ i .

Coarse trigger deviation functions Analogously, coarse trigger deviation functions can be represented as follows.

Definition A.3 (Anagnostides et al., 2022b). Let j Ji and q Qj. We let Mj q R|Σi| |Σi| be a matrix, so that for any σr, σc Σi,

1 if σc j and σr = σc; q[σr] if σc = σj and σr j; and 0 otherwise.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Unlike trigger deviations, which are triggered by a sequence, we point out that coarse trigger deviations are triggered by an information set; see the work of Farina et al. (2020) for a more detailed discussion on this point.

Returning to the example of Figure 2, and letting again q = ( 1

1 0 0 0 0 1 1/2 0 0 0 0 2 1/2 0 0 0 0 3 0 0 0 1 0 4 0 0 0 0 1

1 0 0 0 0 1 0 1 0 0 0 2 0 0 1 0 0 3 1/2 0 0 0 0 4 1/2 0 0 0 0

Analogously to Claim A.2, one can show the following characterization. Claim A.4. Every transformation ϕi Ψi can be expressed as P

j Ji λi[j]ϕj qj, where λi (Ji) and qj Qj.

The connection between coarse trigger deviation functions and EFCCE is illuminated in the following fact. Theorem A.5 (Anagnostides et al., 2022b). If each player i [[n]] incurs coarse trigger regret Reg T Ψi after T repetitions of the game, the average product distribution of play is a 1

T maxi [[n]] Reg T Ψi-approximate EFCCE.

B. Omitted Proofs

In this section, we provide all the omitted proofs from the main body (Section 3). For the convenience of the reader, we restate each claim before proceeding with its proof.

B.1. RVU Bounds for the Set of Deviations

Let us fix a player i [[n]]. First, we prove Lemma 3.1. To this end, let us provide some auxiliary claims. Recall that, for each ˆσ = (j, a) Σ i , Rˆσ receives at every time t the utility U (t) i := u(t) i x(t) i , and the next strategy is computed via LRL-OFTRL (Farina et al., 2022a); namely, we first compute q(t) ˆσ := ( q(t) ˆσ [0], ( q(t) ˆσ [e])e Σj) = (λ(t) ˆσ , y(t) ˆσ ) Qj, for a time t N, as

arg max qˆσ Qj

η D S(t 1) ˆσ , qˆσ E + X

e Σj {0} log qˆσ[e]

(i) Qj := {(λˆσ, yˆσ) : λˆσ [0, 1], yˆσ λˆσQj};

(ii) U (t) ˆσ := ( q(t) ˆσ , U (t) ˆσ , U (t) ˆσ ), where in turn U (t) ˆσ is the component of U (t) i that corresponds to sequence ˆσ;

(iii) S(t 1) ˆσ := U (t 1) ˆσ + Pt 1 τ=1 U (τ) ˆσ ; and

(iv) η > 0 is the learning rate common among all Rˆσ.

Finally, having determined q(t) ˆσ = (λ(t) ˆσ , y(t) ˆσ ), we compute q(t) ˆσ := y(t) ˆσ λ(t) ˆσ Qj. In turn, this gives the next strategy of

Rˆσ as X(t) ˆσ := Mˆσ q(t) ˆσ (recall Definition A.1). It is evident that the regret minimization problem faced by each Rˆσ is

equivalent to minimizing regret over Qj, since only the components of X(t) ˆσ that correspond to q(t) ˆσ cumulate regret (the rest are constant), leading to the regret bound below. We note that all the subsequent analysis operates under the tacit premise that each local regret minimizer is updated via LRL-OFTRL, without explicitly mentioned in the statements in order to lighten the exposition. Proposition B.1. For any learning rate η 1 256 Qi 1 and T 2, max{0, Reg T ˆσ } can be upper bounded by

2|Σi| log T

η + 16η Qi 2 T 1 X

t=1 U (t+1) i U (t) i 2 1 32η

λ(t+1) ˆσ y(t+1 ˆσ

λ(t) ˆσ y(t ˆσ

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Proof. This regret bound is an immediate implication of (Farina et al., 2022a, Proposition 2 and Corollary 1). More precisely, we note that the regret bound by Farina et al. (2022a) applies if U (t) i 1 Qi 1 , for any t N. That

assumption can be met by rescaling the learning rate by a factor of 1 Qi 1 since in our setting it holds that U (t) i 1;

the latter follows from the definition of U (t) i := u(t) i x(t) i (Algorithm 1), and the fact that u(t) i 1 (by assumption) and x(t) i 1 (since Qi [0, 1]|Σi|).

In Proposition B.1 we used the shorthand notation

λ(t+1) ˆσ y(t+1 ˆσ

λ(t) ˆσ y(t ˆσ

λ(t+1) ˆσ y(t+1 ˆσ

λ(t) ˆσ y(t ˆσ

(λ(t) ˆσ ,y(t) ˆσ ) ,

where for a vector w Rd+1 and x Rd+1 >0 , we used the notation

for the local norm induced by x. Further, we will also use the notation

w x, := max e [[d+1]]

Lemma B.2. For any sequence ˆσ = (j, a) Σ i , learning rate η 1 50 Qi 1 and time t [[T 1]],

1 q(t+1) ˆσ [σ]

q(t) ˆσ [σ]

λ(t+1) ˆσ y(t+1) ˆσ

λ(t) ˆσ y(t) ˆσ

! t, 100η Qi 1.

Proof. We will need the following stability bound, extracted from (Farina et al., 2022a, Proposition 3).

Lemma B.3 (Farina et al., 2022a). For any sequence ˆσ = (j, a) Σ i , time t [[T 1]] and learning rate η 1 50 Qi 1 ,

λ(t+1) ˆσ y(t+1) ˆσ

λ(t) ˆσ y(t) ˆσ

! t, 22η Qi 1.

Now let us fix a time t [[T 1]]. For convenience, we introduce the notation

λ(t+1) ˆσ y(t+1) ˆσ

λ(t) ˆσ y(t) ˆσ

For our choice of the learning rate η 1 50 Qi 1 , Lemma B.3 implies that µ(t) 1

2. By definition, we have

q(t+1) ˆσ := y(t+1) ˆσ λ(t+1) ˆσ (1 + µ(t))y(t) ˆσ (1 µ(t))λ(t) ˆσ = 1 + 2µ(t)

q(t) ˆσ (1 + 4µ(t))q(t) ˆσ ,

where the last bound follows since µ(t) 1

2. That is,

q(t+1) ˆσ q(t) ˆσ (1 + 4µ(t)). (9)

q(t+1) ˆσ = y(t+1) ˆσ λ(t+1) ˆσ 1 µ(t)

1 + µ(t) y(t) ˆσ λ(t) ˆσ = 1 2µ(t)

q(t) ˆσ (1 2µ(t))q(t) ˆσ .

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Thus, q(t+1) ˆσ q(t) ˆσ 1 2µ(t). (10)

As a result, the claim follows from (9) and (10).

We are now ready to establish Lemma 3.1, restated below.

Lemma 3.1. Fix any ˆσ Σ i , and let Reg T ˆσ be the regret of Rˆσ up to time T 2. For any η 1 256 Qi 1 , max{0, Reg T ˆσ } can be upper bounded by

2|Σi| log T

η + 16η Qi 2 1

t=1 U (t+1) i U (t) i 2 1 512η

t=1 q(t+1) ˆσ q(t) ˆσ 2 q(t) ˆσ , .

Proof. The claim follows directly from Proposition B.1 and Lemma B.2.

Under the premise that R is also updated via LRL-OFTRL, similar reasoning yields the proof of Lemma 3.2.

Lemma 3.2. Let Reg T be the regret of R up to time T 2. For any η 1 512|Σi|, max{0, Reg T } can be upper bounded by

2|Σi| log T

η + 16η |Σi|2 T 1 X

t=1 u(t+1) u(t) 2 1 512η

t=1 λ(t+1) i λ(t) i 2 λ(t) i , .

Proof. The argument is analogous to the proof of Lemma 3.1, leveraging the fact that u(t) = | X(t) ˆσ , U (t) i |

X(t) ˆσ 1 U (t) i 2|Σi|, for any ˆσ Σ i , by Cauchy-Schwarz inequality.

Lemma B.4. For any t [[T 1]] and η 1 512|Σi|,

1 λ(t+1) i [ˆσ]

λ(t) i [ˆσ]

Proof. The argument is analogous to Lemma B.2.

Next, we combine all those local regret minimizers, namely R , (Rˆσ)ˆσ Σ i , into a global regret minimizer RΨi for the set Ψi via the regret circuit for the convex hull. Finally, we denote by R the Ψi-regret minimizer derived from Algorithm 1, based on RΨi.

Proposition 3.3. For any T N,

max{0, Reg T Ψi} max{0, Reg T } + X

ˆσ Σ i max{0, Reg T ˆσ }.

Proof. Using the guarantee of the regret circuit for the convex hull (Farina et al., 2019b), we have

Reg T Reg T + max ˆσ Σ i Reg T ˆσ ,

where Reg T is the external regret cumulated by RΨi up to time T. But, by Theorem 2.2, this is equal to the Ψi-regret of R, constructed according to Algorithm 1. As a result,

Reg T Ψi Reg T + max ˆσ Σ i Reg T ˆσ ,

Near-Optimal Φ-Regret Learning in Extensive-Form Games

In turn, this implies that

max{0, Reg T Ψi} max 0, Reg T + max ˆσ Σ i Reg T ˆσ

max{0, Reg T } + max ˆσ Σ i max{0, Reg T ˆσ }

max{0, Reg T } + X

ˆσ Σ i max{0, Reg T ˆσ },

where the last inequality follows from the fact that max{0, Reg T ˆσ } 0, for any ˆσ Σ i .

B.2. Characterizing the Fixed Points

We recall that (x(t) i )1 t T denotes the sequence of fixed points produced by Algorithm 1 that is, the strategies produced by R. The next key result relates the deviation of the fixed points in ℓ1 norm in terms of the multiplicative deviation of the transformations, assuming a particular rational function characterization of the fixed points.

Lemma 3.5. Let X(t) i RD >0 be defined for every time t N, for some D N. Further, suppose that for every time t N and σ Σi,

x(t) i [σ] =

pσ,k(X(t) i )

qσ,k(X(t) i ) , (4)

for some multivariate polynomials {pσ,k}, {qσ,k} with positive coefficients and maximum degree degi N. If

max e [[D]]

1 X(t+1) i [e]

100 256 degi , (5)

it holds that

x(t+1) i x(t) i 1 (4 Qi 1 degi) max e [[D]]

1 X(t+1) i [e]

Proof. Let us define

µ(t) := max e [[D]]

1 X(t+1) i [e]

By assumption, it holds that µ(t) 100 256 degi 1 2 degi . Further, suppose that

pσ,k : Xi 7 X

T Tσ,k CT Y

e T Xi[e], (12)

and qσ,k : Xi 7 X

T T σ,k CT Y

e T Xi[e], (13)

for all (σ, k) Σi [[m]], where CT > 0 for any T Tσ,k and CT > 0 for any T T σ,k. Here, T can be a multiset or an empty set (the validity of (12) and (13) follows by assumption). Then, for (σ, k) Σi [[m]],

pσ,k(X(t+1) i ) = X

T Tσ,k CT Y

e T X(t+1) i [e]

T Tσ,k CT Y

e T (1 + µ(t))X(t) i [e] (14)

(1 + µ(t))degi X

T Tσ,k CT Y

e T X(t) i [e] (15)

= (1 + µ(t))degipσ,k(X(t) i )

(1 + 1.5µ(t) degi)pσ,k(X(t) i ), (16)

Near-Optimal Φ-Regret Learning in Extensive-Form Games

where (14) follows since X(t+1) i [e] (1 + µ(t))X(t) i [e], for any e [[D]], by definition of µ(t) in (11); (15) uses the fact that |T | deg for any T Tσ,k; and (16) follows since (1 + µ(t))degi exp{µ(t) degi} 1 + 1.3µ(t) degi for µ(t) 1 2 degi . Similarly, for (σ, k) Σi [[m]], we get

pσ,k(X(t+1) i ) = X

T Tσ,k CT Y

e T X(t+1) i [e]

T Tσ,k CT Y

e T (1 µ(t))X(t) i [e]

(1 µ(t))degipσ,k(X(t) i )

(1 µ(t) degi)pσ,k(X(t) i ), (17)

where the last bound follows from Bernoulli s inequality. Analogous reasoning yields that for any (σ, k) Σi [[m]],

qσ,k(X(t+1) i ) (1 + 1.3µ(t) degi)qσ,k(X(t) i ), (18)

and qσ,k(X(t+1) i ) (1 µ(t) degi)qσ,k(X(t) i ). (19)

As a result, for σ Σi,

x(t+1) i [σ] x(t) i [σ] =

pσ,k(X(t+1) i )

qσ,k(X(t+1) i )

pσ,k(X(t) i )

qσ,k(X(t) i )

1 + 1.3µ(t) degi

1 µ(t) degi

qσ,k(X(t) i )

pσ,k(X(t) i )

qσ,k(X(t) i ) (20)

1 + 2.3µ(t) degi

pσ,k(X(t) i )

qσ,k(X(t) i )

pσ,k(X(t) i )

qσ,k(X(t) i )

= 2.3µ(t) degi

1 µ(t) degi x(t) i [σ] 4µ(t) degi x(t) i [σ]. (21)

where (20) uses (16) and (19), and (21) follows from the fact that µ(t) 100 256 degi . Similarly, by (17) and (18),

x(t) i [σ] x(t+1) i [σ] =

pσ,k(X(t) i )

qσ,k(X(t) i )

pσ,k(X(t+1) i )

qσ,k(X(t+1) i )

4µ(t) degi x(t) i [σ].

As a result, we conclude that

x(t+1) i x(t) i 1 4µ(t) degi Qi 1.

Lemma 3.5 makes the assumption that each polynomial in (4) has positive coefficients. While this might seem rather restrictive, we next show that there is a procedure that eliminates the negative monomials, as long as the involved variables are deriving from the sequence-form polytope. As a warm-up, we first establish this property for variables deriving from the simplex.

We note that the processes described in the proofs below are not meant to be algorithmic meaningful, but instead highlight the generality of Lemma 3.5. Indeed, the way one computes the fixed point should not be necessarily related to the rational function formula postulated in (4); for example, computing the stationary distribution of a Markov chain using the Markov chain tree theorem would make little sense, as it would require exponential time.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Proposition B.5. Let p : X 7 R be a non-constant multivariate polynomial of degree deg N such that p(0) = 0. If X = (x1, . . . , xm) such that xk dk, for all k [[m]], p can be expressed as a combination of monomials with positive coefficients and degree at most deg.

Proof. Let p(X) = X

where T is a finite and nonempty set, and T = and CT = 0 for all T T; the validity of such a formulation follows since, by assumption, p(0) = 0 and p is non-constant. To establish the claim, we consider the following iterative algorithm.

First, if it happens that CT > 0, for all T T, the algorithm terminates. Otherwise, we take any monomial of the form CT Q

e T X[e] with CT < 0. Since T = , we might take e T . Further, we let X[e] = xk[r], for some k [[m]], r [[dk]], where xk dk. As such, we have that xk[r] = 1 P

r =r xk[r ]. Thus,

e T X[e ] = CT Y

e T \{e} X[e ] + X

r =r ( CT )xk[r ] Y

e T \{e} X[e ].

Here, by convention the product over an empty set is assumed to be 1. This step clearly cannot increase the degree of the polynomial. Now to analyze this iterative process, we consider as the potential function the sum of the degrees of all the negative monomials monomials for which CT < 0. It should be evident that every step of the previous algorithm will decrease the potential function by one. Further, the previous step can always be applied as long as the potential function is not zero. As a result, given that T is finite, we conclude that after a finite number of iterations the potential function will be zero. Then, we will have that p(X) = X

e T X[e] + C,

where T = and CT > 0. But, given that p(0) = 0, we conclude that C = 0, and the claim follows.

Proposition B.6. Let p : X 7 R be a non-constant multivariate polynomial of degree deg N such that p(0) = 0. If X = (q1, . . . , qm) such that qk Qdk, for all k [[m]], p can be expressed as a combination of monomials with positive coefficients and degree at most deg.

Proof. As in Proposition B.5, let p(X) = X

where T is a finite and nonempty set, and T = and CT = 0 for all T T. We consider the following algorithm.

First, if CT > 0, for all T T, the algorithm may terminate. In the contrary case, we consider any monomial CT Q

e T X[e] for which CT < 0. Further, take any e T , which is possible since T = . Now let us assume that X[e] = qk[σ], for some k [[m]], σ = (j, a). By the sequence-form polytope constraints, we have

qk[σ] = qk[σj] X

a Aj\{a} qk[(j, a )].

e T X[e ] = CT qk[σj] Y

e T \{e} X[e ] + X

a =a ( CT )qk[(j, a )] Y

e T \{e} X[e ].

This step clearly does not increase the degree of the polynomial. To construct a potential function, we will say that the depth of a monomial Q

e T X[e], for T = , is the sum of the depths of each X[e]; more precisely, the depth of qk[σ] is 0 if σ = , or 1 plus the depth of qk[σj] otherwise. Now we claim that the sum of the depths of the negative monomials is a proper potential function. Indeed, by construction every step reduces the potential by 1, while the previous step can always be applied when the potential function is not zero. As a result, given that T is finite, we conclude that after a finite number of iterations the potential function will be zero, which in turn implies that

e T X[e] + C,

Near-Optimal Φ-Regret Learning in Extensive-Form Games

where T = and CT > 0. But, since p(0) = 0, it follows that C = 0, concluding the proof.

Now we show that the fixed points associated with EFCCE and EFCE can be analyzed through the lens of Lemma 3.5, establishing Propositions 3.6 and 3.7.

Proposition 3.6. Let ϕ(t) i Ψi be a transformation defined by X(t) i = (λ(t) i , (q(t) j )j Ji) RD >0, for some D N and

time t N. The unique fixed point x(t) i of ϕ(t) i satisfies (4) with degi 2Di.

Proof. Consider any coarse trigger deviation function ϕ(t) i = P

j Ji λ(t) i [j]ϕj q(t) j , where q(t) j Qj (Claim A.4). Given

that we are updating R using LRL-OFTRL, it follows that λ(t) i [j] > 0 for any j Ji. As a result, by (Anagnostides et al., 2022b, Theorem 5.1), the (unique) fixed point x(t) i Qi can be computed in a top-down fashion as follows.

x(t) i [σ] =

j j λ(t) i [j ]q(t) j [σ]x(t) i [σj ] P

j j λ(t) i [j ] , (22)

for any sequence σ = (j, a) Σ i . We will prove the claim by induction. For the basis of the induction, we note that the empty sequence is trivially given by a 0-degree rational function with positive coefficients; namely, xi[ ] = 1

Now for the inductive step, let us take any sequence σ = (j, a) Σ i . We suppose that for any sequence σj , for j j, it holds that

x(t) i [σj ] =

pσj ,k(X(t) i )

qσj ,k(X(t) i ) ,

where {pσj ,k}, {qσj ,k} are multivariate polynomials with positive coefficients and maximum degree at most h N {0}. Then, the term λ(t) i [j ]q(t) j [σ] P

j j λ(t) i [j ] pσj ,k(X(t) i )

qσj ,k(X(t) i )

is a rational function in X(t) i with positive coefficients and maximum degree at most h + 2. Hence, the term below is a sum of rational functions with positive coefficients and maximum degree at most h + 2:

j j λ(t) i [j ]q(t) j [σ]x(t) i [σj ] P

j j λ(t) i [j ]

As a result, by (22) we conclude that

x(t) i [σ] =

pσ,k(X(t) i )

qσ,k(X(t) i ) ,

where {pσ,k}, {qσ,k} have positive coefficients and maximum degree h+2. This establishes the inductive step, concluding the proof.

Next, for the proof of Proposition 3.7, we will need the following key refinement of the Markov chain tree theorem (Anagnostides et al., 2022b, Corollary A.8).

Theorem B.7 (Anagnostides et al., 2022b). Let M be the transition matrix of a d-state Markov chain such that M = v1 d + C, where C Rd d >0 and v Rd >0 has entries summing to λ > 0. Further, let v = r/l, for some l > 0. If x d is the (unique) stationry distribution of M, then for each r [[d]] there exist a nonempty and finite set Fr, and F = d r=1Fr, and parameters bk {0, 1}, 0 pk d 2, |Sk| = d pk bk 1, for each k Fr, such that the r-th coordinate of w := lx can be expressed as

k Fr λpk+1(r[qk])bkl1 bk Q

(a,b) Sk C[a, b] P

k F λpk+bk Ck Q

(a,b) Sk C[a, b] ,

for each r [[d]], where Ck = Ck(d) > 0.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Let us also introduce the following terminology, borrowed from the work of Farina et al. (2022b).

Definition B.8 (Farina et al., 2022b). Let J Ji be a subset of i s information sets. We say that J is a trunk of Ji if for all j J, all predecessors of j are also in J.

Definition B.9 (Farina et al., 2022b). Let ϕi Ψi and J be a trunk of Ji. We say that a vector xi R|Σi| 0 is a J-partial fixed point if it satisfies all the sequence-form constraints at all information sets j Ji, and

ϕi(xi)[ ] = xi[ ] = 1,

ϕi(xi)[(j, a)] = xi[(j, a)], j Ji, a Aj.

Proposition 3.7. Let ϕ(t) i Ψi be a transformation defined by X(t) i = (λ(t) i , (q(t) ˆσ )ˆσ Σ i ) RD >0, for some D N and

time t N. The (unique) fixed point x(t) i of ϕ(t) i satisfies (4) with degi 2Di|Ai|, where |Ai| := maxj Ji |Aj|.

Proof. For the base of the induction, the claim trivially holds for x(t) i [ ] = 1

1. For the inductive step, let us first define a

vector r(t) R |Aj | 0 , so that r(t)[a] is equal to X

a Aj λ(t) i [(j , a )]q(t) (j ,a )[(j , a)]x(t) i [(j , a )].

Further, we let W(t) S|Aj | be a stochastic matrix, so that for any ar, ac Aj , W(t)[ar, ac] is equal to

x(t) i [σj ] r(t)[ar] + λ(t) i [(j , ac)]q(t) (j ,ac)[(j , ar)] +

ˆσ (j ,ac) λ(t) i [ˆσ]

1{ar = ac},

By (Farina et al., 2022b, Proposition 4.14), if b(t) (Aj ) is the (unique) stationary distribution of W(t), extending by x(t) i [σj ]b(t) at information set j yields a (J {j })-partial fixed point (Definition B.9). To bound the increase in the degree of the rational function, we will use Theorem B.7. In particular, we define a matrix C(t) R|Aj | |Aj |, so that for any ar, ac Aj ,

C(t)[ar, ac] := λ(t) i [(j , ac)]q(t) (j ,ac)[(j , ar)] +

ˆσ (j ,ac) λ(t) i [ˆσ]

1{ar = ac}. (23)

For a fixed ac Aj , we have X

ar Aj C(t)[ar, ac] = λ(t) i [(j , ac)] + X

ˆσ (j ,ac) λ(t) i [ˆσ], (24)

where we used the fact that for any ac Aj , P

ar Aj q(j ,ac)[(j , ar)] = 1 since q(j ,ac) Qj . Thus, from (24) we obtain that 1 X

ar Aj C(t)[ar, ac] = X

ˆσ (j ,ac) λ(t) i [ˆσ]. (25)

Now for the inductive step, suppose that for any information set j σj and a Aj , the partial fixed point x(t) i [σ ], with σ = (j , a ), can be expressed as

x(t) i [σ ] =

pσ ,k(X(t) i )

qσ ,k(X(t) i ) , (26)

where {pσ ,k}, {qσ ,k} are multivariate polynomials with positive coefficients and maximum degree h. By (23), (25), the inductive hypothesis (26), and Theorem B.7, we conclude that for any a Aj ,

x(t) i [(j , a)] =

pa,k(X(t) i )

qa,k(X(t) i ) ,

where {pa,k}, {qa,k} are multivariate polynomials with positive coefficients and maximum degree h+2|Aj | h+2|Ai|. This concludes the inductive step, and the proof.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Before we proceed, we note that while Lemmas 3.1 and 3.2 and Lemmas B.2 and B.4 were stated for the construction relating to trigger deviations, those results readily apply for the construction relating to coarse trigger deviations as well; we omit the formal statements as they are almost identical to Lemmas 3.1 and 3.2 and Lemmas B.2 and B.4.

In this context, combining Propositions 3.6 and 3.7 with Lemma 3.5 we arrive at the following conclusions.

Lemma B.10. For any parameters η 1 512 Qi 1Di , η 1 1024|Σi|Di , and time t [[T 1]],

x(t+1) i x(t) i 1 8 Qi 1Di M(X(t) i ),

where M(X(t) i ) is defined as

1 λ(t+1) i [j]

, max j Ji max σ Σj

1 q(t+1) j [σ]

Proof. By Proposition 3.6, it follows that the fixed point x(t) i can be expressed, for any σ Σi, as

x(t) i [σ] =

pσ,k λ(t) i , (q(t) j )j Ji

qσ,k λ(t) i , (q(t) j )j Ji ,

such that {pσ,k}, {qσ,k} are multivariate polynomials in X(t) i = (λ(t) i , (q(t) j )j Ji) with positive coefficients and maximum degree degi := 2Di. As a result, similarly to Lemmas B.2 and B.4, it follows that

1 q(t+1) j [σ]

100η Qi 1 100 256 degi ,

for any j Ji, and

1 λ(t+1) i [j]

200η |Σi| 100 256 degi .

As a result, the claim follows from Lemma 3.5.

Lemma 3.8. Consider any parameters η 1 256 Qi 1 degi and η 1 512|Σi| degi , where degi := 2|Ai|Di. Then, for any time t [[T 1]],

x(t+1) i x(t) i 1 8 Qi 1|Ai|Di M(X(t) i ),

where M(X(t) i ) is defined as

1 λ(t+1) i [ˆσ]

λ(t) i [ˆσ]

, max ˆσ Σ i max σ Σj

1 q(t+1) ˆσ [σ]

q(t) ˆσ [σ]

Proof. By Proposition 3.7, it follows that the fixed point x(t) i can be expressed, for any σ Σi, as

x(t) i [σ] =

pσ,k λ(t) i , (q(t) ˆσ )ˆσ Σ i

qσ,k λ(t) i , (q(t) ˆσ )ˆσ Σ i ,

such that {pσ,k}, {qσ,k} are multivariate polynomials in X(t) i = (λ(t) i , (q(t) ˆσ )ˆσ Σ i ) with positive coefficients and maximum degree degi := 2|Ai|Di. As a result, in light of Lemmas B.2 and B.4,

1 q(t+1) ˆσ [σ]

q(t) ˆσ [σ]

100η Qi 1 100 256 degi ,

Near-Optimal Φ-Regret Learning in Extensive-Form Games

for any ˆσ = (j, a) Σ i , and

1 λ(t+1) i [ˆσ]

λ(t) i [ˆσ]

200η |Σi| 100 256 degi .

As a result, the claim follows from Lemma 3.5.

B.3. Completing the Proof

Finally, here we combine all of the previous ingredients to complete the proof of Theorem 3.10. Proposition B.11. Let η 1 256|Σi|1.5 and η 1 512|Σi|2.5 . Then, for any T 2,

max{0, Reg T Ψi} 2|Σi|2 log T

η + 2|Σi| log T

η + (32η|Σi||Qi|2 + 256η |Σi|4)

t=1 u(t+1) i u(t) i 2

+(32η|Σi||Qi|2 + 256η |Σi|4)

t=1 x(t+1) i x(t) i 2 1 512η

t=1 λ(t+1) i λ(t) i 2 λ(t) i ,

t=1 q(t+1) ˆσ q(t) ˆσ 2 q(t) ˆσ , .

Proof. Fix any t [[T 1]]. By definition of U (t) i (Algorithm 1),

U (t+1) i U (t) i 2 u(t+1) i x(t+1) i u(t) i x(t) i 2

2 u(t+1) i (x(t+1) i x(t) i ) 2 + 2 x(t) i (u(t+1) i u(t) i ) 2 (27)

= 2 u(t+1) i 2 x(t+1) i x(t) i 2 + 2 x(t) i 2 u(t+1) i u(t) i 2 (28)

2 x(t+1) i x(t) i 2 + 2 u(t+1) i u(t) i 2 , (29)

where (27) follows from the triangle inequality for the norm, as well as Young s inequality; (28) uses the fact that x u = x u , for any vectors x, u; and (29) follows from the assumption that u(t) i , x(t) i 1. Similarly, for any t [[T 1]],

u(t+1) u(t) 2 = | X(t+1) ˆσ , U (t+1) i X(t) ˆσ , U (t) i |2 (30)

2| X(t+1) ˆσ , U (t+1) i U (t) i |2 + 2| U (t) i , X(t+1) ˆσ X(t) ˆσ |2 (31)

8|Σi|2 U (t+1) i U (t+1) i 2 + 2 q(t+1) ˆσ q(t) ˆσ 2 1, (32)

for some ˆσ Σ i , where (30) follows from the definition of u(t) ; (31) uses Young s inequality; and (32) uses the Cauchy-

Schwarz inequality, along with the fact that U (t) i 1 and X(t) ˆσ 1 2|Σi|. Further, for any ˆσ Σ i , η 1 256|Σi|1.5 and η 1 512|Σi|2.5 ,

1 1024η q(t+1) ˆσ q(t) ˆσ 2 q(t) ˆσ , + 32η |Σi|2 q(t+1) ˆσ q(t) ˆσ 2 1 1 1024η + 32η |Σi|4 q(t+1) ˆσ q(t) ˆσ 2

4 + |Σi|1.5

q(t+1) ˆσ q(t) ˆσ 2 0. (33)

As a result, the proof follows from Lemmas 3.1 and 3.2 and Proposition 3.3, (29), (32), and (33).

As a result, we are now ready to establish Corollary 3.9, the statement of which is recalled below.

Corollary 3.9. Suppose that η 1 212|Σi|1.5 Qi 1 degi and η = 1 2|Σi|η, where degi := 2|Ai|Di. For any T 2,

max{0, Reg T Ψi} can be upper bounded by

8|Σi|2 log T

η + 256η|Σi|3 T 1 X

t=1 u(t+1) i u(t) i 2 1 215η deg2 i Qi 2 1

t=1 x(t+1) i x(t) i 2 1.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Proof. By Lemma 3.8,

t=1 λ(t+1) i λ(t) i 2 λ(t) i , + 1 1024η

t=1 q(t+1) ˆσ q(t) ˆσ 2 q(t) ˆσ , 1 214η Qi 2 1 deg2 i x(t+1) i x(t) i 2 1.

Thus, the proof follows directly from Proposition B.11 since for any t [[T 1]], 256η|Σi|3 1 215η Qi 2 1 deg2 i

x(t+1) i x(t) i 2 1 0,

for any η 1 212|Σi|1.5 Qi 1 degi .

So far we have performed the analysis from the perspective of a fixed player i [[n]], while being oblivious to the mechanism that produces the sequence of utilities (u(t) i )1 t T . Having established the RVU bound of Corollary 3.9, we are ready to show that when all players employ our learning dynamics, the second-order path length is bounded by O(log T). (In what follows, we tacitly assume that each player uses η := 1 2|Σi|η, in accordance with Corollary 3.9.)

Theorem B.12. Suppose that each player i [[n]] uses learning rate η 1 212(n 1)|Σ|1.5 Q 1|Z| deg, where deg = 2|A|D. Then, for any T 2, T 1 X

i=1 x(t+1) i x(t) i 2 1 219n|Σ|2 Q 2 1 deg2 log T.

Proof. For any time t [[T 1]] and player i [[n]],

u(t+1) i u(t) i 2 (n 1)|Z|2 X

i =i x(t+1) i x(t) i 2 1,

by (Anagnostides et al., 2022b, Claim 4.16). Further, for η 1 212(n 1)|Σ|1.5 Q 1|Z| deg,

256η(n 1)2|Σ|3|Z|2 1 216η Q 2 1 deg2

As a result, using Corollary 3.9, Pn i=1 max{0, Reg T Ψi} can be upper bounded by

8n|Σ|2 log T

η 1 216η deg2 Q 2 1

t=1 x(t+1) i x(t) i 2 1.

But, given that Pn i=1 max{0, Reg T Ψi} 0, we conclude that

t=1 x(t+1) i x(t) i 2 1 219n|Σ|2 deg2 Q 2 1 log T.

We now arrive at Theorem 3.10, which is restated below with the precise parameterization.

Corollary B.13. Suppose that all players employ Algorithm 1 instantiated with LRL-OFTRL for all local regret minimizers, R and {Rˆσ}ˆσ Σ i , with η = 1 213(n 1)|Σ|1.5 Q 1|Z||A|D and η = 1 2|Σi|η. Then, the trigger regret of each player i [[n]] after T repetitions will be bounded as

Reg T Ψi 217n|Σ|3.5 Q 1|Z||A|D log T.

Proof. This follows directly from Corollary 3.9 and Theorem B.12.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Corollary B.14. Suppose that all players employ Algorithm 1 instantiated with LRL-OFTRL for all local regret minimizers, R and {Rj}j Ji, with η = 1 213(n 1)|Σ|1.5 Q 1|Z|D and η = 1 2|Σi|η. Then, the trigger regret of each player i [[n]] after T repetitions will be bounded as

Reg T Ψi 217n|Σ|3.5 Q 1|Z|D log T. (34)

Proof. The proof is analogous to Corollary B.13.

We remark that for coarse trigger regret, our bound (34) is loose, as the analysis is not optimized to handle coarse trigger deviation functions; instead, Corollary B.14 follows the construction of trigger deviations, with the exception of using Lemma B.10 in order to obtain a slightly improved RVU bound. Further refining Corollary B.14 was not within our scope.

C. Description of the Game Instances

In this section, to keep our paper self-contained, we describe the games we used in our experiments (Section 4), as well as the precise parameterization for each instance.

Kuhn poker Kuhn poker is a simple poker variant studied by Kuhn (1953). For simplicity, below we describe the 2-player version of Kuhn poker; the 3-player version we consider in our experiments is analogous.

In Kuhn poker each player initially submits an ante worth of 1 in the pot. Then, each player is privately dealt one card from a deck of r unique cards or ranks; in our experiments we used r = 3. Next, a single round of betting occurs: First, player 1 gets to decide either check or bet. Then,

If player 1 checked, the second player can either check or raise.

If player 2 also checked, a showdown occurs, meaning that the player with the highest card wins the pot, thereby terminating the game. On the other hand, if player 2 raised, player 1 can either fold or call; in the former case player 2 wins the pot, while in the latter a showdown follows.

If player 1 raised, player 2 can either fold or call.

If player 2 folded, then player 1 wins the pot, while if player 2 called, a showdown occurs.

Sheriff Sheriff (Farina et al., 2019c) is a 2-player bargaining game inspired by the board game Sheriff of Nottingham. Initially, player 1 (or the Smuggler ) secretly loads his cargo with m {0, 1, . . . , mmax} illegal items. The game then proceeds for r bargaining rounds. In each round,

the Smuggler first gets to decide a bribe amount b in {0, 1, . . . , bmax}. This amount also becomes available to player 2 (the Sheriff ), although the smuggler does not transfer than amount unless it is the ultimate round.

The Sheriff then decides whether to accept the bribe.

If the Sheriff accepts the bribe of value b, the smuggler gets a payoff of p m b, while Sheriff receives a payoff of b. In the contrary case, Sheriff decides whether to inspect the cargo.

* If the Sheriff does not inspect the cargo, the Smuggler receives a payoff of v m, while the Sheriff gets 0 utility;

* Otherwise, if the Sheriff detects illegal items, the Smuggler must pay the Sheriff an amount of p m, while if no illegal items were loaded, the Sheriff has to compensate the Smuggler with a utility of s.

In our experiments, we use the baseline version of Sheriff, wherein v = 5, p = 1, s = 1, mmax = 5, bmax = 2, and r = 2.

Near-Optimal Φ-Regret Learning in Extensive-Form Games

Goofspiel Goofspiel is a 2-player card game introduced by Ross (1971). The game is based on three identical decks of r cards each, with values ranging from 1 to r; we use r = 3 in our experiments. Initially, each player is dealt a full deck, while the third deck (the prize deck) is faced down on the board after being shuffled. In each round, the topmost card from the prize deck is revealed. Then, each player privately selects a card from their hand with the goal of winning the card that was revealed from the prize deck. The players selected cards are revealed simultaneously, and the card with the highest value prevails; in case of a tie, the prize card is discarded. This tie-breaking mechanism makes the game general-sum. Finally, the score of each player is the sum of the values of the prize cards that player has won.