# regret_circuits_composability_of_regret_minimizers__76df80d4.pdf

Regret Circuits: Composability of Regret Minimizers

Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1 3 4 5

Regret minimization is a powerful tool for solving large-scale problems; it was recently used in breakthrough results for large-scale extensiveform game solving. This was achieved by composing simplex regret minimizers into an overall regret-minimization framework for extensiveform game strategy spaces. In this paper we study the general composability of regret minimizers. We derive a calculus for constructing regret minimizers for composite convex sets that are obtained from convexity-preserving operations on simpler convex sets. We show that local regret minimizers for the simpler sets can be combined with additional regret minimizers into an aggregate regret minimizer for the composite set. As one application, we show that the CFR framework can be constructed easily from our framework. We also show ways to include curtailing (constraining) operations into our framework. For one, they enable the construction of CFR generalization for extensive-form games with general convex strategy constraints that can cut across decision points.

1. Introduction

Counterfactual regret minimization (CFR) (Zinkevich et al., 2007), and its newer variants (Lanctot et al., 2009; Brown & Sandholm, 2015a; Tammelin et al., 2015; Brown et al., 2017; Brown & Sandholm, 2017a; 2019), have been a central component in several recent milestones in solving imperfectinformation extensive-form games (EFGs). Bowling et al. (2015) used CFR+ to near-optimally solve heads-up limit Texas hold em. Brown & Sandholm (2017c) used CFR

1Computer Science Department, Carnegie Mellon University, Pittsburgh PA 15213 2IEOR Department, Columbia University, New York NY 10027 3Strategic Machine, Inc. 4Strategy Robot, Inc. 5Optimized Markets, Inc.. Correspondence to: Gabriele Farina <gfarina@cs.cmu.edu>, Christian Kroer <christian.kroer@columbia.edu>, Tuomas Sandholm <sandholm@cs.cmu.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

variants, along with other scalability techniques such as real-time endgame solving (Ganzfried & Sandholm, 2015; Burch et al., 2014; Moravcik et al., 2016; Brown & Sandholm, 2017b) and automated action abstraction (Brown & Sandholm, 2014), to create Libratus, an AI that beat top human specialist professionals at the larger game of heads-up no-limit Texas hold em. Moravˇc ık et al. (2017) also used CFR variants and endgame solving to beat professional human players at that game.

CFR and its newer variants are usually presented as algorithms for ﬁnding an approximate Nash equilibrium in zero-sum EFGs. However, an alternative view is that it is a framework for constructing regret minimizers for the types of action spaces encountered in EFGs, as well as singleagent sequential decision making problems with similarlystructured actions spaces. Viewed from a convex optimization perspective, the class of convex sets to which they apply are sometimes referred to as treeplexes (Hoda et al., 2010; Kroer et al., 2015; 2018). In this view, those algorithms specify how a set of regret minimization algorithms for simplexes and linear loss functions can be composed to form a regret minimizer for a treeplex. Farina et al. (2019) take this view further, describing how regret-minimization algorithms can be composed to form regret minimizers for a generalization of treeplexes that allows convex sets and convex losses. This decomposition into individual optimization problems can be beneﬁcial because it enables the use of 1) different algorithms for different parts of the search space, 2) specialized techniques for different parts of the problem, such as warm starting (Brown & Sandholm, 2014; 2015b; 2016) and pruning (Lanctot et al., 2009; Brown & Sandholm, 2015a; Brown et al., 2017; Brown & Sandholm, 2017a), and 3) approximation of some parts of the space.

In this paper we introduce a general methodology for composing regret minimizers. We derive a set of rules for how regret minimizers can be constructed for composite convex sets via a calculus of regret minimization: given regret minimizers for convex sets X, Y we show how to compose these regret minimizers for various convexity-preserving operations performed on the sets (e.g., intersection, convex hull, Cartesian product), in order to arrive at a regret minimizer for the resulting composite set.

Our approach treats the regret minimizers for individual

Regret Circuits: Composability of Regret Minimizers

convex sets as black boxes, and builds a regret minimizer for the resulting composite set by combining the outputs of the individual regret minimizers. This is important because it allows freedom in choosing the best regret minimizer for each individual set (from either a practical or theoretical perspective). For example, in practice the regret matching (Hart & Mas-Colell, 2000) and regret matching+ (RM+) (Tammelin et al., 2015) regret minimizers are known to perform better than theoretically-superior regret minimizers such as Hedge (Brown et al., 2017), while Hedge may give better theoretical results when trying to prove the convergence rate of a construction through our calculus.

One way to conceptually view our construction is as regret circuits: in order to construct a regret minimizer for some convex set X that consists of convexity-preserving operations on (say) two sets X1, X2, we construct a regret circuit consisting of regret minimizers for X1 and X2, along with a sequence of operations that aggregate the results of those circuits in order to form an overall circuit for X. We use this view extensively in the paper; we show the regret-circuit representation of every operation that we develop.

As an application, we show that the correctness and convergence rate of the CFR algorithm can be proven easily through our calculus. We also show that the recent Constrained CFR algorithm (Davis et al., 2019) can be constructed via our framework. Our framework enables the construction of two algorithms for that problem. The ﬁrst is based on Lagrangian relaxation, and only guarantees approximate feasibility of the output strategies. The second is based on projection and guarantees exact feasibility, for the ﬁrst time in any algorithm that decomposes overall regret into local regrets at decision points.

2. Regret Minimization

We will prove our results in the online learning framework called online convex optimization (Zinkevich, 2003) (OCO). In OCO, a decision maker repeatedly interacts with an unknown environment by making a sequence of decisions x1, x2, . . . from a convex and compact set X Rn. After each decision xt, the decision maker faces a convex loss function ℓt(xt), which is unknown to the decision maker until after the decision is made. So, we are constructing a device that supports two operations: (i) it provides the next decision xt+1 X and (ii) it receives/observes the convex loss function ℓt used to evaluate decision xt. The decision making is online in the sense that the next decision, xt+1, is based only on the previous decisions x1, . . . , xt and corresponding observed loss functions ℓ1, . . . , ℓt.

The quality of the device is measured by its cumulative regret, which is the difference between the loss cumulated by the sequence of decisions x1, . . . , x T and the loss that would

have been cumulated by playing the best-in-hindsight timeindependent decision ˆx. Formally, the cumulative regret up to time T is

RT (X,F) :=

t=1 ℓt(xt) min ˆx X

Above we introduce the new notation of a subscript (X, F) to be explicit about the domain of the decisions {xt} and the domain of the loss functions {ℓt}, respectively. This turns out to be important because we will study composability of devices with different domains.

The device is called a regret minimizer if it satisﬁes the desirable property of Hannan consistency: the average regret approaches zero, that is, RT (X,F) grows sublinearly in T. Formally, in our notation, we have the following deﬁnition.

Deﬁnition 1 ((X, F)-regret minimizer). Let X be a convex and compact set, and let F be a convex cone in the space of bounded convex functions on X, and such that F contains the space L of linear functions. An (X, F)- regret minimizer is a function that selects the next decision xt+1 X given the history of decisions x1, . . . , xt and observed loss functions ℓ1, . . . , ℓt F, so that the cumulative regret RT (X,F) = o(T).

2.1. Universality of Linear Loss Functions

Regret minimizers for linear loss functions are in a sense universal: one can construct a regret minimizer for convex loss functions from any regret minimizer for linear loss functions (e.g., Mc Mahan (2011)). The crucial insight is that the regret that we are trying to minimize, RT (X,F), is bounded by the regret of a (X, L)-regret minimizer that, at each time t, observes as its loss function a tangent plane of ℓt at the most recent decision xt. Thus we can minimize RT (X,F) by minimizing RT (X,L).

Formally, let ℓt(xt) be any subgradient of ℓt at xt. By convexity of ℓt,

ℓt(ˆx) ℓt(xt) + ℓt(xt), ˆx xt ˆx X,

and, substituting into (1), we obtain

t=1 ℓt(xt), xt min ˆx X

t=1 ℓt(xt), ˆx

where the right hand side is RT (X,L), the regret cumulated by a device that observes the linear loss functions ℓt(xt), .1

1A downside of this approach is that (2) is an inequality, not an equality. When we use a linearization of the loss function at each decision point, we introduce error. This can cause regret to be minimized more slowly than somehow working on the nonlinear loss functions directly. Nevertheless, we obtain a regret minimizer for the original problem.

Regret Circuits: Composability of Regret Minimizers

2.2. Connection to Convex-Concave Saddle-Point Problems and Game Theory

In this subsection we review how regret minimization can be used to compute solutions to regularized bilinear saddlepoint problems, that is solutions to problems of the form

min x X max y Y

x Ay + d1(x) d2(y) , (3)

where X, Y are closed convex sets, and d1, d2 are convex functions. This general formulation allows us to capture, among other settings, several game-theoretical applications such as computing Nash equilibria in two-player zero-sum games. In that setting, d1 and d2 are the constant zero functions, X and Y are convex polytopes whose description is provided by the sequence-form constraints, and A is a real payoff matrix (von Stengel, 1996).

In order to use regret minimization to solve problems of the form (3), we consider the loss functions ℓt X : x 7 ( Ayt) x + d1(x), ℓt Y : y 7 (A xt) y + d2(y). The error metric that we use is the saddle-point residual (or gap) ξ of ( x, y), deﬁned as ξ( x, y) := maxˆy Y{d1( x) d2(ˆy)+ x, Aˆy } minˆx X {d1(ˆx) d2( y)+ ˆx, A y }. The following folk theorem shows that the average of a sequence of regretminimizing strategies for the choice of losses above leads to a bounded saddle-point residual (see, for example, Farina et al. (2019) for a proof).

Theorem 1. If the average regret accumulated on X and Y by the two sets of strategies {xt}T t=1 and {yt}T t=1 is ϵ1 and ϵ2, respectively, then the strategy proﬁle ( x, y) where x = 1 T PT t=1 xt, y = 1 T PT t=1 yt has a saddle-point residual bounded above by ϵ1 + ϵ2.

When d1 d2 0 and X, Y are the players sequenceform strategy spaces, Theorem 1 asserts that the average strategy proﬁle produced by the regret minimizers is an (ϵ1 + ϵ2)-Nash equilibrium. Different choices of the regularizing functions d1 and d2 can be used to solve for strategies in other game-theoretic applications as well, such as computing a normal-form quantal-response equilibrium (Ling et al., 2018; Farina et al., 2019). Farina et al. (2019) study opponent exploitation where the goal is to compute a best response, subject to a penalty for moving away from a precomputed Nash equilibrium strategy; this is captured by having d1 or d2 include a penalty term that penalizes distance from the precomputed strategy. Farina et al. (2017) and Kroer et al. (2017) study constraints on individual decision points, and Davis et al. (2019) study additional constraints on the overall EFG polytopes X, Y. Regret minimization in those settings requires regret minimizers that can operate on more general domains X, Y than the sequence form. In this paper we show how one can construct regret minimizers for any convex domain that can be expressed from simpler domains using convexity-preserving operations.

3. Regret Circuits

In this paper, we introduce regret circuits. They are composed of independent regret minimizers connected by wires on which the loss functions and decisions can ﬂow. Regret circuits encode how the inputs and outputs of multiple regret minimizers can be combined to achieve a goal, in a divide-and-conquer fashion, and help simplify the design and analysis of regret-minimizing algorithms. Using the constructions that we will present, one can compose different regret circuits and produce increasingly complex circuits.

The regret circuits approach has several advantages that make it appealing when compared to other, more monolithic, approaches. For one, by treating every regret minimizer that appears in a regret circuit as an independent black box, our approach enables one to select the best individual algorithm for each of them. Second, our framework is amenable to pruning or warm-starting techniques in different parts of the circuit, and substituting one or more parts of the circuit with an approximation. Finally, regret circuits can be easily run in distributed and parallel environments.

We will express regret circuits pictorially through block diagrams. We will use the following conventions when drawing regret circuits:

an (X, F)-regret minimizer is drawn as a box

(X, F) ℓt 1 xt

where the input (red) arrow represents the loss at a generic time t 1, while the output (blue) arrow represents the decision produced at time t; the symbol is used to denote an operation that constructs or manipulates one or more loss functions; the symbol is used to denote an operation that combines or manipulates one or more decisions; the symbol denotes an adder, that is a node that outputs the sum of all its inputs; dashed arrows denote decisions that originate from the previous iteration.

As an example, consider the construction of Section 2.1, where we showed how one can construct a regret minimizer for generic convex loss functions from any regret minimizer for linear loss functions. Figure 1 shows how that construction can be expressed as a regret circuit.

(X, L) xt ℓt 1 ℓt 1(xt 1),

Figure 1. Regret circuit representing the construction of an (X, F)- regret minimizer using an (X, L)-regret minimizer.

Regret Circuits: Composability of Regret Minimizers

4. Circuit Construction for Operations that Enlarge or Transform Sets

In this section, we begin the construction of our calculus of regret minimization. Given regret minimizers for two closed convex sets X and Y, we show how to construct a regret minimizer for sets obtained via convexity-preserving operations on X and Y. In this section, we focus on operations that take one or more sets and produce a regret minimizer for a larger set this is the case, for instance, of convex hulls, Cartesian products, and Minkowski sums.

As explained in Section 2.1, one can extend any (X, L)- regret minimizer to handle more expressive loss functionals. Therefore, in the rest of the paper, we focus on (X, L)-regret minimizers.

4.1. Cartesian Product

In this section, we show how to combine an (X, L)- and a (Y, L)-regret minimizer to form an (X Y, L)-regret minimizer. Any linear function ℓ: X Y R can be written as ℓ(x, y) = ℓX (x) + ℓY(y) where the linear functions ℓX : X R and ℓY : Y R are deﬁned as ℓX : x 7 ℓ(x, 0) and ℓY : y 7 ℓ(0, y). It is immediate to verify that

RT (X Y,L) =

t=1 ℓt X (xt) min ˆx X

t=1 ℓt X (ˆx)

t=1 ℓt Y(yt) min ˆy Y

t=1 ℓt Y(ˆy)

= RT (X,L) + RT (Y,L).

In other words, it is possible to minimize regret on X Y by simply minimizing it on X and Y independently and then combining the decisions, as in Figure 2.

(xt, yt) ℓt 1

Figure 2. Regret circuit for the Cartesian product X Y.

4.2. Afﬁne Transformation and Minkowski Sum

Let H : E F be an afﬁne map between two Euclidean spaces E and F, and let X E be a convex and compact set. We now show how an (X, L)-regret minimizer can be employed to construct a (H(X), L)-regret minimizer.

Since every y H(X) can be written as y = H(x) for some x X, the cumulative regret for an (H(X), L)-regret minimizer can be expressed as

RT (H(X),L) =

t=1 (ℓt H)(xt) min ˆx X

t=1 (ℓt H)(ˆx)

Since ℓt and H are afﬁne, their composition ℓt H := ℓt H is also afﬁne. Hence, RT (H(X),L) is the same regret as an (X, L)-regret minimizer that observes the linear function ℓt H( ) ℓt H(0) instead of ℓt. The construction is summarized by the circuit in Figure 3.

(X, L) xt H(xt) ℓt 1 ℓt 1 H ( ) ℓt 1 H (0)

Figure 3. Regret circuit for the image H(X) of X under the afﬁne transformation H.

As an application, we use the above construction to form a regret minimizer for the Minkowski sum X + Y := {x + y : x X, y Y} of two sets. Indeed, note that X + Y = σ(X Y), where σ : (x, y) 7 x + y is a linear map. Hence, we can combine the construction in this section together with the construction of the Cartesian product (Figure 2). See Figure 8 in the appendix for the resulting circuit.

4.3. Convex Hull

In this section, we show how to combine an (X, L)- and a (Y, L)-regret minimizer to form a (co{X, Y}, L)-regret minimizer, where co denotes the convex hull operation, co{X, Y} = {λ1x + λ2y : x X, y Y, (λ1, λ2) 2}, and 2 := {(λ1, λ2) R2 + : λ1 + λ2 = 1} is the twodimensional simplex. We can think of a (co{X, Y}, L)-regret minimizer as picking a triple (λt, xt, yt) 2 X Y at each time point t. Using the linearity of the loss functions,

RT (co{X,Y},L) =

t=1 λt 1ℓt(xt) + λt 2ℓt(yt)

min ˆλ 2 ˆx X,ˆy Y

t=1 ℓt(ˆx) + ˆλ2

Now, we make two crucial observations. First,

min ˆλ 2 ˆx X,ˆy Y

t=1 ℓt(ˆx) + ˆλ2

( ˆλ1 min ˆx X

+ ˆλ2 min y Y

since all components of ˆλ are non-negative. Second, the inner minimization problem over X is related to the cumulative regret RT (X,L) of the (X, L)-regret minimizer that observes the loss functions ℓt as follows:

= RT (X,L) +

t=1 ℓt(xt).

(An analogous relationship holds for Y.) Hence,

RT (co{X,Y},L) =

t=1 λt 1ℓt(xt) + λt 2ℓt(yt)

t=1 ˆλ1ℓt(xt)+ˆλ2ℓt(yt)

ˆλ1RT (X,L)+ˆλ2RT (Y,L) )

Regret Circuits: Composability of Regret Minimizers

Using the fact that min(f + g) min f + min g, and introducing the quantity

RT ( 2,L) :=

t=1 λt 1ℓt(xt) + λt 2ℓt(yt)

t=1 ˆλ1ℓt(xt) + ˆλ2ℓt(yt)

we conclude that

RT (co{X,Y},L) RT ( 2,L) + max{RT (X,L), RT (Y,L)}. (4)

The introduced quantity, RT ( 2,L), is the cumulative regret of a ( 2, L)-regret minimizer that, at each time instant t, observes the (linear) loss function

ℓt λ : 2 (λ1, λ2) 7 λ1ℓt(xt) + λ2ℓt(yt). (5)

Intuitively, this means that in order to make good decisions in the convex hull co{X, Y}, we can let two independent (X, L)- and (Y, L)-regret minimizers pick good decisions in X and Y respectively, and then use a third regret minimizer that decides how to mix the two outputs. This way, we break the task of picking the next recommended triple (λt, xt, yt) into three different subproblems, two of which can be run independently. Equation (4) guarantees that if all three regrets {RT ( 2,L), RT (X,L), RT (Y,L)} grow sublinearly, then so does RT (co{X,Y},L). Figure 4 shows the regret circuit that corresponds to our construction above.

λt 1xt + λt 2yt ℓt 1 ℓt 1 λ λt

Figure 4. Regret circuit for the convex hull co{X, Y}. The loss function ℓt λ is deﬁned in Equation (5).

Extending to multiple sets The construction shown in Figure 4 can be extended to handle the convex hull co{X1, . . . , Xn} of n sets as follows. First, the input loss function ℓt 1 is fed into all the (Xi, L)-regret minimizers (i = 1, . . . , n). Then, the loss function ℓt λ, deﬁned as

ℓt λ : n (λ1, . . . , λn) 7 λ1ℓ(xt 1) + + λnℓ(xt n),

is input into a ( n, L)-regret minimizer, where n is the n-dimensional simplex. Finally, at each time instant t, the n decisions xt 1, . . . , xtn output by the (Xi, L)-regret minimizers are combined with the decision λt output by the ( n, L)-regret minimizer to form λt 1xt 1 + + λtnxtn.

V -polytopes Our construction can be directly applied to construct an (X, L)-regret minimizer for a V -polytope X = co{v1, . . . , vn} where v1, . . . , vn are n points in a Euclidean

space E. Of course, any ({vi}, L)-regret minimizer outputs the constant decision vi. Hence, our construction (Figure 4) reduces to a single ( n, L)-regret minimizer that observes the (linear) loss function

ℓt λ : n (λ1, . . . , λn) 7 λ1ℓt(v1) + + λnℓt(vn).

The observation that a regret minimizer over a simplex can be used to minimize regret over a V -polytope already appeared in Zinkevich (2003), Schuurmans & Zinkevich (2016), and Farina et al. (2017, Theorem 3).

5. Application: Derivation of CFR

We now show that these constructions can be used to construct the CFR framework. The ﬁrst thing to note is that the strategy space of a single player in an extensive-form game is a treeplex, which can be viewed recursively as a series of convex hull and Cartesian product operations. This perspective is also used when constructing distance functions for ﬁrst-order methods for EFGs (Hoda et al., 2010; Kroer et al., 2015; 2018). In particular, an information set is viewed as an n-dimensional convex hull (since the sum of probabilities over actions is 1), where each action a at the information set corresponds to a treeplex Xa representing the set of possible information sets coming after a (in order to perform the convex hull operation, we create a new, larger representation of Xa so that the dimension is the same for all a, described in detail below). The Cartesian product operation is used to represent multiple potential information sets being arrived at (for example different hands dealt in a poker game).

Figure 5 shows an example. Each information set Xi (except X0) corresponds to a 2-dimensional convex hull over two treeplexes, one of which is always empty (that is, a leaf node). Each is a Cartesian product. The top-most represents the three possible hands that the player may have when making their ﬁrst decision. The second layer of Cartesian products represent actions taken by the opponent.

The information-set construction is as follows: let I be the information set under construction, and AI the set of actions. Each action a AI has some, potentially empty, treeplex Xa beneath it; let na be the dimension of that treeplex. We cannot form a convex hull over {Xa}a AI directly since the sets are not of the same dimension, and we do not wish to average across different strategy spaces. Instead, we create a new convex set X a R|AI|+P

a AI na for each a. The ﬁrst |AI| indices correspond to the actions in AI, and each Xa gets its own subset of indices. For each x Xa there is a corresponding x X a; x has a 1 at the index of a, x at the indices corresponding to Xa, and 0 everywhere else. The convex hull is constructed over the set {X a}a, which gives exactly the treeplex rooted at I. The Cartesian product is easy and can be done over a given set of treeplexes rooted

Regret Circuits: Composability of Regret Minimizers

Fold Call Fold Call Fold Call

Check Raise Check Raise Check Raise

Jack Queen King

Check Raise Check Raise Check Raise

Figure 5. Treeplex for the ﬁrst player in the game of Kuhn poker. Each Xi represents a convex hull over the treeplexes below, while

denotes the Cartesian product operation.

e1 X1 {0} ... {0}

e2 {0} X2 ... {0}

en {0} {0} ... Xn

Figure 6. Inductive treeplex construction rules. ei Rn contains a 1 at index i, and 0 everywhere else.

at information sets I1, . . . , In. The inductive construction rules for the treeplex are given in Figure 6. In fact, one can prove that the ℓλ loss functions deﬁned in Equation (5) are exactly the counterfactual loss functions deﬁned in the original CFR paper (Zinkevich et al., 2007). If we use as our loss function the gradient Ayt where yt is the opponent s strategy at iteration t, and then apply our expressions for the Cartesian-product and convex-hull regrets inductively, it follows from (5) that the loss function associated with each action is exactly the negative counterfactual value. Finally, the average treeplex strategy as per Theorem 1 coincides with the per-information-set averaging used in standard CFR expositions (e.g., (Zinkevich et al., 2007)).

6. Circuit Construction for Operations that Constrain Sets

Unlike Section 4, in this section we deal with operations that curtail the set of decisions that can be output by our regret minimizer. Section 6.1 and 6.2 propose two different constructions, and Section 6.3 discusses the merits and drawbacks of the two.

6.1. Constraint Enforcement via Lagrangian Relaxation

Suppose that we want to construct an (X {x : g(x) 0}, L)- regret minimizer, where g is a convex function, but we only

possess an (X, L)-regret minimizer. One natural idea is to use the latter to approximate the former, by penalizing any choice of x X such that g(x) > 0. In particular, it seems natural to introduce the penalized loss function

ℓt : X x 7 ℓt(x) + βt max{0, g(x)},

where βt is a (large) positive constant that can change over time. This approach is reminiscent of Lagrangian relaxation. The loss function ℓt is not linear, and as such it cannot be handled as is by our (X, L)-regret minimizer. However, as we have observed in Section 2.1, the regret induced by ℓt can be minimized by our (X, L)-regret minimizer if it observes the linearized loss function ℓt : X R deﬁned as

ℓt (x) = ℓt(x) + βt g(xt), x , βt :=

( βt if g(xt) > 0 0 otherwise.

Figure 9 in the appendix shows the regret circuit corresponding to the construction described so far.

In the rest of this subsection we analyze in what sense small cumulative regret implies that the constraint g(x) 0 is satisﬁed. Let RT (X,L) be the cumulative regret of our (X, L)- regret minimizer. Introducing Xg := X {x : g(x) 0} and τg := {t {1, . . . , T} : g(xt)>0},

t=1 ℓt (xt) min ˆx X

t=1 ℓt (ˆx)

t=1 ℓt(xt) + X

t τg βtg(xt)

t=1 ℓt(ˆx)+

max{0, g(ˆx)}

t=1 ℓt(xt) min ˆx Xg

t τg βtg(xt), (6)

where the ﬁrst inequality is by (2) and the second inequality comes from simply restricting the domain of the minimization from X to Xg.2 Thus, if the βt are sufﬁciently large, the average decision x := 1 B (β1x1 + + βT x T ) where B := β1 + + βT satisﬁes

max{0, g}( x) 1

t=1 βt max{0, g}(xt) = 1

t τg βtg(xt)

RT (X,L) + min ˆx Xg

t=1 ℓt(ˆx xt)

where the ﬁrst inequality follows by convexity of max{0, g}, and the second inequality follows by (6).

If PT t=1 βi TLD, where L is an upper bound on the norm of the loss functions ℓ1, . . . , ℓT and D is an upper bound on the diameter of X, then max{0, g( x)} 0 as T , that

2It may tempting to recognize in the term in parentheses in (6) the cumulative regret of an(Xg,L)-regret minimizer.This would be incorrect: the decisions xt are not guaranteed to satisfy g(xt) 0.

Regret Circuits: Composability of Regret Minimizers

is, the constraint is satisﬁed at least by the average in the limit. If L and D are known ahead of time, one practical way to guarantee PT t=1 βi TLD is to choose βt = κLD where κ > 0 is a large constant. This guarantees that in the limit, small cumulative regret implies that the average strategy approximately satisﬁes the constraint and satisﬁes Hannan consistency. Alternatively, the βt can be chosen by a regret minimizer which sees the constraint violation max{0, g(xt)} at time t as its loss function.

6.2. Intersection with a Closed Convex Set

In this subsection we consider constructing an (X Y, L)- regret minimizer from an (X, L)-regret minimizer, where Y is a closed convex set such that X Y = . As it turns out, this is always possible, and can be done by letting the (X, L)- regret minimizer give decisions in X, and then projecting them onto the intersection X Y.

We will use a Bregman divergence D(y x) := d(y) d(x) d(x), y x as our notion of distance between the points x and y, where the distance generating function (DGF) d is µstrongly convex and β-smooth (that is, d is differentiable and its gradient is Lipschitz continuous with Lipschitz constant β). Our construction makes no further assumptions on d, so the most appropriate DGF can be used for the application at hand. When d(x) = x 2 2 we obtain D(y x) = y x 2 2, so we recover the usual Euclidean distance between x and y. In accordance with our generalized notion of distance, we deﬁne the projection of a point x X onto X Y as πX Y(x) = argminy X Y D(y x). For ease of notation, we will denote the projection of x onto X Y as [x]; since X Y is closed and convex, and since D( x) is strongly convex, such projection exists and is unique. The cumulative regret of the (X Y, L)-minimizer is

RT (X Y,L) =

t=1 ℓt([xt]) min ˆx X Y

t=1 ℓt([xt] xt) min ˆx X Y

t=1 ℓt(ˆx xt)

where the second equality holds by linearity of ℓt. The ﬁrst-order optimality condition for the projection problem is

d(xt) d([xt]), ˆx [xt] 0 ˆx X Y. Consequently, provided αt 0 for all t,

t=1 ℓt(ˆx xt)

t=1 ℓt(ˆx xt)

t=1 αt d(xt) d([xt]), ˆx [xt]

(note the change in the domain of the minimum between the leftand right-hand side). The role of the αt coefﬁcients is to penalize choices of xt that are in X \ Y. In particular, if

t=1 ℓt([xt] xt)

t=1 αt [xt] xt 2, (9)

then, by µ-strong convexity of d, we have

t=1 ℓt([xt] xt)

t=1 αt d(xt) d([xt]), xt [xt] . (10)

Substituting (10) and (8) into Equation (7) we get

t=1 ℓt(xt) + αt d(xt) d([xt]), xt

t=1 ℓt(ˆx) + αt d(xt) d([xt]), ˆx

which is the regret observed by an (X, L)-regret minimizer that at each time t observes the linear loss function

ℓt : x 7 ℓt(x) + αt d(xt) d([xt]), x . (11)

Hence, as long as condition (9) holds, the regret circuit of Figure 7 is guaranteed to be Hannan consistent.

(X, L) [xt]

πX Y + ℓt 1 ℓt 1

αt 1 d(xt 1) d([xt 1]),

Figure 7. Regret circuit representing the construction of an (X Y, L)-regret minimizer using a (X, L)-regret minimizer.

On the other hand, condition (9) can be trivially satisﬁed by the deterministic choice

0 if xt X Y

max 0, ℓt([xt] xt)

µ [xt] xt 2

The fact that αt can be arbitrarily large (when xt and [xt] are very close) is not an issue. Indeed, αt is only used in ℓt (Equation 11) and is always multiplied by a term whose magnitude grows proportionally with the distance between xt and [xt]. In fact, the norm of the functional ℓt is bounded:

ℓt ℓt + ℓt([xt] xt) µ [xt] xt 2

d(xt) d([xt])

ℓt + β ℓt([xt] xt) µ [xt] xt 2

where the second inequality follows by β-smoothness of d. In other words, our construction dilates the loss functions by at most a factor 1 + β/µ. For instance, when d(x) = x 2, this dilation factor is equal to 2.

6.3. Comparison of the Two Constructions

As we pointed out, the decisions in the construction using Lagrangian relaxation only converge to the constrained domain Xg on average. Thus, formally the construction does

Regret Circuits: Composability of Regret Minimizers

not provide an (Xg, L)-regret minimizer, but only an approximate one. Section 6.2 solves this problem by providing a generic construction for an (X Y, L)-regret minimizer, a strictly more general task. The price to pay is the need for (generalized) projections, a potentially expensive operation. Thus, the choice of which construction to use reduces to a tradeoff between the computational cost of projecting and the need to have exact versus approximate feasibility with respect to g. The right choice depends on the application at hand. Finally, the construction based on Lagrangian relaxation requires large penalization factors βt in order to work properly. Therefore, the norm of ℓt can be large, which can complicate the task of minimizing the regret RT (X,L).

7. Application:Handling Strategy Constraints

When solving EFGs, there may be a need to add additional constraints beyond simply computing feasible strategies:

Opponent modeling. Upon observing repeated play from an opponent, we may wish to constrain our model of their strategy space to reﬂect such observations. Since observations can be consistent with several information sets belonging to the opponent, this requires adding constraints that span across information sets. Bounding probabilities. For example, in a patrolling game we may wish to ensure that a patrol returns to its base at the end of the game with high probability. Nash equilibrium reﬁnement computation. Reﬁnements can be computed, or approximated, via perturbation of the strategy space of each player. For extensive-form perfect equilibrium this can be done by lower-bounding the probability of each action at each information set (Farina & Gatti, 2017), which can be handled with small modiﬁcations to standard CFR or ﬁrst-order methods (Farina et al., 2017; Kroer et al., 2017). However, quasi-perfect equilibrium requires perturbations on the probability of sequences of action (Miltersen & Sørensen, 2010), which requires strategy constraints that cross information sets. All the applications above potentially require adding strategy space constraints that span across multiple information sets. Such constraints break the recursive nature of the treeplex, and are thus not easily incorporated into standard regret-minimization or ﬁrst-order methods for EFG solving. Davis et al. (2019) propose a Lagrangian relaxation approach called Constrained CFR (CCFR): each strategy constraint is added to the objective with a Lagrangian multiplier, and a regret minimizer is used to penalize violation of the strategy constraints. They prove that if the regret minimizer for the Lagrange multipliers has the optimal Lagrangian multipliers as part of their strategy space, the average output strategy converges to an approximate solution to the constrained game. They also prove a bound on the approximate feasibility of the average output strategy when

their algorithm is instantiated with Regret Matching (Hart & Mas-Colell, 2000) as the local regret minimizer at each information set.

At least two alternative variants of CFR for EFGs with strategy constraints can be obtained using our framework. First, we can apply our method for Lagrangian relaxation of X and a constraint g(x) 0. Our Lagrangian approach yields as a special case the CCFR algorithm. Our approach supports regret minimization for the Lagrangian multipliers, as was done in CCFR, since we put no constraints on the form of the βt multipliers. However, our approach is more general in that it also allows instantiation with a ﬁxed choice of multipliers, thus obviating the need for regret minimization. The second alternative is to apply our construction for the intersection of convex sets (Section 6.2), which uses (generalized) projection onto X {x : g(x) 0}. This leads to a different regret-minimization approach, which has the major advantage that all iterates are feasible, whereas Lagrangian approaches only achieve approximate feasibility. The cost of projection may be nontrivial, and so in general the choice of method depends on the application at hand.

8. Conclusion and Future Research

We developed a calculus of regret minimization, which enables the construction of regret minimizers for composite convex sets that can be inductively expressed as a series of convexity-preserving operations on simpler sets. We showed that our calculus can be used to construct the CFR algorithm directly, as well as several of its variants for the more general case where we have strategy constraints. Our regret calculus is much more broadly applicable than just EFGs: it applies to any setting where the decision space can be expressed via the convexity-preserving operations that we support. In the future we plan to investigate novel applications of our regret calculus. One potential application would be online portfolio selection with additional constraints (e.g., exposure constraints across industries); our framework makes it easy to construct such a regret minimizer from any standard online-portfolio-selection algorithm.

The approach presented in this paper has a large number of potential future applications. For one, it would be interesting to apply our approaches of including additional constraints to the computation of quasi-perfect equilibria. Currently the only solver that is fairly scalable is based on an exact, monolithic, custom approach that uses heavy-weight operations such as matrix inversion, etc. (Farina et al., 2018). Our regret-minimization approach would be the ﬁrst of its kind for equilibrium reﬁnement that requires constraints that cut across information sets. It obviates the need for the heavy-weight operations, and would still converge to a feasible solution that satisﬁes an approximate notion of quasi-perfect equilibrium. It would be interesting to study this tradeoff between speed and solution quality.

Regret Circuits: Composability of Regret Minimizers

Acknowledgments

This material is based on work supported by the National Science Foundation under grants IIS-1718457, IIS-1617590, and CCF-1733556, and the ARO under award W911NF-171-0082.

Bowling, M., Burch, N., Johanson, M., and Tammelin, O. Heads-up limit hold em poker is solved. Science, 347 (6218), January 2015.

Brown, N. and Sandholm, T. Regret transfer and parameter optimization. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2014.

Brown, N. and Sandholm, T. Regret-based pruning in extensive-form games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2015a.

Brown, N. and Sandholm, T. Simultaneous abstraction and equilibrium ﬁnding in games. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2015b.

Brown, N. and Sandholm, T. Strategy-based warm starting for regret minimization in games. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016.

Brown, N. and Sandholm, T. Reduced space and faster convergence in imperfect-information games via pruning. In International Conference on Machine Learning (ICML), 2017a.

Brown, N. and Sandholm, T. Safe and nested subgame solving for imperfect-information games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pp. 689 699, 2017b.

Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, pp. eaao1733, Dec. 2017c.

Brown, N. and Sandholm, T. Solving imperfect-information games via discounted regret minimization. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2019.

Brown, N., Kroer, C., and Sandholm, T. Dynamic thresholding and pruning for regret minimization. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017.

Burch, N., Johanson, M., and Bowling, M. Solving imperfect information games using decomposition. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2014.

Davis, T., Waugh, K., and Bowling, M. Solving large extensive-form games with strategy constraints. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2019.

Farina, G. and Gatti, N. Extensive-form perfect equilibrium computation in two-player games. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017.

Farina, G., Kroer, C., and Sandholm, T. Regret minimization in behaviorally-constrained zero-sum games. In International Conference on Machine Learning (ICML), 2017.

Farina, G., Gatti, N., and Sandholm, T. Practical exact algorithm for trembling-hand equilibrium reﬁnements in games. In Conference on Neural Information Processing Systems (Neur IPS), 2018.

Farina, G., Kroer, C., and Sandholm, T. Online convex optimization for sequential decision processes and extensiveform games. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2019.

Ganzfried, S. and Sandholm, T. Endgame solving in large imperfect-information games. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2015. Early version in AAAI-13 Workshop on Computer Poker and Incomplete Information.

Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68: 1127 1150, 2000.

Hoda, S., Gilpin, A., Pe na, J., and Sandholm, T. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2), 2010.

Kroer, C., Waugh, K., Kılınc -Karzan, F., and Sandholm, T. Faster ﬁrst-order methods for extensive-form game solving. In Proceedings of the ACM Conference on Economics and Computation (EC), 2015.

Kroer, C., Farina, G., and Sandholm, T. Smoothing method for approximate extensive-form perfect equilibrium. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2017.

Kroer, C., Waugh, K., Kılınc -Karzan, F., and Sandholm, T. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, pp. 1 33, 2018.

Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009.

Regret Circuits: Composability of Regret Minimizers

Ling, C. K., Fang, F., and Kolter, J. Z. What game are we playing? End-to-end learning in normal and extensive form games. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2018.

Mc Mahan, B. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 525 533, 2011.

Miltersen, P. B. and Sørensen, T. B. Computing a quasiperfect equilibrium of a two-player game. Economic Theory, 42(1), 2010.

Moravcik, M., Schmid, M., Ha, K., Hladik, M., and Gaukrodger, S. Reﬁning subgames in large imperfect information games. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016.

Moravˇc ık, M., Schmid, M., Burch, N., Lis y, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker. Science, 356(6337), May 2017.

Schuurmans, D. and Zinkevich, M. A. Deep learning games. In Advances in Neural Information Processing Systems, pp. 1678 1686, 2016.

Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving heads-up limit Texas hold em. In Proceedings of the 24th International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2015.

von Stengel, B. Efﬁcient computation of behavior strategies. Games and Economic Behavior, 14(2):220 246, 1996.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning (ICML), pp. 928 936, Washington, DC, USA, 2003.

Zinkevich, M., Bowling, M., Johanson, M., and Piccione, C. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.