# subgame_solving_in_adversarial_team_games__18bf432f.pdf

Subgame Solving in Adversarial Team Games

Brian Hu Zhang Computer Science Department Carnegie Mellon University bhzhang@cs.cmu.edu

Luca Carminati DEIB, Politecnico di Milano luca.carminat@polimi.it

Federico Cacciamani DEIB, Politecnico di Milano federico.cacciamani@polimi.it

Gabriele Farina Computer Science Department Carnegie Mellon University gfarina@cs.cmu.edu

Pierriccardo Olivieri DEIB, Politecnico di Milano pierriccardo.olivieri@polimi.it

Nicola Gatti DEIB, Politecnico di Milano nicola.gatti@polimi.it

Tuomas Sandholm Computer Science Department, CMU Strategic Machine, Inc. Strategy Robot, Inc. Optimized Markets, Inc. sandholm@cs.cmu.edu

In adversarial team games, a team of players sequentially faces a team of adversaries. These games are the simplest setting with multiple players where cooperation and competition coexist, and it is known that the information asymmetry among the team members makes equilibrium approximation computationally hard. Although much effort has been spent designing scalable algorithms, the problem of solving large game instances is open. In this paper, we extend the successful approach of solving huge two-player zero-sum games, where a blueprint strategy is computed offline by using an abstract version of the game and then it is refined online, that is, during a playthrough. In particular, to the best of our knowledge, our paper provides the first method for online strategy refinement via subgame solving in adversarial team games. Our method, based on the team belief DAG, generates a gadget game and then refine the blueprint strategy by using column-generation approaches in anytime fashion. If the blueprint is sparse, then our whole algorithm runs end-to-end in polynomial time given a best-response oracle; in particular, it avoids expanding the whole team belief DAG, which has exponential worst-case size. We apply our method to a standard test suite, and we empirically show the performance improvement of the strategies thanks to subgame solving.

1 Introduction

Sequential strategic games with multiple players where cooperation and competition coexist are fascinating problems receiving increasing interest in the scientific literature. The simplest case

equal contribution; author order randomized

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

is represented by adversarial team games (ATGs) [19], where a team of players faces a team of adversaries. This setting extends the basic zero-sum two-player setting, introducing the problem of coordinating team members with asymmetric and partial information. A common approach is that of ex ante coordination introduced in [7], in which the players can coordinate their strategies beforehand, but they have no means of communication during the gameplay. Many recreational and real-world scenarios can be modeled according to this framework, such as, e.g., contract bridge, collusion in Poker, and sequential security games. While ex ante coordination makes the optimization problem convex, the problem of finding an equilibrium is inapproximable in polynomial time [7], and many efforts have been spent to design algorithms scaling up to non-toy instances.

Related works. The mainstream literature on ATGs, e.g., [7, 8, 9, 23], focuses on n-vs.-1 scenarios and resorts to column-generation (CG) algorithms to solve small/medium-sized instances exploiting the existence of equilibria with small support. These algorithms iterate between the problem of finding an equilibrium with a restricted set of strategies (the meta-problem) and the problem of enlarging such a set by finding the players best responses. A crucial issue concerns the design of practical representations for the team s strategy space, which in the worst case is exponential in the size of the instance. In particular, a junction-tree-based decomposition is proposed in [20], while the team is represented through an explicit coordinator prescribing actions to team members in [5, 6]. The team belief DAG (TB-DAG) representation proposed in [22] bridges the previous two representations [5, 6, 20]. TB-DAG consists of a directed acyclic graph representation of the team strategy space, and it captures the players beliefs over the game nodes given the joint strategy computed beforehand and the information observed publicly by the team players. A central concept in TB-DAG is team belief (or team-public state), which is a combination of the team players information sets with the same public information. The size of TB-DAG is exponential in the number of information sets of each team-public state, and this number is called information complexity [22]. Counterfactual regret minimization (CFR) [24] and CG algorithms [14] applied to TB-DAG are the current state of the art to solve ATGs [21, 22]. Empirically, CFR outperforms CG when the information complexity is low, while CG outperforms CFR otherwise. However, despite the efforts to enhance representations and algorithms, solving medium-to-big-sized games is not currently affordable due to infeasible memory requirements for CFR and iterations requirements for CG [21].

The problem of scaling up to huge games in the two-player zero-sum setting has been successfully addressed by refining strategies online by subgame solving [1, 3, 4, 17]. This approach led to, e.g., superhuman performances in poker [2, 16]. However, these techniques cannot be directly applied to ATGs due to some open non-trivial technical questions, such as, e.g., how to generate the gadget game for the refinement, how to compute the counterfactual reach and best response value.

Original contributions. To the best of our knowledge, this paper provides the first method for refining strategies by subgame solving in ATGs. In particular, we generate a blueprint strategy by running a CG algorithm with a time limit. This procedure results in a blueprint whose support has a size linear in the number of iterations of the CG algorithm. We conveniently represent the blueprint by TB-DAG, while we take inspiration from maxmargin to design our algorithm to refine the blueprint strategy during the gameplay [15]. In particular, the gadget game is generated from the subgame of the TB-DAG played by strictly positive probability in the blueprint rooted in the players public state in which the players are playing. Then, the CG algorithm is applied to the gadget subgame to refine the blueprint strategy with a time limit depending on the application. Here, the CG algorithm iteratively expands the gadget game by adding a new best response at any iteration.

By using CG for both the blueprint and the subgame solving procedures, the strategy is guaranteed to remain sparse. More precisely, the size of the representation of the strategy will grow only linearly in the number of best-response oracle (BRO) calls and polynomially in the size of the game. Hence, our algorithm is timeand space-efficient modulo the BRO, and in particular it does not need to expand the whole TB-DAG, which has worst-case exponential size. We hope that, in the future, this property will allow scaling to huge games.

The experimental evaluation of our method with a standard test suit shows that they dramatically reduce the gap between the values provided by the equilibrium strategy and the blueprint. In particular, the average reduction is of about 39% when the time limit used for the strategy refinement after every single move in the game equals the time needed by the CG algorithm to complete even a single iteration at the root of the game, and allotting more iterations further improves performance. Such an

empirical result shows that a local reoptimization can effectively lead to strategies whose value is close to the equilibrium value in ATGs.

2 Preliminaries

2.1 Adversarial team games

A two-team extensive-form adversarial team game (ATG) Γ, between teams and , consists of:

a directed tree of histories, or nodes, H, whose edges are labeled with actions. The set of actions available at a node h H is denoted with A(h), while the root of H is denoted with . Leaves of the tree are called terminal nodes, and their set is denoted with Z. Given two nodes h, h H, we write h h if there exists a path in the tree going from h to h , and h h if h h and h = h ; a partition H \ Z = HN H H of the set of nonterminal nodes determining who plays at each node: nature (N) also called chance , the max-team ( ), or the min-team ( ). Given a team i { , }, the opponent is denoted with i, and we let H i = H \ Hi; for each team i { , }, a partition Ii of Hi into information sets. In each information set I Ii, every node in I must have the same action set, denoted as A(I); a utility function u RZ where u[z] is the utility for for reaching terminal node z. Since the interaction between the teams is zero-sum, we let s utility be u[z]; for each nature node h HN, a distribution p( |h) over the children of h. We will use p[h] to denote the probability that nature plays all actions on the h path.

Since each i { , } is a team of players with potentially asymmetric information, we will not demand that the meta players, each representing a whole team, have perfect recall. However, for most of the paper, we will be interested in the special case where consists of just one player that is, has perfect recall, but may not. In this paper, we will have no need whatsoever to differentiate between the players on the same team rather, we will take the perspective of the team as a whole. Equivalently, this whole paper can be formulated in terms of two-player games of imperfect recall.

We will assume timeability that is, no information set spans multiple levels of the tree [10]. Intuitively, this means that there is a turn clock in the game that is common knowledge to all players.

A pure strategy for team i is a selection of one action for each information set I Ii. The realization form x of a pure strategy is the vector x RZ where x[z] = 1 if and only if the team plays every action on the path to z. A correlated strategy is a probability distribution over pure strategies, and its realization form is the appropriate convex combination. By taking arbitrary probability distributions over pure strategies, the individual players of a team can correlate their strategy with shared randomness hidden from the opposing team. The best response of a team i is one of its strategies maximizing its expected utility given a fixed strategy of the opponent team i.

The central solution concept in ATGs is the Team-Maxmin Equilibrium with correlation (TMEcor), introduced in [7]. A TMEcor is a pair of correlated strategies, one per team, such that each team s correlated strategy is a best response to the opponent s.

2.2 Public information and the team belief DAG

We call a piece of information public if it is common knowledge to all players. Formally, two nodes h, h in the same level of the tree are indistinguishable to team i if there is some information set I Ii that connects a descendant of h to a descendant of h . A public state is a connected component of the graph induced by the indistinguishability relation of both teams. A team-public state for team i is a connected component of the graph given by the indistinguishability relation of that team alone.

Zhang et al. [22] introduce the team belief DAG (TB-DAG), a representation of the decision problem faced by the team in a team game. For a team i, the nodes Si of the TB-DAG are partitioned in a set of observation nodes Oi, i.e., nodes in which the team observes information concerning the state of the game and a set of decision points called beliefs i.e., nodes in which the team performs actions. To disambiguate the notation, the sets of beliefs for teams and are denoted with B and C, respectively. Intuitively, both beliefs and observation nodes identify a subset of H that is indistinguishable to

the team, based on the common information they have. In what follows, we introduce the structure and basic notation of s TB-DAG representation. The representation of s TB-DAG is defined identically. The DAG is constructed recursively, starting from a root decision node that contains only the root node of the ATG and alternating decision and observation nodes along every possible path. More in detail, each edge leaving a belief B corresponds to a so-called prescription, i.e., a tuple that assigns an action to each infoset that has a nonempty intersection with B. Formally, the set of possible prescriptions at a belief B is defined as A(B) = a | a Ij BA(Ij) , where, for any infoset I I and belief B B, we write I B if I B = . Following a prescription a at a belief B B, the TB-DAG transitions to the observation node are identified by

{haj | h Ij B} {ha | h B H , a A(h)} O .

A belief B B is said to be terminal if it contains terminal nodes of the ATG. Finally, the set of possible observations that team can obtain from an observation node O O coincides with the set of s public states with nonempty intersection with O. Following the observation of a team public state P at observation node O, the DAG transitions to the decision node identified by O P. Let us remark that this construction ensures that each belief is a subset of a single team-public state. We denote s public state associated with belief B B as P(B), while the set of beliefs in a public state P is denoted as B[P] = {B | B B B P}. Given a public state P and a belief B, we say B P if for any h B there exists h P such that h h .

For a TB-DAG, the strategy representation that we rely on is the TB-DAG form, that yields a strategy representation equivalent, yet more compact, to the set of correlated strategies. Given a pure strategy for a team , the associated TB-DAG form is a vector x {0, 1}S such that x[B] = 1 if and only if the pure strategy prescribes to play all the actions from to all the nodes h B and no B B satisfying such property exists. Furthermore, for each observation node Ba O , x[Ba] = 1 if and only if x[B] = 1 and the pure strategy prescribes to play all the actions aj a with probability 1. The TB-DAG form of a correlated strategy is obtained by taking the appropriate convex combination of pure strategies TB-DAG forms. The polytopes of TB-DAG-form strategies for team and are denoted as X and Y respectively. Throughout this work, it will be useful to consider the restriction of X and Y on the set of decision and observation nodes following a public state P. We denote such sets as XP and YP . Formally, the set XP is defined by the following linear constraints:

a A(B) x[Ba] for non-terminal B P ,

Ba B x[Ba] for B P.

The set YP is defined similarly. Note that in the definition of XP and YP the strategy values at initial beliefs at P are unconstrained. Additional constraints can be introduced by the subgame solving algorithm used and will be discussed in Section 3.

The TB-DAG has worst-case exponential size, and therefore algorithms that build it fully cannot scale past a certain point. The crucial property we will need for this paper, though, is that pure strategies have sparse representations in the TB-DAG. In particular, every pure strategy plays to at most one belief in each team-public state.

3 Subgame solving for adversarial team games: Team-Maxmargin

3.1 Maxmargin

The maxmargin algorithm [15] is used in two-player zero-sum games in extensive form for online refinement of a blueprint strategy2 for the game. In particular, for every public state P reached by the players during the playthrough, maxmargin considers fixed the blueprint strategy in the trunk (i.e., the part of the game leading to P), optimizing the player s strategy exclusively in the subgame rooted at P. Its main objective, roughly speaking, is to compute the strategy giving the largest decrease in

2A blueprint strategy is a suboptimal strategy for the whole game being computed on an abstracted, smaller version of the game.

exploitability when merged with the trunk one. For the sake of presentation, from here on, we refer to as the team player whose strategy is to be refined, and as the opponent. Furthermore, in this section, team player is composed of a single member.

The maxmargin algorithm works with an auxiliary game called the gadget game. The gadget game is obtained by adding additional nodes to the top of the subgame starting at P. The root of the gadget game is a newly-added -node where the available moves called deviations allow him to choose an infoset among those belonging to P. Therefore, can adversarially decide from which of his own infosets the subgame will restart. Each of those actions leads to a chance node, where chance chooses a specific history in the infoset chosen by the opponent, based on chance reach probabilities and on the trunk blueprint strategy for . Once a history has been selected by chance, the following part of the gadget game is the same as the original subgame up to the terminal nodes. Furthermore, the payoffs associated to each terminal node following a specific initial choice of infoset by the opponent are decreased by the counterfactual best response value at that infoset. This value is the best response value of conditioned on the game reaching that specific infoset, assuming that is playing the blueprint. The change of the payoffs induces to restart the gadget game in the infoset in which the current strategy of the refining player is most exploitable. Fully solving the gadget game guarantees safety: if maxmargin subgame solving is applied during a playthrough, then the resulting strategy is at most as exploitable as the blueprint.

The crucial issue when applying the maxmargin algorithm to an ATG is that the TB-DAG allows the same history to be reached through multiple beliefs, a phenomenon unique to team games. In particular, this issue will mean that it is difficult to formulate team subgame solving in terms of a gadget game the same way as is done in the standard two-player case. While this issue could be trivially solved by unrolling the TB-DAG into an extensive-form game [5, 6], such an unrolling is inefficient as it would require an amount of extra space exponential in the original game size [5, 6, 22]. We will therefore formulate the gadget game directly in terms of a max-min optimization problem that does not immediately correspond to an EFG.

3.2 Team-maxmargin

We introduce the team-maxmargin algorithm, which extends maxmargin subgame solving to ATGs. The algorithm that we present here can be applied to any ATG, without any constraint on the number of players in each team.

As aforementioned, in order to successfully apply subgame solving techniques to the TB-DAG representation of ATGs, there is the need to relate the TB-DAG-form polytopes of the two teams. To do so, we rebalance the reach of s strategy depending on the fixed trunk strategy of , and nature. The normalization factor that we use is denoted as counterfactual reach. Definition 1 (Counterfactual reach). Let x the strategy for team . For each s belief C, the counterfactual reach ρ[x, C] is the total nature and reach at C, defined as follows:

ρ[x, C] := X

Counterfactual reach corresponds to the probability that and nature play to reach a specific belief C given that plays a strategy y such that y[C] = 1. In the definition above, the counterfactual reach ρ[x, C] aggregates the realization probability of x and p on the histories h C so as to effectively represent the behavior of nature and in the polytope of the opponents.

The counterfactual reach allows us to determine which of s beliefs are actually reachable given s strategy. This set of nodes is denoted as the set of counterfactually reachable beliefs. Definition 2 (Counterfactually reachable beliefs). The set of all counterfactually reachable beliefs in a public state P is denoted with C[x, P]. Formally:

C[x, P] := {C | C C P ρ[x, C] > 0} .

Another concept needed to extend maxmargin to TB-DAGs is the one of counterfactual best response value at a s belief C, that is defined as the value of the best s prescription at C. Definition 3 (Counterfactual best response value). Let x be the strategy for team . For each decision node C, let u[x, C] be the unnormalized counterfactual -best response value defined by

the recurrences as follows:

z C x[z] p[z] u[z] if C is terminal

min a A(C) u[x, Ca] otherwise ,

u[x, Ca] = X

C Ca u[x, C ],

whereas the normalized counterfactual -best response value at C against x can be defined as follows:

u N[x, Ca] = u[x, Ca]

Notice that, in the definition of the counterfactual best response value, the realization form induced by the TB-DAG form strategy x and chance is taken into account. On the other hand, when considering the normalized counterfactual best response value, we condition to reaching a specific C. Hence, to remove the contribution to realizations due to actions that have been played in trunk, we divide by the counterfactual reach. Given the above definitions, we can now provide the optimization problem which the Team-Maxmargin algorithm needs to solve for the refinement of the strategy. Definition 4 (Team-maxmargin linear program). The maxmargin optimization problem in a TB-DAG subgame rooted at P is formulated as the following linear program:

max x min y, y

z P x [z] y[z] p[z] u[z] X

C C[x,P ] y[C] u[x, C] (1)

s.t. x [B] = x[B] for all -beliefs B B[P] (2)

y[C] = y[C] ρ[x, C] for all -beliefs C C[x, P] (3)

x XP , y YP , y C[x,P ] (4)

As in the original two-player formulation, can choose to deviate to any of the counterfactually reachable beliefs at the public state of the subgame by choosing y C[x,P ]. Equation 1 specifies the maximization of the margin, i.e., the difference between the expected value of the subgame and the value of the counterfactual best response to s blueprint. This encodes the objective of to find a x to maximize its value w.r.t. the blueprint for any adversarially chosen deviation of the opponent. Equation 2 fixes the trunk strategy of . Equation 3 rescales reach probabilities of the opponent according to the counterfactual reach ρ[x, C]. This is equivalent to normalizing by the nature and reach probabilities at C. Such a normalization is crucial because it accounts for the fixed reach probabilities in the trunk for chance and . While it may seem more intuitive to normalize by the reach probabilities of and chance directly on the terminal nodes, as is typically done in two-player zero-sum games, in team games it is actually necessary to do so on the beliefs in C[x, P]. This a consequence of the fact that, when is composed by more than one player, the reach of and nature over terminal nodes depends directly on the deviation C C[x, P] chosen by , and multiple such beliefs C can reach the same terminal node. This is a difficulty that only occurs in the team-vs-team setting. Note that only counterfactually reachable beliefs C[x, P] of are considered in this constraint since they are the only potential deviations of the opponent in the gadget game. Equation 4 encompasses all the TB-DAG constraints on the strategy spaces of and .

One may expect that, in parallel to the two-player case, our team-maxmargin optimization problem may also be representable as an ATG. However, the constraints that we have introduced into the formulation go out of the scope of ATGs, so this is not directly possible. Despite this, the strategy spaces of both players in Definition 4 are DAGs, to which any technique for solving team games, such as CFR [24] or column generation, apply.

A complete construction procedure of the gadget DAG is provided in Appendix A. Furthermore, in Appendix B, we extend other subgame solving techniques (i.e., resolving [4] and reach [1]) to the team setting.

Safety of the refinement. The team maxmargin algorithm achieves the same safety guarantees as the maxmargin algorithm, for similar arguments to [15, 1]. Formally:

Algorithm 1 Maxmargin subgame solving with column generation, at public state P

1: Input: 2: sparse DAG-form blueprint x 3: reach probabilities ρ[x, C] and alt values u[x, C] for all reachable -beliefs C C[x, P] 4: let B be the set of -beliefs in P with x[B] > 0. 5: x B,1 {GENERATEARBITRARYSTRATEGY(B)} for each B B for each time t and belief B B, x B,t is a pure strategy defined on the sub-DAG rooted at B. It represents a strategy that may take following belief B. 6: for iteration t = 1, 2, . . . do 7: solve the meta-game:

max λB:B B,λ 0 min y, y

z P x [z]y[z]p[z]u[z] X

C C[P ] y[C]u[x, C]

s.t. x = λ x + X

1 τ t λB,τx B,τ

y[C] = y[C] / ρ[x, C] for all -beliefs C C[x, P] X

1 τ t λB,τ = 1 λ for all -beliefs B B

λB 0, λ 0, t, y YP , y C[x,P ]

8: for each belief B B do 9: solve for a best response to y given belief B:

x B,t+1 argmax x {0,1}HB

z B x [z]y[z]p[z]u[z]

s.t. x [ha]x [h ] = x [h ]x [ha] h, h I I x [h] = 1 h B

where HB is the set of nodes h B. 10: return x converted to sparse TB-DAG form

Theorem 5. Applying team maxmargin subgame solving (Definition 4) to every subgame reached during play, in a nested fashion, results in a playing a strategy with exploitability no worse than that of the blueprint.

A proof of Theorem 5 can be found in Appendix D.

Team-maxmargin algorithm. The team-maxmargin algorithm is applied similarly to the original maxmargin algorithm. If we want to locally refine a strategy x of while playing, team-maxmargin proceeds as follows: at each game turn, it considers the public state P reached and computes an approximate solution x to the linear program in Definition 4 (while respecting a timelimit). Then x is updated for the next turns: x[B] x [B] B XP . The blueprint strategy is used as the first strategy x. We solve the linear program by means of a column generation procedure in which the current strategy x is given as an available choice. This guarantees that the solution found is not worse than the current strategy, even if a single iteration is performed.

4 Column generation for sparser solutions

In this section, we argue that, if the opponent is a single player (rather than a team) and the blueprint is sparse, then, by using column generation (e.g., Farina et al. [8]) as the solver, we can arrive at an online game-playing algorithm that runs in polynomial time (in H, per iteration) with the exception of a best-response oracle.

The key observation is that the maxmargin formulation of subgame solving (Definition 4) has an input of size |B| + |C| where B is the set of played beliefs for in the current public state, and C is the set of all beliefs for in the current public state. Therefore, to have a hope of an efficient algorithm, we need both of these quantities to be small. Hence, for this section, let us assume that is a single player or a team whose DAG is small (so that |C| is small) and that the blueprint is sparse (so that |B| is small at least on the first iteration).

The algorithm is given in Algorithm 1. At a high level, for computing best responses, it splits the problem of solving a game starting at public state P into subproblems, one for each reachable -belief B. This allows a finer mixing of strategies than would have been allowed otherwise. Given a best-response oracle (i.e., a solution method to the integer program (5)), the whole algorithm runs in time poly(|HP |, | B|, |C[P]|, t) where HP is the set of nodes h P and B is the set of -beliefs in P with positive reach. While NP-hard in the general case, integer programs are practically reasonably fast to solve. In particular, if a polynomial-space algorithm such as depth-first branch-and-bound is used to solve the integer program, then the whole algorithm runs in polynomial space. This is in stark contrast to any algorithm that would expand the whole TB-DAG, as the TB-DAG takes worst-case exponential space.

The creation of sparse blueprints. Throughout this section, we have been assuming that blueprints created are sparse, in particular, that in every public state there are at most polynomially many beliefs. There are two straightforward ways to guarantee this. The first is to use a natively sparse algorithm such as column generation to generate the blueprint in the first place. The second is to take an arbitrary blueprint in realization form, and express it as a sparse convex combination via Caratheodory s theorem3. Then, the blueprint will have support size at most |Z|. On the other hand, the support size of the blueprint generated by a CG algorithm, scales linearly with the number of iterations, which, under reasonable time constraints, rarely exceeds the hundreds. Thus, the far smaller support size of the blueprint yielded by CG is a point in favour of the adoption of the former family of algorithms for blueprint generation.

5 Experimental evaluation

5.1 Experimental setting

Game instances. In our evaluation, we resort to game instances whose exact solution can be computed in practice by standard algorithms available in the literature. This choice allows us to calculate the equilibrium value, which is necessary to appropriately assess how our subgame solving method approximates the exact solution as the time spent for the strategy refinement increases. In particular, our test suite is composed of parametric versions of the ATG instances customarily adopted in the literature where the adversary team is composed of a single player: Kuhn [11] and Leduc [18] poker with team collusion, team Liar s Dice [13] and Tricks [21] (a simplified Bridge endgame.)

For space reasons, we omit the rules of these games, pointing an interested reader to the aforementioned articles. We use the following labels:

Knr: n-player Kuhn poker with r ranks;

Lnbrs: n-player Leduc poker with b bets in each betting round, r ranks and s suits;

Dnd: n-player Liar s Dice with one d-sided dice per player;

Td: three-player Trick-taking game with three ranks, four suits, and a limited amount of possible deals fixed to d.

We also use a binary string of length n to denote the assignment of teams, where the ith character is 1 if and only if the ith player is on team .

Blueprint and strategy refinement time limits. As usual with subgame solving methods, the performance depends on the quality of the blueprint and the time spent for its refinement after every single move in the game. To provide coherent results over multiple, different game instances, we set

3Caratheodory s theorem asserts that any convex combination of points in Rd can be expressed as a convex combination of at most d + 1 of them.

Table 1: Team s values against a best-responding opponent when playing the equilibrium strategy, blueprint, or refinement strategy with α = 5, and corresponding algorithms running times. When the equilibrium computation requires more than 2 hours, we use the equilibrium values provided in [21]. The table summarizes the experiments for different game instances (see Section 5.1) and team structures (the i-th element of the vector equal to 1 means that player i is in team ), with α = 5. Equilibrium time, refinement time (per move) and blueprint time denote, respectively, the time needed to find an exact equilibrium via CG, the time allocated to a single iteration of subgame solving and the time allotted to blueprint computation. Equilibrium value indicates the expected utility of the team at the equilibrium, while refinement value and blueprint value represent the team expected utility when they play against a best-responding opponent team and adopt, respectively, the blueprint strategy and the refined strategy. The gap reduction column quantifies the improvement of the refined strategy over the blueprint strategy (see Section 5.2 for a formal definition of gap).

Game Team Equilibrium Refinement Blueprint Equilibrium Refinement Blueprint Gap instance structure time time (per move) time value value value reduction K34 110 0.05s 0.02s 0.02s -0.042 -0.050 -0.061 58.0% K36 110 0.55s 0.07s 0.03s -0.024 -0.055 -0.098 58.3% K38 110 2.58s 0.20s 0.04s -0.019 -0.031 -0.152 91.2% K312 110 10.73s 0.74s 0.29s -0.014 -0.024 -0.064 78.9% K45 1110 6.92s 0.54s 0.22s -0.030 -0.046 -0.354 95.2% L3133 110 6m 39s 0.57s 1.59s 0.215 0.038 -0.616 78.7% L3143 110 >2h 2.59s 6.21s 0.107 -0.133 -0.673 69.2% L3151 110 2m 46s 1.27s 2.80s -0.019 -0.399 -0.624 37.3% L3153 110 >2h 4.91s 8.84s 0.024 -0.177 -0.681 71.5% L3223 110 7m 44s 1.02s 12.28s 0.516 -0.320 -1.299 54.0% L3523 110 >2h 3m 21s 10m 46s 0.953 -1.151 -6.671 72.4% D33 110 3m 55s 1.06s 9.15s 0.284 0.279 0.238 90.2% D34 110 >2h 33.21s 10m 11s 0.284 0.241 0.139 70.0% D62 111110 >2h 40.22s 10m 11s 0.333 0.167 -0.000 50.0% T350 101 >2h 0.49s 1.09s 0.600 0.509 0.352 63.3% T3100 101 >2h 1.16s 1.85s 0.710 0.514 0.495 8.5%

both of these time limits dependent on the complexity of the game. More precisely, the blueprint computation is stopped once 10 minutes have elapsed or column generation has achieved a Nash gap of /10, where is the difference between maximum and minimum team s payoffs, whichever comes first. We remark that the main goal of our experimental evaluation is to show that, no matter the suboptimality of the blueprint strategy and the amount of computation allocated to the subgame solving routine, the local re-optimization performed under our subgame solving framework allows us to improve the starting strategy. The choice of focusing on cases in which the blueprint strategy is suboptimal, and in which the computational budget available for subgame solve is (even severely) limited, goes precisely in that direction.

To create the strategy that would be played by Algorithm 1, we perform refinement at every public state, in a top-down order. We use a range of time limits for the strategy refinement, defined as the average time needed by a single iteration of the CG algorithm at the root of the whole game multiplied by a number α {0, 1, ..., 10}. Notice that the value obtained with α = 0 is the value provided by the blueprint, while using α 1 ensures that at least one iteration of strategy refinement is done after every move in the game. Each experiment was allocated 32 CPU cores and 256 GB RAM on a cluster machine. Integer and linear programs were solved with Gurobi 9.5.

5.2 Discussion of the results

For every game instance, we computed the best-response values for the opponent against the following strategies: the equilibrium strategy, the blueprint strategy, and the strategy returned by our refinement method for different values of α. We will use gap to refer to the difference between a value and the equilibrium value. Intuitively, it shows the relative quality of the blueprint when compared with the equilibrium. We use a performance index based on the concept of gap showing the capability of our refinement method to reduce the gap. We call it gap reduction. It is defined as 1 (E R)/(E B), where E, R, and B are the values of the equilibrium, refined strategies, and blueprint respectively.

The values provided by the strategies when α = 5 are provided in Table 1, together with the algorithms runtimes. We initially observe that our experimental evaluation confirms the maxmargin theoretical results as the team s exploitability never increases. Most importantly, with the majority of the game instances, the gap reduction is dramatic (on the average, about 65%). These results are coherent with the results achieved in the literature for two-player zero-sum games [1, 15]. We also investigate how the performance of our strategy refinement method varies as α varies. In Figure 1, we report the results related to the two most representative instances together with their blueprint and equilibrium values. The plots and the tables related to the entire test suite are in Appendix C.

Interestingly, we observe that, with the vast majority of the game instances, the curve describing the improvement provides the maximum increase by the first iteration of strategy refinement (corresponding to α = 1). In particular the average gap reduction is about 39% when α = 1, showing that a time limit equal to the time needed for a single CG iteration at the root of the tree is sufficient to get a value very close to the equilibrium value. We also observe that the equilibrium value is never completely recovered in the test suite. This is not surprising as strategy refinement methods perform a local reoptimization. The improvement is sometimes not monotonically increasing in the number of the iterations. This is due to the nature of CG algorithms.

6 Conclusions

In this paper, we propose subgame solving algorithms for adversarial team games based on the team belief DAG. In team-vs.-player games, when applied with column generation as both the blueprint generator and subgame solving algorithm, our techniques run efficiently modulo a best-response oracle, which in practice can be implemented via an integer program solver. In general team-vs.-team games, our method addresses several technical difficulties that do not arise in the two-player setting, most notably the normalization of reach probabilities required by maxmargin. In the experimental activity, we find that applying any amount of subgame solving at all to a blueprint, even a single iteration of column generation, is significantly superior to not applying it. More precisely, with a standard test suite, our method reduces the the gap between the equilibrium value and the value provided by the blueprint by 39% on average even when α = 1, and up to 65% when α = 5.

In the future, we believe that our contributions can allow the scaling of subgame solving techniques to very large team games, where column-generation-like methods such as PSRO [12] will likely be the fundamental building block for practical solutions.

Acknowledgements

This material is based on work supported by the National Science Foundation under grants IIS1901403 and CCF-1733556, and the ARO under award W911NF2010081.

0 2 4 6 8 10

0 2 4 6 8 10

Figure 1: Value of the team s refined strategy as varying the refinement time limit. The blue dashed line in the bottom is the blueprint value, while the black dotted line in the top is the equilibrium value.

[1] N. Brown and T. Sandholm. Safe and nested subgame solving for imperfect-information games. In NIPS, 2017.

[2] N. Brown and T. Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359:418 424, 2018.

[3] N. Brown, T. Sandholm, and B. Amos. Depth-limited solving for imperfect-information games. In Neur IPS, 2018.

[4] N. Burch, M. B. Johanson, and M. Bowling. Solving imperfect information games using decomposition. In AAAI, 2014.

[5] L. Carminati, F. Cacciamani, M. Ciccone, and N. Gatti. A marriage between adversarial team games and 2-player games: Enabling abstractions, no-regret learning, and subgame solving. In ICML, 2022.

[6] L. Carminati, F. Cacciamani, M. Ciccone, and N. Gatti. Public information representation for adversarial team games. Ar Xiv, abs/2201.10377, 2022.

[7] A. Celli and N. Gatti. Computational results for extensive-form adversarial team games. In AAAI, 2018.

[8] G. Farina, A. Celli, N. Gatti, and T. Sandholm. Ex ante coordination and collusion in zero-sum multi-player extensive-form games. In Neur IPS, 2018.

[9] G. Farina, A. Celli, N. Gatti, and T. Sandholm. Connecting optimal ex-ante collusion in teams to extensive-form correlation: Faster algorithms and positive complexity results. In ICML, 2021.

[10] S. K. Jakobsen, T. B. Sørensen, and V. Conitzer. Timeability of extensive-form games. In ITCS, page 191 199, 2016.

[11] H. Kuhn. A simplified two-person poker. Contributions to the Theory of Games, 1:97 103, 1951.

[12] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Neur IPS, pages 4190 4203, 2017.

[13] V. Lisý, M. Lanctot, and M. Bowling. Online monte carlo counterfactual regret minimization for search in imperfect information games. In AAMAS, 2015.

[14] H. B. Mc Mahan, G. J. Gordon, and A. Blum. Planning in the presence of cost functions controlled by an adversary. In ICML, 2003.

[15] M. Moravcík, M. Schmid, K. Ha, M. Hladík, and S. J. Gaukrodger. Refining subgames in large imperfect information games. In AAAI, 2016.

[16] M. Moravcík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. B. Johanson, and M. H. Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356:508 513, 2017.

[17] T. Sandholm. Abstraction for solving large incomplete-information games. In AAAI, 2015.

[18] F. Southey, M. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and D. C. Rayner. Bayes bluff: Opponent modelling in poker. In UAI, 2005.

[19] B. von Stengel and D. Koller. Team-maxmin equilibria. Games and Economic Behavior, 21: 309 321, 1997.

[20] B. H. Zhang and T. Sandholm. Team correlated equilibria in zero-sum extensive-form games via tree decompositions. In AAAI, 2021.

[21] B. H. Zhang, G. Farina, A. Celli, and T. Sandholm. Optimal correlated equilibria in general-sum extensive-form games: Fixed-parameter algorithms, hardness, and two-sided column-generation. Proceedings of the 23rd ACM Conference on Economics and Computation, 2022.

[22] B. H. Zhang, G. Farina, and T. Sandholm. Team belief dag form: A concise representation for team-correlated game-theoretic decision making. Ar Xiv, abs/2202.00789, 2022.

[23] Y. Zhang and B. An. Computing ex ante coordinated team-maxmin equilibria in zero-sum multiplayer extensive-form games. In AAAI, 2021.

[24] M. A. Zinkevich, M. B. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Neur IPS, 2007.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]