# game_redesign_in_noregret_game_playing__9bd65f21.pdf

Game Redesign in No-regret Game Playing

Yuzhe Ma , Young Wu , Xiaojin Zhu University of Wisconsin Madison {yzm234, yw, jerryzhu}@cs.wisc.edu

We study the game redesign problem in which an external designer has the ability to change the payoff function in each round, but incurs a design cost for deviating from the original game. The players apply no-regret learning algorithms to repeatedly play the changed games with limited feedback. The goals of the designer are to (i) incentivize players to take a speciﬁc target action proﬁle frequently; (ii) incur small cumulative design cost. We present game redesign algorithms with the guarantee that the target action proﬁle is played in T o(T) rounds while incurring only o(T) cumulative design cost. Simulations on four classic games conﬁrm the effectiveness of our proposed redesign algorithms.

1 Introduction Consider a ﬁnite normal-form game with loss function ℓo. This is the original game. As an example, the Volunteer s Dilemma (see Table 1) has each player choose whether or not to volunteer for a cause that beneﬁts all players. It is known that all pure Nash equilibria in this game involve a subset of the players free-riding the contribution from the remaining players. M players, who initially do not know ℓo, use no-regret algorithms to individually choose their action in each of the t = 1 . . . T rounds. The players receive limited feedback: suppose the chosen action proﬁle in round t is at = (at 1, . . . , at M), then the i-th player only receives her own loss ℓo i (at) but not the other players actions or losses. Game redesign is the following task. A game designer not a player does not like the solution concept to ℓo. Instead, the designer wants to incentivize a target action proﬁle a , for example every player volunteers . The designer has the power to redesign the game: before each round t is played, the designer can change ℓo to some ℓt. The players will receive new losses ℓt i(at), but the designer pays a design cost C(ℓo, ℓt, at) in that round for deviating from ℓo. The designer s goal is to make the players play the target action proﬁle a in the vast majority (T o(T)) of rounds, while incurring o(T) cumulative design cost. Game redesign naturally emerges in two opposing contexts: A benevolent designer (interested party) wants to redesign the game to improve social welfare, as in the

Volunteer s Dilemma. This is the motivation behind kimplementation [Monderer and Tennenholtz, 2004];

A malicious designer (attacker) wants to poison the payoffs to force a nefarious target action proﬁle. This is an extension of reward-poisoning attacks (previously studied on bandits [Jun et al., 2018; Liu and Shroff, 2019; Ma et al., 2018; Yang et al., 2021; Guan et al., 2020; Garcelon et al., 2020; Bogunovic et al., 2021; Zuo, 2020; Lu et al., 2021] and reinforcement learning [Zhang et al., 2020; Ma et al., 2019; Rakhsha et al., 2020; Sun et al., 2020; Huang and Zhu, 2019]) to game playing.

For both contexts the mathematical question is the same. Since the design costs are measured by deviations from the original game ℓo, the designer is not totally free in creating new games. Our idea for successful game redesign is:

1. Do not change the loss of the target action proﬁle, i.e. let ℓt(a ) = ℓo(a ), t. If game redesign is indeed successful, then a will be played for T o(T) rounds. As we will see, ℓt(a ) = ℓo(a ) means there is no design cost in those rounds under our deﬁnition of C. The remaining rounds incur at most o(T) cumulative design cost.

2. The target action proﬁle a forms a strictly dominant strategy equilibrium. This ensures no-regret players will eventually learn to prefer a over any other action proﬁles.

Game redesign is closely related to the k-implementation problem [Monderer and Tennenholtz, 2004]. Both aim to manipulate player behaviors by changing the payoff. However, there are major differences: k-implementation assumes players know the game, while in our case the players have to learn the game; k-implementation only allows increase to existing payoffs, while we allow both positive (subsidy) and negative (tax) changes. Our interior design (Algorithm 1) indeed produces a 0-implementation in their terminology because we keep the payoff of the desired strategy proﬁle unchanged. Nonetheless, our players have to discover this strategy proﬁle by exploration, meaning that the designer will still incur costs especially in earlier rounds. More broadly, game redesign is related to, but distinct from, constrained mechanism design. The players in game redesign are no-regret learners, not rational (best-response) players of a repeated game.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

2 Formal Deﬁnition

We ﬁrst describe the original game without the designer. There are M players. Let Ai be the ﬁnite action space of player i, and let Ai = |Ai|. The original game is deﬁned by the loss function ℓo : A1 . . . AM 7 RM. The players do not know ℓo. Instead, we assume they play the game for T rounds using no-regret algorithms. This may be the case, for example, if the players are learning an approximate Nash equilibrium in zero-sum ℓo or coarse correlated equilibrium in general sum ℓo. In running the no-regret algorithm, the players maintain their own action selection policies πt i Ai over time, where Ai is the probability simplex over Ai. In each round t, every player i samples an action at i according to policy πt i. This forms an action proﬁle at = (at 1, . . . , at M). The original game produces the loss vector ℓo(at) = (ℓo 1(at), ..., ℓo M(at)). However, player i only observes her own loss value ℓo i (at), not the other players losses or their actions. All players then update their policy according to their no-regret algorithms. We now bring in the designer. The designer knows ℓo and wants players to frequently play an arbitrary but ﬁxed target action proﬁle a . We stress that a does not need to coincide with any solution concept in ℓo. At the beginning of round t, the designer commits to a potentially different loss function ℓt. Note this involves preparing the loss vector ℓt(a) for all action proﬁles a (i.e. cells in the payoff matrix). The players then choose their action proﬁle at. Importantly, the players receive losses ℓt(at), not ℓo(at). For example, in games involving money such as the volunteer game, the designer may achieve ℓt(at) via taxes or subsidies, and in zero-sum games such as the rock-paper-scissors game, the designer essentially makes up a new outcome and tell each player whether they win, tie, or lose via ℓt i(at); The designer incurs a cost C(ℓo, ℓt, at) for deviating from ℓo. The interaction among the designer and the players is summarized as below.

Protocol: Game Redesign

Designer knows ℓo, a , M, A1:M, and player no-regret rate α for t = 1, . . . , T do Designer prepares new loss function ℓt. Players form action proﬁle at = (at 1, ..., at M), where at i πt i, i [M]. Player i observes its loss ℓt i(at), updates policy πt i. Designer incurs cost C(ℓo, ℓt, at). end for

The designer has two goals simultaneously:

1. To incentivize the players to frequently choose the target action proﬁle a (which may not coincide with any solution concept of ℓo). Let N T (a) = PT t=1 1 [at = a] be the number of times an action proﬁle a is chosen in T rounds, then this goal is to achieve E N T (a ) = T o(T).

2. To have a small cumulative design cost CT := PT t=1 C(ℓo, ℓt, at), speciﬁcally E CT = o(T).

The per-round design cost C(ℓo, ℓt, a) is application dependent. One plausible is to account for the overall cost in all action proﬁles, not just what is actually chosen: an example is C(ℓo, ℓt, at) = P a ℓo(a) ℓt(a) 1. Note that it ignores the at argument. In many applications, though, only the chosen action proﬁle costs the designer (the implementation cost in [Monderer and Tennenholtz, 2004]). An example is C(ℓo, ℓt, at) = ℓo(at) ℓt(at) 1. We use a slight generalization of the latter cost: Assumption 1. The non-negative designer cost function C satisﬁes t, at, C(ℓo, ℓt, at) η ℓo(at) ℓt(at) p for some Lipschitz constant η and norm p 1. This implies no design cost if the losses are not modiﬁed, i.e., when ℓo(at) = ℓt(at), C(ℓo, ℓt, at) = 0 .

3 Assumption: No-Regret Players The designer assumes that the players are each running a no-regret learning algorithm like EXP3.P [Bubeck and Cesa Bianchi, 2012]. It is well-known that for two-player (M = 2) zero-sum games, no-regret learners can ﬁnd an approximate Nash Equilibrium [BLUM, 2007]. More general results suggest that for multi-player (M 2) general-sum games, noregret learners can ﬁnd an approximate Coarse Correlated Equilibrium [Hart and Mas-Colell, 2000]. We ﬁrst deﬁne the player s regret. We use at i to denote the actions selected by all players except player i in round t. Deﬁnition 1. (Regret). For any player i, the best-inhindsight regret with respect to a sequence of loss functions ℓt i( , at i), t [T], is deﬁned as

t=1 ℓt i(at i, at i) min ai Ai

t=1 ℓt i(ai, at i). (1)

The expected regret is deﬁned as E RT i , where the expectation is taken with respect to the randomness in the selection of actions at, t [T] over all players. Remark. The loss functions ℓt i( , at i), t [T] depend on the actions selected by the other players at i, while at i further depends on a1, ..., at 1 of all players in the ﬁrst t 1 rounds. Therefore, ℓt i( , at i) depends on a1 i , ..., at 1 i . That means, from player i s perspective, the player is faced with a nonoblivious (adaptive) adversary [Slivkins, 2019].

Remark. Note that a i := argminai Ai PT t=1 ℓt i(ai, at i) in (1) would have meant a baseline in which player i always plays the best-in-hindsight action a i in all rounds t [T]. Such baseline action should have caused all other players to change their plays away from a1 i, ..., a T i. However, we are disregarding this fact in (1). For this reason, (1) is not fully counterfactual, and is called the best-in-hindsight regret [Bubeck and Cesa-Bianchi, 2012]. The same is true when we deﬁne the expected regret. Our key assumption is that the learners achieve sublinear regret. This assumption is satisﬁed by standard bandit algorithms such as EXP3.P [Bubeck and Cesa-Bianchi, 2012]. Assumption 2. (No-regret Learner) We assume the players apply no-regret learning algorithm that achieves expected regret E RT i = O(T α), i for some α [0, 1).

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

4 Game Redesign Algorithms There is an important consideration regarding the allowed values of ℓt. The original game ℓo has a set of natural loss values L. For example, in the rock-paper-scissors game L = { 1, 0, 1} for the player wins (recall the value is the loss), ties, and loses, respectively; while for games involving money it is often reasonable to assume L as some interval [L, U]. Ideally, ℓt should take values in L to match the semantics of the game or to avoid suspicion (in the attack context). Our designer can work with discrete L (section 4.3); but for exposition we will ﬁrst allow ℓt to take real values in L = [L, U], where L = minx L x and U = maxx L x. We assume U and L are the same for all players and U > L, which is satisﬁed when L contains at least two distinct values.

4.1 Algorithm: Interior Design The name refers to the narrow applicability of Algorithm 1: the original loss values for the target action proﬁle ℓo(a ) must all be in the interior of L. Formally, we require ρ (0, U L

2 ], i, ℓo i (a ) [L + ρ, U ρ]. In Algorithm 1, we present the interior design. The key insight is to keep ℓo(a ) unchanged: If the designer is successful, a will be played in T o(T) rounds. In these rounds, the designer cost is zero. The other o(T) rounds each incur bounded cost. Overall, this ensures sublinear design cost. For the design to be successful, the designer can make a the strictly dominant strategy. The designer can do this by judiciously increasing or decreasing the loss of other action proﬁles in ℓo: there is enough room because ℓo(a ) is in the interior. In fact, the designer can design a time-invariant game ℓt = ℓas Algorithm 1 shows.

Algorithm 1 Interior Design

Input: the target action proﬁle a ; the original game ℓo. Output: a time-invariant game ℓconstructed as follows:

i, a, ℓi(a) =

( ℓo i (a ) (1 d(a)

M )ρ if ai = a i, ℓo i (a ) + d(a)

M ρ if ai = a i, (2) where d(a) = PM j=1 1 h aj = a j i .

Lemma 3. The redesigned game (2) satisﬁes: 1. i, a, ℓi(a) L, thus ℓis valid.

2. For every player i, the target action a i strictly dominates any other action by (1 1 M )ρ, i.e., ℓi(ai, a i) = ℓi(a i, a i) + (1 1 M )ρ, i, ai = a i, a i.

3. ℓ(a ) = ℓo(a ). 4. If the original loss for the target action proﬁle ℓo(a ) is zero-sum, then the redesigned game ℓis also zero-sum. Our main result is that Algorithm 1 achieves the design goal with sublinear cumulative design cost. It is worth noting that although many entries in the redesigned game ℓcan appear to be quite different than the original game ℓo, their contribution to the design cost is small because the design discourages them from being played often.

Theorem 4. Using Algorithm 1, the designer can achieve E N T (a ) = T O(MT α) while incurring expected cumulative design cost E CT = O(ηM 1+ 1

Corollary 5. If the players use EXP3.P, the designer can achieve E N T (a ) = T O(MT 1 2 ) while incurring expected cumulative design cost E CT = O(ηM 1+ 1

If the original game ℓo is two-player zero-sum, then under redesign, players will think that a is a Nash equilibrium.

Corollary 6. Assume M = 2 and ℓo is zero-sum. Then with the redesigned game (2), the expected averaged policy E πT i = E 1

T P t πt i converges to a point mass on a i.

4.2 Boundary Design

When ℓo(a ) has some values hitting the boundary of L, the designer cannot apply Algorithm 1 directly because the loss of other action proﬁles cannot be increased or decreased further to make a a dominant strategy. However, a time-varying design can still achieve the design goals with sublinear design cost. In Algorithm 2, we present the boundary design which is applicable to both boundary and interior ℓo(a ) values.

Algorithm 2 Boundary Design

Input: the target action proﬁle a ; a loss vector v RM whose elements are in the interior, i.e., i, vi [L + ρ, U ρ] for some ρ > 0; the regret rate α; ϵ (0, 1 α); Output: a time-varying game with loss ℓt, t [T]. 1: Use v in place of ℓo(a ) in (2) and apply the interior design 1. Call the resulting game the source game ℓ. 2: Deﬁne a destination game ℓwhere ℓ(a) = ℓo(a ), a. 3: Interpolate the source and destination games:

ℓt = wtℓ+ (1 wt)ℓ (3)

where wt = tα+ϵ 1.

The designer can choose any loss vector v as long as v lies in the interior of L. We give two exemplary choices of v.

1. Let the average player cost of a be ℓ(a ) = PM i=1 ℓo i (a )/M, then if ℓ(a ) (L, U), one could choose v to be a constant vector with value ℓ(a ). The nice property about this choice is that if ℓo is zero-sum, then v is zero-sum, thus property 4 is satisﬁed and the redesigned game is zero-sum. However, note that when ℓ(a ) does hit the boundary, the designer cannot choose this v.

2. Choose v to be a constant vector with value (L + U)/2. This choice is always valid, but may not preserve the zerosum property of the original game unless L = U.

The designer applies the interior design on v to obtain a source game ℓ. Note that the target action proﬁle a strictly dominates in the source game. The designer also creates a destination game ℓ(a) by repeating the ℓo(a ) entry everywhere. The boundary algorithm then interpolates between the source and destination games with a decaying weight wt. Note after interpolation (3), the target a still dominates by

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

roughly wt. We design the weight wt = tα+ϵ 1 so that cumulatively, the sum of wt grows with rate α + ϵ, which is faster than the regret rate α. This is a critical consideration to enforce frequent play of a . Also note that asymptotically, ℓt converges toward the destination game. Therefore, in the long run, when a is played the designer incurs diminishing cost, resulting in o(T) cumulative design cost. Lemma 7. The redesigned game (3) satisﬁes: 1. i, a, ℓt i(a) L, thus the loss function is valid.

2. For every player i, the target action a i strictly dominates any other action by (1 1 M )ρwt, i.e., ℓt i(ai, a i) = ℓt i(a i, a i) + (1 1 M )ρwt, i, t, ai = a i, a i.

3. t, C(ℓo, ℓt, a ) η(U L)M 1 p wt 4. If the original loss for the target action proﬁle ℓo(a ) and the vector v are both zero-sum, then t, ℓt is zero-sum. Given Lemma 7, we provide our second main result. Theorem 8. Using Algorithm 2, the designer can achieve E N T (a ) = T O(MT 1 ϵ) while incurring expected cumulative design cost E CT = O(M 1+ 1

p T 1 ϵ+M 1 p T α+ϵ). Remark. By choosing a larger ϵ in Theorem 8, the designer increases E N T (a ) . However, the design cost can grow. When ϵ = 1 α

2 , the design cost attains the minimum order

2 and E N T (a ) = T O(T 1+α

Corollary 9. Assume the no-regret learning algorithm is EXP3.P. The designer can achieve expected number of target plays E N T (a ) = T O(MT 3 4 ) while incurring

E CT = O M 1 p (1 + M)T 3 4 design cost.

4.3 Discrete Design In previous sections, we assumed the games ℓt can take arbitrary continuous values in the relaxed loss range L = [L, U]. However, there are many real-world situations where continuous loss does not have a natural interpretation. For example, in the rock-paper-scissors game, the loss is interpreted as win, lose or tie, thus ℓt should only take value in the original loss value set L = { 1, 0, 1}. We now provide a discrete redesign to convert any game ℓt with values in L into a game ˆℓt only involving loss values L and U, which are both in L. Speciﬁcally, the discrete design is illustrated in Algorithm 3.

Algorithm 3 Discrete Design

Input: the target action proﬁle a ; a loss vector v RM whose elements are in the interior, i.e., i, vi [L+ρ, U ρ] for some ρ > 0; the regret rate α; ϵ (0, 1 α); Output: a time-varying game with loss ˆℓt L as below:

t, i, a, ˆℓt i(a) =

( U with probability ℓt i(a) L

U L L with probability U ℓt i(a) U L . (4)

It is easy to verify E h ˆℓti = ℓt. In experiments we show such discrete games also achieve the design goals.

4.4 Thresholding the Redesigned Game For all designs in previous sections, the designer could impose an additional min or max operator to threshold on the original game loss, e.g., for the interior design, the redesigned game loss after thresholding becomes i, a,

( min{ℓo i (a ) (1 d(a)

M )ρ, ℓo(a)} if ai = a i, max{ℓo i (a ) + d(a)

M ρ, ℓo(a)} if ai = a i. (5) We point out a few differences between (5) and (2). First, (5) guarantees a dominance gap of at least (instead of exactly) (1 1 M )ρ. As a result, the thresholded game can induce a larger N T (a ) because the target action a is redesigned to stand out even more. Second, one can easily show that (5) incurs less design cost CT compared to (2) due to thresholding. Therefore, Theorem 4 still holds. However, thresholding no longer preserves the zero-sum property.

5 Experiments

We perform empirical evaluations of game redesign algorithms on four games the volunteer s dilemma (VD), tragedy of the commons (TC), prisoner s dilemma (PD) and rock-paper-scissors (RPS). Throughout the experiments, we use EXP3.P [Bubeck and Cesa-Bianchi, 2012] as the noregret learner. The concrete form of the regret bound for EXP3.P is illustrated in the appendix. Based on that, we derive the exact form of our theoretical upper bounds for Theorem 4 and Theorem 8, and we show the theoretical value for comparison in our experiments. We let the designer cost function be C(ℓo, ℓt, at) = ℓo(at) ℓt(at) p with p = 1. For VD, TC and PD, the original game is not zero-sum, and we apply the thresholding (5) to slightly improve the redesign performance. For the RPS game, we apply the design without thresholding to preserve the zero-sum property. The results we show in all the plots are produced by taking the average of 5 trials.

5.1 Volunteer s Dilemma (VD) In volunteer s dilemma (Table 1) there are M players. Each player has two actions: volunteer or not. When there exists at least one volunteer, those players who do not volunteer gain 1 (i.e. a 1 loss). The volunteers receive zero payoff. On the other hand, if no players volunteer, then every player loss 10.

Other players exists a volunteer no volunteer exists

Player i volunteer 0 0 not volunteer 1 10

Table 1: The loss function ℓo i for individual player i in VD.

As mentioned earlier, all pure Nash equilibria involve freeriders. The designer aims at encouraging all players to volunteer, i.e., the target action proﬁle a i is volunteer for any player i. Note that i, ℓo i (a ) = 0, which lies in the interior of L = [ 1, 10]. Therefore, the designer could apply the interior design Algorithm 1. The margin parameter is ρ = 1. We let M = 3. In table 2, we show the redesigned game

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

(a) T N T (a ) on TC.

(b) CT on TC.

(c) Loss change on TC.

(d) T N T (a ) on PD.

(e) CT on PD.

Figure 1: Interior design on TC and PD. The dashed lines are the theoretical upper bound.

ℓ. Note that when all three players volunteer (i.e., at a ), the loss is unchanged compared to ℓo. Furthermore, regardless of the other players, the action volunteer strictly dominates the action not volunteer by at least (1 1

3 for every player. When there is no other volunteers, the dominance gap is 32

M )ρ, which is due to the thresholding in (5). We

Number of other volunteers 0 1 2

Player i volunteer 2/3 1/3 0 not volunteer 10 1/3 2/3

Table 2: The redesigned loss function ℓi for player i in VD.

simulated play for T = 104, 105, 106, 107, respectively on this redesigned game ℓ. In Figure 2a, we show T N T (a ) against T. The plot is in log scale. The standard deviation estimated from 5 trials is less than 3% of the corresponding value and is hard to see in log-scale plot, thus we do not show that. We also plot our theoretical upper bound in dashed lines for comparison. Note that the theoretical value indeed upper bounds our empirical results. In Figure 2b, we show CT against T. Again, the theoretical upper bound holds. As our theory predicts, for the four T s the designer increasingly enforces a in 60%, 82%, 94%, and 98% of the rounds, respectively; The per-round design costs CT /T decreases at 0.98, 0.44, 0.15, and 0.05, respectively.

(a) Number of rounds at = a grows sublinearly.

(b) The cumulative design cost grows sublinearly too.

Figure 2: Interior design on VD with M = 3. The dashed lines are theoretical upper bounds.

5.2 Tragedy of the Commons (TC) Our second example is the tragedy of the commons (TC). There are M = 2 farmers who share the same pasture to graze sheep. Each farmer i is allowed to graze at most 15 sheep, i.e., the action space is Ai = {0, 1, ..., 15}. The more sheep are grazed, the less well fed they are, and thus less price on market. We assume the price of each sheep is

30 P2 i=1 ai, where ai is the number of sheep

that farmer i grazes. The loss function of farmer i is then ℓo i (a) = p(a)ai, i.e. negating the total price of the sheep that farmer i owns. The Nash equilibrium strategy of this game is that every farmer grazes a i = 12 sheep. It is well-known that this Nash equilibrium is suboptimal. Instead, the designer hopes to maximize social welfare: p(a)(a1 + a2), which is achieved when a1 + a2 = 20. Moreover, to promote equity the designer desires that the two farmers graze the same number of sheep. Thus the target action proﬁle is a i = 10, i. Note that the original loss function takes value in [ 15

15, 0] while ℓo i (a ) = 10

10, thus this is the interior design scenario. Due to the large number of entries, we only visualize the difference ℓ1(a) ℓo 1(a) for player 1 in Figure 1c; the other player is the same. We observe three patterns of loss change. For most a s, e.g., a1 6 or a2 11, the original loss ℓo 1(a) is already sufﬁciently large and satisﬁes the dominance gap in Lemma 3, thus the loss remains unchanged. For those a s where a 1 = 10, the designer reduces the loss to make the target action more proﬁtable. For those a s close to the bottom left (a1 > a 1 and a2 10), the designer increases the loss to enforce the gap (1 1 M )ρ. We simulated the game play for T = 104, 105, 106, 107 rounds and show the results in Figure 1a, 1b, and 1c. Again the game redesign is successful: the ﬁgures conﬁrm T o(T) target action play and o(T) cumulative design cost. Numerically, for the four T s the designer enforces a in 41%, 77%, 92%, and 98% of rounds, and the per-round design costs are 9.4, 4.2, 1.4, and 0.5, respectively.

5.3 Prisoner s Dilemma (PD) Our third example is the prisoner s dilemma (PD). There are two prisoners, each can stay mum or ﬁnk. The original loss ℓo is given in Table 5a. The Nash equilibrium strategy of this game is that both prisoners ﬁnk. Suppose a maﬁa designer hopes to force a =(mum, mum) by sabotaging the losses. Note that i, ℓo i (a ) = 2, which lies in the interior of the loss range L = [1, 5]. Therefore, this is again an interior design scenario. In Table 5b we show the redesigned game ℓ. Note that when both prisoners stay mum or both ﬁnk, the designer does not change the loss. However, when one prisoner stays mum and the other ﬁnks, the designer reduces the loss for the mum prisoner and increases the loss for the betrayer. We simulated plays for T = 104, 105, 106, and 107. In Figure 1d and 1e, we plot the number of non-target selections T N T (a ) and the cumulative design cost CT for PD. Both grow sublinearly as T increases. The designer enforces a in 85%, 94%, 98%, and 99% of rounds. The per-round design costs are 0.71, 0.28, 0.09, and 0.03, respectively.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

R P S R 0.5, 0.5 0, 0 0.5, 0.5 P 0, 0 0.5, 0.5 0, 0 S 0, 0 0.5, 0.5 0, 0

(a) ℓt(t = 1).

R P S R 0.62, 0.62 0.75, 0.75 0.62, 0.62 P 0.75, 0.75 0.87, 0.87 0.75, 0.75

S 0.75, 0.75 0.87, 0.87 0.75, 0.75

(b) ℓt(t = 103).

R P S R 0.94, 0.94 0.96, 0.96 0.94, 0.94 P 0.96, 0.96 0.98, 0.98 0.96, 0.96

S 0.96, 0.96 0.98, 0.98 0.96, 0.96

(c) ℓt(t = 107).

Table 3: The redesigned RPS games ℓt for selected t (with ϵ = 0.3). Note the target entry a = (R, P) converges toward (1, 1).

R P S R 0, 0 1, 1 1, 1 P 1, 1 0, 0 1, 1

S 1, 1 1, 1 0, 0

(a) The original loss ℓo.

R P S R 1, 1 1, 1 1, 1 P 1, 1 1, 1 1, 1

S 1, 1 1, 1 1, 1

(b) ˆℓt(t = 1).

R P S R 1, 1 1, 1 1, 1 P 1, 1 1, 1 1, 1

S 1, 1 1, 1 1, 1

(c) ˆℓt(t = 103).

R P S R 1, 1 1, 1 1, 1 P 1, 1 1, 1 1, 1

S 1, 1 1, 1 1, 1

(d) ˆℓt(t = 107).

Table 4: Instantiation of discrete design on the same games as in Table 3. The redesigned loss lies in L = { 1, 0, 1}.

mum ﬁnk mum 2, 2 5, 1

ﬁnk 1, 5 4, 4

(a) The original loss ℓo of PD.

mum ﬁnk mum 2, 2 1.5, 2.5 ﬁnk 2.5, 1.5 4, 4

(b) The redesigned loss ℓof PD.

Table 5: Interior redesign on Prisoner s Dilemma.

(a) Number of rounds at = a .

(b) The cumulative design cost.

Figure 3: Boundary design on RPS. The dashed lines are the theoretical upper bound.

5.4 Rock-Paper-Scissors (RPS) Our last example is the RPS game, as shown in Table 4a. Boundary Design. Suppose the target proﬁle is a = (R, P). Since ℓo(a ) = (1, 1) hits the boundary of loss range L = [ 1, 1], the designer must use the boundary design. For simplicity we choose v with vi = U+L

2 , i. This choice of v preserves the zero-sum property. Table 3 shows the redesigned games at t = 1, 103 and 107 under ϵ = 0.3. Note that the designer maintains the zero-sum property of the games. Also note that the redesigned loss function guarantees strict dominance of a for all t, but the dominance gap decreases as t grows. Finally, the loss of the target action a = (R, P) converges to the original loss ℓo(a ) = (1, 1) asymptotically, thus the designer incurs diminishing cost. We ran Algorithm 2 for ϵ = 0.1, 0.2, 0.3, 0.4. For each ϵ we simulated game play for T = 104, 105, 106 and 107. In Figure 3a, we show T N T (a ) under different ϵ (solid lines) and the theoretical upper bounds of Theorem 8 (dashed lines) for comparison. In Figure 3b, we show CT and the upper bounds. Note that both T N T (a ) and CT grow sublinearly. For ϵ = 0.3, for the four T s the designer forces a in 34%, 60%, 76%, and 88% rounds. The per-round design costs are 1.7, 1.2, 0.73 and 0.40. The results are similar for

(a) Number of rounds at = a .

(b) The cumulative design cost.

Figure 4: Discrete redesign for a = (R, P) with natural loss values in L. The dashed lines are the corresponding boundary design.

the other ϵ s. We note that empirically the cumulative design cost achieves the minimum at some ϵ (0.3, 0.4) while Theorem 8 suggests the minimum at ϵ = 0.25. We investigate this inconsistency in the appendix. Discrete Design. We now compare the performance of discrete design (Algorithm 3) with the boundary design. The target proﬁle is still a = (R, P). Recall the purpose of discrete design is to only use natural game loss values, in the RPS case L = { 1, 0, 1}. Figure 4 shows that the performance of the discrete design nearly matches the boundary design. When ϵ = 0.3, for the four T s discrete design enforces a 35%, 59%,75% and 88% of the time. The per-round design costs are 1.7, 1.2, 0.79, and 0.41. Overall, discrete design does not lose much performance. Table 4 shows the redesigned random games at t = 1, 103 and 107 under ϵ = 0.3. As t increases, the redesigned loss function converges to a constant function that takes the target loss value ℓo(a ).

6 Conclusion We studied the game redesign problem where players apply no-regret algorithms to play the game. We show that a designer can enforce a target action proﬁle in T o(T) rounds while incurring o(T) cumulative design cost. Experiments demonstrate the performance of our redesign algorithms.

Acknowledgments Project supported in part by NSF grants 1545481, 1704117, 1836978, 2041428, 2023239, ARO MURI W911NF2110317, and AF Co E FA9550-18-1-0166.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

References [BLUM, 2007] A BLUM. Learning, regret minimization, and equilibria. Algorithmic Game Theory, pages 79 102, 2007. [Bogunovic et al., 2021] Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artiﬁcial Intelligence and Statistics, pages 991 999. PMLR, 2021. [Bubeck and Cesa-Bianchi, 2012] S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012. [Garcelon et al., 2020] Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits. ar Xiv preprint ar Xiv:2002.03839, 2020. [Guan et al., 2020] Ziwei Guan, Kaiyi Ji, Donald J Bucci Jr, Timothy Y Hu, Joseph Palombo, Michael Liston, and Yingbin Liang. Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack. ar Xiv preprint ar Xiv:2002.07214, 2020. [Hart and Mas-Colell, 2000] Sergiu Hart and Andreu Mas Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000. [Huang and Zhu, 2019] Yunhan Huang and Quanyan Zhu. Deceptive reinforcement learning under adversarial manipulations on cost signals. In International Conference on Decision and Game Theory for Security, pages 217 237. Springer, 2019. [Jun et al., 2018] Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3640 3649, 2018. [Liu and Shroff, 2019] Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pages 4042 4050, 2019. [Lu et al., 2021] Shiyin Lu, Guanghui Wang, and Lijun Zhang. Stochastic graphical bandits with adversarial corruptions. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 8749 8757, 2021. [Ma et al., 2018] Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pages 186 204. Springer, 2018. [Ma et al., 2019] Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Xiaojin Zhu. Policy poisoning in batch reinforcement learning and control. In Advances in Neural Information Processing Systems, pages 14570 14580, 2019. [Monderer and Tennenholtz, 2004] Dov Monderer and Moshe Tennenholtz. k-implementation. Journal of Artiﬁcial Intelligence Research, 21:37 62, 2004. [Rakhsha et al., 2020] Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy

teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. ar Xiv preprint ar Xiv:2003.12909, 2020. [Slivkins, 2019] Aleksandrs Slivkins. Introduction to multiarmed bandits. ar Xiv preprint ar Xiv:1904.07272, 2019. [Sun et al., 2020] Yanchao Sun, Da Huo, and Furong Huang. Vulnerability-aware poisoning mechanism for online rl with unknown dynamics. ar Xiv preprint ar Xiv:2009.00774, 2020. [Yang et al., 2021] Lin Yang, Mohammad Hajiesmaili, Mohammad Sadegh Talebi, John Lui, and Wing Shing Wong. Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. In Advances in Neural Information Processing Systems (Neur IPS), 2021. [Zhang et al., 2020] Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward-poisoning attacks against reinforcement learning. ar Xiv preprint ar Xiv:2003.12613, 2020. [Zuo, 2020] Shiliang Zuo. Near optimal adversarial attack on ucb bandits. ar Xiv preprint ar Xiv:2008.09312, 2020.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)