# payoff_control_in_the_iterated_prisoners_dilemma__83449ac5.pdf Payoff Control in the Iterated Prisoner s Dilemma Dong Hao1, Kai Li2 and Tao Zhou1 1 University of Electronic Science and Technology of China, Chengdu, China 2 Shanghai Jiao Tong University, Shanghai, China haodong@uestc.edu.cn, kai.li@sjtu.edu.cn, zhutou@ustc.edu Repeated game has long been the touchstone model for agents long-run relationships. Previous results suggest that it is particularly difficult for a repeated game player to exert an autocratic control on the payoffs since they are jointly determined by all participants. This work discovers that the scale of a player s capability to unilaterally influence the payoffs may have been much underestimated. Under the conventional iterated prisoner s dilemma, we develop a general framework for controlling the feasible region where the players payoff pairs lie. A control strategy player is able to confine the payoff pairs in her objective region, as long as this region has feasible linear boundaries. With this framework, many well-known existing strategies can be categorized and various new strategies with nice properties can be further identified. We show that the control strategies perform well either in a tournament or against a human-like opponent. 1 Introduction Understanding what are the best strategies for intelligent agents in a long-run relationship is a fundamental challenge in many disciplines. Repeated games are prevailing tools for modeling and analyzing intelligent agents long-run relationships [Mailath and Samuelson, 2006], which have been richly studied in economics, artificial intelligence and biology [Kandori, 2002; Claus and Boutilier, 1998; Nowak et al., 1995]. For multi-agent systems, repeated games are widely utilized for understanding how cooperation or competition emerges among agents and how cooperative or winning strategies can be identified. It has been commonly accepted that in such games, it is impossible for a unilateral player to freely control the payoffs and determine the evolutionary route of the game, since the outcomes are jointly determined by all participants. In this paper, we propose a general framework for payoff control in iterated prisoner s dilemma, which is a conventional model for repeated games. First of all, based on the game s Markov decision process (MDP), the correlation between a single player s strategy and the MDP s joint stationary distribution is derived. Then according to this correlation, we establish a general payoff control framework, under which a control strategy can be easily obtained by solving a system of linear inequalities. Using the payoff control framework, as long as the control objective is feasible, a controller can restrict the relation between her and the opponent s payoffs (represented by a two-tuple) to an arbitrary region with linear boundaries. More specifically, she can (i) unilaterally determine the maximum and minimum values of the opponent s possible payoffs; or (ii) always win the game no matter what the opponent s strategy is, and she can even control her winning probability; or (iii) control the evolutionary route of the game, as long as the opponent is rational and self-optimizing, the controller can enforce the game to finally converge either to a mutual-cooperation equilibrium or to any feasible equilibrium that she wishes. We simulate serval specific strategies generated under the payoff control framework in a tournament similar to that of Axelrod [Axelrod and Hamilton, 1981], it is found that the new payoff control strategies have remarkably good performances. The discussion of payoff control in games can be traced back to [Boerlijst et al., 1997], in which the authors discovered that, in iterated prisoner s dilemma, one player can set the opponent s payoff to a certain value. However, what is the underlying mechanism for such strategies to exist and how to formally derive them are not thoroughly investigated. In recent years, Press and Dyson s discovery of zero-determinant (ZD) strategies illuminates a new starting point for the control [Hao et al., 2014]. They show that in repeated games, it is possible for a player to unilaterally enforce a linear relation between her and the opponent s payoff [Press and Dyson, 2012]. This is the first time that the linear payoff relation control is formally investigated, which receives a lot of attention. Thereafter, the linear control on players payoff relations is discovered in multiplayer games [Pan et al., 2015; Hilbe et al., 2014], games with imperfect information [Chen and Zinger, 2014; Hao et al., 2015] and evolutionary games [Adami and Hintze, 2013; Hilbe et al., 2013]. Furthermore, from a mathematical point of view, Akin formally investigated why such linear control exists in games which can be represented by an MDP and proposed a new payoff control scheme whereby one player can fix the upper bound of the opponent s payoff to the mutual cooperation reward R, and such strategies can enforce the game to converge to a mutual cooperation situation [Akin, 2012]. Extended cases of Akin s cooperation-enforcing control are then studied and special Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) cases of the nonlinear payoff relation control are identified [Hilbe et al., 2015]. These existing payoff control strategies confront two major problems. The first one is that they only realize special cases of payoff control such as linear control or cooperative control. Zero-determinant based strategies can only realize linear payoff relations, which are very strong and sometimes not easy to use; Akin s method based strategies can only control the upper bound of the opponent s payoff to a mutual-cooperation reward R. However, how to establish a general control framework with multiple and free control objectives is still challenging. The second problem is that these strategies are mostly difficult to obtain. For the zerodeterminant based strategies, calculating the determinant of a matrix already has high computational complexity; for strategies based on Akin s method, when one tries to add more objectives other than cooperation-enforcement, the computational complexity increases exponentially and deriving the strategy becomes intractable. In this paper, we propose a general payoff control framework which conquers both of these two problems. In section 2, the repeated game is modeled as an MDP and the relationship between a single player s strategy and the stationary distribution of the MDP is derived. In section 3, we realize a control on the opponent s maximum and minimum payoffs. In section 4, this is extended to a free regional payoff control with multiple linear boundaries, and various types of regional control strategies, especially the cooperation-enforcing control strategies, are identified. To analyze the performances of the payoff control strategies when confronting various famous strategies, in section 5, we simulate control strategies in the Axelrod s tournament. In the last section, to evaluate how payoff control strategies perform in the real world, we simulate them against a reinforcement learning player [Sutton and Barto, 1998]. 2 Strategy and Game Convergence The iterated prisoner s dilemma (IPD) is the canonical example for analyzing the cooperation and competition in agents long-run relationships. The IPD consists of multiple rounds of stage games. In each stage, player i {X, Y } adopts an action ai {C, D} with a certain probability, where C denotes cooperation and D denotes defection. The space of the outcomes in each stage game is Ω= {CC, CD, DC, DD}. If both players cooperate (CC), then each earns a reward R; if one cooperates but the other defects (CD or DC), then the cooperator earns S and the defector earns T ; if they both defect (DD), then both get P. The payoff vector of player X over Ωis thus defined as SX = (R, S, T, P) and for player Y it is SY = (R, T, S, P). In this paper we consider the case that player X chooses her action conditioning only on the outcome of the previous stage. It is worth noting that, in infinitely repeated games it has been proved that such onestage memory strategies have no disadvantages as if the opponent has a longer memory [Press and Dyson, 2012]. The strategy of player X is defined as a vector of probabilities p = (p1, p2, p3, p4), where each component is a probability that she cooperates with player Y conditioning on the last stage outcomes CC, CD, DC or DD, respectively. Analo- gously, the strategy of player Y is a vector of probabilities for cooperation q = (q1, q2, q3, q4) conditioning on the previous outcomes CC, DC, CD or DD, respectively. Then the transition matrix over the state space between adjacent stage games is derived as M: p1q1 p2q3 p3q2 p4q4 p1 (1 q1) p2 (1 q3) p3 (1 q2) p4 (1 q4) (1 p1) q1 (1 p1) (1 q1) (1 p2) q3 (1 p2) (1 q3) (1 p3) q2 (1 p3) (1 q2) (1 p4) q4 (1 p4) (1 q4) If this Markov matrix is regular, it has the unique stationary distribution v = (v1, v2, v3, v4), which is a probability distribution over the state space Ωand can be calculated as v = lim n 1 t=1 vt, (2) where vt is the distribution over Ωat the t-th stage. Then the average expected payoffs for players X and Y are derived as s X = v SX = (v1, v2, v3, v4) (R, S, T, P) and s Y = v SY = (v1, v2, v3, v4) (R, T, S, P), respectively. In the t-th stage, the total probability that player X cooperates is pt c = (1, 1, 0, 0) vt. And the probability she will cooperate in the next stage game is calculated as pt+1 c = p vt. Deriving the difference between these two probabilities, we have: pt+1 c pt c = (p (1, 1, 0, 0)) vt. (3) Denote ep = p (1, 1, 0, 0). Essentially, this vector depicts to what extend of speed player X changes its probability for cooperation. Sum eq. (3) from t = 1 to t = n, then we have t=1 ep vt = t=1 pt+1 c pt c = pn+1 c p1 c. (4) If the game is infinitely repeated, averaging the above equation when n ensures that ep v = lim n 1 t=1 ep vt = lim n 1 pn+1 c p1 c = 0, (5) where v is just the stationary distribution of the game s state transition matrix. Expanding the vector equation ep v = 0 leads to: ( 1 + p1) v1 + ( 1 + p2) v2 + p3v3 + p4v4 = 0. (6) This relation is firstly discovered in [Press and Dyson, 2012] and then investigated in [Akin, 2012]. It significantly reveals the underlying relationship between the game s transition matrix and the unilateral strategy of a single player. 3 Control the Maximum and Minimum Values of Opponent s Payoff If player X s objective is to ensure that Y s expected payoff where s Y = v SY = (v1, v2, v3, v4) (R, T, S, P) and W is a constant, then the objective function eq. (7) becomes (v1, v2, v3, v4) ((R, T, S, P) (W, W, W, W)) 0. Multiplying both side with a positive factor 1 p2 does Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) not change the inequality, thus eq. (7) is equivalent to (s Y W) (1 p2) 0. Substituting eq. (6) into it and combining the coefficients of vi, the objective of player X is finally reduced to an inequality as follows. α1v1 + α3v3 + α4v4 0, (8) where α1, α3 and α4 are the coefficients and ( α1 = (R W) (1 p2) + (W T )(1 p1), α3 = (T W)p3 + (S W)(1 p2), α4 = (T W)p4 + (P W)(1 p2). (9) One sufficient condition for eq. (8) to hold is that all αi are non-positive. This further requires that the strategy of player X should fall into the following region: 0 p2 < 1, 0 p1 min 1 R W T W (1 p2), 1 , 0 p3 min W S T W (1 p2), 1 , 0 p4 min W P T W (1 p2), 1 . It is shown that there are multiple candidate strategies for player X to control the maximum value of Y s possible payoff. Moreover, how to choose p1, p3 and p4 depends on the value of p2, which is the probability that X will cooperate after she cooperated but was defected by the opponent. The value of p2 partially reflects X s bottom line for cooperation and all the other components in p should be adjusted accordingly. Nevertheless, as long as eqs. (10) has solutions, no matter which level of such a bottom line player X has, it is always possible for her to control the maximum value of player Y s payoff to her objective W. Based on the above model, besides controlling the maximum value of Y s payoff, player X can also control the the minimum payoff Y can achieve. If X s objective is: s Y U, (11) then X actually secures a bottom line for Y s payoff. Applying the similar trick as for solving eq. (7), the constraints for eq. (11) becomes: β1v1 + β3v3 + β4v4 0, (12) where ( β1 = (R U) (1 p2) + (U T) (1 p1) , β3 = (T U) p3 + (S U) (1 p2) , β4 = (T U) p4 + (P U) (1 p2) . (13) Thus one sufficient condition for s Y U is that all βi are non-negative, then the solution is: 0 p2 < 1, max 0, 1 R U T U (1 p2) p1 1, T U (1 p2) p3 1, T U (1 p2) p4 1. It is straightforward that X can set W and U simultaneously and sandwich Y s expected payoff s Y into an intermediate region. She can do this by choosing a strategy p satisfying both eqs. (10) and eqs. (14). When W and U become 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y Figure 1: Control on maximum and minimum values of Y s possible payoffs. Each black dot is a possible payoff pair. In (A), (B) and (C), X s strategies are p = (1, 0.51, 1, 0), p = (0.998, 0.994, 0.01, 0.0012) and p = (0.5, 0, 1, 0.5), respectively. closer to each other, the range of Y s possible payoff shrinks. In the extreme case, when controller X sets W = U, the region of Y s possible payoffs will be compressed into a line. At this point, p degenerates to an equalizer strategy, which has been discovered in [Boerlijst et al., 1997] and formally discussed in [Press and Dyson, 2012]. In Figure 1 we show an example of how a payoff control on Y s maximum and minimum payoffs degenerates to a equalizer strategy. The convex hull is the space for the two players payoff pairs (s X, s Y ). The x-axis and y-axis are the payoff values for player X and Y, respectively. In each subfigure, X uses a control strategy and Y s strategy is randomly sampled for 5000 times. We use a traditional prisoner s dilemma payoff matrix setting (R, T, S, P) = (2, 3, 1, 0). Each black dot represents a possible payoff pair consisted of X s and Y s average payoffs under the fixed control strategy of X and a random strategy of Y. The upper and lower bounds of Y s possible payoffs are depicted by the red and blue lines, respectively. In 1(A), X s control strategy yields the maximum and minimum values W = 2 and U = 0 for Y; In 1(B), X sets W = 1.5 and U = 0.5 and the possible payoff region of Y shrinks; In 1(C), the general regional payoff control finally degenerates to an equalizer strategy, under which Y s payoff is pinned to a fixed value W = U = 1.0. We can see that the equalizer/pinning strategy are special cases of control on the maximum and minimum values of opponent s payoffs. 4 Control of Players Payoff Relations and Cooperation Enforcement The above framework makes it possible to control the maximum and minimum values of the opponent s possible payoffs. In this section, we show that it is even possible for the controller X to confine the two players possible payoff pairs in an arbitrary region of which the boundaries are characterized by linear functions. Under such a regional control, the game can be lead to a mutual-cooperation situation. Assume the controller X wants to establish a relation between the two players payoffs such that the opponent always obtains less than a linear combination of what she earns: χs X + κ, (15) where χ 1 and (1 1 χ)P. This objective claims a linear upper bound of Y s payoff and ensures that all possible payoff pairs are under it. Eq. (15) is equivalent to (SX χSY + χκ 1) v 0, which further leads to Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y 1 0 1 2 3 1 Payoff of controler X Payoff of oppnent Y Figure 2: Control on the region of possible payoff pairs. Each black dot is a possible payoff pair. (A) is a normal winning/selfish control p = (1, 0.1, 0.75, 0). (B) is a normal altruist control p = (1, 0.182, 1, 0). (C) is a TFT-like strategy with p1 = 1, p4 = 0 and p2 + p3 = 1. (D) is an extreme case of selfish control p = (1, 0, 0.005, 0). (E) is an extreme case of altruist control p = (1, 0.995, 1, 0). (A), (B), (C) and (D) are all cooperation-enforcing. [(R, S, T, P) χ (R, T, S, P) + χκ 1] (v1, v2, v3, v4) 0. Multiplying both side by 1 p2 and representing v2 by using v1, v3 and v4, we have: γ1v1 + γ3v3 + γ4v4 0, (16) where ( γ1 = µ ( 1 + p1) + [(1 χ) R + χκ] (1 p2) , γ3 = µp3 + (T χS + χκ) (1 p2) , γ4 = µp4 + [(1 χ) P + χκ] (1 p2) , (17) and µ = (S χT + χκ). Making γ1, γ3 and γ4 simultaneously nonnegative is sufficient to ensure that eq. (15) holds. Similarly, if X wants to establish a payoff relation such that Y always obtains more than a linear combination of what she earns, then her objective is χs X + κ. (18) This indicates that X sets a linear lower bound of Y s payoff. Such an objective demands a payoff region above the line s X χs Y + χκ = 0. To realize this, she just needs to make γ1, γ3 and γ4 in eqs. (17) nonpositive simultaneously. One necessary condition for a control strategy to exist is that the objective region should be feasible, which means on the one hand, the objective region of the possible payoff pairs must intersect with the (P, P) (S, T ) line, which is the left boundary of the payoffs in the iterated prisoner s dilemma, depicting the payoff pairs when Y unconditionally defects, i.e., q = (0, 0, 0, 0); on the other hand, the payoff region must also terminates at some point on the (R, R) (T, S) line, which is the right boundary depicting the possible payoff pairs when Y unconditionally cooperates, i.e., q = (1, 1, 1, 1). Specifically, (1) if X controls by using objective function eq. (15) with χ 1 and κ = (1 1 χ)P, we have (s X P) χ (s Y P). Any point in the objective region ensures that X s payoff difference to P is at least χ times of that of Y. Under such a strategy, X only concerns about herself winning the game regardless of the opponent s outcome. This feature well captures the selfishness in nature, therefore, we call such strategies the selfish control strategies. For example, in a game with P = 0, if X sets eq. (15) with χ = 1.5 and κ = 0, she always obtains more than 1.5 times of what Y obtains. p = (0.5, 0.5, 0.4, 0) is one of such selfish strategies. (2) If X s objective function is eq. (18) with χ 1 and κ = (1 1 χ)R, Y s payoff difference to R will be at most 1 χ times of that of X, meaning that X is offering a benefit to Y at the expense of hurting her own benefit. Since in biological organisms, altruism can be defined as an individual performing an action which is at a cost to themselves but benefits another individual, we call this strategy the altruist control strategies. (3) However, if X controls with constraint (1 1 χ)R > κ > (1 1 χ)P, who can win the game is uncertain, since whether a payoff pair locates below the diagonal line (P, P) (R, R) still depends on Y s strategy. Thus we call them contingent control strategies. More generally, it is also possible for controller X to set up combinatorial objectives, such that there are multiple linear upper and/or lower bounds of Y s possible payoffs. She can do this by generalizing the constraint coefficients γ to where v = (v1, v3, v4) and G is a coefficient matrix with each entry γij as the j-th coefficient from the i-th control objective. Following such a regularization, the complex payoff control problem is reduced to a formal linear programming. As long as G constitutes a feasible payoff region, the combinatorial control objective can be realized. Under this framework of regional control with multiple constraints, various shape of payoff regions can be generated. Especially, ZD strategies are extreme cases of regional control. If each player has chosen a certain strategy and no one can benefit by changing his strategy while the other players keep theirs unchanged, then the current set of strategy choices and the corresponding payoffs constitute a Nash equilibrium. A strategy p N of player X is called a Nash strategy if player X can control the upper bound of player Y to R. Thus, under the general payoff control framework in eq.(19), any strategy with s Y R as a tight constraint is a Nash strategy. According to the definition of Nash equilibrium, it is straightforward that any pair of Nash strategies constitute a Nash equilibrium. However, although a Nash strategy can induce a fixed upper bound R of Y s payoff, it is possible for Y to choose an alternative strategy other than fully cooperating (q = 1), which still yields R for herself but with the payoff for controller X smaller than R. This is why a Nash equilibrium is not necessarily a cooperative equilibrium. So how to select out an cooperation-enforcing Nash strategy is a problem. The con- Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) troller can enforce cooperation by setting: χs X + (1 1 χ)R, p1 = 1, (20) where χ 1. Under such strategies of X, the only best response of Y is to fully cooperate, whereby both players finally receive payoffs R which will lead the game to a win-win situation. We call them cooperation-enforcing control . Under the above framework, one can derive arbitrary regional control strategies, as long as the region has feasible linear boundaries. In Figure 2, we show several examples of regional control strategies. In 2(A) and 2(B), X sets same linear upper and lower bounds for the payoff region. The red lines are upper bounds with χ = 1.5 and κ = 1 and blue lines are lower bounds with χ = 1.5 and κ = 0. In 2(A), X uses a selfish control where her payoff is always larger than that of Y. In 2(B), X uses an altruist control which always lets Y win. Both these two strategies are cooperation enforcing, leading the game to evolve to a mutual cooperation equilibrium. In 2(C), we shrink the controlled region to an extreme case by setting the upper and lower bounds identically with χ = 1.0, κ = 0. The solution shows, as long as p1 = 1, p4 = 0 and p2 + p3 = 1, the control strategies have similar effect as the traditional Tit-for-tat (TFT): equalizing the two players expected payoffs. In 2(D) X uses an extreme case of selfish control while in 2(E) X uses an extreme case of altruist control. These two cases are also investigated as partnership and rival strategies in [Hilbe et al., 2015]. 5 Control in Axelrod s Tournaments Right up to today, it has been a fundamental challenge for many disciplines to understand how various strategies perform in multi-agent interactions, what is the best strategy in repeated games and why it is the best. The most influential experiments for strategy evaluation are established by Robert Axelrod as his iterated prisoner s dilemma computer tournaments [Axelrod and Hamilton, 1981]. Based on the payoff control framework, in this section, we derive several control strategies and simulate them in a tournament. The simulated tournament is similar as in [Stewart and Plotkin, 2012] but uses a different IPD setting (R, T, S, P) = (2, 3, 1, 0). Besides classic strategies, Stewart and Plotkin implemented two ZD strategies: Extort-2 with s X P = 2(s Y P) and Generous-2 with s X R = 2(s Y R). Their simulation shows the best performance is from Generous-2, which is followed by GTFT and TFT. We add four regional control strategies into the tournament, including Altruist G, Altruist TFT and Selfish TFT. Here G denotes the control objective matrix from eq. (19). Altruist G is derived with respect to two objectives s X R 2(s Y R) and s X R 4 3(s Y R), while the Selfish G is derived with respect to s X P 2(s Y P) and s X P 4 3(s Y P). Selfish TFT and Altruist TFT are using the same strategies as in Figure 2(A) and 2(B), respectively. It is worth noting that Altruist G is essentially a regional control expansion based on Generous-2, while Selfish G is expanded from Extort-2. Both Selfish TFT and Altruist TFT can be viewed as expansions from original TFT. Name p Score Wins (1, 2/15, 1, 1/3) 0 GENEROUS-2 (1, 2/7, 1, 2/7) 1.60 0 (1, 0.182, 1, 0) 0 GTFT (1, 2/3, 1, 2/3) 1.46 0 TFT (1, 0, 1, 0) 1.44 0 TF2T 1.43 0 HARD TF2T 1.37 0 (1, 0.1, 0.75, 0) 1 WSLS (1, 0, 0, 1) 1.29 0 HARD PROBE 1.26 8 ALLC (1, 1, 1, 1) 1.18 0 PROBE2 1.11 4 GRIM (1, 0, 0, 0) 1.08 4 HARD TFT 1.08 4 RANDOM (1/2, 1/2, 1/2, 1/2) 0.92 10 HARD MAJO 0.91 13 PROBE 0.81 6 CALCULATOR 0.76 12 PROBE3 0.72 10 HARD JOSS (0.9, 0, 1, 0) 0.72 14 (5/7, 0, 13/15, 0) 15 ALLD (0, 0, 0, 0) 0.45 20 EXTORT-2 (6/7, 1/2, 5/14, 0) 0.45 19 Table 1: Results of the IPD tournament Due to the inherent stochasticity of some strategies, the tournament is repeated 1000 times. In a tournament, each strategy in the above set meets each other (including itself) in a perfect iterated prisoner s dilemma (IPD) game, and each IPD game has 200 stages. The average results are shown in Table 1. The shaded rows are for the control strategies derived under our framework. One can see that the Altruist G has the best performance. It is better than Generous-2 and has much higher score than either TFT or GTFT. The Altruist TFT also performs better than TFT and GTFT. The Selfish TFT is a little tougher than TFT, although it has slightly higher number of wins. Analogously, using the above payoff control framework, one could also generate other regional control strategies which are better than the corresponding ZD strategies. Although no strategy is universally best in such tournaments, because a player s performance depends on the strategies of all her opponents as well as the environment of the game, the control framework still provides us a new perspective to formally quantify new nice strategies. In perfect environments, TFT has long been recognized as the most remarkable basic strategy. Starting with cooperation, it can constitute a Nash equilibrium strategy that enforces long-run cooperation. Nevertheless, TFT is not flawless. The first drawback of it, which is not apparent in perfect environment, is that if one of the two interacting TFT players faces the problem of trembling hand or imperfect observation, then a false defection will leads to a sequence of alternating cooperation and defection. Then the two players both receive a payoff much less than mutual cooperation. This indicates TFT is not a subgame perfect equilibrium strategy. Another Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 0 10000 20000 30000 40000 50000 Rounds Average payoff Payoff Controler X Reinforcement Learner Y 0 10000 20000 30000 40000 50000 Rounds Average payoff Payoff Controler X Reinforcement Learner Y 0 10000 20000 30000 40000 50000 Rounds Average payoff Payoff Controler X Reinforcement Learner Y 0 10000 20000 30000 40000 50000 Rounds Average payoff Payoff Controler X Reinforcement Learner Y Figure 3: Control against human-like players. In each subfigure, Y uses a reinforcement learning strategy. In (A) X uses a TFT-like strategy p = (1, 0.2, 0.8, 0). In (B) X uses a selfish control p = (1, 0.1, 0.6, 0) which makes mutual cooperation (s X = R, s Y = R) as the optimal outcome for Y. In (C) X uses selfish control p = (0.3, 0, 1, 0) which makes a payoff pair (s X = 1, s Y = 1) as the optimal outcome for Y. In (D) X uses a selfish control p = (5/7, 0, 1, 0) which guarantees Y s payoff is much lower than that of X. weakness of TFT is a population of TFT players can be replaced by ALLC through random drift. Once ALLC has increased to some threshold, ALLD can invade the population. The reason why TFT is vulnerable to noise is that when confronting the opponent s unilateral defect (CD), a TFT player is too vengeful and will fully defect (p2 = 0). The reason why TFT can be invaded by ALLC players is that it is not greedy at all, when it occasionally takes advantage of the opponent (DC), it completely stops defection and turns back to cooperation (p3 = 1). To conquer these drawbacks, a nice strategy in a noisy environment should necessarily embody three features: (1) It should be cooperation-enforcing, i.e., its objective payoff region should have a tight upper bound s Y 1 χs X + (1 1 χ)R and χ 1; (2) It should not be too vengeful, i.e., p2 > 0, meaning its objective payoff region should not be too far from the (S, T ) point; (3) It should be somewhat greedy, i.e., p3 < 1, meaning its objective payoff region should not be too far from the (T, S) point. 6 Control against Human-like Players In the real world, if a player is not aware of any nice strategies, he actually dynamically updates his stage action according to a learned history of the long-run game, and gradually evolves his own optimal plan for interacting with the opponent. In artificial intelligence, this learning and planning procedure is usually investigated by reinforcement learning models, which are state-of-the-art human like plays when agents are confronting with complex environment or strategic opponents. To try our best to understand the performance of the payoff control strategies in a real world, in this section, we simulate several repeated games between the payoff control players and the reinforcement learning players. Let X be the payoff controller who uses a payoff control strategy obtained beforehand, and let Y be the reinforcement learner who evolves his strategy/plan q according to the reinforcement learning dynamics. q is a mapping from the game history to the probabilities of selecting a = C. Y s objective is to find an optimal q which maximizes his stage payoff: q = arg max q t=1 Eq st Y ) where st Y is Y s realized stage payoff at time t and Eq is an expectation with respect to q. Y s strategy q is updated according to the following average-reward value function: Q (ω, a) (1 α) Q (ω, a) + α h r + max a Q (ω , a ) i (22) where Q (ω, a) is an evaluation value of player Y choosing action a after stage game outcome ω. r = r (ω, a, ω ) r is difference between the instantiate reward r and the estimated average reward r . The instantiation reward r (ω, a, ω ) is induced by player Y taking action a after outcome ω and transiting the game to a new outcome ω . α is a free variable for the learning rate. With Q s values updated stage after stage, Y can improve his strategy dynamically [Gosavi, 2004]. We implement the above algorithm, simulate four repeated games and show the results in Figure 3. In each of these four games, player Y uses the reinforcement learning strategy described above. In 3(A) and 3(B), the strategies used by the controller X are both cooperation-enforcing. Under X s TFT-like control strategy in 3(A), X and Y always have almost the same average payoff; While under X s winning yet cooperation-enforcing strategy in 3(B), X dominates Y for a long time but the game finally converges to a mutual cooperation. In 3(C), X s objective is to set s X = s Y = 1. We can see the game finally converges as X wishes. In 3(D), X uses a very tough selfish control, which means she can win Y a lot. In this situation, when the intelligent agent Y improves his own payoff step by step, he improves that of the controller even more. In a word, when playing against human-like reinforcement learning players, payoff control strategy players can lead the game to evolve to their objective outcomes. 7 Conclusions We propose a general framework for controlling the linear boundaries of the region where the repeated game players possible payoff pairs lie. By generating payoff control strategies under this framework, a single player can unilaterally set arbitrary boundaries on the two players payoff relation and thereby realize her control objective, including limiting the maximum and minimum payoffs of the opponent, always winning the game, offering an altruist share to the opponent, enforcing the game to converge to a win-win outcome, and so on. The idea in this work is not limited to iterated prisoner s dilemma, it can be introduced into other two player repeated games and also can be generalized for repeated multiplayer games, such as iterated public goods games. Future Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) researches on the payoff control in repeated games with imperfect monitoring [Hao et al., 2015], with different memory sizes [Li and Kendall, 2014] and researches investigating better winning or cooperation-enforcing strategies [Crandall et al., 2018; Mathieu and Delahaye, 2015] could be potentially inspired by this work. Furthermore, all the control strategies are based on the important premise that the player is with a theory of mind [Devaine et al., 2014]. Therefore, how to identify more cognitively complex human-like strategies in the context of the IPD, such as intention recognition [Han et al., 2012], is of great value for the future research. Acknowledgments This work is partially supported by the National Natural Science Foundation of China (NNSFC) under Grant No. 71601029 and No. 61761146005. Kai Li also acknowledges the funding from Ant Financial Services Group. [Adami and Hintze, 2013] Christoph Adami and Arend Hintze. Evolutionary instability of zero-determinant strategies demonstrates that winning is not everything. Nature Communications, 4:2193, 2013. [Akin, 2012] Ethan Akin. Stable cooperative solutions for the iterated prisoner s dilemma. ar Xiv preprint, 1211.0969, 2012. [Axelrod and Hamilton, 1981] Robert Axelrod and William Donald Hamilton. The evolution of cooperation. Science, 211(4489):1390 1396, 1981. [Boerlijst et al., 1997] Maarten Boerlijst, Martin A Nowak, and Karl Sigmund. Equal pay for all prisoners. The American Mathematical Monthly, 104(4):303 305, 1997. [Chen and Zinger, 2014] Jing Chen and Aleksey Zinger. The robustness of zero-determinant strategies in iterated prisoner s dilemma games. Journal of Theoretical Biology, 357:46 54, 2014. [Claus and Boutilier, 1998] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th International Conference on Artificial Intelligence, 1998:746 752, 1998. [Crandall et al., 2018] Jacob W Crandall, Mayada Oudah, Fatimah Ishowo-Oloko, Sherief Abdallah, Jean-Franc ois Bonnefon, Manuel Cebrian, Azim Shariff, Michael A Goodrich, Iyad Rahwan, et al. Cooperating with machines. Nature communications, 9(1):233, 2018. [Devaine et al., 2014] Marie Devaine, Guillaume Hollard, and Jean Daunizeau. Theory of mind: did evolution fool us? Plo S One, 9(2):e87619, 2014. [Gosavi, 2004] Abhijit Gosavi. Reinforcement learning for long-run average cost. European Journal of Operational Research, 155(3):654 674, 2004. [Han et al., 2012] The Anh Han, Luis Moniz Pereira, and Francisco C Santos. Corpus-based intention recognition in cooperation dilemmas. Artificial Life, 18(4):365 383, 2012. [Hao et al., 2014] Dong Hao, Zhihai Rong, and Tao Zhou. Zero-determinant strategy: An underway revolution in game theory. Chinese Physics B, 23(7):078905, 2014. [Hao et al., 2015] Dong Hao, Zhihai Rong, and Tao Zhou. Extortion under uncertainty: Zero-determinant strategies in noisy games. Physical Review E, 91(5):052803, 2015. [Hilbe et al., 2013] Christian Hilbe, Martin A Nowak, and Karl Sigmund. Evolution of extortion in iterated prisoner s dilemma games. Proceedings of the National Academy of Sciences, 110(17):6913 6918, 2013. [Hilbe et al., 2014] Christian Hilbe, Bin Wu, Arne Traulsen, and Martin A Nowak. Cooperation and control in multiplayer social dilemmas. Proceedings of the National Academy of Sciences, 111(46):16425 16430, 2014. [Hilbe et al., 2015] Christian Hilbe, Arne Traulsen, and Karl Sigmund. Partners or rivals? strategies for the iterated prisoner s dilemma. Games and Economic Behavior, 92:41 52, 2015. [Kandori, 2002] Michihiro Kandori. Introduction to repeated games with private monitoring. Journal of Economic Theory, 102(1):1 15, 2002. [Li and Kendall, 2014] Jiawei Li and Graham Kendall. The effect of memory size on the evolutionary stability of strategies in iterated prisoner s dilemma. IEEE Transactions on Evolutionary Computation, 18(6):819 826, 2014. [Mailath and Samuelson, 2006] George J Mailath and Larry Samuelson. Repeated Games and Reputations: Long-Run Relationships. Oxford university press, 2006. [Mathieu and Delahaye, 2015] Philippe Mathieu and Jean Paul Delahaye. New winning strategies for the iterated prisoner s dilemma. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1665 1666, 2015. [Nowak et al., 1995] Martin A Nowak, Karl Sigmund, and Esam El-Sedy. Automata, repeated games and noise. Journal of Mathematical Biology, 33(7):703 722, 1995. [Pan et al., 2015] Liming Pan, Dong Hao, Zhihai Rong, and Tao Zhou. Zero-determinant strategies in iterated public goods game. Scientific reports, 5:13096, 2015. [Press and Dyson, 2012] William H Press and Freeman J Dyson. Iterated prisoners dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26):10409 10413, 2012. [Stewart and Plotkin, 2012] Alexander J Stewart and Joshua B Plotkin. Extortion and cooperation in the prisoner s dilemma. Proceedings of the National Academy of Sciences, 109(26):10134 10135, 2012. [Sutton and Barto, 1998] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press Cambridge, 1998. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)