# deception_in_finitely_repeated_security_games__3b8b23fd.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Deception in Finitely Repeated Security Games Thanh H. Nguyen,1 Yongzhao Wang,2 Arunesh Sinha,2 Michael P. Wellman2 1University of Oregon, thanhhng@cs.uoregon.edu 2University of Michigan, {wangyzh,arunesh,wellman}@umich.edu Allocating resources to defend targets from attack is often complicated by uncertainty about the attacker s capabilities, objectives, or other underlying characteristics. In a repeated interaction setting, the defender can collect attack data over time to reduce this uncertainty and learn an effective defense. However, a clever attacker can manipulate the attack data to mislead the defender, influencing the learning process toward its own benefit. We investigate strategic deception on the part of an attacker with private type information, who interacts repeatedly with a defender. We present a detailed computation and analysis of both players optimal strategies given the attacker may play deceptively. Computational experiments illuminate conditions conducive to strategic deception, and quantify benefits to the attacker. By taking into account the attacker s deception capacity, the defender can significantly mitigate loss from misleading attack actions. Introduction Real-worldsecuritydomainsareoftencharacterizedbyimperfect information: uncertainty (particularly on the defender s part) about actions taken or underlying characteristics of the opposing agent. Experience observed through repeated interaction in such domains provides an opportunity for the defender to learn about the behaviors and characteristics of attacker(s) (Kar et al., 2017; Gholami et al., 2017; Haghtalab et al., 2016; Nguyen et al., 2016; Xu, Tran-Thanh, and Jennings, 2016; Balcan et al., 2015; Blum, Haghtalab, and Procaccia, 2014; Marecki, Tesauro, and Segal, 2012; Letchford, Conitzer, and Munagala, 2009). For example, in wildlife protection (Fang et al., 2016), repeated interaction with poachers allows the defense authorities to observe poaching signs and patternsovertime.Fromtheseobservations,thedefendermay inferfeaturesofthepoacher scapabilitiesandpreferences,and thus design more effective patrolling strategies. To the extent that the defender relies on data, however, the attacker may choose to modify its behavior to mislead the defender. That is, in a particular interaction the attacker may select an action that does not actually yield the best immediate reward, to avoid revealing sensitive private information. Such deceptive behavior could manipulate the outcome of learning to the long-term benefit of the attacker. A savvy Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. defender, therefore, would take into account the attacker s manipulative strategy in designing his own strategy. We label an attacker strategy as non-manipulative if it chooses an action without regard to the defender s learning, for example if its behavior in each stage is a myopic best response. Previous work on learning in security games has generally treated the attacker as non-manipulative in this sense (Blum, Haghtalab, and Procaccia, 2014; Marecki, Tesauro, and Segal, 2012). We study the strategic deployment of attacker deception in finitely repeated security games. We adopt an incompleteinformation model, where the defender has underlying uncertainty about the attacker s type. At each time step, the defender updates his belief on the attacker s type based on attack data collected at previous steps. Based on the updated belief, the defender chooses an action to play. The attacker decides its own action, aware that the defender is collecting attack data to infer about the attacker s type. The ultimate goal of both players is to maximize expected utility accumulated over the whole time horizon. A pair of strategies that best-respond to each other, accounting for observations, constitutes a Perfect Bayesian Nash Equilibrium (PBNE). The paper includes four main contributions. First, we present a non-linear optimization program to find a PBNE of the finitely repeated simultaneous-move game. Second, we present a result that provides an easy technique to find a sequential equilibrium of the game based on a computed PBNE. Third, we show that there exists a PBNE in which players equilibrium strategies depend only on histories of the attacker s actions. This allows us to represent both players strategies in a compact form, which helps in significantly speeding up the equilibrium computation of the game. Fourth, we provide a preliminary extension to the Stackelberg game (sequential move) setting. Finally, we present a detailed experimental analysis of strategic deception, showing how various game factors affect the tendency for the attacker to deviate from myopic best responses to mislead the defender. Our results show that the defender and attacker receive significant loss and benefit respectively if the defender does not address the attacker s deception. By taking into account deceptive attacks, such loss and benefit is reduced drastically. Related Work Learning in security games. Most existing work on learning in security games follows a Stackelberg model and assumes the attacker plays myopically at every time step (Kar et al., 2017; Gholami et al., 2017; Haghtalab et al., 2016; Nguyen et al., 2016; Blum, Haghtalab, and Procaccia, 2014; Marecki, Tesauro, and Segal, 2012; Letchford, Conitzer, and Munagala, 2009). Balcan et al. (2015) and Xu, Tran-Thanh, and Jennings (2016) study the problem of learning with no prior knowledge of the attacker s behavior. They take a regret-minimization approach to determine the defender s strategies at each time step. Secrecy and deception in security games. Previous work studies security scenarios in which information available to the defender and attacker is asymmetric (Guo et al., 2017; Xu et al., 2015; Rabinovich et al., 2015; Hendricks and Mc Afee, 2006; Brown et al., 2005; Farrell and Rabin, 1996; Zhuang, Bier, and Alagoz, 2010). The defender can exploit that information asymmetricity to strategically reveal or disguise his information to the attacker. This results in responses of the attacker which are in favor of the defender. For example, in the model of Guo et al. (2017), the defender can disguise defense resources to deceive the attacker about the defender s type. We study an opposite scenario in which the attacker acts deceptively to mislead the defender. Repeated games with incomplete information. Previous work has studied infinitely repeated games with incomplete information (Sorin, 2002; Aumann and Maschler, 1995; Jordan, 1995; Zamir, 1992; Forges, 1988). These studies analyze properties and the convergence of players strategies in an infinitely repeated game setting. We study the problem of one-sided incomplete information (i.e., uncertainty in the attacker s type) in finitely repeated security games. Adversarial machine learning. There have been several studies on adversarial machine learning, attempting to investigate different attack scenarios on machine learning algorithms (Br uckner, Kanzow, and Scheffer, 2012; Br uckner and Scheffer, 2011; Barreno et al., 2010, 2006; Lowd and Meek, 2005). For example, causative attacks alters the training process by influencing the training data or exploratory attacks attempts to discover information about the learner and its training data. Different machine learning algorithms are then proposed which can resist these sophisticated attacks. Our work focuses on a causative attack scenario in security games. We aim at obtaining effective defense strategies which minimizes the damage of deceptive attacks in security games, given some learning outcome of attack data. Game Model In a finitely repeated simultaneous-move security game, there is a set of N targets, denoted by N = {1, . . . , N}. A defender attempts to protect these targets by allocating limited security resources over these targets. Conversely, an attacker aims at attacking these targets. We denote by K < N the number of the defender s security resources. At each time step t in a finite time horizon T = {1, . . . , T}, both the defender and the attacker has to decide on which action to take. An action of the defender, s, is an allocation of K resources over N. We denote by S the set of all feasible actions of the defender. An action of the attacker is a target to attack. There is a set of attacker types Λ = {1, . . . , L}. Each type λ Λ has a prior probability pλ (0, 1) such that P λ pλ = 1. At the beginning, Nature randomly draws a type to play the game according to a prior distribution {pλ}λ. The attacker knows its type while the defender does not. The defender is aware of {pλ}λ. Player payoffs. Each target i N is associated with rewards and penalties of the defender, Rd(i), P d(i) , and the attacker, Rλ(i), P λ(i) , for every type λ Λ. When the attacker of type λ Λ attacks i, if the defender is protecting i, the attacker receives a penalty P λ(i) while the defender obtains a reward Rd(i). Conversely, if the defender is not protecting target i, the attacker gets Rλ(i) > P λ(i) while the defender receives P d(i) < Rd(i). Player observations. At t + 1 T, both players observe their actions at previous time steps ht = {(s1, i1), . . . , (st, it)} where st and it are the defender and the attacker actions respectively at time step t . We denote by Ht the set of all possible histories of length t and H = {Ht} (where t = 0, . . . , T 1) the set of all histories. In particular, H0 = . We denote by ha t = {i1, . . . , it} a history of the attacker s actions and Ha t the set of all these attack histories. Behavioral strategies. At each step t + 1, given a history ht Ht, a behavioral strategy of the defender is a probability distribution x(ht) = {x(s | ht) : P s x(s | ht) = 1, x(s | ht) [0, 1], s S} over the defender s action set S. x(s | ht) is the probability the defender takes action s S given the history ht. Similarly, a behavioral strategy of the attacker of type λ is a probability distribution yλ(ht) = {yλ(i | ht) : P i yλ(i | ht) = 1, yλ(i | ht) [0, 1], i N} over the attacker s actions N. yλ(i | ht) is the probability the attacker of type λ attacks target i given ht. We denote by x = {x(ht)} and yλ = {yλ(ht)} strategies of the defender and attacker of type λ over all ht H respectively. Finally, X and Y = {Yλ} denote the sets of all strategies x and yλ respectively. Player expected utilities. Let x and y = {yλ} be the defender and attacker s behavioral strategies respectively. At each t + 1, the defender can update his belief on the attacker types using the Bayes rule, which is formulated as: p(λ | ht) pλ Yt t =1 yλ(it | ht 1) where h0 = , ht = {ht 1, it }. Let: EU d i (x, ht)= h X s:i sx(s|ht) i Rd(i) P d(i) +P d(i) EU λ i (x, ht)= h X s:i sx(s|ht) i P λ(i) Rλ(i) +Rλ(i) be immediate expected utilities of the defender and the attacker of type λ respectively at target i at step t+1 given the defender plays x(ht). Based on the immediate expected utilities at every target, the players total expected utilities over T can be computed using backward induction as follows. At the last time step T, the total expected utilities of the defender and the attacker with respect to history h T 1 HT 1 is equal to their immediate expected utilities at h T 1: Ud T (x, y | h T 1)= X λ,i p(λ|h T 1)yλ(i|h T 1)EUd i (x, h T 1) Uλ T (x, y | h T 1) = X i yλ(i | h T 1)EUλ i (x, h T 1), λ At time step t + 1 < T, the total expected utilities of both players with respect to history ht Ht consists of (i) the immediate expected utility at t + 1; and (ii) the future expected utility after t + 1. These utilities are formulated as: Ud t+1(x, y | ht) = X λ,i p(λ | ht)yλ(i | ht)EUd i (x, ht) s,λ,ix(s | ht)p(λ | ht)yλ(i | ht)Ud t+2(x, y | ht,(s, i)) Uλ t+1(x, y | ht) = X i yλ(i | ht)EUλ i (x, ht) s,i x(s | ht)yλ(i | ht)Uλ t+2(x, y | ht, (s, i)), λ Player s goals. Given any history ht, both players aim at choosing strategies x(ht), {yλ(ht)} that maximize their total expected utility at ht. In this scenario, the attacker is no longer myopic; it has to reason about all future possibilities to decide on which behavioral strategy to play at each ht. Such attack strategies (which may not be myopically optimal) are chosen to mislead the defender about the attacker s type, ultimately benefiting the attacker in future steps. These optimal behavioral strategies of players form a PBNE. Game Equilibria Definition 1 (PBNE). Behavioral strategies of the defender x and attacker y form a PBNE of the game if and only if for every ht Ht that occurs, we have: x is the best response of the defender: U d t+1(x , y | ht) U d t+1(x, y | ht), x X yλ, is the best response of the attacker type λ: U λ t+1(x , y | ht) U λ t+1(x , y | ht), y Y Since the action sets of both players are finite, there always exists a PBNE of the game. Our first result extends a given PBNE to a refined sequential equilibrium. Theorem 1. For each PBNE, there is a sequential equilibrium in which players strategies are identical to the ones in the PBNE at histories that occur with a positive probability. Proof. We denote by (x, y) a PBNE of the game. We follow the trembling-hand approach to find a corresponding sequential equilibrium of (x, y). Let rd(ht) and rλ(ht) be the ratios of the number of zero probabilities to the number of non-zero probabilities in x(ht) and yλ(ht) respectively. For each ϵ > 0, we construct a new fully mixed behavioral strategy of the defender and the attacker, (xϵ, yϵ), as follows: xϵ(s | ht) = ϵ, if x(s | ht) = 0 xϵ(s | ht) = x(s | ht) ϵ rd(ht), if x(s | ht) > 0 yλ ϵ (i | ht) = ϵ, if yλ(i | ht) = 0 yλ ϵ (i | ht) = yλ(i | ht) ϵ rλ(ht), if yλ(i | ht) > 0 ϵ is chosen to be small enough such that all resulting probabilities are positive. We use the breath-first search (according to the time horizon T) to examine the whole history set. When encountering a history with a zero probability of occurrence p(ht) = 0 according to (x, y), we construct a new belief of the defender over attacker types at ht as follows: p (λ|ht)= lim ϵ 0 pϵ(λ|ht)= lim ϵ 0 pλ Q t yλ ϵ (it |ht 1) P t yλ ϵ (it |ht 1) We find a PBNE of the corresponding sub-game starting from this history ht with this new belief. We then replace the strategies of the sub-game in (x, y) with these new equilibrium strategies. We also update (xϵ, yϵ) accordingly with the updated strategies of the sub-game. This process will continue until all histories are examined. The resulting strategies (x , y ) with belief {p (λ | ht)} belong to a sequential equilibrium of the game. Indeed, it is straightforward to prove: x (s | ht) = lim ϵ 0 xϵ(s | ht) yλ, (i | ht) = lim ϵ 0 yλ ϵ (i | ht) p (λ | ht) = lim ϵ 0 pϵ(λ | ht) (by definition) Furthermore, the updating process only replaces strategies in (x, y) at histories ht with a zero-probability of occurrence by a PBNE of the sub-game at ht respective to the belief p (λ | ht). Therefore, (x , y ) is the best response of the players at every ht according to p (λ | ht). Next, we present a result that enables a compact representation of the game. We denote by Xa a subset of behavioral strategies of the defender in which all the strategies are independent of histories of the defender s actions. In other words, for all x Xa, x(s | ht) = x(s | ha t ) for every history ht where ha t is the corresponding history of attacker actions. Similarly, Ya is a subset of behavioral strategies of the attacker. Theorem 2. There exists a PBNE of the game in which the equilibrium strategies of the players only depend on the histories of the attacker s actions. Proof. We use Brouwer s fixed-point theorem and a backward induction method. We are going to show that there exists a PBNE x Xa and y Ya such that: ht U d t+1(x , y | ht) U d t+1(x, y | ht), x X (1) U λ t+1(x , y | ht) U λ t+1(x , y | ht), y Y (2) We denote by x{x(ht) s} the defender strategy obtained by replacing x(ht) in x by a defense action s. Similarly, yλ{yλ(ht) i} is the attacker strategy of type λ obtained by replacing yλ(ht) in yλ by an attack action i. We define: φd s(x, y|ht)=max{0, U d t+1(x{x(ht) s}, y|ht) U d t+1(x, y|ht)} φλ i(x, y|ht)=max{0, U λ t+1(x, y{yλ(ht) i})|ht) U λ t+1(x, y|ht)} which are non-negative continuous functions in (x, y). We define a function F : (Xa, Ya) (Xa, Ya) as follows: F(x, y) = (x , y ) where x (s | ht) = x(s | ht) + φd s(x, y | ht) P s x(s | ht) + φd s (x, y | ht) yλ, (i | ht) = yλ(i | ht) + φλ i (x, y | ht) P j yλ(j | ht) + φλ j (x, y | ht) Since F is continuous over a convex and compact set (Xa, Ya), there exists (x , y ) such that F(x , y ) = (x , y ) according to the Brouwer s fixed point theorem. On the other hand, according to the linearity of expectation, there must be an action s such that U d t+1(x {x (ht) s}, y |ht) U d t+1(x , y |ht) 0, meaning that φd s(x, y | ht) = 0. Therefore, we have: x (s | ht) = x (s | ht) 1 + P s φd s (x , y | ht) This implies φd s (x , y | ht) = 0, s . Similarly, φλ i (x , y | ht) = 0, i. As a result, we obtain: U d t+1(x {x (ht) s}, y |ht) U d t+1(x , y |ht) (3) U λ t+1(x , y {yλ, (ht) i})|ht) U λ t+1(x , y |ht) (4) for all s and i and ht. While the above result holds for any convex and compact (Xa, Ya), showing that the fixed point profile provides more utility than any deviation to X or Y requires (Xa, Ya) to depend on the attacker s past actions. This dependence is due to the dependence of the posterior belief on the attacker s past actions, as can be seen in the next steps of the proof. Based on the above inequalities, we show that (x , y ) satisfy (1 2) using backward induction. At last time step T, for every x X, we have: U d T (x , y | h T 1) s x(s|h T 1)U d T (x {x (h T 1) s}, y |h T 1) = U d T (x, y | h T 1) Therefore, x is the defender best response against the attacker s strategy y at time step T. At time step t + 1 < T, suppose that (1 2) hold true for all t > t + 1, then for every x X, we have: U d t+1(x , y | ht) s x(s | ht)U d t+1(x {x (ht) s}, y | ht) λ,i p(λ|ht)yλ, (i|ht)EU d i (x, ht) + X s x(s | ht) i,λp(λ | ht)yλ, (i | ht)U d t+2(x , y | ht,(s, i)) λ,i p(λ|ht)yλ, (i|ht)EU d i (x, ht) + X s x(s | ht) i,λp(λ | ht)yλ, (i | ht)U d t+2(x, y | ht,(s, i)) = U d t+1(x, y | ht) Therefore, x is the defender best response against the attacker s strategy y at t + 1. Similarly, y is the attacker s best response against x at all time steps. Equilibrium Computation Based on Theorem 2, in computing a PBNE, we only need to search over the strategy sets (Xa, Ya). We also only need to consider attack histories {ha t }. We can now represent the defender behavioral strategies as compact marginal coverage probabilities over targets. We overload the notation x(ha t ) = {x(i | ha t )} where x(i | ha t ) is the defender s coverage probability at target i at history ha t such that P i x(i | ha t ) K and x(i | ha t ) [0, 1] for all i N. In particular, x(i | ha t ) = P s:i s x(s | ha t ). The players immediate and total expected utilities can be reformulated accordingly as follows: EU d i (x, ha t ) = x(i | ha t )(Rd(i) P d(i)) + P d(i) (5) EU λ i (x, ha t ) = x(i | ha t )(P λ(i) Rλ(i)) + Rλ(i) (6) U d t+1(x, y | ha t ) = X λ,i p(λ | ha t )yλ(i | ha t )EU d i (x, ha t ) i,λ p(λ | ha t )yλ(i | ha t )U d t+2(x, y | ha t , i) (7) U λ t+1(x, y | ha t ) = X i yλ(i | ha t )EU λ i (x, ha t ) i yλ(i | ha t )U λ t+2(x, y | ha t , i) (8) Note that the total expected utilities of players at last time step U d T (x, y | ha T 1) and U λ T (x, y | ha T 1) do not have the second term (which represents the future expected utility) as in Equations (7 8). In the following, we present a backwardinduction based method to find the attacker (defender) best response against a fixed behavioral strategy of the defender (attacker). We then introduce a program to compute a PBNE based on these best-response solutions. Attacker best response Given a defender s strategy x, we can compute a best response of the attacker of type λ using backward induction: At last time step T. Given a history ha T 1, finding a best response of the attacker type λ against x is formulated as the following linear program: max yλ(ha T 1) i yλ(i | ha T 1)EU λ i (x, ha T 1) (9) iyλ(i|ha T 1) = 1, yλ(i|ha T 1) 0, i (10) which maximizes the attacker s total expected utility at ha T 1. Its corresponding dual program: min vλ(ha T 1) vλ(ha T 1) (11) s.t. vλ(ha T 1) EU λ i (x, ha T 1), i N. (12) According to complementary slackness, any optimal primal and dual solutions (yλ, (ha T 1), vλ (ha T 1)) satisfies: i yλ, (i | ha T 1) vλ (ha T 1) EU λ i (x, ha T 1) = 0 (13) At time step t + 1 < T. Given a history ha t , finding a best response of the attacker type λ is formulated as: max yλ(ha t ) iyλ(i|ha t ) EU λ i (x, ha t ) + vλ (ha t , i) (14) i yλ(i|ha t ) = 1, yλ(i|ha t ) 0, i N. (15) which maximizes the attacker s total expected utility at ha t where vλ (ha t , i) is the attacker s optimal total expected utility at (ha t , i). Its corresponding dual program: min vλ(ha t ) vλ(ha t ) (16) s.t. vλ(ha t ) EU λ i (x, ha t ) + vλ (ha t , i), i (17) According to complementary slackness: any optimal solution (yλ, (ha t ), vλ (ha t )) must satisfy: i: yλ, (i|ht) vλ (ht) EU λ i (x, ht) vλ (ht, i) =0 (18) Defender best response Given an attack strategy y, we denote by: p(λ | ha t ) = pλ Yt t =1 yλ(it | ha t 1) Then the defender s belief on attacker types at each history ha t can be computed as follows: p(λ | ha t ) = p(λ | ha t ) P λ p(λ | ha t ) (19) Similar to the computation of a best response of attacker, we can compute a best response of the defender against an attack strategy y using backward induction as follows: At last time step T. Given a history ha T 1, finding a best response x(ha T 1) can be formulated as: λ p(λ|ha T 1) X i yλ(i|h T 1)EU d i (x, ha T 1) (20) i x(i|h T 1) K, x(i|h T 1) [0, 1], i. (21) Proposition 1. For every attack history ha T 1, we denote by vd (ha T 1) the defender s optimal total expected utility against the attacker s strategy y at ha T 1. Then: vd (ha T 1) = vd (ha T 1) P λ p(λ | ha T 1) The proof of Proposition 1 is in the appendix.1 Here, vd (ha T 1) the optimal objective of (20 21). By removing the constant P d(i) in EU d i (x, ha T 1) (Equation 5) and taking the dualty, we obtain the corresponding dual program: min K vd(0 | ha T 1) + X i vd(i | ha T 1) (22) s.t. vd(0 | ha T 1) 0, vd(i | ha T 1) 0, i (23) vd(0 | ha T 1) + vd(i | ha T 1) (24) X λ p(λ | ha T 1)yλ(i | ha T 1) Rd(i) P d(i) , i. According to complementary slackness, any optimal solutions (x (ha T 1), { vd (i|ha T 1)}, vd (0|ha T 1)) satisfies: i x (i | ha T 1)[ vd (0 | ha T 1) + vd (i | ha T 1) (25) λ p(λ|ha T 1)yλ(i|ha T 1) Rd(i) P d(i) ] = 0 vd (i | ha T 1) x (i | ha T 1) 1 = 0 (26) vd (0 | ha T 1) h X j x (j | ha T 1) K i = 0 (27) 1Link: https://ix.cs.uoregon.edu/ thanhhng/publications/ Conf Paper/AAAI19 Appendix.pdf At time step t + 1 < T. Given a history ha t , finding an optimal behavioral strategy x(ha t ) can be formulated as the following program: max x(ha t ) λ,i p(λ | ha t )yλ(i | ha t )EU d i (x, ha t ) (28) i vd (ha t , i) i x(i | ha t ) K, x(i | ha t ) [0, 1], i (29) We denote by vd (ha t ) the optimal objective of (28 29) at ht. In (28), vd (ha t , i) is the optimal objective of this primal program (28 29) but with respect to the history (ha t , i). Proposition 2. For every attack history ha t , we denote by vd (ha t ) the defender s optimal total expected utility against the attacker s strategy y at ha t . Then: vd (ha t ) = vd (ha t ) P λ p(λ | ha t ) The proof of Proposition 2 is in the appendix. In (28), the term P i vd (ha t , i) and the term P d(i) in EU d i (x, ha t ) are constant. By removing these constants and taking the dual, we obtain the corresponding dual program: min K vd(0 | ha t ) + X i vd(i | ha t ) (30) s.t. vd(0 | ha t ) 0, vd(i | ha t ) 0, i N (31) vd(0 | ha t ) + vd(i | ha t ) (32) X λ p(λ | ht)yλ(i | ha t ) Rd(i) P d(i) , i. According to complementary slackness, any optimal solution (x (ha t ), { vd (i | ha t )}, vd (0 | ha t )) satisfies: i: x (i | ha t ) vd (0 | ha t ) + vd (i | ha t ) (33) X λ p(λ | ha t )yλ(i | ha t ) Rd(i) P d(i) ] = 0 vd (i | ha t ) [x (i | ha t ) 1] = 0 (34) vd (0 | ha t ) h X i x (i | ha t ) K i = 0 (35) Equilibrium computation program Based on the computation of players best responses, a pair of behavioral strategies (x, y) forms a PBNE if and only if these strategies satisfy (i) the feasibility constraints (21,23,24,29,31,32) and (10,12,15,17); and (ii) the complementary slackness constraints (25 27, 33 35) and (13, 18). Since finding strategies which satisfy these slackness constraints is not straightforward, we convert the problem of finding a PBNE into the following program: min δ such that i, ha t : (36) δ yλ(i | ha t ) vλ(ha t ) EU λ i (x, ha t ) vλ(ha t , i) (37) δ x(i | ha t ) vd(0 | ha t ) + vd(i | ha t ) (38) λ p(λ | ha t )yλ(i | ha t ) Rd(i) P d(i) ] δ vd(i | ha t ) [x(i | ha t ) 1] (39) δ vd(0 | ha t ) h X i x(i | ha t ) K i (40) Constraints (21,23,24,29,31,32), (10,12,15,17) (41) where vλ(ha t , i) = 0 if t = T 1. Constraints (37) and (38 40) correspond to the complementary slackness constraints of the attacker and defender respectively. Note that any equilibrium of the game is a feasible solution of the program (36 41) which returns an objective value of δ = 0. On the other hand, the right-hand side of constraints (37 40) is always non-negative due to constraint (41). Thus, δ 0 for all feasible solutions of the program (36 41). It means that any equilibrium of the game is an optimal solution of (36 41). In addition, since the optimal objective value δ = 0, any optimal solution of (36 41) returns a value of zero for all the right-hand sides of (37 40). Therefore, any optimal solution of this program is a PBNE. Extension to Stackelberg Setting In the Stackelberg game model, a mixed strategy of the defender is defined as a probability distribution m = {m(s) : P s m(s) = 1, m(s) [0, 1]} over the action set S. We denote by M the set of all mixed strategies of the defender. At each time step, the defender commits to a mixed strategy. The attacker is aware of that mixed strategy and then decides which target to attack. Therefore, in finitely repeated Stackelberg games, at each time step t + 1, an observation of the defender is a history ht = {(m1, i1), . . . , (mt, it)} while an observation of the attacker is a history (ht, mt+1). The behavioral strategy of the defender at ht is a probability distribution x(ht) = {x(m | ht) : P m x(m | ht) = 1, x(m | ht) [0, 1]} over the set of mixed strategies of the defender. On the other hand, a behavioral strategy of the attacker of type λ at (ht, mt+1) is a probability distribution yλ(ht, mt+1) = {yλ(i | ht, mt+1) : P i yλ(i | ht, mt+1) = 1, yλ(i | ht, mt+1) [0, 1]}. A PBNE of Stackelberg security games is then defined similarly as simultaneous-move games. Since the set of mixed strategies of the defender is infinite, the existence of a PBNE in Stackelberg security games is an open research question. Nevertheless, we can compute an ϵ-PBNE by discretizing this set of defense mixed strategies and applying the same backward induction method as in the simultaneous case. We specifically analyze the deception of the attacker in finitely repeated Stackelberg security games with |N| = 2, |Λ| = 2, and K = 1. We adopt the tradition in Stackelberg security game that rewards and penalties are strictly positive and negative respectively for both players. We consider a game scenario in which the defender only plays a pure behavioral strategy in Xpure = {x : x(m | ht) = 1, for some m M, ht}. Theorem 3. In a finitely repeated Stackelberg security game with |N| = 2, |Λ| = 2, and K = 1, if the defender only plays a pure behavioral strategy in Xpure and the rewards and penalties are strictly positive and negative respectively for both the players, there exists a PBNE of the game in which the attacker plays a myopic best response at every history (ht, mt+1). One significance of this preliminary result is that the assumption about a myopic attacker in previous work on finitely repeated Stackelberg security games is justified (at least in the simple setting of this result) even when the attackers care about future expected utility. In future research, we aim to generalize this special case and explore the deception patterns for multiple targets and multiple types in the Stackelberg setting. Experiments We focus on the attacker s strategic use of deception. In our experiments, the players rewards and penalties are generated uniformly at random in the range [1, 10] and [ 10, 1] respectively. Analysis of attacker deception The purpose of deception is to shift the defender s belief away from the attacker s true type. Any action on part of the attacker toward this purpose must take into account similar reasoning by other attacker types. Further, shaping the belief of the defender is beneficial only if it results in a later gain for the attacker. In the following, we present our results with respect to an attacker of type 1. The behavior for other attacker types is symmetric. In our first experiment, we analyze games with number of attacker types: |Λ| = 2, number of targets: |N| {4, 6, 8, 10, 12}, and number of time steps |T| {2, 3}. Results are shown in Figure 1(a)(b). The x-axis is the prior probability of attacker type 1. The y-axis is the probability a type-1 attacker attacks a myopically non-optimal target (i.e., probability of deceptive action, or lie for short) at time step t = 1 or t = 2 (for 3-step games). Each data point is averaged over 220 game instances. Figure 1(a) shows results for 2-step games; each curve corresponds to a number of targets. Overall, the attacker s tendency to deceive is roughly concave in the prior probability of its true type. This makes sense, as deception has relatively less power to change the beliefs of a defender when they start near extremes. We also see an increase in deception with the number of targets. This reflects the growth in options for deception, as well as increased potential benefit for misleading the defender. Results for 3-step games are shown in Figure 1(b). We present deception probabilities for the attacker of type 1 at: (1) Step 1; (ii) Step 2; (iii) Step 2-lie (step 2 conditioned that the attacker lied at step 1); and (iv) Step 2-not lie (step 2 conditioned that the attacker did not lie at step 1). In this figure, |N| = 4. As for the 2-step game, the probability of deception in each case is roughly concave in the prior. The probability of deception at step 1 is somewhat elevated in the 3-step game, since the attacker accrues longer-term benefit from misleading the defender. Moreover, the peak is shifted to the right, reflecting increased chance for successful deception given its opportunity to repeat the lie over two periods. Indeed, given that the attacker lies at step 1, the attacker lies with roughly proportional probability at step 2 (blue curve versus yellow curve). On the other hand, when the attacker does not lie at step 1, its pattern of deception at step 2 (purple curve) is qualitatively different. Switching to be deceptive at step 2 is more promising at low priors (where the act has some chance of misleading), and very unlikely at high priors where there is little chance to mislead the defender if it had not already started in step 1. 0.1 0.3 0.5 0.7 0.9 Prior prob of type 1 Prob of lie (a) 2-type, 2-step games 0.1 0.3 0.5 0.7 0.9 Prior prob of type 1 Prob of lie Step 1 Step 2 Step 2-lie Step 2-not lie (b) 2-type, 3-step games 0 0.2 0.4 0.6 0.8 1 Prior prob of type 1 Prior prob of type 2 (c) 3-type, 2-step games 0.1 0.3 0.5 0.7 0.9 Prior prob of type 1 #Def resources (d) 2-type, 2-step games Figure 1: Attacker deception analysis, attacker type 1. 0 0.2 0.4 0.6 0.8 1 Prior prob of type 1 Att utility Perfect Bayesian Myopic w/ Learn Deceptive w/ Learn (a) Attacker type 1 0 0.2 0.4 0.6 0.8 1 Prior prob of type 1 Att utility Perfect Bayesian Myopic w/ Learn Deceptive w/ Learn (b) Attacker type 2 0 0.2 0.4 0.6 0.8 1 Prior prob of type 1 Def Utility Perfect Bayesian Myopic w/ Learn Deceitful w/ Learn (c) Defender 3 6 9 12 #Target 0 4 8 12 16 20 24 28 Runtime (mins) 2 Step 3 Step (d) Runtime Figure 2: Solution comparison and runtime performance. In our second experiment, we analyze deceptive strategies of the attacker type 1 in 2-step games with the number of attacker types is |Λ| = 3. In these games, |N| = 8. The result is shown in Figure 1(c). The x-axis and y-axis represent prior probabilities of types 1 and 2 respectively. Figure 1(c) shows that the attacker deception tendency is unimodal with respect the prior of its type 1, and less sensitive ot the distribution across other types. In our third experiment, we vary the number of defender resources in 2-step games with 2 attacker types. The result is shown in Figure 1(d). When the number of defender resources is high (close to the number of the targets), the defender can provide a high coverage probability at all targets. Specially, when K = |N|, the defender protects all targets all the time. As a result, the attacker may not achieve any benefit by lying. Therefore, the attacker lies less when K gets closer to |N|. Solution quality and runtime performance In our last experiment, we compare the players utilities for playing strategies computed in three scenarios: 1. Perfect Bayesian. The attacker is rationally deceptive and the defender takes into account the potential deceit. 2. Myopic w/ Learn. The attacker is myopic and the defender also assumes so. 3. Deceptive w/ Learn. The attacker is rationally deceptive while the defender assumes the attacker is myopic. The defender performs a Bayesian update on his belief about the attacker s type in all three cases. Results are shown in Figures 2(a)(b)(c), averaging over 220 3-step game instances with two attacker types and five targets. The x-axis is the prior probability of type 1. The y-axis is the attacker utility of each type or the defender utility on average. Figures 2(a)(b)(c) show if the defender does not account for deception, the rationally deceptive attacker achieves a significant gain while the defender suffers a significant loss (yellow versus red curves). When the defender accounts for the prospect of deception of the attacker, such gains and losses are drastically reduced (blue versus red). Finally, we display in Figure 2(d) the runtime performance of our equilibrium-finding algorithm. The x-axis is the number of targets and the y-axis is runtime in minutes. For 2-step games, the runtime remains modest for up to ten targets. For |T| = 3, the runtime grows quickly and exceeds 24 minutes when the number of targets is five. We study the problem of deception in finitely repeated security games. In these games, the defender collects attack data over time to learn about the attacker type while the attacker plays deceptively to mislead the defender. We present a detailed analysis and computation of finding optimal strategies of players in the games. We then show through computational experiments that the attacker (defender) receives a great benefit (loss) when the defender does not take into account deceptive attacks. Conversely, such benefit (loss) is reduced significantly when the defender addresses the attacker s deception. Acknowledgment This work was supported in part by MURI grant W911NF-13-1-0421 from the US Army Research Office. References Aumann, R. J., and Maschler, M. 1995. Repeated Games with Incomplete Information. MIT Press. Balcan, M.-F.; Blum, A.; Haghtalab, N.; and Procaccia, A. D. 2015. Commitment without regrets: Online learning in Stackelberg security games. In 16th ACM Conference on Economics and Computation, 61 78. Barreno, M.; Nelson, B.; Sears, R.; Joseph, A. D.; and Tygar, J. D. 2006. Can machine learning be secure? In ACM Symposium on Information, Computer and Communications Security, 16 25. Barreno, M.; Nelson, B.; Joseph, A. D.; and Tygar, J. D. 2010. The security of machine learning. Machine Learning 81(2):121 148. Blum, A.; Haghtalab, N.; and Procaccia, A. D. 2014. Learning optimal commitment to overcome insecurity. In Advances in Neural Information Processing Systems, 1826 1834. Brown, G.; Carlyle, M.; Diehl, D.; Kline, J.; and Wood, K. 2005. A two-sided optimization for theater ballistic missile defense. Operations Research 53(5):745 763. Br uckner, M., and Scheffer, T. 2011. Stackelberg games for adversarial prediction problems. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 547 555. Br uckner, M.; Kanzow, C.; and Scheffer, T. 2012. Static prediction games for adversarial learning problems. Journal of Machine Learning Research 13:2617 2654. Fang, F.; Nguyen, T. H.; Pickles, R.; Lam, W. Y.; Clements, G. R.; An, B.; Singh, A.; Tambe, M.; and Lemieux, A. 2016. Deploying PAWS: Field optimization of the protection assistant for wildlife security. In 30th AAAI Conference on Artificial Intelligence, 3966 3973. Farrell, J., and Rabin, M. 1996. Cheap talk. Journal of Economic Perspectives 10(3):103 118. Forges, F. 1988. Repeated games of incomplete information: non-zero-sum. Technical report, Universit e Catholique de Louvain, Center for Operations Research and Econometrics (CORE). Gholami, S.; Ford, B.; Fang, F.; Plumptre, A.; Tambe, M.; Driciru, M.; Wanyama, F.; Rwetsiba, A.; Nsubaga, M.; and Mabonga, J. 2017. Taking it for a test drive: a hybrid spatiotemporal model for wildlife poaching prediction evaluated through a controlled field test. In European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases. Guo, Q.; An, B.; Bosansky, B.; and Kiekintveld, C. 2017. Comparing strategic secrecy and Stackelberg commitment in security games. In 26th International Joint Conference on Artificial Intelligence. Haghtalab, N.; Fang, F.; Nguyen, T. H.; Sinha, A.; Procaccia, A. D.; and Tambe, M. 2016. Three strategies to success: Learning adversary models in security games. In 25th International Joint Conference on Artificial Intelligence, 308 314. Hendricks, K., and Mc Afee, R. P. 2006. Feints. Journal of Economics & Management Strategy 15(2):431 456. Jordan, J. S. 1995. Bayesian learning in repeated games. Games and Economic Behavior 9(1):8 20. Kar, D.; Ford, B.; Gholami, S.; Fang, F.; Plumptre, A.; Tambe, M.; Driciru, M.; Wanyama, F.; Rwetsiba, A.; Nsubaga, M.; et al. 2017. Cloudy with a chance of poaching: Adversary behavior modeling and forecasting with real-world poaching data. In 16th International Conference on Autonomous Agents and Multi-Agent Systems, 159 167. Letchford, J.; Conitzer, V.; and Munagala, K. 2009. Learning and approximating the optimal strategy to commit to. In International Symposium on Algorithmic Game Theory, 250 262. Lowd, D., and Meek, C. 2005. Adversarial learning. In ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 641 647. Marecki, J.; Tesauro, G.; and Segal, R. 2012. Playing repeated Stackelberg games with unknown opponents. In 11th International Conference on Autonomous Agents and Multiagent Systems, 821 828. Nguyen, T. H.; Sinha, A.; Gholami, S.; Plumptre, A.; Joppa, L.; Tambe, M.; Driciru, M.; Wanyama, F.; Rwetsiba, A.; Critchlow, R.; et al. 2016. Capture: A new predictive antipoaching tool for wildlife protection. In 15th International Conference on Autonomous Agents and Multi-Agent Systems, 767 775. Rabinovich, Z.; Jiang, A. X.; Jain, M.; and Xu, H. 2015. Information disclosure as a means to security. In 14th International Conference on Autonomous Agents and Multi-Agent Systems, 645 653. Sorin, S. 2002. A first course on zero-sum repeated games, volume 37. Springer Science & Business Media. Xu, H.; Rabinovich, Z.; Dughmi, S.; and Tambe, M. 2015. Exploring information asymmetry in two-stage security games. In 29th AAAI Conference on Artificial Intelligence, 1057 1063. Xu, H.; Tran-Thanh, L.; and Jennings, N. R. 2016. Playing repeated security games with no prior knowledge. In 15th International Conference on Autonomous Agents and Multi Agent Systems, 104 112. Zamir, S. 1992. Repeated games of incomplete information: Zero-sum. Handbook of Game Theory with Economic Applications 1:109 154. Zhuang, J.; Bier, V. M.; and Alagoz, O. 2010. Modeling secrecy and deception in a multi-period attacker-defender signaling game. European Journal of Operational Research 203:409 418.