# adaptive_rewardpoisoning_attacks_against_reinforcement_learning__39ba7a42.pdf Adaptive Reward-Poisoning Attacks against Reinforcement Learning Xuezhou Zhang 1 Yuzhe Ma 1 Adish Singla 2 Xiaojin Zhu 1 In reward-poisoning attacks against reinforcement learning (RL), an attacker can perturb the environment reward rt into rt + δt at each step, with the goal of forcing the RL agent to learn a nefarious policy. We categorize such attacks by the infinitynorm constraint on δt: We provide a lower threshold below which reward-poisoning attack is infeasible and RL is certified to be safe; we provide a corresponding upper threshold above which the attack is feasible. Feasible attacks can be further categorized as non-adaptive where δt depends only on (st, at, st+1), or adaptive where δt depends further on the RL agent s learning process at time t. Non-adaptive attacks have been the focus of prior works. However, we show that under mild conditions, adaptive attacks can achieve the nefarious policy in steps polynomial in state-space size |S|, whereas non-adaptive attacks require exponential steps. We provide a constructive proof that a Fast Adaptive Attack strategy achieves the polynomial rate. Finally, we show that empirically an attacker can find effective reward-poisoning attacks using state-of-the-art deep RL techniques. 1. Introduction In many reinforcement learning (RL) applications the agent extracts reward signals from user feedback. For example, in recommendation systems the rewards are often represented by user clicks, purchases or dwell time (Zhao et al., 2018; Chen et al., 2019); in conversational AI, the rewards can be user sentiment or conversation length (Dhingra et al., 2016; Li et al., 2016). In such scenarios, an adversary can manipulate user feedback to influence the RL agent in nefarious ways. Figure 1 describes a hypothetical scenario of how conversational AI can be attacked. One real-world example is that of the chatbot Tay, which was quickly corrupted by a 1University of Wisconsin-Madison 2Max Planck Institute for Software Systems (MPI-SWS). Correspondence to: Xuezhou Zhang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). group of Twitter users who deliberately taught it misogynistic and racist remarks shortly after its release (Neff & Nagy, 2016). Such attacks reveal significant security threats in the application of reinforcement learning. Hey, don t say that! Hello! You look pretty! at : Hello! You look pretty! at : Figure 1. Example: an RL-based conversational AI is learning from real-time conversations with human users. the chatbot says Hello! You look pretty! and expects to learn from user feedback (sentiment). A benign user will respond with gratitude, which is decoded as a positive reward signal. An adversarial user, however, may express anger in his reply, which is decoded as a negative reward signal. In this paper, we formally study the problem of trainingtime attack on RL via reward poisoning. As in standard RL, the RL agent updates its policy πt by performing action at at state st in each round t. The environment Markov Decision Process (MDP) generates reward rt and transits the agent to st+1. However, the attacker can change the reward rt to rt + δt, with the goal of driving the RL agent toward a target policy πt π . Figure 2. A chain MDP with attacker s target policy π Figure 2 shows a running example that we use throughout the paper. The episodic MDP is a linear chain with five states, with left or right actions and no movement if it hits the boundary. Each move has a -0.1 negative reward, and G is the absorbing goal state with reward 1. Without attack, the optimal policy π would be to always move right. The attacker s goal, however, is to force the agent to learn the nefarious target policy π represented by the arrows in Figure 2. Specifically, the attacker wants the agent to move left and hit its head against the wall whenever the agent is at the left-most state. Adaptive Reward-Poisoning Attacks against Reinforcement Learning Our main contributions are: 1. We characterize conditions under which such attacks are guaranteed to fail (thus RL is safe), and vice versa; 2. In the case where an attack is feasible, we provide upper bounds on the attack cost in the process of achieving π ; 3. We show that effective attacks can be found empirically using deep RL techniques. 2. Related Work Test-time attacks against RL Prior work on adversarial attacks against reinforcement learning focused primarily on test-time, where the RL policy π is pre-trained and fixed, and the attacker manipulates the perceived state st to s t in order to induce undesired action (Huang et al., 2017; Lin et al., 2017; Kos & Song, 2017; Behzadan & Munir, 2017). For example, in video games the attacker can make small pixel perturbation to a frame (Goodfellow et al., 2014)) to induce an action π(s t) = π(st). Although test-time attacks can severely impact the performance of a deployed and fixed policy π, they do not modify π itself. For everlearning agents, however, the attack surface includes π. This motivates us to study training-time attack on RL policy. Reward Poisoning: Reward poisoning has been studied in bandits (Jun et al., 2018; Peltola et al., 2019; Altschuler et al., 2019; Liu & Shroff, 2019; Ma et al., 2018), where the authors show that adversarially perturbed reward can mislead standard bandit algorithms to pull a suboptimal arm or suffer large regret. Reward poisoning has also been studied in batch RL (Zhang & Parkes, 2008; Zhang et al., 2009; Ma et al., 2019) where rewards are stored in a pre-collected batch data set by some behavior policy, and the attacker modifies the batch data. Because all data are available to the attacker at once, the batch attack problem is relatively easier. This paper instead focuses on the online RL attack setting where reward poisoning must be done on the fly. (Huang & Zhu, 2019) studies a restricted version of reward poisoning, in which the perturbation only depend on the current state and action: δt = φ(st, at). While such restriction guarantees the convergence of Q-learning under the perturbed reward and makes the analysis easier, we show both theoretically and empirically that such restriction severely harms attack efficiency. Our paper subsumes their results by considering more powerful attacks that can depend on the RL victim s Q-table Qt. Theoretically, our analysis does not require the RL agent s underlying Qt to converge while still providing robustness certificates; see section 4. Reward Shaping: While this paper is phrased from the adversarial angle, the framework and techniques are also applicable to the teaching setting, where a teacher aims to guide the agent to learn the optimal policy as soon as possible, by designing the reward signal. Traditionally, reward shaping and more specifically potential-based reward shaping (Ng et al., 1999) has been shown able to speed up learning while preserving the optimal policy. (Devlin & Kudenko, 2012) extend potential-based reward shaping to be time-varying while remains policy-preserving. More recently, intrinsic motivations(Schmidhuber, 1991; Oudeyer & Kaplan, 2009; Barto, 2013; Bellemare et al., 2016) was introduced as a new form of reward shaping with the goal of encouraging exploration and thus speed up learning. Our work contributes by mathematically defining the teaching via reward shaping task as an optimal control problem, and provide computational tools that solve for problem-dependent high-performing reward shaping strategies. 3. The Threat Model In the reward-poisoning attack problem, we consider three entities: the environment MDP, the RL agent, and the attacker. Their interaction is formally described by Alg 1. The environment MDP is M = (S, A, R, P, µ0) where S is the state space, A is the action space, R : S A S R is the reward function, P : S A S R is the transition probability, and µ0 : S R is the initial state distribution. We assume S, A are finite, and that a uniformly random policy can visit each (s, a) pair infinitely often. We focus on an RL agent that performs standard Q-learning defined by a tuple A = (Q0, ε, γ, {αt}), where Q0 is the initial Q table, ε is the random exploration probability, γ is the discounting factor, {αt} is the learning rate scheduling as a function of t. This assumption can be generalized: in the additional experiments provided in appendix ??, we show how the same framework can be applied to attack general RL agents, such as DQN. Denote Q as the optimal Q table that satisfies the Bellman s equation: Q (s, a) = EP (s |s,a) R(s, a, s ) + γ max a A Q (s , a ) (1) and denote the corresponding optimal policy as π (s) = arg maxa Q (s, a). For notational simplicity, we assume π is unique, though it is easy to generalize to multiple optimal policies, since most of our analyses happen in the space of value functions. The Threat Model The attacker sits between the environment and the RL agent. In this paper we focus on white-box attacks: the attacker has knowledge of the environment MDP and the RL agent s Q-learning algorithm, except for their future randomness. Specifically, at time t the attacker observes the learner Q-table Qt, state st, action at, the environment transition st+1 and reward rt. The attacker Adaptive Reward-Poisoning Attacks against Reinforcement Learning Algorithm 1 Reward Poisoning against Q-learning PARAMETERS: Agent parameters A = (Q0, ε, γ, {αt}), MDP parameters M = (S, A, R, P, µ0). 1: for t = 0, 1, ... do 2: agent at state st, has Q-table Qt. 3: agent acts according to ε-greedy behavior policy at arg maxa Qt(st, a), w.p. 1 ε uniform from A, w.p. ε. (2) 4: environment transits st+1 P( | st, at), produces reward rt = R(st, at, st+1). 5: attacker poisons the reward to rt + δt. 6: agent receives (st+1, rt + δt), performs Q-learning update: Qt+1(st, at) (1 αt)Qt(st, at)+ (3) rt + δt + γ max a A Qt(st+1, a ) 7: environment resets if episode ends: st+1 µ0. 8: end for can choose to add a perturbation δt R to the current environmental reward rt. The RL agent receives poisoned reward rt + δt. We assume the attack is inf-norm bounded: |δt| , t. There can be many possible attack goals against an RL agent: forcing the RL agent to perform certain actions; reaching or avoiding certain states; or maximizing its regret. In this paper, we focus on a specific attack goal: policy manipulation. Concretely, the goal of policy manipulation is to force a target policy π on the RL agent for as many rounds as possible. Definition 1. Target (partial) policy π : S 7 2A: For each s S, π (s) A specifies the set of actions desired by the attacker. The partial policy π allows the attacker to desire multiple target actions on one state. In particular, if π (s) = A then s is a state that the attacker does not care. Denote S = {s S : π (s) = A} the set of target states on which the attacker does have a preference. In many applications, the attacker only cares about the agent s behavior on a small set of states, namely |S | |S|. For RL agents utilizing a Q-table, a target policy π induces a set of Q-tables: Definition 2. Target Q-table set Q := {Q : max a π (s) Q(s, a) > max a/ π (s) Q(s, a), s S } If the target policy π always specifies a singleton action or does not care on all states, then Q is a convex set. But in general when 1 < |π (s)| < |A| on any s, Q will be a union of convex sets but itself can be in general non-convex. 4. Theoretical Guarantees Now, we are ready to formally define the optimal attack problem. At time t, the attacker observes an attack state (N.B. distinct from MDP state st): ξt := (st, at, st+1, rt, Qt) Ξ (4) which jointly characterizes the MDP and the RL agent. The attacker s goal is to find an attack policy φ : Ξ [ , ], where for ξt Ξ the attack action is δt := φ(ξt), that minimizes the number of rounds on which the agent s Qt disagrees with the attack target Q : t=0 1[Qt / Q ], (5) where the expectation accounts for randomness in Alg 1. We denote J (φ) = Eφ P t=0 1[Qt / Q ] the total attack cost, and JT (φ) = Eφ PT t=0 1[Qt / Q ] the finite-horizon cost. We say the attack is feasible if (5) is finite. Next, we characterize attack feasibility in terms of poison magnitude constraint , as summarized in Figure 3. Proofs to all the theorems can be found in the appendix. 4.1. Attack Infeasibility Intuitively, smaller makes it harder for the attacker to achieve the attack goal. We show that there is a threshold 1 such that for any < 1 the RL agent is eventually safe, in that πt π the correct MDP policy. This implies that (5) is infinite and the attack is infeasible. There is a potentially larger 2 such that for any < 2 the attack is also infeasible, though πt may not converge to π . While the above statements are on πt, our analysis is via the RL agent s underlying Qt. Note that under attack the rewards rt + δt are no longer stochastic, and we cannot utilize the usual Q-learning convergence guarantee. Nonetheless, we show that Qt is bounded in a polytope in the Q-space. Theorem 1 (Boundedness of Q-learning). Assume that δt < for all t, and the stepsize αt s satisfy that αt 1 for all t, P αt = and P α2 t < . Let Q be defined as (1). Then, for any attack sequence {δt}, there exists N N such that, with probability 1, for all t N, we have Q (s, a) 1 γ Qt(s, a) Q (s, a) + 1 γ . (6) Remark 1: The bounds in Theorem 1 are in fact tight. The lower and upper bound can be achieved by setting δt = or + respectively. Adaptive Reward-Poisoning Attacks against Reinforcement Learning Figure 3. A summary diagram of the theoretical results. We immediately have the following two infeasibility certificates. Corollary 2 (Strong Infeasibility Certificate). Define 1 = (1 γ) min s Q (s, π (s)) max a =π (s) Q (s, a) /2. If < 1, there exist N N such that, with probability 1, for all t > N, πt = π . In other words, eventually the RL agent learns the optimal MDP policy π despite the attacks. Corollary 3 (Weak Infeasibility Certificate). Given attack target policy π , define 2 = (1 γ) max s Q (s, π (s)) max a π (s) Q (s, a) /2. If < 2, there exist N N such that, with probability 1, for all t > N, πt(s) / π (s) for some s S . In other words, eventually the attacker is unable to enforce π (though πt may not settle on π either). Intuitively, an MDP is difficult to attack if its margin mins Q (s, π (s)) maxa =π (s) Q (s, a) is large. This suggests a defense: for RL to be robust against poisoning, the environmental reward signal should be designed such that the optimal actions and suboptimal actions have large performance gaps. 4.2. Attack Feasibility We now show there is a threshold 3 such that for all > 3 the attacker can enforce π for all but finite number of rounds. Theorem 4. Given a target policy π , define 2 max s S [ max a/ π (s) Q (s, a) max a π (s) Q (s, a)]+ (7) where [x]+ := max(x, 0). Assume the same conditions on αt as in Theorem 1. If > 3, there is a feasible attack policy φsas 3 . Furthermore, J (φsas 3 ) O(L5), where L is the covering number. Algorithm 2 The Non-Adaptive Attack φsas 3 PARAMETERS: target policy π , agent parameters A = (Q0, ε, γ, {αt}), MDP parameters M = (S, A, R, P, µ0), maximum magnitude of poisoning . def Init(π , A, M): 1: Construct a Q-table Q , where Q (s, a) is defined as Q (s, a) + (1 + γ), if s S , a π (s) Q (s, a) (1 + γ), if s S , a / π (s) Q (s, a), if s / S 2: Calculate a new reward function R (s, a) = Q (s, a) γEP (s |s,a) h max a Q (s , a ) i . 3: Define the attack policy φsas 3 as: φsas 3 (s, a) = R (s, a) EP (s |s,a) [R(s, a, s)] , s, a. def Attack(ξt): 1: Return φsas 3 (st, at) Theorem 4 is proved by constructing an attack policy φsas 3 (st, at), detailed in Alg. 2. Note that this attack policy does not depend on Qt. We call this type of attack non-adaptive attack. Under such construction, one can show that Q-learning converges to the target policy π . Recall the covering number L is the upper bound on the minimum sequence length starting from any (s, a) pair and follow the MDP until all (state, action) pairs appear in the sequence (Even-Dar & Mansour, 2003). It is well-known that ε-greedy exploration has a covering time L O(e|S|) (Kearns & Singh, 2002). Prior work has constructed examples on which this bound is tight (Jin et al., 2018). We show in appendix ?? that on our toy example ε-greedy indeed has a covering time O(e|S|). Therefore, Adaptive Reward-Poisoning Attacks against Reinforcement Learning the objective value of (5) for non-adaptive attack is upperbounded by O(e|S|). In other words, the non-adaptive attack is slow. 4.3. Fast Adaptive Attack (FAA) We now show that there is a fast adaptive attack φξ F AA which depends on Qt and achieves J polynomial in |S|. The price to pay is a larger attack constraint 4, and the requirement that the attack target states are sparse: k = |S | O(log |S|). The FAA attack policy φξ F AA is defined in Alg. 3. Conceptually, the FAA algorithm ranks the target states in descending order by their distance to the starting states, and focusing on attacking one target state at a time. Of central importance is the temporary target policy νi, which is designed to navigate the agent to the currently focused target state s (i), while not altering the already achieved target actions on target states of earlier rank. This allows FAA to achieve a form of program invariance: after FAA achieves the target policy in a target state s (i), the target policy on target state (i) will be preserved indefinitely. We provide a more detailed walk-through of Alg. 3 with examples in appendix ??. Definition 3. Define the shortest ε-distance from s to s as dε(s, s ) = min π Π Eπε [T] (9) s.t. s0 = s, s T = s , st = s , t < T where πε denotes the epsilon-greedy policy based on π. Since we are in an MDP, there exists a common (partial) policy πs that achieves dε(s, s ) for all source state s S. Denote πs as the navigation policy to s . Definition 4. The ε-diameter of an MDP is defined as the longest shortest ε-distance between pairs of states in S: Dε = max s,s S dε(s, s ) (10) Theorem 5. Assume that the learner is running ε-greedy Q-learning algorithm on an episodic MDP with ε-diameter Dε and maximum episode length H, and the attacker aims at k distinct target states, i.e. |S | = k. If is large enough that the Clip () function in Alg. 3 never takes effect, then φξ F AA is feasible, and we have J (φξ F AA) k |S||A|H How large is Dε? For MDPs with underlying structure as undirected graphs, such as the grid worlds, it is shown that the expected hitting time of a uniform random walk is bounded by O(|S|2)(Lawler, 1986). Note that the random hitting time tightly upper bounds the optimal hitting time, Algorithm 3 The Fast Adaptive Attack (FAA) PARAMETERS: target policy π , margin η, agent parameters A = (Q0, ε, γ, {αt}), MDP parameters M = (S, A, R, P, µ0). def Init(π , A, M, η): 1: Given (st, at, Qt), define the hypothetical Q-update function without attack as Q t+1(st, at) = (1 αt)Qt(st, at) + αt (rt + γ(1 EOE) maxa A Qt(st+1, a )). 2: Given (st, at, Qt), denote the greedy attack function at st w.r.t. a target action set Ast as g(Ast), defined as 1 αt [maxa/ Ast Qt(st, a) Q t+1(st, at) + η]+ if at Ast 1 αt [maxa Ast Qt(st, a) Q t+1(st, at) + η] if at / Ast. 3: Define Clip (δ) = min(max(δ, ), ). 4: Rank the target states in descending order as {s (1), ..., s (k)}, according to their shortest ε-distance to the initial state Es µ0 dε(s, s(i)) . 5: for i = 1, ..., k do 6: Define the temporary target policy νi as ( πs (i)(s) if s / {s (j) : j i} π (s) if s {s (j) : j i}. def Attack(ξt): 1: for i = 1, ..., k do 2: if arg maxa Qt(s (i), a) / π (s (i)) then 3: Return δt Clip (g({νi(st)})). 4: end if 5: end for 6: Return δt Clip (g({π (st)})). a.k.a. the ε-diameter Dε, and they match when ε = 1. This immediately gives us the following result: Corollary 6. If in addition to the assumptions of Theorem 5, the maximal episode length H = O(|S|), then J (φξ F AA) O(ek|S|2|A|) in Grid World environments. When the number of target states is small, i.e. k O(log |S|), J (φξ F AA) O(poly(|S|)). Remark 2: Theorem 5 and Corollary 6 can be thought of as defining an implicit 4, such that for any > 4, the clip function in Alg. 3 never take effect, and φξ F AA achieves polynomial cost. Adaptive Reward-Poisoning Attacks against Reinforcement Learning Figure 4. Attack cost J105(φ) on different s. Each curve shows mean 1 standard error over 1000 independent test runs. 4.4. Illustrating Attack (In)feasibility Thresholds The theoretical results developed so far can be summarized as a diagram in Figure 3. We use the chain MDP in Figure 2 to illustrate the four thresholds 1, 2, 3, 4 developed in this section. On this MDP and with this attack target policy π , we found that 1 = 2 = 0.0069. The two matches because this π is the easiest to achieve in terms of having the smallest upperbound 2. Attackers whose poison magnitude |δt| < 2 will not be able to enforce the target policy π in the long run. We found that 3 = 0.132. We know that φsas 3 should be feasible if > 3. To illustrate this, we ran φsas 3 with = 0.2 > 3 for 1000 trials and obtained estimated J105(φsas 3 ) = 9430. The fact that J105(φsas 3 ) T = 105 is empirical evidence that φsas 3 is feasible. We found that 4 = 1 by simulation. The adaptive attack φξ F AA constructed in Theorem 5 should be feasible with = 4 = 1. We run φξ F AA for 1000 trials and observed J105(φξ F AA) = 30.4 T, again verifying the theorem. Also observe that J105(φξ F AA) is much smaller than J105(φsas 3 ), verifying the foundamental difference in attack efficiency between the two attack policies as shown in Theorem 4 and Corollary 6. While FAA is able to force the target policy in polynomial time, it s not necessarily the optimal attack strategy. Next, we demonstrate how to solve for the optimal attack problem in practice, and empirically show that with the techniques from Deep Reinforcement Learning (DRL), we can find efficient attack policies in a variety of environments. 5. Attack RL with RL The attack policies φsas 3 and φξ F AA were manually constructed for theoretical analysis. Empirically, though, they do not have to be the most effective attacks under the relevant constraint. In this section, we present our key computational insight: the attacker can find an effective attack policy by relaxing the attack problem (5) so that the relaxed problem can be effectively solved with RL. Concretely, consider the higher-level attack MDP N = (Ξ, , ρ, τ) and the associated optimal control problem: The attacker observes the attack state ξt Ξ. The attack action space is {δt R : |δt| }. The original attack loss function 1[Qt / Q ] is a 0-1 loss that is hard to optimize. We replace it with a continuous surrogate loss function ρ that measures how close the current agent Q-table Qt is to the target Q-table set: max a/ π (s) Qt(s, a) max a π (s) Qt(s, a) + η + (12) where η > 0 is a margin parameter to encourage that π (s) is strictly preferred over A\π (s) with no ties. The attack state transition probability is defined by τ(ξt+1 | ξt, δt). Specifically, the new attack state ξt+1 = (st+1, at+1, st+2, rt+1, Qt+1) is generated as follows: st+1 is copied from ξt if not the end of episode, else st+1 µ0. at+1 is the RL agent s exploration action drawn according to (2), note it involves Qt+1. st+2 is the RL agent s new state drawn according to the MDP transition probability P( | st+1, at+1). rt+1 is the new (not yet poison) reward according to MDP R(st+1, at+1, st+2). The attack δt happens. The RL agent updates Qt+1 according to (3). With the higher-level attack MDP N, we relax the optimal attack problem (5) into φ = arg min φ Eφ t=0 ρ(ξt) (13) One can now solve (13) using Deep RL algorithms. In this paper, we choose Twin Delayed DDPG (TD3) (Fujimoto et al., 2018), a state-of-the-art algorithm for continuous action space. We use the same set of hyperparameters for TD3 across all experiments, described in appendix ??. 6. Experiments In this section, We make empirical comparisons between a number of attack policies φ: We use the naming convention where the superscript denotes non-adaptive or adaptive Adaptive Reward-Poisoning Attacks against Reinforcement Learning Figure 5. Attack performances on the chain MDPs of different lengths. Each curve shows mean 1 standard error over 1000 independent test runs. policy: φsas depends on (st, at, st+1) but not Qt. Such policies have been extensively used in the reward shaping literature and prior work (Ma et al., 2019; Huang & Zhu, 2019) on reward poisoning; φξ depends on the whole attack state ξt. We use the subscript to denote how the policy is constructed. Therefore, φξ T D3 is the attack policy found by solving (13) with TD3; φξ F AA+T D3 is the attack policy found by TD3 initialized from FAA (Algorithm 3), where TD3 learns to provide an additional δ t on top of the δt generated by φξ F AA, and the agent receives rt + δt + δ t as reward; φsas T D3 is the attack policy found using TD3 with the restriction that the attack policy only takes (st, at, st+1) as input. In all of our experiments, we assume a standard Q-learning RL agent with parameters: Q0 = 0S A, ε = 0.1, γ = 0.9, αt = 0.9, t. The plots show 1 standard error around each curve (some are difficult to see). We will often evaluate an attack policy φ using a Monte Carlo estimate of the 0-1 attack cost JT (φ) for T = 105, which approximates the objective J (φ) in (5). 6.1. Efficiency of Attacks across different s Recall that > 3, > 4 are sufficient conditions for manually-designed attack policies φsas 3 and φξ F AA to be feasible, but they are not necessary conditions. In this experiment, we empirically investigate the feasibilities and efficiency of non-adaptive and adaptive attacks across different values. We perform the experiments on the chain MDP in Figure 2. Recall that on this example, 3 = 0.132 and 4 = 1 (implicit). We evaluate across 4 different values, [0.1, 0.2, 0.5, 1], covering the range from 3 to 4. The result is shown in Figure 4. Figure 6. The 10 10 Grid World. s0 is the starting state and G the terminal goal. Each move has a 0.1 negative reward, and a +1 reward for arriving at the goal. We consider two partial target policies: π 1 marked by the green arrows, and π 2 by both the green and the orange arrows. We are able to make several interesting observations: (1) All attacks are feasible (y-axis T), even when falls under the thresholds 3 and 4 for corresponding methods. This suggests that the feasibility thresholds are not tight. (2) For non-adaptive attacks, as increases the best-found attack policies φsas T D3 achieve small improvement, but generally incur a large attack cost. (3) Adaptive attacks are very efficient when is large. At = 1, the best adaptive attack φξ F AA+T D3 achieves a cost of merely 13 (takes 13 steps to always force π on the RL agent). However, as decreases the performance quickly degrades. At = 0.1 adaptive attacks are only as good as non-adaptive attacks. This shows an interesting transition region in that our theoretical analysis does not cover. 6.2. Adaptive Attacks are Faster In this experiment, we empirically verify that, while both are feasible, adaptive attacks indeed have an attack cost O(Poly|S|) while non-adaptive attacks have O(e|S|). The 0-1 costs 1[πt = π ] are in general incurred at the beginning of each t = 0 . . . T run. In other words, adaptive attacks achieve π faster than non-adaptive attacks. We use several chain MDPs similar to Figure 2 but with increasing number of states |S| = 3, 4, 5, 6, 12. We provide a large enough = 2 4 to ensure the feasibility of all attack policies. The result is shown in Figure 5. The best-found non-adaptive attack φsas T D3 is approximately straight on the log-scale plot, suggesting attack cost J growing exponentially with MDP size |S|. In contrast, the two adaptive attack polices φξ F AA and φξ F AA+T D3 actually achieves attack cost linear in |S|. This is not easy to see from this log-scaled plot; We reproduce Figure 5 without the log scale in the Adaptive Reward-Poisoning Attacks against Reinforcement Learning (a) 6-state chain MDP (b) 12-state chain MDP (c) 10 10 MDP with π 1. (d) 10 10 MDP with π 2. Figure 7. Experiment results for the ablation study. Each curve shows mean 1 standard error over 20 independent test runs. The gray dashed lines indicate the total number of target actions. appendix ??, where the linear rate can be clearly verified. This suggests that the upperbound developed in Theorem 5 and Corollary 6 can be potentially improved. 6.3. Ablation Study In this experiment, we compare three adaptive attack policies: φξ T D3 the policy found by out-of-the-box TD3, φξ F AA the manually designed FAA policy, and φξ F AA+T D3 the policy found by using FAA as initialization for TD3. We use three MDPs: a 6-state chain MDP, a 12-state chain MDP, and a 10 10 grid world MDP.. The 10 10 MDP has two separate target policies π 1 and π 2, see Figure 6. For evaluation, we compute the number of target actions achieved |{s S : πt(s) π (s)}| as a function of t. This allows us to look more closely into the progress made by an attack. The results are shown in Figure 7. First, observe that across all 4 experiments, attack policy φξ T D3 found by out-of-the-box TD3 never succeeded in achieving all target actions. This indicates that TD3 alone cannot produce an effective attack. We hypothesize that this is due to a lack of effective exploration scheme: when the target states are sparse (|S | |S|) it can be hard for TD3 equiped with Gaussian exploration noise to locate all target states. As a result, the attack policy found by vanilla TD3 is only able to achieve the target actions on a subset of frequently visited target states. Hand-crafted φξ F AA is effective in achieving the target policies, as is guaranteed by our theory. Nevertheless, we found that φξ F AA+T D3 always improves upon φξ T D3. Recall that we use FAA as the initialization and then run TD3. This indicates that TD3 can be highly effective with a good initialization, which effectively serves as the initial exploration policy that allows TD3 to locate all the target states. Of special interest are the two experiments on the 10 10 Grid World with different target policies. Conceptually, the advantage of the adaptive attack is that the attacker can perform explicit navigation to lure the agent into the target states. An efficient navigation policy that leads the agent to all target states will make the attack very efficient. Observe that in Figure 6, both target polices form a chain, so that if the agent starts at the beginning of the chain, the target actions naturally lead the agent to the subsequent target states, achieving efficient navigation. Recall that the FAA algorithm prioritizes the target states farthest to the starting state. In the 10 10 Grid World, the farthest state is the top-left grid. For target states S 1, the top-left grid turns out to be the beginning of the target chain. As a result, φξ F AA is already very efficient, and φξ F AA+T D3 couldn t achieve much improvement, as shown in 7c. On the other hand, for target states S 2, the top-left grid is in the middle of the target chain, which makes φξ F AA not as efficient. In this case, φξ F AA+T D3 makes a significant improvement, successfully forcing the target policy in about 500 steps, whereas it takes φξ F AA as many as 1000 steps, about twice as long as φξ F AA+T D3. 7. Conclusion In this paper, we studied the problem of reward-poisoning attacks against reinforcement-learning agents. Theoretically, we provide robustness certificates that guarantee the truthfulness of the learned policy when the attacker s constraint is stringent. When the constraint is loose, we show that by being adaptive to the agent s internal state, the attacker can force the target policy in polynomial time, whereas a naive non-adaptive attack takes exponential time. Empirically, we formulate that the reward poisoning problem as an optimal control problem on a higher-level attack MDP, and developed computational tools based on DRL that is able to find efficient attack policies across a variety of environments. Acknowledgments This work is supported in part by NSF 1545481, 1623605, 1704117, 1836978 and the MADLab AF Center of Excellence FA9550-18-1-0166. Adaptive Reward-Poisoning Attacks against Reinforcement Learning Altschuler, J., Brunel, V.-E., and Malek, A. Best arm identification for contaminated bandits. Journal of Machine Learning Research, 20(91):1 39, 2019. Barto, A. G. Intrinsic motivation and reinforcement learning. In Intrinsically motivated learning in natural and artificial systems, pp. 17 47. Springer, 2013. Behzadan, V. and Munir, A. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 262 275. Springer, 2017. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471 1479, 2016. Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., and Chi, E. H. Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 456 464. ACM, 2019. Devlin, S. M. and Kudenko, D. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pp. 433 440. IFAAMAS, 2012. Dhingra, B., Li, L., Li, X., Gao, J., Chen, Y.-N., Ahmed, F., and Deng, L. Towards end-to-end reinforcement learning of dialogue agents for information access. ar Xiv preprint ar Xiv:1609.00777, 2016. Even-Dar, E. and Mansour, Y. Learning rates for q-learning. Journal of machine learning Research, 5(Dec):1 25, 2003. Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587 1596, 2018. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and Abbeel, P. Adversarial attacks on neural network policies. ar Xiv preprint ar Xiv:1702.02284, 2017. Huang, Y. and Zhu, Q. Deceptive reinforcement learning under adversarial manipulations on cost signals. ar Xiv preprint ar Xiv:1906.10571, 2019. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018. Jun, K.-S., Li, L., Ma, Y., and Zhu, J. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3640 3649, 2018. Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002. Kos, J. and Song, D. Delving into adversarial attacks on deep policies. ar Xiv preprint ar Xiv:1705.06452, 2017. Lawler, G. F. Expected hitting times for a random walk on a connected graph. Discrete mathematics, 61(1):85 92, 1986. Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., and Jurafsky, D. Deep reinforcement learning for dialogue generation. ar Xiv preprint ar Xiv:1606.01541, 2016. Lin, Y.-C., Hong, Z.-W., Liao, Y.-H., Shih, M.-L., Liu, M.-Y., and Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. ar Xiv preprint ar Xiv:1703.06748, 2017. Liu, F. and Shroff, N. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pp. 4042 4050, 2019. Ma, Y., Jun, K.-S., Li, L., and Zhu, X. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pp. 186 204. Springer, 2018. Ma, Y., Zhang, X., Sun, W., and Zhu, J. Policy poisoning in batch reinforcement learning and control. In Advances in Neural Information Processing Systems, pp. 14543 14553, 2019. Melo, F. S. Convergence of q-learning: A simple proof. Neff, G. and Nagy, P. Talking to bots: Symbiotic agency and the case of tay. International Journal of Communication, 10:17, 2016. Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278 287, 1999. Oudeyer, P.-Y. and Kaplan, F. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009. Peltola, T., C elikok, M. M., Daee, P., and Kaski, S. Machine teaching of active sequential learners. In Advances in Neural Information Processing Systems, pp. 11202 11213, 2019. Adaptive Reward-Poisoning Attacks against Reinforcement Learning Schmidhuber, J. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222 227, 1991. Zhang, H. and Parkes, D. C. Value-based policy teaching with active indirect elicitation. 2008. Zhang, H., Parkes, D. C., and Chen, Y. Policy teaching through reward function learning. In Proceedings of the 10th ACM conference on Electronic commerce, pp. 295 304, 2009. Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., and Tang, J. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 95 103. ACM, 2018.