# when_are_linear_stochastic_bandits_attackable__50124102.pdf

When Are Linear Stochastic Bandits Attackable?

Huazheng Wang * 1 Haifeng Xu 2 Hongning Wang 2

We study adversarial attacks on linear stochastic bandits: by manipulating the rewards, an adversary aims to control the behaviour of the bandit algorithm. Perhaps surprisingly, we ﬁrst show that some attack goals can never be achieved. This is in a sharp contrast to context-free stochastic bandits, and is intrinsically due to the correlation among arms in linear stochastic bandits. Motivated by this ﬁnding, this paper studies the attackability of a k-armed linear bandit environment. We ﬁrst provide a complete necessity and sufﬁciency characterization of attackability based on the geometry of the arms context vectors. We then propose a two-stage attack method against Lin UCB and Robust Phase Elimination. The method ﬁrst asserts whether the given environment is attackable; and if yes, it poisons the rewards to force the algorithm to pull a target arm linear times using only a sublinear cost. Numerical experiments further validate the effectiveness and cost-efﬁciency of the proposed attack method.

1. Introduction

In a contextual bandit problem, a learner takes sequential actions to interact with an environment to maximize its received cumulative reward. As a natural and important variant, linear stochastic bandits (Auer, 2002; Li et al., 2010; Abbasi-yadkori et al., 2011) assume the expected reward of an arm a is a linear function of its feature vector xa and an unknown bandit parameter θ . A linear bandit algorithm thus adaptively improves its estimation of θ based on the reward feedback on its pulled arms. Thanks to their sound theoretical guarantees and promising empirical performance, linear stochastic bandits have become a reference solution to many real-world problems, such as content recommendation and online advertisement (Li et al., 2010; Chapelle & Li,

*Most work was done while with University of Virginia. 1Princeton University 2 University of Virginia. Correspondence to: Huazheng Wang <huazhengwang@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

2011; Durand et al., 2018).

Since bandit algorithms adapt their behavior according to its received feedback, such algorithms are susceptible to adversarial attacks, especially poisoning attacks. Under such an attack, a malicious adversary observes the pulled arm and its reward feedback, and then modiﬁes the reward to misguide the bandit algorithm to pull a target arm, which is of the adversary s interest. Due to the wide applicability of bandit algorithms in practice as mentioned before, understanding the robustness of such algorithms under poisoning attacks becomes increasingly important (Jun et al., 2018; Liu & Shroff, 2019; Garcelon et al., 2020).

Most existing studies on adversarial attacks in bandits focused on the context-free stochastic multi-armed bandit (MAB) settings. Jun et al. (2018) and Liu & Shroff (2019) showed that an adversary can force any MAB algorithm to pull a target arm linear times only using a logarithmic cost. Garcelon et al. (2020) showed the vulnerability of k-armed linear contextual bandits under poisoning attacks. Linear stochastic bandits are related to context-free stochastic bandits and linear contextual bandits. Interestingly, however, there is no known result about attacks on linear stochastic bandit until now. This paper shall provide a formal explanation for this gap the analysis of attacks to linear stochastic bandits turns out to be signiﬁcantly more challenging due to the correlation among arms; in fact, some learning environment is provably unattackable.

Speciﬁcally, we ﬁll the aforementioned gap by studying poisoning attacks on k-armed linear stochastic bandits, where an adversary modiﬁes the reward using a sublinear attack cost to misguide the bandit algorithm to pull a target arm x linear times. We ﬁrst show that a linear stochastic bandit environment is not always efﬁciently attackable1, and its attackability is governed by the feasibility of ﬁnding a parameter vector θ, under which the rewards of all nontarget arms are smaller than the reward of target arm x and the reward of x remains the same as that in the original environment speciﬁed by θ . Intuitively, to promote the target arm x, an adversary needs to lower the rewards of

1Throughout this paper, efﬁcient attack means fooling the bandit algorithm to pull the target arm for linear times with a sublinear attack cost. We will use attackable and efﬁciently attackable interchangeably, as the adversary normally only has a limited budget and needs to design a cost-efﬁcient strategy.

When Are Linear Stochastic Bandits Attackable?

non-target arms in the null space of x by θ, which might not be always feasible. We prove the feasibility of the resulting convex quadratic program is both sufﬁcient and necessary for attacking a linear stochastic bandit environment.

Inspired by our attackability analysis, we propose a twostage attack framework against linear stochastic bandit algorithms and demonstrate its application against Lin UCB (Li et al., 2010) and Robust Phase Elimination (Bogunovic et al., 2021): the former is one of the most widely used linear contextual bandit algorithms, and the latter is a robust version designed for settings with adversarial corruptions. In the ﬁrst stage, our method collects a carefully calibrated amount of rewards on the target arm to assess whether the given environment is attackable. The decision is based on an empirical version of our feasibility characterization. If attackable, i.e., there exists a feasible solution θ, in the second stage the method depresses the rewards the bandit algorithm receives on non-target arms based on θ, to fool the bandit algorithm to recognize the target arm as optimal. We prove that in an attackable environment, both algorithms can be successfully manipulated with only a sublinear cost.

Our main contributions can be summarized as follows:

We characterize the sufﬁcient and necessary conditions about when a stochastic linear bandit environment is attackable as the feasibility of a convex quadratic program. En route to proving the sufﬁciency, we also provide an oracle attack method that can attack any no-regret learning algorithm given the knowledge of ground-truth bandit parameter θ . If the environment is unattackable, i.e., the program is infeasible, our necessity proof implies that even the vanilla Lin UCB algorithm cannot be efﬁciently attacked. A direct corollary of our characterization is that context-free stochastic MAB is always attackable, resonating the ﬁndings in (Jun et al., 2018; Liu & Shroff, 2019).

We propose a two-stage attack method that works without the knowledge of ground-truth bandit parameter. In the ﬁrst stage, the algorithm detects the attackability of the environment and estimates the model parameter. In the second stage, it solves for a working solution θ and attacks accordingly. Our theoretical analysis shows this attack method is effective against Lin UCB (Li et al., 2010) and Robust Phase Elimination (Bogunovic et al., 2021), i.e., pulling the target arm T o(T) times using o(T) cost when the environment is attackable.

2. Preliminaries

Linear stochastic bandit. We study poisoning attacks to the fundamental k-armed linear stochastic bandit problem (Auer, 2002), where a bandit algorithm sequentially interacts with an environment for T rounds. In each round t,

the algorithm pulls an arm at [k] = {1, , k} from a set A = {xi}k i=1 with k arms, and receives reward rat from the environment. Each arm a is associated with a d-dimensional context vector xa Rd; and the observed reward follows a linear mapping rat = x T atθ + ηt, where θ Rd is a common unknown bandit parameter vector and ηt is an R-sub-Gaussian noise term. We assume context vectors and parameters are all bounded; and for convenience and without loss of generality, we assume ||xi||2 1 and θ 2 1. The performance of a bandit algorithm is evaluated by its pseudo-regret, which is deﬁned as RT (θ ) = PT t=1(x T a θ x T atθ ), where a is the best arm according to θ , i.e., a = arg maxa [k][x T aθ ].

Lin UCB. Lin UCB (Li et al., 2010; Abbasi-yadkori et al., 2011) is a classical algorithm for linear stochastic bandit. It estimates a bandit model parameter ˆθ using ridge regression, i.e., ˆθt = A 1 t Pt i=1 xairi, where At = Pt i=1 xaix T ai + λI and λ is the L2-regularization coefﬁcient. We use x A =

x TAx to denote the matrix norm of vector x. The conﬁdence bound about reward estimation on arm x is deﬁned as CBt(x) = αt x A 1 t , where αt is

a high probability bound of θ ˆθt At. In each round t, Lin UCB pulls an arm with the highest upper conﬁdence bound, i.e., at = arg maxa [k][x T a ˆθt +CBt(xa)] to balance the exploration-exploitation trade-off. Lin UCB algorithm achieves a sublinear upper regret bound, i.e., RT = O(

T) ignoring the logarithmic term.

Threat model. The goal of an adversary is to fool a linear stochastic bandit algorithm to pull the target arm x A for T o(T) times. Like most recent works in this space (Jun et al., 2018; Liu & Shroff, 2019; Garcelon et al., 2020; Zhang et al., 2020), we also consider the widely studied poisoning attack on the rewards: after arm at is pulled by the bandit algorithm, the adversary modiﬁes the realized reward rat from the environment by rt into rat, i.e., rat = rat + rt, and feeds the manipulated reward rat to the algorithm. Naturally, the adversary aims to achieve its attack goal with minimum attack cost, deﬁned as C(T) = PT t=1 | rt|. By convention, an attack strategy is said to be efﬁcient if it uses only a sublinear total cost, i.e., C(T) = o(T).

We conclude the preliminaries with an important remark about a key difference between attacking linear stochastic bandit studied in this paper and attacking k-armed linear contextual bandit setting recently studied in (Garcelon et al., 2020). In linear contextual bandits, all arms share a context vector at each round but each arm has its own (to-be-estimated) bandit parameter. Therefore, the reward manipulation at a round t will only affect the parameter estimation of the pulled arm at, but not any other arms . This isolates the attack s impact in different arms. In contrast, in linear stochastic bandit, all arms share the same

When Are Linear Stochastic Bandits Attackable?

bandit parameter but have different context vectors. And thus manipulating the reward of any arm will alter the shared bandit parameter estimation, which will then affect the reward estimation of all arms due to the correlation among their context vectors. Such coupled effect of adversarial manipulation from the pulled arm at to all other arms is unique in linear stochastic bandits, and makes its analysis of attack much more challenging. This is also the fundamental reason that some linear stochastic bandit environment may not be attackable (see our illustration in Example 1).

3. The Attackability of A Linear Stochastic Bandit Environment

We study the attackability of a linear stochastic bandit environment. At the ﬁrst glance, one might wonder whether attackability is the property of a bandit algorithm rather than a property of the environment, since if an algorithm can be attacked, we should have blamed the algorithm for not being robust. A key ﬁnding of this work is attackability is also a property of the learning environment; and in other words, not all environments are attackable.

Deﬁnition 1 (Attackability of a k-Armed Linear Stochastic Bandit Environment). A k-armed linear stochastic bandit environment A = {xi}k i=1, θ is attackable with respect to the target arm x A if for any no-regret learning algorithm, there exists an attack method that uses o(T) attack cost and fools the algorithm to pull arm x at least T o(T) times with high probability2 for any T large enough, i.e., T larger than a constant T0.

We make a few remarks about the above deﬁnition of attackability. First, this deﬁnition is all about the bandit environment A, θ and the target arm x, but independent of any speciﬁc bandit algorithm. Second, if an attack method can only fool a bandit algorithm to pull the target arm x under a few different time horizons T, it does not count as a successful attack it has to succeed for inﬁnitely many time horizons. Third, by reversing the order of quantiﬁers, we obtain the assertion that a bandit environment is not attackable w.r.t. the target arm x if there exists some no-regret learning algorithm such that no attack method can use o(T) attack cost to fool the algorithm to pull arm x at least T o(T) times with high probability for any T large enough.

The following simple yet insightful example illustrates that there are indeed linear stochastic bandit environment setups where some no-regret learning algorithm cannot be attacked.

Example 1 (An unattackable environment). Figure 1 shows a three-arm environment with A = {x1 = (0, 1), x2 = (1, 2), x3 = ( 1, 2)}. Suppose the target arm x = x1

2Typically, the high probability refers to 1 δ probability for an arbitrarily small δ. Please see theorems later for rigorous statements.

Figure 1. Illustration of attackability of a linear stochastic bandit environment.

and the ground-truth bandit parameter θ = (1, 1)3. The expected true rewards of the arms are r 1 = 1, r 2 = 3, r 3 = 1 and x2 is the best arm in this environment. Based on Deﬁnition 1, we will need to identify a no-regret learning algorithm that cannot be attacked in this environment, and we argue that Lin UCB is such an algorithm. Suppose, for the sake of contradiction, that there exists an efﬁcient attack which fools Lin UCB to pull x1 T o(T) times. Lin UCB then must estimate θ in the x1 s direction almost accurately as T becomes large, since the Ω(T) amount of true reward feedback in x1 direction will ultimately dominate the o(T) adversarial manipulations. This suggests that the estimated parameter ˆθt will be close to (α, 1) for some α. Since (α, 1)T(x2 + x3) = 4, implying that either x2 or x3 will have its estimated reward larger than 2 (i.e., strictly larger than x1 s estimated reward) for any α. This shows that for large T, x1 cannot be the arm with the highest UCB during the execution of Lin UCB, which causes a contradiction. Therefore, this environment cannot be efﬁciently attacked with o(T) cost. Here we give an intuitive argument about this environment with target arm x is not attackable, while its formal proof is an instantiation of our Theorem 1.

Note that when A = {x1, x2}, the environment becomes attackable: as shown in Figure 1, a feasible attack strategy is to perturb reward according to θ = ( 2, 1). The key idea is that in the null space of x1, θ reduces the reward of x2 to make x1 the best arm without changing the actual reward of x1 from the environment. The presence of arm x3 prevents the existence of such a θ (and therefore θ) and makes the environment unattackable.

The above example motivates us to study when a linear stochastic bandit environment is attackable. After all, only when facing an unattackable environment, we can ensure the existence of a no-regret algorithm that will be robust to any o(T) poisoning attacks.

Next we show that there indeed exists a complete character-

3One can re-scale all vectors with norms smaller than 1, e.g., divide each dimension by 10, without changing the conclusion that the environment is unattackable.

When Are Linear Stochastic Bandits Attackable?

ization about when a linear stochastic bandit environment is attackable. As Example 1 shows, the attackability of a bandit environment depends on the arm set A = {xi}k i=1, the target arm x, and the underlying bandit parameter θ . For clarity of presentation, in this section, we shall always assume that the adversary knows exactly the ground-truth bandit parameter θ and thus the true expected reward of each arm. This is also called the oracle attack in previous works (Jun et al., 2018; Rakhsha et al., 2020). But in the next section, we will show that this assumption is not necessary: when the bandit environment is indeed attackable, we can design provably successful attacks even if the adversary does not know the underlying bandit parameter θ .

We need the following convenient notation to state our results. Let

x 2 2 x (1)

denote the projection of ground-truth bandit parameter θ

onto the targeted x direction. Since the attackability depends on the target arm x as well, we shall include the target arm x as part of the bandit environment. The following theorem provides a clean characterization about the attackability of a linear stochastic bandit environment.

Theorem 1 (Characterization of Attackability). A bandit environment A = {xi}k i=1, θ , x is attackable if and only if the optimal objective ϵ of the following convex quadratic program (CQP) satisﬁes ϵ > 0.

maximize ϵ subject to x Tθ ϵ + x T a(θ + θ ), for xa = x. x T θ = 0 θ + θ 2 1 (2) where ϵ R and θ Rd are variables.

Since the conceptual message of Theorem 1 signiﬁcantly differs from previous studies on adversarial attacks in bandit algorithms, we would like to elaborate on its implications.

First of all, we, for the ﬁrst time, point out some learning environment is intrinsically robust. Even the vanilla Lin UCB algorithm, as we will analyze in the proof of Theorem 1, cannot be efﬁciently attacked when CQP (2) is not satisﬁed. Notably, although almost all previous works have focused on the vulnerability of bandit algorithms, e.g., by designing attacks against UCB, ϵ-Greedy (Jun et al., 2018), Lin UCB (Garcelon et al., 2020), it just so happens that they were already studied under an attackable environment (see our Corollary 2). To our best knowledge, the problem about the intrinsic robustness of a linear bandit environment has not been studied before and can be viewed as a complement to these previous works. Second, as we will show next, our proof techniques are also signiﬁcantly different from

existing ones, since what is central to our proof is to demonstrate that when CQP (2) is not satisﬁed, there will exist a robust algorithm that cannot be efﬁciently attacked by any adversary. This can be viewed as analyzing the robustness of certain bandit algorithms when ϵ 0 in CQP (2).

Since CQP (2) and its solutions will show up very often in our later analysis, we provide the following deﬁnition for reference convenience.

Deﬁnition 2 (Attackability Index and Certiﬁcate). The optimal objective ϵ of CQP (2) is called the attackability index and the optimal solution θ to CQP (2) is called the attackability certiﬁcate.4

We should note both the index ϵ and certiﬁcate θ are intrinsic to the bandit environment A = {xi}k i=1, θ , x , and are irrelevant to any bandit algorithms used. As we will see in the next section when designing attack algorithms without the knowledge of θ , the index ϵ will determine how difﬁcult it is to attack the environment.

Proof Sketch of Theorem 1. Proof of sufﬁciency. This direction is relatively straightforward. Suppose the attackability index ϵ > 0 and let θ be a certiﬁcate. We design the oracle null space attack based on the knowledge of bandit parameter θ . Let target parameter θ = θ + θ where θ is deﬁned in Eq (1). The adversary perturbs the reward of any non-target arm xa = x by ra,t = x T a θ + ηt, whereas the reward of the target arm x is not touched. To make attack appear less suspicious , a sub-Gaussian noise ηt is added to the perturbed reward to make it stochastic. The ﬁrst constraint in CQP (2) ensures the non-target arms rewards are smaller than the target arm s, and thus any no-regret bandit algorithm will only pull the non-target arms o(T) times. The resulting cost is also o(T) since the expected cost on each attack is bounded by a constant. Hence, such an attack is successful. Importantly, we note that this argument only relies on the deﬁnition of no regret but does not depend on what the algorithm is. This is crucial for proving the sufﬁciency of attackability.

Proof of necessity. This is the more difﬁcult direction. We shall prove that if ϵ 0, the bandit environment is not attackable. To do so, we need to identify at least one no-regret bandit algorithm such that no attack strategy can successfully attack it. We argue that even the vanilla Lin UCB is already robust to any attack strategy with o(T) cost when ϵ 0. Recall that Lin UCB maintains an estimate ˆθt at round t using the attacked rewards { r1:t}. We consider Lin UCB with the choice of L2-regularization parameter λ that guarantees ˆθt 2 < 1 in order to satisfy the last constraint in CQP (2). Consider the decomposition ˆθt = ˆθt, + ˆθt, ,

4We sometimes omit attackability when it is clear from the context, and simply mention index and certiﬁcate.

When Are Linear Stochastic Bandits Attackable?

where x ˆθt, and x ˆθt, .

Suppose, for the sake of contradiction, that Lin UCB is attackable when ϵ 0. According to Deﬁnition 1, the target arm x will be pulled T o(T) times with high probability for inﬁnitely many different time horizons T. Fix any large T; we know that x must have the largest UCB score whenever it is pulled at some round t [T], or formally, for any xa = x we must have the following:

x Tˆθt, + CBt( x) x T a ˆθt, + x T a ˆθt, + CBt(xa). (3)

By attackability, we know that the above inequality will hold for inﬁnitely many ts. Our main idea to construct the proof is that as t , we have CBt( x) 0 and CBt(xa) > 0. Moreover, the estimation of ˆθt, will converge to θ , since x will be pulled for t o(t) times. The key challenge is to show CBt(xa) CBt( x), due to Inequality (3), is strictly greater

than 0 for all large t. To do so, we prove a Θ q

lower bound for CBt(xa) (Lemma 3 in Appendix B) and

t ) upper bound for CBt( x) (Lemma 2). The main technical barrier we overcome is the lower bound proof for the conﬁdence bound term, which employs non-standard arguments since most (if not all) of the bandit algorithm analysis only needs the upper bound of the conﬁdence bound terms. Due to this reason, we believe this technical proof is of independent interest, particularly for the analysis of robust properties of linear bandit algorithms.

By letting t , we obtain the following condition:

x Tθ > x T aθ + x T a ˆθt, , xa = x. (4)

This implies that for any sufﬁciently large t, there must exist a ˆθt, that and makes the optimal objective of CQP (2) ϵ positive. But this contradicts the starting assumption of ϵ 0; hence, the bandit environment is not attackable.

We now provide an intuitive explanation about Theorem 1. CQP (2) is to ﬁnd θ such that: 1) it is orthogonal to x (hence its subscript); and 2) it maximizes the gap ϵ between x Tθ and the largest x T a(θ + θ ) among all xa = x. Theorem 1 states that the bandit environment is attackable if and only if such a gap is strictly larger than 0, i.e., when the geometry of context vectors allows the adversary to lower non-target arms rewards in the null space of x. The constraint θ + θ 2 1 ensures the attacked rewards are in the same scale as the unattacked rewards.

Recent works have shown that any no-regret algorithm for the context-free k-armed setting (where arm set A is orthonormal) can always be attacked (Liu & Shroff, 2019). This ﬁnding turns out to be a corollary of Theorem 1. Corollary 2. For standard stochastic multi-armed bandit setting where arm set A is orthonormal, the environment A = {xa}, θ , x is attackable for any target arm x.

Proof. Since arms are orthogonal to each other, it is easy to see that θ = C[P

xa:xa = x xa] achieves objective value C x T θ in CQP (2). Let C be a large enough positive constant such that the objective value is positive gives us a feasible θ to CQP (2), which yields the corollary.

The intuition behind this corollary is that arms in contextfree stochastic multi-armed bandits are independent, and an adversary can arbitrarily lower the rewards of non-target arms to make the target arm optimal. This is also the attack strategy in (Jun et al., 2018; Liu & Shroff, 2019). Garcelon et al. (2020) showed that similar idea works for k-armed linear contextual bandits where each arm is associated with an unknown bandit parameter and the reward estimations are independent among different arms.

We should point an important distinction between poisoning attacks to k-armed bandits and another line of research on stochastic bandits under adversarial corruption initiated by Lykouris et al. (2018). For poisoning attacks considered in this paper, the adversary manipulates the realized rewards after the algorithm selects an action, whereas in (Lykouris et al., 2018), the adversary manipulates the entire reward vector before the algorithm selects any action. Obviously, the later threat model is strictly weaker and has led to various bandit algorithms that can have sublinear regret so long as the total manipulation is sublinear in T (Lykouris et al., 2018; Zimmert & Seldin, 2021).

4. Effective Attacks without Knowledge of True Model Parameters

In the previous section, we characterized the attackability of a linear stochastic bandit environment by assuming the knowledge of ground-truth bandit parameter θ . We now show that such prior knowledge is not needed when designing practical attacks. Towards this end, we propose provably effective attacks against two representative bandit algorithms: the most well-known Lin UCB and a state-ofthe-art robust linear stochastic bandit algorithm, Robust Phase Elimination (Bogunovic et al., 2021). We remark that the optimal attacks to these algorithms depend on the characteristics of algorithms themselves and are generally different, due to their different levels of robustness. This also resonates the important message mentioned in the introduction, i.e., the attackability analysis often goes hand-in-hand with the understanding of robustness of different algorithms, as reﬂected in various parts of our analysis. However, we point out that it is an intriguing open question to understand whether there is a single attack strategy that can manipulate any no-regret algorithm in an attackable environment.

Two-stage Null Space Attack. Our proposed attack method is presented in Algorithm 1. The method spends

When Are Linear Stochastic Bandits Attackable?

the ﬁrst T1 rounds as the ﬁrst stage to attack rewards on all arms by imitating a bandit environment θ0, in which x is the best arm such that arm x will be pulled most often by the bandit algorithm. This stage is for the adversary to observe the true rewards of x from the environment, which helps it estimate the parameter θ . At round T1, the method uses the estimate of θ , denoted as θ , to perform the attackability test by solving CQP (2) using θ to obtain an estimated index ϵ and certiﬁcate θ . If ϵ > 0, the method asserts the environment is attackable and starts the second stage of attack. From T1 to T, the method perturbs the reward of non-target arms by rt = x T at( θ + θ ) + ηt (just like the oracle attack but using the estimated θ ). When the bandit algorithm pulls the target arm x for the ﬁrst time in the second stage, the method will compensate the reward of x as shown in line 20, where n( x) is the number of times target arm is pulled before T1. The goal is to correct the rewards on x collected in the ﬁrst stage to follow θ instead of θ0. Note that a larger T1 brings in more observations on x to improve the estimate of θ ; but it also means a higher attack cost. The optimal choice of T1 depends on the robustness of the bandit algorithm to be attacked. Consequently, it also leads to different attack cost for different algorithms. For example, as we will show next, the attack to Robust Phase Elimination will be more costly than the attack to Lin UCB.

Note that our attackability test might make both false positive and false negative assertions due to the estimation error in θ . But as T becomes larger, the estimate gets more accurate and the assertion is correct with high probability.

Remark 1. We acknowledge that the rewards from the two stages follow different reward distributions and could be detected, e.g., using some homogeneity test (Li et al., 2021). Thus a bandit player could realize the attack if equipped with some change detector. However, attacking such a cautious bandit algorithm is beyond the scope of this paper. Moreover, it is very difﬁcult (if not impossible) to attack with a stationary reward distribution or undetectable perturbations (e.g., using a ﬁxed target parameter θ). We could easily ﬁnd cases where the adversary s parameter θ is limited to extremely few choices and it is almost impossible to directly start the attack with a valid θ without knowing θ . For example, if we change the third arm in Example 1 to x3 = ( 1+ϵ, 0) with a small ϵ, we can see that the valid parameters are only in a small range around θ = ( 1 ϵ, 1). Therefore, in order to attack with a stationary reward distribution, the adversary needs to start from somewhere very close to θ = ( 1 ϵ, 1), which we believe is extremely difﬁcult without knowing θ . Overall, we think designing an attack strategy against a bandit algorithm with reward change detector or showing the inability to attack such cautious algorithms is an important future work of ours.

Algorithm 1 Two-stage Null Space Attack

1: Inputs: T, T1 2: θ0 = arg max θ 2 1 h x Tθ maxxa = x x T aθ i , let ϵ 0 be its optimal objective 3: if ϵ 0 0 then // Initial attackability test 4: return Not attackable 5: end if 6: for t = 1 to T1 do 7: Set rt = x T atθ0 + ηt // Attack as if x is the best 8: Bandit algorithm observes modiﬁed reward rt 9: end for

10: Estimate θ =

Pn( x) i=1 ri( x) n( x) x 2 2 x

11: Solve CQP (2) using θ to obtain the estimated attackability index ϵ and certiﬁcate θ 12: if ϵ 0 then // Attackability test 13: return Not attackable 14: else // Attack stage 15: Set θ = θ + θ 16: for t = T1 + 1 to T do 17: if xat = x then 18: Set rt = x T at θ + ηt 19: else if xat = x for the ﬁrst time then 20: Set rt = n( x) x T( θ θ0) + x T θ + ηt 21: else 22: Set rt = rt 23: end if 24: Bandit algorithm observes modiﬁed reward rt 25: end for 26: end if

Attack against Lin UCB. We now show how Lin UCB algorithm can be attacked by Algorithm 1.

Theorem 3. For large enough T1, the attack strategy in Algorithm 1 will correctly assert the attackability with probability at least 1 δ. Moreover, when the environment is attackable, with probability at least 1 3δ the attack strategy will fool Lin UCB to pull non-target arms at most

log(T/δ) + p

T1 log (T1/δ)

T log(1/δ)/ p

T log(T/δ)/ϵ

rounds. And with probability at least 1 4δ, the adversary s cost is at most

log(T/δ) + p

T1 log (T1/δ)

T log(1/δ)/ p

T log(T/δ)/ϵ .

Speciﬁcally, when T1 = T 1/2, the strategy gives the minimum attack cost in the order of O(T 3/4), and the non-target arms are pulled at most O(T 3/4) rounds.

When Are Linear Stochastic Bandits Attackable?

Proof Sketch. To prove the the assertion is correct with high probability, the key idea is that the estimated θ is close to the true parameter θ . We ﬁrst note that in the ﬁrst stage, the bandit algorithm will pull the target arm x T1 O( T1) times, since x is the best arm according to θ0. According to the Hoeffding s inequality, the estimation

error θ θ 2 q

2 log(2/δ) T1 O( T1). Therefore, with a large enough T1, the error on x s reward estimation is smaller than ϵ . Thus solving CQP (2) with θ and we can correctly assert attackability with positive estimated index ϵ when the environment is attackable with index ϵ .

To prove the success and the cost of the attack, we need to analyze the behavior of Lin UCB under the reward discrepancy between the two stages. Our proof crucially hinges on the following robustness result of Lin UCB.

Lemma 1. Consider Lin UCB with ridge regression under poisoning attack. Let S t = P

τ {1...t},xaτ = x | τ| be the total corruption on non-target arms until time t and assume every corruption on target arm is bounded by γ. For any t = 1 . . . T, with probability at least 1 δ we have

θ ˆθt At αt + S t/

where αt = r

d log 1+t/λ

Based on this lemma, we can derive the regret RT ( θ) of Lin UCB with θ as the true parameter. The total corruption on non-target arms is O d T1 log (T1/δ) given the rewards are generated by θ0 (the rewards of target arm in the ﬁrst stage are compensated in line 20). Because the target arm s rewards are not attacked in the second stage and follows θ , we have γ = O(1/ T1). Since the non-target arms are pulled at most RT ( θ)/ϵ rounds, substitute the total corruption back and we have the resulting bound.

The attack cost has two sources: attacks in the ﬁrst stage for T1 times, and attacks on the non-target arms in the second stage. The second term has the same order as the number of rounds where the non-target arms are pulled by Lin UCB. Each attack cost can be decomposed as 1) the change of mean reward |x T a( θ θ )|, and 2) the sub Gaussian noise | ηt|, the sum of which increases linearly with high probability. By setting T1 = T 1/2, the total cost achieves O(T 3/4).

Remark 2. Lemma 1 shows that Lin UCB still enjoys sublinear regret for any corruption amount S = o(

T). This tolerance of o(

T) attack turns out to be the same as the recently proposed robust linear contextual bandit algorithm based on phase elimination in (Bogunovic et al., 2021) (which we examine next). However, the regret term S

T in Lin UCB has a worse dependence on S within the

T) regime compared to the O(S2) regret dependence of the robust algorithm in (Bogunovic et al., 2021).

Attack against Robust Phase Elimination. We now show that Robust Phase Elimination (Robust Ph E) (Bogunovic et al., 2021) can also be attacked by Algorithm 1. Comparing to attacking Lin UCB, the robustness of Robust Ph E brings challenge to the ﬁrst stage attack, as the attack cost is more sensitive to the length of this stage. Corollary 4. For any large enough T1, the attack strategy in Algorithm 1 will correctly assert the attackability with high probability. Moreover, when the environment is attackable, with probability at least 1 2δ the attack strategy will fool Robust Ph E to pull non-target arms at most

T log(T/δ) +

d T log(T) log(1/δ)/

T1 + T 2 1 /ϵ

rounds. And with probability at least 1 3δ, the adversary s cost is at most

T log(T/δ)+

d T log(T) log(1/δ)/

T1+T 2 1 /ϵ

Speciﬁcally, setting T1 = T 2/5 gives the minimum attack cost order O(T 4/5) and the non-target arms are pulled at most O(T 4/5) rounds.

Robust Ph E has an additional regret term O(S2) for total corruption S (assuming S is unknown to the bandit algorithm). If we view the second stage attack model θ as the underlying environment bandit model, rewards generated by θ0 in the ﬁrst stage should be considered as corrupted rewards. The T1 amount of rewards from the ﬁrst stage means T1 amount of corruption, which leads to the additional T 2 1 term compared with the results in Theorem 3. Hence, the adversary can only run fewer iterations in the ﬁrst stage but spend more attack cost there. On the other hand, this also facilitates the design of attack such that line 19-20 in Algorithm 1 is not necessary: the corruption in the ﬁrst stage can be handled by the robustness of Robust Ph E. The unattacked rewards in second stage are viewed as misspeciﬁcation from θ with error γ, which leads to the O(γT) term (the second term) in the bound. Our success in attacking Robust Ph E does not violate the robustness claim in the original paper (Bogunovic et al., 2021): Robust Ph E could tolerate o(

T) corruption and our attack cost is O(T 4/5).

5. Experiments

We use simulation-based experiments to validate the effectiveness and cost-efﬁciency of our proposed attack methods. In our simulations, we generate a size-k arm pool A, in which each arm a is associated with a context vector xa Rd. Each dimension of xa is drawn from a set of zeromean Gaussian distributions with variances sampled from a uniform distribution U(0, 1). Each xa is then normalized

When Are Linear Stochastic Bandits Attackable?

Figure 2. Total cost of attack under different attack methods. We report averaged cost and its variance of 10 runs.

Figure 3. Target arm pulls under different attack methods.

to xa 2 = 1. The bandit model parameter θ is sampled from N(0, 1) and normalized to θ 2 = 1. We set d to 10, the standard derivation σ of Gaussian noise ηt and ηt to 0.1, and the arm pool size k to 30 in our simulations. We run the experiment for T = 10, 000 iterations. To create an attackable environment, we will re-sample the environment A, θ , x until it is attackable, following Theorem 1.

We compared the two proposed attack methods, oracle null space attack and two-stage null space attack, against Lin UCB (Li et al., 2010) and Robust Phase Elimination (Robust Ph E) (Bogunovic et al., 2021). We report average results of 10 runs where in each run we randomly sample an attackable environment. Both oracle attack and two-stage attack can effectively fool the two bandit algorithms to pull the target arm linear times as shown in Figure 3, and the total cost of the attack is shown in Figure 2. We observe that both attack methods are cost-efﬁcient with a sublinear total cost, while the two-stage attack requires higher attack cost when attacking the same bandit algorithm. Speciﬁcally, we notice that the adversary spends almost linear cost in the ﬁrst stage. This is because in the ﬁrst stage the adversary attacks according to parameter θ0, which leads to an almost constant cost at every round. This is to help the adversary to estimate bandit model parameter in order to construct

target parameter θ. In the second stage, the cost gets much smaller since the adversary only attacks the non-target arms. We also notice that for the same attack method, attacking Robust Ph E requires a higher cost and the number of target arm pull is also smaller comparing with attacking Lin UCB, due to the robustness of the algorithm.

6. Related Work

Adversarial attacks to bandit algorithms was ﬁrst studied in the stochastic multi-armed bandit setting (Jun et al., 2018; Liu & Shroff, 2019) and recently in linear contextual bandits (Garcelon et al., 2020). These works share a similar attack idea: lowering the rewards of non-target arms while not modifying the reward of target arm. However, as our attackability analysis revealed, this idea can fail in a linear stochastic bandit environment where one cannot lower the rewards of non-target arms without modifying the reward of target arm, due to their correlation. This insight is a key reason that gives rise to unattackable environments. Ma et al. (2018) also considered the attackability issue of linear bandits, but under the setting of ofﬂine data poisoning attack where the adversary has the power to modify the rewards in history. There are also several recent works on reward poisoning attacks against reinforcement learning (Yu & Sra, 2019; Zhang et al., 2020; Rakhsha et al., 2021; 2020), but with quite different focus as ours. Besides reward poisoning attacks, recent works also studied other threat model such as action poisoning attacks (Liu & Lai, 2020; 2021).

A parallel line of works focused on improving the robustness of bandit algorithms. Lykouris et al. (2018) proposed a robust MAB algorithm and Gupta et al. (2019) further improved the solution with additive regret dependency on attack cost. Zimmert & Seldin (2021); Masoudian & Seldin (2021) proposed best-of-both-world solutions for both stochastic and adversarial bandits which also solved stochastic bandits with adversarial corruption. Ito (2021) further proposed optimal robust algorithm to adversarial corruption. These work assumed a weaker oblivious adversary who determines the manipulation before the bandit algorithm pulls an arm. Hajiesmaili et al. (2020) studied robust adversarial bandit algorithm. Guan et al. (2020) proposed a median-based robust bandit algorithm for probabilistic unbounded attack. Bogunovic et al. (2021) proposed robust phase elimination algorithm for linear stochastic bandits under a stronger adversary (same as ours), which could tolerate o(

T) corruption when the total corruption is unknown to the algorithm. We showed that our two-stage null space attack could effectively attack this algorithm with O(T 4/5) cost. Recently, Zhao et al. (2021) proposed an OFUL style robust algorithm that can handle inﬁnite action set, but only tolerates o(T 1/4) corruption.

When Are Linear Stochastic Bandits Attackable?

7. Conclusion

In this paper, we studied the problem of poisoning attacks in k-armed linear stochastic bandits. Different from contextfree stochastic bandits and k-armed linear contextual bandits where the environment is always attackable, we showed that some linear stochastic bandit environments are not attackable due to the correlation among arms. We characterized the attackability condition as the feasibility of a CQP based on the geometry of the arms context vectors. Our key insight is that given the ground-truth parameter θ , the adversary should perform oracle attack that lowers the reward of non-target arms in the null space of the target arm s convex vector x. Based on this insight, we proposed a twostage null space attack without the knowledge of θ and applied it against Lin UCB and Robust Phase Elimination. We showed that the proposed attack methods are effective and cost-efﬁcient, both theoretically and empirically.

As our future work, it is interesting to study the lower bound of attack cost in linear stochastic bandits and also design cost-optimal attack method with a matching upper bound. One idea is to design a multi-stage attack method following the doubling trick idea, which was brief discussed in Appendix C.3. Although the analysis could be much more challenging than our two-stage attack, it may lead to a lower attack cost as well as handling unknown time horizon T. Another intriguing direction is to study algorithm-oblivious choice of the length of the ﬁrst stage T1 or more generally, algorithm-oblivious attack strategies that can achieve sublinear cost for arbitrary no-regret algorithm without the knowledge of θ .

Acknowledgements

This work is based upon work supported by National Science Foundation under grant IIS-2128019, IIS-2007492 and IIS-1838615, Bloomberg Data Science Ph.D. Fellowship, and a Google Faculty Research Award.

Abbasi-yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In NIPS, pp. 2312 2320. 2011.

Auer, P. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002.

Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 991 999. PMLR, 2021.

Chapelle, O. and Li, L. An empirical evaluation of thompson

sampling. In Advances in neural information processing systems, pp. 2249 2257, 2011.

Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine Learning for Healthcare Conference, pp. 67 82, 2018.

Garcelon, E., Roziere, B., Meunier, L., Tarbouriech, J., Teytaud, O., Lazaric, A., and Pirotta, M. Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33, 2020.

Guan, Z., Ji, K., Bucci Jr, D. J., Hu, T. Y., Palombo, J., Liston, M., and Liang, Y. Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 4036 4043, 2020.

Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562 1578. PMLR, 2019.

Hajiesmaili, M., Talebi, M. S., Lui, J., Wong, W. S., et al. Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. Advances in Neural Information Processing Systems, 33, 2020.

Ito, S. On optimal robustness to adversarial corruption in online decision problems. Advances in Neural Information Processing Systems, 34:7409 7420, 2021.

Jun, K.-S., Li, L., Ma, Y., and Zhu, J. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3640 3649, 2018.

Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pp. 5662 5670. PMLR, 2020.

Li, C., Wu, Q., and Wang, H. Unifying clustered and nonstationary bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1063 1071. PMLR, 2021.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670. ACM, 2010.

Liu, F. and Shroff, N. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pp. 4042 4050, 2019.

When Are Linear Stochastic Bandits Attackable?

Liu, G. and Lai, L. Action-manipulation attacks on stochastic bandits. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3112 3116. IEEE, 2020.

Liu, G. and Lai, L. Provably efﬁcient black-box action poisoning attacks against reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.

Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114 122. ACM, 2018.

Ma, Y., Jun, K.-S., Li, L., and Zhu, X. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pp. 186 204. Springer, 2018.

Masoudian, S. and Seldin, Y. Improved analysis of the tsallis-inf algorithm in stochastically constrained adversarial bandits and stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 3330 3350. PMLR, 2021.

Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and Singla, A. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pp. 7974 7984. PMLR, 2020.

Rakhsha, A., Zhang, X., Zhu, X., and Singla, A. Reward poisoning in reinforcement learning: Attacks against unknown learners in unknown environments. ar Xiv preprint ar Xiv:2102.08492, 2021.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Yu, T. and Sra, S. Efﬁcient policy learning for non-stationary mdps under adversarial manipulation. ar Xiv preprint ar Xiv:1907.09350, 2019.

Zhang, X., Ma, Y., Singla, A., and Zhu, X. Adaptive rewardpoisoning attacks against reinforcement learning. ar Xiv preprint ar Xiv:2003.12613, 2020.

Zhao, H., Zhou, D., and Gu, Q. Linear contextual bandits with adversarial corruptions. ar Xiv preprint ar Xiv:2110.12615, 2021.

Zimmert, J. and Seldin, Y. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. J. Mach. Learn. Res., 22:28 1, 2021.

When Are Linear Stochastic Bandits Attackable?

A. Notations

For clarity, we collect the notations used in the paper below.

x Context vector of target arm xa Context vector of arm a θ Unknown bandit model parameter of the environment θ Projection of θ on x rt Unattacked reward feedback at time t ηt Sub-Gaussian noise in reward, i.e., rt = x T t θ + ηt. rt Attacked reward ˆθt Parameter estimated by the victim bandit algorithm with attacked rewards { r1:t} θ Parameter parallel to x, estimated by adversary with unattacked rewards θ Paramter of adversary s attack strategy θ Attackability certiﬁcate, the parameter orthogonal to x solved by CQP (2) ϵ Attackability index, optimal objective of CQP (2) ϵ Estimated index, optimal objective of CQP (2) with θ replacing θ

B. Details on Attackability of Linear Stochastic Bandits

We illustrate the details of oracle null space attack in Algorithm 2, which is constructed for the sufﬁciency proof of Theorem 1.

Algorithm 2 Oracle Null Space Attack

1: Inputs: T, θ

2: Initialize: 3: if Optimal objective ϵ of CQP (2) > 0 then // Attackability Test 4: Find the optimal solution θ 5: Set θ = θ + θ 6: else 7: return Not attackable 8: end if 9: for t = 1 to T do 10: Bandit algorithm pulls arm at 11: Attacker observes the corresponding reward rt = x T atθ + ηt from the environment 12: if xat = x then 13: Set rt = x T at θ + ηt // Attack 14: else 15: Set rt = rt 16: end if 17: Bandit algorithm observes modiﬁed reward rt 18: end for

B.1. Sufﬁciency Proof of Theorem 1

Proof. For sufﬁciency proof, we show that there exists an efﬁcient attack strategy if CQP (2) holds.

Suppose the attackability index ϵ > 0 and let θ be a certiﬁcate. In Algorithm 2, we design the oracle null space attack based on the knowledge of bandit parameter θ . Let θ = θ + θ where θ is deﬁned in Eq (1). The adversary perturbs the reward of any non-target arm xa = x by ra,t = x T a θ + ηt, whereas the reward of the target arm x is not touched. In other words, the adversary misleads the algorithm to believe that θ is the ground-truth parameter We should note both θ and θ generate the same expected reward on x, i.e., x T θ = x Tθ = x Tθ . To make the attack appear less suspicious , a sub-Gaussian noise term ηt is added to the perturbed reward to make it stochastic. The key idea is that the attacker does not need to perturb the reward of the target arm because the original reward is the same as perturbed reward in expectation.

When Are Linear Stochastic Bandits Attackable?

Instead, the attacker only perturbs the reward in the null space of x according to θ , which guarantees the cost-efﬁciency of the attack.

Since the perturbed rewards observed by the bandit algorithm are generated by θ, the target arm x is the optimal arm in this environment due to the attackability index ϵ being strictly positive. According to the deﬁnition, any no-regret bandit algorithm will only pull suboptimal arms, i.e., the non-target arms, o(T) times and pull target arm T o(T) times with high probability. Thus the attack is successful. Moreover, the cost of oracle attack is o(T) because the attacker only perturbs rewards on the non-target arms for o(T) times, and the cost on each attack is bounded by a constant (because of the ﬁnite norm of xa and θ ).

B.2. Necessity Proof of Theorem 1

To prove its necessity, we will rely on the following results.

Lemma 2. Suppose arm x is pulled n times till round t by Lin UCB. Its conﬁdence bound CBt(x) in Lin UCB satisﬁes

CBt(x) αt n. (6)

with probability at least 1 δ, where αt = r

d log 1+t/λ

λ. Furthermore, we have

with probability at least 1 δ.

Proof. In (Abbasi-yadkori et al., 2011), the exploration bonus term is computed as CBt(x) = αt x A 1 t . Denote

A t = n xx T + λI. Since At = Pt i=1 xaix T ai + λI, we have At A t. We can thus bound x A 1 t by

x A 1 t x A t 1 1 n (8)

According to Theorem 2 in (Abbasi-yadkori et al., 2011),

d log 1 + t/λ

log(t/δ)). (9)

Combining αt and (8) ﬁnishes the proof.

Claim 1. Target arm x is pulled n = T o(T) times till round T. According to Lemma 2, we have

log(T/δ) T o(T)

Lemma 3. Suppose arm x is pulled t m times till round t by Lin UCB, and other arms are pulled m times in total. Conﬁdence bound CBt(xa) of any arm xa that is not parallel to x satisﬁes

with probability at least 1 δ, where αt = r

d log 1+t/λ

λS and constant b = x T ax x Tx. Furthermore, we have

CBt(xa) Θ p

log(t/δ) 1 m 1 t m

with probability at least 1 δ

When Are Linear Stochastic Bandits Attackable?

Proof. Since CBt(x) = αt x A 1 t , we need to show a lower bound of xa A 1 t . Since xa x, we decompose xa =

x a + x a , where x a x. By the reverse triangle inequality we have

xa A 1 t x a A 1 t x a A 1 t (13)

First we analyze the term x a A 1 t . Decompose At = (t m) xx T + P

i,xai =x xaix T ai + λI and let A t = P

i,xai =x xaix T ai + λI. Since x a x, we have

x a TAtx a = x a TA tx a .

There are at most m terms in the summation of A t = P

i,xai =x xaix T ai + λI, thus

x a TA tx a x a T m x a 2 2 x a x a T + λI x a m + λ

Then we have

x a A 1 t = q

x a TA 1 t x a = q

x a TA 1 t x a = 1 q

x a TA tx a

Similar to Eq (8), we have

x a A 1 t x a 2 x 2

t m + λ (15)

Let constant b = x a 2 x 2 = x T ax x Tx. Substitute the terms and we have

CBt(x) = αt x A 1 t αt x a A 1 t x a A 1 t

Claim 2. Non-target arms are pulled m = o(T) times till round T. According to Lemma 3, any arm xa x satisﬁes

with probability at least 1 δ.

Lemma 4. Suppose the non-target arms {xa = x} are pulled o(T) times, the arm x is pulled T o(T) times, and the total manipulation is CT . With probability at least 1 δ, reward estimation error satisﬁes

x TˆθT, x Tθ CT T o(T) + αt p

T o(T) , x A. (18)

When Are Linear Stochastic Bandits Attackable?

x T(ˆθT θ )

t=1 xt rt(xt) Atθ !

t=1 xt( rt(xt) x T t θ ) λθ !

t=1 xtηt λθ !

2 + 1 x 2 2

x TA 1/2 t A 1/2 t

t=1 xtηt λθ !

CT T o(T) + αt p

Note that x 2 1. In the last step, the ﬁrst term is because there are T o(T) number of x x T in At and x TA 1 t 2 1 T o(T ), and PT t=1 xt t 2 is bounded by total manipulation CT . Similarly, in the second term we

have x TA 1/2 t 2 1

T o(T ), and A 1/2 t PT t=1 xtηt λθ 2 A 1/2 t PT t=1 xtηt 2 + A 1/2 t λθ 2 =

PT t=1 xtηt A 1 t + λθ A 1 t αt is the self-normalized bound for vector-valued martingales following Theorem 1 in (Abbasi-yadkori et al., 2011).

Now we are ready to prove that the index ϵ in CQP (2) being positive is the necessary condition of an attackable environment.

Proof of Necessity of Theorem 1. We prove if ϵ 0, the bandit environment is not attackable. To prove this, we show that there exists some no-regret bandit algorithm (Lin UCB in particular) such that no attacking strategy can succeed. In particular, we will show that Lin UCB (with a speciﬁc choice of its L2-regularization parameter λ) is robust under any attacking strategy with o(T) cost when ϵ 0. We prove it by contradiction: assume Lin UCB is attackable with o(T) cost when ϵ 0. According to Deﬁnition 1, the target arm x will be pulled T o(T) times for inﬁnitely many different time horizons T, and the following inequalities hold when arm x is pulled by Lin UCB:

x TˆθT, + CBT ( x) > x T a ˆθT, + x T a ˆθT, + CBT (xa), xa = x (19)

where ˆθt is Lin UCB s estimated parameter at round t based on the attacked rewards. We decompose ˆθT = ˆθT, + ˆθT, , where x ˆθt, and x ˆθT, . We will now show that the above inequalities lead to

x Tθ > x T aθ + x T a ˆθT, , xa = x

By Lemma 4 we have

x T a ˆθT, x T aθ CT T o(T) αT p

x TˆθT, x Tθ + CT T o(T) + αT p

Substitute them back and we have that with probability at least 1 2δ,

x Tθ > x T aθ + x T a ˆθT, + CBT (xa) CBT ( x) 2CT T o(T) 2αT p

T o(T) , xa = x (20)

When Are Linear Stochastic Bandits Attackable?

Let us ﬁrst consider the case of xa x. Substitute Claim 1 and Claim 2 back and we have with probability at least 1 4δ

x Tθ >x T aθ + x T a ˆθT, + Θ

log(T/δ) T o(T)

2CT T o(T) 2αT p

T o(T) , xa x

Taking T and noticing that the last three terms diminish to 0 faster than the third term, there must exists a T0 such that for any T > T0,

x Tθ > x T aθ + x T a ˆθT, , xa = x (21)

satisﬁes when xa x.

Now we consider the special case that some xa x and show that the above inequality is still strict. Let xa = c x. If |c| > 1, we have CBT (xa) CBT ( x) = (c 1)CBT ( x) > 0. If |c| < 1, since x is pulled linear times for any large t with sublinear cost, then x Tˆθt, > 0; otherwise the cost would be linear. We directly have x Tˆθt, = x T a ˆθt, + (1 c) x Tˆθt, > x T a ˆθt, . This leads to x Tθ > x T aθ (inequality (21)) since xa ˆθT, .

Combining the two cases, we know there must exist a ˆθT, that satisﬁes inequality (21) (the ﬁrst constraint of CQP (2)), x ˆθT, (the second constraint of CQP (2)), and makes the objective of CQP (2) larger than 0. To satisfy the last constraint, we consider Lin UCB with the choice of L2-regularization parameter λ that guarantees ˆθt 2 < 1 given the data for large enough T and any t < T. Note that ridge regression is equivalent to minimizing square loss under some constraint ˆθt 2 c, and there always exists a one-to-one correspondence between λ and c (one simple way to show the correspondence is using KKT conditions). Therefore, we are guaranteed to ﬁnd a λ that gives us c = 1 ζ where ζ is an arbitrarily small constant. Then we know that ˆθT, satisﬁes ˆθT = ˆθT, + ˆθT, 2 c < 1. We prove the last constraint θ = θ + ˆθT, 2 1 by the fact that ˆθT, + ˆθT, 2 < 1 and θ ˆθT, 2 is arbitrarily small for large enough T according to Lemma 4.

Now we proved that there exists a ˆθT, that satisﬁes all the constraints in CQP (2) with positive objective, which means the optimal objective ϵ must also be positive. This however contradicts our assumption ϵ 0, implying that such Lin UCB is not attackable by any attack strategy.

C. Details on Effective Attacks Without Knowledge of Model Parameters

We now prove the theorems of using Two-stage Null Space Attack (Algorithm 1) against Lin UCB and Robust Phase Elimination.

C.1. Proof of Theorem 3

Proof of Theorem 3. We ﬁrst prove that for a large enough T, Algorithm 1 will correctly assert the attackability with probability at least 1 δ. We rely on the following lemma to show θ estimated in step 11 of Algorithm 1 is close to the true parameter θ .

Lemma 5 (Estimation error of θ ). Algorithm 1 estimates θ by

θ = Pn( x) i=1 ri( x) n( x) x 2 2 x. (22)

With probability at least 1 δ, the estimation error is bounded by

2R2 log(1/δ)

where the rewards have R-sub-Gaussian noise.

When Are Linear Stochastic Bandits Attackable?

Proof. θ is the projected vector of θ onto x, which is

as deﬁned in Eq (1). Though we need to estimate the vector θ Rd, we actually only need to estimate the scale of it by ˆl =

Pn( x) i=1 ri( x) n( x) x 2 2 , since the direction is known to be x. Based on Hoeffding s inequality, the estimation error of averaged rewards on x is bounded by

Pn( x) i=1 ri( x)

n( x) r ( x)

2R2 log(1/δ)

where r ( x) = x Tθ . Considering x 2 2 = 1 and we ﬁnish the proof.

In the ﬁrst stage, the bandit algorithm will pull the target arm x for T1 O( T1) times, since x is the best arm according to θ0. According to Lemma 5, with probability at least 1 δ the estimation error is bounded as

2R2 log(1/δ) T1 O( T1).

As a result, with a large enough T1, the error of x s reward estimation satisﬁes

| x T θ x Tθ | x 2 θ θ 2

2R2 log(1/δ) T1 O( T1) ϵ .

Thus solving CQP (2) with θ replacing θ and we could correctly assert attackability with an estimated positive index ϵ

when the environment is attackable with index ϵ .

Remark 3. From the analysis above, we notice that the adversary requires sufﬁciently large T1 to collect enough rewards on the target arm, in order to make the correct attackability assertion. When T1 is not large enough, the attackability test may have false positive or false negative conclusions. We empirically test the error rate and report the results in Additional Experiments section.

We now prove the success and total cost of the proposed attack. The analysis relies on the robustnes property of Lin UCB stated in Lemma 1, which is restated and proved below.

Lemma 6 (Robustness of ridge regression). Consider Lin UCB with ridge regression under poisoning attack. Let S t = P

τ {1...t},xaτ = x | τ| be the total corruption on non-target arms until time t and assume every corruption on target arm is bounded by γ. Then for any t = 1 . . . T, with probability at least 1 δ we have

θ ˆθt At αt + S t/

where αt = r

d log 1+t/λ

Proof. Based on the closed form solution of ridge regression, we have

ˆθt = θ λA 1 t θ + A 1 t

τ=1 xaτ [ητ + τ]

When Are Linear Stochastic Bandits Attackable?

Therefore, using ideas from (Abbasi-yadkori et al., 2011), we can have

ˆθt θ At λ1/2 θ 2 +

τ=1 xaτ ητ A 1 t +

τ=1 xaτ τ A 1 t

τ=1 xaτ τ A 1 t

τ {1...t},xaτ = x xaτ τ A 1 t + X

τ {1...t},xaτ = x xaτ τ A 1 t

τ {1...t},xaτ = x xaτ τ A 1 t + γn( x) x A 1 t

τ {1...t},xaτ = x xaτ τ 2/

λ + γn( x) x A 1 t

λ + γn( x) x A 1 t

λ + γn( x) p

with probability at least 1 δ, where n( x) is the times target arm has been pulled. The second step is based on the deﬁnition of αt and introduces the high probability bound, the fourth step is because we have | τ| < γ if xaτ = x; the ﬁfth step is because of At λI; and the second last inequality follows Eq (8). Finally, notice that n( x) t and we ﬁnish the proof.

Let us ﬁrst analyze the attack in the ﬁrst stage. Denote RT (θ) as the regret of Lin UCB until round T, where θ is the ground-truth parameter. We know from (Abbasi-yadkori et al., 2011) that if the rewards are all generated by θ then with probability at least 1 δ we have

RT (θ) = αT

d T log 1 + T/λ

T log(T/δ)) (26)

where αt = r

d log 1+t/λ

λ. Then the attack in the ﬁrst T1 rounds based on θ0 should make the bandit algorithm

pull x at least T1 RT1(θ0)/ϵ 0 times. According to Lemma 5, with probability at least 1 δ parameter estimation error is bounded by

θ ,T1 θ 2 p

2 log(1/δ)/ q

T1 RT1(θ0)/ϵ 0 2 p

2 log(1/δ)/ p

for large enough T1. This means we have

γ = x T θ ,T1 θ 2 p

2 log(1/δ)/ p

Now we prove the attack is successful with high probability. Consider the regret of the Lin UCB against θ as the ground-truth parameter. The estimation error in ˆθt θ has three sources: the sub-Gaussian noise, the rewards on non-target arms in the ﬁrst stage generated by θ0 (the rewards on the target arm are corrected to θ in step 19-20 in Algorithm 1), and the unattacked rewards on target arm in the second stage generated by θ . According to Lemma 1, with probability at least 1 3δ, we have

ˆθt θ At αt + RT1(θ0)/

2t log(1/δ)/ p

T1, t > T1.

When Are Linear Stochastic Bandits Attackable?

To show the number of rounds pulling non-target arms, we ﬁrst look at the regret against θ, i.e., RT ( θ).

x T θ x T at θ

x Tˆθt + CBt( x) x T at θ

x T atˆθt + CBt(xat) x T at θ

t=1 2CBt(xat)

t=1 CB2 t(xat)

t=1 x 2 A 1 t

2 αT + RT1(θ0)/

2T log(1/δ)/ p

d T log 1 + T/λ

holds with probability at least 1 3δ. And Lin UCB will pull non-target arms at most RT ( θ)/ϵ times, which can be bounded by

RT ( θ)/ϵ 2 αT + RT1(θ0)/

2T log(1/δ)/ p

d T log 1 + T/λ

d T log 1 + T/λ

λ + RT1(θ0)/

2T log(1/δ)/ p

d T log 1 + T/λ

and is in the order of

log(T/δ) + p

T1 log (T1/δ) + p

T log(1/δ)/ p

T log(T/δ)/ϵ (29)

The T1 log (T1/δ) term is due to the corrupted rewards of non-target arms observed in the ﬁrst stage. Setting T1 = T 1/2

gives us the minimum number of rounds pulling non-target arms in O(T 3/4) according to Eq (29).

Now we prove the total cost C(T). Note that in order to make the attack stealthy , we inject sub-Gaussian noise ηt on perturbed reward to make it stochastic. We separate the total cost by the cost on changing the mean reward and the cost of sub-Gaussian noise as

t={1..T }, rt =rt | rt|

i=1 |x T ai( θ θ )| +

i=1 | ηi| (30)

where i {t = {1..T} : rt = rt}. Let N = |{i}| be the total rounds of attack. Since we know the times attacking non-target arms is bounded by Eq 29 and attack target arm at most T1 times, we have with probablity 1 δ,

N = T1 + O d p

log(T/δ) + p

T1 log (T1/δ) + p

T log(1/δ)/ p

T log(T/δ)/ϵ = O(T 3/4) (31)

when setting T1 = T 1/2.

When Are Linear Stochastic Bandits Attackable?

Notice that since ηt is R-sub-Gaussian, its absolute value | ηt| is also R-sub-Gaussian, and E[| ηt|] < L for some constant L following Proposition 2.5.2 in (Vershynin, 2018). According to the general Hoeffding s inequality, with probability at least 1 δ, N X

i=1 | ηi| NL + p

NL log(2/δ) = O(N + p

N log(1/δ)) (32)

Thus the second term of Eq (30) has the order of O(N) = O(T 3/4). The ﬁrst term of Eq (30) is bounded by 2N + 2T1 because each attack changes the mean reward at most 2 except the compensation step in line 20, and the reward compensation on target arm can be bounded by 2T1 because target arm is pulled at most T1 times in the ﬁrst stage. Overall, with probability at least 1 4δ

i=1 |x T ai( θ θ )| +

i=1 | ηi| 2N + 2T1 + NL + p

NL log(2/δ) = O(N) = O(T 3/4) (33)

when setting T1 = T 1/2.

C.2. Proof Sketch of Corollary 4

The proof is similar to the proof of Theorem 3, and thus we only explain the difference here.

Instead of using Lemma 1 to analyze the impact of perturbed rewards generated by θ0 (against θ) collected in the ﬁrst stage, we know Robust Ph E has an additional regret term O(S2) for corruption S (assuming S is unknown to the bandit algorithm). Since the bandit algorithm observes T1 rewards in the ﬁrst stage, S 2T1 and the additional regret due to rewards from ﬁrst stage is O(T 2 1 ). For the unattacked rewards on target arm in the second stage generated by θ , we view them as rewards generated by θ with misspeciﬁcation error γ, i.e., | x T(θ θ)| γ. Proposition 5.1 in (Lattimore et al., 2020) showed that the phase elimination algorithm with misspeciﬁcation γ has additional regret in O(γ

d T log(T)). With γ 2 p

2 log(1/δ)/ T1 by Eq (28), the total regret is

RT ( θ) = O d

T log(T/δ) +

d T log(T) log(1/δ)/ p

with probability at least 1 2δ. Therefore, we have with probability at least 1 2δ, the attack strategy will fool Robust Ph E to pull non-target arms at most O (d

T log(T/δ) +

d T log(T) log(1/δ)/ T1 + T 2 1 )/ϵ rounds.

Similar to Eq (31), we bound the total rounds of attacking Robust Ph E by

N = T1 + O d

T log(T/δ) +

d T log(T) log(1/δ)/ p

T1 + T 2 1 /ϵ (34)

From Eq (33), we know the total cost is in the same order as the rounds of attack. So with probability at least 1 3δ the total cost is O T1 + (d

T log(T/δ) +

d T log(T) log(1/δ)/ p

T1 + T 2 1 )/ϵ .

Setting T1 = T 2/5 gives us the minimum attack cost O(T 4/5), and the non-target arms are pulled at most O(T 4/5) rounds.

Remark 4. Note that we bound the total corruption by T1, which means the adversary does not need to compensate the rewards on the target arm as shown in line 20 in Algorithm 1. The robustness of Robust Ph E allows us to carry over the rewards in the ﬁrst stage while Lin UCB does not.

C.3. Attack under unknown T

Our two-stage null space attack algorithm requires that T is known for the convenience of analysis. Here we discuss a promising idea that leveraging the doubling trick to extend the attack to the case of unknown T, which will lead to a multi-stage attack that repeatedly adjusts the target parameter θ at each stage. More concretely, to attack Lin UCB without knowing T, the adversary can start with a pre-speciﬁed horizon T0 and run the two-stage attack. Once the actual horizon reaches T0, the adversary will expand the horizon to T 2 0 (i.e., the doubling trick), and views previous T0 rounds as the

When Are Linear Stochastic Bandits Attackable?

new ﬁrst stage, re-calculates target parameter θ using rewards from the T0 rounds and runs the new attack strategy until T 2 0 . Using this exponential horizon sequence {Ti = T 2i 0 , i N}, we have a multi-stage attack on Lin UCB with unknown time horizon. Similarly, to attack Robust Ph E, we will need to adjust the horizon sequence to be {Ti = T (5/2)i

0 , i N}, which will lead to a similar multi-stage attack on Robust Ph E. We believe this direction is feasible based on our current analysis, and more thorough and complete proof should be the target of our next work.

D. Additional Experiments

Figure 4. False negative rate of attackability test

In Figure 4, we study the false negative rate of the attackability test in Algorithm 1, i.e., how often the adversary mistakenly asserts that an attackable environment is not attackable. As we explained in Proof of Theorem 1, the wrong assertion is because of using estimated θ instead of the ground-truth bandit parameter. In this experiment, we consider a challenging attackable three-arm environment with A = {x1 = (0, 1), x2 = (0.11, 1.1), x3 = ( 2, 0)}, x = x1 and θ = (0, 0.5). By solving CQP (2), we have attackability index ϵ = 0.005 and certiﬁcate θ = ( 0.5, 0)5. We test two-stage null space attack against Lin UCB with T = 10, 000 and the adversary will test the attackability after the ﬁrst T1 = T 1/2 = 100 rounds. We vary T1 from 1 to 100 to see how many iterations is sufﬁcient for attackability test. We report averaged results of 100 runs. We also vary the standard derivation σ of Gaussian noise from 0.1 to 0.3. In Figure 4, we can see that the false negative rate is almost zero when T1 > 50, suggesting T1 = 100 is sufﬁcient. When σ = 0.1 the adversary only needs around 10 rounds to make a correct assertion. We also notice the false negative rate becomes higher under a larger noise scale. As suggested in Lemma 5, the error in θ estimation is larger if noise scale is larger or the number of target arm s rewards n( x) is smaller, which highly depends on T1. Larger error means CQP (2) with θ is more likely to be unfeasible and gives false negative assertion. However, T1 = 100 is still enough for the attackability test when ϵ = 0.005.

5We introduce arm x3 to guarantee the ﬁrst dimension of θ cannot be smaller than 0.5. Comparing x and x2 and we can see the optimal solution is ϵ = 0.005.