# action_poisoning_attacks_on_linear_contextual_bandits__82eccc16.pdf

Published in Transactions on Machine Learning Research (02/2023)

Action Poisoning Attacks on Linear Contextual Bandits

Guanlin Liu glnliu@ucdavis.edu Department of Electrical and Computer Engineering University of California, Davis

Lifeng Lai lflai@ucdavis.edu Department of Electrical and Computer Engineering University of California, Davis

Reviewed on Open Review: https: // openreview. net/ forum? id= yh GCKUs KJS

Contextual bandit algorithms have many applicants in a variety of scenarios. In order to develop trustworthy contextual bandit systems, understanding the impacts of various adversarial attacks on contextual bandit algorithms is essential. In this paper, we propose a new class of attacks: action poisoning attacks, where an adversary can change the action signal selected by the agent. We design action poisoning attack schemes against disjoint linear contextual bandit algorithms in both white-box and black-box settings. We further analyze the cost of the proposed attack strategies for a very popular and widely used bandit algorithm: Lin UCB. We show that, in both white-box and black-box settings, the proposed attack schemes can force the Lin UCB agent to pull a target arm very frequently by spending only logarithm cost. We also extend the proposed attack strategies to generalized linear models and show the effectiveness of the proposed strategies.

1 Introduction

Multiple armed bandits(MABs), a popular framework of sequential decision making model, has been widely investigated and has many applicants in a variety of scenarios (Chapelle et al., 2014; Lai et al., 2011; Kveton et al., 2015). The contextual bandits model is an extension of the multi-armed bandits model with contextual information. At each round, the reward is associated with both the arm (a.k.a, action) and the context. Contextual bandits algorithms have a broad range of applications, such as recommender systems (Li et al., 2010), wireless networks (Saxena et al., 2019), etc.

In the modern industry-scale applications of bandit algorithms, action decisions, reward signal collection, and policy iterations are normally implemented in a distributed network. Action decisions and reward signals may need to be transmitted over communication links. For example, in recommendation systems, the transitions of the decisions and the reward signal rely on a feedback loop between the recommendation system and the user. When data packets containing the reward signals and action decisions etc are transmitted through the network, the adversary can implement adversarial attacks by intercepting and modifying these data packets. As the result, poisoning attacks on contextual bandits could possibly happen. In many applications of contextual bandits, an adversary may have an incentive to force the contextual bandits system to learn a specific policy. For example, a restaurant may attack the bandit systems to force the systems into increasing the restaurant s exposure. Thus, understanding the risks of different kinds of adversarial attacks on contextual bandits is essential for the safe applications of the contextual bandit model and designing robust contextual bandit systems.

Depending on the target of the poisoning attacks, the poisoning attacks against contextual linear bandits can be categorized into four types: reward poisoning attack, action poisoning attack, context poisoning attack and the mix of them. In this paper, we aim to investigate the impact of action poisoning attacks on contextual bandit models. To our best knowledge, this paper is the first work to analyze the impact

Published in Transactions on Machine Learning Research (02/2023)

of action poisoning attacks on contextual bandit models. Detailed comparisons of various types of attacks against contextual bandits will be provided in Section 3. We note that the goal of this paper is not to promote any particular type of poisoning attack. Rather, our goal is to understand the potential risks of action poisoning attacks. We note that for the safe applications and design of robust contextual bandit algorithms, it is essential to address all possible weaknesses of the models and understanding the risks of different kinds of adversarial attacks. Our hope is that the understanding of the potential risks of action poisoning attacks could motivate follow up research to design algorithms that are robust to such attacks.

In this paper, we study the action poisoning attack against disjoint linear contextual bandit (Li et al., 2010; Kong et al., 2020; Garcelon et al., 2020; Huang et al., 2021) in both white-box and black-box settings. In the white-box setting, we assume that the attacker knows the coefficient vectors associated with arms. Thus, at each round, the attacker knows the mean rewards of all arms. While it is often unrealistic to exactly know the coefficient vectors, the understanding of the white-box attacks could provide valuable insights on how to design the more practical black-box attacks. In the black-box setting, we assume that the attacker has no prior information about the arms and does not know the agent s algorithm. The limited information that the attacker has are the context information, the action signal chosen by the agent, and the reward signal generated from the environment. In both white-box and black-box settings, the attacker aims to manipulate the agent into frequently pulling a target arm chosen by the attacker with a minimum cost. The cost is measublack by the number of rounds that the attacker changes the actions selected by the agent. The contributions of this paper are:

(1) We propose a new online action poisoning attack against contextual bandit in which the attacker aims to force the agent to frequently pull a target arm chosen by the attacker via strategically changing the agent s actions.

(2) We introduce a white-box attack strategy that can manipulate any sublinear-regret disjoint linear contextual bandit agent into pulling a target arm T o(T) rounds over a horizon of T rounds, while incurring a cost that is sublinear dependent on T.

(3) We design a black-box attack strategy whose performance nearly matches that of the white-box attack strategy. We apply the black-box attack strategy against a very popular and widely used bandit algorithm: Lin UCB (Li et al., 2010). We show that our proposed attack scheme can force the Lin UCB agent into pulling a target arm T O(log3 T) times with attack cost scaling as O(log3 T).

(4) We extend the proposed attack strategies to generalized linear contextual bandits. We further analyze the cost of the proposed attack strategies for a generalized linear contextual bandit algorithm: UCB-GLM (Li et al., 2017).

(5) We evaluate our attack strategies using both synthetic and real datasets. We observe empirically that the total cost of our black-box attack is sublinear for a variety of contextual bandit algorithms.

2 Related Work

In this section, we discuss related works on two parts: attacks that cause standard bandit algorithms to fail and robust bandit algorithms that can defend attacks.

Attacks Models. While there are many existing works addressing adversarial attacks on supervised learning models (Szegedy et al., 2014; Moosavi-Dezfooli et al., 2017; Cohen et al., 2019; Dohmatob, 2019; Wang et al., 2019; Dasgupta et al., 2019; Cicalese et al., 2020), the understanding of adversarial attacks on contextual bandit models is less complete. Of particular relevance to our work is a line of interesting recent work on adversarial attacks on MABs (Jun et al., 2018; Liu & Shroff, 2019; Liu & Lai, 2020b) and on linear contextual bandits (Ma et al., 2018; Garcelon et al., 2020). In recent works in MABs setting, the types of attacks include both reward poisoning attacks and action poisoning attacks. In the reward poisoning attacks, there is an adversary who can manipulate the reward signal received by the agent (Jun et al., 2018; Liu & Shroff, 2019). In the action poisoning attacks, the adversary can manipulate the action signal chosen by the agent before the environment receives it (Liu & Lai, 2020b). Existing works on adversarial attacks against linear contextual bandits focus on the reward (Ma et al., 2018; Garcelon et al., 2020) or context poisoning

Published in Transactions on Machine Learning Research (02/2023)

attacks (Garcelon et al., 2020). In the context poisoning attacks, the adversary can modify the context observed by the agent without changing the reward associated with the context. (Wang et al., 2022) defines the concept of attackability of linear stochastic bandits and introduces the sufficient and necessary conditions on attackability. There are also some recent interesting work on adversarial attacks against reinforcement learning (RL) algorithms under various setting (Behzadan & Munir, 2017; Huang & Zhu, 2019; Ma et al., 2019; Zhang et al., 2020; Sun et al., 2021; Rakhsha et al., 2020; 2021; Liu & Lai, 2021; Rangi et al., 2022). Although there are some recent works on the action poisoning attacks against MABs and RL, the action poisoning attack on contextual linear bandit is not a simple extension of the case of MAB or RL. Firstly, in the MAB settings the rewards only depend on the arm (action), while in the contextual bandit setting, the rewards depend on both the arm and the context (state). Secondly, (Liu & Lai, 2021) discusses the action poisoning attack in the tabular RL case where the number of states is finite. In the linear contextual bandit problem, the number of contexts is infinite. These factors make the design of attack strategies and performance analysis for the contextual linear bandit problems much more challenging.

Robust algorithms. Lots of efforts have been made to design robust bandit algorithms to defend adversarial attacks. In the MABs setting, (Lykouris et al., 2018) introduces a bandit algorithm, called Multilayer Active Arm Elimination Race algorithm, that is robust to reward poisoning attacks. (Gupta et al., 2019) presents an algorithm named BARBAR that is robust to reward poisoning attacks and the regret of the proposed algorithm is nearly optimal. (Guan et al., 2020) proposes algorithms that are robust to a reward poisoning attack model where an adversary attacks with a certain probability at each round. (Feng et al., 2020) proves that Thompson Sampling, UCB, and ϵ-greedy can be modified to be robust to self-interested reward poisoning attacks. (Liu & Lai, 2020a) introduce a bandit algorithm, called MOUCB, that is robust to action poisoning attacks. The algorithms for the context-free stochastic multi-armed bandit (MAB) settings are not suited for our settings. In the linear contextual bandit setting, (Bogunovic et al., 2021) proposes two variants of stochastic linear bandit algorithm that is robust to reward poisoning attacks, which separately work on known attack budget case and agnostic attack budget case. (Bogunovic et al., 2021) also proves that a simple greedy algorithm based on linear regression can be robust to linear contextual bandits with shablack coefficient under a stringent diversity assumption on the contexts. (Ding et al., 2022) provides a robust linear contextual bandit algorithm that works under both the reward poisoning attacks and context poisoning attacks. (He et al., 2022) provides nearly optimal algorithms for linear contextual bandits with adversarial corruptions.

3 Problem Setup

3.1 Review of Linear Contextual Bandit

Consider the standard disjoint linear contextual bandit model in which the environment consists of K arms. In each round t = 1, 2, 3, . . . , T, the agent observes a context xt D where D Rd, pulls an arm It and receives a reward rt,It. Each arm i is associated with an unknown but fixed coefficient vector θi Θ where Θ Rd. In each round t, the reward satisfies

rt,It = xt, θIt + ηt, (1)

where ηt is a conditionally independent zero-mean R-subgaussian noise and , denotes the inner product. Hence, the expected reward of arm i under context xt follows the linear setting

E[rt,i] = xt, θi (2)

for all t and all arm i. If we consider the σ-algebra Ft = σ(η1, . . . , ηt), ηt becomes Ft measurable which is a conditionally independent zero-mean random variable. The agent aims to minimize the cumulative pseudo-regret

xt, θI t xt, θIt ,

where I t = arg maxi xt, θi .

Published in Transactions on Machine Learning Research (02/2023)

In this paper, we assume that there exist L > 0 and S > 0, such that for all round t and arm i, xt 2 L and θi 2 S, where 2 denotes the ℓ2-norm. We assume that there exist D Rd such that for all x D and all arm i, x, θi > 0. In addition, for all t, xt D.

3.2 Action Poisoning Attack Model

In this paper, we introduce a novel adversary setting, in which the attacker can manipulate the action chosen by the agent. In particular, at each round t, after the agent chooses an arm It, the attacker can manipulate the agent s action by changing It to another I0 t {1, . . . , K}. If the attacker decides not to attack, I0 t = It. The environment generates a random reward rt,I0 t based on the post-attack arm I0 t and the context xt. Then the agent and the attacker receive reward rt,I0 t from the environment. Since the agent does not know the attacker s manipulations and the presence of the attacker, the agent will still view rt,I0 t as the reward corresponding to the arm It. The action poisoning attack model is summarized in Algorithm 1 .

Algorithm 1 Action poisoning attacks on contextual linear bandit agent

1: for t = 1, 2, . . . , T do 2: Agent chooses arm It after observing the context xt. 3: Attacker observes the agent s action It. If the attacker decides to attack, it manipulates the action to I0 t . If the attacker does not attack, I0 t = It. 4: The environment generates reward rt,I0 t according to arm I0 t and context xt.

5: The agent and attacker receive reward rt,I0 t . 6: end for

The goal of the attacker is to design attack strategy to manipulate the agent into pulling a target arm very frequently but by making attacks as rarely as possible. Without loss of generality, we assume arm K is the attack target" arm or target arm. Define the set of rounds when the attacker decides to attack as C := {t : t T, I0 t = It}. The cumulative attack cost is the total number of rounds where the attacker decides to attack, i.e., |C|. The attacker can monitor the contexts, the actions of the agent and the reward signals from the environment.

Mean Reward(<x, >)

arm 1 arm 2 arm 3 arm 4

Figure 1: An example of one dimension linear contextual bandit model.

As the action poisoning attack only changes the actions, it can impact but does not have direct control of the agent s observations. Furthermore, when the action space is discrete and finite, the ability of the action poisoning attacker is severely limited. It is reasonable to limit the choice of the target policy. Here we introduce an important assumption that the target arm is not the worst arm:

Assumption 1. For all x D, the mean reward of the target arm satisfies x, θK > mini [K] x, θi .

If the target arm is the worst arm in most contexts, the attacker should change the target arm to a better arm or the optimal arm so that the agent learns that the target set is optimal for almost every context. In this case, the cost of attack may be up to O(T). Assumption 1 does not imply that the target arm is optimal

Published in Transactions on Machine Learning Research (02/2023)

at some contexts. The target arm could be sub-optimal for all contexts. Fig. 1 shows an example of one dimension linear contextual bandit model, where the x-axis represents the contexts and the y-axis represents the mean rewards of arms under different contexts. As shown in Fig. 1, arms 3 and 4 satisfy Assumption 1. In addition, arm 3 is not optimal at any context.

Under Assumption 1, there exists α (0, 1

2) such that

max x D mini x, θi

x, θK (1 2α).

Equivalently, Assumption 1 implies that there exists α (0, 1

2), such that for all t, we have

(1 2α) xt, θK min i [K] xt, θi . (3)

Equation 3 describes the relation of the target arm and the worst arm at context xt. The action poisoning attack can only indirectly impacts the agent s rewards. The mean rewards at xt after the action attacks must be larger or equal to mini [K] xt, θi . Under Equation 3, (1 2α) xt, θK mini [K] xt, θi at any context xt. Then, with xt, θK > 0, (1 α) xt, θK > mini [K] xt, θi at any context xt. Thus, the attacker can indirectly change the agent s mean reward of the non-target arm to (1 α) xt, θK . Then, the agent s estimate of each non-target arm s coefficient vector is close to (1 α)θK, which is worse than the target arm at any context. Equation 3 brings the possibility of successful action attacks.

Assumption 1 is necessary in our analysis to prove a formal bound of the attack cost. In Appendix A, we show that this assumption is necessary in the sense that, if this assumption does not hold, there exist a sequence of contexts {xt}t [T ] such that no efficient action poisoning attack scheme can successfully attack Lin UCB. Certainly, these sequences of contexts are worst case scenarios for attacks. In practice, if these worst case scenarios do not occur, the proposed algorithms in Sections 4.2 and 4.3 may still work even if the target arm is the worst in a small portion of the contexts (as illustrated in the numerical example section).

3.3 Comparisons with Different Poisoning Attacks

We now compare the three types of poisoning attacks against contextual linear bandit: reward poisoning attack, action poisoning attack and context poisoning attack. In the reward poisoning attack (Ma et al., 2018; Garcelon et al., 2020), after the agent observes context xt and chooses arm It, the environment will generate reward rt,It based on context xt and arm It. Then, the attacker can change the reward rt,It to ert and feed ert to the agent.

Compablack with the reward poisoning attacks, the action poisoning attack consideblack in this paper is more difficult to carry out. In particular, as the action poisoning attack only changes the action, it can impact but does not have direct control of the reward signal. By changing the action It to I0 t , the reward received by the agent is changed from rt,It to rt,I0 t which is a random variable drawn from a distribution based on the action I0 t and context xt. This is in contrast to reward poisoning attacks where an attacker has direct control and can change the reward signal to any value ert of his choice.

In the context poisoning attack (Garcelon et al., 2020), the attacker changes the context shown to the agent. The reward is generated based on the true context xt and the agent s action It. Nevertheless, the agent s action may be indirectly impacted by the manipulation of the context, and so as the reward. Since the attacker attacks before the agent pulls an arm, the context poisoning attack is the most difficult to carry out.

For numerical comparison, as our paper is the first paper that discusses the action poisoning attack against contextual linear bandits, there is no existing action poisoning attack method to compare. One could run simulations to compare with reward or context poisoning attacks, but the attack costs are defined differently, due to the different nature of attacks. For example, for reward poisoning attacks, (Garcelon et al., 2020) proposed a reward poisoning attack method whose attack cost scales as O(log(T)2). However, the definition of the attack cost in (Garcelon et al., 2020) is different from that of our paper. The cost of the reward attack is defined as the cumulative differences between the post-attack rewards and the pre-attack rewards. The

Published in Transactions on Machine Learning Research (02/2023)

cost of the action attack is defined as the number of rounds that the attacker changes the actions selected by the agent. Although the definition of the attack cost of these two different kinds of attacks are different, the attack cost of our proposed white-box attack strategy scales on O(log(T)2), which is same with the results in (Garcelon et al., 2020).

As mentioned in the introduction, the goal of this paper is not to promote any particular types of poisoning attacks. Instead, our goal is to understand the potential risks of action poisoning attacks, as the safe applications and design of robust contextual bandit algorithm relies on the addressing all possible weakness of the models.

4 Attack Schemes and Cost Analysis

In this section, we introduce the proposed action poisoning attack schemes in the white-box setting and black-box setting respectively. In order to demonstrate the significant security threat of action poisoning attacks to linear contextual bandits, we investigate our attack strategies against a widely used algorithm: Lin UCB algorithm. Furthermore, we analyze the attack cost of our action poisoning attack schemes.

4.1 Overview of Lin UCB

For reader s convenience, we first provide a brief overview of the Lin UCB algorithm (Li et al., 2010). The Lin UCB algorithm is summarized in Algorithm 2. The main steps of Lin UCB are to obtain estimates of the unknown parameters θi using past observations and then make decisions based on these estimates. Define τi(t) := {s : s t, Is = i} as the set of rounds up to t where the agent pulls arm i. Let Ni(t) = |τi(t)|. Then, at round t, the ℓ2-regularized least-squares estimate of θi with regularization parameter λ > 0 is obtained by (Li et al., 2010) ˆθt,i = V 1 t,i X

k τi(t 1) rt,ixk, (4)

where Vt,i = X

k τi(t 1) xkx k + λI

with I being identity matrix.

After ˆθi s are obtained, at each round, an upper confidence bound of the mean reward has to be calculated for each arm (step 4 of Algorithm 2). Then, the Lin UCB algorithm picks the arm with the largest upper confidence bound (step 5 of Algorithm 2). By following the setup in "optimism in the face of uncertainty linear algorithm" (OFUL) (Abbasi-Yadkori et al., 2011), we set

2 log K/δ + d log (1 + L2N/(λd)),

and βt,i = ω(Ni(t)) =

2 log K/δ + d log (1 + L2Ni(t)/(λd)).

It is easy to verify that ω(N) is a monotonically increasing function over N (0, + ).

4.2 White-box Attack

We first consider the white-box attack scenario, in which the attacker has knowledge of the environment. In particular, in the white-box attack scenario, the attacker knows the coefficient vectors θi s for all i. The understanding of this scenario could provide useful insights for the more practical black-box attack scenario to be discussed in Section 4.3.

Our proposed attack strategy works as follows. When the agent chooses arm K, the attacker does not attack. When the agent chooses arm It = K, the attacker changes it to arm

( K with probability ϵt

I t with probability 1 ϵt (5)

Published in Transactions on Machine Learning Research (02/2023)

Algorithm 2 Contextual Lin UCB (Li et al., 2010) Require: λ, K, T, L, S.

1: Initialize for every arm i, Vi λI, bi 0, ˆθi V 1 i bi. 2: for t = 1, 2, . . . , T do 3: Observe the context xt.

4: Compute the upper confidence bound: pt,i x t ˆθi + βt,i q

x t V 1 i xt for all arm i [K].

5: Pull arm It = arg maxi pt,i. 6: The environment generates reward rt according to arm It. The agent receive rt. 7: VIt VIt + xtx t , b It b It + rtxt, ˆθIt V 1 It b It. 8: end for

where I t = arg mini xt, θi and

ϵt = (1 α) xt, θK mini xt, θi

xt, θK mini xt, θi . (6)

We now explain the main idea behind the attack strategy specified in Equation 5 and Equation 6. Intuitively speaking, using Equation 5 and Equation 6, the attacker can manipulate the agent into learning some particular coefficient vectors. In particular, for arm K (the target arm), the agent obtains the true coefficient vector θK. For any arm i = K, because of the attacks, the agent will obtain an estimate of arm i s coefficient vector that is close to (1 α)θK. This can be seen as follows. Under Assumption 1 and the fact that the mean rewards are positive, we have that 1/2 < ϵt < 1 α. Under the white-box attack, when the agent pulls a non-target arm It = K, the post-attack action I0 t is a random variable and the mean reward received by the agent should satisfy

E[rt,I0 t |Ft 1, It] = E[ xt, θI0 t |Ft 1, It] = ϵt xt, θK + (1 ϵt) min i xt, θi = (1 α) xt, θK . (7)

Because of the positive mean reward assumption and the fact that 0 < 1 α < 1, the target arm K appears to be the optimal arm for any context. Hence the proposed white-box attack strategy can easily mislead an agent who employs a contextual bandit algorithm with sublinear pseudo-regret, as shown in the following proposition.

Proposition 1. For δ > 0. Suppose the agent employs a linear contextual bandit algorithm whose cumulative pseudo-regret is upper bounded by o(T) with probability at least 1 δ. When using the proposed white-box attack scheme, with probability at least 1 δ, the attacker can force the agent to pull the target arm for T o(T) times, while the total attack cost |C| is upper bounded by o(T).

Proposition 1 applies to any contextual bandit algorithm whose cumulative pseudo-regret is upper bounded by o(T). To further illustrate the proposed attack scheme, we now provide a finer analysis the impact of this attack on Lin UCB described in Algorithm 2. We first build a high probability bound for the estimate of parameter at the agent, when the agent computes the estimate of θi by equation 4 and the attacker employs the white-box attack. Recall that ω(N) =

2 log K/δ + d log (1 + L2N/(λd)).

Lemma 1. Under the proposed white-box attack, the estimate of θi for each arm i obtained by Lin UCB agent as described in Algorithm 2 satisfies

|x t ˆθt,i x t (1 α)θK| ω(Ni(t)) + LS p

0.5 log (2KT/δ) xt V 1 t,i , (8)

with probability 1 2(K 1)δ/K, for all arm i = K and all t 0. Here, x V =

x V x is the weighted norm of vector x for a positive definite matrix V .

Lemma 1 shows that, under our white-box attack, the agent s estimate of the parameter of non-target arm, i.e. ˆθi, will converge to (1 α)θK. Thus, the agent is misled to believe that arm K is the optimal arm for every context in most rounds. The following theorem provides an upper bound of the cumulative cost of the attack.

Published in Transactions on Machine Learning Research (02/2023)

Theorem 1. Define γ = minx D x, θK . Under the same assumptions as in Lemma 1, for any δ > 0 with probability at least 1 2δ, for all T 0, the attacker can manipulate the Lin UCB agent into pulling the target arm in at least T |C| rounds, using an attack cost

|C| 2d(K 1)

(αγ)2 log 1 + TL2/(dλ) 2ω(T) + LS p

0.5 log (2KT/δ) 2 . (9)

Theorem 1 shows that our white-box attack strategy can force Lin UCB agent into pulling the target arm T O(log2 T) times with attack cost scaled only as O(log2 T).

Proof. Detailed proof of Theorem 1 can be found in Appendix B.3. Here we provide sketch of the main proof idea.

For round t and context xt, if Lin UCB pulls arm i = K, we have

x t ˆθt,K + βt,K q

x t V 1 t,Kxt x t ˆθt,i + βt,i q

x t V 1 t,i xt.

Since the attacker does not attack the target arm, the confidence bound of arm K does not change and

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt holds with probability 1 δ

Then, by Lemma 1,

x t θK x t (1 α)θK + βt,i xt V 1 t,i +

1 2 log 2KT

+ ω (Ni(t))

xt V 1 t,i . (10)

By multiplying both sides 1{It=i} and summing over rounds, we have

k=1 1{Ik=i}αx k θK

k=1 1{Ik=i}

1 2 log 2KT

δ + d log 1 + L2Ni(k)

xk V 1 k,i .

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and obtain

k=1 1{Ik=i} xk V 1 k,i

Ni(t)2d log 1 + t L2

Set γ = minx D x, θK . Since Ni(t) = PT k=1 1{Ik=i}, we have

Ni(t) 2d (αγ)2 log 1 + t L2

1 2 log 2KT

δ + d log 1 + t L2

4.3 Black-box Attack

We now focus on the more practical black-box setting, in which the attacker does not know any of arm s coefficient vector. The attacker knows the value of α (or a lower bound) in which the Equation 3 holds for all t. Although the attacker does not know the coefficient vectors, the attacker can compute an estimate of the unknown parameters by using past observations. On the other hand, there are multiple challenges brought by the estimation errors that need to properly addressed.

Published in Transactions on Machine Learning Research (02/2023)

The proposed black-box attack strategy works as follows. When the agent chooses arm K, the attacker does not attack. When the agent chooses arm It = K, the attacker changes it to arm

( K with probability ϵt

I t with probability 1 ϵt (14)

where I t = arg min i =K

xt, ˆθ0 t,i β0 t,i xt (V 0 t,i) 1 , (15)

and β0 t,i = ϕi ω(N i (t)) + LS p

0.5 log (2KT/δ) ,

ϕi = 1/α when i = K and ϕK = 2, and

2, (1 α) xt, ˆθ0 t,K xt, ˆθ0 t,I t

xt, ˆθ0 t,K xt, ˆθ0 t,I t , 1 α

with clip(a, x, b) = min(b, max(x, a)) where a b.

For notational convenience, we set I t = K and ϵt = 1 when It = K. We define that, if i = K,

τ i (t) := {s : s t, I s = i}

and N i (t) = |τ i (t)|. We also define τ K(t) := {s : s t} and N K(t) = |τ K(t)|.

ˆθ0 t,i = V 0 t,i 1 X

k τ i (t 1) wk,irk,I0 kxk, (17)

where V 0 t,i = X

k τ i (t 1) xkx k + λI

1/ϵt if i = I0 t = K

1/(1 ϵt) if i = I0 t = I t 0 if i = I0 t

Here, ˆθ0 t,i is the estimation of θi by the attacker, while ˆθt,i in Equation 4 is the estimation of θi at the agent side. We will show in Lemma 2 and Lemma 4 that ˆθ0 t,i will be close to the true value of θi while ˆθt,i will be close to a sub-optimal value chosen by the attacker. This disparity gives the attacker the advantage for carrying out the attack.

We now highlight the main idea of why our black-box attack strategy works. As discussed in Section 4.2, if the attacker knows the coefficient vectors of all arms, the proposed white-box attack scheme can mislead the agent to believe that the coefficient vector of every non-target arm is (1 α)θK, hence the agent will think the target arm is optimal. In the black-box setting, the attacker does not know the coefficient vector for any arm. The attacker should estimate an coefficient vector of each arm. Then, the attacker will use the estimated coefficient vector to replace the true coefficient vector in the white-box attack scheme. As the attacker does not know the true values of θi s, we need to design the estimator ˆθ0 t,i, the attack choice I t and the probability ϵt carefully. In the following, we explain the main ideas behind our design choices. 1): Firstly, we explain why we design estimator ˆθ0 t,i using the form Equation 17, in which the attacker employs the importance sampling to obtain an estimate of θi. There are two reasons for this. The first reason is that, for a successful attack, the number of observations in arm i = K will be limited. Hence if the importance sampling is not used, the estimation variance of the mean reward x, θi at the attacker side for some contexts x may be large. The second reason is that the attacker s action is stochastic when the agent

Published in Transactions on Machine Learning Research (02/2023)

pulls a non-target arm. Thus, the attacker uses the observations at round t when the attacker pulls arm i with certain probability, i.e. when t τ i , to estimate θi. Since the agent s action is deterministic, the agent uses the observations at round t when the agent pulls arm i, i.e. when t τi, to estimate θi. 2): Secondly, we explain ideas behind the choice of I t in Equation 15. Under our black-box attack, when the agent pulls a non-target arm It = K, the mean reward received by the agent satisfies

E[rt,I0 t |Ft 1, It] = E[ xt, θI0 t |Ft 1, It] = ϵt xt, θK + (1 ϵt) xt, θI t . (19)

In white-box attack scheme, I t is the worst arm at context xt. In the black-box setting, the attacker does not know a priori which arm is the worst. In the proposed black-box attack scheme, as indicated in Equation 15, we use the lower confidence bound (LCB) method to explore the worst arm and arm I t has the smallest lower confidence bound. 3): Finally, we provide reasons why we choose ϵt using Equation 16. In our white-box attack scheme, we have that 1/2 < ϵt < 1 α. Thus, in our black-box attack scheme, we limit the choice of ϵt to [1/2, 1 α]. Furthermore, in Equation 6 used for the white-box attack, ϵt is computed by the true mean reward. Now, in the black-box attack, as the attacker does not know the true coefficient vector, the attacker uses an estimation of θ to compute the second term in the clip function in Equation 16.

In summary, our design of ˆθ0 t,i, I t and ϵt can ensure that the attacker s estimation ˆθ0 t,i is close to θi, while the agent s estimation ˆθt,i will be close to (1 α)θK. In the following, we make these statements precise, and formally analyze the performance of the proposed black-box attack scheme. First, we analyze the estimation ˆθ0 t,i at the attacker side. We establish a confidence ellipsoid of xt, ˆθ0 t,i at the attacker.

Lemma 2. Assume the attacker performs the proposed black-box action poisoning attack. With probability 1 2δ, we have |x t ˆθ0 t,i x t θi| β0 t,i xt (V 0 t,i) 1

holds for all arm i and all t 0.

Lemma 2 shows that ˆθ0 i lies in an ellipsoid with center at θi with high probability, which implies that the attacker has good estimate. We then analyze the estimation ˆθt,i at the agent side. The following lemma provides an upper bound on the difference between E[rt,I0 t |Ft 1, It] and (1 α) xt, θK .

Lemma 3. Under the black-box attack, the estimate obtained by an Lin UCB agent satisfies E[rt,I0 t |Ft 1, It] (1 α) xt, θK (1 α)β0 t,K xt (V 0 t,K) 1 + (1 + α)β0 t,I t xt

simultaneously for all t 0 when It = K, with probability 1 2δ.

The bound in Lemma 3 consists of the confidence ellipsoid of the estimate of arm I t and that of arm K. As mentioned above, for a successful attack, the number of pulls on arm I t will be limited. Thus, in our proposed algorithm, the attacker use the importance sampling to obtain the estimate of θi, which will increases the number of observations that can be used to estimate the coefficient vector of arm I t . Using Lemma 3, we have the following lemma regarding the estimation ˆθt,i at the agent side.

Lemma 4. Consider the same assumption as in Lemma 2. With a probability at least 1 (3K 1)δ

K , the estimate ˆθt,i obtained by the Lin UCB agent will satisfy

|x t ˆθt,i x t (1 α)θK| 1 + 4d/α p

K log (1 + t L2/(dλ)) ω(t) + LS p

0.5 log (2KT/δ) xt V 1 t,i ,

(21) simultaneously for all arm i = K and all t 0.

Lemma 4 shows that, under the proposed black-box attack scheme, the agent s estimate of the parameter of the non-target arm, i.e. ˆθi, will converge to (1 α)θK. Hence the agent will believe that the target arm K is the optimal arm for any context in most rounds. Using these supporting lemmas, we can then analyze the performance of the proposed black-box attack strategy.

Published in Transactions on Machine Learning Research (02/2023)

Theorem 2. Define γ = minx D x, θK . Under the same assumptions as in Lemma 4, with probability at least 1 3δ, for all T 0, the attacker can manipulate a Lin UCB agent into pulling the target arm in at least T |C| rounds, using an attack cost

|C| 2d(K 1)

K log 1 + TL2

log 1 + TL2/(dλ) ω(T) + LS p

0.5 log (2KT/δ) 2 .

Proof. Detailed proof of Theorem 2 can be found in Appendix C.4. Here we provide a sketch of the main proof idea.

For round t and context xt, if Lin UCB pulls arm i = K, we have

x t ˆθt,K + βt,K q

x t V 1 t,Kxt x t ˆθt,i + βt,i q

x t V 1 t,i xt.

Since the attacker does not attack the target arm, the confidence bound of arm K does not change and

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt holds with probability 1 δ

Thus, by Lemma 4,

x t θK x t (1 α)θK + ω(Ni(t)) xt V 1 t,i

K log 1 + t L2

1 2 log 2KT

xt V 1 t,i . (23)

By multiplying both sides by 1{It=i} and summing over rounds, we have

k=1 1{Ik=i}αx k θK

k=1 1{Ik=i}

K log 1 + k L2

1 2 log 2KT

xk V 1 k,i .

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and get

k=1 1{Ik=i} xk 2 V 1 k,i

Ni(t)2d log 1 + t L2

Thus, we have

k=1 1{Ik=i}

2d (αγ)2 log 1 + t L2

K log 1 + t L2

1 2 log 2KT

where γ = minx D x, θK .

Theorem 2 shows that our black-box attack strategy can manipulate a Lin UCB agent into pulling a target arm T O(log3 T) times with attack cost scaling as O(log3 T). Compablack with the result for the white-box attack, the black-box attack only brings an additional log T factor.

Published in Transactions on Machine Learning Research (02/2023)

4.4 Generalized Linear Models

We note that the proposed attack strategies can also be extended to generalized linear models. Detailed analysis of the cost of the proposed attack strategies for the generalized linear contextual bandit model can be found in Appendix D.

5 Numerical Experiments

In this section, we provide numerical examples to illustrate the impact of proposed action poisoning attack schemes.

5.1 Attack Linear Contextual Bandit Algorithms

We first empirically evaluate the performance of the proposed action poisoning attack schemes on three contextual bandit algorithms: Lin UCB (Abbasi-Yadkori et al., 2011), Lin TS (Agrawal & Goyal, 2013), and ϵ-Greedy. We run the experiments on three datasets:

Synthetic data: The dimension of contexts and the coefficient vectors is d = 6. We set the first entry of every context and coefficient vector to 1. The other entries of every context and coefficient vector are uniformly drawn from ( 1 d 1, 1 d 1). Thus, x 2

2 and mean rewards x, θ > 0. The reward noise ηt is drawn from a Gaussian distribution N(0, 0.01).

Jester dataset (Goldberg et al., 2001): Jester contains 4.1 million ratings of jokes in which the rating values scale from 10.00 to +10.00. We normalize the rating to [0, 1]. The dataset includes 100 jokes and the ratings were collected from 73,421 users between April 1999 - May 2003. We consider a subset of 10 jokes and 38432 users. Every jokes are rated by each user. We perform a low-rank matrix factorization (d = 6) on the ratings data and obtain the features for both users and jokes. At each round, the environment randomly select a user as the context and the reward noise is drawn from a Gaussian distribution N(0, 0.01).

Movie Lens 25M dataset: (Harper & Konstan, 2015) Movie Lens 25M dataset contains 25 million 5-star ratings of 62,000 movies by 162,000 users. The preprocessing of this data is almost the same as the Jester dataset, except that we consider a subset of 10 movies and 7344 users. At each round, the environment randomly select a user as the context and the reward noise is drawn from N(0, 0.01).

We set δ = 0.1 and λ = 2. For all the experiments, we set the total number of rounds T = 106 and the number of arms K = 10. We independently run ten repeated experiments. Results reported are averaged over the ten experiments. We set α to 0.2 for the two proposed attack strategies, hence the target arm may be the worst arm in some rounds. Each of the individual experimental runs costs up to 10 minutes on one physical CPU core. The type of CPU is Intel Core i7-8700.

0 2 4 6 8 10 Time(t) 105

0 2 4 6 8 10 Time(t) 105

0 2 4 6 8 10 Time(t) 105

White-box attack on Lin TS Black-box attack on Lin TS White-box attack on Lin UCB Black-box attack on Lin UCB White-box attack on -Greeedy Black-box attack on -Greeedy

Figure 2: The cumulative cost of the attacks for the synthetic (Left), Jester (Center) and Movie Lens (Right) datasets.

The results are shown in Table 1 and Figure 2. These experiments show that the action poisoning attacks can force the three agents to pull the target arm very frequently, while the agents rarely pull the target arm under no attack. Under the attacks, the true regret of the agent becomes linear as the target arm is not optimal for most context. Table 1 show the number of rounds the agent pulls the target arm among 106

total rounds. In the synthetic dataset, under the proposed white-box attacks, the target arm is pulled more

Published in Transactions on Machine Learning Research (02/2023)

Synthetic Jester Movie Lens ϵ-Greeedy without attacks 2124.6 5908.7 3273.5 White-box attack on ϵ-Greeedy 982122.5 971650.9 980065.6 Black-box attack on ϵ-Greeedy 973378.5 939090.2 935293.8 Lin UCB without attacks 8680.9 16927.2 13303.4 White-box attack on Lin UCB 981018.7 911676.9 969118.6 Black-box attack on Lin UCB 916140.8 875284.7 887373.1 Lin TS without attacks 5046.9 18038.0 9759.0 White-box attack on Lin TS 981112.8 908488.3 956821.1 Black-box attack on Lin TS 918403.8 862556.8 825034.8

Table 1: Average number of rounds when the agent pulls the target arm over T = 106 rounds.

than 98.1% of the times by the three agent (see Table 1). The target arm is pulled more than 91.6% of the times in the worst case (the black-box attacks on Lin UCB). Fig 2 shows the cumulative cost of the attacks on three agents for the three datasets. The results show that the attack cost |C| of every attack scheme on every agent for every dataset scales sublinearly, which exposes a significant security threat of the action poisoning attacks on linear contextual bandits.

5.2 Attack Robust Algorithms

We now discuss existing robust linear bandit algorithms (Ding et al., 2022; Bogunovic et al., 2021) and evaluate the performance of the proposed attack strategy on these algorithms.

In particular, (Bogunovic et al., 2021) focuses on a special case in which the context and coefficient vectors are assumed to be fixed over rounds, and developed Robust Phased Elimination (RPE) algorithm. In contrast, our paper focuses on the general setting where the contexts are different for each round and coefficients are different for each arm. Hence the RPE algorithm will not work for the contextual bandit setting consideblack in our paper. (Bogunovic et al., 2021) also proves that a simple greedy algorithm based on linear regression can be robust to linear contextual bandits with shablack coefficient under a stringent diversity assumption on the contexts. We empirically evaluate the action attacks on the greedy algorithm in (Bogunovic et al., 2021). The greedy algorithm is designed in the shablack coefficient setting and may not work in the disjoint setting.

(Ding et al., 2022) provides a linear contextual bandit algorithm that is robust to rewards attacks and context attacks. The scheme in (Ding et al., 2022) could be used to defend against the action attacks. However, our numerical results show that the proposed attack strategy can successfully defeat the scheme in (Ding et al., 2022). In the following, we empirically evaluate the performance of the proposed action poisoning attack schemes on Robust Bandit algorithm in (Ding et al., 2022).

(He et al., 2022) provides nearly optimal algorithms for linear contextual bandits with adversarial corruptions. (He et al., 2022) consider two cases: 1) the agent knows the corruption budget; and 2) the agent does not know the corruption budget. We also empirically evaluate the action attacks on CW-OFUL algorithm in (He et al., 2022).

We use the same synthetic data setting in Section 5.1. We set δ = 0.1 and λ = 2. For all the experiments, we set the total number of rounds T = 106 and the number of arms K = 10. We independently run ten repeated experiments. Results reported are averaged over the ten experiments. We set α to 0.2 for the two proposed attack strategies. For the action attacks on CW-OFUL algorithm, we run the simulation on the two cases that the agent know the corruption budget and the agent does not know the corruption budget. For the case that the agent know the corruption budget, we set the attack budget to 3000 and the attacker will stop the attack if the budget is exhaust. For the case that the agent does not know the corruption budget, we does not limit the attack budget.

The simulation results in Figure 3 show that our proposed white-box action poisoning attack can force the greedy contextual agent in (Bogunovic et al., 2021) to pull the target arm. Although the black-box

Published in Transactions on Machine Learning Research (02/2023)

0 2 4 6 8 10 Time(t) 105

Total Attack Costs

White-box attack on Contextual Greedy

0 2 4 6 8 10 Time(t) 105

Total Attack Costs

105 Black-box attack on Contextual Greedy

0 2 4 6 8 10 Time(t) 105

Total Regret

105 Black-box attack on Contextual Greedy

Figure 3: The cumulative cost of the attacks against greedy contextual algorithm in (Bogunovic et al., 2021).

action poisoning attack strategy fails, it causes linear regret on the agent. The greedy contextual algorithm in (Bogunovic et al., 2021) highly relies a stringent diversity assumption on the contexts. In the disjoint setting, the agent will pull each arm in some correlated contexts. The stringent diversity assumption may not be satisfied in disjoint setting. The greedy contextual algorithm in (Bogunovic et al., 2021) will induced to linear regret as shown in Figure 3.

0 2 4 6 8 10 Time(t) 105

White-box attack on Robust Bandit Black-box attack on Robust Bandit

Figure 4: The cumulative cost of the attacks against Robust Bandit algorithm for the synthetic datasets.

The simulation results in Figure 4 show that our proposed action poisoning attack can force the Robust Bandit agent to pull the target arm with T o(T) times. The attack cost also scales sublinearly on T. The reason why the Robust Bandit agent cannot defend against the action attacks is that its regret scales on O(C

T). If the attack cost C = O(

T), the regret will be linear and the agent will be fooled.

0 2 4 6 8 10 Time(t) 105

Total Attack Costs

White-box attack on CW-OFUL with unknown C Black-box attack on CW-OFUL with unknown C

0 2 4 6 8 10 Time(t) 105

Total Regret

White-box attack on CW-OFUL with unknown C Black-box attack on CW-OFUL with unknown C

Figure 5: The cumulative cost of the attacks against CW-OFUL algorithm for the synthetic datasets.

The simulation results in Figure 5 show that our proposed action poisoning attack can force the CW-OFUL agent to pull the target arm in the case that the agent does not know the corruption budget and the attack budget is unlimited. The attack cost also scales sublinearly on T. The reason why the CW-OFUL agent cannot defend against the action attacks is that its regret scales on O(T) once the corruption level is larger

Published in Transactions on Machine Learning Research (02/2023)

T. CW-OFUL agent can defend the action poisoning attack and achieve sublinear regret when the agent knows the corruption budget and the attack budget is limited.

5.3 The Attack Performance When the Assumption 1 Violates

0 0.1 0.2 0.3 0.4 0.5 The proportion of contexts when the target arm is the worst

Total Attack Costs

White-box attack on -Greeedy Black-box attack on -Greeedy White-box attack on Lin UCB Black-box attack on Lin UCB White-box attack on Lin TS Black-box attack on Lin TS

Figure 6: The total costs of the attacks for the synthetic datasets when the Assumption 1 violates.

We now evaluate the sensitivity of the proposed algorithms on Assumption 1. We empirically evaluate the performance of the proposed action poisoning attack schemes on three contextual bandit algorithms: Lin UCB, Lin TS, and ϵ-greedy, when the target arm is the worst for some contexts.

We use the same synthetic data setting in Section 5.1. We set δ = 0.1 and λ = 2. For all the experiments, we set the total number of rounds T = 106 and the number of arms K = 10. We independently run ten repeated experiments. Results reported are averaged over the ten experiments. We set α to 0.2 for the two proposed attack strategies. We manipulate the simulation environments so that the proportions of contexts, at which the target arm is the worst, are separately controlled around: 0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5.

The results in Figure 6 show the total costs under different proportions of contexts when the target arm is the worst. The x-axis represents the proportions of context when the target arm is the worst. The y-axis represents the total cost over 106 rounds. The results shows that the total attack costs scale linearly on the proportions of contexts when the target arm is the worst. When the proportions of contexts when the target arm is the worst is small, our proposed attack strategies can still efficiently attacks.

5.4 The Attack Performance When the Rewards Are Normalized in [-1,1]

Here, we would like to note that we make the positive reward assumption for the formal analysis, our results actually can be generalized to the case with negative rewards. If the positive reward assumption does not hold, for the case with negative rewards, the attacker can preprocess the reward by adding a positive constant to all rewards such that all rewards become positive. As long as Assumption 1 holds, the proposed attack strategy can mislead the agent to believe that the target arm is optimal regardless of the positive reward assumption.

To illustrate this, we evaluate the performance of the proposed algorithms when negative rewards exist. We use the same synthetic data setting in Section 5.1. We set δ = 0.1 and λ = 2. For all the experiments, we set the total number of rounds T = 106 and the number of arms K = 10. We independently run ten repeated experiments. Results reported are averaged over the ten experiments. We set α to 0.2 for the two proposed attack strategies. We normalize the rewards in [-1,1]. The attacker preprocesses the rewards by adding a constant c = 3 to the rewards in his algorithms. In the white-box attack setting, the attacker adds c = 3 to every x, θ and use the preprocessed x, θ to compute ϵt. In the black-box attack setting, the attacker adds c = 3 to every reward rt and use the preprocessed rt to compute ˆθ and ϵt.

Published in Transactions on Machine Learning Research (02/2023)

0 2 4 6 8 10 Time(t) 105

White-box attack on Lin TS Black-box attack on Lin TS White-box attack on Lin UCB Black-box attack on Lin UCB White-box attack on -Greeedy Black-box attack on -Greeedy

Figure 7: The cumulative cost of the attacks when the rewards are normalized in [-1,1].

The results in Figure 7 show that our proposed attack strategies still work for the case with negative rewards.

6 Future Works

In this section, we discuss several future directions.

Linear Bandits with Shablack Coefficients. In this paper, we discuss the disjoint linear contextual model, similar to those consideblack in (Li et al., 2010; Kong et al., 2020; Garcelon et al., 2020; Huang et al., 2021), where each arm is associated with its own coefficient vector. At each time, the agent observes a contextual vector xt that is shablack with all arms.

Our model is different from the linear contextual bandit model with shablack coefficient. In that model, the coefficient vector θ is shablack with all arms. At each time, each arm i is given a specific contextual vector xi,t. These contextual vectors form the decision set. The decision set changes over time and can even be infinite. In this model, adversarial attacks may not always be successful. For example, (Wang et al., 2022) shows that some attack goals can never be achieved in linear stochastic bandits due to the shablack coefficient. This may also occur in the linear contextual bandit model with shablack coefficient. It is of interest to conduct rigorous analysis of action attacks in the shablack coefficient model in the future work.

Attack Strategy without Assumption 1. In Section 5.3, we empirically evaluate the sensitive of the proposed algorithms on Assumption 1. The results shows that the total attack costs seem to scale linearly on the proportions of contexts when the target arm is the worst. It is of interest to further theoretically analyze the phenomenon shown in the simulation results in Section 5.3. However, using current tools analysis developed in the paper, it is challenging to formally analyze the relationship between the effectiveness of the action attack strategies and the portion of contexts when the target arm is the worst. On the other hand, we can try to find a new algorithm that can force the agent to choose the target arm when the target arm is not the worst. In the following, we provide our initial ideas for this direction.

In the action attack strategies consideblack in the paper, the attacker misleads the agent to obtain an estimate of the non-target arms coefficient vectors that are close to (1 α)θK. However, this can not be always achieved without Assumption 1, since the target arm is the worst at some contexts but (1 α)θK is worse than the target arm. Action attacks can not force the agent to learn (1 α)θK. We need some new attack strategies. Here, we provide our initial thoughts on theoretical insights to solve this challenge in the white-box attack scenario.

If the Assumption 1 does not hold, we set

DK := {x D : x, θK = min i [K] x, θi }

Published in Transactions on Machine Learning Research (02/2023)

as the set of context where the target arm is the worst. We consider such sequence of contexts {xt}t [T ] that xt DK for all t < TK and TK linearly depends on T.

We consider a r-neighborhood of DK: e DK := x DKBr(x),

where Br(x) = {x D : x x 2 < r}. In the white-box attack case, we can find a coefficient vector θ

such that x θK > x θ for all context x DK and x θK > x θ for all context x D/ e DK. Then the attacker force the agent to obtain an estimate of the non-target arms coefficient vectors that are close to θ . Then the target arm is the optimal in D/ e DK, and the distribution of e DK is limited.

The proposed attack strategies in this paper may make the target arm to be the optimal in D/ e DK with some r. Thus, even when the target arm is the worst arm, the attack algorithm designed in this paper can still work.

It is of interest to conduct rigorous analysis of this strategy in the future work. Furthermore, it is of interest to carry out the design and analysis for algorithms for the black box attack strategy under this scenario, which will be more challenging.

7 Conclusion

In this paper, we have proposed a class of action poisoning attacks on linear contextual bandits. We have shown that our white-box attack strategy is able to force any linear contextual bandit agent, whose regret scales sublinearly with the total number of rounds, into pulling a target arm chosen by the attacker. We have also shown that our white-box attack strategy can force Lin UCB agent into pulling a target arm T O(log2 T) times with attack cost scaled as O(log2 T). We have further shown that the proposed blackbox attack strategy can force Lin UCB agent into pulling a target arm T O(log3 T) times with attack cost scaled as O(log3 T). Our results expose a significant security threat to contextual bandit algorithms. In the future, we will investigate the defense strategy to mitigate the effects of this attack.

8 Acknowledgement

This work was supported in part by the National Science Foundation under Grants ECCS-1824553, CCF1908258 and ECCS-2000415.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 127 135, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR.

Vahid Behzadan and Arslan Munir. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 262 275. Springer, 2017.

Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pp. 991 999. PMLR, 2021.

Olivier Chapelle, Eren Manavoglu, and Romer Rosales. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61:1 61:34, December 2014. ISSN 2157-6904. doi: 10.1145/2532128. URL http://doi.acm.org/10.1145/2532128.

Ferdinando Cicalese, Eduardo Laber, Marco Molinaro, et al. Teaching with limited information on the learner s behaviour. In International Conference on Machine Learning, pp. 2016 2026, 2020.

Published in Transactions on Machine Learning Research (02/2023)

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310 1320, 2019.

Sanjoy Dasgupta, Daniel Hsu, Stefanos Poulis, and Xiaojin Zhu. Teaching a black-box learner. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 1547 1555, 2019.

Qin Ding, Cho-Jui Hsieh, and James Sharpnack. Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pp. 7111 7123. PMLR, 2022.

Elvis Dohmatob. Generalized no free lunch theorem for adversarial robustness. In Proceedings of the 36th International Conference on Machine Learning, pp. 1646 1654, 2019.

Zhe Feng, David Parkes, and Haifeng Xu. The intrinsic robustness of stochastic bandits to strategic manipulation. In International Conference on Machine Learning, pp. 3092 3101. PMLR, 2020.

Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits. In Advances in Neural Information Processing Systems, pp. 14362 14373, 2020.

Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collaborative filtering algorithm. information retrieval, 4(2):133 151, 2001.

Z. Guan, K. Ji, D. Bucci, T. Hu, J. Palombo, M. Liston, and Y. Liang. Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4036 4043, New York City, NY, Feb. 2020.

Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562 1578. PMLR, 2019.

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. ar Xiv preprint ar Xiv:2205.06811, 2022.

Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057 27068, 2021.

Yunhan Huang and Quanyan Zhu. Deceptive reinforcement learning under adversarial manipulations on cost signals. In International Conference on Decision and Game Theory for Security, pp. 217 237. Springer, 2019.

Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3644 3653, Montréal, Canada, Dec. 2018.

Weihao Kong, Emma Brunskill, and Gregory Valiant. Sublinear optimal policy value estimation in contextual bandits. In International Conference on Artificial Intelligence and Statistics, pp. 4377 4387. PMLR, 2020.

Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767 776, Lille, France, July 2015. PMLR.

Lifeng Lai, Hesham El Gamal, Hai Jiang, and H Vincent Poor. Cognitive medium access: Exploration, exploitation and competition. IEEE transactions on mobile computing, 10(2):239 253, Feb. 2011.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Published in Transactions on Machine Learning Research (02/2023)

Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071 2080. PMLR, 2017.

Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, volume 97, pp. 4042 4050, Long Beach, CA, June 2019.

Guanlin Liu and Lifeng Lai. Action-manipulation attacks against stochastic bandits: Attacks and defense. IEEE Transactions on Signal Processing, 68:5152 5165, 2020a.

Guanlin Liu and Lifeng Lai. Action-manipulation attacks on stochastic bandits. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3112 3116, 2020b.

Guanlin Liu and Lifeng Lai. Provably efficient black-box action poisoning attacks against reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.

Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114 122, Los Angeles, CA, June 2018. ISBN 978-1-4503-5559-9.

Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pp. 186 204. Springer, 2018.

Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Jerry Zhu. Policy poisoning in batch reinforcement learning and control. In Advances in Neural Information Processing Systems, volume 32, 2019.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765 1773, 2017.

Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pp. 7974 7984, 2020.

Amin Rakhsha, Xuezhou Zhang, Xiaojin Zhu, and Adish Singla. Reward poisoning in reinforcement learning: Attacks against unknown learners in unknown environments. ar Xiv preprint ar Xiv:2102.08492, 2021.

Anshuka Rangi, Haifeng Xu, Long Tran-Thanh, and Massimo Franceschetti. Understanding the limits of poisoning attacks in episodic reinforcement learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 3394 3400. International Joint Conferences on Artificial Intelligence Organization, 7 2022.

Vidit Saxena, Joakim Jaldén, Joseph E Gonzalez, Mats Bengtsson, Hugo Tullberg, and Ion Stoica. Contextual multi-armed bandits for link adaptation in cellular networks. In Proceedings of the 2019 Workshop on Network Meets AI & ML, pp. 44 49, 2019.

Yanchao Sun, Da Huo, and Furong Huang. Vulnerability-aware poisoning mechanism for online rl with unknown dynamics. In International Conference on Learning Representations, 2021.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.

Huazheng Wang, Haifeng Xu, and Hongning Wang. When are linear stochastic bandits attackable? In Proceedings of the 39th International Conference on Machine Learning, pp. 23254 23273, 2022.

Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In Proceedings of the 36th International Conference on Machine Learning, volume 1, pp. 2, 2019.

Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward-poisoning attacks against reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 11225 11234, 2020.

Published in Transactions on Machine Learning Research (02/2023)

A Necessity of Assumption 1

In the main paper, we show that, if Assumption 1 holds, our proposed attack schemes can successfully attack Lin UCB regardless the context.

Here, we highlight the necessity of Assumption 1 in the sense that, if Assumption 1 does not hold, for some sequence of context {xt}t [T ], there is no efficient action poisoning attack scheme that can successfully attack Lin UCB. If the Assumption 1 does not hold, we set

DK := {x D| x, θK = min i [K] x, θi }

as the set of context where the target arm is the worst. We consider such sequence of contexts {xt}t [T ] that xt DK for all t < TK and TK linearly depends on T.

If the attacker does not attack the target arm, the target arm is the worst in DK. No matter how we change the non-target arms to another arm, the cost and loss will linearly depend on TK and therefore linearly depend on T.

If the attacker attacks the target arm when xt DK and changes the target arm to a better arm or the optimal arm so that the agent learns that the target set is optimal even in the context set DK, the cost of attack may be up to O(TK) and hence scales as O(T). We define

τK,[K 1](t) = {s : s t, Is = K but I0 s = K}

as the set of rounds up to t where the attacker attacks arm K and CK(t) = |τK,[K 1](t)| as the attack cost of attacking arm K. We can show that if we attack the target arm, for any attack scheme,

ˆθt,K θK Vt,K V 1 t,K X

s τK,[K 1](t) LSxs Vt,K

s τK,[K 1](t) LS q

x s V 1 t,Kxs

s τK,[K 1](t) xs 2 V 1 t,K

2d CK(t) log 1 + NK(t)L2

where the last inequality is obtained by the Lemma 11 from (Abbasi-Yadkori et al., 2011). Then, the average estimation error of the mean reward of arm K is bounded by

t=1 |x t (ˆθt,K θK)|

t=1 ˆθt,K θK Vt,K xt V 1 t,K

2d LS log 1 + NK(TK)L2

Note that we consider a sequence of contexts {xt}t [T ] that xt DK for all t < TK and τ linearly depends on T. In the action poisoning attacks, if the attacker tries to blackuce the rewards of non-target arms, he can change the non-target arm to a worse arm or the worst arm. However, the target arm K is the worst arm for all t < TK. The mean rewards of the non-target arm at the contexts {xt}t [TK] must be larger or equal to x t θK. If the attacker tries to force the agent to pull the target arm in linear time at the first TK rounds, the attacker needs to let the post-attack arm K be -optimal than the original arm K. If the attacker tries

Published in Transactions on Machine Learning Research (02/2023)

to let the post-attack arm K be -optimal than the original arm K, we have CK(TK) = e O(TK), where e O ignores the logarithm dependency.

In summary, without the prior information of the context distribution or the context sequence, there is no action poisoning attack scheme that can always efficiently attack without Assumption 1. However, in some situations, some action poisoning attack scheme may still work even if the target arm is the worst in a small portion of the contexts, as shown in the numerical result session.

B Attack Cost Analysis of White-box Setting

B.1 Proof of Proposition 1

When the agent pulls a non-target arm It = K, the mean reward received by the agent should satisfy

E[rt,I0 t |Ft 1, It] = (1 α) xt, θK .

In the observation of the agent, the target arm becomes optimal and the non-target arms are associated with the coefficient vector (1 α)θK. In addition, the cumulative pseudo-regret should satisfy

t=1 1{It =K}α xt, θK

t=1 1{It =K}αγ.

We define γ = minx D x, θK . If RT is upper bounded by o(T), PT t=1 1{It =K} is also upper bounded by o(T).

B.2 Proof of Lemma 1

If the agent computes an estimate of θi by equation 4 and Vt,i = P

k τi(t 1) xkx k + λI , we have

x t ˆθt,i x t (1 α)θK

=x t V 1 t,i

k τi(t 1) rt,I0 kxk

x t V 1 t,i Vt,i(1 α)θK

=x t V 1 t,i

k τi(t 1) xk rt,I0 k (1 α)x k θK

λx t V 1 t,i (1 α)θK

k τi(t 1) x t V 1 t,i xk x k θI0 k + ηk (1 α)x k θK λx t V 1 t,i (1 α)θK,

and by triangle inequality,

|x t ˆθt,i x t (1 α)θK|

k τi(t 1) x t V 1 t,i xk x k θI0 k (1 α)x k θK +

k τi(t 1) x t V 1 t,i xkηk

+ λx t V 1 t,i (1 α)θK . (30)

Now we separately bound the three items in the RHS of Equation 30.

(1) In our model, the mean reward is bounded by 0 < xt, θi xt 2 2 θi 2 2 = LS. Since the mean

rewards are bounded and the rewards are generated independently, we have 0 x k θI0 k (1 α)x k θK

LS and E[x k θI0 k|Fk 1] = (1 α)x k θK, where the post-attack action I0 k is the random variable. Thus, n x t V 1 t,i xk x k θI0 k (1 α)x k θK o

k τi(t 1) is a bounded martingale difference sequence w.r.t the filtration

{Fk}k τi(t 1).

Published in Transactions on Machine Learning Research (02/2023)

Then, by Azuma s inequality,

k τi(t 1) x t V 1 t,i xk x k θI0 k (1 α)x k θK | B)

k τi(t 1)(x t V 1 t,i xk LS)2

where B represents confidence bound. In order to ensure the confidence bounds hold for all arms and all round t simultaneously, we set Pt,i = δ KT so

k τi(t 1) (x t V 1 t,i xk)2 LS

1 2 log 2KT

xt V 1 t,i , (32)

where the last inequality is obtained from the fact that

xt 2 V 1 t,i = x t V 1 t,i

k τi(t 1) xkx k + λI

x t V 1 t,i

k τi(t 1) xkx k

k τi(t 1) (x t V 1 t,i xk)2.

In other words, with probability 1 δ, we have

k τi(t 1) x t V 1 t,i xk x k θI0 k (1 α)x k θK LS

1 2 log 2KT

xt V 1 t,i , (34)

for all arms and all t.

(2) Note that Vt,i = P

k τi(t 1) xkx k + λI is positive definite. We define x, y V = x V y as the weighted inner-product. According to Cauchy-Schwarz inequality, we have

k τi(t 1) x t V 1 t,i xkηk

k τi(t 1) xkηk

Assume that λ L. From Theorem 1 and Lemma 11 in (Abbasi-Yadkori et al., 2011), we know that for any δ > 0, with probability at least 1 δ

k τi(t 1) xkηk

2R2 log K det(Vt,i)1/2 det(λI) 1/2

δ + d log 1 + L2Ni(t)

for all arms and all t > 0.

(3) For the third part of the right hand side of Equation 30,

|λx t V 1 t,i (1 α)θK| (1 α)λθK V 1 t,i xt V 1 t,i . (37)

Published in Transactions on Machine Learning Research (02/2023)

Since Vt,i λI, the maximum eigenvalue of V 1 t,i is smaller or equal to 1/λ. Thus,

(1 α)λθK 2 V 1 t,i 1

λ (1 α)λθK 2 2 (1 α)2λS2.

In summary,

|x t ˆθt,i x t (1 α)θK|

1 2 log 2KT

δ + d log 1 + L2Ni(t)

xt V 1 t,i . (38)

B.3 Proof of Theorem 1

For round t and context xt, if Lin UCB pulls arm i = K, we have

x t ˆθt,K + βt,K q

x t V 1 t,Kxt x t ˆθt,i + βt,i q

x t V 1 t,i xt.

Recall βt,i =

δ + d log 1 + L2Ni(t)

Since the attacker does not attack the target arm, the confidence bound of arm K does not change and

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt holds with probability 1 δ

Then, by Lemma 1,

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt

x t ˆθt,i + βt,i q

x t V 1 t,i xt

x t (1 α)θK + βt,i xt V 1 t,i +

1 2 log 2KT

+ ω (Ni(t))

xt V 1 t,i .

By multiplying both sides 1{It=i} and summing over rounds, we have

k=1 1{Ik=i}αx k θK

k=1 1{Ik=i}

1 2 log 2KT

δ + d log 1 + L2Ni(k)

xk V 1 k,i .

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and obtain

k=1 1{Ik=i} xk 2 V 1 k,i 2d log(1 + Ni(t)L2

2d log 1 + t L2

According to PT k=1 1{Ik=i} xk V 1 k,i r

Ni(t) PT k=1 1{Ik=i} xk 2 V 1 k,i , we have

k=1 1{Ik=i} xk V 1 k,i

Ni(t)2d log 1 + t L2

Published in Transactions on Machine Learning Research (02/2023)

Thus, we have

k=1 1{Ik=i}αx k θK

Ni(t)2d log 1 + t L2

1 2 log 2KT

δ + d log 1 + t L2

k=1 1{Ik=i}

Ni(t)2d log 1 + t L2

1 2 log 2KT

δ + d log 1 + t L2

where γ = minx D x, θK . Since Ni(t) = PT k=1 1{Ik=i}, we have

Ni(t) 2d (αγ)2 log 1 + t L2

1 2 log 2KT

δ + d log 1 + t L2

C Attack Cost Analysis of Black-box Setting

C.1 Proof of Lemma 2

Since the estimate of θi obtained by the agent satisfies

ˆθ0 t,i = V 0 t,i 1

k τ i (t 1) wk,irk,I0 kxk

we have x t ˆθ0 t,i x t θi

=x t V 0 t,i 1

k τ i (t 1) wk,irk,I0 kxk

x t V 0 t,i 1 V 0 t,iθi

=x t V 0 t,i 1

k τ i (t 1) (wk,irk,I0 k x k θi)xk

λx t V 0 t,i 1 θi

=x t V 0 t,i 1

k τ i (t 1) (wk,ix k θI0 k x k θi)xk

+ x t V 0 t,i 1

k τ i (t 1) wk,iηk

λx t V 0 t,i 1 θi.

Now we separately bound the three items in the RHS of Equation 47.

(1) We have 0 wk,ix k θI0 k x k θi wk,i LS and E[wk,ix k θI0 k|Fk 1] = x k θi, where the post-attack action

I0 k is the random variable. In addition, by the definition of wk,i, we have that wk,i 1/α if i = K,

Published in Transactions on Machine Learning Research (02/2023)

and wk,i 2 if i = K. Thus, n x t (V 0 t,i) 1 P

k τ i (t 1)(wk,ix k θI0 k x k θi)xk o

k τi(t 1) is also a bounded

martingale difference sequence w.r.t the filtration {Fk}k τi(t 1). By following the steps in Section B.2, we have, with probability 1 K 1

K δ, for any arm i = K and any round t, x t V 0 t,i 1

k τ i (t 1) (wk,ix k θI0 k x k θi)xk

1 2 log 2KT

xt (V 0 t,i) 1,

and with probability 1 1

K δ, for arm K and any round t, x t V 0 t,K 1

k τ K(t 1) (wk,Kx k θI0 k x k θK)xk

1 2 log 2KT

xt (V 0 t,K) 1.

(2) The confidence bound of the second item of the right side hand of Equation 47 can be obtained from Equation 36. With probability, 1 K 1

K δ, for any arm i = K and any round t, x t V 0 t,i 1

k τ i (t 1) wk,iηk

v u u t2 log K

1 + L2N i (t) λd

xt (V 0 t,i) 1. (47)

With probability, 1 1

K δ, for arm K and any round t, x t V 0 t,K 1

k τ K(t 1) wk,Kηk

v u u t2 log K

1 + L2N K(t) λd

xt (V 0 t,K) 1. (48)

(3) For the third part of the right hand side of Equation 47,

|λx t V 0 t,i 1 θi| λθi (V 0 t,i) 1 xt (V 0 t,i) 1

λS xt (V 0 t,i) 1 . (49)

In summary,

|x t ˆθ0 t,i x t θi|

1 2 log 2KT

v u u t2 log K

1 + L2N i (t) λd

xt (V 0 t,K) 1, (50)

where ϕi = 1/α when i = K and ϕK = 2.

C.2 Proof of Lemma 3

Recall the definition of ϵt:

2, (1 α) xt, ˆθ0 t,K xt, ˆθ0 t,I t

xt, ˆθ0 t,K xt, ˆθ0 t,I t , 1 α

and the definition of I t :

I t = arg min i =K

xt, ˆθ0 t,i β0 t,i xt (V 0 t,i) 1 . (52)

Published in Transactions on Machine Learning Research (02/2023)

By Lemma 2, xt, ˆθ0 t,I t β0 t,I t xt (V 0

t,I t ) 1 mini xt, θi with probability 1 2δ.

Because ϵt is bounded by [1/2, 1 α], we can analyze E[rt,I0 t |Ft 1, It] in four cases.

Case 1: when xt, ˆθ0 t,K < xt, ˆθ0 t,I t and ϵt = 1 α, we have

E[rt,I0 t |Ft 1, It] = (1 α) xt, θK + α xt, θI t . (53)

Then, by Lemma 2,

(1 α)x t θK + αx t θI t (1 α)x t θK

(1 α) x t ˆθ0 t,K + β0 t,K xt (V 0 t,K) 1 + α

x t ˆθ0 t,I t + β0 t,I t xt

(1 α)x t θK

x t ˆθ0 t,I t + (1 α)β0 t,K xt (V 0 t,K) 1 + αβ0 t,I t xt

1 (1 α)x t θK

(1 α)β0 t,K xt (V 0 t,K) 1 + (1 + α)β0 t,I t xt

where the second inequality is obtained by the condition of Case 1 and the last inequality is obtained by x t ˆθ0 t,I t β0 t,I t xt

1 mini xt, θi and Assumption 1.

On the other side, we have

(1 α)x t θK + αx t θI t (1 α)x t θK = αx t θI t 0. (55)

Case 2: when xt, ˆθ0 t,K xt, ˆθ0 t,I t > (1 2α) xt, ˆθ0 t,K and ϵt = 1/2, we have

E[rt,I0 t |Ft 1, It] = 1

2 xt, θK + 1

2 xt, θI t . (56)

Then, by Lemma 2,

1 2(x t θK + x t θI t ) (1 α)x t θK

x t θI t (1 2α)x t θK

x t ˆθ0 t,I t + β0 t,I t xt

1 (1 2α)x t θK

β0 t,I t xt

where the last inequality is obtained by x t ˆθ0 t,I t β0 t,I t xt

1 mini xt, θi and Assumption 1.

Published in Transactions on Machine Learning Research (02/2023)

On the other side, by Lemma 2,

1 2(x t θK + x t θI t ) (1 α)x t θK

x t ˆθ0 t,I t β0 t,I t xt

2(1 2α) x t ˆθ0 t,K + β0 t,K xt (V 0 t,K) 1

2β0 t,I t xt

2(1 2α)β0 t,K xt (V 0 t,K) 1.

where the last inequality is obtained by the conditions of Case 2.

Case 3: when 0 xt, ˆθ0 t,I t (1 2α) xt, ˆθ0 t,K and 1/2 ϵt 1 α, we have

E[rt,I0 t |Ft 1, It] = ϵt xt, θK + (1 ϵt) xt, θI t . (58)

We can find that

ϵt xt, θK + (1 ϵt) xt, θI t (1 α) xt, θK

=ϵt( xt, θK xt, θI t ) + xt, θI t (1 α) xt, θK

=ϵt( xt, ˆθ0 t,K xt, ˆθ0 t,I t ) + xt, θI t (1 α) xt, θK

+ ϵt( xt, ˆθ0 t,I t xt, θI t ) + ϵt( xt, θK xt, ˆθ0 t,K )

=(1 α) xt, ˆθ0 t,K xt, ˆθ0 t,I t + xt, θI t (1 α) xt, θK

+ ϵt( xt, ˆθ0 t,I t xt, θI t ) + ϵt( xt, θK xt, ˆθ0 t,K )

=(1 α ϵt) xt, ˆθ0 t,K xt, θK + (1 ϵt) xt, ˆθ0 t,I t xt, θI t ,

which is equivalent to E[rt,I0 t |Ft 1, It] (1 α) xt, θK

(1 α ϵt)β0 t,K xt (V 0 t,K) 1 + (1 ϵt)β0 t,I t xt

Case 4: when xt, ˆθ0 t,I t < 0 and ϵt = 1 α, we have

E[rt,I0 t |Ft 1, It] = (1 α) xt, θK + α xt, θI t . (61)

Then, by Lemma 2, (1 α)x t θK + αx t θI t (1 α)x t θK

=αx t θI t αx t ˆθ0 t,I t + αβ0 t,I t xt

αβ0 t,I t xt

where the last inequality is obtained by the condition of Case 4. We also have

(1 α)x t θK + αx t θI t (1 α)x t θK = αx t θI t 0. (63)

Published in Transactions on Machine Learning Research (02/2023)

Combining these four cases, we have

E[rt,I0 t |Ft 1, It] (1 α) xt, θK

(1 α)β0 t,K xt (V 0 t,K) 1 + (1 + α)β0 t,I t xt

C.3 Proof of Lemma 4

From Section B.2, we have, for any arm i = K,

|x t ˆθt,i x t (1 α)θK|

k τi(t 1) x t V 1 t,i xk x k θI0 k (1 α)x k θK +

k τi(t 1) x t V 1 t,i xkηk

+ λx t V 1 t,i (1 α)θK

k τi(t 1) x t V 1 t,i xk x k θI0 k ϵk xk, θK (1 ϵk) xk, θI k

k τi(t 1) x t V 1 t,i xk ϵk xk, θK + (1 ϵk) xk, θI k (1 α)x k θK

k τi(t 1) x t V 1 t,i xkηk

+ λx t V 1 t,i (1 α)θK .

Now we separately bound the first item and second item in the RHS of Equation 65. The bounds of the third item and fourth item in the RHS of Equation 65 are provided in Section B.2.

(1) Since the mean rewards are bounded and the rewards are generated independently, we have 0 x k θI0 k ϵk xk, θK (1 ϵk) xk, θI k LS and E[x k θI0 k|Fk 1] = ϵk xk, θK + (1 ϵk) xk, θI k .

Then n x t V 1 t,i xk x k θI0 k E[x k θI0 k|Fk 1] o

k τi(t 1) is also a bounded martingale difference sequence w.r.t

the filtration {Fk}k τi(t 1). By following the steps in Section B.2, we have, with probability 1 δ, for any arm i and any round t,

k τi(t 1) x t V 1 t,i xk x k θI0 k E[x k θI0 k|Fk 1] LS

1 2 log 2KT

xt V 1 t,i . (66)

(2) From Equation 33 in Section B.2, we have

xt 2 V 1 t,i X

k τi(t 1) (x t V 1 t,i xk)2. (67)

Published in Transactions on Machine Learning Research (02/2023)

Then, the second item of the right hand side of Equation 65 can be upper bounded by

k τi(t 1) x t V 1 t,i xk ϵk xk, θK + (1 ϵk) xt, θI k (1 α)x k θK

E[rk,I0 k|Fk 1, Ik] (1 α)x k θK 2s X

k τi(t 1) (x t V 1 t,i xk)2

(1 α)β0 k,K xk (V 0 k,K) 1 + (1 + α)β0 k,I k xk

xt V 1 t,i ,

where the first inequality is obtained from Cauchy-Schwarz inequality, the second inequality is obtained from Lemma 3 and Equation 33.

In addition, by the fact that (a + b)2 2a2 + 2b2 for any real number, we have

(1 α)β0 k,K xk (V 0 k,K) 1 + (1 + α)β0 k,I k xk

k τi(t 1) 2 (1 α)β0 k,K xk (V 0 k,K) 1 2 + X

k τi(t 1) 2

(1 + α)β0 k,I k xk

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and get, for any arm i,

k τ i (t 1) xk 2 (V 0 k,i) 1 2d log 1 + Ni(t)L2

2d log 1 + t L2

By the fact that P

i τi(t 1) = τ K(t 1), and P

i =K τi(t 1) = P

i =K τ i (t 1), we have, for any arm i,

τi(t 1) τ K(t 1), and τi(t 1) P j =K τ j (t 1). Thus,

k τi(t 1) xk 2 (V 0 k,K) 1 X

k τ K(t 1) xk 2 (V 0 k,K) 1 2d log 1 + t L2

k τi(t 1) xk 2

k τ i (t 1) xk 2 (V 0 k,i) 1 2(K 1)d log 1 + t L2

Published in Transactions on Machine Learning Research (02/2023)

By combining the definition of β0 t,i, Equation 69, Equation 71 and Equation 72, we have

(1 α)β0 k,K xk (V 0 k,K) 1 + (1 + α)β0 k,I k xk

k τi(t 1) 2 β0 k,K xk (V 0 k,K) 1 2 + X

k τi(t 1) 2

2β0 k,I k xk

1 2 log 2KT

log 1 + t L2

+ 16d2(K 1)

1 2 log 2KT

log 1 + t L2

1 2 log 2KT

log 1 + t L2

In summary, we have

|x t ˆθt,i x t (1 α)θK|

K log 1 + t L2

1 2 log 2KT

xt V 1 t,i . (74)

C.4 Proof of Theorem 2

For round t and context xt, if Lin UCB pulls arm i = K, we have

x t ˆθt,K + βt,K q

x t V 1 t,Kxt x t ˆθt,i + βt,i q

x t V 1 t,i xt.

In this case, βt,i = ω(Ni(t)) =

δ + d log 1 + L2Ni(t)

Since the attacker does not attack the target arm, the confidence bound of arm K does not change and

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt holds with probability 1 δ

Thus, by Lemma 4,

x t θK x t ˆθt,i + βt,i q

x t V 1 t,i xt

x t (1 α)θK + ω(Ni(t)) xt V 1 t,i

K log 1 + t L2

1 2 log 2KT

xt V 1 t,i .

By multiplying both sides by 1{It=i} and summing over rounds, we have

k=1 1{Ik=i}αx k θK

k=1 1{Ik=i}

K log 1 + k L2

1 2 log 2KT

xk V 1 k,i .

Published in Transactions on Machine Learning Research (02/2023)

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and get

k=1 1{Ik=i} xk 2 V 1 k,i 2d log 1 + Ni(t)L2

2d log 1 + t L2

According to PT k=1 1{Ik=i} xk V 1 k,i r

Ni(t) PT k=1 1{Ik=i} xk 2 V 1 k,i , we have

k=1 1{Ik=i} xk 2 V 1 k,i

Ni(t)2d log 1 + t L2

Thus, we have

k=1 1{Ik=i}αx k θK

Ni(t)2d log 1 + t L2

K log 1 + t L2

1 2 log 2KT

k=1 1{Ik=i}

2d (αγ)2 log 1 + t L2

K log 1 + t L2

1 2 log 2KT

where γ = minx D x, θK .

D Attacks on Generalized Linear Contextual Bandits

In the generalized linear model (GLM), there is a fixed, strictly increasing link function µ : R R such that the reward satisfies rt,It = µ( xt, θIt ) + ηt,

where ηt is a conditionally independent zero-mean R-subgaussian noise and , denotes the inner product. If we consider the σ-algebra Ft = σ(η1, . . . , ηt), ηt becomes Ft measurable.

Hence, the expected reward of arm i under context xt follows the GLM setting: E[rt,i] = µ( xt, θi ) for all t and all arm i. One can verify that µ(x) = x leads to the linear model and µ(x) = exp(x)/(1 + exp(x)) leads to the logistic model.

We assume that the link function µ is continuously twice differentiable, Lipschitz with constant kµ and such that cµ = infθ Θ,x D µ(x θ) > 0, where µ denote the first derivatives of µ. It can be verified that the link function of the linear model is Lipschitz with constant kµ = 1 and which of the logistic model is Lipschitz with constant kµ = 1/4.

The agent is interested in minimizing the cumulative pseudo-regret, and the cumulative pseudo-regret for the GLM can be formally written as

µ( xt, θI t ) µ( xt, θIt ) , (81)

where I t = arg maxi µ( xt, θi ).

For the GLM consideblack here, µ is a strictly increasing function and I t = arg maxi µ( xt, θi ) = arg maxi xt, θi . As the link function µ is strictly increasing, the target arm is not the worst arm under Assumption 1.

Published in Transactions on Machine Learning Research (02/2023)

D.1 Overview of UCB-GLM

For reader s convenience, we first provide a brief overview of the UCB-GLM algorithm (Li et al., 2017). The UCB-GLM algorithm is summarized in Algorithm 3.

The algorithm is simply initialized by play every arm j times to ensure a unique solution of ˆθi for each arm i. We assume that after playing arm i J times, Vi is invertible and the minimal eigenvalue of Vi is greater or equal to λ0 for all arm i. We assume that xt is drawn iid from some distribution v with support in the unit ball and set Σ := E[xtx t ]. Proposition 1 in (Li et al., 2017) shows that there exist positive, universal constants D1 and D2 such that λmin(Vi) λ0 with probability at least 1 δ, as long as

log(1/δ) λmin(Σ)

+ 2λ0 λmin(Σ). (82)

Algorithm 3 UCB-GLM (Li et al., 2017) Require:

number of arms K, number of rounds T, number of initial rounds j. 1: Initialize for every arm i.

2: Play every arm J times. At each time, update VIt VIt + xtx t . 3: for t = KJ + 1, KJ + 2, . . . , T do 4: observe the context xt. 5: for i = 1, 2, . . . , K do 6: Calculate the maximum-likelihood estimator ˆθi by solving the equation X

n τi(t 1) (rn µ(x n ˆθi))xn = 0.

7: Compute the upper confidence bound: pt,i x t ˆθi + βt,i q

x t V 1 i xt.

9: Pull arm It = arg maxi pt,i. 10: The environment generates reward rt according to arm It. 11: The agent receive rt. 12: VIt VIt + xtx t . 13: end for

By following the setup in (Li et al., 2017), we set βt,i = 2R

δ + d log 1 + L2Ni(t)

λ0d for GLM.

D.2 White-box Attack Strategy

Similar to the white-box attack strategy for linear contextual bandits, the white-box attack strategy for GLM works as follows. When the agent chooses arm K, the attacker does not attack. When the agent chooses arm It = K, the attacker changes it to arm

( K with probability ϵt

I t with probability 1 ϵt (83)

where I t = arg mini xt, θi and

ϵt = µ((1 α) xt, θK ) µ(mini xt, θi )

µ( xt, θK ) µ(mini xt, θi ) . (84)

The attacker can manipulate the agent into learning some particular coefficient vectors. In particular, for arm K (the target arm), the agent obtains the true coefficient vector θK. For any arm i = K, because of

Published in Transactions on Machine Learning Research (02/2023)

the attacks, the agent will obtain an estimate of arm i s coefficient vector to be close to (1 α)θK. Under the white-box attack, when the agent pulls a non-target arm It = K, the mean reward received by the agent should satisfy E[rt,I0 t |Ft 1, It] = µ((1 α) xt, θK ).

Proposition 2. For δ > 0. Suppose the agent employs a generalized linear contextual bandit algorithm whose cumulative pseudo-regret is upper bounded by o(T) over T with probability at least 1 δ. When using the proposed white-box attack scheme, with probability at least 1 δ, the attacker can force the agent to pull the target arm for T o(T) times, while the total attack cost |C| is upper bounded by o(T).

The proof of Proposition 2 is provided in Appendix E.1. To further illustrate the proposed attack scheme, we now provide a finer analysis the impact of this attack on UCB-GLM described in Algorithm 3.

Lemma 5. Under the proposed white-box attack, the estimate of θi for each arm i = K obtained by UCBGLM agent as described in Algorithm 3 satisfies

|x t ˆθt,i x t (1 α)θK| 2kµLS + 2R

δ + d log 1 + L2Ni(t)

xt V 1 t,i . (85)

The proof of Lemma 5 is provided in Appendix E.2.

Theorem 3. Define γ = minx D x, θK . Under the same assumptions as in Lemma 5, for any δ > 0 with probability at least 1 2δ, for all T 0, the attacker can manipulate the UCB-GLM agent into pulling the target arm in at least T |C| rounds, using an attack cost

|C| 4d(K 1)

(αγ)2 log 1 + t L2

δ + d log 1 + L2T

The proof of Theorem 3 is provided in Appendix E.3.

D.3 Black-box Attack Strategy

The modified black-box attack strategy for GLM works as follows. When the agent chooses arm K, the attacker does not attack. When the agent chooses arm It = K, the attacker changes it to arm

( K with probability ϵt

I t with probability 1 ϵt (87)

where I t = arg min i =K

xt, ˆθ0 t,i β0 t,i xt ( V 0 t,i) 1 , (88)

β0 t,i = 2ϕi kµLS + R

δ + d log 1 + L2Ni(t)

ϕi = kµ cµα when i = K and ϕK = 1 + kµ

cµ cµ + kµ , µ((1 α)x t ˆθ0 t,K) µ(x t ˆθ0 t,I t )

µ(x t ˆθ0 t,K) µ(x t ˆθ0 t,I t ) , 1 α cµ

with clip(a, x, b) = min(b, max(x, a)) where a b.

For notational convenience, we set I t = K and ϵt = 1 when It = K. We define that, if i = K, τ i (t) := {s : s t, I s = i} and N i (t) = |τ i (t)| ; τ K(t) := {s : s t} and N K(t) = |τ K(t)|.

Published in Transactions on Machine Learning Research (02/2023)

Calculate the maximum-likelihood estimator ˆθ0 t,i by solving the equation

n τi(t 1) (wt,irn µ(x n ˆθt,i))xn = 0.

where V 0 t,i = P

k τ i (t 1) xkx k

1/ϵt if i = I0 t = K

1/(1 ϵt) if i = I0 t = I t 0 if i = I0 t

First, we analyze the estimation ˆθ0 t,i at the attacker side. We establish a confidence ellipsoid of xt, ˆθ0 t,i at the attacker.

Lemma 6. Assume the attacker performs the proposed black-box action poisoning attack. With probability 1 2δ, we have

|x t ˆθ0 t,i x t θi| β0 t,i xt ( V 0 t,i) 1. (92)

holds for all arm i and all t 0 simultaneously.

The proof of Lemma 6 is provided in Appendix E.4.

Lemma 7. Under the black-box attack, with probability 1 2δ, the estimate obtained by an UCB-GLM agent satisfies E[rt,I0 t |Ft 1, It] µ((1 α)x t θK) 2kµβ0 t,K xt ( V 0 t,K) 1 + 2kµβ0 t,I t xt

simultaneously for all t 0 when It = K.

Lemma 8. Assume the attacker performs the proposed black-box action poisoning attack. With a probability at least 1 3Kδ

K , the estimate ˆθt,i obtained by the UCB-GLM agent will satisfy

|x t ˆθt,i x t (1 α)θK|

1 + 16k2 µd cµα

K log 1 + t L2

δ + d log 1 + L2t

x V 1 t,i (93)

simultaneously for all arm i = K and all t 0.

Theorem 4. Define γ = minx D x, θK . Assume the attacker performs the proposed black-box action poisoning attack. For any δ > 0 with probability at least 1 2δ, for all T 0, the attacker can manipulate the UCB-GLM agent into pulling the target arm in at least T |C| rounds, using an attack cost

|C| 4d(K 1)

2 log 1 + TL2

δ + d log 1 + L2T

1 + 16k2 µd cµα

K log 1 + TL2

The proof of Theorem 4 is provided in Appendix E.7. Theorem 4 shows that our black-box attack strategy can manipulate a UCB-GLM agent into pulling a target arm T O(log3 T) times with attack cost scaling as O(log3 T). Compablack with the result for the white-box attack, the black-box attack only brings an additional log T factor.

Published in Transactions on Machine Learning Research (02/2023)

E Attack Cost Analysis of GLM

E.1 Proof of Proposition 2

Let us consider a contextual bandit problem P1, in which the arm K (the target arm) is associated with a fixed coefficient vector θK and all other arms are associated with the coefficient vector (1 α)θK. For a generalized linear contextual bandit algorithm A , we suppose that the cumulative pseudo-regret regret of algorithm A for the problem P1 is upper bounded with probability at least 1 δ by a function f A (T) such that f A (T) = o(T).

Under the proposed white-box attack, when the agent pulls a non-target arm It = K, the mean reward received by the agent should satisfy E[rt,I0 t |Ft 1, It] = µ((1 α) xt, θK ). In the observation of the agent, the target arm becomes optimal and the non-target arms are associated with the coefficient vector (1 α)θK. For the agent, the combination of the attacker and the environment form problem P1. The cumulative pseudo-regret should satisfy

t=1 1{It =K} (µ ( xt, θK ) µ ( xt, (1 α)θK ))

t=1 1{It =K}cµ xt, αθK

t=1 1{It =K}cµγ,

which is equivalent to PT t=1 1{It =K} RT /(cµγ). Since RT is upper bounded by f A (T) = o(T), |C| = PT t=1 1{It =K} is also upper bounded by o(T).

E.2 Proof of Lemma 5

The maximum-likelihood estimation can be written as the solution to the following equation X

n τi(t 1) (rn µ(x n ˆθt,i))xn = 0. (95)

Define gt,i(θ) = P

n τi(t 1) µ(x n θ))xn. gt,i(ˆθt,i) = P

n τi(t 1) rnxn. Since µ is continuously twice differentiable, gt,i is continuous, and for any θ Θ, gt,i(θ) = P

n τi(t 1) xnx n µ(x n θ)). gt,i(θ) denotes the Jacobian matrix of gt,i at θ. By the Fundamental Theorem of Calculus,

gt,i(ˆθt,i) gt,i((1 α)θK) = Gt,i(ˆθt,i (1 α)θK), (96)

0 gt,i sˆθt,i + (1 s)(1 α)θK ds. (97)

Note that gt,i(θ) = P

n τi(t 1) xnx n µ(x n θ)). According to the assumption that cµ = infθ Θ,x D µ(x θ) > 0, we have Gt,i cµVt,i cµVKJ,i λ0I 0, where in the last two step we used the assumption that the minimal eigenvalue of Vi is greater or equal to λ0 after playing arm i J times. Thus, Gt,i is positive definite and non-singular. Therefore,

ˆθi (1 α)θK = G 1 t,i gt,i(ˆθi) gt,i((1 α)θK) . (98)

For arm K, gt,i(ˆθi) gt,i(θK) = P

n τi(t 1) ηnxn.

Published in Transactions on Machine Learning Research (02/2023)

For all arm i = K, the right hand side of Equation 98 is equivalent to

gt,i(ˆθi) gt,i((1 α)θK)

n τi(t 1) (rn µ((1 α)x n θK))xn

n τi(t 1) (µ(x n θI0n) µ((1 α)x n θK))xn + X

n τi(t 1) ηnxn.

We set Z1 = P

n τi(t 1)(µ(x n θI0 n) µ((1 α)x n θK))xn and Z2 = P

n τi(t 1) ηnxn.

We have gt,i(ˆθi) gt,i((1 α)θK) = Z1 + Z2 and

x t (ˆθi (1 α)θK) = x t G 1 t,i (Z1 + Z2). (100)

For any context x D and arm i = K, we have

|x (ˆθt,i (1 α)θK)|

=|x G 1 t,i (Z1 + Z2)|

|x G 1 t,i Z1| + |x G 1 t,i Z2|.

We first bound |x G 1 t,i Z2|. Since Gt,i is positive definite and G 1 t,i is also positive definite, |x G 1 t,i Z2| x G 1 t,i Z2 G 1 t,i .

Since Gt,i cµVt,i implies that G 1 t,i c 1 µ V 1 t,i , we have x G 1 t,i 1 cµ x V 1 t,i holds for any x Rd. Thus,

|x G 1 t,i Z2| 1

cµ x V 1 t,i Z2 V 1 t,i . (102)

Note that Vt,i = Vt,i + λI. Hence, for all vector x Rd

x 2 V 1 t,i = x 2 V 1 t,i + x ( V 1 t,i V 1 t,i )x. (103)

Since (A + B) 1 = A 1 A 1B(A + B) 1,

V 1 t,i = V 1 t,i λ V 1 t,i V 1 t,i . (104)

The above implies that

0 x ( V 1 t,i V 1 t,i )x

=x λ V 1 t,i V 1 t,i x

λ0 x 2 V 1 t,i .

and x 2 V 1 t,i (1 + λ

λ0 ) x 2 V 1 t,i .

From Theorem 1 and Lemma 11 in (Abbasi-Yadkori et al., 2011), we know that for any δ > 0, with probability at least 1 δ

k τi(t 1) xkηk

2R2 log K det(Vt,i)1/2 det(λI) 1/2

δ + d log 1 + L2Ni(t)

Published in Transactions on Machine Learning Research (02/2023)

for all arms and all t > 0.

Set λ = λ0, we have

Z2 V 1 t,i 2R

δ + d log 1 + L2Ni(t)

Now we bound |x G 1 t,i Z1|. Similarly,

|x G 1 t,i Z1| 1

cµ x V 1 t,i Z1 V 1 t,i . (108)

In our model, we have 0 < xt, θi xt 2 2 θi 2 2 = LS. Further,

0 µ(x k θI0 k) µ((1 α)x k θK)

kµ x k θI0 k (1 α)x k θK

Since we have E[µ(x k θI0 k)|Fk 1] = µ((1 α)x k θK), n µ(x k θI0 k) µ (1 α)x k θK o

k τi(t 1) is a bounded

martingale difference sequence w.r.t the filtration {Fk}k τi(t 1) and is also kµLS-sub-Gaussian-sub-Gaussian.

From Theorem 1 and Lemma 11 in (Abbasi-Yadkori et al., 2011), we know that for any δ > 0, with probability at least 1 K 1

Z1 V 1 t,i 2kµLS

δ + d log 1 + L2Ni(t)

for any arm i = K and all t > 0.

In summary, for all arm i = K,

|x (ˆθi (1 α)θK)| 2kµLS + 2R

δ + d log 1 + L2Ni(t)

x V 1 t,i . (111)

E.3 Proof of Theorem 3

For round t and context xt, if UCB-GLM pulls arm i = K, we have

x t ˆθt,K + βt,K q

x t V 1 t,Kxt x t ˆθt,i + βt,i q

x t V 1 t,i xt,

Recall βt,i = 2R

δ + d log 1 + L2Ni(t)

λ0d in GLM.

Since the attacker does not attack the target arm, the confidence bound of arm K does not change and

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt holds with probability 1 δ

Thus, by Lemma 5,

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt

x t ˆθt,i + βt,i q

x t V 1 t,i xt

x t ˆθt,i + 2kµLS + 2R

cu 2Rβt,i q

x t V 1 t,i xt + βt,i q

x t V 1 t,i xt

x t (1 α)θK + kµLS + 2R

R βt,i xt V 1 t,i .

Published in Transactions on Machine Learning Research (02/2023)

By multiplying both sides 1{It=i} and summing over rounds, we have

k=1 1{Ik=i}αx k θK

k=1 1{Ik=i} kµLS + 2R

R βt,i xt V 1 t,i . (113)

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and obtain

k=1 1{Ik=i} xk 2 V 1 k,i 2d log(1 + Ni(t)L2

dλ ) 2d log 1 + t L2

According to PT k=1 1{Ik=i} xk V 1 k,i r

Ni(t) PT k=1 1{Ik=i} xk 2 V 1 k,i , we have

k=1 1{Ik=i} xk V 1 k,i

Ni(t)2d log 1 + t L2

Set λ = λ0, we have x 2 V 1 t,i (1 + λ

λ0 ) x 2 V 1 t,i = 2 x 2 V 1 t,i . Thus, we have

k=1 1{Ik=i}αx k θK kµLS + 2R

4Ni(t)d log 1 + t L2

k=1 1{Ik=i} 4d (αγ)2 log 1 + t L2

where γ = minx D x, θK .

E.4 Proof of Lemma 6

The attacker calculate the maximum-likelihood estimator ˆθ0 t,i by solving the equation

n τi(t 1) (wn,irn µ(x n ˆθi))xn = 0. (118)

Note that g0 t,i(θ) = P

n τ i (t 1) µ(x n θ))xn. g0 t,i(ˆθ0 t,i) = P

n τ i (t 1) wn,irnxn.

For all arm i,

g0 t,i(ˆθ0 t,i) g0 t,i(θi)

n τ i (t 1) (wn,irn µ(x n θi))xn

n τ i (t 1) (wn,iµ(x n θI0n) µ(x n θi))xn + X

n τ i (t 1) wn,iηnxn.

Similarly, we set Z3 = P

n τ i (t 1) wn,iηnxn and Z4 = P

n τ i (t 1)(wn,iµ(x n θI0n) µ((1 α)x n θK))xn .

We have g0 t,i(ˆθ0 t,i) g0 t,i(θi) = Z3 + Z4 and

x t (ˆθ0 t,i θi) = x t (G0 t,i) 1(Z3 + Z4), (120)

Published in Transactions on Machine Learning Research (02/2023)

G0 t,i = Z 1

0 g0 t,i sˆθ0 t,i + (1 s)θi ds. (121)

For any context x D, we have

|x (ˆθ0 t,i θi)|

=|x (G0 t,i) 1(Z3 + Z4)|

|x (G0 t,i) 1Z3| + |x (G0 t,i) 1Z4|.

We first bound |x (G0 t,i) 1Z3|. We have

|x (G0 t,i) 1Z3| 1

cµ x ( V 0 t,i) 1 Z3 ( V 0 t,i) 1, (123)

where V 0 t,i = P

n τi(t 1) xnx n .

Note that V = t,i V 0 t,i + λI. Hence,

Z3 2 ( V 0 t,i) 1 = Z3 2 (V 0 t,i) 1 + Z 3 (( V 0 t,i) 1 (V 0 t,i) 1)Z3

λ0 ) Z3 2 (V 0 t,i) 1. (124)

From Theorem 1 and Lemma 11 in (Abbasi-Yadkori et al., 2011), we know that for any δ > 0, with probability at least 1 δ

k τi(t 1) xkηk

2R2 log K det(Vt,i)1/2 det(λI) 1/2

δ + d log 1 + L2Ni(t)

for all arms and all t > 0.

Set λ = λ0, we have

Z3 2 ( V 0 t,i) 1 2ϕi R

δ + d log 1 + L2N 0 i (t) λ0d

Now we bound |x (G0 t,i) 1Z4|. Similarly,

|x (G0 t,i) 1Z4| 1

cµ x ( V 0 t,i) 1 Z4 ( V 0 t,i) 1. (127)

In our model, we have 0 < xt, θi xt 2 2 θi 2 2 = LS. Further,

0 wk,iµ(x k θI0 k) µ((1 α)x k θK) ϕikµLS. (128)

Since we have E[wk,iµ(x k θI0 k)|Fk 1] = µ(x k θi), n wk,iµ(x k θI0 k) µ x k θi o

k τi(t 1) is a bounded martingale

difference sequence w.r.t the filtration {Fk}k τi(t 1) and is also ϕikµLS-sub-Gaussian-sub-Gaussian.

From Theorem 1 and Lemma 11 in (Abbasi-Yadkori et al., 2011), we know that for any δ > 0, with probability at least 1 δ

Z4 2 ( V 0 t,i) 1 2ϕikµLS

δ + d log 1 + L2N 0 i (t) λ0d

Published in Transactions on Machine Learning Research (02/2023)

for any arm i = K and all t > 0.

In summary, for all arm i = K,

|x (ˆθi (1 α)θK)| 2ϕi kµLS + R

δ + d log 1 + L2Ni(t)

x ( V 0 t,i) 1. (130)

E.5 Proof of Lemma 7

Recall the definition of ϵt:

cµ cµ + kµ , µ((1 α)x t ˆθ0 t,K) µ(x t ˆθ0 t,I t )

µ(x t ˆθ0 t,K) µ(x t ˆθ0 t,I t ) , 1 α cµ

and the definition of I t :

I t = arg min i =K

xt, ˆθ0 t,i β0 t,i xt ( V 0 t,i) 1 . (132)

By Lemma 6, xt, ˆθ0 t,I t β0 t,I t xt ( V 0

t,I t ) 1 mini xt, θi with probability 1 2δ. Thus, with probability

1 2δ, µ(x t ˆθ0 t,I t ) mini µ(x t θi) kµβ0 t,I t xt ( V 0

Because ϵt is bounded by [ cµ cµ+kµ , 1 α cµ

kµ ], we can analyze E[rt,I0 t |Ft 1, It] in four cases.

Case 1: when xt, ˆθ0 t,K < xt, ˆθ0 t,I t , we have ϵt = 1 α cµ

kµ and µ( xt, ˆθ0 t,K ) < µ( xt, ˆθ0 t,I t ). Thus,

E[rt,I0 t |Ft 1, It] = (1 α cµ

kµ )µ(x t θK) + α cµ

kµ µ(x t θI t ). (133)

Then, by Lemma 6,

kµ )µ(x t θK) + α cµ

kµ µ(x t θI t ) µ((1 α)x t θK)

kµ ) µ(x t ˆθ0 t,K) + kµβ0 t,K xt ( V 0 t,K) 1 + α cµ

µ(x t ˆθ0 t,I t ) + kµβ0 t,I t xt

µ((1 α)x t θK)

µ(x t ˆθ0 t,I t ) + αcµβ0 t,I t xt

1 + (1 α cµ

kµ )kµβ0 t,K xt ( V 0 t,K) 1 µ((1 α)x t θK)

(kµ αcµ)β0 t,K xt ( V 0 t,K) 1 + (kµ + αcµ)β0 t,I t xt

where the second inequality is obtains by µ( xt, ˆθ0 t,K ) < µ( xt, ˆθ0 t,I t ) and the last inequality is obtained by

µ(x t ˆθ0 t,I t ) mini µ(x t θi) kµβ0 t,I t xt ( V 0

t,I t ) 1 and Assumption 1.

Published in Transactions on Machine Learning Research (02/2023)

On the other side, we have

kµ )µ(x t θK) + α cµ

kµ µ(x t θI t ) µ((1 α)x t θK)

kµ ) µ(x t ˆθ0 t,K) kµβ0 t,K xt ( V 0 t,K) 1 + α cµ

µ(x t ˆθ0 t,I t ) kµβ0 t,I t xt

µ((1 α)x t θK)

µ(x t ˆθ0 t,K) µ((1 α)x t θK) αcµβ0 t,I t xt

kµ )kµβ0 t,K xt ( V 0 t,K) 1

µ(x t θK) µ((1 α)x t θK) αcµβ0 t,I t xt

kµ )kµβ0 t,K xt ( V 0 t,K) 1

(2kµ αcµ)β0 t,K xt ( V 0 t,K) 1 αcµβ0 t,I t xt

Case 2: when µ(x t ˆθ0 t,K) µ(x t ˆθ0 t,I t ) > (1 + cµ

kµ )µ((1 α)x t ˆθ0 t,K) cµ

kµ µ(x t ˆθ0 t,K) and ϵt = cµ cµ+kµ , we have

E[rt,I0 t |Ft 1, It] = cµ cµ + kµ µ(x t θK) + kµ cµ + kµ µ(x t θI t ). (136)

By Lemma 6, with probability 1 2δ,

µ(x t ˆθ0 t,I t ) min i µ(x t θi) kµβ0 t,I t xt ( V 0

Since mini x t θi (1 2α)x t θK, we have

µ(x t ˆθ0 t,I t ) µ((1 2α)x t θK) + kµβ0 t,I t xt ( V 0

µ(x t θI t ) µ((1 2α)x t θK) + 2kµβ0 t,I t xt ( V 0

cµµ(x t θK) cµ + kµ + kµµ(x t θI t )

cµ + kµ µ((1 α)x t θK)

= cµ cµ + kµ µ(x t θK) cµ cµ + kµ µ((1 α)x t θK)

+ kµ cµ + kµ µ(x t θI t ) kµ cµ + kµ µ((1 α)x t θK)

µ(x t θK) µ((1 α)x t θK)

+ kµ cµ + kµ

µ((1 2α)x t θK) µ((1 α)x t θK) + 2k2 µ cµ + kµ β0 t,I t xt ( V 0

Published in Transactions on Machine Learning Research (02/2023)

According to the definition of kµ and cµ and Lemma 6,

cµµ(x t θK) cµ + kµ + kµµ(x t θI t )

cµ + kµ µ((1 α)x t θK)

cµ cµ + kµ kµ x t θK (1 α)x t θK + kµ cµ + kµ cµ (1 2α)x t θK (1 α)x t θK

+ 2k2 µ cµ + kµ β0 t,I t xt ( V 0

= 2k2 µ cµ + kµ β0 t,I t xt ( V 0

In addition, by Lemma 6,

cµµ(x t θK) cµ + kµ + kµµ(x t θI t )

cµ + kµ µ((1 α)x t θK)

=cµµ(x t θK) cµ + kµ + kµµ(x t θI t )

cµ + kµ µ((1 α)x t ˆθ0 t,K)

+ µ((1 α)x t ˆθ0 t,K) µ((1 α)x t θK)

cµ cµ + kµ µ(x t θK) + kµ cµ + kµ µ(x t θI t ) cµ cµ + kµ µ(x t ˆθ0 t,K) kµ cµ + kµ µ(x t ˆθ0 t,I t )

+ µ((1 α)x t ˆθ0 t,K) µ((1 α)x t θK)

(1 α + cµ cµ + kµ )kµβ0 t,K xt ( V 0 t,K) 1 kµ cµ + kµ kµβ0 t,I t xt

where the first inequality is obtained by the condition of case 2:

µ(x t ˆθ0 t,I t ) > (1 + cµ

kµ )µ((1 α)x t ˆθ0 t,K) cµ

kµ µ(x t ˆθ0 t,K)

which is equivalent to

cµµ(x t ˆθ0 t,K) cµ + kµ + kµµ(x t ˆθ0 t,I t )

cµ + kµ > µ((1 α)x t ˆθ0 t,K).

Case 3: when the attacker s estimates satisfy

kµ αcµ µ((1 α)x t ˆθ0 t,K) ( kµ

αcµ 1)µ(x t ˆθ0 t,K)

µ(x t ˆθ0 t,I t )

kµ )µ((1 α)x t ˆθ0 t,K) cµ

kµ µ(x t ˆθ0 t,K)

and hence cµ cµ+kµ ϵt 1 α cµ

kµ , we have

E[rt,I0 t |Ft 1, It] = ϵtµ(x t θK) + (1 ϵt)µ(x t θI t ). (141)

Published in Transactions on Machine Learning Research (02/2023)

We can find that

ϵtµ(x t θK) + (1 ϵt)µ(x t θI t ) µ((1 α)x t θK)

=ϵt(µ(x t θK) µ(x t θI t )) + µ(x t θI t ) µ((1 α)x t θK)

=ϵt(µ(x t ˆθ0 t,K) µ(x t ˆθ0 t,I t )) + ϵt(µ(x t ˆθ0 t,I t ) µ(x t θI t )) + ϵt(µ(x t θK) µ(x t ˆθ0 t,K))

+ µ(x t θI t ) µ((1 α)x t θK)

=µ((1 α)x t ˆθ0 t,K) µ(x t ˆθ0 t,I t ) + ϵt(µ(x t ˆθ0 t,I t ) µ(x t θI t )) + ϵt(µ(x t θK) µ(x t ˆθ0 t,K))

+ µ(x t θI t ) µ((1 α)x t θK)

=µ((1 α)x t ˆθ0 t,K) µ((1 α)x t θK) + ϵt(µ(x t θK) µ(x t ˆθ0 t,K)) + (1 ϵt) µ(x t θI t ) µ(x t ˆθ0 t,I t ) .

(142) From Lemma 6, E[rt,I0 t |Ft 1, It] µ((1 α)x t θK

(1 α + ϵt)kµβ0 t,K xt ( V 0 t,K) 1

+ (1 ϵt)kµβ0 t,I t xt

Case 4: when µ(x t ˆθ0 t,I t ) < kµ αcµ µ((1 α)x t ˆθ0 t,K) ( kµ

αcµ 1)µ(x t ˆθ0 t,K) and ϵt = 1 α cµ

kµ , we have

E[rt,I0 t |Ft 1, It] = (1 α cµ

kµ )µ(x t θK) + α cµ

kµ µ(x t θI t ). (144)

Then, by Lemma 6,

kµ )µ(x t θK) + α cµ

kµ µ(x t θI t ) µ((1 α)x t θK)

kµ )µ(x t θK) µ((1 α)x t θK) + α cµ

kµ µ(x t ˆθ0 t,I t ) + αcµβ0 t,I t xt ( V 0

kµ )µ(x t θK) µ((1 α)x t θK) + µ((1 α)x t ˆθ0 t,K) (1 α cµ

kµ )µ(x t ˆθ0 t,K) + αcµβ0 t,I t xt ( V 0

(kµ cµ)β0 t,K xt ( V 0 t,K) 1 + αcµβ0 t,I t xt ( V 0

(145) Since x t θI t > 0, µ(x t θK) µ(x t θI t ) kµx t θK. Hence, we also have

kµ )µ(x t θK) + α cµ

kµ µ(x t θI t ) µ((1 α)x t θK)

=µ(x t θK) µ((1 α)x t θK) α cµ

µ(x t θK) µ(x t θI t )

cµαx t θK α cµ

kµ kµx t θK = 0.

Combining these four cases, we have

E[rt,I0 t |Ft 1, It] µ((1 α)x t θK)

2kµβ0 t,K xt ( V 0 t,K) 1 + 2kµβ0 t,I t xt

Published in Transactions on Machine Learning Research (02/2023)

E.6 Proof of Lemma 8

The agent s maximum-likelihood estimation can be written as the solution to the following equation X

n τi(t 1) (rn µ(x n ˆθt,i))xn = 0. (148)

As described in the section E.2, we have gt,i(ˆθi) gt,i((1 α)θK) = Z1 + Z2 and

x t (ˆθi (1 α)θK) = x t G 1 t,i (Z1 + Z2). (149)

We set Z1 = P

n τi(t 1)(µ(x n θI0n) µ((1 α)x n θK))xn and Z2 = P

n τi(t 1) ηnxn.

In the white-box attack case, we have E[µ(x k θI0 k)|Fk 1] = µ((1 α)x k θK) and hence E[Z1|Fk 1] = 0. Under the proposed black-box attack, E[Z1|Fk 1] = 0 but E[µ(x t θI0 t )|Ft 1, It] (1 α) xt, θK

2kµβ0 t,K xt ( V 0 t,K) 1 + 2kµβ0 t,I t xt

We set Z1 = Z5 + Z6, where

n τi(t 1) (µ(x n θI0n) E[µ(x t θI0 t )|Ft 1, It])xn

n τi(t 1) (E[µ(x t θI0 t )|Ft 1, It] µ((1 α)x n θK))xn.

For any context x D and arm i = K, we have

|x (ˆθt,i (1 α)θK)|

|x G 1 t,i Z2| + |x G 1 t,i Z5| + |x G 1 t,i Z6|. (151)

Since we have n µ(x k θI0 k) E[µ(x k θI0 k)|Fk 1, Ik] o

k τi(t 1) is a bounded martingale difference sequence w.r.t

the filtration {Fk, Ik}k τi(t 1) and is also kµLS-sub-Gaussian-sub-Gaussian.

From Theorem 1 and Lemma 11 in (Abbasi-Yadkori et al., 2011), we know that for any δ > 0, with probability at least 1 K 1

Z5 V 1 t,i 2kµLS

δ + d log 1 + L2Ni(t)

for any arm i = K and all t > 0.

We have the fact that

xt 2 V 1 t,i = x t V 1 t,i

k τi(t 1) xkx k + λI

x t V 1 t,i

k τi(t 1) xkx k

k τi(t 1) (x t V 1 t,i xk)2.

Published in Transactions on Machine Learning Research (02/2023)

Hence, xt 2 G 1 t,i = x t G 1 t,i Gt,i G 1 t,i xt

cµx t V 1 t,i

k τi(t 1) xkx k

k τi(t 1) (x t G 1 t,i xk)2,

and hence P k τi(t 1)(x t G 1 t,i xk)2 1 c2µ xt 2 V 1 t,i .

Then, |x G 1 t,i Z6| can be upper bounded by

|x G 1 t,i Z6|

E[rk,I0 k|Fk 1, Ik] µ((1 α)x k θK) 2s X

k τi(t 1) (x t G 1 t,i xk)2

2kµβ0 k,K xk ( V 0 k,K) 1 + 2kµβ0 k,I k xk

1 cµ xt V 1 t,i ,

where the first inequality is obtained from Cauchy-Schwarz inequality, the second inequality is obtained from Lemma 7 and Equation 153.

In addition, by the fact that (a + b)2 2a2 + 2b2 for any real number, we have

2kµβ0 k,K xk ( V 0 k,K) 1 + 2kµβ0 k,I k xk

k τi(t 1) 2 2kµβ0 k,K xk ( V 0 k,K) 1 2 + X

k τi(t 1) 2

2kµβ0 k,I k xk

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and get, for any arm i,

k τ i (t 1) xk 2 (V 0 k,i) 1 2d log 1 + Ni(t)L2

2d log 1 + t L2

Set λ = λ0, we have x 2 V 1 t,i (1 + λ

λ0 ) x 2 V 1 t,i 2 x 2 V 1 t,i .

By the fact that P

i τi(t 1) = τ K(t 1), and P

i =K τi(t 1) = P

i =K τ i (t 1), we have, for any arm i,

τi(t 1) τ K(t 1), and τi(t 1) P j =K τ j (t 1). Thus,

k τi(t 1) xk 2 ( V 0 k,K) 1 X

k τ K(t 1) xk 2 ( V 0 k,K) 1 4d log 1 + t L2

k τi(t 1) xk 2

k τ i (t 1) xk 2 ( V 0 k,i) 1 4(K 1)d log 1 + t L2

Published in Transactions on Machine Learning Research (02/2023)

By combining Equation 156, Equation 158 and Equation 159 and when K 3, we have

2kµβ0 k,K xk (V 0 k,K) 1 + 2kµβ0 k,I k xk

k τi(t 1) 2 2kµβ0 k,K xk (V 0 k,K) 1 2 + X

k τi(t 1) 2

2kµβ0 k,I k xk

2ϕK kµLS + R

δ + d log 1 + L2t

8k2 µ 16d2 log 1 + t L2

cµα kµLS + R

δ + d log 1 + L2t

8k2 µ 16d2(K 1) log 1 + t L2

128k2 µd2 2kµLS + 2R

δ + d log 1 + L2t

log 1 + t L2

(K 1) k2 µ c2µα2 + (1 + kµ

128k2 µd2 2kµLS + 2R

δ + d log 1 + L2t

log 1 + t L2

2K k2 µ c2µα2 .

In summary, we have

|x t ˆθt,i x t (1 α)θK|

1 + 16k2 µd cµα

K log 1 + t L2

δ + d log 1 + L2t

x V 1 t,i . (161)

E.7 Proof of Theorem 4

For round t and context xt, if UCB-GLM pulls arm i = K, we have

x t ˆθt,K + βt,K q

x t V 1 t,Kxt x t ˆθt,i + βt,i q

x t V 1 t,i xt.

Recall βt,i = 4R

δ + d log 1 + L2Ni(t)

Since the attacker does not attack the target arm, the confidence bound of arm K does not change and

x t θK x t ˆθt,K + βt,K q

x t V 1 t,Kxt holds with probability 1 δ

Thus, by Lemma Equation 8,

x t θK x t ˆθt,i + βt,i q

x t V 1 t,i xt

x t (1 α)θK + 2kµLS + 2R

1 + 16k2 µd cµα

K log 1 + t L2

δ + d log 1 + L2t

x V 1 t,i .

(162) By multiplying both sides 1{It=i} and summing over rounds, we have

k=1 1{Ik=i}αx k θK

k=1 1{Ik=i} 2kµLS + 2R

1 + 16k2 µd cµα

K log 1 + t L2

δ + d log 1 + L2t

x V 1 t,i .

Published in Transactions on Machine Learning Research (02/2023)

Here, we use Lemma 11 from (Abbasi-Yadkori et al., 2011) and obtain

k=1 1{Ik=i} xk 2 V 1 k,i 2d log(1 + Ni(t)L2

dλ ) 2d log 1 + t L2

According to PT k=1 1{Ik=i} xk V 1 k,i r

Ni(t) PT k=1 1{Ik=i} xk 2 V 1 k,i , we have

k=1 1{Ik=i} xk V 1 k,i

Ni(t)2d log 1 + t L2

Set λ = λ0, we have x 2 V 1 t,i (1 + λ

λ0 ) x 2 V 1 t,i 2 x 2 V 1 t,i . Thus, we have

k=1 1{Ik=i}αx k θK

1 + 16k2 µd cµα

K log 1 + t L2

δ + d log 1 + L2t

4Ni(t)d log 1 + t L2

k=1 1{Ik=i}

2 log 1 + t L2

δ + d log 1 + L2t

1 + 16k2 µd cµα

K log 1 + t L2

where γ = minx D x, θK .