# stochastic_bandits_robust_to_adversarial_attacks__9ec74b2d.pdf Published as a conference paper at ICLR 2025 STOCHASTIC BANDITS ROBUST TO ADVERSARIAL ATTACKS Xuchuang Wang CICS, UMass Amherst Maoli Liu CSE, CUHK Jinhang Zuo CS, City U Xutong Liu ECE, CMU John C.S. Lui CSE, CUHK Mohammad Hajiesmaili CICS, UMass Amherst This paper investigates stochastic bandit algorithms that are robust to adversarial attacks, where an attacker can first observe the learner s action and then alter their reward observation. We study two cases of this model, with or without the knowledge of an attack budget C, defined as an upper bound of the summation of the difference between the actual and altered rewards. For both cases, we devise two types of algorithms with regret bounds having additive or multiplicative C dependence terms. For the known attack budget case, we prove our algorithms achieve the regret bound of O((K/ ) log T + KC) and O( KTC) for the additive and multiplicative C terms, respectively, where K is the number of arms, T is the time horizon, is the gap between the expected rewards of the optimal arm and the second-best arm, and O hides the logarithmic factors. For the unknown case, we prove our algorithms achieve the regret bound of O( KT + KC2) and O(KC T) for the additive and multiplicative C terms, respectively. In addition to these upper bound results, we provide several lower bounds showing the tightness of our bounds and the optimality of our algorithms. These results delineate an intrinsic separation between the bandits with attacks and corruption models. 1 INTRODUCTION Online learning literature (Borodin & El-Yaniv, 2005, 7.1.2) usually considers two types of nonoblivious adversary models: the medium adversary and the strong adversary.1 The medium adversary chooses the next instance before observing the learner s actions, while the strong adversary chooses instances after observing the learner s actions. When it comes to the multi-armed bandits (MAB) learning with an adversary, the medium adversary corresponds to the bandits with corruption (Lykouris et al., 2018), and the strong adversary corresponds to adversarial attacks on bandits (Jun et al., 2018). Bandit algorithms robust to corruption are developed for the medium adversary models in bandits literature, e.g., Auer et al. (2002); Audibert & Bubeck (2010); Lykouris et al. (2018); Gupta et al. (2019). However, for the strong adversary model, i.e., the adversarial attack model on MAB, most previous efforts focus on devising attacking policies to mislead the learner to pull a suboptimal arm and thus suffer a linear regret, e.g., Jun et al. (2018); Liu & Shroff (2019); Zuo (2024). Developing robust algorithms for the adversarial attack model and achieving regret with benign dependence on the total attack is still largely unexplored (details discussed at the end of 1). Stochastic MAB with adversarial attacks In this paper, we study the stochastic MAB under adversarial attacks and develop robust algorithms whose regret bounds degrade gracefully in the presence of such attacks. Denote by K N+ as the number of arms, and each arm k is associated with a reward random variable Xk with an unknown mean µk. Denote by k as the arm index with the highest mean reward, and k := µk µk the reward gap between the optimal arm k and any suboptimal arm k. The learner aims to minimize the regret, defined as the difference between the highest total rewards of pulling a single arm and the accumulated rewards of the concern algorithm. 1There is a third adversary model, called the weak adversary, which is oblivious. The medium adversary is also called the adaptive-online adversary, while the strong adversary is called the adaptive-offline adversary. Published as a conference paper at ICLR 2025 In each time slot, the learner first pulls an arm. Then, the adversary observes the pulled arm and its realized reward and chooses an attacked reward for the learner to observe. Denote by C as the total amount of attacks that the adversary uses to alter the original rewards to the attacked rewards over all rounds. We present the extended formal model definitions in 2. Model difference between bandits with corruptions and attacks. Different from the attack (strong adversary) above, in each time slot, the medium adversary (corruption) chooses the corrupted rewards for all arms before observing which arm is pulled by the learner. This simple order alternation yields intrinsic differences, making the attack model more challenging than the corruption model. Specifically, the attack model (attack after observing the pulled arm) makes the randomized arm pulling policies, widely used in algorithms for bandits robust to corruptions (Lykouris et al., 2018; Gupta et al., 2019), invalid. Without randomization, devising robust algorithms for attacks is much more challenging than that for corruptions. We discuss more details of the differences in 2. Contributions This paper offers a thorough investigation into robust algorithms for stochastic MABs under an adversarial attack model. We explore both known and unknown attack budget C scenarios, where the unknown attack budget case is also known as the adaptive budget case. For both cases, we develop robust algorithms with two kinds of regret bounds having additive or multiplicative C dependence terms. We summarize our contributions as follows: Known attack budget case. For the known attack case ( 4), we first investigate the successive elimination with wider confidence radius (SE-WR) algorithm, which was initially proposed for bandits with corruptions (Lykouris et al., 2018). In 4.1, we show SE-WR can be applied to the bandits with known adversarial attack budget and prove that this algorithm enjoys a tighter regret upper bound O(P k =k log T/ k +KC) (under both attack and corruption models) than previous analysis (Lykouris et al., 2018) under the corruption model. This improvement removes the original additive term P k =k C/ k s dependence of k and reduces it to KC. To achieve gap-independent regret bounds in 4.2, we propose two stopping conditions for the successive elimination to modify the SE-WR algorithm to two SE-WR-Stop algorithms (see 1 in Figure 1). We prove that the regrets of these two modified algorithms are O( KT log T + KC) and O( p KT(log T + C)) respectively. Unknown attack budget case. To address the MAB with unknown attack case ( 5), we use algorithms proposed for known attacks as building blocks to devise algorithms for unknown attacks. We consider two types of algorithmic techniques. In 5.1, we apply a multi-phase idea to extend the SE-WR algorithm to a phase elimination algorithm (PE-WR, see 2 in Figure 1). The PE-WR algorithm utilizes a multi-phase structure each phase with carefully chosen attack budgets and phase lengths to defend against unknown attacks. We show that the regret of PE-WR is upper bounded by O( KT + KC2) (additive). In 5.2, we apply a model selection technique, which treats the unknown attack C as a parameter in model selection (see 3 in Figure 1). Specifically, we consider log2 T base instances of SE-WR-Stop with different input attack budgets and then use the model selection technique to corral these base instances, composing the model-selection SE-WR-Stop (called MS-SE-WR) algorithm. We upper bound the regret of MS-SE-WR by O( KCT 2 3 ) or O(KC T) (multiplicative) depending on the choice of the model selection algorithm. Figure 1 summarizes the relations of algorithm design, and Figure 2 shows which type of algorithm performs better, those with additive bounds or those with multiplicative bounds, varies depending on the total attacks, highlighting the necessity to study both types of bounds. Lower bounds. In 3, we also provide lower bound results to show the tightness of our upper bounds. We first show a general Ω(KC) lower bound for any algorithm under attack. This lower bound improves a factor K of the number of arms upon prior Ω(C) lower bounds (Gupta et al., 2019). Based on this, we propose the gap-dependent lower bound Ω(P k =k log T/ k + KC) (informal expression) and gap-independent lower bound Ω( KT + KC). Our refined upper bound of SE-WR matches the gap-dependent lower bound up to some prefactors, and one of our proposed SE-WR-Stop algorithms matches the gap-independent lower bound up to some logarithmic factors. To show the tightness of our upper bounds for the unknown attack algorithms, we derive lower bounds, Ω(T α + C 1 α ) and Ω(C 1 α 1T α) with α [ 1 2, 1), for two classes of algorithms whose regret upper bounds follow certain additive and multiplicative forms respectively. With α = 1 2, these two lower bounds, becoming Ω( T + C2) and Ω(C T), match the upper bounds of PE-WR and MS-SE-WR in terms of T and C. We summarize our analysis results in Table 1. Published as a conference paper at ICLR 2025 Figure 1: Algorithm design overview: only SE-WR has a gap-dependent bound; others are all gap-independent. 0.0 0.2 0.4 0.6 0.8 1.0 Attack parameter of C = O(T ) Regret parameter of RT = O(T ) O( KT + KC2) Figure 2: Comparison of unknown C regrets (see Remark 11 for detail) Table 1: Results overview: upper bounds with are for pseudo regret in expectation, while all other upper bounds are for realized regret with high probability; lower bounds with are only for special classes of algorithms (see Propositions 4 and 5), and the lower bound with holds when T . We study both additive and multiplicative bounds because which one is better depends on the value of C (see Figure 2 for unknown C case and the detailed discussion in Remark 11). Regret Bounds Known Attack C ( 4) Unknown Attack C ( 5) Additive C O P k + KC , O( KT log T + KC) O Multiplicative C O( p KT(log T + C)) O( KCT 2 3 ) , O(KC Lower Bounds ( 3) Ω P k + KC , Ω( KT + KC) Ω(T α + C 1 α ) , Ω(C 1 α 1T α) Experiment. Besides the above analytical results, we also conduct experiments to validate the performance of our algorithms. The attacker employs the attack strategy proposed in Jun et al. (2018) when there are remaining attack budgets. We compare the regrets of our algorithms with prior algorithms under different attack budgets. The results show that our algorithms outperform the state-of-the-art algorithms in most cases, especially when the attack budget is large. We provide more details in 6. Result separation between corruption and attack. With the results in this paper, we demonstrate a clear separation between corruption (medium adversary) and attack (strong adversary) models in terms of the additive regret bounds. Denote by C the total corruption to distinguish from the total attack C. In the case of known corruption/attack budget, while both achieving the gap-dependent regret bounds P k =k (1/ k) log T in T-dependent terms, we show that the regret due to attack is Θ(KC) which has a factor of K worse than the additional Θ(C ) regret due to corruption. That implies, with the same amount of budget, an attacker can produce K times larger impact on regret than that under the corruption model. For unknown budget cases, while bandits with corruptions have the regret bounds of O(P k =k log2(KT/δ)/ k + KC ) (Gupta et al., 2019), it is impossible for attack case to achieve a regret upper bound in the form of O (ploylog(T) + Cα) for any α > 0 (contradicts with Theorem 3). Instead, we show that the regret with unknown attacks can attain O( KT + KC2) where the C2 additive term is tight. Even with ignoring the worse order of the first T-related terms, the second regret term due to attack O(C2) is still a factor of C worse than the additional O(C ) regret due to corruption. That is, an attack with O( T) budget is enough to make the algorithm suffer a linear regret in the attack model,2 while one needs a linear O(T) corruption budget to achieve the same effect in the corruption case. This separation is also studied in linear bandits in the concurrent work (Liu et al., 2024). Extended literature review Robustness against corruption. Lykouris et al. (2018) propose the bandits with corruption model and devised algorithms with O((K log T + KC )/ ) regret for known corruption C case and algorithms with O(C K2 log T/ ) regret for unknown C case, where 2This statement is for the algorithm with O( KT + KC2) mentioned above. Our lower bound results do not exclude the possible existence of other algorithms with upper bounds O(T α + C 1 α ) for α [ 1 2, 1), in which case a sublinear O(T α) budget can make the algorithm suffer linear regret. Published as a conference paper at ICLR 2025 Procedure 1 Decision under Attacks 1: for each round t = 1, 2, . . . , T do 2: The learner pulls an arm It 3: A stochastic reward XIt,t is drawn from this arm It 4: The adversary observes the pulled arm It and its realized reward XIt,t 5: The adversary chooses an attacked reward XIt,t 6: The learner only observes the attacked reward XIt,t, and not XIt,t denotes the smallest non-zero reward gap. Later, Gupta et al. (2019) study the same model and propose an algorithm that improves the multiplicative C dependence of regret for unknown C case to additive C , i.e., O(K log2 T/ + KC ). However, the robust design of both algorithms above relies on randomly pulling arms, which is not feasible in our attack setting. Besides bandits robust to corruption, there is another line of works on best-of-both-world bandit algorithm design, e.g., Seldin & Lugosi (2017); Wei & Luo (2018); Zimmert & Seldin (2021); Masoudian & Seldin (2021), where their algorithms can be applied to the bandit with corruption but do not work in the adversarial attack model either. Notably, Zimmert & Seldin (2021); Masoudian & Seldin (2021) show that their best-of-both-worlds algorithms can be applied to the bandit with corruption model and achieve O(K log T/ + C log T) regret, which can be better than that of Gupta et al. (2019) in some parameter regimes. Attacks on bandits. Attack policy design for MAB algorithms has been studied by Jun et al. (2018); Liu & Shroff (2019); Xu et al. (2021); Zuo et al. (2024); Zuo (2024) and many others. They aim to design attacking policies to mislead the learner, typically with logarithmic regrets when there is no attack, to pull a suboptimal arm and thus suffer a linear regret. This is the opposite of our objective for designing robust algorithms to achieve sublinear regret under any attacks. Robustness against attacks. The only existing results on robust algorithms for MAB under attacks are by Zuo (2024, Section 6) and Rangi et al. (2022). Zuo et al. (2024, Section 6) uses competitive ratio as the objective used when one cannot achieve sublinear regret, instead of the finer-grained regret studied in this paper. Rangi et al. (2022) studies how the verification scheme , an additional assumption, can help the learner to detect the attack and achieve sublinear regret, but they do not study robust algorithms as this paper does. Apart from the MAB model, there are works studying robust algorithms on structured bandits under attacks, e.g., linear bandits (Bogunovic et al., 2021; He et al., 2022), Gaussian bandits (Bogunovic et al., 2020; 2022), and Lipschitz bandits (Kang et al., 2024). Among them, Bogunovic et al. (2020) are the first to devise robust algorithms for bandits under attacks (referred to as corruptions). Although their bandits results can be reduced to MAB, this reduction does not provide tight regrets as in Table 1 of this paper. We elaborate on this in 5. 2 MODEL: MAB WITH ADVERSARIAL ATTACKS We consider a MAB with K N+ arms. Each arm k K := {1, 2, . . . , K} is associated with a reward random variable Xk [0, 1] with an unknown mean µk. Denote the unique arm with highest reward mean as k , i.e., k = arg maxk K µk.3 We consider T N+ decision rounds. At each round t T := {1, 2, . . . , T}, the learner selects an arm It K, and the arm generates a reward realization XIt,t (not disclosed to the learner yet). The adversary observes the pulled arm It and its realized reward XIt,t and then chooses an attacked reward XIt,t for the learner to observe. The total attack is the sum of the absolute differences between the raw rewards and the attacked rewards over all rounds, i.e., C := PT t=1|XIt,t XIt,t|. We summarize the decision process in Procedure 1. Regret objective The adversary aims to maximize the learner s regret, while the learner aims to minimize the regret. We define the (realized) regret as the difference between the highest total realized raw rewards of a single arm and the accumulated raw rewards of the learning algorithm, RT := max k K t T (Xk,t XIt,t). (1) 3We assume the uniqueness of the optimal arm for simplicity. Our algorithms and results (with slightly change) also work for the case with multiple optimal arms. Published as a conference paper at ICLR 2025 Besides the realized regret in (1), another common regret definition is the pseudo-regret, defined as RT := maxk K E P t T (Xk,t XIt,t) = E Tµk P t T µIt . By Jensen s inequality, we have RT E[RT ], implying that the pseudo-regret is a weaker metric than the realized regret. The paper focuses on realized regret, with pseudo-regret results included as needed. Comparison between corruption and attack models Unlike the attack model, prior work on robust bandit algorithms against corruption (Lykouris et al., 2018; Gupta et al., 2019) involves the adversary selecting corruption rewards before observing the learner s actions (see Appendix B for details). The total corruption is defined as C := PT t=1 maxk|Xt(k) Xk,t|. Although the definitions of corruption and attacks differ slightly, the models are fundamentally distinct due to the sequence of events: in the attack model, the adversary modifies rewards after observing the learner s actions, whereas in the corruption model, the modification occurs before. To illustrate this distinction, consider that achieving the same impact on a bandit algorithm requires a significantly larger budget under the corruption model compared to the attack model. For example, a naive attack policy that forces the learner to incur linear regrets might involve altering the reward of the optimal arm k whenever it is selected, making it smaller than the mean of the best suboptimal arm (Liu & Shroff, 2019). In the attack model, where the adversary observes the learner s action before attacking, this strategy may require only an O(log T) or sublinear O(T α) budget for some α < 1, depending on the specific algorithm. In contrast, under the corruption model, since the adversary does not know which arm will be pulled before corrupting the rewards, they would need to corrupt the optimal arm s reward realization in all T rounds to achieve the same effect, resulting in an O(T) cost. This is substantially higher than the sublinear attack budget. From another perspective, designing robust algorithms is more challenging in the attack model because the adversary s ability to act after observing the selected arm renders common randomized strategies ineffective. For instance, algorithms like EXP3.P (Auer et al., 2002) and BARBAR (Gupta et al., 2019) utilize randomization to mitigate corruption, but this approach is less effective against adversaries who can adapt their attacks based on observed actions. Therefore, developing robust algorithms against such adaptive adversaries is significantly more complex. 3 LOWER BOUNDS This section first presents a general lower bound for the adversarial attack model on stochastic multiarmed bandits (Theorem 1) and two of its variants. Later, we derive two lower bounds for two special classes of bandit algorithms (Propositions 4 and 5). All proofs are deferred to Appendix C. 3.1 A GENERAL LOWER BOUND We first present a general lower bound Ω(KC) in Theorem 1, originally derived in the linear bandit scenario by Bogunovic et al. (2021, Theorem 3). Together with known lower bounds of MAB, we derive Proposition 2. The 4 presents algorithms matching these lower bounds. Theorem 1. Given a stochastic multi-armed bandit game with K arms, under attack with budget C, and T > KC decision rounds,4 for any bandits algorithm, there exists an attack policy with budget C that can make the algorithm suffer Ω(KC) regret. Proposition 2. For gap-dependent regret bounds and any consistent bandit algorithm5, the lower bound is roughly Ω P k + KC , or formally, for some universal constant ξ > 0, 1 2 k . (2) For gap-independent cases, the lower bound is RT Ω( KT + KC). (3) 3.2 TWO LOWER BOUNDS FOR SPECIAL ALGORITHM CLASSES We first recall results from adversarial bandit attacks on bandits (Liu & Shroff, 2019; Zuo, 2024) in Theorem 3, then derive two lower bounds for bandit algorithms with additive and multiplicative regret bounds respectively. In 5, we present algorithms that match these lower bounds. 4Note that T KC implies that C = Ω(T) which trivially results in a linear regret. 5When C = 0, a consistent bandits algorithm has regret RT O(T α) for any α (0, 1). Published as a conference paper at ICLR 2025 Algorithm 2 SE-WR: Successive Elimination with Wide Confidence Radius Input: total attack C (or budget CInput), number of arms K, decision rounds T, parameter δ Initialize: candidate arm set S K, number of arm pulls Nk 0, reward estimates µk 0 1: while t T do 2: Uniformly pull each arm in S once and observes Xk for all arms k S 3: Update parameters: t t + |S|, Nk Nk + 1, µk µk(Nk 1)+ Xk Nk for all arms k S 4: µmax maxk µk 5: S k S : µk µmax 2 q Theorem 3 (Adapted from (Zuo, 2024, Fact 1) and (He et al., 2022, Theorem 4.12)). If an algorithm achieves regret RT when total attack C = 0, then there exists an attack policy with C = Θ(RT ) that can make the algorithm suffer linear regret. Proposition 4 (Achievable additive regret bounds). Given any parameter α [ 1 2, 1), for any bandit algorithm with an additive regret upper bound of the form O(T α + Cβ), we have β 1 Proposition 5 (Achievable multiplicative regret bounds). Given parameter α [ 1 2, 1), for any bandit algorithm with a multiplicative regret bound of the form O(T αCβ), we have β 1 4 ALGORITHMS WITH KNOWN ATTACK BUDGET In this section, we study the stochastic multi-armed bandit problem with a known attack budget. In 4.1, we first present an algorithm with an additive gap-dependent upper bound on the attack budget. Then, in 4.2, we modify this algorithm to two variants with gap-independent upper bounds, one with an additive C term and another with a multiplicative C term. 4.1 SE-WR: AN ALGORITHM WITH GAP-DEPENDENT UPPER BOUND Given the knowledge of attack budget C, one is able to predict the worst-case attack and design an algorithm to defend against it. Here, the robustness is achieved by widening the confidence radius of the reward estimate to account for the C attack such that the corresponding widened confidence interval contains the true reward mean with a high probability (as if the rewards were not attacked). Denote µk as the empirical mean of the attacked reward observations of arm k. The widened confidence interval is centered at µk and has a radius of p log(2/δ)/Nk+C/Nk, where Nk is the number of times arm k is pulled and δ is the confidence parameter. Then, we utilize the successive elimination (SE) algorithm (Even-Dar et al., 2006) to devise a robust bandit algorithm in Algorithm 2. The algorithm maintains a candidate set of arms S, initialized as the set of all arms K, and successively eliminate suboptimal arms according to the widened confidence intervals (Line 5), called successive elimination with wide radius (SE-WR). A similar algorithm design idea was also used in Lykouris et al. (2018) for the stochastic bandit problem with corruption attacks. Theorem 6 provides the regret upper bound of Algorithm 2. Theorem 6. Given the total attack or its upper bound C, Algorithm 2 s the realized regret is upper bounded by RT O(P k =k log(KT/δ)/ k + KC) with probability 1 δ, and, with δ = K/T, its pseudo regret is upper bounded as RT O(P k =k log T/ k + KC). The regret upper bound in Theorem 6 matches the lower bound in (2) in terms of both the first classic regret term and the second additive attack Ω(KC) term. This shows that the regret upper bound is nearly optimal up to some prefactors. Our regret upper bounds improve the O(P k =k log(KT/δ)/ k + P k =k C/ k) in Lykouris et al. (2018, Theorem 1) by removing the multiplicative dependence of 1/ k on the total attack C term. This improvement comes from a tighter analysis for sufficient pulling times for eliminating a suboptimal arm (see Lemma 14 in Appendix E). This improvement is crucial for achieving our tight multiplicative regret bounds for both known and unknown attack budget cases (Sections 4.2 and 5.2). Published as a conference paper at ICLR 2025 4.2 SE-WR-Stop: ALGORITHMS WITH GAP-INDEPENDENT UPPER BOUNDS We follow the approach of stochastic bandits literature (Lattimore & Szepesv ari, 2020, Chapter 6) to transfer the algorithm with a gap-dependent upper bound to another algorithm with gap-independent bounds. The main idea is to replace the while loop condition in Line 1 in Algorithm 2 with nuanced stopping conditions. If the stopping condition is triggered before the end of the MAB game, i.e., t = T, the algorithm will uniformly pull the remaining arms in the candidate set S until the end of the game. The pseudo-code is deferred to Algorithm 6 in Appendix D for its similarity to SE-WR. We consider two kinds of stopping conditions. The first condition aims to transfer the gap-dependent bound to an independent one while keeping the additive KC term in the upper bound. We start from a stopping condition Nk log(KT/δ)/ϵ2 + C/ϵ, where ϵ is a parameter to be determined. With this condition and the proof details of Theorem 6, we can upper bound the realized regret as follows, RT O(K log(KT/δ)/ϵ+ϵT +KC) O( p KT log(KT/δ)+KC), where we choose ϵ = p K log(KT/δ)/T in the last inequality. The second stopping condition, while transferring the gap-dependent to independent, also converts the additive KC term to multiplication. To derive it, we start from a stopping condition Nk log(KT/δ)+C ϵ2 , where ϵ is a parameter to be determined. Similar to the derivation for the first condition, we can upper bound the realized regret as follows, RT O((K log(KT/δ) + KC)/ϵ+ϵT) O( p KT(log(KT/δ) + C)) where the last inequality sets ϵ = p (K log(KT/δ) + KC)/T. The above results are summarized in Theorem 7 below. Theorem 7. Given the total attack or its upper bound C, SE-WR-Stop (Algorithm 6) with stopping condition Nk T T K log(KT/δ) has the realized regret upper bounded as follows, KT log(KT/δ) + KC) with probability 1 δ, (4) and SE-WR-Stop with stopping condition Nk T K has the realized regret as follows, KT(log(KT/δ) + C)) with probability 1 δ. (5) Let δ = K/T, the regret in (4) becomes O KT log T + KC , matching the lower bound in (3) up to logarithmic factors. This implies the regret bound in (4) is nearly optimal. Let δ = K/T, the regret in (5) becomes O( p KT(log T + C)). Although this bound cannot match the lower bound in (3), its dependence on total attack C is square root, better than the linear dependence in (4). This better dependence will shine when devising algorithms for unknown attack budgets in 5.2. 5 ALGORITHMS WITH UNKNOWN ATTACK BUDGET This section presents algorithms that can defend against attacks with unknown attack budgets. We present algorithms with additive and multiplicative upper bounds in 5.1 and 5.2 respectively. Before diving into the detailed algorithm designs, we recall that the input attack budget CInput for SE-WR and SE-WR-Stop can be interpreted as the level of robustness that the algorithms have. For any actual attack C CInput, the algorithms achieve the regret upper bounds in Theorems 6 and 7 respectively; for attack C > CInput, the algorithms may be misled by the attacker and suffer linear regret. One simple way to achieve robustness against unknown attacks is to set the CInput = p T/K for SE-WR in 4.1, called SE-WR p T/K later, which leads to a piece-wise regret bound KT) if C = O( p T/K), O(T) if C = Ω( p We note that even this simple conversion-based bound in (6) is new among literature, thanks to our tighter analysis in Theorem 6. The best prior regret bound with a similar form is from He et al. (2022) (reduced to stochastic bandits): RT = O(K T) if C = O( T); RT = O(T) if C = Ω( T), which is worse than our bound in (6) by a factor of K. Without the knowledge of the actual attack budget C, the performance of SE-WR and SE-WR-Stop algorithms has a robust-or-not separation as in (6), which is not satisfactory for practical applications, especially when the actual corruption is small. In this section, we apply two types of smoothing algorithmic techniques to smooth this robust-or-not separation of SE-WR and SE-WR-Stop so as to achieve robustness against unknown attacks. An overview of algorithm design is in Figure 1 (see 2 and 3 ). Published as a conference paper at ICLR 2025 Algorithm 3 PE-WR: Phase Elimination with Wide Confidence Radius Input: number of arms K, decision rounds T, confidence parameter δ Initialize: candidate arm set S K, the number of arm pulls Nk 0, reward mean estimates ˆµk 0, phase h 0, phase upper bound H log2 T 1, initial pulling times L0 K 1: for each phase h = 0, 1, . . . do 2: ˆCh min{ T/K, 2H h 1} 3: Pull each arm in S for Lh/|Sh| times 4: Update arm empirical means ˆµk,h for all arms k S 5: Sh+1 k Sh : ˆµk,h max k Sh ˆµk ,h 2 q Lh/|Sh| + ˆ Ch Lh/|Sh| 6: Lh+1 2Lh On the other hand, robust algorithms designed for unknown corruptions, e.g., Lykouris et al. (2018); Gupta et al. (2019), do not apply to the unknown attack setting. Because these robust algorithms all rely on some randomized action mechanism to defend against corruption, which is invalid for the attack as the attacker can manipulate the rewards of the arms after observing the learner s actions. 5.1 PE-WR: AN ALGORITHM WITH ADDITIVE UPPER BOUND The first smoothing algorithmic technique is applying the multi-phase structure to modify SE-WR (Algorithm 2), yielding the phase elimination with wide confidence radius algorithm, PE-WR (Algorithm 3) for unknown attack budget. This algorithm considers multiple phases of SE-WR, each with a different assumed attack budget CInput, and the candidate arm set S is updated consecutively among phases, that is, the arms eliminated in previous phases are not considered in later phases. Specifically, PE-WR separates the total decision rounds into multiple phases h {1, 2, . . . }, each with a length doubled from the previous phase (Line 6). In phase h, the elimination assumes the potential attack is upper bounded by ˆCh, which is halved in each phase (Line 2). This mechanism results in two kinds of phases. Denote h as the first phase whose corresponding ˆCh is smaller than the true attack budget C. Then, for phases h < h , we have ˆCh C, and, therefore, the algorithm eliminates arms properly as SE-WR; for phases h h , as ˆCh < C, the algorithm may falsely eliminate the optimal arm and suffer extra regret. But even though the optimal arm is eliminated, the algorithm in these phases still has the level of ˆCh robustness, which guarantees some relatively good suboptimal arms with a mean close to the optimal arm remaining in the candidate set S such that the total regret due to losing the optimal arm in these phases is not large (bounded by KC2 in the proof). We present PE-WR in Algorithm 3 and its regret upper bound in Theorem 8. Theorem 8. Algorithm 3 has a realized regret upper bound, with a probability of 1 δ, RT O( p KT log (K log T/δ) + T log T + KC log T + KC2), and by setting δ = K/T, a pseudo regret upper bound RT O( Comparing our upper bound in Theorem 8 to the lower bound in Proposition 4 with α = 1/2, Ω( T + C2), our upper bound is tight in terms of both T and C. Compared with the lower bound Ω( KT + KC) in Theorem 2, our upper bound is also tight in terms of K. We also note that the bound in Theorem 8 for PE-WR has the same order as the robust-or-not piece-wise bound in (6), but PE-WR is expected to be empirically superior, especially when the actual corruption is small. This multi-phase elimination idea is widely used in bandit literature, e.g., batched bandits (Gao et al., 2019), multi-player bandits (Wang et al., 2019), and linear bandits (Bogunovic et al., 2021). The most related to ours is the phase elimination algorithm proposed in Bogunovic et al. (2021) for linear bandits. However, directly reducing their results (Bogunovic et al., 2021, Theorem 2) to our stochastic bandit case would yield a O( K log T + C2) upper bound for C T/(K log(log K) log T) only, and, for general C, their results would become O( K log T + K2C2). Our result in Theorem 8, simplified as O( KT + KC log T + KC2), is tighter than theirs by a factor of K on the second log T term, and, for the general C case, our result is tighter by a factor of K on the third C2 term. This improvement comes from the tighter confidence bound employed in Line 5 of Algorithm 3 and the corresponding tighter analysis for the necessary number of observations for the algorithm to eliminate arms (Lemma 14) in 4.1. Published as a conference paper at ICLR 2025 Algorithm 4 MS-SE-WR: Model Selection with Wide Confidence Radius Input: number of arms K, decision rounds T, model selection algorithm Mod Sel Alg 1: Construct G := log2 T base algorithm instances as B {SE-WR-Stop(C = 2g, K, T, δ = K/T) : g = 1, 2, . . . , G}. 2: Mod Sel Alg(B) for T decision rounds 5.2 MS-SE-WR: AN ALGORITHM WITH MULTIPLICATIVE UPPER BOUND While the additive bound in Theorem 8 can be transferred to a multiplicative bound O(KC2 T), this bound cannot match the lower bound in Proposition 5, e.g., Ω(C T) for α = 1 2. In this section, we deploy another algorithmic technique, model selection, to SE-WR-Stop to achieve tight multiplicative upper bounds for the unknown attack budget case. Unlike the multi-phase technique, where each phase assumes a different attack budget, we consider multiple algorithm instances with different attack budget inputs, called models. Model selection means selecting the model (algorithm instance) that performs best in the unknown environment. In our scenario, it is to select the instance with the smallest attack budget input CInput that is larger than the actual attack budget C. Specially, we consider G := log2 T instances of SE-WR-Stop, each with a different attack budget input CInput = 2g, g = 1, 2, . . . , G. Among these instances, the best is g = log2 C , as 2g [C, 2C) is the smallest attack budget larger than the actual attack C. We then apply a model selection algorithm, e.g., CORRAL (Agarwal et al., 2017) or EXP3.P (Auer et al., 2002), to select the best one among the G instances. Algorithm 4 presents the pseudo-code. To prove the regret upper bound, we recall a simplified version of Pacchiano et al. (2020, Theorem 5.3) in Lemma 9. Lemma 9. If base algorithm instance has regret upper bound RT T αc(δ) with probability 1 δ for known constant α [ 1 2, 1) and the unknown function c : R R, then the regrets of CORRAL and EXP3.P are RT O(c(δ) 1 α T α) and RT O(c(δ)T 1 2 α ) respectively. Applying this lemma to our case, we first convert the gap-independent upper bound in (5) of SE-WR-Stop (Algorithm 6) in Theorem 7 to a multiplicative upper bound: When the stop condition is Nk T K , the realized regret upper bound is RT O( p KTC log(KT/δ)) with a probability of at least 1 δ. Plugging this into Lemma 9 with α = 1 2 and c(δ) = p C log(KT/δ) and letting δ = K/T, we have the pseudo regret upper bounds for MS-SE-WR as follows: Theorem 10. Deploying SE-WR-Stop (stop condition Nk T/K) as the base algorithm instance, Algorithm 4 with CORRAL has pseudo regret upper bound RT O(KC T), and Algorithm 4 with EXP3.P has pseudo regret upper bound RT O( The bounds in Theorem 10 are tight in terms of the trade-off of T and C, as they respectively match the lower bounds in Proposition 5, Ω( CT 2 3 ) when α = 2 3 and Ω(C T) when α = 1 2. The challenge of achieving these tight regret bounds in Theorem 10 is in discovering the right base algorithm instances with suitable multiplicative upper bounds. For example, the first upper bound in Theorem 7 can also be converted to the multiplicative RT O(C p KT log(KT/δ)) bound with a probability of at least 1 δ. However, applying the model selection technique to this bound only results in O(C KT 2 3 ) and O(C2K T) bounds, which are not as tight as the ones in Theorem 10. This model selection technique was also used in Kang et al. (2024) to boost the algorithm for Lipchitz bandits with a known attack budget to an algorithm with an unknown attack budget. However, their regret upper bounds, mapping to our MAB setting, are O(C 1 K+2 T K+2 K+3 ) and O(C 1 K+1 T K+1 K+2 ), which have a much worse dependence on T than the ones in Theorem 10, especially when K is large. This is because the regret upper bounds of their base algorithms have a much worse dependence on T than our base SE-WR-Stop algorithm. This highlights the importance of devising the suitable base algorithm instances for MAB with the known attack budget in 4.2. Remark 11 (Regret comparison for unknown C case in Figure 2). The section presents three different upper bounds for unknown attacks, one additive O( KT + KC2) by PE-WR and two multiplicative O( KCT 2 3 ), O(KC T) by two types of MS-SE-WR. Giving attack C = T β for some parameter β [0, 1], these three bounds can be written in the form of O(T γ) for corresponding exponents γ. Figure 2 compares these bounds in terms of the exponent γ by varying β. Comparing all three bounds, the additive bound is better than both multiplicative ones when β 4 Published as a conference paper at ICLR 2025 UCB SE SE-WR SE-WR-STOP-T/K SE-WR-STOP FSAAER 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (b) C = 3000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (c) C = 6000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (d) C = 9000 Figure 3: Regret comparison of algorithms with known attack budgets when varying budget C MLAAER BARBAR MS-SE-WR-EXP3.P MS-SE-WR-CORRAL 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (b) C = 3000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (c) C = 6000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (d) C = 9000 Figure 4: Regret comparison of algorithms with unknown attack budgets when varying budget C KCT 2 3 ) bound is better than the additive one when β > 4 9. Especially, the multiplicative bound O( KCT 2 3 ) is still sublinear when the total corruption is large than T, specifically, Ω( T) < C < O(T 2 3 ), which is not achievable by the additive bound. Comparing among the two multiplicative bounds, the O(KC T) is better than O( KCT 2 3 ) when β 1 3, and vice versa. 6 EXPERIMENTS Baselines We compare our algorithms with baselines: Upper Confidence Bound (UCB) (Lattimore & Szepesv ari, 2020), Successive Elimination (SE) (Even-Dar et al., 2006), Fast-Slow Active Arm Elimination Race (FSAAER) and Multi-Layer Active Arm Elimination Race (MLAAER) (Lykouris et al., 2018), BARBAR (Gupta et al., 2019). These baselines are grouped by whether they require knowledge of attack budgets. To distinguish our two SE-WR-Stop algorithms, we call the one with stopping condition Nk T/K as SE-WR-Stop T/K and the other SE-WR-Stop. Setup We adopt the attack strategy from Jun et al. (2018) (details in Appendix G.1), a standard attack policy for stochastic bandits. Although our algorithms outperform baselines under this attack, they could perform even better under more sophisticated policies. We consider a stochastic bandit with K = 10 arms, where each arm k follows a Bernoulli distribution with means µk decreasing from 0.9 to 0.45 in steps of 0.05. We evaluate all algorithms under four attack budgets (C {0, 3000, 6000, 9000}) over T = 200000 rounds, repeating each experiment 10 times and reporting the average and deviation of cumulative regret. Additional results can be found in Appendix G. Observations Figures 3 and 4 show the regret of algorithms under known and unknown attack budgets. Known attack budgets: Figures 3b 3d show that baselines, including FSAAER (Lykouris et al., 2018), perform poorly, while our algorithms (SE-WR and SE-WR-Stop) achieve sublinear regret, confirming the need for developing robust algorithms in this attack model. The increase in the regret of SE-WR and SE-WR-Stop with the attack budget C is consistent with our theoretical results. Unknown attack budgets: Figure 4 shows the following: (i) When total attack is small, e.g., C = 0 in Figure 4a, the adaptive PE-WR algorithm outperforms the SE-WR p T/K algorithm (with rigid robust-or-not separate in (6)), highlighting the need for adaptive strategies in unknown attack settings. (ii) Baselines like MLAAER (Lykouris et al., 2018) (developed for corruptions) achieve sublinear regret for small budgets (e.g., C = 0) as expected but perform poorly when C = 6000 (comparing to our robust algorithms, PE-WR and MS-SE-WR), and even fail when C = 9000, underscoring the need for devising robust algorithms for attacks. (iii) For large budgets (C = 9000), only our MS-SE-WR algorithms maintain sublinear regret, consistent with our theoretical results that algorithms with multiplicative regret bounds perform better when C > Ω( T) (detailed in Remark 11), emphasizing the importance of studying such bounds. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENTS The authors want to thank Thodoris Lykouris for suggesting this topic in an AIM s SQua REs meeting and for his invaluable feedback and insightful discussions throughout the research. The work of Jinhang Zuo was supported by City U 9610706. The work of John C.S. Lui was supported in part by the RGC GRF14202923. The work of Mohammad Hajiesmaili was supported by NSF CPS2136199, CAREER-2045641, and CNS-2325956. The work of Xutong Liu was supported in part by a fellowship award from the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK PDFS2324-4S04). Xutong Liu is the corresponding author. Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pp. 12 38. PMLR, 2017. Jean-Yves Audibert and S ebastien Bubeck. Regret bounds and minimax policies under partial monitoring. The Journal of Machine Learning Research, 11:2785 2836, 2010. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett. Corruption-tolerant gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pp. 1071 1081. PMLR, 2020. Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pp. 991 999. PMLR, 2021. Ilija Bogunovic, Zihan Li, Andreas Krause, and Jonathan Scarlett. A robust phased elimination algorithm for corruption-tolerant gaussian process bandits. Advances in Neural Information Processing Systems, 35:23951 23964, 2022. Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis. cambridge university press, 2005. Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019. Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562 1578. PMLR, 2019. Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in neural information processing systems, 35:34614 34625, 2022. Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. Advances in neural information processing systems, 31, 2018. Yue Kang, Cho-Jui Hsieh, and Thomas Chun Man Lee. Robust lipschitz bandits to adversarial corruptions. Advances in Neural Information Processing Systems, 36, 2024. Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pp. 4042 4050. PMLR, 2019. Haolin Liu, Artin Tajdini, Andrew Wagenmaker, and Chen-Yu Wei. Corruption-robust linear bandits: Minimax optimality and gap-dependent misspecification. Advances in Neural Information Processing Systems, 2024. Published as a conference paper at ICLR 2025 Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114 122, 2018. Saeed Masoudian and Yevgeny Seldin. Improved analysis of the tsallis-inf algorithm in stochastically constrained adversarial bandits and stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 3330 3350. PMLR, 2021. Aldo Pacchiano, My Phan, Yasin Abbasi Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33:10328 10337, 2020. Anshuka Rangi, Long Tran-Thanh, Haifeng Xu, and Massimo Franceschetti. Saving stochastic bandits from poisoning attacks via limited data verification. In Proceedings of the aaai conference on artificial intelligence, volume 36, pp. 8054 8061, 2022. Yevgeny Seldin and G abor Lugosi. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pp. 1743 1759. PMLR, 2017. Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Nearoptimal regret with efficient communication. ar Xiv preprint ar Xiv:1904.06309, 2019. Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pp. 1263 1291. PMLR, 2018. Yinglun Xu, Bhuvesh Kumar, and Jacob D Abernethy. Observation-free attacks on stochastic bandits. Advances in Neural Information Processing Systems, 34:22550 22561, 2021. Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. The Journal of Machine Learning Research, 22(1):1310 1358, 2021. Jinhang Zuo, Zhiyao Zhang, Zhiyong Wang, Shuai Li, Mohammad Hajiesmaili, and Adam Wierman. Adversarial attacks on online learning to rank with click feedback. Advances in Neural Information Processing Systems, 36, 2024. Shiliang Zuo. Near optimal adversarial attack on stochastic bandits and defenses with smoothed responses. In The 27th International Conference on Artificial Intelligence and Statistics, 2024. Published as a conference paper at ICLR 2025 A APPLICATION EXAMPLES OF OUR ATTACK MODEL A key application of our model lies in online news recommendation systems. Here, the platform (learner) recommends articles (arms) to users with the goal of maximizing total clicks (rewards). However, click fraud botnets (adversaries) may maliciously target the system by generating fake clicks on specific articles (attack reward observations), thereby misleading the recommendation algorithm. This scenario extends naturally to other online recommendation systems, such as ecommerce, video streaming, and social media platforms. Another potential application is in resource allocation for content delivery networks (CDNs). In this context, the CDN manager (learner) allocates resources (e.g., storage, bandwidth) to various servers (arms) to optimize overall performance (rewards, such as response time or throughput). However, the manager might encounter distributed denial-of-service (DDo S) attacks (adversaries) targeting specific servers (attack their reward observations) to degrade system performance. B OTHER MODEL DETAILS: CORRUPTION PROCESS, REGRET DISCUSSION In this section, we provide more details on the corruption process. At each round t {1, 2, . . . , T}, the adversary observes the realized rewards Xk,t for all arms k, as well as the rewards and actions of the learner in previous rounds. The adversary then chooses the corrupted rewards Xk,t for all arms k. The learner pulls an arm It (maybe randomly) and observes the attacked reward XIt,t. We summarize the decision process in Procedure 5. We denote the total amount of corruption as t=1 max k |Xk,t Xk,t|, where It is the pulled arm at time slot t. Procedure 5 Decision under Corruptions 1: for each round t = 1, 2, . . . , T do 2: Stochastic rewards Xk,t are drawn from all arms k 3: The adversary observes the realized rewards Xk,t as well as the rewards and actions of the learner in previous rounds 4: The adversary chooses the corruption rewards Xk,t for all arms k 5: The learner pulls an arm It (maybe randomly) and observes the attacked reward XIt,t Remark 12 (Regret defined by corrupted rewards). Beside the regret defined on raw reward in (1), one can also define the regret based on the corrupted rewards as follows, RT := maxk K P t T ( Xk,t XIt,t). This definition is considered in Zimmert & Seldin (2021); Bogunovic et al. (2020). As the adversary can manipulate at most C rewards, one can easily show that, | RT RT | 2C. We will focus on the regret defined by raw rewards (1) in the following, as most results in this paper have at least a linear dependence on C. C PROOF FOR LOWER BOUND Theorem 1. Given a stochastic multi-armed bandit game with K arms, under attack with budget C, and T > KC decision rounds,6 for any bandits algorithm, there exists an attack policy with budget C that can make the algorithm suffer Ω(KC) regret. Proof of Theorem 1. We consider K instances of a MAB game where for instance Ik for any k {1, 2, . . . , K}, the arm k has reward ϵ > 0 and all other arms have reward 0 without any noise. The adversary s policy is that whenever the algorithm pulls an arm with a non-zero reward, the adversary attacks the algorithm by changing the reward of the arm to 0. The adversary can conceal the optimal arm for C 6Note that T KC implies that C = Ω(T) which trivially results in a linear regret. Published as a conference paper at ICLR 2025 2 time slots, there exists at least K 2 arms that are pulled at most C ϵ times. That is, if the optimal arm is among these K 2 arms, the algorithm cannot identify the optimal arm. Let us pick the instance whose optimal arm is among these K 2 arms. Then, during these C 2 time slots, the algorithm needs to pay the regret of C 1 ϵ = Ω(KC). Proposition 2 For gap-dependent regret bounds and any consistent bandit algorithm7, the lower bound is roughly Ω P k + KC , or formally, for some universal constant ξ > 0, For gap-independent cases, the lower bound is Proof of Proposition 2. From bandits literature, e.g., Lattimore & Szepesv ari (2020, Chapter 16), we know that the pseudo regret of any consistent bandit algorithm can be lower bounded as follows, On the other hand, we can rewrite the Ω(KC) lower bound in Theorem 1 as follows, where ξ > 0 is a universal constant. Adding the above two inequalities and dividing both sides by 2, we have For the gap-independent case, recall in bandits literature (Lattimore & Szepesv ari, 2020, Chapter 15), we have the RT Ω( KT) lower bound. Combining it with the Ω(KC) lower bound in Theorem 1 yields RT Ω(max{ KT, KC}) = Ω( KT + KC) lower bound. Proposition 4 (Achievable additive regret bounds) Given any parameter α [ 1 2, 1), for any bandit algorithm that achieves an additive regret upper bound of the form O(T α + Cβ), we have β 1 Proposition 5 (Achievable multiplicative regret bounds) Given parameter α [ 1 2, 1), for any bandit algorithm that achieves a multiplicative regret bound of the form O(T αCβ), we have β 1 Proof of Propositions 4 and 5. We use contradiction to prove both propositions. We first prove Proposition 4. Given Theorem 3, we know that there exists an attack policy with C = Θ(RT ) that can make the algorithm suffer linear regret. Thus, if the algorithm achieves an additive regret upper bound of the form O(T α +Cβ) with β < 1 α, then the algorithm only suffers a sublinear regret when C = T α, which contradicts Theorem 3. Thus, the parameter β 1 α. Proposition 5 can be proved via a similar contradiction argument. Published as a conference paper at ICLR 2025 Algorithm 6 SE-WR-Stop: Successive Elimination with Wide Confidence Radius and Stop Condition Input: total attack C, number of arms K, decision rounds T, confidence parameter δ 1: Initialize candidate arm set S K, the number of arm pulls Nk 0, and reward mean estimates ˆµk 0 2: while Nk T K and t T do or Nk T T K log(KT/δ) 3: Uniformly pull each arm in S once and observes Xk for all arms k S 4: Update parameters: t t + |S|, Nk Nk + 1, ˆµk ˆµk(Nk 1)+Xk Nk 5: ˆµmax maxk ˆµk 6: S k S : ˆµk ˆµmax 2 q 7: Uniformly pull arms in S till the end of the game D DEFERRED ALGORITHM PSEUDO-CODE E PROOF FOR UPPER BOUNDS WITH KNOWN ATTACK Theorem 6. Given the total attack or its upper bound C, Algorithm 2 s the realized regret is upper bounded by RT O(P k =k log(KT/δ)/ k + KC) with probability 1 δ, and, with δ = K/T, its pseudo regret is upper bounded as RT O(P k =k log T/ k + KC). Proof of Theorem 6. We first recall the following lemma from Lykouris et al. (2018). Lemma 13 (Lykouris et al. (2018, Lemma 3.1)). With a probability of at least 1 δ, the optimal arm k is never eliminated. Next, we prove a lemma shows the sufficient number of pulling that are necessary for a suboptimal arm to be eliminated. This lemma is tighter than the one in Lykouris et al. (2018, Lemma 3.2) (proved for Nk 64 log(2KT/δ)+64C 2 k ), and, based on it, we derive tighter regret bounds. Lemma 14. With a probability of at least 1 δ, all suboptimal arms k = k are eliminated after pulling each arm Nk 64 log(2KT/δ) Proof for Lemma 14. With Lemma 13, we know that the optimal arm k is never eliminated. Thus, in the following, we show that the suboptimal arm k will be eliminated by the optimal arm k on or before Nk 64 log(2KT/δ) Denote ˆµk as the empirical mean of the raw stochastic reward observation of arm k. By Hoeffding s inequality, we have, with a probability of at least 1 δ KT , where the last inequality is due to Nk 64 log(2KT/δ) Comparing the true empirical mean ˆµk with the attacked empirical mean µk, we have | µk ˆµk| C 7When C = 0, a consistent bandits algorithm has regret RT O(T α) for any α (0, 1). Published as a conference paper at ICLR 2025 where the last inequality is due to Nk 8C k . Combining the two inequalities, we have Last, we show that the elimination condition for arm k would have been triggered before Nk 64 log(2KT/δ) k as follows, As the pseudo regret is upper bounded at most P k k Nk, we have 64 log(2KT/δ) where the last inequality is by choosing δ = K Transfer pseudo-regret to realized regret. Next, we show the high probability bound for the realized regret RT . Multiplying Nk to both sides of (7), we have that, for any arm k, the difference between the actual accumulated reward and its expected counterpart is upper bounded by p Nk log(2KT/δ) with a probability of at least 1 δ/KT. The probability of any of the above events does not hold is at most 1 δ. For any suboptimal arm k, we bound the difference as follows, Nk log(2KT/δ) Nk log(2KT/δ) 64 log(2KT/δ) + 8C k Nk k In case that the arm k with the highest empirical mean is not the optimal arm k , for example, this arm k has a smaller reward gap k O( p 1/T), the regret reduction due to this event is at most Nk k . Putting the potential impact of different kinds of reward realization above, we can bound the realized regret by O(P k =k Nk k) still. Thus, with similar derivation as (8), we have the realized regret of the algorithm is at most with probability 1 δ. Published as a conference paper at ICLR 2025 F PROOF FOR UPPER BOUNDS WITH UNKNOWN ATTACK Theorem 8. Algorithm 3 has a realized regret upper bound, with a probability of 1 δ, RT O( p KT log (K log T/δ) + T log T + KC log T + KC2), and by setting δ = K/T, a pseudo regret upper bound RT O( Proof of Theorem 8. For C > T/K, the O( KT + KC2) = O(T) upper bound holds trivially. Hence, we only consider C T/K in this proof. Note that the elimination condition in Algorithm 3 fails with a probability of at most δ/KH. Applying a union bound to sum all potential failure probabilities shows that, with a probability of at least 1 δ, the elimination condition works properly for all arms in all phases. In the rest of the proof, we exclude this failure probability and assume that the elimination only fails when C > ˆCh. Denote H as the actual number of phases, while recall H = log2 T 1 is the upper bound of the number of phases. As the analysis of Algorithm 2 shows, for phase h with ˆCh > C, the optimal arm k is in the candidate set Sh with high probability. Denote h as the first phase with ˆCh C, implying H h log2 C. Denote k h as the arm with the highest reward mean in phase h. Then, we have k h = k for h < h . Next, we use induction to show, for h h , the remaining best arm k h is close to the optimal arm k as follows, µk µk h 2C Lh/|Sh|. If the optimal arm is eliminated in phase h , then we have ˆµk ,h max k Sh ˆµk,h 2 Lh /|Sh | + ˆCh Lh /|Sh | Denote ˆk h = arg maxk Sh ˆµk,h . We have (a) ˆµˆk h ,h Lh /|Sh | + C Lh /|Sh | (b) ˆµk ,h + 2 Lh /|Sh | + ˆCh Lh /|Sh | Lh /|Sh | + C Lh /|Sh | Lh /|Sh | + ˆCh Lh /|Sh | Lh /|Sh | + C Lh /|Sh | = µk 2(C ˆCh ) µk 2C Lh /|Sh | Lh where inequalities (a) and (c) come from the Hoeffding s inequality and the attack C, inequality (b) is due to the elimination condition in round h . That is, µk µˆk h 2CK With a similar analysis, we can show, µˆk h +i µˆk h +i+1 2CK Lh +i , for all i = 1, . . . , H h 1. Based on the inequality, we further have, for any h h , 2CK Lh +i = 2CK L02h +i 2CK L02h +1 . (9) Published as a conference paper at ICLR 2025 Next, we decompose the pseudo regret as follows, t=1 (µk µIt) = X k K Nk,T (µk µk) = Lh |Sh|(µk µk) Lh |Sh|(µk µk) + Lh |Sh|(µk µk) Lh |Sh|(µk µk) + Lh |Sh|(µk h µk) | {z } Part I Lh |Sh|(µk µk h) | {z } Part II To bound Part I, we notice that all arms in Sh are not eliminated at the end of the previous phase, implying, for h < h , log(2KH/δ) Lh 1/|Sh 1| + C Lh 1/|Sh 1| (b) max k Sh 1 ˆµk ,h 1 2 log(2KH/δ) Lh 1/|Sh 1| + ˆCh 1 Lh 1/|Sh 1| log(2KH/δ) Lh 1/|Sh 1| + C Lh 1/|Sh 1| (c) ˆµk ,h 1 log(2KH/δ) Lh 1/|Sh 1| + 2 ˆCh 1 + C Lh 1/|Sh 1| log(2KH/δ) Lh 1/|Sh 1| + 2 ˆCh 1 + 2C Lh 1/|Sh 1| where inequalities (a) and (d) are due to the Hoeffding s inequality and the attack C, inequality (b) is due to the elimination condition, and inequality (c) is due to k Sh 1. That is, for arms k Sh, we have log(2KH/δ) Lh 1/|Sh 1| + 2( ˆCh 1 + C) Lh 1/|Sh 1| . (10) As we have k h Sh, and with the same derivation as (10), we also have log(2KH/δ) Lh 1/|Sh 1| + 2( ˆCh 1 + C) Lh 1/|Sh 1| . (11) Published as a conference paper at ICLR 2025 Therefore, we have Lh |Sh| k + Lh |Sh|(µk h µk) log(2KH/δ) Lh 1/|Sh 1| + 2( ˆCh 1 + C) Lh 1/|Sh 1| . L2 h log(2KH/δ) Lh 1/|Sh 1| + 2Lh( ˆCh 1 + C) Lh 1/|Sh 1| 2Lh K log(2KH/δ) + 4K( ˆCh 1 + C) 2K log(2KH/δ) h=1 ( ˆCh 1 + C) 2KL0 log(2KH/δ)T + 4 T log2 T + 4KC log2 T. where inequality (a) is due to (10) and (11), and inequality (b) is because PH T since the summation of a geometric increasing series is upper bounded by g LH , for some universal constant g > 0, and LH T. Next, we bound Part II as follows, Lh |Sh|(µk µk h) (a) Lh |Sh| 2CK L02h +1 2CKLh L02h +1 h=h +1 2h (h +1) where inequality (a) is due to (9), and inequality (b) is because H h H h log2 C. Hence, substituting (12) and (13) into the regret bound, we have the pseudo-regret bound with a probability of at least 1 δ as follows, RT L0 + 4g p 2K log(2KH/δ)T + 4 T log2 T + 4KC log2 T + KC2 KT log K log T T log T + KC log T + KC2 ! With a similar analysis as in the proof in Theorem 6, we can show that the realized regret RT also has the same upper bound (in order) with a probability of at least 1 δ. G ADDITIONAL EXPERIMENT DETAILS AND RESULTS G.1 ATTACK POLICY ON STOCHASTIC BANDITS We adopt the attack strategy against UCB-type algorithms in Jun et al. (2018). Initially, the adversary randomly selects one suboptimal arm as the target. At each round t, after observing the pulled arm It and its reward realization XIt,t, the adversary determines the attack αt according to Equation (7) in Jun et al. (2018) and returns the attacked reward XIt,t = XIt,t αt to the learner. We set 0 = 0.1, σ = 1, and δ = 0.05 in the attack model. The attacker cannot attack anymore if the attack budget C is exhausted. Published as a conference paper at ICLR 2025 UCB SE SE-WR SE-WR-STOP-T/K SE-WR-STOP FSAAER 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (b) C = 3000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (c) C = 6000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (d) C = 9000 Figure 5: Regret comparison of algorithms with known attack budgets when varying budget C MLAAER BARBAR MS-SE-WR-EXP3.P MS-SE-WR-CORRAL 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (b) C = 3000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (c) C = 6000 0.0 0.5 1.0 1.5 2.0 Round Cumulative Regret (d) C = 9000 Figure 6: Regret comparison of algorithms with unknown attack budgets when varying budget C G.2 EXPERIMENTS WITH LARGER GAPS We take K = 10 and T = 200, 000, with rewards for all arms following Bernoulli distributions Bern(µi) and mean rewards defined as [0.9, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45, 0.4, 0.35, 0.3]. We use four attack budgets C {0, 3000, 6000, 9000}. Each experiment is repeated 10 times. Figures 5 and 6 illustrate the cumulative regrets of different algorithms. With known attack budgets, as depicted in Figure 5, the performance of FSAAER (Lykouris et al., 2018) gradually deteriorates, while our proposed algorithms remain robust as the attack budget C increases. When attack budgets are unknown to the learner, as shown in Figure 6, baselines including MLAAER (Lykouris et al., 2018) achieve sublinear regrets for small budgets (e.g., C = 0 and C = 3000) as expected, but perform poorly when the budgets become large (e.g., C = 6000) and fail entirely when C = 9000. As the attack budget C varies, our robust PE-WR and MS-SE-WR algorithms consistently maintain sublinear regrets. Developing these two STOP algorithms for the known budget case is mainly for (1) later algorithm design for the unknown budget case and (2) the theoretical interest of deriving gap-independent and multiplicative C bounds. The worst performance of the two STOP algorithms when C = 0 is expected. Because both STOP conditions are triggered when there are still some suboptimal arms in the candidate arm set. Then, after the STOP conditions, the algorithms uniformly pull these remaining arms, resulting in a linear increasing in regret. Although the regret increases in linear in the later stage of the plot, it is still logarithmic bounded theoretically. Because the remaining suboptimal arms are with reward means close to the optimal arm, which only incurs a slow linear regret increase. On the other hand, note that the y-axis in Figure 3(a) is in 103 scale while the scales of Figures 3(b)(c)(d) are in 104. That is, the two STOP algorithms although with the worst performance among plots in Figure 3(a) with C = 0 still outperform other scenarios when C = 3000, 6000, 9000.