# adversarial_attacks_on_combinatorial_multiarmed_bandits__ae45b964.pdf Adversarial Attacks on Combinatorial Multi-Armed Bandits Rishab Balasubramanian 1 Jiawei Li 2 Prasad Tadepalli 1 Huazheng Wang 1 Qingyun Wu 3 Haoyu Zhao 4 We study reward poisoning attacks on Combinatorial Multi-armed Bandits (CMAB). We first provide a sufficient and necessary condition for the attackability of CMAB, a notion to capture the vulnerability and robustness of CMAB. The attackability condition depends on the intrinsic properties of the corresponding CMAB instance such as the reward distributions of super arms and outcome distributions of base arms. Additionally, we devise an attack algorithm for attackable CMAB instances. Contrary to prior understanding of multi-armed bandits, our work reveals a surprising fact that the attackability of a specific CMAB instance also depends on whether the bandit instance is known or unknown to the adversary. This finding indicates that adversarial attacks on CMAB are difficult in practice and a general attack strategy for any CMAB instance does not exist since the environment is mostly unknown to the adversary. We validate our theoretical findings via extensive experiments on real-world CMAB applications including probabilistic maximum covering problem, online minimum spanning tree, cascading bandits for online ranking, and online shortest path. 1. Introduction Multi-armed bandits (MAB) (Auer, 2002) is a classic framework of sequential decision-making problems that has been extensively studied (Lattimore & Szepesvári, 2020; Slivkins et al., 2019). In each round, the learning agent selects one out of m arms and observes its reward feedback which follows an unknown reward distribution. The goal is to maximize the cumulative reward, which requires the agent to balance exploitation (selecting the arm with the highest *Alphabetical order 1Oregon State University 2University of Illinois Urbana-Champaign 3Pennsylvania State University 4Princeton University. Correspondence to: Huazheng Wang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). average reward) and exploration (exploring arms that have high potential but have not been played enough). Combinatorial multi-armed bandits (CMAB) is a generalized framework of original MAB with many real-world applications, including online advertising, recommendation, ranking, and influence maximization (Liu & Zhao, 2012; Kveton et al., 2015; Chen et al., 2016; Wang & Chen, 2017). In CMAB, the agent chooses a combinatorial action (called a super arm) over the m base arms in each round, and observes outcomes of base arms triggered by the action as feedback, known as the semi-bandit feedback. The exploration-exploitation trade-off in CMAB is extremely hard compared to MAB because the number of candidate super arms could be exponential in m. Recent studies showed that MAB and its variants are vulnerable to adversarial attacks, especially poisoning attacks (Jun et al., 2018; Liu & Shroff, 2019; Wang et al., 2022; Garcelon et al., 2020). Under such attacks, an adversary observes the pulled arm and its reward feedback, and then modifies the reward to misguide the bandit algorithm to pull a target arm that is in the adversary s interest. Specifically, the adversary aims to spend attack cost sublinear in time horizon T to modify rewards, i.e., o(T) cost, such that the bandit algorithm pulls the target arm almost all the time, i.e., T o(T) times. Liu & Shroff (2019) showed that no-regret MAB algorithms can be efficiently attacked under any problem instance, while Wang et al. (2022) showed that there exist instances of linear bandits that cannot be attacked without linear cost, indicating such instances are intrinsically robust to adversarial attacks even provided vanilla bandit algorithms. Due to the wide applicability of CMAB, understanding its vulnerability and robustness to poisoning attacks is increasingly important. Directly applying the MAB concept of attackability to CMAB is tempting (Liu & Shroff, 2019; Wang et al., 2022). However, it leads to a sublinear cost bound in T but exponential in the number of base arms m. This approach follows the same attack strategy as in MAB. However, in practice, the exponential cost in m is undesirable, and can even exceed T, resulting in vacuous results. Therefore, the original attackability notion is insufficient, and we have the following research question: RQ: What is a good notion to capture the Adversarial Attacks on Combinatorial Multi-Armed Bandits vulnerability and robustness of CMAB? 1.1. Our contribution To answer the research question, we propose the definition of polynomial attackability (Section 3): when the attack is successful , the cost upper bound should not only be sublinear in time horizon T, but also polynomial in the number of base arms m and other factors. Following the definition of polynomial attackability, we propose the first study on adversarial attacks on combinatorial multi-armed bandits under the reward poisoning attack (Jun et al., 2018; Liu & Shroff, 2019; Wang et al., 2022). We provide a sufficient and necessary condition for polynomial attackability of CMAB, which depends on the bandit environment, such as the reward distributions of super arms and outcome distributions of base arms (Section 3). The polynomial attackability notion captures vulnerability of certain CMAB instances regardless of the CMAB algorithm while other instances are intrinsically robust to poisoning attacks. The condition also motivates the design of our attack algorithm. We analyze the attackability of several real-world CMAB applications including cascading bandits for online learning to rank (Kveton et al., 2015), online shortest path (Liu & Zhao, 2012), online minimum spanning tree (Kveton et al., 2014), and probabilistic maximum coverage problem (Chen et al., 2016) (a special instance of online influence maximization (Wen et al., 2017)). Our results can also be applied to simple reinforcement learning settings (Section 3), and to the best of our knowledge, get the first instance -level attackability result on reinforcement learning. Perhaps surprisingly, we discovered that for the same CMAB instance, polynomial attackability is not always the same, but is conditioned on whether the bandit environment is known or unknown to the adversary (Section 4). We constructed a hard example such that the instance is polynomially attackable if the environment is known to the adversary but polynomially unattackable if it is unknown to the adversary in advance. This hardness result suggests that adversarial attacks on CMAB may be extremely difficult in practice and a general attack strategy for any CMAB instance does not exist since the environment is usually unknown to the adversary (black-box settings). Instead, the success of an attack strategy should be analyzed specifically for an instance or an application. Following the attackability condition, we prove that in black-box settings our attack algorithm can successfully attack applications such as probabilistic max coverage, online minimum spanning tree and cascading bandits. Finally, numerical experiments 1 are conducted to verify our theory (Section 5). Specifically, we validated that our attack 1Code can be found at https://github.com/ haoyuzhao123/robust-cmab-code algorithm efficiently attacked the applications mentioned above in unknown environments. 1.2. Related works Adversarial attacks on bandits and reinforcement learning Reward poisoning attacks on bandit algorithms were first studied in the stochastic multi-armed bandit setting (Jun et al., 2018; Liu & Shroff, 2019), where an adversary can always force the bandit algorithm to pull a target arm linear times only using a logarithmic cost. Garcelon et al. (2020) studied attacks on linear contextual bandits. The notion of attackability is first termed by Wang et al. (2022), where they studied the attackability of linear stochastic bandits. Adversarial attacks on reinforcement learning have been studied under white box (Rakhsha et al., 2020; Zhang et al., 2020) and black-box setting (Rakhsha et al., 2021; Rangi et al., 2022). Specifically, Rangi et al. (2022) showed there exist unattackable episodic RL instances with reward poisoning attacks. However, none of the existing work analyzed the attackability of a given instance. Besides reward poisoning attacks, other threat models such as environment poisoning attacks (Rakhsha et al., 2020; Sun et al., 2021; Xu et al., 2021; Rangi et al., 2022) and action poisoning attacks (Liu & Lai, 2020) were also being studied. We focus on reward poisoning attacks in this paper and leave investigation on other threat models as future work. Corruption-tolerant bandits Another line of work studies the robustness of bandit algorithms against poisoning attacks, also known as corruption-tolerant bandits. Lykouris et al. (2018); Gupta et al. (2019) proposed robust MAB algorithms under an oblivious adversary who determines the manipulation before the bandit algorithm pulls an arm. Dong et al. (2022) proposed a robust CMAB algorithm under strategic manipulations where the corruptions are limited to only increase the outcome of base arms in semi-bandit feedback with known attack budget. This is weaker than our threat model as we allow the adversary to increase or decrease the outcome of base arms. 2. Preliminary 2.1. Combinatorial semi-bandit In this section, we introduce our model for the combinatorial semi-bandit (CMAB) problem. The CMAB model is mainly based on Wang & Chen (2017), which handles nonlinear reward functions, approximate offline oracle, and probabilistically triggered base arms. A CMAB problem (Wang & Chen, 2017) can be considered a game between a player and an environment. We summarize the important concepts involved in a CMAB problem here: Base arm: The environment has m base arms [m] = Adversarial Attacks on Combinatorial Multi-Armed Bandits 1, 2, . . . , m associated with random variables following a joint distribution D over [0, 1]m. At each time t, random outcomes X(t) = (X(t) 1 , X(t) 2 , . . . , X(t) m ) are sampled from D. The unknown mean vector µ = (µ1, µ2, . . . , µm) where µi = EX D[X(t) i ] represents the means of the m base arms. Super arm: At time t, the player selects a base arm set S(t) from action space S based on the feedback from the previous rounds. The base arm set, referred to as a super arm, is composed of individual base arms. Probabilistically triggered base arms: When the player selects a super arm S(t), a random subset τt [m] of base arms is triggered, and the outcomes X(t) i for i τt are observed as feedback. τt is sampled from the distribution Dtrig(S(t), X(t)), where Dtrig(S, X) is the probabilistic triggering function on the subsets 2[m] given S and X. We denote the probability of triggering arm i with action S as p D,S i where D is the environment triggering distribution. The set of base arms that can be triggered by S under D is denoted as OS = {i [m] : p D,S i > 0}. Reward: The player receives a nonnegative reward R(S(t), X(t), τt) determined by S(t), X(t), and τt. The objective is to select an arm S(t) in each round t to maximize cumulative reward. We assume that E[R(S(t), X(t), τt)] depends on S(t), µt, and denote r S(µ) := EX[R(S, X, τ)] as the expected reward of super arm S with mean vector µ. This assumption is similar to Chen et al. (2016); Wang & Chen (2017), and holds when variables X(t) i are independent Bernoulli random variables. We define optµ := sup S S r S(µ) as the maximum reward given µ. In summary, a CMAB problem instance with probabilistically triggered arms can be described as a tuple ([m], S, D, Dtrig, R). We introduce the following assumptions on the reward function, which are standard assumptions commonly used in the CMAB problem. Assumption 2.1 (Monotonicity). For any µ and µ with µ µ (dimension-wise), for any super arm S S, r S(µ) r S(µ ). Assumption 2.2 (1-Norm TPM Bounded Smoothness). For any two distributions D, D with expectation vectors µ and µ and any super arm S S, there exists a B R+ such that, |r S(µ) r S(µ )| B X i [m] p D,S i |µi µ i|. CUCB algorithm The combinatorial upper confidence bound (CUCB) algorithms (Chen et al., 2013; Wang & Chen, 2017) are a series of algorithms devised for the CMAB problem. Key ingredients of a typical CUCB algorithm include (1) optimistic estimations of the expected value of the base arms µ and (2) a computational oracle that takes the expected value vector µ as input and returns an optimal or close-to-optimal super arm. For example, an (α, β)-approximation oracle can ensure that Pr(r S(µ) α optµ) β. A typical CUCB algorithm proceeds in the following manner: in each round, the player tries to construct tight upper confidence bound (UCB) on the expected outcome based on historical observations, and feeds the UCBs to the computational oracle; the oracle yields which super arm to select. When combined with the two assumptions above, a legitimate CUCB algorithm s regret, or the (α, β)-approximation regret, can typically be upper bounded. 2.2. Threat model We consider reward poisoning attack as the threat model (Jun et al., 2018; Liu & Shroff, 2019; Garcelon et al., 2020; Wang et al., 2022). Under such threat model, a malicious adversary has a set M of target super arms in mind and the goal of the adversary is to misguide the player into pulling any of the target super arm linear times in the time horizon T. At each round t, the adversary observes the pulled super arm S(t), the outcome of the base arms X(t) = {X(t) i }i τt, and its reward feedback R(S(t), X(t), τt). The adversary modifies the outcome of base arm from X(t) i to X(t) i for all i τt. X(t) := { X(t) i }i τt is thus the corrupted return of the observed base arms. The cost of the attack is defined as C(T) = PT t=1 X(t) X(t) 0. 2.3. Selected applications of CMAB Online minimum spanning tree Consider an undirected graph G = (V, E). Every edge e = (u, v) in the graph has a cost realization X(t) u,v [0, 1] drawn from a distribution with mean µu,v at each time slot t. Online minimum spanning tree problem is to select a spanning tree S(t) at every time step t on an unknown graph in order to minimize the cumulative (pseudo) regret defined as R = PT t=1 E P (u,v) S(t) Xt u,v P (u,v) S Xt u,v , where S is the minimum spanning tree given the edge cost µe for every edge e. Online shortest path Consider an undirected graph G = (V, E). Every edge e = (u, v) in the graph has a cost realization X(t) u,v [0, 1] drawn from a distribution with mean µu,v at each time slot t. Given the parameters µe, the shortest path problem with a source s and a destination t is to select a path from s to t with minimum cost, i.e., to solve the following problem min S P (u,v) S µu,v where S is a path from s to t. The online version is to select a path S(t) from s to t at time t on graph G = (V, E) in order to minimize the cumulative regret. Adversarial Attacks on Combinatorial Multi-Armed Bandits Cascading bandit Cascading bandit is a popular model for online learning to rank with Bernoulli click feedback under cascade click model (Kveton et al., 2015; Vial et al., 2022). There is a set of items (base arms) [m] = {1, 2, 3, , m}. At round t, the bandit algorithm recommends a list of K < m items [at,1, at,2, at,K] as a super arm to the user from which the user clicks on the first attractive item (if any), and stops examining items after clicking. The user selects item j from the list with probability µj, which should be estimated online by the algorithm. The cascading bandit problem can be reduced to CMAB with probabilistic triggered arms (Wang & Chen, 2017). Probabilistic maximum coverage Given a weighted bipartite graph G = (L, R, E) where each edge (u, v) has a probability µu,v, the probabilistic maximum coverage (PMC) is to find a set S L of size k that maximizes the expected number of activated nodes in R, where a node v R can be activated by a node u S with an independent probability of µu,v. In the online version of PMC, µu,v are unknown. The algorithm estimates µu,v online and minimizes the pseudo-regret. 3. Polynomial Attackability of CMAB Instances Previous literature defined the following attackability notion for MAB instances (Jun et al., 2018; Liu & Shroff, 2019; Garcelon et al., 2020; Wang et al., 2022): for any no-regret algorithm A (the regret of A is o(T) for large enough T), the attack can only use sublinear cost C(T) = o(T) and fool the algorithm A to play the arms in target set M for T o(T) times. It is worth noting that Wang et al. (2022) is the first to observe that certain linear MAB instances are unattackable without linear cost, suggesting intrinsic robustness of such linear bandit instances. However the notion of attackability needs to be modified in the CMAB framework, since the number of super arms is exponential and will lead to impractical guarantees. In this work, we propose to consider the polynomial attackability of CMAB. For a particular CMAB instance ([m], S, D, Dtrig, R), we define p := inf S S,i OS{p D,S i }. Definition 3.1 (Polynomially attackable2). A CMAB instance is polynomially attackable with respect to a set of super arms M, if for any learning algorithm with regret O(poly(m, 1/p , K) T 1 γ) with high probability for some constant γ > 0, there exists an attack method with constant γ > 0 that uses at most T 1 γ attack cost and misguides the algorithm to pull super arm S M for T T 1 γ times with high probability for any T T , where T 2We use the terms attackable and polynomially attackable interchangeably in the following discussion. polynomially depends on m, 1/p , K. Definition 3.2 (Polynomially unattackable). A CMAB instance is polynomially unattackable with respect to a set of super arms M if there exists a learning algorithm A with regret O(poly(m, 1/p , K) T 1 γ) with high probability for some constant γ > 0, such that for any attack method with constant γ > 0 that uses at most T 1 γ attack cost, the algorithm A will pull super arms S M for at most T/2 times with high probability for any T T , where T polynomially depends on m, 1/p , K. Remark 3.3 (Conventional attackability definition vs. polynomially attackable vs. polynomially unattackable). (1) Compared to conventional attackability definitions for karmed stochastic bandit instance, both Definition 3.1 and Definition 3.2 require a polynomial dependency on m. (2) Note that the polynomially unattackable notion is stronger than not polynomially attackable . Remark 3.4 (Polynomial dependency). T s dependency on m, 1/p , K is important to CMAB. Otherwise, the problem reduces to a vanilla MAB with an exponentially large number of arms. Definition 3.5 (Gap). For each super arm S, we define the following gap S := r S(µ) max S =S r S (µ OS). where is the element-wise product. For a set M of super arms, we define the gap of M as M = max S M S. Algorithm 1 Attack algorithm for CMAB instance Require: Target arm set M such that M > 0, CMAB algorithm Alg. 1: Find a super arm S M such that S > 0. 2: for t = 1, 2, . . . , T do 3: Alg returns super arm S(t). 4: Adversary returns Xt i = 0 for all i τ (t) \ OS and keep other outcome Xt i unchanged. 5: end for Theorem 3.6 (Polynomial attackability of CMAB). Given a particular CMAB instance and the target set of super arms M to attack. If M > 0, then the CMAB instance is polynomially attackable. If M < 0, the instance is polynomially unattackable. We do not consider the special case of M = 0 in Theorem 3.6 as the condition is ill-defined for attackability: it relies on how the CMAB algorithm breaks the tie and we omit the case in the theorem. Theorem 3.6 provides a sufficient and necessary condition for the CMAB instance to be polynomial attackable. Specifically, we prove the sufficiency condition ( M > 0) Adversarial Attacks on Combinatorial Multi-Armed Bandits by constructing Algorithm 1 as an attack that spends poly(m, 1/p , K) m (1/ S ) T 1 γ attack cost to pull target super arms linear time. The main idea is to reduce the reward of base arms that are not associated with the target super arms to 0. On the other hand, we show that M < 0 leads to polynomial unattackable instance, which is stronger than and covers not polynomial attackable, and serves as the necessary condition. The polynomial unattackable instances are intrinsically robust to any attack strategy even with the vanilla CUCB algorithm. Note that in Theorem 3.6, we do not specify the attack cost since the cost depends on not only the CMAB instance but also the victim CMAB algorithm. If we specify the victim algorithm to be CUCB, which is the most general algorithm for solving CMAB problem, we can have the following results directly from the proof of Theorem 3.6. Corollary 3.7. Given a particular CMAB instance and the target set of super arms M to attack. If M > 0 and the victim algorithm is chosen to be CUCB, then the attack cost can be bounded by poly(m, K, 1/p, 1/ ) log(T). In general, the probability p may be exponentially small, and there already exists analysis for the combinatorial semibandit (Wang & Chen, 2017) that can remove this p dependence. However, there are still some differences between the original CMAB setting and our attack setting. The following example shows the necessity of the term p : if we remove the dependency on 1/p in Definition 3.1 and 3.2, we show that there exist instances such that it is neither polynomial attackable nor polynomial unattackable . Consider the case that there are 3 base arms a, b, c. µa = µb = 1/2, µc = 1/4, and two super arms S1, S2. Assume that when you pull S1, you observe base arm b with probability 1, and a with probability p < 1 2, and if you pull S2, you observe a, c with probability 1. Thus, the reward of S1 is Rb + I[Ea]Ra where Ea is the event where a is triggered, and the reward of S2 is Ra +Rc. Now note that if we set the time scale T large enough, we know that CUCB will not play S1 for T o(T) time. However if T is not that large, it is still possible to misguide CUCB to play S1 for T o(T) times. The attack strategy is that: whenever a is observed, we first set the reward for a to be 0 and let UCB (a) < 1/4. Then CUCB will pull S1. However if S1 is not pulled by 1/p times, it may not observe base arm a and thus S1 can still be played for T o(T) times for T not large enough (say, not dependent on 1/p). Following the Theorem 3.6, we now analyze the attackability of several practical CMAB problems. Corollary 3.8. In online minimum spanning tree problem, for any super arm S, S 0. In cascading bandit problem, for any super arm S, M=permutation(S) 0. Notice that if the goal of the CMAB problem is to minimize the objective (cost) such as online minimum spanning tree, we change line 3 of Algorithm 1 to XT i = 1 where we modify the base arm into having highest cost. Corollary 3.8 is relatively straightforward. First, we know that every spanning tree has the same number of edges. Then the edges in OS (where S is the target spanning tree we select) are the edges with smallest mean cost under parameter µ OS, since the other edges are set to have cost 1. Thus, any other spanning tree will induce a larger cost. As for the cascade bandit, the selected items have the highest click probabilities under the mean vector µ OS for super arm S, and replacing them to any other item will cause the total click probability to drop. (α, β)-oracle Different from the previous CMAB literature, we do not consider the (α, β)-approximation oracle and the (α, β)-regret (Chen et al., 2016) in this paper. The fundamental problem for algorithms with (α, β)-regret is that the algorithms are not no-regret algorithms, since they are not comparing to the best reward opt(µ), but α β opt(µ). Then, it is always possible to change an (α, β)-oracle to (α, β ϵ)-oracle by applying the original oracle with probability 1 ϵ and do the random exploration with probability ϵ, which would make the problem unattackable. Thus, one should not hope to get general characterizations or results of the attackability for CMAB instances, and should discuss the attackability issue for a specific problem solved by specific algorithm and oracles. In the following, we show the attackability for the probabilistic maximum coverage problem solved by CUCB together with the Greedy oracle. Theorem 3.9. In the probabilistic maximum coverage problem (PMC), CUCB algorithm with Greedy oracle is polynomial attackable when M > 0. Besides, for any PMC instance, M 0. The intuition of the proof is that although the Greedy oracle is an approximation oracle, by using CUCB, it acts like an exact oracle when the number of observations for each base arm is large enough, and thus we can follow the proof idea for Theorem 3.6. Another interesting application of Theorem 3.6 is simple episodic reinforcement learning (RL) settings where the transition probability is known, i.e., white-box attack (Rakhsha et al., 2020; Zhang et al., 2020). Since the adversarial attacks on RL is a very large topic and little diverted from the current paper, we only provide minimal details for this result. Please refer to Appendix B.3 for more details. Corollary 3.10 (Informal). For episodic reinforcement learning setting where the transition probability is known, the reward poisoning attack for the episodic RL can be reduced to the adversarial attack on CMAB, and thus the attackability of the episodic RL instances are captured by Adversarial Attacks on Combinatorial Multi-Armed Bandits Theorem 3.6. We sketch the reduction below. Please refer to Appendix B.3 for formal version of Corollary 3.10 and its proof. Proof sketch. We study episodic Markov Decision Process (MDP), denoted by the tuple (State, Action, H, P, µ), where State is the set of states, Action is the set of actions, H is the number of steps in each episode, P is the transition metric such that P( |s, a) gives the transition distribution over the next state if action a is taken in the current state s, and µ: State Action R is the expected reward of state action pair (s, a). We assume that the states State and the actions Action are finite and the transition probability P is known in advance. For simplicity, we only consider deterministic policy π. Then we construct the CMAB instance. 1. There are |State| |Action| base arms, each base arm (s, a) denote the random reward r(s, a), with unknown expected value E[r(s, a)] = µ(s, a). 2. Each policy π is a super arm, and the expected reward of super arm π is just the value function. 3. Note that when we select a policy π (a super arm) and execute π for this episode t, a random set of base arms τt will be triggered and observed following distribution D, and the distribution D is determined by the episodic Markov Decision Process. Define Aπ h(s, a) as the probability that it will go to state s with π(s) = a at time h, then the base arm (s, a) is triggered with probability PH h=1 Aπ h(s, a) in each episode, thus p D,π (s,a) = PH h=1 Aπ h(s, a), where p D,π (s,a) is the probability to trigger arm (s, a) under policy π (defined in Section 2). In this way, the simple version of episodic RL (white-box) is reduced to CMAB, and one can also verify that Assumption 2.1 and Assumption 2.2 hold. Although the above corollary is a simple direct application of Theorem 3.6 to white-box episodic RL, we anticipate that comparable techniques can be applied to analyze much more complex RL problems and provide an attackability characterization at the instance level. This part is left as future work. 4. Attack in Unknown Environment In previous section, we discussed polynomial attackability of a CMAB instance in a known environment, i.e., all parameters of the instance such as the reward distributions of super arms and outcome distributions of base arms are a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 S1 S2 S3 S4 S5 S6 S0 Figure 1. Example 4 with n = 5 given. However, in practice the environment is often unknown to the adversary. Previous studies of adversarial attack on MAB and linear bandits showed that the attackability of an instance in a known or an unknown environment (Liu & Shroff, 2019; Wang et al., 2022) are the same. In this section, we show that the polynomial attackability can be different for the same instance between a known or an unknown environment. Theorem 4.1. There exists a CMAB instance satisfying Assumption 2.1 and 2.2 such that it is polynomially attackable given the parameter µ (induced from the instance s base arms joint distribution D), but there exists no attack algorithm that can efficiently attack the instances for CUCB algorithm with unknown parameter µ. The following example construct hard instances that satisfying Theorem 4.1. We also illustrate the example with n = 5 in Figure 4. [Hard example] We construct the following CMAB instance Ii. There are 2n base arms, {ai}i [n] and {bi}i [n], and each corresponds to a random variable ranged in [0, 1]. We have µaj = 1 2ϵ for all j = i and µai = 1, and µbj = 1 ϵ for all j [n] for some ϵ > 0. Then we construct n + 2 super arms. Sj for all j [n] will observe base arms aj and bj, and r Sj = µaj + µbj. There is another super arm Sn+1 that will observe base arms bj for all j [n], and r Sn+1 = P j [n] µbj + (1 ϵ). Besides, there is also a super arm S0 with constant reward S0 = 2 2ϵ and does not observe any base arm. Then, the attack super arm set M = {Sj}j [n]. In total, we can construct n hard instances Ii for all i [n]. The intuition of why this example is hard is that it blocks the exploration of base arms a1, . . . , am simultaneously. This means that the algorithm needs to visit exponential super arms and the cost is also exponential, violating the definition of polynomial attackable. Next, we give a proof sketch of Theorem 4.1 and explain more about the intuition, and the detailed proof is deferred to Appendix C. Proof sketch of Theorem 4.1. First, given an instance Ii, if we know the environment parameters µ, we can attack the CMAB instance by setting the rewards besides base arms ai, bi to 0. However in the unknown setting, if we still want to attack the instance, we need to know the base arm aj for which µaj = 1. Thus, the attack algorithm needs to pull at Adversarial Attacks on Combinatorial Multi-Armed Bandits least Ω(n) super arms in M = {Sj}j [n]. For simplicity, assume that the attack algorithm lets CUCB pull super arms S1, S2 . . . , Sn in order. Now if CUCB pulls super arm Si, it means that UCB (ai) 1 2ϵ and UCB (bi) 1 2ϵ. Otherwise, UCB (ai) + UCB (bi) < 2 2ϵ and CUCB will choose to play S0 instead. Besides, we also require UCB (bj) ϵ for all j = i. Otherwise, UCB (bi)+UCB (bj)+1 ϵ > UCB (ai)+UCB (bi), and CUCB will choose Sn+1. Suppose that CUCB first pulls S1, and then pulls S2. When pulling S1, we know that UCB (b2) ϵ, but if it needs to pull S2, UCB (b2) 1 2ϵ. Note that b2 can only be observed by S2 and Sn+1, and without pulling S2, the attack algorithm needs to let CUCB to pull Sn+1 to change the UCB value of b2. Then CUCB needs to pull Sn+1 for at least Ω(1/ϵ) times. Now say the attack algorithm needs CUCB to pull S3, the UCB value of b3 needs to raise to at least 1 2ϵ from at most ϵ. However, note that b3 has already been observed by at least Ω(1/ϵ) times since Sn+1 is pulled by at least Ω(1/ϵ) to observe S2. Then, Sn+1 need to be pulled by Ω(1/ϵ2) times. Now if the attack algorithm wants CUCB to pull S4, CUCB will pull Sn+1 for Ω(1/ϵ3) times. Because the attack algorithm needs CUCB to visit Ω(n) super arms in M, the total cost is at least 1/ϵΩ(n) = 1/ϵΩ(m) since m = 2n. Despite the hardness result mentioned previously, an adversary can still attack the CMAB instance for certain instance or application. For example, an adversary can just randomly pick a super arm S from the set M and set S to be the target arm in Algorithm 1. The guarantee of this heuristic follows from Theorem 3.6: Corollary 4.2. If S satisfies S > 0,3 then running Algorithm 1 can successfully attack the CMAB instance under Definition 3.1. Note that we have M 0 for cascading bandit and online minimum spanning tree problem. Thus even in the unknown environment case, cascading bandit and online minimum spanning tree are mostly polynomially attackable (by applying Algorithm 1). Also from Theorem 3.9, when solving probabilistic covering problem using CUCB algorithm with Greedy oracle, M 0 and thus it is also mostly polynomially attackable. However, for the shortest path problem, we do not know its attackability in the unknown environment since M can be either positive or negative. Designing attacking algorithms for other CMAB applications in the unknown setting is left as future work. 3Note that is S = 0 is a very special case and we do not consider it in the corollary. In order to analyze the case, we need to understand how the algorithm breaks the tie when two superarms have the same reward and this information is generally unknown. 5. Numerical Experiments In this section, we present the numerical experiments along with their corresponding results. We empirically evaluate our attack on four CMAB applications: probabilistic maximum coverage, online minimum spanning tree , online shortest path problems , and cascading bandits . Additionally, we conduct experiments on the influence maximization problems discussed in Appendix A.2. 5.1. Experiment setup General setup In our study, we utilize the Flickr dataset (Mc Auley & Leskovec, 2012) for the probabilistic maximum coverage, online minimum spanning tree, and online shortest path problems. We downsample a subgraph from this dataset and retain only the maximum weakly connected component to ensure connectivity, which comprises 1,892 nodes and 7,052 directed edges. We incorporate the corresponding edge activation weights provided in the dataset. For cascading bandits, we employ the Movie Lens small dataset (Harper & Konstan, 2015), which comprises 9,000 movies. From this dataset, we randomly sample 5,000 movies for our experiments. Each experiment is conducted a minimum of 10 times, and we report the average results and variances. Please refer to Appendix A.1 for more details. Probabilistic maximum coverage We use the CUCB algorithm (Chen et al., 2016) along with a greedy oracle. We consider two kinds of targets: In the first case, we calculate the average weight of all outgoing edges of a node, sort the nodes with decreasing average weight, and select the nodes K + 1, . . . , 2K, and the selected node set is denoted fixed target. The second type is the random target, where we randomly sample K nodes whose average weight over all the outgoing edges is greater than 0.5 to ensure no node with extremely sparse edge activation is selected. Online minimum spanning tree We convert the Flickr dataset to an undirected graph, where we use the average probability in both directions as the expected cost of the undirected edge. We employ a modified version of Algorithm 1, with line 4 changed to XT i = 1 since now we are minimizing the cost. We consider two types of targets: (1) the fixed target, where M containing only the second-best minimum spanning tree; and (2) the random target, where M contains the minimum spanning tree obtained on the graph with same topology but randomized weight (mean uniformly sampled from [0, 1]) on edges. Online shortest path We consider two types of targets M: In the first type, M contains an unattackable path that is carefully constructed. We randomly draw a source node s, then use random walk to unobserved nodes and record the path from s. When we reach a node k and the path Adversarial Attacks on Combinatorial Multi-Armed Bandits 0.00 0.25 0.50 0.75 1.00 Iterations 1e5 1e4 Total Cost Random Target Fixed Target 0.00 0.25 0.50 0.75 1.00 Iterations 1e5 1e4 Target Arm Pulls Random Target Fixed Target 0 1 2 3 Iterations 1e3 1e3 Total Cost Random Target Fixed Target 0 1 2 3 Iterations 1e3 1e3 Target Arm Pulls Random Target Fixed Target 0 1 2 3 Iterations 1e3 1e3 Total Cost Random Target Unattackable Target 0 1 2 3 Iterations 1e3 1e3 Target Arm Pulls Random Target Unattackable Target 0.0 0.2 0.4 0.6 0.8 1.0 Iterations 1e5 1e3 Total Cost Cascade KLUCB Cascade UCB1 Cascade UCB-V 0.0 0.2 0.4 0.6 0.8 1.0 Iterations 1e5 1e4 Target Arm Trigger Rate Cascade KLUCB Cascade UCB1 Cascade UCB-V Figure 2. Cost and target arm pulls for: (2(a), 2(b)) probabilistic max coverage; (2(c), 2(d)) online maximum spanning tree; (2(e), 2(f)) online shortest path; (2(g), 2(h)) cascading bandits. Experiments are repeated for at least 10 times and we report the averaged result and its variance. weight is larger than the shortest path length from s to k by a certain threshold θ, we set k as the destination node t and set the path obtained from the random walk as the target path. If after 50 steps the target path still can not be found, we re-sample the source node s. To avoid trivial unattackable cases, we require the shortest path should have more than one edge. In our experiment, we set the threshold θ as 0.5. We call this type of M as unattackable target. The second type of M is generated by randomly sampling the source node s and the destination node t, and then finding the shortest path from s to t given a randomized weight on edges. We call this type of M as random target. We sample 100 targets for both unattackable and random ones, and repeat the experiment 10 times for each target. Cascading bandit In this experiment, we test Cascade KLUCB, Cascade UCB1 (Kveton et al., 2015), and Cascade UCB-V (Vial et al., 2022). We compute a d-rank SVD approximation using the training data, which is used to compute a mapping ϕ from movie rating to the probability that a user selected at random would rate the movie with 3 stars or above.4 In our experiments, we select the target super arm from the subset of a base arms M whose average click probability is greater than 0.1. 5.2. Experiment results The experiment results are summarized in Figure 2, which show the cost and number of target super arm played under 4Please refer to Section 6 (Vial et al., 2022) for more details. different settings. Probabilistic maximum coverage Fig. 2(a), 2(b) show the results of our experiments on two types of targets. From Fig. 2(b) and 2(a), we observe that the number of target arm pulls increases linearly after around 5, 000 iterations while the cost grows sublinearly. Considering the large number of nodes and edges, the cost is clearly polynomial to the number of base arms. This result validated our Theorem 3.9 that probabilistic maximum covering is polynomially attackable. We also report the variance in the figures, highlighting that attacking a random target results in larger variances for both cost and target arm pulls. Online minimum spanning tree As demonstrated in Figs 2(c) and 2(d), the total cost is sublinear and the rounds pulling the target arm are linear. This result aligns with our Corollary 3.8: since the number of edges in spanning tree is the same and the reward of the super arm is a linear combination of base arms, M 0. Thus online minimum spanning tree problem is always attackable as suggested by experiments on a random target if there is only one minimum spanning tree. Online shortest path In Figs 2(e) and 2(f), we show that for the random target M, the total cost is sublinear, and the target arm pulls are linear. For the unattackable targets M, the total cost is linear while the target arm pulls are almost constant, which means they are indeed unattackable. We show an example from the Flickr dataset in Figure 3. Adversarial Attacks on Combinatorial Multi-Armed Bandits t s 0.50 0.62 Figure 3. An unattackable shortest path from s to t in the Flickr dataset. Optimal path: (s, a, b, e, t). Target path: (s, a, v, c, d, t). The cost on (b, c, d, t) is larger than the number of edges on (b, e, t), and the attacker cannot fool the algorithm to play the target path. Cascading bandits We can clearly observe from Fig. 2(h) & 2(g) the number of times the target arm is played increases linearly, while the cost of attack is sublinear. Considering the large number of base arms m, the results validated Corollary 3.8 that cascading bandits is polynomially attackable. 6. Discussions, Limitation, and Open Problems Attack algorithms with better cost The findings presented in Theorem 3.6 comprehensively delineate the polynomial attackability of CMAB instances, wherein the susceptible instances are vulnerable to the attack strategy outlined in Algorithm 1. However, the attack cost associated with Algorithm 1 may not be optimal. There is a significant interest in developing attack algorithms that achieve lower attack cost compared to Algorithm 1. Furthermore, establishing a lower bound for the attack costs associated with CMAB instances is an intriguing area for future research. More discussions on black-box settings Our Theorem 3.6 characterizes the unattackable CMAB instances, which is the (intrinsic) robustness of CMAB instances for the known environment. On the other hand, it is interesting to consider if there is any robustness guarantee of CMAB instances for the unknown environment. For the CMAB instances in the unknown environment, the only guarantee (derived from this paper) is that if the instance of a certain problem is always attackable in the known environment, then it must be attackable in the unknown environment. For example, in online minimum spanning tree and cascading bandit problems, the instances always have nonnegative gap (Definition 3.5, Corollary 3.8), and thus always attackable in the unknown case. For other problem instances, we might need to discuss their attackability in the unknown setting case by case, which is probably very hard. Characterizing the robustness guarantee for CMAB instances in the unknown setting is an interesting open problem. Limitation One limitation of our findings is that our attackability characterization is limited to one threat model: reward poisoning attacks. The characterization cannot be directly generalized to environment poisoning attacks (Rakhsha et al., 2020; Sun et al., 2021; Xu et al., 2021; Rangi et al., 2022) where the adversary can directly change the environment such as reward function. Environment poisoning is more powerful than reward poisoning and perturbation in environment will invalidate our analysis regarding M. On the other hand, our attackability characterization can be applied to action poisoning attacks Liu & Lai (2020), which in principle is still reward poisoning where the reward cannot be arbitrarily changed but has to be replaced by another action s reward. We leave the study of attackability of CMAB under other threat models as future work. 7. Conclusion In this paper, we provide the first study on attackability of CMAB and propose to consider the polynomial attackability, a definition accommodating the exponentially large number of super arms in CMAB. We first provide a sufficient and necessary condition of polynomial attackability of CMAB. Our analysis reveals a surprising fact that the polynomial attackability of a particular CMAB instance is conditioned on whether the bandit instance is known or unknown to the adversary. In addition, we present a hard example, such that the instance is polynomially attackable if the environment is known to the adversary and polynomially unattackable if it is unknown. Our result reveals that adversarial attacks on CMAB are difficult in practice, and a general attack strategy for any CMAB instance does not exist since the environment is mostly unknown to the adversary. We propose an attack algorithm where we show that the attack is guaranteed to be successful and efficient if the instance is attackable. Extensive experimental results on real-world datasets from four applications of CMAB validate the effectiveness of our attack algorithm. As of future work, we will further investigate the impact of probabilistic triggered arm to the attackability, e.g., the attackability of online influence maximization given different diffusion models. It is also important to generalize the hardness result in unknown environment, e.g., a general framework to reduce unattackable CMAB problems to the hardness example. We also leave the study of attackability of CMAB under other threat models as future work. Impact Statement This research studied the vulnerability of combinatorial bandits against reward poisoning attacks. Our findings on attackability characterization and the attack algorithm have the potential to inspire the design of robust CMAB algorithms, which would benefit many real-world applications including the ones studied in this paper. Adversarial Attacks on Combinatorial Multi-Armed Bandits Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002. Chen, W., Wang, Y., and Yuan, Y. Combinatorial multiarmed bandit: General framework and applications. In International conference on machine learning, pp. 151 159. PMLR, 2013. Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746 1778, 2016. Dong, J., Li, K., Li, S., and Wang, B. Combinatorial bandits under strategic manipulations. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pp. 219 229, 2022. Garcelon, E., Roziere, B., Meunier, L., Tarbouriech, J., Teytaud, O., Lazaric, A., and Pirotta, M. Adversarial attacks on linear contextual bandits, 2020. Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562 1578. PMLR, 2019. Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015. Jun, K.-S., Li, L., Ma, Y., and Zhu, J. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3640 3649, 2018. Kveton, B., Wen, Z., Ashkan, A., Eydgahi, H., and Eriksson, B. Matroid bandits: Fast combinatorial optimization with learning. ar Xiv preprint ar Xiv:1403.5045, 2014. Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In International conference on machine learning, pp. 767 776. PMLR, 2015. Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020. Liu, F. and Shroff, N. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pp. 4042 4050, 2019. Liu, G. and Lai, L. Action-manipulation attacks on stochastic bandits. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3112 3116. IEEE, 2020. Liu, K. and Zhao, Q. Adaptive shortest-path routing under unknown and stochastically varying link states. In 2012 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (Wi Opt), pp. 232 237. IEEE, 2012. Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114 122. ACM, 2018. Mc Auley, J. and Leskovec, J. Image labeling on a network: using social-network metadata for image classification. In Computer Vision ECCV 2012, pp. 828 841. Springer, 2012. Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and Singla, A. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pp. 7974 7984. PMLR, 2020. Rakhsha, A., Zhang, X., Zhu, X., and Singla, A. Reward poisoning in reinforcement learning: Attacks against unknown learners in unknown environments. ar Xiv preprint ar Xiv:2102.08492, 2021. Rangi, A., Xu, H., Tran-Thanh, L., and Franceschetti, M. Understanding the limits of poisoning attacks in episodic reinforcement learning. ar Xiv preprint ar Xiv:2208.13663, 2022. Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(12):1 286, 2019. Sun, Y., Huo, D., and Huang, F. Vulnerability-aware poisoning mechanism for online {rl} with unknown dynamics. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=9r30XCjf5Dt. Tang, Y., Shi, Y., and Xiao, X. Influence maximization in near-linear time: A martingale approach. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1539 1554, 2015. Vial, D., Sanghavi, S., Shakkottai, S., and Srikant, R. Minimax regret for cascading bandits. ar Xiv preprint ar Xiv:2203.12577, 2022. Wang, H., Xu, H., and Wang, H. When are linear stochastic bandits attackable? In International Conference on Machine Learning, pp. 23254 23273. PMLR, 2022. Wang, Q. and Chen, W. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. Advances in Neural Information Processing Systems, 30, 2017. Adversarial Attacks on Combinatorial Multi-Armed Bandits Wen, Z., Kveton, B., Valko, M., and Vaswani, S. Online influence maximization under independent cascade model with semi-bandit feedback. Advances in neural information processing systems, 30, 2017. Xu, H., Wang, R., Raizman, L., and Rabinovich, Z. Transferable environment poisoning: Training-time attack on reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems, pp. 1398 1406, 2021. Zhang, X., Ma, Y., Singla, A., and Zhu, X. Adaptive rewardpoisoning attacks against reinforcement learning. ar Xiv preprint ar Xiv:2003.12613, 2020. Adversarial Attacks on Combinatorial Multi-Armed Bandits A. Additional Experiments In this section, we provide additional information regarding the experiments. In Appendix A.1, we give more details for the experiments in Section 5. In Appendix A.2, we show the results of our experiments on online influence maximization. A.1. More experiment details For the online shortest path, online minimum spanning tree and probabilistic maximum coverage we use the Flickr dataset (Mc Auley & Leskovec, 2012). We first select a fraction of the nodes such that their degree is in the range [m, n]. Here, we take m = 2, n = 75. We then clean up the graph by selecting the largest connected component, which has 1892 nodes and 7052 edges. For cascading bandits, we use the small version of the Movie Lens Dataset (Harper & Konstan, 2015), which contains 9000 movies. Each movie has a rating (in the range of one to five stars) given to it by several users. To convert the ratings from stars to average click probability by any user, we consider the click probability close to the probability that the movie has a rating of 3 stars or above. We follow Vial et al. (2022) to generate a mapping ϕ that goes from ratings in stars to the average click probability for some user selected at random. However, as the average click probabilities are sparse, we use only the top 1000 movies in our experiments. A.2. Experiment results on Influence Maximization Here we provide additional experiment results for the online influence maximization (IM) problem. Basic influence maximization settings. The IM problem involves selecting (activating) an initial set of K nodes. Each node u can attempt to activate its neighbor v with probability pu,v. If node v was not activated by node u in time instant i, it cannot be activated by u in any further time instant i + 1, i + 2, . This is called one step of diffusion. We continue this process until time step T0, such that no new node can be activated in time step T0 + 1. The goal of the IM problem is to select a set of K nodes that maximize the final number of activated nodes. The goal in online IM is to select a set of K nodes at each round t that minimizes the total regret. At each round, the player observes the diffusion process, and receives the number of activated nodes as the reward. Note that the IM problem is very similar to the probabilistic maximum coverage problem, where the only difference is the length of the diffusion process since the probabilistic maximum coverage problem can be viewed as influence maximization when there is only one diffusion step. Attack heuristics. Note that the IM problem is NP-hard and only has (1 1/e)-approximation oracle. Thus, we do not have an attack strategy that has a theoretical guarantee from Theorem 3.6. However, we can still have heuristics to attack the online influence maximization problem. We define a set Sℓ ex that represents the extended target set. This set includes the target nodes and all other nodes v at a distance of at most ℓfrom any node u in the target set S, S = {v : u S, v V, d(u, v) < ℓ}, Sℓ ex = S S where d(u, v) is the distance function between two nodes u and v. We do not attack the edge between u and v if either of the nodes lies in Sℓ ex. For all other observed edges, we modify the realization to 0 (thus attack it). Note that if ℓ= , then Sℓ ex contains all nodes, and thus there is no viable attack. If ℓ= 1, then Sℓ ex only includes the target set, and then the reward feedback is very similar to the probabilistic maximum coverage problem, and the expected value after attack for each arm is exactly the same as the probabilistic maximum coverage attack strategy (Theorem 3.9). This attack heuristic simplifies the strategy for the attacker as there are more nodes in the extended set than in the target set, making the attack easier. In all the experiments, we use an (α, β)-approximation oracle based on (Tang et al., 2015) (IMM oracle) and Fig 4 shows the corresponding results. Experiments are repeated for at least 5 times and we report the averaged result and its variance. Discussions of the results. Figure 4(b) and Figure 4(d) show the percentage of target nodes played in each round for online influence maximization and probabilistic maximum coverage respectively. This value represents the percentage of base arms in the target super arm that is selected at a given time instant. For example, if we consider a target set containing 5 nodes (base arms), 20% of base arms selected would imply that at that round, 1 node belonging to the target set was played. Adversarial Attacks on Combinatorial Multi-Armed Bandits 0 1 2 3 4 5 Iterations 1e3 1e4 Total Cost CUCB (l=3) CUCB (l=2) CUCB (l=1) 0 1000 2000 3000 4000 5000 Iterations Percentage of Base Arms Selected CUCB (l=3) CUCB (l=2) CUCB (l=1) 0 2000 4000 Iterations Random Target Fixed Target 0 1000 2000 3000 4000 5000 Iterations Percentage of Base Arms Selected Random Target Fixed Target Figure 4. Cost and percentage of base arms selected for: (4(a), 4(b)) Influence Maximization; (4(c), 4(d)) Probabilistic Maximum Coverage. From Figure 4, we can clearly find that although the attack cost in each round decreases as the number of iterations increases5, the number of target set played is a constant 0. This may happen because (1) attacking online influence maximization is hard, and (2) the number of iterations is not large enough. To prove the attacking influence maximization is harder, we also show the result of probabilistic maximum coverage only with T = 5000 (Figure 4). Although for probabilistic maximum coverage, there is also no target set pulled in the first 5000 rounds, we can clearly observe that the percentage of target nodes selected is increasing, and there is a clear trend to select all the nodes in the target arm set when the number of iteration is large enough (which is the case in Figure 2(b)). However, for the online influence maximization problem, the algorithm selects none or 20% of the target node set for ℓ= 1, 2, 3 for a majority of the experiment, and there is no trend indicating that the number of target nodes selected would increase. Note that when ℓ= 1, the reward structure of influence maximization and probabilistic maximum coverage are nearly the same, and the difference is the oracle. This finding corroborates our claim that when the oracle for a CMAB instance is not exact (α-approximation oracle), we need to analyze the instance case by case. Although we do not have a promising attack strategy for online influence maximization, it does not mean that it is unattackable. Further developing efficient attack methods for online influence maximization problem or proving its intrinsic robustness is a very interesting future work. B. Missing proofs in Section 3 In this section, we give the missing proofs in Section 3. We first prove Theorem 3.6 in Appendix B.1, which characterizes the attackability issue of different CMAB instances under the exact oracle. Then in Appendix B.2, we prove Theorem 3.9, which has an approximation oracle. B.1. Proof of Theorem 3.6 In this section, we prove Theorem 3.6, which is restated below. Theorem 3.6 (Polynomial attackability of CMAB). Given a particular CMAB instance and the target set of super arms M to attack. If M > 0, then the CMAB instance is polynomially attackable. If M < 0, the instance is polynomially unattackable. The proof of Theorem 3.6 is divided into two cases: M > 0 or M < 0. We first show the case when M > 0, and then we prove the theorem when M < 0. Proof of Theorem 3.6 when M > 0. The proof of this direction is relatively straightforward: we only need to show that there exists an attacking strategy that successfully attacks the CMAB instance when the number of rounds T is polynomially large. The attacking strategy is exactly Algorithm 1. Note that if M > 0, we can find S such that S > 0. The regret of the CMAB algorithm Alg is bounded by 5Here in theory, in our attack strategy, the attack cost can never be sublinear, since no matter what l parameter we choose, there is a constant probability (independent on the time t) that the diffusion happens to the nodes outside Sl ex. However, if we choose l large enough, the probability to attack will decrease, and if we choose the hyperparameter l appropriately to make sure the probability to attack is o(1), the attack cost can still be sublinear with respect to a given time scale T. Adversarial Attacks on Combinatorial Multi-Armed Bandits poly(m, 1/p , K)T 1 γ with high probability, then in the modified CMAB environment with µ = µ OS , Alg will only play super arms other than S for poly(m, 1/p , K)T 1 γ/ S times. Each time when Alg pulls super arm other than S , the attacking algorithm (Algorithm 1) might need to modify the reward, leading to the cost bounded by m. Therefore the total budget can be bounded by poly(m, 1/p , K) m (1/ S ) T 1 γ. We conclude the proof by choosing T γ/2 poly(m, 1/p , K) m (1/ S ) and set γ = γ/2. Proof of Theorem 3.6 when M < 0. Now we prove the other direction: when M < 0, the CMAB instance is polynomially unattackable (Definition 3.2). The intuition of the proof is that: when all base arms i S MOS are observed for a certain number of times (say, T 1 γ /2 times for each base arm), CUCB will not choose to play the super arms in M. Since the budget is bounded by B T 1 γ , the estimated upper confidence bound of CUCB after corrupted by the attacker should be still close to the real empirical mean. Thus, because M < 0, the CUCB algorithm will not try to play the arms in M anymore. The following part formalizes this intuition. The following proof is organized as follows: we first present the necessary notations, technical propositions, definitions, and lemmas; then, we formalize the intuition in the main lemma (Lemma B.8); finally, we conclude the proof base on Lemma B.8. We use µ(t) i to denote the mean of base arm i estimated by the CUCB algorithm (possibly after attacker s corruption). We use T (t) i to denote the number of times base arm i is observed before round t. We use ρ(t) i to denote the confidence bound of base arm i. Because we are interested in a high-probability event, we choose ρ(t) i = q 2Ti,t . We also define ˆµi,t to be the empirical mean generated by the environment before the corruption by the attacker (or the empirical mean observed by the attacker from the environment). We define UCBt = min{ µ(t) + ρ(t), 1} to be the upper confident bound at round t. First, we show two standard concentration inequalities (Proposition B.1 and B.2) and several technical results. Proposition B.1 (Hoeffding Inequality). Suppose Xi [0, 1] for all i [n] and Xi are independent, then we have 2 exp 2nϵ2 . Proposition B.2 (Multiplicative Chernoff Bound). Suppose Xi are Bernoulli variables for all i [n] and E[Xi|X1, . . . , Xi 1] µ for every i n. Let Y = X1 + + Xn, then we have Pr {Y (1 δ)nµ} exp δ2nµ Next, we define the following event that the empirical means of different base arms returned by the environment is close to the mean µ. This definition is similar to that in Wang & Chen (2017). Definition B.3 (Sampling is Nice). We say that the sampling is nice at the beginning of round t if for any arm i [m], we have |ˆµ(t) i µ(t) i | < ρ(t) i , where ρ(t) i = r 2T (t) i ( if T (t) i = 0). We use N s t to denote this event. The following lemma shows that the defined event should hold with high probability. The proof is a straightforward application of Hoeffding s inequality (Proposition B.1). Lemma B.4. For each round t 1, we have Pr{ N s t } δ 2t2 . Besides, Pr{ t( N s t )} δ. Proof. For each round t 1, we have Pr{ N s t } = Pr i [m], |ˆµ(t) i µi| |ˆµ(t) i µi| T (t) i = k, |ˆµ(t) i µi| Adversarial Attacks on Combinatorial Multi-Armed Bandits Then we apply Hoeffding s inequality, and we have T (t) i = k, |ˆµ(t) i µi| 2e 2k ln(4mt3/δ) 2k = δ 2mt3 . Plugging in the previous bound, we have Pr{ N s t } δ 2t2 . Now we take the union bound over t, we have Pr{ t( N s t )} and we conclude the proof. Because of the probabilistic triggered arms, a base arm may only be observed with some probability when some super arm is pulled. For the sake of simpler analysis, we define the following notion Counter , which characterizes how many times a base arm may be triggered by some super arm. Definition B.5 (Counter). In an execution of the CUCB algorithm, we define the counter Ni,t as the following number s=0 I n p D,Ss i > 0 o . The following event (Definition B.6) bridges the observation time of a base arm and the notion Counter . Then, Lemma B.7 shows that the following event should hold with high probability. The proof of Lemma B.7 is a direct application of the multiplicative Chernoff bound (Proposition B.2). Definition B.6 (Triggering is Nice). We call that the triggering is nice at the beginning of round t if for any arm i, as long as ln(4mt3/δ) 1 4Ni,t 1 p , we have T (t 1) i 1 4Ni,t 1 p . We use N t t to denote this event. Lemma B.7. We have for every round t 1, Pr{ N t t } δ 4t3 . Then, we have Pr{ ( N t t )} δ Proof. We directly apply the Multiplicative Chernoff Bound (Proposition B.2). Note that as long as Ni,t 1 p 4 ln(4mt3/δ), we have Pr T (t 1) i 1 4Ni,t 1 p exp 32Ni,t 1 p exp 9 4 ln(4mt3/δ) exp ln(4mt3/δ) Now applying the union bound on i [m] and Ni,t 1, we can get Pr{ N t t } δ 4t2 . Then apply the union bound on t, we have Pr{ ( N t t )} X Adversarial Attacks on Combinatorial Multi-Armed Bandits Given the previous technical lemmas, we now show the main lemma to prove Theorem 3.6 when M < 0. The following lemma upper bounds the number of times the CUCB algorithm plays the arms in M under certain budget constraint B. Lemma B.8. Suppose that the adversary only has budget B = o(T) and assume that N s t holds. For any S M, if M < 0, and S is pulled in round t, then there must exist i OS such that T (t) i T0 := 6KBB | M| + 18K2B2 log(4m T 3/δ) Proof. We ll use proof by contradiction. Let s suppose there exists a round t where S is pulled and T (t) i T0 for all i S. Since the adversary only has budget B = o(T), each arm i can receive the corrupted value for at most B times. Thus, we have for all i S, | µ(t) i ˆµ(t) i | B where we use the fact that | S| | M|. Besides, given N s t , for all i S we have |ˆµ(t) i µi| ρ(t) i = log(4mt3/δ) 2T (t) i | S| Then under N s t , we have UCBt OS µ OS = (( µ(t) + ρ(t)) OS) µ OS = ( µ(t) + ρ(t) µ) OS (ˆµ(t) + | S| 6KB 1 + ρ(t) µ) OS 6KB 1 + 2ρ(t) µ) OS 6KB 1 + 2 | S| Let S denote a super arm such that r S (µ OS) = r S(µ) S, under N s t , we have r S (UCB(t)) = r S ( µ(t) + ρ(t)) r S (( µ(t) + ρ(t)) OS) r S ˆµ(t) | S| 6KB 1 + ρ(t) OS r S (ˆµ(t) + ρ(t)) OS X r S (µ OS) | S| = r S(µ OS) + 5 r S(UCBt) X r S(UCBt) + 1 which means that CUCB will not play S, which contradicts our assumption that S is pulled in round t. Thus, our original proposition that there exists an element i OS such that T (t) i T0 holds true, completing the proof. Then we are ready to prove Theorem 3.6 when M < 0. Adversarial Attacks on Combinatorial Multi-Armed Bandits Proof of Theorem 3.6 when M < 0. First we assume that N s t and N t t hold, which should happen with probability at least 1 3 Now under N s t , if S M is pulled, then at least one of the base arm i OS has not been observed for T0 times. Besides, under N t t , we know that when S is pulled at time t, at least one of its base arm i OS satisfies Ni,t 1 4T0/p . Note that there are m arms in total, thus the CUCB algorithm will only play super arms belonging to M for at most 4m T0/p times in total. Since every time CUCB pulls S M, the quantity P i Ni,t increases by at least 1. Note that if B T 1 γ , T0 can be bounded by poly(m, 1/p , K, 1/| M|, log(1/δ)) T 1 γ . Thus there must exist some polynomial on T = poly(m, 1/p , K, 1/| M|, log(1/δ)) such that when T > T , 4n T0/p T 2 , and thus we conclude the proof. B.2. Proof of Theorem 3.9 In this section, we prove Theorem 3.9. The intuition of Theorem 3.9 is that, although the Greedy oracle is an approximation oracle, by using CUCB, it acts like an exact oracle when the number of observations for each base arm is large enough, and thus we can follow the proof idea for Theorem 3.6. We first present some technical lemmas to leverage the submodular structure of the reward function. Lemma B.9 (Chen et al. (2013)). Probabilistic maximum coverage is 1-TPM bounded smoothness (Assumption 2.2). Lemma B.10. Suppose S = {u1, u2, . . . , uk} denote one super arm the probabilistic maximum coverage problem, then for any u S and any set S S and u / S , we have, r S {u}(µ OS) r S (µ OS) S. Besides, S 0. Proof. First for any u S and u / S, we have r S(µ OS) r(S\{u}) {u }(µ OS) S. Observe that OS will assign 0 to all edges without an endpoint in S, and thus r(S\{u}) {u }(µ OS) = r S\{u}(µ OS). Then because the reward function of the probabilistic maximum coverage problem is submodular in terms of the set L (Chen et al., 2013), we then have for any S S such that u / S , r S(µ OS) r S\{u}(µ OS) r S {u}(µ OS) r S (µ OS). Then we show that S 0 for any S. Suppose that S is the super arm such that r S(µ OS) r S (µ OS) = S. Then because any u / S will not increase the reward under the mean vector µ OS, we have r S (µ OS) = r S S (µ OS). Then, because the reward function is monotone in terms of the set L, we have S = r S(µ OS) r S (µ OS) = r S(µ OS) r S S (µ OS) 0, and we conclude the proof. The following lemma is the main lemma for the proof of Theorem 3.9. It claims that if CUCB does not choose the target super arm S, then there must be an arm not observed for enough times, and that arm must be observed. Lemma B.11. Suppose T t=1N s t hold during the execution of Algorithm 1, and S is a super arm such that S > 0. Then for time t, if the Greedy oracle does not choose S, then there must be a selected node such that a base arm corresponds to that node has ρ(t) i > S Adversarial Attacks on Combinatorial Multi-Armed Bandits Proof. Suppose that the Greedy oracle chooses super arm S = S, and denote u to be the first node picked by the Greedy oracle that does not belong to S. Besides, we denote A to be the set of nodes before u is picked, and u to be a node in S \ A. We denote MA, Mu and Mu to be the set of base arms connected to the node set A, the node u and u respectively. Then we use UCBt to denote the upper confidence value of the arms. Then we show that there exists a base arm i connected to the node in {u } A such that ρ(t) i > S We prove by contradiction, suppose that there does not exist a base arm i connected to the node in {u } A such that ρ(t) i > S We first show a lower bound on r A {u}(UCBt) r A(UCBt). First under T t=1N s t , we know that UCBt µ OS, thus because of the monotonicity (Assumption 2.1), we have r A {u}(UCBt) r A {u}(µ). Besides, because T t=1N s t hold, we also have UCBt µ OS + 2ρ(t). Thus from Lemma B.9, we have r A(UCBt) r A(µ OS + 2ρ(t)) = r A(µ MA + 2ρ(t) MA) r A(µ) + 2m S 4m = r A(µ) + S Thus we have r A {u}(UCBt) r A(UCBt) r A {u}(µ) r A(µ) S where we apply Lemma B.10 in the second inequality. Then we show an upper bound on r A {u }(UCBt) r A(UCBt). Since u does not belong to the super arm S, we have r A {u }(UCBt) r A(UCBt) r{u }(UCBt) m S Thus because we assume S > 0, we have r A {u}(UCBt) r A(UCBt) > r A {u }(UCBt) r A(UCBt), which means that the Greedy oracle will not select u , and we conclude the proof. Now with Lemma B.11, we can prove Theorem 3.9. Proof of Theorem 3.9. The proof is a straightforward application of Lemma B.11. First, under T t=1N s t , which happen with probability 1 δ, if CUCB does not choose S at round t, it means that there exists a base arm i belongs to OS such that ρ(t) i > S 4m , which means that s ln(4m T 3/δ) 2T (t) i > S Thus we have T (t) i 8m2 ln(4m T 3/δ) However after not choose S, T (t) i will increase by 1. Thus, the total number of times not choosing S is bounded by 8m3 ln(4m T 3/δ) 2 S since there are at most m base arms. The attack cost is bounded by 8m4 ln(4m T 3/δ) 2 S since the attack cost at each time is bounded by m. Thus, there exist T = poly(m, 1/ S, log 1/δ) such that for all T T , 8m4 ln(4m T 3/δ) Thus, the probabilistic maximum coverage problem with CUCB and Greedy oracle is polynomially attackable. Adversarial Attacks on Combinatorial Multi-Armed Bandits B.3. Proof of Corollary 3.10 In this part, we prove Corollary 3.10. We first state our settings, and then show how to reduce the simple RL setting to CMAB. Episodic MDP We consider the episodic Markov Decision Process (MDP), denoted by the tuple (State, Action, H, P, µ), where State is the set of states, Action is the set of actions, H is the number of steps in each episode, P is the transition metric such that P( |s, a) gives the transition distribution over the next state if action a is taken in the current state s, and µ : State Action R is the expected reward of state action pair (s, a). We assume that the states State and the actions Action are finite. We work with the stationary MDPs here with the same reward and transition functions at each h H, and all our analyses extend trivially to non-stationary MDPs. We assume that the transition probability P is known in advance.6 An RL agent (or learner) interacts with the MDP for T episodes, and each episode consists of H steps. In each episode of the MDP, we assume that the initial state s(1) is fixed for simplicity. In episode t and step h, the learner observes the current state st(h) State, selects an action at(h) Action, and incurs a noisy reward rt,h(st(h), at(h)). Also, we have E[rt,h(st(h), at(h))] = µ(st(h), at(h)). Our results can also be extended to the setting where the reward is dependent on step h H. A (deterministic) policy π of an agent is a collection of H functions {πh : State Action}. The value function V π h (s)is the expected reward under policy π, starting from state s at step h, until the end of the episode, namely V π h (s) = E h =h µ(sh , πh (sh )|sh = s where sh denotes the state at step h . Likewise, the Q-value function Qπ h(s, a) is the expected reward under policy π, starting from state s and action a, until the end of the episode, namely Qπ h(s, a) = E h =h+1 µ(sh , πh (sh )|sh = s, ah = a Since State, Action and H are finite, there exists an optimal policy π such that V π h (s) = supπ V π h (s) for any state s. The regret RA(T, H) of any algorithm A is the difference between the total expected true reward from the best fixed policy π in hindsight, and the expected true reward over T episodes, namely V π 1 (s(1)) V πt 1 (s(1)) . The objective of the learner is to minimize the regret RA(T, H). Reward poisoning attack (reward manipulation) For the reward poisoning attack, or reward manipulation, the attacker can intercept the reward rt,h(st(h), at(h)) at every episode t and step h, and decide whether to modify the reward rt,h(st(h), at(h)) to rt,h(st(h), at(h)). The goal of the attack process is to fool the algorithm to play a target policy π for T o(T) times. The cost of the whole attack process is defined as C(T) = PT t=1 PH h=1 I{rt,h(st(h), at(h)) = rt,h(st(h), at(h))}. Reduction to CMAB problem Note that in this simple RL setting where the transition probability is known in advance, it can be reduced to a CMAB problem and thus solved by the CUCB algorithm. We now construct the CMAB instance. 1. There are |State| |Action| base arms, each base arm (s, a) denote the random reward r(s, a), with unknown expected value E[r(s, a)] = µ(s, a). 6Unlike most RL literature with unknown transition probability P, we need to know the transition probability of the MDP, i.e., the white-box setting, for a direct reduction from this simple RL setting to CMAB and solved by CUCB. We believe similar techniques can also be applied to study the attackability of RL instances with unknown transition probability. Adversarial Attacks on Combinatorial Multi-Armed Bandits 2. Each policy π is a super arm, and the expected reward of super arm π is just the value function V π 1 (s(1)). 3. Note that when we select a policy π (a super arm) and execute π for this episode t, a random set of base arms τt will be triggered and observed following distribution D, and the distribution D is determined by the episodic Markov Decision Process. Define Aπ h(s, a) as the probability that it will go to state s with π(s) = a at time h, then the base arm (s, a) is triggered with probability PH h=1 Aπ h(s, a) in each episode, thus p D,π (s,a) = PH h=1 Aπ h(s, a), where p D,π (s,a) is the probability to trigger arm (s, a) under policy π (defined in Section 2). Now the episodic MDP problem is reduced to a CMAB instance, and the remaining problem is to validate Assumption 2.1 and Assumption 2.2. Note that the reward of the super arm (policy) under the expected reward µ is given as rπ(µ) =V π 1 (s(1)) h =1 µ(sh , πh (sh )|s1 = s(1) h =1 E [µ(sh , πh (sh )|s1 = s(1)] s,a µ(s, a)Aπ h(s, a) s,a µ(s, a) h =1 Aπ h(s, a) s,a µ(s, a)p D,π (s,a). Thus, Assumption 2.1 and Assumption 2.2 hold naturally, and we can apply Theorem 3.6 to determine if an episodic MDP is polynomially attackable or not. C. Missing Proofs in Section 4 In this section, we prove Theorem 4.1. We first restate the theorem as below. Theorem 4.1. There exists a CMAB instance satisfying Assumption 2.1 and 2.2 such that it is polynomially attackable given the parameter µ (induced from the instance s base arms joint distribution D), but there exists no attack algorithm that can efficiently attack the instances for CUCB algorithm with unknown parameter µ. As presented in Section 4, the hardness result for CMAB instances comes from the combinatorial structure of the super arms, which may block the exploration for other base arms. However when the attacker does not know the vector µ, she needs to observe different base arms to get some estimation of the problem parameter µ, and choose a super arm S to attack. The following section formalizes this idea. Before we go to the proof, we restate the hard instances we constructed (Example 4) as follows. [Hard example] We construct the following CMAB instance Ii. There are 2n base arms, {ai}i [n] and {bi}i [n], and each corresponds to a random variable ranged in [0, 1]. We have µaj = 1 2ϵ for all j = i and µai = 1, and µbj = 1 ϵ for all j [n] for some ϵ > 0. Then we construct n + 2 super arms. Sj for all j [n] will observe base arms aj and bj, and r Sj = µaj + µbj. There is another super arm Sn+1 that will observe base arms bj for all j [n], and r Sn+1 = P j [n] µbj + (1 ϵ). Besides, there is also a super arm S0 with constant reward S0 = 2 2ϵ and does not observe any base arm. Then, the attack super arm set M = {Sj}j [n]. In total, we can construct n hard instances Ii for all i [n]. The following lemma claim that the instances we construct are actually polynomially attackable (Definition 3.1 by computing the Gap (Definition 3.5). Lemma C.1. For instance Ii, we have Si = ϵ > 0 and Sj = ϵ < 0 for all j = i, j [n]. Adversarial Attacks on Combinatorial Multi-Armed Bandits Proof. First recall the definition of Gap (Definition 3.5) of a super arm S M is defined as S := r S(µ) max S =S r S (µ OS). Now for the instance Ii, we have Si = r Si(µ) max Si =Si r Si (µ OSi) = r Si(µ) r S0(µ) = (µai + µbi) (2 2ϵ) = 1 + (1 ϵ) (2 2ϵ) = ϵ. Similarly, we have Sj = r Sj(µ) max Sj =Sj r Sj (µ OSj) = r Sj(µ) r S0(µ) = (µaj + µbj) (2 2ϵ) = (1 2ϵ) + (1 ϵ) (2 2ϵ) = ϵ. The next two lemmas (Lemma C.2 and C.3) show that, if the attacker let the CUCB algorithm to observe l different super arms in the attack set M, then the attacker needs to suffer exponential cost (1/ϵ)Ω(l). Lemma C.2. For a specific base arm a, suppose that during the attack of CUCB, there exists two rounds t1 < t2 such that UCBt1 (a) ϵ at time t1 and has already been observed by x times, and UCBt2 (a) 1 2ϵ at time t2, then a should be observed for at least x/(2ϵ) times during (t1, t2) for ϵ < 1/8. Proof. Note that UCBt1 (a) ϵ, which means that µa,t1 ϵ and ρa,t1 ϵ. Then since ρa,t will not increase and t2 > t1, we have µa,t2 UCBt2 (a) ρa,t2 UCBt2 (a) ρa,t1 = 1 2ϵ ϵ = 1 3ϵ. Now suppose a has been observed for x times during (t1, t2] and x times during [0, t1], we have µa,t2 µa,t1 x + 1 x x + x ϵ x + 1 x We know that µa,t2 1 3ϵ, and thus x + x 1 3ϵ x 1 4ϵ and we conclude the proof. Lemma C.3. For any instance Ii, suppose that there exist l different time t1, . . . , tl such that CUCB pulls Stj at tj and Stj are different for j [l], then tl (1/2ϵ)l 1 and Sn+1 is pulled by at least (1/2ϵ)l 1 times when l > 1. Proof. We prove this by induction. When l = 1, the argument holds trivially. Now suppose that the argument holds when l = k, then for l = k + 1, we first apply the induction argument for time t1, . . . , tl 1 and we have at time tl 1, Sn+1 has already been pulled for at least (1/2ϵ)l 2 times if k > 1, which means that Stl has already been observed by (1/2ϵ)l 2 times. If k = 1 (l = 2), we also know that Stl needs to be observed by at least once, since UCBt1 (btl) ϵ. Adversarial Attacks on Combinatorial Multi-Armed Bandits Now assume that t l 1 and t l satisfies that tl 1 t l 1 < t l tl such that at time t l 1 CUCB pulls Stl 1 and at time t l CUCB pulls Stl, and during the period (t l 1, t l 1) CUCB never pulls the super arm Stl 1 and Stl. Then because CUCB will pull Stl at time t l, we have UCBt l (btl) 1 2ϵ. Besides, at time t l 1, since CUCB pulls Stl 1, we have UCBt l 1 (btl) ϵ. Applying Lemma C.2, we know that btl is observed for at least (1/2ϵ)l 1 times during (t l 1, t l). Also note that during (t l 1, t l), super arm Stl is never pulled, and the only way to observe btl is to pull the super arm Sn+1. Thus, we conclude the induction step. Now we can finally prove Theorem 4.1. In Lemma C.3, we already show that if the attacker lets CUCB to pull l different arms in the attack set M, the attacker needs to suffer for loss exponential in l. Then the intuition is that: without knowing the parameter µ, the attacker needs to visit at least Ω(n) different arms in M to guarantee a high probability of attack success, and thus suffer for exponential cost. Proof of Theorem 4.1. We prove this by contradiction. Suppose that an attack algorithm A with constant γ successfully attacks CUCB on the random instance I, which is uniformly chosen from {I1, . . . , In} with high probability. Here we assume ϵ < 1/4. Then, for T P(n, K) where P(n, K) denotes a polynomial with variable n, K, we know that A uses at most T 1 γ budget and play the super arms in M = {S1, . . . , Sn} with at least T 1 γ times. Now we choose n such that 1/(2ϵ)n/2 P(n, K), which is possible since P(n, K) can be bound by some polynomials of n, and we consider T = 1/(2ϵ)n/2. Then by Lemma C.3, we know that at time T, CUCB visits at most n/2 different arms in M. Recall by Lemma C.1, we know that for instance Ii, only super arm Si has gap Si = ϵ > 0 and all other super arms Sj for j = i, j [n] has gap Sj = ϵ < 0. Because I is uniformly chosen from {I1, . . . , In}, then with probability at least 1/2, CUCB does not visit the super arm with gap ϵ at time T. However if the super arm Sj has gap Sj = ϵ < 0, from Lemma B.8, Sj can only be observed by P (n, 1/ϵ, log(1/δ) T 1 γ times for some polynomial P on n, 1/ϵ, log(1/δ), under event ( N s t ), which happens with probability 1 δ. Thus, with probability at least 1/2 δ, CUCB can only play super arms in M for n/2 P (n, 1/ϵ, log(1/δ) T 1 γ times. However, by Definition 3.1, CUCB needs to pull arms in M for at least T T 1 γ times, choosing n large enough will get T T 1 γ > n/2 P (n, 1/ϵ, log(1/δ) T 1 γ , since T = 1/(2ϵ)n/2 exponentially depends on n, which means that with probability at least 1/2 δ, CUCB will not pull arms in M with at least T T 1 γ times. Note that the number of base arms m = 2n, we conclude the proof.