# pareto_regret_analyses_in_multiobjective_multiarmed_bandit__448dbf95.pdf Pareto Regret Analyses in Multi-objective Multi-armed Bandit Mengfan Xu 1 Diego Klabjan 1 Abstract We study Pareto optimality in multi-objective multi-armed bandit by providing a formulation of adversarial multi-objective multi-armed bandit and defining its Pareto regrets that can be applied to both stochastic and adversarial settings. The regrets do not rely on any scalarization functions and reflect Pareto optimality compared to scalarized regrets. We also present new algorithms assuming both with and without prior information of the multi-objective multi-armed bandit setting. The algorithms are shown optimal in adversarial settings and nearly optimal up to a logarithmic factor in stochastic settings simultaneously by our established upper bounds and lower bounds on Pareto regrets. Moreover, the lower bound analyses show that the new regrets are consistent with the existing Pareto regret for stochastic settings and extend an adversarial attack mechanism from bandit to the multi-objective one. 1. Introduction Multi-armed bandit (MAB) is a sequential paradigm where players choose arms and receive reward values from an environment at each time step. It usually aims at maximizing the cumulative reward of the player, or equivalently minimizing the regret formulated as the difference between the rewards of the best arm and the obtained rewards of the player. Multiobjective Multi-armed bandit (MO-MAB) extends reward values of arms to multi-dimensional reward vectors and thereby changing the nature of the MAB problem significantly as a result of the complicated order relationships of vectors. Rewards can be optimal in one dimension but suboptimal in other dimensions, leading to multiple optimal arms. Formally, given time horizon T, at time step t T the player chooses one arm at among K arms and receives the D-dimensional vector rat,t among rewards (ri,t)K i=1. 1Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, U.S.A.. Correspondence to: Mengfan Xu , Diego Klabjan . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). The goal is to optimize the total reward vector PT t=1 rat,t by minimizing some regret metric measuring how far the player is away from optimality. There are two ways to define optimality: Pareto optimality in the reward vector space and Scalarized optimality by scalarizing reward vectors. Pareto optimality admits a Pareto optimal front defined as the set of rewards of optimal arms determined by the Pareto order relationship. With limited information based on the definition of MO-MAB, it is a great challenge to directly estimate the Pareto optimal front while crucial for an optimal strategy. Scalarization is often used to transform reward vectors to scalars and thereby reducing MO-MAB to MAB depending on scalarization functions, which, however, has the following several limitations. First, how to properly choose such scalarization functions may be troublesome. Linear and Chebyshev options have been discussed in (Drugan and Nowe, 2013), despite of the fact that they may not fully explore non-convex Pareto optimal fronts. Secondly, algorithms can be less robust since their performance highly depends on the specific scalarization functions. In other words, what is optimal for a scalarization function, can result in large scalarized regrets for other functions. Therefore, Pareto optimality in the reward vector space makes more sense and thereby being the main focus herein. MAB usually falls into the category of either stochastic MAB or adversarial MAB depending on how rewards are generated. Following similar assumptions on the reward process, we consider MO-MAB as stochastic MO-MAB and adversarial MO-MAB, respectively. In stochastic MOMAB, the rewards of each arm at different time steps are assumed to be i.i.d., following a multi-dimensional distribution of which the mean vector is time-invariant. Henceforth, the Pareto optimal set is the set of arms with mean vectors that are Pareto optimal and the Pareto optimal reward set (front) is the set of corresponding mean vectors, which are constant over time. A regret metric concerning Pareto optimality, namely Pareto regret, is then defined as the cumulative distance between the expected reward received at each time step and the Pareto optimal front (Drugan and Nowe, 2013). However, this regret measure cannot generalize to adversarial MO-MAB by noting that the Pareto optimal front is ill-defined. Precisely, for adversarial MO-MAB, the rewards are arbitrarily specified by adversaries at each time Pareto Regret Analyses in Multi-objective Multi-armed Bandit step and thereby varying with time and even depending on past observations. Meanwhile, when the adversary generates rewards based on a distribution depending on the state (context) at each time step, i.e. a context-dependent reward generator, it boils down to contextual MO-MAB. It necessitates a new way to determine Pareto optimality, Pareto optimal front and subsequently a Pareto regret measure in adversarial settings, which has not yet been studied. Traditionally, algorithms designed for MO-MAB work either by modifying the methods for MAB (MAB-based) or by importing the mechanisms from Multi-objective Optimization (MOO) (MOO-based). More specifically, MAB-based algorithms alternate the estimations of the best arm by approximating a Pareto optimal set, from which an arm is randomly selected. Examples include Pareto UCB (Drugan and Nowe, 2013) as optimistic in face of uncertainty and MOSS++ (Zhu and Nowak, 2020) as an extension of MOSS (Audibert and Bubeck, 2009). Their formulations and analyses are limited to stochastic MO-MAB. While a lot of attention has been given to algorithms for variants of stochastic MO-MAB (see (H uy uk and Tekin, 2021; Van Moffaert et al., 2014)), a counterpart remains unexplored in adversarial settings, including the lack of a valid formulation and analyses which we fully cover herein. Moreover, the aforementioned MAB-based algorithms may result in linear Pareto regret when applied to our defined adversarial settings. Precisely, we provide a regret analysis of Pareto UCB in the adversarial regime where Pareto UCB suggests a linear regret. Some MOO-based approaches adapt Multi-objective Evolutionary Algorithms (Hong et al., 2021) to bandit problems and (Yahyaa et al., 2014) propose the annealing knowledge gradient descent method, both of which, however, are examined empirically without theoretical guarantees. Analyses for MOO under uncertainty provide possibilities for adversarial MO-MAB given that the objective functions can be black-box. Recently, a minimax regret formulation has been suggested by (Groetzner and Werner, 2022), though Pareto optimality is again neither defined nor potentially guaranteed by proposing algorithms with an optimal minimax regret. An algorithm that guarantees theoretically optimal Pareto regret for adversarial MO-MAB has not yet been proposed, let alone achieving optimality for both adversarial and stochastic MO-MAB. Herein we propose a formulation of the adversarial MOMAB problem where rewards are determined arbitrarily under no assumptions on their distributions. Moreover, we formally introduce its Pareto optimal front and Pareto regret and Pareto pseudo regret, which makes it a possibility to consider algorithms from a perspective of Pareto optimality. To our best knowledge, this is the first work formally introducing the general MO-MAB setting and considering the task of Pareto regret optimization in adversarial MO-MAB, in line with both MAB and stochastic MO-MAB. Surprisingly, we can extend these regrets to stochastic settings without any modifications. On the basis of existing literature on stochastic MO-MAB, the achievements of this paper are to build the theoretical connection between general MO-MAB and MAB which allows us to characterize Pareto regret generally, and to develop theoretically effective methods that can achieve Pareto optimality in both stochastic and adversarial MO-MAB. In this paper, we characterize the Pareto regret explicitly and propose algorithms that achieve regrets of order T in adversarial MO-MAB and regrets of order log T or (log T)2 in stochastic MO-MAB. More specifically, when given whether it is a stochastic schema or adversarial schema before the game starts, our algorithm MO-KS, which switches between the UCB algorithm and the EXP3.P algorithm, obtains Pareto regret of the same orders as the regret in MAB in the sense that the regret is of order log T and T in stochastic MO-MAB and adversarial MO-MAB, respectively. Otherwise, when this prior information is not available a different algorithm MO-US is proposed herein that combines the EXP3 and UCB algorithms, as a modification of the existing IEXP3++ algorithm for MAB (Seldin and Lugosi, 2017). Consequently, its Pareto pseudo regret has an order of (log T)2 in stochastic MO-MAB and T in adversarial MO-MAB, respectively, without necessarily knowing the settings of MO-MAB. To claim optimality of our proposed algorithms and to provide the worst-case scenario analyses for the proposed framework, we are the first to provide regret lower bound analyses for both stochastic and adversarial MO-MAB. More precisely, we show that in stochastic MO-MAB, the expected Pareto regret and the Pareto pseudo regret cannot be smaller than order log T whatever algorithms are deployed. For adversarial MO-MAB, a regret of order greater than or equal to T results from any randomized algorithm. The lower bounds on Pareto regret match the upper bounds of MO-KS in that they have the same order with respect to T, which implies that it is optimal in both settings. Note that for MO-US, the lower bound coincides with the upper bound in adversarial settings and is almost the same as the upper bound in stochastic ones up to the log T multiplicative factor, i.e. MO-US is optimal in adversarial MO-MAB and nearly optimal up to a log T factor in stochastic MO-MAB. Moreover, we derive the lower bound of Pareto UCB, a specific algorithm that is optimal in stochastic MO-MAB, by constructing instances in adversarial MO-MAB. We report that Pareto UCB can result in a linear Pareto regret far away from optimal, which further necessitates the proposed algorithms. As a by-product, the instances form a generalization of an adversarial attack on UCB in MAB (Jun et al., 2018). We provide an attack cost analysis and preand post-attack Pareto Regret Analyses in Multi-objective Multi-armed Bandit regret analysis of Pareto UCB for MO-MAB herein. The rest of the paper is as follows. In Section 2, we present the existing work in the domain of MO-MAB. Then in Section 3, we introduce the preliminaries including notations and formulations in the context of MO-MAB. We proceed by proposing an algorithm with optimality in both stochastic and adversarial MO-MAB, as well as a nearly optimal one up to a log T factor, and establishing their regret upper bounds in Section 4. Finally, in Section 5, we provide full analyses on the regret lower bound and on a special case when Pareto UCB is adopted. 2. Literature Review MAB is well understood and many algorithms have proven to be effective for MAB both theoretically and numerically. For stochastic MAB, UCB and Thompson Sampling achieve optimal regret of order log T. For adversarial MAB, (Auer et al., 2002) propose EXP3 with an optimal regret T. Moreover, in the work of (Seldin and Slivkins, 2014), EXP3++ algorithm is designed as a nearly optimal solution to both stochastic and adversarial setting, up to a log T factor. More specifically, the EXP3++ algorithm dynamically updates two levers used for EXP3 and UCB, respectively, by estimating the assumed nature of rewards, and yields a sample rule accordingly that is a weighted average of EXP3 and UCB. The SAO algorithm (Auer and Chiang, 2016) improves it by having smaller regret in the stochastic setting but with a larger regret in the adversarial setting instead and with more complicated steps. (Seldin and Lugosi, 2017) improve on EXP3++ by a careful parameterization that leads to smaller regret and does not require a prior T and (Zimmert and Seldin, 2019) finally arrive at optimum for both settings under certainly strong conditions. MOMAB extends the single-dimensional objectives of MAB to be multi-dimensional ones and makes it strikingly challenging to develop optimal methods. To this end, it is natural to extend MAB algorithms to MO-MAB and Pareto UCB is such a success. Specifically, Pareto UCB (Drugan and Nowe, 2013) constructs upper bounds of confidence intervals for each arm and determines a Pareto optimal front instead of the best arm in the presence of Pareto optimality. An arm is randomly pulled from the set. Pareto regret of order log T for Pareto UCB is guaranteed by the multi-dimensional Chernoff-Hoeffding bound. Further improvement on it can be found in (Drugan et al., 2014). However, they only work for stochastic MO-MAB. There are variants of stochastic MO-MAB, such as PF-LEX for MO-MAB with a lexicographical order and satisficing objectives (which is a relaxation of maximizing that allows for suboptimal less risky decision making in the face of uncertainty) (H uy uk and Tekin, 2021) and MO-lin UCB for contextual MO-MAB (Mehrotra et al., 2020). But no attention has been given to the framework of adversarial Another line of work on MO-MAB is to modify Multiobjective Optimization (MOO) methods. A lot of algorithms have shown empirically great performance in real applications, such as larg-scale MOEA using evolutionary algorithms dealing with the classic trade-off between exploration and exploitation (Hong et al., 2021) and Annealing Pareto Knowledge Gradient as a variant of Thompson Sampling (Yahyaa et al., 2014). But the lack of a theoretical guarantee presents a concern since data-driven methods may be less robust and more sensitive, which motivates focus on theoretically effective algorithms. MOO methods consider both deterministic settings and settings under uncertainty, where the former can be adapted to stochastic MO-MAB with deterministic mean vectors and the latter raises the possibility of use in adversarial MO-MAB. For stochastic MO-MAB, various scalarization techniques from MOO are proposed for dimension reduction, such as Linear and Chebyshev, with which the problem essentially degenerates to MAB. But scalarizations may result in a loss of information and be highly problem-dependent as stated earlier. Optimizing the General Gini Index (GGI) using online convex optimization (Busa-Fekete et al., 2017) extends the concept of the Gini Index by means of ordered weighted average, albeit the regret metric is again defined on GGI. For MOO under uncertainty, utility functions including both linear and non-linear options are often applied in economics, whereas they work similarly as scalarizarion methods. Recently, (Groetzner and Werner, 2022) suggest a relative minimax regret approach with theoretical guarantees. Specifically, the minimax regret first takes maximum over the random variables for each dimension and then minimum over the decision variables using Pareto order relationships. However, it requires precomputing all optimal values with respect to random variables, which does not generalize to on-the-fly bandit settings. Herein we fill the gap by formally introducing the Pareto regret for adversarial MO-MAB and by proposing algorithms that show optimality or near optimality in both stochastic and adversarial MO-MAB. Lower bound analyses on regret play an important role in studying the optimality of algorithms in MO-MAB. On the one hand, the work on MO-MAB algorithms usually shows lower bounds on regret depending on the problem settings and what regret metrics are used. For Pareto regret, (Drugan and Nowe, 2013) argue a lower bound of order log T for stochastic MO-MAB that matches both the regret upper bound of Pareto UCB and the regret lower bound for stochastic MAB. For our newly proposed Pareto regrets, we also establish their lower bounds in both stochastic and adversarial MO-MAB, which are consistent not only with most of the upper bounds, but with (Drugan and Nowe, 2013) in stochastic settings, which further supports the validity of the proposed regret metrics. On the flip side, an adversar- Pareto Regret Analyses in Multi-objective Multi-armed Bandit ial attack on UCB (Jun et al., 2018) shows that UCB may suffer a linear regret given a small attack cost, while the vulnerability of Pareto UCB is not yet studied which potentially provides bad cases where Pareto UCB does not work. In this paper, we consider the adversarial attack on Pareto UCB, which not only justifies its use in MO-MAB, but as an evidence of the limitation of Pareto UCB for adversarial MO-MAB. 3. Preliminaries In this section, we start with the notations used throughout the paper and then formally introduce Pareto regret including the existing one for stochastic MO-MAB and our newly proposed ones that capture both stochastic MO-MAB and adversarial MO-MAB. The regret analyses are presented in the next section. 3.1. Notation For vectors, a Pareto order relationship has been introduced in (Drugan and Nowe, 2013) and (Drugan et al., 2014). There are several possible order relationships between two vectors a and b. Vector a is said to weakly dominate b, written as a b or b a, if and only if for any dimension d, ad bd. Removing the equality for at least one dimension gives us dominating, which is denoted by a b or b a. Vector a is incomparable with b, i.e. a||b, if there exists a dimension d such that ad > bd and another dimension d satisfying ad < bd . We say that vector a is non-dominated by b if there is dimension d such that ad > bd. We consider MO-MAB with K arms being {1, 2, . . . , K} and D-dimensional rewards and time horizon T. The chosen arm by a player at each time step t is denoted by at and only the reward of at, namely rat,t, is revealed to the player. Value Ni(t) is the total number of pulls of arm i by the player up to time t. In the stochastic setting, at each time step t, the reward ri,t = (ri,t l )1 l D of arm i follows a distribution with time-invariant mean vector µi = (µi l)1 l D satisfying 1 µi 0. For adversarial MO-MAB, the reward 0 ri,t 1 of each arm i is specified by an adversary which can be adaptive or oblivious where the former depends on past actions of players and the latter does not. 3.2. Pareto Optimality and Regret 3.2.1. OPTIMALITY The Pareto order relationship can determine the Pareto optimality of arms as follows. Formally, in stochastic settings, an arm j is said to be Pareto optimal, if µj is non-dominated by the reward mean vectors of any other arm and the set of Pareto optimal arms is named as the Pareto optimal set, denoted by OA (Drugan and Nowe, 2013). We propose to define the Pareto optimal set O A and O A based on Pareto optimality with respect to the cumulative reward vector PT t=1 ri,t and PT t=1 E[ri,t], respectively, for both stochastic settings and adversarial settings. Formally, O A = {i : PT t=1 ri ,t is non-dominated by PT t=1 rj,t, for any arm j = i }, O A = {i : PT t=1 E[ri ,t] is non-dominated by PT t=1 E[rj,t] for any arm j = i } allow us to define a uniformly effective Pareto regret for MO-MAB. Note that O A is actually the same as OA in stochastic settings. We denote the corresponding Pareto optimal front as O = {µa : a OA}, O = {PT t=1 ra,t : a O A}, and O = {PT t=1 E[ra,t] : a O A}. 3.2.2. REGRET We now proceed to introduce Pareto regrets by measuring the distance between the obtained rewards and the rewards of arms in the Pareto optimal set with a metric, consistent with in (Drugan and Nowe, 2013). In (Drugan and Nowe, 2013), Pareto regret for Stochastic MO-MAB is denoted by RT = PT t=1 Dist(µat, O) = PK i=1 Ni(T) Dist(µi, O), where the distance measure Dist(a, O) between a vector a σ for every σ O and a set O is defined by Dist(a, O) = minϵ 0{ϵ : a + ϵ1||σ for every σ O}. Here 1 RD is a vector of all 1. Nevertheless, this definition of Pareto regret is with respect to O which requires the reward mean vectors to be constant over time that only holds for stochastic MO-MAB. To this end, we propose a Pareto regret R T and a Pareto pseudo regret R T based on a similar distance metric but with the newly defined Pareto optimal fronts. To our best knowledge, this is the first work on defining and establishing Pareto regret in adversarial MO-MAB with the generalizability to stochastic MO-MAB as well. Pareto Regret Formally, the Pareto regret in MO-MAB is defined as R T = Dist(PT t=1 rat,t, O ). Pareto Pseudo Regret In the context of MO-MAB, we propose a Pareto Pseudo regret as R T = Dist(E[PT t=1 rat,t], O ). In the proofs and some statements we utilize rewards only along a single dimension. To this end, denote Rd T as the regret with respect to rewards of dimension or coordinate d as Rd T = maxi PT t=1 ri,t d PT t=1 rat,t d , and the corresponding pseudo regret is defined as Rd T = maxi E[PT t=1 ri,t d ] E[PT t=1 rat,t d ]. Note that the proposed regrets apply for both stochastic MO-MAB and adversarial MO-MAB since rewards can be arbitrary. To this end, we call R T the general Pareto regret, different from RT , namely the stochastic Pareto regret that only works for the stochastic setting. 4. Upper bounds on Pareto regret In this section, we formally elaborate on the newly proposed algorithms for MO-MAB based on MAB algorithms, which Pareto Regret Analyses in Multi-objective Multi-armed Bandit we consider both with and without a priori knowledge of stochastic or adversarial, and prove their optimality or near optimality in stochastic and adversarial settings simultaneously. More specifically, we denote an indicator for the setting of MO-MAB by s with s = 0 being stochastic and s = 1 being adversarial. When s is known, we propose Algorithm 1 that achieves optimality. For unknown s, i.e. with less information, Algorithm 2 is nearly optimal with respect to Pareto pseudo regret, up to a log T factor, in line with the work on MAB. 4.1. A Priori Knowledge of Stochastic or Adversarial When the indicator s is given, Pareto regret in MO-MAB can be related to regret in MAB as follows. Theorem 4.1. For any dimension d , we have that R T Rd T . Based on the result which essentially fully characterizes Pareto regret in the form of vanilla regret in MAB, we formally develop Algorithm 1 called Multi-Objective with Known S (MO-KS) for MO-MAB, which has comparable performance with MAB by switching between EXP3.P and UCB depending on the problem setting s and a randomly sampled dimension d . Theorem 4.2 shows the Pareto regret upper bound of Algorithm 1 with respect to Pareto regret R T , which has the same order as the regret upper bound in MAB. Algorithm 1 Algorithm With Known s (MO-KS) Input: Fixed dimension d , 1 d D; Initialization: indicator s = {0, 1}; if s= 0 then for t = 1, 2, . . . , T do Play the bandit game by applying the UCB algorithm along dimension d end for else if s= 1 then for t = 1, 2, . . . , T do Play the bandit game by applying EXP3.P algorithm along dimension d end for end if Theorem 4.2. Based on Algorithm 1, for dimension d we have E[Rd T ] O (log T) Is=0 + O ( T) Is=1, which leads to E[R T ] O (log T) Is=0 + O ( The proof of Theorem 4.2 is by combining the result of Theorem 4.1 and the existing regret upper bounds of the UCB and EXP3.P algorithms. Note that the choice of d in Algorithm 1 is arbitrary and thus it can depend on the context, since the theoretical guarantee holds for any d . For example, for a recommendation system, the click rate may be of more interest for decision makers and thereby being the optimization objective. Moreover, from the perspective of minimizing the constant term in the Pareto regret, we can always specify such a dimension accordingly by running a burning period to determine the optimal dimension, despite of the fact that the regret order remains the same which is a focus of this paper. 4.2. Lack of Knowledge of Stochastic or Adversarial In this section, we focus on MO-MAB where the indicator s is unknown and propose an algorithm (see Algorithm 2) that remains nearly optimal, up to a log T factor, no matter what is the value of s. Moreover, Algorithm 2 allows unknown T, compared to the scenarios with known indicators, which increases generalizability. Similarly to Theorem 1 we have the following result that makes it a possibility to analyze Pareto pseudo regret by means of pseudo regret in the context of MO-MAB. Theorem 4.3. For any dimension d , we have that R T Rd T . Algorithm 2 Algorithm With Unknown s and T (MO-US) Input: Fixed dimension d , 1 d D; Initialization: c = 256, α = 3; Pull each arm once and set La(K) = ra d and Na(K) = 1 for any arm a; for t = K + 1, K + 2, . . . , T do t K ; For any arm a let UCBa(t) = min {1, La(t 1) Na(t 1) + α ln t K 1 α 2Na(t 1)}; LCBa(t) = min {1, La(t 1) Na(t 1) α ln t K 1 α 2Na(t 1)}; ζt(a) = min {0, LCBa(t) mina UCBa (t)}; ψt(a) = c ln t tζt(a)2 ; ϵt(a) = min { 1 2K , ηt, ψt(a)}; ρt(a) = exp (ηt Lt 1(a)) P a exp (ηt Lt 1(a )); ρt(a) = (1 P a ϵt(a ))ρt(a) + ϵt(a); Pull an arm at based on probability ρt(a) and receive the reward rat,t; For any arm a let la t = 1 rat,t d ρt(a) 1a=at; Lt(a) = Lt 1(a) + la t ; Na(t) = Na(t 1) + 1a=at; end for The proof of Theorem 4.3 follows the same steps as the proof of Theorem 4.1 and substitutes the analysis on Pareto regret with the one on Pareto pseudo regret. Motivated by the relationship between Pareto pseudo regret and pseudo regret as in Theorem 4.3, we develop Algorithm 2 (MO-US) by modifying the EXP3++ algorithm in Pareto Regret Analyses in Multi-objective Multi-armed Bandit MAB as in (Seldin and Slivkins, 2014; Seldin and Lugosi, 2017). We are given a dimension and apply the parameterspecific EXP3++ algorithm to guarantee near optimality for pseudo regret in that dimension and consequently for Pareto pseudo regret. The upper bound on Pareto pseudo regret of Algorithm 2 is formally stated as follows. Theorem 4.4. In Algorithm 2, without knowing the time horizon T, for dimension d we have Rd T O ((log T)2) Is=0 + O ( T) Is=1. Pareto Pseudo regret R T can be bounded as R T O ((log T)2) Is=0 + O ( In like manner, the proof utilizes the result of Theorem 4.3 and the known results of IEXP3++ algorithm as in (Seldin and Lugosi, 2017). Similarly, the discussion aforementioned in the previous subsection on the choice of dimension applies here. Remark 4.5. Note that there is an algorithm proposed by (Zimmert and Seldin, 2019) to improve the regret order (log T)2 in stochastic MAB, which has order log T, exactly the same order as the lower bound. However, the additional assumptions of the algorithm may limit its application in MO-MAB, such as the condition of a unique best arm. For MO-MAB, it is possible that multiple best arms exist in one dimension. For stochastic MO-MAB, (Drugan and Nowe, 2013; Drugan et al., 2014) provide a formulation of regret and subsequently propose optimal algorithms, however their algorithms and analyses do not hold for adversarial settings. While techniques in multi-objective optimization ((Hong et al., 2021; Yahyaa et al., 2014; Busa-Fekete et al., 2017; Groetzner and Werner, 2022)) adapt to any reward distributions, they either convert the MO-MAB problem to single objective MAB by scalarization and minimax or only work empirically. Our MO-MAB results establish algorithms and regret bounds for the adversarial setting, Figure 1. 5. Lower bounds on Pareto Regret 5.1. Lower Bounds In this section, we show that for stochastic MO-MAB and adversarial MO-MAB, there exist scenarios where the regret of any randomized algorithms is exceeding certain lower bounds, with respect to both Pareto regret and Pareto pseudo regret. It validates the optimality of Algorithm 1 and the near optimality of Algorithm 2, up to constant factors. Moreover, the lower bounds stay the same with those for MAB, which further connects MO-MAB with MAB, in conjunction with the upper bound results. Formally, the lower bounds on Pareto regret and Pareto pseudo regret in both stochastic MO-MAB and adversarial MO-MAB are summarized as follows. Theorem 5.1. For stochastic MO-MAB, there exists scenarios where the Pareto pseudo regret of any randomized algorithms is larger than log T. Theorem 5.2. For stochastic MO-MAB and small T, there exist scenarios where the expected Pareto regret of any randomized algorithms is larger than log T. Theorem 5.3. For adversarial MO-MAB, there exist scenarios where the Pareto pseudo regret of any randomized algorithms is larger than Theorem 5.4. Consider adversarial MO-MAB with repeating values for different dimensions in reward vectors, i.e. ri,t = (ri,t 1 , . . . , ri,t 1 ). Then there exists scenarios where the Pareto regret of any randomized algorithms is larger than T with high probability. In the proofs of Theorems 5.3 and 5.4, we find instances where the regret in MO-MAB is equivalent to the regret in MAB in one dimension, namely the marginal regret. This can be done by letting either the reward vectors or the reward mean vectors have the same numbers for all dimensions. Then the lower bounds on marginal regret essentially hold for MO-MAB when the reward numbers in vectors meet the corresponding conditions in MAB. This argument does not hold for Theorems 5.1 and 5.2 by noting that the reward vectors do not have the same numbers for all dimensions out of stochasticity. This brings additional difficulties and thereby necessitates a new analytical approach. We herein utilize the multi-dimensional concentration inequalities to deal with stochasticity. For stochastic MO-MAB, (Drugan and Nowe, 2013) establish a lower bound of order log T and our results are consistent with it, with respect to the newly defined Pareto regret and Pareto pseudo regret. This justifies the introduction of the proposed measures. Furthermore, for adversarial MOMAB, our regrets have lower bounds of order T, which are the same as in single dimensional MAB (Gerchinovitz and Lattimore, 2016; Bubeck et al., 2012). Meanwhile, the upper bounds of MO-KS have the same order as the lower bounds, implying optimality of MO-KS, while MO-US is nearly optimal up to a log T factor since the upper bound in the stochastic setting is (log T)2. 5.2. Lower bounds of Pareto UCB We have already shown the lower bounds for general algorithms. Motivated by the online adversarial attack on UCB with small attack cost in an MAB setting as studied in (Jun et al., 2018), for Pareto UCB designed for stochastic MOMAB, we go further and show that it can lead to linear regrets in terms of Pareto regret RT and R T given adaptive adversaries by borrowing the adversarial attack mechanism in MAB. This analysis is made possible by our newly proposed framework of adversarial MO-MAB, highlighting the inef- Pareto Regret Analyses in Multi-objective Multi-armed Bandit fectiveness of Pareto UCB in such scenarios. Specifically, we consider scenarios where the reward vectors of all arms are sub-Gaussian distributed with the same variance denoted by σ2 and µK is dominated by any other arm and thus for every i < K, 0 < i = maxd{µi d µK d } = ||µi µK|| . Generally speaking, the adversary, namely Alice, aims to manipulate Bob, the player, into pulling arm K very often while making small attacks under assumptions that 1) Bob does not know the presence of Alice, 2) the number of arms K and the time horizon T are known to Alice and Bob, 3) Alice knows that arm K is dominated by all other arms, and 4) Alice knows the exact algorithm Bob is using. Note that Alice does not necessarily know the specific arm at pulled by Bob at time step t, which generalizes the adversarial attack in MAB as in (Jun et al., 2018) that assume a known at. Formally, value αt is the cost of Alice to attack Bob at time step t. Parameter 0 is determined by Alice and δ (0, 1). (We have already defined i, i 1 but not 0.) Let ˆµi(t) be an estimator for µi up to time step t, while ˆ µi(t), µi = µi αt = µi PT t=1 αt T are the estimator and true value of the post-attack reward of arm i, respectively. The attack Algorithm 3 Attack UCB Initialization: β(n) = q n log π2Kn2 3δ ; for t = 1, 2, . . . , T do Bob chooses at according to the UCB algorithm; Simultaneously Alice computes at based on past t 1 observations and the UCB algorithm; The environment generates rewards for each arm i by sampling from a stochastic generator; Alice learns the pre-attack reward rat,t from the environment; By using rat,t Alice updates all ˆµ, ˆµ; αt = 1at=K max{0, ˆµK(t) 2β(NK(t)) 0 ˆ µat(t)}; Bob receives reward rat,t αt; end for algorithm in MAB is shown in Algorithm 3. At each time step, Alice predicts the choice at of Bob given by the UCB algorithm, and attacks at = K to mislead Bob. When it comes to Pareto UCB in the context of MO-MAB, randomness in the Pareto optimal front presents a concern since at is unpredictable. As a generalization, we propose Algorithm 4 that attacks Pareto UCB with small costs. Let O , O t be the post-attack Pareto optimal front, let O again be the pre-attack Pareto optimal front on µi and Ot be the estimator for O up to time step t. As before we similarly denote the quantities with subscript A to capture arm indices. More specifically, we modify the estimation for the best arm in UCB by estimating Pareto optimal front Ot and attack any arm in Ot A if K Ot A and determine the attack cost as a corruption by the Pareto order relationship. 5.2.1. RESULTS ON STOCHASTIC PARETO REGRET Under Algorithm 4, we present a lower bound on stochastic Pareto Regret RT defined in (Drugan and Nowe, 2013). We assume that Alice does not attack in the first 2K rounds, i.e. P2K s=1 αs = 0. The result generalizes the adversarial attack on UCB to adversarial attack on Pareto-UCB, i.e. from MAB to MO-MAB. We first state Theorem 5.5 which provides an upper bound on the number of times Bob does not pull arm K and on the total attack cost. Algorithm 4 Attack Pareto UCB Initialization: β(n) = q n log π2Kn2 3δ ; for t = 1, 2, . . . , T do Bob computes the Pareto front Ot; The environment generates reward vector for each arm by sampling from a generator; Simultaneously Alice computes the Pareto front Ot based on past t 1 observations and the Pareto UCB algorithm; Alice learns the pre-attack reward ri,t for i Ot A; By using ri,t, i Ot A, Alice updates ˆµ, ˆ µ; if K Ot A then αt = 0 else for j Ot A do zj = ˆµK (2β(NK(t)) + 0) 1; ˆzj = Nj(t 1)(ˆµj(t 1)) Pt 1 s=1 αs 1+rj,t Nj(t) ; end for end if αt = max{maxj Ot A,d{Nj(t)(ˆzj d zj d)}, 0}; Bob randomly samples arm at from Ot A and receives reward rat,t αt 1; end for Theorem 5.5. Let K 3e2δ π2 . With probability 1 Dδ, for any T > 2K, under Algorithm 4, any non-target arm is pulled O (log T) times and the total attack cost is (K 2 0 log T maxi( i + 0) + O (log T). As a consequence, the Pareto regret RT can be bounded by its definition, which is stated as Theorem 5.6. Theorem 5.6. With probability 1 Dδ, for any T > 2K and K 3e2δ π2 , the Pareto regret RT of the Pareto UCB algorithm is at least of order O (T) under the adversarial attack generated by Algorithm 4. The proofs of Theorem 5.5 and Theorem 5.6 follow the logic of what have been established in (Jun et al., 2018), but Pareto Regret Analyses in Multi-objective Multi-armed Bandit the analyses are mostly in the reward vector space and under our newly proposed attack algorithm. The dimensionality of rewards presents non-trivial challenges in that the attack is on multiple arms at each time step and Bob s rule of playing is Pareto UCB without deterministic choices. To this end, we use multi-dimensional concentration inequalities, (Drugan and Nowe, 2013). For stochastic MO-MAB under no adversarial attack, Pareto UCB is optimal with respect to the Pareto regret RT as established in (Drugan and Nowe, 2013), the optimality of which is unknown in adversarial settings. While in (Jun et al., 2018), the adversarial attack on UCB in MAB is fully studied, it does not apply to Pareto UCB for MO-MAB. When adversarial attack is added to Pareto UCB, which leads to an adversarial MO-MAB setting, we show that its Pareto regret RT is at least linear in T. 5.2.2. RESULTS ON GENERAL PARETO REGRET Next, we study the lower bound on the general Pareto regret R T for Pareto UCB under Algorithm 4, as a validation for the claim that Pareto UCB is not optimal for adversarial MO-MAB. Let us denote quantity αj t to be the counter-factual attack cost of arm j with respect to arm K. They are computed by Alice only for the arms in the Pareto optimal set and are 0 for the arms not in the Pareto optimal set. In other words, the counter-factual attack cost αj t is defined for each arm j as αi t = max{maxd Ni(t)(ˆzi d zi d), 0}. Based on Algorithm 4, αt = maxj Ot A αj t = maxj αj t. We start by properly defining the Pareto optimal front for stochastic MO-MAB with the attack cost, which is necessary for Pareto regret R T . Two options can be considered: one averaging the actual attack costs and the second taking into account the counterfactual attack costs. Definition 1 Given adversarial attack cost, O is defined as the Pareto optimal front of vectors µi 1 P j =K Nj(T ) P at =K αt 1 over all arm i, whereas O is over vectors 1 T PT t=1 ri,t 1 Ni(T ) P it=i αt 1. Definition 2 Given adversarial attack cost, O is defined as the Pareto front of vectors µi 1 T PT t=1 αi t 1, whereas O is over 1 T (PT t=1 ri,t PT t=1 αi t 1). Assumption 4 Let S be the set of {µ1, . . . , µK}. We assume that the distance from the arms that are not in Pareto front of S to the arms in Pareto front of S is at least 5γ for 0 < γ < 1 5. Formally, µj µi + 5γ 1 holds for any arm i not in Pareto front of S and arm j in Pareto front of S. Based on either definition, we establish a lower bound on Pareto regret under certain conditions as follows. First, we consider the high probability regret lower bound which essentially implies a lower bound on the expected regret by integration. Theorem 5.7. Under Definition 1 and Assumption 4, with probability 1 2η Dδ, for any T, 0 < µ < 1, δ 2π2 3e2 , and σ2 1 5(ln (4D)+ln 1 µ ), the Pareto regret R T of the Pareto UCB algorithm is at least of order O(T) based on Algorithm 4, if K = 2 and γ q 2σ2(ln (4D) + ln 1 The proof connects R T and RT by concentration inequalities and an asymptotic property of the attack cost and then studying RT . Next based on Definition 2, we obtain a similar result. Theorem 5.8. Under Definition 2 and Assumption 4, with probability (1 2η Dδ), for T T(γ), 0 < µ < 1, δ 2π2 3e2 and σ2 1 5(ln (4D)+ln 1 µ ), the Pareto regret R T of the Pareto UCB algorithm is at least of order O(T) based on Algorithm 4 if K = 2 and γ q 2σ2(ln (4D) + ln 1 Again, for the proof we utilize concentration inequalities and the asymptotic property of attack cost and the lower bound on RT aforementioned. For Pareto regret in (Drugan and Nowe, 2013), we previously establish that its lower bound is linear when attacking Pareto UCB with adversarial costs by extending the MAB result in (Jun et al., 2018). The regret analysis (Jun et al., 2018) only covers pre-attack rewards, while the post-attack scenarios are not studied. Here we generalize the result to our newly defined Pareto regret and Pareto pseudo regret defined on post-attack reward vectors, to further validate non-optimality of Pareto UCB in adversarial MO-MAB. Remark 5.9. It is worth highlighting the robustness of our newly proposed algorithm against this online adversarial attack. In the presence of such attacks, where the attack cost is of order log T as shown in Theorem 5.5 and supported by Lemma 8.5, we observe that for any large t, Ni(t) min {NK(t), 2 + 9σ2 2 0 log t}. This implies that NK(t) = O(T). As a result, for at least O(T) rounds, Alice refrains from launching attacks. During this period, the regret of our proposed algorithms is of order O( T), which is different from Pareto UCB that exhibits linear regret in such scenarios. 6. Conclusion In this paper, we study multi-objective multi-armed bandit (MO-MAB) with full Pareto regret analyses. We formulate the general MO-MAB problem for both stochastic and adversarial settings from a perspective of Pareto optimality and regret. Then we propose an optimal algorithm when given a priori knowledge of stochastic and adversarial and develop a nearly optimal algorithm up to a log T factor when lack of such knowledge is present. The theoretical guarantee is fully examined by both upper bounds and lower bounds on Pareto regret and Pareto pseudo regret. Moreover, we Pareto Regret Analyses in Multi-objective Multi-armed Bandit extend the adversarial attack algorithm for UCB to a new one attacking Pareto UCB in the context of MO-MAB. We also establish a lower bound on stochastic Pareto regret for Pareto UCB and subsequently show that Pareto UCB may have poor performance in adversarial settings by providing a lower bound on general Pareto regret. A summary of the research gaps we close is shown in Figure 1 where the highlighted parts are the contributions of this paper. Figure 1: An illustration of our contributions Table 1: Comparisons Area Application Results (1) Pareto order relationship; Pareto optimal front; MO-MAB (2) Attacks on stochastic bandits (3) (nearly) optimal in the (stochastic) adversarial settings ln T (ln2 T); Area Limitation Our results (1) scalarization functions; solving T subproblems formulation of general MO-MAB (2) single-objective; regret not in adversarial settings analyses in MO-MAB; regret T (3) single-objective analyses in MO-MAB; ln T (ln2 T); To explore and compare the connections between the research domains discussed herein, we present a comprehensive summary table (Table 1) that examines three distinct areas of research: Pareto optimization (labeled as (1)), adversarial attack (labeled as (2)), and best-of-both-worlds algorithms (labeled as (3)). We compare these methods from various perspectives, including their applications, known results, and areas where they have not yet been studied, or equivalently, potential limitations, as well as our own contributions to these fields. For simplicity, we use the term nearly optimal to denote performance up to a factor of log T. As a future work, we point out that it could be of great interest to improve the upper bound of Pareto pseudo regret to achieve optimality given no knowledge of MO-MAB settings. Furthermore, this is in line with the current work on MAB, since it remains open to develop an optimal algorithm for general MAB under no assumptions. Meanwhile, it is worth investigating the explicit relationship between the regret upper bound and both the dimension D and the size of the Pareto optimal set |O|. The existing lower bound results already indicate a dependency on D, suggesting the possibility of enhancing the regret upper bound through the development of more effective algorithms. As a concluding remark, other metrics can be developed for measuring the performance of algorithms, such as unfairness. 7. Acknowledgement We would like to thank the ICML anonymous reviewers and meta-reviewer for their helpful suggestions and valuable comments. Their feedback has greatly helped to improve the paper. J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory, volume 7, pages 1 122, 2009. P. Auer and C.-K. Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pages 116 120. Proceedings of Machine Learning Research, 2016. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 (1):1 122, 2012. R. Busa-Fekete, B. Sz or enyi, P. Weng, and S. Mannor. Multiobjective bandits: Optimizing the generalized Gini index. Pareto Regret Analyses in Multi-objective Multi-armed Bandit In International Conference on Machine Learning, pages 625 634. Proceedings of Machine Learning Research, 2017. M. M. Drugan and A. Nowe. Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks, pages 1 8. IEEE, 2013. M. M. Drugan, A. Now e, and B. Manderick. Pareto upper confidence bounds algorithms: an empirical study. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 1 8. IEEE, 2014. S. Gerchinovitz and T. Lattimore. Refined lower bounds for adversarial bandits. Advances in Neural Information Processing Systems, 29, 2016. P. Groetzner and R. Werner. Multiobjective optimization under uncertainty: A multiobjective robust (relative) regret approach. European Journal of Operational Research, 296(1):101 115, 2022. W.-J. Hong, P. Yang, and K. Tang. Evolutionary computation for large-scale multi-objective optimization: A decade of progresses. International Journal of Automation and Computing, 18(2):155 169, 2021. A. H uy uk and C. Tekin. Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives. Machine Learning, 110(6):1233 1266, 2021. K.-S. Jun, L. Li, Y. Ma, and J. Zhu. Adversarial attacks on stochastic bandits. Advances in Neural Information Processing Systems, 31, 2018. T. Lattimore and C. Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/ 9781108571401. R. Mehrotra, N. Xue, and M. Lalmas. Bandit based optimization of multiple objectives on a music streaming platform. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3224 3233, 2020. Y. Seldin and G. Lugosi. An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pages 1743 1759. Proceedings of Machine Learning Research, 2017. Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287 1295. Proceedings of Machine Learning Research, 2014. K. Van Moffaert, K. Van Vaerenbergh, P. Vrancx, and A. Now e. Multi-objective χ-armed bandits. In 2014 International Joint Conference on Neural Networks, pages 2331 2338. IEEE, 2014. S. Q. Yahyaa, M. M. Drugan, and B. Manderick. Annealingpareto multi-objective multi-armed bandit algorithm. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 1 8. IEEE, 2014. Y. Zhu and R. Nowak. On regret with multiple best arms. Advances in Neural Information Processing Systems, 33: 9050 9060, 2020. J. Zimmert and Y. Seldin. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 467 475. Proceedings of Machine Learning Research, 2019. Pareto Regret Analyses in Multi-objective Multi-armed Bandit 8. Appendix 8.1. Proofs of the results in Section 4 Throughout the following lemmas, we let a be a vector such that the set {ϵ 0 : a + ϵ1||σ for every σ O} is not empty. Lemma 8.1. For any a and O we have u = Dist(a, O) = min d max σ O(σd ad). Proof. By definition, we have that Dist(a, O) = min ϵ 0 {ϵ : a + ϵ1||σ for every σ O} = min ϵ 0 (Aϵ). For ϵ Aϵ and for every σ O, there exists at least one dimension d such that ad + ϵ > σd and one dimension d such that ad + ϵ σd . Let B be the set of ϵ where for every σ O, there exists a dimension d such that ϵ σd ad. Similarly, let C be the set of ϵ where for every σ O, there exists a dimension d such that ϵ > σd ad . Clearly, we observe that Formally, we have B = {ϵ : 0 ϵ b = minσ maxd(σd ad)} and C = {ϵ : ϵ c = mind maxσ(σd ad)}, by their definitions. Note that for Aϵ = , it must be the case that c b. Therefore, it indicates min ϵ (Aϵ) = min ϵ (B C) = c, which completes the proof. Lemma 8.2. If O = {σ}, then Dist(a, O) = mind(σd ad). Proof. This is a consequence of letting O = {σ} in Lemma 8.1. Lemma 8.3. There exists σ O such that a + u1 σ where u is defined in Lemma 8.1. For any dimension d we have u max σ O(σd ad). Proof. We prove the first part by contradiction. Let us assume that a + u1||σ for every σ O. Then for each σ there exist d = d(σ) such that ad + u > σd. We denote by s(σ) the set of all dimensions with this property. We have s(σ) = , (s(σ))c = . Let us define l(σ) = maxd s(σ)(ad + u σd) and d (σ) = arg maxd s(σ)(ad + u σd). By definition l(σ) > 0. Let ϵ = minσ O l(σ) > 0. We consider a + (u ϵ 2)1. For every σ O and d d (σ) = , we have ad + u ϵ 2 ad + u l(σ) 2 = ad+u ad+u σd 2 > σd. For d s(σ) (note that s(σ)c = ), we have ad + u ϵ 2 ad + u σd. We conclude that a + (u ϵ 2)1||σ for every σ O which contradicts the definition of u. For the second part, a + u1 σ implies that for every d we have u σd ad. The existence of σ gives us that u maxσ O(σd ad). 8.1.1. PROOF OF THEOREM 4.1 Proof. It follows from Lemma 8.3 and definitions. Specifically, we use Lemma 8.3 by applying a = PT t=1 rat,t and O = PT t=1 ri ,t, where a meets the condition that a {a σ for some σ O}. Otherwise, by the fact that PT t=1 rat,t does not dominate any PT t=1 ri ,t O which is the set of Pareto optimal reward vectors, we have a+c1 = σ for exactly one c and σ O. In this case, the distance is equivalent to σd ad maxσ O(σd sd). 8.1.2. PROOF OF THEOREM 4.2 Proof. The result can be shown by combining the regret analysis of the EXP3.P and UCB algorithms in onedimension MAB. Formally, when s = 0, regret Rd T of UCB satisfies that E[Rd T ] O (log T). While s = 1 gives us that the regret of EXP3.P satisfies E[Rd T ] O ( Meanwhile, according to Theorem 4.1, we have that RT mind Rd T . Then by the Jensen s inequality, the expected value of RT can be upper bounded by E[RT ] E[min d Rd T ] min d E[Rd T ] E[Rd T ]. (1) To conclude, the result follows by plugging the upper bounds on E[Rd T ] in (1). 8.1.3. PROOF OF THEOREM 4.3 Proof. It follows from Lemma 8.3 and definitions. Likewise, we again leverage Lemma 8.3 by using a = Pareto Regret Analyses in Multi-objective Multi-armed Bandit PT t=1 E[rat,t] and O = PT t=1 E[ri ,t], where a meets the condition that a {a σ for some σ O}. Otherwise, by noting that PT t=1 E[rat,t] does not dominate any PT t=1 E[ri ,t] O which is the set of Pareto optimal expected reward vectors, we obtain a + c1 = σ for exactly one c and σ O and thus the distance equals to σd ad maxσ O(σd sd), which completes the proof. 8.1.4. PROOF OF THEOREM 4.4 Proof. By the choice of parameters ηt, ζt(a), ψt(a), the results hold as follows. In the adversarial regime, we have that the pseudo regret satisfies while in the stochastic regime, the pseudo regret satisfies Rd T O ((log T)2). Meanwhile, we have that R T min d Rd T Rd T , (2) where the first inequality holds by the argument as in the proof of Theorem 4.3. By plugging the upper bounds of Rd T in 2, we derive the statement in the theorem. 8.2. Proofs of the results in Section 5 8.2.1. PROOF OF THEOREM 5.1 Proof. We consider stochastic MO-MAB with reward vectors µi = (µi 1, . . . , µi 1) RD for any arm i and any time steps t. By the definition of O , we have that O is equivalent to a unique best arm i with µi 1 = maxi µi 1. It gives us that R T = R1 T = . . . = RD T by noting that for any dimension d and by Lemma 2 we have R T = Dist(E[ t=1 rat,t], O ) t=1 rat,t], {Tµi }) t=1 µat, {Tµi }) t=1 µat 1 = Rd T . By the result in (Bubeck et al., 2012) stating that R1 T log T O(1), R T log T O(1). 8.2.2. PROOF OF THEOREM 5.2 Proof. We can assume Ni(T) 1 since otherwise the underlying algorithm has linear regret. Consider stochastic MO-MAB with reward mean vectors being µi = µi 1 1 for any arm i and any time steps t. It indicates that ri,t j , ri,t k are i.i.d for any 1 j, k D, though potentially having different values as a result of stochasticity. Therefore, the difference between R T and R1 T can be bounded by the concentration inequality as follows. Note that by Lemma 8.2 and 8.3, Theorem 4.1 and Theorem 4.3,we have R T = Dist( t=1 rat,t, O ) = min d max i ( X t=1 rat,t d ) = min d max i ( at=j rj,t d ) (3) R T = Dist(E[ t=1 rat,t], O ) = min d max i E[ X t=1 rat,t d ]. We use the Hoeffding s inequality, which leads to P(CA) = P( i : |µi d P t ri,t d Ni(T) | |K4 log log Ni(T) where η = P i 2 exp { 2Ni(T) ( K4 log log Ni(T ) Ni(T ) )2} = P i 2 exp { 2 (K4 log log Ni(T ))2 Through the end of the proof, we assume that K = 2 and 11 < T 1200, i.e. 2-arm bandits with mean values µ1, µ2. Pareto Regret Analyses in Multi-objective Multi-armed Bandit We first argue that f(x) = (log log x)2 x is decreasing for x 11. We get f (x) = log log x x2 ( 2 (log log x) log x log x ). For x 11, log x 0, log log x 0 and both are increasing. We conclude that (log log x) log x is increasing for x 11. Since f (11) < 0, it then follows that f (x) < 0 for x 11. If there exists arm i such that Ni(T) = 1, then we have Nj(T) = T 1 for j = i since K = 2. In this case (K4 log log Ni(T ))2 Ni(T ) = , which implies that η = 2 exp { } + 2 exp { 2(K4 log log (T 1))2 = 2 exp { 2(K4 log log (T 1))2 2 exp { 2(K4 log log T)2 = 2 exp { 29 (log log T)2 2 exp ( 29 (log log 1200)2 where the first inequality is by monotonicity of (log log x)2 x for x 11. Suppose now that for i = 1, 2 we have Ni(T) 2, and Ni(T) T. For x = 2, 3, . . . , 1200, f(x) can only have maximum in x {2, 3, . . . , 10, 1200}. Its maximum is for x = 3 and then the maximum in the η term is at Ni(T) = 3. We thus have η 2 2 exp { 29 (log log 3)2 Therefore, we have 1 η 1 9 > 0. With probability 1 η, R T = min d max i ( X at=j rj,t d ) min d max i (T µi d K4 log log T t=1 µat,t d j=1 |K4 log log Nj(T)|) min d max i (T µi d t=1 µat,t d (K + 1)K4 log log T) = min d Rd T (K + 1)K4 log log T = min d Rd T 48 log log T where the first inequality is straightforward from (4) and the second inequality holds by noticing Ni(T) T. Consider the instance-dependent lower bound with θ = µ2 1 µ1 1 = µ2 2 µ1 2 > 0. By the result in (Lattimore and Szepesv ari, 2020) for MAB, we have that for any T and any algorithm with T regret upper bound, it must hold When choosing θ 1 48, we derive R T Rd T 48 log log T (5) 48(log T log log T). (6) To conclude, we obtain E[R T ] E[R T ICA] 48(log T log log T) (1 η) 9 O (log T) = O (log T) where the second inequality holds by (6) and the last inequality uses 1 η > 1 8.2.3. PROOF OF THEOREM 5.3 Proof. (Bubeck et al., 2012) analyze the lower bound on pseudo regret for MAB and establish that for any ϵ > 0, there exists an adversarial MAB, denoted by (ri,t 1 ), satisfying inf R1 T O ( where inf is taken over all algorithms. Again, we focus on adversarial MO-MAB with reward vectors constant in dimensions, i.e. ri,t = (ri,t 1 , . . . , ri,t 1 ), which is equivalent to MAB in the sense that R T = R1 T = . . . = RD T given by O A = {i } = arg max i E[ t=1 ri,t d ] and subsequently by Lemma 8.2, we have t=1 rat,t], O ! t=1 rat,t], {E[ t=1 ri ,t d ]} t=1 ri ,t d ] E[ t=1 rat,t d ] = Rd T for any 1 d D. This is to say that inf R T O ( T) by (7) holds for all algorithms. Pareto Regret Analyses in Multi-objective Multi-armed Bandit 8.2.4. PROOF OF THEOREM 5.4 Proof. Consider adversarial MO-MAB with reward vectors being ri,t = (ri,t 1 , . . . , ri,t 1 ). The Pareto optimal set O A is the unique best arm i that satisfies i = arg maxi PT t=1 ri,t 1 . Then the MO-MAB problem essentially degenerates to MAB since R T = R1 T = . . . = RD T which is guaranteed by t=1 rat,t, O ! t=1 ri ,t d t=1 rat,t d = Rd T , for every d. By the result in (Gerchinovitz and Lattimore, 2016) stating that holds for any randomized algorithm, we have that the Pareto regret R T is larger than 8.2.5. RESULTS IN SECTION 5.2 8.2.6. PROOF OF THEOREM 5.5 Proof of Theorem 5.5. With β(n) = q n log π2Kn2 3δ where σ2 is the variance of values for different dimensions in reward vectors of different arms, let us define event E as E = { i, t > K : ||ˆµi(t) µi|| < β(Ni(t))}. We first present several lemmas used later. As before, without loss of generality we assume that in the first K steps every arm is pulled. This implies Ni(t) 1 for any t > K and arm i. Lemma 8.4. For any δ (0, 1), P(E) > 1 Dδ. Proof. Note that = P( i, t > K : ||ˆµi(t) µi|| < β(Ni(t))) = P( i, t > K, d : |N 1 i (t) X as=i ri,s d µi d| < β(Ni(t))) i K D exp { Ni(t) 2σ2 β(Ni(t))2} i K D exp { Ni(t) Ni(t) log π2KN 2 i (t) 3δ } i K D π2KN 2 i (t) 3δ = 1 6δ π2K D X i K N 2 i (t) 1 Dδ 6 π2K Kπ2 6 = 1 Dδ (8) where the first inequality is by the Hoeffding s inequality for sub-Gaussian distributions and the last inequality holds by the fact that P t=1 1 t 2 = π2 Lemma 8.5. Let us assume event E holds. Then, for any i < K and any t 2K, we have Ni(t) min {NK(t), 2 + 9σ2 2 0 log t}. Proof. Since t > 2K, if Nj(t) 2 for any j < K, then NK(t) 2 and the result holds trivially. Let S = {j < K : Nj(t) > 2} = . If we prove the statement for each arm in S, this implies NK(t) > 2 and thus the statement holds also for any arm j S. Let now i S, i.e. Ni(t) > 2 . If at = i, then we can consider the last time t we had at = i and apply the result in this case. Let us assume at = i and we consider the previous time step t < t when Bob pulled arm i. Since at = i, it implies i Ot A. We have that Ni(t 1)+1 = Ni(t ) = Ni(t) 1, Ni(t 1) = Ni(t ) by the definition of Ni(t). We note that by definition of αt we have zj ˆzj + αt Nj(t ) 0 for any j Ot A. This implies since i Ot A and ˆzi αt Ni(t ) = ˆ µi(t ), that ˆ µi(t) ˆ µK(t) (2β(NK(t)) + 0) 1. (9) Pareto Regret Analyses in Multi-objective Multi-armed Bandit Since at = i is chosen by Bob based on Pareto UCB, there exists at least one dimension d, such that ˆ µi d(t 1) + 3σ log t Ni(t 1) ˆ µK d (t 1) + 3σ log t NK(t 1) ˆ µi d(t ) + 3σ log t Ni(t ) ˆ µK d (t 1) + 3σ log t NK(t 1) which is equivalent to log t Ni(t ) 3σ log t NK(t 1) ˆ µK d (t 1) ˆ µi d(t ). ˆ µK d (t 1) ˆ µi d(t ) ˆµK d (t 1) ˆµK d (t ) + 2β(NK(t )) + 0 β(NK(t 1)) β(NK(t )) + 2β(NK(t )) + 0 0 > 0. The first inequality is due to (9) and since Alice never attacks arm K we have ˆ µK(t 1) = ˆµK(t 1). The second inequality holds by the definition of event E and the third inequality uses the fact that β(n) = q n log π2Kn2 monotone decreasing in n 1 if K 3δe2 π2 which can be shown by calculus. Therefore, we have that log t Ni(t ) 3σ log t NK(t 1) 0 > 0, (10) and thus NK(t) = NK(t 1) Ni(t ) + 1 = Ni(t). Meanwhile, from (10) we get log t Ni(t ) > 0 Ni(t) = 1 + Ni(t ) 2 + 9σ2 which completes the proof. Lemma 8.6. Let us assume event E holds. Then the cumulative attack cost for any arm i < K up to time step t 2K, can be bounded as αs max j Nj(t)( j + 0 + 4β(Nj(t))). Proof. Note that the right hand side is monotone increasing in t. It suffices to show that the result holds for t with at = i essentially assuming that NK(t 1) = NK(t). By its definition, the cumulative cost satisfies αt = max{ max j Ot A,d Nj(t)(ˆzj d zj d), 0} = max{ max j Ot A,d Nj(t)ˆµj d(t) s=1 αs Nj(t) (ˆµK d (t) 2β(NK(t)) + 0), 0} = max{ max j Ot A,d(Nj(t))(ˆµj d(t) (ˆµK d (t) 2β(NK(t)) 0)) s=1 αs, 0}, which implies by adding t 1 P s=1 as=i αs to both sides that αs max{ max j Ot A,d{Nj(t)(ˆµj d(t) ˆµK d (t 1)+ 2β(NK(t 1)) + 0)}, Note that t 1 P s=1 as=i αs t 1 P s=1 αs. Since event E holds, we have Pareto Regret Analyses in Multi-objective Multi-armed Bandit max{ max j Ot A,d Nj(t)(µj d + β(Nj(t))) µK d + β(NK(t 1)) + 2β(NK(t 1)) + 0, max{ max j Ot A,d Nj(t)(µj d µK d + 4β(Nj(t))) + 0, = max{max j Ot A Nj(t)( j + 0 + 4β(Nj(t))), max j Ot A Nj(t)( j + 0 + 4β(Nj(t))) where the first inequality holds by the definition of event E, the second inequality holds as a result of Lemma 8.5 and the monotonicity of β( ) and the third inequality holds by the induction over Pt 1 s=1 αs. The induction is with respect to all times t such that a t = i. The base case corresponds to 1 t 2K since we assume that initially every arm is pulled. If there exists t, K t 2K, such that a t = i, then this is the base case. Otherwise 1 t < K with at = i and this is the base case. We also use the fact that P2K s=1 αs = 0. Suppose event E holds. Lemma 8.5 proves the first half of the theorem. For the second half, we observe that the total attack cost satisfies X i 0. By Theorem 5.5 we derive NK(T) = T P i 0. 8.2.8. PROOF OF THEOREM 5.7 Proof. The proof is two-fold. We first show that the two Pareto optimal fronts O and O can be quite close in a high probability sense and then show that the obtained reward vectors approach the mean vectors as T goes large by concentration inequalities. We first note that the conditions in the theorem imply γ 1 5 and K = 2 is valid for Theorem 9. This implies that we can meet Assumption 4. Let η be fixed with 0 < η < 1. Since {ri,t}1 t T are i.i.d. sub-Gaussian distributed, the Chernoff-Hoeffding inequality implies P(| 1 Ni(T) X as=i ri,s µi| γ 1) 2D exp ( γ2 In (12) we use the derivation from (8). By choosing γ such that 2DK exp ( γ2 2σ2 ) η which is imposed in the theorem, we have P(| 1 Ni(T) X as=i ri,s µi| γ 1, 1 i K) We denote the event as E0 when using (13), i.e. P(E0) 1 η. Pareto Regret Analyses in Multi-objective Multi-armed Bandit Likewise, note that t=1 ri,t µi| γ 1, 1 i K) 1 η which can be shown by t=1 ri,t µi| γ 1, 1 i K) = P( i, d : | 1 t=1 ri,t d µi d| < γ) = 1 P( i, d : | 1 t=1 ri,t d µi d| > γ) t=1 ri,t d µi d| > γ) 1 2DK exp ( Tγ2 where the first inequality is by the Bonferroni s inequality, the second inequality is by the Chernoff-Hoeffding inequality and the last one holds by the choice of γ. Note that if we use (14), we denote the event as E1, i.e. P(E1) 1 η. Meanwhile, by Assumption 4, µj µi 5γ 1 holds for any arm i not in OA and arm j in OA. Note that O and O only differs in whether 1 Ni(T ) P at=i αt is present. Since K = 2, the two terms for attack cost are equivalent by noting that at=i αt = 1 P j =K Nj(T) X at =K αt. (15) For any arm j O A and arm i not in O A, on event E1 we derive t=1 ri,t + 1 t=1 ri,t + µi µi+ t=1 rj,t µj + µj t=1 ri,t µi) + ( 1 t=1 rj,t µj) (5γ 2γ) 1 = 3γ 1 > 0. (16) where the last inequality is by the choice of i, j and (14). This implies that i O A together with (15). Similarly, for any arm j O A and arm i O A, on event E1 we have t=1 rj,t + 1 t=1 ri,t µi) + ( 1 t=1 rj,t µj) (0 + γ + γ) 1 = 2γ 1 (17) where the last inequality is by the choice of i, j and (14). Note that since K = 2 we have PT t=1 ri,t PT t=1 rj,t. If i O A, then j O A and µi > µj + 5γ 1 since 1) there is an arm not in O A (by our construction O A contains at least one arm, arm K), and 2) Assumption 4 is valid. This contradicts (17) and thus implies that i O A. Consequently, we established that on event E1 we have O A = O A. Since P(E1) 1 η we conclude P( O A = O A) 1 η. (18) We further obtain by (14) and (15), on event E1 for any i, t=1 ri,t 1 Ni(T) at=i αt 1) γ j =K Nj(T) X t=1 ri,t 1 Ni(T) at=i αt 1) + γ. (19) Therefore, by the definition of O being defined as the Pareto optimal front of vectors µi 1 P j =K Nj(T ) P over all arm i and O being over vectors 1 T PT t=1 ri,t Pareto Regret Analyses in Multi-objective Multi-armed Bandit at=i αt 1, we have on event E1 that = T Dist( 1 at =K αt 1, O ) T (NK(T) ˆµK + X i =K Ni(T) ˆµi) at =K αt 1, O ) 2γT. (20) The inequality follows from (18) and (19) and the following statement in Proposition 1. Proposition 1 For any γ1 > 0, γ2 > 0 if O A = O A and ua va γ1 1 for any va O , ua O where a O A, then for any e, e with e e + γ2 1 by Lemma 8.1 we have Dist(e, O ) = min d max a O A (ua d ed) = min d max a O A (ua d ed) min d max a O A (va d ed γ1) min d max a O A (va d ed) γ1 γ2 = Dist( e, O ) γ1 γ2. Note that Dist(e, A) = Dist(e z, {a z|a A}). On event E0 E1, we have that NK(T) ˆµK + X i =K Ni(T) ˆµi NK(T) µK + X i =K Ni(T) µi 3γT .= TD 3γT. where in the first inequality we use (15), the various definitions and the second inequality is a result of (13) by bounding µi with ˆµ + γ 1 for any arm i. By Theorem 5.5, we have that on event E Ni(T) O(log T) (21) and clearly NK(T) T. Therefore, on events E, E0, E1 we have TD T Dist( 1 (K 1)O(log T) max i =K µi), O) T Dist(µK, O) (K 1)O(log T) (22) where the last inequality uses the fact that maxi µi 1. By our construction, the distance from the reward vector of arm K to the Pareto optimal front O, Dist(µK, O) 5γ since K O. Therefore, on E E0 E1 we have that T 5γ (K 1)O(log T) 3γT = 2γT (K 1) O(log T). Note that P(E E0 E1) 1 P(Ec) P((E0 E1)c) = Dδ + P(E0 E1) Dδ + 1 P(Ec 0) P(Ec 1) = 1 Dδ 2η. This completes the proof. 8.2.9. PROOF OF THEOREM 5.8 Proof. We use the notation from the proof of Theorem 5.7. By the statement in (14), we obtain t=1 ri,t µi| γ 1) 1 η. By Assumption 4, µj µi 5γ 1 holds for any arm i not in O A and arm j in O A. For any arm j O A and arm i not in O A, on event E1 we have t=1 ri,t + 1 by the result in (16). Meanwhile, since O and O consider the same attack cost, this again implies that i O A. Similarly, for any arm j O A and arm i O A, on event E1 we have arm i O A. Consequently on event E1 we get Pareto Regret Analyses in Multi-objective Multi-armed Bandit t=1 αi t 1) γ t=1 αi t 1) + γ. Therefore, by the definition of O being defined as the Pareto front of vectors µi 1 T PT t=1 αi t 1, whereas O is over 1 T (PT t=1 ri,t PT t=1 αi t 1), we have on event E0 E1 that R T T Dist( 1 NK(T) µK + X i =K Ni(T) µi t=1 αat t 1, O ) 2γT. (23) This derivation follows (20), (13) and relies on Proposition 1. Consider the counterfactual attack cost αi t on arm i in the definition of O . Formally, according to the definition of αi t, it reads explicitly as αi t = max{max d Ni(t)(ˆzi d zi d), 0}. We observe that αt = max{ max j Ot A,d Nj(t)(ˆzj d zj d), 0} since the chosen arm by Bob at Ot A. Therefore, by the results in Lemma 8.6 and Lemma 8.5 on event E, we have (K 1) max j