# multiobjective_lipschitz_bandits_under_lexicographic_ordering__e6cd418f.pdf

Multiobjective Lipschitz Bandits under Lexicographic Ordering

Bo Xue1,2, Ji Cheng1,2, Fei Liu1,2, Yimu Wang3, Qingfu Zhang1,2,*

1Department of Computer Science, City University of Hong Kong, Hong Kong, China 2The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China 3Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada {boxue4, jcheng27, fliu36}-c@my.cityu.edu.hk, yimu.wang@uwaterloo.ca, qingfu.zhang@cityu.edu.hk

This paper studies the multiobjective bandit problem under lexicographic ordering, wherein the learner aims to simultaneously maximize m objectives hierarchically. The only existing algorithm for this problem considers the multi-armed bandit model, and its regret bound is e O((KT)2/3) under a metric called priority-based regret. However, this bound is suboptimal, as the lower bound for single objective multiarmed bandits is Ω(K log T). Moreover, this bound becomes vacuous when the arm number K is infinite. To address these limitations, we investigate the multiobjective Lipschitz bandit model, which allows for an infinite arm set. Utilizing a newly designed multi-stage decision-making strategy, we develop an improved algorithm that achieves a general regret bound of e O(T (di z+1)/(di z+2)) for the i-th objective, where di z is the zooming dimension for the i-th objective, with i {1, 2, . . . , m}. This bound matches the lower bound of the single objective Lipschitz bandit problem in terms of T, indicating that our algorithm is almost optimal. Numerical experiments confirm the effectiveness of our algorithm.

Introduction Online learning with bandit feedback is a powerful paradigm for modeling sequential decision-making cases (Robbins 1952), such as clinical trials (Villar, Bowden, and Wason 2015), news recommendation (Li et al. 2010), and website optimization (White 2012). The fundamental model of this paradigm is multi-armed bandits (MAB), where a learner repeatedly selects one arm from K available arms and receives a stochastic payoff drawn from an unknown distribution associated with the chosen arm (Bubeck et al. 2015; Luo et al. 2018; Zhou, Xu, and Blanchet 2019; Xue et al. 2020; Zhu and Mineiro 2022; Qin et al. 2023; Gou, Yi, and Zhang 2023). The goal of learner is to minimize regret, defined as the cumulative difference between the expected payoff of the selected arm and that of the inherently best arm. To achieve this goal, the learner must strike a balance between exploration and exploitation, attempting potentially better arms while concurrently employing the best arm identified so far. Although MAB is powerful, many real-world applications involve multiple and potentially conflicting objectives, such

*Qingfu Zhang is the corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

as the Click-Through Rate (CTR) and the Post-Click Conversion Rate (CVR) in advertising recommendation systems (Ma et al. 2018). This has led to the study of multiobjective multi-armed bandits (MOMAB), in which payoffs are vectors containing multiple elements, and the learner aims to simultaneously minimize the regret for all objectives (Drugan and Nowe 2013). A commonly used criterion for evaluating the performance of MOMAB is Pareto regret, which regards all objectives as equivalent (Van Moffaert et al. 2014; Q. Yahyaa, M. Drugan, and Manderick 2014; Turgay, Oner, and Tekin 2018; Lu et al. 2019a; Xu and Klabjan 2023). However, some scenarios may require varying levels of importance among objectives, such as radiation treatment for cancer patients, where the primary objective is target coverage and the secondary objective is the therapy s proximity to organs at risk (Jee, Mc Shan, and Fraass 2007). Similarly, water resource planning legally mandates the prioritization of objectives such as flood protection, supply shortage for irrigation, and electricity generation (Weber et al. 2002). To deal with these real-world applications, a natural idea is to utilize lexicographic ordering, as it ranks the objectives according to their importance (Ehrgott 2005; Wray and Zilberstein 2015; Wray, Zilberstein, and Mouaddib 2015; H uy uk and Tekin 2021; Hosseini et al. 2021; Skalse et al. 2022). Let X represent an arm space, and the expected payoffs for a, b X are [µ1(a), µ2(a), . . . , µm(a)] Rm and [µ1(b), µ2(b), . . . , µm(b)] Rm. The arm a is said to lexicographically dominate arm b if and only if µ1(a) > µ1(b), or there exists an i {2, 3, . . . , m}, such that µi(a) = µi(b) for 1 i i 1 and µi (a) > µi (b). The lexicographically optimal arm is the one that is not lexicographically dominated by any other arms (H uy uk and Tekin 2021). The only existing algorithm for multiobjective bandits under lexicographic ordering is specifically designed for the MOMAB model (H uy uk and Tekin 2021), whose arm set is finite, i.e., X = [K]1. Let x denote the lexicographically optimal arm among X and xt be the arm chosen at t-th epoch. H uy uk and Tekin (2021) defined a priority-based regret to evaluate the performance of their algorithm, given by

µi(x ) µi(xt) I(Ai(xt)), i [m]. (1)

Here, I( ) is the indicator function, and Ai(xt) denotes the

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

event that the previous i 1 expected payoffs of the chosen arm are optimal, i.e., Ai(xt) = {µj(x ) µj(xt) = 0, j [i 1]}. H uy uk and Tekin (2021) proposed an algorithm with a bound of e O((KT)2/3) under this priority-based regret. There are three limitations of the existing algorithm (H uy uk and Tekin 2021). First, the performance declines as K increases and becomes ineffective when the arm set is infinite. Second, the regret bound is suboptimal with respect to T when the number of objectives reduces to one, as the lower bound for single objective MAB is Ω(K log T) (Lai and Robbins 1985). Third, regret (1) is not practicable due to the impact of I( ). Specifically, for the i-th objective, if there exists an objective j [i 1], the expected payoff of the chosen arm xt is not optimal, i.e., µj(x ) = µj(xt), the gap µi(x ) µi(xt) does not accumulate to regret (1). To remove these limitations, we investigate the multiobjective Lipschitz bandit (MOLB) model under lexicographic ordering, where the arm set X can be infinite. We adopt the general regret to evaluate our algorithms, as expressed by

Ri(T) = Tµi(x )

t=1 µi(xt), i [m]. (2)

To the best of our knowledge, this work is the first to explore the MOLB model under lexicographic ordering, and the main contributions can be summarized as follows:

We develop an algorithm that achieves the regret bound e O(T (di z+1)/(di z+2)) for the i-th objective, where di z is the zooming dimension to be introduced in the next section. This result matches the lower bound of single objective Lipschitz bandit problem (Kleinberg, Slivkins, and Upfal 2008), indicating that our algorithm is almost optimal. We propose a novel multi-stage decision-making strategy that delicately balances exploration and exploitation, which is crucial for improving the existing suboptimal result (H uy uk and Tekin 2021). We extend the metric of lexicographically ordered multiobjective bandits from the priority-based regret (1) to the general regret (2), which more accurately evaluates the performance of the learner.

Preliminaries In this section, we first present the learning setting of MOLP, and then introduce two necessary concepts, covering dimension and zooming dimension.

Learning Setting MOLP is a T-round decision-making system indexed by t [T]. At each epoch t, the learner selects an arm xt from the metric space X and receives a stochastic payoff vector [y1 t , y2 t , . . . , ym t ] Rm, where yi t is the payoff of the i-th objective and m is the number of objectives. The payoffs are conditionally 1-sub-Gaussian, such that

E h eα(yi t µi(xt))|Ft 1 i eα2/2, α R (3)

1For any n N+, [n] denotes the set {1, 2, . . . , n}.

where µi(xt) denotes the i-th expected payoff of arm xt , i [m], and Ft 1 = {x1, x2, . . . , xt} {y1 1, y1 2, . . . , y1 t 1} . . . {ym 1 , ym 2 , . . . , ym t 1} is a σ-filtration (Auer 2002; Bubeck, Stoltz, and Yu 2011; Abbasi-yadkori, P al, and Szepesv ari 2011; Shao et al. 2018). Another common assumption on the Lipschitz bandit model is that the expected functions satisfy the Lipschitz property (Turgay, Oner, and Tekin 2018; Wanigasekara and Yu 2019; Podimata and Slivkins 2021), such that

|µi(x) µi(x )| D(x, x ), x, x X, i [m] (4)

where D( , ) is the distance function on metric space X. Without loss of generality, we assume the diameter of X is smaller than 1, i.e., D(x, x ) 1, x, x X. Furthermore, we propose a parameter λ to depict the difficulty of identifying the lexicographically optimal arm. Precisely, we assume there exists some λ 0,

µi(x) µi(x ) λ max j [i 1]{µj(x ) µj(x)} (5)

for any i {2, 3, . . . , m} and x X. An exceptionally large λ implies the existence of an arm x X with a significantly higher payoff than the optimal arm x for the i-th objective while maintaining similar payoffs for the preceding i 1 objectives, which makes the identification of the lexicographically optimal arm challenging.

Covering Dimension and Zooming Dimension Let B( x, r) denote the ball with center x X and radius r 0, such that B( x, r) = {x X|D( x, x) r}. The rcovering number of X is the minimal number of balls with radius r to cover X, i.e.,

Nc(r) = min{n N | X k [n]B( xk, r)}. (6)

Based on the covering number, the covering dimension of X is defined as

dc = min{d 0 | C > 0, Nc(r) Cr d, r > 0}. (7)

We present two specific cases to help with the understanding of covering dimension. One is the unit ball in d-dimensional Euclidean space, whose covering dimension is d and C = 1. Another case is any set containing finite elements, i.e., |X| = K, whose covering dimension is 0 and C = K. The covering dimension does not account for the structure of expected payoff functions, thus failing to reflect the complexity of a Lipschitz bandit problem accurately. To illustrate this issue, we provide a simple example. Suppose the arm space is X Rd with a Euclidean metric, and the expected functions are µi(x) = x1, i [m] for all x X. Here, x1 denotes the first element of vector x. In this case, the complexity of identifying the optimal arm remains the same, regardless of the covering dimension of X. To deal with this issue, another concept termed zooming dimension was proposed (Kleinberg, Slivkins, and Upfal 2008). In this paper, we extend this concept to multiobjective setting. First, we define the r-optimal region for the i-th objective as

X i(r) = {x X|λir/2 < µi(x ) µi(x) λir} (8)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

where λi = 1 + λ + . . . + λi 1 is a fixed constant and r 0. Then, similar to the r-covering number, the r-zooming number can be defined as the minimal number of balls with radius r/96 to cover X i(r), denoted by N i z(r), i.e.,

N i z(r) = min{n N | X i(r) k [n]B( xk, r/96)}. (9) Now, we are ready to define the zooming dimension for the i-th objective, which is

di z = min{d 0 | Zi > 0, N i z(r) Zir d, r > 0}. (10) Compared with the zooming dimension of single objective Lipschitz bandits (Kleinberg, Slivkins, and Upfal 2008), the only difference is the adoption of the constant λi in (8), which is due to technical reasons (see Theoretical Analysis section) and does not constitute an essential different, as r can approach zero arbitrarily closely.

Related Work In this section, we give a brief review of the research for Lipschitz bandits and multiobjective bandits.

Lipschitz Bandits Plenties of work on Lipschitz bandits have been conducted in recent years, and most of them employ two basic techniques: static discretization (Agrawal 1995; Kleinberg 2004; Auer, Ortner, and Szepesv ari 2007) and adaptive discretization (Kleinberg, Slivkins, and Upfal 2008; Bubeck et al. 2008, 2011; Lu et al. 2019b; Wang et al. 2020; Feng, Huang, and Wang 2022). Static discretization involves dividing the arm space into a uniform mesh and directly applying MAB algorithms to the mesh regions, such as UCB (Auer 2002). The seminal work of Agrawal (1995) investigated a specific case called continuum-armed bandits, wherein the arm set is a compact interval (i.e., X [0, 1]). Building upon this research, Kleinberg (2004) proposed a near-optimal algorithm with a bound of O(T 2/3) and established a matching lower bound. Subsequently, Auer, Ortner, and Szepesv ari (2007) improved this result by achieving a regret bound of O(

T) under mild assumptions. Adaptive discretization dynamically discretizes the arm space according to observed payoffs and allocates more trials to promising regions. This technique was first proposed by Kleinberg, Slivkins, and Upfal (2008), who extended the arm set into a general metric space and introduced the zooming algorithm, achieving a regret bound of e O(T (dz+1)/(dz+2)). Here, dz represents the zooming dimension of the expected payoff function. Furthermore, Kleinberg, Slivkins, and Upfal (2008) provided a matching lower bound of Ω(T (dz+1)/(dz+2)). A subsequent work of Bubeck et al. (2011) relaxed the Lipschitz assumption to locally Lipschitz and proposed a tree-based algorithm that attains a regret bound of e O(T (dz+1)/(dz+2)). Wang et al. (2020) connected tree-based methods with Gaussian processes and developed a new analytical framework.

Multiobjective Banidts Drugan and Nowe (2013) initially formalized the MOMAB model and introduced two algorithms enjoying the bounds

O(K log T) under the metrics of scalarized regret and Pareto regret, respectively. Scalarized regret refers to the weighted sum of all objectives regret, while Pareto regret measures the cumulative Pareto distance between the obtained payoff vectors and the Pareto optimal payoff vector. Turgay, Oner, and Tekin (2018) studied the multiobjective contextual bandit model and proposed a zooming-based algorithm that achieves a Pareto regret bound of e O(T (dp+1)/(dp+2)), where dp represents the Pareto zooming dimension. Subsequently, Lu et al. (2019a) investigated a parameterized bandit model called multiobjective generalized linear bandits. To our knowledge, the study by H uy uk and Tekin (2021) is the only one that focuses on bandits with lexicographic ordering. They proposed the algorithm PF-LEX, which enjoys a regret bound of e O((KT)2/3). However, this bound is suboptimal as existing single objective MAB algorithms attain a regret bound of O(K log T) (Lai and Robbins 1985). To illustrate the intuition for improving PF-LEX, we briefly introduce the decision-making strategy of PF-LEX. The fundamental framework to settle the bandit problem is the upper confidence bound (UCB), which first constructs confidence intervals for all arms, and then selects the arm with the highest upper confidence bound (Lattimore and Szepesv ari 2020). When adapting UCB to MOMAB, the main modification is considering all objectives in the arm selection. Let ct(a) denote the confidence term of arm a [m] at round t. PF-LEX considers two cases for arm selection. If some arm at [K] satisfies ct(at) > ϵ for a given criterion ϵ > 0, PF-LEX chooses this arm at. Otherwise, if ct(a) < ϵ for all arms a [K], PF-LEX filters promising arms based on the confidence intervals sequentially, ranging from the first to the m-th objective, and ultimately selects an arm in the m-th filtered set. PF-LEX consumes numerous trials in the first case, which is a pure exploration case and leads to suboptimal regret. Therefore, we consider avoiding the pure exploration case by dividing the decision-making process into multiple stages.

Algorithms In this section, we first introduce a simple algorithm based on static discretization, which is easy to understand but needs an oracle. Then, we use adaptive discretization to remove the oracle, creating an almost optimal algorithm.

Warm-up: SDLO As a warm-up, we propose a simple algorithm called Static Discretization under Lexicographic Ordering (SDLO), which first discretizes the arm set X statically, and then utilizes a multi-stage decision-making strategy to select arms. According to the Lipschitz property of expected payoff functions, knowing the expected payoff of x X enables us to estimate the expected payoff of any arm x B( x, r), i.e., |µi(x) µi( x)| r, i [m]. Consequently, a natural strategy for addressing the Lipschitz bandit problem is to discretize the arm space X into a collection of small balls and identify the best one. Given the radius r, using fewer balls to cover the arm space simplifies the task of identifying the optimal ball. Thus, covering X with Nc(r) balls is

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: Static Discretization under Lexicographic Ordering (SDLO)

Input: confidence parameter δ (0, 1), query radius r 0 1: Query the oracle with r to obtain the static arm set A = { x1, . . . , x Nc(r)} satisfying X k [Nc(r)]B( xk, r) 2: Initialize ˆµi(x) = 0, i [m] for x A 3: Initialize r(x) = + and n(x) = 0 for x A 4: for t = 1, 2, . . . , T do 5: Invoke the Algorithm 2 to select the arm xt = MSDM-SD {ˆµi(x), i [m]}x A, {r(x)}x A, r

6: Play arm xt and receive the payoff [y1 t , y2 t , . . . , ym t ] 7: Update ˆµi(xt), i [m] and n(xt) according to (14) 8: Compute r(xt) according to (15) 9: end for

the best choice, as Nc(r) is the minimum number of balls with radius r needed to cover X. However, constructing this minimal coverage is challenging due to the potentially intricate structure of X. Hence, we assume there exists an oracle that takes radius r as input and outputs the minimal arm set A = { x1, x2, . . . , x Nc(r)} satisfying

k [Nc(r)] B( xk, r). (11)

Note that for any x X, there always exists an arm xk A satisfying D(x, xk) r, which reduces our MOLP problem to a MOMAB problem with Nc(r) arms. Similar to existing MAB algorithms (Auer 2002; Yu et al. 2018), SDLO initializes the mean payoffs ˆµi(x), i [m] and the counter n(x) to zero for all x A, where n(x) counts the times arm x is played. Meanwhile, the confidence term r(x) is initialized to infinity. These terms will be updated with new trial data as learning goes, whose details are given in equations (14) and (15). Equipped with the mean payoffs and confidence terms, SDLO is ready to make a decision. At each epoch t, SDLO utilizes a novel decision-making method to select an arm xt from A, whose details are outlined in Algorithm 2, referred to as Multi-stage Decision-Making under Static Discretization (MSDM-SD). Starting with the initialized arm set A1 = A and stage index s = 1, MSDM-SD enters a loop that continues until an arm is chosen. In each stage s, MSDM-SD first checks if there exists an arm xt As whose confidence term r(xt) is greater than 2 s. If such an arm exists, MSDM-SD chooses this arm xt. If no arm in As meets this criterion, MSDM-SD proceeds to an inner loop containing m iterations, whhich sequentially filters promising arms from the first objective to the m-th objective. The initialized arm set for the inner loop is A0 s = As. For the i-th objective, MSDM-SD first selects the arm ˆxi t who is highest in mean payoff plus confidence term from the previously filtered arm set Ai 1 s , i.e.,

ˆxi t = argmax x Ai 1 s ˆµi(x) + r(x). (12)

Then MSDM-SD updates the arm set Ai 1 s to Ai s by keeping

Algorithm 2: Multi-stage Decision-Making under Static Discretization (MSDM-SD)

Input: estimated payoffs {ˆµi(x), i [m]}x A, confidence interval width {r(x)}x A, query radius r 0 1: Initialize s = 1 and A1 = A 2: repeat 3: if r(xt) > 2 s for some xt As then 4: Choose this arm xt 5: else 6: Initialize the arm set A0 s = As 7: for i = 1, 2, . . . , m do 8: ˆxi t = argmaxx Ai 1 s ˆµi(x) + r(x) 9: Ai s = {x Ai 1 s |ˆµi(x) + r(x) ˆµi(ˆxi t) + r(ˆxi t) (1 + 2λ + . . . + 2λi 1) (r + 2 2 s)} 10: end for 11: Update As+1 = Am s and s = s + 1 12: end if 13: until an arm xt is chosen 14: Return the chosen arm xt

the promising arms, such that

Ai s = x Ai 1 s |ˆµi(x) + r(x) ˆµi(ˆxi t) + r(ˆxi t)

(1 + 2λ + . . . + 2λi 1) (r + 2 2 s) . (13)

After filtering on the last objective, MSDM-SD sets the arm set As+1 = Am s and proceeds to the next stage s = s + 1. According to equation (15), r(x) > 1/

T for all x A, MSDM-SD will return an arm xt before s = log2(T). Once SDLO plays the arm xt returned by MSDM-SD and receives payoff vector [y1 t , y2 t , . . . , ym t ], it updates the mean payoffs ˆµi(x), i [m] and counter n(xt) as follows:

ˆµi(xt) = n(xt)ˆµi(xt) + yi t n(xt) + 1 , n(xt) = n(xt) + 1. (14)

Meanwhile, SDLO updates the confidence term of the chosen arm xt as

2α(xt)/n(xt). (15)

Here, α(xt) = 1 + 2 ln(m Nc(r) p

1 + n(xt)/δ) and δ is an input confidence parameter. The following theorem provides a theoretical guarantee for the SDLO algorithm.

Theorem 1 Suppose that (3), (4) and (5) hold. If SDLO is run with δ (0, 1) and r 0, then with probability at least 1 δ, the regret of SDLO can be bounded as

Ri(T) 2λi r T + 8 p

αT Nc(r)T , i [m]

where λi = 1 + λ + . . . + λi 1 and αT = 1 + 2 ln(m Nc(r)

Remark: Theorem 1 states that SDLO achieves a regret bound of e O(r T + p

Nc(r)T). If the arm set is finite, i.e., |X| = K, the query radius r can be 0 and Nc(r) = K. Thus, Theorem 1 provides a regret bound e O(

KT) for MOMAB, which not only improves the existing results of H uy uk and

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 3: Adaptive Discretization under Lexicographic Ordering (ADLO)

Input: confidence parameter δ (0, 1) 1: Initialize e A = 2: for t = 1, 2, . . . , T do 3: if X x e AB(x, r(x)) then 4: Pick an arm x randomly from the uncovered region X x e AB(x, r(x))

5: e A = e A {x} 6: Initialize ˆµi(x) = 0, i [m] and n(xt) = 0 7: Play xt = x and receive the reward [y1 t , . . . , ym t ] 8: else 9: Invoke the Algorithm 4 to select the arm xt = MSDM-AD {ˆµi(x), i [m]}x e A, {r(x)}x e A

10: Play xt and receive the reward [y1 t , . . . , ym t ] 11: end if 12: Update ˆµi(xt), i [m] and n(xt) according to (14) 13: Compute r(xt) according to (18) 14: end for

Tekin (2021) by order of e O((KT)1/6), but also extends the priority-based regret (1) to general regret (2). Recalling the definition of covering dimension dc in (7), Nc(r) Cr dc for some constant C > 0. Taking this inequality into Theorem 1 and minimizing the regret with respect to r results in a tight bound, as presented below.

Corollary 1 Suppose that (3), (4) and (5) hold. If SDLO is run with δ (0, 1) and r = T 1 2+dc , then with probability at least 1 δ, the regret of SDLO can be bounded as

Ri(T) e O 1 + λ + . . . + λi 1 T

1+dc 2+dc , i [m].

Improved Algorithm: ADLO Although SDLO is easy to understand, it presents two limitations. Firstly, it requires a complicated oracle to discretize the arm space X. Secondly, SDLO fails to match the lower bound of the single objective Lipschitz bandit problem (Kleinberg, Slivkins, and Upfal 2008), which means SDLO can be further improved. Therefore, we adopt the adaptive discretization method proposed by Kleinberg, Slivkins, and Upfal (2008) to improve it. Our second algorithm based on adaptive discretization is called Adaptive Discretization under Lexicographic Ordering (ADLO), and the detailed procedure can be found in Algorithm 3. ADLO maintains an adaptive arm set e A to construct a collection of balls that cover the arm space X, and the radius of these balls is the confidence term r(x), which is dynamically adjusted as the learning goes. To begin with, ADLO initializes the adaptive arms set e A with the empty set . In each round t, if the arm space X is not covered by the set of balls constructed by e A, i.e., X x e AB(x, r(x)), ADLO selects an arm x randomly from the uncovered region, adds it to the arm set e A, and plays this arm. The mean payoffs ˆµi(x), i [m] and the counter n(x) of the new arm x are initialized to zero. If the arm space is covered, ADLO employs

Algorithm 4: Multi-stage Decision-Making under Adaptive Discretization (MSDM-AD)

Input: estimated payoffs {ˆµi(x), i [m]}x e A, confidence interval width {r(x)}x e A 1: Initialize s = 1 and e A1 = e A 2: repeat 3: if r(xt) > 2 s for some xt e As then 4: Choose this arm xt 5: else 6: Initialize the arm set e A0 s = e As 7: for i = 1, 2, . . . , m do 8: ˆxi t = argmaxx e Ai 1 s ˆµi(x) + 2r(x)

9: e Ai s = {x e Ai 1 s |ˆµi(x) + 2r(x) ˆµi(ˆxi t) + 2r(ˆxi t) (3 + 6λ + . . . + 6λi 1) 2 s} 10: end for 11: Update e As+1 = e Am s and s = s + 1 12: end if 13: until an arm xt is chosen 14: Return the chosen arm xt

a multi-stage decision-making method called Multi-Stage Decision-Making under Adaptive Discretization (MSDMAD) to select the most promising arm from e A. As shown in Algorithm 4, MSDM-AD takes a similar framework to MSDM-SD, which utilizes an outer loop to restrict the confidence term r(x) of the arms to be chosen, and an inner loop to filter promising arms from the first objective to the m-th objective. Unlike MSDM-SD, MSDM-AD does not take the query radius r as input, and the candidate arm set e A changes as ADLO goes, resulting in a different filtering mechanism within the inner loop of MSDM-AD. Precisely, for the i-th objective, MSDM-AD first selects an arm ˆxi t that maximizes the mean payoff plus twice the confidence term, i.e., ˆxi t = argmax x e Ai 1 s ˆµi(x) + 2r(x) (16)

where e Ai 1 s is the set filtered on the previous i 1 objectives and e A0 s = e As. Then, MSDM-AD eliminates arms from e Ai 1 s who are less promising on the i-th objective as follows,

e Ai s = n x e Ai 1 s |ˆµi(x) + 2r(x) ˆµi(ˆxi t)

+2r(ˆxi t) (3 + 6λ + . . . + 6λi 1) 2 s . (17)

After the inner loop ends, MSDM-AD obtains a set e Am s containing arms that are promising for all m objectives. MSDMAD then passes e Am s to the next stage s + 1 as e As+1 = e Am s for a more refined filtration. Similar to MSDM-SD, MSDMAD chooses an arm xt before the stage s increases to log(T). Upon playing the arm xt and receiving the corresponding payoff vector, ADLO proceeds to update the mean payoffs ˆµi(x), i [m] according to the equation (14). Note that the update of the confidence term in SDLO relies on the arm number Nc(r), and ADLO picks at most T arms, thus ADLO updates the confidence term as follows,

2 α(xt)/n(xt) (18)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

where α(xt) = 1 + 2 ln(m T p

1 + n(xt)/δ). There are two main differences between SDLO and ADLO. The first one is the construction of the candidate arm set. SDLO constructs the arm set statically at the beginning of the algorithm by querying an oracle, while ADLO grows arm set to cover previously uncovered regions. The second difference is the filtering mechanism in the decision-making stage. MSDM-SD filters arms by the operation (13), which relies on the parameter r. In contrast, MSDM-AD employs a mechanism for the adaptive arm set, as demonstrated by equation (17), which removes the dependence on r. The following theorem provides a theoretical guarantee for ADLO.

Theorem 2 Suppose that (3), (4) and (5) hold. If ADLO is run with δ (0, 1), then with probability at least 1 δ, the regret of ADLO can be bounded as

Ri(T) 96λi ( αT Zi)

diz+2 , i [m]

where λi = 1 + λ + . . . + λi 1 and αT = 1 + 2 ln(m T

Remark: Theorem 2 states that ADLO attains a regret bound e O(λi T (1+di z)/(2+di z)), which matches the lower bound of the Lipschitz bandit problem with respect to T (Kleinberg, Slivkins, and Upfal 2008). When applied to the single objective problem, ADLO removes the dependence on λ and achieves the regret bound e O(T (1+d1 z)/(2+d1 z)), which is the same as the optimal single objective Lipschitz bandit algorithm (Kleinberg, Slivkins, and Upfal 2008).

Theoretical Analysis In this section, we provide a proof sketch for Theorem 2. The omitted details can be found in the supplementary material due to the page limit. For clarity, we use the notation ˆµi t(x), nt(x), and rt(x) to represent the values of ˆµi(x), n(x), and r(x) at the end of the t-th epoch, respectively. Furthermore, e At denotes the adaptive arm set e A at the end of the t-th epoch, and e At,s represents the arm set e As in MSDM-AD.

Proof of Theorem 2 First, we present Lemma 1 to show that mean payoff ˆµi t(x) and confidence term rt(x) construct a reliable confidence interval for the expected payoff µi(x).

Lemma 1 With probability at least 1 δ, for any x e At, ˆµi t(x) µi(x) rt(x), i [m], t [T].

Next, we demonstrate an essential property of the multistage decision-making strategy by the following lemma.

Lemma 2 With probability at least 1 δ, for any x e At,s,

µi(x ) µi(x) 6λi 2 s+1, i [m], t [T]

where x is the optimal arm and λi = 1 + λ + . . . + λi 1.

Remark: Lemma 2 establishes a bound on the instantaneous regret for any arm x e At,s, indicating an exponential decrease as the stage advances. This property is crucial for

bounding the cumulative regret, which we will further illustrate in the proof of Lemma 4. Let i(x) = µi(x ) µi(x), i [m]. To proceed with the analysis, we partition the adaptive arm set e Ai + = {x e AT | i(x) > 0} into a set of disjoint subsets. Specifically, we define e Ai j = {x e Ai +|λi2 j 1 < i(x) λi2 j}, (19)

thus e Ai + = j N e Ai j. Recall the definitions of r-optimal region in (8) and zooming dimension di z in (10), we can easily bound the number of arms in e Ai j by the following lemma. Lemma 3 With probability at least 1 δ, for any j N,

| e Ai j| Zi2j di z, i [m].

Then, we give Lemma 4 to analyze the cumulative regret of any arm in e Ai j. A detailed proof of Lemma 4 is provided since it illustrates the capacity to divide the decision-making process into multiple stages. Lemma 4 With probability at least 1 δ, for all j N, the regret for any x e Ai j can be bounded as

n T (x) i(x) 1152λi αT 2j, i [m]

where αT = 1 + 2 ln(m T

Proof. For any x e Ai j, if n T (x) = 1, Lemma 4 holds trivially. Now, we assume n T (x) 2. Recalling Step 3 of MSDM-AD, if the last time x is chosen occurs at the s T (x)- th stage among the total T rounds, we get

n T (x) 1 2s T (x)p

2 αT (n T (x) 1)

since x is played n T (x) 1 times before this round. Then, due to the fact that 1 (2

2)2s T (x)p

αT n T (x), we have

n T (x) i(x) 2s T (x)+1p

αT n T (x) i(x). (20)

Taking Lemma 2 into the right-hand side of (20) yields

n T (x) i(x) 24λi p

αT n T (x). (21)

This step reduces the linear term n T (x) i(x) to a sublinear term e O( p

n T (x)), which serves as the crucial function for dividing the decision-making process into multiple stages. Squaring both sides of (21) gives

n T (x) i(x) 576λ2 i αT / i(x).

The definition of e Ai j implies that 1/ i(x) < 2j+1/λi for any x e Ai j. Taking it into the right-hand side of the above equation finishes the proof of Lemma 4. Now, we are ready to prove Theorem 2. First, we relax Ri(T) by some r0 > 0 as follows,

Ri(T) λir0T + X

n T (x) i(x)I( i(x) > λir0).

Next, due to e Ai + = j N e Ai j and the definition of e Ai j in (19), we rewrite above equation as

Ri(T) λir0T + X

n T (x) i(x)I(2 j r0).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0 1 2 3 4 5 6 t 1e4

PFLEX-1st PFLEX-3rd SDLO-1st SDLO-3rd

(a) Static Methods

0 1 2 3 4 5 6 t 1e4

zooming-1st zooming-3rd ADLO-1st ADLO-3rd

(b) Adaptive Methods

Figure 1: Comparison of our algorithms versus PF-LEX and zooming algorithm.

Then, Lemma 3 and Lemma 4 tell that

Ri(T) λir0T + 1152λi αT Zi X

j N 2j(di z+1)I(2 j r0).

Utilizing the sum formula of the geometric sequence, we get

Ri(T) λir0T + 1152λi αT Zi

j=0 2j(di z+1)

λir0T + 1152λi αT Zi(2/r0)di z+1.

Finally, we minimize the right side of the above equation by taking

r0 = (1152 αT Zi2di z+1/T)

This concludes the proof.

Experiments In this section, we conduct numerical experiments to verify the effectiveness of our algorithms. We adopt PF-LEX (H uy uk and Tekin 2021) and zooming algorithm (Kleinberg, Slivkins, and Upfal 2008) as baselines. PF-LEX is a static method designed for MOMAB under lexicographic ordering, and the zooming algorithm is an adaptive method that is optimal for single objective Lipschitz bandits. Following the existing experimental setup (Magureanu, Combes, and Proutiere 2014), we set the arm space X = [0, 1] with a Euclidean metric on it. The number of objectives is set as m = 3, and the expected payoff functions are given as µ1(x) = 1 minp {0.1,0.4,0.8} |x p|, µ2(x) = 1 2 minp {0.3,0.7} |x p| and µ3(x) = 1 2|x 0.3|. Note that the optimal arms for the first objective are {0.1, 0.4, 0.8}, and the optimal arms for both the first and second objectives are {0.4, 0.8}. Thus, all three objectives must be considered to determine the lexicographically optimal arm 0.4. We set the payoff yi t = µi(xt)+ηt, where ηt is drawn from a Gaussian distribution with mean 0 and variance 1. The time horizon T is 6 104, and thus the nearly optimal query parameter r for SDLO is 0.025, as stated in Corollary 1. The static arm set for SDLO and PF-LEX is constructed as A = {0.025 + 0.05 (k 1)|k [20]}. The confidence

term (15) is scaled by a factor searched in [1e 2, 1], which is a common practice in bandit learning (Chapelle and Li 2011; Li et al. 2012; Zhang et al. 2016; Jun et al. 2017). We present the cumulative regret for the first and third objectives. To reduce the randomness, each algorithm is repeated 10 times, and the average regret is reported. Figure 1(a) presents the performance of the static methods, PFLEX and SDLO. SDLO and PF-LEX exhibit comparable performance in the first objective, while SDLO significantly outperforms PF-LEX in the third objective. The primary reason for this difference is that the theoretical guarantee of PFLEX is constructed under the priority-based regret (1) and is not reliable for general regret (2). Figure 1(b) showcases the performance of two adaptive methods, the zooming algorithm and ADLO. ADLO demonstrates a similar regret to the zooming algorithm in the first objective but surpasses it in the third objective. This result confirms the effectiveness of ADLO in addressing the MOLP problem.

Conclusion and Future Work

We investigated the MOLB model under lexicographic ordering and proposed two algorithms: SDLO and ADLO. The SDLO algorithm is straightforward but requires an oracle, yielding a regret bound of e O((1 + λi 1)T (1+dc)/(2+dc)) for the i-th objective, where i [m]. In contrast, the ADLO algorithm removes the dependence on oracle and achieves an almost optimal bound of e O((1 + λi 1)T (1+di z)/(2+di z)) for the i-th objective, which matches the lower bound of the Lipschitz bandit problem with respect to T (Kleinberg, Slivkins, and Upfal 2008). Both SDLO and ADLO improve the regret bounds by order of O((KT)1/6) compared to the recent work of H uy uk and Tekin (2021), as dc = 0 and dz = 0 for K-armed bandit problem. Moreover, we extended the metric of lexicographically ordered multiobjective bandits from the priority-based regret to the general regret, which more accurately evaluates the performance of algorithms. However, both SDLO and ADLO require the prior knowledge λ. Thus, a challenging open problem is to eliminate the dependence on λ and achieve a regret bound of e O(T (1+di z)/(2+di z)) for all objectives.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments The work described in this paper was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China [GRF Project No. City U 11215622] and by Natural Science Foundation of China [Project No: 62276223].

References Abbasi-yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems 24, 2312 2320. Agrawal, R. 1995. The Continuum-Armed Bandit Problem. SIAM Journal on Control and Optimization, 33(6): 1926 1951. Auer, P. 2002. Using Confidence Bounds for Exploitation Exploration Trade-offs. Journal of Machine Learning Research, 3(11): 397 422. Auer, P.; Ortner, R.; and Szepesv ari, C. 2007. Improved Rates for the Stochastic Continuum-Armed Bandit Problem. In Proceedings of the 20th Annual Conference on Learning Theory, 454 468. Bubeck, S.; Dekel, O.; Koren, T.; and Peres, Y. 2015. Bandit Convex Optimization:

T Regret in One Dimension. In Proceedings of The 28th Conference on Learning Theory, 266 278. Bubeck, S.; Munos, R.; Stoltz, G.; and Szepesv ari, C. 2011. X-Armed Bandits. Journal of Machine Learning Research, 12(46): 1655 1695. Bubeck, S.; Stoltz, G.; Szepesv ari, C.; and Munos, R. 2008. Online Optimization in X-Armed Bandits. In Advances in Neural Information Processing Systems 21, 201 208. Bubeck, S.; Stoltz, G.; and Yu, J. Y. 2011. Lipschitz Bandits Without the Lipschitz Constant. In In Proceedings of the 22nd International Conference on Algorithmic Learning Theory, 144 158. Chapelle, O.; and Li, L. 2011. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems 24, 2249 2257. Drugan, M. M.; and Nowe, A. 2013. Designing multiobjective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks, 1 8. Ehrgott, M. 2005. Multicriteria Optimization. Berlin, Heidelberg: Springer-Verlag. Feng, Y.; Huang, Z.; and Wang, T. 2022. Lipschitz Bandits with Batched Feedback. In Advances in Neural Information Processing Systems 35, 19836 19848. Gou, Y.; Yi, J.; and Zhang, L. 2023. Stochastic Graphical Bandits with Heavy-Tailed Rewards. In Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence, 734 744. Hosseini, H.; Sikdar, S.; Vaish, R.; and Xia, L. 2021. Fair and Efficient Allocations under Lexicographic Preferences. Proceedings of the 35th AAAI Conference on Artificial Intelligence, 5472 5480.

H uy uk, A.; and Tekin, C. 2021. Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives. Machine Learning, 110(6): 1233 1266.

Jee, K.-W.; Mc Shan, D. L.; and Fraass, B. A. 2007. Lexicographic ordering: intuitive multicriteria optimization for IMRT. Physics in Medicine & Biology, 52: 1845 1861.

Jun, K.-S.; Bhargava, A.; Nowak, R.; and Willett, R. 2017. Scalable Generalized Linear Bandits: Online Computation and Hashing. In Advances in Neural Information Processing Systems 30, 99 109.

Kleinberg, R. 2004. Nearly Tight Bounds for the Continuum-armed Bandit Problem. Advances in Neural Information Processing Systems 17, 697 704.

Kleinberg, R.; Slivkins, A.; and Upfal, E. 2008. Multi-armed Bandits in Metric Spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, 681 690.

Lai, T. L.; and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22.

Lattimore, T.; and Szepesv ari, C. 2020. Bandit Algorithms. Cambridge University Press.

Li, L.; Chu, W.; Langford, J.; Moon, T.; and Wang, X. 2012. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, 19 36.

Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661 670.

Lu, S.; Wang, G.; Hu, Y.; and Zhang, L. 2019a. Multi Objective Generalized Linear Bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 3080 3086.

Lu, S.; Wang, G.; Hu, Y.; and Zhang, L. 2019b. Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards. In Proceedings of the 36th International Conference on Machine Learning, 4154 4163.

Luo, H.; Wei, C.-Y.; Agarwal, A.; and Langford, J. 2018. Efficient Contextual Bandits in Non-stationary Worlds. In Proceedings of the 31st Conference On Learning Theory, 1739 1776.

Ma, X.; Zhao, L.; Huang, G.; Wang, Z.; Hu, Z.; Zhu, X.; and Gai, K. 2018. Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 1137 1140.

Magureanu, S.; Combes, R.; and Proutiere, A. 2014. Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms. In Proceedings of The 27th Conference on Learning Theory, 975 999.

Podimata, C.; and Slivkins, A. 2021. Adaptive Discretization for Adversarial Lipschitz Bandits. In Proceedings of the 34st Conference On Learning Theory, 3788 3805.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Q. Yahyaa, S.; M. Drugan, M.; and Manderick, B. 2014. Knowledge Gradient for Multi-Objective Multi-Armed Bandit Algorithms. In Proceedings of the 6th International Conference on Agents and Artificial Intelligence 1, 74 83. Qin, Y.; Li, Y.; Pasqualetti, F.; Fazel, M.; and Oymak, S. 2023. Stochastic Contextual Bandits with Long Horizon Rewards. Proceedings of the 37th AAAI Conference on Artificial Intelligence, 9525 9533. Robbins, H. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5): 527 535. Shao, H.; Yu, X.; King, I.; and Lyu, M. R. 2018. Almost Optimal Algorithms for Linear Stochastic Bandits with Heavytailed Payoffs. In Advances in Neural Information Processing Systems 31, 8430 8439. Skalse, J.; Hammond, L.; Griffin, C.; and Abate, A. 2022. Lexicographic Multi-Objective Reinforcement Learning. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, 3430 3436. Turgay, E.; Oner, D.; and Tekin, C. 2018. Multi-objective contextual bandit problem with similarity information. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, 1673 1681. Van Moffaert, K.; Van Vaerenbergh, K.; Vrancx, P.; and Nowe, A. 2014. Multi-objective X-Armed bandits. In 2014 International Joint Conference on Neural Networks, 2331 2338. Villar, S. S.; Bowden, J.; and Wason, J. 2015. Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. Statistical Science, 30(2): 199 215. Wang, T.; Ye, W.; Geng, D.; and Rudin, C. 2020. Towards Practical Lipschitz Bandits. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, 129 138. Wanigasekara, N.; and Yu, C. L. 2019. Nonparametric Contextual Bandits in Metric Spaces with Unknown Metric. In Advances in Neural Information Processing Systems 32, 14657 14667. Weber, E.; Rizzoli, A. E.; Soncini-Sessa, R.; and Castelletti, A. 2002. Lexicographic Optimisation for Water Resources Planning: the Case of Lake Verbano, Italy. 235 240. White, J. M. 2012. Bandit Algorithms for Website Optimization. O Reilly Media, Inc. Wray, K.; Zilberstein, S.; and Mouaddib, A.-I. 2015. Multi Objective MDPs with Conditional Lexicographic Reward Preferences. Proceedings of the 29th AAAI Conference on Artificial Intelligence, 3418 3424. Wray, K. H.; and Zilberstein, S. 2015. Multi-Objective POMDPs with Lexicographic Reward Preferences. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, 1719 1725. Xu, M.; and Klabjan, D. 2023. Pareto Regret Analyses in Multi-objective Multi-armed Bandit. In Proceedings of the 40th International Conference on International Conference on Machine Learning, 38499 38517.

Xue, B.; Wang, G.; Wang, Y.; and Zhang, L. 2020. Nearly Optimal Regret for Stochastic Linear Bandits with Heavy Tailed Payoffs. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, 2936 2942. Yu, X.; Shao, H.; Lyu, M. R.; and King, I. 2018. Pure exploration of multi-armed bandits with heavy-tailed payoffs. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, 937 946. Zhang, L.; Yang, T.; Jin, R.; Xiao, Y.; and Zhou, Z. 2016. Online Stochastic Linear Optimization under One-bit Feedback. In Proceedings of the 33rd International Conference on Machine Learning, 392 401. Zhou, Z.; Xu, R.; and Blanchet, J. 2019. Learning in Generalized Linear Contextual Bandits with Stochastic Delays. In Advances in Neural Information Processing Systems 32, 5197 5208. Zhu, Y.; and Mineiro, P. 2022. Contextual Bandits with Smooth Regret: Efficient Learning in Continuous Action Spaces. In Proceedings of the 39th International Conference on Machine Learning, 27574 27590.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)