# multiobjective_generalized_linear_bandits__70b04231.pdf

Multi-Objective Generalized Linear Bandits

Shiyin Lu1 , Guanghui Wang1 , Yao Hu2 and Lijun Zhang1

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2You Ku Cognitive and Intelligent Lab, Alibaba Group, Beijing 100102, China {lusy, wanggh, zhanglj}@lamda.nju.edu.cn, yaoohu@alibaba-inc.com

In this paper, we study the multi-objective bandits (MOB) problem, where a learner repeatedly selects one arm to play and then receives a reward vector consisting of multiple objectives. MOB has found many real-world applications as varied as online recommendation and network routing. On the other hand, these applications typically contain contextual information that can guide the learning process which, however, is ignored by most of existing work. To utilize this information, we associate each arm with a context vector and assume the reward follows the generalized linear model (GLM). We adopt the notion of Pareto regret to evaluate the learner s performance and develop a novel algorithm for minimizing it. The essential idea is to apply a variant of the online Newton step to estimate model parameters, based on which we utilize the upper conﬁdence bound (UCB) policy to construct an approximation of the Pareto front, and then uniformly at random choose one arm from the approximate Pareto front. Theoretical analysis shows that the proposed algorithm achieves an O(d

T) Pareto regret, where T is the time horizon and d is the dimension of contexts, which matches the optimal result for single objective contextual bandits problem. Numerical experiments demonstrate the effectiveness of our method.

1 Introduction Online learning under bandit feedback is a powerful paradigm for modeling sequential decision-making process arising in various applications such as medical trials, advertisement placement, and network routing [Bubeck and Cesa-Bianchi, 2012]. In the classic stochastic multi-armed bandits (MAB) problem, at each round a learner ﬁrstly selects one arm to play and then obtains a reward drawn from a ﬁxed but unknown probability distribution associated with the selected arm. The learner s goal is to minimize the regret, which is deﬁned as the difference between the cumulative reward of the learner and that of the best arm in hindsight. Algorithms designed for this problem need to strike a balance between exploration and exploitation, i.e., identifying the best arm by

trying different arms while spending as much as possible on the seemingly optimal arm. A natural extension of MAB is the multi-objective multiarmed bandits (MOMAB), proposed by Drugan and Nowe [2013], where the reward pertaining to an arm is a multidimensional vector instead of a scalar value. In this setting, different arms are compared according to Pareto order between their reward vectors, and those arms whose rewards are not inferior to that of any other arms are called Pareto optimal arms, all of which constitute the Pareto front. The standard metric is the Pareto regret, which measures the cumulative gap between the reward of the learner and that of the Pareto front. The task here is to design online algorithms that minimize the Pareto regret by judiciously selecting seemingly Pareto optimal arms based on historical observation, while ensuring fairness, that is, treating each Pareto optimal arm as equally as possible. MOMAB is motivated by real-world applications involved with multiple optimization objectives, e.g., novelty and diversity in recommendation systems [Rodriguez et al., 2012]. On the other hand, the aforementioned real-world applications typically contain auxiliary information (contexts) that can guide the decision-making process, such as user proﬁles in recommendation systems [Li et al., 2010], which is ignored by MOMAB. To incorporate this information into the decision-making process, Turgay et al. [2018] extended MOMAB to the multiobjective contextual bandits (MOCB). In MOCB, the learner is endowed with contexts before choosing arms and the reward he receives in each round obeys a distribution whose expectation depends on the contexts and the chosen arm. Turgay et al. [2018] assumed that the learner has a prior knowledge of the similarity information that relates distances between the context-arm pairs to those between the expected rewards. Under this assumption, they proposed an algorithm called Pareto contextual zooming which is built upon the contextual zooming method [Slivkins, 2014]. However, the Pareto regret of their algorithm is O(T 1 1/(2+dp)), where dp is the Pareto zooming dimension, which is almost linear in T when dp is large (say, dp = 10) and hence hinders the application of their algorithm to broad domains. To address this limitation, we formulate the multi-objective contextual bandits under a different assumption the parameterized realizability assumption, which has been extensively studied in single objective contextual bandits [Auer, 2002;

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Dani et al., 2008]. Concretely, we model the context associated with an arm as a d-dimensional vector x Rd and for the sake of clarity denote the arm by x. The reward vector y pertaining to an arm x consists of m objectives. The value of each objective is drawn according to the generalized linear model [Nelder and Wedderburn, 1972] such that

E[yi|x] = µi(θ i x), i = 1, . . . , m

where yi represents the i-th component of y, θ1, . . . , θm are vectors of unknown coefﬁcients, and µ1, . . . , µm are link functions. We refer to this formulation as multi-objective generalized linear bandits (MOGLB), which is very general and covers a wide range of problems, such as stochastic linear bandits [Auer, 2002; Dani et al., 2008] and online stochastic linear optimization under binary feedback [Zhang et al., 2016], where the link functions are the identity function and the logistic function respectively. To the best of our knowledge, this is the ﬁrst work that investigates the generalized linear bandits (GLB) in multiobjective scenarios. Note that a naive application of existing GLB algorithms to a speciﬁc objective does not work, because it could favor those Pareto optimal arms that achieve maximal reward in this objective, which harms the fairness. To resolve this problem, we develop a novel algorithm named MOGLB-UCB. Speciﬁcally, we employ a variant of the online Newton step to estimate unknown coefﬁcients and utilize the upper conﬁdence bound policy to construct an approximate Pareto front, from which the arm is then pulled uniformly at random. Theoretical analysis shows that the proposed algorithm enjoys a Pareto regret bound of O(d

T), where T is the time horizon and d is the dimension of contexts. This bound is sublinear in T regardless of the dimension and matches the optimal regret bound for single objective contextual bandits problem. Empirical results demonstrate that our algorithm can not only minimize Pareto regret but also ensure fairness.

2 Related Work In this section, we review the related work on stochastic contextual bandits, parameterized contextual bandits, and multiobjective bandits.

2.1 Stochastic Contextual Bandits In the literature, there exist many formulations of the stochastic contextual bandits, under different assumptions on the problem structure, i.e., the mechanism of context arrivals and the relation between contexts and rewards. One category of assumption says that the context and reward follow a ﬁxed but unknown joint distribution. This problem is ﬁrst considered by Langford and Zhang [2008], who proposed an efﬁcient algorithm named Epoch-Greedy. However, the regret of their algorithm is O(T 2/3), which is suboptimal when compared to inefﬁcient algorithms such as Exp4 [Auer et al., 2002b]. Later on, efﬁcient and optimal algorithms that attain an O(T 1/2) regret are developed [Dudik et al., 2011; Agarwal et al., 2014]. Another line of work [Kleinberg et al., 2008; Bubeck et al., 2009; Lu et al., 2010; Slivkins, 2014; Lu et al., 2019b]

assume that the relation between the rewards and the contexts can be modeled by a Lipschitz function. Lu et al. [2010] established an Ω(T 1 1/(2+dp)) lower bound on regret under this setting and proposed the Query-Ad-Clustering algorithm, which attains an O(T 1 1/(2+dc)) regret, where dp and dc are the packing dimension and the covering dimension of the similarity space respectively. Slivkins [2014] developed the contextual zooming algorithm, which enjoys a regret bound of O(T 1 1/(2+dz)), where dz is the zooming dimension of the similarity space.

2.2 Parameterized Contextual Bandits In this paper, we focus on the parameterized contextual bandits, where each arm is associated with a d-dimensional context vector and the expected reward is modeled as a parameterized function of the arm s context. Auer [2002] considered the linear case of this problem under the name of stochastic linear bandits (SLB) and developed a complicated algorithm called Super Lin Rel, which attains an O((log K)3/2

d T) regret, assuming the arm set is ﬁnite. Later, Dani et al. [2008] proposed a much simpler algorithm named Conﬁdence Ball2, which enjoys a regret bound of O(d

T) and can be used for inﬁnite arm set. Filippi et al. [2010] extended SLB to the generalized linear bandits, where the expected reward is a composite function of the arm s context. The inside function is linear and the outside function is certain link function. The authors proposed a UCB-type algorithm that enjoys a regret bound of O(d

T). However, their algorithm is not efﬁcient since it needs to store the whole learning history and perform batch computation to estimate the function. Zhang et al. [2016] studied a particular case of GLB in which the reward is generated by the logit model and developed an efﬁcient algorithm named OL2M, which attains an O(d

T) regret. Later, Jun et al. [2017] extended OL2M to generic GLB problems.

2.3 Multi-Objective Bandits The seminal work of Drugan and Nowe [2013] proposed the standard formulation of the multi-objective multi-armed bandits and introduced the notion of Pareto regret as the performance measure. By making use of the UCB technique, they developed an algorithm that enjoys a Pareto regret bound of O(log T). Turgay et al. [2018] extended MOMAB to the contextual setting with the similarity information assumption. Based on the contextual zooming method [Slivkins, 2014], the authors proposed an algorithm called Pareto contextual zooming whose Pareto regret is O(T 1 1/(2+dp)), where dp is the Pareto zooming dimension. Another related line of the MOMAB researches [Drugan and Now e, 2014; Auer et al., 2016] study the best arm identiﬁcation problem. The focus of those papers is to identify all Pareto optimal arms within a ﬁxed budget.

3 Multi-Objective Generalized Linear Bandits We ﬁrst introduce notations used in this paper, next describe the learning model, then present our algorithm, and ﬁnally state its theoretical guarantees.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3.1 Notation Throughout the paper, we use the subscript to distinguish different objects (e.g., scalars, vectors, functions) and superscript to identify the component of an object. For example, yi t represents the i-th component of the vector yt. For the sake of clarity, we denote the ℓ2-norm by . The induced matrix norm associated with a positive deﬁnite matrix A is deﬁned as x A :=

x Ax. We use BR := {x | x R} to denote a centered ball whose radius is R. Given a positive semideﬁnite matrix P, the generalized projection of a point x onto a convex set W is deﬁned as ΠP W[x] := arg miny W(y x) P(y x). Finally, [n] := {1, 2, . . . , n}.

3.2 Learning Model We now give a formal description of the learning model investigated in this paper.

Problem Formulation We consider the multi-objective bandits problem under the GLM realizability assumption. Let m denote the number of objectives and X Rd be the arm set. In each round t, a learner selects an arm xt X to play and then receives a stochastic reward vector yt Rm consisting of m objectives. We assume each objective yi t is generated according to the GLM such that for i = 1, 2, . . . , m,

Pr(yi t|xt) = hi(yi t, τi) exp yi tθ i xt gi(θ i xt) τi

where τi is the dispersion parameter, hi is a normalization function, gi is a convex function, and θi is a vector of unknown coefﬁcients. Let µi = g i denote the so-called link function, which is monotonically increasing due to the convexity of gi. It is easy to show E[yi t|xt] = µi(θ i xt). A remarkable member of the GLM family is the logit model in which the reward is one-bit, i.e., y {0, 1} [Zhang et al., 2016], and satisﬁes

Pr(y = 1|x) = 1 1 + exp ( θ x).

Another well-known binary model belonging to the GLM is the probit model, which takes the following form

Pr(y = 1|x) = Φ(θ x)

where Φ( ) is the cumulative distribution function of the standard normal distribution. Following previous studies [Filippi et al., 2010; Jun et al., 2017], we make standard assumptions as follows. Assumption 1 The coefﬁcients θ1, . . . , θm are bounded by D, i.e., θi D, i [m]. Assumption 2 The radius of the arm set X is bounded by 1, i.e., x 1, x X. Assumption 3 For each i [m], the link function µi is L-Lipschitz on [ D, D] and continuously differentiable on ( D, D). Furthermore, we assume that µ i(z) κ > 0, z ( D, D) and |µi(z)| U, z [ D, D]. Assumption 4 There exists a positive constant R such that |yi t| R, t [T], i [m] holds almost surely.

Performance Metric According to the properties of the GLM, for any arm x X that is played, its expected reward is a vector of [µ1(θ 1 x), µ2(θ 2 x), . . . , µm(θ mx)] Rm. With a slight abuse of notation, we denote it by µx. We compare different arms by their expected rewards and adopt the notion of Pareto order.

Deﬁnition 1 (Pareto order) Let u, v Rm be two vectors.

u dominates v, denoted by v u, if and only if i [m], vi ui and j [m], uj > vj.

v is not dominated by u, denoted by v u, if and only if v = u or i [m], vi > ui.

u and v are incomparable, denoted by u v, if and only if either vector is not dominated by the other, i.e., u v and v u.

Equipped with the Pareto order, we can now deﬁne the Pareto optimal arm.

Deﬁnition 2 (Pareto optimality) Let x X be an arm.

x is Pareto optimal if and only if its expected reward is not dominated by that of any arm in X, i.e., x X, µx µ x.

The set comprised of all Pareto optimal arms is called Pareto front, denoted by O .

It is clear that all arms in the Pareto front are incomparable. In single objective bandits problem, the standard metric to measure the learner s performance is regret deﬁned as the difference between the cumulative reward of the learner and that of the optimal arm in hindsight. In order to extend such metric to multi-objective setting, we introduce the notion of Pareto suboptimality gap [Drugan and Nowe, 2013] to measure the difference between the learner s reward and that of the Pareto optimal arms.

Deﬁnition 3 (Pareto suboptimality gap, PSG) Let x be an arm in X. Its Pareto suboptimality gap x is deﬁned as the minimal scalar ϵ 0 such that x becomes Pareto optimal after adding ϵ to all entries of its expected reward. Formally,

x := inf {ϵ | (µx + ϵ) µx , x X}.

We evaluate the learner s performance using the (pseudo) Pareto regret [Drugan and Nowe, 2013] deﬁned as the cumulative Pareto suboptimality gap of the arms pulled by the learner.

Deﬁnition 4 (Pareto regret, PR) Let x1, x2, . . . , x T be the arms pulled by the learner. The Pareto regret is deﬁned as

3.3 Algorithm The proposed algorithm, termed MOGLB-UCB, is outlined in Algorithm 1. Had we known all coefﬁcients θ1, . . . , θm in advance, we could compute the Pareto front directly and always pull the Pareto optimal arms, whose Pareto suboptimality gaps are zero. Motivated by this observation, we maintain

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

an arm set Ot as an approximation to the Pareto front O and always play arms in Ot. To encourage fairness, we draw an arm xt from Ot uniformly at random to play (Step 3). The approximate Pareto front is initialized to be X and is updated as follows. In each round t, after observing the reward vector yt, we make an estimation denoted by ˆθt+1,i for each coefﬁcients θi (Steps 4-7). Let Ht := {(x1, y1), (x2, y2), . . . , (xt, yt)} be the learning history up to round t. A natural approach is to use the maximum log-likelihood estimation:

ˆθt+1,i = arg max θ D

s=1 log Pr(yi s|xs)

= arg min θ D

s=1 yi sθ xs + gi(θ xs).

Despite its simplicity, this approach is inefﬁcient since it needs to store the whole learning history and perform batch computation in each round, which makes its space and time complexity grow at least linearly with t. To address this drawback, we utilize an online learning method to estimate the unknown coefﬁcients and construct conﬁdence sets. Speciﬁcally, for each objective i [m], let ℓt,i denote the surrogate loss function in round t, deﬁned as

ℓt,i(θ) := yi tθ xt + gi(θ xt).

We employ a variant of the online Newton step and compute ˆθt+1,i by

ˆθt+1,i = arg min θ D

θ ˆθt,i 2 Zt+1 2 + θ ℓt,i(ˆθt,i)

= ΠZt+1 BD [ˆθt,i Z 1 t+1 ℓt,i(ˆθt,i)]

Zt+1 = Zt + κ

2 xtx t = λId + κ

s=1 xsx s (2)

and ℓt,i(ˆθt,i) = yi txt + µi(ˆθ t,ixt)xt. Based on this estimation, we construct a conﬁdence set Ct+1,i (Step 8) such that the true vector of coefﬁcients θi falls into it with high probability. According to the theoretical analysis (Theorem 1), we deﬁne Ct+1,i as an ellipsoid centered at ˆθt+1,i:

Ct+1,i := {θ : θ ˆθt+1,i 2 Zt+1 γt+1} (3)

where γt+1 is deﬁned in (8). Then, we adopt the principle of optimism in face of uncertainty to balance exploration and exploitation. Speciﬁcally, for each arm x X, we compute the upper conﬁdence bound of its expected reward ˆµt+1,x (Step 11) as

ˆµi t+1,x = max θ Ct+1,i µi(θ x), i = 1, 2, . . . , m. (4)

Based on it, we deﬁne the empirical Pareto optimality. Deﬁnition 5 (Empirical Pareto optimality) An arm x X is empirically Pareto optimal if and only if the upper conﬁdence bound of its expected reward is not dominated by that of any arm in X, i.e., x X, ˆµt+1,x ˆµt+1,x .

Algorithm 1 MOGLB-UCB

Require: Regularization parameter λ max(1, κ/2) 1: Initialize Z1 = λId, ˆθ1,1 = = ˆθ1,m = 0, O1 = X 2: for t = 1, 2, . . . , T do 3: Pull an arm xt from the approximate Pareto front Ot uniformly at random 4: Observe the reward vector yt 5: Update Zt+1 = Zt + κ

2 xtx t 6: for i = 1, 2, . . . , m do 7: Compute the estimation ˆθt+1,i by formula (1) 8: Construct the conﬁdence set Ct+1,i by formula (3) 9: end for 10: for each x X do 11: Compute the upper conﬁdence bound ˆµt+1,x by formula (6) 12: end for 13: Update the approximate Pareto front Ot+1 = {x X | x X, ˆµt+1,x ˆµt+1,x } 14: end for

Finally, we update the approximate Pareto front Ot (Step 13) by ﬁnding all empirically Pareto optimal arms:

Ot+1 = {x X | x X, ˆµt+1,x ˆµt+1,x }.

Note that the computation in (4) involves the link function µi, which may be very complicated. Fortunately, thanks to the fact that the updating mechanism only relies on the Pareto order between arms rewards and the link function is monotonically increasing, we can replace (4) by

ˆµi t+1,x = max θ Ct+1,i θ x, i = 1, 2, . . . , m. (5)

Furthermore, by standard algebraic manipulations, the above optimization problem can be rewritten in a closed form:

ˆµi t+1,x = ˆθ t+1,ix + γt+1 x Z 1 t+1. (6)

3.4 Theoretical Guarantees We ﬁrst show that the conﬁdence sets constructed in each round contain the true coefﬁcients with high probability. Theorem 1 With probability at least 1 δ,

θi ˆθt+1,i 2 Zt+1 γt+1, i [m], t 0 (7)

γt+1 =16(R + U)2

1 + 4D2t + λD2

+ 2(R + U)2

κ log det (Zt+1)

det (Z1) + κ

Proof of Theorem 1. The detailed proof can be found in the full paper [Lu et al., 2019a]. The main idea lies in exploring the properties of the surrogate loss function (Lemmas 1 and 2), analyzing the estimation method (Lemma 3), and utilizing the self-normalized bound for martingales (Lemma 4). We then investigate the data-dependent item log det (Zt+1)

det (Z1) appearing in the deﬁnition of γt+1 and bound the width of the conﬁdence set by the following corollary, which is a direct consequence of Lemma 10 in Abbasi-Yadkori et al. [2011].

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Corollary 1 For any t 0, we have

log det (Zt+1)

det (Z1) d log 1 + κt

and hence γt+1 O(d log t). Finally, we present the Pareto regret bound of our algorithm, which is built upon on Theorem 1. Theorem 2 With probability at least 1 δ,

κ log 1 + κT

where γT +1 is deﬁned in (8). Remark. The above theorem implies that our algorithm enjoys a Pareto regret bound of O(d

T), which matches the optimal result for single objective GLB problem. Futhermore, in contrast to the O(T 1 1/(2+dp)) Pareto regret bound of Turgay et al. [2018], which is almost linear in T when the Pareto zooming dimension dp is large, our bound grows sublinearly with T regardless of the dimension. Proof of Theorem 2. By Theorem 1,

θi Ct,i, i [m], t 1 (9)

holds with probability at least 1 δ. For each objective i [m] and each round t 1, we deﬁne

θt,i := arg max θ Ct,i θ xt. (10)

Recall that xt is selected from Ot, which implies that for any x X, there exists an objective j [m] such that

ˆµj t,xt ˆµj t,x. (11)

By deﬁnitions in (5) and (10), we have

ˆµj t,xt = θ t,jxt, ˆµj t,x = max θ Ct,j θ x (9) θ j x. (12)

Combining (11) and (12), we obtain

θ t,jxt θ j x. (13)

In the following, we consider two different scenarios, i.e., θ j x θ j xt and θ j x > θ j xt. For the former case, it is easy to show

µj(θ j x) µj(θ j xt) 0

since µj is monotonically increasing. For the latter case, we have

µj(θ j x) µj(θ j xt)

L(θ j x θ j xt) (13) L( θ t,jxt θ j xt)

= L( θt,j ˆθt,j) xt + L(ˆθt,j θj) xt

L( θt,j ˆθt,j Zt + ˆθt,j θj Zt) xt Z 1 t (7) 2L γt xt Z 1 t 2L γT +1 xt Z 1 t

where the ﬁrst inequality is due to the Lipschitz continuity of µj, the third inequality follows from the H older s inequality, and the last inequality holds since γt is monotonically increasing with t. In summary, we have

µj(θ j x) µj(θ j xt) 2L γT +1 xt Z 1 t .

Since the above inequality holds for any x X, we have xt 2L γT +1 xt Z 1 t , which immediately implies

t=1 xt 2L γT +1

t=1 xt Z 1 t . (14)

We bound the RHS by the Cauchy Schwarz inequality:

t=1 xt Z 1 t

t=1 xt 2 Z 1 t . (15)

By Lemma 11 in Abbasi-Yadkori et al. [2011], we have

t=1 xt 2 Z 1 t 4

κ log det (ZT +1)

det (Z1) . (16)

Combining (14)-(16) and Corollary 1 ﬁnishes the proof.

4 Experiments In this section, we conduct numerical experiments to compare our algorithm with the following multi-objective bandits algorithms. P-UCB [Drugan and Nowe, 2013]: This is the Pareto UCB algorithm, which compares different arms by the upper conﬁdence bounds of their expected reward vectors and pulls an arm uniformly from the approximate Pareto front. S-UCB [Drugan and Nowe, 2013]: This is the scalarized UCB algorithm, which scalarizes the reward vector by assigning weights to each objective and then employs the single objective UCB algorithm [Auer et al., 2002a]. Throughout the experiments, we assign each objective with equal weight. P-TS [Yahyaa and Manderick, 2015]: This is the Pareto Thompson sampling algorithm, which makes use of the Thompson sampling technique to estimate the expected reward for every arm and selects an arm uniformly at random from the estimated Pareto front. Note that the Pareto contextual zooming algorithm proposed by Turgay et al. [2018] is not included in the experiments, because one step of this algorithm is ﬁnding relevant balls whose speciﬁc implementation is lacked in their paper and no experimental results of their algorithm are reported as well. In our algorithm, there is a parameter λ. Since its functionality is just to make Zt invertible and our algorithm is insensitive to it, we simply set λ = max(1, κ/2). Following common practice in bandits learning [Zhang et al., 2016; Jun et al., 2017], we also tune the width of the conﬁdence set γt as c log det (Zt)

det (Z1), where c is searched within [1e 3, 1]. We use a synthetic dataset constructed as follows. Let m = 5

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 1: Pareto regret of different methods

and pick d from {5, 10, 15}. For each objective i [m], we sample the coefﬁcients θi uniformly from the positive part of the unit ball. To control the size of the Pareto front, we generate the arm set comprised of 4d arms as follows. We ﬁrst draw 3d arms uniformly from the centered ball whose radius is 0.5, and then sample d arms uniformly from the centered unit ball. We repeat this process until the size of the Pareto front is not more than d. In each round t = 1, 2, . . . , T, after the learner submits an arm xt, he observes an m-dimensional reward vector, each component of which is generated according to the generalized linear model. While the GLM family contains various statistical models, in the experiments we choose two frequently used models namely the probit model and the logit model. Speciﬁcally, the ﬁrst two components of the reward vector are generated by the probit model, and the last three components are generated by the logit model. Since both the problem and the algorithms involve randomness, we perform 10 trials up to round T = 3000 and report average performance of the algorithms. As can be seen from Fig. 1, where the vertical axis represents the cumulative Pareto regret up to round t, our algorithm signiﬁcantly outperforms its competitors in all experiments. This is expected since all these algorithms are designed for multi-armed bandits problem and hence do not utilize the particular structure of the problem considered in this paper, which is explicitly exploited by our algorithm. Finally, we would like to investigate the issue of fairness. To this end, we examine the approximate Pareto front Ot constructed by the tested algorithms except S-UCB and use Jaccard index (JI) to measure the similarity between Ot and the true Pareto front O , deﬁned as

JIt := |Ot O |

for which the larger the better. We plot the curve of JIt for each algorithm in Fig. 2, where we set d = 10. As can be seen, our algorithm ﬁnds the true Pareto front much faster than P-UCB and P-TS. Furthermore, the approximate Pareto front constructed by our algorithm is very close to the true Pareto front when t > 1500. Combining with the result shown in Fig. 1 and the uniform sampling strategy used in Step 3 of Algorithm 1, we observe that our

Figure 2: Jaccard index of different methods

algorithm indeed minimizes the Pareto regret while ensuring fairness.

5 Conclusion and Future Work

In this paper, we propose a novel bandits framework named multi-objective generalized linear bandits, which extends the multi-objective bandits problem to contextual setting under the parameterized realizability assumption. By employing the principle of optimism in face of uncertainty , we develop a UCB-type algorithm whose Pareto regret is upper bounded by O(d

T), which matches the optimal regret bound for single objective contextual bandits problem. While we have conducted numerical experiments to show that the proposed algorithm is able to achieve high fairness, it is appealing to provide a theoretical guarantee regarding fairness. We will investigate this in future work.

Acknowledgements

This work was partially supported by the National Key R&D Program of China (2018YFB1004300), NSFC (61603177), Jiangsu SF (BK20160658), and YESS (2017QNRC001).

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Abbasi-Yadkori et al., 2011] Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312 2320, 2011. [Agarwal et al., 2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646, 2014. [Auer et al., 2002a] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002. [Auer et al., 2002b] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. [Auer et al., 2016] Peter Auer, Chao-Kai Chiang, Ronald Ortner, and Madalina Drugan. Pareto front identiﬁcation from stochastic bandit feedback. In International Conference on Artiﬁcial Intelligence and Statistics, pages 939 947, 2016. [Auer, 2002] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. [Bubeck and Cesa-Bianchi, 2012] S ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. [Bubeck et al., 2009] S ebastien Bubeck, Gilles Stoltz, Csaba Szepesv ari, and R emi Munos. Online optimization in Xarmed bandits. In Advances in Neural Information Processing Systems 22, pages 201 208, 2009. [Dani et al., 2008] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, page 355, 2008. [Drugan and Nowe, 2013] MM Drugan and A Nowe. Designing multi-objective multi-armed bandits algorithms: a study. In International Joint Conference on Neural Networks, pages 2352 2359, 2013. [Drugan and Now e, 2014] Madalina M Drugan and Ann Now e. Scalarization based pareto optimal set of arms identiﬁcation algorithms. In International Joint Conference on Neural Networks, pages 2690 2697, 2014. [Dudik et al., 2011] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efﬁcient optimal learning for contextual bandits. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 169 178, 2011. [Filippi et al., 2010] Sarah Filippi, Olivier Cappe, Aur elien Garivier, and Csaba Szepesv ari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23, pages 586 594, 2010.

[Jun et al., 2017] Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems 30, pages 99 109, 2017. [Kleinberg et al., 2008] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681 690, 2008. [Langford and Zhang, 2008] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 21, pages 817 824, 2008. [Li et al., 2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, pages 661 670, 2010. [Lu et al., 2010] Tyler Lu, D avid P al, and Martin P al. Contextual multi-armed bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 485 492, 2010. [Lu et al., 2019a] Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Multi-objective generalized linear bandits. ar Xiv preprint ar Xiv:1905.12879, 2019. [Lu et al., 2019b] Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In Proceedings of the 36th International Conference on Machine Learning, pages 4154 4163, 2019. [Nelder and Wedderburn, 1972] J. A. Nelder and R. W. M. Wedderburn. Generalized linear models. Journal of the Royal Statistical Society, Series A, General, 135:370 384, 1972. [Rodriguez et al., 2012] Mario Rodriguez, Christian Posse, and Ethan Zhang. Multiple objective optimization in recommender systems. In Proceedings of the sixth ACM conference on Recommender systems, pages 11 18. ACM, 2012. [Slivkins, 2014] Aleksandrs Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533 2568, 2014. [Turgay et al., 2018] Eralp Turgay, Doruk Oner, and Cem Tekin. Multi-objective contextual bandit problem with similarity information. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1673 1681, 2018. [Yahyaa and Manderick, 2015] Saba Yahyaa and Bernard Manderick. Thompson sampling for multi-objective multiarmed bandits problem. In European Symposium on Artiﬁcial Neural Networks, pages 47 52, 2015. [Zhang et al., 2016] Lijun Zhang, Tianbao Yang, Rong Jin, Yichi Xiao, and Zhi-hua Zhou. Online stochastic linear optimization under one-bit feedback. In International Conference on Machine Learning, pages 392 401, 2016.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)