# preference_based_adaptation_for_learning_objectives__78f707fc.pdf

Preference Based Adaptation for Learning Objectives

Yao-Xiang Ding Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China {dingyx, zhouzh}@lamda.nju.edu.cn

In many real-world learning tasks, it is hard to directly optimize the true performance measures, meanwhile choosing the right surrogate objectives is also difﬁcult. Under this situation, it is desirable to incorporate an optimization of objective process into the learning loop based on weak modeling of the relationship between the true measure and the objective. In this work, we discuss the task of objective adaptation, in which the learner iteratively adapts the learning objective to the underlying true objective based on the preference feedback from an oracle. We show that when the objective can be linearly parameterized, this preference based learning problem can be solved by utilizing the dueling bandit model. A novel sampling based algorithm DL2M is proposed to learn the optimal parameter, which enjoys strong theoretical guarantees and efﬁcient empirical performance. To avoid learning a hypothesis from scratch after each objective function update, a boosting based hypothesis adaptation approach is proposed to efﬁciently adapt any pre-learned element hypotheses to the current objective. We apply the overall approach to multi-label learning, and show that the proposed approach achieves signiﬁcant performance under various multi-label performance measures.

1 Introduction

Machine learning approaches have already been applied on many real-world tasks, in which the target is usually to optimize some task-speciﬁc performance measures. For complex problems, the performance measures are usually hard to be optimized directly, such as the click-through-rate in online advertisement and the proﬁt gain in recommendation system design. Instead of directly optimizing these complex measures, surrogate objectives with better mathematical properties are designed to simplify optimization. It is obvious that whether the objective is correctly designed essentially affects the application performance. However, it also requires delicate knowledge on the relationship between the true measure and the objective, which is sometimes difﬁcult and challenging to acquire. Under this situation, it is more desirable to learn both objective and hypothesis simultaneously.

Based on this motivation, we consider the novel scenario of learning with objective adaptation from preference feedback. Under this scenario, in each iteration of the objective adaptation process, the learner maintains a pair of objective functions, as well as the corresponding learned hypotheses, obtained from the latest two iterations. An oracle then provides a preference over the pair of hypotheses to the learner, according to the true task performance measure. Based on this preference information, the learner updates both the objective function and the corresponding hypothesis. In special, this formulation even allows us to model complex scenerios when the true performance measure is not quantiﬁed, such as subjective human preference. It is expected that the objective function converges to the optimal one so that the learned hypothesis optimizes the true performance measure. In this work, we focus on the following linear parameterized objective function class. Denote the objective by Lw, w W, in which W is the parameter space, and w = [w1 w2 w K] is a K

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

dimensional real-valued vector. We assume that Lw can be represented as

i=1 wili + w0Λ, w0 {0, 1}, (1)

in which l1, l2, . . . , l K are K convex element objectives, and w0 is an additional indicator of Λ. When w0 = 0, Lw is a linear combination of the K element objectives. When w0 = 1, Λ can be utilized to represent an additional convex regularization term. It is easy to see that this linear formulation covers a broad class of commonly used objectives in different learning tasks. By choosing different w, we are allowed to consider different trade-offs among element objectives. The target of objective adaptation is then to learn the optimal w which corresponds to the best trade-off leading to the optimal task performance measure. To ensure the problem of learning a hypothesis under any choice of w is solvable, we restrict wi 0. When w0 = 0, the scale of w does not matter, thus we restrict W to be the non-negative part of the K-dimensional unit sphere, i.e. w 2 = 1, wi 0. When w0 = 1, i.e. the scale of w is meaningful, we restrict W to be the K-dimensional ball w R1K 2 R, in which 1K is the K-dimensional full-one vector and R is the radius. By this way, Lw is kept convex over all w W.

There are two main challenges under the above objective adaptation scenario. One is to learn the objective function based on preference feedback from the oracle, which requires proper modeling of the preference feedback. Another is how hypothesis learning can be done efﬁciently without learning from scratch when the objective is updated. In this work, we take the ﬁrst step towards the above two challenges. First, we naturally formulate the objective adaptation process into the dueling bandit model [Yue et al., 2012], in which w is treated as the bandit arm and the oracle preference is treated as the reward. A novel sampling based algorithm DL2M , which stands for Dueling bandit Learning for Logit Model, is proposed for learning the optimal weight w , which enjoys O(K3/2

T) regret bound and efﬁcient empirical performance. Second, by assuming to learn K element hypotheses f i beforehand, which correspond to one-hot weights wi, i [K] with only one non-zero wi, a novel gradient boosting based approach named Adapt-Boost is proposed for adapting the element hypotheses to the hypothesis hw corresponding to any w. We apply the proposed objective adaptation approach to multi-label learning, and the experimental results show that our approach achieves signiﬁcant performance for various multi-label performance measures.

2 Related Work

Some similarities exist between the objective adaptation scenerio and multi-objective optimization (MOO) [Deb, 2014]. Under both scenerios, multiple element objectives are considered, and the trade-offs among them should be properly dealt with. While in MOO, the target is to ﬁgure out the Pareto solutions reﬂecting different trade-offs instead of a single optimal solution deﬁned by the oracle s preference. In fact, for our objective adaptation problem, it is also possible to utilize evolutionary algorithms instead of the proposed DL2M algorithm. While evolutionary algorithms are usually heuristic and theoretical guarantees are lacking. In [Agarwal et al., 2014], the multi-objective decision making problem is considered. The target of the learner is to optimize all objectives by observing the actions provided by a mentor. There is a signiﬁcant difference between their setting and ours since we focus on general learning tasks instead of decision making.

The proposed DL2M algorithm belongs to the family of continuous dueling bandit algorithms. In [Yue and Joachims, 2009], an online bandit gradient descent algorithm [Flaxman et al., 2005] was proposed, which achieves O(

KT 3/4) regret bound for convex value functions. In [Kumagai, 2017], they showed that when the value function is strongly convex and smooth, their stochastic mirror descent algorithm achieves near optimal O(K

T) regret bound. Similar to DL2M , both the above two algorithms follow from the online convex optimization framework [Zinkevich, 2003], while DL2M assumes the underlying value function follows from a linear model. The major advantage of DL2M lies on the reduction of the total number of arms needed to be sampled during learning. For the above two algorithms, two arms are needed to be sampled for comparison in one iteration. While DL2M samples only one arm in one iteration t, and compares it with the arm sampled on t 1. Thus the total number of arms needed is halved for DL2M , comparing to the above two algorithms. For objective adaptation, choosing an arm incurs the cost of learning the corresponding hypothesis, thus it is important to reduce the total number of arms sampled.

Our boosting based hypothesis adaptation procedure is motivated from multi-task learning [Evgeniou and Pontil, 2004; Chapelle et al., 2010]. By regarding each element objective as a single task, the hypothesis adaptation procedure can be decomposed into the adaptation of the element hypothesis for each element objective, and the adaptation of a global hypothesis for the weighted total objective. On the other hand, since the target is to optimize the single total objective, directly utilizing multi-task learning approaches is invalid to our problem. The hypothesis adaptation task is also considered in [Li et al., 2013], in which an efﬁcient adaptation approach is proposed under the assumption that the auxiliary hypothesis is a linear model. On the other hand, the linear model assumption also restricts the capacity of adaptation. To address this issue, our approach utilizes a gradient boosting based learner, which can use any weak hypothesis for adaptation, thus the optimization procedure can be more ﬂexible and efﬁcient.

3 Dueling Bandit Learning for Objective Adaptation

In this section, a dueling bandit algorithm DL2M is proposed to learn the optimal weight vector w from preference feedback to solve the objective adaptation task. For convenience of optimization, we assume the arm space for DL2M is the full K-dimensional unit sphere W : w 2 = 1. How to apply DL2M on W deﬁned in Section 1 is discussed in Remark 3 below. To model the preference of the oracle, we assume a total order exists on W. For w W, we have w w , w W. Whenever the oracle is given an ordered pair (w, w ), the oracle gives the feedback of r = 1 if w w, and r = 1 otherwise. To precisely model the partial order and how the oracle provides the preference information, we assume that each arm can be evaluated by a value function v(w), such that v(w ) v(w) w w, and the preference feedback is generated by the probabilistic model considering the gap between v(w) and v(w ): Pr(r = 1) = µ(v(w) v(w )), in which µ(x) is a strictly increasing link function. In this paper, the logistic probability model µ(x) = 1/(1 + exp( x)) is utilized, which is the common choice in related researches. The generation of preferences is also assumed to be independent of other parts of learning. In each iteration t out of the total T iterations, a pair (wt, w t) is submitted to the oracle for feedback. The target is to minimize the total (pseudo) regret

t=1 µ(v(w ) v(wt)) + µ(v(w ) v(w t)).

If further restricts w t to be wt 1, then we can only consider the summation over wt. Furthermore, by observing that µ(v(w ) v(wt)) achieves minimum 1/2 when wt = w , we can reformulate the regret as T = T t=1 µ(v(w ) v(wt)) µ(v(w ) v(w )) = T t=1 µ(v(w ) v(wt)) 1/2. From the above deﬁnition, the tasks of regret minimization and optimal weight vector estimation coincide. By L Hopital s rule, f(x) = 1/(1+e x) 1/2 has the same convergence rate as f(x) = x when x 0. Thus we can only consider T = T t=1 v(w ) v(wt). In this work, we adopt the commonly used linear value function v LIN(w) = w T θ , in which θ is an underlying optimal evaluation vector. This leads to the classical linear regret formulation

t=1 w T θ w T t θ , (2)

which indicates that the objective is to maximize the linear value function. Since w 2 = 1, w T θ is the projection of θ onto w, and achieves the maximum when the directions of w and θ coincide. Thus the direction of θ can be interpreted as the direction of the optimal weight vector kept in the oracle s mind. To simplify optimization, we assume θ 2 1 without loss of generality.

From the deﬁnition of regret, it is crucial to estimate θ accurately. Thus we consider the procedure of estimating θ in each iteration ﬁrst. Motivated by the logit one-bit bandit algorithm proposed in [Zhang et al., 2016], in each iteration t, we can utilize the online version of the maximum likelihood estimator, i.e. to minimize the loss function

ft(θ) = log ( 1 + exp ( rt(w T t θ w T t 1θ) )) ,

Algorithm 1 Dueling bandit Learning for Logit Model (DL2M )

1: Input Initialization θ1 = 0, Z1 = λI, w0, number of iterations T. 2: for t = 1 to T do 3: Sample ηt N(0, IK). 4: Choose κt according to Theorem 1. 5: Compute θt as

θt θt + κt Z 1/2 t ηt. (5)

6: Compute wt as

wt arg max w 2=1 w T θt. (6)

7: Submit wt and wt 1 and get rt. 8: Compute θt+1 and Zt+1 as Equation 3 and 4. 9: end for

which satisﬁes the exponentially concave property. As a result, the optimal update can be approximated by the analogy of the online Newton step [Hazan et al., 2007]:

θt+1 = min θ 2 1

θ θt 2 Zt+1 2 + (θ θt)T ft(θt), (3)

Zt+1 = Zt + β

2 (wt wt 1)(wt wt 1)T , Z1 = λI, (4)

and β = 1 2(e+1). Next, we consider how to choose wt in each round for better exploration. Different from the UCB based exploration strategy implemented in [Zhang et al., 2016], we extend the linear Thompson sampling technique proposed in [Abeille and Lazaric, 2017] to our dueling bandit setting, leading to the DL2M algorithm, which is illustrated in Algorithm 1. We provide regret guarantee for the proposed algorithm, whose proof will be presented in a longer version of the paper.

Theorem 1. Assume that κt in Algorithm 1 is set according to κt =

4T ), where

γt+1(δ) = λ + 16 + ( 8

3 ) log (2 2 log t t2

β log det(Zt+1)

det(Z1) . (7)

After running DL2M for T rounds, then for δ > 0, the following result holds with probability at least 1 δ:

t=1 w T θ w T t θ

4T )KT log 8KT

1 β log det(Zt+1)

det(Z1) + 128

By Lemma 10 of [Abbasi-Yadkori et al., 2011], we have log(det(Zt+1)/ det(Z1))

K log ( 1 + βt 2λK ) . Thus Theorem 1 provides O(K3/2

T) regret guarantee for DL2M .

Remark 1 The well-known doubling trick [Shalev-Shwartz, 2012] can be utilized to make κt independent of the total number of iterations T. In practice, since κt determines the step size of exploration, it is desirable to further make it ﬁne-tunable. In all the experiments, we set κt = min(c/2, c

log(det(Zt)/ det(Z1))), in which c is a hyperparameter. The min operator is introduced to control the largest step size.

Remark 2 Theorem 1 provides the guarantee of total regret. Our objective adaptation approach can be utilized in many real-world tasks, in which the learned is already in application during the learning stage, and the preference feedback is generated from its true effectiveness. The total regret guarantee is natural under this situation. Meanwhile, it is also important to consider another kind of tasks, in which only the ﬁnal estimation accuracy matters. Under this situation, it is better to consider simple regret instead of total regret since it is a pure exploration problem. This is a particularly interesting

and challenging task since we assume the continuous arm space. The experimental results in Section 5 show that DL2M is efﬁcient in ﬁnding the best arms, and we leave designing the optimal pure exploration algorithm for continuous dueling bandits a future work to investigate.

Remark 3 In the above discussion, we assume that the arm space for DL2M is W0 : w 2 = 1. For objective learning, the parameter domains introduced in Section 1 are different. We discuss how DL2M can be applied.

When w0 = 0, the domain of w is W : w 2 = 1, wi 0, which is the nonnegative part of W0. To apply DL2M , it is necessary to restrict each dimension of θt, θt to be nonnegative. For θt, we can simply change the domain of the update of θ as

θt+1 = min θ 2 1, θi 0,i [K]

θ θt 2 Zt+1 2 + (θ θt)T ft(θt),

in which θi is the i-th entry of θ. Since the domain remains convex, the efﬁciency of optimization is unaffected. For θt+1, we can simply take a θi t+1 max(0, θi t+1), i [K] operation to limit its value. Though this operation may affect the theoretical guarantee for w near the boundary, we observe that the performance is not affected in experiments.

When w0 = 1, the domain of w is W : w R1K 2 R. The main idea is to establish a topologically identical mapping from the arm space to W, then we can perform DL2M in the arm space, then map the result to W. First, it is easy to establish a bijective mapping g1 from the K dimensional ball W1 : w 2 1 to W with constant shifting and scaling. Second, another simple bijective mapping g2 exists to map a point in half of the K + 1-dimensional sphere, i.e. W2 : w 2 = 1, w K+1 0, to a point in W1, by simply setting w K+1 = 0 (just imagine the mapping from the upper half of the three-dimensionl sphere onto a two-dimensional circle). Thus we can simply utilize the composite mapping g1(g2) to map an arm in W2 to a parameter in W. To apply DL2M on W2, we can update θ by

θt+1 = min θ 2 1,θK+1 0

θ θt 2 Zt+1 2 + (θ θt)T ft(θt),

and perform θK+1 t+1 max(0, θK+1 t+1 ) to restrict both θt+1, θt+1, which is similar to w0 = 0.

4 Boosting Based Hypothesis Adaptation

After each objective adaptation step, w is updated, then a new Lw is obtained. To avoid learning the corresponding hypothesis F w from scratch, a hypothesis adaptation procedure is considered. Recall the formulation of objective function deﬁned in Equation 1. Assume that before the whole objective adaptation process, we have learned K element hypotheses f i under (regularized) element objectives li + w0Λ, i [K]. To obtain F w corresponding to Lw, we can linearly combine f i

together with a newly learned auxiliary hypothesis ϕw, i.e. make F w = K i=1 αif i + ϕw. As a result, the learning problem is transformed into

min αi,i [K],ϕw Lw(( K

i=1 αif i) + ϕw) . (8)

Under the above formulation, the learning target is to decide the weight αi for each f i, together with the auxiliary hypothesis ϕw. Intuitively, there should be a close relationship between αi and wi, which is the weight for li in Lw. When wi is large, the corresponding li has a large impact to the global Lw. Since f i is learned under li +w0Λ, then αi should also be large to make the contribution of f i in F w more signiﬁcant. For the similar reason, if wi is small then αi should follow. As a result, to solve Equation 8 properly, establishing a close relationship between αi and wi is a necessary task.

Based on this motivation, a boosting based hypothesis adaptation approach named Adapt-Boost is proposed. Assume that the learning procedure runs for N iterations. Under Adapt-Boost, one weak hypothesis hj is learned in each iteration j. Denote by H = [h1 h2 h N]T the vector of all learned weak hypotheses and set w ,0 = 1, w ,i = wi, i [K], Adapt-Boost solves the following

Algorithm 2 Adapt-Boost

1: Input: Loss parameter w, loss function Lw, element hypotheses f i, i [K], f 0 0, number of iterations N, number of element losses K, step size ϵ. 2: w ,i wi, i [K], w ,0 1, F0 f 0. 3: for j = 1 to N do 4: Calculate current residual Lw(Fj 1). 5: for i = 0 to K do 6: Fit residual Lw(Fj 1) with f i + hi j to obtain a weak hypothesis hi j. 7: end for 8: Choose the optimal update i as

i = arg maxi w ,i[ Lw(Fj 1)](f i + hi j).

9: Update the current hypothesis as

Fj = Fj 1 + (w ,i ϵ)(f i + hi j ).

10: end for 11: Output The learned hypothesis FN.

l1-regularized problem:

min βi,i [K] {0},H Lw(( K

k=1 βk,T (1Nf k + H) ) + β0,T H ) , s.t.

1 w ,i βi 1 µ, (9)

in which βi, i [K] {0} are N-dimensional weight vectors and 1N is the N-dimensional full-one vector. Comparing to Equation 8, F w is further restricted as ( K k=1 βk,T (1Nf k + H) ) + β0,T H, and the auxiliary hypothesis ϕw is decomposed into K local βk,T H corresponding to f k, together with a global β0,T H. For each f k, the weight αk, which represents the importance of f k in learning F w, is substituted by the weight vector βk. Thus controlling the magnitude of αk is equivalent to controlling the norm of βk. This target is realized by introducing the sum of 1/w ,i-weighted l1norm constraints on βi, i [K] {0} in Equation 9 with a hyperparameter µ controlling the global sparsity. Meanwhile, by controlling the local sparsity of each βk using w ,k, k [K], we are able to relate the importance of f k in F w with objective weights w.

The key advantage to employ Equation 9 is that this sparsity-constrained problem can be solved by the ϵ-boost algorithm [Rosset et al., 2004], which will be brieﬂy introduced below. To simplify notation, we use β to denote the vector which is the concatenation of all βi, i [K] {0}. Temporarily, we also assume that the weak hypotheses H are ﬁxed, and only β needs to be optimized. Instead of explicitly setting the sparsity level µ, we decompose the sparsity constraint over all steps. In each iteration, a small increment β is added on β, and an ϵ-sparsity constraint is applied on β, leading to the following inside-iteration optimization problem:

min β Lw(β + β), s.t.

1 w ,i βi 1 ϵ, (10)

in which βi is the part of β added on βi. The objective function can be approximated as

Lw(β + β) Lw(β) + [ Lw(β)]T β (11)

by Taylor expansion. Thus we turn to minimize [ Lw(β)]T β. Since the sparsity constraints are gradually added by ϵ over the learning process, and Lw is convex, the optimal solution for Equation 10 always satisﬁes K i=0 1 w ,i βi 1 = ϵ. Let [ Lw(β)]i j, [ β]i j, i [K] {0}, j [N] be the (Ni + j)-th dimension of Lw(β) and β. It is easy to see that the optimal β in Equation 11

is a vector of all zeros except for [ β]i j = w ,i ϵ such that i , j = arg mini,j w ,i[ Lw(β)]i j. Furthermore, we can explicity write the (Ni + j)-th component of Lw(β) as Lw/ βi j = [ Lw(F w)]( F w/ βi j) = [ Lw(F w)](f i + hj), in which hj is the j-th weak hypothesis in

H, and an additional f 0 0 is introduced to simplify notation. Now let us take choosing weak hypotheses H into consideration. Based on the above discussion, in the j-th iteration, our target is to solve maxi,hj w ,i[ Lw(F w)](f i + hj). This formulation inspires us to utilize gradient boosting. To obtain the optimal update, a candidate weak hypothesis hi j is chosen for each i, i [K] {0} to let f i + hi j ﬁt for the residual Lw(F w), and then we can choose the optimal update f i + hi j which optimally ﬁts the residual weighted by w ,i. The optimal step size for the update is w ,i ϵ according to the previous discussion. The whole process of Adapt-Boost is illustrated in Algorithm 2. It can be seen that Adapt-Boost utilizes a boosting based process to gradually add the element and weak hypotheses into the learned hypothesis instead of explicitly setting their weights. Thanks to the ﬂexability of choosing the weak learners and the efﬁciency of gradient boosting, we are able to solve complex hypothesis adaptation problems with low cost.

50 100 150 200 t

Instantaneous Regret

(a) K = 10, c = 0.1

50 100 150 200 t

Instantaneous Regret

(b) K = 10, c = 0.05

500 1000 1500 2000 2500 3000 t

Instantaneous Regret

(c) K = 10, c = 0.01

1 2 3 4 5 t 10 4

Instantaneous Regret

(d) K = 100, c = 0.1

1 2 3 4 5 t 10 4

Instantaneous Regret

(e) K = 100, c = 0.05

1 2 3 4 5 t 10 4

Instantaneous Regret

(f) K = 100, c = 0.01

Figure 1: Instantaneous regret of DL2M .

5 Experiments

5.1 Testing DL2M on Synthetic Data

We present experimental results on synthetic data to verify the effectiveness of DL2M . In each experiment, a K dimensional point is uniformly sampled from the unit ball as θ . Once the learner submits the pair of arms (wt, wt 1), a preference feedback rt { 1, 1} is randomly generated according to Pr(rt = 1|(wt, wt 1)) = 1/(1 + exp ( ρrt(wt wt 1)T θ ) ), in which ρ is the parameter controlling the randomness of the preferences. In all experiments, we use ρ = 100 to ensure the preferences are relatively consistent. We also set λ = 1 in all the experiments. The performance is measured by the change of instantaneous regret w T θ w T t θ over time. We compare the results among different c and K, which are illustrated in Figure 1. It can be observed that when the parameter c is properly set and K is not large, DL2M achieves very efﬁcient performance, which can quickly converge in limited number of iterations. As the dimension gets larger, the performance degenerates accordingly. To verify the efﬁciency of Thompson sampling in dueling and one-bit bandit problems, it is interesting to compare the above results with those reported in [Zhang et al., 2016], which utilizes UCB based exploration. It can be seen that Thompson sampling can achieve more efﬁcient performance in practice as in many other bandit problems.

5.2 Multi-Label Performance Measure Adaptation

The multi-label classiﬁcation task is utilized to evaluate the effectiveness of DL2M and Adapt-Boost, both separately and jointly. It is well-known that for multi-label learning, various performance measures exist, and the choice of the performance measure will largely affect the evaluation of the learned classiﬁer. In [Wu and Zhou, 2017], two notions of multi-label margin, i.e. label-wise and instance-wise margins are proposed to characterize different multi-label performance measures. According to their work, one speciﬁc multi-label performance measure tends to be biased towards one of the two margins, such that optimizing the corresponding margin will also make the measure optimized. Based on this ﬁnding, the stochastic gradient descent (SGD) based LIMO algorithm is proposed to jointly optimize the both margins, in order to achieve good performance on different measures simultaneously. The high-level formulation of LIMO s objective is

LLIMO = Λ + w1Llabel + w2Linst, (12) in which Llabel, Linst are two margin loss terms for maximizing the two margins, and Λ is a regularization term. For LLIMO, the weights w1, w2 control the relative importance of the two loss terms, thus different choices of the weights can signiﬁcantly affect the performance. We will show that by utilizing DL2M and Adapt-Boost, we can automatically ﬁnd the proper weights between the two margin losses in LIMO s objective, and efﬁciently adapt to different performance measures.

The experiments are conducted on six benchmark multi-label datasets 1: emotions, CAL500, enron, Corel5k, medical and bibtex. On each dataset, four multi-label performance measures are adopted for evaluation, i.e. ranking loss, coverage, average precision and one error. Three LIMO based comparison methods are adopted as baselines: (i) LIMO-label, which optimizes LLIMO with w1 = 1, w2 = 0, (ii) LIMO-inst, which optimizes LLIMO with w1 = 0, w2 = 1, (iii) LIMO, which optimizes LLIMO with w1 = 1, w2 = 1 and corresponds to the recommended parameter in the original paper. To evaluate DL2M and Adapt-Boost both separately and jointly, three adaptation based approaches are tested: (i) ADAPT-hypo, which optimizes LLIMO with w1 = 1, w2 = 1 using Adapt-Boost, (ii) ADAPT-obj, which utilizes DL2M with SGD training, (iii) ADAPT-both, which utilizes both DL2M and Adapt-Boost. To implement DL2M , each dataset is randomly split into training, validation and testing set, with ratio of size 3:1:1. During the learning process, the preference feedback is generated by testing the learned hypothesis on the validation set, and DL2M is utilized to update the objective for 20 iterations, with c = 0.05, λ = 1. For Adapt-Boost, to evaluate its efﬁciency, we only use half number of training iterations than standard LIMO training. Furthermore, to make Adapt-Boost compatible to LIMO training, the SGD updates are utilized as the weaker learners for adaptation.

The experimental results are illustrated in Table 1, and the average ranks in all experiments are illustrated in Table 2. It can be seen that DL2M based method ADAPT-obj achieve better performance than LIMO, which assigns ﬁxed weights to the two margin losses. This phenomenon reveals

1http://mulan.sourceforge.net/datasets-mlc.html

that DL2M can automatically identify the best trade-off among different element objectives. Furthermore, though running with much fewer training iterations, Adapt-Boost based method ADAPT-hypo achieves even better performance than LIMO, which is based on standard SGD training. This veriﬁes the efﬁciency of Adapt-Boost. ADAPT-both, which utilizes both two adaptation methods, achieves superior performance. It shows that by utilizing DL2M and Adapt-Boost, we can effectively solve the objective and hypothesis adaptation problem better and faster.

Dataset Algorithm ranking loss coverage avg. precision one-error

LIMO-inst .420 .051(6) 2.950 .134(6) .603 .028(6) .500 .047(6) LIMO-label .349 .028(5) 2.745 .174(5) .619 .025(5) .509 .064(5) LIMO .299 .023(4) 2.483 .070(4) .648 .028(4) .498 .057(4) ADAPT-hypo .279 .026(3) 2.331 .090(2) .671 .032(3) .481 .048(3) ADAPT-obj .268 .033(2) 2.377 .144(3) .673 .033(2) .478 .062(2) ADAPT-both .254 .020(1) 2.298 .200(1) .678 .028(1) .465 .058(1)

LIMO-inst .522 .026(6) 162.950 2.417(6) .153 .010(6) .971 .019(6) LIMO-label .182 .005(2) 131.439 1.764(5) .496 .006(5) .099 .026(2) LIMO .182 .004(2) 131.020 1.697(2) .498 .006(1) .131 .053(5) ADAPT-hypo .182 .005(2) 131.297 1.899(4) .497 .007(2) .128 .036(4) ADAPT-obj .182 .005(2) 131.088 1.849(3) .497 .007(2) .098 .028(1) ADAPT-both .181 .004(1) 131.008 2.072(1) .497 .008(2) .107 .024(3)

LIMO-inst .229 .010(6) 25.166 .957(6) .504 .017(6) .350 .043(6) LIMO-label .087 .009(3) 12.362 .612(5) .672 .014(4) .235 .024(1) LIMO .089 .009(5) 12.199 .625(4) .670 .014(5) .246 .029(3) ADAPT-hypo .087 .009(3) 12.060 .648(2) .680 .013(2) .246 .022(3) ADAPT-obj .086 .009(2) 12.049 .624(1) .675 .012(3) .251 .022(5) ADAPT-both .085 .008(1) 12.066 .577(3) .683 .016(1) .242 .027(2)

LIMO-inst .302 .006(6) 188.785 3.122(6) .101 .006(6) .893 .007(6) LIMO-label .121 .005(3) 106.920 2.457(5) .281 .005(1) .718 .017(1) LIMO .130 .004(5) 104.465 2.149(4) .222 .006(5) .793 .012(5) ADAPT-hypo .121 .005(3) 101.668 2.141(3) .252 .006(3) .762 .014(3) ADAPT-obj .118 .005(2) 100.478 3.376(2) .247 .010(4) .772 .013(4) ADAPT-both .114 .005(1) 98.880 2.989(1) .280 .007(2) .719 .016(2)

LIMO-inst .019 .005(2) 1.781 .337(5) .857 .020(5) .192 .034(5) LIMO-label .028 .004(6) 2.326 .489(6) .830 .020(6) .216 .037(6) LIMO .020 .005(4) 1.563 .249(3) .869 .026(2) .163 .028(1) ADAPT-hypo .021 .004(5) 1.621 .246(4) .863 .023(4) .181 .031(4) ADAPT-obj .019 .004(2) 1.499 .340(2) .874 .021(1) .171 .034(2) ADAPT-both .018 .004(1) 1.447 .288(1) .866 .025(3) .176 .036(3)

LIMO-inst .120 .003(6) 32.751 1.144(6) .488 .007(6) .486 .018(6) LIMO-label .072 .003(5) 20.460 .515(5) .526 .007(5) .440 .018(5) LIMO .060 .002(4) 17.648 .596(4) .567 .007(4) .395 .014(4) ADAPT-hypo .056 .002(1) 16.708 .430(1) .579 .007(2) .384 .013(2) ADAPT-obj .057 .003(3) 17.119 .832(2) .575 .006(3) .391 .013(3) ADAPT-both .056 .002(1) 17.128 .590(3) .581 .006(1) .383 .013(1) Table 1: Experimental results for the adaptation based and LIMO based methods. For each measure, indicates the smaller the better and indicates the larger the better . The results are shown in mean std(rank) format calculated from ten repeated experiments. The rank is calculated from the mean. The smaller the rank, the better the performance. The ﬁrst-ranked results are bolded.

Algorithm LIMO-inst LIMO-label LIMO ADAPT-hypo ADAPT-obj ADAPT-both avg. rank 5.71 4.21 3.67 2.83 2.42 1.58 Table 2: The average performance rank in all experiments.

6 Conclusion and Future Work

In this work, the preference based objective adaptation task is studied. The DL2M algorithm is proposed under this setting, which can efﬁciently solve the objective adaptation problem based on the dueling bandit model. For better hypothesis adaptation, the Adapt-Boost method is proposed in order to adapt the pre-learned element classiﬁers to the new objective with low cost.

To further investigate the objective adaptation problem, it is possible to relax the linear combinaiton formulation of the objective function adopted in this work. We are also interested in applying the proposed approaches in other real-world problems, especially the tasks in which human expert feedback can be utilized. Furthermore, it is also interesting to investigate Adapt-Boost on problems with larger scale, as well as to study its theoretical guarantees.

Acknowledgement

This research is supported by National Key R&D Program of China (2018YFB1004300), NSFC (61751306) and Collaborative Innovation Center of Novel Software Technology and Industrialization. Yao-Xiang Ding is supported by the Outstanding Ph D Candidate Program of Nanjing University. The Authors would like to thank the anonymous reviewers for constructive suggestions, as well as Lijun Zhang, Ming Pang, Xi-Zhu Wu and Yichi Xiao for helpful discussions.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, pages 2312 2320, 2011.

Marc Abeille and Alessandro Lazaric. Linear Thompson Sampling Revisited. In AISTATS, pages 176 184, 2017.

Alekh Agarwal, Ashwinkumar Badanidiyuru, Miroslav Dudik, Robert E Schapire, and Aleksandrs Slivkins. Robust multi-objective learning with mentor feedback. In COLT, pages 726 741, 2014.

Olivier Chapelle, Pannagadatta Shivaswamy, Srinivas Vadrevu, Kilian Weinberger, Ya Zhang, and Belle Tseng. Multi-task learning for boosting with application to web search ranking. In KDD, pages 1189 1198, 2010.

Kalyanmoy Deb. Multi-objective optimization. In Search methodologies, pages 403 449. Springer, 2014.

Theodoros Evgeniou and Massimiliano Pontil. Regularized multi task learning. In KDD, pages 109 117, 2004.

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages 385 394, 2005.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Wataru Kumagai. Regret analysis for continuous dueling bandit. In NIPS, pages 1488 1497, 2017.

Nan Li, Ivor W Tsang, and Zhi-Hua Zhou. Efﬁcient optimization of performance measures by classiﬁer adaptation. IEEE TPAMI, 35(6):1370 1382, 2013.

Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classiﬁer. JMLR, 5(Aug):941 973, 2004.

Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012.

Xi-Zhu Wu and Zhi-Hua Zhou. A uniﬁed view of multi-label performance measures. In ICML, pages 3780 3788, 2017.

Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML, pages 1201 1208, 2009.

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. JCSS, 78(5):1538 1556, 2012.

Lijun Zhang, Tianbao Yang, Rong Jin, Yichi Xiao, and Zhi-hua Zhou. Online stochastic linear optimization under one-bit feedback. In ICML, pages 392 401, 2016.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, pages 928 936, 2003.