# confidence_weighted_multitask_learning__acec2076.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Conﬁdence Weighted Multitask Learning

Peng Yang,1 Peilin Zhao,2 Jiayu Zhou,3 Xin Gao1

1King Abdullah University of Science and Technology, Saudi Arabia 2Tencent AI Lab, China, 3Michigan State University, USA {peng.yang.2, xin.gao}@kaust.edu.sa, masonzhao@tencent.com, jiayuz@msu.edu

Traditional online multitask learning only utilizes the ﬁrstorder information of the datastream. To remedy this issue, we propose a conﬁdence weighted multitask learning algorithm, which maintains a Gaussian distribution over each task model to guide online learning process. The mean (covariance) of the Gaussian Distribution is a sum of a local component and a global component that is shared among all the tasks. In addition, this paper also addresses the challenge of active learning on the online multitask setting. Instead of requiring labels of all the instances, the proposed algorithm determines whether the learner should acquire a label by considering the conﬁdence from its related tasks over label prediction. Theoretical results show the regret bounds can be signiﬁcantly reduced. Empirical results demonstrate that the proposed algorithm is able to achieve promising learning efﬁcacy, while simultaneously minimizing the labeling cost.

Introduction Multitask learning (MTL) aims to enhance the overall generalization performance by learning the related knowledge of multiple tasks. Most existing works in multitask learning focus on how to take advantage of the task relationship, either by sharing model parameters via regularization techniques (Argyriou, Evgeniou, and Pontil 2008; Zhang and Yeung 2010; Yang, Zhao, and Gao 2017) or learning crosstask data directly (Crammer and Mansour 2012). This paper focuses on a speciﬁc multitask setting where tasks are allowed to query labels by interacting with other tasks for difﬁcult cases, for example, recommending new products based on customers preferences on old ones. In a broad sense, there are two settings to learn multiple tasks together: 1) batch learning, where an entire training set is available to the learner before training; and 2) online learning, where the model is trained over the data streams. In contrast to batch learning, which often suffers from expensive re-training costs whenever new training data come, online learning avoids re-training and learns incrementally from sequential data, which is much more efﬁcient (Hoi et al. 2018; Zhao et al. 2011). Online MTL (OMTL) has been intensively studied, where each task learner receives an instance on each round and then predicts its label. After that

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

the true label is revealed and the learner updates the model as necessary. Previous studies (Dekel, Long, and Singer 2007; Yang and Zhao 2015; Murugesan et al. 2016) mostly utilized the ﬁrst-order information of datastream. Few studies worked on second-order algorithms. To remedy this issue, we propose a conﬁdence weighted MTL. The conﬁdence estimated through a local-global Gaussian distribution over each task model can guide the direction and scale of parameter updates. Speciﬁcally, updates not only ﬁx predicted errors but also increase conﬁdence. However, such method assumes that the true labels are readily provided for all the tasks, which is impractical in several settings where the tasks (e.g., minority languages, new products) naturally have very few data labels. Online active learning (OAL) addresses this concern by letting the learner decide whether to request the true label of the current instance or not. Most of OAL techniques decide a query by estimating a conﬁdence towards current prediction (Cesa-Bianchi, Gentile, and Zaniboni 2006; Dekel, Gentile, and Sridharan 2012). In the multitask learning setting, one can further reduce the total number of required labels by interacting with related peer tasks. This paper proposes an active multitask learning, where the learner determines a query by considering the conﬁdence over label prediction from related peer tasks. The key idea is that when a task model is uncertain about its prediction, it would consult its peers. Theoretical results show the regret bounds can be signiﬁcantly reduced. Empirical results demonstrate that the proposed technique in such setting minimizes the learning errors and labeling cost simultaneously.

Related Work Existing works on OMTL focus on how to take advantage of task relationship. To achieve this, Lugosi et al. (Lugosi, Papaspiliopoulos, and Stoltz 2009) imposed a hard constraint across the task model parameters. Agarwal et al. (Agarwal, Rakhlin, and Bartlett 2008) used a matrix regularization on model weights, like the nuclear norm (Ding et al. 2018). And Dekel et al. (Dekel, Long, and Singer 2007) learned the task relationship via a global loss function. However, all these works assume that the true label is provided to each instance. OAL addresses this problem by making the decision on whether to query the label over the data streams (Cesa Bianchi, Gentile, and Zaniboni 2006; Dekel, Gentile, and

Sridharan 2012). It can be smoothly extended to the multitask learning setting by applying active learning for each local task separately or several related tasks. Saha et al. (Saha et al. 2011) utilized the learned task-relationship matrix to query labels of the instances with more shared information. Murugesan et. al. (Murugesan and Carbonell 2017) determined a query based on the ﬁrst-order information of the predicted margin. To save the labeling cost, our query decision is made by leveraging the Gaussian distribution of the predicted margin, which is more effective to capture the informative instances compared with prior solutions. This work is motivated by the recent OAL techniques in the multitask setting (Saha et al. 2011; Murugesan and Carbonell 2017). A query is determined based on the peer tasks if the single task model is uncertain about its perdition. Our query strategy is different from them in two aspects: 1) query decision is made by consulting the local-global knowledge of multiple tasks; and 2) query conﬁdence is estimated based on the Gaussian distribution of the predicted margin. Finally, our method is related to conﬁdence weighted (CW) learning (Crammer, Dredze, and Pereira 2009). Different from (Li et al. 2014), we adapt the CW technique into a multi-task setting that performs active learning across the multiple tasks.

Algorithm In this section, we ﬁrst introduce an OMTL problem, then solve this problem with an active learning strategy.

Problem Setting Suppose there are K tasks and each task k has a sequence of Tk training data. In this work, we consider an online binary classiﬁcation problem for each task. Let {(xk t , yk t )}Tk t=1 be a set of instance-label pairs for the task k where xk t Rd is the tth instance and yk t { 1} is its corresponding true label. Let {wk}k [K] be a set of arbitrary vectors where wk Rd. Given a model weight w, we denote ℓt(w) as its instantaneous loss and Lk T (w) = PTk t ℓt(w) be its cumulative loss. The goal is to achieve a low regret compared with the best linear function. Formally, the regret of a model is given by

t=1 ℓt(wt) inf w Lk T (w).

The objective is to let the loss of the online algorithm converge to the loss of the best linear function w.

Conﬁdence Weighted Multitask Learning We propose a local-global MTL framework where the local and global memory is introduced to store parts of the weight vector of each task, motivated by (Evgeniou and Pontil 2004). Formally, the task weight wk t is modeled in terms of local and global memories on round t,

wk t = ut + vk t , (1)

where the global memory u captures the interdependent information across all tasks, while the local memory vk learns the unique characteristic of a single task. When the offset vk is small , it indicates that the tasks are similar to each other.

To better explore the second-order structure of parameter weights, motivated by (Crammer, Dredze, and Pereira 2009), we assume that a weight w follows a Gaussian distribution w N(µ, Σ) with mean vector µ Rd and covariance matrix Σ Rd d. The values µi and Σi,i encode the model s knowledge and conﬁdence towards the weight wi, i.e., the smaller the value of Σi,i is, the more conﬁdent the learner is towards the mean value µi. The covariance term Σi,j captures the interactions between wi and wj. To adapt conﬁdence weight into the local-global setting in Eq. (1), we begin with the following Lemma:

Lemma 1. If u N(p, A) and {vk N(qk, Bk)}K k=1 are mutually independent normal random variables, then the linear combination: wk = u + vk follows the Gaussian distribution:

wk N p + qk, A + Bk .

Proof. The proof is in the Supplementary Material1.

Given an instance (xk t , yk t ) at round t, the local-global parameters aim to adjust their distributions to ensure the probability of a correct prediction at least η > 0:

min p,A,qk,Bk

k=1 DKL N(qk, Bk)||N(qk t 1, Bk t 1)

+ DKL (N(p, A)||N(pt 1, At 1)) ,

s.t. Prwk N(p+qk,A+Bk)[yk t (wk xk t ) 0] η

where DKL is the Kullback-Leibler divergence:

DKL(N(µ, Σ)||N(µt, Σt))

+ µt µ 2 Σ 1 t d ,

| | is the determinant of a matrix and Tr() is the trace of a matrix. In the following Lemma, The constraint can be formulated by the Gaussian distribution.

Lemma 2. The predicted margin on (xk, yk) by the model wk N(p + qk, A + Bk) follows the Gaussian distribution:

yk(wk xk) N yk((p + qk) xk), x (A + Bk)x .

The probability constraint can be written explicitly as:

yk((p + qk) xk) + φ q

xk (A + Bk)xk 0,

where φ = Φ 1(η) > 0 and Φ is the Gaussian cumulative distribution function.

Proof. The proof is in the Supplementary Material1.

We directly tackle the variance variable and the problem becomes convex, while replacing yk((p+qk) xk with the

1https://github.com/Young Big Bird1985/Second-Order-Online Multitask-Learning

hinge loss ℓ(p+qk) = max(0, 1 yk((p+qk) xk)). We recast the the constraint as a regularizer, solving the following unconstrained function:

min p,A,qk,Bk

k=1 DKL N(qk, Bk)||N(qk t 1, Bk t 1)

+ DKL (N(p, A)||N(pt 1, At 1))

2ϵℓt(p + qk) + 1

2λxk t (A + Bk)xk t

where ϵ and λ are positive tradeoff parameters. Speciﬁcally, the objective (2) aims to reach the trade-off between the distribution divergence (the ﬁrst two terms), the loss function (the third term) and the predicted variance (the fourth term). In other words, the objective tends to make the least adjustment to minimize the painful loss and maximize the conﬁdence of prediction. To solve this problem, we exploit the block coordinate descent (Tseng 2001) to optimize the localglobal memory variables alternatively.

Optimizing Local Memory The parameters Bk can be solved as below,

f(Bk) = log |Bk t 1| |Bk|

λxk t Bkxk t .

By applying the KKT condition on B, we have that (Bk t ) 1 = (Bk t 1) 1 + 1

λxtx t . By using the Sherman Morrison formula (Sherman and Morrison 1950), Bk t can be updated efﬁciently with time complexity O(d2),

Bk t = Bk t 1 Bk t 1xk t xk t Bk t 1 λ + xk t Bk t 1xk t . (3)

Let p = pt 1, the qk is solved under the hinge loss and squared hinge loss, respectively.

Lemma 3. Let ˆek t = (pt 1 + qk t 1) xk t . Whenever yk t = sign( ˆf k t ), we solve the problem

f(qk) = qk qk t 1 2 (Bk t ) 1 + 1

ϵ ℓt(pt 1 + qk),

where the optimal solution of qk is given by,

qk t = qk t 1 + gk t yk t Bk t xk t , (4)

gk t = max{0, 1 yk t ˆek t } ϵ + xk t Bk t xk t (squared hinge)

gk t = min 1

2ϵ, max 0, 1 yk t ˆek t xk t Bk t xk t

Proof. The proof is in the supplementary material1.

Optimizing Global Memory The global parameter A can be optimized:

f(A) = log |At 1|

+ Tr A At 1

λTr X t AXt ,

Algorithm 1 CWMT: Conﬁdence Weighted Multitask Learning

1: Input:λ, ϵ > 0. 2: Output: p T , AT , qk T , Bk T , k [K] . 3: Initialize: p0 = qk 0 = 0, A0 = Bk 0 = I, k [K]. 4: for t = 1, . . . , T do 5: for (local update): k = 1, . . . , K in parallel do 6: Receive xk t and let µk t 1 = pt 1 + qk t 1; 7: Compute ˆyk t = sign(µk t 1 xk t ); 8: If yk t = ˆyk t , update Bk t in Eq. (3), qk t in Eq. (4); 9: endfor 10: Reduce (global update): aggregate {z1 t , . . . , z K t } 11: Update At with Eq. (5) and pt with Eq. (6); 12: end for

where Xt = [x1 t, x2 t, . . . , x K t ] Rd K. By using Woodbury matrix identity, A can be updated by

At = At 1 At 1Xt C 1 t 1X t At 1, (5)

where Ct 1 = λIK + X t At 1Xt is positive-deﬁnite and IK RK K is an identity matrix. The matrix inverse in Eq. (5) takes O(K3 +d2K) complexity, which is acceptable when the task number K is small. Let zk t = I(yk t = ˆyk t ) where I( ) is an indicator function, p is solved by

f(p) = p pt 1 2 A 1 t + 1

k=1 zk t ℓt(p + qk t 1).

Taking the derivative of the above problem, i.e. pt 1f(p), p is solved by

pt = pt 1 + 1

k=1 zk t yk t xk t . (6)

We summarize the conﬁdence weighted multitask learning in Algorithm 1. It uses a conservative strategy to update the model when an error occurs. Different from ﬁrst-order techniques (Murugesan et al. 2016), this algorithm captures the second-order information by exploiting the conﬁdence of weight parameters over Gaussian distribution. Unlike the CW-based method (Li et al. 2014), we provide a theoretical analysis for the CW-based MTL in Theorem 1.

Theorem 1. Let {(xk t , yk t )}T t=1 be a sequence of samples on any task (k K), where xk t Rd, yk t { 1}. For any µ Rd on the convex loss ℓ(µ), the CWMT satisﬁes:

Regret λ log(1 + KT)

4ϵ + ϵ(D(µ))2Tr (AT + Bk T ) 1

where D(µ) = maxt (pt + qk t ) µ .

Proof. The proof is in the Supplementary Material1.

Discussion: Setting ϵ = 1

λ log(1+KT ) (D(µ))2Tr((AT +Bk T ) 1), we get

Tr((AT + Bk T ) 1) log(1 + KT).

Let Σk = A + Bk, A Σk and Bk Σk yield to A 1 (Σk) 1 and (Bk) 1 (Σk) 1. Assume xt 2 1, we have Tr((Σk) 1) 1

2Tr(A 1) + 1

2Tr((Bk) 1) O( (K+1)T

λ ). Thus, the regret is in the order of O(

KT). Remark: As Regret = PTk t=1 ℓt hinge(wk t ) infw Lk T (w), and P

t ℓhinge( ) P

t ℓ0 1( ) = |M|, then regret |M| infw Lk T (w). It infers a relative mistake bound:

E[M] inf w Lk T (w)

4ϵ log(1 + KT) + ϵ(D(µ))2Tr (AT + Bk T ) 1 . (7)

Active Conﬁdence Weighted Multitask Learning Unlike the online algorithm that retrieves the label of every instance, active learning considers the labeling budget and has to decide whether to query the true label of the current instance xt. If the label is queried, the algorithm can update the learner with yt; otherwise, no action is performed and the learner continues to process the next one. Query and update decisions are deﬁned as binary variables Qt and Zt at time t, where Qt = 1 when yt is queried of, and 0 otherwise; Zt is under the same setting. In general, optimizing the local memory (B, q) with query and update decisions on round t yields

B 1 t = B 1 t 1 + Qt Zt xtx t λ , qt = qt 1 + Qt Ztgtyt Btxt.

The goal of active learning is to achieve few mistakes with a small query number P

t Qt. Next we propose a query method to perform active leaning across multiple tasks.

Adaptive-Margin Query The target of the active learning algorithm is to make few mistakes with a small amount of queries. To achieve this goal, we propose a randomized query strategy based on a conﬁdence score Θt: given h > 0, the binary variable Qt is sampled using a Bernoulli distribution with a parameter h/(h + max(0, Θt)); if Qt = 1, then the actual label yt is queried of; otherwise no query is performed. The randomized query strategy has been studied in (Dekel, Gentile, and Sridharan 2012). We extend the randomized query into the multitask setting by performing active leaning across multiple tasks. We begin with the additional annotations below,

ak t = xk t At 1xk t , bk t = xk t Bk t 1xk t .

Using the Sherman-Morrison formula, it is inferred that

xk t Σk t xk t = xk t Atxk t + xk t Bk t xk t ,

where the last two terms can be derived as:

At 1 At 1xk t xk t At 1 λ + xk t At 1xk t

xk t = ak t 1 + ak t /λ,

Bk t 1 Bk t 1xk t xk t Bk t 1 λ + xk t Bk t 1xk t

xk t = bk t 1 + bk t /λ.

Deﬁnition 1. Assume a task weight wk N(µk, Σk), with mean µk = p + qk and variance Σk = A + Bk. At

round t, the algorithm decides to query with a probability h h+max(0,Θk t ) (h > 0), where the score Θk t is deﬁned as,

Θk t = | k t | 1

4ϵCk t , (8)

where | | is the absolute value,

k t = (pt 1 + qk t 1) xk t ; Ck t = ak t 1 + ak t /λ + bk t 1 + bk t /λ.

Θt is a function parameterized by two variables | t| and Ct. The variable | t| is a distance of an input to the decision boundary, known as the margin (Cesa-Bianchi, Gentile, and Zaniboni 2006), while Ct is the conﬁdence of the current prediction, known as variance . In this way, | t| 1

4ϵCt acts as a lower conﬁdence bound of the prediction. The probability of query (h/(h + max(0, Θt))) will be reduced only when an input is far from the boundary (i.e. a large | t|), meanwhile it has a low variance towards the prediction (i.e. Ct). So that the predicted result is reliable, and a label can be omitted safely. Similar with (Murugesan and Carbonell 2017), the conﬁdence Θt is estimated by exploiting the local-global knowledge of multiple tasks. The idea is that when the single task is uncertain about its prediction, it can consult the global memory that encodes the knowledge of similar tasks to help make the query decision. Intuitively, a query decision is effective only if it can control the probability of making a mistake when the label is not queried. To analyze the effectiveness of the query method, we derive an error bound for our approach that learns from only queried labels {t : Qt = 1}. For the randomized queries, the mistake trials can be partitioned into two disjoint sets: a set S = {t : h h+max(0,Θt) < 1} includes indices on which a stochastic query is conducted, and a set D = {t : h h+max(0,Θt) = 1} includes indices when there is a deterministic query. The expected number of queries (i.e., expected labeling cost) is upper bounded by

Let Zt = 1 if a mistake occurs at the round t, i.e., ˆyt = yt after the true label is queried. We denote M = {t : yt t 0} as the set of mistake trials and let M = |M|, while UT = {t : Zt Qt = 1, t [T]} as the set of update trials. Theorem 2. Assume that {(xk t , yk t )}T t=1 is a sequence of instances for any task k (k K), where xk t Rd, yk t { 1}. CWMT learns from the queried labels {t : Qk t = 1, s.t. Qk t h h+max(0,Θk t )}. For any µ Rd, it satisﬁes

t=1 ℓt(p + qk) + λ

4hϵ log(1 + KT)

h E (D(µ))2Tr (AUT + Bk UT ) 1 ,

where D(µ) = maxt (pt + qk t ) hµ .

Proof. The proof is in the Supplementary Material1.

Algorithm 2 ACWMT: Active Conﬁdence Weighted Multitask Learning

1: Input: λ, ϵ > 0 and h > 0. 2: Output: p T , AT , qk T , Bk T , k [K] . 3: Initialize: p0 = qk 0 = 0, A0 = Bk 0 = I k [K]. 4: for t = 1, . . . , T do 5: for (local update): k = 1, . . . , K in parallel do 6: Receive xk t and ˆyk t = sign(µk t 1 xk t ) ; 7: Compute Θk t with Eq. (8); 8: if Θk t 0 then (aggressive update) 9: Set Qk t Zk t = 1 and query the true label; 10: else (stochastic update) 11: Generate Qk t h h+Θk t ;

12: Query if Qk t = 1, let Zk t = 1 if yk t = ˆyk t ; 13: end if 14: If Qk t Zk t = 1, update Bk t in Eq. (3), qk t in Eq. (4); 15: endfor 16: Reduce (global update): aggregate {Z1 t , . . . , ZK t } 17: Update At with Eq. (5) and pt with Eq. (6); 18: end for

Discussion: We can replace PT t=1 ℓt(p + qk) with its loss of the best model, infµ PT t=1 ℓt(µ) where µ = p + qk, then Eq. (9) becomes a relative mistake bound (Abernethy, Bartlett, and Rakhlin 2007). It is observed that CWMT learned on actively selected labels can achieve a comparable mistake bound with the one learned on all labels in Eq. (7). It demonstrates the efﬁcacy of the proposed query strategy.

Adaptive Optimization We solve the active multitask learning problem with adaptive optimization. The algorithm maintains (Bk, qk) for each local memory and (A, p) for the global memory. At round t, the learner receives an input xk t and predicts its binary label ˆyk t = sign(µk t 1 xk t ). Then its actual label yk t is revealed with a probability of h h+max(0,Θk t ), which yields a stochastic query or a determin-

istic query. In a stochastic query ( h h+max(0,Θk t ) < 1), update

is driven by mistake. If an error occurs (ˆyk t = yk t ), the algorithm updates the local memory in a recursive way; otherwise, no action is performed and we proceed to the next one with Bk t = Bk t 1 and qk t = qk t 1. We observe that a deterministic query is issued when h h+max(0,Θk t ) = 1. In this case, aggressive update is performed, that is, we update a model even if no error occurs. After that, the global update (p, A) is conducted via aggregating the update decisions {Zi}K i=1 from the local tasks. We summarize this algorithm ACWMT, active conﬁdence weighted multitask learning, in Algorithm 2. To further understand this algorithm, we compute under what condition an update will be issued aggressively. An aggressive update is issued when Θk t 0, which yields

| k t | θk t (a, b) = 1

4ϵ( ak t 1 + ak t /λ + bk t 1 + bk t /λ).

If | k t | is less than θk t (a, b), a deterministic query/update is

Table 1: Description of the data sets

Spam Email MHC-I Each Movie #Tasks 4 12 30 #Sample 7068 18664 6000 #Dimesion 1458 400 1783 #Max Sample 4129 3793 200 #Min Sample 710 415 200

conducted, while | k t | is above θk t (a, b), a label is queried with a probability less than 1. And the upper bound of θk t (a, b) increases with a or b. Since 0 At, Bt I and xt 1, it is inferred that 0 ak t , bk t 1. When ak t , bk t = 0 with a minimal uncertainty, a deterministic query is issued only if the input lies on the boundary (| k t | θk t (a, b) = 0). When ak t , bk t = 1 with the largest value of uncertainty, this implies that an aggressive query is issued whenever its margin | k t | is less than θk t (a, b) = λ 2ϵ(λ+1). In the following analysis, we show the superiority of ACWMT. Besides stochastic query trials S and deterministic query trials D, we denote V = {t : yt t > 0, Θt 0} as the set of trials where there is an aggressive update without predicted errors, and let V = |V|.

Theorem 3. Let {(xk t , yk t )}T t=1 be an input-label pair sequence for any task k, k [K]. Following the setting in Theorem 2, the proposed ACWMT satisﬁes

t=1 ℓ(µ; xk t , yk t ) + λ

4hϵ log(1 + KT)

h E (D(µ))2Tr (AUT + Bk UT ) 1 E[V ].

The expected number of update is E[|D| + P

t S M h h+Θt ].

Proof. The proof is Supplementary Material1.

Discussion: It is observed that the error bound of ACWMT, in expectation, is lower than that of the conservative update algorithm in Theorem 2, due to the deduction of E[V ] from the bound, where E[V ] is the expected number of the aggressive update trials. This can be regarded as a theoretical support for our aggressive algorithm. Although the aggressive strategy requires more updates in an early stage, it helps reduce the predicted variance and accelerate the learning progress, which could reduce the error rate and queried number when the model learns sufﬁcient knowledge of data.

Experiments

We conduct the performance evaluation for the algorithms on three real-world datasets. We begin with the introduction of the experimental data and evaluation metrics. Then we show and discuss the empirical results.

Experimental Datasets

We introduce three real-world datasets to evaluate the methods. Table 1 summarizes the statistic features of the datasets.

Table 2: Mean and standard deviation of cumulative error rate (%), F1-measure (%) and queried number on the datasets

Algorithm Spam Email MHC-I Each Movie Error Rate F1-measure Query Error Rate F1-measure Query Error Rate F1-measure Query MTFL 13.40 (3.47) 88.39 (4..67) 7068 43.84 (5.98) 51.04 (7.40) 18664 27.51(12.25) 79.18 (12.87) 6000 TRML 16.21 (3.77) 86.02 (5.27) 7068 44.26 (6.05) 50.50 (7.36) 18664 26.58(11.82) 79.89 (12.49) 6000 OMTL 13.18 (7.65) 90.24 (4.75) 7068 38.13 (5.03) 54.73 (4.28) 18664 18.61 (7.29) 84.45 (8.64) 6000 COL 5.94 (1.67) 94.93 (1.93) 7068 41.46 (4.13) 50.89 (3.00) 18664 18.43 (6.75) 84.14 (8.93) 6000 OSMTL-e 5.22 (2.06) 95.67 (1.78) 7068 38.18 (4.95) 55.03 (4.09) 18664 19.24 (7.15) 83.18 (9.04) 6000 CWMT 4.90 (1.94) 95.93 (1.69) 7068 36.72 (4.87) 55.87 (3.09) 18664 17.98 (6.57) 84.73 (8.39) 6000 ALP 7.49 (0.34) 93.32 (2.30) 5530.8 (53.18) 42.64 (3.39) 50.05 (4.02) 13684.1 (86.19) 21.64 (6.98) 81.77 (8.96) 3905.3 (46.40) ACWMT 4.89 (1.95) 95.94 (1.67) 3983.2 (31.23) 36.73 (3.66) 55.48 (4.14) 11388.5 (69.26) 17.97 (6.54) 84.78 (8.23) 3171.5 (29.09)

Figure 1: Cumulative error rate, F1-measure and queried number along the entire online learning process

0 2000 4000 6000 Number of epochs

Cumulative error rate(%)

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 5000 10000 15000 Number of epochs

Cumulative error rate(%)

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 1000 2000 3000 4000 5000 6000 Number of epochs

Cumulative error rate(%)

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 2000 4000 6000 Number of epochs

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 5000 10000 15000 Number of epochs

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 1000 2000 3000 4000 5000 6000 Number of epochs

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 2000 4000 6000 Number of epochs

Queried Number

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 5000 10000 15000 Number of epochs

Queried Number

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

0 2000 4000 6000 Number of epochs

Queried Number

ACWMT CWMT ALP OSMTL-e COL OMTL MTFL TRML

Spam Email2, maintained by Internet Content Filtering Group, collects 7068 emails from mailboxes of 4 users (i.e., 4 tasks). Each mail entry is represented by a word document vector via the TF-IDF conversion technique. For each user, a model is proposed to classify each incoming email into two categories: legitimate or spam. Due to no time stamp on the emails, the email order is shufﬂed into a random sequence. MHC-I3, a biomarker dataset, contains 18664 peptide sequences from 12 MHC-I molecules (i.e., 12 tasks). Each peptide sequence is converted to a 400 dimensional feature vector (Li et al. 2011). We aim to classify whether a peptide sequence is binder or non-binder for each MHC-I molecule. Each Movie4 is a movie recommendation dataset. We ran-

2http://labs-repos.iit.demokritos.gr/skel/i-conﬁg/ 3http://web.cs.iastate.edu/ honavar/ailab/ 4http://goldberg.berkeley.edu/jester-data/

domly prioritize 6000 user-rating pairs where 30 users rate exactly 200 movies each. The six possible ratings (i.e. [1, 6]) are converted into two classes, like or dislike, based on the rating order. For each user (as a task), we randomly select 1783 users who viewed the same 200 movies and use their ratings as the features of movie instances.

Evaluation Metrics

We evaluate the performance of the algorithms with three measurements: 1) cumulative error rate, reﬂecting the prediction accuracy of an online algorithm; 2) number of queried labels, reﬂecting the label-efﬁciency of an algorithm; and 3) F1-measure, the harmonic mean of precision and recall, suitable to evaluate the performance on classimbalance data. For error rate and queried number, a small value indicates better performance of a method. For F1-

Figure 2: A comparison between ACWMT and ALP with respect to different queried ratios

0 0.2 0.4 0.6 0.8 Query Ratio

Cumulative error rate(%)

0 0.2 0.4 0.6 0.8 Query Ratio

Cumulative error rate(%)

0 0.2 0.4 0.6 0.8 Query Ratio

Cumulative error rate(%)

measure, a higher value indicates a better result. In order to compare these algorithms fairly, we randomly shufﬂe the ordering of samples for each dataset. We repeat each experiment 10 times and calculate the average results.

Baseline and Parameter Setting Our baselines cover three categories of MTL techniques: 1) two batch learning techniques: multitask feature learning (MTFL) (Argyriou, Evgeniou, and Pontil 2006) and trace-norm regularized MTL (TRML) (Zhou, Chen, and Ye 2011); 2) three online learning algorithms: online MTL (OMTL) (Saha et al. 2011), collaborative online MTL (COL) (Li et al. 2014), and online smoothed MTL (OSMTLe) (Murugesan et al. 2016); and 3) one active online learning method: active learning from peers (ALP) (Murugesan and Carbonell 2017). To handle the online setting, we modify the batch learning setting of MTFL and TRML by periodically retraining online data after observing 100 samples. All parameters of the baselines are tuned according to their recommended instructions. CWMT and ACWMT are the proposed multitask algorithms in the fully-supervised setting and the partially-labeled setting, respectively. For simplicity, we set ϵ = λ = 100 to avoid overﬁtting. In the query method, we set h = 0.1 for MHC-I and Each Movie, h = 1 for Spam Email.

Comparison Result The experimental results are presented in Table 2. We also show the evaluation measures with respect to the round of online learning in Fig. 1. The improvement of our algorithms over ALP is always signiﬁcant over all datasets. This is consistent with previous observations in online multitask learning: the second-order algorithms are generally better than the ﬁrst-order ones (Yang et al. 2015; Yang, Zhao, and Gao 2018). The reason is that the covariance matrix that encodes the conﬁdence of parameters can guide the direction and magnitude of the parameter learning. ACWMT always achieves a lower error rate with a smaller number of queries. The possible reasons are three folds. 1) The prediction variance is reduced by exploiting the Gaussian distribution of parameter weights. 2) The labeling cost is saved by the proposed adaptive-margin query. And 3) the aggressive updating strategy accelerates the learning progress. These techniques speed up the convergence of this method, so that the error rate and queried number can be reduced further when the model learns sufﬁcient knowledge of

Table 3: Run-time (in seconds) for each algorithm

Algorithm Spam Email MHC-I Each Movie TRML 73.55 361.42 391.22 MTFL 78.01 198.90 302.17 COL 1.86 6.35 31.01 OSMTL-e 1.36 4.33 11.43 ACWMT 11.49 11.45 17.92

data. It demonstrates both the computational efﬁciency and label efﬁciency of these algorithms. We observe that ACWMT achieves comparable accuracy to CWMT with a small number of queries. The reason may be due to the class-imbalance issue, where the training instances from the minority class are much fewer than that from the majority class. ACWMT can learn aggressively on the minority class, while CWMT accesses all labels and the majority class may dominate the predictive model, leading to poor performance (Zhang, Yang, and Srinivasan 2016).

Sensitivity Analysis on Query Ratio We study the impact of the parameter h. The algorithm with a lower value of h will conduct a small number of queries. Speciﬁcally, we set h to {10 4, 10 3, . . . , 1} and calculate the averaged query ratio over 10 times of random shufﬂes. The comparison result is shown in Fig. 2. We observe that ACWMT achieves better accuracy consistently under different ratios of queries. This is expected since ACWMT determines a query by leveraging the second-order information of the prediction, which is more effective to capture the informative instances than ALP that adopts ﬁrst-order query strategy.

Time Complexity We study the time complexity of the proposed algorithm in Table 3. We observe that the online methods run faster than the two batch learning algorithms. This is obvious since online models only learn on the current instance, while batch models learn with a substantial amount of samples. In addition, we observe that ACWMT is relatively slower than OSMTL-e. This is expected since ACWMT has to update the covariance matrix of parameter weights. However, the extra computational cost is worth it since the signiﬁcant improvement has been achieved by considering the parameter conﬁdence.

Conclusion This work presents an online active learning method in the multitask setting, where the learner performs local task jointly with learning global knowledge of peer tasks. The intuition in this paper is that learning efﬁcacy can be improved by accessing the knowledge of peer tasks. Our query strategy is beneﬁted from two aspects: 1) query decision is made by consulting the local-global knowledge of multiple tasks; and 2) query conﬁdence depends on both the margin and the variance of its prediction. Theoretical results show that our method that runs on a fraction of informative labels achieves a lower error bound than the fully-supervised counterparts. Finally, the promising empirical results validate the effectiveness of our methods on several real-world datasets.

Acknowledgements: The research reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST), under award number URF/1/3007-01-01.

References Abernethy, J.; Bartlett, P.; and Rakhlin, A. 2007. Multitask learning with expert advice. In COLT, 484 498. Springer. Agarwal, A.; Rakhlin, A.; and Bartlett, P. 2008. Matrix regularization techniques for online multitask learning. Technical report, EECS Department, University of California, Berkeley. Argyriou, A.; Evgeniou, T.; and Pontil, M. 2006. Multi-task feature learning. In NIPS, 41 48. Argyriou, A.; Evgeniou, T.; and Pontil, M. 2008. Convex multi-task feature learning. Machine Learning 73(3):243 272. Cesa-Bianchi, N.; Gentile, C.; and Zaniboni, L. 2006. Worst-case analysis of selective sampling for linear classiﬁcation. JMLR 7(Jul):1205 1230. Crammer, K., and Mansour, Y. 2012. Learning multiple tasks using shared hypotheses. In Advances in Neural Information Processing Systems, 1475 1483. Crammer, K.; Dredze, M.; and Pereira, F. 2009. Exact convex conﬁdence-weighted learning. In NIPS, 345 352. Dekel, O.; Gentile, C.; and Sridharan, K. 2012. Selective sampling and active learning from single and multiple teachers. JMLR 13(Sep):2655 2697. Dekel, O.; Long, P. M.; and Singer, Y. 2007. Online learning of multiple tasks with a shared loss. JMLR 8:2233 2264. Ding, L.-Z.; Liao, S.; Liu, Y.; Yang, P.; and Gao, X. 2018. Randomized kernel selection with spectra of multilevel circulant matrices. In AAAI. Evgeniou, T., and Pontil, M. 2004. Regularized multi task learning. In SIGKDD, 109 117. ACM. Hoi, S. C.; Sahoo, D.; Lu, J.; and Zhao, P. 2018. Online learning: A comprehensive survey. ar Xiv preprint ar Xiv:1802.02871. Li, G.; Chang, K.; Hoi, S. C. H.; Liu, W.; and Jain, R. 2011. Collaborative online learning of user generated content. In CIKM, 285 290.

Li, G.; Hoi, S. C.; Chang, K.; Liu, W.; and Jain, R. 2014. Collaborative online multitask learning. TKDE 26(8):1866 1876. Lugosi, G.; Papaspiliopoulos, O.; and Stoltz, G. 2009. Online multi-task learning with hard constraints. ar Xiv preprint ar Xiv:0902.3526. Murugesan, K., and Carbonell, J. 2017. Active learning from peers. In Advances in Neural Information Processing Systems, 7011 7020. Murugesan, K.; Liu, H.; Carbonell, J.; and Yang, Y. 2016. Adaptive smoothed online multi-task learning. In Advances in Neural Information Processing Systems, 4296 4304. Saha, A.; Rai, P.; Daum e III, H.; and Venkatasubramanian, S. 2011. Online learning of multiple tasks and their relationships. In AISTATS. Sherman, J., and Morrison, W. J. 1950. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics 21(1):124 127. Tseng, P. 2001. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications 109(3):475 494. Yang, P., and Zhao, P. 2015. A min-max optimization framework for online graph classiﬁcation. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 643 652. ACM. Yang, P.; Zhao, P.; Zheng, V. W.; and Li, X.-L. 2015. An aggressive graph-based selective sampling algorithm for classiﬁcation. In Data Mining (ICDM), 2015 IEEE International Conference on, 509 518. IEEE. Yang, P.; Zhao, P.; and Gao, X. 2017. Robust online multitask learning with correlative and personalized structures. IEEE Transactions on Knowledge and Data Engineering 29(11):2510 2521. Yang, P.; Zhao, P.; and Gao, X. 2018. Bandit online learning on graphs via adaptive optimization. International Joint Conferences on Artiﬁcial Intelligence Organization. Zhang, Y., and Yeung, D.-Y. 2010. A convex formulation for learning task relationships in multi-task learning. In UAI, 733 742. AUAI Press. Zhang, X.; Yang, T.; and Srinivasan, P. 2016. Online asymmetric active learning with imbalanced data. In SIGKDD, 2055 2064. ACM. Zhao, P.; Hoi, S. C.; Jin, R.; and Yang, T. 2011. Online auc maximization. In ICML, 233 240. Omnipress. Zhou, J.; Chen, J.; and Ye, J. 2011. MALSAR: Multi-t Ask Learning via Structur Al Regularization. Arizona State University.