# adaptive_smoothed_online_multitask_learning__a82202ff.pdf Adaptive Smoothed Online Multi-Task Learning Keerthiram Murugesan Carnegie Mellon University kmuruges@cs.cmu.edu Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu Jaime Carbonell Carnegie Mellon University jgc@cs.cmu.edu Yiming Yang Carnegie Mellon University yiming@cs.cmu.edu This paper addresses the challenge of jointly learning both the per-task model parameters and the inter-task relationships in a multi-task online learning setting. The proposed algorithm features probabilistic interpretation, efficient updating rules and flexible modulation on whether learners focus on their specific task or on jointly address all tasks. The paper also proves a sub-linear regret bound as compared to the best linear predictor in hindsight. Experiments over three multitask learning benchmark datasets show advantageous performance of the proposed approach over several state-of-the-art online multi-task learning baselines. 1 Introduction The power of joint learning in multiple tasks arises from the transfer of relevant knowledge across said tasks, especially from information-rich tasks to information-poor ones. Instead of learning individual models, multi-task methods leverage the relationships between tasks to jointly build a better model for each task. Most existing work in multi-task learning focuses on how to take advantage of these task relationships, either to share data directly [1] or to learn model parameters via cross-task regularization techniques [2, 3, 4]. In a broad sense, there are two settings to learn these task relationships 1) batch learning, in which an entire training set is available to the learner 2) online learning, in which the learner sees the data in a sequential fashion. In recent years, online multi-task learning has attracted extensive research attention [5, 6, 7, 8, 9]. Following the online setting, particularly from [6, 7], at each round t, the learner receives a set of K observations from K tasks and predicts the output label for each of these observations. Subsequently, the learner receives the true labels and updates the model(s) as necessary. This sequence is repeated over the entire data, simulating a data stream. Our approach follows an error-driven update rule in which the model for a given task is updated only when the prediction for that task is in error. The goal of an online learner is to minimize errors compared to the full hindsight learner. The key challenge in online learning with large number of tasks is to adaptively learn the model parameters and the task relationships, which potentially change over time. Without manageable efficient updates at each round, learning the task relationship matrix automatically may impose a severe computational burden. In other words, we need to make predictions and update the models in an efficient real time manner. We propose an online learning framework that efficiently learns multiple related tasks by estimating the task relationship matrix from the data, along with the model parameters for each task. We learn the model for each task by sharing data from related task directly. Our model provides a natural way to specify the trade-off between learning the hypothesis from each task s own (possibly quite Both student authors contributed equally. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. limited) data and data from multiple related tasks. We propose an iterative algorithm to learn the task parameters and the task-relationship matrix alternatively. We first describe our proposed approach under a batch setting and then extend it to the online learning paradigm. In addition, we provide a theoretical analysis for our online algorithm and show that it can achieve a sub-linear regret compared to the best linear predictor in hindsight. We evaluate our model with several state-of-the-art online learning algorithms for multiple tasks. There are many useful application areas for online multi-task learning, including optimizing financial trading, email prioritization, personalized news, and spam filtering. Consider the latter, where some spam is universal to all users (e.g. financial scams), some messages might be useful to certain affinity groups, but spam to most others (e.g. announcements of meditation classes or other special interest activities), and some may depend on evolving user interests. In spam filtering each user is a task, and shared interests and dis-interests formulate the inter-task relationship matrix. If we can learn the matrix as well as improving models from specific spam/not-spam decisions, we can perform mass customization of spam filtering, borrowing from spam/not-spam feedback from users with similar preferences. The primary contribution of this paper is precisely the joint learning of inter-task relationships and its use in estimating per-task model parameters in an online setting. 1.1 Related Work While there is considerable literature in online multi-task learning, many crucial aspects remain largely unexplored. Most existing work in online multi-task learning focuses on how to take advantage of task relationships. To achieve this, Lugosi et. al [7] imposed a hard constraint on the K simultaneous actions taken by the learner in the expert setting, Agarwal et. al [10] used matrix regularization, and Dekel et. al [6] proposed a global loss function, as an absolute norm, to tie together the loss values of the individual tasks. Different from existing online multi-task learning models, our paper proposes an intuitive and efficient way to learn the task relationship matrix automatically from the data, and to explicitly take into account the learned relationships during model updates. Cavallanti et. al [8] assumes that task relationships are available a priori. Kshirsagar et. al [11] does the same but in a more adaptive manner. However such task-relation prior knowledge is either unavailable or infeasible to obtain for many applications especially when the number of tasks K is large [12] and/or when the manual annotation of task relationships is expensive [13]. Saha et. al [9] formulated the learning of task relationship matrix as a Bregman-divergence minimization problem w.r.t. positive definite matrices. The model suffers from high computational complexity as semi-definite programming is required when updating the task relationship matrix at each online round. We show that with a different formulation, we can obtain a similar but much cheaper updating rule for learning the inter-task weights. The most related work to ours is Shared Hypothesis model (SHAMO) from Crammer and Mansour [1], where the key idea is to use a K-means-like procedure that simultaneously clusters different tasks and learns a small pool of m K shared hypotheses. Specifically, each task is free to choose a hypothesis from the pool that better classifies its own data, and each hypothesis is learned from pooling together all the training data that belongs to the same cluster. A similar idea was explored by Abernathy et. al [5] under expert settings. 2 Smoothed Multi-Task Learning Suppose we are given K tasks where the jth task is associated with Nj training examples. For brevity we consider a binary classification problem for each task, but the methods generalize to multi-class and are also applicable to regression tasks. We denote by [N] the consecutive integers ranging from 1 to N. Let (x(i) j , y(i) j ) Nj i=1 and Lj(w) = 1 Nj P i [Nj] 1 y(i) j x(i) j , w + be the training set and batch empirical loss for task j, respectively, where (z)+ = max(0, z), x(i) j Rd is the ith instance from the jth task and y(i) j is its corresponding true label. We start from the motivation of our formulation in Section 2.2, based on which we first propose a batch formulation in Section 2.3. Then, we extend the method to the online setting in Section 2.4. 2.2 Motivation Learning tasks may be addressed independently via w k = argminwk Lk(wk), k [K]. However, when each task has limited training data, it is often beneficial to allow information sharing among the tasks, which can be achieved via the following optimization: w k = argminwk X j [K] ηkj Lj(wk) k [K] (1) Beyond each task k, optimization (1) encourages hypothesis w k to do well on the remaining K 1 tasks thus allowing tasks to borrow information from each other. In the extreme case where the K tasks have an identical data distribution, optimization (1) amounts to using P j [K] Nj examples for training as compared to Nk in independent learning. The weight matrix η is in essence a task relationship matrix, and a prior may be manually specified according to domain knowledge about the tasks. For instance, ηkj would typically be set to a large value if tasks k and j share similar nature. If η = I, (1) reduces to learning tasks independently. It is clear that manual specification of η is feasible only when K is small. Moreover, tasks may be statistically correlated even if a domain expert is unavailable to identify an explicit relation, or if the effort required is too great. Hence, it is often desirable to automatically estimate the optimal η adapted to the inter-task problem structure. We propose to learn η in a data-driven manner. For the kth task, we optimize w k, η k = argminwk,ηk Θ X j [K] ηkj Lj(wk) + λr(ηk) (2) where Θ defines the feasible domain of ηk, and regularizer r prevents degenerate cases, e.g., where ηk becomes an all-zero vector. Optimization (2) shares the same underlying insight with Self-Paced Learning (SPL) [14, 15] where the algorithm automatically learns the weights over data points during training. However, the process and scope in the two methods differ fundamentally: SPL minimizes the weighted loss over datapoints within a single domain, while optimization (2) minimizes the weighted loss over multiple tasks across possibly heterogeneous domains. A common choice of Θ and r(ηk) in SPL is Θ = [0, 1]K and r(ηk) = ηk 1. There are several drawbacks of naively applying this type of settings to the multitask scenario: (i) Lack of focus: there is no guarantee that the kth learner will put more focus on the kth task itself. When task k is intrinsically difficult, η kk could simply be set near zero and w k becomes almost independent of the kth task. (ii) Weak interpretability, the learned η k may not be interpretable as it is not directly tied to any physical meanings (iii) Lack of worst-case guarantee in the online setting. All those issues will be addressed by our proposed model in the following. 2.3 Batch Formulation We parametrize the aforementioned task relationship matrix η RK K as follows: η = αIK + (1 α) P (3) where IK RK K is an identity matrix, P RK K is a row-stochastic matrix and α is a scalar in [0, 1]. Task relationship matrix η defined as above has the following interpretations: 1. Concentration Factor α quantifies the learners concentration on their own tasks. Setting α = 1 amounts to independent learning. We will see from the forthcoming Theorem 1 how to specify α to ensure the optimality of the online regret bound. 2. Smoothed Attention Matrix P quantifies to which degree the learners are attentive to all tasks. Specifically, define the kth row of P , namely pk K 1, as a probability distribution over all tasks where K 1 denotes the probability simplex. Our goal of learning a data-adaptive η now becomes learning a data-adaptive attention matrix P . Common choices about η in several existing algorithms are special cases of (3). For instance, domain adaptation assumes α = 0 and a fixed row-stochastic matrix P ; in multi-task learning, we obtain the effective heuristics of specifying η by Cavallanti et. al. [8] when α = 1 1+K and P = 1 K 11 . When there are m K unique distributions pk, then the problem reduces to SHAMO model [1]. Equation (3) implies the task relationship matrix η is also row-stochastic, where we always reserve probability α for the kth task itself as ηkk α. For each learner, the presence of α entails a trade off between learning from other tasks and concentrating on its own task. Note that we do not require P to be symmetric due to the asymmetric nature of information transferability while classifiers trained on a resource-rich task can be well transferred to a resource-scarce task, the inverse is not usually true. Motivated by the above discussion, our batch formulation instantiates (2) as follows w k, p k = argminwk,pk K 1 X j [K] ηkj(pk)Lj(wk) λH (pk) (4) = argminwk,pk K 1 Ej Multinomial(ηk(pk))Lj(wk) λH (pk) (5) where H(pk) = P j [K] pkj log pkj denotes the entropy of distribution pk. Optimization (4) can be viewed as to balance between minimizing the cross-task loss with mixture weights ηk and maximizing the smoothness of cross-task attentions. The max-entropy regularization favours a uniform attention over all tasks and leads to analytical updating rules for pk (and ηk). Optimization (4) is biconvex over wk and pk. With p(t) k fixed, solution for wk can be obtained using off-the-shelf solvers. With w(t) k fixed, solution for pk is given in closed-form: p(t+1) kj = e 1 α λ Lj(w(t) k ) PK j =1 e 1 α λ Lj (w(t) k ) j [K] (6) The exponential updating rule in (6) has an intuitive interpretation. That is, our algorithm attempts to use hypothesis w(t) k obtained from the kth task to classify training examples in all other tasks. Task j will be treated as related to task k if its training examples can be well classified by wk. The intuition is that two tasks are likely to relate to each other if they share similar decision boundaries, thus combining their associated data should yield to a stronger model, trained over larger data. 2.4 Online Formulation In this section, we extend our batch formulation to the online setting. We assume that all tasks will be performed at each round, though the assumption can be relaxed with some added complexity to the method. At time t, the kth task receives a training instance x(t) k , makes a prediction x(t) k , w(t) k and suffers a loss after y(t) is revealed. Our algorithm follows a error-driven update rule in which the model is updated only when a task makes a mistake. Let ℓ(t) kj (w) = 1 y(t) j x(t) j , w if y(t) j x(t) j , w(t) k < 1 and ℓkj(w) = 0 otherwise. For brevity, we introduce shorthands ℓ(t) kj = ℓ(t) kj (w(t) k ) and η(t) kj = ηkj(p(t) k ). For the kth task we consider the following optimization problem at each time: w(t+1) k , p(t+1) k = argmin wk,pk K 1 C X j [K] ηkj(pk)ℓ(t) kj (wk) + wk w(t) k 2 + λDKL pk p(t) k (7) j [K] ηkj(pk)ℓ(t) kj (wk) = Ej Multi(ηk(pk))ℓ(t) kj (wk), and DKL pk p(t) k denotes the Kullback Leibler (KL) divergence between current and previous soft-attention distributions. The presence of last two terms in (7) allows the model parameters to evolve smoothly over time. Optimization (7) is naturally analogous to the batch optimization (4), where the batch loss Lj(wk) is replaced by its noisy version ℓ(t) kj (wk) at time t, and negative entropy H(pk) = P j pkj log pkj is replaced by DKL(pk p(t) k ) also known as the relative entropy. We will show the above formulation leads to analytical updating rules for both wk and pk, a desirable property particularly as an online algorithm. Solution for w(t+1) k conditioned on p(t) k is given in closed-form by the proximal operator w(t+1) k = prox(w(t) k ) = argminwk C X j [K] ηkj(p(t) k )ℓ(t) kj (wk) + wk w(t) k 2 (8) = w(t) k + C X j:y(t) j x(t) j ,w(t) k <1 ηkj(p(t) k )y(t) j x(t) j (9) Solution for p(t+1) k conditioned on w(t) k is also given in closed-form, analogous to mirror descent [16] p(t+1) k = argminpk K 1 C(1 α) X j [K] pkjℓ(t) kj + λDKL pk p(t) k (10) = p(t+1) kj = p(t) kj e C(1 α) λ ℓ(t) kj P j p(t) kj e C(1 α) λ ℓ(t) kj j [K] (11) The pseudo-code is in Algorithm 22. Our algorithm is passive in the sense that updates are carried out only when a classification error occurs, namely when ˆy(t) k = y(t) k . An alternative is to perform aggressive updates only when the active set {j : y(t) j x(t) j , w(t) k < 1} is non-empty. Algorithm 1: Batch Algorithm (SMTL-e) while not converge do for k [K] do w(t) k argminwk αLk(wk) + (1 j [K] p(t) kj Lj(wk); for j [K] do p(t+1) kj e 1 α λ Lj (w(t) k ) PK j =1 e 1 α λ Lj (w(t) k ) ; end end t t + 1; end Algorithm 2: Online Algorithm (OSMTL-e) for t [T] do for k [K] do if y(t) k x(t) k , w(t) k < 1 then w(t+1) k w(t) k + Cα1ℓ(t) kk >0y(t) k x(t) k + j:ℓ(t) kj >0 p(t) kj y(t) j x(t) j ; for j [K] do p(t+1) kj p(t) kj e C(1 α) PK j =1 p(t) kj e C(1 α) λ ℓ(t) kj ; w(t+1) k , p(t+1) k w(t) k , p(t) k ; end end end 2.5 Regret Bound Theorem 1. k [K], let Sk = x(t) k , y(t) k T t=1 be a sequence of T examples for the kth task where x(t) k Rd, y(t) k { 1, +1} and x(t) k 2 R, t [T]. Let C be a positive constant and let α be some predefined parameter in [0, 1]. Let {w k}k [K] be any arbitrary vectors where w k Rd and its hinge loss on the examples x(t) k , y(t) k and x(t) j , y(t) j j =k are given by ℓ(t) kk = 1 y(t) k x(t) k , w k + and ℓ(t) kj = 1 y(t) j x(t) j , w k +, respectively. If {Sk}k [K] is presented to OSMTL algorithm, then k [K] we have t [T ] (ℓ(t) kk ℓ(t) kk 1 2Cα w k 2 + (1 α)T ℓ(t) kk + max j [K],j =k ℓ(t) kj + CR2T Notice when α 1, the above reduces to the perceptron mistake bound [17]. 2It is recommended to set α T T , as suggested by Corollary 2. Corollary 2. Let α = T and C = 1+ T T in Theorem 1, we have t [T ] (ℓ(t) kk ℓ(t) kk 2 w k 2 + ℓ(t) kk + max j [K],j =k ℓ(t) kj + 2R2 (13) Proofs are given in the supplementary. Theorem 1 and Corollary 2 have several implications: 1. Quality of the bound depends on both ℓ(t) kk and the maximum of {ℓ(t) kj }j [K],j =k. In other words, the worst-case regret will be lower if the kth true hypothesis w k can well distinguish training examples in both the kth task itself as well as those in all the other tasks. 2. Corollary 2 indicates the difference between the cumulative loss achieved by our algorithm and by any fixed hypothesis for task k is bounded by a term growing sub-linearly in T. 3. Corollary 2 provides a principled way to set hyperparameters to achieve the sub-linear regret bound. Specifically, recall α quantifies the self-concentration of each task. Therefore, α = T T 1 implies for large horizon it would be less necessary to rely on other tasks as available supervision for the task itself is already plenty; C = 1+ T T T 0 suggests diminishing learning rate over the horizon length. 3 Experiments We evaluate the performance of our algorithm under batch and online settings. All reported results in this section are averaged over 30 random runs or permutations of the training data. Unless otherwise specified, all model parameters are chosen via 5-fold cross validation. 3.1 Benchmark Datasets We use three datasets for our experiments. Details are given below: Landmine Detection3 consists of 19 tasks collected from different landmine fields. Each task is a binary classification problem: landmines (+) or clutter ( ) and each example consists of 9 features extracted from radar images with four moment-based features, three correlation-based features, one energy ratio feature and a spatial variance feature. Landmine data is collected from two different terrains: tasks 1-10 are from highly foliated regions and tasks 11-19 are from desert regions, therefore tasks naturally form two clusters. Any hypothesis learned from a task should be able to utilize the information available from other tasks belonging to the same cluster. Spam Detection4 We use the dataset obtained from ECML PAKDD 2006 Discovery challenge for the spam detection task. We used the task B challenge dataset which consists of labeled training data from the inboxes of 15 users. We consider each user as a single task and the goal is to build a personalized spam filter for each user. Each task is a binary classification problem: spam (+) or non-spam ( ) and each example consists of approximately 150K features representing term frequency of the word occurrences. Since some spam is universal to all users (e.g. financial scams), some messages might be useful to certain affinity groups, but spam to most others. Such adaptive behavior of user s interests and dis-interests can be modeled efficiently by utilizing the data from other users to learn per-user model parameters. Sentiment Analysis5 We evaluated our algorithm on product reviews from amazon. The dataset contains product reviews from 24 domains. We consider each domain as a binary classification task. Reviews with rating > 3 were labeled positive (+), those with rating < 3 were labeled negative ( ), reviews with rating = 3 are discarded as the sentiments were ambiguous and hard to predict. Similar to the previous dataset, each example consists of approximately 350K features representing term frequency of the word occurrences. We choose 3040 examples (160 training examples per task) for landmine, 1500 emails for spam (100 emails per user inbox) and 2400 reviews for sentiment (100 reviews per domain) for our experiments. 3http://www.ee.duke.edu/~lcarin/Landmine Data.zip 4http://ecmlpkdd2006.org/challenge.html 5http://www.cs.jhu.edu/~mdredze/datasets/sentiment 0 50 100 150 200 250 300 Training Size SMTL-t SMTL-e Figure 1: Average AUC calculated for compared models (left). A visualization of the task relationship matrix in Landmine learned by SMTL-t (middle) and SMTL-e (right). The probabilistic formulation of SMTL-e allows it to discover more interesting patterns than SMTL-t. Note that we intentionally kept the size of the training data small to drive the need for learning from other tasks, which diminishes as the training sets per task become large. Since all these datasets have a class-imbalance issue (with few (+) examples as compared to ( ) examples), we use average Area Under the ROC Curve (AUC) as the performance measure. 3.2 Batch Setting Since the main focus of this paper is online learning, we briefly conduct an experiment on landmine detection dataset for our batch learning to demonstrate the advantages of learning from shared data. We implement two versions of our proposed algorithm with different updates: SMTL-t (SMTL with thresholding updates) where p(t+1) kj (λ ℓ(t) kj )+6 and SMTL-e (SMTL with exponential updates) as in Algorithm 1. We compare our SMTL* with two standard baseline methods for our batch setting: Independent Task Learning (ITL) learning a single model for each task and Single Task Learning (STL) learning a single classification model for pooled data from all the tasks. In addition we compare our models with SHAMO, which is closest in spirit with our proposed models. We select the value for λ and α for SMTL* and M for SHAMO using cross validation. Figure 1 (left) shows the average AUC calculated for different training size on landmine. We can see that the baseline results are similar to the ones reported by Xue et. al [3]. Our proposed algorithm (SMTL*) outperforms the other baselines but when we have very few training examples (say 20 per task), the performance of STL improves as it has more examples than the others. Since η depends on the loss incurred on the data from related tasks, this loss-based measure can be unreliable for a small training sample size. To our surprise, SHAMO performs worse than the other models which tells us that assuming two tasks are exactly same (in the sense of hypothesis) may be inappropriate in real-world applications. Figure 1 (middle & left) show the task relationship matrix η for SMTL-t and SMTL-e on landmine when the number of training instances is 160 per task. 3.3 Online Setting To evaluate the performance of our algorithm in the online setting, we use all three datasets (landmine, spam and sentiment) and compare our proposed methods to 5 baselines. We implemented two variations of Passive-Aggressive algorithm (PA) [18]. PA-ITL learns independent model for each task and PA-ONE builds a single model for all the tasks. We also implemented the algorithm proposed by Dekel et. al for online multi-task learning with shared loss (OSGL) [6]. These three baselines do not exploit the task-relationship or the data from other tasks during model update. Next, we implemented two online multi-task learning related to our approach: FOML initializes η with fixed weights [8], Online Multi-Task Relationship Learning (OMTRL) [9] learns a task covariance matrix along with task parameters. We could not find a better way to implement the online version of the SHAMO algorithm, since the number of shared hypotheses or clusters varies over time. 6Our algorithm and theorem can be easily generalized to other types of updating rules by replacing exp in (6) with other functions. In latter cases, however, η may no longer have probabilistic interpretations. Table 1: Average performance on three datasets: means and standard errors over 30 random shuffles. Models Landmine Detection Spam Detection Sentiment Analysis AUC n SV Time (s) AUC n SV Time (s) AUC n SV Time (s) PA-ONE 0.5473 (0.12) 2902.9 (4.21) 0.01 0.8739 (0.01) 1455.0 (4.64) 0.16 0.7193 (0.03) 2350.7 (6.36) 0.19 PA-ITL 0.5986 (0.04) 618.1 (27.31) 0.01 0.8350 (0.01) 1499.9 (0.37) 0.16 0.7364 (0.02) 2399.9 (0.25) 0.16 OSGL 0.6482 (0.03) 740.8 (42.03) 0.01 0.9551 (0.007) 1402.6 (13.57) 0.17 0.8375 (0.02) 2369.3 (14.63) 0.17 FOML 0.6322 (0.04) 426.5 (36.91) 0.11 0.9347 (0.009) 819.8 (18.57) 1.5 0.8472 (0.02) 1356.0 (78.49) 1.20 OMTRL 0.6409 (0.05) 432.2 (123.81) 6.9 0.9343 (0.008) 840.4 (22.67) 53.6 0.7831 (0.02) 1346.2 (85.99) 128 OSMTL-t 0.6776 (0.03) 333.6 (40.66) 0.18 0.9509 (0.007) 809.5 (19.35) 1.4 0.9354 0.01 1312.8 (79.15) 2.15 OSMTL-e 0.6404 (0.04) 458 (36.79) 0.19 0.9596 (0.006) 804.2 (19.05) 1.3 0.9465 (0.01) 1322.2 (80.27) 2.16 Table 1 summarizes the performance of all the above algorithms on the three datasets. In addition to the AUC scores, we report the average total number of support vectors (n SV) and the CPU time taken for learning from one instance (Time). From the table, it is evident that OSMTL* outperforms all the baselines in terms of both AUC and n SV. This is expected for the two default baselines (PA-ITL and PA-ONE). We believe that PA-ONE shows better result than PA-ITL in spam because the former learns the global information (common spam emails) that is quite dominant in spam detection problem. The update rule for FOML is similar to ours but using fixed weights. The results justify our claim that making the weights adaptive leads to improved performance. In addition to better results, our algorithm consumes less or comparable CPU time than the baselines which take into account inter-task relationships. Compared to the OMTRL algorithm that recomputes the task covariance matrix every iteration using expensive SVD routines, the adaptive weights in our are updated independently for each task. As specified in [9], we learn the task weight vectors for OMTRL separately as K independent perceptron for the first half of the training data available (EPOCH=0.5). OMTRL potentially looses half the data without learning task-relationship matrix as it depends on the quality of the task weight vectors. It is evident from the table that algorithms which use loss-based update weights η (OSGL, OSMTL*) considerably outperform the ones that do not use it (FOML,OMTRL). We believe that loss incurred per instance gives us valuable information for the algorithm to learn from that instance, as well as to evaluate the inter-dependencies among tasks. That said, task relationship information does help by learning from the related tasks data, but we demonstrate that combining both the task relationship and the loss information can give us a better algorithm, as is evident from our experiments. We would like to note that our proposed algorithm OSMTL* does exceptionally better in sentiment, which has been used as a standard benchmark application for domain adaptation experiments in the existing literature [19]. We believe the advantageous results on sentiment dataset implies that even with relatively few examples, effectively knowledge transfer among the tasks/domains can be achieved by adaptively choosing the (probabilistic) inter-task relationships from the data. 4 Conclusion We proposed a novel online multi-task learning algorithm that jointly learns the per-task hypothesis and the inter-task relationships. The key idea is based on smoothing the loss function of each task w.r.t. a probabilistic distribution over all tasks, and adaptively refining such distribution over time. In addition to closed-form updating rules, we show our method achieves the sub-linear regret bound. Effectiveness of our algorithm is empirically verified over several benchmark datasets. Acknowledgments This work is supported in part by NSF under grants IIS-1216282 and IIS-1546329. [1] Koby Crammer and Yishay Mansour. Learning multiple tasks using shared hypotheses. In Advances in Neural Information Processing Systems, pages 1475 1483, 2012. [2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243 272, 2008. [3] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classification with dirichlet process priors. The Journal of Machine Learning Research, 8:35 63, 2007. [4] Yu Zhang and Dit-Yan Yeung. A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12, 2014. [5] Jacob Abernethy, Peter Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. In Learning Theory, pages 484 498. Springer, 2007. [6] Ofer Dekel, Philip M Long, and Yoram Singer. Online learning of multiple tasks with a shared loss. Journal of Machine Learning Research, 8(10):2233 2264, 2007. [7] Gábor Lugosi, Omiros Papaspiliopoulos, and Gilles Stoltz. Online multi-task learning with hard constraints. ar Xiv preprint ar Xiv:0902.3526, 2009. [8] Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. The Journal of Machine Learning Research, 11:2901 2934, 2010. [9] Avishek Saha, Piyush Rai, Suresh Venkatasubramanian, and Hal Daume. Online learning of multiple tasks and their relationships. In International Conference on Artificial Intelligence and Statistics, pages 643 651, 2011. [10] Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for online multitask learning. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2008-138, 2008. [11] Meghana Kshirsagar, Jaime Carbonell, and Judith Klein-Seetharaman. Multisource transfer learning for host-pathogen protein interaction prediction in unlabeled tasks. In NIPS Workshop on Machine Learning for Computational Biology, 2013. [12] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1113 1120. ACM, 2009. [13] Meghana Kshirsagar, Jaime Carbonell, and Judith Klein-Seetharaman. Multitask learning for host pathogen protein interactions. Bioinformatics, 29(13):i217 i226, 2013. [14] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189 1197, 2010. [15] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Self-paced learning with diversity. In Advances in Neural Information Processing Systems, pages 2078 2086, 2014. [16] A-S Nemirovsky, D-B Yudin, and E-R Dawson. Problem complexity and method efficiency in optimization. 1982. [17] Shai Shalev-Shwartz and Yoram Singer. Online learning: Theory, algorithms, and applications. Ph D Dissertation, 2007. [18] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. The Journal of Machine Learning Research, 7:551 585, 2006. [19] John Blitzer, Mark Dredze, Fernando Pereira, et al. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, volume 7, pages 440 447, 2007.