# bayesian_maxmargin_multitask_learning_with_data_augmentation__b488d5d5.pdf Bayesian Max-margin Multi-Task Learning with Data Augmentation Chengtao Li CTLI.CS@HOTMAIL.COM Jun Zhu DCSZJ@MAIL.TSINGHUA.EDU.CN Jianfei Chen CHENJF10@MAILS.TSINGHUA.EDU.CN Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China Dept. of Comp. Sci. & Tech, TNList Lab, State Key Lab of Intell. Tech & Sys, Tsinghua University, Beijing, China Both max-margin and Bayesian methods have been extensively studied in multi-task learning, but have rarely been considered together. We present Bayesian max-margin multi-task learning, which conjoins the two schools of methods, thus allowing the discriminative max-margin methods to enjoy the great flexibility of Bayesian methods on incorporating rich prior information as well as performing nonparametric Bayesian feature learning with the latent dimensionality resolved from data. We develop Gibbs sampling algorithms by exploring data augmentation to deal with the non-smooth hinge loss. For nonparametric models, our algorithms do not need to make mean-field assumptions or truncated approximation. Empirical results demonstrate superior performance than competitors in both multi-task classification and regression. 1. Introduction Multi-task learning (MTL) (Caruana, 1997; Thrun & O Sullivan, 1996) has been widely studied in computer vision (Zhang et al., 2012), text classification (Zhu et al., 2013), and bioinformatics (Widmer & R atsch, 2012). MTL is particularly good for the scenarios where some tasks are under sampled. The primary belief of MTL is that solving multiple potentially correlated tasks together could improve the performance of some (or all) of these tasks by sharing statistic strength. According to the strategies on sharing statistics, existing methods can be grouped into two categories. The first category consists of approaches that aim to discover the relationships between tasks (Bakker & Heskes, 2003; Jacob et al., 2008; Xue et al., 2007; Yu et al., Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 2007; Hariharan et al., 2010), while the other one consists of the approaches that aim to mine the related features or find a common feature structure shared by all tasks (Argyriou et al., 2006; Chen et al., 2009; Rai & Daume, 2010; Obozinski et al., 2010). Recent attempts have been made to simultaneously estimate task correlations and feature correlations in one unified learning method, under either a Bayesian or a max-margin framework. Bayesian methods (Archambeau et al., 2011; Yang et al., 2013) employ hierarchical structures to model multi-task data and try to extract both types of correlations by setting proper priors. Though enjoying great flexibility for incorporating latent variables and performing nonparametric Bayesian inference, the generative nature could make these Bayesian methods less than sufficient in predictive learning. On the other hand, the max-margin MTL methods (Zhang & Schneider, 2010) could also learn task correlations and feature correlations simultaneously by using a proper regularization over model parameters, under a regularized loss minimization framework. Though max-margin methods enjoy the strong discriminative ability, they are usually lack of the flexibility of Bayesian methods. One recent work that attempts to bring max-margin learning and Bayesian multi-task learning together is the multitask infinite latent SVM (MT-i LSVM) (Zhu et al., 2014b), which learns a common projection matrix to share statistics among multiple tasks. By performing max-margin learning, it could achieve promising prediction results. It also applies nonparametric Bayesian methods to automatically resolve the dimensionality of the projection matrix. However, MT-i LSVM solely focuses on learning latent features, while not considering task correlations. Furthermore, for computational tractability, MT-i LSVM adopts variational approximation methods with truncated mean-field assumptions, which could be too strict to be realistic in practice. This paper presents a systematical study of Bayesian maxmargin multi-task learning (BM-MTL) for both classification and regression. First, we present a generic framework of performing Bayesian max-margin MTL, which can Bayesian Max-margin Multi-Task Learning with Data Augmentation use structured priors, e.g., the matrix normal prior (Archambeau et al., 2011), to jointly estimate task correlations and feature correlations. Second, we extend the basic BM-MTL framework and present a nonparametric Bayesian max-margin MTL method (NPBM-MTL), which can learn latent feature representations and estimate task correlations, with latent dimensionality automatically resolved from data. Finally, for all the Bayesian max-margin MTL methods, we develop simple Gibbs sampling algorithms by exploring data augmentation techniques (Tanner & Wong, 1987; Polson & Scott, 2011; Zhu et al., 2014a). Unlike the truncated variational mean-field methods of MTi LSVM, our algorithms for the nonparametric Bayesian methods do not make any restrictive assumptions and are truncation free, thus allowing for inference of the true posterior distributions. We empirically study the effectiveness of our methods and make comparisons with other stateof-the-art models. Our results on several real data sets demonstrate superior performance than various competitors in both multi-task classification and regression tasks. 2. Background We briefly overview Multi-Task SVM and Bayesian MTL approaches, on which our methods are based. 2.1. Multi-Task SVM and Extensions Let L be the number of tasks. We denote the multi-task training set by D = (X, Y) = {{(xil, yil)}Nl i=1}L l=1, where Nl is the number of instances in task l. Each data instance is a pair (xil, yil) with xil RD being an input feature vector and yil being a response variable. Without loss of generality, we consider binary classification, where yil equals to +1 if the label of instance i in task l is positive and 1 otherwise. For a linear MTL model, we characterize each learning task l by a parameter vector ηl RD. Let η denote the D L matrix formed by concatenating ηl s of all the tasks. The prediction of the i-th sample in the l-th task is given by the sign rule ˆyil = sgn(η l xil), where sgn(x) is +1 if x 0 and 1 otherwise. Multi-task SVM (MTSVM) employs hinge loss as its loss measure for each instance, i.e., for the i-th sample in task l, the loss is calculated as max(0, ζil), where ζil = t yilη l xil with t specifying the panelty of making a wrong prediction. The objective function of MTSVM is then defined as min η 1 2 Reg(η) + 2C max(0, ζil) , (1) where Reg(η) is the regularization term over η. A na ıve choice of the regularization is Vec(η) 2, where Vec(η) is the vectorization of η with dimension DL 1. With such a regularization the model degenerates to a set of singletask SVMs, one for each task, thus doesn t share statistics among tasks. Many efforts have been done to fit this max- margin framework to multi-task learning problems better and make good use of hidden correlation information. Previous work has focused on designing appropriate regularization terms to capture the correlations among tasks and features. For example, Argyriou et al. (2006) used the regularization of Tr{η Ω 1η} to model feature correlations via the feature correlation matrix Ω; Hariharan et al. (2010) and Zhang & Yeung (2012) used Tr{ηR 1η } as regularization to correlate ηl s together via the task correlation matrix R; and Zhang & Schneider (2010) proposed to use Tr{Ω 1ηR 1η } as the regularization term, and added additional parameters related to |R| and |Ω| to learn sparse correlations among both tasks and features. Though powerful on learning discriminative models, all the aforementioned methods perform point estimation to learn a single large-margin model by solving a regularized loss minimization problem. Such a deterministic formulation could limit the model flexibility. For instance, it lacks the flexibility of Bayesian methods on incorporating latent variables, leveraging informative priors, or performing nonparametric inference, as reviewed later. 2.2. Bayesian Multi-task Learning Due to the great flexibility for incorporating rich priors and performing nonparametric inference, Bayesian methods have been widely used for multi-task learning. For example, Xue et al. (2007) employed a nonparametric hierarchical structure in Bayesian framework to automatically adjust model complexity and learn the shared statistics among tasks; Archambeau et al. (2011) assumed a matrix normal prior over the model parameters with the mean being decomposed into two latent matrices and with the covariance matrices following an inverse Wishart distribution; later Yang et al. (2013) extended such approaches with the covariance matrices of model parameter following a more complex matrix generalized inverse Gaussian (Barndorff Nielsen et al., 1982) prior. Bayesian models allow one to learn latent structures hiding in the data, and by placing smart prior distributions like matrix normal over model parameters, one would be able to infer the task and feature correlations simply by doing Bayesian inference. However, such Bayesian models have difficulty to incorporate side information like discriminative margin constraints and structural bias, thus may suffer in performance. Finally, under a similar constrained Bayesian inference framework, Koyejo & Ghosh (2013) presented a Bayesian MTL method, where a nuclear norm constraint on the predictive weight matrix was used to force a low rank solution. 3. Bayesian Max-margin Multi-task Learning The advantages of max-margin learning and Bayesian methods could be integrated by bringing them together and do Bayesian max-margin multi-task learning (BM-MTL). Bayesian Max-margin Multi-Task Learning with Data Augmentation We now present a BM-MTL method to jointly estimate task correlations and feature correlations. 3.1. The Model As a Bayesian model, we learn a posterior distribution of the classifier weights, q(η). We adopt the approach of Gibbs classifiers (Mc Allester, 2003; Germain et al., 2009) to account for the uncertainty of the models. Specifically, if a random sample η is drawn from q(η), we can make predictions using the same sign rule as in the deterministic MT-SVM, and measure the goodness of the sampled classifier using the hinge-loss as a surrogate for the training error. Gibbs classifiers consider the posterior distribution by taking the expectation and define the expected hinge loss i=1 Eq max(0, ζil) , (2) which is a good surrogate (in fact, an upper bound) for the expected training error, t PL l=1 PNl i=1 Eq[I(ˆyil = yil)]. Given a prior distribution over η, p0(η), we define the Bayesian max-margin multi-task learning (BM-MTL) as solving the entropy-regularized loss minimization problem min q(η) P L q(η) + 2C R q(η) , (3) where L(q(η)) = KL(q(η) p0(η)) and P is the probability simplex with an appropriate dimension. We should note that although the entropy-regularized loss minimization problem looks similar to that of maximum entropy discrimination (MED) (Jaakkola et al., 1999), this loss function derived from a Gibbs classifier has rarely been studied. As we shall see, it will lead to simple Gibbs sampling algorithms by exploring data augmentation. Moreover, in a Bayesian formulation, we have the flexibility to incorporate a likelihood model if necessary. Let p(D|γ) be a likelihood parameterized by γ. We can perform regularized Bayesian inference by solving the augmented problem min q(η,γ) P L q(η, γ) + 2C R q(η) , (4) where L(q(η, γ)) = KL(q p0(η, γ)) Eq[log p(D|γ)]. Let ψ(yil|η, xil) = exp{ 2C max(0, ζil)} be the unnormalized likelihood of yil for the i-th sample in l-th task. Then solving problem (3) will result in the posterior distribution q(η|D) = p0(η)ψ(Y|η, X)/Γ(D), where ψ(Y|η, X) = QL l=1 QNl i=1 ψ(yil|η, xil) and Γ(D) is a normalization factor to make q a normalized probability distribution. This update rule is in a similar form as Bayes rule. Such a transformation from max-margin learning to Bayesian inference results in discriminative models that inherit the flexibility of Bayesian methods. Comparing with standard Bayesian methods, our method has the flexibility xil Ω Ψ1,ν1 Figure 1. Graphical models for (a) BM-MTL; (b) NPBM-MTL. of incorporating rich side-information such as max-margin posterior regularization, which is hard to be done within full Bayesian framework. The regularization term in MTSVM within this Bayesian formalism could be explained as a prior distribution over η. The na ıve regularization of Vec(η) 2 corresponds to a prior of multi-variate normal distribution over Vec(η). As we have stated, such a regularization (prior) is ignorant of the fact that η is actually a D L matrix instead of a single column vector and results in a covariance of η of size DL DL, which is usually prohibitive for modeling and estimation. To capture the structure of η, matrix normal prior could be used (Yang et al., 2013; Archambeau et al., 2011), which assume that the DL DL covariance matrix could be decomposed as a Kronecker product Ω R, and η follows Vec(η) ND,L(0, Ω R), i.e., p0(η|Ω, R) = exp 1 2 Tr(Ω 1ηR 1η ) (2π)DL/2|R|D/2|Ω|L/2 . (5) Such a prior has the advantage in that it could use decomposed covariance matrix to model latent correlations in data, where Ωcorresponds to feature correlations and R corresponds to task correlations respectively, as described in (Zhang & Schneider, 2010). The Bayesian framework allows us to view correlation matrices as random variables and include proper priors for them. There are many possible choices (Archambeau et al., 2011; Yang et al., 2013). We adopt the conjugate inverse Wishart priors (Mardia et al., 1980), namely, Ω IW(Ψ1, ν1) and R IW(Ψ2, ν2). Then, the posterior probability becomes q(η, Ω, R|D) p0(Ω|Ψ1, ν1)p0(R|Ψ2, ν2) p0(η|Ω, R)ψ(Y|η, X). (6) Fig. 1(a) shows the graphical structure of the model. 3.2. Gibbs Sampling with Data Augmentation Although the expected hinge loss R is hard to deal with, by using ideas of data augmentation (Tanner & Wong, 1987; Polson & Scott, 2011) we can express the posterior distribution as a marginal of a higher dimensional distribution Bayesian Max-margin Multi-Task Learning with Data Augmentation with augmented variables and then develop a simple Gibbs sampler. Specifically, the unnormalized likelihood for each label yil can be expressed as ψ(yil|η, xil) = Z 1 2πλil exp (Cζil + λil)2 with the augmented variable λil (0, ). This result indicates that the posterior distribution q(η, Ω, R|D) can be expressed as the marginal of a higher dimensional distribution with the augmented variables λ = {{λil}Nl i=1}L l=1, where the complete posterior distribution is q(η, Ω, R, λ|D) p0(Ω)p0(R)p0(η|Ω, R)ψ(Y, λ|η, X) with ψ(y, λ|η, X) = QL l=1 QNl i=1 ψ(yil, λil|η, X) and ψ(yil, λil|η, X) = 1 2πλil exp{ 1 2λil (Cζil + λil)2}. With the data augmentation representation, we can develop a simple Gibbs sampler to infer the complete distribution q(η, Ω, R, λ|D) and thus the target posterior q(η, Ω, R|D) by dropping λ. The Gibbs sampler iteratively performs the following steps: For η: we have conditional distribution: q(η|Ω, R, λ) p0(η|Ω, R)ψ(Y, λ|η, X). Though jointly sampling η will lead to a high-dimensional Gaussian, we can effectively sample each ηl task-wisely. This leads to sampling from a low dimensional Gaussian. Namely, we have the conditional probabilities q(ηl|Ω, R, λ, η l) exp 1 2 Tr{Ω 1ηR 1η } i=1 exp (Cζil + λil)2 = N(µl, Σl), (7) where the posterior covariance matrix and mean are computed as Σl = (PNl i=1 C2 xilx il λil + R 1 ll Ω 1) 1 and µl = k =l R 1 lk Ω 1ηk + C PNl i=1 yil( Ct λil + 1)xil), respectively. We can easily draw a sample from a Ddimensional normal distribution, and the inverse can be robustly done using Cholesky decomposition, an O(D3) procedure. Thus the sampling could be done efficiently when D is not very large. For Ωand R: due to the conjugacy, we have the inverse Wishart conditional distributions: q(Ω|R, η, λ) = IW Ψ1 + ηR 1η , ν1 + L , q(R|Ω, η, λ) = IW Ψ2 + η Ω 1η, ν2 + D . (8) The sampling procedure involves matrix inversions of sizes D D and L L which, again, could be done robustly with Cholesky decomposition. Thus the inversion would be efficient when D and L are not very large. For λ: due to the conditional independence, we can sample each λil seperately from a generalized inverse Gaussian distribution (Devroye, 1986) q(λil|η, Ω, R) 1 2πλil exp 1 2λil (Cζil + λil)2 = GIG λil; 1 2, 1, C2ζ2 il Therefore, λ 1 il follows an inverse Gaussian distribution q(λ 1 il |η, Ω, R) = IG λ 1 il ; 1 C|ζil|, 1 (10) where IG(x; a, b) = q b 2πx3 exp b(x 1)2 2a2x for a, b > 0. We can draw a sample from an inverse Gaussian distribution in constant time (Michael et al., 1976). 4. Nonparametric Bayesian Extensions We present an extension of the basic framework for nonparametric Bayesian inference, thus allowing to automatically resolve the model complexity in learning latent features for multi-task learning. 4.1. The Model We again consider the simple linear discriminant function f(x; ηl) = η l x and make predictions using the sign rule for multi-task binary classification. Instead of imposing a zero-mean prior on η, which provides no structural information about what model parameters would look like, we take the suggestions by (Archambeau et al., 2011; Yang et al., 2013) and impose the structured non-zero mean matrix normal prior η ND,L(ZV, I R), where both Z and V are latent and Z works as a projection matrix. Using such a decomposed mean would resort to a low rank approximation of correlation matrix as in (Archambeau et al., 2011; Yang et al., 2013) and provide structural information for η. As for the task correlation matrix R, we again assume an inverse Wishart prior with hyper-parameters Ψ and ν. We should note that the reason for choosing such a covariance for η, as shall be seen soon, is that the modeling of the common projection matrix Z shared by all the tasks is essentially an indirect modeling of feature correlation matrix Ω. Thus there is no need to assume another matrix for the same use. The graphical representation is shown in Figure 1(b). In (Archambeau et al., 2011; Yang et al., 2013), some Gaussian hyper-priors are imposed on Z and V, both of which are Bayesian Max-margin Multi-Task Learning with Data Augmentation assumed to have a finite and fixed dimension K. However, since K is an unknown parameter, a model selection procedure (e.g., cross-validation) is needed to select a good value of K. We address this issue in our Bayesian maxmargin MTL by exploring the flexibility on nonparametric inference, where we let Z have an unbounded number of columns, corresponding to an unbounded latent feature dimensionality. For simplicity, we start with the case where Z has fixed K finite columns. We impose the the matrix normal prior for V, V NK,L(0, I R). Then, we can show that the marginal prior p0(η|Z, R) by integrating out V is p0(η|Z, R) = exp 1 2 Tr{R 1η (I ZMZ )η} (2π)DL/2|R|D/2|I + Z Z|L/2 , where M = (I+Z Z) 1. For Z, without loss of generality, we assume it is a binary matrix, i.e., each entry is 0 or 1. For the finite case, a simple prior is the Beta-Bernoulli prior, that is, each column k is associated with a parameter πk Beta(α/K, 1) and the entries in column k are i.i.d, namely, zik Bernoulli(πk). We can generalize the above process to let Z have an infinite number of columns. The infinite generalization of the Beta-Bernoulli prior is the hierarchical Beta process (Thibaux & Jordan, 2007) (also known as Indian buffet process, IBP (Griffiths & Ghahramani, 2005)). Although Z is allowed to have an infinite number of columns, it would have a finite number of non-zero, or active, columns with probability 1 under the IBP prior. We denote the matrix formed by combining these active columns as Z+. The column number of Z+ is denoted as K+, corresponding to learned latent feature dimensionality. Then, with the infinite limit of K, the marginal prior becomes p0(η|Z, R) = exp 1 2 Tr{R 1η Q 1η} (2π)DL/2|R|D/2|I + Z +Z+|L/2 , where we let Q = (I Z+(I + Z +Z+) 1Z +) 1 for simplicity of notations. We should note that, for all the terms related to η, by substituting Q with Ωwe will get exactly the same conditional probability as in the parametric case, indicating that modeling the common projection matrix Z is essentially modeling the feature correlation matrix. That s why we force parts of the covariance in the prior of η to be identity, or else it would be redundant. With the sampled η, we can similarly measure the training error. We again adopt the hinge loss as a surrogate and take the expectation to account for the uncertainty of η. This leads to the optimization problem min q(η,Z,R) P L q(η, Z, R) + 2C R q(η) , (11) where L(q(η, Z, R)) = KL(q p0(η, Z, R)). 4.2. Gibbs Sampling with Data Augmentation By applying the same data augmentation techniques, we are able to obtain the augmented posterior distribution q(η, Z, R, λ) p0(Z)p0(R)p0(η|Z, R)ψ(Y, λ|η, X), from which we can develop a Gibbs sampler to draw samples. Specifically, the sampling probabilities for η, λ and R are exactly the same as before except that we replace Ω in the original probabilities by Q. For the binary matrix Z, we do the sampling entry-wisely. For the existing column k, we sample zdk via p(zdk = 1|Z (dk), η, R) mk zdk D p(η|Z, R), (12) where mk is the number of non-zero entries in the k-th column of Z. Note that if there exists no non-zero entries except zdk, the sampler would directly remove this column. A number of new columns would be sampled via p(Knew) Poisson Knew, α p(η|Znew, R), (13) where Znew is obtained by appending Knew columns of ed to Z, where ed is a 0-vector except the d-th entry being 1. 5. Multi-Task Regression The above techniques could be generalized to tackle multitask regression problems, where the response variable yil for each instance xil takes real values. For the i-th instance in the l-th task, we consider the linear prediction rule, yil = η l xil. One widely used margin-based loss measure is the ε-insensitive loss Rε(η) = P il max(0, | il| ε) for support vector regression (Smola & Sch olkopf, 2004), where il = yil η l xil is the margin. To do Bayesian max-margin learning, we define the expected ε-intensive loss as Rε(q(η)) = P il Eq(max(0, | il| ε)). Then we define BM-MTL regression model as solving the entropyregularized loss minimization problem min q(η) P L q(η) + 2C Rε q(η) . The resulting posterior probability follows the form as in Eq. (6), with the pseudo-likelihood being ψ(Y|η, X) QL l=1 QNl i=1 exp{ 2C max(0, | il| ε)}. By noting that max(0, |x| ε) = max(0, x ε) + max(0, x ε), we can do similar data augmentation and express the unnormalized likelihood for each instance (xil, yil) as ψ(yil|η, xil) = Z 0 ψ(yil, λil, ωil|η, xil)dλildωil, where ψ(yil, λil, ωil|η, xil) = 1 2πλil exp{ 1 2λil (λil + C( il ε))2} 1 2πωil exp{ 1 2ωil (ωil C( il + ε))2}, Bayesian Max-margin Multi-Task Learning with Data Augmentation with the augmented variables λ = {{λil}Nl i=1}L l=1 and ω = {{ωil}Nl i=1}L l=1. Then, the target posterior distribution q(η, Ω, R) is the marginal of the complete posterior q(η, Ω, R, λ, ω) p0(Ω)p0(R)p0(η|Ω, R)ψ(Y, λ, ω|η, X), where ψ(y, λ, ω|η, X) = Q i ψ(yil, λil, ωil|η, xil). Following similar derivations as in the classification models, the Gibbs sampler iteratively performs the steps: For η: again we can employ an effective sampling of each ηl task-wisely. With the matrix normal prior on η, the conditional probability is still Gaussian: q(ηl|Ω, R, λ, η l) N(µl, Σl), (14) where the posterior covariance matrix and mean are computed as Σl = (P i C2ρilxix i + R 1 ll Ω 1) 1 and µl = Σl( 1 k =l R 1 lk Ω 1ηk + P i C2(yilρil εσil)xi) with ρil = ( 1 λil + 1 ωil ) and σil = ( 1 λil 1 ωil ). For λ and ω: the conditional distribution for both sets of augmented variables are still inverse Gaussian: q(λ 1 il |η, Ω, R) = IG λ 1 il ; 1 C| il ε|, 1 , q(ω 1 il |η, Ω, R) = IG ω 1 il ; 1 C| il + ε|, 1 . (15) The sampling probabilities for Ωand R are exactly the same as Eq. (8) in the classification case. We could futher derive the sampling probabilities for NPBM-MTL model by substituting Ωwith Q in sampling distributions of the parametric regression case, together with the sampling probability of Z following Eq. (12) and Eq. (13). 6. Experiments We present empirical studies for both multi-task classification and multi-task regression. 6.1. Multi-Task Classification 6.1.1. DATASETS, METHODS AND SETUPS We use four multi-label datasets that are publicly available1. Table 1 summarizes their statistics. For multi-label classification, we formulate it as a multi-task learning problem, where each task is a binary classifier determining whether an instance has a particular label. We compare with various baselines, including LR (Independent Logistic Regression), MTL-C (Clustered Multitask Learning) (Jacob et al., 2008), MTL-F (Multitask Feature Learning) (Argyriou et al., 2006) and MTi LSVM (Multi-task infinite latent SVMs) (Zhu et al., 1http://mulan.sourceforge.net/datasets. html Table 1. Statistics of various datasets, where Avg-labels stands for the average number of labels per instance. DATASETS NUM SAMPLES L D AVG-LABELS EMOTIONS 593 6 72 1.869 CAL500 502 174 68 26.044 SCENE 2,407 6 294 1.074 YEAST 2,417 14 103 4.237 2011). MTL-C could learn clusters of different tasks to capture the task correlations. MTL-F learns a lowdimensional representation shared across a set of related tasks, while MT-i LSVM learns an infinite dimensional feature representation. Both MTL-F and MT-i LSVM are essentially learning the feature correlations. In BM-MTL, we have the flexibility to learn the correlation matrices or simply set them to some fixed matrices. There are also various choices for hyper-parameters Ψ1 and Ψ2 in inverse Wishart priors. We either set R or Ωto be fixed identity or sample them with Ψ1 or Ψ2 being identity, yielding four different configurations of BM-MTL listed below: (I-Ω&I-R) We fix both Ωand R to be identity. (Ω&I-R) We fix R to be identity and learn Ωwith Ψ1 being identity. (I-Ω&R) We fix Ωto be identity and learn R with Ψ2 being identity. (Ω&R) We learn both Ωand R with Ψ1 and Ψ2 being identities. Finally, for the nonparametric NPBM-MTL, we learn both Z and R with Ψ being identity. The regularization parameter C is chosen from [10 3, 103] using 5-fold cross validation. We use the F1-score as the performance measure, which accounts for label bias in the datasets. To avoid the condition that the sampled Ω 1 and R 1 are too large or close to zero, we divide the sampled matrices by a constant number and force the diagonal entries to be at least one. 6.1.2. RESULTS AND ANALYSIS Table 2 shows the mean F1-scores and standard deviation (in parentheses) obtained from runs on 5 random splits of each dataset, where each row corresponds to a model and each column corresponds to a dataset. Our random splits are obtained following the datasets original trainingtesting split ratios, which are provided along with the data. We observe that among the four BM-MTL models with various configurations, the one with (Ω&R) provides consistently promising results it either offers the best performance or gives a competitive one. This shows the benefits brought by simultaneously estimating task correlations and feature correlations. As for the two non-parametric methods, NPBM-MTL obtains significant improvements over MT-i LSVM by modeling task correlations and relieving Bayesian Max-margin Multi-Task Learning with Data Augmentation Table 2. Results for different models on the four datasets. Competitive results are shown in bold, and the best one is marked with * . MODELS F1(%) F1(%) F1(%) F1(%) EMOTIONS CAL500 SCENE YEAST LR 54.83 (1.63) 32.05 (1.63) 61.65 (1.45) 61.44 (0.92) MTL-C 63.50 (1.67) 33.51 (1.39) 64.98 (0.66) 61.86 (0.69) MTL-F 64.51 (1.10) 34.01 (1.47) 64.23 (0.89) 62.66 (0.55) BM-MTL(I-Ω&I-R) 64.16 (1.19) 34.65 (1.06) 65.85 (0.78) 63.75 (0.57) BM-MTL(Ω&I-R) 65.22 (1.34) 34.91 (1.05) 65.97 (0.70) 64.20 (0.59) BM-MTL(I-Ω&R) 64.97 (1.37) 35.07 (1.22) 66.30 (0.70) 64.16 (0.53) BM-MTL(Ω&R) 65.67 (1.32)* 35.54 (1.09) 66.33 (0.70) 64.36 (0.58)* MT-ILSVM 62.27(2.25) 32.20(1.09) 62.27(2.46) 61.81(0.60) NPBM-MTL(Z&R) 65.53 (1.34) 35.67 (0.69)* 66.85 (0.55)* 64.21 (0.44) the truncated mean-field assumption made by MT-i LSVM, which only estimates the feature correlations. Overall, both BM-MTL and NPBM-MTL could significantly improve the performance over existing competitors2, mainly due to the advantages brought by conjoining Bayesian methods and max-margin learning as well as the joint estimation of task correlations and feature correlations. We also compare with the methods in (Archambeau et al., 2011). Since their code is not available, we compare with the reported accuracy on the yeast and scene datasets. Our NPBM-MTL achieves an accuracy of 80.01 (0.06) on yeast, slightly better than the reported accuracy of 79.88 (0.17) in (Archambeau et al., 2011), and gives competitive accuracy of 88.95 (0.07) on scene where the reported result is 88.97 (0.34). Our model could not only give stabler performance, but also free users from manually tuning the latent feature dimensionality, contrast to methods in (Archambeau et al., 2011). 6.2. Multi-Task Regression 6.2.1. DATASETS, METHODS AND SETUPS We use the public School dataset, which consists of the examination records of 15,362 students from 139 secondary schools in years 1985, 1986 and 1987. The dataset has been used to study the effectiveness of schools. It has been used extensively to evaluate multi-task learning methods (Bakker & Heskes, 2003; Zhang & Yeung, 2012), where the goal of the task is to predict the exam scores of students from different schools based on four studentdependent features and four school-dependent features. To make a fair comparison, we follow the same setup described in (Bakker & Heskes, 2003) and use the same 10 random splits of the data to do the training and testing. The average number of students included in training set is about 80 per school, while the number in testing set is about 30. We compare with the state-of-the-art results of BMTL (Bayesian Multi-task Learning) (Bakker & Heskes, 2We performed the 2-tailed t-test. For BM-MLT, the p-values over all the existing methods on all the datasets are less than 0.027; similarly for NPBM-MTL. Table 3. Experimental results for different models on School dataset. PEV stands for the percentage of explained variance. BASELINES PEV(%) MODEL PEV(%) STL 23.5 (1.9) BM-MTL(I-Ω&I-R) 31.12 (1.02) BMTL 29.5 (0.4) BM-MTL(Ω&I-R) 31.21 (1.02) MTGP 29.2 (1.6) BM-MTL(I-Ω&R) 31.54 (1.04) MTRL 29.9 (1.8) BM-MTL(Ω&R) 31.56 (1.09) MT-ILSVM 31.13 (1.15) NPBM-MTL(Z&R) 31.86 (1.00) 2003), MTGP (Multi-task Gaussian Processes) (Bonilla et al., 2007), MTRL (Convex Multi-task Relationship Learning) (Zhang & Yeung, 2012), STL (Single Task Learning) as reported in (Bonilla et al., 2007; Zhang & Yeung, 2012) and the MT-i LSVM regression model (Zhu et al., 2011). We empirically set ε to be 0.001. The hyper-parameter of the IBP prior of Z (i.e., α) is fixed at 5 in this experiment, and we will return to investigate the sensitivity over this parameter in Section 6.3.2. We search the regularization parameter C in the range of [0.1,10] with 5-fold cross validation. For performance measure, we use percentage of explained variance (PEV) (Bakker & Heskes, 2003). PEV is defined as the total variance of the data minus the sum-squared error on the test set as a percentage of the total variance. 6.2.2. RESULTS AND ANALYSIS Table 3 shows the results. We can see that both BM-MTL and NPBM-MTL could work very well on regression tasks. For the four configurations of BM-MTL, the hybrid learning of Ωand R again yields the best performance. Although all these four configurations could outperform other state-of-the-art methods, the variances are large compared to some of the baselines. While for NPBM-MTL, it not only gives the best performance, but also enjoys a small variance. This may result from the modeling of latent features when the original features are redundant and noisy, the learned latent feature dimensionality would be low, which could work as a refiner of data noise and result in simple and effective models. In Section 6.3.2, we will see that the learned latent feature dimensionality is low on School, which gives evidence to our statement. Bayesian Max-margin Multi-Task Learning with Data Augmentation 0 5 10 15 20 0.6 0 5 10 15 20 0.05 BM MTL(I 1&I R) BM MTL(1&I R) BM MTL(I 1&R) BM MTL(1&R) NPBM MTL Figure 2. Prediction performance w.r.t the number of iterations (i.e., burn-in steps) on (a) yeast and (b) School datasets. 0 5 10 15 20 0 0 5 10 15 20 0.05 _ = 1 _ = 2 _ = 4 _ = 8 _ = 16 Figure 3. Impact of α on (a) learned latent feature dimensions and (b) performance of NPBM-MTL. 6.3. More Discussions 6.3.1. BURN-IN STEPS We analyze the effects of the number of burn-in steps on the performance for both multi-task classification and regression problems. For classification, we choose the yeast dataset as an example, and for regression we choose one of the 10 random splits of the school data. Figure 2 presents the performance changes of both the parametric BM-MTL and nonparametric NPBM-MTL with respect to the number of burn-in steps. We can observe that all our methods converge quickly. For example, on both datasets, we need about 10 burn-in steps to get stable prediction performance. 6.3.2. SENSITIVITY ANALYSIS AND LATENT FEATURES Since the initialization of Z follows an IBP, the number of new columns added to Z follows Poisson( α i ) when initializing the i-th row of Z, where α is the IBP hyper-parameter. Thus the initial latent dimensions of Z would be different if using different α values, resulting in initial latent dimensions that may be far from the actual one. To analyze whether NPBM-MTL is sensitive to such an initialization, we analyze the influence of α on both the learned feature dimensionality and prediction performance on the School dataset. We set α to be 2[0:4]. Figure 3 shows the results. We can see that even though the initial dimensions vary a lot with different α, with no more than 10 iterations, all the NPBM-MTL models with different α values converge to a similar dimension, which is about 15. Meanwhile, the prediction performance gradually improves until the sampling procedure converges and the latent feature dimensionality settles down. This shows that NPBM-MTL is able to con- Table 4. Learned latent feature dimensionality K+ on different datasets. D is the original feature dimensionality. EMOTIONS CAL500 SCENE YEAST SCHOOL K+ 57.8 (2.0) 4.4 (0.5) 48.2 (1.9) 4.8 (0.8) 14.8 (1.1) D 72 68 294 103 28 verge quickly and is stable against different choices of α. We also observe in Table 4 that the inferred latent feature dimensions are generally much smaller than the original feature dimensions. For example, the original feature dimensionality in the School dataset is nearly 30, but the learned latent feature dimensionality is only about 15. On most datasets, the learned K+ is relatively small compared to D, thus enjoying a denoising effect and resulting in stabler performance (except on EMOTIONS, whose K+ is close to D, and the performance on EMOTIONS, as can be seen from Table 2, is not that stable). The low ratio of K+/D indicates that usually data has redundant features, and by applying our non-parametric method we would be able to learn the true feature dimensionality automatically instead of setting them manually in the model. 7. Conclusions and Future Work We present Bayesian max-margin multi-task learning, which conjoins the max-margin learning with Bayesian formalism and allows discriminative max-margin models to enjoy the great flexibility of Bayesian methods on incorporating rich prior information and performing nonparametric Bayesian inference to learn latent features or structures as well as the latent feature dimensionality. We present simple Gibbs sampling algorithms by exploring data augmentation techniques. Our algorithms are truncation-free and assumption-free when applied to nonparametric Bayesian models. Experimental results demonstrate promising prediction performance over various competitors. Bayesian max-margin models for multi-task learning could be further extended. For example, both BM-MTL and NPBM-MTL may be expensive in training when the feature dimension is huge, because the sampling of weight matrices would involve large matrix inversion. To tackle such a problem, feature-dimension reduction techniques may be used. Another interesting direction is to selectively share information among tasks (Kumar & III, 2012), which may be beneficial especially when we have many tasks. In theory, we can consider the selective sharing structure by formulating it via some prior. We leave these for future work. Acknowledgments The work is supported by the National Basic Research Program of China (No. 2013CB329403), National Natural Science Foundation of China (Nos. 61322308, 61332007), and Tsinghua University Initiative Scientific Research Program (No. 20121088071). Bayesian Max-margin Multi-Task Learning with Data Augmentation Archambeau, C., Guo, S., and Zoeter, O. Sparse Bayesian multi-task learning. In NIPS, pp. 1755 1763, 2011. Argyriou, A., Evgeniou, T., and Pontil, M. Multi-task feature learning. In NIPS, pp. 41 48, 2006. Bakker, B. and Heskes, T. Task clustering and gating for Bayesian multitask learning. JMLR, 4:83 99, 2003. Barndorff-Nielsen, O., Blæsild, P., Jensen, J., and Jorgensen, B. Exponential transformation models. Proceedings of the Royal Society of London. A. Mathematical and Physical Sciences, 379:41 65, 1982. Bonilla, E., Chai, K., and Williams, C. Multi-task Gaussian process prediction. In NIPS, 2007. Caruana, R. Multitask learning. Machine Learning, 28: 41 75, 1997. Chen, J., Tang, L., and Liu, J.and Ye, J. A convex formulation for learning shared structures from multiple tasks. In ICML, pp. 137 144, 2009. Devroye, L. Non-uniform random variate generation. Springer-Verlag, 1986. Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classifiers. In ICML, pp. 353 360, 2009. Griffiths, T. and Ghahramani, Z. Infinite latent feature models and the Indian Buffet Process. In NIPS, 2005. Hariharan, B., Zelnik-Manor, L., Vishwanathan, S. V. N., and Varma, M. Large scale max-margin multi-label classification with priors. In ICML, pp. 423 430, 2010. Jaakkola, T., Meila, M., and Jebara, T. Maximum entropy discrimination. In NIPS, 1999. Jacob, L., Bach, F., and Vert, J. Clustered multi-task learning: A convex formulation. In NIPS, pp. 745 752, 2008. Koyejo, O. and Ghosh, J. Constrained bayesian inference for low rank multitask learning. UAI, pp. 97 106, 2013. Kumar, A. and III, Daume H. Learning task grouping and overlap in multi-task learning. ICML, pp. 690 697, 2012. Mardia, K., Kent, J., and Bibby, J. Multivariate analysis. Academic press, 1980. Mc Allester, D. PAC-Bayesian stochastic model selection. Machine Learning, 51:5 21, 2003. Michael, J.R., Schucany, W.R., and Haas, R.W. Generating random variates using transformations with multiple roots. The American Statistician, 30(2):88 90, 1976. Obozinski, G., Taskar, B., and Jordan, M. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20: 231 252, 2010. Polson, N. and Scott, S. Data augmentation for support vector machines. Bayesian Analysis, 6:1 24, 2011. Rai, P. and Daume, H. Infinite predictor subspace models for multitask learning. In AISTATS, pp. 613 620, 2010. Smola, A. and Sch olkopf, B. A tutorial on support vector regression. Statistics and Computing, 14:199 222, 2004. Tanner, M. and Wong, W. The calculation of posterior distributions by data augmentation. Journal of the Americal Statistical Association (JASA), 82(398):528 540, 1987. Thibaux, R. and Jordan, M. Hierarchical Beta processes and the Indian Buffet Process. In AISTATS, pp. 564 571, 2007. Thrun, S. and O Sullivan, J. Discovering structure in multiple learning tasks: The TC algorithm. In ICML, pp. 489 497, 1996. Widmer, C. and R atsch, G. Multitask learning in computational biology. JMLR, 27:207 216, 2012. Xue, Y., Liao, X., Carin, L., and Krishnapuram, B. Multitask learning for classification with Dirichlet process priors. JMLR, 8:35 63, 2007. Yang, M., Li, Y., and Zhang, Z. Multi-task learning with Gaussian matrix generalized inverse Gaussian model. In ICML, pp. 423 431, 2013. Yu, S., Tresp, V., and Yu, K. Robust multi-task learning with t-processes. In ICML, pp. 1103 1110, 2007. Zhang, T., Ghanem, B., Liu, S., and Ahuja, N. Robust visual tracking via multi-task sparse learning. In CVPR, pp. 2042 2049, 2012. Zhang, Y. and Schneider, J. Learning multiple tasks with a sparse matrix-normal penalty. In NIPS, pp. 2550 2558, 2010. Zhang, Y. and Yeung, D. A convex formulation for learning task relationships in multi-task learning. ar Xiv:1203.3536, 2012. Zhu, J., Chen, N., and Xing, E. Infinite latent SVM for classification and multi-task learning. In NIPS, pp. 1620 1628, 2011. Zhu, J., Zheng, X., Zhou, L., and Zhang, B. Scalable inference in max-margin topic models. In KDD, pp. 964 972, 2013. Zhu, J., Chen, N., Perkins, H., and Zhang, B. Gibbs max-margin topic models with data augmentation. JMLR, 15:1073 1110, 2014a. Zhu, J., Chen, N., and Xing, E. Bayesian inference with posterior regularization and applications to infinite latent SVMs. JMLR (in press, ar Xiv:1210.1766v3), 2014b.