# multiplicative_multitask_feature_learning__7141194a.pdf

Journal of Machine Learning Research 17 (2016) 1-33 Submitted 5/15; Revised 11/15; Published 4/16

Multiplicative Multitask Feature Learning

Xin Wang wangxin@engr.uconn.edu Department of Computer Science and Engineering University of Connecticut Storrs, CT 06279, USA

Jinbo Bi jinbo@engr.uconn.edu Department of Computer Science and Engineering University of Connecticut Storrs, CT 06279, USA

Shipeng Yu shipeng.yu@siemens.com Health Services Innovation Center Siemens Healthcare Malvern, PA 19355, USA

Jiangwen Sun javon@engr.uconn.edu Department of Computer Science and Engineering University of Connecticut Storrs, CT 06279, USA

Minghu Song minghu.song@pfizer.com Worldwide Research and Development Pﬁzer Inc. Groton, CT 06340, USA

Editor: Urun Dogan, Marius Kloft, Francesco Orabona, and Tatiana Tommasi

We investigate a general framework of multiplicative multitask feature learning which decomposes individual task s model parameters into a multiplication of two components. One of the components is used across all tasks and the other component is task-speciﬁc. Several previous methods can be proved to be special cases of our framework. We study the theoretical properties of this framework when diﬀerent regularization conditions are applied to the two decomposed components. We prove that this framework is mathematically equivalent to the widely used multitask feature learning methods that are based on a joint regularization of all model parameters, but with a more general form of regularizers. Further, an analytical formula is derived for the across-task component as related to the task-speciﬁc component for all these regularizers, leading to a better understanding of the shrinkage eﬀects of diﬀerent regularizers. Study of this framework motivates new multitask learning algorithms. We propose two new learning formulations by varying the parameters in the proposed framework. An eﬃcient blockwise coordinate descent algorithm is developed suitable for solving the entire family of formulations with rigorous convergence analysis. Simulation studies have identiﬁed the statistical properties of data that would be in favor of the new formulations. Extensive empirical studies on various classiﬁcation and regression benchmark data sets have revealed the relative advantages of the two new

c 2016 Xin Wang, Jinbo Bi, Shipeng Yu, Jiangwen Sun and Minghu Song.

Wang, Bi, Yu, Sun and Song

formulations by comparing with the state of the art, which provides instructive insights into the feature learning problem with multiple tasks.

Keywords: Multitask learning, regularization, sparse modeling, blockwise coordinate descent

1. Introduction

Multitask learning (MTL) improves the generalization of the estimated models for multiple related learning tasks by capturing and exploiting the task relationships. It has been theoretically and empirically shown to be more eﬀective than learning tasks individually. Especially when single task learning suﬀers from limited sample size, multitask learning reinforces a single learning process with the transferable knowledge learned from the related tasks. Multi-task learning has been widely applied in many scientiﬁc ﬁelds, such as robotics (Zhang and Yeung, 2010), natural language processing (Ando and Zhang, 2005), computer aided diagnosis (Bi et al., 2008), and computer vision (Kang et al., 2011). Research eﬀorts have been devoted to various Multi Task Feature Learning (MTFL) algorithms. One direction of these works directly learns the dependencies among tasks, either by modeling the correlated regression or classiﬁcation noise (Greene, 2002), or assuming that the model parameters share a common prior (Yu et al., 2005; Lee et al., 2007; Xue et al., 2007; Yu et al., 2007; Jacob et al., 2008), or by examining the tasks covariance matrix (Bonilla et al., 2007; Zhang and Yeung, 2010; Guo et al., 2011; Yang et al., 2013). Another research direction relies on a basic assumption that the diﬀerent tasks may share a substructure in the feature space. In order to exploit this shared substructure, Rai and Daume (2010) project the task parameters to explore the latent common substructure. Kang et al. (2011) form a shared low-dimensional representation of data by feature learning. More recent methods explore the latent basis that can be used to characterize the entire set of tasks and examine the potential clusters of tasks. For instance, Passos et al. (2012) assume that subsets of features may be shared only between tasks in the same cluster. Kumar and Daume III (2012) allow overlapping of tasks in diﬀerent groups by having several bases in common. Maurer et al. (2013) exploit a dictionary allowing sparse representations of the tasks. Gong et al. (2012) detect if certain tasks are outliers from the majority of the tasks. Regularization methods are widely used in MTFL to learn a shared subset of features. A common strategy is to impose a blockwise joint regularization (Meier et al., 2008; Zhao et al., 2009) on all task model parameters to shrink the eﬀects of features across the tasks and simultaneously minimize the regression or classiﬁcation loss. These methods employ the so-called ℓ1,p matrix norm (Lee et al., 2010; Liu et al., 2009a; Obozinski and Taskar, 2006; Zhang et al., 2010; Zhou et al., 2010) that is the sum of the ℓp norms of the rows in a matrix. As a result, this regularizer encourages sparsity among the rows. If a row of the parameter matrix corresponds to a feature and a column represents an individual task, the ℓ1,p regularizer intends to rule out the unrelated features across tasks by shrinking the entire rows of the matrix to zero. Typical choices for p are 2 (Obozinski and Taskar, 2006; Evgeniou and Pontil, 2004) and (Turlach et al., 2005). Eﬀective algorithms have since then been developed for the ℓ1,2 (Liu et al., 2009a) and ℓ1, (Quattoni et al., 2009) regularization. Later, the ℓ1,p norm is generalized to include 1 < p with a probabilistic interpretation that the resultant MTFL method solves a relaxed optimization problem with a generalized

Multiplicative Multitask Feature Learning

normal prior for all tasks (Zhang et al., 2010). Although the matrix-norm based regularizers lead to convex learning formulations for MTFL, recent studies show that a convex regularizer may be too relaxed to approximate the ℓ0-type regularizer for the shrinkage eﬀects in the feature space and thus results in suboptimal performance (Gong et al., 2013). To address this problem, non-convex regularizers, such as the capped-ℓ1, ℓ1 regularizer (Gong et al., 2013), have been proposed for the multitask joint regularization. However, using non-convex regularizers may bring up computational challenges. For instance, non-convex formulations are usually diﬃcult to solve, and require more complicated optimization algorithms to guarantee satisfactory performance. For the existing MTFL methods based on joint regularization, a major limitation is that it either selects a feature as relevant to all tasks or excludes it from all models, which is very restrictive in practice where tasks may share some features but may also have their own speciﬁc features that are not relevant to other tasks. To overcome this limitation, one of the most eﬀective strategies is to decompose the model parameters into either summation (Jalali et al., 2010; Chen et al., 2011; Gong et al., 2012) or multiplication (Xiong et al., 2006; Bi et al., 2008; Lozano and Swirszcz, 2012) of two components with separate regularizers applied to the two components. One regularizer is imposed on the component taking care of the task-speciﬁc model parameters and the other one is imposed on the component for mining the cross-feature sparsity. Speciﬁcally, for the methods that decompose the parameter matrix into summation of two matrices, the dirty model in (Jalali et al., 2010) employs ℓ1,1 and ℓ1, regularizers to the two components. A robust MTFL method in (Chen et al., 2011) uses the trace norm on one component for mining a low-rank structure shared by tasks and a column-wise ℓ1,2-norm on the other component for identifying task outliers. A more recent method applies the ℓ1,2-norm both row-wisely to one component and column-wisely to the other (Gong et al., 2012). For these additive decomposition methods, it requires the corresponding entries in both components to be zero in order to exclude a feature from a task. For the methods that work with multiplicative decompositions, the parameter vector of each task is decomposed into an element-wise product of two vectors where one is used across tasks and the other is task-speciﬁc. To exclude a feature from a task, the multiplicative decomposition only requires one of the components to be zero. Existing methods of this line apply the same regularization to both of the component vectors, by either the ℓ2-norm penalty (Bi et al., 2008), or the sparse ℓ1-norm (i.e., multi-level LASSO (Lozano and Swirszcz, 2012)). The multi-level LASSO method has been analytically compared to the dirty model (Lozano and Swirszcz, 2012), showing that the multiplicative decomposition creates better shrinkage on the global and task-speciﬁc parameters. The across-task component can screen out the features irrelevant to all tasks. An individual task s component can further select features from those selected by the cross-task component for use in its corresponding model. Although there are diﬀerent ways to regularize the two components in the product, no systematic work has been done to analyze the algorithmic and statistical properties of the diﬀerent regularizers. It is insightful to answer the questions such as how these learning formulations diﬀer from the early methods based on blockwise joint regularization, how the optimal solutions of the two components intervene, and how the resultant solutions are compared with those of other methods that also learn both shared and task-speciﬁc features. We highlight the contributions of this paper as follows:

Wang, Bi, Yu, Sun and Song

We propose and examine a general framework of the multiplicative decomposition that enables a variety of regularizers to be applied. The general form corresponds to a family of MTFL methods, including all early methods that decompose model parameters as a product of two components (Bi et al., 2008; Lozano and Swirszcz, 2012).

Our theoretical analysis has revealed that this family of methods is actually equivalent to the joint regularization based approach but with a more general form of regularizers, including matrix-norm based and non-matrix-norm based regularizers. The non-matrix-norm based joint regularizers derived from the proposed framework have never been considered previously. If they are considered in the joint regularization form, the resultant optimization problems will be diﬃcult to solve. However, our equivalent multiplicative MTFL framework in this case uses convex regularizers on the two components, which can be solved eﬃciently.

Further analysis reveals that the optimal solution of the across-task component can be analytically computed by a formula of the optimal task-speciﬁc parameters. This analytical result facilitates a better understanding of the shrinkage eﬀects of the diﬀerent regularizers applied to the two components.

Statistical justiﬁcation is also derived for this family of formulations. It proves that the proposed framework is equivalent to maximizing a lower bound of the maximum a posterior solution under a probabilistic model that assumes generalized normal priors on model parameters.

Two new MTFL formulations are derived from the proposed general framework. Unlike the existing methods (Bi et al., 2008; Lozano and Swirszcz, 2012) where the same kind of vector norm is applied to both components, the shrinkage of the global and task-speciﬁc parameters diﬀers in the new formulations. We empirically illustrate the scenarios where the two new formulations are more suitable for solving the MTFL problems.

An eﬃcient blockwise coordinate descent algorithm is derived suitable for solving the entire family of the methods. Given some of the methods (including the two new formulations we study) correspond to non-matrix-norm based joint regularizers, our algorithm provides a powerful alternative to solving the related diﬃcult optimization problems, allowing us to explore the behaviors and properties of these regularizers in an eﬀective way. Convergence analysis is thoroughly discussed.

To depict the diﬀerences between our approach and previous methods, Table 1 summaries various regularizers used in the joint regularization based and model decomposition based MTFL methods. There are some fundamental connections among these methods. As studied in the present work, multiplicative MTFL models can be connected with the early blockwise joint regularization models. We also attempt to empirically compare the model behaviors between the multiplicative and additive MTFL methods, and particularly compare the applicability of the four diﬀerent choices of regularization listed in Table 1 for multiplicative MTFL .

Multiplicative Multitask Feature Learning

Table 1: The regularization terms used in various MTFL methods.

Model Methods Norms

Joint regularization models

Evgeniou and Pontil (2004) ℓ1,2 Turlach et al. (2005) ℓ1, Lee et al. (2010) both ℓ1,1 and ℓ1,2 Zhang et al. (2010) ℓ1,p, 1 < p Gong et al. (2012) capped ℓ1,1

Decomposed models

A = P + Q Jalali et al. (2010) ℓ1,1 on P, ℓ1, on Q Gong et al. (2012) ℓ1,2 on P, ℓ1,2 on QT

A = diag(c) B

The proposed framework (ℓk)k on c, (ℓp)p on B Bi et al. (2008) k=2, p=2 Lozano and Swirszcz (2012) k=1, p=1 The new formulation 1 k=1, p=2 The new formulation 2 k=2, p=1

The rest of the paper is organized as follows. Section 2 deﬁnes the mathematical notation and introduces the proposed models in detail. Section 3 discusses several important theoretical properties of the proposed models including equivalence analysis. Section 4 provides the statistical justiﬁcation of the multiplicative MTFL models. In Section 5, we develop an eﬃcient algorithm to solve the optimization problems in the proposed models with a convergence analysis. Section 6 shows the empirical results, in which simulations have been designed to examine the various feature sharing patterns for which a speciﬁc choice of regularizer may be preferred. Extensive experiments with a variety of classiﬁcation and regression benchmark data sets are also described in Section 6. In Section 7, we conclude this work.

2. The Proposed Multiplicative MTFL

Given T tasks in total, for each task t, t {1, , T}, we have sample set (Xt Rℓt d, yt Rℓt). The data set of Xt has ℓt examples, where the i-th row corresponds to the i-th example xt i of task t, i {1, , ℓt}, and each column represents a feature and there are totally d features. The vector yt contains yt i, the label of the i-th example of task t. We consider functions of the linear form yt = α t xt i where αt Rd, which corresponds to computing Xtαt on the training data as the estimate of yt. We deﬁne the parameter matrix or weight matrix A = [α1, , αT ] and denote the rows of A by αj, j {1, , d}. The joint regularization based MTFL method with the ℓ1,p regularizer minimizes

t=1 L(αt, Xt, yt) + λ

j=1 ||αj||p, (1)

for the best αt:t=1, ,T , where L(αt, Xt, yt) is the loss function of task t which computes the discrepancy between the observed yt and the model output Xtαt, and λ is a tuning parameter to balance between the loss and the regularizer. Although any suitable loss function can be used in the formulation (1), convex loss functions are the common choices such as the least squares loss Pℓt i=1(yt i α t xt i)2 for regression problems or the logistic loss

Wang, Bi, Yu, Sun and Song

Pℓt i=1 log(1+e yt i(α t xt i)) for classiﬁcation problems. These loss functions are strictly convex with respect to the model parameters αt. The ℓp norm is computed for each row of the matrix A corresponding to a feature (rather than a task) so to enforce sparsity on the features. A family of multiplicative MTFL methods can be derived by rewriting αt = diag(c)βt where diag(c) is a diagonal matrix with its diagonal elements composing a vector c. The c vector is used across all tasks, indicating if a feature is useful for any of the tasks, and the vector βt is only for task t. Let j index the entries in these vectors. We have αt j = cjβt j. Typically c comprises binary entries that are equal to 0 or 1, but the integer constraint is often relaxed to require just non-negativity (c 0). We minimize a regularized loss function as follows for the best c and βt:t=1, ,T :

t=1 L(c, βt, Xt, yt) + γ1

t=1 ||βt||p p + γ2||c||k k, (2)

where L(c, βt, Xt, yt) is the same loss function used in Eq.(1) but with αt replaced by the new vector of (cjβt j)j=1, ,d, ||βt||p p = Pd j=1 |βt j|p and ||c||k k = Pd j=1(cj)k, which are the ℓp-norm of βt to the power of p and the ℓk-norm of c to the power of k if p and k are positive integers. The tuning parameters γ1, γ2 are used to balance the empirical loss and regularizers. At optimality, if cj = 0, the j-th variable is removed for all tasks, and the corresponding row vector αj = 0; otherwise the j-th variable is selected for use in at least one of the α s. Then, a speciﬁc βt can rule out the j-th variable from task t if βt j = 0. If both p = k = 2, Problem (2) becomes the formulation used in Bi et al. (2008). Since the ℓ2-norm regularization is applied on both βt and c, this model does not impose strong sparsity on the model parameters. According to our empirical study, this model may be suitable for the scenarios where only a few features can be excluded from all of the tasks, and the diﬀerent models (tasks) share a lot features between each other. There could exist features that, although irrelevant to most of the tasks, cannot be completely excluded only due to few tasks. If p = k = 1, Problem (2) becomes the formulation used in Lozano and Swirszcz (2012), where the ℓ1-norm regularization is applied on βt and c and thus it induces very strong sparsity both on task-speciﬁc parameters and the across-task component to select the features. Compared to the model in Bi et al. (2008), this model is more suitable for learning from the tasks with persistently sparse models. For example, many features are irrelevant to any of the tasks, and only a few of the features selected by c are used by an individual task, indicating the sparse pattern in sharing the selected features among tasks. Besides the above two existing models, any other choices of p and k will derive into new formulations for MTFL. Note that the two existing methods discussed in Bi et al. (2008); Lozano and Swirszcz (2012) use p = k in their formulations, which renders βt j and cj the same amount of shrinkage. To explore other feature sharing patterns among tasks, we propose two new formulations where p = k.

Formulation 1: If p = 2 but k = 1 in Problem (2), then we obtain the following optimization problem

t=1 L(c, βt, Xt, yt) + γ1

t=1 ||βt||2 2 + γ2||c||1. (3)

Multiplicative Multitask Feature Learning

When there exists a large subset of features irrelevant to any of the tasks, it requires a sparsity-inducing norm on c. However, within the relevant features selected by c, the majority of these features are shared between tasks. In other words, the features used in each task are not sparse relative to the features selected by c, which requires a non-sparsityinducing norm on β. Hence, we use ℓ1 norm on c and ℓ2 norm on each β in our formulation (2).

Formulation 2: If p = 1 but k = 2 in Problem (2), we obtain the following optimization problem

t=1 L(c, βt, Xt, yt) + γ1

t=1 ||βt||1 + γ2||c||2 2. (4)

When the union of the features relevant to any given tasks includes many or even all features, the ℓ2 norm penalty on c may be preferred. However, only a limited number of features are shared between tasks, i.e., the features used by individual tasks are sparse with respect to the features selected as useful across tasks by c. We can impose the ℓ1 norm penalty on β. Clearly, many other choices of p and k values can be used, such as those corresponding to higher order polynomials (e.g., Pd j=1 c3 j). Our theoretical results in the next few sections apply to all positive value choices of p and k unless otherwise speciﬁed. In our empirical studies, however, we have implemented algorithms for the two existing models and the two new models for comparison. Some other choices of regularizers may require signiﬁcant re-programming of our algorithms and we will leave them for more thorough individual examinations in the future.

3. Theoretical Analysis

We ﬁrst extend the formulation (1) to allow more choices of regularizers. We introduce a new notation that is an operator applied to a vector, such as αj. The operator ||αj||p/q p =

qq PT t=1 |αt j|p, p, q 0, which corresponds to the ℓp norm if p = q and both are positive integers. A joint regularized MTFL approach can solve the following optimization problem with pre-speciﬁed values of p, q and λ, for the best parameters αt:t=1, ,T :

t=1 L(αt, Xt, yt) + λ d P

||αj||p/q p . (5)

Our main results of this paper include (i) a theorem that establishes the equivalence between the models derived from solving Problem (2) and Problem (5) for properly chosen values of λ, q, k, γ1 and γ2; (ii) a theorem that delineates the conditions for (2) to impose a convex (or concave) regularizer on the model parameter matrix A; and (iii) an analytical solution of Problem (2) for c which shows how the sparsity of the across-task component is relative to the sparsity of task-speciﬁc components.

Theorem 1 (Main Result 1) Let ˆαt be the optimal solution to Problem (5) and ( ˆβt, ˆc)

be the optimal solution to Problem (2). Then ˆαt = diag(ˆc) ˆβt when λ = 2 q

p kq 2 and q = k+p

2k (or equivalently, k = p 2q 1).

Wang, Bi, Yu, Sun and Song

Proof Theorem 1 can be proved by establishing the following two lemmas and two theorems. The two lemmas provide the basis for the proofs of the two theorems and then from the ﬁrst theorem, we conclude that the solution ˆαt of Problem (5) also minimizes the following optimization problem:

min αt,σ 0 PT t=1 L(αt, Xt, yt) + µ1 Pd j=1 σ 1 j ||αj||p/q p + µ2 Pd j=1 σj, (6)

and the optimal solution of Problem (6) also minimizes Problem (5) when proper values of λ, µ1 and µ2 are chosen. The second theorem connects Problem (6) to the proposed formulation (2). We show that the optimal ˆσj is equal to (ˆcj)k, and then the optimal ˆα can be computed as diag(ˆc) ˆβt from the optimal ˆβt.

Note that when p = 2 and q = 1, the intermediate problem (6) uses a similar regularizer to that in Micchelli et al. (2013) where |αj|2

σj +σj is used to approximate |αj| in the ℓ1-norm regularizer. Problem (6) extends the discussion in Micchelli et al. (2013) to include more general regularizers according to p and q.

Lemma 2 For any given αt:t=1, ,T , Problem (6) can be optimized with respect to σ by the following analytical solution

||αj||p/q p . (7)

Proof By the Cauchy-Schwarz inequality, we can derive a lower bound for the sum of the two regularizers in Problem (6) as follows:

j=1 σ 1 j ||αj||p/q p + µ2

j=1 σj 2 µ1µ2

||αj||p/q p , (8)

where the equality holds if and only if σj = µ

||αj||p/q p . Using the method of proof by contradiction, suppose that σ optimizes Problem (6)

with a given set of αt:t=1, ,T , but σ j = µ

||αj||p/q p . Thus, σ does not make the regularization term reach its lower bound. Then, we can choose another σ where

||αj||p/q p , so σ delivers the lower bound in Eq.(8). Because the lower bound (the right hand side of Eq.(8)) only depends on α, it is a constant for ﬁxed αt:t=1, ,T . Hence, since the loss function is also ﬁxed for given αt:t=1, ,T , σ reaches a lower objective value of Problem (6) than that of σ , which contradicts to the optimality of σ . Therefore, the optimal σ always takes the form of Eq.(7).

Remark 3 Based on the proof of Lemma 2, we also know that the objective function of (5) is the lower bound of the objective function of (6) for any given αt (including the optimal ˆαt) when λ = 2 µ1µ2, and the lower bound can be attained if and only if σ is set according to the formula (7). Hence, we can also conclude that if (ˆαt, ˆσ) is the optimal solution of

Problem (6), then ˆσj = µ

||ˆαj||p/q p .

Multiplicative Multitask Feature Learning

Lemma 4 Let αt j = cjβt j for all t and j. Replacing βt by αt in Problem (2) yields the following optimization problem

t=1 L(αt, Xt, yt) + γ1 Pd j=1 c p j ||αj||p p + γ2 Pd j=1(cj)k. (9)

For any given αt:t=1, ,T , Problem (9) can be optimized with respect to c by the following analytical solution

cj = (γ1γ 1 2

t=1 (αt j)p) 1 p+k . (10)

Proof This lemma can be proved following a similar argument in the proof of Lemma 2. By the Cauchy-Schwarz inequality, the sum of the two regularizers in (9) satisﬁes the following inequality

j=1 c p j ||αj||p p + γ2

j=1 ck j 2γ

j=1 (||αj||p p) k p+k , (11)

and the equality holds if and only if γ1c p j ||αj||p p = γ2ck j (and note that all parameters γ1, γ2, ||ˆαj||p p and c are non-negative), which yields the following formula

cj = (γ1γ 1 2

t=1 (αt j)p) 1 p+k .

Through proof by contradiction, we know that the optimal c has to take the above formula.

Based on Lemma 2, we will prove that Problem (5) is equivalent to Problem (6) in the sense that an optimal solution of Problem (5) is also an optimal solution of Problem (6) and vice versa when λ, µ1, and µ2 satisfy certain conditions.

Theorem 5 The solution sets of Problem (5) and Problem (6) are identical when λ = 2 µ1µ2.

Proof First, if ˆA = [ˆαt:t=1, ,T ] minimizes Problem (5), we show that the pair ( ˆA, ˆσ)

minimizes Problem (6) where ˆσj = µ

||ˆαj||p/q p . By Remark 3, the objective function of (5) is the lower bound of the objective function

of (6) for any given A (including the optimal solution ˆA). When ˆσj = µ

||ˆαj||p/q p , Problem (6) reaches the lower bound. In this case, Problem (5) at ˆA and Problem (6) at ( ˆA, ˆσ) have the same objective value. Now, suppose that the pair ( ˆA, ˆσ) does not minimize Problem (6), there exists another pair ( A, σ) = ( ˆA, ˆσ) that achieves a lower objective value

than that of ( ˆA, ˆσ). By Lemma 2, we have that σj = µ

|| αj||p/q p , and then Problem (6), at A, reaches the lower bound which is formulated as the objective of Problem (5)

Wang, Bi, Yu, Sun and Song

when λ = 2 µ1µ2. In other words, the objective values of (6) and (5) are identical at A. Hence, A achieves a lower objective value than that of ˆA for Problem (5), contradicting to the optimality of ˆA. Second, if ( ˆA, ˆσ) minimizes Problem (6), we show that ˆA minimizes Problem (5). Suppose that ˆA does not minimize Problem (5), which means that there exists αj

( = ˆαj for some j) that achieves a lower objective value than that of ˆαj. We set σj =

|| αj||p/q p . Then ( A, σ) is an optimal solution of Problem (6) as proved in the ﬁrst paragraph, and will bring the objective function of Problem (6) to a lower value than that of ( ˆA, ˆσ), contradicting to the optimality of ( ˆA, ˆσ). Therefore, Problems (5) and (6) have identical solution sets when λ = 2 µ1µ2.

In order to link Problem (6) to our formulation (2), we let σj = (cj)k, k R, k = 0 and αt j = cjβt j for all t and j, and derive an equivalent objective function of Problem (6) based on Lemmas 2 and 4.

Theorem 6 The optimal solution ( ˆA, ˆσ) of Problem (6) is equivalent to the optimal so-

lution ( ˆB, ˆc) of Problem (2) where ˆαt j = ˆcj ˆβt j and ˆσj = (ˆcj)k when γ1 = µ

kq 2kq p 1 µ

kq p 2kq p 2 , γ2 = µ2, and k = p 2q 1.

Proof First, if ˆαt j and ˆσj optimize Problem (6), we show that ˆcj = kp

ˆσj and ˆβt j = ˆαt j/ˆcj optimize Problem (2). By a change of variables (replacing ˆαt and ˆσ by ˆβt and ˆc in Problem (6)), we know that ˆβt and ˆc minimize the following objective function

J(ˆβt j, ˆcj) =

t=1 L(ˆc, ˆβt, Xt, yt) + µ1

j=1 ||ˆβ j||p/q p ˆc(p kq)/q j + µ2

j=1 (ˆcj)k. (12)

By Lemma 2 and Remark 3, the optimal ˆσj = µ

||ˆαj||p/q p . Because ˆcj = kp

ˆσj, we derive that

ˆcj = µ1µ 1 2 ||ˆβ j||p/q p q 2kq p .

Substituting the formula of ˆcj into Eq.(12) yields the same objective function of Problem

(2) after replacing µ1 and µ2 by γ1 and γ2 with the equations γ1 = µ

kq 2kq p 1 µ

kq p 2kq p 2 , γ2 = µ2. Therefore, ˆβt and ˆc optimize Problem (2) because otherwise, if any other solution (β, c) can further reduce the objective value of Problem (2), then the corresponding α and σ will bring the objective function of Problem (6) to a lower value than ˆα and ˆσ. Next, if ˆβt j and ˆcj optimize Problem (2), we show that ˆαt j = ˆcj ˆβt j and ˆσj = (ˆcj)k optimize Problem (6). Substituting ˆαt j, ˆσj for ˆβt j, ˆcj in Problem (2) yields an objective function

J(ˆαt j, ˆσj) =

t=1 L(ˆαt, Xt, yt) + γ1

j=1 ||ˆαj||p pˆσ (p/k) j + γ2

j=1 ˆσj. (13)

Multiplicative Multitask Feature Learning

We hence know that ˆαt j and ˆσj minimize Eq.(13). Similar to the proof of Lemma 4, we can show that the optimal ˆσ takes the form of

ˆσj = (γ1γ 1 2 ) k k+p (||ˆαj||p p) k k+p .

Substituting the formula into Eq.(13) and setting γ1 = µ

kq 2kq p 1 µ

kq p 2kq p 2 and γ2 = µ2 transfer Eq.(13) to the objective function of Problem (6). Thus, ˆαt j and ˆσj optimize Problem (6).

Now, based on the above two lemmas and two theorems, we can derive that when

p kq 2 and q = k+p

2k , the optimal solutions to Problems (2) and (5) are equivalent. Solving Problem (2) will yield an optimal solution ˆα to Problem (5) and vice versa. By the equivalence analysis, the proposed framework corresponds to a family of joint regularization methods as deﬁned by Eq.(5). Assuming a convex loss function is used, this family includes some convex formulations when convex regularizers are applied to A in (5) and some other non-convex formulations when non-matrix-norm based regularizers are applied to A. Particularly, when q = p/2, the regularization term on αj in (5) becomes the standard ℓp-norm. Correspondingly, when k = p/(p 1) which is commonly not an integer except p = 2, our formulation (2) amounts to imposing a ℓ1,p-norm on A. When both k and p take positive integers (except p = k = 2), Problem (2) is equivalent to using a non-matrix-norm regularizer in (5). Combinations of diﬀerent p and k values will render the models diﬀerent algorithmic behaviors.

Theorem 7 (Main Result 2) For any positive k and p, if kp k+p, then the formulation (2) imposes a convex regularizer on the model parameter matrix A; or otherwise, it imposes a concave regularizer on A.

Proof According to Theorem 1, the formulation (2) is equivalent to the formulation (5) that imposes the following regularizer on A:

||αj||p/q p = λ

j=1 (||αj||p)

j=1 (||αj||p)

kp k+q . (14)

where q = k+p

2k and λ = 2 q

p kq 2 . A function xa is a convex function in terms of x > 0 if a 1; or otherwise, it is concave. Hence, if kp k + p, Eq.(14) is a composite function of two functions: a convex function

kp k+p and another convex function which is the ℓp vector norm of αj. Thus, the overall regularizer is convex. Otherwise, if kp < k + p, the regularizer becomes a concave function in terms of α s.

Remark 8 For a particular choice of p = 2 and k = 2, Problem (2) is formulated as

t=1 L(c, βt, Xt, yt) + γ1

t=1 ||βt||2 2 + γ2||c||2 2, (15)

Wang, Bi, Yu, Sun and Song

which is used in Bi et al. (2008). This problem is equivalent to the following joint regularization method as used in Obozinski and Taskar (2006); Argyriou et al. (2007).

t=1 L(αt, Xt, yt) + λ

t=1 (αt j)2, (16)

when λ = 2 γ1γ2. Problem (16) uses the so called ℓ1,2-norm to regularize the matrix A.

Remark 9 For a particular choice of p = 1 and k = 1, Problem (2) is formulated as

t=1 L(c, βt, Xt, yt) + γ1

t=1 ||βt||1 + γ2||c||1, (17)

which is used in Lozano and Swirszcz (2012). This problem is equivalent to the following joint regularization method

t=1 L(αt, Xt, yt) + λ

t=1 |αt j|, (18)

when λ = 2 γ1γ2. Problem (17) imposes stronger sparsity on β and c (and correspondingly on α) than (15), which intends to shrink more model parameters to zero. Problem (18) uses a concave non-matrix-norm regularizer.

Remark 10 For a particular choice of p = 2 and k = 1, which corresponds to the new formulation (3). This problem is equivalent to the following joint regularization method

t=1 L(αt, Xt, yt) + λ

t=1 (αt j)2, (19)

when λ = 2γ

2 3 2 . Problem (19) uses a concave non-matrix-norm regularizer as well but the concavity is weaker than Problem (18) in the sense that the polynomial order is 2

3 rather than 1

The proposed formulation (3) imposes stronger sparsity induction on the across-task component than on the task-speciﬁc component. Thus, it has stronger shrinkage eﬀects to exclude many features for all the tasks. If the jth feature is considered as unrelated to most of the tasks, the model of p = k = 2 may shrink cj to a small value but not zero, the new model (3) might shrink cj to zero instead. Therefore this model would be more favorable to jointly learning from tasks where a large portion of noisy features exist that may be irrelevant or redundant to all of the tasks. Compared to the method in Lozano and Swirszcz (2012) where p = k = 1, the new formulation (3) may allow more selected features to be shared across diﬀerent tasks.

Multiplicative Multitask Feature Learning

Remark 11 For a particular choice of p = 1 and k = 2, which corresponds to the new formulation (4). This problem is equivalent to the following joint regularization method

t=1 L(αt, Xt, yt) + λ

when λ = 2γ

1 3 2 . Problem (20) uses another concave non-matrix-norm regularizer that has the similar polynomial order of 2

3 to Problem (19). In comparison with (19), the cross-task quadratic terms (e.g., |α1 j||α2 j|, |α2 j||α3 j|) are allowed inside the cube root in this formulation.

The new formulation (4) imposes stronger sparsity induction on task-speciﬁc component than on the across-task vector. This model can be favorable in the cases where few tasks share a limited number of selected features. As shown in our empirical results, for instance, when every two or three tasks share a limited subset of selected features but no common features are used by more than three tasks, this model performs the best among the four multiplicative MTFL formulations. In comparison to the model in Lozano and Swirszcz (2012), this model allows more of the features that are only relevant to a few tasks to be selected. In comparison with the model in Bi et al. (2008), this model can help to remove a lot of irrelevant or redundant features for individual tasks from the selected features. We further derive another main result that characterizes the optimal across-task vector as a formula of the optimal task-speciﬁc vectors. This connection can be easily developed from the above equivalence analysis, and can help us understand the relationship and interaction between the two components.

Theorem 12 (Main Result 3) Let ˆβt, t = 1, , T, be the optimal solutions of Problem (2), Let ˆB = [ˆβ1, , ˆβT ] and ˆβ j denote the j-th row of the matrix ˆB. Then,

ˆcj = (γ1/γ2) 1 k k

t=1 (ˆβt j)p, (21)

for all j = 1, , d, is optimal to Problem (2).

Proof This analytical formula can be directly derived from Theorem 5 and Theorem 6. When we set ˆσj = (ˆcj)k and ˆαt j = ˆcj ˆβt j in the proof of Theorem 6, we obtain ˆcj = µ1µ 1 2 ||ˆβ j||p/q p q 2kq p . In the same proof, we also show that µ1 = γ

kq 2 and µ2 = γ2.

Substituting these formulas into the formula of c yields ˆcj = (γ1/γ2) 1 k ||ˆβ j||

p 2kq p . Moreover, to establish the equivalence between (2) and (5), we require q = k+p

2k , which leads to

k = 2kq p. Hence, ||ˆβ j||

p 2kq p = kq PT t=1(ˆβt j)p. We then obtain the formula (21).

Remark 13 For particular choices of p and k, the relation between the optimal c and β can be computed according to Theorem 12. Table 2 summarizes the relation formula for the common choices when p {1, 2} and k {1, 2}.

Wang, Bi, Yu, Sun and Song

Table 2: The shrinkage eﬀect of c with respect to β for four common choices of p and k.

2 2 ˆcj = q

γ1γ 1 2 q PT t=1(ˆβt j)2

1 1 ˆcj = γ1γ 1 2 PT t=1 |ˆβt j|

2 1 ˆcj = γ1γ 1 2 PT t=1(ˆβt j)2

1 2 ˆcj = q

γ1γ 1 2 q PT t=1 |ˆβt j|

4. Probabilistic Interpretation

In this section we show that the proposed multiplicative formalism is related to the maximum a posteriori (MAP) solution of a probabilistic model. Let p(A| ) be the prior distribution of the weight matrix A = [α1, . . . , αT ] = [α1 , . . . , αd ] Rd T , where denote the parameter of the prior. Then the a posteriori distribution of A can be calculated via Bayes rule as

p(A|X, y, ) p(A| )

t=1 p(yt|Xt, αt). (22)

Denote z GN(µ, ρ, p) the univariate generalized normal distribution, with the density function

p(z) = 1 2ρΓ(1 + 1/p) exp |z µ|p

in which ρ > 0, p > 0, and Γ( ) is the Gamma function (Goodman and Kotz, 1973). Now let each element of A, αt j, follow a generalized normal prior, αt j GN(0, δj, p). Then with the independent and identically distributed assumption, the prior takes the form (also refer to Zhang et al. (2010) for a similar treatment)

1 δj exp |αt j|p

1 δT j exp αj p p δp j

where p denotes the vector p-norm. With an appropriately chosen likelihood function p(yt|Xt, αt) exp( L(αt, Xt, yt)), ﬁnding the MAP solution is equivalent to solving the following problem:

t=1 L(αt, Xt, yt) +

αj p p δp j + T ln δj . (25)

Setting the derivative of J with respect to δj to zero, we obtain:

1/p αj p. (26)

Multiplicative Multitask Feature Learning

Bringing this back to (25), we have the following equivalent problem:

min A J = XT

t=1 L(αt, Xt, yt) + T Xd

j=1 ln αj p. (27)

Now let us look at the multiplicative nature of αt j with diﬀerent p [1, ). When p = 1, we have:

j=1 ln αj 1 =

t=1 |cjβt j|

ln |cj| + ln

t=1 |βt j| 2d,

where the inequality is because of the fact that ln z z 1 for any z > 0. Therefore we can optimize an upper bound of J in (27),

t=1 L(αt, Xt, yt) + T

j=1 |cj| + T

t=1 |βt j| 2d T, (28)

which is equivalent to the multiplicative formulation (2) where {p, k} = {1, 1}. This proves the following theorem:

Theorem 14 When {p, k} = {1, 1}, optimizing the multiplicative formulation (2) is equivalent to maximizing a lower bound of the MAP solution under probabilistic model (22) with p = 1 in the prior deﬁnition.

In the general case, we have:

j=1 ln αj p = 1

t=1 |cjβt j|p !

(ck j ) p k

t=1 |βt j|p !

j=1 ln |cj|k + 1

t=1 |βt j|p

j=1 |cj|k + 1

t=1 βt p p d(1

Wang, Bi, Yu, Sun and Song

These inequalities lead to an upper bound of J in (27). By minimizing the upper bound, the problem is formulated as:

min A Jp,k =

t=1 L(αt, Xt, yt) + T

k c k k + T

t=1 βt p p, (29)

which is equivalent to the general multiplicative formulation in (2). Therefore we have proved the following theorem:

Theorem 15 Optimizing the multiplicative formulation (2), in the form of (29), is equivalent to maximizing a lower bound of the MAP solution under probabilistic model (22) with p (1, ) in the prior deﬁnition.

5. Optimization Algorithm

Alternating optimization algorithms have been used in both of the early methods (Bi et al., 2008; Lozano and Swirszcz, 2012) to solve Problem (2) which alternate between solving two subproblems: solve for βt with ﬁxed c; solve for c with ﬁxed βt. The convergence property of such an alternating algorithm has been analyzed in Bi et al. (2008); Lozano and Swirszcz (2012) that it converges to a local minimizer. In these early methods, both of the two subproblems have to be solved using iterative algorithms such as gradient descent, linear or quadratic program solvers. Besides the algorithm that solves for c and βt alternatively and can be applied to our formulations, we design an alternating optimization algorithm that utilizes the closed-form solution for c we have derived in Lemma 4 and the property that both Problems (2) and (5) are equivalent to the intermediate Problem (6) (or Problem (9)). In the algorithm to solve Problem (2), we start from an initial choice of c. At iteration s, we start from cs, and solve for βs t with the ﬁxed cs. We then compute the value of αs t = diag(cs)βs t, which is used to update cs to cs+1 according to Eq.(10) in Lemma 4. The overall procedure is summarized in Algorithm 1. As a central idea in designing this algorithm, at each iteration we update β to reduce the loss function, and update c to reduce the regularizer while maintaining the same loss function value. Note that if the value of α is ﬁxed, the loss will remain the same.

To analyze the convergence property of Algorithm 1, we utilize the fact that Problem (2) and Problem (9) are equivalent. For notational convenience, we denote the objective function of Problem (2) by g(B, c) that takes inputs βt and c. We denote the objective function of Problem (9) by f(A, c). Both objective functions comprise the sum of three parts. For instance, f can be written as follows:

f(A, c) = f0(A, c) + f A(A) + fc(c), (31)

where f0(A, c) = γ1 Pd j=1(c p j PT t=1(αt j)p) is the part that involves both α and c, f A(A) = P t L(αt, Xt, yt) is the part relying only on α, and fc(c) = µ2 Pd j=1(cj)k is the part for c only. Let z be the vector consisting of all variables in Problem (9). Similar to what has been deﬁned in Tseng (2001), we deﬁne that the point z is a stationary point of f if z dom f where dom f is the feasible region of f, and

lim λ 0[f(z + λb) f(z)]/λ 0, b such that (z + λb) dom f. (32)

Multiplicative Multitask Feature Learning

Algorithm 1 The blockwise coordinate descent algorithm for multiplicative MTFL

Input: Xt, yt, t = 1, , T, as well as γ1, γ2, p and k Initialize: cj = 1, j = 1, , d, and s = 1 repeat

Compute Xt = Xtdiag(cs), t = 1, , T for t = 1, , T do

Solve the following problem for βs t

min βt L(βt, Xt, yt) + γ1||βt||p p (30)

end for Compute αs t = diag(cs)βs t Set s = s + 1 Compute cs+1 using αs t according to Eq.(10) until maxt,j(|(αt j)s (αt j)s 1|) < ϵ (or other proper termination rules) Output: αt, c and βt, t = 1, , T

where b denotes any feasible direction, (which corresponds to f(z) = 0 if f is diﬀerentiable, or 0 f(z) if f is non-diﬀerentiable where f(z) is the subgradient of f at z). In our case, f is not diﬀerentiable at α = 0 or c = 0 when p is set to an odd number. We also deﬁne that a point z is a coordinate-wise minimum point of f if z dom f, and bk Rdk that makes (0, , bk, , 0) a feasible direction, there exists a small ϵ > 0, such that for all positive λ ϵ, f(z + λ(0, , bk, , 0)) f(z), (33)

where k indexes the blocks of variables in our algorithm, which include α1, , αT and c, so k = {1, , T + 1}, and the (T + 1)-th block is for c, dk is the number of variables in the kth coordinate block and in our case dk = d. The vector (0, , bk, , 0) is a vector in Rd (T+1) and used to only vary z in the k-th block. We ﬁrst prove that for the sequence of points generated by Algorithm 1, the objective function f is monotonically non-increasing. Then we prove the sequence of points is bounded, because of which, the sequence will have accumulation points. We prove that each accumulation point is a coordinate-wise minimum point. Then according to Tseng (2001), if f is regular at an accumulation point z , this z is a stationary point.

Lemma 16 Let the sequence of iterates generated by Algorithm 1 be zs = {As, cs}s=1,2, and the function f is deﬁned by (31), then f(As+1, cs+1) f(As, cs).

Proof First, note that for each {As, cs}, we accordingly have {Bs, cs} where αs t = diag(cs)βs t, t, and we have f(As, cs) = g(Bs, cs). Because we start with c so at each iteration c gets updated ﬁrst (the order dose not matter actually). At iteration s + 1, we compute cs+1 based on As according to Eq.(10). According to Lemma 4, cs+1 will reach the lower bound of f when A is ﬁxed to As. Hence f(As, cs+1) f(As, cs). Moreover, when c is updated, there is an implicit new value of B s that is just computed as ( βt j)s = (αt j)s/(cj)s+1,

Wang, Bi, Yu, Sun and Song

j and t. Then, g( B s, cs+1) = f(As, cs+1). Next in Algorithm 1, we obtain Bs+1 by solving Problem (30), i.e., by optimizing g with respect B when c is ﬁxed to cs+1. Hence, g(Bs+1, cs+1) g( B s, cs+1). Then A will be updated by As+1 = diag(cs+1)Bs+1, which leads to f(As+1, cs+1) = g(Bs+1, cs+1). Overall, we have

f(As+1, cs+1) = g(Bs+1, cs+1) g( B s, cs+1) = f(As, cs+1) f(As, cs).

This proves that f(As+1, cs+1) f(As, cs).

Based on the proof of Lemma 16, we can also show that g(Bs+1, cs+1) g(Bs, cs) for the sequence of {Bs, cs}s=1,2, that is also created by Algorithm 1.

Lemma 17 The sequence of iterates generated by Algorithm 1, zs = {As, cs}s=1,2, , (or equivalently, {Bs, cs}s=1,2, ) is bounded.

Proof According to Lemma 16, we know that f(As+1, cs+1) f(As, cs), and g(Bs+1, cs+1) g(Bs, cs), s = 1, 2, . Hence, g(Bs, cs) g(B1, c1), s = 1, 2, . Let g(B1, c1) = C. Then, g(Bs, cs) is upper bounded by C. In our algorithm, we assume that the loss function in g can be either the least squares loss or the logistic regression loss of all the tasks. Hence, the loss terms are non-negative (actually most other loss functions, such as the hinge loss, are also non-negative). The two regularizers, one on βt and the other on c, are both non-negative. Thus, we have that ||βs t||p p C/γ1 and ||cs||k k C/γ2, s = 1, 2, . This shows that the sequence of {Bs, cs}s=1,2, is bounded. Because αs t = diag(cs)βs t, ||αs t||p p = Pd j=1 |(αt j)s|p = Pd j=1 |(cs j(βt j)s|p ||cs||2p 2p + ||βs t||2p 2p C where C is a constant computed from C, γ1 and γ2. Thus, the sequence of zs = {As, cs}s=1,2, is also bounded.

Theorem 18 The sequence zs = {As, cs}s=1,2, generated by Algorithm 1 has at least one accumulation point. For any accumulation point z = {A , c }, z is a coordinate-wise minimum point of f.

Proof According to Lemma 17, the sequence zs is bounded, so there exists at least one accumulation point z and a subsequence of {zs}s=1,2, that converges to z . Without loss of generality and for notational convenience, let us just assume that {zs}s=1,2, converges to z . Because z is an accumulation point, if it is the iterate at the current iteration s, then in the next iteration s + 1, the same iterate z will be obtained. Hence, β t is the optimal solution of Problem (30) when c is set to c (and all other βk =t = β k). Correspondingly, c

is the optimal solution of Problem (2) when B is set to B . Hence, for any feasible direction (0, , bt, , 0), t = 1, 2, , T + 1, we have

f(z + λ(0, , bt, , 0)) f(z ), (34)

for small λ values. Hence, z is a coordinate-wise minimum point of f.

Multiplicative Multitask Feature Learning

Theorem 19 If both p and k are positive, and k 1, the accumulation point z is a stationary point of f.

Proof Due to Lemma 4, we know that the optimal solution of Problem (9) can only occur when c and αt satisfy Eq.(10), so any other points can be excluded from discussion. Hence, we deﬁne a new level set Z0 = {z|f(z) C} {z|cj = (γ1γ 1 2 PT t=1(αt j)p) 1 p+k , αt j}. We now prove that f is continuous on this set Z0 and the gradient of f exists when p is even (diﬀerentiable case), and examine 0 f at z for non-diﬀerentiable cases. Given the deﬁnition of f, f is continuous with respect to αt j, j and t. Because f0 is a division term that has a divisor based on cj, we have the division-by-0 issue, so f is in general not continuous with respect to cj at 0 (but continuous and diﬀerentiable at other values). However, using L Hospital s rule (i.e., lim φ 0

f2(φ ) = lim φ 0

f 1(φ) f 2(φ ), we show that f is continuous

with respect to cj at cj = 0 in the set Z0. When cj = 0, ||αj||p p = PT t=1(αt j)p = 0 due to Eq.(10). Let φ = ||αj||p p. Since the function f0 is a ratio of two items both approaching 0 as functions of φ, we can apply L Hospital s rule as follows

lim φ 0 ||αj||p p (cj)p = lim φ 0 φ

(γ1γ 1 2 φ)

p p+k = lim φ 0 1

p p+k(γ1γ 1 2 )

p p+k φ k p+k = lim φ 0 (p + k)φ k p+k

p(γ1γ 1 2 )

We then compute the partial derivative of f0 with respect to cj, which is f0

cj = p||αj||p p (cj)p+1 for cj = 0. Now when cj approaches 0, we can prove continuity using L Hospital s rule:

f0 cj |cj 0 = lim φ 0 p||αj||p p (cj)p+1 = lim φ 0 pφ

(γ1γ 1 2 φ)

p+1 p+k = lim φ 0 p(p + k)φ k 1 p+k

(p + 1)(γ1γ 1 2 )

when k > 1. When k = 1, the limit is a ﬁnite number p(p + k)/((p + 1)(γ1γ 1 2 )

p+k ). Note that when an odd p is taken, ||αt||p p = P |αt j|p is not diﬀerentiable at 0. However, the above limit exists no matter f is diﬀerentiable or not because we take φ as the varying parameter. With the above conditions, we use the results in Tseng (2001) that when each subproblem has a unique minimum, which is the case in our algorithm because Subproblem (30) is strictly convex (for our chosen loss functions) and we have already proved the unique analytical solution of c, z is a stationary point of f.

We brieﬂy discuss the computation cost of Algorithm 1. Subproblem (30) is essentially for single task learning, which can be solved by many existing eﬃcient algorithms, such as gradient-based optimization methods. The second subproblem has a closed-form solution for c, which requires only a minimal level of computation. The computation cost of Algorithm 1 only linearly increases with the number of tasks. Due to the nature of Algorithm 1, it can be easily parallelizable and distributed to multiple processors when optimizing individual βt. More eﬃcient algorithms may be designed for speciﬁc choices of p in the future.

Wang, Bi, Yu, Sun and Song

6. Experiments

We empirically evaluated the performance of the multiplicative MTFL algorithms on both synthetic data sets and a variety of real-world data sets, where we solved either classiﬁcation (using the logistic regression loss) or regression (using the least squares loss) problems. In the experiments, we implemented and compared Algorithm 1 for four parameter settings: (p, k) = (2, 2), (1, 1), (2, 1), and (1, 2), corresponding to four multiplicative MTFL (MMTFL) methods as listed in Table 2. Although when the values of (p, k) were (2,2) and (1,1), the two models corresponded respectively to the same methods in Bi et al. (2008) and Lozano and Swirszcz (2012), they were solved diﬀerently from prior methods using our Algorithm 1 with higher computational eﬃciency. When (p, k) = (2, 1) and (1, 2), the resultant models corresponded to the two new formulations. In our experiments, the multiplicative MTFL methods were also compared with the additive MTFL methods that decompose the model parameters into an addition of two components, such as the Dirty model (DMTL) (Jalali et al., 2010) and the robust MTFL (r MTFL) (Gong et al., 2012). Single task learning (STL) approaches were also implemented as baselines and compared with all of the MTFL algorithms in the experiments. We list the various methods used for comparison in our experiments as follows:

STL lasso : Learning each task independently with ||αt||1 as the regularizer.

STL ridge : Learning each task independently with ||αt||2 2 as the regularizer.

DMTL (Jalali et al., 2010) : The dirty model with regularizers ||P||1,1 and ||Q||1, , where A = P + Q.

r MTFL (Gong et al., 2012) : Robust multitask feature learning with the regularizers ||P||1,2 and ||QT ||1,2, where A = P + Q.

MMTFL{2, 2} (Bi et al., 2008) : Multiplicative multitask feature learning with regularizers ||B||2 F and ||c||2 2, where A = diag(c)B.

MMTFL{1, 1} (Lozano and Swirszcz, 2012) : Multiplicative multitask feature learning with regularizers ||B||1,1 and ||c||1, where A = diag(c)B.

MMTFL{2, 1}(New formulation 1) : Multiplicative multitask feature learning with regularizers ||B||2 F and ||c||1, where A = diag(c)B.

MMTFL{1, 2}(New formulation 2) : Multiplicative multitask feature learning with regularizers ||B||1,1 and ||c||2 2, where A = diag(c)B.

In all experiments, unless otherwise noted, the original data set was partitioned to have 25%, 33% or 50% of the data in a training set and the rest used for testing. For each speciﬁed partition ratio (corresponding to a trial), we randomly partitioned the data 15 times and reported the average performance. The same tuning process was used to tune the hyperparameters (e.g.,γ1 and γ2) of every method in the comparison. In every trial, an internal three-fold cross validation (CV) was performed within the training data of the ﬁrst partition to select a proper hyperparameter value for each of the methods from the choices of

Multiplicative Multitask Feature Learning

2k with k = 10, 9, , 7. In the subsequent partitions of each trial, the hyperparameters were ﬁxed to the values that yielded the best performance in the CV. The regression performance was measured by the coeﬃcient of determination, denoted by R2, which measures the explained variance of the data by the ﬁtted model. In particular, we used the following formula to report performance R2 = 1 Pn i=1(yi fi)2 Pn i=1(yi y)2 where y is the mean of the observed values of y and fi is the prediction of the observed yi. The reported values in our experiments were the averaged R2 over all tasks. The R2 value ranges from 0 to 1, and a higher value indicates better regression performance. The classiﬁcation performance was measured by the F1 score, which is the harmonic mean of precision and recall. The numbers we reported in each trial was the F1 score averaged over all tasks. Similarly, the F1 score also ranges from 0 to 1 and higher values represent better classiﬁcation performance.

6.1 Simulation Studies

We created three categories of data sets, all of which were designed for regression experiments to evaluate the behaviors of the diﬀerent methods. The data sets in each category were created with a pre-speciﬁed feature sharing structure. The ﬁrst two categories were designed to validate the scenarios that we hypothesized for our two new formulations to work. We performed sensitivity analyses using these two categories of data sets, i.e., studying how performance was altered when the number of tasks or the number of features varied. Because we also empirically compared with a few additive decomposition based methods, it would be interesting to see how multiplicative MTFL behaved in a scenario that was actually in favor of additive MTFL. Hence, in the third category, the feature sharing structure was generated following the assumption that the robust MTFL method used (Gong et al., 2012). In all experiments, we created input examples Xt for each task t using a number of features randomly drawn from the standard multivariate Gaussian distribution N(0, 1), and pre-deﬁned A = [α1, , αT ] across all tasks. The responses for each task was computed by yt = Xtαt + ϵt where ϵt N(0, 0.5) was the noise introduced to the model. If an entry of A was set to non-zero, its value was randomly sampled from a uniform distribution in the interval [0.5, 1.5]. As a side note, we transposed A in Figure 1 for better illustration.

6.1.1 Synthetic Datasets

Category 1 (C1). In the C1 experiments, 40% of the rows in the matrix A were set to 0. Since each row corresponded to a feature across the tasks, the zero rows made the features irrelevant to all tasks. All remaining features received non-zero values in A so that they were used in every task s model although with diﬀerent combination weights. Hence, the individual models were sparse with respect to all synthesized features, but not sparse with respect to the selected features. The pre-deﬁned A is demonstrated in Figure 1a(top), where we transpose A to have columns representing features and a darker spot indicates that the particular element of A had a larger absolute value. Note that this synthetic data follows the assumption that has motivated the early block-wise joint regularized methods that used matrix-norm-based regularizers. Those methods were developed with an assumption that a subset of features was shared by all tasks. However, we observed that the level of shrinkage needed would be diﬀerent for c and β, which corresponded to non-matrix-norm-

Wang, Bi, Yu, Sun and Song

(a) Synthetic data C1

(b) Synthetic data C2

(c) Synthetic data C3

Figure 1: The true parameter matrix versus the parameter matrices constructed by the various methods on synthetic data. The ﬁgure shows the results when 200 examples and 100 features were created for 20 tasks each. Darker color indicates larger values in magnitude.

based regularizers. We hypothesized that the proposed new formulation (3) would produce better regression performance than existing models on this data category. To examine how the number of tasks inﬂuenced the performance of diﬀerent methods, we varied the number of tasks from 1, 5, 10, 50, and 100 to 1000. For each task, we created 200 examples, each represented by 100 features. We also tested the diﬀerent methods when increasing the number of features. In this set of experiments, we used 20 tasks and each task contained 1000 examples. Each example was represented by a number of features ranging from 200 to 1000 with a step size of 200. Category 2 (C2). In the C2 experiments, 10% of the rows in A were set to non-zero and these features were shared by all tasks. We then arranged the tasks to follow six diﬀerent sparse structures (the staircases) as shown in Figure 1b(top), where we once again transpose A. Each of the remaining features except the 10% common features was used by a comparatively small proportion of the tasks. Consecutive tasks were grouped such that the neighboring groups of tasks shared 7% of the features besides the 10% common features, whereas the non-neighboring groups of tasks did not share any features. Therefore, no feature could be excluded from all tasks, but a majority of individual features (90%) was only useful for few tasks (i.e., the useful features for one task were sparse). In this case, the non-sparsity-inducing norm was suitable for regularizing c and sparsity-inducing norm was more suitable for regularizing β. We hypothesized that the new formulation (4) would produce better regression performance than the other models on this data set. We created 20, 50, 100, 500 and 1000 tasks, respectively, to test the algorithms sensitivity to the number of tasks. The numbers of tasks were chosen to make sure enough tasks in each of the six groups. The number of tasks in each group ranged from 3 to 170. We generated 200 examples and 100 features for each task. We also created another set of C2 data sets with the number of features changing from 200 to 1000 with a step size of 200 for 20 tasks and 1000 examples for each task. Category 3 (D3). This category was synthesized following the model of the additive decomposition methods. It only contained one data set where 200 examples, each represented

Multiplicative Multitask Feature Learning

Table 3: Comparison of the test R2 values obtained by the diﬀerent MTFL methods on synthetic data sets using diﬀerent partition ratios for training data (where standard deviation 0 means that it is less than 0.01).

Dataset STL lasso STL ridge DMTL r MTFL MMTFL(2,2) MMTFL(1,1) MMTFL(2,1) MMTFL(1,2) 25% 0.32 0.02 0.45 0.01 0.60 0.02 0.58 0.02 0.64 0.02 0.54 0.03 0.73 0.02 0.42 0.04 C1 33% 0.54 0.03 0.62 0.02 0.73 0.01 0.61 0.02 0.79 0.02 0.76 0.01 0.86 0.01 0.65 0.03 50% 0.70 0.02 0.86 0.01 0.75 0.01 0.66 0.01 0.86 0.01 0.88 0.01 0.90 0.01 0.84 0.01 25% 0.42 0.03 0.33 0.01 0.36 0.01 0.46 0.01 0.45 0.01 0.35 0.05 0.46 0.02 0.49 0.02 C2 33% 0.77 0.02 0.63 0.01 0.42 0.02 0.63 0.03 0.69 0.02 0.75 0.01 0.67 0.03 0.83 0.02 50% 0.95 0.01 0.89 0.01 0.81 0.01 0.83 0.01 0.91 0 0.95 0 0.92 0.01 0.97 0 25% 0.28 0.03 0.43 0.03 0.50 0.04 0.55 0.03 0.54 0.03 0.31 0.02 0.43 0.02 0.34 0.02 C3 33% 0.47 0.01 0.56 0.02 0.60 0.02 0.65 0.02 0.64 0.02 0.45 0.04 0.60 0.04 0.47 0.03 50% 0.77 0.01 0.78 0.01 0.76 0.02 0.83 0.01 0.81 0.02 0.75 0.01 0.79 0.02 0.76 0.01

by 100 features, were generated for each of 20 tasks. The parameter matrix A = P + Q where 80 rows in P and 16 columns in Q were set to 0. The component P was used to indicate the subset of relevant features across all tasks, and the component Q was used to tell that there were outlier tasks that did not share features with other tasks. Given this simulation process, this data set would be in favor of the r MTFL model proposed in Gong et al. (2012). The designed model parameter matrix was shown in Figure 1c(top).

6.1.2 Performance on synthetic data sets

We ﬁrst compared the regression performance of the diﬀerent methods on the three categories of data sets. Table 3 shows the averaged R2 values together with standard deviations for each method and each trial setting. The best results are shown in bold fonts. The results in Table 3 were obtained on synthetic data sets that had 20 tasks with 200 examples and 100 features for each task. We reported the test R2 obtained on each data set when 25%, 33% and 50% of the data were used in training. From Table 3, we observe that the proposed formulation (3)(MMTFL(2,1)) consistently outperformed other models on C1 data sets, whereas the proposed model (4)(MMTFL(1,2)) consistently outperformed on C2 data sets. The results conﬁrmed our hypotheses that the two proposed models could be more suitable for learning the type of sharing structures in C1 and C2. As anticipated, r MTFL model constantly outperformed other models on the C3 data set. Among the multiplicative MTFL methods, MMTFL(2,2) achieved similar performance to that of r MTFL (oﬀonly by 0.01 0.02 for average R2). In order to elucidate the diﬀerent shrinkage eﬀects of the diﬀerent decomposition strategies and regularizers, we compared the true parameter matrix with the constructed parameter matrices by the six MTFL methods used in our experiments in Figure 1. From the results on the C1 and C2 data sets, we observe that only MMTFL(2,1), MMTFL(1,2) and MMTFL(1,1) produced reasonably sparse structures. The two additive decomposition methods could not yield a suﬃcient level of sparsity in the models. Although the unused features did receive smaller weights in general, they were not completely excluded. To evaluate the accuracy of feature selection, we quantitatively measured the discrepancy between the estimated models and the true model by computing the mean squared error (MSE) trace((A Aest) (A Aest)) where Aest was the matrix estimated by a method. We compared MSE values of individual mothods. On the C1 data, MMTFL(2,1) learned better combination weights (darker areas) for the relevant features than MMTFL(1,1). MMTFL(1,1) appeared to be unnecessarily too

Wang, Bi, Yu, Sun and Song

(a) On C1 data

(b) On C2 data

Figure 2: The regression performance of diﬀerent models on synthetic data C1 and C2 when the number of tasks is varied.

sparse because the useful features received much smaller weights than needed (lighter than the true model). The smallest MSE was achieved by MMTFL(2,1) with a value of 0.1, and the second best model, MMTFL(1,1), had MSE = 0.2 whereas the r MTFL model had the largest MSE = 0.25. On the C2 data, MMTFL(1,2) learned a model that was most comparable to the true model. Both MMTFL(1,2) and MMTFL(1,1) eliminated well the irrelevant features. However, if we compared the two rows corresponding to these two models in Figure 1, we could see that MMTFL(1,1) broke the staircases in several places (e.g., towards the lower right and the up left corners) by excluding more features than necessary. Note that the feature sharing patterns (particularly in synthetic data C2) may not be revealed by the recent methods on clustered multitask learning that cluster tasks into groups (Kang et al., 2011; Jacob et al., 2008; Zhou et al., 2011) because no cluster structure is present in Figure 1b. Rather, the sharing pattern in Figure 1b is actually in between the consecutive groups of tasks. MMTFL(1,2) had the smallest MSE = 0.05, which was smaller than that of the second best model MMTFL(1,1) by 0.025. DMTL received the largest MSE = 0.19. Figure 1c shows the results on the C3 data set. MMTFL(2,1), MMTFL(1,2), and MMTFL(1,1) imposed excessive sparsity on the parameter matrix, which removed some useful features. The other three models, DMTL, r MTFL and MMTFL(2,2), produced similar parameter matrices, but r MTFL was originally designed to detect outlier tasks and thus was more favorable for this data set. The r MTFL model obtained the smallest MSE (0.03), and MMTFL(2,2) had a similar performance (MSE=0.04), which was the same as that of DMTL. On this data set, MMTFL(1,1) got the largest MSE (0.09). These results bring out an interesting observation that for the MTL scenarios that have outlier tasks but relevant tasks share the same set of features, MMTFL(2,2) (which corresponds to the very early joint regularized method using the ℓ1,2 matrix norm) is most suitable among the multiplicative MTFL methods. Figure 2 compares the performance of diﬀerent methods when we vary the number of tasks in the C1 and C2 categories. On each data set, we used 33% of the data in training

Multiplicative Multitask Feature Learning

(a) On C1 data

(b) On C2 data

Figure 3: The regression performance of diﬀerent models on synthetic data C1 and C2 when the number of features is varied.

with 15 trials, and reported here the average R2 values and standard deviation bars. From Figure 2, MMTFL(2,1) constantly performed the best among all methods on the C1 data sets (but not in the single task learning) whereas MMTFL(1,2) outperformed the other models on the C2 data sets. We also observed that on C2 data, MMTFL(1,1) obtained very similar performance to that of MMTFL(1,2) after the number of tasks reached 50. On this data category, almost all methods reached a stable level of accuracy after the number of tasks reached 50 except DMTL. DMTL continued to gain knowledge from more relevant tasks until it reached 500 tasks but it produced the lowest R2 values among all methods. Overall, the results indicate that with the ﬁxed dimension and sample size, when the number of tasks reaches a certain level, the transferable knowledge learned from the tasks can be saturated for a speciﬁc feature sharing structure. On C1 data, we observe that the performance was not always monotonically improved or non-degraded (for all methods) when more tasks were included, which may indicate that when an unnecessarily large number of tasks was used, it could add more uncertainty to the learning process. Figure 3 compares the performance of diﬀerent methods when we vary the number of features in the C1 and C2 categories. Obviously, when the problem dimension was higher, the learning problem became more diﬃcult (especially when the number of tasks and sample size remained the same). All methods dropped their performance substantially with increasing numbers of features although MMTFL(2,1) and MMTFL(1,2) still outperformed other methods, respectively, on the C1 and C2 data sets. This ﬁgure also shows that MMTFL(1,1) performed well on the C2 data sets but much worse on the C1 data sets. DMTL produced good performance, close to that of MMTFL(2,1), on the C1 data sets.

6.2 Experiments with Benchmark Data

Extensive empirical studies were conducted on benchmark data sets where we tested the proposed multiplicative MTFL algorithms on ten real-world data sets. Among these data sets, three were for regression experiments and all others were for classiﬁcation experiments. Characteristics of these data sets are summarized in the next section.

Wang, Bi, Yu, Sun and Song

6.2.1 Benchmark Datasets

Sarcos (Argyriou et al., 2007): Sarcos data were collected for a robotics problem of learning the inverse dynamics of a 7 degrees-of-freedom SARCOS anthropomorphic robot arm. Each observation has 21 features corresponding to 7 joint positions and their velocities and accelerations. We needed to map from the 21-dimensional input space to 7 joint torques, which corresponded to 7 tasks. For each task, we randomly selected 2000 cases for training and the remaining 5291 cases for test. Readers can consult with http://www.gaussianprocess.org/ gpml/data/ for more details. College Drinking (Bi et al., 2013): The college drinking data were collected in order to identify alcohol use patterns of college students and the risk factors associated with the binge drinking. The data set contained daily responses from 100 college students to a survey questionnaire measuring various daily measures, such as drinking expectation, negative aﬀects, and level of stress, every day in a 30 day period. The goal was to predict the amount of nighttime drinks based on 51 daily measures for each student, corresponding to 100 regression tasks. Because there were only 30 records for each person, we used 66%, 75% and 80% of the records to form the training set, and the rest for test. QSAR (Ma et al., 2015): The quantitative structure-activity relationship (QSAR) methods are commonly used to predict biological activities of chemical compounds in the ﬁeld of drug discovery. The data sets we used were collected from three diﬀerent types of drug activities, including binding to cannabinoid receptor 1 (CB1), inhibition of dipeptidyl peptidase 4 (DPP4) and time dependent 3A4 inhibitions (TDI). For each activity, there were 200 molecule examples represented by 2618 features. Three regression models were constructed to simultaneously predict the targets log(IC50)) of the CB1, DPP4 and TDI eﬀectiveness based on the molecular features. C.M.S.C. (Lucas et al., 2013): The Climate Model Simulation Crashes (C.M.S.C.) data set contained records of simulated crashes encountered during climate model uncertainty quantiﬁcation ensembles. The data set comprised 3 tasks. There were 180 examples for each task. Each example was represented by an 18-dimensional feature vector. Each task is formed by a binary classiﬁcation problem, which was to predict simulation outcomes (either fail or succeed) from the input parameter values for a climate model. Landmine (Xue et al., 2007): The original Landmine data contained 29 data sets where sets 1-15 corresponded to the geographical regions that were highly foliated and sets 16-29 corresponded to the regions with bare earth or desert. Each data set could be used to build a binary classiﬁer. We used the data sets 1-10 and 16-25 to form 20 tasks where each example was represented by 9 features extracted from radar images. The number of examples varied between individual tasks ranging from 445 to 690. Alphadigits (Maurer et al., 2013): This data set was composed of binary 20 16 images of the 10 digits and capital letters. We used all the images of digits to form 10 binary classiﬁcation tasks. For each digit, there were 39 images in this data set. We labeled the images of a single digit as positive examples, and randomly selected other 39 images from other digits and labeled them as negative examples. All the pixels were concatenated to form a 320-dimensional feature vector for each image. Underwatermine (Liu et al., 2009b): This data set was originally used in the underwater mine classiﬁcation problem that aimed to detect mines from non-mines based on the

Multiplicative Multitask Feature Learning

synthetic-aperture sonar images. The data set consisted of 8 tasks with sample sizes ranging from 756 to 3562 for each task, and each task was a binary classiﬁcation problem. Each example was represented by 13 features. Animal recognition (Kang et al., 2011): This data set consisted of images from 20 animal classes. Each image was originally represented by 2000 features extracted using the bag of word descriptors from the Scale-invariant Feature Transform (SIFT), and then the dimensionality was reduced to 202 by a principal component analysis, retaining 95% of the data variance. For each animal class, there were 100 images. We formed 20 binary classiﬁcation tasks where for each task, 100 positive examples were from a speciﬁc animal class and 100 negative examples were randomly sampled from other classes. HWMA base and HWMA peak (Qazi et al., 2007; Bi and Wang, 2015): The heart wall motion abnormality (HWMA) detection data set was used to analyze and predict if a heart had abnormal motion based on the image features extracted from stress test echocardiographs of 220 patients. The images were taken at the base dose and also peak dose of stress contrast. The wall of left ventricle was medically segmented into 16 segments, and 25 features were extracted from each segment. Every segment of every heart case was annotated by radiologists in terms of normal or abnormal motions. Thus, there were 16 binary classiﬁcation tasks, each corresponding to one of the 16 heart segments, and each task comprised 220 examples.

6.2.2 Performance on real world data sets

Three real-world data sets, the Sarcos, College Drinking and QSAR, were used in regression experiments. The performance of the diﬀerent methods is summarized in Table 4 depicting the R2 values averaged over the 15 re-partitions in each trial. MMTFL(2,1) achieved the best R2 values on the Sarcos data set (in all of the 3 trials) and the College Drinking data set (in 2 of the 3 trials). The Sarcos data appeared to be in favor of denser models given MMTFL(2,2) also performed reasonably well on this data set. MMTFL(1,2) models achieved the best performance on the QSAR data set consistently across all the 3 trial settings. On this data set, it was obvious that MMTFL(1,2) was more suitable, which indicated that most of the 2618 features were useful for some tasks, but the tasks shared few features between each other. The diﬀerence between the proposed models and the additively decomposed models ranged from 1 % to 10%, and most importantly, the trend was consistent for the proposed models to outperform on these data sets. The other seven real-world data sets were used in classiﬁcation experiments. Table 5 summarizes the results where the F1 scores were averaged across the 15 random splits in each trial together with standard derivations. MMTFL(2,1) models achieved consistently the best performance on the C.S.M.C. and Landmine data sets in comparison with other models. In particular, we observed both MMTFL(1,1) and the MMTFL(2,1) models produced the best F1 scores in the trial with 33% training split whereas MMTFL(2,1) outperformed all other models in the trial with other partition ratios. These two data sets may prefer acrosstask sparse models, indicating that many irrelevant features may exist in the data. For the remaining ﬁve data sets used in classiﬁcation experiments, MMTFL(1,2) models showed generally better performance than all other models. The diﬀerence between the best model and other MMTFL models could reach 4% to 8%.

Wang, Bi, Yu, Sun and Song

Figure 4: The models constructed by MTFL methods for individual digits using the pixels in hand written digit images. The models can be re-organized back into images. The pixel in the above images ranges from 0 to 1. The lighter, the closer to 1 for lucid illustration.

In particular, for the Alphadigits data set, we used the raw pixels of the hand written digits as the features to build models. Each task aimed to learn a linear model in the original pixel dimensions to distinguish a digit from other nine digits. Thus, each model, or equivalently the weight vector of the linear model, could be re-shaped back into an image. Figure 4 compares the constructed models for each digit by each of the six MTFL methods. In the top of Figure 4, we illustrate some sample images. In the middle, we show the results by the four MMTFL methods whereas at the bottom we include the constructed additively decomposed models. Clearly, the additively decomposed models were much noisier and selected many undesirable features. Among the multiplicative MTFL methods, MMTFL(1,1) models were too sparse to trace out the digits. Overall, MMTFL(1,2) models were the closest to the shapes of the diﬀerent digits. If we compare MMTFL(2,1) and (1,2), MMTFL(2,1) excluded too many features from all digits. It indicated that most of the pixels were useful for predicting a digit but not many pixels were shared by multiple digits.

Multiplicative Multitask Feature Learning

Table 4: Comparison of the R2 values of the diﬀerent MTFL methods on the benchmark regression data sets.

Dataset STL lasso STL ridge DMTL r MTFL MMTFL(2,2) MMTFL(1,1) MMTFL(2,1) MMTFL(1,2)

25% 0.75 0.02 0.78 0.02 0.86 0 0.86 0 0.89 0 0.88 0 0.89 0.01 0.86 0.01 SARCOS 33% 0.78 0.01 0.78 0.02 0.81 0.11 0.81 0.1 0.90 0 0.89 0 0.90 0.01 0.82 0.01 50% 0.82 0.01 0.83 0.06 0.87 0.1 0.87 0.1 0.89 0 0.90 0.01 0.91 0.01 0.86 0.01

66% 0.15 0.05 0.09 0.02 0.11 0.07 0.22 0.02 0.23 0.04 0.15 0.02 0.21 0.01 0.19 0.04 College Drinking 75% 0.18 0.05 0.15 0.02 0.18 0.08 0.18 0.05 0.14 0.05 0.25 0.08 0.27 0.05 0.21 0.08 80% 0.11 0.06 0.09 0.04 0.19 0.06 0.16 0.06 0.19 0.05 0.15 0.04 0.21 0.04 0.15 0.04

25% 0.39 0.03 0.36 0.01 0.39 0.01 0.39 0.01 0.34 0.02 0.38 0.03 0.38 0.03 0.40 0.03 QSAR 33% 0.39 0.04 0.39 0.04 0.40 0.04 0.39 0.04 0.37 0.04 0.41 0.04 0.40 0.05 0.43 0.03 50% 0.44 0.06 0.41 0.03 0.44 0.03 0.44 0.04 0.40 0.06 0.44 0.04 0.45 0.05 0.48 0.04

Table 5: Comparison of the F1 scores of the diﬀerent MTFL methods on the benchmark classiﬁcation data sets.

Dataset STL lasso STL ridge DMTL r MTFL MMTFL(2,2) MMTFL(1,1) MMTFL(2,1) MMTFL(1,2)

25% 0.30 0.04 0.31 0.03 0.36 0 0.26 0.01 0.40 0 0.39 0.01 0.43 0 0.39 0 C.S.M.C. 33% 0.37 0.02 0.37 0.06 0.37 0.01 0.39 0.01 0.40 0.01 0.45 0.01 0.45 0.01 0.39 0.01 50% 0.40 0.02 0.46 0.01 0.37 0.01 0.44 0.01 0.48 0.01 0.46 0.01 0.49 0.01 0.46 0

25% 0.14 0.01 0.06 0.01 0.04 0.01 0.06 0 0.1 0.01 0.17 0.01 0.17 0.01 0.17 0.01 Landmine 33% 0.13 0.01 0.15 0.01 0.16 0.01 0.16 0.01 0.1 0.01 0.18 0.01 0.18 0.01 0.12 0.01 50% 0.15 0.01 0.15 0.01 0.18 0.01 0.21 0.01 0.11 0.01 0.19 0.01 0.23 0.01 0.15 0.01

25% 0.86 0.01 0.77 0.01 0.86 0.01 0.86 0 0.85 0.01 0.84 0.01 0.78 0.01 0.88 0.01 Alphadigits 33% 0.90 0.01 0.87 0.01 0.90 0.01 0.90 0.01 0.87 0.01 0.85 0.01 0.81 0.01 0.91 0.01 50% 0.90 0.01 0.87 0.01 0.91 0.01 0.92 0.01 0.91 0.01 0.90 0.01 0.85 0.01 0.93 0.01

25% 0.24 0.01 0.27 0.02 0.21 0.01 0.27 0 0.29 0.01 0.24 0.01 0.25 0.01 0.27 0.01 Underwatermine 33% 0.25 0.01 0.31 0.02 0.31 0.01 0.29 0.01 0.3 0.01 0.3 0.01 0.27 0.01 0.32 0.01 50% 0.27 0.01 0.28 0.01 0.32 0.01 0.34 0.01 0.32 0.01 0.34 0.01 0.35 0.01 0.37 0.01

25% 0.63 0.01 0.63 0.01 0.65 0.01 0.65 0 0.66 0.01 0.64 0.01 0.63 0.01 0.66 0.01 Animal 33% 0.66 0.01 0.67 0.01 0.66 0.01 0.65 0.01 0.67 0.01 0.66 0.01 0.66 0.01 0.68 0.01 50% 0.67 0 0.68 0.01 0.68 0.01 0.66 0.01 0.69 0.01 0.67 0.01 0.68 0.01 0.69 0.01

25% 0.35 0.01 0.37 0.01 0.42 0.01 0.4 0 0.46 0.01 0.36 0.01 0.44 0.01 0.45 0.01 HWMA base 33% 0.38 0.01 0.34 0 0.45 0.01 0.47 0.01 046 0.01 0.44 0.01 0.47 0.01 0.49 0.01 50% 0.44 0.01 0.45 0.01 0.48 0.01 0.48 0.01 0.51 0.01 0.47 0.01 0.47 0.01 0.55 0.01

25% 0.41 0 0.48 0.01 0.47 0.01 0.47 0 0.53 0.01 0.53 0.01 0.51 0.01 0.54 0.01 HWMA peak 33% 0.42 0.01 0.47 0.01 0.47 0.01 0.5 0.01 0.56 0.01 0.55 0.01 0.52 0.01 0.53 0.01 50% 0.44 0.02 0.50 0.02 0.49 0.01 0.51 0.01 0.56 0.01 0.56 0.01 0.55 0.01 0.58 0.01

Wang, Bi, Yu, Sun and Song

7. Conclusion

In this paper, we have studied a general framework of multiplicative multitask feature learning. By decomposing the model parameter of each task into a product of two components: the across-task feature indicator and task-speciﬁc parameters, and applying diﬀerent regularizers to the two components, we can select features for individual tasks and also search for the shared features among tasks. We have examined the theoretical properties of this framework when diﬀerent regularizers are applied and found that this family of methods creates models equivalent to those of the joint regularized multitask learning methods but with a more general form of regularization. Further, we show that this family consists of some convex and some non-convex formulations and specify the conditions to obtain convexity. An analytical formula is derived for the across-task component as related to the task-speciﬁc component, which sheds light on the diﬀerent shrinkage eﬀects in the various regularizers. An eﬃcient algorithm is derived to solve the entire family of methods and also tested in our experiments for some chosen parameter values. Empirical results on synthetic data clearly show that there may not be a particular choice of regularizers that is universally better than other choices. We empirically show a few feature sharing patterns that are in favor of the two newly-proposed choices of regularizers. The extensive experimental results on real-world benchmark data sets also conﬁrm the observation and demonstrate the advantages of the proposed two formulations over several existing methods.

Acknowledgments

This work was supported by NSF grants IIS-1320586, DBI-1356655 and the NIH grant R01DA037349. Jinbo Bi was also supported by NSF grants CCF-1514357, IIS-1407205 and IIS-1447711. Correspondence should be addressed to Jinbo Bi (jinbo@engr.uconn.edu).

Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817 1853, December 2005.

Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems, pages 41 48. 2007.

Jinbo Bi and Xin Wang. Learning classiﬁers from dual annotation ambiguity via a minmax framework. Neurocomputing, 151, Part 2(0):891 904, 2015.

Jinbo Bi, Tao Xiong, Shipeng Yu, Murat Dundar, and R. Bharat Rao. An improved multitask learning approach with applications in medical diagnosis. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, pages 117 132, 2008.

Jinbo Bi, Jiangwen Sun, Yu Wu, Howard Tennen, and Stephen Armeli. A machine learning approach to college drinking prediction and risk factor identiﬁcation. ACM Transactions on Intelligent Systems Technology, 4:72:1 72:24, 2013.

Multiplicative Multitask Feature Learning

Edwin Bonilla, Kian Ming Chai, and Christopher Williams. Multi-task gaussian process prediction. In Advances in neural information processing systems, pages 153 160, 2007.

Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 42 50, 2011.

Theodoros Evgeniou and Massimiliano Pontil. Regularized multi task learning. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 109 117, 2004.

Pinghua Gong, Jieping Ye, and Changshui Zhang. Robust multi-task feature learning. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 895 903, 2012.

Pinghua Gong, Jieping Ye, and Changshui Zhang. Multi-stage multi-task feature learning. The Journal of Machine Learning Research, 14(1):2979 3010, 2013.

Irwin R Goodman and Samuel Kotz. Multivariate θ-generalized normal distributions. Journal of Multivariate Analysis, 3(2):204 219, 1973.

William H. Greene. Econometric Analysis. Prentice Hall, ﬁfth edition, 2002.

Shengbo Guo, Onno Zoeter, and C edric Archambeau. Sparse bayesian multi-task learning. In Advances in Neural Information Processing Systems, pages 1755 1763, 2011.

Laurent Jacob, Bach Francis, and Jean-Philippe Vert. Clustered multi-task learning: a convex formulation. In Advances in Neural Information Processing Systems, pages 745 752, 2008.

Ali Jalali, Sujay Sanghavi, Chao Ruan, and Pradeep K Ravikumar. A dirty model for multi-task learning. In Advances in Neural Information Processing Systems, pages 964 972, 2010.

Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multi-task feature learning. In Proceedings of the International Conference on Machine Learning, pages 521 528, 2011.

Abhishek Kumar and Hal Daume III. Learning task grouping and overlap in multi-task learning. In Proceedings of the International Conference on Machine Learning, pages 1383 1390, 2012.

Seunghak Lee, Jun Zhu, and Eric Xing. Adaptive multi-task lasso: with application to e QTL detection. In Advances in Neural Information Processing Systems, pages 1306 1314. 2010.

Su-In Lee, Vassil Chatalbashev, David Vickrey, and Daphne Koller. Learning a metalevel prior for feature relevance from multiple related tasks. In Proceedings of the 24th International Conference on Machine Learning, pages 489 496, New York, NY, USA, 2007.

Wang, Bi, Yu, Sun and Song

Jun Liu, Shuiwang Ji, and Jieping Ye. Multi-task feature learning via eﬃcient ℓ1,2-norm minimization. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, pages 339 348, 2009a.

Qiuhua Liu, Xuejun Liao, Hui Li, Jason R Stack, and Lawrence Carin. Semisupervised multitask learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(6):1074 1086, 2009b.

Aurelie Lozano and Grzegorz Swirszcz. Multi-level lasso for sparse multi-task regression. In Proceedings of the International Conference on Machine Learning, pages 361 368, 2012.

D. D. Lucas, R. Klein, J. Tannahill, D. Ivanova, S. Brandon, D. Domyancic, and Y. Zhang. Failure analysis of parameter-induced simulation crashes in climate models. Geoscientiﬁc Model Development Discussions, 6(1):585 623, 2013.

Junshui Ma, Robert P. Sheridan, Andy Liaw, George E. Dahl, and Vladimir Svetnik. Deep neural nets as a method for quantitative structureactivity relationships. Journal of Chemical Information and Modeling, 55(2):263 274, 2015.

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. Sparse coding for multitask and transfer learning. In Proceedings of the 30th International Conference on Machine Learning, pages 343 351, 2013.

L. Meier, S. v. d. Geer, and P. B uhlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B, 70 Part 1:53 71, 2008.

Charles A Micchelli, Jean M Morales, and Massimiliano Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38(3):455 489, 2013.

Guillaume Obozinski and Ben Taskar. Multi-task feature selection. In Technical report, Statistics Department, UC Berkeley, 2006.

Alexandre Passos, Piyush Rai, Jacques Wainer, and Hal Daume III. Flexible modeling of latent task structures in multitask learning. In Proceedings of the International Conference on Machine Learning, pages 1103 1110, 2012.

M. Qazi, G. Fung, S. Krishnan, J. Bi, B. Rao, and A. Katz. Automated heart abnormality detection using sparse linear classiﬁers. IEEE Engineering in Medicine and Biology, 26 (2):56 63, 2007.

Ariadna Quattoni, Xavier Carreras, Michael Collins, and Trevor Darrell. An eﬃcient projection for l1,inﬁnity regularization. In Proceedings of the International Conference on Machine Learning, pages 108 115, 2009.

Piyush Rai and Hal Daume. Inﬁnite predictor subspace models for multitask learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 613 620, 2010.

P. Tseng. Convergence of a block coordinate descent method for nondiﬀerentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001.

Multiplicative Multitask Feature Learning

B. A. Turlach, W. N. Wenables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47(3):349 363, 2005.

T. Xiong, J. Bi, B. Rao, and V. Cherkassky. Probabilistic joint feature selection for multitask learning. In Proceedings of SIAM International Conference on Data Mining, pages 69 76, 2006.

Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classiﬁcation with dirichlet process priors. Journal of Machine Learning Research, 8:35 63, 2007.

Ming Yang, Yingming Li, and Zhongfei Mark Zhang. Multi-task learning with gaussian matrix generalized inverse gaussian model. In Proceedings of the 30th International Conference on Machine Learning, pages 423 431, 2013.

Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of the 22Nd International Conference on Machine Learning, pages 1012 1019, New York, NY, USA, 2005.

Shipeng Yu, Volker Tresp, and Kai Yu. Robust multi-task learning with t-processes. In Proceedings of the 24th International Conference on Machine Learning, pages 1103 1110, 2007.

Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationship in multitask learning. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, pages 733 742, 2010.

Yu Zhang, Dit-Yan Yeung, and Qian Xu. Probabilistic multi-task feature selection. In Advances in Neural Information Processing Systems, pages 2559 2567, 2010.

Peng Zhao, Guilherme Rocha, and Bin Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, pages 3468 3497, 2009.

Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alternating structure optimization. In Advances in Neural Information Processing Systems, pages 702 710, 2011.

Y. Zhou, R. Jin, and S. C.H. Hoi. Exclusive lasso for multi-task feature selection. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics, pages 988 995, 2010.