# on_multiplicative_multitask_feature_learning__1ff0889a.pdf

On Multiplicative Multitask Feature Learning

Xin Wang , Jinbo Bi , Shipeng Yu , Jiangwen Sun

Dept. of Computer Science & Engineering Health Services Innovation Center University of Connecticut Siemens Healthcare Storrs, CT 06269 Malvern, PA 19355 wangxin,jinbo,javon@engr.uconn.edu shipeng.yu@siemens.com

We investigate a general framework of multiplicative multitask feature learning which decomposes each task s model parameters into a multiplication of two components. One of the components is used across all tasks and the other component is task-speciﬁc. Several previous methods have been proposed as special cases of our framework. We study the theoretical properties of this framework when different regularization conditions are applied to the two decomposed components. We prove that this framework is mathematically equivalent to the widely used multitask feature learning methods that are based on a joint regularization of all model parameters, but with a more general form of regularizers. Further, an analytical formula is derived for the across-task component as related to the taskspeciﬁc component for all these regularizers, leading to a better understanding of the shrinkage effect. Study of this framework motivates new multitask learning algorithms. We propose two new learning formulations by varying the parameters in the proposed framework. Empirical studies have revealed the relative advantages of the two new formulations by comparing with the state of the art, which provides instructive insights into the feature learning problem with multiple tasks.

1 Introduction

Multitask learning (MTL) captures and exploits the relationship among multiple related tasks and has been empirically and theoretically shown to be more effective than learning each task independently. Multitask feature learning (MTFL) investigates a basic assumption that different tasks may share a common representation in the feature space. Either the task parameters can be projected to explore the latent common substructure [18], or a shared low-dimensional representation of data can be formed by feature learning [10]. Recent methods either explore the latent basis that is used to develop the entire set of tasks, or learn how to group the tasks [16, 11], or identify if certain tasks are outliers to other tasks [6].

A widely used MTFL strategy is to impose a blockwise joint regularization of all task parameters to shrink the effects of features for the tasks. These methods employ a regularizer based on the so-called ℓ1,p matrix norm [12, 13, 15, 22, 24] that is the sum of the ℓp norms of the rows in a matrix. Regularizers based on the ℓ1,p norms encourage row sparsity. If rows represent features and columns represent tasks, they shrink the entire rows of the matrix to have zero entries. Typical choices for p are 2 [15, 4] and [20] which are used in the very early MTFL methods. Effective algorithms have since then been developed for the ℓ1,2 [13] and ℓ1, [17] regularization. Later, the ℓ1,p norm is generalized to include 1 < p with a probabilistic interpretation that the resultant MTFL method solves a relaxed optimization problem with a generalized normal prior for all tasks [22]. Recent research applies the capped ℓ1,1 norm as a nonconvex joint regularizer [5]. The major limitation of joint regularized MTFL is that it either selects a feature as relevant to all tasks or excludes it from all models, which is very restrictive in practice where tasks may share some features but may have their own speciﬁc features as well.

To overcome this limitation, one of the most effective strategies is to decompose the model parameters into either summation [9, 3, 6] or multiplication [21, 2, 14] of two components with different regularizers applied to the two components. One regularizer is used to take care of cross-task similarities and the other for cross-feature sparsity. Speciﬁcally, for the methods that decompose the parameter matrix into summation of two matrices, the dirty model in [9] employs ℓ1,1 and ℓ1, regularizers to the two components. A robust MTFL method in [3] uses the trace norm on one component for mining a low-rank structure shared by tasks and a column-wise ℓ1,2-norm on the other component for identifying task outliers. Another method applies the ℓ1,2-norm both row-wisely to one component and column-wisely to the other [6].

For the methods that work with multiplicative decompositions, the parameter vector of each task is decomposed into an element-wise product of two vectors where one is used across tasks and the other is task-speciﬁc. These methods either use the ℓ2-norm penalty on both of the component vectors [2], or the sparse ℓ1-norm on the two components (i.e., multi-level LASSO) [14]. The multi-level LASSO method has been analytically compared to the dirty model [14], showing that the multiplicative decomposition creates better shrinkage on the global and task-speciﬁc parameters. The across-task component can screen out the features irrelevant to all tasks. To exclude a feature from a task, the additive decomposition requires the corresponding entries in both components to be zero whereas the multiplicative decomposition only requires one of the components to have a zero entry. Although there are different ways to regularize the two components in the product, no systematic work has been done to analyze the algorithmic and statistical properties of the different regularizers. It is insightful to answer the questions such as how these learning formulations differ from the early methods based on blockwise joint regularization, how the optimal solutions of the two components look like, and how the resultant solutions are compared with those of other methods that also learn both shared and task-speciﬁc features.

In this paper, we investigate a general framework of the multiplicative decomposition that enables a variety of regularizers to be applied. This general form includes all early methods that represent model parameters as a product of two components [2, 14]. Our theoretical analysis has revealed that this family of methods is actually equivalent to the joint regularization based approach but with a more general form of regularizers, including those that do not correspond to a matrix norm. The optimal solution of the across-task component can be analytically computed by a formula of the optimal task-speciﬁc parameters, showing the different shrinkage effects. Statistical justiﬁcation and efﬁcient algorithms are derived for this family of formulations. Motivated by the analysis, we propose two new MTFL formulations. Unlike the existing methods [2, 14] where the same kind of vector norm is applied to both components, the shrinkage of the global and task-speciﬁc parameters differ in the new formulations. Hence, one component is regularized by the ℓ2-norm and the other is by the ℓ1-norm, which aims to reﬂect the degree of sparsity of the task-speciﬁc parameters relative to the sparsity of the across-task parameters. In empirical experiments, simulations have been designed to examine the various feature sharing patterns where a speciﬁc choice of regularizer may be preferred. Empirical results on benchmark data are also discussed.

2 The Proposed Multiplicative MTFL

Given T tasks in total, for each task t, t {1, , T}, we have sample set (Xt Rℓt d, yt Rℓt). The data set of Xt has ℓt examples, where the i-th row corresponds to the i-th example xt i of task t, i {1, , ℓt}, and each column represents a feature. The vector yt contains yt i, the label of the i-th example of task t. We consider functions of the linear form Xtαt where αt Rd. We deﬁne the parameter matrix or weight matrix A = [α1, , αT ] and αj are the rows, j {1, , d}.

A family of multiplicative MTFL methods can be derived by rewriting αt = diag(c)βt where diag(c) is a diagonal matrix with its diagonal elements composing a vector c. The c vector is used across all tasks, indicating if a feature is useful for any of the tasks, and the vector βt is only for task t. Let j index the entries in these vectors. We have αt j = cjβt j. Typically c comprises binary entries that are equal to 0 or 1, but the integer constraint is often relaxed to require just non-negativity. We minimize a regularized loss function as follows for the best c and βt:t=1, ,T :

t=1 L(c, βt, Xt, yt) + γ1

t=1 ||βt||p p + γ2||c||k k (1)

where L( ) is a loss function, e.g., the least squares loss for regression problems or the logistic loss for classiﬁcation problems, ||βt||p p = Pd j=1 |βt j|p and ||c||k k = Pd j=1(cj)k, which are the ℓp-norm of βt to the power of p and the ℓk-norm of c to the power of k if p and k are positive integers. The tuning parameters γ1, γ2 are used to balance the empirical loss and regularizers. At optimality, if cj = 0, the j-th variable is removed for all tasks, and the corresponding row vector αj = 0; otherwise the j-th variable is selected for use in at least one of the α s. Then, a speciﬁc βt can rule out the j-th variable from task t if βt j = 0.

In particular, if both p = k = 2, Problem (1) becomes the formulation in [2] and if p = k = 1, Problem (1) becomes the formulation in [14]. Any other choices of p and k will derive into new formulations for MTFL. We ﬁrst examine the theoretical properties of this entire family of methods, and then empirically study two new formulations by varying p and k.

3 Theoretical Analysis

The joint ℓ1,p regularized MTFL method minimizes PT t=1 L(αt, Xt, yt) + λ Pd j=1 ||αj||p for the best αt:t=1, ,T where λ is a tuning parameter. We now extend this formulation to allow more choices of regularizers. We introduce a new notation that is an operator applied to a vector, such

as αj. The operator ||αj||p/q = qq PT t=1 |αt j|p, p, q 0, which corresponds to the ℓp norm if p = q and both are positive integers. A joint regularized MTFL approach can solve the following optimization problem with pre-speciﬁed values of p, q and λ, for the best parameters αt:t=1, ,T :

t=1 L(αt, Xt, yt) + λ d P

||αj||p/q. (2)

Our main result of this paper is (i) a theorem that establishes the equivalence between the models derived from solving Problem (1) and Problem (2) for properly chosen values of λ, q, k, γ1 and γ2; and (ii) an analytical solution of Problem (1) for c which shows how the sparsity of the across-task component is relative to the sparsity of task-speciﬁc components.

Theorem 1 Let ˆ αt be the optimal solution to Problem (2) and ( ˆβt, ˆc) be the optimal solution to

Problem (1). Then ˆ αt = diag(ˆc) ˆβt when λ = 2 q

p kq 2 and q = k+p

2k (or k = p 2q 1).

Proof. The theorem holds by proving the following two Lemmas. The ﬁrst lemma proves that the solution ˆ αt of Problem (2) also minimizes the following optimization problem:

minαt,σ 0 PT t=1 L(αt, Xt, yt) + µ1 Pd j=1 σ 1 j ||αj||p/q + µ2 Pd j=1 σj, (3)

and the optimal solution of Problem (3) also minimizes Problem (2) when proper values of λ, µ1 and µ2 are chosen. The second lemma connects Problem (3) to our formulation (1). We show that the optimal ˆσj is equal to (ˆcj)k, and then the optimal ˆβ can be computed from the optimal ˆα.

Lemma 1 The solution sets of Problem (2) and Problem (3) are identical when λ = 2 µ1µ2.

Proof. First, we show that when λ = 2 µ1µ2, the optimal solution ˆαt j of Problem (2) minimizes

Problem (3) and the optimal ˆσj = µ

||ˆαj||p/q. By the Cauchy-Schwarz inequality, the following inequality holds

j=1 σ 1 j ||αj||p/q + µ2

j=1 σj 2 µ1µ2

where the equality holds if and only if σj = µ

||αj||p/q. Since Problems (3) and (2) use

the exactly same loss function, when we set ˆσj = µ

||ˆαj||p/q, Problems (3) and (2) have

identical objective function if λ = 2 µ1µ2. Hence the pair ( ˆA = (ˆαt j)jt, ˆσ = (ˆσj)j=1, ,d) minimizes Problem (3) as it entails the objective function to reach its lower bound.

Second, it can be proved that if the pair ( ˆA, ˆσ) minimizes Problem (3), then ˆA also minimizes Problem (2) by proof of contradiction. Suppose that ˆA does not minimize Problem (2), which means that there exists αj ( = ˆαj for some j) that is an optimal solution to Problem (2) and achieves a lower

objective value than ˆαj. We set σj = µ

|| αj||p/q. The pair ( A, σ) is an optimal solution

of Problem (3) as proved in the ﬁrst paragraph. Then ( A, σ) will bring the objective function of Problem (3) to a lower value than that of ( ˆA, ˆσ), contradicting to the assumption that ( ˆA, ˆσ) be optimal to Problem (3).

Hence, we have proved that Problems (3) and (2) have identical solutions when λ = 2 µ1µ2.

From the proof of Lemma (1), we also see that the optimal objective value of Problem (2) gives a lower bound to the objective of Problem (3). Let σj = (cj)k, k R, k = 0 and αt j = cjβt j, an equivalent objective function of Problem (3) can be derived.

Lemma 2 The optimal solution ( ˆA, ˆσ) of Problem (3) is equivalent to the optimal solution ( ˆB, ˆc) of

Problem (1) where ˆαt j = ˆcj ˆβt j and ˆσj = (ˆcj)k when γ1 = µ

kq 2kq p 1 µ

kq p 2kq p 2 , γ2 = µ2, and k = p 2q 1. Proof. First, by proof of contradiction, we show that if ˆαt j and ˆσj optimize Problem (3), then

ˆσj and ˆβt j = ˆαt j ˆcj optimize Problem (1). Denote the objectives of (1) and (3) by J(1)

and J(3). Substituting ˆβt j, ˆcj for ˆαt j, ˆσj in J(3) yields an objective function L(ˆc, ˆβt, Xt, yt) +

µ1 Pd j=1 ||ˆβ j||p/qˆc(p kq)/q j + µ2 Pd j=1(ˆcj)k. By the proof of Lemma 1, ˆσj = µ

||ˆαj||p/q.

Hence, ˆcj = µ1µ 1 2 ||ˆβ j||p/q q 2kq p . Applying the formula of ˆcj and substituting µ1 and µ2 by

γ1 and γ2 yield an objective identical to J(1). Suppose ( βt j, cj)( = (ˆβt j, ˆcj)) that minimize (1), and J(1)( βt j, cj) < J(1)(ˆβt j, ˆcj). Let αt j = cj βt j and substitute βt j by αt j/ cj in J(1). By Cauchy-Schwarz

inequality, we similarly have cj = (γ1γ 1 2 PT t=1( αt j)p) 1 p+k . Thus, J(1)( αt j, cj) can be derived into J(3)( αt j, cj). Let σj = ( cj)k, and we have J(3)( αt j, σj) < J(3)(ˆαt j, ˆσj), which contradicts with the optimality of (ˆαt j, ˆσj). Second, we similarly prove that if ˆβt j and ˆcj optimize Problem (1), then ˆαt j = ˆcj ˆβt j and ˆσj = (ˆcj)k optimize Problem (3).

Now, combining the results from the two Lemmas, we can derive that when λ = 2 q

p kq 2 and q = k+p

2k , the optimal solutions to Problems (1) and (2) are equivalent. Solving Problem (1) will yield an optimal solution ˆα to Problem (2) and vice versa.

Theorem 2 Let ˆβt, t = 1, , T, be the optimal solutions of Problem (1), Let ˆB = [ˆβ1, , ˆβT ]

and ˆβ j denote the j-th row of the matrix ˆB. Then,

ˆcj = (γ1/γ2) 1 k ||ˆβ j||

p 2kq p , (4) for all j = 1, , d, is optimal to Problem (1).

Proof. This analytical formula can be directly derived from Lemma 1 and Lemma 2. When we set

ˆσj = (ˆcj)k and ˆαt j = ˆcj ˆβt j in Problem (3), we obtain ˆcj = µ1µ 1 2 ||ˆβ j||p/q q 2kq p . In the proof

of Lemma 2, we obtain that µ1 = γ

kq 2 and µ2 = γ2. Substituting these formula into the formula of c yields the formula (4).

Based on the derivation, for each pair of {p, q} and λ in Problem (2), there exists an equivalent problem (1) with determined values of k, γ1 and γ2, and vice versa. Note that if q = p/2, the regularization term on αj in Problem (2) becomes the standard p-norm. In particular, if {p, q} = {2, 1} in Problem (2) as used in the methods of [15] and [1], the ℓ2-norm regularizer is applied to αj. Then, this problem is equivalent to Problem (1) when k = 2 and λ = 2 γ1γ2, the same formulation in [2]. If {p, q} = {1, 1}, the square root of ℓ1-norm regularizer is applied to αj. Our theorem 1 shows that this problem is equivalent to the multi-level LASSO MTFL formulation [14] which is Problem (1) with k = 1 and λ = 2 γ1γ2.

4 Probabilistic Interpretation

In this section we show the proposed multiplicative formalism is related to the maximum a posteriori (MAP) solution of a probabilistic model. Let p(A| ) be the prior distribution of the weight matrix A = [α1, . . . , αT ] = [α1 , . . . , αd ] Rd T , where denote the parameter of the prior. Then the a posteriori distribution of A can be calculated via Bayes rule as p(A|X, y, ) p(A| ) QT t=1 p(yt|Xt, αt). Denote z GN(µ, ρ, q) the univariate generalized normal distribution, with the density function p(z) = 1 2ρΓ(1+1/q) exp( |z µ|q

ρq ), in which ρ > 0, q > 0, and Γ( ) is the Gamma function [7]. Now let each element of A, αt j, follow a generalized normal prior, αt j GN(0, δj, q). Then with the i.i.d. assumption, the prior takes the form (also refer to [22] for a similar treatment)

1 δj exp |αt j|q

1 δT j exp αj q q δq j

where q denote vector q-norm. With an appropriately chosen likelihood function p(yt|Xt, αt) exp( L(αt, Xt, yt)), ﬁnding the MAP solution is equivalent to solving the following problem:

min A, J = PT t=1 L(αt, Xt, yt) + Pd j=1 αj q q δq j + T ln δj . By setting the derivative of J with respect to δj to zero, we obtain:

min A J = XT

t=1 L(αt, Xt, yt) + T Xd

j=1 ln αj q. (6)

Now let us look at the multiplicative nature of αt j with different q [1, ]. When q = 1, we have:

j=1 ln αj 1 =

t=1 |cjβt j|

ln |cj| + ln

Because of ln z z 1 for any z > 0, we can optimize an upper bound of J in (6). We then have min A J1 = PT t=1 L(αt, Xt, yt) + T Pd j=1 |cj| + T Pd j=1 PT t=1 |βt j|, which is equivalent to the multiplicative formulation (1) where {p, k} = {1, 1}. For q > 1, we have:

j=1 ln αj q = 1

t=1 |cjβt j|q !

max{|c1|, . . . , |cd|}q

t=1 |βt j|q !

j=1 ln c + 1

t=1 |βt j|q d c + 1

t=1 βt q q (d + d

Since vector norms satisfy z z k for any vector z and k 1, these inequalities lead to an upper bound of J in (6), i.e., min A Jq,k = PT t=1 L(αt, Xt, yt) + Td c k + T

q PT t=1 βt q q, which is equivalent to the general multiplicative formulation in (1).

5 Optimization Algorithm

Alternating optimization algorithms have been used in both of the early methods [2, 14] to solve Problem (1) which alternate between solving two subproblems: solve for βt with ﬁxed c; solve for c with ﬁxed βt. The convergence property of such an alternating algorithm has been analyzed in [2] that it converges to a local minimizer. However, both subproblems in the existing methods can only be solved using iterative algorithms such as gradient descent, linear or quadratic program solvers. We design a new alternating optimization algorithm that utilizes the property that both Problems (1) and (2) are equivalent to Problem (3) used in our proof and we derive a closed-form solution for c for the second subproblem. The following theorem characterizes this result.

Theorem 3 For any given values of αt:t=1, ,T , the optimal σ of Problem (3) when αt:t=1, ,T are

ﬁxed to the given values can be computed by σj = γ 1 p 2kq 1 γ

1 2 p 2kp 2 2qq PT t=1(αt j)p, j = 1, , d.

Proof. By the Cauchy-Schwarz inequality and the same argument used in the proof of Lemma 1, we obtain that the best σ for a given set of αt:t=1, ,T is σj = µ

||αj||p/q. We also know

that µ1 and µ2 are chosen in such a way that γ1 = µ

kq 2kq p 1 µ

kq p 2kq p 2 and γ2 = µ2. This is equivalent to

have µ1 = γ

kq 2 and µ2 = γ2. Substituting them into the formula of σ yields the result.

Now, in the algorithm to solve Problem (1), we solve the ﬁrst subproblem to obtain a new iterate βnew t , then we use the current value of c, cold, to compute the value of αnew t = diag(cold)βnew t , which is then used to compute σj according to the formula in Theorem 3. Then, c is computed as cj = k σj, j = 1, , d. The overall procedure is summarized in Algorithm 1.

Algorithm 1 Alternating optimization for multiplicative MTFL

Input: Xt, yt, t = 1, , T, as well as γ1, γ2, p and k Initialize: cj = 1, j = 1, , d repeat

1. Convert Xtdiag(cs 1) Xt, t = 1, , T for t = 1, , T do

Solve minβt L(βt, Xt, yt) + γ1||βt||p p for βs t end for 2. Compute αs t = diag(c(s 1))βs t, and compute cs as cs j = k σj where σj is computed according to the formula in Theorem 3. until max(|(αt j)s (αt j)s 1|) < ϵ Output: αt, c and βt, t = 1, , T

Algorithm 1 can be used to solve the entire family of methods characterized by Problem (1). The ﬁrst subproblem involves convex optimization if a convex loss function is chosen and p 1, and can be solved separately for individual tasks using single task learning. The second subproblem is analytically solved by a formula that guarantees that Problem (1) reaches a lower bound for the current αt. In this paper, the least squares and logistic regression losses are used and both of them are convex and differentiable loss functions. When convex and differentiable losses are used, theoretical results in [19] can be used to prove the convergence of the proposed algorithm. We choose to monitor the maximum norm of the A matrix to terminate the process, but it can be replaced by any other suitable termination criterion. Initialization can be important for this algorithm, and we suggest starting with c = 1, which considers all features initially in the learning process.

6 Two New Formulations

The two existing methods discussed in [2, 14] use p = k in their formulations, which renders βt j and cj the same amount of shrinkage. To explore other feature sharing patterns among tasks, we propose two new formulations where p = k. For the common choices of p and k, the relation between the optimal c and β can be computed according to Theorem 2, and is summarized in Table 1.

1. When the majority of the features is not relevant to any of the tasks, it requires a sparsityinducing norm on c. However, within the relevant features, many features are shared between tasks. In other words, the features used in each task are not sparse relative to all the features selected by c, which requires a non-sparsity-inducing norm on β. Hence, we use ℓ1 norm on c and ℓ2 norm on all β s in Formulation (1). This formulation is equivalent to the joint regularization method of

minαt PT t=1 L(αt, Xt, yt) + λ Pd j=1 3q PT t=1(αt j)2 where λ = 2γ

2. When many or all features are relevant to the given tasks, it may prefer the ℓ2 norm penalty on c. However, only a limited number of features are shared between tasks, i.e., the features used by individual tasks are sparse with respect to the features selected as useful across tasks by c. We can impose the ℓ1 norm penalty on β. This formulation is equivalent to the joint regularization method

of minαt PT t=1 L(αt, Xt, yt) + λ Pd j=1

3 r PT t=1 |αt j| 2 where λ = 2γ

Table 1: The shrinkage effect of c with respect to β for four common choices of p and k.

p k c p k c

2 2 ˆcj = q

q PT t=1 ˆβt j

2 2 1 ˆcj = γ1γ 1 2 PT t=1 ˆβt j

1 1 ˆcj = γ1γ 1 2 PT t=1 ˆ |βt j| 1 2 ˆcj = q

γ1γ 1 2 q PT t=1 ˆ |βt j|

7 Experiments

In this section, we empirically evaluate the performance of the proposed multiplicative MTFL with the four parameter settings listed in Table 1 on synthetic and real-world data for both classiﬁcation and regression problems. The ﬁrst two settings (p, k) = (2, 2), (1, 1) give the same methods respectively in [2, 14], and the last two settings correspond to our new formulations. The least squares and logistic regression losses are used, respectively, for regression and classiﬁcation problems. We focus on the understanding of the shrinkage effects created by the different choices of regularizers in multiplicative MTFL. These methods are referred to as MMTFL and are compared with the dirty model (DMTL) [9] and robust MTFL (r MTFL) [6] that use the additive decomposition.

The ﬁrst subproblem of Algorithm 1 was solved using CPLEX solvers and single task learning in the initial ﬁrst subproblem served as baseline. We used respectively 25%, 33% and 50% of the available data in each data set for training and the rest data for test. We repeated the random split 15 times and reported the averaged performance. For each split, the regularization parameters of each method were tuned by a 3-fold cross validation within the training data. The regression performance was measured by the coefﬁcient of determination, denoted as R2, which was computed as 1 minus the ratio of the sum of squared residuals and the total sum of squares. The classiﬁcation performance was measured by the F1 score, which was the harmonic mean of precision and recall.

Synthetic Data. We created two synthetic data sets which included 10 and 20 tasks, respectively. For each task, we created 200 examples using 100 features with pre-deﬁned combination weights α. Each feature was generated following the N(0, 1) distribution. We added noise and computed yt = Xtαt + ϵt for each task t where the noise ϵ followed a distribution N(0, 1). We put the different tasks α s together as rows in Figure 1. The values of α s were speciﬁed in such a way for us to explore how the structure of feature sharing inﬂuences the multitask learning models when various regularizers are used. In particular, we illustrate the cases where the two newly proposed formulations outperformed other methods.

(a) Synthetic data D1

(b) Synthetic data D2

Figure 1: Parameter matrix learned by different methods (darker color indicates greater values.).

Synthetic Data 1 (D1). As shown in Figure 1a, 40% of components in all α s were set to 0, and these features were irrelevant to all tasks. The rest features were used in every task s model and hence these models were sparse with respect to all of the features, but not sparse with respect to the selected features. This was the assumption for the early joint regularized methods to work. To learn this feature sharing structure, however, we observed that the amount of shrinkage needed would be different for c and β. This case might be in favor of the ℓ1 norm penalty on c.

Synthetic Data 2 (D2). The designed parameter matrix is shown in Figure 1b where tasks were split into 6 groups. Five features were irrelevant to all tasks, 10 features were used by all tasks, and each

of the remaining 85 features was used by only 1 or 2 groups. The neighboring groups of tasks in Figure 1b shared only 7 features besides those 10 common features. Non-neighboring tasks did not share additional features. We expected c to be non-sparse. However, each task only used very few features with respect to all available features, and hence each β should be sparse.

Figure 1 shows the parameter matrices (with columns representing features for illustrative convenience) learned by different methods using 33% of the available examples in each data set. We can clearly see that MMTFL(2,1) performs the best for Synthetic data D1. This result suggests that the classic choices of using ℓ2 or ℓ1 penalty on both c and β (corresponding to early joint regularized methods) might not always be optimal. MMTFL(1,2) is superior for Synthetic data D2, where each model shows strong feature sparsity but few features can be removed if all tasks are considered. Table 2 summarizes the performance comparison where the best performance is highlighted in bold font. Note that the feature sharing patterns may not be revealed by the recent methods on clustered multitask learning that cluster tasks into groups [10, 8, 23] because no cluster structure is present in Figure 1b, for instance. Rather, the sharing pattern in Figure 1b is in the shape of staircase.

Table 2: Comparison of the performance between various multitask learning models

Data set STL DMTL r MTFL MMTFL(2,2) MMTFL(1,1) MMTFL(2,1) MMTFL(1,2) Synthetic data D1 (R2) 25% 0.40 0.02 0.60 0.02 0.58 0.02 0.64 0.02 0.54 0.03 0.73 0.02 0.42 0.04 33% 0.55 0.03 0.73 0.01 0.61 0.02 0.79 0.02 0.76 0.01 0.86 0.01 0.65 0.03 50% 0.60 0.02 0.75 0.01 0.66 0.01 0.86 0.01 0.88 0.01 0.90 0.01 0.84 0.01 D2 (R2) 25% 0.28 0.02 0.36 0.01 0.46 0.01 0.45 0.01 0.35 0.05 0.46 0.02 0.49 0.02 33% 0.35 0.01 0.42 0.02 0.63 0.03 0.69 0.02 0.75 0.01 0.67 0.03 0.83 0.02 50% 0.75 0.01 0.81 0.01 0.83 0.01 0.91 0 0.95 0 0.92 0.01 0.97 0 Real-world data SARCOS 25% 0.78 0.02 0.90 0 0.90 0 0.89 0 0.89 0 0.90 0.01 0.87 0.01 (R2) 33% 0.78 0.02 0.88 0.11 0.89 0.1 0.90 0 0.90 0 0.91 0.01 0.89 0.01 50% 0.83 0.06 0.87 0.1 0.89 0.1 0.91 0 0.90 0.01 0.91 0.01 0.89 0.01 USPS 25% 0.83 0.01 0.89 0.01 0.91 0.01 0.90 0.01 0.90 0.01 0.90 0.01 0.91 0.01 (F1 score) 33% 0.84 0.02 0.90 0.01 0.90 0.01 0.89 0.01 0.90 0.01 0.90 0.01 0.91 0.01 50% 0.87 0.02 0.91 0.01 0.92 0.01 0.92 0.01 0.92 0.01 0.92 0.01 0.93 0.01

Real-world Data. Two benchmark data sets, the Sarcos [1] and the USPS data sets [10], were used for regression and classiﬁcation tests respectively. The Sarcos data set has 48,933 observations and each observation (example) has 21 features. Each task is to map from the 21 features to one of the 7 consecutive torques of the Sarcos robot arm. We randomly selected 2000 examples for use in each task. USPS handwritten digits data set has 2000 examples and 10 classes as the digits from 0 to 9. We ﬁrst used principle component analysis to reduce the feature dimension to 87. To create binary classiﬁcation tasks, we randomly chose images from the other 9 classes to be the negative examples. Table 2 provides the performance of the different methods on these two data sets, which shows the effectiveness of MMTFL(2,1) and MMTFL(1,2).

8 Conclusion

In this paper, we study a general framework of multiplicative multitask feature learning. By decomposing the model parameter of each task into a product of two components: the across-task feature indicator and task-speciﬁc parameters, and applying different regularizers to the two components, we can select features for individual tasks and also search for the shared features among tasks. We have studied the theoretical properties of this framework when different regularizers are applied and found that this family of methods creates models equivalent to those of the joint regularized MTL methods but with a more general form of regularization. Further, an analytical formula is derived for the across-task component as related to the task-speciﬁc component, which shed light on the different shrinkage effects in the various regularizers. An efﬁcient algorithm is derived to solve the entire family of methods and also tested in our experiments. Empirical results on synthetic data clearly show that there may not be a particular choice of regularizers that is universally better than other choices. We empirically show a few feature sharing patterns that are in favor of two newly-proposed choices of regularizers, which is conﬁrmed on both synthetic and real-world data sets.

Acknowledgements Jinbo Bi and her students Xin Wang and Jiangwen Sun were supported by NSF grants IIS-1320586, DBI-1356655, IIS-1407205, and IIS-1447711.

[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Proceedings of NIPS 07, pages 41 48. 2007.

[2] J. Bi, T. Xiong, S. Yu, M. Dundar, and R. B. Rao. An improved multi-task learning approach with applications in medical diagnosis. In Proceedings of ECML 08, pages 117 132, 2008.

[3] J. Chen, J. Zhou, and J. Ye. Integrating low-rank and group-sparse structures for robust multitask learning. In Proceedings of KDD 11, pages 42 50, 2011.

[4] T. Evgeniou and M. Pontil. Regularized multi task learning. In Proceedings of KDD 04, pages 109 117, 2004.

[5] P. Gong, J. Ye, and C. Zhang. Multi-stage multi-task feature learning. In Proceedings of NIPS 12, pages 1997 2005, 2012.

[6] P. Gong, J. Ye, and C. Zhang. Robust multi-task feature learning. In Proceedings of KDD 12, pages 895 903, 2012.

[7] I. R. Goodman and S. Kotz. Multivariate θ-generalized normal distributions. Journal of Multivariate Analysis, 3(2):204 219, 1973.

[8] L. Jacob, B. Francis, and J.-P. Vert. Clustered multi-task learning: a convex formulation. 2008.

[9] A. Jalali, S. Sanghavi, C. Ruan, and P. K. Ravikumar. A dirty model for multi-task learning. In Proceedings of NIPS 10, pages 964 972, 2010.

[10] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In Proceedings of ICML 11, pages 521 528, 2011.

[11] A. Kumar and H. Daume III. Learning task grouping and overlap in multi-task learning. In Proceedings of ICML 12, 2012.

[12] S. Lee, J. Zhu, and E. Xing. Adaptive multi-task lasso: with application to e QTL detection. In Proceedings of NIPS 10, pages 1306 1314. 2010.

[13] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efﬁcient ℓ1,2-norm minimization. In Proceedings of UAI 09, pages 339 348, 2009.

[14] A. Lozano and G. Swirszcz. Multi-level lasso for sparse multi-task regression. In Proceedings of ICML 12, pages 361 368, 2012.

[15] G. Obozinski and B. Taskar. Multi-task feature selection. In Technical report, Statistics Department, UC Berkeley, 2006.

[16] A. Passos, P. Rai, J. Wainer, and H. Daume III. Flexible modeling of latent task structures in multitask learning. In Proceedings of ICML 12, pages 1103 1110, 2012.

[17] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efﬁcient projection for l1,inﬁnity regularization. In Proceedings of ICML 09, pages 108 115, 2009.

[18] P. Rai and H. Daume. Inﬁnite predictor subspace models for multitask learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 613 620, 2010.

[19] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal Optimization Theory Applications, 109(3):475 494, 2001.

[20] B. A. Turlach, W. N. Wenables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47(3):349 363, 2005.

[21] T. Xiong, J. Bi, B. Rao, and V. Cherkassky. Probabilistic joint feature selection for multi-task learning. In Proceedings of SIAM International Conference on Data Mining (SDM), pages 69 76, 2006.

[22] Y. Zhang, D.-Y. Yeung, and Q. Xu. Probabilistic multi-task feature selection. In Proceedings of NIPS 10, pages 2559 2567, 2010.

[23] J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In Proceedings of NIPS 11, pages 702 710, 2011.

[24] Y. Zhou, R. Jin, and S. C. Hoi. Exclusive lasso for multi-task feature selection. In Proceedings of UAI 10, pages 988 995, 2010.