# learning_to_multitask__a3583452.pdf Learning to Multitask Yu Zhang1, Ying Wei2, Qiang Yang1 1HKUST 2Tencent AI Lab yu.zhang.ust@gmail.com judywei@tencent.com qyang@cse.ust.hk Multitask learning has shown promising performance in many applications and many multitask models have been proposed. In order to identify an effective multitask model for a given multitask problem, we propose a learning framework called Learning to Multi Task (L2MT). To achieve the goal, L2MT exploits historical multitask experience which is organized as a training set consisting of several tuples, each of which contains a multitask problem with multiple tasks, a multitask model, and the relative test error. Based on such training set, L2MT first uses a proposed layerwise graph neural network to learn task embeddings for all the tasks in a multitask problem and then learns an estimation function to estimate the relative test error based on task embeddings and the representation of the multitask model based on a unified formulation. Given a new multitask problem, the estimation function is used to identify a suitable multitask model. Experiments on benchmark datasets show the effectiveness of the proposed L2MT framework. 1 Introduction Multitask learning [9] aims to leverage useful information contained in multiple tasks to help improve the generalization performance of those tasks. In the past decades, many multitask models have been proposed. According to a recent survey [38], these models can be classified into two main categories: feature-based approach and parameter-based approach. The feature-based approach uses data features as the media to share knowledge among tasks and it usually learns a common feature representation for all the tasks. This approach can be further divided into two categories: shallow approach [9, 2, 43] and deep approach [25]. Different from the feature-based approach, the parameter-based approach links different tasks by placing regularizers or Bayesian priors on model parameters to achieve knowledge transfer among tasks. This approach can be further classified into four categories: low-rank approach [1, 28, 17], task clustering approach [18, 20, 15], task relation learning approach [39, 40, 36, 35, 41, 42, 21, 37], and decomposition approach [10, 19, 44, 16]. Given so many multitask models, one important issue is how to choose a good model among them for a given multitask problem. One solution is to do model selection, that is, using cross validation or its variants. One limitation of this solution is that it is computationally heavy considering that each of the candidate models needs to be trained for multiple times. In this paper, we propose a framework called Learning to Multi Task (L2MT) to solve this issue in a learning-based approach. The main idea of L2MT is to exploit the historical multitask experience to learn how to choose a suitable multitask model for a new multitask problem. To achieve that, the historical multitask experience is represented as a training set consisting of tuples each of which has three entries: a multitask problem, a multitask model, and the relative test error that equals the ratio of the average test error of the multitask model on the multitask problem over that of the single-task learning model. Based on this training set, we propose an end-to-end approach to learn the mapping from both the multitask problem and the multitask model to the relative test error, where we need to determine the representations of the multitask problem and the multitask model. First, a Layerwise Graph Neural Network (LGNN) is proposed to learn the task embedding as the representation of each 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. task in a multitask problem and by aggregating of all the task embeddings, the task embedding matrix is used as the representation of the multitask problem. For multitask models which have a unified formulation, task covariance matrices are used as their representations since they play an important role to reveal pairwise task relations. Then both representations of the multitask problem and model are encoded in an estimation function to estimate the relative test error. For a new multitask problem, we can obtain the task embedding matrix via the LGNN learned on the training set and then in order to achieve a low relative test error, we minimize the estimation function to learn the task covariance matrix as well as the corresponding multitask model. Experiments on benchmark datasets show the effectiveness of the proposed L2MT framework. 2 A Unified Formulation for Multitask Learning Before presenting the L2MT framework, in this section, we give a unified formulation for multitask learning by extending that proposed in the survey [38]. Suppose that we are given a multitask problem consisting of m tasks {Ti}m i=1. For task Ti, its training dataset contains ni data points {xi,j}ni j=1 as well as their labels {yi,j}ni j=1, where xi,j Rd and yi,j { 1, 1} for classification problems. The learning function for task Ti is defined as fi(x) = w T i x + bi. A regularized formulation to learn task relations, which can unify several representative models [14, 13, 18, 28, 36, 39, 29, 42, 37], is formulated as min W,b,Ω 0 j=1 l w T i xi,j + bi, yi,j + λ1 2 tr(WΩ 1WT ) + λ2g(Ω), (1) where W = (w1, . . . , wm), b = (b1, . . . , bm)T , l( , ) denotes a loss function such as the crossentropy loss, Ω 0 means that Ωis positive semidefinite (PSD), tr( ) denotes the trace of a square matrix, Ω 1 denotes the inverse or pseduoinverse of a square matrix, and λ1, λ2 are regularization hyperparameters to control the trade-off among the three terms in problem (1). The first term in problem (1) measures the empirical loss. The second term is a regularizer on W based on Ω. According to [39], Ω, the task covariance matrix, is used to describe pairwise task relations. The function g( ) in problem (1) can be considered as a regularizer on Ωto characterize its structure. The survey [38] has shown that the models proposed in [14, 13, 18, 28, 36, 39, 29, 42, 37] can be formulated as problem (1) with different g( ) s, where the detailed connections between these works and problem (1) are put in the supplementary material for completeness. In the following, we propose two main extensions to enrich problem (1). Firstly, in Theorem 1, we prove that the Schatten norm regularization is an instance of problem (1). As its special case, the trace norm is widely used in multitask learning [28] as a regularizer to capture the low-rank structure in W. Here we generalize it to the Schatten a-norm denoted by ||| |||a for a > 0, where ||| |||1 is just the trace norm. To see the relation between the Schatten norm regularization and problem (1), we prove the following theorem with the proof in the supplementary material. Theorem 1 When g(Ω) = tr(Ωr) for any given positive scalar r, by defining ˆr = 2r/(r + 1) and λr = (1 + 1/r)(λr 1λ2r/2r)1/(r+1), problem (1) reduces to the following problem j=1 l w T i xi,j + bi, yi,j + λr|||W|||ˆr ˆr. (2) When r = 1, Theorem 1 implies that problem (1) with g(Ω) = tr(Ω) is equivalent to the trace norm regularization. Even though r can be any positive scalar, problem (2) corresponds to the Schatten ˆr-norm regularization with ˆr = 2r r+1 < 2 and ˆr 1 when r 1. Secondly, in the following theorem, we prove that the squared Schatten norm regularization is an instance of problem (1). Theorem 2 By defining g(Ω) = 0 if tr(Ωr) 1 + otherwise , which is an extended real-value function and corresponds to a constraint on Ω, for any given positive scalar r and ˆr = 2r r+1, problem (1) is equivalent to the following problem: min W,b Pm i=1 1 ni Pni j=1 l w T i xi,j + bi, yi,j + λ1|||W|||2 ˆr. The aforementioned multitask models with different instantiations of g( ) are summarized in Table 1 in the supplementary material. Based on the above discussion, we can see that problem (1) can embrace many or even infinite multitask models as r in the (squared) Schatten norm regularization can take an infinite number of values. Given a multitask problem and so many candidate models, the top priority is to choose which model to use. One solution is to try all possible models to find the best one but it is computationally heavy. In the following section, we will give our solution: L2MT. 3 Learning to Multitask In this section, we present the proposed L2MT framework and its associated solution. 3.1 The Framework Recall that the aim of the proposed L2MT framework shown in Figure 1 is to determine a suitable multitask model for a test multitask problem by exploiting historical multitask experience. To achieve this, as a representation of historical multitask experience, the training set of the L2MT framework consists of q tuples {(Si, Mi, oi)}q i=1. S denotes the space of multitask problems and Si S denotes a multitask problem. Each multitask problem Si consists of mi learning tasks each of which is associated with a training dataset, a validation dataset, and a test dataset. As we will see later, the jth task in Si is represented as a task embedding ei j R ˆd via applying the proposed LGNN model onto its training dataset and by aggregating of task embeddings of all the tasks, the task embedding matrix Ei = (ei 1, . . . , ei mi) will be treated as the representation of the multitask problem Si. M denotes the space of multitask models and Mi M denotes a specific multitask model which is trained on the training datasets in Si. Mi can be a discrete index for candidate multitask models or a continuous representation based on model parameters. In this sequel, based on the unified formulation presented in the previous section, Mi is represented by the task covariance matrix Ωi and hence M is continuous. One reason to choose the task covariance matrix as the representation of a multitask model is that the task covariance matrix is core to problem (1) and once it has been determined, the model parameters W and b can easily be obtained. One benefit of choosing a continuous M is that we can learn a new model out of all the candidate models. oi R denotes the relative test error ϵMT L/ϵST L, where ϵMT L denotes the average test error of the multitask model Mi on the test datasets of multiple tasks in Si and ϵST L denotes the average test error of a single-task learning (STL) model which is trained on each task independently. Hence, the training process of the L2MT framework is to learn an estimation function f( , ) to map from {(Si, Mi)}q i=1 or concretely {(Ei, Ωi)}q i=1 to {υ(oi)}q i=1, where υ( ), a link function, transforms oi to make the estimation easier and will be introduced later. Moreover, based on problem (1), we can see Ωis a function of hyperparameters λ1 and λ2 and so is the relative test error. Here we make an assumption that Ωis sufficient to estimate the relative test error. This assumption empirically works very well and it can simplify the design of the estimation function. Moreover, under this assumption, we do not need to find the best hyperparameters for each training tuple, which can save a lot of computational cost. In the test process, suppose that we are given a test multitask problem S which is not in the training set. Each task in S also has a training dataset, a validation dataset and a test dataset. To obtain the relative test error o as low as possible, we resort to minimizing γ1f( E, Ω) with respect to Ωto find the optimal task covariance matrix Ω, where E denotes the task embedding matrix for the test multitask problem and γ1 is a parameter in the link function υ( ) to control its monotonic property, and then by incorporating Ωinto problem (1) without manually specifying g( ), we can learn the optimal W and b which are used to make prediction on the test datasets. There are some learning paradigms related to the L2MT framework, including multitask learning, transfer learning [27], lifelong learning [12], and learning to transfer [33]. However, there exist significant differences between the L2MT framework and these related paradigms. In multitask learning, the training set contains only one multitask problem, i.e., S1, and its goal is to learn model parameters in a given multitask model. The difference between transfer learning and L2MT is similar to that between multitask learning and L2MT. Lifelong learning can be viewed as online transfer/multitask learning and hence it is different from L2MT. Different from learning to transfer which is for transfer learning and relies on handcrafted task features, the proposed L2MT is end-to-end based on neural networks for multitask learning. E1 ( ) M1 Ω1 Eq ( ) Mq Ωq υ{ } oq f( , ) υ{ } Ei Ωi oi optimize problem (7) = arg f( , Ω) Ω min Training Process Test Process optimize problem (8) Figure 1: An illustration of the L2MT framework consisting of two stages. The training stage is to learn the estimation function f( , ) to approximate the relative test error based on multitask problems and multitask models and the test process is to learn the task covariance matrix by minimizing the relative test error or approximately γ1f( E, Ω) with respect to Ω. Di j denotes the training dataset for the jth task in the ith multitask problem Si and Di denotes the training dataset for the ith task in the test multitask problem S. LGNN, which receives a training dataset as the input and is learned in the training process, is shared by all the tasks in the training and test multitask problems and we plot multiple copies for clear presentation. 3.2 Task Embedding In order to learn the estimation function in the training process, the first thing we need to do is to determine the representation of multitask problems {Si}. Usually each multitask problem is associated with multiple training datasets each of which corresponds to a task. So we can reduce representing a multitask problem to representing the training dataset of a task, which is called the task embedding, in the multitask problem. In the following, we propose a method to represent the task embedding based on neural networks with powerful capacities. For the ease of presentation, the training dataset of a task in a multitask problem consists of n data-label pairs {(xj, yj)}n j=1 by omitting the task index, where xj is assumed to have a vectorized representation. Due to varying nature of training datasets in different tasks (e.g., the size and the relations among training data points), it is difficult to use conventional neural networks such as convolutional neural networks or recurrent neural networks to represent a dataset. For a dataset, usually we can represent it as a graph where each vertex corresponds to a data point and the edge between vertices implies the relation between the corresponding data points. Based on the graph representation, we propose the LGNN to obtain the task embedding. Specifically, the input to the LGNN is a data matrix X = (x1, . . . , xn). By using Re LU as the activation function, the output of the first hidden layer in LGNN is H1 = Re LU(LT 1 X + β11T ), (3) where ˆd denotes the dimension of hidden representations, L1 Rd ˆd and β1 R ˆd denote the transformation matrix and bias, and 1 denotes a vector or matrix of all ones with the size depending on the context. According to Eq. (3), H1 contains the hidden representations for all the training data points in this task. With an adjacency matrix G Rn n to model the relations between each pair of training data points, the output of the ith hidden layer (2 i s) in the LGNN is defined as Hi = Re LU(LT i X + Hi 1G + βi1T ), (4) where Li Rd ˆd and βi R ˆd are the transformation matrix and bias, and s denotes the total number of hidden layers. According to Eq. (4), the hidden representations of all the data points at the ith layer (i.e., Hi) rely on those in the previous layer (i.e., Hi 1) and if xi and xj are correlated according to G (i.e., gij = 0), their hidden representations are correlated. The term LT i X in Eq. (4) not only preserves the comprehensive information encoded in the original representation but also alleviates the gradient vanishing issue by achieving the skip connection as in the highway network [32] when s is large. The task embedding of this task, as a result, takes the average of the last hidden layer Hs over all data points, i.e., e = Hs1/n. One advantage of the mean function used here is that it can handle datasets with varying sizes. In LGNN, {Li} and {βi} are learnable parameters based on the objective function presented in the next section. The graph G plays an important role in LGNN. Here we use the label information in the training dataset to construct it. For example, when each learning task is a classification problem, gij, the (i, j)th entry in G, is set to be 1 when yi equals yj, 1 when yi = yj and xi is one of the k nearest neighbors of xj or xj is one of the k nearest neighbors of xi, and 0 otherwise. Based on the definition of gij and Eq. (4), when two data points are in the same class, their hidden representations have positive effects to each other. When two data points are in different classes and they are nearby (i.e., in the neighborhood), their hidden representations have negative effects to each other. The original graph neural network [30] needs to solve the fixed point of a recursive equation, which restricts the functional form of the activation function. Graph convolutional neural networks [8, 26, 4] focus on how to select neighbored data points to do the convolution operation, while LGNN aggregates all the neighborhood information in a layerwise manner. Given a multitask problem consisting of m tasks, we construct a LGNN for all tasks with the shared parameters. Therefore, the task embedding matrix E = (e1, . . . , em), where ei denotes the task embedding for the ith task, is treated as the representation for the entire multitask problem. In the next section, we show how to learn the estimation function based on such representation. 3.3 Training Process Recall that the training set in L2MT contains q tuples {(Si, Mi, oi)}q i=1. Applying the LGNN in the previous section, we represent Si with mi tasks as a task embedding matrix Ei R ˆd mi. Based on the unified formulation in Section 2, Mi is represented by the task covariance matrix Ωi Rmi mi. In the training process, we aim to learn an estimation function mapping from both the task embedding matrix and the task covariance matrix to the relative test error, i.e., f(Ei, Ωi) υ(oi) for i = 1, . . . , q, where υ( ) is defined as a link function to transform the target. Considering the difficulty of designing a numerically stable f( , ) to meet all positive oi s, we introduce the link function, υ( ), which transforms oi to real scalars being positive or negative. Different Ωi s may have variable scales as they are produced by different multitask models with different g( ) s. To make their scales comparable, we impose a restriction that tr(Ωi) equals 1. If some Ωi does not satisfy this requirement, we simply preprocess it via Ωi/tr(Ωi). Note that different Ei s can have different sizes as mi is not fixed. By taking this into consideration, we design an estimation function, where the number of parameters is independent of mi, as f(Ei, Ωi) = α1tr(ET i EiΩi) + α2tr(KiΩi) + α4tr(Ω2 i ), (5) where ei j is the jth column in Ei, Ki is an mi mi matrix with its (j, k)th entry equal to exp{ α3(ei j ei k) 2 2}, and α = (α1, α2, α3, α4) contains four real parameters to be optimized in the estimation function. In the right-hand side of Eq. (5), ET i Ei and Ki are linear and RBF kernel matrices to define task similarities based on task embeddings, respectively. The first two terms in f( , ) define the consistency between kernel matrices and Ωi with α1 and α2 controlling the positive/negative magnitude to estimate oi. The resultant kernel matrices with the same size as Ωi are also the key to empower the estimation function to accommodate Ωi s of different sizes. The link function takes the following form: υ(o) = tanh(γ1o + γ2), where tanh( ) denotes the hyperbolic tangent function to transform a positive o to the range ( 1, 1) and γ = (γ1, γ2) contains two learnable parameters. The objective function in the training process is formulated as i=1 |f(Ei, Ωi) υ(oi)| + λ i=1 Li 2 F , (6) where Θ = {{Li}, {βi}, α, γ} denotes the set of parameters to be optimized. Here we use the absolute loss as it is robust to outliers. Problem (6) indicates that the proposed method is end-to-end from the training datasets of a multitask problem to its relative test error. We optimize problem (6) via the Adam optimizer in the tensorflow package. In each batch, we randomly choose a tuple (e.g., the kth tuple) and optimize problem (6) by replacing the first term with |f(Ek, Ωk) υ(ok)| as an approximation. The left part of Figure 1 illustrates the training process. 3.4 Test Process In the test process, suppose that we are given a new test multitask problem S consisting of m tasks each of which is associated with a training dataset, a validation dataset and a test dataset. The goal here is to learn the optimal Ωautomatically via the estimation function and the training datasets without manually specifying the form of g( ) in problem (1). With Ωinjected, the validation datasets in all tasks can be used to identify the regularization hyperparameter λ1 in problem (1) and the test datasets are used to evaluate the performance of L2MT as usual. For the training datasets in the m tasks, we first apply the learned LGNN in the training process to obtain their task embedding matrix E R ˆd m. Here the task covariance matrix is unknown and what we need to do is to estimate the task covariance matrix by minimizing the relative test error, which, however, is difficult to measure based on the training datasets. Recall that the estimation function is an approximation of the transformed relative test error by the link function. So we resort to optimize the estimation function instead. Due to the monotonically increasing property of the hyperbolic tangent function used in the link function υ( ), minimizing the relative test error can be approximated by minimizing/maximizing the estimation function when γ1 is positive/negative,1 leading to the minimization of γ1f( E, Ω) with respect to Ω, which based on Eq. (5) can be simplified as min Ω ρtr(Ω2) + tr(ΦΩ) s.t. Ω 0, tr(Ω) = 1, (7) where ρ = γ1α4, ei denotes the ith column in E, K is an m m matrix with its (i, j)th entry equal to exp{ α3( ei ej) 2 2}, and Φ = γ1(α1 ET E + α2 K). The constraints in problem (7) are due to the requirement that the trace of the PSD task covariance matrix equals 1 as preprocessed in the training process. It is easy to find that problem (7) is convex when ρ 0 and otherwise non-convex. Even though the convex/non-convex nature of problem (7) varies with ρ, we can always find its efficient solutions summarized in the following theorem. Theorem 3 Define the eigendecomposition of Φ as Φ = UΛUT where Λ = diag(κ) denotes the diagonal eigenvalue matrix with κ = (κ1, . . . , κ m)T (κ1 . . . κ m), U = (u1, . . . , u m) denotes the eigenvector matrix, and the multiplicity of κ m is assumed to be t (t 1). When ρ = 0, the optimal solution Ωof problem (7) is in the convex hull of u m t+1u T m t+1, . . . , u mu T m. When ρ < 0, optimal solutions of problem (7) are in a set {u m t+1u T m t+1, . . . , u mu T m}. When ρ > 0, the optimal solution is Ω= Udiag(µ)UT where µ is the solution of the following problem min µ ρ µ 2 2 + µT κ s.t. µ 0, µT 1 = 1. (8) According to Theorem 3, we need to solve problem (8) when ρ > 0. Based on the Lagrange multiplier method, we design an efficient algorithm with O( m) complexity in the supplementary material. After learning Ωaccording to Theorem 3, we can plug Ωinto problem (1) and learn the optimal W and b for the m tasks involved in the test multitask problem. The right part of Figure 1 illustrates the test process. 3.5 Analysis The training process of L2MT induces a new learning problem where each multitask problem is used to predict the relative test error and the task embedding matrix contains meta features to describe the multitask problem. In this section, we study the generalization bound for this learning problem. For the ease of presentation, we assume each multitask problem contains the same number of tasks. By following [7], tasks originate in a common environment η, which is by definition a probability measure on a learning task. In L2MT, the absolute loss is used and here we generalize it to a general case where the loss function l : R R [0, 1] is assumed to be 1-Lipschitz in the first argument. In the test process, we can see that the task covariance matrix is a function of the task embedding matrix and inspired by that we make an assumption that there is some function to represent the task covariance matrix in terms of the task embedding matrix. Based on such assumption, the estimation function is denoted by f(E) f(E, Ω). Then the expected loss is defined as E = E[ l( f(E), υ(o))] where the expectation E is on the space of multitask problems and relative test errors, and E denotes the task embedding matrix induced by the corresponding multitask problem. The training loss is 1We do not consider a trivial case that γ1 = 0 where the estimation function is to approximate a constant. defined as ˆE = 1 q Pq i=1 l( f(Ei), υ(oi)). Based on the Gaussian average [6, 22], we can bound E in terms of ˆE as follows. Theorem 4 Let F be a real-valued function class on the space of task embeddings, the members of F have values in [0, 1], and H denotes the space of transformation functions in LGNN. With probability greater than 1 δ, for any f F and any h H, we have E ˆE + c1LG({Ei}) q + c2Q suph H E F where c1, c2 are universal constants, functions in F are assumed to have a Lipschitz constant at most L, G(Y ) = E supy Y σ, y denotes the Gaussian average where σ denotes a generic vector or matrix of independent standard normal variables, min E G(F(E)) is assumed to be 0 by following [23], F denote the Frobenius norm, and Q = sup E,E E =E E sup f F σ, f(E) f(E ) E E F . According to Theorem 4, we can see that the expected loss can be upper-bounded by the sum of the training loss, the model complexity based on the task embedding matrices, and a confidence term with the rate of convergence O(q 1 2 ). The Gaussian average on the task embedding matrices induced by LGNN can be estimated via the chain rule [22]. 4 Experiments Four datasets are used in the experiments, including the MIT-Indoor-Scene, Caltech256, 20newsgroup, and RCV1 datasets. The MIT-Indoor-Scene and Caltech256 datasets are for image classification, while the 20newsgroup and RCV1 datasets are for text classification. For these two image datasets, we use the FC8 layer of the VGG-19 network [31] pretrained on the Image Net dataset as the feature extractor. The two text datasets are represented using bag-of-words , thereby lying in high-dimensional spaces. To reduce the heavy computational cost induced, we preprocess these two datasets to reduce the dimension to 1,000 by following [34] which utilizes ridge regression to select important features. The RCV1 dataset is highly imbalanced as the number of data points per class varies from 5 to 130,426. To reduce the effect of imbalanced classes to multitask learning, we keep the categories whose numbers of data samples are between 400 and 5,000. Based on each aforementioned dataset , we construct the training set for L2MT in the following two steps: 1) We first construct a multitask problem where each task is a binary classification task. The number of tasks in a multitask problem is uniformly distributed between 4 and 8 as the number of tasks in real applications is limited. For a multitask problem with m tasks, we just randomly sample m pairs of classes along with their data where each task is to distinguish between each pair of classes. 2) We sample q multi-task problems to constitute the training set for L2MT. The test set for L2MT can be obtained similarly and its construction is exclusively different from the training set. Baseline methods in the comparison consist of a single-task learner (STL), which is trained on each task independently by adopting the cross-entropy loss, and all the instantiation models of problem (1), including regularized multitask learning (RMTL) [14], Schatten norm regularization with r = 1 (SNR1) which is the trace norm regularization [28], Schatten norm regularization with r = 2 (SNR2) which is equivalent to the Schatten 4 3-norm regularization according to Theorem 1, the MTRL method [39, 42], squared Schatten norm regularization with r = 2 (SSNR2) which is equivalent to squared Schatten 4 3-norm regularization according to Theorem 2, clustered multitask learning (CMTL) [18], multitask learning with graphical Lasso (gl MTL) [36, 29], asymmetric multitask learning (AMTL) [21], and SPATS [37]. So in total there are 10 baseline methods. Moreover, to ensure fairness of comparison, we also allow each baseline method to access and include all training datasets of all training multitask problems, besides the training datasets in a test multitask problem at hand. Consequently, we report the better performance of each baseline method when it learns on the test multitask problem only and on all the training and test multitask problems, respectively. Usually collecting a training set with many multitask problems needs to take much time and in the experiments, we only collect 100 multitask problems for training where 30% data in each task form the training dataset. For better training on such training set and controlling the model complexity, different Li s and βi s (2 i s) are constrained to be identical in Eq. (4), i.e., L2 = . . . = Ls and β2 = . . . = βs. There are 50 test multitask problems in the test set. Each entry in {Li} is initialized to be normally distributed with zero mean and variance of 1/100, and the biases {βi} are initialized to be zero. α in the estimation function is initialized to [1, 1, 1, 0.1]T and γ in the link function is initialized to [1, 0]T . The learning rate linearly decays from 0.01 with respect to the number of epoches. (a) MIT-Indoor-Scene (b) Caltech256 (c) 20newsgroup Figure 2: Results of different models on four datasets when varying the size of training data. To investigate the effect of the size of the training dataset on the performance, we vary the size of training data from 30% to 50% at an interval of 10% with the validation proportion fixed to 30% in the test process and plot the average relative test errors of different methods over STL in Figure 2, where ϵMT L = 1 q P q i=1 1 mi P mi j=1 ϵMT L i,j denotes the average test error of a multitask model over all the tasks in all the test multitask problems, ϵST L has a similar definition for STL, and the average relative test error is defined as ϵMT L/ ϵST L. All the relative test errors of STL are equal to 1, and the performance of RMTL is not very good as its assumption that all the tasks are equally similar to each other is usually violated in real applications. Hence we omit these two methods in Figure 2 for clear presentation. According to Figure 2, we can see that some multitask models perform worse than STL with relative test errors larger than 1, which can be explained by the mismatch between data and model assumptions imposed on the task covariance. By learning the task covariance directly from data without explicit assumptions, the proposed L2MT performs better than all the baseline methods under different settings, which demonstrates the effectiveness of L2MT. Figure 3: Sensitivity analysis of L2MT on the 20newsgroup dataset when using 30% data for training. In Figure 3, we conduct the sensitivity analysis on the 20newsgroup dataset with respect to hyperparameters in L2MT, including the number of layers s, the regularization hyperparameter λ, the latent dimension ˆd, and the number of neighbors k in LGNN, to see their effects on the performance. According to Figure 3(a), we can see that the performance when s equals 2 and 3 is better than that with s = 1, which demonstrates the usefulness of the graph information used in LGNN to learn the task embeddings. Yet the performance degrades when s increases further with one reason that L2MT is likely to overfit given a limited number of training multitask problems. As implied by Figures 3(b) and 3(d), when λ is in [0.01, 0.5] and k in [5, 10], the performance is not so sensitive that the choices are easier and hence in experiments we always set λ and k to 0.1 and 6. According to Figure 3(c), when ˆd is not very large, the performance is better than that corresponding to a larger ˆd where the overfitting is likely to occur. Based on such observation, ˆd is set to be 50. Moreover, in Figure 3(e) we test the performance of L2MT by varying q, the size of training multitask problems, and we can see that the test error of L2MT decreases when q is increasing, which matches the generalization bound in Theorem 4. In previous experiments, the VGG-19 network is used as the feature extractor. It is worth noting that L2MT can even be used to update the VGG-19 network. On the Caltech256 and MIT-Indoor-Scene datasets, we use problem (6) as the objective function to fine-tune parameters in the fully connected layers of the VGG-19 network. After fine-tuning, the average test errors of L2MT are reduced by about 5% compared to L2MT without fine-tuning, which demonstrates the effectiveness of L2MT on not only improving the performance of multitask problems but also learning good features. We also study other formulations for the estimation and link functions. For example, another choice for the estimation function we used is f(E, Ω) = α1tr(Re LU(ˆET ˆE)Ω) + α2tr(Ω2) where ˆE = LE+β1T with parameters L and β, and that for the link function is υ(o) = ln(exp{γ1}o+exp{γ2}), where ln( ) denotes the logarithm function with base e. Compared with the estimation and link functions proposed in Section 3.3, these new functions lead to slightly worse performance (about 2% relative increase on the test error), which demonstrates the effectiveness of the proposed functions over the newly defined ones. To assess the quality of the learned task covariance matrices by different models, we conduct a case study by constructing a multitask problem consisting of three tasks from the Caltech256 dataset. The first task is to classify between classes Bat and Clutter , the second one is to distinguish between classes Bear and Clutter , and the last task does classification between classes Dog and Clutter . The learned task correlation matrices, which can be computed from task covariance matrices, by SNR1, MTRL and L2MT are 1.0000 0.0028 0.0789 0.0028 1.0000 0.0633 0.0789 0.0633 1.0000 1.0000 0.0067 0.0553 0.0067 1.0000 0.0473 0.0553 0.0473 1.0000 1.0000 0.0057 0.0052 0.0057 1.0000 0.9782 0.0052 0.9782 1.0000 . From the three task correlation matrices, we can see that the correlations between the first and second tasks are close to 0 in the three models, which matches the intuition that bats and bears are almost irrelevant as they belong to different species. The same observation holds for the first and third tasks. The difference among the three methods lies in the correlation between the second and third tasks. Specifically, in SNR1 and MTRL, those correlations are close to 0, indicating that these two tasks are nearly uncorrelated, and hence the knowledge shared among the three tasks is very limited for SNR1 and MTRL. On the contrary, in L2MT, the second and third tasks have a highly negative correlation and hence there is strong knowledge leverage between those two tasks, which may be one reason that L2MT outperforms SNR1 and MTRL. 5 Conclusions In this paper, we propose L2MT to identify a good multitask model for a multitask problem based on historical multitask problems. To achieve this, we propose an end-to-end procedure, which employs the LGNN to learn task embedding matrices for multitask problems and then uses the estimation function to approximate the relative test error. In the test process, given a new multitask problem, minimizing the estimation function leads to the identification of the task covariance matrix. As revealed in the survey [38], there is another representative formulation for the feature-based approach [2, 3, 11] in multitask learning. In our future research, we will extend the proposed L2MT method to learn good feature covariances for multitask problems based on this formulation. Moreover, the proposed L2MT method can be extended to meta learning where LGNN can be used to learn hidden representations for datasets. Acknowledgments This research has been supported by NSFC 61673202, National Grant Fundamental Research (973 Program) of China under Project 2014CB340304, and Hong Kong CERG projects 16211214/16209715/16244616. [1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817 1853, 2005. [2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19, pages 41 48, 2006. [3] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task structure learning. In Advances in Neural Information Processing Systems 20, pages 25 32, 2007. [4] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems 29, pages 1993 2001, 2016. [5] O. Banerjee, L. E. Ghaoui, A. d Aspremont, and G. Natsoulis. Convex optimization techniques for fitting sparse Gaussian graphical models. In Proceedings of the Twenty-Third International Conference on Machine Learning, pages 89 96, 2006. [6] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. [7] J. Baxter. A model of inductive bias learning. Journal of Artifical Intelligence Research, 12:149 198, 2000. [8] J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun. Spectral networks and locally connected networks on graphs. Co RR, abs/1312.6203, 2013. [9] R. Caruana. Multitask learning. Machine Learning, 28(1):41 75, 1997. [10] J. Chen, J. Liu, and J. Ye. Learning incoherent sparse and low-rank patterns from multiple tasks. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1179 1188, 2010. [11] J. Chen, L. Tang, J. Liu, and J. Ye. A convex formulation for learning shared structures from multiple tasks. In Proceedings of the 26th International Conference on Machine Learning, pages 137 144, 2009. [12] Z. Chen and B. Liu. Lifelong Machine Learning. Morgan & Claypool, 2016. [13] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615 637, 2005. [14] T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 109 117, 2004. [15] L. Han and Y. Zhang. Learning multi-level task groups in multi-task learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. [16] L. Han and Y. Zhang. Learning tree structure in multi-task learning. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2015. [17] L. Han and Y. Zhang. Multi-stage multi-task learning with reduced rank. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016. [18] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. In Advances in Neural Information Processing Systems 21, pages 745 752, 2008. [19] A. Jalali, P. D. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In Advances in Neural Information Processing Systems 23, pages 964 972, 2010. [20] A. Kumar and H. Daumé III. Learning task grouping and overlap in multi-task learning. In Proceedings of the 29 th International Conference on Machine Learning, 2012. [21] G. Lee, E. Yang, and S. J. Hwang. Asymmetric multi-task learning based on task relatedness and loss. In Proceedings of the 33rd International Conference on Machine Learning, pages 230 238, 2016. [22] A. Maurer. A chain rule for the expected suprema of Gaussian processes. In Proceedings of the 25th International Conference on Algorithmic Learning Theory, pages 245 259, 2014. [23] A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research, 17:1 32, 2016. [24] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099 1125, 2005. [25] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3994 4003, 2016. [26] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In Proceedings of the 33nd International Conference on Machine Learning, pages 2014 2023, 2016. [27] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, 2010. [28] T. K. Pong, P. Tseng, S. Ji, and J. Ye. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):3465 3489, 2010. [29] P. Rai, A. Kumar, and H. Daume. Simultaneously leveraging output and task structures for multiple-output regression. In Advances in Neural Information Processing Systems 25, pages 3185 3193, 2012. [30] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009. [31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Co RR, abs/1409.1556, 2014. [32] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. Co RR, abs/1505.00387, 2015. [33] Y. Wei, Y. Zhang, J. Huang, and Q. Yang. Transfer learning via learning to transfer. In Proceedings of the 35th International Conference on Machine Learning, 2018. [34] S. Yang, L. Yuan, Y.-C. Lai, X. Shen, P. Wonka, and J. Ye. Feature grouping and selection over an undirected graph. In Proceedings of ACM SIGKDD Conference on Kownledge Discovery and Data Mining, 2012. [35] Y. Zhang. Heterogeneous-neighborhood-based multi-task local learning algorithms. In Advances in Neural Information Processing Systems 26, 2013. [36] Y. Zhang and J. G. Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In Advances in Neural Information Processing Systems 23, pages 2550 2558, 2010. [37] Y. Zhang and Q. Yang. Learning sparse task relations in multi-task learning. In Proceedings of the 31th AAAI Conference on Artificial Intelligence, 2017. [38] Y. Zhang and Q. Yang. A survey on multi-task learning. ar Xiv preprint, ar Xiv:1707.08114v2, 2017. [39] Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pages 733 742, 2010. [40] Y. Zhang and D.-Y. Yeung. Multi-task learning using generalized t process. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 964 971, 2010. [41] Y. Zhang and D.-Y. Yeung. Learning high-order task relationships in multi-task learning. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, 2013. [42] Y. Zhang and D.-Y. Yeung. A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data, 8(3):article 12, 2014. [43] Y. Zhang, D.-Y. Yeung, and Q. Xu. Probabilistic multi-task feature selection. In Advances in Neural Information Processing Systems 23, 2010. [44] A. Zweig and D. Weinshall. Hierarchical regularization cascade for joint learning. In Proceedings of the 30th International Conference on Machine Learning, pages 37 45, 2013.