# deep_asymmetric_multitask_feature_learning__6b8a1aad.pdf Deep Asymmetric Multi-task Feature Learning Hae Beom Lee 1 2 Eunho Yang 3 2 Sung Ju Hwang 3 2 We propose Deep Asymmetric Multitask Feature Learning (Deep-AMTFL) which can learn deep representations shared across multiple tasks while effectively preventing negative transfer that may happen in the feature sharing process. Specifically, we introduce an asymmetric autoencoder term that allows reliable predictors for the easy tasks to have high contribution to the feature learning while suppressing the influences of unreliable predictors for more difficult tasks. This allows the learning of less noisy representations, and enables unreliable predictors to exploit knowledge from the reliable predictors via the shared latent features. Such asymmetric knowledge transfer through shared features is also more scalable and efficient than inter-task asymmetric transfer. We validate our Deep-AMTFL model on multiple benchmark datasets for multitask learning and image classification, on which it significantly outperforms existing symmetric and asymmetric multitask learning models, by effectively preventing negative transfer in deep feature learning. 1. Introduction Multi-task learning (Caruana, 1997) aims to improve the generalization performance of the multiple task predictors by jointly training them, while allowing some kinds of knowledge transfer between them. One of the crucial challenges in multi-task learning is tackling the problem of negative transfer, which describes the situation where accurate predictors for easier tasks are negatively affected by inaccurate predictors for more difficult tasks. A recently introduced method, Asymmetric Multi-task Learning (AMTL) (Lee et al., 2016), proposes to solve this nega- *Equal contribution 1UNIST, Ulsan, South Korea 2AItrics, Seoul, South Korea 3KAIST, Daejeon, South Korea. Correspondence to: Hae Beom Lee , Eunho Yang , Sung Ju Hwang . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). tive transfer problem by allowing asymmetric knowledge transfer between tasks through inter-task parameter regularization. Specifically, AMTL enforces the task parameters for each task to be also represented as a sparse combination of the parameters for other tasks, which results in learning a directed graph that decides the amount of knowledge transfer between tasks. However, such inter-task transfer model based on parameter regularization is limited in several aspects. First of all, in most cases, the tasks exhibit relatedness to certain degree, but the model parameter for a task might not be reconstructed as a combination of the parameter for other tasks, because the tasks are only partly related. Consider the example in Fig. 1(b), where the task is to predict whether the given image has any of the three animal classes. Here, the three animal classes are obviously related as they share a common visual attribute, stripe. Yet, we will not be able to reconstruct the model parameter for class hyena by combining the model parameters for class tiger and zebra, as there are other important attributes that define the hyena class, and the stripe is merely a single attribute among them that is also shared by other classes. Thus, it is more natural to presume that the related tasks leverage a common set of latent representations, rather than considering that a task parameter is generated from the parameters for a set of relevant tasks, as assume in inter-task transfer models. Moreover, AMTL does not scale well with the increasing number of tasks since the inter-task knowledge transfer graph grows quadratically, and thus will become both inefficient and prone to overfitting when there are large number of tasks, such as in large-scale classification. While sparsity can help reduce the number of parameters, it does not reduce the intrinsic complexity of the problem. Finally, the inter-task transfer models only store the knowledge in the means of learned model parameters and their relationship graph. However, at times, it might be beneficial to store what has been learned, in the form of explicit representations which can be used later for other tasks, such as transfer learning. Thus, we resort to the multi-task feature learning approach that aims to learn latent features, which is one of the most popular ways of sharing knowledge between tasks in the multi-task learning framework (Argyriou et al., 2008; Ku- Deep Asymmetric Multi-task Feature Learning (a) Symmetric Feature Sharing MTL (b) Asymmetric MTL (c) Asymmetric Multi-task Feature Learning Figure 1. Concept. (a) Feature-sharing multi-task learning models such as Go-MTL suffers from negative transfer from unreliable predictors, which can result in learning noisy representations. (b) AMTL, an inter-task transfer asymmetric multi-task learning model that enforces the parameter of each task predictor to be generated as a linear combination of the parameters of relevant task predictors, may not make sense when the tasks are only partially related. (c) Our asymmetric multi-task feature learning enforces the learning of shared representations to be affected only by reliable predictors, and thus the learned features can transfer knowledge to unreliable predictors. mar & Daume III, 2012), and aim to prevent negative transfer under this scenario by enforcing asymmetric knowledge transfer. Specifically, we allow reliable task predictors to affect the learning of the shared features more, while downweighting the influences of unreliable task predictors, such that they have less or no contributions to the feature learning. Figure 1 illustrates the concept of our model, which we refer to as asymmetric multi-task feature learning (AMTFL). Another important advantage of our AMTFL, is that it naturally extends to the feature learning in deep neural networks, in which case the top layer of the network contains additional weight matrix for feed-back connection, along with the original feed-forward connections, that allows asymmetric transfer from each task predictor to the bottom layer. This allows our model to leverage state-of-the-art deep neural network models to benefit from recent advancement in deep learning. We extensively validate our method on a synthetic dataset as well as eight benchmark datasets for multi-task learning and image classification using both the shallow and the deep neural network models, on which our models obtain superior performances over existing symmetric feature-sharing multi-task learning model as well as the inter-task parameter regularization based asymmetric multi-task learning model. Our contributions are threefold: We propose a novel multi-task learning model that prevents negative transfer by allowing asymmetric transfer between tasks, through latent shared features, which is more natural when tasks are correlated but not cause and effect relationships, and is more scalable than existing inter-task knowledge transfer model. We extend our asymmetric multi-task learning model to deep learning setting, where our model obtains even larger performance improvements over the base network and linear multi-task models. We leverage learned deep features for knowledge transfer in a transfer learning scenarios, which demonstrates that our model learns more useful features than the base deep networks. 2. Related Work Multitask learning Multi-task Learning (Caruana, 1997) is a learning framework that jointly trains a set of task predictors while sharing knowledge among them, by exploiting relatedness between participating tasks. Such task relatedness is the main idea on which multi-task learning is based, and there are several assumptions on how the tasks are related. Probably the most common assumption is that the task parameters lie in a low-dimensional subspace. One example of a model based on this assumption is multi-task feature learning (MTFL) (Argyriou et al., 2008), where a set of related tasks learns common features (or representations) shared across multiple tasks. Specifically, they propose to discard features that are not used by most tasks by imposing the (2, 1)-norm regularization on the coefficient matrix, and solved the regularized objective with an equivalent convex optimization problem. The assumption in MTFL is rather strict, in that the (2, 1)-norm requires the features to be shared across all tasks, regardless of whether the tasks are related or not. To overcome this shortcoming, (Kang et al., 2011) suggest a method that also learns to group tasks based on their relatedness, and enforces sharing only within each group. However, since such strict grouping might not exist between real-world tasks, (Kumar & Daume III, 2012) and (Maurer et al., 2012) suggest to learn overlapping groups by learning latent parameter bases that are shared Deep Asymmetric Multi-task Feature Learning across multiple tasks. Multi-task learning on neural networks is fairly straightforward simply by sharing a single network for all tasks. Recently, there has been some effort on finding more meaningful sharing structures between tasks instead of sharing between all tasks (Yang & Hospedales, 2016; 2017; Ruder et al., 2017). Yet, while these models consider task relatedness, they do not consider asymmetry in knowledge transfer direction between the related tasks. Asymmetric Multitask learning The main limitation in the multi-task learning models based on common bases assumption is that they cannot prevent negative transfer as shared bases are trained without consideration of the quality of the predictors. To tackle the problem, asymmetric multitask learning (AMTL) (Lee et al., 2016) suggests to break the symmetry in the knowledge transfer direction between tasks. It assumes that each task parameter can be represented as a sparse linear combination of other task parameters, in which the knowledge flows from task predictors with low loss to predictors with high loss. Then, hard tasks could exploit more reliable information from easy tasks, whereas easy tasks do not have to rely on hard tasks when it has enough amount of information to be accurately predicted. This helps prevent performance degradation on easier tasks from negative transfer. However, in AMTL, knowledge is transferred from one task to another, rather than from tasks to some common feature spaces. Thus the model is not scalable and also it is not straightforward to apply this model to deep learning. On the contrary, our model is scalable and straightforward to implement into a deep network as it learns to transfer from tasks to shared features. Autoencoders Our asymmetric multi-task learning formulation has a sparse nonlinear autoencoder term for feature learning. The specific role of the term is to reconstruct latent features from the model parameters using sparse nonlinear feedback connections, which results in the denoising of the latent features. Autoencoders were first introduced in (Rumelhart et al., 1986) for unsupervised learning, where the model is given the input features as the output and learns to transform the input features into a latent space and then decode them back to the original features. While there exist various autoencoder models, our reconstruction term closely resembles the sparse autoencoder (Ranzato et al., 2007), where the transformation onto and from the latent feature space is sparse. (Hinton & Salakhutdinov, 2006) introduce a deep autoencoder architecture in the form of restricted Boltzmann machine, along with an efficient learning algorithm that trains each layer in a bottom-up fashion. (Vincent et al., 2008) propose a denoising autoencoder, which is trained with the corrupted data as the input and the clean data as output, to make the internal representations to be robust from corruption. Our regularizer in some sense can be also considered as a denoising autoencoder, since its goal is to refine the features through the autoencoder form such that the reconstructed latent features reflect the loss of more reliable predictors, thus obtaining cleaner representations. 3. Asymmetric Multi-task Feature Learning In our multi-task learning setting, we have T different tasks with varying degree of difficulties. For each task t {1, . . . , T}, we have an associated training dataset Dt = {(Xt, yt)|Xt RNt d, yt RNt 1} where Xt and yt respectively represent the d-dimensional feature vector and the corresponding labels for Nt data instances. The goal of multi-task learning is then to jointly train models for all T tasks simultaneously via the following generic learning objective: t=1 L(wt; Xt, yt) + Ω(W ). (1) where L is the loss function applied across the tasks, wt Rd is the model parameter for task t and W Rd T is the column-wise concatenated matrix of w defined as W = [w1 w2 w T ]. Here, the penalty Ωenforces certain prior assumption on sharing properties across tasks in terms of W . One of the popular assumptions is that there exists a common set of latent bases across tasks (Argyriou et al., 2008; Kumar & Daume III, 2012), in which case the matrix W can be decomposed as W = LS. Here, L Rd k is the collection of k latent bases while S Rk T is the coefficient matrix for linearly combining those bases. Then, with a regularization term depending on L and S, we build the following multi-task learning formulation: t=1 L(Lst; Xt, yt) + Ω(L, S). (2) where st is tth column of S to represent wt as the linear combination of shared latent bases L, that is, wt = Lst. As a special case of (2), Go-MTL(Kumar & Daume III, 2012), for example, encourages L to be element-wisely ℓ2 regularized and each st sparse: n L(Lst; Xt, yt) + µ st 1 o + λ L 2 F . (3) On the other hand, it is possible to exploit a task relatedness without the explicit assumption on shared latent bases. AMTL (Lee et al., 2016) is such an instance of multi-task learning, based on the assumption that each task parameter wt Rd is reconstructed as a sparse combination of other task parameters {ws}s =t. In other words, it encourages that wt P s =t Bstws where the weight Bst of ws in reconstructing wt, can be interpreted as the amount of knowledge Deep Asymmetric Multi-task Feature Learning Figure 2. (a) An illustration of negative transfer in common latent bases model. (b) The effects of inter-task ℓ2 regularization on top of common latent bases model. (c) Asymmetric task-to-basis transfer. (d) An illustration of Re LU transformation with a bias term. transfer from task s to t. Since there is no symmetry constraint on the matrix B, AMTL learns asymmetric knowledge transfer from easier tasks to harder ones. Towards this goal, AMTL solves the multi-task learning problem via the following optimization problem: t=1 (1 + α bo t 1)L(wt; Xt, yt) + γ W W B 2 F . (4) where Btt = 0 for t = 1, ..., T and B s row vector bo t R1 T controls the amount of outgoing transfer from task t to other tasks s = t. The sparsity parameter α is multiplied by the amount of training loss L(wt; Xt, yt), making the outgoing transfer from hard tasks more sparse than those of easy tasks. The second Frobenius norm based penalty is on the inter-task regularization term for reconstructing each task parameter wt. 3.1. Asymmetric Transfer from Task to Bases One critical drawback of (3) is on the severe negative transfer from unreliable models to reliable ones since all task models equally contribute to the construction of latent bases. On the other hand, (4) is not scalable to large number of tasks, and does not learn explicit features. In this section, we provide a novel framework for asymmetric multi-task feature learning that overcomes the limitations of these two previous approaches, and find an effective way to achieve asymmetric knowledge transfer in deep neural networks while preventing negative transfers. We start our discussion with the observation of how negative transfer occurs in a common latent bases multi-task learning models as in (3). Suppose that we train a multi-task learning model for three tasks, where the model parameters of each task is generated from the bases {l1, l2, l3}. Specifically, w1 is generated from {l1, l3}, w2 from {l1, l2}, and w3 from {l2, l3}. Further, we assume that the predictor for task 3 is unreliable and noisy, while the predictors for task 1 and 2 are reliable, as illustrated in Figure 2(a). In such a case, when we train the task predictors in a multi-task learn- ing framework, w3 will transfer noise to the shared bases {l2, l3}, which will in turn negatively affect the models parameterized by w1 and w2. One might consider the naive combination of the shared basis model (3) and AMTL (4) to prevent negative transfer among latent features where the task parameter matrix is decomposed into LS in (4): n (1 + α bo t 1)L(Lst; Xt, yt) + µ st 1 o + γ LS LSB 2 F + λ L 2 F (5) where Btt = 0 for t = 1, .., T. However, this simple combination cannot resolve the issue mainly due to two limitations. First, the inter-task transfer matrix B still grows quadratically with respect to T as in AMTL, which is not scalable for large T. Second and more importantly, this approach would induce additional negative transfer. In the previous example in Figure 2(b), the unreliable model w3 is enforced to be a linear combination of other reliable models via the matrix B (the purple dashed lines in the figure). In other words, w3 can now affect the clean basis l1 that is only trained by the reliable models in Figure 2(a). As a result, the noise will be transferred to l1, and consequently, to the reliable models based on it. As shown in this simple example, the introduction of inter-task asymmetric transfer in the shared basis MTL (3) leads to more severe negative transfer, which is in contrast to the original intention. To resolve this issue, we propose a completely new type of regularization in order to prevent the negative transfer from the task predictors to the shared latent features, which we refer to as asymmetric task-to-basis transfer. Specifically, we encourage the latent features to be reconstructed by the task predictors parameters in an asymmetric manner, where we enforce the reconstruction to be done by the parameters of reliable predictors only, as shown in Figure 2(c). Since the parameters for the task predictors are reconstructed from the bases, this regularization can be considered as an autoencoder framework. The difference here Deep Asymmetric Multi-task Feature Learning Figure 3. Deep-AMTFL. The green lines denote feedback connections with ℓ2 constraints on the features. Different color scales denote different amount of reliabilities (blue) and knowledge transfers from task predictions to features (green). is that the consideration of predictor loss result in learning denoising of the representations. We describe the details of our asymmetric framework of task-to-basis transferring in the following subsection. 3.2. Feature Reconstruction with Nonlinearity There are two main desiderata in our construction of asymmetric feature learning framework. First, the reconstruction should be achieved in a non-linear manner. Suppose that we perform linear reconstruction of the bases as shown in Figure 2(c). In this case, the linear span of {w1, w2} does not cover any of {l1, l2, l3}. Thus we need a nonlinearity to solve the problem. Second, the reconstruction needs to be done in the feature space, not on the bases of the parameters, in order to directly apply it to a deep learning framework. We first define notations before introducing our framework. Let X be the row-wise concatenation of {X1, .., XT }. We assume a neural network with a single hidden layer, where L and S are the parameters for the first and the second layer respectively. As for the nonlinearity in the hidden layer, we use rectified linear unit (Re LU), denoted as σ( ). The nonnegative feature matrix is denoted as Z = σ(XL), or zi = σ(Xli) for each column i = 1, .., k. The taskto-feature transfer matrix is A RT k. Using the above notations, our asymmetric multi-task feature learning framework is defined as follows: n (1 + α||ao t||1) L(L, st; Xt, yt) + µ st 1 o + γ Z σ(ZSA) 2 F + λ L 2 F . (6) The goal of the autoencoder term is to reconstruct features Z with model outputs ZS with the nonlinear transformation σ( ; A). We also use Re LU nonlinearity for the reconstruction term as in the original network, since this will allow the reconstruction b Z to explore the same manifold of Z, thus making it easier to find an accurate reconstruction. In Fig.2(d), for example, the linear span of task output vectors {y1, y2} forms the blue hyperplane. Transforming this hyperplane with Re LU and a bias term will result in the manifold colored as gray and yellow, which includes the original feature vectors {z1, z2, z3}. Since our framework considers the asymmetric transfer in the feature space, we can seamlessly generalize (6) to deep networks with multiple layers. Specifically, the autoencoding regularization term is formulated at the penultimate layer to achieve the asymmetric transfer. We name this approach Deep-AMTFL: min. A,{W (l)}L l=1 n (1 + α||ao t||1) Lt + µ w(L) t 1 + γ σ ZW (L)A Z 2 where W (l) is the weight matrix for the lth layer, with w(L) t denoting the tth column vector of W (L), and Z = σ(W (L 1)σ(W (L 2) . . . σ(XW (1)))) Lt := L(w(L) t , W (L 1), .., W (1); Xt, yt), are the hidden representations at layer L 1 and the loss for each task t. See Figure (3) for the description. Loss functions The loss function L(w; X, y) could be any generic loss function. We mainly consider the two most popular instances. For regression tasks, we use the squared loss: L(wt; Xt, yt) = 1 Nt yt Xtwt 2 2 + δ/ Nt. For classification tasks, we use the logistic loss: L(wt; Xt, yt) = 1 Nt PNt i=1{yti log σ(xtiwt) + (1 yti) log (1 σ(xtiwt))}+δ/ Nt, where σ is the sigmoid function. Note that we augment the loss terms with δ/ Nt to express the imbalance for the training instances for each task. As for δ, we roughly tune δ/ Nt to have similar scale to the first term of L(wt; Xt, yt). 4. Experiments We validate AMTFL on both synthetic and real datasets against relevant baselines, using both shallow and deep neural networks as base models. 4.1. Shallow models - Feedforward Networks We first validate our shallow AMTFL model (6) on synthetic and real datasets with shallow neural networks. Note that we use one-vs-all logistic loss for multi-class classification. Baselines and our models 1) STL. A linear single-task learning model. 2) GO-MTL. A feature-sharing MTL model from (Kumar & Daume III, 2012), where different task predictors share a common set of latent bases (3). Deep Asymmetric Multi-task Feature Learning (b) true LS Figure 4. Visualization of the learned features and paramters on the synthetic dataset. (a-b) True parameters that generated the data. (c) Reconstructed parameters from AMTL (d-e) Reconstructed parameters from Go-MTL. (f-h) Reconstructed parameters from AMTFL. 3) AMTL. Asymmetric multi-task learning model (Lee et al., 2016), with inter-task knowledge transfer through a parameter-based regularization (4). 4) NN. A simple feedforward neural network with a single hidden layer. 5) MT-NN. Same as NN, but with each task loss divided by Nt, for balancing the task loss. Note that this model applies ℓ1-regularization at the last fully connected layer. 6) AMTFL. Our asymmetric multi-task feature learning model with feedback connections (6). Synthetic dataset experiment We first check the validity of AMTFL on a synthetic dataset. We first generate six 30-dimensional true bases in Figure 4(a). Then, we generate parameters for 12 tasks from them with noise ϵ N(0, σ). We vary σ to create two groups based on the noise level: easy and hard. Easy tasks have noise level of σ = 1 and hard tasks have σ = 2. Each predictor for easy task wt combinatorially picks two out of four bases - {l1, .., l4} to linearly combine wt R30, while each predictor for hard task selects among {l3, .., l6}. Thus the bases {l3, l4} overlap both easy and hard tasks, while other bases are used exclusive by each group in Figure 4(b). We generate five random train/val/test splits for each group - 50/50/100 for easy tasks and 25/25/100 for hard tasks. For this dataset, we implement all base models as neural networks to better compare with AMTFL. We add in ℓ1 to L for all models for better reconstruction of L. We remove Re LU at the hidden layer in AMTFL since the features are linear1. All the hyper-parameters are found with separate validation set. For AMTL, we remove the nonnegative constraint on B due to the characteristic of this dataset. We first check whether AMTFL can accurately reconstruct the true bases in Figure 4(a). We observe that L learned by AMTFL in Figure 4(f) more closely resembles the true bases 1We avoid adding nonlinearity to features to make qualitative analysis much easier. than L reconstructed using Go-MTL in Figure 4(d)), which is more noisy. The reconstructed W = LS from AMTFL in Figure 4(g), in turn, is closer to the true parameters than W = (LS) generated with Go-MTL in Figure 4(e) and W from AMTL in Figure 4(c) for both easy and hard tasks. Further analysis of the inter-task transfer matrix AS in Figure 4(h) reveals that this accurate reconstruction is due to the asymmetric inter-task transfer, as it shows no transfer from hard to easy tasks, while we see significant amount of transfers from easy to hard tasks. Quantitative evaluation result in Figure 5(a) further shows that AMTFL significantly outperforms existing MTL methods. AMTFL obtains lower errors on both easy and hard tasks to STL, while Go-MTL results in even higher errors than those obtained by STL on hard tasks. We attribute this result to the negative transfer from other hard tasks. AMTFL also outperforms AMTL by significant margin, maybe because it is hard for AMTL to find meaningful relation between tasks in case of this particular synthetic dataset, where data for each task is assumed to be generated from the same set of latent bases. Also, a closer look at the per-task error reduction over STL in Figure 5(b) shows that AMTFL effectively prevents negative transfer while GO-MTL suffers from negative transfer, and make larger improvements than AMTL. Further, Figure 5(c) shows that AMTFL is more scalable than AMTL, both in terms of error reduction and training time, especially when we have large number of tasks. One thing to note is that, for AMTFL, the error goes down as the number of tasks increases. This is a reasonable result, since the feature reconstruction using the task-specific model parameters will become increasingly accurate with larger number of tasks. Real dataset experiment We further test our model on one binary classification, one regression, and two multiclass classification datasets, which are the ones used for experiments in (Kumar & Daume III, 2012; Lee et al., 2016). We report averaged performance of each model on five random splits for all datasets. Deep Asymmetric Multi-task Feature Learning Clean Noisy All 1 STL GO-MTL AMTL AMTFL 1 2 3 4 5 6 7 8 9 10 11 12 Tasks GOMTL AMTL AMTFL (b) RMSE reduction over STL 12 120 1200 12000 Number of Tasks RMSE-AMTL RMSE-AMTFL Time-AMTL Time-AMTFL (c) Scalability Figure 5. Results of synthetic dataset experiment. (a) Average RMSE for clean/noisy/all tasks. (a) Per-task RMSE reduction over STL. (b) RMSE and training time for increasing number of tasks. 1) AWA-A: This is a classification dataset (Lampert et al., 2009) that consists of 30, 475 images, where the task is to predict 85 binary attributes for each image describing a single animal. The feature dimension is reduced from 4096 (Decaf) to 500 by using PCA. The number of instances for train, validation, and test set for each task is 1080, 60, and 60, respectively. We set the number of hidden neurons to 1000 which is tuned on the base NN. 2) MNIST: This is a standard dataset for classification (Le Cun et al., 1998) that consists of 60, 000 training and 10, 000 test images of 28 28 that describe 10 handwritten digits (09). Following the procedure of (Kumar & Daume III, 2012), feature dimension is reduced to 64 by using PCA, and 5 random 100/50/50 splits are used for each train/val/test. We set the number of hidden neurons to 500. 3) School: This is a regression dataset where the task is to predict the exam scores of 15, 362 student s from 139 schools. Prediction of the exam score for each school is considered as a single task. The splits used are from (Argyriou et al., 2008) and we use the first 5 splits among 10. We set the number of hidden neurons to 10 or 15. 4) Room: This is a subset of the Image Net dataset (Deng et al., 2009) from (Lee et al., 2016), where the task is to classify 14, 140 images of 20 different indoor scene classes. The number of train/val instances varies from 30 to over 1000, while test set has 20 per each class. The feature dimension is reduced from 4096 (Decaf) to 500 by using PCA. We set the number of hidden neurons to 1000. Table 1 shows the results on the real datasets. As expected, the AMTFL (6) outperforms the baselines on most datasets. The only exception is the School dataset, on which GO-MTL obtains the best performance, but as mentioned in (Lee et al., 2016) this is due to the strong homogeneity among the tasks in this particular dataset. 4.2. Deep models - Convolutional Networks Next, we validate our Deep-AMTFL (7) using CNN as base networks for end-to-end image classification. Note that we Table 1. Performance of the linear and shallow baselines and our asymmetric multi-task feature learning model. We report the RMSE for regression and mean classification error(%) for classification, along with the standard error for 95% confidence interval. AWA-A MNIST School Room STL 37.6 0.5 14.8 0.6 10.16 0.08 45.9 1.4 GO-MTL 35.6 0.2 14.4 1.3 9.87 0.06 47.1 1.4 AMTL 33.4 0.3 12.9 1.4 10.13 0.08 40.8 1.5 NN 26.3 0.3 8.96 0.9 9.89 0.03 44.5 2.0 MT-NN 26.2 0.3 8.76 1.0 9.91 0.04 41.7 1.7 AMTFL 25.2 0.3 8.68 0.9 9.89 0.09 40.4 2.4 Table 2. Classification performance of the deep learning baselines and Deep-AMTFL. The reported numbers for MNIST-Imbalanced and CUB datasets are averages over 5 runs. MNIST-Imbal. CUB AWA-C Small CNN 8.13 46.18 33.36 66.54 MT-CNN 8.72 43.92 32.80 65.69 Deep-AMTL 8.52 45.26 32.32 65.61 Deep-AMTFL 5.82 43.75 31.88 64.49 use one-vs-all classifier instead of softmax, since we want to consider the classification of each class as a separate task. Baselines and our models 1) CNN: The base convolutional neural network. 2) MT-CNN: The base CNN with ℓ1-regularization on the parameters for the last fully connected layer W (L) instead of ℓ2-regularization, similarly to (7). 3) Deep-AMTL: Base CNN with the asymmetric multi-task learning objective in (Lee et al., 2016) replacing the original loss. Note that the model is deep version of (5), where LS corresponds to the last fully connected layer W (L). 4) Deep-AMTFL: Our deep asymmetric multi-task feature learning model with the asymmetric autoencoder based on task loss. (7). Deep Asymmetric Multi-task Feature Learning Datasets and base networks 1) MNIST-Imbalanced: This is an imbalanced version of MNIST dataset. Among 6, 000 training samples for each class 0, 1, . . . , 9, we used 200, 180, . . . , 20 samples for training respectively, and the rest for validation. For test, we use 1, 000 instances per class as with the standard MNIST dataset. As the base network, we used Lenet-Conv. 2) CUB-200: This dataset consists of images describing 200 bird classes including Cardinal, Tree Sparrow, and Gray Catbird, of which 5, 994 images are used for training and 5, 794 are used for test. As for the base network, we used Res Net-18 (He et al., 2016). 3) AWA-C: This is the same AWA dataset used in the shallow model experiments, but we used the class labels for 50 animals instead of binary attribute labels. Among 30, 475 images, we randomly sampled 50 instances per each class to use as the test set, and used the rest of them for training, which results in an imbalanced training set (42-1118 samples per class). As with the CUB dataset, we used Res Net-18 as the base network. 4) Image Net-Small: This is a subset of the Image Net 22K dataset (Deng et al., 2009), which contains 352 classes. We deliberately created the dataset to be largely imbalanced, with the number of images per class ranging from 2 to 1, 044. We used Res Net-50 for the base network. Experimental Setup For the implementation of all the baselines and our deep models, we use Tensorflow (Abadi et al., 2016) and Caffe (Jia, 2013) framework. For AWA-C and Small datasets, we first train Base model from scratch, and finetune the the other models based on it for expedited training. We trained from scratch for other datasets. Note that we simply use the weight decay λ provided by the code of the base networks, and set µ = λ to reduce the effective number of hyperparameters to tune. We searched for α and γ in the range of {1, 0.1, 10 2, 10 3, 10 4}. For CUB dataset, we gradually increase α and γ from 0, which helps with stability of learning at the initial stage of the training. Quantitative evaluation We report the average per-class classification performances of baselines and our models in Table 2. Our Deep-AMTFL outperforms all baselines, including MT-CNN and Deep-AMTL, which shows the effectiveness of our asymmetric knowledge transfer from tasks to features, and back to tasks in deep learning frameworks. To see where the performance improvement comes from, we further examine the per-task (class) performance improvement of baselines and our Deep-AMTFL over the base CNN on MNIST-Imbalanced dataset along with average per-task loss (Figure 6(a)). We see that MT-CNN improves the performance over CNN on half of the tasks (5 out of 10) while 0 1 2 3 4 5 6 7 8 9 Accuracy Improvements over CNN (%) Cross Entropy (x 1E-4) MT-CNN Deep-AMTL Deep-AMTFL Cross Entropy (a) Per-task Improvements (b) Inter-task Transfer Figure 6. Results of experiments on the MNIST-Imbalanced dataset. (a) Accuracy improvements over the CNN and the pertask losses. (b) The inter-task transfer matrkx AS. We remove the sign of values for better visualization. degenerating performance on the remainders. Deep-AMTL obtains larger performance gains on later tasks with large loss (task 8 and 9) due to its asymmetric inter-task knowledge transfer, but still suffers from performance degradation (task 6 and 7). Our Deep-AMTFL, on the other hand, does not suffer from accuracy loss on any tasks and shows significantly improved performances on all tasks, especially on the tasks with large loss (task 9). This result suggests that the performance gain mostly comes from the suppression of negative transfer. Quanlitative analysis As further qualitative analysis, we examine how inter-task knowledge transfer is done in Deep AMTFL in Figure 6(b). Although Deep-AMTFL does not explicitly model inter-task knowledge transfer graph, we can obtain one by computing AS, as in Figure 6(b). We see that each task transfers to later tasks (upper triangular submatrix) that comes with fewer training instances but does not receive knowledge transfer from them, which demonstrates that Deep-AMTFL is performing asymmetric knowledge transfer in correct directions implicitly via the latent feature space. The only exception is the tansfer from task 5 to task 2, which is reasonable since they have similar losses (Figure 6(a)). 5. Conclusion We propose a novel deep asymmetric multi-task feature learning framework that can effectively prevent negative transfer resulting from symmetric influences of each task in feature learning. By introducing an asymmetric feedback connections in the form of autoencoder, our AMTFL enforces the participating task predictors to asymmetrically affect the learning of shared representations based on task loss. We perform extensive experimental evaluation of our model on various types of tasks on multiple public datasets. The experimental results show that our model significantly outperforms both the symmetric multi-task feature learning and asymmetric multi-task learning based on inter-task knowledge transfer, for both shallow and deep frameworks. Deep Asymmetric Multi-task Feature Learning Acknowledgements This research was supported by Samsung Research Funding Center of Samsung Electronics (SRFC-IT150203), Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01061019), a Machine Learning and Statistical Inference Framework for Explainable Artificial Intelligence (No.2017-0-01779) and Development of Autonomous Io T Collaboration Framework for Space Intelligence (2017-0-00537) supervised by the IITP(Institute for Information & communications Technology Promotion). Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. Tensorflow: Large-scale Machine Learning on Heterogeneous Distributed Systems. ar Xiv:1603.04467, 2016. Argyriou, A., Evgeniou, T., and Pontil, M. Convex Multitask Feature Learning. Machine Learning, 73(3):243 272, 2008. Caruana, R. Multitask Learning. Machine Learning, 1997. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and ei, L. F.-F. Imagenet: A Large-Scale Hierarchical Image Database. In CVPR, 2009. He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In CVPR, 2016. Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006. ISSN 0036-8075. doi: 10.1126/science.1127647. Jia, Y. Caffe: An open source convolutional architecture for fast feature embedding, 2013. Kang, Z., Grauman, K., and Sha, F. Learning with whom to share in multi-task feature learning. In ICML, pp. 521 528, 2011. Kumar, A. and Daume III, H. Learning task grouping and overlap in multi-task learning. In ICML, 2012. Lampert, C., Nickisch, H., and Harmeling, S. Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer. In CVPR, 2009. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lee, G., Yang, E., and Hwang, S. Asymmetric multi-task learning based on task relatedness and confidence. In ICML. ICML, 2016. Maurer, A., Pontil, M., and Romera-Paredes, B. Sparse coding for multitask and transfer learning. In ICML, 2012. Ranzato, M. A., Poultney, C., Chopra, S., and Cun, Y. L. Efficient learning of sparse representations with an energybased model. pp. 1137 1144. MIT Press, 2007. Ruder, S., Bingel, J., Augenstein, I., and Søgaard, A. Learning what to share between loosely related tasks. Ar Xiv e-prints, May 2017. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pp. 318 362. MIT Press, Cambridge, MA, USA, 1986. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. pp. 1096 1103, New York, NY, USA, 2008. ACM. Yang, Y. and Hospedales, T. Deep Multi-task Representation Learning: A Tensor Factorisation Approach. ICLR, 2017. Yang, Y. and Hospedales, T. M. Trace Norm Regularised Deep Multi-Task Learning. Ar Xiv e-prints, June 2016.