# adaptive_adversarial_multitask_representation_learning__4f6f7f68.pdf Adaptive Adversarial Multi-task Representation Learning Yuren Mao 1 Weiwei Liu 2 Xuemin Lin 1 Adversarial Multi-task Representation Learning (AMTRL) methods are capable of boosting the performance of Multi-task Representation Learning (MTRL) models. However, the theoretical mechanism behind AMTRL has been only minimally investigated. Accordingly, to fill this gap, we study the generalization error bound of AMTRL through the lens of Lagrangian duality. Based on this duality, we propose a novel adaptive AMTRL algorithm that improves the performance of the original AMTRL methods. We further conduct extensive experiments to back up our theoretical analysis and validate the superiority of our proposed algorithm. 1. Introduction Multi-task Representation Learning (MTRL), which is an influential line of research on Multi-task Learning, learns related tasks simultaneously by sharing a common representation. Compared with learning each task independently, MTRL typically has a lower computational cost and better prediction performance. It has achieved great success in various applications ranging from computer vision (Kendall et al., 2018) to natural language processing (Collobert & Weston, 2008). Recently, adversarial MTRL (AMTRL) methods (Liu et al., 2017; Chen et al., 2018a; Shi et al., 2018; Yu et al., 2018; Liu et al., 2018; Yadav et al., 2018) have been widely utilized in a range of applications. AMTRL methods improve the performance of original MTRL models by adding an extra adversarial module, i.e., a task discriminator in the representation space. Unfortunately, the theoretical mechanism behind AMTRL methods is still not well understood. The findings of this paper suggest that AMTRL methods 2School of Computer Science, Wuhan University, China. 1School of Computer Science and Engineering, University of New South Wales, Australia. Correspondence to: Weiwei Liu . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). restrict the hypothesis class by enforcing all the tasks to share an identical distribution in the representation space. The identical distribution restriction provides further inductive bias and tightens the task-averaged generalization error bound for MTRL. Based on this restriction, we formulate AMTRL as a constrained optimization problem and propose to solve the problem using the augmented Lagrangian method. To quantitatively measure how likely the tasks share an identical distribution in the representation space, we propose a pairwise relatedness metric for AMTRL. Based on this metric, a weight adaption strategy is proposed in order to accelerate the convergence of the adversarial module. Combining the weight adaption strategy and the augmented Lagrangian method, we present the adaptive AMTRL method. This paper conducts experiments on two popular multi-task learning applications: sentiment analysis and topic classification. Experimental results verify our theoretical analysis and validate that the proposed algorithm outperforms several state-of-the-art methods. 2. Related Works Adaptive weighting scalarization, which linearly scalarizes the tasks with adaptive weight assignment, is a typical MTRL method. Various adaptive weighting strategies (Kendall et al., 2018; Chen et al., 2018b; Sener & Koltun, 2018; Lin et al., 2019; Mao et al., 2020) have been proposed to balance the regularization between tasks and improve the performance of original MTRL. By contrast, existing AMTRL methods, for example (Liu et al., 2017; Chen et al., 2018a), only adopts the naïve uniform scalarization. In this paper, we propose a adaptive weighting strategy for AMTRL based on the augmented Lagrangian (Hestenes, 1969) and a novel task relatedness metric. The task relatedness metric is proposed based on the representation similarity. Comparing with the typical representation-similarity-based task relatedness metric (Kriegeskorte et al., 2008; Mc Clure & Kriegeskorte, 2016; Dwivedi & Roig, 2019), the proposed task relatedness metric computes the representation similarity with the output of the adversarial module and does not require extra computation of correlation coefficients, which is more efficient for AMTRL. Submission and Formatting Instructions for ICML 2020 3. Preliminaries Consider a multi-task representation learning problem with T tasks over an input space X and a collection of task spaces {Y}T t=1. We define the hypothesis class of the problem as H and H = {F}T t=1 G. G = {g : X RK} is the set of representation functions (i.e. the representation hypothesis class). K is the dimension of the representation space. {F}T t=1 = {f t : RK Y}T t=1 is a set of predictors (i.e. the prediction hypothesis class) and f t is ρ-Lipschitz for all t {1, ..., T}. g is used across different tasks, while f t is task-specific. H = {h = {f t(g( ))}T t=1 : X {Y}T t=1}. Learning H is based on the data observed for all the tasks. Without loss of generality, we assume that each task has n samples. The data takes the form of a multisample S = {St}T t=1 with St = (Xt, Y t) and (Xt, Y t) = {xt i, yt i}n i=1 Dn t . Dt is a probability distribution over X Y. After representation mapping, (g(Xt), Y t) μn t where μt is a distribution over RK. The loss function for task t is defined as lt : Y Y [0, 1] and assumed to be 1-Lipschitz. We define the true risk of a hypothesis f t g for task t as LDt(f t g) = E(xt,yt) Dt[lt(f t(g(xt)), yt)] and the task-averaged generalization error as LD(h) = 1 T T t=1 LDt(f t g). Correspondingly, the empirical loss of the task t is defined as LSt(f t g) = 1 n n i=1 lt(f t(g(xt i)), yt i) and the empirical task-averaged error is defined as LS(h) = 1 T T t=1 LSt(f t g). We also denote the transpose of the vector/matrix by superscript , the logarithms to base 2 by log. Multi-task Representation Learning. Multi-task Representation Learning (MTRL) learns multiple tasks jointly by sharing representation across tasks. This representation is typically produced using a representation map that has the same parameters for each task. For example, in deep neural networks, the common representation is obtained by sharing hidden layers. The original MTRL module in Figure 1 shows a deep MTRL network model utilizing a hard parameter sharing strategy (Ruder, 2017). With the Empirical Risk Minimization (ERM) paradigm, MTRL is defined to minimize the task-averaged empirical error (1) (Maurer et al., 2016). min g,f 1,...,f T 1 T t=1 ˆLt(g, f t). (1) Theorem 1 (Maurer et al., 2016; Ando & Zhang, 2005) presents an upper bound for the task-averaged generalization error of MTRL. Theorem 1. For 0 < δ < 1, with probability at least 1 δ Task 1 Task T Shared Layers Task Specific Layers Discriminator Forward Propagation Backward Propagation ܮሺߠ௦ ǡ ߠଵሻ ߠଵ ܮሺߠ௦ ǡ ߠ ሻ ߠ ܮሺߠ௦ ǡ ߠଵሻ ߠ௦ ܮሺߠ௦ ǡ ߠ ሻ ߠ௦ ܮ ሺߠ௦ ሻ ܮ ሺߠ௦ ሻ ߠ௦ Gradient Reversal Layer Original MTRL Figure 1. A deep adversarial MTRL Network model. in S we have that LD(h) LS(h) c1L Ga(G(X)) n T + c2Qsupg G g(X) 2n T (2) where c1 and c2 are universal constants. G(G(X)) is the Gaussian average defined in (3) Ga(G(X)) = E k,t,i γktigk(xt i) | xt i where γkti denote independent standard normal variables. supg G g(X) can be computed by (4) supg G g(X) = sup g G k,t,i gk(xt i)2. (4) Q is the quantity Q sup y =y RKn 1 y y E sup f F i=1 γi(f(yi) f(y i )), (5) where γi are independent standard normal variables. Adversarial Multi-task Representation Learning. Adversarial MTRL (AMTRL) adds an extra task discriminator to the original MTRL model shown in Figure 1. For each training sample, the discriminator can recognize which task the sample belongs to. The loss functions of existing adversarial MTRL methods (Liu et al., 2017; Chen et al., 2018a; Shi et al., 2018; Yu et al., 2018; Liu et al., 2018; Yadav et al., 2018) have a common part min h L(h, λ) = LS(h) + λLadv, (6) where λ is a hyper parameter and the adversarial term Ladv has the form Ladv = max Φ 1 n T i=1 etΦ(g(xt i)). (7) Submission and Formatting Instructions for ICML 2020 Φ( ) : RK [0, 1]T is a task discriminator that estimates which task the sample belongs to. et is the vector with all components equal to 0, except the t-th, which is 1. (6) minimizes the task-averaged empirical risk and enforces the representation of each task to share an identical distribution (μ1 = μ2, ..., = μT ). When all tasks have an identical distribution in the representation space, Ladv = c where c is a discriminator-depended constant. For the widely used softmax function-based discriminator, where Φ(g(xt n)) = softmax(W g(xt n)+b) and W RK T , c = 1 T . Without loss of generality, we can set Ladv := Ladv c. 4. Proposed Methods 4.1. Task-averaged Generalization Error Bound Assuming the representation of each task shares an identical distribution, Corollary 1 outlines the task-averaged generalization error bound for AMTRL. Corollary 1. Assume μ1 = μ2, ..., = μT . For 0 < δ < 1, with probability at least 1 δ in S we have that LD(h) LS(h) c1ρ Ga(G (X1)) n + c2Qsupg G g(X1) n + 2n T (8) where c1 and c2 are universal constants, while G = {g G : μ1 = μ2 =, ..., μT }. G(G (X1)) is the Gaussian average of task 1 defined in (9) Ga(G (X1)) = E k,i γkigk(x1 i ) | x1 i where γki are independent standard normal variables. supg G g(X1) can be computed by (10): supg G g(X1) = sup g G k,i gk(x1 i )2. (10) Q is the quantity Q sup y =y RKn 1 y y E sup f F i=1 γi(f(yi) f(y i)), (11) where γi denote independent standard normal variables. Proof. For μ1 = μ2, ..., = μT , Ga(G (X)) = E k,t,i γktigk(xt i) | xt i k,i γkigk(x1 i ) | x1 i = TGa(G (X1)). supg G g(X) = Tsupg G g(X1) . (13) By combining (12) and (13) with Theorem 1, we conclude our proof. The first term of the bound, which can be interpreted as the cost of estimating the representation g, is typically of order 1 n. Moreover, the second term, which corresponds to the cost of estimating task-specific predictors, is typically of order 1 n. The last term contains the confidence parameter. According to Theorem 3 in (Maurer, 2014), c1, c2 are rather large; the last term typically makes only a small contribution. From the property of the Gaussian average, TGa(G (X1)) Ga(G(X)) for G G. Furthermore, we have Tsupg G g(X1) supg G g(X) . The generalization error bound for AMTRL is tighter than that for MTRL. In AMTRL, the number of tasks has little to do with the generalization error bound. 4.2. Task Relatedness in Representation Space The above analysis shows that the similarity of distributions between tasks in the representation space determines the performance of AMTRL. The similarity is a data-dependent between-task relatedness. This paper proposes a novel relatedness metric for AMTRL based on the task discriminator to quantitatively measure the similarity. Based on the metric, we are able to visualize the relatedness between tasks during training. Assume that the discriminator Φ( ) is the Bayes optimal classifer. We propose to measure the relatedness between task i and task j as follows: Rij = Φj(g(xi)) + Φi(g(xj)) Φi(g(xi)) + Φj(g(xj)), (14) where xi and xj are sampled from Di and Dj respectively, g(xi) μi and g(xj) μj. Φi( ), Φj( ) represent the probability that Φ( ) classify the input into tasks i, j respectively. Rij [0, 1] reflects the similarity between μi and μj. Rij is equal to 1 when μi is the same as μj and equals to 0 when μi and μj are totally different. In the Empirical Risk Minimization (ERM) setting, we approximate Rij with (15), as follows: Rij = min{ N n=1 ejΦ(g(xi n)) + eiΦ(g(xj n)) N n=1 eiΦ(g(xin)) + ejΦ(g(xj n)) , 1}, (15) Submission and Formatting Instructions for ICML 2020 (a) Three 2-D Gaussian distributions. (b) Discriminator. (c) Relatedness change curve. Figure 2. Performance of the proposed relatedness measure Rij across three two-dimensional Gaussian distributions. (a) Illustration of three tasks with 2-D Gaussian distributions over their representation space. A total of 3000 samples are used in this case. The mean of the Gaussian distributions corresponding to task 1, 2 and 3 are [0.2α, 0], [ 0.2α, 0], [0, 0.2α] respectively, and all of them have the same variance-covariance matrix = IT , where I is the T T identity matrix. (b) Discriminator constructed using a two-layers fully connected network ending with a softmax function. (c) Illustration of relatedness Rij between tasks decreases as α increases. where et is the vector with all components equal to 0, except the t-th, which is 1. Figure 2 presents the performance of the proposed relatedness metric in a two-dimensional Gaussian distribution case. It verifies that the metric is sensitive to the variation of the similarity between distributions. We then propose a relatedness matrix R, where R11 R12 R1T R21 R22 R2T ... ... ... ... RT 1 RT 2 RT T 4.3. Adaptive Adversarial MTRL Motivated by considering the task relatedness and duality, we present an adaptive AMTRL algorithm with an novel weighting strategy in 4.3.1 and optimize it with the augmented Lagrangian method in 4.3.2. 4.3.1. WEIGHT ADAPTATION Based on the relatedness matrix, we propose a weighting strategy designed to accelerate the convergence of the adversarial module for AMTRL models. Let w = (w1, w2, ..., w T ) and 1 = (1, 1, ..., 1) be a T-dimension vector with all components being 1. The weighting strategy is used in formulating the empirical loss of the proposed adaptive AMTRL method (17). t=1 wt LSt(f t g), (17) where w = 1 1R1 1R . (18) Tasks that have a closer relationship with other tasks in the representation space have a larger weight. This has an intuitive interpretation: that is, the weighting strategy motivates tasks to be more similar in the representation space, which meets the constraint of AMTRL. The experimental result in Section 5.2.1 verifies this intuition. 4.3.2. AUGMENTED LAGRANGIAN (6) can be regard as the Lagrangian dual function of the following equality-constrained optimization problem (Problem 1). Problem 1. min h LS(h) s.t. Ladv = 0, In existing adversarial MTRL works, λ is manually tuned; this process is highly time-consuming and makes it almost impossible to achieve the optimal Lagrange multiplier. As a result, an adaptive method that can choose λ automatically is desired. Moreover, an MTL Problem like Problem 1 is usually non-convex, such that the solution obtained from the Lagrangian duality is in fact not optimal due to the duality gap (Rockafellar, 1974; Hager, 1987). Accordingly, we propose an Augmented Lagrangian-based Algorithm to dynamically tune λ and reduce the duality gap. The basic idea behind augmented Lagrangian involves augmenting the ordinary Lagrangian with a penalty term, which usually has a quadratic form. Combining the proposed weighting strategy with the augmented Lagrangian Submission and Formatting Instructions for ICML 2020 Algorithm 1 Adaptive Adversarial MTRL Input: S Initialize λ0, r0, R0. for q = 0 to N do wq = 1 IRq I IRq Train the AMTRL model with loss (19) Update Rq+1 using (15) with Φq( ) if λq+1 > 0 then Update Lagrange multipliers using (20) to obtain λq+1 else λq+1 = λq end if Choose new penalty parameter rq+1 > rq end for method, the optimization objective of our adaptive AMTRL method is given in (19). t=1 wt LSt(f t g) + λLadv + r 2Ladv2, (19) where λ is the Lagrangian multiplier, while r is the the penalty parameter with r > 0. As r increases, the gap between the value of the primal problem and the value of the dual problem decreases. Based on the typical augmented Lagrangian algorithmic framework, λk is updated as follows: λq+1 = λq rq Ladv, (20) with rq increasing linearly. The specific procedure of the algorithm is shown in Algorithm 1. The adaptive AMTRL algorithm is shown in Algorithm 1. 5. Experiments In this section, we perform experimental studies on sentiment analysis and topic classification in order to evaluate the performance of our proposed method and verify our theoretical analysis respectively. The implementation is based on Py Torch (Paszke et al., 2019). The code can be found in the Supplementary Materials. 5.1. Experimental Setup 5.1.1. DATASETS Sentiment Analysis 1. We evaluated our algorithm on product reviews from Amazon. The dataset (Blitzer et al., 2007) contains product reviews from 14 domains, including books, 1https://www.cs.jhu.edu/~mdredze/ datasets/sentiment/ Table 1. Data Allocation for Topic Classification Tasks. TASKS NEWSGROUPS COMP OS.MS-WINDOWS.MISC, SYS.MAC.HARDWARE, GRAPHICS, WINDOWS.X REC SPORT.BASEBALL, SPORT.HOCKEY AUTOS, MOTORCYCLES SCI CRYPT, ELECTRONICS, MED, SPACE TALK POLITICS.MIDEAST, RELIGION.MISC, POLITICS.MISC, POLITICS.GUNS DVDs, electronics, kitchen appliances, etc. We consider each domain as a binary classification task. Reviews with ratings > 3 were labeled positive, while those with ratings < 3 were labeled negative, reviews with rating = 3 are discarded, as the sentiments were ambiguous and difficult to predict. The training/testing/validation partition is randomly split into 70% training, 10% testing and 20% validation. Topic Classification 2. We select 16 newsgroups from the 20 Newsgroup dataset, which is a collection of approximately 20,000 newsgroup documents and partitioned (nearly) evenly across 20 different newsgroups, and formulate them into four 4-class classification tasks (shown in Table 1) to evaluate the performance of our algorithm on topic classification. The training/testing/validation partition is randomly split into 60% training, 20% testing and 20% validation. 5.1.2. NETWORK MODEL We implement our adaptive AMTRL algorithm on the most prevalent deep multi-task representation learning network model (i.e. hard parameter sharing network model (Caruana, 1997)). As shown in Figure 1, all tasks have task-specific output layers and share the representation extraction layers in the model. The shared representation extraction layers are typically built with a feature extraction structure such as Convolutional Neural Networks (CNN) or Recurrent Neural Network (RNN), and the task-specific output layers are typically formulated using fully connected layers. In our experiments, either Text CNN (Kim, 2014) or Bi LSTM (Hochreiter & Schmidhuber, 1997) is used to build the shared representation extraction layers. The Text CNN module is structured with three parallel convolutional layers with kernel sizes of 3, 5, and 7 respectively. The Bi LSTM module is structured with two bi-directional hidden layers with size 32. The extracted feature representations are then concatenated and classified using the task-specific output module, which has one fully connected layer. 2http://qwone.com/~jason/20Newsgroups/ Submission and Formatting Instructions for ICML 2020 (a) Rmean changes in the training process. (b) Rvar changes in the training process. Figure 3. Evolution of relatedness between tasks during training for sentiment analysis. (a) presents the change in Rmean for the original MTRL (Orig MTRL), AAMTRL without the weighting strategy (Uniform AAMTRL) and AAMTRL respectively. (b) presents the change in Rvar for Orig MTRL, Uniform AAMTRL and AAMTRL respectively. (a) Rmean changes in the training process. (b) Rvar changes in the training process. Figure 4. Evolution of relatedness between tasks during training for topic classification. (a) presents the change in Rmean for the original MTRL (Orig MTRL), AAMTRL without the weighting strategy (Uniform AAMTRL) and AAMTRL respectively. (b) presents the change in Rvar for Orig MTRL, Uniform AAMTRL and AAMTRL respectively. The adversarial module is built with one fully connected layer, the output size of which is equal to the number of tasks. It is noteworthy that the adversarial module connects to the shared layers via a gradient reversal layer (Ganin & Lempitsky, 2015). This gradient reversal layer multiplies the gradient by 1 during the backpropagation, which optimizes the adversarial loss function (7). 5.1.3. TRAINING PARAMETERS We train the deep AAMTRL network model with Algorithm 1 settings λ0 = 1, r0 = 10 and rk+1 = rk + 2; here, R0 is a matrix of ones. We use the Adam optimizer (Kingma & Ba, 2015) and train 600 epochs for sentiment analysis and 1200 epochs for topic classification, The batch size is 256 for both sentiment analysis and topic classification. We use dropout with probability of 0.5 for all task-specific output modules. For all experiments, we search over the set {1e 4, 5e 4, 1e 3, 5e 3, 1e 2, 5e 2} of learning rates and choose the model with the highest validation accuracy. 5.2. Results and Analysis 5.2.1. RELATEDNESS EVOLUTION To evaluate the performance of the adversarial module for AAMTRL, we record the change in the relatedness matrix during training. In this experiment, the text CNN module is used to extract representation. The relatedness matrix is summarized by the mean and variance of {R1, R2, ..., RT }, where Rt for t {1, ..., T} is defined in (21). Let Rmean, Rvar be the mean and the variance respectively. The results for sentiment analysis and topic classification are shown in Fig. 3 and Fig.4 respectively. k=0 Rtk. (21) The results show the following: The proposed AAMTRL is able to enforce the tasks Submission and Formatting Instructions for ICML 2020 (a) text-CNN. (b) Bi LSTM. Figure 5. Radar chart of the error rate for each task in sentiment analysis. (a) shows the results for MTRL models with text CNN-based representation extraction layers. (b) shows the results for MTRL models with Bi LSTM-based representation extraction layers. (a) text-CNN. (b) Bi LSTM. Figure 6. Radar chart of the error rate for each task in topic classification. (a) shows the results for MTRL models with text CNN-based representation extraction layers. (b) shows the results for MTRL models with Bi LSTM-based representation extraction layers to share an identical distribution in the representation space. The weighting strategy can accelerate and smooth the convergence process of the adversarial module during training. The tasks in sentiment analysis initially have a much closer relationship than those in topic classification. 5.2.2. CLASSIFICATION ACCURACY We compare our proposed methods with two baselines (i) Single Task, which solves tasks independently, and (ii) Uni- form Scaling, which minimizes a uniformly weighted sum of loss functions as well as two state-of-the-art methods: (i) MGDA , which uses the MGDA-UB method proposed by (Sener & Koltun, 2018). (ii) Adversarial MTRL, which uses the original adversarial MTL framework proposed by (Liu et al., 2017). We report the error rate of each task for sentiment analysis and topic classification in Figure 5 and Figure 6 respectively. The exact results can be referred to in the supplementary materials. The results shows the following: The proposed AAMTRL outperforms the state-of-the- Submission and Formatting Instructions for ICML 2020 Figure 7. Change of the relative task-averaged risk along the number of tasks. Figure 8. variety of the test error for task (Appeal) according to learning with different tasks. art methods on sentiment analysis and achieves similar performance for topic classification. For topic classification, in which the tasks are not closely related (as shown in Figure 4 (a)), MTL strategies do not outperform single-task learning. This shows that the performance of MTL is dependent on the initial relatedness between tasks. 5.2.3. INFLUENCE OF THE NUMBER OF TASKS In this section, we investigate the influence of the number of tasks on the task-averaged risk. We define a relative taskaveraged risk with respect to single-task learning (STL) in (22). errel = er MT L 1 T T 1 ert ST L , (22) where er MT L is the task-averaged test error of a MTL model, while ert ST L is the test error of the STL model t. The MTL model and the STL models are the best-performing models generated from our experimental setting. The MTL model is trained using our AAMTRL algorithm. We also carry out an experiment on sentiment analysis. In this experiment, the text CNN module is used to extract representation. Figure 7 presents the change in the relative task-averaged risk depending on the number of tasks. Figure 8 presents the variety of the test error for task (Appeal) according to learning with different tasks. The results show the following: In AMTRL, an increase in task numbers does not decrease the task-averaged error. For a specific task in AMTRL, learning with more tasks does not guarantee better performance. The results verify our analysis in Section 4.1. 6. Conclusion While performance of AMTRL is attractive, the theoretical mechanism is unexplored. To fill this gap, we analyze the task-averaged generalization error bound for AMTRL. Based on the analysis, we propose a novel AMTRL method, named Adaptive AMTRL, that is designed to improve the performance of existing AMTRL methods. Numerical experiments support our theoretical results and demonstrate the effectiveness of our proposed approach. Submission and Formatting Instructions for ICML 2020 Acknowledgements This work is supported by the National Natural Science Foundation of China under Grants 61976161. Ando, R. K. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817 1853, 2005. Blitzer, J., Dredze, M., and Pereira, F. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, 2007. Caruana, R. Multitask learning. Machine Learning, 28(1): 41 75, 1997. Chen, C., Yang, Y., Zhou, J., Li, X., and Bao, F. S. Crossdomain review helpfulness prediction based on convolutional neural networks with auxiliary domain discriminators. In NAACL, pp. 602 607, 2018a. Chen, Z., Badrinarayanan, V., Lee, C., and Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, pp. 793 802, 2018b. Collobert, R. and Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML, pp. 160 167, 2008. Dwivedi, K. and Roig, G. Representation similarity analysis for efficient task taxonomy & transfer learning. In CVPR, 2019. Ganin, Y. and Lempitsky, V. S. Unsupervised domain adaptation by backpropagation. In ICML, pp. 1180 1189, 2015. Hager, W. W. Dual techniques for constrained optimization. Journal of Optimization Theory and Applications, 55(1): 37 71, 1987. Hestenes, M. R. Multiplier and gradient methods. Journal of optimization theory and applications, 4(5):303 320, 1969. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. Kendall, A., Gal, Y., and Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, pp. 7482 7491, 2018. Kim, Y. Convolutional neural networks for sentence classification. In EMNLP, pp. 1746 1751, 2014. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. Kriegeskorte, N., Mur, M., and Bandettini, P. A. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 2008. Lin, X., Zhen, H., Li, Z., Zhang, Q., and Kwong, S. Pareto multi-task learning. In Neur IPS, 2019. Liu, P., Qiu, X., and Huang, X. Adversarial multi-task learning for text classification. In ACL, pp. 1 10, 2017. Liu, Y., Wang, Z., Jin, H., and Wassell, I. J. Multi-task adversarial network for disentangled feature learning. In CVPR, pp. 3743 3751, 2018. Mao, Y., Yun, S., Liu, W., and Du, B. Tchebycheff procedure for multi-task text classification. In ACL, 2020. Maurer, A. A chain rule for the expected suprema of gaussian processes. In ALT, pp. 245 259, 2014. Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit of multitask representation learning. Journal of Machine Learning Research, 17:81:1 81:32, 2016. Mc Clure, P. and Kriegeskorte, N. Representational distance learning for deep neural networks. Frontiers in Computational Neuroscience, 10:131, 2016. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS. 2019. Rockafellar, R. T. Augmented lagrange multiplier functions and duality in nonconvex programming. SIAM Journal on Control, 12(2):268 285, 1974. Ruder, S. An overview of multi-task learning in deep neural networks. Co RR, abs/1706.05098, 2017. Sener, O. and Koltun, V. Multi-task learning as multiobjective optimization. In Neur IPS, pp. 525 536, 2018. Shi, G., Feng, C., Huang, L., Zhang, B., Ji, H., Liao, L., and Huang, H. Genre separation network with adversarial training for cross-genre relation extraction. In EMNLP, pp. 1018 1023, 2018. Yadav, S., Ekbal, A., Saha, S., Bhattacharyya, P., and Sheth, A. P. Multi-task learning framework for mining crowd intelligence towards clinical treatment. In NAACL, pp. 271 277, 2018. Submission and Formatting Instructions for ICML 2020 Yu, J., Qiu, M., Jiang, J., Huang, J., Song, S., Chu, W., and Chen, H. Modelling domain relationships for transfer learning on retrieval-based question answering systems in e-commerce. In WSDM, pp. 682 690, 2018.