# gradient_surgery_for_multitask_learning__0e7ca228.pdf Gradient Surgery for Multi-Task Learning Tianhe Yu1, Saurabh Kumar1, Abhishek Gupta2, Sergey Levine2, Karol Hausman3, Chelsea Finn1 Stanford University1, UC Berkeley2, Robotics at Google3 tianheyu@cs.stanford.edu While deep learning and deep reinforcement learning (RL) systems have demonstrated impressive results in domains such as image classification, game playing, and robotic control, data efficiency remains a major challenge. Multi-task learning has emerged as a promising approach for sharing structure across multiple tasks to enable more efficient learning. However, the multi-task setting presents a number of optimization challenges, making it difficult to realize large efficiency gains compared to learning tasks independently. The reasons why multi-task learning is so challenging compared to single-task learning are not fully understood. In this work, we identify a set of three conditions of the multi-task optimization landscape that cause detrimental gradient interference, and develop a simple yet general approach for avoiding such interference between task gradients. We propose a form of gradient surgery that projects a task s gradient onto the normal plane of the gradient of any other task that has a conflicting gradient. On a series of challenging multi-task supervised and multi-task RL problems, this approach leads to substantial gains in efficiency and performance. Further, it is model-agnostic and can be combined with previously-proposed multi-task architectures for enhanced performance. 1 Introduction While deep learning and deep reinforcement learning (RL) have shown considerable promise in enabling systems to learn complex tasks, the data requirements of current methods make it difficult to learn a breadth of capabilities, particularly when all tasks are learned individually from scratch. A natural approach to such multi-task learning problems is to train a network on all tasks jointly, with the aim of discovering shared structure across the tasks in a way that achieves greater efficiency and performance than solving tasks individually. However, learning multiple tasks all at once results is a difficult optimization problem, sometimes leading to worse overall performance and data efficiency compared to learning tasks individually [42, 50]. These optimization challenges are so prevalent that multiple multi-task RL algorithms have considered using independent training as a subroutine of the algorithm before distilling the independent models into a multi-tasking model [32, 42, 50, 21, 56], producing a multi-task model but losing out on the efficiency gains over independent training. If we could tackle the optimization challenges of multi-task learning effectively, we may be able to actually realize the hypothesized benefits of multi-task learning without the cost in final performance. While there has been a significant amount of research in multi-task learning [6, 49], the optimization challenges are not well understood. Prior work has described varying learning speeds of different tasks [8, 26] and plateaus in the optimization landscape [52] as potential causes, whereas a range of other works have focused on the model architecture [40, 33]. In this work, we instead hypothesize that one of the main optimization issues in multi-task learning arises from gradients from different tasks conflicting with one another in a way that is detrimental to making progress. We define two gradients to be conflicting if they point away from one another, i.e., have a negative cosine similarity. We hypothesize that such conflict is detrimental when a) conflicting gradients coincide with b) high positive curvature and c) a large difference in gradient magnitudes. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: Visualization of PCGrad on a 2D multi-task optimization problem. (a) A multi-task objective landscape. (b) & (c) Contour plots of the individual task objectives that comprise (a). (d) Trajectory of gradient updates on the multi-task objective using the Adam optimizer. The gradient vectors of the two tasks at the end of the trajectory are indicated by blue and red arrows, where the relative lengths are on a log scale.(e) Trajectory of gradient updates on the multi-task objective using Adam with PCGrad. For (d) and (e), the optimization trajectory goes from black to yellow. As an illustrative example, consider the 2D optimization landscapes of two task objectives in Figure 1a-c. The optimization landscape of each task consists of a deep valley, a property that has been observed in neural network optimization landscapes [22], and the bottom of each valley is characterized by high positive curvature and large differences in the task gradient magnitudes. Under such circumstances, the multi-task gradient is dominated by one task gradient, which comes at the cost of degrading the performance of the other task. Further, due to high curvature, the improvement in the dominating task may be overestimated, while the degradation in performance of the non-dominating task may be underestimated. As a result, the optimizer struggles to make progress on the optimization objective. In Figure 1d), the optimizer reaches the deep valley of task 1, but is unable to traverse the valley in a parameter setting where there are conflicting gradients, high curvature, and a large difference in gradient magnitudes (see gradients plotted in Fig. 1d). In Section 5.3, we find experimentally that this tragic triad also occurs in a higher-dimensional neural network multi-task learning problem. The core contribution of this work is a method for mitigating gradient interference by altering the gradients directly, i.e. by performing gradient surgery. If two gradients are conflicting, we alter the gradients by projecting each onto the normal plane of the other, preventing the interfering components of the gradient from being applied to the network. We refer to this particular form of gradient surgery as projecting conflicting gradients (PCGrad). PCGrad is model-agnostic, requiring only a single modification to the application of gradients. Hence, it is easy to apply to a range of problem settings, including multi-task supervised learning and multi-task reinforcement learning, and can also be readily combined with other multi-task learning approaches, such as those that modify the architecture. We theoretically prove the local conditions under which PCGrad improves upon standard multi-task gradient descent, and we empirically evaluate PCGrad on a variety of challenging problems, including multi-task CIFAR classification, multi-objective scene understanding, a challenging multitask RL domain, and goal-conditioned RL. Across the board, we find PCGrad leads to substantial improvements in terms of data efficiency, optimization speed, and final performance compared to prior approaches, including a more than 30% absolute improvement in multi-task reinforcement learning problems. Further, on multi-task supervised learning tasks, PCGrad can be successfully combined with prior state-of-the-art methods for multi-task learning for even greater performance. 2 Multi-Task Learning with PCGrad While the multi-task problem can in principle be solved by simply applying a standard single-task algorithm with a suitable task identifier provided to the model, or a simple multi-head or multi-output model, a number of prior works [42, 50, 53] have found this learning problem to be difficult. In this section, we introduce notation, identify possible causes for the difficulty of multi-task optimization, propose a simple and general approach to mitigate it, and theoretically analyze the proposed approach. 2.1 Preliminaries: Problem and Notation The goal of multi-task learning is to find parameters θ of a model fθ that achieve high average performance across all the training tasks drawn from a distribution of tasks p(T ). More formally, we aim to solve the problem: min θ ETi p(T ) [Li(θ)], where Li is a loss function for the i-th task Ti that we want to minimize. For a set of tasks, {Ti}, we denote the multi-task loss as L(θ) = P i Li(θ), and the gradients of each task as gi = Li(θ) for a particular θ. (We drop the reliance on θ in the notation for brevity.) To obtain a model that solves a specific task from the task distribution p(T ), we define a task-conditioned model fθ(y|x, zi), with input x, output y, and encoding zi for task Ti, which could be provided as a one-hot vector or in any other form. 2.2 The Tragic Triad: Conflicting Gradients, Dominating Gradients, High Curvature We hypothesize that a key optimization issue in multi-task learning arises from conflicting gradients, where gradients for different tasks point away from one another as measured by a negative inner product. However, conflicting gradients are not detrimental on their own. Indeed, simply averaging task gradients should provide the correct solution to descend the multi-task objective. However, there are conditions under which such conflicting gradients lead to significantly degraded performance. Consider a two-task optimization problem. If the gradient of one task is much larger in magnitude than the other, it will dominate the average gradient. If there is also high positive curvature along the directions of the task gradients, then the improvement in performance from the dominating task may be significantly overestimated, while the degradation in performance from the dominated task may be significantly underestimated. Hence, we can characterize the co-occurrence of three conditions as follows: (a) when gradients from multiple tasks are in conflict with one another (b) when the difference in gradient magnitudes is large, leading to some task gradients dominating others, and (c) when there is high curvature in the multi-task optimization landscape. We formally define the three conditions below. Definition 1. We define φij as the angle between two task gradients gi and gj. We define the gradients as conflicting when cos φij < 0. Definition 2. We define the gradient magnitude similarity between two gradients gi and gj as Φ(gi, gj) = 2 gi 2 gj 2 gi 2 2+ gj 2 2 . When the magnitude of two gradients is the same, this value is equal to 1. As the gradient magnitudes become increasingly different, this value goes to zero. Definition 3. We define multi-task curvature as H(L; θ, θ ) = R 1 0 L(θ)T 2L(θ + a(θ θ)) L(θ)da, which is the averaged curvature of L between θ and θ in the direction of the multi-task gradient L(θ). When H(L; θ, θ ) > C for some large positive constant C, for model parameters θ and θ at the current and next iteration, we characterize the optimization landscape as having high curvature. We aim to study the tragic triad and observe the presence of the three conditions through two examples. First, consider the two-dimensional optimization landscape illustrated in Fig. 1a, where the landscape for each task objective corresponds to a deep and curved valley with large curvatures (Fig. 1b and 1c). The optima of this multi-task objective correspond to where the two valleys meet. More details on the optimization landscape are in Appendix D. Particular points of this optimization landscape exhibit the three described conditions, and we observe that, the Adam [30] optimizer stalls precisely at one of these points (see Fig. 1d), preventing it from reaching an optimum. This provides some empirical evidence for our hypothesis. Our experiments in Section 5.3 further suggest that this phenomenon occurs in multi-task learning with deep networks. Motivated by these observations, we develop an algorithm that aims to alleviate the optimization challenges caused by conflicting gradients, dominating gradients, and high curvature, which we describe next. 2.3 PCGrad: Project Conflicting Gradients Our goal is to break one condition of the tragic triad by directly altering the gradients themselves to prevent conflict. In this section, we outline our approach for altering the gradients. In the next section, we will theoretically show that de-conflicting gradients can benefit multi-task learning when dominating gradients and high curvatures are present. To be maximally effective and widely applicable, we aim to alter the gradients in a way that allows for positive interactions between the task gradients and does not introduce assumptions on the form of the model. Hence, when gradients do not conflict, we do not change the gradients. When gradients do conflict, the goal of PCGrad is to modify the gradients for each task so as to minimize negative conflict with other task gradients, which will in turn mitigate underand over-estimation problems arising from high curvature. To deconflict gradients during optimization, PCGrad adopts a simple procedure: if the gradients between two tasks are in conflict, i.e. their cosine similarity is negative, we project the gradient of each task onto the normal plane of the gradient of the other task. This amounts to removing the conflicting component of the gradient for the task, thereby reducing the amount of destructive gradient interference between tasks. A pictorial description of this idea is shown in Fig. 2. Algorithm 1 PCGrad Update Rule Require: Model parameters θ, task minibatch B = {Tk} 1: gk θLk(θ) k 2: g PC k gk k 3: for Ti B do 4: for Tj uniformly B \ Ti in random order do 5: if g PC i gj < 0 then 6: // Subtract the projection of g PC i onto gj 7: Set g PC i = g PC i g PC i gj gj 2 gj 8: return update θ = g PC = P Figure 2: Conflicting gradients and PCGrad. In (a), tasks i and j have conflicting gradient directions, which can lead to destructive interference. In (b) and (c), we illustrate the PCGrad algorithm in the case where gradients are conflicting. PCGrad projects task i s gradient onto the normal vector of task j s gradient, and vice versa. Non-conflicting task gradients (d) are not altered under PCGrad, allowing for constructive interaction. Suppose the gradient for task Ti is gi, and the gradient for task Tj is gj. PCGrad proceeds as follows: (1) First, it determines whether gi conflicts with gj by computing the cosine similarity between vectors gi and gj, where negative values indicate conflicting gradients. (2) If the cosine similarity is negative, we replace gi by its projection onto the normal plane of gj: gi = gi gi gj gj 2 gj. If the gradients are not in conflict, i.e. cosine similarity is non-negative, the original gradient gi remains unaltered. (3) PCGrad repeats this process across all of the other tasks sampled in random order from the current batch Tj j = i, resulting in the gradient g PC i that is applied for task Ti. We perform the same procedure for all tasks in the batch to obtain their respective gradients. The full update procedure is described in Algorithm 1 and a discussion on using a random task order is included in Appendix H. This procedure, while simple to implement, ensures that the gradients that we apply for each task per batch interfere minimally with the other tasks in the batch, mitigating the conflicting gradient problem, producing a variant on standard first-order gradient descent in the multi-objective setting. In practice, PCGrad can be combined with any gradient-based optimizer, including commonly used methods such as SGD with momentum and Adam [30], by simply passing the computed update to the respective optimizer instead of the original gradient. Our experimental results verify the hypothesis that this procedure reduces the problem of conflicting gradients, and find that, as a result, learning progress is substantially improved. 2.4 Theoretical Analysis of PCGrad In this section, we theoretically analyze the performance of PCGrad with two tasks: Definition 4. Consider two task loss functions L1 : Rn R and L2 : Rn R. We define the two-task learning objective as L(θ) = L1(θ) + L2(θ) for all θ Rn, where g1 = L1(θ), g2 = L2(θ), and g = g1 + g2. We first aim to verify that the PCGrad update corresponds to a sensible optimization procedure under simplifying assumptions. We analyze convergence of PCGrad in the convex setting, under standard assumptions in Theorem 1. For additional analysis on convergence, including the non-convex setting, with more than two tasks, and with momentum-based optimizers, see Appendices A.1 and A.4 Theorem 1. Assume L1 and L2 are convex and differentiable. Suppose the gradient of L is LLipschitz with L > 0. Then, the PCGrad update rule with step size t 1 L will converge to either (1) a location in the optimization landscape where cos(φ12) = 1 or (2) the optimal value L(θ ). Proof. See Appendix A.1. Theorem 1 states that application of the PCGrad update in the two-task setting with a convex and Lipschitz multi-task loss function L leads to convergence to either the minimizer of L or a potentially sub-optimal objective value. A sub-optimal solution occurs when the cosine similarity between the gradients of the two tasks is exactly 1, i.e. the gradients directly conflict, leading to zero gradient after applying PCGrad. However, in practice, since we are using SGD, which is a noisy estimate of the true batch gradients, the cosine similarity between the gradients of two tasks in a minibatch is unlikely to be 1, thus avoiding this scenario. Note that, in theory, convergence may be slow if cos(φ12) hovers near 1. However, we don t observe this in practice, as seen in the objective-wise learning curves in Appendix B. Now that we have checked the sensibility of PCGrad, we aim to understand how PCGrad relates to the three conditions in the tragic triad. In particular, we derive sufficient conditions under which PCGrad achieves lower loss after one update. Here, we still analyze the two task setting, but no longer assume convexity of the loss functions. Definition 5. We define the multi-task curvature bounding measure ξ(g1, g2) = (1 cos2 φ12) g1 g2 2 2 g1+g2 2 2 . With the above definition, we present our next theorem: Theorem 2. Suppose L is differentiable and the gradient of L is Lipschitz continuous with constant L > 0. Let θMT and θPCGrad be the parameters after applying one update to θ with g and PCGradmodified gradient g PC respectively, with step size t > 0. Moreover, assume H(L; θ, θMT) ℓ g 2 2 for some constant ℓ L, i.e. the multi-task curvature is lower-bounded. Then L(θPCGrad) L(θMT) if (a) cos φ12 Φ(g1, g2), (b) ℓ ξ(g1, g2)L, and (c) t 2 ℓ ξ(g1,g2)L. Proof. See Appendix A.2. Intuitively, Theorem 2 implies that PCGrad achieves lower loss value after a single gradient update compared to standard gradient descent in multi-task learning when (i) the angle between task gradients is not too small, i.e. the two tasks need to conflict sufficiently (condition (a)), (ii) the difference in magnitude needs to be sufficiently large (condition (a)), (iii) the curvature of the multi-task gradient should be large (condition (b)), (iv) and the learning rate should be big enough so that large curvature would lead to overestimation of performance improvement on the dominating task and underestimation of performance degradation on the dominated task (condition (c)). These first three points (i-iii) correspond to exactly the triad of conditions outlined in Section 2.2, while the latter condition (iv) is desirable as we hope to learn quickly. We empirically validate that the first three points, (i-iii), are frequently met in a neural network multi-task learning problem in Figure 4 in Section 5.3. For additional analysis, including complete sufficient and necessary conditions for the PCGrad update to outperform the vanilla multi-task gradient, see Appendix A.3. 3 PCGrad in Practice We use PCGrad in supervised learning and reinforcement learning problems with multiple tasks or goals. Here, we discuss the practical application of PCGrad to those settings. In multi-task supervised learning, each task Ti p(T ) has a corresponding training dataset Di consisting of labeled training examples, i.e. Di = {(x, y)n}. The objective for each task in this supervised setting is then defined as Li(θ) = E(x,y) Di [ log fθ(y|x, zi)], where zi is a one-hot encoding of task Ti. At each training step, we randomly sample a batch of data points B from the whole dataset S i Di and then group the sampled data with the same task encoding into small batches denoted as Bi for each Ti represented in B. We denote the set of tasks appearing in B as BT . After sampling, we precompute the gradient of each task in BT as θLi(θ) = E(x,y) Bi [ θ log fθ(y|x, zi)] . Given the set of precomputed gradients θLi(θ), we also precompute the cosine similarity between all pairs of the gradients in the set. Using the pre-computed gradients and their similarities, we can obtain the PCGrad update by following Algorithm 1, without re-computing task gradients nor backpropagating into the network. Since the PCGrad procedure is only modifying the gradients of shared parameters in the optimization step, it is model-agnostic and can be applied to any architecture with shared parameters. We empirically validate PCGrad with multiple architectures in Section 5. For multi-task RL and goal-conditioned RL, PCGrad can be readily applied to policy gradient methods by directly updating the computed policy gradient of each task, following Algorithm 1, analogous to the supervised learning setting. For actor-critic algorithms, it is also straightforward to apply PCGrad: we simply replace task gradients for both the actor and the critic by their gradients computed via PCGrad. For more details on the practical implementation for RL, see Appendix C. 4 Related Work Algorithms for multi-task learning typically consider how to train a single model that can solve a variety of different tasks [6, 2, 49]. The multi-task formulation has been applied to many different settings, including supervised learning [63, 35, 60, 53, 62] and reinforcement-learning [17, 58], as well as many different domains, such as vision [3, 39, 31, 33, 62], language [11, 15, 38, 44] and robotics [45, 59, 25]. While multi-task learning has the promise of accelerating acquisition of large task repertoires, in practice it presents a challenging optimization problem, which has been tackled in several ways in prior work. A number of architectural solutions have been proposed to the multi-task learning problem based on multiple modules or paths [19, 14, 40, 51, 46, 57, 46], or using attention-based architectures [33, 37]. Our work is agnostic to the model architecture and can be combined with prior architectural approaches in a complementary fashion. A different set of multi-task learning approaches aim to decompose the problem into multiple local problems, often corresponding to each task, that are significantly easier to learn, akin to divide and conquer algorithms [32, 50, 42, 56, 21, 13]. Eventually, the local models are combined into a single, multi-task policy using different distillation techniques (outlined in [27, 13]). In contrast to these methods, we propose a simple and cogent scheme for multi-task learning that allows us to learn the tasks simultaneously using a single, shared model without the need for network distillation. Similarly to our work, a number of prior approaches have observed the difficulty of optimization in multi-task learning [26, 29, 52, 55]. Our work suggests that the challenge in multi-task learning may be attributed to what we describe as the tragic triad of multi-task learning (i.e., conflicting gradients, high curvature, and large gradient differences), which we address directly by introducing a simple and practical algorithm that deconflicts gradients from different tasks. Prior works combat optimization challenges by rescaling task gradients [53, 9]. We alter both the magnitude and direction of the gradient, which we find to be critical for good performance (see Fig. 3). Prior work has also used the cosine similarity between gradients to define when an auxiliary task might be useful [16] or when two tasks are related [55]. We similarly use cosine similarity between gradients to determine if the gradients between a pair of tasks are in conflict. Unlike Du et al. [16], we use this measure for effective multi-task learning, instead of ignoring auxiliary objectives. Overall, we empirically compare our approach to a number of these prior approaches [53, 9, 55], and observe superior performance with PCGrad. Multiple approaches to continual learning have studied how to prevent gradient updates from adversely affecting previously-learned tasks through various forms of gradient projection [36, 7, 18, 23]. These methods focus on sequential learning settings, and solve for the gradient projections using quadratic programming [36], only project onto the normal plane of the average gradient of past tasks [7], or project the current task gradients onto the orthonormal set of previous task gradients [18]. In contrast, our work focuses on positive transfer when simultaneously learning multiple tasks, does not require solving a QP, and iteratively projects the gradients of each task instead of averaging or only projecting the current task gradient. Finally, our method is distinct from and solves a different problem than the projected gradient method [5], which is an approach for constrained optimization that projects gradients onto the constraint manifold. 5 Experiments The goal of our experiments is to study the following questions: (1) Does PCGrad make the optimization problems easier for various multi-task learning problems including supervised, reinforcement, and goal-conditioned reinforcement learning settings across different task families? (2) Can PCGrad be combined with other multi-task learning approaches to further improve performance? (3) Is the tragic triad of multi-task learning a major factor in making optimization for multi-task learning challenging? To broadly evaluate PCGrad, we consider multi-task supervised learning, multi-task RL, and goal-conditioned RL problems. We include the results on goal-conditioned RL in Appendix F. During our evaluation, we tune the parameters of the baselines independently, ensuring that all methods were fairly provided with equal model and training capacity. PCGrad inherits the hyperparameters of the respective baseline method in all experiments, and has no additional hyperparameters. For more details on the experimental set-up and model architectures, see Appendix J. The code is available online1. 5.1 Multi-Task Supervised Learning To answer question (1) in the supervised learning setting and question (2), we perform experiments on five standard multi-task supervised learning datasets: Multi MNIST, City Scapes, Celeb A, multi-task CIFAR-100 and NYUv2. We include the results on Multi MNIST and City Scapes in Appendix E. 1Code is released at https://github.com/tianheyu927/PCGrad #P. Architecture Weighting Segmentation Depth Surface Normal (Higher Better) (Lower Better) Angle Distance Within t (Lower Better) (Higher Better) m Io U Pix Acc Abs Err Rel Err Mean Median 11.25 22.5 30 Equal Weights 14.71 50.23 0.6481 0.2871 33.56 28.58 20.08 40.54 51.97 3 Cross-Stitch Uncert. Weights 15.69 52.60 0.6277 0.2702 32.69 27.26 21.63 42.84 54.45 DWA , T = 2 16.11 53.19 0.5922 0.2611 32.34 26.91 21.81 43.14 54.92 Equal Weights 17.72 55.32 0.5906 0.2577 31.44 25.37 23.17 45.65 57.48 1.77 MTAN Uncert. Weights 17.67 55.61 0.5927 0.2592 31.25 25.57 22.99 45.83 57.67 DWA , T = 2 17.15 54.97 0.5956 0.2569 31.60 25.46 22.48 44.86 57.24 1.77 MTAN + PCGrad (ours) Uncert. Weights 20.17 56.65 0.5904 0.2467 30.01 24.83 22.28 46.12 58.77 Table 1: Three-task learning on the NYUv2 dataset: 13-class semantic segmentation, depth estimation, and surface normal prediction results. #P shows the total number of network parameters. We highlight the best performing combination of multi-task architecture and weighting in bold. The top validation scores for each task are annotated with boxes. The symbols indicate prior methods: : [28], : [33], : [40]. Performance of other methods as reported in Liu et al. [33]. task specific, 1-fc [46] 42 task specific, all-fc [46] 49 cross stitch, all-fc [40] 53 routing, all-fc + WPL [47] 74.7 independent 67.7 PCGrad (ours) 71 routing-all-fc + WPL + PCGrad (ours) 77.5 Table 2: CIFAR-100 multi-task results. When combined with routing networks, PCGrad leads to a large improvement. For CIFAR-100, we follow Rosenbaum et al. [46] to treat 20 coarse labels in the dataset as distinct tasks, creating a dataset with 20 tasks, with 2500 training instances and 500 test instances per task. We combine PCGrad with a powerful multi-task learning architecture, routing networks [46, 47], by applying PCGrad only to the shared parameters. For the details of this comparison, see Appendix J.1. As shown in Table 2, applying PCGrad to a single network achieves 71% classification accuracy, which outperforms most of the prior methods such as cross-stitch [40] and independent training, suggesting that sharing representations across tasks is conducive for good performance. While routing networks achieve better performance than PCGrad on its own, they are complementary: combining PCGrad with routing networks leads to a 2.8% absolute improvement in test accuracy. We also aim to use PCGrad to tackle a multi-label classfication problem, which is a commonly used benchmark for multi-task learning. In multi-label classification, given a set of attributes, the model needs to decide whether each attribute describes the input. Hence, it is essentially a binary classification problem for each attribute. We choose the Celeb A dataset [34], which consists of 200K face images with 40 attributes. Since for each attribution, it is a binary classfication problem and thus we convert it to a 40-way multi-task learning problem following [53]. We use the same architecture as in [53]. We use the binary classification error averaged across all 40 tasks to evaluate the performance as in [53]. Similar to the Multi MNIST results, we compare PCGrad to Sener and Koltun [53] by rerunning the open-sourced code provided in [53]. As shown in Table 3, PCGrad outperforms Sener and Koltun [53], suggesting that PCGrad is effective in multi-label classification and can also improve multi-task supervised learning performance when the number of tasks is high. average classification error Sener and Koltun [53] 8.95 PCGrad (ours) 8.69 Table 3: Celeb A results. We show the average classification error across all 40 tasks in Celeb A. PCGrad outperforms the prior method Sener and Koltun [53] in this dataset. Finally, we combine PCGrad with another state-of-art multi-task learning algorithm, MTAN [33], and evaluate the performance on a more challenging indoor scene dataset, NYUv2, which contains 3 tasks: 13-class semantic segmentation, depth estimation, and surface normal prediction. We compare MTAN with PCGrad to a list of methods mentioned in Appendix J.1, where each method is trained with three different weighting schemes as in [33], equal weighting, weight uncertainty [28], and DWA [33]. We only run MTAN with PCGrad with weight uncertainty as we find weight uncertainty Figure 3: For the two plots on the left, we show learning curves on MT10 and MT50 respectively. PCGrad significantly outperforms the other methods in terms of both success rates and data efficiency. In the rightmost plot, we present the ablation study on only using the magnitude and the direction of gradients modified by PCGrad and a comparison to Grad Norm [8]. PCGrad outperforms both ablations and Grad Norm, indicating the importance of modifying both the gradient directions and magnitudes in multi-task learning. as the most effective scheme for training MTAN. The results comparing Cross-Stitch, MTAN and MTAN + PCGrad are presented in Table 1 while the full comparison can be found in Table 8 in the Appendix J.4. MTAN with PCGrad is able to achieve the best scores in 8 out of the 9 categories where there are 3 categories per task. Our multi-task supervised learning results indicate that PCGrad can be seamlessly combined with state-of-art multi-task learning architectures and further improve their results on established supervised multi-task learning benchmarks. We include more results of PCGrad combined with more multi-task learning architectures in Appendix I. 5.2 Multi-Task Reinforcement Learning To answer question (2) in the RL setting, we first consider the multi-task RL problem and evaluate our algorithm on the recently proposed Meta-World benchmark [61]. In particular, we test all methods on the MT10 and MT50 benchmarks in Meta-World, which contain 10 and 50 manipulation tasks respectively shown in Figure 10. in Appendix J.2. The results are shown in left two plots in Figure 3. PCGrad combined with SAC learns all tasks with the best data efficiency and successfully solves all of the 10 tasks in MT10 and about 70% of the 50 tasks in MT50. Training a single SAC policy and a multi-head policy is unable to acquire half of the skills in both MT10 and MT50, suggesting that eliminating gradient interference across tasks can significantly boost performance of multi-task RL. Training independent SAC agents is able to eventually solve all tasks in MT10 and 70% of the tasks in MT50, but requires about 2 millions and 15 millions more samples than PCGrad with SAC in MT10 and MT50 respectively, implying that applying PCGrad can result in leveraging shared structure among tasks that expedites multi-task learning. As noted by Yu et al. [61], these tasks involve distinct behavior motions, which makes learning all tasks with a single policy challenging as demonstrated by poor baseline performance. The ability to learn these tasks together opens the door for a number of interesting extensions to meta-learning and generalization to novel task families. Since the PCGrad update affects both the gradient direction and the gradient magnitude, we perform an ablation study that tests two variants of PCGrad: (1) only applying the gradient direction corrected with PCGrad while keeping the gradient magnitude unchanged and (2) only applying the gradient magnitude computed by PCGrad while keeping the gradient direction unchanged. We further run a direction comparison to Grad Norm [8], which also scales only the magnitudes of the task gradients. As shown in the rightmost plot in Figure 3, both variants and Grad Norm perform worse than PCGrad and the variant where we only vary the gradient magnitude is much worse than PCGrad. This emphasizes the importance of the orientation change, which is particularly notable as multiple prior works only alter gradient magnitudes [8, 53]. We also notice that the variant of PCGrad where only the gradient magnitudes change achieves comparable performance to Grad Norm, which suggests that it is important to modify both the gradient directions and magnitudes to eliminate interference and achieve good multi-task learning results. Finally, to test the importance of keeping positive cosine similarities between tasks for positive transfer, we compare PCGrad to a recently proposed method in [55] that regularizes cosine similarities of different task gradients towards 0. PCGrad outperforms Suteu and Guo [55] by a large margin. We leave details of the comparison to Appendix G. Figure 4: An empirical analysis of the theoretical conditions discussed in Theorem 2, showing the first 100 iterations of training on two RL tasks, reach and press button top. Left: The estimated value of the multi-task curvature. We observe high multi-task curvatures exist throughout training, providing evidence for condition (b) in Theorem 2. Middle: The solid lines show the percentage gradients with positive cosine similarity between two task gradients, while the dotted lines and dashed lines show the percentage of iterations in which condition (a) and the implication of condition (b) (ξ(g1, g2) 1) in Theorem 2 are held respectively, among iterations when the cosine similarity is negative. Right: The average return of each task achieved by SAC and SAC combined with PCGrad. From the plots in the Middle and on the Right, we can tell that condition (a) holds most of the time for both Adam and Adam combined with PCGrad when they haven t solved Task 2 and as soon as Adam combined PCGrad starts to learn Task 2, the percentage of condition (a) held starts to decline. This observation suggests that condition (a) is a key factor for PCGrad excelling in multi-task learning. 5.3 Empirical Analysis of the Tragic Triad Finally, to answer question (1), we compare the performance of standard multi-task SAC and the multi-task SAC with PCGrad. We evaluate each method on two tasks, reach and press button top, in the Meta-World [61] benchmark. As shown in the leftmost plot in Figure 4, we plot the multitask curvature, which is computed as H(L; θt,θt+1)=2 L(θt+1) L(θt) θt L(θt)T(θt+1 θt) by Taylor s Theorem where L is the multi-task loss, and θt and θt+1 are the parameters at iteration t and t + 1. During the training process, the multi-task curvature stays positive and is increasing for both Adam and Adam combined PCGrad, suggesting that condition (b) in Theorem 2 that the multi-task curvature is lower bounded by some positive value is widely held empirically. To further analyze conditions in Theorem 2 empirically, we plot the percentage of condition (a) (i.e. conflicting gradients) and the implication of condition (b) (ξ(g1, g2) 1) in Theorem 2 being held among the total number of iterations where the cosine similarity is negative in the plot in the middle of Figure 4. Along with the plot on the right in Figure 4, which presents the average return of the two tasks during training, we can see that while Adam and Adam with PCGrad haven t received reward signal from Task 2, condition (a) and the implication of condition (b) stay held and as soon as Adam with PCGrad begins to solve Task 2, the percentage of condition (a) and the implication of condition (b) being held start to decrease. Such a pattern suggests that conflicting gradients, high curvatures and dominating gradients indeed produce considerable challenges in optimization before multi-task learner gains any useful learning signal, which also implies that the tragic triad may indeed be the determining factor where PCGrad can lead to better performance gain over standard multi-task learning in practice. 6 Conclusion In this work, we identified a set of conditions that underlies major challenges in multi-task optimization: conflicting gradients, high positive curvature, and large gradient differences. We proposed a simple algorithm (PCGrad) to mitigate these challenges via gradient surgery. PCGrad provides a simple solution to mitigating gradient interference, which substantially improves optimization performance. We provide simple didactic examples and subsequently show significant improvement in optimization for a variety of multi-task supervised learning and reinforcement learning problems. We show that, when some optimization challenges of multi-task learning are alleviated by PCGrad, we can obtain hypothesized benefits in efficiency and asymptotic performance of multi-task settings. While we studied multi-task supervised learning and multi-task reinforcement learning in this work, we suspect the problem of conflicting gradients to be prevalent in a range of other settings and applications, such as meta-learning, continual learning, multi-goal imitation learning [10], and multi-task problems in natural language processing applications [38]. Due to its simplicity and model-agnostic nature, we expect that applying PCGrad in these domains to be a promising avenue for future investigation. Further, the general idea of gradient surgery may be an important ingredient for alleviating a broader class of optimization challenges in deep learning, such as the challenges in the stability challenges in two-player games [48] and multi-agent optimizations [41]. We believe this work to be a step towards simple yet general techniques for addressing some of these challenges. Broader Impact Applications and Benefits. Despite recent success, current deep learning and deep RL methods mostly focus on tackling a single specific task from scratch. Prior methods have proposed methods that can perform multiple tasks, but they often yield comparable or even higher data complexity compared to learning each task individually. Our method enables deep learning systems that mitigate inferences between differing tasks and thus achieves data-efficient multi-task learning. Since our method is general and simple to apply to various problems, there are many possible real-world applications, including but not limited to computer vision systems, autonomous driving, and robotics. For computer vision systems, our method can be used to develop algorithms that enable efficient classification, instance and semantics segmentation and object detection at the same time, which could improve performances of computer vision systems by reusing features obtained from each task and lead to a leap in real-world domains such as autonomous driving. For robotics, there are many situations where multi-task learning is needed. For example, surgical robots are required to perform a wide range of tasks such as stitching and removing tumour from the patient s body. Kitchen robots should be able to complete multiple chores such as cooking and washing dishes at the same time. Hence, our work represents a step towards making multi-task reinforcement learning more applicable to those settings. Risks. However, there are potential risks that apply to all machine learning and reinforcement learning systems including ours, including but not limited to safety, reward specification in RL which is often difficult to acquire in the real world, bias in supervised learning systems due to the composition of training data, and compute/data-intensive training procedures. For example, safety issues arise when autonomous driving cars fail to generalize to out-of-distribution data, which leads to crashing or even hurting people. Moreover, reward specification in RL is generally inaccessible in the real world, making RL unable to scale to real robots. In supervised learning domains, learned models could inherit the bias that exists in the training dataset. Furthermore, training procedures of ML models are generally compute/data-intensive, which cause inequitable access to these models. Our method is not immune to these risks. Hence, we encourage future research to design more robust and safe multi-task RL algorithms that can prevent unsafe behaviors. It is also important to push research in self-supervised and unsupervised multi-task RL in order to resolve the issue of reward specification. For supervised learning, we recommend researchers to publish their trained multi-task learning models to make access to those models equitable to everyone in field and develop new datasets that can mitigate biases and also be readily used in multi-task learning. Acknowledgments and Disclosure of Funding The authors would like to thank Annie Xie for reviewing an earlier draft of the paper, Eric Mitchell for technical guidance, and Aravind Rajeswaran and Deirdre Quillen for helpful discussions. Tianhe Yu is partially supported by Intel Corporation. Saurabh Kumar is supported by an NSF Graduate Research Fellowship and the Stanford Knight Hennessy Fellowship. Abhishek Gupta is supported by an NSF Graduate Research Fellowship. Chelsea Finn is a CIFAR Fellow in the Learning in Machines and Brains program. [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2017. [2] Bart Bakker and Tom Heskes. Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res., 2003. [3] Hakan Bilen and Andrea Vedaldi. Integrated perception with recurrent multi-task neural networks. In Advances in Neural Information Processing Systems, 2016. [4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [5] Paul H Calamai and Jorge J Mor e. Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1), 1987. [6] Rich Caruana. Multitask learning. Machine Learning, 1997. [7] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. ar Xiv:1812.00420, 2018. [8] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. ar Xiv:1711.02257, 2017. [9] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, 2018. [10] Felipe Codevilla, Matthias Miiller, Antonio L opez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In International Conference on Robotics and Automation (ICRA), 2018. [11] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, 2008. [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213 3223, 2016. [13] Wojciech M. Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant M. Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. Distilling policy distillation. In International Conference on Artificial Intelligence and Statistics, AISTATS, 2019. [14] Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning modular neural network policies for multi-task and multi-robot transfer. Co RR, abs/1609.07088, 2016. [15] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015. [16] Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. Co RR, abs/1812.02224, 2018. [17] Lasse Espeholt, Hubert Soyer, R emi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, 2018. [18] Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. ar Xiv preprint ar Xiv:1910.07104, 2019. [19] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. Co RR, abs/1701.08734, 2017. URL http://arxiv.org/abs/1701. 08734. [20] William Fulton. Eigenvalues, invariant factors, highest weights, and schubert calculus. Bulletin of the American Mathematical Society, 37(3):209 249, 2000. [21] Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-andconquer reinforcement learning. Co RR, abs/1711.09874, 2017. [22] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. ar Xiv:1412.6544, 2014. [23] Yunhui Guo, Mingrui Liu, Tianbao Yang, and Tajana Rosing. Improved schemes for episodic memory-based lifelong learning, 2020. [24] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning, 2018. [25] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. 2018. [26] Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019. [27] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv:1503.02531, 2015. [28] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Computer Vision and Pattern Recognition, 2018. [29] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018. [30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014. [31] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Computer Vision and Pattern Recognition, 2017. [32] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1), 2016. [33] Shikun Liu, Edward Johns, and Andrew J. Davison. End-to-end multi-task learning with attention. Co RR, abs/1803.10704, 2018. [34] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730 3738, 2015. [35] Mingsheng Long and Jianmin Wang. Learning multiple tasks with deep relationship networks. ar Xiv:1506.02117, 2, 2015. [36] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017. [37] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. Co RR, abs/1904.08918, 2019. [38] Bryan Mc Cann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. ar Xiv:1806.08730, 2018. [39] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Computer Vision and Pattern Recognition, 2016. [40] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Conference on Computer Vision and Pattern Recognition, CVPR, 2016. [41] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1), 2009. [42] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. ar Xiv:1511.06342, 2015. [43] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. [44] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Open AI Blog, 1(8), 2019. [45] Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing-solving sparse reward tasks from scratch. ar Xiv:1802.10567, 2018. [46] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. International Conference on Learning Representations (ICLR), 2018. [47] Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. ar Xiv:1904.12774, 2019. [48] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, 2017. [49] Sebastian Ruder. An overview of multi-task learning in deep neural networks. ar Xiv:1706.05098, 2017. [50] Andrei A. Rusu, Sergio Gomez Colmenarejo, C aglar G ulc ehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In International Conference on Learning Representations, ICLR, 2016. [51] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ar Xiv:1606.04671, 2016. [52] Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. Ray interference: a source of plateaus in deep reinforcement learning. ar Xiv:1904.11455, 2019. [53] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, 2018. [54] Yuekai Sun. Notes on first-order methods for minimizing smooth functions, 2015. https://web.stanford.edu/class/msande318/notes/ notes-first-order-smooth.pdf. [55] Mihai Suteu and Yike Guo. Regularizing deep multi-task networks using orthogonal gradients. ar Xiv preprint ar Xiv:1912.06844, 2019. [56] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, 2017. [57] Simon Vandenhende, Bert De Brabandere, and Luc Van Gool. Branched multi-task networks: Deciding what layers to share. Co RR, abs/1904.02920, 2019. [58] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement learning: a hierarchical bayesian approach. In International Conference on Machine Learning, 2007. [59] Markus Wulfmeier, Abbas Abdolmaleki, Roland Hafner, Jost Tobias Springenberg, Michael Neunert, Tim Hertweck, Thomas Lampe, Noah Siegel, Nicolas Heess, and Martin Riedmiller. Regularized hierarchical policies for compositional transfer in robotics. ar Xiv:1906.11228, 2019. [60] Yongxin Yang and Timothy M Hospedales. Trace norm regularised deep multi-task learning. ar Xiv:1606.04038, 2016. [61] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta-reinforcement learning. ar Xiv:1910.10897, 2019. [62] Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Computer Vision and Pattern Recognition, 2018. [63] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In European conference on computer vision. Springer, 2014.