# robust_multitask_learning_with_excess_risks__22bab0ae.pdf Robust Multi-Task Learning with Excess Risks Yifei He 1 Shiji Zhou 2 Guojun Zhang 3 Hyokun Yun 4 Yi Xu 4 Belinda Zeng 4 Trishul Chilimbi 4 Han Zhao 1 4 Multi-task learning (MTL) considers learning a joint model for multiple tasks by optimizing a convex combination of all task losses. To solve the optimization problem, existing methods use an adaptive weight updating scheme, where task weights are dynamically adjusted based on their respective losses to prioritize difficult tasks. However, these algorithms face a great challenge whenever label noise is present, in which case excessive weights tend to be assigned to noisy tasks that have relatively large Bayes optimal errors, thereby overshadowing other tasks and causing performance to drop across the board. To overcome this limitation, we propose Multi-Task Learning with Excess Risks (Excess MTL), an excess risk-based task balancing method that updates the task weights by their distances to convergence instead. Intuitively, Excess MTL assigns higher weights to worse-trained tasks that are further from convergence. To estimate the excess risks, we develop an efficient and accurate method with Taylor approximation. Theoretically, we show that our proposed algorithm achieves convergence guarantees and Pareto stationarity. Empirically, we evaluate our algorithm on various MTL benchmarks and demonstrate its superior performance over existing methods in the presence of label noise. Our code is available at https://github.com/yifei-he/Excess MTL. 1. Introduction Multi-task learning (MTL) aims to train a single model to perform multiple related tasks (Caruana, 1997). Due to 1Department of Computer Science, University of Illinois Urbana-Chamapign, Urbana, IL, USA 2Department of Automation, Tsinghua University, Beijing, China 3School of Computer Science, University of Waterloo, Waterloo, ON, Canada 4Amazon, Seattle, WA, USA. Correspondence to: Yifei He , Han Zhao . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). the nature of learning multiple tasks simultaneously, the problem is often tackled by aggregating multiple objectives into a scalar one via a convex combination. Despite various efforts to achieve more balanced training, a crucial aspect that often remains overlooked is robustness to label noise. Label noise is ubiquitous in real-world MTL problems as the tasks are drawn from diverse sources, introducing variations in data quality (Hsieh & Tseng, 2021; Burgert et al., 2022). The presence of label noise can negatively impact the performance of all tasks trained jointly, leading to suboptimal performance across the board. Addressing this issue is essential for the robust and reliable deployment of MTL models in real-world scenarios. In this work, we study label noise in MTL by delving into a practical scenario where one or more tasks are contaminated by label noise due to the heterogeneity of data collection processes. Under label noise, scalarization (static weighted combination) is not robust because it overlooks the dataset quality. In fact, scalarization is prone to overfitting to a subset of tasks so that it cannot achieve a balanced solution among tasks, especially for under-parametrized models (Hu et al., 2023). Similarly, existing adaptive weight updating methods are vulnerable to label noise. These methods aim at prioritizing difficult tasks during training, and the difficulty is typically measured by the magnitude of each task loss (Chen et al., 2018; Liu et al., 2019; Sagawa et al., 2020; Liu et al., 2021c). Namely, they assign higher weights to the tasks with higher losses. However, the high loss may stem from label noise rather than insufficient training. For instance, if a task has noise in labels, its loss will be high, yet it provides no informative signal for the learning process. To address this challenge, we propose Multi-Task Learning with Excess Risks (Excess MTL), which is robust to label noise while retaining the benefit of prioritizing worsetrained tasks. It dynamically adjusts the task weights based on their distance to convergence. Specifically, we define the distance to be excess risks, which measure the gap of loss between the current model and the optimal model in the hypothesis class. In the presence of different noise levels among tasks, the converged task-specific losses are likely to differ and are not guaranteed to be low. On the other hand, with proper training, excess risks for all tasks will approach 0. Thus, excess risks provide the true improvement ceiling achievable through model training or refinement, making Robust Multi-Task Learning with Excess Risks Bayes optimal loss for task 1 Task 1 (noisy) loss Excess risk Loss weighting 0 Task 2 loss Figure 1. Conceptual comparison between Excess MTL and loss weighting methods. The figure shows a two-task MTL setting, where Task 1 contains label noise, while Task 2 does not. Thus, the Bayes optimal loss (dashed line) for Task 1 is non-zero. The curve represents the Pareto front, i.e., all points on the curve are Pareto optimal. Loss weighting methods aim to find the solution with equal losses for two tasks, severely sacrificing the performance of Task 2. On the other hand, Excess MTL finds the solution with equal excess risks, striking a better balance between the two tasks. it naturally robust to label noise by definition. The advantage of excess risks over losses as a difficulty measure is demonstrated in Figure 1. Since the optimal loss is usually intractable to compute, we propose an efficient method to estimate excess risks via Taylor approximation. At a high level, Excess MTL iteratively executes the following steps until convergence: i) estimate excess risk for each task, ii) update task weights based on their respective excess risks, iii) perform a gradient update using the weighted sum of losses. The ability to identify the convergence distance leads to robustness in the presence of label noise. Even if one or multiple tasks are highly corrupted by label noise, the overall performance will not be compromised. Theoretically, we derive convergence guarantees for our algorithm and establish connections with multi-objective optimization, proving that the solutions of Excess MTL are Pareto stationary. Empirically, we evaluate our method on various MTL benchmarks, showing that it outperforms existing adaptive weighting methods in the presence of label noise, even in cases of extreme noise in one or a few tasks. Our results highlight the robustness of excess risk-based weighting methods, especially when label noise is a concern. Beyond MTL, our insights on the connection between excess risks and convergence distance can potentially inspire more robust algorithmic design for applications using loss weighting methods. 2. Preliminaries 2.1. Excess Risks Consider predicting the label y Y from the input data x X. Let the data be drawn from some distribution P. Given the per-sample loss function L, the risk (or expected loss) of a model θ from the model family Θ is given by ℓ(θ) = E(x,y) P [L(θ; (x, y))]. The risk can be further decomposed as follows ℓ(θ) = ℓ(θ) ℓ(θ Θ) | {z } Estimation error + ℓ(θ Θ) ℓ(θ ) | {z } Approximation error | {z } Excess risk(E) + ℓ(θ ) | {z } Bayes error where θ Θ = argminθ Θ E(x,y) P [L(θ; (x, y))] is the optimal model in the model family Θ and θ = argminθ E(x,y) P [L(θ; (x, y))] is the optimal model in any model family. The combination of estimation error and approximation error is the excess risk, denoted as E. When the function class F is expressive enough, the approximation error approaches 0. The Bayes error is irreducible due to the stochasticity in the data generating process (e.g., label noise). For instance, if the data generating process is nondeterministic, one data point can have non-zero probability of belonging to multiple classes. This is an inherent property of a dataset, where high label noise leads to high Bayes error. On the other hand, the excess risk captures the difference between the risks of the model and the Bayes optimal model, effectively removing the influence of label noise. Therefore, it can be viewed as a measure of distance to optimality. The above decomposition of risk highlights that it is not an appropriate criterion for evaluating model performance because it incorporates the irreducible error, which heavily depends on the noise level in the dataset. Since we typically do not have control over the dataset quality, excess risk is a more robust measure of model performance, particularly in the presence of label noise. It allows us to focus on the part of risk that can be improved through model learning and optimization, making it a reliable metric for assessing the effectiveness of models. 2.2. Multi-Task Learning In MTL, m 2 tasks are given. We use αi to represent the weight of the i-th task and m to denote the (m 1)- dimensional probability simplex. We study the setting of hard parameter sharing, where a subset of model parameters (θsh) is shared across all tasks, while other parameters (θi) are task-specific. The goal of MTL is to find the parameter θsh and θi that minimizes a convex combination of all taskspecific losses (ℓi) min θsh,θ1, ,θm i=1 αiℓi(θsh, θi), (1) where αi m. The weights can be either static (Liu et al., 2021a; Yu et al., 2020) or dynamically computed (Chen et al., 2018; Liu et al., 2019; 2021c; Navon et al., 2022). Despite the wide usage of the weighted combination scheme, it is difficult to define optimality under the MTL setting Robust Multi-Task Learning with Excess Risks because a model may work well on some tasks, but perform poorly on others. For more rigorous optimality analysis, MTL can be formulated as a multi-objective optimization problem (Sener & Koltun, 2018; Zhou et al., 2022), where two models can be compared by Pareto dominance. Definition 2.1 (Pareto dominance). Let L(θ) = {ℓi(θ) : i [m]} be a set of loss functions. For two parameter vectors θ1 and θ2, if ℓi(θ1) ℓi(θ2) for all i [m] and L(θ1) = L(θ2), we say that θ1 Pareto dominates θ2 with the notation θ1 θ2. The goal of multi-task learning as multi-objective optimization is to achieve Pareto optimality. Definition 2.2 (Pareto optimal). A parameter vector θ is Pareto optimal if there exists no parameter θ such that θ θ . There may exist multiple Pareto optimal solutions and they consist the Pareto set. A weaker condition is called Pareto stationary and all Pareto optimal points are Pareto stationary. Definition 2.3 (Pareto stationary). A parameter θ is called Pareto stationary if there exists α m such that Pm i=1 αi θℓi(θ) = 0. 3. Multi-Task Learning with Excess Risks We now introduce our main algorithm. In Section 3.1, we begin by identifying a key limitation of previous MTL algorithms and outlining our objectives. In Section 3.2, we provide a detailed description of the algorithm. In Section 3.3, we make a conceptual comparison between our proposed algorithm with existing task weighting methods. Finally, in Section 3.4, we present a theoretical analysis of the convergence guarantee and Pareto stationarity of our algorithm. 3.1. Motivations and Objectives Prior works have demonstrated that effective multi-task learning requires task balancing, i.e., tasks that the current model performs poorly on should be assigned higher weights during training (Guo et al., 2018; Kendall et al., 2018). One popular method is loss balancing, which assigns high weights to the tasks with high losses, in the hope of prioritizing difficult tasks and making the training more balanced. However, we argue that task-specific loss is not a good criterion for task difficulty, as it not only considers model training, but is also subject to the quality of the dataset, which we may not have prior knowledge about. For instance, the high loss may not necessarily stem from insufficient training, but from high label noise. In that case, assigning high weights to the noisy tasks can hurt the multitask performance across the board. To address this problem, we propose to use excess risks to Algorithm 1 Excess MTL Input: Step size ηα, ηθ, number of total tasks m Initialize θ(1) sh and θ(1) i for all i [m], α(1) = [1/m, , 1/m] for t = 1, 2 do for i = 1, , m do Compute gradient g(t) i = θshℓ(t) i (θsh, θi) Compute excess risks with Eq. 6 and Eq. 7 ˆE(t) i = g(t) i diag Pt τ=1 g(τ) i g(τ) i Update weights α(t+1) i = α(t) i exp ηα ˆE(t) i Update task-specific parameters θ(t+1) i θ(t) i ηθ θiℓ(t) i (θsh, θi) end for Normalize α(t+1) i α(t+1) i / P j α(t+1) j for all i Update shared parameters θ(t+1) sh θ(t) sh ηθ P i α(t+1) i θshℓ(t) i (θsh, θi) end for weigh the tasks because it measures the true performance gap that can be closed by model training. We hope to assign high weights to tasks with high excess risks such that all tasks converge at similar rates. To achieve this goal, we solve the min-max problem min θsh,θ1, θm max i [m] Ei(θsh, θi), (2) where Ei is the excess risk of the i-th task. Note that the formulation has the same form as Tchebycheff scalarization function, whose solution is able to fully explore the Pareto front of multiple objectives (Bowman Jr, 1976; Steuer & Choo, 1983), a property lacking in linear scalarization (Hu et al., 2023). 3.2. Algorithm We present our algorithm in Algorithm 1, which consists of three main components: excess risk estimation, multiplicative weight update, and scale processing. Excess risks estimation. For the ease of presentation, we overload the expression θi as a combination of θsh and θi for task i. In the computation of excess risks, the Bayes optimal loss is generally intractable to compute exactly, so we propose to use a local approximation instead. Specifically, we use the second-order Taylor expansion of the task-specific loss ℓi at the current parameter θ(t) i ℓi(θ) = ℓi(θ(t) i ) + (θ θ(t) i ) g(t) i 2(θ θ(t) i ) H(t) i (θ θ(t) i ) + O( θ θ(t) i 3 2), (3) where g(t) i is the gradient and H(t) i is the Hessian matrix of ℓi at θ(t) i . Plugging the locally optimal parameter θ i in Eq. Robust Multi-Task Learning with Excess Risks 3, we can estimate the excess risk as Ei(θ(t) i ) ℓi(θ(t) i ) ℓi(θ i ) (θ(t) i θ i ) g(t) i 1 2(θ(t) i θ i ) H(t) i (θ(t) i θ i ). (4) To obtain the difference between θ(t) i and θ i , we use the fact that θ i is locally optimal, θ i ℓi(θ i ) g(t) i + H(t) i (θ i θ(t) i ) = 0 = θ(t) i θ i H(t) i 1g(t) i . (5) Plugging Eq. 5 into Eq. 4 and assuming the second-order partial derivative is continuous, we have Ei(θ(t) i ) 1 1g(t) i . (6) The factor 1/2 can be dropped for simplicity. However, the computation of the Hessian matrix is generally intractable, so we use the diagonal approximation of empirical Fisher (Amari, 1998) to estimate it, which computes a diagonal matrix through the accumulation of outer products of historical gradients. This approach has shown to be useful in various practical settings (Duchi et al., 2011; Kingma & Ba, 2015). Specifically, let g(τ) i be the gradient of the i-th task with respect to the model parameter θ(τ) i at training step τ, the approximate Hessian at training step t for the i-th task is H(t) i diag τ=1 g(τ) i g(τ) i where diag( ) is a diagonal matrix. The estimation is efficient as the computational complexity is O(d), where d is the dimension of parameters. Multiplicative weight update. After estimating the excess risks, we update the task weights accordingly. As the gradients contain stochastic factors during the training, this online learning process calls for the stability of α for algorithmic convergence (Hazan et al., 2016). The one-hot solution in Equation (8) suffers from fluctuation and cannot be directly applied. Following the framework of online mirror descent with entropy regularization that formulates KL divergence as Bregman divergence (Hazan et al., 2016), we smooth out the hard choice of a single task with a weight vector α m, so problem 2 can be reformulated as min θsh,θ1, ,θm max α m i=1 αi Ei(θsh, θi). (8) Here we find the task weights α(t) by α(t+1) = argmax α m i=1 αi Ei(θ(t) sh , θ(t) i ) 1 ηα KL(α α(t)), which can easily be proven as equivalent to α(t+1) i = α(t) i exp ηαEi(θ(t) sh , θ(t) i ) Pm j=1 α(t) j exp ηαEj(θ(t) sh , θ(t) j ) , (9) where ηα is the step size for weight update. Note that this update can be viewed as an exponentiated gradient (Kivinen & Warmuth, 1997) step on the convex combination of excess risks. Based on the updated task weights, we compute the loss by taking a convex combination of each task-specific loss and backpropagating the gradient. Multiplicative weight update provides stability in training and we show the convergence guarantees in Section 3.4. Scale processing. Similar to loss and gradient, the excess risk is also sensitive to the scale of tasks. The issue can manifest in two scenarios. The first one is when the type of loss is the same across all tasks, but the input data for each task varies in magnitude. A simple solution is to standardize all input data such that they have zero mean and unit variance. The second case is when the tasks employ different losses. For instance, the cosine loss has a maximum value of 1, while the squared loss can be unbounded. To ensure a fair comparison among tasks with varying losses, we propose to compute the relative excess risks. We adopt a similar method from Chen et al. (2018) to normalize excess risks by dividing the current value by the initial value, ensuring a range from 0 to 1. 3.3. Conceptual Comparison In this section, we discuss the relationships and distinctions between our approach and prior task-balancing methods conceptually. We demonstrate the limitations of previous methods, especially in the presence of label noise. Empirical verification of the arguments is presented in Section 4.3. Grad Norm (Chen et al., 2018) enforces similar gradient norms and training rates across all tasks. The training rate is defined as the ratio of the current and the initial loss. It tends to favor tasks with small gradient norms, overlooking inherent scale differences in task gradients. In the face of label noise, noisy tasks inevitably exhibit a slow training rate due to their high losses. However, Grad Norm treats this as insufficient training and assigns high weights to the noisy tasks. Additionally, the task weights require additional gradient updates as they are learnable parameters in Grad Norm, whereas our algorithm does not. MGDA (Sener & Koltun, 2018) formulates multi-task learning as multi-objective optimization. They solve it by using the classical multiple gradient descent algorithm (Mukai, 1980; Fliege & Svaiter, 2000; D esid eri, 2012), which finds the minimum norm within the convex hull of task gradients. It also tends to favor tasks with small gradient magnitudes due to the nature of the Frank-Wofle algorithm it employs. Robust Multi-Task Learning with Excess Risks Consider two tasks with gradients g1 2 > g2 2 where the projection of g1 on g2 has a larger norm than g2. In this case, MGDA will concentrate all weights on task 2. This observation holds in general beyond the two-task example. As the loss landscape of a noisy task tends to be flatter, its gradient magnitude will be smaller, thus favored by MGDA. Moreover, the Frank-Wofle algorithm requires pairwise computation among all tasks, which is prohibitive under large number of tasks. Group DRO (Sagawa et al., 2020) is initially designed to address the problem of subpopulation shift and has since been extended to multi-task learning (Michel et al., 2021). It aims to optimize the worst task loss by assigning high weights to tasks with high losses. However, in the presence of label noise, tasks with high noise levels will exhibit persistently high losses, which leads Group DRO to assign excessive weight to those tasks, thereby ignoring the other tasks and causing an overall performance decline. IMTL (Liu et al., 2021c) achieves impartial learning by requiring the gradient update to have equal projection on each task. In a two-task scenario, this is equivalent to finding the angle bisector for the task-specific gradients, meaning that regardless of the gradient magnitude, each task exerts an equal influence on determining the final gradient direction. However, when confronted with label noise, the noisy gradient can substantially distort the final gradient direction, damping or even misguiding the training. 3.4. Theoretical Analysis In this section, we analyze Excess MTL theoretically, and verify its soundness with three key properties: Excess MTL i) converges at the same rate as MGDA and Group DRO (Theorem 3.1), ii) converges to Pareto optimal solutions in convex cases (Corollary 3.2), iii) converges to Pareto stationary solutions in non-convex cases (Theorem 3.3). Algorithm 1 takes inspiration from the online mirror descent algorithm (Nemirovskij & Yudin, 1983), where the update of parameter θ and weights α corresponds to online gradient descent and online exponentiated gradient respectively. With established results (Nemirovski et al., 2009) in the online learning literature, we show that Algorithm 1 converges at the rate O(1/ t), same as MGDA and Group DRO. Theorem 3.1 (Convergence). Suppose (i) each task-specific loss ℓi is L-Lipschitz, (ii) ℓi is convex on the model parameter θ, (iii) ℓi bounded by Bℓand (iv) θ 2 is bounded by Bθ. At training step t, let θ(1:t) := 1 t Pt τ=1 θ(τ), then i=1 α(t) i Ei( θ(1:t)) min θ max α m i=1 αi Ei(θ) 10(B2 θL2 + B2 ℓlog m) t , (10) where m is the number of tasks. With the convergence analysis, we further deduce that the solution of Excess MTL is Pareto optimal in the convex setting. Corollary 3.2 (Pareto Optimality). Under the same conditions as Theorem 3.1, using the weights α output by Algorithm 1, we have i=1 αi Ei( θ(1:t)) i=1 αi Ei(θ ) 10(B2 θL2 + B2 ℓlog m) t , (11) where θ is the Pareto optimal solution, i.e., θ = argminθ Pm i=1 αi Ei(θ). This condition is the same as Definition 2.2 because the weights α are positive, so any Pareto improvement over θ would increase the sum. The proof is in Appendix A. For non-convex cases, we next provide a stationary analysis. Theorem 3.3 (Pareto Stationarity). Suppose each taskspecific loss ℓi is (i) L-Lipschitz (ii) G-Smooth and (iii) bounded by Bℓ. At training step t, using the weights α output by Algorithm 1, we have min k=1,...,t E i=1 α(k) i ℓi(θ(k)) (12) where m is the number of tasks, and σ corresponds to the assumption of a bounded approximation error of the subgradient and the gradient of the excess risk detailed in Assumption A.3 of Appendix A. This shows that Excess MTL converges to a Pareto stationary point with the rate O(1/t1/4), matching the rate for singleobjective SGD (Drori & Shamir, 2020). We also empirically show that the algorithm performs well under the non-convex setting in Section 4. The proof is in Appendix B. The results demonstrate the theoretical soundness of our method, as it provably converges to optimal solutions in multi-objective optimization. MGDA and Group DRO also provably converge to Pareto optimal solutions in the convex setting, while Grad Norm and IMTL lack this property, meaning that their solutions can be Pareto dominated. 4. Experiments The experiments aim to investigate the following questions under different levels of noise injection: i) Does label noise significantly harm MTL performance? ii) Does Excess MTL perform consistently with its theoretical properties? iii) In presence of label noise, does Excess MTL maintain high Robust Multi-Task Learning with Excess Risks overall performance? iv) If so, is this achieved by appropriately assigning weights to the noisy tasks? In the subsequent analysis, we provide affirmative answers to all the questions. 4.1. Datasets Multi MNIST (Sabour et al., 2017) is a multi-task version of the MNIST dataset. Two randomly selected MNIST images are put on the top-left and bottom-right corners respectively to construct a new image. Noise is injected into the bottom-right corner task. Office-Home (Venkateswara et al., 2017) consists of four image classification tasks: artistic images, clip art, product images, and real-world images. It contains 15,500 images over 65 classes. Noise is injected into the product image classification task. NYUv2 (Silberman et al., 2012) consists of RGB-D indoor images. It contains 3 tasks: semantic segmentation, depth estimation, and surface normal prediction. Noise is injected into the semantic segmentation task. 4.2. Noise Injection Scheme To replicate the real-world scenario where tasks come from heterogeneous sources with various quality, we inject label noise into one or more (but not all) tasks within the task batch. For classification problems, we adopt the common practice of introducing symmetric noise (Kim et al., 2019), which is generated by flipping the true label to any possible labels uniformly at random. For regression problems, we introduce additive Gaussian noise (Hu et al., 2020) with the variance equal to that of the original regression target. This approach ensures that the level of noise introduced is consistent with the statistical characteristics of the original data. We vary the noise level by changing the proportion of training data subject to the noise injection procedure, allowing us to measure the impact of increasing levels of noise on the performance of our algorithm. The test data remains clean. In the following sections, we refer to the tasks with noise injection as noisy tasks and tasks without noise injection as clean tasks. The proportion of noisy data in the noisy task is referred to as noise level. 4.3. Empirical Analysis and Comparison In this section, we use Multi MNIST and Office-Home as illustrative examples to analyze the behavior of Excess MTL in detail and demonstrate its advantage comparing with other task weighting methods. Excess risk estimation. We present the estimated excess risks on Multi MNIST in Fig. 2. Here, we use the difference between the current and converged loss as a proxy for the ground-truth excess risk. Initially, the noisy task 0 20 40 60 80 100 Epoch Excess risk Estimated Ground-truth 0 20 40 60 80 100 Epoch Excess risk Estimated Ground-truth Figure 2. Excess risks on Multi MNIST with noise level 0.6. The estimated excess risk well matches the ground-truth pattern. 0 20 40 60 80 100 Epoch Clean accuracy 0 20 40 60 80 100 Epoch Noisy accuracy 0 20 40 60 80 100 Epoch Clean weight Excess MTL IMTL MGDA Group DRO Grad Norm Figure 3. Weight and accuracy on the Multi MNIST dataset with a noise level of 0.8. Excess MTL assigns most weight to the clean task so that the performance is least affected by the injected noise. 0.0 0.5 1.0 1.5 2.0 Task 1 Loss Task 2 Loss 0% Noise on Multi MNIST STL Excess MTL MGDA IMTL Group DRO Grad Norm 0.0 0.5 1.0 1.5 2.0 Task 1 Loss Task 2 (noisy) Loss 80% Noise on Multi MNIST Figure 4. Multi MNIST loss profile (lower left better). The left plot has no noise injected, while in the right one, task 2 has 80% noise. With no noise injected, all algorithms achieve ideal performance. However, with significant noise injected, only Excess MTL retains performance close to Bayes optimal on both tasks. shows lower excess risk due to higher Bayes optimal loss. As training proceeds, both tasks converge and excess risks approach 0. The estimation is up to a constant multiplier, so we scale the estimated value by a constant to align it with the ground-truth. Our estimated excess risk well matches the ground-truth pattern, validating its accuracy. Training dynamics and weight assignment. We analyze the training dynamics for the algorithms mentioned in Section 3.3 in Figure 3. For each algorithm, we examine the change in task weights and test accuracy as training proceeds. Grad Norm and Group DRO rapidly assign substantial weights to the noisy task due to their high losses. IMTL and MGDA, on the other hand, initially allocate significant weights to the noisy task but eventually converge to nearly uniform weighting. Their emphasis on the noisy task dramatically impacts the performance of the clean task, leading to suboptimal performance. In contrast, Excess MTL identifies the higher excess risk associated with the clean task at the beginning of training, leading to a higher accumulation of weight on the clean task. Simultaneously, the reduced emphasis on the noisy task Robust Multi-Task Learning with Excess Risks 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Overall Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean Weight Excess MTL MOML MGDA Figure 5. Comparison with MOML and MGDA. MGDA and MOML use the same method to select weights on training and validation set respectively. Despite more consistent weight assignment than MGDA, MOML fails when noise level is high, showing that a clean validation set does not alleviate the label noise issue. Excess MTL ourperforms both baselines. 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Overall Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean Weight Excess MTL IMTL MGDA Group DRO Grad Norm Uniform Figure 6. Results on the Multi MNIST dataset. Only Excess MTL assigns smaller weights to the noisy task, maintaining the clean task accuracy, whereas other methods show decreasing performance. helps mitigate overfitting, leading to improved performance when learning with noise. This ability of Excess MTL to assign appropriate weights based on the proximity to convergence contributes to its superior overall performance. Performance comparison. In Figure 4, we present the converged performance of all algorithms with and without significant noise injection. When more noise is injected into task 2, its Bayes optimal loss increases, leading to a degradation in its single-task performance. A robust MTL algorithm should retain performance close to Bayes optimal regardless of noise level, i.e., reaching or surpassing the intersection of the single-task performance (dashed line). Despite performing well under the noise-free scenario, all algorithms except Excess MTL have performance decrease in task 1 under label noise. The performance for loss weighting methods aligns with expectations in Figure 1, i.e., they aim for equal losses across both tasks, resulting in undertraining of task 1. Clean validation set. One may expect that the existence of a clean validation set would address the label noise issue, as validation performance is a more robust metric than training loss. However, since weight selection on the validation set is performed on the clean data whereas the training set still contains noise, it could lead to assigning considerable weight to noisy tasks. To illustrate this possibility, we compare with MOML (Ye et al., 2021), which assigns weights using MGDA on a clean validation set. On the Office-Home dataset, we allocate 20% of the training data as a clean validation set and inject noise into the remainder. From Figure 5, despite the additional requirement of clean data, MOML remains largely impacted by label noise. When noise level is low, MOML performs worse than MGDA due to less gradient information contained in the validation set than the training set. When noise level is high, although MOML slightly improves over MGDA with more consistent weight assignment, the overall performance still significantly degrades. In contrast, Excess MTL consistently outperforms both baselines, especially at high noise level. 4.4. Benchmark Evaluation We present the results on three MTL benchmarks. For Multi MNIST and Office-Home, we present three plots: the overall performance, the average performance on the clean tasks and the sum of task weights assigned to the clean tasks, where the x-axis is the noise level. For NYUv2, we provide the performance of each task since they have different evaluation criteria. Along with adaptive weighting algorithms mentioned in Section 3.3, we also include uniform scalarization as a baseline. The key evaluation criteria for robustness is that the performance on the clean tasks should not be affected even in face of increasing label noise in the noisy tasks. Multi MNIST. From Figure 6, we can see that for all adap- Robust Multi-Task Learning with Excess Risks 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Overall Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean Weight Excess MTL IMTL MGDA Group DRO Grad Norm Uniform Figure 7. Results on the Office-Home dataset (noise in product classification). The left figure considers all tasks, while the other two consider all tasks except product images. The right figure is the combined weights of all clean tasks (0.75 for uniform scalarization). Excess MTL is least affected by label noise and is the only adaptive algorithm assigning small weights to the noisy task. 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Pixel accuracy ( ) Semantic (Noisy) 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Relative error ( ) 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Accuracy at 30 ( ) 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean weight Excess MTL IMTL MGDA Group DRO Grad Norm Uniform Figure 8. Results on the NYUv2 dataset (noise in semantic segmentation). Higher pixel accuracy for semantic segmentation ( ), lower relative error for depth estimation ( ) and higher angle accuracy for surface normal estimation ( ) are desired. Pixel accuracy monotonically decreases as expected. Excess MTL consistently achieves the best performance on the clean tasks, while other methods fail dramatically under high label noise. tive weighting methods except Excess MTL, the performance on the clean task monotonically declines as the noise level increases. The weight plot indicates that this performance degradation stems from the excessive weights assigned to noisy tasks, and in extreme cases, the weight on the clean task is 0, effectively disregarding it during optimization. In contrast, Excess MTL identifies and assigns minimal weights to the noisy task, preserving the performance on clean tasks even under high label noise (flat line in the clean accuracy plot). Surprisingly, uniform scalarization also exhibits resilience to label noise, although slightly inferior to Excess MTL. This can be attributed to the nature of Multi MNIST as an easy dataset with abundant data in the sense that two tasks have small interference with each other. Therefore, even if one task is fully corrupted, the clean task provides sufficient information to learn a model. However, we find that the performance of uniform scalarization is inconsistent and varies across datasets, as shown by the following experiments. Office-Home. Similar trends can be observed in Figure 7. The performance of Excess MTL is least affected on the clean tasks, showcasing its robustness. When the noisy task has purely random label (noise level= 1), all other adaptive methods completely fail. In contrast to the results in Multi MNIST, uniform scalarization is not sufficient to obtain good performance in this real-world dataset as it is outperformed by most adaptive weighting methods in the noisy setting. This underscores the need for adaptive methods in such scenarios. NYUv2. From Figure 8, we observe that Excess MTL consistently outperforms other methods on the two clean tasks. Other adaptive weighting algorithms assign increasing weights to the noisy task with rising noise level. The phenomenon is particularly evident when the noise level is high, confirming our hypothesis that label noise leads to suboptimal performance for all tasks trained jointly. Since the loss magnitude of the surface normal estimation task is the smallest, Group DRO and uniform scalarization exhibit clear undertraining on it even without noise injection. Through our scale processing method, Excess MTL evaluates the excess risks on a comparable scale and trains on all tasks in a more balanced way. In summary, the experiments validate our hypothesis that label noise affects not only the noisy tasks, but also all tasks trained jointly, leading to an overall decline in performance. Compared with other adaptive weighting methods, Excess MTL consistently achieves high overall performance by appropriately assigning weights to noisy tasks, thereby mitigating the negative impact of label noise. Notably, Excess MTL maintains competitive performance with other methods in noise-free scenarios, indicating that its robustness does not come at the cost of performance degradation. Robust Multi-Task Learning with Excess Risks 5. Related Work MTL Optimization can be tackled from multiple perspectives. One line of work focuses on network architecture design (Dai et al., 2016; Liu et al., 2019). Another approach is gradient manipulation, where Chen et al. (2020) drops gradient components by the extent of conflict; Yu et al. (2020) removes gradient conflict by projection; Liu et al. (2021a) minimizes average loss while tracking the improvement on the worst task. Researchers have also studied the problem from a game theory perspective (Navon et al., 2022). In this study, our primary focus lies on task weighting methods (Kendall et al., 2018; Chen et al., 2018; Sener & Koltun, 2018; Liu et al., 2021c; Lin et al., 2022). Task weighting is a common strategy employed to prioritize tasks with inferior performance. Existing algorithms in this domain differ in their measures of task difficulty, including homoscedastic uncertainty (Kendall et al., 2018), taskspecific performance measures (Guo et al., 2018), norms of gradients and training rates (Chen et al., 2018), or a combination of task-specific losses and gradients (Liu et al., 2021c). Another approach formulates the MTL problem as a multiobjective optimization problem (Sener & Koltun, 2018) and utilizes the classical multiple-gradient descent algorithm (MGDA) (Mukai, 1980; Fliege & Svaiter, 2000; D esid eri, 2012) to find Pareto stationary solutions. While MGDA does not explicitly use gradient norms for task weighting, it implicitly considers them by seeking the minimum norm point within the convex hull of task gradients. However, the problem of label noise is often overlooked in the existing works, which can make many proposed task difficulty measures inappropriate. For instance, high label noise will lead to high losses and potentially small gradient magnitudes, making aforementioned algorithms assign high weights to the noisy tasks and causing an imbalance in training. Our work incorporates the idea of prioritizing difficult tasks, while further improves the robustness to label noise. Loss weighting in other applications. Beyond MTL, various machine learning problems employ similar mechanism of focusing on challenging samples, often utilizing loss as a difficulty measure. In domain generalization, prior works focus on domains or groups with higher losses to improve generalization (Sagawa et al., 2020; Liu et al., 2021b; Piratla et al., 2021). In hard example mining, researchers use losses to determine whether specific data instances are difficult for the model (Shrivastava et al., 2016; Yuan et al., 2017; Xue et al., 2019). Similar to MTL, those applications also suffer from the label noise problem. We believe our proposed method can offer a more robust difficulty measure for such applications. This extension of our methodology can pave the way for more reliable solutions across a spectrum of machine learning challenges. 6. Conclusion In this work, we identify a key limitation of existing adaptive weight updating methods in multi-task learning, i.e., the vulnerability to label noise. Building upon this observation, we propose Excess MTL, a task balancing algorithm based on excess risks. It employs multiplicative weight update to dynamically adjust the task weights according to their respective distance to convergence. We further show the convergence guarantees of the proposed algorithm and establish connections with multi-objective optimization, showing the Pareto stationarity of its solutions. Extensive experiments across diverse MTL benchmarks demonstrate the consistent superiority of our method in the presence of label noise. Beyond multi-task learning, our insights on excess risks and their connection with convergence distance can potentially inspire more robust algorithmic design in various machine learning applications utilizing loss weighting methods. Acknowledgements HZ is partially supported by a research grant from the Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE) and a Google Research Scholar Award. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Amari, S. Natural gradient works efficiently in learning. Neural Computation, 10:251 276, 1998. Badrinarayanan, V., Kendall, A., and Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481 2495, 2017. Bowman Jr, V. J. On the relationship of the tchebycheff norm and the efficient frontier of multiple-criteria objectives. In Multiple Criteria Decision Making: Proceedings of a Conference Jouy-en-Josas, France May 21 23, 1975, pp. 76 86. Springer, 1976. Burgert, T., Ravanbakhsh, M., and Demir, B. On the effects of different types of label noise in multi-label remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 60:1 13, 2022. doi: 10.1109/tgrs.2022.3226371. Caruana, R. Multitask learning. Machine learning, 28: 41 75, 1997. Robust Multi-Task Learning with Excess Risks Chen, Z., Badrinarayanan, V., Lee, C.-Y., and Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pp. 794 803. PMLR, 2018. Chen, Z., Ngiam, J., Huang, Y., Luong, T., Kretzschmar, H., Chai, Y., and Anguelov, D. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. Advances in Neural Information Processing Systems, 33: 2039 2050, 2020. Dai, J., He, K., and Sun, J. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3150 3158, 2016. D esid eri, J.-A. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313 318, 2012. Drori, Y. and Shamir, O. The complexity of finding stationary points with stochastic gradient descent. In International Conference on Machine Learning, pp. 2658 2667, 2020. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Fliege, J. and Svaiter, B. F. Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research, 51(3):479 494, 2000. Guo, M., Haque, A., Huang, D.-A., Yeung, S., and Fei-Fei, L. Dynamic task prioritization for multitask learning. In Proceedings of the European conference on computer vision (ECCV), pp. 270 287, 2018. Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hsieh, M.-E. and Tseng, V. Boosting multi-task learning through combination of task labels - with applications in ecg phenotyping. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7771 7779, May 2021. Hu, W., Li, Z., and Yu, D. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. Hu, Y., Xian, R., Wu, Q., Fan, Q., Yin, L., and Zhao, H. Revisiting scalarization in multi-task learning: A theoretical perspective. Advances in Neural Information Processing Systems, 2023. Kendall, A., Gal, Y., and Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482 7491, 2018. Kim, Y., Yim, J., Yun, J., and Kim, J. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 101 110, 2019. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Kivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1 63, 1997. Lin, B. and Zhang, Y. Lib MTL: A Python library for multitask learning. Journal of Machine Learning Research, 24 (209):1 7, 2023. Lin, B., Ye, F., Zhang, Y., and Tsang, I. W.-H. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. Transactions on Machine Learning Research, 2022. Liu, B., Liu, X., Jin, X., Stone, P., and Liu, Q. Conflictaverse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878 18890, 2021a. Liu, E. Z., Haghgoo, B., Chen, A. S., Raghunathan, A., Koh, P. W., Sagawa, S., Liang, P., and Finn, C. Just train twice: Improving group robustness without training group information. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 6781 6792. PMLR, 18 24 Jul 2021b. Liu, L., Li, Y., Kuang, Z., Xue, J.-H., Chen, Y., Yang, W., Liao, Q., and Zhang, W. Towards impartial multitask learning. In International Conference on Learning Representations, 2021c. Liu, S., Johns, E., and Davison, A. J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1871 1880, 2019. Robust Multi-Task Learning with Excess Risks Michel, P., Ruder, S., and Yogatama, D. Balancing average and worst-case accuracy in multitask learning. ar Xiv preprint ar Xiv:2110.05838, 2021. Mukai, H. Algorithms for multicriterion optimization. IEEE Transactions on Automatic Control, 25(2):177 186, 1980. doi: 10.1109/TAC.1980.1102298. Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., and Fetaya, E. Multi-task learning as a bargaining game. Proceedings of Machine Learning Research, 162:16428 16446, 2022. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. Nemirovskij, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. 1983. Piratla, V., Netrapalli, P., and Sarawagi, S. Focus on the common good: Group distributional robustness follows. In International Conference on Learning Representations, 2021. Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. Advances in neural information processing systems, 30, 2017. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. Sener, O. and Koltun, V. Multi-task learning as multiobjective optimization. Advances in neural information processing systems, 31, 2018. Shrivastava, A., Gupta, A. K., and Girshick, R. B. Training region-based object detectors with online hard example mining. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 761 769, 2016. Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. ECCV (5), 7576:746 760, 2012. Steuer, R. E. and Choo, E.-U. An interactive weighted tchebycheff procedure for multiple objective programming. Mathematical programming, 26:326 344, 1983. Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018 5027, 2017. Vijayakumar, S. and Schaal, S. Locally weighted projection regression: An o (n) algorithm for incremental real time learning in high dimensional space. In Proceedings of the seventeenth international conference on machine learning (ICML 2000), volume 1, pp. 288 293. Morgan Kaufmann, 2000. Xue, J., Han, J., Zheng, T., Guo, J., and Wu, B. Hard sample mining for the improved retraining of automatic speech recognition, 2019. Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., and Zhang, Y. Multi-objective meta learning. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=w Kf9i Su_TEm. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33: 5824 5836, 2020. Yuan, Y., Yang, K., and Zhang, C. Hard-aware deeply cascaded embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017. Zhou, S., Zhang, W., Jiang, J., Zhong, W., Gu, J., and Zhu, W. On the convergence of stochastic multi-objective gradient manipulation and beyond. Advances in Neural Information Processing Systems, 35:38103 38115, 2022. Robust Multi-Task Learning with Excess Risks A. Proof for Pareto Optimality Theorem 3.1 (Convergence). Suppose (i) each task-specific loss ℓi is L-Lipschitz, (ii) ℓi is convex on the model parameter θ, (iii) ℓi bounded by Bℓand (iv) θ 2 is bounded by Bθ. At training step t, let θ(1:t) := 1 t Pt τ=1 θ(τ), then i=1 α(t) i Ei( θ(1:t)) min θ max α m i=1 αi Ei(θ) 10(B2 θL2 + B2 ℓlog m) t , (10) where m is the number of tasks. Proof. We directly apply the well-established regret bound for online mirror descent from Nemirovski et al. (2009) on the saddle point problem min θ Θ max α m i=1 αifi(θ). In our case, fi( ) is the excess risk for the i-th task Ei( ). We take inspiration from the proof strategy of Proposition 2 in (Sagawa et al., 2020) to first present the bound and then explain why our formulation satisfies all conditions for the bound. Assumption A.1. fi is convex on Θ. Assumption A.2. Let ξ be a random vector that takes value in Ξ. For all i [m], there exists a function Fi : Θ Ξ R such that Eξ p[Fi(θ; ξ)] = fi(θ). Assumption A.3. For every given θ Θ and ξ Ξ, we are able to compute Fi(θ, ξ) and the subgradient Fi(θ, ξ) such that Eξ p[ Fi(θ, ξ)] = fi(θ), Eξ p[ Fi(θ, ξ) fi(θ) ] σ. Theorem A.4 (Nemirovski et al. 2009). If Assumption 1-3 hold, at training step t, the regret for the online mirror descent algorithm on the saddle point problem min θ Θ max α m i=1 αifi(θ) can be bounded by i=1 αifi( θ(1:t)) min θ Θ i=1 αi (1:t)fi(θ) 10(R2 θM 2 ,θ + M 2 ,α log m) i=1 αi Fi(θ; ξ) i=1 αi Fi(θ; ξ) c (max θ θ 2 θ min θ θ 2 θ) for a c strongly convex norm θ. Robust Multi-Task Learning with Excess Risks Assumption A.1 is the same as our condition (ii). For Assumption A.2, we can let ξ be the tuple (x, y, i), where i is the task index. Then, let the distribution of ξ can be a mixture of each task distribution Pi, i.e., To make Fi an unbiased estimator for fi, we construct Fi(θ; (x, y, i )) := m1[i = i ]fi(θ). We can check the validity of this construction by E(x,y,i ) p[Fi(θ; (x, y, i ))] = 1 m EPi[mfi(θ)] = fi(θ). For Assumption A.3, similarly, E(x,y,i ) p[ Fi(θ; (x, y, i ))] = 1 m EPi[m fi(θ)] = fi(θ). According to our condition (iii) and (iv), we have that i=1 αi Fi(θ; (x, y, i )) m2L2 = M ,θ, i=1 αi Fi(θ; ξ) m2B2 ℓ= M ,α, c (max θ θ 2 θ min θ θ 2 θ) = B2 θ. Therefore, we obtain i=1 α(t) i Ei( θ(1:t)) min θ Θ max α m i=1 αi Ei(θ) max α m i=1 αifi( θ(1:t)) min θ Θ i=1 αi (1:t)fi(θ), 10(B2 θL2 + B2 ℓlog m) t . (13) Corollary 3.2 (Pareto Optimality). Under the same conditions as Theorem 3.1, using the weights α output by Algorithm 1, we have i=1 αi Ei( θ(1:t)) i=1 αi Ei(θ ) 10(B2 θL2 + B2 ℓlog m) t , (11) where θ is the Pareto optimal solution, i.e., θ = argminθ Pm i=1 αi Ei(θ). Proof. From Equation (13), it is clear that α(t) satisfies the above condition because i=1 α(t) i Ei( θ(1:t)) min θ Θ i=1 α(t) i Ei(θ) max α m i=1 αifi( θ(1:t)) min θ Θ i=1 α(t) i fi(θ) 10(B2 θL2 + B2 ℓlog m) t . Robust Multi-Task Learning with Excess Risks B. Proof for Pareto Stationary Theorem 3.3 (Pareto Stationarity). Suppose each task-specific loss ℓi is (i) L-Lipschitz (ii) G-Smooth and (iii) bounded by Bℓ. At training step t, using the weights α output by Algorithm 1, we have min k=1,...,t E i=1 α(k) i ℓi(θ(k)) where m is the number of tasks, and σ corresponds to the assumption of a bounded approximation error of the subgradient and the gradient of the excess risk detailed in Assumption A.3 of Appendix A. Before proving Theorem 3.3, we first present following necessary lemmas. Note that we rewrite Fi(θ(t); ξ) as Fi(θ(t)) for simplicity in the next context. Without loss of generality, we assume that Ei(θ(t)) is bounded by Bℓ. Lemma B.1. Under the same assumption in Theorem 3.3, select nonincreasing η(t) θ min{1/Bℓ, 1/Gm}, we have the following inequality i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) 4m3/2Lση(t) α Bℓ Eξ i=1 α(t) i fi(θ(t)) Proof. We first decompose the term into i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t)) Fi(θ(t)) ! i=1 α(t) i fi(θ(t)) The first term can potentially corrupt the effectiveness of optimization, and the second term measures the descent value. We next bound the first term. By definitions and decomposition, we get i=1 α(t) i fi(θ(t)) i=1 α(t) i ( fi(θ(t)) Fi(θ(t))) i=1 α(t) i fi(θ(t)) i=1 (α(t) i E[α(t) i ])( fi(θ(t)) Fi(θ(t))) i=1 (α(t) i E[α(t) i ]) fi(θ(t)) i=1 E[α(t) i ]( fi(θ(t)) Fi(θ(t))) i=1 E[α(t) i ] fi(θ(t)) i=1 E[α(t) i ]( fi(θ(t)) Fi(θ(t))) We then bound each term individually. As we know that α(t) i [0, 1], fi(θ(t)) L, i = 1, . . . , m. By Robust Multi-Task Learning with Excess Risks Cauchy Schwartz inequality, we further know that for the term A term A = Eξ i=1 α(t) i fi(θ(t)) i=1 (α(t) i E[α(t) i ])( fi(θ(t)) Fi(θ(t))) i=1 α(t) i fi(θ(t)) i=1 (α(t) i E[α(t) i ])( fi(θ(t)) Fi(θ(t))) i=1 α(t) i fi(θ(t)) 2 i=1 (α(t) i E[α(t) i ])( fi(θ(t)) Fi(θ(t))) i=1 (α(t) i E[α(t) i ])( fi(θ(t)) Fi(θ(t))) i=1 |α(t) i E[α(t) i ]| fi(θ(t)) Fi(θ(t)) 2 By the fact that ab 1 2β(t) a2 + β(t) 2 b2 for any β(t) > 0, and by the linearity of expectation, we can get term A L 2β(t) i=1 Eξ h |α(t) i E[α(t) i ]|2i + Lβ(t) i=1 Eξ h fi(θ(t)) Fi(θ(t)) 2 2 i i=1 Eξ h |α(t) i α(t+1) i |2i + Lβ(t) i=1 (exp (η(t) α Ei(θ(t))) 1)2 + Lβ(t) i=1 (2η(t) α Ei(θ(t)))2 + Lβ(t) β(t) mη(t) α 2B2 ℓ+ Lβ(t) The third inequality is by the fact that |α(t) i α(t+1) i | |α(t) i exp (η(t) α Ei(θ(t)))α(t) i | (exp (η(t) α Ei(θ(t))) 1)α(t) i and α(t) i [0, 1]. The fourth inequality is because exp(x) 2x + 1 when x [0, 1], and we know that η(t) α Ei(θ(t) [0, 1]. By setting β(t) = 2η(t) α Bℓ/σ, denote V[α(t)] = Pm i=1 Eξ h |α(t) i E[α(t) i ]|2i , we have term A 2m Lση(t) α Bℓ. Robust Multi-Task Learning with Excess Risks With similar tricks, we have for the term B term B = Eξ i=1 (α(t) i E[α(t) i ]) fi(θ(t)) i=1 E[α(t) i ]( fi(θ(t)) Fi(θ(t))) i=1 (α(t) i E[α(t) i ]) fi(θ(t)) i=1 E[α(t) i ]( fi(θ(t)) Fi(θ(t))) i=1 L|α(t) i E[α(t) i ]| i=1 E[α(t) i ]( fi(θ(t)) Fi(θ(t))) i=1 L|α(t) i E[α(t) i ]| j=1 E[α(t) j ]( f j(θ(t)) Fj(θ(t))) 1 2β(t) |α(t) i E[α(t) i ]|2 + β(t) i=1 E[α(t) i ]( fi(θ(t)) Fi(θ(t))) 1 2β(t) |α(t) i E[α(t) i ]|2 + β(t) i=1 E[α(t) i ]2)( fi(θ(t)) Fi(θ(t)) 2 i=1 Eξ h |α(t) i E[α(t) i ]|2i + m Lβ(t) i=1 Eξ h fi(θ(t)) Fi(θ(t)) 2 2 i β(t) mη(t) α 2B2 ℓ+ m2σ2Lβ(t) The first inequality is by Cauchy Schwarz inequality. The second one is by the triangle inequality of l2 norm, we have Pm i=1(α(t) i E[α(t) i ]) fi(θ(t)) 2 Pm i=1 (α(t) i E[α(t) i ]) fi(θ(t)) 2 = Pm i=1 |α(t) i E[α(t) i ]| fi(θ(t)) 2. As we know that Pm i=1 |α(t) i E[α(t) i ]| fi(θ(t)) 2 L Pm i=1 |α(t) i E[α(t) i ]|, the third one is from the fact that ab 1 2β(t) a2 + β(t) 2 b2. The forth one is also from Cauchy-Schwartz inequality. The last one is by the fact that E[α(t) i ] 1. Further by setting β(t) = 2η(t) α Bℓ/σ m, we have term B 2m3/2Lση(t) α Bℓ. For term C, by the fact that only Fi(θ(t)) has randomness, we get term C = Eξ i=1 E[α(t) i ] fi(θ(t)) i=1 E[α(t) i ]( fi(θ(t)) Fi(θ(t))) i=1 E[α(t) i ] fi(θ(t)) i=1 E[α(t) i ] fi(θ(t)) Eξ[ Fi(θ(t))] ! By summing up the above results, we obtain i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) i=1 α(t) i fi(θ(t)) 4m3/2Lση(t) α Bℓ. Plug in the first decomposition in the beginning, we finally get i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) 4m3/2Lση(t) α Bℓ Eξ i=1 α(t) i fi(θ(t)) Robust Multi-Task Learning with Excess Risks Lemma B.2. Under the same assumption in Theorem 3.3, select nonincreasing η(t) θ min{1/Bℓ, 1/Gm}, we have the following inequality i=1 α(t) i Fi(θ(t)) 2 2 i=1 α(t) i fi(θ(t)) 2 2 mσ2 + 4m3/2Lση(t) α Bℓ. Proof. We first decompose the expectation into i=1 α(t) i Fi(θ(t)) 2 2 i=1 α(t) i fi(θ(t)) 2 2 i=1 α(t) i Fi(θ(t)) + i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t)) 2 2 i=1 α(t) i fi(θ(t)) 2 2 i=1 α(t) i Fi(θ(t)) + i=1 α(t) i fi(θ(t)) 2 2 | {z } term A i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) i=1 α(t) i fi(θ(t)) | {z } term B We next analyze the term A. By the triangle inequality of the l2 norm, we have term A = Eξ i=1 α(t) i fi(θ(t)) + i=1 α(t) i Fi(θ(t)) i=1 α(t) i ( fi(θ(t)) Fi(θ(t))) i=1 α(t) i ( fi(θ(t)) Fi(θ(t))) 2 Further, by the fact that Pm i=1 α(t) i = 1, α(t) i [0, 1], i = 1, . . . , m, we know that fi(θ(t)) Fi(θ(t)) 2 ( fi(θ(t)) Fi(θ(t))) 2 # We then analyze the term B. From the proof in Lemma B.1, and by the fact that the minus sign does not affect the inequality, we know that term B = 2Eξ i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) i=1 α(t) i fi(θ(t)) 4m3/2Lση(t) α Bℓ. Combining all the results, we obtain i=1 α(t) i Fi(θ(t)) i=1 α(t) i fi(θ(t)) 2 2 mσ2 + 4m3/2Lση(t) α Bℓ. Robust Multi-Task Learning with Excess Risks Lemma B.3. Under the same assumption as Theorem 3.3, select nonincreasing η(t) θ min{1/Bℓ, 1/Gm}, we have the following inequality η(t) θ 2 Eξ i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t+1))] + 6m3/2Lση(t) α η(t) θ Bℓ+ Gη(t) θ Proof. From the G-smoothness of each objective function, we have α(t) i fi(θ(t+1)) α(t) i fi(θ(t)) + η(t) θ fi(θ(t)) m X i=1 α(t) i Fi(θ(t)) i=1 α(t) i Fi(θ(t)) 2 2 Sum up both side for i = 1, . . . , m, and take the expectation on random variable ξ, we can get i=1 α(t) i fi(θ(t+1)) i=1 α(t) i fi(θ(t))] i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) i=1 α(t) i Fi(θ(t)) 2 2 i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) i=1 α(t) i Fi(θ(t)) 2 2 The last inequality is by the Assumption that Pm i=1 α(t) i 1. From the result of Lemma B.1, we can bound the first term and obtain i=1 α(t) i fi(θ(t+1)) i=1 α(t) i fi(θ(t))] 4m3/2Lση(t) α η(t) θ Bℓ η(t) θ Eξ i=1 α(t) i fi(θ(t)) i=1 α(t) i Fi(θ(t)) 2 2 Then, adopting the result from Lemma B.2, we know that i=1 α(t) i fi(θ(t+1)) i=1 α(t) i fi(θ(t))] 4m3/2Lση(t) α η(t) θ Bℓ i=1 α(t) i fi(θ(t)) 2 mσ2 + 2m3/2LGση(t) θ 2η(t) α Bℓ. By the fact that η(t) θ 1/Gm, we further have i=1 α(t) i fi(θ(t+1)) i=1 α(t) i fi(θ(t))] 4m3/2Lση(t) θ η(t) α Bℓ+ 2m1/2Lση(t) α η(t) θ Bℓ η(t) θ 2 Eξ i=1 α(t) i fi(θ(t)) Robust Multi-Task Learning with Excess Risks By rearrangement, we therefore have η(t) θ 2 Eξ i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t+1))] + 6m3/2Lση(t) α η(t) θ Bℓ+ Gη(t) θ Lemma B.4. Under the same assumption with Theorem 3.3, select nonincreasing η(t) θ 1/Gm, we have i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t+1))] 2mη(t) α B2 ℓ η(t) θ + 2m Bℓ Proof. First, we can decompose the left side as i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t+1))] i=1 α(t) i fi(θ(t)) 1 i=1 α(t 1) i fi(θ(t))] + E[ 1 i=1 α(1) i fi(θ(1)) 1 i=1 α(t) i fi(θ(T +1))] Then we have the following decomposition for the first term i=1 α(t) i fi(θ(t)) 1 i=1 α(t 1) i fi(θ(t))] i=1 α(t) i fi(θ(t)) 1 i=1 α(t 1) i fi(θ(t)) + 1 i=1 α(t 1) i fi(θ(t)) 1 i=1 α(t 1) i fi(θ(t))] η(t) θ (α(t) i α(t 1) i )fi(θ(t))] + i=1 α(t 1) i fi(θ(t))] 2mη(t) α Bℓ η(t) θ E[fi(θ(t))] + i=1 α(t 1) i fi(θ(t))] Combining the above, and by the assumption that |fi(θ(t))| Bℓ, α(t) i [0, 1] as well as the nonincreasing for η(t) θ , we therefore have i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t+1))] 2mη(t) α B2 ℓ η(t) θ + i=1 α(t 1) i |fi(θ(t))|] i=1 α(1) i |fi(θ(1))| + 1 i=1 α(t) i |fi(θ(T +1))|] 2mη(t) α B2 ℓ η(t) θ + m Bℓ η(t 1) θ ) + m Bℓ η(1) θ + m Bℓ 2mη(t) α B2 ℓ η(t) θ + 2m Bℓ Robust Multi-Task Learning with Excess Risks Then plug into Lemma B.3, we get i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t+1))] + 12m3/2Lση(t) α Bℓ+ Gη(t) θ mσ2. The above is divided with η(t) θ /2 for the both side. Then, sum up the above inequality from t = 1 to t = T and take the result from Lemma B.4, we have i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t)) i=1 α(t) i fi(θ(t+1))] + 12m3/2LσBℓ t=1 η(t) α + Gmσ2 T X t=1 η(t) θ . 4mη(t) α B2 ℓ η(t) θ + 4m Bℓ η(T ) θ + 12m3/2LσBℓ t=1 η(t) α + Gmσ2 T X t=1 η(t) θ . By selecting η(t) θ = 2 Bℓ/σ TG, and η(t) α min{ m G/12L BℓT}. Then average the above inequality, we prove the theorem. C. More Experimental Details and Results C.1. Implementation We implement our algorithm using hard parameter sharing, where all tasks share a feature extractor and have task-specific heads. For feature extractors, we use a two-layer CNN for Multi MNIST, Res Net-18 (He et al., 2016) for Office-Home, and Seg Net (Badrinarayanan et al., 2017) for NYUv2. For task-specific heads, we use two-layer CNNs for NYUv2, and MLP for all other datasets. We standardize all datasets to ensure zero mean and unit variance, as excess risks are sensitive to the scale of tasks. The details are as follows. It is a well-known fact that overparametrized models can achieve 0 training error even from a dataset with pure random noise. This means that the training loss will always decrease, but not plateau, even if substantial label noise is contained. To address this issue, we use weight decay in the experiments on SARCOS, Office-Home and NYUv2, as we employ overparametrized models on them. We use Adam optimizer and Re LU activation on all datasets. For easier direct comparisons across different model types, we use constant learning rates instead of adaptive ones. The experiments are run on NVIDIA RTX A6000 GPUs. For all datasets except NYUv2, we use linear layers as task-specific heads. On Multi MNIST, we use a two-layer CNN with kernel size 5 followed by one fully connected layer with 80 hidden units as the feature extractor, trained with learning rate 1e-3. Since the model size is small, we do not apply any regularization. On Office-Home, we use a Res Net 18 (without pretraining) as the shared feature extractor, which is trained using a weight decay of 1e-5. The learning rate is 1e-4. On SARCOS, we use a three-layer MLP with 128 hidden units as the shared feature extract, which is also trained using a weight decay of 1e-2. The learning rate is 1e-3. On NYUv2, we follow the implementation of Liu et al. (2019). Since the dataset have limited data, making it prone to overfitting, we use data augmentation as suggested by Liu et al. (2019). For the feature extractor, we use the Seg Net architecture with four convolutional encoder layers and four convolutional decoder layers. For each of the three tasks, we use two convolutional layers as the task-specific heads. We use a weight decay of 1e-3 and the learning rate is 1e-4. When performing scale processing as mentioned in Section 3.2, directly dividing by the initial excess risk can be unstable since it heavily depends on the initialization of the network, which can be arbitrarily poor. To address this, we deploy a warm-up strategy, where we do not do weight update in the first 3 epochs to collect the average risks over those epochs as an estimation of the initial excess risk. Robust Multi-Task Learning with Excess Risks 0.0 0.2 0.4 0.6 0.8 1.0 Noise level 0.0 0.2 0.4 0.6 0.8 1.0 Noise level 0.0 0.2 0.4 0.6 0.8 1.0 Noise level Clean Weight Excess MTL IMTL MGDA Group DRO Grad Norm Uniform Figure 9. Results on the SARCOS dataset (noise in the first two joints). The left figure considers all 7 tasks, while the other two considers all tasks except the first two tasks. The right figure is the combined weights of all clean tasks (around 0.71 for uniform scalarization). Excess MTL consistently maintains its performance, significantly outperforming other adaptive methods in face of label noise. For the implementation of baselines, we use the code from Lin & Zhang (2023) and Navon et al. (2022). C.2. More Results SARCOS (Vijayakumar & Schaal, 2000) presents an inverse dynamics problem for a robot arm with seven degrees of freedom. The task is to perform multi-target regression that uses 21 attributes (7 joint positions, 7 joint velocities, 7 joint accelerations) to predict the corresponding 7 joint torques. The noise is injected into the first two joint torques. The results on SARCOS is presented in Figure 9. In the face of increasing label noise, all adaptive weighting algorithms except Excess MTL exhibit a trend of assigning increasing weights to the noisy tasks. This behavior leads to a decline in performance on clean tasks. Excess MTL demonstrates resilience to label noise, consistently maintaining its performance. Here, a similar pattern to Multi MNIST can be observed that uniform scalarization performs well. However, we want to emphasize again that the performance of uniform scalarization varies across all datasets, and it is not able to produce consistent results universally. In contrast, Excess MTL demonstrates consistency across diverse datasets, reinforcing its robustness and effectiveness as a reliable solution in noisy environments.