# weighted_training_for_crosstask_learning__d7c5d007.pdf Published as a conference paper at ICLR 2022 WEIGHTED TRAINING FOR CROSS-TASK LEARNING Shuxiao Chen University of Pennsylvania shuxiaoc@wharton.upenn.edu Koby Crammer The Technion koby@ee.technion.ac.il Hangfeng He University of Pennsylvania hangfeng@seas.upenn.edu Dan Roth University of Pennsylvania danroth@seas.upenn.edu Weijie J. Su University of Pennsylvania suw@wharton.upenn.edu In this paper, we introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning based on minimizing a representationbased task distance between the source and target tasks. We show that TAWT is easy to implement, is computationally efficient, requires little hyperparameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. The effectiveness of TAWT is corroborated through extensive experiments with BERT on four sequence tagging tasks in natural language processing (NLP), including part-of-speech (Po S) tagging, chunking, predicate detection, and named entity recognition (NER). As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning, such as the choice of the source data and the impact of fine-tuning.1 1 INTRODUCTION The state-of-the-art (SOTA) models in real-world applications rely increasingly on the usage of weak supervision signals (Pennington et al., 2014; Devlin et al., 2019; Liu et al., 2019). Among these, cross-task signals are one of the most widely-used weak signals (Zamir et al., 2018; Mc Cann et al., 2018). Despite their popularity, the benefits of cross-task signals are not well understood from a theoretical point of view, especially in the context of deep learning (He et al., 2021; Neyshabur et al., 2020), hence impeding the efficient usage of those signals. Previous work has adopted representation learning as a framework to understand the benefits of cross-task signals, where knowledge transfer is achieved by learning a representation shared across different tasks (Baxter, 2000; Maurer et al., 2016; Tripuraneni et al., 2020; 2021; Du et al., 2021). However, the existence of a shared representation is often too strong an assumption in practice. Such an assumption also makes it difficult to reason about several critical aspects of cross-task learning, such as the quantification of the value of the source data and the impact of fine-tuning (Kalan & Fabian, 2020; Chua et al., 2021). In this paper, we propose Target-Aware Weighted Training (TAWT), a weighted training algorithm for efficient cross-task learning. The algorithm can be easily applied to existing cross-task learning paradigms, such as pre-training and joint training, to boost their sample efficiency by assigning adaptive (i.e., trainable) weights on the source tasks or source samples. The weights are determined in a theoretically principled way by minimizing a representation-based task distance between the source and target tasks. Such a strategy is in sharp contrast to other weighting schemes common in machine learning, such as importance sampling in domain adaptation (Shimodaira, 2000; Cortes et al., 2010; Jiang & Zhai, 2007). 1Our code is publicly available at http://cogcomp.org/page/publication_view/963. Published as a conference paper at ICLR 2022 The effectiveness of TAWT is verified via both theoretical analyses and empirical experiments. Using empirical process theory, we prove a non-asymptotic generalization bound for TAWT. The bound is a superposition of two vanishing terms and a term depending on the task distance, the latter of which is potentially negligible due to the re-weighting operation. We then conduct comprehensive experiments on four sequence tagging tasks in NLP: part-of-speech (Po S) tagging, chunking, predicate detection, and named entity recognition (NER). We demonstrate that TAWT further improves the performance of BERT (Devlin et al., 2019) in both pre-training and joint training for cross-task learning with limited target data, achieving an average absolute improvement of 3.1% on the performance. As a byproduct, we propose a representation-based task distance that depends on the quality of representations for each task, respectively, instead of assuming the existence of a single shared representation among all tasks. This finer-grained notion of task distance enables a better understanding of cross-task signals. For example, the representation-based task distance gives an interpretable measure of the value of the source data on the target task based on the discrepancy between their optimal representations. Such a measure is more informative than measuring the difference between tasks via the discrepancy of their task-specific functions (e.g. linear functions) as done in previous theoretical frameworks (Tripuraneni et al., 2020). Furthermore, the representation-based task distance clearly conveys the necessity of fine-tuning: if this distance is non-zero, then fine-tuning the representation becomes necessary as the representation learned from the source data does not converge to the optimal target representation. Finally, we compare our work with some recent attempts in similar directions. Liu et al. (2020) analyze the benefits of transfer learning by distinguishing source-specific features and transferable features in the source data. Based on the two types of features, they further propose a meta representation learning algorithm to encourage learning transferable and generalizable features. Instead of focusing on the distinction between two types of features, our algorithm and analyses are based on the representation-based task distance and are thus different. Chua et al. (2021) present a theoretical framework for analyzing representations derived from model agnostic meta-learning (Finn et al., 2017), assuming all the tasks use approximately the same underlying representation. In contrast, we do not impose any a priori assumption on the proximity among source and target representations, and our algorithm seeks for a weighting scheme to maximize the proximity. Our work is also different from task weighting in curriculum learning. This line of work tends to learn suitable weights in the stochastic policy to decide which task to study next in curriculum learning (Graves et al., 2017), while TAWT aims to learn better representations by assigning more suitable weights on source tasks. Compared to heuristic weighting strategies in multi-task learning (Gong et al., 2019; Zhang & Yang, 2021), we aim to design a practical algorithm with theoretical guarantees for cross-task learning. 2 TAWT: TARGET-AWARE WEIGHTED TRAINING 2.1 PRELIMINARIES Suppose we have T source tasks, represented by a collection of probability distributions {Dt}T t=1 on the sample space X Y, where X Rd is the feature space and Y R is the label space. For classification problems, we take Y to be a finite subset of R. We have a single target task, whose probability distribution is denoted as D0. For the t-th task, where t = 0, 1, . . . , T, we observe nt i.i.d. samples St = {(xti, yti)}nt i=1 from Dt. Typically, the number of samples from the target task, n0, is much smaller than the samples from the source tasks, and the goal is to use samples from source tasks to aid the learning of the target task. Let Φ be a collection of representations from the feature space X to some latent space Z Rr. We refer to Φ as the representation class. Let F be a collection of task-specific functions from the latent space Z to the label space Y. The complexity of the representation class Φ is usually much larger (i.e., more expressive) than that of the task-specific function class F. Given a bounded loss function ℓ: Y Y [0, 1], the optimal pair of representation and task-specific function of the t-th task is given by (φ t , f t ) argmin φt Φ,ft F Lt(φt, ft), Lt(φt, ft) := E(X,Y ) Dt[ℓ(ft φt(X), Y )]. (2.1) Note that in general, the optimal representations of different tasks are different. For brevity, all proofs for the theory part are deferred to the Appx. A. Published as a conference paper at ICLR 2022 2.2 DERIVATION OF TAWT Under the assumption that the optimal representations {φ t }T t=0 are similar, a representation learned using samples only from the source tasks would perform reasonably well on the target task. Consequently, we can devote n0 samples from the target task to learn only the task specific function. This is a much easier task, since the complexity of F is typically much smaller than that of Φ. This discussion leads to a simple yet immensely popular two-step procedure as follows (Tripuraneni et al., 2020; Du et al., 2021). First, we solve a weighted empirical risk minimization problem with respect to the source tasks: (bφ, { bft}T t=1) argmin φ Φ,{ft} F t=1 ωt b Lt(φ, ft), b Lt(φ, ft) := 1 i=1 ℓ(ft φ(xti), yti), (2.2) where ω T 1 is a user-specified vector lying in the T-dimensional probability simplex (i.e., PT t=1 ωt = 1 and ωt 0, 1 t T). In the second stage, we freeze the representation bφ, and seek the task-specific function that minimizes the empirical risk with respect to the target task: bf0 argmin f0 F b L0(bφ, f0). (2.3) In practice, we can allow bφ to slightly vary (e.g., via fine-tuning) to get a performance boost. In the two-step procedure (2.2) (2.3), the weight vector ω is usually taken to be a hyperparameter and is fixed during training. Popular choices include the uniform weights (i.e., ωt = 1/T) or weights proportional to the sample sizes (i.e., ωt = nt/ PT t =1 nt ) (Liu et al., 2019; Johnson & Khoshgoftaar, 2019). This reveals the target-agnostic nature of the two-step procedure (2.2) (2.3): the weights stay the same regardless the level of proximity between the source tasks and the target task. Consider the following thought experiment: if we know a priori that the first source task D1 is closer (compared to other source tasks) to the target task D0, then we would expect a better performance by raising the importance of D1, i.e., make ω1 larger. This thought experiment motivates a target-aware procedure that adaptively adjusts the weights based on the proximity of source tasks to the target. A natural attempt for developing such a task-aware procedure is a follows: min φ Φ,f0 F, ω T 1 b L0(φ, f0) subject to φ argmin ψ Φ min {ft} F t=1 ωt b Lt(ψ, ft). (OPT1) That is, we seek for the best weights ω such that solving (2.2) with this choice of ω would lead to the lowest training error when we subsequently solve (2.3). Despite its conceptual simplicity, the formulation (OPT1) is a complicated constrained optimization problem. Nevertheless, we demonstrate that it is possible to transform it into an unconstrained form for which a customized gradient-based optimizer could be applied. To do so, we let (φω, {f ω t }) be any representation and task-specific functions that minimizes PT t=1 ωt b Lt(φ, ft) over φ Φ and {ft} F. Equivalently, φω minimizes PT t=1 ωt minft F b Lt(φ, ft) over φ Φ. With such notations, we can re-write (OPT1) as min f0 F,ω T 1 b L0(φω, f0). (OPT2) The gradient of the above objective with respect to the task-specific function, f b L0(φω, f0), is easy to calculate via back-propagation. The calculation of the gradient with respect to the weights requires more work, as φω is an implicit function of ω. By the chain rule, we have ωt b L0(φω, f0) = [ φ b L0(φω, f0)] ωt φω. Since φω is a minimizer of φ 7 PT t=1 ωt minft F b Lt(φ, ft), we have F(φω, ω) = 0, ω T 1, F(φ, ω) := φ t=1 ωt min ft F b Lt(φ, ft). (2.4) By implicit function theorem, if F( , ) is everywhere differentiable and the matrix F(φ, ω)/ φ is invertible for any (φ, ω) near some (eφ, eω) satisfying F(eφ, eω) = 0, then we can conclude that the Published as a conference paper at ICLR 2022 Algorithm 1: Target-Aware Weighted Training (TAWT) Input: Datasets {St}T t=0. Output: Final pair of representation and task-specific function (bφ, bf0) for the target task. Initialize parameters ω0 T 1, φ0 Φ, {f 0 t }T t=0 F; for k = 0, . . . , K 1 do Starting from (φk, {f k t }T t=1), run a few steps of SGD to get (φk+1, {f k+1 t }T t=1); Use the approximate gradient f b L0(φk+1, f0) to run a few steps of SGD from f k 0 to get f k+1 0 ; Run one step of approximate mirror descent (2.7) (2.8) from ωk to get ωk+1; end return bφ = φK, bf0 = f K 0 map ω 7 φω is a locally well-defined function near eω, and the derivative of this map is given by ωt φω = F(φ, ω) To simplify the above expression, note that under regularity conditions, we can regard φ b Lt(φ, f ω t ) as a sub-gradient of the map φ 7 minft F b Lt(φ, ft). This means that we can write F(φω, ω) = PT t=1 ωt φ b Lt(φω, f ω t ). Plugging this expression back to (2.5) and recalling the expression for b L0(φω, f0)/ ωt derived via the chain rule, we get ωt b L0(φω, f0) = [ φ b L0(φω, f0)] h T X t=1 ωt 2 φ b Lt(φω, f ω t ) i 1 [ φ b Lt(φω, f ω t )]. (2.6) Now that we have the expressions for the gradients of b L0(φω, f0) with respect to f0 and ω, we can solve (OPT2) via a combination of alternating minimization and mirror descent. To be more specific, suppose that at iteration k, the current weights, representation, and task-specific functions are ωk, φk, and {f k t }T t=0, respectively. At this iteration, we conduct the following three steps: 1. Freeze ωk. Starting from (φk, {f k t }T t=1), run a few steps of SGD on the objective function (φ, {ft}T t=1) 7 PT t=1 ωk t b Lt(φ, ft) to get (φk+1, {f k+1 t }T t=1), which is regarded as an approximation of (φωk, {f ωk t }T t=1); 2. Freeze (φk+1, {f k+1 t }T t=1). Approximate the gradient f b L0(φωk, f0) by f b L0(φk+1, f0). Using this approximate gradient, run a few steps of SGD from f k 0 to get f k+1 0 ; 3. Freeze (φk+1, {f k+1 t }T t=0). Approximate the partial derivative b L0(φωk, f k+1 0 )/ ωt by gk t := [ φ b L0(φk+1, f k+1 0 )] h T X t=1 ωt 2 φ b Lt(φk+1, f k+1 t ) i 1 [ φ b Lt(φk+1, f k+1 t )]. (2.7) Then run one step of mirror descent (with step size ηk) from ωk to get ωk+1: ωk+1 t = ωk t exp{ ηkgk t } PT t =1 ωk t exp{ ηkgk t } . (2.8) We use mirror descent in (2.8), as it is a canonical generalization of Euclidean gradient descent to gradient descent on the probability simplex (Beck & Teboulle, 2003). Note that other optimization methods, such as projected gradient descent, can also be used here. The update rule (2.8) has a rather intuitive explanation. Note that gk t is a weighted dissimilarity measure between the gradients φ b L0 and φ b Lt. This can further be regarded as a crude dissimilarity measure between the optimal representations of the target task and the t-th source task. The mirror descent updates ωt along the direction where the target task and the t-th source task are more similar. The overall procedure is summarized in Algorithm 1. A faithful implementation of the above steps would require a costly evaluation of the inverse of the Hessian matrix PT t=1 ωt 2 φ b Lt(φk+1, f k+1 t ) Rr r. In practice, we can bypass this step by Published as a conference paper at ICLR 2022 replacing2 the Hessian-inverse-weighted dissimilarity measure (2.7) with a consine-similarity-based dissimilarity measure (see Section 4 for details). The previous derivation has focused on weighted pre-training, i.e., the target data is not used when defining the constrained set in (OPT1). It can be modified, mutatis mutandis, to handle weighted joint-training, where we change (OPT1) to min φ Φ,f0 F, ω T b L0(φ, f0) subject to φ argmin ψ Φ min {ft} F t=0 ωt b Lt(ψ, ft). (2.9) Compared to (OPT1), we now also use the data from the target task when learning the representation φ, and thus there is an extra weight ω0 on the target task. The algorithm can also be easily extended to handle multiple target tasks or to put weights on samples (as opposed to putting weights on tasks). The algorithm could also be applied to improve the efficiency of learning from cross-domain and cross-lingual signals, and we postpone such explorations for future work. 3 THEORETICAL GUARANTEES 3.1 A REPRESENTATION-BASED TASK DISTANCE In this subsection, we introduce a representation-based task distance, which will be crucial in the theoretical understanding of TAWT. To start with, let us define the representation and task-specific functions that are optimal in an ω-weighted sense as follows: ( φω, { f ω t }T t=1) argmin φ Φ,{ft} F t=1 ωt Lt(φ, ft) (3.1) Intuitively, ( φω, { f ω t }) are optimal on the ω-weighted source tasks when only a single representation is used. Since φω may not be unique, we introduce the function space Φω Φ to collect all φωs that satisfy (3.1). To further simplify the notation, we let L t (φ) := minft F Lt(φ, ft), which stands for the risk incurred by the representation φ on the t-th task. With the foregoing notation, we can write φω argminφ Φ PT t=1 ωt L t (φ). The definition of the task distance is given below. Definition 3.1 (Representation-based task distance). The representation-based task distance between the ω-weighted source tasks and the target task is defined as t=1 ωt Dt, D0 := sup φω Φω L 0( φω) L 0(φ 0), (3.2) where the supremum is taken over any φω satisfying (3.1), and φ 0 is the optimal target representation. If all the tasks share the same optimal representation, then above distance becomes exactly zero. Under such an assumption, the only source of discrepancy among tasks arises from the difference in their task-specific functions. This can be problematic in practice, as the task-specific functions alone are usually not expressive enough to describe the intrinsic difference among tasks. In contrast, we relax the shared representation assumption substantially by allowing the optimal representations to differ and the distance to be non-zero. The above notion of task distance also allows us to reason about certain important aspects of cross-task learning. For example, this task distance is asymmetric, capturing the asymmetric nature of cross-task learning. Moreover, if the task distance is non-zero, then fine-tuning the representation becomes necessary, because solving (2.2) alone would gives a representation that does not converge to the correct target φ 0. In addition, an empirical illustration of the representation-based task distance can be found in Fig. 3 in Appx. B. This task distance can be naturally estimated from the data by replacing all population quantities with their empirical version. Indeed, minimizing the estimated task distance over the weights is equivalent to the optimization formulation (OPT1) of our algorithm (see Appx. A.1 for a detailed derivation). This observation gives an alternative and theoretically principled derivation of TAWT. 2This type of approximation is common and almost necessary, such as MAML (Finn et al., 2017). Published as a conference paper at ICLR 2022 3.2 PERFORMANCE GUARANTEES FOR TAWT To give theoretical guarantees for TAWT, we need a few standard technical assumptions. The first one concerns the Lipschitzness of the function classes, and the second one controls the complexity of the function classes via uniform entropy (Wellner & van der Vaart, 2013) as follows. Assumption A (Lipschitzness). The loss function ℓ: Y Y [0, 1] is Lℓ-Lipschitz in the first argument, uniformly over the second argument. Any f F is LF-Lipschitz w.r.t. the ℓ2 norm. Assumption B (Uniform entropy control of function classes). There exist CΦ > 0, νΦ > 0, such that for any probability measure QX on X Rd, we have N(Φ; L2(QX ); ε) (CΦ/ε)νΦ, ε > 0, (3.3) where N(Φ; L2(QX ); ε) is the L2(QX ) covering number of Φ (i.e., the minimum number of L2(QX ) balls3 with radius ε required to cover Φ). In parallel, there exist CF > 0, νF > 0, such that for any probability measure QZ on Z Rr, we have N(F; L2(QZ); ε) (CF/ε)νG, ε > 0, (3.4) where N(F; L2(QZ); ε) is the L2(QZ) covering number of F. Uniform entropy generalizes the notion of Vapnik-Chervonenkis dimension (Vapnik, 2013) and allows us to give a unified treatment of regression and classification problems. For this reason, function classes satisfying the above assumption are also referred to as VC-type classes in the literature (Koltchinskii, 2006). In particular, if each coordinate of Φ has VC-subgraph dimension c(Φ), then (3.3) is satisfied with νΦ = Θ(r c(Φ)) (recall that r is the dimension of the latent space Z). Similarly, if F has VC-subgraph dimension c(F), then (3.4) is satisfied with νF = Θ(c(F)). The following definition characterizes how transferable a representation φ is from the ω-weighted source tasks to the target task. Definition 3.2 (Transferability). A representation φ Φ is (ρ, Cρ)-transferable from ω-weighted source tasks to the target task, if there exists ρ > 0, Cρ > 0 such that for any φω Φω, we have L 0(φ) L 0( φω) Cρ t=1 ωt[L t (φ) L t ( φω)] 1/ρ . (3.5) Intuitively, the above definition says that relative to φω, the risk of φ on the target task can be controlled by a polynomial of the average risk of φ on the source tasks. This can be regarded as an adaptation of the notions of transfer exponent and relative signal exponent (Hanneke & Kpotufe, 2019; Cai & Wei, 2021) originated from the transfer learning literature to the representation learning setting. This can also be seen as a generalization of the task diversity assumption (Tripuraneni et al., 2020; Du et al., 2021) to the case when the optimal representations {φ t }T t=1 do not coincide. In addition, Tripuraneni et al. (2020) and Du et al. (2021) proved that ρ = 1 in certain simple models and under some simplified setting. In this part, we prove a non-asymptotic generalization bound for TAWT. The fact that the weights are learned from the data substantially complicates the analysis. To proceed further, we make a few simplifying assumptions. First, we assume that the sample sizes of the source data are relatively balanced: there exists an integer n such that nt = Θ(n) for any t {1, . . . , T}. Meanwhile, instead of directly analyzing (OPT1), we focus on its sample split version. In particular, we let B1 B2 be a partition of {1, . . . , n0}, where |B1| = Θ(|B2|). Define b L(1) 0 (φ, f0) := 1 |B1| i B1 ℓ(f0 φ(x0i), y0i), b L(2) 0 (φ, f0) := 1 |B2| i B2 ℓ(f0 φ(x0i), y0i). We first solve (OPT1) restricted to the first part of the data: (bφ, bω) argmin φ Φ,ω T 1 min f0 F b L(1) 0 (φ, f0) subject to φ argmin ψ Φ min {ft} F t=1 ωt b Lt(ψ, ft). (3.6) 3For two vector-valued functions φ, ψ Φ, their L2(QX ) distance is R φ(x) ψ(x) 2d QX (x) 1/2. Published as a conference paper at ICLR 2022 Then, we proceed by solving bf0 argmin f0 F b L(2) 0 (bφ, f0). (3.7) Such a sample splitting ensures the independence of bφ and the second part of target data B2, hence allowing for more transparent theoretical analyses. Such strategies are common in statistics and econometrics literature when the algorithm has delicate dependence structures (see, e.g., (Hansen, 2000; Chernozhukov et al., 2018)). We emphasize that sample splitting is conducted only for theoretical convenience and is not used in practice. The following theorem gives performance guarantees for the sample split version of TAWT. Theorem 3.1 (Performance of TAWT with sample splitting). Let (bφ, bf0) be obtained via solving (3.6) (3.7). Let Assumptions A and B hold. In addition, assume that the learned weights satisfy bω Wβ := {ω T 1 : β 1 ωt/ωt β, t = t }, where β 1 is an absolute constant. Fix δ (0, 1). There exists a constant C = C(Lℓ, LF, CΦ, CF) > 0 such that the following holds: if for any weights ω Wβ and any representation φ in a Cβ p (νΦ log δ 1)/n T + (νF + log T)/nneighborhood of Φω4, there exists a specific φω Φω such that φω is (ρ, Cρ)-transferable, then there exists another C = C (Lℓ, LF, CΦ, CF, Cρ, ρ) such that with probability at least 1 δ, we have L0(bφ, bf0) L0(φ 0, f 0 ) C νF + log(1/δ) 2 + β1/ρ νΦ + log(1/δ) n T + νF + log T t=1 bωt Dt, D0 . (3.8) The upper bound in (3.8) is a superposition of three terms. Let us disregard the log(1/δ) terms for now and focus on the dependence on the sample sizes and problem dimensions. The first term, which scales with p νF/n0, corresponds to the error of learning the task-specific function in the target task. This is unavoidable even if the optimal representation φ 0 is known. The second term that scales with [(νΦ + TνF)/(n T)]1/2ρ characterizes the error of learning the imperfect representation φω from the source datasets and transferring the knowledge to the target task. Note that this term is typically much smaller than p (νΦ + νF)/n0, the error that would have been incurred when learning only from the target data, thus illustrating the potential benefits of representation learning. This happens because n T is typically much larger than n0. The third term is precisely the task distance introduced in Definition 3.1. The form of the task distance immediately demonstrates the possibility of matching φω φ 0 via varying the weights ω, under which case the third term would be negligible compared to the former two terms. For example, in Appx. A.2, we give a sufficient condition for exactly matching φω = φ 0. The proof of Theorem 3.1 is based on empirical process theory. Along the way, we also establish an interesting result on multi-task learning, where the goal is to improve the average performance for all tasks instead of target tasks (see Lemma A.1 in Appx. A). The current analysis can also be extended to cases where multiple target tasks are present. 4 EXPERIMENTS In this section, we verify the effectiveness of TAWT in extensive experiments using four NLP tasks, including Po S tagging, chunking, predicate detection, and NER. More details are in Appx. B. Experimental settings. In our experiments, we mainly use two widely-used NLP datasets, Ontontes 5.0 (Hovy et al., 2006) and Co NLL-2000 (Tjong Kim Sang & Buchholz, 2000). Ontonotes 5.0 contains annotations for Po S tagging, predicate detection, and NER, and Co NLL-2000 is a shared task for chunking. There are about 116K sentences, 16K sentences, and 12K sentences in the training, development, and test sets for tasks in Ontonotes 5.0. As for Co NLL-2000, there are 4A representation φ is in an ε-neighborhood of Φω if PT t=1 ωt[L t (φ) L t ( φω)] ε. Published as a conference paper at ICLR 2022 Learning Paradigm Target Task Po S Chunking Predicate Detection NER Avg Single-Task Learning 34.37 43.05 66.26 33.20 44.22 Pre-Training 49.43 73.15 74.10 41.22 59.48 Weighted Pre-Training 51.17 *** 73.41 75.77 *** 46.23 *** 61.64 Joint Training 53.83 75.58 75.42 43.50 62.08 Weighted Joint Training 57.34 *** 77.78 *** 75.98 *** 53.44 *** 66.14 Normalized Joint Training 84.14 88.91 77.02 61.15 77.80 Weighted Normalized Joint Training 86.07 *** 90.62 *** 76.67 63.44 *** 79.20 Table 1: The benefits of weighted training for different learning paradigms under different settings. There are four tasks in total, Po S tagging, chunking, predicate detection, and NER. For each setting, we choose one task as the target task and the remaining three tasks as source tasks. We randomly choose 9K training sentences for each source task respectively, because the training size of the chunking dataset is 8936. As for the target task, we randomly choose 100, 100, 300, 500 training sentences for Po S tagging, chunking, predicate detection, and NER respectively, based on the difficulty of tasks. Single-task learning denotes learning only with the small target data. *** indicates the p-value of the paired sampled t-test is smaller than 0.001. about 9K sentences and 2K sentences in the training and test sets. As for the evaluation metric, we use accuracy for Po S tagging, span-level F1 for chunking, word-level F1 for predicate detection, and span-level F1 for NER. We use BERT5 (Devlin et al., 2019) as our basic model in our main experiments. Specifically, we use the pre-trained case-sensitive BERT-base Py Torch implementation (Wolf et al., 2020), and the common hyperparameters for BERT. In the BERT, the task-specific function is the last-layer linear classifier, and the representation model is the remaining part. As for cross-task learning paradigms, we consider two popular learning paradigms, pre-training, and joint training. Pre-training first pre-train the representation part on the source data and then fine-tune the whole target model on the target data. Joint training uses both source and target data to train the shared representation model and task-specific functions for both source and target tasks at the same time. As for the multi-task learning part in both pre-training and joint training, we adopt the same multi-task learning algorithm as in MT-DNN (Liu et al., 2019). More explanation on the choice of experimental settings can be found in Appx. C. Settings for weighted training. For scalability, in all the experiments, we approximate gk t in Eq. (2.7) by c sim( φ b L0(φk+1, f k+1 0 ), φ b Lt(φk+1, f k+1 t )), where sim( , ) denotes the cosine similarity between two vectors. Note that this type of approximation is common and almost necessary, such as MAML (Finn et al., 2017). For weighted joint training, we choose c = 1. For weighted pre-training, we choose the best c among [0.3, 1, 10, 30, 100], because the cosine similarity between the pre-training tasks and the target task is small in general. In practice, we further simplify the computation of φ b Lt(φk+1, f k+1 t ) (t = 0, 1, . . . , T) in Eq. (2.7) by computing the gradient over the average loss of a randomly sampled subset of the training set instead of the average loss of the whole training set as in Eq. (2.2). In our experiments, we simply set the size of the randomly sampled subset of the training set as 64, though a larger size is more beneficial in general. In our experiments, we choose ηk = 1.0 in the mirror descent update (2.8). It is worthwhile to note that there is no need to tune any extra hyperparameters for weighted joint training, though we believe that the performance of TAWT can be further improved by tuning extra hyper parameters, such as the learning rate ηk. Results. The effectiveness of TAWT is first demonstrated for both pre-training and joint training on four tasks with quite a few training examples in the target data, as shown in Table 1. Experiments with more target training examples and some additional experiments can be found in Table 2 and Table 3 in Appx. B. Finally, we note that TAWT can be easily extended from putting weights on tasks to putting weights on samples (see Fig. 2 in Appx. B). Normalized joint training. Inspired by the final task weights learned by TAWT (see Table 6 in Appx. B), we also experiment with normalized joint training. The difference between joint training and normalized joint training lies in the initialization of task weights. For (weighted) joint training, the weights on tasks are initialized to be ωt = nt/ PT t =1 nt (i.e, the weights on the loss of each 5While BERT is no longer the SOTA model, all SOTA models are slight improvements of BERT, so our experiments are done with highly competitive models. Published as a conference paper at ICLR 2022 20 21 22 23 24 25 26 27 Improvement (%) (a) The impact of the ratio between the training size of the source dataset and the target dataset. (b) The impact of the training size of the target dataset (Ratio=10). (b) The impact of the training size of the target dataset (Ratio=8). 500 1000 2000 4000 8000 The training size of the target dataset Improvement (%) 500 1000 2000 4000 8000 The training size of the target dataset Improvement (%) Figure 1: Analysis of the weighted training algorithms. We analyze the impact of two crucial factors on the improvement of the weighted training algorithms: the ratio between the training sizes of the source and target datasets, and the training size of the target dataset. In this figure, we use NER as the target task and source tasks include Po S tagging and predicate detection. In the first subfigure, we keep the training size of the target tasks as 500 and change the ratio from 1 to 128 on a log scale. In the second (third) subfigure, we keep the ratio as 8 (10) and change the training size of the target dataset from 500 to 8000 on a log scale. The corresponding improvement from the normalized joint training to the weighted normalized joint training is shown. example will be uniform), whereas for the (weighted) normalized joint training, the weights on tasks are initialized to be ωt = 1/T (i.e., the weight on the loss of each example will be normalized by the sample sizes). Note that the loss for each task is already normalized by the sample size in our theoretical analysis (see Eq. 2.2). As a byproduct, we find that normalized joint training is much better than the widely used joint training when the source data is much larger than the target data. In addition, TAWT can still be used to further improve normalized joint training. The corresponding results can be found in Table 1. In addition, we find that dynamic weights might be a better choice in weighted training compared to fixed weights (see Table 7 in Appx. B). Analysis. Furthermore, we analyze two crucial factors (i.e., the ratio between the training sizes of the source and target datasets, and the training size of the target dataset) that affect the improvement of TAWT in Fig. 1. In general, we find that TAWT is more beneficial when the performance of the base model is poorer, either because of a smaller-sized target data or due to a smaller ratio between the source data size and the target data size. More details are in Table 4 and Table 5 in Appx. B. 5 DISCUSSION In this paper, we propose a new weighted training algorithm, TAWT, to improve the sample efficiency of learning from cross-task signals. TAWT adaptively assigns weights on tasks or samples in the source data to minimize the representation-based task distance between source and target tasks. The algorithm is an easy-to-use plugin that can be applied to existing cross-task learning paradigms, such as pre-training and joint training, without introducing too much computational overhead and hyperparameters tuning. The effectiveness of TAWT is further corroborated through theoretical analyses and empirical experiments. To the best of our knowledge, TAWT is the first weighted algorithm for cross-task learning with theoretical guarantees, and the proposed representation-based task distance also sheds light on many critical aspects of cross-task learning. Limitations and Future Work. There are two main limitations in our work. First, although we gave an efficient implementation of task-weighted version of TAWT, an efficient implementation of sample-weighted version of TAWT is still lacking, and we leave it for future work. Second, we are not aware of any method that can efficiently estimate the representation-based task distance without training the model, and we plan to work more on this direction. In addition, we also plan to evaluate TAWT in other settings, such as cross-domain and cross-lingual settings; in more general cases, such as multiple target tasks in the target data; and in other tasks, such as language modeling, question answering, sentiment analysis, image classification, and object detection. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS This material is based upon work supported by the US Defense Advanced Research Projects Agency (DARPA) under contracts FA8750-19-2-0201 and W911NF-20-1-0080, NSF through CAREER DMS-1847415 and an Alfred Sloan Research Fellowship. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government. Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12: 149 198, 2000. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1):100 128, 2021. Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018. Kurtland Chua, Qi Lei, and Jason D Lee. How fine-tuning allows for effective meta-learning. Advances in Neural Information Processing Systems, 34, 2021. Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Nips, volume 10, pp. 442 450. Citeseer, 2010. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, 2019. Simon Shaolei Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations, 2021. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135. PMLR, 2017. Ting Gong, Tyler Lee, Cory Stephenson, Venkata Renduchintala, Suchismita Padhy, Anthony Ndirango, Gokce Keskin, and Oguz H Elibol. A comparison of loss weighting strategies for multi task learning in deep neural networks. IEEE Access, 7:141627 141632, 2019. Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In international conference on machine learning, pp. 1311 1320. PMLR, 2017. Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. In Advances in Neural Information Processing Systems, volume 32, 2019. Bruce E Hansen. Sample splitting and threshold estimation. Econometrica, 68(3):575 603, 2000. Hangfeng He, Mingyuan Zhang, Qiang Ning, and Dan Roth. Foreseeing the benefits of incidental supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1782 1800, 2021. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Published as a conference paper at ICLR 2022 Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. Ontonotes: the 90% solution. In Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers, pp. 57 60, 2006. Jing Jiang and Cheng Xiang Zhai. Instance weighting for domain adaptation in nlp. In Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 264 271, 2007. Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1 54, 2019. MM Kalan and Z Fabian. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Neural Information Processing Systems, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015. Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006. Hong Liu, Jeff Z Hao Chen, Colin Wei, and Tengyu Ma. Meta-learning transferable representations with a single target domain. ar Xiv preprint ar Xiv:2011.01418, 2020. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4487 4496, 2019. Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016. Bryan Mc Cann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. ar Xiv preprint ar Xiv:1806.08730, 2018. Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning? In Advances in Neural Information Processing Systems, volume 33, pp. 512 523, 2020. Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532 1543, 2014. Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. Erik F Tjong Kim Sang and Sabine Buchholz. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning-Volume 7, pp. 127 132, 2000. Nilesh Tripuraneni, Michael Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems, volume 33, pp. 7852 7862, 2020. Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pp. 10434 10443. PMLR, 2021. Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer science & business media, 2013. Jon Wellner and Aad van der Vaart. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 2013. Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, and Sam Shleifer. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38 45, 2020. Published as a conference paper at ICLR 2022 Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712 3722, 2018. Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021. Published as a conference paper at ICLR 2022 A OMITTED PROOFS We start by introducing some notations. For a set S, we let 1S be its indicator function and we use #S and |S| interchangeably to denote its cardinality. For two positive sequences {an} and {bn}, we write an bn or an = O(bn) to denote lim sup an/bn < , and we let an bn or an = Ω(bn) to denote bn an. Meanwhile, the notation an bn or an = Θ(bn) means an bn and an bn simultaneously. For a vector x, we let x denote its ℓ2 norm. In this section, we treat Lℓ, LF, CΦ, CF as absolute constants and we hide the dependence on those parameters in our theoretical results. The exact dependence on those parameters can be easily traced from the proofs. A.1 TAWT AND TASK DISTANCE MINIMIZATION If we estimate L 0(φ) = minf0 F L0(φ, f0) by minf0 F b L0(φ, f0) and φω argminφ Φ PT t=1 ωt minft F Lt(φ, ft) by the argmin of PT t=1 ωt minft F b Lt(φ, ft), then overall, the quantity L 0( φω) can be estimated by min f0 F b L0(φ, f0) subject to φ argmin ψ Φ t=1 ωt b Lt(ψ, ft). Thus, minimizing the estimated task distance over the weights (note that L 0(φ ) is a constant) is equivalent to the optimization formulation (OPT1) of our algorithm. To further relate TAWT with task distance minimization, we provided an analysis of the two-step procedure (2.2) (2.3) with fixed weights. The next theorem gives the corresponding performance guarantees. Theorem A.1 (Performance of the two-step procedure with fixed weights). Let (bφ, bf0) be obtained via solving (2.2) (2.3) with fixed ω. Let Assumptions A and B hold. Fix any δ (0, 1) and define Nω = (PT t=1 ω2 t /nt) 1. There exists a constant C = C(Lℓ, LF, CΦ, CF) > 0 such that the following holds: if for any representation φ in a C p (νΦ + TνF + log δ 1)/Nω-neighborhood of Φω6, there exists a specific φω Φω such that φ is (ρ, Cρ)-transferable, then there exists another C = C (Lℓ, LF, CΦ, CF, Cρ, ρ) such that with probability at least 1 δ, we have L0(bφ, bf0) L0(φ 0, f 0 ) C νF + log(1/δ) 2 + νΦ + TνF + log(1/δ) t=1 ωt Dt, D0 . (A.1) Proof. See Appendix A.3. Note that the task distance naturally appears in the upper bound above. This theorem can be regarded as a predecessor of Theorem 3.1, and we refer readers to Section 3.2 for a detailed exposition on the meaning of each term in the upper bound. A.2 A SUFFICIENT CONDITION FOR EXACTLY MATCHING φω = φ 0 Proposition A.1 (Weighting and task distance minimization). Suppose that for every φ Φ, there exists two source tasks 1 t1, t2 T such that L t1(φ) L 0(φ) L t2(φ). Then there exists a choice of weights ω, possibly depending on φ, such that dist(PT t=1 ωt Dt, D0) = 0. Proof. By construction, it suffices to show the existence of some ω such that t=1 ωt L t (φ) = L 0(φ), φ Φ. (A.2) 6Recall that a representation φ is in an ε-neighborhood of Φω if PT t=1 ωt[L t (φ) L t ( φω)] ε. Published as a conference paper at ICLR 2022 By assumption, fix any φ Φ, we can find 1 t1, t2 T such that L t1(φ) L 0(φ) L t2(φ). We set ωt = 0 for any t / {t1, t2}. If L t1(φ) = L 0(φ) = L t2(φ), then any choice of ωt1, ωt2 will suffice. Otherwise, we let ωt1 = L t2(φ) L 0(φ) L t2(φ) L t1(φ), ωt2 = L 0(φ) L t1(φ) L t2(φ) L t1(φ). It is straightforward to check that such a choice of ω indeed ensures (A.2). The proof is concluded. A.3 PROOF OF THEOREM A.1 We start by stating two useful lemmas. Lemma A.1 (Error for learning the imperfect representation from source data). Under the setup of Theorem A.1, there exists a constant C1 = C1(Lℓ, LF, CΦ, CF) > 0 such that for any δ (0, 1), we have T X t=1 ωt[L t (bφ) L t ( φω)] C1 νΦ + TνF + log(1/δ) with probability at least 1 δ. Proof. See Appendix A.3.1. Lemma A.2 (Error for learning the task-specific function from target data). Under the setup of Theorem A.1, there exists a constant C2 = C2(Lℓ, CF) > 0 such that for any δ (0, 1), we have L0(bφ, bf0) L 0(bφ) C2 νF + log(1/δ) with probability at least 1 δ. Proof. See Appendix A.3.2. To prove Theorem A.1, we start by writing L0(bφ, bf0) L0(φ 0, f 0 ) = L0(bφ, bf0) L 0(bφ) + L 0(bφ) L 0( φω) + L 0( φω) L 0(φ 0). Suppose the two high probability events in Lemmas A.1 and A.2 hold. Then we can bound L0(bφ, bf0) L 0(bφ) by (A.4). Meanwhile, since bφ is in a C p (νΦ + TνF + log δ 1)/Nω-neighborhood of Φω, we can invoke the transferability assumption to get L 0(bφ) L 0( φω) Cρ t=1 ωt[L t (φ) L t ( φω)] 1/ρ CρC1/ρ 1 νΦ + TνF + log(1/δ) Moreover, we have the trivial bound that L 0( φω) L 0(φ 0) sup φω L 0( φω) L 0(φ 0). Assembling the three bounds above gives the desired result. A.3.1 PROOF OF LEMMA A.1 We start by writing t=1 ωt[L t (bφ) L t ( φω)] t=1 ωt[L0(bφ, bft) Lt( φω, f ω t )] Lt(bφ, bft) b Lt(bφ, bft) + b Lt(bφ, bft) b Lt( φω, f ω t ) + b Lt( φω, f ω t ) Lt( φω, f ω t ) Published as a conference paper at ICLR 2022 Lt(bφ, bft) b Lt(bφ, bft) + b Lt( φω, f ω t ) Lt( φω, f ω t ) sup φ Φ,{ft} F ωt Lt(φ, ft) b Lt(φ, ft) + b Lt( φω, f ω t ) Lt( φω, f ω t ) = sup φ Φ,{ft} F Lt(φ, ft) ℓ(f0 φ(xti), yti) + ℓ( f ω t φω(xti), yti) Lt( φω, f ω t ) , where the first inequality is by L t ( ) = minft F Lt( , ft) and the second inequality is by the fact that (bφ, { bft}T t=1) is a minimizer of (2.2). To simplify notations, let zti = (xti, yti) and let the right-hand side above be G({zti}). Fix two indices 1 t T, 1 it nt, and let {ezti} be the source datasets formed by replacing zt ,it with some ezt ,it = (ext ,it , eyt ,it ) X Y. Since {zti} and {ezti} differ by only one example, we have Lt(φ, ft) ℓ(ft φ(xti), yti) + ℓ( f ω t φω(xti), yti) + Lt( φω, f ω t ) Lt (φ, ft ) ℓ(ft φ(xt ,i), yt ,i) + ℓ( f ω t φω(xt ,i), yt ,i) + Lt ( φω, f ω t ) Lt (φ, ft ) ℓ(ft φ(xt ,it ), yt ,it ) + ℓ( f ω t φω(xt ,it ), yt ,it ) + Lt ( φω, f ω t ) Lt (φ, ft ) ℓ(ft φ(xt ,ii ), yt ,it ) Lt ( φω, f ω t ) ℓ( f ω t φω(xt ,it ), yt ,it ) + Lt (φ, ft ) ℓ(ft φ(ext ,ii ), eyt ,it ) Lt ( φω, f ω t ) ℓ( f ω t φω(ext ,it ), eyt ,it ) where the last inequality is by the fact that the loss function is bounded in [0, 1]. Taking the supremum over φ Φ, {ft} F at both sides, we get G({zti}) G({ezti}) 4ωt /nt . A symmetric argument shows that the reverse inequality, namely G({ezti}) G({zti}) 4ωt /nt , is also true. That is, we have shown |G({zti}) G({ezti})| 4ωt This means that we can invoke Mc Diarmid s inequality to get P G({zti}) E[G({zti})] ε exp 2ε2 PT t=1 Pnt i=1 16ω2 t /n2 t for any ε > 0, or equivalently G({zti}) E[G({zti})] + 2 with probability at least 1 δ for any δ (0, 1). To bound the expectation term, we use a standard symmetrization argument (see, e.g., Lemma 11.4 in (Boucheron et al., 2013)) to get NωE[G({zti})] 2 p NωE sup φ Φ,{ft} F ℓ(ft φ(xti), yti)+ℓ( ft φ(xti), yti) , where the expectation is taken over the randomness in both the source datasets {zti} and the i.i.d. symmetric Rademacher random variables {εti}. Consider the function space G := {(φ, {ft}T t=1) : φ Φ, {ft} F}. Let nt [ ℓ(ft φ(xti), yti)], g = (φ, {ft}T t=1) G Published as a conference paper at ICLR 2022 be the empirical process indexed by the function space G . Conditional on the randomness in the data {zti}, this is a Rademacher process with sub-Gaussian increments: log Eeλ(Mg Meg) λ2 2 d2(g, eg), λ 0, g = (φ, {ft}T t=1), eg = (eφ, { eft}T t=1) G , where the pseudometric d2(g, eg) := Nω ℓ(ft φ(xti), yti) ℓ( eft eφ(xti), yti) 2 Nω ω2 t nt = 1. Thus, we can invoke Dudley s entropy integral inequality (see, e.g., Corollary 13.2 in (Boucheron et al., 2013)) to get E[sup g G Mg M g | {zti}] Z 1 log N(G ; d; ε)dε, where g = ( φω, { f ω t }), and N(G ; d; ε) is the ε-covering number of G with respect to the pseudometric d. Taking expectation over the randomness in {zti}, we get E[G({zti})] N 1/2 ω Z 1 log N(G ; d; ε)dε. (A.7) We now bound the covering number of G . To do so, we define t=1 Nω ω2 t nt nt , Qt := 1 where δxti is a point mass at xti. Let {φ(1), . . . , φ(Nε)} Φ be an ε-covering of Φ with respect to L2(Q), where Nε = N(Φ; L2(Q); ε). This means that for any φ Φ, there exists j {1, . . . , Nε} such that φ φ(j) 2 L2(Q) := i=1 Nω ω2 t n2 t φ(xti) φ(j)(xti) 2 ε2. Now for each j {1, . . . , Nε} and t {1, . . . , T}, let {f (j,1) t , . . . , f (j,N (j) ε ) t } be an ε-covering of F with respect to L2(φ(j)#Qt), where φ(j)#Qt is the pushforward of Qt by φ(j), and N (j) ε = N(F; L2(φ(j)#Qt); ε) has no dependence on t due to the uniform entropy control from Assumption B. This means that for any ft F, j {1, . . . , Nε}, there exists k {1, . . . , N (j) ε } such that ft f (j,k) t 2 L2(φ(j)#Qt) := 1 ft φ(j)(xti) f (j,k) t φ(j)(xti) 2 ε2. Now, let us fix (φ, {ft}T t=1) G . By construction, we can find j {1, . . . , Nε} and kt {1, . . . , N (j) ε } for any 1 t T such that φ φ(j) L2(Q) ε, ft f (j,kt) t L2(φ(j)#Qt) ε, 1 t T. (A.8) Thus, we have d2 (φ, {ft}T t=1), (φ(j), {f (j,kt) t }T t=1) ℓ(ft φ(xti), yti) ℓ(f (j,kt) t φ(j)(xti), yti) 2 ft φ(xti) f (j,kt) t φ(j)(xti) 2 ft φ(xti) ft φ(j)(xti) + ft φ(j)(xti) f (j,kt) t φ(j)(xti) 2 Published as a conference paper at ICLR 2022 ω2 t n2 t L2 F φ(xti) φ(j)(xti) 2 + 2L2 ℓNω ft φ(j)(xti) f (j,kt) t φ(j)(xti) 2 = 2L2 ℓL2 F φ φ(j) 2 L2(Q) + 2L2 ℓNω ω2 t nt ft f (j,kt) t 2 L2(φ(j)#Qt) 2L2 ℓ(L2 F + 1)ε2. This yields L2 F + 1 ε) (φ(j), {f (j,kt) t }T t=1) : 1 j Nε, 1 kt N (j) ε , 1 t T = Nε (N (j) ε )T from which we get log N(G ; d; ε) νΦ log(CΦLℓ q 2(L2 F + 1)) + TνF log(CFLℓ q 2(L2 F + 1)) + (νΦ + TνF) log(1/ε) (νΦ + TνF)(1 + log(1/ε)). Plugging the above inequality to (A.7), we get E[G({zti})] N 1/2 ω p νΦ + TνF 1 + Z 1 The proof is concluded by plugging the above inequality to (A.6). A.3.2 PROOF OF LEMMA A.2 Since bφ is obtained from the source datasets {St}T t=1, it is independent of the target data S0. Throughout the proof, we condition on the randomness in the source datasets, thus effectively treating bφ as fixed. Let f0,bφ argminf0 F L0(bφ, f). We start by writing L0(bφ, bf0) L 0(bφ) = L0(bφ, bf0) b L0(bφ, bf0) + b L0(bφ, bf0) b L0(bφ, f0,bφ) + b L0(bφ, f0,bφ) L0(bφ, f0,bφ) L0(bφ, bf0) b L0(bφ, bf0) + b L0(bφ, f0,bφ) L0(bφ, f0,bφ) sup f0 F L0(bφ, f0) b L(bφ, f0) + b L0(bφ, f0,bφ) L0(bφ, f0,bφ). The right-hand side above is an empirical process indexed by f0 F. Using similar arguments as those appeared in the proof of Lemma A.1, we have sup f0 F L0(bφ, f0) b L(bφ, f0) + b L0(bφ, f0,bφ) L0(bφ, f0,bφ) νF + log(1/δ) with probability at least 1 δ, which is exactly the desired result. A.4 PROOF OF THEOREM 3.1 The proof bears similarities to the proof of Theorem A.1, with additional complications in ensuring a uniform control over the learned weights. Let f0,bφ argminf0 F L0(bφ, f0). Note that L0(bφ, bf0) L0(φ 0, f 0 ) = L0(bφ, bf0) b L(2) 0 (bφ, bf0) + b L(2) 0 (bφ, bf0) b L(2) 0 (bφ, f0,bφ) + b L(2) 0 (bφ, f0,bφ) L0(bφ, f0,bφ) + L 0(bφ) L 0( φbω) + L 0( φbω) L 0(φ 0) Published as a conference paper at ICLR 2022 L0(bφ, bf0) b L(2) 0 (bφ, bf0) + b L(2) 0 (bφ, f0,bφ) L0(bφ, f0,bφ) + L 0(bφ) L 0( φbω) + L 0( φbω) L 0(φ 0). (A.9) Let bft argminft F b Lt(bφ, ft). We then have t=1 bωt[L t (bφ) L t ( φbω)] t=1 bωt[Lt(bφ, bft) Lt( φbω, f bω t )] Lt(bφ, bft) b Lt(bφ, bft) + b Lt(bφ, bft) b Lt( φbω, f bω t ) + b Lt( φbω, f bω t ) Lt( φbω, f bω t ) Lt(bφ, bft) b Lt(bφ, bft) + b Lt( φbω, f bω t ) Lt( φbω, f bω t ) 2 sup φ Φ,{ft} F,ω Wβ Lt(φ, ft) b Lt(φ, ft) + b Lt( φω, f ω t ) Lt( φω, f ω t ) . Let zti = (xti, yti), and let {ezti} be the source datasets formed by replacing zt ,it with ezt ,it . Let G({zti}) := sup φ Φ,{ft} F,ω Wβ Lt(φ, ft) b Lt(φ, ft) + b Lt( φω, f ω t ) Lt( φω, f ω t ) . Then conducting a similar calculation to what led to (A.5), we get |G({zti}) G({ezti})| 4ωt where the last inequality is by the fact that ω Wβ implies ωt β/T for any 1 t T. Now, invoking Mc Diarmid s inequality, we get G({zti}) EG({zti}) + O β with probability at least 1 δ. Now, a standard symmetrization argument plus an application of Dudley s entropy integral bound (similar to what led to (A.7)) gives E[G({zti})] (n T/β2) 1/2 Z 1 log N(G ; d; ε)dε, (A.11) where now G := {(φ, {ft}T t=1, ω) : φ Φ, {ft} F, ω Wβ}, and d2 (φ, {ft}T t=1, ω), (eφ, { eft}T t=1, eω) ωtℓ(ft φ(xti), yti) eωtℓ( eft eφ(xti), yti) 2 ℓ(ft φ(xti), yti) ℓ( eft eφ(xti) 2 + n T 1 n2 t (ωt eωt)2[ℓ( eft eφ(xti), yti)]2 ℓ(ft φ(xti), yti) ℓ( eft eφ(xti) 2 + T 2 ω eω 2, where the last inequality is by nt n, ωt β/T for any 1 t T and β 1. This means that we can construct a Cε-covering of G by the following two steps (where C is an absolute constant only depending on Lℓand LF): (1) cover the space {(φ, {ft}T t=1)} by the same construction as (A.8); Published as a conference paper at ICLR 2022 (2) construct an ε/T-covering of Wβ with at most (c T/ε)T many points, where c is an absolute constant. Overall, we can construct a Cε covering G with (CΦ/ε)νΦ (CF/ε)T νF (c T/ε)T many points. Hence, we have log N(G ; d; ε) (νΦ + TνF)(log(1/ε) + 1) + T(1 + log T + log(1/ε)) (νΦ + TνF)(log(1/ε) + 1) + T log T, where the last inequality is by νF 1. Plugging this inequality back to (A.10) and (A.11), we get t=1 bωt[L t (bφ) L t ( φbω)] β νΦ + log(1/δ) n T + νF + log T with probability at least 1 δ. This means that under this high probability event, we can invoke the transferability assumption to conclude the existence of a specific φbω with L 0(bφ) L 0( φbω) Cρβ1/ρ νΦ + log(1/δ) n T + νF + log T Recalling (A.9), we arrive at L0(bφ, bf0) L0(φ 0, f 0 ) L0(bφ, bf0) b L(2) 0 (bφ, bf0) + b L(2) 0 (bφ, f0,bφ) L0(bφ, f0,bφ) + sup φ b ω L 0( φbω) L 0(φ 0) + O Cρβ1/ρ νΦ + log(1/δ) n T + νF + log T L0(bφ, f0) b L(2) 0 (bφ, f0) + b L(2) 0 (bφ, f0,bφ) L0(bφ, f0,bφ) + sup φ b ω L 0( φbω) L 0(φ 0) + O Cρβ1/ρ νΦ + log(1/δ) n T + νF + log T with probability 1 δ. Since bφ is independent of the second batch of target data {(x0i, y0i) : i B2} and |B2| n0, a nearly identical argument as that appeared in the proof of Lemma A.2 gives L0(bφ, f0) b L(2)(bφ, f0) + b L(2) 0 (bφ, f0,bφ) L0(bφ, f0,bφ) νF + log(1/δ) with probability at least 1 δ. We conclude the proof by invoking a union bound. B EXPERIMENTAL DETAILS AND ADDITIONAL RESULTS Learning Paradigm Target Task Po S Predicate Detection NER Avg Single-Task Learning 85.06 71.51 54.96 70.51 Pre-Training 88.66 78.57 58.22 75.15 Weighted Pre-Training 89.31 *** 79.21 *** 60.09 *** 76.20 Joint Training 87.23 74.90 59.55 73.89 Weighted Joint Training 90.71 *** 76.29 *** 64.49 *** 77.16 Table 2: Compared to Table 1, more training examples are considered here. Here we only consider the three tasks in Ontonotes 5.0, i.e., Po S tagging, predicate detection, and NER. For each setting, we choose one task as the target task and the remaining two tasks as source tasks. We randomly choose 20K training sentences for each source task respectively. As for the target task, we randomly choose 500, 600, 1000 training sentences for Po S tagging, predicate detection, and NER respectively. In this section, we briefly highlight some important settings in our experiments. More details can be found in our released code. It usually costs about half an hour to run the experiment for each setting (e.g. one number in Table 1) on one Ge Force RTX 2080 GPU. Published as a conference paper at ICLR 2022 Learning Paradigm Setting Po S + NER Po S + Chunking + NER Po S + Chunking + Predicate Detection Avg Single-Task Learning 68.47 68.47 78.00 71.65 Joint Training 65.70 67.12 79.81 70.88 Weighted Joint Training 69.58 69.73 80.38 73.23 Table 3: Additional experiments to illustrate the benefits of weighted joint training compared to joint training. For each setting, the task in the bold text is the target task and the remaining tasks are source tasks. The improvement from weighted training in the joint training paradigm under various settings indicates the effectiveness of TAWT. Learning Paradigm Ratio 1 2 4 8 16 32 64 128 Avg Single-Task Learning 30.08 30.08 30.08 30.08 30.08 30.08 30.08 30.08 30.08 Normalized Joint Training 38.55 36.53 42.37 54.18 62.08 63.58 62.85 62.33 52.81 Weighted Normalized Joint Training 42.31 42.13 51.64 59.08 63.46 64.86 64.08 63.65 56.40 Upper Bound 62.69 70.31 75.65 79.39 81.59 83.19 84.18 - - Table 4: The performance for the analysis on the ratio between the training sizes of the source and target datasets in Fig. 1. In this part, we use NER as the target task and source tasks include Po S tagging and predicate detection. We keep the training size of the target task as 500 and change the ratio from 1 to 128 on a log scale. Single-task learning denotes learning only with the small target data, and upper bound denotes learning with the target data that has the same size as the overall size of all datasets in cross-task learning. "-" means that we do not have enough training data for the upper bound in that setting. The sampling strategy7 for training examples we used for the analysis is a little different from that used in the experiments in Table 1, so that the performance of single-task learning for the same setting can be a little different. Data. In our experiments, we mainly use two widely-used NLP datasets, Ontontes 5.0 (Hovy et al., 2006) and Co NLL-2000 (Tjong Kim Sang & Buchholz, 2000). Ontonotes 5.0 is a large multilingual richly annotated corpus covering a lot of NLP tasks, and we use the corresponding English annotations for three sequence tagging tasks, Po S tagging, predicate detection, and NER. Co NLL-2000 is a shared task for another sequence tagging task, chunking. There are about 116K sentences, 16K sentences, and 12K sentences in the training, development, and test sets for tasks in Ontonotes 5.0. The average sentence length for sentences in Ontonotes 5.0 is about 19. As for Co NLL-2000, there are about 9K sentences and 2K sentences in the training and test sets. The corresponding average sentence length is about 24. Tasks. We consider four common sequence tagging tasks in NLP, Po S tagging, chunking, predicate detection, and NER. Po S tagging aims to assign a particular part of speech, such as nouns and verbs, for each word in the sentence. Chunking divides a sentence into syntactically related nonoverlapping groups of words and assigns them with specific types, such as noun phrases and verb phrases. Predicate detection aims to find the corresponding verbal or nominal predicates for each sentence. NER seeks to locate and classify named entities mentioned in each sentence into pre-defined categories such as person names, organizations, and locations. Based on the above datasets, there are 50 labels in Po S tagging, 23 labels in chunking, 2 labels in predicate detection, and 37 labels in NER. As for the evaluation metric, we use accuracy for Po S tagging, span-level F1 for chunking, token-level F1 for predicate detection, and span-level F1 for NER. The model. We use BERT as our basic model in our main experiments. Specifically, we use the pretrained case-sensitive BERT-base Py Torch implementation (Wolf et al., 2020). We use the common parameter settings for BERT. Specifically, the max length is 128, the batch size is 32, the epoch number is 4, and the learning rate is 5e 5. In the BERT model, the task-specific function is the last-layer linear classifier, and the representation model is the remaining part. 7Specifically, in the main experiments, we first randomly sample 9K sentences and then randomly sample a specific number of sentences (e.g. 500 sentences for NER) among the 9K sentences for the training set of the target task. In the analysis, we first randomly sample 100K sentences and then randomly sample a specific number of sentences (e.g. 500 sentences for NER) among the 100K sentences for the training set of the target task. We use a different sampling strategy for training sentences in main experiments and analysis, simply because the largest training size of the data we considered in the two situations is different. Published as a conference paper at ICLR 2022 Learning Paradigm Target Size 500 1000 2000 4000 8000 Avg Single-Task Learning 30.08 54.26 68.05 74.77 79.20 61.27 Normalized Joint Training (Ratio=8) 54.18 64.69 71.33 76.95 80.37 69.50 Weighted Normalized Joint Training (Ratio=8) 59.08 66.00 72.30 76.95 80.64 70.99 Upper Bound 79.39 81.49 83.53 84.26 - - Normalized Joint Training (Ratio=10) 58.06 65.94 71.48 77.33 80.34 70.63 Weighted Normalized Joint Training (Ratio=10) 60.96 67.45 72.52 77.44 79.94 71.66 Upper Bound 79.70 82.32 83.39 84.84 - - Table 5: The performance for the analysis on the training size of the target dataset in Fig. 1. In this part, we use NER as the target task and source tasks include Po S tagging and predicate detection. We keep the ratio between the training sizes of the source and target datasets as 8 (10) and change the training size of the target dataset from 500 to 8000 on a log scale. Single-task learning denotes learning only with the small target data, and upper bound denotes learning with the target data that has the same size as the overall size of all datasets in cross-task learning. "-" means that we do not have enough training data for the upper bound in that setting. Learning Paradigm Target Task Po S Chunking Predicate Detection NER Weighted Pre-Training (0.68, 0.01, 0.31) (0.32, 0.30, 0.37) (0.33, 0.35, 0.32) (0.92, 0.05, 0.04) Weighted Joint Training (0.04, 0.04, 0.03) (0.04, 0.04, 0.05) (0.04, 0.04, 0.05) (0.04, 0.04, 0.05) 0.89 0.87 0.87 0.87 Weighted Normalized Joint Training (0.0005, 0.0004, 0.0004) (0.0005, 0.0005, 0.0004) (0.002, 0.002, 0.002) (0.003, 0.003, 0.003) 0.9986 0.9987 0.995 0.990 Table 6: The final learned weights on tasks in the experiments in Table 1. For each setting, the final weights on three source tasks are represented by a three-element tuple. The source tasks are always organized in the following order: Po S tagging, chunking, predicate detection, and NER. For example, the tuple (0.68, 0.01, 0.31) for the target task Pos tagging in the weighted pre-training indicate that the final weights on the three source tasks (chunking, predicate detection, NER) are 0.68, 0.01, and 0.31 respectively. As for TAWT in the joint training, we have extra weight on the target task, which is bold in the second line. Cross-task learning paradigms. In our experiments, we consider two cross-task learning paradigms, pre-training, and joint training. Pre-training first pre-trains the representation part on the source data and then fine-tunes the whole target model on the target data. Joint training uses both source and target data to train the shared representation model and task-specific functions for both source and target tasks at the same time. As for the multi-task learning part in both pre-training and joint training, we adopt the same multi-task learning algorithm as in MT-DNN (Liu et al., 2019). Experiments with more target training examples. Compared to the main experiments in Sec. 4, we further experiment with more training examples in the target data for three tasks in Ontonotes 5.0. Without chunking, we also use more training examples in the source tasks. As shown in Table 2, we can see that TAWT is still beneficial even with more training examples, though the relative improvement of TAWT is smaller compared to that with fewer training examples. Additional experiments to illustrate the benefits of TAWT. In this part, we conduct additional experiments with TAWT on the joint training with fewer source tasks compared to the main experiments in Sec. 4. For simplicity, we make the target task bold to distinguish it from source tasks for each setting. For example, in the setting of Po S + Chunking + NER, NER is the target task and the other two tasks are source tasks. For Po S + NER, we randomly sample 20K training sentences for Po S tagging and 2K training sentences for NER. For Po S + Chunking + NER, we randomly sample 20K, 9K8 and 2K training sentences for Po S tagging, chunking, and NER respectively. As for Po S + Chunking + Predicate Detection, we randomly sample 20K, 9K, and 2K training sentences for Po S tagging, chunking, and predicate detection respectively. As shown in Table 3, we can see that TAWT is still beneficial under these diverse settings, which is a complement for our main experiments in Sec. 4. 8We choose 9K training sentences for chunking because the training size of the chunking dataset is 8936. Published as a conference paper at ICLR 2022 Learning Paradigm Target Task Po S Chunking Predicate Detection NER Avg Weighted Pre-Training (fixed weights) 49.17 73.05 74.26 39.84 59.08 Weighted Pre-Training (dynamic weights) 51.17 73.41 75.77 46.23 61.64 Weighted Normalized Joint Training (fixed weights) 86.93 90.12 74.90 63.60 78.89 Weighted Normalized Joint Training (dynamic weights) 86.07 90.62 76.67 63.44 79.20 Table 7: Comparison between dynamic weights and fixed final weights. There are four tasks in total, Po S tagging, chunking, predicate detection, and NER. For each setting, we choose one task as the target task and the remaining three tasks are source tasks. Performance (%) Single-Task Learning Joint Training Weighted-Sample Joint Training Performance (%) Single-Task Learning Pre-Training Weighted-Sample Pre-Training Figure 2: Extension to weighted-sample training. The weighted training algorithm can be easily extended from the weights on tasks to the weights on samples. As for the experiments on weightedsample training, we use the Po S tagging on entity words as the source task and named entity classification on entity words as the target task. Note that the settings for the weighted-sample training are quite different from those for the weighted-task training in the remaining parts because the weighted-sample training is much more costly compared to the weighted-task training. Analysis of some crucial factors. As shown in Fig. 1, we analyze two crucial factors that affect the improvement of TAWT. First, in general, we find that the improvement of TAWT in normalized joint training is larger when the ratio between the training sizes of the source and target dataset is smaller, though the largest improvement may not be achieved at the smallest ratio. Note that the improvement vanishes with a large ratio mainly because the baseline (normalized joint training) is already good enough with a large ratio. Second, the improvement from TAWT will decrease with the increase of the training size of the target data. In a word, TAWT is more beneficial when the performance of the base model is poorer, either with a smaller target data or with a smaller ratio. The specific performance for experiments in our analysis on two crucial factors for TAWT, the ratio between the training sizes of the source and target datasets, and the training size of the target dataset, can be found in Table 4 and Table 5 respectively. Dynamic weights analysis. As for experiments in Table 1, there are four tasks in total, Po S tagging, chunking, predicate detection, and NER. For each setting, we choose one task as the target task and the remaining three tasks are source tasks. The learned final weights on tasks are shown in Table 6. To better understand our algorithm, we compare TAWT with dynamic weights and TAWT with fixed final weights. TAWT with dynamic weights is the default setting for our algorithm, where weights on tasks for each epoch are different. As for TAWT with fixed weights, we simply initialize the weights as the final weights learned by our algorithm and fix the weights during the training. As shown in Table 7, we find that TAWT with dynamic weights (the default one) is slightly better than TAWT with final fixed weights. It indicates that fixed weights might not be a good choice for weighted training, because the importance of different source tasks may change during the training. In other words, it might be better to choose the weighted training with dynamic weights, where the weights are automatically adjusted based on the state of the trained model. Extension to the weighted-sample training. As shown in Fig. 2, TAWT can be easily extended from the weights on tasks to the weights on samples. We can see that TAWT based on weighted-sample training is also beneficial for both pre-training and joint training. Because the weighted-sample training is much more costly compared to the weighted-task training, we choose a much simpler Published as a conference paper at ICLR 2022 101 102 103 104 The training size of the target task Performance (%) STL LTL with flip=0.0 LTL with flip=0.2 LTL with flip=0.4 LTL with flip=0.6 LTL with flip=0.8 LTL with flip=1.0 Figure 3: Illustration for the task distance. We find that the source data is more beneficial when the task distance between the source data and the target data is smaller or the size of the target data is smaller. In this part, we keep the training size of the source task as 10000 and change the training size of the target task from 10 to 10000 in a log scale. STL denotes the single-task learning only with the target data. LTL denotes the learning to learn paradigm, where we first learn the representations in the source data and then learn the task-specific function in the target data. For the learning to learn paradigm, we consider the source task with different flip rates from 0.0 to 1.0, where the flip rate is an important factor in generating the source data and lower flip rate indicates a smaller task distance between the source data and the target data. setting here. In this part, we use the Po S tagging on entity words as the source task and named entity classification on entity words as the target task. Note that both tasks we considered here are word-level classification tasks, i.e., predicting the label for a given word in a named entity. We still use the Ontontes 5.0 (Hovy et al., 2006) as our dataset. There are 50 labels in Po S tagging on entity words, and 18 labels 9 in named entity classification on entity words. There are 37534 examples in the development set and 23325 examples in the test set for both source and target tasks. As for the training set, we randomly sample 1000 examples for the source task and 100 examples for the target task. As for the model, we use two-layer NNs with 5-gram features. The two-layer NNs have a hidden size of 4096, Re LU non-linear activation, and cross-entropy loss. As for the embeddings, we use 50 dimensional Glove embeddings (Pennington et al., 2014). The majority baseline for named entity classification on entity words on the test data is 20.17%. As for training models, the size of the training batch is 128, the optimizer is Adam (Kingma & Ba, 2015) with a learning rate 3e 4, and the number of training epochs is 200. As for updating weights on samples, we choose to update the weights every 5 epoch with the mirror descent in Eq. 2.8 and thus 40 updates on weights in total. In this part, we approximate the inverse of the Hessian matrix ( h PT t=1 ωt 2 φ b Lt(φk+1, f k+1 t ) i 1 ) in Eq. 2.7 by a constant multiple of the identity matrix as in MAML (Finn et al., 2017), and choose the constant as 5. According to our experiments, the results of this approximation are similar to those of the approximation that we used in Sec. 4. The corresponding results are shown in Fig. 2. In the future, we also plan to group the instances and give each group a weight rather than each sample. Settings for simulations on task distance. In this part, we first randomly generate 10K examples D, where the dimension of the inputs are 1000 and the corresponding labels are in 0 9 (10 classes). For each dimension of the input, it is randomly sampled from a uniform distribution over [ 0.5, 0.5]. For each flip rate q% [0, 1], we randomly choose q% of the data and replace their labels with a random label uniformly sampled from the 10 classes. The data with flip rate q% is denoted as Dq% (including D0.0 for the original dataset). For each dataset Dq%, we train a 2-layer NNs Mq%with 100% accuracy and almost zero training loss on the dataset. The 2-layer NNs can then be used to generate data for the task Tq%. We use the T0.0 as the target task and change q% from 0.0 to 1.0 as 9In NER, we have 37 labels because each type of entity have two variants of labels (B/I for beside/inside) and one more extra-label O for non-entity words is also considered. Published as a conference paper at ICLR 2022 the source task. Based on the process of the data generation, we can see that the flip rate q% plays an important role in the intrinsic difference between the source task Tq% and the target task T0.0. In general, we can expect that a smaller flip rate q% indicates a smaller task distance between the source task Tq% and the target task T0.0. This data generation process is inspired by the teacher-student network in (Hinton et al., 2015). As shown in Fig. 3, we keep the training size of the source task as 10000 and change the training size of the target task from 10 to 10000 in a log scale. For simplicity, we use the same architectures for both the teacher network that we used for generating data and the student network that we used for learning the data, i.e., a two-layer neural network with a hidden size of 4096, an input size of 1000, and an output size of 10. As for training, the size of the training batch is 100, the optimizer is Adam (Kingma & Ba, 2015) with learning rate 3e 4, and the number of training epochs is 100. Details on using existing assets. As for the corresponding code and pre-trained models for BERT, we directly download it from the Github of huggingface whose license is Apache-2.0, and more details can be found in https://github.com/huggingface/transformers. As for the Ontonotes 5.0, we obtain the data from LDC. The corresponding license and other details can be found in https://catalog.ldc.upenn.edu/LDC2013T19. As for the Co NLL-2000 shared task, we download the data directly from the website and more details are in https: //www.clips.uantwerpen.be/conll2000/chunking. C ON THE CHOICE OF EXPERIMENTAL SETTINGS In this part, we clarify the choice of experimental settings based on the following perspectives. Signal selection: TAWT can also be applied to cross-domain or cross-lingual signals. Compared to the above two types of signals, we think cross-task signals are more widespread and more difficult to be used efficiently. Therefore, in our experiments, we choose the most difficult and crucial signals, cross-task signals, to verify the effectiveness of TAWT. Another reason is that we didn t find any existing weighted training algorithms with theoretical guarantees for cross-task learning, while weighted training is common for domain adaptation, such as importance sampling. The choice of similar domains: We choose to select tasks with similar domains. Ideally, TAWT can be used for cross-task signals in different domains or languages. However, the gap between different domains or different languages can make cross-task learning more complicated. We instead consider cross-task learning in similar domains of English to simplify the settings. But similar domains didn t mean that datasets overlap because we randomly sample sentences for different tasks independently without repetition, causing a small ratio of overlap in our case. Area selection: We choose to experiment with NLP tasks because we think cross-task signals are more common in the NLP area and NLP tasks are more diverse compared to tasks in other areas. Dataset selection: We choose Ontonotes 5.0 as our main dataset because it is widely used, and provides large-scale expert annotations (on 2.9 million words) for a wide range of NLP tasks. This enables us to focus on learning with various tasks in similar domains. As for Co NLL-2000, we add it mainly because we want to analyze the impact of chunking on NER. Task selection: On the one hand, sequence tagging is more challenging than classification tasks. On the other hand, the evaluation of sequence tagging is more reliable, compared with generative tasks. Another reason is that Ontonotes 5.0 mainly cover sequence tagging tasks. Model selection: We choose to use BERT in our main experiments because BERT is widely used and all SOTA models are slight improvements of BERT. For weighted-sample training, we instead choose two-layer NNs with 5-gram features because it is simple and fast. Similar results could be shown even if the model is more complex, but exhaustive experimentation is not our goal. The choice of the low-resource setting: There is an intrinsic trade-off between the base performance of single-task learning and the relevant improvement of cross-task learning. Specifically, if the base performance of single-task learning is good enough, adding cross-task signals can introduce extra noise compared to its information. In our experiments, we simply consider a simple low-resource setting where few target examples are available. Actually, we can still see the effectiveness of TAWT in cross-task learning as long as the base performance is not too high, but the relative improvement of TAWT will be smaller compared to that in the low-resource setting.