# aang__automating_auxiliary_learning__077acad8.pdf

Published as a conference paper at ICLR 2023

AANG: AUTOMATING AUXILIARY LEARNING

Lucio M. Dery1 Paul Michel2 Mikhail Khodak1 Graham Neubig 1 Ameet Talwalkar1,3

1 Carnegie Mellon University 2 ENS PSL University 3 Hewlett Packard Enterprise

Auxiliary objectives, supplementary learning signals that are introduced to help aid learning on data-starved or highly complex end-tasks, are commonplace in machine learning. Whilst much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious handdesign. Intuition for how and when these objectives improve end-task performance has also had limited theoretical backing. In this work, we present an approach for automatically generating a suite of auxiliary objectives. We achieve this by deconstructing existing objectives within a novel unified taxonomy, identifying connections between them, and generating new ones based on the uncovered structure. Next, we theoretically formalize widely-held intuitions about how auxiliary learning improves generalization on the end-task. This leads us to a principled and efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task. With natural language processing (NLP) as our domain of study, we demonstrate that our automated auxiliary learning pipeline leads to strong improvements over competitive baselines across continued training experiments on a pre-trained model on 5 NLP tasks 1.

1 INTRODUCTION

Objective Data (D) Transform (T ) Representation (R) Output (O)

BERT Out-of-domain BERT-Op Bidirectional Denoise Token

TAPT Task data BERT-Op Bidirectional Denoise Token

DAPT In-domain BERT-Op Bidirectional Denoise Token

ELMO Out-of-domain No-Op Left-to-Right Next Token and Right-to-Left

GPT Out-of-domain No-Op Left-To-Right Next Token

XLNet Out-of-domain No-Op Random factorized Next Token

Electra Neural LM Data Replace Bidirectional Real / Synthetic

. . . . . . . . . . . . . . .

Figure 1: We present the decomposition of some auxiliary objectives in NLP within our framework.

The auxiliary learning paradigm, where we augment a primary objective with extra learning signals to boost end-task performance, is a staple of many machine learning (ML) domains. In natural language processing (NLP), well known models like Span BERT (Joshi et al., 2020) and Ro BERTa (Liu et al., 2019b) are trained on masked language modelling (MLM) auxiliary objectives (Devlin et al., 2018) before fine-tuning on the end-task. And for speech processing and reinforcement learning (RL), Oord et al. (2018) introduced the popular contrastive predictive coding objective which achieved state of the art performance in many settings when multi-tasked with the end-task. Despite these successes and many more, research into devising such objectives has progressed in a very local, objective-by-objective manner (Raffel et al., 2019; Clark et al., 2020; Grill et al., 2020; Chen et al., 2020). Auxiliary objectives are constructed by hand-design and without much overarching structure, relying on the experience and intuition of a select group of researchers versed at making appropriate design choices. Unfortunately, this status-quo not only creates a technical barrier of entry for exploring auxiliary objectives in new domains but also, by virtue of its incremental nature, limits the rate at which new objectives are discovered and investigated.

To address the above challenges, this paper presents a framework for automatically generating and utilizing a large set of candidate auxiliary objectives. Our framework is seeded by the following key observation: leading auxiliary objectives across multiple domains can be viewed as making different design decisions within a 4 stage pipeline: Input Data (D) Input Transformation (T )

Correspondence to : ldery@andrew.cmu.edu 1Code available at : https://github.com/ldery/Automating-Auxiliary-Learning.

Published as a conference paper at ICLR 2023

Model Representation (R) Output (O). For instance, in RL, a common auxiliary objective is to predict the environment s forward dynamics (Agrawal et al., 2016; Hafner et al., 2019). To construct this objective, the current task state-action pair (D) is corrupted (T ) and then passed through the model to produce a latent representation (R) which is finally used to predict the next state (O). Similarly, in NLP, the XLNet (Yang et al., 2019) objective which performs language modelling on a randomly factorized permutation of the input can be written within our taxonomy as {D = Out-of-Domain, T = No-op, R = Random-Factorized, O = Next Token}. These two examples (along with others listed in Figure 1) fall within a class we term named objectives: objectives that have been previously proposed in the auxiliary learning literature.

Data (D) Transform (T ) Representation (R) Output (O) Out-of-domain No-Op Bidirectional Next Token In-domain Replace Left-to-Right Real / Synth Task data Mask Right-to-Left Denoise Token Neural LM Data Noising embeds Rand. factorized TF-IDF . . . . . . . . . . . . #

TAPT = {Task data ! BERT-Op ! Bidirectional ! Denoise Token}

GPT = {Out-of-domain ! No-Op ! Left-to-Right ! Next Token} New-Obj1 = {Task data ! BERT-Op ! Left-to-Right ! Denoise Token}

New-Obj2 = {In-domain ! No-Op ! Random Factorized ! TF-IDF}

. . . Figure 2: Our framework in the context of NLP. We decompose named objectives within our four staged taxonomy : {D, T , R, O}. By taking the cartesian product of choices across stages, we reproduce named objectives and discover new ones.

Decomposing named objectives within our taxonomy provides a unified view of the auxiliary learning landscape. From this vantage point, it becomes clear that there are many unexplored combinations of the various primitives used across named objectives. This presents a simple formula for automatically generating a large set of candidate objectives: take the cartesian product of the design decisions across given stages (Figure 2). Using this compositional process, not only can we reconstruct existing named objectives, we can also generate new combinations. This overcomes the tedium of implementing each objective independently since we can just reuse a small set of simple stage-wise primitives.

Generating a large set of objectives raises the natural question of how to efficiently select the most helpful ones for a given end task. Instead of leaving this to practitioner intuition, we develop principled guidelines to address this question by theoretically studying the impact of auxiliary learning on a particular end-task. Specifically, using arguments based on algorithmic stability (Hardt et al., 2016; Bousquet & Elisseeff, 2002), we derive end-task generalization error bounds that are dependent on the choice of auxiliary task. This contributes to existing theory (Saunshi et al., 2020; Xie et al., 2021) on how auxiliary learning impacts the end-task by suggesting a new candidate mechanism: auxiliary learning results in more stable optimization end-points in the sense of Bousquet & Elisseeff (2002), which in theory improves generalization of the final model.

Guided by our theory, we introduce AANG (Automating Auxiliary Learni NG), an efficient, structureaware algorithm for adaptively combining a set of related objectives to improve generalization on a specific end-task. AANG incorporates the following prescriptions from our theory: (i) auxiliary tasks that are more similar to the end-task are desirable. Given a set of objectives, AANG learns adaptive weights to bring the composite objective closer to the end-task; (ii) in general, more auxiliary data is better. AANG maximizes the effective amount of data used in training by using all the generated objectives instead of taking task-specific subsets.

To empirically validate our method for automatically generating and utilizing auxiliary objectives, we experiment on five NLP tasks. We do so in the widely-used setting of continued pretraining (Gururangan et al., 2020; Aghajanyan et al., 2021; Dery et al., 2021b; Zhang et al., 2022), where a model trained with a single auxiliary objective on large-scale data is further trained on end-task related data. Without introducing any external data or architectural modifications, variants of AANG outperform strong and widely used baselines in 4 out of 5 tasks. AANG achieves an average improvement of 4.2% over standard fine-tuning of Ro BERTa across our chosen tasks. We believe our results will spur further research into exploring automating auxiliary learning across a variety of settings. Notably, while we focus on NLP when discussing the space of auxiliary objectives (Section 3) and in our empirical evaluation (Section 6), our theoretical results (Section 4) and AANG itself are domain-agnostic2.

2Our ideas could be applied to domains like RL or computer vision (CV), where a similar dissection of existing objectives can be performed.

Published as a conference paper at ICLR 2023

2 RELATED WORK

To properly scope this work, we define auxiliary learning as training a model on alternative objectives with the goal of improving performance on some primary end-task. Auxiliary learning is an instantiation of transfer learning (Caruana, 1997; Baxter, 2000; Ruder et al., 2019). It covers the pretrain-then-finetune paradigm (Huh et al., 2016; Devlin et al., 2018; Schneider et al., 2019; Gururangan et al., 2020) as well as end-task aware multitasking approaches (Lin et al., 2019; Dery et al., 2021a;b). Whilst auxiliary objectives may be meta-learned (Liu et al., 2019a; Navon et al., 2020), for simplicity since incorporating these would require further complication of our design space such objectives are out of the scope of this paper.

This work bears many parallels to the area of neural architecture search (NAS) (Stanley & Miikkulainen, 2002; Zoph & Le, 2016; Roberts et al., 2021). Whilst we seek to automate auxiliary learning, the objective of NAS is to automate the discovery of the right neural architecture given a specific end-task. Search spaces of candidate architectures are created by taking the cartesian product of architecture design choices across the depth of the network. The design of suitable architectural search spaces for a variety of settings has been an active area of research (Tan & Le, 2019; Howard et al., 2019; Dao et al., 2020; Roberts et al., 2021). To develop AANG, we borrow ideas from the NAS literature on efficient algorithms for sifting through spaces of architectures. Mirroring the popular differentiable NAS method DARTS Liu et al. (2018), we perform a continuous relaxation over the search space of objectives, allowing for efficient search by gradient descent. We also use a factored approach to model relationships between objectives that share primitives. This is inspired by recent work on stochastic-relaxation weight sharing (Dong & Yang, 2019; Li et al., 2020).

As a theoretical contribution, this work derives an end-task aware generalization error bound for auxiliary learning. Our bound is built on that of Hardt et al. (2016), who derive generalization bounds for parametric models trained with stochastic gradient descent (SGD). To derive their bounds, they leverage the concept of algorithmic stability introduced by Bousquet & Elisseeff (2002). Informally, a randomized algorithm is uniformly stable if changing a single training data point in the given samples does not change its end-point too much. Said change is characterized as the average difference in predictions between the two learned models. Stability implies generalization in expectation (Hardt et al., 2016; Kuzborskij & Lampert, 2018).

3 AUTOMATICALLY GENERATING AUXILIARY OBJECTIVES

To begin, we take a high-level view of the landscape of named objectives. Using running examples from NLP, we propose the following coarse structure for the sequence of choices made in the hand-design of auxiliary objectives:

1. Data, D: Auxiliary objective pipelines begin with a choice of input data. Here, options can range from heterogeneous out-of-domain data (Radford et al., 2019), in-domain data with respect to the final end-task (Beltagy et al., 2019) or the task data itself (Gururangan et al., 2020). It may even include data outside the modality of the end-task. 2. Input-Transformation, T : Many auxiliary objectives are self-supervised with respect to their input data. They corrupt or transform the input and then reconstruct it in whole or part. For example, input text tokens can be masked, replaced or deleted. Operations can also be aggregated as in BERT-Op: mask 80% of selected tokens and randomly replace 50% of the remaining Devlin et al. (2018); Liu et al. (2019b). 3. Representation, R: After transformation, representations of the input data can be computed from a given model in different ways. A chosen token s representation can depend on only its left context (Left-to-Right) (Radford et al., 2018) or its right context (Right-to-Left) (Peters et al., 2018). It could also depend on the representations of a randomly selected permutation of other tokens (Random Factorized) Yang et al. (2019). 4. Output, O: Finally, representations obtained from the previous stage are fed into a loss function producing a final output. The choice of output loss is usually coupled with the choice of transformation made in stage 2. Choices include but are not restricted to denoising tokens, predicting the next token or predicting the TF-IDF (Term Frequency-Inverse Document Frequency) of a token.

Published as a conference paper at ICLR 2023

The above taxonomy {D T R O} is expansive enough to cover a range of named auxiliary objectives of interest in NLP (Figure 1)3. For example, we can write any member of the GPT series (Radford et al., 2018; 2019; Brown et al., 2020) which perform left-to-right language modelling on out-of-domain data as {D = Out-of-Domain, T = No-op, R = Left-To-Right, O = Next Token}. We can summarize the pre-existing choices within each design stage to obtain a unique set of options. For example, we can reduce the set of model representation types used by the objectives enumerated in Figure 1 to the unique set R = {Bi-directional, Left-To-Right, Right-To-Left, Random-Factorized}. Having summarized the list of primitives within each stage, a simple formula for generating a space of auxiliary objectives becomes apparent: take the cartesian product of the design choices at each stage (see Figure 2). In general, given an instance of our taxonomy, we can construct a space of objectives A = D T R O of size |A| |D| |T | |R| |O|. Consider New Obj1 from Figure 2. This previously unexplored objective can be obtained by combining the special masking operation from BERT (BERT-Op) with computing model representations based on left-to-right causal masking as in GPT. In fact, this objective proved one of the most useful ones in our experiments below (see Figure 5).

Our framework also allows us to reason about whole families of objectives, F, by thinking in terms of design stages and choices. For example, given a particular end-task E with input text ED, we can create a family of objectives based solely on task data by fixing to that option in our input data stage; we call this family FD=ED. FD=ED not only includes pre-existing TAPT Gururangan et al. (2020) but also unexplored objectives like task-data dependent variants of XLNET, ELMO etc. Auxiliary learning with FD=ED can be seen as a relaxed form of data augmentation which we dub task augmentation. Whilst data augmentation requires applying transformations that preserve the data-point s label, task augmentation has no such restriction and thus offers greater flexibility in terms of specifying {T , R, O}. We can also reason about expanding particular stages to include new primitives. Any supervised loss can be added to the output stage, O, allowing us to potentially explore auxiliary objectives based on supervised signals like NER or POS tagging (Carreras et al., 2003; Charniak, 1997). A special example is setting O to the end-task supervised output EO. This leads to FO=EO D=ED which is a subset of FD=ED. FO=EO D=ED includes many objectives like predicting the end-task signal from corrupted input data. In Section 6, we will introduce a search space of objectives that leverages task augmentation.

4 THE IMPACT OF AUXILIARY LEARNING ON END-TASK GENERALIZATION

In this section, we relieve reliance on practitioner intuition by deriving a set of guiding principles on how to effectively utilize the automatically generated objectives from Section 3.

Auxiliary learning influences the end-task through both training and generalization error. Previous theory has largely focused on characterizing the impact on end-task training error. Liu et al. (2021), for example, show that end-task agnostic pre-training can create a performance gap in training error compared to training with the end-task alone. The size of this gap depends on how dissimilar the pre-training auxiliary objective is from the end-task. They introduce the following assumption (which we will borrow) to formalize their notion of task similarity: Assumption A.1: Let fe represent the end-task objective and fa be the auxiliary objective. There exists 0 such that fa(θ) fe(θ) θ. Note that θ represents all the parameters of the model. Smaller implies fa is more similar to the primary task fe. Liu et al. (2021) bound the end-task agnostic training error gap to be logarithmic in .

Unlike training error, end-task generalization error has gone unstudied in the auxiliary learning setting. Bounding the generalization error not only adds to our theoretical understanding of the impact of auxiliary learning but also provides insights to guide algorithm design. To arrive at a bound, we adapt the technique of Hardt et al. (2016) who derive a generalization bound on training with only the end-task via stochastic gradient descent. We consider the end-task aware setting where the end-task is multi-tasked with the auxiliary objective. This setting has recently been shown to improve end-task performance over the pretrain-then-finetune paradigm (Dery et al., 2021a;b; Yao et al., 2021).

Auxiliary learning with Dynamic Sampling: We are given an auxiliary objective fa( ; z) [0, 1] with Na samples Sa = (z1, . . . , z Na) from the distribution Da. fa can either be a single objective or

3Although this taxonomy is quite expansive, it obviously does not consider other elements of objective creation such as choice of model architecture, optimizer settings, etc.

Published as a conference paper at ICLR 2023

a weighted linear combination of objectives : fa = P

k wkf k a . At any iteration of SGD, we sample a choice of the end-task function fe or the auxiliary objective fa according to the probabilities λe, λa [0, 1] | λe + λa = 1. Given the chosen objective, we sample a data-point and perform stochastic gradient descent based on the sampled data-point. We now present our bound in the setting described.

Theorem 4.1 (Auxiliary learning with Dynamic Sampling). Assume that fe(; ze), fa(; za) [0, 1] are both L-Lipschitz with βe and βa-smooth loss functions respectively. Consider that we have N = Ne + Na total samples where fe and fa have Ne and Na samples respectively. re = Ne

N is the fraction of the available data represented by the end-task. Suppose that we run stochastic gradient descent for T steps with monotonically non-increasing step sizes αt c

t by dynamically sampling the tasks according to λe and λa. Then, with respect to fe, the generalization error is bounded by:

ϵgen ) 1 1+cλ β γT

1 1 cλ β +1 Where γ = λe

Here β = min{βe, βa} and λ is the weighting of the function with smaller smoothness.

Proof. See Appendix E for full proof and Appendix F for more discussion

As a detailed inspection of the proof will show, we derive Equation 1 by appealing to algorithmic stability (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Kuzborskij & Lampert, 2018) (Section 2). To our knowledge, ours is the first work to present an algorithmic stability view to formally explain how auxiliary learning influences end-task performance. Equation 1 surfaces the following prescriptions about learning with auxiliary tasks : P1 Smaller improves ϵgen. This implies that the more similar the auxiliary objective is to the end-task (under Assumption A.1), the lower the generalization error. P2 Larger N leads to smaller ϵgen4. Since we usually have a fixed amount of task data Ne, we can increase N by adding more auxiliary data Na.

5 END-TASK AWARE SEARCH OF STRUCTURED OBJECTIVE SPACES

Algorithm 1 AANG

Input: Search Space - A Factor vectors - {W All, W I, W T , W R, W O} End-task - E, End-task weight - λe Initial Model Params - θ0 RD

Sample a batch of n objectives Kn A Weighting of objectives in Kn

Construct wn

for k = 1 to n do

(d, t, r, o) = [Kn k].stages wk exp W All (d, t, r, o)+W I d +W T t +W R r +W O o

end for Get losses from batches of data

ˆLA(Kn, wn) = Pn k=1 wk Lk Ltotal = λe LE + (1 λe) ˆLA Get gradients and update factors θt+1, { wn,λe} META-TARTAN θt, E, Ltotal) Update {W All, W I, W T , W R, W O} using wn Update λe using λe until done Return : θT

Guided by Section 4, we build a practical method for exploring a set of objectives, A.

Whilst the dynamic sampling setting described in Section 4 is amenable to theoretical consideration, we make a few practical changes to it. First, instead of performing alternating gradient descent by sampling fa, fe according to λe, λa, we instead use them as multitask weights and perform joint training. Joint training has been found to produce superior results compared to alternating optimization when leveraging auxiliary objectives (Aghajanyan et al., 2021). We perform gradient descent on the following total loss which interpolates between the end-task and the auxiliary loss Ltotal = λe LE + (1 λe)LK. Here, K is a chosen subset of A.

Second, as indicated in Section 4, given K, we can write the set as a single objective fa = P

k K wkf k a . By Prescription P1, we want to choose {wk} such that fa has a small with the end-task fe. We would also like to set λe such that the bound on ϵgen is minimized. Whilst a closed form exists for the optimal weightings λe, {wk}, it depends on variables like { k}, {βk a}, L that are hard to estimate.

4This holds at fixed γ which we achieve by adjusting λe to account for introducing more auxiliary data.

Published as a conference paper at ICLR 2023

We therefore propose to learn λe, {wk} in an online, data-driven way. To do this, we build on top of the META-TARTAN algorithm proposed by Dery et al. (2021b). META-TARTAN is a meta-learning algorithm that learns adaptive weights for different auxiliary tasks in a way that prioritizes end-task generalization. It learns {wk} by minimizing the loss on the end-task validation

set: Lval E wk θLf k a T θLval E . This corresponds to learning {wk} such that θfa T θfe) is maximized. This minimizes one of the terms that contributes to and thus attempts to fulfil Prescription P1. We can similarly learn λe to minimize the end-task validation loss. For a more detailed discussion of META-TARTAN, please see Appendix B.

So far, we have introduced independent weights, {wk}, for each objective. This is sufficient in the case of unrelated objectives. However, the objectives in A share an underlying structure. We recognize this by using a factored approach to model each wk. We introduce a factor vector for each of the 4 stages introduced in Section 3: W D R|D|, W T R|T |, W R R|R| and W O R|O|. This ties together the weights of objectives that share primitives in common. To capture the fact that an objective can be more than the sum of it parts, we also introduce an independent weight for each objective : W All R|D| |T | |R| |O|. Consider the objective k which is generated by the composition of the operations {d D, t T , r R, o O}, its weighting is computed as : wk exp W All (d,t,r,o) + W I d + W T t + W R r + W O o . Our factored approach not only allows us to share information between objectives but it also allows us to analyze which stages and primitives are most important to a particular end-task after training is completed (Section 7).

Prescription P2 from Section 4, advocates for introducing as much auxiliary data as possible. As such, instead of fixing to a specific subset throughout training for a particular end-task, we propose to utilize all the objectives in A. This also avoids the combinatorial explosion that comes with exploring subsets of A at a time. |A| can be large and descending on all of A at once can be computationally prohibitive. As an efficient work around, at each training step, we sample a subset of A for execution with META-TARTAN. Our samples are drawn from all of A so any objective can get used at any timestep. Because we model each wk via a factored approach, even if an objective is not sampled its weight is implicitly updated. Our approach is reminiscent of stochastic-relaxation weight sharing (Pham et al., 2018; Dong & Yang, 2019; Li et al., 2020) where sampled architectural primitives result in updates to shared model weights which can be used by other primitives that are not sampled.

We coalesce all the ideas we have introduced so far into Algorithm 1 which we dub AANG (Automated Auxiliary Learni NG). At a high-level, given an end-task E:

1. We generate a space of auxiliary objectives A by leveraging the taxonomy discussed in Section 3. A may contain auxiliary tasks that can improve our performance on E. 2. We leverage MAML-style (Finn et al., 2017) meta-learning to adaptively weight the objectives in A based on measuring each objective s influence on E s validation set loss. 3. We make our algorithm scalable by sub-sampling the tasks A. By exploiting the underlying structure of the objectives in A via a factored approach to modeling task weights, we reduce the impact of the inexact sub-sampling.

6 EXPERIMENTAL SETTING

Our exploration of auxiliary learning has made the following transitions from the status-quo: manual to automated, single task to multitask, end-task agnostic to end-task aware. In this section, we set up experiments to validate these deviations from the standard.

We focus on continued pre-training (Gururangan et al., 2020; Aghajanyan et al., 2021). In this setting, we perform further auxiliary learning on an already pre-trained model. We favor this setting over pre-training from scratch (Liu et al., 2019b; Yang et al., 2019) not only because it is a more computationally feasible arena for experimentation but also because it is more relevant to modern ML systems where building upon pre-trained models is the norm (Qiu et al., 2020; Du et al., 2020). Model Details and Datasets: We use a pre-trained Ro BERTabase (Liu et al., 2019b) as the shared model base. We implement each auxiliary objective as a separate head on top of this shared base. For classification based objectives, the output head is a 2-layer multi-layer perceptron (MLP) that receives representations for the special classification token [CLS] (Devlin et al., 2018) from Ro BERTabase. For sequence generation objectives, we make a copy of the pre-trained output layer of Ro BERTabase for each task. Table 4 in Appendix C provides details of the 5 datasets used.

Published as a conference paper at ICLR 2023

All datasets are low-resource classification tasks. Not only are these datasets more amenable to meta-learning from a computational standpoint, but low-resource tasks also benefit the most from auxiliary learning. We also choose these tasks because they feature in previous work which we use as baselines (Gururangan et al., 2020; Dery et al., 2021b) Baselines and Search Spaces: The following methods are end-task agnostic baselines. By end-task agnostic, we mean that these do not multitask with the end-task. Finetuning on the end-task occurs after training on the auxiliary objective. 1. Ro BERTa (Liu et al., 2019b): We simply finetune a pre-trained Ro BERTabase on the end-task. 2. TAPT (Gururangan et al., 2020): Continue training Ro BERTabase on masked language modelling on end-task data itself before finetuning on the end-task. The following named objectives are end-task aware baselines that use META-TARTAN (Dery et al., 2021b) but utilize only 1 auxiliary task. Each auxiliary objective is multi-tasked with the end-task. 1. GPT-style: We perform end-task aware training with a denoising auxiliary objective based on left-to-right causal masking for computing representations. {I = End-task data, T = No-op, R = Left-To-Right, O = Denoise Token }. 2. XLNET-style: This is a denoising auxiliary objective that uses randomized masking for computing representations. {I = End-task data, T = No-op, R = Random-factorized, O = Denoise Token}. 3. BERT-style / TAPT: Denoising inputs corrupted via BERT-Op: 80% masking and 10% random replacement. {I = End-task data, T = BERT-Op, R = Bi-directional, O = Denoise Token}. Please note that this baseline is equivalent to META-TARTAN as introduced in Dery et al. (2021b).

Table 1: AANG-TD (task data) has 24 objectives and is based on only end-task data. AANG-TD+ED (task data + external data) has 40 objectives and uses both end-task and in-domain data.

TD End-task BERT-op Bi-directional Denoise Token Mask Left-to-Right End-task

TD+ED End-task Replace Right-to-Left In-Domain data No-op Random-Factorized

Table 1 details the search spaces that we evaluate against the above baselines. This is by no means the most encompassing search space but we leave more expansive space design to future work. Please note that all tasks within AANG-TD, and those with {I = End-task} in AANG-TD+ED, are instantiations of task augmentation as introduced in Section 3. Training Details : Please see Appendix D for more details about hyper-parameter configurations.

7 RESULTS AND DISCUSSION

In this section, we experimentally validate our case for automating the creation of auxiliary objectives and using them in an end-task aware multitask fashion.

7.1 GOING A LONG WAY WITHOUT EXTERNAL DATA

We first consider the setting where we rely solely on end-task data (task augmentation), and work with the AANG-TD search space. This search space has 24 objectives. Table 2 shows that automatically generating auxiliary objectives from only task data and using them appropriately is productive. End-task awareness is key: From Table 2, methods that are end-task aware result in over 1.12% average improvement over those that are end-task agnostic even under the most generous comparison (GPT-style 79.84% vs task-agnostic TAPT 78.72%). Knowing the end-task means that at each iteration, AANG can make informed gradient updates by adapting task weights so the resulting auxiliary task better aligns with the end-task (Prescription P1). Amongst the single task objectives, BERT-style performs best. We posit that this is because Ro BERTa was trained from scratch on a similar objective and so this objective represents minimal shift in training distributions. Adaptive multi-task auxiliary learning improves performance: We compare single-task end-task aware auxiliary learning to its multitask variant. Table 2 shows that multitasking our 3 different types of language modelling tasks results in improved average performance over using the tasks individually (81.12% for the BERT-style and 81.55% for combining the three single task objectives). We get our best performance when we multitask 24 auxiliary objectives automatically generated with our framework using AANG-TD. Boosting the number of objectives from 3 to 24 resulted in a 0.66% improvement in average performance across tasks. This is in line with Prescription P2 from Section 4 since we are increasing the effective amount of auxiliary data. We further posit that introducing more auxiliary objectives also serves to implicitly regularize the end-task during training.

Published as a conference paper at ICLR 2023

Table 2: Our framework and AANG on tasks using only task data. Without using any external data, we are able to get significant average performance improvement over baselines. Superscripts are p-values from paired t-tests (best multitask versus best single-task).

Task Adaptive Method # CS BIOMED NEWS STANCE

ACL-ARC SCIERC CHEMPROT H.PARTISAN SE-2016-6 AVG

No Ro BERTa 1 66.033.55 77.962.96 82.100.98 93.392.26 70.371.51 77.97 TAPT 1 67.743.68 79.531.93 82.170.65 93.422.87 70.741.21 78.72 [OURS] Static Multitask-TD 24 69.603.80 83.370.58 83.420.26 97.950.73 71.020.43 81.07

Yes X. GPT-style 1 67.220.44 81.620.84 83.291.21 96.410.73 70.671.46 79.84 Y. XLNET-style 1 69.762.42 81.810.42 83.390.31 96.411.92 71.180.58 80.51 Z. BERT-style (Dery et al., 2021b) 1 70.084.70 81.480.82 84.49(0.09)

0.50 96.841.72 72.700.60 81.12

[OURS] AANG-[X+Y+Z] 3 71.513.19 82.890.78 83.680.45 96.921.26 72.75(0.94)

0.82 81.55 [OURS] AANG-TD 24 73.26(0.28)

1.32 82.98(0.27)

1.52 83.910.32 98.46(0.14)

0.0 72.461.65 82.21

7.2 INTRODUCING EXTERNAL DATA

Figure 3: AANG effectively leverages out-of-task data. P-values (in brackets) are comparisons to (Dery et al., 2021b)

For the ACL-ARC task, we experiment with introducing auxiliary tasks based on external data. AANG-TD+ED has 40 tasks, 16 of which are based on domain data. We introduce CS domain data (from the S2ORC dataset (Lo et al., 2019)) that is n = 10 the size of the task data. From Figure 3 we see that AANG-TD+ED makes better use of domain-data than doing end-task aware training using only BERT-style objective with task (TAPT) and domain-data (DAPT) jointly as in Dery et al. (2021b). However, AANG-TD+ED (73.70) does not significantly improve over AANG-TD (73.26) on the ACL-ARC task (Figure 3). This might seem at odds with Prescription P2 since the TD+ED search space introduces more data. However, note that the AANG search algorithm is approximate and as such, with a larger search space, it can be harder to find composite tasks with a small as suggested by Prescription P1. We posit that we need more external data than n = 10 in order to see marked improvements to offset our inexact search of the space of composite functions. However, such scales are outside our computational budget.

7.3 WHY DOES AANG WORK ?

To better understand why our auxiliary learning pipeline improves end-task performance, we perform multiple ablations under AANG-TD. Static versus Dynamic Weighting: We ablate the impact of using static task weights throughout training, as against adaptive task weights. Just as with AANG, we sub-sample n tasks from the search space at every iteration (n is cross-validated exactly as AANG is Table D ). Each sampled tasks weight is initialized to 1

n and this remains unchanged throughout training. This is the Static Multitask-TD baseline in Table2. AANG-TD improves upon the static multitask baseline by over 1.1% on average. With adaptive weighting, AANG down-weights objectives that are harmful to the end-task whilst up-weighting relevant ones (Prescription P1). However, using static weightings is more compute friendly since we do not have to calculate task-weight meta-gradients. This computevs-performance trade-off is left for practitioners to resolve based on their available resources. Impact of number of sampled objectives: Due to computational constraints, AANG sub-samples the set of generated objectives. Whilst this sampling can result in approximation error when inferring task weightings, it can also introduce stochasticity which can help regularize the learned model. From Table 3 (Appendix A) we find that for some tasks (ACL-ARC and SCIERC) sampling a larger number of tasks helps. SE-2016-6 and CHEMPROT on the other hand benefit from smaller number of sampled tasks. Our recommendation is that the number of sampled tasks be cross-validated on a per-task basis. Learned task weight trajectories: AANG learns interesting trajectories for weighting design stage primitives. From Table 2, the fact that AANG-TD roughly matches the best single task performance (72.461.65 versus 72.700.60 for BERT-style) on the SE-2016-6 task suggests that it may be learning to mostly up-weight this task. Figure 4 provides evidence of this. For the SE-2016-6 task (row 1), composing the highest weighted primitive from each stage [BERT None DENOISE] results in BERT-style, the best single task objective. Figure 4 also shows that AANG can adapt to overfitting.

Published as a conference paper at ICLR 2023

0 200 400 600 800 Iteration

= SE-2016-6

0 200 400 600 800 Iteration

0.75 = BERT = Mask

= None = Replace

0 200 400 600 800 Iteration

0.75 = Left-To-Right

= None = Random-Factorized

= Right-To-Left

0 200 400 600 800 Iteration

= DENOISE = SE-2016-6

0 500 1000 1500 Iteration

0 500 1000 1500 Iteration

= BERT = Mask

= None = Replace

0 500 1000 1500 Iteration

= Left-To-Right

= None = Random-Factorized

= Right-To-Left

0 500 1000 1500 Iteration

= DENOISE = SCIERC

Figure 4: Learned trajectories for AANG-TD for run instances of SE-2016-6 and SCIERC tasks.

The vertical black lines indicate the point of best validation set performance. AANG responds to over-fitting by down-weighting objectives based on the output loss being over-fit to. Thus, after several iterations, the objective that dominates when the validation performance is at its highest (black vertical line) gets down-weighted in response to it becoming saturated. What tasks are important and when they are important? We study which tasks are most highly weighted early in training (first 10% of learning trajectory) and later in training (last 50%). We aggregate statistics across 3 datasets. Note that early in training, objectives based on the self-

Fraction of Runs

0 0.1 0.2 0.3 0.4 0.5

H.PARTISAN SCIERC ACL-ARC

T R O None Left-To-Right Denoise None None Denoise Mask Left-To-Right Denoise Mask None Denoise Bert-op Left-To-Right Denoise None None Task None Rand-Fact Task Bert-op None Denoise Replace Rand-Fact Denoise Replace None Denoise None Rand-Fact Denoise Replace Left-To-Right Denoise Replace Rand-Fact Task Mask Rand-Fact Denoise Replace Right-To-Left Denoise Bert-op Rand-Fact Denoise Bert-op Right-To-Left Denoise Replace None Task Bert-op Rand-Fact Task Mask None Task Mask Rand-Fact Task

Tasks with highest average weight during ﬁrst 10% of training

(GPT-Style)

(BERT-Style)

(Reconstruct Input)

(Copy of End-Task)

Fraction of Runs

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

H.PARTISAN SCIERC ACL-ARC

T R O Mask Rand-Fact Task Replace Rand-Fact Task Bert-op Rand-Fact Task Replace None Task Bert-op None Task Mask None Task Replace Rand-Fact Denoise None None Task Replace Left-To-Right Denoise

Bert-op Rand-Fact Denoise None Rand-Fact Task Mask Left-To-Right Denoise Mask Rand-Fact Denoise None Left-To-Right Denoise None Right-To-Left Denoise Replace Right-To-Left Denoise

Bert-op None Denoise None Rand-Fact Denoise

Tasks with highest average weight during later half of training

(GPT-Style)

(BERT-Style)

(XLNET-Style)

Figure 5: Top ranked objectives (averaged weight) early in training (left) and later in training (right) supervised output O = {DENOISE} are highly weighted but later, objectives based on supervised signal, O = {Task} play a larger role. AANG rediscovers the common practice of training on self-supervised objectives before introducing supervised ones. It is also interesting to note that many newly generated objectives (outside of the 3 named single task baselines in Table 2) such as simple input reconstruction were discovered to have relevant impact on the end-tasks. This means AANG can automatically surface new, previously unexplored objectives relevant to the end-task.

8 LIMITATIONS AND CONCLUSION

Our work has some limitations that we leave for future work. First, because AANG relies on meta-learning, it presents extra compute burden over simple multitasking. This is because, we have to independently compute meta-gradients for each auxiliary task thus requiring O(n) forward-backward operations for n sampled tasks compared to O(1) for static multitasking. In Table 2, we show that our static Multitask-TD method outperforms all other non-task-adaptive methods by 2.4% and is thus a viable alternative when runtime is a signficant constraint. Secondly, AANG as presented is an approximate algorithm primarily due to sub-sampling the space of tasks. Thus as mentioned in Section 7.2, we do not get as much gain as desired when our search space becomes larger. We leave finding an efficient exact search algorithm for future exploration.

This paper presents a procedure for automating the creation of auxiliary objectives. We showed, theoretically, how auxiliary learning impacts end-task generalization. This resulted in prescriptions that informed the design of AANG, an algorithm to search the space of generated objectives in an end-task aware multitask fashion. Our experiments show that AANG is a promising first step in automating auxiliary learning.

Published as a conference paper at ICLR 2023

9 ACKNOWLEDGEMENTS

This work was supported in part by DSO National Laboratories, an ENS-CFM Data Science Chair, DARPA FA875017C0141, the National Science Foundation grants IIS1705121, IIS1838017, IIS2046613 and IIS-2112471, an Amazon Web Services Award, a Facebook Faculty Research Award, funding from Booz Allen Hamilton Inc., and a Block Center Grant. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of these funding agencies. We are grateful for helpful feedback from Uri Alon, Patrick Fernandes, Joon Sik Kim, Han Guo, Victor Akinwande and Clara Na.

Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. Muppet: Massive multi-task representations with pre-finetuning. ar Xiv preprint ar Xiv:2101.11038, 2021.

Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems, 29, 2016.

Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12: 149 198, 2000.

Iz Beltagy, Arman Cohan, and Kyle Lo. Scibert: Pretrained contextualized embeddings for scientific text. Co RR, abs/1903.10676, 2019. URL http://arxiv.org/abs/1903.10676.

Olivier Bousquet and Andr e Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. Co RR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.

Xavier Carreras, Llu ıs M arquez, and Llu ıs Padr o. A simple named entity extractor using adaboost. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pp. 152 155, 2003.

Rich Caruana. Multitask learning. Machine learning, 28(1):41 75, 1997.

Eugene Charniak. Statistical techniques for natural language parsing. AI magazine, 18(4):33 33, 1997.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020.

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. ar Xiv preprint ar Xiv:2003.10555, 2020.

Tri Dao, Nimit S Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christopher R e. Kaleidoscope: An efficient, learnable representation for all structured linear maps. ar Xiv preprint ar Xiv:2012.14966, 2020.

Lucio M Dery, Yann Dauphin, and David Grangier. Auxiliary task update decomposition: The good, the bad and the neutral. ar Xiv preprint ar Xiv:2108.11346, 2021a.

Lucio M Dery, Paul Michel, Ameet Talwalkar, and Graham Neubig. Should we be pre-training? an argument for end-task aware training as an alternative. ar Xiv preprint ar Xiv:2109.07437, 2021b.

Published as a conference paper at ICLR 2023

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Xuanyi Dong and Yi Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1761 1770, 2019.

Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Ves Stoyanov, and Alexis Conneau. Self-training improves pre-training for natural language understanding. ar Xiv preprint ar Xiv:2010.02194, 2020.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks, 2017. URL https://arxiv.org/abs/1703.03400.

Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271 21284, 2020.

Suchin Gururangan, Ana Marasovi c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don t stop pretraining: adapt language models to domains and tasks. ar Xiv preprint ar Xiv:2004.10964, 2020.

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pp. 2555 2565. PMLR, 2019.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225 1234. PMLR, 2016.

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314 1324, 2019.

Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? ar Xiv preprint ar Xiv:1608.08614, 2016.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64 77, 2020.

David Jurgens, Srijan Kumar, Raine Hoover, Dan Mc Farland, and Dan Jurafsky. Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6:391 406, 2018.

Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. Sem Eval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829 839, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/ S19-2145. URL https://aclanthology.org/S19-2145.

Jens Kringelum, Sonny Kim Kjaerulff, Søren Brunak, Ole Lund, Tudor I Oprea, and Olivier Taboureau. Chemprot-3.0: a global chemical biology diseases mapping. Database, 2016, 2016.

Ilja Kuzborskij and Christoph Lampert. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 2815 2824. PMLR, 2018.

Liam Li, Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Geometry-aware gradient algorithms for neural architecture search. ar Xiv preprint ar Xiv:2004.07802, 2020.

Xingyu Lin, Harjatin Baweja, George Kantor, and David Held. Adaptive auxiliary task weighting for reinforcement learning. Advances in neural information processing systems, 32, 2019.

Published as a conference paper at ICLR 2023

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

Shikun Liu, Andrew J Davison, and Edward Johns. Self-supervised generalisation with meta auxiliary learning. ar Xiv preprint ar Xiv:1901.08933, 2019a.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019b.

Ziquan Liu, Yi Xu, Yuanhong Xu, Qi Qian, Hao Li, Antoni B. Chan, and Rong Jin. Improved fine-tuning by leveraging pre-training data: Theory and practice. Co RR, abs/2111.12292, 2021. URL https://arxiv.org/abs/2111.12292.

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S Weld. S2orc: The semantic scholar open research corpus. ar Xiv preprint ar Xiv:1911.02782, 2019.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017. URL https: //arxiv.org/abs/1711.05101.

Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. ar Xiv preprint ar Xiv:1808.09602, 2018.

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. Sem Eval2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (Sem Eval-2016), pp. 31 41, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/S16-1003. URL https://aclanthology. org/S16-1003.

Aviv Navon, Idan Achituve, Haggai Maron, Gal Chechik, and Ethan Fetaya. Auxiliary learning by implicit differentiation. ar Xiv preprint ar Xiv:2007.02693, 2020.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. Co RR, abs/1802.05365, 2018. URL http://arxiv.org/abs/1802.05365.

Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095 4104. PMLR, 2018.

Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, pp. 1 26, 2020.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ar Xiv preprint ar Xiv:1910.10683, 2019.

Nicholas Roberts, Mikhail Khodak, Tri Dao, Liam Li, Christopher R e, and Ameet Talwalkar. Rethinking neural operations for diverse tasks. ar Xiv preprint ar Xiv:2103.15798, 2021.

Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15 18, 2019.

Published as a conference paper at ICLR 2023

Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. A mathematical exploration of why language models help solve downstream tasks. ar Xiv preprint ar Xiv:2010.03648, 2020.

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. ar Xiv preprint ar Xiv:1904.05862, 2019.

Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99 127, 2002.

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105 6114. PMLR, 2019.

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. ar Xiv preprint ar Xiv:2111.02080, 2021.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.

Xingcheng Yao, Yanan Zheng, Xiaocong Yang, and Zhilin Yang. Nlp from scratch without large-scale pretraining: A simple and efficient framework. ar Xiv preprint ar Xiv:2111.04130, 2021.

Tong Zhang, Peng Gao, Hao Dong, Yin Zhuang, Guanqun Wang, Wei Zhang, and He Chen. Consecutive pretraining: A knowledge transfer learning strategy with relevant unlabeled data for remote sensing domain. ar Xiv preprint ar Xiv:2207.03860, 2022.

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.

Published as a conference paper at ICLR 2023

A MORE ABLATION TABLES

Table 3: Varying number of sampled objectives per-iteration.

Task 3 24 tasks 6 24 tasks

ACL-ARC 72.112.12 73.261.32 SCIERC 82.351.76 82.981.52 SE-2016-6 72.461.65 72.460.90 CHEMPROT 83.910.32 83.690.98 H.PARTISAN 98.460.0 97.950.73

B DISCUSSION OF META-TARTAN (DERY ET AL., 2021B)

META-TARTAN (Dery et al., 2021b) is a MAML style (Finn et al., 2017) meta-learning algorithm that learns to adaptively weight a given set of tasks based on their influence on the end-task validation performance. META-TARTAN achieves this by formulating the following bi-level optimization problem : θ , w = argmin{θ g(θ0), w} LE(θ) (2)

θ0 = argminθ Ltotal(θ, w) = argminθ

w LE(θ) + X

Ti A wi LTi(θ)

Note that E is the end-task and A is the set of auxiliary tasks.

Since the above bi-level problem is difficult to solve directly, Dery et al. (2021a) relax the problem and into an alternating optimization problem where task weights are updated based on 1-step improvement to the validation performance of the end-task :

Lval E (θt+1(w))

wi β LTi T Lval E (θt) (4)

To prevent the above relaxation from finding the trivial solution of just upweigting solely the end-task, Dery et al. (2021b) introduce a special dev-head which they use for estimating the meta-gradient :

Lval T (θ (w))

wi β θLTi T θLval E ([θbody; ϕ ]t) (5)

Where ϕ t is the special dev-head and θbody is the body of the model. For even more details about META-TARTAN, please see Section 3 of Dery et al. (2021b). Though we leverage MET-TARTAN, compared to Dery et al. (2021b), we make three distinct contributions to the field of auxiliary learning. We list them below

1. Novel Problem Formulation: As far as we are aware of, we are the first to formulate the problem of automated auxiliary learning. Specifically, we presented an approach for automatically constructing a suite of auxiliary objectives based on existing objectives. Please note that Dery et al. (2021b) perform auxiliary learning with only the DAPT/TAPT variants of the BERT objective. They effectively assume that the search space of objectives (the 2 they explore) is given before-hand. Our approach automatically creates the search space. 2. Theoretical Novelty: To the best of our knowledge, we are the first work to provide an exploration of why auxiliary learning improves primary task performance via algorithmic stability. Dery et al. (2021b) in introducing META-TARTAN do not attempt to give a theoretical characterization of why the algorithm improves end-task performance. 3. Algorithm Improvements to META-TARTAN: Please note that META-TARAN as presented in Dery et al. (2021b) was used with only 2 auxiliary tasks. When scaling to more tasks, using META-TARTAN naively becomes computationally prohibitive. Specifically, on a search space of N tasks, META-TARTAN requires O(N) order computation per step.

Published as a conference paper at ICLR 2023

We improve upon this by introducing the task sub-sampling of (k N) which reduces the compute overhead to O(k). To account for the impact of sub-sampling as an approximation, we introduced the factorised modelling of task weights which allows sharing of information between auxiliary tasks that might themselves be related.

C DATASET DETAILS

Table 4: Specifications of datasets used to evaluate our methods.

Domain Task Label Type Train Size Dev Size Test Size Classes Metric

BIOMED CHEMPROT Kringelum et al. (2016) relation classification 4169 2427 3469 13 Accuracy CS SCIERC Luan et al. (2018) relation classification 3219 455 974 7 F1 STANCE SE-2016-6 Mohammad et al. (2016) stance detection 2497 417 1249 3 Accuracy CS ACL-ARC Jurgens et al. (2018) citation intent 1688 114 139 6 F1 NEWS H.PARTISAN Kiesel et al. (2019) partisanship 515 65 65 2 Accuracy

D MORE TRAINING DETAILS

We run each hyper-parameter configuration across 3 seeds {0, 1, 2}. We use a batch size of 128 for all end-tasks tasks except H.PARTISAN where we use a batch size of 64. The auxiliary task batch-size, aux bsz, is shared across all the n sub-sampled auxiliary objectives according to the objective s weight.

We use the Adam W optimizer (Loshchilov & Hutter, 2017), with weight decay of 0.01 for all experiments.

Table 5: AANG-TD specific Hyper-parameters

Hyper-parameter Values Description

aux lr 1.0, 0.1 Learning rate for factor vectors - {W All, W I, W T , W R, W O} sopt lr 0.1, 0.01 Learning rate for primary task weighting λe nconf subsamp 3, 6 Number of sub-sampled auxiliary tasks. learning rate 1e-3, 1e-4 Learning rate used for further training of Ro BERTabase aux bsz 256 Batch size of for auxiliary objectives

Table 6: AANG-TD+ED specific Hyper-parameters

Hyper-parameter Values Description

aux lr 1.0, 0.5, 0.1 Learning rate for factor vectors - {W All, W I, W T , W R, W O} sopt lr 0.1 Learning rate for primary task weighting λe nconf subsamp 6, 12, 24 Number of sub-sampled auxiliary tasks. learning rate 1e-4 Learning rate used for further training of Ro BERTabase aux bsz 1024 Batch size of for auxiliary objectives

Table 7: META-TARTAN Hyper-parameters for single task auxiliary tasks

Hyper-parameter Values Description

sopt lr 1.0, 0.1, 0.01 Learning rate for primary task weighting λe learning rate 1e-3, 1e-4, 5e-5 Learning rate used for further training of Ro BERTabase

META-TARTAN introduces a dev-head which is trained sporadically during training for estimating the meta-gradients. We use the following hyper-parameters for training this dev-head : we sample 32 examples (8 examples in the case of H.PARTISAN) and perform full batch gradient descent with

Published as a conference paper at ICLR 2023

a learning rate of 1e-2 for 10 iterations. The dev-head is trained with the Adam W optimizer with weight decay set to 0.1.

We copy the end-task agnostic baseline results from (Dery et al., 2021b) when available. We use the hyper-parameters specified for TAPT in Gururangan et al. (2020) to train for the SE-2016-6 task.

All models were trained on one of two types of gpus: NVIDIA A100 or NVIDIA A6000. All models fit within a single gpu. We used gradient accumulation to expand the effective batch sizes used for our experiments.

E GENERALIZATION ERROR BOUND FOR END-TASK AWARE TRAINING

E.1 DEFINITIONS

Definition E.1. A function, f : Ω R is L-Lipschitz if u, v dom(f):

f(u) f(v) L u v

Note that L-Lipschitz implies bounded gradients.

Definition E.2. A function, f : Ω R is β-smooth if u, v Ω:

f(u) f(v) β u v

Definition E.3. An update rule, G is σ-bounded if :

supw Ω w G(w) σ

Consider the following general setting. There is an unknown distribution De over examples from some space Z. We receive a sample S = (z1, . . . , z Ne) of Ne examples drawn i.i.d. from De. Our goal is to find a model w, that parameterizes the function fe, with small population risk defined as: Definition E.4. Population Risk

R[w] = Ez Defe(w; z)

Definition E.5. Empirical Risk Since we have a finite number of samples, we can only compute the empirical risk which is :

i fe(w; zi),

Let A be a potentially randomized algorithm (such as Stochastic Gradient Descent) that is a function of the S such that w = A(S). Definition E.6. Generalization Error ϵgen(A, Ne)

ϵgen(A, Ne) = ES,A RS[A(S)] R[A(S)]

Definition E.7. Uniform Stability A randomized algorithm A is ϵ-uniformly stable if for all data sets S, S Z, |S| = |S | = Ne such that S and S differ in at most one example, we have

sup z EA fe(A(S); z) fe(A(S ); z) ϵ

Here, the expectation is taken only over the internal randomness of A. We will denote by ϵstab(A, Ne) the infimum over all ϵ for which the above holds.

E.2 RELEVANT THEOREMS

Theorem E.1 (Uniform Stability implies Generalization in expectation). Let Algorithm A be ϵuniformly stable. Then,

ϵgen(A, Ne) = ES,A RS[A(S)] R[A(S)] ϵstab(A, Ne)

For full proof see Theorem 2.2 of Hardt et al. (2016).

Published as a conference paper at ICLR 2023

Theorem E.2 (Stochastic Gradient Method is stable). Assume that fe(; z) [0, 1] is an L-Lipschitz and βe-smooth loss function for every z. Suppose that we run SGM for T steps with monotonically non-increasing step sizes αt c

t. Then, SGM has uniform stability with :

q Ne 1 2c L2 1 q+1 T

where q = βec

We can simplify this to only terms involving T and Ne

ϵsgm T 1 1 cβe+1

Proof. For the full proof, see Theorem 3.12 of Hardt et al. (2016)

E.3 GROWTH FUNCTIONS

Lemma E.3 (Growth Recursion Under Dynamic Sampling). We consider the Stochastic Gradient update rule G : Ω Ω:

Gf(w) = w α f(w)

Fix an arbitrary sequence of updates Gf1, . . . , Gf T and another G f1, . . . , G f T . Let w0 = w 0 be a starting point in Ωgiven that f : Ω R and define

δt = Ef1...ft Pλ wt w t

where wt, w t are defined recursively through :

wt = Gft(wt 1) w t = G ft(w t 1) t 0

Then we have the recurrence relation :

δt+1 min 1 + αλ1β1 δt + αλ2 + 2L , 1 + α λ1β1 + λ2β2) δt Gft = G ft δt + 2σt Gft, G ft are σ-bounded

Note that Pf is a distribution over the support {f 1, f 2} according to probabilities {λ1, λ2 | λ1+λ2 = 1}. {f1, f2} have smoothness β1, β2 respectively.

Published as a conference paper at ICLR 2023

Proof. The second bound on δt is taken directly from Lemma 2.5 of Hardt et al. (2016). We now derive the first-half of the first bound δt+1 = Ef1...ft+1 Pλ wt+1 w t+1

= Ef1...ft Pλ

λ1 Gf 1(wt) G f 1(w t) + λ2 Gf 2(wt) G f 2(w t)

= Ef1...ft Pλ

λ1 wt α f 1(wt) w t + α f 1(w t) + λ2 wt α f 2(wt) w t + α f 2(w t)

Ef1...ft Pλ wt w t + αEf1...ft Pλ

λ1 f 1(w t) f 1(wt) + λ2 f 2(w t) f 2(wt)

(Triangle Inequality used for above step)

= δt + αEf1...ft Pλ

λ1 f 1(w t) f 1(wt) + λ2 f 2(w t) f 2(wt)

(Without Loss of Generality, let β1 β2)

δt + αEf1...ft Pλ

λ1β1 wt w t + λ2 f 2(w t) f 2(wt) (Smoothness)

= δt + αλ1β1δt + αλ2Ef1...ft Pλ

f 2(w t) f 2(wt) (Triangle Inequality)

= 1 + αλ1β1 δt + αλ2

f 2(w t) f 1(w t) + f 1(w t) f 2(wt) (add zero)

1 + αλ1β1 δt + αλ2

f 2(w t) f 1(w t) + f 1(w t) f 2(wt) (Triangle Inequality)

1 + αλ1β1 δt + αλ2

+ f1(w t) f2(wt) Using Assumption A.1

1 + αλ1β1 δt + αλ2

+ f1(w t) + f2(wt) Triangle Inequality

1 + αλ1β1 δt + αλ2 + 2L L-Lipschitz function

To obtain the second half of the first bound : δt+1 = Ef1...ft+1 Pλ wt+1 w t+1

= Ef1...ft Pλ

λ1 Gf 1(wt) G f 1(w t) + λ2 Gf 2(wt) G f 2(w t)

= Ef1...ft Pλ

λ1 wt α f 1(wt) w t + α f 1(w t) + λ2 wt α f 2(wt) w t + α f 2(w t)

Ef1...ft Pλ wt w t + αEf1...ft Pλ

λ1 f 1(w t) f 1(wt) + λ2 f 2(w t) f 2(wt)

(Triangle Inequality used for above step)

δt + αEf1...ft Pλ

λ1β1 wt w t + λ2β2 wt w t (Smoothness)

= δt + αλ1β1Ef1...ft Pλ

wt w t + αλ2β2Ef1...ft Pλ

= δt + α(λ1β1 + λ2β2)δt = (1 + α(λ1β1 + λ2β2))δt

E.4 STABILITY OF DYNAMIC SAMPLING

We repeat the description of our Auxiliary Learning with Dynamic Sampling Setting here for ease of access.

Published as a conference paper at ICLR 2023

Setting : We are given an auxiliary objective fa( ; z) [0, 1] with Na samples Sa = (z1, . . . , z Na) from the distribution Da. At any iteration of SGD, we sample a choice of either the end-task function fe or the auxiliary objective fa according to the probabilities λe, λa | λe + λa = 1. Given the chosen objective, we sample a data-point and perform stochastic gradient descent (SGD) based on the sampled data-point.

An equivalent way to instantiate this procedure to create SA by drawing N = Ne + Na total samples from the end-task and auxiliary task according to Pλ. S A is then created by replacing 1 end-task sample in SA. At each step, a sample is drawn from a distribution : zi, z i PSA, PS A and a gradient step is taken on the function corresponding to the set the sample was drawn from. Lemma E.4 (Stability of dynamic sampling). We denote the outputs of T steps of SGM on SA and S A with the dynamically sampled functions, as w T and w T respectively. Then, for every ze Ze and every t0 > 0, under both the random update rule and the random permutation rule, we have :

E fe(w T ; z) fe(w T ; z) γt0

N sup w,ze fe(w; ze) + LE[δT |δt0 = 0]

Where N = Ne + Na and γ = λe N

Proof. Let E = 1[δt0 = 0] denote the event that δt0 = 0. We have E fe(w T ; z) fe(w T ; z) = P{E}E fe(w T ; z) fe(w T ; z) |E

+ P{Ec}E fe(w T ; z) fe(w T ; z) |Ec

E fe(w T ; z) fe(w T ; z) |E + P{Ec} sup w,ze fe(w; ze)

because fe is non-negative

LE w T w T |E + P{Ec} sup w,ze fe(w; ze)

because fe is L-Lipschitz

We now proceed to bound P{Ec}. Let i [N ] denote the position in which SA, S A differ and consider the random variable I assuming the index of the first time step in which SGM uses the example zi e . Note that when I > t0, then we must have that δt0 = 0 since the two samples are identical up until this point. P{Ec} = P{δ0 = 0} P{I t0} Using the selection rule specified above (sample either fe, fa according to the probabilities λe, λa and then sample uniformly from the selected task data) we have that :

t=1 P{I = t0} =

Theorem E.5 (Stability Bound on Dynamic Sampling). Assume that fe(; ze), fa(; za) [0, 1] are L-Lipschitz and βe and βa-smooth loss functions. Consider that we have N = Ne + Na total samples where fe and fa have Ne and Na samples respectively. Suppose that we run SGM for T steps with monotonically non-increasing step sizes αt c

t by dynamically sampling the tasks according to λe and λa. Then, with respect to fe, SGM has uniform stability with :

ϵstab 1 + 1

N γ + ρLc 1 c β+1 γT

Where γ = λe N

Ne Given that β = min{βe, βa} and λ is the corresponding weighting of the function with smaller smoothness.

Depending on which one gives a tighter bound the pair ( β, ρ) can be : ( β, ρ)1 = (λ β , (1 λ ) + 2L ) or ( β, ρ)2 = (λeβe + λaβa, 0)

Published as a conference paper at ICLR 2023

When ( β, ρ)1 gives the tighter bound, we can simplify to :

ϵgen ) 1 1+cλ β γT

1 1 cλ β +1

As presented in Section 4.

Proof. Let SA, S A be two sample of size N = Ne + Na as described in lemma E.4. Consider the gradient updates Gf1, . . . , Gf T and G f1, . . . , G f T induced by running SGM on samples SA and S A respectively. Let w T and w T denote the corresponding outputs of SGM. By lemma E.4 we have :

E fe(w T ; z) fe(w T ; z) γt0

N sup w,ze fe(w; ze) + LE[δT |δt0 = 0] (8)

Let ΨT = E[δT |δt0 = 0]. We will bound ΨT as function of t0 and then minimize for t0. Note the following :

At any step t, with probability 1 γ N , the sample selected is the same in both SA and S A. In this case Gft = G ft and we use the corresponding expansivity rule from lemma E.4. This gives :

δt+1 min 1 + αtλ β δt + αt(1 λ ) + 2L , 1 + αt λeβe + λaβa) δt

Where β = min{βe, βa} and λ is the corresponding weighting of the function with smaller smoothness. To avoid deriving the bound independently for each case, we perform a variable substituation that captures the two cases :

δt+1 1 + αt β δt + αtρ

β = λ β , λeβe + λaβa and ρ = (1 λ ) + 2L , 0 . We can present the final bound in terns of these variables which can be substituted depending on the minimizer.

With probability γ N the selected example is different. Note that in this case, we know that we are evaluating the end-task function fe. We use that both Gft and G ft are (σt = αt L)- bounded according to lemma E.3 since fe is L-Lipschitz.

Combining the above we have :

N 1 + αt β Ψt + αtρ + γ

N Ψt + 2αt L

N 1 + αt β Ψt + 2γαt L

N αt β Ψt + αt 2γL + (N γ)ρ

t β Ψt + c 2γL + (N γ)ρ

t β Ψt + c 2γL + (N γ)ρ

We use 1 + x exp(x) x

t β Ψt + c ρ

Where ρ = 2γL + (N γ)ρ

Published as a conference paper at ICLR 2023

We can unwind the recurrence until Ψt0 = 0.

k=t+1 exp (1 γ

N )c β log T

= c ρT c β(1 γ

t=t0+1 t c β(1 γ

We can upper bound the sum over t with an integral + drop negative terms

c ρ N c β(1 γ N )

Plugging this bound back into Equation 8 and using the fact that fe [0, 1]:

E fe(w T ; z) fe(w T ; z) γt0

N + L ρ β(N γ)

We let q = c β, we can minimize the R.H.S by setting :

t0 = N Lc ρ γ(N γ)

Plugging this in gives us :

E fe(w T ; z) fe(w T ; z) (1 + 1 c β )

N Lc 2γL + (N γ)ρ

1 c β+1 γT c β 1+c β

N γ + ρLc 1 c β+1 γT

c β 1+c β (12)

Recall that : β = λ β , λeβe + λaβa

ρ = (1 λ ) + 2L , 0

We can choose whichever of the pairs for β, ρ that minimizes the bound :

F DISCUSSION OF GENERALIZATION ERROR BOUNDS

F.1 WHAT DOES THEOREM E.5 SAY.

We consider the setting where β = λ β

ρ = (1 λ ) + 2L

Published as a conference paper at ICLR 2023

Assuming the ρ term dominates Equation 12 in this setting is :

ϵauxdyn gen ϵauxdyn stab ( β,ρ)1 1+c βp

(1 λ )( + 2L) γT

) 1 1+cλ β γT

1 1 cλ β +1 This is Equation 1 from Section 4

In going from the first line to the second we consider the setting where 2L. This is a case where the auxiliary task is sufficiently different from the primary task. Some observations about this setting:

1. Smaller implies auxiliary task is similar to main task and leads to improving the bound.

2. Dependence of the bound on N is a bit more nuanced. Note that increasing N increases γ unless we reduce λe appropriately. Remember that λe is the rate at which we sample the primary task. Thus, if we add more auxiliary data but still sample the primary task at the original rate, then we are effectively ignoring the extra auxiliary data.

3. It might be tempting to assume that we can get arbitrary improvements in this setting by setting λe = 0. However, note that whilst this might reduce the generalization error, it means that we are seeing none of the end-task which would result in large increase in the training error

4. Note that ( β = λ β βe) always. So we get improvements on the dependence on T compared to Theorem E.2.

5. We can optimize λe, λa to minimize ϵauxdyn stab .