# joint_datatask_generation_for_auxiliary_learning__8c7b17f7.pdf

Joint Data-Task Generation for Auxiliary Learning

Hong Chen1, Xin Wang1,2 , Yuwei Zhou1, Yijian Qin1, Chaoyu Guan1, Wenwu Zhu1,2

1Department of Computer Science and Technology, Tsinghua University 2Beijing National Research Center for Information Science and Technology, Tsinghua {h-chen20,zhou-yw21,qinyj19,guancy19}@mails.tsinghua.edu.cn {xin_wang,wwzhu}@tsinghua.edu.cn

Current auxiliary learning methods mainly adopt the methodology of reweighing losses for the manually collected auxiliary data and tasks. However, these methods heavily rely on domain knowledge during data collection, which may be hardly available in reality. Therefore, current methods will become less effective and even do harm to the primary task when unhelpful auxiliary data and tasks are employed. To tackle the problem, we propose a joint data-task generation framework for auxiliary learning (DTG-Aux L), which can bring benefits to the primary task by generating the new auxiliary data and task in a joint manner. The proposed DTGAux L framework contains a joint generator and a bi-level optimization strategy. Specifically, the joint generator contains a feature generator and a label generator, which are designed to be applicable and expressive for various auxiliary learning scenarios. The bi-level optimization strategy optimizes the joint generator and the task learning model, where the joint generator is effectively optimized in the upper level via the implicit gradient from the primary loss and the explicit gradient of our proposed instance regularization, while the task learning model is optimized in the lower level by the generated data and task. Extensive experiments show that our proposed DTG-Aux L framework consistently outperforms existing methods in various auxiliary learning scenarios, particularly when the manually collected auxiliary data and tasks are unhelpful.

1 Introduction

Auxiliary learning aims to improve the model generalization ability on the primary task with the help of related auxiliary tasks [1, 2]. This learning paradigm has been widely adopted and has shown its effectiveness in various areas, like image classification [3], recommendation [4] and reinforcement learning [5, 6]. Different auxiliary tasks are often chosen manually according to the primary task, e.g., [7] utilize the task of visual attribute classification to help the fine-grained bird classification and [4] improve the click conversion rate prediction with the help of click-through rate prediction task.

Most existing works utilize the auxiliary information by first reweighing the losses of the auxiliary data and tasks, and then use the sum of the weighted losses together with the primary loss to optimize the task learning model. The weights are employed to balance the primary loss and the auxiliary losses to avoid negative auxiliary transfer, which are tuned with HPO tools [4, 3, 8]. More recent works [9, 6, 10, 1, 11] propose to dynamically weigh different auxiliary losses during training.

However, the existing methods require that there exists beneficial information in the auxiliary data and tasks, so that the beneficial loss terms can be selected through the loss reweighing process. This condition cannot always be satisfied because whether the auxiliary data or task is beneficial depends on many factors including the chosen auxiliary task, the scale of primary task dataset and the selected

Corresponding Authors.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

learning model for the tasks [1, 11], making it difficult to manually collect the beneficial auxiliary data and task via prior knowledge. Therefore, existing approaches may adopt useless auxiliary information and finally do harm to the primary task when the involved auxiliary data and tasks are improperly collected, as observed in [1, 12]. Although [2] propose to generate fine-grained auxiliary classification tasks, this method can be only applied to the classification primary task. Additionally, they only generate the new task without considering new data, while the data-level information is pointed out to be an important factor in auxiliary learning [11].

To address the problem, in this paper, we propose to simultaneously generate the auxiliary data and task for auxiliary learning in a joint manner. However, there are three challenges for the joint generation. First, it is challenging to design a generic framework that can accommodate various tasks with different inputs. This is because the types of data and tasks are quite diversified in different auxiliary learning scenarios, e.g., the primary task of image classification takes visual images as input and outputs categorical labels for classification [7], while the primary task of rating prediction for recommendation takes tabular data as input and outputs numeric labels for regression [4]. Second, it is challenging to develop a generation framework that is expressive enough to produce beneficial data and tasks for the primary task. Finally, to guarantee the jointly generated auxiliary data and task are beneficial to primary task, how to effectively optimize the parameters in the generation framework is a challenging problem as well.

To tackle these challenges, we propose a joint Data-Task Generation framework for Auxiliary Learning (DTG-Aux L), which involves a joint generator and a bi-level optimization strategy. Specifically, the joint generator consists of a feature generator and a label generator, which generates the new auxiliary data and new task based on the existing manually collected data and tasks. The data generation process is conducted in the feature space so that it can accommodate various input types, while the label generator architecture is inspired by the model in recommendation which can accommodate both categorical labels for classification and numeric labels for regression. Moreover, we introduce nonlinear interaction terms in the joint generator, making it more expressive to produce beneficial auxiliary data and tasks. To effectively optimize the joint generator and the task learning model, we propose the bi-level optimization strategy with instance regularization. In the lower optimization, the task learning model is optimized by the generated auxiliary data and task. In the upper optimization for the joint generator, we not only utilize the implicit gradient from the primary loss but also the explicit gradient from the proposed instance regularization, to avoid label generation mode collapse. Extensive experiments show that DTG-Aux L outperforms existing methods in various auxiliary learning scenarios, especially when the manually collected auxiliary data and task are unhelpful to the primary task. We summarize our contributions as follows,

We propose to simultaneously generate auxiliary data and tasks in a joint manner in auxiliary learning for the first time, to the best of our knowledge.

We propose DTG-Aux L, a joint data-task generation framework applicable in various auxiliary learning scenarios, containing a joint generator and a bi-level optimization strategy.

Extensive experiments in various auxiliary learning scenarios demonstrate the superiority of our proposed DTG-Aux L framework. We further analyze when and how DTG-Aux L works to bring performance boost.

2 Related Work

Auxiliary Learning The most widely adopted way in auxiliary learning is to combine the loss of the primary task and the losses of the auxiliary tasks in a linear way [3, 7, 4, 5], where the linear weights are tuned manually or with HPO methods. More recent works [9, 6, 10] propose to automatically assign dynamic weights to the auxiliary losses. [9] propose to assign weights to each loss based on the cosine gradient similarity between the primary loss and each auxiliary loss. Later work [10] aims to make the weighted auxiliary gradient sum closest to the gradient of the primary loss. [13, 1] utilize the bi-level optimization strategy to learn the auxiliary weights, where [1] even propose to combine the losses not only limited to a linear form, but also a nonlinear form. [11, 14] point out that only considering the task-level information is not sufficient, so they jointly consider the weights of different tasks and different data samples within the same task through a joint selector. As aforementioned, these reweighing methods will easily fail to bring improvement when the chosen auxiliary tasks and data are unhelpful. There are also works [2, 1] that generate

Primary Head

Prediction Label

𝑓",$ Aux Head G 𝑟",$

𝑦'(,!,# 𝑦'(,$%,#

𝐿* = 𝐿! + ,

𝑤+𝐿$+ + 𝑤'𝐿'

Task Learning Model(TLM) 𝜃

Joint Generator(JG) 𝜙 Feature Generator Label Generator

)𝐿* + )𝐿+,&

Model Design Optimization

Linear Embedding

Nonlinear Embedding

Input data Features

bias Label linear term

Label nonlinear term

Feature linear term

Feature nonlinear term

Generated feature

Generated label

upper optimization for joint generator

lower optimization for

task learning model

Figure 1: The overall DTG-Aux L framework. In the model design part, the joint generator contains a feature generator and a label generator. The feature generator utilizes the primary and auxiliary features to generate the new auxiliary feature fg,j, whose label ˆyg,j is generated by the label generator by combining the information of all the auxiliary and primary labels. The optimization part shows that we optimize the task learning model and the joint generator in an alternating bi-level manner.

fine-grained classification auxiliary tasks for the primary classification task. However, they can only be applied to a classification problem. Additionally, they only generate auxiliary tasks without new data, limiting their performance especially when the data of the primary task is inadequate, which is an often encountered scenario in auxiliary learning [9, 1].

Multi-task learning Multi-task learning aims to share information among tasks to improve model performance. However, the goal of multi-task learning is to obtain good performance for all the tasks, while auxiliary learning only focuses on the primary task. Multi-task learning methods can be categorized into three parts [15]: multi-task architecture design [16, 17], multi-task optimization [18, 19] and multi-task relationship learning [20], where the multi-task optimization methods involve techniques for optimizing several losses, like loss reweighing [18] and gradient modulation [21] .

3 The Proposed Method

The overall DTG-Aux L framework is shown in Figure 1. Next, we give the problem formulation, describe the designs of the joint generator, and present the proposed bi-level optimization strategy.

3.1 Preliminaries and Problem Formulation

In auxiliary learning, we have one primary task Tp, and totally K auxiliary tasks {Tai}K i=1. Each of these tasks has its corresponding training dataset, including the primary dataset Dp = {(xp,j, yp,j)}|Dp| j=1 , and the dataset for each auxiliary task Dai = {(xai,j, yai,j)}|Dai| j=1 , where | | denotes the data sample number of the dataset. If the auxiliary tasks share the same input with the primary task, which is a widely encountered and studied scenario in previous works [11, 1, 9, 10], we have xai,j = xp,j and |Dai| = |Dp|. Besides the datasets, we have a task learning model parameterized by θ which is used to learn all the tasks together. There is also a validation dataset Dv which is used to evaluate the model performance on the primary task. With these notations, the widely adopted training objective of auxiliary learning is:

Lt(θ) = Lp(Dp; θ) +

i=1 wi Lai(Dai; θ), (1)

where Lp and Lai indicate the loss functions of the primary and each auxiliary task. Current reweighing methods focus on how to decide wi so that the task learning model θ can achieve the best

performance on Dv. However, as aforementioned, the terms Lai(Dai, θ) are defined by the manually collected auxiliary data and tasks, with no guarantee to bring benefits for the primary task. Therefore, we propose to simultaneously generate the new beneficial auxiliary data and task in a joint manner, with the following training objective:

Lt(θ, ϕ) = Lp(Dp; θ) +

i=1 wi Lai(Dai; θ) + wg Lg(Dg(ϕ); θ), (2)

where the last term is the loss on the generated auxiliary data and task. Dg(ϕ) = {xg,j, ˆyg,j}|Dg| j=1 is the generated dataset which contains the new data together with the new label defining the new task, and ϕ denotes the parameters of the joint generator that are used to generate Dg. Since we only care about the performance of the primary task in auxiliary learning, we keep Lg the same loss function as Lp, which can be cross entropy or MSE loss, etc. Note that we still keep the original auxiliary loss terms in our training objective, so that it can accommodate the scenario where the existing manually collected auxiliary data and tasks are beneficial. The task weights wi and wg are also learnable, which will be optimized together with the generator parameters, and we uniformly denote them all as ϕ for convenience. Next, we discuss the details of the generator and how we optimize ϕ and θ.

3.2 Joint Generator Design

The joint generator involves a feature generator to generate new features and a label generator to generate a new task for the new feature, as shown in the Model Design of Figure 1.

3.2.1 Feature Generator

Since we expect that our data generator can tackle different types of data, a neat and elegant solution is to conduct the generation process in the feature space. In the feature space, different types of input data are transformed to vectors, so we can tackle these vectors in a unified way. Specifically, for the input from different tasks [xp,j, xa1,j, , xa K,j], the task learning model will first map them to the feature space with an encoder which is generally noted as backbone , and then use different task-specific heads to tackle each of the tasks. We generate the new features based on the features extracted via the backbone. In another word, the input of the feature generator is a feature list [fp,j, fa1,j, , fa K,j], where fp,j = fbackbone(xp,j; θ) is a d-dimension feature. We next describe how the new features are generated.

Feature linear term. A natural way to generate new features is to combine different features with linear masks. We assign an individual feature mask to each of the K +1 features, i.e., we have a mask list [mp, ma1, , ma K], where each mask is a d-dimension vector. Then, we denote the subscript set for the task IDs {p, a1, , a K} as S, and the linear combination is conducted as follows:

i S ˆmi fi,j, ˆmi[k] = emi[k] P

j S emj[k] , k = 1 d, (3)

where gl indicates generated linearly , means the element-wise multiplication, and ˆmi is the normalized mask. mi[k] is the kth element of the mask mi, and is normalized by softmax with the elements in the same dimension in other masks. This linear combination with normalized mask works as feature selection from all the input features, and then combines them into a new one.

Feature nonlinear term. To make the generated features more expressive, we introduce an MLP (Multi-Layer-Perceptron) to capture the nonlinear feature interaction:

fgn,j = MLPF ([fp,j; fa1,j; ; fa K,j]), (4)

where gn indicates generated nonlinearly . All the features are concatenated together and the nonlinear feature interactions are modeled by an MLP, whose output dimension is d. Finally, the generated feature is the sum of the linear and nonlinear term, i.e., fg,j = fgl,j + fgn,j.

3.2.2 Label Generator

The label generator aims to generate a proper label for the feature produced by the feature generator. We use the labels of all the input data as the input of the label generator to make the generated label expressive enough, i.e., the input of the label generator is a list of labels [yp,j, ya1,j, , ya K,j].

However, this label list may contain both numeric labels (if the corresponding task is regression) and categorical labels (if the corresponding task is classification). How to utilize different types of label information to generate a new label is the key problem. Inspired by the CTR (click-through rate) model in recommendation [22, 23], which predicts the user s preference towards an item also based on both their numeric and categorical features, we design the label generator with both linear terms and deep nonlinear terms similar to those in the CTR model.

Label linear term. The linear term models the direct and independent relationship between each input label and the generated label. We simply keep the dimension of the generated label the same as the primary label yp,j and more adaptive ways to choose the dimension can be an interesting future work. Here, we assume that the dimension of yp,j is m, where m = 1 if the primary task Tp is regression, or m equals to the number of the total categories if Tp is classification. If the primary task is regression, the generated label is also a scalar value for regression, while if the primary task is classification, the generated label is an m-dimension probability distribution vector. Specifically, for each task ID i S, if Ti is a classification task and it has totally di categories, there is an embedding table Ei with dimension di m for this task. We directly map the label yi,j in task Ti to its m-dimension embedding space through the embedding table:

ygl,i,j = Ei(yi,j). (5)

If Ti is a regression task, we follow [23] to map yi,j to its m-dimension embedding. Specifically, we also have an embedding table Ei for this task Ti, whose dimension is di m, where di = H is a hyper-parameter which is fixed to 10 in our experiments. Then yi,j is mapped into the m-dimension embedding space as follows:

ci,j = Softmax(Linear(yi,j)), ci,j RH, ygl,i,j =

k=1 ci,j[k]Ei[k], (6)

where the numeric label yi,j is linearly transformed to a H-dimension vector ci,j, which is used to weigh all the embeddings in the embedding table to obtain the final embedding. Then the final label linear term is obtained by the sum of all the input label embeddings ygl,j = P

i S ygl,i,j.

Label nonlinear term. In addition to the label linear term capturing how each input label independently influences the final generated label, we also propose a nonlinear term to model the influence of more complex label interactions on the generated label. Specifically, we introduce another group of embedding tables for all the input tasks {ENi}i S, where the ways to obtain the embeddings of the categorical and numeric labels are the same as eq. (5) and eq. (6), respectively. The dimension of ENi is di mn, where di equals to the number of total categories if Ti is classification. di equals to H (i.e.,10) for regression task, and mn is a hyper-parameter. With these embedding tables, we can map all the input label list [yp,j, ya1,j, , ya K,j] to an embedding list [ep,j, ea1,j, , ea K,j]. Then the nonlinear label interactions are captured through an MLP:

ygn,j = MLPL([ep,j; ea1,j; ; ea K,j]). (7)

The final generated label for the generated feature fg,j is the sum of the linear and nonlinear terms,

yg,j = ygl,j + ygn,j. (8)

Label bias term. Besides the linear and nonlinear terms, we also introduce a label bias term that guides the generated label to possess similar semantic meaning to the label from the target primary task, yp,j. As such, we add yp,j as the label bias term as follows,

ˆyg,j = αyp,j + (1 α) norm(yg,j), (9)

where α (0, 1) is a learnable parameter initialized as 0.5 and norm( ) is the normalization function. If the primary task is classification, norm( ) will be softmax( ) and yp,j will be converted to the one-hot form. If the primary task is regression with range (a, b), then norm( ) is set to be (b a)sigmoid( ) + a. This bias term makes the generated label lie in the same space as the original label, enabling us to more conveniently explore its semantic meaning, which is also verified to improve model performance in our experiments.

3.2.3 Discussion about the Share Input Scenario

The proposed generator can generate the new feature by combining the features from the primary and auxiliary tasks. However, in the most widely studied auxiliary learning scenario [9, 1, 11], all the

tasks share the same input data. We only have the training dataset {(xp,j, yp,j, ya1,j, , ya K,j)}|Dp| j=1 . Therefore, we cannot obtain the auxiliary data [xa1,j, , xa K,j] from the auxiliary tasks to conduct the feature generation. To tackle this problem, we obtain the new auxiliary data by randomly sampling another data sample in the dataset, which is easy to implement with randomly shuffling the training batch to match another sample for the original sample. Specifically, for a sample (xp,j, yp,j, ya1,j, , ya K,j), we consider another sample (xp,j2, yp,j2, ya1,j2, , ya K,j2) in the dataset to provide new auxiliary data information. The input of the feature generator is [fp,j, fp,j2], which are the features extracted by the backbone with [xp,j, xp,j2] as input. When generating new labels, we need to combine all the existing labels of the two samples, i.e., the input of the label generator is [yp,j, ya1,j, , ya K,j, yp,j2, ya1,j2, , ya K,j2], and then the label generation process is the same as before. Note that there are two small details: i) During the label embedding process, yai,j and yai,j2 share the same embedding table, because they both belong to the same task. ii) For the bias term in eq. (9), we now have two labels from the primary task, yp,j and yp,j2, and then eq. (9) can be adjusted to:

ˆyg,j = α(βyp,j + (1 β)yp,j2) + (1 α) norm(yg,j), (10)

where β (0, 1) is a learnable parameter of the generator. Till now, we have described the designs of the joint generator, which is applicable in various auxiliary learning scenarios. However, the generator has several parameters to be optimized, like the masks and MLPs. We denote all the learnable parameters in the generator together with the task weights wi in eq. (2) as ϕ. Next, we will elaborate how we optimize the task learning model parameters θ and the generator parameters ϕ.

3.3 Optimization Strategy

Bi-level optimization. In our whole framework, the task learning model parameters θ are expected to minimize the loss of all the selected and generated tasks, while the generator parameters ϕ aim to make θ achieve the best performance on the primary task. These two different goals give rise to the following bi-level optimization problem:

ϕ = arg min ϕ Lp(θ (ϕ); Dv),

s.t. θ (ϕ) = arg min θ Lt(θ, ϕ), (11)

where Lt(θ, ϕ) is the objective in eq. (2), and Lp(θ (ϕ); Dv) is the primary task loss of the task learning model on the validation dataset. The lower optimization is easy, we can directly obtain the gradient of θ as θLt(θ, ϕ). However, in the upper optimization, Lp(θ (ϕ); Dv) directly relies on θ instead of ϕ. Assuming that the Hessian 2 θLt(θ (ϕ), ϕ) is positive-definite, we can use the implicit theorem to obtain its implicit gradient ϕLp(θ (ϕ); Dv),

ϕLp(θ (ϕ); Dv) = θLp ( 2 θLt) 1 ϕ θLt|(ϕ,θ (ϕ)). (12)

Detailed derivation can be found in Appendix 1. Since the inverse of the Hessian is often intractable, we follow [24] to use truncated Neumann series to approximate it. ( 2 θLt) 1 Pn i=0(I 2 θLt)i, and n is fixed to 3 in all our experiments. Thus, the complete implicit gradient is approximated as:

ϕLp(θ (ϕ); Dv) θLp

i=0 (I 2 θLt)i ϕ θLt. (13)

Instance regularization. In the upper optimization, we additionally introduce an instance regularization for the parameterized label yg,j in eq. (8) as follows, to prevent generation mode collapse.

Lreg(ϕ; Dg) =

( P|Dg| j=1 P

j =j cos(yg,j, yg,j ) categorical P|Dg| j=1 yg,jlog( yg,j) numerical (14)

where if the generated label is categorical, we expect that the cosine similarity of different generated labels is small. This regularization means we expect that the generated label can keep its instancelevel uniqueness, preventing the generated labels of all the generated features from being the same. Similarly, if the generated label is numerical, since it is 1-dimension, we use the entropy regularization to achieve this goal, where yg,j = eyg,j/ P|Dg| j =1 eyg,j . Then in the upper optimization, the gradient of ϕ is the sum of the implicit gradient and explicit gradient from the regularization:

ϕ = ϕLp(θ (ϕ); Dv) + ϕLreg(ϕ; Dg). (15)

Now the gradients of θ and ϕ are obtained, and we follow previous works [1, 11] to alternatingly update θ and ϕ. The update loop will continue until convergence, where in each loop we first update θ for N times, and then update ϕ using the upper gradient eq. (15) as shown in Figure 1. N is the interval between two upper updates. We summarize the complete algorithm in Appendix 2, where we do not require an additional validation dataset Dv to calculate the upper objective but reuse the training primary set Dp as done in [13, 14].

4 Experiments

4.1 Experimental Setup

Task and Dataset. We conduct our experiments on two scenarios to validate the generalization ability of our method. One is the most widely studied scenario in previous auxiliary learning works, where the auxiliary tasks share the same input as the primary task (share input). The other one is that the inputs of the primary and auxiliary tasks are different (different input). In the Share Input setting, (i) CUB [25]: we follow previous works [1, 11] to use the bird visual attribute classification (e.g., whether the bird has white belly) to help the bird species classification primary task, where both the auxiliary and primary labels are categorical. (ii) CIFAR100 [26]: it is a widely adopted image classification dataset, where there are totally 100 categories. Additionally, each image has a coarse class, e.g., a car belongs to the vehicles 1 coarse class. We use the coarse classification as the auxiliary task to help the 100-classification primary task. (iii) Besides the classification problem, we also focus on the regression problem, where we follow previous works [11, 12] to regard the rating prediction in recommender system as the primary task, and the CTR prediction as the auxiliary task. The primary task is regression and the auxiliary task is binary classification. We choose the widely used Amazon Toys and Movies [27] datasets, where we use the user ID, item ID and item category as the input data. In the Different Input setting, (i) CIFAR10-100 is a setting where our primary task is the CIFAR10 classification problem, and the auxiliary task is the CIFAR100 classification problem. (ii) Pet-CUB is a similar setting where our primary task is fine-grained pet classification on the Pet [28] dataset, while our auxiliary task is the bird species classification in the previously mentioned CUB dataset. More detailed data information is presented in Appendix 4.

Baselines. We compare with SOTA auxiliary learning methods, including the reweighing methods and the auxiliary task generation method. Single task learning(STL) is a natural baseline where we only train on the primary task. Equal is a baseline where we assign equal weights 1.0 to all the tasks. Uncert [18] is a dynamic weighting method for multi-task learning based on task uncertainty. GCS [9] and Aux L [1] dynamically reweight the auxiliary losses, and JTDS [11] not only reweighs the tasks but also each data sample within each task. MAXL [1] is a method that automatically generates a fine-grained auxiliary task for the primary classification task. We provide detailed differences between our work and the baselines in Appendix 3.1.

Implementation details. In CUB dataset, we respectively adopt Res Net18 [29] and Res Net50 as our backbone. In CIFAR100, we respectively adopt Res Net18 and a 4-layer Conv Net composed of Convolution, Batch Normaliztion and Relu layers as our backbone. In Amazon Toys and Movies, we adopt Auto INT [30] as the backbone. In CIFAR10-100, the backbone is the 4-layer Conv Net and in Pet-CUB the backbone is Res Net18. For the head of each task, we adopt Multi-Layer Perceptron(MLP) whose layer is searched from {1, 2}. In the generator, the embedding dimension mn is searched from {32, 64}, and the layer number of the MLP is searched from {2, 3, 4}. More details are presented in Appendix 4.

4.2 Experimental Results

4.2.1 Method Performance

Main results. Table 1 presents the overall experimental results, showing that our proposed method consistently outperforms existing methods on the diversified auxiliary learning scenarios. During training in the CUB dataset and Pet-CUB dataset, different from [11] that does not use the learning rate scheduler, we found that applying a learning rate scheduler to all the baselines can obtain better or at least on-par performance. Therefore, we report the results with the scheduler. From the results, we have the following observations. (i) As expected, our proposed method brings benefits for the primary task when the original manually collected auxiliary data and task are unhelpful. For example,

Table 1: Overall performance. We report the results over 5 random seeds. Note that JTDS cannot handle the different input scenarios and MAXL cannot generate auxiliary tasks for the primary regression task. The metric for classification is accuracy(Acc) and the metric for the rating regression in recommendation is RMSE. Higher accuracy and lower RMSE indicate better results.

Dataset CUB CIFAR100 Toys Movies CIFAR10-100 Pet-CUB Metric Acc(%) Acc(%) RMSE RMSE Acc(%) Acc(%) Backbone Res Net18 Res Net50 Conv Net Res Net18 Auto INT Auto INT Conv Net Res Net18 STL 76.160.70 80.460.42 50.350.50 55.790.22 0.91880.0005 1.04560.0008 79.350.41 69.470.16 Equal 74.250.17 78.280.23 49.560.31 56.420.05 0.92130.0004 1.04590.0009 78.990.42 67.150.41 Uncert 73.940.20 77.030.24 48.900.10 56.950.31 0.91710.0009 1.04950.0018 80.130.17 63.580.13 GCS 74.110.29 78.540.54 49.260.35 56.570.21 0.92240.0003 1.04590.0018 79.590.52 67.310.89 Aux L 75.390.59 78.000.55 49.390.82 57.140.25 0.91860.0005 1.04830.0019 79.690.41 66.170.37 JTDS 76.500.47 79.340.19 49.000.33 57.280.20 0.91870.0004 1.05040.0023 - - MAXL 75.790.45 78.480.88 49.710.37 56.320.23 - - 80.020.51 68.480.85 ours 77.750.27 81.730.20 50.940.05 57.840.20 0.91530.0004 1.04260.0009 80.640.12 70.480.28

Table 2: Acc(%) on CUB Fewshot Experiments with Res Net50.

Method STL Equal Uncert GCS Aux L JTDS MAXL ours 5 shot 44.991.15 45.370.61 45.630.08 45.680.67 45.071.31 46.130.89 45.471.06 52.331.36 10 shot 63.250.68 60.580.81 59.300.55 60.810.74 63.790.11 63.390.18 61.700.10 67.680.33

in the CUB dataset with Res Net50 backbone, Movies, and Pet-CUB, the three scenarios, all the methods that utilize auxiliary tasks perform worse than the STL baseline, indicating that the manually chosen auxiliary data and tasks are not beneficial to the primary task. This phenomenon is also observed in previous works [1, 12]. However, our proposed method can utilize the information in the originally harmful auxiliary task to generate new and beneficial auxiliary data and task, so that the model performance can be significantly improved. The ability of our method to convert originally harmful auxiliary labels into beneficial forms is further validated in Appendix 5.2. Although MAXL generates a new fine-grained classification task for the primary task, it brings no improvement in these settings. This is possibly because it requires a hierarchical dataset structure in the primary task. However, in the CUB and Pet-CUB setting, the primary task itself is a fine-grained image classification problem and a more fine-grained meanwhile beneficial task is hard to be automatically generated. Additionally, it does not consider generating new data during generation. (ii) When the original auxiliary task is beneficial to the primary task, our proposed method inherits these benefits and can bring further improvement. For example, in CIFAR100 with Res Net18 as backbone and CIFAR10-100 settings, almost all the methods that utilize the auxiliary information outperform the STL baseline, and our proposed method outperforms all the existing auxiliary methods. (iii) Whether the auxiliary data and task are beneficial to the primary task depends on the backbone we choose. In the CIFAR100 dataset, when we the use Conv Net backbone, the original auxiliary data and tasks are harmful because all the existing auxiliary methods perform worse than STL, but when we use Res Net18 as the backbone, the auxiliary data and tasks are beneficial.

Fewshot results. Since previous works [1, 11] indicate auxiliary learning is more effective when the data of the primary task is inadequate, we conduct experiments on CUB where we only use 5/10 shots of each category for the primary task in Table 2. We can see that when the data of the primary task is 5 shot, all methods that utilize the auxiliary task perform better than STL, while in Table 1, when the primary task has full data, the auxiliary data and task are harmful, indicating whether the auxiliary information is beneficial or not is related to the primary dataset scale. Among all the auxiliary learning methods, our proposed method brings the most significant improvement. Compared to the full dataset setting, the improvement of our method becomes more significant in the fewshot setting. The improvement largely results from that we not only generate new tasks but also new data. The effectiveness of generating new data is further validated in Appendix 5.3.

4.2.2 Ablation Studies

To further understand how our proposed method works, we provide the following ablation studies.

Table 3: Generation visualization. The first two rows from Pet-CUB and the last two rows are from CUB. The first and second columns are the images from the primary and auxiliary tasks used to generate new feature and label. The third column is the generated label, where the x-axis represents the category and the y-axis represents the probability of the generated feature belonging to this category. In the fourth column, we visualize the images with maximum probabilities (largest peaks) in the generated label, where besides the image from the peak guided by the bias term, we visualize images from the largest 2 peaks among the small peaks.

#Image-P #Image-Aux #G-Label #Max-Prob Images

Generation visualization. We visualize the generation process in Table 3. The first two rows are from the Pet-CUB dataset in the Different Input setting, and the last two rows are from the CUB dataset in the Share Input setting. In Pet-CUB, we can see the generated label has one obvious peak from the label bias term, which represents the original category of the primary image. There are also some other small peaks. Specifically, from the first row, we can see that when our feature generator combines a white cat and a yellow-and-black bird, the label generator thinks that the label of the generated feature has the maximum probability of being the original white cat category, the second largest probability of being the yellow-and-black cat category and then a similar color dog. The pattern in the second row is similar to the first row, the texture and color of the bird are combined with the black cat, so that the label generator thinks the generated feature should first belong to the original black cat, then the brown cat with stripes from the bird, and then a similar color dog. In CUB, the generated label has two obvious peaks, which represent the two categories of the primary and auxiliary images. Also, there are some other small peaks, we can see that the images from the small peaks are similar to the primary and auxiliary images, which have reasonable semantic meanings.

Effectiveness of the framework design. Our whole framework contains the generator and the optimization strategy. We validate their effectiveness in Table 4. To make the generator expressive, we incorporate nonlinear terms in the generator, and we also incorporate the label bias term in the label generator. To optimize the generator, we propose the bi-level optimization strategy with instance regularization in the upper update. We respectively remove each of these components and observe performance degradation, where w/o F-nonlinear, w/o L-nonlinear, w/o L-bias, w/o bi-level and w/o instance reg respectively represents the variant where we remove the feature nonlinear term, label nonlinear term, label bias term, the overall upper update, and the instance regularization. We find that the non-linear terms indeed help to generate more beneficial data and tasks by making the generator more expressive. Also, the instance regularization is very important in the bi-level optimization. Furthermore, we observe label generation mode collapse without the regularization in Appendix 5.4.

Comparison with Mix Up During the generation visualization, we find that in the CUB setting, the generation process behaves like Mix Up where the inputs are combined in a linear way and the labels are also combined in a linear way. Therefore, we compare our proposed method with different Mix Up methods in Table 5. Mix Up [31] means we directly utilize the data and labels generated by Mix Up as auxiliary tasks. F-Mix Up [32] means the linear combination does not happen in the input, but in the manifold feature space. Auto-Mix/Auto-F-Mix means the linear weight of Mix Up/F-Mix Up is learned automatically with the bi-level optimization, and the task weights are also learnable, i.e, utilizing learnable Mix Up/F-Mix Up to replace our joint generator. We can see that our proposed method outperforms these Mix Up methods. The reasons are quite obvious, as shown in Table 3, our proposed method not only has the two peaks in Mix Up, but also other semantically reasonable small peaks. Compared to Mix Up, the advantages of our proposed method are three-fold. (i) Our method

Table 4: Effectiveness of framework designs. The first three rows explore the effectiveness of the generator components, while the rest show the effectiveness of the training strategies.

Dataset CUB Pet-CUB Toys Metric Acc(%) Acc(%) RMSE Backbone Res Net50 Res Net18 Auto INT w/o F-nonlinear 81.220.27 67.760.17 0.91950.0021 w/o L-nonlinear 80.960.37 66.581.12 0.92130.0026 w/o L-bias 79.280.34 69.360.37 0.92040.0011

w/o bi-level 80.320.42 68.530.49 0.91630.0008 w/o instance reg 80.440.34 67.530.95 0.91740.0012

complete 81.730.20 70.480.28 0.91530.0004

can not only combine information from the label of the same task but also label from other tasks. (ii) Our method can not only model linear relations in generation but also non-linear relations. (iii) Our method can handle the Different Input scenario, while Mix Up cannot.

To further show the advantage of our proposed generation compared to Mix Up, we apply Auto-F-Mix on top of auxiliary learning methods (Aux L, JTDS, MAXL) and the experimental results are presented in Table 6. The results show that both directly combining auxiliary weighting methods(Aux L, JTDS) with Auto-F-Mix and combining the current auxiliary generation method (MAXL) with Auto-F-mix are not as effective as our method, indicating the advantage of our proposed joint generation.

Table 5: Comparison with Mix Up Methods. Note that Mix Up cannot be applied to the recommendation scenario where the input are categorical features.

Dataset CUB Movies Metric Acc(%) RMSE Backbone Res Net18 Auto INT Mix Up 75.880.50 - Auto-Mix 76.530.36 - F-Mix Up 75.330.67 1.04600.0023 Auto-F-Mix 76.880.52 1.04640.0003 ours 77.750.27 1.04260.0009

Table 6: Comparison to the combination of Auto-F-Mix and current auxiliary learning methods. We denote Auto-F-Mix as AFM.

Dataset CUB CUB-5shot Toys Metric Acc(%) Acc(%) RMSE Backbone Res Net50 Res Net50 Auto INT AFM 80.300.57 46.970.95 0.91870.0003 Aux L+AFM 79.701.07 47.841.25 0.91950.0011 MAXL+AFM 80.760.38 47.960.89 - JTDS+AFM 80.520.90 48.140.87 0.91890.0013 ours 81.730.20 52.331.36 0.91530.0004

5 Conclusion

In this paper, we propose to jointly generate beneficial auxiliary data and tasks for auxiliary learning, so that the primary task can still obtain benefits when the manually collected auxiliary data and tasks are unhelpful. We propose the DTG-Aux L framework with a joint generator and a bi-level optimization strategy, which can be applied in various auxiliary learning scenarios. Future works like designing more adaptive generators and more efficient bi-level optimization algorithms can further improve the generation.

Acknowledgement

This work was supported by the National Key Research and Development Program of China No. 2020AAA0106300, National Natural Science Foundation of China (No. 62222209, 62250008, 62102222), Beijing National Research Center for Information Science and Technology under Grant No. BNR2023RC01003, BNR2023TD03006, and Beijing Key Lab of Networked Multimedia.

[1] Aviv Navon, Idan Achituve, Haggai Maron, Gal Chechik, and Ethan Fetaya. Auxiliary learning by implicit differentiation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.

[2] Shikun Liu, Andrew J. Davison, and Edward Johns. Self-supervised generalisation with meta auxiliary learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1677 1687, 2019.

[3] Lucas Beyer, Xiaohua Zhai, Avital Oliver, and Alexander Kolesnikov. S4L: self-supervised semi-supervised learning. In ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 1476 1485, 2019.

[4] Hong Wen, Jing Zhang, Yuan Wang, Fuyu Lv, Wentian Bao, Quan Lin, and Keping Yang. Entire space multi-task modeling via post-click behavior decomposition for conversion rate prediction. In SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 2377 2386, 2020.

[5] Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward: Selfsupervision for reinforcement learning. In ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, 2017.

[6] Xingyu Lin, Harjatin Singh Baweja, George Kantor, and David Held. Adaptive auxiliary task weighting for reinforcement learning. In Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 4773 4784, 2019.

[7] Yabin Zhang, Hui Tang, and Kui Jia. Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII, pages 241 256, 2018.

[8] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In European conference on computer vision, pages 94 108, 2014.

[9] Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. Co RR, 2018.

[10] Baifeng Shi, Judy Hoffman, Kate Saenko, Trevor Darrell, and Huijuan Xu. Auxiliary task reweighting for minimum-data learning. In Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[11] Hong Chen, Xin Wang, Chaoyu Guan, Yue Liu, and Wenwu Zhu. Auxiliary learning with joint task and data scheduling. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 3634 3647, 2022.

[12] Hong Chen, Xin Wang, Yue Liu, Yuwei Zhou, Chaoyu Guan, and Wenwu Zhu. Module-aware optimization for auxiliary learning. In Advances in Neural Information Processing Systems, 2022.

[13] Shikun Liu, Stephen James, Andrew J Davison, and Edward Johns. Auto-lambda: Disentangling dynamic task relationships. ar Xiv preprint ar Xiv:2202.03091, 2022.

[14] Hong Chen, Xin Wang, Ruobing Xie, Yuwei Zhou, and Wenwu Zhu. Cross-domain recommendation with behavioral importance perception. In Proceedings of the ACM Web Conference 2023, pages 1294 1304, 2023.

[15] Michael Crawshaw. Multi-task learning with deep neural networks: A survey. ar Xiv preprint ar Xiv:2009.09796, 2020.

[16] Simon Vandenhende, Stamatios Georgoulis, Bert De Brabandere, and Luc Van Gool. Branched multi-task networks: deciding what layers to share. ar Xiv preprint ar Xiv:1904.02920, 2019.

[17] Pengsheng Guo, Chen-Yu Lee, and Daniel Ulbricht. Learning to branch for multi-task learning. In International Conference on Machine Learning, pages 3854 3863. PMLR, 2020.

[18] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7482 7491, 2018.

[19] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML 2018, pages 793 802. PMLR, 2018.

[20] Jie Song, Yixin Chen, Xinchao Wang, Chengchao Shen, and Mingli Song. Deep model transferability from attribution maps. Advances in Neural Information Processing Systems, 32, 2019.

[21] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[22] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7 10, 2016.

[23] Huifeng Guo, Bo Chen, Ruiming Tang, Weinan Zhang, Zhenguo Li, and Xiuqiang He. An embedding learning framework for numerical features in ctr prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2910 2918, 2021.

[24] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], Proceedings of Machine Learning Research, pages 1540 1552. PMLR, 2020.

[25] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.

[26] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases, 1(4), 2009.

[27] Ruining He and Julian J. Mc Auley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 507 517. ACM, 2016.

[28] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3498 3505, 2012.

[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770 778. IEEE Computer Society, 2016.

[30] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, pages 1161 1170. ACM, 2019.

[31] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

[32] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438 6447. PMLR, 2019.