# active_multitask_representation_learning__d84d28b5.pdf

Active Multi-Task Representation Learning

Yifang Chen 1 Simon S. Du 1 Kevin Jamieson 1

To leverage the power of big data from source tasks and overcome the scarcity of the target task samples, representation learning based on multitask pretraining has become a standard approach in many applications. However, up until now, choosing which source tasks to include in the multi-task learning has been more art than science. In this paper, we give the first formal study on resource task sampling by leveraging the techniques from active learning. We propose an algorithm that iteratively estimates the relevance of each source task to the target task and samples from each source task based on the estimated relevance. Theoretically, we show that for the linear representation class, to achieve the same error rate, our algorithm can save up to a number of source tasks factor in the source task sample complexity, compared with the naive uniform sampling from all source tasks. We also provide experiments on real-world computer vision datasets to illustrate the effectiveness of our proposed method on both linear and convolutional neural network representation classes. We believe our paper serves as an important initial step to bring techniques from active learning to representation learning.

1. Introduction

Much of the success of deep learning is due to its ability to efficiently learn a map from high-dimensional, highlystructured input like natural images into a dense, relatively low-dimensional representation that captures the semantic information of the input. Multi-task learning leverages the observation that similar tasks may share a common representation to train a single representation to overcome a scarcity

*Equal contribution 1Paul G. Allen School of Computer Science & Engineering, University of Washington. Correspondence to: Yifang Chen <yifangc@cs.washington.edu>, Simon S. Du <ssdu@cs.washington.edu>, Kevin Jamieson <jamieson@cs.washington.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

of data for any one task. In particular, given only a small amount of data for a target task, but copious amounts of data from source tasks, the source tasks can be used to learn a high-quality low-dimensional representation, and the target task just needs to learn the map from this low-dimensional representation to its target-specific output. This paradigm has been used with great success in natural language processing domains GPT-2 (Radford et al.), GPT-3 (Brown et al., 2020), Bert (Devlin et al., 2018), as well as vision domains CLIP (Radford et al., 2021).

This paper makes the observation that not all tasks are equally helpful for learning a representation, and a priori, it can be unclear which tasks will be best suited to maximize performance on the target task. For example, modern datasets like CIFAR-10, Image Net, and the CLIP dataset were created using a list of search terms and a variety of different sources like search engines, news websites, and Wikipedia. (Krizhevsky, 2009; Deng et al., 2009; Radford et al., 2021) Even if more data always leads to better performance, practicalities demand some finite limit on the size of the dataset that will be used for training. Up until now, choosing which source tasks to include in multi-task learning has been an ad hoc process and more art than science. In this paper, we aim to formalize the process of prioritizing source tasks for representation learning by formulating it as an active learning problem.

Specifically, we aim to achieve a target accuracy on a target task by requesting as little total data from source tasks as possible. For example, if a target task was to generate captions for images in a particular domain where few examples existed, each source task could be represented as a search term into Wikipedia from which (image, caption) pairs are returned. By sampling moderate numbers of (image, caption) pairs resulting from each search term (task), we can determine which tasks result in the best performance on the source task and increase the rate at which examples from those terms are sampled. By quickly identifying which source tasks are useful for the target task and sampling only from those, we can reduce the overall number of examples to train over, potentially saving time and money. Moreover, prioritizing relevant tasks in training, in contrast to uniformly weighting them, even has the potential to improve performance, as demonstrated in (Chen et al., 2021).

Active Multi-Task Representation Learning

From the theoretical perspective, Tripuraneni et al. (2020; 2021); Du et al. (2020) study few-shots learning via multitask representation learning and gives the generalization guarantees, that, such representation learning can largely reduce the target sample complexity. But all those works only consider uniform sampling from each source task and thus establish the proof based on benign diversity assumptions on the sources tasks as well as some common assumptions between target and source tasks.

In this paper, we initiate the systematic study on using active learning to sample from source tasks. We aim to achieve the following two goals:

1. If there is a fixed budget on the source task data to use during training, we would like to select sources that maximize the accuracy of target task relative to naive uniform sampling from all source tasks. Equivalently, to achieve a given error rate, we want to reduce the amount of required source data. In this way, we can reduce the computation because the training complexity generally scales with the amount of data used, especially when the user has limited computing resources (e.g., a finite number of GPUs).

2. Given a target task, we want to output a relevance score for each source task, which can be useful in at least two aspects. First, the scores suggest which certain source tasks are helpful for the target task and inform future task or feature selection (sometimes the task itself can be regard as some latent feature). Second, the scores help the user to decide which tasks to sample more, in order to further improve the target task accuracy.

1.1. Our contributions

In our paper, given a single target task and M source tasks we propose a novel quantity ν RM that characterizes the relevance of each source task to the target task (cf. Defn 3.1). We design an active learning algorithm which can take any representation function class as input. The algorithm iteratively estimates ν and samples data from each source task based on the estimated ν . The specific contributions are summarized below:

In Section 3, we give the definition of ν . As a warm up, we prove that when the representation function class is linear and ν is known, if we sample data from source tasks according to the given ν , the sample complexity of the source tasks scales with the sparsity of ν RM (the m-th task is relevant if ν m = 0). This can save up to a factor of M, the number of source tasks, compared with the naive uniform sampling from all source tasks.

In Section 4, we drop the assumption of knowing ν

and describe our active learning algorithm that iteratively samples examples from tasks to estimate ν from data. We prove that when the representation function class is linear, our algorithm never performs worse than uniform sampling, and achieves a sample complexity nearly as good as when ν is known. The key technical innovation here is to have a trade-off on less related source tasks between saving sample complexity and collecting sufficient informative data for estimating ν .

In Section 5, we empirically demonstrate the effectiveness of our active learning algorithm by testing it on the corrupted MNIST dataset with both linear and convolutional neural network (CNN) representation function classes. The experiments show our algorithm gains substantial improvements compared to the non-adaptive algorithm on both models. Furthermore, we also observe that our algorithm generally outputs higher relevance scores for source tasks that are semantically similar to the target task.

1.2. Related work

There are many existing works on provable non-adaptive representation learning with various assumptions. Tripuraneni et al. (2020; 2021); Du et al. (2020); Thekumparampil et al. (2021); Collins et al. (2021); Xu & Tewari (2021) assume there exists an underlying representation shared across all tasks. (Notice that some works focus on learning a representation function for any possible target task, instead of learning a model for a specific target task as is the case in our work.) In particular, Tripuraneni et al. (2020); Thekumparampil et al. (2021) assume a low dimension linear representation. Furthermore, it assumes the covariance matrix of all input features is the identity and the linear representation model is orthonormal. Du et al. (2020); Collins et al. (2021) also study a similar setting but lift the identity covariance and orthonormal assumptions. Both works obtain similar conclusions. We will discuss our results in the context of these two settings in Section 2.

Going beyond the linear representation, Du et al. (2020) generalize their bound to a 2-layer Re Lu network and Tripuraneni et al. (2021) further considers any general representation and linear predictor classes. More recent work has studied fine-tuning in both theoretical and empirical contexts Shachaf et al. (2021); Chua et al. (2021); Chen et al. (2021). We leave extending our theoretical analysis to more general representation function classes as future work. Other than the generalization perspective, Tripuraneni et al. (2021); Thekumparampil et al. (2021); Collins et al. (2021) propose computational efficient algorithms in solving this non-convex empirical minimization problems during representation learning, including Method-of-moments (MOM) algorithm and Alternating Minimization. Incorpo-

Active Multi-Task Representation Learning

rating these efficient algorithms into our framework would also be a possible direction in the future.

Chen et al. (2021) also consider learning a weighting over tasks. However, their motivations are much different since they are working under the hypothesis that some tasks are not only irrelevant, but even harmful to include in the training of a representation. Thus, during training they aim to down-weight potentially harmful source tasks and upweight those source tasks most relevant to the target task. But the critical difference between their work and ours is that they assume a pass over the complete datasets from all tasks is feasible whereas we assume it is not (e.g., where each task is represented by a search term to Wikipedia or Google). In our paper, their setting would amount to being able to solve for ν for free, the equivalent of the known ν setting of our warm-up section. However, our main contribution is an active learning algorithm that ideally only looks at a vanishing fraction of the data from all the sources to train a representation.

There exists some empirical multi-task representation learning/transfer learning works that have similar motivations as us. For example, Yao et al. (2021) use a heuristic retriever method to select a subset of target-related NLP source tasks and show training on a small subset of source tasks can achieve similar performance as large-scale training. Zamir et al. (2018); Devlin et al. (2018) propose a transfer learning algorithm based on learning the underlying structure among visual tasks, which they called Taskonomy, and gain substantial experimental improvements.

Many classification, regression, and even optimization tasks may fall under the umbrella term active learning (Settles, 2009). We use it in this paper to emphasize that a priori, it is unknown which source tasks are relevant to the target task. We overcome this challenge by iterating the closed-loop learning paradigm of 1) collect a small amount of data, 2) make inferences about task relevancy, and 3) leverage these inferences to return to 1) with a more informed strategy for data collection.

2. Preliminaries

In this section, we formally describe our problem setup which will be helpful for our theoretical development.

Problem setup. Suppose we have M source tasks and one target task, which we will denote as task M + 1. Each task m [M + 1] is associated with a joint distribution µm over X Y, where X Rd is the input space and Y R is the output space. We assume there exists an underlying representation function ϕ : X Z that maps the input to some feature space Z RK where K d. We restrict the representation function to be in some function

class Φ, e.g., linear functions, convolutional nets, etc. We also assume the linear predictor to be a linear mapping from feature space to output space, which is represented by w m RK. Specifically, we assume that for each task m [M + 1], an i.i.d sample (x, y) µm can be represented as y = ϕ(x) w m + z, where z N(0, σ2). Lastly, we also impose a regularity condition such that for all m, the distribution of x when (x, y) µm is 1-sub-Gaussian.

During the learning process, we assume that we have only a small, fixed amount of data {xi M+1, yi M+1}i [n M+1] drawn i.i.d. from the target task distribution µM+1. On the other hand, at any point during learning we assume we can obtain an i.i.d. sample from any source task m [M] without limit. This setting aligns with our main motivation for active representation learning where we usually have a limited sample budget for the target task but nearly unlimited access to large-scale source tasks (such as (image,caption) example pairs returned by a search engine from a task keyword).

Our goal is to use as few total samples from the source tasks as possible to learn a representation and linear predictor ϕ, w M+1 that minimizes the excess risk on the target task defined as

ERM+1(ϕ, w) = LM+1(ϕ, w) LM+1(ϕ , w M+1)

where LM+1(ϕ, w) = E(x,y) µM+1 h ( ϕ(x), w y)2i .

Our theoretical study focuses on the linear representation function class, which is studied in (Du et al., 2020; Tripuraneni et al., 2020; 2021; Thekumparampil et al., 2021). Definition 2.1 (low-dimension linear representation). Φ = {x B x | B Rd K}. We denote the true underlying representation function as B . Without loss of generality, we assume for all m [M + 1], Eµm)[xx ] are equal.

We also make the following assumption which has been used in (Tripuraneni et al., 2020). We note that Theorem 3.2 does not require this assumption, but Theorem E.4 does. Assumption 2.2 (Benign low-dimension linear representation). We assume Eµm[xx ] = I and Ω(1) w m 2 R for all m [M + 1]. We also assume B is not only linear, but also orthonormal.

Notations We denote the nm i.i.d samples collected from source task m as the input matrix Xm Rnm d, output vector Ym Rnm and noise vector Zm Rnm. We then denote the expected and empirical input variances as Σm = E(x,y) µmxx and ˆΣm = 1 nm (Xm) Xm. In addition, we denote the collection of {wm}m [M] as W RK M. Note that, the learning process will be divided into several epochs in our algorithm stated later, so we sometimes add subscript or superscript i on those empirical notations to refer to the data used in certain epoch i. Finally, we use e O to hide log(K, M, d, 1/ε, PM m=1 nm).

Active Multi-Task Representation Learning

Other data assumptions Based on our large-scale source tasks motivation, we assume M K and σmin(W ) > 0, which means the source tasks are diversified enough to learn all relevant representation features with respect to the low-dimension space. This is the standard diversity assumption used in many recent works (Du et al., 2020; Tripuraneni et al., 2020; 2021; Thekumparampil et al., 2021). In addition, we assume σ Ω(1) to make our main result easier to read. This assumption can be lift by adding some corner case analysis.

3. Task Relevance ν and More Efficient Sampling with Known ν

In this section, we give our key definition of task relevance, based on which, we design a more efficient source task sampling strategy.

Note because σmin(W ) > 0, we can regard w M+1 as a linear combination of {w m}m [M].

Definition 3.1. ν RM is defined as

ν = arg min ν ν 2 s.t. W ν = w M+1 (1)

where larger |ν (m)| means higher relevance between source task m and the target task. If ν is known to the learner, intuitively, it makes sense to draw more samples from source tasks that are most relevant.

For each source task m [M], Line 3 in Alg. 1 draws nm (ν (m))2 samples. The algorithm then estimates the shared representation ϕ : Rd RK, and task-specific linear predictors W = {w m}M m=1 by empirical risk minimization across all source tasks following the standard multi-task representation learning approach.

Below, we give our theoretical guarantee on the sample complexity from the source tasks when ν is known.

Theorem 3.2. Under the low-dimension linear representation setting as defined in Definition 2.1, with probability at least 1 δ, our algorithm s output satisfies ER( ˆB, ˆw M+1) ε2 whenever the total sampling budget from all sources Ntotal is at least

e O (Kd + KM + log(1/δ))σ2s ν 2 2ε 2

and the number of target samples n M+1 is at least

e O(σ2(K + log(1/δ))ε 2)

where s = minγ [0,1](1 γ) ν 0,γ +γM and ν 0,γ :=

m : |νm| > q

γ ν 2 2 Ntotal

Note that the number of target samples n M+1 scales only with the dimension of the feature space K, and not the input

Algorithm 1 Multi-task sampling strategy with Known ν

1: Input: confidence δ, representation function class Φ, combinatorial coefficient ν , source-task sampling budget Ntotal M(Kd + log(M/δ)) 2: Initialize the lower bound N = Kd + log(M/δ) and number of samples nm =

max n (Ntotal MN) (ν (m))2

o for all m [M].

3: For each task m, draw nm i.i.d samples from the corresponding offline dataset denoted as {Xm, Ym}M m=1 4: Estimate the models as

ˆϕ, ˆW = arg min ϕ Φ,W =[w1,...,w M]

m=1 ϕ(Xm)wm Ym 2.

ˆw M+1 = arg min w ˆϕ(XM+1)w YM+1 2 (3)

5: Return ˆϕ, ˆw M+1

dimension d K which would be necessary without multitask learning. This dependence is known to be optimal (Du et al., 2020). The quantity s characterizes our algorithm s ability to adapt to the approximate sparsity of ν . Noting

ν 2 2 Ntotal is roughly on the order of ε, taking γ 1/M

suggests that to satisfy ER( ˆB, ˆw M+1) ε2, only those source tasks with relevance |ν (m)| ε are important for learning.

For comparison, we rewrite the bound in (Du et al., 2020) in the form of ν .

Theorem 3.3. Under Assumption 2.1, to obtain the same accuracy result, the non-adaptive (uniform) sampling of (Du et al., 2020) requires that the total sampling budget from all sources Ntotal is at least

e O (Kd + KM + log(1/δ))σ2M ν 2 2ε 2

and requires the same amount of target samples as above.

Note the key difference is that the s in Theorem 3.2 is replaced by M in Theorem 3.3. Below we give a concrete example to show this difference is significant.

Example: Sparse ν . Consider an extreme case where wm = em mod (K 1)+1 for all m [M 1], and w M = w M+1 = e K. This suggests that the target task is exactly the same as the source task M and all the other source tasks are uninformative. It follows that ν is a 1-sparse vector e M and s = 1 when γ = 0. We conclude that uniform sampling requires a sample complexity that is M times larger than that of our non-uniform procedure.

Active Multi-Task Representation Learning

3.1. Proof sketch of Theorem 3.2

We first claim two inequalities that are derived via straightforward modifications of the proofs in Du et al. (2020):

ER( ˆB, ˆw M+1) P XM+1 ˆ BXM+1B w M+1 2

P XM+1 ˆ BXM+1B f W 2 F n M+1 σ2 K(M + d) + log 1

where P A = I A A A A , ν (m) = ν (m) nm , and W is n1w 1, n2w 2, . . . , n Mw M . By using these two results and noting that w M+1 = f W ν , we have

ER( ˆB, ˆw M+1) (4) 1 n M+1 P XM+1 ˆ BXM+1B f W ν 2 2

1 n M+1 P XM+1 ˆ BXM+1B f W 2 F ν 2 2

= (5) ν 2 2.

The key step to our analysis is the decomposition of ν 2 2. If we denote ϵ 2 = Ntotal ν 2 2 , we have, for any γ [0, 1],

nm (1{|ν (m)| > γϵ} + 1{|ν (m)| γϵ})

ϵ21{|ν (m)| > γϵ} + γϵ21{|ν (m)| γϵ}

where the inequality comes from the definition of nm and the fact Ntotal MN. Now by replacing the value of ϵ and ν 0,γ, we get the desired result.

4. Main Algorithm and Theory

In the previous section, we showed the advantage of targetaware source task sampling when the optimal mixing vector ν between source tasks and the target task is known. In practice, however, ν is unknown and needs to be estimated based on the estimation of W and w M+1, which are themselves consequences of the unknown representation ϕ . In this section, we design an algorithm that adaptively samples from source tasks to efficiently learn ν and the prediction function for the target task B w M+1. The pseudocode for the procedure is found in Alg. 2.

We divide the algorithm into several epochs. At the end of each epoch i, we obtain estimates ˆϕi, ˆWi and ˆwi M+1 which are then used to calculate the task relevance denoted as ˆνi+1. Then in the next epoch i + 1, we sample data based on ˆνi+1. The key challenge in this iterative estimation approach is that the error of the estimation propagates from round to round due to unknown ν if we directly apply the sampling strategy proposed in Section 3. To avoid inconsistent estimation, we enforce the condition that each source task is sampled at least βϵ 1 i times to guarantee that |ˆνi(m)| is

Algorithm 2 Active Task Relevance Sampling

1: Input: confidence δ, a lower bound of σmin(W ) as σ, representation function class Φ 2: Initialize ˆν1 = [1/M, 1/M, . . .], ϵi = 2 i and {βi}i=1,2,..., which will be specified later 3: for i = 1, 2, . . . do 4: Set ni m = max βiˆν2 i (m)ϵ 2 i , βiϵ 1 i . 5: For each task m, draw nm i.i.d samples from the corresponding offline dataset denoted as {Xi m, Y i m}M m=1 6: Estimate ˆϕi, ˆWi, ˆwi M+1 with Eqn. (2) and (3) 7: Estimate the coefficient as ˆνi+1 = arg min ν ν 2 2 s.t. ˆWiν = ˆwi M+1 (6)

always ϵi-close to |cν (m)|, where c [1/16, 4]. We will show why such estimation is enough in our analysis.

4.1. Theoretical results under linear representation

Here we give a theoretical guarantee for the realizable linear representation function class. Under this setting, we choose β as follows 1

β : = βi = 3000K2R2(KM + Kd log(Ntotal

+ log(M log(1/Ntotal)

Theorem 4.1. Suppose we know in advance a lower bound of σmin(W ) denoted as σ. Under the benign lowdimension linear representation setting as defined in Assumption 2.2, we have ER( ˆB, ˆw M+1) ε2 with probability at least 1 δ whenever the number of source samples Ntotal is at least

e O K(M + d) + log 1

σ2s ν 2 2ε 2 + σε 1

where = MK2d R/σ3

s and the target task sample complexity n M+1 is at least

e O σ2Kε 2 +

where = min n

K(M + d) + log 1

δ o and s has been defined in Theorem 3.2.

Discussion. Comparing to the known ν case studied in the previous section, in this unknown ν setting our algorithm only requires an additional low order term σε 1 to

1The choice of β here is purely for theoretical proof under the restricted realizable linear representation functions. In practice, we find that the choice of β can be more flexible as long as the lower bound of number of samples is in proportional to ϵ 1 i .

Active Multi-Task Representation Learning

achieve the same objective (under the additional assumption of Assumption 2.2). Also, as long as e O(σKε 1), our target task sample complexity e O(σ2Kε 2) remains the optimal rate (Du et al., 2020).

Finally, we remark that a limitation of our algorithm is that it requires some prior knowledge of σ. However, because it only hits the low-order ϵ 1 terms, this is unlikely to dominate either of the sample complexities for reasonable values of d, K, and M.

4.2. Proof sketch

Step 1: We first show that the estimated distribution over tasks ˆνi is close to the underlying ν . Lemma 4.2 (Closeness between ˆνi and ν ). With probabil-

ity at least 1 δ, for any i, as long as n M+1 2000ϵ 1 i σ4 , we have

( [|ν (m)|/16, 4|ν (m)|] if ν (m) σ ϵi [0, 4 ϵi] if |ν (m)| σ ϵi

Notice that the sample lower bound in the algorithm immediately implies sufficiently good estimation in the next epoch even if ˆνi+1 goes to 0.

Proof sketch:

Under Assumption 2.2, by solving Eqn (3), we can rewrite the optimization problem on ν and ˆνi defined in Eqn.(1) and (6) roughly as the follows (see the formal definition in the proof of Lemma E.1 in Appendix E)

ˆνi+1 = arg min ν ν 2 2

B w M+1 + 1 ni M+1

Xi M+1 ZM+1

and ν = arg min ν ν 2 2

m w mν(m) = w M+1.

Solving these two optimization problem gives,

ν (m) = (B w m)T B W (B W )T + (B w M+1)

|ˆνi+1(m)| 2 (B w m)T ( ˆBi ˆWi( ˆBi ˆWi)T )+B w M+1

+ low order noise.

It is easy to see that the main different between these two expressions is B W (B W )T + and its corresponding empirical estimation. Therefore, by denoting the difference between these two terms as

= ( ˆBi ˆWi( ˆBi ˆWi)T )+ B W (B W )T + ,

we can establish the connection between the true and empirical task relevance as

|ˆνi+1(m)| 2|ν (m)| 2 (B w m)T (B w M+1) (7)

Now the minimization on source tasks shown in Eqn. (2) ensures that

B W ˆBi ˆWi F σpoly(d, M) ϵi.

This helps us to further bound the ( ˆBi ˆWi( ˆBi ˆWi)T ) B W (B W )T term, which can be regarded as a perturbation on the underlying matrix B W (B W )T . Then by using the generalized inverse matrix theorem (Kovanic, 1979), we can show that the inverse of the perturbed matrix is close to its original matrix on some low dimension space.

Therefore, we can upper bound Eqn. (7) by σ ϵi. We repeat the same procedure to lower bound the 1

2|ν (m)| |ˆνi+1(m)|. Combining these two, we have

|ˆνi+1(m)| 1

16σ ϵi, 2|ν (m)| + 2σ ϵi

This directly lead to the result based on whether ν (m) σ ϵi or not.

Step 2: Now we prove the following two main lemmas on the final accuracy and the total sample complexity.

Define event E as the case that, for all epochs, the closeness between ˆνi and ν defined Lemma 4.2 has been satisfied.

Lemma 4.3 (Accuracy on each epoch (informal)). Under E, after the epoch i, we have ER( ˆB, ˆw M+1) roughly upper bounded by

KM + Kd + log 1

s i ϵ2 i + σ2 (K + log(1/δ))

where s i = minγ [0,1](1 γ) ν i 0,γ + γM and ν i 0,γ := |{m : νm > γϵi}|.

Proof sketch: As we showed in Section 3, the key for calculating the accuracy is to upper bound P

ni m . Similarly to Section 3, we employ the decomposition

nim (1{|ν (m)|> γϵi}+1{|ν (m)| γϵi}) .

The last sparsity-related term can again be easily upper bounded by O((1 ν i 0,γ)σ2γϵ2 i ).

Then in order to make a connection between ni m and ν (m) by using Lemma 4.2, we further decompose the first term

Active Multi-Task Representation Learning

as follows and get the upper bound

nim 1{|ν (m)| > σ ϵi 1}

nim 1{σ γϵi |ν (m)| ϵi 1}

ˆν2 i nim 1{|ν (m)| > σ ϵi 1}

nim 1{ γϵi 1 |ν (m)| σ ϵi 1}

m ϵ2 i /β1{|ν (m)| > σ ϵi 1}

m σ2ϵ2 i /β1{ γϵi |ν (m)| σ ϵi 1}

O( ν i 0,γϵ2 i /β)

where the second inequality is from the definition of ni m.

Lemma 4.4 (Sample complexity on each epoch(informal)). Under E, after the epoch i, We have the total number of training samples from source tasks upper bounded by

O β(Mε 1 + ν 2 2ε 2) + low-order term Γ.

Proof sketch: For any fixed epoch i, by definition of ni m, we again decompose the summed source tasks based on ν (m) and get the total sample complexity as follows

m+1 βˆν2 i (m)ϵ 2 i 1{|ν (m)| > σ ϵi 1}

m+1 βˆν2 i (m)ϵ 2 i 1{|ν (m)| σ ϵi 1} + Mβϵ 1 i

Again by replacing the value of ˆν from Lemma 4.2, we can upper bounded second term in terms of ν and we can also show that the third term is low order ϵ 1.

Theorem E.4 follows by combining the two lemmas.

5. Experiments

In this section, we empirically evaluate our active learning algorithm for multi-task by deriving tasks from the corrupted MNIST dataset (MNIST-C) proposed in Mu & Gilmer (2019). While our theoretical results only hold for the linear representations, our experiments demonstrate the effectiveness our algorithm on neural network representations as well. We show that our proposed algorithm: (1) achieves better performance when using the same amount of source samples as the non-adaptive sampling algorithm, and (2) gradually draws more samples on important source tasks.

5.1. Experiment setup

Dataset and problem setting. The MNIST-C dataset is a comprehensive suite of 16 different types of corruptions applied to the MNIST test set. To create source and target tasks, we divide each sub-dataset with a specific corruption into 10 tasks by applying one-hot encoding to 0 9 labels. Therefore, we have 160 tasks in total, which we denote as corruption type + label . For example, brightness 0 denotes the data corrupted by brightness noise and are relabeled to 1/0 based on whether the data is number 0 or not. We choose a small number of fixed samples from the target task to mimic the scarcity of target task data. On the other hand, we set no budget limitation on source tasks. We compare the performance of our algorithm to the non-adaptive uniform sampling algorithm, where each is given the same number of source samples and same target task dataset.

Models. We start with the linear representation as defined in our theorem and set B R28 28 50 and wi m R50. Note that although the MNIST problem is usually a classification problem with cross-entropy loss, here we model it as a regression problem with ℓ2 loss to align with the setting studied in this paper. Moreover, we also test our algorithm with 2-layer Re LU convolutional neural nets (CNN) followed by fully-connected linear layers, where all the source tasks share the same model except the last linear layer, also denoted as wi m R50.

AL algorithm implementation. We run our algorithm iteratively for 4 epochs. The non-adaptive uniform sampling algorithm is provided with the same amount of source samples. There are some difference between our proposed algorithm and what is implemented here. First we re-scale some parameters from the theorem to account for potential looseness in our analysis. Moreover, instead of drawing fresh i.i.d samples for each epoch and discarding the past, in practice, we reuse the samples from previous epochs and only draw what is necessary to meet the required number of samples of the current epoch. This introduces some randomness in the total source sample usage. For example, we may only require 100 samples from the source task A for the current epoch, but we may have sampled 200 from source task A in the previous epoch. So we always sample equal or less than non-adaptive algorithm in a single epoch. Therefore, in our result shown below, the total source sample numbers varies across target tasks. But we argue that this variation are roughly at the same level and will not effect our conclusion. Please refer to Appendix F.1 for details.

5.2. Results

Linear representation. We choose 500 target samples from each target task. After 4 epochs, we use in total around 30000 to 40000 source samples. As a result, our adaptive algorithm frequently outperforms the non-adaptive one as shown in Figure 1.

Active Multi-Task Representation Learning

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

1.9 0.35 2.2 2.9 2.5 2 2.4 1.2 3 1.7

1.2 0.03 1.6 -0.04 0.08 0.12 2.4 0.82 0.09 0.03

2.9 5.4 1 0.73 1.1 0.12 2.5 2.3 0.68 0.35

0.96 0.3 0.68 3.1 1.9 -0.25 -0.27 -0.59 2.1 0.6

2.2 3.3 3.2 0.58 2.3 2.2 2.8 2.7 0.18 0.48

1.2 -0.39 0.82 1.6 1.5 1.2 0.79 0.87 1.4 1.2

0.17 0.06 0.05 0.02 0.11 0 0.08 0.01 0 0.02

0.64 0.77 0.7 0.48 0.1 0.34 1 0.99 0.24 0.7

0.58 0.3 0.01 0.72 0.05 0.1 1.9 0.59 0.02 0.16

-1 1 -1.3 -0.57 -0.65 0.07 0.15 0.79 -0.09 0.07

1.7 3.1 1.6 0.37 -0.63 -0.2 0.07 1.4 0.24 0.12

2.5 7.8 3.2 2.5 2.8 1.5 4.5 3.3 1.7 1.8

1.6 3.4 1.9 3 3.4 -0.07 4 2.6 2.3 -0.07

-0.56 0.66 1.2 2.1 0.55 -0.21 1 1.1 1.8 2.3

-0.84 0.06 0.05 -0.08 0.02 -0.11 -0.04 -0.03 -0.02 0.12

2.6 3.2 2.1 2 -0.01 0.05 2.2 -0.38 0.73 1.5

0 2 4 6 8 10 12 incorrect prediction (%)

non-ada ada

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

Figure 1. Performance between the adaptive (ada) and the non-adaptive (non-ada) algorithm on linear representation. Left: The prediction difference (in %) between ada and non-ada for all target tasks. The larger is the better. Respectively y-axis denotes noise type and x-axis denotes binarized label, with each grid representing a target task, e.g., the grid at the top left corner stands for target task brightness 0. In summary, the adaptive algorithm achieves 1.1% higher average accuracy than the non-adaptive one and results same or better accuracy in 136 out of 160 tasks. Middle: Histogram summary of incorrect prediction (left is better). There is a clear shift for adaptive algorithm towards left. Right: Sampling distribution for the target task glass blur 2. Respectively, the plot shows numbers of samples from each source tasks at the beginning of epoch 3 by running adaptive algorithm. The samples clearly concentrated on several X 2 source tasks, which meets our intuition that all 2 vs. others tasks should has closer connection with the glass blur 2 target task.

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0.83 0.64 0.98 0.48 1.6 2.1 1.1 0.84 1.2 1.5

0.99 -0.03 0.72 0.67 0.48 0.22 0.24 0.57 -0.27 0.65

0.53 0.61 0.82 1.2 1.1 1 0.93 0.39 0.71 2.2

0.09 0.55 0.95 0.71 1.3 1.2 0.3 -0.95 0.67 0.55

0.37 0.89 1.1 0.85 2.1 1.4 1.3 0.81 1.2 2

0.43 0.67 1.2 0.84 1.2 2 0.99 0.63 1.1 0.98

0.34 0.16 1.2 1.2 1.5 1.5 -0.81 0.31 0.89 -0.8

0.29 -0.67 0.72 -0.42 0.91 -0.11 0.41 2.1 -0.62 2.2

0.15 0.61 -0.13 0.13 1.9 0.03 1 -1.5 0.36 -0.23

0.58 0.07 2.2 -0.07 -0.2 -0.95 -0.15 0.48 0.36 1.2

0.39 0.63 1.9 -0.14 2.8 -0.58 0.84 0.74 0 -1.1

0.24 0.95 0.88 0.14 1.5 1 1.1 1.1 0.36 1.2

0.46 0.69 1.5 0.77 1.8 1.8 0.87 0.8 0.69 2.2

-0.21 0.42 0.74 0.31 0.29 0.46 1.2 -0.39 -0.13 -0.68

-0.17 3.7 -0.84 2.4 0.74 -0.43 -0.59 0.1 0.68 -0.21

0.55 -0.38 1.1 0.79 0.99 1.6 0.66 0.54 0.6 0.72

0 2 4 6 8 10 12 incorrect prediction (%)

non-ada ada

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

Figure 2. Performance between the adaptive (ada) and the non-adaptive (non-ada) algorithm on Convnet. Left: The prediction difference (%) between ada and non-ada for all target tasks. (See more explanations for notations in Figure 1.) In summary, the adaptive algorithm achieves 0.68% higher average accuracy than the non-adaptive one and results same or better accuracy in 133 out of 160 tasks. Middle: Histogram summary of incorrect prediction (left is better). There is a clear for adaptive algorithm towards left. Although the average performance improvement is smaller than in the linear representation, the relative improvement is also significant given the already good baseline performance (most prediction error are below 6% while in linear most are above 6%). Right: Sample distribution for target task as glass blur 2. A large portion of samples again concentrate on several X 2 source tasks, which meets our intuition that all 2 vs. others tasks should has closer connection with glass blur 2 target task. But the overall sample distribution is more spread compared to the one on linear representation.

For those cases where gains are not observed, we conjecture that those tasks violate our realizable assumptions more than the others. We provide a more detailed discussion and supporting results for those failure cases in Appendix F.2.2. Next, we investigate the sample distribution at the beginning epoch 3. We show the result for glass blur 2 as a representative case with more examples in the Appendix F.3. From the figure, we can clearly see the samples concentrate on more target-related source tasks.

Convnet. We choose 200 target samples from each target task. After 4 epochs, we use in total around 30000 to 40000 source samples. As a result, our adaptive algorithm again frequently outperforms the non-adaptive one as shown in Figure 1. Next, we again investigate the sample distribution at the beginning epoch 3 and show a representative result (more examples in Appendix F.3.2). First of all, there are still a number of source samples again concentrating on 2 vs. others . The reader may notice some other source tasks also contribute a relatively large amount of sample complexity. This is actually a typical phenomenon in our experiment on convnets, which is seldom observed in the

linear representation. This might be due to the more expressive power of CNN that captures some non-intuitive relationships between some source tasks and the target task. Or it might be simply due to the estimation error since our algorithm is theoretically justified only for the realizable linear representation. We provide more discussion in Appendix F.3.1.

6. Conclusion and future work

Our paper takes an important initial step to bring techniques from active learning to representation learning. There are many future directions. From the theoretical perspective, it is natural to analyze some fine-tuned models or even more general models like neural nets as we mentioned in the related work section. From the empirical perspective, our next step is to modify and apply the algorithm on more complicated CV or NLP datasets and further analyze its performance. Finally, it is also interesting to combine our task-wise active learning with instance-wise active learning results.

Active Multi-Task Representation Learning

Acknowledgements

YC want to thank Lei Chen for discussing about experiments. SSD acknowledges funding from NSF Award s IIS-2110170 and DMS2134106.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

Chen, S., Crammer, K., He, H., Roth, D., and Su, W. J. Weighted training for cross-task learning, 2021.

Chua, K., Lei, Q., and Lee, J. D. How fine-tuning allows for effective meta-learning, 2021.

Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. Exploiting shared representations for personalized federated learning. ar Xiv preprint ar Xiv:2102.07078, 2021.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009. doi: 10.1109/CVPR.2009.5206848.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations, 2020.

Kovanic, P. On the pseudoinverse of a sum of symmetric matrices with applications to estimation. Kybernetika, 15(5):(341) 348, 1979. URL http://eudml.org/ doc/28097.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.

Mu, N. and Gilmer, J. Mnist-c: A robustness benchmark for computer vision. ar Xiv preprint ar Xiv:1906.02337, 2019.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. ar Xiv preprint ar Xiv:2103.00020, 2021.

Settles, B. Active learning literature survey. 2009.

Shachaf, G., Brutzkus, A., and Globerson, A. A theoretical analysis of fine-tuning with linear teachers. Advances in Neural Information Processing Systems, 34, 2021.

Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Sample efficient linear meta-learning by alternating minimization. ar Xiv preprint ar Xiv:2105.08306, 2021.

Tripuraneni, N., Jordan, M., and Jin, C. On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems, 33, 2020.

Tripuraneni, N., Jin, C., and Jordan, M. Provable metalearning of linear representations. In International Conference on Machine Learning, pp. 10434 10443. PMLR, 2021.

Xu, Z. and Tewari, A. Representation learning beyond linear prediction functions. ar Xiv preprint ar Xiv:2105.14989, 2021.

Yao, X., Zheng, Y., Yang, X., and Yang, Z. Nlp from scratch without large-scale pretraining: A simple and efficient framework, 2021.

Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712 3722, 2018.

Active Multi-Task Representation Learning

A. Appendix structure

In Appendix B, we define the commonly used notations in the following analysis. In Appendix C.2, we define three high probability events and prove three claims as variants of original results in Du et al. (2020). All these events and claims are widely used in the following theoretical analysis. Then we give formal proofs of Theorem 3.2 and Theorem 3.3 in Appendix D and formal proofs of Theorem E.4 in Appendix E. Finally, we show more comprehensive results of experiment in Appendix F.

B. Notation

Define ˆ = ˆB ˆW B W and correspondingly define ˆ i = ˆBi ˆWi B W if the algorithm is divided into epochs.

Define ˆ m = ˆB ˆwm B w m and correspondingly define ˆ i m = ˆBi ˆwi m B w m .

Restate P A = I A A A A

Restate that ˆΣm = (Xm) Xm and correspondingly ˆΣi m = (Xi m) Xi m if the algorithm is divided into epochs.

Restate that Σm = E h ˆΣm i and correspondingly Σi m = E h ˆΣi m i .

Define κ = λmax(Σ)

λmin(Σ) , recall we assume all Σm = Σ. Note that in the analysis for adaptive algorithm, we assume identity covariance so κ = 1.

For convenience, we write PM m=1 as P

If the algorithm is divided into epochs, we denote the total number of epoch as Γ.

C. Commonly used claims and definitions

C.1. Co-variance concentration guarantees

We define the following guarantees on the feature covariance concentration that has been used in all proofs below.

Esource = {0.9Σm ˆΣm 1.1Σm, m [M]}

Etarget1 = {0.9B 1 B2 B 1 ˆΣM+1B2 1.1B 1 B2, for any orthonormal B1, B2 Rd K | ΣM+1 = I}

Etarget2 = {0.9ΣM+1 B ˆΣM+1B 1.1ΣM+1, for any B Rd K}

By Claim A.1 in Du et al. (2020), we know that, as long as nm d + log(M/δ), m [M + 1]

Prob Ei source 1 δ

Moreover, as long as nm K + log(1/δ),

Prob(Etarget1) 1 δ

10 Prob(Etarget2) 1 δ

Correspondingly, if the algorithm is divided into epochs where each epoch we draw new set of data, then we define

Ei source = {0.9Σm ˆΣi m 1.1Σm, m [M]}

Again, as long as ni m d + log(MΓ/δ), m [M], i [Γ], we have

i [Γ] Ei source

Notice Etarget1 will only be used in analyze the main active learning Algorithm 2 while the other two is used in both Algorithm 1 and 2.

Active Multi-Task Representation Learning

C.2. Claims guarantees for unequal sample numbers from source tasks

Here we restate two claims and one result (not written in claim) from (Du et al., 2020), and prove that they still hold when the number of samples drawn from each of the source tasks are not equal, as long as the general low-dimension linear representation is satisfied as defined in Definition 2.1. No benign setting like Definition 2.2 is required.

Algorithm 3 General sample procedure

1: For each task m, draw nm i.i.d samples from the corresponding offline dataset denoted as {Xm, Ym}M m=1 2: Estimate the models as

ˆϕ, ˆW = arg min ϕ Φ, ˆ W =[ ˆ w1, ˆ w2,...]

m=1 ϕ(Xm) ˆwm Ym 2

ˆw M+1 = arg min w ˆϕ(Xi M+1)w YM+1 2

Specidically, consider the above procedure, we show that the following holds for any {nm}M m=1. Claim C.1 (Modified version of Claim A.3 in (Du et al., 2020)). Given Esource, with probability at least 1 δ/10,

m Xm ˆ m 2 2 σ2

KM + Kd log(κ( X

m nm)/M) + log(1/δ)

Proof. We follow nearly the same steps as the proof in (Du et al., 2020), so some details are skipped and we only focus on the main steps that require modification. Also we directly borrow some notations including V , N, r from the original proof and will restate some of them here for clarity.

Notation restatement Since rank( ˆ ) 2K, we can write = V R = [V r1, , V r M] where V Od,2K and R = [r1, , r M] R2K M. Here Od1,d2 (d1 d2) is the set of orthonormal d1 d2 matrices (i.e., the columns are orthonormal). For each m [M] we further write Xm V = Um Qm where Um On1,2K and Qm R2K 2K. To cover all possible V , we use an ϵ-net argument, that is, there exists any fixed V Od,2K and there exists an ϵ-net Nϵ of Od,2K in

Frobenius norm such that N Od,2K and |Nϵ| 6

2K ϵ 2Kd . (Please refer to original proof for why such ϵ-net exists)

Now we briefly state the proofs.

m Xm( ˆB ˆwm B w m) 2 2

m=1 Zm, Xm V mrm +

m=1 Zm, Xm(V V m)rm

This comes from the first three lines of step 4 in original proof. For the first term, with probability 1 δ/10 , by using standard tail bound for χ2 random variables and the ϵ-net argument (details in eqn.(28) in original proof), we have it upper bounded by

KM + log(|Nϵ|/δ)

m Xm V rm 2 2 + σ p

KM + log(|Nϵ|/δ)

m Xm(V V )rm 2 2

And for the second term, since σ 2 P m Zm 2 χ2 PM m=1 nm , again by using standard tail bound (details in eqn.(29)

in original proof), we have that with high probability 1 δ/20, it is upper bounded by

m=1 Zm, Xm(V V m)rm σ

m=1 nm + log(1/δ) s X

m Xm(V V m)rm 2 2

Active Multi-Task Representation Learning

Step 2: Now we can further bound the term q P m Xm(V V m)rm 2 2 by showing

m Xm(V V m)rm 2 2 X

m Xm 2 F V V 2 F rm 2 2

m nm V V 2 F rm 2 2

m nm rm 2 2

m nm ˆ m 2 2

m Xm( ˆ m) 2 2

Note this proof is the combination of step 2 and step 3 in the original proof. The only difference here is nm is different for each m so you need to be more careful on those upper and lower bounds.

Step 3: Finally, we again use the self-bounding techniques. Recall that we have

m Xm( ˆ m) 2 2

m Xm( ˆ m) 2 2

By rearranging the inequality and the distribution of Zm, we have

m Xm( ˆ m) 2 2 σ2 M X

m=1 nm + log(1/δ)

Step 4: Now replace these into the inequality in step 1, we have

m Xm ˆ m 2 2 σ p

KM + log(|Nϵ|/δ) s X

m Xm ˆ m 2 2 + 2ϵσ2 X

m nm + log(1/δ)

Then rearrange the inequality and choose proper ϵ, we get the result.

Claim C.2 (Modified version of Claim A.4 in (Du et al., 2020)). Given Esource and Etarget2, with probability at least 1 δ/10,

P XM+1 ˆ BXM+1B f W 2 F 1.3n M+1σ2

KT + Kd log((κ X

m nm)/M) + log 1

where f W = W p

diag([n1, n2, . . . , n M])

Proof. The proof is almost the same as the first part of the proof except we don t need to extract nm out. X

m Xm( ˆB ˆwm B w m) 2 2 X

m P Xm ˆ BXm B w m 2

m nm P Σm ˆ BΣm B w m 2

m nm P ΣM+1 ˆ BΣM+1B w m 2

= 0.9 P ΣM+1 ˆ BΣM+1B W 2

1.1 1 n M+1 P XM+1 ˆ BXM+1B f W 2 F

Active Multi-Task Representation Learning

where the first and second inequality is the same as original proof. The third equation comes from our assumption that all Σm are the same and the forth equation is just another form of the above term. The last inequality comes from the same reason as second equality, which can again be found in original proof.

Now by using Claim C.1 as an upper bound, we get our desired result.

Basically, this claim is just another way to write Claim A.4 in (Du et al., 2020). Here we combined the nm with W and in original proof they extract nm can lower bound W with its minimum singular value since in their case all nm are the same.

Claim C.3. Given Esource and Etarget2, with probability at least 1 δ/5,

ER ˆB, ˆw M+1 1 n M+1

P XM+1 ˆ BXM+1B w M+1 2

F + σ2 K + log(1/δ)

Proof. This bound comes exactly from some part of Proof of Theorem 4.1. Nothing need to change.

D. Analysis for Warm-up

D.1. Proof for Theorem 3.2

Suppose event Esource and Etarget2 holds, then we have with probability at least 1 3δ

ER( ˆB, ˆw M+1) P XM+1 ˆ BXM+1B w M+1 2

n M+1 + σ2 K + log(1/δ)

= 1 n M+1 P XM+1 ˆ BXM+1B f W ν 2 2 + σ2 K + log(1/δ)

1 n M+1 P XM+1 ˆ BXM+1B f W 2 F ν 2 2 + σ2 K + log(1/δ)

KT + Kd log((κ X

m nm)/M) + log 1

ν 2 2 + σ2 K + log(1/δ)

where ˆν (m) = ν (m) nm . Here the first inequality comes from Claim C.3 and the last inequality comes from Claim C.2. By

use both claim, we have probability at least 1 δ

5. The third inequality comes from holder s inequality.

The key step to our analysis is to decompose and upper bound ν 2 2. Denote ϵ 2 = Ntotal ν 2 2 , we have, for any γ [0, 1],

nm (1{|ν (m)| > γϵ} + 1{|ν (m)| γϵ}) X

ϵ21{|ν (m)| > γϵ} + γϵ21{|ν (m)| γϵ}

ν 0,γϵ2 + (M ν 0,γ)γϵ2

= (1 γ) ν 0,γϵ2 + Mγϵ2

ν 2 2 Ntotal ((1 γ) ν 0,γ + γM)

where the inequality comes from nm 1

2(ν (m))2ϵ 2.

Finally, combine this with the probability of Esource and Etarget2, we finish the bound.

D.2. Proof for Theorem 3.3

By the same procedure as before, we again get

ER( ˆB, ˆw M+1) σ2

KT + Kd log(κ( X

m nm)/M) + log 1

ν 2 2 + σ2 K + log(1/δ)

Active Multi-Task Representation Learning

Now due to uniform sampling, so we have all nm = Ntotal/M, which means

ν 2 2 = ν 2 2 M Ntotal .

Then we get the result by direction calculation.

E. Analysis for Theorem E.4

E.1. Main analysis

Step 1: We first show that the estimated distribution over tasks ˆνi is close to the actual distribution ν for any fixed i. Notice that Assumption 2.2 is necessary for the proofs in this part.

Lemma E.1 (Closeness between ˆνi and ν ). Under the Assumption 2.2, given Ei source and Etarget1, for any i, m, as long as

n M+1 2000ϵ 1 i σ4 , we have with probability at least 1 δ/10MΓ

( [|ν (m)|/16, 4|ν (m)|] if ν σ ϵi [0, 4 ϵi] if |ν | σ ϵi (8)

We define this conditional event as

Ei,m relevance = Eqn.(8) holds | Ei source, Etarget1

Proof. By the definition of ν and Lemma E.9, we have the following optimization problems,

ˆνi+1 = arg min ν ν 2 2

ˆΣi m B w m + 1

ν(m) = αM+1

ˆΣi MB w M+1 + 1 ni M+1

Xi M+1 ZM+1

ν = arg min ν ν 2 2

m w mν(m) = w M+1

where αi m = ˆB i ˆΣi m ˆBi 1 ˆB i .

Now we are ready to show that ˆνi+1 is close to ν by comparing their closed form solution.

First by using the lemma E.8 based on the standard KKT condition, it is easy to get a closed form solution of ν , that is, for any m.

ν (m) = (w m) W (W ) 1 w M+1

= (B w m) B W (B W ) + (B w M+1)

Active Multi-Task Representation Learning

where the last inequality comes from the fact that B is orthonormal. And,

|ˆνi+1(m)| =

(ˆΣi m B w m) + 1

nim (Zi m) Xi m

α m( ˆWi ˆW i ) αM+1

ˆΣi M+1B w M+1 + 1 ni M+1 (Xi M+1) Zi M+1

1.7 (B w m) ˆBi( ˆB i ˆBi) 1( ˆWi ˆW i ) ( ˆB i ˆBi) 1 ˆB i B w M+1

+ 1.3 1 nim (Zi m) Xi m ˆBi( ˆB i ˆBi) 1( ˆWi ˆW i ) ( ˆB i ˆBi) 1 ˆB i B w M+1

+ 1.3 (B w m) ˆBi( ˆB i ˆBi) 1( ˆWi ˆW i ) ( ˆB i ˆBi) 1 ˆB i 1 ni M+1 (Xi M+1) Zi M+1

+ 1 nim (Zi m) Xi m ˆBi( ˆB i ˆBi) 1( ˆWi ˆW i ) ( ˆB i ˆBi) 1 ˆB i 1 ni M+1 (Xi M+1) Zi M+1

1.7 (B w m) ˆBi( ˆWi ˆW i ) ˆB i B w M+1

+ 1.3 1 nim (Zi m) Xi m ˆBi( ˆWi ˆW i ) ˆB i B w M+1

+ 1.3 (B w m) ˆBi( ˆWi ˆW i ) ˆB i 1 ni M+1 (Xi M+1) Zi M+1

+ 1 nim (Zi m) Xi m ˆBi( ˆWi ˆW i ) ˆB i 1 ni M+1 (Xi M+1) Zi M+1

2 (B w m) ( ˆBi ˆWi( ˆBi ˆWi) ) B w M+1 + noise term(m)

The first inequality comes from the definition of αm and the event Esource, Etarget1. The second equality comes from that ˆBi are always orthonormal matrix. Notice that in practice this is not required. Finally we set the last three terms in the second inequality as noise term(m), which is a low order term that we will shown later.

-Sub-step 1 (Analyze the non noise-term): We have the difference between |ˆνi+1(m)| and 2|ν (m)| is

|ˆνi+1(m)| 2|ν (m)| noise term(m)

2 (B w m) ( ˆBi ˆWi( ˆBi ˆWi) ) B w M+1 2 (B w m) B W (B W ) + (B w M+1)

2 (B w m) ( ˆBi ˆWi( ˆBi ˆWi) ) B W (B W ) + (B w M+1)

2 (B w m) B W (B W ) ˆ i ˆ i + ˆ i(B W ) + (B W )( ˆ i) (B W (B W ) ) (B w M+1)

2 B w m 2 B w M+1 2 B W (B W ) ˆ i( ˆ i) + ˆ i(B W ) + (B W )( ˆ i) (B W (B W ) ) F

2R B W (B W ) ˆ i( ˆ i) + ˆ i(B W ) + (B W )( ˆ i) (B W (B W ) ) F

σ ϵi + noise term(m)

where the second inequality comes from the triangle inequality, the third inequality holds with probability at least 1 δ by using the application of generalized inverse of matrices theorem (see Lemma E.10 for details) and the fifth inequality comes form the fact that B w m 2 w m 2 R. Finally, by using the assumptions of B , W as well as the multi-task model estimation error ˆ i, we can apply Lemma E.12 to get the last inequality.

With the same reason, we get that, for the other direction,

0.5|ν (m)| |ˆνi+1(m)| noise term(m) σ ϵi/4

Combine these two, we have

|ˆνi+1(m)| [0.5|ν (m)| σ ϵi/4 1.5noise term(m) , 2|ν (m)| + σ ϵi + 1.5noise term(m)]

Active Multi-Task Representation Learning

-Sub-step 2 (Analyze the noise-term): Now let s deal with the noise term(m), we restate it below for convenience,

1.3 1 nim (Zi m) Xi m ˆBi( ˆWi ˆW i ) ˆB i B w M+1

+ 1.3 (B w m) ˆBi( ˆWi ˆW i ) ˆB i 1 ni M+1 (Xi M+1) Zi M+1

+ 1 nim (Zi m) Xi m ˆBi( ˆWi ˆW i ) ˆB i 1 ni M+1 (Xi M+1) Zi M+1

By the assumption on B , W , it is easy to see that with high probability at least 1 δ , where δ = δ/10ΓM 1 nim (Zi m) Xi m( ˆBi ˆWi( ˆBi ˆWi) ) B w M+1

| 1 λmin (B W (B W ) ) 1 nim (Zi m) Xi m B w M+1|

σ nimλmin (B W (B W ) )

(w M+1) (B ) (Xim) Xim B w M+1 log(1/δ )

2.2σ w M+1 2 p

log(1/δ ) p

nimλmin (B W (B W ) )

ϵi/β 2.2σ p

R log(1/δ ) λmin (B W (B W ) )

where the first inequality comes from Lemma E.6, the second the inequality comes from Chernoff inequality and the last inequality comes from the definition ni m = max{βˆν2 i ϵ 2 i , βϵ 1 i , N}. Note that we choose β = 3000K2R2(KM + Kd log(1/εM) + log(MΓ/δ)/σ6. Therefore, above can be upper bounded by ϵi/24.

By the similar argument and the assumption that n M+1 3000Rϵ 1 i σ4 , we can also show that with high probability at least 1 δ (B w m) ˆBi( ˆWi ˆW i ) ˆB i 1 ni M+1 (Xi M+1) Zi M+1

2.2σ w m 2 p

log(1/δ ) n M+1λmin (B W (B W ) )

Finally, we have that 1 nim (Zi m) Xi m ˆBi( ˆWi ˆW i ) ˆB i 1 ni M+1 (Xi M+1) Zi M+1

ϵi/β w m 2 w M+1 2 log(1/δ ) n M+1λmin (B W (B W ) )

So overall we have we have noise term(m) σ ϵi/8.

-Sub-step 3 (Combine the non noise-term and noise-term):

Now when |ν (m)| σ ϵi, combine the above results show that |ˆνi+1(m)| [ν (m)/16, 4|ν (m)|].

On the other hand, if |ν | σ ϵi, then we directly have ˆνi+1 [0, 4σ ϵi].

Step 2: Now we are ready to prove the following two main lemmas on the final accuracy and the total sample complexity. Lemma E.2 (Accuracy on each epoch). Given Esource, Etarget 1, Etarget 2 and Ei,m relevance for all m, after the epoch i, we have ER( ˆB, ˆw M+1) upper bounded by with probability at least 1 δ/10MΓ.

KM + Kd + log 1

s i ϵ2 i + σ2 (K + log(1/δ))

where s i = minγ [0,1](1 γ) ν i 0,γ + γM and ν i 0,γ := |{m : νm > γϵi}|.

Active Multi-Task Representation Learning

Proof. The first step is the same as proof of Theorem 3.2 in Appendix 3. Suppose event Ei source and Etarget2 holds, then we have with probability at least 1 3δ 10ΓM ,

ER( ˆB, ˆw M+1) 1.3σ2

KT + Kd log(( X

m nm)/M) + log 1

ν 2 2 + σ2 K + log(1/δ)

where ν (m) = ν (m)/ nm.

Now for any γ [0, 1], given Ei,m relevance, we are going to bound P

nim 1{|ν (m)| > σ ϵi 1} + X

nim 1{ γϵi 1 |ν (m)| σ ϵi 1}

nim 1{|ν (m)| γϵi}

256ˆν2 i nim 1{|ν (m)| > σ ϵi 1} + X

nim 1{ γϵi 1 |ν (m)| ϵi 1}

+ γϵ2 i /β X

m 1{|ν (m)| γϵi}

m ϵ2 i /β1{|ν (m)| > σ ϵi 1}

m σ2ϵ2 i /β1{ γϵi 1 |ν (m)| σ ϵi 1}

+ (M ν i 0,γ)γϵ2 i /β

m ϵ2 i /β1{|ν (m)| > σ γϵi 1

+ (M ν i 0,γ)γϵ2 i /β

((1 γ) ν i 0,γ + γM)ϵ2 i /β

Lemma E.3 (Sample complexity on each epoch). Given Esource, Etarget 1 and Ei,m relevance for all m, we have the total number of source samples used in epoch i as as

O β(Mε 1 + ν 2 2ε 2)

Proof. Given Ei,m relevance, we can get the sample complexity for block i as the follows.

m max{ˆνi(m)2ϵ 2 i , ϵ 1 i , N}

m βˆν2 i ϵ 2 i +

m βˆν2 i (m)ϵ 2 i 1{|ν (m)| > σ ϵi 1} +

m βˆν2 i (m)ϵ 2 i 1{|ν (m)| σ ϵi 1} +

m β(4ν (m))2ϵ 2 i 1{|ν (m)| > σ ϵi 1} +

m β(4σ ϵi 1)2/ϵ 2 i 1{|ν (m)| σ ϵi 1} +

= O β(Mϵ 1 i + ν 2 2ϵ 2 i )

Active Multi-Task Representation Learning

Theorem E.4. Suppose we know in advance a lower bound of σmin(W ) denoted as σ. Under the benign low-dimension linear representation setting as defined in Assumption 2.2, we have ER( ˆB, ˆw M+1) ε2 with probability at least 1 δ whenever the number of source samples Ntotal is at least

e O K(M + d) + log 1

σ2s ν 2 2ε 2 + σε 1

where = MK2d R/σ3

s and the target task sample complexity n M+1 is at least

e O σ2Kε 2 +

where = min n

K(M + d) + log 1

δ o and s has been defined in Theorem 3.2.

Proof. Given Esource, Etarget 1, Etarget 2 and Ei,m relevance then by Lemma E.2, we the final accuracy from last epoch as

ERM+1( ˆB, ˆw M+1) σ2

KM + Kd log(( X

m nm)/M) + log 1

s Γϵ2 Γ/β + σ2 (K + log(1/δ))

Denote the final accuracy of the first term as ε2. So we can write ϵΓ as

KM + Kd log(( X

m nm)/M) + log 1

By applying lemma 4.4, we requires the total source sample complexity

i=1 β(Mϵ 1 i + ν 2 2ϵ 2 i ) 2β(Mϵ 1 Γ + 2 ν 2 2ϵ 2 Γ )

KM + Kd log(( X

m nm)/M) + log 1

+ β ν 2 2σ2

KM + Kd log(( X

m nm)/M) + log 1

KM + Kd log(( X

m nm)/M) + log 1

KM + Kd log(( X

m nm)/M) + log 1

= e O MK2d + M

s Γσε 1 + Kds Γσ2 ν 2 2ε 2

Also in order to satisfy the assumption in Lemma E.1, we required n M+1 to be at least

KM + Kd log(( X

m nm)/M) + log 1

KM + Kd log(( X

m nm)/M) + log 1

Notice that Γ is a algorithm-dependent parameter, therefore, the final step is to bound s Γ by an algorithm independent term by write Γ as

Γ = log ϵΓ min

Ntotal β ν 2 2 , log Ntotal

Active Multi-Task Representation Learning

So we have .

ν Γ 0,γ = |{m : νm > γ max{β ν 2 2 Ntotal , βM

To further simply this, notice that, for any ϵ < ϵi.

ν i 0,γ = |{m : νm > γϵi}| |{m : νm > γϵ }|

So we further have

ν Γ 0,γ = |{m : νm > γ ν 2 2 Ntotal }| := ν 0,γ

Finally, by union bound Esource, Etarget 1, Etarget 2 and Ei,m relevance on all epochs, we show that all the lemmas holds with probability at least 1 δ.

E.2. Auxiliary Lemmas

Lemma E.5 (Convergence on estimated model ˆBi ˆWi). For any fixed i, given Ei source, we have

ˆ i 2 F 1.3σ2

KM + Kd log(( X

m ni m)/M) + log 10Γ

And therefore, when β = 3000K2R2(KM +Kd log(Ntotal/M)+log(MΓ/δ)/σ6. Therefore, above can be upper bounded by ϵi/24,

we have 2 F σ2ϵi 4K2R2

Proof. Denote m as the m-th column of ˆ i.

m=1 Xm m 2 2 =

m=1 m X m Xm m

0.9 min m ni m

m=1 m 2 2 = 0.9 min m ni m ˆ i 2 F

Recall that our definition on ni m = max βˆν2 i (m)ϵ 2 i , βϵ 1 i and also use the upper bound derived in Claim C.1, we finish the proof.

Lemma E.6 (minimum singular value guarantee for ˆBi ˆWi). For all i, we can guarantee that

σmin( ˆBi ˆWi) σmin(W )/2

Also because M K, so there is always a feasible solution for ˆνi.

Proof. Because B is a orthonormal, so σmin(B W ) = σmin(W ). Also from Lemma E.5 and Weyl s theorem stated below, we have |σmin( ˆBi ˆWi)) σmin(B W )| ˆBi ˆWi B W F σ

2 . Combine these two inequality we can easily get the result.

Active Multi-Task Representation Learning

Theorem E.7 (Weyl s inequality for singular values). Let M be a p n matrix with 1 p n. Its singular values σk(M) are the p positive eigenvalues of the (p + n) (p + n) Hermitian augmented matrix 0 M M 0

Therefore, Weyl s eigenvalue perturbation inequality for Hermitian matrices extends naturally to perturbation of singular values. This result gives the bound for the perturbation in the singular values of a matrix M due to an additive perturbation : |σk(M + ) σk(M)| σ1( ) F

Lemma E.8. For any two matrix M1 RK M, M2 RK where K M. Suppose rank(M1) = K and define ν as

arg min ν RM ν 2 2 s.t. M1ν = M2, ν Rm,

then we have

ν = M 1 (M1M 1 ) 1M2

Proof. We prove this by using KKT conditions,

L(ν, λ) = ν 2 2 + λ (M1ν1 M2)

Given 0 L, we have ν = (M1) λ/2. Then by replace this into the constrains, we have

M1M 1 λ = 2M2 λ = 2 M1M 1 1 M2

and therefore ν = (M1) M1M 1 1 M2.

Lemma E.9 (A closed form expression for ˆνi). For any epoch i, given the estimated representation ˆBi, we have

ˆνi+1 = arg min ν ν 2 2

ˆΣi m B w m + 1

ν(m) = αi M+1

ˆΣi MB w M+1 + 1 ni M+1

Xi M+1 ZM+1

where αi m = ˆB i ˆΣi m ˆBi 1 ˆB i

Proof. For any epoch i and it s estimated representation ˆBi, by least square argument, we have

ˆwi m = arg min w Xi m ˆBiw Ym 2

= Xi m ˆBi Xi m ˆBi

1 Xi m ˆBi Ym

= Xi m ˆBi Xi m ˆBi

1 Xi m ˆBi Xi m B w m + Xi m ˆBi Xi m ˆBi

1 Xi m ˆBi Zm

= ˆB i ˆΣi m ˆBi 1 ˆB i | {z } αi m Rk d

ˆΣi m B w m + ˆB i ˆΣi m ˆBi 1 ˆB i | {z } αim

Active Multi-Task Representation Learning

Therefore, combine this with the previous optimization problem, we have which implies that

ˆΣi m B w m + 1

ˆwi M+1 = αi M+1

ˆΣi MB w M+1 + 1 ni M+1

Xi M+1 ZM+1

Recall the definition of ˆνi+1 as

min ν 2 2 s.t. ˆWiν = ˆwi M+1

Therefore we get the closed-form by replace ˆWiν, ˆwi M+1 with the value calculated above.

Lemma E.10 (Difference of the inverse covariance matrix). For any fixed i, m and any proper matrices M1, M2, M3, M4, M1 ( ˆBi ˆWi( ˆBi ˆWi) ) B W (B W ) + B M2

= M1 B W (B W ) ˆ i( ˆ i) + ˆ i(B W ) + (B W )( ˆ i) (B W (B W ) ) B M2 M3(B ) ( ˆBi ˆWi( ˆBi ˆWi) ) B W (B W ) + M4

= M3(B ) B W (B W ) ˆ i( ˆ i) + ˆ (B W ) + (B W )( ˆ i) (B W (B W ) ) M4

Proof. First we want to related these two inverse term,

( ˆBi ˆWi ˆW i ˆB i ) B W (B W )

(B W + ˆ i)(B W + ˆ i) B W (B W )

B W (B W ) + ˆ i( ˆ i) + ˆ i(B W ) + (B W )( ˆ i) B W (B W )

In order to connect the first pseudo inverse with the second, we want to use the generalized inverse of matrix theorem as stated below.

Theorem E.11 (Theorem from (Kovanic, 1979)). If V is an n n symmetrical matrix and if X is an n q arbitrary real matrix, then

(V + XX ) = V V X(I + X V X) 1X V + ((X ) ) X

where X = (I V V )X

It is easy to see that V := B W (B W ) and we can also decompose ˆ ˆ + ˆ (B W ) + (B W ) ˆ into some

Therefore, we can write the above inequality as

V X(I + X V X) 1X V + ((X ) ) X

Next we show that ((X ) ) X B = 0 and (B ) ((X ) ) X = 0. Let UDV be the singular value decomposition of B W . So we have

V V = UD2U (UD2U ) = UU ,

and therefore, because B are contained in the column spaces as B W ,

X B = (U U X) B

= X U U B = 0.

Active Multi-Task Representation Learning

Therefore, we conclude that M1 ( ˆBi ˆWi( ˆBi ˆWi) ) B W (B W ) + B M2

M1 B W (B W ) ˆ ( ˆ i) + ˆ (B W ) + (B W )( ˆ i) (B W (B W ) ) | {z } V XX V

and so does the other equation.

Lemma E.12. Given Ei source, for any fixed i, we have

B W (B W ) ˆ i( ˆ i) + ˆ (B W ) + (B W )( ˆ i) (B W (B W ) ) F

3σ (W ) F (W (W ) ) F σ3 ϵi/6KR

Proof. The target term can be upper bounded by

B W (B W ) ˆ i( ˆ i) (B W (B W ) ) F

+ B W (B W ) B W ( ˆ i) (B W (B W ) ) F

+ B W (B W ) ˆ i(B W ) (B W (B W ) ) F

Before we do the final bounding, we first show the upper bound of the following term, which will be used a lot,

(B W ) B W (B W ) F = B W (B W ) B W F

= (B W (W ) (B ) ) B W F

= ((B ) ) (W (W ) ) (B ) B W F

= B (W (W ) ) W F

= (W (W ) ) W F

= (W ) ) (W ) W F

= ((W ) W ) (W ) F

= (W ) W (W ) F

Therefore, we can bound the whole term by

(B W (B W ) ) 2 F ˆ i 2 F + 2 (W ) F ˆ i F (B W (B W ) ) F 3 (W ) (W (W ) ) F ˆ i F

Recall that we have ˆ i 2 F σ2σ6ϵi 36K2R2 given Ei source therefore, we get the final result.

Active Multi-Task Representation Learning

F. Experiment details

F.1. Other implementation details

We choose βi = 1/ ν 2 2, which is usually Θ(1) in practice. Instead of choosing ϵi = 2i, we set that as 1.5 i and directly start from i = 22. It is easy to see that it turns out the actual sample number used in the experiment is similar as choosing β = poly(d, K, M) and start from epoch 1 as proposed in theorem. But our choice is more easy for us to adjust parameter and do comparison.

We run each experiment on each task only once due to the limited computational resources and admitted that it is better to repeat the experiment for more iterations. But since we have overall 160 target tasks, so considering the randomness among tasks, we think it still gives meaningful result.

Moreover, remember that we have lower bound βϵ 1 i in our proposed algorithm, In our experiment for linear model, we actually find that only using a constant small number like 50 for each epoch is enough for getting meaningful result. While in convnet, considering the complexity of model, we still obey this law.

F.2. More results and analysis for linear model

F.2.1. MORE COMPREHENSIVE SUMMARY

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

1.9 0.35 2.2 2.9 2.5 2 2.4 1.2 3 1.7

1.2 0.03 1.6 -0.04 0.08 0.12 2.4 0.82 0.09 0.03

2.9 5.4 1 0.73 1.1 0.12 2.5 2.3 0.68 0.35

0.96 0.3 0.68 3.1 1.9 -0.25 -0.27 -0.59 2.1 0.6

2.2 3.3 3.2 0.58 2.3 2.2 2.8 2.7 0.18 0.48

1.2 -0.39 0.82 1.6 1.5 1.2 0.79 0.87 1.4 1.2

0.17 0.06 0.05 0.02 0.11 0 0.08 0.01 0 0.02

0.64 0.77 0.7 0.48 0.1 0.34 1 0.99 0.24 0.7

0.58 0.3 0.01 0.72 0.05 0.1 1.9 0.59 0.02 0.16

-1 1 -1.3 -0.57 -0.65 0.07 0.15 0.79 -0.09 0.07

1.7 3.1 1.6 0.37 -0.63 -0.2 0.07 1.4 0.24 0.12

2.5 7.8 3.2 2.5 2.8 1.5 4.5 3.3 1.7 1.8

1.6 3.4 1.9 3 3.4 -0.07 4 2.6 2.3 -0.07

-0.56 0.66 1.2 2.1 0.55 -0.21 1 1.1 1.8 2.3

-0.84 0.06 0.05 -0.08 0.02 -0.11 -0.04 -0.03 -0.02 0.12

2.6 3.2 2.1 2 -0.01 0.05 2.2 -0.38 0.73 1.5

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

3.9 3.1 6.8 8.2 7.9 7.9 5.8 5.3 8.1 9.7

9.7 11 10 10 9.9 9 9.5 10 9.8 10

6.1 8 8 9.3 9.6 8.9 6 8.5 9.3 9.6

3.7 3.8 6.2 8.5 8.3 7.9 4.2 6 8.7 9.2

5.9 8.6 8.9 8.3 9 8.5 6.7 7.8 9.7 10

3.8 3.3 7.2 7 6.5 8.7 4.4 6 8 8

10 11 10 10 9.9 8.9 9.7 10 9.7 10

3.9 9.8 6.8 8.8 7.7 7.3 5.3 7.1 9.5 9

4.8 8.6 7.8 9.5 9.6 8.7 7.6 8 9.7 10

3.5 4 8.1 8.2 7.6 8.7 5.9 6.5 7.6 9.7

6.5 8.9 7.3 8.1 8.1 8.5 7.3 7.3 9.5 9.9

5.7 11 8.6 9.5 9.6 8.6 8 8 9.7 9.7

4.6 6.5 8 9.8 8.9 8.8 7.5 7 8.3 9.8

4.1 3.4 7.7 7.9 7.4 8.3 5.6 5.7 8.7 9.6

8.8 11 10 10 9.8 8.9 9.5 10 9.7 10

5.5 7 8.1 9.1 9.6 8.7 7.8 8.7 9.8 9.8

Figure 3. summary of performance difference for linear model (restated of figure 1); left: The prediction difference (in %) between ada and non-ada for all target tasks right: The incorrect percentage of non-adaptive algorithm.Note that 10% is the baseline due to the in-balance dataset and the large the worse. Please refer to the main paper for further explanation,

F.2.2. WHY OUR ALGORITHM FAILS ON SOME TARGET TASKS ?

Here we focus on why our algorithm get bad performance on some tasks. Overall, other than the randomness, this is mainly due to the incompatibility between our theoretical assumption (realizable linear model) and the complicated data structure in reality.

To be specific, there might a subset of source tasks that are informative about the target task under linear model assumptions, but other source tasks are far away from this model assumption. Due to the model misspecification, those misleading tasks may gain more sampling in the adaptive algorithm. We conjecture that this might be the case for target tasks like scale 0 and scale 2. To further support our argument, we further analyze its sample number distribution and test error changing with increasing epoch in the next paragraph.

For the scale 0 task, in Figure 8 we observe the non-stationary sample distribution changing across each epoch. But fortunately, the distribution is not extreme, there are still some significant sample from X 0 source tasks. This aligns with the test error observation in Figure 9, which is still gradually decreasing, although slower than the non-adaptive algorithm. On the other hand, the sample distributions are even worse. We observed that nearly all the samples concentrate towards X 5 source tasks. Thus no only we can not sample enough informative data, but we also force the model to fit to some non-related data. Such mispecifcation has been reflected in the test error changing plot. (You may notice the unstable error changing in non-adaptive algorithm performance, we think it is acceptable randomness because we only run each target task once and the target task itself is not very easy to learn.)

Active Multi-Task Representation Learning

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

Figure 4. top: sample distribution for target task as scale 0, bottom: sample distribution for target task as scale 2 We show the sample distribution at each epoch 1,2,3.

10000 15000 20000 25000 30000 35000

total sample numer

target source: scale with label 0

10000 15000 20000 25000 30000 35000

total sample numer

target source: scale with label 2

Figure 5. Test error change after each epoch for target task as scale 0 and scale 2

F.2.3. MORE GOOD SAMPLE DISTRIBUTION EXAMPLES

Appendix F.3

Active Multi-Task Representation Learning

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

Figure 6. Good sample distribution. We show the sample distribution at each epoch 1,2,3. From top to bottom: glass blur 0,glass blur 1,glass blur 3,impulse noise 4,motion blur 6,motion blur 7,identity 8,identity 9

Active Multi-Task Representation Learning

F.3. More results and analysis for convnet model

The convnet gives overall better accuracy than the linear model, except the translate class, as shown in Figure 7. So we want to argue that it might be harder for us to get as large improvement as on linear model given the better expressive power on convnet.

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0.83 0.64 0.98 0.48 1.6 2.1 1.1 0.84 1.2 1.5

0.99 -0.03 0.72 0.67 0.48 0.22 0.24 0.57 -0.27 0.65

0.53 0.61 0.82 1.2 1.1 1 0.93 0.39 0.71 2.2

0.09 0.55 0.95 0.71 1.3 1.2 0.3 -0.95 0.67 0.55

0.37 0.89 1.1 0.85 2.1 1.4 1.3 0.81 1.2 2

0.43 0.67 1.2 0.84 1.2 2 0.99 0.63 1.1 0.98

0.34 0.16 1.2 1.2 1.5 1.5 -0.81 0.31 0.89 -0.8

0.29 -0.67 0.72 -0.42 0.91 -0.11 0.41 2.1 -0.62 2.2

0.15 0.61 -0.13 0.13 1.9 0.03 1 -1.5 0.36 -0.23

0.58 0.07 2.2 -0.07 -0.2 -0.95 -0.15 0.48 0.36 1.2

0.39 0.63 1.9 -0.14 2.8 -0.58 0.84 0.74 0 -1.1

0.24 0.95 0.88 0.14 1.5 1 1.1 1.1 0.36 1.2

0.46 0.69 1.5 0.77 1.8 1.8 0.87 0.8 0.69 2.2

-0.21 0.42 0.74 0.31 0.29 0.46 1.2 -0.39 -0.13 -0.68

-0.17 3.7 -0.84 2.4 0.74 -0.43 -0.59 0.1 0.68 -0.21

0.55 -0.38 1.1 0.79 0.99 1.6 0.66 0.54 0.6 0.72

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

1.4 1.2 2.3 2.3 2.2 3.2 1.8 2.2 3.8 3.7

2 2 3.3 3.3 3.1 3.1 2.7 3.2 4.7 5.9

1.2 1.6 2.8 2.9 2.4 3.7 1.9 2.4 3.8 5.3

1.2 1.3 2.6 2.6 2.9 3.8 1.7 2.4 4 4

1.5 2.1 3.1 3.9 4.1 4.9 2.6 3.2 4.8 7.5

0.97 1.1 2.2 2.1 1.9 3.2 1.6 1.9 3.2 4.4

1.9 3.2 4.6 5.1 4.4 5.5 3 4.3 6.4 6.5

2.4 2.2 4.1 3.9 4.1 3.8 2.8 5.5 5.1 7.4

1.9 2.6 4.9 4.5 7.2 5.6 6.3 4.2 6.8 5.8

2.1 1.1 5.4 4 3.6 4 2.4 3.1 4.6 5.3

1.8 1.9 5.3 3.8 5 4.9 3.4 3.3 4.9 5.5

1.2 1.7 3 3 2.9 3.7 2.1 2.8 3.5 4.9

1.4 1.6 3.3 2.8 2.8 4.1 2 2.8 4.1 5

1.5 1.5 4 2.1 3.1 4.6 2.8 2.2 3.8 4.4

7.4 13 11 11 9.7 8 7.1 11 9.7 11

1.6 1.7 3.2 3.3 2.8 4 2.1 3.2 4.6 5

Figure 7. left: summary of performance difference for conv model (restated of figure 2); right: the incorrect percentage of non-adaptive algorithm On the right side we show the incorrect percentage of non-adaptive algorithm. Note that 10% is the baseline due to the in-balance dataset and the large the worse. Please refer to the main paper for further explanation,

F.3.1. WHY OUR ALGORITHM FAILS ON SOME TARGET TASKS ?

Here we show scale 5 and shear 9 as the representative bad cases. With the similar idea of linear model, we again observe the non-stationary sample distribution changing across each epoch in Figure 8. For scale 5, we observe that at the beginning of epoch 1, the sample fortunately converges to X 5 source tasks, therefore our adaptive algorithm initially performs better than the non-adaptive one as shown in Figure 9. Unfortunately, the sample soon diverges to other source tasks, which more test error. For shear 9, although there are some samples concentrate on X 9 source tasks, overall, the number of samples on X 9 source tasks has a decrease proportion of total number of source sample. So the algorithm has a worse performance on this.

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

Figure 8. top: sample distribution for target task as scale 5, bottom: sample distribution for target task as shear 9 We show the sample distribution at each epoch 1,2,3.

Active Multi-Task Representation Learning

15000 20000 25000 30000 35000 40000 45000

total sample numer

15000 20000 25000 30000 35000 40000 45000

total sample numer

Figure 9. Test error change after each epoch for target task as scale 5 and shear 9

F.3.2. MORE GOOD SAMPLE DISTRIBUTION EXAMPLES

Active Multi-Task Representation Learning

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

dotted_line

impulse_noise

motion_blur

0 1 2 3 4 5 6 7 8 9

dotted_line

impulse_noise

motion_blur

Figure 10. Good sample distribution. We show the sample distribution at each epoch 1,2,3. From top to bottom: brightness 0,brightness 1,dotted line 3,dotted line 4,identity 5, fog 6,glass blur 7,rotate 8,canny edges 9